Advances in Synthetic Data Generation for Public Health Research

By Maeve Smillie

Graduate Student Fellow Spotlight: Sarah Milligan (SPH)

Sarah Milligan, PhD Candidate, Biostatistics (SPH) and Hariri Institute Graduate Student Fellow

The Hariri Institute welcomes Sarah Milligan as part of the 2025 Graduate Student Fellow cohort. Milligan is a PhD candidate in the Department of Biostatistics, where she is advised by Fatema Shafie Khorassani, assistant professor of biostatistics, and Janice Weinberg, professor of biostatistics.

Building on her M.S. in Biostatistics from Boston University, Milligan is developing secure methods for data sharing in public health research. By adapting machine learning algorithms like Variational Autoencoders—traditionally used for image and audio data—to tabular health datasets, she generates realistic synthetic data that aims to protect privacy while enabling collaboration among researchers.

In this Q&A, Milligan discusses what inspired her work, the real‐world impact of her methods, and how the Hariri Institute Fellowship will help expand her research’s reach across disciplines.

Can you describe your research focus and its applications?

My research lies at the intersection of machine learning and clinical development, with a focus on enabling secure data exchange among public health researchers while preserving public health information (PHI). I am extending existing machine learning techniques, mainly used on image and audio data, to tabular public health datasets.

This framework facilitates data sharing and fosters interdisciplinary collaboration.

How did you become interested in this? Was there something that inspired this area of interest?

It is well recognized that the quality of research hinges on accessibility of data. However, obtaining data can be challenging and often delays progression of independent research.

Recognizing how these delays impede the timely advancement of novel research, I am working to develop methods to streamline data acquisition. As I pursued my dissertation, I became particularly interested in how common machine learning methods could be applied and extended to public health research questions.

To address this need, I am integrating biostatistical principles with machine learning techniques, creating a framework that facilitates efficient data sharing.

What are the main goals or objectives of your research?

My research aims to extend machine learning methods that generate realistic synthetic datasets. These methods sometimes focus on datasets of one distribution type, the extension to multiple distributions allows public health researchers to explore these methods and accelerate interdisciplinary clinical research.

My main goal is to make complex methods accessible and reliable for public health researchers. Some machine learning algorithms are solely available in Python, a programming language. As I extend machine learning methods to public health applications, I am also creating an R package to utilize the method. R is an open source programming language.

Has there been a recent development or finding that you find particularly exciting?

The machine learning framework I am extending is Variational Autoencoders (VAE), which is most commonly applied to image and audio data.

Adapting VAEs to tabular datasets, where features follow heterogeneous distributions, presents unique challenges. I am working on addressing these challenges. A high level example is data preprocessing. In standard VAE implementations, data is typically standardized; however, when features span different distributions, min max scaling is more appropriate.

By applying min max scaling during preprocessing, the fidelity of the synthetic data relative to the original dataset improved markedly.

What do you feel is most rewarding about your work, either as a professor or researcher?

One of the most rewarding aspects of my work is seeing each milestone bring me closer to real‐world applications.

By streamlining access to individual‐level data through synthetic dataset generation, this approach can accelerate independent and collaborative research. In turn, that acceleration helps identify clinically meaningful results and serve the community faster.

How do you plan on using this fellowship opportunity?

This fellowship will allow me to share my work at conferences and team up with other researchers through other collaborative events. I am also excited to potentially extend the utility of this method in new fields by collaborating with other Hariri Institute Fellows.