PhD Thesis Defenses

2025

March 11th
Kelley Anderson

Title: Molecular Subtypes of Airway Field Gene Expression Alterations Are Associated with Lung Cancer Phenotypes

Major Professors: Jennifer Beane & Marc Lenburg

Abstract

Lung cancer is the leading cause of cancer mortality. This is primarily due to a lack of interventions to prevent lung premalignant lesions from progressing to invasive cancer, and due to the challenges of early cancer detection when disease is at more treatable stages. Despite past research conducting multi-omic profiling of lung premalignant lesions and lung tumors, the earliest molecular changes associated with lung carcinogenesis are not fully elucidated. To improve lung cancer detection using relatively non-invasive samples, we sought to understand the relationship between gene signatures associated with lung cancer and lung squamous premalignant severity in airway biopsies and bronchial brushings. To improve lung cancer prevention, we studied the molecular alterations associated with lung adenocarcinoma premalignant lesions and assessed their similarity to aggressive tumors.

First, we analyzed bulk mRNA sequencing of samples collected from bronchial brushings and endobronchial biopsies from a cohort of ever smokers with indeterminate pulmonary nodules. Previously published lung cancer signatures were concordantly enriched among genes associated with lung cancer in the bronchial brushings and biopsies; however, only the results in the bronchial brushings were statically significant. We found that biopsy samples could be classified into previously described lung squamous premalignant lesion molecular subtypes. There was a significant difference in smoking status among the subtypes, similar to prior characterization of the subtypes. There was no significant enrichment based on cancer diagnosis among the subtypes. We found enrichment for cell type-specific molecular signatures for stromal and immune cells in the biopsies compared to the bronchial brushings. While a cancer signature able to discriminate between patients with benign and cancerous lesions was challenging to derive in the biopsies, our results suggest that pathways related to cell metabolism, oxidative stress, and cell proliferation are upregulated in the biopsy samples from patients with cancer.

Next, we investigated gene expression changes in lung adenocarcinoma premalignant lesions using lung resection cases with normal, premalignant, and tumor histology. RNA and DNA were isolated from laser capture microdissected tissue (representing a specific histology) and underwent mRNA and exome sequencing. Using multiple gene expression datasets of lung adenocarcinoma premalignant lesions, we derived a set of recurrent gene co-expression modules. Using these modules, we identified four archetypes. In archetype analysis, samples closest to each vertex typically represent unique biological features. Analyzing the expression of published gene signatures suggests that distances from the archetypes are differentially associated with signatures related to the immune environment, cell signaling, and proliferation. Based on these enrichment patterns, the archetypes were named: normal-like, inflammation, cell adhesion, and proliferation. Samples closer to the inflammation and proliferation archetypes were associated with unique driver mutations, and samples closer to the proliferation archetype had increased features relating to genome instability. To more fully characterize the archetypes, we projected them into independent lung adenocarcinoma tumor datasets. Across multiple studies, samples closest to the proliferation archetype were associated with decreased disease-free survival and with tumors with aggressive histologic patterns. Archetype analysis allowed for identification of distinct features among lung adenocarcinoma premalignant lesions, and helped characterize them based on their relationship to aggressive disease features.

In summary, I investigated gene expression changes in the airway to improve our understanding of lung cancer pathogenesis. Past research demonstrated that squamous premalignant lesions in the airway are a marker of risk of any type of lung cancer. To better understand these findings, we analyzed the relationship between cancer-associated signatures and premalignant lesion-associated signatures of airway biopsies. Past studies of lung adenocarcinoma demonstrated transcriptional differences between tumors and premalignant lesions. Our archetyping approach uniquely characterizes premalignant lesion-specific features that relate to pathway alterations and prognosis. Characterizing these premalignant lesion features may improve our understanding of divergent mechanisms of carcinogenesis and enable identification of aggressive lesions capable of developing into invasive cancer.

March 14th
Regan Conrad

Title: Dissecting the Airway Field of Injury and Immune Microenvironment in Lung Squamous Premalignancy at a Single Cell Resolution

Major Professor: Jennifer Beane

Abstract

Lung cancer is the leading cause of cancer deaths globally, but early detection significantly improves survival, as evidenced by the success of low-dose computed tomography (LDCT) in reducing mortality rates. While LDCT identifies lung cancer at earlier stages, further survival improvements may be achieved by detecting and treating premalignant lesions (PMLs), the abnormal airway changes that precede lung cancer. Not all PMLs progress to invasive cancer; some remain stable or regress, emphasizing the need to understand the mechanisms behind these outcomes. Beyond localized lesions, molecular alterations in the airway field reflect processes occurring throughout the lung, providing a potential avenue for noninvasive biomarker opportunities. Studies show gene expression in bronchial and nasal brushings can distinguish between benign and malignant lung nodules, suggesting airway profiling can aid lung cancer risk assessments. This thesis employs single-cell RNA sequencing (scRNA-seq) and bulk transcriptomic analysis to examine immune and epithelial changes linked to PML severity and to identify molecular shifts in the airway field that may serve as early detection biomarkers and inform prevention strategies.

Previous studies identified immune changes associated with PML progression but lacked the resolution needed to determine the contributions of specific immune populations. I analyzed immune cells from endobronchial biopsies across histological grades, identifying 14 subpopulations that differed in abundance between low-grade (LGB) and high-grade (HGB) samples. A cluster of CD4+ T regulatory cells and a subset of plasma cells were enriched in HGB, while multiple CD8+ T cell subpopulations and a neutrophil subpopulation were more abundant in LGB.

To examine airway-wide molecular alterations, I analyzed scRNA-seq data from bronchial and nasal brushes along with the biopsies. Nasal brush cells were enriched in gene modules associated with detoxification and mucous production, indicating upper airway specialization. Smoking-related transcriptional changes exhibited a gradient, being most pronounced in bronchial biopsies, followed by bronchial brushes, and least evident in nasal brushes. Notably, a bronchial brush sample clustered with high-grade basal cells and an upregulated gene module in those cells correlated with the most severe lesion histology from bronchial brushes analyzed via bulk RNA-seq.

Together, these findings highlight how both immune and epithelial cell populations are altered in PMLs and illustrate how molecular changes throughout the airway reflect the biology of nearby lesions. By leveraging profiling of the immune microenvironment in PMLs and samples collected using noninvasive brush techniques, this work offers new insights into the early stages of lung carcinogenesis and the potential for developing biomarkers to improve PML detection and risk stratification.

March 27th
Mengze (Vanessa) Li

Title: Advancing Aging Research: From Computational Tools to Molecular and Cellular Insights

Major Professor: Stefano Monti

Abstract

Aging is associated with dynamic changes across multiple biological systems. This project aims to advance aging research by investigating genotype information, molecular characteristics, and cell type composition.

The first section presents a computational pipeline for testing genotype-phenotype associations through the identification of quantitative trait loci (QTL). This pipeline integrates key steps in the QTL discovery process, including input data quality control, association testing, and results aggregation and visualization. It is applicable to datasets with or without family structure, leveraging linear and linear mixed-effects models. Developed using Nextflow, the pipeline supports process parallelization and automated workflow management.

The second section examines the relationship between aging and molecular features, with a particular focus on gene expression. RNA sequencing data from the Long Life Family Study was analyzed using linear mixed-effects models to identify transcriptomic signatures associated with aging, extreme longevity, and mortality risk. Enrichment analysis revealed multiple gene sets reflecting key hallmarks of aging, including increased inflammatory signaling and reduced protein synthesis. Additionally, a transcriptomic aging clock was developed using an elastic net model, with its predicted biological age showing a significant association with mortality risk.

The final section of this thesis project investigates age-related changes in cell type composition using flow cytometry data from the same cohort. Additionally, cell type deconvolution was applied to the gene expression data mentioned above to evaluate state-of-the-art deconvolution methods. Notable trends in cell type composition changes were observed, including a decrease in naïve CD4 T cells and an increase in monocytes with age.

An important deliverable of this research project was the derivation and annotation of signatures of age-associated phenotypes, and their comparison and integration with previous results. To address challenges in storing and organizing biological signatures across research projects, an R6 object, OmicSignature, was developed to streamline the management and retrieval of biological signatures in the R programming environment. OmicSignature facilitates structured organization and efficient access to pre-existing signatures, enhancing workflow efficiency in biological research.

In summary, this project advances our understanding of the molecular and cellular mechanisms underlying human aging, and provides computational and analytical resources for future aging research.

June 30th
Anastasia Leshchyk

Title: Developing Computational Methods for Multi-Omics Integration to Reveal Mechanisms of Healthy Aging

Major Professor: Paola Sebastiani

Abstract

This research project investigates complementary dimensions of aging through three integrated aims, each addressing a critical gap in our understanding of aging mechanisms. In Aim 1, we investigated the role of somatic genomic instability by analyzing mosaic chromosomal alterations (mCAs) in two cohorts enriched for longevity: the New England Centenarian Study (NECS) and the Longevity Family Study (LLFS). We found that while the prevalence of mCAs increases with age, it plateaus after 102 years, suggesting that high mCA burden may be incompatible with extreme longevity. Moreover, individuals with familial longevity and carriers of protective APOE alleles exhibited significantly fewer mCAs, reinforcing the hypothesis that genetic and familial factors may protect against somatic alterations.

In Aim 2, we addressed the challenge of biological interpretability in aging clocks by developing a system-specific framework using pathway-based transcriptomic clocks. Instead of training a single global clock, we constructed 50 clocks based on Hallmark gene sets and identified distinct pathways — such as G2M checkpoint, inflammatory response, and hypoxia — as strong predictors of mortality. Replication in an independent cohort (ILO) and network analysis revealed a modular structure of transcriptomic aging, suggesting that aging is driven by interacting but semi-independent biological subsystems.

In Aim 3, we focus on multi-omic integration to develop models that explain some biological mechanisms of aging. To this end, we developed the Bayesian Community of Networks (BCoN), a novel framework that overcomes the limitations of traditional Bayesian network learning from incomplete data. BCoN leverages multiple imputation and ensemble structure learning to generate a community of networks. This method enables more robust probabilistic inference, and we validated it on both benchmark and synthetic omics datasets. We further implemented BCoN as part of the software pipeline, facilitating reproducible analysis and interactive network exploration.

Together, these three aims contribute to a more nuanced and mechanistic understanding of aging by linking somatic genomic alterations, system-specific biological clocks, and probabilistic modeling into a unified framework for studying human longevity.

August 14th
Rebekah Miller

Title: High-Throughput Assessment of Multi-Protein Complexes Demonstrates Rules Governing Transcription Regulation

Major Professor: Trevor Siggers

Abstract

Regulation of the spatiotemporal expression of genes is vital for the survival of cells in dynamic environments and for the existence of multicellular organisms comprising multiple distinct cell types. Thus, there is a need for high-throughput approaches to characterize the gene regulatory mechanisms that define distinct cell types and states. Gene expression changes are primarily coordinated by transcription factors (TFs) that bind throughout the genome and recruit cofactors (COFs) to establish and maintain the epigenome. These TF–COF interactions can be cell type-specific, can change in response to signals, and are dysregulated in many diseases. To investigate how TF–COF interactions control transcriptional and epigenetic differences across cell types and states, we developed several protein-binding microarray (PBM)-based approaches to characterize TF-COF complexes active in cell extracts. We developed CoRec (Cofactor Recruitment) as a high-throughput method for characterizing TF–COF complexes and demonstrated that it can be used to profile diverse TF-COF complexes across different cell types. To examine TF–COF interactions central to histone acetylation and gene expression, we profiled seven lysine acetyltransferases (KATs) in resting and activated Jurkat T cells. We demonstrated that TF–COF networks are highly dynamic, with 35% of interactions identified in resting T cells altered after 45 minutes of T-cell receptor (TCR) stimulation. Additionally, we found that heterotypic clusters of TF binding sites that recruit the KATs P300 and CBP are associated with increased promoter H3K27ac levels. To examine how TFs work together in the recruitment of COFs, we developed the Cooperative Cofactor Recruitment (CCR) assay to profile synergistic recruitment of COFs by pairs of TF binding sites (TFBSs). We applied the CCR assay to four KATs in resting Jurkat T cells and found that 45% of profiled TFBS pairs exhibit cooperative KAT recruitment, but only 1% of these configurations cooperatively recruit more than one KAT. We also demonstrated that cooperative recruitment of the KAT P300 is associated with increased reporter gene expression, suggesting that combinatorial TF logic may play an important role in regulating gene expression. Finally, we used the CoRec method to investigate the concept of indirect TF recruitment where a TF is recruited to DNA by another DNA-bound TF, much like a COF is recruited to DNA. We examined indirect recruitment of 12 TFs in K562 cells that we predicted might operate by both direct DNA binding and indirect recruitment. We found numerous examples of indirect TF recruitment, including several previously unreported interactions. We also determined that the presence of indirect recruitment sites in addition to a TF’s own binding site is predictive of increased TF occupancy at genomic loci. Overall, this work describes new methodological approaches to examine TF–COF and TF–TF interactions and lead to insights into the mechanisms of gene regulation, emphasizing the dynamic nature of TF–COF interactions and highlighting the complexity of combinatorial TF logic.

August 26th
Yusuke Koga

Title: Discovering Molecular Subgroups and Cellular Heterogeneity Associated with Pulmonary Disease

Major Professor: Joshua Campbell

Abstract

September 11th
Kathryn Atherton

Title: The Impact of Urbanization on the Tree-Associated Microbiome

Major Professor: Jennifer M. Bhatnagar

Abstract

Urbanization profoundly reshapes ecosystems, altering both abiotic conditions and the interactions between organisms, yet its effects on tree-associated microbial communities and their potential implications for urban forest health remain poorly understood. This dissertation examines how urbanization influences the tree holobiont – including trees and their associated microbiomes – and how we select city tree planting locations to maximize the ecological and social benefits of urban trees. I first demonstrate that urbanization increases overall forest soil microbial connectivity while simultaneously disrupting key ecological associations between microorganisms, notably reducing connectivity between ectomycorrhizal fungi that form mutualistic relationships with tree roots and the rest of the soil microbiome. Next, I show that urbanization shifts the structure and composition of oak tree microbiomes, decreasing mutualistic symbionts and increasing decomposers and pathogens, with consequences for biogeochemical cycling, including higher potential nitrogenous greenhouse gas emissions and reduced methane consumption. These microbiome shifts correlate with urban stressors such as heat, drought, and inorganic nutrient deposition. Integrating microbial and environmental data, I also show that urban tree growth and survival are shaped by the composition of the tree microbiome and soil properties: growth is positively associated with soil microbial diversity and abundances of saprotrophs and root endophytes, whereas mortality is linked to pathogenic and wood-decomposing taxa. These findings highlight the importance of understanding urbanization effects on the whole tree holobiont to understand its potential impacts on tree longevity and health in cities. To better predict functional outcomes from microbiome analysis, I developed Fun2FITS, a computational pipeline that links fungal ITS amplicon data to predicted gene content, enabling scalable inference of fungal functional potential across ecosystems. Fun2FITS captures patterns in fungal functional gene abundances in soil, particularly for ectomycorrhizal fungi, that help explain critical ecosystem processes like soil nitrogen cycling. Finally, I translate these insights into applied urban forestry strategies through GIS-based analyses that identify optimal front-yard planting locations in Boston to maximize ecological and social benefits of trees. Collectively, this work provides a mechanistic understanding of how urbanization reshapes tree-microbe interactions, microbial functional potential, and urban forest health, offering tools for evidence-based management of resilient, sustainable urban tree populations.

2024

January 26th
William Hackett

Title: Improving Reproducibility and Standards in Quantitative N-Glycoproteomic Data

Major Professor: Joseph Zaia

Abstract

More than half of all human proteins are glycosylated, making glycosylation one of the most abundant post-translational modifications in proteomics. N-glycosylation is a prevalent and diverse type of glycosylation with key roles in regulating systems such as protein folding and host-pathogen recognition; without proper understanding of the heterogeneities of N-glycosylation efforts to understand biological systems and efforts to combat the maladies that affect those systems will be hindered, knowingly and unknowingly. N-glycosylation is a semi-stochastic process governed by local chemistries and enzymatic availability, and it is regulated by end process evaluation making modeling infeasible. This drives glycoproteomics to rely on observational data from tandem mass spectrometry; mass spectrometry is a powerful tool that comes with logistical and technical limitations on the availability and compatibility of data. N-glyocopeptides can be identified in tandem mass spectrometry data, but this is with greater uncertainty than traditional proteomics for a variety of factors. This uncertainty propagates into the quantification of these molecules, generating interdependent datasets with small sample sizes and high missing value rates. N-glycans are inherently interrelated by the biosynthetic network that they’re processed in, and as a result they have a lot of shared information and chemical properties that make identification and quantification more difficult. While advances in N-glycoproteomics continue there is still a lot needed for true and reliable understanding of quantitative N-glycoproteomics. In order to make use of the existing data, an R-package called RAMZIS- Relative Assessment of m/z Identifications by Similarity- was developed. This toolkit focuses on data quality assessment and identifying broad differences between glycosylation sites. RAMZIS uses a series of permutation tests with a weighted Tanimoto similarity assessment, it provides researchers with information on their ability to use their data, the presence of outliers, the probable differentiability of glycosylation sites, and how to improve their future experimentations. Data Independent Acquisition (DIA) has enabled vast improvements in proteomic’s ability to quantify and identify proteins in complex samples, but these improvements cannot be directly applied to glycoproteomics. Glycoproteins are more heterogeneous than deglycosylated proteomic datasets and have lower overall signal, the latter compounding the issues made by the former. For glycoproteomics to make full use of the power of DIA and account for its idiosyncrasies, a large number of bioinformatic advancements need to be made in glycopeptide identification, validation, and quantification. To this effort, we developed a python package called GlyLine as a framework to assess glycoproteomic DIA data; it tracks coeluting product ions of identified glycopeptides, splitting the signal from shared product ions in order to produce MS2 level quantifications of the identified glycopeptides and provide databases of information for further analysis. As glycoproteomics advances and comes into greater prominence, it is vital that experiments and bioinformatic workflows be repeatable, as quantitative glycoproteomic data is reported in many different ways that are often incompatible. We have worked with the MIRAGE Commission in order to develop a community based minimum reporting guideline for glycoproteomic experiments.

March 21st
Aubrey Odom

Title: Methods and Tools for Characterizing Microbial Communities in the Context of Chronic Diseases

Major Professor: W. Evan Johnson

Abstract

The human microbiome, a complex ecosystem of microorganisms inhabiting various body sites, plays a crucial role in the immune system and overall human health. A comprehensive understanding of the microbiome and its interactions with the host is essential for advancing scientific knowledge and potential therapeutic interventions. This work focuses on two aspects of the microbiome space: taxonomic profiling for the identification of resident microbes, and longitudinal analysis to unravel the dynamics of microbial communities over time.

A fundamental step in microbiome analysis is taxonomic profiling: the identification of resident microbes in samples. Numerous tools have been developed to cater to different sequencing types (e.g. 16S versus WGS) and contexts. However, despite significant advances in the profiling field, further work is needed to establish optimal methods for metagenomic classification. To address this gap, we introduce MetaScope, a comprehensive R-based package for accurate microbial composition identification at a strain-level resolution within a sample. We have performed benchmarking against mock microbial communities to validate MetaScope’s performance against popular competitors using 16S datasets.

Microbial time-series data presents unique challenges, including intricate covariate dependencies and diverse longitudinal study designs. Existing methods often fall short in addressing these challenges, lacking versatility, data type specificity, or the ability to account for the compositional nature of the data. In response, this work introduces LegATo, an open-source suite comprising modeling, visualization, and statistical tools tailored for analyzing microbiome dynamics. LegATo, with its user-friendly interface, accommodates various study structures and incorporates Generalized Estimating Equation (GEE) models, Hotelling’s T-squared tests, and several visualization functions. This toolkit enables researchers to identify microbial taxa affected by perturbations over time, such as the onset of disease or lifestyle changes, and predict their effects on the composition or stability of commensal bacteria. To illustrate the practical application of LegATo, we present two case studies focusing on the nasopharyngeal microbiomes of Zambian infants exposed to HIV and experiencing fatal acute febrile illness. These applications showcase the efficacy of LegATo for unraveling the complex dynamics of microbial communities, providing insight into the impact of specific perturbations on the microbiome.

In conclusion, this research contributes to the advancement of microbiome analysis by enhancing taxonomic profiling methodologies and addressing the challenges posed by longitudinal data. The presented tools, MetaScope and LegATo, provide valuable resources for researchers exploring the intricate interactions between the microbiome and host over time, paving the way for a deeper understanding of microbial dynamics and their implications for human health.

July 10th
Shruthi Bandyadka

Title: Multimodal Investigation of Cell Death and Clearance in Drosophila Melanogaster

Major Professor: Kimberly McCall

Abstract

Cell death shapes and sustains life. Over the past decade we have begun to understand the breadth of physiological and biochemical diversity in cell death and clearance pathways, which play vital roles in organismal development and heath. While apoptosis and necrosis have been studied extensively across many model systems and contexts, the discovery of non-apoptotic paradigms of cell death and their roles in disease has greatly expanded the field. Collectively called Regulated Cell Death (RCD), these death pathways are actuated in a tissue and context-dependent manner (e.g. disease state). This dissertation is a culmination of multiple projects investigating cell death and clearance events spanning the ovary and the brain of the pliable and reliable Drosophila melanogaster. We undertook the first multi-modal, high-throughput survey, involving single-cell RNA-seq, TRAP-seq, and proteomics, to compare two different archetypes of germline death in the fly egg chamber – apoptosis and phagoptosis. Our analysis identified several important candidates and pathways that are either unique to or shared between the germline death modalities and the striking consequences to oogenesis upon their disruption. We also observed that V-ATPases, proton pumps required for germline phagoptosis, are differentially localized throughout oogenesis and identified the specific subunits upregulated in phagoptosis. Further, we identified a novel exon splicing event in the ‘a’ subunit isoform of V-ATPases that may facilitate its sub-cellular localization. Using a novel image analysis method involving image segmentation and spatial statistical inference, we determined that circulating immune cells agglomerate at specific niches within the abdomen, in response to egg chamber degeneration resulting from physiological stress of protein-deprivation. We then turned our focus to cell clearance in the fly brain. Phagocytosis by glial cells is essential for pruning synapses and for the removal of dying neurons and misfolded proteins. Disruptions to glial phagocytosis results in a range of age-dependent neurodegenerative phenotypes, primarily exemplified by vacuolization of brain tissue. Vacuolization can also result from other sources of physiological insult, such as traumatic brain injury. To date, quantification of vacuole size and severity has been a tedious, non-replicable exercise. We addressed this need by utilizing a pre-trained deep-learning model to perform image segmentation and 3D reconstruct vacuoles, quickly and reproducibly for semi-automated volumetry. We then applied this method to characterize the severity of neurodegeneration in brains lacking the phagocytic receptor Draper in glia and further demonstrated that this phenotype is attenuated by RNAi of the NF-κB transcription factor Relish in flies lacking glial Draper. Finally, by analyzing single-nucleus RNA-sequencing data of wildtype and draper RNAi fly brains, we characterized changes to the composition and transcriptional profiles of cell populations in the brain and identified several key pathways disrupted during neurodegeneration. Collectively, the methods and results described herein will have applications beyond the Drosophilamodel and the field of cell death, with important implications in understanding fertility and the underpinnings of cognitive disorders.

November 18th
Nicholas O’Neill

Title: Characterizing Cell-Type and Neuron Subtype Activity and Abundance in Asymptomatic Alzheimer Disease

Major Professor: Xiaoling Zhang

Abstract

December 11th
Devlin Moyer

Title: The Fundamentals of Genome-Scale Metabolic Models and their Application to the Study of Evolution and Cancer

Major Professor: Juan Fuxman-Bass

Abstract

December 11th
Michael Silverstein

Title: Principles of Microbiome Structure and the Implications for Climate Change Mitigation

Major Professor: Daniel Segrè

Abstract

Microorganisms assemble into communities of various structures across virtually all of Earth’s environments, where they drive biogeochemical process that range from microscopic to climatic scales. Only in recent years has the study of these communities, or microbiomes, surpassed demographic surveys of where microbes reside towards a more systematic perspective that seeks to identify principles that govern microbiome structure generically. Uncovering these principles could enable unprecedented control over microbiome structure and the environments that communities reside in through microbiome engineering. In this dissertation, I first reviewed the legacy of environmental microbiome engineering, considered its prospects for climate change mitigation, and proposed approaches for overcoming outstanding challenges to its implementation. These proposals include using directed evolution to design microbial communities as inocula to boost the carbon stabilization capacity of soils and calling for further research into the principles that govern the establishment of new microbial communities into existing ecosystems. Second, I addressed this latter proposal by performing a study which

uncovered a novel principle of microbiome structure: that environmental metabolic complexity drives the taxonomic divergence of microbial communities, or the differences in which taxa differ between communities. This suggests that complex environments may be more susceptible to microbiome engineering since these environments can host a larger diversity of types of microbial communities, which may include communities with higher capacity than the resident

one to perform climate change mitigating activities, for example. Finally, I explored how taxonomic divergence relates to functional divergence to understand what types of environments can host communities that vary in function. Ultimately, these projects contribute to growing efforts to understand microbiome structure and inform microbiome engineering efforts to combat today’s greatest challenges, including the mitigation of climate change.

2023

January 4th
Kritika Karri

Title: Computational Characterization of Long Non-Coding RNAs (LncRNAs) and Study Their Role in Rodent Liver Disease, Xenobiotic Exposure, and Sex-Specific Responses Using Bulk and Single Cell RNA-Sequencing

Major Professor: David Waxman

Abstract

LncRNAs comprise a heterogeneous class of thousands of RNA-encoding genes whose functions are largely unknown. This thesis describes systematic computational approaches to discover liver-expressed lncRNAs globally and then deduce their regulatory roles in response to foreign chemical and hormonal exposures. In a first study, bulk liver RNA-seq data was used to discover liver-expressed lncRNAs responsive to multiple xenobiotics in a rat model. Ortholog analysis combined with co-expression data and causal inference methods was used to infer lncRNA function and deduce gene regulatory networks, including causal effects of lncRNAs on biological pathways. This work provides a framework for understanding the widespread transcriptome-altering actions of foreign chemicals in a key-responsive mammalian tissue. In a second study, single-cell RNA-seq was employed to develop a reference catalog of 48,261 mouse liver-expressed lncRNAs, a majority novel, by transcriptome reconstruction from > 2,000 bulk public mouse liver RNA-seq datasets. Single cell RNA-seq was sufficiently sensitive to detect >30,000 mouse liver lncRNAs and characterize their dysregulation in mouse models of high fat diet-induced non-alcoholic steatohepatitis (NASH), carbon tetrachloride-induced liver fibrosis, and hepatotoxicity induced by the Ah receptor agonist TCDD. Trajectory inference algorithms uncovered lncRNA zonation patterns in five major hepatic cell populations and their dysregulation in diseased states. LncRNAs expressed in NASH-associated macrophages, closely linked to disease progression, and in collagen-producing myofibroblasts, a key source of the fibrous scar in fibrotic liver, were identified. Regulatory network analysis linked individual lncRNAs with key biological pathways and gene centrality metrics identified network-essential regulatory lncRNAs in each liver disease model. In a third study, single nucleus RNA-seq combined with single nucleus ATAC-seq mapping of open chromatin regions elucidated functional linkages between cis- and trans-regulatory elements and their downstream genes targets, notably genes showing expression sex-differences impacting metabolism and disease risk. Liver cell type-specific chromatin accessibility signatures were identified, as were sex-specific accessibility signatures for hepatocytes and their associated DNA regulatory region motifs. Integrative modalities were employed to elucidate transcription factor-based mechanisms involved in sex-specific growth hormone-regulated gene expression by identifying transcriptional and epigenetic changes during feminization of mouse liver. Together, these studies characterize lncRNA function and can motivate future experiments.

January 17th
Jamie Strampe

Title: Utilizing Blood-based Biomarkers to Characterize Pathogenesis and Predict Mortality in Viral Hemorrhagic Fevers

Major Professor: John H. Connor

Abstract

Hemorrhagic fever viruses are a major public health threat in Sub-Saharan Africa. These kinds of viruses cause symptoms ranging from non-specific fevers and body aches to severe, life-threatening bleeding, shock, and multi-organ failure. Previously discovered hemorrhagic fever viruses can cause recurrent or seasonal outbreaks, but new ones continue to emerge. In order to combat these viruses, we need to better understand the aspects of pathogenesis that lead to mortality or survival. I will present analyses of the host immune response to two hemorrhagic fever viruses, Lassa and Bundibugyo, and how the host response can be used to predict mortality in these diseases.

Lassa virus (LASV) was identified over 50 years ago, but it remains understudied and has hence been denoted a “Neglected Tropical Disease”. Longitudinal blood samples were collected by our collaborators for over two hundred Nigerian Lassa Fever patients, and concentrations of over 60 proteins were analyzed. I processed the datasets, performed statistical testing, and created logistic regression models for each protein. This modeling allowed me to determine which proteins could be used as a predictive biomarker of mortality and the level of that protein that could best stratify patients who died and survived. I then produced an application using RShiny that incorporated these and other exploratory analyses of the data, which allows users to visualize all the data we had in addition to the plots that were published.

The filovirus Bundibugyo ebolavirus (BDBV), a relative of the more well-known Ebola (Zaire) virus (EBOV), first caused an outbreak in people fifteen years ago. Animal models are still being developed and characterized for this virus. Our collaborators in Texas experimentally infected cynomolgus macaques with BDBV and gave them post-exposure treatment with a VSV-based vaccine, performed RNA-Seq on longitudinal samples from the infected macaques, and sent these data to me for analysis. I wrote pipelines to perform RNA-Seq and differential expression analyses on over 600 samples, of which I will focus on a subset here. I found differentially expressed genes for different subsets of the data, and I examined these gene lists using gene set enrichment analysis. I then created models to predict mortality at either late or early timepoints, or over the entire course of disease.

April 10th
Conor Shea

Title: Identification of Gene Programs Associated with Histology and Progression of Lung Squamous Premalignant Lesions at Single Cell Resolution

Major Professor: Marc Lenburg

Abstract

Squamous cell carcinoma of the bronchus is the second most common and fatal subtype of lung cancer. In the process of squamous carcinogenesis, the normal bronchial epithelium undergoes a series of histologic transformations known as the metaplasia-dysplasia-carcinoma sequence. These intermediate histologic patterns are called premalignant lesions, and occur prior to the development of cancer. Compared to early stage cancer, survival following resection of premalignant lesions approaches 100%, highlighting the promise of lung cancer interception. However, because of our lack of understanding of the molecular events during squamous carcinogenesis, we are currently unable to predict which lesions will progress to cancer, and we do not have molecular targets for noninvasive treatment. The work in this thesis seeks to improve our understanding of the changes associated with grades of premalignant histology and progression at the level of single cells.

We performed single cell RNA sequencing on a cohort of 41 lesions from 26 patients, encompassing the normal-appearing bronchus, premalignant lesions, and early stage carcinoma. We described histology-associated changes in basal cells. Basal cells from low grade lesions expressed genes related to the maintenance of the normal epithelium, while basal cells from high grade lesions expressed genes related to the cell cycle and detoxification of the airway from smoking toxicants. Secondly, we identified a high grade lesion undergoing the epithelial-to-mesenchymal transition. These cells transitioned from a high grade basal cell state, lost their expression of basal cell markers, and expressed canonical EMT genes, including SPARC, FN1, and MMP2. Finally, we identified shifts in T cell subtypes and widespread expression of exhaustion markers PD-1, CTLA4, LAG3, and TIGIT co-occurring with high grade basal cells.

Next, in a novel computational framework, we sub-clustered histology- and progression-associated modules from previous bulk transcriptomic studies based on their expression patterns in our single cell data to elucidate the contributions of individual cell types to histology- and progression-associated gene expression changes. Further, we modeled the expression of cell type-specific modules across histology in the single cell data set, in order to disentangle changes in cell type expression and proportion in the bulk RNA sequencing data. Through this analysis, we identified a module of genes expressed in B and dendritic cells involved in antigen presentation through the MHC II pathway whose expression was decreased in progressive lesions. We also identified a module of stromal-expressed genes that were less expressed in progressive lesions, which had previously been unidentified. Associations between module expression, histology, and progression were validated in a second data set.

This work improves our understanding of the signaling and interactions between cell types associated with histology and progression of premalignant lesions. These findings may be used to improve our prognostication and treatment of premalignant lesions.

April 12th
Dakota Hawkins

Title: Bioinformatic Approaches for Understanding Cell-Type Diversification During Development in an Urchin Model

Major Professor: Cynthia Bradham

Abstract

From the discovery of developmental gradients to pioneering some of the first gene regulatory models, the sea urchin model has played a foundational role in deciphering the complex molecular mechanisms behind the phenomena that underlie pattern formation during embryonic development. Of particular interest to our lab, primary mesenchyme cells (PMCs), a skeletogenic lineage, provide an excellent system for understanding the mechanisms behind skeletal pattern formation. Sea urchin skeletal patterning is driven by ectodermal cues that are differentially expressed in space and time; these cues instruct the PMCs. Originating as a homogeneous population, PMCs diversify in response to patterning cue reception, then produce distinct skeletal elements as a function of the cues that they have received from the ectoderm. However, the exact mechanisms underpinning PMC diversification and the role that individual ectodermal cues play to mediate this diversification process is poorly understood. To bridge that knowledge gap, this work leverages multiple data modalities, including single-cell RNA sequencing (scRNA-seq) and 3D visualization of gene expression in normal and perturbed embryos to not only present an exhaustive description of PMC diversification, but also offers novel computational approaches and the development of resources necessary for these studies.

First, we present the novel algorithm ICAT. Created to correctly identify cell states from mixed-condition scRNA-seq experiments, ICAT plays a necessary role in identifying PMC subpopulations affected by ectodermal cue disruption. Using simulated and real datasets, we benchmark ICAT against several state-of-the-art workflows, and find ICAT provides more robust and sensitive performance compared to current practices. We further validate ICAT in vivo using single molecule fluorescent in situ hybridization (FISH) and show that, compared to leading algorithms, ICAT uniquely and correctly characterizes the effects of patterning cue disruption on PMC subpopulation composition.

Finally, by combining temporal scRNA-seq data throughout skeletal patterning with a newly generated spatial gene expression reference map, we not only identify distinct PMC subpopulations, but also provide spatial and temporal coherence to each of their developmental trajectories during skeletal pattern formation. We compliment this work by inferring the gene regulatory networks underlying PMC diversification and thereby identifying the transcriptional regulators that function as network hubs. We empirically demonstrate that these hubs are required for skeletal patterning, and spatially map their expression within the PMCs. Sequencing single PMCs isolated from embryos in which ectodermal cue function was inhibited, we show that functional loss of each cue uniquely disrupts the PMC gene regulatory network and characterize the subsequent compositional effects of PMC subpopulations. Taken together, this work defines the spatiotemporal details of PMC diversification in normal embryos as well as in embryos with individual cue losses, as well as offering numerous novel computational methods and resources necessary for these advances.

June 7th
Ahmed Youssef

Title: Computational Methods to Uncover Cell State Proteomes and Profile Protein Interaction Dynamics

Major Professors: Andrew Emili & Mark Crovella

Abstract

Proteins, through their networks of interactions, carry out most essential biological processes governing cellular functions, yet the proteome remains largely unexplored at single-cell resolution and existing models of protein interactions do not capture the dynamic nature of the interactome, representing crucial gaps in our understanding of proteome organization and function. Despite the critical position that the proteome occupies in the functional landscape of the cell, proteomics has lagged behind other data-driven systems biology subfields when it comes to the development of tailored computational strategies for interrogating its complexities. The research projects discussed in this dissertation aim to build upon the rapidly evolving computational proteomics toolkit by developing novel algorithms to address existing analysis gaps with regards to single-cell proteomics and protein interaction dynamics.

In this dissertation, we present DESP, a novel algorithm that leverages independent readouts of cellular proportions, such as those from single-cell RNA-sequencing, to resolve the relative contributions of cell states to bulk molecular measurements, most notably quantitative proteomics, recorded in parallel. DESP provides a generalizable computational framework for modeling the relationship between bulk and single-cell molecular measurements, enabling the study of proteomes and other molecular profiles at the cell state-level using established bulk-level workflows. We applied DESP to an in-vitro model of the epithelial-to-mesenchymal transition and demonstrated its ability to accurately reconstruct cell state signatures from bulk-level measurements of both the proteome and transcriptome while providing insights into transient regulatory mechanisms.

This dissertation also describes the development of a novel analysis pipeline for modeling protein interaction remodeling from dynamic CF/MS data. Protein interactions can be disrupted by many triggers, such as pathogen infection or mutations in protein-coding genes, yet most studies in the field focused on characterizing the interactome in a static manner, with few devoted to investigating the dynamic nature of these interactions. As an application of our pipeline, we profiled the dynamics of the Escherichia coli interactome in response to changes in its growth environment. Our results shed light on the mechanisms governing protein interaction remodeling, while also providing a rigorous analytical framework for quantifying interaction dynamics on an interactome-wide scale, representing an important step towards accurate modeling of dynamic biological systems.

September 19th
Jacquelyn Turcinovic

Title: Host Responses to Viral Infection and Genomic Variation During Pandemic Transmission

Major Professor: John Connor

Abstract

This dissertation is a tale of two emerging human pathogens. The first is a genus of viruses, ebolaviruses, which periodically cause outbreaks in humans in central and western Africa following spillover from animal reservoirs. Ebolavirus outbreaks have high rates of morbidity and mortality and can cause symptoms ranging from vomiting and diarrhea to hemorrhage. Understanding both how the virus evolves to fit its host as well as how the host reacts to viral infection is paramount to understanding what determines whether an infected patient will die or survive ebolavirus infection.

To understand how ebolavirus genomic plasticity allows the virus to optimize itself to its host, I analyzed viral genomic sequencing data from 2 ebolavirus species during serial passage in tissue culture: Ebola virus and Sudan virus. In low-passage Sudan virus, I discovered a true viral quasispecies in which 3-4 viral genotypes circulated within the same stock. I then examined how that quasispecies reacted when put into a nonhuman primate model of infection; unexpectedly we saw that the mix of genotypes that went in matched the mix of genotypes that came out.

To begin to understand what a successful immune response to ebolavirus infection entails, I characterized the circulating transcriptomic response in 2 survival models of Ebola virus disease. In a uniform survival model where nonhuman primates (NHPs) were challenged with Bombali virus, I showed that the animals have a clear and robust response to infection despite varying symptom severity. In a Taï Forest virus challenge model with ~44% survival, I showed that NHPs that succumb do so in a uniform manner consistent with other model of Ebola virus disease. In contrast, survivors were highly variable in their response to infection: some mimicked the non-survivor response but recovered in time, while others hardly responded at all.

After covering ebolavirus genomic plasticity and the host response to ebolavirus infection in the first and second sections, respectively, I will then shift to the other focus of my dissertation work: SARS-CoV-2 and molecular epidemiology. This coronavirus swept the globe in 2020 following spillover into humans from an animal reservoir in late 2019, and surveillance sequencing of viral genomes early in the pandemic showed it was rapidly adapting to its new host. I leveraged this high mutation rate to spin up a molecular epidemiology operation for Boston Medical Center (BMC) and Boston University (BU). From mid-2020 through spring 2022, I catalogued, processed, sequenced, and analyzed samples and viral genomes from over 7000 SARS-CoV-2 patient swabs. I worked with contact tracing teams, physicians, and infection control from BU and BMC to quantify viral introductions, identify transmission chains, and integrate the genetic linkages with traditional epidemiological data.

2022

June 10
Aaron Chevalier

Title: Tools for Mutational Signature Discovery and Methods for Prediction of Drug Response

Major Professor: Joshua Campbell

Abstract

Mutational signatures are patterns of somatic alterations in the genome caused by carcinogenic exposures or aberrant cellular processes. Specifically, this dissertation focuses on the analysis of mutational signatures in human cancer and its application to stratification of patients for drug response.

To provide a comprehensive workflow for preprocessing, analysis, and visualization of mutational signatures, I created the Mutational Signature Comprehensive Analysis Toolkit (musicatk) package. musicatk enables users to select different schemas for counting mutation types and easily combine count tables from different schemas. Multiple distinct methods are available to deconvolute signatures and exposures or to predict exposures in individual samples given a pre-existing set of signatures. Additional exploratory features include the ability to compare signatures to the COSMIC database, embed tumors in two dimensions with UMAP, cluster tumors into subgroups based on exposure frequencies, identify differentially active exposures between tumor subgroups, and plot exposure distributions across user-defined annotations such as tumor type.

I then use musicatk to analyze the largest tumor sequencing dataset from a Chinese population to date. I identified differences in the levels of signature exposures compared to similar data from a Western cohort. Specifically, COSMIC signature SBS25 was higher in the Chinese dataset for Melanoma and Renal Cell Carcinoma patients and Melanoma patients had lower levels of SBS7a/b (Ultraviolet Light). My analysis also revealed a putative novel signature enriched in pancreatic cancers.

Lastly, I assess the ability of mutational signatures to identify patients who may respond to irofulven, a drug for late-stage cancer patients who have defects in the Transcription Coupled Nucleotide Excision Repair (TC-NER) pathway. As the functional understanding of which mutations successfully disrupt this pathway is incomplete, I develop an approach that classifies patients based on evidence of this pathway being disrupted based on levels of mutational signatures. I build a model that successfully predicts patients who will respond to treatment without a known relevant mutation in the TC-NER pathway.

The work from this study furthers our understanding of mutational signatures in different populations and demonstrates the feasibility of using mutational signatures to identify patients eligible for drug trials.

August 19
Lucas Schiffer

Title: Multimodal, Longitudinal, and Mega-Analysis of Biomedical Data

Major Professor: W. Evan Johnson

Abstract

Biomedical data science is a multi-disciplinary field concerned with the collection, storage, and interpretation of biomedical data that uses annotation, algorithms, and analysis to extract knowledge and insights from structured and unstructured data to be used in the development and evaluation of diagnostic tests, prognostic predictions, and therapeutic interventions. Biomedical data scientists perform this work using biomedical data that arises when samples are subjected to biochemical assays to quantitively or qualitatively investigate their pathophysiological characteristics. Increasingly, biomedical data are generated at single-cell resolution and have consequently become far more hierarchical and multimodal in nature – that is, levels of organization encapsulate one another (e.g., samples belonging to subjects are made up of cells) and multiple biological modalities are profiled simultaneously. The paradigm shift adds significant complexity to the collection, storage, management, and analysis of biomedical data, but brings with it the promise of unprecedent insights to be gained from integrative analyses. These analyses are the focus of this dissertation, where the challenges of integrating biomedical data across multiple modalities, timepoints, and studies are examined through three research projects.

Challenges related to multimodal analysis of biomedical data will be explored through the development of MultimodalExperiment, a data structure that appropriately and efficiently represents multiomics data that is hierarchical, multimodal, and/or longitudinal in nature. A schematic of and methods for the data structure will be presented along with example usage to demonstrate how current challenges of alternative data structures are overcome, ease of data management is improved, and computational/storage efficiency is optimized.

Challenges related to longitudinal analysis of biomedical data will be explored in the context of a cohort study of cancer patients being treated with anti-programmed cell death protein 1/programmed cell death ligand 1 immunotherapies at Boston Medical Center. The progression-free survival status of study participants will be analyzed using linear mixed effects models which incorporate longitudinal high-dimensional metabolomics data. Maps of metabolic pathways and a hypothesis will be presented to explain serum metabolites that are associated with progress-free survival status and possibly therapeutic efficacy.

Challenges related to mega-analysis of biomedical data will be explored through the creation of a pipeline to preprocess transcriptomics data from human host infected with tuberculosis to support machine learning and other tasks. The details of original software developed to provide more than 10,000 samples of clean high-quality machine learning ready data from all related and eligible studies in the Gene Expression Omnibus repository will be illustrated. The importance improving diagnostic testing and therapeutic interventions for tuberculosis disease will be highlighted in the context of these data, and the specifics of why they represent a key ingredient for machine learning that helps overcome current challenges in the field will be explained.

August 24
Boting Ning

Title: Leveraging Transcriptomic Regulation to Understand, Diagnose and Intercept Early Lung Cancer Pathogenesis

Major Professor: Marc Lenburg

Abstract

Lung cancer is the leading cause of cancer death in the US, largely due to the lack of treatment options to intercept the progression of early lung cancers and methods to diagnose lung cancer at early stages. Prior studies indicated that the lack of immune surveillance is associated with the progression of bronchial premalignant lesions (PMLs) and the gene alterations in the nasal epithelium can be leveraged for the early detection of lung cancer. Yet, the regulatory mechanism of these gene expression alterations is still less understood. Thus, there are unmet needs to study the gene expression regulation for better disease management of early lung cancer, including further understanding the biology of early lung cancer development, identifying potential interception strategies, and improving the lung cancer diagnosis.

My dissertation addresses these challenges by investigating the transcriptional and post-transcriptional gene expression regulators, including transcription factors and microRNAs (miRNAs), to facilitate the understanding, interception, and diagnosis of early lung cancer. First, I explored the miRNA regulatory landscape to identify miRNA-gene regulatory relationships associated with bronchial PML progression and molecular subtypes. Using matched gene and microRNA expression profiles from patients with bronchial premalignant lesions, I identified epithelial miR-149-5p to be a key regulator of gene expression contributing to PML progression. By suppressing NLRC5, miR-149-5p inhibits MHC-I gene expression of epithelial cells, promoting early immune depletion and lesion progression. I also developed a novel statistical framework, Differential Regulation Analysis of miRNA (DReAmiR), that characterizes miRNA-mediated gene regulatory network rewiring across multiple groups from transcriptomic profiles, and identified regulatory network differences across PML molecular subtypes. Secondly, I investigated the alterations in the Hippo pathway to identify potential drug targets to intercept the progression of bronchial PMLs. I found that Hippo pathway effectors YAP/TAZ, together with transcription factors TEAD and TP63, cooperatively promote basal cell proliferation and repress signals associated with interferon responses and immune cell communication. Further in silico drug screening with external datasets identified small compounds that can reverse the direct regulated gene signature to potentially intercept bronchial PML progression. Lastly, I integrated miRNA and gene expression profiles in the nasal epithelium to distinguish malignant from benign indeterminate pulmonary nodules. I built an ensemble classifier consisting of nasal epithelial miRNA expression features, miRNA-gene top scoring pairs, and clinical features. The performance of the ensemble classifier exceeded that of the classifier built with clinical features alone.

Collectively, my thesis investigated the gene expression regulation mechanisms to facilitate the understanding, interception, and diagnosis of early lung cancer pathogenesis.

November 17th
Rebecca Panitch

Title: Understanding the Mechanisms and Pathways of Alzheimer’s Disease in APOE Genotype Sub-Populations

Major Professor: Lindsay Farrer

Abstract

November 21st
Dileep Kishore

Title: Computational Study of Microbe-Microbe Interactions and Their Interplay with Their Environment

Major Professor: Daniel Segrè

Abstract

Microbial communities play important roles in human health and disease, are essential components of terrestrial and marine ecosystems, and are crucial for producing commercially valuable molecules in industrial processes. These communities consist of hundreds of species involved in complex interactions. Mapping the interrelationships between different species in a microbial community is vital for understanding and controlling ecosystem structure and function. Advances in sequencing and other omics technologies have led to thousands of datasets containing information about microbial composition, gene expression, and metabolism in microbial communities associated with human hosts and other environments. These provide valuable information in understanding how microbes interact with each other and how their interactions affect the health of their host (e.g., human or plant). Furthermore, understanding these interactions paves the way for the rational design and modulation of synthetic communities for producing antibiotics, biofuels, and pharmaceutical products.

The first part of my thesis is focused on improving the workflow for the inference of microbial co-occurrence relationships from abundance data. Toward this goal, we developed Microbial Co-occurrence Network Explorer or MiCoNE, a pipeline that infers microbial co-occurrences from 16S ribosomal RNA (16S rRNA) amplicon data. The second part of my thesis focuses on microbe-host interactions rather than microbe-microbe associations. In particular, we sought to predict the effects of microbial metabolites on human receptors and their associated regulatory pathways. In the final part of my thesis, we turn to the question of whether computational algorithms can help control microbial community growth to achieve specific objectives. We describe the development of a reinforcement learning algorithm to learn optimal environmental control strategies to steer a microbial community towards a particular goal, such as reaching a specific taxonomic distribution or producing desired metabolites.

Overall, the work presented in this thesis demonstrates how microbe-microbe and microbe-environment (including microbe-host) interactions represent plastic system-level properties whose understanding can help unravel the role of microbial communities in specific diseases. Correspondingly, manipulating these interactions, e.g., by appropriately modifying environmental conditions, can serve as a promising strategy for steering communities towards desired states, including producing valuable molecular products.

December 9th
Rui Hong

Title: Building an Analytical Framework for Quality Control and Meta-Analysis of Single-Cell Data to Understand Heterogeneity in Lung Cancer Cells

Major Professor: Joshua Campbell

Abstract

Single-cell RNA sequencing (scRNA-seq) has been a powerful technique for characterizing transcriptional heterogeneity related to tumor development and disease pathogenesis. Despite the advances of the technology, there is still a lack of software to systematically and easily assess the quality and different types of artifacts present in scRNA-seq data and lack of statistical frameworks for understanding heterogeneity in the gene programs of cancer cells.

In this dissertation, I first introduced novel computational software to enhance and streamline the process of quality control for scRNA-seq data called SCTKQC. SCTK-QC is a pipeline that performs comprehensive quality control (QC) of scRNA-seq data and runs a multitude of tools to assess various types of noise present in scRNA-seq data as well as quantification of general QC metrics. These metrics are displayed in an user-friendly HTML report and the pipeline has been implemented in two cloud-based platforms.

Most scRNA-seq studies only profiled a small number of tumors and provided a narrow view of the transcriptome in tumor tissue. Next, I developed a novel framework to perform a large-scale meta-analysis of cancer cells from 12 studies with scRNA-seq data from patients with non-small-cell lung cancer (NSCLC). I discovered interpretable gene co-expression modules with celda and demonstrated that the activity of gene modules accounted for both inter- and intra-tumor heterogeneity of NSCLC samples. Furthermore, I used CaDRa to determine that the levels of some gene modules were significantly associated with combinations of underlying genetic alterations. I also show that other gene modules are associated with immune cell signatures and may be important for communication with the cancer cells and the immune microenvironment.

Finally, I presented a novel computational method to study the association between copy number variation (CNV) and gene expression at single-cell level. The diversity of CNV profile was identified in tumor subclones within each sample and I discovered cis and trans gene signatures which have expression value associated with specific somatic CNV status. This study helped us prioritize the potential cancer driver genes within each CNV region.

Collectively, this work addressed the limitation in the quality control of scRNAseq data and provided insights for understanding the heterogeneity of NSCLC samples.

2021

December 2
Emma Briars

Title: Development Of Methods To Diagnose And Predict Antibiotic Resistance Using Synthetic Biology And Computational Approaches

Major Professor: Ahmad (Mo) Khalil

Abstract

Antibiotic resistance is a quickly emerging public health crisis, accounting for more than 700,000 annual global deaths. Global human antibiotic overuse and misuse has significantly expedited the rate at which bacteria become resistant to antibiotics. A renewed focus on discovering new antibiotics is one approach to addressing this crisis. However, it alone cannot solve the problem: historically, the introduction of a new antibiotic has consistently, and at times rapidly, been followed by the appearance and dissemination of resistant bacteria. It is thus crucial to develop strategies to improve how we select and deploy antibiotics so that we can control and prevent the emergence and transmission of antibiotic resistance. Current gold-standard antibiotic susceptibility tests measure bacterial growth, which can take up to 72 hours. However, bacteria exhibit more immediate measurable phenotypes of antibiotic susceptibility, including changes in transcription, after brief antibiotic exposure. In this dissertation I develop a framework for building a paper-based cell-free toehold sensor antibiotic susceptibility test that can detect differential mRNA expression. I also explore how long-term lab evolution experiments can be used to prospectively uncover transcriptional signatures of antibiotic susceptibility.

Paper-based cell-free systems provide an opportunity for developing clinically tractable nucleic-acid based diagnostics that are low-cost, rapid, and sensitive. I develop a computational workflow to rapidly and easily design toehold switch sensors, amplification primers, and synthetic RNAs. I develop an experimental workflow, based on existing paper-based cell-free technology, for screening toehold sensors, amplifying bacterial mRNA, and deploying sensors for differential mRNA detection. I combine this work to introduce a paper-based cell-free toehold sensor antibiotic susceptibility test that can detect fluoroquinolone-susceptible E. coli. Next, I describe a methodology for long-term lab evolution and how it can be used to explore the relationship between a phenotype, such as gene expression, and antibiotic resistance acquisition. Using a set of E. coli strains evolved to acquire tetracycline resistance, I explore how each strains transcriptome changes as resistance increases. Together, this work provides a set of computational and experimental methods that can be used to study the emergence of antibiotic resistance, and improve upon available methods for properly selecting and deploying antibiotics.

November 18
Anthony Federico

Title: Development of Methods for Omics Network Inference and Analysis and Their Application to Disease Modeling

Major Professor: Stefano Monti

Abstract

With the advent of Next Generation Sequencing (NGS) technologies and the emergence of large publicly available genomics data comes an unprecedented opportunity to model biological networks through a holistic lens using a systems-based approach. Networks provide a mathematical framework for representing biological phenomena that go beyond standard one-gene-at-a-time analyses. Networks can model system-level patterns and the molecular rewiring (i.e., changes in connectivity) occurring in response to perturbations or between distinct phenotypic groups or cell types. This in turn supports the identification of putative mechanisms of actions of the biological processes under study, and thus has the potential to advance prevention and therapy. However, there are major challenges faced by researchers. Inference of biological network structures is often performed on high-dimensional data, yet is hindered by the limited sample size of high throughput omics data. Furthermore, modeling biological networks involves complex analyses capable of integrating multiple sources of omics layers and summarizing large amounts of information.

My dissertation aims to address these challenges by presenting new approaches for high-dimensional network inference with limited samples as well as methods and tools for integrated network analysis applied to multiple research domains in cancer genomics. First, I introduce a novel method for reconstructing gene regulatory networks called SHINE (Structure Learning for Hierarchical Networks) and present an evaluation on simulated and real datasets including a Pan-Cancer analysis using The Cancer Genome Atlas (TCGA) data. Next, I summarize the challenges with executing and managing data processing workflows for large omics datasets on high performance computing environments and present multiple strategies for using Nextflow for reproducible scientific workflows including shine-nf – a collection of Nextflow modules for structure learning. Lastly, I introduce the methods, objects, and tools developed for the analysis of biological networks used throughout my dissertation work. Together – these contributions were used in focused analyses of understanding the molecular mechanisms of tumor maintenance and progression in subtype networks of Breast Cancer and Head and Neck Squamous Cell Carcinoma.

August 4
Brian Haas

Title: Bioinformatic Tool Developments with Applications to RNA-Seq Data Analysis and Clinical Cancer Research

Major Professors: Simon Kasif & Aviv Regev

Abstract

July 29
Tanya Karagiannis

Title: Single Cell Analysis and Methods To Characterize Peripheral Blood Immune Cell Types in Disease and Aging

Major Professors: Stefano Monti & Paola Sebastiani

Abstract

In the past decade, RNA-sequencing (RNA-seq)-based genome-wide expression studies have contributed to major advances in understanding human biology and disease. However, for heterogeneous tissues such as peripheral blood, RNA-sequencing masks the expression of different populations of cells that may be important in understanding different conditions and disease progression. With the advent of single cell RNA-sequencing (scRNA-seq), it has become possible to study the gene expression of each single cell and to explore cellular heterogeneity in the context of disease and under the influence of medications or other substances. In this dissertation, I will present three projects that demonstrate how single cell sequencing methods can be used to characterize novel changes in the peripheral immune system in human disease and aging. I will also describe novel methodological approaches I created to analyze cell type composition and gene expression level changes.

First, I investigated the cell type specific changes due to opioid use in human peripheral blood. Utilizing single cell transcriptomic methods, I identified a genome-wide suppression of antiviral gene expression across immune cell types of chronic opioid users, and similarly under acute exposure to morphine.

Second, I investigated the immune cell type specific changes of gene expression and composition in the context of human aging and longevity. I developed novel approaches to measure and compare overall cell type composition between samples, and identified significant overall differences in immune cell type composition, including pro-inflammatory cell populations, between extreme longevity and younger ages. In addition, I generated cell type-specific signatures associated with longevity after accounting for age-related changes that demonstrate an upregulation in immune response and metabolic processes important in the activation of immune cells in extreme long-lived individuals compared to normally aging individuals.

Finally, I investigated whether aging of the immune system is accelerated in opioid-dependent individuals. I utilized the unique aging signatures generated in the aging project and discovered higher expression of aging signatures in specific cell types of opioid-dependent individuals, suggesting chronic opioid use causes premature aging of the immune system that may contribute to the increased susceptibility to infections in these individuals.

March 24th
Marzie Rasekh

Title: Characterizing VNTRS in Human Populations

Major Professor: Gary Benson

Abstract

Over half the human genome consists of repetitive sequences. One major class is the tandem repeats (TRs), which are defined by their location in the genome, repeat unit, and copy number. TRs loci which exhibit variant copy numbers are called Variable Number Tandem Repeats (VNTRs). High VNTR mutation rates of approximately 10-4 per generation make them suitable for forensic studies, and of interest for potential roles in gene regulation and disease. TRs are generally divided into three classes: 1) microsatellites or short tandem repeats (STRs) with patterns <7 bp; 2) minisatellites with patterns of seven to hundreds of base pairs; and 3) macrosatellites with patterns of >100 bp. To date, mini- and macrosatellites have been poorly characterized, mainly due to a lack of computational tools. In this thesis, I utilize a tool, VNTRseek, to identify human minisatellite VNTRs using short read sequencing data from nearly 2,800 individuals and developed a new computational tool, MaSUD, to identify human macrosatellite VNTRs using data from 2,504 individuals. MaSUD is the first high-throughput tool to genotype macrosatellites using short reads.

I identified over 35,000 minisatellite VNTRs and over 4,000 macrosatellite VNTRs, most previously unknown. A small subset in each VNTR class was validated experimentally and in silico. The detected VNTRs were further studied for their effects on gene expression, ability to distinguish human populations, and functional enrichment. Unlike STRs, mini- and macrosatellite VNTRs are enriched in regions with functional importance, e.g., introns, promoters, and transcription factor binding sites. A study of VNTRs across 26 populations shows that minisatellite VNTR genotypescan be used to predict super-populations with >90% accuracy. In addition, genotypes for 195 minisatellite VNTRs and 24 macrosatellite VNTRs were shown to be associated with differential expression in nearby genes (eQTLs).

Finally, I developed a computational tool, mlZ, to infer undetected VNTR alleles and to detect false positive predictions. mlZ is applicable to other tools that use read support for predicting short variants.

Overall, these studies provide the most comprehensive analysis of mini- and macrosatellites in human populations and will facilitate the application of VNTRs for clinical purposes.

April 8th
Zhe Wang

Title: Enhancing Preprocessing and Clustering of Single-Cell RNA Sequencing Data

Major Professor: Joshua Campbell

Abstract

Single-cell RNA sequencing (scRNA-seq) is the leading technique for characterizing cellular heterogeneity in biological samples. Various scRNA-seq protocols have been developed that can measure the transcriptome from thousands of cells in a single experiment. With these methods readily available, the ability to transform raw data into biological understanding of complex systems is now a rate-limiting step. In this dissertation, I introduce novel computational software and tools which enhance preprocessing and clustering of scRNA-seq data and evaluate their performance compared to existing methods.

First, I present scruff, an R/Bioconductor package that preprocesses data generated from scRNA-seq protocols including CEL-Seq or CEL-Seq2 and reports comprehensive data quality metrics and visualizations. scruff rapidly demultiplexes, aligns, and counts the reads mapped to genomic features with deduplication of unique molecular identifier (UMI) tags and provides novel and extensive functions to visualize both pre- and post-alignment data quality metrics for cells from multiple experiments.

Second, I present Celda, a novel Bayesian hierarchical model that can perform simultaneous co-clustering of genes into transcriptional modules and cells into subpopulations for scRNA-seq data. Celda identified novel cell subpopulations in a publicly available peripheral blood mononuclear cell (PBMC) dataset and outperformed a PCA-based approach for gene clustering on simulated data.

Third, I extend the application of Celda by developing a multimodal clustering method that utilizes both mRNA and protein expression information generated from single-cell sequencing datasets with multiple modalities, and demonstrate that Celda multimodal clustering captured meaningful biological patterns which are missed by transcriptome- or protein-only clustering methods.

Collectively, this work addresses limitations present in the computational analyses of scRNA-seq data by providing novel methods and solutions that enhance scRNA-seq data preprocessing and clustering.

April 8th
Ke Xu

Title: Airway Gene Expression Alterations in Association with Radiographic Abnormalities of the Lung

Major Professor: Marc Lenburg

Abstract

High-resolution computed tomography (HRCT) of the chest is commonly used in the diagnosis of a variety of lung diseases. Structural changes associated with clinical characteristics of disease may also define specific disease-associated physiologic states that may provide insights into disease pathophysiology. Gene expression profiling is potentially a useful adjunct to HRCT to identify molecular correlates of the observed structural changes. However, it is difficult to directly access diseased distal airway or lung parenchyma routinely for profiling studies.

Previously, we have profiled bronchial airway in normal-appearing epithelial cells at the mainstem bronchus, detecting distinct gene expression alterations related to the clinical diagnosis of chronic obstructive pulmonary disease (COPD) and lung cancer. These gene expression alterations offer insights into the molecular events related to diseased tissue at more distal airways and in the parenchyma, which we hypothesize are due to a field-of-injury effect. Here, we expand this prior work by correlating airway gene expression to COPD and bronchiectasis phenotypes defined by HRCT to better understand the pathophysiology of these diseases. Additionally, we classified pulmonary nodules as malignant or benign by combining HRCT nodule imaging characteristics with gene expression profiling of the nasal airway.

First, we collected brushing samples from the main-stem bronchus and assessed gene expression alterations associated with COPD phenotypes defined by K-means clustering of HRCT-based imaging features. We found three imaging clusters, which correlated with incremental severity of COPD: normal, interstitial predominant, and emphysema predominant. 41 genes were differentially expressed between the normal and the emphysema predominant clusters. Functional analysis of the differentially expressed genes suggests a possible induction of inflammatory processes and repression of T-cell related biologic pathways, in the emphysema predominant cluster.

We then discovered gene expression alterations associated with radiographic evidence of bronchiectasis (BE), an underdiagnosed obstructive pulmonary disease with unclear pathophysiology. We found 655 genes were differentially expressed in bronchial epithelium from individuals with radiographic evidence of BE despite none of the study participants having a clinical BE diagnosis. In addition to biological pathways that had been previously associated with BE, novel pathways that may play important roles in BE initiation were also discovered. Furthermore, we leveraged an independent single-cell RNA-sequencing dataset of the bronchial epithelium to explore whether the observed gene expression alterations might be cell-type dependent. We computationally detected an increased presence of ciliated and deuterosomal cells, as well as a decreased presence of basal cells in subjects with widespread radiographic BE, which may reflect a shift in the cellular landscape of the airway during BE initiation.

Finally, we identified gene expression alterations within the nasal epithelium associated with the presence of malignant pulmonary nodules. A computational model was constructed for determining whether a nodule is malignant or benign that combines gene expression and imaging features extracted from HRCT. Leveraging data from single-cell RNA sequencing, we found genes increased in patients with lung cancer are expressed at higher levels within a novel cluster of nasal epithelial cells, termed keratinizing epithelial cells.

In summary, we leveraged gene expression profiling of the proximal airway and discovered novel biological pathways that potentially drive the structural changes representative of physiologic states defined by chest HRCT in COPD and BE. This approach may also be combined with chest HRCT to detect weak signals related to malignant pulmonary nodules.

2020

December 3rd
Tyler Faits

Title: The Evaluation, Application, and Expansion of 16S Amplicon Metagenomics

Major Professor: W. Evan Johnson

Abstract

Since the invention of high-throughput sequencing, the majority of experiments studying bacterial microbiomes have relied on the PCR amplification of all or part of the gene for the 16S rRNA subunit, which serves as a biomarker for identifying and quantifying the various taxa present in a microbiomic sample. Several computational methods exist for analyzing 16S amplicon based metagenomics, but the most commonly used bioinformatics tools are unable to produce quality genus-level or species-level taxonomic calls and may underestimate the degree to which such calls are possible. In this thesis, I have used 16S sequencing data from mock bacterial communities to evaluate the sensitivity and specificity of several bioinformatics pipelines and genomic reference libraries used for microbiome analyses, with a focus on measuring the accuracy of species-level taxonomic assignments of 16S amplicon reads. With the efficacy of these tools established, I then applied them in the analysis of data from two studies into human microbiomes. I evaluated the metagenomics analysis tools Qiime 2, Mothur, PathoScope 2, and Kraken, in conjunction with reference libraries from GreenGenes, Silva, Kraken, and RefSeq, using publicly available mock community data from several sources, comprising 137 samples spanning a range of taxonomic diversity, amplicon regions, and sequencing methods. PathoScope and Kraken, both tools designed for whole genome metagenomics, outperformed Qiime 2 and Mothur, which are theoretically specialized in 16S analyses. I used PathoScope 2 to analyze longitudinal 16S data from infants in Zambia, exploring the maturation of nasopharyngeal microbiomes in healthy infants, establishing a range of typical healthy taxonomic profiles, and identifying dysbiotic patterns which are associated with the development of severe lower respiratory tract infections in early childhood. With more data, these dysbiotic patterns may help identify infants at high risk of developing respiratory disease.

I used Qiime 2 to analyze 16S data from human subjects in a controlled dietary intervention study with a focus on dietary carbohydrate quality. I correlated alterations in the gut microbiome with various cardiometabolic risk factors, and identified increases in some butyrate-producing bacteria in response to complex carbohydrates. I also constructed a metatranscriptomics pipeline to analyze paired rRNA-depleted RNAseq data.

October 14th
Alan Pacheco

Title: Environmental Modulation of Microbial Ecosystems

Major Professor: Daniel Segre

Abstract

Natural microbiota are essential to the health of living systems – from the human gut to coral reefs. Although advances in DNA sequencing have allowed us to catalogue many of the different organisms that make up these microbial communities, significant challenges remain in understanding the complex networks of interspecies metabolic interactions they exhibit. These interactions are crucial to community stability and function, and are highly context-dependent: the availability of different nutrients can determine whether a set of microbes will interact cooperatively or competitively, which can drastically change a community’s structure. Disentangling the environmental factors that determine these behaviors will not only fundamentally enhance our knowledge of their ecological properties, but will also bring us closer to the rational engineering of synthetic microbiomes with novel functions. Here, I integrate modeling and experimental approaches to quantify the dependence of microbial communities on environmental composition. I then show how this relationship can be leveraged to facilitate the design of synthetic consortia.

The first chapter of this dissertation is a review article that introduces a framework for cataloguing interaction mechanisms, which enables quantitative comparisons and predictive models of these complex phenomena. The second chapter is a computational study that explores one such attribute – metabolic cost – in high detail. It demonstrates how a large variety of molecules can be secreted without imposing a fitness cost on microbial organisms, allowing for the emergence of beneficial interspecies interactions. The third chapter is an experimental study that determines how the number of unique environmental nutrients affects microbial community growth and taxonomic diversity. The integration of stoichiometric and consumer resource models enabled the discovery of basic ecological principles that govern this environment phenotype relationship. The fourth chapter applies these principles to the design of engineered communities via a search algorithm that identifies environmental compositions that yield specific ecosystem properties. This dissertation then concludes with extensions of the modeling methods used throughout this work to additional model systems.

Future work could further quantify how microbial community phenotypes depend on each of the individual factors explored in this thesis, while also leveraging emerging knowledge on interaction mechanisms to design synthetic consortia.

August 24th
Devanshi Patel

Title: Tissue-Dependent Analysis of Common and Rare Genetic Variants for Alzheimer’s Disease Using Multi-Omics Data

Major Professor: Lindsay Farrer

Abstract

Alzheimer’s disease (AD) is a complex neurodegenerative disease characterized by progressive memory loss and caused by a combination of genetic, environmental, and lifestyle factors. AD susceptibility is highly heritable at 58-79%, but only about one third of the AD genetic component is accounted for by common variants discovered through genome-wide association studies (GWAS). Rare variants may contribute to some of the unexplained heritability of AD and have been demonstrated to contribute to large gene expression changes across tissues, but conventional analytical approaches pose challenges because of low statistical power even for large sample sizes. Recent studies have demonstrated by expression quantitative trait locus (eQTL) analysis that changes in gene expression could play a key role in the pathogenesis of AD. However, regulation of gene expression has been shown to be context-specific (e.g., tissue and cell-types), motivating a context dependent approach to achieve more precise and statistically significant associations. To address these issues, I applied a strategy to identify new AD risk or protective rare variants by examining mutations occurring only in cases or only controls, observing that different mutations in the same gene or variable dose of a mutation may result in distinct dementias. I also evaluated the impact of rare variation on expression at the gene and gene pathway levels in blood and brain tissue, further strengthening the rare variant findings with functional evidence and finding evidence for a large immune and inflammatory component to AD. Lastly, I identified cell-type specific eQTLs in blood and brain tissue to explain underlying genetic associations of common variants in AD, and also discovered additional evidence for the role of myeloid cells in AD risk and potential novel blood and brain AD biomarkers. Collectively, these findings further explain the genetic basis of AD risk and provide insight about mechanisms leading to this disorder.