PhD Thesis Defenses
2022
January 4th
Kritika Karri
Title: Computational Characterization of Long Non-Coding RNAs (LncRNAs) and Study Their Role in Rodent Liver Disease, Xenobiotic Exposure, and Sex-Specific Responses Using Bulk and Single Cell RNA-Sequencing
Major Professor: David Waxman
ABSTRACT:
LncRNAs comprise a heterogeneous class of thousands of RNA-encoding genes whose functions are largely unknown. This thesis describes systematic computational approaches to discover liver-expressed lncRNAs globally and then deduce their regulatory roles in response to foreign chemical and hormonal exposures. In a first study, bulk liver RNA-seq data was used to discover liver-expressed lncRNAs responsive to multiple xenobiotics in a rat model. Ortholog analysis combined with co-expression data and causal inference methods was used to infer lncRNA function and deduce gene regulatory networks, including causal effects of lncRNAs on biological pathways. This work provides a framework for understanding the widespread transcriptome-altering actions of foreign chemicals in a key-responsive mammalian tissue. In a second study, single-cell RNA-seq was employed to develop a reference catalog of 48,261 mouse liver-expressed lncRNAs, a majority novel, by transcriptome reconstruction from > 2,000 bulk public mouse liver RNA-seq datasets. Single cell RNA-seq was sufficiently sensitive to detect >30,000 mouse liver lncRNAs and characterize their dysregulation in mouse models of high fat diet-induced non-alcoholic steatohepatitis (NASH), carbon tetrachloride-induced liver fibrosis, and hepatotoxicity induced by the Ah receptor agonist TCDD. Trajectory inference algorithms uncovered lncRNA zonation patterns in five major hepatic cell populations and their dysregulation in diseased states. LncRNAs expressed in NASH-associated macrophages, closely linked to disease progression, and in collagen-producing myofibroblasts, a key source of the fibrous scar in fibrotic liver, were identified. Regulatory network analysis linked individual lncRNAs with key biological pathways and gene centrality metrics identified network-essential regulatory lncRNAs in each liver disease model. In a third study, single nucleus RNA-seq combined with single nucleus ATAC-seq mapping of open chromatin regions elucidated functional linkages between cis- and trans-regulatory elements and their downstream genes targets, notably genes showing expression sex-differences impacting metabolism and disease risk. Liver cell type-specific chromatin accessibility signatures were identified, as were sex-specific accessibility signatures for hepatocytes and their associated DNA regulatory region motifs. Integrative modalities were employed to elucidate transcription factor-based mechanisms involved in sex-specific growth hormone-regulated gene expression by identifying transcriptional and epigenetic changes during feminization of mouse liver. Together, these studies characterize lncRNA function and can motivate future experiments.
June 10
Aaron Chevalier
Title: Tools for Mutational Signature Discovery and Methods for Prediction of Drug Response
Major Professor: Joshua Campbell
ABSTRACT:
Mutational signatures are patterns of somatic alterations in the genome caused by carcinogenic exposures or aberrant cellular processes. Specifically, this dissertation focuses on the analysis of mutational signatures in human cancer and its application to stratification of patients for drug response.
To provide a comprehensive workflow for preprocessing, analysis, and visualization of mutational signatures, I created the Mutational Signature Comprehensive Analysis Toolkit (musicatk) package. musicatk enables users to select different schemas for counting mutation types and easily combine count tables from different schemas. Multiple distinct methods are available to deconvolute signatures and exposures or to predict exposures in individual samples given a pre-existing set of signatures. Additional exploratory features include the ability to compare signatures to the COSMIC database, embed tumors in two dimensions with UMAP, cluster tumors into subgroups based on exposure frequencies, identify differentially active exposures between tumor subgroups, and plot exposure distributions across user-defined annotations such as tumor type.
I then use musicatk to analyze the largest tumor sequencing dataset from a Chinese population to date. I identified differences in the levels of signature exposures compared to similar data from a Western cohort. Specifically, COSMIC signature SBS25 was higher in the Chinese dataset for Melanoma and Renal Cell Carcinoma patients and Melanoma patients had lower levels of SBS7a/b (Ultraviolet Light). My analysis also revealed a putative novel signature enriched in pancreatic cancers.
Lastly, I assess the ability of mutational signatures to identify patients who may respond to irofulven, a drug for late-stage cancer patients who have defects in the Transcription Coupled Nucleotide Excision Repair (TC-NER) pathway. As the functional understanding of which mutations successfully disrupt this pathway is incomplete, I develop an approach that classifies patients based on evidence of this pathway being disrupted based on levels of mutational signatures. I build a model that successfully predicts patients who will respond to treatment without a known relevant mutation in the TC-NER pathway.
The work from this study furthers our understanding of mutational signatures in different populations and demonstrates the feasibility of using mutational signatures to identify patients eligible for drug trials.
August 19
Lucas Schiffer
Title: Multimodal, Longitudinal, and Mega-Analysis of Biomedical Data
Major Professor: W. Evan Johnson
ABSTRACT:
Biomedical data science is a multi-disciplinary field concerned with the collection, storage, and interpretation of biomedical data that uses annotation, algorithms, and analysis to extract knowledge and insights from structured and unstructured data to be used in the development and evaluation of diagnostic tests, prognostic predictions, and therapeutic interventions. Biomedical data scientists perform this work using biomedical data that arises when samples are subjected to biochemical assays to quantitively or qualitatively investigate their pathophysiological characteristics. Increasingly, biomedical data are generated at single-cell resolution and have consequently become far more hierarchical and multimodal in nature – that is, levels of organization encapsulate one another (e.g., samples belonging to subjects are made up of cells) and multiple biological modalities are profiled simultaneously. The paradigm shift adds significant complexity to the collection, storage, management, and analysis of biomedical data, but brings with it the promise of unprecedent insights to be gained from integrative analyses. These analyses are the focus of this dissertation, where the challenges of integrating biomedical data across multiple modalities, timepoints, and studies are examined through three research projects.
Challenges related to multimodal analysis of biomedical data will be explored through the development of MultimodalExperiment, a data structure that appropriately and efficiently represents multiomics data that is hierarchical, multimodal, and/or longitudinal in nature. A schematic of and methods for the data structure will be presented along with example usage to demonstrate how current challenges of alternative data structures are overcome, ease of data management is improved, and computational/storage efficiency is optimized.
Challenges related to longitudinal analysis of biomedical data will be explored in the context of a cohort study of cancer patients being treated with anti-programmed cell death protein 1/programmed cell death ligand 1 immunotherapies at Boston Medical Center. The progression-free survival status of study participants will be analyzed using linear mixed effects models which incorporate longitudinal high-dimensional metabolomics data. Maps of metabolic pathways and a hypothesis will be presented to explain serum metabolites that are associated with progress-free survival status and possibly therapeutic efficacy.
Challenges related to mega-analysis of biomedical data will be explored through the creation of a pipeline to preprocess transcriptomics data from human host infected with tuberculosis to support machine learning and other tasks. The details of original software developed to provide more than 10,000 samples of clean high-quality machine learning ready data from all related and eligible studies in the Gene Expression Omnibus repository will be illustrated. The importance improving diagnostic testing and therapeutic interventions for tuberculosis disease will be highlighted in the context of these data, and the specifics of why they represent a key ingredient for machine learning that helps overcome current challenges in the field will be explained.
August 24
Boting Ning
Title: Leveraging Transcriptomic Regulation to Understand, Diagnose and Intercept Early Lung Cancer Pathogenesis
Major Professor: Marc Lenburg
ABSTRACT:
Lung cancer is the leading cause of cancer death in the US, largely due to the lack of treatment options to intercept the progression of early lung cancers and methods to diagnose lung cancer at early stages. Prior studies indicated that the lack of immune surveillance is associated with the progression of bronchial premalignant lesions (PMLs) and the gene alterations in the nasal epithelium can be leveraged for the early detection of lung cancer. Yet, the regulatory mechanism of these gene expression alterations is still less understood. Thus, there are unmet needs to study the gene expression regulation for better disease management of early lung cancer, including further understanding the biology of early lung cancer development, identifying potential interception strategies, and improving the lung cancer diagnosis.
My dissertation addresses these challenges by investigating the transcriptional and post-transcriptional gene expression regulators, including transcription factors and microRNAs (miRNAs), to facilitate the understanding, interception, and diagnosis of early lung cancer. First, I explored the miRNA regulatory landscape to identify miRNA-gene regulatory relationships associated with bronchial PML progression and molecular subtypes. Using matched gene and microRNA expression profiles from patients with bronchial premalignant lesions, I identified epithelial miR-149-5p to be a key regulator of gene expression contributing to PML progression. By suppressing NLRC5, miR-149-5p inhibits MHC-I gene expression of epithelial cells, promoting early immune depletion and lesion progression. I also developed a novel statistical framework, Differential Regulation Analysis of miRNA (DReAmiR), that characterizes miRNA-mediated gene regulatory network rewiring across multiple groups from transcriptomic profiles, and identified regulatory network differences across PML molecular subtypes. Secondly, I investigated the alterations in the Hippo pathway to identify potential drug targets to intercept the progression of bronchial PMLs. I found that Hippo pathway effectors YAP/TAZ, together with transcription factors TEAD and TP63, cooperatively promote basal cell proliferation and repress signals associated with interferon responses and immune cell communication. Further in silico drug screening with external datasets identified small compounds that can reverse the direct regulated gene signature to potentially intercept bronchial PML progression. Lastly, I integrated miRNA and gene expression profiles in the nasal epithelium to distinguish malignant from benign indeterminate pulmonary nodules. I built an ensemble classifier consisting of nasal epithelial miRNA expression features, miRNA-gene top scoring pairs, and clinical features. The performance of the ensemble classifier exceeded that of the classifier built with clinical features alone.
Collectively, my thesis investigated the gene expression regulation mechanisms to facilitate the understanding, interception, and diagnosis of early lung cancer pathogenesis.
November 17th
Rebecca Panitch
Title: Understanding the Mechanisms and Pathways of Alzheimer’s Disease in APOE Genotype Sub-Populations
Major Professor: Lindsay Farrer
ABSTRACT:
Alzheimer’s disease (AD) is a neurodegenerative disease classified pathologically by the presence of tau tangles and amyloid plaques. The largest genetic risk factor for AD is the APOE ε4 allele, while the APOE ε2 allele has been linked to a protective effect for AD. Recent studies demonstrated that APOE genotypes are linked to unique omics signatures and pathological features relating to AD, such as blood-brain barrier breakage. To investigate the role of APOE genotype in AD, I analyzed different levels of omic data in blood and brain. I analyzed transcriptomic data derived from autopsied brains using network and differential gene expression approaches to identify genes and pathways involved in the APOE ε2 protective mechanism for AD. Additionally, I identified APOE genotype-specific pathways and networks involved in both blood and brain function in AD using blood and brain tissue gene expression from mostly the same individuals. Lastly, I analyzed the association of methylation of DNA from blood and brain samples with AD to identify APOE and AD specific methylation signatures and potential drug targets. Collectively, this thesis emphasizes the utility of investigating APOE genotypes individually to identify novel pathways and potential drug targets within AD subpopulations.
November 21st
Dileep Kishore
Title: Computational Study of Microbe-Microbe Interactions and Their Interplay with Their Environment
Major Professor: Daniel Segrè
ABSTRACT:
Microbial communities play important roles in human health and disease, are essential components of terrestrial and marine ecosystems, and are crucial for producing commercially valuable molecules in industrial processes. These communities consist of hundreds of species involved in complex interactions. Mapping the interrelationships between different species in a microbial community is vital for understanding and controlling ecosystem structure and function. Advances in sequencing and other omics technologies have led to thousands of datasets containing information about microbial composition, gene expression, and metabolism in microbial communities associated with human hosts and other environments. These provide valuable information in understanding how microbes interact with each other and how their interactions affect the health of their host (e.g., human or plant). Furthermore, understanding these interactions paves the way for the rational design and modulation of synthetic communities for producing antibiotics, biofuels, and pharmaceutical products.
The first part of my thesis is focused on improving the workflow for the inference of microbial co-occurrence relationships from abundance data. Toward this goal, we developed Microbial Co-occurrence Network Explorer or MiCoNE, a pipeline that infers microbial co-occurrences from 16S ribosomal RNA (16S rRNA) amplicon data. The second part of my thesis focuses on microbe-host interactions rather than microbe-microbe associations. In particular, we sought to predict the effects of microbial metabolites on human receptors and their associated regulatory pathways. In the final part of my thesis, we turn to the question of whether computational algorithms can help control microbial community growth to achieve specific objectives. We describe the development of a reinforcement learning algorithm to learn optimal environmental control strategies to steer a microbial community towards a particular goal, such as reaching a specific taxonomic distribution or producing desired metabolites.
Overall, the work presented in this thesis demonstrates how microbe-microbe and microbe-environment (including microbe-host) interactions represent plastic system-level properties whose understanding can help unravel the role of microbial communities in specific diseases. Correspondingly, manipulating these interactions, e.g., by appropriately modifying environmental conditions, can serve as a promising strategy for steering communities towards desired states, including producing valuable molecular products.
December 9th
Rui Hong
Title: Building an Analytical Framework for Quality Control and Meta-Analysis of Single-Cell Data to Understand Heterogeneity in Lung Cancer Cells
Major Professor: Joshua Campbell
ABSTRACT:
Single-cell RNA sequencing (scRNA-seq) has been a powerful technique for characterizing transcriptional heterogeneity related to tumor development and disease pathogenesis. Despite the advances of the technology, there is still a lack of software to systematically and easily assess the quality and different types of artifacts present in scRNA-seq data and lack of statistical frameworks for understanding heterogeneity in the gene programs of cancer cells.
In this dissertation, I first introduced novel computational software to enhance and streamline the process of quality control for scRNA-seq data called SCTKQC. SCTK-QC is a pipeline that performs comprehensive quality control (QC) of scRNA-seq data and runs a multitude of tools to assess various types of noise present in scRNA-seq data as well as quantification of general QC metrics. These metrics are displayed in an user-friendly HTML report and the pipeline has been implemented in two cloud-based platforms.
Most scRNA-seq studies only profiled a small number of tumors and provided a narrow view of the transcriptome in tumor tissue. Next, I developed a novel framework to perform a large-scale meta-analysis of cancer cells from 12 studies with scRNA-seq data from patients with non-small-cell lung cancer (NSCLC). I discovered interpretable gene co-expression modules with celda and demonstrated that the activity of gene modules accounted for both inter- and intra-tumor heterogeneity of NSCLC samples. Furthermore, I used CaDRa to determine that the levels of some gene modules were significantly associated with combinations of underlying genetic alterations. I also show that other gene modules are associated with immune cell signatures and may be important for communication with the cancer cells and the immune microenvironment.
Finally, I presented a novel computational method to study the association between copy number variation (CNV) and gene expression at single-cell level. The diversity of CNV profile was identified in tumor subclones within each sample and I discovered cis and trans gene signatures which have expression value associated with specific somatic CNV status. This study helped us prioritize the potential cancer driver genes within each CNV region.
Collectively, this work addressed the limitation in the quality control of scRNAseq data and provided insights for understanding the heterogeneity of NSCLC samples.
2021
December 2
Emma Briars
Title: Development Of Methods To Diagnose And Predict Antibiotic Resistance Using Synthetic Biology And Computational Approaches
Major Professor: Ahmad (Mo) Khalil
ABSTRACT:
Antibiotic resistance is a quickly emerging public health crisis, accounting for more than 700,000 annual global deaths. Global human antibiotic overuse and misuse has significantly expedited the rate at which bacteria become resistant to antibiotics. A renewed focus on discovering new antibiotics is one approach to addressing this crisis. However, it alone cannot solve the problem: historically, the introduction of a new antibiotic has consistently, and at times rapidly, been followed by the appearance and dissemination of resistant bacteria. It is thus crucial to develop strategies to improve how we select and deploy antibiotics so that we can control and prevent the emergence and transmission of antibiotic resistance. Current gold-standard antibiotic susceptibility tests measure bacterial growth, which can take up to 72 hours. However, bacteria exhibit more immediate measurable phenotypes of antibiotic susceptibility, including changes in transcription, after brief antibiotic exposure. In this dissertation I develop a framework for building a paper-based cell-free toehold sensor antibiotic susceptibility test that can detect differential mRNA expression. I also explore how long-term lab evolution experiments can be used to prospectively uncover transcriptional signatures of antibiotic susceptibility.
Paper-based cell-free systems provide an opportunity for developing clinically tractable nucleic-acid based diagnostics that are low-cost, rapid, and sensitive. I develop a computational workflow to rapidly and easily design toehold switch sensors, amplification primers, and synthetic RNAs. I develop an experimental workflow, based on existing paper-based cell-free technology, for screening toehold sensors, amplifying bacterial mRNA, and deploying sensors for differential mRNA detection. I combine this work to introduce a paper-based cell-free toehold sensor antibiotic susceptibility test that can detect fluoroquinolone-susceptible E. coli. Next, I describe a methodology for long-term lab evolution and how it can be used to explore the relationship between a phenotype, such as gene expression, and antibiotic resistance acquisition. Using a set of E. coli strains evolved to acquire tetracycline resistance, I explore how each strains transcriptome changes as resistance increases. Together, this work provides a set of computational and experimental methods that can be used to study the emergence of antibiotic resistance, and improve upon available methods for properly selecting and deploying antibiotics.
November 18
Anthony Federico
Title: Development of Methods for Omics Network Inference and Analysis and Their Application to Disease Modeling
Major Professor: Stefano Monti
ABSTRACT:
With the advent of Next Generation Sequencing (NGS) technologies and the emergence of large publicly available genomics data comes an unprecedented opportunity to model biological networks through a holistic lens using a systems-based approach. Networks provide a mathematical framework for representing biological phenomena that go beyond standard one-gene-at-a-time analyses. Networks can model system-level patterns and the molecular rewiring (i.e., changes in connectivity) occurring in response to perturbations or between distinct phenotypic groups or cell types. This in turn supports the identification of putative mechanisms of actions of the biological processes under study, and thus has the potential to advance prevention and therapy. However, there are major challenges faced by researchers. Inference of biological network structures is often performed on high-dimensional data, yet is hindered by the limited sample size of high throughput omics data. Furthermore, modeling biological networks involves complex analyses capable of integrating multiple sources of omics layers and summarizing large amounts of information.
My dissertation aims to address these challenges by presenting new approaches for high-dimensional network inference with limited samples as well as methods and tools for integrated network analysis applied to multiple research domains in cancer genomics. First, I introduce a novel method for reconstructing gene regulatory networks called SHINE (Structure Learning for Hierarchical Networks) and present an evaluation on simulated and real datasets including a Pan-Cancer analysis using The Cancer Genome Atlas (TCGA) data. Next, I summarize the challenges with executing and managing data processing workflows for large omics datasets on high performance computing environments and present multiple strategies for using Nextflow for reproducible scientific workflows including shine-nf – a collection of Nextflow modules for structure learning. Lastly, I introduce the methods, objects, and tools developed for the analysis of biological networks used throughout my dissertation work. Together – these contributions were used in focused analyses of understanding the molecular mechanisms of tumor maintenance and progression in subtype networks of Breast Cancer and Head and Neck Squamous Cell Carcinoma.
August 4
Brian Haas
Title: Bioinformatic Tool Developments with Applications to RNA-Seq Data Analysis and Clinical Cancer Research
Major Professors: Simon Kasif & Aviv Regev
ABSTRACT
Modern advances in sequencing technologies have enabled exploration of molecular biology at unprecedented scale and resolution. Transcriptome sequencing (RNA-seq), in particular, has been widely adopted as a routine cost-effective method for assaying both genetic and functional characteristics of biological systems with resolution down to individual cells. Clinical research and applications leveraging these technologies have largely targeted tumor biology, where transcriptome sequencing can capture tumor genetic and epigenetic characteristics and aid with understanding the etiology or guide treatments. Specialized computational methods and bioinformatic software tools are essential for processing and analyzing RNA-seq to explore various aspects of tumor biology including driver mutations, genome rearrangements, and aneuploidy. With single cell resolution, such methods can yield insights into tumor cellular composition and heterogeneity. Here, we developed methods and tools to support cancer transcriptome studies for bulk and single cell tumor transcriptomes, focusing primarily on fusion transcript detection and predicting large-scale copy number alternations from RNA-seq. These efforts culminated in the development of STAR-Fusion for fast and accurate detection of fusion transcript, FusionInspector for further characterizing predicted fusion transcripts and discriminating likely artifacts, and TrinityFusion for de novo reconstruction of fusion transcripts and tumor viruses. We also developed advanced methods for predicting copy number alterations and subclonal architecture from tumor and normal single cell RNA-seq data, as incorporated into our InferCNV software. In addition to these bioinformatic method and software development, we applied our fusion detection methods to thousands of tumor and normal samples and gain novel insights that should further help guide researchers with clinical applications of fusion transcript discovery.
July 29
Tanya Karagiannis
Title: Single Cell Analysis and Methods To Characterize Peripheral Blood Immune Cell Types in Disease and Aging
Major Professors: Stefano Monti & Paola Sebastiani
ABSTRACT
In the past decade, RNA-sequencing (RNA-seq)-based genome-wide expression studies have contributed to major advances in understanding human biology and disease. However, for heterogeneous tissues such as peripheral blood, RNA-sequencing masks the expression of different populations of cells that may be important in understanding different conditions and disease progression. With the advent of single cell RNA-sequencing (scRNA-seq), it has become possible to study the gene expression of each single cell and to explore cellular heterogeneity in the context of disease and under the influence of medications or other substances. In this dissertation, I will present three projects that demonstrate how single cell sequencing methods can be used to characterize novel changes in the peripheral immune system in human disease and aging. I will also describe novel methodological approaches I created to analyze cell type composition and gene expression level changes.
First, I investigated the cell type specific changes due to opioid use in human peripheral blood. Utilizing single cell transcriptomic methods, I identified a genome-wide suppression of antiviral gene expression across immune cell types of chronic opioid users, and similarly under acute exposure to morphine.
Second, I investigated the immune cell type specific changes of gene expression and composition in the context of human aging and longevity. I developed novel approaches to measure and compare overall cell type composition between samples, and identified significant overall differences in immune cell type composition, including pro-inflammatory cell populations, between extreme longevity and younger ages. In addition, I generated cell type-specific signatures associated with longevity after accounting for age-related changes that demonstrate an upregulation in immune response and metabolic processes important in the activation of immune cells in extreme long-lived individuals compared to normally aging individuals.
Finally, I investigated whether aging of the immune system is accelerated in opioid-dependent individuals. I utilized the unique aging signatures generated in the aging project and discovered higher expression of aging signatures in specific cell types of opioid-dependent individuals, suggesting chronic opioid use causes premature aging of the immune system that may contribute to the increased susceptibility to infections in these individuals.
March 24th
Marzie Rasekh
Title: Characterizing VNTRS in Human Populations
Major Professor: Gary Benson
ABSTRACT
Over half the human genome consists of repetitive sequences. One major class is the tandem repeats (TRs), which are defined by their location in the genome, repeat unit, and copy number. TRs loci which exhibit variant copy numbers are called Variable Number Tandem Repeats (VNTRs). High VNTR mutation rates of approximately 10-4 per generation make them suitable for forensic studies, and of interest for potential roles in gene regulation and disease. TRs are generally divided into three classes: 1) microsatellites or short tandem repeats (STRs) with patterns <7 bp; 2) minisatellites with patterns of seven to hundreds of base pairs; and 3) macrosatellites with patterns of >100 bp. To date, mini- and macrosatellites have been poorly characterized, mainly due to a lack of computational tools. In this thesis, I utilize a tool, VNTRseek, to identify human minisatellite VNTRs using short read sequencing data from nearly 2,800 individuals and developed a new computational tool, MaSUD, to identify human macrosatellite VNTRs using data from 2,504 individuals. MaSUD is the first high-throughput tool to genotype macrosatellites using short reads.
I identified over 35,000 minisatellite VNTRs and over 4,000 macrosatellite VNTRs, most previously unknown. A small subset in each VNTR class was validated experimentally and in silico. The detected VNTRs were further studied for their effects on gene expression, ability to distinguish human populations, and functional enrichment. Unlike STRs, mini- and macrosatellite VNTRs are enriched in regions with functional importance, e.g., introns, promoters, and transcription factor binding sites. A study of VNTRs across 26 populations shows that minisatellite VNTR genotypescan be used to predict super-populations with >90% accuracy. In addition, genotypes for 195 minisatellite VNTRs and 24 macrosatellite VNTRs were shown to be associated with differential expression in nearby genes (eQTLs).
Finally, I developed a computational tool, mlZ, to infer undetected VNTR alleles and to detect false positive predictions. mlZ is applicable to other tools that use read support for predicting short variants.
Overall, these studies provide the most comprehensive analysis of mini- and macrosatellites in human populations and will facilitate the application of VNTRs for clinical purposes.
April 8th
Zhe Wang
Title: Enhancing Preprocessing and Clustering of Single-Cell RNA Sequencing Data
Major Professor: Joshua Campbell
ABSTRACT
Single-cell RNA sequencing (scRNA-seq) is the leading technique for characterizing cellular heterogeneity in biological samples. Various scRNA-seq protocols have been developed that can measure the transcriptome from thousands of cells in a single experiment. With these methods readily available, the ability to transform raw data into biological understanding of complex systems is now a rate-limiting step. In this dissertation, I introduce novel computational software and tools which enhance preprocessing and clustering of scRNA-seq data and evaluate their performance compared to existing methods.
First, I present scruff, an R/Bioconductor package that preprocesses data generated from scRNA-seq protocols including CEL-Seq or CEL-Seq2 and reports comprehensive data quality metrics and visualizations. scruff rapidly demultiplexes, aligns, and counts the reads mapped to genomic features with deduplication of unique molecular identifier (UMI) tags and provides novel and extensive functions to visualize both pre- and post-alignment data quality metrics for cells from multiple experiments.
Second, I present Celda, a novel Bayesian hierarchical model that can perform simultaneous co-clustering of genes into transcriptional modules and cells into subpopulations for scRNA-seq data. Celda identified novel cell subpopulations in a publicly available peripheral blood mononuclear cell (PBMC) dataset and outperformed a PCA-based approach for gene clustering on simulated data.
Third, I extend the application of Celda by developing a multimodal clustering method that utilizes both mRNA and protein expression information generated from single-cell sequencing datasets with multiple modalities, and demonstrate that Celda multimodal clustering captured meaningful biological patterns which are missed by transcriptome- or protein-only clustering methods.
Collectively, this work addresses limitations present in the computational analyses of scRNA-seq data by providing novel methods and solutions that enhance scRNA-seq data preprocessing and clustering.
April 8th
Ke Xu
Title: Airway Gene Expression Alterations in Association with Radiographic Abnormalities of the Lung
Major Professor: Marc Lenburg
ABSTRACT
High-resolution computed tomography (HRCT) of the chest is commonly used in the diagnosis of a variety of lung diseases. Structural changes associated with clinical characteristics of disease may also define specific disease-associated physiologic states that may provide insights into disease pathophysiology. Gene expression profiling is potentially a useful adjunct to HRCT to identify molecular correlates of the observed structural changes. However, it is difficult to directly access diseased distal airway or lung parenchyma routinely for profiling studies.
Previously, we have profiled bronchial airway in normal-appearing epithelial cells at the mainstem bronchus, detecting distinct gene expression alterations related to the clinical diagnosis of chronic obstructive pulmonary disease (COPD) and lung cancer. These gene expression alterations offer insights into the molecular events related to diseased tissue at more distal airways and in the parenchyma, which we hypothesize are due to a field-of-injury effect. Here, we expand this prior work by correlating airway gene expression to COPD and bronchiectasis phenotypes defined by HRCT to better understand the pathophysiology of these diseases. Additionally, we classified pulmonary nodules as malignant or benign by combining HRCT nodule imaging characteristics with gene expression profiling of the nasal airway.
First, we collected brushing samples from the main-stem bronchus and assessed gene expression alterations associated with COPD phenotypes defined by K-means clustering of HRCT-based imaging features. We found three imaging clusters, which correlated with incremental severity of COPD: normal, interstitial predominant, and emphysema predominant. 41 genes were differentially expressed between the normal and the emphysema predominant clusters. Functional analysis of the differentially expressed genes suggests a possible induction of inflammatory processes and repression of T-cell related biologic pathways, in the emphysema predominant cluster.
We then discovered gene expression alterations associated with radiographic evidence of bronchiectasis (BE), an underdiagnosed obstructive pulmonary disease with unclear pathophysiology. We found 655 genes were differentially expressed in bronchial epithelium from individuals with radiographic evidence of BE despite none of the study participants having a clinical BE diagnosis. In addition to biological pathways that had been previously associated with BE, novel pathways that may play important roles in BE initiation were also discovered. Furthermore, we leveraged an independent single-cell RNA-sequencing dataset of the bronchial epithelium to explore whether the observed gene expression alterations might be cell-type dependent. We computationally detected an increased presence of ciliated and deuterosomal cells, as well as a decreased presence of basal cells in subjects with widespread radiographic BE, which may reflect a shift in the cellular landscape of the airway during BE initiation.
Finally, we identified gene expression alterations within the nasal epithelium associated with the presence of malignant pulmonary nodules. A computational model was constructed for determining whether a nodule is malignant or benign that combines gene expression and imaging features extracted from HRCT. Leveraging data from single-cell RNA sequencing, we found genes increased in patients with lung cancer are expressed at higher levels within a novel cluster of nasal epithelial cells, termed keratinizing epithelial cells.
In summary, we leveraged gene expression profiling of the proximal airway and discovered novel biological pathways that potentially drive the structural changes representative of physiologic states defined by chest HRCT in COPD and BE. This approach may also be combined with chest HRCT to detect weak signals related to malignant pulmonary nodules.
2020
December 3rd
Tyler Faits
Title: The Evaluation, Application, and Expansion of 16S Amplicon Metagenomics
Major Professor: W. Evan Johnson
ABSTRACT
Since the invention of high-throughput sequencing, the majority of experiments studying bacterial microbiomes have relied on the PCR amplification of all or part of the gene for the 16S rRNA subunit, which serves as a biomarker for identifying and quantifying the various taxa present in a microbiomic sample. Several computational methods exist for analyzing 16S amplicon based metagenomics, but the most commonly used bioinformatics tools are unable to produce quality genus-level or species-level taxonomic calls and may underestimate the degree to which such calls are possible. In this thesis, I have used 16S sequencing data from mock bacterial communities to evaluate the sensitivity and specificity of several bioinformatics pipelines and genomic reference libraries used for microbiome analyses, with a focus on measuring the accuracy of species-level taxonomic assignments of 16S amplicon reads. With the efficacy of these tools established, I then applied them in the analysis of data from two studies into human microbiomes. I evaluated the metagenomics analysis tools Qiime 2, Mothur, PathoScope 2, and Kraken, in conjunction with reference libraries from GreenGenes, Silva, Kraken, and RefSeq, using publicly available mock community data from several sources, comprising 137 samples spanning a range of taxonomic diversity, amplicon regions, and sequencing methods. PathoScope and Kraken, both tools designed for whole genome metagenomics, outperformed Qiime 2 and Mothur, which are theoretically specialized in 16S analyses. I used PathoScope 2 to analyze longitudinal 16S data from infants in Zambia, exploring the maturation of nasopharyngeal microbiomes in healthy infants, establishing a range of typical healthy taxonomic profiles, and identifying dysbiotic patterns which are associated with the development of severe lower respiratory tract infections in early childhood. With more data, these dysbiotic patterns may help identify infants at high risk of developing respiratory disease.
I used Qiime 2 to analyze 16S data from human subjects in a controlled dietary intervention study with a focus on dietary carbohydrate quality. I correlated alterations in the gut microbiome with various cardiometabolic risk factors, and identified increases in some butyrate-producing bacteria in response to complex carbohydrates. I also constructed a metatranscriptomics pipeline to analyze paired rRNA-depleted RNAseq data.
October 14th
Alan Pacheco
Title: Environmental Modulation of Microbial Ecosystems
Major Professor: Daniel Segre
ABSTRACT
Natural microbiota are essential to the health of living systems – from the human gut to coral reefs. Although advances in DNA sequencing have allowed us to catalogue many of the different organisms that make up these microbial communities, significant challenges remain in understanding the complex networks of interspecies metabolic interactions they exhibit. These interactions are crucial to community stability and function, and are highly context-dependent: the availability of different nutrients can determine whether a set of microbes will interact cooperatively or competitively, which can drastically change a community’s structure. Disentangling the environmental factors that determine these behaviors will not only fundamentally enhance our knowledge of their ecological properties, but will also bring us closer to the rational engineering of synthetic microbiomes with novel functions. Here, I integrate modeling and experimental approaches to quantify the dependence of microbial communities on environmental composition. I then show how this relationship can be leveraged to facilitate the design of synthetic consortia.
The first chapter of this dissertation is a review article that introduces a framework for cataloguing interaction mechanisms, which enables quantitative comparisons and predictive models of these complex phenomena. The second chapter is a computational study that explores one such attribute – metabolic cost – in high detail. It demonstrates how a large variety of molecules can be secreted without imposing a fitness cost on microbial organisms, allowing for the emergence of beneficial interspecies interactions. The third chapter is an experimental study that determines how the number of unique environmental nutrients affects microbial community growth and taxonomic diversity. The integration of stoichiometric and consumer resource models enabled the discovery of basic ecological principles that govern this environment phenotype relationship. The fourth chapter applies these principles to the design of engineered communities via a search algorithm that identifies environmental compositions that yield specific ecosystem properties. This dissertation then concludes with extensions of the modeling methods used throughout this work to additional model systems.
Future work could further quantify how microbial community phenotypes depend on each of the individual factors explored in this thesis, while also leveraging emerging knowledge on interaction mechanisms to design synthetic consortia.
August 24th
Devanshi Patel
Title: Tissue-Dependent Analysis of Common and Rare Genetic Variants for Alzheimer’s Disease Using Multi-Omics Data
Major Professor: Lindsay Farrer
ABSTRACT
Alzheimer’s disease (AD) is a complex neurodegenerative disease characterized by progressive memory loss and caused by a combination of genetic, environmental, and lifestyle factors. AD susceptibility is highly heritable at 58-79%, but only about one third of the AD genetic component is accounted for by common variants discovered through genome-wide association studies (GWAS). Rare variants may contribute to some of the unexplained heritability of AD and have been demonstrated to contribute to large gene expression changes across tissues, but conventional analytical approaches pose challenges because of low statistical power even for large sample sizes. Recent studies have demonstrated by expression quantitative trait locus (eQTL) analysis that changes in gene expression could play a key role in the pathogenesis of AD. However, regulation of gene expression has been shown to be context-specific (e.g., tissue and cell-types), motivating a context dependent approach to achieve more precise and statistically significant associations. To address these issues, I applied a strategy to identify new AD risk or protective rare variants by examining mutations occurring only in cases or only controls, observing that different mutations in the same gene or variable dose of a mutation may result in distinct dementias. I also evaluated the impact of rare variation on expression at the gene and gene pathway levels in blood and brain tissue, further strengthening the rare variant findings with functional evidence and finding evidence for a large immune and inflammatory component to AD. Lastly, I identified cell-type specific eQTLs in blood and brain tissue to explain underlying genetic associations of common variants in AD, and also discovered additional evidence for the role of myeloid cells in AD risk and potential novel blood and brain AD biomarkers. Collectively, these findings further explain the genetic basis of AD risk and provide insight about mechanisms leading to this disorder.