Please join us for a Data Science Colloquium featuring Nisheeth Vishnoi, professor in School of Computer and Communication Sciences at École Polytechnique Fédérale de Lausanne (Switzerland).
Jim Kurose’s featured presentation will discuss federal funding opportunities in data science. Dr. Kurose is the Assistant Director of the National Science Foundation (NSF) for the Computer and Information Science and Engineering (CISE). He leads the CISE Directorate, with an annual budget of more than $900 million, in its mission to uphold the nation’s leadership in scientific discovery and engineering innovation through its support of fundamental research in computer and information science and engineering and transformative advances in cyberinfrastructure. Dr. Kurose also serves as co-chair of the Networking and Information Technology Research and Development Subcommittee of the National Science and Technology Council Committee on Technology, facilitating the coordination of networking and information technology research and development efforts across Federal agencies.
His research interests include computer network protocols and architecture, network measurement, sensor networks, multimedia communication, and modeling and performance evaluation. He was one of the founders of the Commonwealth Information Technology Initiative (CITI) and helped lead the founding of the Massachusetts Green High Performance Computing Center. Dr. Kurose has served on many national and international boards and panels, including the Board of Directors of the Computing Research Association and the Board of Governors of the IEEE Communications Society. With Keith Ross, he is the co-author of the textbook, Computer Networking, a Top Down Approach (7th edition, 2016) published by Pearson. He has received numerous awards for his research and teaching. He is the recipient of several conference best paper awards, the IEEE Infocom Achievement Award, the ACM Sigcomm Lifetime Achievement Award, and the ACM Sigcomm Test of Time Award. He is the recipient of a number of outstanding teaching awards, and the IEEE CS Taylor Booth Education Medal. He has twice received an IBM Faculty Development Award and a Lilly Teaching Fellowship.
BU Research Lightning Sessions
BU Research Lightning Session: Networks and Contagions
Panel Moderator: Eric Kolaczyk, Professor, Mathematics and Statistics, Boston University College of Engineering
Boston University College of Arts & Sciences
Assistant Professor, Finance
Questrom School of Business
Boston University College of Arts & Sciences
Professor, Bioinformatics & Biomedical Engineering
Boston University College of Engineering
Assistant Professor, Sociology
Boston University College of Arts & Sciences
BU Research Lightning Session: Methods Across Disciplines
Panel Moderator: Anita Destefano, Professor, Biostatistics and Neurology, Boston University School of Public Health; Associate Director, BUMC Genome Science Institute
Associate Professor, Mathematics and Statistics
Boston University College of Arts & Sciences
Professor, Electrical and Computer Engineering
Boston University College of Engineering
Assistant Professor, Psychological & Brain Sciences
Boston University College of Arts & Sciences
Director, Visual Neuroscience Lab
Research Director, NSF Frontier MACS Project
“Privacy-Preserving Data Aggregation”
BU Research Lightning Session: Data Science for Social Good
Panel Moderator: Lynn Rosenberg, Sc.D., Professor, Epidemiology, Boston University School of Public Health; Senior Epidemiologist, Slone Epidemiology Center
Sandro Galea, MD, MPH, DrPH
Dean, Boston University School of Public Health
Ben Linas, MD, MPH
Assistant Professor, Epidemiology
Boston University School of Public Health
Assistant Professor of Medicine, Boston University School of Medicine
Alex Walley, MD, MS
Medical Director, Opioid Treatment Program, Boston Public Health Commission
Assistant Professor of Medicine, Boston University School of Medicine
Co-Program Director, Boston University Addiction Medicine Residency Program
Program Director, Fellow Immersion Training Program
Associate Dean of Research, Counseling Psychology
Professor, Counseling Psychology
Boston University School of Education
Public & Private Sector Panels
Industry Perspectives: Data Science as a Business Advantage
Panel Moderator: John Byers, Professor, Computer Science, Boston University College of Arts & Sciences; Founding Chief Scientist, Cogo Labs
Watson Research Scientist, IBM
Richard joined IBM Watson Health in 2014, as a research staff member on the IBM Watson for Drug Discovery product team. In this capacity Richard leads the natural language processing team, whose work concerns enabling Watson to read and extract domain-specific information from the scientific and intellectual property literature. Watson routinely captures mentions of genes, proteins, diseases, drugs and other chemicals – as well as how they are related to one another – from tens of millions of documents, producing the large-scale data which underlies its cognitive computing abilities in the life science domain. Richard holds a PhD in chemical informatics from The University of Sheffield (UK, 2010). Prior to joining IBM, Richard was a postdoctoral research fellow at Lawrence Berkeley National Laboratory, where he helped pioneer the new field of materials informatics, enabling the computational design of nanoporous materials tailored for clean energy applications like carbon capture and natural gas storage.
Principal Scientist, AstraZeneca
Jonathan completed a BSc (Hons) in Biomedical Science at the University of Manchester where his research focused on the genetics of diabetes and exposed him to the richness of information in DNA that can be uncovered with computational tools. To pursue this interest, he was awarded a scholarship at the University of Exeter to study for an MSc in Bioinformatics and he graduated with distinction. Jonathan collaborated with GlaxoSmithKline to develop computational models determining the risk of recombinant protein breakdown by proteases in host cells. At AstraZeneca, Jonathan specialised in gene expression microarray data analysis and has supported and influenced a number of discovery programmes, including drugs targeting MEK, PARP, mTOR, PI3K and ERBB. He introduced pioneering approaches for harnessing genomic data from cancer cell lines, uncovering numerous disease associations, and discovering mechanisms of drug target dependency leading to biomarkers of drug response. Perhaps his most notable contribution to date is the identification of transcriptional readouts which demonstrate MEK activity. This was the first transcriptomic personalised healthcare (PHC) hypothesis tested in AstraZeneca clinical trials and enhanced the development of the MEK inhibitor, selumetinib.
Director, Machine Intelligence, Trip Advisor
Jeff Palmucci has been writing code professionally since he was 11 years old. A serial entrepreneur, Jeff has started several companies. He was a Founder and the VP of Software Development for Optimax Systems, a developer of scheduling systems for manufacturing operations. Optimax was acquired by i2 Technologies where he continued on as an i2 Fellow and Lead Architect for scheduling products doing extensive research into production scheduling. As a Founder and CTO of Percipio Capital Management, he helped lead the company to an acquisition by Link Ventures. Percipo Capital ran a programmatic hedge fund, trading in commodity futures and equities. Before Optimax, Jeff worked at BBN Laboratories, where he performed research in genetic algorithms and scheduling, machine vision, simulation, natural language, machine learning, cognitive science, and expert systems. Mr Palmucci has worked at the Children’s Hospital in Boston, developing an educational video game for children with Asthma. He worked at Cal Tech’s Jet Propulsion Laboratory for the Voyager II Neptune flyby. Jeff is currently leading the Machine Learning group at Tripadvisor, which does various machine learning projects across the company including natural language processing, review fraud detection, personalization, click fraud detection, and machine vision. Jeff has publications in natural language processing, machine learning, genetic algorithms, expert systems, and programming languages. Jeff has a BS from the Massachusetts Institute of Technology in Computer Science.
Principal Data Scientist, Cambridge Mobile Telematics
Bill works on modeling, inference and visualization for vehicle data at CMT. Our data science problems range from physical (“How do I continuously orient an accelerometer?”), behavioral (“Is this driver unsafe?”), to global (“How many Boston drivers bicycle to the T?”). Prior to joining CMT, he built dynamically evolving maps of the internet for Akamai, developed cryptographic and signal processing techniques at CCR-Princeton, and optimized analog error correction at Lyric Semiconductor (acquired by Analog Devices). He received his BA from Harvard, MA from Cambridge University, and PhD from MIT.
Data and the Local Community
Panel Moderator: Andrei Lapets, Research Scientist & Director of Research Development and SAIL, Hariri Institute for Computing, Boston University
Snehal Shah, MD, MPH
Director, Research and Evaluation Office, Boston Public Health Commission
Snehal Shah is the Director of the Boston Public Health Commission’s (BPHC) Research and Evaluation Office, which serves several important functions, including research, public health surveillance, analysis and interpretation of public health data, and program evaluation. Her office creates the annual Health of Boston reports, which provide data on the health status of Boston residents and inform the Commission’s work to identify and provide support to groups of individuals and communities at greatest risk for poor health outcomes. Additionally, Dr. Shah is a physician at Boston Medical Center and an Assistant Professor of Pediatrics at BU’s School of Medicine.
Research Director, MassINC
Benjamin Forman is MassINC’s Research Director and Executive Director of MassINC’s Gateway Cities Innovation Institution. Prior to MassINC, he oversaw strategic planning for the District of Columbia Department of Parks and Recreation. He also has experience as a researcher at the Brookings Institution and as a research assistant at Nathan Associates, a global economic development consulting firm.
Chief Data Officer, City of Boston
Andrew Therriault was named as the City of Boston’s first Chief Data Officer in 2016. His team uses data science to address some of the city’s most challenging problems, from homelessness and addiction to food-borne illness and traffic safety. An expert on predictive modeling, quantitative research, and data integration, Therriault previously served as Director of Data Science for the Democratic National Committee and as editor of Data and Democracy: How Political Data Science Is Shaping the 2016 Elections (O’Reilly Media). He received his PhD in political science from New York University in 2011 and completed a postdoctoral research fellowship at Vanderbilt.
Assistant Secretary for Performance Management & Innovation, Massachusetts Department of Transportation (MASDOT), Commonwealth of Massachusetts
Rachel Bain is the Assistant Secretary for Performance Management and Innovation. In that role she oversees the performance and accountability system for MassDOT and the MBTA. Prior to that appointment, Ms. Bain served as the Deputy Registrar for Operations at MassDOT’s Registry Division and MassDOT’s project lead for the Big Data in Transportation Initiative. Ms. Bain spent 6 years in the Office of Transportation Planning, in a variety of roles that included working with the Commonwealth’s 13 MPOs, and managing policy and planning efforts.
Sahar Abi-Hassan, PhD Candidate, Political Science, CAS
Interest Group Composition, Judicial Agenda-Setting, and Dissensus on the Supreme Court
Studies into the agenda-setting process in the US Supreme Court have provided a wide range of information about the factors driving a justice’s decision to grant review, including legal considerations and the strategic pursuit of the best policy outcome. However, the bulk of these studies have limited their analyses to litigation outcomes. Little is known about how the justice’s vote on a petition for a writ of certiorari affects her behavior in the later stages of a case. Through an empirical study of Supreme Court cases between 1999 and 2010, we find that the reasoning at the writ stage affects a justices’ decision to join the majority or minority at the litigation stage. In addition, we explore how these initial evaluations influence a justice’s interpretation of the signals of litigant support conveyed in amicus briefs. The writ decision moderates the effect of amicus briefs in support of the petitioner versus the respondent. In sum, the findings of this paper suggest a richer relationship between the internal and external factors that influence judicial behavior than previously acknowledged.
Cantay Caliskan, PhD Candidate, Political Science, CAS
The Influence of Elite Networks on Green Policymaking in Advanced Economies of Europe: Evidence From Twitter and Elite Interviews
This study focuses on the energy and environment policy networks and their influence on creating policies in Northern Europe by looking at cases from Sweden, Denmark, Norway, Iceland, Ireland, Belgium, and the Netherlands. The study largely looks at renewable energy and anti-climate policies, for which Europe provides a good climate with a significant level of clean energy debate and environmental activism on the state and the public level. Specifically, the study asks the following question: how do the structures of elite networks influence the success of energy and environment policies? The mixed-methods analysis uses network analysis and elite interviews to uncover the complex nature of relationships between European elites. The main argument is that stronger connections between elites will lead to better policymaking and thus more renewable energy production and a faster decrease of CO2 levels.
Ruidi Chen, PhD Candidate, Systems Engineering, ENG
A Distributionally Robust Optimization Approach for Outlier Detection
We propose a new Distributionally Robust Optimization (DRO) method for outlier detection in a linear regression setting, in which the closeness of probability distributions is measured using the Wasserstein metric. The robust optimization problem reduces to solving a quadratic optimization problem. We prove several generalization guarantees for our solution under mild conditions. Extensive numerical experiments demonstrate that our approach outperforms Huber’s robust regression approach in terms of both estimation accuracy and detection rate.
Jiawei Chen, PhD Candidate, Electrical & Computer Engineering, ENG
Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions
Today, there exist reliable methods for automatically recognizing human actions based on standard-resolution videos. However, acquiring video at such resolutions raises privacy concerns in environments some environments, for example private homes, hospitals, rehabilitation centers, and some corporate offices. In this project, we use extremely low resolution videos (e.g., 16 x 12 pixels) to protect privacy and develop a novel deep convolutional neural network (ConvNet) framework for action recognition from such low resolution data. Our method outperforms state-of-the-art methods at extremely low resolutions on two public datasets.
Adam Gower, Faculty, Department of Medicine, BUSM
openSESAME: a tool for discovering biological connections between experimental conditions based on common patterns of differential gene expression
We have developed openSESAME, as a “search engine” to enable researchers to easily scan large databases of gene expression data for experiments where the genes in a query are co-regulated in a pattern that is similar to an experiment of interest. Identifying experiments or biological conditions that cause similar alterations in gene expression can provide new insights into the biological similarities between these experiments. In the same manner that modern search engines have completely transformed the process of acquiring information and identifying hidden connections from simple textual cues, openSESAME has the potential to transform the process by which biologists interpret the patterns of gene regulation that they observe in their experiments. One particularly useful application of openSESAME is repositioning existing drugs in new disease settings to produce new treatments for intractable diagnoses: if the gene expression changes associated with disease are reversed by a given drug treatment, or vice versa, the implication is that the drug may be able to slow, stop, or reverse biological processes set in motion by the disease. Furthermore, by inverting the openSESAME paradigm, one can produce a network of published gene expression patterns (“signatures”) based on patterns of co-expression across hundreds of thousands of publicly available gene expression datasets, allowing investigators to place their own signatures in the context of those from the scientific literature.
David Jenkins, PhD Candidate, Bioinformatics, BUSM
Interactive single cell RNA-Seq analysis with R Shiny
Single cell RNA sequencing allows an investigator to identify differences in RNA levels between individual cells. This is particularly important for identifying tumor subclonality or for identifying differences between cell types in a particular body tissue or at a particular developmental stage. Due to the sparseness of scRNA-SEQ results, additional filtering and normalization steps are included in a snRNA-SEQ pipeline beyond what is performed for standard bulk RNA sequencing projects. As the popularity of scRNA-SEQ experiments increases, there is a need for a user-friendly R based pipeline for snRNA-SEQ analysis. Here, we present the single cell toolkit, an interactive analysis pipeline written in R Shiny that allows users to filter, normalize, visualize, and perform differential expression analysis on scRNA-SEQ data.
Mark Bestavros, Undergraduate Student, Computer Science, CAS
Tyrone Huo, Undergraduate Student, Computer Science, CAS
Adrian Law, Undergraduate Student, Computer Science, CAS
Algorithmic Optimizations of Boston’s Public Bus Network
The planning of public transportation control systems involves coordination with multiple parties, including the authorities, riders, and operators, as well as analyzing external factors that impact the system such as traffic patterns and areas of high demand. A systematic optimization of such a system would be immensely beneficial for the MBTA; we would be able to analyze how to best maximize bus route coverage and how we can allocate buses.
Christy Lin, PhD Candidate, Systems Engineering, ENG
Node Embedding for Network Community Discovery
The pattern of connections in a network of interacting entities reflects the structure of their organization into communities whether explicit or hidden. Discovering such communities is crucial for understanding the local and global interactions among the entities in the network. In this project, we developed a novel algorithm for discovering communities in a network by viewing nodes as “words”, forming sentences of node-words via random walks in the network, and transforming the words into vectors though recent advances in natural language processing. The transformed word-vectors cluster by their latent community memberships. Through extensive experimental studies on simulated and real-world networks, we demonstrate that the proposed approach consistently improves over the current state-of-the-art.
Feng Nan, PhD Candidate, Electrical & Computer Engineering, ENG
Pruning Random Forests for Prediction on a Budget
We propose to prune a random forest (RF) for resource-constrained prediction. We first construct a RF and then prune it to optimize expected feature cost & accuracy. We pose pruning RFs as a novel 0-1 integer program with linear constraints that encourages feature re-use. We establish total unimodularity of the constraint set to prove that the corresponding LP relaxation solves the original integer program. We then exploit connections to combinatorial optimization and develop an efficient primal-dual algorithm, scalable to large datasets. In contrast to our bottom-up approach, which benefits from good RF initialization, conventional methods are top-down acquiring features based on their utility value and is generally intractable, requiring heuristics. Empirically, our pruning algorithm outperforms existing state-of-the-art resource-constrained algorithms.
Supriya Sharma, PhD Candidate, Department of Medicine, BUSM
Multicomponent Signatures for Predicting Epigenomic Treatment Efficacy
In this work, we have developed an unbiased, multi-component approach, which profiles DNA methylation and gene expression signatures across breast cancer subtypes and multiple profiling platforms. The goal of this study is to predict drug sensitivity in breast cancer cell lines, mouse xenografts, and tumor-normal samples from the Cancer Genome Atlas (TCGA).
Zafar Takhirov, PhD Candidate, Electrical & Computer Engineering, ENG
Energy-Efficient Adaptive Classifier Design for Mobile Systems
With the continuous increase in the amount of data that needs to be processed by digital mobile systems, energy-efficient computation has become a critical design constraint for mobile systems. We propose an adaptive classifier that leverages the wide variability in data complexity to enable energy-efficient data classification operations for mobile systems. Our approach takes advantage of varying classification “hardness” across data to dynamically allocate resources and improve energy efficiency. On average, our adaptive classifier is ≈100× more energy efficient while having ≈1% higher error rate than a complex radial basis function classifier and is ≈10× less energy efficient but has ≈40% lower error rate than a simple linear classifier across a wide range of classification data sets.
Ozan Tezcan, PhD Candidate, Electrical & Computer Engineering, ENG
Automatic Rating of Room Clutter for Improved Hoarding Disorder Assessment
Hoarding disorder (HD) is a prevalent psychiatric condition occurring in 2-5% of adults in the United States that is characterized by persistent difficulty discarding ordinary items and associated clutter that impairs daily functioning. It is distinct from many mental health disorders due to its profound negative effect on the health and safety of patients and their homes, their neighbors, and the broader community (e.g., falls, fires, sanitation issues). Among various HD assessment factors, Clutter Image Rating (CIR) is a central measure related to the visual aspect of HD. CIR is rated by clients, practitioners, and family members by selecting one of 9 “clutter-equidistant” photos of a bedroom, living room, and kitchen and, therefore, it is subjective, potentially biased, and costly (when clinicians do home visits). This project explores various machine learning ! techniques for automatically rating room clutter from images according to the CIR scale.
Ozan, Tuncer, PhD Candidate, Electrical & Computer Engineering, ENG
Diagnosing Performance Variations in High Performance Computing Using Machine Learning
With the growing complexity and scale of high performance computing (HPC) systems, application performance variation has become a significant challenge in efficient and resilient system management. To effectively alleviate this problem, system administrators must detect and identify the software- and firmware-related anomalies that are responsible for performance variation and take preventive actions. However, diagnosing anomalies can be a difficult task given the vast amount of noisy and high-dimensional data being collected via a variety of system monitoring infrastructures. In this work, we present a framework that uses machine learning to automatically diagnose a variety of performance anomalies in HPC systems. Our framework leverages resource usage and performance counter data collected during application runs to learn anomaly characteristics and to identify the types of anomalies observed while running applications.
Ata Turk, Research Scientist, Electrical & Computer Engineering, ENG
DeltaSherlock: Identifying Changes in the Cloud
To track security and compliance requirements and perform problem diagnosis, administrators of cloud computing systems need to monitor significant system changes occurring on the set of cloud instances under their supervision. Considering the large number of instances (virtual machines, containers) possibly operating under multiple configurations, this is a difficult-to-track process. We propose DeltaSherlock, a practical system change discovery framework that can capture system states on-demand and detect multiple system changes between them. We evaluate DeltaSherlock over 25,000 system changes caused by software installations collected from virtual machines deployed over a commercial cloud. DeltaSherlock can accurately identify multiple software installations with 96.8% accuracy when supplied with a non-overlapping record of system changes and with 77.8% accuracy when supplied with random irregular observations possibly containing overlapping or incomplete changes.
Lan Wang, PhD Candidate, Organizational Behavior, Questrom
Beyond Moneyball to Social Capital Inside and Out: The Value of Differentiated Workforce Experience Ties to Performance
The differential impact of social capital among employees in strategic and support roles has received far less attention than that of human capital in talent management literature. Building on network closure theory and differentiated workforce theory, we examine the effect of strategic and support teams’ experience ties on team performance while controlling for human capital using current Moneyball-inspired metrics for workforce quality. Using an 111-year longitudinal dataset of 15,837 Major League Baseball players from all 30 teams and 3,475,778 experience ties, we find that after accounting for the effect of team quality, managerial stability and reputation, and era effects, organizational experience ties and subsequent team performance have an inverted U-shaped relationship for strategic roles and a U-shaped relationship for support roles. Competitor experience ties have an inverted U-shaped relationship on performance for strategic roles, yet the hypothesized U-shaped relationship showed differences for different competency areas among support roles. This study highlights the value of social capital to team performance and the importance of differentiating HRM practices for strategic and support roles in different competency areas.
Yuqing Zhang, PhD Candidate, Bioinformatics, BUSM
Dataset heterogeneity in the validation of prediction models across studies
Cross-study validation of prediction models is an alternative to cross-validation which is more relevant to independent reproducibility. Cross-study validation performance is affected by differences between studies. Previous research has shown that cross-study validation accuracy is often worse than cross-validation accuracy, but has not identified the sources of heterogeneity responsible. We use a bootstrap method to generate realistic simulations based on publicly available breast and ovarian cancer microarray datasets. We assess the impact of three types of heterogeneity between studies on the across-study performances of prediction models.
Sarah Zheng, PhD Candidate, Operations and Technology Management, Questrom
The Impact of Internal Service Quality on Preventable Adverse Events in Hospitals
Provision of safe, timely medical care to hospital patients requires services from multiple support departments, such as environmental services and pharmacy. However, there have been few studies that examine the impact of the service quality of internal support departments on clinical performance. The lack of studies linking internal service quality (ISQ) to clinical performance creates a gap in healthcare operations management theory and—from a practice standpoint—might result in underinvestment in the quality of services delivered by hospitals’ internal support departments. To address these issues, we develop a hypothesis that higher ISQ is associated with lower adverse events. We test this hypothesis by leveraging a unique dataset from a hospital that developed its own measure of ISQ provided by support departments. Using over a year’s worth of monthly data on the average ISQ delivered by 11 support departments to five nursing units, we test the impact of ISQ on two nursing-sensitive adverse events: patient falls with injury and hospital-acquired pressure ulcers. We find support for our hypothesis that higher levels of ISQ are associated with lower rates of adverse events. Our results show that improving the overall average ISQ received by a nursing unit by 0.1 on a 5-point scale has almost the same benefit as increasing staffing on that unit by one full time equivalent nurse in terms of reducing adverse events. Our study has important implications for theory and practice as it points to a fruitful, cost effective, and yet underutilized avenue for improving quality of care.