Research on Large-Scale Computing Systems Analytics and Optimization

A diagram showing software components in the Pythia project

A Just-in-Time, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications

Funding: NSF and Red Hat; Collaboration: Tufts University

Diagnosing performance problems in distributed applications is extremely challenging and time-consuming. A significant reason is that it is hard to know where to enable instrumentation a priori to help diagnose problems that may far occur in the future. In this work, we aim to create an instrumentation framework that automatically searches the space of possible instrumentation choices to enable the instrumentation needed to help diagnose a newly-observed problem. Our prototype, called Pythia, builds on workflow-centric tracing, which records the order and timing of how requests are processed within and among a distributed application’s nodes (i.e., records their workflows). Pythia uses the key insight that localizing the sources of high-performance variation within the workflows of requests that are expected to perform similarly gives insight into where additional instrumentation is needed.

AI for cloud Ops, in collaboration with Red Hat
Project link to VAIF: Variance-driven Automated Instrumentation Framework

 

Data flow diagram for automatic anomaly detection.

Automated Analytics for Improving Efficiency, Safety, and Security of HPC Systems

Funding: Sandia National Laboratories

Performance variations are becoming more prominent with new generations of large-scale HPC systems. Understanding these variations and developing resilience to anomalous performance behavior are critical challenges for reaching extreme-scale computing. To help address these emerging performance variation challenges, there is increasing interest in designing data analytics methods to make sense out of the telemetry data collected out of computing systems. Existing methods, however, rely heavily on manual analysis and/or are specific to a certain type of application, system, or performance anomaly. In contrast, the aims of this project are (1) Identifying information that is available on production HPC systems that would help understand performance characteristics, performance variations, inefficiencies, and anomalous behaviors indicative of software problems, component degradation, or malicious activities; (2) conducting this identification through automated techniques that can work with a broad range of systems, applications, and conditions that create performance variations; (3) designing methods that leverage this system information to improve efficiency, resilience, and security of HPC systems.

AI4HPC: AI-based Scalable Analytics for Improving Performance, Resilience, and Security of HPC Systems
 

Time-series plot explaining vmstat contributions toward a machine-learning decision

Scalable and Explainable Machine Learning Analytics for Understanding HPC Systems

Funding: Sandia National Laboratories

The goal of this project is to design scalable and explainable analytics methods to diagnose performance anomalies in high-performance computing (HPC) systems so as to help sustain the necessary performance and efficiency increases towards achieving exascale computing and beyond. Specific tasks include (1) Designing and building techniques for training a performance analysis framework and making sufficiently accurate predictions with less data; (2) investigating the integration of existing methods and the design of new methods to substantially improve explainability of the decision making the process of the performance analytics framework.
 

Research on Designing Future Energy-Efficient Computing Systems

Architecting the COSMOS:A Combined System of Optical Phase Change Memory and Optical Links

Funding: NSF

Today’s data-intensive applications that use graph processing, machine learning or privacy-preserving paradigms demand memory sizes on the order of hundreds of GigaBytes and bandwidths on the order of TeraBytes per second. To support the ever-growing memory needs of the applications, Dynamic Random Access Memory (DRAM) systems have evolved over the decades. However, DRAM will not be able to support these large-capacity and -bandwidth demands in the future in an efficient and scalable way. While there is research on a number of alternate memory technologies (such as phase-change memory, magnetic memory, resistive memory, etc.), there is no clear winner among these technologies to replace DRAM. Moreover, none of these alternate memory technologies nor DRAM efficiently complements the silicon-photonic link technology that is expected to replace high-speed electrical links for processor-to-memory communication in the near future. This project aims to address the problems arising from limited memory capacity and bandwidth, which are significant limiting factors in application performance, through a unified network and memory system called COSMOS. COSMOS integrates Optically-controlled Phase Change Memory (OPCM) technology and silicon-photonic link technology to achieve a “one-stop-shop” solution that provides seamless high-bandwidth access from the processor to a high-density memory. The project goes beyond solely using OPCM as a DRAM replacement, and aims to demonstrate the true potential of OPCM as non-volatile memory and in processing-in-memory (PIM) design scenarios. At a broader level, the project seeks to improve the performance of many societal data-driven applications in various domains, including healthcare, scientific computing, transportation, and finance.

The research goals of this project are to design the first full system architecture for COSMOS, and then demonstrate its benefits using realistic application kernels, when OPCM is used as a DRAM replacement, in an OPCM+DRAM combination with persistence properties, and in OPCM as PIM scenarios. To achieve these ambitious goals, the project is organized into the following three research thrusts and a cross-cutting thrust. Thrust 1 designs a full-system architecture using OPCM and silicon-photonic links to address the memory capacity and bandwidth requirements of data-centric applications. Thrust 2 investigates the use of COSMOS for PIM, where the stored data is processed at the speed of light. Thrust 3 aims to create mechanisms and methods for application developers and the Operating System to profile and instrument applications for making effective use of COSMOS. The Cross-cutting Thrust builds a simulation methodology to accurately evaluate the benefits of OPCM as DRAM replacement as well as for combined OPCM+DRAM and OPCM as PIM designs.

 

A diagram of 3D chip structure

Managing Thermal Integrity in Monolithic 3D Integrated Systems

Funding: NSF; Collaboration: Stony Brook University and CEA-LETI, France

Integrated circuit (IC) research community has witnessed highly encouraging developments on reliably fabricating monolithic three-dimensional (Mono3D) chips. Unlike through silicon via (TSV) based vertical integration where multiple wafers are thinned, aligned, and bonded; in Mono3D ICs, multiple device tiers are formed on a single substrate following a sequential fabrication process. Vertical interconnects, referred to as monolithic inter-tier vias (MIVs), are orders of magnitude smaller than TSVs (nanometers vs. micrometers), enabling unprecedented integration density with superior power and performance characteristics. The importance of this fine-grained connectivity is emphasized particularly because modern transistors have reached sub 10 nm dimensions. Despite the growing interest in various aspects of Mono3D technology, a reliable framework for ensuring thermal integrity in dense Mono3D systems does not yet exist. This research fills this gap with its primary emphasis on leveraging Mono3D-specific characteristics during both efficient thermal analysis and temperature optimization. Our objective is to facilitate future progress on both design and fabrication aspects of Mono3D technology by developing a comprehensive framework for managing thermal issues. The results of this research will provide a better understanding of unique thermal characteristics in Mono3D ICs and help mitigate these thermal issues through efficient analysis and optimization.

 

A diagram illustrating the project architecture to model hybrid cooling systems

Modeling the Next-Generation Hybrid Cooling Systems for High-Performance Processors

Funding: NSF; Collaboration: MIT and Brown University

Design of future high-performance chips is hindered by severe temperature challenges. This is because existing cooling mechanisms are not equipped to efficiently cool power densities reaching hundreds to several thousand watts per centimeter square, which are expected in exascale systems. There are several highly-efficient emerging cooling technologies that are being developed by thermomechanical engineers; however, these technologies are not easily accessible for experimentation to computer engineers for co-designing and optimizing their aggressive processor architectures together with the cooling subsystem. To close this gap, this project proposes to develop a software infrastructure that enables accurate modeling of cutting-edge cooling methods and, further, facilitates mutually customizing the computing and cooling systems to dramatically push beyond the system performance per watt that is achievable in today’s systems. Specific tasks include: (1) synthesizing novel physical device-level models into compact representations, (2) using measurements on prototypes and detailed simulators for validation of the proposed models, and (3) developing the necessary automation tooling to provide the ability for design and optimization of hybrid customized cooling subsystems together with a given target computing system.
Github Link
 

A view of the interior of a data center

Sustainable IT and IT for Sustainability

Funding: BU College of Engineering Dean’s Catalyst Award

The computing ecosystem continues to grow at a breakneck pace and consumes a substantial portion of the world’s electricity. Currently, the vast majority of electricity production comes from fossil fuels, which is long-term unsustainable and has a tremendous environmental impact. There is a growing motivation to integrate renewables; however, volatility of renewables creates new challenges for the power grid operators, who need to dynamically balance electricity supply and demand. Wouldn’t it be appealing if computing, whose growth is contributing to increased electricity demand, could emerge as a major enabler of increased electricity generation from renewables? This would also make the growth of the computing systems sustainable. This proposal aims at developing a framework for making such a vision the reality, particularly through integrating large data centers (HPC clusters, grid engines, or other data centers) into emerging smart grid programs. We propose to develop a collaborative and distributed control framework for the computing sector that helps stabilize the grid, while providing power cost incentives for data centers. Specifically, this project seeks to build computing demand response control opportunities, where computing systems follow power provider requests when regulating their power consumption, to improve the nation’s power supply efficiency and robustness, simultaneously with improving sustainability of computing.