Research on Large-Scale Computing Systems Analytics and Optimization

EPFL_3D

Automated Analytics for Improving Efficiency, Safety, and Security of HPC Systems

Funding: Sandia National Laboratories

Performance variations are becoming more prominent with new generations of large-scale HPC systems. Understanding these variations and developing resilience to anomalous performance behavior are critical challenges for reaching extreme-scale computing. To help address these emerging performance variation challenges, there is increasing interest in designing data analytics methods to make sense out of the telemetry data collected out of computing systems. Existing methods, however, rely heavily on manual analysis and/or are specific to a certain type of application, system, or performance anomaly. In contrast, the aims of this project are (1) Identifying information that is available on production HPC systems that would help understand performance characteristics, performance variations, inefficiencies, and anomalous behaviors indicative of software problems, component degradation, or malicious activities; (2) conducting this identification through automated techniques that can work with a broad range of systems, applications, and conditions that create performance variations; (3) designing methods that leverage this system information to improve efficiency, resilience, and security of HPC systems.

 

IMG_0304

Scalable and Explainable Machine Learning Analytics for Understanding HPC Systems

Funding: Sandia National Laboratories

The goal of this project is to design scalable and explainable analytics methods to diagnose performance anomalies in high-performance computing (HPC) systems so as to help sustain the necessary performance and efficiency increases towards achieving exascale computing and beyond. Specific tasks include (1) Designing and building techniques for training a performance analysis framework and making sufficiently accurate predictions with less data; (2) investigating the integration of existing methods and the design of new methods to substantially improve explainability of the decision making the process of the performance analytics framework.

 

Server

A Just-in-Time, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications

Funding: NSF and Red Hat; Collaboration: Tufts University

Diagnosing performance problems in distributed applications is extremely challenging and time-consuming. A significant reason is that it is hard to know where to enable instrumentation a priori to help diagnose problems that may far occur in the future. In this work, we aim to create an instrumentation framework that automatically searches the space of possible instrumentation choices to enable the instrumentation needed to help diagnose a newly-observed problem. Our prototype, called Pythia, builds on workflow-centric tracing, which records the order and timing of how requests are processed within and among a distributed application’s nodes (i.e., records their workflows). Pythia uses the key insight that localizing the sources of high-performance variation within the workflows of requests that are expected to perform similarly gives insight into where additional instrumentation is needed.
 
Project website

 

2

Scalable Software and System Analytics for a Secure and Resilient Cloud

Funding: IBM Research

The overarching goal of this project is to help automate cloud operation and achieve better security, resilience, and efficiency. Specifically, we aim to identify known software instances and versions based on automated and scalable techniques, detect software configurations that lead to failures or significant performance degradation, and design integrated systems analytics and software analysis methods to determine the contents of VMs, containers, or applications binaries. Towards these aims, this research designs machine learning based systems approaches for software discovery in the cloud, configuration analytics, binary analysis, and microservices analytics. All of the proposed methods are demonstrated on VMs, containers or microservices running on real-world cloud systems.

 

Research on Designing Future Energy-Efficient Computing Systems

EPFL_3D

Reclaiming Dark Silicon via 2.5D Integrated Systems with Silicon-Photonic Networks

Funding: NSF; Collaboration: CEA-LETI, France

The design of today’s leading-edge systems is fraught with power, thermal, variability and reliability challenges. As a society, we are increasingly relying on a variety of rapidly evolving computing domains, such as cloud, internet-of-things, and high-performance computing. The applications in these domains exhibit significant diversity and require an increasing number of threads and much larger data transfers compared to applications of the past. Moreover, power and thermal constraints limits the number of transistors that can be used simultaneously, which has lead to the Dark Silicon problem. To handle the needs of the next-generation applications, there is a need to explore novel design and management approaches to be able to operate the computing nodes close to their peak capacity. This project uses 2.5D integration technology with silicon photonic networks to build heterogeneous computing systems that can provide the desired parallelism, heterogeneity, and the network bandwidth to handle the demands of the next-generation applications. To this end, we investigate the complex cross-layer interactions among devices, architecture, applications, and their power/thermal characteristics and design a systematic framework to accurately evaluate and harness the true potential of the 2.5D integration technology with silicon-photonic networks. Specific research tasks focus on cross-layer design automation tools and methods, including pathfinding enablement, for the design and management of the 2.5D integrated system with silicon-photonic networks.

 

EPFL_3D

Managing Thermal Integrity in Monolithic 3D Integrated Systems

Funding: NSF; Collaboration: Stony Brook University and CEA-LETI, France

Integrated circuit (IC) research community has witnessed highly encouraging developments on reliably fabricating monolithic three-dimensional (Mono3D) chips. Unlike through silicon via (TSV) based vertical integration where multiple wafers are thinned, aligned, and bonded; in Mono3D ICs, multiple device tiers are formed on a single substrate following a sequential fabrication process. Vertical interconnects, referred to as monolithic inter-tier vias (MIVs), are orders of magnitude smaller than TSVs (nanometers vs. micrometers), enabling unprecedented integration density with superior power and performance characteristics. The importance of this fine-grained connectivity is emphasized particularly because modern transistors have reached sub 10 nm dimensions. Despite the growing interest in various aspects of Mono3D technology, a reliable framework for ensuring thermal integrity in dense Mono3D systems does not yet exist. This research fills this gap with its primary emphasis on leveraging Mono3D-specific characteristics during both efficient thermal analysis and temperature optimization. Our objective is to facilitate future progress on both design and fabrication aspects of Mono3D technology by developing a comprehensive framework for managing thermal issues. The results of this research will provide a better understanding of unique thermal characteristics in Mono3D ICs and help mitigate these thermal issues through efficient analysis and optimization.

 

EPFL_3D

Modeling the Next-Generation Hybrid Cooling Systems for High-Performance Processors

Funding: NSF; Collaboration: MIT and Brown University

Design of future high-performance chips is hindered by severe temperature challenges. This is because existing cooling mechanisms are not equipped to efficiently cool power densities reaching hundreds to several thousand watts per centimeter square, which are expected in exascale systems. There are several highly-efficient emerging cooling technologies that are being developed by thermomechanical engineers; however, these technologies are not easily accessible for experimentation to computer engineers for co-designing and optimizing their aggressive processor architectures together with the cooling subsystem. To close this gap, this project proposes to develop a software infrastructure that enables accurate modeling of cutting-edge cooling methods and, further, facilitates mutually customizing the computing and cooling systems to dramatically push beyond the system performance per watt that is achievable in today’s systems. Specific tasks include: (1) synthesizing novel physical device-level models into compact representations, (2) using measurements on prototypes and detailed simulators for validation of the proposed models, and (3) developing the necessary automation tooling to provide the ability for design and optimization of hybrid customized cooling subsystems together with a given target computing system.

 

EPFL_3D

Sustainable IT and IT for Sustainability

Funding: BU College of Engineering Dean’s Catalyst Award

The computing ecosystem continues to grow at a breakneck pace and consumes a substantial portion of the world’s electricity. Currently, the vast majority of electricity production comes from fossil fuels, which is long-term unsustainable and has a tremendous environmental impact. There is a growing motivation to integrate renewables; however, volatility of renewables creates new challenges for the power grid operators, who need to dynamically balance electricity supply and demand. Wouldn’t it be appealing if computing, whose growth is contributing to increased electricity demand, could emerge as a major enabler of increased electricity generation from renewables? This would also make the growth of the computing systems sustainable. This proposal aims at developing a framework for making such a vision the reality, particularly through integrating large data centers (HPC clusters, grid engines, or other data centers) into emerging smart grid programs. We propose to develop a collaborative and distributed control framework for the computing sector that helps stabilize the grid, while providing power cost incentives for data centers. Specifically, this project seeks to build computing demand response control opportunities, where computing systems follow power provider requests when regulating their power consumption, to improve the nation’s power supply efficiency and robustness, simultaneously with improving sustainability of computing.