Automated Analytics for Improving Efficiency, Safety, and Security of HPC Systems

Sponsor: Sandia National Laboratories

Co-Is/Co-PIs: Manuel Egele, Brian Kulis

Abstract:

Performance variations are becoming more prominent with new generations of large-scale High Performance Computing (HPC) systems. Understanding these variations and developing resilience to anomalous performance behavior are critical challenges for reaching extreme-scale computing. To help address these emerging performance variation challenges, there is increasing interest in designing data analytics methods to make sense out of the telemetry data collected out of computing systems. Existing methods, however, rely heavily on manual analysis and/or are specific to a certain type of application, system, or performance anomaly. In contrast, the aims of this project are (1) Identifying information that is available on production HPC systems that would help understand performance characteristics, performance variations, inefficiencies, and anomalous behaviors indicative of software problems, component degradation, or malicious activities; (2) conducting this identification through automated techniques that can work with a broad range of systems, applications, and conditions that create performance variations; (3) designing methods that leverage this system information to improve efficiency, resilience, and security of HPC systems.