AI-based Scalable Analytics for Improving Performance, Resilience, and Security of HPC Systems

Sponsor: Department of Energy (DOE) via Sandia National Laboratories

PI: Ayse K. Coskun

Co-Is/Co-PIs: Manuel Egele, Brian Kulis

Abstract:

Next generation large-scaleĀ  High Performance Computing (HPC) systems face important cost and scalability challenges due to anomalous system and application behavior resulting in wasted compute cycles and the ever-growing difficulty of system management. There is an increasing interest in the HPC community in using AI-based frameworks to tackle analytics and management problems in HPC so as to improve decision making and automation. However, several common challenges prevent such frameworks from being deployed easily on production systems. One such difficulty arises because many of these techniques use machine learning methods that require ample training data, which is difficult and costly to obtain. Another challenge is the limited scalability and feasibility of many of the proposed analytics methods. These challenges exacerbate as the hardware in HPC systems also evolve by incorporating heterogeneous resources such as Graphical Processing Units (GPUs). The overarching goal of this project is to design scalable AI-based frameworks to automatically diagnose and mitigate performance anomalies in HPC systems.

Learn more here.