• Starts: 10:00 am on Friday, September 22, 2023
  • Ends: 12:00 pm on Friday, September 22, 2023

Title: Machine Learning-based Performance Analytics for High-Performance Computing Systems

Presenter: Burak Aksar

Advisor: Professor Ayse K. Coskun

Chair: Professor Janusz Konrad

Committee: Professor Ayse K. Coskun, Professor Manuel Egele, Professor Brian Kulis, Professor Wenchao Li

Google Scholar Link: https://scholar.google.com/citations?user=XWEar80AAAAJ&hl=en

Abstract: HPC systems play pivotal roles in societal and scientific advancements, executing up to quintillions of calculations every second. As we shift towards exascale computing and beyond, modern HPC systems emphasize resource sharing, where various applications share processors, memory, networks, and other components. While this sharing enhances power efficiency, it complicates performance prediction and introduces significant variations in application running times, affecting overall system efficiency and operational costs.

HPC systems utilize monitoring frameworks that gather numerical telemetry data on resource usage to track operational status. Given the massive complexity and volume of this data, manual analysis is often daunting and inefficient. Machine learning (ML) techniques offer automated performance anomaly diagnosis, but the transition from successful research outcomes to production-scale deployment encounters two critical obstacles. First, the scarcity of labeled training data (i.e., identifying healthy and anomalous runs) in telemetry datasets makes it hard to train these ML systems effectively. Second, runtime analysis required for providing timely detection and diagnosis of performance anomalies demands seamless integration of ML-based methods with the monitoring frameworks.

This thesis claims that ML-based performance analytics frameworks that leverage a limited amount of labeled data achieve the performance of a fully-supervised framework that aligns with the requirements of deployment in production HPC systems. To support this claim, we undertake ML-based performance analytics on two fronts. First, we design and develop novel frameworks for anomaly diagnosis that leverage semi-supervised or unsupervised learning techniques to reduce the need for extensive labeled data. Second, we design a simple yet adaptable architecture to enable rapid deployment and demonstrate that they are feasible for real-world evaluation.

PHO 339