- Starts: 3:00 pm on Monday, March 17, 2025
- Ends: 4:30 pm on Monday, March 17, 2025
ECE PhD Prospectus Defense: Efe Sencan
Title: Scalable and Robust Machine Learning Frameworks for Anomaly Detection in Large Scale Computing Systems
Presenter: Efe Sencan
Advisors: Prof. Ayse K. Coskun & Prof. Brian Kulis
Chair: Prof. Wenchao Li
Committee: Professor Ayse K. Coskun, Brian Kulis, Manuel Egele, Wenchao Li
Google Scholar Link: https://scholar.google.com/citations?user=RwrhVIcAAAAJ&hl=tr&oi=ao
Abstract: High-Performance Computing (HPC) systems are integral to scientific and technological advancements, supporting applications ranging from climate modeling to financial forecasting. These systems often face operational inefficiencies due to performance variations caused by factors such as network contention, hardware malfunctions, software bugs, and resource contention. These issues result in energy wastage, reduced computational efficiency, and significantly increased operational costs. Machine learning (ML) has emerged as a promising solution for automating anomaly detection and diagnosis in these systems. However, several challenges hinder its widespread application: the scarcity of labeled data, which is essential for training ML models in production environments; the difficulty of integrating anomaly detection frameworks seamlessly into existing monitoring tools; and the susceptibility of anomaly detection models to performance degradation when training datasets include anomalous samples. If not properly addressed, such contaminated datasets can undermine the accuracy and reliability of ML-based anomaly detection, exacerbating inefficiencies and operational risks.
This thesis argues that scalable and robust ML frameworks are essential for improving anomaly detection in HPC systems. The research focuses on three thrusts: (1) reducing reliance on labeled data to enable deployment in real-world environments, (2) integrating ML models into production monitoring systems for scalable and efficient anomaly detection, and (3) enhancing model robustness to contamination in training datasets.
To advance HPC anomaly detection and performance management, we contribute by: (1) leveraging active learning to reduce labeled data requirements, (2) developing scalable unsupervised anomaly detection frameworks, (3) introducing iterative refinement strategies to mitigate contaminated training data, and (4) applying ML to diagnose performance bottlenecks in GPU-accelerated applications.
- Location:
- PHO 339, 8 St Mary's St.