Coskun and Team Will Collaborate with Sandia Labs on Applying AI to HPC

By Caroline Amato

Professors Ayse Coskun, Manuel Egele, and Brian Kulis in ECE have received a $500K grant from Sandia National Labs for their project “Al-based Scalable Analytics for Improving Performance, Resilience, and Security of HPC Systems”.

HPC refers to High Performance Computing. This is the practice of collecting computing power so that a large system delivers high (e.g., exascale) performance that a normal computer cannot. HPC enables solving large and difficult problems in fields such as engineering, healthcare, finance, and many others. The research community in HPC recently has grown interested in using Al-based frameworks to solve the challenges faced by HPC systems such as optimizing performance, cost, and size. At the same time, there are challenges with AI too that make it difficult to be applied in the real world at scale.

 Coskun and team’s goal is to design scalable Al-based frameworks that diagnose performance problems in HPC systems. Performance anomaly diagnosis is a challenging task as the system telemetry data in HPC is big (easily larger than TB/node/day), the systems are large with thousands of nodes, and the applications running on HPC are complex with many interacting parallel threads. The BU team will develop new Al-based frameworks that are suited for practical deployment on production systems. To achieve this goal, proposed AI frameworks will not rely on extensive labeled data sets that are difficult to acquire, they will be applicable to cutting-edge hardware including supercomputers with GPUs, and the need for human administrator intervention will be minimal.

 Professor Ayse Coskun earned her PhD from the University of California, San Diego in 2009 and is a professor of ECE. She received the NSF Career Award in 2012, the IEEE CEDA Ernest Kuh Early Career Award in 2017, and several best paper awards and nominations. She serves as an Associate Editor for IEEE Transactions on Computer Aided Design and IEEE Transactions on Computers.

Manuel Egele is an Associate Professor of ECE and earned his PhD from Vienna University of Technology in 2011. He earned many recognitions for his work, including the Distinguished Paper Award from the NDSS in 2011 and the NSF CAREER award in 2020.

Brian Kulis is also an Associate Professor for ECE and earned his PhD from the University of Texas at Austin in 2008. He won the Peter J. Levine Career Development Professorship in 2015 and the NSF Career Award in 2015.

Burak Aksar, a 4th year PhD student at BU ECE advised by Prof. Coskun, is going to be working on the project as well. Burak has been working on applying machine leaning to HPC performance analytics problems and has published papers at venues such as Euro-Par and ISC-HPC. He completed successful internships recently at Sandia Labs and IBM Research.