- Starts: 2:30 pm on Friday, February 6, 2026
- Ends: 4:00 pm on Friday, February 6, 2026
ECE PhD Prospectus Defense Syed Mohammad Qasim
Title: Towards Practical, Scalable and User-Friendly Performance Debugging in Large-Scale Distributed Systems
Presenter: Syed Mohammad Qasim
Advisor: Professor Ayse K. Coskun
Co-Advisor: Professor Gianluca Stringhini
Chair: Professor Gianluca Stringhini
Committee: Professor Ayse K. Coskun , Professor Gianluca Stringhini, Professor Yigong Hu, Dr. Ata Turk.
Google Scholar Profile: https://scholar.google.com/citations?user=HGHBhz0AAAAJ&hl=en
Abstract: Large-scale distributed systems, such as cloud platforms and microservice applications, frequently experience performance degradation and outages caused by software, configuration, or hardware issues. These systems generate massive volumes of metrics, logs, and distributed system traces (DSTs), but manually analyzing this telemetry is infeasible, and existing tools leveraging machine learning and statistical methods remain fragmented, problem specific, and difficult to use. Recent advances in large language models (LLMs) have made interaction with computers easier and offer an opportunity to make debugging more generalized and user-friendly. This thesis argues that integrating machine learning with reasoning-driven large language models enables practical, scalable, and user-friendly debugging of performance issues in large-scale distributed systems.
This thesis seeks to design a user-friendly, scalable, and cost-efficient debugging framework for distributed systems, with the goal of improving the resilience of large-scale platforms. To this end, we first conduct a systematic evaluation of open-source LLMs for distributed system trace (DST) analysis. Our results show that LLMs struggle to classify and localize latency issues even in small benchmark applications; however, combining span-level statistics with reasoning-based LLMs improves anomaly localization from 50% to over 74%. We also observe that LLMs perform better on DSTs represented in YAML format than in JSON. These findings highlight current limitations of LLMs while revealing several open research questions. Building on this work, we plan to investigate why certain DST representations are easier for LLMs to understand and to explore ways of improving DST analysis, for example through richer natural-language context. Based on these insights, we also plan to propose a general LLM-orchestrated debugging framework that autonomously invokes analytic tools, synthesizes insights across metrics and traces, and communicates root causes in natural language.
Lastly, we extend automated analysis to resource-usage metrics by designing an anomaly detection framework for widely used OpenStack services in the Chameleon Cloud. We release the first dataset of resource-usage metrics for these services and show that training on only a few days of healthy data can effectively balance training cost and detection accuracy, enabling cost-optimized anomaly detection.
Together, these contributions lay the foundation for self-explaining, adaptive debugging systems that combine machine learning and large language models to support scalable and accessible cloud debugging.
- Location:
- PHO 339
