Making the Cloud More Reliable: How LLMs Can Improve Latency and Reliability

Nandini Cheruku, CISE Staff

When a car is booked in a rideshare app, a seemingly simple request triggers thousands of messages sent across a distributed system that must be executed with near-perfect reliability. Distributed systems are the invisible engine of today’s economy. Its full integration into our daily routines makes its reliability essential. But what can we do when these systems fail?

Syed Mohammad Qasim, a PhD student advised by CISE Director and Professor Ayse K. Coskun, aims to address the challenge of debugging in complex systems. Building on the need for dependable cloud infrastructure, Qasim’s recent research, “Limitations of Large Language Models in Analyzing Distributed System Traces,” explores whether large language models (LLMs) can effectively analyze a distributed system request chain. According to Qasim, the research investigates whether an LLM can determine if a “transaction was successful, anomalous, late, or normal, and if it possessed any particularly interesting characteristics.”

Frameworks like Jaeger capture Distributed System Traces (DSTs), but engineers often have to analyze them individually, leading to backlogs and delays. Identifying specific bottlenecks, or traffic, among thousands of requests can be challenging.

A single transaction can contain thousands of logged events, and companies, like eBay, can collect upwards of 150 billion traces per day. When application codes can change hourly, it is nearly impossible for humans to manually search the data to find the issue. In an effort to increase efficiency, companies often use aggressive sampling techniques to filter DSTs, which can also lead to the omission of critical errors.

A few deep learning techniques are currently in use. Models such as TraceAnomaly and CRISP can identify irregular traces. However, while it can notice patterns in the traces, it lacks the ability to understand why these irregularities occur.

“LLMs provide a natural way to talk to systems or any computer. The entry barrier to interacting with an LLM is very low,” Qasim explained. Allowing operators to interact with the system in natural language makes debugging distributed systems significantly faster and more intuitive. Additionally, LLMs not only help identify issues but can also suggest specific fixes, providing a level of guidance that traditional automated methods lack.

While existing frameworks alert operators at all hours, sometimes waking them in the middle of the night, a framework utilizing LLMs could streamline the entire debugging process. Such a system would first attempt to debug and resolve smaller, easily fixable bugs autonomously before escalating more complex issues to human engineers. This approach would significantly improve both system reliability and the operators’ experience.

Looking ahead, Qasim aims to develop a comprehensive framework to help LLMs identify and fix performance bugs in cloud systems. His research has revealed that while standalone LLMs struggle to classify DSTs as “anomalous” or “healthy” on their own, they perform much better at localizing performance issues when provided with minimal pre-processing and aggregate statistics.

In his next steps, Qasim is exploring fine-tuning and specialized DST format representations to create an even more robust framework for analyzing performance issues. In operation, this agent’s analysis will inform the system on how to proactively improve cloud reliability.

The potential reach of this research is vast, spanning from fintech platforms processing thousands of transactions per second to telemedicine systems where latency is not an option. By ensuring cloud-dependent services remain resilient against bugs, this research helps protect the modern digital experience. His goal is to ensure they remain fast and reliable, even under the pressure of millions of concurrent requests.

At the most recent iteration of the CISE Graduate Student Workshop (CGSW 12.0) hosted by the Center for Information and Systems Engineering (CISE), Syed Mohammad Qasim presented this research and won the workshop’s Best Presenter Award.

Syed Mohammad Qasim earned his Bachelor’s Degree in Computer Engineering from Aligarh Muslim University. He is currently a student in Professor Ayse Coskun’s PEACLab, where he researches end-to-end tracing, cloud computing, and distributed systems. Before pursuing his PhD, Syed served as a Manager of Platform Engineering at State Street, where he helped establish their API management and DataRobot platforms. Additionally, he worked as an Application Developer at ThoughtWorks, contributing to the build of the VAKT platform.

View all posts