Beyond Overload: Rethinking Reliability in Cloud Systems

Nandini Cheruku, CISE Staff

Imagine you are at a restaurant, and you see that there are only 10 tables available, but there’s a line of almost 50 people out the door. You later learn that the restaurant has only 20 cups and 20 bowls and cannot efficiently serve all the guests, no matter how well it manages its seating or staff. They will become overwhelmed before demand is even met. This scenario aptly models the challenges faced in cloud systems. As system demands rapidly increase, resource allocation has failed to meet sudden or large-scale demands, highlighting the need for a solution to help these systems operate more efficiently.

This is where CISE Faculty Affiliate and Assistant Professor Yigong Hu’s (ECE) research comes into play. His primary focus is on building reliable, fast systems, and he has dedicated his time to developing techniques to enhance performance across machine learning systems and cloud computing platforms.

The biggest issue cloud systems face is reliability. Software developers often spend more than half of their time debugging systems, as the losses of system shutdowns can stack up to approximately $2.4 trillion. In the medical field, Professor Hu explained “just one simple bug” in a radiation therapy machine can “send the levels 10 times higher than they should be” and, in the case of the Therac-25 System Failure, led to the deaths of a handful of patients.

In 2025, Professor Hu presented his team’s research paper, “Mitigating Application Resource Overload with Targeted Task Cancellation,” at the 31st Symposium on Operating Systems Principles. The paper examines the demands of incoming tasks and the allocated resources, then uses its profiler to mitigate improper allocations. When an overload happens, current systems typically respond by canceling incoming requests at the front door. This mechanism does not identify which running tasks are actually monopolizing internal resources, treating the symptom rather than the cause – often dropping innocent requests while the real culprits continue running. To address this, the team developed Atropos, a runtime overload control system that continuously monitors how each task uses internal resources and selectively cancels the ones responsible for the bottleneck. “Atropos allows all requests to first run and then it dynamically monitors resource usage before selectively canceling requests,” said Professor Hu. This frees up resources for other waiting requests and restores system throughput.

Another profiler Professor Hu created has been conditionally accepted for presentation at the 20th USENIX Symposium on Operating Systems Design and Implementation. His research paper titled “Diagnosing Performance Issues in Application-Defined Resources” presents the GiGi profiler as a tool used to diagnose performance problems that stem from the misuse of application resources. The profiler aims to pinpoint the root causes of performance degradation and improve latency and system overloads.

As Professor Hu continues his research into resource allocation in cloud systems, he’s looking to address how energy is used and wasted in code. A large part of his interest hinges on teaching a machine learning system to be aware of its energy usage and creating a long-term solution to counteract energy waste in data centers. Hu’s next iteration of this work is to build an open-source profiler that, when run, can tell engineers which parts of their code use the most resources.

Professor Yigong Hu is an Assistant Professor in the Department of Electrical and Computer Engineering at Boston University. He was previously a postdoctoral researcher at the University of Washington, and prior to that, he received his Ph.D. in Computer Science from Johns Hopkins University and his Bachelor’s degree in Computer Science from Huazhong University of Science and Technology.

View all posts