{"id":43452,"date":"2026-04-24T11:46:08","date_gmt":"2026-04-24T15:46:08","guid":{"rendered":"https:\/\/www.bu.edu\/cise\/?p=43452"},"modified":"2026-04-24T11:46:08","modified_gmt":"2026-04-24T15:46:08","slug":"beyond-overload-rethinking-reliability-in-cloud-systems","status":"publish","type":"post","link":"https:\/\/www.bu.edu\/cise\/beyond-overload-rethinking-reliability-in-cloud-systems\/","title":{"rendered":"Beyond Overload: Rethinking Reliability in Cloud Systems"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Imagine you are at a restaurant, and you see that there are only 10 tables available, but there\u2019s a line of almost 50 people out the door. You later learn that the restaurant has only 20 cups and 20 bowls and cannot efficiently serve all the guests, no matter how well it manages its seating or staff. They will become overwhelmed before demand is even met. This scenario aptly models the challenges faced in cloud systems. As system demands rapidly increase, resource allocation has failed to meet sudden or large-scale demands, highlighting the need for a solution to help these systems operate more efficiently.\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" src=\"\/cise\/files\/2026\/04\/AdobeStock_486877969-636x424.jpeg\" alt=\"\" width=\"341\" height=\"227\" class=\"alignright wp-image-43458 \" srcset=\"https:\/\/www.bu.edu\/cise\/files\/2026\/04\/AdobeStock_486877969-636x424.jpeg 636w, https:\/\/www.bu.edu\/cise\/files\/2026\/04\/AdobeStock_486877969-1024x683.jpeg 1024w, https:\/\/www.bu.edu\/cise\/files\/2026\/04\/AdobeStock_486877969-768x512.jpeg 768w, https:\/\/www.bu.edu\/cise\/files\/2026\/04\/AdobeStock_486877969-1536x1024.jpeg 1536w, https:\/\/www.bu.edu\/cise\/files\/2026\/04\/AdobeStock_486877969-2048x1366.jpeg 2048w\" sizes=\"(max-width: 341px) 100vw, 341px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">This is where CISE Faculty Affiliate and Assistant Professor <\/span><a href=\"https:\/\/www.bu.edu\/cise\/profile\/yigong-hu\/\"><span style=\"font-weight: 400;\">Yigong Hu\u2019s<\/span><\/a><span style=\"font-weight: 400;\"> (ECE) research comes into play. His primary focus is on building reliable, fast systems, and he has dedicated his time to developing techniques to enhance performance across machine learning systems and cloud computing platforms. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The biggest issue cloud systems face is reliability. <\/span><a href=\"https:\/\/newsroom.cisco.com\/c\/r\/newsroom\/en\/us\/a\/y2024\/m05\/developers-spending-more-time-firefighting-issues-than-delivering-innovation.html\"><span style=\"font-weight: 400;\">Software<\/span><\/a><span style=\"font-weight: 400;\"> developers often spend more than half of their time debugging systems, as the losses of system shutdowns can stack up to approximately <\/span><a href=\"https:\/\/www.it-cisq.org\/the-cost-of-poor-quality-software-in-the-us-a-2022-report\/\"><span style=\"font-weight: 400;\">$2.4 trillion<\/span><\/a><span style=\"font-weight: 400;\">. In the medical field, Professor Hu explained \u201cjust one simple bug\u201d in a radiation therapy machine can \u201csend the levels 10 times higher than they should be\u201d and, in the case of the Therac-25 System Failure, led to the deaths of a handful of patients.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In 2025, Professor Hu presented his team\u2019s research paper, <\/span><a href=\"https:\/\/yigonghu.github.io\/paper\/sosp25-atropos.pdf\"><span style=\"font-weight: 400;\">\u201cMitigating Application Resource Overload with Targeted Task Cancellation,\u201d<\/span><\/a><span style=\"font-weight: 400;\"> at the 31st Symposium on Operating Systems Principles. The paper examines the demands of incoming tasks and the allocated resources, then uses its profiler to mitigate improper allocations. When an overload happens, current systems typically respond by canceling incoming requests at the front door. This mechanism does not identify which running tasks are actually monopolizing internal resources, treating the symptom rather than the cause \u2013 often dropping innocent requests while the real culprits continue running. To address this, the team developed Atropos, a runtime overload control system that continuously monitors how each task uses internal resources and selectively cancels the ones responsible for the bottleneck. \u201cAtropos allows all requests to first run and then it dynamically monitors resource usage before selectively canceling requests,\u201d said Professor Hu. This frees up resources for other waiting requests and restores system throughput.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Another profiler Professor Hu created has been conditionally accepted for presentation at the 20th USENIX Symposium on Operating Systems Design and Implementation. His research paper titled <\/span><a href=\"https:\/\/drive.google.com\/file\/d\/1nRacn3XkQPtFkSLEgl-358KE_7-0qZFw\/view?usp=share_link\"><span style=\"font-weight: 400;\">\u201cDiagnosing Performance Issues in Application-Defined Resources\u201d<\/span><\/a><span style=\"font-weight: 400;\"> presents the GiGi profiler as a tool used to diagnose performance problems that stem from the misuse of application resources. The profiler aims to pinpoint the root causes of performance degradation and improve latency and system overloads.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As Professor Hu continues his research into resource allocation in cloud systems, he\u2019s looking to address how energy is used and wasted in code. A large part of his interest hinges on teaching a machine learning system to be aware of its energy usage and creating a long-term solution to counteract energy waste in data centers. Hu\u2019s next iteration of this work is to build an open-source profiler that, when run, can tell engineers which parts of their code use the most resources.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><img loading=\"lazy\" src=\"\/cise\/files\/2026\/02\/Screenshot-2026-02-26-at-3.17.22\u202fPM-577x636.png\" alt=\"\" width=\"131\" height=\"144\" class=\"wp-image-43292 alignleft\" srcset=\"https:\/\/www.bu.edu\/cise\/files\/2026\/02\/Screenshot-2026-02-26-at-3.17.22\u202fPM-577x636.png 577w, https:\/\/www.bu.edu\/cise\/files\/2026\/02\/Screenshot-2026-02-26-at-3.17.22\u202fPM-768x847.png 768w, https:\/\/www.bu.edu\/cise\/files\/2026\/02\/Screenshot-2026-02-26-at-3.17.22\u202fPM.png 816w\" sizes=\"(max-width: 131px) 100vw, 131px\" \/>Professor Yigong Hu is an Assistant Professor in the Department of Electrical and Computer Engineering at Boston University. He was previously a postdoctoral researcher at the University of Washington, and prior to that, he received his Ph.D. in Computer Science from Johns Hopkins University and his Bachelor\u2019s degree in Computer Science from Huazhong University of Science and Technology. <\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Imagine you are at a restaurant, and you see that there are only 10 tables available, but there\u2019s a line of almost 50 people out the door. You later learn that the restaurant has only 20 cups and 20 bowls and cannot efficiently serve all the guests, no matter how well it manages its seating [&hellip;]<\/p>\n","protected":false},"author":25166,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[127,204],"tags":[],"_links":{"self":[{"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/posts\/43452"}],"collection":[{"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/users\/25166"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/comments?post=43452"}],"version-history":[{"count":19,"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/posts\/43452\/revisions"}],"predecessor-version":[{"id":43476,"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/posts\/43452\/revisions\/43476"}],"wp:attachment":[{"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/media?parent=43452"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/categories?post=43452"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.bu.edu\/cise\/wp-json\/wp\/v2\/tags?post=43452"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}