ECE PhD Dissertation Defense: Mert Toslali

  • Starts: 10:00 am on Friday, April 7, 2023
  • Ends: 11:30 am on Friday, April 7, 2023


Presenter: Mert Toslali

Advisor: Professor Ayse K. Coskun

Chair: TBA

Committee: Professor Orran Krieger, Professor Alan (Zaoxing) Liu. Dr. Fabio A. Oliveira

Abstract: Performance unpredictability of the cloud hinders its widespread adoption and adversely impacts costs and revenue. To mitigate this challenge, cloud systems typically incorporate monitoring and tracing mechanisms to collect a diverse set of metrics on applications' state to facilitate the analysis of performance fluctuations. Drawing on this collected data, engineers devote considerable effort to diagnosing performance issues and expediting the delivery of superior-quality software to enhance performance, aligning with changing demands.

To capture unanticipated performance problems, engineers utilize state-of-the-art diagnostic systems to meticulously trace and record all potential behavior of cloud applications. However, this comprehensive and detailed tracing incurs considerable costs in terms of storage, computation, and network overheads. Moreover, after diagnosing and resolving performance issues, engineers rely on ad-hoc gradual deployment approaches for code delivery on the cloud.

Unfortunately, these delivery systems lack the necessary statistical sophistication to accurately assess and compare application versions, potentially leading to further performance problems.

This thesis argues that integrating automated, statistically-driven methods is imperative to achieve substantial improvements in efficiency when diagnosing application performance and delivering new software in the cloud. This vision has the potential to enable efficient and proactive performance management beyond the state-of-the-art by reducing time, effort, and cost spent on diagnosing and updating cloud applications. To support this vision, the thesis makes two specific contributions. First, we demonstrate that dynamically adjusting instrumentation using statistically-driven techniques significantly enhances diagnosis efficiency. Our distributed tracing approach enables accurate tracing of sources of performance issues using only 3-34% of the available tracing instrumentation. Second, we demonstrate an online learning-based approach that intelligently adjusts the user traffic split among competing deployments, substantially improving code delivery efficiency. Our online experimentation approach reduces performance violations by directing user traffic to the optimal deployment 93% of the time during code delivery.

PHO 339