ECE PHD Final Dissertation Defense: Emre Ates

<P><B>Title:</B> Automating Telemetry- and Trace-Based Analytics on Large-Scale Distributed Systems</P><P><B>Advisor:</B> Professor Ayse Coskun, ECE</P><P><B>Chair:</B>TBD</P><P><B>Committee:</B>Professor Manuel Egele, ECE; Professor Wenchao Li, ECE; Professor Raja Sambasivan, Tufts CS</P><P><B>Abstract:</B>Large-scale distributed systems—such as supercomputers, cloud computing platforms, and distributed applications—routinely suffer from slowdowns and crashes due to software and hardware problems, resulting in reduced efficiency and wasted resources. These large-scale systems typically deploy monitoring or tracing systems that gather a variety of statistics about the state of the hardware and the software. State-of-the-art methods either analyze this data manually, or use separate automated methods for each specific problem. This thesis builds on the vision that generalized automated analytics methods on the data sets collected from these complex computing systems can provide critical information about the causes of the problems, and this analysis can enable proactive management to improve performance, resilience, efficiency, or security significantly beyond current limits.</P><P>This thesis seeks to design scalable, automated analytics methods and frameworks for large-scale distributed systems that minimize dependency on expert knowledge, reduce the time to solution, and help make systems more resilient. Besides analyzing data that is already collected from systems, our frameworks also help collect data that is useful for analytics. We focus on two data sources for conducting analytics: operating system and hardware counters, and end-to-end traces from distributed applications.</P><P>This thesis makes the following contributions: (1) Developing a framework for accurately diagnosing previously encountered performance variations in large-scale systems, (2) designing a technique for detecting (unwanted) applications running in clusters, (3) developing a suite for reproducing performance variations in supercomputers that can be used to systematically develop analytics methods, (4) designing a method to explain predictions of black-box machine learning frameworks on large-scale systems data, and (5) developing an end-to-end tracing framework for distributed applications that can dynamically adjust instrumentation for effective diagnosis of performance problems.</P>

When 1:00 pm to 3:00 pm on Friday, June 26, 2020