Tools and Software

AI4HPC: AI-based Scalable Analytics for Improving Performance, Resilience, and Security of HPC Systems (Website Link)

  • This web application showcases a supervised machine learning framework ([Paper Link] ) designed to detect and diagnose performance anomalies in High-Performance Computing (HPC) systems. Users can engage with the application in two distinct ways:
    Sample Telemetry Data: Users can choose from a set of sample telemetry data files provided within the application. This option allows for a quick demonstration of the framework’s results.
    Upload Custom Data: Users have the option to upload their own telemetry data, which they’ve gathered through the Lightweight Distributed Metric Service

VAIF: Variance-driven Automated Instrumentation Framework (GitHub)

  • VAIF, also referred to as Pythia, is a variance-driven automated instrumentation framework that operates alongside distributed applications. It is designed to address performance problems by automatically searching for the appropriate instrumentation to log and diagnose the issue. The tool leverages distributed tracing and critical-path analysis to decompose the response-time variance in request traces. For more information, you can refer to our papers (2019, 2021). If you would like to contribute or report any issues, please use GitHub.

PACT: PArallel Compact Thermal Simulator (GitHub) (User Group)

  • PACT is a SPICE-based PArallel Compact Thermal simulator that enables fast and accurate standard-cell-level to architecture-level steady-state and transient parallel thermal simulation. PACT utilizes the advantages of multi-core processing (OpenMPI) and includes several solvers to speed up both steady-state and transient simulations. PACT can be easily extended to model a variety of emerging integration and cooling technologies, such as 3D stacking, liquid cooling via microchannels, and others, by simply modifying the thermal netlist. PACT can be also used in conjunction with popular architecture-level performance and power simulators to evaluate the thermal profiles of processors. More details can be found in the paper. Please use GitHub or the PACT Google Group for any contributions or issues.

Iter8: Online Experimentation in the Cloud (GitHub) (Website)

  • Iter8 is an open-source system that enables practitioners to deliver code changes to cloud applications in an agile manner while minimizing risk. In Iter8, we developed a novel mathematical formulation built on online Bayesian learning and multi-armed bandit algorithms to enable online experimentation tailored for the cloud, considering both SLOs and business concerns, unlike existing solutions. Using our formulation, practitioners can safely and rapidly orchestrate various types of online experiments, gain key insights into the behavior of cloud applications, and roll out the optimal versions in an automated and statistically rigorous manner. More details can be found in the paper. Please use GitHub for any contributions or issues.

CoMTE: Counterfactual Explanations for Multivariate Time Series Data (GitHub)

  • CoMTE is a novel, open-source explainability technique that provides counterfactual explanations for supervised machine learning (ML) frameworks that operate on multivariate time series data. Counterfactuals help understand why a particular sample resulted in a given prediction and such explanations enable wider adoption in real-world settings where black-box ML methods are inadequate. More details can be found in the paper. Please use GitHub for any contributions or issues.

ACE: Just-in-time Serverless Software Component Discovery Through Approximate Concrete Execution (GitHub)

  • Approximate Concrete Execution (ACE) is a just-in-time binary analysis technique that enables automatic detection of undesirable components in executable binaries found in serverless and other cloud applications without requiring a trusted build system. With ACE, we contribute a novel method of creating function fingerprints just before serverless software is first-executed, in which we execute an intermediate representation of the code in an approximate virtual machine and use the resulting context as the fingerprint. ACE fingerprints can then be compared to a function blocklist using simple vector distance metrics or searched for in a k-nearest-neighbor fashion. In our evaluation, we find that ACE performs these tasks with comparable accuracy 5.2x faster than a state-of-the-art method. More details are available in our WoSC 2020 paper. Please use GitHub for any contributions or issues.

Praxi: Cloud Software Discovery That Learns From Practice (GitHub)

  • Praxi is a tool for discovering software running in the cloud using machine learning. Users can create a corpus of “changesets” representing the software they’d like to discover on their cloud systems (e.g., insecure server applications like Telnet or resource-stealing applications like crypto-currency miners) using Praxi’s filesystem fingerprint recording tool, which can then be used to train a machine learning model. Users can then run Praxi as a daemon, using the trained model to predict whether current filesystem activity matches that of an application in the corpus. Thanks to Praxi’s low-overhead and incremental training ability, it’s well-suited to long-term deployments on cloud systems of all sizes. More details available in our TCC 2019 paper. Please use GitHub for any contributions or issues.

HPC Performance Anomaly Suite (HPAS) (GitHub)

  • HPAS, an HPC Performance Anomaly Suite, consists of anomaly generators for the major subsystems in HPC systems. These easy-to-use synthetic anomaly generators facilitate low-effort evaluation and comparison of various analytics methods as well as performance or resilience of applications, middleware, or systems under realistic performance variability scenarios. More details are explained in our ICPP 2019 paper. Please use GitHub for any contributions or issues.

Artifact for Taxonomist: Application Detection through Rich Monitoring Data – Best Artifact Award (Download)

  • We built a technique for detecting applications running in a supercomputer using the monitoring data that is readily available. The artifact contains the code for the machine learning technique, as well as a small dataset of monitoring data. More details are explained in our EuroPar 2018 paper. Please contact us if the download link is not accessible to you.

HotSpot Extension for Modeling 3D-Stacked Systems (Download)

  • We built an extension to HotSpot (version 5.02) to enable detailed modeling of the layers in the stack, where each layer may include blocks with a heterogeneous set of thermal resistivity and heat capacity values. More details on the extension is provided in the appendix of our DAC 2012 paper which you may cite if you use our 3D extension.
  • The latest HotSpot (version 6.0) integrates our extension as well.