• Starts: 11:00 am on Thursday, May 30, 2024
  • Ends: 1:00 pm on Thursday, May 30, 2024

ECE PhD Dissertation Defense: Anqi Guo

Title: Software and Hardware Codesign of SmartNIC-Based HPC Clusters with Machine Learning Case Studies

Presenter: Anqi Guo

Advisor: Professor Martin Herbordt

Chair: TBA

Committee: Professor Martin Herbordt, Professor Roscoe Giles, Professor Tali Moreshet, Professor Tong Geng, University of Rochester.

Google Scholar Link: https://scholar.google.com/citations?hl=en&user=hqCn2VQAAAAJ&view_op=list_works&sortby=pubdate

Abstract: Machine learning has evolved significantly in recent past decades that have penetrated every aspect of science, technology, and daily life. As application prediction demands higher accuracy and more complex tasks, larger models are proposed to meet these requirements. Deep learning applications like deep learning recommendation models and large language models have evolved with trillions of parameters and consume up to terabytes of memory. These models have outpaced the growth of GPU memories: GPU clusters, which aggregate GPU memory, have therefore grown exponentially to accommodate these large models. The memory wall exacerbates the problems where the Memory wall refers to the point at which the demand for memory exceeds the available capacity, creating a bottleneck for training ever-larger deep learning models. Heterogeneous deep learning training has become a key approach to addressing the limitations of GPU clusters, especially as models grow in size and complexity. By combining the strengths of CPUs, GPUs, and NVMe memory, heterogeneous systems aim to overcome the required scale of GPU clusters and mitigate the memory wall limitation by offloading model states and parameters and making it possible to train ever-growing large-size models on limited resources. However, such heterogeneous system performance is limited by the data exchange, computation, and control efficiency.

Advanced network interface cards, known as SmartNICs, have emerged to mitigate network challenges in scale-out data centers. The placement of SmartNICs as a network-facing computational component within a node allows them to efficiently manage communication between different parts of the distributed system, offloading tasks from the central processors and reducing network traffic bottlenecks. As SmartNICs continue to evolve, they are expected to play a crucial role in enabling more scalable and efficient operations in large-scale data centers, addressing the growing demands of modern applications like machine learning and big data analytics.

In this thesis, we propose heterogeneous smartNIC-based systems for coupling software and hardware for machine learning applications. We explore the heterogeneous system design space in four steps: the practical capabilities of emerging smartNIC, host-detached smartNICs integrated into CPU-centric systems, facilitating SmartNICs in GPU-centric systems and heterogeneous smartNIC-based global control and disaggregated memory systems. Our proposal involves software-hardware codesign of SmartNIC-based systems, enhancing system performance through dynamic scheduling and control, enabling both GPU and CPU to focus on computation with reduced interruptions. The smartNIC serve as an intermediary layer, breaking barriers between heterogeneous system components and facilitating seamless connectivity between GPUs and CPU offload engines. Additionally, the introduction of a caching system reduces communication workload and memory bandwidth pressure. Furthermore, SmartNICs are attached to the switch level with disaggregated memory, forming a heterogeneous global control system. This system aims to minimize system barrier and synchronization overhead while maximizing communication-computation overlap and model FLOPs utilization for higher system performance.

PHO 339