BU Computer Systems (BU♺S) Seminar
Join us Thursdays from 12 – 1PM for the BU Computer Systems Seminar
- In Person: Hariri Institute for Computing and Computational Science and Engineering, 665 Commonwealth Ave, Boston, MA, Room 1101 (11th floor)
- Zoom:
Meeting ID: 944 0022 9257
Passcode: 564023 - Calendar
- Join the Mailing list
- Register on Eventbrite
The BU Computer Systems Seminar seeks to bring together systems researchers in academia and industry in a forum for discussing design, implementation, analysis and applications of computer systems at various scales. Researchers are invited to present their own work or other significant efforts at the state of the art in operating, distributing, and networking systems and system architectures, of the type typically presented at conferences such as SOSP, OSDI, NDSS, USENIX Security, and NSDI.
Lunch will be provided, when permitted by BU COVID guidelines
Contacts: Vasia Kalavri, Assistant Professor, Computer Science,
John Liagouris, Assistant Professor, Computer Science
Schedule
Spring 2023
January 26 (Colloquium): Matthew Miller, Conversation with Fedora Project Leader Matthew Miller
- Location: Center for Computing & Data Sciences, 665 Commonwealth Ave, Room 1101 (11th floor)
Abstract & Bio
- Abstract: Join for a conversation about community-driven software development and the future of Linux distributions and related technology — and maybe a little reminiscing about the days of BU Linux (Boston University’s own Fedora-based distro from the early 2000s!)
- Bio: Matthew Miller is a Distinguished Engineer at Red Hat and is the leader of the Fedora Project, which creates the Fedora Linux distribution.
February 7: Bassel Mabsout, Obtaining Robustness with Reinforcement Learning in Flight Control Systems
- Location: Center for Computing & Data Sciences, 665 Commonwealth Ave, Room 1101 (11th floor)
Abstract & Bio
- Abstract:
In this talk I will mainly be presenting the problems our research group has faced in using Reinforcement Learning (RL) for the purpose of real quadrotor control, and the solutions we have contributed. I will also show how these solutions apply to a general class of robots. Starting from the works of Wil Koch who developed Neuroflight, I will walk through the journey that took us from controllers which would burn the drone’s motors, to controllers beating the standard methods used for such systems, both in performance and power efficiency. In doing so, I will have to talk about the quadrotor’s embedded system, the simulation used for training, the methods used to perform RL, and the way we define robot control intent in these systems. Finally, I will present our more recent results and where we plan to take our research in the future. - Bio:
I am Bassel Mabsout, a computer science PhD student at Boston University. My specializations include Programming Languages, Machine Learning, and Robotics. I aim to utilize the rigorous theoretical tools developed for programming languages, to improve the robustness and compositionality of the state-of-the-art machine learning methods intended for solving difficult control problems. I am presently working with Renato Mancuso (my advisor), Kate Saenko, and Siddharth Mysore on power-efficient and performant attitude control on quadrotors through Reinforcement Learning.
I am interested in: Type Theory, Metaheuristics, Category Theory, Reinforcement Learning, Agent-Based Models, Control systems, and Differentiable Computation.
- Abstract:
February 16: Zongshun Zhang, Towards Optimal Offloading of Deep Learning Tasks
- Location: Center for Computing & Data Sciences, 665 Commonwealth Ave, Room 1101 (11th floor)
Abstract & Bio
- Abstract:
Edge intelligent applications like VR/AR and surveillance have become popular given the growth in IoT devices and mobile devices. However, edge devices with limited capacity cannot support the requirements of increasingly large and complex deep learning (DL) models. To mitigate such challenges, researchers have proposed methods to optimize and offload partitions of DL models among user devices, edge servers, and the cloud. In this setting, each inference task will sequentially pass through all partitions of a model from the user device to the cloud. The classifier at the end of the last partition will make the corresponding predictions. A shortcoming of this method is that the intermediate data transmitted between partitions can be time-consuming to transmit and can reveal information about the source data. To overcome such shortcoming, recent work considers model compression, model distillation, transmission compression, and Early Exits at each partition. The goal is to trade off accuracy with computation delay, transmission delay, and privacy. In this presentation, I will summarize some of the recent developments and future directions in DL task offloading. - Bio:
Zongshun Zhang is a Fourth-year CS Ph.D. student at Boston University, advised by Professor Abraham Matta. His research interests include cloud resource orchestration and edge intelligence systems. He targets using ML(Neural Network) knowledge to enhance performance and save resource costs of Neural Networks deployed at the edge. Presently he is studying the tradeoff among Neural Network accuracy, latency, and resource cost in a hybrid cloud scenario (IaaS v.s. FaaS).
- Abstract:
February 23 (Colloquium) POSTPONED: Christina Delimitrou, Designing the Next Generation Cloud Systems: An ML-Driven Approach
- Location: Center for Computing & Data Sciences, 665 Commonwealth Ave, Room 1101 (11th floor)
Abstract & Bio
- Abstract:
Cloud systems are experiencing significant shifts both in their hardware, with an increased adoption of heterogeneity, and their software, with the prevalence of microservices and serverless frameworks. These trends require fundamentally rethinking how the cloud system stack should be designed.In this talk, I will briefly describe the challenges these hardware and software trends introduce, and discuss how applying machine learning (ML) to hardware design, cluster management, and performance debugging can improve the cloud’s performance, efficiency, predictability, and ease of use. I will first discuss Dagger, a reconfigurable network accelerator for microservices that shows the advantages of tightly-coupled peripherals, and then present Seer and Sage, two performance debugging systems that leverage ML to identify and resolve the root causes of performance issues in cloud microservices. I will conclude with the open questions that remain for cloud systems, and how ML can help address them. - Bio:
Christina Delimitrou is an Assistant Professor at MIT, where she works on computer architecture and computer systems. She focuses on improving the performance, predictability, and resource efficiency of large-scale cloud infrastructures by revisiting the way they are designed and managed. Christina is the recipient of the 2020 TCCA Young Computer Architect Award, an Intel Rising Star Award, a Microsoft Research Faculty Fellowship, an NSF CAREER Award, a Sloan Research Scholarship, two Google Faculty Research Awards, and a Facebook Faculty Research Award. Her work has also received 5 IEEE Micro Top Picks awards and several best paper awards. Before joining MIT, Christina was a professor at Cornell University, and before that, she received her PhD from Stanford University. She had previously earned an MS also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. More information can be found at: http://people.csail.mit.edu/delimitrou/
- Abstract:
March 2: Mert Toslali, Efficient Navigation of Performance Unpredictability in Cloud Through Automated Analytics Systems
- Location: Center for Computing & Data Sciences, 665 Commonwealth Ave, Room 1101 (11th floor)
Abstract & Bio
- Abstract:
The cloud’s performance unpredictability is a significant obstacle to its widespread adoption and can adversely impact costs and revenue. Consequently, engineers strive to diagnose performance-related concerns and deliver high-quality software to enhance performance and meet changing demands. However, the systems employed to diagnose performance necessitate tracing all conceivable application behaviors, incurring substantial overheads. Even after performance issues have been diagnosed and addressed, the current cloud code delivery systems used by engineers lack intelligence, increasing the risk of imprecise decisions and further performance violations.
In this presentation, we will introduce automated, statistically-driven control mechanisms designed to enhance the efficiency and intelligence of diagnosis and code-delivery processes. Firstly, we will demonstrate how dynamically adjusting instrumentation using statistically-driven techniques can optimize instrumentation efficiency. This method enables precise tracing of performance issues while significantly reducing overhead costs. Secondly, we will showcase how an online learning-based approach can intelligently adjust the user traffic split among competing deployments, resulting in minimized performance violations and optimized code-delivery efficiency. - Bio:
Mert Toslali is a 5th year computer engineering PhD student at Boston University. His research is primarily focused on performance diagnosis and online experimentation of cloud applications. He has developed a range of automated, statistically-driven systems that are designed to enhance the performance and efficiency of cloud applications.
- Abstract:
March 16 (Colloquium): Adam Belay, MIT, LDB: An Efficient, Full-Program Latency Profiler
- Location: Center for Computing & Data Sciences, 665 Commonwealth Ave, Room 1101 (11th floor)
Abstract & Bio
- Abstract:
Maintaining low tail latency is critical for the efficiency and performance of large-scale datacenter systems. Software bugs that cause tail latency problems, however, are notoriously difficult to debug. In this talk, I will present LDB, a new latency profiling tool that aims to overcome this challenge by precisely identifying the specific functions that are responsible for tail latency anomalies. LDB observes the latency of all functions in a running program. It uses a novel, software-only technique called stack sampling, where a busy-spinning stack scanner thread polls light-weight metadata recorded in call frames, shifting instrumentation cost away from program threads. In addition, LDB records request boundaries and inter-thread synchronization to generate per-request timelines and to find the root cause of complex tail latency problems such as lock contention in multi-threaded programs. Our results show that LDB has low overhead and can rapidly analyze recordings, making it feasible to use in production settings.
- Bio:
Adam Belay is an Associate Professor of Computer Science at the Massachusetts Institute of Technology, where he works on operating systems, runtime systems, and distributed systems. During his Ph.D. at Stanford, he developed Dune, a system that safely exposes privileged CPU instructions to userspace; and IX, a dataplane operating system that significantly accelerates I/O performance. Dr. Belay’s current research interests lie in developing systems that cut across hardware and software layers to increase datacenter efficiency and performance. He is a member of the Parallel and Distributed Operating Systems Group, and a recipient of a Google Faculty Award, a Facebook Research Award, and the Stanford Graduate Fellowship. http://abelay.me
- Abstract:
March 23: Yara Awad, BU
- Location: Center for Computing & Data Sciences, 665 Commonwealth Ave, Room 1101 (11th floor)
Abstract & Bio
- Bio:
—–
I am a PhD candidate in the Department of Computer Science at Boston University, advised by Professor Jonathan Appavoo. My main research falls under operating systems and computer architecture. My broad research agenda targets enabling self-optimization across the spectrum of computation. To that end, one of my goals is to integrate learning mechanisms effectively into a system such that trends in computation may be learned, persisted, retrieved, and shared across computing components. Toward this goal, my current work focuses on the ability to model different subsets of execution in a way that can be consistently learned from and therefore exploited for control and optimization. - Abstract:
————
An overarching theme of our lab’s work is to question the ability to affect system change (i.e. control) externally: not from within the system itself, but rather from some control space that is external to the system. We propose that such exogenous control relies on the ability to externally observe relevant system state, learn useful trends as that state mutates, and ultimately discover control policies that can mutate that state in some desired fashion. Our lab’s current work focuses on exogenous energy-aware performance control of a system. In this talk, I will describe the general context of energy-aware performance control and the prior results that motivated our current research direction. These results provide evidence that for some pre-defined network software stack, execution can proceed with optimal energy efficiency when carefully controlled by two hardware-level mechanisms: NIC-level interrupt coalescing, and CPU-level frequency and voltage scaling. Our long-term goals are to 1) prove an innate correlation between these two control settings and the energy profile of a software stack, and 2) exploit this correlation in the face of a real and dynamic network. Our immediate goal is to tackle the constraints posed by the complex nature of network/system interaction which may limit the utility of an exogenous control entity. To that end, I will be presenting different ongoing experiments that we hope can help us answer some open questions: 1) how can a large search space of control decisions be exploited, 2) how can control respond to a dynamic network, and 3) how can control be software-agnostic (i.e. truly external). We propose that when these system-level problems are carefully modeled, machine learning algorithms can help answer these questions.
- Bio:
March 30 (Colloquium): Dionisio de Niz, Carnegie Mellon University: Mixed-Trust Real-Time Computation
- Location: Center for Computing & Data Sciences, 665 Commonwealth Ave, Room 365 (3rd floor)
Abstract & Bio
- Bio:
Dionisio de Niz is a Principal Researcher and the Technical Director of the Assuring Cyber-Physical Systems directorate at the Software Engineering Institute at Carnegie Mellon University. He received a Master of Science in Information Networking and a Ph.D. in Electrical and Computer Engineering both from Carnegie Mellon University. His research interest includes Cyber-Physical Systems, Real-Time Systems, Model-Based Engineering (MBE), and Security of CPS. In the Real-time arena he has recently focused on multicore processors and mixed-criticality scheduling and more recently in real-time mixed-trust computing. In MBE, he has focused on the symbolic integration of analysis using analysis contracts. Dr. de Niz co-edited and co-authored the book “Cyber-Physical Systems” where the authors discuss different application areas of CPS and the different foundational domains including real-time scheduling, logical verification, and CPS security. He has participated and/or helped in the organization of multiple workshops with industry on real-time multicore systems (two co-sponsored by the FAA and three by different services of the US military) and Safety Assurance of Nuclear Energy. He is a member of the executive committee of the IEEE Technical Committee on Real-Time Systems. Dr. de Niz participates regularly in technical program committees of the real-time systems conferences such as RTSS, RTAS, RTCSA, etc. where he also publishes a large part of his work.
- Abstract:
Certification authorities (e.g., FAA) allow the validation of different parts of a system with different degrees of rigor depending on their level of criticality. Formal methods have been recognized as important to verify safety-critical components. Unfortunately, a verified property can be easily compromised if the verified components are not protected from misbehaviors of the unverified ones (e.g., due to bugs). Thus, trust requires that both verification and protection of components are jointly considered.
A key challenge to building trust is the complexity of today’s operating systems (OSs) making them impractical to verify. Building a trusted system is challenging because the underlying operating systems (OSs) that implement protection mechanisms are extremely hard (if even possible) to thoroughly verify. Thus, there has been a trend to minimize the trusted computing base (TCB) by developing small verified hypervisors (HVs) and microkernels, e.g., seL4, CertiKOS}, and uberXMHF. In these systems, trusted and untrusted components co-exist on a single hardware platform but in a completely isolated and disjoint manner. We thus call this approach disjoint-trust computing. The fundamental limitation of disjoint-trust computing is that it does not allow the use of untrusted components in critical functionality whose safety must be assured through verification.
In this talk, we present the real-time mixed-trust computing (RT-MTC) framework. Unlike disjoint-trust computing, it gives the flexibility to use untrusted components even for CPS critical functionality. In this framework, untrusted components are monitored by verified components ensuring that the output of the untrusted components always lead to safe states (e.g., avoiding crashes). These monitoring components are known as logical enforcers. To ensure trust, these enforcers are protected by a verified micro-hypervisor. To preserve the timing guarantees of the system, RT-MTC uses temporal enforcers, which are small, self-contained codeblocks that perform a default safety action (e.g., hover in a quadrotor) if the untrusted component has not produced a correct output by a specified time. Temporal enforcers are contained within the verified micro-hypervisor. Our framework incorporates two schedulers: (i) a preemptive fixed-priority scheduler in the VM to run the untrusted components and (ii) a non-preemptive fixed-priority scheduler within the HV to run trusted components. To verify the timing correctness of safety-critical applications in our mixed-trust framework, we develop a new task model and schedulability analysis. We also present the design and implementation of a coordination protocol between the two schedulers to preserve the synchronization between the trusted and untrusted components while preventing dependencies that can compromise the trusted component.
Finally, we discuss the extension of this framework for trusted mode degradation. While a number of real-time modal models have been proposed, they fail to address the challenges presented here in at least two important respects. First, previous models consider mode transitions as simple task parameter changes without taking into account the computation required by the transition and the synchronization between the modes and the transition. Second, previous work does not address the challenges imposed by the need to preserve safety guarantees during the transition. Our work addresses these issues by extending the RT-MTC framework to include degradation modes and creating a schedulability model based on the digraph model that supports this extension.
- Bio:
Past Events
Fall 2022
- December 9: William Moses, Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation
Abstract: Automatic differentiation (AD) is key to training neural networks, Bayesian inference, and scientific computing. Applying these techniques requires rewriting code in a specific machine learning framework or manually providing derivatives. This talk presents Enzyme, a high-performance automatic differentiation compiler plugin for the low-level virtual machine (LLVM) compiler capable of synthesizing gradients of programs expressed in the LLVM intermediate representation (IR). Enzyme differentiates programs in any language whose compiler targets LLVM, including C/C++, Fortran, Julia, Rust, Swift, etc., thereby providing native AD capabilities in these languages with state-of-the-art performance. Unlike traditional tools, Enzyme performs AD on optimized IR. On a combined machine-learning and scientific computing benchmark suite, AD on optimized IR achieves a geometric mean speedup of 4.2x over AD on IR before optimization.
This talk will also include work that makes Enzyme the first fully automatic reverse-mode AD tool to generate gradients of existing GPU kernels. This includes new GPU and AD-specific compiler optimizations, and an algorithm ensuring correctness of high-performance parallel gradient computations. We provide a detailed evaluation of five GPU-based HPC applications, executed on NVIDIA and AMD GPUs.Bio: William Moses is a Ph.D. Candidate at MIT, where he also received his M.Eng in electrical engineering and computer science (EECS) and B.S. in EECS and physics. William’s research involves creating compilers and program representations that enable performance and use-case portability, thus enabling non-experts to leverage the latest in high-performance computing and ML. He is known as the lead developer of Enzyme (NeurIPS ’20, SC ’21, SC ’22’), an automatic differentiation tool for LLVM capable of differentiating code in a variety of languages, after optimization, and for a variety of architectures and the lead developer of Polygeist (PACT ’21, PPoPP ’23), a polyhedral compiler and C++ frontend for MLIR. He has also worked on the Tensor Comprehensions framework for synthesizing high-performance GPU kernels of ML code, the Tapir compiler for parallel programs (best paper at PPoPP ’17), and compilers that use machine learning to better optimize (AutoPhase/TransformLLVM). He is a recipient of the U.S. Department of Energy Computational Science Graduate Fellowship and the Karl Taylor Compton Prize, MIT’s highest student award.
- November 18: Jingyu Su, Secure Cross-Site Analytics on Openshift LogsAbstract: Openshift is a Kubernetes platform developed by Red Hat to manage clients’ containerized applications. When an incident happens at a client’s cluster and the client asks for support, Red Hat engineers need to access runtime logs or monitoring metrics to locate root-causes. However, obtaining access to logs isn’t easy, since the log entries may contain sensitive or proprietary data. The current practice involves a series of steps, including legal actions before the engineers can obtain the logs, and is hard to generalize since the obtained logs belong to a single client. Secrecy is Secure Multi-Party Computing (MPC) platform that requires low computational overhead while providing strong security guarantees, therefore, it can be used to analyze logs across multiple clients. This presentation covers my work on applying Secrecy to secure cross-site analytics on Openshift logs, along with ongoing work and future perspectives.
Bio: Jingyu Su received BS degree in Electronics and Computer Engineering as well as minor in Data Science from Shanghai Jiaotong University in Shanghai, China. He is now a second year Master student in Computer Science at Boston University. His interest lies in distributed systems, streaming, and databases.
- November 4: Red Hat Collaboratory Student ProjectsSpeaker: Xiteng Yao
Project Title: Practical Programming of FPGAs with Open-Source Tools: Optimizing High-level Synthesis Using Machine Learning
Mentor: Martin Herbordt, Professor, College of Engineering
Abstract: A fundamental problem in Computer Engineering is the difficulty in creating
applications that are simultaneously programmable, performant, and portable (PPP).
This is especially challenging when the target is hardware, rather than software, as for
Field Programmable Gate Arrays (FPGAs). FPGAs are a type of processor making
rapid inroads in datacenters; in FPGAs, the hardware is mapped to the application
rather than vice versa. This project addresses the PPP problem for FPGAs by applying
supervised and reinforcement learning to the compilation process.High-level synthesis (HLS) tools convert standard programs to implementations that
can run on FPGAs. These tools are vastly complex, applying hundreds of different
methods (passes) to improve performance. What is to be learned is the optimal
application of these methods, both in general and with respect to particular
applications. The problems that have been addressed include generating data to be
used for training and creating a framework to use the training data to train the
machine learning model.The project has tested different models on a set of programs. The result shows that
reinforcement learning is more suitable to approach the problem. The code optimized
by reinforcement learning models could yield better performance than optimizing
using the traditional approach. However, depending on the different strategies used
for optimization, the learning speed, performance potential, and speedup vary
significantly. As a result, it becomes vital to choose the right strategy when using
machine learning to optimize HLS.Speaker: Quan Pham
Project Title: Real Time Quality Assurance
Mentor: Gianluca Stringhini, Assistant Professor, College of Engineering
Abstract: The goal of RTQA is to develop plugins for the Jupyter Notebook development platform capable of performing code analysis in order to provide developers with real-time feedback such as identifying performance bottlenecks, exploitable code, or outdated imported modules. The project also aims to create a foundational framework from which these and future plugins can be integrated into the Jupyter Lab plugin architecture. My role in the project has been to integrate a plugin which implements Praxi, a software discovery algorithm. Specifically, I partially completed a data pipeline to automate the generation and storage of datasets needed to train the algorithm. - October 28: Alan Liu, Unleashing the Algorithmic Power of Approximate Computing SystemsAbstract:
Today’s computing systems, such as big data, network, and streaming analytics systems, face increasing demands on performance, reliability, and energy efficiency. In the last few decades, rapidly evolving microprocessors have largely fulfilled these demands. However, with the slow-down of Moore’s Law, existing data processing systems are ill-suited for analyzing large-scale, dynamic data and face key underlying algorithmic challenges. In this talk, I will present my research on scaling data systems with approximation techniques for dynamic connected data processing. I will discuss the efficient algorithms and implementations that enable mining complex structures in large-scale graph data. Finally, I will describe how bridging theory and practice with probabilistic and sampling theory can significantly speed up computations without specialized hardware.
Bio:
Alan (Zaoxing) Liu is an Assistant Professor in ECE at Boston University. Liu’s research spans across computer systems, networks, and applied algorithms to co-design solutions across the computing stack, with a focus on systems and algorithmic design for telemetry, big-data analytics, and security. He is a recipient of the best paper award at FAST’19 and received interdisciplinary recognitions such as ACM STOC “Best-of-Theory” and USENIX ATC “Best-of-Rest”. - October 21: Anton Njavro, A DPU Solution to Container Overlay NetworksAbstract:
There is an increasing demand to incorporate hybrid environments as part of workflows across edge, cloud, and HPC systems. In a such converging environment of cloud and HPC, containers are starting to play a more prominent role, bringing their networking infrastructure along with them. However, the current body of work shows that container overlay networks, which are often used to connect containers across physical hosts, are ill-suited for the HPC environment. They tend to impose significant overhead and noise, resulting in degraded performance and disturbance to co-processes on the same host. This paper focuses on utilizing a novel class of hardware, Data Processing Unit, to offload the networking stack of overlay networks away from the host onto the DPU. We intend to show that such ancillary offload is possible and that it will result in decreased overhead on host nodes which in turn will improve the performance of running processes.
- October 7: Nikolai Merkel, Automatic Graph Partitioner Selection to Optimize Distributed Graph Processing and Jana Vatter, An Introduction to (Distributed) Systems for Graph Neural NetworksAutomatic Graph Partitioner Selection to Optimize Distributed Graph Processing
Abstract:
For distributed graph processing on massive graphs, a graph is partitioned into multiple equally-sized parts which are distributed among machines in a compute cluster. In the last decade, many partitioning algorithms have been developed which differ from each other with respect to the partitioning quality, the run-time of the partitioning and the type of graph for which they work best. The plethora of graph partitioning algorithms makes it a challenging task to select a partitioner for a given scenario. In this talk we present a machine learning-based approach for automatic partitioner selection.Bio:
Nikolai Merkel is a PhD student at the Technical University of Munich (TUM). He received a M.Sc. in Information Systems from TUM with a focus on distributed systems. His research interests are in improving the performance of large-scale Graph Processing and Graph Neural Network systems and Graph Partitioning.An Introduction to (Distributed) Systems for Graph Neural Networks
Abstract:
Graph Neural Networks (GNNs) are a special type of Machine Learning algorithms capable of processing graph structured data. As graphs are all around us, for instance social networks, traffic grids or molecule structures, GNNs have a broad field of applications. With the ever-growing size of real-world graph data, the need for large-scale GNN training solutions has emerged. Specialized systems have been developed to distribute and parallelize the GNN training process. This talk will first give an introduction to GNNs in general and then present hand-picked methods to optimize and distribute the GNN training process. This includes, for instance, partitioning, sampling, storage and scheduling techniques.Bio:
I’m a first-year PhD student at the Technical University of Munich (TUM) and received my M.Sc. in Computer Science from the Technical University of Darmstadt in 2021. During my graduate studies, I worked as a student assistant at the Ubiquitous Knowledge Processing (UKP) lab and as a teaching assistant at the Interactive Graphics Systems Group (TU Darmstadt). Currently, I’m working on (Distributed) Systems for Graph Neural Network. My research interests include (Dynamic) Graph Neural Networks, large-scale Deep Learning and Distributed Systems. - September 30: Daniel Wilson, Application-Aware HPC Power Management.Abstract:
Data centers use a lot of energy, and continue to increase their demand as systems and workloads grow. Power management software enables a data center to guide power to achieve goals like reducing energy costs or improving performance. While a high-level power manager in a data center has visibility into system-wide metrics such as total power usage, a job-level power manager is able to dynamically respond to application-specific relationships between performance and power. Power managers may improve their energy-efficiency opportunities by leveraging knowledge from multiple power management levels to form site-wide application-aware power management policies. But some challenges remain to put cross-layer solutions into practice. This talk covers my works in single-level and multi-level power management in HPC systems, along with practical challenges.
Bio:
Daniel Wilson received BS degrees in Computer Science and Computer Engineering from NC State University in Raleigh, North Carolina. He is working toward a PhD degree in Computer Engineering at Boston University. Prior to his current studies, Daniel worked at NetApp and Itron. He works as an intern at Intel while pursuing his PhD. His current research interests include energy-aware computing and systemwide optimization.
Spring 2022
- May 13: Ruoyu “Fish” Wang, 30 Years into Scientific Binary Decompilation: What We Have Achieved and What We Need to Do Next, Abstract and Bio.
- May 6: Tommy Unger, Abstraction, Programmability and Optimization, .
- April 29: Marco Serafini, Graph data systems: transactions, mining, and learning, Abstract and Bio.
- April 22: Anam Farruk, FlyOS: Integrated Modular Avionics for Autonomous Multicopters, Abstract and Bio.
- April 15: Burak Aksar (Boston University), Diagnosing Performance Anomalies in HPC Systems. Abstract and Bio.
- April 8, Golsani Ghaemi, The Memory Wedding Problem, Abstract and Bio.
- April 1, Mania Abdi, Customization of general-purpose cloud platforms. Abstract and Bio.
- March 25: Ari Trachtenberg, Autopsy of a Scientific Death? Automated Exposure Notification for COVID-19. Abstract and Bio.
- March 18: Vasia Kalavri(Boston University), Open discussion. Abstract and Bio.
- March 4: Renato Mancuso, From Memory Partitioning to Management through Fine-grained Profiling and Control. Abstract and Bio.
- Feburary 25: PhD Student lightning talks II. Abstract and Bio.
- February 18: PhD Student lightning talks I. Abstract and Bio.
- February 11: Han Dong (Boston University), Slowing Down for Performance and Energy: Building An OS-Centric Model of Network Applications, Abstract and Bio.
- February 4: Dan Schatzberg (Meta, formerly Facebook), IOCost: Block I/O Control for Containers in Datacenter, Abstract and Bio
Fall 2021
- December 17: Ali Raza (Orran Krieger’s group), Unikernel Linux, Abstract and Bio
- December 10: Alan (Zaoxing) Liu, Can Sketch-based Telemetry be Ready for Prime Time?, Abstract and Bio
- December 3: John Liagouris, Secrecy: Secure collaborative analytics on secret-shared data, Abstract and Bio
- November 19: Anthony Byrne (BU – Ayse Coskun’s group), MicroFaaS: Energy-efficient Serverless on Bare-metal Single-board Computers, Abstract and Bio
- October 29: Novak Boskov (BU – Ari Trachtenberg’s group), GenSync: A Unified Framework for Optimizing Data Reconciliation, Abstract and Bio
- October 22: Mert Tosali (BU – Ayse Coskun’s group), Iter8: Online Experimentation in the Cloud, Abstract and Bio
- October 15: Udit Gupta (Harvard & Facebook), Designing Specialized Systems for Deep Learning-based Personalized Recommendation, Abstract and Bio
- October 8: Ari Trachtenberg (BU), Some big problems of simple systems, Abstract and Bio