Older Projects

Software Optimization for Green Computing

This project targets developing inexpensive, widely applicable methods for generating Green Software for reducing the total cost of computing while achieving high performance and reliability. Specific aims are:

(1) Designing mechanisms for creating software variations that are plausibly optimal with respect to performance, energy, and temperature. Some of these are based on existing methods of optimizing for performance such as code transformations and autotuning. Case studies include production applications from document/media processing, scientific computing, and bioinformatics.

(2) Designing consolidation methods to manage resource sharing of software on computing clusters for enabling energy proportional computing. The methods include explicit consideration of the temperature effects and cooling as well as the equipment and energy costs.


Energy and Thermal Management of Manycore Systems

Single-chip multicore systems have become increasingly attractive in the recent years because of their potential to provide higher throughput per watt in comparison to single-core systems. It has not been possible, however, to achieve the projected ideal peak performance due to high power densities, prohibitive cooling costs, thermal gradients, suboptimal resource utilization due to dynamically varying workloads, and limited memory bandwidth.

The main objective of this project is to develop a suite of run-time management techniques for manycore systems by jointly exploring key contributors to system performance and power: reconfigurable network architectures, workload scheduling policies, and microarchitectural resource management. We utilize an integrated hardware-software approach to dynamically monitor system behavior and to enforce intelligent decisions for improving performance, energy efficiency and thermal behavior. In addition to simulation and emulation tools, we run experiments on Intel’s 48-core Single-Chip Cloud and other commercial servers as part of our research.


Architecture-Level Reliability Simulation for CMPs

Accurate evaluation of thermal and reliability management policies on chip multiprocessors (CMPs) requires a new simulation framework that can capture architecture-level effects over tens of seconds or longer, while also capturing thermal interactions among cores resulting from dynamic scheduling policies. Using a new framework based on phase analysis of applications and a set of new thermal management policies, this work shows that techniques that offer similar performance, energy, and even peak temperature can differ significantly in their effects on the expected processor lifetime.


Thermal Management of 3D Stack Architectures

Chip cross-sectional power density increases with the number of vertically stacked circuit layers in a 3D architecture. This increase exacerbates temperature related reliability, performance and design challenges. 3D integration complicates the implementation of dynamic thermal management techniques because of the heat transfer between vertically adjacent units and the heterogeneous cooling efficiencies of different layers (e.g., the components closer to the heat sink cool down easier than those further away). Therefore, traditional 2D thermal management policies are not sufficient to optimize the temperature profile of multicore 3D systems. In this work, we first investigate how the existing policies for dynamic thermal management handle the thermal hot spots and temperature gradients in 3D systems. We then propose a low overhead policy for temperature-aware job allocation in 3D architectures. The new policy takes the thermal history of the processing cores and the 3D system characteristics into account to balance the temperature and reduce the frequency of hot spots. We evaluate the management policies on various 2- and 4-tier 3D systems, whose design is based on an extension of the UltraSPARC T1 processor.


Proactive Thermal Management

Conventional thermal management techniques are reactive in nature; that is, they take action after temperature reaches a predetermined threshold value. Such approaches do not always minimize and balance the temperature on the chip, and furthermore, they control temperature at a noticeable performance cost. In this work, we investigate how to use an autoregressive moving average (ARMA) predictor for forecasting future temperature, and we propose a proactive thread allocation technique for multiprocessor systems. When implemented in the Solaris kernel on an UltraSPARC T1 chip, our proactive technique achieved 60% reduction in hot spot occurrences, 80% reduction in spatial gradients and 75% reduction in thermal cycles on average in comparison to reactive management.


Online Learning for Thermal Management

The policies proposed in the literature have different optimization goals; thus, their advantages vary in terms of saving power, achieving better temperature profiles or increasing performance. For example, putting cores to sleep state when they are idle can reduce the thermal hot spots while saving energy. However, when there are frequent workload arrivals, it can significantly increase thermal cycling. Migrating threads upon reaching a critical temperature achieves significant reduction in hot spots. On the other hand, this strategy does not balance the workload across the chip. In this work, we propose using online learning to adapt to dynamically changing workload, and to select a policy (among a given set of expert policies) that provides the desired trade-off between performance and thermal profile.


Static and Dynamic Temperature-Aware Job Scheduling

This work explores the benefits of temperature-aware task scheduling for multiprocessor system-on-a-chip (MPSoC). The task scheduling problem is first statically solved using integer linear programming (ILP). This solution can be utilized for embedded systems with a priori known workload, and also for setting a baseline of comparison to dynamic methods. The ILP solution is guaranteed to be optimal for the given assumptions for tasks. ILPs for minimizing energy, balancing energy, and reducing hot spots are formulated and compared against the thermally-aware optimization method. Our static solution can reduce the frequency of hot spots by 35%, spatial gradients by 85%, and thermal cycles by 61% in comparison to the ILP for minimizing energy.

For dynamic thermal management, scheduling policies at the OS-level with negligible performance overhead are introduced. A novel adaptive policy, which adjusts the probability value of receiving workload for each core, reduces the frequency of high-magnitude thermal cycles and spatial gradients by around 50% and 90%, respectively, in comparison to state-of-the-art schedulers. Reactive thermal management strategies, such as thread migration, can be combined with this novel scheduling policy to further reduce hot spots, temperature variations, and the associated performance cost.


Transient Fault Prediction and Recovery

Future microprocessors will be highly susceptible to transient errors as the sizes of transistors decrease due to CMOS scaling. Prior techniques advocated full scale structural or temporal redundancy to achieve fault tolerance. Though they can provide complete fault coverage, they incur significant hardware and/or performance cost. It is desirable to have mechanisms that can provide partial but sufficiently high fault coverage with negligible cost. To meet this goal, I helped developing a method that leverages speculative structures that already exist in modern processors.  The proposed mechanism is based on the insight that when a fault occurs, it is likely that the incorrect execution would result in abnormally higher or lower number of mispredictions (branch mispredictions, L2 misses, store set mispredictions) than a correct execution. A simple transient fault predictor is designed to detect the anomalous behavior in the outcomes of the speculative structures to predict transient faults.


Modeling and Optimization of MPSoC Reliability

In this work a comprehensive framework for analyzing reliability of multi-core systems, considering permanent faults, is presented. We show that aggressive power management can have an impact on reliability due to temperature cycling. Our cycle-accurate simulation methodology shows fine-grained variations of device failure rates over short time scales, thus enabling workload analysis and scheduling to control the reliability impact. On the other hand, the statistical reliability simulator and optimizer give a view into the long time horizon reliability analysis (over system lifetime), and enable optimizing a power management policy under reliability and performance constraints. The optimization strategy can achieve large power savings while still meeting the reliability and performance constraints.