Yeni Sayfa 1
 
 
 
 
 


 

acoskun(at)bu.edu
 

 
 

Phone:
(617) 358-3641 (Office)


 

 

Fax:
(617) 353-7337

 
 

Address:
Boston University
ECE Department
8 St. Mary's Street
Boston, MA 02215
 

 
 

Office:
PHO 336

 


 

 

 

 

 
  .Ayse Kivilcim Coskun

Architecture-Level Reliability Simulation for CMPs

[SIGMETRICS 09]

               Accurate evaluation of thermal and reliability management policies on chip multiprocessors (CMPs) requires a new simulation framework that can capture architecture-level effects over tens of seconds or longer, while also capturing thermal interactions among cores resulting from dynamic scheduling policies. Using a new framework based on phase analysis of applications and a set of new thermal management policies, this work shows that techniques that offer similar performance, energy, and even peak temperature can differ significantly in their effects on the expected processor lifetime.
 
   
Thermal management of 3D Stack Architectures [DATE 09]

               Chip cross-sectional power density increases with the number of vertically stacked circuit layers in a 3D architecture. This increase exacerbates temperature related reliability, performance and design challenges. 3D integration complicates the implementation of dynamic thermal management techniques because of the heat transfer between vertically adjacent units and the heterogeneous cooling efficiencies of different layers (e.g., the components closer to the heat sink cool down easier than those further away). Therefore, traditional 2D thermal management policies are not sufficient to optimize the temperature profile of multicore 3D systems. In this work, we first investigate how the existing policies for dynamic thermal management handle the thermal hot spots and temperature gradients in 3D systems. We then propose a low overhead policy for temperature-aware job allocation in 3D architectures. The new policy takes the thermal history of the processing cores and the 3D system characteristics into account to balance the temperature and reduce the frequency of hot spots. We evaluate the management policies on various 2- and 4-tier 3D systems, whose design is based on an extension of the UltraSPARC T1 processor.
 
   
Proactive thermal management [ISLPED 08, ICCAD 08]

                Conventional thermal management techniques are reactive in nature; that is, they take action after temperature reaches a predetermined threshold value. Such approaches do not always minimize and balance the temperature on the chip, and furthermore, they control temperature at a noticeable performance cost. In this work, we investigate how to use an autoregressive moving average (ARMA) predictor for forecasting future temperature, and we propose a proactive thread allocation technique for multiprocessor systems. When implemented in the Solaris kernel on an UltraSPARC T1 chip, our proactive technique achieved 60% reduction in hot spot occurrences, 80% reduction in spatial gradients and 75% reduction in thermal cycles on average in comparison to reactive management.
 
   
Online learning for thermal management [DAC 08]

                The policies proposed in the literature have different optimization goals; thus, their advantages vary in terms of saving power, achieving better temperature profiles or increasing performance. For example, putting cores to sleep state when they are idle can reduce the thermal hot spots while saving energy. However, when there are frequent workload arrivals, it can significantly increase thermal cycling. Migrating threads upon reaching a critical temperature achieves significant reduction in hot spots. On the other hand, this strategy does not balance the workload across the chip. In this work, we propose using online learning to adapt to dynamically changing workload, and to select a policy (among a given set of expert policies) that provides the desired trade-off between performance and thermal profile.
   
Static and dynamic temperature-aware job scheduling [DATE 07, ASPDAC 08, Transactions on VLSI 08]

                This work explores the benefits of temperature-aware task scheduling for multiprocessor system-on-a-chip (MPSoC). The task scheduling problem is first statically solved using integer linear programming (ILP). This solution can be utilized for embedded systems with a priori known workload, and also for setting a baseline of comparison to dynamic methods. The ILP solution is guaranteed to be optimal for the given assumptions for tasks. ILPs for minimizing energy, balancing energy, and reducing hot spots are formulated and compared against the thermally-aware optimization method. Our static solution can reduce the frequency of hot spots by 35%, spatial gradients by 85%, and thermal cycles by 61% in comparison to the ILP for minimizing energy.

                For dynamic thermal management, scheduling policies at the OS-level with negligible performance overhead are introduced. A novel adaptive policy, which adjusts the probability value of receiving workload for each core, reduces the frequency of high-magnitude thermal cycles and spatial gradients by around 50% and 90%, respectively, in comparison to state-of-the-art schedulers. Reactive thermal management strategies, such as thread migration, can be combined with this novel scheduling policy to further reduce hot spots, temperature variations, and the associated performance cost.
 
   
Transient fault prediction and recovery  [DATE 07]

                Future microprocessors will be highly susceptible to transient errors as the sizes of transistors decrease due to CMOS scaling. Prior techniques advocated full scale structural or temporal redundancy to achieve fault tolerance. Though they can provide complete fault coverage, they incur significant hardware and/or performance cost. It is desirable to have mechanisms that can provide partial but sufficiently high fault coverage with negligible cost. To meet this goal, I helped developing a method that leverages speculative structures that already exist in modern processors.  The proposed mechanism is based on the insight that when a fault occurs, it is likely that the incorrect execution would result in abnormally higher or lower number of mispredictions (branch mispredictions, L2 misses, store set mispredictions) than a correct execution. A simple transient fault predictor is designed to detect the anomalous behavior in the outcomes of the speculative structures to predict transient faults.
 
   
Modeling and optimization of MPSoC reliability [GLSVLSI 06, JOLPE 06]

           In this work a comprehensive framework for analyzing reliability of multi-core systems, considering permanent faults, is presented. We show that aggressive power management can have an impact on reliability due to temperature cycling. Our cycle-accurate simulation methodology shows fine-grained variations of device failure rates over short time scales, thus enabling workload analysis and scheduling to control the reliability impact. On the other hand, the statistical reliability simulator and optimizer give a view into the long time horizon reliability analysis (over system lifetime), and enable optimizing a power management policy under reliability and performance constraints. The optimization strategy can achieve large power savings while still meeting the reliability and performance constraints.