Andrew Lukefahr
University of Michigan
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Andrew Lukefahr.
international symposium on microarchitecture | 2012
Andrew Lukefahr; Shruti Padmanabha; Reetuparna Das; Faissal M. Sleiman; Ronald G. Dreslinski; Thomas F. Wenisch; Scott A. Mahlke
Heterogeneous multicore systems -- comprised of multiple cores with varying capabilities, performance, and energy characteristics -- have emerged as a promising approach to increasing energy efficiency. Such systems reduce energy consumption by identifying phase changes in an application and migrating execution to the most efficient core that meets its current performance requirements. However, due to the overhead of switching between cores, migration opportunities are limited to coarse-grained phases (hundreds of millions of instructions), reducing the potential to exploit energy efficient cores. We propose Composite Cores, an architecture that reduces switching overheads by bringing the notion of heterogeneity within a single core. The proposed architecture pairs big and little compute µEngines that together can achieve high performance and energy efficiency. By sharing much of the architectural state between the µEngines, the switching overhead can be reduced to near zero, enabling fine-grained switching and increasing the opportunities to utilize the little µEngine without sacrificing performance. An intelligent controller switches between the µEngines to maximize energy efficiency while constraining performance loss to a configurable bound. We evaluate Composite Cores using cycle accurate micro architectural simulations and a detailed power model. Results show that, on average, the controller is able to map 25% of the execution to the little µEngine, achieving an 18% energy savings while limiting performance loss to 5%.
international symposium on microarchitecture | 2013
Shruti Padmanabha; Andrew Lukefahr; Reetuparna Das; Scott A. Mahlke
Heterogeneous multicore systems are composed of multiple cores with varying energy and performance characteristics. A controller dynamically detects phase changes in applications and migrates execution onto the most efficient core that meets the performance requirements. In this paper, we show that existing techniques that react to performance changes break down at fine-grain intervals, as performance variations between consecutive intervals are high. We propose a predictive trace-based switching controller that predicts an upcoming phase change in a program and preemptively migrates execution onto a more suitable core. This prediction is based on a phases individual history and the current program context. Our implementation detects re-peatable code sequences to build history, uses these histories to predict an phase change, and preemptively migrates execution to the most appropriate core. We compare our method to phase prediction schemes that track the frequency of code blocks touched during execution as well as traditional reactive controllers, and demonstrate significant increases in prediction accuracy at fine-granularities. For a big-little heterogeneous system that is comprised of a high performing out-of-order core (Big) and an energy-efficient, in-order core (Little), at granularities of 300 instructions, the trace based predictor can spend 28% of execution time on the Little, while targeting a maximum performance degradation of 5%. This translates to an increased energy savings of 15% on average over running only on Big, representing a 60% increase over existing techniques.
international conference on parallel architectures and compilation techniques | 2014
Andrew Lukefahr; Shruti Padmanabha; Reetuparna Das; Ronald G. Dreslinski; Thomas F. Wenisch; Scott A. Mahlke
Heterogeneous architectures offer many potential avenues for improving energy efficiency in todays low-power cores. Two common approaches are dynamic voltage/frequency scaling (DVFS) and heterogeneous microarchitectures (HMs). Traditionally both approaches have incurred large switching overheads, which limit their applicability to coarse-grain program phases. However, recent research has demonstrated low-overhead mechanisms that enable switching at granularities as low as 1K instructions. The question remains, in this fine-grained switching regime, which form of heterogeneity offers better energy efficiency for a given level of performance? The effectiveness of these techniques depend critically on both efficient architectural implementation and accurate scheduling to maximize energy efficiency for a given level of performance. Therefore, we develop PaTH, an offline analysis tool, to compute (near-)optimal schedules, allowing us to determine Pareto-optimal energy savings for a given architecture. We leverage PaTH to study the potential energy efficiency of fine-grained DVFS and HMs, as well as a hybrid approach. We show that HMs achieve higher energy savings than DVFS for a given level of performance. While at a coarse granularity the combination of DVFS and HMs still proves beneficial, for fine-grained scheduling their combination makes little sense as HMs alone provide the bulk of the energy efficiency.
international symposium on computer architecture | 2017
Jiecao Yu; Andrew Lukefahr; David J. Palframan; Ganesh S. Dasika; Reetuparna Das; Scott A. Mahlke
As the size of Deep Neural Networks (DNNs) continues to grow to increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model size and the computation by removing redundant weights. However, we implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results. For many networks, the network sparsity caused by weight pruning will actually hurt the overall performance despite large reductions in the model size and required multiply-accumulate operations. Also, encoding the sparse format of pruned networks incurs additional storage space overhead. To overcome these challenges, we propose Scalpel that customizes DNN pruning to the underlying hardware by matching the pruned network structure to the data-parallel hardware organization. Scalpel consists of two techniques: SIMD-aware weight pruning and node pruning. For low-parallelism hardware (e.g., microcontroller), SIMD-aware weight pruning maintains weights in aligned fixed-size groups to fully utilize the SIMD units. For high-parallelism hardware (e.g., GPU), node pruning removes redundant nodes, not redundant weights, thereby reducing computation without sacrificing the dense matrix format. For hardware with moderate parallelism (e.g., desktop CPU), SIMD-aware weight pruning and node pruning are synergistically applied together. Across the microcontroller, CPU and GPU, Scalpel achieves mean speedups of 3.54x, 2.61x, and 1.25x while reducing the model sizes by 88%, 82%, and 53%. In comparison, traditional weight pruning achieves mean speedups of 1.90x, 1.06x, 0.41x across the three platforms.
international symposium on microarchitecture | 2015
Shruti Padmanabha; Andrew Lukefahr; Reetuparna Das; Scott A. Mahlke
InOrder (InO) cores achieve limited performance because their inability to dynamically reorder instructions prevents them from exploiting Instruction-Level-Parallelis Conversely, Out-of-Order (OoO) cores achieve high performance by aggressively speculating past stalled instructions and creating highly optimized issue schedules. It has been observed that these issue schedules tend to repeat for sequences of instructions with predictable control and data-flow. An equally provisioned InO core can potentially achieve OoOs performance at a fraction of the energy cost if provided with an OoO schedule. In the context of a fine-grained heterogeneous multicore system composed of a big (OoO) core and a little (InO) core, we could offload recurring issue schedules from the big to the little core, to achieve energy-efficiency while maintaining performance. To this end, we introduce the DynaMOS architecture. Recurring issue schedules may contain instructions that speculate across branches, utilize renamed registers to eliminate false dependencies, and reorder memory operations. DynaMOS provisions little with an OinO mode to replay a speculative schedule while ensuring program correctness. Any divergence from the recorded instruction sequence causes execution to restart in program order from a previously checkpointed state. On a system capable of switching between big and little cores rapidly with low overheads, DynaMOS schedules 38% of execution on the little on average, increasing utilization of the energy-efficient core by 2.9X over prior work. This amounts to energy savings of 32% over execution on only big core, with an allowable 5% performance loss.
IEEE Transactions on Computers | 2016
Andrew Lukefahr; Shruti Padmanabha; Reetuparna Das; Faissal M. Sleiman; Ronald G. Dreslinski; Thomas F. Wenisch; Scott A. Mahlke
Heterogeneous multicore systems- comprising multiple cores with varying performance and energy characteristics-have emerged as a promising approach to increasing energy efficiency. Such systems reduce energy consumption by identifying application phases and migrating execution to the most efficient core that meets performance requirements. However, the overheads of migrating between cores limit opportunities to coarse-grained phases (hundreds of millions of instructions), reducing the potential to exploit energy efficient cores. We propose Composite Cores, an architecture that reduces migration overheads by bringing heterogeneity into a core. Composite Cores pairs a big and little compute μEngine that together achieve high performance and energy efficiency. By sharing architectural state between the μEngines, the migration overhead is reduced, enabling fine-grained migration and increasing the opportunities to utilize the little μEngine without sacrificing performance. An intelligent controller migrates the application between μEngines to maximize energy efficiency while constraining performance loss to a configurable bound. We evaluate Composite Cores using cycle accurate microarchitectural simulations and a detailed power model. Results show that, on average, Composite Cores are able to map 30 percent of the execution time to the little μEngine, achieving a 21 percent energy savings while maintaining 95 percent performance.
international symposium on microarchitecture | 2017
Shruti Padmanabha; Andrew Lukefahr; Reetuparna Das; Scott A. Mahlke
Heterogenous chip multiprocessors (Het-CMPs) offer a combination of large Out-of-Order (OoO) cores optimized for high singlethreaded performance and small In-Order (InO) cores optimized for low-energy and area costs. Due to practical constraints, CMP designers must choose to either optimize for total system throughput by utilizing many InO cores or maximize single-thread execution with fewer OoO cores. We propose Mirage Cores, a novel Het-CMP design where clusters of InO cores are architected around an OoO in a manner that optimizes for both throughput and single-thread performance. The insight behind Mirage Cores is that InO cores can achieve near-OoO performance if they are provided with the dynamic instruction schedule of an OoO core. To leverage this, Mirage Cores employs an OoO core as an optimal instruction schedule generator as well as a high-performance alternative for all neighboring InO cores. We also develop intelligent runtime schedulers which orchestrate the arbitration and migration of applications between the InO cores and the central OoO. Fast and timely transfer of dynamic schedules from the OoO to InO allows Mirage Cores to create the appearance of all OoO cores to the user using underlyingIn-Order hardware. Overall, with an 8 InO per OoO configuration, Mirage Cores can achieve on average 84% of the performance of a CMP with 8 OoO cores, a 28% increase relative to current systems, while conserving 55% of energy and 25% of area costs. We find that we can scale the design to around 12 InOs per OoO before starvation for the OoO starts to hamper system performance.CCS CONCEPTS• Computer systems organization → Multicore architectures; Heterogeneous (hybrid) systems; • Hardware • Chip-level power issues;
Archive | 2014
Shruti Padmanabha; Andrew Lukefahr; Reetuparna Das; Scott A. Mahlke
Archive | 2013
Andrew Lukefahr; Reetuparna Das; Shruti Padmanabha; Scott A. Mahlke
Archive | 2017
Andrew Lukefahr; Shruti Padmanabha; Reetuparna Das; Scott A. Mahlke; Jiecao Yu