Tayler H. Hetherington

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tayler H. Hetherington is active.

Explore More

Publication

Featured researches published by Tayler H. Hetherington.

international symposium on computer architecture | 2013

GPUWattch: enabling energy optimizations in GPGPUs

Jingwen Leng; Tayler H. Hetherington; Ahmed ElTantawy; Syed Zohaib Gilani; Nam Sung Kim; Tor M. Aamodt; Vijay Janapa Reddi

General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the models inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

international symposium on performance analysis of systems and software | 2012

Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems

Tayler H. Hetherington; Timothy G. Rogers; Lisa Hsu; Mike O'Connor; Tor M. Aamodt

The recent use of graphics processing units (GPUs) in several top supercomputers demonstrate their ability to consistently deliver positive results in high-performance computing (HPC). GPU support for significant amounts of parallelism would seem to make them strong candidates for non-HPC applications as well. Server workloads are inherently parallel; however, at first glance they may not seem suitable to run on GPUs due to their irregular control flow and memory access patterns. In this work, we evaluate the performance of a widely used key-value store middleware application, Memcached, on recent integrated and discrete CPU+GPU heterogeneous hardware and characterize the resulting performance. To gain greater insight, we also evaluate Memcacheds performance on a GPU simulator. This work explores the challenges in porting Memcached to OpenCL and provides a detailed analysis into Memcacheds behavior on a GPU to better explain the performance results observed on physical hardware. On the integrated CPU+GPU systems, we observe up to 7.5X performance increase compared to the CPU when executing the key-value look-up handler on the GPU.

international symposium on computer architecture | 2016

Cnvlutin: ineffectual-neuron-free deep neural network computing

Jorge Albericio; Patrick Judd; Tayler H. Hetherington; Tor M. Aamodt; Natalie D. Enright Jerger; Andreas Moshovos

This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.

symposium on cloud computing | 2015

MemcachedGPU: scaling-up scale-out key-value stores

Tayler H. Hetherington; Mike O'Connor; Tor M. Aamodt

This paper tackles the challenges of obtaining more efficient data center computing while maintaining low latency, low cost, programmability, and the potential for workload consolidation. We introduce GNoM, a software framework enabling energy-efficient, latency bandwidth optimized UDP network and application processing on GPUs. GNoM handles the data movement and task management to facilitate the development of high-throughput UDP network services on GPUs. We use GNoM to develop MemcachedGPU, an accelerated key-value store, and evaluate the full system on contemporary hardware. MemcachedGPU achieves ~10 GbE line-rate processing of ~13 million requests per second (MRPS) while delivering an efficiency of 62 thousand RPS per Watt (KRPS/W) on a high-performance GPU and 84.8 KRPS/W on a low-power GPU. This closely matches the throughput of an optimized FPGA implementation while providing up to 79% of the energy-efficiency on the low-power GPU. Additionally, the low-power GPU can potentially improve cost-efficiency (KRPS/

IEEE Micro | 2018

Value-Based Deep Learning Hardware Accelerators

Andreas Moshovos; Jorge Albericio; Patrick Judd; Alberto Delmas Lascorz; Sayeh Sharify; Tayler H. Hetherington; Tor M. Aamodt; Natalie D. Enright Jerger

) up to 17% over a state-of-the-art CPU implementation. At 8 MRPS, MemcachedGPU achieves a 95-percentile RTT latency under 300μs on both GPUs. An offline limit study on the low-power GPU suggests that MemcachedGPU may continue scaling throughput and energy-efficiency up to 28.5 MRPS and 127 KRPS/W respectively.

IEEE Computer | 2018

Exploiting Typical Values to Accelerate Deep Learning

Andreas Moshovos; Jorge Albericio; Patrick Judd; Alberto Delmas Lascorz; Sayeh Sharify; Zissis Poulos; Tayler H. Hetherington; Tor M. Aamodt; Natalie D. Enright Jerger

This article summarizes our recent work on value-based hardware accelerators for image classification using Deep Convolutional Neural Networks (CNNs). The presented designs exploit runtime value properties that are difficult or impossible to discern in advance. These include values that are zero or near zero and that prove ineffectual, have reduced yet variable precision needs, or have ineffectual bits. The designs offer a spectrum of choices in terms of area cost, energy efficiency, and relative performance when embedded in server class installations. More importantly, the accelerators reward advances in CNN design that increase the aforementioned properties.

arXiv: Learning | 2015

Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets

Patrick Judd; Jorge Albericio; Tayler H. Hetherington; Tor M. Aamodt; Natalie D. Enright Jerger; Raquel Urtasun; Andreas Moshovos

To deliver the hardware computation power advances needed to support deep learning innovations, identifying deep learning properties that designers could potentially exploit is invaluable. This article articulates our strategy and overviews several value properties of deep learning models that we identified and some of our hardware designs that exploit them to reduce computation, and on- and off-chip storage and communication.

international symposium on microarchitecture | 2016

Stripes: bit-serial deep neural network computing

Patrick Judd; Jorge Albericio; Tayler H. Hetherington; Tor M. Aamodt; Andreas Moshovos

IEEE Micro | 2018

Value-Based Deep-Learning Acceleration

Andreas Moshovos; Jorge Albericio; Patrick Judd; Alberto Delmas Lascorz; Sayeh Sharify; Tayler H. Hetherington; Tor M. Aamodt; Natalie D. Enright Jerger

parallel computing | 2017