Paul N. Whatmough | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paul N. Whatmough is active.

Explore More

Publication

Featured researches published by Paul N. Whatmough.

international symposium on computer architecture | 2016

Minerva: enabling low-power, highly-accurate deep neural network accelerators

Brandon Reagen; Paul N. Whatmough; Robert Adolf; Saketh Rama; Hyunkwang Lee; Sae Kyu Lee; José Miguel Hernández-Lobato; Gu-Yeon Wei; David M. Brooks

The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.

international solid-state circuits conference | 2017

14.3 A 28nm SoC with a 1.2GHz 568nJ/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications

Paul N. Whatmough; Sae Kyu Lee; Hyunkwang Lee; Saketh Rama; David M. Brooks; Gu-Yeon Wei

Machine Learning (ML) techniques empower Internet of Things (IoT) devices with the capability to interpret the complex, noisy real-world data arising from sensor-rich systems. Achieving sufficient energy efficiency to execute ML workloads on an edge-device necessitates specialized hardware with efficient digital circuits. Razor systems allow excessive worst-case VDD guardbands to be minimized down to the point where timing violations start to occur. By tracking the non-zero timing violation rate, process/voltage/temperature/aging (PVTA) variations are dynamically compensated as they change over time. Resilience to timing violations is achieved using either explicit correction (e.g., replay [1]), or algorithmic tolerance [2]. ML algorithms offer remarkable inherent error tolerance and are a natural fit for Razor timing violation detection without the burden of explicit and guaranteed error correction. Prior ML accelerators have focused either on computer vision CNNs with high-power (e.g., [3] consumes 278mW) or spiking neural networks with low-accuracy (e.g. 84% on MNIST [4]). Programmable fully-connected (FC) deep-neural-network (DNN) accelerators offer flexible support for a range of general classification tasks with high accuracy [5]. However, because there is no parameter reuse in FC layers, both compute and memory resources must be optimized.

international symposium on low power electronics and design | 2017

A case for efficient accelerator design space exploration via Bayesian optimization

Brandon Reagen; José Miguel Hernández-Lobato; Robert Adolf; Michael A. Gelbart; Paul N. Whatmough; Gu-Yeon Wei; David M. Brooks

In this paper we propose using machine learning to improve the design of deep neural network hardware accelerators. We show how to adapt multi-objective Bayesian optimization to overcome a challenging design problem: optimizing deep neural network hardware accelerators for both accuracy and energy efficiency. DNN accelerators exhibit all aspects of a challenging optimization space: the landscape is rough, evaluating designs is expensive, the objectives compete with each other, and both design spaces (algorithmic and microarchitectural) are unwieldy. With multi-objective Bayesian optimization, the design space exploration is made tractable and the design points found vastly outperform traditional methods across all metrics of interest.

international symposium on low power electronics and design | 2015

Analysis of adaptive clocking technique for resonant supply voltage noise mitigation

Paul N. Whatmough; Shidhartha Das; David Michael Bull

Resonant supply voltage noise is emerging as a serious limitation for power efficiency in SoCs for mobile products. Increasing supply currents coupled with stagnant package inductance is leading to significant AC supply impedance, which necessitates increasing supply voltage margins, impacting power efficiency. Adaptive clocking offers a potentially promising approach to reduce voltage margins, by stretching the clock period to match datapath delays. However, the adaptation bandwidth and clock distribution latencies required can be very demanding. We present analysis of the potential benefits from adaptive clocking based on measurements of supply voltage noise in a dual-core ARM Cortex-A57 cluster in a mobile SoC. By modeling an adaptive clocking system on the measured supply voltage noise dataset, we demonstrate that an adaptation latency of 1.5ns may offer a VMIN improvement of around 30mV and at 1ns improvements of 50mV. Benefits are workload dependent and ultimately limited by insurmountable synchronization and clock distribution latency.

international conference on computer design | 2017

Applications of Deep Neural Networks for Ultra Low Power IoT

Sreela Kodali; Patrick Hansen; Niamh Mulholland; Paul N. Whatmough; David M. Brooks; Gu-Yeon Wei

IoT devices are increasing in prevalence and popularity, becoming an indispensable part of daily life. Despite the stringent energy and computational constraints of IoT systems, specialized hardware can enable energy-efficient sensor-data classification in an increasingly diverse range of IoT applications. This paper demonstrates seven different IoT applications using a fully-connected deep neural network (FC-NN) accelerator on 28nm CMOS. The applications include audio keyword spotting, face recognition, and human activity recognition. For each application, a FC-NN model was trained from a preprocessed dataset and mapped to the accelerator. Experimental results indicate the models retained their state-of-the-art accuracy on the accelerator across a broad range of frequencies and voltages. Real-time energy results for the applications were found to be on the order of 100nJ per inference or lower.

Synthesis Lectures on Computer Architecture | 2017

Deep Learning for Computer Architects

Brandon Reagen; Robert Adolf; Paul N. Whatmough

Machine learning, and specifically deep learning, has been hugely disruptive in many fields of computer science. The success of deep learning techniques in solving notoriously difficult classification and regression problems has resulted in their rapid adoption in solving real-world problems. The emergence of deep learning is widely attributed to a virtuous cycle whereby fundamental advancements in training deeper models were enabled by the availability of massive datasets and high-performance computer hardware. This text serves as a primer for computer architects in a new and rapidly evolving field. We review how machine learning has evolved since its inception in the 1960s and track the key developments leading up to the emergence of the powerful deep learning techniques that emerged in the last decade. Next we review representative workloads, including the most commonly used datasets and seminal networks across a variety of domains. In addition to discussing the workloads themselves, we also detail the most popular deep learning tools and show how aspiring practitioners can use the tools with the workloads to characterize and optimize DNNs. The remainder of the book is dedicated to the design and optimization of hardware and architectures for machine learning. As high-performance hardware was so instrumental in the success of machine learning becoming a practical solution, this chapter recounts a variety of optimizations proposed recently to further improve future designs. Finally, we present a review of recent research published in the area as well as a taxonomy to help readers understand how various contributions fall in context.

IEEE Journal of Solid-state Circuits | 2017

Power Integrity Analysis of a 28 nm Dual-Core ARM Cortex-A57 Cluster Using an All-Digital Power Delivery Monitor

Paul N. Whatmough; Shidhartha Das; Zacharias Hadjilambrou; David Michael Bull

This paper presents a power delivery monitor (PDM) peripheral integrated in a flip-chip packaged 28 nm system-on-chip (SoC) for mobile computing. The PDM is composed entirely of digital standard cells and consists of: 1) a fully integrated VCO-based digital sampling oscilloscope; 2) a synthetic current load; and 3) an event engine for triggering, analysis, and debug. Incorporated inside an SoC, it enables rapid, automated analysis of supply impedance, as well as monitoring supply voltage droop of multi-core CPUs running full software workloads and during scan-test operations. To demonstrate these capabilities, we describe a power integrity case study of a dual-core ARM Cortex-A57 cluster in a commercial 28 nm mobile SoC. Measurements are presented of power delivery network (PDN) electrical parameters, along with waveforms of the CPU cluster running test cases and benchmarks on bare metal and Linux OS. The effect of aggressive power management techniques, such as power gating on the dominant resonant frequency and peak impedance, is highlighted. Finally, we present measurements of supply voltage noise during various scan-test operations, an often-neglected aspect of SoC power integrity.

international symposium on circuits and systems | 2016

A low-power correlator for wakeup receivers with algorithm pruning through early termination

Reza Ghanaatian; Paul N. Whatmough; Jeremy Constantin; Adam Teman; Andreas Burg

A low-complexity, low-power digital correlator for wakeup receivers is presented. With the proposed algorithm, unnecessary computational cycles are dynamically pruned from the correlation using an early threshold check. For the algorithm, we provide a rigorous mathematical analysis for the associated complexity/performance trade-offs. Furthermore, a low overhead hardware architecture with early-termination capability is developed and implemented in a 0.18μm CMOS technology. The post layout power analysis shows that the presented architecture can reduce power by up to 32% when compared to the conventional architecture with negligible degradation in detection probability and without degradation in false-alarm probability.

international symposium on computer architecture | 2018

Euphrates: algorithm-SoC co-design for low-power mobile continuous vision

Yuhao Zhu; Anand Samajdar; Matthew Mattina; Paul N. Whatmough

Continuous computer vision (CV) tasks increasingly rely on convolutional neural networks (CNN). However, CNNs have massive compute demands that far exceed the performance and energy constraints of mobile devices. In this paper, we propose and develop an algorithm-architecture co-designed system, Euphrates, that simultaneously improves the energyefficiency and performance of continuous vision tasks. Our key observation is that changes in pixel data between consecutive frames represents visual motion. We first propose an algorithm that leverages this motion information to relax the number of expensive CNN inferences required by continuous vision applications. We co-design a mobile System-ona-Chip (SoC) architecture to maximize the efficiency of the new algorithm. The key to our architectural augmentation is to co-optimize different SoC IP blocks in the vision pipeline collectively. Specifically, we propose to expose the motion data that is naturally generated by the Image Signal Processor (ISP) early in the vision pipeline to the CNN engine. Measurement and synthesis results show that Euphrates achieves up to 66% SoC-level energy savings (4× for the vision computations), with only 1% accuracy loss.

design automation conference | 2018

Ares: a framework for quantifying the resilience of deep neural networks

Brandon Reagen; Udit Gupta; Lillian Pentecost; Paul N. Whatmough; Sae Kyu Lee; Niamh Mulholland; David M. Brooks; Gu-Yeon Wei

As the use of deep neural networks continues to grow, so does the fraction of compute cycles devoted to their execution. This has led the CAD and architecture communities to devote considerable attention to building DNN hardware. Despite these efforts, the fault tolerance of DNNs has generally been overlooked. This paper is the first to conduct a large-scale, empirical study of DNN resilience. Motivated by the inherent algorithmic resilience of DNNs, we are interested in understanding the relationship between fault rate and model accuracy. To do so, we present Ares: a light-weight, DNN-specific fault injection framework validated within 12% of real hardware. We find that DNN fault tolerance varies by orders of magnitude with respect to model, layer type, and structure.

Explore More