Brandon Reagen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Brandon Reagen is active.

Explore More

Publication

Featured researches published by Brandon Reagen.

international symposium on computer architecture | 2014

Aladdin: a Pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures

Yakun Sophia Shao; Brandon Reagen; Gu-Yeon Wei; David M. Brooks

Hardware specialization, in the form of accelerators that provide custom datapath and control for specific algorithms and applications, promises impressive performance and energy advantages compared to traditional architectures. Current research in accelerator analysis relies on RTL-based synthesis flows to produce accurate timing, power, and area estimates. Such techniques not only require significant effort and expertise but are also slow and tedious to use, making large design space exploration infeasible. To overcome this problem, we present Aladdin, a pre-RTL, power-performance accelerator modeling framework and demonstrate its application to system-on-chip (SoC) simulation. Aladdin estimates performance, power, and area of accelerators within 0.9%, 4.9%, and 6.6% with respect to RTL implementations. Integrated with architecture-level core and memory hierarchy simulators, Aladdin provides researchers an approach to model the power and performance of accelerators in an SoC environment.

international symposium on computer architecture | 2016

Minerva: enabling low-power, highly-accurate deep neural network accelerators

Brandon Reagen; Paul N. Whatmough; Robert Adolf; Saketh Rama; Hyunkwang Lee; Sae Kyu Lee; José Miguel Hernández-Lobato; Gu-Yeon Wei; David M. Brooks

The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.

ieee international symposium on workload characterization | 2014

MachSuite: Benchmarks for accelerator design and customized architectures

Brandon Reagen; Robert Adolf; Yakun Sophia Shao; Gu-Yeon Wei; David M. Brooks

Recent high-level synthesis and accelerator-related architecture papers show a great disparity in workload selection. To improve standardization within the accelerator research community, we present MachSuite, a collection of 19 benchmarks for evaluating high-level synthesis tools and accelerator-centric architectures. MachSuite spans a broad application space, captures a variety of different program behaviors, and provides implementations tailored towards the needs of accelerator designers and researchers, including support for high-level synthesis. We illustrate these aspects by characterizing each benchmark along five different dimensions, highlighting trends and salient features.

ieee international symposium on workload characterization | 2016

Fathom: reference workloads for modern deep learning methods

Robert Adolf; Saketh Rama; Brandon Reagen; Gu-Yeon Wei; David M. Brooks

Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community. Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebooks AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.

symposium on vlsi circuits | 2015

A multi-chip system optimized for insect-scale flapping-wing robots

Xuan Zhang; Mario Lok; Tao Tong; Simon Chaput; Sae Kyu Lee; Brandon Reagen; Hyunkwang Lee; David M. Brooks; Gu-Yeon Wei

We demonstrate a battery-powered multi-chip system optimized for insect-scale flapping wing robots that meets the tight weight limit and real-time performance demands of autonomous flight. Measured results show open-loop wing flapping driven by a power electronics unit and energy efficiency improvements via hardware acceleration.

international symposium on low power electronics and design | 2013

Quantifying acceleration: power/performance trade-offs of application kernels in hardware

Brandon Reagen; Yakun Sophia Shao; Gu-Yeon Wei; David M. Brooks

As the traditional performance gains of technology scaling diminish, one of the most promising directions is building special purpose fixed function hardware blocks, commonly referred to as accelerators. Accelerators have become prevalent in industrial SoC designs for their low power, high performance potential. In this work we explore thousands of implementations of classical software workloads in hardware. This thorough, detailed design space search of hardware accelerators gives architects a quantitative way to reason about the differences in implementations. The exploration presented in this work shows that the space is full of poor design choices. By thoroughly analyzing each benchmark, we show which provide the most performance when implemented in hardware given a fixed power budget and explain which design techniques work best for each workload.

IEEE Micro | 2015

The Aladdin Approach to Accelerator Design and Modeling

Yakun Sophia Shao; Brandon Reagen; Gu-Yeon Wei; David M. Brooks

Hardware specialization, in the form of datapath and control circuitry customized to particular algorithms or applications, promises impressive performance and energy advantages compared to traditional architectures. Current research in accelerators relies on RTL-based synthesis flows to produce accurate timing, power, and area estimates. Such techniques not only require significant effort and expertise but also are slow and tedious to use, making large design space exploration infeasible. To overcome this problem, the authors developed Aladdin, a pre-RTL, power-performance accelerator modeling framework and demonstrated its application to system-on-chip (SoC) simulation. Aladdin estimates performance, power, and area of accelerators within 0.9, 4.9, and 6.6 percent with respect to RTL implementations. Integrated with architecture-level general-purpose core and memory hierarchy simulators, Aladdin provides researchers with a fast but accurate way to model the power and performance of accelerators in an SoC environment.

international symposium on low power electronics and design | 2017

A case for efficient accelerator design space exploration via Bayesian optimization

Brandon Reagen; José Miguel Hernández-Lobato; Robert Adolf; Michael A. Gelbart; Paul N. Whatmough; Gu-Yeon Wei; David M. Brooks

In this paper we propose using machine learning to improve the design of deep neural network hardware accelerators. We show how to adapt multi-objective Bayesian optimization to overcome a challenging design problem: optimizing deep neural network hardware accelerators for both accuracy and energy efficiency. DNN accelerators exhibit all aspects of a challenging optimization space: the landscape is rough, evaluating designs is expensive, the objectives compete with each other, and both design spaces (algorithmic and microarchitectural) are unwieldy. With multi-objective Bayesian optimization, the design space exploration is made tractable and the design points found vastly outperform traditional methods across all metrics of interest.

Synthesis Lectures on Computer Architecture | 2017

Deep Learning for Computer Architects

Brandon Reagen; Robert Adolf; Paul N. Whatmough

Machine learning, and specifically deep learning, has been hugely disruptive in many fields of computer science. The success of deep learning techniques in solving notoriously difficult classification and regression problems has resulted in their rapid adoption in solving real-world problems. The emergence of deep learning is widely attributed to a virtuous cycle whereby fundamental advancements in training deeper models were enabled by the availability of massive datasets and high-performance computer hardware. This text serves as a primer for computer architects in a new and rapidly evolving field. We review how machine learning has evolved since its inception in the 1960s and track the key developments leading up to the emergence of the powerful deep learning techniques that emerged in the last decade. Next we review representative workloads, including the most commonly used datasets and seminal networks across a variety of domains. In addition to discussing the workloads themselves, we also detail the most popular deep learning tools and show how aspiring practitioners can use the tools with the workloads to characterize and optimize DNNs. The remainder of the book is dedicated to the design and optimization of hardware and architectures for machine learning. As high-performance hardware was so instrumental in the success of machine learning becoming a practical solution, this chapter recounts a variety of optimizations proposed recently to further improve future designs. Finally, we present a review of recent research published in the area as well as a taxonomy to help readers understand how various contributions fall in context.

design automation conference | 2018

On-chip deep neural network storage with multi-level eNVM

Marco Donato; Brandon Reagen; Lillian Pentecost; Udit Gupta; David M. Brooks; Gu-Yeon Wei

One of the biggest performance bottlenecks of today’s neural network (NN) accelerators is off-chip memory accesses [11]. In this paper, we propose a method to use multi-level, embedded nonvolatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators [6].

Explore More