[PDF] AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs

Abstract

Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for DNN chips. However, designing DNN chips is non-trivial because: (1) mainstream DNNs have millions of parameters and operations; (2) the large design space due to the numerous design choices of dataflows, processing elements, memory hierarchy, etc.; and (3) an algorithm/hardware co-design is needed to allow the same DNN functionality to have a different decomposition, which would require different hardware IPs to meet the application specifications. Therefore, DNN chips take a long time to design and require cross-disciplinary experts. To enable fast and effective DNN chip design, we propose AutoDNNchip - a DNN chip generator that can automatically generate both FPGA- and ASIC-based DNN chip implementation given DNNs from machine learning frameworks (e.g., PyTorch) for a designated application and dataset. Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-based accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, and area based on the DNN model parameters, hardware configuration, technology-based IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balancing, etc.), optimize chip design via the Chip Predictor, and then generate optimized synthesizable RTL to achieve the target design metrics. Experimental results show that our Chip Predictor's predicted performance differs from real-measured ones by < 10% when validated using 15 DNN models and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, accelerators generated by our AutoDNNchip can achieve better (up to 3.86X improvement) performance than that of expert-crafted state-of-the-art accelerators.

Full PDF

AAutoDNNchip : An Automated DNN Chip Predictor and Builderfor Both FPGAs and ASICs

Pengfei Xu , Xiaofan Zhang , Cong Hao , Yang Zhao , Yongan Zhang , Yue Wang , Chaojian Li ,Zetong Guan , Deming Chen , Yingyan Lin Rice University, TX, USA, University of Illinois at Urbana-Champaign, IL, USA {eiclab, zy34, yz87, yw68, cl114, zg20, yingyan.lin}@rice.edu, {xiaofan3, congh, dchen}@illinois.edu

ABSTRACT

Recent breakthroughs in Deep Neural Networks (DNNs) have fu-eled a growing demand for domain-specific hardware accelerators(i.e., DNN chips). However, designing DNN chips is non-trivialbecause: (1) mainstream DNNs have millions of parameters andoperations; (2) the design space is large due to the numerous de-sign choices of dataflows, processing elements, memory hierarchy,etc.; and (3) an algorithm/hardware co-design is needed to allowthe same DNN functionality to have a different decomposition,which would require different hardware IPs that correspond todramatically different performance/energy/area tradeoffs. There-fore, DNN chips often take months to years to design and require alarge team of cross-disciplinary experts. To enable fast and effectiveDNN chip design, we propose

AutoDNNchip − a DNN chip genera-tor that can automatically generate both FPGA- and ASIC-basedDNN chip implementation (i.e., synthesizable RTL code with opti-mized algorithm-to-hardware mapping (i.e., dataflow) ) given DNNsfrom machine learning frameworks (e.g., PyTorch) for a designatedapplication and dataset without humans in the loop. Specifically, AutoDNNchip consists of two integrated enablers: (1) a

Chip Predic-tor , built on top of a graph-based accelerator representation, whichcan accurately and efficiently predict a DNN accelerator’s energy,throughput, latency, and area based on the DNN model parameters,hardware configuration, technology-based IPs, and platform con-straints; and (2) a

Chip Builder , which can automatically explorethe design space of DNN chips (including IP selection, block config-uration, resource balance, etc.), optimize chip design via the

ChipPredictor , and then generate synthesizable RTL code with optimizeddataflows to achieve the target design metrics. Experimental resultsshow that our

Chip Predictor ’s predicted performance differs fromreal-measured ones by <

10% when validated using 15 DNN modelsand 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore,both the FPGA- and ASIC-based DNN accelerators generated by our

AutoDNNchip can achieve better (up to 3.86 × improvement) per-formance than that of expert-crafted state-of-the-art accelerators,showing the effectiveness of AutoDNNchip . Our open-source codecan be found at https://github.com/RICE-EIC/AutoDNNchip.git.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

ACM Reference Format:

Pengfei Xu, Xiaofan Zhang, Cong Hao, Yang Zhao, Yongan Zhang, Yue Wang,Chaojian Li, Zetong Guan, Deming Chen, Yingyan Lin. 2020.

AutoDNNchip :An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs.In

ACM, New York,NY, USA, 11 pages. https://doi.org/10.1145/3373087.3375306

We have seen the rapid adoption of Deep Neural Networks (DNNs)for solving real-life problems, such as image classification [1, 2],object detection [3], natural language processing [4], etc. AlthoughDNNs enable high-quality inferences, they also require a largeamount of computation and memory demand during deploymentdue to their inherently immense complexity [5–9]. Moreover, DNN-based applications often require not only high inference accuracy,but also aggressive hardware performance, including high through-put, low end-to-end latency, and limited energy consumption. Re-cently, we have seen intensive studies on DNN accelerators inhardware, which attempt to take advantage of different hardwaredesign styles, such as GPUs, FPGAs, and ASICs, to improve thespeed and efficiency of DNN inference and training [10 ? –21].However, developing customized DNN accelerators presentssignificant challenges as it asks for cross-disciplinary knowledgein machine learning, micro-architecture, and physical chip design.Specifically, to build accelerators on FPGAs or ASICs, it is inevitableto include (1) customized architectures for running DNN workloads,(2) RTL programming for implementing accelerator prototypes, and(3) reiterative verifications for validating the functionality correct-ness. The whole task requires designers to have a deep understand-ing of both DNN algorithms and hardware design. In responseto the intense demands and challenges of designing DNN accel-erators, we have seen rapid development of high-level synthesis(HLS) design flow [22–25] and DNN design automation frame-works [16, 26–30] that improve the hardware design efficiency byallowing DNN accelerator design from high-level algorithmic de-scriptions and using pre-defined high-quality hardware IPs. Still,they either rely on hardware experts to trim down the large de-sign space (e.g., use pre-defined/fixed architecture templates andexplore other factors [16, 29]) or conduct merely limited design ex-ploration and optimization, hindering the development of optimalDNN accelerators that can be deployed into various platforms.To address the challenges above, we propose AutoDNNchip , anend-to-end automation tool for generating optimized FPGA- andASIC-based accelerators from machine learning frameworks (e.g.,Pytorch/Tensorflow) and providing fast and accurate performanceestimations of hardware accelerators implemented on various tar-geted devices. The main contributions of this paper are as follows: a r X i v : . [ c s . D C ] J un One-for-all Design Space Description . We make use of a graph-based representation that can unify design factors in all of thethree design abstraction levels (including IP, architecture, andhardware-mapping levels) of DNN accelerator design, allow-ing highly flexible architecture configuration, scalable architec-ture/IP/mapping co-optimization, and algorithm-adaptive accel-erator design. • Chip Predictor . Built on top of the above design space descrip-tion, we propose a DNN

Chip Predictor , a multi-grained per-formance estimation/simulation tool, which includes a coarse-grained, analytical-model based mode and a fine-grained, run-time-simulation based mode. Experiments using 15 DNN modelsand 4 platforms (edge-FPGA/TPU/GPU and ASIC) show that our

Chip Predictor ’s predicted error is within 10% of real-measuredenergy/latency/resource-consumption. • Chip Builder . We further propose a DNN

Chip Builder , whichfeatures a two-stage Design Space Exploration (DSE) methodol-ogy. Specifically, our

Chip Builder realizes: (1) an architecture/IPdesign based on the

Chip Predictor ’s coarse-grained, analytical-model based prediction for a 1st-stage fast exploration and op-timization, and (2) an IP/pipeline design based on the

Chip Pre-dictor ’s fine-grained, run-time-simulation based prediction as a2nd-stage IP-pipeline co-optimization. Experiments show thatthe

Chip Builder ’s 1st-stage DSE can efficiently rule out infeasiblechoices, while its 2nd-stage co-optimization can effectively boostthe performance of remaining design candidates, e.g., 36.46%throughput improvement and 2.4 × idle cycles reduction. • AutoDNNchip . Integrating the aforementioned two enablers(i.e.,

Chip Predictor and

Chip Builder ), we develop

AutoDNNchip ,which can automatically generate optimized DNN acceleratorimplementation (i.e., synthesizable RTL implementation) giventhe user-defined DNN models from machine learning frame-works (e.g., Pytorch), application-driven specifications (e.g., en-ergy and latency), and resource budget (e.g., size of the processingarray and memories). Experiments demonstrate that the opti-mized FPGA- and ASIC-based DNN accelerators generated by

AutoDNNchip outperform the recent award-winning design [31]by 11% and a state-of-the-art accelerator [32] by up to 3.86 × .As an automated DNN accelerator design tool, AutoDNNchip is the first to highlight all of the following features: (1) efficientand accurate performance prediction of DNN accelerators on 4platforms, enabling fast optimal algorithm-to-accelerator mappingdesign and algorithm/accelerator co-design/co-optimization; (2)a design space description that unifies the descriptions of designfactors from all of the three design abstraction levels in DNN accel-erators into one directed graph, supporting arbitrary acceleratorarchitectures (e.g., both homogeneous and heterogeneous IPs andtheir inter-connections), and (3) can automatically generate bothFPGA- and ASIC-based DNN accelerator implementation that out-performs expert-crafted state-of-the-art designs for various appli-cations.

FPGA- and ASIC-based DNN Accelerators.

There has been in-tensive study in customized FPGA- and ASIC-based DNN accelera-tors. The accelerator in [10] uses loop tiling for accelerating convo-lutional layers on FPGAs. The DNNBuilder accelerator [16] applies

Input

Output

FPGA

C++ code

AutoDNNchipChip PredictorChip Predictor Chip Builder Chip Builder or

ASIC

RTL code

DNN Models App. Specs & Budget

STARTSTART

Figure 1: Overview of the proposed

AutoDNNchip frame-work, which accepts user-defined DNN models/datasets andapplication-driven specifications to automatically generateoptimized FPGA- or ASIC-based DNN accelerator designs. an optimal resource allocation strategy, fine-grained layer-basedpipeline, and column-based cache to deliver high-quality FPGA-based DNN accelerators. The work in [13] proposes a throughput-oriented accelerator with multiple levels (i.e., task, layer, loop, andoperator levels) of parallelisms. The recent designs in [31, 33] intro-duce a hardware-efficient DNN and accelerator co-design strategyby considering both algorithm and hardware optimizations, usingDNN building blocks (called Bundles) to capture hardware con-straints. For ASIC-based DNN accelerators, efforts have been madein both industry and academia, where representative ones includeTPU [17, 18], ShiDianNao [20], and Eyeriss [21], and different ac-celerators exploit different optimizations for various applications.

DNN Accelerator Performance Prediction.

For designingFPGA-based DNN accelerators, current practice usually relies onroofline models [10] or customized analytical tools [13, 16] to es-timate the achievable performance. For ASIC-based accelerators,recently published designs [21, 34, 35] introduce various perfor-mance prediction methods. Eyeriss [21] proposes an energy modelfor capturing the energy overhead of the customized memory andcomputation units and a delay model that simplifies the latency cal-culation. Similarly, MAESTRO [34] develops an energy estimationmodel that considers hardware design configurations and memoryaccess behaviors, while Timeloop [35] adopts a loop-based descrip-tion of targeted workloads and analyzes the data movement andmemory access for latency estimation.

DNN Accelerator Generation.

The tremendous need for de-veloping FPGA-/ASIC-based DNN accelerators motivates the de-velopment of automated DNN accelerator generation. For example,DeepBurning [26] is a design automation tool for building FPGA-based DNN accelerators with customized design parameters usinga pre-constructed RTL module library. DNNBuilder [16] and FP-DNN [28] propose end-to-end tools that can automatically generateoptimized FPGA-based accelerators from high-level DNN symbolicdescriptions in Caffe/Tensorflow frameworks. Caffeine [27] is an-other automation tool that provides guidelines for choosing FPGAhardware parameters, such as the number of processing elements(PEs), bit precision of variables, and parallel data factors. By usingthese automation tools, it is easier to bridge the gap between fastDNN construction in popular machine learning frameworks andslow implementation of targeted hardware accelerators. tep

III : RTL Generation

Design Space< N3 > ● Architecture template pool ● IP template pool - Memory IPs : BRAM, DRAM, SRAM, etc. - Data path IPs : Axi-bus, sync FIFO, etc. - Computation IPs : Adder tree, MAC array, etc. ● DNN def : Pytorch/Tensorflow ● Application Spec & budget : L max, P max, R max (for Latency, Power, and Resource) ● Opt. Obj. : Cost = obj( E , L ) ● Target back-ends : FPGA/ASIC

User Specified InputsHardware IP Pool IP IP IP IP IP directed graph Step I : Early-Stage Arch-IP Co-optimization Chip Predictor (coarse-grained mode)

DNN Parser

DNN

Design Space < N > for i=1:N Meet Requirement? L max, P max, R max the i-th Graph G i Predicted

Ei, Li, Ri

Yes: save G i Hardware Candidates < N > Step II : IP-Pipeline Co-optimization Design Space< N > Inter-IP pipeline insert

Chip Predictor (fine-grained mode)Pipeline/IP optimizer updated G i Ei, Li, Ri , Bottleneck IP j for i=1:N If Cost i converge: save G i N opt Hardware designs with lowest Cost = obj(

E, L )HLS GeneratorRTL Generator vivado

FPGAASIC

Memory compiler

Output: Optimized RTL Design the i-th

Graph G i Figure 2:

AutoDNNchip ’s three-step design flow for the design space exploration, optimization, and DNN-to-RTL generation.

AUTODNNCHIP

Fig. 1 shows an overview of the proposed

AutoDNNchip , whichcan automatically generate optimized FPGA- or ASIC-based DNNaccelerators as well as an optimal algorithm-to-hardware map-ping (i.e., dataflow), according to three customized inputs: (1) thehigh-level DNN descriptions trained in desired datasets, (2) theapplication-driven specifications regarding DNN inference qualityand hardware performance, and (3) the available resources of tar-geted platforms. The realization of

AutoDNNchip is achieved by theproposed

One-for-all Design Space Description (see

Section 4 ), ChipPredictor (see

Section 5 ), and

Chip Builder (see

Section 6 ).One of the major challenges that

AutoDNNchip needs to over-come is the lack of effective representations of DNN accelera-tors’ large design space given the numerous design choices (e.g.,dataflows, the number of pipeline stages, parallelism factors, mem-ory hierarchy, etc.), as a precise and concise representation is aprecondition of valid DNN accelerator design. To address this chal-lenge, we propose a

One-for-all Design Space Description , which is anobject-oriented graph-based definition for DNN accelerator designthat can unify the description of design factors from all of the three design abstraction levels into one directed graph. Furthermore,

Au-toDNNchip features another two key enablers, the

Chip Predictor and the

Chip Builder . Specifically, the proposed

Chip Predictor canaccurately and efficiently estimate the energy, throughput, latency,and area overhead of DNN accelerators based on the parametersthat can characterize the algorithms, hardware architectures, andtechnology-based IPs. The proposed

Chip Builder can automatically(1) explore the design space of DNN accelerators (including IP se-lection, block configuration, resource balance, etc.), (2) optimizechip designs via the

Chip Predictor , and (3) generate synthesizableVerilog code to achieve target design metrics.

Overview.

It is well known that the design space of DNN accel-erators can be very large. For effective and efficient design spaceexploration and optimization, it is critical that the design space

Table 1: A summary of DNN accelerators’ design factors.

Design factor Description Back-end Opt. level B W , B A , B Acc a

Bit precision F, A b IP, Accuracy req.

Freq . Clock frequency F, A Arch., IP

Arch mem

Memory tech/hierarchy/volume A Arch., IP, Mapping

Arch pe PE array architecture F, A Arch., IP, Mapping Bw Port/Bus width for data transfer A Arch., IP

Malloc

Memory allocation F, A Arch., IP, Mapping

Data Schedule

DNN to accelerator mapping F, A Arch., IP, Mapping a B W , B A , B Acc : Bit precision for weights, activations, accumulations b A: ASIC design; F: FPGA design

Table 2: A summary of attributes for the nodes and edges inthe graph-based description

Compo. a Hardware meaning Attributes

Node Memory IPs Impl., Freq., Vol., Prec., Dt., StM., E, L b Computation IPs Impl., Freq., Prec., StM., E, LData Path IPs Impl., Freq., Bw. c Prec., Dt., StM., E, LEdge IP inter-connections (IP dependency) Start, End d a Compo.: graph components including nodes and directed edges; b Impl.: implementation, e.g., 14nm DRAM, 28nm SRAM, DSP48E, AXI-bus, sync FIFO, etc.;Freq.: clock frequency (MHz); Vol.: volume/capacity (bits); Prec.: bit precision;Dt.: data type including weights, input activations, and partial sums.; E/L: energy&latencyoverhead; StM.: the state machine storing all the states (including needed inputs and generatedoutputs) through the whole execution process; c Bw.: port/bus width; d Start & End: the start and ending node for the directed edge. can be precisely and concisely described, e.g., that the differentdesign abstraction levels of optimization in DNN accelerator de-sign, including architecture level, IP level, and hardware-mappinglevel, are considered. To this end, we adopt a

One-for-all DesignSpace Description that unifies the design factors of the three levelsinto one directed graph. Table 1 lists the design factors which aresufficient for most cases and the last column shows the levels of de-sign/optimization that may influence the corresponding factors. Wecan see that (1) most of the design factors are related to cross-leveloptimization which also reflects the fact that DNN acceleratorshave a large design space and (2) optimization at merely one level(or one hardware component) does not guarantee overall systemperformance. We thus adopt an object-oriented directed graph forthe DNN accelerator design space description, an illustrative exam-ple of which is shown in Fig. 3. Specifically, a basic directed graph isfirst constructed using the PE array architecture, memory architec-ture and mapping/dataflow factors, where each node in the graphdenotes a computation/data-path/memory IP and each directededge denotes an inter-connection between nodes whose directionis determined by the corresponding data movement’s direction.Proper attributes (e.g., those in Table 2) are then assigned to thenodes and edges of the directed graph in an object-oriented manner.In the following subsections, we will briefly describe four graph-based accelerator templates corresponding to four state-of-the-artDNN accelerators which are stored in the

Hardware IP Pool (seeFig. 2 under the

User Specified Inputs ) of

AutoDNNchip together withother templates to provide a sufficient number of design candidates,and then discuss the IP attributes for the nodes and edges.

Graph-based Accelerator Templates.

Fig. 4 shows four graph-based accelerator template examples for describing DNN accelera-tors that can be translated into real hardware implementation byapplying appropriate IP attributes. Specifically, Fig. 4 (a) shows aspatial architecture based on a single adder-tree based computation

XI-busfor inputDRAM AXI-bus for weight On-chip buffer for CONV1 input On-chip buffer for CONV1 weight PE array for CONV1On-chip buffer for CONV1 output On-chip buffer for CONV2 weight PE array for CONV2 On-chip buffer for CONV2 output AXI-busfor output

IP AttributesFreq = 200MHzFeature map prec = 11Weight prec = 8StM description: Inter-IP pipeline = True M= 16 (pipeline II = 2) C = 16 (unroll = 16) IP AttributesImpl = SRAMBw = 512, Prec = 11Freq = 200MHzStM description: Inter-IP pipeline = False M= 16 C = 16 (unroll=16)

Memory IP NodeData path IP NodeComp. IP Node

Figure 3: An illustrative example of the graph-based designspace description for a heterogeneous architecture to accel-erate residual block in ResNet [36], where M and C denotethe output and input channel, respectively.

IP, which is a commonly-used architecture on FPGA-based accelera-tors; Fig. 4 (b) is a graph with 2 different computation IPs, includingdepth-wise convolutional (denoted as DW_CONV) and normal con-volutional (denoted as CONV) ones commonly adopted in compactDNN models, and two BRAM IPs that handle the memory dataarrangement for the computation IPs; Fig. 4 (c) is an architecturetemplate for TPU [17] type DNN accelerators using a systolic ar-ray; and Fig. 4 (d) shows the graph-based representation for DNNaccelerators with Eyeriss [21] type architectures, where the datapath IPs (i.e., NoC IPs in Fig. 4 (d)) between PEs describe the localdata reuse patterns of inputs, outputs, and weights.

IP Attributes.

Table 2 summarizes the attributes for three typesof node IP including memory (e.g., BRAM and off-Chip DRAM ),data access (e.g., bus), and computation hardware that characterizesthe corresponding design, as elaborated below: (1) The

Implemen-tation or Impl. attribute refers to the required hardware resourcefor implementing the IPs, e.g., DRAM and SRAM for implementingmemory IPs, and AXI-bus and NoC for implementing data pathIPs; (2) The state machine or StM. attribute is used to describe whenthe IPs will update their states between computation and load-ing/unloading data, where each state defines both the needed inputaddress and generated output address. Fig. 5 shows that differentpipeline designs can be captured by the IPs’ state machine attribute:

PE Array IP A X I - BU S I P BRAM IPInputBRAM IPWeightBRAM IPOutput O ff- c h i p DRA M BRAM IPInputBRAM IPDW-CONV WeightBRAM IPCONV WeightBRAM IPOutput

PE Array IPDW-CONV PE Array IPCONV A X I - BU S I P O ff- c h i p DRA M (a) (b) MACIP MACIPMACIP MACIPMACIP MACIP MACIPMACIPMACIP

1D Systolic NoC IPIntput D S ys t o li c N o C I P W e i gh t

1D Systolic NoC IPOutputSRAM IPInputSRAM IPOutput O ff- c h i p DRA M BU S I P SRAM IPWeight

PE PE PEPE PE PEPE PE PE RF IPWeightRF IPInput MACIPRF IPOutput

PEForward NoC IP Output M u l t i cas t N o C I P W e i gh t Multicast2 NoC IP InputSRAM IPOutputSRAM IPIntput O ff- c h i p DRA M BU S I P SRAM IPWeight (c) (d)

Figure 4: An illustration of 4 architecture templates in our

Hardware IP Pool including 2 architectures for both state-of-the-art FPGA- and ASIC-based DNN accelerators.

Data Path IPTrans. [NxN] T T T T

2N R R R R R R R R Computation IPComp. [NxN] SD SD SD N S D N + SD SD SD SC SC SC SC SD SD SD SD SC SC SC N S C N + SC SC SC R R R R R R R R R Trans.Comp.

Execution Time TT T T T Trans.Comp.

Execution Time T (a)(b) (c) Task division StateMathineAttribute Task division StateMathine AttributeTaskToy Architecture with 2 IPs

Figure 5: A toy example of IP’s state machine attribute w/oand w/ considering inter-IP pipeline effects: (a) a simple ar-chitecture with 2 IPs, i.e., one data path IP and one compu-tation IP; the task division, state machine, and the run-timeprocess when (b) excluding the inter-IP pipeline and (c) con-sidering the inter-IP pipeline, where SD and SC denote thestate for data path IP and computation IP, respectively.

Fig. 5 (b) and (c) illustrate two kinds of designs (w/o and w/ inter-IPpipeline) and their corresponding state machine definition, respec-tively, where there are more states in Fig. 5 (c) for capturing theinter-IP pipeline between data transfer and computation IPs; (3)The data precision or Prec. attribute refers to the IPs’ bit precision;(4) The clock Frequency or Freq. and energy/latency or E/L attributescapture the operating clock frequency and required energy/latencyfor IPs; and (5) The port/bus width or Bw and memory volume or Vol. attributes refers to the port/bus width of data path IPs and memoryvolume for memory IPs, respectively.

CHIP PREDICTOR

As shown in Fig. 6, the proposed

Chip Predictor accepts DNN mod-els (e.g., number of layers, layer structure, precision, etc.), hardwarearchitectures (e.g., memory hierarchy, number of PEs, NoC design,etc.), hardware mapping, and IP design (e.g., unit energy/delaycost of a multiply-and-accumulate (MAC) operation and memoryaccesses to various memory hierarchies), and then outputs the es-timated energy consumption, latency, and resource consumptionwhen executing the DNN in the target accelerator defined by thegiven hardware architecture and hardware mapping. First, to cap-ture the large search space and consider all the design abstractionlevels (including architecture, IP, and hardware mapping levels),we construct a graph-based description that serves as one input of

Chip Predictor . Second, to match the different tradeoff requirements coarse-grained stop

No inter-IP pipeline

Chip Predictor

Graph def.DNNmodel

EnergyLatencyResourceconsumptionArch selectionIP designHw mapping System Bottleneck(only for fine-grained) fine-grained

Enable inter-IP pipeline

Figure 6: An overview of the proposed

Chip Predictor . a a a a a a a a b b b b b b b b b (a) MAC1MAC4MAC7 MAC9MAC3MAC6MAC2MAC5MAC8 a a a b x0 b x1 b x2Critical path Cycles = 15

Coarse-grained (b)

Cycles a a a MAC1 b b b a a a b b b a a a b b b MAC2MAC9

Fine-grained (c) t=0 check a b ✓ t=1 check a ✓ b ✓ ... Idle Busy t=0 t=1 t=2,3t=4

MAC 2 State Machine stay idle statejump to busy state check input ready or not for MAC 2

Figure 7: A toy example using a systolic array, illustrating:(a) the corresponding matrix-matrix multiplication, and (b)coarse-grained and (c) fine-grained latency estimation. of the

Chip Builder ’s two-stage DSE, which aims for efficient andaccurate design space exploration and optimization, our

Chip Pre-dictor adopts a mixed-granularity prediction: (1) a coarse-grainedmode that can quickly provide IP performance estimation to enablethe identification of critical paths when the inter-IP pipeline is notconsidered, in order to be used for the

Chip Builder ’s early-stagearchitecture and IP exploration and selection; and (2) a fine-grainedmode that can perform accurate performance prediction by con-sidering the pipeline dependency between IPs based on run-timesimulations, in order to be used for the

Chip Builder ’s 2nd-stageDSE that targets IP-pipeline co-optimization.

Chip Predictor ’s Coarse-grained Mode

Overview.

The

Chip Predictor ’s coarse-grained mode is analytical-model-based, i.e., using equations to formulate the accelerators’energy, critical path latency, and resource consumption given aDNN model and a graph-based hardware design description (seeFig. 6). Specifically, the energy and latency of IPs are first calcu-lated using: (1) analytical equations as described below and (2) theattributes of each IP, where the unit energy/latency costs are ob-tained from single-IP RTL implementation or simulations; and theenergy and latency consumption of the whole DNN accelerator areformulated by considering: (1) the total energy and latency of allIPs when executing the DNN model and (2) the energy and latencyoverhead of the CPU and on-chip controller.

Analytical-model-based Intra-IP Modeling.

If we define theenergy, latency and resource utilization as E , L , and R , respectively,and use ip comp , ip dp , ip mem to denote the computation IP, datapath IP and the memory IP, respectively, the energy and latency ofthe computation IPs can be formulated by: E ip comp = e + ( states ) × ( e + e mac × U ) (1) L ip comp = l + ( states ) × l mac (2)where ( states ) denotes the total number of the states in the IP statemachine; U denotes the unrolling factor (PE parallelism) for thecomputation IP; e mac and l mac denote the unit energy and latencycosts for a MAC operation, respectively; e and l are the energyand latency overhead for warming up, i.e., configure the data pathand pre-load data; and e denotes the energy overhead of run-timecontrol of CPU or on-chip logic units. Meanwhile, the energy andlatency of the data path IPs can be formulated by: E ip dp = e + ( states ) × ( e + V × e bit ) (3) L ip dp = l + ( states ) × ( l + VP w × l bit ) (4) where V denotes the total data volume (bits) needed to be trans-ferred when the IPs are called; P w denotes the port width for thecorresponding data path; e bit and l bit denote the unit energy andlatency costs for each bit of data access, respectively; e and l arethe energy and latency overhead for warming up, respectively; and e and l denote the energy and latency overhead of the run-timecontrol of CPU or on-chip logic units, respectively. Analytical-model-based Inter-IP Modeling.

For the systemperformance, including energy, latency, and resource consumptionof a convolutional layer or a DNN building block (e.g., the Bundlein [31, 33]), the resource and energy consumption are obtained bysumming up that of all the IPs in the graph, and the total latencycan be calculated by summing up the latency of all the IPs on thecritical path of the graph, i.e., R mem = (cid:213) ip mem ∈ G Vol ip mem (5) R mul = (cid:213) ip comp ∈ G U ip comp + R mul dec (6) E = (cid:213) ip ∈ G E ip (7) L = max path ∈ G (cid:213) ip ∈ path L ip (8)where G denotes the whole graph; R mem denotes the total memoryvolume consumption for one type of memory; R mul denotes thetotal number of multipliers used in both the computation IPs andwhen decoding the memory address, with the latter denoted as R mul dec . Regarding the latency estimation, the inter-IP pipeline ef-fects are excluded in the coarse-grained mode and it can be capturedin the fine-grained mode of Chip Predictor (see Section 5.3). As a toyexample, Fig. 7 (b) and (c) illustrate the latency estimation whenoperating a matrix-matrix multiplication in a systolic array usingboth the coarse-grained mode and fine-grained mode, where theresulting estimated latency results are 15 and 7 cycles, respectively.

Chip Predictor ’s Fine-grained Mode

Overview.

In the fine-grained mode of the

Chip Predictor , we adopt:(1)

Algorithm

Chip Predictor ’s coarse-grained mode to get the IPs’ energy andlatency for estimating the intra-IP performance.

Implementation.

The run-time simulation algorithm is de-scribed in

Algorithm

1, where each IP (denoted as ip ) has (1) itsneighbour IPs on the graph defined as ip . prev and ip . next , respec-tively, and ip will use the data from ip . prev as its inputs and pass itsoutputs to ip . next ; and (2) a state machine to store different states(including its needed inputs and generated outputs) through thewhole execution process. For each clock cycle in the simulation, ip can jump to the next state when (1) it has finished generating allthe outputs in its current state (i.e., ip is in an idle status) and (2) ip . prev has generated all the inputs ip needed for the next state.If ip is in an idle status but its needed inputs are not ready from ip . prev , it will continue to wait on the idle status, resulting in anincrease of the idle cycles associated with this IP; If ip is in a busystatus, it will generate its outputs and jump to an idle status whenit finishes generating all the outputs in this state. lgorithm 1 Run-time sim. in the fine-grained

Chip Predictor

1: Input: One accelerator design described by graph G ;2: For each edдe in G ip start ←− edдe ′ s starting node;4: ip end ←− edдe ′ s ending node;5: Add ip start to ip end . prev ;6: Add ip end to ip start . next ;7: Initialize energy and latency: E = , cycles = ;8: While not all inference outputs are stored back9: cycles ←− cycles + ;10: For each ip in G If ( ip is idle ) & (all needed inputs ∈ outputs of ip . prev )12: ip ←− busy ;13: ip jumps to the next state;14: If ( ip is idle ) & (not all needed inputs ∈ outputs of ip . prev )15: ip . idle _ cycles ←− ip . idle _ cycles + ;16: If ( ip is busy ) & (not all outputs for ip is ready)17: Update the ready outputs for ip ;18: If ( ip is busy ) & (all outputs for ip is ready)19: ip ←− idle ;20: E ←− E + E ip ;21: L ←− cyclesдlobal clk f req ;22: ip bottleneck ←− ip with minimum idle cycles. For better understanding, Fig. 7 uses a toy example to show thatthe

Chip Predictor ’s fine-grained mode (see Fig. 7 (c)) can moreaccurately estimate the required latency than its coarse-grainedmode. In this 3 × Algorithm

Chip Builder will launch the

Chip Predictor tosimulate the whole graph iteratively in order to generate an optimaldesign for the whole accelerator system.

CHIP BUILDER

Fig. 2 elaborates the design flow of

AutoDNNchip that leveragesthe

Chip Builder ’s two-stage DSE engine. To effectively explorethe design space (e.g., the design factors in Table 1),

AutoDNNchip involves three major steps as shown in Fig. 2: (1) the 1st-stageDSE: an early stage architecture and IP configuration explorationto efficiently rule out infeasible designs using the

Chip Predictor ’scoarse-grained mode; (2) the 2nd-stage DSE: an inter-IP pipelineexploration and IP optimization to effectively boost the performanceof the remaining design candidates resulting from the 1st-stage DSE;and (3) a design validation through RTL generation and execution.

Step I. Early Stage Architecture and IP Configuration Ex-ploration.

As shown in the middle part of Fig. 2, this step considersthe following exploration. First, the DNN model from a mainstreammachine learing framework is applied to the DNN parser to extractthe DNN layer information, e.g., layer types (CONV, Pooling, ReLU,

Algorithm 2

IP-pipeline co-optimizationusing the

Chip Builder

1: Input: Design space D G with N graphs;2: For each G in D G For each edдe in G ip start ←− edдe ′ s starting node;5: ip end ←− edдe ′ s ending node;6: Add ip start to ip end . prev ;7: Add ip end to ip start . next ;8: While simulated (using Algorithm 1) latency L G does not converge9: ip ←− simulated bottleneck IP (i.e., ip bottleneck from Algorithm 1);10: If inter-IP pipeline is adopted for ip and ip . next

11: allocate more resource to ip ;12: Else

13: adopt inter-IP pipeline between ip and ip . next ;14: update the state machine of ip ;15: update the state machine of ip . next ;16: Select top N opt candidates in D G Reorg [31], etc.), feature map inter-connections (Concat, Add, etc.),and layer shapes (shape of weight and feature map tensors). Sec-ond, according to the given DNN model, performance requirements(e.g., latency and throughput) and hardware budgets (e.g., resourceand power budget of FPGA or ASIC), a design space of size N isgenerated by fetching commonly-used or promising hardware ar-chitecture templates and hardware IP templates from the HardwareIP pool . For example, when the given resource budgets are tight, afolded hardware architecture will be chosen instead of a flattenedone; whereas flattened structures which facilitate IP pipelines arepreferred when there are sufficient budgets. Third, an architectureand IP configuration optimization is then performed to rule outmost of the infeasible choices and trim down the design space to N ( N < N ) promising candidates, e.g, more efficient with a lower la-tency. This fast early exploration makes use of the analytical natureof the Chip Predictor ’s coarse-grained mode.

Step II. Inter-IP Pipeline Exploration and IP Optimization.

This step accepts the resulting N designs and performs furtherexploration and IP optimization using Algorithm

2. First, inter-IPpipelines are inserted into different locations of the correspondingcomputation graphs, resulting in a new design space of size N ,i.e., N new graphs with different inter-IP pipeline designs. Sec-ond, for each of these graphs, the bottleneck IPs will be recordedduring Algorithm

Chip Predictor ’s fine-grained mode’spredicted performance, as shown in

Algorithm

2. Third, the top N opt design candidates will be chosen according to the Chip Predictor ’spredicted energy consumption or/and latency, and then passed tothe next step for validation through RTL generation and execution.

Step III. Design Validation through RTL Generation andExecution.

In this step, we generate RTL code for the top N opt optimized designs through an automated code generation proce-dure: (1) For the FPGA back-end, the generated files include thetestbench for a board-level implementation, the binary file for thequantized-and-reordered weights, and the C-code for the HLS IPimplementation. We use Vivado [22] to actually generate the bit-stream and meanwhile eliminate the designs that fail in place androute (PnR) to guarantee that AutoDNNchip ’s generated designsare valid; (2) For the ASIC back-end, the generated files include theRTL testbench for the DNN model, the quantized-and-reorderedweights, the synthesizable RTL code, and the memory specifica-tions. The RTL code could be further passed to an EDA tool likeesign Compiler and IC Compiler to generate gate-level/layoutnetlist, during which Memory Compilers could take the memoryspecifications to generate the memory design. After this step, allthe output designs are fully validated with correct functionality.

In this section, we evaluate the proposed

AutoDNNChip on 20DNN models across 4 platforms (3 edge devices including edge-FPGA/TPU/GPU and 2 ASIC-based accelerators).

Chip Predictor

Methodology and Setup.

Table 3 summarizes the details of ourvalidation experiments for the

Chip Predictor , including the plat-forms, performance metrics, DNN models, methods to obtain theunit parameters, employed precision for the weights and activations,and frequency of the corresponding computation core.

Methodol-ogy.

In order to conduct a solid validation, we validate the

Chip Pre-dictor by comparing its predicted performance with actual device-measured ones on 3 edge devices (Ultra96 FPGA [37], edge TPU [18],Jetson TX2 [38]) and paper-reported ones of 2 published ASIC-basedaccelerators (Eyeriss [21] and ShiDianNao [20]), when adoptingthe same experiment settings (e.g., clock frequency, DNN modeland dataset, bit precision, architecture design, and dataflow, etc).

Benchmark DNN Models and Datasets.

For the 3 edge devices,we consider 15 representative compact/light-weight DNN models(see Table 4 and Table 5, where the models in Table 5 use the Ima-geNet dataset [39] and the models in Table 4 use the dataset in theSystem Design Contest of the DAC 2019 conference [40]); for the 2published DNN accelerators, we use the same benchmark modelsand datasets as the original papers.

Unit Parameters.

The unitenergy/latency parameters are obtained through either real-devicemeasurement or synthesized RTL implementation as mentioned inSection 5. For the 3 edge devices, we measure the unit energy andlatency by running the basic IP operations (such as the memory ac-cesses and the MAC computation) over multiple sets of experimentsunder different settings and average the energy and latency valuesto get unit parameters. Specially, for memory accesses, we changethe clock frequency, memory volume, port width, bit precision, andburst read length; for the MAC operations, the clock frequency, to-tal number of MACs and parallelism of MACs are changed. For theASIC-based accelerators, the unit parameters are obtained eitherfrom the paper [21] or gate-level simulations of the synthesizedRTL implementation on the same CMOS technology.

Table 3: Experiment settings for the

Chip Predictor ’s cross-platform/model/design/dataset validation.

Arch/Device Metrics a DNNs b Unit Precision Freq. e Param. c d (MHz)Ultra96 [37] E, L, R

Compact Measured <11, 9> 220

Edge TPU [18]

E, L

Compact Measured <8, 8> 500

Jetson TX2 [38]

E, L

Compact Measured <32, 32> 1300

Eyeriss [21]

E, L, R

AlexNet Reported <16, 16> 250

ShiDianNao [20] E Small Synthesized <16, 16> 1000 a Metrics – E : energy, L : latency, R : resource; b DNN benchmarks – Compact: see the 15 compact DNN models in Table 4 and Table 5;AlexNet [41]; and Small: DNNs used in [20] (< 5 convolutional/fully-connected layers); c Methods to obtain the unit parameters; d Bit precision for different types of data, i.e., ; e Clock frequency for the computation core.

Table 4: The 10 model variants of the SkyNet backbone [31].

DNN SK SK1 SK2 SK3 SK4 SK5 SK6 SK7 SK8 SK9Size (MB)

Layer

14 14 14 14 17 14 16 14 14 17

Bypass ✓ ✓ ✓ ✓ ✓ - - - - -

Table 5: The 5 model variants of MobileNetV2 [42] using dif-ferent channel scaling factors and input resolutions.

DNN V-Model 1 V-Model 2 V-Model 3 V-Model 4 V-Model 5

Resolution 128 128 224 224 224Channel scaling 0.5 1.0 0.5 1.0 1.4

Validation of the Predicted Energy Consumption.

We com-pare the

Chip Predictor ’s predicted energy with the measured onesfrom 3 edge devices, including Ultra96 FPGA [37] (edge FPGA),edge TPU [18], and Jetson TX2 (edge GPU) [38]) under the samesettings (see Table 3).Fig. 8 summarizes the validation results, and shows that themaximum prediction error of our

Chip Predictor is . forall 15 DNN models across 3 platforms . Specifically, the predic-tion error ranges from 0 .

89% to 8 . .

12% to 7 . . . . . . Chip Predictor by comparing it to 2 state-of-the-art ASIC-based accelerators: Eye-riss [21] and ShiDianNao [20]. For Eyeriss [21], we first comparethe predicted energy breakdown of the first and fifth convolutionallayers of AlexNet, of which the maximum error is 5 .

15% and 1 . Predictor can be straightforwardly . . . . max error 9.17% PlatformsFPGA (Ultra96) [11,9]Edge TPU [8,8]Edge GPU (Jetson TX2) [32,32] Prediction Error0%-1% error1%-5% error5%-10% error

SK SK1 SK2 SK3 SK4 SK5 SK6 SK7 SK8 SK9 MB1 MB2 MB3 MB4 MB5

Diﬀerent DNN models . . . max error 7.67%max error 8.13% min error 2.12%min error 2.72%min error 3.43% E n e r g y ( J ) Figure 8: The energy prediction error of the

Chip Predictor when using the 15 DNN models in Table 4 and Table 5 run-ning on 3 edge devices including an edge FPGA [37], EdgeTPU [18], and edge GPU [38]. max 5.15%

Prediction Error

Comp. RF NoC GB

Diﬀerent hardware IPs max 1.64% min 0.03%min 0.26%

Layer

CONV1CONV5 E n e r g y b r e a k d o w n ( % ) (a) CONV1 CONV2 CONV3 CONV4 CONV5

Diﬀerent layer of AlexNet o f D R A M / S R A M a cc e ss e s ( M ) max 9.56%max 9.64%min 2.18% min 0.7% Prediction Error (b)

Figure 9: The

Chip Predictor ’s energy prediction error con-sidering the Eyeriss architecture [43]: (a) The energy break-down for AlexNet’s 1st and 5th convolutional layers and (b)the of DRAM and SRAM accesses of convolutional layers. extended to include other stride values. Note that the predictionerrors of DRAM accesses in the last three layers are relatively large,because the input data are compressed to save DRAM accessesin [21] and we lack their information regarding the input datasparsity. The validation results over ShiDianNao [20] are listed inTable 6. By showing the average energy over 10 DNN benchmarksof the 4 IPs in [20], we verify the maximum prediction error is9.59%, where the error is mainly due to the difference between ouradopted commercial CMOS IP library and the one used in [20].

Table 6: The energy prediction error of the

Chip Predictor when using the architecture of ShiDianNao [20]: The energybreakdown over 10 benchmarks.

IP Computation Input SRAM Output SRAM Weight SRAM

Predicted (%) 89.2 7.4 1.7 1.6Paper-reported (%) 89.0 8.0 1.6 1.5

Prediction error 0.35% -7.19% 9.59% 7.87%

Validation of the Predicted Latency.

The latency predictionof the

Chip Predictor is validated over the measured results of thesame 15 DNN models and 3 edge devices and shown in Fig. 10. heedge GPU, the Ultra96 FPGA board, and the edge TPU, respectively,and the corresponding average prediction error is 4 . . . The maximum latency prediction error ofour

Chip Predictor is . . Specifically, the prediction errorsrange from 0 .

89% to 9 . .

78% to 5 . .

92% to 9 . . . . . Chip Predictor is smaller than the paper-reported results,as

Chip Predictor does not consider the special scenario when theaccelerator needs to access memory multiple times for one singlewordline of data. Such a scenario only happens when one wordlineof data is physically stored in multiple wordlines of the memory. The

Chip Predictor can be extended to include such a case by configuringcorresponding memory data arrangements.

Validation of the Predicted Resource Consumption.

Table 8summarizes the

Chip Predictor ’s predicted resource consumption max 9.44%

PlatformsFPGA (Ultra96) [11,9]Edge TPU [8,8]Edge GPU (Jetson TX2) [32,32] Prediction Error0%-1% error1%-5% error5%-10% error

SK SK1 SK2 SK3 SK4 SK5 SK6 SK7 SK8 SK9 MB1 MB2 MB3 MB4 MB5

Diﬀerent DNN models max 5.98%max 9.75%min 1.78% min 2.92%min 0.89% L a t e n c y ( m s ) Figure 10: The latency prediction error of the

Chip Predic-tor when using 15 DNN models on 3 edge devices includingUltra96 FPGA board, Edge TPU, and Jetson TX2 (edge GPU).Table 7: The latency prediction error of the

Chip Predic-tor when using the architecture of Eyeriss [21]: The latencywhen processing the 5 convolutional layers of AlexNet.

AlexNet Layer CONV1 CONV2 CONV3 CONV4 CONV5

Predicted latency (ms) 16.04 37.58 21.09 15.59 9.79Paper-reported latency (ms) 16.5 39.2 21.8 16 10

Prediction error -2.88% -4.12% -3.24% -2.56% -2.14% based on the experiments using Ultra96 FPGA [37]. Specifically, thepredicted resource consumption for the 2 critical on-chip resourcesof FPGAs, DSP48E and BRAM18K, is validated against those ob-tained from the post-implementation utilization reports, and hasa corresponding prediction error of smaller than 4 .

2% and 3 . Table 8: The resource consumption prediction error of the

Chip Predictor ’s on the Ultra96 FPGA board when consider-ing 6 different designs given 6 different resource budgets.

Resource type

Val.

Bg. 1 Bg. 2 Bg. 3 Bg. 4 Bg. 5 Bg. 6

DSP48E Predicted 35 69 141 213 285 331Measured 36 72 144 216 288 360

Error -3.2% -4.2% -2.4% -1.2% -1.0% -0.8%

BRAM18K Predicted 65 87 175 265 354 446Measured 64 86 173 259 346 432

Error +1.0% +0.8% +1.2% +2.2% +2.4% +3.2%

Chip Builder and

AutoDNNchip

In this subsection, we evaluate the performance of the proposed

Chip Builder , which makes use of the time-efficient and accurate

Chip Predictor to perform an effective two-stage DSE efficiently,and

AutoDNNchip . Specifically, we study the performance of theresulting DNN accelerators generated and optimized by the

ChipBuilder and

AutoDNNchip . First, we show experiment results to vi-sualize the

Chip Builder ’s two-stage DSE process; Second, we studythe performance improvement resulted from the

Chip Builder ’ssecond-stage IP-pipeline co-optimization in terms of bottlenecklocks’ latency and idle cycle reduction; Finally, we validate theeffectiveness of the

AutoDNNchip by comparing the performanceof its generated FPGA- and ASIC-based accelerators (i.e., the corre-sponding RTL implementation) with that of state-of-the-art designsunder the same conditions.

Evaluation Setup.

In this set of experiments, we consider theapplication-driven specifications and constraints summarized inTable 9, where the throughput requirement and power budget areset to meet real-time applications of visual recognition (e.g., im-age classification and object detection [3]) on edge devices. Forthe FPGA-based accelerator design, we use a state-of-the-art edgedevice Ultra96 FPGA board [37] whose resource budget is fixed;for the ASIC-based accelerator design, we evaluate our generateddesigns through RTL simulations. Regarding the design space ex-ploration, we use

Algorithm

Visualizing the

Chip Builder ’s Two-stage DSE Process.

Fordemonstrating the effectiveness of the

Chip Builder ’s two-stageDSE engine, here we visualize the DSE process, when using

Au-toDNNchip to design an FPGA-based accelerator for achieving com-petitive performance as the award winning state-of-the-art designin [31] given the same target performance specification/constraint,FPGA board, DNN model, and dataset. The FPGA measured energyconsumption of both the resulting design from the

AutoDNNchip and the reported one of [31] are marked in purple in Fig. 11. Itdemonstrates that: (1) the DSE engine of the

Chip Builder can ef-fectively trim down the design choices and generate optimizeddesigns with better performance compared to the state-of-the-artdesign published in [31].

Without humans in the loop, the

Au-toDNNchip can indeed automatically generate DNN acceler-ators that achieve optimized performance ; (2) most of the de-sign choices can be efficiently ruled out by the 1st stage of theDSE engine, i.e., the early stage exploration based on the

Chip Pre-dictor ’s coarse-grained analytical performance estimation; and (3)the 2nd stage IP-pipeline co-optimization of the

Chip Builder caneffectively boost (up to 36.46% improvement and an average of28.92% improvement) the performance, i.e., throughput of the DNNaccelerators here, as compared to that of the designs resulted fromthe 1st stage DSE. The final generated design candidates in theHLS code format will be passed to Vivado [22] for implementation.Then, we eliminate the designs that fail in the PnR step as shownin Fig. 11 and find an optimal design from the remaining ones. Asa reference point, the 1st stage DSE takes about 0.65 ms for eachdesign point and only 0.8 hour for exploring a total of 4.6 milliondesign points when running on an Intel Core i5 CPU with a singlethread, thanks to the analytical nature of the

Chip Predictor . Table 9: The considered application-driven specifications(i.e., throughput requirement) and constraints (i.e., powerand resource budget) when evaluating the

Chip Builder ’sgenerated FPGA- and ASIC-based DNN accelerators.

Target Back-end Application Opt. Obj. Th./P. Req. Res. BudgetUltra96 FPGA

Object Detection E, L 20FPS DSP=360, FF=14112010W LUT=70560, BRAM=432

ASIC

Vision Tasks [20] E, L 15FPS On-chip SRAM=128KB600mW

Evaluation of the

Chip Builder ’s 2nd-stage Optimization.

Fig. 12 summarizes the evaluation experiments for the

Chip Builder ’s2nd-stage Optimization process. As described in Section 6, thisstage targets an IP-pipeline co-optimization and thus can lead tomore balanced pipeline and more efficient resource allocation. FromFig. 12, we can see that the

Chip Builder ’s 2nd-stage optimizationcan achieve up to 2.4 × idle cycles reduction, when optimizing the de-sign of SkyNet’s 6 blocks [31] on the Ultra96 edge FPGA board [37].

10 20 30 40 50 60 70 80 90 100

Latency (ms) . . . . . . . . . E n e r g y / i m g ( J ) OptimizedBaseline

Failed inPnR stepOptimized Design and baselineDesigns eliminated by the 1st stage DSECandidates after the 1st stage DSECandidates after the 2nd stage DSE

Figure 11: Visualizing the energy consumption per imageand processing latency of the resulting designs from the

Chip Builder ’s 1st and 2nd stage optimization, when using

AutoDNNchip to design an FPGA-based accelerator for meet-ing the performance of a state-of-the-art design [31] giventhe same performance specification/constraint, FPGA board,DNN model, and dataset (see Table 9).Evaluation of

AutoDNNchip ’s Generated FPGA-based Ac-celerators.

Fig. 11 shows that the DNN accelerator which is gener-ated by

AutoDNNchip can apparently outperform the recent award-winning design [31]. We further conduct another set of experimentsto compare the performance of

AutoDNNchip ’s generated DNN ac-celerators on the Ultra96 FPGA board with that of a mobile CPU(Pixel2 XL [32]), when both designs (1) adopt the settings in Table 3,

Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 P r o c e ss i n g l a t e n c y ( M c y c l e s ) Busy cycles w/o the 2nd Opt.Idle cycles w/o the 2nd Opt.Busy cycles w/ the 2nd Opt.Idle cycles w/ the 2nd Opt.

Figure 12: The busy and idle cycles of the bottleneck IPs inSkyNet’s 6 different blocks, before and after conducting the

Chip Builder ’s 2nd-stage IP-pipeline co-optimization, whenusing the

AutoDNNchip to generate designs for the Ultra96FPGA board with the same target performance as [31]. . . . . . . . Energy Eﬃciency (FPS/W) L a t e n c y ( m s ) latency reduction 3.86x PlatformsGenerated design on FPGA (Ultra96) [8,8]Mobile (Pixel2 XL) [8,8]Mean value

Figure 13: Processing latency and energy efficiency on Ul-tra96 FPGA compared with a mobile device (Pixel2 XL ) on10 compact DNN models using the same bit precision and the 10 DNN models in Table 4,and (2) try to minimize the latency for considering time-criticalapplications. Note that the DNN mapping to the mobile CPU isoptimized using Tensorflow Lite [44]. Fig. 13 illustrates the cor-responding latency vs. energy efficiency, where the results underthe same DNN models are marked with makers of the same shape.We can see that

AutoDNNchip generated accelerators consistentlyachieve smaller latency than the baselines under the same DNNmodel and settings while having similar (<15% difference) energy ef-ficiency. Specifically,

AutoDNNchip generated accelerators achievean average latency reduction of 3.86 × while having slightly better(10%) or worse (differs <15%) energy efficiency, demonstrating theeffectiveness of AutoDNNchip in generating optimized FPGA-basedaccelerators.

Evaluation of

AutoDNNchip ’s Generated ASIC-based Ac-celerators.

Fig.14 illustrates that

AutoDNNchip indeed can generateASIC-based accelerator that leads to an optimal tradeoff between la-tency and energy consumption by visualizing the latency vs. energyconsumption of the generated accelerators, where dots with differ-ent colors correspond to designs based on different hardware tem-plates. Furthermore, we evaluate the performance of

AutoDNNchip generated ASIC-based accelerators by comparing their energy con-sumption with that of a state-of-the-art ASIC-based accelerator[20] based on 5 shallow neural networks, which are used in [20] forperformance evaluation, given both having the same throughputconstraint as shown in Table 9. Fig. 15 shows the comparison, whereall the energy consumption in Fig.15 are obtained from RTL imple-mentation and simulation. We can see that

AutoDNNchip generatedASIC-based accelerators consistently outperform [20] in all the 5networks with energy consumption improvement ranging from7.9% to 58.3%, demonstrating the effectiveness of

AutoDNNchip ingenerating optimized ASIC-based accelerators.For the aforementioned set of experiments, We first use theapplication-driven performance and constraints (see Table 9) toperform design space exploration and then validate the generateddesigns using RTL simulations adopting the same clock frequency(1 GHz) and technology (65nm) as our baseline [20]. Specifically, theDSE process optimizes the accelerators’ energy-delay-product, andconsidering different: (1) hardware templates with three differentarchitectures [17, 20, 21] (denoted as template 1/2/3 in Fig. 14), (2)memory size and . . . . . Energy (mJ) . . . . . . . P r o c e ss i n g l a t e n c y ( m s ) Optimized

Designs exceed the power budget

Power Budget = 600 mWDesigns using template 1Designs using template 2Designs using template 3

Figure 14: Visualizing the latency vs. energy consumptionper image of the ASIC-based accelerators in the design spacepool, when using

AutoDNNchip to design an ASIC-based ac-celerator for meeting the performance of a state-of-the-artASIC-based accelerator [20], with both having the same per-formance constraints, DNN model, and dataset (see Table 9).

Face Reco. Lenet-5 CFF ConvNN Face align. . . . . . E n e r g y N o r m . Baseline paper reportedAutoDNNchip optimized

Figure 15: Comparing the normalized energy consumptionbetween the

AutoDNNchip generated ASIC-based accelera-tors and [20], when accelerating 5 shallow neural networksunder the same throughput requirement.

To close the gap between the growing demand for DNN acceleratorswith various specifications and the time consuming and challengingDNN accelerator design, we develop

AutoDNNchip which can auto-matically generate both FPGA- and ASIC-based DNN accelerators.Experiments using over 20 DNN models and 4 platforms show thatDNN accelerators generated by

AutoDNNchip outperform state-of-the-art designs by up to 3.86 × . Specifically, AutoDNNchip is madepossible by the proposed one-for-all design space description , ChipPredictor , and

Chip Builder . Experiments based on 15 DNN modelsand 4 platforms demonstrate that the

Chip Predictor ’s predictionerror is smaller than 10% compared with real-measured ones, andthe

Chip Builder can effectively and efficiently perform design spaceexploration and optimization.

ACKNOWLEDGMENTS

This work is supported in part by the NSF RTML grant 1937592and NSF 1801865, the IBM-Illinois Center for Cognitive ComputingSystem Research (C3SR), and XMotors.ai.

EFERENCES [1] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,”

CoRR , vol. abs/1409.1556, 2014.[2] Y. Wang, Z. Jiang, X. Chen, P. Xu, Y. Zhao, Y. Lin, and Z. Wang, “E2-Train: TrainingState-of-the-art CNNs with Over 80% Energy Savings,” in

Advances in NeuralInformation Processing Systems , pp. 5139–5151, 2019.[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time objectdetection with region proposal networks,” in

Advances in neural informationprocessing systems , pp. 91–99, 2015.[4] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig,“The Microsoft 2016 conversational speech recognition system,” in

Acoustics,Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on ,pp. 5255–5259, IEEE, 2017.[5] S. Liu, Y. Lin, Z. Zhou, K. Nan, H. Liu, and J. Du, “On-demand deep modelcompression for mobile devices: A usage-driven model selection framework,”in

Proceedings of the 16th Annual International Conference on Mobile Systems,Applications, and Services , pp. 389–400, ACM, 2018.[6] Y. Wang, T. Nguyen, Y. Zhao, Z. Wang, Y. Lin, and R. Baraniuk, “Energynet:Energy-efficient dynamic inference,” in

Advances in Neural Information ProcessingSystems (Workshop) , 2018.[7] J. Shen, Y. Fu, Y. Wang, P. Xu, Z. Wang, and Y. Lin, “Fractional Skipping: To-wards Finer-Grained Dynamic Inference,” in

The Thirty-Forth AAAI Conferenceon Artificial Intelligence , 2020.[8] J. Wu, Y. Wang, Z. Wu, Z. Wang, A. Veeraraghavan, and Y. Lin, “Deep k-Means:Re-Training and Parameter Sharing with Harder Cluster Assignments for Com-pressing Deep Convolutions,” in

Thirty-fifth International Conference on MachineLearning , 2018.[9] Y. Wang, J. Shen, T.-K. Hu, P. Xu, T. Nguyen, R. Baraniuk, Z. Wang, and Y. Lin,“Dual dynamic inference: Enabling more efficient, adaptive and controllable deepinference,”

IEEE Journal of Selected Topics in Signal Processing , 2019.[10] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-basedaccelerator design for deep convolutional neural networks,” in

Proceedings ofInternational Symposium on Field-Programmable Gate Arrays , pp. 161–170, ACM,2015.[11] Y. Lin, S. Zhang, and N. Shanbhag, “Variation-Tolerant Architectures for Convo-lutional Neural Networks in the Near Threshold Voltage Regime,” in , pp. 17–22, Oct 2016.[12] S. Liu, A. Papakonstantinou, H. Wang, and D. Chen, “Real-time object track-ing system on FPGAs,” in , pp. 1–7, IEEE, 2011.[13] Z. Liu, Y. Dou, J. Jiang, J. Xu, S. Li, Y. Zhou, and Y. Xu, “Throughput-optimizedFPGA accelerator for deep convolutional neural networks,”

ACM Transactions onReconfigurable Technology and Systems (TRETS) , vol. 10, no. 3, p. 17, 2017.[14] X. Zhang, X. Liu, A. Ramachandran, C. Zhuge, S. Tang, P. Ouyang, Z. Cheng,K. Rupnow, and D. Chen, “High-performance video content recognition withlong-term recurrent convolutional network for FPGA,” in , pp. 1–4, IEEE,2017.[15] C. Zhuge, X. Liu, X. Zhang, S. Gummadi, J. Xiong, and D. Chen, “Face recognitionwith hybrid efficient convolution algorithms on FPGAs,” in

Proceedings of the2018 on Great Lakes Symposium on VLSI , pp. 123–128, ACM, 2018.[16] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen, “DNNBuilder:an automated tool for building high-performance DNN hardware accelerators forFPGAs,” in

Proceedings of the International Conference on Computer-Aided Design ,p. 56, ACM, 2018.[17] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,S. Bhatia, N. Boden, A. Borchers, et al. , “In-datacenter performance analysis of atensor processing unit,” in , pp. 1–12, IEEE, 2017.[18] Google Inc., “Edge TPU.” https://coral.withgoogle.com/docs/edgetpu/faq/, ac-cessed 2019-09-01.[19] Y. Lin and J. R. Cavallaro, “Energy-efficient convolutional neural networks via sta-tistical error compensated near threshold computing,” in , pp. 1–5, May 2018.[20] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam,“Shidiannao: Shifting vision processing closer to the sensor,” in

ACM SIGARCHComputer Architecture News , vol. 43, pp. 92–104, ACM, 2015.[21] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficientdataflow for convolutional neural networks,” in

Computer Architecture (ISCA),2016 ACM/IEEE 43th Annual International Symposium on

SRC TechCon , vol. 5, 2005. [24] D. Chen, J. Cong, Y. Fan, and L. Wan, “Lopass: A low-power architectural synthesissystem for FPGAs with interconnect estimation and optimization,”

IEEE Transac-tions on Very Large Scale Integration (VLSI) Systems , vol. 18, no. 4, pp. 564–577,2009.[25] K. Rupnow, Y. Liang, Y. Li, D. Min, M. Do, and D. Chen, “High level synthesis ofstereo matching: Productivity, performance, and software constraints,” in , pp. 1–8, IEEE, 2011.[26] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “DeepBurning: automatic generation ofFPGA-based learning accelerators for the neural network family,” in

Proceedingsof the 53rd Annual Design Automation Conference , p. 110, ACM, 2016.[27] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: Towards uni-formed representation and acceleration for deep convolutional neural networks,”

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ,2018.[28] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong,“FP-DNN: An automated framework for mapping deep neural networks ontoFPGAs with RTL-HLS hybrid templates,” in , pp. 152–159, IEEE, 2017.[29] R. Venkatesan, Y. S. Shao, M. Wang, J. Clemons, S. Dai, M. Fojtik, B. Keller,A. Klinefelter, N. Pinckney, P. Raina, et al. , “MAGNet: A Modular AcceleratorGenerator for Neural Networks,” in

Proceedings of the International Conference onComputer-Aided Design (ICCAD) , 2019.[30] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Design flow of accelerat-ing hybrid extremely low bit-width neural network in embedded FPGA,” in ,2018.[31] X. Zhang, H. Lu, C. Hao, J. Li, B. Cheng, Y. Li, K. Rupnow, J. Xiong, T. Huang,H. Shi, et al. , “SkyNet: a Hardware-Efficient Method for Object Detection andTracking on Embedded Systems,” arXiv preprint arXiv:1909.09709 , 2019.[32] Google Inc., “Pixel Phone 2 XL.” https://store.google.com/product/pixel_3?srp=/product/pixel_2/, accessed 2019-09-01.[33] C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W.-m. Hwu, and D. Chen,“FPGA/DNN Co-Design: An efficient design methodology for IoT intelligence onthe edge,” in

Proceedings of the Design Automation Conference , p. 206, ACM, 2019.[34] H. Kwon, M. Pellauer, and T. Krishna, “MAESTRO: an open-source infrastruc-ture for modeling dataflows within deep learning accelerators,” arXiv preprintarXiv:1805.02566 , 2018.[35] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan,B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A Systematic Approach to DNNAccelerator Evaluation,” in , pp. 304–315, IEEE, 2019.[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”in

Proceedings of the IEEE conference on computer vision and pattern recognition

In CVPR , 2009.[40] J. Hu, J. Goeders, P. Brisk, Y. Wang, G. Luo, and B. Yu, “2019 DAC system designcontest on low power object detection,”

When Accuracy meets Power: 2019 DACSystem Design Contest on Low Power Object Detection , 2019.[41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with DeepConvolutional Neural Networks,” in

Advances in Neural Information ProcessingSystems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.),pp. 1097–1105, Curran Associates, Inc., 2012.[42] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2:Inverted Residuals and Linear Bottlenecks,” in

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , June 2018.[43] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficientreconfigurable accelerator for deep convolutional neural networks,”