[PDF] Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration

Abstract

Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead. In this paper, we address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware. We describe a time unrolled formulation of variable density-bound block (VDBB) sparsity that allows for a configurable number of non-zero elements per block, at constant utilization. We then describe a systolic array microarchitecture that implements this scheme, with two data reuse optimizations. Firstly, we increase reuse in both operands and partial products by increasing the number of MACs per PE. Secondly, we introduce a novel approach of moving the IM2COL transform into the hardware, which allows us to achieve a 3x data bandwidth expansion just before the operands are consumed by the datapath, reducing the SRAM power consumption. The optimizations for weight sparsity, activation sparsity and data reuse are all interrelated and therefore the optimal combination is not obvious. Therefore, we perform an design space evaluation to find the pareto-optimal design characteristics. The resulting design achieves 16.8 TOPS/W in 16nm with modest 50% model sparsity and scales with model sparsity up to 55.7TOPS/W at 87.5%. As well as successfully demonstrating the variable DBB technique, this result significantly outperforms previously reported sparse CNN accelerators.

Full PDF

SSparse Systolic Tensor Arrayfor Efﬁcient CNN Hardware Acceleration

Zhi-Gang Liu*, Paul N. Whatmough*, and Matthew MattinaArm ML Research Lab, Boston, MA, USA

Abstract —Convolutional neural network (CNN) inference onmobile devices demands efﬁcient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). Exploit-ing data sparsity is a common approach to further accelerateGEMM for CNN inference, and in particular, structural sparsityhas the advantages of predictable load balancing and very lowindex overhead. In this paper, we address a key architecturalchallenge with structural sparsity: how to provide support fora range of sparsity levels while maintaining high utilizationof the hardware. We describe a time unrolled formulation ofvariable density-bound block (VDBB) sparsity that allows for aconﬁgurable number of non-zero elements per block, at constantutilization. We then describe a systolic array microarchitecturethat implements this scheme, with two data reuse optimizations.Firstly, we increase reuse in both operands and partial productsby increasing the number of MACs per PE. Secondly, weintroduce a novel approach of moving the IM2COL transforminto the hardware, which allows us to achive a 3 × data bandwidthexpansion just before the operands are consumed by the datapath,reducing the SRAM power consumption.The optimizations for weight sparsity, activation sparsityand data reuse are all interrelated and therefore the optimalcombination is not obvious. Therefore, we perform an designspace evaluation to ﬁnd the pareto optimal design characteristics.The resulting design achieves 16.8 TOPS/W in 16nm with modest50% model sparsity and scales with model sparsity up to 55.7TOPS/W at 87.5%. As well as successfully demonstrating thevariable DBB technique, this result signiﬁcantly out performspreviously reported sparse CNN accelerators. I. I

NTRODUCTION

Convolutional neural network (CNN) inference has quicklybecome an important workload on IoT [12], [13], [22] andmobile computing devices [14], [42], which has spurred thedevelopment of hardware accelerators in mobile SoCs [19],[36]. CNNs are fundamentally composed of many layers ofmulti-channel 2D convolution, interspersed with non-linearactivation functions. The convolutions are usually lowered in the runtime to general matrix multiplication (GEMM) bylinearizing the 3D feature maps into a 2D structure using the

IM2COL function [35]. The resulting GEMMs are usuallycompute-bound, O ( N ) , and heavily dominate the runtimeof CNN inference. Therefore, GEMM is an obvious targetfor acceleration [38], and being compute bound, the speedupjustiﬁes the extra silicon real estate. For mobile computingdevices, INT8 CNN inference accelerators demand high energy * authors with equal contribution. (a) (b) (c)Fig. 1: Sparse matrix encodings, red denotes non-zero element.BZ is block size, and NNZ is non-zero elements per block.efﬁciency (TOPS/W) and area efﬁciency (TOPS/mm ) toachieve performance and price differentiation.Data sparsity can be exploited in CNN inference accelera-tors [16], [30], [32], as zeros in the data reduce the theoreticalcompute and storage cost signiﬁcantly. However, traditionalsparse matrix multiplication (sGEMM) from scientiﬁc work-loads only generates speedup at very high sparsity (e.g., > density bound block(DBB) sparsity [21], [27], introduces a bound on the numberof non-zero elements in each block (Fig. 1(c)). DBB sparsityexhibits the hardware advantages of block sparsity and the1 a r X i v : . [ c s . A R ] O c t NN performance of random sparsity. DBB has even beenimplemented in the recently announced Nvidia A100 GPUproduct [28], which cites 2 × speedup for 50% sparsity.A signiﬁcant limitation of the two DBB architecturespublished to date, is that the sparsity is ﬁxed at design time:75% in [21], and 50% in [28]. As a result, any model that doesnot meet this ﬁxed sparsity level will be forced to fall backto dense execution with no gains. Even worse, any modelsthat achieve even higher sparsity than the ﬁxed level willalso see no further gains. Therefore, a ﬁxed sparsity levelseverely limits the usefulness of DBB for broader deployment,as any commercial products are forced to choose a modestsparsity level to best suit the average customer. The challengein supporting variable sparsity levels is that the number ofMACs required per ﬁxed amount of weights read from SRAMchanges. For a ﬁxed provisioning of hardware MACs, this leadsto a proportional drop in utilization, which directly impactsenergy efﬁciency and area efﬁciency.In this paper, we introduce a novel variable DBB techniqueusing a time unrolled architecture. We demonstrate this in areuse optimized accelerator and demonstrate state-of-the-artresults. The contributions of this paper are summarized below: • Variable Density Bound Block (VDBB)

In previouswork, the DBB compression is ﬁxed [21], [28], which is abig impediment to broader deployment, because CNNs canvary widely in their weight sparsity. This paper describesa variable DBB (VDBB) architecture achieved through time unrolling , that supports all structured sparsity ratiosfrom 12.5% ( / ) up to fully dense ( / ), achieving bothspeedup and energy efﬁciency as sparsity increases. • Reuse Optimized VDBB Microarchitecture

We de-scribe an accelerator microarchitecture to implement timeunrolled variable DBB. At the datapath array level, we im-plement a systolic tensor array (STA) composed of a morecomplex PE called a tensor PE (TPE), which increasesreuse and better amortizes the cost of data movement.To decrease SRAM read power we introduce a novelhardware IM2COL unit after the SRAM and just beforethe datapath, which achieves 3x bandwidth magniﬁcationfor 3 × • Design Space and Evaluation

The combination of timeunrolled VDBB and the reuse optimized implementationresult in a large number of parameters, which all have aninterlinked impact on performance, area and energy suchthat the optimal design point is not obvious. Therefore,we ﬁnally enumerate the design space and describe thepareto-optimal design choices. Results in 16nm for INT8at 1GHz show the optimal nominal 4 TOPS acceleratorhas effective throughput and energy that scale stronglywith model sparsity and demonstrating 16.8 TOPS/W(50% model sparsity) up to 55.7 TOPS/W (87.5% modelsparsity). This is more than 8 × greater energy efﬁciencycompared to the previously published Laconic [32].The remainder of the paper is organized as follows. Section II M02100-380 Index Mask

Compressed DBB

Raw DBB

500 0 000 Non-Zero ElementsBlock Size(BZ)DepthwiseTensor Blocking BZ bits

Weight Tensors -23 -4

Fig. 2: Density Bound Block (DBB) structured sparsityconstraints result in a maximum of NNZ non-zero values perblock of size BZ, when the weight tensors (e.g., 3 ×

3, 1 × ACKGROUND AND M OTIVATION

The main advantage with DBB weight compression forhardware deployment is that it maintains the regularity ofGEMM. This results in speedup proportional to the compressionrate, high utilization, and low index storage and manipulationoverheads which is anyway amortized over the block size. Inthis section we give an overview of the DBB approach anddiscuss preliminaries. We also explain the limitations of theﬁxed sparsity ratio used in the previous work.

A. DBB Overview

Both the weights and activations of CNNs exhibit sparsity.However, while the weights are known in advance and canbe inﬂuenced during training, activations depend on the inputimage and therefore their sparsity is more difﬁcult to inﬂuence.Therefore, in this work, we apply DBB to weight tensors.On top of this, we describe clock-gating schemes to exploitactivation sparsity. Density bound block [21] imposes a simpleconstraint on the sparsity of a block of BZ elements, suchthat there are at most NNZ non-zero elements per block. Fig.2 gives a concrete example, using a block size of 8 ×

1. Thetensor blocking is performed depthwise (i.e. over the channeldimension), such that the elements in any single 3 × × odel Dataset Baseline ———- With DBB Pruning ———-Acc.(%) Acc.(%) Total NNZ Sparsity (%)LeNet-5 MNIST 99.1 98.7 1.05K 75 ( / )ConvNet CIFAR10 86.0 85.3 26.8K 75 ( / )ResNet-50V1 ImageNet 75.2 74.2 8.79M 62.5 ( / )VGG-16 ImageNet 71.5 71.4 5.39M 62.5 ( / )MobileNetV1 ImageNet 70.9 69.8 1.6M 50 ( / ) Convolution layers only.

TABLE I: CNNs trained with INT8 DBB weights with a blocksize of 8. The maximum block sparsity achieved for thesebenchmark models varies from 50% ( / ) to 75% ( / ).When the tensors are blocked in this fashion, they can betrivially compressed in two steps. First, the non-zero elementsare stored by removing the zeros. Secondly, a simple bitmaskindex M is added to encode the presence of a non-zero elementat each location in the expanded block (size BZ). The resultingcompressed size is N N Z + BZ bits, assuming INT8 wordsize, giving a compression ratio of BZ/ (8 N N Z + BZ ) . Anyblocks that have less than NNZ non-zero elements will includeone or more zero elements in the encoded form. B. Training DBB CNN Models

CNNs must be specially trained to meet the DBB constraint.To demonstrate the feasibility of this, we trained ﬁve CNNs,applying conventional INT8 quantization and magnitude-basedDBB pruning to VGG-16, MobileNetV1, ResNet-50V1, 5-layerConvNet and LeNet-5 on ImageNet, CIFAR10 and MNISTdatasets. The DBB sparsity hyperparameter was optimized foreach model. For MobileNet, we apply DBB to the pointwise(1 × × weight compression, while maintaining reasonable test accuracywith INT8 quantization, while maintaining regularity. C. DBB Parameters

There are only two key parameters for DBB: the block size(BZ) and the density bound given by

N N Z/BZ . In general,a larger block size introduces a less severe constraint on theoptimization process, but increases the hardware cost. A largerblock size also provides a greater granularity of sparsity levels.To illustrate this, Table II shows the training sensitivity to theblock size (BZ) using 8-bit quantized LeNet-5 on the MNISTdataset, following the methodology in Section V-A. For a givensparsity ratio, DBB pruned LeNet-5 models with larger block In this paper we routinely refer to the block density as a ratio of NNZ/BZ,but also use the sparsity given as a percentage. NNZ BZ 2 4 8 161 99.0% 98.7% 98.2% 97.9%2 – 99.1% 98.9% 98.6%4 – – 99.1% 99.1%

TABLE II: Accuracy sensitivity to DBB block size (BZ) andnumber of non-zeros (NNZ) for 8-bit LeNet-5 on MNIST.Accuracy increases with block size at equal sparsity ratio. Cellcolors indicate equal compression ratios of NNZ/BZ.sizes clearly achieve better predication accuracy. For example,the ratio of / (NNZ=1 and BZ=4) achieves 98.7% accuracy,but the same compression ratio with a higher block size, e.g. / , gives better accuracy (99.1%). Based on this, we use ablock size of 8 based on the results of the DBB pruning inTable I and analysis of the hardware cost. Previous work usesa block size of 8 in [21] and 4 in [28]. Note that any modelstrained with block size of 4 are guaranteed to be supportedon hardware with a block size of 8, as a block 4 model willalways satisfy the same sparsity ratio in block 8 format. D. Motivation

We showed in Table I that popular CNN architecturesachieve a fairly wide range of DBB pruning ratios, which varydepending on the dataset, the network architecture and even thetraining recipe. Smaller models such as LeNet-5 and ConvNetcan be pruned down to / of their original size, so wouldideally beneﬁt from a block sparsity rate of / . However, verycompact models such as MobileNet are notoriously tricky totrain and optimize and achieve about 50% compression at best,which requires a block size of / . We are also very likely toencounter a variety of sparsity levels within a single model.For example, it is very common to avoid optimizing the ﬁrstlayer of a large CNN, as this often damages accuracy. It is alsopossible to optimize sparsity per-layer or even per-channel toextract the most from the model. Therefore, all of this pointstowards the need to support a range of structured sparsityratios natively in the hardware.

Previous implementations of DBB have demonstrated ﬁxedsparsity: Kang et al. implements ﬁxed DBB with a / (75%)block, and the Nvidia A100 GPU [28] implements a / block. However, the ﬁxed block sparsity ratios are a practicallimitation, as models with more sparsity will see no furtherimprovement, and models with less sparsity will have tofall back to dense GEMM. In this work, we demonstratean effective approach to implement DBB with continuouslyvariable block sparsity. We also demonstrate the ﬁrst structuredsparsity systolic array, which heavily emphasizes data reuse.III. V ARIABLE D ENSITY B OUND B LOCK (VDBB)As we outlined in the previous section, hardware support forvariable sparsity DBB (VDBB) is highly desirable. However,varying the density bound leads to hardware under utilization.In this section we will discuss this issue more concretely andpresent an architecture solution to the VDBB requirement.3

A NN: DenseHW: Dense

NN: Random SparseHW: CG MAC~50% MAC Power50% Utilization NN: 4/8 DBB SparseHW: 4/8 DBB (Fixed)~50% BW/Power/Area100% Utilization NN: 2/8 DBB SparseHW: 4/8 DBB (Fixed)50% Utilization

Operand BW = 1.0 Operand BW = 1.0 Operand BW = 0.625 Operand BW = 0.375

NN: 6/8 DBB SparseHW: 4/8 DBB (Fixed)Not SupportedOperand BW = 0.875 (a) (b) (c) (d) (e)Fig. 3: Spatially unrolled datapaths, all with effective throughput of 16 Ops/cycle. (a) Conventional dense datapath with nobeneﬁt from sparsity. (b) With a random sparsity, we can clock gate (CG) on zero operands to proportionally reduce computepower while lowering utilization. However, this does not reduce data movement power or area. (c) A DBB datapath designedfor / block sparsity has the same effective throughput as (a), but requires 62.5% operand bandwidth, and about half thearea/power. The block sparsity ( / ) is ﬁxed at silicon design time. (d) A model with higher sparsity ( / ) has little advantage,as the hardware is designed for ( / ) block sparsity. (e) A model with lower sparsity (e.g., / ) is not natively supported. A. Spatially Unrolled Block Architecture

A conventional (dense) datapath is shown in Fig. 3(a), wherea block of 8 weights (W) are multiplied by a correspondingblock of activations (A). The most obvious approach to computea sparse block is to parallelize the operations across independenthardware MACs, i.e. spatially unroll the block. For randomweight sparsity, we can add a simple mechanism to detectzero operands and clock gate (CG) the relevant MAC lane [7],[31], as shown in Fig. 3(b). This reduces the compute power(proportionally to the sparsity), but also reduces the utilizationof the hardware, which impacts area efﬁciency (TOPS/mm ).It is also challenging to reduce data storage cost with randomsparsity, due to the unpredictability of the non-zero elementsper ﬁxed SRAM access size.In contrast, DBB results in a predictable number of non-zero(NNZ) elements per block, which means we can easily reduceboth compute and data movement. For example, Fig. 3(c)illustrates a datapath that supports a / density-bound blockand achieves the same throughput as Fig. 3(a). The DBBversion requires only four hardware MACs, each augmentedwith an 8:1 mux to steer the correct activation element into theMAC. The select signal for the mux is driven by the positionalindex metadata ( M ), which is an additional byte per blockoverhead in this example. Note that the implementation ofDBB by Kang [21] is similar to this, but with a / block(75% sparsity).However, as we noted in Section II, real world models canexhibit a very wide variety of sparsity levels. However, theﬁxed DBB hardware in Fig. 3(c) can only support a singleﬁxed block sparsity ratio. If the sparsity is higher than / , e.g., / shown in Fig. 3(d), then the utilization drops, limiting theTOPS/mm advantage on sparser models. Conversely, lowersparsity models, such as the / example in Fig. 3(e), are notsupported and it is necessary to fall back to dense execution,which offers no beneﬁt at all. Therefore, instead of ﬁxing thesparsity at design time, we would instead like to support all sparsity ratios from very sparse ( / density) up to fully dense( / ), from 87.5% to 0% sparsity. B. Time Unrolled Block Architecture

The big challenge with supporting variable density boundblocks (VDBB) in hardware, is that as the weight sparsity rateis increased, the hardware utilization decreases, which leadsto low energy and area efﬁciency. If we implement ﬁxed / DBB (50% DBB compression), a model that achieves / would result in a utilization drop of roughly 50%. On top ofthis, executing a model with sparsity lower than 50% is notsupported, other than by treating it as a dense model.To get around this issue with spatially unrolled DBBarchitectures [21], we instead implement the DBB hardwareby unrolling the block in the time dimension. This simplymeans that we process one element of the density bound blockper cycle using a single MAC per block. Of course, the keyadvantage of this arrangement, is that we can now freely varythe block sparsity, while the datapath utilization and the operandbandwidth both remain constant. The number of cycles perblock is the only thing that varies as we change the sparsity, i.e.the effective throughput increases with sparsity. For example,Fig. 4(a) shows a conventional dense block unrolled in thetime dimension and requiring eight cycles to compute on asingle MAC. While in Fig. 4(b)–(e), we illustrate that NNZcan be freely varied, with the number of clock cycles requiredto compute the block being equal to NNZ. At the extreme,a very sparse model with / DBB sparsity only requires 1cycle per block (8 × speedup).Although the illustrative diagrams in Fig. 3 and Fig. 4 showboth the zero and non-zero elements of the 8-element block, thekey idea with DBB is that the data can be trivially compressed(Section II), by storing only the non-zero elements and theindex metadata M. Therefore, the non-zero elements of theweight block are consumed one per cycle, and the skipping ofelements is achieved implicitly. The corresponding activation4 :1 Skip → Skip → Skip → Skip → NN: 4/84 CyclesW A

NN: 8/88 CyclesW A

Skip → Skip → Skip → NN: 2/82 CyclesW ASkip → Skip → NN: 1/81 CycleW ASkip → Skip → Skip → Skip → NN: 7/87 CyclesW A -1-3-7 -1-3 -1 -3-3 (a) (b) (c) (d) (e)Fig. 4: Time unrolled structured sparse block processing, whichallows a continuously variable NNZ per block, while retaining100% hardware utilization and constant operand bandwidth.All NNZ options are supported, from the fully dense case (a),through to 87.5% sparsity (e).element is then muxed into the MAC. Note that a complexreordering buffer is not required to implement this, and itresults in very high energy and area efﬁciency.IV. A

CCELERATOR

This section describes an extremely efﬁcient VDBB im-plementation, that aggressively optimizes ﬁve types of datareuse. The proposed accelerator (Fig. 5) includes a SystolicTensor Array (STA), local SRAMs for weights and activations,IM2COL activation bandwidth magniﬁer, and Arm Cortex-M33microcontrollers for DMA and vector compute operations.

A. Datapath Array

The systolic array (SA) is a very efﬁcient and scalablehardware implementation of GEMM, due to the local register-to-register operand reuse. However, implementing VDBB in ansystolic architecture greatly improves energy and area efﬁciency.We achieve this by generalizing the SA into the systolic tensorarray (STA), which is amenable to supporting DBB and VDBB. Systolic Tensor Array (STA) : The classic systolic array(SA) of Fig. 6(a), consists of an M × N array of PEs. Each PEis a single MAC with INT8 operand (OPR) pipeline registers aINT32 accumulator (ACC) register. We use an output stationarydataﬂow, which allows the larger INT32 accumulators to remainstationary. The Systolic Tensor Array (STA) of Fig. 6b extendsthe SA concept, with a more complex PE called a tensor PE(TPE). The TPE accepts a tensor (matrix) of weights and atensor of activations per cycle, instead of a single weight andactivation. Instead of computing a single MAC per cycle, eachTPE essentially processes a small matrix multiplication on theinput matrices of size A × C, using a B-way dotproduct (DP).This increases the MACs to operands ratio, which we refer toas intra-TPE reuse . While moving from a scalar MAC to adot product introduces accumulator reuse . In the remainder

Weight SRAM (WB)

TPETPE TPETPE M = N = 4 C x B x INT8 A x B x INT8 Arm M33 Cluster

IM2COLIM2COL

AXIDMA ~3x Bandwidth Magnifier

TPE TPE

IM2COL

TPE TPE

IM2COL

TPETPETPETPE TPETPETPETPE A c t i v a t i o n S R A M ( A B ) A c t i v a t i o n B u ff e r ( A B ) S R A M Weight Buffer (WB) SRAM

Fig. 5: VDBB accelerator microarchitecture consisting of TPEdatapath array, local SRAM, IM2COL unit, and M33 MCUs.of this paper, we uniquely denote an STA conﬁguration asA × B × C M × N. Fig 7a illustrates the STA dataﬂow, whichis similar to the classic SA, but with tensor (i.e. sub-matrix)operands in place of scalar operands. Adding DBB Support (STA-DBB) : Next, we add supportfor DBB weight matrices. DBB allows us to reduce the numberof MACs per block from BZ to NNZ, which reduces thearea by the compression factor NNZ/BZ. Each dot product(DP) also requires an additional multiplexer (mux) in front ofeach MAC to select the activation element that correspondsto the non-zero weight, indicated by the bitmask index, M.An example 2 × × × × [ A , A , A , A ] from the left, anda 2/4 DBB compressed weight input [ W , W ] and associated2-bit non-zero index from the top. Compared to a conventional(dense) DP4, each S4DP2 trades two MACs for two 8-bit4:1 datapath multiplexers (MUX), where each MUX costssigniﬁcantly less than a MAC. Although this architecture stillsupports conventional dense GEMM, it only supports a singlesparsity ratio (50% in this example). Fig. 7(a) illustrates thecorresponding dataﬂow for this STA-DBB example. Here, wemultiply a 8 × × × × Adding VDBB Support (STA-VDBB : Finally, we imple-ment time unrolled variable DBB (Section III-B) as an efﬁcientSTA-VDBB, by modifying the TPE (Fig. 6(d)). We retain theMUX at the input to the MAC on the activation side to selectthe required activation according to the bitmask index, M . But The classic SA (Fig. 6a) is a special case: 1 × × × N. Dot productarchitectures similar to DaDianNao [8] can be described as 1 × B × × F MAC MAC MAC MAC

F F F F F FF FF FF F

MAC MAC MAC MAC

F FF FF F

MAC MAC MAC MAC

F F F F F F F FF FF F F F

MAC MAC MAC MAC

F F F F F F F FF FF F F FF FF FF FF F F F F F F FF F X + MAC

A C C (a) Systolic Array(SA) 1 × × × DP4 DP4DP4 DP4

F FF FF F F F

DP4 DP4DP4 DP4

F FF FF F F F

DP4 DP4DP4 DP4

F FF FF F F F

DP4 DP4DP4 DP4

F FF FF F F F X + A C C

X XX

DP4 (b) Systolic Tensor Array(STA) 2 × × × S4DP2 S4DP2S4DP2 S4DP2

F FF FF F F F

S4DP2 S4DP2S4DP2 S4DP2

F FF FF F F F

S4DP2 S4DP2S4DP2 S4DP2

F FF FF F F F

S4DP2 S4DP2S4DP2 S4DP2

F FF FF F F F + A C C

X X

S4DP2

M U X

M U X (c) STA with DBB support(STA-DBB) 2 × × × x8x8x8x8 F FF FF F F F F FF FF F F FF FF FF F F F F FF FF F F FS8DP1 S8DP1S8DP1 S8DP1 S8DP1 S8DP1S8DP1 S8DP1S8DP1 S8DP1S8DP1 S8DP1 S8DP1 S8DP1S8DP1 S8DP1F F F FF F F F S8DP1 S8DP1S8DP1 S8DP1 S8DP1 S8DP1S8DP1 S8DP1S8DP1 S8DP1S8DP1 S8DP1 S8DP1 S8DP1S8DP1 S8DP1F F F FF FF F + A C C X S8DP1

M U X (d) STA with VDBB support(STA-VDBB) 2 × × × Fig. 6: (a) The systolic array (SA) is efﬁcient because operands read from SRAM are reused many times as they propagatethrough PEs in the M × N array. (b) The systolic tensor array (STA), generalizes the scalar PE into a tensor PE (TPE), whichaccepts two tensor operands and performs a small matrix multiplication on each cycle. This allows us to introduce intra-TPEreuse and accumulator reuse, increasing the ratio of compute to data movement. (c) Fixed DBB is implemented inside STA byadding a mux to the activation input on the dot product. (d) Finally, variable DBB is implemented by switching to multiplesingle MACs to allow time unrolling. Notation: A × B × C M × N denotes a M × N 2-D array of A × B × C TPEs (red box). DP2denotes a 2-way dot-product into a single accumulator register. S4DP2 denotes a 2-way sparse dot-product (SDP) with a 4:1mux in the activation path for DBB sparsity.to support VDBB, we move from a wide dot product withaccumulator sharing, to a single MAC with an accumulatordedicated to a single block (S8DP1). Most importantly, as weare unrolling in time, the occupancy (number of clock cycles) ofthe S8DP1 unit for each block depends on the compression ratio.Fig. 7(b) illustrates a corresponding data ﬂow for computing a4 ×

16 by 16 × A × W , respectively),where W can be compressed into (4 ×

8) in 2/8 DBB format. A and the compressed W are then partitioned into 2 × × W are input onerow per clock. For each TPE, the corresponding tensor inputfrom the left edge need is delayed until the whole block iscomplete, i.e. 2 cycle occupancy for this example. Due to theDBB sparsity, all TPEs have the the same occupancy in acomputing stream. For higher DBB compression ratios, theTPE has a lower number of occupancy cycles, resulting inhigher throughput, and area/energy efﬁcency. Array Design Trade-offs : Table III summarizes the keydifferences between the conventional SA, the STA, and thesparsity optimized STA-DBB and STA-VDBB designs. Thishighlights the analytical beneﬁt we achieve in both inter- andintra-TPE reuse. However, there are some trade-offs betweenthe items listed, which we touch upon in the summary below: • Inter-TPE Operand Reuse

An M × N systolic arrayfeatures weight and activation operand reuse, O ( M ) and O ( N ) respectively, which amortizes the relatively highcost of reading operands from SRAM at the array edge. • Intra-TPE Operand Reuse

STA extends the array-level

Trade-offs SA STA STA-DBB STA-VDBBMACs per TPE 1 A × B × C A × b × C A × C ACCs per TPE 1 A × C A × C A × C OPRs per TPE 2 B ( A + C ) AB + bC AB + nC Inter-TPE Reuse MNM + N AMCNAM + CN AbCMNABM + CbN AnCMNABM + CnN

Intra-TPE Reuse

ACA + C AbCAB + bC AnCAB + nC ACC Reuse 1

B b

1A Sparsity CG (cid:51) (cid:55) (cid:55) (cid:51)

W Sparsity (cid:55) (cid:55)

Fixed DBB Variable DBB Array MACs / Array input operands. TPE MACs / TPE input operands.

TABLE III: Summary of array design trade-offs. b indicates thenumber of MACs in the SDP unit of the STA-DBB datapath. n denotes NNZ for the block. The STA-VDBB array increasesinter- and intra-TPE reuse, while also supporting VDBB weightsparsity and random activation sparsity clock gating (CG).reuse, by introducing additional operand reuse inside theTPE itself. This further amortizes data movement, bylocally performing a small matrix multiply (with manyMACs) on the input tensor operands inside the TPE. • Accumulator Reuse

STA introduces accumulator reuse,whenever a wide dot product is used. Accumulatorreuse increases area efﬁciency by increasing the MACsto registers ratio by using more efﬁcient carry-saveimplementation in the dot product datapath. • Weight Structured Sparsity

To support VDBB in STA,we must switch from wide dot products to single-MACs,sacriﬁcing the area reduction from accumulator reuse.However, the advantages of VDBB far out this concession.6 A A A A A A A A A A A A A A A A (4x8) x (8x4) W W W W W W W W W W W W W W W W X A w (a) Data ﬂow for STA-DBB (2 × × × A A A A A A A A A A A A A A A A F FF F

F F F F

F FF F

F F F F

F FF F

F F F F

F FF F

F F F F

S8DP1 S8DP1S8DP1 S8DP1 S8DP1 S8DP1S8DP1 S8DP1S8DP1 S8DP1S8DP1 S8DP1 S8DP1 S8DP1S8DP1 S8DP1

F F F FF F F F

S8DP1 S8DP1S8DP1 S8DP1 S8DP1 S8DP1S8DP1 S8DP1S8DP1 S8DP1S8DP1 S8DP1 S8DP1 S8DP1S8DP1 S8DP1

F F F FF FF Fx8x8x8x8 W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W w (4 x16) A x (b) Data ﬂow for STA-VDBB (2 × × × Fig. 7: DBB and VDBB dataﬂow examples on small arrays. (a) Example dataﬂow to compute dense 4 × × × × × × × × × × • Activation Sparsity

Finally, we can also exploit activationsparsity on top of VDBB weight sparsity, by clock gating(CG) the datapath on zero activations. This can not beapplied to wide dot products, as all inputs would have tobe zero, which is very unlikely. But, we can apply this toSTA-VDBB which anyway uses single-MACs.

B. Local SRAM

As is commonplace for accelerators, we heavily leveragelocal software managed SRAM [26] to provide a low-costoperand supply for the datapath array. The weight buffer (WB)is 0.5MB and the activation buffer (AB) is 2MB (Fig. 5).Due to the array architecture, the SRAM is grouped together,rather than distributed, so we are able to choose large SRAMinstances which fully amortize the cost of the SRAM peripheryagainst the bitcell array. We further balance the choice of thebank muxing parameter to balance power and area. The ABand WB are both double buffered to allow them to be sharedbetween the datapath and the local MCUs. The input image isloaded into the AB before operation begins, via DMA fromthe MCUs.

C. IM2COL Unit

The main drawback of GEMM (compared to native con-volution) is the memory footprint overhead from IM2COLexpansion. IM2COL is used to linearize 3D volumes of CNNfeature map and kernel data, in order to process them usingGEMM. If the stride is less than the kernel size, IM2COLresults in duplicated pixels in the output. This leads to anincreased storage requirement, and higher SRAM read power.We directly address this issue by adding a new hardware unitthat functions as a SRAM read bandwidth magniﬁer (Fig. 5).To do this, we implement IM2COL in hardware on activationsas late as possible in the microarchitecture: after it is read fromlocal SRAM, and just before the data reaches the datapath (Fig. 5). This allows us to achieve the lower memory footprintof native convolution, while taking advantage of the computeregularity of GEMM, which can be more readily optimized inhardware. The net result of the late IM2COL hardware unitis a reduction in SRAM read bandwidth for 3 × × × × average SRAM read reduction. D. Local MCU with SIMD

Although matrix multiplication represents by far the majorityof the computation for CNN inference, there are a number ofancillary operators that must be supported to allow the wholemodel to be processed in place, without moving intermediateresults back to the CPU. These operators include activationfunctions, pooling, scaling, batch normalization and data typecasting. We implement these in software, using Arm Cortex-M33 [1] microcontrollers (MCUs). The M33 MCU has 32-bit SIMD instructions that can encode up to four INT8operations [2]. A small 64KB SRAM is included as a programstore for the M33, which has minimal impact on the areaefﬁciency. Control and data movement tasks are also performedby the MCU. The number of M33 MCUs required depends onthe peak throughput, 2 is sufﬁcient for an design with 2 TOPSpeak throughput, 4 for 4 TOPS and 8 for 16 TOPS. The siliconarea of the M33 in 16nm technology is very small at 0.008mm ,and the typical power consumption is 3.9 µ W/MHz [1].V. M

ETHODOLOGY

A. DBB CNN Training

Models must be specially trained to exploit DBB. To demon-strate this, we trained ﬁve popular benchmark CNN models7ig. 8: Overview of the hardware IM2COL unit: (a) hardware;(b) IM2COL operation for a 6 × × × × outputs are generated per cycle,reducing SRAM bandwidth by 3 × for a typical 3 × B. RTL Generator

The proposed accelerator (Section IV encompasses a scalablefamily of conﬁgurations. It is also regular and amenable tohierarchical implementation and validation. We implemented aparameterized Python RTL generator to rapidly and preciselyexplore the full design space. This generator produces synthe-sizable Verilog RTL for the accelerator, along with a testbenchsuite. Designs can be generated with arbitrary dimensions ofA × B × C M × N, along with optional support for DBB andVDBB sparsity, activation clock gating options etc. Each designis automatically validated in Synopsys VCS using the generatedtestbench, which can execute inference on a CNN model whilelogging value change dump (VCD) switching activity tracesfor each design on a given CNN.

C. Physical Design and Evaluation

The generated RTL was implemented in TSMC 16nmFinFET technology to evaluate circuit area, power dissipationand clock frequency. We also implemented a design in TSMC65nm LP bulk CMOS to allow fair comparison with resultsreported in the older technology. The EDA tool ﬂow usedconsists of Synopsys and Cadence tools, which we use withthe TSMC PDK and Arm multi-Vt standard cell libraries. Thesingle-ported SRAM instances were carefully chosen from theoptions available in an Arm SRAM compiler. The design wasconstrained for a 1GHz clock period at the slow corner, withmultiple process, voltage and temperature corners used for setupand hold timing. The 65nm design achieved 500MHz at theslow corner. Power analysis was performed at the typical corner,using Synopsys PrimeTimePX, with the parasitic-annotatednetlist and switching activity from VCD simulation traces.While architecting and analyzing the performance of random-sparse accelerators [30] is challenging due to the potentiallywidely varying sparsity patterns, DBB sparse models haveﬁxed sparsity and easily predictable runtime. We evaluatedeach design using RTL simulation in Synopsys VCS runningour DBB INT8 CNN models (Table I). This generates accurateperformance (throughput) metrics which vary depending on thedimensions of the weight and activation matrices in each layer.For power consumption analysis, we capture VCD traces inRTL simulation from representative layers of ResNet50, whichis then input to PrimeTimePX.VI. E

VALUATION R ESULTS

In this section, we review the implementation results ofproposed accelerators and compare with previous publications.

A. Design Space

The proposed microarchitecture described in Section IV, hasa number of parameters to be optimized, with some trade-offs (Section IV). We consider the design space for a typical 4TOPS mobile CNN accelerator implementation. Area and powerconsumption metrics are shown in Fig. 9, where each designhas equivalent peak throughput. Each design point shown isdescribed by a string which includes the array dimensions, theoptional hardware IM2COL unit (denoted “IM2C”), optionalﬁxed DBB (“DBB”), and optional variable DBB (“VDBB”).Not all combinations of these parameters are valid. All thedesigns are conﬁgured to have the same peak throughput of 4TOPS, and normalized to the baseline of conventional TPU-like systolic array, which we refer to as 1 × × ×

64. TheTOPS/W and TOPS/mm results assume a fairly typical 3/8DBB weight sparsity and 50% activation sparsity. The bestdesign is 4 × × × × × Classic Systolic ArraySub-Optimal TPE ConfigsDBB SupportVDBB Support >2x Area Reduction>2.5x Area>2x PowerReduction

Fig. 10: Effective power and area design space of the proposedaccelerators, normalized to the 1 × × × × × × × area reductioncompared to the baseline. Finally, the third group in the farbottom left are pareto-optimal VDBB designs, which beneﬁtthe most from IM2COL. These designs improve the area by > × and the power by > × . The best design is summarizedin detail in Table IV, showing power and area breakdown foreach of the major components.Unlike with random sparse weight accelerators [30], the power consumption of proposed microarch. with DBB weightsis fairly constant. However, as we exploit random activationsparsity, the power varies from layer to layer on a real worldmodel. To illustrate this, Fig. 11 shows normalized power forthe popular ResNet-50-v1 model. Twelve designs designs areshown, which are representative of the whole design space,all with 4 TOPS peak throughput. All our accelerators takeadvantage of activation sparsity using a simple clock gatingscheme. The average activation sparsity percentage is annotatedabove each bar group.Over the whole model, the 4 × × × × B. Variable DBB Sparsity

The key advantage of our proposal is that we supportstructured sparsity with a variable sparsity rate (Section II),where as in the previous work this is ﬁxed: / in Kang [21]and / in the Nvidia A100 [28]. Fig 12 illustrates this pointin terms of effective throughput and energy efﬁciency over thefull range of weight sparsity levels available with a block sizeof 8. Three designs are shown. The ﬁrst is the baseline systolicarray (1 × × × × blk1/unit3/conv3 layer in this ResNet50 example. E ff e c t i v e T h r o u g hpu t ( T O P S ) Weight Sparsity (%)Systolic Array (Dense W)DBB (4/8 W Density)VDBB (Variable W Density) VDBB Scales with Weight Sparsity (a) Throughput vs. Model Sparsity E ff e c t i v e E n e r g y ( T O P S / W ) Weight Sparsity (%)Systolic Array (50% Act. Sparsity)Systolic Array (80% Act. Sparsity)DBB (4/8) (50% Act. Sparsity)DBB (4/8) (80% Act. Sparsity)VDBB (50% Act. Sparsity)VDBB (80% Act. Sparsity) VDBB Scales with Weight Sparsity (b) Energy Efﬁciency vs. Model Sparsity

Fig. 12: Scaling of (a) effective throughput and (b) effective en-ergy, with weight sparsity for baseline systolic array with clockgating (1 × × × × × × × × × Component Power, mW (%) Area, mm (%)Systolic Tensor Array 318 (65.2%) 0.732 (20.0%)Weight SRAM (512KB) 78.5 (16.1%) 0.54 (14.4%)Activation SRAM (2MB) 31.0 (6.4%), 93.0 † ×

4. 50.5 (10.2%) 0.30 (8.0%)IM2COL Unit 10.0 (2%) 0.01 (0.26%)Total 487.5 (100%), 539.5 † † IM2COL disabled

TABLE IV: Summary of the pareto-optimal VDBB design:4 × × × . C. Comparison with Prior Work

This work has demonstrated a time unrolled variable DBBscheme (Section III) implemented in a reuse optimized accel-erator (Section IV). Here, we illustrate the beneﬁt of these twocontributions by comparing our work with previously publishedINT8 CNN inference accelerators in Table V. The proposeddesigns shown include the nominal 4 TOPS design in Table IVat multiple model sparsity points, as well as a 65nm versionof the same design to allow a broader comparison with resultsin the older technology.We ﬁrst evaluate our results speciﬁcally against the state-of-the-art sparse CNN accelerator work, Laconic [32]. This paperincludes a thorough survey against the latest work, including:DaDianNao++ [8], Eyeriss [7], SCNN [30], Pragmatic [4],and BitFusion [33], which convincingly demonstrates it is thecurrent state-of-the-art. Therefore, this is a useful comparisonpoint at the same INT8 precision, 1 GHz clock frequencyand comparable 15nm technology. The energy efﬁciency resultreported for Laconic is just under 2 TOPS/W energy efﬁciency.Our nominal 4 TOPS VDBB design (Table IV) achieves 16.8TOPS/W at 50% model sparsity, which is more than 8 × higher.The only other DBB accelerator is Kang [21], which uses aﬁxed / DBB implemented in a dot-product microarchitecture,and reports 1.65 TOPS/W in 65nm technology for 75%sparse DBB. Our design implements variable DBB in a reuseoptimized accelerator and in 65nm achieves 2.8 TOPS/W at thesame 75% model sparsity, which is 70% higher. We also note10 echnology SRAM Freq. Throughput Energy Efﬁciency Area Efﬁciency Weight ActivationA / W (GHz) (TOPS) (TOPS/W) (TOPS/mm ) Sparsity SparsityOurs 16nm 2MB / 512KB 1.0 4 55.7 8.52 87.5% VDBB 50% CG16nm 2MB / 512KB 1.0 4 31.3 4.29 75% VDBB 50% CG16nm 2MB / 512KB 1.0 4 21.9 2.85 62.5% VDBB 50% CG16nm 2MB / 512KB 1.0 4 16.8 2.13 50% VDBB 50% CGSMT-SA [34] 16nm 2MB / 512KB 1.0 4 7.4 1.13 62.5% Random 50% CGLaconic [32] 15nm 2MB / 512KB 1.0 – 1.997 – Bit-wise Bit-wiseSCNN [32] 16nm 1.2MB / – 1.0 2 0.79 0.7 Random –Ours 65nm 2MB / 512KB 0.5 1 2.80 0.26 75% VDBB 50% CG65nm 2MB / 512KB 0.5 1 1.95 0.17 62.5% VDBB 50% CGKang et al. [21] 65nm 58KB 1.0 0.5 1.65 1.01 75% DBB –Laconic [32] 65nm 2MB / 512KB. 1.0 – 0.81 – Bit-wise Bit-wiseEyeriss v2 [6] 65nm 246KB 0.2 0.40 0.96 0.07/2.7M gates Random Random Effective operations. Our re-implementation with INT8 operands in 16nm. TABLE V: Comparison with published sparse INT8 CNN accelerators in 16nm and 65nm technology. Published metrics forthese works varies wildly; however, even at modest 50% weight and activation sparsity, Our VDBB design achieves 16.8TOPS/W in 16nm, which far exceeds previously reported results and offers strong throughput and energy scaling with weightsparsity.that such a high ﬁxed DBB sparsity would not be practicalfor compact models like MobileNet, ResNet on the ImageNetdataset, based on our results (Table I).The only other sparse systolic array design we are awareof is SMT-SA [34]. To compare against this work, which wasreported in 45nm, we implemented the same design ourselves,which achieves 7.4 TOPS/W compared to the proposed designat 21.9 for the same sparsity. This is largely due to the cost ofthe FIFOs required in the array, and the advantages of DBBvs random weight sparsity (Section II).Finally, SCNN [30] and Eyeriss v2 [6] are also includedin the comparison (Table V). In summary, we found that ourwork outperforms: 1) Laconic [32], the latest sparse accelerator,2) SMT-SA [34], the only other sparse systolic array, and 3)Kang [21], the only other DBB accelerator.VII. R

ELATED W ORK

Clock Gating Random Sparsity

A simple and effectiveapproach to exploiting random sparsity is to clock-gate (CG)to save power when one or more zero operands are encoun-tered [7], [31]. However, CG schemes generally result in lowutilization with no area or throughput improvement. We applythis CG for activation sparsity (which is not amenable to DBB).

Random Sparsity

For RNN acceleration, EIE [16] im-plements a ﬁne-grained random sparse CSR-encoded INT16matrix-vector accelerator for dense layers, and ESE [17] extendsthis to LSTMs. MASR [15] also exploits random sparsity, butuses a bitmask encoding. PermDNN [10] targets sparse denselayers using permuted diagonal matrices. A number of paperstarget random sparse matrix multiplication for very sparsedata, such as Outer Space [29] which uses an outer productscheme, and SpArch [39], which further optimizes for locality.More speciﬁc to the lower sparsity of CNNs, Cnvlutin [5]demonstrates skipping compute for zero activations discoveredat runtime, without explicit indexes. SCNN [30] implementsa fully CSR-indexed sparse CNN accelerator using an outer product to exploit sparse weights and activations. FixyNN [37]demonstrates a ﬁxed-weight accelerator that can very efﬁcientlyexploit random sparsity. We focus on CNN structured sparsity,but compare with SCNN and Laconic (Table V).

Structured Sparsity

Cambricon-S proposes a conventionalblock sparse accelerator [40]. A DBB accelerator described byKang [21] implements a ﬁxed weight sparsity of 75%. The ac-celerator design is also based on a dot product microarchitecturewith limited data reuse. The hardware implementation of theNvidia structured sparsity scheme [28] are unknown, but is ﬁxedat / (50%) sparsity. From our pruning experiments (Table I), / (75%) is quite aggressive, but / is probably moreuniversally useful. Nonetheless, in both cases, the deployedbeneﬁt is limited due to the ﬁxed-sparsity ratio. In contrast,our VDBB proposal demonstrates variable-sparsity DBB usingtime unrolling in a reuse optimized accelerator. Sparsity in Systolic Arrays

Most sparse CNN acceleratorsare based on dot-product designs reminiscent of DaDianNao [8],which typically have lower data reuse compared to systolicarrays (SAs) like the Google TPU [20]. SMT-SA [34] is asparse SA, which foccusses on random sparsity (instead ofDBB). Kung et al. [24] demonstrated a preprocessing step ofcolumn combining of sparse weight matrices, before processingon a dense SA architecture. We implemented an INT8 versionof SMT-SA to compare against (Table V), and found thatthe DBB is much more efﬁcient than the random sparsity ofSMT-SA, which requires FIFOs in the array.VIII. C

ONCLUSION

Structured model sparsity is a powerful optimization toenable improved throughput and energy efﬁciency in CNNhardware accelerators, without the overheads and unpredictableload balancing of random weight sparsity. However, unlikewith random sparsity, previous demonstrations of block sparsityemploy a ﬁxed target sparsity ratio. Unfortunately, this is aserious impediment to broad deployment, because real world11NNs typically exhibit a wide range of weight sparsity ratios.With a ﬁxed sparsity, any models that do not achieve thisthreshold must fall back to dense operation with no speedup.At the same time, aggressively optimized models that exceedthe threshold also do not see any beneﬁt.In this paper, we introduce a variable density bound block(VDBB) technique, which uses a time unrolled architecture toimplement a weight sparsity scheme that supports all possibleblock sparsity levels. This enables hardware beneﬁts frommodel sparsity across the whole spectrum of models in usetoday. We implement VDBB in a reuse optimized acceleratormicroarchitecture, featuring a systolic tensor array (STA)composed of more complex PEs called tensor PEs (TPEs) whichincrease operand and accumulator reuse. We also describe ahardware delayed IM2COL unit that achieves a 3 × activationSRAM bandwidth magniﬁcation effect to reduce SRAM powerconsumption. The reuse optimized accelerator design introducesa number of interdependent parameters, resulting in a non-trivialdesign space, which we evaluate in 16nm process technology.The optimal design scales strongly in throughput and energyas a function of model sparsity, from 16.8 TOPS/W at 50%sparsity up to 55.7 TOPS/W at 87.5%. The advantage ofboth the VDBB compression and the reuse optimizations isapparent in these results, which outperform previous sparseCNN accelerator designs reported.R EFERENCES[1] Arm Cortex-M33. [Online]. Available: https://developer.arm.com/ip-products/processors/cortex-m/cortex-m33[2] ARMv8-M Architecture Reference Manual. [Online]. Available:https://static.docs.arm.com/ddi0553/a/DDI0553A e armv8m arm.pdf[3] MobileNet v1 TensorFlow Model. [Online]. Avail-able: http://download.tensorﬂow.org/models/mobilenet v1 2018 08 02/mobilenet v1 1.0 224.tgz[4] J. Albericio, A. Delm´as, P. Judd, S. Sharify, G. O’Leary, R. Genov,and A. Moshovos, “Bit-pragmatic deep neural network computing,” in

Proceedings of the 50th Annual IEEE/ACM International Symposiumon Microarchitecture , ser. MICRO-50 ’17. New York, NY, USA:Association for Computing Machinery, 2017, p. 382–394. [Online].Available: https://doi.org/10.1145/3123939.3123982[5] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger,and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neuralnetwork computing,” in

Proceedings of the 43rd InternationalSymposium on Computer Architecture , ser. ISCA ’16. Piscataway,NJ, USA: IEEE Press, 2016, pp. 1–13. [Online]. Available:https://doi.org/10.1109/ISCA.2016.11[6] Y. Chen, T. Yang, J. Emer, and V. Sze, “Eyeriss v2: A ﬂexible acceleratorfor emerging deep neural networks on mobile devices,”

IEEE Journalon Emerging and Selected Topics in Circuits and Systems , vol. 9, no. 2,pp. 292–308, 2019.[7] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecturefor energy-efﬁcient dataﬂow for convolutional neural networks,” in

Proceedings of the 43rd International Symposium on ComputerArchitecture . Piscataway, NJ, USA: IEEE Press, 2016, pp. 367–379.[Online]. Available: https://doi.org/10.1109/ISCA.2016.40[8] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,N. Sun, and O. Temam, “Dadiannao: A machine-learning supercomputer,”in

Proceedings of the 47th Annual IEEE/ACM International Symposiumon Microarchitecture , ser. MICRO-47. Washington, DC, USA:IEEE Computer Society, 2014, pp. 609–622. [Online]. Available:http://dx.doi.org.ezp-prod1.hul.harvard.edu/10.1109/MICRO.2014.58[9] F. Chollet, “Xception: Deep learning with depthwise separableconvolutions,”

CoRR , vol. abs/1610.02357, 2016. [Online]. Available:http://arxiv.org/abs/1610.02357 [10] C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan,“Permdnn: Efﬁcient compressed dnn architecture with permuted diagonalmatrices,” in , 2018, pp. 189–202.[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A Large-Scale Hierarchical Image Database,” in

CVPR09 , 2009.[12] I. Fedorov, R. P. Adams, M. Mattina, and P. N. Whatmough, “SpArSe:Sparse architecture search for CNNs on resource-constrained micro-controllers,” in

Advances in Neural Information Processing Systems(NeurIPS) , 2019, pp. 4978–4990.[13] I. Fedorov, M. Stamenovic, C. Jenson, L.-C. Yang, A. Mandell, Y. Gan,M. Mattina, and P. N. Whatmough, “ TinyLSTMs: Efﬁcient Neural SpeechEnhancement for Hearing Aids ,” in

Conference of the InternationalSpeech Communication Association (INTERSPEECH) , 2020.[14] Y. Feng, P. Whatmough, and Y. Zhu, “ASV: Accelerated Stereo VisionSystem,” in

Proceedings of the 52nd Annual IEEE/ACM InternationalSymposium on Microarchitecture , ser. MICRO ’52. New York, NY,USA: Association for Computing Machinery, 2019, p. 643–656. [Online].Available: https://doi.org/10.1145/3352460.3358253[15] U. Gupta, B. Reagen, L. Pentecost, M. Donato, T. Tambe, A. M.Rush, G. Wei, and D. Brooks, “MASR: A modular acceleratorfor sparse rnns,” in . IEEE, 2019, pp. 1–14. [Online].Available: https://doi.org/10.1109/PACT.2019.00009[16] S. Han et al. , “EIE: Efﬁcient inference engine on compressed deepneural network,” in

Proceedings of the 43rd Int. Symp. on ComputerArchitecture (ISCA) , 2016.[17] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo,S. Yao, Y. Wang, H. Yang, and W. B. J. Dally, “Ese: Efﬁcient speechrecognition engine with sparse lstm on fpga,” in

Proceedings of the2017 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays , ser. FPGA ’17. New York, NY, USA: Associationfor Computing Machinery, 2017, p. 75–84. [Online]. Available:https://doi.org/10.1145/3020078.3021745[18] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural network with pruning, trained quantization and huffmancoding,”

CoRR , vol. abs/1510.00149, 2015. [Online]. Available:http://arxiv.org/abs/1510.00149[19] P. Hansen, A. Vilkin, Y. Khrustalev, J. Imber, D. Hanwell, M. Mattina, andP. N. Whatmough, “ ISP4ML: Understanding the Role of Image SignalProcessing in Efﬁcient Deep Learning Vision Systems ,” in

InternationalConference on Pattern Recognition (ICPR) , 2020.[20] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin,C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, R. C. Ho,D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon,J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean,A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps,J. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham,J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian,H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox,and D. H. Yoon, “In-datacenter performance analysis of a tensorprocessing unit,”

CoRR , vol. abs/1704.04760, 2017. [Online]. Available:http://arxiv.org/abs/1704.04760[21] H. Kang, “Accelerator-aware pruning for convolutional neural networks,”

IEEE Transactions on Circuits and Systems for Video Technology , pp.1–1, 2019.[22] S. Kodali, P. Hansen, N. Mulholland, P. Whatmough, D. Brooks, andG. Wei, “Applications of Deep Neural Networks for Ultra Low Power IoT,”in ,2017, pp. 589–592.[23] A. Krizhevsky and G. Hinton, “Learning multiple layers of featuresfrom tiny images,”

Master’s thesis, Department of Computer Science,University of Toronto , 2009.[24] H. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutionalneural networks for efﬁcient systolic array implementations: Columncombining under joint optimization,” in ,2019, pp. 821–834.

25] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[26] H. Li, M. Bhargav, P. N. Whatmough, and H. . Philip Wong, “On-chipmemory technology design space explorations for mobile deep neuralnetwork accelerators,” in , 2019, pp. 1–6.[27] Z. Liu, P. N. Whatmough, and M. Mattina, “Systolic Tensor Array:An Efﬁcient Structured-Sparse GEMM Accelerator for Mobile CNNInference,”

IEEE Computer Architecture Letters et al. , “Outerspace: An outer product based sparse matrixmultiplication accelerator,” in

Int. Symp. on High Performance ComputerArchitecture (HPCA) , Feb 2018, pp. 724–736.[30] A. Parashar et al. , “SCNN: An accelerator for compressed-sparse convo-lutional neural networks,” in , June 2017, pp. 27–40.[31] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.Hern´andez-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enabling low-power, highly-accurate deep neural network accelerators,” in

Proceedingsof the 43rd International Symposium on Computer Architecture , ser.ISCA, 2016.[32] S. Sharify, A. D. Lascorz, M. Mahmoud, M. Nikolic, K. Siu, D. M.Stuart, Z. Poulos, and A. Moshovos, “Laconic deep learning inferenceacceleration,” in

Proceedings of the 46th International Symposiumon Computer Architecture , ser. ISCA ’19. New York, NY, USA:Association for Computing Machinery, 2019, p. 304–317. [Online].Available: https://doi.org/10.1145/3307650.3322255[33] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, andH. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composablearchitecture for accelerating deep neural networks,” in

Proceedings ofthe 45th Annual International Symposium on Computer Architecture ,ser. ISCA ’18. IEEE Press, 2018, p. 764–775. [Online]. Available:https://doi.org/10.1109/ISCA.2018.00069[42] Y. Zhu, A. Samajdar, M. Mattina, and P. Whatmough, “Euphrates:Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision,” [34] G. Shomron, T. Horowitz, and U. Weiser, “SMT-SA: Simultaneousmultithreading in systolic arrays,”

IEEE Comput. Archit. Lett. , vol. 18,no. 2, pp. 99–102, Jul. 2019.[35] P. Warden. Why GEMM is at the heard of deep learning. [Online].Available: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/[36] P. N. Whatmough, S. K. Lee, M. Donato, H. Hsueh, S. Xi, U. Gupta,L. Pentecost, G. G. Ko, D. Brooks, and G. Wei, “A 16nm 25mm2 SoCwith a 54.5x Flexibility-Efﬁciency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators,” in , 2019, pp. C34–C35.[37] P. N. Whatmough, C. Zhou, P. Hansen, S. K. Venkataramanaiah,J. sun Seo, and M. Mattina, “FixyNN: Efﬁcient Hardware for MobileComputer Vision via Transfer Learning,” in

Proceedings of the 2ndSysML Conference, Palo Alto, CA, USA , 2019.[38] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao,H. Ha, P. Raina, C. Kozyrakis, and M. Horowitz, “Interstellar: Usinghalide’s scheduling language to analyze dnn accelerators,” in

Proceedingsof the Twenty-Fifth International Conference on Architectural Supportfor Programming Languages and Operating Systems , ser. ASPLOS ’20.New York, NY, USA: Association for Computing Machinery, 2020, p.369–383. [Online]. Available: https://doi.org/10.1145/3373376.3378514[39] Z. Zhang, H. Wang, S. Han, and W. J. Dally, “Sparch: Efﬁcientarchitecture for sparse matrix multiplication,” in , 2020.[40] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li,T. Chen, and Y. Chen, “Cambricon-s: Addressing irregularity in sparseneural networks through a cooperative software/hardware approach,” in

Proceedings of the 51st Annual IEEE/ACM International Symposiumon Microarchitecture , ser. MICRO-51. IEEE Press, 2018, p. 15–28.[Online]. Available: https://doi.org/10.1109/MICRO.2018.00011[41] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efﬁcacyof pruning for model compression.”

CoRR , vol. abs/1710.01878, 2017.[Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1710.html

Proceedings of the 45th Annual International Symposium onComputer Architecture , ser. ISCA ’18. IEEE Press, 2018, p. 547–560.[Online]. Available: https://doi.org/10.1109/ISCA.2018.00052, ser. ISCA ’18. IEEE Press, 2018, p. 547–560.[Online]. Available: https://doi.org/10.1109/ISCA.2018.00052