[PDF] Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling GEMM Acceleration

Abstract

With increasing diversity in Deep Neural Network(DNN) models in terms of layer shapes and sizes, the research community has been investigating flexible/reconfigurable accelerator substrates. This line of research has opened up two challenges. The first is to determine the appropriate amount of flexibility within an accelerator array that that can trade-off the performance benefits versus the area overheads of the reconfigurability. The second is being able to determine the right configuration of the array for the current DNN model and/or layer and reconfigure the accelerator at runtime. This work introduces a new class of accelerators that we call Self Adaptive Reconfigurable Array (SARA). SARA architectures comprise of both a reconfigurable array and a hardware unit capable of determining an optimized configuration for the array at runtime. We demonstrate an instance of SARA with an accelerator we call SAGAR, which introduces a novel reconfigurable systolic array that can be configured to work as a distributed collection of smaller arrays of various sizes or as a single array with flexible aspect ratios. We also develop a novel recommendation neural network called ADAPTNET which recommends an array configuration and dataflow for the current layer parameters. ADAPTNET runs on an integrated custom hardware ADAPTNETX that runs ADAPTNET at runtime and reconfigures the array, making the entire accelerator self-sufficient. SAGAR is capable of providing the same mapping flexibility as a collection of 10244x4 arrays working as a distributed system while achieving 3.5x more power efficiency and 3.2x higher compute density Furthermore, the runtime achieved on the recommended parameters from ADAPTNET is 99.93% of the best achievable runtime.

Full PDF

SSelf-Adaptive Reconﬁgurable Arrays (SARA):Using ML to Assist Scaling GEMM Acceleration

Ananda Samajdar

Georgia TechAtlanta, GA [email protected]

Michael Pellauer

NVIDIABoston, MA [email protected]

Tushar Krishna

Georgia TechAtlanta, GA [email protected]

Abstract —With increasing diversity in Deep Neural Network(DNN) models in terms of layer shapes and sizes, the researchcommunity has been investigating ﬂexible/reconﬁgurable accel-erator substrates. This line of research has opened up twochallenges. The ﬁrst is to determine the appropriate amountof ﬂexibility within an accelerator array that that can trade-off the performance beneﬁts versus the area overheads of thereconﬁgurability. The second is being able to determine the rightconﬁguration of the array for the current DNN model and/orlayer and reconﬁgure the accelerator at runtime.This work introduces a new class of accelerators that we call

Self Adaptive Reconﬁgurable Array (SARA) . SARA architecturescomprise of both a reconﬁgurable array and a hardware unitcapable of determining an optimized conﬁguration for the arrayat runtime. We demonstrate an instance of SARA with anaccelerator we call

SAGAR that introduces a novel reconﬁgurablesystolic array that can be conﬁgured to work as a distributed col-lection of smaller arrays of various sizes or as a single array withﬂexible aspect ratios. We also develop a novel recommendationneural network called A

DAPT N ET which recommends an arrayconﬁguration and dataﬂow for the current layer parameters. Anintegrated custom hardware A DAPT N ET X runs A

DAPT N ET atruntime and reconﬁgures the array, making the entire acceleratorself sufﬁcient. SAGAR is capable of providing the same mappingﬂexibility as a collection of 1024 × arrays working as adistributed system while achieving 3.5 × more power efﬁciencyand 3.2 × higher compute density Furthermore, when testedover × cases, the runtimes (GeoMean) achieved on therecommended parameters from A DAPT N ET is 99.93% of thebest achievable runtime. I. I

NTRODUCTION

General Matrix to Matrix Multiplication (GEMMs) is at theheart of Deep Neural Network (DNN) training and inferenceand thus has been the target application of many acceleratordesigns [4], [7], [8], [9], [13], [37]. However these individualdevices work on small matrix tiles, and do not have enoughcomputation power to work on larger networks without multiplecostly passes. Recent proposals [9] [35] have demonstratedthe need for scaling the DNN computation engines to meetthe computation demands of contemporary workloads. Despiteextensive research and product development on architecturesfor small-tile GEMMs, designing efﬁcient architectures forperforming GEMMs at scale is still non-trivial.The crux of the problem is that there exists a pernicious trade-off between scalability and utilization (or mapping efﬁciency).Scalability is a direct consequence of simplicity and regularityof a particular design. For instance, regular designs like theTPU [13] (systolic array) can pack large number of MAC units

K N

Matrix B

Tile 1Tile 2

Matrix A

M N

Tile 1Tile 2

Matrix A

M N

Device 1

Device 2

K N

Matrix B

Multicast

Tile 1Tile 2

Matrix A

M N K N

Matrix B

Set switches[M, N, K]

Bypass linkSwitch to use direct linkSwitch to use bypass link

Conﬁguration unit

Pros

High reuse potential (A) Monolithic (B) Distributed (C) Reconﬁgurable Array

Rigidity adversely a ﬀ ects mapping ﬂexibility Cons

High mapping ﬂexibility High reuse potential and mapping ﬂexibilityLow opportunity to exploit reuse Hardware overhead, and large mapping search space

Fig. 1.

Challenges in efﬁciently scaling GEMM arrays. (256 × scaled out setting is a feasible alternative to gainﬂexibility, which is further corroborated by recent architectureproposals [9], [35].Figure 1 provides intuition on why scale-out is moreperformant than scale-up. Both the monolithic and distributedacclerators have the same number of MAC units (4 N ), andare executing the the same workload ( MatA × MatB ). Whilethe distributed accelerator (Figure 1(b)) can map the entirecomputation in a single step, the monolithic conﬁguration(Figure 1(a)) needs two. The irregular size of the workloadsand the mapping rigidity of the monolithic conﬁguration areresponsible for the serialization, even when compute resourcesare available.Unfortunately, scaling out compromises on operand reuse andthus energy efﬁciency. First, spatial reuse via wires is limitedsince communication paths between two distinct compute unitsare fundamentally lower-bandwidth than within the array—inthe worst case requiring off-chip access [35]. As a corollarytherefore, each compute unit needs to have their own separatelocally-placed high-bandwidth memory to store operand data.These individual memories are smaller than the aggregatememory available in a scaled-up system, and thus less ableto exploit temporal reuse. To make matters worse, some data We use scale-up/scale-out interchangeably with monolithic/distributed. a r X i v : . [ c s . A R ] J a n daptNet-858 (Recommendation Model) AdaptNetX (Recommendation HW)

Self Adaptive (SA) unit SMART Systolic Array (Reconﬁgurable Accelerator) conﬁg

Workload DNN

Layer Shape

Reconﬁgurable Array (RA)

ResultsLayer Shape + Weights Activations

Fig. 2.

The constitution and interactions of the self adaptive (SA)and reconﬁgurable array (RA) components to make up the SARAaccelerator called

SAGAR in this work. operands must be replicated, reducing capacity even further.

Contribution 1:

In this paper, we propose a novel acceler-ator micro-architecture organization called S

MART S YSTOLIC

Array (

SSA ) aimed to attain the beneﬁts of monolithic (i.e.,scale-up) and distributed (i.e., scale-out) designs for efﬁcientGEMM computation in a single uniﬁed substrate. We choosesystolic arrays as our scale-up building blocks, since thesimplicity of these arrays leads to low area and poweroverheads, maximizing local bandwidth while minimizingcommunication distance. To mitigate the under-utilizationproblem, we propose augmenting the traditional systolic designwith a bypass interconnection network inspired by SMART[17]. These links permits us to emulate similar mapping to adistributed cluster of accelerator, within a monolithic computearray (Figure 1(c)). The SMART-like links are conﬁgurable,and can therefore emulate distributed systems of differentgranularities, eg. a system with 128 ×

128 MAC (multiply andaccumulate units) can be conﬁgured to be used as 4 64 × ×

32 units or even 32 32 ×

16 units.

Contribution 2:

Finding the optimal conﬁguration forsuch a reconﬁgurable array is the key to achieve the bestperformance. However, the best conﬁguration depends onthe workload, which means the conﬁguration needs to bedetermined at runtime. A ﬁner granularity of reconﬁgurationimproves mapping efﬁciency, but at the same time increasesthe conﬁguration search space [15], [27]. Extra resourcesare needed to ensure that conﬁguration search does notbecome a bottleneck at runtime. We develop a novel lightweight neural recommendation model called A

DAPT N ET ,which replaces costly conﬁguration search with constant timeinference operation at runtime. For a given workload dimension,the network recommends best architecture conﬁguration aswell as the mapping (dataﬂow) strategy. The recommendedconﬁgurations attain about 99.93% of the best runtime onaverage (GeoMean) in our tests with 200K samples. Contribution 3:

We also design a custom hardware forrunning A

DAPT N ET , called A DAPT N ET X, and augment it withthe reconﬁgurable accelerator. The resulting design is thus selfsufﬁcient for providing optimal performance without externalinputs.

Contribution 4:

We integrate these three componentsinto an accelerator which we call ‘Shape Adaptive GEMMAcceleratoR (

SAGAR )’ as shown in Figure 2 and evaluate Sagar is a Sanskrit word that means Ocean, reﬂecting the ability of ouraccelerator to have ﬂexible shapes T h e o r e t i c a l x a rr x a rr x a rr x a rr x a rr x a rr N o r m . S R A M R e a d s T h e o r e t i c a l x a rr x a rr x a rr x a rr x a rr x a rr N o r m a li z e d r un t i m e (a) (b)

8x 32x

Fig. 3.

The trade-off between improved runtime and lost operandreuse in compute equivalent monolithic and distributed systolic arrayconﬁgurations. Subﬁgures show (a) the theoretical minimum runtime,and the runtime obtained for stall free operation of monolithic andcompute normalized distributed systolic array settings; and (b) thecorresponding SRAM reads, normalized to theoretical minimum readsrequired when multiplying a 256 ×

64 matrix with another 64 × its performance across various conﬁgurations. We show that SAGAR has 3.2 × higher compute density and 3.5 × improvedpower efﬁciency, over equivalent scaled-out systolic array.The extra ﬂexibility costs <

10% in area and 50% in power,compared to equivalent scaled-up systolic array. Compared toan area normalized state-of-the-art ﬂexible scalable accelerator[28],

SAGAR incorporates 45% more compute, while whencomparing compute-equivalent conﬁgurations,

SAGAR con-sumes 43% less power and saves 30% area footprint (seeSection VI-B).We believe our proposed accelerator is the ﬁrst in a class ofdesigns we name

Self Adaptive Reconﬁgurable Array (SARA) (Figure 2). To summarize, we make the following contributions.(i) We propose

SSA , a reconﬁgurable architecture for scalableGEMM acceleration, simultaneously achieving high utilizationand reuse. (ii) We develop A DAPT N ET , a lean recommendationneural net which suggests optimized conﬁguration and dataﬂowwith high accuracy. (iii) We implement A DAPT N ET X , ahardware capable of running A DAPT N ET in constant timeand conﬁguring SSA at runtime. (iv) We integrate the abovecomponents into a SARA accelerator called

SAGAR , whichachieves optimal runtime at lower power and area than stateof the art. II. B

ACKGROUND AND M OTIVATION

A. Scaling DNN acceleration

Google’s Tensor Processing Unit(TPU) [13], a large 256 ×

256 systolic array based accelerator, is one of the ﬁrst data-center class DNN accelerators. However, as the authors report,irregular sized GEMM operations found in recurrent workloads,result in as much as 86% under-utilization of the array,attributed to the rigidity of mapping. A recent acceleratorproposal called SIGMA [28] tries to alleviate this problemby introducing ﬂexibility of mapping by using a two levelinterconnect to deliver data to the computation elements. On theother hand recent proposals like Simba [35] take the scaled-outapproach by building a cluster of Multi Chip Modules (MCMs)to increase the compute capability of the design. Tangram[9] on the other hand proposes a scaled-out design wherethe problems of data replication is potentially mitigated bycommunicating over an interconnection network and preciselytiming the communication and compute.2hile recent research targets both scaled-up and scaled-outapproach, it is not immediately clear if one design direction hasan immediate advantage over the other. One study [32] arguesthat for systolic array based designs, scaled-out conﬁguration isalmost always better than scaled-out in terms of performance.However this beneﬁt comes at a cost of huge increase inexternal bandwidth requirement (in this case from DRAM),which renders implementing such a design impractical at scale.To help understand the trade offs involved in choosing aperformant conﬁguration, and the associated loss of reuse weperform a simple experiment. We run one GEMM operation,involving operand matrices of sizes sizes 256 ×

64 and 64 × × ×

64 arrays, 16 32 ×

32 arrays, 64 16 ×

16 arrays, 2568 × × ×

32 array is themost performant beating the the monolithic conﬁguration byabout 2 × . In Figure 3(b) we depict the SRAM read accessesperformed by all the array conﬁgurations, normalized to thetheoretical minimum number of reads possible. From this ﬁgurewe observe that the 32 ×

32 conﬁguration performed about 4 × more memory accesses then the monolithic. The excess memoryaccesses, which lead to reduced energy efﬁciency, result fromthe loss of wire reuse as compared to a monolithic array.From the discussion above we make two observations.(i) Distributed arrays are more performant than an equivalentmonolithic array, even when mapping efﬁciency is 100% onboth. However, the optimal size of each device in a distributedsetting is workload dependent. (ii) Monolithic conﬁgurationsare strictly more energy efﬁcient than distributed arrays, dueto loss the of spatio-temporal reuse in the latter.Mitigating the loss of reuse in a distributed setting istherefore the key to achieve both performance and energyefﬁciency simultaneously and therefore is the goal of thiswork. B. Single cycle multi-hop links with SMART

SMART [17] describe a mechanism to reduce the averagehop count of network on chip (NoC) by introducing bypasspaths which could be used to allow multi-hop traversal ofa packet withing a cycle. The hardware changes needed toimplement the proposed scheme involve two changes. Firstis the addition of an alternate path from the input ports ofeach router to the input selector mux bypassing the queuingbuffer in the datapath. Second is addition dedicated controlpaths from each router to

HPC max (maximum possible hopsper cycle) routers in the downstream at every direction.

Bypass muxes

Source router Destination router

Single cycle multi hop path

Repeater

Fig. 4.

Multi-hop traversals in a single cycle using SMART [17]

The mechanism to use the bypassing path happens in twosteps. In the ﬁrst step, the SMART path is setup by the sourcerouter, one cycle prior to the actual movement of the packet.The source router broadcasts the SMART Setup Request (SSR)to the downstream routers by sending a SSR over dedicated SSRlinks. Each SSR link is log ( HPC max ) bit wide and indicatesthe number of hops the ﬂit wants to go. The request precludesthe ﬂit by one cycle, which allows the downstream routersmake the decision of accepting the packet before it is launched.The downstream routers employ a ﬁxed priority scheme toensure that there are no packet drops. C. Recommendation Systems

Recommendation systems are widely used on social media,streaming services, online marketplaces etc. to show mostrelevant advertisements or products for a given user to increasethe click-through rate and provide meaningful personalizedcontent. Over the years neural networks have consistentlyprovided improved performance over then state of the methodslike Neural collaborative ﬁltering (NCF) [12] over collaborativeﬁltering [34] and Neural factorization machine(NFM) [11] overfactorization machine [30]. The current state of the art methodcalled the Deep Learning Recommendation Model (DLRM)[24] uses the techniques like feature embeddings and featureinteractions employed NCF and NFM, coupled with deep multilayer perceptrons. In the DLRM architecture, a combinationof sparse and dense features are used as inputs. The sparsefeatures are converted into dense vectors in a learned latentspace via embedding lookup. The resulting dense features arethen passed through multiple dense layers followed by feature.interactions [11], [24]. Finally, the combined features are sendthrough a few more dense layers to get the ﬁnal classiﬁcation.III. R

ECONFIGURABLE ARRAY ARCHITECTURE

We augment a base monolithic systolic array with additionalbypass paths along the row and columns, inspired by SMART[17]. This enables us to realize a ﬂexible, energy efﬁcientdesign which can be conﬁgured to act as a large single array(i.e., scale-up) or a collection of smaller arrays (i.e., scale-out),whenever required.

A. Compute array

Traditional MAC units.

In Figure 5(a) we show a tradi-tional systolic array constructed by laying down Multiply-and-Accumulate (MAC) (Figure 5(b)) units in a 2D grid. EachMAC unit is designed to get an operand from either both(

Left in, Top in ) ports or from either of the ports, and perform3 a) (b)

Peer-to-peer links

X +

Top InBottom outLeft in

Right out

From ControllerFrom ControllerFrom initializer Left in Reg Acc RegTop in RegStat OP regSBit

MAC

SMART MACFrom peerFrom peer To peer To peerVertical bypass linksHorizontal bypass links (c) (d)

Peer-to-peer linksBypass links ‘island of MAC’

Direct access using bypass links

Fig. 5. (a) A systolic array of traditional MAC units, (b) the architecture of a traditional MAC unit, (c) addition of bypass paths to atraditional MAC to create a SMART MAC unit to (d) enable ﬂexible mapping in a systolic array by creating ‘islands’ of MAC units capableof receiving data directly from SRAM banks by bypassing its peers multiplication and addition operation. In the next cycle theoperand data received is sent to its neighbour over the peer-to-peer links. The internal registers, and multiplexers enablethe array to work in output stationary (OS), weight stationary(WS), and input stationary operations (IS) [4]. This simplemechanism of data movement results in high wire reuse, but atthe same time restricts the mapping of compute only to thoseoperations which require same set of operands to be mappedalong a row or a column.

SMART MAC units.

Any mechanism to enable MAC unitsto accept data from sources other than the neighbouring MACunit will ease the restrictions of mapping compute and thereforeenable high utilization. In Figure 5(c) we depict a modiﬁedMAC unit, which is augmented with multiplexers in the inputand output ports.

We call this, a SMART MAC unit . Note thatSMART MAC denotes a MAC unit with multiplexers on oneor more input/output ports to enable bypass, not necessarilyon all ports. These multiplexers enable reading and writingto wires other than the peer-to-peer links, thus enabling it torecieve and work on operands unrelated to the ones forwardedby its peers. Akin to the multi-hop bypass path employed in theSMART-NoC, we can allocate bypass busses from memory tothe array, to provide extra channels to move data. Connectingone of the inputs of the multiplexers with these bypass bussesfrom memory therefore enables us to create ‘ islands ’ of MACunits. These ‘ islands ’ are a collection of neighbouring MACunits forming a rectangular group which can act as independentsmaller arrays. We can map computations on such ‘ islands ’independent of the other neighbouring MACs which translatesto improved mapping ﬂexibility (see Figure 5(d)). However,note that this design allows for arbitrary reconﬁguration whichis an overkill and makes the design costlier than necessary.

Systolic Cells.

We ﬁnd that an alternative design, employinga mix of traditional MAC units (Figure 5(b)) and the SMARTMAC units to be a more practical choice. Instead of allowingﬂexible connectivity at the granularity of individual MAC units,we provide reconﬁgurabilty at the scale of a smaller array byallowing ﬂexible connectivity at the edge of the said sub-array.For example, in Figure 6(a) we show a 4 × systolic-cells . Ascan be observed in Figure 6(a), a systolic-cell is constructedby using SMART MAC units in the edge of the array, andtraditional MAC units (Figure 6(b)) in the inside, and thenconnecting them using peer-to-peer links. This helps reduceimplementation cost in three ways. First, the number of SMARTMAC used in the array is reduced. Second, in the remainingSMART MAC, multiplexers are not required on each port,instead only the ports which communicate with MACs outsidethe given systolic-cell need multiplexers to read and write datato the bypass links. Third, the number of bypass links arealso reduced as a consequence of reduction in the numberof multiplexers. With this optimization the bypasses are nowperformed at the granularities of the systolic-cell s. Scale-up and Scale-out using Systolic Cells.

Larger arrayscan be created by arranging and connecting the systolic-cell sas depicted in Figure 6(b) using the peer-to-peer links. At theedge of each systolic-cell the bypass paths can be connectedto the bypass links. Please note that dedicated bypass links areallocated to each systolic-cell to allow concurrency. Attainingﬂexible mapping in such a design is a matter of conﬁguring themultiplexers of the systolic-cell s. Depending on the mappingrequirement an user can chose not to use the bypass pathsat all and use the entire array as a single monolithic unit,by setting the multiplexers to accept data only to/from thepeer-to-peer links, (this is the case depicted in Figure 6(b)),which is equivalent to a scaled-up conﬁguration. One theother hand, the user can set all the multiplexers to acceptand deliver data solely to the bypass links, therefore operatingas a cluster of arrays, each the size of a systolic-cell . Thisconﬁguration, depicted in Figure 6(c) is equivalent to a scaled-out conﬁguration. Other scaled-out conﬁgurations with subarrays larger than the systolic-cell size is also possible to berealized by logically combining a few systolic-cell s by settingsome of multiplexers to connect with the bypass links andothers to connect with the peer-to-peer links. The availabilityof such variety of choices for reconﬁguration leads to ﬂexibleand efﬁcient mapping, hence improving the utilization andenergy efﬁciency of the design. Note that unlike the SMART4 a) (b) (c)

From SRAM banksFrom SRAM banks

Vertical Bypass linksHorizontal Bypass links

SRAM Bu ﬀ er SRAM Bu ﬀ er SRAM Bu ﬀ er SRAM Bu ﬀ er Mux with no bypass selectedMux with latch to bypass selectedSRAM bankActive PortInactive Port

Fig. 6. (a) Construction of a 4 × systolic-cell with bypass muxes and bypass links. (b) A 8 × × systolic-cell is connected to its neighbor with the peer-to-peer links as the bypass muxes are turned off. The SRAMports connected to bypass links are unused. (c) Conﬁguration of bypass muxes to enable the 8 × NoC [17], these muxes are conﬁgured statically, and hence donot need to worry about about arbitration.

B. Scratch pad memory

The array constructed from systolic-cells are backed bySRAM scratchpad memory, which are constructed as twoindividual buffers. Each of these buffers is dedicated to oneof the operand matrices. Such scratchpad SRAM buffers arecommon in accelerators, and are designed to reduce the numberof off chip accesses and facilitate temporal reuse. Each operandbuffer is operated in a double buffered fashion, so that the pre-fetch latency can be minimized. The system also comprises ofa third buffer which is used to store generated output elements.Unlike conventional accelerators however, our systolic systolic-cell based design has bypass links, which are alsoneeded to be backed by the memory. We provision for thisextra bandwidth by increasing the number of memory banksin the scratchpad SRAM buffers. In a traditional systolic arraybased design, each row and column of the array is connected toone dedicated SRAM port to supply one element per cycle each.In a similar fashion we allocate one port for each row, column and the individual bypass links . To reduce the complexity ofmultiplexing data within the SRAM, we increase the numberof SRAM banks to support the increased number of ports. Forexample, the compute arrays shown in Figure 6(c) is backedby two scratchpad memories each with 16 ports and wouldbe constructed as collection of 16 SRAM banks. The SRAMbanks in a traditional 8 × systolic array on the other handwould have 8 port per buffer; one per each row/column. Weevaluate the overheads for such a design in Section VI-B.Despite having the same number of SRAM ports as ina distributed conﬁguration, this approach has a couple ofadvantages over the latter. First , there is no replication ofdata required, which otherwise reduces the effective capacityof the system therefore adversely affecting reuse. In our designby eliminating replication we inherently improve the temporalreuse of operands.

Second , each of the systolic-cells can accessdata in the entire operand buffer. Due to uniﬁed memory controlof each buffer, operation like multi-cast are implicit in form

Fig. 7.

Psuedocode depicting the control logic of read collation, which improves energy efﬁciency withoutimpacting performance. We describe the impact on reads andenergy efﬁciency in detail in Section VI-A.

C. Control

Figure 7 shows the control logic executed for each GEMMworkload or a layer of a neural network. The control logic ofour proposed system is similar to the control of a distributedsystolic array based system. However, unlike other systems, ina systolic-cell based design, the number of distributed units isa variable and is determined at runtime based on the data-ﬂowand operand shapes. The following steps describe the logic. recNetInference(): In this work we use a recommendationsystem based described in Section IV. The model takes in thelayer parameters and recommends a conﬁguration, which isthe most efﬁcient for the workload. setBypassMuxes(): Next, the bypass muxes are set in thecompute hardware to realize the partitioned conﬁguration. Thisis accomplished by writing select values to a register, whoseindividual bits drives the select lines. These conﬁgurations staystatic throughout the GEMM computation. partitionWorkload(): The control logic, then partitions theoriginal workload by marking portions of the original operandarrays to be used by each individual partition. systolicController(): Finally, for each partition, an instanceof systolic array controller is initiated to drive the GEMMoperations to completion and orchestrate the required datamovement. Please note that in contrast to a traditional systolicarray like TPU, multiple control units are required to work inparallel.5V. A RCHITECTURE R ECOMMENDATION M ODEL

To fully exploit the reconﬁgurability offered by the computesubstrate, the best conﬁguration needs to be identiﬁed depend-ing on the workload. Identiﬁcation of the best conﬁguration,unfortunately can be a costly operation given the large size ofthe search space of possible conﬁgurations. For example, inFigure 8(a) we show the size of the conﬁguration space of ourﬂexible systolic-cell based architecture as a function of MACunits. A TPU v2 like system with 2 MAC units has 858possible conﬁgurations while a TPU v1 like system (2 MACunits) has nearly 1400 conﬁgurations. For a given workload,conﬁguration search over such large space at runtime can eitherbe a performance bottleneck if implemented in a power limitedsystem, or could end up using signiﬁcant amount of energyas compared to the execution of the workload itself. In boththe cases, this could undermine the beneﬁts obtained fromthe ﬂexible architecture system. Another alternative approachis the perform the search ofﬂine and store the conﬁgurations.However, this will only work when the workload conﬁgurationsare known beforehand and are limited in size. For a data-centerlike use-case, where scaled systems are likely to be used, thevariety of workloads is expected to be high. In this work wedevelop a novel technique to tackle this problem. Our solutionis to replace the expensive search operation with a constant timeneural network(NN) inference to procure the best conﬁgurationat runtime.

Search as a ML problem.

The ﬁrst step to design aconstant time NN inference system is to pose the searchproblem to an ML problem. Out of the several alternativeapproaches, our experiments depicted posing the problem as arecommendation system, provided the best performance. Theidea of a recommendation system is simple; when queriedwith the workload parameters (ie M, N, and K dimensionsin our case) the network returns a category ID. This IDis used to a architecture conﬁguration which provides theoptimal performance for the workload. For our ﬂexible systolic-cell based system the parameters captured by each categoryis depicted in Figure 8(b). As we depict in the ﬁgure, boththe architecture conﬁguration in terms of the systolic cellarrangements (ﬁrst and second cols), systolic cell dimensions,as well as the mapping strategy is represented in terms ofdataﬂow eg. output stationary (OS), weight stationary (WS),and input stationary (IS).

Recommendation Neural Network.

We hand designed arecommendation system neural network. Given our use-case,there are two main requirements we need to satisfy. First,we need our network to have high accuracy in predictingthe best runtime conﬁguration which maximizes performanceby attaining optimal mapping. Second, given that the rec-ommendation network needs to be queried at runtime, thenetwork should be small to lower costs. A smaller networkwill lead to low inference latency. In our use case, therecommendation inference for a given layer is run concurrentto the execution of a previous layer whenever possible. Lowerinference latency therefore moves the recommendation step out of the critical path. Moreover, smaller network has fewercomputation and storage requirements therefore minimizingthe overheads. Honoring these requirements, our proposednetwork is depicted in Figure 8(c). We use an embeddingtable to project the input features into a latent-space as in theDLRM and NCF models [12], [24]. The emebdding lookupsare then passed through two dense layers for classiﬁcation,which chooses a category from the conﬁguration space usingsoftmax activation. We call this architecture A

DAPT N ET . Thissimple model works very well for the ﬂexible systolic-cell basedarrays with different number of MAC units. Figure 8(d) depictsthe accuracies obtained on test sets when the same model istrained to recommend conﬁgurations for varying number ofMAC units. The model is trained for about 30 epoch on aseparate datasets for each MAC units each containing aboutone million samples. We observe that the the test accuracies areconsistently over 90% and for a few conﬁgurations are as highas 96%. We also want to point out that the correct predictionscorrespond to the conﬁguration which leads to the best runtime.Also it is germane to note that for the cases where the networkrecommends conﬁgurations other than the best possible ones,the performance is better than those we get in our baselineconﬁgurations (see Figure 9(c)). We distinguish the variants ofthis network by adding the number of categories as sufﬁx. Forexample, the systolic-cell based SSA with 2 MAC has 858conﬁgurations, therefore the corresponding network is calledA

DAPT N ET -858.V. S ELF A DAPTIVE R ECONFIGURABLE A RRAYS

By coupling A

DAPT N ET with a reconﬁgurable array, we cancreate a self adaptive system which can be conceptually viewedas a combination of two units, a Self Adaptive unit (SA), anda Reconﬁgurable Array (RA) unit as shown in Figure 2. TheSA unit encompasses the software and hardware componentswhich recommend the optimal conﬁgurations. The RA unitis the hardware unit capable of ﬂexibly conﬁguring to therecommended conﬁgurations and hence run the workloads. Itis worth pointing out that this design class is not speciﬁc to areconﬁgurable core for running GEMM workloads. Insteadany Coarse Grained Reconﬁgurable Array (CGRA) unit,conﬁgurable at runtime, can be augmented with a suitableSA, to ensure optimal performance. We believe this resultsin a new class of designs, which we name Self AdaptiveReconﬁgurable Array (SARA). A. Hardware to run A DAPT N ET In the context of our use case, an intuitive option is to allocatea few systolic-cell s from the main array to run A

DAPT N ET .However, this choice will lead to either fewer MAC units leftfor the actual workloads, or to allocate additional systolic-cell s for A DAPT N ET leading to an additional overhead. Analternative to adding more systolic-cell s will be to add a customhardware dedicated for running A DAPT N ET . We explore boththe systolic-cell and custom hardware options below. For theease of discussion, we chose our RA to be a 1024 4 × systolic-cell unit (2 MACs) which we call the S

MART S YSTOLIC unit.6 um MAC Units

Vertical systolic cellsHorizontal systolic cells Systolic cell rows Systolic cell cols Dataﬂow … … … … … (a) (b) (c) (d) N u m b e r o f C o n f i g u r a t i o n s

13 14 15 16 17 18 19 20

Feature Embeddings

10 x 32 Dense(ReLu)

Dense(Softmax) [ M, N, K ]

Workload params Category Idx T e s t a cc u r a c y ( % )

12 13 14 15 16 17 18 19 20

Num MAC Units

Fig. 8. (a) Size of the conﬁguration space wrt number of MAC units for a systolic-cell based ﬂexible array (b) Example of conﬁgurationspredicted by A

DAPT N ET indexed by category ID (c) Schematic of the A DAPT N ET topology (d) Test accuracies obtained on test sets forA DAPT N ET s trained on systolic-cell based ﬂexible array with various MAC units Runtime (Cycles) N u m M u l t i p li e r s Systolic CellAdaptNetX +X X +X X +X X +X X … + ++ … … Stationary bu ﬀ ers Output elementStreamed input elements (a) (b) (c)GeoMean: 99.93% Fig. 9. (a) Cycles needed to run A

DAPT N ET -858 on an array of systolic-cell s and on the custom hardware unit (A DAPT N ET X) as a functionof number of multipliers. (b) architecture of the custom 1-D unit hardware for A

DAPT N ET X(c) Relative performance of the conﬁgurationspredicted by A

DAPT N ET -858 for SAGAR for 2 × test samples when compared to the runtime of best possible conﬁgurations … … ………… SRAM banksSRAM banks

AdaptNetX (SA)

Conﬁguration vectors

RecommendedConﬁguration [M, N, K] + ++ + ++ +

X X X XLocal bu ﬀ er Reconﬁgurable Array

SRAM banks for output

Mux selects

Legend:

Systolic cell multiplexers Peer to Peer Link Horizontal bypass buses Vertical bypass busesSystolic cells

SARA system

External Interface

Fig. 10.

Schematic of

SAGAR , an instance of a SARA accelerator.

The corresponding recommendation that we use therefore isthe A

DAPT N ET -858. A DAPT N ET Runtime on systolic-cell s. Figure 9(a) showsthe cycles required for a single inference of the A

DAPT N ET asa function of multipliers used in 4 × systolic-cell basedarray. Understandably, the runtime decreases proportional tothe increase in number of multipliers as we increase thenumber of systolic-cell s, achieving the best runtime of 1134cycles when using 1024 multipliers or 64 cells. When both theworkloads and the recommendation engine is run on a same array; for a TPU equivalent machine with 2 MAC units,about 6.25% of the array needs to be allocated for running theA

DAPT N ET . Another choice could be allocating more hardwareresources in terms of extra 64 systolic-cell s dedicated to run therecommender network. However, given that A DAPT N ET hasexclusively dense layers processing the embedding lookups, asystolic execution turns out to be sub-optimal. A DAPT N ET Runtime on A

DAPT N ET X. We found a cus-tom design tuned for A

DAPT N ET layer parameters to be moreefﬁcient. For efﬁcient execution of the dense layers, we chose a1-D multiplier unit with a binary tree based reduction as shownin Figure 9(b). We found Input stationary (IS) dataﬂow to be themost performant for our use case. In this mapping the elementsof the input vector is buffered near the multipliers, whileelements of the weight matrix are streamed through to generateone output element/partial sum, with a sustained throughput of1 element per cycle. Throughput can be further increased byadding more such 1-D units. We name the custom core withone or more such 1-D units as A DAPT N ET X. In Figure 9(a)we depict the variation of runtime of A

DAPT N ET inference onA DAPT N ET X with two 1-D units as a function of multipliers.We ﬁnd the 512 multipliers result in best runtime of 576cycles, when running A

DAPT N ET for 2 MAC unit systolic-cell design.

B. SAGAR AcceleratorSAGAR is constructed by augmenting the 2 MAC

SSA unit, laid out as 32 ×

32 grid of systolic-cell s, with A

DAPT -N ET X running A

DAPT N ET -858 (see Figure 10). We chosethis conﬁguration as it has the same compute as the TPUv2, and the 4 × systolic-cell size works the best for ourworkloads (see Section VI-A). Since each row and column inthis conﬁguration has 31 bypass links and one link to MAC,7 ABLE I

Table depicting the architectural conﬁguration of distributed systolic arraybased systems, monolithic systolic array baseline, and SAGAR

Name Num Units MAC/unit Banks perSRAM buffer Capacity perSRAM bankDistributed 4x4 units(Baseline – GPU tensor core like) 1024 16 4 256 BDistributed 8x8 units 256 64 8 512 BDistributed 16x16 units 64 256 16 1 KBDistributed 32x32 units 16 1024 32 2 KBDistributed 64x64 units 4 4096 64 4 KBMonolithic 128x128(Baseline – TPU like) 1 16384 128 8 KB

SAGAR each buffer is constructed as a collection of 1024 1KB SRAMbanks.

Real-time Reconﬁguration.

The A

DAPT N ET X uses anadditional SRAM bank of 512KB to store the embeddingtable and the weight matrices for A

DAPT N ET -858. Eachconﬁguration corresponds to a 3968 bit vector which setsthe bypass muxes, once the layer is ready to be mapped.VI. E VALUATIONS

To showcase the capabilities of our proposed design, weevaluate

SAGAR in two settings. To show the beneﬁts thatarise solely from the architecture aspects, we present resultsobtained from analysis done with a high level simulator. Then,to capture the implementation dependent aspects of the design,we implement

SAGAR and the baselines in RTL and capturethe PPA numbers by running Place-and-Route (PnR) ﬂow. Wealso compare

SAGAR with a state-of-the-art ﬂexible acceleratorarchitecture SIGMA [28]. The following subsections describeour ﬁndings in details.

A. Architectural evaluations

Methodology.

For our architecture level studies we choseto use SCALE-Sim [33]. SCALE-Sim is a cycle accuratesimulator for systolic array, which generates per cycle dataaccesses to and from various memories. This enables us toestimate and compare performance, energy consumption, poweretc. of systolic array based components to a certain degree ofaccuracy. We created in-house scripts to generate SCALE-siminput ﬁles to perform the workload partitioning as generatedby A

DAPT N ET -858. Workloads.

For our evaluations we choose FasterRCNN[29], DeepSpeech2 [2], and AlphaGoZero [36], as our work-loads as a representative of convolution neural networks,language modelling network, and DNNs for reinforcementlearning respectively.

Baselines.

We have two baselines, a large monolithic systolicarray and a collection of small systolic arrays to work as a singledistributed system (see Table I). We modelled the monolithicsystolic array with the same dimensions as Google’s TPU v2,with 128 ×

128 MAC units. However unlike TPU v2 whichsupports ﬂoating point MAC operations, our model assumesbyte long words, which is the accepted size for operands forinference operations. We allocated a total 3MB of SRAMmemory to the entire array divided into 3 operand buffers, onefor each input operand matrix and the output matrix. Although

TABLE II

Dimensions for the synthetic GEMM workloads

G1 G2 G3 G4 G5 G6 G7 G8 G9 G10M 128 256 512 1024 2048 128 256 512 1024 2048K 128 256 512 1024 2048 64 64 64 64 64N 128 256 512 1024 2048 64 64 64 64 64G11 G12 G13 G14 G15 G16 G17 G18 G19 G20M 64 64 64 64 64 64 64 64 64 64K 64 64 64 64 64 128 256 512 1024 2048N 128 256 512 1024 2048 64 64 64 64 64 we would like to point out that, SCALE-Sim assumes that thereis sufﬁcient DRAM bandwidth available for ideal prefetchingof operands, and therefore the computation runs in a stallfree mode. Therefore, for runtime generated by simulation theSRAM sizes have no impact. Nevertheless, previous work [33]has shown that buffer sizes of this scale lead to reasonableoff-chips requests.For our second baseline, we chose a collection of 4 × × Performance Analysis.

We model both of the baselinesystems and

SAGAR in SCALE-Sim and compare the perfor-mance for our workloads. In Figure 11(a) we depict the cyclestaken to run all the layers in AlphaGoZero, DeepSpeech2,and the ﬁrst 10 layers of FasterRCNN networks. Among thebaselines, the distributed conﬁguration mostly results in fasterruntime owing to higher mapping ﬂexibility. However

SAGAR ,owing to reconﬁgurability is capable of matching the betterbaseline conﬁguration. Naturally, this ﬂexibility leads to loweraggregated runtime for

SAGAR than either of the baselines.

Favorable Conﬁgurations.

SAGAR is also capable ofrealizing conﬁgurations which are out of scope of either ofbaselines. This allows

SAGAR to achieve higher performancethan both the baselines on certain layers. For example, considerthe synthetic GEMM operands depicted in Table II. Figure 12(a)depicts the histogram of the best conﬁguration for these layersobtained from simulation. The layers favouring 8 × × × SAGAR ’s performance isidentical to the 4 × ×

8, 32 ×

32 etc.

SAGAR will leadto lower runtime than both the baselines. This is depicted byFigure 11(c), where we see that

SAGAR achieves about > × AlphaGoZero DeepSpeech2 FasterRCNN S p ee dup o v e r M o n o li t h i c Baseline Monolithic Baseline Distributed SAGAR C o n v R e s C o n v R e s C o n v V H _ C o n v V H _ F C V H _ F C P H _ C o n v P H _ F C C o n v C o n v B a t c h R NN B a t c h R NN B a t c h R NN F C C o n v C B a _ C B a _ C B a _ C B s I B b _ I B b _ I B b _ I B c _ I B c _ A l ph a G o Z e r o D ee p S p ee c h F a s t e r R C NN AlphaGoZero DeepSpeech2 FasterRCNN Cumulative R un t i m e ( C y c l e s ) Baseline Monolithic Baseline Distributed SAGAR E n e r g y ( m J ) SRAM read energy Compute Energy (c)(d)(a) C o n v R e s C o n v R e s C o n v V H _ C o n v V H _ F C V H _ F C P H _ C o n v P H _ F C C o n v C o n v B a t c h R NN B a t c h R NN B a t c h R NN F C C o n v C B a _ C B a _ C B a _ C B s I B b _ I B b _ I B b _ I B c _ I B c _ A l ph a G o Z e r o D ee p S p ee c h F a s t e r R C NN AlphaGoZero DeepSpeech2 FasterRCNN Cumulative S R A M R e a d s ( W o r d s ) Baseline Monolithic Baseline Distributed SAGAR (b) E n e r g y D e l a y P r o du c t ( E D P ) N o m a li z e d t o M o n o li t h i c Baseline MonolithicBaseline DistributedSAGAR (e)0.1x

Fig. 11. (a) Simulated runtimes for monolithic 128 ×

128 baseline, distributed 1024 4 × SAGAR for layers in AlphaGoZero,DeepSpeech2, and ﬁrst 10 layers of FasterRCNN (b) SRAM reads for the same workloads for

SAGAR and baseline conﬁgurations (c) Speedupof

SAGAR and distributed baseline as compared to the monolithic baseline (d) Energy consumption breakdown for our workloads in

SAGAR and baselines (e) Energy delay product(EDP) of

SAGAR and baselines, normalized to EDP for monolithic baseline (c) (d) R e l a t i v e F r e qu e n c y R e l a t i v e F r e qu e n c y (a) R e l a t i v e F r e qu e n c y R e l a t i v e F r e qu e n c y (b) Fig. 12.

Distribution of favorable array sizes for a 16384 MACdistributed system which attain the lowest runtime when run foreach layer in (a) synthetic GEMM workloads (b) AlphaGoZero, (c)DeepSpeech2, and (d) FasterRCNN. speedup over monolithic when distributed conﬁgurations arepreferred. While in cases where monolithic is preferred it runsfaster than both the baselines.

SRAM reads and Energy efﬁciency.

In designs where alarge number of computations are executed in parallel, reducingthe number of SRAM reads can lead to a high energy savingsgiven it does not negatively inﬂuence performance. In general,due to the loss of reuse, distributed conﬁgurations with smallerarray sizes have more SRAM reads. We observe this trend inaction in Figure 11(b) where we depict the number of SRAMreads performed for layers when running our workloads on thetwo baselines and on

SAGAR . The distributed 4 × SAGAR and themonolithic baseline. In

SAGAR this efﬁciency loss in reuse ismitigated by using bypassing links. As shown in Figure 11(b),accross all layers in our workloads,

SAGAR incurs SRAMreads close to that in the monolithic baseline. In the case ofDeepSpeech2,

SAGAR , owing to efﬁcient mapping, incurs readseven fewer than that of the monolithic baseline. To further quantify the efﬁciency of

SAGAR , we estimatedthe energy spent by the three conﬁgurations on the workloadsby taking into the cycle counts and the SRAM reads and scalingthe counts by typical energy consumed per operation computedfrom RTL PnR ﬂows. In Figure 11(d) we plot the energyconsumed normalized to the energy of the monolithic array.We observe that for workloads amenable to monolithic array (ie.FasterRCNN and DeepSpeech2),

SAGAR ’s energy consumptionis almost identical to the monolithic baseline. The distributedbaseline on the other hand consumes an order of magnitudehigher energy for all the three workloads, while supporting thesame mapping conﬁgurations as

SAGAR . For AlphaGoZero,which favours a distributed conﬁguration,

SAGAR consumesabout 20% of the energy consumed by the monolithic baseline,while almost one order of magnitude lower than that of thedistributed baseline. Figure 11(d) also shows that

SAGAR ’senergy consumption for SRAM is close to that of consumedby the monolithic array for all the three workloads. Thecomputation energy consumption in

SAGAR equivalent to thebetter of the two baselines. The combined effect of improvedlatency and reuse is perhaps better represented by the energy-delay product (EDP) depicted by Figure 11(e). In this ﬁgurewe plot the EDP for

SAGAR and the two baselines normalizedto the values corresponding to the monolithic conﬁguration.We observe that

SAGAR results in about 92% to 80% less EDPcompared to the monolithic baseline. This further demonstratesthe efﬁciency of our proposed architecture, resulting frompreserving reuse while simultaneously decreasing latency dueto improved mapping.

B. Implementation evaluations

Methodology.

We implemented

SAGAR in RTL and as a32 ×

32 array of 4 × systolic-cells and ran ASIC ﬂow tillPlace-and-Route (PnR) to obtain area and power. We used28nm library for implementing the logic. We also implementedthe SRAM buffers as a collection of 1024 1KB cells withthe SAED32 education library from Synopsis, to quantify thepower and area overheads, and then scaled down to 28nm9 Scale out4x4 unit Scale out8x8 unit Scale out16x16 unit Scale out32x32 unit Scale out64x64 unit Monolithic128x128array SAGAR(ThisWork) SIGMA P o w e r ( W a tt s ) SRAM NoC Compute Array SRAM AdaptNetX Compute AdaptNetX

Scale out4x4 unit Scale out8x8 unit Scale out16x16 unit Scale out32x32 unit Scale out64x64 unit Monolithic128x128array SAGAR(ThisWork) SIGMA A r e a ( mm ) SRAM NoC Compute Array SRAM AdaptNetX Compute AdaptNetX

SAGAR compute array . mm (a) (b) (c) (d) Systolic cell dims 4x4Num systolic cells 1024Max Throughput 32.768 TOPsFrequency 1 GHzTech node 28nmArea 84.89 mm2Power 13.05 WattsSAGAR

Fig. 13. (a) The post PnR ﬂoor-plan diagram of

SAGAR ’s compute array, (b) A table detailing architecture conﬁguration of

SAGAR , theimplementation parameters, and post PnR area and power of

SAGAR . (c) The comparison and breakdown of post synthesis area for distributedsystolic array based designs, the monolithic systolic baseline,

SAGAR , and SIGMA (d) The corresponding breakdown for power consumed byvarious components in distributed systolic array based designs, the monolithic systolic baseline,

SAGAR , and SIGMA A rr a y C o n f i g u r a t i o n Normalized power (W) Normalized area (mm2) N u m b e r o f b a n k s ( x ) P o w e r ( u W ) Distributed ConfigurationsTotal Power (uW) Per bank power (uW) Number of Banks N u m b e r o f B a n k s ( x ) A r e a ( u m ) Distributed configurationsTotal Area (um2) Per bank area (um2) Number of Banks (a) (b)(c)

Fig. 14. (a) The variation of total area footprint of SRAM banksin various distributed systolic array and monolithic conﬁgurationjuxtaposed with the variation in bank sizes and the number ofbanks required, (b) A similar variation in the power consumptionby the SRAM banks in distributed systolic array and monolithicconﬁgurations, and (c) the the area and power of a 128 ×

128 arraywhen constructed using different sized of “ systolic-cells ” normalizedto the area and power of an array constructed with traditional MACunits. equivalent by using Dennard’s scaling [6]. Figure 13(a) depictsthe post PnR ﬂoorplan of

SAGAR ’s compute logic. Figure 13(b)lists the array conﬁguration, area, and power consumptionreported after PnR by synthesizing the

SSA and memory ata operating frequency of 1 GHz. At 32.768 TOPs (with 1MAC being two operation) at 1 GHz

SAGAR takes 84.89 mm of real estate while consuming 13.05 Watts of power. TheA DAPT N ET X consumes 12.4% of total area and 1.6% of totalpower.

Baselines.

We implement the baseline monolithic 128 × × × × SAGAR . In

SAGAR , inaddition to the links going directly from the SRAM to theedge MAC units of the array, we have to consider the bypasslinks as well. To get full bandwidth on these links we needto consider additional buffers. Extending the design describedin Figure 6, each row and column of

SAGAR has 31 bypasslinks and one link to the ﬁrst MAC unit, we need 32 banksper row/column. Therefore each SRAM buffer is constructedwith 1024, 1KB banks.

Area Analysis.

In Figure 13(c) we depict the break down ofarea overheads for SRAM buffers, mesh NoC, and the computearray for various distributed conﬁgurations, the monolithic array,

SAGAR and SIGMA [28]. We observe that the monolithicconﬁguration is the most efﬁcient in terms of area, where itis about 5 × more compact than the distributed 4 × × SAGAR on the other hand takes about 8%more area than the monolithic array, while consuming about3.2 × lower area than the distributed 4 × SAGAR and the distributed conﬁgurationprovides same mapping ﬂexibility, the proposed design isstrictly more efﬁcient.

SAGAR is also more compact thanSIGMA, taking about 70% of the area while packing equalnumber of compute units.Across the various systolic-array conﬁgurations in Fig-ure 13(c), the SRAM area appears to remain fairly constant.This is a direct consequence of the buffer capacity andconstruction of the array. In Figure 14(a) we depict the totalarea obtained for various conﬁgurations depicted in Table I.We observe that, the various conﬁgurations vary in the bankcapacity and the number of banks. Since the total capacityremains the same across the conﬁgurations, these factor counterbalance each other leading to observed trends.

Power Consumption.

In Figure 13(d) we depict the postPnR power consumption for various array conﬁguration. The10esh NoC stands out as the major contributor, which naturallymakes the 4 × × moreexpensive than the monolithic conﬁguration, with the NoCcontributing to about 78% of the power. Considering thepower of the compute array alone, all the systolic-array basedconﬁgurations appear to consume similar power. We also depictthe trend in power consumed by SRAM banks across varioussystolic-array based conﬁgurations in Figure 14(b). Similar tothe trends observed in area breakdown, the counter balancingaffects of increasing the bank sizes and lowering of numberof banks lead to similar powers across various distributed andmonolithic conﬁgurations. SSA however consumes about 50% more power than themonolithic conﬁguration, owing to the bypass links. Howeverthis extra cost results in achieving the same mapping ﬂexibilityof the 4 × × moreexpensive. SAGAR therefore is almost as scalable as a nativemonolithic systolic array in terms of area, while consumingabout 50% more power, it provides the same mapping ﬂexibilityand performance as a distributed scaled out conﬁguration using × systolic arrays. Furthermore,

SAGAR consumes about43% less power than SIGMA, owing to the relatively simpleinterconnection network.To explore further opportunities for optimization in

SAGAR ’simplementation, in Figure 14(c) we plot the area and power ofthe compute array for varying systolic cell sizes, normalized tothe area/power of the monolithic conﬁguration. It is evident thatboth the power and area overheads increase by decreasing thecell sizes. Using larger cells might be tempting, but it comes atthe cost of mapping ﬂexibility. As depicted in Figure 12(b,c,d),our workloads predominantly favour 4 × Summary.

Considering our ﬁndings from architecturalsimulations and physical implementation, we conclude thatthe proposed Self Adaptive Reconﬁgurable hardware enablesachieving both high performance and energy efﬁciency simul-taneously. The S MART S YSTOLIC compute unit enables highmapping efﬁciency of a ﬁne-grained ﬂexible architecture, whileretaining the scalability of monolithic systolic array. The novel A DAPT N ET ensures optimal conﬁguration at runtime withhigh accuracy, while minimizing performance, power, and areaoverheads when run on the proposed A DAPT N ET X . VII. R

ELATED W ORKS

Flexible DNN Accelerator . To efﬁciently execute a varietyof workloads, DNN accelerator designs generally come withtwo tiers of ﬂexibility, architecture and dataﬂow. Designs likeNeurocube [16], Flexﬂow [22], and by FPGA based designs[37] enable ﬂexible mapping by supporting multiple dataﬂow.On the other hand proposals like Planaria [10], Brainwave [7],SIGMA [28], MAERI [20], Cascades [31] and others [1], [9],[37] enable reconﬁguration at the hardware level for efﬁcientexecution. The S

MART S YSTOLIC array in

SAGAR enables bothmapping ﬂexibility and reconﬁgurability. Table III depicts thestanding of various such accelerators in term of native operationsupported, mapping capability and ﬂexibility.

TABLE III

Table depicting previous accelerator proposals categorized in terms ofcomputation support, mapping capability, and ﬂexibility offered in term ofhardware reconﬁguration or dataﬂow supported

Native Operation Mapping Capability FlexibilityConvolution GEMM Homogenous Heterogenous Dataﬂow ArchitectureZhang et al. [37] (cid:88) (cid:88) (cid:88) (cid:88)

Eyeriss [4] (cid:88) (cid:88)

Alwani et al. [1] (cid:88) (cid:88) (cid:88)

NeuroCube [16] (cid:88) (cid:88) (cid:88)

MAERI [20] (cid:88) (cid:88) (cid:88) (cid:88)

TPU [13] (cid:88) (cid:88) (cid:88)

Flexﬂow [22] (cid:88) (cid:88) (cid:88)

Tetris [8] (cid:88) (cid:88)

Brainwave [7] (cid:88) (cid:88) (cid:88)

Simba [35] (cid:88) (cid:88)

Tangram [9] (cid:88) (cid:88)

Cascades [31] (cid:88) (cid:88) (cid:88)

Sigma [28] (cid:88) (cid:88) (cid:88)

Planaria [10] (cid:88) (cid:88) (cid:88)

SAGAR (This work) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Dataﬂow and Accelerator Design Space Search.

Severalarchitecture and mapping space exploration tools have beenproposed in the recent past to take advantage of ﬂexibilities inthe design. Tool like SCALE-Sim [32], MAESTRO [18], Tetris[8] etc. provide analytical models for fast cost estimation ofspeciﬁc conﬁgurations. While Timeloop [27], dMazeRunner [5]etc are tools which perform heuristic or exhaustive search forarchitecture conﬁguration or mapping strategy. SARA systemslike

SAGAR on the other hand use a trained recommender likeA

DAPT N ET to circumvent the search and obtain the optimalconﬁguration and dataﬂow in one shot at runtime. ML assisted system conﬁguration.

Recent works havedemonstrated the use of ML algorithms to assist in systemconﬁguration. Gamma [15] and ConfuciuX [14] performarchitecture mapping and design space conﬁguration searchusing genetic algorithm and reinforcement learning (RL). Onmore systems size, work by Mirhoseni et al [23] use RL for taskplacement on a heterogenous system, while modern compilerslike AutoTVM [3] use ML models for cost prediction toimprove compilation time. Nautilus [26] uses genetic algorithmto improve FPGA place and route. It is worth noting that theseapproaches mostly enhance search for the optimal conﬁguration,and this unlike A

DAPT N ET do not replace search. Perhapsthe closest to our approach is work by Kwon et al [21], whouse online tensor-based recommender systems to aid place androute in chip design.VIII. C ONCLUSIONS

In this paper we present a new class of acceleratorsnamed

Self Adaptive Reconﬁgurable Arrays (SARA). Wedevelop SARA by augmenting a reconﬁgurable array with ahardware running a neural network recommender system. Therecommender system can provide optimal conﬁguration of thearray at runtime when as the workload arrives. This makes thearray self sufﬁcient to run optimally. We design a novel, highlyaccurate and fast recommendation network called A

DAPT N ET ,and a specialized hardware accelerator A DAPT N ET X. We alsopresent a new design approach for building scalable GEMMaccelerator architectures by augmenting traditional systolicarray units with SMART [17]-like bypass links called ‘ systolic- ells ’. This S MART S YSTOLIC array can be conﬁgured tooperate in any regime between a monolithic scaled-up anddistributed scaled-out architecture. Using a reference designnamed

SAGAR , we show that we can simultaneously achievehigh mapping ﬂexibility and exploit spatio-temporal reuse,therefore achieving both performance and energy efﬁciency.R

EFERENCES[1] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnnaccelerators,” in . IEEE, 2016, pp. 1–12.[2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg,C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al. , “Deepspeech 2: End-to-end speech recognition in english and mandarin,” in

International conference on machine learning , 2016, pp. 173–182.[3] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, andA. Krishnamurthy, “Learning to optimize tensor programs,” in

Advancesin Neural Information Processing Systems , 2018, pp. 3389–3400.[4] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecturefor energy-efﬁcient dataﬂow for convolutional neural networks,”

ACMSIGARCH Computer Architecture News , vol. 44, no. 3, pp. 367–379,2016.[5] S. Dave, Y. Kim, S. Avancha, K. Lee, and A. Shrivastava, “Dmazerun-ner: Executing perfectly nested loops on dataﬂow accelerators,”

ACMTransactions on Embedded Computing Systems (TECS) , vol. 18, no. 5s,pp. 1–27, 2019.[6] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R.LeBlanc, “Design of ion-implanted mosfet’s with very small physicaldimensions,”

IEEE Journal of Solid-State Circuits , vol. 9, no. 5, pp.256–268, 1974.[7] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo,S. Alkalay, M. Haselman, L. Adams, M. Ghandi et al. , “A conﬁgurablecloud-scale dnn processor for real-time ai,” in .IEEE, 2018, pp. 1–14.[8] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris:Scalable and efﬁcient neural network acceleration with 3d memory,”in

Proceedings of the Twenty-Second International Conference onArchitectural Support for Programming Languages and OperatingSystems , 2017, pp. 751–764.[9] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram:Optimized coarse-grained dataﬂow for scalable nn accelerators,” in

Pro-ceedings of the Twenty-Fourth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , 2019, pp.807–820.[10] S. Ghodrati, B. H. Ahn, J. K. Kim, S. Kinzer, B. R. Yatham, N. Alla,H. Sharma, M. Alian, E. Ebrahimi, N. S. Kim et al. , “Planaria: Dynamicarchitecture ﬁssion for spatial multi-tenant acceleration of deep neuralnetworks,” in . IEEE, 2020, pp. 681–697.[11] X. He and T.-S. Chua, “Neural factorization machines for sparsepredictive analytics,” in

Proceedings of the 40th International ACMSIGIR conference on Research and Development in Information Retrieval ,2017, pp. 355–364.[12] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural col-laborative ﬁltering,” in

Proceedings of the 26th international conferenceon world wide web , 2017, pp. 173–182.[13] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers et al. , “In-datacenterperformance analysis of a tensor processing unit,” in

Proceedings of the44th Annual International Symposium on Computer Architecture , 2017,pp. 1–12.[14] S.-C. Kao, G. Jeong, and T. Krishna, “Confuciux: Autonomous hardwareresource assignment for dnn accelerators using reinforcement learning,”in . IEEE, 2020, pp. 622–636.[15] S.-C. Kao and T. Krishna, “Gamma: Automating the hw mapping ofdnn models on accelerators via genetic algorithm,” in

ICCAD , 2020.[16] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,“Neurocube: A programmable digital neuromorphic architecture withhigh-density 3d memory,”

ACM SIGARCH Computer Architecture News ,vol. 44, no. 3, pp. 380–392, 2016. [17] T. Krishna, C.-H. O. Chen, W.-C. Kwon, and L.-S. Peh, “Smart: single-cycle multihop traversals over a shared network on chip,”

IEEE micro ,vol. 34, no. 3, pp. 43–56, 2014.[18] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna,“Understanding reuse, performance, and hardware cost of dnn dataﬂow:A data-centric approach,” in

Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture , 2019, pp. 754–768.[19] H. Kwon and T. Krishna, “Opensmart: Single-cycle multi-hop nocgenerator in bsv and chisel,” in . IEEE, 2017,pp. 195–204.[20] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling ﬂexible dataﬂowmapping over dnn accelerators via reconﬁgurable interconnects,”

ACMSIGPLAN Notices , vol. 53, no. 2, pp. 461–475, 2018.[21] J. Kwon, M. M. Ziegler, and L. P. Carloni, “A learning-based recom-mender system for autotuning design ﬁows of industrial high-performanceprocessors,” in . IEEE, 2019, pp. 1–6.[22] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexﬂow: A ﬂexibledataﬂow accelerator architecture for convolutional neural networks,” in . IEEE, 2017, pp. 553–564.[23] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou,N. Kumar, M. Norouzi, S. Bengio, and J. Dean, “Device placement opti-mization with reinforcement learning,” arXiv preprint arXiv:1706.04972 ,2017.[24] M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman,J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al. , “Deeplearning recommendation model for personalization and recommendationsystems,” arXiv preprint arXiv:1906.00091 , 2019.[25] T. NVIDIA, “Nvidia tesla v100 gpu architecture,” https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf,2017.[26] M. K. Papamichael, P. Milder, and J. C. Hoe, “Nautilus: Fast automatedip design space search using guided genetic algorithms,” in

Proceedingsof the 52nd Annual Design Automation Conference , 2015, pp. 1–6.[27] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara,R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop:A systematic approach to dnn accelerator evaluation,” in . IEEE, 2019, pp. 304–315.[28] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul,and T. Krishna, “Sigma: A sparse and irregular gemm accelerator withﬂexible interconnects for dnn training,” in , 2020.[29] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in

Advances in neuralinformation processing systems , 2015, pp. 91–99.[30] S. Rendle, “Factorization machines,” in . IEEE, 2010, pp. 995–1000.[31] A. Samajdar, T. Garg, T. Krishna, and N. Kapre, “Scaling the cascades:Interconnect-aware fpga implementation of machine learning problems,”in . IEEE, 2019, pp. 342–349.[32] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, andT. Krishna, “A systematic methodology for characterizing scalability ofdnn accelerators using scale-sim,” in , 2020.[33] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “Scale-sim: Systolic cnn accelerator simulator,” arXiv preprint arXiv:1811.02883 ,2018.[34] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborativeﬁltering recommendation algorithms,” in

Proceedings of the 10thinternational conference on World Wide Web , 2001, pp. 285–295.[35] Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang,B. Keller, A. Klinefelter, N. Pinckney, P. Raina et al. , “Simba: Scalingdeep-learning inference with multi-chip-module-based architecture,” in

Proceedings of the 52nd Annual IEEE/ACM International Symposiumon Microarchitecture , 2019, pp. 14–27.[36] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Mastering thegame of go without human knowledge,”

Nature , vol. 550, no. 7676, pp.354–359, 2017.

37] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizingfpga-based accelerator design for deep convolutional neural networks,” in

Proceedings of the 2015 ACM/SIGDA International Symposium onField-Programmable Gate Arrays , 2015, pp. 161–170., 2015, pp. 161–170.