Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling GEMM Acceleration
SSelf-Adaptive Reconfigurable Arrays (SARA):Using ML to Assist Scaling GEMM Acceleration
Ananda Samajdar
Georgia TechAtlanta, GA [email protected]
Michael Pellauer
NVIDIABoston, MA [email protected]
Tushar Krishna
Georgia TechAtlanta, GA [email protected]
Abstract —With increasing diversity in Deep Neural Network(DNN) models in terms of layer shapes and sizes, the researchcommunity has been investigating flexible/reconfigurable accel-erator substrates. This line of research has opened up twochallenges. The first is to determine the appropriate amountof flexibility within an accelerator array that that can trade-off the performance benefits versus the area overheads of thereconfigurability. The second is being able to determine the rightconfiguration of the array for the current DNN model and/orlayer and reconfigure the accelerator at runtime.This work introduces a new class of accelerators that we call
Self Adaptive Reconfigurable Array (SARA) . SARA architecturescomprise of both a reconfigurable array and a hardware unitcapable of determining an optimized configuration for the arrayat runtime. We demonstrate an instance of SARA with anaccelerator we call
SAGAR that introduces a novel reconfigurablesystolic array that can be configured to work as a distributed col-lection of smaller arrays of various sizes or as a single array withflexible aspect ratios. We also develop a novel recommendationneural network called A
DAPT N ET which recommends an arrayconfiguration and dataflow for the current layer parameters. Anintegrated custom hardware A DAPT N ET X runs A
DAPT N ET atruntime and reconfigures the array, making the entire acceleratorself sufficient. SAGAR is capable of providing the same mappingflexibility as a collection of 1024 × arrays working as adistributed system while achieving 3.5 × more power efficiencyand 3.2 × higher compute density Furthermore, when testedover × cases, the runtimes (GeoMean) achieved on therecommended parameters from A DAPT N ET is 99.93% of thebest achievable runtime. I. I
NTRODUCTION
General Matrix to Matrix Multiplication (GEMMs) is at theheart of Deep Neural Network (DNN) training and inferenceand thus has been the target application of many acceleratordesigns [4], [7], [8], [9], [13], [37]. However these individualdevices work on small matrix tiles, and do not have enoughcomputation power to work on larger networks without multiplecostly passes. Recent proposals [9] [35] have demonstratedthe need for scaling the DNN computation engines to meetthe computation demands of contemporary workloads. Despiteextensive research and product development on architecturesfor small-tile GEMMs, designing efficient architectures forperforming GEMMs at scale is still non-trivial.The crux of the problem is that there exists a pernicious trade-off between scalability and utilization (or mapping efficiency).Scalability is a direct consequence of simplicity and regularityof a particular design. For instance, regular designs like theTPU [13] (systolic array) can pack large number of MAC units
K N
Matrix B
Tile 1Tile 2
Matrix A
M N
Tile 1Tile 2
Matrix A
M N
Device 1
Device 2
K N
Matrix B
Multicast
Tile 1Tile 2
Matrix A
M N K N
Matrix B
Set switches[M, N, K]
Bypass linkSwitch to use direct linkSwitch to use bypass link
Configuration unit
Pros
High reuse potential (A) Monolithic (B) Distributed (C) Reconfigurable Array
Rigidity adversely a ff ects mapping flexibility Cons
High mapping flexibility High reuse potential and mapping flexibilityLow opportunity to exploit reuse Hardware overhead, and large mapping search space
Fig. 1.
Challenges in efficiently scaling GEMM arrays. (256 × scaled out setting is a feasible alternative to gainflexibility, which is further corroborated by recent architectureproposals [9], [35].Figure 1 provides intuition on why scale-out is moreperformant than scale-up. Both the monolithic and distributedacclerators have the same number of MAC units (4 N ), andare executing the the same workload ( MatA × MatB ). Whilethe distributed accelerator (Figure 1(b)) can map the entirecomputation in a single step, the monolithic configuration(Figure 1(a)) needs two. The irregular size of the workloadsand the mapping rigidity of the monolithic configuration areresponsible for the serialization, even when compute resourcesare available.Unfortunately, scaling out compromises on operand reuse andthus energy efficiency. First, spatial reuse via wires is limitedsince communication paths between two distinct compute unitsare fundamentally lower-bandwidth than within the array—inthe worst case requiring off-chip access [35]. As a corollarytherefore, each compute unit needs to have their own separatelocally-placed high-bandwidth memory to store operand data.These individual memories are smaller than the aggregatememory available in a scaled-up system, and thus less ableto exploit temporal reuse. To make matters worse, some data We use scale-up/scale-out interchangeably with monolithic/distributed. a r X i v : . [ c s . A R ] J a n daptNet-858 (Recommendation Model) AdaptNetX (Recommendation HW)
Self Adaptive (SA) unit SMART Systolic Array (Reconfigurable Accelerator) config
Workload DNN
Layer Shape
Reconfigurable Array (RA)
ResultsLayer Shape + Weights Activations
Fig. 2.
The constitution and interactions of the self adaptive (SA)and reconfigurable array (RA) components to make up the SARAaccelerator called
SAGAR in this work. operands must be replicated, reducing capacity even further.
Contribution 1:
In this paper, we propose a novel acceler-ator micro-architecture organization called S
MART S YSTOLIC
Array (
SSA ) aimed to attain the benefits of monolithic (i.e.,scale-up) and distributed (i.e., scale-out) designs for efficientGEMM computation in a single unified substrate. We choosesystolic arrays as our scale-up building blocks, since thesimplicity of these arrays leads to low area and poweroverheads, maximizing local bandwidth while minimizingcommunication distance. To mitigate the under-utilizationproblem, we propose augmenting the traditional systolic designwith a bypass interconnection network inspired by SMART[17]. These links permits us to emulate similar mapping to adistributed cluster of accelerator, within a monolithic computearray (Figure 1(c)). The SMART-like links are configurable,and can therefore emulate distributed systems of differentgranularities, eg. a system with 128 ×
128 MAC (multiply andaccumulate units) can be configured to be used as 4 64 × ×
32 units or even 32 32 ×
16 units.
Contribution 2:
Finding the optimal configuration forsuch a reconfigurable array is the key to achieve the bestperformance. However, the best configuration depends onthe workload, which means the configuration needs to bedetermined at runtime. A finer granularity of reconfigurationimproves mapping efficiency, but at the same time increasesthe configuration search space [15], [27]. Extra resourcesare needed to ensure that configuration search does notbecome a bottleneck at runtime. We develop a novel lightweight neural recommendation model called A
DAPT N ET ,which replaces costly configuration search with constant timeinference operation at runtime. For a given workload dimension,the network recommends best architecture configuration aswell as the mapping (dataflow) strategy. The recommendedconfigurations attain about 99.93% of the best runtime onaverage (GeoMean) in our tests with 200K samples. Contribution 3:
We also design a custom hardware forrunning A
DAPT N ET , called A DAPT N ET X, and augment it withthe reconfigurable accelerator. The resulting design is thus selfsufficient for providing optimal performance without externalinputs.
Contribution 4:
We integrate these three componentsinto an accelerator which we call ‘Shape Adaptive GEMMAcceleratoR (
SAGAR )’ as shown in Figure 2 and evaluate Sagar is a Sanskrit word that means Ocean, reflecting the ability of ouraccelerator to have flexible shapes T h e o r e t i c a l x a rr x a rr x a rr x a rr x a rr x a rr N o r m . S R A M R e a d s T h e o r e t i c a l x a rr x a rr x a rr x a rr x a rr x a rr N o r m a li z e d r un t i m e (a) (b)
8x 32x
Fig. 3.
The trade-off between improved runtime and lost operandreuse in compute equivalent monolithic and distributed systolic arrayconfigurations. Subfigures show (a) the theoretical minimum runtime,and the runtime obtained for stall free operation of monolithic andcompute normalized distributed systolic array settings; and (b) thecorresponding SRAM reads, normalized to theoretical minimum readsrequired when multiplying a 256 ×
64 matrix with another 64 × its performance across various configurations. We show that SAGAR has 3.2 × higher compute density and 3.5 × improvedpower efficiency, over equivalent scaled-out systolic array.The extra flexibility costs <
10% in area and 50% in power,compared to equivalent scaled-up systolic array. Compared toan area normalized state-of-the-art flexible scalable accelerator[28],
SAGAR incorporates 45% more compute, while whencomparing compute-equivalent configurations,
SAGAR con-sumes 43% less power and saves 30% area footprint (seeSection VI-B).We believe our proposed accelerator is the first in a class ofdesigns we name
Self Adaptive Reconfigurable Array (SARA) (Figure 2). To summarize, we make the following contributions.(i) We propose
SSA , a reconfigurable architecture for scalableGEMM acceleration, simultaneously achieving high utilizationand reuse. (ii) We develop A DAPT N ET , a lean recommendationneural net which suggests optimized configuration and dataflowwith high accuracy. (iii) We implement A DAPT N ET X , ahardware capable of running A DAPT N ET in constant timeand configuring SSA at runtime. (iv) We integrate the abovecomponents into a SARA accelerator called
SAGAR , whichachieves optimal runtime at lower power and area than stateof the art. II. B
ACKGROUND AND M OTIVATION
A. Scaling DNN acceleration
Google’s Tensor Processing Unit(TPU) [13], a large 256 ×
256 systolic array based accelerator, is one of the first data-center class DNN accelerators. However, as the authors report,irregular sized GEMM operations found in recurrent workloads,result in as much as 86% under-utilization of the array,attributed to the rigidity of mapping. A recent acceleratorproposal called SIGMA [28] tries to alleviate this problemby introducing flexibility of mapping by using a two levelinterconnect to deliver data to the computation elements. On theother hand recent proposals like Simba [35] take the scaled-outapproach by building a cluster of Multi Chip Modules (MCMs)to increase the compute capability of the design. Tangram[9] on the other hand proposes a scaled-out design wherethe problems of data replication is potentially mitigated bycommunicating over an interconnection network and preciselytiming the communication and compute.2hile recent research targets both scaled-up and scaled-outapproach, it is not immediately clear if one design direction hasan immediate advantage over the other. One study [32] arguesthat for systolic array based designs, scaled-out configuration isalmost always better than scaled-out in terms of performance.However this benefit comes at a cost of huge increase inexternal bandwidth requirement (in this case from DRAM),which renders implementing such a design impractical at scale.To help understand the trade offs involved in choosing aperformant configuration, and the associated loss of reuse weperform a simple experiment. We run one GEMM operation,involving operand matrices of sizes sizes 256 ×
64 and 64 × × ×
64 arrays, 16 32 ×
32 arrays, 64 16 ×
16 arrays, 2568 × × ×
32 array is themost performant beating the the monolithic configuration byabout 2 × . In Figure 3(b) we depict the SRAM read accessesperformed by all the array configurations, normalized to thetheoretical minimum number of reads possible. From this figurewe observe that the 32 ×
32 configuration performed about 4 × more memory accesses then the monolithic. The excess memoryaccesses, which lead to reduced energy efficiency, result fromthe loss of wire reuse as compared to a monolithic array.From the discussion above we make two observations.(i) Distributed arrays are more performant than an equivalentmonolithic array, even when mapping efficiency is 100% onboth. However, the optimal size of each device in a distributedsetting is workload dependent. (ii) Monolithic configurationsare strictly more energy efficient than distributed arrays, dueto loss the of spatio-temporal reuse in the latter.Mitigating the loss of reuse in a distributed setting istherefore the key to achieve both performance and energyefficiency simultaneously and therefore is the goal of thiswork. B. Single cycle multi-hop links with SMART
SMART [17] describe a mechanism to reduce the averagehop count of network on chip (NoC) by introducing bypasspaths which could be used to allow multi-hop traversal ofa packet withing a cycle. The hardware changes needed toimplement the proposed scheme involve two changes. Firstis the addition of an alternate path from the input ports ofeach router to the input selector mux bypassing the queuingbuffer in the datapath. Second is addition dedicated controlpaths from each router to
HPC max (maximum possible hopsper cycle) routers in the downstream at every direction.
Bypass muxes
Source router Destination router
Single cycle multi hop path
Repeater
Fig. 4.
Multi-hop traversals in a single cycle using SMART [17]
The mechanism to use the bypassing path happens in twosteps. In the first step, the SMART path is setup by the sourcerouter, one cycle prior to the actual movement of the packet.The source router broadcasts the SMART Setup Request (SSR)to the downstream routers by sending a SSR over dedicated SSRlinks. Each SSR link is log ( HPC max ) bit wide and indicatesthe number of hops the flit wants to go. The request precludesthe flit by one cycle, which allows the downstream routersmake the decision of accepting the packet before it is launched.The downstream routers employ a fixed priority scheme toensure that there are no packet drops. C. Recommendation Systems
Recommendation systems are widely used on social media,streaming services, online marketplaces etc. to show mostrelevant advertisements or products for a given user to increasethe click-through rate and provide meaningful personalizedcontent. Over the years neural networks have consistentlyprovided improved performance over then state of the methodslike Neural collaborative filtering (NCF) [12] over collaborativefiltering [34] and Neural factorization machine(NFM) [11] overfactorization machine [30]. The current state of the art methodcalled the Deep Learning Recommendation Model (DLRM)[24] uses the techniques like feature embeddings and featureinteractions employed NCF and NFM, coupled with deep multilayer perceptrons. In the DLRM architecture, a combinationof sparse and dense features are used as inputs. The sparsefeatures are converted into dense vectors in a learned latentspace via embedding lookup. The resulting dense features arethen passed through multiple dense layers followed by feature.interactions [11], [24]. Finally, the combined features are sendthrough a few more dense layers to get the final classification.III. R
ECONFIGURABLE ARRAY ARCHITECTURE
We augment a base monolithic systolic array with additionalbypass paths along the row and columns, inspired by SMART[17]. This enables us to realize a flexible, energy efficientdesign which can be configured to act as a large single array(i.e., scale-up) or a collection of smaller arrays (i.e., scale-out),whenever required.
A. Compute array
Traditional MAC units.
In Figure 5(a) we show a tradi-tional systolic array constructed by laying down Multiply-and-Accumulate (MAC) (Figure 5(b)) units in a 2D grid. EachMAC unit is designed to get an operand from either both(
Left in, Top in ) ports or from either of the ports, and perform3 a) (b)
Peer-to-peer links
X +
Top InBottom outLeft in
Right out
From ControllerFrom ControllerFrom initializer Left in Reg Acc RegTop in RegStat OP regSBit
MAC
SMART MACFrom peerFrom peer To peer To peerVertical bypass linksHorizontal bypass links (c) (d)
Peer-to-peer linksBypass links ‘island of MAC’
Direct access using bypass links
Fig. 5. (a) A systolic array of traditional MAC units, (b) the architecture of a traditional MAC unit, (c) addition of bypass paths to atraditional MAC to create a SMART MAC unit to (d) enable flexible mapping in a systolic array by creating ‘islands’ of MAC units capableof receiving data directly from SRAM banks by bypassing its peers multiplication and addition operation. In the next cycle theoperand data received is sent to its neighbour over the peer-to-peer links. The internal registers, and multiplexers enablethe array to work in output stationary (OS), weight stationary(WS), and input stationary operations (IS) [4]. This simplemechanism of data movement results in high wire reuse, but atthe same time restricts the mapping of compute only to thoseoperations which require same set of operands to be mappedalong a row or a column.
SMART MAC units.
Any mechanism to enable MAC unitsto accept data from sources other than the neighbouring MACunit will ease the restrictions of mapping compute and thereforeenable high utilization. In Figure 5(c) we depict a modifiedMAC unit, which is augmented with multiplexers in the inputand output ports.
We call this, a SMART MAC unit . Note thatSMART MAC denotes a MAC unit with multiplexers on oneor more input/output ports to enable bypass, not necessarilyon all ports. These multiplexers enable reading and writingto wires other than the peer-to-peer links, thus enabling it torecieve and work on operands unrelated to the ones forwardedby its peers. Akin to the multi-hop bypass path employed in theSMART-NoC, we can allocate bypass busses from memory tothe array, to provide extra channels to move data. Connectingone of the inputs of the multiplexers with these bypass bussesfrom memory therefore enables us to create ‘ islands ’ of MACunits. These ‘ islands ’ are a collection of neighbouring MACunits forming a rectangular group which can act as independentsmaller arrays. We can map computations on such ‘ islands ’independent of the other neighbouring MACs which translatesto improved mapping flexibility (see Figure 5(d)). However,note that this design allows for arbitrary reconfiguration whichis an overkill and makes the design costlier than necessary.
Systolic Cells.
We find that an alternative design, employinga mix of traditional MAC units (Figure 5(b)) and the SMARTMAC units to be a more practical choice. Instead of allowingflexible connectivity at the granularity of individual MAC units,we provide reconfigurabilty at the scale of a smaller array byallowing flexible connectivity at the edge of the said sub-array.For example, in Figure 6(a) we show a 4 × systolic-cells . Ascan be observed in Figure 6(a), a systolic-cell is constructedby using SMART MAC units in the edge of the array, andtraditional MAC units (Figure 6(b)) in the inside, and thenconnecting them using peer-to-peer links. This helps reduceimplementation cost in three ways. First, the number of SMARTMAC used in the array is reduced. Second, in the remainingSMART MAC, multiplexers are not required on each port,instead only the ports which communicate with MACs outsidethe given systolic-cell need multiplexers to read and write datato the bypass links. Third, the number of bypass links arealso reduced as a consequence of reduction in the numberof multiplexers. With this optimization the bypasses are nowperformed at the granularities of the systolic-cell s. Scale-up and Scale-out using Systolic Cells.
Larger arrayscan be created by arranging and connecting the systolic-cell sas depicted in Figure 6(b) using the peer-to-peer links. At theedge of each systolic-cell the bypass paths can be connectedto the bypass links. Please note that dedicated bypass links areallocated to each systolic-cell to allow concurrency. Attainingflexible mapping in such a design is a matter of configuring themultiplexers of the systolic-cell s. Depending on the mappingrequirement an user can chose not to use the bypass pathsat all and use the entire array as a single monolithic unit,by setting the multiplexers to accept data only to/from thepeer-to-peer links, (this is the case depicted in Figure 6(b)),which is equivalent to a scaled-up configuration. One theother hand, the user can set all the multiplexers to acceptand deliver data solely to the bypass links, therefore operatingas a cluster of arrays, each the size of a systolic-cell . Thisconfiguration, depicted in Figure 6(c) is equivalent to a scaled-out configuration. Other scaled-out configurations with subarrays larger than the systolic-cell size is also possible to berealized by logically combining a few systolic-cell s by settingsome of multiplexers to connect with the bypass links andothers to connect with the peer-to-peer links. The availabilityof such variety of choices for reconfiguration leads to flexibleand efficient mapping, hence improving the utilization andenergy efficiency of the design. Note that unlike the SMART4 a) (b) (c)
From SRAM banksFrom SRAM banks
Vertical Bypass linksHorizontal Bypass links
SRAM Bu ff er SRAM Bu ff er SRAM Bu ff er SRAM Bu ff er Mux with no bypass selectedMux with latch to bypass selectedSRAM bankActive PortInactive Port
Fig. 6. (a) Construction of a 4 × systolic-cell with bypass muxes and bypass links. (b) A 8 × × systolic-cell is connected to its neighbor with the peer-to-peer links as the bypass muxes are turned off. The SRAMports connected to bypass links are unused. (c) Configuration of bypass muxes to enable the 8 × NoC [17], these muxes are configured statically, and hence donot need to worry about about arbitration.
B. Scratch pad memory
The array constructed from systolic-cells are backed bySRAM scratchpad memory, which are constructed as twoindividual buffers. Each of these buffers is dedicated to oneof the operand matrices. Such scratchpad SRAM buffers arecommon in accelerators, and are designed to reduce the numberof off chip accesses and facilitate temporal reuse. Each operandbuffer is operated in a double buffered fashion, so that the pre-fetch latency can be minimized. The system also comprises ofa third buffer which is used to store generated output elements.Unlike conventional accelerators however, our systolic systolic-cell based design has bypass links, which are alsoneeded to be backed by the memory. We provision for thisextra bandwidth by increasing the number of memory banksin the scratchpad SRAM buffers. In a traditional systolic arraybased design, each row and column of the array is connected toone dedicated SRAM port to supply one element per cycle each.In a similar fashion we allocate one port for each row, column and the individual bypass links . To reduce the complexity ofmultiplexing data within the SRAM, we increase the numberof SRAM banks to support the increased number of ports. Forexample, the compute arrays shown in Figure 6(c) is backedby two scratchpad memories each with 16 ports and wouldbe constructed as collection of 16 SRAM banks. The SRAMbanks in a traditional 8 × systolic array on the other handwould have 8 port per buffer; one per each row/column. Weevaluate the overheads for such a design in Section VI-B.Despite having the same number of SRAM ports as ina distributed configuration, this approach has a couple ofadvantages over the latter. First , there is no replication ofdata required, which otherwise reduces the effective capacityof the system therefore adversely affecting reuse. In our designby eliminating replication we inherently improve the temporalreuse of operands.
Second , each of the systolic-cells can accessdata in the entire operand buffer. Due to unified memory controlof each buffer, operation like multi-cast are implicit in form
Fig. 7.
Psuedocode depicting the control logic of read collation, which improves energy efficiency withoutimpacting performance. We describe the impact on reads andenergy efficiency in detail in Section VI-A.
C. Control
Figure 7 shows the control logic executed for each GEMMworkload or a layer of a neural network. The control logic ofour proposed system is similar to the control of a distributedsystolic array based system. However, unlike other systems, ina systolic-cell based design, the number of distributed units isa variable and is determined at runtime based on the data-flowand operand shapes. The following steps describe the logic. recNetInference(): In this work we use a recommendationsystem based described in Section IV. The model takes in thelayer parameters and recommends a configuration, which isthe most efficient for the workload. setBypassMuxes(): Next, the bypass muxes are set in thecompute hardware to realize the partitioned configuration. Thisis accomplished by writing select values to a register, whoseindividual bits drives the select lines. These configurations staystatic throughout the GEMM computation. partitionWorkload(): The control logic, then partitions theoriginal workload by marking portions of the original operandarrays to be used by each individual partition. systolicController(): Finally, for each partition, an instanceof systolic array controller is initiated to drive the GEMMoperations to completion and orchestrate the required datamovement. Please note that in contrast to a traditional systolicarray like TPU, multiple control units are required to work inparallel.5V. A RCHITECTURE R ECOMMENDATION M ODEL
To fully exploit the reconfigurability offered by the computesubstrate, the best configuration needs to be identified depend-ing on the workload. Identification of the best configuration,unfortunately can be a costly operation given the large size ofthe search space of possible configurations. For example, inFigure 8(a) we show the size of the configuration space of ourflexible systolic-cell based architecture as a function of MACunits. A TPU v2 like system with 2 MAC units has 858possible configurations while a TPU v1 like system (2 MACunits) has nearly 1400 configurations. For a given workload,configuration search over such large space at runtime can eitherbe a performance bottleneck if implemented in a power limitedsystem, or could end up using significant amount of energyas compared to the execution of the workload itself. In boththe cases, this could undermine the benefits obtained fromthe flexible architecture system. Another alternative approachis the perform the search offline and store the configurations.However, this will only work when the workload configurationsare known beforehand and are limited in size. For a data-centerlike use-case, where scaled systems are likely to be used, thevariety of workloads is expected to be high. In this work wedevelop a novel technique to tackle this problem. Our solutionis to replace the expensive search operation with a constant timeneural network(NN) inference to procure the best configurationat runtime.
Search as a ML problem.
The first step to design aconstant time NN inference system is to pose the searchproblem to an ML problem. Out of the several alternativeapproaches, our experiments depicted posing the problem as arecommendation system, provided the best performance. Theidea of a recommendation system is simple; when queriedwith the workload parameters (ie M, N, and K dimensionsin our case) the network returns a category ID. This IDis used to a architecture configuration which provides theoptimal performance for the workload. For our flexible systolic-cell based system the parameters captured by each categoryis depicted in Figure 8(b). As we depict in the figure, boththe architecture configuration in terms of the systolic cellarrangements (first and second cols), systolic cell dimensions,as well as the mapping strategy is represented in terms ofdataflow eg. output stationary (OS), weight stationary (WS),and input stationary (IS).
Recommendation Neural Network.
We hand designed arecommendation system neural network. Given our use-case,there are two main requirements we need to satisfy. First,we need our network to have high accuracy in predictingthe best runtime configuration which maximizes performanceby attaining optimal mapping. Second, given that the rec-ommendation network needs to be queried at runtime, thenetwork should be small to lower costs. A smaller networkwill lead to low inference latency. In our use case, therecommendation inference for a given layer is run concurrentto the execution of a previous layer whenever possible. Lowerinference latency therefore moves the recommendation step out of the critical path. Moreover, smaller network has fewercomputation and storage requirements therefore minimizingthe overheads. Honoring these requirements, our proposednetwork is depicted in Figure 8(c). We use an embeddingtable to project the input features into a latent-space as in theDLRM and NCF models [12], [24]. The emebdding lookupsare then passed through two dense layers for classification,which chooses a category from the configuration space usingsoftmax activation. We call this architecture A
DAPT N ET . Thissimple model works very well for the flexible systolic-cell basedarrays with different number of MAC units. Figure 8(d) depictsthe accuracies obtained on test sets when the same model istrained to recommend configurations for varying number ofMAC units. The model is trained for about 30 epoch on aseparate datasets for each MAC units each containing aboutone million samples. We observe that the the test accuracies areconsistently over 90% and for a few configurations are as highas 96%. We also want to point out that the correct predictionscorrespond to the configuration which leads to the best runtime.Also it is germane to note that for the cases where the networkrecommends configurations other than the best possible ones,the performance is better than those we get in our baselineconfigurations (see Figure 9(c)). We distinguish the variants ofthis network by adding the number of categories as suffix. Forexample, the systolic-cell based SSA with 2 MAC has 858configurations, therefore the corresponding network is calledA
DAPT N ET -858.V. S ELF A DAPTIVE R ECONFIGURABLE A RRAYS
By coupling A
DAPT N ET with a reconfigurable array, we cancreate a self adaptive system which can be conceptually viewedas a combination of two units, a Self Adaptive unit (SA), anda Reconfigurable Array (RA) unit as shown in Figure 2. TheSA unit encompasses the software and hardware componentswhich recommend the optimal configurations. The RA unitis the hardware unit capable of flexibly configuring to therecommended configurations and hence run the workloads. Itis worth pointing out that this design class is not specific to areconfigurable core for running GEMM workloads. Insteadany Coarse Grained Reconfigurable Array (CGRA) unit,configurable at runtime, can be augmented with a suitableSA, to ensure optimal performance. We believe this resultsin a new class of designs, which we name Self AdaptiveReconfigurable Array (SARA). A. Hardware to run A DAPT N ET In the context of our use case, an intuitive option is to allocatea few systolic-cell s from the main array to run A
DAPT N ET .However, this choice will lead to either fewer MAC units leftfor the actual workloads, or to allocate additional systolic-cell s for A DAPT N ET leading to an additional overhead. Analternative to adding more systolic-cell s will be to add a customhardware dedicated for running A DAPT N ET . We explore boththe systolic-cell and custom hardware options below. For theease of discussion, we chose our RA to be a 1024 4 × systolic-cell unit (2 MACs) which we call the S
MART S YSTOLIC unit.6 um MAC Units
Vertical systolic cellsHorizontal systolic cells Systolic cell rows Systolic cell cols Dataflow … … … … … (a) (b) (c) (d) N u m b e r o f C o n f i g u r a t i o n s
13 14 15 16 17 18 19 20
Feature Embeddings
10 x 32 Dense(ReLu)
Dense(Softmax) [ M, N, K ]
Workload params Category Idx T e s t a cc u r a c y ( % )
12 13 14 15 16 17 18 19 20
Num MAC Units
Fig. 8. (a) Size of the configuration space wrt number of MAC units for a systolic-cell based flexible array (b) Example of configurationspredicted by A
DAPT N ET indexed by category ID (c) Schematic of the A DAPT N ET topology (d) Test accuracies obtained on test sets forA DAPT N ET s trained on systolic-cell based flexible array with various MAC units Runtime (Cycles) N u m M u l t i p li e r s Systolic CellAdaptNetX +X X +X X +X X +X X … + ++ … … Stationary bu ff ers Output elementStreamed input elements (a) (b) (c)GeoMean: 99.93% Fig. 9. (a) Cycles needed to run A
DAPT N ET -858 on an array of systolic-cell s and on the custom hardware unit (A DAPT N ET X) as a functionof number of multipliers. (b) architecture of the custom 1-D unit hardware for A
DAPT N ET X(c) Relative performance of the configurationspredicted by A
DAPT N ET -858 for SAGAR for 2 × test samples when compared to the runtime of best possible configurations … … ………… SRAM banksSRAM banks
AdaptNetX (SA)
Configuration vectors
RecommendedConfiguration [M, N, K] + ++ + ++ +
X X X XLocal bu ff er Reconfigurable Array
SRAM banks for output
Mux selects
Legend:
Systolic cell multiplexers Peer to Peer Link Horizontal bypass buses Vertical bypass busesSystolic cells
SARA system
External Interface
Fig. 10.
Schematic of
SAGAR , an instance of a SARA accelerator.
The corresponding recommendation that we use therefore isthe A
DAPT N ET -858. A DAPT N ET Runtime on systolic-cell s. Figure 9(a) showsthe cycles required for a single inference of the A
DAPT N ET asa function of multipliers used in 4 × systolic-cell basedarray. Understandably, the runtime decreases proportional tothe increase in number of multipliers as we increase thenumber of systolic-cell s, achieving the best runtime of 1134cycles when using 1024 multipliers or 64 cells. When both theworkloads and the recommendation engine is run on a same array; for a TPU equivalent machine with 2 MAC units,about 6.25% of the array needs to be allocated for running theA
DAPT N ET . Another choice could be allocating more hardwareresources in terms of extra 64 systolic-cell s dedicated to run therecommender network. However, given that A DAPT N ET hasexclusively dense layers processing the embedding lookups, asystolic execution turns out to be sub-optimal. A DAPT N ET Runtime on A
DAPT N ET X. We found a cus-tom design tuned for A
DAPT N ET layer parameters to be moreefficient. For efficient execution of the dense layers, we chose a1-D multiplier unit with a binary tree based reduction as shownin Figure 9(b). We found Input stationary (IS) dataflow to be themost performant for our use case. In this mapping the elementsof the input vector is buffered near the multipliers, whileelements of the weight matrix are streamed through to generateone output element/partial sum, with a sustained throughput of1 element per cycle. Throughput can be further increased byadding more such 1-D units. We name the custom core withone or more such 1-D units as A DAPT N ET X. In Figure 9(a)we depict the variation of runtime of A
DAPT N ET inference onA DAPT N ET X with two 1-D units as a function of multipliers.We find the 512 multipliers result in best runtime of 576cycles, when running A
DAPT N ET for 2 MAC unit systolic-cell design.
B. SAGAR AcceleratorSAGAR is constructed by augmenting the 2 MAC
SSA unit, laid out as 32 ×
32 grid of systolic-cell s, with A
DAPT -N ET X running A
DAPT N ET -858 (see Figure 10). We chosethis configuration as it has the same compute as the TPUv2, and the 4 × systolic-cell size works the best for ourworkloads (see Section VI-A). Since each row and column inthis configuration has 31 bypass links and one link to MAC,7 ABLE I
Table depicting the architectural configuration of distributed systolic arraybased systems, monolithic systolic array baseline, and SAGAR
Name Num Units MAC/unit Banks perSRAM buffer Capacity perSRAM bankDistributed 4x4 units(Baseline – GPU tensor core like) 1024 16 4 256 BDistributed 8x8 units 256 64 8 512 BDistributed 16x16 units 64 256 16 1 KBDistributed 32x32 units 16 1024 32 2 KBDistributed 64x64 units 4 4096 64 4 KBMonolithic 128x128(Baseline – TPU like) 1 16384 128 8 KB
SAGAR each buffer is constructed as a collection of 1024 1KB SRAMbanks.
Real-time Reconfiguration.
The A
DAPT N ET X uses anadditional SRAM bank of 512KB to store the embeddingtable and the weight matrices for A
DAPT N ET -858. Eachconfiguration corresponds to a 3968 bit vector which setsthe bypass muxes, once the layer is ready to be mapped.VI. E VALUATIONS
To showcase the capabilities of our proposed design, weevaluate
SAGAR in two settings. To show the benefits thatarise solely from the architecture aspects, we present resultsobtained from analysis done with a high level simulator. Then,to capture the implementation dependent aspects of the design,we implement
SAGAR and the baselines in RTL and capturethe PPA numbers by running Place-and-Route (PnR) flow. Wealso compare
SAGAR with a state-of-the-art flexible acceleratorarchitecture SIGMA [28]. The following subsections describeour findings in details.
A. Architectural evaluations
Methodology.
For our architecture level studies we choseto use SCALE-Sim [33]. SCALE-Sim is a cycle accuratesimulator for systolic array, which generates per cycle dataaccesses to and from various memories. This enables us toestimate and compare performance, energy consumption, poweretc. of systolic array based components to a certain degree ofaccuracy. We created in-house scripts to generate SCALE-siminput files to perform the workload partitioning as generatedby A
DAPT N ET -858. Workloads.
For our evaluations we choose FasterRCNN[29], DeepSpeech2 [2], and AlphaGoZero [36], as our work-loads as a representative of convolution neural networks,language modelling network, and DNNs for reinforcementlearning respectively.
Baselines.
We have two baselines, a large monolithic systolicarray and a collection of small systolic arrays to work as a singledistributed system (see Table I). We modelled the monolithicsystolic array with the same dimensions as Google’s TPU v2,with 128 ×
128 MAC units. However unlike TPU v2 whichsupports floating point MAC operations, our model assumesbyte long words, which is the accepted size for operands forinference operations. We allocated a total 3MB of SRAMmemory to the entire array divided into 3 operand buffers, onefor each input operand matrix and the output matrix. Although
TABLE II
Dimensions for the synthetic GEMM workloads
G1 G2 G3 G4 G5 G6 G7 G8 G9 G10M 128 256 512 1024 2048 128 256 512 1024 2048K 128 256 512 1024 2048 64 64 64 64 64N 128 256 512 1024 2048 64 64 64 64 64G11 G12 G13 G14 G15 G16 G17 G18 G19 G20M 64 64 64 64 64 64 64 64 64 64K 64 64 64 64 64 128 256 512 1024 2048N 128 256 512 1024 2048 64 64 64 64 64 we would like to point out that, SCALE-Sim assumes that thereis sufficient DRAM bandwidth available for ideal prefetchingof operands, and therefore the computation runs in a stallfree mode. Therefore, for runtime generated by simulation theSRAM sizes have no impact. Nevertheless, previous work [33]has shown that buffer sizes of this scale lead to reasonableoff-chips requests.For our second baseline, we chose a collection of 4 × × Performance Analysis.
We model both of the baselinesystems and
SAGAR in SCALE-Sim and compare the perfor-mance for our workloads. In Figure 11(a) we depict the cyclestaken to run all the layers in AlphaGoZero, DeepSpeech2,and the first 10 layers of FasterRCNN networks. Among thebaselines, the distributed configuration mostly results in fasterruntime owing to higher mapping flexibility. However
SAGAR ,owing to reconfigurability is capable of matching the betterbaseline configuration. Naturally, this flexibility leads to loweraggregated runtime for
SAGAR than either of the baselines.
Favorable Configurations.
SAGAR is also capable ofrealizing configurations which are out of scope of either ofbaselines. This allows
SAGAR to achieve higher performancethan both the baselines on certain layers. For example, considerthe synthetic GEMM operands depicted in Table II. Figure 12(a)depicts the histogram of the best configuration for these layersobtained from simulation. The layers favouring 8 × × × SAGAR ’s performance isidentical to the 4 × ×
8, 32 ×
32 etc.
SAGAR will leadto lower runtime than both the baselines. This is depicted byFigure 11(c), where we see that
SAGAR achieves about > × AlphaGoZero DeepSpeech2 FasterRCNN S p ee dup o v e r M o n o li t h i c Baseline Monolithic Baseline Distributed SAGAR C o n v R e s C o n v R e s C o n v V H _ C o n v V H _ F C V H _ F C P H _ C o n v P H _ F C C o n v C o n v B a t c h R NN B a t c h R NN B a t c h R NN F C C o n v C B a _ C B a _ C B a _ C B s I B b _ I B b _ I B b _ I B c _ I B c _ A l ph a G o Z e r o D ee p S p ee c h F a s t e r R C NN AlphaGoZero DeepSpeech2 FasterRCNN Cumulative R un t i m e ( C y c l e s ) Baseline Monolithic Baseline Distributed SAGAR E n e r g y ( m J ) SRAM read energy Compute Energy (c)(d)(a) C o n v R e s C o n v R e s C o n v V H _ C o n v V H _ F C V H _ F C P H _ C o n v P H _ F C C o n v C o n v B a t c h R NN B a t c h R NN B a t c h R NN F C C o n v C B a _ C B a _ C B a _ C B s I B b _ I B b _ I B b _ I B c _ I B c _ A l ph a G o Z e r o D ee p S p ee c h F a s t e r R C NN AlphaGoZero DeepSpeech2 FasterRCNN Cumulative S R A M R e a d s ( W o r d s ) Baseline Monolithic Baseline Distributed SAGAR (b) E n e r g y D e l a y P r o du c t ( E D P ) N o m a li z e d t o M o n o li t h i c Baseline MonolithicBaseline DistributedSAGAR (e)0.1x
Fig. 11. (a) Simulated runtimes for monolithic 128 ×
128 baseline, distributed 1024 4 × SAGAR for layers in AlphaGoZero,DeepSpeech2, and first 10 layers of FasterRCNN (b) SRAM reads for the same workloads for
SAGAR and baseline configurations (c) Speedupof
SAGAR and distributed baseline as compared to the monolithic baseline (d) Energy consumption breakdown for our workloads in
SAGAR and baselines (e) Energy delay product(EDP) of
SAGAR and baselines, normalized to EDP for monolithic baseline (c) (d) R e l a t i v e F r e qu e n c y R e l a t i v e F r e qu e n c y (a) R e l a t i v e F r e qu e n c y R e l a t i v e F r e qu e n c y (b) Fig. 12.
Distribution of favorable array sizes for a 16384 MACdistributed system which attain the lowest runtime when run foreach layer in (a) synthetic GEMM workloads (b) AlphaGoZero, (c)DeepSpeech2, and (d) FasterRCNN. speedup over monolithic when distributed configurations arepreferred. While in cases where monolithic is preferred it runsfaster than both the baselines.
SRAM reads and Energy efficiency.
In designs where alarge number of computations are executed in parallel, reducingthe number of SRAM reads can lead to a high energy savingsgiven it does not negatively influence performance. In general,due to the loss of reuse, distributed configurations with smallerarray sizes have more SRAM reads. We observe this trend inaction in Figure 11(b) where we depict the number of SRAMreads performed for layers when running our workloads on thetwo baselines and on
SAGAR . The distributed 4 × SAGAR and themonolithic baseline. In
SAGAR this efficiency loss in reuse ismitigated by using bypassing links. As shown in Figure 11(b),accross all layers in our workloads,
SAGAR incurs SRAMreads close to that in the monolithic baseline. In the case ofDeepSpeech2,
SAGAR , owing to efficient mapping, incurs readseven fewer than that of the monolithic baseline. To further quantify the efficiency of
SAGAR , we estimatedthe energy spent by the three configurations on the workloadsby taking into the cycle counts and the SRAM reads and scalingthe counts by typical energy consumed per operation computedfrom RTL PnR flows. In Figure 11(d) we plot the energyconsumed normalized to the energy of the monolithic array.We observe that for workloads amenable to monolithic array (ie.FasterRCNN and DeepSpeech2),
SAGAR ’s energy consumptionis almost identical to the monolithic baseline. The distributedbaseline on the other hand consumes an order of magnitudehigher energy for all the three workloads, while supporting thesame mapping configurations as
SAGAR . For AlphaGoZero,which favours a distributed configuration,
SAGAR consumesabout 20% of the energy consumed by the monolithic baseline,while almost one order of magnitude lower than that of thedistributed baseline. Figure 11(d) also shows that
SAGAR ’senergy consumption for SRAM is close to that of consumedby the monolithic array for all the three workloads. Thecomputation energy consumption in
SAGAR equivalent to thebetter of the two baselines. The combined effect of improvedlatency and reuse is perhaps better represented by the energy-delay product (EDP) depicted by Figure 11(e). In this figurewe plot the EDP for
SAGAR and the two baselines normalizedto the values corresponding to the monolithic configuration.We observe that
SAGAR results in about 92% to 80% less EDPcompared to the monolithic baseline. This further demonstratesthe efficiency of our proposed architecture, resulting frompreserving reuse while simultaneously decreasing latency dueto improved mapping.
B. Implementation evaluations
Methodology.
We implemented
SAGAR in RTL and as a32 ×
32 array of 4 × systolic-cells and ran ASIC flow tillPlace-and-Route (PnR) to obtain area and power. We used28nm library for implementing the logic. We also implementedthe SRAM buffers as a collection of 1024 1KB cells withthe SAED32 education library from Synopsis, to quantify thepower and area overheads, and then scaled down to 28nm9 Scale out4x4 unit Scale out8x8 unit Scale out16x16 unit Scale out32x32 unit Scale out64x64 unit Monolithic128x128array SAGAR(ThisWork) SIGMA P o w e r ( W a tt s ) SRAM NoC Compute Array SRAM AdaptNetX Compute AdaptNetX
Scale out4x4 unit Scale out8x8 unit Scale out16x16 unit Scale out32x32 unit Scale out64x64 unit Monolithic128x128array SAGAR(ThisWork) SIGMA A r e a ( mm ) SRAM NoC Compute Array SRAM AdaptNetX Compute AdaptNetX
SAGAR compute array . mm (a) (b) (c) (d) Systolic cell dims 4x4Num systolic cells 1024Max Throughput 32.768 TOPsFrequency 1 GHzTech node 28nmArea 84.89 mm2Power 13.05 WattsSAGAR
Fig. 13. (a) The post PnR floor-plan diagram of
SAGAR ’s compute array, (b) A table detailing architecture configuration of
SAGAR , theimplementation parameters, and post PnR area and power of
SAGAR . (c) The comparison and breakdown of post synthesis area for distributedsystolic array based designs, the monolithic systolic baseline,
SAGAR , and SIGMA (d) The corresponding breakdown for power consumed byvarious components in distributed systolic array based designs, the monolithic systolic baseline,
SAGAR , and SIGMA A rr a y C o n f i g u r a t i o n Normalized power (W) Normalized area (mm2) N u m b e r o f b a n k s ( x ) P o w e r ( u W ) Distributed ConfigurationsTotal Power (uW) Per bank power (uW) Number of Banks N u m b e r o f B a n k s ( x ) A r e a ( u m ) Distributed configurationsTotal Area (um2) Per bank area (um2) Number of Banks (a) (b)(c)
Fig. 14. (a) The variation of total area footprint of SRAM banksin various distributed systolic array and monolithic configurationjuxtaposed with the variation in bank sizes and the number ofbanks required, (b) A similar variation in the power consumptionby the SRAM banks in distributed systolic array and monolithicconfigurations, and (c) the the area and power of a 128 ×
128 arraywhen constructed using different sized of “ systolic-cells ” normalizedto the area and power of an array constructed with traditional MACunits. equivalent by using Dennard’s scaling [6]. Figure 13(a) depictsthe post PnR floorplan of
SAGAR ’s compute logic. Figure 13(b)lists the array configuration, area, and power consumptionreported after PnR by synthesizing the
SSA and memory ata operating frequency of 1 GHz. At 32.768 TOPs (with 1MAC being two operation) at 1 GHz
SAGAR takes 84.89 mm of real estate while consuming 13.05 Watts of power. TheA DAPT N ET X consumes 12.4% of total area and 1.6% of totalpower.
Baselines.
We implement the baseline monolithic 128 × × × × SAGAR . In
SAGAR , inaddition to the links going directly from the SRAM to theedge MAC units of the array, we have to consider the bypasslinks as well. To get full bandwidth on these links we needto consider additional buffers. Extending the design describedin Figure 6, each row and column of
SAGAR has 31 bypasslinks and one link to the first MAC unit, we need 32 banksper row/column. Therefore each SRAM buffer is constructedwith 1024, 1KB banks.
Area Analysis.
In Figure 13(c) we depict the break down ofarea overheads for SRAM buffers, mesh NoC, and the computearray for various distributed configurations, the monolithic array,
SAGAR and SIGMA [28]. We observe that the monolithicconfiguration is the most efficient in terms of area, where itis about 5 × more compact than the distributed 4 × × SAGAR on the other hand takes about 8%more area than the monolithic array, while consuming about3.2 × lower area than the distributed 4 × SAGAR and the distributed configurationprovides same mapping flexibility, the proposed design isstrictly more efficient.
SAGAR is also more compact thanSIGMA, taking about 70% of the area while packing equalnumber of compute units.Across the various systolic-array configurations in Fig-ure 13(c), the SRAM area appears to remain fairly constant.This is a direct consequence of the buffer capacity andconstruction of the array. In Figure 14(a) we depict the totalarea obtained for various configurations depicted in Table I.We observe that, the various configurations vary in the bankcapacity and the number of banks. Since the total capacityremains the same across the configurations, these factor counterbalance each other leading to observed trends.
Power Consumption.
In Figure 13(d) we depict the postPnR power consumption for various array configuration. The10esh NoC stands out as the major contributor, which naturallymakes the 4 × × moreexpensive than the monolithic configuration, with the NoCcontributing to about 78% of the power. Considering thepower of the compute array alone, all the systolic-array basedconfigurations appear to consume similar power. We also depictthe trend in power consumed by SRAM banks across varioussystolic-array based configurations in Figure 14(b). Similar tothe trends observed in area breakdown, the counter balancingaffects of increasing the bank sizes and lowering of numberof banks lead to similar powers across various distributed andmonolithic configurations. SSA however consumes about 50% more power than themonolithic configuration, owing to the bypass links. Howeverthis extra cost results in achieving the same mapping flexibilityof the 4 × × moreexpensive. SAGAR therefore is almost as scalable as a nativemonolithic systolic array in terms of area, while consumingabout 50% more power, it provides the same mapping flexibilityand performance as a distributed scaled out configuration using × systolic arrays. Furthermore,
SAGAR consumes about43% less power than SIGMA, owing to the relatively simpleinterconnection network.To explore further opportunities for optimization in
SAGAR ’simplementation, in Figure 14(c) we plot the area and power ofthe compute array for varying systolic cell sizes, normalized tothe area/power of the monolithic configuration. It is evident thatboth the power and area overheads increase by decreasing thecell sizes. Using larger cells might be tempting, but it comes atthe cost of mapping flexibility. As depicted in Figure 12(b,c,d),our workloads predominantly favour 4 × Summary.
Considering our findings from architecturalsimulations and physical implementation, we conclude thatthe proposed Self Adaptive Reconfigurable hardware enablesachieving both high performance and energy efficiency simul-taneously. The S MART S YSTOLIC compute unit enables highmapping efficiency of a fine-grained flexible architecture, whileretaining the scalability of monolithic systolic array. The novel A DAPT N ET ensures optimal configuration at runtime withhigh accuracy, while minimizing performance, power, and areaoverheads when run on the proposed A DAPT N ET X . VII. R
ELATED W ORKS
Flexible DNN Accelerator . To efficiently execute a varietyof workloads, DNN accelerator designs generally come withtwo tiers of flexibility, architecture and dataflow. Designs likeNeurocube [16], Flexflow [22], and by FPGA based designs[37] enable flexible mapping by supporting multiple dataflow.On the other hand proposals like Planaria [10], Brainwave [7],SIGMA [28], MAERI [20], Cascades [31] and others [1], [9],[37] enable reconfiguration at the hardware level for efficientexecution. The S
MART S YSTOLIC array in
SAGAR enables bothmapping flexibility and reconfigurability. Table III depicts thestanding of various such accelerators in term of native operationsupported, mapping capability and flexibility.
TABLE III
Table depicting previous accelerator proposals categorized in terms ofcomputation support, mapping capability, and flexibility offered in term ofhardware reconfiguration or dataflow supported
Native Operation Mapping Capability FlexibilityConvolution GEMM Homogenous Heterogenous Dataflow ArchitectureZhang et al. [37] (cid:88) (cid:88) (cid:88) (cid:88)
Eyeriss [4] (cid:88) (cid:88)
Alwani et al. [1] (cid:88) (cid:88) (cid:88)
NeuroCube [16] (cid:88) (cid:88) (cid:88)
MAERI [20] (cid:88) (cid:88) (cid:88) (cid:88)
TPU [13] (cid:88) (cid:88) (cid:88)
Flexflow [22] (cid:88) (cid:88) (cid:88)
Tetris [8] (cid:88) (cid:88)
Brainwave [7] (cid:88) (cid:88) (cid:88)
Simba [35] (cid:88) (cid:88)
Tangram [9] (cid:88) (cid:88)
Cascades [31] (cid:88) (cid:88) (cid:88)
Sigma [28] (cid:88) (cid:88) (cid:88)
Planaria [10] (cid:88) (cid:88) (cid:88)
SAGAR (This work) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Dataflow and Accelerator Design Space Search.
Severalarchitecture and mapping space exploration tools have beenproposed in the recent past to take advantage of flexibilities inthe design. Tool like SCALE-Sim [32], MAESTRO [18], Tetris[8] etc. provide analytical models for fast cost estimation ofspecific configurations. While Timeloop [27], dMazeRunner [5]etc are tools which perform heuristic or exhaustive search forarchitecture configuration or mapping strategy. SARA systemslike
SAGAR on the other hand use a trained recommender likeA
DAPT N ET to circumvent the search and obtain the optimalconfiguration and dataflow in one shot at runtime. ML assisted system configuration.
Recent works havedemonstrated the use of ML algorithms to assist in systemconfiguration. Gamma [15] and ConfuciuX [14] performarchitecture mapping and design space configuration searchusing genetic algorithm and reinforcement learning (RL). Onmore systems size, work by Mirhoseni et al [23] use RL for taskplacement on a heterogenous system, while modern compilerslike AutoTVM [3] use ML models for cost prediction toimprove compilation time. Nautilus [26] uses genetic algorithmto improve FPGA place and route. It is worth noting that theseapproaches mostly enhance search for the optimal configuration,and this unlike A
DAPT N ET do not replace search. Perhapsthe closest to our approach is work by Kwon et al [21], whouse online tensor-based recommender systems to aid place androute in chip design.VIII. C ONCLUSIONS
In this paper we present a new class of acceleratorsnamed
Self Adaptive Reconfigurable Arrays (SARA). Wedevelop SARA by augmenting a reconfigurable array with ahardware running a neural network recommender system. Therecommender system can provide optimal configuration of thearray at runtime when as the workload arrives. This makes thearray self sufficient to run optimally. We design a novel, highlyaccurate and fast recommendation network called A
DAPT N ET ,and a specialized hardware accelerator A DAPT N ET X. We alsopresent a new design approach for building scalable GEMMaccelerator architectures by augmenting traditional systolicarray units with SMART [17]-like bypass links called ‘ systolic- ells ’. This S MART S YSTOLIC array can be configured tooperate in any regime between a monolithic scaled-up anddistributed scaled-out architecture. Using a reference designnamed
SAGAR , we show that we can simultaneously achievehigh mapping flexibility and exploit spatio-temporal reuse,therefore achieving both performance and energy efficiency.R
EFERENCES[1] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnnaccelerators,” in . IEEE, 2016, pp. 1–12.[2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg,C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al. , “Deepspeech 2: End-to-end speech recognition in english and mandarin,” in
International conference on machine learning , 2016, pp. 173–182.[3] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, andA. Krishnamurthy, “Learning to optimize tensor programs,” in
Advancesin Neural Information Processing Systems , 2018, pp. 3389–3400.[4] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecturefor energy-efficient dataflow for convolutional neural networks,”
ACMSIGARCH Computer Architecture News , vol. 44, no. 3, pp. 367–379,2016.[5] S. Dave, Y. Kim, S. Avancha, K. Lee, and A. Shrivastava, “Dmazerun-ner: Executing perfectly nested loops on dataflow accelerators,”
ACMTransactions on Embedded Computing Systems (TECS) , vol. 18, no. 5s,pp. 1–27, 2019.[6] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R.LeBlanc, “Design of ion-implanted mosfet’s with very small physicaldimensions,”
IEEE Journal of Solid-State Circuits , vol. 9, no. 5, pp.256–268, 1974.[7] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo,S. Alkalay, M. Haselman, L. Adams, M. Ghandi et al. , “A configurablecloud-scale dnn processor for real-time ai,” in .IEEE, 2018, pp. 1–14.[8] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris:Scalable and efficient neural network acceleration with 3d memory,”in
Proceedings of the Twenty-Second International Conference onArchitectural Support for Programming Languages and OperatingSystems , 2017, pp. 751–764.[9] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram:Optimized coarse-grained dataflow for scalable nn accelerators,” in
Pro-ceedings of the Twenty-Fourth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , 2019, pp.807–820.[10] S. Ghodrati, B. H. Ahn, J. K. Kim, S. Kinzer, B. R. Yatham, N. Alla,H. Sharma, M. Alian, E. Ebrahimi, N. S. Kim et al. , “Planaria: Dynamicarchitecture fission for spatial multi-tenant acceleration of deep neuralnetworks,” in . IEEE, 2020, pp. 681–697.[11] X. He and T.-S. Chua, “Neural factorization machines for sparsepredictive analytics,” in
Proceedings of the 40th International ACMSIGIR conference on Research and Development in Information Retrieval ,2017, pp. 355–364.[12] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural col-laborative filtering,” in
Proceedings of the 26th international conferenceon world wide web , 2017, pp. 173–182.[13] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers et al. , “In-datacenterperformance analysis of a tensor processing unit,” in
Proceedings of the44th Annual International Symposium on Computer Architecture , 2017,pp. 1–12.[14] S.-C. Kao, G. Jeong, and T. Krishna, “Confuciux: Autonomous hardwareresource assignment for dnn accelerators using reinforcement learning,”in . IEEE, 2020, pp. 622–636.[15] S.-C. Kao and T. Krishna, “Gamma: Automating the hw mapping ofdnn models on accelerators via genetic algorithm,” in
ICCAD , 2020.[16] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,“Neurocube: A programmable digital neuromorphic architecture withhigh-density 3d memory,”
ACM SIGARCH Computer Architecture News ,vol. 44, no. 3, pp. 380–392, 2016. [17] T. Krishna, C.-H. O. Chen, W.-C. Kwon, and L.-S. Peh, “Smart: single-cycle multihop traversals over a shared network on chip,”
IEEE micro ,vol. 34, no. 3, pp. 43–56, 2014.[18] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna,“Understanding reuse, performance, and hardware cost of dnn dataflow:A data-centric approach,” in
Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture , 2019, pp. 754–768.[19] H. Kwon and T. Krishna, “Opensmart: Single-cycle multi-hop nocgenerator in bsv and chisel,” in . IEEE, 2017,pp. 195–204.[20] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible dataflowmapping over dnn accelerators via reconfigurable interconnects,”
ACMSIGPLAN Notices , vol. 53, no. 2, pp. 461–475, 2018.[21] J. Kwon, M. M. Ziegler, and L. P. Carloni, “A learning-based recom-mender system for autotuning design fiows of industrial high-performanceprocessors,” in . IEEE, 2019, pp. 1–6.[22] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexibledataflow accelerator architecture for convolutional neural networks,” in . IEEE, 2017, pp. 553–564.[23] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou,N. Kumar, M. Norouzi, S. Bengio, and J. Dean, “Device placement opti-mization with reinforcement learning,” arXiv preprint arXiv:1706.04972 ,2017.[24] M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman,J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al. , “Deeplearning recommendation model for personalization and recommendationsystems,” arXiv preprint arXiv:1906.00091 , 2019.[25] T. NVIDIA, “Nvidia tesla v100 gpu architecture,” https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf,2017.[26] M. K. Papamichael, P. Milder, and J. C. Hoe, “Nautilus: Fast automatedip design space search using guided genetic algorithms,” in
Proceedingsof the 52nd Annual Design Automation Conference , 2015, pp. 1–6.[27] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara,R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop:A systematic approach to dnn accelerator evaluation,” in . IEEE, 2019, pp. 304–315.[28] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul,and T. Krishna, “Sigma: A sparse and irregular gemm accelerator withflexible interconnects for dnn training,” in , 2020.[29] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in
Advances in neuralinformation processing systems , 2015, pp. 91–99.[30] S. Rendle, “Factorization machines,” in . IEEE, 2010, pp. 995–1000.[31] A. Samajdar, T. Garg, T. Krishna, and N. Kapre, “Scaling the cascades:Interconnect-aware fpga implementation of machine learning problems,”in . IEEE, 2019, pp. 342–349.[32] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, andT. Krishna, “A systematic methodology for characterizing scalability ofdnn accelerators using scale-sim,” in , 2020.[33] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “Scale-sim: Systolic cnn accelerator simulator,” arXiv preprint arXiv:1811.02883 ,2018.[34] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborativefiltering recommendation algorithms,” in
Proceedings of the 10thinternational conference on World Wide Web , 2001, pp. 285–295.[35] Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang,B. Keller, A. Klinefelter, N. Pinckney, P. Raina et al. , “Simba: Scalingdeep-learning inference with multi-chip-module-based architecture,” in
Proceedings of the 52nd Annual IEEE/ACM International Symposiumon Microarchitecture , 2019, pp. 14–27.[36] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Mastering thegame of go without human knowledge,”
Nature , vol. 550, no. 7676, pp.354–359, 2017.
37] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizingfpga-based accelerator design for deep convolutional neural networks,” in
Proceedings of the 2015 ACM/SIGDA International Symposium onField-Programmable Gate Arrays , 2015, pp. 161–170., 2015, pp. 161–170.