Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir Computing
DDirect Spatial Implementation of Sparse MatrixMultipliers for Reservoir Computing
Matthew Denton and Herman Schmit
Google Brain { myuzaki, schmit } @google.com Abstract —Reservoir computing systems rely on the recurrentmultiplication of a very large, sparse, fixed matrix. We argue thatdirect spatial implementation of these fixed matrices minimizesthe work performed in the computation, and allows for significantreduction in latency and power through constant propagationand logic minimization. Bit-serial arithmetic enables massivestatic matrices to be implemented. We present the structure ofour bit-serial matrix multiplier, and evaluate using canonicalsigned digit representation to further reduce logic utilization.We have implemented these matrices on a large FPGA andprovide a cost model that is simple and extensible. These FPGAimplementations, on average, reduce latency by 50x up to 86xversus GPU libraries. Comparing against a recent sparse DNNaccelerator, we measure a 4.1x to 47x reduction in latencydepending on matrix dimension and sparsity. Throughput ofthe FPGA solution is also competitive for a wide range ofmatrix dimensions and batch sizes. Finally, we discuss ways thesetechniques could be deployed in ASICs, making them applicablefor dynamic sparse matrix computations.
I. I
NTRODUCTION
The progress of machine learning algorithms and accel-erators are bidirectionally linked. Algorithms that are well-structured for available computing achieve higher quality atless cost. [25] Similarly, accelerators are benchmarked onpre-existing ML algorithms and succeed or fail accordingly.[21] In this paper, we try to escape this self-reinforcingsystem by considering a different form of machine learning,contemplating what kind of accelerator this approach wouldrequire, and consider whether the results have consequencesfor conventional machine learning systems.Current ML accelerators use matrix multiplication as thebasic building block. These matrix multiplication units areprimarily: • Dense: Dense matrix multiplication is very efficientlyimplemented in silicon with systolic arrays. [14] Algo-rithmically, density is a result of backpropagation, whichtends to fill in the value of every weight with a number,even if that value means little to the end computation. • Small: Dense matrix multipliers are easily composableto provide the computation for arbitrarily sized matrices.Therefore, a smaller matrix multiplier can be used as anefficient atom in a big matrix computation, whereas abig matrix multiplier will suffer from lower utilizationon small matrix computations. • Two-operand: Finally, most matrix multipliers have boththe matrix and the vector as stored variables. There is aneed to change matrix values during training and in order to enable time-multiplexing of the multiplier unit to sup-port different matrix dimensions. Having two operandscan be costly from area and power. The primary power-saving technique for these multipliers is batching, wherethe matrix values are kept constant while performingmultiple vector times matrix computations.Reservoir computing, specifically Echo State Networks asdiscussed in this paper, uses very large, sparse matrices. Thesematrices are generated randomly and are never modified bytraining, and therefore the matrix is fixed for the lifetime of thecomputation. In order to handle these matrices, conventionalML accelerators perform indexing and tiling of the sparsematrix, which effectively transform the large sparse operationinto multiple small dense operations. However, this transfor-mation has compute and storage overhead, and tiling increasesthe latency of the computation.
This paper shows that it isboth possible and advantageous to implement these largesparse operations directly in programmable hardware,which eliminates the cost of indexing and tiling.
We useReservoir Computing as a motivating factor, but discuss howthis could impact ML inference acceleration in the future.
A. Contributions
The primary contributions of our paper are as follows: • We present a bit-serial architecture for fixed vector-matrixmultiplication, which minimizes hardware, energy andlatency. • We implement very large, sparse matrices using thisarchitecture without paying the cost of indexing or tiling. • We present simple cost and power models, which enablethe quick estimation of size and power of any fixed matrixon an FPGA. • We show how a simple transformation can significantlyimprove the achievable density. • We compare the achievable performance against state ofthe art sparse libraries for GPUs and DNN accelerators. • We discuss how this idea could be applied to a customcircuit that would eliminate the limitations of the FPGA.
B. Outline
The remaining content is structured as follows: In SectionII, we describe reservoir computing in detail and enumerateprior art. Section III reviews the state-of-the-art in bit-serialmachine learning acceleration and describes our low-latencybit-serial matrix multiplier. Section IV describes synthesis of1 a r X i v : . [ c s . A R ] J a n ur architecture and presents a simple area model based on ourdata. Section V investigates the use of canonical signed digitsin our matrices. Section VI discusses experiments compilingvery large matrices onto a large FPGA and measuring area,frequency and power. Section VII compares these resultsagainst the state-of-the-art sparse libraries for comparableGPU implementations. We also compare against a sparse DNNaccelerator. Finally, in Section VIII we discuss the implicationsof this approach to implementations of more conventionalmodels and ASICs.II. R ESERVOIR C OMPUTING
Reservoir computing is a subset of supervised machinelearning which seeks to learn a non-linear transformation froman input sequence to an output sequence. It’s conceptually sim-ilar to a recurrent neural network, but has one key difference,the recurrent weights are not trained. A reservoir system canbe described using the following equations, adapted from [26]: x ( n ) = f ( W in u ( n ) + W x ( n − (1) y ( n ) = W out x ( n ) (2)Where u ( n ) , y ( n ) are the input and output at time n respec-tively, W in , W out are the input and output weight matrices,and f () is a non-linear activation function. W is a very large,fixed matrix, and x is a large vector that carries the stateinformation from one time unit to another.Ordinarily, W, W in are initialized according to some heuris-tics, and remain fixed. W out is trained via linear regression.It is evident to see why this approach is appealing from acompute perspective - only a linear regressor needs to betrained which completely eliminates the need for error back-propagation and allows the designer to choose the optimizerfor the linear layer, which may be gradient-descent, least-squares, or any other optimization technique.A comparison of various systems for recognition of mul-tivariate time series is presented in [5]. This paper comparesreservoir computing systems against fully trainable recurrentnetworks. The reservoir solutions train much faster with com-parable quality. In this paper, the baseline reservoir computingsystem consists of a fixed square matrix with a dimension of800 with 75% of the elements being 0, which we henceforthrefer to as ”element sparsity”.A comprehensive summary of physical systems used toimplement physical reservoirs [26] includes a section onFPGA implementations. The earliest citation is [6], whichdescribes encoding neuron values as ”stochastic bitstreams”.[19] describes an FPGA implementation of RC and presentsheuristics for defining the input matrix W in . [24] describesan encoding system that allows for multiplication to be repre-sented as shifts.An FPGA design that includes both the reservoir as wellas the output layer with auto-regression was presented in[3]. This system was used to perform channel equalization,which is ideal for online learning because the known patternswith expected results are presented on a periodic basis. This paper is not easily replicated, but the results state that thedesign used most of the block RAM on an FPGA, with lowlogic utilization. This paper did not emphasize performanceexploration, as evidenced by the 33MHz clock frequency.It is shown in [16] that the elements of these reservoirscan be quantized into integers. The authors show a varietyof tasks where a precision of 3-4 bits leads to no accuracyloss. They evaluate their design on a ZedBoard FPGA with anintegrated ARM CPU and are able to run a 300x300 reservoirat 100MHz. Studies in [10] show that sparsity should exceed80% to maximize performance and enable rich interactionamong neurons.The idea of using large, sparse matrices has consequencesbeyond reservoir computing. Numenta [1] is a company cre-ating artificial neural networks by mimicking biological net-works, which are all sparse. They are also using large, sparse,random, fixed matrices with a trainable output selection layerfor their investigations. Finally, [8] showed that special sub-networks with 80-95% sparsity exist within backpropagation-trained dense networks. These subnetworks have the same orbetter error rates than full dense networks. In other words,most of the computation performed in inference using the fullmatrix is wasted.The remainder of this paper details our approaches toimplement a large, fixed matrix multiplication on an FPGAusing bit serial arithmetic. Specifically, we accelerate the keyprimitive in reservoir computing: o = a (cid:124) V (3)Which is often called ”gemv” and is an important primitivefor many machine learning workloads.III. B IT - SERIAL VECTOR - MATRIX MULTIPLIERS
Bit-serial arithmetic was an important technique decadesago when parallel arithmetic was too costly. Simple VLSIimplementations of bit-serial neural networks [17] were pro-posed as far back as 1988. More recently, DNN acceleratorshave used bit-serial arithmetic to support variable precisionarithmetic. STRIPES [15] presents a bit-serial architecture thatenables different precisions on a per-layer basis, quantizingeach layer to their required bit-widths. Bit-fusion [23] enablesdynamic-precision for inputs as well. Bit-pragmatic [2] andLaconic [22] exploit bit-level sparsity. They detect the bits thatare zero and skip those multiplications. In all the above cases,the DNN dimensions often exceed the compute capabilityof the hardware, so tiling and striding the inner product isregularly employed.As discussed earlier, reservoir computing uses a singlelarge, sparse, weight matrix that is fixed for both training andinference. This allows us to optimize the logic implementationby performing constant propagation on the matrix to removethe logic corresponding to all zeros; both zeros as whole terms,and bits that are zeros within individual terms. It would be pos-sible to implement these matrices in custom logic (ASIC), butin this work we use an FPGA. The FPGA consists of an arrayof Lookup Tables (LUTs), Registers (FFs), and programmable2nterconnect. FPGAs generally include other programmableresources such as embedded multipliers and SRAMs, whichwe will not consider in this work. The particular FPGA we areusing has the capability to re-purpose some of the LUTs intosmall RAMs or shift registers which are called LUTRAMs.Our FPGA design flow takes the content of the matricesand compiles it to a physical design consisting of the pro-grammable logic resources and interconnect. This design flowproduces an achievable frequency, area, and power estimation.
A. The Microarchitecture
Figure 1 shows the design of a bit-serial adder, which is thebasic building block of bit-serial arithmetic units. C in C out AB SDQ
Shift RegisterShift Register Shift RegisterFull AdderDFF
Fig. 1. Bit-serial adder
In the bit-serial adder, inputs A and B are shifted into afull-adder one bit each clock cycle, LSb first. The result ofthe addition (S) is shifted into a result register. The carryis registered as the carry input for the next cycle. This isanalogous to manual long addition - add one digit at atime, with any resulting carry being added in the next mostsignificant digit. Rather than chaining the carries throughmultiple adders ”in space”, we use a single adder and processthe carry ”in time”. Table I shows an example of a bit-serialaddition of + 7 = 10 ↔ + 111 = 1010 TABLE IB IT - SERIAL A DDITION E XAMPLE
Cycle C in A B S C out Result1 0 1 1 0 1 00002 1 1 1 1 1 10003 1 0 1 0 1 01004 1 0 0 1 0 1010
Similarly, we can create a bit-serial subtractor that performs a − b by initializing the carry bit to 1, and adding a NOTgate between b ’s register and the full adder. This scheme isequivalent to negating b in two’s complement and then addingit to a . In the FPGA, the bit serial adder or subtractor can bemapped to a single 6-input LUT and two registers. The timingpath that determines the frequency for this design (the criticalpath) will be very short: a single stage of logic, the setup tothe flop, and any interconnect delay between the components. a) Single-bit dot-product: Using these primitives, wenow build a single-bit dot-product, as described below: s = N (cid:88) i =1 a i ∗ b i (4) Where a, b are N-length vectors whose elements are 1-bitwide, and s is a scalar integer. Because we are multiplyingsingle bits, we can realize the multiplication with a simpleAND gate. After multiplying each pair of elements, we reducethe partial sums through a tree of bit-serial adders. b) Multi-bit input: We can extend this design to supportone of the operands, a , being multiple bits wide. We holdthe single-bit vector b fixed, and stream the multi-bit vectorthrough the circuit one bit at a time from LSb to MSb using ashift register for each dimension. Similarly, we capture theresult in a shift register. One can think of this circuit assumming the elements of a that are selected by a Booleanvector b . This design is shown in Figure 2a.Note that because this is merely summing the dimensionsselected by b , this circuit works for both signed and unsigned a inputs. To ensure signed inputs produce the correct sign bit,we sign extend the input a from the shift register until thecomputation has finished.Given that our boolean vector b is fixed, consider a givenelement b i . If this element is set to 1, the input to the AND gateis 1, so the output of the AND gate will just be the associated a i . In this case, we can cull the AND gate entirely and connectthe input a i directly to the bit serial adder. On the other hand,if the element is 0, the AND gate will have a fixed 0 inputand therefore its output will always be 0. Furthermore, oneof the inputs to the bit-serial adder is now a fixed 0, so theadder is just passing the value forward. In this case, the adderis acting as a D-flip-flop. This means we can not only cull theAND gate, but we can also replace the adder with a single flip-flop, which reduces the cost significantly. Put simply, we cangreatly reduce the cost of this circuit by fixing b , and the costshould be proportional to the number of bits set in b . Thisoptimization is the fundamental minimization techniquein our design.
An example of this reduction can be seen inFigure 2b. We will revisit this idea in Section IV. c) Multi-bit input and weight:
Finally, we can combinemultiple of these multi-bit by 1-bit dot-products to create afull multi-bit dot-product. This allows the dot-product of twointeger vectors, each with its own bit-width. Because each dot-product in the previous circuit occurs at the bit-level in b , wemerely need to create such a circuit for each bit position in b ,and then combine the results for each bit position.Luckily, the bit-serial adders already have this capabilityby shifting in time. For example, to make a 2-bit dot-productcircuit, we make one circuit for the MSb, and one for theLSb. The results of the two are combined by feeding the LSbresult into a bit-serial adder, and then feeding the MSb result,delayed by 1 cycle, into the same bit-serial adder, which thenoutputs into a shift register to capture the final result. Delayingthe MSb result by one cycle effectively multiplies it by 2. Toextend this idea to arbitrary bit-width, we just create a chainof bit-serial adders, where the bottom input is the previouslink in the chain, and the top input is the corresponding bit. Inthis case, the result of a bit position is delayed accordingly. Amulti-bit example is shown in Figure 3. Notice that the MSbis fed into a bit-serial adder along with 0, which becomes a3 b b b Shift Reg (a )Shift Reg (a ) Shift Reg (a )Shift Reg (a )Shift Reg (s)Shift Reg (a )Shift Reg (a ) Shift Reg (a )Shift Reg (s)D flip-flopb = [1 1 0 1] Fig. 2. a)Multi-bit by fixed single-bit dot-product with tree reduction andshift registers (N=4) (trapezoids represent bit-serial adders, which have flip-flips between each stage but are omitted for brevity)b) Setting b = [1 1 0 1] reduces the circuit
D flip-flop.This architecture works for signed inputs and unsigned in-puts, but the weights are unsigned. An easy way to implementsigned weights is to separate the positive and negative termsof the b vector into two separate unsigned vectors, and simplysubtract the two resultant streams. Because the number of onesin the two matrices is conserved by this transform, it makes b b b b Shift Reg (a )Shift Reg (a ) Shift Reg (a )Shift Reg (a ) b b b b Shift Reg (a )Shift Reg (a ) Shift Reg (a )Shift Reg (a ) S h i ft R eg ( s ) Fig. 3. Multi-bit Dot-product almost no impact on the total area, and adds a single cycle tothe latency. d) Vector-matrix product:
Now that we have a unit toperform dot-products, we can finish up this matrix-multiplierby creating a dot-product unit for each column in the matrix.Then, we just broadcast the input on each row to all columns,and that results in a matrix multiplier, which takes shape asshown in Figure 4. (Full bit-serial circuits are omitted here forbrevity) I npu t V e c t o r Output Vectorx + + + xxx x + + + xxx x + + + xxx x + + + xxx Fig. 4. Full Vector-Matrix Multiplier ArchitectureBroadcast of inputs shown in green. Multiplication at bit-level of fixed weightsshown in blue. Per-column partial sum reduction shown in purple, yieldingthe final output vector in orange
Because the computations are parallelized across the matrixcolumns, and then again across each bit-position, the latencyfor this design is described in Equation 5:Latency = BW i + BW w + log R + 2 (5)Where BW i and BW w are the bitwidth of the input andweights respectively, and R is the number of rows in the4atrix. We incur the input width to stream the input in, theoutput width to stream the output out, and our adder tree is log-arithmic in depth. We incur a single cycle to accumulate acrossbit positions and an additional cycle to subtract the positiveand negative weight matrices. For example, given 8-bit inputsand weights and a 1024x1024 weight matrix, we perform thevector-matrix product in (1024) + 2 = 28 cycles.IV. RTL S YNTHESIS R ESULTS
In this section, we seek to understand the behaviour of ourdesign by studying it on small matrices. We extend the insightshere to larger matrices in Section VI. Recall the optimizationfrom Section III that if a bit in the weight-matrix is 0, wecan cull the AND gate and bit-serial adder, replacing it witha simple D flip-flop. We coded our design in SystemVerilogand ran synthesis in Xilinx Vivado 2020.2 targeting XilinxUltraScale+ devices [27]. If our design behaves as expected,the hardware cost should be proportional to the number of bitsset in the weight matrix. To test this, we randomly initialized aseries of 64x64 weight matrices at 8-bit precision. For each bitin the weight matrix, we sample from a Bernoulli distribution,where the p parameter is equal to (1 - bit sparsity). That isto say, the bit-sparsity of the weight matrix is the number ofbits that are 0 out of the total number of bits. 0% bit-sparsemeans all bits are 1, 50% means the bits are uniformly randombetween 0 and 1, and 100% means all bits are 0. This gives usa straightforward way to measure the cost of these matrices.Figure 5 shows synthesis results for sweeping bit-sparsityfrom 0% to 100%. We report utilization of LUTs, FFs, andLUTRAMs. Bit-Sparsity (%)020000400006000080000 20 40 60 80 100LUT FF LUTRAM
Fig. 5. Hardware utilization vs bit-sparsity of a 64x64 matrix
As expected, the total hardware cost of our architecture islinear with respect to the number of bits set in the weightmatrix.This illustrates that our architecture effectively exploits bit-level sparsity to reduce the cost of fixed-matrix multiplication.Recent advances in neural-networks and architecture havesimilarly sought to exploit element-level sparsity to make com-putations more efficient. [12], [13] By element-level sparsity,we mean a portion of the weights being 0, so there is no need to multiply by them. Because bit-level sparsity is a super-setof element-level sparsity, we wanted to know if our designwas ineffective given an element-sparse matrix. To test this,we randomly initialized a set of matrices, where the weightsare sampled from a uniform distribution of all possible valuesfor the given bit-width. In this case, the matrix is 50% bit-sparse, as every bit has an equal probability of being 0 or1. We then randomly replace matrix elements with 0 until wereach a desired level of element-sparsity. Taking these samples,we convert the element-sparse value into a bit-sparse value,and compare the two approaches. This second experimentencourages bits to gather in individual elements, rather thanthe bit-sparse experiment which encouraged bits to be spreadout. Figure 6 shows our findings and plots them against theoriginal experiment.
Bit-Sparsity (%)01000020000300004000050 60 70 80 90 100LUT (es) FF (es) LUTRAM (es) LUT (bs) FF (bs) LUTRAM (bs)
Fig. 6. Cost of Element Sparse Matrix Compared To Bit-Sparse MatrixHere, (es) and (bs) designate Element Sparse and Bit-Sparse schemes respec-tively. While results do not match exactly, they are within an acceptable noisemargin
The graph shows us that it doesn’t matter if the bits areconcentrated or not. The two lines are nearly identical, whichmeans that we don’t have to make any concessions to supportelement-sparse designs. Unlike most sparse accelerators, thisdesign exploits sparsity in the elements when matrix entriesare zero, and also elicits great benefit from elements beingpowers-of-two.Figure 7 shows the LUT and FF utilization vs matrix size.The cost is quadratic with respect to matrix dimension andtherefore linear with respect to the number of elements. Thisimplies there is little optimization our RTL flow is doing acrossweight elements. In other words, large matrices are no moreand no less dense than smaller matrices.We also explore the cost function with respect to the bit-width of the weights. Figure 8 shows the cost of a 64x64matrix, sweeping the weights bit-width from 1-bit to 32-bit.Because the architecture performs 1-bit dot-products for eachbit-position, and then accumulates the results, we would expectthere to be little cross-bit optimizations. Indeed, we observea linear LUT and FF cost with respect to the bit-width of theweights.5 atrix Size050000100000150000 2x2 4x4 8x8 16x16 32x32 64x64 128x128LUT FF
Fig. 7. Hardware Utilization vs Matrix Size for random 8-bit integers
Bitwidth050000100000150000 1 2 4 8 16 32LUT FF
Fig. 8. Hardware Utilization of 64x64 Random Matrix for Varying Bitwidths
V. C
ANONICAL S IGNED D IGIT
In an effort to reduce the cost of the multiplier, we con-sider Canonical Signed Digit (CSD) representations [4]. CSDdecomposes an unsigned integer into a sum of a positiveand negative integers, where the positive and negative integertogether have fewer set bits than the original number. Forexample, consider the following : = 16 − ↔ = 10000 − Notice that we can decompose a number that had four bitsset to the difference of two numbers which each have one bitset, for a total of two. Do also take note that the bit-width ofthe decomposition is one wider than the original. Recall thatthe cost function of our multiplier is the number of bits set,so this shows promise in our case. We transform Equation 3using: V ≡ P − N = ⇒ o = a (cid:124) ( P − N ) = ( a (cid:124) P ) − ( a (cid:124) N ) (6)To transform V into N and P, we apply the algorithm in Listing1 element-wise, placing positive elements in P, and placingthe absolute value of negative elements in N. In short, thealgorithm searches for strings of consecutive 1 bits, which wecall a chain. If a chain is length 1, nothing is done. If a chain is length 2, we flip a coin. On heads, we replace the chain witha +1 in the bit position one-past that of the MSb in the chainand a -1 at the LSb in the chain. On tails, we leave the originalrepresentation. For chains of length 3 and greater, we performthe same replacement that we do for heads on length 2. Weintroduce the random variable to balance the decomposition,since a transformation of a length 2 chain has no benefit andno detriment.When operating on signed weights, we perform a CSDtransform on both the positive and negative weight matrices.Positive elements that result from CSD remain in the originalmatrix, and negative elements are transferred to the oppositeweight matrix. def convert_to_csd(num_bin_list): local_list = list(num_bin_list) target = [0] * (len(local_list) + 1) local_list.reverse() chain_start = -1 for i in range(len(target)): if i < len(local_list): bit = local_list[i] else: bit = 0 if bit == 0: if chain_start == -1: target[i] = 0 else: chain_length = i - chain_start if chain_length == 1: target[chain_start] = 1 elif chain_length == 2: if bool(random.getrandbits(1)): target[chain_start] = -1 target[i] = 1 else: target[chain_start] = 1 target[i-1] = 1 else: target[chain_start] = -1 target[i] = 1 chain_start = -1 else: if chain_start == -1: chain_start = i target.reverse() return target Listing 1. CSD Conversion Algorithm
Figure 9 shows the resource utilization when CSD is appliedto random matrices of varying element-sparsity. Note that thex-axis is element-sparsity and not bit-sparsity.CSD results are strictly better than the naive implementa-tion. In this experiment, all the numbers are uniform random.CSD applies equally to them and reduces the hardware by17% for any level of element-sparsity. In other words, CSDalways makes numbers more bit-sparse, which reduces the costof our architecture. We would expect these savings to improvefor larger weight bitwidths.VI. L
ARGE S CALE D ESIGN R ESULTS
The prior sections looked at the logic synthesis resultsof small matrices. In this section, we show the scaling of6 lement Sparsity (%)010000200003000040000 0 25 50 75 100LUT (V) FF (V) LUTRAM (V) LUT (CSD) FF (CSD)LUTRAM (CSD)
Fig. 9. CSD Resource Utilization for 64x64 Element-Sparse matrices these techniques to much larger matrices. Our experimentsin this section use square matrices with dimensions of 512and 1024, with 8-bit signed weights. We consider elementsparsity from 40% to 98%. We use a split matrix to deal withthe signed weights, and the split matrices are generated byeither splitting positive and negative terms (PN) or by usingthe CSD technique previously described. The results of thesetwo matrices are fed to an array of final bit-serial subtractorsto complete the signed matrix implementation. We ”wrap” thematrix multiplier with a small design that feeds inputs froman SRAM, and captures results in that same SRAM. Thisdesign wrapper only adds a few extra LUTs and registers.We run synthesis and place and route on these designs andset a timing constraint of 450MHz. Our target FPGA is theXilinx XCVU13P [27], which is a 16nm device containingfour chiplets in the package (called Super Logic Regions orSLRs). This device has a capacity of 1.7M 6-input LUTs and3.4M logic flip-flops.The LUT and register counts as a function of ones in thematrix is plotted in Figure 10. The PN and CSD points arebased on the identical original signed matrix. The trend linesshow the strong correspondence with the resource counts andthe number of ones. LUTs are essentially equivalent to thenumber of ones, and there are two registers per LUT. CSDreduces both the number of ones in the matrix and the resultingresource counts.The process of FPGA design does not always meet thespecified timing constraint. Figure 11 shows the achievedfrequency (Fmax) of these designs after placement and routing.All the paths within these designs have at most one LUTbetween flops, which means that the frequency is primarilya result of the interconnect delays between LUTs and flops.Other components, such as the SRAMs have a maximumfrequency that exceeds 600MHz. There is some noise in theseresults, but the trends are that bigger matrices run slower. Asthe matrices get bigger, two mechanisms impact the routingdelay: • The initial layer has a large fanout, approximately cor-responding to the dimension times the sparsity. Nets
Fig. 10. Large Scale Area Results: The very strong linear relationship betweenmatrix ones and FPGA resources is obvious. that have a fanout of 100s can have delays of severalnanoseconds, which becomes the critical path. • Nets cross the chiplet boundaries, and those routes havesignificantly slower propagation delays.
Fig. 11. Large Scale Frequency Results: Increasing fanout of the first stageand the spanning of multiple chiplets increase the critical path delay anddecrease maximum frequency.
Each of the four SLRs within the FPGA have a maximumcapacity of 425k LUTs. After about 80% of LUTs are usedthe tools can struggle. Figure 11 has tick marks illustratingthe 82% thresholds of SLR capacity. Within one SLR, thefrequencies range from 597MHz to 445MHz. Designs requir-ing 2 SLRs range from 296MHz to 400MHz. Matrices biggerthan 2 SLRs seem relatively consistent between 225MHz and250MHz. Both the fanout and chiplet crossing problems couldbe addressed by adding registers to perform the fanout andchiplet crossings in multiple cycles. These optimizations arenot represented here.The other limit to the frequency is power. Figure 12 plotsthe estimated total power consumption of this device scaledto run at the maximum achievable frequency. These resultswere obtained from the Vivado tool based on the defaultassumptions about switching activity. Under medium settings7or airflow and heatsink, the thermal power limit of this FPGAis approximately 150W, which we approach at high dimensionand low sparsity.
Fig. 12. Large Scale Power Results: Note the sublinear increase due to thedecreasing acheivable frequency. Under medium cooling assumptions, thisFPGA has a limit of about 150W.
VII. E
VALUATION
To evaluate our design, we compare it against state-of-the-art sparse multiplication libraries on GPU, and also against arecent sparse DNN accelerator.
A. GPU Benchmarks
We compare our results against the Volta V100 GPU fromNVIDIA. The V100 is built in 12nm technology, whereasthe UltraScale+ FPGA is 16nm. The more recent AmpereArchitecture from NVIDIA includes hardware support forsparsity up to 50%, but is unfortunately unavailable to us forthis study. We use two libraries for the GPU benchmarks:cuSPARSE [18], and a recently published optimized kernel[9]. Neither of these libraries support integer arithmetic, sowe are using FP16 as a best-case proxy. We run the FPGA atthe maximum frequency achieved after place-and-route, whichcan be seen in Figures 11 and 12.For both of these configurations, we initialize a randomweight matrix using the same scheme as described previously.We transfer this to the device and allow the given library toperform any necessary optimizations before timing. Then, werandomly initialize a dense vector, and repeatedly multiply thisvector by the matrix so that all caches and memory systems arewarm. We do this for 1000 iterations and take the latency to bethe mean iteration time. In this case, the latency is measuredfrom the devices memory, through the arithmetic, and back tomemory - which is identical to our FPGA implementation. a) Sweeping dimension:
Our first experiment demon-strates latency performance as a function of array dimensionsat 98% element sparsity. We sweep the matrix dimension from64x64 to 4096x4096 in powers of two. The results are showin Figure 13 with actual latency numbers and in Figure 14 ascomparative speedup. The ”Optimized Kernel” is the libraryin [9].
Matrix Dimension (98% Sparse) La t en cy ( n s ) Fig. 13. Latency (ns) of varying 98% element sparse matrices
Matrix Dimension (98% Sparse) S peedup ( A / B ) Fig. 14. Speedup of varying 98% element sparse matrices
Notice that in all cases, our FPGA latency is less than120ns, whereas the GPU cannot break the 1 µ s barrier. Whenthe matrix size is less than 512x512, the GPU performance isnearly constant. This indicates the GPU is underutilized andtherefore latency bound. GPUs possess incredible throughputusing techniques like thread-level parallelism, memory la-tency hiding, double-buffering, and others. However, all thesetechniques require the GPU to spawn many more threadsthan the arithmetic can handle, swapping threads in andout to maximize utilization. In the low-latency regime, thesetechniques introduce overhead which cannot be overcome,even with the Optimized Kernel. Our solution configures thehardware to exactly support the matrix in question, leadingto astounding latency. When the GPU is latency-bound, itslatency does not increase for larger matrices. Our solutionpays a constant cost to stream the inputs and outputs, butthe number of cycles spent reducing partial sum increaseslogarithmically with matrix size. Furthermore, increasing thematrix size decreases our highest achievable clock frequency.In the GPU’s latency-bound regime, we see our speedup fallfrom 86x to 60x. However, at 1024x1024, the GPU is utilizedand is no longer latency-bound, so it begins to see linearscaling with respect to matrix size. Because our solution scales8ogarithmically in cycles, we see our speedup leveling off at50x due to the slower clock. b) Sweeping sparsity: Our next experiment considers theaverage latency as a function of matrix sparsity at a fixeddimension of 1024x1024. The results are show in Figures15 and 16 with average latency and speedup, respectively.Element sparsity is swept from 70% to 98%. Recall thatour solution’s latency in cycles does not depend on sparsity,but we can clock the design faster as sparsity increases.The GPU sees decreasing latency with increasing sparsitybecause it needs to operate on fewer elements and the costof indexing is amortized. For cuSPARSE, increasing sparsityfrom 70% to 85% sees large reductions in latency as the libraryreduces the amount of compute to be done. The optimizedkernel comparatively spends less time indexing and has higherperformance at lower sparsity. In both cases, the computereduction from 70% to 85% is effective and sees our speedupgo from 77x to 72x. As sparsity increases further, the GPUagain becomes underutilized and both the latency and speeduplevel off, yielding a minimum speedup of 60x. Again, the GPUis unable to break the 1 µ s barrier, whereas our solution staysunder 120ns. Element-Sparsity (1024x1024) La t en cy ( n s ) Fig. 15. Latency of varying sparsity
Element-Sparsity (1024x1024) S peedup ( A / B ) Fig. 16. Speedup of varying sparsity c) Batching:
We have illustrated the advantage of ourarchitecture for latency-sensitive scenarios, which is our tar-geted case. In this next experiment, we compare throughput ofthe GPU solution versus our solution. Previous experiments,because of the recurrence, were done with a batch size of1, which underutilizes the GPU’s pipelined resources. In thisexperiment, we change the GPU’s computation to performmatrix-matrix multiplication, where the number of columns inthe multiplicand matrix is the ”batch-size”, borrowing termi-nology from DNN processing. We again randomly initializea fixed weight matrix with 95% sparsity and signed 8-bitintegers. We choose 95% sparsity to give the GPU ampleheadroom. We sweep the batch size from 1 to 64, reportingthe speedup of our solution compared to the GPU. We are stillmeasuring latency, but in the higher batch sizes, the reciprocalof the latency will approach the maximum throughput of theGPU.Figure 17 presents our results for a 1024x1024 weightmatrix, and multiplying a batch size x 1024 random densematrix.
Batch Size S peedup ( A / B ) Fig. 17. Speedup against V100 for varying batch size (matrix = 1024x1024,95% sparse) Batch-size 1 is a comparison of pure latency. As batch sizeincreases, the speedup is comparing achievable throughput
Figure 18 presents our results for a 64x64 weight matrix,and multiplying a batch size x 64 random dense matrix.As expected, the latency for the GPU solution scales sub-linearly with respect to batch size. As the GPU becomesmore utilized, it’s able to overlap computation and memoryto take full advantage of the massive parallelism. Because ourarchitecture is built for vector products, we have to stream thecolumns of the input matrix in one-by-one, which yields tolinear scaling. Because our solution begins with such a lead,it’s able to stream multiple batches before the GPU can reachthe same performance. This means our solution is still lowerlatency at small batch sizes. In the 1024 case, our solution isstill marginally better due to the GPU already being close toutilized with the large matrix. With 64x64, the GPU has morecomputational intensity to fill before it becomes utilized.The TDP of our FPGA is 150W, while the V100 is 300W,but these numbers are not reliable for accurate run-timecomparisons. Given the disparity between the platforms, it9 atch Size S peedup ( A / B ) Fig. 18. Speedup against V100 for varying batch size (matrix = 64x64, 95%sparse) Batch-size 1 is a comparison of pure latency. As batch size increases,the speedup is comparing achievable throughput is very difficult to accurately compare power consumptionbetween these two solutions. There is a difference in theprocess technology. The FPGA and GPU also have a differentset of peripherals associated with them, that may or maynot be active. The dynamic power will be different based onthe specific variations of the random weights and activations,which is exacerbated by the int8 vs fp16 difference. Finally, thedifferent boards have different methods to monitor power con-sumption on their different supply rails. However, we believethat the efficiency gains shown here are due to fundamentalcomputational simplification, and it would be reasonable toassume that the dynamic power would be correspondinglylower.
B. DNN Accelerator Benchmarks
We also consider the state-of-the-art in Sparse DNN ac-celeration: SIGMA [20]. SIGMA similarly relies on an inputbroadcast and reduction tree to perform matrix-multiplication.We reached out to the authors and were given their cycle-accurate simulator. In their paper, they design SIGMA as a128x128 array of fp16 processing elements (PEs) clocked at500MHz. To approximate process technology node differencesand the change to int8 from fp16, we assume that SIGMA canbe clocked at 1GHz. We run SIGMA with the weight matrixstationary and stream the input in to minimize latency. a) Sweeping dimension:
We repeat a similar suite ofexperiments as before. We sweep the dimension of 98%element-sparse matrices. Figures 19 and 20 show the latencyand speedup respectively.For small dimensions, SIGMA does report nanosecond-scale latency due to its input broadcast and reduction tree.Furthermore, small, sparse matrices easily fit into the PEgrid, so there is little overhead from tiling. However, after1024x1024, the elements no longer fit in the PE grid and thecomputation must be tiled. This invokes extra SRAM use andtransitions SIGMA into the memory-bound region, where itsees linear scaling. This yields a 4.1x speedup for our solution
Matrix Dimension (98% Sparse) R un t i m e ( n s ) Fig. 19. Latency of FPGA and SIGMA for varying dimension (98% sparse)
Matrix Dimension (98% Sparse) S peedup ( A / B ) Fig. 20. Speedup against SIGMA for varying dimension (98% sparse) in the worst case, but we quickly gain a 25x advantage as thematrix size increases. b) Sweeping sparsity:
Moving on, Figures 21 and 22 re-port the latency and speedup of a 1024x1024 matrix, sweepingelement-sparsity.
Element-Sparsity (%) (1024x1024) La t en cy ( n s ) Fig. 21. Latency of FPGA and SIGMA for varying sparsity (1024x1024)
The advantage of SIGMA is that it only maps non-zeroweight and activation pairs to PEs. If this union can fit into10 lement-Sparsity (%) (1024x1024) S peedup ( A / B ) Fig. 22. Speedup against SIGMA for varying sparsity (1024x1024) the PE grid, no tiling is done. As we expect, the size of thisset is directly related to the sparsity, and SIGMA sees hugelatency improvements as sparsity increases, taking it into thenanosecond regime. However, even 90% sparsity and belowis enough to push it back into the microsecond regime, whichyields a large advantage to our design. c) Batching:
Finally, we compare matrix-matrix multi-plication between our solution and SIGMA. We repeat thesame batching test and the result is shown in Figure 23.
Batch Size S peedup ( A / B ) Fig. 23. Speedup against SIGMA of varying batch-size (1024x1024, 95%)Batch-size 1 is a comparison of pure latency. As batch size increases, thespeedup is comparing achievable throughput
Because SIGMA is able to fill its PE array with sparseentries at small batches, it doesn’t have as much to gain asa GPU when transitioning from vector to matrix products. Atbatch-size 2, SIGMA does find opportunity to utilize morePEs and our advantage decreases. However, batch-size 4 andbeyond quickly see SIGMA in the memory-bound regionagain, which causes the speedup to saturate at 5.4x.VIII. D
ISCUSSION AND F UTURE W ORK
For the given use case, our design has two major limitations:1) the fanout of the input broadcast saturates the interconnectresources of the FPGA and limits frequency. 2) we are bound by the number of 6-input LUTs in the FPGA, which limits thenumber of set bits we have in the weight matrix. Creating aCGRA architecture could solve both of these issues. We couldemploy a broadcast-friendly pipelined interconnect networkfor the inputs, similar to the Benes network in SIGMA.Furthermore, a 6-input LUT is made using 64 SRAM bits of6 transistors each, with 64 MUX T-gates of 2 transistors each,which yields a total of 512 transistors for every LUT. A full-adder uses 16 or fewer transistors [7], which is 1/32 the cost.A CGRA implementation of our design would see a grid offull-adders and flip-flops, with a flexible tree-like interconnectto perform partial sums and broadcast interconnect for theinput. This approach would allow for higher compute densityat higher frequencies.Even with these optimizations, there may be instances wherethe compute matrix cannot entirely fit in hardware and mustbe tiled similar to DNN accelerators. To amortize the cost ofloading and moving weights, DNN inference accelerators oftenemploy batching to perform the multiplication of multiplevectors with the same matrix, which saves power by reducingthe movement of matrix weights. Our approach eliminates themovement of matrix weights by programming them into aninterconnect matrix and takes further advantage by propagatingthe constants within the matrix. The time to modify theinterconnect matrix of the FPGA is on the order of 200ms,which limits its practicality in moving weights during runtime.However, the feed-forward topology of this network allowsfor the approach of pipeline reconfiguration [11]. Pipelineconfiguration is the ability to configure hardware cycle-by-cycle as the pipeline fills. This concept would allow a re-configurable platform with nearly zero configuration overheadtime to change the matrix weights. Pipeline configuration isnot supported in conventional FPGAs, but we could supportit in a custom CGRA architecture. As the partial sums traveldown the tree, the levels above the current accumulation canbe recongifured as their state is no longer needed. One canthink of ”waves” of configuration travelling down the tree,where some parts of the tree are reconfiguring and some partsare computing. This is analogous to double-buffering in DNNprocessing.We plan to explore these optimizations in future work.IX. C
ONCLUSION
We present an architecture for performing integer matrix-vector multiplication. The cost and power of our implemen-tations scale exactly with the number of non-zero terms inthe matrix. Bit serial implementations allow for us to supportmatrices with up to 1.5 million ones, as large as 1024x1024eight-bit matrix at a sparsity of 60%. These products can beproduced with nanosecond-scale latencies, which is 2-3 ordersof magnitude faster than a GPU, and 4-47x faster than asparse accelerator. Throughput is also competitive, especiallyfor large matrices. Given an FPGA’s slow configuration time,these techniques are primarily suitable to computations wherethe matrix is essentially fixed, and where low latency iscritical, such as reservoir computing. However, a customized11rogrammable device for this approach could pipeline the con-figuration, thus effectively hiding it, and enable this approachto work for dynamic sparse matrices.A
CKNOWLEDGEMENTS
We would like to thank Peter Matheu for the extensivetechnical discussion and help with the SystemVerilog imple-mentation. We would also like to thank Amir Yazdanbakhshand Dan Zhang for paper review, and Guillermo Maturanafor code review. This work was completed as an internshipproject. R
EFERENCES[1] S. Ahmad and L. Scheinkman, “How can we be so dense? the benefitsof using highly sparse representations,”
CoRR , vol. abs/1903.11257,2019. [Online]. Available: http://arxiv.org/abs/1903.11257[2] J. Albericio, P. Judd, A. D. Lascorz, S. Sharify, and A. Moshovos, “Bit-pragmatic deep neural network computing,”
CoRR , vol. abs/1610.06920,2016. [Online]. Available: http://arxiv.org/abs/1610.06920[3] P. Antonik, A. Smerieri, F. Duport, M. Haelterman, and S. Massar, “Fpgaimplementation of reservoir computing with online learning,” in , 2015.[4] A. Avizienis, “Signed-digit number representations for fast parallelarithmetic,”
IRE Transactions on Electronic Computers , vol. EC-10,no. 3, pp. 389–400, 1961.[5] F. M. Bianchi, S. Scardapane, S. Løkse, and R. Jenssen, “Reservoircomputing approaches for representation and classification of multivari-ate time series,”
IEEE Transactions on Neural Networks and LearningSystems , pp. 1–11, 2020.[6] B. S. David Verstraeten and D. Stroobandt, “Reservoir computingwith stochastic bitstream neurons,”
Proc of ProRISC Workshop , 2005.[Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.5025&rep=rep1&type=pdf[7] A. Dubey, S. Akashe, and S. Dubey, “A novel high-performance cmos 1bit full-adder cell,” in , 2013, pp. 312–315.[8] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Trainingpruned neural networks,”
CoRR , vol. abs/1803.03635, 2018. [Online].Available: http://arxiv.org/abs/1803.03635[9] T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernelsfor deep learning,” 2020. [Online]. Available: https://arxiv.org/abs/2006.10901[10] C. Gallicchio, “Sparsity in reservoir computing neural networks,” 2020.[Online]. Available: https://arxiv.org/abs/2006.02957[11] S. C. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi,R. R. Taylor, and R. Laufer, “Piperench: A co/processor forstreaming multimedia acceleration,” in
Proceedings of the 26th AnnualInternational Symposium on Computer Architecture , ser. ISCA ’99.USA: IEEE Computer Society, 1999, p. 28–39. [Online]. Available:https://doi.org/10.1145/300979.300982[12] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, “Eie: Efficient inference engine on compressed deep neuralnetwork,” 2016. [Online]. Available: https://arxiv.org/abs/1602.01528[13] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” 2016. [Online]. Available: https://arxiv.org/abs/1510.00149[14] N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin,C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, R. C. Ho,D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon,J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean,A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps,J. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham,J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian,H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox,and D. H. Yoon, “In-datacenter performance analysis of a tensorprocessing unit,”
CoRR , vol. abs/1704.04760, 2017. [Online]. Available:http://arxiv.org/abs/1704.04760 [15] P. Judd, J. Albericio, and A. Moshovos, “Stripes: Bit-serial deep neuralnetwork computing,”
IEEE Computer Architecture Letters , vol. 16, no. 1,pp. 80–83, 2017.[16] D. Kleyko, E. P. Frady, M. Kheffache, and E. Osipov, “Integer echostate networks: Efficient reservoir computing for digital hardware,”2020. [Online]. Available: https://arxiv.org/abs/1706.00280[17] A. F. Murray, A. V. W. Smith, and Z. F. Butler, “Bit-serial neuralnetworks,” in
Neural Information Processing Systems , D. Z. Anderson,Ed. American Institute of Physics, 1988, pp. 573–583. [Online].Available: http://papers.nips.cc/paper/27-bit-serial-neural-networks.pdf[18] NVIDIA,
NVIDIA cuSparse API Reference , 2020 (accessed Sept 4,2020), https://docs.nvidia.com/cuda/cusparse/index.html.[19] B. Penkovsky, L. Larger, and D. Brunner, “Efficient design of hardware-enabled reservoir computing in fpgas,”
CoRR , vol. abs/1805.03033,2018. [Online]. Available: http://arxiv.org/abs/1805.03033[20] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das,B. Kaul, and T. Krishna, “Sigma: A sparse and irregular gemm ac-celerator with flexible interconnects for dnn training,” in , 2020, pp. 58–70.[21] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu,B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka,C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S.Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John,P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng,P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan,D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei,E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, andY. Zhou, “Mlperf inference benchmark,”
CoRR , vol. abs/1911.02549,2019. [Online]. Available: http://arxiv.org/abs/1911.02549[22] S. Sharify, A. D. Lascorz, M. Mahmoud, M. Nikolic, K. Siu, D. M.Stuart, Z. Poulos, and A. Moshovos, “Laconic deep learning inferenceacceleration,” in
Proceedings of the 46th International Symposiumon Computer Architecture , ser. ISCA ’19. New York, NY, USA:Association for Computing Machinery, 2019, p. 304–317. [Online].Available: https://doi.org/10.1145/3307650.3322255[23] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, J. K. Kim,V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamicallycomposable architecture for accelerating deep neural networks,”
CoRR ,vol. abs/1712.01507, 2017. [Online]. Available: http://arxiv.org/abs/1712.01507[24] E. S. Skibinsky-Gitlin, M. L. Alomar, C. F. Frasser, V. Canals, E. Isern,M. Roca, and J. L. Rossell´o, “Cyclic reservoir computing with fpgadevices for efficient channel equalization,” in
Artificial Intelligence andSoft Computing , L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz,R. Tadeusiewicz, and J. M. Zurada, Eds. Cham: Springer InternationalPublishing, 2018, pp. 226–234.[25] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scalingfor convolutional neural networks,” 2020. [Online]. Available: https://arxiv.org/abs/1905.11946[26] G. Tanaka, T. Yamane, J. B. H´eroux, R. Nakane, N. Kanazawa,S. Takeda, H. Numata, D. Nakano, and A. Hirose, “Recentadvances in physical reservoir computing: A review,”
NeuralNetworks
Virtex UltraScale+