Pre-Defined Sparse Neural Networks with Hardware Acceleration
11 Pre-Defined Sparse Neural Networkswith Hardware Acceleration
Sourya Dey, Kuan-Wen Huang, Peter A. Beerel,
Senior Member, IEEE, and Keith M. Chugg,
Fellow, IEEE
Abstract —Neural networks have proven to be extremely pow-erful tools for modern artificial intelligence applications, butcomputational and storage complexity remain limiting factors.This paper presents two compatible contributions towards re-ducing the time, energy, computational, and storage complexitiesassociated with multilayer perceptrons. Pre-defined sparsity isproposed to reduce the complexity during both training andinference, regardless of the implementation platform. Our resultsshow that storage and computational complexity can be reducedby factors greater than 5X without significant performanceloss. The second contribution is an architecture for hardwareacceleration that is compatible with pre-defined sparsity. Thisarchitecture supports both training and inference modes andis flexible in the sense that it is not tied to a specific numberof neurons. For example, this flexibility implies that varioussized neural networks can be supported on various sized FieldProgrammable Gate Array (FPGA)s.
Index Terms —Machine learning, Neural network, Multilayerperceptron, Sparsity, Hardware Acceleration
I. I
NTRODUCTION N EURAL networks are critical drivers of new technolo-gies such as computer vision, speech recognition, andautonomous systems. As more data have become available,the size and complexity of neural network (NN)s has risensharply with modern NNs containing millions or even billionsof trainable parameters [1], [2]. These massive NNs comewith the cost of large computational and storage demands.The current state of the art is to train large NNs on GraphicalProcessing Unit (GPU)s in the cloud – a process that can takedays to weeks even on powerful GPUs [1]–[3] or similar pro-grammable processors with multiply-accumulate accelerators[4]. Once trained, the model can be used for inference whichis less computationally intensive and is typically performed onmore general purpose processors ( i.e.,
Central Processing Unit(CPU)s). It is increasingly desirable to run inference, and evensome re-training, on embedded processors which have limitedresources for computation and storage. In this regard, modelreduction has been identified as a key to NN acceleration byseveral prominent researchers [5]. This is generally performedpost-training to reduce the memory requirements to storethe model for inference – e.g., methods for quantization,compression, and grouping parameters [6]–[9].Decreasing the time, computation, storage, and energy costsfor training and inference is therefore a highly relevant goal. In
The authors are with the Ming Hsieh Department of Electrical Engineering,University of Southern California, Los Angeles, CA, 90089 USA e-mail: { souryade, kuanwenh, pabeerel, chugg } @usc.eduManuscript submitted December 3, 2018.This work is partly supported by NSF, Software and Hardware Foundations,Grant 1763747. this paper we present two compatible methods towards this endgoal: (i) a method for introducing sparsity in the connectionpatterns of NNs, and (ii) a flexible hardware architecture thatis compatible with training and inference-only operation andsupports the proposed sparse NNs. Our approach to sparsifyinga NN is extremely simple and results in a large reductionin storage and computational complexity both in trainingand inference modes. Moreover, this method is not tied tothe hardware acceleration and provides the same benefits fortraining and inference in software under the current paradigm.The hardware architecture is massively parallel, but not tightlycoupled to a specific NN architecture ( i.e., not tied to thenumber of nodes in a layer). Instead, the architecture allows formaximum throughput for a given amount of circuit resources.Our approach to making a NN sparse is to specify asparse set of neuron connections prior to training and tohold this pattern fixed throughout training and inference. Werefer to this method of simply excluding some fixed set ofconnections in the NN as pre-defined sparsity . There areseveral methods in the literature related to sparse NNs, butmost do not reduce the computation and storage complexityassociated with training, which is a primary goal of this work.One related concept is drop-out [10] where selected edges inthe NN are not processed during some steps of the trainingprocess, but the final result is a Fully-connected (FC) NN forinference. Another set of approaches target producing a sparseNN for inference, but use FC NNs during training. Amongthese are pruning and trimming methods that post-processthe trained NN to produce a sparse NN for inference mode[11]–[13]. As mentioned above, other methods have beenproposed for reducing the complexity of performing inferenceon a trained FC NN such as quantization, compression, andgrouping parameters [6]–[9]. Other research has suggesteda method of learning sparsity during training that beginstraining a fully-connected NN and uses a cost regularizer thatpromotes sparsity in the trained model [14]. Note that all ofthese methods do not substantially reduce the complexity oftraining and instead target inference models that have lowercomplexity. One method aimed at reducing both training andinference complexity is using NNs with structured, but notsparse, weight matrices [15], [16]. Finally, we note that severalauthors have very recently proposed pre-defined sparse NNs[17]–[19] independently of our published work [20]–[22].Motivated by the fact that specialized hardware is typicallyfaster and more energy efficient than GPUs and CPUs, thereexists a large body of literature in NN hardware acceleration.The vast majority of this addresses only inference given atrained model [9], [23]–[26], with few addressing hardware a r X i v : . [ c s . L G ] D ec accelerated training [27]. The work of [27], for example,targets a specific size NN – i.e., the logic and memoryarchitecture is tied to the number of neurons in a layer.We propose an architecture that supports training, but can besimplified for inference-only mode, and is flexible to the NNsize. This is particularly attractive for FPGA implementations.Specifically, the proposed architecture produces the maximumthroughput on a given FPGA for a given NN and can thereforesupport various sized NNs on various sized FPGAs. This isaccomplished by an edge-based processing architecture thatcan process z edges in a given layer in parallel ( i.e., we referto z as the degree of parallelism ). A given FPGA can supportsome largest value of z , and NNs with more edges will simplytake more clock cycles to process. Our edge-based architecture is inspired by architecturesproposed for iterative decoding of modern sparse-graph-basederror correction codes ( i.e.,
Turbo and Low Density ParityCheck (LDPC) codes) (cf., [28], [29]). In particular, for agiven processing task, there are z logic units to perform thetask and z memories to store the quantities associated withthe task. A challenge with this architecture, shared between thedecoding and NN applications, is that, in order to achieve high-throughput without memory duplication, the parallel memoriesmust be accessed in two manners: natural order and interleavedorder. In natural order, each computation unit is associatedwith one memory and accesses the elements of that memorysequentially. For interleaved order access, the z computationalunits must access the memories such that no memory isaccessed more than once in a cycle. Such an addressing patternis called clash-free , and this property ensures that no memorycontention occurs so that no stalls or wait states are required.For modern codes, the clash-free property of the memories isensured by defining clash-free interleavers ( i.e., permutations)[30], or clash-free parity check matrices [29]. In the context ofNNs, this clash-free property is tied to the connection patternsbetween layers of neurons.In addition to z degrees of parallelism in edge processingin a given layer, our architecture is pipelined across layers.Thus, there is a degree of parallelism associated with eachlayer ( i.e., z i for layer i ) selected to set the number of cyclesrequired to process a layer to a constant – i.e., larger layershave larger z so that the computation time of all layers is thesame. For an ( L + 1) -layer NN there are L pipeline stagesso that a given NN input is processed in the time it takesto complete the processing of the edges in a single layer.Furthermore, the three operations associated with training– Feedforward (FF), Backpropagation (BP), and Update ofTrainable Parameters (UP) – are performed in parallel. Thearchitecture may be simplified to perform only inference byeliminating the logic and memory associated with BP andUP. Furthermore, while the architecture supports the reducedsparse complexity NNs, it is also compatible with traditionalFC networks. Interestingly, very recent work proposed pipelin-ing across layers for an inference-only accelerator [31], aswell as a scalable edge-based architecture for training [32] We use the terms the terms ‘connection’ and ‘edge’ interchangeably, as wedo with ‘node’ and ‘neuron’. Also, the term ‘cycle’ will mean ‘clock cycle’,unless otherwise stated. independently of our published work [20], [21]. Neither ofthese other recent works, however, takes advantage of pre-defined sparsity in the network.In Section II we provide motivation for and simple examplesof the effectiveness of pre-defined sparsity. In Section III thehardware architecture is described in detail, including defininga class of simple clash-free connection patterns with lowaddress generation complexity. Section IV contains a detailedsimulation study of pre-defined sparsity in NNs based on fourdifferent classification datasets – MNIST handwritten digits[33], Reuters news articles [34], TIMIT speech corpus [35],and CIFAR-100 images [36]. We identify a set of trends ordesign guidelines in this section as well. This section alsodemonstrates that the simple, hardware-compatible clash-freeconnection patterns provide performance on-par or better thanthat of randomly connected sparse patterns. Finally, in SectionV we consider the issue of whether pre-defining the structuredsparse patterns causes a significant performance loss relativeto other sparse methods having similar amount of parameters.We find that there is no significant performance degradationand therefore our hardware architecture can provide trainingand inference performance commensurate with state-of-the artsparsity methods.II. S
TRUCTURED P RE -D EFINED S PARSITY
A. Definitions, Notation, and Background An ( L + 1) -layer Multilayer Perceptron (MLP) has N i nodes in the i th layer, described collectively by the neuronalconfiguration N net = ( N , N , · · · , N L ) , where layer 0 is theinput layer. We use the convention that layer i is to the ‘right’of layer i − . There are L junctions between layers, withjunction i connecting the N i − nodes of its left layer i − with the N i nodes of its right layer i .We define pre-defined sparsity as simply not having all N i − N i edges present in junction i . Furthermore, we define structured pre-defined sparsity so that for a given junction i ,each node in its left layer has fixed out-degree – i.e., d out i connections to its right layer, and each node in its right layerhas fixed in-degree – i.e., d in i connections from its left layer.FC NNs have d out i = N i and d in i = N i − with N i − N i edgesin the i th junction, while a sparse NN has at least one junctionwith less than this number of edges. The number of edges (orweights) in junction i is given by | W i | = N i − d out i = N i d in i .The density of junction i is measured relative to FC anddenoted as ρ i = | W i | / ( N i − N i ) . The structured constraintimplies that the number of possible ρ i values is equal to theGreatest Common Divisor ( gcd ) of N i − and N i , as shown inAppendix A. The overall density is ρ net = (cid:80) Li =1 | W i | (cid:80) Li =1 N i − N i (1)Thus, specifying N net and the out-degree configuration d outnet = ( d out1 , · · · , d out L ) determines the density of eachjunction and the overall density.We will also consider random pre-defined sparsity , whereconnections are distributed randomly given preset ρ i valueswithout constraints on in- and out-degrees. In Sec. IV-B we show that random pre-defined sparsity is undesirable at lowdensities because it may result in unconnected neurons.The standard equations for FC NNs are well-known [37].For a NN using structured pre-defined sparsity, only theweights corresponding to connected edges are stored in mem-ory and used in computation. This leads to the modified equa-tions (2)–(4), where subscripts denote layer/junction numbers,single superscripts denote neurons in a layer, and double super-scripts denote (right neuron, left neuron) in a junction. The FFprocessing proceeds left-to-right and computes the activations a i and associated derivatives ˙ a i for each layer by applying anactivation function act ( · ) to a linear combination of biases b i ,junction weights W i and preceding layer activations a i − h ( j ) i = d in i (cid:88) f =1 W ( j,k f ) i a ( k f ) i − + b ( j ) i (2a) a ( j ) i = act (cid:16) h ( j ) i (cid:17) (2b) ˙ a ( j ) i = da ( j ) i dh ( j ) i = ˙ act (cid:16) h ( j ) i (cid:17) (2c)Note that (2c) is used in training, but is not required ininference mode. The BP computation is done only in trainingand computes a sequence of error values from right-to-left δ ( j ) L = ∂l ( j ) (cid:16) a ( j ) L , y ( j ) (cid:17) ∂h ( j ) L (3a) δ ( j ) i = ˙ a ( j ) i d out i (cid:88) f =1 W ( k f ,j ) i +1 δ ( k f ) i +1 (3b)where l ( j ) (cid:16) a ( j ) L , y ( j ) (cid:17) is the j th component of the loss func-tion. Finally, stochastic gradient UP is given by b ( j ) i ← b ( j ) i − ηδ ( j ) i (4a) W ( j,k ) i ← W ( j,k ) i − ηa ( k ) i − δ ( j ) i (4b)where η is the learning rate. The parameters on left-hand-sideof (2)–(4) will be referred to as the network parameters , withthe weights and biases being the trainable parameters . B. Motivation and Preliminary Examples
Pre-defined sparsity can be motivated by inspecting thehistogram for trained weights in a FC NN. There have beenprevious efforts to study such statistics [3], [38], however,not for individual junctions. Fig. 1 shows weight histogramsfor each junction in both a 2-junction and 4-junction FC NNtrained on the MNIST dataset. Note that many of the weightsare zero or near-zero after training, especially in the earlierjunctions. This motivates the idea that some weights in theselayers could be set to zero ( i.e., the edges excluded). Evenwith this intuition, it is unclear that one can pre-define a set ofweights to be zero and let the NN learn around this pre-definedsparsity constraint. Fig. 1(c) and (h) show that, in fact, this isthe case – i.e., this shows classification accuracy as a functionof the overall density ρ net for structured pre-defined sparsity.Since the computational and storage complexity is directly Fig. 1. Histograms of weight values in different junctions for FC NNs trainedon MNIST for 50 epochs, with (a-b) N net = (800 , , , and (d-g) N net = (800 , , , , . Test accuracy shown in (c,h) for differentNNs with same N net and varying ρ net . The overall density ρ net is set byreducing ρ since junction 1 has more weights close to zero in the FC cases(circled). proportional to the number of edges in the NN, operating at anoverall density of, for example, 50% results in a 2X reductionin complexity both during training and inference. Detailednumerical experiments in Section IV build on these simpleexamples. However, before we proceed to those results, it isimportant to consider a hardware architecture that can supportstructured pre-defined sparsity and consider the additionalclash-free constraints placed on the connection patterns so thatthese can be considered in the studies in Section IV.III. H ARDWARE A RCHITECTURE
In this section we describe the proposed flexible hardwarearchitecture outlined in the Introduction. The overall architec-tural view is captured by Fig. 2: sub-figure (a) shows paralleledge processing within a junction with degree of parallelism3, (b) shows clash-free memory access, and (c) junctionpipelining and parallel processing of the three operations – FF,BP, UP. The toy example in Fig. 2(a)-(b) is for N i − = 6 , N i = 3 , ρ i = 6 /
18 = 1 / , and z i = 3 . Fig. 2(a) shows thatthe z i = 3 blue edges are processed in parallel in one cycle,while the pink edges are processed in parallel during the nextcycle. Fig. 2(b) shows how the z i = 3 FF processing logicunits access the memories in natural and interleaved order.As described in detail in Sec. III-B, the interleaved orderaccess may represent reading of the activations { a ( j ) i − } for j ∈ { , , } and the natural order access may correspondto writing the computed activations { a ( j ) i } for j ∈ { , , } .On the next cycle, the remaining memory locations ( i.e., thewhite cells) will be accessed. Note that this illustrates a clash-free connection pattern since each of the z i = 3 memories isaccessed no more than once in each cycle – i.e., one hit percolumn on each access. Fig. 2. (a) Processing z i = 3 edges in each cycle (blue in cycle 0, pink in cycle 1) for some junction i . (b) Accessing z i = 3 memories – M0, M1 andM2 shown as columns – from two separate banks, one in natural order (same address from each memory), the other in interleaved order. Clash-freedom isachieved by accessing only one element from each memory. The accessed values are fed to z i = 3 processors to perform FF simultaneously. (c) Operationalparallelism in each junction (vertical dotted lines denote processing for one junction), and junction pipelining of each operation across junctions (horizontaldashed lines) in a multi-junction NN. Subfigure (c) is modified from our previous conference publication [20, Fig. 2(c)] The junction-based operation in Fig. 2(b) is repeated foreach junction in a pipeline. In particular, there are L pipelinestages. For example, for the FF pipeline, while the first stageis processing input vector n + L on junction 1, the secondstage is processing input vector n + L − on junction 2.The degree of parallelism for each junction is selected so thatthe processing time for any operation (FF/BP/UP) is the samefor each junction. Thus the throughput, i.e., the frequency ofprocessing input samples, is determined by the time taken toperform a single operation in a single junction.In summary, the architecture is (i) edge-based and not tiedto a specific number of nodes in a layer, (ii) flexible in thatthe amount of logic is determined by the degree of parallelismwhich trades size for speed, and (iii) fully pipelined for theparallel operations associated with NN training. Also note thatthe architecture can be specialized to perform only inferenceby removing the logic and memory associated with the BPand UP operations, and the ˙ a i computation in (2c).A key concern when implementing NNs on hardware isthe large amount of storage required. Several characteristicsregarding memory requirements guided us in developing theproposed architecture. Firstly, since weight memories are thelargest, their number should be minimized. Secondly, havinga few deep memories is more efficient in terms of powerand area than having many shallow memories [39]. Thirdly,throughput should be maximized without duplicating memo-ries, hence the need for clash-free connection patterns.In Sec. III-A, we describe junction pipelining design whichattempts to minimize weight storage resources. The memoryorganization within a junction is described in Sec. III-B, andis designed to minimize the number of memories for a givendegree of parallelism. Finally, clash-free access conditions aredeveloped in Sec. III-B and III-C, and a simple method forimplementing such patterns given in Sec. III-C. A. Junction pipelining and Operational parallelism
Our edge-based architecture is motivated by the fact that allthree operations – FF, BP, UP – use the same weight valuesfor computation. Since z i edges are processed in parallel in a single cycle, the time taken to complete an operation injunction i is ( C i = | W i | /z i ) cycles. The degree of parallelismconfiguration z net = ( z , · · · , z L ) is chosen to achieve C i = C ∀ i ∈ { , · · · , L } . This allows efficient junction pipeliningsince each operation takes exactly C cycles to be completedfor each input in each junction, which we refer to as a junctioncycle . This determines throughput.The following is an analysis of Fig. 2(c) in more detailfor an example NN with L = 2 . While a new training inputnumbered n +3 is getting loaded as a , junction 1 is processingthe FF stage for the previous input n + 2 and computing a .Simultaneously, junction 2 is processing FF and computingcost δ L via cost derivatives for input n + 1 . It is also doingBP on input n to compute δ , as well as updating (UP)its parameters from the finished δ L computation of input n .Simultaneously, junction 1 is performing UP using δ from thefinished BP results of input n − . This results in operationalparallelism in each junction, as shown in Fig. 3. The combinedspeedup is approximately a factor of L as compared to doingone operation at a time for a single input.Notice from Fig. 3 that there is only one weight memorybank which is accessed for all three operations. However, UPin junction needs access to a for input n − , as per theweight update equation (4b). This means that there need to be L + 1 = 5 left activation memory banks for storing a forinputs n − to n + 3 , i.e., a queue-like structure. Similarly,UP in junction 2 will need L −
1) + 1 = 3 queued banks foreach of its left activation a and its derivative ˙ a memories– for inputs from n (for which values will be read) to n + 2 (for which values are being computed and written). There alsoneed to be 2 banks for all δ memories – 1 for reading and theother for writing. Thus junction pipelining requires multiplememory banks, but only for layer parameters a , ˙ a and δ , not During hardware implementation, a few extra cycles may be needed toflush the pipeline so that C i = | W i | /z i + c i . These are also balanced, i.e., c i = c ∀ i ∈ { , · · · , L } , to achieve efficient pipelining. In our initialimplementation [40], for example, c = 2 and the junction cycle is C = 34 . Note that BP does not occur in the first junction because there are no δ values to be computed Fig. 3. Architecture for parallel operations for an intermediate junction i ( i (cid:54) = 1 , L ) showing the three operations along with associated inputs andoutputs. Natural and interleaved order accesses are shown using solid anddashed lines, respectively. The a and ˙ a memory banks occur as queues, the δ memory banks as pairs, while there is a single weight memory bank. Figuremodified from our previous conference publication [20, Fig. 3].TABLE IH ARDWARE A RCHITECTURE T OTAL S TORAGE C OST C OMPARISON FOR N net = (800 , , FC VS SPARSE WITH d outnet = (20 , , ρ net = 21% Parameter Expression Count (FC) Count (sparse) a (cid:80) L − i =0 (2( L − i ) + 1) N i ˙ a (cid:80) L − i =1 (2( L − i ) + 1) N i
300 300 δ (cid:80) Li =1 N i
220 220 b (cid:80) Li =1 N i
110 110 W (cid:80) Li =1 N i d in i TOTAL Σ (All above) 85930 21930 for weights. The number of layer parameters is insignificantcompared to the number of weights for practical networks.This is why pre-defined sparsity leads to significant storagesavings, as quantified in Table I for the circled FC point vsthe ρ net = 21% point from Fig. 1(c). Specifically, memoryrequirements are reduced by 3.9X in this case. Furthermore,the computational complexity, which is proportional to thenumber of weights for a MLP, is reduced by 4.8X. Forthis example, these complexity reductions come at a cost ofdegrading the classification accuracy from . to . . B. Memory organization
For the purposes of memory organization, edges are num-bered sequentially from top to bottom on the right side of thejunction. Other network parameters such as a , ˙ a and δ arenumbered according to the neuron numbers in their respectivelayer. Consider Fig. 4 as an example, where junction i isflanked by N i − = 12 left neurons with d out i = 2 and N i = 8 right neurons, leading to | W i | = 24 and d in i = 3 . The threeweights connecting to right neuron 0 are numbered 0, 1, 2;the next three connecting to right neuron 1 are numbered 3,4, 5, and so on. A particular right neuron connects to somesubset of left neurons of cardinality d in i . This is achieved by making the weight memory dual-port, while a and ˙ a are single-ported memories. The δ memories are also dual-ported due to theexact manner in which we implemented this architecture on FPGA, refer to[40] for full details. Fig. 4. An example of processing inside junction i with z i = 4 memoriesin the weight and left banks, and z i +1 = 2 memories in the right bank. Thebanks are represented as numerical grids, each column is a memory, and thenumber in each cell is the number of the edge / left neuron / right neuronwhose parameter value is stored in it. Edge are sequentially numbered onthe right (shown in curly braces). Four weights are read in each of the sixcycles with the first three colored blue, pink and green, respectively. Theserepresent sweep 0, while the next 3 (using dashed lines) colored brown, redand purple, respectively, represent sweep 1. Clash-freedom leads to at mostone cell from each memory in each bank being accessed each cycle. Weightand right memories are accessed in natural order, while left memories areaccessed in interleaved order. Each type of network parameter is stored in a bank ofmemories. The example in Fig. 4 uses z i = 4 , i.e., i.e., z i , and their depth equals C i . Weight memories are readin natural order – 1 row per cycle (shown in same color).Right neurons are processed sequentially due to the weightnumbering. The number of right neuron parameters of aparticular type needing to be accessed in a cycle is upperbounded by (cid:6) z i /d in i (cid:7) , which leads to z i +1 ≥ (cid:6) z i /d in i (cid:7) in orderto prevent clashes in the right memory bank. For FF in Fig. 4for example, cycles 0 and 1 finish computation of a (0) i and a (1) i respectively, while cycle 2 finishes computing both a (2) i and a (3) i . For BP or UP, everything remains same except for theright memory accesses. Now δ (0) i and δ (1) i are used in cycle 0, δ (1) i and δ (2) i in cycle 1, and δ (2) i and δ (3) i in cycle 2. Thus themaximum number of right neuron parameters ever accessedin a cycle is (cid:6) z i /d in i (cid:7) = 2 .Since edges are interleaved on the left, in general, the z i edge processing logic units will need access to z i parametersof a particular type from layer i − . So all the left memorybanks have z i memories, each of depth D i = N i − /z i , whichare accessed in interleaved order. For example, after D i cycles, N i − edges have been processed – i.e., ( D i × z i ) = N i − . Werequire that each of these edges be connected to a different This does not limit most practical designs (see Appendix B). left neuron to eliminates the possibility of duplicate edges.This completes a sweep , i.e., one complete access of the leftmemory bank. Since each left neuron connects to d out i edges, d out i sweeps are required to process all the edges, i.e., each leftactivation is read d out i times in the whole junction cycle. Thereader can verify that D i cycles multiplied by d out i sweepsresults in C i total cycles, i.e., one junction cycle. C. Clash-free connection patterns
We define a clash as attempting to perform a particularoperation more than once on the same memory at the sametime, which would stall processing. The idea of clash-freedom is to pre-define a pattern of connections and z values such thatno operation in any junction of the NN results in a clash. Sec.III-B described how z values should be designed to preventclashes in the weight and right memory banks.This subsection analyzes the left memory banks, which areaccessed in interleaved order. Their memory access patternshould be designed so as to prevent clashes. Additionally,the following properties are desired for practical clash-freepatterns. Firstly, it should be easy to find a pattern that givesgood performance. Secondly, the logic and storage required togenerate the left memory addresses should be low complexity.We generate clash-free patterns by initially specifying theleft memory addresses to be accessed in cycle 0 using aseed vector φ i ∈ { , , · · · , D i − } z i . Subsequent addressesare cyclically generated. Considering Fig. 4 as an example, φ i = (1 , , , . Thus in cycle 0, we access addresses (1 , , , from memories ( M , M , M , M , i.e., left neu-rons (4 , , , . In cycle 1, the accessed addresses are (2 , , , , and so on. Since D i = 3 , cycles 3–5 access thesame left neurons as cycles 0–2.We found that this technique results in a large numberof possible connection patterns, as discussed in AppendixC. Randomly sampling from this set results in performancecomparable with non-clash-free NNs, as shown in Sec. IV-B.Finally, our approach only requires storing φ i and using z i incrementers to generate subsequent addresses. This approachis similar to methods used in modern coding to allow parallelprocessing and memory accesses, c.f. [28]–[30]. Other tech-niques to generate clash-free connection patterns are discussedin Appendix C. D. Batch size
It is common in training of NNs to use minibatches. Fora batch size of M , the UP operation in (4) is performedonly once for M inputs by using the average over the M gradients. Our architecture performs an UP for every inputand therefore may be viewed as having batch size one.However, the processing in our architecture differs from atypical software implementation with M = 1 due to the For single-ported memories, attempting two reads or two writes or a readand a write in the same cycle is a clash. For simple dual-ported memorieswith one port exclusively for reading and the other exclusively for writing, aread and a write can be performed in the same cycle. Attempting to performtwo reads or two writes in the same cycle is a clash. Fig. 5. Processing the FC version of the junction from Fig. 4. For clarity, onlythe first 12 and last 12 edges (dashed) are shown, corresponding respectivelyto right neurons 0 and 7, sweeps 0 and 7, cycles 0–2 and 21–23. pipelined and parallel operations. Specifically, in our archi-tecture, FF and BP for the same input use different weights,as implied by Fig. 2(c). In results not presented here, wefound no performance degradation due to this variation fromthe standard backpropagation algorithm. There is considerableambiguity in the literature regarding ideal batch sizes [41],[42], and we found that our current network architectureperformed well in our initial hardware implementation [40].However, if a more conventional batch size is desired, theUP logic can be removed from the junction pipeline and theUP operation performed once every M inputs. This wouldeliminate some arithmetic units at the cost of increased storagefor accumulating intermediate values from (4). E. Special Case: Processing a FC junction
Fig. 5 shows the FC version of the junction from Fig. 4,which has 96 edges to be accessed and operated on. This canbe done keeping the same junction cycle C i = 6 by increasing z i to 16, i.e., using more hardware. On the other hand, ifhardware resources are limited, one can use the same z i = 4 and pay the price of a longer junction cycle C i = 24 , as shownin Fig. 5. This demonstrates the flexibility of our architecture.Note that FC junctions are clash-free in all practical casesdue to the following reasons. Firstly, the left memory accessesare in natural order just like the weights, which ensures thatno more than one element is accessed from each memory percycle. Secondly, (cid:6) z i /d in i (cid:7) = 1 for all practical cases since z i ≤ N i − , as discussed in Appendix B, and d in i = N i − for FC junctions. This means that at most one right neuronis processed in a cycle , so clashes will never occur when In Fig. 5 for example, one right neuron finishes processing every rd cycle accessing the right memory bank.Note that compared to Fig. 4, the weight memories in Fig. 5are deeper since C i has increased from 6 to 24. However, theleft layer memories remain the same size since N i − = 12 and z i = 4 are unchanged, but the left memory bank is accessedmore times since the number of sweeps has increased from 2to 8. Also note that even if cycle 0 (blue) accesses some otherclash-free subset of left neurons, such as { , , , } instead of { , , , } , the connection pattern would remain unchanged.This implies that different memory access patterns do notnecessarily lead to different connection patterns; as discussedfurther in Appendix C.IV. O BSERVED T RENDS OF P RE -D EFINED S PARSITY
This section analyzes trends observed when experimentingwith several different datasets via software simulations. Weintend the following four trends to provide guidelines ondesigning pre-defined sparse NNs.1) Hardware-compatible, clash-free, pre-defined sparse pat-terns perform at least as well as other pre-defined sparsepatterns ( i.e., random and structured) (Sec. IV-B).2) The performance of pre-defined sparsity is better ondatasets that have more inherent redundancy (Sec. IV-C).3) Junction density should increase to the right: junctionscloser to the output should generally have more connec-tions than junctions closer to the input (Sec. IV-D).4) Larger and more sparse NNs are better than smallerand denser NNs, given the same number of layers andtrainable parameters. Specifically, ‘larger’ refers to morehidden neurons (Sec. IV-E).The remainder of this section first describes the datasets weexperimented on, and then examines these trends in detail.
A. Datasets and Experimental Configuration
Unless otherwise noted, the following parameters and con-figurations listed below were used for all presented results. a) MNIST handwritten digits:
We rasterized each inputimage into a single layer of 784 features , i.e., the permutation-invariant format. No data augmentation was applied. b) Reuters RCV1 corpus of newswire articles: The clas-sification categories are grouped in a tree structure. We usedpreprocessing techniques similar to [43] to isolate articleswhich fell under a single category at the second level of thetree. We finally obtained 328,669 articles in 50 categories, splitinto , for validation, , for test, and the remainingfor training. The original data has a list of token strings foreach story, for example, a story on finance would frequentlycontain the token ‘financ’. We chose the most common 2000tokens and computed counts for each of these in each article.Each count x was transformed into log (1 + x ) to form thefinal 2000-dimensional feature vector for each input. On certain occasions we added 16 input features which are always trivially0 so as to get 800 features for each input. This leads to easier selection ofdifferent sparse network configurations. c) TIMIT speech corpus:
TIMIT is a speech dataset com-prising approximately . hours of 16 kHz audio commonlyused in Automatic Speech Recognition (ASR). A modernASR system has three major components: (i) preprocessingand feature extraction, (ii) acoustic model, and (iii) dictionaryand language model. A complete study of an ASR systemis beyond the scope of this work. Instead we focus on theacoustic model which is typically implemented using a NN.The input to the acoustic model is feature vectors and theoutput is a probability distribution on phonemes ( i.e., speechsounds). For our experiments, we used 25ms speech frameswith 10ms shift, as in [43], and computed a feature vector of 39Mel-frequency Cepstral Coefficient (MFCC)s for each frame.We used the complete training set of , training samples(462 speakers), , validation samples (50 speakers), and , test samples (118 speakers). We used a phoneme setof size 39 as defined in [44]. d) CIFAR-100 images: Our setup for CIFAR-100 con-sists of a Convolutional Neural Network (CNN) followedby a MLP. The CNN has 3 blocks and each block has 2convolutional layers with window size 3x3 followed by amax pooling layer of pool size 2x2. The number of filtersfor the six convolutional layers is (60,60, 125,125, 250,250).This results in a total of approximately one million trainableparameters in the convolutional portion of the network. Batchnormalization is applied before activations. The output fromthe 3rd block, after flattening into a vector, has 4000 features.Typically dropout is applied in the MLP portion, however weomitted it there since pre-defined sparsity is an alternate formof parameter reduction. Instead we found that a dropout prob-ability of half applied to the convolutional blocks improvedperformance. No data augmentation was applied.For each dataset, we performed classification using one-hotlabels and measured accuracy on the test set as a performancemetric. We also calculated the top-5 test set classificationaccuracy for CIFAR-100.We found the optimal training configuration for each FCsetup by doing a grid search using validation performanceas a metric. This resulted in choosing ReLU activations forall layers except for the final softmax layer. The initializationproposed by He et al. [45] worked best for the weights; whilefor biases, we found that an initial value of . worked best inall cases except for Reuters, for which zeroes worked better.The Adam optimizer [46] was used with all parameters setto default, except that we set the decay parameter to − for best results. We used a batch size of 1024 for TIMIT andReuters since the number of training samples is large, and 256for MNIST and CIFAR.All experiments were run for 50 epochs of training andregularization was applied as an L2 penalty to the weights.To maintain consistency, we kept most hyperparameters thesame when sparsifying the network, but reduced the L2 penalty The NN in a complete ASR system would be a ‘soft’ classifier and feed thephoneme distribution outputs to a decoder to perform ‘hard’ final classificationdecisions. Therefore for TIMIT, we computed another performance metriccalled Test Prediction Comparison (TPC), measured as KL divergence be-tween predicted test output probability distributions of sparse vs the respectiveFC case. Performance results obtained using TPC were qualitatively verysimilar to test accuracy and not shown here.
TABLE IIC
OMPARISON OF P RE -D EFINED S PARSE M ETHODS d outnet ρ net % z net Test Accuracy PerformanceClash-free Structured RandomMNIST: N net = (800 , , , , , FC test accuracy = ± . , , ,
10) 80 . , , ,
4) 97 . ± . . ± . . ± . , , ,
10) 60 . , , ,
4) 97 . ± . . ± . . ± . , , ,
10) 40 . , , ,
5) 97 . ± . . . ± . , , ,
10) 20 . , , ,
10) 97 . ± . . ± . . ± . , , ,
10) 10 . , , ,
25) 96 . ± . . ± . . ± . , , ,
10) 6 . , , ,
25) 96 . ± . . ± . . ± . , , ,
10) 3 . , , ,
50) 95 ± . . ± . ± . , , ,
10) 2 . , , , . ± . . ± . ± . Reuters: N net = (2000 , , , FC test accuracy = . ± . ,
25) 50 (1000 ,
25) 89 . ± . . . ,
10) 20 (400 ,
10) 87 ± . . ± . . ± . ,
5) 10 (200 ,
5) 78 . ± . . ± . . ± . ,
2) 4 (80 ,
2) 53 . ± . . ± . . ± . ,
1) 2 (40 ,
1) 28 . ± . . ± . ± . TIMIT: N net = (39 , , , FC test accuracy = . ± . ,
27) 69 . ,
13) 43 ± . ± . ,
18) 46 . . ± . . ± . . ± . ,
9) 23 . . ± . . ± . . ± . ,
6) 15 . . ± . . ± . . ± . ,
3) 7 . . ± . . ± . . ± . CIFAR-100 a : N net = (4000 , , , FC top-5 test accuracy = . ± . , , . ± . . ± . . ± . ,
29) 6 . . ± . . ± . . ± . ,
12) 2 . ,
50) 86 . ± . . ± . . ± . ,
5) 1 . . ± . . ± . . ± . ,
2) 0 . ,
10) 84 . ± . . ± . . ± . ,
1) 0 . ± . . ± . . ± . a For CIFAR-100, given values of N net , d outnet , z net and ρ net are just forthe MLP portion, which follows a CNN as described in Sec. IV-A to form thecomplete net. Reported values are top-5 test accuracies obtained from trainingon the complete net. coefficient with increasing sparsity. This was done becausesparse NNs have fewer trainable parameters and are less proneto overfitting. We ran each experiment at least five times toaverage out randomness and we show the 90% ConfidenceInterval (CI)s for each metric as shaded regions (this also holdsfor the results in Fig. 1(c,h)). In addition to the results shown,we developed a data set of Morse code symbol sequencesand investigated pre-defined sparse NNs. While these resultsare excluded for brevity, they are consistent with the trendsdescribed in this Section, and can be found in [47]. B. Comparison of Pre-Defined Sparse Methods
Table II shows performance on different datasets for threemethods of pre-defined sparsity: a) the most restrictive andhardware-friendly clash-freedom, b) structured, and c) random.For the clash-free case, we experimented with different z net settings to simulate different hardware environments: • Reuters: One junction cycle is 50 cycles for all the differ-ent densities. This is because we scale z net accordingly, i.e., a more powerful hardware device is used for eachNN as ρ net increases. • CIFAR-100 and MNIST: These simulate cases wherehardware choice is limited, such as a high-end, a mid-range and a low-end device being available. Thus threedifferent z net values are used for CIFAR-100 dependingon ρ net . • TIMIT: We keep z net constant for different densities.Junction cycle length varies from 90 cycles for ρ net =7 . to 810 for ρ net = 69 . . This shows that whenlimited to a single low-end hardware device, denser NNscan be processed in longer time by simply changing z net .Table II confirms that hardware-friendly clash-free pre-defined sparse architectures do not lead to any statisticallysignificant performance degradation . We also observed thatrandom pre-defined sparsity performs poorly for very lowdensity networks, as shown by the blue values. This is pos-sibly because there is non-negligible probability of neuronsgetting completely disconnected, leading to irrecoverable lossof information. C. Dataset Redundancy
Many machine learning datasets have considerable redun-dancy in their input features. For example, one may notneed information from the ∼
800 input features of MNISTto infer the correct image class. We hypothesize that pre-defined sparsity takes advantage of this redundancy, and willbe less effective when the redundancy is reduced. To test this,we changed the feature vector for each dataset as follows.For MNIST, Principal Component Analysis (PCA) was usedto reduce the feature count to the least redundant 200. ForReuters, the number of most frequent tokens considered asfeatures was reduced from 2000 to 400. For TIMIT, we bothreduced and increased the number of MFCCs by 3X to 13 and117, respectively. Note that the latter increases redundancy.For CIFAR-100, a source of redundancy is the depth ofthe CNN, which extracts features and discriminates betweenclasses before the MLP performs final classification. In otherwords, the CNN eases the burden of the MLP. So a way toreduce redundancy and increase the classification burden of theMLP is to lessen the effectiveness of the CNN by reducingits depth. Accordingly, we used a single convolutional layerwith 250 filters of window size × followed by a × maxpooling layer. This results in the same number of features,4000, at the input of the MLP as the original network, but hasreduced redundancy for the MLP.Classification performance results are shown in Fig. 6 as afunction of ρ net . For MNIST and CIFAR-100, the performancedegrades more sharply with reducing ρ net for the nets usingthe reduced redundancy datasets. To explore this further,we recreated the histograms from Fig. 1 for the reducedredundancy datasets, i.e., a FC NN with N net = (200 , , training on MNIST after PCA. We observed a wider spreadof weight values, implying less opportunity for sparsification( i.e., fewer weights were close to zero). Similar trends areless discernible for Reuters and TIMIT, however, reducingredundancy led to worse performance overall.The results in Fig. 6 further demonstrate the effectivenessof pre-defined sparsity in greatly reducing network complexitywith negligible performance degradation. For example, eventhe reduced redundancy problems perform well when oper-ating with half the number of connections. For CIFAR inparticular, FC performs worse than an overall MLP density ofaround 20% . Thus, in addition to reducing complexity, struc-tured pre-defined sparsity may be viewed as an alternative to
Fig. 6. Comparison of classification accuracy as a function of ρ net fordifferent versions of datasets – original, reduced in redundancy by reducingfeature space (MNIST, Reuters, TIMIT) or performing less processing priorto the MLP (CIFAR-100), and increasing redundancy by enlarging featurespace (TIMIT).Fig. 7. Comparison of classification accuracy as a function of ρ net fordifferent ρ L , where L = 2 . Black-circled points show the effects of ρ when ρ net is the same. N net values are (800 , , for MNIST, (2000 , , for Reuters, and (4000 , , for the MLP in CIFAR-100. dropout in the MLP for the purpose of improving classificationperformance. D. Individual junction densities
The weight histograms in Fig. 1 indicate that latter junc-tions, particularly junction L closest to the output, have a widespread of weight values. This suggests that a good strategyfor reducing ρ net would be to use lower densities in earlierjunctions – i.e., ρ < ρ L . This is demonstrated in Fig. 7 for thecases of MNIST, CIFAR-100 and Reuters, each with L = 2 junctions in their MLPs. Each curve in each subfigure is fora fixed ρ , i.e., reducing ρ net across a curve is done solely Fig. 8. Comparison of classification accuracy as a function of ρ net for:(a) TIMIT with 39 MFCCs for the two cases where one junction is alwayssparser than the other and vice-versa. Black-circled points show how reducing ρ degrades performance to a greater extent. (b) TIMIT with 13 MFCCs fordifferent ρ . (c,d) TIMIT with 117 MFCCs, and Reuters reduced to 400tokens, for different ρ . N net values are (a) (39 , , , (b) (13 , , ,(c) (117 , , , (d) (400 , , . by reducing ρ . For a fixed ρ net , the performance improves as ρ increases. For example, the circled points in Reuters bothhave ρ net = 4% , but the starred point with ρ = 100% hasapproximately better test accuracy than the pentagonalpoint with ρ = 2% . The trend clearly holds for MNIST andis also discernible for CIFAR-100.We further observed that this trend ( i.e., ρ i +1 > ρ i shouldhold) is related to the redundancy inherent in the dataset andmay not hold for datasets with very low levels of redundancy.To explore this, results analogous to those in Fig. 7 arepresented in Fig. 8 for TIMIT, but with varying sized MFCCfeature vectors – i.e., datasets corresponding to larger featurevectors will contain more redundancy. The results in Fig. 8(c)are for 117 dimensional MFCCs and are consistent with thetrend in Fig. 7. However, for a MFCC dimension of 13, thistrend actually reverses – i.e., the junction 1 should have higherdensity. This is shown in Fig. 8(b), where each curve is for afixed ρ . This reversed trend is also observed for the caseof 39 dimensional feature vectors, considered in Fig. 8(a),where N net = (39 , , . Due to this symmetric neuronalconfiguration, for each value of ρ net on the x-axis in Fig.8(a), the two curves have complementary values of ρ and ρ ( ρ (cid:54) = ρ ) – e.g., the two curves at ρ net = 7 . have ( ρ , ρ ) pairs of (2 . , . and (12 . , . . Weobserve that the curve for ρ < ρ is generally worse thanthe curve for ρ < ρ , which indicates that junction 1 shouldhave higher density in this case.Fig. 8(d) depicts the results for Reuters with the featurevector size reduced to 400 tokens. While junction 2 is still Fig. 9. Comparing ‘large and sparse’ to ‘small and dense’ networks forMNIST with 784 features, with (a) N net = (784 , x, (on the left), and(b) N net = (784 , x, x, x, (on the right). Solid curves (with the shadedCIs around them) are for constant x , black dashed curves with same markerare for same number of trainable parameters. The final junction is alwaysFC. Intermediate junctions for the L = 4 case have d out values similar tojunction 1. more important (as in Fig. 7(c) for the original Reutersdataset), notice the circled star-point at the very left of the ρ = 100% curve. This point has very low ρ . Unlike Fig.7(c), it crosses below the other curves, indicating that it is moreimportant to have higher density in the first junction with thisless redundant set of features. We observed a similar, but lessprominent, trend in MNIST PCA when the feature dimensionwas reduced to 200.In summary, if an individual junction density falls belowa certain value, referred to as the critical junction density , itwill adversely affect performance regardless of the density ofother junctions. This explains why some of the curves crossin Fig. 8. The critical junction density is much smaller forearlier junctions than for later junctions in most datasets withsufficient redundancy. However, the critical density for earlierjunctions increases for datasets with low redundancy. E. ‘Large and sparse’ vs ‘small and dense’ networks
We observed that when keeping the total number of trainableparameters the same, sparser NNs with larger hidden layers( i.e., more neurons) generally performed better than densernetworks with smaller hidden layers. This is true as long asthe larger NN is not so sparse that individual junction densitiesfall below the critical density, as explained in Sec. IV-D. Whilethe critical density is problem-dependent, it is usually lowenough to obtain significant complexity savings above it. Thus,‘large and sparse’ is better than ‘small and dense’ for manypractical cases, including NNs with more than one hiddenlayer ( i.e.,
L > ).Fig. 9 shows this for networks having one and three hid-den layers trained on MNIST. For the three layer network,all hidden layers have the same number of neurons. Eachsolid curve shows classification performance vs ρ net for aparticular N net , while the black dashed curves with identicalmarkers are configurations that have approximately the samenumber of trainable parameters. As an example, the pointswith circular markers (with a big blue ellipse around them)in Fig. 9(b) all have the same number of trainable parametersand indicate that the larger, more sparse NNs perform better. Fig. 10. Comparing ‘large and sparse’ to ‘small and dense’ networks forReuters with 2000 tokens, with N net = (2000 , x, . The x-axis is splitinto higher values on the left (a), and lower values on the right in log scale(b). Solid curves (with the shaded CIs around them) are for constant x , blackdashed curves with same marker are for same number of trainable parameters.Junction 1 is sparsified first until its number of total weights is approximatelyequal to that of junction 2, then both are sparsified equally.Fig. 11. Comparing ‘large and sparse’ to ‘small and dense’ networks for (a)TIMIT with 39 MFCCs and N net = (39 , x, x, x, x, (on the left), and (b)CIFAR-100 with the deep 6-layer CNN and MLP N net = (4000 , x, with log scale for the x-axis (on the right). Solid curves (with the shaded CIsaround them) are for constant x , black dashed curves with same marker are forsame number of trainable parameters (in the MLP portion only for CIFAR).Since TIMIT has symmetric junctions, we tried to keep input and outputjunction densities as close as possible and adjusted intermediate junctiondensities to get the desired ρ net . CIFAR-100 is sparsified in a way similar toReuters in Fig. 10. Specifically, the network with N net = (784 , , , , and d outnet = (10 , , , corresponding to ρ net = 9 . performs significantly better than the FC network with N net =(784 , , , , , and other smaller and denser networks,despite each having trainable parameters. Increasing thenetwork size further to N net = (784 , , , , , andreducing ρ net to to fix the number of trainable parametersat , leads to performance degradation. This is becausethis ρ net was achieved by setting ρ = ρ = 2 . , whichappears to be below the critical density.Fig. 10 summarizes the analogous experiment on Reuterswith similar conclusions. Both subfigures are for the sameresults with the x-axis split into higher and lower density range(on log scale), to show more detail. Observe that the trend of‘large and sparse’ being better than ‘small and dense’ holdsfor subfigure (a), but reverses for (b) since densities are verylow (the black dashed curves have positive slope instead ofnegative). This is due to the critical density effect.Fig. 11(a) shows the result for the same experiment on Fig. 12. Comparison of classification accuracy as a function of ρ net for differ-ent sparse methods on (a) MNIST with N net = (800 , , , (b) Reuterswith N net = (2000 , , , and (c) TIMIT with N net = (39 , , .We set the overall density ρ net and all individual junction densities ρ i to beapproximately the same across different sparse methods. TIMIT with four hidden layers . The trend is less clearlydiscernible, but it exists. Notice how the black dashed curveshave negative slopes at appreciable levels of ρ net , indicating‘large and sparse’ being better than ‘small and dense’, but highpositive slopes at low ρ net , indicating the rapid degradation inperformance as density is reduced beyond the critical density.This is exacerbated by the fact that TIMIT with 39 MFCCs isa dataset with low redundancy, so the effects of very low ρ net are better observed.Fig. 11(b) for the MLP portion of CIFAR-100 shows similarresults as TIMIT, but on a log x-scale for more clarity. Asnoted in Sec. IV-C, the best performance for a given N net occurs at an overall density less than . It appears that forany N net for CIFAR-100, peak performance occurs at around – overall MLP density. In experiments not shown here,we obtained similar results for the reduced redundancy netwith a single convolutional layer.V. C OMPARISON TO O THER S PARSE
NN M
ETHODS
Numerical results in Sec. IV showed that hardware-compatible clash-free connection patterns performed as well asstructured and random pre-defined sparse connections. In thissection, we compare clash-free patterns against two sparsityapproaches that are less constrained than the structured pre-defined sparsity considered in Sec. IV. In particular, bothapproaches remove the constraint of regular degree – i.e., theseapproaches yield sparse NNs that have varying d out i and d in i selected to optimize classification performance. We also performed experiments on TIMIT with one hidden layer ( L = 2 )and Reuters with 2 hidden layers ( L = 3 ). Results were similar to thoseshown, so are not shown for brevity’s sake. A. Attention-based Preprocessed Sparsity
Previous works [48], [49] have applied the concept of at-tention on object recognition and image captioning to achievebetter performance with fewer parameters and less computa-tion. We simplify this idea by computing the variance of inputfeatures as attention and setting the out-degree of the neuronsof the input layer based on this value, Specifically, the featurevariances are quantized into three levels, and input neuronswith higher attention are assigned more connections than thosewith lower attention. For the neurons in latter layers, we useuniform out-degree and in-degree.
B. Learning Structured Sparsity during Training
While the method in Sec. V-A obtains a non-uniform neuronout-degree for the first layer, it only considers the propertiesof the dataset and not the learning process. We also comparedagainst the method of Learning Structured Sparsity (
LSS )which learns a good sparse connection pattern during training.This method was proposed in [14] and prunes the connectionsduring training by using a sparse-promoting penalty functionas part of the objective function. Example penalty functionsinclude L1 and L1/L2 used in Lasso [50] and group-Lasso[51], respectively. During training, the optimizer minimizesa balancing objective comprising the loss function l ( · ) , theregularizer r ( · ) , and a sparse-promoting penalty function p ( · ) , min { W i , b i } Li =1 l (cid:16) { W i , b i } Li =1 (cid:17) + λr (cid:16) { W i } Li =1 (cid:17) + L (cid:88) i =1 γ i p ( W i ) (5)where the penalty coefficients { γ i } Li =1 control the density ofeach junction. Increasing γ i decreases ρ i , however, obtaininga specific value of ρ i requires experimental tuning of γ i .In the results presented in this section, we used L1 as theelement-wise sparse-promoting penalty function and L2 as theregularizer. Note that, in contrast to the attention-based methodand the structured pre-defined sparsity approach, LSS is nota pre-defined sparsity method. Instead training in
LSS beginswith a FC network, which means that training complexity issimilar to that of a FC NN. At the end of the
LSS trainingprocess, weights with absolute value below a threshold are setas zero to achieve the target density.
C. Performance comparison
Fig. 12 compares performance versus ρ net of differentsparse NNs on MNIST, Reuters, and TIMIT. The individualdensity of each junction with the attention-based preprocessedsparse method is set to be identical to the density of eachjunction using clash-free pre-defined sparse method. However,the density of the nets using the LSS method can be tuned onlywith the penalty coefficients. We tuned these to approximatematch the density of the other methods. Here we emphasize that the loss function depends on all of the trainableparameters in the network, as opposed to the output layer activations andground truth labels as done in Sec. II-A. This is to emphasize that loss is afunction of all of the trainable parameters and therefore the loss function canpromote sparsity by driving some edge weights to zero. This is why ρ net values of the green curves do not perfectly align withthe pre-defined sparsity curves. The
LSS method performs best among all sparse methods,which is to be expected as it is the least constrained and alsodiscovers a good sparse connection pattern during training.However, the performance with clash-free pre-defined spar-sity is near that of the attention-based and
LSS methods – i.e., within in terms of test accuracy at ρ net = 20% . Weconclude that even though the clash-free patterns are highlystructured and pre-defined, there is no significant performancedegradation when compared to advanced methods for pro-ducing sparse models by exploiting specific properties of thedataset or learning sparse patterns during training.VI. C ONCLUSIONS AND F UTURE W ORK
In this work we proposed a new technique for complexityreduction of neural networks – pre-defined sparsity – inwhich a fixed sparse connection pattern is enforced prior totraining and held fixed during both training and inference.We presented a hardware architecture suited to leverage thebenefits of structured pre-defined sparsity, capable of paralleland pipelined processing. The architecture can be used forboth training and inference modes, and supports networks ofarbitrary density, including conventional fully-connected ones.Flexibility is afforded by the degree of parallelism z net , whichtrades hardware complexity for speed. Simple methods forclash-free memory access are presented and these methodsare shown to achieve performance on par with the best knownmethods for obtaining sparse MLPs.Using extensive numerical experiments, we identified trendswhich help in designing pre-defined sparse networks. Firstly, itis better to allocate connections in a structured manner ratherthan randomly. Secondly, for most datasets with high redun-dancy, earlier junctions can be made more sparse. Thirdly, itis better to have more neurons in the hidden layers, and thensparsify aggressively to keep the number of edges low andreduce complexity.As motivated in the Introduction, the rapidly growing com-plexity associated with modern NNs is a major challenge.Pre-defined sparsity is a simple method to help address thischallenge, as is acceleration with custom hardware. Interestingareas for future research include analytical approaches tojustify the trends observed in this work and improving ourinitial hardware implementation in [40]. It is also interesting toconsider extending the methods introduced herein to convolu-tional layers and recurrent architectures. Finally, truly speedingthe training process by orders of magnitude would allow moreextensive search over NN architectures and therefore a betterunderstanding of the largely empirical process of NN design.A PPENDIX AS TRUCTURED P RE -D EFINED S PARSITY C ONSTRAINTS
In our structured pre-defined sparse network, ρ i , the densityof junction i , cannot be arbitrary, since ρ i = d out i /N i = d in i /N i − , where d out i and d in i are natural numbers satisfyingthe equation N i − d out i = N i d in i . Therefore, the number of pos-sible ρ i values is the same as the number of (cid:0) d out i , d in i (cid:1) valuessatisfying the structured pre-defined sparsity constraints: d out i = N i d in i N i − , d in i ≤ N i − , d out i , d in i ∈ N (6) where N denotes the set of natural numbers.The smallest value of d in i which satisfies d out i ∈ N is N i − / gcd( N i − , N i ) , and other values are its integer multi-ples. Since d in i is upper bounded by N i − , the total number ofpossible (cid:0) d out i , d in i (cid:1) is gcd( N i − , N i ) . Thus, the set of possible ρ i is (cid:26) ρ i ∈ (0 , (cid:12)(cid:12)(cid:12)(cid:12) ρ i = k gcd( N i − , N i ) , k ∈ N (cid:27) . (7)As a concrete example, consider a NN with N net =(117 , , . The number of possible densities of thejunctions are determined by gcd(117 , and gcd(390 ,
13) = 13 . Therefore, the sets of junction densitiesare ρ ∈ (cid:26) , , · · · , (cid:27) , ρ ∈ (cid:26) , , · · · , (cid:27) . (8)A PPENDIX BH ARDWARE A RCHITECTURE C ONSTRAINTS
The depth of left memories in our hardware architecture is D i = N i − /z i . Thus N i − should preferably be an integralmultiple of z i . This is not a burdening constraint since thechoice of z i is independent of network parameters and dependson the capacity of the device. In the unusual case that thisconstraint cannot be met, the extra cells in memories can befilled with dummy values such as 0.There are also 2 conditions placed on the z values toeliminate stalls in processing: for all layers i ∈ { , · · · , L } , (i) | W i | /z i = C , and (ii) z i +1 ≥ (cid:6) z i /d in i (cid:7) . Using the definitionsfrom Sec. II-A, (i) is equivalent to z i +1 = z i d out i +1 /d in i . Then,(ii) can be equivalently written as d out i +1 ≥ d in i z i (cid:24) z i d in i (cid:25) (9)which needs to be satisfied ∀ i ∈ { , · · · , L − } . In practice,it is desirable to design z i /d in i to be an integer so that anintegral number of right neurons finish processing every cycle.This simplifies hardware implementation by eliminating theneed for additional storage, for example, of the intermediateactivation values during FF. In this case, (9) reduces to d out i +1 ≥ , which is always true.For non-integral z i /d in i , there are two cases. If z i > d in i , (9)reduces to d out i +1 ≥ . On the other hand, if z i < d in i , there isno bound on the right hand side of (9). In general, note that(9) becomes a burdening constraint only if d in i is large, and d out i +1 and z i are both desired to be small. This correspondsto earlier junctions being denser than later, which is typicallynot desirable according to the observations in Sec. IV-D, or tovery limited hardware resources. We thus conclude that (9) isnot a limiting constraint in most practical cases.A PPENDIX CC LASH -F REE P ATTERNS
Specifying N i − , N i , d in i and z i for junction i in a clash-freestructured pre-defined sparse NN does not uniquely define aconnection pattern (unless it is FC). This section discusses the number of possible left memory access patterns S M i for such Fig. 13. (a-c) Various types of clash-freedom, and (d) memory dithering fortype 3, using the same left neuronal structure from Fig. 4 as an example.The grids represent different access patterns for the same memory bank. Thenumber in each cell represents the left neuron number whose parameter isstored in that cell. Cells sharing the same color are read in the same cycle. a junction i . Note that the total number of possible memoryaccess patterns for the complete NN is S M = (cid:81) Li =1 S M i .When z i ≥ d in i , which is expected to be true for practicalcases of implementing sparse NNs on powerful hardware de-vices, S M i is also equal to the number of possible connectionpatterns S C i , which is the key quantity of interest. This isbecause if z i ≥ d in i , at least one right neuron is completelyprocessed in some cycle. Thus, changing the left memoryaccess pattern will change the left neurons to which that rightneuron connects, thereby changing the connection pattern.This one-to-one correspondence results in S M i = S C i .For the case of z i < d in i , a FC junction provides an examplewhere S M i (cid:54) = S C i . Specifically, in this case S C i = 1 as there isonly one way to fully connect all neurons, but there are manyclash-free memory access patterns, as shown in the followingequations (10)-(12).We now discuss various types of clash-freedom, and S M i arising from each: • Type 1: This is as described in Sec. III-C, and reca-pitulated in Fig. 13(a). S M i is the number of ways ofdesigning φ i , i.e., S M i = D iz i (10) • Type 2 (implemented in our earlier work [40]): In thistechnique, a new φ i is defined for every sweep. Con-sidering the example in Fig. 13(b), φ i = (1 , , , forsweep 0, but (2 , , , for sweep 1. There will be d out i different φ i vectors for each junction, resulting in: S M i = D iz i d out i (11) • Type 3: In this technique, the constraint of cyclicallyaccessing the left memories is also eliminated. Instead,any cycle can access any cell from each of the memories.This means that storing φ i is not enough, the entire TABLE IIIC
OMPARISON OF C LASH -F REE M ETHODS FOR A S INGLE J UNCTION i WITH ( N i − , N i , d out i , d in i , z i ) = (12 , , , , Type Memory S M i Storage Cost to ComputeDithering Memory Addresses1 No 81 z i = 4 Yes 486 z i = 8 z i d out i = 8 Yes 236k z i d out i = 16 N i − d out i = 24 Yes 60M ( N i − + z i ) d out i = 32 sequence of memory accesses needs to be stored as amatrix Φ i ∈ { , , · · · , D i − } D i × z i . In Fig. 13(c) forexample, Φ i = ((1 , , , , (0 , , , , (2 , , , forsweep 0. Every sweep would also have a different Φ i ,resulting in: S M i = ( D i !) z i d out i (12)A technique that can be applied to all the types of clash-freedom is memory dithering , which is a permutation of the z i memories ( i.e., the columns) in a bank. This permutationcan change every sweep, as shown in Fig. 13(d). Memorydithering incurs an additional address computation storage costbecause of the z i permutation, but increases S M i by a factor K i . If d in i /z i is an integer, an integral number of cycles arerequired to process each right neuron. Since a cycle accessesall memories, dithering has no effect and K i = 1 . On theother hand, if z i /d in i is an integer greater than 1, the effectsof dithering on connectivity patterns are only observed whenswitching from one right neuron to the next within a cycle.This results in K i = z i ! d in i ! zid in i d out i (13)for types 2 and 3, and the d out i exponent is omitted for type1 since the access pattern does not change across sweeps.When either of z i or d in i does not perfectly divide the other,an exact value of K i is hard to arrive at since some proper orimproper fraction of right neurons are processed every cycle.In such cases, K i is upper-bounded by ( z !) d out i .Table III compares the count of possible left memory accesspatterns and associated storage cost for computing memoryaddresses for types 1–3, with and without memory dither. Thejunction used is the same as in Fig. 4, except N i is raised to12 such that d in i becomes 2 and allows us to better show theeffects of memory dithering.R EFERENCES[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classificationwith deep convolutional neural networks,” in
Proc. Advances in NeuralInformation Processing Systems 25 (NIPS) , 2012, pp. 1097–1105.[2] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro,“Deep learning with COTS HPC systems,” in
Proc. 30th Int. Conf.Machine Learning (ICML) , vol. 28, 2013, pp. III–1337–III–1345.[3] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights andconnections for efficient neural networks,” in
Proc. Advances in NeuralInformation Processing Systems 28 (NIPS) , 2015, pp. 1135–1143. [4] N. P. Jouppi, C. Young, N. Patil et al. , “In-datacenter performanceanalysis of a tensor processing unit,” in , June 2017.[5] C. Szegedy, W. Liu, Y. Jia et al. , “Going deeper with convolutions,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR) , 2015,pp. 1–9.[6] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deepconvolutional networks using vector quantization,” in arXiv preprintarXiv:1412.6115 , 2014.[7] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen,“Compressing neural networks with the hashing trick,” in
Proc. 32ndInt. Conf. Machine Learning (ICML) , 2015.[8] S. Han, H. Mao, and W. Dally, “Deep compression: Compressing deepneural networks with pruning, trained quantization and huffman coding,”in
Proc. Int. Conf. Learning Representations (ICLR) , 2016.[9] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, “EIE: Efficient inference engine on compressed deep neuralnetwork,” in
Proc. 43rd Int. Symp. Computer Architecture (ISCA) , 2016.[10] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from overfit-ting,”
Journal of Machine Learning Research , vol. 15, pp. 1929–1958,2014.[11] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, andA. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural networkcomputing,” in
Proc. 2016 ACM/IEEE 43rd Annu. Int. Symp. ComputerArchitecture (ISCA) , 2016, pp. 1–13.[12] B. Reagen, P. Whatmough, R. Adolf et al. , “Minerva: Enabling low-power, highly-accurate deep neural network accelerators,” in
Proc. 2016ACM/IEEE 43rd Annu. Int. Symp. Computer Architecture (ISCA) , 2016,pp. 267–278.[13] A. Aghasi, A. Abdi, N. Nguyen, and J. Romberg, “Net-trim: Convexpruning of deep neural networks with performance guarantee,” in
Proc.Advances in Neural Information Processing Systems 30 (NIPS) , 2017.[14] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured spar-sity in deep neural networks,” in
Proc. Advances in Neural InformationProcessing Systems 29 (NIPS) , 2016, pp. 2074–2082.[15] V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms forsmall-footprint deep learning,” in
Proc. Advances in Neural InformationProcessing Systems 28 (NIPS) , 2015, pp. 3088–3096.[16] S. Wang, Z. Li, C. Ding, B. Yuan, Y. Wang, Q. Qiu, and Y. Liang,“C-LSTM: Enabling efficient LSTM using structured compressiontechniques on FPGAs,” in
Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays , 2018.[17] A. Bourely, J. P. Boueri, and K. Choromonski, “Sparse neural networktopologies,” arXiv preprint arXiv:1706.05683 , 2017.[18] A. Prabhu, G. Varma, and A. M. Namboodiri, “Deep expander net-works: Efficient deep networks from graph theory,” arXiv preprintarXiv:1711.08757 , 2017.[19] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, andA. Liotta, “Scalable training of artificial neural networks with adaptivesparse connectivity inspired by network science,”
Nature Communica-tions , vol. 9, 2018.[20] S. Dey, Y. Shao, K. M. Chugg, and P. A. Beerel, “Accelerating trainingof deep neural networks via sparse edge processing,” in
Proc. 26th Int.Conf. Artificial Neural Networks (ICANN) . Springer, Sep 2017, pp.273–280.[21] S. Dey, P. A. Beerel, and K. M. Chugg, “Interleaver design for deepneural networks,” in
Proc. 51st Asilomar Conf. Signals, Systems, andComputers , Oct 2017, pp. 1979–1983.[22] S. Dey, K. Huang, P. A. Beerel, and K. M. Chugg, “Characterizing sparseconnectivity patterns in neural networks,” in
Proc. 2018 InformationTheory and Applications Workshop (ITA) , Feb 2018, pp. 1–9.[23] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficientreconfigurable accelerator for deep convolutional neural networks,”
IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 127–138, 2017.[24] Y. Ma, N. Suda, Y. Cao, S. Vrudhula, and J. Seo, “ALAMO: FPGAacceleration of deep learning algorithms with a modularized RTLcompiler,”
Integration, the VLSI Journal , 2018.[25] S. Zhang, Z. Du, L. Zhang et al. , “Cambricon-X: An accelerator forsparse neural networks,” in
Proc. 2016 49th Annu. IEEE/ACM Int. Symp.Microarchitecture (MICRO) , 2016, pp. 1–12.[26] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula,J.-s. Seo, and Y. Cao, “Throughput-optimized OpenCL-based FPGAaccelerator for large-scale convolutional neural networks,” in
Proc. 2016ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays . ACM, 2016,pp. 16–25. [27] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,“Diannao: A small-footprint high-throughput accelerator for ubiquitousmachine-learning,” in
Proc. 19th Int. Conf. Architectural Support forProgramming Languages and Operating Systems (ASPLOS) . ACM,2014, pp. 269–284.[28] G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, “VLSI ar-chitectures for turbo codes,”
IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 7, no. 3, pp. 369–379, 1999.[29] T. Brack, M. Alles, T. Lehnigk-Emden et al. , “Low complexity LDPCcode decoders for next generation standards,” in
Design, Automation &Test in Europe Conf. & Exhibition (DATE) . IEEE, 2007, pp. 1–6.[30] S. Crozier and P. Guinand, “High-performance low-memory interleaverbanks for turbo-codes,” in
Vehicular Technology Conf., 2001. VTC 2001Fall. IEEE VTS 54th , vol. 4. IEEE, 2001, pp. 2394–2398.[31] F. Sun, C. Wang, L. Gong, C. Xu, Y. Zhang, Y. Lu, X. Li, and X. Zhou,“A high-performance accelerator for large-scale convolutional neuralnetworks,” in , Dec 2017, pp. 622–629.[32] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “DLAU: Ascalable deep learning accelerator unit on FPGA,”
IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems , vol. 36,no. 3, pp. 513–517, March 2017.[33] Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST database ofhandwritten digits,” http://yann.lecun.com/exdb/mnist/.[34] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “RCV1: A new benchmarkcollection for text categorization research,”
Journal of machine learningresearch , vol. 5, pp. 361–397, Apr 2004.[35] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L.Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speechcorpus,” https://catalog.ldc.upenn.edu/LDC93S1.[36] A. Krizhevsky, “Learning multiple layers of features from tiny images,”Master’s thesis, University of Toronto, 2009.[37] I. Goodfellow, Y. Bengio, and A. Courville,
Deep Learning
Proc. 29th Int. Conf. MachineLearning (ICML) , 2012.[39] N. H. E. Weste and D. M. Harris,
CMOS VLSI Design: A Circuits andSystems Perspective , 4th ed. Pearson, 2010.[40] S. Dey, D. Chen, Z. Li, S. Kundu, K. Huang, K. M. Chugg, andP. A. Beerel, “A highly parallel FPGA implementation of sparse neuralnetwork training,” in
Proc. 2018 Int. Conf. Reconfigurable Computingand FPGAs (ReConFig) , Dec 2018, expanded pre-print version availableat https://arxiv.org/abs/1806.01087.[41] P. Goyal, P. Doll´ar, R. B. Girshick et al. , “Accurate, large minibatchSGD: training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677 ,2017.[42] D. Masters and C. Luschi, “Revisiting Small Batch Training for DeepNeural Networks,” arXiv preprint arXiv:1804.07612 , 2018.[43] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving neural networks by preventing co-adaptationof feature detectors,” arXiv preprint arXiv:1207.0580 , 2012.[44] K.-F. Lee and H.-W. Hon, “Speaker-independent phone recognitionusing hidden markov models,”
IEEE Transactions on Acoustics, Speech,and Signal Processing , vol. 37, no. 11, pp. 1641–1648, Nov 1989.[45] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on ImageNet classification,” in
Proc. IEEE Int. Conf. Computer Vision (ICCV) , 2015, pp. 1026–1034.[46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in
Proc. Int. Conf. Learning Representations (ICLR) , 2014.[47] S. Dey, K. M. Chugg, and P. A. Beerel, “Morse code datasets formachine learning,” in
Proc. 9th Int. Conf. Computing, Communicationand Networking Technologies (ICCCNT) , Jul 2018, pp. 1–7.[48] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition withvisual attention,” arXiv preprint arXiv:1412.7755 , 2014.[49] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,and Y. Bengio, “Show, attend and tell: Neural image caption generationwith visual attention,” in
Proc. 32nd Int. Conf. Machine Learning(ICML) , 2015, pp. 2048–2057.[50] M. R. Osborne, B. Presnell, and B. Turlach, “A new approach to variableselection in least squares problems,” in
IMA Journal of NumericalAnalysis , 2000.[51] R. Jenatton, J. Audibert, and F. R. Bach, “Structured variable selectionwith sparsity-inducing norms,”