[PDF] Ramanujan Bipartite Graph Products for Efficient Block Sparse Neural Networks

Abstract

Sparse neural networks are shown to give accurate predictions competitive to denser versions, while also minimizing the number of arithmetic operations performed. However current hardware like GPU's can only exploit structured sparsity patterns for better efficiency. Hence the run time of a sparse neural network may not correspond to the arithmetic operations required. In this work, we propose RBGP( Ramanujan Bipartite Graph Product) framework for generating structured multi level block sparse neural networks by using the theory of Graph products. We also propose to use products of Ramanujan graphs which gives the best connectivity for a given level of sparsity. This essentially ensures that the i.) the networks has the structured block sparsity for which runtime efficient algorithms exists ii.) the model gives high prediction accuracy, due to the better expressive power derived from the connectivity of the graph iii.) the graph data structure has a succinct representation that can be stored efficiently in memory. We use our framework to design a specific connectivity pattern called RBGP4 which makes efficient use of the memory hierarchy available on GPU. We benchmark our approach by experimenting on image classification task over CIFAR dataset using VGG19 and WideResnet-40-4 networks and achieve 5-9x and 2-5x runtime gains over unstructured and block sparsity patterns respectively, while achieving the same level of accuracy.

Full PDF

RRamanujan Bipartite Graph Products for EfﬁcientBlock Sparse Neural Networks

Dharma Teja Vooturi, Girish Varma, Kishore Kothapalli

Center for Security Theory and Algorithmic ResearchInternational Institute of Information Technology Hyderabad, India [email protected]

Abstract

Sparse neural networks are shown to give accurate predictions competitive to denserversions, while also minimizing the number of arithmetic operations performed.However current hardware like GPU’s can only exploit structured sparsity patternsfor better efﬁciency. Hence the run time of a sparse neural network may notcorrespond to the arithmetic operations required.In this work, we propose RBGP( Ramanujan Bipartite Graph Product) frameworkfor generating structured multi level block sparse neural networks by using thetheory of Graph products. We also propose to use products of Ramanujan graphswhich gives the best connectivity for a given level of sparsity. This essentiallyensures that the i.) the networks has the structured block sparsity for which runtimeefﬁcient algorithms exists ii.) the model gives high prediction accuracy, due tothe better expressive power derived from the connectivity of the graph iii.) thegraph data structure has a succinct representation that can be stored efﬁciently inmemory. We use our framework to design a speciﬁc connectivity pattern calledRBGP4 which makes efﬁcient use of the memory hierarchy available on GPU.We benchmark our approach by experimenting on image classiﬁcation task overCIFAR dataset using VGG19 and WideResnet-40-4 networks and achieve 5-9xand 2-5x runtime gains over unstructured and block sparsity patterns respectively,while achieving the same level of accuracy.

Sparsity is an essential tool for generating compute and memory efﬁcient neural networks. Despitethis, the predominant choice of deep neural networks in production are dense instead of sparse. Thisis mainly because sparse neural networks tend to have poor runtime performance on the widely useddense AI hardware like GPU/TPU, that are primarily designed for accelerating dense neural networks.So in order to truly uncover the potential of sparsity in production, it is necessary to generate sparseneural networks, that are in harmony with the dense AI hardware.Pruning [16, 11, 10, 9] is one of the widely used approach for generating sparse neural networks.In element pruning, individual parameters/elements are removed from a pre-trained dense neuralnetwork based on some criterion such as magnitude, and then the resultant sparse network is ﬁnetunedto recover accuracy. Signiﬁcant number of parameters can be removed by using element pruning withminimal loss in model accuracy. But the main issue with element pruning is that the generated sparseneural networks have irregular compute and memory access patterns due to unstructured sparsitypattern, and thus cannot be efﬁciently mapped onto dense AI hardware. Structured pruning methods[18, 26, 12, 22, 23, 36, 4, 33] are proposed to improve the runtime performance of sparse neuralnetworks. Unlike element pruning, where parameters are removed at an individual level, in structuredpruning, parameters are ﬁrst divided into structural units like ﬁlter, channel, block, multi-block etc and

Preprint. Under review. a r X i v : . [ c s . L G ] J u l igure 1: Tiled matrix multiplication of RBGP4 sparse matrix W s with a dense matrix I ( O = W s × I ) on GPU. A tile in O ( OT ) is mapped to a thread block T B , and each thread in

T B is mapped to a2D strided grid of element blocks in OT , where the number of strides, and the size of the elementblock in row dimension are set to | G r .U | and | G b .U | respectively. OT is computed in steps, wherein each step, tiles W T and IT are ﬁrst loaded into shared memory from DRAM, and a thread in T B loads corresponding elements from shared memory to registers before performing the computation.then are removed at a unit level based on the strength of the unit. Structured sparse neural networkshave better run-time performance than unstructured sparse neural networks. But this improvementin run-time performance comes at the cost of accuracy due to the imposed structural constraintswhile removing parameters from a trained model. For example, Mao et al. [23] have shown thatfor a given amount of pruning, model accuracy decreases and run-time performance increases withincrease in coarsity of structural unit from 0D to 3D in pruning 4D weight tensors in convolutionalneural networks. This trade-off between run-time and accuracy limits the possibility of generatingefﬁcient structured sparse neural networks using structured pruning methods. Structured sparse neuralnetworks can also be generated using structure aware training (STAT) methods [35, 29, 19, 14, 34, 15],where structure is part of the training process. Because the structure is coupled with the trainingprocess, STAT methods are better placed than structured pruning methods in generating efﬁcientstructured sparse neural networks.Runtime of a sparse neural network on a given hardware is dependent on the efﬁciency with whichSDMM (Multiplication of a Sparse Matrix with a Dense matrix) operation can be implemented. Ona hardware like GPU with memory hierarchy (Registers > Shared memory > L2 cache > DRAM),SDMM operation will have good runtime efﬁciency if and only if it maximizes data accesses fromfaster memory through data reuse. And for a structured sparse neural network, the amount of reusedepends on the choice of the structured sparsity pattern. Additionally, the chosen pattern shouldbe well connected to allow for good ﬂow of information in the neural network. In this work, weaddress these requirements and generate structured sparse networks that are performant and connected.Following are our main contributions: • Proposed RBGP (Ramanujan Bipartite Graph Product) framework for generating structuredsparse neural networks that have multiple levels of block sparsity, good connectivity, andtakes less memory for storage. • Using RBGP framework, we proposed RBGP4 structured sparsity pattern for the GPU,a representative dense hardware, and achieve good runtime efﬁciency for the SDMM(Multiplication of a sparse matrix with a dense matrix) operation on GPU. • We demonstrate the utility of RBGP4 sparsity pattern on image classiﬁcation task overCIFAR dataset and achieve 5-9x and 2-5x runtime gains over unstructured and block sparsitypatterns respectively, while achieving the same level of accuracy.

Post training:

Generating sparse neural network from a trained dense model dates back to decadesold work of Lecun et al. [16] and Hassibi & Stork [11] where they use second-derivative informationto prune weights from a dense model. The idea of pruning was revived by Han et al. [10, 9] by simplypruning weights based on their magnitude. To improve runtime performance on dense AI hardware,structured pruning methods [18, 26, 12, 22, 23, 36, 4, 33] are proposed with various structuredsparsity patterns like ﬁlter,channel,block and multi-block.2 uring training:

Sparse neural networks are generated during the training process either by graduallyremoving the connections or rearranging existing set of connections [32, 28, 2, 25, 27, 17, 6].Similarly, structured sparse networks are generated by removing elements at a structural unit levelduring training. Wen et al. [35] used group Lasso regularization to induce channel and ﬁlter sparsity inCNNs. Narang et al. [29] used gradual pruning along with group Lasso regularization to induce blocksparsity pattern in RNNs. In [19, 14, 34], structure is induced by assigning a learnable parameter foreach structural unit and removing them gradually through regularization and pruning.

Before training(predeﬁned):

Sparsity can be incorporated apriori to the training process by choosinga mask(choice of connections) in each layer of the sparse neural network and keeping it ﬁxed throughout the training. Prior works in predeﬁned approach differ in the way the mask is chosen. Prabhu etal.[30] makes use of expander graphs, and generates a random mask with row uniformity pattern,where all the rows in the mask have equal number of non zeros. Sourya et al.[7] generates a randommask with both row and column uniformity. Frankle et al. [8] uses an unstructured mask generatedby pruning a trained dense model. Kepner et al. [15] uses the idea of radix topology to generate amask with cyclical diagonal pattern. Blocking pattern is the key requirement for achieving runtimeperformance on dense AI hardware, and none of the above works incorporate block sparsity pattern.In this work, we impose impose block sparsity pattern at multiple levels using RBGP framework, andachieve good runtime performance on GPU, a representative dense AI hardware.

In this section, we setup various deﬁnitions and notations used throughout the paper. First we deﬁnevarious types of block sparsity patterns.

Block Sparse (BS) matrix:

A BS matrix W bs is a sparse matrix, where non zero elements are struc-tured in the form of blocks of size ( bh, bw ) . Matrix W bs has ( W bs .rows/bh × W bs .columns/bw ) number of blocks, and a block in W bs is either a zero block with all zeros or a non-zero block withsome or all elements as non-zeros. Uniform Block Sparse (UBS) matrix:

A UBS matrix W ubs is a block sparse matrix with block size ( bh, bw ) , where all the row/column blocks of size ( bh, W ubs .columns ) / ( W ubs .rows, bw ) have equalnumber of non-zero blocks of size ( bh, bw ) . Cloned Block Sparse (CBS) matrix:

A CBS matrix is a block sparse matrix with block size ( bh, bw ) , where all the non zero blocks of size ( bh, bw ) have the same non-zero pattern. Cloned Uniform Block Sparse (CUBS) matrix:

A CUBS matrix is a block sparse matrix withblock size ( bh, bw ) that is both UBS and CBS matrix with block size ( bh, bw ) . Recursive CUBS (RCUBS) matrix:

An RCUBS matrix W s is a sparse matrix with K levels ofblocking B , ...B K and following recursion: W s is a CUBS matrix with block size B , and a nonzero block of size B i in W s is again a CUBS matrix with block size B i +1 . Figure 3 shows an exampleof RCUBS matrix with three levels of blocking.We consider the Bipartite graph G = ( U, V, E ) representation of matrices (with dimension | U | × | V | ).In a biregular bipartite graph, all the vertices in U and V have same degree d l and d r respectively. Thedegree also characterizes the sparsity of such graphs. The eigenvalues of a graph G are the eigenvaluesof its adjacency matrix and they characterize many graph properties including connectivity [5].Bipartite graph with N vertices have Eigen values ± λ , ..., ± λ N/ , where λ ≥ λ ... ≥ λ N/ . The spectral gap between λ , λ is a measure of the connectivity properties of the graph [1]. RamanujanGraphs are the graphs with the optimal connectivity (as measured by the spectral gap) for a givenlevel of sparsity [21].

Ramanujan bipartite graph:

A Ramanujan bipartite graph is a ( d l , d r ) -biregular bipartite graph,where the second largest eigenvalue λ is less than or equal to ( √ d l − √ d r − . Bipartite Graph Product ( ⊗ b ) : Bipartite graph product ( G p = G ⊗ b G ) takes two bipartitegraphs, G ( U , V , E ) and G ( U , V , E ) as the input and produces a bigger bipartite graph G p ( U p , V p , E p ) , where U p = U × U , V p = V × V , and E p is constructed using cross product ofedges from G and G i.e, E p = { (( u , u ) , ( v , v )) | (( u , v ) ∈ E &( u , v )) ∈ E } .Bipartitegraph product can also be viewed from a matrix viewpoint in the following way:3igure 2: Bipartite graph product operation( ⊗ b ) along with matrix view. Biadjacency matrix of theproduct graph has CBS(Cloned Block Sparse) pattern with block size (2,2).A bipartite graph G ( U, V, E ) can be represented as a bi-adjacency matrix BA of size ( | U | , | V | ) , with BA uv = 1 if ( u, v ) ∈ E , and zero otherwise. For the bipartite graph product ( G p = G ⊗ b G ) ,bi-adjacency matrix of G p is equal to the Tensor product( ⊗ ) of the bi-adjacency matrices of the inputbipartite graphs G and G i.e, BA p = BA ⊗ BA . Figure 2 shows an example of bipartite graphproduct both from the viewpoint of both graph and matrix. The connectivity between neurons in a layer L of a sparse neural network can be captured usinga bipartite graph G , where left/right neurons in L corresponds to left/right vertices in G , and theconnections between left and right neurons in L corresponds to undirected edges between left andright vertices in G . The core idea in RBGP (Ramanujan Bipartite Graph Product) framework is toexpress G as a bipartite graph product of Ramanujan bipartite graphs i.e ( G = G ⊗ b ... ⊗ b G K ) ,where K is the number of base graphs. In the rest of the section, we show how expressing connectivityof a layer using bipartite graph products leads to sparse neural networks that have structured sparsity,good connectivity, and memory efﬁciency. Structured sparsity.

In bipartite graph product ( G p = G ⊗ b G ) , the biadjacency matrix of G p is equal to the Tensor product( ⊗ ) of the biadjacency matrices of G and G i.e, BA p = BA ⊗ BA .And in Tensor product, BA p is constructed by replacing each non zero element in BA with BA matrix, and each zero element in BA with zero matrix of size BA . As BA is repeated, BA p will have CBS (Cloned Block Sparse) sparsity pattern with block size equal to the size of BA or ( | G .U | , | G .V | ) . Figure 2 shows an example of bipartite graph product, where the biadjacency matrixof the product graph has CBS pattern with block size (2 , . Additionally, when G is a biregularbipartite graph, BA p will have CUBS (Cloned Uniform Block Sparse) sparsity pattern as BA willhave equal number of elements in all rows, and all columns. In RBGP framework, the bipartite graph G of a layer L in the neural network is constructed by performing a series of ( K − bipartite graphproducts on K base biregular bipartite graphs ( G = G ⊗ b · · · ⊗ b G K ) that are Ramanujan. Bipartitegraph G can be rewritten as G = G ⊗ b CG , where CG = ( G ⊗ b · · · ⊗ b G K ) . As G is abiregular bipartite graph, BA (biadjacency matrix of G ) will have CUBS sparsity pattern with blocksize ( π i = Ki =2 | G i .U | , π i = Ki =2 | G i .V | ) . Going deeper, as CG i = ( G i ⊗ b CG ( i +1) ) , and also as all thebase graphs are biregular, BA will have RCUBS (Recursive Cloned Uniform Block Sparse) sparsitypattern with ( K − blocking levels B · · · B ( K − , where B j = ( π i = Ki = j +1 | G i .U | , π j = Ki = j +1 | G i .V | ) .Figure 3 shows an example bipartite graph generated using RBGP framework that uses four basegraphs and has three block sizes (16 , , (8 , , and (2 , . Memory efﬁciency.

A sparse neural network can be efﬁciently stored by only storing the infor-mation related to the connections that are present in the sparse layers. For a sparse layer L and it’sassociated bipartite graph G , | E ( G ) | memory is required for storing the parameters corresponding toconnections, and another | E ( G ) | memory is required for storing connectivity information in the formof adjacency list of G . Thus a total of × | E ( G ) | memory is required for storing the information ofa layer in a sparse neural network. But in a RBGP sparse neural network, the memory requirementcan be reduced by reducing the memory required for storing connectivity information. In RBGPsparse neural network, as G is constructed using K base bipartite graphs ( G = G ⊗ b ... ⊗ b G K ) ,the connectivity information of G can be reduced from E ( G )( (cid:81) i = Ki =1 | E ( G i ) | ) to (cid:80) i = Ki =1 | E ( G i ) | , byonly storing the connectivity information of the individual base graphs. For example, the bipartite4igure 3: Biadjacency matrix BA of a bipartie graph generated using RBGP framework. BA has RCUBS(Recursive Cloned Uniform Block Sparse) sparsity pattern with three blocking levels (16 , , (8 , and (2 , graph G generated using RBGP framework in Figure 3 has 512 edges ( × × × ), but it onlyrequires storing 22 edges ( ) from the base graphs to construct the connectivity informationof G , thus leading to a 23x reduction in memory requirement for storing the connectivity informationwhen compared to a random bipartite graph with same number of edges as G . Good connectivity.

Connectivity in a sparse neural network is key for ensuring good ﬂow ofinformation. It is well known [1] that connectivity of the graph is characterized by the spectral gap between the largest and second largest eigenvalue (in absolute terms) of the adjacency matrix. In thissection, we show that the spectral gap for the block sparse graph we construct using graph products,are optimal for any level of sparsity, for large graphs.For a d -regular bipartite graph the largest eigenvalue in absolute value is d and − d . The next largesteigenvalue is considered as the second largest eigenvalue λ . The spectral gap is d − λ and largerthis quantity, the better connected the graph. Suppose the bipartite graph has n vertices on both sides,the degree d is αn where α is the fractional sparsity. For a given value of d , the best possible spectralgap of d − √ d − is achieved by Ramanujan Graphs. We construct block sparse graphs using graphproducts of smaller Ramanujan Graphs and show below that this construction has similar spectralgap as n → ∞ . For simplicity we consider the case where the bipartite graph G is the graph productof G , G which are bipartite graphs with n vertices on each sides and degree d = αn . Note that G has degree d and sparsity − (1 − α ) . Theorem 1.

Let G = G ⊗ b G where G i are bipartite graphs with n vertices on each sides anddegree d = αn . Then for any ﬁxed level of sparsity α ,IdealSpectralGap d SpectralGap ( G ) → as n → ∞ (1) where IdealSpectralGap d = d − √ d − is the best possible spectral gap for d -regular graphsand SpectralGap ( G ) is the spectral gap of the block sparse graph G that we construct.Proof. The biadjacency matrix of G is the tensor product of biadjacency matrices of G , G . Hencethe eigenvalues of the biadjacency matrix is the product of eigenvalues of biadjacency matrices of G , G . Since G , G are Ramanujan Graphs, their second largest eigenvalue is √ d − . Hencesecond largest eigenvalue of G is λ ( G ) = d × √ d − . The ideal value of second largest eigenvaluefor graphs of degree d is √ d − . Hence Equation 1, becomes d − √ d − d − d √ d − − (cid:112) /d − /d − (cid:112) /d − /d . Hence for any ﬁxed level of sparsity α , n → ∞ (large matrices), d → ∞ , the LHS of Equation 1 → . 5 RBGP framework for GPU

A GPU is fundamentally a many core architecture with thousands of cores, and have multiplememory subsytems(DRAM, L2 cache, L1 cache/shared memory, and registers) with data access timesdecreasing in that order.The reason for having many memory subsytems is to feed data into cores at ahigher rate by avoiding data accesses to slower memory say DRAM, when data is already availableon faster memory say L2 cache. On GPU, a computational task can have good runtime efﬁciency, if itcan avoid idling of cores by maximizing memory accesses from faster memories through data reuse.Sparse neural networks with unstructured sparsity pattern offers limited data reuse due to irregularmemory access patterns, and thus has poor runtime performance on GPU. The only way for sparseneural networks to achieve good runtime performance on GPU is by embracing structured sparsitypatterns. In this section, using our proposed RBGP framework, we design RBGP4 structured sparsitypattern to effectively use memory subsystems on GPU by facilitating data reuse, and achieve goodruntime performance for RBGP4 sparse neural networks.

RBGP4 sparsity pattern.

In RBGP framework, bipartite graph G ( G = G ⊗ b ... ⊗ b G K ) corre-sponding to a layer in the sparse neural network is conﬁgured by the number of base graphs ( K ) ,and for each base graph G i , it’s type(sparse or complete). RBGP4 sparsity pattern correspondsto a speciﬁc conﬁguration, where G is constructed using four base Ramanujan bipartite graphs( G = G o ⊗ p G r ⊗ p G i ⊗ p G b ), with graphs G o and G i being sparse, and G r and G b being completebipartite graphs. Figure 1 shows an example of RBGP4 sparsity pattern, where G o and G i are 50%sparse, and G r and G b are (2,1) and (2,2) complete bipartite graphs respectively. GPU Implementation.

Compute in each layer of an RBPG4 sparse neural network is composed ofRBGP4MM(Multiplication of a sparse matrix W s with RBPG4 sparsity pattern, and a dense matrix I ) operation ( O = W s × I ), where W s , I , and O , corresponds to sparse weight matrix, batchedinput activations, and batched output activations respectively. We use tiling approach for efﬁcientlyprocessing RBGP4MM operation. In tiling approach, matrices are divided into tiles, and OT (a tile in O ) is computed in steps, where each step is comprised of matrix multiplication of W T s (a sparse tilein W ) with IT (a dense tile in I ) i.e, OT + = W T s × IT . For RBGP4MM, we set tile size in W s isset to be ( | G t .U | , | G t .V | ) , where G t = ( G r ⊗ b G i ⊗ b G b ) . On GPU, we associate computation of OT to a thread block, and with in a thread block, each thread maps to a strided 2D grid of elementblocks in OT , with | G r .U | number of strides and | G b .U | element block size in row dimension. Weexploit the data reuse offered by RBPG4 sparsity pattern and make efﬁcient use of memory hierarchyon GPU, by ﬁrst loading tiles W T s and /IT into shared memory in each step of OT , and each threadloads it’s share of data into registers from shared memory before performing the computation. Figure1 shows an example of using tiling approach for RBGP4MM operation on GPU. A more detailedGPU algorithm can be found in Appendix. Why RBGP4 ?

RBGP4 sparsity pattern ( G = G o ⊗ p G r ⊗ p G i ⊗ p G b ) is designed to achieveruntime efﬁciency for SDMM operation ( O = W s × I ) on GPU. Towards that, all the four basegraphs G o , G r , G i , and G b in RBGP4 sparsity pattern have a speciﬁc role to play.The role of G o is to reduce the number of steps required to process OT (a tile in O ) by inducing sparsityat the tile level in W s . Performing bipartite product to the left of G t with G o i.e, ( G = G o ⊗ b G t ) results in block sparsity pattern in W s with block size ( | G t .U | , G t .V | ) . As we set tile size in W s tobe the block size, sparsity is induced at the tile/block level in W s , which inturn reduces the numberof steps for processing CT by skipping computation corresponding to zero tiles in W s . For examplein Figure 1, we can see that the number of steps required to compute OT is reduced from two to one,as W s has only two non zero tiles out of four tiles due to 50% sparsity in G o .The role of graphs G r and G b in RBGP4 sparsity pattern is to maximize data reuse from registers inGPU threads by inducing row repetition in W T s (a tile in W s ). In row repetition, rows are divided intogroups of equal size, where all the rows in a group have non zeros at the same locations. Having rowrepetition pattern in W T s implies that all the rows in a group will have same memory access patternsinto IT , and thus allows for reuse of data from W T s and IT . Performing bipartite graph productto the left and right of G i with complete graphs G r and G b respectively i.e, ( G t = G r ⊗ G i ⊗ G b ) results in row repetition in W T s with | G i .U | groups, and | G r .U | × | G b .U | rows in each group. Forexample in Figure 1, we can see that as G r and G b are complete bipartite graphs with (2 , and (2 , sizes, the sparsity pattern of W T s , has row repetition pattern with 4 rows. In computation associated6parsity Pattern VGG19 WideResnet-40-4in % CF10 CF100 Mem Time CF10 CF100 Mem Time . Dense 93.14 70.64 77.39 22 95.01 77.20 34.10 40 . Unstructured 92.67 70.31 77.39 165 95.42 77.92 34.10 241Block 92.45 70.75 41.12 94 95.49 77.52 18.12 165RBGP4 92.58 70.48 38.76 20 95.34 78.27 17.13 32 . Unstructured 91.99 69.32 38.71 86 95.10 76.89 17.05 135Block 91.93 68.72 20.57 48 94.92 76.50 9.07 85RBGP4 91.99 68.34 19.40 13 94.72 76.80 8.57 20 . Unstructured 90.88 65.41 19.37 79 94.48 75.21 8.53 102Block 90.62 65.37 10.30 25 94.56 74.55 4.54 45RBGP4 90.48 65.39 9.72 8 94.38 75.25 4.30 16 . Unstructured 90.01 62.33 9.70 50 93.57 73.09 4.27 69Block 89.40 62.90 5.16 14 93.55 71.86 2.27 26RBGP4 89.32 62.79 4.88 6 93.53 72.44 2.16 14Table 1: Image classiﬁcation on CIFAR10 (CF10) and CIFAR100 (CF100) datasets using VGG19 andWideResnet-40-4 networks. Models are trained using predeﬁned approach with unstructured,block,and RBGP4 sparsity patterns. For block pattern, we set block size to be (4 , . Memory (Mem) isgiven in MB, and time is given in milliseconds for one forward pass in training.with thread T in O , rows (1 , , , have same non zero pattern in W T s , and this allows us to loadtwo × blocks from W T s and one × block from IT into register blocks RegW and

RegI respectively and reuse each elements from

RegW and

RegI for 2 and 4 times respectively.The role of G i in RBGP4 sparsity pattern is to allow W s to have any level of sparsity even when thetile size in W s is big. When the tile size in W s is relatively large when compared to the size of W s , itis not possible to obtain desired level of sparsity if a non zero tile in W s is dense. For example, if atile in W s is of size (64 , , and W s is of size (128 , , only by allowing tiles in W s to be sparse,can sparsity greater than 50% can be obtained. Bipartite graph G t corresponds to sparsity pattern of W T s , and in RBGP4 sparsity pattern G t = ( G r ⊗ b G i ⊗ b G b ) . As G r and G b are dense/complete, G i has to be sparse to achieve a desired level of sparsity in W s . We study the effect of RBGP4 sparsity pattern on model accuracy for the task of image classiﬁcationand compare with unstructured and block structured sparsity patterns. Further more, we study theeffect of changing conﬁguration of base graphs in RBGP4 sparsity pattern on runtime. We performall our experiments on V100 GPU, where we benchmark unstructured and block sparsity patternsusing cuSparse library, and dense pattern using cuBLAS library from NVIDIA.

Image classiﬁcation benchmark.

In this benchmark, we perform the image classiﬁcation task onCIFAR dataset using VGG19[31] as adapted by Liu et al. [20], and WideResnet-40-4[37] networks.To train the models, we use predeﬁned approach, where the mask(choice of connections) is chosenapriori to the training process. As a sparse neural network has less number of parameters,we ﬁrst trainthe dense model and guide the sparse neural network using knowledge distillation [13]. For all ourexperiments, we incorporate equal amount of sparsity in all layers, except for the ﬁrst layer connectedto input and the ﬁnal classiﬁer layer. For the optimizer, we use SGD optimizer with momentum of0.9 and weight decay of 1e-4. VGG19/WideResnet-40-4 model is trained for 160/200 epochs withbatch size of 256/128. Initial learning rate is set to 0.1. For VGG19, learning rate is multiplied by 0.1at epochs 60,120, and 160. And for WideResnet-40-4, learning rate is multiplied by 0.2 at epochs70,120, and 160. From Table 1, we can see that RBGP4 is as accurate as unstructured and blocksparsity patterns, but takes 2x less memory and is 5-9x faster when compared to unstructured, and is2-5x faster when compared to block sparsity pattern.

RBGP4 runtime characteristics.

RBGP4 sparse matrix W s of a given size and sparsity can beobtained in multiple ways by varying the sizes of base graphs G o , G p , G i , G b , and sparsities of G o and G i . For example, setting sparsities of ( G o , G i ) to either (0 , or (50% , leadsto sparsity in W s , and setting sizes of base graphs to either ((8 , , (1 , , (8 , , (1 , or ((8 , , (2 , , (4 , , (1 , leads to W s of size (64 , . In this section, we study the effect ofRBGP4 conﬁguration on runtime of SDMM operation ( O = W s × I ) . For all our experiments, weset sizes of matrices O , W s , and I to be 4096x4096. Sparsity distribution :

In RBGP4 sparsity pattern, sparsity is solely due to presence of sparse graphs G o and G i , as G r and G b are dense or complete graphs. We run experiments with 75%,87.5%, and93.75% sparsity amounts distributed between G o and G i , while keeping sizes of G o , G r , G i , G b ﬁxedto (32 , , (4 , , (32 , , (1 , . From Table 2, we can see that for a given sparsity, as sparsity of G o increases,the runtime decreases. This is because sparsity in G o incorporates sparsity at the tilelevel, and this reduces runtime due to skipping of computation and memory loads associated withzero tiles. For dense case(0% sparsity), we use cuBLAS library from NVIDIA. Row repetition :

In row repetition, matrix W s can be divided into row groups of equal size, whereall the rows in a row group have non zeros exactly at the same locations. Having row repetitionsallows us to effectively reuse data from I as rows have same non zero pattern. G r and G b in RBGP4introduces | G r .U | × | G b .U | amount of row repetition in W s . We run experiments with 1,2, and 4repetition amounts, while keeping size of G t ( G r ⊗ G i ⊗ G b ) ﬁxed at (128,32), and sparsity of G o at 50%. From Table 3, we can see that increasing the size of G r or G b or both leads to improvedruntime performance as repetition amount increases.Sp(G)% Sp( G o ) % Sp( G i ) % Time(ms)0 0 0 11.2 (1x)75.00 0.00 75.00 5.64 (2x)50.00 50.00 4.44 (2.5x)87.50 0.00 87.50 4.31 (2.6x)50.00 75.00 2.74 (4.1x)75.00 50.00 2.29 (4.9x)93.75 0.00 93.75 3.76 (3x)50.00 87.50 1.93 (5.8x)75.00 75.00 1.44 (7.8x)87.50 50.00 1.22 (9.2x)Table 2: Effect of varying sparsities of sparsegraphs G o and G i in RBGP4 sparsity patternon runtime. Sizes Time(ms) for Sp(G)% G r G b G r and G b in RBGP4 sparsity patternon runtime. We used ideas from extremal graph theory and combinatorics to make sparse neural networks runtimeefﬁcient. Ramanujan graphs which gives the optimal connectivity for a given level of sparsity areused to model connections in a neural network layer. Furthermore, we obtain structured blocksparsity by using products of Ramanujan graphs. We prove that the product graph also has theoptimal connectivity for large matrices. For the speciﬁc case of GPUs, we describe how the blocksparsity can be efﬁciently implemented in hardware, by exploiting the memory hierarchy throughdata reuse. Benchmarks of this implementation is shown to give signiﬁcant runtime improvements.Similar ideas could be used for generating structured sparsity patterns that results in runtime efﬁcientimplementations in other hardware as well. For the future work, generating combinatorial structuredsparsity patterns like RBGP4 during the training process could lead to more accurate models asstructure is induced in a gradual manner. 8 eferences [1] Alon, N.: Eigenvalues and expanders. Combinatorica (2), 83–96 (1986)[2] Bellec, G., Kappel, D., Maass, W., Legenstein, R.: Deep rewiring: Training very sparse deepnetworks. arXiv preprint arXiv:1711.05136 (2017)[3] Bilu, Y., Linial, N.: Lifts, discrepancy and nearly optimal spectral gap. Combinatorica (5), 495–519 (Oct 2006). https://doi.org/10.1007/s00493-006-0029-7, https://doi.org/10.1007/s00493-006-0029-7 [4] Cao, S., Zhang, C., Yao, Z., Xiao, W., Nie, L., Zhan, D., Liu, Y., Wu, M., Zhang, L.: Efﬁcientand effective sparse lstm on fpga with bank-balanced sparsity. In: Proceedings of the 2019ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. pp. 63–72 (2019)[5] Chung, F.R.K.: Spectral Graph Theory. American Mathematical Society (1997)[6] Dettmers, T., Zettlemoyer, L.: Sparse networks from scratch: Faster training without losingperformance. CoRR abs/1907.04840 (2019), http://arxiv.org/abs/1907.04840 [7] Dey, S., Huang, K.W., Beerel, P.A., Chugg, K.M.: Pre-deﬁned sparse neural networks withhardware acceleration. IEEE Journal on Emerging and Selected Topics in Circuits and Systems (2), 332–345 (2019)[8] Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable neural networks.arXiv preprint arXiv:1803.03635 (2018)[9] Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural network with prun-ing, trained quantization and huffman coding. In: Bengio, Y., LeCun, Y. (eds.) 4th InternationalConference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,Conference Track Proceedings (2016), http://arxiv.org/abs/1510.00149 [10] Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efﬁcient neuralnetwork. In: Advances in neural information processing systems. pp. 1135–1143 (2015)[11] Hassibi, B., Stork, D.G., Wolff, G.: Optimal brain surgeon: Extensions and performancecomparisons. In: Advances in neural information processing systems. pp. 263–270 (1994)[12] He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In:Proceedings of the IEEE International Conference on Computer Vision. pp. 1389–1397 (2017)[13] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPSDeep Learning and Representation Learning Workshop (2015), http://arxiv.org/abs/1503.02531 [14] Huang, Z., Wang, N.: Data-driven sparse structure selection for deep neural networks. In: TheEuropean Conference on Computer Vision (ECCV) (September 2018)[15] Kepner, J., Robinett, R.: Radix-net: Structured sparse matrices for deep neural networks. In:2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).pp. 268–274. IEEE (2019)[16] LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in neural informationprocessing systems. pp. 598–605 (1990)[17] Lee, N., Ajanthan, T., Torr, P.H.S.: SNIP: single-shot network pruning based on connectionsensitivity. CoRR abs/1810.02340 (2018), http://arxiv.org/abs/1810.02340 [18] Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning ﬁlters for efﬁcient convnets.arXiv preprint arXiv:1608.08710 (2016)[19] Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efﬁcient convolutionalnetworks through network slimming. In: Proceedings of the IEEE International Conference onComputer Vision. pp. 2736–2744 (2017)[20] Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning(2018)[21] Lubotzky, A., Phillips, R., Sarnak, P.: Ramanujan graphs. Combinatorica (3), 261–277 (1988)[22] Luo, J.H., Wu, J., Lin, W.: Thinet: A ﬁlter level pruning method for deep neural networkcompression. In: Proceedings of the IEEE international conference on computer vision. pp.5058–5066 (2017) 923] Mao, H., Han, S., Pool, J., Li, W., Liu, X., Wang, Y., Dally, W.J.: Exploring the granularity ofsparsity in convolutional neural networks. In: The IEEE Conference on Computer Vision andPattern Recognition (CVPR) Workshops (July 2017)[24] Marcus, A.W., Spielman, D.A., Srivastava, N.: Interlacing families i: Bipartite ramanujangraphs of all degrees. Annals of Mathematics (1), 307–325 (2015), [25] Mocanu, D.C., Mocanu, E., Stone, P., Nguyen, P.H., Gibescu, M., Liotta, A.: Scalable trainingof artiﬁcial neural networks with adaptive sparse connectivity inspired by network science.Nature communications (1), 1–12 (2018)[26] Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networksfor resource efﬁcient inference. arXiv preprint arXiv:1611.06440 (2016)[27] Mostafa, H., Wang, X.: Parameter efﬁcient training of deep convolutional neural networks bydynamic sparse reparameterization. In: International Conference on Machine Learning. pp.4646–4655 (2019)[28] Narang, S., Elsen, E., Diamos, G., Sengupta, S.: Exploring sparsity in recurrent neural networks.arXiv preprint arXiv:1704.05119 (2017)[29] Narang, S., Undersander, E., Diamos, G.: Block-sparse recurrent neural networks. arXiv preprintarXiv:1711.02782 (2017)[30] Prabhu, A., Varma, G., Namboodiri, A.: Deep expander networks: Efﬁcient deep networks fromgraph theory. In: The European Conference on Computer Vision (ECCV) (September 2018)[31] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recogni-tion. In: International Conference on Learning Representations (2015)[32] Srinivas, S., Subramanya, A., Venkatesh Babu, R.: Training sparse neural networks. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp.138–145 (2017)[33] Vooturi, D.T., Kothapalli, K.: Efﬁcient sparse neural networks using regularized multi block spar-sity pattern on a gpu. In: High Performance Computing and Data Analytics (HiPC) (December2019)[34] Vooturi, D.T., Varma, G., Kothapalli, K.: Dynamic block sparse reparameterization of convolu-tional neural networks. In: The IEEE International Conference on Computer Vision (ICCV)Workshops (Oct 2019)[35] Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured spar-sity in deep neural networks. In: Lee, D.D., Sugiyama, M., Luxburg, U.V.,Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems29, pp. 2074–2082. Curran Associates, Inc. (2016), http://papers.nips.cc/paper/6504-learning-structured-sparsity-in-deep-neural-networks.pdf [36] Yu, R., Li, A., Chen, C.F., Lai, J.H., Morariu, V.I., Han, X., Gao, M., Lin, C.Y., Davis, L.S.:Nisp: Pruning networks using neuron importance score propagation. In: The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) (June 2018)[37] Zagoruyko, S., Komodakis, N.: Wide residual networks. In: Richard C. Wilson, E.R.H., Smith,W.A.P. (eds.) Proceedings of the British Machine Vision Conference (BMVC). pp. 87.1–87.12.BMVA Press (September 2016). https://doi.org/10.5244/C.30.87, https://dx.doi.org/10.5244/C.30.87 Appendix

A construction for Ramanujan Bipartite graph(RBG) was given by Bilu et al. [3]. The proof thatthis construction obtains the optimal eigenvalue gap was given by Marcus et al. [24]. We usealgorithms(graph lifts) derived from these construction to generate Ramanujan Bipartite Graphs for agiven sparsity.

A 2-lift is an operation applied on a graph G to produce a bigger graph G L that istwice as big as G in both vertices and edges. In the 2-lift operation, a clone graph G c is ﬁrst createdand the vertex set of G L is set to be the union of vertex sets of G and G c i.e, V ( G L ) = V ( G ) ∪ V ( G c ) .The edge set of G L i.e, E ( G L ) is then constructed in the following way: For an edge ( u, v ) ∈ G ,andit’s corresponding clone edge ( u c , v c ) ∈ G c , either the identity edge pair { ( u, v ) , ( u c , v c ) } or thecrossover edge pair { ( u, v c ) , ( u c , v ) } is chosen at random and added to E ( G L ) . Figure 4 shows anexample of 2-lift operation.Figure 4: 2-lift operation on graph G . Clone graph G c is ﬁrst created and edges ( u , v ) and ( u , v ) are randomly chosen to cross over with the corresponding edges ( u c , v c ) and ( u c , v c ) respectively inthe clone graph. Generating sparse biregular bipartite graph:

A 2-lift operation when applied on a biregular bi-partite graph also results in a biregular bipartite graph that is twice as big with same left and rightdegrees. A biregular graph G ( U, V, E ) with sparsity( . − | E ( G ) | / ( | G.U | × |

G.V | ) ) sp , can begenerated by repeatedly applying log (1 / (1 − sp )) (1 − sp ) × | G.U | left and (1 − sp ) × | G.V | right vertices. Generating RBG graph:

A Ramanujan bipartite graph is ﬁrst a biregular bipartite graph with anadditional constraint on second largest eigenvalue of the adjacency matrix of the graph. To generatean RBG graph, we sample sparse biregular bipartite graphs generated using 2-lift operations until thesampled graph is Ramanujan. We found that an RBG graph with sizes in the order of thousands canbe generated in the order of minutes. For a layer in RBGP sparse neural network, the base Ramanujangraphs are generated only once before training and hence sampling approach is not a bottleneck.

Computation in each layer of a sparse neural network is an SDMM(Multiplication of a sparse matrixwith a dense matrix) operation ( C = A s × B ) . RBGP4MM is an SDMM operation where A s hasRBGP4 sparsity pattern. Algorithm 1 describes the pseudo code for RBGP4MM operation on a GPU.As RBGP4 sparsity pattern has equal number of non zero elements in each row, non zero elementsin A s can be stored using data arrray of size ( A s .rows, (1 − sp ) × A s .columns ) , and the indexinformation of A s is captured by storing adjacency lists of base bipartite graphs.11 lgorithm 1 GPU algorithm for RBGP4MM ( C = A s × B ) operation using tiling approach. Tile sizesfor A s , B , and C are chosen to be ( T M, T K ) , ( T K, T N ) , and ( T M, T N ) respectively. On GPU,each tile in C is mapped to a thread block, and each thread in the thread block is mapped to a groupof ( RM × BM × RN × BN ) number of elements in a tile of C . Variables TM,TK,RM,RK,BM,BKare set based on RBGP4 conﬁguration( G = G o ⊗ b G r ⊗ b G i ⊗ b G b ) of A s . function LBFM( matrix, ( bi, bj ) , ( BH, BW ) ) (cid:46) Load Block From Matrix block [ BH ][ BW ] for i in [0 , BH ) do for j in [0 , BW ) do block [ i ][ j ] = matrix [ bi ∗ BH + i ][ bj ∗ BW + j ] end for end for return block end function G t = G r ⊗ b G i ⊗ b G b T M, T K = | G t .U | , | G t .V | (cid:46) Number of left and right vertices of bipartite graph G t RM, RK = | G r .U | , | G r .V | BM, BK = | G b .U | , | G b .V | gridBlockDim = ( C.rows/T M, C.cols/T N ) (cid:46)

2D grid block threadBlockDim = (

T M/ ( RM × BM ) , T N/ ( RN × BN )) (cid:46)

2D thread block for ( tbm, tbn ) in [(0 ,

0) : gridBlockDim ) do (cid:46) Mapped to thread blocks for ( thm, thn ) in [(0 ,

0) : threadBlockDim ) do (cid:46) Mapped to threads

Areg [ RM ][ BM ][ BK ] (cid:46) Registers

Breg [ RN ][ BK ][ BN ] (cid:46) Registers

Creg [ RM ][ RN ][ BM ][ BN ] (cid:46) Registers for outk in [0 , G o .d l ) do (cid:46) G o .d l is left degree of biregular bipartite graph G o oind = G o .adj _ list [ tbm ][ outk ] Atile = LBF M ( A s .data, ( tbm, outk ) , ( T M, G t .d l )) (cid:46) DRAM to shared memory

Btile = LBF M ( B, ( oind, tbn ) , ( T K, T N )) (cid:46) DRAM to shared memory(shMem) __ syncthreads () for rk, ink in [0 , RK ) × [0 , G i .d l ) do for rm in [0 : RM ) do bm = rm ∗ | G i .U | + thm bk = rk × G i .d l + ink Areg [ rm ] = LBF M ( Atile, ( bm, bk ) , ( BM, BK )) (cid:46) ShMem to registers end for for rn in [0 , RN ) do bk = rk × | G i .V | + G i .adj _ list [ thm ][ ink ] bn = rn × T N/ ( RN × BN ) + thn Breg [ rn ] = LBF M ( Btile, ( bk, bn ) , ( BK, BN )) (cid:46) ShMem to registers end for for rm, rn in [0 , RM ) × [0 , RN ) do Creg [ rm ][ rn ]+ = Areg [ rm ] × Breg [ rn ] (cid:46) Computation end for __ syncthreads () end for end for for rm, rn in [0 , RM ) × [0 , RN ) do for m, n in [0 : BM ) × [0 : BN ) do row = tbm ∗ T M + rm ∗ ( T M/RM ) + thm ∗ BM + m col = tbn ∗ T N + rn ∗ ( T N/RN ) + thn ∗ BN + n C [ row ][ col ]+ = Creg [ rm ][ rn ][ m ][ n ] end for end for end for end forend for