Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks
Itay Hubara, Brian Chmiel, Moshe Island, Ron Banner, Seffi Naor, Daniel Soudry
AAccelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks
Itay Hubara * 1 2
Brian Chmiel * 1 2
Moshe Island Ron Banner Seffi Naor Daniel Soudry { ihubara, bchmiel, misland, rbanner } @[email protected]@gmail.com Abstract
Recently, researchers proposed pruning deepneural network weights (DNNs) using an N : M fine-grained block sparsity mask. In this mask,for each block of M weights, we have at least N zeros. In contrast to unstructured sparsity, N : M fine-grained block sparsity allowsacceleration in actual modern hardware. Sofar, this was used for DNN acceleration at theinference phase. First, we suggest a method toconvert a pretrained model with unstructuredsparsity to a N : M fine-grained block sparsitymodel, with little to no training. Then, to alsoallow such acceleration in the training phase,we suggest a novel transposable-fine-grainedsparsity mask where the same mask can beused for both forward and backward passes.Our transposable mask ensures that both theweight matrix and its transpose follow the samesparsity pattern; thus the matrix multiplicationrequired for passing the error backward can alsobe accelerated. We discuss the transposableconstraint and devise a new measure for maskconstraints, called mask-diversity (MD), whichcorrelates with their expected accuracy. Then,we formulate the problem of finding the optimaltransposable mask as a minimum-cost-flowproblem and suggest a fast linear approximationthat can be used when the masks dynamicallychange while training. Our experiments suggest2x speed-up with no accuracy degradationover vision and language models. A refer-ence implementation can be found at https://github.com/papers-submission/structured_transposable_masks . * Equal contribution Electrical Engineering Department - Tech-nion, Haifa, Israel Habana Labs – An Intel company, Caesarea,Israel Computer Science Department - Technion, Haifa, Israel.Correspondence to: Itay Hubara < [email protected] > . Preprint. Under review.
1. Introduction
Deep neural networks (DNNs) have established themselvesas the first-choice tool for a wide range of applications, in-cluding computer vision and natural language processing.However, their impressive performance comes with a priceof extensive infrastructure costs as they may contain trillionsof parameters (Fedus et al., 2021) and require thousandsof petaflops (Brown et al., 2020) for the training process.For this reason, compression of DNNs training and infer-ence process is a leading research topic of researchers inthe academy and industry. The main techniques of compres-sion includes quantization (Banner et al., 2018a; Nahshanet al., 2019), knowledge distillation (Hinton et al., 2015)and pruning (Han et al., 2015; Li et al., 2017).Pruning DNNs is one of the most popular and widely studiedmethods to improve DNN resource efficiency. The differ-ent pruning methods can be categorized into two differentgroups: unstructured and structured pruning. While theformer can achieve a very high compression ratio, it usuallyfails in reducing the computation footprint in modern hard-ware. On the other hand, structured pruning methods, suchas block (Wen et al., 2016) or filter (Li et al., 2017) pruning,are more hardware friendly. Unfortunately, these methodsusually fail to keep original accuracy for high compressionratios (Renda et al., 2020a).Recently, Nvidia (Nvidia, 2020) announced the A100 GPU,containing sparse tensor cores which are able to acceler-ate fine-grained sparse matrix multiplication. The sparsetensor cores in A100 enable a 2x acceleration of regularmatrix multiplication in DNNs, Y = W X , where W and X are weight and input matrices, respectively. The only re-quirement is that W would have a fine-grained 2:4 sparsitystructure, i.e. out of every four contiguous elements in W ,two are pruned. Consequently, models with unstructuredsparsity, in general, would suffer a severe degradation in ac-curacy when forced to adhere to a N : M sparsity structure.In our first contribution of this paper (Section 3)• We analyze the setting of converting a pretrained model a r X i v : . [ c s . A I] F e b ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks with unstructured sparsity to N : M sparsity structure.We suggest two light methods that together can pre-vent the degradation when less than 10% of the activeweights must be set to zero.However, more commonly, only a pretrained dense modelis given. Therefore, Nvidia (2020) suggested a two-foldscheme for pruning a dense model: (a) Define a fixed maskwhich for every weight tensor prunes the two smallest mag-nitude elements out of a block of four contiguous elements,and (b) retrain with the masked weights using original train-ing schedule. Indeed, Nvidia (2020) approach is very appeal-ing for the common case where a pretrained dense model isgiven.While Nvidia (2020) method works well on many models,a pretrained model is not always given. Thus, Zhou et al.(2021) suggested a method that trains from scratch a modelwith N:M fine-grained mask, using a sparse-refined straight-through estimator (SR-STE). Similar to the quantize-aware-training method (Hubara et al., 2017) they maintain a densecopy of the weights and prune it every iteration while keep-ing the gradients oblivious to that process using the straight-through estimator (Bengio et al., 2013). Since the maskdynamically change while training they suggest adding anextra weight decay on the masked (i.e. pruned) elements toreduce the mask changes during the training process.The rest of the paper focuses on accelerating sparse train-ing in the two settings detailed above (starting from pre-trained model or starting from scratch). Training DNNsrequires three matrix multiplications per layer: one for theforward pass, one for the backward pass, and the last forthe weight update. Both the forward and the backwardmatrix-multiplications involve the weight matrix W , yetthe methods suggested in Zhou et al. (2021); Nvidia (2020)accelerate only the forward pass matrix-multiplication: Y = W X. (1)Therefore, the current methods use the sparse tensor coresto accelerate approximately a third of the total computationduring training. We note, that for the backward pass, thesparse tensor cores cannot be utilized, even if W has fine-grained sparsity. This is because the transposed matrix W T is used for the backward pass multiplication: ∂ Loss ∂X = ∂ Loss ∂Y W T (2)and W T does not generally have an N:M fine-grained spar-sity structure, even if W has such structure.We propose a novel N : M transposable-fine-grained spar-sity mask, where the same mask can be used for both theforward and backward passes. Our suggested mask con-tains only M − N non-zero elements for every contiguous M elements — simultaneously in both W and W T . InFig. 1 we emphasize the difference between previously sug-gested methods of N : M fine-grained structured pruning(Zhou et al., 2021; Nvidia, 2020), which accelerates onlythe forward pass, and our suggested method, which is ableto accelerate the backward pass as well.In this setting,• For the case of pretrained dense models, we suggest anovel method for training models with the N : M transposable-fine-grained sparsity mask, exploitingmodern sparse core tensors to allow acceleration ofthe forward and backward passes. Our method uses anovel algorithm to determine the optimal transposable-mask using a reduction to the min-cost-flow problem(Section 4).• For the case where the model is trained from scratch,we define an approximation scheme with an (almost)linear (in input-size) time complexity that produces amask whose l -norm is within a factor of 2 from theoptimal mask(Section 4.2).• We suggest a new measure called mask diversity,which, to the best of our knowledge, provides the firstconnection between the mask constraints and networkaccuracy for a fixed sparsity ratio (Section 5).
2. Related work
Pruning of neural networks weights has been extensively in-vestigated, starting with classical methods in the late 1980s(Janowsky, 1989; Mozer & Smolensky, 1989a;b; Karnin,1990) and amounting to dozens of papers published in re-cent years. Since DNNs are generally over-parameterized,pruning the weights reduces their memory footprint. In spe-cial cases, when the sparsity mask has a specific pattern, ithas the potential to reduce computation footprint as well.The most common practice is to prune a pretrained densemodel so it will be sparse at deployment. Since the pre-trained dense model accuracy is known, one can tune thesparsity level to ensure comparable accuracy for the sparsemodel. Recently, a new line of research that aims to trainsparse models from scratch (Gray et al., 2017; Dettmers &Zettlemoyer, 2019; Evci et al., 2020) has emerged. Theirgoal is to train models that cannot fit into currently availablehardware. Next, we would briefly overview the structured,unstructured, and sparse-training–from-scratch categories.
Unstructured pruning removes individual elements ofthe matrix, aiming for high total sparsity, while being ag-nostic to the location of the pruned elements. Standardpruning methods are based on different criteria, such as ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks
𝑋 𝑌𝜕𝐿𝑜𝑠𝑠 𝜕𝑌 𝜕𝐿𝑜𝑠𝑠 𝜕𝑋 𝐹𝑜𝑟𝑤𝑎𝑟𝑑
𝐵𝑎𝑐𝑘𝑤𝑎𝑟𝑑𝑊′
4: 8𝑊 𝑊 𝑇 (a) 𝑋 𝑌𝜕𝐿𝑜𝑠𝑠 𝜕𝑌 𝜕𝐿𝑜𝑠𝑠 𝜕𝑋 𝐹𝑜𝑟𝑤𝑎𝑟𝑑
𝐵𝑎𝑐𝑘𝑤𝑎𝑟𝑑 𝑊′
4: 8𝑊 𝑊 𝑇 𝑊′ 𝑇
4: 8 (b)
Figure 1. (a):
A 4:8 structured fine-grained pruning as used in Zhou et al. (2021); Nvidia (2020), capable of accelerating with sparse coretensors only the forward pass. (b):
The suggested 4:8 transpose structured fine-grained pruning capable of accelerating with sparse coretensors the forward and backward passes. magnitude (Han et al., 2015), approximated L regulariza-tion (Louizos et al., 2018) or connection sensitivity (Leeet al., 2019). Recent methods (Frankle & Carbin, 2018) sug-gested to train a dense network until convergence, extractthe required mask (”winning ticket”), and use the originaltraining regime to re-train the active weights from their orig-inal initialization or final values (Renda et al., 2020b) usingthe original training schedule. These methods are able toachieve over 80% sparsity on ResNet50- ImageNet dataset(Renda et al., 2020b). Despite the high sparsity ratio that canbe achieved with these methods, modern hardware cannotefficiently utilize such a form of sparsity for reducing thecomputation resources (Nvidia, 2020). Structured pruning removes weights in specific locationbased patterns, which are more useful for hardware accelera-tion. Such methods can be applied at the level of channels orlayers. For example, Li et al. (2017) removed the channelswith the lower norm, Luo et al. (2017) pruned channels ac-cording to the effect on the activation of the following layer,and Wen et al. (2016) split the filters into multiple groupsand applied a group Lasso regularization. All methods arenativity supported in both hardware and software, as theyeffectively change the model structure by reducing channelsor groups. Yet, no method was able to achieve a reasonableaccuracy with sparsity levels higher than 50%. As observedby Liu et al. (2018), filter pruning of a pretrained dense over-parameterized model is rarely the best method to obtain anefficient final model. Thus, here the structured pruningserves mostly as a DNN architecture search for the optimalcompact model (Tan et al., 2019; Wu et al., 2019). Our workis most closely related to Zhou et al. (2021), which is thefirst work that attempted training with a fine-grained N : M structured sparsity mask, as explained above. Sparse training from scratch
Gray et al. (2017) was thefirst to introduce this approach for NLP tasks. They in-vestigated a fixed mask in which blocks of size N × N are either pruned (i.e., all elements are set to zero) or not.They implemented dedicated kernels for GPUs with sev-eral blocks sizes (N=8,16,32) and reported their results.While they managed to achieve slightly less than x2 train-ing speedup for 50% sparsity,they observed high speedupsas the sparsity level increases (x5 for 90% sparsity level).They initiated their masks using predefined schemes andfixed them throughout the training process. As expected,this method resulted in accuracy degradation if one doesnot expand the model size. Thus several researchers (Bellecet al., 2017; Mocanu et al., 2018; Mostafa & Wang, 2019;Dettmers & Zettlemoyer, 2019; Evci et al., 2020) tried toenable dynamic mask changes during the training process.All methods focused on unstructured sparsity and aimedto enable training large models on hardware with memorylimitations. The first to introduce this approach was Bellecet al. (2017) which essentially applied a random walk inparameter space. At initialization connections are assigneda pre-defined sign at random; if during optimization the signis flipped the weight would be pruned and a new weight willbe activated at random (i.e., regrow). Sparse EvolutionaryTraining (SET) (Mocanu et al., 2018) proposed a simplerscheme where they pruned weights based on their magni-tude. Dettmers & Zettlemoyer (2019) replaced the randomregrow with a per-layer mean-momentum-magnitude redis-tribution. Evci et al. (2020) expanded Mocanu et al. (2018)mask initialization scheme to convolution kernels and sug-gested pruning and regrowing based on dense gradients,calculated once every several iterations. As opposed tothese approaches, Zhou et al. (2021) does not aim to reducethe model memory footprint and keeps a dense weight ma-trix. At each iteration, they re-calculate the mask and use itto obtain a pruned copy of the weights, which then servesfor the forward and backward pass. Therefore, this methodis most relevant when a pretrained dense model is not given,and one wishes to obtain a fine-grained sparse model forinference. ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks
3. From unstructured to structured sparsity
Most DNN pruning methods focus on unstructured pruning,which reduces the memory footprint. However, current hard-ware implementations suggest that, unless very high sparsitylevels are achieved, the model cannot be accelerated at all.Thus, commonly, the weights are simply decompressed be-fore multiplication. Forcing structured sparsity on a modelthat was trained without structured sparsity, leads to a severeaccuracy degradation as several bits of the mask may changeto satisfy the structured sparsity requirements.In this section, we study the probability that an unstructuredmask would not violate any N : M constraint (Nvidia,2020). We then discuss two light methods to bridge the gapwhen a sparse model is given but the hardware does notsupport its structure. Probability for violating the N : M constraint in un-structured sparsity. Let X = { x , x , ..., x M } be ablock of independent and identically distributed randomvariables. Assume that with a probability ρ , x i can bepruned without accuracy degradation (i.e., unstructuredpruning). In this section, we consider a general form ofblock sparsity in which, for a block of size M , at least N values could be pruned.Define X to be N : M sparse if this M sized block has atleast N values that can be pruned without harming accuracy.The probability of having a N : M sparse block is given bythe binomial distribution and so P ( X is N : M sparse ) = (cid:88) i ≥ N (cid:18) Mi (cid:19) · ρ i · (1 − ρ ) M − i (3)This probability goes to 1 as M → ∞ if N/M = c < ρ . InFig. 2 we demonstrate this Eq. (3) for ρ = 0 . .To force a given sparse model to have a fine grained N : M sparsity, we need to make sure that N out of every M con-tiguous elements are zero. Therefore, as in Nvidia (2020),in each block we prune N weights with the lowest magni-tude (including any zero weights, e.g., non-active). Forcingthis pattern on an existing unstructured mask might removeactive (non-zero) weights, i.e., flipping some of the maskvalues from one to zero. We named those required flips, pattern-violations . Removing active weights without re-training tends to severely degrade the model accuracy. Todemonstrate the problem we used an unstructured sparsepretrained ResNet-50 model ( ρ = 0 . ) and set the N : M structure per-layer, based on Eq. (3), such that the proba-bility for a pattern-violation would be equal or less than agiven percentage. Here we used a block size of M = 8 .As can be seen in Fig. 3, without any optimization even a pattern-violation results in severe degradation. Next, wedetail two light methods to boost the accuracy. N:M sparse P r o b a b ili t y M=16M=32M=64M=128
Figure 2.
Eq. (3) for ρ = 0 . and various block sizes M . We havea sharp (”phase”) transition at N/M = ρ . Specifically, (i) when N/M ≤ ρ we have a probability larger than 0.5 that the sampledblock is N : M sparse; (ii) when N/M ≥ ρ this probabilityquickly decreases to zero. As block size M increases this phase-transition gets sharper. As expected, when M → ∞ , unstructuredsparsity satisfies the structured constraints, and we expect it todisplay the phase transition precisely at the critical point ρ . T o p - A cc u r a c y ( % )
59% 61% 65% 69% 69% 69% 70% 71% 71% 71% 72% 73%
Baseline (unstructured)AdaPruneAbsorb MeanNo-optimization
Figure 3.
Top-1 accuracy vs. percent of constraints violated. Thenumbers next to the baseline samples represents the sparsity levelof the refined model.
Pruning bias fix:
Several works (Banner et al., 2018b;Finkelstein et al., 2019; Hubara et al., 2020) reported that itis important to fix the bias introduced when quantizing themodel. We build on those results and suggest absorbing themean of the N pruned weights into the M − N non zeroedweights. As can be seen in Fig. 3 this simple fix, by itself,greatly boosts accuracy. AdaPrune
Recently, several works (Hubara et al., 2020;Nagel et al., 2020) suggested small fast fine-tuning tech-niques for post-training quantization. These techniquesreplace the heavy full model training with a fast per-layeroptimization which requires only a few iterations to con-verge. While each method applies a different optimization ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks technique, they all aim to reduce the discrepancy betweenthe quantized and full-precision layer outputs. We adaptedparallel-AdaQuant (Hubara et al., 2020) technique whichoptimizes (using a small calibration set) the weights andquantization parameters to reduce the pre-activation recon-struction error, measured via mean squared-error per-layer.We adjusted their objective to the pruning problem min W (cid:48) || W X − ( Mask (cid:12) W (cid:48) ) X || , (4)where W is the original weight layer, W (cid:48) is the weightlayer we aim to find, X is the output of the previous activa-tion layer, ”Mask” is the weight sparsity mask, and (cid:12) is acomponent-wise product. We named this method AdaPrune.In our experiments we used 1000 images from the Ima-geNet training set as a calibration set. As can be seen inFig. 3, AdaPrune is capable of correcting the remainingerror and obtain less than 1% degradation from the originalunstructured-sparse model counterpart. We argue that withAdaPrune, we can potentially adapt any generic mask to thehardware at hand, thus elevate the need to retrain the model.However, usually a pretrained unstructured sparse modelis not given. When starting from a dense model (thus hav-ing 50% pattern-violation) we get 2.3% degradation usingAdaPrune. We discuss and extend those experiments inAppendix A.1. Thus, retraining is required to prevent suchdegradation. In the next sections, we’ll discuss two scenar-ios: The first is the common case when a pretrained densemodel is given. The second is a more challenging scenarioin which one aims to train with a sparse mask from scratch.
4. Computing transposable sparsity masks
In general, training DNNs requires three matrix multipli-cations per layer. The first multiplication is required forthe forward propagation between the weights and activa-tion (Eq. (1)). The other two multiplications are used forthe backward and update phases. The backward phase cal-culates the gradients of the loss function with respect tothe input of the neural layer. This is done by recursivelypassing the error, from the last layer to the first (Eq. (2)).Note that the backward phase uses the transposed weightsmatrix. Hence, accelerating the backward phase requires thetransposed weight matrix to adhere to the hardware requiredpattern (e.g., N : M fine-grained sparsity).In this section, we aim to tackle this issue by presentinga novel N : M transposable fine-grained sparsity mask,where the same mask can be used to accelerate both for-ward and backward passes. The required mask, containsonly M − N non-zero elements for every contiguous M elements in both in W and W T simultaneously. We formu-late the problem and suggest two methods to generate thetransposable mask: the first one trains from a dense modelusing a min-cost flow procedure, and the second one trains from scratch, using an approximation algorithm that allowsa more efficient training. First, we provide an integer-programming (IP) formulationfor finding an optimal transposable fine-grained mask. Letus consider a block of size M × M in a weight matrix W .Our goal is to maximize the l norm of W after masking Nelements in each row and column.We formulate the problem as an integer program. Definea binary indicator variable I i,j , where I i,j = 1 if and onlyif the element W i,j is part of the chosen mask, otherwise I i,j = 0 . The integer program is as follows:Maximize M − (cid:88) i,j =0 | W i,j | · I ij S.t (cid:88) j I ij = N, ∀ i ∈ { , ..., M − } (cid:88) i I ij = N, ∀ j ∈ { , ..., M − } I i,j ∈ { , } , ∀ i, j ∈ { , ..., M − } (5)In the following, we examine several methods for solvingthis problem and enforcing N : M transposable fine-grainedsparsity during training. We first describe an optimal, yetcomputationally expensive method. Then we describe amore efficient method, yet only provides an approximate,though near-optimal, solution. General integer programs (IP) have exponential time com-plexity with respect to the input size (worst-case). Fortu-nately, Fig. 4 showed that our IP formulation at Eq. (5) canbe reduced to a min-cost flow problem. Hence, by using cost-scaling method to solve the problem (Ahuja et al., 1988),we can find the optimal transposable mask in O ( M log ( M ) time for a block size of M × M .The min-cost flow solution should be used when trainingfrom a pretrained dense model (Section 6.1), where thetransposable mask is generated once and is then fixed duringtraining. Sparse training from scratch on the other hand,requires changing the mask during training (Section 6.2).Therefore it is essential to find a very efficient algorithmfor computing the mask. To that end we design a light algorithm, i.e., for every input it produces asolution which is guaranteed to be within a factor of 2 of anoptimal solution (to the input), yet it runs in almost lineartime. We design a greedy algorithm (see Algorithm 1) having a low ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks s t 𝑊 ∈ ℝ
𝑀×𝑀 = 𝑭 = 𝑴 − 𝑴 Rows
Columns
Figure 4. M : M transposable-sparsity optimization as a min-costflow problem. In addition to a source and a sink, the networkhas a node for each row and for each column. The constructionuses three types of edges: (i) source edges emanating from thesource node s into each row node i ; (ii) sink edges connecting eachcolumn node j with the sink node t ; and (iii) a coefficient edge ( i, j ) for each matrix element W i,j . Each source edge ( s, i ) hascapacity M which is equal to the number of elements that need tobe selected for pruning in row i . Similarly, each sink edge ( j, t ) has capacity M which is equal to the number of elements prunedin column j . Each coefficient edge ( i, j ) has unit capacity and cost | W i,j | . Finally, selecting a matrix element with weight W i,j forpruning corresponds to a unit flow on the coefficient edge ( i, j ) . Amin-cost flow from source s to destination t would find the lowestpossible cost of sending a flow of value M from the source s tothe destination t . Assuming the source and sink edges have a zerocost, it is easy to see a one-to-one correspondence between a min-cost flow in this construction and an optimal transposable mask,that minimizes the sum of absolute values selected for pruning. time complexity that can be used in practice without com-promising too much the quality of the solution produced.Unlike the optimal min cost flow solution that runs in timecomplexity of O ( M ) for a block size of M × M , Algo-rithm 1 has a running time of O ( M log M ) i.e., a timecomplexity that is almost linear in the number of block el-ements M . The approximation algorithm uses the sameconstruction described in Fig. 4, but instead of running amin-cost flow on the graph, it employs a simple greedy ap-proach. In Appendix A.3 we analyze the running times ofdifferent min-cost flow methods and compare them with therunning time of our 2-approximation method.Let P be the list of edges pruned by Algorithm 1, let W ( P ) be the total weight of the edges in P , and let W ∗ be theweight of an optimal solution (i.e., the minimal sum of edgesthat can be pruned to create a M : M transposable sparsitymask). The next lemma establishes that Algorithm 1 finds a2-approximate solution. Lemma.
Algorithm 1 produces a tight 2-approximate solu-tion, i.e., W ( P ) < · W ∗ . Proof. Consider any node i ∈ V \ { s, t } . Let E (cid:48) ( i ) = { e (cid:48) , e (cid:48) , e (cid:48) , ...e (cid:48) M/ } denote the edges of an optimal solutionthat are adjacent to node i and sorted in ascending orderfrom light to heavy. Let E ( i ) = { e , e , e , ...e M/ } denotethe first M/ edges adjacent to i in P with respect to theorder in which Algorithm 1 picked them. By construction,we have that for all edges in E ( i ) : w ( e ) ≤ w ( e ) ... ≤ w ( e M ) . (6)We note that we can truncate the list of i at M/ , since if i has more than M/ edges adjacent to it in P , then anysuch edge ( i, j ) would also appear in E ( j ) (among the first M/ edges adjacent to j ). Thus, the union of the lists E ( i ) contains all edges in P . We now prove by induction that forany n , n ≥ , w ( e n ) ≤ w ( e (cid:48) n ) . (7)• Base case ( n = 1 ): w ( e ) ≤ w ( e (cid:48) ) , since by con-struction of Algorithm 1, edge e is the lightest edgeadjacent to node i .• Induction step: assume w ( e n ) ≤ w ( e (cid:48) n ) , then itmust hold that w ( e n +1 ) ≤ w ( e (cid:48) n +1 )) ; otherwise, if w ( e n +1 ) > w ( e (cid:48) n +1 )) , then e (cid:48) n +1 should have beenconsidered before e n +1 and also chosen by Algo-rithm 1.Thus, M/ (cid:88) j =1 w ( e j ) ≤ M/ (cid:88) j =1 w ( e (cid:48) j ) . To complete the proof, our goal is to change the weight ofthe edges in P to the weight of the edges in the optimalsolution based on the above inequality. However, note thatan edge ( i, j ) ∈ P may appear in only one of the lists E ( i ) or E ( j ) , while an edge in the optimal solution alwaysappears in two lists (of its endpoints). Thus, for example,two edges in P , ( i, j ) and ( i (cid:48) , j ) , may charge their weightto the same edge ( i, i (cid:48) ) in the optimal solution. But, this”double” charging can happen at most twice, hence: W ( P ) ≤ W ∗ . (8)In Appendix A.2 we show with an example that this upperbound is tight. (cid:4)
5. Mask Diversity
Structure sparsity requires the mask to have some hardware-friendly pattern. In this section, we will argue that the moreflexible the required sparsity constraint is (e.g., fine-grainedsparsity with large block) the less we are prone to accuracydegradation. Consequently, as expected and well explored ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks
Algorithm 1
2- approximation algorithm
Input:
Graph G=(V,E)Initialize P= ∅ Sort the list of coefficient edges from light to heavyLet A = [ e , ..., e n ] be the sorted list of edges for each edge e i = ( u, v ) ∈ A doif degree(u) ≤ M or degree(v) ≤ M in P then P ← P + e i . end ifend for (Frankle & Carbin, 2018; Renda et al., 2020b), unstructuredsparsity which has no requirements on the sparsity structureachieves the best sparsity levels. To quantify the constrainta specific mask enforces, we introduce a new measure wename mask-diversity . A mask diversity (MD) is the numberof all possible configurations which adhere to the maskrestriction under similar sparsity level.Let us consider W to be a weight tensor of size n × n andour desired sparsity level to be N/n . For unstructuredsparsity M D
Unstructured = (cid:18) n N (cid:19) (9)As the block size increases, the diversity increases, whichmight explain the recent success of global pruning. Herewe investigate per-layer block sparsity and specifically, fine-grained N : M structured sparsity (Nvidia, 2020). Thisapproach requires us to zero out N values in each block ofsize M . Since we have n M blocks this results in M D
Structured = (cid:18) M ! N ! ( M − N )! (cid:19) n M . (10)In order to evaluate the mask diversity in the N : M struc-tured transposable case, let us first assume N = 1 . Thenumber of possibilities in each block of size M is M ! . Byrepeating this process for general N in all the n M blockswe get: M D
Structuredtransposable = ( M ! ( M − · · · ( M − N + 1)!) n M (11)A more constrained mask, is a fine-grained N : M maskwith a sequential structure. Here we require that each M contiguous elements would contain N sequential zeros. Ineach block of size M , there are ( M − N + 1) M optionsof sequential zeros. Hence, in all the n M blocks we get: M D
Sequential = (cid:0) ( M − N + 1) M (cid:1) n M (12)In Table 1 we show the MD for different constraints fora matrix of size × . Notice the diversity of structured Table 1.
MD for different constraints for a matrix of size × . N : M . · . · . · Structured · . · . · Transposablestructured · · . · Sequential · · · l norm of the last layer in trained ResNet-50masked with different structured pruning. We note that 2:4structured and 4:8 transposable structured have (almost)similar MD measures which translates to similar l normsas well. In order to show the correlation between MD and US 4:8 2:4 4:8-T 4:8-S0.800.850.900.951.00 N o r m a li z e d m a g n i t u e d Figure 5.
Magnitude of the last layer’s weight tensor of ResNet-50(pretrained dense model) masked with structured mask 4:8, 2:4, 4:8transposable (”4:8-T”) and 4:8 sequential (”4:8-S”) normalized bythe unstructured 50% sparsity (”US”). Notice that mask diversityis correlated with magnitude preservation. As expected the 4:8transposable mask has a similar L1 score as the 2:4 mask. the accuracy of the model, we show in Fig. 6 the accuracyof ResNet18 on Cifar100 dataset while inducing differentsparsity masks with the same sparsity ratio of 50%. Asexpected, our mask-diversity measure correlates with thepruned model accuracy.
6. Experiments
In this section, we demonstrate the effectiveness of our pro-posed transposable N : M fine-grained structured sparsityin computer vision and natural language processing tasks.We compare the suggested method over two different ini-tialization: (i) initialization from a trained dense model andtrain with a fixed mask, similar to ASP (Nvidia, 2020), (ii)Train from scratch and update the mask frequently, simi- ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks
10 15 20 25 30 35log( MD )65666768697071 A cc u r a c y StructuredTransposable structuredSequential
Figure 6.
ResNet18 on Cifar100 accuracy for weight sparsity of 50% using different structured mask. Notice that the more constraintswe impose on the mask - the more we affect the pruned networkaccuracy. lar to Zhou et al. (2021). We show comparable accuracyto previous methods, while achieving a significant reduc-tion of the training process resources — by exploiting thesparse core tensor abilities, allowing their use both in for-ward and backward passes. In all the experiments we usea transposable 4:8 mask, which as shown in Table 1, havea similar MD as 2:4 mask used in previous works (Nvidia,2020; Zhou et al., 2021). Experiments setting appears inAppendix A.4
We evaluate the suggested N : M transposable mask us-ing a trained dense model as initialization. In order to findthe transposable mask, we solve the min-cost flow reduc-tion (Section 4.2) on the dense trained network and thenfix the mask. In Table 2 we compare our method withASP (Nvidia, 2020) on classification (ResNet50 - ImageNetdataset), detection (MaskRCNN - COCO dataset) and ques-tion answering (BERT-large - SQuAD dataset) tasks. Noticethat the initialization in both methods is similar, however,in the training phase, we allow 2x acceleration by the useof the 4:8 transposable mask in comparison to the 2:4 nontransposable used in ASP. In order to avoid the training of a dense model, we alsoevaluate the proposed transposable N : M mask in thetraining from scratch setting. Similar to Zhou et al. (2021)we keep a dense copy of the weights and before each forwardpass we mask the weights with a N : M transposable mask.In contrast to Zhou et al. (2021) who changed the maskevery iteration, we found that we can use 2-approximationscheme to extract the transposable mask every 40 iterations.Empirically we found that the 2-approximation scheme is Table 2.
Comparison of the suggested method with ASP (Nvidia,2020) initialized from dense model on ResNet50 (Imagenetdataset), BERT-large (SQuAD dataset) and MaskRCNN (COCOdataset).We use transposable 4:8 mask while ASP use 2:4. Theuse of the transposable mask allow 2x speedup by allowing sparsemultiplication both in forward and backward passes.
Model Method Accuracy Sparse core(Metric) utilizationResNet18 Baseline 69.7% 0%(Top1) Ours 70.06 66%ResNet50 Baseline 76.15% 0%ASP 76.6% 33%(Top1) Ours 76.6% 66%BERT-large Baseline 91.1 0%ASP 91.5 33%(F1) Ours 91.67 66%MaskRCNN Baseline 37.7 0%ASP 37.9 33%(AP) Ours 37.84 66%on average within a factor of 1.2 from the optimal mask.The hyper-parameters used for training are equal to the onessuggested by Zhou et al. (2021). In Table 3 we test theproposed method over ResNet18, ResNet50, ResNext50,Vgg11 (ImageNet dataset) and fine-tune of Bert (SQuAD-v1.1 dataset) and compare to Zhou et al. (2021) results.As can be seen, we achieved comparable accuracy with 2xspeedup in the training process.
7. Conclusions
In this work, we analyze the constraints introduced by blocksparsity. We discuss the limitations with current research inthe field and suggest two simple methods (Pruning bias fixand AdaPrune) to transform an unstructured sparse model toa fine-grained sparse structured model with little to no train-ing. We managed to reduce accuracy degradation causedby forcing N : M pattern on unstructured sparse mask. Asan example, in ResNet50 we reduce the degradation to lessthan 1% from the unstructured model without any retraining.Furthermore, with a light training procedure over a calibra-tion set (i.e., AdaPrune) we can compress the model by upto x3.In addition, we discuss the inherent problem of accelerat-ing sparse training and suggest a novel N:M transposablemask which enables accelerating the backward phase aswell. We formulate the question of finding the optimal maskas a minimum-cost-flow problem and show no accuracydegradation in a variety of tasks with 2x acceleration incomparison to previous methods (Nvidia, 2020). Moreover,we design a new fast algorithm ( with linear complexity) ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks Table 3.
Training from scratch of ResNet18, ResNet50, ResNext50and Vgg11 on imageNet dataset and fine-tuning of Bert-base onSQuAD dataset, using the proposed 2-scheme approximation. Weshow comparable results with N:M-SS (Zhou et al., 2021) withtraining acceleration.
Model Method Accuracy Sparse core(Metric) utilizationResNet18 Baseline 70.54% 0%N:M-SS 71.2% 33%(Top1) Ours 70.75% 66%ResNet50 Baseline 77.3% 0%N:M-SS 77.4% 33%(Top1) Ours 77.1% 66%ResNext50 Baseline 77.6% 0%(Top1) Ours 77.4% 66%Vgg11 Baseline 69% 0%(Top1) Ours 68.8% 66%BERT-base Baseline 88.52 0%(F1) Ours 88.38 66%that guarantees to be within a factor of 2 from the optimaltransposable mask L1 norm and use it to train a sparse modelfrom scratch by accelerating both forward and backwardphases. We believe this work paves the path toward trueefficient sparse training.
References
Ahuja, R. K., Magnanti, T. L., and Orlin, J. B. Networkflows. 1988.Banner, R., Hubara, I., Hoffer, E., and Soudry, D. Scalablemethods for 8-bit training of neural networks. In
NeurIPS ,2018a.Banner, R., Nahshan, Y., Hoffer, E., and Soudry, D. Post-training 4-bit quantization of convolution networks forrapid-deployment. arXiv preprint arXiv:1810.05723 ,2018b.Bellec, G., Kappel, D., Maass, W., and Legenstein, R. Deeprewiring: Training very sparse deep networks. arXivpreprint arXiv:1711.05136 , 2017.Bengio, Y., L´eonard, N., and Courville, A. Estimating orpropagating gradients through stochastic neurons for con-ditional computation. arXiv preprint arXiv:1308.3432 ,2013.Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Kr¨uger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. Languagemodels are few-shot learners.
ArXiv , abs/2005.14165,2020.Dettmers, T. and Zettlemoyer, L. Sparse networks fromscratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840 , 2019.Evci, U., Gale, T., Menick, J., Castro, P. S., and Elsen,E. Rigging the lottery: Making all tickets winners. In
International Conference on Machine Learning , pp. 2943–2952. PMLR, 2020.Fedus, W., Zoph, B., and Shazeer, N. Switch transform-ers: Scaling to trillion parameter models with simple andefficient sparsity.
ArXiv , abs/2101.03961, 2021.Finkelstein, A., Almog, U., and Grobman, M. Fighting quan-tization bias with bias. arXiv preprint arXiv:1906.03193 ,2019.Frankle, J. and Carbin, M. The lottery ticket hypothesis:Finding sparse, trainable neural networks. In
ICLR , 2018.Gray, S., Radford, A., and Kingma, D. P. Gpu kernels forblock-sparse weights. arXiv preprint arXiv:1711.09224 ,3, 2017.Han, S., Pool, J., Tran, J., and Dally, W. Learning bothweights and connections for efficient neural network.
ArXiv , abs/1506.02626, 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. , pp.770–778, 2016.Hinton, G. E., Vinyals, O., and Dean, J. Distilling theknowledge in a neural network.
ArXiv , abs/1503.02531,2015.Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., andBengio, Y. Quantized neural networks: Training neuralnetworks with low precision weights and activations.
TheJournal of Machine Learning Research , 18(1):6869–6898,2017.Hubara, I., Nahshan, Y., Hanani, Y., Banner, R., andSoudry, D. Improving post training neural quantization:Layer-wise calibration and integer programming.
ArXiv ,abs/2006.10518, 2020.Janowsky, S. A. Pruning versus clipping in neuralnetworks.
Physical Review A, 39(12):6600–6603 ,1989. URL https://link.aps.org/doi/10.1103/PhysRevA.39.6600 . ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks Karnin, E. D. A simple procedure for pruning backprop-agation trained neural networks.
IEEE transactions onneural networks, 1(2):239–242 , 1990.Lee, N., Ajanthan, T., and Torr, P. Snip: Single-shot networkpruning based on connection sensitivity. In
ICLR , 2019.Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P.Pruning filters for efficient convnets. In
ICLR , 2017.Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Re-thinking the value of network pruning. arXiv preprintarXiv:1810.05270 , 2018.Louizos, C., Welling, M., and Kingma, D. P. Learning sparseneural networks through l0 regularization. In
ICLR , 2018.Luo, J.-H., Wu, J., and Lin, W. Thinet: A filter level pruningmethod for deep neural network compression. , pp.5068–5076, 2017.Marcel, S. and Rodriguez, Y. Torchvision the machine-vision package of torch. In
Proceedings of the 18th ACMInternational Conference on Multimedia , MM ’10, pp.1485–1488, New York, NY, USA, 2010. Association forComputing Machinery. ISBN 9781605589336. doi: 10.1145/1873951.1874254. URL https://doi.org/10.1145/1873951.1874254 .Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H.,Gibescu, M., and Liotta, A. Scalable training of arti-ficial neural networks with adaptive sparse connectivityinspired by network science.
Nature communications , 9(1):1–12, 2018.Mostafa, H. and Wang, X. Parameter efficient training ofdeep convolutional neural networks by dynamic sparsereparameterization. In
International Conference on Ma-chine Learning , pp. 4646–4655. PMLR, 2019.Mozer, M. C. and Smolensky, P. Skeletonization: A tech-nique for trimming the fat from a network via relevanceassessment.
Advances in neural information processingsystems, pp. 107–115 , 1989a.Mozer, M. C. and Smolensky, P. Using relevance to re-duce network size automatically.
Connection Science,1(1):3–16 , 1989b.Nagel, M., Amjad, R. A., van Baalen, M., Louizos, C.,and Blankevoort, T. Up or down? adaptive rounding forpost-training quantization. In
ICML , 2020.Nahshan, Y., Chmiel, B., Baskin, C., Zheltonozhskii, E.,Banner, R., Bronstein, A. M., and Mendelson, A. Lossaware post-training quantization.
ArXiv , abs/1911.07190,2019. Nvidia. Nvidia deep learning examples for tensor cores.2018. URL https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch .Nvidia. a100 tensor core gpu architec-ture. 2020. URL .Renda, A., Frankle, J., and Carbin, M. Comparing rewind-ing and fine-tuning in neural network pruning.
ArXiv ,abs/2003.02389, 2020a.Renda, A., Frankle, J., and Carbin, M. Comparing rewind-ing and fine-tuning in neural network pruning. In
ICLR ,2020b.Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M.,Howard, A., and Le, Q. V. Mnasnet: Platform-awareneural architecture search for mobile. In
Proceedingsof the IEEE/CVF Conference on Computer Vision andPattern Recognition , pp. 2820–2828, 2019.Wen, W., Wu, C., Wang, Y., Chen, Y., and Li., H. Learn-ing structured sparsity in deep neural networks. In
InAdvances in neural information processing systems, pp.2074–2082 , 2016.Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian,Y., Vajda, P., Jia, Y., and Keutzer, K. Fbnet: Hardware-aware efficient convnet design via differentiable neuralarchitecture search. In
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition ,pp. 10734–10742, 2019.Zhou, A., Ma, Y., Zhu, J., Liu, J., Zhang, Z., Yuan, K.,Sun, W., and Li, H. Learning n:m fine-grained structuressparse neural networks from scratch. In
ICLR , 2021. ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks
A. Supplementary Material
A.1. Additional AdaPrune experiments
To further examine AdaPrune capabilities we checked twoadditional settings: (a) starting from pre-trained densemodel, and (b) staring from less constrained N:M mask.A.1.1. A DA P RUNE FROM D ENSE
While this case is more common, we expect to see somedegradation as we know that we have 50% mask violations.Yet as can be seen in table 2 we managed to restore accuracyto 2-3% of the full-precision baseline using just AdaPrune.To further improve results we applied batch-norm-tuningas suggested by Hubara et al. (2020) and kept the first andlast layers dense which results in less than 2% degradation.We believe it to be the first tolerable post-training-pruningresults reported.
Table A.1.
Using AdaPrune from dense pre-trained model. APstands for AdaPrune and BNT stands for batch-norm-tuning.
Model Dense BiasFix AP APBNT BNTResNet18 69.7% 62.47% 68.41% 68.63%ResNet34 73.3% % 68.72% 72.15% 72.36%ResNet50 76.1% 67.42% 74.41 74.75%ResNet101 77.27 % 71.54% 76.36% 76.48%A.1.2. A DA P RUNE FROM
N:M
SPARSE
In Section 5 we explained why as the block size decreasesthe mask diversity decreases. Thus, we expect to have manyviolations when a pre-trained sparse model with N : M translates to N : M , for N > N and M > M . Weargue that this might be a common case in the future asdifferent hardware vendors would support different formats.In table Table A.2 we can see results of converting ResNet-50 model trained with sparsity pattern to and patterns. As can be seen, converting from to produces results with negligible accuracy degradation(less than 0.5%). Therefore, we argue that AdaPrune is anefficient and useful approach to convert models which wereoptimized on a different hardware than the one in use, as itremoves the need for full sparse training. This is even moreimportant when the training data is not available. A.2. A tight example for the 2-approximation factor inLemma
In the following, we show that the upper bound of 2-approximation (proven in Lemma) is asymptotically tightusing a tight example. Let’s assume we want to zero oneelement in each row and column in the block of size × Table A.2.
Using AdaPrune to convert from one sparse patternto the other. The baseline model was trained with 4:8 sparsity(90 epochs). Thus, 4:8 column is the baseline. BNT stands forbatch-norm-tuning
Model 4:8 2:4 2:4 1:2 1:2BNT BNTRN50 76.5% 76.2% 76.4% 74.6% 75.1%RN50-T 77.1% 76.3% 76.4% 74.7% 75.1%RN18-T 70.75% 70.1% 70.2% 68.9% 69.2%presented in Fig. A.1a using 2-approximate algorithm (Al-gorithm 1). First, we need to convert the block into a directbipartite graph (as suggested in Figure 4). This constructionappears in Fig. A.1b. Next, we sort the the edges from lightto heavy and go over the sorted list. In Fig. A.1c we showthe 7th iteration of the 2-approximate algorithm. All edgesare added to the list of chosen edges P up until iteration7. The algorithm stops at iteration 7 since after adding thelink u −→ v , every node is ”covered” by at least one edge(alternatively, each row and each column has at least oneitem chosen for pruning). Note that the optimal solutionwould choose the edges that correspond to elements on thediagonal (i.e., u −→ v , u −→ v , u −→ v , and u −→ v ),summing to a total weight of 4. Hence, we get an approxi-mation ratio of . It is easy to see that when using the sameconstruction for a general block of size M × M we get anapproximation ratio of M − M , asymptotically converging to2 as M → ∞ . A.3. Run-time analysis
In this section we specify the running times of differentmin-cost flow methods and compare them with the run-ning time of our 2-approximation method. Ahuja et al.(1988) specifies the running times of six min-cost flow algo-rithms, two of which have distinctively better performancefor our construction compared to the others. The runningtime complexities of these two methods depend on the fol-lowing parameters: number of nodes n , number of edges m , the largest weight coefficient W , and the flow demand U . Then, the cost-scaling method has a running time of O ( n log( n · W )) while the capacity scaling method has arunning time of O ( m ( m + n log n ) log U ) . For a block ofsize M × M , our construction process creates a number ofedges m = M + 2 M , a number of nodes n = 2 M + 2 ,and a flow demand U = 0 . M . This boils down to run-ning times of O ( M log( M · W )) and O ( M log( M )) forthe cost-scaling and the capacity-scaling methods, respec-tively. Finally, assuming the weights are represented in b bits, we have that log( W ) = b and therefore solving ourconstruction using the cost-scaling method has a runningtime complexity of O ( M (log( M ) + b )) . In Table A.3 we ccelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks ∞ 1 1 − 𝜀 ∞ ∞ 1 − 𝜀 1 ∞∞ ∞ ∞ 1𝑢 𝑢 𝑢 𝑢 𝑣 𝑣 𝑣 𝑣 (a) Rows
Columns 𝑢 𝑢 𝑢 𝑢 𝑣 𝑣 𝑣 𝑣 (b) Rows
Columns 𝑢 𝑢 𝑢 𝑢 𝑢 𝑢 𝑢 𝑣 𝑣 𝑣 𝑣 𝑣 𝑣 𝑣 (c) Figure A.1. (a):
Block of size × where we want to zero one element in each row and column using 2-approximate algorithm. (b): Block in (a) represented in a direct bipartite graph. (c):
The 7 iterations of the 2-approximate algorithm on graph (b). Notice we get anapproximate ratio between the 2-approximate solution and the optimal solution of . Table A.3.
Running times of min-cost flow based implementationsand the 2-approximation method.
Method ComplexityCost-Scaling O ( M (log( M ) + b ))) Capacity-Scaling O ( M log( M )) O ( M log( M )) summarize the complexity of these methods. A.4. Experiments SettingAdaPrune
We used a small calibration set of 1000 images(one per-class). We run AdaPrune for 1000 iterations withbatch-size of 100. For the results in the supplementarymaterial, we kept the first and last layers dense.
N:M transposable sparsity mask from a pre-trainedmodel
We used torchvison (Marcel & Rodriguez, 2010)model-zoo as our pre-trained dense baseline. For all ResNetmodels we used the original regime as given by He et al.(2016), i.e., SGD over 90 epochs starting with learning rateof 0.1 and decreasing it at epochs 30,60,80 by a factor of10. For BERT-large and MaskRCNN we used the defaultsscripts as in Nvidia (2018).