[PDF] Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Abstract

Recently, researchers proposed pruning deep neural network weights (DNNs) using an N:M fine-grained block sparsity mask. In this mask, for each block of M weights, we have at least N zeros. In contrast to unstructured sparsity, N:M fine-grained block sparsity allows acceleration in actual modern hardware. So far, this was used for DNN acceleration at the inference phase. First, we suggest a method to convert a pretrained model with unstructured sparsity to a N:M fine-grained block sparsity model, with little to no training. Then, to also allow such acceleration in the training phase, we suggest a novel transposable-fine-grained sparsity mask where the same mask can be used for both forward and backward passes. Our transposable mask ensures that both the weight matrix and its transpose follow the same sparsity pattern; thus the matrix multiplication required for passing the error backward can also be accelerated. We discuss the transposable constraint and devise a new measure for mask constraints, called mask-diversity (MD), which correlates with their expected accuracy. Then, we formulate the problem of finding the optimal transposable mask as a minimum-cost-flow problem and suggest a fast linear approximation that can be used when the masks dynamically change while training. Our experiments suggest 2x speed-up with no accuracy degradation over vision and language models. A reference implementation can be found at this https URL

Full PDF

AAccelerated Sparse Neural Training: A Provable and Efﬁcient Method to FindN:M Transposable Masks

Itay Hubara * 1 2

Brian Chmiel * 1 2

Moshe Island Ron Banner Sefﬁ Naor Daniel Soudry { ihubara, bchmiel, misland, rbanner } @[email protected]@gmail.com Abstract

Recently, researchers proposed pruning deepneural network weights (DNNs) using an N : M ﬁne-grained block sparsity mask. In this mask,for each block of M weights, we have at least N zeros. In contrast to unstructured sparsity, N : M ﬁne-grained block sparsity allowsacceleration in actual modern hardware. Sofar, this was used for DNN acceleration at theinference phase. First, we suggest a method toconvert a pretrained model with unstructuredsparsity to a N : M ﬁne-grained block sparsitymodel, with little to no training. Then, to alsoallow such acceleration in the training phase,we suggest a novel transposable-ﬁne-grainedsparsity mask where the same mask can beused for both forward and backward passes.Our transposable mask ensures that both theweight matrix and its transpose follow the samesparsity pattern; thus the matrix multiplicationrequired for passing the error backward can alsobe accelerated. We discuss the transposableconstraint and devise a new measure for maskconstraints, called mask-diversity (MD), whichcorrelates with their expected accuracy. Then,we formulate the problem of ﬁnding the optimaltransposable mask as a minimum-cost-ﬂowproblem and suggest a fast linear approximationthat can be used when the masks dynamicallychange while training. Our experiments suggest2x speed-up with no accuracy degradationover vision and language models. A refer-ence implementation can be found at https://github.com/papers-submission/structured_transposable_masks . * Equal contribution Electrical Engineering Department - Tech-nion, Haifa, Israel Habana Labs – An Intel company, Caesarea,Israel Computer Science Department - Technion, Haifa, Israel.Correspondence to: Itay Hubara < [email protected] > . Preprint. Under review.

1. Introduction

Deep neural networks (DNNs) have established themselvesas the ﬁrst-choice tool for a wide range of applications, in-cluding computer vision and natural language processing.However, their impressive performance comes with a priceof extensive infrastructure costs as they may contain trillionsof parameters (Fedus et al., 2021) and require thousandsof petaﬂops (Brown et al., 2020) for the training process.For this reason, compression of DNNs training and infer-ence process is a leading research topic of researchers inthe academy and industry. The main techniques of compres-sion includes quantization (Banner et al., 2018a; Nahshanet al., 2019), knowledge distillation (Hinton et al., 2015)and pruning (Han et al., 2015; Li et al., 2017).Pruning DNNs is one of the most popular and widely studiedmethods to improve DNN resource efﬁciency. The differ-ent pruning methods can be categorized into two differentgroups: unstructured and structured pruning. While theformer can achieve a very high compression ratio, it usuallyfails in reducing the computation footprint in modern hard-ware. On the other hand, structured pruning methods, suchas block (Wen et al., 2016) or ﬁlter (Li et al., 2017) pruning,are more hardware friendly. Unfortunately, these methodsusually fail to keep original accuracy for high compressionratios (Renda et al., 2020a).Recently, Nvidia (Nvidia, 2020) announced the A100 GPU,containing sparse tensor cores which are able to acceler-ate ﬁne-grained sparse matrix multiplication. The sparsetensor cores in A100 enable a 2x acceleration of regularmatrix multiplication in DNNs, Y = W X , where W and X are weight and input matrices, respectively. The only re-quirement is that W would have a ﬁne-grained 2:4 sparsitystructure, i.e. out of every four contiguous elements in W ,two are pruned. Consequently, models with unstructuredsparsity, in general, would suffer a severe degradation in ac-curacy when forced to adhere to a N : M sparsity structure.In our ﬁrst contribution of this paper (Section 3)• We analyze the setting of converting a pretrained model a r X i v : . [ c s . A I] F e b ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks with unstructured sparsity to N : M sparsity structure.We suggest two light methods that together can pre-vent the degradation when less than 10% of the activeweights must be set to zero.However, more commonly, only a pretrained dense modelis given. Therefore, Nvidia (2020) suggested a two-foldscheme for pruning a dense model: (a) Deﬁne a ﬁxed maskwhich for every weight tensor prunes the two smallest mag-nitude elements out of a block of four contiguous elements,and (b) retrain with the masked weights using original train-ing schedule. Indeed, Nvidia (2020) approach is very appeal-ing for the common case where a pretrained dense model isgiven.While Nvidia (2020) method works well on many models,a pretrained model is not always given. Thus, Zhou et al.(2021) suggested a method that trains from scratch a modelwith N:M ﬁne-grained mask, using a sparse-reﬁned straight-through estimator (SR-STE). Similar to the quantize-aware-training method (Hubara et al., 2017) they maintain a densecopy of the weights and prune it every iteration while keep-ing the gradients oblivious to that process using the straight-through estimator (Bengio et al., 2013). Since the maskdynamically change while training they suggest adding anextra weight decay on the masked (i.e. pruned) elements toreduce the mask changes during the training process.The rest of the paper focuses on accelerating sparse train-ing in the two settings detailed above (starting from pre-trained model or starting from scratch). Training DNNsrequires three matrix multiplications per layer: one for theforward pass, one for the backward pass, and the last forthe weight update. Both the forward and the backwardmatrix-multiplications involve the weight matrix W , yetthe methods suggested in Zhou et al. (2021); Nvidia (2020)accelerate only the forward pass matrix-multiplication: Y = W X. (1)Therefore, the current methods use the sparse tensor coresto accelerate approximately a third of the total computationduring training. We note, that for the backward pass, thesparse tensor cores cannot be utilized, even if W has ﬁne-grained sparsity. This is because the transposed matrix W T is used for the backward pass multiplication: ∂ Loss ∂X = ∂ Loss ∂Y W T (2)and W T does not generally have an N:M ﬁne-grained spar-sity structure, even if W has such structure.We propose a novel N : M transposable-ﬁne-grained spar-sity mask, where the same mask can be used for both theforward and backward passes. Our suggested mask con-tains only M − N non-zero elements for every contiguous M elements — simultaneously in both W and W T . InFig. 1 we emphasize the difference between previously sug-gested methods of N : M ﬁne-grained structured pruning(Zhou et al., 2021; Nvidia, 2020), which accelerates onlythe forward pass, and our suggested method, which is ableto accelerate the backward pass as well.In this setting,• For the case of pretrained dense models, we suggest anovel method for training models with the N : M transposable-ﬁne-grained sparsity mask, exploitingmodern sparse core tensors to allow acceleration ofthe forward and backward passes. Our method uses anovel algorithm to determine the optimal transposable-mask using a reduction to the min-cost-ﬂow problem(Section 4).• For the case where the model is trained from scratch,we deﬁne an approximation scheme with an (almost)linear (in input-size) time complexity that produces amask whose l -norm is within a factor of 2 from theoptimal mask(Section 4.2).• We suggest a new measure called mask diversity,which, to the best of our knowledge, provides the ﬁrstconnection between the mask constraints and networkaccuracy for a ﬁxed sparsity ratio (Section 5).

2. Related work

Pruning of neural networks weights has been extensively in-vestigated, starting with classical methods in the late 1980s(Janowsky, 1989; Mozer & Smolensky, 1989a;b; Karnin,1990) and amounting to dozens of papers published in re-cent years. Since DNNs are generally over-parameterized,pruning the weights reduces their memory footprint. In spe-cial cases, when the sparsity mask has a speciﬁc pattern, ithas the potential to reduce computation footprint as well.The most common practice is to prune a pretrained densemodel so it will be sparse at deployment. Since the pre-trained dense model accuracy is known, one can tune thesparsity level to ensure comparable accuracy for the sparsemodel. Recently, a new line of research that aims to trainsparse models from scratch (Gray et al., 2017; Dettmers &Zettlemoyer, 2019; Evci et al., 2020) has emerged. Theirgoal is to train models that cannot ﬁt into currently availablehardware. Next, we would brieﬂy overview the structured,unstructured, and sparse-training–from-scratch categories.

Unstructured pruning removes individual elements ofthe matrix, aiming for high total sparsity, while being ag-nostic to the location of the pruned elements. Standardpruning methods are based on different criteria, such as ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks

𝑋 𝑌𝜕𝐿𝑜𝑠𝑠 𝜕𝑌 𝜕𝐿𝑜𝑠𝑠 𝜕𝑋 𝐹𝑜𝑟𝑤𝑎𝑟𝑑

𝐵𝑎𝑐𝑘𝑤𝑎𝑟𝑑𝑊′

4: 8𝑊 𝑊 𝑇 (a) 𝑋 𝑌𝜕𝐿𝑜𝑠𝑠 𝜕𝑌 𝜕𝐿𝑜𝑠𝑠 𝜕𝑋 𝐹𝑜𝑟𝑤𝑎𝑟𝑑

𝐵𝑎𝑐𝑘𝑤𝑎𝑟𝑑 𝑊′

4: 8𝑊 𝑊 𝑇 𝑊′ 𝑇

4: 8 (b)

Figure 1. (a):

A 4:8 structured ﬁne-grained pruning as used in Zhou et al. (2021); Nvidia (2020), capable of accelerating with sparse coretensors only the forward pass. (b):

The suggested 4:8 transpose structured ﬁne-grained pruning capable of accelerating with sparse coretensors the forward and backward passes. magnitude (Han et al., 2015), approximated L regulariza-tion (Louizos et al., 2018) or connection sensitivity (Leeet al., 2019). Recent methods (Frankle & Carbin, 2018) sug-gested to train a dense network until convergence, extractthe required mask (”winning ticket”), and use the originaltraining regime to re-train the active weights from their orig-inal initialization or ﬁnal values (Renda et al., 2020b) usingthe original training schedule. These methods are able toachieve over 80% sparsity on ResNet50- ImageNet dataset(Renda et al., 2020b). Despite the high sparsity ratio that canbe achieved with these methods, modern hardware cannotefﬁciently utilize such a form of sparsity for reducing thecomputation resources (Nvidia, 2020). Structured pruning removes weights in speciﬁc locationbased patterns, which are more useful for hardware accelera-tion. Such methods can be applied at the level of channels orlayers. For example, Li et al. (2017) removed the channelswith the lower norm, Luo et al. (2017) pruned channels ac-cording to the effect on the activation of the following layer,and Wen et al. (2016) split the ﬁlters into multiple groupsand applied a group Lasso regularization. All methods arenativity supported in both hardware and software, as theyeffectively change the model structure by reducing channelsor groups. Yet, no method was able to achieve a reasonableaccuracy with sparsity levels higher than 50%. As observedby Liu et al. (2018), ﬁlter pruning of a pretrained dense over-parameterized model is rarely the best method to obtain anefﬁcient ﬁnal model. Thus, here the structured pruningserves mostly as a DNN architecture search for the optimalcompact model (Tan et al., 2019; Wu et al., 2019). Our workis most closely related to Zhou et al. (2021), which is theﬁrst work that attempted training with a ﬁne-grained N : M structured sparsity mask, as explained above. Sparse training from scratch

Gray et al. (2017) was theﬁrst to introduce this approach for NLP tasks. They in-vestigated a ﬁxed mask in which blocks of size N × N are either pruned (i.e., all elements are set to zero) or not.They implemented dedicated kernels for GPUs with sev-eral blocks sizes (N=8,16,32) and reported their results.While they managed to achieve slightly less than x2 train-ing speedup for 50% sparsity,they observed high speedupsas the sparsity level increases (x5 for 90% sparsity level).They initiated their masks using predeﬁned schemes andﬁxed them throughout the training process. As expected,this method resulted in accuracy degradation if one doesnot expand the model size. Thus several researchers (Bellecet al., 2017; Mocanu et al., 2018; Mostafa & Wang, 2019;Dettmers & Zettlemoyer, 2019; Evci et al., 2020) tried toenable dynamic mask changes during the training process.All methods focused on unstructured sparsity and aimedto enable training large models on hardware with memorylimitations. The ﬁrst to introduce this approach was Bellecet al. (2017) which essentially applied a random walk inparameter space. At initialization connections are assigneda pre-deﬁned sign at random; if during optimization the signis ﬂipped the weight would be pruned and a new weight willbe activated at random (i.e., regrow). Sparse EvolutionaryTraining (SET) (Mocanu et al., 2018) proposed a simplerscheme where they pruned weights based on their magni-tude. Dettmers & Zettlemoyer (2019) replaced the randomregrow with a per-layer mean-momentum-magnitude redis-tribution. Evci et al. (2020) expanded Mocanu et al. (2018)mask initialization scheme to convolution kernels and sug-gested pruning and regrowing based on dense gradients,calculated once every several iterations. As opposed tothese approaches, Zhou et al. (2021) does not aim to reducethe model memory footprint and keeps a dense weight ma-trix. At each iteration, they re-calculate the mask and use itto obtain a pruned copy of the weights, which then servesfor the forward and backward pass. Therefore, this methodis most relevant when a pretrained dense model is not given,and one wishes to obtain a ﬁne-grained sparse model forinference. ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks

3. From unstructured to structured sparsity

Most DNN pruning methods focus on unstructured pruning,which reduces the memory footprint. However, current hard-ware implementations suggest that, unless very high sparsitylevels are achieved, the model cannot be accelerated at all.Thus, commonly, the weights are simply decompressed be-fore multiplication. Forcing structured sparsity on a modelthat was trained without structured sparsity, leads to a severeaccuracy degradation as several bits of the mask may changeto satisfy the structured sparsity requirements.In this section, we study the probability that an unstructuredmask would not violate any N : M constraint (Nvidia,2020). We then discuss two light methods to bridge the gapwhen a sparse model is given but the hardware does notsupport its structure. Probability for violating the N : M constraint in un-structured sparsity. Let X = { x , x , ..., x M } be ablock of independent and identically distributed randomvariables. Assume that with a probability ρ , x i can bepruned without accuracy degradation (i.e., unstructuredpruning). In this section, we consider a general form ofblock sparsity in which, for a block of size M , at least N values could be pruned.Deﬁne X to be N : M sparse if this M sized block has atleast N values that can be pruned without harming accuracy.The probability of having a N : M sparse block is given bythe binomial distribution and so P ( X is N : M sparse ) = (cid:88) i ≥ N (cid:18) Mi (cid:19) · ρ i · (1 − ρ ) M − i (3)This probability goes to 1 as M → ∞ if N/M = c < ρ . InFig. 2 we demonstrate this Eq. (3) for ρ = 0 . .To force a given sparse model to have a ﬁne grained N : M sparsity, we need to make sure that N out of every M con-tiguous elements are zero. Therefore, as in Nvidia (2020),in each block we prune N weights with the lowest magni-tude (including any zero weights, e.g., non-active). Forcingthis pattern on an existing unstructured mask might removeactive (non-zero) weights, i.e., ﬂipping some of the maskvalues from one to zero. We named those required ﬂips, pattern-violations . Removing active weights without re-training tends to severely degrade the model accuracy. Todemonstrate the problem we used an unstructured sparsepretrained ResNet-50 model ( ρ = 0 . ) and set the N : M structure per-layer, based on Eq. (3), such that the proba-bility for a pattern-violation would be equal or less than agiven percentage. Here we used a block size of M = 8 .As can be seen in Fig. 3, without any optimization even a pattern-violation results in severe degradation. Next, wedetail two light methods to boost the accuracy. N:M sparse P r o b a b ili t y M=16M=32M=64M=128

Figure 2.

Eq. (3) for ρ = 0 . and various block sizes M . We havea sharp (”phase”) transition at N/M = ρ . Speciﬁcally, (i) when N/M ≤ ρ we have a probability larger than 0.5 that the sampledblock is N : M sparse; (ii) when N/M ≥ ρ this probabilityquickly decreases to zero. As block size M increases this phase-transition gets sharper. As expected, when M → ∞ , unstructuredsparsity satisﬁes the structured constraints, and we expect it todisplay the phase transition precisely at the critical point ρ . T o p - A cc u r a c y ( % )

59% 61% 65% 69% 69% 69% 70% 71% 71% 71% 72% 73%

Baseline (unstructured)AdaPruneAbsorb MeanNo-optimization

Figure 3.

Top-1 accuracy vs. percent of constraints violated. Thenumbers next to the baseline samples represents the sparsity levelof the reﬁned model.

Pruning bias ﬁx:

Several works (Banner et al., 2018b;Finkelstein et al., 2019; Hubara et al., 2020) reported that itis important to ﬁx the bias introduced when quantizing themodel. We build on those results and suggest absorbing themean of the N pruned weights into the M − N non zeroedweights. As can be seen in Fig. 3 this simple ﬁx, by itself,greatly boosts accuracy. AdaPrune

Recently, several works (Hubara et al., 2020;Nagel et al., 2020) suggested small fast ﬁne-tuning tech-niques for post-training quantization. These techniquesreplace the heavy full model training with a fast per-layeroptimization which requires only a few iterations to con-verge. While each method applies a different optimization ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks technique, they all aim to reduce the discrepancy betweenthe quantized and full-precision layer outputs. We adaptedparallel-AdaQuant (Hubara et al., 2020) technique whichoptimizes (using a small calibration set) the weights andquantization parameters to reduce the pre-activation recon-struction error, measured via mean squared-error per-layer.We adjusted their objective to the pruning problem min W (cid:48) || W X − ( Mask (cid:12) W (cid:48) ) X || , (4)where W is the original weight layer, W (cid:48) is the weightlayer we aim to ﬁnd, X is the output of the previous activa-tion layer, ”Mask” is the weight sparsity mask, and (cid:12) is acomponent-wise product. We named this method AdaPrune.In our experiments we used 1000 images from the Ima-geNet training set as a calibration set. As can be seen inFig. 3, AdaPrune is capable of correcting the remainingerror and obtain less than 1% degradation from the originalunstructured-sparse model counterpart. We argue that withAdaPrune, we can potentially adapt any generic mask to thehardware at hand, thus elevate the need to retrain the model.However, usually a pretrained unstructured sparse modelis not given. When starting from a dense model (thus hav-ing 50% pattern-violation) we get 2.3% degradation usingAdaPrune. We discuss and extend those experiments inAppendix A.1. Thus, retraining is required to prevent suchdegradation. In the next sections, we’ll discuss two scenar-ios: The ﬁrst is the common case when a pretrained densemodel is given. The second is a more challenging scenarioin which one aims to train with a sparse mask from scratch.

4. Computing transposable sparsity masks

In general, training DNNs requires three matrix multipli-cations per layer. The ﬁrst multiplication is required forthe forward propagation between the weights and activa-tion (Eq. (1)). The other two multiplications are used forthe backward and update phases. The backward phase cal-culates the gradients of the loss function with respect tothe input of the neural layer. This is done by recursivelypassing the error, from the last layer to the ﬁrst (Eq. (2)).Note that the backward phase uses the transposed weightsmatrix. Hence, accelerating the backward phase requires thetransposed weight matrix to adhere to the hardware requiredpattern (e.g., N : M ﬁne-grained sparsity).In this section, we aim to tackle this issue by presentinga novel N : M transposable ﬁne-grained sparsity mask,where the same mask can be used to accelerate both for-ward and backward passes. The required mask, containsonly M − N non-zero elements for every contiguous M elements in both in W and W T simultaneously. We formu-late the problem and suggest two methods to generate thetransposable mask: the ﬁrst one trains from a dense modelusing a min-cost ﬂow procedure, and the second one trains from scratch, using an approximation algorithm that allowsa more efﬁcient training. First, we provide an integer-programming (IP) formulationfor ﬁnding an optimal transposable ﬁne-grained mask. Letus consider a block of size M × M in a weight matrix W .Our goal is to maximize the l norm of W after masking Nelements in each row and column.We formulate the problem as an integer program. Deﬁnea binary indicator variable I i,j , where I i,j = 1 if and onlyif the element W i,j is part of the chosen mask, otherwise I i,j = 0 . The integer program is as follows:Maximize M − (cid:88) i,j =0 | W i,j | · I ij S.t (cid:88) j I ij = N, ∀ i ∈ { , ..., M − } (cid:88) i I ij = N, ∀ j ∈ { , ..., M − } I i,j ∈ { , } , ∀ i, j ∈ { , ..., M − } (5)In the following, we examine several methods for solvingthis problem and enforcing N : M transposable ﬁne-grainedsparsity during training. We ﬁrst describe an optimal, yetcomputationally expensive method. Then we describe amore efﬁcient method, yet only provides an approximate,though near-optimal, solution. General integer programs (IP) have exponential time com-plexity with respect to the input size (worst-case). Fortu-nately, Fig. 4 showed that our IP formulation at Eq. (5) canbe reduced to a min-cost ﬂow problem. Hence, by using cost-scaling method to solve the problem (Ahuja et al., 1988),we can ﬁnd the optimal transposable mask in O ( M log ( M ) time for a block size of M × M .The min-cost ﬂow solution should be used when trainingfrom a pretrained dense model (Section 6.1), where thetransposable mask is generated once and is then ﬁxed duringtraining. Sparse training from scratch on the other hand,requires changing the mask during training (Section 6.2).Therefore it is essential to ﬁnd a very efﬁcient algorithmfor computing the mask. To that end we design a light algorithm, i.e., for every input it produces asolution which is guaranteed to be within a factor of 2 of anoptimal solution (to the input), yet it runs in almost lineartime. We design a greedy algorithm (see Algorithm 1) having a low ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks s t 𝑊 ∈ ℝ

𝑀×𝑀 = 𝑭 = 𝑴 − 𝑴 Rows

Columns

Figure 4. M : M transposable-sparsity optimization as a min-costﬂow problem. In addition to a source and a sink, the networkhas a node for each row and for each column. The constructionuses three types of edges: (i) source edges emanating from thesource node s into each row node i ; (ii) sink edges connecting eachcolumn node j with the sink node t ; and (iii) a coefﬁcient edge ( i, j ) for each matrix element W i,j . Each source edge ( s, i ) hascapacity M which is equal to the number of elements that need tobe selected for pruning in row i . Similarly, each sink edge ( j, t ) has capacity M which is equal to the number of elements prunedin column j . Each coefﬁcient edge ( i, j ) has unit capacity and cost | W i,j | . Finally, selecting a matrix element with weight W i,j forpruning corresponds to a unit ﬂow on the coefﬁcient edge ( i, j ) . Amin-cost ﬂow from source s to destination t would ﬁnd the lowestpossible cost of sending a ﬂow of value M from the source s tothe destination t . Assuming the source and sink edges have a zerocost, it is easy to see a one-to-one correspondence between a min-cost ﬂow in this construction and an optimal transposable mask,that minimizes the sum of absolute values selected for pruning. time complexity that can be used in practice without com-promising too much the quality of the solution produced.Unlike the optimal min cost ﬂow solution that runs in timecomplexity of O ( M ) for a block size of M × M , Algo-rithm 1 has a running time of O ( M log M ) i.e., a timecomplexity that is almost linear in the number of block el-ements M . The approximation algorithm uses the sameconstruction described in Fig. 4, but instead of running amin-cost ﬂow on the graph, it employs a simple greedy ap-proach. In Appendix A.3 we analyze the running times ofdifferent min-cost ﬂow methods and compare them with therunning time of our 2-approximation method.Let P be the list of edges pruned by Algorithm 1, let W ( P ) be the total weight of the edges in P , and let W ∗ be theweight of an optimal solution (i.e., the minimal sum of edgesthat can be pruned to create a M : M transposable sparsitymask). The next lemma establishes that Algorithm 1 ﬁnds a2-approximate solution. Lemma.

Algorithm 1 produces a tight 2-approximate solu-tion, i.e., W ( P ) < · W ∗ . Proof. Consider any node i ∈ V \ { s, t } . Let E (cid:48) ( i ) = { e (cid:48) , e (cid:48) , e (cid:48) , ...e (cid:48) M/ } denote the edges of an optimal solutionthat are adjacent to node i and sorted in ascending orderfrom light to heavy. Let E ( i ) = { e , e , e , ...e M/ } denotethe ﬁrst M/ edges adjacent to i in P with respect to theorder in which Algorithm 1 picked them. By construction,we have that for all edges in E ( i ) : w ( e ) ≤ w ( e ) ... ≤ w ( e M ) . (6)We note that we can truncate the list of i at M/ , since if i has more than M/ edges adjacent to it in P , then anysuch edge ( i, j ) would also appear in E ( j ) (among the ﬁrst M/ edges adjacent to j ). Thus, the union of the lists E ( i ) contains all edges in P . We now prove by induction that forany n , n ≥ , w ( e n ) ≤ w ( e (cid:48) n ) . (7)• Base case ( n = 1 ): w ( e ) ≤ w ( e (cid:48) ) , since by con-struction of Algorithm 1, edge e is the lightest edgeadjacent to node i .• Induction step: assume w ( e n ) ≤ w ( e (cid:48) n ) , then itmust hold that w ( e n +1 ) ≤ w ( e (cid:48) n +1 )) ; otherwise, if w ( e n +1 ) > w ( e (cid:48) n +1 )) , then e (cid:48) n +1 should have beenconsidered before e n +1 and also chosen by Algo-rithm 1.Thus, M/ (cid:88) j =1 w ( e j ) ≤ M/ (cid:88) j =1 w ( e (cid:48) j ) . To complete the proof, our goal is to change the weight ofthe edges in P to the weight of the edges in the optimalsolution based on the above inequality. However, note thatan edge ( i, j ) ∈ P may appear in only one of the lists E ( i ) or E ( j ) , while an edge in the optimal solution alwaysappears in two lists (of its endpoints). Thus, for example,two edges in P , ( i, j ) and ( i (cid:48) , j ) , may charge their weightto the same edge ( i, i (cid:48) ) in the optimal solution. But, this”double” charging can happen at most twice, hence: W ( P ) ≤ W ∗ . (8)In Appendix A.2 we show with an example that this upperbound is tight. (cid:4)

5. Mask Diversity

Structure sparsity requires the mask to have some hardware-friendly pattern. In this section, we will argue that the moreﬂexible the required sparsity constraint is (e.g., ﬁne-grainedsparsity with large block) the less we are prone to accuracydegradation. Consequently, as expected and well explored ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks

Algorithm 1

2- approximation algorithm

Input:

Graph G=(V,E)Initialize P= ∅ Sort the list of coefﬁcient edges from light to heavyLet A = [ e , ..., e n ] be the sorted list of edges for each edge e i = ( u, v ) ∈ A doif degree(u) ≤ M or degree(v) ≤ M in P then P ← P + e i . end ifend for (Frankle & Carbin, 2018; Renda et al., 2020b), unstructuredsparsity which has no requirements on the sparsity structureachieves the best sparsity levels. To quantify the constrainta speciﬁc mask enforces, we introduce a new measure wename mask-diversity . A mask diversity (MD) is the numberof all possible conﬁgurations which adhere to the maskrestriction under similar sparsity level.Let us consider W to be a weight tensor of size n × n andour desired sparsity level to be N/n . For unstructuredsparsity M D

Unstructured = (cid:18) n N (cid:19) (9)As the block size increases, the diversity increases, whichmight explain the recent success of global pruning. Herewe investigate per-layer block sparsity and speciﬁcally, ﬁne-grained N : M structured sparsity (Nvidia, 2020). Thisapproach requires us to zero out N values in each block ofsize M . Since we have n M blocks this results in M D

Structured = (cid:18) M ! N ! ( M − N )! (cid:19) n M . (10)In order to evaluate the mask diversity in the N : M struc-tured transposable case, let us ﬁrst assume N = 1 . Thenumber of possibilities in each block of size M is M ! . Byrepeating this process for general N in all the n M blockswe get: M D

Structuredtransposable = ( M ! ( M − · · · ( M − N + 1)!) n M (11)A more constrained mask, is a ﬁne-grained N : M maskwith a sequential structure. Here we require that each M contiguous elements would contain N sequential zeros. Ineach block of size M , there are ( M − N + 1) M optionsof sequential zeros. Hence, in all the n M blocks we get: M D

Sequential = (cid:0) ( M − N + 1) M (cid:1) n M (12)In Table 1 we show the MD for different constraints fora matrix of size × . Notice the diversity of structured Table 1.

MD for different constraints for a matrix of size × . N : M . · . · . · Structured · . · . · Transposablestructured · · . · Sequential · · · l norm of the last layer in trained ResNet-50masked with different structured pruning. We note that 2:4structured and 4:8 transposable structured have (almost)similar MD measures which translates to similar l normsas well. In order to show the correlation between MD and US 4:8 2:4 4:8-T 4:8-S0.800.850.900.951.00 N o r m a li z e d m a g n i t u e d Figure 5.

Magnitude of the last layer’s weight tensor of ResNet-50(pretrained dense model) masked with structured mask 4:8, 2:4, 4:8transposable (”4:8-T”) and 4:8 sequential (”4:8-S”) normalized bythe unstructured 50% sparsity (”US”). Notice that mask diversityis correlated with magnitude preservation. As expected the 4:8transposable mask has a similar L1 score as the 2:4 mask. the accuracy of the model, we show in Fig. 6 the accuracyof ResNet18 on Cifar100 dataset while inducing differentsparsity masks with the same sparsity ratio of 50%. Asexpected, our mask-diversity measure correlates with thepruned model accuracy.

6. Experiments

In this section, we demonstrate the effectiveness of our pro-posed transposable N : M ﬁne-grained structured sparsityin computer vision and natural language processing tasks.We compare the suggested method over two different ini-tialization: (i) initialization from a trained dense model andtrain with a ﬁxed mask, similar to ASP (Nvidia, 2020), (ii)Train from scratch and update the mask frequently, simi- ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks

10 15 20 25 30 35log( MD )65666768697071 A cc u r a c y StructuredTransposable structuredSequential

Figure 6.

ResNet18 on Cifar100 accuracy for weight sparsity of 50% using different structured mask. Notice that the more constraintswe impose on the mask - the more we affect the pruned networkaccuracy. lar to Zhou et al. (2021). We show comparable accuracyto previous methods, while achieving a signiﬁcant reduc-tion of the training process resources — by exploiting thesparse core tensor abilities, allowing their use both in for-ward and backward passes. In all the experiments we usea transposable 4:8 mask, which as shown in Table 1, havea similar MD as 2:4 mask used in previous works (Nvidia,2020; Zhou et al., 2021). Experiments setting appears inAppendix A.4

We evaluate the suggested N : M transposable mask us-ing a trained dense model as initialization. In order to ﬁndthe transposable mask, we solve the min-cost ﬂow reduc-tion (Section 4.2) on the dense trained network and thenﬁx the mask. In Table 2 we compare our method withASP (Nvidia, 2020) on classiﬁcation (ResNet50 - ImageNetdataset), detection (MaskRCNN - COCO dataset) and ques-tion answering (BERT-large - SQuAD dataset) tasks. Noticethat the initialization in both methods is similar, however,in the training phase, we allow 2x acceleration by the useof the 4:8 transposable mask in comparison to the 2:4 nontransposable used in ASP. In order to avoid the training of a dense model, we alsoevaluate the proposed transposable N : M mask in thetraining from scratch setting. Similar to Zhou et al. (2021)we keep a dense copy of the weights and before each forwardpass we mask the weights with a N : M transposable mask.In contrast to Zhou et al. (2021) who changed the maskevery iteration, we found that we can use 2-approximationscheme to extract the transposable mask every 40 iterations.Empirically we found that the 2-approximation scheme is Table 2.

Comparison of the suggested method with ASP (Nvidia,2020) initialized from dense model on ResNet50 (Imagenetdataset), BERT-large (SQuAD dataset) and MaskRCNN (COCOdataset).We use transposable 4:8 mask while ASP use 2:4. Theuse of the transposable mask allow 2x speedup by allowing sparsemultiplication both in forward and backward passes.

Model Method Accuracy Sparse core(Metric) utilizationResNet18 Baseline 69.7% 0%(Top1) Ours 70.06 66%ResNet50 Baseline 76.15% 0%ASP 76.6% 33%(Top1) Ours 76.6% 66%BERT-large Baseline 91.1 0%ASP 91.5 33%(F1) Ours 91.67 66%MaskRCNN Baseline 37.7 0%ASP 37.9 33%(AP) Ours 37.84 66%on average within a factor of 1.2 from the optimal mask.The hyper-parameters used for training are equal to the onessuggested by Zhou et al. (2021). In Table 3 we test theproposed method over ResNet18, ResNet50, ResNext50,Vgg11 (ImageNet dataset) and ﬁne-tune of Bert (SQuAD-v1.1 dataset) and compare to Zhou et al. (2021) results.As can be seen, we achieved comparable accuracy with 2xspeedup in the training process.

7. Conclusions

In this work, we analyze the constraints introduced by blocksparsity. We discuss the limitations with current research inthe ﬁeld and suggest two simple methods (Pruning bias ﬁxand AdaPrune) to transform an unstructured sparse model toa ﬁne-grained sparse structured model with little to no train-ing. We managed to reduce accuracy degradation causedby forcing N : M pattern on unstructured sparse mask. Asan example, in ResNet50 we reduce the degradation to lessthan 1% from the unstructured model without any retraining.Furthermore, with a light training procedure over a calibra-tion set (i.e., AdaPrune) we can compress the model by upto x3.In addition, we discuss the inherent problem of accelerat-ing sparse training and suggest a novel N:M transposablemask which enables accelerating the backward phase aswell. We formulate the question of ﬁnding the optimal maskas a minimum-cost-ﬂow problem and show no accuracydegradation in a variety of tasks with 2x acceleration incomparison to previous methods (Nvidia, 2020). Moreover,we design a new fast algorithm ( with linear complexity) ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks Table 3.

Training from scratch of ResNet18, ResNet50, ResNext50and Vgg11 on imageNet dataset and ﬁne-tuning of Bert-base onSQuAD dataset, using the proposed 2-scheme approximation. Weshow comparable results with N:M-SS (Zhou et al., 2021) withtraining acceleration.

Model Method Accuracy Sparse core(Metric) utilizationResNet18 Baseline 70.54% 0%N:M-SS 71.2% 33%(Top1) Ours 70.75% 66%ResNet50 Baseline 77.3% 0%N:M-SS 77.4% 33%(Top1) Ours 77.1% 66%ResNext50 Baseline 77.6% 0%(Top1) Ours 77.4% 66%Vgg11 Baseline 69% 0%(Top1) Ours 68.8% 66%BERT-base Baseline 88.52 0%(F1) Ours 88.38 66%that guarantees to be within a factor of 2 from the optimaltransposable mask L1 norm and use it to train a sparse modelfrom scratch by accelerating both forward and backwardphases. We believe this work paves the path toward trueefﬁcient sparse training.

References

Ahuja, R. K., Magnanti, T. L., and Orlin, J. B. Networkﬂows. 1988.Banner, R., Hubara, I., Hoffer, E., and Soudry, D. Scalablemethods for 8-bit training of neural networks. In

NeurIPS ,2018a.Banner, R., Nahshan, Y., Hoffer, E., and Soudry, D. Post-training 4-bit quantization of convolution networks forrapid-deployment. arXiv preprint arXiv:1810.05723 ,2018b.Bellec, G., Kappel, D., Maass, W., and Legenstein, R. Deeprewiring: Training very sparse deep networks. arXivpreprint arXiv:1711.05136 , 2017.Bengio, Y., L´eonard, N., and Courville, A. Estimating orpropagating gradients through stochastic neurons for con-ditional computation. arXiv preprint arXiv:1308.3432 ,2013.Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Kr¨uger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. Languagemodels are few-shot learners.

ArXiv , abs/2005.14165,2020.Dettmers, T. and Zettlemoyer, L. Sparse networks fromscratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840 , 2019.Evci, U., Gale, T., Menick, J., Castro, P. S., and Elsen,E. Rigging the lottery: Making all tickets winners. In

International Conference on Machine Learning , pp. 2943–2952. PMLR, 2020.Fedus, W., Zoph, B., and Shazeer, N. Switch transform-ers: Scaling to trillion parameter models with simple andefﬁcient sparsity.

ArXiv , abs/2101.03961, 2021.Finkelstein, A., Almog, U., and Grobman, M. Fighting quan-tization bias with bias. arXiv preprint arXiv:1906.03193 ,2019.Frankle, J. and Carbin, M. The lottery ticket hypothesis:Finding sparse, trainable neural networks. In

ICLR , 2018.Gray, S., Radford, A., and Kingma, D. P. Gpu kernels forblock-sparse weights. arXiv preprint arXiv:1711.09224 ,3, 2017.Han, S., Pool, J., Tran, J., and Dally, W. Learning bothweights and connections for efﬁcient neural network.

ArXiv , abs/1506.02626, 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. , pp.770–778, 2016.Hinton, G. E., Vinyals, O., and Dean, J. Distilling theknowledge in a neural network.

ArXiv , abs/1503.02531,2015.Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., andBengio, Y. Quantized neural networks: Training neuralnetworks with low precision weights and activations.

TheJournal of Machine Learning Research , 18(1):6869–6898,2017.Hubara, I., Nahshan, Y., Hanani, Y., Banner, R., andSoudry, D. Improving post training neural quantization:Layer-wise calibration and integer programming.

ArXiv ,abs/2006.10518, 2020.Janowsky, S. A. Pruning versus clipping in neuralnetworks.

Physical Review A, 39(12):6600–6603 ,1989. URL https://link.aps.org/doi/10.1103/PhysRevA.39.6600 . ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks Karnin, E. D. A simple procedure for pruning backprop-agation trained neural networks.

IEEE transactions onneural networks, 1(2):239–242 , 1990.Lee, N., Ajanthan, T., and Torr, P. Snip: Single-shot networkpruning based on connection sensitivity. In

ICLR , 2019.Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P.Pruning ﬁlters for efﬁcient convnets. In

ICLR , 2017.Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Re-thinking the value of network pruning. arXiv preprintarXiv:1810.05270 , 2018.Louizos, C., Welling, M., and Kingma, D. P. Learning sparseneural networks through l0 regularization. In

ICLR , 2018.Luo, J.-H., Wu, J., and Lin, W. Thinet: A ﬁlter level pruningmethod for deep neural network compression. , pp.5068–5076, 2017.Marcel, S. and Rodriguez, Y. Torchvision the machine-vision package of torch. In

Proceedings of the 18th ACMInternational Conference on Multimedia , MM ’10, pp.1485–1488, New York, NY, USA, 2010. Association forComputing Machinery. ISBN 9781605589336. doi: 10.1145/1873951.1874254. URL https://doi.org/10.1145/1873951.1874254 .Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H.,Gibescu, M., and Liotta, A. Scalable training of arti-ﬁcial neural networks with adaptive sparse connectivityinspired by network science.

Nature communications , 9(1):1–12, 2018.Mostafa, H. and Wang, X. Parameter efﬁcient training ofdeep convolutional neural networks by dynamic sparsereparameterization. In

International Conference on Ma-chine Learning , pp. 4646–4655. PMLR, 2019.Mozer, M. C. and Smolensky, P. Skeletonization: A tech-nique for trimming the fat from a network via relevanceassessment.

Advances in neural information processingsystems, pp. 107–115 , 1989a.Mozer, M. C. and Smolensky, P. Using relevance to re-duce network size automatically.

Connection Science,1(1):3–16 , 1989b.Nagel, M., Amjad, R. A., van Baalen, M., Louizos, C.,and Blankevoort, T. Up or down? adaptive rounding forpost-training quantization. In

ICML , 2020.Nahshan, Y., Chmiel, B., Baskin, C., Zheltonozhskii, E.,Banner, R., Bronstein, A. M., and Mendelson, A. Lossaware post-training quantization.

ArXiv , abs/1911.07190,2019. Nvidia. Nvidia deep learning examples for tensor cores.2018. URL https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch .Nvidia. a100 tensor core gpu architec-ture. 2020. URL .Renda, A., Frankle, J., and Carbin, M. Comparing rewind-ing and ﬁne-tuning in neural network pruning.

ArXiv ,abs/2003.02389, 2020a.Renda, A., Frankle, J., and Carbin, M. Comparing rewind-ing and ﬁne-tuning in neural network pruning. In

ICLR ,2020b.Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M.,Howard, A., and Le, Q. V. Mnasnet: Platform-awareneural architecture search for mobile. In

Proceedingsof the IEEE/CVF Conference on Computer Vision andPattern Recognition , pp. 2820–2828, 2019.Wen, W., Wu, C., Wang, Y., Chen, Y., and Li., H. Learn-ing structured sparsity in deep neural networks. In

InAdvances in neural information processing systems, pp.2074–2082 , 2016.Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian,Y., Vajda, P., Jia, Y., and Keutzer, K. Fbnet: Hardware-aware efﬁcient convnet design via differentiable neuralarchitecture search. In

Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition ,pp. 10734–10742, 2019.Zhou, A., Ma, Y., Zhu, J., Liu, J., Zhang, Z., Yuan, K.,Sun, W., and Li, H. Learning n:m ﬁne-grained structuressparse neural networks from scratch. In

ICLR , 2021. ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks

A. Supplementary Material

A.1. Additional AdaPrune experiments

To further examine AdaPrune capabilities we checked twoadditional settings: (a) starting from pre-trained densemodel, and (b) staring from less constrained N:M mask.A.1.1. A DA P RUNE FROM D ENSE

While this case is more common, we expect to see somedegradation as we know that we have 50% mask violations.Yet as can be seen in table 2 we managed to restore accuracyto 2-3% of the full-precision baseline using just AdaPrune.To further improve results we applied batch-norm-tuningas suggested by Hubara et al. (2020) and kept the ﬁrst andlast layers dense which results in less than 2% degradation.We believe it to be the ﬁrst tolerable post-training-pruningresults reported.

Table A.1.

Using AdaPrune from dense pre-trained model. APstands for AdaPrune and BNT stands for batch-norm-tuning.

Model Dense BiasFix AP APBNT BNTResNet18 69.7% 62.47% 68.41% 68.63%ResNet34 73.3% % 68.72% 72.15% 72.36%ResNet50 76.1% 67.42% 74.41 74.75%ResNet101 77.27 % 71.54% 76.36% 76.48%A.1.2. A DA P RUNE FROM

N:M

SPARSE

In Section 5 we explained why as the block size decreasesthe mask diversity decreases. Thus, we expect to have manyviolations when a pre-trained sparse model with N : M translates to N : M , for N > N and M > M . Weargue that this might be a common case in the future asdifferent hardware vendors would support different formats.In table Table A.2 we can see results of converting ResNet-50 model trained with sparsity pattern to and patterns. As can be seen, converting from to produces results with negligible accuracy degradation(less than 0.5%). Therefore, we argue that AdaPrune is anefﬁcient and useful approach to convert models which wereoptimized on a different hardware than the one in use, as itremoves the need for full sparse training. This is even moreimportant when the training data is not available. A.2. A tight example for the 2-approximation factor inLemma

In the following, we show that the upper bound of 2-approximation (proven in Lemma) is asymptotically tightusing a tight example. Let’s assume we want to zero oneelement in each row and column in the block of size × Table A.2.

Using AdaPrune to convert from one sparse patternto the other. The baseline model was trained with 4:8 sparsity(90 epochs). Thus, 4:8 column is the baseline. BNT stands forbatch-norm-tuning

Model 4:8 2:4 2:4 1:2 1:2BNT BNTRN50 76.5% 76.2% 76.4% 74.6% 75.1%RN50-T 77.1% 76.3% 76.4% 74.7% 75.1%RN18-T 70.75% 70.1% 70.2% 68.9% 69.2%presented in Fig. A.1a using 2-approximate algorithm (Al-gorithm 1). First, we need to convert the block into a directbipartite graph (as suggested in Figure 4). This constructionappears in Fig. A.1b. Next, we sort the the edges from lightto heavy and go over the sorted list. In Fig. A.1c we showthe 7th iteration of the 2-approximate algorithm. All edgesare added to the list of chosen edges P up until iteration7. The algorithm stops at iteration 7 since after adding thelink u −→ v , every node is ”covered” by at least one edge(alternatively, each row and each column has at least oneitem chosen for pruning). Note that the optimal solutionwould choose the edges that correspond to elements on thediagonal (i.e., u −→ v , u −→ v , u −→ v , and u −→ v ),summing to a total weight of 4. Hence, we get an approxi-mation ratio of . It is easy to see that when using the sameconstruction for a general block of size M × M we get anapproximation ratio of M − M , asymptotically converging to2 as M → ∞ . A.3. Run-time analysis

In this section we specify the running times of differentmin-cost ﬂow methods and compare them with the run-ning time of our 2-approximation method. Ahuja et al.(1988) speciﬁes the running times of six min-cost ﬂow algo-rithms, two of which have distinctively better performancefor our construction compared to the others. The runningtime complexities of these two methods depend on the fol-lowing parameters: number of nodes n , number of edges m , the largest weight coefﬁcient W , and the ﬂow demand U . Then, the cost-scaling method has a running time of O ( n log( n · W )) while the capacity scaling method has arunning time of O ( m ( m + n log n ) log U ) . For a block ofsize M × M , our construction process creates a number ofedges m = M + 2 M , a number of nodes n = 2 M + 2 ,and a ﬂow demand U = 0 . M . This boils down to run-ning times of O ( M log( M · W )) and O ( M log( M )) forthe cost-scaling and the capacity-scaling methods, respec-tively. Finally, assuming the weights are represented in b bits, we have that log( W ) = b and therefore solving ourconstruction using the cost-scaling method has a runningtime complexity of O ( M (log( M ) + b )) . In Table A.3 we ccelerated Sparse Neural Training: A Provable and Efﬁcient Method to Find N:M Transposable Masks ∞ 1 1 − 𝜀 ∞ ∞ 1 − 𝜀 1 ∞∞ ∞ ∞ 1𝑢 𝑢 𝑢 𝑢 𝑣 𝑣 𝑣 𝑣 (a) Rows

Columns 𝑢 𝑢 𝑢 𝑢 𝑣 𝑣 𝑣 𝑣 (b) Rows

Columns 𝑢 𝑢 𝑢 𝑢 𝑢 𝑢 𝑢 𝑣 𝑣 𝑣 𝑣 𝑣 𝑣 𝑣 (c) Figure A.1. (a):

Block of size × where we want to zero one element in each row and column using 2-approximate algorithm. (b): Block in (a) represented in a direct bipartite graph. (c):

The 7 iterations of the 2-approximate algorithm on graph (b). Notice we get anapproximate ratio between the 2-approximate solution and the optimal solution of . Table A.3.

Running times of min-cost ﬂow based implementationsand the 2-approximation method.

Method ComplexityCost-Scaling O ( M (log( M ) + b ))) Capacity-Scaling O ( M log( M )) O ( M log( M )) summarize the complexity of these methods. A.4. Experiments SettingAdaPrune

We used a small calibration set of 1000 images(one per-class). We run AdaPrune for 1000 iterations withbatch-size of 100. For the results in the supplementarymaterial, we kept the ﬁrst and last layers dense.

N:M transposable sparsity mask from a pre-trainedmodel

We used torchvison (Marcel & Rodriguez, 2010)model-zoo as our pre-trained dense baseline. For all ResNetmodels we used the original regime as given by He et al.(2016), i.e., SGD over 90 epochs starting with learning rateof 0.1 and decreasing it at epochs 30,60,80 by a factor of10. For BERT-large and MaskRCNN we used the defaultsscripts as in Nvidia (2018).