[PDF] Lossless CNN Channel Pruning via Decoupling Remembering and Forgetting

Abstract

We propose ResRep, a novel method for lossless channel pruning (a.k.a. filter pruning), which aims to slim down a convolutional neural network (CNN) by reducing the width (number of output channels) of convolutional layers. Inspired by the neurobiology research about the independence of remembering and forgetting, we propose to re-parameterize a CNN into the remembering parts and forgetting parts, where the former learn to maintain the performance and the latter learn for efficiency. By training the re-parameterized model using regular SGD on the former but a novel update rule with penalty gradients on the latter, we realize structured sparsity, enabling us to equivalently convert the re-parameterized model into the original architecture with narrower layers. Such a methodology distinguishes ResRep from the traditional learning-based pruning paradigm that applies a penalty on parameters to produce structured sparsity, which may suppress the parameters essential for the remembering. Our method slims down a standard ResNet-50 with 76.15% accuracy on ImageNet to a narrower one with only 45% FLOPs and no accuracy drop, which is the first to achieve lossless pruning with such a high compression ratio, to the best of our knowledge.

Full PDF

LLossless CNN Channel Pruning via Gradient Resetting andConvolutional Re-parameterization

Xiaohan Ding Tianxiang Hao Ji Liu Jungong Han Yuchen Guo

1, 4

Guiguang Ding Beijing National Research Center for Information Science and Technology (BNRist);School of Software, Tsinghua University, Beijing, China AI platform department, Seattle AI lab, and FeDA lab, Kwai Inc. WMG Data Science, University of Warwick, Coventry, United Kingdom Department of Automation, Tsinghua University;Institute for Brain and Cognitive Sciences, Tsinghua University, Beijing, China [email protected] [email protected]@gmail.com [email protected]@gmail.com [email protected]

September 2, 2020

Abstract

Channel pruning (a.k.a. ﬁlter pruning) aims to slim down a convolutional neural network (CNN)by reducing the width (i.e., numbers of output channels) of convolutional layers. However, as CNN’srepresentational capacity depends on the width, doing so tends to degrade the performance. A traditionallearning-based channel pruning paradigm applies a penalty on parameters to improve the robustness topruning, but such a penalty may degrade the performance even before pruning. Inspired by the neurobiologyresearch about the independence of remembering and forgetting, we propose to re-parameterize a CNNinto the remembering parts and forgetting parts, where the former learn to maintain the performance andthe latter learn for eﬃciency. By training the re-parameterized model using regular SGD on the formerbut a novel update rule with penalty gradients on the latter, we achieve structured sparsity, enabling usto equivalently convert the re-parameterized model into the original architecture with narrower layers.With our method, we can slim down a standard ResNet-50 with 76.15% top-1 accuracy on ImageNetto a narrower one with only 43.9% FLOPs and no accuracy drop. Code and models are released at https://github.com/DingXiaoH/ResRep . Convolutional Neural Network (CNN) is one of the most popular models for deep learning. To compressand accelerate CNN for eﬃcient inference, numerous methods have been proposed, including sparsiﬁcation[10, 13, 14], channel pruning [8, 19, 21, 29], quantization [4, 5, 35, 56], knowledge distillation [22, 26, 33, 50],etc. Channel pruning [21] (a.k.a. ﬁlter pruning [28] or network slimming [36]) reduces the width (i.e., numberof output channels) of each convolutional layer, which can eﬀectively reduce the required FLOPs and memoryfootprint. Of note is that channel pruning is complementary to the other model compression and accelerationtechniques, because it simply produces a thinner model of the original architecture with no customizedstructures or extra operations.However, as CNN’s representational capacity depends on the width of conv layers, it is diﬃcult toreduce the width without performance drops. On practical CNN architectures like ResNet-50 [16] and1 a r X i v : . [ c s . L G ] S e p arge-scale datasets like ImageNet [6], lossless pruning with high compression ratio has long been consideredchallenging. For reasonable trade-oﬀ between compression ratio and performance, a typical paradigm (Fig.1.A) [2, 3, 9, 30, 32, 51, 52] seeks to train the model with magnitude-related penalty loss (e.g., group Lasso[46, 49]) on the conv kernels to produce structured sparsity . That is, all the parameters of some channelsbecome small in magnitude. Though such a process may degrade the performance by an acceptable margin,pruning such channels causes less damage. Notably, if the parameters of pruned channels are small enough,the pruned model may deliver the same performance as before (i.e., after training but before pruning), whichwe refer to as perfect pruning .For convenience, we propose to evaluate a training-based pruning method from two aspects.

1) Resistance .The training process aims to introduce desired properties such as structured sparsity into the model forpruning. However, the emerging of such properties may degrade the model’s performance. If the model resistsagainst such degradation, i.e., maintains the accuracy, we will say it has high resistance.

2) Prunability . Ifthe trained model endures a high pruning ratio with low performance drop, we will say it has high prunability.Obviously, we are seeking for a pruning method with both high resistance and high prunability. However, thetraditional penalty-based paradigm naturally suﬀers from a resistance-prunability trade-oﬀ. Taking groupLasso as an example, if we use a strong penalty to achieve high structured sparsity, the performance willdrop signiﬁcantly during training (Fig. 3). Contrarily, with a weak penalty to maintain the performance, wewill achieve low sparsity, hence low prunability. A detailed analysis will be presented in Sect. 3.3.In this paper, we propose a novel method named ResRep, which comprises two key components, namely,Convolutional Re-parameterization and Gradient Resetting, to address the above problem. ResRep surpassesthe recent competitors by a signiﬁcant margin, and is the ﬁrst to achieve real lossless pruning on ResNet-50on ImageNet (76.15% top-1 accuracy before and after pruning) with a high pruning ratio of 56.1% (i.e., theresulting model has only 43.9% of the original FLOPs), to the best of our knowledge.ResRep is inspired by the neurobiology research on remembering and forgetting. On the one hand,remembering requires the network to potentiate some synapses but depotentiate the others, which resemblesthe training process of CNN, making some parameters large and some small. On the other hand, synapseelimination via shrinkage or loss of spines is one of the classical forgetting mechanisms [45] as a key processto improve eﬃciency in both energy and space for biological neural network, which resembles pruning.Neurobiology research reveals that remembering and forgetting are independently controlled by Rutabagaadenylyl cyclase-mediated memory formation mechanism and Rac-regulated spine shrinkage mechanism,respectively [12, 15, 48], indicating it is more reasonable to control the learning and pruning by two decoupledmodules.Inspired by such independence, we propose to decouple the “remembering” and “forgetting”, which arecoupled in the traditional paradigm, because the conv parameters are involved in both the “remembering”(objective function) and “forgetting” (penalty loss) in order for them to achieve a trade-oﬀ. Concretely, we ﬁrstre-parameterize the original model into “remembering parts” and “forgetting parts”, then apply “rememberinglearning” (i.e., regular SGD with the original objective function) on the former to maintain the “memory”(original performance), and “forgetting learning” (a customized update rule named Gradient Resetting) onthe latter to “eliminate synapses” (zero out channels). More speciﬁcally, we re-parameterize the originalconv - BN (short for a conv layer followed by batch normalization [25]) sequences by conv - BN - compactor,where compactor is a pointwise (i.e., × ) conv layer. During training, we add penalty gradients to only thecompactors, select some compactor channels and zero out their gradients derived from the objective function.After training, we prune the compactors into narrower ones. As the word “re-parameterization” suggests, aconv - BN - compactor sequence is another parameterization of a regular conv layer, which makes it feasibleto equivalently convert the former into the latter for inference. Eventually, the resulting model will have thesame architecture as the original but narrower layers. Fig. 1.B shows an example for illustration.ResRep features: High resistance. ResRep does not change the loss function, update rule or anytraining hyper-parameters of the original CNN parameters (i.e., the conv - BN parts) so that they learnto maintain the performance as usual. High prunability. The compactors are driven by the penaltygradients to learn which channels to prune, and we can make many compactor channels small enough torealize perfect pruning, even with a mild penalty strength. Given the required global reduction ratio of2 × prune conv params (2 channels, 9 params/channel)SGD updatemultiple iterationsperfectprunechannel selection mask GradientResetting training process pruned layer (A) Traditional penalty-based paradigm. (B) ResRep.batch norm 3 × × ... training process batch norm conv params compactor compactor params loss conv gradientscompactor gradients pruned layer × Figure 1: Traditional penalty-based channel pruning v.s. ResRep. We prune a × conv layer with oneinput channel and four output channels for illustration. For the ease of visualization, we ravel the kernel K ∈ R × × × into a matrix W ∈ R × . A) To prune some channels of K (i.e., rows of W ), we add apenalty loss on the kernel to the original loss, so that the gradients will make some rows smaller in magnitude,but not small enough to realize perfect pruning. B) ResRep constructs a compactor with kernel matrix Q ∈ R × . Driven by the penalty gradients, the compactor selects some of its channels and generates a binarymask, which resets some of the original gradients of Q to zero. After multiple iterations, those compactorchannels with reset gradients become inﬁnitely close to zero, which enables perfect pruning. Finally, theconv - BN - compactor sequence is equivalently converted into a regular conv layer with two channels. Blankrectangles indicate zero values.FLOPs, ResRep automatically ﬁnds the appropriate eventual width of each layer with no prior knowledge,making it a powerful tool for CNN structure optimization. End-to-end training and easy implementation.We summarize our contributions as follows. • Inspired by neurobiology research, we have proposed to decouple “remembering” and “forgetting” in channelpruning. • We have proposed a novel method with two components, Gradient Resetting and Convolutional Re-parameterization, to achieve both high resistance and prunability. • We have achieved state-of-the-art channel pruning results on common benchmark models, including reallossless pruning on ResNet-50 on ImageNet with a pruning ratio of . . Most of the channel pruning methods can be categorized into two families.

Pruning-then-ﬁnetuning methods identify and prune the unimportant channels from a well-trained model by some measurements[1, 24, 28, 39, 41, 42, 55], which may cause signiﬁcant accuracy drop, and ﬁnetune it afterwards to restore3he performance. However, a major drawback is that the pruned models are diﬃcult to ﬁnetune, and theﬁnal accuracy is not guaranteed. As a prior work [37] highlighted, the pruned models can be easily trappedinto bad local minima, and sometimes cannot even reach a similar level of accuracy with a counterpart ofthe same structure trained from scratch. Such a discovery motivated us to pursue perfect pruning, whicheliminates the need for ﬁnetuning.

Learning-based pruning methods overcome such a drawback by acustomized learning process. Apart from the above-mentioned penalty-based paradigm to zero out some of thechannels [2, 3, 9, 30, 32, 51, 52], some other methods prune via meta-learning [34], adversarial learning [31],etc. Compared with these complicated methods, ResRep can be easily implemented and trained end-to-end.

We ﬁrst introduce the formulation and background of convolution and channel pruning. For a conv layerwith D output channels, C input channels and kernel size K × K , we use K ∈ R D × C × K × K to denotethe kernel parameter tensor, and b ∈ R D for the optional bias term. Let I ∈ R N × C × H × W be the input, O ∈ R N × D × H (cid:48) × W (cid:48) be the output, (cid:126) be the convolution operator, and B be the broadcast function whichreplicates b into N × D × H (cid:48) × W (cid:48) , we have O = I (cid:126) K + B ( b ) . (1)For a conv layer with no bias term but a following batch normalization (BN) [25] layer with accumulatedmean µ , standard deviation σ , learned scaling factor γ and bias β ∈ R D , we have O : ,j, : , : = (( I (cid:126) K ) : ,j, : , : − µ j ) γ j σ j + β j , ∀ ≤ j ≤ D . (2)Let i be the index of convolutional layer. To prune conv i by some rules (e.g., removing channels with thesmallest norms [28]), we obtain the index set of pruned channels P ( i ) ⊂ { , , . . . , D } , then its complementaryset S ( i ) = { , , . . . , D } \ P ( i ) for the index set of channels which survive. The pruning operation preservesthe S ( i ) output channels of conv i and the corresponding input channels of conv i + 1 , which is the succeedinglayer, and discard the others. The corresponding entries in the bias or following BN, if any, should bediscarded as well. Formally, the obtained kernels are K ( i ) (cid:48) = K ( i ) S ( i ) , : , : , : , K ( i +1) (cid:48) = K ( i +1): , S ( i ) , : , : . (3) Inspired by the neurobiology research about the independence of forgetting and remembering [12, 15, 45, 48],we propose to explicitly re-parameterize the original model into “remembering parts” and “forgetting parts”.Speciﬁcally, for every conv layer together with the following BN we desire to prune, which are referred toas the target layers , we re-parameterize with an additional pointwise ( × ) conv with kernel Q ∈ R D × D ,which is named compactor. As the training begins, we initialize the conv - BN as the original weights of anoﬀ-the-shelf model and Q as an identity matrix, so that the re-parameterized model produces the identicaloutputs as the original. After training with Gradient Resetting, which will be described in detail in Sect.3.3, we prune the resulting close-to-zero channels of compactors and convert the model into the originalarchitecture with narrower layers. Concretely, for a speciﬁc compactor with kernel Q , we prune the channelswith norm smaller than a threshold (cid:15) . Formally, we obtain the to-be-pruned set by P = { j | || Q j, : || < (cid:15) } , orthe surviving set S = { j | || Q j, : || ≥ (cid:15) } . Similar to Eq. 3, we prune Q by Q (cid:48) = Q S , : . In our experiments, weuse (cid:15) = 1 × − , which is found to be small enough to realize perfect pruning. Now that the compactor hasfewer rows than columns, i.e., Q (cid:48) ∈ R D (cid:48) × D , D (cid:48) = |S| , we seek to convert the conv - BN - compactor into aconv layer with K (cid:48) ∈ R D (cid:48) × C × K × K and bias b (cid:48) ∈ R D (cid:48) .4irstly, we point out that a conv - BN sequence can be equivalently converted into a conv layer forinference, which produces the identical outputs as the original. With K , µ , σ , γ , β of a conv layer and itsfollowing BN, we can construct a new conv layer with kernel ˆ K and bias ˆ b as follows, ˆ K j, : , : , : = γ j σ j K j, : , : , : , ˆ b j = − µ j γ j σ j + β j , ∀ ≤ j ≤ D . (4)Given Eq. 1, Eq. 2 and the homogeneity of convolution [11], it is easy to verify (( I (cid:126) K ) : ,j, : , : − µ j ) γ j σ j + β j = ( I (cid:126) ˆ K + B (ˆ b )) : ,j, : , : , ∀ ≤ j ≤ D . (5)After obtaining ˆ K and ˆ b , we are seeking for the formula to construct K (cid:48) and b (cid:48) so that ( I (cid:126) ˆ K + B (ˆ b )) (cid:126) Q (cid:48) = I (cid:126) K (cid:48) + B ( b (cid:48) ) . (6)With the additivity of convolution, we arrive at I (cid:126) ˆ K (cid:126) Q (cid:48) + B (ˆ b ) (cid:126) Q (cid:48) = I (cid:126) K (cid:48) + B ( b (cid:48) ) . (7)The intuition is that every channel of B (ˆ b ) is a constant matrix, thus every channel of B (ˆ b ) (cid:126) Q (cid:48) is also aconstant matrix. And since the × convolution with Q (cid:48) on the result of I (cid:126) ˆ K only performs cross-channelrecombination, it is feasible to merge Q (cid:48) into ˆ K by recombining the entries in ˆ K . Let T be the transposefunction (e.g., T ( ˆ K ) is a C × D × K × K tensor), we present the formulas to construct K (cid:48) and b (cid:48) , which canbe easily veriﬁed. K (cid:48) = T ( T ( ˆ K ) (cid:126) Q (cid:48) ) , b (cid:48) j = ˆ b · Q (cid:48) j, : , ∀ ≤ j ≤ D (cid:48) . (8)For the ease of implementation, we convert and save the weights of the trained re-parameterized model,construct a model with the original architecture but narrower layers without BN, and use the saved weightsfor testing and deployment. We describe how to produce structured sparsity in compactors while maintaining the performance. We startfrom the situation where we use the traditional penalty-loss-based paradigm on a speciﬁc kernel K to makethe magnitude of some channels smaller for high prunability, i.e., || K P , : , : , : || → . Let Θ be the universal setof parameters, X, Y be the data examples and labels, L perf ( X, Y, Θ ) be the performance-related objectivefunction (e.g., cross-entropy for classiﬁcation tasks). The traditional paradigm adds a penalty loss term P ( K ) to the original loss by a pre-deﬁned strength factor λ , L total ( X, Y, Θ ) = L perf ( X, Y, Θ ) + λP ( K ) , (9)where the common forms of P include L1[28], L2[9], and group Lasso [32, 52]. Speciﬁcally, group Lasso iseﬀective in producing channel-wise structured sparsity: P Lasso ( K ) = D (cid:88) j =1 || K j, : , : , : || . (10)In the following discussions with group Lasso as the form of penalty, we focus on a speciﬁc channel in K ,which is denoted by F = K j, : , : , : . Let G ( F ) be the gradient we use for SGD update, we have G ( F ) = ∂L total ( X, Y, Θ ) ∂ F = ∂L perf ( X, Y, Θ ) ∂ F + λ F || F || . (11)The training dynamics are quite straightforward: starting from a well-trained model, F reside near thelocal optima, thus the ﬁrst term of Eq. 11 is close to , but the second is not, so F will be pushed closer to .5f F is important to the performance, the objective function will intend to maintain its magnitude, i.e., theﬁrst gradient term will become larger to compete against the second, thus F will end up smaller than it usedto be, depending on λ . Otherwise, taking the extreme case for example, if F does not inﬂuence L perf at all,the ﬁrst term will be , so F will keep growing towards by the second term. I.e, the performance-relatedloss and the penalty loss compete so that the resulting value of F will reﬂect its importance, which we referto as competence-based importance evaluation for convenience. However, we face a dilemma. Problem A :The penalty deviates the parameters of every channel from the optima of the objective function. Notably, amild deviation may not bring negative eﬀects, e.g., L2 regularization can also be viewed as a mild deviation.However, with a strong penalty, though some channels are zeroed out for pruning, the remaining channels arealso made too small to maintain the representational capacity, which is an undesired side-eﬀect.

Problem B :With mild penalty for the high resistance, we cannot achieve high prunability, because most of the channelsmerely become smaller than they used to be, but not close to enough for perfect pruning.We propose to achieve high prunability with a mild penalty by resetting the gradients derived from theobjective function. Speciﬁcally, we introduce a binary mask m ∈ { , } , which indicates whether we wish tozero out F . For the ease of implementation, we add no terms to the objective function (i.e., L total = L perf ),simply compute the gradients, then manually apply the mask and add the penalty gradients. Then we usethe resulting gradients for SGD update. That is, G ( F ) = ∂L perf ( X, Y, Θ ) ∂ F m + λ F || F || . (12)In practice, for deciding which channels to zero out (i.e., setting mask values for multiple channels), we maysimply follow the smaller-norm-less-important rule [28] or other heuristics [24, 41]. In this way, we havesolved the above two problems. A) Though we add Lasso gradients to the objective-related gradients of everychannel, which is equivalent to deviating the optima by adding Lasso loss to the original loss, the deviation ismild ( λ = 1 × − in our experiments), thus doing so does not degrade the performance. B) With m = 0 ,the ﬁrst term no longer exists to compete against the second, thus even a mild λ can make F steadily movetowards .Though doing so shows superiority over the traditional paradigm (Fig. 3) by simply zeroing out thechannels with smaller norms, we notice a problem: the zeroed-out objective-related gradients encode thenecessary information for maintaining the performance, which should be preserved to improve the resistance.Fortunately, Convolutional Re-parameterization provides a natural solution. As shown in Sect. 3.2, if wetrain a compactor from an identity matrix into a matrix with many close-to-zero rows via Gradient Resetting,we will be able to convert a conv - BN - compactor into a narrower conv, without losing any informationencoded in the gradients of the original kernels.Firstly, we need to decide which channels of Q to be zeroed out. After a few epochs of training from theinitialized re-parameterized model, || Q j, : || will reﬂect the importance of channel j , so we start to perform channel selection . Let n be the number of compactors, m ( i ) ∈ R D ( i ) be the mask for the i -th compactor, wedeﬁne t ( i ) ∈ R D ( i ) as the metric vector, t ( i ) j = || Q ( i ) j, : || , ∀ ≤ j ≤ D ( i ) . (13)We calculate the metric values and organize them as a mapping M = { ( i, j ) → t ( i ) j | ∀ ≤ i ≤ n, ≤ j ≤ D ( i ) } .Then we sort the values of M in ascending order, start to pick one at a time from the smallest, and set thecorresponding mask m ( i ) j to 0. We stop picking when the reduced FLOPs (i.e., the original FLOPs minusthe FLOPs without the current mask-0 channels) reach our target, or we have already picked θ (named the channel selection limit ) channels. The mask values of unpicked channels are set to 1. The motivation isstraightforward: following the discussions of competence-based importance evaluation, just like the traditionalusage of penalty loss to compete against the original loss and remove the channels with smaller norms, weuse the penalty gradients to compete with the original gradients. Even better, all the metric values are 1 atthe beginning (because every compactor kernel is an identity matrix), making them fair to compare amongdiﬀerent layers. We initialize θ as a small number, increase θ every several iterations and re-select channels to6able 1: Pruning results of ResNet-50 and MobileNet on ImageNet. Model Result BaseTop-1 BaseTop-5 PrunedTop-1 PrunedTop-5 Top-1 ↓ Top-5 ↓ FLOPs ↓ %ResNet-50 SFP [18] 76.15 92.87 74.61 92.06 1.54 0.81 41.8GAL-0.5 [31] 76.15 92.87 71.95 90.94 4.20 1.93 43.03HRank [29] 76.15 92.87 74.98 92.33 1.17 0.54 43.76NISP [55] - - - - 0.89 - 44.01Channel Pr [21] - 92.2 - 90.8 - 1.4 50HP [53] 76.01 92.93 74.87 92.43 1.14 0.50 50MetaPruning [34] 76.6 - 75.4 - 1.2 - 51.10Autopr [38] 76.15 92.87 74.76 92.15 1.39 0.72 51.21FPGM [19] 76.15 92.87 74.83 92.32 1.32 0.55 53.5DCP [57] 76.01 92.93 74.95 92.32 1.06 0.61 55.76C-SGD [7] 75.33 92.56 74.54 92.09 0.79 0.47 55.76ThiNet [40] 75.30 92.20 72.03 90.99 3.27 1.21 55.83SASL [47] 76.15 92.87 75.15 92.47 1.00 0.40 56.10 ResRep (ours) 76.15 92.87 76.15 92.90 0.00 -0.03 56.11

TRP [54] 75.90 92.70 72.69 91.41 3.21 1.29 56.52LFPC [17] 76.15 92.87 74.46 92.32 1.69 0.55 60.8HRank [29] 76.15 92.87 71.98 91.01 4.17 1.86 62.10

ResRep (ours) 76.15 92.87 75.49 92.55 0.66 0.32 62.10

MobileNet MetaPruning [34] 70.6 - 66.1 - 4.5 - 73.81

ResRep (ours) 70.78 89.78 68.05 87.66 2.73 2.12 73.91 avoid zeroing out too many channels at once. As shown in the right of Fig. 3, those mask-0 channels willbecome very close to with the eﬀects of Lasso gradients, thus the structured sparsity emerges. We ﬁrst introduce the datasets and benchmark models. We experiment with ResNet-50 [16] and MobileNet[23] on

ImageNet [6], which contains 1.28M images for training and 50K for validation from 1000 classes.For the reproducibility, we follow the oﬃcial data augmentation provided by PyTorch examples [43] includingrandom cropping and left-right ﬂipping. For the ResNet-50 base model, we used the oﬃcial torchvision version(76.15% top-1 accuracy) [44] for the fair comparison with most competitors. For MobileNet, we trained fromscratch with an initial learning rate of 0.1, batch size of 512 and cosine learning rate annealing for 70 epochs,achieving top-1 accuracy of 70.78%, which is slightly higher than that reported in the original paper [23].Then we use ResNet-56/110 [16] on

CIFAR-10 [27], which contains 50K images from 10 classes of × pixels for training and 10K for testing. We adopt the standard data augmentation [16]: padding to × ,random cropping and left-right ﬂipping. We trained the base models with batch size of 64 and the commonlearning rate schedule which is initialized as 0.1, multiplied by 0.1 at epoch 120 and 180, and terminatedafter 240 epochs. We calculate the FLOPs in the same way as the original papers (3.86G for ResNet-50 [16],569M for MobileNet [23], and 126M/253M for ResNet-56/110).We apply ResRep on ResNet-50 and MobileNet with the same hyper-parameters: λ = 1 × − , channelselection limit θ = 4 and θ ← θ + 4 every 200 iterations, batch size=256, initial learning rate=0.01 and cosineannealing for 180 epochs. The ﬁrst channel selection begins after 5 epochs. To align the pruning ratios forthe ease of comparison, we experiment with ResNet-50 for two times with FLOPs reduction target of 56.1%and 62.1%, respectively, to compare with SASL [47] and HRank [29], and MobileNet with 73.9% to compare7able 2: Pruning results of ResNet-56/110 on CIFAR-10. Model Result Base top-1 Pruned top-1 Top-1 ↓ % FLOPs ↓ %ResNet-56 AMC [20] 92.8 91.9 0.9 50FPGM [19] 93.59 93.26 0.33 52.6SFP [18] 93.59 93.35 0.24 52.6LFPC [17] 93.59 93.24 0.35 52.9 ResRep (ours) 93.71 93.73 -0.02 52.91

TRP [54] 93.14 91.62 1.52 77.82

ResRep (ours) 93.71 92.67 1.04 77.83

ResNet-110 Li et al. [28] 93.53 93.30 0.23 38.60GAL-0.5 [31] 93.50 92.74 0.76 48.5HRank [29] 93.50 93.36 0.14 58.2

ResRep (ours) 94.64 94.62 0.02 58.21 with MetaPruning [34]. Following most competitors, we prune the ﬁrst ( × ) and second ( × ) conv layersin every residual block of ResNet-50, and every non-depthwise conv of MobileNet. Moreover, as inspired by aprior work [10], which modiﬁes the gradients and utilizes momentum and weight decay for CNN sparsiﬁcation,we raise the momentum coeﬃcient of SGD on compactors from 0.9 (the default setting in most cases) to 0.99.The intuition is that those mask-0 channels continuously grow in the same direction (i.e., towards zero), andsuch a tendency accumulates in the momentum, thus the zeroing-out process can be accelerated by a largermomentum coeﬃcient. For ResNet-56/110 on CIFAR10, the target layers include all the ﬁrst layers in everyresidual block, and we use the same hyper-parameters as ImageNet except for batch size of 64 and cosinelearning rate annealing for 480 epochs.Table. 1, 2 show the superiority of ResRep. On ResNet-50, ResRep achieves 0.00% top-1 accuracy drop,which is the ﬁrst to realize lossless pruning with such high pruning ratio, to the best of our knowledge. Interms of top-1 accuracy drop, ResRep outperforms SASL by 1.00%, HRank by 3.51% and all the otherrecent competitors by a signiﬁcant margin. On MobileNet, ResRep outperforms MetaPruning by 1.77%. Forreference, the accuracy of uniformly shrunk MobileNet (i.e., width multiplier=0.5 [23]) which has the sameFLOPs as our result is 63.7%. On ResNet-56/110, ResRep also outperforms the recent competitors by asigniﬁcant margin, even though the comparison on accuracy drop is biased towards other methods, as ourbase models deliver higher accuracy. I.e., it is more challenging to prune a higher-accuracy model withoutaccuracy degradation.As can be observed from the ﬁnal width of each target layer (Fig. 2), given the desired global pruningratio, ResRep discovers the appropriate ﬁnal structure without any prior knowledge. Notably, ResRep choosesto preserve more channels at higher-level layers of ResNet-50 and MobileNet, but prunes aggressively on thelast block of ResNet-56, which ends up with only one channel as its ﬁrst layer. A possible explanation is thatrich higher-level features are essential for maintaining the ﬁtting capacity on diﬃcult task like ImageNet,while ResNet-56 suﬀers from over-ﬁtting on CIFAR-10. We then perform controlled experiments with the same training conﬁgurations as described above on ResNet-56 to evaluate the signiﬁcance of Convolutional Re-parameterization (Rep) and Gradient Resetting (Res)separately. As the baseline, we adopt the traditional paradigm by directly adding Lasso loss (Eq. 10) on allthe target layers. With λ ∈ { . , . , . , . , . } , we obtain four models with diﬀerent ﬁnal accuracy:69.81%, 87.09%, 92.65%, 93.69%. To realize perfect pruning on each trained model, we attain the minimalstructure by removing the channels one at a time until the accuracy drops below the original. I.e., pruningany one more channel of the minimal structure decreases the accuracy. Then we record the FLOPs reductionof the minimal structures: 81.24%, 71.94%, 57.56%, 28.31%. We test Rep but no Res by applying Lasso loss8 EORFNLQGH[ QX P R I F K D QQ H O V RULJLQDOILQDO ILQDO EORFNLQGH[ QX P R I F K D QQ H O V RULJLQDOILQDO EORFNLQGH[ QX P R I F K D QQ H O V RULJLQDOILQDO Figure 2: Width of target layers in pruned models. Left: ResNet-50 with the ﬁrst × and × layer ineach block shown separately. Middle: MobileNet. Right: ResNet-56. Vertical dashed lines indicate the stagetransition in ResNets. )/23VUHGXFWLRQ W RS DFF X U DF \ EDVHOLQHRQO\5HSRQO\5HV5HV5HS HSRFKV W RS DFF X U DF \ EDVHOLQHRULJLQDOEDVHOLQHSUXQHG5HV5HSRULJLQDO5HV5HSSUXQHG HSRFKV l o g TX D G U D WL F V X P EDVHOLQHVXUYLYHGEDVHOLQHSUXQHG5HV5HSVXUYLYHG5HV5HSSUXQHG Figure 3: Left: FLOPs reduction v.s. accuracy of baseline, Res, Rep and ResRep. Middle: the original andpruned accuracy of baseline and ResRep every 5 epochs. Right: the quadratic sum of survived parametersand those to-be-pruned (note the logarithmic scale).on the compactors with varying λ to achieve comparable FLOPs reduction as baselines. And with Res butno Rep, we directly apply Gradient Resetting on the original conv kernels, targeting at the same FLOPsreduction as the four baseline models. Then we experiment with the full-featured ResRep. As shown in theleft of Fig. 3, where the baseline data point of (81.24%, 69.81%) is ignored for better readability, Res andRep deliver better ﬁnal accuracy than the baselines, and perform even better when combined.We investigate into the training process by saving the weights of the λ = 0 . baseline every 5 epochs.Upon the completion of training, we obtain the minimal structure, turn back to prune each saved model intothe minimal structure, and observe the accuracy both before and after pruning. For the ResRep counterpart,we do the same but on the compactors instead of the original conv layers. As shown in the middle of Fig. 3,the baseline accuracy drops drastically because of the side-eﬀects brought by strong Lasso, which implies lowresistance. In contrast, the original accuracy of ResRep maintains on a high level. Both the baseline andResRep models are severely damaged by pruning at the beginning, but as the sparsity emerges, the prunedaccuracy (i.e., prunability) improves. However, the prunability of baseline improves slowly and unsteadilydue to the competence of two losses.For each saved model, we also collect the quadratic sum of parameters which survive at last as well as thequadratic sum of those ﬁnally pruned, according to the ﬁnal minimal structure. As shown in the right of Fig.3 (note the logarithmic scale), the parameters of baseline soon become too small to maintain the performance,which explains the poor resistance. However, for ResRep, the magnitude of survived parameters decreasesbut maintains on a high level due to the mild penalty, and those to-be-pruned (i.e., mask-0) parameters dropsteadily and soon become very close to zero, which explains the high resistance and high prunability. ResRep re-parameterizes a CNN into the “remembering” parts and “forgetting” parts, which can be equivalentlyconverted back for inference. With regular SGD on the former and Gradient Resetting on the latter, we achieveboth high resistance and prunability. The superiority of ResRep over the recent competitors suggests that9ecomposing the traditional learning-based pruning into “performance-oriented learning” and “pruning-orientedlearning” may be a promising research direction.

References [1] Reza Abbasi-Asl and Bin Yu. Structural compression of convolutional neural networks based on greedyﬁlter pruning. arXiv preprint arXiv:1705.07356 , 2017.[2] Jose M Alvarez and Mathieu Salzmann. Learning the number of neurons in deep networks. In

Advancesin Neural Information Processing Systems , pages 2270–2278, 2016.[3] Jose M. Alvarez and Mathieu Salzmann. Learning the number of neurons in deep networks. In Daniel D.Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors,

Advancesin Neural Information Processing Systems 29: Annual Conference on Neural Information ProcessingSystems 2016, December 5-10, 2016, Barcelona, Spain , pages 2262–2270, 2016.[4] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In

Advances in neural informationprocessing systems , pages 2654–2662, 2014.[5] Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutionalnetworks for rapid-deployment. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florenced’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,

Advances in Neural Information ProcessingSystems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14December 2019, Vancouver, BC, Canada , pages 7948–7956, 2019.[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In

Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on ,pages 248–255. IEEE, 2009.[7] Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han. Centripetal SGD for pruning verydeep convolutional networks with complicated structure. In

IEEE Conference on Computer Vision andPattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 4943–4953. ComputerVision Foundation / IEEE, 2019.[8] Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han, and Chenggang Yan. Approximated oracleﬁlter pruning for destructive cnn width optimization. In

International Conference on Machine Learning ,pages 1607–1616, 2019.[9] Xiaohan Ding, Guiguang Ding, Jungong Han, and Sheng Tang. Auto-balanced ﬁlter pruning for eﬃcientconvolutional neural networks. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , pages6797–6804, 2018.[10] Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. Global sparsemomentum SGD for pruning very deep neural networks. In Hanna M. Wallach, Hugo Larochelle, AlinaBeygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,

Advances in NeuralInformation Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada , pages 6379–6391, 2019.[11] Xiaohan Ding, Yuchen Guo, Guiguang Ding, and Jungong Han. Acnet: Strengthening the kernelskeletons for powerful cnn via asymmetric convolution blocks. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 1911–1920, 2019.[12] Tao Dong, Jing He, Shiqing Wang, Lianzhang Wang, Yuqi Cheng, and Yi Zhong. Inability to activaterac1-dependent forgetting contributes to behavioral inﬂexibility in mutants of multiple autism-risk genes.

Proceedings of the National Academy of Sciences of the United States of America , 113 27:7644–9, 2016.1013] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for eﬃcient dnns. In

AdvancesIn Neural Information Processing Systems , pages 1379–1387, 2016.[14] Song Han, Jeﬀ Pool, John Tran, and William Dally. Learning both weights and connections for eﬃcientneural network. In

Advances in Neural Information Processing Systems , pages 1135–1143, 2015.[15] Akiko Hayashi-Takagi, Sho Yagishita, Mayumi Nakamura, Fukutoshi Shirai, Yi Wu, Amanda L. Losh-baugh, Brian Kuhlman, Klaus M. Hahn, and Haruo Kasai. Labelling and optical erasure of synapticmemory traces in the motor cortex.

Nature , 525:333 – 338, 2015.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[17] Yang He, Yuhang Ding, Ping Liu, Linchao Zhu, Hanwang Zhang, and Yi Yang. Learning ﬁlter pruningcriteria for deep convolutional neural networks acceleration. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2020.[18] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft ﬁlter pruning for acceleratingdeep convolutional neural networks. In

Proceedings of the Twenty-Seventh International Joint Conferenceon Artiﬁcial Intelligence , pages 2234–2240, 2018.[19] Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median fordeep convolutional neural networks acceleration. In

IEEE Conference on Computer Vision and PatternRecognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 4340–4349. Computer VisionFoundation / IEEE, 2019.[20] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for modelcompression and acceleration on mobile devices. In

European Conference on Computer Vision , pages815–832. Springer, 2018.[21] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks.In

International Conference on Computer Vision (ICCV) , volume 2, page 6, 2017.[22] Geoﬀrey Hinton, Oriol Vinyals, and Jeﬀ Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 , 2015.[23] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,Marco Andreetto, and Hartwig Adam. Mobilenets: Eﬃcient convolutional neural networks for mobilevision applications.

CoRR , abs/1704.04861, 2017.[24] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuronpruning approach towards eﬃcient deep architectures. arXiv preprint arXiv:1607.03250 , 2016.[25] Sergey Ioﬀe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In

International Conference on Machine Learning , pages 448–456, 2015.[26] Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Rynson W. H. Lau, and Thomas S. Huang. Geometry-aware distillation for indoor semantic segmentation. In

IEEE Conference on Computer Vision andPattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 2869–2878. ComputerVision Foundation / IEEE, 2019.[27] Alex Krizhevsky and Geoﬀrey Hinton. Learning multiple layers of features from tiny images. Technicalreport, 2009.[28] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters for eﬃcientconvnets. In , 2017.1129] Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao.Hrank: Filter pruning using high-rank feature map.

CoRR , abs/2002.10179, 2020.[30] Shaohui Lin, Rongrong Ji, Yuchao Li, Cheng Deng, and Xuelong Li. Towards compact convnets viastructure-sparsity regularized ﬁlter pruning. arXiv preprint arXiv:1901.07827 , 2019.[31] Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang,and David S. Doermann. Towards optimal structured CNN pruning via generative adversarial learning.In

IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA,June 16-20, 2019 , pages 2790–2799. Computer Vision Foundation / IEEE, 2019.[32] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutionalneural networks. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 806–814, 2015.[33] Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledgedistillation for semantic segmentation. In

IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 2604–2613. Computer Vision Foundation /IEEE, 2019.[34] Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Kwang-Ting Cheng, and JianSun. Metapruning: Meta learning for automatic neural network channel pruning. In , pages 3295–3304. IEEE, 2019.[35] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net:Enhancing the performance of 1-bit cnns with improved representational capability and advanced trainingalgorithm. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors,

ComputerVision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings,Part XV , volume 11219 of

Lecture Notes in Computer Science , pages 747–763. Springer, 2018.[36] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learningeﬃcient convolutional networks through network slimming. In , pages 2755–2763. IEEE, 2017.[37] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value ofnetwork pruning. In , 2019.[38] Jian-Hao Luo and Jianxin Wu. Autopruner: An end-to-end trainable ﬁlter pruning method for eﬃcientdeep model inference. arXiv preprint arXiv:1805.08941 , 2018.[39] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A ﬁlter level pruning method for deep neuralnetwork compression. In

IEEE International Conference on Computer Vision , pages 5068–5076, 2017.[40] Jian-Hao Luo, Hao Zhang, Hong-Yu Zhou, Chen-Wei Xie, Jianxin Wu, and Weiyao Lin. Thinet: PruningCNN ﬁlters for a thinner net.

IEEE Trans. Pattern Anal. Mach. Intell. , 41(10):2525–2538, 2019.[41] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neuralnetworks for resource eﬃcient inference. In ,2017.[42] Adam Polyak and Lior Wolf. Channel-level acceleration of deep face representations.

IEEE Access ,3:2163–2175, 2015.[43] PyTorch.

PyTorch Oﬃcial Example , 2020.[44] PyTorch.

Torchvision Oﬃcial Models , 2020. 1245] Blake A. Richards and Paul W. Frankland. The persistence and transience of memory.

Neuron , 94(6):1071– 1084, 2017.[46] Volker Roth and Bernd Fischer. The group-lasso for generalized linear models: uniqueness of solutionsand eﬃcient algorithms. In

Proceedings of the 25th international conference on Machine learning , pages848–855. ACM, 2008.[47] Jun Shi, Jianfeng Xu, Kazuyuki Tasaka, and Zhibo Chen. SASL: saliency-adaptive sparsity learning forneural network acceleration.

CoRR , abs/2003.05891, 2020.[48] Yichun Shuai, Binyan Lu, Ying Hu, Lianzhang Wang, Kan Sun, and Yi Zhong. Forgetting is regulatedthrough rac activity in drosophila.

Cell , 140:579–589, 2010.[49] Noah Simon, Jerome H. Friedman, Trevor J. Hastie, and Robert Tibshirani. A sparse-group lasso. 2013.[50] Jayakorn Vongkulbhisal, Phongtharin Vinayavekhin, and Marco Visentini Scarzanella. Unifying hetero-geneous classiﬁers with distillation. In

IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 3175–3184. Computer Vision Foundation /IEEE, 2019.[51] Huan Wang, Qiming Zhang, Yuehai Wang, Lu Yu, and Haoji Hu. Structured pruning for eﬃcientconvnets via incremental regularization. In

International Joint Conference on Neural Networks , pages1–8, 2019.[52] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deepneural networks. In

Advances in Neural Information Processing Systems , pages 2074–2082, 2016.[53] Xiaofan Xu, Mi Sun Park, and Cormac Brick. Hybrid pruning: Thinner sparse networks for fast inferenceon edge devices. arXiv preprint arXiv:1811.00482 , 2018.[54] Yuhui Xu, Yuxi Li, Shuai Zhang, Wei Wen, Botao Wang, Yingyong Qi, Yiran Chen, Weiyao Lin, andHongkai Xiong. TRP: trained rank pruning for eﬃcient deep neural networks.

CoRR , abs/2004.14566,2020.[55] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-YungLin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 9194–9203,2018.[56] Yiren Zhao, Xitong Gao, Daniel Bates, Robert D. Mullins, and Cheng-Zhong Xu. Focused quantizationfor sparse cnns. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc,Emily B. Fox, and Roman Garnett, editors,

Advances in Neural Information Processing Systems 32:Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019,Vancouver, BC, Canada , pages 5585–5594, 2019.[57] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang,and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. In