Lossless CNN Channel Pruning via Decoupling Remembering and Forgetting
Xiaohan Ding, Tianxiang Hao, Jianchao Tan, Ji Liu, Jungong Han, Yuchen Guo, Guiguang Ding
LLossless CNN Channel Pruning via Gradient Resetting andConvolutional Re-parameterization
Xiaohan Ding Tianxiang Hao Ji Liu Jungong Han Yuchen Guo
1, 4
Guiguang Ding Beijing National Research Center for Information Science and Technology (BNRist);School of Software, Tsinghua University, Beijing, China AI platform department, Seattle AI lab, and FeDA lab, Kwai Inc. WMG Data Science, University of Warwick, Coventry, United Kingdom Department of Automation, Tsinghua University;Institute for Brain and Cognitive Sciences, Tsinghua University, Beijing, China [email protected] [email protected]@gmail.com [email protected]@gmail.com [email protected]
September 2, 2020
Abstract
Channel pruning (a.k.a. filter pruning) aims to slim down a convolutional neural network (CNN)by reducing the width (i.e., numbers of output channels) of convolutional layers. However, as CNN’srepresentational capacity depends on the width, doing so tends to degrade the performance. A traditionallearning-based channel pruning paradigm applies a penalty on parameters to improve the robustness topruning, but such a penalty may degrade the performance even before pruning. Inspired by the neurobiologyresearch about the independence of remembering and forgetting, we propose to re-parameterize a CNNinto the remembering parts and forgetting parts, where the former learn to maintain the performance andthe latter learn for efficiency. By training the re-parameterized model using regular SGD on the formerbut a novel update rule with penalty gradients on the latter, we achieve structured sparsity, enabling usto equivalently convert the re-parameterized model into the original architecture with narrower layers.With our method, we can slim down a standard ResNet-50 with 76.15% top-1 accuracy on ImageNetto a narrower one with only 43.9% FLOPs and no accuracy drop. Code and models are released at https://github.com/DingXiaoH/ResRep . Convolutional Neural Network (CNN) is one of the most popular models for deep learning. To compressand accelerate CNN for efficient inference, numerous methods have been proposed, including sparsification[10, 13, 14], channel pruning [8, 19, 21, 29], quantization [4, 5, 35, 56], knowledge distillation [22, 26, 33, 50],etc. Channel pruning [21] (a.k.a. filter pruning [28] or network slimming [36]) reduces the width (i.e., numberof output channels) of each convolutional layer, which can effectively reduce the required FLOPs and memoryfootprint. Of note is that channel pruning is complementary to the other model compression and accelerationtechniques, because it simply produces a thinner model of the original architecture with no customizedstructures or extra operations.However, as CNN’s representational capacity depends on the width of conv layers, it is difficult toreduce the width without performance drops. On practical CNN architectures like ResNet-50 [16] and1 a r X i v : . [ c s . L G ] S e p arge-scale datasets like ImageNet [6], lossless pruning with high compression ratio has long been consideredchallenging. For reasonable trade-off between compression ratio and performance, a typical paradigm (Fig.1.A) [2, 3, 9, 30, 32, 51, 52] seeks to train the model with magnitude-related penalty loss (e.g., group Lasso[46, 49]) on the conv kernels to produce structured sparsity . That is, all the parameters of some channelsbecome small in magnitude. Though such a process may degrade the performance by an acceptable margin,pruning such channels causes less damage. Notably, if the parameters of pruned channels are small enough,the pruned model may deliver the same performance as before (i.e., after training but before pruning), whichwe refer to as perfect pruning .For convenience, we propose to evaluate a training-based pruning method from two aspects.
1) Resistance .The training process aims to introduce desired properties such as structured sparsity into the model forpruning. However, the emerging of such properties may degrade the model’s performance. If the model resistsagainst such degradation, i.e., maintains the accuracy, we will say it has high resistance.
2) Prunability . Ifthe trained model endures a high pruning ratio with low performance drop, we will say it has high prunability.Obviously, we are seeking for a pruning method with both high resistance and high prunability. However, thetraditional penalty-based paradigm naturally suffers from a resistance-prunability trade-off. Taking groupLasso as an example, if we use a strong penalty to achieve high structured sparsity, the performance willdrop significantly during training (Fig. 3). Contrarily, with a weak penalty to maintain the performance, wewill achieve low sparsity, hence low prunability. A detailed analysis will be presented in Sect. 3.3.In this paper, we propose a novel method named ResRep, which comprises two key components, namely,Convolutional Re-parameterization and Gradient Resetting, to address the above problem. ResRep surpassesthe recent competitors by a significant margin, and is the first to achieve real lossless pruning on ResNet-50on ImageNet (76.15% top-1 accuracy before and after pruning) with a high pruning ratio of 56.1% (i.e., theresulting model has only 43.9% of the original FLOPs), to the best of our knowledge.ResRep is inspired by the neurobiology research on remembering and forgetting. On the one hand,remembering requires the network to potentiate some synapses but depotentiate the others, which resemblesthe training process of CNN, making some parameters large and some small. On the other hand, synapseelimination via shrinkage or loss of spines is one of the classical forgetting mechanisms [45] as a key processto improve efficiency in both energy and space for biological neural network, which resembles pruning.Neurobiology research reveals that remembering and forgetting are independently controlled by Rutabagaadenylyl cyclase-mediated memory formation mechanism and Rac-regulated spine shrinkage mechanism,respectively [12, 15, 48], indicating it is more reasonable to control the learning and pruning by two decoupledmodules.Inspired by such independence, we propose to decouple the “remembering” and “forgetting”, which arecoupled in the traditional paradigm, because the conv parameters are involved in both the “remembering”(objective function) and “forgetting” (penalty loss) in order for them to achieve a trade-off. Concretely, we firstre-parameterize the original model into “remembering parts” and “forgetting parts”, then apply “rememberinglearning” (i.e., regular SGD with the original objective function) on the former to maintain the “memory”(original performance), and “forgetting learning” (a customized update rule named Gradient Resetting) onthe latter to “eliminate synapses” (zero out channels). More specifically, we re-parameterize the originalconv - BN (short for a conv layer followed by batch normalization [25]) sequences by conv - BN - compactor,where compactor is a pointwise (i.e., × ) conv layer. During training, we add penalty gradients to only thecompactors, select some compactor channels and zero out their gradients derived from the objective function.After training, we prune the compactors into narrower ones. As the word “re-parameterization” suggests, aconv - BN - compactor sequence is another parameterization of a regular conv layer, which makes it feasibleto equivalently convert the former into the latter for inference. Eventually, the resulting model will have thesame architecture as the original but narrower layers. Fig. 1.B shows an example for illustration.ResRep features: High resistance. ResRep does not change the loss function, update rule or anytraining hyper-parameters of the original CNN parameters (i.e., the conv - BN parts) so that they learnto maintain the performance as usual. High prunability. The compactors are driven by the penaltygradients to learn which channels to prune, and we can make many compactor channels small enough torealize perfect pruning, even with a mild penalty strength. Given the required global reduction ratio of2 × prune conv params (2 channels, 9 params/channel)SGD updatemultiple iterationsperfectprunechannel selection mask GradientResetting training process pruned layer (A) Traditional penalty-based paradigm. (B) ResRep.batch norm 3 × × ... training process batch norm conv params compactor compactor params loss conv gradientscompactor gradients pruned layer × Figure 1: Traditional penalty-based channel pruning v.s. ResRep. We prune a × conv layer with oneinput channel and four output channels for illustration. For the ease of visualization, we ravel the kernel K ∈ R × × × into a matrix W ∈ R × . A) To prune some channels of K (i.e., rows of W ), we add apenalty loss on the kernel to the original loss, so that the gradients will make some rows smaller in magnitude,but not small enough to realize perfect pruning. B) ResRep constructs a compactor with kernel matrix Q ∈ R × . Driven by the penalty gradients, the compactor selects some of its channels and generates a binarymask, which resets some of the original gradients of Q to zero. After multiple iterations, those compactorchannels with reset gradients become infinitely close to zero, which enables perfect pruning. Finally, theconv - BN - compactor sequence is equivalently converted into a regular conv layer with two channels. Blankrectangles indicate zero values.FLOPs, ResRep automatically finds the appropriate eventual width of each layer with no prior knowledge,making it a powerful tool for CNN structure optimization. End-to-end training and easy implementation.We summarize our contributions as follows. • Inspired by neurobiology research, we have proposed to decouple “remembering” and “forgetting” in channelpruning. • We have proposed a novel method with two components, Gradient Resetting and Convolutional Re-parameterization, to achieve both high resistance and prunability. • We have achieved state-of-the-art channel pruning results on common benchmark models, including reallossless pruning on ResNet-50 on ImageNet with a pruning ratio of . . Most of the channel pruning methods can be categorized into two families.
Pruning-then-finetuning methods identify and prune the unimportant channels from a well-trained model by some measurements[1, 24, 28, 39, 41, 42, 55], which may cause significant accuracy drop, and finetune it afterwards to restore3he performance. However, a major drawback is that the pruned models are difficult to finetune, and thefinal accuracy is not guaranteed. As a prior work [37] highlighted, the pruned models can be easily trappedinto bad local minima, and sometimes cannot even reach a similar level of accuracy with a counterpart ofthe same structure trained from scratch. Such a discovery motivated us to pursue perfect pruning, whicheliminates the need for finetuning.
Learning-based pruning methods overcome such a drawback by acustomized learning process. Apart from the above-mentioned penalty-based paradigm to zero out some of thechannels [2, 3, 9, 30, 32, 51, 52], some other methods prune via meta-learning [34], adversarial learning [31],etc. Compared with these complicated methods, ResRep can be easily implemented and trained end-to-end.
We first introduce the formulation and background of convolution and channel pruning. For a conv layerwith D output channels, C input channels and kernel size K × K , we use K ∈ R D × C × K × K to denotethe kernel parameter tensor, and b ∈ R D for the optional bias term. Let I ∈ R N × C × H × W be the input, O ∈ R N × D × H (cid:48) × W (cid:48) be the output, (cid:126) be the convolution operator, and B be the broadcast function whichreplicates b into N × D × H (cid:48) × W (cid:48) , we have O = I (cid:126) K + B ( b ) . (1)For a conv layer with no bias term but a following batch normalization (BN) [25] layer with accumulatedmean µ , standard deviation σ , learned scaling factor γ and bias β ∈ R D , we have O : ,j, : , : = (( I (cid:126) K ) : ,j, : , : − µ j ) γ j σ j + β j , ∀ ≤ j ≤ D . (2)Let i be the index of convolutional layer. To prune conv i by some rules (e.g., removing channels with thesmallest norms [28]), we obtain the index set of pruned channels P ( i ) ⊂ { , , . . . , D } , then its complementaryset S ( i ) = { , , . . . , D } \ P ( i ) for the index set of channels which survive. The pruning operation preservesthe S ( i ) output channels of conv i and the corresponding input channels of conv i + 1 , which is the succeedinglayer, and discard the others. The corresponding entries in the bias or following BN, if any, should bediscarded as well. Formally, the obtained kernels are K ( i ) (cid:48) = K ( i ) S ( i ) , : , : , : , K ( i +1) (cid:48) = K ( i +1): , S ( i ) , : , : . (3) Inspired by the neurobiology research about the independence of forgetting and remembering [12, 15, 45, 48],we propose to explicitly re-parameterize the original model into “remembering parts” and “forgetting parts”.Specifically, for every conv layer together with the following BN we desire to prune, which are referred toas the target layers , we re-parameterize with an additional pointwise ( × ) conv with kernel Q ∈ R D × D ,which is named compactor. As the training begins, we initialize the conv - BN as the original weights of anoff-the-shelf model and Q as an identity matrix, so that the re-parameterized model produces the identicaloutputs as the original. After training with Gradient Resetting, which will be described in detail in Sect.3.3, we prune the resulting close-to-zero channels of compactors and convert the model into the originalarchitecture with narrower layers. Concretely, for a specific compactor with kernel Q , we prune the channelswith norm smaller than a threshold (cid:15) . Formally, we obtain the to-be-pruned set by P = { j | || Q j, : || < (cid:15) } , orthe surviving set S = { j | || Q j, : || ≥ (cid:15) } . Similar to Eq. 3, we prune Q by Q (cid:48) = Q S , : . In our experiments, weuse (cid:15) = 1 × − , which is found to be small enough to realize perfect pruning. Now that the compactor hasfewer rows than columns, i.e., Q (cid:48) ∈ R D (cid:48) × D , D (cid:48) = |S| , we seek to convert the conv - BN - compactor into aconv layer with K (cid:48) ∈ R D (cid:48) × C × K × K and bias b (cid:48) ∈ R D (cid:48) .4irstly, we point out that a conv - BN sequence can be equivalently converted into a conv layer forinference, which produces the identical outputs as the original. With K , µ , σ , γ , β of a conv layer and itsfollowing BN, we can construct a new conv layer with kernel ˆ K and bias ˆ b as follows, ˆ K j, : , : , : = γ j σ j K j, : , : , : , ˆ b j = − µ j γ j σ j + β j , ∀ ≤ j ≤ D . (4)Given Eq. 1, Eq. 2 and the homogeneity of convolution [11], it is easy to verify (( I (cid:126) K ) : ,j, : , : − µ j ) γ j σ j + β j = ( I (cid:126) ˆ K + B (ˆ b )) : ,j, : , : , ∀ ≤ j ≤ D . (5)After obtaining ˆ K and ˆ b , we are seeking for the formula to construct K (cid:48) and b (cid:48) so that ( I (cid:126) ˆ K + B (ˆ b )) (cid:126) Q (cid:48) = I (cid:126) K (cid:48) + B ( b (cid:48) ) . (6)With the additivity of convolution, we arrive at I (cid:126) ˆ K (cid:126) Q (cid:48) + B (ˆ b ) (cid:126) Q (cid:48) = I (cid:126) K (cid:48) + B ( b (cid:48) ) . (7)The intuition is that every channel of B (ˆ b ) is a constant matrix, thus every channel of B (ˆ b ) (cid:126) Q (cid:48) is also aconstant matrix. And since the × convolution with Q (cid:48) on the result of I (cid:126) ˆ K only performs cross-channelrecombination, it is feasible to merge Q (cid:48) into ˆ K by recombining the entries in ˆ K . Let T be the transposefunction (e.g., T ( ˆ K ) is a C × D × K × K tensor), we present the formulas to construct K (cid:48) and b (cid:48) , which canbe easily verified. K (cid:48) = T ( T ( ˆ K ) (cid:126) Q (cid:48) ) , b (cid:48) j = ˆ b · Q (cid:48) j, : , ∀ ≤ j ≤ D (cid:48) . (8)For the ease of implementation, we convert and save the weights of the trained re-parameterized model,construct a model with the original architecture but narrower layers without BN, and use the saved weightsfor testing and deployment. We describe how to produce structured sparsity in compactors while maintaining the performance. We startfrom the situation where we use the traditional penalty-loss-based paradigm on a specific kernel K to makethe magnitude of some channels smaller for high prunability, i.e., || K P , : , : , : || → . Let Θ be the universal setof parameters, X, Y be the data examples and labels, L perf ( X, Y, Θ ) be the performance-related objectivefunction (e.g., cross-entropy for classification tasks). The traditional paradigm adds a penalty loss term P ( K ) to the original loss by a pre-defined strength factor λ , L total ( X, Y, Θ ) = L perf ( X, Y, Θ ) + λP ( K ) , (9)where the common forms of P include L1[28], L2[9], and group Lasso [32, 52]. Specifically, group Lasso iseffective in producing channel-wise structured sparsity: P Lasso ( K ) = D (cid:88) j =1 || K j, : , : , : || . (10)In the following discussions with group Lasso as the form of penalty, we focus on a specific channel in K ,which is denoted by F = K j, : , : , : . Let G ( F ) be the gradient we use for SGD update, we have G ( F ) = ∂L total ( X, Y, Θ ) ∂ F = ∂L perf ( X, Y, Θ ) ∂ F + λ F || F || . (11)The training dynamics are quite straightforward: starting from a well-trained model, F reside near thelocal optima, thus the first term of Eq. 11 is close to , but the second is not, so F will be pushed closer to .5f F is important to the performance, the objective function will intend to maintain its magnitude, i.e., thefirst gradient term will become larger to compete against the second, thus F will end up smaller than it usedto be, depending on λ . Otherwise, taking the extreme case for example, if F does not influence L perf at all,the first term will be , so F will keep growing towards by the second term. I.e, the performance-relatedloss and the penalty loss compete so that the resulting value of F will reflect its importance, which we referto as competence-based importance evaluation for convenience. However, we face a dilemma. Problem A :The penalty deviates the parameters of every channel from the optima of the objective function. Notably, amild deviation may not bring negative effects, e.g., L2 regularization can also be viewed as a mild deviation.However, with a strong penalty, though some channels are zeroed out for pruning, the remaining channels arealso made too small to maintain the representational capacity, which is an undesired side-effect.
Problem B :With mild penalty for the high resistance, we cannot achieve high prunability, because most of the channelsmerely become smaller than they used to be, but not close to enough for perfect pruning.We propose to achieve high prunability with a mild penalty by resetting the gradients derived from theobjective function. Specifically, we introduce a binary mask m ∈ { , } , which indicates whether we wish tozero out F . For the ease of implementation, we add no terms to the objective function (i.e., L total = L perf ),simply compute the gradients, then manually apply the mask and add the penalty gradients. Then we usethe resulting gradients for SGD update. That is, G ( F ) = ∂L perf ( X, Y, Θ ) ∂ F m + λ F || F || . (12)In practice, for deciding which channels to zero out (i.e., setting mask values for multiple channels), we maysimply follow the smaller-norm-less-important rule [28] or other heuristics [24, 41]. In this way, we havesolved the above two problems. A) Though we add Lasso gradients to the objective-related gradients of everychannel, which is equivalent to deviating the optima by adding Lasso loss to the original loss, the deviation ismild ( λ = 1 × − in our experiments), thus doing so does not degrade the performance. B) With m = 0 ,the first term no longer exists to compete against the second, thus even a mild λ can make F steadily movetowards .Though doing so shows superiority over the traditional paradigm (Fig. 3) by simply zeroing out thechannels with smaller norms, we notice a problem: the zeroed-out objective-related gradients encode thenecessary information for maintaining the performance, which should be preserved to improve the resistance.Fortunately, Convolutional Re-parameterization provides a natural solution. As shown in Sect. 3.2, if wetrain a compactor from an identity matrix into a matrix with many close-to-zero rows via Gradient Resetting,we will be able to convert a conv - BN - compactor into a narrower conv, without losing any informationencoded in the gradients of the original kernels.Firstly, we need to decide which channels of Q to be zeroed out. After a few epochs of training from theinitialized re-parameterized model, || Q j, : || will reflect the importance of channel j , so we start to perform channel selection . Let n be the number of compactors, m ( i ) ∈ R D ( i ) be the mask for the i -th compactor, wedefine t ( i ) ∈ R D ( i ) as the metric vector, t ( i ) j = || Q ( i ) j, : || , ∀ ≤ j ≤ D ( i ) . (13)We calculate the metric values and organize them as a mapping M = { ( i, j ) → t ( i ) j | ∀ ≤ i ≤ n, ≤ j ≤ D ( i ) } .Then we sort the values of M in ascending order, start to pick one at a time from the smallest, and set thecorresponding mask m ( i ) j to 0. We stop picking when the reduced FLOPs (i.e., the original FLOPs minusthe FLOPs without the current mask-0 channels) reach our target, or we have already picked θ (named the channel selection limit ) channels. The mask values of unpicked channels are set to 1. The motivation isstraightforward: following the discussions of competence-based importance evaluation, just like the traditionalusage of penalty loss to compete against the original loss and remove the channels with smaller norms, weuse the penalty gradients to compete with the original gradients. Even better, all the metric values are 1 atthe beginning (because every compactor kernel is an identity matrix), making them fair to compare amongdifferent layers. We initialize θ as a small number, increase θ every several iterations and re-select channels to6able 1: Pruning results of ResNet-50 and MobileNet on ImageNet. Model Result BaseTop-1 BaseTop-5 PrunedTop-1 PrunedTop-5 Top-1 ↓ Top-5 ↓ FLOPs ↓ %ResNet-50 SFP [18] 76.15 92.87 74.61 92.06 1.54 0.81 41.8GAL-0.5 [31] 76.15 92.87 71.95 90.94 4.20 1.93 43.03HRank [29] 76.15 92.87 74.98 92.33 1.17 0.54 43.76NISP [55] - - - - 0.89 - 44.01Channel Pr [21] - 92.2 - 90.8 - 1.4 50HP [53] 76.01 92.93 74.87 92.43 1.14 0.50 50MetaPruning [34] 76.6 - 75.4 - 1.2 - 51.10Autopr [38] 76.15 92.87 74.76 92.15 1.39 0.72 51.21FPGM [19] 76.15 92.87 74.83 92.32 1.32 0.55 53.5DCP [57] 76.01 92.93 74.95 92.32 1.06 0.61 55.76C-SGD [7] 75.33 92.56 74.54 92.09 0.79 0.47 55.76ThiNet [40] 75.30 92.20 72.03 90.99 3.27 1.21 55.83SASL [47] 76.15 92.87 75.15 92.47 1.00 0.40 56.10 ResRep (ours) 76.15 92.87 76.15 92.90 0.00 -0.03 56.11
TRP [54] 75.90 92.70 72.69 91.41 3.21 1.29 56.52LFPC [17] 76.15 92.87 74.46 92.32 1.69 0.55 60.8HRank [29] 76.15 92.87 71.98 91.01 4.17 1.86 62.10
ResRep (ours) 76.15 92.87 75.49 92.55 0.66 0.32 62.10
MobileNet MetaPruning [34] 70.6 - 66.1 - 4.5 - 73.81
ResRep (ours) 70.78 89.78 68.05 87.66 2.73 2.12 73.91 avoid zeroing out too many channels at once. As shown in the right of Fig. 3, those mask-0 channels willbecome very close to with the effects of Lasso gradients, thus the structured sparsity emerges. We first introduce the datasets and benchmark models. We experiment with ResNet-50 [16] and MobileNet[23] on
ImageNet [6], which contains 1.28M images for training and 50K for validation from 1000 classes.For the reproducibility, we follow the official data augmentation provided by PyTorch examples [43] includingrandom cropping and left-right flipping. For the ResNet-50 base model, we used the official torchvision version(76.15% top-1 accuracy) [44] for the fair comparison with most competitors. For MobileNet, we trained fromscratch with an initial learning rate of 0.1, batch size of 512 and cosine learning rate annealing for 70 epochs,achieving top-1 accuracy of 70.78%, which is slightly higher than that reported in the original paper [23].Then we use ResNet-56/110 [16] on
CIFAR-10 [27], which contains 50K images from 10 classes of × pixels for training and 10K for testing. We adopt the standard data augmentation [16]: padding to × ,random cropping and left-right flipping. We trained the base models with batch size of 64 and the commonlearning rate schedule which is initialized as 0.1, multiplied by 0.1 at epoch 120 and 180, and terminatedafter 240 epochs. We calculate the FLOPs in the same way as the original papers (3.86G for ResNet-50 [16],569M for MobileNet [23], and 126M/253M for ResNet-56/110).We apply ResRep on ResNet-50 and MobileNet with the same hyper-parameters: λ = 1 × − , channelselection limit θ = 4 and θ ← θ + 4 every 200 iterations, batch size=256, initial learning rate=0.01 and cosineannealing for 180 epochs. The first channel selection begins after 5 epochs. To align the pruning ratios forthe ease of comparison, we experiment with ResNet-50 for two times with FLOPs reduction target of 56.1%and 62.1%, respectively, to compare with SASL [47] and HRank [29], and MobileNet with 73.9% to compare7able 2: Pruning results of ResNet-56/110 on CIFAR-10. Model Result Base top-1 Pruned top-1 Top-1 ↓ % FLOPs ↓ %ResNet-56 AMC [20] 92.8 91.9 0.9 50FPGM [19] 93.59 93.26 0.33 52.6SFP [18] 93.59 93.35 0.24 52.6LFPC [17] 93.59 93.24 0.35 52.9 ResRep (ours) 93.71 93.73 -0.02 52.91
TRP [54] 93.14 91.62 1.52 77.82
ResRep (ours) 93.71 92.67 1.04 77.83
ResNet-110 Li et al. [28] 93.53 93.30 0.23 38.60GAL-0.5 [31] 93.50 92.74 0.76 48.5HRank [29] 93.50 93.36 0.14 58.2
ResRep (ours) 94.64 94.62 0.02 58.21 with MetaPruning [34]. Following most competitors, we prune the first ( × ) and second ( × ) conv layersin every residual block of ResNet-50, and every non-depthwise conv of MobileNet. Moreover, as inspired by aprior work [10], which modifies the gradients and utilizes momentum and weight decay for CNN sparsification,we raise the momentum coefficient of SGD on compactors from 0.9 (the default setting in most cases) to 0.99.The intuition is that those mask-0 channels continuously grow in the same direction (i.e., towards zero), andsuch a tendency accumulates in the momentum, thus the zeroing-out process can be accelerated by a largermomentum coefficient. For ResNet-56/110 on CIFAR10, the target layers include all the first layers in everyresidual block, and we use the same hyper-parameters as ImageNet except for batch size of 64 and cosinelearning rate annealing for 480 epochs.Table. 1, 2 show the superiority of ResRep. On ResNet-50, ResRep achieves 0.00% top-1 accuracy drop,which is the first to realize lossless pruning with such high pruning ratio, to the best of our knowledge. Interms of top-1 accuracy drop, ResRep outperforms SASL by 1.00%, HRank by 3.51% and all the otherrecent competitors by a significant margin. On MobileNet, ResRep outperforms MetaPruning by 1.77%. Forreference, the accuracy of uniformly shrunk MobileNet (i.e., width multiplier=0.5 [23]) which has the sameFLOPs as our result is 63.7%. On ResNet-56/110, ResRep also outperforms the recent competitors by asignificant margin, even though the comparison on accuracy drop is biased towards other methods, as ourbase models deliver higher accuracy. I.e., it is more challenging to prune a higher-accuracy model withoutaccuracy degradation.As can be observed from the final width of each target layer (Fig. 2), given the desired global pruningratio, ResRep discovers the appropriate final structure without any prior knowledge. Notably, ResRep choosesto preserve more channels at higher-level layers of ResNet-50 and MobileNet, but prunes aggressively on thelast block of ResNet-56, which ends up with only one channel as its first layer. A possible explanation is thatrich higher-level features are essential for maintaining the fitting capacity on difficult task like ImageNet,while ResNet-56 suffers from over-fitting on CIFAR-10. We then perform controlled experiments with the same training configurations as described above on ResNet-56 to evaluate the significance of Convolutional Re-parameterization (Rep) and Gradient Resetting (Res)separately. As the baseline, we adopt the traditional paradigm by directly adding Lasso loss (Eq. 10) on allthe target layers. With λ ∈ { . , . , . , . , . } , we obtain four models with different final accuracy:69.81%, 87.09%, 92.65%, 93.69%. To realize perfect pruning on each trained model, we attain the minimalstructure by removing the channels one at a time until the accuracy drops below the original. I.e., pruningany one more channel of the minimal structure decreases the accuracy. Then we record the FLOPs reductionof the minimal structures: 81.24%, 71.94%, 57.56%, 28.31%. We test Rep but no Res by applying Lasso loss8 E O R F N L Q G H [ Q X P R I F K D Q Q H O V R U L J L Q D O I L Q D O I L Q D O E O R F N L Q G H [ Q X P R I F K D Q Q H O V R U L J L Q D O I L Q D O E O R F N L Q G H [ Q X P R I F K D Q Q H O V R U L J L Q D O I L Q D O Figure 2: Width of target layers in pruned models. Left: ResNet-50 with the first × and × layer ineach block shown separately. Middle: MobileNet. Right: ResNet-56. Vertical dashed lines indicate the stagetransition in ResNets. ) / 2 3 V U H G X F W L R Q W R S D F F X U D F \ E D V H O L Q H R Q O \ 5 H S R Q O \ 5 H V 5 H V 5 H S H S R F K V W R S D F F X U D F \ E D V H O L Q H R U L J L Q D O E D V H O L Q H S U X Q H G 5 H V 5 H S R U L J L Q D O 5 H V 5 H S S U X Q H G H S R F K V l o g T X D G U D W L F V X P E D V H O L Q H V X U Y L Y H G E D V H O L Q H S U X Q H G 5 H V 5 H S V X U Y L Y H G 5 H V 5 H S S U X Q H G Figure 3: Left: FLOPs reduction v.s. accuracy of baseline, Res, Rep and ResRep. Middle: the original andpruned accuracy of baseline and ResRep every 5 epochs. Right: the quadratic sum of survived parametersand those to-be-pruned (note the logarithmic scale).on the compactors with varying λ to achieve comparable FLOPs reduction as baselines. And with Res butno Rep, we directly apply Gradient Resetting on the original conv kernels, targeting at the same FLOPsreduction as the four baseline models. Then we experiment with the full-featured ResRep. As shown in theleft of Fig. 3, where the baseline data point of (81.24%, 69.81%) is ignored for better readability, Res andRep deliver better final accuracy than the baselines, and perform even better when combined.We investigate into the training process by saving the weights of the λ = 0 . baseline every 5 epochs.Upon the completion of training, we obtain the minimal structure, turn back to prune each saved model intothe minimal structure, and observe the accuracy both before and after pruning. For the ResRep counterpart,we do the same but on the compactors instead of the original conv layers. As shown in the middle of Fig. 3,the baseline accuracy drops drastically because of the side-effects brought by strong Lasso, which implies lowresistance. In contrast, the original accuracy of ResRep maintains on a high level. Both the baseline andResRep models are severely damaged by pruning at the beginning, but as the sparsity emerges, the prunedaccuracy (i.e., prunability) improves. However, the prunability of baseline improves slowly and unsteadilydue to the competence of two losses.For each saved model, we also collect the quadratic sum of parameters which survive at last as well as thequadratic sum of those finally pruned, according to the final minimal structure. As shown in the right of Fig.3 (note the logarithmic scale), the parameters of baseline soon become too small to maintain the performance,which explains the poor resistance. However, for ResRep, the magnitude of survived parameters decreasesbut maintains on a high level due to the mild penalty, and those to-be-pruned (i.e., mask-0) parameters dropsteadily and soon become very close to zero, which explains the high resistance and high prunability. ResRep re-parameterizes a CNN into the “remembering” parts and “forgetting” parts, which can be equivalentlyconverted back for inference. With regular SGD on the former and Gradient Resetting on the latter, we achieveboth high resistance and prunability. The superiority of ResRep over the recent competitors suggests that9ecomposing the traditional learning-based pruning into “performance-oriented learning” and “pruning-orientedlearning” may be a promising research direction.
References [1] Reza Abbasi-Asl and Bin Yu. Structural compression of convolutional neural networks based on greedyfilter pruning. arXiv preprint arXiv:1705.07356 , 2017.[2] Jose M Alvarez and Mathieu Salzmann. Learning the number of neurons in deep networks. In
Advancesin Neural Information Processing Systems , pages 2270–2278, 2016.[3] Jose M. Alvarez and Mathieu Salzmann. Learning the number of neurons in deep networks. In Daniel D.Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors,
Advancesin Neural Information Processing Systems 29: Annual Conference on Neural Information ProcessingSystems 2016, December 5-10, 2016, Barcelona, Spain , pages 2262–2270, 2016.[4] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In
Advances in neural informationprocessing systems , pages 2654–2662, 2014.[5] Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutionalnetworks for rapid-deployment. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florenced’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,
Advances in Neural Information ProcessingSystems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14December 2019, Vancouver, BC, Canada , pages 7948–7956, 2019.[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on ,pages 248–255. IEEE, 2009.[7] Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han. Centripetal SGD for pruning verydeep convolutional networks with complicated structure. In
IEEE Conference on Computer Vision andPattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 4943–4953. ComputerVision Foundation / IEEE, 2019.[8] Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han, and Chenggang Yan. Approximated oraclefilter pruning for destructive cnn width optimization. In
International Conference on Machine Learning ,pages 1607–1616, 2019.[9] Xiaohan Ding, Guiguang Ding, Jungong Han, and Sheng Tang. Auto-balanced filter pruning for efficientconvolutional neural networks. In
Thirty-Second AAAI Conference on Artificial Intelligence , pages6797–6804, 2018.[10] Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. Global sparsemomentum SGD for pruning very deep neural networks. In Hanna M. Wallach, Hugo Larochelle, AlinaBeygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,
Advances in NeuralInformation Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada , pages 6379–6391, 2019.[11] Xiaohan Ding, Yuchen Guo, Guiguang Ding, and Jungong Han. Acnet: Strengthening the kernelskeletons for powerful cnn via asymmetric convolution blocks. In
Proceedings of the IEEE InternationalConference on Computer Vision , pages 1911–1920, 2019.[12] Tao Dong, Jing He, Shiqing Wang, Lianzhang Wang, Yuqi Cheng, and Yi Zhong. Inability to activaterac1-dependent forgetting contributes to behavioral inflexibility in mutants of multiple autism-risk genes.
Proceedings of the National Academy of Sciences of the United States of America , 113 27:7644–9, 2016.1013] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In
AdvancesIn Neural Information Processing Systems , pages 1379–1387, 2016.[14] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficientneural network. In
Advances in Neural Information Processing Systems , pages 1135–1143, 2015.[15] Akiko Hayashi-Takagi, Sho Yagishita, Mayumi Nakamura, Fukutoshi Shirai, Yi Wu, Amanda L. Losh-baugh, Brian Kuhlman, Klaus M. Hahn, and Haruo Kasai. Labelling and optical erasure of synapticmemory traces in the motor cortex.
Nature , 525:333 – 338, 2015.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[17] Yang He, Yuhang Ding, Ping Liu, Linchao Zhu, Hanwang Zhang, and Yi Yang. Learning filter pruningcriteria for deep convolutional neural networks acceleration. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2020.[18] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for acceleratingdeep convolutional neural networks. In
Proceedings of the Twenty-Seventh International Joint Conferenceon Artificial Intelligence , pages 2234–2240, 2018.[19] Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median fordeep convolutional neural networks acceleration. In
IEEE Conference on Computer Vision and PatternRecognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 4340–4349. Computer VisionFoundation / IEEE, 2019.[20] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for modelcompression and acceleration on mobile devices. In
European Conference on Computer Vision , pages815–832. Springer, 2018.[21] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks.In
International Conference on Computer Vision (ICCV) , volume 2, page 6, 2017.[22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 , 2015.[23] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobilevision applications.
CoRR , abs/1704.04861, 2017.[24] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuronpruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250 , 2016.[25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In
International Conference on Machine Learning , pages 448–456, 2015.[26] Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Rynson W. H. Lau, and Thomas S. Huang. Geometry-aware distillation for indoor semantic segmentation. In
IEEE Conference on Computer Vision andPattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 2869–2878. ComputerVision Foundation / IEEE, 2019.[27] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technicalreport, 2009.[28] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficientconvnets. In , 2017.1129] Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao.Hrank: Filter pruning using high-rank feature map.
CoRR , abs/2002.10179, 2020.[30] Shaohui Lin, Rongrong Ji, Yuchao Li, Cheng Deng, and Xuelong Li. Towards compact convnets viastructure-sparsity regularized filter pruning. arXiv preprint arXiv:1901.07827 , 2019.[31] Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang,and David S. Doermann. Towards optimal structured CNN pruning via generative adversarial learning.In
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA,June 16-20, 2019 , pages 2790–2799. Computer Vision Foundation / IEEE, 2019.[32] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutionalneural networks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 806–814, 2015.[33] Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledgedistillation for semantic segmentation. In
IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 2604–2613. Computer Vision Foundation /IEEE, 2019.[34] Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Kwang-Ting Cheng, and JianSun. Metapruning: Meta learning for automatic neural network channel pruning. In , pages 3295–3304. IEEE, 2019.[35] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net:Enhancing the performance of 1-bit cnns with improved representational capability and advanced trainingalgorithm. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors,
ComputerVision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings,Part XV , volume 11219 of
Lecture Notes in Computer Science , pages 747–763. Springer, 2018.[36] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learningefficient convolutional networks through network slimming. In , pages 2755–2763. IEEE, 2017.[37] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value ofnetwork pruning. In , 2019.[38] Jian-Hao Luo and Jianxin Wu. Autopruner: An end-to-end trainable filter pruning method for efficientdeep model inference. arXiv preprint arXiv:1805.08941 , 2018.[39] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neuralnetwork compression. In
IEEE International Conference on Computer Vision , pages 5068–5076, 2017.[40] Jian-Hao Luo, Hao Zhang, Hong-Yu Zhou, Chen-Wei Xie, Jianxin Wu, and Weiyao Lin. Thinet: PruningCNN filters for a thinner net.
IEEE Trans. Pattern Anal. Mach. Intell. , 41(10):2525–2538, 2019.[41] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neuralnetworks for resource efficient inference. In ,2017.[42] Adam Polyak and Lior Wolf. Channel-level acceleration of deep face representations.
IEEE Access ,3:2163–2175, 2015.[43] PyTorch.
PyTorch Official Example , 2020.[44] PyTorch.
Torchvision Official Models , 2020. 1245] Blake A. Richards and Paul W. Frankland. The persistence and transience of memory.
Neuron , 94(6):1071– 1084, 2017.[46] Volker Roth and Bernd Fischer. The group-lasso for generalized linear models: uniqueness of solutionsand efficient algorithms. In
Proceedings of the 25th international conference on Machine learning , pages848–855. ACM, 2008.[47] Jun Shi, Jianfeng Xu, Kazuyuki Tasaka, and Zhibo Chen. SASL: saliency-adaptive sparsity learning forneural network acceleration.
CoRR , abs/2003.05891, 2020.[48] Yichun Shuai, Binyan Lu, Ying Hu, Lianzhang Wang, Kan Sun, and Yi Zhong. Forgetting is regulatedthrough rac activity in drosophila.
Cell , 140:579–589, 2010.[49] Noah Simon, Jerome H. Friedman, Trevor J. Hastie, and Robert Tibshirani. A sparse-group lasso. 2013.[50] Jayakorn Vongkulbhisal, Phongtharin Vinayavekhin, and Marco Visentini Scarzanella. Unifying hetero-geneous classifiers with distillation. In
IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 3175–3184. Computer Vision Foundation /IEEE, 2019.[51] Huan Wang, Qiming Zhang, Yuehai Wang, Lu Yu, and Haoji Hu. Structured pruning for efficientconvnets via incremental regularization. In
International Joint Conference on Neural Networks , pages1–8, 2019.[52] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deepneural networks. In
Advances in Neural Information Processing Systems , pages 2074–2082, 2016.[53] Xiaofan Xu, Mi Sun Park, and Cormac Brick. Hybrid pruning: Thinner sparse networks for fast inferenceon edge devices. arXiv preprint arXiv:1811.00482 , 2018.[54] Yuhui Xu, Yuxi Li, Shuai Zhang, Wei Wen, Botao Wang, Yingyong Qi, Yiran Chen, Weiyao Lin, andHongkai Xiong. TRP: trained rank pruning for efficient deep neural networks.
CoRR , abs/2004.14566,2020.[55] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-YungLin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 9194–9203,2018.[56] Yiren Zhao, Xitong Gao, Daniel Bates, Robert D. Mullins, and Cheng-Zhong Xu. Focused quantizationfor sparse cnns. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc,Emily B. Fox, and Roman Garnett, editors,
Advances in Neural Information Processing Systems 32:Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019,Vancouver, BC, Canada , pages 5585–5594, 2019.[57] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang,and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. In