Holistic Filter Pruning for Efficient Deep Neural Networks
HHolistic Filter Pruning for Efficient Deep Neural Networks
Lukas EnderichCorporate ResearchRobert Bosch GmbH, Renningen, Germany [email protected]
Fabian TimmCorporate ResearchRobert Bosch GmbH, Renningen, Germany [email protected]
Wolfram BurgardInstitute for Autonomous Intelligent SystemsUniversity of Freiburg, Germany [email protected]
Abstract
Deep neural networks (DNNs) are usually over-parameterized to increase the likelihood of getting ade-quate initial weights by random initialization. Conse-quently, trained DNNs have many redundancies which canbe pruned from the model to reduce complexity and improvethe ability to generalize. Structural sparsity, as achieved byfilter pruning, directly reduces the tensor sizes of weightsand activations and is thus particularly effective for reduc-ing complexity. We propose Holistic Filter Pruning (HFP),a novel approach for common DNN training that is easy toimplement and enables to specify accurate pruning rates forthe number of both parameters and multiplications. Aftereach forward pass, the current model complexity is calcu-lated and compared to the desired target size. By gradientdescent, a global solution can be found that allocates thepruning budget over the individual layers such that the de-sired target size is fulfilled. In various experiments, we giveinsights into the training and achieve state-of-the-art per-formance on CIFAR-10 and ImageNet (HFP prunes 60% ofthe multiplications of ResNet-50 on ImageNet with no sig-nificant loss in the accuracy). We believe our simple andpowerful pruning approach to constitute a valuable contri-bution for users of DNNs in low-cost applications.
1. Introduction
Deep neural networks (DNNs) have a strong ability fordata abstraction and outperform classical methods in manymachine learning challenges such as computer vision, ob-ject detection, or speech recognition [1, 19]. But, recentprogress has been made by training powerful models withmany parameters using large scale data sets [7, 32]. Fran-kle and Carbin [3] demonstrated the correlation between the
Input features Convolution filter Output features
Figure 1. Structural sparsity can be achieved by pruning completefilters or neurons from the network. Since filter pruning reducesboth the number of filters in the respective layer and the numberof output feature maps, the tensor sizes of both weights and acti-vations decrease. With a reduced number of output feature maps,the depth of the following layer decreases to the same degree. initial model size and the probability of getting meaningfulinitial values for the parameters by random initialization.As a result, modern DNNs are usually over-parameterized,have high memory requirements and need many floating-point multiplications, which are especially expensive con-cerning computation time and energy consumption [32].However, reduction techniques can significantly reducethe complexity of trained DNNs. On the one hand, quan-tization methods reduce the precision of both parametersand activations to accelerate DNNs on dedicated hardware[32, 2]. On the other hand, pruning and factorization meth-ods reduce the number of parameters and multiplicationsrather than their bit-sizes [22, 21]. Structural sparsity, asachieved by filter pruning, directly reduces computationtime, energy consumption, and memory requirements with-out the need for specialized hardware. A visualization offilter pruning is given in figure 1.Unsupervised filter pruning usually fails to preserve the1ccuracy of the original model [29]. Therefore, data drivenapproaches have been developed which either iterativelyprune filters based on saliency scores [13, 22, 9, 38, 10, 24]or retrain the model under consideration of sparsity con-straints [37, 11, 27, 31, 33]. Methods of the first categorycalculate saliency scores to rate the importance of individualfilters. Filters with low saliency scores are considered unim-portant and are deleted whereas the remaining filters are re-trained. This process is repeated until the desired pruningrate is reached. In contrast, methods of the second cate-gory investigate sparsity constraints that can be integratedinto the the training of DNNs. Regularization terms pushthe sum of absolute values of filter weights towards zero[27, 37, 11]. Furthermore, in [31, 33, 34] learnable gatevariables were introduced that scale single weights or com-plete filters by one or zero.However, most recent approaches have some disadvan-tages. Determining saliency scores requires a lot of humanlabor and is usually a heuristic practice. Furthermore, layer-by-layer pruning as well as iterative pruning and retrainingare unsuitable procedures for determining a global selectionof filters to be pruned. Considering that all networks layersjointly contribute to the learning task, it is inappropriate toprune single layers independently. Moreover, iterative prun-ing and retraining may prune filters that become importantagain at a later iteration.In this work, we make the following contributions: • We propose a holistic pruning approach that can beintegrated into the training of DNNs by only a fewlines of code. The proposed method induces spar-sity via the channel-wise scaling factors of the batch-normalization layers. Hence, no additional variablesare needed. Furthermore, the user can specify the de-sired model size in terms of the number of parame-ters and multiplications. The pruning budget is allo-cated over all layers automatically such that the desiredmodel size is reached. • We evaluate our pruning approach on two benchmarkdata-sets (CIFAR-10, ImgaeNet). We provide com-parisons with recent filter pruning results and provestate-of-the-art performance on various DNN architec-tures. Furthermore, we analyze the allocation of prun-ing rates over the individual layers for different targetsizes and layer types.
2. Related Work
DNNs are usually over-parameterized and have many re-dundant network connections, which can be eliminated (e.g.pruned) after the training to reduce the model complex-ity and improve the ability to generalize. The first prun-ing methods were aimed at setting single weights to zeroin order to trim intermediate layer connections.
Optimal Brain Damage [20] utilized the second-order derivatives ofthe loss function to calculate saliencies for each networkweight. Subsequently, weights with small saliency scoreswere pruned iteratively whereas the remaining weights wereretrained. Since calculating the second-order derivatives ofthe loss function with respect to the prameters is too com-plex for large DNNs, many approaches applied magnitudebased pruning [6, 5, 27, 37].However, since pruning single weights has no direct ben-efit for the hardware implementation of DNNs (unstruc-tural sparsity), the indicator of non-zero weights is an in-sufficient evidence of the model complexity. In contrast,pruning complete filters or neurons from the network ar-chitecture directly reduces the tensor sizes of weights andactivations (structural sparsity). A visualization is given infigure 1. Filter pruning methods can be devided into twosubcategories[21]: saliency based pruning and retraining on the one hand and sparsity learning on the other. Bothsubcategories are based on pre-trained and usually over-parameterized models. In the following, a filter is equiv-alent to a channel or a neuron.
These methods determine heuristics to calculate saliencyscores for each filter. The saliency score indicates the im-portance of the respective filter: The higher the score themore important the filter is considered to be for fulfillingthe learning task. Based on the saliencies, a certain num-ber or percentage of filters is pruned whereas the remainingones are retrained. This process is repeated iteratively untilthe desired network size is reached.Hu et al . identified unimportant filters by analyzing themagnitudes of the output activations [13]. Feature mapswith comparatively small sums of absolute values were con-sidered less important and hence removed. In contrast, Li et al . measured the importance of individual filters by cal-culating the sum over the absolute values of the weights[22]. Zhuang et al . argued that informative channels shouldhave discriminative power [38]: They proposed a rank-ing heuristic to identify channels with high discriminativepower while deleting redundant channels and their corre-sponding filters. Furthermore, He et al . set the weightsof filters with small L -norms to zero [9]. During the re-training steps, however, the pruned filters were updated aswell to improve the training behaviour. The procedure isrepeated until the selection of filters with small L -normsconverges. Furthermore, Yu et al . calculated saliency scoresby minimizing the reconstruction error in the second-to-last layer before the classification output [35]. Recently,Zhonghui et al introduced Gate Decorator , a pruning frame-work that uses gates to scale the channel-wise output ofintermediate layers [34]. The change in the loss functioncaused by setting the gates to zero is calculated using a Tay-2or expansion and subsequently used for the saliency scores.
Sparsity learning induces sparsity constraints into thetraining of DNNs. Pan et al. approximated the L -norm topenalize incoming and outgoing connections of single fil-ters [30]. He et al . proposed a channel selection based onLASSO regression whereas pruning each layer is achievedby minimizing the reconstruction error of the output fea-ture maps [11]. Furthermore, Liu et al . applied L -normbased regularization on the scaling factors of the batch-normalization layers to scale single channels towards zero[26]. Subsequently, a certain percentile of channels ispruned according to a global threshold across all layers.However, in extreme cases this could lead to all channels ofa single layer being pruned. Aditionally, the scaling factorsare penalized without considering the respective filter size.In contrast, Huang et al . proposed a try-and-learn algorithmto train pruning agents that identify superfluous filters [15].Recently, Xiao et al . introduced Auto Prune , a frame-work that uses a set of additional parameters to prune singleweights or filters during each forward pass [33]. However,in their implementation, the pruning layers are located infront of the batch-normalization layers which reactivate thepruned channels (unless batch-normalization is disabled).Srinivas et al . proposed a similar approach using gate vari-ables but neglected batch-normalization layers as well [31].
3. Background on batch-normalization
DNNs consist of interconnected layers which mainlyperform weighted sums (convolution and fully-connectedlayers), batch-normalization, and non-linear transforma-tions. With l being the layer index, the weighted sums canbe written as a l = w l ∗ x l − + b l (1)with the layer input x l − , the weight-tensor or -matrix w l ,the bias vector b l and ∗ denoting either a convolution opera-tor or a matrix-vector multiplication. Each layer consists ofseveral channels, with the amount of channels in a l beingequal to the number of convolution filters or matrix rows in w l . After calculating the weighted sums, each channel isnormalized and transformed linearly. The normalized out-put ˆ a l,c is calculated by ˆ a l,c = a l,c − E [ a l,c ] (cid:112) Var [ a l,c ] + (cid:15) γ l,c + β l,c during training, a l,c − µ l,c (cid:113) σ l,c + (cid:15) γ l,c + β l,c during inference, (2) − t t γ Φ ( γ ) forward − t t t backward Figure 2. An illustration of the indicator function during bothforward and backward pass. During the forward pass, the indi-cator function outputs whether the absolute values of the batch-normalization scaling factors are greater than t . During the back-ward pass, the indicator function is approximated using two piece-wise linear functions. Thus, the gradient with respect to the scalingfactors is either ± , depending on the sign of the scaling factors. with c denoting the channel index, E [ a l,c ] and Var [ a l,c ] be-ing the mean and the variance of the current mini-batch,and { γ l,c , β l,c } being the learnable parameters of the affinetransformation.After training, batch-normalization layers are folded intothe preceding convolution or fully-connected layer to ac-celerate the inference graph. The normalized output of thefolded layer ˆ a l can therefore be written as ˆ a l = ˆ w l ∗ x l − + ˆ b l with ˆ w l = w l γ l (cid:112) σ l + (cid:15) and ˆ b l = ( b l − µ l ) γ l (cid:112) σ l + (cid:15) + β l . (3)Thus, batch-normalization scaling factors can be used toprune complete filters from the network structure: As theabsolute value decreases, γ l,c scales the output of channel c in layer l towards zero.
4. Holistic filter pruning
In this section, a pruning loss is provided that can beused for common DNN training to prune filters and neu-rons by gradient descent. The pruning rates are freely ad-justable and automatically distributed over the individuallayers. The pruning itself is induced via the channel-wisescaling factors of the batch-normalization layers consider-ing the respective layer sizes. The training objective com-bines the learning task L learning and the pruning task L pruning such that both are solved simultaneously during training: L = L learning + λ L pruning . (4)Here, λ is the pruning parameter that scales the weightingbetween both tasks. As demonstrated in section 3, the pruning of completefilters can be done via the channel-wise scaling factors of3he batch-normalization layers. Thus, we first implement amagnitude based indicator function that determines whetherthe absolute value of γ is smaller than the magnitude t : Φ( γ, t ) = (cid:40) if | γ | ≤ t if | γ | > t . (5)If the indicator functions outputs zero, the respectivechannel is considered inactive and would be deleted aftertraining. As can be seen in Figure 2, Φ is a non-smoothquantization function whose gradient is zero almost every-where. Therefore, we utilize the straight-through estimator(STE, [12]) which is widely used in network quantizationto approximate the local gradient of step-functions duringbackpropagation. However, since Φ is symmetrical to they-axis (in contrast, fixed-point quantization functions areusually symmetrical to the origin), we flip the estimator onthe y-axis as well: ∂ Φ( γ ) ∂γ = (cid:40) − if γ ≤ if γ > . (6)As shown in figure 2, this is the most suitable approachfor approximating Φ with linear segments. As a result, thegradient estimator is easy to implement and non-zero forany input value.Liu et al . found that scaling factors with absolute val-ues below − can be set to zero without a noticeabledrop in accuracy [26]. Therefore, we use t = 10 − forour experiments. According to equation 3, this results inthe channel output being approximately equal to the batch-normalization bias β l,c , ˆ a l,c = ˆ w l,c ∗ x l − + ˆ b l,c | γ l,c | < − ≈ β l,c , (7)which is independent from the channel input. The biasis propagated through the following convolution or fully-connected layer and shifts the resulting feature maps.However, this shift is corrected by the following batch-normalization layer by subtracting the mean over the re-spective mini-batch. After training, both the scaling factorand the bias of the batch-normalization layers are set to zeroif the indicator function outputs zero. The complexity of a DNN can be specified on the onehand by the number of parameters P and on the other handby the number of floating-point multiplications M that areneeded to propagate one sample through the network. If (cid:101) P and (cid:102) M denote the number of parameters and multiplicationsof the pruned model and P ∗ and M ∗ specify the desired tar-get values, the deviation between the pruned model and the target size can be described by the following loss function: L pruning = relu (cid:32) (cid:101) P − P ∗ P (cid:33) + relu (cid:32) (cid:102) M − M ∗ M (cid:33) (8)with P and M being the number of parameters and multi-plications of the original model. The terms within the rec-tifier functions denote the normalized differences betweenthe current and the desired mode size with − P ∗ /P beingthe desired pruning rate and − (cid:101) P /P being the current prun-ing rate. Thus, both summands vary between zero and thedesired pruning rates. For example, if the goal is to prune50% of the parameters and 40% of the required multiplica-tions, the pruning loss takes values between and . .Both the original and the desired network sizes are con-stant values: the former is fixed whereas the latter is speci-fied by the user. Therefore, (cid:101) P and (cid:102) M remain the only vari-able quantities in equation 8. Utilizing the indicator func-tion from equation 6, the number of parameters in a feed-forward neural network can be calculated as follows: (cid:101) P = L − (cid:88) l =1 P l C l − C l C l − (cid:88) c =1 Φ( γ l − ,c ) C l (cid:88) c =1 Φ( γ l,c ) (cid:124) (cid:123)(cid:122) (cid:125) pruning rate of intermediate layers + P L C L − C L − (cid:88) c =1 Φ( γ L − ,c ) (cid:124) (cid:123)(cid:122) (cid:125) pruning rate of the last layer (9)Here, l denotes the layer index, L the number of layers, C l the number of channels in layer l , and P l the originalnumber of weights in layer l . The terms within the bracketscorrespond to the pruning rates of the respective layer anddepend on the balance between active and inactive channels.The pruning ratios are scaled with the respective channelsizes and added together over the number of layers. Thesame calculation can be done analogically for the numberof pruned multiplications: (cid:102) M = L − (cid:88) l =1 M l C l − C l C l − (cid:88) c =1 Φ( γ l − ,c ) C l (cid:88) c =1 Φ( γ l,c ) (cid:124) (cid:123)(cid:122) (cid:125) pruning ratio in intermediate layer l + M L C L − C L − (cid:88) c =1 Φ( γ L − ,c ) (cid:124) (cid:123)(cid:122) (cid:125) pruning ratio in last layer L . (10)4ence, during each forward pass the pruning loss cal-culates the deviation between the current and the desiredmodel size in terms of the number of parameters and mul-tiplications. The gradients can be backpropagated by utiliz-ing the gradient estimator of the indicator function.
State-of-the-art DNN architectures such as ResNet [7],DenseNet [14], or MobileNet use shortcut connections be-tween layers which add the output feature maps of thelayers. This makes filter pruning more complicated sinceshortcut connections can reactive already pruned channels.Several solutions have been proposed for this problem: In[22, 28] layers with shortcut connections were not pruned toavoid the problem of reactivated channels. However, skip-ping the layers with shortcut connections redudces the fea-sible pruning ratio. Furthermore, in [26, 11] feature mapswere sampled in front of each residual block to reduce theirdimension. Yet, sampling layers bring additional compu-tation cost. The authors of [34] proposed a group pruningmethod in which layers connected by a shortcut connectionshare the same pruning patterns.In our case, the application of shortcut connections is nota problem as long as the counting functions from equation9 and 10 are implemented correctly. Consequently, whencalculating the layer-wise pruning rates, it must be takeninto account whether a shortcut connection is added and ifso, whether the inactive channels match on both sides. Thiscan be done by using a mask that consists of the element-wise sums of the absolute values of the batch-normalizationscaling factors. The mask is then processed by the indicatorfunction Φ to calculate the pruning rates. Algorithm 4.4 shows how a DNN can be pruned using
Holistic Filter Pruning (HFP).
Sparsity learning : After each forward-pass, the prun-ing loss is calculated according to equation 8 and added tothe learning loss. Subsequently, the parameters are updatedusing SGD optimization with a nesterov momentum of . .We train until the number of given epochs is reached. Regularization parameter : In equation 8, the pruningparameter λ regularizes the weighting between the learn-ing task on the one hand and the pruning loss on the other.Hence, λ should be chosen such that both losses are in thesame order of magnitude. Therefore, we define λ such that λ L pruning is equal to the expectation value of the learningloss over the training set. E.g., if the average cross-entropyloss for an untrained model is . on ImageNet, and thedesired pruning rate is 0.5 for both parameters and multi-plications, λ is equal to . . Furthermore, since we usepre-trained models, we recommend heating up the pruningparameter from one over the training epochs. Fine-tuning : After training, channels whose scaling fac-tors are set to zero by the indicator function are completelydeleted from the network architecture. Subsequently, the re-maining architecture is retrained for three epochs to updatethe batch-statistics of the batch-normalization layers.
Algorithm 1
The procedure to prune a DNN with
HolisticFilter Pruning . The steps that have to be implemented arein line 7 and 12. Input : Pre-trained model O , Training Data ( X, Y ) ,Learning task L learning , Target size { P ∗ , M ∗ } . model ← O for e in epochs do for (data, target) in ( X, Y ) do out = model(data) loss = L learning (out, target) loss += λ L pruning (model, P ∗ , M ∗ ) loss.backward( ) SGD.step(model) end for end for
Sparsitylearning model ← Prune(model) model ← Retrain(model) (cid:46)
Fine-tune for 3 epochs return model
5. Experiments
In this section, we evaluate
Holistic Filter Pruning (HFP) on common benchmark data-sets including CIFAR-10 and ImageNet. First we compare with state-of-the-art fil-ter pruning methods before giving insights into the trainingprocedure of HFP. The baselines of experiments on CIFAR-10 are calculated by training for 150 epochs using SGDoptimization with the nesterov momentum set to 0.9 anda batch-size of 64. The learning rate is reduced linearlyduring the training from − to − . For ImageNet, thebaselines are taken from the torchvision model zoo . CIFAR-10 is an image classification task with 10 differ-ent classes [17]. The data consists of × color imagesand is divided into 50,000 training and 10,000 test samples.We preprocess the images as recommended in [14] and usea batch-size of 64. Furthermore, we train for 150 epochsand linearly decrease the learning rate from 0.02 to − .Table 1 shows the pruning results of VGG-8. We specifyto prune the number of parameters by 90% and the numberof multiplications by 80%. Thus, we achieve comparablepruning rates to HRank but outperform the accuracy sig-nificantly by 3%. In comparison to Zhao et al . and
SSS , https://pytorch.org/docs/stable/torchvision/models.html GG-8 on CIFAR-10
Table 1. Top-1 accuracy and percentage reduction in the numberof multiplications and parameters.
Method Flops % ↓ Params % ↓ Top-1 % Baseline - - 94.89SSS [16] 41.6 73.8 93.02Zhao et al . [36] 39.1 73.3 93.18GAL-0.1 [25] 45.2 82.2 90.73HRank [24] 65.3 82.1 92.34HRank [24] 76.5 92.0 91.23
HFP
Table 2. Top-1 accuracy and percentage reduction in the numberof multiplications and parameters. Results marked with ’-’ are notreported by the authors.
Method Flops % ↓ Params % ↓ Top-1 % Baseline - - 93.30NISP [35] 35.50 42.40 93.01DCP [38] 47.10 70.30 93.79CP [11] 50.00 - 91.80FPGM [10] 52.60 - 93.26GBN-40 [34] 60.10 53.50 93.41GBN-60 [34] 70.30 66.70 93.07HRank [24] 50.00 42.40 93.17HRank [24] 74.10 68.10 90.72
HFP
Figure 3. Top-1 accuracies of ResNet-56 on CIFAR-10 with dif-ferent pruning rates. The performance values are illustrated bycolored level curves created by fitting a second-order polynomial.
ResNet-50 on ImageNet
Table 3. Labels have the same meaning as in Table 2.
Method Flops % ↓ Params % ↓ Top-1 % Baseline - - 76.15
NIPS 2018, NIPS 2019, CVPR 2019
DCP [38] 55.76 51.45 74.95FPGM [10] 53.50 - 74.83GBN-60 [34] 40.54 31.83 76.19GBN-50 [34] 55.06 53.40 75.18
CVPR 2020
Hinge [23] 53.45 - 74.70He et al . [8] 60.80 - 74.56DMCP [4] 73.17 - 74.40HRank [24] 62.10 - 71.98HRank [24] 76.04 - 69.10
HFP we achieve higher pruning rates, while simultaneously in-creasing the accuracy by more than 1%. Compared to thebaseline accuracy, we are able to reduce the number of pa-rameters by 90% with an accuracy drop of 0.6%.Table 2 shows the pruning results on the ResNet-56 ar-chitecture. We use two different settings with target reduc-tions of 50% and 70%, respectively. Thus, we are able toprune both the parameters and the multiplications by at least50% with no loss of accuracy. In comparison to
HRank ,we achieve higher pruning rates with a slightly improvedTop-1 accuracy. To the best of our knowledge, we are thefirst to reduce the number of multiplications by more than75% while at the same time reducing accuracy by less than1.5%.
GBN achieves a slightly higher Top-1 accuracy forcomparable pruning rates. Additionally, figure 3 illustratesthe level curves of various experiments with different prun-ing rates on ResNet-56. One can observe that pruning theparameters has a greater impact on the performance thanpruning the multiplications.
ImageNet is an image classification task which provides1000 different class labels. We use the data from 2012(ILSVRC12 [18]) which consists of 1,281,167 training and50,000 test samples. We preprocess the data by subtractingthe mean and dividing by the standard-deviation over thetraining set. For data-augmentation we apply random hori-zontal flips and crop the images to × . We train for100 epochs with a batch-size of 256 and linearly decrease6 esNet-18 on ImageNet Table 4. Top-1 accuracy and percentage reduction in the numberof multiplications and parameters. Results marked with ’-’ are notreported by the authors.
Method Flops % ↓ Params % ↓ Top-1 % Baseline - - 69.75SFP [9] 41.80 - 67.10FPGM [10] 41.80 - 68.41
HFP
Table 5. Top-1 accuracy and percentage reduction in the numberof multiplications and parameters for different values of λ . λ Flops % ↓ Params % ↓ Top-1 %
1. 48 36 → the learning rate from − to − .Table 3 shows the pruning results of ResNet-50. Toenable accurate comparisons, we evaluate four configura-tions with various pruning rates and compare with the lat-est pruning results from CVPR2020. The first configura-tion reduced the number of multiplications by 60% withno significant loss in the accuracy. The second configura-tion achieves both higher pruning rates and higher accuracythan in [34, 23, 8]. The third configuration yields a reducednumber of multiplications and slightly improved accuracyin comparison with [4]. Furthermore, HFP is able to reducethe number of multiplications by 78% with only 2% loss inthe accuracy.Table 4 shows the pruning results of ResNet-18. ResNet-18 is much smaller than ResNet-50, less over-parameterizedand consequently more difficult to prune. HFP providesnew state-of-the-art performance with 36% reduced mul-tiplications and only 0.6% accuracy decrease. The secondconfiguration reduces the number of multiplications by 45%with only 1.2% loss in the accuracy Table 5 shows the pruning results of ResNet-50 with theaim of pruning 60% of the multiplications and 40% of theparameters by using different values of the pruning param-eter λ . The first experiment uses the constant value λ = 1 .As noticeable, the desired pruning rates are not reachedsince the weighting of the pruning loss is to low. The secondexperiment uses λ = 7 . which results from the consider-ation in section 4.4. Indeed, the desired pruning rates arefulfilled. However, the accuracy drops below 76% since the imbalance is high at the beginning of the training. The thirdexperiment utilizes the proposed strategy of heating up λ from to . : The pruning rates are still fulfilled and theaccuracy increases in comparison to the second experiment. The distribution of the overall pruning budget to the in-dividual layers is a well-known problem in filter pruning.HFP automatically distributes the pruning rates among theindividual layers such that the pruning loss is minimized.VGG-8 consists of six convolution layers and two fully-connected layers. The convolution layers are especiallyexpensive regarding the number of multiplication whereasthe first fully-connected layer owns most of the parameters.Thus, we analyze two experiments: a) with the aim of prun-ing 90% of the parameters and b) with the aim of pruning90% of the multiplications. Figure 4 shows the layer-wisepruning rates for both experiments as well as the propor-tional layer sizes regarding the number of parameters andmultiplications. In the first experiment, HFP primarily re-duces the layers which contribute most to the number ofparameters (conv6 and fc7). Especially fc7 has a large num-ber of parameters and is therefore pruned by approximately97%. In contrast, the second experiment mainly leads to areduction of the convolution layers as they offer more po-tential for saving multiplications. Consequently, we can ob-serve that HFP distributes flexible pruning rates over the in-dividual layers. Furthermore, the distribution of the pruningbudget varies depending on the target reduction. Compar-isons with the layer sizes regarding the number of multipli-cations and parameters result in a meaningful distribution.
This section analyzes how the overall reduction of pa-rameters and multiplications is proportionally distributedamong the individual layers. For example, if 1000 parame-ters are pruned and the first layer is reduced by 150 param-eters, then the proportional contribution of the first layer tothe parameter pruning is 15%. Figure 5 shows the propor-tional pruning rates of ResNet-56 with 56% pruned multi-plications and 50% pruned parameters (section 5.1, table 1).The pruning rates are shown for different training epochsand refer to the pruning result at that time step (e.g., after10 epochs 47% of the multiplications were reduced). Thefirst diagram shows the proportional pruning rates for themultiplications while the second diagram shows the propor-tional pruning rates for the parameters. Additionally, thediagrams show the total number of multiplications and pa-rameters of the unpruned layers (dotted lines). In both fig-ures, the three basic blocks of the ResNet architecture arevisible and marked with A, B, and C. In case of the prunedmultiplications, the proportional pruning rates of the indi-vidual layers change over the epochs. While in block A the7onv1 conv2 conv3 conv4 conv5 conv6 fc7 fc8 . layer index p r un i ng r a t e a) prune 90% of parametersb) prune 90% of multiplications 00.51 p r opo r ti on a ll a y e r s i ze parameters per layermultiplications per layer Figure 4. Pruning rates of all layers in VGG-8 for two different experiments: a) with the aim of pruning 90% of the parameters and b)with the aim of pruning 90% of the multiplications. Depending on the target reduction, the pruning budget is distributed differently overthe individual layers: a) reduces layers with many parameters while b) especially prunes the convolution layers with many multiplication.Figure 5. The upper plot shows the proportional pruning rates of the individual layers of ResNet-56 (with 56% reduced multiplications and50% reduced parameters, table 1). Proportional pruning rates indicate the contribution of single layers to the overall pruning rate. E.g., if1000 multiplications are pruned from the model and the first layer is reduced by 150 multiplications, the proportional pruning rate of thefirst layer is 15%. The lower plot indicates the proportional pruning rates for the number of parameters.
6. Conclusion
We propose
Holistic Filter Pruning (HFP), a simple andpowerful filter pruning method to reduce the complexity oftrained DNNs. HFP uses a pruning loss that takes accu-rate pruning rates for the number of both parameters andmultiplications into account. After each forward pass, thedeviation between the current model size and the target sizeis calculated. By gradient descent, the pruning rates are dis-tributed over the individual layers such that the target sizeis fulfilled. The loss function fits seamlessly into the train-ing of DNNs and uses the channel-wise scaling factors ofthe batch-normalization layers to calculate the model size.Thus, no additional variables need to be defined and theimplementation effort is low. Especially for large pruningrates ( > ), HFP yields excellent performance and out-performs recent approaches by up to 5%. References [1] Li Deng and Dong Yu. Deep learning: Methods and ap-plications.
Foundations and Trends in Signal Processing ,7(34):197–387, 2014.[2] Lukas Enderich, Fabian Timm, and Wolfram Burgard.Symog: learning symmetric mixture of gaussian modes forimproved fixed-point quantization.
Neurocomputing , 2020.[3] Jonathan Frankle and Michael Carbin. The lottery ticket hy-pothesis: Finding sparse, trainable neural networks. In
Inter-national Conference on Learning Representations , 2019.[4] Shaopeng Guo, Yujie Wang, Quanquan Li, and Junjie Yan.Dmcp: Differentiable markov channel pruning for neuralnetworks. In
IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) , June 2020.[5] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic net-work surgery for efficient dnns.
CoRR , abs/1608.04493,2016.[6] Song Han, Jeff Pool, John Tran, and William J. Dally. Learn-ing both weights and connections for efficient neural net-works.
CoRR , abs/1506.02626, 2015.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. , pages 770–778, 2015. [8] Yang He, Yuhang Ding, Ping Liu, Linchao Zhu, Han-wang Zhang, and Yi Yang. Learning filter pruning crite-ria for deep convolutional neural networks acceleration. In
The IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR) , June 2020.[9] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and YiYang. Soft filter pruning for accelerating deep convolutionalneural networks.
CoRR , abs/1808.06866, 2018.[10] Yang He, Ping Liu, Ziwei Wang, and Yi Yang. Pruning fil-ter via geometric median for deep convolutional neural net-works acceleration.
CoRR , abs/1811.00250, 2019.[11] Yihui He, Xiangyu Zhang, and Jian Sun. Channel prun-ing for accelerating very deep neural networks.
CoRR ,abs/1707.06168, 2017.[12] Geoffrey Hinton. Neural networks for machine learning.Coursera, video lecture, 2012.[13] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang.Network trimming: A data-driven neuron pruning approachtowards efficient deep architectures.
CoRR , abs/1607.03250,2016.[14] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil-ian Q. Weinberger. Densely connected convolutional net-works. In
The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , July 2017.[15] Qiangui Huang, Shaohua Kevin Zhou, Suya You, and UlrichNeumann. Learning to prune filters in convolutional neuralnetworks.
CoRR , abs/1801.07365, 2018.[16] Zehao Huang and Naiyan Wang. Data-driven sparse struc-ture selection for deep neural networks. In
The EuropeanConference on Computer Vision (ECCV) , September 2018.[17] Alex Krizhevsky. Learning multiple layers of features fromtiny images. 2009.[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In
Advances in Neural Information Processing Sys-tems 25 , pages 1097–1105. Curran Associates, Inc., 2012.[19] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deeplearning.
Nature , 521(7553):436–444, 2015.[20] Yann LeCun, John S. Denker, and Sara A. Solla. Optimalbrain damage. In D. S. Touretzky, editor,
Advances in NeuralInformation Processing Systems 2 , pages 598–605. Morgan-Kaufmann, 1990.[21] Carl Lemaire, Andrew Achkar, and Pierre-Marc Jodoin.Structured pruning of neural networks with budget-awareregularization.
CoRR , abs/1811.09332, 2018.[22] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, andHans Peter Graf. Pruning filters for efficient convnets.
CoRR ,abs/1608.08710, 2016.[23] Yawei Li, Shuhang Gu, Christoph Mayer, Luc Van Gool,and Radu Timofte. Group sparsity: The hinge between fil-ter pruning and decomposition for network compression. In
The IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR) , June 2020.[24] Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang,Baochang Zhang, Yonghong Tian, and Ling Shao. Hrank:Filter pruning using high-rank feature map, 02 2020.
25] Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang,Liujuan Cao, Qixiang Ye, Feiyue Huang, and David S. Do-ermann. Towards optimal structured cnn pruning via gener-ative adversarial learning.
CoRR , abs/1903.09291, 2019.[26] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang,Shoumeng Yan, and Changshui Zhang. Learning efficientconvolutional networks through network slimming.
CoRR ,abs/1708.06519, 2017.[27] Christos Louizos, Max Welling, and Diederik P. Kingma.Learning sparse neural networks through l0 regularization.In
International Conference on Learning Representations ,2018.[28] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filterlevel pruning method for deep neural network compression.
CoRR , abs/1707.06342, 2017.[29] Lukas Mauch and Bin Yang. A novel layerwise pruningmethod for model reduction of fully connected deep neuralnetworks. In , pages 2382–2386, 2017.[30] Wei Pan, Hao Dong, and Yike Guo. Dropneuron: Sim-plifying the structure of deep neural networks.
CoRR ,abs/1606.07326, 2016.[31] Suraj Srinivas, Akshayvarun Subramanya, and R. VenkateshBabu. Training sparse neural networks.
CoRR ,abs/1611.06694, 2016.[32] V. Sze, Y. Chen, T. Yang, and J. S. Emer. Efficient processingof deep neural networks: A tutorial and survey.
Proceedingsof the IEEE , 105(12):2295–2329, Dec 2017.[33] Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. Au-toprune: Automatic network pruning by regularizing auxil-iary parameters. In
Advances in Neural Information Pro-cessing Systems 32 , pages 13681–13691. Curran Associates,Inc., 2019.[34] Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and PingWang. Gate decorator: Global filter pruning method for ac-celerating deep convolutional neural networks. In H. Wal-lach, H. Larochelle, A. Beygelzimer, E. Fox, and R. Garnett,editors,
Advances in Neural Information Processing Systems32 , pages 2133–2144. Curran Associates, Inc., 2019.[35] R. Yu, A. Li, C. Chen, J. Lai, V. I. Morariu, X. Han, M.Gao, C. Lin, and L. S. Davis. Nisp: Pruning networks usingneuron importance score propagation. In , pages9194–9203, 2018.[36] Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao,Wenjun Zhang, and Qi Tian. Variational convolutional neu-ral network pruning. In
The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , June 2019.[37] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and YurongChen. Incremental network quantization: Towards losslesscnns with low-precision weights.
CoRR , 2017.[38] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu,Yong Guo, Qingyao Wu, Junzhou Huang, and Jin-Hui Zhu.Discrimination-aware channel pruning for deep neural net-works.
CoRR , abs/1810.11809, 2018., abs/1810.11809, 2018.