[PDF] Holistic Filter Pruning for Efficient Deep Neural Networks

Abstract

Deep neural networks (DNNs) are usually over-parameterized to increase the likelihood of getting adequate initial weights by random initialization. Consequently, trained DNNs have many redundancies which can be pruned from the model to reduce complexity and improve the ability to generalize. Structural sparsity, as achieved by filter pruning, directly reduces the tensor sizes of weights and activations and is thus particularly effective for reducing complexity. We propose "Holistic Filter Pruning" (HFP), a novel approach for common DNN training that is easy to implement and enables to specify accurate pruning rates for the number of both parameters and multiplications. After each forward pass, the current model complexity is calculated and compared to the desired target size. By gradient descent, a global solution can be found that allocates the pruning budget over the individual layers such that the desired target size is fulfilled. In various experiments, we give insights into the training and achieve state-of-the-art performance on CIFAR-10 and ImageNet (HFP prunes 60% of the multiplications of ResNet-50 on ImageNet with no significant loss in the accuracy). We believe our simple and powerful pruning approach to constitute a valuable contribution for users of DNNs in low-cost applications.

Full PDF

HHolistic Filter Pruning for Efﬁcient Deep Neural Networks

Lukas EnderichCorporate ResearchRobert Bosch GmbH, Renningen, Germany [email protected]

Fabian TimmCorporate ResearchRobert Bosch GmbH, Renningen, Germany [email protected]

Wolfram BurgardInstitute for Autonomous Intelligent SystemsUniversity of Freiburg, Germany [email protected]

Abstract

Deep neural networks (DNNs) are usually over-parameterized to increase the likelihood of getting ade-quate initial weights by random initialization. Conse-quently, trained DNNs have many redundancies which canbe pruned from the model to reduce complexity and improvethe ability to generalize. Structural sparsity, as achieved byﬁlter pruning, directly reduces the tensor sizes of weightsand activations and is thus particularly effective for reduc-ing complexity. We propose Holistic Filter Pruning (HFP),a novel approach for common DNN training that is easy toimplement and enables to specify accurate pruning rates forthe number of both parameters and multiplications. Aftereach forward pass, the current model complexity is calcu-lated and compared to the desired target size. By gradientdescent, a global solution can be found that allocates thepruning budget over the individual layers such that the de-sired target size is fulﬁlled. In various experiments, we giveinsights into the training and achieve state-of-the-art per-formance on CIFAR-10 and ImageNet (HFP prunes 60% ofthe multiplications of ResNet-50 on ImageNet with no sig-niﬁcant loss in the accuracy). We believe our simple andpowerful pruning approach to constitute a valuable contri-bution for users of DNNs in low-cost applications.

1. Introduction

Deep neural networks (DNNs) have a strong ability fordata abstraction and outperform classical methods in manymachine learning challenges such as computer vision, ob-ject detection, or speech recognition [1, 19]. But, recentprogress has been made by training powerful models withmany parameters using large scale data sets [7, 32]. Fran-kle and Carbin [3] demonstrated the correlation between the

Input features Convolution filter Output features

Figure 1. Structural sparsity can be achieved by pruning completeﬁlters or neurons from the network. Since ﬁlter pruning reducesboth the number of ﬁlters in the respective layer and the numberof output feature maps, the tensor sizes of both weights and acti-vations decrease. With a reduced number of output feature maps,the depth of the following layer decreases to the same degree. initial model size and the probability of getting meaningfulinitial values for the parameters by random initialization.As a result, modern DNNs are usually over-parameterized,have high memory requirements and need many ﬂoating-point multiplications, which are especially expensive con-cerning computation time and energy consumption [32].However, reduction techniques can signiﬁcantly reducethe complexity of trained DNNs. On the one hand, quan-tization methods reduce the precision of both parametersand activations to accelerate DNNs on dedicated hardware[32, 2]. On the other hand, pruning and factorization meth-ods reduce the number of parameters and multiplicationsrather than their bit-sizes [22, 21]. Structural sparsity, asachieved by ﬁlter pruning, directly reduces computationtime, energy consumption, and memory requirements with-out the need for specialized hardware. A visualization ofﬁlter pruning is given in ﬁgure 1.Unsupervised ﬁlter pruning usually fails to preserve the1ccuracy of the original model [29]. Therefore, data drivenapproaches have been developed which either iterativelyprune ﬁlters based on saliency scores [13, 22, 9, 38, 10, 24]or retrain the model under consideration of sparsity con-straints [37, 11, 27, 31, 33]. Methods of the ﬁrst categorycalculate saliency scores to rate the importance of individualﬁlters. Filters with low saliency scores are considered unim-portant and are deleted whereas the remaining ﬁlters are re-trained. This process is repeated until the desired pruningrate is reached. In contrast, methods of the second cate-gory investigate sparsity constraints that can be integratedinto the the training of DNNs. Regularization terms pushthe sum of absolute values of ﬁlter weights towards zero[27, 37, 11]. Furthermore, in [31, 33, 34] learnable gatevariables were introduced that scale single weights or com-plete ﬁlters by one or zero.However, most recent approaches have some disadvan-tages. Determining saliency scores requires a lot of humanlabor and is usually a heuristic practice. Furthermore, layer-by-layer pruning as well as iterative pruning and retrainingare unsuitable procedures for determining a global selectionof ﬁlters to be pruned. Considering that all networks layersjointly contribute to the learning task, it is inappropriate toprune single layers independently. Moreover, iterative prun-ing and retraining may prune ﬁlters that become importantagain at a later iteration.In this work, we make the following contributions: • We propose a holistic pruning approach that can beintegrated into the training of DNNs by only a fewlines of code. The proposed method induces spar-sity via the channel-wise scaling factors of the batch-normalization layers. Hence, no additional variablesare needed. Furthermore, the user can specify the de-sired model size in terms of the number of parame-ters and multiplications. The pruning budget is allo-cated over all layers automatically such that the desiredmodel size is reached. • We evaluate our pruning approach on two benchmarkdata-sets (CIFAR-10, ImgaeNet). We provide com-parisons with recent ﬁlter pruning results and provestate-of-the-art performance on various DNN architec-tures. Furthermore, we analyze the allocation of prun-ing rates over the individual layers for different targetsizes and layer types.

2. Related Work

DNNs are usually over-parameterized and have many re-dundant network connections, which can be eliminated (e.g.pruned) after the training to reduce the model complex-ity and improve the ability to generalize. The ﬁrst prun-ing methods were aimed at setting single weights to zeroin order to trim intermediate layer connections.

Optimal Brain Damage [20] utilized the second-order derivatives ofthe loss function to calculate saliencies for each networkweight. Subsequently, weights with small saliency scoreswere pruned iteratively whereas the remaining weights wereretrained. Since calculating the second-order derivatives ofthe loss function with respect to the prameters is too com-plex for large DNNs, many approaches applied magnitudebased pruning [6, 5, 27, 37].However, since pruning single weights has no direct ben-eﬁt for the hardware implementation of DNNs (unstruc-tural sparsity), the indicator of non-zero weights is an in-sufﬁcient evidence of the model complexity. In contrast,pruning complete ﬁlters or neurons from the network ar-chitecture directly reduces the tensor sizes of weights andactivations (structural sparsity). A visualization is given inﬁgure 1. Filter pruning methods can be devided into twosubcategories[21]: saliency based pruning and retraining on the one hand and sparsity learning on the other. Bothsubcategories are based on pre-trained and usually over-parameterized models. In the following, a ﬁlter is equiv-alent to a channel or a neuron.

These methods determine heuristics to calculate saliencyscores for each ﬁlter. The saliency score indicates the im-portance of the respective ﬁlter: The higher the score themore important the ﬁlter is considered to be for fulﬁllingthe learning task. Based on the saliencies, a certain num-ber or percentage of ﬁlters is pruned whereas the remainingones are retrained. This process is repeated iteratively untilthe desired network size is reached.Hu et al . identiﬁed unimportant ﬁlters by analyzing themagnitudes of the output activations [13]. Feature mapswith comparatively small sums of absolute values were con-sidered less important and hence removed. In contrast, Li et al . measured the importance of individual ﬁlters by cal-culating the sum over the absolute values of the weights[22]. Zhuang et al . argued that informative channels shouldhave discriminative power [38]: They proposed a rank-ing heuristic to identify channels with high discriminativepower while deleting redundant channels and their corre-sponding ﬁlters. Furthermore, He et al . set the weightsof ﬁlters with small L -norms to zero [9]. During the re-training steps, however, the pruned ﬁlters were updated aswell to improve the training behaviour. The procedure isrepeated until the selection of ﬁlters with small L -normsconverges. Furthermore, Yu et al . calculated saliency scoresby minimizing the reconstruction error in the second-to-last layer before the classiﬁcation output [35]. Recently,Zhonghui et al introduced Gate Decorator , a pruning frame-work that uses gates to scale the channel-wise output ofintermediate layers [34]. The change in the loss functioncaused by setting the gates to zero is calculated using a Tay-2or expansion and subsequently used for the saliency scores.

Sparsity learning induces sparsity constraints into thetraining of DNNs. Pan et al. approximated the L -norm topenalize incoming and outgoing connections of single ﬁl-ters [30]. He et al . proposed a channel selection based onLASSO regression whereas pruning each layer is achievedby minimizing the reconstruction error of the output fea-ture maps [11]. Furthermore, Liu et al . applied L -normbased regularization on the scaling factors of the batch-normalization layers to scale single channels towards zero[26]. Subsequently, a certain percentile of channels ispruned according to a global threshold across all layers.However, in extreme cases this could lead to all channels ofa single layer being pruned. Aditionally, the scaling factorsare penalized without considering the respective ﬁlter size.In contrast, Huang et al . proposed a try-and-learn algorithmto train pruning agents that identify superﬂuous ﬁlters [15].Recently, Xiao et al . introduced Auto Prune , a frame-work that uses a set of additional parameters to prune singleweights or ﬁlters during each forward pass [33]. However,in their implementation, the pruning layers are located infront of the batch-normalization layers which reactivate thepruned channels (unless batch-normalization is disabled).Srinivas et al . proposed a similar approach using gate vari-ables but neglected batch-normalization layers as well [31].

3. Background on batch-normalization

DNNs consist of interconnected layers which mainlyperform weighted sums (convolution and fully-connectedlayers), batch-normalization, and non-linear transforma-tions. With l being the layer index, the weighted sums canbe written as a l = w l ∗ x l − + b l (1)with the layer input x l − , the weight-tensor or -matrix w l ,the bias vector b l and ∗ denoting either a convolution opera-tor or a matrix-vector multiplication. Each layer consists ofseveral channels, with the amount of channels in a l beingequal to the number of convolution ﬁlters or matrix rows in w l . After calculating the weighted sums, each channel isnormalized and transformed linearly. The normalized out-put ˆ a l,c is calculated by ˆ a l,c =  a l,c − E [ a l,c ] (cid:112) Var [ a l,c ] + (cid:15) γ l,c + β l,c during training, a l,c − µ l,c (cid:113) σ l,c + (cid:15) γ l,c + β l,c during inference, (2) − t t γ Φ ( γ ) forward − t t t backward Figure 2. An illustration of the indicator function during bothforward and backward pass. During the forward pass, the indi-cator function outputs whether the absolute values of the batch-normalization scaling factors are greater than t . During the back-ward pass, the indicator function is approximated using two piece-wise linear functions. Thus, the gradient with respect to the scalingfactors is either ± , depending on the sign of the scaling factors. with c denoting the channel index, E [ a l,c ] and Var [ a l,c ] be-ing the mean and the variance of the current mini-batch,and { γ l,c , β l,c } being the learnable parameters of the afﬁnetransformation.After training, batch-normalization layers are folded intothe preceding convolution or fully-connected layer to ac-celerate the inference graph. The normalized output of thefolded layer ˆ a l can therefore be written as ˆ a l = ˆ w l ∗ x l − + ˆ b l with ˆ w l = w l γ l (cid:112) σ l + (cid:15) and ˆ b l = ( b l − µ l ) γ l (cid:112) σ l + (cid:15) + β l . (3)Thus, batch-normalization scaling factors can be used toprune complete ﬁlters from the network structure: As theabsolute value decreases, γ l,c scales the output of channel c in layer l towards zero.

4. Holistic ﬁlter pruning

In this section, a pruning loss is provided that can beused for common DNN training to prune ﬁlters and neu-rons by gradient descent. The pruning rates are freely ad-justable and automatically distributed over the individuallayers. The pruning itself is induced via the channel-wisescaling factors of the batch-normalization layers consider-ing the respective layer sizes. The training objective com-bines the learning task L learning and the pruning task L pruning such that both are solved simultaneously during training: L = L learning + λ L pruning . (4)Here, λ is the pruning parameter that scales the weightingbetween both tasks. As demonstrated in section 3, the pruning of completeﬁlters can be done via the channel-wise scaling factors of3he batch-normalization layers. Thus, we ﬁrst implement amagnitude based indicator function that determines whetherthe absolute value of γ is smaller than the magnitude t : Φ( γ, t ) = (cid:40) if | γ | ≤ t if | γ | > t . (5)If the indicator functions outputs zero, the respectivechannel is considered inactive and would be deleted aftertraining. As can be seen in Figure 2, Φ is a non-smoothquantization function whose gradient is zero almost every-where. Therefore, we utilize the straight-through estimator(STE, [12]) which is widely used in network quantizationto approximate the local gradient of step-functions duringbackpropagation. However, since Φ is symmetrical to they-axis (in contrast, ﬁxed-point quantization functions areusually symmetrical to the origin), we ﬂip the estimator onthe y-axis as well: ∂ Φ( γ ) ∂γ = (cid:40) − if γ ≤ if γ > . (6)As shown in ﬁgure 2, this is the most suitable approachfor approximating Φ with linear segments. As a result, thegradient estimator is easy to implement and non-zero forany input value.Liu et al . found that scaling factors with absolute val-ues below − can be set to zero without a noticeabledrop in accuracy [26]. Therefore, we use t = 10 − forour experiments. According to equation 3, this results inthe channel output being approximately equal to the batch-normalization bias β l,c , ˆ a l,c = ˆ w l,c ∗ x l − + ˆ b l,c | γ l,c | < − ≈ β l,c , (7)which is independent from the channel input. The biasis propagated through the following convolution or fully-connected layer and shifts the resulting feature maps.However, this shift is corrected by the following batch-normalization layer by subtracting the mean over the re-spective mini-batch. After training, both the scaling factorand the bias of the batch-normalization layers are set to zeroif the indicator function outputs zero. The complexity of a DNN can be speciﬁed on the onehand by the number of parameters P and on the other handby the number of ﬂoating-point multiplications M that areneeded to propagate one sample through the network. If (cid:101) P and (cid:102) M denote the number of parameters and multiplicationsof the pruned model and P ∗ and M ∗ specify the desired tar-get values, the deviation between the pruned model and the target size can be described by the following loss function: L pruning = relu (cid:32) (cid:101) P − P ∗ P (cid:33) + relu (cid:32) (cid:102) M − M ∗ M (cid:33) (8)with P and M being the number of parameters and multi-plications of the original model. The terms within the rec-tiﬁer functions denote the normalized differences betweenthe current and the desired mode size with − P ∗ /P beingthe desired pruning rate and − (cid:101) P /P being the current prun-ing rate. Thus, both summands vary between zero and thedesired pruning rates. For example, if the goal is to prune50% of the parameters and 40% of the required multiplica-tions, the pruning loss takes values between and . .Both the original and the desired network sizes are con-stant values: the former is ﬁxed whereas the latter is speci-ﬁed by the user. Therefore, (cid:101) P and (cid:102) M remain the only vari-able quantities in equation 8. Utilizing the indicator func-tion from equation 6, the number of parameters in a feed-forward neural network can be calculated as follows: (cid:101) P = L − (cid:88) l =1 P l  C l − C l C l − (cid:88) c =1 Φ( γ l − ,c ) C l (cid:88) c =1 Φ( γ l,c ) (cid:124) (cid:123)(cid:122) (cid:125) pruning rate of intermediate layers + P L  C L − C L − (cid:88) c =1 Φ( γ L − ,c ) (cid:124) (cid:123)(cid:122) (cid:125) pruning rate of the last layer (9)Here, l denotes the layer index, L the number of layers, C l the number of channels in layer l , and P l the originalnumber of weights in layer l . The terms within the bracketscorrespond to the pruning rates of the respective layer anddepend on the balance between active and inactive channels.The pruning ratios are scaled with the respective channelsizes and added together over the number of layers. Thesame calculation can be done analogically for the numberof pruned multiplications: (cid:102) M = L − (cid:88) l =1 M l  C l − C l C l − (cid:88) c =1 Φ( γ l − ,c ) C l (cid:88) c =1 Φ( γ l,c ) (cid:124) (cid:123)(cid:122) (cid:125) pruning ratio in intermediate layer l + M L C L − C L − (cid:88) c =1 Φ( γ L − ,c ) (cid:124) (cid:123)(cid:122) (cid:125) pruning ratio in last layer L . (10)4ence, during each forward pass the pruning loss cal-culates the deviation between the current and the desiredmodel size in terms of the number of parameters and mul-tiplications. The gradients can be backpropagated by utiliz-ing the gradient estimator of the indicator function.

State-of-the-art DNN architectures such as ResNet [7],DenseNet [14], or MobileNet use shortcut connections be-tween layers which add the output feature maps of thelayers. This makes ﬁlter pruning more complicated sinceshortcut connections can reactive already pruned channels.Several solutions have been proposed for this problem: In[22, 28] layers with shortcut connections were not pruned toavoid the problem of reactivated channels. However, skip-ping the layers with shortcut connections redudces the fea-sible pruning ratio. Furthermore, in [26, 11] feature mapswere sampled in front of each residual block to reduce theirdimension. Yet, sampling layers bring additional compu-tation cost. The authors of [34] proposed a group pruningmethod in which layers connected by a shortcut connectionshare the same pruning patterns.In our case, the application of shortcut connections is nota problem as long as the counting functions from equation9 and 10 are implemented correctly. Consequently, whencalculating the layer-wise pruning rates, it must be takeninto account whether a shortcut connection is added and ifso, whether the inactive channels match on both sides. Thiscan be done by using a mask that consists of the element-wise sums of the absolute values of the batch-normalizationscaling factors. The mask is then processed by the indicatorfunction Φ to calculate the pruning rates. Algorithm 4.4 shows how a DNN can be pruned using

Holistic Filter Pruning (HFP).

Sparsity learning : After each forward-pass, the prun-ing loss is calculated according to equation 8 and added tothe learning loss. Subsequently, the parameters are updatedusing SGD optimization with a nesterov momentum of . .We train until the number of given epochs is reached. Regularization parameter : In equation 8, the pruningparameter λ regularizes the weighting between the learn-ing task on the one hand and the pruning loss on the other.Hence, λ should be chosen such that both losses are in thesame order of magnitude. Therefore, we deﬁne λ such that λ L pruning is equal to the expectation value of the learningloss over the training set. E.g., if the average cross-entropyloss for an untrained model is . on ImageNet, and thedesired pruning rate is 0.5 for both parameters and multi-plications, λ is equal to . . Furthermore, since we usepre-trained models, we recommend heating up the pruningparameter from one over the training epochs. Fine-tuning : After training, channels whose scaling fac-tors are set to zero by the indicator function are completelydeleted from the network architecture. Subsequently, the re-maining architecture is retrained for three epochs to updatethe batch-statistics of the batch-normalization layers.

Algorithm 1

The procedure to prune a DNN with

HolisticFilter Pruning . The steps that have to be implemented arein line 7 and 12. Input : Pre-trained model O , Training Data ( X, Y ) ,Learning task L learning , Target size { P ∗ , M ∗ } . model ← O for e in epochs do for (data, target) in ( X, Y ) do out = model(data) loss = L learning (out, target) loss += λ L pruning (model, P ∗ , M ∗ ) loss.backward( ) SGD.step(model) end for end for

Sparsitylearning model ← Prune(model) model ← Retrain(model) (cid:46)

Fine-tune for 3 epochs return model

5. Experiments

In this section, we evaluate

Holistic Filter Pruning (HFP) on common benchmark data-sets including CIFAR-10 and ImageNet. First we compare with state-of-the-art ﬁl-ter pruning methods before giving insights into the trainingprocedure of HFP. The baselines of experiments on CIFAR-10 are calculated by training for 150 epochs using SGDoptimization with the nesterov momentum set to 0.9 anda batch-size of 64. The learning rate is reduced linearlyduring the training from − to − . For ImageNet, thebaselines are taken from the torchvision model zoo . CIFAR-10 is an image classiﬁcation task with 10 differ-ent classes [17]. The data consists of × color imagesand is divided into 50,000 training and 10,000 test samples.We preprocess the images as recommended in [14] and usea batch-size of 64. Furthermore, we train for 150 epochsand linearly decrease the learning rate from 0.02 to − .Table 1 shows the pruning results of VGG-8. We specifyto prune the number of parameters by 90% and the numberof multiplications by 80%. Thus, we achieve comparablepruning rates to HRank but outperform the accuracy sig-niﬁcantly by 3%. In comparison to Zhao et al . and

SSS , https://pytorch.org/docs/stable/torchvision/models.html GG-8 on CIFAR-10

Table 1. Top-1 accuracy and percentage reduction in the numberof multiplications and parameters.

Method Flops % ↓ Params % ↓ Top-1 % Baseline - - 94.89SSS [16] 41.6 73.8 93.02Zhao et al . [36] 39.1 73.3 93.18GAL-0.1 [25] 45.2 82.2 90.73HRank [24] 65.3 82.1 92.34HRank [24] 76.5 92.0 91.23

HFP

Table 2. Top-1 accuracy and percentage reduction in the numberof multiplications and parameters. Results marked with ’-’ are notreported by the authors.

Method Flops % ↓ Params % ↓ Top-1 % Baseline - - 93.30NISP [35] 35.50 42.40 93.01DCP [38] 47.10 70.30 93.79CP [11] 50.00 - 91.80FPGM [10] 52.60 - 93.26GBN-40 [34] 60.10 53.50 93.41GBN-60 [34] 70.30 66.70 93.07HRank [24] 50.00 42.40 93.17HRank [24] 74.10 68.10 90.72

HFP

Figure 3. Top-1 accuracies of ResNet-56 on CIFAR-10 with dif-ferent pruning rates. The performance values are illustrated bycolored level curves created by ﬁtting a second-order polynomial.

ResNet-50 on ImageNet

Table 3. Labels have the same meaning as in Table 2.

Method Flops % ↓ Params % ↓ Top-1 % Baseline - - 76.15

NIPS 2018, NIPS 2019, CVPR 2019

DCP [38] 55.76 51.45 74.95FPGM [10] 53.50 - 74.83GBN-60 [34] 40.54 31.83 76.19GBN-50 [34] 55.06 53.40 75.18

CVPR 2020

Hinge [23] 53.45 - 74.70He et al . [8] 60.80 - 74.56DMCP [4] 73.17 - 74.40HRank [24] 62.10 - 71.98HRank [24] 76.04 - 69.10

HFP we achieve higher pruning rates, while simultaneously in-creasing the accuracy by more than 1%. Compared to thebaseline accuracy, we are able to reduce the number of pa-rameters by 90% with an accuracy drop of 0.6%.Table 2 shows the pruning results on the ResNet-56 ar-chitecture. We use two different settings with target reduc-tions of 50% and 70%, respectively. Thus, we are able toprune both the parameters and the multiplications by at least50% with no loss of accuracy. In comparison to

HRank ,we achieve higher pruning rates with a slightly improvedTop-1 accuracy. To the best of our knowledge, we are theﬁrst to reduce the number of multiplications by more than75% while at the same time reducing accuracy by less than1.5%.

GBN achieves a slightly higher Top-1 accuracy forcomparable pruning rates. Additionally, ﬁgure 3 illustratesthe level curves of various experiments with different prun-ing rates on ResNet-56. One can observe that pruning theparameters has a greater impact on the performance thanpruning the multiplications.

ImageNet is an image classiﬁcation task which provides1000 different class labels. We use the data from 2012(ILSVRC12 [18]) which consists of 1,281,167 training and50,000 test samples. We preprocess the data by subtractingthe mean and dividing by the standard-deviation over thetraining set. For data-augmentation we apply random hori-zontal ﬂips and crop the images to × . We train for100 epochs with a batch-size of 256 and linearly decrease6 esNet-18 on ImageNet Table 4. Top-1 accuracy and percentage reduction in the numberof multiplications and parameters. Results marked with ’-’ are notreported by the authors.

Method Flops % ↓ Params % ↓ Top-1 % Baseline - - 69.75SFP [9] 41.80 - 67.10FPGM [10] 41.80 - 68.41

HFP

Table 5. Top-1 accuracy and percentage reduction in the numberof multiplications and parameters for different values of λ . λ Flops % ↓ Params % ↓ Top-1 %

1. 48 36 → the learning rate from − to − .Table 3 shows the pruning results of ResNet-50. Toenable accurate comparisons, we evaluate four conﬁgura-tions with various pruning rates and compare with the lat-est pruning results from CVPR2020. The ﬁrst conﬁgura-tion reduced the number of multiplications by 60% withno signiﬁcant loss in the accuracy. The second conﬁgura-tion achieves both higher pruning rates and higher accuracythan in [34, 23, 8]. The third conﬁguration yields a reducednumber of multiplications and slightly improved accuracyin comparison with [4]. Furthermore, HFP is able to reducethe number of multiplications by 78% with only 2% loss inthe accuracy.Table 4 shows the pruning results of ResNet-18. ResNet-18 is much smaller than ResNet-50, less over-parameterizedand consequently more difﬁcult to prune. HFP providesnew state-of-the-art performance with 36% reduced mul-tiplications and only 0.6% accuracy decrease. The secondconﬁguration reduces the number of multiplications by 45%with only 1.2% loss in the accuracy Table 5 shows the pruning results of ResNet-50 with theaim of pruning 60% of the multiplications and 40% of theparameters by using different values of the pruning param-eter λ . The ﬁrst experiment uses the constant value λ = 1 .As noticeable, the desired pruning rates are not reachedsince the weighting of the pruning loss is to low. The secondexperiment uses λ = 7 . which results from the consider-ation in section 4.4. Indeed, the desired pruning rates arefulﬁlled. However, the accuracy drops below 76% since the imbalance is high at the beginning of the training. The thirdexperiment utilizes the proposed strategy of heating up λ from to . : The pruning rates are still fulﬁlled and theaccuracy increases in comparison to the second experiment. The distribution of the overall pruning budget to the in-dividual layers is a well-known problem in ﬁlter pruning.HFP automatically distributes the pruning rates among theindividual layers such that the pruning loss is minimized.VGG-8 consists of six convolution layers and two fully-connected layers. The convolution layers are especiallyexpensive regarding the number of multiplication whereasthe ﬁrst fully-connected layer owns most of the parameters.Thus, we analyze two experiments: a) with the aim of prun-ing 90% of the parameters and b) with the aim of pruning90% of the multiplications. Figure 4 shows the layer-wisepruning rates for both experiments as well as the propor-tional layer sizes regarding the number of parameters andmultiplications. In the ﬁrst experiment, HFP primarily re-duces the layers which contribute most to the number ofparameters (conv6 and fc7). Especially fc7 has a large num-ber of parameters and is therefore pruned by approximately97%. In contrast, the second experiment mainly leads to areduction of the convolution layers as they offer more po-tential for saving multiplications. Consequently, we can ob-serve that HFP distributes ﬂexible pruning rates over the in-dividual layers. Furthermore, the distribution of the pruningbudget varies depending on the target reduction. Compar-isons with the layer sizes regarding the number of multipli-cations and parameters result in a meaningful distribution.

This section analyzes how the overall reduction of pa-rameters and multiplications is proportionally distributedamong the individual layers. For example, if 1000 parame-ters are pruned and the ﬁrst layer is reduced by 150 param-eters, then the proportional contribution of the ﬁrst layer tothe parameter pruning is 15%. Figure 5 shows the propor-tional pruning rates of ResNet-56 with 56% pruned multi-plications and 50% pruned parameters (section 5.1, table 1).The pruning rates are shown for different training epochsand refer to the pruning result at that time step (e.g., after10 epochs 47% of the multiplications were reduced). Theﬁrst diagram shows the proportional pruning rates for themultiplications while the second diagram shows the propor-tional pruning rates for the parameters. Additionally, thediagrams show the total number of multiplications and pa-rameters of the unpruned layers (dotted lines). In both ﬁg-ures, the three basic blocks of the ResNet architecture arevisible and marked with A, B, and C. In case of the prunedmultiplications, the proportional pruning rates of the indi-vidual layers change over the epochs. While in block A the7onv1 conv2 conv3 conv4 conv5 conv6 fc7 fc8 . layer index p r un i ng r a t e a) prune 90% of parametersb) prune 90% of multiplications 00.51 p r opo r ti on a ll a y e r s i ze parameters per layermultiplications per layer Figure 4. Pruning rates of all layers in VGG-8 for two different experiments: a) with the aim of pruning 90% of the parameters and b)with the aim of pruning 90% of the multiplications. Depending on the target reduction, the pruning budget is distributed differently overthe individual layers: a) reduces layers with many parameters while b) especially prunes the convolution layers with many multiplication.Figure 5. The upper plot shows the proportional pruning rates of the individual layers of ResNet-56 (with 56% reduced multiplications and50% reduced parameters, table 1). Proportional pruning rates indicate the contribution of single layers to the overall pruning rate. E.g., if1000 multiplications are pruned from the model and the ﬁrst layer is reduced by 150 multiplications, the proportional pruning rate of theﬁrst layer is 15%. The lower plot indicates the proportional pruning rates for the number of parameters.

6. Conclusion

We propose

Holistic Filter Pruning (HFP), a simple andpowerful ﬁlter pruning method to reduce the complexity oftrained DNNs. HFP uses a pruning loss that takes accu-rate pruning rates for the number of both parameters andmultiplications into account. After each forward pass, thedeviation between the current model size and the target sizeis calculated. By gradient descent, the pruning rates are dis-tributed over the individual layers such that the target sizeis fulﬁlled. The loss function ﬁts seamlessly into the train-ing of DNNs and uses the channel-wise scaling factors ofthe batch-normalization layers to calculate the model size.Thus, no additional variables need to be deﬁned and theimplementation effort is low. Especially for large pruningrates ( > ), HFP yields excellent performance and out-performs recent approaches by up to 5%. References [1] Li Deng and Dong Yu. Deep learning: Methods and ap-plications.

Foundations and Trends in Signal Processing ,7(34):197–387, 2014.[2] Lukas Enderich, Fabian Timm, and Wolfram Burgard.Symog: learning symmetric mixture of gaussian modes forimproved ﬁxed-point quantization.

Neurocomputing , 2020.[3] Jonathan Frankle and Michael Carbin. The lottery ticket hy-pothesis: Finding sparse, trainable neural networks. In

Inter-national Conference on Learning Representations , 2019.[4] Shaopeng Guo, Yujie Wang, Quanquan Li, and Junjie Yan.Dmcp: Differentiable markov channel pruning for neuralnetworks. In

IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) , June 2020.[5] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic net-work surgery for efﬁcient dnns.

CoRR , abs/1608.04493,2016.[6] Song Han, Jeff Pool, John Tran, and William J. Dally. Learn-ing both weights and connections for efﬁcient neural net-works.

CoRR , abs/1506.02626, 2015.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. , pages 770–778, 2015. [8] Yang He, Yuhang Ding, Ping Liu, Linchao Zhu, Han-wang Zhang, and Yi Yang. Learning ﬁlter pruning crite-ria for deep convolutional neural networks acceleration. In

The IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR) , June 2020.[9] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and YiYang. Soft ﬁlter pruning for accelerating deep convolutionalneural networks.

CoRR , abs/1808.06866, 2018.[10] Yang He, Ping Liu, Ziwei Wang, and Yi Yang. Pruning ﬁl-ter via geometric median for deep convolutional neural net-works acceleration.

CoRR , abs/1811.00250, 2019.[11] Yihui He, Xiangyu Zhang, and Jian Sun. Channel prun-ing for accelerating very deep neural networks.

CoRR ,abs/1707.06168, 2017.[12] Geoffrey Hinton. Neural networks for machine learning.Coursera, video lecture, 2012.[13] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang.Network trimming: A data-driven neuron pruning approachtowards efﬁcient deep architectures.

CoRR , abs/1607.03250,2016.[14] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil-ian Q. Weinberger. Densely connected convolutional net-works. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , July 2017.[15] Qiangui Huang, Shaohua Kevin Zhou, Suya You, and UlrichNeumann. Learning to prune ﬁlters in convolutional neuralnetworks.

CoRR , abs/1801.07365, 2018.[16] Zehao Huang and Naiyan Wang. Data-driven sparse struc-ture selection for deep neural networks. In

The EuropeanConference on Computer Vision (ECCV) , September 2018.[17] Alex Krizhevsky. Learning multiple layers of features fromtiny images. 2009.[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neural net-works. In

Advances in Neural Information Processing Sys-tems 25 , pages 1097–1105. Curran Associates, Inc., 2012.[19] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deeplearning.

Nature , 521(7553):436–444, 2015.[20] Yann LeCun, John S. Denker, and Sara A. Solla. Optimalbrain damage. In D. S. Touretzky, editor,

Advances in NeuralInformation Processing Systems 2 , pages 598–605. Morgan-Kaufmann, 1990.[21] Carl Lemaire, Andrew Achkar, and Pierre-Marc Jodoin.Structured pruning of neural networks with budget-awareregularization.

CoRR , abs/1811.09332, 2018.[22] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, andHans Peter Graf. Pruning ﬁlters for efﬁcient convnets.

CoRR ,abs/1608.08710, 2016.[23] Yawei Li, Shuhang Gu, Christoph Mayer, Luc Van Gool,and Radu Timofte. Group sparsity: The hinge between ﬁl-ter pruning and decomposition for network compression. In

The IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR) , June 2020.[24] Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang,Baochang Zhang, Yonghong Tian, and Ling Shao. Hrank:Filter pruning using high-rank feature map, 02 2020.

25] Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang,Liujuan Cao, Qixiang Ye, Feiyue Huang, and David S. Do-ermann. Towards optimal structured cnn pruning via gener-ative adversarial learning.

CoRR , abs/1903.09291, 2019.[26] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang,Shoumeng Yan, and Changshui Zhang. Learning efﬁcientconvolutional networks through network slimming.

CoRR ,abs/1708.06519, 2017.[27] Christos Louizos, Max Welling, and Diederik P. Kingma.Learning sparse neural networks through l0 regularization.In

International Conference on Learning Representations ,2018.[28] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A ﬁlterlevel pruning method for deep neural network compression.

CoRR , abs/1707.06342, 2017.[29] Lukas Mauch and Bin Yang. A novel layerwise pruningmethod for model reduction of fully connected deep neuralnetworks. In , pages 2382–2386, 2017.[30] Wei Pan, Hao Dong, and Yike Guo. Dropneuron: Sim-plifying the structure of deep neural networks.

CoRR ,abs/1606.07326, 2016.[31] Suraj Srinivas, Akshayvarun Subramanya, and R. VenkateshBabu. Training sparse neural networks.

CoRR ,abs/1611.06694, 2016.[32] V. Sze, Y. Chen, T. Yang, and J. S. Emer. Efﬁcient processingof deep neural networks: A tutorial and survey.

Proceedingsof the IEEE , 105(12):2295–2329, Dec 2017.[33] Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. Au-toprune: Automatic network pruning by regularizing auxil-iary parameters. In

Advances in Neural Information Pro-cessing Systems 32 , pages 13681–13691. Curran Associates,Inc., 2019.[34] Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and PingWang. Gate decorator: Global ﬁlter pruning method for ac-celerating deep convolutional neural networks. In H. Wal-lach, H. Larochelle, A. Beygelzimer, E. Fox, and R. Garnett,editors,

Advances in Neural Information Processing Systems32 , pages 2133–2144. Curran Associates, Inc., 2019.[35] R. Yu, A. Li, C. Chen, J. Lai, V. I. Morariu, X. Han, M.Gao, C. Lin, and L. S. Davis. Nisp: Pruning networks usingneuron importance score propagation. In , pages9194–9203, 2018.[36] Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao,Wenjun Zhang, and Qi Tian. Variational convolutional neu-ral network pruning. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , June 2019.[37] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and YurongChen. Incremental network quantization: Towards losslesscnns with low-precision weights.

CoRR , 2017.[38] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu,Yong Guo, Qingyao Wu, Junzhou Huang, and Jin-Hui Zhu.Discrimination-aware channel pruning for deep neural net-works.

CoRR , abs/1810.11809, 2018., abs/1810.11809, 2018.