How Compact?: Assessing Compactness of Representations through Layer-Wise Pruning
HHow Compact?: Assessing Compactness of Representationsthrough Layer-Wise Pruning
Hyun-Joo Jung ∗ Jaedeok Kim ∗ Yoonsuck Choe , Machine Learning Lab, Artificial Intelligence Center, Samsung Research, Samsung Electronics Co.56 Seongchon-gil, Secho-gu, Seoul, Korea, 06765 Department of Computer Science and Engineering, Texas A&M UniversityCollege Station, TX, 77843, USA
Abstract
Various forms of representations may arise in the many layersembedded in deep neural networks (DNNs). Of these, wherecan we find the most compact representation? We proposeto use a pruning framework to answer this question: Howcompact can each layer be compressed, without losing per-formance? Most of the existing DNN compression methodsdo not consider the relative compressibility of the individuallayers. They uniformly apply a single target sparsity to alllayers or adapt layer sparsity using heuristics and additionaltraining. We propose a principled method that automaticallydetermines the sparsity of individual layers derived from theimportance of each layer. To do this, we consider a metric tomeasure the importance of each layer based on the layer-wisecapacity. Given the trained model and the total target sparsity,we first evaluate the importance of each layer from the model.From the evaluated importance, we compute the layer-wisesparsity of each layer. The proposed method can be appliedto any DNN architecture and can be combined with any prun-ing method that takes the total target sparsity as a parameter.To validate the proposed method, we carried out an imageclassification task with two types of DNN architectures ontwo benchmark datasets and used three pruning methods forcompression. In case of VGG-16 model with weight prun-ing on the ImageNet dataset, we achieved up to 75% (17.5%on average) better top-5 accuracy than the baseline under thesame total target sparsity. Furthermore, we analyzed wherethe maximum compression can occur in the network. Thiskind of analysis can help us identify the most compact repre-sentation within a deep neural network.
INTRODUCTION
In recent years, DNN models have been used for a varietyof artificial intelligence (AI) tasks such as image classifica-tion (Simonyan and Zisserman 2014; He et al. 2016; Huanget al. 2017), semantic segmentation (He et al. 2017), andobject detection (Redmon and Farhadi 2017). The need forintegrating such models into devices with limited on-boardcomputing power have been growing consistently. However,to extend the usage of large and accurate DNN models toresource-constrained devices such as mobile phones, homeappliances, or IoT devices, compressing DNN models whilemaintaining their performance is an imperative. ∗ Equally contributed.Copyright c (cid:13)
A recent study (Zhu and Gupta 2017) demonstrates thatmaking large DNN models sparse by pruning can consis-tently outperform directly trained small-dense DNN models.In pruning DNN models, we would like to address the fol-lowing problem: “How can we determine the target sparsityof individual layers?” . Most existing methods set the layer-wise sparsity to be uniformly fixed to a single target sparsity(Zhu and Gupta 2017) or adopt layer-wise sparsity manually(Han et al. 2015; He, Zhang, and Sun 2017).Starting from the assumption that not all layers in a DNNmodel have equal importance, we propose a new method thatautomatically computes the sparsity of individual layers ac-cording to layer-wise importance.The contributions of the proposed method are as follows: • The proposed method can analytically compute the layer-wise sparsity of DNN models. The computed layer-wisesparsity value enables us to prune the DNN model moreefficiently than pruning with uniform sparsity for all lay-ers. In our experiments, we validate that the proposedlayer-wise sparsity scheme can compress DNN modelsmore than the uniform-layer sparsity scheme while retain-ing the same classification accuracy. • The proposed method can be combined with any pruningmethod that takes total target sparsity as an input param-eter. Such a condition is general in compression tasks be-cause users usually want to control the trade-off betweencompression ratio and performance. In our experiments,we utilized three different pruning approaches (weightpruning by Han et al. 2015, random channel pruning, andchannel pruning by Li et al. 2016) for compression. Otherpruning approaches are also applicable. • The proposed method can be applied to any DNN archi-tecture because it does not require any constraint on theDNN architectures (e.g., Liu et al. 2017; Ye et al. 2018requires batch normalization layer to prune DNN model). • To compute the layer-wise sparsity, we do not require ad-ditional training or evaluation steps which take a longtime.
RELATED WORKS
Pruning is a simple but efficient method for DNN modelcompression. Zhu and Gupta (2017) showed that making a r X i v : . [ c s . L G ] J a n arge DNN models sparse by pruning can outperform small-dense DNN models trained from scratch. There are manypruning methods according to the granularity of pruningfrom weight pruning (Han et al. 2015; Han, Mao, andDally 2015) to channel pruning (He, Zhang, and Sun 2017;Liu et al. 2017; Li et al. 2016; Molchanov et al. 2016;Ye et al. 2018). These approaches mainly focused on how toselect the redundant weights/filters in the model, rather thanconsidering how many weights/filters need to be pruned ineach layer.However, considering the role of each layer is critical forefficient compression. Arora et al. (2018) measured the noisesensitivity of each layer and tried to use it to compute themost effective number of parameters for each layer.Recently, He and Han (2018) proposed Automated DeepCompression (ADC). They aimed to automatically find thesparsity ratio of each layer which is similar to our goal. How-ever, they used reinforcement learning to find the sparsityratio and characterized the state space as a layout of layer,i.e., kernel dimension, input size, and FLOPs of a layer, etc.In this paper, we focus on directly measuring the impor-tance of each layer for the given task and the model. We usedthe value of the weight matrix itself rather than the layout ofthe layer. We will show considering layer-wise importancein DNN compression is effective. PROBLEM FORMULATION
Our goal is to compute the sparsity of each layer in themodel given the total target sparsity of the model. In this sec-tion we derive the relation between the total target sparsity s of a model and the sparsity of each layer while consideringthe importance of each layer.Let s l ∈ [0 , , l = 1 , · · · L , be the layer sparsity of the l -th layer where L is the total number of layers in the model.Then, the layer sparsity s l should satisfy the following con-dition: L (cid:88) l =1 s l N l = sN, (1)where N l is the number of parameters in the l -th layer and N is the total number of parameters in the model.A DNN model is usually overparameterized and it con-tains many redundant parameters. Pruning aims at remov-ing such parameters from every layer and leaving only theimportant ones. So, if a large proportion of parameters in alayer is important for the given task, the layer should not bepruned too much while otherwise the layer should be prunedaggressively. Hence, it is natural to consider the importanceof a layer when we determine the layer sparsity.We assume that the number of remaining parameters inthe l -th layer after pruning is proportional to the importanceof the l -th layer, ω l . Because the number of remaining pa-rameters for the l -th layer is equal to (1 − s l ) N l , we have, (1 − s l ) N l = αω l , (2)where α > is a constant. Because we want all layers toequally share the number of pruned parameters with respect to the importance of each layer, α is set to be independent ofthe layer index.By summing both sides over all the layers, we obtain L (cid:88) l =1 (1 − s l ) N l = N − sN, (3) L (cid:88) l =1 αω l = α Ω , (4)where Ω is the sum of importance of all layers. From (Eq. 3)and (Eq. 4), we can easily compute α , i.e., α = (1 − s ) N Ω . (5)The above equation can be seen as the ratio of the totalnumber of remained parameters and the sum of importanceof all layers which is proportional to the total number ofeffective parameters.When α = 1 , the number of remaining parameters deter-mined by s is the same as the number of effective parame-ters. In that case, we can readily prune parameters in eachlayer. When α < , then the number of remaining parame-ters becomes smaller than the number of effective parame-ters. Therefore, we need to prune some effective parameters.In this case, α acts as a distributor that equally allocates thenumber of pruning effective parameters to all layers. Thanksto the α , the compression is prevented from pruning a spe-cific layer too much. In the case of α > , it acts similarto the case of α < but allocates the number of pruningredundant parameters instead.Because all factors consisting α are automatically deter-mined by the given model, we can control the above pruningcases by controlling s only, which we desired.By combining (Eq. 2) and (Eq. 5), we obtain the layer-wise sparsity as s l = 1 − α ω l N l = 1 − (1 − s ) NN l ω l Ω . (6)The remaining problem is: How can we compute thelayer-wise importance ω l ? We propose metric to measurethe importance of each layer based on the layer-wise capac-ity . In the subsection below, we will explain the metric indetail. Layer-wise Capacity
We measure the importance of a layer using the layer capac-ity induced by noise sensitivity (Arora et al. 2018). Basedon the definition (Arora et al. 2018), the authors proved thata matrix having low noise sensitivity has large singular val-ues (i.e., low rank). They identified the effective number ofparameters of a DNN model by measuring the capacity ofmapping operations (e.g., convolution or multiplication witha matrix) which is inverse proportional to the noise sensitiv-ity. The layer capacity has an advantage of directly countingthe effective number of parameters in a layer. However, theynly consider a linear network having fully connected layersor convolutional layers.Motivated by the work, we use the concept of capacityto compute layer-wise sparsity. The layer capacity µ l of the l -th layer is defined as µ l := max x l ∈ S l || W l x l |||| W l || F || x l || , (7)where W l is a mapping (e.g., convolution filter or multipli-cation with a weight matrix) of the l -th layer and x l is theinput of the l -th layer. || · || and || · || F are l norm and theFrobenius norm of the operator, respectively. S l is a set ofinputs of the l -th layer. In other words, µ l is the largest num-ber that satisfies µ l || W l || F || x l || = || W l x l || .According to the work (Arora et al. 2018) and (Eq. 7),the mapping having large capacity has low rank and hencesmall number of effective parameters. Therefore, we let theeffective number of parameters to be inverse proportionalto the layer-wise capacity. More specifically, the effectivenumber of parameters e l of the l -th layer can be written as, e l ∝ βµ l , (8)where β is a constant.In fact, β might be different across layers because otherattributes such as the depth of layers (distance from the inputlayer) would affect the number of effective parameters. Inthis paper, however, we set β to be constant for simplicityand focus on the layer-wise capacity.We assume that the layer having a large number of ef-fective parameters (in other words, having small capacity)should not be pruned too much. Therefore, we can set theimportance of the l -th layer ω l to be the number of effectiveparameters e l , i.e, ω l = e l . Then, (Eq. 6) becomes, s l = 1 − (1 − s ) NN l e l E = 1 − (1 − s ) NN l Mµ l , (9)where E and M are the sum of e l and /µ l for all layers,respectively. Because the value of s l is independent of thevalue of β , s l can be obtained without any knowledge of thevalue of β . PRUNING WITH LAYER-WISE SPARSITY
In this section, we explain our proposed DNN model com-pression process with layer-wise sparsity. Given the total tar-get sparsity, we first compute the layer-wise importance ofeach layer from the two proposed metrics (layer-wise per-formance gain and layer-wise capacity). According to (Eq.9), we can compute layer-wise sparsity easily.However, there are some issues in computing the actuallayer-wise sparsity. First, the user might want to controlthe minimum number of remaining parameters. In the worstcase, the required number of pruned parameters for a layermight be equal or larger than the total number of parametersin the layer. Second, the exact number of pruned parameters would be different from the computed sparsity from (Eq. 6).For example, in channel pruning, the number of pruned pa-rameters in the ( l − th layer also affects the l -th layer.To handle those problems, we re-formulate (Eq. 6) as anoptimization problem. The objective function is defined as, min (cid:15) || (cid:15) || , (10)such that ξ l ≤ α (1 + ε l ) ω l ≤ N l for all l = 1 , · · · , L, (cid:88) l α (1 + ε l ) ω l = (1 − s ) N, where ξ l is the minimum number of remaining parametersafter pruning. (cid:15) = ( ε , · · · , ε L ) is the vector of layer-wiseperturbations on the constant α . That is, though we set thenumber of pruned parameters of each layer equally propor-tional to the importance of the layer, we perturb the degreeof pruning of each layer by adding ε l to α in unavoidablecases. Proposition 1 If (cid:80) Ll =1 ξ l ≤ (1 − s ) N , then a solution ofthe optimization problem (10) exists and it is unique. Proof.
Since (cid:107) (cid:15) (cid:107) is strictly convex in (cid:15) , the optimizationproblem (10) has a unique solution if the problem is feasible.So it is enough to show that the constraint set of the problemis not empty. Denote the constraint set of the inequalities by D := { (cid:15) ∈ R L : ξ l ≤ α (1 + ε l ) ω l ≤ N l for l = 1 , · · · , L } . Consider a continuous function f : D → R such that f ( (cid:15) ) := (cid:80) Ll =1 α (1 + ε l ) w l . If we take (cid:15) max = ( N /αw − , · · · , N L /αw L − , then (cid:15) max ∈ D and f ( (cid:15) max ) = L (cid:88) l =1 N l = N ≥ (1 − s ) N. Similarly, by taking (cid:15) min = ( ξ /αw − , · · · , ξ L /αw L − ,we have (cid:15) min ∈ D and f ( (cid:15) min ) = L (cid:88) l =1 ξ l ≤ (1 − s ) N. By the provided condition, we know (cid:80) Ll =1 ξ l ≤ (1 − s ) N ≤ N . Then the continuity of f yields that there exists a point ˜ (cid:15) ∈ D such that f (˜ (cid:15) ) = (1 − s ) N , which completes ourproof.Intuitively, Proposition 1 says that the total target spar-sity s of pruning should not violate the constraint of theminimum number of remaining parameters. Let us assume (cid:80) Ll =1 ξ l ≤ (1 − s ) N to guarantee the feasibility of the op-timization problem. We then can apply vairous convex opti-mization algorithms to obtain the optimal solution (cid:15) ∗ (Boydand Vandenberghe 2004).Remark that our proposed method does not require modeltraining or evaluation to find the values of layer sparsities.We obtain analytically the layer sparsity without additionaltraining or evaluation. Many existing works (Zhong et al.2018; He and Han 2018) use a trial and error approachfor searching for the best combination of hyperparametersable 1: Simple DNN model architecture. Conv and FCmean convolutional and fully connected layer, respectively.Layer Filter size Output size ActivationConv1 (3 , , ,
32) (32 , , ReLUConv2 (3 , , ,
32) (32 , , ReLUMaxpool (2 ,
2) (16 , , Conv3 (3 , , ,
64) (16 , , ReLUConv4 (3 , , ,
64) (16 , , ReLUMaxpool (2 ,
2) (8 , , FC1 (2048 , , ReLUFC2 (512 ,
10) (1 , Softmaxsuch as the layer sparsity s l . Evaluating each combinationthrough additional training or evaluation is necessary whichmakes these approaches not feasible on large datasets.Our method requires convex optimization in a small di-mensional space that has negligible computing time in gen-eral. The total calculation time mainly depends on the timespent by calculating the layer capacity µ l by (Eq. 7). Al-though it is required to compute the norm of each input vec-tors induced by the whole dataset, it has approximately sim-ilar computing costs as inferencing. Our approach has anadvantage in terms of computation time.With our implementation computing (Eq. 7) took less than3 hours for VGG16 model with the ImageNet dataset in asingle GPU machine. For VGG-16 model with the CIFAR-10 dataset, the total computation time of (Eq. 7) was lessthan 1 minute on the same hardware setup. The most timeconsuming part of our implementation was calculating theFrobenius norm of each layer. There could be more speedup if we use an approximate value of norms or use a subsetof the data set instead of the whole data set. However, theseare out of the scope of this paper so we will not discuss itany further.Note that although here we assume that all layers in themodel are compressed, selecting and compressing the subsetof , · · · , L can easily be handled. Moreover, the processdescribed below can also be applied to any pruning methodssuch as channel pruning or weight pruning. Selection of Target Sparsity for Channel Pruning
In the experiments, we applied the proposed layer-wise spar-sity method to both weight pruning and channel pruning.However, in case of channel pruning, a pruned layer af-fects the input dimension of the next layer. So the actualnumber of pruned parameters may differ from our expec-tation. As a consequence, our proposed method does notachieve the exact total target sparsity s (over-pruned inusual) by channel pruning. In fact, the exact number ofremaining parameters by channel pruning depends on thetopology of a neural network and properties of each layer,e.g. the kernel size of a convolution layer. So an exact for-mulation of the number of remaining parameters is not math-ematically tractable.To overcome such a limitation we need to consider amethod to achieve the total target sparsity. Because the valueof s in our proposed method can be considered as the com- Figure 1: Classification accuracy comparison against thebaseline and the proposed method using a simple DNNmodel on the CIFAR-10 dataset. For compression, weightpruning (Han et al. 2015) is used. (p) and (p + ft) mean prun-ing only and fine-tuning after pruning, respectively.pression strength in pruning, we can find the exact sparsity ˆ s from the s . Let C ( s ) be the actual number of remainingparameters in a model after applying channel pruning withsparsity s , i.e., the l -th layer is pruned by the sparsity s l in-duced by (10). Then the total target sparsity is achievableif we use the value ˆ s instead of the total target sparsity s .Finally, the problem becomes finding proper ˆ s that satisfies, C (ˆ s ) = (1 − s ) N. Because both solving (10) and counting the number of pa-rameters require low computational costs, the value of ˆ s canbe obtained in reasonable time. We therefore can control theachieved sparsity of the model by using ˆ s in case of channelpruning. EXPERIMENTAL RESULTS
In this section, we investigate how much our proposedscheme affects the performance of a pruned model. Todo this, we carried out an image classification task withthe simple DNN and VGG-16 (Simonyan and Zisserman2014) model on the CIFAR-10 (Krizhevsky, Nair, and Hin-ton 2014) and ImageNet (Deng et al. 2009) dataset.In all experiments, we compared our layer-wise sparsitymethod with the uniform sparsity method which we use asthe baseline. To investigate how much robust our proposedscheme is to different pruning methods, we applied threepruning methods (magnitude-based weight pruning by Hanet al. 2015, random channel pruning, and magnitude-basedchannel pruning by Li et al. 2016) to DNN architectures.We set ξ l as × w l × h l × c l − where c l − is the numberof input channels for the l -th layer and w l and h l is the spa-tial filter size of the l -th layer. In other words, we wanted toremain at least 3 channels for performance.We implemented the proposed method using Keras (Chol-let 2015). For the VGG-16 model on the ImageNet dataset,igure 2: Classification accuracy comparison against thebaseline and the proposed method using simple DNN modelon the CIFAR-10 dataset. For compression, random channelpruning is used. We plot the median value for 10 trials andthe vertical bar at each point represents the max and minvalue. (p) and (p + ft) mean pruning only and fine-tuningafter pruning, respectively.we used pre-trained weights in Keras, and for the VGG-16 model on the CIFAR-10 dataset, we used pre-trainedweights from (Geifman 2017). The simple DNN model wasdesigned and trained by ourselves. Simple DNN Model on CIFAR-10 Dataset
Table 1 shows the architecture of the simple DNN modelused in this experiment. To compress the model, we appliedpruning to Conv2-4 and FC1 layers. After pruning is done,we can additionally apply (optional) fine-tuning to improveperformance. Therefore we checked the accuracy of bothpruned and fine-tuned models to investigate the resilienceof the proposed method. For fine-tuning, we ran 3 epochswith learning rate = 0.0001.Because the computation of layer-wise sparsity is not af-fected by the choice of pruning method, we need to computelayer-wise sparsity for the model only once. In the subsec-tions below, therefore, we shared the computed layer-wisesparsity with all three pruning methods.
Weight Pruning
To prune weights, we used magnitude-based pruning (Han et al. 2015). In other words, we prunedweights that have small absolute values because we can con-sider that they do not contribute much to the output.Figure 1 shows the performance after compression. Aswe can see in the figure, the proposed method outperformsunder all total target sparsities in pruning only case, andachieves similar or better results after fine-tuning. At largetotal target sparsity, the effect of the proposed method be-comes apparent. For example, at total target sparsity 0.9, theaccuracy of the proposed method drops by 0.179 after prun-ing only, while the baseline drops by 0.656.
Random Channel Pruning
To eliminate the effect of thedetails of a pruning algorithm, we also applied random prun- Figure 3: Classification accuracy comparison against thebaseline and the proposed method using simple DNN modelon the CIFAR-10 dataset. For compression, channel pruning(Li et al. 2016) is used. (p) and (p + ft) mean pruning onlyand fine-tuning after pruning, respectively.ing in channel. Given the total target sparsity, we computedthe layer-wise sparsities and the number of required parame-ters to be pruned in order. Then the number of pruned chan-nels is determined by applying floor operation to the numberof pruned parameters. According to the computed number ofpruned channels, we randomly selected which channels arepruned and repeated the selection 10 times.We compared the classification accuracy of the proposedlayer-wise sparsity scheme against the baseline. Figure 2shows the results. Similar with weight pruning, the proposedlayer-wise sparsity scheme also outperforms the baseline un-der all total target sparsities. As we can see in figure 2, theproposed method conducts compression reliably comparedwith the baseline method (smaller height of min-max barthan the baseline). Interestingly, the result of the proposedmethod after pruning outperforms the baseline method afterfine-tuning.
Channel Pruning
We used channel pruning (Li et al.2016) to compress the model. The authors used the sum ofabsolute weights in a channel as the criteria for pruning. Inother words, channels that have small magnitude weights arepruned. Figure 3 shows the results.Numerically, the proposed method achieved up to 58.9%better classification accuracy than the baseline using thesame total target sparsity. Please refer to the accuracy of theBaseline (p) and Proposed (p) at total target sparsity 0.6 infigure 3. In terms of compression ratio, the proposed methodcan prune up to 5 times more parameters than the baselinewhile retaining the same accuracy. Please compare the accu-racy of the Baseline (p) at total target sparsity 0.1 and Pro-posed (p) at total target sparsity 0.5 in figure 3.
VGG-16 on CIFAR-10 Dataset
In this experiment, we applied pruning to all Conv and FClayers except the first and the last layers (Conv1 and FC2) inigure 4: The computed layer-wise sparsity of VGG-16 model on the CIFAR-10 dataset, given the total target sparsity s .Figure 5: Classification accuracy comparison against thebaseline and the proposed methods using the VGG-16 modelon the CIFAR-10 dataset. For compression, random channelpruning is used. We plot the median value for 10 trials andthe vertical bar at each point represents the max and minvalue. (p) and (p + ft) mean pruning only and fine-tuningafter pruning, respectively.the VGG-16 model. For fine-tuning, we ran 3 epochs withlearning rate = 0.0001 in all pruning cases.Figure 4 shows the computed layer-wise sparsities underdifferent total target sparsities, s . The computed sparsitiesconfirm our assumption that all layers have different impor-tance for the given task. Random Channel Pruning
Similar to the simple DNNmodel, we repeated the random channel pruning 10 timesand the result of pruning is shown in figure 5. Surprisingly,the proposed method achieves almost 90% maximum accu-racy when total target sparsity is 0.8. Although the VGG-16 model is considered already quite overparameterized, wecan still say that the proposed method efficiently compressesDNN models. The proposed method conduct compressionreliably compared with the baseline method (smaller heightof min-max bar than the baseline) as we can see in fig- Figure 6: Classification accuracy comparison with the base-line and the proposed method of VGG-16 model on theCIFAR-10 dataset. For compression, channel pruning (Li etal. 2016) is used. (p) and (p + ft) mean pruning only andfine-tuning after pruning, respectively.ure 5. Though the performance after pruning is the samewith the random guessing or worse than the baseline when s > . , the performance is almost recovered (except thecase s = 0 . ) after fine-tuning. This demonstrates that con-sidering layer-wise sparsity helps not only pruning but alsoperformance improvement with fine-tuning. Channel Pruning
Figure 6 shows the results. Though theaccuracy values after pruning only are worse than the base-line, the degree of performance improvement after fine-tuning is better than the baseline when s > . except s = 0 . .Similar to the results of random channel pruning, the pro-posed method maintains the performance within 3% of theoriginal model until the total target sparsity becomes 0.7.Table 2 shows the performance comparison with otherchannel pruning methods. For fair comparison, we prunedConv layers only. For fine-tuning, we ran 100 epochs witha constant learning rate 0.001 and select the best accuracy.able 2: Performance evaluation for the baseline, other state-of-the-art methods and the proposed method using a VGG-16model on the CIFAR-10 dataset. We pruned Conv layers only for fair comparison. ratio(param) and ratio(FLOP) mean prunedratio in number of parameters and FLOPs, respectively. To compute ∆ err , ratio(param) and ratio(FLOP) of other methods, wereferred the test error, ∆ err . × Baseline 7.42 +1.01 5.4M 64.0% . × . × Proposed 6.25 -0.16 . × +0.13 3.23M 78.1% . × . × Proposed +0.12 . × ∆ err ), the pro-posed method outperforms other methods in both cases. Ourproposed method prunes more filters from the later layersthan those from the former layers. It performs better in re-ducing the number of parameters, while it does not in re-ducing FLOPs as we can see in Table 2. However, reducingFLOPs can be easily achievable by reformulating (Eq. 1). VGG-16 on ImageNet Dataset
In this experiment, we applied pruning to all Conv and FClayers except the first and the last layers (Conv1 and FC3) inthe VGG-16 model on the ImageNet dataset.Figure 7 shows the computed layer-wise sparsities underdifferent total target sparsities. As we can see in the figure,the difference of layer-wise sparsity between layers. Surpris-ingly, the figure says that we only need prune two fully con-nected layers (FC1 and FC2) until the total target sparsity s becomes 0.8. Such results are reasonable because more than85% of total parameters in VGG are concentrated in the FC1and FC2 layers. Therefore we can consider that there wouldbe many redundant parameters in those layers. However, wecan also see that the proposed method computes layer-wisesparsities not considering the number of parameters only,from the figure. For example, the number of parameters inFC1 layer is far lager than the FC2 layer but the sparsity ofFC2 layer is larger than the FC1 layer. Conv11, Conv12, andConv13 also represents similar results (all three layers havethe same number of parameters). Weight Pruning
Figure 8 shows the performance aftercompression. The proposed method outperforms the base-line under all target sparsities both top-1 and top-5 accuracy.Both methods maintain the performance until s = 0 . . Butwhen s becomes larger than 0.4, the proposed method showsconsistently better performance. Channel Pruning
Figure 8 shows the compression resultsusing channel pruning (Li et al. 2016). Because the channelpruning removes bunch of parameters, distributing the num-ber of reqired parameters to be pruned to all layers accord-ing to the layer-wise sparsity is harder than weight pruning.Therefore the performance is worse than the weight pruning but still outperforms the baseline method in both top-1 andtop-5 accuracy cases.From the above results, the proposed layer-wise sparsityscheme outperforms the baseline method except for fewcases. We can validate our claim that not all layers have thesame importance for the given task and the proposed layer-wise sparsity scheme is highly effective in various DNNmodels for compression by pruning.
CONCLUSION
In this paper, we proposed a new method that automati-cally computes the layer-wise sparsity from the layer-wisecapacity for DNN model compression, especially pruning.Our proposed method does not require additional training orevaluation steps to compute the layer-wise sparsity, whichhas an advantage in terms of computation time. Experimen-tal results validated the efficiency of the proposed layer-wisesparsity calculation in DNN model compression. Further-more, the estimated layer-wise sparsity varied greatly acrosslayers, suggesting that the information can be used to findwhere the most compact representation resides in the deepneural network.
References
Arora, S.; Ge, R.; Neyshabur, B.; and Zhang, Y. 2018.Stronger generalization bounds for deep nets via a compres-sion approach. arXiv preprint arXiv:1802.05296 .Ayinde, B. O., and Zurada, J. M. 2018. Building efficientconvnets using redundant feature pruning. arXiv preprintarXiv:1802.07653 .Boyd, S., and Vandenberghe, L. 2004.
Convex optimization .Cambridge university press.Chollet, F. 2015. keras. https://github.com/fchollet/keras .Deng, J.; Dong, W.; Socher, R.; jia Li, L.; Li, K.; and Fei-fei, L. 2009. Imagenet: A large-scale hierarchical imagedatabase. In
In CVPR .Geifman, Y. 2017. cifar-vgg. https://github.com/geifmany/cifar-vgg .Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learningboth weights and connections for efficient neural network. Inigure 7: The computed layer-wise sparsity of VGG-16 model on the ImageNet dataset, given the total target sparsity s .Figure 8: Classification accuracy comparison against thebaseline and the proposed method using a VGG-16 modelon the ImageNet dataset. ’w’ means weight pruning and ’ch’means channel pruning. Advances in neural information processing systems , 1135–1143.Han, S.; Mao, H.; and Dally, W. J. 2015. Deep com-pression: Compressing deep neural networks with pruning,trained quantization and huffman coding. arXiv preprintarXiv:1510.00149 .He, Y., and Han, S. 2018. Adc: Automated deep compres-sion and acceleration with reinforcement learning. arXivpreprint arXiv:1802.03494 .He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778. He, K.; Gkioxari, G.; Doll´ar, P.; and Girshick, R. 2017.Mask r-cnn. In
Computer Vision (ICCV), 2017 IEEE In-ternational Conference on , 2980–2988. IEEE.He, Y.; Zhang, X.; and Sun, J. 2017. Channel pruning for ac-celerating very deep neural networks. In
International Con-ference on Computer Vision (ICCV) , volume 2.Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger,K. Q. 2017. Densely connected convolutional networks. In
CVPR , volume 1, 3.Krizhevsky, A.; Nair, V.; and Hinton, G. 2014. The cifar-10dataset. .Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P.2016. Pruning filters for efficient convnets. arXiv preprintarXiv:1608.08710 .Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; and Zhang,C. 2017. Learning efficient convolutional networks throughnetwork slimming. In
Computer Vision (ICCV), 2017 IEEEInternational Conference on , 2755–2763. IEEE.Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; and Kautz, J.2016. Pruning convolutional neural networks for resourceefficient inference. arXiv preprint arXiv:1611.06440 .Redmon, J., and Farhadi, A. 2017. Yolo9000: Better, faster,stronger. In , 6517–6525. IEEE.Simonyan, K., and Zisserman, A. 2014. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 .Ye, J.; Lu, X.; Lin, Z.; and Wang, J. Z. 2018. Rethinking thesmaller-norm-less-informative assumption in channel prun-ing of convolution layers. arXiv preprint arXiv:1802.00124 .Zhong, J.; Ding, G.; Guo, Y.; Han, J.; and Wang, B. 2018.Where to prune: Using lstm to guide end-to-end pruning. In
Proceedings of the Twenty-Seventh International Joint Con-ference on Artificial Intelligence, IJCAI-18 , 3205–3211. In-ternational Joint Conferences on Artificial Intelligence Or-ganization.Zhu, M., and Gupta, S. 2017. To prune, or not to prune: ex-ploring the efficacy of pruning for model compression. arXivpreprint arXiv:1710.01878arXivpreprint arXiv:1710.01878