Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure
CCentripetal SGD for Pruning Very Deep Convolutional Networks withComplicated Structure ∗ Xiaohan Ding Guiguang Ding Yuchen Guo Jungong Han Tsinghua University Lancaster University [email protected] [email protected] { yuchen.w.guo,jungonghan77 } @gmail.com Abstract
The redundancy is widely recognized in ConvolutionalNeural Networks (CNNs), which enables to remove unim-portant filters from convolutional layers so as to slim thenetwork with acceptable performance drop. Inspired bythe linear and combinational properties of convolution, weseek to make some filters increasingly close and eventuallyidentical for network slimming. To this end, we proposeCentripetal SGD (C-SGD), a novel optimization method,which can train several filters to collapse into a singlepoint in the parameter hyperspace. When the training iscompleted, the removal of the identical filters can trim thenetwork with NO performance loss, thus no finetuning isneeded. By doing so, we have partly solved an open prob-lem of constrained filter pruning on CNNs with complicatedstructure, where some layers must be pruned following oth-ers. Our experimental results on CIFAR-10 and ImageNethave justified the effectiveness of C-SGD-based filter prun-ing. Moreover, we have provided empirical evidences forthe assumption that the redundancy in deep neural networkshelps the convergence of training by showing that a redun-dant CNN trained using C-SGD outperforms a normallytrained counterpart with the equivalent width.
1. Introduction
Convolutional Neural Network (CNN) has become animportant tool for machine learning and many related fields[10, 36, 37, 38]. However, due to their nature of com-putational intensity, as CNNs grow wider and deeper,their memory footprint, power consumption and requiredfloating-point operations (FLOPs) have increased dramat- ∗ This work is supported by the National Key R&D Program of China(No. 2018YFC0807500), National Natural Science Foundation of China(No. 61571269), National Postdoctoral Program for Innovative Talents(No. BX20180172), and the China Postdoctoral Science Foundation (No.2018M640131). Corresponding author: Guiguang Ding. Here “centripetal” means “several objects moving towards a center”,not “an object rotating around a center by the centripetal force”. ically, thus making them difficult to be deployed on plat-forms without rich computational resource, like embeddedsystems. In this context, CNN compression and accelera-tion methods have been intensively studied, including ten-sor low rank expansion [31], connection pruning [20], filterpruning [40], quantization [19], knowledge distillation [27],fast convolution [48], feature map compacting [61], etc .We focus on filter pruning, a.k.a. channel pruning [26]or network slimming [44], for three reasons. Firstly, filterpruning is a universal technique which is able to handle anykinds of CNNs, making no assumptions on the applicationfield, the network architecture or the deployment platform.Secondly, filter pruning effectively reduces the FLOPs ofthe network, which serve as the main criterion of computa-tional burdens. Lastly, as an important advantage in prac-tice, filter pruning produces a thinner network with no cus-tomized structure or extra operation, which is orthogonal tothe other model compression and acceleration techniques.Motivated by the universality and significance, consid-erable efforts have been devoted to filter pruning tech-niques. Due to the widely observed redundancy in CNNs[8, 9, 13, 19, 66, 69], numerous excellent works have shownthat, if a CNN is pruned appropriately with acceptable struc-tural damage, a follow-up finetuning procedure can restorethe performance to a certain degree. Some prior works[2, 5, 28, 40, 49, 50, 66] sort the filters by their importance,directly remove the unimportant ones and re-construct thenetwork with the remaining filters. As the important filtersare preserved, a comparable level of performance can bereached by finetuning. However, some recent powerful net-works have complicated structures, like identity mapping[23] and dense connection [29], where some layers must bepruned in the same pattern as others, raising an open prob-lem of constrained filter pruning . This further challengessuch pruning techniques, as one cannot assume the impor-tant filters at different layers reside on the same positions. Obviously, the model is more likely to recover if the de-structive impact of pruning is reduced. Taking this into con-sideration, another family of methods [3, 15, 43, 60, 63]seeks to zero out some filters in advance, where group- a r X i v : . [ c s . L G ] A p r asso Regularization [53] is frequently used. Essentially,zeroing filters out can be regarded as producing a desired redundancy pattern in CNNs. After reducing the magni-tude of parameters of some whole filters, pruning these fil-ters causes less accuracy drop, hence it becomes easier torestore the performance by finetuning.In this paper, we also aim to produce some redundancypatterns in CNNs for filter pruning. However, instead of ze-roing out filters, which ends up with a pattern where somewhole filters are close to zero, we intend to merge multiplefilters into one, leading to a redundancy pattern where somefilters are identical. The intuition motivating the proposedmethod is an observation of information flow in CNNs (Fig.1). If two or more filters are trained to become identical,due to the linear and combinational properties of convolu-tion, we can simply discard all but leave one filter, and addup the parameters along the corresponding input channelsof the next layer. Doing so will cause ZERO performanceloss, and there is no need for a time-consuming finetuningprocess. It is noted that such a finetuning process is essentialfor the zeroing-out methods [3, 43, 63], as the discarded fil-ters are merely small in magnitude, but still encode a certainquantity of information. Therefore, removing such filtersunavoidably degrades the performance of the network. When multiple filters are constrained to grow closer in theparameter hyperspace, which we refer to as the centripetalconstraint , though they start to produce increasingly simi-lar information, the information conveyed from the corre-sponding input channels of the next layer is still in full use,thus the model’s representational capacity is stronger than acounterpart with the filters being zeroed out.We summarize our contributions as follows. • We propose to produce redundancy patterns in CNNsby training some filters to become identical. Comparedto the importance-based filter pruning methods, doingso requires no heuristic knowledge about the impor-tance of filter. Compared to the zeroing-out methods,no finetuning is needed, and more representational ca-pacity of the network is preserved. • We propose
Centripetal SGD (C-SGD), an innovativeSGD optimization method. As the name suggests, wemake multiple filters move towards a center in the hy-perspace of the filter parameters. In the meantime, su-pervised by the model’s original objective function, theperformance is maintained as much as possible. • By C-SGD, we have partly solved constrained filterpruning, an open problem of slimming modern verydeep CNNs with complicated structure, where somelayers must be pruned in the same pattern as others. • We have presented both theoretical and empiricalanalysis of the effectiveness of C-SGD. We haveshown empirical evidences supporting our motivation(Fig. 1) and the assumption that the redundancy helps the convergence of neural networks [14, 27].The codes are available at https://github.com/ShawnDing1994/Centripetal-SGD .
2. Related Work
Filter Pruning.
Numerous inspiring works [7, 17, 20,22, 39, 58, 67] have shown that it is feasible to remove alarge portion of connections or neurons from a neural net-work without a significant performance drop. However, asthe connection pruning methods make the parameter ten-sors no smaller but just sparser, little or no accelerationcan be observed without the support from specialized hard-ware. Then it is natural for researchers to go further onCNNs: by removing filters instead of sporadic connections,we transform the wide convolutional layers into narrowerones, hence the FLOPs, memory footprint and power con-sumption are significantly reduced. One kind of methodsdefines the importance of filters by some means, then selectsand prunes the unimportant filters carefully to minimize theperformance loss. Some prior works measure a filter’s im-portance by the accuracy reduction (CAR) [2], the channelcontribution variance [50], the Taylor-expansion-based cri-terion [49], the magnitude of convolution kernels [40] andthe average percentage of zero activations (APoZ) [28], re-spectively; Luo et al . [47] select filters based on the infor-mation derived from the next layer; Yu et al . [66] take intoconsideration the effect of error propagation; He et al . [26]select filters by solving the Lasso regression; He and Han[24] pick up filters with aid of reinforcement learning. An-other category seeks to train the network under certain con-straints in order to zero out some filters, where group-Lassoregularization is frequently used [3, 43, 63]. It is notewor-thy that since removing some whole filters can degrade thenetwork a lot, the CNNs are usually pruned in a layer-by-layer [3, 24, 26, 28, 47, 50] or filter-by-filter [2, 49] manner,and require one or more finetuning processes to restore theaccuracy [2, 3, 5, 24, 26, 28, 40, 44, 47, 49, 50, 63, 66].
Other Methods.
Apart from filter pruning, some excel-lent works seek to compress and accelerate CNNs in otherways. Considerable works [4, 14, 31, 32, 54, 56, 65, 68]decompose or approximate the parameter tensors; quanti-zation and binarization techniques [11, 18, 19, 51, 64] ap-proximate a model using fewer bits per parameter; knowl-edge distillation methods [6, 27, 52] transfer knowledgefrom a big network to a smaller one; some researchers seekto speed up convolution with the help of perforation [16],FFT [48, 59] or DCT [62]; Wang et al . [61] compact fea-ture maps by extracting information via Circulant matrices.Of note is that since filter pruning simply shrinks a wideCNN into a narrower one with no special structures or extraoperations, it is orthogonal to the other methods. onv2 conv1 conv2 conv1 add to
Figure 1: Zeroing-out v.s. centripetal constraint. This figure shows a CNN with 4 and 6 filters at the 1st and 2nd convolutionallayer, respectively, which takes a 2-channel input. Left: the 3rd filter at conv1 is zeroed out, thus the 3rd feature map is closeto zero, implying that the 3rd input channels of the 6 filters at conv2 are useless. During pruning, the 3rd filters at conv1along with the 3rd input channels of the 6 filters at conv2 are removed. Right: the 3rd and 4th filters at conv1 are forced togrow close by centripetal constraint until the 3rd and 4th feature maps become identical. But the 3rd and 4th input channelsof the 6 filters at conv2 can still grow without constraints, making the encoded information still in full use. When pruned, the4th filter at conv1 is removed, and the 4th input channel of every filter at conv2 is added to the 3rd channel.
3. Slimming CNNs via Centripetal SGD
In modern CNNs, batch normalization [30] and scalingtransformation are commonly used to enhance the repre-sentational capacity of convolutional layers. For simplic-ity and generality, we regard the possible subsequent batchnormalization and scaling layer as part of the convolutionallayer. Let i be the layer index, M ( i ) ∈ R h i × w i × c i be an h i × w i feature map with c i channels and M ( i,j ) = M ( i ): , : ,j be the j -th channel. The convolutional layer i with ker-nel size u i × v i has one 4th-order tensor and four vectorsas parameters at most, namely, K ( i ) ∈ R u i × v i × c i − × c i and µ ( i ) , σ ( i ) , γ ( i ) , β ( i ) ∈ R c i , where K ( i ) is the con-volution kernel, µ ( i ) and σ ( i ) are the mean and standarddeviation of batch normalization, γ ( i ) and β ( i ) are theparameters of the scaling transformation. Then we use P ( i ) = ( K ( i ) , µ ( i ) , σ ( i ) , γ ( i ) , β ( i ) ) to denote the param-eters of layer i . In this paper, the filter j at layer i refersto the five-tuple comprising all the parameter slices relatedto the j -th output channel of layer i , formally, F ( j ) =( K ( i ): , : , : ,j , µ ( i ) j , σ ( i ) j , γ ( i ) j , β ( i ) j ) . During forward propagation,this layer takes M ( i − ∈ R h i − × w i − × c i − as input andoutputs M ( i ) . Let ∗ be the 2-D convolution operator, the j -th output channel is given by M ( i,j ) = (cid:80) c i − k =1 M ( i − ,k ) ∗ K ( i ): , : ,k,j − µ ( i ) j σ ( i ) j γ ( i ) j + β ( i ) j . (1)The importance-based filter pruning methods [2, 28, 40,49, 50, 66] define the importance of filters by some means,prune the unimportant part and reconstruct the network us-ing the remaining parameters. Let I i be the filter index setof layer i ( e.g ., I = { , , , } if the second layer hasfour filters), T be the filter importance evaluation functionand θ i be the threshold. The remaining set, i.e ., the indexset of the filters which survive the pruning, is R i = { j ∈ I i | T ( F ( j ) ) > θ i } . Then we reconstruct the network byassembling the parameters sliced from the original tensoror vectors of layer i into the new parameters. That is, ˆ P ( i ) = ( K ( i ): , : , : , R i , µ ( i ) R i , σ ( i ) R i , γ ( i ) R i , β ( i ) R i ) . (2)The input channels of the next layer corresponding to thepruned filters should also be discarded, ˆ P ( i +1) = ( K ( i +1): , : , R i , : , µ ( i +1) , σ ( i +1) , γ ( i +1) , β ( i +1) ) . (3) For each convolutional layer, we first divide the filtersinto clusters. The number of clusters equals the desirednumber of filters, as we preserve only one filter for eachcluster. We use C i and H to denote the set of all filter clus-ters of layer i and a single cluster in the form of a filter in-dex set, respectively. We generate the clusters evenly or byk-means [21], between which our experiments demonstrateonly minor difference (Table. 1). • K-means clustering . We aim to generate clusters withlow intra-cluster distance in the parameter hyperspace,such that collapsing them into a single point less im-pacts the model, which is natural. To this end, we sim-ply flatten the filter’s kernel and use it as the featurevector for k-means clustering. • Even clustering . We can generate clusters with noconsideration of the filters’ inherent properties. Let c i and r i be the number of original filters and de-sired clusters, respectively, then each cluster will have (cid:100) c i /r i (cid:101) filters at most. For example, if the secondlayer has six filters and we wish to slim it to fourfilters, we will have C = {H , H , H , H } , where H = { , } , H = { , } , H = { } , H = { } .We use H ( j ) to denote the cluster containing filter j , soin the above example we have H (3) = H and H (6) = H .Let F ( j ) be the kernel or a vector parameter of filter j , atach training iteration, the update rule of C-SGD is F ( j ) ← F ( j ) + τ ∆ F ( j ) , ∆ F ( j ) = − (cid:80) k ∈ H ( j ) ∂L∂ F ( k ) | H ( j ) | − η F ( j ) + (cid:15) ( (cid:80) k ∈ H ( j ) F ( k ) | H ( j ) | − F ( j ) ) , (4)where L is the original objective function, τ is the learningrate, η is the model’s original weight decay factor, and (cid:15) is the only introduced hyper-parameter, which is called the centripetal strength .Let L be the layer index set, we use the sum of squaredkernel deviation χ to measure the intra-cluster similarity, i.e ., how close filters are in each cluster, χ = (cid:88) i ∈L (cid:88) j ∈I i || K ( i ): , : , : ,j − (cid:80) k ∈ H ( j ) K ( i ): , : , : ,k | H ( j ) | || . (5)It is easy to derive from Eq. 4 that if the floating-point op-eration errors are ignored, χ is lowered monotonically and exponentially with a proper learning rate τ .The intuition behind Eq. 4 is quite simple: for the filtersin the same cluster, the increments derived by the objectivefunction are averaged (the first term), the normal weight de-cay is applied as well (the second term), and the differencein the initial values is gradually eliminated (the last term), sothe filters will move towards their center in the hyperspace.In practice, we fix η and reduce τ with time just as we doin normal SGD training, and set (cid:15) casually. Intuitively, C-SGD training with a large (cid:15) prefers “rapid change” to “sta-ble transition”, and vice versa. If (cid:15) is too large, e.g ., 10,the filters are merged in an instant such that the whole pro-cess becomes equivalent to training a destroyed model fromscratch. If (cid:15) is extremely small, like × − , the differ-ence between C-SGD training and normal SGD is almostinvisible during a long time. However, since the differenceamong filters in each cluster is reduced monotonically and exponentially , even an extremely small (cid:15) can make the fil-ters close enough, sooner or later. As shown in the Ap-pendix, C-SGD is insensitive to (cid:15) .A simple analogy to weight decay ( i.e ., (cid:96) -2 regulariza-tion) may help understand Centripetal SGD. Fig. 2a showsa 3-D loss surface, where a certain point A corresponds to a2-D parameter a = ( a , a ) . Suppose the steepest descentdirection is −−→ AQ , we have −−→ AQ = − ∂L∂ a , where L is theobjective function. Weight decay is commonly applied toreduce overfitting [35], that is, −−→ AQ = − η a , where η isthe model’s weight decay factor, e.g ., × − for ResNets[23]. The actual gradient descent direction then becomes ∆ a = −−→ AQ = −−→ AQ + −−→ AQ = − ∂L∂ a − η a .Formally, with t denoting the number of training itera-tions, we seek to make point A and B grow increasingly loss 0 1 2 1 2 A x y Q Q Q (a) Normal weight decay. loss 0 1 2 1 2 A B x y Q Q Q Q Q M (b) Centripetal constraint. Figure 2: Gradient descent direction on the loss surfaceof normal weight decay and centripetal constraint withoutmerging the original gradients.close and eventually the same by satisfying lim t →∞ || a ( t ) − b ( t ) || = 0 . (6)Given the fact that a ( t +1) = a ( t ) + τ ∆ a ( t ) and b ( t +1) = b ( t ) + τ ∆ b ( t ) , where τ is the learning rate, Eq. 6 implies lim t →∞ || ( a ( t ) − b ( t ) ) + τ (∆ a ( t ) − ∆ b ( t ) ) || = 0 . (7)We seek to achieve this with lim t →∞ (∆ a ( t ) − ∆ b ( t ) ) = as well as lim t →∞ ( a ( t ) − b ( t ) ) = . Namely, as two pointsare growing closer, their gradients should become closer ac-cordingly in order for the training to converge.If we just wish to make A and B closer to each otherthan they used to be, a natural idea is to push both A and B to their midpoint M ( a + b ) , as shown in Fig. 2b. Therefore,the gradient descent direction of point A becomes ∆ a = −−→ AQ + −−→ AQ = − ∂L∂ a − η a + (cid:15) ( a + b − a ) , (8)where (cid:15) is a hyper-parameter controlling the intensity orspeed of pushing A and B close. We have ∆ b = − ∂L∂ b − η b + (cid:15) ( a + b − b ) , (9) ∆ a − ∆ b = ( ∂L∂ b − ∂L∂ a ) + ( η + (cid:15) )( b − a ) . (10)Here we see the problem: we cannot ensure lim t →∞ ( ∂L∂ b ( t ) − ∂L∂ a ( t ) ) = . Actually, even a = b does not imply ∂L∂ a = ∂L∂ b ,because they participate in different computation flows. Asa consequence, we cannot ensure lim t →∞ (∆ a ( t ) − ∆ b ( t ) ) = with Eq. 8 and Eq. 9.We solve this problem by merging the gradients derivedfrom the original objective function. For simplicity andsymmetry, by replacing both ∂L∂ a in Eq. 8 and ∂L∂ b in Eq. with ( ∂L∂ a + ∂L∂ b ) , we have ∆ a − ∆ b = ( η + (cid:15) )( b − a ) .In this way, the supervision information encoded in theobjective-function-related gradients is preserved to main-tain the model’s performance, and Eq. 6 is satisfied, whichcan be easily verified. Intuitively, we deviate a from thesteepest descent direction according to some information of b and deviate b vice versa, just like the (cid:96) -2 regularizationdeviates both a and b towards the origin of coordinates. The efficiency of modern CNN training and deploymentplatforms, e.g ., Tensorflow [1], is based on large-scale ten-sor operations. We therefore seek to implement C-SGDby efficient matrix multiplication which introduces minimalcomputational burdens. Concretely, given a convolutionallayer i , the kernel K ∈ R u i × v i × c i − × c i and the gradient ∂L∂ K , we reshape K to W ∈ R u i v i c i − × c i and ∂L∂ K to ∂L∂ W accordingly. We construct the averaging matrix Γ ∈ R c i × c i and decaying matrix Λ ∈ R c i × c i as Eq. 12 and Eq. 13such that Eq. 11 is equivalent to Eq. 4, which can be eas-ily verified. Obviously, when the number of clusters equalsthat of the filters, Eq. 11 degrades into normal SGD with Γ = diag (1) , Λ = diag ( η ) . The other trainable param-eters ( i.e ., γ and β ) are reshaped into W ∈ R × c i andhandled in the same way. In practice, we observe almost nodifference in the speed between normal SGD and C-SGDusing Tensorflow on Nvidia GeForce GTX 1080Ti GPUswith CUDA9.0 and cuDNN7.0. W ← W − τ ( ∂L∂ W Γ + W Λ ) . (11) Γ m,n = (cid:40) / | H ( m ) | if H ( m ) = H ( n ) , elsewise . (12) Λ m,n = (cid:40) η + (1 − / | H ( m ) | ) (cid:15) if H ( m ) = H ( n ) , elsewise . (13) After C-SGD training, since the filters in each clusterhave become identical, as will be shown in Sect. 4.3, pick-ing up which one makes no difference. We simply pick upthe first filter ( i.e ., the filter with the smallest index) in eachcluster to form the remaining set for each layer, which is R i = { min ( H ) | ∀H ∈ C i } . For the next layer, we add the to-be-deleted input chan-nels to the corresponding remaining one, K ( i +1): , : ,k, : ← (cid:88) K ( i +1): , : ,H ( k ) , : ∀ k ∈ R i , then we delete the redundant filters as well as the inputchannels of the next layer following Eq. 2, 3. Due to the linear and combinational properties of convolution (Eq. 1),no damage is caused, hence no finetuning is needed. Recently, accompanied by the advancement of CNN de-sign philosophy, several efficient and compact CNN ar-chitectures [23, 29] have emerged and become favored inthe real-world applications. Altough some excellent works[28, 32, 49, 66, 69] have shown that the classical plainCNNs, e.g ., AlexNet [34] and VGG [55], are highly redun-dant and can be pruned significantly, the pruned versions areusually still inferior to the more up-to-date and complicatedCNNs in terms of both accuracy and efficiency.We consider filter pruning for very deep and complicatedCNNs challenging for three reasons. Firstly, these net-works are designed in consideration of computational ef-ficiency, which makes them inherently compact and effi-cient. Secondly, these networks are significantly deeperthan the classical ones, thus the layer-by-layer pruning tech-niques become inefficient, and the errors can increase dra-matically when propagated through multiple layers, mak-ing the estimation of filter importance less accurate [66]. Lastly and most importantly, some innovative structures areheavily used in these networks, e.g ., cross-layer connections[23] and dense connections [29], raising an open problem ofconstrained filter pruning.
I.e ., in each stage of ResNets, every residual block is ex-pected to add the learned residuals to the stem feature mapsproduced by the first or the projection layer (referred to as pacesetter ), thus the last layer of every residual block (re-ferred to as follower ) must be pruned in the same patternas the pacesetter, i.e ., the remaining set R of all the fol-lowers and the pacesetter must be identical, or the networkwill be damaged so badly that finetuning cannot restore itsaccuracy. For example, Li et al . [40] once tried violentlypruning ResNets but resulted in low accuracy. In some suc-cessful explorations, Li et al . [40] sidestep this problem byonly pruning the internal layers on ResNet-56, i.e ., the firstlayers in each residual block. Liu et al . [44] and He et al .[26] skip pruning these troublesome layers and insert an ex-tra sampler layer before the first layer in each residual blockduring inference time to reduce the input channels. Thoughthese methods are able to prune the networks to some ex-tent, from a holistic perspective the networks are not liter-ally “slimmed” but actually “clipped”, as shown in Fig. 3.We have partly solved this open problem by C-SGD,where the key is to force different layers to learn the sameredundancy pattern . For example, if the layer p and q haveto be pruned in the same pattern, we only generate clustersfor the layer p by some means and assign the resulting clus-ter set to the layer q , namely, C q ← C p . Then during C-SGDtraining, the same redundancy patterns among filters in bothlayer p and q are produced. I.e ., if the j -th and k -th filters × … × × … × × (a) Original. × … × × … × × (b) Clipped. × … × × … × × (c) Sampled. …… × × × × × (d) Slimmed. Figure 3: Compared to the prior works which only clipthe internal layers [40] or insert sampler layers [26, 44] onResNets, C-SGD is literally “slimming” the network.at layer p become identical, we ensure the sameness of the j -th and k -th filters at layer q as well, thus the troublesomelayers can be pruned along with the others. Some sketchesare presented in the Appendix for more intuitions.
4. Experiments
We experiment on CIFAR-10 [33] and ImageNet-1K[12] to evaluate our method. For each trial we start froma well-trained base model and apply C-SGD training on allthe target layers simultaneously . The comparisons betweenC-SGD and other filter pruning methods are presented inTable. 1 and Table. 2 in terms of both absolute and relativeerror increase, which are commonly adopted as the metricsto fairly compare the change of accuracy on different basemodels.
E.g ., the Top-1 accuracy of our ResNet-50 basemodel and C-SGD-70 is 75.33% and 75.27%, thus the ab-solute and relative error increase is . − .
27% =0 . and . − . = 0 . , respectively. CIFAR-10.
The base models are trained from scratchfor 600 epochs to ensure the convergence, which is muchlonger than the usually adopted benchmarks (160 [23] or300 [29] epochs), such that the improved accuracy of thepruned model cannot be simply attributed to the extra train-ing epochs on a base model which has not fully converged.We use the data augmentation techniques adopted by [23], i.e ., padding to × , random cropping and flipping. Thehyper-parameter (cid:15) is casually set to × − . We performC-SGD training with batch size 64 and a learning rate ini-tialized to × − then decayed by 0.1 when the loss stopsdecreasing. For each network we perform two experimentsindependently, where the only difference is the way we gen-erate filter clusters, namely, even dividing or k-means clus-tering. We seek to reduce the FLOPs of every model byaround 60%, so we prune / of every convolutional layerof ResNets, thus the parameters and FLOPs are reduced by around − (5 / = 61% . Aggressive as it is, no obviousaccuracy drop is observed. For DenseNet-40, the prunedmodel has 5, 8 and 10 incremental convolutional layers inthe three stages, respectively, so that the FLOPs is reducedby 60.05%, and a significantly increased accuracy is ob-served, which is consistent with but better than that of [44]. ImageNet.
We perform experiments using ResNet-50[23] on ImageNet to validate the effectiveness of C-SGDon the real-world applications. We apply k-means cluster-ing on the filter kernels to generate the clusters, then usethe ILSVRC2015 training set which contains 1.28M high-quality images for training. We adopt the standard data aug-mentation techniques including b-box distortion and colorshift. At test time, we use a single central crop. For C-SGD-7/10, C-SGD-6/10 and C-SGD-5/10, all the first and secondlayers in each residual block are shrunk to 70%, 60% and50% of the original width, respectively.
Discussions.
Our pruned networks exhibit fewer FLOPs,simpler structures and higher or comparable accuracy. Notethat we apply the same pruning ratio globally for ResNets,and better results are promising to be achieved if morelayer sensitivity analyzing experiments [26, 40, 66] are con-ducted, and the resulting network structures are tuned ac-cordingly. Interestingly, even arbitrarily generated clusterscan produce reasonable results (Table. 1). vs . Normal Training The comparisons between C-SGD and other pruning-and-finetuning methods [26, 40, 47, 66] indicate that itmay be better to train a redundant network and equivalentlytransform it to a narrower one than to finetune it after prun-ing. This observation is consistent with [14] and [27], wherethe authors believe that the redundancy in neural networksis necessary to overcome a highly non-convex optimization.We verify this assumption by training a narrow CNNwith normal SGD and comparing it with another modeltrained using C-SGD with the equivalent width , whichmeans that some redundant filters are produced during train-ing and trimmed afterwards, resulting in the same networkstructure as the normally trained model. For example, if anetwork has × number of filters as the normal counterpartbut every two filters are identical, they will end up with thesame structure. If the redundant one outperforms the normalone, we can conclude that C-SGD does yield more powerfulnetworks by exploiting the redundant filters.On DenseNet-40, we evenly divide the 12 filters at eachincremental layer into 3 clusters, use C-SGD to train thenetwork from scratch, then trim it to obtain a DenseNet-40with 3 filters per incremental layer. I.e ., during training, ev-ery 4 filters are growing centripetally. As contrast, we traina DenseNet-40 with originally 3 filters per layer by normalSGD. Another group of experiments where each layer endsup with 6 filters are carried out similarly. After that, ex-able 1: Pruning Results on CIFAR-10. For C-SGD, the left is achieved by even clustering, and the right uses k-means.
Model Result Base Top1 Pruned Top1even / k-means K-means Top1 errorAbs/Rel ↑ % FLOPs ↓ % ArchitectureResNet-56 Li et al . [40] 93.04 93.06 -0.02 / -0.28 27.60 only internals prunedResNet-56 NISP-56 [66] - - 0.03 / - 43.61 -ResNet-56 Channel Pruning [26] 92.8 91.8 1.0 / 13.88 50 sampler layerResNet-56 ADC [24] 92.8 91.9 0.9 / 12.5 50 sampler layer ResNet-56 C-SGD-5/8 93.39 93.44 / 93.62 -0.23 / -3.47 60.85 10-20-40
ResNet-110 Li et al . [40] 93.53 93.30 0.23 / 3.55 38.60 only internals prunedResNet-110 NISP-110 [66] - - 0.18 / - 43.78 -
ResNet-110 C-SGD-5/8 94.38 94.54 / 94.41 -0.03 / -0.53 60.89 10-20-40
ResNet-164 Network Slimming [44] 94.58 94.73 -0.15 / -2.76 44.90 sampler layer
ResNet-164 C-SGD-5/8 94.83 94.80 / 94.81 0.02 / 0.38 60.91 10-20-40
DenseNet-40 Network Slimming [44] 93.89 94.35 -0.46 / -7.52 55.00 -
DenseNet-40 C-SGD-5-8-10 93.81 94.37 / 94.56 -0.75 / 12.11 60.05 5-8-10
Table 2: Pruning ResNet-50 on ImageNet using k-means clustering.
Result Base Top1 Base Top5 Pruned Top1 Pruned Top5 Top1 ErrorAbs/Rel ↑ % Top5 errorAbs/Rel ↑ % FLOPs ↓ % C-SGD-70 75.33 92.56 75.27 92.46 0.06 / 0.24 0.10 / 1.34 36.75
ThiNet-70 [47] 72.88 91.14 72.04 90.67 0.84 / 3.09 0.47 / 5.30 36.75SFP [25] 76.15 92.87 74.61 92.06 1.54 / 6.45 0.81 / 11.36 41.8NISP [66] - - - - 0.89 / - - / - 43.82
C-SGD-60 75.33 92.56 74.93 92.27 0.40 / 1.62 0.29 / 3.89 46.24
CFP [57] 75.3 92.2 73.4 91.4 1.9 / 7.69 0.8 / 10.25 49.6Channel Pruning [26] - 92.2 - 90.8 - / - 1.4 / 17.94 50Autopruner [46] 76.15 92.87 74.76 92.15 1.39 / 5.82 0.72 / 10.09 51.21GDP [42] 75.13 92.30 71.89 90.71 3.24 / 13.02 1.59 / 20.64 51.30SSR-L2 [41] 75.12 92.30 71.47 90.19 3.65 / 14.67 2.11 / 27.40 55.76DCP [70] 76.01 92.93 74.95 92.32 1.06 / 4.41 0.61 / 8.62 55.76ThiNet-50 [47] 72.88 91.14 71.01 90.02 1.87 / 6.89 1.12 / 12.64 55.76
C-SGD-50 75.33 92.56 74.54 92.09 0.79 / 3.20 0.47 / 6.31 55.76 periments on VGG [55] are also carried out, where we slimeach layer to 1/4 and 1/2 of the original width, respectively.It can be concluded from Table. 3 that the redundant filtersdo help, compared to a normally trained counterpart withthe equivalent width. This observation supports our intu-ition that the centripetally growing filters can maintain themodel’s representational capacity to some extent becausethough these filters are constrained, their corresponding in-put channels are still in full use and can grow without con-straints (Fig. 1). vs . Zeroing Out As making filters identical and zeroing filters out [3, 15,43, 60, 63] are two means of producing redundancy pat-terns for filter pruning, we perform controlled experiments Table 3: Validation accuracy of scratch-trained DenseNet-40 and VGG using C-SGD or normal SGD on CIFAR-10.
Model Normal SGD C-SGDDenseNet-3 88.60
DenseNet-6 89.96
VGG-1/4 90.16
VGG-1/2 92.49 on ResNet-56 to investigate the difference. For fair compar-ison, we aim to produce the same number of redundant fil-ters in both the model trained with C-SGD and the one withgroup-Lasso Regularization [53]. For C-SGD, the numberof clusters in each layer is 5/8 of the number of filters. For
50 100 150 200epochs864202 l o g o r l o g C-SGDLasso (a) Values of χ or φ . t o p - a cc u r a c y C-SGD before pruningLasso before pruningC-SGD after pruningLasso after pruning (b) Validation accuracy.
Figure 4: Training process with C-SGD or group-Lasso onResNet-56. Note the logarithmic scale of the left figure.Lasso, 3/8 of the original filters in the pacesetters and inter-nal layers are regularized by group-Lasso, and the followersare handled in the same pattern. We use the aforementionedsum of squared kernel deviation χ and the sum of squaredkernel residuals φ as follows to measure the redundancy,respectively. Let L be the layer index set and P i be the to-be-pruned filter set of layer i , i.e ., the set of the 3/8 filterswith group-Lasso regularization, φ = (cid:88) i ∈L (cid:88) j ∈P i || K ( i ): , : , : ,j || . We present in Fig. 4 the curves of χ , φ as well as the vali-dation accuracy both before and after pruning. The learningrate τ is initially set to × − and decayed by 0.1 at epoch100 and 200, respectively. It can be observed that: GroupLasso cannot literally zero out filters, but can decrease theirmagnitude to some extent, as φ plateaus when the gradientsderived from the regularization term become close to thosederived from the original objective function. We empiri-cally find out that even when φ reaches around × − ,which is nearly × times smaller than the initial value,pruning still causes obvious damage (around 10% accuracydrop). When the learning rate is decayed and φ is reduced atepoch 200, we observe no improvement in the pruned accu-racy, therefore no more experiments with smaller learningrate or stronger group-Lasso regularization are conducted.We reckon this is due to the error propagation and amplifi-cation in very deep CNNs [66]. By C-SGD, χ is reduced monotonically and perfectly exponentially , which leads tofaster convergence. I.e ., the filters in each cluster can be-come infinitely close to each other at a constant rate witha constant learning rate. For C-SGD, pruning causes ab-solutely no performance loss after around 90 epochs. Training with group-Lasso is × slower than C-SGD as itrequires costly square root operations. vs . Other Filter Pruning Methods We compare C-SGD with other methods by controlledexperiments on DenseNet-40 [29]. We slim every incre-mental layer of a well-trained DenseNet-40 to 3 and 6 fil-ters, respectively. The experiments are repeated 3 times, v a l a cc u r a c y APoZLassoMagnitudeC-SGDTaylor (a) Three filters per layer. v a l a cc u r a c y APoZLassoMagnitudeC-SGDTaylor (b) Six filters per layer.
Figure 5: Controlled pruning experiments on DenseNet-40.and all the results are presented in Fig. 5. The train-ing setting is kept the same for every model: learning rate τ = 3 × − , × − , × − , × − for 200, 200,100 and 100 epochs, respectively, to ensure the convergenceof every model. For our method, the models are trained withC-SGD and trimmed. For Magnitude- [40], APoZ- [28] andTaylor-expansion-based [49], the models are pruned by dif-ferent criteria and finetuned. The models labeled as Lassoare trained with group-Lasso Regularization for 600 epochsin advance, pruned, then finetuned for another
600 epochswith the same learning rate schedule, so that the compari-son is actually biased towards the Lasso method. The mod-els are tested on the validation set every 10,000 iterations(12.8 epochs). The results reveal the superiority of C-SGDin terms of higher accuracy and also the better stability.Though group-Lasso Regularization can indeed reduce theperformance drop caused by pruning, it is outperformed byC-SGD by a large margin. It is interesting that the violentlypruned networks are unstable and easily trapped in the localminimum, e.g ., the accuracy curves increase steeply in thebeginning but slightly decline afterwards. This observationis consistent with that of Liu et al . [45].
5. Conclusion
We have proposed to produce identical filters in CNNsfor network slimming. The intuition is that making fil-ters identical can not only eliminate the need for finetuningbut also preserve more representational capacity of the net-work, compared to the zeroing-out fashion (Fig. 1). Wehave partly solved an open problem of constrained filterpruning on very deep and complicated CNNs and achievedstate-of-the-art results on several common benchmarks. Bytraining networks with redundant filters using C-SGD, wehave demonstrated empirical evidences for the assumptionthat redundancy can help the convergence of neural networktraining, which may encourage future studies. Apart frompruning, we consider C-SGD promising to be applied as ameans of regularization or training technique. eferences [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.Tensorflow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467 , 2016.[2] R. Abbasi-Asl and B. Yu. Structural compression of convolu-tional neural networks based on greedy filter pruning. arXivpreprint arXiv:1705.07356 , 2017.[3] J. M. Alvarez and M. Salzmann. Learning the number ofneurons in deep networks. In
Advances in Neural Informa-tion Processing Systems , pages 2270–2278, 2016.[4] J. M. Alvarez and M. Salzmann. Compression-aware train-ing of deep networks. In
Advances in Neural InformationProcessing Systems , pages 856–867, 2017.[5] S. Anwar, K. Hwang, and W. Sung. Structured prun-ing of deep convolutional neural networks.
ACM Journalon Emerging Technologies in Computing Systems (JETC) ,13(3):32, 2017.[6] J. Ba and R. Caruana. Do deep nets really need to be deep?In
Advances in neural information processing systems , pages2654–2662, 2014.[7] G. Castellano, A. M. Fanelli, and M. Pelillo. An iterativepruning algorithm for feedforward neural networks.
IEEEtransactions on Neural networks , 8(3):519–531, 1997.[8] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary,and S.-F. Chang. An exploration of parameter redundancyin deep networks with circulant projections. In
Proceedingsof the IEEE International Conference on Computer Vision ,pages 2857–2865, 2015.[9] M. D. Collins and P. Kohli. Memory bounded deep convolu-tional networks. arXiv preprint arXiv:1412.1442 , 2014.[10] R. Collobert and J. Weston. A unified architecture for naturallanguage processing: Deep neural networks with multitasklearning. In
Proceedings of the 25th international conferenceon Machine learning , pages 160–167. ACM, 2008.[11] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, andY. Bengio. Binarized neural networks: Training deep neu-ral networks with weights and activations constrained to+ 1or-1. arXiv preprint arXiv:1602.02830 , 2016.[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In
Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on , pages 248–255. IEEE, 2009.[13] M. Denil, B. Shakibi, L. Dinh, N. De Freitas, et al. Pre-dicting parameters in deep learning. In
Advances in neuralinformation processing systems , pages 2148–2156, 2013.[14] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-gus. Exploiting linear structure within convolutional net-works for efficient evaluation. In
Advances in neural infor-mation processing systems , pages 1269–1277, 2014.[15] X. Ding, G. Ding, J. Han, and S. Tang. Auto-balanced fil-ter pruning for efficient convolutional neural networks. In
Thirty-Second AAAI Conference on Artificial Intelligence ,2018.[16] M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli. Per-foratedcnns: Acceleration through elimination of redundant convolutions. In
Advances in Neural Information ProcessingSystems , pages 947–955, 2016.[17] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery forefficient dnns. In
Advances In Neural Information Process-ing Systems , pages 1379–1387, 2016.[18] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan.Deep learning with limited numerical precision. In
Interna-tional Conference on Machine Learning , pages 1737–1746,2015.[19] S. Han, H. Mao, and W. J. Dally. Deep compres-sion: Compressing deep neural networks with pruning,trained quantization and huffman coding. arXiv preprintarXiv:1510.00149 , 2015.[20] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weightsand connections for efficient neural network. In
Advances inNeural Information Processing Systems , pages 1135–1143,2015.[21] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm.
Journal of the Royal StatisticalSociety. Series C (Applied Statistics) , 28(1):100–108, 1979.[22] B. Hassibi and D. G. Stork. Second order derivatives for net-work pruning: Optimal brain surgeon. In
Advances in neuralinformation processing systems , pages 164–171, 1993.[23] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016.[24] Y. He and S. Han. Adc: Automated deep compression andacceleration with reinforcement learning. arXiv preprintarXiv:1802.03494 , 2018.[25] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang. Soft filterpruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866 , 2018.[26] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerat-ing very deep neural networks. In
International Conferenceon Computer Vision (ICCV) , volume 2, page 6, 2017.[27] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv preprint arXiv:1503.02531 , 2015.[28] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim-ming: A data-driven neuron pruning approach towards effi-cient deep architectures. arXiv preprint arXiv:1607.03250 ,2016.[29] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.Densely connected convolutional networks. In
Proceed-ings of the IEEE conference on computer vision and patternrecognition , volume 1, page 3, 2017.[30] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In
International Conference on Machine Learning , pages 448–456, 2015.[31] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding upconvolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 , 2014.[32] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin.Compression of deep convolutional neural networks forfast and low power mobile applications. arXiv preprintarXiv:1511.06530 , 2015.33] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. 2009.[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In
Advances in neural information processing systems , pages1097–1105, 2012.[35] A. Krogh and J. A. Hertz. A simple weight decay can im-prove generalization. In
Advances in neural information pro-cessing systems , pages 950–957, 1992.[36] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back.Face recognition: A convolutional neural-network approach.
IEEE transactions on neural networks , 8(1):98–113, 1997.[37] Y. LeCun, Y. Bengio, et al. Convolutional networks for im-ages, speech, and time series.
The handbook of brain theoryand neural networks , 3361(10):1995, 1995.[38] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E.Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digitrecognition with a back-propagation network. In
Advancesin neural information processing systems , pages 396–404,1990.[39] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain dam-age. In
Advances in neural information processing systems ,pages 598–605, 1990.[40] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P.Graf. Pruning filters for efficient convnets. arXiv preprintarXiv:1608.08710 , 2016.[41] S. Lin, R. Ji, Y. Li, C. Deng, and X. Li. Towards com-pact convnets via structure-sparsity regularized filter prun-ing. arXiv preprint arXiv:1901.07827 , 2019.[42] S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang. Accel-erating convolutional networks via global & dynamic filterpruning. In
IJCAI , pages 2425–2432, 2018.[43] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.Sparse convolutional neural networks. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 806–814, 2015.[44] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang.Learning efficient convolutional networks through networkslimming. In , pages 2755–2763. IEEE, 2017.[45] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Re-thinking the value of network pruning. arXiv preprintarXiv:1810.05270 , 2018.[46] J.-H. Luo and J. Wu. Autopruner: An end-to-end train-able filter pruning method for efficient deep model inference. arXiv preprint arXiv:1805.08941 , 2018.[47] J.-H. Luo, J. Wu, and W. Lin. Thinet: A filter level prun-ing method for deep neural network compression. In
Pro-ceedings of the IEEE international conference on computervision , pages 5058–5066, 2017.[48] M. Mathieu, M. Henaff, and Y. LeCun. Fast trainingof convolutional networks through ffts. arXiv preprintarXiv:1312.5851 , 2013.[49] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz.Pruning convolutional neural networks for resource efficientinference. 2016. [50] A. Polyak and L. Wolf. Channel-level acceleration of deepface representations.
IEEE Access , 3:2163–2175, 2015.[51] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neu-ral networks. In
European Conference on Computer Vision ,pages 525–542. Springer, 2016.[52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,and Y. Bengio. Fitnets: Hints for thin deep nets. arXivpreprint arXiv:1412.6550 , 2014.[53] V. Roth and B. Fischer. The group-lasso for generalized lin-ear models: uniqueness of solutions and efficient algorithms.In
Proceedings of the 25th international conference on Ma-chine learning , pages 848–855. ACM, 2008.[54] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, andB. Ramabhadran. Low-rank matrix factorization for deepneural network training with high-dimensional output tar-gets. In
Acoustics, Speech and Signal Processing (ICASSP),2013 IEEE International Conference on , pages 6655–6659.IEEE, 2013.[55] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[56] V. Sindhwani, T. Sainath, and S. Kumar. Structured trans-forms for small-footprint deep learning. In
Advances inNeural Information Processing Systems , pages 3088–3096,2015.[57] P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri. Lever-aging filter correlations for deep model compression. arXivpreprint arXiv:1811.10559 , 2018.[58] S. W. Stepniewski and A. J. Keane. Pruning backpropaga-tion neural networks using modern stochastic optimisationtechniques.
Neural Computing & Applications , 5(2):76–98,1997.[59] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi-antino, and Y. LeCun. Fast convolutional nets withfbfft: A gpu performance evaluation. arXiv preprintarXiv:1412.7580 , 2014.[60] H. Wang, Q. Zhang, Y. Wang, and H. Hu. Structured pruningfor efficient convnets via incremental regularization. arXivpreprint arXiv:1811.08390 , 2018.[61] Y. Wang, C. Xu, C. Xu, and D. Tao. Beyond filters: Com-pact feature map for portable deep model. In
InternationalConference on Machine Learning , pages 3703–3711, 2017.[62] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu. Cnnpack: Pack-ing convolutional neural networks in the frequency domain.In
Advances in neural information processing systems , pages253–261, 2016.[63] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningstructured sparsity in deep neural networks. In
Advances inNeural Information Processing Systems , pages 2074–2082,2016.[64] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantizedconvolutional neural networks for mobile devices. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 4820–4828, 2016.[65] J. Xue, J. Li, and Y. Gong. Restructuring of deep neuralnetwork acoustic models with singular value decomposition.In
Interspeech , pages 2365–2369, 2013.66] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han,M. Gao, C.-Y. Lin, and L. S. Davis. Nisp: Pruning net-works using neuron importance score propagation. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 9194–9203, 2018.[67] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, andY. Wang. A systematic dnn weight pruning framework usingalternating direction method of multipliers. In
Proceedingsof the European Conference on Computer Vision (ECCV) ,pages 184–199, 2018.[68] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating verydeep convolutional networks for classification and detection.
IEEE transactions on pattern analysis and machine intelli-gence , 38(10):1943–1955, 2016.[69] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towardscompact cnns. In
European Conference on Computer Vision ,pages 662–677. Springer, 2016.[70] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu,J. Huang, and J. Zhu. Discrimination-aware channel pruningfor deep neural networks. In