[PDF] Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure

Abstract

The redundancy is widely recognized in Convolutional Neural Networks (CNNs), which enables to remove unimportant filters from convolutional layers so as to slim the network with acceptable performance drop. Inspired by the linear and combinational properties of convolution, we seek to make some filters increasingly close and eventually identical for network slimming. To this end, we propose Centripetal SGD (C-SGD), a novel optimization method, which can train several filters to collapse into a single point in the parameter hyperspace. When the training is completed, the removal of the identical filters can trim the network with NO performance loss, thus no finetuning is needed. By doing so, we have partly solved an open problem of constrained filter pruning on CNNs with complicated structure, where some layers must be pruned following others. Our experimental results on CIFAR-10 and ImageNet have justified the effectiveness of C-SGD-based filter pruning. Moreover, we have provided empirical evidences for the assumption that the redundancy in deep neural networks helps the convergence of training by showing that a redundant CNN trained using C-SGD outperforms a normally trained counterpart with the equivalent width.

Full PDF

CCentripetal SGD for Pruning Very Deep Convolutional Networks withComplicated Structure ∗ Xiaohan Ding Guiguang Ding Yuchen Guo Jungong Han Tsinghua University Lancaster University [email protected] [email protected] { yuchen.w.guo,jungonghan77 } @gmail.com Abstract

The redundancy is widely recognized in ConvolutionalNeural Networks (CNNs), which enables to remove unim-portant ﬁlters from convolutional layers so as to slim thenetwork with acceptable performance drop. Inspired bythe linear and combinational properties of convolution, weseek to make some ﬁlters increasingly close and eventuallyidentical for network slimming. To this end, we proposeCentripetal SGD (C-SGD), a novel optimization method,which can train several ﬁlters to collapse into a singlepoint in the parameter hyperspace. When the training iscompleted, the removal of the identical ﬁlters can trim thenetwork with NO performance loss, thus no ﬁnetuning isneeded. By doing so, we have partly solved an open prob-lem of constrained ﬁlter pruning on CNNs with complicatedstructure, where some layers must be pruned following oth-ers. Our experimental results on CIFAR-10 and ImageNethave justiﬁed the effectiveness of C-SGD-based ﬁlter prun-ing. Moreover, we have provided empirical evidences forthe assumption that the redundancy in deep neural networkshelps the convergence of training by showing that a redun-dant CNN trained using C-SGD outperforms a normallytrained counterpart with the equivalent width.

1. Introduction

Convolutional Neural Network (CNN) has become animportant tool for machine learning and many related ﬁelds[10, 36, 37, 38]. However, due to their nature of com-putational intensity, as CNNs grow wider and deeper,their memory footprint, power consumption and requiredﬂoating-point operations (FLOPs) have increased dramat- ∗ This work is supported by the National Key R&D Program of China(No. 2018YFC0807500), National Natural Science Foundation of China(No. 61571269), National Postdoctoral Program for Innovative Talents(No. BX20180172), and the China Postdoctoral Science Foundation (No.2018M640131). Corresponding author: Guiguang Ding. Here “centripetal” means “several objects moving towards a center”,not “an object rotating around a center by the centripetal force”. ically, thus making them difﬁcult to be deployed on plat-forms without rich computational resource, like embeddedsystems. In this context, CNN compression and accelera-tion methods have been intensively studied, including ten-sor low rank expansion [31], connection pruning [20], ﬁlterpruning [40], quantization [19], knowledge distillation [27],fast convolution [48], feature map compacting [61], etc .We focus on ﬁlter pruning, a.k.a. channel pruning [26]or network slimming [44], for three reasons. Firstly, ﬁlterpruning is a universal technique which is able to handle anykinds of CNNs, making no assumptions on the applicationﬁeld, the network architecture or the deployment platform.Secondly, ﬁlter pruning effectively reduces the FLOPs ofthe network, which serve as the main criterion of computa-tional burdens. Lastly, as an important advantage in prac-tice, ﬁlter pruning produces a thinner network with no cus-tomized structure or extra operation, which is orthogonal tothe other model compression and acceleration techniques.Motivated by the universality and signiﬁcance, consid-erable efforts have been devoted to ﬁlter pruning tech-niques. Due to the widely observed redundancy in CNNs[8, 9, 13, 19, 66, 69], numerous excellent works have shownthat, if a CNN is pruned appropriately with acceptable struc-tural damage, a follow-up ﬁnetuning procedure can restorethe performance to a certain degree. Some prior works[2, 5, 28, 40, 49, 50, 66] sort the ﬁlters by their importance,directly remove the unimportant ones and re-construct thenetwork with the remaining ﬁlters. As the important ﬁltersare preserved, a comparable level of performance can bereached by ﬁnetuning. However, some recent powerful net-works have complicated structures, like identity mapping[23] and dense connection [29], where some layers must bepruned in the same pattern as others, raising an open prob-lem of constrained ﬁlter pruning . This further challengessuch pruning techniques, as one cannot assume the impor-tant ﬁlters at different layers reside on the same positions. Obviously, the model is more likely to recover if the de-structive impact of pruning is reduced. Taking this into con-sideration, another family of methods [3, 15, 43, 60, 63]seeks to zero out some ﬁlters in advance, where group- a r X i v : . [ c s . L G ] A p r asso Regularization [53] is frequently used. Essentially,zeroing ﬁlters out can be regarded as producing a desired redundancy pattern in CNNs. After reducing the magni-tude of parameters of some whole ﬁlters, pruning these ﬁl-ters causes less accuracy drop, hence it becomes easier torestore the performance by ﬁnetuning.In this paper, we also aim to produce some redundancypatterns in CNNs for ﬁlter pruning. However, instead of ze-roing out ﬁlters, which ends up with a pattern where somewhole ﬁlters are close to zero, we intend to merge multipleﬁlters into one, leading to a redundancy pattern where someﬁlters are identical. The intuition motivating the proposedmethod is an observation of information ﬂow in CNNs (Fig.1). If two or more ﬁlters are trained to become identical,due to the linear and combinational properties of convolu-tion, we can simply discard all but leave one ﬁlter, and addup the parameters along the corresponding input channelsof the next layer. Doing so will cause ZERO performanceloss, and there is no need for a time-consuming ﬁnetuningprocess. It is noted that such a ﬁnetuning process is essentialfor the zeroing-out methods [3, 43, 63], as the discarded ﬁl-ters are merely small in magnitude, but still encode a certainquantity of information. Therefore, removing such ﬁltersunavoidably degrades the performance of the network. When multiple ﬁlters are constrained to grow closer in theparameter hyperspace, which we refer to as the centripetalconstraint , though they start to produce increasingly simi-lar information, the information conveyed from the corre-sponding input channels of the next layer is still in full use,thus the model’s representational capacity is stronger than acounterpart with the ﬁlters being zeroed out.We summarize our contributions as follows. • We propose to produce redundancy patterns in CNNsby training some ﬁlters to become identical. Comparedto the importance-based ﬁlter pruning methods, doingso requires no heuristic knowledge about the impor-tance of ﬁlter. Compared to the zeroing-out methods,no ﬁnetuning is needed, and more representational ca-pacity of the network is preserved. • We propose

Centripetal SGD (C-SGD), an innovativeSGD optimization method. As the name suggests, wemake multiple ﬁlters move towards a center in the hy-perspace of the ﬁlter parameters. In the meantime, su-pervised by the model’s original objective function, theperformance is maintained as much as possible. • By C-SGD, we have partly solved constrained ﬁlterpruning, an open problem of slimming modern verydeep CNNs with complicated structure, where somelayers must be pruned in the same pattern as others. • We have presented both theoretical and empiricalanalysis of the effectiveness of C-SGD. We haveshown empirical evidences supporting our motivation(Fig. 1) and the assumption that the redundancy helps the convergence of neural networks [14, 27].The codes are available at https://github.com/ShawnDing1994/Centripetal-SGD .

2. Related Work

Filter Pruning.

Numerous inspiring works [7, 17, 20,22, 39, 58, 67] have shown that it is feasible to remove alarge portion of connections or neurons from a neural net-work without a signiﬁcant performance drop. However, asthe connection pruning methods make the parameter ten-sors no smaller but just sparser, little or no accelerationcan be observed without the support from specialized hard-ware. Then it is natural for researchers to go further onCNNs: by removing ﬁlters instead of sporadic connections,we transform the wide convolutional layers into narrowerones, hence the FLOPs, memory footprint and power con-sumption are signiﬁcantly reduced. One kind of methodsdeﬁnes the importance of ﬁlters by some means, then selectsand prunes the unimportant ﬁlters carefully to minimize theperformance loss. Some prior works measure a ﬁlter’s im-portance by the accuracy reduction (CAR) [2], the channelcontribution variance [50], the Taylor-expansion-based cri-terion [49], the magnitude of convolution kernels [40] andthe average percentage of zero activations (APoZ) [28], re-spectively; Luo et al . [47] select ﬁlters based on the infor-mation derived from the next layer; Yu et al . [66] take intoconsideration the effect of error propagation; He et al . [26]select ﬁlters by solving the Lasso regression; He and Han[24] pick up ﬁlters with aid of reinforcement learning. An-other category seeks to train the network under certain con-straints in order to zero out some ﬁlters, where group-Lassoregularization is frequently used [3, 43, 63]. It is notewor-thy that since removing some whole ﬁlters can degrade thenetwork a lot, the CNNs are usually pruned in a layer-by-layer [3, 24, 26, 28, 47, 50] or ﬁlter-by-ﬁlter [2, 49] manner,and require one or more ﬁnetuning processes to restore theaccuracy [2, 3, 5, 24, 26, 28, 40, 44, 47, 49, 50, 63, 66].

Other Methods.

Apart from ﬁlter pruning, some excel-lent works seek to compress and accelerate CNNs in otherways. Considerable works [4, 14, 31, 32, 54, 56, 65, 68]decompose or approximate the parameter tensors; quanti-zation and binarization techniques [11, 18, 19, 51, 64] ap-proximate a model using fewer bits per parameter; knowl-edge distillation methods [6, 27, 52] transfer knowledgefrom a big network to a smaller one; some researchers seekto speed up convolution with the help of perforation [16],FFT [48, 59] or DCT [62]; Wang et al . [61] compact fea-ture maps by extracting information via Circulant matrices.Of note is that since ﬁlter pruning simply shrinks a wideCNN into a narrower one with no special structures or extraoperations, it is orthogonal to the other methods. onv2 conv1 conv2 conv1 add to

Figure 1: Zeroing-out v.s. centripetal constraint. This ﬁgure shows a CNN with 4 and 6 ﬁlters at the 1st and 2nd convolutionallayer, respectively, which takes a 2-channel input. Left: the 3rd ﬁlter at conv1 is zeroed out, thus the 3rd feature map is closeto zero, implying that the 3rd input channels of the 6 ﬁlters at conv2 are useless. During pruning, the 3rd ﬁlters at conv1along with the 3rd input channels of the 6 ﬁlters at conv2 are removed. Right: the 3rd and 4th ﬁlters at conv1 are forced togrow close by centripetal constraint until the 3rd and 4th feature maps become identical. But the 3rd and 4th input channelsof the 6 ﬁlters at conv2 can still grow without constraints, making the encoded information still in full use. When pruned, the4th ﬁlter at conv1 is removed, and the 4th input channel of every ﬁlter at conv2 is added to the 3rd channel.

3. Slimming CNNs via Centripetal SGD

In modern CNNs, batch normalization [30] and scalingtransformation are commonly used to enhance the repre-sentational capacity of convolutional layers. For simplic-ity and generality, we regard the possible subsequent batchnormalization and scaling layer as part of the convolutionallayer. Let i be the layer index, M ( i ) ∈ R h i × w i × c i be an h i × w i feature map with c i channels and M ( i,j ) = M ( i ): , : ,j be the j -th channel. The convolutional layer i with ker-nel size u i × v i has one 4th-order tensor and four vectorsas parameters at most, namely, K ( i ) ∈ R u i × v i × c i − × c i and µ ( i ) , σ ( i ) , γ ( i ) , β ( i ) ∈ R c i , where K ( i ) is the con-volution kernel, µ ( i ) and σ ( i ) are the mean and standarddeviation of batch normalization, γ ( i ) and β ( i ) are theparameters of the scaling transformation. Then we use P ( i ) = ( K ( i ) , µ ( i ) , σ ( i ) , γ ( i ) , β ( i ) ) to denote the param-eters of layer i . In this paper, the ﬁlter j at layer i refersto the ﬁve-tuple comprising all the parameter slices relatedto the j -th output channel of layer i , formally, F ( j ) =( K ( i ): , : , : ,j , µ ( i ) j , σ ( i ) j , γ ( i ) j , β ( i ) j ) . During forward propagation,this layer takes M ( i − ∈ R h i − × w i − × c i − as input andoutputs M ( i ) . Let ∗ be the 2-D convolution operator, the j -th output channel is given by M ( i,j ) = (cid:80) c i − k =1 M ( i − ,k ) ∗ K ( i ): , : ,k,j − µ ( i ) j σ ( i ) j γ ( i ) j + β ( i ) j . (1)The importance-based ﬁlter pruning methods [2, 28, 40,49, 50, 66] deﬁne the importance of ﬁlters by some means,prune the unimportant part and reconstruct the network us-ing the remaining parameters. Let I i be the ﬁlter index setof layer i ( e.g ., I = { , , , } if the second layer hasfour ﬁlters), T be the ﬁlter importance evaluation functionand θ i be the threshold. The remaining set, i.e ., the indexset of the ﬁlters which survive the pruning, is R i = { j ∈ I i | T ( F ( j ) ) > θ i } . Then we reconstruct the network byassembling the parameters sliced from the original tensoror vectors of layer i into the new parameters. That is, ˆ P ( i ) = ( K ( i ): , : , : , R i , µ ( i ) R i , σ ( i ) R i , γ ( i ) R i , β ( i ) R i ) . (2)The input channels of the next layer corresponding to thepruned ﬁlters should also be discarded, ˆ P ( i +1) = ( K ( i +1): , : , R i , : , µ ( i +1) , σ ( i +1) , γ ( i +1) , β ( i +1) ) . (3) For each convolutional layer, we ﬁrst divide the ﬁltersinto clusters. The number of clusters equals the desirednumber of ﬁlters, as we preserve only one ﬁlter for eachcluster. We use C i and H to denote the set of all ﬁlter clus-ters of layer i and a single cluster in the form of a ﬁlter in-dex set, respectively. We generate the clusters evenly or byk-means [21], between which our experiments demonstrateonly minor difference (Table. 1). • K-means clustering . We aim to generate clusters withlow intra-cluster distance in the parameter hyperspace,such that collapsing them into a single point less im-pacts the model, which is natural. To this end, we sim-ply ﬂatten the ﬁlter’s kernel and use it as the featurevector for k-means clustering. • Even clustering . We can generate clusters with noconsideration of the ﬁlters’ inherent properties. Let c i and r i be the number of original ﬁlters and de-sired clusters, respectively, then each cluster will have (cid:100) c i /r i (cid:101) ﬁlters at most. For example, if the secondlayer has six ﬁlters and we wish to slim it to fourﬁlters, we will have C = {H , H , H , H } , where H = { , } , H = { , } , H = { } , H = { } .We use H ( j ) to denote the cluster containing ﬁlter j , soin the above example we have H (3) = H and H (6) = H .Let F ( j ) be the kernel or a vector parameter of ﬁlter j , atach training iteration, the update rule of C-SGD is F ( j ) ← F ( j ) + τ ∆ F ( j ) , ∆ F ( j ) = − (cid:80) k ∈ H ( j ) ∂L∂ F ( k ) | H ( j ) | − η F ( j ) + (cid:15) ( (cid:80) k ∈ H ( j ) F ( k ) | H ( j ) | − F ( j ) ) , (4)where L is the original objective function, τ is the learningrate, η is the model’s original weight decay factor, and (cid:15) is the only introduced hyper-parameter, which is called the centripetal strength .Let L be the layer index set, we use the sum of squaredkernel deviation χ to measure the intra-cluster similarity, i.e ., how close ﬁlters are in each cluster, χ = (cid:88) i ∈L (cid:88) j ∈I i || K ( i ): , : , : ,j − (cid:80) k ∈ H ( j ) K ( i ): , : , : ,k | H ( j ) | || . (5)It is easy to derive from Eq. 4 that if the ﬂoating-point op-eration errors are ignored, χ is lowered monotonically and exponentially with a proper learning rate τ .The intuition behind Eq. 4 is quite simple: for the ﬁltersin the same cluster, the increments derived by the objectivefunction are averaged (the ﬁrst term), the normal weight de-cay is applied as well (the second term), and the differencein the initial values is gradually eliminated (the last term), sothe ﬁlters will move towards their center in the hyperspace.In practice, we ﬁx η and reduce τ with time just as we doin normal SGD training, and set (cid:15) casually. Intuitively, C-SGD training with a large (cid:15) prefers “rapid change” to “sta-ble transition”, and vice versa. If (cid:15) is too large, e.g ., 10,the ﬁlters are merged in an instant such that the whole pro-cess becomes equivalent to training a destroyed model fromscratch. If (cid:15) is extremely small, like × − , the differ-ence between C-SGD training and normal SGD is almostinvisible during a long time. However, since the differenceamong ﬁlters in each cluster is reduced monotonically and exponentially , even an extremely small (cid:15) can make the ﬁl-ters close enough, sooner or later. As shown in the Ap-pendix, C-SGD is insensitive to (cid:15) .A simple analogy to weight decay ( i.e ., (cid:96) -2 regulariza-tion) may help understand Centripetal SGD. Fig. 2a showsa 3-D loss surface, where a certain point A corresponds to a2-D parameter a = ( a , a ) . Suppose the steepest descentdirection is −−→ AQ , we have −−→ AQ = − ∂L∂ a , where L is theobjective function. Weight decay is commonly applied toreduce overﬁtting [35], that is, −−→ AQ = − η a , where η isthe model’s weight decay factor, e.g ., × − for ResNets[23]. The actual gradient descent direction then becomes ∆ a = −−→ AQ = −−→ AQ + −−→ AQ = − ∂L∂ a − η a .Formally, with t denoting the number of training itera-tions, we seek to make point A and B grow increasingly loss 0 1 2 1 2 A x y Q Q Q (a) Normal weight decay. loss 0 1 2 1 2 A B x y Q Q Q Q Q M (b) Centripetal constraint. Figure 2: Gradient descent direction on the loss surfaceof normal weight decay and centripetal constraint withoutmerging the original gradients.close and eventually the same by satisfying lim t →∞ || a ( t ) − b ( t ) || = 0 . (6)Given the fact that a ( t +1) = a ( t ) + τ ∆ a ( t ) and b ( t +1) = b ( t ) + τ ∆ b ( t ) , where τ is the learning rate, Eq. 6 implies lim t →∞ || ( a ( t ) − b ( t ) ) + τ (∆ a ( t ) − ∆ b ( t ) ) || = 0 . (7)We seek to achieve this with lim t →∞ (∆ a ( t ) − ∆ b ( t ) ) = as well as lim t →∞ ( a ( t ) − b ( t ) ) = . Namely, as two pointsare growing closer, their gradients should become closer ac-cordingly in order for the training to converge.If we just wish to make A and B closer to each otherthan they used to be, a natural idea is to push both A and B to their midpoint M ( a + b ) , as shown in Fig. 2b. Therefore,the gradient descent direction of point A becomes ∆ a = −−→ AQ + −−→ AQ = − ∂L∂ a − η a + (cid:15) ( a + b − a ) , (8)where (cid:15) is a hyper-parameter controlling the intensity orspeed of pushing A and B close. We have ∆ b = − ∂L∂ b − η b + (cid:15) ( a + b − b ) , (9) ∆ a − ∆ b = ( ∂L∂ b − ∂L∂ a ) + ( η + (cid:15) )( b − a ) . (10)Here we see the problem: we cannot ensure lim t →∞ ( ∂L∂ b ( t ) − ∂L∂ a ( t ) ) = . Actually, even a = b does not imply ∂L∂ a = ∂L∂ b ,because they participate in different computation ﬂows. Asa consequence, we cannot ensure lim t →∞ (∆ a ( t ) − ∆ b ( t ) ) = with Eq. 8 and Eq. 9.We solve this problem by merging the gradients derivedfrom the original objective function. For simplicity andsymmetry, by replacing both ∂L∂ a in Eq. 8 and ∂L∂ b in Eq. with ( ∂L∂ a + ∂L∂ b ) , we have ∆ a − ∆ b = ( η + (cid:15) )( b − a ) .In this way, the supervision information encoded in theobjective-function-related gradients is preserved to main-tain the model’s performance, and Eq. 6 is satisﬁed, whichcan be easily veriﬁed. Intuitively, we deviate a from thesteepest descent direction according to some information of b and deviate b vice versa, just like the (cid:96) -2 regularizationdeviates both a and b towards the origin of coordinates. The efﬁciency of modern CNN training and deploymentplatforms, e.g ., Tensorﬂow [1], is based on large-scale ten-sor operations. We therefore seek to implement C-SGDby efﬁcient matrix multiplication which introduces minimalcomputational burdens. Concretely, given a convolutionallayer i , the kernel K ∈ R u i × v i × c i − × c i and the gradient ∂L∂ K , we reshape K to W ∈ R u i v i c i − × c i and ∂L∂ K to ∂L∂ W accordingly. We construct the averaging matrix Γ ∈ R c i × c i and decaying matrix Λ ∈ R c i × c i as Eq. 12 and Eq. 13such that Eq. 11 is equivalent to Eq. 4, which can be eas-ily veriﬁed. Obviously, when the number of clusters equalsthat of the ﬁlters, Eq. 11 degrades into normal SGD with Γ = diag (1) , Λ = diag ( η ) . The other trainable param-eters ( i.e ., γ and β ) are reshaped into W ∈ R × c i andhandled in the same way. In practice, we observe almost nodifference in the speed between normal SGD and C-SGDusing Tensorﬂow on Nvidia GeForce GTX 1080Ti GPUswith CUDA9.0 and cuDNN7.0. W ← W − τ ( ∂L∂ W Γ + W Λ ) . (11) Γ m,n = (cid:40) / | H ( m ) | if H ( m ) = H ( n ) , elsewise . (12) Λ m,n = (cid:40) η + (1 − / | H ( m ) | ) (cid:15) if H ( m ) = H ( n ) , elsewise . (13) After C-SGD training, since the ﬁlters in each clusterhave become identical, as will be shown in Sect. 4.3, pick-ing up which one makes no difference. We simply pick upthe ﬁrst ﬁlter ( i.e ., the ﬁlter with the smallest index) in eachcluster to form the remaining set for each layer, which is R i = { min ( H ) | ∀H ∈ C i } . For the next layer, we add the to-be-deleted input chan-nels to the corresponding remaining one, K ( i +1): , : ,k, : ← (cid:88) K ( i +1): , : ,H ( k ) , : ∀ k ∈ R i , then we delete the redundant ﬁlters as well as the inputchannels of the next layer following Eq. 2, 3. Due to the linear and combinational properties of convolution (Eq. 1),no damage is caused, hence no ﬁnetuning is needed. Recently, accompanied by the advancement of CNN de-sign philosophy, several efﬁcient and compact CNN ar-chitectures [23, 29] have emerged and become favored inthe real-world applications. Altough some excellent works[28, 32, 49, 66, 69] have shown that the classical plainCNNs, e.g ., AlexNet [34] and VGG [55], are highly redun-dant and can be pruned signiﬁcantly, the pruned versions areusually still inferior to the more up-to-date and complicatedCNNs in terms of both accuracy and efﬁciency.We consider ﬁlter pruning for very deep and complicatedCNNs challenging for three reasons. Firstly, these net-works are designed in consideration of computational ef-ﬁciency, which makes them inherently compact and efﬁ-cient. Secondly, these networks are signiﬁcantly deeperthan the classical ones, thus the layer-by-layer pruning tech-niques become inefﬁcient, and the errors can increase dra-matically when propagated through multiple layers, mak-ing the estimation of ﬁlter importance less accurate [66]. Lastly and most importantly, some innovative structures areheavily used in these networks, e.g ., cross-layer connections[23] and dense connections [29], raising an open problem ofconstrained ﬁlter pruning.

I.e ., in each stage of ResNets, every residual block is ex-pected to add the learned residuals to the stem feature mapsproduced by the ﬁrst or the projection layer (referred to as pacesetter ), thus the last layer of every residual block (re-ferred to as follower ) must be pruned in the same patternas the pacesetter, i.e ., the remaining set R of all the fol-lowers and the pacesetter must be identical, or the networkwill be damaged so badly that ﬁnetuning cannot restore itsaccuracy. For example, Li et al . [40] once tried violentlypruning ResNets but resulted in low accuracy. In some suc-cessful explorations, Li et al . [40] sidestep this problem byonly pruning the internal layers on ResNet-56, i.e ., the ﬁrstlayers in each residual block. Liu et al . [44] and He et al .[26] skip pruning these troublesome layers and insert an ex-tra sampler layer before the ﬁrst layer in each residual blockduring inference time to reduce the input channels. Thoughthese methods are able to prune the networks to some ex-tent, from a holistic perspective the networks are not liter-ally “slimmed” but actually “clipped”, as shown in Fig. 3.We have partly solved this open problem by C-SGD,where the key is to force different layers to learn the sameredundancy pattern . For example, if the layer p and q haveto be pruned in the same pattern, we only generate clustersfor the layer p by some means and assign the resulting clus-ter set to the layer q , namely, C q ← C p . Then during C-SGDtraining, the same redundancy patterns among ﬁlters in bothlayer p and q are produced. I.e ., if the j -th and k -th ﬁlters × … × × … × × (a) Original. × … × × … × × (b) Clipped. × … × × … × × (c) Sampled. …… × × × × × (d) Slimmed. Figure 3: Compared to the prior works which only clipthe internal layers [40] or insert sampler layers [26, 44] onResNets, C-SGD is literally “slimming” the network.at layer p become identical, we ensure the sameness of the j -th and k -th ﬁlters at layer q as well, thus the troublesomelayers can be pruned along with the others. Some sketchesare presented in the Appendix for more intuitions.

4. Experiments

We experiment on CIFAR-10 [33] and ImageNet-1K[12] to evaluate our method. For each trial we start froma well-trained base model and apply C-SGD training on allthe target layers simultaneously . The comparisons betweenC-SGD and other ﬁlter pruning methods are presented inTable. 1 and Table. 2 in terms of both absolute and relativeerror increase, which are commonly adopted as the metricsto fairly compare the change of accuracy on different basemodels.

E.g ., the Top-1 accuracy of our ResNet-50 basemodel and C-SGD-70 is 75.33% and 75.27%, thus the ab-solute and relative error increase is . − .

27% =0 . and . − . = 0 . , respectively. CIFAR-10.

The base models are trained from scratchfor 600 epochs to ensure the convergence, which is muchlonger than the usually adopted benchmarks (160 [23] or300 [29] epochs), such that the improved accuracy of thepruned model cannot be simply attributed to the extra train-ing epochs on a base model which has not fully converged.We use the data augmentation techniques adopted by [23], i.e ., padding to × , random cropping and ﬂipping. Thehyper-parameter (cid:15) is casually set to × − . We performC-SGD training with batch size 64 and a learning rate ini-tialized to × − then decayed by 0.1 when the loss stopsdecreasing. For each network we perform two experimentsindependently, where the only difference is the way we gen-erate ﬁlter clusters, namely, even dividing or k-means clus-tering. We seek to reduce the FLOPs of every model byaround 60%, so we prune / of every convolutional layerof ResNets, thus the parameters and FLOPs are reduced by around − (5 / = 61% . Aggressive as it is, no obviousaccuracy drop is observed. For DenseNet-40, the prunedmodel has 5, 8 and 10 incremental convolutional layers inthe three stages, respectively, so that the FLOPs is reducedby 60.05%, and a signiﬁcantly increased accuracy is ob-served, which is consistent with but better than that of [44]. ImageNet.

We perform experiments using ResNet-50[23] on ImageNet to validate the effectiveness of C-SGDon the real-world applications. We apply k-means cluster-ing on the ﬁlter kernels to generate the clusters, then usethe ILSVRC2015 training set which contains 1.28M high-quality images for training. We adopt the standard data aug-mentation techniques including b-box distortion and colorshift. At test time, we use a single central crop. For C-SGD-7/10, C-SGD-6/10 and C-SGD-5/10, all the ﬁrst and secondlayers in each residual block are shrunk to 70%, 60% and50% of the original width, respectively.

Discussions.

Our pruned networks exhibit fewer FLOPs,simpler structures and higher or comparable accuracy. Notethat we apply the same pruning ratio globally for ResNets,and better results are promising to be achieved if morelayer sensitivity analyzing experiments [26, 40, 66] are con-ducted, and the resulting network structures are tuned ac-cordingly. Interestingly, even arbitrarily generated clusterscan produce reasonable results (Table. 1). vs . Normal Training The comparisons between C-SGD and other pruning-and-ﬁnetuning methods [26, 40, 47, 66] indicate that itmay be better to train a redundant network and equivalentlytransform it to a narrower one than to ﬁnetune it after prun-ing. This observation is consistent with [14] and [27], wherethe authors believe that the redundancy in neural networksis necessary to overcome a highly non-convex optimization.We verify this assumption by training a narrow CNNwith normal SGD and comparing it with another modeltrained using C-SGD with the equivalent width , whichmeans that some redundant ﬁlters are produced during train-ing and trimmed afterwards, resulting in the same networkstructure as the normally trained model. For example, if anetwork has × number of ﬁlters as the normal counterpartbut every two ﬁlters are identical, they will end up with thesame structure. If the redundant one outperforms the normalone, we can conclude that C-SGD does yield more powerfulnetworks by exploiting the redundant ﬁlters.On DenseNet-40, we evenly divide the 12 ﬁlters at eachincremental layer into 3 clusters, use C-SGD to train thenetwork from scratch, then trim it to obtain a DenseNet-40with 3 ﬁlters per incremental layer. I.e ., during training, ev-ery 4 ﬁlters are growing centripetally. As contrast, we traina DenseNet-40 with originally 3 ﬁlters per layer by normalSGD. Another group of experiments where each layer endsup with 6 ﬁlters are carried out similarly. After that, ex-able 1: Pruning Results on CIFAR-10. For C-SGD, the left is achieved by even clustering, and the right uses k-means.

Model Result Base Top1 Pruned Top1even / k-means K-means Top1 errorAbs/Rel ↑ % FLOPs ↓ % ArchitectureResNet-56 Li et al . [40] 93.04 93.06 -0.02 / -0.28 27.60 only internals prunedResNet-56 NISP-56 [66] - - 0.03 / - 43.61 -ResNet-56 Channel Pruning [26] 92.8 91.8 1.0 / 13.88 50 sampler layerResNet-56 ADC [24] 92.8 91.9 0.9 / 12.5 50 sampler layer ResNet-56 C-SGD-5/8 93.39 93.44 / 93.62 -0.23 / -3.47 60.85 10-20-40

ResNet-110 Li et al . [40] 93.53 93.30 0.23 / 3.55 38.60 only internals prunedResNet-110 NISP-110 [66] - - 0.18 / - 43.78 -

ResNet-110 C-SGD-5/8 94.38 94.54 / 94.41 -0.03 / -0.53 60.89 10-20-40

ResNet-164 Network Slimming [44] 94.58 94.73 -0.15 / -2.76 44.90 sampler layer

ResNet-164 C-SGD-5/8 94.83 94.80 / 94.81 0.02 / 0.38 60.91 10-20-40

DenseNet-40 Network Slimming [44] 93.89 94.35 -0.46 / -7.52 55.00 -

DenseNet-40 C-SGD-5-8-10 93.81 94.37 / 94.56 -0.75 / 12.11 60.05 5-8-10

Table 2: Pruning ResNet-50 on ImageNet using k-means clustering.

Result Base Top1 Base Top5 Pruned Top1 Pruned Top5 Top1 ErrorAbs/Rel ↑ % Top5 errorAbs/Rel ↑ % FLOPs ↓ % C-SGD-70 75.33 92.56 75.27 92.46 0.06 / 0.24 0.10 / 1.34 36.75

ThiNet-70 [47] 72.88 91.14 72.04 90.67 0.84 / 3.09 0.47 / 5.30 36.75SFP [25] 76.15 92.87 74.61 92.06 1.54 / 6.45 0.81 / 11.36 41.8NISP [66] - - - - 0.89 / - - / - 43.82

C-SGD-60 75.33 92.56 74.93 92.27 0.40 / 1.62 0.29 / 3.89 46.24

CFP [57] 75.3 92.2 73.4 91.4 1.9 / 7.69 0.8 / 10.25 49.6Channel Pruning [26] - 92.2 - 90.8 - / - 1.4 / 17.94 50Autopruner [46] 76.15 92.87 74.76 92.15 1.39 / 5.82 0.72 / 10.09 51.21GDP [42] 75.13 92.30 71.89 90.71 3.24 / 13.02 1.59 / 20.64 51.30SSR-L2 [41] 75.12 92.30 71.47 90.19 3.65 / 14.67 2.11 / 27.40 55.76DCP [70] 76.01 92.93 74.95 92.32 1.06 / 4.41 0.61 / 8.62 55.76ThiNet-50 [47] 72.88 91.14 71.01 90.02 1.87 / 6.89 1.12 / 12.64 55.76

C-SGD-50 75.33 92.56 74.54 92.09 0.79 / 3.20 0.47 / 6.31 55.76 periments on VGG [55] are also carried out, where we slimeach layer to 1/4 and 1/2 of the original width, respectively.It can be concluded from Table. 3 that the redundant ﬁltersdo help, compared to a normally trained counterpart withthe equivalent width. This observation supports our intu-ition that the centripetally growing ﬁlters can maintain themodel’s representational capacity to some extent becausethough these ﬁlters are constrained, their corresponding in-put channels are still in full use and can grow without con-straints (Fig. 1). vs . Zeroing Out As making ﬁlters identical and zeroing ﬁlters out [3, 15,43, 60, 63] are two means of producing redundancy pat-terns for ﬁlter pruning, we perform controlled experiments Table 3: Validation accuracy of scratch-trained DenseNet-40 and VGG using C-SGD or normal SGD on CIFAR-10.

Model Normal SGD C-SGDDenseNet-3 88.60

DenseNet-6 89.96

VGG-1/4 90.16

VGG-1/2 92.49 on ResNet-56 to investigate the difference. For fair compar-ison, we aim to produce the same number of redundant ﬁl-ters in both the model trained with C-SGD and the one withgroup-Lasso Regularization [53]. For C-SGD, the numberof clusters in each layer is 5/8 of the number of ﬁlters. For

50 100 150 200epochs864202 l o g o r l o g C-SGDLasso (a) Values of χ or φ . t o p - a cc u r a c y C-SGD before pruningLasso before pruningC-SGD after pruningLasso after pruning (b) Validation accuracy.

Figure 4: Training process with C-SGD or group-Lasso onResNet-56. Note the logarithmic scale of the left ﬁgure.Lasso, 3/8 of the original ﬁlters in the pacesetters and inter-nal layers are regularized by group-Lasso, and the followersare handled in the same pattern. We use the aforementionedsum of squared kernel deviation χ and the sum of squaredkernel residuals φ as follows to measure the redundancy,respectively. Let L be the layer index set and P i be the to-be-pruned ﬁlter set of layer i , i.e ., the set of the 3/8 ﬁlterswith group-Lasso regularization, φ = (cid:88) i ∈L (cid:88) j ∈P i || K ( i ): , : , : ,j || . We present in Fig. 4 the curves of χ , φ as well as the vali-dation accuracy both before and after pruning. The learningrate τ is initially set to × − and decayed by 0.1 at epoch100 and 200, respectively. It can be observed that: GroupLasso cannot literally zero out ﬁlters, but can decrease theirmagnitude to some extent, as φ plateaus when the gradientsderived from the regularization term become close to thosederived from the original objective function. We empiri-cally ﬁnd out that even when φ reaches around × − ,which is nearly × times smaller than the initial value,pruning still causes obvious damage (around 10% accuracydrop). When the learning rate is decayed and φ is reduced atepoch 200, we observe no improvement in the pruned accu-racy, therefore no more experiments with smaller learningrate or stronger group-Lasso regularization are conducted.We reckon this is due to the error propagation and ampliﬁ-cation in very deep CNNs [66]. By C-SGD, χ is reduced monotonically and perfectly exponentially , which leads tofaster convergence. I.e ., the ﬁlters in each cluster can be-come inﬁnitely close to each other at a constant rate witha constant learning rate. For C-SGD, pruning causes ab-solutely no performance loss after around 90 epochs. Training with group-Lasso is × slower than C-SGD as itrequires costly square root operations. vs . Other Filter Pruning Methods We compare C-SGD with other methods by controlledexperiments on DenseNet-40 [29]. We slim every incre-mental layer of a well-trained DenseNet-40 to 3 and 6 ﬁl-ters, respectively. The experiments are repeated 3 times, v a l a cc u r a c y APoZLassoMagnitudeC-SGDTaylor (a) Three ﬁlters per layer. v a l a cc u r a c y APoZLassoMagnitudeC-SGDTaylor (b) Six ﬁlters per layer.

Figure 5: Controlled pruning experiments on DenseNet-40.and all the results are presented in Fig. 5. The train-ing setting is kept the same for every model: learning rate τ = 3 × − , × − , × − , × − for 200, 200,100 and 100 epochs, respectively, to ensure the convergenceof every model. For our method, the models are trained withC-SGD and trimmed. For Magnitude- [40], APoZ- [28] andTaylor-expansion-based [49], the models are pruned by dif-ferent criteria and ﬁnetuned. The models labeled as Lassoare trained with group-Lasso Regularization for 600 epochsin advance, pruned, then ﬁnetuned for another

600 epochswith the same learning rate schedule, so that the compari-son is actually biased towards the Lasso method. The mod-els are tested on the validation set every 10,000 iterations(12.8 epochs). The results reveal the superiority of C-SGDin terms of higher accuracy and also the better stability.Though group-Lasso Regularization can indeed reduce theperformance drop caused by pruning, it is outperformed byC-SGD by a large margin. It is interesting that the violentlypruned networks are unstable and easily trapped in the localminimum, e.g ., the accuracy curves increase steeply in thebeginning but slightly decline afterwards. This observationis consistent with that of Liu et al . [45].

5. Conclusion

We have proposed to produce identical ﬁlters in CNNsfor network slimming. The intuition is that making ﬁl-ters identical can not only eliminate the need for ﬁnetuningbut also preserve more representational capacity of the net-work, compared to the zeroing-out fashion (Fig. 1). Wehave partly solved an open problem of constrained ﬁlterpruning on very deep and complicated CNNs and achievedstate-of-the-art results on several common benchmarks. Bytraining networks with redundant ﬁlters using C-SGD, wehave demonstrated empirical evidences for the assumptionthat redundancy can help the convergence of neural networktraining, which may encourage future studies. Apart frompruning, we consider C-SGD promising to be applied as ameans of regularization or training technique. eferences [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.Tensorﬂow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467 , 2016.[2] R. Abbasi-Asl and B. Yu. Structural compression of convolu-tional neural networks based on greedy ﬁlter pruning. arXivpreprint arXiv:1705.07356 , 2017.[3] J. M. Alvarez and M. Salzmann. Learning the number ofneurons in deep networks. In

Advances in Neural Informa-tion Processing Systems , pages 2270–2278, 2016.[4] J. M. Alvarez and M. Salzmann. Compression-aware train-ing of deep networks. In

Advances in Neural InformationProcessing Systems , pages 856–867, 2017.[5] S. Anwar, K. Hwang, and W. Sung. Structured prun-ing of deep convolutional neural networks.

ACM Journalon Emerging Technologies in Computing Systems (JETC) ,13(3):32, 2017.[6] J. Ba and R. Caruana. Do deep nets really need to be deep?In

Advances in neural information processing systems , pages2654–2662, 2014.[7] G. Castellano, A. M. Fanelli, and M. Pelillo. An iterativepruning algorithm for feedforward neural networks.

IEEEtransactions on Neural networks , 8(3):519–531, 1997.[8] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary,and S.-F. Chang. An exploration of parameter redundancyin deep networks with circulant projections. In

Proceedingsof the IEEE International Conference on Computer Vision ,pages 2857–2865, 2015.[9] M. D. Collins and P. Kohli. Memory bounded deep convolu-tional networks. arXiv preprint arXiv:1412.1442 , 2014.[10] R. Collobert and J. Weston. A uniﬁed architecture for naturallanguage processing: Deep neural networks with multitasklearning. In

Proceedings of the 25th international conferenceon Machine learning , pages 160–167. ACM, 2008.[11] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, andY. Bengio. Binarized neural networks: Training deep neu-ral networks with weights and activations constrained to+ 1or-1. arXiv preprint arXiv:1602.02830 , 2016.[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In

Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on , pages 248–255. IEEE, 2009.[13] M. Denil, B. Shakibi, L. Dinh, N. De Freitas, et al. Pre-dicting parameters in deep learning. In

Advances in neuralinformation processing systems , pages 2148–2156, 2013.[14] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-gus. Exploiting linear structure within convolutional net-works for efﬁcient evaluation. In

Advances in neural infor-mation processing systems , pages 1269–1277, 2014.[15] X. Ding, G. Ding, J. Han, and S. Tang. Auto-balanced ﬁl-ter pruning for efﬁcient convolutional neural networks. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence ,2018.[16] M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli. Per-foratedcnns: Acceleration through elimination of redundant convolutions. In

Advances in Neural Information ProcessingSystems , pages 947–955, 2016.[17] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery forefﬁcient dnns. In

Advances In Neural Information Process-ing Systems , pages 1379–1387, 2016.[18] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan.Deep learning with limited numerical precision. In

Interna-tional Conference on Machine Learning , pages 1737–1746,2015.[19] S. Han, H. Mao, and W. J. Dally. Deep compres-sion: Compressing deep neural networks with pruning,trained quantization and huffman coding. arXiv preprintarXiv:1510.00149 , 2015.[20] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weightsand connections for efﬁcient neural network. In

Advances inNeural Information Processing Systems , pages 1135–1143,2015.[21] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm.

Journal of the Royal StatisticalSociety. Series C (Applied Statistics) , 28(1):100–108, 1979.[22] B. Hassibi and D. G. Stork. Second order derivatives for net-work pruning: Optimal brain surgeon. In

Advances in neuralinformation processing systems , pages 164–171, 1993.[23] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016.[24] Y. He and S. Han. Adc: Automated deep compression andacceleration with reinforcement learning. arXiv preprintarXiv:1802.03494 , 2018.[25] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang. Soft ﬁlterpruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866 , 2018.[26] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerat-ing very deep neural networks. In

International Conferenceon Computer Vision (ICCV) , volume 2, page 6, 2017.[27] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv preprint arXiv:1503.02531 , 2015.[28] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim-ming: A data-driven neuron pruning approach towards efﬁ-cient deep architectures. arXiv preprint arXiv:1607.03250 ,2016.[29] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.Densely connected convolutional networks. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , volume 1, page 3, 2017.[30] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In

International Conference on Machine Learning , pages 448–456, 2015.[31] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding upconvolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 , 2014.[32] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin.Compression of deep convolutional neural networks forfast and low power mobile applications. arXiv preprintarXiv:1511.06530 , 2015.33] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. 2009.[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012.[35] A. Krogh and J. A. Hertz. A simple weight decay can im-prove generalization. In

Advances in neural information pro-cessing systems , pages 950–957, 1992.[36] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back.Face recognition: A convolutional neural-network approach.

IEEE transactions on neural networks , 8(1):98–113, 1997.[37] Y. LeCun, Y. Bengio, et al. Convolutional networks for im-ages, speech, and time series.

The handbook of brain theoryand neural networks , 3361(10):1995, 1995.[38] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E.Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digitrecognition with a back-propagation network. In

Advancesin neural information processing systems , pages 396–404,1990.[39] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain dam-age. In

Advances in neural information processing systems ,pages 598–605, 1990.[40] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P.Graf. Pruning ﬁlters for efﬁcient convnets. arXiv preprintarXiv:1608.08710 , 2016.[41] S. Lin, R. Ji, Y. Li, C. Deng, and X. Li. Towards com-pact convnets via structure-sparsity regularized ﬁlter prun-ing. arXiv preprint arXiv:1901.07827 , 2019.[42] S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang. Accel-erating convolutional networks via global & dynamic ﬁlterpruning. In

IJCAI , pages 2425–2432, 2018.[43] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.Sparse convolutional neural networks. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 806–814, 2015.[44] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang.Learning efﬁcient convolutional networks through networkslimming. In , pages 2755–2763. IEEE, 2017.[45] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Re-thinking the value of network pruning. arXiv preprintarXiv:1810.05270 , 2018.[46] J.-H. Luo and J. Wu. Autopruner: An end-to-end train-able ﬁlter pruning method for efﬁcient deep model inference. arXiv preprint arXiv:1805.08941 , 2018.[47] J.-H. Luo, J. Wu, and W. Lin. Thinet: A ﬁlter level prun-ing method for deep neural network compression. In

Pro-ceedings of the IEEE international conference on computervision , pages 5058–5066, 2017.[48] M. Mathieu, M. Henaff, and Y. LeCun. Fast trainingof convolutional networks through ffts. arXiv preprintarXiv:1312.5851 , 2013.[49] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz.Pruning convolutional neural networks for resource efﬁcientinference. 2016. [50] A. Polyak and L. Wolf. Channel-level acceleration of deepface representations.

IEEE Access , 3:2163–2175, 2015.[51] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classiﬁcation using binary convolutional neu-ral networks. In

European Conference on Computer Vision ,pages 525–542. Springer, 2016.[52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,and Y. Bengio. Fitnets: Hints for thin deep nets. arXivpreprint arXiv:1412.6550 , 2014.[53] V. Roth and B. Fischer. The group-lasso for generalized lin-ear models: uniqueness of solutions and efﬁcient algorithms.In

Proceedings of the 25th international conference on Ma-chine learning , pages 848–855. ACM, 2008.[54] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, andB. Ramabhadran. Low-rank matrix factorization for deepneural network training with high-dimensional output tar-gets. In

Acoustics, Speech and Signal Processing (ICASSP),2013 IEEE International Conference on , pages 6655–6659.IEEE, 2013.[55] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[56] V. Sindhwani, T. Sainath, and S. Kumar. Structured trans-forms for small-footprint deep learning. In

Advances inNeural Information Processing Systems , pages 3088–3096,2015.[57] P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri. Lever-aging ﬁlter correlations for deep model compression. arXivpreprint arXiv:1811.10559 , 2018.[58] S. W. Stepniewski and A. J. Keane. Pruning backpropaga-tion neural networks using modern stochastic optimisationtechniques.

Neural Computing & Applications , 5(2):76–98,1997.[59] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi-antino, and Y. LeCun. Fast convolutional nets withfbfft: A gpu performance evaluation. arXiv preprintarXiv:1412.7580 , 2014.[60] H. Wang, Q. Zhang, Y. Wang, and H. Hu. Structured pruningfor efﬁcient convnets via incremental regularization. arXivpreprint arXiv:1811.08390 , 2018.[61] Y. Wang, C. Xu, C. Xu, and D. Tao. Beyond ﬁlters: Com-pact feature map for portable deep model. In

InternationalConference on Machine Learning , pages 3703–3711, 2017.[62] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu. Cnnpack: Pack-ing convolutional neural networks in the frequency domain.In

Advances in neural information processing systems , pages253–261, 2016.[63] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningstructured sparsity in deep neural networks. In

Advances inNeural Information Processing Systems , pages 2074–2082,2016.[64] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantizedconvolutional neural networks for mobile devices. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 4820–4828, 2016.[65] J. Xue, J. Li, and Y. Gong. Restructuring of deep neuralnetwork acoustic models with singular value decomposition.In

Interspeech , pages 2365–2369, 2013.66] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han,M. Gao, C.-Y. Lin, and L. S. Davis. Nisp: Pruning net-works using neuron importance score propagation. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 9194–9203, 2018.[67] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, andY. Wang. A systematic dnn weight pruning framework usingalternating direction method of multipliers. In

Proceedingsof the European Conference on Computer Vision (ECCV) ,pages 184–199, 2018.[68] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating verydeep convolutional networks for classiﬁcation and detection.

IEEE transactions on pattern analysis and machine intelli-gence , 38(10):1943–1955, 2016.[69] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towardscompact cnns. In

European Conference on Computer Vision ,pages 662–677. Springer, 2016.[70] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu,J. Huang, and J. Zhu. Discrimination-aware channel pruningfor deep neural networks. In