[PDF] Towards Compact ConvNets via Structure-Sparsity Regularized Filter Pruning

Abstract

The success of convolutional neural networks (CNNs) in computer vision applications has been accompanied by a significant increase of computation and memory costs, which prohibits its usage on resource-limited environments such as mobile or embedded devices. To this end, the research of CNN compression has recently become emerging. In this paper, we propose a novel filter pruning scheme, termed structured sparsity regularization (SSR), to simultaneously speedup the computation and reduce the memory overhead of CNNs, which can be well supported by various off-the-shelf deep learning libraries. Concretely, the proposed scheme incorporates two different regularizers of structured sparsity into the original objective function of filter pruning, which fully coordinates the global outputs and local pruning operations to adaptively prune filters. We further propose an Alternative Updating with Lagrange Multipliers (AULM) scheme to efficiently solve its optimization. AULM follows the principle of ADMM and alternates between promoting the structured sparsity of CNNs and optimizing the recognition loss, which leads to a very efficient solver (2.5x to the most recent work that directly solves the group sparsity-based regularization). Moreover, by imposing the structured sparsity, the online inference is extremely memory-light, since the number of filters and the output feature maps are simultaneously reduced. The proposed scheme has been deployed to a variety of state-of-the-art CNN structures including LeNet, AlexNet, VGG, ResNet and GoogLeNet over different datasets. Quantitative results demonstrate that the proposed scheme achieves superior performance over the state-of-the-art methods. We further demonstrate the proposed compression scheme for the task of transfer learning, including domain adaptation and object detection, which also show exciting performance gains over the state-of-the-arts.

Full PDF

IIEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Towards Compact ConvNets via Structure-SparsityRegularized Filter Pruning

Shaohui Lin,

Student Member, IEEE,

Rongrong Ji ∗ , Senior Member, IEEE,

Yuchao Li,Cheng Deng,

Member, IEEE, and Xuelong Li,

Fellow, IEEE

Abstract —The success of convolutional neural networks(CNNs) in computer vision applications has been accompaniedby a signiﬁcant increase of computation and memory costs, whichprohibits their usage on resource-limited environments such asmobile or embedded devices. To this end, the research of CNNcompression has recently become emerging. In this paper, wepropose a novel ﬁlter pruning scheme, termed structured sparsityregularization (SSR), to simultaneously speedup the computationand reduce the memory overhead of CNNs, which can bewell supported by various off-the-shelf deep learning libraries.Concretely, the proposed scheme incorporates two differentregularizers of structured sparsity into the original objectivefunction of ﬁlter pruning, which fully coordinates the globaloutputs and local pruning operations to adaptively prune ﬁlters.We further propose an Alternative Updating with LagrangeMultipliers (AULM) scheme to efﬁciently solve its optimization.AULM follows the principle of ADMM and alternates betweenpromoting the structured sparsity of CNNs and optimizing therecognition loss, which leads to a very efﬁcient solver ( . × to themost recent work that directly solves the group sparsity basedregularization). Moreover, by imposing the structured sparsity,the online inference is extremely memory-light, since the numberof ﬁlters and the output feature maps are simultaneously reduced.The proposed scheme has been deployed to a variety of state-of-the-art CNN structures including LeNet, AlexNet, VGGNet,ResNet and GoogLeNet over different datasets. Quantitativeresults demonstrate that the proposed scheme achieves supe-rior performance over the state-of-the-art methods. We furtherdemonstrate the proposed compression scheme for the taskof transfer learning, including domain adaptation and objectdetection, which also show exciting performance gains over thestate-of-the-art ﬁlter pruning methods. Index Terms —Convolutional neural networks, Structured spar-sity, CNN acceleration, CNN compression.

I. I

NTRODUCTION I N recent years, convolutional neural networks (CNNs)have achieved great success on a variety of computervision applications, ranging from action recognition [1], objectrecognition [2]–[8], to object detection [9]–[12]. One essentialfoundation of such success lies in the gigantic amount of

S. Lin and Y. Li are with the Fujian Key Laboratory of Sensing andComputing for Smart City, School of Information Science and Engineering,Xiamen University, 361005, China.R. Ji (Corresponding author) is with the Fujian Key Laboratory of Sensingand Computing for Smart City, School of Information Science and Engi-neering, Xiamen University, 361005, China, and Peng Cheng Laboratory,Shenzhen, 518055, China (e-mail: [email protected]).C. Deng is with the School of Electronic Engineering, Xidian University,Xi’an 710071, China.X. Li is with the School of Computer Science and Center for OPTicalIMagery Analysis and Learning (OPTIMAL), Northwestern PolytechnicalUniversity, Xi’an 710072, China. model parameters to accompany with the large-scale trainingdata. For instance, ResNet-152 [3] has 57 million parameters,costs 230MB storage, and requires 11.3 billion FLOPs toclassify one image with a resolution of × . Under sucha circumstance, these models cannot be directly deployed toscenarios that require fast processing or compact storage, suchas mobile systems or embedded devices.Substantial efforts have been devoted to the speedup andcompression of CNNs for efﬁcient online inference. Amongthe existing methods, network pruning has attracted increasingattention recently, due to its ability to reduce an overwhelmingamount of parameters. As one of the earliest works, LeCun et al . [13] proposed an Optimal Brain Damage algorithm toprune the network by reducing the number of connections witha theoretically-justiﬁed saliency measurement. Later, Hassibiand Stork [14] proposed an Optimal Brain Surgeon algorithmto remove unimportant parameters, which are determined bythe second-order derivations of their weights. Recently, Han etal . [15], [16] proposed to prune parameters with small magni-tudes to reduce the network size. However, the above pruningschemes typically produce sparse CNNs with non-structuredrandom connections, which will cause irregular memory ac-cess, i.e. , a complex storage structure that adversely impactsthe efﬁciency of accessing CNN models in memory. Moreover,such non-structured sparse CNNs cannot be supported by off-the-shelf libraries, which thus need specialized hardware [17]or software [18] designs to improve their efﬁciency in onlineinference.To address the shortcoming of such non-structured connec-tions, ﬁlter pruning is regarded as a promising solution, whichhas shown signiﬁcant speedup in online inference as well asgood independence to software/hardware platforms [19]–[23].Differing from the previous works in parameter pruning, ﬁlterpruning can be further integrated by various CNN compressionor acceleration methods, e.g. , low-rank decomposition [24]–[31], DCT [32] or FFT [33], [34] based frequency domainacceleration, and parameter quantization [31], [35]–[38]. Itsanother advantage lies in reducing the energy consumption[39], which is not only inﬂuenced by the parameter amounts,but also inﬂuenced by the FLOPs and memory access ofinput/output feature maps. From this perspective, ﬁlter pruningcan directly remove FLOPs and the intermediate activations ofﬁlters ( e.g. , output feature maps and their corresponding ﬁlterchannels in the next layer), which substantially reduces theenergy consumption. FLOPs: The number of ﬂoating-point operations. a r X i v : . [ c s . C V ] M a r EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2

Remove filters via AULMPre-trained modelRemove the corresponding output feature maps

Remove the corresponding channel of filters in the next layerThe final layer?No(Prune next layer) YesFine-tune the pruned networkUpdate filters in the current and next layerOutput pruned model (a) The compete process to prune network

Input of layer l * Patch matrix I Im2col … ReshapeInput tensor ℐ … Location index Filters of layer l … F ilt e r i nd e x Filter tensor 𝒦 Filter matrix K Reshape … F ilt e r i nd e x Input of layer l+ * … Feature maps Output matrix Filters of layer l+ l+ Feature maps

Filter tensor (b) The process to prune the single layerFig. 1. Illustration of SSR scheme. (a) The complete process includes adaptively pruning the unimportant ﬁlters via AULM solver, removing the correspondingoutput feature maps, removing the corresponding channels of ﬁlters in the next layer, as well as updating ﬁlters and ﬁne-tuning the pruned network. (b) Theprocess to select and prune unimportant ﬁlters, the corresponding output feature maps and channels in the next layer (highlighted in purple). In particular, thetensor-based convolutional operator can be replaced by a matrix-by-matrix multiplication using BLAS library to accelerate the computation of CNNs. (Bestviewed in color.)

However, there are still open issues in the existing ﬁlterpruning schemes. Concretely, how to adaptively and efﬁcientlyselect important ﬁlters to reconstruct the global output , i.e. ,the probabilistic “softmax” output after local ﬁlter pruning, isstill an open problem, with only few works in the literature.For instance, the work in [20] proposed a magnitude basedcriterion to prune Convolutional ﬁlters with small (cid:96) -norms.It has resulted in structured sparse patterns that can accel-erate the online inference. However, such a magnitude-basedmeasurement ( e.g. , (cid:96) -norm) is too simple and inefﬁcient todetermine the importance of each ﬁlter, due to the existenceof nonlinear activation functions ( e.g. , rectiﬁer linear unit(ReLU) [40]) and other complex operations ( e.g. , pooling andbatch normalization [41]). To explain, ﬁlters with small (cid:96) -norm values may have large responses in the output. Forexample, given two ﬁlter vectors a = (0 . , . , (cid:62) and b = (0 . , . , (cid:62) with an input c = (1 , , (cid:62) , we have (cid:107) a (cid:107) > (cid:107) b (cid:107) , while ReLU ( c ∗ a ) < ReLU ( c ∗ b ) afterconvolutional operator and ReLU activation. Very recently,Luo et al. [19] implicitly associated the convolutional ﬁlter ofeach layer to its input channel of the next layer, upon whichﬁlter pruning is done by selecting input channels that haveminimal local reconstruction error. The small reconstructionerror, however, might be magniﬁed and propagated in the deepnetworks, leading to large reconstruction error in the globaloutputs. A Taylor expansion based criterion [42] was furtherproposed to iteratively prune one feature map and its associ-ated ﬁlter. The pruned network is then ﬁne-tuned to reducethe accuracy drop. However, such a scheme is unadaptive andcostly in pruning the entire network. Group sparsity [21]–[23],[43], [44] was introduced to select unimportant ﬁlters by theStochastic Gradient Descent (SGD). However, SGD is less Structured sparsity is to directly prune the ﬁlter/block ( i.e. , set the valuesof ﬁlter/block to zeros), while non-structured sparsity only justify whethereach element in the ﬁlter is zero or not. efﬁcient in convergence, which is also less efﬁcient to generatethe structured output of the pruned ﬁlters.In this paper, we propose an efﬁcient ﬁlter pruning scheme,termed

Structured Sparsity Regularization (SSR), which canefﬁciently and adaptively prune a group of convolutionalﬁlters to minimize the classiﬁcation error of the global output.Compared to the existing works of sequential ﬁlter pruning[19], [20], [23], [42], [45], we incorporate the structuredsparsity constraint into the objective function of global outputto model the correlation between the global output loss andthe local ﬁlter removal, which produces a structured networkwith fast computing and light memory consumption . Inparticular, we propose two different kinds of structured sparseregularizer for adaptive ﬁlter pruning, i.e., (cid:96) , -norm [46]and (cid:96) , -norm. The (cid:96) , -norm of a matrix A is deﬁned as (cid:107) A (cid:107) , = (cid:80) i (cid:107) (cid:113)(cid:80) j A ij (cid:107) , where for a scalar a , (cid:107) a (cid:107) = 0 if a = 0 , and (cid:107) a (cid:107) = 1 otherwise. The (cid:96) , -norm is aconvex norm to approximate the cardinality in ﬁlter selection,while the (cid:96) , -norm selects ﬁlter with the constraint of explicitcardinality, which is the most natural constraint. As for groupsparsity with (cid:96) , -regularization, SGD in the previous works[21]–[23], [43] cannot solve the NP-hard problem for effectiveﬁlter pruning. To this end, we propose a novel AlternativeUpdating with Lagrange Multipliers (AULM), which handlesthe convergence difﬁculty with (cid:96) , -norm using SGD andthe NP-hard problem with (cid:96) , -norm. AULM follows theprinciple of ADMM [47] by splitting the optimization probleminto tractable sub-problems that can be solved efﬁciently.In addition, AULM has a faster convergence than ADMMby adding a Nesterov’s optimal method [48], can effectivelyidentify the importance of ﬁlters, and then updates parametersby alternating between promoting the structured sparsity and To make a fair comparison, we only evaluate the computational cost ofmodel for online inference, without including the training

EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3 optimizing the recognition loss. In particular, the proposedsolver circumvents the sparse constraint evaluations during thestandard back-propagation step, which makes the implemen-tation very practical. Moreover, compared to SSL [23], theproposed solver is much faster to the existing solvers in ofﬂinepruning, i.e. , almost . × than directly solving the regularizerwith (cid:96) , -norm by using SGD. Fig. 1 shows the proposed ﬁlterpruning framework.Quantitatively, we demonstrate the advantage of the pro-posed SSR scheme using ﬁve widely-used models ( i.e. , LeNet,AlexNet, VGGNet-16, ResNet-50 and GoogLeNet) on twodatasets ( i.e. , MNIST and ImageNet 2012). Compared toseveral state-of-the-art ﬁlter pruning methods [19], [20], [23],[42], [43], [45], [49], the proposed scheme performs drasti-cally better, i.e. , . × CPU speedup and . × compressionwith a negligible classiﬁcation accuracy loss for LeNet onMNIST, . × CPU speedup with an increase of 1.28% Top-5classiﬁcation error for AlexNet, . × GPU speedup with andecrease (even better) of 1.65% Top-1 classiﬁcation error forVGG-16, . × CPU speedup and . × compression withan increase of 3.65% Top-1 classiﬁcation error for ResNet-50, and . × CPU speedup and . × compression with anincrease of 1.05% Top-5 classiﬁcation error for GoogLeNetin ImageNet 2012. Moreover, the pruned AlexNet and VGG-16 can be further compressed by replacing the original fully-connected layers with global average pooling [50], leading to . × and × compression rates only with increases of4.62% and 0.27% Top-5 classiﬁcation error, respectively.In addition, we also explore the generalization ability ofSSR-based compressed model in more complex tasks, i.e. ,domain adaptation and object detection. Experimental resultsdemonstrate that such a compressed model has achieved 3.5 × FLOPs reduction and 15.4 × compression only with an in-crease of 1.95% Top-1 error on the task of domain adaptation,as well as 0.3% mAP drops with a factor of 2.45 × GPUspeedup on the task of object detection. These two resultsare highly competitive comparing to the state-of-the-art ﬁlterpruning methods. II. R

ELATED W ORK

Early works in network compression mainly focus oncompressing the fully-connected layers [13]–[16], [51]. Forinstance, LeCun et al. [13] and Hassibi et al. [14] proposed asaliency measurement by computing the Hessian matrix of theloss function with respect to the parameters, based on whichnetwork parameters with low saliency values are pruned. Srini-vas and Babu [51] explored the redundancy among neuronsto remove a subset of neurons without retraining. Han et al. [15], [16] proposed a pruning scheme based on low-weightconnections to reduce the total amount of parameters in CNNs.However, these methods only reduce the memory footprint anddo not guarantee to reduce the computation time, since thetime consumption is mostly dominated by the convolutionallayers. Moreover, the above pruning schemes typically producenon-structured sparse CNNs that lack ﬂexibility to be appliedacross different platforms or libraries. For example, the Com-pressed Sparse Column (CSC) based weight formation has to change the original format of weight storing in Caffe [52]after pruning, which cannot be well supported across differentplatforms.To reduce the computation cost of convolutional layers, apopular solution is to decompose convolutional ﬁlters into asequence of tensors with fewer parameters [25], [26], [28],[30], [31], [53]. The convolution can be also conducted inthe frequency domain using DCT [32] and FFT [33], [34],or approximated by balanced decoupled spatial convolution[54] to reduce the redundancy of spatial and channel infor-mation. Besides, binarization of weights [36], [37], [55] andlow-complexity weights [56] can be also employed in theconvolutional layers to reduce the computation overheads withmultiplication-free operations. Designing a compact ﬁlter isalso able to accelerate the convolutional computation by re-placing the over-parametric ﬁlters with a compact block, suchas inception module in GoogLeNet [5], bottleneck module inResNet [3], ﬁre module in SqueezeNet [57], group convolution[4], [58], [59], depth-wise separable convolution [6], [60],[61]. Without incurring additional overheads, our scheme canbe integrated with the above schemes to further speedup thecomputation, since the above schemes are orthogonal to thecore contribution of this paper.In line with our work, some recent works have investigatedstructured pruning to remove redundant ﬁlters or feature maps,which can be categorized into either greedy based pruning[19], [20], [42], [62], [63] or sparsity regularization basedpruning [21]–[23], [43], [49], [64]. For the former group, thework in [20] proposed a magnitude based pruning to pruneﬁlters with their corresponding feature maps by measuring the (cid:96) -norm of ﬁlters, which is however inefﬁcient in determiningthe importance of ﬁlters. He et al. [62] proposed an (cid:96) -normcriterion to prune unsalient ﬁlters in a soft manner. Luo et al. [19] explored the importance of the input channel from thenext convolutional layer, based on which conducted a localchannel selection to prune unimportant input channels and thecorresponding ﬁlters in the current layer. However, the smalllocal reconstruction error might lead to large error in the globaloutput by propagating through the deep network. A Taylorexpansion based criterion was proposed in [42] to iterativelyprune one ﬁlter and then ﬁne-tune the pruned network, whichis however prohibitively costly for deep networks. Lin etal. [63] proposed a global and dynamic pruning scheme toreduce redundant ﬁlters by greedy alternative updating. Al-ternatively, group sparsity based regularization was proposedin [21]–[23], [43], [44] to penalize unimportant parametersand prune redundant ﬁlters directly by using SGD, whichis also very slow in convergence for ﬁlter selection. Toreduce redundancies in the model parameters, the combinationof group and exclusive sparsity was proposed in [49] topromote sharing and competition for features, respectively.Different from these group sparsity based regularization, weinvestigate the structured sparsity of ﬁlters instead, including (cid:96) , -regularization and (cid:96) , -regularization. And the proposedAULM solver alternates the updating between promoting thestructured sparsity and optimizing the recognition loss. By thisway, AULM effectively overcomes the difﬁculty of conver-gence with (cid:96) , -regularization on SGD, and also can solve EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 4 the NP-hard problem with (cid:96) , -regularization. Quantitatively,such an innovation has achieved much faster convergenceand generated much more structured ﬁlters during training.Recently, Liu et al. [64] have proposed a network slimmingscheme to associate a scaling factor in batch normalizationwith each ﬁlter channel, and imposed (cid:96) -reguralization onthese scaling factors to identify and prune unimportant chan-nels. Different from network slimming, we directly focuson structured ﬁlter sparsity by (cid:96) , -regularization and (cid:96) , -regularization for pruning the complete ﬁlters.III. S TRUCTURED P RUNING VIA

SSRIn this section, we ﬁrst describe the notations and prelimi-naries. Next, we present the general framework of structuredﬁlter pruning. Then, we present the proposed structured spar-sity regularization scheme. Afterwards, AULM base solver ispresented to perform the corresponding optimization. Finally,we discuss how to deploy our pruning strategy on the residualnetworks.

A. Notations and Preliminaries

Consider a CNN model consisting of L layers in total(including convolutional and fully-connected layers), whichare interlaced with rectiﬁer linear units and pooling. For theconvolution operation, an input tensor I l of size H l × W l × C l is transformed into an output tensor O l of size H (cid:48) l × W (cid:48) l × C l +1 by the following linear mapping at the l -th layer: O lh (cid:48) ,w (cid:48) ,n = d l (cid:88) i =1 d l (cid:88) j =1 C l (cid:88) c =1 K li,j,c,n I lh i ,w j ,c , (1)where the convolutional ﬁlter K l at the l -th layer is a tensor ofsize d l × d l × C l × C l +1 . The spatial location of its output aredenoted as h (cid:48) = h i − i + 1 and w (cid:48) = w j − j + 1 , respectively.For simplicity, we assume a unit stride without zero-paddingand skip biases.In practice, many deep learning frameworks ( e.g. , Caffe[52] and Tensorﬂow [65]) compute tensor-based convolutionaloperator by a highly optimized matrix-by-matrix multiplica-tion using linear algebra packages, such as Intel MKL andOpenBLAS. For example, an input tensor of size H l × W l × C l can be transformed into an input patch matrix I l of size ( d l × d l × C l ) × H (cid:48) l W (cid:48) l using im2col operator. The columns of I l are patch elements of the input tensor with size d l × d l × C l .Correspondingly, convolutional ﬁlter is transformed into aﬁlter matrix K l of size C l +1 × ( d l × d l × C l ) using reshaped operator. Then, the output tensor can be obtained by reshapingthe result matrix of size C l +1 × H (cid:48) l W (cid:48) l , which is the result ofmultiplying ﬁlter matrix K l with input patch matrix I l . Inthis paper, we use Tensorﬂow to train and test our structuredsparse CNNs. Therefore, we replace tensor-based ﬁlters K l with matrix-based K l .In addition, we consider several norms of ﬁlter matrix K l , which are used in the regularization term. For exam-ple, the Frobenius norm of ﬁlter matrix K l is deﬁned as (cid:107) K l (cid:107) F := (cid:113)(cid:80) i,j K l ij . The sparsity inducing (cid:96) -norm isdeﬁned as (cid:107) K l (cid:107) , := (cid:80) C l +1 i =1 (cid:107) K li (cid:107) . In this paper, we introduce two different structured sparsity norms to adaptivelyselect unimportant ﬁlters to be pruned, i.e. , (cid:96) , -norm and (cid:96) , -norm, which are denoted as (cid:107) K l (cid:107) , := (cid:80) C l +1 i =1 (cid:107) K li (cid:107) and (cid:107) K l (cid:107) , = (cid:80) C l +1 i (cid:107) (cid:113)(cid:80) C l j K l ij (cid:107) , respectively. Note that (cid:96) , -norm is not a valid norm because it does not satisfy thepositive scalarbility: (cid:107) α K l (cid:107) , = | α |(cid:107) K l (cid:107) , for any scalar α .The term “norm” here is for convenience. B. The framework of SSR

SSR prunes the least important ﬁlters from a trained con-volutional network to reduce the computation and memorycosts. Its procedure consists of three basic operations, i.e. , (1)evaluate the importance of each ﬁlter, (2) prune unimportantﬁlters, and (3) ﬁne-tune the whole network. Differing fromthe previous ﬁlter pruning, we adaptively select unimportantﬁlters to be pruned by using AULM. As shown in Fig. 1, Wefocus on the blue dotted boxes that perform AULM solverto adaptively prune unimportant ﬁlters, and then remove thecorresponding output feature maps and the ﬁlter channels inthe next layer. We present the principle steps of SSR as below:1.

Automatic ﬁlter selection.

We design a novel objec-tive function, which incorporates the structured sparsityconstraint into the data error term, e.g. , cross-entropybetween the probability of data inference and the groundtruth. The optimization problem can be solved by theproposed AULM solver. Thus, the unimportant ﬁlters areadaptively identiﬁed during the training.2.

Pruning.

We prune unimportant ﬁlters and their corre-sponding feature maps, together with the channels ofﬁlters in the next layer.3.

Updating.

We update the rest ﬁlters and feature maps inthe current layer, as well as the channels of ﬁlters in thenext layer.4.

Iteration of Step 1 to prune the next layer. Global ﬁne-tuning . We globally ﬁne-tune the prunednetworks, which recovers the discriminability and gen-eralization ability for the pruned networks.

C. The Objective Function of SSR

Instead of directly pruning ﬁlters by calculating their cor-responding magnitudes [20], [45], SSR utilizes the structuredsparsity to seek a best trade-off between loss minimization andﬁlter selection. We consider the following objective function: min K L (cid:0) Z , f ( X ; K ) (cid:1) + λg ( K ) . (2)Here K represents the collection of all weights in CNNs. L (cid:0) Z , f ( X ; K ) (cid:1) is the cross-entropy loss for classiﬁcationor mean-squared error for regression between the labels ofground truth Z and the output of the last layer in CNNs f ( X ; K ) , where D = (cid:8) X , Z (cid:9) = (cid:8) X i , Z i (cid:9) Ni =1 is a trainingdataset with N instances. We denote the ﬁrst loss term inEq. (2) as L D ( K ) for simplicity. The term g ( K ) is certain For simplicity, the term of weight decay ( i.e. , non-structured regularizationapplying on every weight, e.g. , (cid:96) -norm) is omitted, since it can be directlyincorporated into the loss function and does not affect the result of structuredsparsity regularization. EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 5 structured sparsity regularization on the total size of remainingﬁlters in each iteration. In this paper, we consider two differentkinds of structured sparsity regularizer, i.e., (cid:96) , -norm and (cid:96) , -norm.Parameter λ is the penalty term of structured sparsity. As λ varies, the solution of Eq. (2) traces a trade-off path betweenthe performance and structured sparsity. Note that the (cid:96) , or (cid:96) , -norm based structured sparse regularizer g ( K ) and itsefﬁcient solver in Eq. (2) is non-trivial. D. The AULM solver

The proposed AULM solver aims at solving the prob-lem of structure-sparsity regularization, which is inspired byADMM in the ﬁeld of distributed optimization. Differentfrom ADMM, we focus on selecting and pruning unimportantﬁlters by solving a non-convex optimization problem withthe (cid:96) , -regularization and an NP-hard problem with the (cid:96) , -regularization. In particular, to handle the non-trival regularizerin Eq. (2), we introduce a slack variable and an equalityconstraint as follows: min K , F L D ( K ) + λg ( F ) s.t. K = F . (3)AULM is an iterative method that augments the Lagrangianfunction with quadratic penalty terms. The augmented La-grangian associated with the constrained problem of Eq. (3)is given by: L ( K , F , Y ) = L D ( K ) + L (cid:88) l =1 λg ( F l )+ L (cid:88) l =1 trace (cid:0) Y l (cid:62) ( K l − F l ) (cid:1) + L (cid:88) l =1 ρ (cid:107) K l − F l (cid:107) F , (4) where K l , F l , Y l are the ﬁlter kernel, the intermediate ﬁlterwith structured sparsity, and the dual variables ( i.e. the la-grange multipliers) at the l -th layer, respectively. ρ > isa penalty parameter. To minimize Eq. (4), AULM solves foreach variable via a sequence of iterative computations:1. Employ gradient descent to minimize the loss of K { k +1 } = min K L (cid:16) K , ˆ F { k } , ˆ Y { k } (cid:17) . (5)2. Find the closed-form solution of the structured sparsity: F { k +1 } = min F L (cid:16) K { k +1 } , F , ˆ Y { k } (cid:17) . (6)3. Update the dual variables Y l using gradient ascent witha step-size equal to ρ , i.e. , Y { k +1 } = ˆ Y { k } + ρ (cid:16) K { k +1 } − F { k +1 } (cid:17) . (7)4. Conduct an overrelaxation step for accelerated variables ˆ F and ˆ Y with a step-size equal to γ , i.e. , ˆ Y { k +1 } = Y { k +1 } + γ { k +1 } (cid:16) Y { k +1 } − Y { k } (cid:17) , (8) ˆ F { k +1 } = F { k +1 } + γ { k +1 } (cid:16) F { k +1 } − F { k } (cid:17) , (9)where γ { k +1 } = k/ ( k + r ) , with r ≥ ( r = 3 is thestandard choice). Algorithm 1

AULM for structured pruning CNN

Input:

Training data points D , pre-trained CNN weights K , a setof regularization factors S . Output:

The structured pruning ﬁlters K . Initialize: dual variables ˆ Y = Y = , ˆ F = F = K , and ρ = 1 . for each λ in S do for each l in [1 , L ] do repeat Step 1:

Find the estimation of K l { k +1 } by solving theproblem in Eq. (11) using SGD; Step 2:

Find the structured sparsity estimation of F l { k +1 } with (cid:96) , -norm or (cid:96) , -norm from Eq. (14) orEq. (15), respectively; Step 3:

Update dual variables Y l { k +1 } by Eq. (7). Step 4:

Update accelerated variables ˆ Y l { k +1 } and ˆ F l { k +1 } by Eq. (8) and Eq. (9), respectively. until (cid:107) K l { k +1 } − F l { k +1 } (cid:107) F ≤ (cid:15) , or (cid:107) F l { k +1 } − F l { k } (cid:107) F ≤ (cid:15) . Prune ﬁlters K l corresponding to the row-index of F l withzeros and their corresponding feature maps. end for Fine-tune the pruned network. end for

The above four steps are applied in an alternating manner.Below we describe the details of step 1 and step 2 to obtain K and F . The proposed alternative optimization is summarized inAlg. 1. By overrelaxing the Lagrange multiplier variables aftereach iteration, AULM not only has a faster convergence, butalso obtains a more effective solution compared to ADMM.

1) The Updating of AULM:

Step 1.

By removing thepenalty terms of g ( F l ) and completing the squares with respectto K in Eq. (4), we obtain the following equivalent problemto Eq. (5): min K L D ( K ) + L (cid:88) l =1 ρ (cid:107) K l − T l (cid:107) F , (10) where T l = F l − ρ Y l . To obtain the sub-optimal ﬁlters K inthe layer-wise pruning framework, we separately update K l in the l -th layer with the following optimization problem: min K l L D ( K l ) + ρ (cid:107) K l − T l (cid:107) F . (11) We use Stochastic Gradient Descent (SGD) to optimize theﬁlters K l , which is a reasonable choice to handle such a high-dimensional optimization. The entire procedure relies mainlyon the standard forward-backward pass. Step 2.

By removing the ﬁrst term L D ( K ) and completingthe squares with respect to F in Eq. (4), we obtain thefollowing equivalent problem to Eq. (6): min F L (cid:88) l =1 λg ( F l ) + L (cid:88) l =1 ρ (cid:107) F l − T l (cid:107) F , (12)where T l = K l + ρ Y l . We update F l layer-by-layer insteadof directly updating the whole layers. Hence, we get thefollowing optimization problem at the l -th layer: min F l λg ( F l ) + ρ (cid:107) F l − T l (cid:107) F . (13) EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 6

Based on Eq. (13), we can obtain a closed-form solution byconsidering the following regularizers g ( F l ) : • (cid:96) , -norm. A closed-form solution of Eq. (13) can bederived, which is evaluated row-by-row on T l [66], [67].The i -th row is calculated via: F li = T l i max {(cid:107) T l i (cid:107) − λρ , }(cid:107) T l i (cid:107) . (14) • (cid:96) , -norm. A closed-form solution of Eq. (13) with thisregularizer can be also derived, which is evaluated row-by-row on T l in Theorem 1 . The i -th row is calculatedvia: F li = (cid:26) , λ ≥ ρ (cid:107) T l i (cid:107) T l i , λ < ρ (cid:107) T l i (cid:107) . (15) • (cid:96) -norm. The norm is not a structural constraint, whichleads to unstructured sparsity by solving the problem inEq. (13) . Speciﬁcally, we obtain a closed-form solutionof Eq. (13) with the constraint by evaluating on each entryof T l . The optimal solution F lij ∗ is obtained via: F lij = sign ( T l ij ) max {| T l ij | − λ/ρ, } . (16)where sign ( · ) is an indicator function, i.e. , sign ( T l ij ) = (cid:26) , T l ij ≥ , − , otherwise. Theorem 1.

Let g ( F l ) be a regularizer by (cid:107) F l (cid:107) , , then theoptimal solution of Eq. (13) is given by Eq. (15), where F l =( F l , F l , · · · , F lC l +1 ) (cid:62) and T l = ( T l , T l , · · · , T l C l +1 ) (cid:62) . Proof.

Since g ( F l ) = (cid:107) F l (cid:107) , = (cid:80) i (cid:107) (cid:113)(cid:80) j F l ij (cid:107) , we areinterested in solving the following problem: min F l λ (cid:88) i (cid:107) (cid:115)(cid:88) j F l ij (cid:107) + ρ (cid:107) F l − T l (cid:107) F . (17)We rewrite F l and T l as F l = ( F l , F l , · · · , F lC l +1 ) (cid:62) and T l = ( T l , T l , · · · , T l C l +1 ) (cid:62) , where F li and T l i are the i -th row of F l and T l , respectively. Then we solve the followingequivalent problem for each row independently to Eq. (13): min F li L ( F li ) = λ (cid:107) F li (cid:107) + ρ (cid:107) F li − T l i (cid:107) . (18)On one hand, ∀ F li (cid:54) = , λ (cid:107) F li (cid:107) = λ . We obtain theoptimal value L ( F li ) = λ , when (cid:107) F li − T l i (cid:107) = 0 , i.e., F li = T l i . On the other hand, if F li = , λ (cid:107) F li (cid:107) = 0 ,and L ( F li ) = ρ (cid:107) T l i (cid:107) . Therefore, when λ ≥ ρ (cid:107) T l i (cid:107) , theoptimal solution is F li = , while λ < ρ (cid:107) T l i (cid:107) , the optimalsolution is F li = T l i . Thus, we can obtain the optimal solutionof Eq. (13), which is given by Eq. (15) by considering the row-wise decoupling property of Eq. (13). Here, we consider the (cid:96) -norm to better validate the effectiveness ofsimultaneously accelerating the computation and compressing the memoryoverhead of CNNs by the aforementioned structured sparsity. Prune

56 × 56 × 2561 × 1 × 256 ×

BN ReLUBN ReLUBN ReLUReLU × × ×

56 × 56 × 512 L i n e a r p r o j e c t i o n

56 × 56 × 2561 × 1 × 256 × 353 × 3 × 35 × 241 × 1 × 24 × 512 ⊕ BN ReLU

BN ReLU

ReLU

56 × 56 × 512 × × × L i n e a r p r o j e c t i o n Fig. 2. Illustration of pruning ResNet. The red value is the number ofremaining ﬁlters/channels.

2) The Convergence of AULM:

Since non-linear transition,normalization and pooling commonly occur in CNNs, theobjective function of Eq. (4) is highly non-convex, which lacksa theoretical proof to guarantee its convergence to the globaloptimal. However, it is empirically shown that AULM workswell when the penalty parameter ρ is sufﬁciently large. Thisis related to the quadratic term that tends to be locally convexby giving a sufﬁciently large ρ . However, if ρ is too large, it isdifﬁcult for the iterative solver to take effects. As a trade-off,we set ρ = 1 in our implementation.Since the objective function is highly non-convex, there isa risk to be trapped into a local optimum. In our implemen-tation, we circumvent this difﬁculty by using the pre-trainedweights as the initialized weights, which performs quite wellin practice. E. Pruning on ResNet

Unlike VGG-16 and AlexNet, there are some restrictions toprune ResNet due to the special residual blocks. In general,each residual block with bottleneck structure contains threeconvolutional layers (followed by both batch normalizationand ReLU) and shortcut connections. In order to perform thesum operator, the number of output feature maps in the lastconvolutional layer needs to be consistent with that of theprojection shortcut layer. In particular, when the dimensionsof input/output channels are mismatched in a residual block,a linear projection is performed by the shortcut connections(see [3] for more details).In this paper, we focus on pruning the ﬁrst two layers in eachresidual block, as shown in Fig. 2. And we do not prune the lastconvolutional layer of each residual block directly. In fact, theparameters ( e.g. , ﬁlter channel) in the last convolutional layerare much fewer, as a large proportion of ﬁlters in the secondlayer has been pruned.IV. E

XPERIMENTS

A. Experimental Setups

Models and datasets.

We conduct comprehensive experi-ments using ﬁve convolutional networks on two datasets, i.e. ,LeNet on MNIST [68], AlexNet [4], VGG-16 [2], ResNet-50[3] and GoogLeNet [5] on ImageNet [69]. We implement theproposed SSR scheme with Tensorﬂow [65]. All pre-trainedCNNs except LeNet are taken from the Caffe model zoo . Our source code is available at https://github.com/ShaohuiLin/SSR. https://github.com/BVLC/caffe/wiki/Model-Zoo EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 7

TABLE IP

RUNING RESULTS OF L E N ET ON

MNIST. “N UM -N UM -N UM ” IS THENUMBER OF REMAINING FILTERS IN EACH LAYER . K/M/B

MEANSTHOUSAND / MILLION / BILLION IN THIS PAPER , RESPECTIVELY . T

HETESTING MINI - BATCH SIZE IS SET TO BE

Method ↑ LeNet 20-50-500 2.3M 0.43M 26.4 × . × . × . × . × . × . × . × . × . × . × . × . × We make use of an open source tool to convert the pre-trained models to Tensorﬂow format and then ﬁne-tune themto restore the accuracy. We train LeNet from scratch and reportthe results in Table I. Implementations.

To train the proposed SSR scheme, weuse a learning rate of 0.001 with a constant dropping factorof 10 throughout 10 epochs. The weight decay is set to be0.0005 and the momentum is set to be 0.9. To train SSR onboth LeNet and AlexNet, the mini-batch size is set to be 256.To train VGG-16, ResNet-50 and GoogLeNet, the mini-batchsize is all set to be 32. After pruning, the pruned network isﬁne-tuned for 30 epochs, in which a learning rate starts at − and is scaled by 0.1 throughout 10 epochs. All experiments arerun on NVIDIA GTX 1080Ti graphics card with 11GB and128G RAM. The number of pruned ﬁlters is directly controlledby the hyper-parameters λ , i.e. , the regularization factor of thestructured sparsity. In our experiments, we vary λ in the setof { . , . · · · , . } with 8 values to select the best trade-offbetween the compression/speedup rate and the accuracy. For r in the overrelaxation step, we set it to be 3. Evaluation Protocols.

For evaluation protocols, we quan-tize the performance by using FLOPs, the number of pa-rameters and the Top-1/5 classiﬁcation error. To make a faircomparison, the speedup rate is measured in a single-threadIntel Xeon E5-2620 CPU and NVIDIA GTX TITAN X GPU.

Alternative and State-of-the-art Approaches.

We ﬁrstcompare our ﬁlter selection criterion with three alternativecriteria, which are brieﬂy summarized as follows:1.

Random.

Randomly prune ﬁlters of each layer.2.

L1-norm (Filter norm) [20]. Filters with smaller mag-nitude tend to be unimportant ﬁlters. Therefore, the (cid:96) -norm of each ﬁlter s i = (cid:107) K ( i, :) (cid:107) is chosen as itsimportance score. We then sort these scores to pruneﬁlters correspondingly.3. APoZ (Average Percentage of Zeros) [45]. The sparsityof each channel in output after ReLU activation can be https://github.com/ethereon/caffe-tensorﬂow The accuracies of models may be slight different from that reported byother works, due to a different learning framework. chosen as the importance score of the correspondingﬁlter. Then the sparsity is calculated as s i = 1 − |I (: , : ,i ) | (cid:80) (cid:80) I (cid:0) I (: , : , i ) == 0 (cid:1) , where |I (: , : , i ) | is theentry number of the i -th channel in the tensor I . Thesmaller s i is, the less important the corresponding ﬁlteris.We also compare the proposed SSR scheme to the state-of-the-art ﬁlter pruning methods, including SSL [23], ThiNet[19], Taylor expansion (TE [42]), CGES [49] and GSS [43].Furthermore, we make a comparison with our alternativeschemes with different regularizers, i.e., SSR with (cid:96) , -norm(termed SSR-L2,1), SSR with (cid:96) , -norm (termed SSR-L2,0)and SSR with (cid:96) -norm (termed SSR-L1). B. LeNet on MNIST

MNIST is a small-scale dataset, which contains a trainingset of 60,000 images and a test set of 10,000 images from 10classes. Each image is a × gray-scale handwritten digitimage. LeNet on MNIST consists of 2 convolutional layersand 2 fully-connected layers, which achieves an error rate of0.88% on MNIST. The detailed structure of LeNet is: (20) C − M P − (50) C − M P − F C − F C − S, (19)where (20) C is a × convolutional layer with 20 ﬁlters, M P is a max-pooling layer with kernel size 2, F C isa fully-connected layer with 10 nodes, S is a softmax losslayer. Since the node number of the last layer is directlyrelated to the number of classes, we prune the remaininglayers except for the last layer . The regularization factor λ isﬁxed on 6 groups, i.e. , (0.1,0.1,0.3), (0.3,0.3,0.4), (0.3,0.5,0.5),(0.4,0.5,0.5), (0.5,0.4,0.5), and (0.5,0.5,0.5). We compare ourmethod to different criteria of ﬁlter selection [20], [45] andalso to other state-of-the-art methods in ﬁlter pruning [23],[42], [43], [49].Fig. 3 shows the pruning results of LeNet based on dif-ferent ﬁlter selection criteria. Compared to these criteria andbaselines, both FLOPs and parameter sizes are signiﬁcantlyreduced with lower error by using the proposed structuredsparsity regularization, as expected. In contrast, the (cid:96) -normbased ﬁlter selection performs poorly. To explain, due to thenonlinear transformation in the network, ﬁlters with small (cid:96) -norm are still likely to be important, which have large impacton the ﬁnal loss function. When a large proportion of ﬁlterswith small (cid:96) -norm are pruned, the classiﬁcation error canbe signiﬁcantly increased. Note that the simplest scheme ofrandom ﬁlter selection works reasonably well, which is due tothe self-recovery ability of the distributed representations [70].However, the random criterion is not robust in practice andmay lead to large accuracy loss when being applied to com-press large network ( e.g. , VGG-16), as presented in Table IIIand Table IV. For APoZ, the sparsity of feature maps is quitereasonable to prune the redundant ﬁlters, which is due to theself-sparsity of the pre-trained model with ReLU activation. Incontrast, compared to these ﬁlter pruning methods, our SSR-L2,1 achieves the best performance with an increase of 0.29% For AlexNet, VGG-16 and ResNet-50, we also keep the number nodesof the ﬁnal layer unchanged and prune other remaining layers.

EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8

FLOPs Reduction Ratio T op − E rr o r ( l og ) L1RandomAPoZOriginalSSR−L2,1SSR−L2,0 0 5 10 15 20 25 30 35 40 45 50 55−5−4.5−4−3.5−3−2.5−2−1.5−1

Parameter Reduction Ratio T op − E rr o r ( l og ) L1RandomAPoZOriginalSSR−L2,1SSR−L2,0

Fig. 3. The results of different evaluation criteria on FLOPs and parameter numbers for compressing LeNet. classiﬁcation error, . × FLOPs reduction, and . × parameter reduction. SSR-L2,0 achieves relatively consistentresults with SSR-L2,1. In particular, with a signiﬁcant highcompression ratio ( i.e., . × parameter reduction ratio),SSR-L2,0 achieves much better performance than SSR-L2,1,as the (cid:96) , -norm is used to directly measure the cardinality ofthe ﬁlter structure.The quantitative performance for compressing LeNet usingthe proposed scheme is further shown in Table I. First, wefound that the FLOPs do not directly reﬂect the actual speedupratio in online inference. For instance, compared to LeNet,the proposed SSR-L2,1 can reach . × FLOPs reduction,with a × actual speedup ratio. To explain, memory accessesfor both inter-layer and intra-layer can signiﬁcantly increasethe computation consumption. Second, TE [42] achieves thebest trade-off between the speedup ratio and the classiﬁcationerror among all the baselines ( e.g., SSL [23], CGES [49] andGSS [43]), as it inherits the effectiveness of ﬁlter selection byestimating the loss increase of pruning each ﬁlter with Taylorexpansion. Note that CGES+ [49] is the combination of iter-ative pruning [15] and CGES for further compressing LeNet,which achieves 0.04% increase of classiﬁcation error using10% parameters of the full network. Compared to CGES+, theproposed SSR-L1 with (cid:96) -norm achieves a signiﬁcant highercompression ratio by . × ( i.e.,

9K number of parameters),only with an increase of 0.12% Top-1 error. However, there areno structural constraints on ﬁlters/weights, which leads to verylow speedup under the same hardware/software evaluationenvironment . Third, by using the proposed AULM withstructured ﬁlter sparsity to adaptively select and prune theredundant ﬁlters, SSR-L2,1 achieves the best trade-off betweenthe speedup/compression ratio and the classiﬁcation error. Forexample, the Top-1 error is only increased by 0.18% with . × speedup and . × compression. C. ImageNet

ImageNet 2012 contains over 1 million training images from1,000 object classes, as well as a validation set of 50,000images. Each image is rescaled to a size of × . A × image is randomly cropped from each scaled image To make a fair comparison, we evaluate the actual speedup of SSR-L1without special hardware/software accelerators in experiments. TABLE IIT

HE EVALUATIONS OF THE NUMBER OF PARAMETERS AND

FLOP

S BOTHIN CONVOLUTIONAL AND FULLY - CONNECTED LAYERS , COMPUTATIONALTIME ON

CPU ( MS ), GPU ( MS ), AND CLASSIFICATION ERROR RATES (T OP -1/5 E RR .) OF A LEX N ET , VGG-16, R ES N ET -50 AND G OOG L E N ETWITH MINI - BATCH SIZE

Model (except for AlexNet with a × cropping size) andmirrored for data augmentation. We test the pruned networkon the validation set using single-view testing (central patchonly) to evaluate the classiﬁcation accuracy.We implement the proposed SSR scheme on four CNNs, i.e. , AlexNet, VGG-16, ResNet-50 and GoogLeNet. AlexNetcontains 5 convolutional layers and 3 fully-connected lay-ers, VGG-16 contains 13 convolutional layers and 3 fully-connected layers, ResNet-50 contains 54 convolutional layerswith 16 residual blocks, and GoogLeNet contains 21 convo-lutional layers with 9 inception blocks. Unlike AlexNet andVGG-16, ResNet-50 and GoogLeNet use the global averagepooling over the last convolutional layer to reduce the numberof parameters, which removes 3 fully-connected layers. Thecomputation time and storage overhead of four networks,together with their classiﬁcation error, are shown in Table II. Sensitivity Analysis.

We explore the sensitivity of eachlayer in the network to guide the ﬁlter pruning for this layer.We take AlexNet and VGG-16 for instance, most layers arerobust to prune, as shown in Fig. 4. But still, there exist asmall amount of sensitive layers, which locate at the topconvolutional layers. For example, it is sensitive to prune the -th and the -th convolutional layers for AlexNet and the last 3convolutional layers ( i.e. , Conv , Conv , Conv ) forVGG-16. To explain, these top layers often have high-levelsemantic information that is necessary for maintaining theclassiﬁcation accuracy. In addition, pruning some ﬁlters in thespeciﬁc layers ( e.g. , Conv , Conv , Conv in VGG-16 when λ is set to be 0.2) gets a slightly better accuracycompared to the original network, which reveals that theredundant ﬁlters can reduce the discriminability of the original EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 9 λ T op − E rr o r Conv1Conv2Conv3Conv4Conv5FC6FC7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.30.320.340.360.380.40.420.44 λ T op − E rr o r Conv1_1Conv1_2Conv2_1Conv2_2Conv3_1Conv3_2Conv3_3Conv4_1Conv4_2Conv4_3Conv5_1Conv5_2Conv5_3FC6FC7

Fig. 4. Sensitivity of pruning ﬁlter in each layer. Left: the sensitivity of AlexNet, Right: the sensitivity of VGG-16.TABLE IIIP

RUNING RESULTS OF A LEX N ET . T HE TEST MINI - BATCH SIZE IS

Method ↑ Err. ↑ TE [42] 60.2M . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × -0.05% 0.14%45.9M . × . × . × . × . × . × network. Therefore, we use a different λ for each layer toreduce the impact of these sensitive layers ( i.e. , we set a large λ in insensitive layers, and set a small one in sensitive layers). Quantitative Results.

Since the fully-connected layersoccupy over 90% storage in AlexNet and VGG-16, we replacethe original fully-connected layers with global average pooling(GAP) [50] to further compress the whole network. “X-GAP”refers to the model using the GAP after all convolutional layersare pruned via “X” methods ( e.g. , ThiNet, SSR). The “X-GAP” is ﬁne-tuned with the same ﬁne-tuning parameters asdescribed in Sec. IV-A.As shown in Table III, we prune AlexNet with three groupsof λ , i.e. , (0.2, 0.4, 0.5, 0.6, 0.1, 0.1, 0.3), (0.4, 0.5, 0.7,0.6, 0.1, 0.3, 0.3) and (0.2, 0.4, 0.5, 0.6, 0.1, GAP). Com-pared to other ﬁlter pruning methods [20], [23], [42], [45],our SSR scheme achieves the best trade-off between thespeedup/compression rate and the Top-1/5 classiﬁcation error.First, we compare SSR-L2,1 to three alternative selectioncriteria ( i.e. , random, L1-norm [20] and APoZ [45]) with thesame pruning number in each layer . SSR-L2,1 achieves thelowest Top-1/5 classiﬁcation error. To explain, all selection To make a fair comparison, the number of ﬁlter pruning in each layerbased on three alternative selection criteria is the same to SSR-L2,1. criteria are naive methods to prune the ﬁlters based on thestatistical property, resulting in a large approximation errorof each layer that is propagated throughout the network.Second, by directly employing SGD with ﬁlter-wise sparsityto solve the SSL problem, the redundant ﬁlters cannot bepruned efﬁciently, which only achieves . × CPU speedupwith an increase of 1.32% Top-1 error . Third, the workin [42] uses Taylor expansion (TE) to approximate the lossincrease, which is similar to ours but is with a totally differentselection criterion. Quantitatively, it is time-consuming forpruning one ﬁlter and then ﬁne-tuning the network iteratively.In contrast, SSR-L2,1 achieves the lowest Top-1 error increaseof 1.72% and Top-5 error increase of 1.38%, while reducingmuch larger amount of parameters. Fourth, we also comparetwo different kinds of structured sparsity regularizations ( i.e. , (cid:96) , -regularization and (cid:96) , -regularization) with element-wisesparsity regularization ( i.e. , (cid:96) -regularization), and observe thatSSR-L1 signiﬁcantly reduces the memory storage with only22.6M parameters, which is twice less than SSR-L2,1 andSSR-L2,0 with a comparable error increase. However, SSR-L1does not boost the inference efﬁciency, as element-wise spar-sity cannot signiﬁcantly reduce the number of ﬁlters that thecomputation is on par with the full network (see later in Sec.IV-D for more detailed discussions). As for structured sparsityregularization, compared to SSR-L2,1, SSR-L2,0 achieves alower error increase ( i.e. , 1.57% vs. i.e. , 45.9M parameters and . × CPU speedup vs.

48M parameters and . × CPU speedup. Moreover, tofurther compress AlexNet using SSR, GAP enables the prunenetwork to be more compact, leading to a . × compressionrate ( i.e. , 2.5M parameters).For VGG-16, we summarize the performance comparisonto [19], [20], [42], [45] in Table IV. In experiments, λ in theﬁrst 10 convolutional layers is set to be (0.5, 0.4, 0.5, 0.3,0.5, 0.3, 0.3, 0.4, 0.5, 0.3) with large values, while the last 3convolutional layers are all set to be 0.1. λ is set to be (0.1, 0.6)in the fully-connected layers. First, instead of directly pruningﬁlters, ThiNet [19] conducts a greedy local channel selection,while TE [42] uses a greedy feature map selection to prune thefeature maps. Compared to ThiNet, TE achieves a higher GPUspeedup ( i.e. , . × vs. . × in ThiNet), but has a signiﬁcant It is different to the result reported in Wen et al. [23], due to the differentﬁne-tuning framework and deep learning library.

EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 10

TABLE IVP

RUNING RESULTS OF

VGG-16. T

HE TEST MINI - BATCH SIZE IS

Method FLOPs ↑ Err. ↑ Random 4.5B 126.7M . × . × . × . × . × . × -0.64% -0.43%TE [42] 4.2B 135.7M - . × - 3.94%ThiNet [19] 5.0B 131.5M . × . × -1.46% -1.09%SSR-L2,1 4.5B 126.7M . × . × -1.46% -1.08%SSR-L2,0 4.5B 126.2M . × . × -1.65% -0.97%Random-GAP 4.4B 9.2M . × . × . × . × . × . × . × . × × . × × . × TABLE VP

RUNING RESULTS OF R ES N ET -50. T HE TEST MINI - BATCH SIZE IS ↑ Err. ↑ ThiNet [19] 2.4B 16.9M . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × increase in Top-5 error, which affects the discriminative abilityof the compressed model. For three alternative criteria of ﬁlterselection, APoZ [45] achieves the lowest increase in both Top-1 and Top-5 classiﬁcation error by the same factor of GPUand CPU speedup, e.g. , . × CPU speedup and . × GPUspeedup with a decrease of 0.64% Top-1 classiﬁcation errorand 0.43% Top-5 classiﬁcation error, respectively. Comparedto all baselines with fully-connected layers, SSR-L2,1 achievesthe best trade-off between classiﬁcation error and speedup, e.g. , an decrease of 1.46% Top-1 error by a factor of . × GPU speedup. To explain, the relationship between the ﬁnaloutput and the local ﬁlters is directly considered in SSR,which therefore can adaptively prune redundant ﬁlters thathave less impact on the global outputs. Second, by replacingwith GAP, the network is further compressed by a large rate, e.g. , ThiNet-GAP achieves . × parameter reduction, i.e. ,9.5M parameters vs. i.e. , . × GPU speedup and × parameter reduction, onlywith an increase of 0.52% Top-1 error. In addition, comparedto SSR-L2,1-GAP, SSR-L2,0-GAP achieves a comparableresult by a factor of . × GPU speedup and . × parameterreduction, with an increase of 0.83% Top-1 error.We also perform the proposed SSR on multi-branch net-works, e.g. , ResNet-50 and GoogLeNet. The results of SSR onResNet-50 are shown in Table V. We prune ResNet-50 with TABLE VIP

RUNING RESULTS OF G OOG L E N ET . T HE TEST MINI - BATCH SIZE IS ↑ Err. ↑ Random 1.0B 4.2M . × . × . × . × . × . × . × . × . × . × two groups of λ . In the ﬁrst 7 residual blocks of the ﬁrstgroup, the hyper-parameter λ of each residual block is set tobe (0.4, 0.3), while λ of each residual block is set to be (0.3,0.4) in the remaining residual blocks ( i.e. , 9 residual blocks).At the second group, the corresponding λ of each residualblock is increased by 0.1 on the basis of the ﬁrst group.Note that, we skip the ﬁrst convolutional layer, which is prettysensitive for pruning. In addition, we prune the ﬁrst two layersin each residual block and leave the output and projectionshortcuts of residual block unchanged, as shown in Fig. 2. Wefound that SGD in SSL [23] is not very effective to solveEq. (2) with (cid:96) , -regularization, which leads to a signiﬁcanterror increase with a limited FLOPs reduction. For threealternative criteria of ﬁlter selection, APoZ [45] still achievesthe lowest error increase at the same computation complexityand memory storage. Although ThiNet [19] achieves the bestperformance among these state-of-the-art baselines [20], [23],[45], it requires additional samples ( i.e. , new input/outputpairs from hidden layers) in each layer to ﬁnd the optimalchannels for pruning, which is not only expensive to storeadditional training datasets, and is also time consuming tocollect them in ofﬂine training. Moreover, ThiNet only reducesthe reconstruction error of each layer, which however ignoresthe correlation between local ﬁlter pruning and global output,leading to the accumulation of reconstruction error. Comparedto ThiNet, without supervised information of hidden layers,SSR-L2,0 employs the original ImageNet dataset to improvethe classiﬁcation accuracy at the same speedup ratio, andalso achieves a higher parameter reduction ( . × vs. . × ).For GoogLeNet, we prune all convolutional ﬁlters with highcomputation complexity, i.e., ﬁlter size of × and × .For λ , it is set to be 0.5 and 0.3 in the ﬁrst three inceptionblocks and the remaining inception blocks, respectively. Weskip the ﬁrst convolutional layer and kernels with a size of × for effective pruning. As shown in Table VI, compared tothree alternative criteria of ﬁlter selection, SSR-L2,1 achievesa lower error increase at the same pruning-level. By replacing (cid:96) , -regularization with (cid:96) , -regularization, SSR-L2,0 achievesthe best performance with an increase of 1.05% Top-5 errorby . × compression and . × CPU speedup.

D. Analysis

Efﬁciency Analysis.

We ﬁrst analyze the empirical efﬁ-ciency of AULM and ADMM. As shown in Fig. 5, we foundthat our AULM can help to learn a more compact network withalmost the same error using much fewer epochs, comparedto ADMM. For instance, AULM achieves 0.9% error with

EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 11

Epoch P e r c en t age o f P a r a m e t e r s U s ed AULM−Conv1ADMM−Conv1AULM−Conv2ADMM−Conv2AULM−FC3ADMM−FC3 (a) Epoch /

Epoch T op − E rr o r AULM−Conv1ADMM−Conv1AULM−Conv2ADMM−Conv2AULM−FC3ADMM−FC3 (b) Epoch / Top-1 ErrorFig. 5. Further analysis of the (cid:96) , -regularization by our AULM and ADMM on MNIST dataset. Iteration Number T r a i n i ng D a t a Lo ss AULM−L2,1−Conv1AULM−L2,0−Conv1SSL−Conv1AULM−L2,1−Conv2AULM−L2,0−Conv2SSL−Conv2AULM−L2,1−Conv3AULM−L2,0−Conv3SSL−Conv3 (a)

Epoch F il t e r P r uned A w a y AULM−L2,1−Conv1AULM−L2,0−Conv1SSL−Conv1AULM−L2,1−Conv2AULM−L2,0−Conv2SSL−Conv2AULM−L2,1−Conv3AULM−L2,0−Conv3SSL−Conv3 (b) Epoch /

Epoch T op − E rr o r AULM−L2,1−Conv1AULM−L2,0−Conv1SSL−Conv1AULM−L2,1−Conv2AULM−L2,0−Conv2SSL−Conv2AULM−L2,1−Conv3AULM−L2,0−Conv3SSL−Conv3 (c) Epoch / Top-5 ErrorFig. 6. Convergence of our AULM slover and SSL in the ﬁrst three convolutional layers of AlexNet. only 8 epochs and 7% parameters in the Conv1 layer, whileADMM achieves almost the same error but requires 11 epochsand 20% parameters. This faster convergence for structuredﬁlter sparsity is due to that we apply Nesterov’s optimizationto overrelax variables ( i.e. , structured sparse ﬁlters and dualvariables) for accelerating the alternative optimization. Sec-ond, we further study the inﬂuence of different optimizationstrategy ( i.e. , SSL

VS.

AULM) on ﬁlter pruning. In SSL [23],the structured sparsity of ﬁlters with (cid:96) , -norm is learned bydirectly solving Eq. (2) with SGD, which is different fromthe proposed AULM-L2,1. Taking the ﬁrst three layers ofAlexNet for instance, they occupy a signiﬁcant proportion ofthe computational overhead.Fig. 6 presents the convergence process of different solversover different layers in Conv1, Conv2, Conv3, respectively.Compared to SSL, as shown in Fig. 6(a), AULM-L2,1 is fasterto reduce the training loss and also achieves a lower trainingloss, which leads to more effective training for structuredpruning. Moreover, by using AULM-L2,1 instead of SSL, thenumber of pruned ﬁlters is always larger, the correspondingclassiﬁcation error is lower, and the convergence is faster, es-pecially in the ﬁrst convolutional layer (the convergence after8 epochs in AULM-L2,1 vs.

20 epochs in SSL). Therefore,alternative optimization in AULM is more effective to prunethe network than SGD in SSL. Furthermore, we also makea comparison between two different structured regularizers( i.e. , (cid:96) , -norm and (cid:96) , -norm) to explicitly analyze their convergence by AULM . As shown in Fig. 6(b) and 6(c),compared to AULM-L2,1, we observe that AULM-L2,0 is notonly signiﬁcantly faster to generate more structured ﬁlters, butalso achieves a lower Top-5 error. Interestingly, the number ofstructured ﬁlters is almost constant during training, which isdue to the closed-form solution by Eq. (15) that leads to almostthe same structured sparsity of intermediate ﬁlters after the ﬁrstupdating. Visualization.

To verify the effectiveness of structured ﬁltersparsity, we visualize the ﬁlters in the ﬁrst convolutionallayer of AlexNet using SSR with three different regularizers( i.e. , (cid:96) -norm, (cid:96) , -norm and (cid:96) , -norm), as shown in Fig. 7.Although SSR with (cid:96) -regularization obtains a large amountof sparse ﬁlter elements, it is unable to remove the wholeﬁlter, which leads to a very limited speedup without thespecialized software. In contrast, SSR with (cid:96) , -regularizationor (cid:96) , -regularization results in complete ﬁlter removal, whichdirectly accelerates the network inference. Compared to (cid:96) , -regularization, SSR with (cid:96) , -regularization achieves a lowerTop-5 error with more structured sparse ﬁlters. This is dueto that the (cid:96) , -regularization explicitly provides the naturalconstraint for ﬁlter selection. E. Generalization Ability for Transfer Learning.

SSR has demonstrated its effectiveness to simutaneouslyaccelerate and compress CNNs on MNIST and ImageNet 2012 We have not compared the SGD to our AULM on (cid:96) , -norm regulariza-tion, since SGD cannot solve the NP-hard problem. EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12 (a) SSR-Ll, Error: 21.26%, Sparsity: 58.67% (b) SSR-L2,1, Error: 21.26%, Sparsity: 13.54% (c) SSR-L2,0, Error: 21.16%, Sparsity: 14.58%Fig. 7. Visualizations of the ﬁrst convolutional layer for pruning the whole AlexNet using SSR with different regularizers. (a) SSR with (cid:96) -regularizationresults in element-wise sparse ﬁlters. (b) SSR with (cid:96) , -regularization results in ﬁlter-wise removal (ﬁlters with dark color). (c) SSR with (cid:96) , -regularizationobtains more structured sparse ﬁlters, and achieves lower Top-5 error than SSR-L2,1.TABLE VIIC OMPARISON OF DIFFERENT COMPRESSED MODELS FOR FINE - GRAINEDCLASSIFICATION ON

CUB-200.Method classiﬁcation tasks. We further investigate the generalizationability of the compressed model in transfer learning, includingdomain adaptation and object detection. For better discussion,we take VGG-16 as our baseline model.

1) Domain Adaptation:

Since SSR does not change thenetwork structure, a model pruned on ImageNet can be eas-ily transferred into other domains. To evaluate the domainadaptation ability of the compressed model, we consider apractical application in which we transfer the pruned modeltrained on ImageNet into a smaller one by using a domain-speciﬁc dataset. To this end, we select a public domain-speciﬁc dataset CUB-200 [71] for ﬁne-grained classiﬁcation toevaluate the ability of domain adaptation. CUB-200 contains11,788 images of 200 different bird species, which contains5,994 training images and 5,794 testing images. To makea fair comparison, we ﬁne-tune the compressed models onImageNet by Random, L1, APoZ and SSR with the samehyper-parameters and epochs . The results of ﬁne-grainedclassiﬁcation are shown in Table VII.The pre-trained VGG-16 is ﬁrst ﬁne-tuned on the CUB-200dataset, which is an effective method to directly transfer themodel from ImageNet domain to CUB-200 domain. As shownin Table VII, the pre-trained VGG-16 achieves the lowest error(27.60% Top-1 error) but has a huge memory cost and slowinference speed ( i.e. , 135.1M parameters and 15.5B FLOPs).We then ﬁne-tune the compressed networks, which are pre-viously compressed in the ImageNet domain by Random, L1 The related implementation details are described in the following:https://github.com/Roll920/ﬁne-tune-avg-vgg16. TABLE VIIIT

HE SPEEDUP FOR F ASTER

R-CNN

DETECTION . Device Method Speedup mAP ∆ mAPTitan X GPU VGG-16 Baseline 68.7 -Random 2.45 67.9 0.8L1 [20] 2.45 68.1 0.6APoZ [45] 2.45 67.0 1.7SSR-L2,1 2.45 68.4 0.3 [20], APoZ [45] and SSR, respectively. Compared to Random,L1 and APoZ, the model compressed by SSR-L2,1 achievesthe best performance by an increase of only 1.1% Top-1 error,with 124.6M parameters and 4.5B FLOPs. Furthermore, wealso ﬁne-tune the compressed models, in which the traditionalfully-connected layers are replaced with GAP. We obtain amore compact model with 8.8M parameters and 4.4B FLOPs, i.e. , 15.4 × lower memory cost and theoretical 3.5 × inferencespeedup than VGG-16. Compared to the three alternativeselection criteria, SSR-L2,1-GAP achieves the lowest Top-1error, i.e. , 29.55% Top-1 error, which is an increase of only1.95% Top-1 error.

2) Object Detection:

We also evaluate the ability of transferlearning for the compressed VGG-16 by Random, L1 [20],APoZ [45] and SSR with (cid:96) , -regularization, which are de-ployed over Faster R-CNN [11] for object detections. ThePASCAL VOC 2007 object detection benchmark is selectedto evaluate the performance of our models by mean AveragePrecision (mAP), which contains about 5K training/validationimages and 5K testing images. In our experiments, we ﬁrstcompress VGG-16 by Random, L1, APoZ and SSR-L2,1 onImageNet, and then use the compressed models as the pre-trained models for Faster-RCNN with the default trainingsettings.The actual running time of Faster R-CNN is 189ms/imageon Titan X GPU. Compared to VGG-16, we get an actualdetection time of 77ms with . × acceleration on Titan X.As shown in Table VIII, interestingly, ﬁlter pruning by randomcriterion works interestingly well for object detection, whichis due to the self-recovery ability in training the speciﬁcPASCAL VOC 2007 dataset. In contrary, APoZ may beunsuitable for object detection, which achieves the lowestmAP, i.e. , 1.7% mAP drops. Compared to three alternative EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 13 pruning criteria, SSR-L2,1 achieves the best performance bya factor of . × speedup on Titan X with only 0.3% mAPdrops, which is still very practical for real-world application.V. C ONCLUSION

In this paper, we propose a uniﬁed ﬁlter pruning scheme,termed structured sparsity regularization (SSR), for CNNacceleration and compression. SSR captures the relationshipbetween global output and local ﬁlter pruning, forms a noveloptimization problem based on two structured sparsity with (cid:96) , -norm and (cid:96) , -norm, both of which can be efﬁcientlysolved by a novel Alternative Updating with Lagrange Mul-tipliers (AULM). The proposed AULM is fast to generatestructured ﬁlters and is adaptive to prune redundant ﬁlters. Wehave demonstrated that the proposed SSR scheme achievessuperior performance over the state-of-the-art ﬁlter pruningmethods [19], [20], [23], [42], [45]. We further evaluate theeffectiveness of the compressed model by SSR when beingapplied to domain adaptation and objection detection.In the future, we would like to investigate the speciﬁc designof ﬁlter pruning for ResNet and DenseNet, including (1) howto effectively prune the shortcut connection in the residualblock, (2) design a more effective strategy for acceleratingbatch normalization and pooling layers, which are left unex-ploited in the existing works, and (3) design a novel ﬁlterselection layer to prevent the dimension mismatch of differentdense blocks, which is due to the dense connectivity in thenatural DenseNet. A CKNOWLEDGMENT

This work is supported by the National Key R&D Programof China (No. 2017YFC0113000, No. 2016YFB1001503 andNo. 2018YFB1107400), the Nature Science Foundation ofChina (No.U1705262, No. 61772443, No. 61402388, No.61572410 and No. 61871470) , the Post Doctoral InnovativeTalent Support Program under Grant BX201600094, the ChinaPost-Doctoral Science Foundation under Grant 2017M612134and the Nature Science Foundation of Fujian Province, China(No. 2017J01125). R

EFERENCES[1] X. Chen, J. Weng, W. Lu, J. Xu, and J. Weng, “Deep manifold learningcombined with convolutional neural networks for action recognition,”

IEEE transactions on neural networks and learning systems , vol. 29,no. 9, pp. 3938–3952, 2018.[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

CVPR , 2016.[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

NIPS , 2012.[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

CVPR , 2015.[6] F. Chollet, “Xception: Deep learning with depthwise separable convo-lutions,” in

CVPR , 2017.[7] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He, “Aggregated residualtransformations for deep neural networks,” in

CVPR , 2017.[8] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in

CVPR , 2017.[9] R. Girshick, “Fast r-cnn,” in

ICCV , 2015. [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in

CVPR , 2014.[11] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in

NIPS , 2015.[12] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, andA. C. Berg, “Ssd: Single shot multibox detector,” in

ECCV , 2016.[13] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel,“Optimal brain damage.” in

NIPS , vol. 2, 1989, pp. 598–605.[14] B. Hassibi and D. G. Stork, “Second order derivatives for networkpruning: Optimal brain surgeon,” in

NIPS , 1993.[15] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deepneural network with pruning, trained quantization and huffman coding,”

CoRR, abs/1510.00149 , vol. 2, 2015.[16] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights andconnections for efﬁcient neural network,” in

NIPS , 2015.[17] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, “Eie: efﬁcient inference engine on compressed deep neuralnetwork,” in

Proceedings of the 43rd International Symposium onComputer Architecture . IEEE Press, 2016, pp. 243–254.[18] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparseconvolutional neural networks,” in

CVPR , 2015.[19] J. Luo, J. Wu, and W. Lin, “Thinet: A ﬁlter level pruning method fordeep neural network compression,” in

ICCV , 2017.[20] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruningﬁlters for efﬁcient convnets,” in

ICLR , 2017.[21] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deepconvolutional neural networks,” arXiv preprint arXiv:1512.08571 , 2015.[22] V. Lebedev and V. Lempitsky, “Fast convnets using group-wise braindamage,” in

CVPR , 2016.[23] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structuredsparsity in deep neural networks,” in

NIPS , 2016.[24] M. Denil, B. Shakibi, L. Dinh, N. de Freitas et al. , “Predicting param-eters in deep learning,” in

NIPS , 2013.[25] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempit-sky, “Speeding-up convolutional neural networks using ﬁne-tuned cp-decomposition,” arXiv preprint arXiv:1412.6553 , 2014.[26] S. Lin, R. Ji, C. Chen, and F. Huang, “Espace: Accelerating convolu-tional neural networks via eliminating spatial & channel redundancy,”in

AAAI , 2017.[27] S. Lin, R. Ji, X. Guo, and X. Li, “Towards convolutional neural networkscompression via global error reconstruction,” in

IJCAI , 2016.[28] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convo-lutional neural networks with low rank expansions,” arXiv preprintarXiv:1405.3866 , 2014.[29] C. Tai, T. Xiao, Y. Zhang, X. Wang et al. , “Convolutional neuralnetworks with low-rank regularization,” in

ICLR , 2016.[30] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,“Exploiting linear structure within convolutional networks for efﬁcientevaluation,” in

NIPS , 2014.[31] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compressionof deep convolutional neural networks for fast and low power mobileapplications,” arXiv preprint arXiv:1511.06530 , 2015.[32] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu, “Cnnpack: Packingconvolutional neural networks in the frequency domain,” in

NIPS , 2016.[33] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutionalnetworks through ffts,” arXiv preprint arXiv:1312.5851 , 2013.[34] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, andY. LeCun, “Fast convolutional nets with fbfft: A gpu performanceevaluation,” arXiv preprint arXiv:1412.7580 , 2014.[35] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deepconvolutional networks using vector quantization,” arXiv preprintarXiv:1412.6115 , 2014.[36] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Trainingdeep neural networks with binary weights during propagations,” in

NIPS ,2015.[37] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:Imagenet classiﬁcation using binary convolutional neural networks,” in

ECCV , 2016.[38] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized cnn: auniﬁed approach to accelerate and compress convolutional networks,”

IEEE transactions on neural networks and learning systems , vol. 29,no. 10, pp. 4730–4742, 2018.[39] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efﬁcient convo-lutional neural networks using energy-aware pruning,” in

CVPR , 2017.[40] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restrictedboltzmann machines,” in

ICML , 2010.

EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 14 [41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[42] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruningconvolutional neural networks for resource efﬁcient inference,” in

ICLR ,2017.[43] A. Torﬁ and R. A. Shirvani, “Attention-based guided structured sparsityof deep neural networks,” arXiv preprint arXiv:1802.09902 , 2018.[44] J. Wang, C. Xu, X. Yang, and J. M. Zurada, “A novel pruning algo-rithm for smoothing feedforward neural networks based on group lassomethod,”

IEEE transactions on neural networks and learning systems ,vol. 29, no. 5, pp. 2012–2024, 2018.[45] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data-driven neuron pruning approach towards efﬁcient deep architectures,” arXiv preprint arXiv:1607.03250 , 2016.[46] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,”

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , vol. 68, no. 1, pp. 49–67, 2006.[47] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”

Foundations and Trends R (cid:13) in Machine Learning , vol. 3,no. 1, pp. 1–122, 2011.[48] Y. E. Nesterov, “A method for solving the convex programming problemwith convergence rate O (1 /k ) ,” in Dokl. Akad. Nauk SSSR , vol. 269,1983, pp. 543–547.[49] J. Yoon and S. J. Hwang, “Combined group and exclusive sparsity fordeep neural networks,” in

ICML , 2017.[50] M. Lin, Q. Chen, and S. Yan, “Network in network,” in

ICLR , 2014.[51] S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neuralnetworks,” arXiv preprint arXiv:1507.06149 , 2015.[52] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in

Proceedings of the 22nd ACM internationalconference on Multimedia . ACM, 2014, pp. 675–678.[53] H. Huang and H. Yu, “Ltnn: A layerwise tensorized compression ofmultilayer neural network,”

IEEE transactions on neural networks andlearning systems , 2018.[54] G. Xie, K. Yang, T. Zhang, J. Wang, and J. Lai, “Balanced decoupledspatial convolution for cnns,”

IEEE transactions on neural networks andlearning systems , 2019.[55] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neuralnetworks with weights and activations constrained to+ 1 or-1,” arXivpreprint arXiv:1602.02830 , 2016.[56] R. J. Cintra, S. Duffner, C. Garcia, and A. Leite, “Low-complexity ap-proximate convolutional neural networks,”

IEEE transactions on neuralnetworks and learning systems , vol. 29, no. 12, pp. 5981–5992, 2018.[57] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and < ICLR , 2017.[58] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufﬂenet: An extremelyefﬁcient convolutional neural network for mobile devices,” in

CVPR ,2018.[59] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger, “Con-densenet: An efﬁcient densenet using learned group convolutions,” in

CVPR , 2018.[60] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861 , 2017.[61] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in

CVPR , 2018.[62] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft ﬁlter pruningfor accelerating deep convolutional neural networks,” arXiv preprintarXiv:1808.06866 , 2018.[63] S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, “Acceleratingconvolutional networks via global & dynamic ﬁlter pruning,” in

IJCAI ,2018, pp. 2425–2432.[64] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learningefﬁcient convolutional networks through network slimming,” in

ICCV ,2017.[65] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorﬂow: Large-scalemachine learning on heterogeneous distributed systems,” arXiv preprintarXiv:1603.04467 , 2016.[66] S. F. Cotter, B. D. Rao, K. Engan, and K. Kreutz-Delgado, “Sparsesolutions to linear inverse problems with multiple measurement vectors,”

IEEE Transactions on Signal Processing , vol. 53, no. 7, pp. 2477–2488,2005.[67] T. Goldstein, C. Studer, and R. Baraniuk, “A ﬁeld guide to forward-backward splitting with a fasta implementation,” arXiv preprintarXiv:1411.3406 , 2014.[68] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[69] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”

International Journal of Computer Vision (IJCV) , vol. 115, no. 3, pp.211–252, 2015.[70] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,”