Towards Compact ConvNets via Structure-Sparsity Regularized Filter Pruning
IIEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1
Towards Compact ConvNets via Structure-SparsityRegularized Filter Pruning
Shaohui Lin,
Student Member, IEEE,
Rongrong Ji ∗ , Senior Member, IEEE,
Yuchao Li,Cheng Deng,
Member, IEEE, and Xuelong Li,
Fellow, IEEE
Abstract —The success of convolutional neural networks(CNNs) in computer vision applications has been accompaniedby a significant increase of computation and memory costs, whichprohibits their usage on resource-limited environments such asmobile or embedded devices. To this end, the research of CNNcompression has recently become emerging. In this paper, wepropose a novel filter pruning scheme, termed structured sparsityregularization (SSR), to simultaneously speedup the computationand reduce the memory overhead of CNNs, which can bewell supported by various off-the-shelf deep learning libraries.Concretely, the proposed scheme incorporates two differentregularizers of structured sparsity into the original objectivefunction of filter pruning, which fully coordinates the globaloutputs and local pruning operations to adaptively prune filters.We further propose an Alternative Updating with LagrangeMultipliers (AULM) scheme to efficiently solve its optimization.AULM follows the principle of ADMM and alternates betweenpromoting the structured sparsity of CNNs and optimizing therecognition loss, which leads to a very efficient solver ( . × to themost recent work that directly solves the group sparsity basedregularization). Moreover, by imposing the structured sparsity,the online inference is extremely memory-light, since the numberof filters and the output feature maps are simultaneously reduced.The proposed scheme has been deployed to a variety of state-of-the-art CNN structures including LeNet, AlexNet, VGGNet,ResNet and GoogLeNet over different datasets. Quantitativeresults demonstrate that the proposed scheme achieves supe-rior performance over the state-of-the-art methods. We furtherdemonstrate the proposed compression scheme for the taskof transfer learning, including domain adaptation and objectdetection, which also show exciting performance gains over thestate-of-the-art filter pruning methods. Index Terms —Convolutional neural networks, Structured spar-sity, CNN acceleration, CNN compression.
I. I
NTRODUCTION I N recent years, convolutional neural networks (CNNs)have achieved great success on a variety of computervision applications, ranging from action recognition [1], objectrecognition [2]–[8], to object detection [9]–[12]. One essentialfoundation of such success lies in the gigantic amount of
S. Lin and Y. Li are with the Fujian Key Laboratory of Sensing andComputing for Smart City, School of Information Science and Engineering,Xiamen University, 361005, China.R. Ji (Corresponding author) is with the Fujian Key Laboratory of Sensingand Computing for Smart City, School of Information Science and Engi-neering, Xiamen University, 361005, China, and Peng Cheng Laboratory,Shenzhen, 518055, China (e-mail: [email protected]).C. Deng is with the School of Electronic Engineering, Xidian University,Xi’an 710071, China.X. Li is with the School of Computer Science and Center for OPTicalIMagery Analysis and Learning (OPTIMAL), Northwestern PolytechnicalUniversity, Xi’an 710072, China. model parameters to accompany with the large-scale trainingdata. For instance, ResNet-152 [3] has 57 million parameters,costs 230MB storage, and requires 11.3 billion FLOPs toclassify one image with a resolution of × . Under sucha circumstance, these models cannot be directly deployed toscenarios that require fast processing or compact storage, suchas mobile systems or embedded devices.Substantial efforts have been devoted to the speedup andcompression of CNNs for efficient online inference. Amongthe existing methods, network pruning has attracted increasingattention recently, due to its ability to reduce an overwhelmingamount of parameters. As one of the earliest works, LeCun et al . [13] proposed an Optimal Brain Damage algorithm toprune the network by reducing the number of connections witha theoretically-justified saliency measurement. Later, Hassibiand Stork [14] proposed an Optimal Brain Surgeon algorithmto remove unimportant parameters, which are determined bythe second-order derivations of their weights. Recently, Han etal . [15], [16] proposed to prune parameters with small magni-tudes to reduce the network size. However, the above pruningschemes typically produce sparse CNNs with non-structuredrandom connections, which will cause irregular memory ac-cess, i.e. , a complex storage structure that adversely impactsthe efficiency of accessing CNN models in memory. Moreover,such non-structured sparse CNNs cannot be supported by off-the-shelf libraries, which thus need specialized hardware [17]or software [18] designs to improve their efficiency in onlineinference.To address the shortcoming of such non-structured connec-tions, filter pruning is regarded as a promising solution, whichhas shown significant speedup in online inference as well asgood independence to software/hardware platforms [19]–[23].Differing from the previous works in parameter pruning, filterpruning can be further integrated by various CNN compressionor acceleration methods, e.g. , low-rank decomposition [24]–[31], DCT [32] or FFT [33], [34] based frequency domainacceleration, and parameter quantization [31], [35]–[38]. Itsanother advantage lies in reducing the energy consumption[39], which is not only influenced by the parameter amounts,but also influenced by the FLOPs and memory access ofinput/output feature maps. From this perspective, filter pruningcan directly remove FLOPs and the intermediate activations offilters ( e.g. , output feature maps and their corresponding filterchannels in the next layer), which substantially reduces theenergy consumption. FLOPs: The number of floating-point operations. a r X i v : . [ c s . C V ] M a r EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2
Remove filters via AULMPre-trained modelRemove the corresponding output feature maps
Remove the corresponding channel of filters in the next layerThe final layer?No(Prune next layer) YesFine-tune the pruned networkUpdate filters in the current and next layerOutput pruned model (a) The compete process to prune network
Input of layer l * Patch matrix I Im2col … ReshapeInput tensor ℐ … Location index Filters of layer l … F ilt e r i nd e x Filter tensor 𝒦 Filter matrix K Reshape … F ilt e r i nd e x Input of layer l+ * … Feature maps Output matrix Filters of layer l+ l+ Feature maps
Filter tensor (b) The process to prune the single layerFig. 1. Illustration of SSR scheme. (a) The complete process includes adaptively pruning the unimportant filters via AULM solver, removing the correspondingoutput feature maps, removing the corresponding channels of filters in the next layer, as well as updating filters and fine-tuning the pruned network. (b) Theprocess to select and prune unimportant filters, the corresponding output feature maps and channels in the next layer (highlighted in purple). In particular, thetensor-based convolutional operator can be replaced by a matrix-by-matrix multiplication using BLAS library to accelerate the computation of CNNs. (Bestviewed in color.)
However, there are still open issues in the existing filterpruning schemes. Concretely, how to adaptively and efficientlyselect important filters to reconstruct the global output , i.e. ,the probabilistic “softmax” output after local filter pruning, isstill an open problem, with only few works in the literature.For instance, the work in [20] proposed a magnitude basedcriterion to prune Convolutional filters with small (cid:96) -norms.It has resulted in structured sparse patterns that can accel-erate the online inference. However, such a magnitude-basedmeasurement ( e.g. , (cid:96) -norm) is too simple and inefficient todetermine the importance of each filter, due to the existenceof nonlinear activation functions ( e.g. , rectifier linear unit(ReLU) [40]) and other complex operations ( e.g. , pooling andbatch normalization [41]). To explain, filters with small (cid:96) -norm values may have large responses in the output. Forexample, given two filter vectors a = (0 . , . , (cid:62) and b = (0 . , . , (cid:62) with an input c = (1 , , (cid:62) , we have (cid:107) a (cid:107) > (cid:107) b (cid:107) , while ReLU ( c ∗ a ) < ReLU ( c ∗ b ) afterconvolutional operator and ReLU activation. Very recently,Luo et al. [19] implicitly associated the convolutional filter ofeach layer to its input channel of the next layer, upon whichfilter pruning is done by selecting input channels that haveminimal local reconstruction error. The small reconstructionerror, however, might be magnified and propagated in the deepnetworks, leading to large reconstruction error in the globaloutputs. A Taylor expansion based criterion [42] was furtherproposed to iteratively prune one feature map and its associ-ated filter. The pruned network is then fine-tuned to reducethe accuracy drop. However, such a scheme is unadaptive andcostly in pruning the entire network. Group sparsity [21]–[23],[43], [44] was introduced to select unimportant filters by theStochastic Gradient Descent (SGD). However, SGD is less Structured sparsity is to directly prune the filter/block ( i.e. , set the valuesof filter/block to zeros), while non-structured sparsity only justify whethereach element in the filter is zero or not. efficient in convergence, which is also less efficient to generatethe structured output of the pruned filters.In this paper, we propose an efficient filter pruning scheme,termed
Structured Sparsity Regularization (SSR), which canefficiently and adaptively prune a group of convolutionalfilters to minimize the classification error of the global output.Compared to the existing works of sequential filter pruning[19], [20], [23], [42], [45], we incorporate the structuredsparsity constraint into the objective function of global outputto model the correlation between the global output loss andthe local filter removal, which produces a structured networkwith fast computing and light memory consumption . Inparticular, we propose two different kinds of structured sparseregularizer for adaptive filter pruning, i.e., (cid:96) , -norm [46]and (cid:96) , -norm. The (cid:96) , -norm of a matrix A is defined as (cid:107) A (cid:107) , = (cid:80) i (cid:107) (cid:113)(cid:80) j A ij (cid:107) , where for a scalar a , (cid:107) a (cid:107) = 0 if a = 0 , and (cid:107) a (cid:107) = 1 otherwise. The (cid:96) , -norm is aconvex norm to approximate the cardinality in filter selection,while the (cid:96) , -norm selects filter with the constraint of explicitcardinality, which is the most natural constraint. As for groupsparsity with (cid:96) , -regularization, SGD in the previous works[21]–[23], [43] cannot solve the NP-hard problem for effectivefilter pruning. To this end, we propose a novel AlternativeUpdating with Lagrange Multipliers (AULM), which handlesthe convergence difficulty with (cid:96) , -norm using SGD andthe NP-hard problem with (cid:96) , -norm. AULM follows theprinciple of ADMM [47] by splitting the optimization probleminto tractable sub-problems that can be solved efficiently.In addition, AULM has a faster convergence than ADMMby adding a Nesterov’s optimal method [48], can effectivelyidentify the importance of filters, and then updates parametersby alternating between promoting the structured sparsity and To make a fair comparison, we only evaluate the computational cost ofmodel for online inference, without including the training
EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3 optimizing the recognition loss. In particular, the proposedsolver circumvents the sparse constraint evaluations during thestandard back-propagation step, which makes the implemen-tation very practical. Moreover, compared to SSL [23], theproposed solver is much faster to the existing solvers in offlinepruning, i.e. , almost . × than directly solving the regularizerwith (cid:96) , -norm by using SGD. Fig. 1 shows the proposed filterpruning framework.Quantitatively, we demonstrate the advantage of the pro-posed SSR scheme using five widely-used models ( i.e. , LeNet,AlexNet, VGGNet-16, ResNet-50 and GoogLeNet) on twodatasets ( i.e. , MNIST and ImageNet 2012). Compared toseveral state-of-the-art filter pruning methods [19], [20], [23],[42], [43], [45], [49], the proposed scheme performs drasti-cally better, i.e. , . × CPU speedup and . × compressionwith a negligible classification accuracy loss for LeNet onMNIST, . × CPU speedup with an increase of 1.28% Top-5classification error for AlexNet, . × GPU speedup with andecrease (even better) of 1.65% Top-1 classification error forVGG-16, . × CPU speedup and . × compression withan increase of 3.65% Top-1 classification error for ResNet-50, and . × CPU speedup and . × compression with anincrease of 1.05% Top-5 classification error for GoogLeNetin ImageNet 2012. Moreover, the pruned AlexNet and VGG-16 can be further compressed by replacing the original fully-connected layers with global average pooling [50], leading to . × and × compression rates only with increases of4.62% and 0.27% Top-5 classification error, respectively.In addition, we also explore the generalization ability ofSSR-based compressed model in more complex tasks, i.e. ,domain adaptation and object detection. Experimental resultsdemonstrate that such a compressed model has achieved 3.5 × FLOPs reduction and 15.4 × compression only with an in-crease of 1.95% Top-1 error on the task of domain adaptation,as well as 0.3% mAP drops with a factor of 2.45 × GPUspeedup on the task of object detection. These two resultsare highly competitive comparing to the state-of-the-art filterpruning methods. II. R
ELATED W ORK
Early works in network compression mainly focus oncompressing the fully-connected layers [13]–[16], [51]. Forinstance, LeCun et al. [13] and Hassibi et al. [14] proposed asaliency measurement by computing the Hessian matrix of theloss function with respect to the parameters, based on whichnetwork parameters with low saliency values are pruned. Srini-vas and Babu [51] explored the redundancy among neuronsto remove a subset of neurons without retraining. Han et al. [15], [16] proposed a pruning scheme based on low-weightconnections to reduce the total amount of parameters in CNNs.However, these methods only reduce the memory footprint anddo not guarantee to reduce the computation time, since thetime consumption is mostly dominated by the convolutionallayers. Moreover, the above pruning schemes typically producenon-structured sparse CNNs that lack flexibility to be appliedacross different platforms or libraries. For example, the Com-pressed Sparse Column (CSC) based weight formation has to change the original format of weight storing in Caffe [52]after pruning, which cannot be well supported across differentplatforms.To reduce the computation cost of convolutional layers, apopular solution is to decompose convolutional filters into asequence of tensors with fewer parameters [25], [26], [28],[30], [31], [53]. The convolution can be also conducted inthe frequency domain using DCT [32] and FFT [33], [34],or approximated by balanced decoupled spatial convolution[54] to reduce the redundancy of spatial and channel infor-mation. Besides, binarization of weights [36], [37], [55] andlow-complexity weights [56] can be also employed in theconvolutional layers to reduce the computation overheads withmultiplication-free operations. Designing a compact filter isalso able to accelerate the convolutional computation by re-placing the over-parametric filters with a compact block, suchas inception module in GoogLeNet [5], bottleneck module inResNet [3], fire module in SqueezeNet [57], group convolution[4], [58], [59], depth-wise separable convolution [6], [60],[61]. Without incurring additional overheads, our scheme canbe integrated with the above schemes to further speedup thecomputation, since the above schemes are orthogonal to thecore contribution of this paper.In line with our work, some recent works have investigatedstructured pruning to remove redundant filters or feature maps,which can be categorized into either greedy based pruning[19], [20], [42], [62], [63] or sparsity regularization basedpruning [21]–[23], [43], [49], [64]. For the former group, thework in [20] proposed a magnitude based pruning to prunefilters with their corresponding feature maps by measuring the (cid:96) -norm of filters, which is however inefficient in determiningthe importance of filters. He et al. [62] proposed an (cid:96) -normcriterion to prune unsalient filters in a soft manner. Luo et al. [19] explored the importance of the input channel from thenext convolutional layer, based on which conducted a localchannel selection to prune unimportant input channels and thecorresponding filters in the current layer. However, the smalllocal reconstruction error might lead to large error in the globaloutput by propagating through the deep network. A Taylorexpansion based criterion was proposed in [42] to iterativelyprune one filter and then fine-tune the pruned network, whichis however prohibitively costly for deep networks. Lin etal. [63] proposed a global and dynamic pruning scheme toreduce redundant filters by greedy alternative updating. Al-ternatively, group sparsity based regularization was proposedin [21]–[23], [43], [44] to penalize unimportant parametersand prune redundant filters directly by using SGD, whichis also very slow in convergence for filter selection. Toreduce redundancies in the model parameters, the combinationof group and exclusive sparsity was proposed in [49] topromote sharing and competition for features, respectively.Different from these group sparsity based regularization, weinvestigate the structured sparsity of filters instead, including (cid:96) , -regularization and (cid:96) , -regularization. And the proposedAULM solver alternates the updating between promoting thestructured sparsity and optimizing the recognition loss. By thisway, AULM effectively overcomes the difficulty of conver-gence with (cid:96) , -regularization on SGD, and also can solve EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 4 the NP-hard problem with (cid:96) , -regularization. Quantitatively,such an innovation has achieved much faster convergenceand generated much more structured filters during training.Recently, Liu et al. [64] have proposed a network slimmingscheme to associate a scaling factor in batch normalizationwith each filter channel, and imposed (cid:96) -reguralization onthese scaling factors to identify and prune unimportant chan-nels. Different from network slimming, we directly focuson structured filter sparsity by (cid:96) , -regularization and (cid:96) , -regularization for pruning the complete filters.III. S TRUCTURED P RUNING VIA
SSRIn this section, we first describe the notations and prelimi-naries. Next, we present the general framework of structuredfilter pruning. Then, we present the proposed structured spar-sity regularization scheme. Afterwards, AULM base solver ispresented to perform the corresponding optimization. Finally,we discuss how to deploy our pruning strategy on the residualnetworks.
A. Notations and Preliminaries
Consider a CNN model consisting of L layers in total(including convolutional and fully-connected layers), whichare interlaced with rectifier linear units and pooling. For theconvolution operation, an input tensor I l of size H l × W l × C l is transformed into an output tensor O l of size H (cid:48) l × W (cid:48) l × C l +1 by the following linear mapping at the l -th layer: O lh (cid:48) ,w (cid:48) ,n = d l (cid:88) i =1 d l (cid:88) j =1 C l (cid:88) c =1 K li,j,c,n I lh i ,w j ,c , (1)where the convolutional filter K l at the l -th layer is a tensor ofsize d l × d l × C l × C l +1 . The spatial location of its output aredenoted as h (cid:48) = h i − i + 1 and w (cid:48) = w j − j + 1 , respectively.For simplicity, we assume a unit stride without zero-paddingand skip biases.In practice, many deep learning frameworks ( e.g. , Caffe[52] and Tensorflow [65]) compute tensor-based convolutionaloperator by a highly optimized matrix-by-matrix multiplica-tion using linear algebra packages, such as Intel MKL andOpenBLAS. For example, an input tensor of size H l × W l × C l can be transformed into an input patch matrix I l of size ( d l × d l × C l ) × H (cid:48) l W (cid:48) l using im2col operator. The columns of I l are patch elements of the input tensor with size d l × d l × C l .Correspondingly, convolutional filter is transformed into afilter matrix K l of size C l +1 × ( d l × d l × C l ) using reshaped operator. Then, the output tensor can be obtained by reshapingthe result matrix of size C l +1 × H (cid:48) l W (cid:48) l , which is the result ofmultiplying filter matrix K l with input patch matrix I l . Inthis paper, we use Tensorflow to train and test our structuredsparse CNNs. Therefore, we replace tensor-based filters K l with matrix-based K l .In addition, we consider several norms of filter matrix K l , which are used in the regularization term. For exam-ple, the Frobenius norm of filter matrix K l is defined as (cid:107) K l (cid:107) F := (cid:113)(cid:80) i,j K l ij . The sparsity inducing (cid:96) -norm isdefined as (cid:107) K l (cid:107) , := (cid:80) C l +1 i =1 (cid:107) K li (cid:107) . In this paper, we introduce two different structured sparsity norms to adaptivelyselect unimportant filters to be pruned, i.e. , (cid:96) , -norm and (cid:96) , -norm, which are denoted as (cid:107) K l (cid:107) , := (cid:80) C l +1 i =1 (cid:107) K li (cid:107) and (cid:107) K l (cid:107) , = (cid:80) C l +1 i (cid:107) (cid:113)(cid:80) C l j K l ij (cid:107) , respectively. Note that (cid:96) , -norm is not a valid norm because it does not satisfy thepositive scalarbility: (cid:107) α K l (cid:107) , = | α |(cid:107) K l (cid:107) , for any scalar α .The term “norm” here is for convenience. B. The framework of SSR
SSR prunes the least important filters from a trained con-volutional network to reduce the computation and memorycosts. Its procedure consists of three basic operations, i.e. , (1)evaluate the importance of each filter, (2) prune unimportantfilters, and (3) fine-tune the whole network. Differing fromthe previous filter pruning, we adaptively select unimportantfilters to be pruned by using AULM. As shown in Fig. 1, Wefocus on the blue dotted boxes that perform AULM solverto adaptively prune unimportant filters, and then remove thecorresponding output feature maps and the filter channels inthe next layer. We present the principle steps of SSR as below:1.
Automatic filter selection.
We design a novel objec-tive function, which incorporates the structured sparsityconstraint into the data error term, e.g. , cross-entropybetween the probability of data inference and the groundtruth. The optimization problem can be solved by theproposed AULM solver. Thus, the unimportant filters areadaptively identified during the training.2.
Pruning.
We prune unimportant filters and their corre-sponding feature maps, together with the channels offilters in the next layer.3.
Updating.
We update the rest filters and feature maps inthe current layer, as well as the channels of filters in thenext layer.4.
Iteration of Step 1 to prune the next layer. Global fine-tuning . We globally fine-tune the prunednetworks, which recovers the discriminability and gen-eralization ability for the pruned networks.
C. The Objective Function of SSR
Instead of directly pruning filters by calculating their cor-responding magnitudes [20], [45], SSR utilizes the structuredsparsity to seek a best trade-off between loss minimization andfilter selection. We consider the following objective function: min K L (cid:0) Z , f ( X ; K ) (cid:1) + λg ( K ) . (2)Here K represents the collection of all weights in CNNs. L (cid:0) Z , f ( X ; K ) (cid:1) is the cross-entropy loss for classificationor mean-squared error for regression between the labels ofground truth Z and the output of the last layer in CNNs f ( X ; K ) , where D = (cid:8) X , Z (cid:9) = (cid:8) X i , Z i (cid:9) Ni =1 is a trainingdataset with N instances. We denote the first loss term inEq. (2) as L D ( K ) for simplicity. The term g ( K ) is certain For simplicity, the term of weight decay ( i.e. , non-structured regularizationapplying on every weight, e.g. , (cid:96) -norm) is omitted, since it can be directlyincorporated into the loss function and does not affect the result of structuredsparsity regularization. EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 5 structured sparsity regularization on the total size of remainingfilters in each iteration. In this paper, we consider two differentkinds of structured sparsity regularizer, i.e., (cid:96) , -norm and (cid:96) , -norm.Parameter λ is the penalty term of structured sparsity. As λ varies, the solution of Eq. (2) traces a trade-off path betweenthe performance and structured sparsity. Note that the (cid:96) , or (cid:96) , -norm based structured sparse regularizer g ( K ) and itsefficient solver in Eq. (2) is non-trivial. D. The AULM solver
The proposed AULM solver aims at solving the prob-lem of structure-sparsity regularization, which is inspired byADMM in the field of distributed optimization. Differentfrom ADMM, we focus on selecting and pruning unimportantfilters by solving a non-convex optimization problem withthe (cid:96) , -regularization and an NP-hard problem with the (cid:96) , -regularization. In particular, to handle the non-trival regularizerin Eq. (2), we introduce a slack variable and an equalityconstraint as follows: min K , F L D ( K ) + λg ( F ) s.t. K = F . (3)AULM is an iterative method that augments the Lagrangianfunction with quadratic penalty terms. The augmented La-grangian associated with the constrained problem of Eq. (3)is given by: L ( K , F , Y ) = L D ( K ) + L (cid:88) l =1 λg ( F l )+ L (cid:88) l =1 trace (cid:0) Y l (cid:62) ( K l − F l ) (cid:1) + L (cid:88) l =1 ρ (cid:107) K l − F l (cid:107) F , (4) where K l , F l , Y l are the filter kernel, the intermediate filterwith structured sparsity, and the dual variables ( i.e. the la-grange multipliers) at the l -th layer, respectively. ρ > isa penalty parameter. To minimize Eq. (4), AULM solves foreach variable via a sequence of iterative computations:1. Employ gradient descent to minimize the loss of K { k +1 } = min K L (cid:16) K , ˆ F { k } , ˆ Y { k } (cid:17) . (5)2. Find the closed-form solution of the structured sparsity: F { k +1 } = min F L (cid:16) K { k +1 } , F , ˆ Y { k } (cid:17) . (6)3. Update the dual variables Y l using gradient ascent witha step-size equal to ρ , i.e. , Y { k +1 } = ˆ Y { k } + ρ (cid:16) K { k +1 } − F { k +1 } (cid:17) . (7)4. Conduct an overrelaxation step for accelerated variables ˆ F and ˆ Y with a step-size equal to γ , i.e. , ˆ Y { k +1 } = Y { k +1 } + γ { k +1 } (cid:16) Y { k +1 } − Y { k } (cid:17) , (8) ˆ F { k +1 } = F { k +1 } + γ { k +1 } (cid:16) F { k +1 } − F { k } (cid:17) , (9)where γ { k +1 } = k/ ( k + r ) , with r ≥ ( r = 3 is thestandard choice). Algorithm 1
AULM for structured pruning CNN
Input:
Training data points D , pre-trained CNN weights K , a setof regularization factors S . Output:
The structured pruning filters K . Initialize: dual variables ˆ Y = Y = , ˆ F = F = K , and ρ = 1 . for each λ in S do for each l in [1 , L ] do repeat Step 1:
Find the estimation of K l { k +1 } by solving theproblem in Eq. (11) using SGD; Step 2:
Find the structured sparsity estimation of F l { k +1 } with (cid:96) , -norm or (cid:96) , -norm from Eq. (14) orEq. (15), respectively; Step 3:
Update dual variables Y l { k +1 } by Eq. (7). Step 4:
Update accelerated variables ˆ Y l { k +1 } and ˆ F l { k +1 } by Eq. (8) and Eq. (9), respectively. until (cid:107) K l { k +1 } − F l { k +1 } (cid:107) F ≤ (cid:15) , or (cid:107) F l { k +1 } − F l { k } (cid:107) F ≤ (cid:15) . Prune filters K l corresponding to the row-index of F l withzeros and their corresponding feature maps. end for Fine-tune the pruned network. end for
The above four steps are applied in an alternating manner.Below we describe the details of step 1 and step 2 to obtain K and F . The proposed alternative optimization is summarized inAlg. 1. By overrelaxing the Lagrange multiplier variables aftereach iteration, AULM not only has a faster convergence, butalso obtains a more effective solution compared to ADMM.
1) The Updating of AULM:
Step 1.
By removing thepenalty terms of g ( F l ) and completing the squares with respectto K in Eq. (4), we obtain the following equivalent problemto Eq. (5): min K L D ( K ) + L (cid:88) l =1 ρ (cid:107) K l − T l (cid:107) F , (10) where T l = F l − ρ Y l . To obtain the sub-optimal filters K inthe layer-wise pruning framework, we separately update K l in the l -th layer with the following optimization problem: min K l L D ( K l ) + ρ (cid:107) K l − T l (cid:107) F . (11) We use Stochastic Gradient Descent (SGD) to optimize thefilters K l , which is a reasonable choice to handle such a high-dimensional optimization. The entire procedure relies mainlyon the standard forward-backward pass. Step 2.
By removing the first term L D ( K ) and completingthe squares with respect to F in Eq. (4), we obtain thefollowing equivalent problem to Eq. (6): min F L (cid:88) l =1 λg ( F l ) + L (cid:88) l =1 ρ (cid:107) F l − T l (cid:107) F , (12)where T l = K l + ρ Y l . We update F l layer-by-layer insteadof directly updating the whole layers. Hence, we get thefollowing optimization problem at the l -th layer: min F l λg ( F l ) + ρ (cid:107) F l − T l (cid:107) F . (13) EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 6
Based on Eq. (13), we can obtain a closed-form solution byconsidering the following regularizers g ( F l ) : • (cid:96) , -norm. A closed-form solution of Eq. (13) can bederived, which is evaluated row-by-row on T l [66], [67].The i -th row is calculated via: F li = T l i max {(cid:107) T l i (cid:107) − λρ , }(cid:107) T l i (cid:107) . (14) • (cid:96) , -norm. A closed-form solution of Eq. (13) with thisregularizer can be also derived, which is evaluated row-by-row on T l in Theorem 1 . The i -th row is calculatedvia: F li = (cid:26) , λ ≥ ρ (cid:107) T l i (cid:107) T l i , λ < ρ (cid:107) T l i (cid:107) . (15) • (cid:96) -norm. The norm is not a structural constraint, whichleads to unstructured sparsity by solving the problem inEq. (13) . Specifically, we obtain a closed-form solutionof Eq. (13) with the constraint by evaluating on each entryof T l . The optimal solution F lij ∗ is obtained via: F lij = sign ( T l ij ) max {| T l ij | − λ/ρ, } . (16)where sign ( · ) is an indicator function, i.e. , sign ( T l ij ) = (cid:26) , T l ij ≥ , − , otherwise. Theorem 1.
Let g ( F l ) be a regularizer by (cid:107) F l (cid:107) , , then theoptimal solution of Eq. (13) is given by Eq. (15), where F l =( F l , F l , · · · , F lC l +1 ) (cid:62) and T l = ( T l , T l , · · · , T l C l +1 ) (cid:62) . Proof.
Since g ( F l ) = (cid:107) F l (cid:107) , = (cid:80) i (cid:107) (cid:113)(cid:80) j F l ij (cid:107) , we areinterested in solving the following problem: min F l λ (cid:88) i (cid:107) (cid:115)(cid:88) j F l ij (cid:107) + ρ (cid:107) F l − T l (cid:107) F . (17)We rewrite F l and T l as F l = ( F l , F l , · · · , F lC l +1 ) (cid:62) and T l = ( T l , T l , · · · , T l C l +1 ) (cid:62) , where F li and T l i are the i -th row of F l and T l , respectively. Then we solve the followingequivalent problem for each row independently to Eq. (13): min F li L ( F li ) = λ (cid:107) F li (cid:107) + ρ (cid:107) F li − T l i (cid:107) . (18)On one hand, ∀ F li (cid:54) = , λ (cid:107) F li (cid:107) = λ . We obtain theoptimal value L ( F li ) = λ , when (cid:107) F li − T l i (cid:107) = 0 , i.e., F li = T l i . On the other hand, if F li = , λ (cid:107) F li (cid:107) = 0 ,and L ( F li ) = ρ (cid:107) T l i (cid:107) . Therefore, when λ ≥ ρ (cid:107) T l i (cid:107) , theoptimal solution is F li = , while λ < ρ (cid:107) T l i (cid:107) , the optimalsolution is F li = T l i . Thus, we can obtain the optimal solutionof Eq. (13), which is given by Eq. (15) by considering the row-wise decoupling property of Eq. (13). Here, we consider the (cid:96) -norm to better validate the effectiveness ofsimultaneously accelerating the computation and compressing the memoryoverhead of CNNs by the aforementioned structured sparsity. Prune
56 × 56 × 2561 × 1 × 256 ×
BN ReLUBN ReLUBN ReLUReLU × × ×
56 × 56 × 512 L i n e a r p r o j e c t i o n
56 × 56 × 2561 × 1 × 256 × 353 × 3 × 35 × 241 × 1 × 24 × 512 ⊕ BN ReLU
BN ReLU
BN ReLU
ReLU
56 × 56 × 512 × × × L i n e a r p r o j e c t i o n Fig. 2. Illustration of pruning ResNet. The red value is the number ofremaining filters/channels.
2) The Convergence of AULM:
Since non-linear transition,normalization and pooling commonly occur in CNNs, theobjective function of Eq. (4) is highly non-convex, which lacksa theoretical proof to guarantee its convergence to the globaloptimal. However, it is empirically shown that AULM workswell when the penalty parameter ρ is sufficiently large. Thisis related to the quadratic term that tends to be locally convexby giving a sufficiently large ρ . However, if ρ is too large, it isdifficult for the iterative solver to take effects. As a trade-off,we set ρ = 1 in our implementation.Since the objective function is highly non-convex, there isa risk to be trapped into a local optimum. In our implemen-tation, we circumvent this difficulty by using the pre-trainedweights as the initialized weights, which performs quite wellin practice. E. Pruning on ResNet
Unlike VGG-16 and AlexNet, there are some restrictions toprune ResNet due to the special residual blocks. In general,each residual block with bottleneck structure contains threeconvolutional layers (followed by both batch normalizationand ReLU) and shortcut connections. In order to perform thesum operator, the number of output feature maps in the lastconvolutional layer needs to be consistent with that of theprojection shortcut layer. In particular, when the dimensionsof input/output channels are mismatched in a residual block,a linear projection is performed by the shortcut connections(see [3] for more details).In this paper, we focus on pruning the first two layers in eachresidual block, as shown in Fig. 2. And we do not prune the lastconvolutional layer of each residual block directly. In fact, theparameters ( e.g. , filter channel) in the last convolutional layerare much fewer, as a large proportion of filters in the secondlayer has been pruned.IV. E
XPERIMENTS
A. Experimental Setups
Models and datasets.
We conduct comprehensive experi-ments using five convolutional networks on two datasets, i.e. ,LeNet on MNIST [68], AlexNet [4], VGG-16 [2], ResNet-50[3] and GoogLeNet [5] on ImageNet [69]. We implement theproposed SSR scheme with Tensorflow [65]. All pre-trainedCNNs except LeNet are taken from the Caffe model zoo . Our source code is available at https://github.com/ShaohuiLin/SSR. https://github.com/BVLC/caffe/wiki/Model-Zoo EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 7
TABLE IP
RUNING RESULTS OF L E N ET ON
MNIST. “N UM -N UM -N UM ” IS THENUMBER OF REMAINING FILTERS IN EACH LAYER . K/M/B
MEANSTHOUSAND / MILLION / BILLION IN THIS PAPER , RESPECTIVELY . T
HETESTING MINI - BATCH SIZE IS SET TO BE
Method ↑ LeNet 20-50-500 2.3M 0.43M 26.4 × . × . × . × . × . × . × . × . × . × . × . × . × We make use of an open source tool to convert the pre-trained models to Tensorflow format and then fine-tune themto restore the accuracy. We train LeNet from scratch and reportthe results in Table I. Implementations.
To train the proposed SSR scheme, weuse a learning rate of 0.001 with a constant dropping factorof 10 throughout 10 epochs. The weight decay is set to be0.0005 and the momentum is set to be 0.9. To train SSR onboth LeNet and AlexNet, the mini-batch size is set to be 256.To train VGG-16, ResNet-50 and GoogLeNet, the mini-batchsize is all set to be 32. After pruning, the pruned network isfine-tuned for 30 epochs, in which a learning rate starts at − and is scaled by 0.1 throughout 10 epochs. All experiments arerun on NVIDIA GTX 1080Ti graphics card with 11GB and128G RAM. The number of pruned filters is directly controlledby the hyper-parameters λ , i.e. , the regularization factor of thestructured sparsity. In our experiments, we vary λ in the setof { . , . · · · , . } with 8 values to select the best trade-offbetween the compression/speedup rate and the accuracy. For r in the overrelaxation step, we set it to be 3. Evaluation Protocols.
For evaluation protocols, we quan-tize the performance by using FLOPs, the number of pa-rameters and the Top-1/5 classification error. To make a faircomparison, the speedup rate is measured in a single-threadIntel Xeon E5-2620 CPU and NVIDIA GTX TITAN X GPU.
Alternative and State-of-the-art Approaches.
We firstcompare our filter selection criterion with three alternativecriteria, which are briefly summarized as follows:1.
Random.
Randomly prune filters of each layer.2.
L1-norm (Filter norm) [20]. Filters with smaller mag-nitude tend to be unimportant filters. Therefore, the (cid:96) -norm of each filter s i = (cid:107) K ( i, :) (cid:107) is chosen as itsimportance score. We then sort these scores to prunefilters correspondingly.3. APoZ (Average Percentage of Zeros) [45]. The sparsityof each channel in output after ReLU activation can be https://github.com/ethereon/caffe-tensorflow The accuracies of models may be slight different from that reported byother works, due to a different learning framework. chosen as the importance score of the correspondingfilter. Then the sparsity is calculated as s i = 1 − |I (: , : ,i ) | (cid:80) (cid:80) I (cid:0) I (: , : , i ) == 0 (cid:1) , where |I (: , : , i ) | is theentry number of the i -th channel in the tensor I . Thesmaller s i is, the less important the corresponding filteris.We also compare the proposed SSR scheme to the state-of-the-art filter pruning methods, including SSL [23], ThiNet[19], Taylor expansion (TE [42]), CGES [49] and GSS [43].Furthermore, we make a comparison with our alternativeschemes with different regularizers, i.e., SSR with (cid:96) , -norm(termed SSR-L2,1), SSR with (cid:96) , -norm (termed SSR-L2,0)and SSR with (cid:96) -norm (termed SSR-L1). B. LeNet on MNIST
MNIST is a small-scale dataset, which contains a trainingset of 60,000 images and a test set of 10,000 images from 10classes. Each image is a × gray-scale handwritten digitimage. LeNet on MNIST consists of 2 convolutional layersand 2 fully-connected layers, which achieves an error rate of0.88% on MNIST. The detailed structure of LeNet is: (20) C − M P − (50) C − M P − F C − F C − S, (19)where (20) C is a × convolutional layer with 20 filters, M P is a max-pooling layer with kernel size 2, F C isa fully-connected layer with 10 nodes, S is a softmax losslayer. Since the node number of the last layer is directlyrelated to the number of classes, we prune the remaininglayers except for the last layer . The regularization factor λ isfixed on 6 groups, i.e. , (0.1,0.1,0.3), (0.3,0.3,0.4), (0.3,0.5,0.5),(0.4,0.5,0.5), (0.5,0.4,0.5), and (0.5,0.5,0.5). We compare ourmethod to different criteria of filter selection [20], [45] andalso to other state-of-the-art methods in filter pruning [23],[42], [43], [49].Fig. 3 shows the pruning results of LeNet based on dif-ferent filter selection criteria. Compared to these criteria andbaselines, both FLOPs and parameter sizes are significantlyreduced with lower error by using the proposed structuredsparsity regularization, as expected. In contrast, the (cid:96) -normbased filter selection performs poorly. To explain, due to thenonlinear transformation in the network, filters with small (cid:96) -norm are still likely to be important, which have large impacton the final loss function. When a large proportion of filterswith small (cid:96) -norm are pruned, the classification error canbe significantly increased. Note that the simplest scheme ofrandom filter selection works reasonably well, which is due tothe self-recovery ability of the distributed representations [70].However, the random criterion is not robust in practice andmay lead to large accuracy loss when being applied to com-press large network ( e.g. , VGG-16), as presented in Table IIIand Table IV. For APoZ, the sparsity of feature maps is quitereasonable to prune the redundant filters, which is due to theself-sparsity of the pre-trained model with ReLU activation. Incontrast, compared to these filter pruning methods, our SSR-L2,1 achieves the best performance with an increase of 0.29% For AlexNet, VGG-16 and ResNet-50, we also keep the number nodesof the final layer unchanged and prune other remaining layers.
EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8
FLOPs Reduction Ratio T op − E rr o r ( l og ) L1RandomAPoZOriginalSSR−L2,1SSR−L2,0 0 5 10 15 20 25 30 35 40 45 50 55−5−4.5−4−3.5−3−2.5−2−1.5−1
Parameter Reduction Ratio T op − E rr o r ( l og ) L1RandomAPoZOriginalSSR−L2,1SSR−L2,0
Fig. 3. The results of different evaluation criteria on FLOPs and parameter numbers for compressing LeNet. classification error, . × FLOPs reduction, and . × parameter reduction. SSR-L2,0 achieves relatively consistentresults with SSR-L2,1. In particular, with a significant highcompression ratio ( i.e., . × parameter reduction ratio),SSR-L2,0 achieves much better performance than SSR-L2,1,as the (cid:96) , -norm is used to directly measure the cardinality ofthe filter structure.The quantitative performance for compressing LeNet usingthe proposed scheme is further shown in Table I. First, wefound that the FLOPs do not directly reflect the actual speedupratio in online inference. For instance, compared to LeNet,the proposed SSR-L2,1 can reach . × FLOPs reduction,with a × actual speedup ratio. To explain, memory accessesfor both inter-layer and intra-layer can significantly increasethe computation consumption. Second, TE [42] achieves thebest trade-off between the speedup ratio and the classificationerror among all the baselines ( e.g., SSL [23], CGES [49] andGSS [43]), as it inherits the effectiveness of filter selection byestimating the loss increase of pruning each filter with Taylorexpansion. Note that CGES+ [49] is the combination of iter-ative pruning [15] and CGES for further compressing LeNet,which achieves 0.04% increase of classification error using10% parameters of the full network. Compared to CGES+, theproposed SSR-L1 with (cid:96) -norm achieves a significant highercompression ratio by . × ( i.e.,
9K number of parameters),only with an increase of 0.12% Top-1 error. However, there areno structural constraints on filters/weights, which leads to verylow speedup under the same hardware/software evaluationenvironment . Third, by using the proposed AULM withstructured filter sparsity to adaptively select and prune theredundant filters, SSR-L2,1 achieves the best trade-off betweenthe speedup/compression ratio and the classification error. Forexample, the Top-1 error is only increased by 0.18% with . × speedup and . × compression. C. ImageNet
ImageNet 2012 contains over 1 million training images from1,000 object classes, as well as a validation set of 50,000images. Each image is rescaled to a size of × . A × image is randomly cropped from each scaled image To make a fair comparison, we evaluate the actual speedup of SSR-L1without special hardware/software accelerators in experiments. TABLE IIT
HE EVALUATIONS OF THE NUMBER OF PARAMETERS AND
FLOP
S BOTHIN CONVOLUTIONAL AND FULLY - CONNECTED LAYERS , COMPUTATIONALTIME ON
CPU ( MS ), GPU ( MS ), AND CLASSIFICATION ERROR RATES (T OP -1/5 E RR .) OF A LEX N ET , VGG-16, R ES N ET -50 AND G OOG L E N ETWITH MINI - BATCH SIZE
Model (except for AlexNet with a × cropping size) andmirrored for data augmentation. We test the pruned networkon the validation set using single-view testing (central patchonly) to evaluate the classification accuracy.We implement the proposed SSR scheme on four CNNs, i.e. , AlexNet, VGG-16, ResNet-50 and GoogLeNet. AlexNetcontains 5 convolutional layers and 3 fully-connected lay-ers, VGG-16 contains 13 convolutional layers and 3 fully-connected layers, ResNet-50 contains 54 convolutional layerswith 16 residual blocks, and GoogLeNet contains 21 convo-lutional layers with 9 inception blocks. Unlike AlexNet andVGG-16, ResNet-50 and GoogLeNet use the global averagepooling over the last convolutional layer to reduce the numberof parameters, which removes 3 fully-connected layers. Thecomputation time and storage overhead of four networks,together with their classification error, are shown in Table II. Sensitivity Analysis.
We explore the sensitivity of eachlayer in the network to guide the filter pruning for this layer.We take AlexNet and VGG-16 for instance, most layers arerobust to prune, as shown in Fig. 4. But still, there exist asmall amount of sensitive layers, which locate at the topconvolutional layers. For example, it is sensitive to prune the -th and the -th convolutional layers for AlexNet and the last 3convolutional layers ( i.e. , Conv , Conv , Conv ) forVGG-16. To explain, these top layers often have high-levelsemantic information that is necessary for maintaining theclassification accuracy. In addition, pruning some filters in thespecific layers ( e.g. , Conv , Conv , Conv in VGG-16 when λ is set to be 0.2) gets a slightly better accuracycompared to the original network, which reveals that theredundant filters can reduce the discriminability of the original EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 9 λ T op − E rr o r Conv1Conv2Conv3Conv4Conv5FC6FC7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.30.320.340.360.380.40.420.44 λ T op − E rr o r Conv1_1Conv1_2Conv2_1Conv2_2Conv3_1Conv3_2Conv3_3Conv4_1Conv4_2Conv4_3Conv5_1Conv5_2Conv5_3FC6FC7
Fig. 4. Sensitivity of pruning filter in each layer. Left: the sensitivity of AlexNet, Right: the sensitivity of VGG-16.TABLE IIIP
RUNING RESULTS OF A LEX N ET . T HE TEST MINI - BATCH SIZE IS
Method ↑ Err. ↑ TE [42] 60.2M . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × -0.05% 0.14%45.9M . × . × . × . × . × . × network. Therefore, we use a different λ for each layer toreduce the impact of these sensitive layers ( i.e. , we set a large λ in insensitive layers, and set a small one in sensitive layers). Quantitative Results.
Since the fully-connected layersoccupy over 90% storage in AlexNet and VGG-16, we replacethe original fully-connected layers with global average pooling(GAP) [50] to further compress the whole network. “X-GAP”refers to the model using the GAP after all convolutional layersare pruned via “X” methods ( e.g. , ThiNet, SSR). The “X-GAP” is fine-tuned with the same fine-tuning parameters asdescribed in Sec. IV-A.As shown in Table III, we prune AlexNet with three groupsof λ , i.e. , (0.2, 0.4, 0.5, 0.6, 0.1, 0.1, 0.3), (0.4, 0.5, 0.7,0.6, 0.1, 0.3, 0.3) and (0.2, 0.4, 0.5, 0.6, 0.1, GAP). Com-pared to other filter pruning methods [20], [23], [42], [45],our SSR scheme achieves the best trade-off between thespeedup/compression rate and the Top-1/5 classification error.First, we compare SSR-L2,1 to three alternative selectioncriteria ( i.e. , random, L1-norm [20] and APoZ [45]) with thesame pruning number in each layer . SSR-L2,1 achieves thelowest Top-1/5 classification error. To explain, all selection To make a fair comparison, the number of filter pruning in each layerbased on three alternative selection criteria is the same to SSR-L2,1. criteria are naive methods to prune the filters based on thestatistical property, resulting in a large approximation errorof each layer that is propagated throughout the network.Second, by directly employing SGD with filter-wise sparsityto solve the SSL problem, the redundant filters cannot bepruned efficiently, which only achieves . × CPU speedupwith an increase of 1.32% Top-1 error . Third, the workin [42] uses Taylor expansion (TE) to approximate the lossincrease, which is similar to ours but is with a totally differentselection criterion. Quantitatively, it is time-consuming forpruning one filter and then fine-tuning the network iteratively.In contrast, SSR-L2,1 achieves the lowest Top-1 error increaseof 1.72% and Top-5 error increase of 1.38%, while reducingmuch larger amount of parameters. Fourth, we also comparetwo different kinds of structured sparsity regularizations ( i.e. , (cid:96) , -regularization and (cid:96) , -regularization) with element-wisesparsity regularization ( i.e. , (cid:96) -regularization), and observe thatSSR-L1 significantly reduces the memory storage with only22.6M parameters, which is twice less than SSR-L2,1 andSSR-L2,0 with a comparable error increase. However, SSR-L1does not boost the inference efficiency, as element-wise spar-sity cannot significantly reduce the number of filters that thecomputation is on par with the full network (see later in Sec.IV-D for more detailed discussions). As for structured sparsityregularization, compared to SSR-L2,1, SSR-L2,0 achieves alower error increase ( i.e. , 1.57% vs. i.e. , 45.9M parameters and . × CPU speedup vs.
48M parameters and . × CPU speedup. Moreover, tofurther compress AlexNet using SSR, GAP enables the prunenetwork to be more compact, leading to a . × compressionrate ( i.e. , 2.5M parameters).For VGG-16, we summarize the performance comparisonto [19], [20], [42], [45] in Table IV. In experiments, λ in thefirst 10 convolutional layers is set to be (0.5, 0.4, 0.5, 0.3,0.5, 0.3, 0.3, 0.4, 0.5, 0.3) with large values, while the last 3convolutional layers are all set to be 0.1. λ is set to be (0.1, 0.6)in the fully-connected layers. First, instead of directly pruningfilters, ThiNet [19] conducts a greedy local channel selection,while TE [42] uses a greedy feature map selection to prune thefeature maps. Compared to ThiNet, TE achieves a higher GPUspeedup ( i.e. , . × vs. . × in ThiNet), but has a significant It is different to the result reported in Wen et al. [23], due to the differentfine-tuning framework and deep learning library.
EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 10
TABLE IVP
RUNING RESULTS OF
VGG-16. T
HE TEST MINI - BATCH SIZE IS
Method FLOPs ↑ Err. ↑ Random 4.5B 126.7M . × . × . × . × . × . × -0.64% -0.43%TE [42] 4.2B 135.7M - . × - 3.94%ThiNet [19] 5.0B 131.5M . × . × -1.46% -1.09%SSR-L2,1 4.5B 126.7M . × . × -1.46% -1.08%SSR-L2,0 4.5B 126.2M . × . × -1.65% -0.97%Random-GAP 4.4B 9.2M . × . × . × . × . × . × . × . × × . × × . × TABLE VP
RUNING RESULTS OF R ES N ET -50. T HE TEST MINI - BATCH SIZE IS ↑ Err. ↑ ThiNet [19] 2.4B 16.9M . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × increase in Top-5 error, which affects the discriminative abilityof the compressed model. For three alternative criteria of filterselection, APoZ [45] achieves the lowest increase in both Top-1 and Top-5 classification error by the same factor of GPUand CPU speedup, e.g. , . × CPU speedup and . × GPUspeedup with a decrease of 0.64% Top-1 classification errorand 0.43% Top-5 classification error, respectively. Comparedto all baselines with fully-connected layers, SSR-L2,1 achievesthe best trade-off between classification error and speedup, e.g. , an decrease of 1.46% Top-1 error by a factor of . × GPU speedup. To explain, the relationship between the finaloutput and the local filters is directly considered in SSR,which therefore can adaptively prune redundant filters thathave less impact on the global outputs. Second, by replacingwith GAP, the network is further compressed by a large rate, e.g. , ThiNet-GAP achieves . × parameter reduction, i.e. ,9.5M parameters vs. i.e. , . × GPU speedup and × parameter reduction, onlywith an increase of 0.52% Top-1 error. In addition, comparedto SSR-L2,1-GAP, SSR-L2,0-GAP achieves a comparableresult by a factor of . × GPU speedup and . × parameterreduction, with an increase of 0.83% Top-1 error.We also perform the proposed SSR on multi-branch net-works, e.g. , ResNet-50 and GoogLeNet. The results of SSR onResNet-50 are shown in Table V. We prune ResNet-50 with TABLE VIP
RUNING RESULTS OF G OOG L E N ET . T HE TEST MINI - BATCH SIZE IS ↑ Err. ↑ Random 1.0B 4.2M . × . × . × . × . × . × . × . × . × . × two groups of λ . In the first 7 residual blocks of the firstgroup, the hyper-parameter λ of each residual block is set tobe (0.4, 0.3), while λ of each residual block is set to be (0.3,0.4) in the remaining residual blocks ( i.e. , 9 residual blocks).At the second group, the corresponding λ of each residualblock is increased by 0.1 on the basis of the first group.Note that, we skip the first convolutional layer, which is prettysensitive for pruning. In addition, we prune the first two layersin each residual block and leave the output and projectionshortcuts of residual block unchanged, as shown in Fig. 2. Wefound that SGD in SSL [23] is not very effective to solveEq. (2) with (cid:96) , -regularization, which leads to a significanterror increase with a limited FLOPs reduction. For threealternative criteria of filter selection, APoZ [45] still achievesthe lowest error increase at the same computation complexityand memory storage. Although ThiNet [19] achieves the bestperformance among these state-of-the-art baselines [20], [23],[45], it requires additional samples ( i.e. , new input/outputpairs from hidden layers) in each layer to find the optimalchannels for pruning, which is not only expensive to storeadditional training datasets, and is also time consuming tocollect them in offline training. Moreover, ThiNet only reducesthe reconstruction error of each layer, which however ignoresthe correlation between local filter pruning and global output,leading to the accumulation of reconstruction error. Comparedto ThiNet, without supervised information of hidden layers,SSR-L2,0 employs the original ImageNet dataset to improvethe classification accuracy at the same speedup ratio, andalso achieves a higher parameter reduction ( . × vs. . × ).For GoogLeNet, we prune all convolutional filters with highcomputation complexity, i.e., filter size of × and × .For λ , it is set to be 0.5 and 0.3 in the first three inceptionblocks and the remaining inception blocks, respectively. Weskip the first convolutional layer and kernels with a size of × for effective pruning. As shown in Table VI, compared tothree alternative criteria of filter selection, SSR-L2,1 achievesa lower error increase at the same pruning-level. By replacing (cid:96) , -regularization with (cid:96) , -regularization, SSR-L2,0 achievesthe best performance with an increase of 1.05% Top-5 errorby . × compression and . × CPU speedup.
D. Analysis
Efficiency Analysis.
We first analyze the empirical effi-ciency of AULM and ADMM. As shown in Fig. 5, we foundthat our AULM can help to learn a more compact network withalmost the same error using much fewer epochs, comparedto ADMM. For instance, AULM achieves 0.9% error with
EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 11
Epoch P e r c en t age o f P a r a m e t e r s U s ed AULM−Conv1ADMM−Conv1AULM−Conv2ADMM−Conv2AULM−FC3ADMM−FC3 (a) Epoch /
Epoch T op − E rr o r AULM−Conv1ADMM−Conv1AULM−Conv2ADMM−Conv2AULM−FC3ADMM−FC3 (b) Epoch / Top-1 ErrorFig. 5. Further analysis of the (cid:96) , -regularization by our AULM and ADMM on MNIST dataset. Iteration Number T r a i n i ng D a t a Lo ss AULM−L2,1−Conv1AULM−L2,0−Conv1SSL−Conv1AULM−L2,1−Conv2AULM−L2,0−Conv2SSL−Conv2AULM−L2,1−Conv3AULM−L2,0−Conv3SSL−Conv3 (a)
Epoch F il t e r P r uned A w a y AULM−L2,1−Conv1AULM−L2,0−Conv1SSL−Conv1AULM−L2,1−Conv2AULM−L2,0−Conv2SSL−Conv2AULM−L2,1−Conv3AULM−L2,0−Conv3SSL−Conv3 (b) Epoch /
Epoch T op − E rr o r AULM−L2,1−Conv1AULM−L2,0−Conv1SSL−Conv1AULM−L2,1−Conv2AULM−L2,0−Conv2SSL−Conv2AULM−L2,1−Conv3AULM−L2,0−Conv3SSL−Conv3 (c) Epoch / Top-5 ErrorFig. 6. Convergence of our AULM slover and SSL in the first three convolutional layers of AlexNet. only 8 epochs and 7% parameters in the Conv1 layer, whileADMM achieves almost the same error but requires 11 epochsand 20% parameters. This faster convergence for structuredfilter sparsity is due to that we apply Nesterov’s optimizationto overrelax variables ( i.e. , structured sparse filters and dualvariables) for accelerating the alternative optimization. Sec-ond, we further study the influence of different optimizationstrategy ( i.e. , SSL
VS.
AULM) on filter pruning. In SSL [23],the structured sparsity of filters with (cid:96) , -norm is learned bydirectly solving Eq. (2) with SGD, which is different fromthe proposed AULM-L2,1. Taking the first three layers ofAlexNet for instance, they occupy a significant proportion ofthe computational overhead.Fig. 6 presents the convergence process of different solversover different layers in Conv1, Conv2, Conv3, respectively.Compared to SSL, as shown in Fig. 6(a), AULM-L2,1 is fasterto reduce the training loss and also achieves a lower trainingloss, which leads to more effective training for structuredpruning. Moreover, by using AULM-L2,1 instead of SSL, thenumber of pruned filters is always larger, the correspondingclassification error is lower, and the convergence is faster, es-pecially in the first convolutional layer (the convergence after8 epochs in AULM-L2,1 vs.
20 epochs in SSL). Therefore,alternative optimization in AULM is more effective to prunethe network than SGD in SSL. Furthermore, we also makea comparison between two different structured regularizers( i.e. , (cid:96) , -norm and (cid:96) , -norm) to explicitly analyze their convergence by AULM . As shown in Fig. 6(b) and 6(c),compared to AULM-L2,1, we observe that AULM-L2,0 is notonly significantly faster to generate more structured filters, butalso achieves a lower Top-5 error. Interestingly, the number ofstructured filters is almost constant during training, which isdue to the closed-form solution by Eq. (15) that leads to almostthe same structured sparsity of intermediate filters after the firstupdating. Visualization.
To verify the effectiveness of structured filtersparsity, we visualize the filters in the first convolutionallayer of AlexNet using SSR with three different regularizers( i.e. , (cid:96) -norm, (cid:96) , -norm and (cid:96) , -norm), as shown in Fig. 7.Although SSR with (cid:96) -regularization obtains a large amountof sparse filter elements, it is unable to remove the wholefilter, which leads to a very limited speedup without thespecialized software. In contrast, SSR with (cid:96) , -regularizationor (cid:96) , -regularization results in complete filter removal, whichdirectly accelerates the network inference. Compared to (cid:96) , -regularization, SSR with (cid:96) , -regularization achieves a lowerTop-5 error with more structured sparse filters. This is dueto that the (cid:96) , -regularization explicitly provides the naturalconstraint for filter selection. E. Generalization Ability for Transfer Learning.
SSR has demonstrated its effectiveness to simutaneouslyaccelerate and compress CNNs on MNIST and ImageNet 2012 We have not compared the SGD to our AULM on (cid:96) , -norm regulariza-tion, since SGD cannot solve the NP-hard problem. EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12 (a) SSR-Ll, Error: 21.26%, Sparsity: 58.67% (b) SSR-L2,1, Error: 21.26%, Sparsity: 13.54% (c) SSR-L2,0, Error: 21.16%, Sparsity: 14.58%Fig. 7. Visualizations of the first convolutional layer for pruning the whole AlexNet using SSR with different regularizers. (a) SSR with (cid:96) -regularizationresults in element-wise sparse filters. (b) SSR with (cid:96) , -regularization results in filter-wise removal (filters with dark color). (c) SSR with (cid:96) , -regularizationobtains more structured sparse filters, and achieves lower Top-5 error than SSR-L2,1.TABLE VIIC OMPARISON OF DIFFERENT COMPRESSED MODELS FOR FINE - GRAINEDCLASSIFICATION ON
CUB-200.Method classification tasks. We further investigate the generalizationability of the compressed model in transfer learning, includingdomain adaptation and object detection. For better discussion,we take VGG-16 as our baseline model.
1) Domain Adaptation:
Since SSR does not change thenetwork structure, a model pruned on ImageNet can be eas-ily transferred into other domains. To evaluate the domainadaptation ability of the compressed model, we consider apractical application in which we transfer the pruned modeltrained on ImageNet into a smaller one by using a domain-specific dataset. To this end, we select a public domain-specific dataset CUB-200 [71] for fine-grained classification toevaluate the ability of domain adaptation. CUB-200 contains11,788 images of 200 different bird species, which contains5,994 training images and 5,794 testing images. To makea fair comparison, we fine-tune the compressed models onImageNet by Random, L1, APoZ and SSR with the samehyper-parameters and epochs . The results of fine-grainedclassification are shown in Table VII.The pre-trained VGG-16 is first fine-tuned on the CUB-200dataset, which is an effective method to directly transfer themodel from ImageNet domain to CUB-200 domain. As shownin Table VII, the pre-trained VGG-16 achieves the lowest error(27.60% Top-1 error) but has a huge memory cost and slowinference speed ( i.e. , 135.1M parameters and 15.5B FLOPs).We then fine-tune the compressed networks, which are pre-viously compressed in the ImageNet domain by Random, L1 The related implementation details are described in the following:https://github.com/Roll920/fine-tune-avg-vgg16. TABLE VIIIT
HE SPEEDUP FOR F ASTER
R-CNN
DETECTION . Device Method Speedup mAP ∆ mAPTitan X GPU VGG-16 Baseline 68.7 -Random 2.45 67.9 0.8L1 [20] 2.45 68.1 0.6APoZ [45] 2.45 67.0 1.7SSR-L2,1 2.45 68.4 0.3 [20], APoZ [45] and SSR, respectively. Compared to Random,L1 and APoZ, the model compressed by SSR-L2,1 achievesthe best performance by an increase of only 1.1% Top-1 error,with 124.6M parameters and 4.5B FLOPs. Furthermore, wealso fine-tune the compressed models, in which the traditionalfully-connected layers are replaced with GAP. We obtain amore compact model with 8.8M parameters and 4.4B FLOPs, i.e. , 15.4 × lower memory cost and theoretical 3.5 × inferencespeedup than VGG-16. Compared to the three alternativeselection criteria, SSR-L2,1-GAP achieves the lowest Top-1error, i.e. , 29.55% Top-1 error, which is an increase of only1.95% Top-1 error.
2) Object Detection:
We also evaluate the ability of transferlearning for the compressed VGG-16 by Random, L1 [20],APoZ [45] and SSR with (cid:96) , -regularization, which are de-ployed over Faster R-CNN [11] for object detections. ThePASCAL VOC 2007 object detection benchmark is selectedto evaluate the performance of our models by mean AveragePrecision (mAP), which contains about 5K training/validationimages and 5K testing images. In our experiments, we firstcompress VGG-16 by Random, L1, APoZ and SSR-L2,1 onImageNet, and then use the compressed models as the pre-trained models for Faster-RCNN with the default trainingsettings.The actual running time of Faster R-CNN is 189ms/imageon Titan X GPU. Compared to VGG-16, we get an actualdetection time of 77ms with . × acceleration on Titan X.As shown in Table VIII, interestingly, filter pruning by randomcriterion works interestingly well for object detection, whichis due to the self-recovery ability in training the specificPASCAL VOC 2007 dataset. In contrary, APoZ may beunsuitable for object detection, which achieves the lowestmAP, i.e. , 1.7% mAP drops. Compared to three alternative EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 13 pruning criteria, SSR-L2,1 achieves the best performance bya factor of . × speedup on Titan X with only 0.3% mAPdrops, which is still very practical for real-world application.V. C ONCLUSION
In this paper, we propose a unified filter pruning scheme,termed structured sparsity regularization (SSR), for CNNacceleration and compression. SSR captures the relationshipbetween global output and local filter pruning, forms a noveloptimization problem based on two structured sparsity with (cid:96) , -norm and (cid:96) , -norm, both of which can be efficientlysolved by a novel Alternative Updating with Lagrange Mul-tipliers (AULM). The proposed AULM is fast to generatestructured filters and is adaptive to prune redundant filters. Wehave demonstrated that the proposed SSR scheme achievessuperior performance over the state-of-the-art filter pruningmethods [19], [20], [23], [42], [45]. We further evaluate theeffectiveness of the compressed model by SSR when beingapplied to domain adaptation and objection detection.In the future, we would like to investigate the specific designof filter pruning for ResNet and DenseNet, including (1) howto effectively prune the shortcut connection in the residualblock, (2) design a more effective strategy for acceleratingbatch normalization and pooling layers, which are left unex-ploited in the existing works, and (3) design a novel filterselection layer to prevent the dimension mismatch of differentdense blocks, which is due to the dense connectivity in thenatural DenseNet. A CKNOWLEDGMENT
This work is supported by the National Key R&D Programof China (No. 2017YFC0113000, No. 2016YFB1001503 andNo. 2018YFB1107400), the Nature Science Foundation ofChina (No.U1705262, No. 61772443, No. 61402388, No.61572410 and No. 61871470) , the Post Doctoral InnovativeTalent Support Program under Grant BX201600094, the ChinaPost-Doctoral Science Foundation under Grant 2017M612134and the Nature Science Foundation of Fujian Province, China(No. 2017J01125). R
EFERENCES[1] X. Chen, J. Weng, W. Lu, J. Xu, and J. Weng, “Deep manifold learningcombined with convolutional neural networks for action recognition,”
IEEE transactions on neural networks and learning systems , vol. 29,no. 9, pp. 3938–3952, 2018.[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
CVPR , 2016.[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
NIPS , 2012.[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in
CVPR , 2015.[6] F. Chollet, “Xception: Deep learning with depthwise separable convo-lutions,” in
CVPR , 2017.[7] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He, “Aggregated residualtransformations for deep neural networks,” in
CVPR , 2017.[8] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in
CVPR , 2017.[9] R. Girshick, “Fast r-cnn,” in
ICCV , 2015. [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in
CVPR , 2014.[11] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in
NIPS , 2015.[12] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, andA. C. Berg, “Ssd: Single shot multibox detector,” in
ECCV , 2016.[13] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel,“Optimal brain damage.” in
NIPS , vol. 2, 1989, pp. 598–605.[14] B. Hassibi and D. G. Stork, “Second order derivatives for networkpruning: Optimal brain surgeon,” in
NIPS , 1993.[15] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deepneural network with pruning, trained quantization and huffman coding,”
CoRR, abs/1510.00149 , vol. 2, 2015.[16] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights andconnections for efficient neural network,” in
NIPS , 2015.[17] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, “Eie: efficient inference engine on compressed deep neuralnetwork,” in
Proceedings of the 43rd International Symposium onComputer Architecture . IEEE Press, 2016, pp. 243–254.[18] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparseconvolutional neural networks,” in
CVPR , 2015.[19] J. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method fordeep neural network compression,” in
ICCV , 2017.[20] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruningfilters for efficient convnets,” in
ICLR , 2017.[21] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deepconvolutional neural networks,” arXiv preprint arXiv:1512.08571 , 2015.[22] V. Lebedev and V. Lempitsky, “Fast convnets using group-wise braindamage,” in
CVPR , 2016.[23] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structuredsparsity in deep neural networks,” in
NIPS , 2016.[24] M. Denil, B. Shakibi, L. Dinh, N. de Freitas et al. , “Predicting param-eters in deep learning,” in
NIPS , 2013.[25] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempit-sky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition,” arXiv preprint arXiv:1412.6553 , 2014.[26] S. Lin, R. Ji, C. Chen, and F. Huang, “Espace: Accelerating convolu-tional neural networks via eliminating spatial & channel redundancy,”in
AAAI , 2017.[27] S. Lin, R. Ji, X. Guo, and X. Li, “Towards convolutional neural networkscompression via global error reconstruction,” in
IJCAI , 2016.[28] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convo-lutional neural networks with low rank expansions,” arXiv preprintarXiv:1405.3866 , 2014.[29] C. Tai, T. Xiao, Y. Zhang, X. Wang et al. , “Convolutional neuralnetworks with low-rank regularization,” in
ICLR , 2016.[30] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,“Exploiting linear structure within convolutional networks for efficientevaluation,” in
NIPS , 2014.[31] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compressionof deep convolutional neural networks for fast and low power mobileapplications,” arXiv preprint arXiv:1511.06530 , 2015.[32] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu, “Cnnpack: Packingconvolutional neural networks in the frequency domain,” in
NIPS , 2016.[33] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutionalnetworks through ffts,” arXiv preprint arXiv:1312.5851 , 2013.[34] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, andY. LeCun, “Fast convolutional nets with fbfft: A gpu performanceevaluation,” arXiv preprint arXiv:1412.7580 , 2014.[35] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deepconvolutional networks using vector quantization,” arXiv preprintarXiv:1412.6115 , 2014.[36] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Trainingdeep neural networks with binary weights during propagations,” in
NIPS ,2015.[37] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:Imagenet classification using binary convolutional neural networks,” in
ECCV , 2016.[38] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized cnn: aunified approach to accelerate and compress convolutional networks,”
IEEE transactions on neural networks and learning systems , vol. 29,no. 10, pp. 4730–4742, 2018.[39] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convo-lutional neural networks using energy-aware pruning,” in
CVPR , 2017.[40] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedboltzmann machines,” in
ICML , 2010.
EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 14 [41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[42] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruningconvolutional neural networks for resource efficient inference,” in
ICLR ,2017.[43] A. Torfi and R. A. Shirvani, “Attention-based guided structured sparsityof deep neural networks,” arXiv preprint arXiv:1802.09902 , 2018.[44] J. Wang, C. Xu, X. Yang, and J. M. Zurada, “A novel pruning algo-rithm for smoothing feedforward neural networks based on group lassomethod,”
IEEE transactions on neural networks and learning systems ,vol. 29, no. 5, pp. 2012–2024, 2018.[45] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data-driven neuron pruning approach towards efficient deep architectures,” arXiv preprint arXiv:1607.03250 , 2016.[46] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,”
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , vol. 68, no. 1, pp. 49–67, 2006.[47] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”
Foundations and Trends R (cid:13) in Machine Learning , vol. 3,no. 1, pp. 1–122, 2011.[48] Y. E. Nesterov, “A method for solving the convex programming problemwith convergence rate O (1 /k ) ,” in Dokl. Akad. Nauk SSSR , vol. 269,1983, pp. 543–547.[49] J. Yoon and S. J. Hwang, “Combined group and exclusive sparsity fordeep neural networks,” in
ICML , 2017.[50] M. Lin, Q. Chen, and S. Yan, “Network in network,” in
ICLR , 2014.[51] S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neuralnetworks,” arXiv preprint arXiv:1507.06149 , 2015.[52] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in
Proceedings of the 22nd ACM internationalconference on Multimedia . ACM, 2014, pp. 675–678.[53] H. Huang and H. Yu, “Ltnn: A layerwise tensorized compression ofmultilayer neural network,”
IEEE transactions on neural networks andlearning systems , 2018.[54] G. Xie, K. Yang, T. Zhang, J. Wang, and J. Lai, “Balanced decoupledspatial convolution for cnns,”
IEEE transactions on neural networks andlearning systems , 2019.[55] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neuralnetworks with weights and activations constrained to+ 1 or-1,” arXivpreprint arXiv:1602.02830 , 2016.[56] R. J. Cintra, S. Duffner, C. Garcia, and A. Leite, “Low-complexity ap-proximate convolutional neural networks,”
IEEE transactions on neuralnetworks and learning systems , vol. 29, no. 12, pp. 5981–5992, 2018.[57] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and < ICLR , 2017.[58] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremelyefficient convolutional neural network for mobile devices,” in
CVPR ,2018.[59] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger, “Con-densenet: An efficient densenet using learned group convolutions,” in
CVPR , 2018.[60] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861 , 2017.[61] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in
CVPR , 2018.[62] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruningfor accelerating deep convolutional neural networks,” arXiv preprintarXiv:1808.06866 , 2018.[63] S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, “Acceleratingconvolutional networks via global & dynamic filter pruning,” in
IJCAI ,2018, pp. 2425–2432.[64] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learningefficient convolutional networks through network slimming,” in
ICCV ,2017.[65] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorflow: Large-scalemachine learning on heterogeneous distributed systems,” arXiv preprintarXiv:1603.04467 , 2016.[66] S. F. Cotter, B. D. Rao, K. Engan, and K. Kreutz-Delgado, “Sparsesolutions to linear inverse problems with multiple measurement vectors,”
IEEE Transactions on Signal Processing , vol. 53, no. 7, pp. 2477–2488,2005.[67] T. Goldstein, C. Studer, and R. Baraniuk, “A field guide to forward-backward splitting with a fasta implementation,” arXiv preprintarXiv:1411.3406 , 2014.[68] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”
Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[69] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV) , vol. 115, no. 3, pp.211–252, 2015.[70] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,”