[PDF] ℓ 0 Regularized Structured Sparsity Convolutional Neural Networks

Abstract

Deepening and widening convolutional neural networks (CNNs) significantly increases the number of trainable weight parameters by adding more convolutional layers and feature maps per layer, respectively. By imposing inter- and intra-group sparsity onto the weights of the layers during the training process, a compressed network can be obtained with accuracy comparable to a dense one. In this paper, we propose a new variant of sparse group lasso that blends the ℓ 0 norm onto the individual weight parameters and the ℓ 2,1 norm onto the output channels of a layer. To address the non-differentiability of the ℓ 0 norm, we apply variable splitting resulting in an algorithm that consists of executing stochastic gradient descent followed by hard thresholding for each iteration. Numerical experiments are demonstrated on LeNet-5 and wide-residual-networks for MNIST and CIFAR 10/100, respectively. They showcase the effectiveness of our proposed method in attaining superior test accuracy with network sparsification on par with the current state of the art.

Full PDF

aa r X i v : . [ c s . C V ] D ec ℓ Regularized Structured Sparsity Convolutional Neural Networks

Kevin BuiDepartment of MathematicsUniversity of California, Irvine [email protected]

Fredrick ParkDepartment of Mathematics & Computer ScienceWhittier College [email protected]

Shuai ZhangQualcomm AI [email protected]

Yingyong QiQualcomm AI [email protected]

Jack XinDepartment of MathematicsUniversity of California, Irvine [email protected]

December 18, 2019

Abstract

Deepening and widening convolutional neural networks(CNNs) signiﬁcantly increases the number of trainableweight parameters by adding more convolutional layersand feature maps per layer, respectively. By imposinginter- and intra-group sparsity onto the weights of the lay-ers during the training process, a compressed network canbe obtained with accuracy comparable to a dense one. Inthis paper, we propose a new variant of sparse group lassothat blends the ℓ norm onto the individual weight pa-rameters and the ℓ , norm onto the output channels of alayer. To address the non-differentiability of the ℓ norm,we apply variable splitting resulting in an algorithm thatconsists of executing stochastic gradient descent followedby hard thresholding for each iteration. Numerical exper-iments are demonstrated on LeNet-5 and wide-residual-networks for MNIST and CIFAR 10/100, respectively.They showcase the effectiveness of our proposed methodin attaining superior test accuracy with network sparsiﬁ-cation on par with the current state of the art. Deep neural networks (DNNs) have proven to be advanta-geous for numerous modern computer vision tasks involv- ing image or video data. In particular, convolutional neu-ral networks (CNNs) yield highly accurate models withapplications in image classiﬁcation [21, 36, 14, 44], se-mantic segmentation [25, 7], and object detection [32,16, 31]. These large models often contain millions oreven billions of weight parameters that often exceed thenumber of training data. This is a double-edged swordsince on one hand, large models allow for high accuracy,while on the other, they contain many redundant param-eters that lead to overparametrization. Overparametriza-tion is a well-known phenomenon in DNN models [8, 3]that results in overﬁtting, learning useless random patternsin data [45], and having inferior generalization. Addition-ally, these models also possess exorbitant computationaland memory demands during both training and inference.As a result, they may not be applicable for devices withlow computational power and memory.Resolving these problems requires compressing thenetworks through sparsiﬁcation and pruning. Althoughremoving weights might affect the accuracy and gener-alization of the models, previous works [26, 12, 39, 29]demonstrated that many networks can be substantiallypruned with negligible effect on accuracy. There are manysystematic approaches to achieving sparsity in DNNs.Han et al. [13] proposed to ﬁrst train a dense network,prune it afterward by setting the weights to zero if be-low a ﬁxed threshold, and retrain the network with the1emaining weights. Jin et al. [18] extended this method byrestoring the pruned weights, training the network again,and repeating the process. Rather than pruning by thresh-olding, Aghasi et al. [1] proposed Net-Trim, which prunesan already trained network layer-by-layer using convexoptimization in order to ensure that the layer inputs andoutputs remain consistent with the original network. ForCNNs in particular, ﬁlter or channel pruning is preferredbecause it signiﬁcantly reduces the amount of weight pa-rameters required compared to individual weight prun-ing. Le et al. [24] calculated the sums of absolute weightsof the ﬁlters of each layer and pruned the ones with thesmallest weights. Hu et al. [15] proposed a metric calledaverage percentage of zeroes for channels to measure theirredundancies and pruned those with highest values foreach layer. Zhuang et al. [48] developed discrimination-aware channel pruning that selects channels that con-tribute to the discriminative power of the network.An alternative approach to pruning a dense network islearning a compressed structure from scratch. A conven-tional approach is to optimize the loss function equippedwith either the ℓ or ℓ regularization, which drives theweights to zero or to very small values during training. Tolearn which groups of weights (e.g., neurons, ﬁlters, chan-nels) are necessary, group regularization, such as grouplasso [42] and sparse group lasso [35], are equipped to theloss function. Alvarez and Salzmann [2] and Scardapane et al. [34] applied group lasso and sparse group lasso tovarious architectures and obtained compressed networkswith comparable or even better accuracy. Instead of shar-ing features among the weights as suggested by groupsparsity, exclusive sparsity [47] promotes competition forfeatures between different weights. This method was in-vestigated by Yoon and Hwang [41]. In addition, theycombined it with group sparsity and demonstrated thatthis combination resulted in compressed networks withbetter performance than their original. Non-convex reg-ularization has also been examined. Louizos et al. [26]proposed a practical algorithm using probabilistic meth-ods to perform ℓ regularization on neural networks. Ma et al. [28] proposed integrated transformed ℓ , a convexcombination of transformed ℓ and group lasso, and com-pared its performance against the aforementioned groupregularization methods.In this paper, we propose a group regularization methodthat balances both group lasso and ℓ regularization: it is a variant of sparse group lasso that replaces the ℓ penaltyterm with the ℓ penalty term. This proposed group reg-ularization method is presumed to yield a better perform-ing, compressed network than sparse group lasso since ℓ is a convex relaxation of ℓ . We develop an algorithm tooptimize loss functions equipped with the proposed regu-larization term for DNNs. Given a training dataset consisting of N input-outputpairs { ( x i , y i ) } Ni =1 , the weight parameters of a DNN arelearned by optimizing the following objective function: min W N N X i =1 L ( h ( x i , W ) , y i ) + λ R ( W ) , (1)where • W is the set of weight parameters of the DNN. • L ( · , · ) ≥ is the loss function that compares theprediction h ( x i , W ) with the ground-truth output y i . Examples include cross-entropy loss function forclassiﬁcation and mean-squared error for regression. • h ( · , · ) is the output of the DNN used for prediction. • λ > is a regularization parameter for R ( · ) . • R ( · ) is the regularizer on the set of weight parame-ters W .The most common regularizer used for DNN is k · k ,also known as weight decay. It prevents overﬁtting andimproves generalization because it enforces the weights todecrease proportional to their magnitudes [22]. Sparsitycan be imposed by pruning weights whose magnitudes arebelow a certain threshold at each iteration during training.However, an alternative regularizer is the ℓ norm k · k ,also known as lasso [37]. ℓ norm is the tightest convexrelaxation of the ℓ norm and it yields a sparse solutionthat is found on the corners of the 1-norm ball. Unfor-tunately, element-wise sparsity by ℓ or ℓ regularization2n CNNs may not yield meaningful speedup as the num-ber of ﬁlters and channels required for computation andinference may remain the same [40].To determine which ﬁlters or channels are relevant ineach layer, group sparsity using group lasso is considered.Suppose a DNN has L layers, so the set of weight param-eters W is divided into L sets of weights: W = { W l } Ll =1 .The weight set of each layer W l is divided into N l groups(e.g., channels or ﬁlters): W l = { w l,g } N l g =1 . Group lassoapplied to W l is formulated as R GL ( W l ) = N l X g =1 q | w l,g |k w l,g k (2) = N l X g =1 q | w l,g | vuut | w l,g | X i =1 w l,g,i , where w l,g,i corresponds to the weight parameter with in-dex i in group g in layer l , and the term p | w l,g | ensuresthat each group is weighed uniformly. This regularizerimposes the ℓ norm on each group, forcing weights ofthe same groups to decrease altogether at every iterationduring training. As a result, groups of weights are prunedwhen their ℓ norms are negligible, resulting in a highlycompact network compared to element-sparse networks.To obtain an even sparser network, element-wise spar-sity and group sparsity can be combined and applied to-gether to the training of DNNs. One regularizer that com-bines these two types of sparsity is sparse group lasso,which is formulated as R SGL ( W l ) = R GL ( W l ) + k W l k , (3)where k W l k = | N l | X g =1 | w l,g | X i =1 | w l,g,i | . Sparse group lasso simultaneously enforces group spar-sity by having R GL ( · ) and element-wise sparsity by hav-ing k · k . asso We recall that the ℓ norm is a convex relaxation of the ℓ norm, which is non-convex and discontinuous. In addi- tion, any ℓ -regularized problem is NP-hard. These prop-erties make developing convergent and tractable algo-rithms for ℓ -regularized problems difﬁcult, thereby mak-ing ℓ -regularized problems better alternatives to solve.However, the ℓ -regularized problems have their advan-tages over their ℓ counterparts. For example, theyare able to recover better sparse solutions than do ℓ -regularized problems in various applications, such ascompressed sensing [27], image restoration [4, 6, 10, 46],MRI reconstruction [38], and machine learning [27, 43].Used to solve ℓ minimization, the soft-thresholding op-erator S λ ( c ) = sign ( c ) max {| c | − λ, } yields a biasedestimator [11].Due to the advantages and recent successes of ℓ min-imization, we propose to replace the ℓ norm in (3) withthe ℓ norm k W l k = | N l | X g =1 | w l,g | X i =1 | w l,g,i | , (4)where | w | = ( if w = 00 if w = 0 . Hence, we propose a new regularizer called sparse group ℓ asso deﬁned by R SGL ( W l ) = R GL ( W l ) + k W l k . (5)Using this regularizer, we expect to obtain a better sparsenetwork than from using sparse group lasso. Before discussing the algorithm, we summarize notationsthat we will use to save space. They are the following: • If V = { V l } Ll =1 and W = { W l } Ll =1 , then ( V, W ) :=( { V l } Ll =1 , { W l } Ll =1 ) = ( V , . . . , V L , W , . . . , W L ) . • For V = { V l } Ll =1 , V l =( V l +1 , . . . , V L ) . Both V ≤ l and V ≥ l are deﬁned simi-larly. • V + := V k +1 . • ˜ L ( W ) := N P Ni =1 L ( h ( x i , W ) , y i ) .3 .4 Numerical Optimization We develop an algorithm to solve (1) with the sparsegroup ℓ asso regularizer (5). So, with W = { W l } Ll =1 ,the minimization problem we solve is min W ˜ L ( W ) + λ L X l =1 R SGL ( W l ) (6) = ˜ L ( W ) + λ L X l =1 ( R GL ( W l ) + k W l k ) . Throughout this paper, we assume that L is continuouslydifferentiable with respect to W l for each l = 1 , . . . , L .Because ﬁnding the subderivative of the objective prob-lem is difﬁcult due to the ℓ norm, we need to ﬁgure outa method to solve it. By introducing an auxiliary variable V = { V l } Ll =1 , we have a constrained optimization prob-lem min V,W ˜ L ( W ) + λ L X l =1 ( R GL ( W l ) + k V l k ) s.t. V = W. (7)The constraint can be relaxed by adding a quadraticpenalty term with β > so that we have min V,W F β ( V, W ):= ˜ L ( W ) + L X l =1 h λ ( R GL ( W l ) + k V l k )+ β k V l − W l k i . (8)With β ﬁxed, (8) can be solved by alternating minimiza-tion: W k +1 l = arg min W l F β ( V k , W + l ) for l = 1 , . . . , L (9a) V k +1 = arg min V F β ( V, W k +1 ) . (9b) We explicitly update W l by gradient descent and V l byhard-thresholding: W k +1 l = W kl − γ (cid:16) ∇ W l ˜ L ( W )+ λ∂ R GL ( W kl ) − β ( V kl − W kl ) (cid:17) for l = 1 , . . . , L (10a) V k +1 = H√ λ/β ( W k +1 ) , (10b)where γ is the learning rate, ∂ R GL is the subdifferen-tial of R GL , and H√ λ/β ( · ) is the element-wise hard-thresholding operator: H √ λ/β ( w i ) = ( if | w i | ≤ p λ/βw i if | w i | > p λ/β. (11)In practice, (10a) is performed using stochastic gradientdescent (or one of its variants) with mini-batches due tothe large-size computation dealing with the amount ofdata and weight parameters that a typical DNN has.After presenting an algorithm that solves the quadraticpenalty problem (8), we now present an algorithm to solve(6). We solve a sequence of quadratic penalty problems(8) with β ∈ { β j } ∞ j =1 such that β j ↑ ∞ . This will yield asequence { ( V j , W j ) } ∞ j =1 such that W j ↑ W ∗ , a solutionto (6). This algorithm is based on the quadratic penaltymethod [30] and the penalty decomposition method [27].The algorithm is summarized in Algorithm 1. To establish convergence for the proposed algorithm, weshow that the accumulation point of the sequence gener-ated by (9a)-(9b) is a block-coordinate minimizer, and anaccumulation point generated by Algorithm 1 is a sparsefeasible solution to (6). Unfortunately, this feasible solu-tion may not be a local minimizer of (6) because the lossfunction L ( · , · ) is nonconvex. However, it was shown in[9] that a similar algorithm to (1), but only for ℓ mini-mization, generates an approximate global solution withhigh probbility for a one-layer CNN with ReLu activationfunction. Theorem 1.

Let { ( V k , W k ) } ∞ k =1 be a sequence gener-ated by the alternating minimization algorithm (9a) - (9b) . lgorithm 1: Algorithm for Sparse Group L assoRegularization Initialize V and W with random entries; learningrate γ ; regularization parameters λ and β ; andmultiplier σ > . Set j := 1 . while stopping criterion for outer loop not satisﬁed do Set k := 1 . Set W j, = W j and V j, = V j . while stopping criterion for inner loop notsatisﬁed do Update W j,k +1 by Eq. (10a). Update V j,k +1 by Eq. (10b). k := k + 1 end Set W j +1 = W j,k and V j +1 = V j,k . Set β := σβ . Set j := j + 1 . end If ( V ∗ , W ∗ ) is an accumulation point of { ( V k , W k ) } ∞ k =1 ,then ( V ∗ , W ∗ ) is a block coordinate minimizer of (8) .that is V ∗ ∈ arg min V l F β ( V, W ∗ ) W ∗ l ∈ arg min W l F β ( V ∗ , W ∗ l ) for l = 1 , . . . , L Proof.

By (9a)- (9b), we have F β ( V k , W + ≤ l , W k>l ) ≤ F β ( V k , W + l ) (12) F β ( V + , W + ) ≤ F β ( V, W + ) (13)for all W l , l = 1 , . . . , L , and V , so it follows after somecomputation that F β ( V + , W + ) ≤ F β ( V k , W k ) (14)for each k ∈ N . Hence, { F β ( V k , W k ) } ∞ k =1 is nonin-creasing. Since F β ( V k , W k ) ≥ for all k ∈ N , its limit lim k →∞ F β ( V k , W k ) exists. From (12), we have F β ( V k , W + ≤ l , W k>l ) ≤ F β ( V k , W + l ) = lim k →∞ F β ( V k , W + l ) + X jl ˜ R λ,β ( W kj , V kj )= F β ( V, W + l ) − λ L X l =1 k V kl k ≥ F β ( V, W + ≤ l , W k>l ) − λ L X l =1 k V kl k = ˜ L ( W + ≤ l , W k>l ) + X j ≤ l ˜ R λ,β ( W + j , V kj )+ X j>l ˜ R λ,β ( W kj , V kj ) k ∈ K . Because lim k ∈ K →∞ V k exists, thesequence { V k } k ∈ K is bounded, which implies that {k V k k } k ∈ K is bounded as well. Hence, there exists afurther subsequence K ⊂ K such that lim k ∈ K →∞ k V k k exists. As a result, For each l = 1 , . . . , L , we have that lim k ∈ K k V kl k exists. So, we obtain lim k ∈ K →∞ ˜ L ( W + ≤ l , W k>l ) + X j ≤ l ˜ R λ,β ( W + j , V kj ) (19) + X j>l ˜ R λ,β ( W kj , V kj )= lim k ∈ K →∞ F β ( V, W + ≤ l , W k>l ) − λ L X l =1 k V kl k = lim k ∈ K →∞ F β ( V, W + ≤ l , W k>l ) − lim k ∈ K →∞ λ L X l =1 k V kl k = lim k ∈ K →∞ F β ( V k , W + l ) + X j = l ˜ R λ,β ( W ∗ j , V ∗ j ) (20) + ˜ R λ,β ( W l , V ∗ l ) ≥ ˜ L ( W ∗ ) + L X l =1 ˜ R λ,β ( W ∗ l , V ∗ l ) Adding P Ll =1 k V ∗ l k on both sides yields F β ( V ∗ , W ∗ l ) ≥ F β ( V ∗ , W ∗ ) . (21) By (17) and (21), ( V ∗ , W ∗ ) is a block coordinate mini-mizer. Theorem 2.

Let { ( V k , W k , β k ) } ∞ k =1 be a sequence gen-erated by Algorithm 1. Suppose that { F β k ( V k , W k ) } ∞ k =1 is uniformly bounded. If ( V ∗ , W ∗ ) is an accumulationpoint of { V k , W k ) } ∞ k =1 , then V ∗ = W ∗ and W ∗ is afeasible solution to (6) .Proof. Because ( V ∗ , W ∗ ) is an accumula-tion point, there exists a subsequence K suchthat lim k ∈ K →∞ ( V k , W k ) = ( V ∗ , W ∗ ) . If { F β k ( V k , W k ) } ∞ k =1 is uniformly bounded, thereexists M such that F β k ( V k , W k ) ≤ M for all k ∈ N .After some algebraic manipulation, we should obtain L X l =1 k V kl − W kl k ≤ β k M, (22)where M is some positive constant equals to the totalnumber of weight parameters in W . Taking the limit over k ∈ K , we have L X l =1 k V ∗ l − W ∗ l k = 0 , which follows that V ∗ = W ∗ . As a result, W ∗ is a feasi-ble solution to (6). We compare the proposed sparse group l asso regulariza-tion against four other methods as baselines: group lasso,sparse group lasso, combined group and exclusive spar-sity (CGES) proposed in [41], and the group variant of ℓ regularization proposed in [26]. For the group terms, theweights are grouped together based on the ﬁlters or out-put channels, which we will refer to as neurons. We applythese methods on the following image datasets: MNIST[23] using the LeNet-5-Caffe [17] and CIFAR 10/100 [20]using wide residual networks [44]. Because the optimiza-tion algorithms do not drive most, if not all, the weightsand neurons to zeroes, we have to set them to zeroes whentheir values are below a certain threshold. In our exper-iments, if the absolute weights are below − , we setthem to zeroes. Then, weight sparsity is deﬁned to be the ercentage of zero weights with respect to the total num-ber of weights trained in the network . If the normalizedsum of the absolute values of the weights of the neuron isless than − , then the weights of the neuron are set tozeroes. Neuron sparsity is deﬁned to be the percentageof neurons whose weights are zeroes with respect to thetotal number of neurons in the network . The MNIST dataset consists of 60k training images and10k test images. It is trained on Lenet-5-Caffe, whichhas four layers with 1,370 total neurons and 431,080 totalweight parameters. All layers of the network are appliedwith strictly the same type of regularization. No otherregularization methods (e.g., dropout and batch normal-ization) are used. The network is optimized using Adam[19] with initial learning rate 0.001. For every 40 epochs,the learning rate decays by a factor of 0.1. We set the reg-ularization parameter λ = 0 . / . For sparse groupl asso, we set β = 2 . / , and for every 40 epochs,it increases by a factor of 1.25. The network is trained for200 epochs across 5 runs.Table 1 reports the mean results for weight sparsity,neuron sparsity, and test error obtained at the end of theruns. The ℓ regularization method barely sparsiﬁes thenetwork. On the other hand, CGES obtains the lowestmean test error with the largest mean weight sparsity, butits mean neuron sparsity is not as high as group lasso,sparse group lasso, and sparse group l asso. The largestmean neuron sparsity is attained by sparse group lasso, butits corresponding test error is worse than the other meth-ods. Sparse group l asso attains comparable mean weightand neuron sparsity as group lasso and sparse group lassobut with lower test error. Therefore, the proposed regular-ization is able to balance accuracy with both weight andneuron sparsity better than the baseline methods. CIFAR 10/100 is a dataset that has 10/100 classes splitinto 50k training images and 10k test images. Thedataset is trained on wide residual networks, speciﬁcallyWRN-28-10. WRN-28-10 has approximately 36,500,000weight parameters and 10,736 neurons. The network isoptimized using stochastic gradient descent with initial learning rate 0.1. After every 60 epochs, learning ratedecays by a factor of 0.2. Strictly the same type of reg-ularization is applied to the weights of the hidden layerwhere dropout is utilized in the residual block. We varythe regularization parameter λ = α/ by trainingthe model on α ∈ { . , . , . , . , . } . For sparsegroup l asso, we set β = 25 α/ initially and it in-creases by a factor of 1.25 for every 20 epochs. The net-work is trained for 200 epochs across 5 runs. Note that weexclude ℓ regularization by Louizos et al. [26] becausewe found the method to be unstable when α ≥ . . Theresults are shown in Figures 1 and 2 for CIFAR 10 andCIFAR 100, respectively.According to Figure 1, CGES outperforms the othermethods when α = 0 . for both sparsity and test error.However, sparsity levels stabilize after when α = 0 . .Sparse group lasso attains the highest mean weight andneuron sparsity when α ≥ . . Group lasso and sparsegroup l asso have comparable mean weight and neuronsparsity levels, but sparse group l asso outperforms theother methods in terms of test error when α ≥ . .Figure 2 shows that the results for CIFAR 100 are sim-ilar to the results for CIFAR 10. CGES has better weightsparsity when α ≤ . , but it has the least neuron spar-sity when α ≥ . . Sparsity levels for CGES appear tostabilize after when α = 0 . . Sparse group lasso attainsthe highest mean weight and neuron sparsity for α ≥ . .The proposed method sparse group l asso has comparableweight and neuron sparsity as group lasso, but it has thelowest mean test error when α ∈ { . , . , . } .Overall, the results demonstrate that sparse groupl asso maintains superior test accuracy with similar spar-sity levels as group lasso when trading accuracy for spar-sity as α increases. In this work, we propose sparse group l asso, a new vari-ant of sparse group lasso where the ℓ norm on the weightparameters is replaced with the ℓ norm. We developa new algorithm to optimize loss functions regularizedwith sparse group l asso for DNNs in order to attain asparse network with competitive accuracy. We compareour method with various baseline methods on MNIST andCIFAR 10/100 on different CNNs. The experimental re-7 ethod Mean Weight Sparsity (%) [Std (%)] Mean Neuron Sparsity (%)[Std (%)] Test Error (%) [Std (%)] ℓ [26] 0.02 [ < . ] 0 [0] 0.69 [0.02]CGES 94.12 [0.26] 39.33 [1.61] 0.65 [0.04]group lasso 88.38 [0.49] 69.39 [0.64] 0.76 [0.02]sparse group lasso 93.50 [0.13] 73.52 [0.49] 0.77 [0.03]sparse group ℓ asso (proposed) 89.27 [0.46] 68.25 [0.49] 0.67 [0.02] Table 1: Comparison of the baseline methods and sparse group l asso regularization method on LeNet-5-Caffe trainedon MNIST. Mean weight sparsity, mean neuron sparsity, and mean test error across 5 runs after 200 epochs are shown. N = 60000 , the number of training points. W e i g h t Sp a r s i t y CGESgroup lassosparse group lassosparse group lasso l0 (a) Mean weight sparsity N e u r o n Sp a r s i t y CGESgroup lassosparse group lassosparse group lasso l0 (b) Mean neuron sparsity T e s t E rr o r ( % ) CGESgroup lassosparse group lassosparse group lasso l0 (c) Mean test error

Figure 1: Mean results for CIFAR-10 on WRN-28-10 across 5 runs when varying the regularization parameter λ = α/ when α ∈ { . , . , . , . , . } .sults demonstrate that in general, sparse group l asso at-tains similar weight and neuron sparsity as group lassowhile maintaining competitive accuracy.For our future work, we plan to extend our proposedvariant to other nonconvex penalties, such as ℓ − ℓ ,transformed ℓ , and ℓ / . We will examine these noncon-vex sparse group lasso methods on various experiments,not only on MNIST and CIFAR 10/100 but also on TinyImagenet and Street View House Number trained on dif-ferent networks such as MobileNetv2 [33]. In addition, we might investigate in developing an alternating direc-tion method of multipliers algorithm [5] as an alternativeto the algorithm developed in this paper. The work was partially supported by NSF grant IIS-1632935, DMS- 1854434, a Qualcomm Faculty Award,and Qualcomm AI Research.8 .0 0.1 0.2 0.3 0.4 0.5 0.6α0.00.10.20.30.40.50.60.70.8 W e i g h t Sp a r s i t y CGESgroup lassosparse group lassosparse group lasso l0 (a) Mean weight sparsity N e u r o n Sp a r s i t y CGESgroup lassosparse group lassosparse group lasso l0 (b) Mean neuron sparsity T e s t E rr o r ( % ) CGESgroup lassosparse group lassosparse group lasso l0 (c) Mean test error)

Figure 2: Mean results for CIFAR-100 on WRN-28-10 across 5 runs when varying the regularization parameter λ = α/ when α ∈ { . , . , . , . , . } . References [1] A. Aghasi, A. Abdi, N. Nguyen, and J. Romberg. Net-trim:Convex pruning of deep neural networks with performanceguarantee. In

Advances in Neural Information ProcessingSystems , pages 3177–3186, 2017.[2] J. M. Alvarez and M. Salzmann. Learning the number ofneurons in deep networks. In

Advances in Neural Infor-mation Processing Systems , pages 2270–2278, 2016.[3] J. Ba and R. Caruana. Do deep nets really need to bedeep? In

Advances in neural information processing sys-tems , pages 2654–2662, 2014.[4] C. Bao, B. Dong, L. Hou, Z. Shen, X. Zhang, andX. Zhang. Image restoration by minimizing zeronorm of wavelet frame coefﬁcients.

Inverse problems ,32(11):115004, 2016.[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al.Distributed optimization and statistical learning via the al-ternating direction method of multipliers.

Foundations andTrends R (cid:13) in Machine learning , 3(1):1–122, 2011.[6] R. H. Chan, T. F. Chan, L. Shen, and Z. Shen. Waveletalgorithms for high-resolution image reconstruction. SIAM Journal on Scientiﬁc Computing , 24(4):1408–1432, 2003.[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs.

IEEE transactions on pattern analysis andmachine intelligence , 40(4):834–848, 2017.[8] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-gus. Exploiting linear structure within convolutional net-works for efﬁcient evaluation. In

Advances in neural in-formation processing systems , pages 1269–1277, 2014.[9] T. Dinh and J. Xin. Convergence of a relaxed vari-able splitting method for learning sparse neural networksvia ℓ , ℓ , and transformed- ℓ penalties. arXiv preprintarXiv:1812.05719 , 2018.[10] B. Dong and Y. Zhang. An efﬁcient algorithm for l0 min-imization in wavelet frame based image restoration. Jour-nal of Scientiﬁc Computing , 54(2-3):350–368, 2013.[11] J. Fan and R. Li. Variable selection via nonconcave pe-nalized likelihood and its oracle properties.

Journal ofthe American statistical Association , 96(456):1348–1360,2001.

12] S. Han, H. Mao, and W. J. Dally. Deep compres-sion: Compressing deep neural networks with pruning,trained quantization and huffman coding. arXiv preprintarXiv:1510.00149 , 2015.[13] S. Han, J. Pool, J. Tran, and W. Dally. Learning bothweights and connections for efﬁcient neural network. In

Advances in neural information processing systems , pages1135–1143, 2015.[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016.[15] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Net-work trimming: A data-driven neuron pruning approachtowards efﬁcient deep architectures. arXiv preprintarXiv:1607.03250 , 2016.[16] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama,et al. Speed/accuracy trade-offs for modern convolutionalobject detectors. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 7310–7311, 2017.[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convo-lutional architecture for fast feature embedding. In

Pro-ceedings of the 22nd ACM international conference onMultimedia , pages 675–678. ACM, 2014.[18] X. Jin, X. Yuan, J. Feng, and S. Yan. Training skinny deepneural networks with iterative hard thresholding methods. arXiv preprint arXiv:1607.05423 , 2016.[19] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[20] A. Krizhevsky and G. Hinton. Learning multiple layersof features from tiny images. Technical report, Citeseer,2009.[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012.[22] A. Krogh and J. A. Hertz. A simple weight decay canimprove generalization. In

Advances in neural informationprocessing systems , pages 950–957, 1992.[23] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition.

Proceed-ings of the IEEE , 86(11):2278–2324, 1998.[24] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P.Graf. Pruning ﬁlters for efﬁcient convnets. arXiv preprintarXiv:1608.08710 , 2016. [25] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 3431–3440, 2015.[26] C. Louizos, M. Welling, and D. P. Kingma. Learningsparse neural networks through l regularization. CoRR ,abs/1712.01312, 2017.[27] Z. Lu and Y. Zhang. Sparse approximation via penaltydecomposition methods.

SIAM Journal on Optimization ,23(4):2448–2478, 2013.[28] R. Ma, J. Miao, L. Niu, and P. Zhang. Transformed ℓ reg-ularization for learning sparse deep neural networks. arXivpreprint arXiv:1901.01021 , 2019.[29] D. Molchanov, A. Ashukha, and D. Vetrov. Variationaldropout sparsiﬁes deep neural networks. In Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pages 2498–2507. JMLR. org, 2017.[30] J. Nocedal and S. Wright.

Numerical optimization .Springer Science & Business Media, 2006.[31] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep facerecognition. In bmvc , page 6, 2015.[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To-wards real-time object detection with region proposal net-works. In

Advances in neural information processing sys-tems , pages 91–99, 2015.[33] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.Chen. Mobilenetv2: Inverted residuals and linear bottle-necks. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 4510–4520,2018.[34] S. Scardapane, D. Comminiello, A. Hussain, andA. Uncini. Group sparse regularization for deep neuralnetworks.

Neurocomputing , 241:81–89, 2017.[35] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. Asparse-group lasso.

Journal of Computational and Graph-ical Statistics , 22(2):231–245, 2013.[36] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition.

CoRR ,abs/1409.1556, 2015.[37] R. Tibshirani. Regression shrinkage and selection via thelasso.

Journal of the Royal Statistical Society: Series B(Methodological) , 58(1):267–288, 1996.[38] J. Trzasko, A. Manduca, and E. Borisch. Sparse mri recon-struction via multiscale l0-continuation. In , pages176–180. IEEE, 2007.[39] K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharingfor neural network compression. stat , 1050:9, 2017.

40] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningstructured sparsity in deep neural networks. In

Advances inneural information processing systems , pages 2074–2082,2016.[41] J. Yoon and S. J. Hwang. Combined group and exclu-sive sparsity for deep neural networks. In

Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pages 3958–3966. JMLR. org, 2017.[42] M. Yuan and Y. Lin. Model selection and estimation inregression with grouped variables.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) ,68(1):49–67, 2006.[43] X.-T. Yuan, P. Li, and T. Zhang. Gradient hard thresh-olding pursuit.

Journal of Machine Learning Research ,18:166–1, 2017.[44] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146 , 2016.[45] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.Understanding deep learning requires rethinking general-ization. arXiv preprint arXiv:1611.03530 , 2016.[46] Y. Zhang, B. Dong, and Z. Lu. l0 minimization for waveletframe based image restoration.

Mathematics of Computa-tion , 82(282):995–1015, 2013.[47] Y. Zhou, R. Jin, and S. C.-H. Hoi. Exclusive lasso formulti-task feature selection. In

Proceedings of the Thir-teenth International Conference on Artiﬁcial Intelligenceand Statistics , pages 988–995, 2010.[48] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu,J. Huang, and J. Zhu. Discrimination-aware channel prun-ing for deep neural networks. In

Advances in Neural In-formation Processing Systems , pages 875–886, 2018., pages 875–886, 2018.