[PDF] Neural Network Compression Via Sparse Optimization

Abstract

The compression of deep neural networks (DNNs) to reduce inference cost becomes increasingly important to meet realistic deployment requirements of various applications. There have been a significant amount of work regarding network compression, while most of them are heuristic rule-based or typically not friendly to be incorporated into varying scenarios. On the other hand, sparse optimization yielding sparse solutions naturally fits the compression requirement, but due to the limited study of sparse optimization in stochastic learning, its extension and application onto model compression is rarely well explored. In this work, we propose a model compression framework based on the recent progress on sparse stochastic optimization. Compared to existing model compression techniques, our method is effective and requires fewer extra engineering efforts to incorporate with varying applications, and has been numerically demonstrated on benchmark compression tasks. Particularly, we achieve up to 7.2 and 2.9 times FLOPs reduction with the same level of evaluation accuracy on VGG16 for CIFAR10 and ResNet50 for ImageNet compared to the baseline heavy models, respectively.

Full PDF

PPreprint manuscript N EURAL N ETWORK C OMPRESSION VIA S PARSE O PTIMIZATION

Tianyi Chen ∗§ , Bo Ji † § , Yixin Shi ∗ , Tianyu Ding ‡ , Biyi Fang ∗ , Sheng Yi ∗ , Xiao Tu ∗ A BSTRACT

The compression of deep neural networks (DNNs) to reduce inference cost be-comes increasingly important to meet realistic deployment requirements of variousapplications. There have been a signiﬁcant amount of work regarding networkcompression, while most of them are heuristic rule-based or typically not friendlyto be incorporated into varying scenarios. On the other hand, sparse optimizationyielding sparse solutions naturally ﬁts the compression requirement, but due to thelimited study of sparse optimization in stochastic learning, its extension and appli-cation onto model compression is rarely well explored. In this work, we propose amodel compression framework based on the recent progress on sparse stochasticoptimization. Compared to existing model compression techniques, our method iseffective and requires fewer extra engineering efforts to incorporate with varyingapplications, and has been numerically demonstrated on benchmark compressiontasks. Particularly, compared to the baseline heavy models, we achieve up to . × and . × FLOPs reduction with competitive . % and . % top-1 accuracyon VGG16 for CIFAR10 and ResNet50 for ImageNet, respectively. NTRODUCTION

In the past decade, building deeper and larger neural networks is a primary trend for solving andpushing the accuracy boundary in varying artiﬁcial intelligent scenarios (Long et al., 2015; Vaswaniet al., 2017). However, these high capacity networks typically possess signiﬁcantly heavier inferencecosts in both time and space complexity especially when used with embedded sensors or mobiledevices where computational resources are usually limited. In addition, for cloud artiﬁcial intelligentservice, the servers operating on a limited latency budget (Shi et al., 2020) beneﬁt remarkably fromthe deployed models to be computationally efﬁcient to maintain the reliable service level agreementduring heavy cloud trafﬁc. For these applications, in addition to satisfactory generalization accuracy,computational efﬁciency and small network sizes are crucial factors to business success.

Literature Review:

There have been numerous efforts devoted to network compression (Buciluˇaet al., 2006) to achieve the speedup and efﬁcient model inference, where the studies are largely evolvedinto (i) weight pruning, (ii) quantization, and (iii) knowledge distillation. The weight pruning mainlyfocuses on given a heavy model with high generalization accuracy, ﬁltering out redundant kernels toachieve slimmer architectures (Gale et al., 2019), including Bayesian pruning (Zhou et al., 2019;Neklyudov et al., 2017; Louizos et al., 2017), ranking kernels importance (Luo et al., 2017; Hu et al.,2016; He et al., 2018a; Li et al., 2019) and compression by reinforcement learning (He et al., 2018b).The quantization is the process of constraining network parameters onto lower precision (Wu et al.,2016). With network parameters quantized, the majority of operations, e.g. , convolution and matrixmultiplication, can be efﬁciently estimated via the approximate inner product computation (Jacobet al., 2018; Zhou et al., 2016). The knowledge distillation constructs a teacher-student trainingpipeline to transfer the knowledge from heavy and accurate teacher network to a prescribed givensmaller student network (Hinton et al., 2015; Polino et al., 2018). * Microsoft, Redmond, WA, USA. E-mail: { tiachen, yixshi, bif, shengyi, xiaotu } @microsoft.com. † National University of Singapore, Singapore. Email: [email protected]. ‡ Johns Hopkins University, Baltimore, MD, USA. Email: [email protected]. § These authors were equally contributed. a r X i v : . [ c s . L G ] N ov reprint manuscript Sparsity of Layer 𝓁 +1 𝑠 𝓁 +1, t Sparse Weight Parameter 𝓌 t Explore Sparsity via

OBProx-SG

Alternating SchemaCompression Step Layer 𝓁 Layer 𝓁 +1 Layer 𝓁 +1 Model 𝓜 t Update Regularization Coefficient ⲗ t +1 Pruned Layer 𝓁𝒦 𝓁 , t+1 Pruned Model 𝓜 t+1 Global

Fine-Tuning

Optimal Model 𝓜 * Figure 1: Overview of alternating schema.We have the perspective that these existing methods tackle the network compression from differentaspects and work as complementary to each other, wherein we study the weight pruning methodin this work. We note that the existing weight pruning methods are perhaps either simple andheuristic (Li et al.; Luo et al., 2017), hence easily hurt generalization accuracy; or sophisticated, sothat numerous engineering efforts are required to reproduce and employ on varying applications (Heet al., 2018b), which may be prohibitive to applications with complicated training procedures.Therefore, establishing a simple and effective compression pipeline becomes indispensable especiallyalong with the irresistible popularization of various artiﬁcial intelligent applications.On the other hand, to remove redundancy of heavy models in convex learning, sparse optimization,which augments the problem with a sparsity-inducing regularization term to yield sparse solutions(including many zeros) (Beck & Teboulle, 2009; Chen, 2018), serves as one of the most effectivemethods in classical feature engineering (Tibshirani, 1996; Roth & Fischer, 2008), where the redun-dancy is trimmed based on the distribution of zero entries on the solution. However, its analogousapplications in non-convex deep learning to ﬁlter redundancy in networks is far away from wellexplored. In fact, although there exist quite a few works formulating sparse optimization to compressnetworks, they either do not actually utilize the sparsity ratio of zero weight parameters, but rank theirmagnitudes to ﬁlter out weight parameters of which norm are sufﬁciently small (Li et al., 2016); orperform not comparable to other approaches (Wen et al., 2016; Lin et al., 2019; Gale et al., 2019). Asinvestigated in Xiao (2010); Xiao & Zhang (2014); Chen et al. (2020a;b), the main reason regardingthe limited performance of sparse optimization on network compression is that effective sparsityexploration in stochastic learning is hard to be achieved due to the randomness and limited sparsityidentiﬁcation mechanism, i.e. , the solutions computed by most of stochastic sparse optimization algo-rithms are typically fully dense. Hence, the previous compression techniques by sparse optimizationmay not utilize the accurate sparsity signal to achieve effective model compression.Moreover, the sparsity identiﬁcation in stochastic non-convex learning are improved to some extentby the recent progress made by orthant-face and half-space projection for (cid:96) -regularization and groupsparsity regularization problems respectively (Chen et al., 2020a;b). Therefore, it is natural to askif the recent progress on stochastic sparse optimization is applicable and beneﬁcial to the networkcompression scenario. In this work, we attempt to re-investigate the utility of sparse optimization onthe weight pruning task to search approximately optimal model architectures. Our Contributions:

We propose a novel network compression framework based on the recentsparse optimization algorithm OBProx-SG for solving (cid:96) -regularization, so-called Recursive SparsePruning (RSP) method. Compared to other weight pruning methods, RSP is easily plugged intovarious applications with competitive performance. We now summarize our contributions as follows.• Algorithmic Design:

We formulate the model compression problem into an optimization problem,and design an alternating recursive schema as shown in Figure 1 to search an approximatelyoptimal solution wherein the optimal model architecture requires much fewer time and space costfor inference and achieves competitive accuracy to the original full model. Compared to otherpruning methods, RSP is more friendly to incorporate with various applications with much fewer2reprint manuscriptefforts, as the fundamental herein step, i.e. , redundancy exploration, is tackled by a well-packagedoptimization algorithm (OBProx-SG) applicable for broad training pipelines.•

Numerical Experiments:

We numerically demonstrate the effectiveness of our compression ap-proach on two benchmark model compression problems, i.e. , VGG16 on CIFAR10 and ResNet50on ImageNet (ILSVRC2012) with up to . × and . × FLOPs reduction and only slight accuracyregression. Compared to the results presented in Zhou et al. (2019), our approach outperforms allother pruning methods on both FLOPs reduction and accuracy maintenance signiﬁcantly.

OTATIONS AND P RELIMINARIES

Consider a CNN model consisting of L convolutional layers, which are interlaced with variousactivation functions, e.g. , rectiﬁer linear units, and pooling operators, e.g. , max pooling. For theconvolution operation on l -th convolutional layer, an input tensor I l of size C l × H l I × W l I , istransformed into an output tensor O l of size C l +1 × H l O × W l O by the following linear mapping: O lc (cid:48) ,h (cid:48) ,w (cid:48) = C l (cid:88) c =1 H l K (cid:88) i =1 W l K (cid:88) j =1 K lc (cid:48) ,c,i,j I lc,h i ,w j , (1)where the convolutional ﬁlter K l at the l th layer is a tensor of size K l × C l × H l K × W l K with K l = C l +1 . Particularly, we denote K lc (cid:48) ∈ R C l × H l K × W l K as the c (cid:48) -th kernel for the l -th convolutionallayers. Similarly to Lin et al. (2019), we ﬂatten tensor K l into a ﬁlter matrix K l of size K l × C l H l K W l K .Hence, the c (cid:48) -th row of K l exactly store the parameters for the c (cid:48) -th kernel.Additionally, we consider several norms of the ﬁlter matrix K l that are widely used to formulatesparsity inducing regularization problem. Particularly, one is the (cid:96) -norm of K l as (cid:13)(cid:13) K l (cid:13)(cid:13) = (cid:88) c (cid:48) (cid:13)(cid:13) K lc (cid:48) (cid:13)(cid:13) , (2)which induces a sparse solution while the zero elements may be distributed randomly. To encodemore sophisticated group sparsity structure, mixed (cid:96) /(cid:96) norm is introduced as (cid:13)(cid:13) K l (cid:13)(cid:13) , = (cid:88) c (cid:48) (cid:13)(cid:13) K lc (cid:48) (cid:13)(cid:13) (3)which induces the entire rows of K l to be zero during solving sparse optimization problem. Inthis work, we will focus on leveraging the sparse (cid:96) -regularization optimization to compress heavynetworks and pay attention to mixed (cid:96) /(cid:96) -regularization to encode group sparsity in the future work. ETWORK C OMPRESSION P ROBLEM F ORMULATION

Essentially, the network compression is to search a slimmer model architecture given a prescribednetwork topology along with minimizing the scariﬁcation on generalization accuracy, which can beformulated as the following constrained optimization problem, minimize w , M f ( w | M , D ) s.t. |M| ≤ T, (4)where M denotes the model architecture associated with its weight parameter w , f is the lossfunction to measure the deviation from the predicted output given evaluation dataset D on ( w , M ) to the ground truth output, and the constraint |M| ≤ T ∈ Z + restricts the size of model M . It iswell-recognized that (4) is NP-hard and intractable in general. But one can solve it approximately bytransforming into non-constraint optimization problem as minimize w , M f ( w | M , D ) + λ · Ω( M ) , (5)where the constraint in (4) is augmented into the original objective as an extra plenaty term Ω( M ) associated with a Lagrangian coefﬁcient λ . The Ω serves as a regularization regarding the size ofmodel M , which can be further relaxed to the (cid:96) -norm of weight parameters w on M , minimize w , M f ( w | M , D ) + λ · (cid:107) w (cid:107) . (6)3reprint manuscriptCompared to (4) and (5), problem (6) is tractable by our following proposed alternating algorithm,where we iteratively update w given ﬁxed M , then construct M given computed w till convergence.The solution ( w ∗ , M ∗ ) of problem (6) are expected to possess smaller model size and similarevaluation accuracy compared to the initial heavy model M to accelerate the inference efﬁciency. HE C OMPRESSION M ETHOD

In this section, we provide a comprehensive introduction of our compression technique via sparseoptimization as illustrated by Figure 1. Our RSP prunes redundant ﬁlters from a high capacity modelfor computational efﬁciency while avoiding sacriﬁcing the generalization accuracy. In general, tosolve the formulated compression problem (6), our RSP as stated in Algorithm 1 is a frameworkconsisting of two stages, (i) an iteratively alternating compression schema to search an approximatelyoptimal model architecture, i.e. , iteratively repeat updating w t given M t then updating M t +1 given w t till convergence as line 2 to 5 in Algorithm 1; and (ii) ﬁne-tune model parameters given searchedoptimal model architecture as line 6 in Algorithm 1. In the reminder of this section, we ﬁrst presenthow the alternating compression schema works, then move on to explain the ﬁne-tuning of networkweight parameters, and conclude this section by revealing the Bayesian interpretation of our approach. Algorithm 1

The Network Compression via Sparse Optimization Framework Input:

Initial high capacity CNN architectures M and dataset D . while Network compression does not converge. do Update w t given M t : Train weight parameter w t on dataset D as w t ← SparseOptimization ( M t , D , λ t ) , (Explore redundancy) Update M t +1 given w t : Construct model M t +1 by Algorithm 2, M t +1 ← Compression ( M t , w t ) (Compress M t ) Update λ t +1 : λ t +1 computes proportionally to the model size ratio between M t +1 and M t , λ t +1 ← |M t +1 | / |M t | · λ t . (7) Global Fine Tuning:

Retrain to attain weights w ∗ given the optimal model M ∗ . Output:

Pruned CNN architecture M ∗ with tuned weight parameters w ∗ . Explore Redundancy, update w t given M t . Given the current possibly heavy model architecture M t , we aim at evaluating its redundancy of each layer in order to construct smaller structures withthe same prediction power. To achieve it, we ﬁrst formulate and train M t under a sparsity-inducingregularization to attain a highly sparse solution w t , i.e. , w t = arg min w F (cid:0) w | M t , D (cid:1) := f ( w | M t , D ) + λ t · (cid:107) w (cid:107) , (8)where the (cid:107) w (cid:107) penalizes on the density and magnitude of the network weight parameters; the-oretically, under proper λ t , the solution of problem (8) w t tends to have both high sparsity andevaluation accuracy. The problem as form of (8) is well-known as (cid:96) -regularization problem, whichis widely used for feature selection in convex applications to ﬁlter out redundant features by thesparsity of solutions. Consequently, it is natural to consider to apply such regularization to determineredundancy of deep neural networks. Unfortunately, although problem as form of (8) has beenwell studied in convex deterministic learning (Beck & Teboulle, 2009; Keskar et al., 2015; Chenet al., 2017), its study in stochastic non-convex applications, e.g. , deep learning, is quite limited, i.e. ,the solutions computed by existing stochastic optimization are typically dense. Such limitation onsparsity exploration results in that the (cid:96) -regularization is typically not used in a standard way toidentify redundancy of network compression, where it is leveraged to rank the importance of ﬁltersonly in the aspect of the magnitude but not on the sparsity in the recent literatures (Li et al., 2016).We note that a recent orthant-face prediction method (OBProx-SG) (Chen et al., 2020a) made abreakthrough on the study of (cid:96) -regularization especially on the sparsity exploration compared to4reprint manuscriptclassical proximal methods in stochastic non-convex learning (Xiao, 2010; Xiao & Zhang, 2014).Particularly, the solutions computed by OBProx-SG typically have multiple-times higher sparsityand competitive objective convergence than the solutions computed by the competitors. Hence, weemploy the OBProx-SG onto line 3 in Algorithm 1 to seek highly sparse weight parameters w t onmodel architecture M t without accuracy regression. Compress model, update M t +1 given w t and M t . Given the highly sparse weight parameters w t of model architecture M t yielded by OBProx-SG, as the high-level procedure presented inAlgorithm 2, we now perform a speciﬁed compression algorithm, e.g. , ﬁlter pruning, to construct asmaller model architecture M t +1 , wherein the sparsity ratio of each building block is leveraged torepresent the redundancy of corresponding layers. Particularly, taking the fully convolution layers(no residual or skip connection) as a concrete example, as stated in Algorithm 3, the sparsity ratio ofeach (cid:96) -th convolutional layer is computed as line 2 to 3 by counting the percentage of zero elements.Then following the backward path of M t +1 , the (cid:96) -th convolutional layer shrinks its number of ﬁltersproportionally to computed sparsity ratio s (cid:96) +1 ,t as line 5 to 6. Finally the architecture of M t +1 isﬁnalized after a calibration on the potential structure inconsistency during pruning to guarantee theultimate network generating outputs in the desired shape. Update the regularization coefﬁcient λ t +1 given λ t , M t and M t +1 . The regularization coefﬁ-cient λ in problem (6) balances between the sparsity level on the exact optimal solution and thegeneralization accuracy. Larger λ typically results in higher sparsity on the computed solution w by OBProx-SG but sacriﬁces more on the bias of model estimation. Hence, to achieve both lowobjective value and high sparse solution, λ requires careful selection, of which setting is usuallyrelated to the number of variables involved per iteration during optimization. Thus, we scale λ t upaccording to the model size ratio between M t +1 and M t to yield λ t +1 as line 5 in Algorithm 1. Algorithm 2

Compress Step for General CNNs Input:

CNN model M t of L t building blocks, sparse weights w t . for (cid:96) = 1 , · · · , L t do Compute the empirical sparsity ratio s (cid:96),t given w t as s (cid:96),t ← K l,t K l,t (9) Remove redundancy in each building block by sparsity ratio { s (cid:96),t } (cid:96) = L t (cid:96) =1 to form M t +1 . Output:

The CNN model M t +1 . Algorithm 3

Compress Step for Fully Convolutional Layers Input:

Parameter x t , CNN model M t of L t convolutional layers and weight parameters w t . for (cid:96) = 1 , · · · , L t do Compute the empirical sparsity ratio s (cid:96),t given w t as s (cid:96),t ← K l,t K l,t (10) Initialize M t +1 with L t +1 ← L t convolution layers of the same architectures as M t : for (cid:96) = L t +1 − , · · · , do In (cid:96) -th convolution layer, shrink the number of ﬁlters from K (cid:96),t to K (cid:96),t +1 = (cid:6) K (cid:96),t · (1 − s (cid:96) +1 ) (cid:7) (11) Calibrate the architecture of M t +1 . Output:

The CNN model M t +1 . Convergence Criteria

The stopping criteria of the alternating schema is designed delicately toensure obtaining the approximate minimal model architecture without accuracy regression. Our5reprint manuscriptdesign monitors the several evaluation metric, e.g. , sparsity, model size and validation accuracy,during the whole procedure. In particular, if the evolution of sparsity of w and model size M becomesﬂatten or the validation accuracy regresses, the alternating schema is considered as convergent andreturn the best checkpoint as the optimal model M ∗ as line 2 in Algorithm 1. Global Fine-Tuning

After searching the optimal model M ∗ , of which size tends to be muchsmaller than the original M . In the ﬁnal step, we retrain the network to strengthen the remainingneurons after compression to enhance the generalization performance of the trimmed network asline 6 in Algorithm 1. Bayesian Interpretation

It is well recognized that (cid:96) -regularization in problem (8) is equivalentto adding a Laplace prior onto the training problem to enhance the occurrence of zero entries on thesolutions. From this point of view, our method can be categorized into weight pruning by Bayesiancompression, where other methods constructs other priors into the original objectives (Zhou et al.,2019; Louizos et al., 2017; Neklyudov et al., 2017). UMERICAL E XPERIMENTS

In this section, we validate the effectiveness of the proposed method RSP on two benchmarkdatasets CIFAR10 (Krizhevsky & Hinton, 2009) and ImagetNet (ILSVRC2012) (Deng et al., 2009).The CNN architectures to prune include VGG16 (Simonyan & Zisserman, 2014) and ResNet50 (Heet al., 2016). We mainly report ﬂoating operations (FLOPs) to indicate acceleration effect. Compres-sion rate (CR) is also revealed as another criterion for pruning.

Experimental Settings

For OBProx-SG called in Algorithm 1, the initial regularization coefﬁcient λ is ﬁne-tuned from all powers of 10 between − to − as the one can achieve the samegeneralization accuracy as the existing benchmark accuracy. The mini-batch sizes for CIFAR10and ILSVRC2012 are selected as 64 and 512 respectively The learning rates follow the setting as Heet al. (2016) and decay periodically by factor . . In practice, due to the expansiveness of trainingcost, we empirically set the number of alternating update as one. VGG16 on CIFAR10

We ﬁrst compress VGG16 on CIFAR10, which contains , images ofsize × in training set and , in test set. The model performance on CIFAR10 is qualiﬁedby the accuracy of classifying ten-class images. Considering that VGG16 is originally proposed forlarge scale dataset, the redundancy is naturally obvious. In realistic implementation, we add a lowerbound of compression ratio, i.e. , (cid:15) > , into line 6 of Algorithm 3, such that K (cid:96),t +1 = (cid:6) K (cid:96),t · max { − s (cid:96) +1 , (cid:15) } (cid:7) (12)to guarantee each layer maintain a fair number of kernels left after compression and select (cid:15) as 0.1.We report the pruning results in Table 1, including existing state-of-the-art compression techniquesreported in Zhou et al. (2019). Particularly, Table 1 shows that RSP is no doubt the best compressionmethod on the FLOPs reduction, by achieving . × speedups whereas others stuck on about × speedups. Remark here that although RSP reaches slight lower compression ratio than RBP becauseof the universal setting of (cid:15) for both convolution and linear layers, the FLOPs reduction of RSP isroughly 2 times higher due to the more signiﬁcant computational acceleration on the early layers. Ifwe switch the gear to compare the baseline validation accuracy, RSP only slightly regressed about . , which outperforms the majority of other compression techniques as well. ResNet50 on ILSVRC2012:

We now prune ResNet50 on ILSVRC2012 (Deng et al., 2009).ILSVRC2012, well known as ImageNet, is a large-scale image classiﬁcation dataset, which contains , classes, more than . million images in training set and , in validation set. ResNet50is a very deep CNN in the residual network family. It contains 16 residual blocks (He et al., 2016),where around 50 convolutional layers are stacked. For ResNet, there exist some restrictions due toits special structure. For example, the channel number of each block in the same group needs to beconsistent in order to ﬁnish the sum operation. Thus it is hard to prune the last convolutional layerof each residual block directly. Hence, we consider the residual block as the basic unit rather thanconvolutional layer to proceed compression during Algorithm 2. Particularly, the sparsity ratio of6reprint manuscriptTable 1: Comparison of pruning VGG16 on CIFAR10. Convolutional layers are in bold. Method Architecture CR FLOPs ACCBaseline -512-512 . × . × -20-272 . × . × -6-20 . × . × -10-22 . × . × -21-21 . × . × -11-12 . × . × RSP 55-31-65-63-115-75-43-52-52-52-52-52-52 -52-52 . × . × Table 2: Comparison of pruning ResNet50 on ILSVRC2012.Method FLOPs Top-1 Acc. Top-5 Acc.Baseline . × . . DDS . × . . CP . × . . ThiNet-50 . × . RBP . × . . RSP . × . . each residual block is computed and then leveraged to prune redundant ﬁlters similarly to the strategyin Luo et al. (2017). The compression results is described in Table 2, where the RSP deﬁnitelyperforms the best, with remarkable FLOPs reduction and highest Top 1 and 5 accuracy. ONCLUSIONS AND F UTURE W ORK

In this report, we propose a model compression framework based on the (cid:96) -regularized sparseoptimization and the recent breakthrough on sparsity exploration achieved by OBProx-SG. Our RSPcontains a recursive alternating schema to search an approximately optimal model architecture toaccelerate network inference without accuracy scariﬁcation by leveraging the computed sparsitydistribution during algorithmic procedure. Compared to other Bayesian weight pruning methods,the RSP is no doubt the best in terms of FLOPs reduction and accuracy maintenance on benchmarkcompression experiments, i.e. , ResNet50 on ImageNet by reducing FLOPs by . × and achieving . top-1 accuracy. As the future work, we remark here that how to utilize the recent group-sparsityoptimization algorithm HSPG (Chen et al., 2020b) is an open problem to remove entire redundanthidden structures from networks directly. R EFERENCES

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverseproblems.

SIAM journal on imaging sciences , 2(1):183–202, 2009.Cristian Buciluˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In

Proceedingsof the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pp.535–541, 2006.Tianyi Chen.

A Fast Reduced-Space Algorithmic Framework for Sparse Optimization . PhD thesis,Johns Hopkins University, 2018.Tianyi Chen, Frank E Curtis, and Daniel P Robinson. A reduced-space algorithm for minimizing (cid:96) -regularized convex functions. SIAM Journal on Optimization , 27(3):1583–1610, 2017.Tianyi Chen, Tianyu Ding, Bo Ji, Guanyi Wang, Yixin Shi, Sheng Yi, Xiao Tu, and Zhihui Zhu.Orthant based proximal stochastic gradient method for ell -regularized optimization. arXivpreprint arXiv:2004.03639 , 2020a.Tianyi Chen, Guanyi Wang, Tianyu Ding, Bo Ji, Sheng Yi, and Zhihui Zhu. A half-space stochasticprojected gradient method for group-sparsity regularization. arXiv preprint arXiv:2009.12078 ,2020b. 7reprint manuscriptJia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In ,pp. 248–255. Ieee, 2009.Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXivpreprint arXiv:1902.09574 , 2019.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition ,2016.Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft ﬁlter pruning for acceleratingdeep convolutional neural networks. arXiv preprint arXiv:1808.06866 , 2018a.Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for modelcompression and acceleration on mobile devices. In

Proceedings of the European Conference onComputer Vision (ECCV) , pp. 784–800, 2018b.Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 , 2015.Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-drivenneuron pruning approach towards efﬁcient deep architectures. arXiv preprint arXiv:1607.03250 ,2016.Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, HartwigAdam, and Dmitry Kalenichenko. Quantization and training of neural networks for efﬁcientinteger-arithmetic-only inference. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pp. 2704–2713, 2018.Nitish Shirish Keskar, Jorge Nocedal, Figen Oztoprak, and Andreas Waechter. A second-order methodfor convex (cid:96) -regularized optimization with active set prediction. arXiv preprint arXiv:1505.04315 ,2015.A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis,Department of Computer Science, University of Toronto , 2009.H Li, A Kadav, I Durdanovic, H Samet, and HP Graf. Pruning ﬁlters for efﬁcient convnets. arxiv2016. arXiv preprint arXiv:1608.08710 .Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters forefﬁcient convnets. arXiv preprint arXiv:1608.08710 , 2016.Yuchao Li, Shaohui Lin, Baochang Zhang, Jianzhuang Liu, David Doermann, Yongjian Wu, FeiyueHuang, and Rongrong Ji. Exploiting kernel sparsity and entropy for interpretable cnn compression.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 2800–2809, 2019.Shaohui Lin, Rongrong Ji, Yuchao Li, Cheng Deng, and Xuelong Li. Toward compact convnets viastructure-sparsity regularized ﬁlter pruning.

IEEE transactions on neural networks and learningsystems , 31(2):574–588, 2019.Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semanticsegmentation. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 3431–3440, 2015.Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In

Advances in neural information processing systems , pp. 3288–3298, 2017.Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A ﬁlter level pruning method for deep neuralnetwork compression. In

Proceedings of the IEEE international conference on computer vision ,pp. 5058–5066, 2017. 8reprint manuscriptKirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured bayesianpruning via log-normal multiplicative noise. In

Advances in Neural Information ProcessingSystems , pp. 6775–6784, 2017.Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantiza-tion. arXiv preprint arXiv:1802.05668 , 2018.Volker Roth and Bernd Fischer. The group-lasso for generalized linear models: uniqueness ofsolutions and efﬁcient algorithms. In

Proceedings of the 25th international conference on Machinelearning , pp. 848–855, 2008.Yixin Shi, Aman Orazaev, Tianyi Chen, and Sheng YI. Object detection and segmentation for inkingapplications, September 24 2020. US Patent App. 16/360,006.Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 , 2014.Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal StatisticalSociety: Series B (Methodological) , 58(1):267–288, 1996.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. Attention is all you need. In

Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deepneural networks. In

Advances in neural information processing systems , pp. 2074–2082, 2016.Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutionalneural networks for mobile devices. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pp. 4820–4828, 2016.Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization.

Journal of Machine Learning Research , 11(Oct):2543–2596, 2010.Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.

SIAM Journal on Optimization , 24(4):2057–2075, 2014.Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Train-ing low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprintarXiv:1606.06160 , 2016.Yuefu Zhou, Ya Zhang, Yanfeng Wang, and Qi Tian. Accelerate cnn via recursive bayesian pruning.In