[PDF] Stage-wise Channel Pruning for Model Compression

Abstract

Auto-ML pruning methods aim at searching a pruning strategy automatically to reduce the computational complexity of deep Convolutional Neural Networks(deep CNNs). However, some previous works found that the results of many Auto-ML pruning methods even cannot surpass the results of the uniformly pruning method. In this paper, we first analyze the reason for the ineffectiveness of Auto-ML pruning. Subsequently, a stage-wise pruning(SP) method is proposed to solve the above problem. As with most of the previous Auto-ML pruning methods, SP also trains a super-net that can provide proxy performance for sub-nets and search the best sub-net who has the best proxy performance. Different from previous works, we split a deep CNN into several stages and use a full-net where all layers are not pruned to supervise the training and the searching of sub-nets. Remarkably, the proxy performance of sub-nets trained with SP is closer to the actual performance than most of the previous Auto-ML pruning works. Therefore, SP achieves the state-of-the-art on both CIFAR-10 and ImageNet under the mobile setting.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Stage-wise Channel Pruning for Model Compression

Mingyang Zhang · Linlin Ou

Received: date / Accepted: date

Abstract

Auto-ML pruning methods aim at searching a pruning strategyautomatically to reduce the computational complexity of deep ConvolutionalNeural Networks(deep CNNs). However, some previous works found that theresults of many Auto-ML pruning methods even cannot surpass the results ofthe uniformly pruning method. In this paper, we ﬁrst analyze the reason for theineﬀectiveness of Auto-ML pruning. Subsequently, a stage-wise pruning(SP)method is proposed to solve the above problem. As with most of the previousAuto-ML pruning methods, SP also trains a super-net that can provide proxyperformance for sub-nets and search the best sub-net who has the best proxyperformance. Diﬀerent from previous works, we split a deep CNN into severalstages and use a full-net where all layers are not pruned to supervise thetraining and the searching of sub-nets. Remarkably, the proxy performance ofsub-nets trained with SP is closer to the actual performance than most of theprevious Auto-ML pruning works. Therefore, SP achieves the state-of-the-arton both CIFAR-10 and ImageNet under the mobile setting.

Keywords

Model Compression · Auto-ML · Neural Network Pruning

Deep convolutional neural networks(deep CNNs)[1,2,3,4] have achieved out-standing results in many computer vision tasks. However, deep CNNs comes

F. Authorﬁrst addressTel.: +123-45-678910Fax: +123-45-678910E-mail: [email protected]. Authorsecond address a r X i v : . [ c s . C V ] N ov Mingyang Zhang, Linlin Ou with a huge computational cost, which limits application on embedded de-vices(i.e. mobile phone).To expand the scope of application for deep CNNs, an eﬀective neural networkcompression method is channel pruning. Traditional channel pruning meth-ods always rely on human-design rules[5,6]. Recently, inspired by the NeuralArchitecture Search(NAS), some AutoML-based pruning works[[7,8,9]] havebeen proposed to automatically prune channels without a human-design mode.Considering a network with 10 layers and each layer contains 32 channels, thecandidates of each layer and the whole network could be 32 and 32 , respec-tively. Thus, AutoML-based pruning methods can be seen as ﬁne-grained NASbecause of more candidates than normal NAS[10,11,12] in each layer.In above mentioned AutoML-based methods, some reinforcement learning orevolutionary-based methods[13,7,6] is quite time-consuming due to iterativeretrain for every pruned network. To reduce the computation in pruning,many AutoML-based works[14,15,8,9] share weights for all candidate prunednetworks structure called sub-net by training a super-net. A typical weight-sharing pruning approach contains three steps: training a super-net by itera-tively sampling and updating diﬀerent candidates, searching the best sub-netby evolutionary algorithm or greedy algorithm, training the best sub-net fromthe scratch. However, [16] considered that the weight-sharing method causesunfull training results in the ﬁrst step since each candidate(sub-net) has only asmall probability of being sampled in training. Moreover, unfull training leadsto an inaccurate evaluation in the second step, which means some candidatesperform well on weight-sharing while bad on training from the scratch. It canbe worse in AutoML-based pruning because of more candidates contained insuper-net.To address the above-mentioned issues, a stage-wise training and searching ap-proach is proposed in this paper. Inspired by [17], we consider a deep CNN asseveral stages(i.e. ResNet50[3] consists of 4 stages). Each stage of sub-nets canbe trained and searched independently, thus, the number of candidates in astage is exponentially smaller than the whole network. With small search spacein each stage, the probability of candidates of being sampled is raised whichmeans each sub-net can be fully trained. Besides, since we divide the networkinto multiple stages, we propose a distributed evolutionary algorithm whereeach stage can be searched independently by an evolutionary algorithm(EA).the constraints(i.e. FLOPs, latency) for each EA are provided by another EA,called EA manager, where EA manager searches for the best combination ofFLOPs for each stage. Due to small and independent stage-wise search space,each EA can be sped up in a parallel way.However, the lack of ground truth for each stage is a new obstacle for ourmethod. To solve this problem, [17] uses an existing pre-trained neural net-work to generate stage-wise feature maps that are viewed as ground truthfor each stage. Nevertheless, It is also time-consuming to obtain a pre-trainedneural network as the teacher network. Besides, [18] considers the structuraldiﬀerence between the teacher network and the student networks has a strongimpact on distillation results. Hence, we propose a stage-wise inplace distilla- tage-wise Channel Pruning for Model Compression 3(a) (b) Fig. 1 (a) The expectation of top1 accuracy collected from ResNet56[3] with diﬀerentnumber of candidates in one layer. The blue and red dash line denotes ResNet56 trained onCIFAR10[19] under 100 epoches and 500 epoches, respectively. (b) The expectation of top1accuracy collected from ResNet18[3] , ResNet34[3] and ResNet56 with diﬀerent number ofcandidates in one layer. All models are trained under 100 epoches on CIFAR10[19]. tion method that uses the full-net(the largest width sub-net) to supervise thelearning of sub-nets. It is worthy to note that the full-net is jointly trainedwith other sub-nets. Thus, no extra cost for obtaining the full-net.Our contribution lies in four folds: – We proposed a stage-wise training and searching pipeline for channel prun-ing. By splitting a CNN into several stages, the number of stage-wise can-didates is exponentially smaller than net-wise candidates. Hence, each can-didate obtains full training, which is the essence of accurate evaluation forsearching. – To conveniently provide stage-wise ground truth for each stage, we pro-posed a stage-wise inplace distillation method. The central of this methodis jointly training full-net and sub-nets where the full-net can easily super-vise the learning of sub-nets by oﬀering stage-wise feature maps. – To accelerate the searching process, a distributed evolutionary algorithmis proposed. Each stage can be searched by an EA with constraints givenby an EA manager in a parallel way. – Compared to other AutoML pruning methods, our method can enhancethe ranking eﬀectiveness on searching and achieves the state-of-the-art inseveral datasets.

The purpose of neural network structure searchis to automatically ﬁnd the optimal network structure with reinforcementlearning-based(RL-based)[20,21], evolutionary algorithm based(EA-based)[22],gradient-based methods[12,23,24] and parameter sharing methods[25,11,10].

Mingyang Zhang, Linlin Ou

RL-based and EA-based methods need to evaluate each sampled networks byretraining them on the dataset, which is time-consuming. The gradient-basedmethod can simultaneously train and search by assigning a learnable weightto each candidate operation. However, [16] consider gradient-based approachcauses unfair training results because some candidates obtain more learningresource than others. Moreover, gradient-based approaches need more memoryfor training, thus, cannot apply to the large-scale dataset. Parameter sharingmethods can search on the large-scale dataset by only activating one candi-date in each training iteration. Nevertheless, [16] ﬁnd that Parameter sharingmethods cause unfull training results. Unfully or unfair training results willresult in an inaccurate evaluation of searching which means the best-searchedarchitecture is not the optimal one after retraining. To solve this problem,[17] proposed a blockwisely searching method, which can fairer and more fullytrain each sampled sub-nets.

Pruning for CNNs.

Pruning some redundant weights is a prevalent methodto accelerate the inference of CNNs. According to the diﬀerent granularity ofpruning, it can be divided into weight pruning and channel pruning. In weightpruning, the individual weights in the channel are removed based on somerules[26], which causes unstructured sparse ﬁlters and cannot be accelerateddirectly on most hardware. Therefore, many recent works focus on channelpruning. Channel pruning methods [27,28,13,29,5] can accelerate the infer-ence of CNNs on general-purpose hardware by reducing the number of ﬁlterssince the remaining ﬁlters is structural. Though the above methods achieveremarkable improvement for the practicality of pruning, it still needs human-designed heuristics to guide pruning.

Auto-ML pruning

Recently, inspired by NAS works, AutoML pruning methods[7,8,9,14,15] have attracted a growing interest in automatically pruning for deepCNNs. Diﬀerent from NAS, the candidate choices are consecutive in the chan-nel pruning task. Compared with pruning methods based on the human-craftrule, AutoML pruning methods aim to search for the best conﬁguration with-out manual tuning. AMC[7] adopts an agent to sample a pruned network andevaluate its performance by training from the scratch, which is time-consumingand cannot be applied to a large-scale dataset. MetaPruning[8] trains a Prun-ingNet which can predict weights for any pruned networks, while the parameteramount of the PruningNet is several times of the target network which leadsto an unfull train. AutoSlim[9] trains a slimmable network[14] where weightsbetween diﬀerent width are shared as the super-net and search the best sub-net by the greedy algorithm. But in training, the width of the convolutionallayer in each sub-net must be the same. It leads to the best sub-net achievesthe highest accuracy with weight sharing but poor performance when trainedfrom the scratch. To keep the consistency of search results and retraining re-sults, the proposed stage-wise pruning method splits a CNN into several stagesand separately train them under the supervision of the full-net, which will beexplained in Sec.3. tage-wise Channel Pruning for Model Compression 5

Fig. 2

Illustration of the stage-wise training. There are three forms of the network, the full-net, the sub-net and the small-net. The full-net infers inputs once to generate and transferits knowledge to the sub-net and the small-net by minimizing the L2-distance between thetheir stage-wise output feature maps. It is worthy to noted that these three networks areweight sharing.

In this section, we ﬁrst introduce the problem of weight-sharing which is pro-posed to reduce the computational cost of AutoML pruning. However, weight-sharing causes an inaccurate evaluation of candidate sub-nets. To solve thisproblem, the stagewisely pruning method is proposed in this section.3.1 Challenge of Weight-sharingAutoML pruning methods always need to train a super-net which sharesweights for all sub-net and evaluate the accuracy for each sub-net. The numberof sub-nets N that inherit weights from super-net can be formulated as || N || = g L (1)where g denotes the number of candidates for each convolutional layer and Lis the depth of the CNN.In many AutoML pruning approaches [7,8,9], pruning candidates directly com-pare with each other in terms of evaluation accuracy. The sub-nets with higherevaluation accuracy are selected and expected to also deliver high accuracy af-ter training from the scratch. However, such intention can not be necessarilyachieved as we notice that some sub-nets perform poorly after training fromscratch while being evaluated well on shared parameters. For the ﬁne-grained Mingyang Zhang, Linlin Ou pruning of deep CNNs, The search space N is always a large number(e.g., > ). Hence, due to weight-sharing, many sub-nets get unfull training re-sults which lead to the ineﬀectiveness of evaluation.To visualize the performance drop of weight-sharing, we train a super-netwith a diﬀerent number of candidates. For a trained super-net, we randomlysample a batch of sub-nets from the super-net and evaluate them on the vali-dation dataset. We use statistical accuracy expectations E ( a super ) to evaluatewhether the super-net is adequately trained. which can be written as E ( a super ) = n (cid:88) i =1 a sub i (2)where n denotes the number of randomly sampled sub-net and a sub i representsthe accuracy of i th sub-net. As we show in ﬁgure 1(a), with the number of can-didates increases, the expectation of top1-accuracy of supernet dramaticallydegrades under 100 epochs while it falls slightly under 500 epochs. It can beinferred that there are too many sub-nets, which leads to insuﬃcient sub-nettraining. Although increasing the number of training iterations can alleviatethe above situation, this conﬂicts with the purpose of weight sharing.Moreover, we train three diﬀerent depth super-nets and calculate their E ( a super )s.It is found that the expectation of top1-accuracy is related to the depth of theCNN, which is shown in ﬁgure 1(b). Thus, If the number of sub-nets in thetraining process can be reduced by segmenting the network to several parts,then the subnets can be more fully trained.3.2 Stage-wise Self-distillation for TrainingAs mentioned before, too many candidates in training can leads to ineﬀective-ness evaluation on searching because of unfull training results. To adequatelytrain the super-net, we divide the super-net into K stages according to thedepth. Hence, the search space of super-net can be represented by N = [ N , ..., N i +1 , N i , ..., N K ] (3)where N i denotes the search space of i th stage. Then we can train the super-netby training the stages separately. The learning of the stage i can be formulatedas W ∗ i = min W i L train ( W i , Ni ; X, Y ) (4)where X and Y denote the input data and the groud truth labels, respectively.Subsequently, the number of candidates of i th stage can be written as || N i || = g L i (5)where L i denotes the depth of the i th stage and it is smaller than L . The searchspace can be extremely reduced when we train each stage independently.However, internal ground truth in Eq.(4) cannot be obtained directly from the tage-wise Channel Pruning for Model Compression 7 dataset. [17] uses block-wise feature maps generated by a pre-trained networkto supervise the training of sub-nets. However, it is time-consuming to obtaina pre-trained network by training from the scratch in practice(e.g. ResNet50 >

10 GPU days). Besides, [18] found that the architecture of teacher and studentnetworks has a huge impact on transferring results.To tackle the above problem, the inplace distillation[15] is applied here. Theessential idea behind the inplace distillation is to transfer knowledge inside asingle super-net from the full-net to a sub-net inplace in each training iteration.For an individual convolutional layer, the performance of the wider candidatecan not be worse than the slim one. Because the wider one can achieve theperformance of the slim one by learning weights from some unuseful channels tozeros. Therefore, the performance of any candidates is bounded in the smallestone and the largest one, which can be formulated as | y f − y f | ≤ | y f − y r | ≤ | y f − y s | (6)where y r = (cid:80) ri =1 w i x i is the aggregated feature, r , s and f denotes the channelnumber of random sampled candidate, the smallest one and the largest one,respectively. This rule also can be extended to the whole super-net, whichmeans the performance of sub-net with any width is bounded in the small-netthat sample the smallest width for any layer and the full-net that sample thelargest width for any layer.Inspired by inplace distillation, we use the stage-wise representation of full-net to supervise sub-nets. The pipeline of stage-wise supervision with inplacedistillation is shown in Fig2. We use ˆ Y i − that is the output of the ( i − i th stage of sub-nets. To supervise thelearning of sub-nets from the full-net, the MSE loss is used as the distillationloss in ﬁgure 3 which can be given as L train ( ˆ Y i − , Y i ) = 1 K || Y i − ˆ Y i || (7)where Y i and ˆ Y i denote the output of sub-nets and full-net in i th stage, re-spectively and K is the number of the channels in Y .To ensure the performance lower bound and upper bound of the super-net,the sandwich rule[15] is applied to the training pipeline. Given a batch of in-put images and ground truth labels, we ﬁrst calculate the task loss(e.g. crossentropy) and gradients of the full-net by forward and backward propagation,meanwhile, its stage-wise feature maps [ ˆ Y , ... ˆ Y k ] is saved. Subsequently, underthe supervision of the stage-wise feature maps from the full-net, the distilla-tion loss Eq.(7) and gradients of stage-wise sub-net is calculated. Further, as asub-net training process, we train the smallest width(small-net) to improve thelower performance of the super-net. Moreover, the ground truth label has beengenerated in the full-net training process, thus, the training of each stage-wisesub-net can be sped up in a parallel way. The detailed algorithm is describedin Algorithm 1. Mingyang Zhang, Linlin Ou

Algorithm 1

Framework of stage-wise supervision with Inplace distillation.

Input:

The full-net, P ; The stage-wise super-nets, [ P , ..., P k ]; Dataset, ( X, Y ); Output:

The well-trained stage-wise super-nets, [ P , ..., P k ];1: for t = 1 , ..., T do

2: Get next mini-batch of data x and label y from ( X, Y )3: Execute full-net y (cid:48) = P ( x ), and save stage-wise feature maps ˆ X = [ x, ..., ˆ Y k − ],ˆ Y = [ ˆ Y , ..., ˆ Y k ]4: Caculate task loss, loss = C ( y (cid:48) , y )5: Clear gradients, optimizer.zero grad ()6: Accumulate gradients, loss.backward ()7: Randomly sample width for convolutional layers and obtain stage-wise sub-nets, P r = [ P r , ..., P rk ]8: Uniformly smallest width for convolutional layers and obtain stage-wise small-nets, P r = [ P s , ..., P sk ]9: parallel p = [ P r , P s ], x s = [ ˆ X, ˆ X ], y s = [ ˆ Y , ˆ Y ] do

10: Execute sub-net, y (cid:48) = p ( x s )11: Caculate distillation loss, loss = L ( y (cid:48) , y s )12: Accumulate gradients, loss.backward ()13: end parallel

14: Update weights, optimizer.step ()15: end for return trained stage-wise suer-nets, [ P , ..., P k ]; Fig. 3

Illustration of the distributed evolutionary algorithm. There have two kinds of evo-lutionary algorithms, EA Manager and EA. Given FLOPs constraint for the whole network,EA Manager is responsible for searching for the best combination of stage-wise FLOPs. Thefeedback of each FLOPs gene in EA Manager is provided by each EA with searching for thesmallest distillation loss under the stage-wise FLOPs constraint. stage-wise sub-nets. It is infeasible to evaluate all tage-wise Channel Pruning for Model Compression 9 of them all.Because of inplace distillation training mentioned above, the Evolutionary al-gorithm(EA) in ﬁgure 3 is applied to search the best stage-wise sub-net thathas the smallest distillation loss under the given FLOPs constraint. As otherworks[8,21], the genes of each stage-wise sub-net is encoded with a vector ofchannel numbers in each layer. Diﬀerent from the above works, each gene isevaluated by Eq.(7) and the search space of an individual EA is shrunk about10 × . After evaluation, the top k genes with the smallest distillation loss areselected for mutation and crossover to generate new genes. By repeating sev-eral iterations, EA can ﬁnd the best stage-wise sub-net under given FLOPsconstraint. But there still has a technical barrier in the searching process. As-suming that the FLOPs constraint for the pruned networks is F , the stage-wiseFLOPs constraint F i for i th stage must satisfy F = (cid:80) ki =1 F i . How to assignstage-wise FLOPs constraints for each stage is optimal?To automatically ﬁnd the best assignment plan of stage-wise FLOPs con-straint, a distributed evolutionary algorithm(DEA) is proposed in this section.The workﬂow of DEA is revealed in ﬁgure 3. The EA Manager is also a kindof evolutionary algorithm that provides the strategy of FLOPs constraint forother EAs. Diﬀerent from EA above, the genes in EA Manager is encoded witha vector of FLOPs constraint in each stage. The evaluation for each gene isthe sum of distillation losses given by EAs. We show the detail in Algorithm2. Algorithm 2

Framework of distributed evolutionary algorithm.

Input:

The FLOPs constraint, C ; The full-net, P ; The stage-wise super-nets, [ P , ..., P k ];Dataset, ( X, Y ); Output:

The best sub-net: P top ;1: Execute full-net and save stage-wise feature maps ˆ Y , y (cid:48) = P ( X ), ˆ Y = [ ˆ Y , ..., ˆ Y k ]2: Randomly generate a batch of genes G under constraint C , G = [ G , ..., G s ] , s.t. || G i || = || C i , ..., C ik || = C for t = 1 , ..., T do for g = G , ..., G s do

5: Obtain stage-wise FLOPs constraint from g , g = [ C , ..., C k ]6: parallel c = C , ..., C k , x s = X, ..., ˆ Y k − , y s = ˆ Y , ..., ˆ Y k , p = P , ..., P k do

7: Search the best stage-wise sub-net p (cid:48) and caculate distillation loss by EA inAlgorithm 3, p (cid:48) , L p i = EA ( p, x s , y s , c )8: end parallel

9: Caculate total loss L for g , L = L p + ... + L p k end for

11: Keep top k genes G topk according to L

12: Generate M mutation genes, G mutation = [ G m , ..., G mM ], s.t. || G mi || = C

13: Generate S crossover genes, G crossover = [ G c , ..., G cS ], s.t. || G ci || = C

14: Generate new population G , G = G mutation + G crossover end for

16: Select P top = [ p (cid:48) , ..., p (cid:48) k ] with smallest L return P top ;0 Mingyang Zhang, Linlin Ou Algorithm 3

Framework of evolutionary algorithm.

Input:

The FLOPs constraint, C ; The stage-wise super-net P stage-wise feature maps X , Y Output:

The best stage-wise sub-net: P top The stage-wise distillation Loss: L ;1: Randomly generate a batch of genes G under constraint C , G = [ G , ..., G s ]2: for t = 1 , ..., T do for g = G , ..., G s do

4: Construct a stage-wise sub-net according to P and g , P g

5: Calculate the distillation loss of P G , L g = L ( P G ( X ) , Y ), L from Eq.(7)6: end for

7: Keep top k genes G topk according to L g

8: Generate M mutation genes under constraint C , G mutation = [ G m , ..., G mM ]9: Generate S crossover genes under constraint C , G crossover = [ G c , ..., G cS ]10: Generate new population G , G = G mutation + G crossover end for

12: Select G top with smallest L g return G top , L G top ; In this section, we demonstrate the eﬀectiveness of our proposed stage-wisepruning method. We ﬁrst explain the experiment settings on CIFAR-10[19]and ImageNet 2012 dataset[30]. Then, we prune ResNet[3] on CIFAR-10 andvisualize the consistency of performance between searching and retraining.What’s more, We apply the stage-wise pruning method to ImageNet 2012 andcompare the results with other state-of-the-art works. Last, ablation studiesare carried out to ﬁnd out the inﬂuence of using inplace distillation.4.1 SetupsStage-wise pruning method consists of three steps:

Stage-wise training

According to resolution size of feature maps, We splitResNet[3] and MobileNet series[1,2] to 4 and 5 stages, respectively. The distil-lation loss of each stage can be calculated by Eq.(7) . To match the chan-nel number of the full-net, the output of each stage is connected with a1 × η = 0 . .

9, and weight decay 3 × − . The super-netis trained for 50 epochs with batch size 512 and the learning rate decays 0 . × per 10 epochs.On ImageNet 2012[30] dataset, we randomly sample 50 images for each classfrom training images as validation dataset. The remaining images are used to tage-wise Channel Pruning for Model Compression 11 Table 1

Pruning results of ResNet-56.Method FLOPs(M) Top1-Acc(%)ResNet-56 125.49 93.27FP 90.90 93.06RFP 90.70 93.12HRank 88.72 93.52EagleEye 62.23 94.66

SP(Ours) train super-net. We use momentum SGD to optimize the weights, with initiallearning rate η = 0 .

1, momentum 0 .

9, and weight decay 3 × − . The super-net is trained for 100 epochs with batch size 512 and learning rate decays 0 . × when epoch is 30, 60 or 90. Stage-wise searching

After training the stage-wise super-net as above, thebest sub-net is searched for each stage. Firstly, we use the full-net to generateand save the stage-wise feature maps with 2048 batch size. Subsequently, thehyperparameter of each EA and EA Manager is set to 128 population number,0.1 mutation probability, 10 iterations. We use 4 and 5 multiprocess to speedup the searching for ResNet and MobileNet series, respectively. Each processcan use 2 GPUs to infer with 2048 batch size.

Retraining

After searching the best sub-net, we adopt the same trainingscheme as [8] on ImageNet 2012 for both ResNet and MobileNet series. Forthe training scheme of ResNet on CIFAR-10, we follow [12]. It is noted thatall baseline models are trained under the same scheme mentioned above.4.2 Pruning ResNet on CIFAR-10 and AnalysisTo demonstrate the eﬀectiveness of stage-wise pruning, we prune ResNet-56[3]under 50% FLOPs constraint on the small dataset of CIFAR-10. As shown inTable 1, our stage-wise pruning method surpass the baseline model about1.4%. Moreover, our method outperforms all other pruning methods in termsof Top-1 accuracy.To evaluate the consistency of model ranking abilities for our method andother AutoML methods, we visualize the relationship between the proxy per-formance and actual performance. To fairly compare our method with MetaPruning[8]and AutoSlim[9] that also train a super-net ﬁrst, we train a PruningNet[8] anda Universally Slimming Network(US-Net)[15] as super-nets under the sametraining scheme. The total distillation loss is viewed as the proxy performanceof our method. The other two methods take Top-1 accuracy of each sub-netthat inherits weights from super-net as proxy performance. To obtain theiractual performance, each sub-net will be trained from scratch. As shown inﬁgure 4, our method has a strong correlation between proxy performance andthe actual performance while others barely rank the sub-nets.

Fig. 4

Comparison of ranking eﬀectiveness for Stage-wise Pruning, MetaPruning[8] andAutoSlim[9].(a) (b) (c)

Fig. 5

The pruning results of ResNet-50. ResNet-50 is stacked by many blocks which consistof three convolutional layers in the main branch. According to the location, we simply dividethe three convolutional layers in each block into top layers, middle layers and bottom layers.(a) The number of channels in top layers. (b) The number of channels in the middle layers.(c) The number of channels in the bottom layers. tage-wise Channel Pruning for Model Compression 13

Table 2

Results of ImageNet classiﬁcation. We show the top-1 accuracy of each methodunder the same or closed FLOPs.Network Method Acc@1 FLOPsBaseline 68.4% 325MAMC[7] 70.5% 285MSN[14] 69.5% 325MMP[8] 70.4% 281MAutoSlim[9] 71.5% 325MMobileNet V1

SP(ours) 71.7%

SP(ours) 58.5%

SP(ours) 59.2% × × SP(ours) 76.1% × SP(ours) 75.6% sights from the results. We compare our results with default channels andMetaPruning[8] in ResNet-50. In ﬁgure 5 (a-c), we show the number of chan-nels in top layers, middle layers and bottom layers of bottleneck blocks inResNet-50, respectively. First, we found that our method is prone to prunemore channels from top layers compared with MetaPruning. It is noted that al-though top layers have a small number of channels, the output feature maps ofthe top layer will be extracted by the next middle layer which kernelsize = 3.Hence, prune top layers can reduce computational complexity. Second, bothour method and Metapruning keep more channels for downsampling layersbecause of shrinking the feature map size. Moreover, our method prunes fewerchannels for bottom layers since the feature maps between the sub-net andthe full-net should be as close as possible.

Table 3

Comparison of Stage-wise pruning with diﬀerent distillation strategy.Teacher Student FLOPs Acc@1Ours 2.0G 76.1%1.0G 75.6%S1 2.0G 76.1%1.0G 75.6%S2 2.0G 75.6%1.0G 73.1% tage-wise Channel Pruning for Model Compression 15

Fig. 6

The number of channels in bottom layers with diﬀerent teachers. formance and the actual performance of the sub-network is greatly improved.We further demonstrate that inplace distillation can replace pre-trained dis-tillation, thereby reducing the time to train a teacher network from scratch.

References

1. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Eﬃcient convo-lutional neural networks for mobile vision applications”,

ArXiv , vol. abs/1704.04861,2017.2. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-ChiehChen, “Mobilenetv2: Inverted residuals and linear bottlenecks”, in

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , June 2018.3. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learningfor image recognition”, in

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2016.4. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, “Denselyconnected convolutional networks”, in

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , July 2017.5. Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang, “Filter pruning via geometricmedian for deep convolutional neural networks acceleration”, in

The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , June 2019.6. Miguel A. Carreira-Perpinan and Yerlan Idelbayev, ““learning-compression” algorithmsfor neural net pruning”, in

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2018.7. Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han, “Amc: Automl formodel compression and acceleration on mobile devices”, in

The European Conferenceon Computer Vision (ECCV) , September 2018.6 Mingyang Zhang, Linlin Ou8. Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Kwang-Ting Cheng,and Jian Sun, “Metapruning: Meta learning for automatic neural network channelpruning”,

ArXiv , vol. abs/1903.10258, 2019.9. Jiahui Yu and Thomas Huang, “Autoslim: Towards one-shot architecture search forchannel numbers”, 2019.10. Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, andJian Sun, “Single path one-shot neural architecture search with uniform sampling”, arXiv preprint arXiv:1904.00420 , 2019.11. Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han, “Once-for-all: Train one network and specialize it for eﬃcient deployment”, arXiv preprintarXiv:1908.09791 , 2019.12. Hanxiao Liu, Karen Simonyan, and Yiming Yang, “Darts: Diﬀerentiable architecturesearch”, arXiv preprint arXiv:1806.09055 , 2018.13. Q. Huang, K. Zhou, S. You, and U. Neumann, “Learning to prune ﬁlters in convolutionalneural networks”, in , March 2018, pp. 709–718.14. Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas S. Huang, “Slimmableneural networks”,

ArXiv , vol. abs/1812.08928, 2018.15. Jiahui Yu and Thomas S. Huang, “Universally slimmable networks and improved train-ing techniques”,

ArXiv , vol. abs/1903.05134, 2019.16. Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li, “Fairnas: Rethinking evaluationfairness of weight sharing neural architecture search”, 2020.17. Changlin Li, Jiefeng Peng, Liuchun Yuan, Guangrun Wang, Xiaodan Liang, Liang Lin,and Xiaojun Chang, “Blockwisely supervised neural architecture search with knowledgedistillation”, 2020.18. Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr, “Snip: Single-shotnetwork pruning based on connection sensitivity”,

ArXiv , vol. abs/1810.02340, 2018.19. Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition”,

CoRR , vol. abs/1409.1556, 2014.20. Barret Zoph and Quoc V. Le, “Neural architecture search with reinforcement learning”,

ArXiv , vol. abs/1611.01578, 2016.21. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le, “Learning transfer-able architectures for scalable image recognition”, in

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , June 2018.22. Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le, “Regularized evolu-tion for image classiﬁer architecture search”, in

Proceedings of the aaai conference onartiﬁcial intelligence , 2019, vol. 33, pp. 4780–4789.23. Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and HongkaiXiong, “Pc-darts: Partial channel connections for memory-eﬃcient diﬀerentiable archi-tecture search”, arXiv preprint arXiv:1907.05737 , 2019.24. Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian, “Progressive diﬀerentiable architecturesearch: Bridging the depth gap between search and evaluation”, in

Proceedings of theIEEE International Conference on Computer Vision , 2019, pp. 1294–1303.25. Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeﬀ Dean, “Eﬃcient neuralarchitecture search via parameter sharing”, arXiv preprint arXiv:1802.03268 , 2018.26. Song Han, Jeﬀ Pool, John Tran, and William Dally, “Learning both weights and con-nections for eﬃcient neural network”, in

Advances in neural information processingsystems , 2015, pp. 1135–1143.27. Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and ChangshuiZhang, “Learning eﬃcient convolutional networks through network slimming”, in

TheIEEE International Conference on Computer Vision (ICCV) , Oct 2017.28. Yihui He, Xiangyu Zhang, and Jian Sun, “Channel pruning for accelerating very deepneural networks”, in

The IEEE International Conference on Computer Vision (ICCV) ,Oct 2017.29. Jianbo Ye, Xin Lu, Zhe L. Lin, and James Zijun Wang, “Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers”,

ArXiv ,vol. abs/1802.00124, 2018.tage-wise Channel Pruning for Model Compression 1730. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.BergLi, and Fei-Fei, “Imagenet large scale visual recognition challenge”, in

InternationalJournal of Computer Vision(IJCV , December 2015.31. Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han, and Chenggang Yan, “Ap-proximated oracle ﬁlter pruning for destructive CNN width optimization”, in

Proceed-ings of the 36th International Conference on Machine Learning , Kamalika Chaudhuriand Ruslan Salakhutdinov, Eds., Long Beach, California, USA, 09–15 Jun 2019, vol. 97of

Proceedings of Machine Learning Research , pp. 1607–1616, PMLR.32. Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han, “Centripetal sgd forpruning very deep convolutional networks with complicated structure”, 2019.33. Jian-Hao Luo, Jianxin Wu, and Weiyao Lin, “Thinet: A ﬁlter level pruning method fordeep neural network compression”, 2017.34. Mingxing Tan and Quoc V Le, “Eﬃcientnet: Rethinking model scaling for convolutionalneural networks”, arXiv preprint arXiv:1905.11946arXiv preprint arXiv:1905.11946