A Framework For Pruning Deep Neural Networks Using Energy-Based Models
aa r X i v : . [ c s . N E ] F e b This paper is accepted for presentation at IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP), 2021.
A FRAMEWORK FOR PRUNING DEEP NEURAL NETWORKS USINGENERGY-BASED MODELS
Hojjat Salehinejad, Member, IEEE, and Shahrokh Valaee, Fellow, IEEE
Department of Electrical & Computer Engineering, University of Toronto, Toronto, Canada [email protected], [email protected]
ABSTRACT
A typical deep neural network (DNN) has a large number oftrainable parameters. Choosing a network with proper capac-ity is challenging and generally a larger network with exces-sive capacity is trained. Pruning is an established approachto reducing the number of parameters in a DNN. In this pa-per, we propose a framework for pruning DNNs based on apopulation-based global optimization method. This frame-work can use any pruning objective function. As a case study,we propose a simple but efficient objective function basedon the concept of energy-based models. Our experimentson ResNets, AlexNet, and SqueezeNet for the CIFAR-10 andCIFAR-100 datasets show a pruning rate of more than ofthe trainable parameters with approximately < and < drop of Top-1 and Top-5 classification accuracy, respectively. Index Terms — Compression of neural networks, dropout,energy-based models, pruning.
1. INTRODUCTION
Pruning a deep neural network (DNN) is one of the majormethods for removing redundant trainable parameters andcompressing the network. This approach permanently re-moves a subset of trainable parameters. In general, pruningalgorithms have three stages which are training, pruning, andfine-tuning [1]. One pruning approach is to utilize secondderivative information to minimize a cost function that re-duces network complexity by removing excess number oftrainable parameters and further fine-tuning [2]. One of themajor approaches is
Deep Compression , which has threemain steps that are pruning, quantization, and Huffman cod-ing [3]. It prunes all connections with weights below a giventhreshold and then retrains the sparsified network.Generally, probabilistic models can be considered as aspecial type of energy-based models (EBMs). An EBM as-signs a scalar energy loss as a measure of compatibility toa configuration of parameters in neural networks as demon-strated in Figure 1. This approach avoids computing the nor-malization term, which can be interpreted as an alternative toprobabilistic estimation [4]. Calculating the exact probabil-ity needs computing the partition function over all the data
Fig. 1 : Energy-based model (EBM), [4]. classes. However, for large number of data classes, such as inlanguage models with more than , classes, this causesa bottleneck [5]. Some methods such as annealed importancesampling [6] have been proposed to deal with this problemwhich is out of the scope of this paper.Previously, we have proposed an Ising energy model fordropout and pruning of multilayer perceptron (MLP) net-works [7, 8]. In this paper, a pruning framework based ona population-based stochastic global optimization method isproposed which is integrated into the typical training proce-dure of a DNN. This scheme is inspired from the concept ofdropout [9] and biological pruning of neurons in brain. Thisframework can handle different pruning objective functionwith multiple constrains. We also propose an energy-basedpruning objective function based on the concept of EBM inDNN, which allocates a scalar energy value to each statevector in the population of state vectors, called EPruning .Each state vector is in fact a representation of a sub-networkfrom the original DNN. Pruning is defined as searching for abinary state vector that prunes the network while minimizesthe energy loss for a set of inputs and corresponding outputsin each iteration. Hence, the search for weights is conductedusing a probabilistic model while the pruning state vectoris fixed and the search for pruning state is conducted usingan EBM while the weights are fixed in each iteration. Thecandidate states help to find a subset of the neural networkand capture its energy function that associates low energies tocorrect values of the remaining variables, and higher energiesto incorrect values . The codes and more details of experiments setup is available at: https://github.com/sparsifai/epruning . PROPOSED PRUNING FRAMEWORK2.1. Energy Model
A DNN can be modeled as a parametric function to mapthe input image X ∈ X to C real-valued numbers ǫ = { ε , ε , ..., ε C } (a.k.a. logits). The output is then passedto a classifier, such as Softmax function to parameterizea categorical distribution in form of a probability distri-bution over the data classes Y = { y , ..., y C } [10], de-fined as { p ( y ) , ..., p ( y C ) } where for simplicity we define p c = p ( y c ) ∀ c ∈ { , ..., C } , as illustrated in Figure 2. Theloss is then calculated based on cross-entropy with respectto the correct answer Y . Gibbs distribution is a very generalfamily of probability distributions defined as p ( Y | X ) = e − β F ( Y,X ) Z ( β ) , (1)where Z ( β ) = P y c ∈Y e − β F ( y c ,X ) is the partition function, β > is the inverse temperature parameter [4], and F ( · ) isthe Hamiltonian or the energy function . Softmax function isa special case of the Gibbs distribution. We can achieve theenergy function corresponding to using a Softmax layer bysetting β = 1 [11] in (1) and getting the Hamiltonian F ( Y, X ) = − ǫ. (2)We define the following energy loss function to measure thequality of energy function for ( X, Y ) with target output y c as E = L ( Y, F )= F ( y c , X ) − min {F ( y c ′ , X ) : y c ′ ∈ Y , c ′ = c } , (3)where it can be extended for a batch of data. The energy loss function assigns a low loss value to a pruning state vectorwhich has the lowest energy with respect to the target dataclass c and higher energy with respect to the other data classesand vice versa [4]. We are interested in pruning weight kernels and hidden units,including their bias terms, referred to as a unit hereafter forconvenience. We define a set of S binary candidate pruningstate vectors as the population S S × D . Each vector s i ∈ S hasa length of D which represents a sub-network. The energyfunction value for each state vector is F i ∈ {F , ..., F S } . If s i,d = 0 unit d is dropped and if s i,d = 1 it is active.Algorithm 1 shows different steps of the proposed frame-work. At the beginning of training ( t = 0 ), we initializethe candidate pruning states S (0) ∈ Z S × D , where s (0) i,d ∼ Bernoulli ( P = 0 . for i ∈ { , ..., S } and d ∈ { , ..., D } .For each candidate state s ( t ) i ∈ S ( t ) in iteration t , theenergy loss value is calculated using (3) as E ( t ) i . Search-ing for the pruning state which can minimize the energy loss Fig. 2 : Switching between the energy-based model (EBM) and the proba-bilistic model. The EBM searches for the pruning state and the probabilisticmodels searches for the weights. Both models are aware of target class Y during training. In inference, the best pruning state is applied and the EBMis removed. value is an NP-hard combinatorial problem. Various methodssuch as MCMC [12] and simulated annealing (SA) can beused to search for low energy states. We propose using a bi-nary version of differential evolution (BDE) [13] to minimizethe energy loss function. This method has the advantage ofsearching the optimization landscape in parallel and sharingthe search experience among candidate states. The other ad-vantage of this approach is flexibility of designing the energyfunction with constraints.The optimization step has three phases which are muta-tion, crossover, and selection. Given the population of states S ( t − , a mutation vector is defined for each candidate state s ( t − i ∈ S ( t − as v i,d = ( − s ( t − i ,d , if s ( t − i ,d = s ( t − i ,d & r d < Fs ( t − i ,d , otherwise , (4)for all d ∈ { , .., D } , where i , i , i ∈ { , ..., S } are mutu-ally different, F is the mutation factor [14], and r d ∈ [0 , isa random number. The next step is to crossover the mutationvectors to generate new candidate state vectors as ˜ s ( t ) i,d = ( v i,d if r ′ d ∈ [0 , ≤ C r s ( t − i,d otherwise , (5)where C r is the crossover coefficient [14]. The parameters C r and F control exploration and exploitation of the populationon the optimization landscape. Each generated state ˜ s ( t ) i isthen compared with its corresponding parent with respect toits energy loss value ˜ E ( t ) i as s ( t ) i = ( ˜ s ( t ) i if ˜ E ( t ) i ≤ E ( t − i s ( t − i otherwise ∀ i ∈ { , ..., S } . (6)The state with minimum energy loss E ( t ) b = min {E ( t )1 , ..., E ( t ) S } is selected as the best state s b , which represents the sub-network for next training batch. This optimization strategy issimple and feasible to implement in parallel for a large S .he population-based optimization methods suffer frompremature convergence and stagnation problems. The for-mer generally occurs when the population (candidate statevectors) has converged to local optima, has lost its diversity,or has no improvement in finding better solutions. The lat-ter happens mainly when the population stays diverse dur-ing training [15]. After a number of iterations, depending onthe capacity of the neural network and the complexity of thedataset, all the states in S ( t ) may converge to a state s b ∈ S ( t ) .We call this the early state convergence phase and define it as ∆ s = E ( t ) b − S S X j =1 E ( t ) j , (7)where E ( t ) b is the energy loss of s b . Therefore, if ∆ s = 0 wecan call for an early state convergence and continue trainingby fine-tuning the sub-network identified by the state vector s b . In addition, a stagnation threshold ∆ s T is implementedwhere if ∆ s = 0 after ∆ s T number of training epochs, itstops the energy loss optimizer and starts fine-tuning the se-lected sub-network. The convergence to the best state s b splitsthe training procedure into two phases, where the first phaseacts similar to dropout and the second phase fine-tunes thepruned network, defined by s b .
3. EXPERIMENTS
The experiments were conducted on the CIFAR-10 andCIFAR-100 [16] datasets. The horizontal flip and Cutout [17]augmentation methods were used. Input images were resizedto × for ResNets and × for AlexNet [18] andSqueezeNet v1.1 [19].We used ResNets (18, 34, 50, and 101 layers) [20],AlexNet [18], SqueezeNet v1.1 [19], and Deep Compres-sion [3] to evaluate EPruning. The results were averagedover five independent runs. A grid hyper-parameter searchwas conducted based on the Top-1 accuracy for all models,including initial learning rates in { . , . , } , StochasticGradient Descent (SGD) [21] and Adadelta [22] optimizer,exponential and step learning rate decays with gamma valuesin { , } , and batch sizes of 64 and 128. The Adadelta op-timizer with Step adaptive learning rate (step: every 50 epochat gamma rate of 0.1) and weight decay of e − was used.The number of epochs was 200 and the batch size was 128.Random dropout was not used in the EPruning experiments.For the other models, where applicable, the random dropoutrate was set to 0.5. The early state convergence in (7) is usedwith a threshold of 100 epochs. The models are implementedin PyTorch [23] and trained on three NVIDIA Titan RTXGPUs.Table 1 shows the classification performance results. Theoriginal models contain the entire trainable parameters andhave larger learning capacity.
EPruning in pruned and full
Algorithm 1
EPruning
Set t = 0 // Optimization counterInitiate the neural network with trainable weights Θ Set S (0) ∼ Bernoulli ( P = 0 . // States initializationSet ∆ s = 0 & ∆ s T for i epoch = 1 → N epoch do // epoch counter for i batch = 1 → N batch do // batch counter t = t + 1 if ∆ s = 0 or i epoch ≤ ∆ s T thenif i epoch = 1 & i batch = 1 then Compute energy loss of S (0) as E (0) using (3) end iffor i = 1 → S do // States counterGenerate mutually different i , i , i ∈ { , ..., S } for d = 1 → D do // State dimension counterGenerate a random number r d ∈ [0 , Compute mutation vector v i,d using (4)Compute candidate state ˜ s ( t ) using (5) end forend for Compute energy loss of ˜ S ( t ) as ˜ E ( t ) using (3)Select S ( t ) and corresponding energy E ( t ) using (6)Select the state with the lowest energy from S ( t ) as s ( t ) b else s ( t ) b = s ( t − b end if Temporarily drop weights of the network based on the beststate s ( t ) b Compute loss of the sparsified networkPerform backpropagation to update Θ end for Update ∆ s for early state convergence using (7) end for versions have slightly lower Top-1 performance than the orig-inal model and competitive performance in terms of Top-5performance. The Deep Compression [3] method receivespruning rate as input. For the sake of comparison, we havemodified it to perform pruning on every convolution layer,given the rate achieved by EPruning , where generally it haslower performance than
EPruning . SqueezeNet [24] is asmall network with AlexNet level accuracy.
EPruning isalso applied to AlexNet and SqueezeNet v1.1, where it hasa smaller pruning rate for AlexNet compared to ResNets butcan prune approximately half of the trainable parameters inSqueezeNet v1.1 and achieve slightly lower performance.Figure 3 shows convergence plots of ResNet-50 with re-spect to the cross-entropy loss and energy of the sub-networkover 200 training epochs. The plots show that the best energyis lower than the average energy and after epoch 100, dueto early stopping, the sub-network is selected. Training lossand energy loss follow similar declining trend and after epoch100, the training loss has a slower declining rate. We observethat before epoch 100 the model was in an exploration phaseand after this epoch enters an exploitation (fine-tuning) phase. poch0 20 40 60 80 100 120 140 160 180 200 C r o ss - E n t r op y Lo ss E ne r g y -1000100 Trainign LossValidation LossAvg EnergyBest Energy (a) CIFAR-10
Epoch0 20 40 60 80 100 120 140 160 180 200 C r o ss - E n t r op y Lo ss E ne r g y -50-40-30-20-10010 Trainign LossValidation LossAvg EnergyBest Energy (b) CIFAR-100
Fig. 3 : Cross-Entropy loss and energy of
EPruning over 200 training epochs of
ResNet-50 for CIFAR-10 and CIFAR-100 datasets with S = 8 , initializationprobability of P = 0 . , and ∆ s T = 100 . Table 1 : Classification performance on test datasets. R is kept trainable parameters and p is approximate number of trainable parameters. All the valuesexcept loss and p are in percentage. (F) refers to full network used for inference and (P) refers to pruned network using EPruning .(a)
CIFAR-10
Model Loss Top-1 Top-3 Top-5 R p ResNet-18 0.3181 92.81 98.78 99.49 100 11.2MResNet-18+DeepCompression 0.6951 76.15 94.16 98.59 49.66 5.5MResNet-18+EPruning(F) 0.4906 90.96 98.33 99.60 100 11.2MResNet-18+EPruning(P) 0.4745 90.96 98.40 99.58 49.66 5.5MResNet-34 0.3684 92.80 98.85 99.71 100 21.3MResNet-34+DeepCompression 1.057 66.51 91.40 97.68 38.83 8.3MResNet-34+EDropou(F) 0.4576 88.28 97.47 99.31 100 21.3MResNet-34+EPruning(P) 0.4598 88.21 97.48 99.28 38.83 8.3MResNet-50 0.3761 92.21 98.70 99.51 100 23.5MResNet-50+DeepCompression 1.0271 67.53 89.92 96.30 46.39 10.9MResNet-50+EPruning(F) 0.6041 85.22 96.35 98.77 100 23.5MResNet-50+EPruning(P) 0.5953 85.30 96.62 98.76 46.39 10.9MResNet-101 0.3680 92.66 98.69 99.65 100 42.5MResNet-101+DeepCompression 1.037 66.32 92.65 98.11 45.10 19.2MResNet-101+EPruning(F) 0.6231 86.97 97.42 99.24 100 42.5MResNet-101+EPruning(P) 0.6339 86.57 97.37 99.20 45.10 19.2MAlexNet 0.9727 84.32 96.58 99.08 100 57.4MAlexNet+EPruning(F) 0.7632 75.05 93.74 98.18 100 57.4MAlexNet+EPruning(P) 0.7897 74.66 93.63 97.96 77.36 44.4MSqueezeNet 0.5585 81.49 96.31 99.01 100 0.73MSqueezeNet+EPruning(F) 0.6686 76.76 94.55 98.62 100 0.73MSqueezeNet+EPruning(P) 0.6725 76.85 95.00 98.56 52.35 0.38M (b)
CIFAR-100
Model Loss Top-1 Top-3 Top-5 R p ResNet-18 1.3830 69.03 84.44 88.90 100 11.2MResNet-18+DeepCompression 2.3072 40.01 62.20 72.28 48.04 5.4MResNet-18+EPruning(F) 1.9479 67.04 84.11 89.43 100 11.2MResNet-18+EPruning(P) 1.9541 67.06 84.14 89.27 48.04 5.4MResNet-34 1.3931 69.96 85.65 90.10 100 21.3MResNet-34+DeepCompression 2.1778 42.09 65.01 74.31 49.41 10.5MResNet-34+EPruning(F) 1.9051 64.50 81.38 86.87 100 21.3MResNet-34+EPruning(P) 1.9219 64.79 81.28 86.74 49.41 10.5MResNet-50 1.3068 71.22 86.47 90.74 100 23.7MResNet-50+DeepCompression 2.3115 43.87 67.02 76.26 46.01 10.9MResNet-50+EPruning(F) 1.8750 61.60 79.52 85.45 100 23.7MResNet-50+EPruning(P) 1.8768 61.91 79.99 85.87 46.01 10.9MResNet-101 1.3574 71.19 85.54 90.00 100 42.6MResNet-101+DeepCompression 2.6003 37.08 58.78 68.76 43.76 18.6MResNet-101+EPruning(F) 1.9558 61.52 79.71 85.20 100 42.6MResNet-101+EPruning(P) 1.9412 61.92 79.49 85.23 43.76 18.6MAlexNet 2.8113 60.12 79.18 83.31 100 57.4MAlexNet+EPruning(F) 2.4731 56.62 78.72 81.92 100 57.4MAlexNet+EPruning(P) 2.4819 56.59 78.52 81.62 71.84 41.2MSqueezeNet 1.4150 67.85 85.81 89.69 100 0.77MSqueezeNet+EPruning(F) 1.5265 64.23 82.71 88.63 100 0.77MSqueezeNet+EPruning(P) 1.5341 64.02 81.63 88.51 56.40 0.43M
This method increases the computational complexity ofthe model due to evaluation of energy loss for each candidatestate vector in the population. However, with proper parallelimplementation at each state vector level and shared memorymanagement, this overhead can significantly be reduced. Asan example, for ResNet-50, the inference time per image is5.50E-5 Seconds where with EPruning with a population sizeof 8 this time is 3.40E-4 Seconds and in parallel mode thistime is 8.95E-5 Seconds.
4. CONCLUSIONS
In this paper, we have proposed a stochastic framework forpruning deep neural networks. Most pruning and compres-sion models first prune a network and then fine-tune it. Theproposed
EPruning method has two phases. The first phaseacts as dropout where various subsets of the neural network are trained. Each sub-network is selected based on a corre-sponding energy loss value which reflects performance of thesub-network. The second phase is a fine-tuning phase withfocus on training the pruned network. The proposed frame-work has the advantage of immediate usability for any neu-ral network without manual modification of layers. In addi-tion, predefined number of active states can also be utilized inthe optimizer to enforce a specific dropout/pruning rate. Ourexperiments show that as the proposed framework searchesfor sub-networks with lower energy, the training loss also de-creases.
5. ACKNOWLEDGMENT
The authors acknowledge financial support of Fujitsu Labo-ratories Ltd. and Fujitsu Consulting (Canada) Inc. . REFERENCES [1] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang,and Trevor Darrell, “Rethinking the value of networkpruning,” arXiv preprint arXiv:1810.05270 , 2018.[2] Yann LeCun, John S Denker, and Sara A Solla, “Opti-mal brain damage,” in
Advances in neural informationprocessing systems , 1990, pp. 598–605.[3] Song Han, Huizi Mao, and William J Dally, “Deep com-pression: Compressing deep neural networks with prun-ing, trained quantization and huffman coding,” arXivpreprint arXiv:1510.00149 , 2015.[4] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato,and F Huang, “A tutorial on energy-based learning,”
Predicting structured data , vol. 1, no. 0, 2006.[5] David Barber and Aleksandar Botev, “Dealing with alarge number of classes–likelihood, discrimination orranking?,” arXiv preprint arXiv:1606.06959 , 2016.[6] Radford M Neal, “Annealed importance sampling,”
Statistics and computing , vol. 11, no. 2, pp. 125–139,2001.[7] Hojjat Salehinejad and Shahrokh Valaee, “Ising-dropout: A regularization method for training andcompression of deep neural networks,” in
ICASSP2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 3602–3606.[8] Hojjat Salehinejad, Zijian Wang, and Shahrokh Valaee,“Ising dropout with node grouping for training andcompression of deep neural networks,” in . IEEE, 2019, pp. 1–5.[9] Alex Labach, Hojjat Salehinejad, and Shahrokh Valaee,“Survey of dropout methods for deep neural networks,” arXiv preprint arXiv:1904.13310 , 2019.[10] Will Grathwohl, Kuan-Chieh Wang, J¨orn-Henrik Jacob-sen, David Duvenaud, Mohammad Norouzi, and KevinSwersky, “Your classifier is secretly an energy basedmodel and you should treat it like one,” arXiv preprintarXiv:1912.03263 , 2019.[11] Kevin P Murphy,
Machine learning: a probabilistic per-spective , MIT press, 2012.[12] Keivan Dabiri, Mehrdad Malekmohammadi, Ali Sheik-holeslami, and Hirotaka Tamura, “Replica exchangemcmc hardware with automatic temperature selectionand parallel trial,”
IEEE Transactions on Parallel andDistributed Systems , vol. 31, no. 7, pp. 1681–1692,2020. [13] Kenneth V Price, “Differential evolution,” pp. 187–214,2013.[14] Hojjat Salehinejad, Shahryar Rahnamayan, andHamid R Tizhoosh, “Micro-differential evolution:Diversity enhancement and a comparative study,”
Applied Soft Computing , vol. 52, pp. 812–833, 2017.[15] Jouni Lampinen, Ivan Zelinka, et al., “On stagnation ofthe differential evolution algorithm,” in
Proceedings ofMENDEL , 2000, pp. 76–83.[16] Alex Krizhevsky, Geoffrey Hinton, et al., “Learningmultiple layers of features from tiny images,” 2009.[17] Terrance DeVries and Graham W Taylor, “Improvedregularization of convolutional neural networks withcutout,” arXiv preprint arXiv:1708.04552 , 2017.[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton, “Imagenet classification with deep convolutionalneural networks,” in
Advances in neural informationprocessing systems , 2012, pp. 1097–1105.[19] Forrest N Iandola, Song Han, Matthew W Moskewicz,Khalid Ashraf, William J Dally, and Kurt Keutzer,“Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5 mb model size,” arXiv preprintarXiv:1602.07360 , 2016.[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[21] Geoffrey Hinton, Nitish Srivastava, and Kevin Swer-sky, “Neural networks for machine learning lecture 6aoverview of mini-batch gradient descent,”
Cited on , vol.14, no. 8, 2012.[22] Matthew D Zeiler, “Adadelta: an adaptive learning ratemethod,” arXiv preprint arXiv:1212.5701 , 2012.[23] Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, Luca Antiga,Alban Desmaison, Andreas Kopf, Edward Yang,Zachary DeVito, Martin Raison, Alykhan Tejani,Sasank Chilamkurthy, Benoit Steiner, Lu Fang, JunjieBai, and Soumith Chintala, “Pytorch: An imperativestyle, high-performance deep learning library,” in
Ad-vances in Neural Information Processing Systems 32 ,pp. 8024–8035. Curran Associates, Inc., 2019.[24] Forrest N Iandola, Song Han, Matthew W Moskewicz,Khalid Ashraf, William J Dally, and Kurt Keutzer,“Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5mb model size,” arXiv preprintarXiv:1602.07360arXiv preprintarXiv:1602.07360