[PDF] Pruning of Convolutional Neural Networks Using Ising Energy Model

Abstract

Pruning is one of the major methods to compress deep neural networks. In this paper, we propose an Ising energy model within an optimization framework for pruning convolutional kernels and hidden units. This model is designed to reduce redundancy between weight kernels and detect inactive kernels/hidden units. Our experiments using ResNets, AlexNet, and SqueezeNet on CIFAR-10 and CIFAR-100 datasets show that the proposed method on average can achieve a pruning rate of more than 50\% of the trainable parameters with approximately <10\% and <5\% drop of Top-1 and Top-5 classification accuracy, respectively.

Full PDF

aa r X i v : . [ c s . N E ] F e b This paper is accepted for presentation at IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP), 2021.

PRUNING OF CONVOLUTIONAL NEURAL NETWORKS USING ISING ENERGY MODEL

Hojjat Salehinejad, Member, IEEE, and Shahrokh Valaee, Fellow, IEEE

Department of Electrical & Computer Engineering, University of Toronto, Toronto, Canada [email protected], [email protected]

ABSTRACT

Pruning is one of the major methods to compress deep neuralnetworks. In this paper, we propose an Ising energy modelwithin an optimization framework for pruning convolutionalkernels and hidden units. This model is designed to reduceredundancy between weight kernels and detect inactive ker-nels/hidden units. Our experiments using ResNets, AlexNet,and SqueezeNet on CIFAR-10 and CIFAR-100 datasets showthat the proposed method on average can achieve a pruningrate of more than of the trainable parameters with ap-proximately < and < drop of Top-1 and Top-5 clas-siﬁcation accuracy, respectively. Index Terms — Neural networks, Ising model, pruning.

1. INTRODUCTION

Deployment of deep neural networks (DNNs) in inferencemode is challenging for applications with limited resourcessuch as in edge devices [1, 2]. Pruning is one of the major ap-proaches for compressing a DNN by permanently dropping asubset of network parameters. Pruning methods are dividedinto unstructured and structured approaches. Unstructuredpruning does not follow a speciﬁc geometry and removes anysubset of the weights [3]. Structured pruning typically fol-lows a geometric structure and happens at channel, kernel,and intra-kernel levels [3, 4]. One of the early attempts forpruning neural networks was to use second derivative infor-mation to minimize a cost function that reduces network com-plexity by removing excess number of trainable parametersand further training the remaining of the network to increaseinference accuracy [5].

Deep Compression is a popular prun-ing method which has three stages that are pruning, quantiza-tion, and Huffman coding [6]. This method works by pruningall connections with weights below a threshold followed byretraining the sparsiﬁed network.We have proposed an Ising energy model in [2] for prun-ing hidden units in multi-layer perceptron (MLP) networks,[7]. In this paper, we propose IPruning, which targets prun-ing convolutional kernels, including all corresponding in-put/output connections, and hidden units based on modelinga DNN as a graph and quantifying interdependencies amongtrainable variables using the Ising energy model. A DNN is modeled as a graph where the nodes represent kernels/hiddenunits, and edges represent the relationship between nodes.This relationship is modeled using entropy of feature mapsbetween convolutional layers and relative entropy (Kull-back–Leibler (KL) divergence) between convolutional ker-nels in a layer. These relationships are represented as weightsin an Ising energy model, which targets dropping kernels withlow activity and eliminating redundant kernels. We initiatea set of candidate pruning state vectors which correspond todifferent subgraphs of the original graph. The objective is tosearch for the state vector that minimizes the Ising energy ofthe graph. Each step of the optimization procedure happenswithin a training iteration of the network, where only thekernels identiﬁed by the best pruning state vector are trainedusing backpropagation. This is indeed similar to training withdropout [8], where the original network is partially trained.However, after a number of iterations the set of candidatestate vectors can converge to a best pruning state vector,which represents the pruned network .

2. PROPOSED ISING PRUNING METHOD

The weights of a DCNN with L layers are deﬁned as the set Θ = { Θ [1] , ..., Θ [ L ] } , where Θ [ l ] ∈ Θ is the set of weightsin layer l , Θ [ l ] i is the weight kernel i of size N [ l ] i , and N [ l ] is the number of weight kernels in the convolution layer l .Similarly, in a dense layer N [ l ] is the number of weights fromlayer l to the next layer l + 1 . Generally, a feature map isconstructed using the convolution operation deﬁned as F [ l ] i = σ (Θ [ l ] i ⋆ F [ l − ) , (1)where F [ l ] i is the feature map i in layer l , F [ l − is the set offeature maps from the previous layer, σ ( · ) is the activationfunction, and ⋆ is the convolution operation. The key ques-tions is “ How to detect redundant and inactive kernels in aDCNN? ”. To answer this question, we suggest to quantita-tively evaluate activity and redundancy of the kernels usingentropy and KL divergence as follows. The codes and more details of experiments setup is available at: https://github.com/sparsifai/ipruning .1. Measuring Kernels Activity

Feature maps are the activation values of a convolutionallayer, representing activation of weight kernel for a giveninput. We use feature maps of a kernel as a means of evalu-ating its activation. Assuming σ ( x ) = max (0 , x ) , a featuremap value f i,j ∈ R ≥ , where j is the j th element of F i , isgenerally a real number in a continuous domain. We quantifya feature map in a discrete domain Λ by mapping the featuremap values as R ≥ Q ( · ) −−−→ Λ where Λ = { , , ..., } is an8-bit discrete state space and Q ( f i,j ) = ⌊ · f i,j max ( F i ) ⌉ , (2)where ⌊·⌉ is the round to the nearest integer function. Let usdeﬁne the random variable F = Q ( f i,j ) with possible out-come λ ∈ Λ . Then, the probability of λ is p F ( λ ) = n λ / | F i | ∀ λ ∈ Λ , (3)where n λ is the number of times λ occurs and | · | is the cardi-nality of the feature map. The entropy of the feature map F i is then deﬁned as H i = − X λ ∈ Λ p F ( λ ) log p F ( λ ) . (4) The weights in neural networks are generally initialized froma normal distribution. A DCNN can have redundancy be-tween kernels in a layer. Removing the redundant kernelsprunes the network while may slightly drop the classiﬁcationaccuracy. A kernel Θ is generally a three-dimensional ten-sor of size K × K × N where K = K × K is the sizeof a ﬁlter and N is the number of ﬁlters, corresponding tothe number of input channels. Therefore, we can representthe weights in a kernel with K sets which are W , ..., W K where W k = { θ k, , ..., θ k,N } and k ∈ { , ..., K } . Let us as-sume the weights W k have a normal distribution. Hence, forthe kernel i we have a multivariate normal distribution withmeans µ i = ( µ i, , ..., µ i,K ) and the K × K covariance ma-trix Σ i . The distributions N i ( µ i , Σ i ) and N j ( µ j , Σ j ) of twogiven kernels i and j , respectively, have the same dimension.Hence, we can compute the KL divergence between the twokernels i and j as D KL ( N i ||N j ) = 12 (cid:16) tr ( Σ − j Σ i ) + ( µ j − µ i ) ⊤ Σ − j ( µ j − µ i )) − K + ln (cid:16) | Σ j || Σ i | (cid:17)(cid:17) , (5)where tr ( · ) is the trace and | · | is the determinant. A neural network F has the set of layers A = { A ∪ A } ,where A and A are the set of convolutional and dense lay-ers, respectively. Obviously, the sets A and A are disjoint(i.e. A ∩ A = ∅ ). Hereafter we refer to a hidden unit ora convolutional kernel a unit for simplicity. A binary statevector s with length D represents the state of the units, where s d ∈ { , } ∀ d ∈ { , ..., D } . If s d = 0 unit d is inactiveand if s d = 1 the unit participates in training and inference.Therefore, the state vector s represents a subnetwork of theoriginal network. The unit d belongs to a layer l ∈ A .Let us represent the network F as a graph G = ( D , Γ) ,where D is the set of vertices (nodes) with cardinality D and Γ is the set of edges (connections) with weight γ d,d ′ betweenvertices d and d ′ . The graph has two types of connections,where the connection between vertices of a layer is bidirec-tional and the connection between layers is unidirectional. Indense layers, unidirectional connections exist between nodesof two layers where each node has a state s d ∈ { , } , exceptin the last layer (logits layer), where s d = 1 . We are interestedin pruning the vertices and all corresponding edges.We model the dependencies between vertices in the graph G using the Ising energy model as E = − X d ∈D X d ′ ∈D γ d,d ′ s d s d ′ − b X d ∈D s d , (6)where b is the bias coefﬁcient and γ d,d ′ is the weight betweenthe vertices d and d ′ deﬁned as γ d,d ′ =  D KL ( N d ||N d ′ ) − if d, d ′ ∈ l & l ∈ A H d − if d ∈ l, d ′ ∈ l + 1 & l, l + 1 ∈ A A d − if d ∈ l, d ′ ∈ l + 1 & l, l + 1 ∈ A otherwise , (7)where H d is calculated using (4). Similar to the approach wehave proposed in [2] for hidden units in dense layer, we have A i = tanh ( a i ) , (8)which maps the activation value a i of the unit i , generatedby the ReLU activation function such that a dead unit has thelowest A i and a highly activated unit has a high A i . In (7) ahigh weight is allocated to the unidirectional connections of aunit with high activation value and a high weight is allocatedto the bidirectional connections with high KL divergence, andvice-versa. From another perspective, the ﬁrst case allocatessmall weight to low-active units and the latter case allocatessmall weight to redundant units.Assuming all the states are active (i.e. s d = 1 ∀ d ∈ D ),the bias coefﬁcient is deﬁned to balance the interaction term lgorithm 1 IPruning

Set t = 0 // Optimization counterInitiate the neural network F Set S (0) ∼ Bernoulli ( P = 0 . // States initializationSet ∆ s = 0 // Early state threshold for i epoch = 1 → N epoch do // Epoch counter for i batch = 1 → N batch do // Batch counter t = t + 1 if ∆ s = 0 thenif i epoch = 1 & i batch = 1 then Compute energy of S (0) using (6) end iffor i = 1 → S do // States counterGenerate mutually different i , i , i ∈ { , ..., S } for d = 1 → D do // State dimension counterGenerate a random number r d ∈ [0 , Compute mutation vector v i,d using (10)Compute candidate state ˜ s ( t ) using (11) end forend for Compute energy loss of ˜ S ( t ) using (6)Select S ( t ) and corresponding energy using (12)Select the state with the lowest energy from S ( t ) as s ( t ) b else s ( t ) b = s ( t − b end if Temporarily drop weights of F according to s ( t ) b Compute cross-entropy loss of the sparsiﬁed networkPerform backpropagation to update active weights end for

Update ∆ s for early state convergence using (13) end for and the bias term by setting E = 0 . Hence, b = − P d ∈D P d ′ ∈D γ d,d ′ s d s d ′ P d ∈D s d = − | γ | D , (9)where | γ | is the sum of weights γ and P d ∈D s d = D . Mini-mizing (6) is equivalent to ﬁnding a state vector which repre-sents a sub-network of F with a smaller number of redundantkernels and inactive units. Algorithm 1 shows different steps of IPruning. The processof searching for the pruning state vector with lowest energy isincorporated into the typical training of the neural network F with backpropagation. First, a population of candidate statevectors is initiated and then the Ising energy loss is computedfor each vector. Then, the population of vectors is evolved onthe optimization landscape of states with respect to the Isingenergy and the state with lowest energy is selected. Dropoutis performed according to the selected state vector and only active weights are updated with backpropagation. The popu-lation is then evolved and the same procedure is repeated untilthe population of states converges to a best state solution or apredeﬁned number of iterations is reached.Let us initialize a population of candidate states S ( t ) ∈ Z S × D such that s ( t ) i ∈ S ( t ) , where t is the iteration and s (0) i,d ∼ Bernoulli ( P = 0 . for i ∈ { , ..., S } and d ∈ { , ..., D } .A state vector s ( t ) j ∈ S ( t ) selects a subset of the graph G .The optimization procedure has three phases which aremutation, crossover, and selection. Given the population ofstates S ( t − , a mutation vector is deﬁned for each candidatestate s ( t − i ∈ S ( t − as v i,d = ( − s ( t − i ,d if s ( t − i ,d = s ( t − i ,d & r d < Fs ( t − i ,d otherwise , (10)for d ∈ { , .., D } where i , i , i ∈ { , ..., S } are mutuallydifferent, F is the mutation factor [9], and r d ∈ [0 , is arandom number. The next step is to crossover the mutationvectors to generate new candidate state vectors as ˜ s ( t ) i,d = ( v i,d if r ′ d ∈ [0 , ≤ Cs ( t − i,d otherwise , (11)where C = 0 . is the crossover coefﬁcient [9]. The parame-ters C and F control exploration and exploitation of the op-timization landscape. Each generated state ˜ s ( t ) i is then com-pared with its corresponding parent with respect to its energyvalue ˜ E ( t ) i and the state with smaller energy is selected as s ( t ) i = ( ˜ s ( t ) i if ˜ E ( t ) i ≤ E ( t − i s ( t − i otherwise ∀ i ∈ { , ..., S } . (12)The state with minimum energy E ( t ) b = min {E ( t )1 , ..., E ( t ) S } isselected as the best state s b , which represents the sub-networkfor next training batch. This optimization strategy is simpleand feasible to implement in parallel for a large S .After a number of iterations, depending on the capacityof the neural network and complexity of the dataset, all thestates in S ( t ) may converge to the best state vector s b ∈ S ( t ) with the Ising energy E ( t ) b . Hence, we can deﬁne ∆ s = E ( t ) b − S S X j =1 E ( t ) j , (13)such that if ∆ s = 0 , we can call for an early state conver-gence and continue training by ﬁne-tuning the sub-networkidentiﬁed by the state vector s b .

3. EXPERIMENTS

The experiments were conducted on the CIFAR-10 andCIFAR-100 [10] datasets using ResNets (18, 34, 50, and able 1 : Classiﬁcation performance on the test datasets. R is kept trainableparameters and p is approximate number of trainable parameters. All thevalues except loss and p are in percentage. (F) refers to full network usedfor inference and (P) refers to pruned network using IPruning .(a)

CIFAR-10

Model Loss Top-1 Top-3 Top-5 R p ResNet-18 0.3181 92.81 98.78 99.49 100 11.2MResNet-18+DeepCompression 0.6893 76.18 94.21 98.63 49.19 5.5MResNet-18+IPruning(F) 0.5167 84.12 96.74 99.24 100 11.2MResNet-18+IPruning(P) 0.5254 84.09 96.77 99.33 49.19 5.5MResNet-34 0.3684 92.80 98.85 99.71 100 21.3MResNet-34+DeepCompression 0.8423 71.45 93.28 98.39 49.61 10.5MResNet-34+IPruning(F) 0.6352 88.78 98.14 99.41 100 21.3MResNet-34+IPruning(P) 0.6401 88.72 97.93 99.42 49.61 10.5MResNet-50 0.3761 92.21 98.70 99.51 100 23.5MResNet-50+DeepCompression 1.0355 67.47 90.45 97.26 43.46 10.2MResNet-50+IPruning(F) 0.8200 82.32 95.92 97.37 100 23.5MResNet-50+IPruning(P) 0.8374 82.45 95.32 97.27 43.46 10.2MResNet-101 0.3680 92.66 98.69 99.65 100 42.5MResNet-101+DeepCompression 1.083 66.63 92.03 97.97 42.41 18.0MResNet-101+IPruning(F) 0.8233 84.47 97.42 98.47 100 42.5MResNet-101+IPruning(P) 0.8372 84.38 97.03 98.37 42.41 18.0MAlexNet 0.9727 84.32 96.58 99.08 100 57.4MAlexNet+IPruning(F) 0.8842 74.02 92.79 97.63 100 57.4MAlexNet+IPruning(P) 0.8830 73.62 92.35 97.03 62.84 36.0MSqueezeNet 0.5585 81.49 96.31 99.01 100 0.73MSqueezeNet+IPruning(F) 0.6894 76.74 95.53 98.54 100 0.73MSqueezeNet+IPruning(P) 0.6989 76.35 95.13 98.34 51.26 0.37M (b)

CIFAR-100

Model Loss Top-1 Top-3 Top-5 R p ResNet-18 1.3830 69.03 84.44 88.90 100 11.2MResNet-18+DeepCompression 2.2130 40.15 61.92 71.84 47.95 5.3MResNet-18+IPruning(F) 1.8431 55.43 74.94 82.60 100 11.2MResNet-18+IPruning(P) 1.8696 56.43 75.37 82.43 47.95 5.3MResNet-34 1.3931 69.96 85.65 90.10 100 21.3MResNet-34+DeepCompression 2.1778 42.09 65.01 74.31 49.41 10.5MResNet-34+IPruning(F) 2.3789 60.73 79.26 85.48 100 21.3MResNet-34+IPruning(P) 2.3794 61.13 79.23 85.30 49.41 10.5MResNet-50 1.3068 71.22 86.47 90.74 100 23.7MResNet-50+DeepCompression 2.4927 43.72 66.93 76.15 44.63 10.8MResNet-50+IPruning(F) 1.8750 60.44 79.25 86.24 100 23.7MResNet-50+IPruning(P) 2.1462 60.05 78.83 85.78 44.63 10.8MResNet-101 1.3574 71.19 85.54 90.00 100 42.6MResNet-101+DeepCompression 2.6232 36.58 57.82 68.36 41.36 17.6MResNet-101+IPruning(F) 2.1338 60.52 79.91 83.22 100 42.6MResNet-101+IPruning(P) 2.2952 60.35 78.99 83.01 41.36 17.6MAlexNet 2.8113 60.12 79.18 83.31 100 57.4MAlexNet+IPruning(F) 2.7420 53.52 72.42 79.70 100 57.4MAlexNet+IPruning(P) 2.7396 53.05 72.28 79.69 65.35 37.5MSqueezeNet 1.4150 67.85 85.81 89.69 100 0.77MSqueezeNet+IPruning(F) 1.9285 61.93 80.74 86.92 100 0.77MSqueezeNet+IPruning(P) 1.9437 61.46 80.45 85.81 53.20 0.41M

101 layers) [11], AlexNet [12], SqueezeNet [13], and DeepCompression [6]. Horizontal ﬂip and Cutout [14] augmenta-tion methods were used. The results are averaged over ﬁveindependent runs. The Adadelta optimizer with Step adaptivelearning rate (step: every 50 epoch at gamma rate of 0.1) andweight decay of e − is used. The number of epochs is 200and the batch size is 128. Random dropout rate is set to 0.5where applicable, except for the proposed model. The earlystate convergence in (13) is used with a threshold of 100.As Table 1 shows, IPruning on average has removed morethan of the trainable weights and the Top-1 performancehas dropped less than compared to the original model.We used the pruning rate achieved by IPruning to prune theoriginal network using Deep Compression [6]. Since this Iteration E ne r g y Lo ss × -8.5-8-7.5-7-6.5 CIFAR-10 Average EnergyCIFAR-10 Best EnergyCIFAR-100 Average EnergyCIFAR-100 Best Energy (a) Average energy and best energy of states population. Iteration K ep t R a t e ( % ) (b) Rate of kept trainable parameters. Fig. 1 : Energy loss and kept rate for ResNet-18 over 1,000 training iterations. method is tailored to pruning certain layers, we have modiﬁedit to prune every layer, similar to IPruning. We also have eval-uated inference results of IPruning in full and pruned modes.The former refers to training the network with IPruning butperforming inference using the full model, and the latter refersto training the network with IPruning and performing infer-ence with the pruned network. The results show that the fullnetwork has slightly better performance than the pruned net-work. It shows that we are able to achieve very competitiveperformance using the pruned network compared with the fullnetwork, which has a larger capacity, trained with IPruning.Figure 1 shows the energy loss and corresponding prun-ing rate over 1,000 training iterations of IPruning for ResNet-18. Since CIFAR-100 is more complicated than CIFAR-10, itconverges slower. Results show that the pruning rates gener-ally converge to a value close to , regardless of the initialdistribution of the population. This might be due to the behav-ior of optimizer in very high dimensional space and its limitedcapability of reaching all possible states during the evolution.

4. CONCLUSIONS

We propose an Ising energy-based framework, called IPrun-ing, for structured pruning of neural networks. Unlike mostother methods, IPruning considers every trainable weight inany layer of a given network for pruning. From an imple-mentation perspective, most pruning methods require man-ual modiﬁcation of network architecture to apply the prun-ing mask while IPruning can automatically detect trainableweights and construct a pruning graph for a given network.

5. ACKNOWLEDGMENT

The authors acknowledge ﬁnancial support of Fujitsu Labo-ratories Ltd. and Fujitsu Consulting (Canada) Inc. . REFERENCES [1] Shaohui Lin, Rongrong Ji, Yuchao Li, Cheng Deng, andXuelong Li, “Toward compact convnets via structure-sparsity regularized ﬁlter pruning,”

IEEE transactionson neural networks and learning systems , vol. 31, no. 2,pp. 574–588, 2019.[2] Hojjat Salehinejad and Shahrokh Valaee, “Ising-dropout: A regularization method for training andcompression of deep neural networks,” in

ICASSP2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 3602–3606.[3] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung,“Structured pruning of deep convolutional neural net-works,”

ACM Journal on Emerging Technologies inComputing Systems (JETC) , vol. 13, no. 3, pp. 1–18,2017.[4] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang, “Asurvey of model compression and acceleration for deepneural networks,” arXiv preprint arXiv:1710.09282 ,2017.[5] Yann LeCun, John S Denker, and Sara A Solla, “Opti-mal brain damage,” in

Advances in neural informationprocessing systems , 1990, pp. 598–605.[6] Song Han, Huizi Mao, and William J Dally, “Deep com-pression: Compressing deep neural networks with prun-ing, trained quantization and huffman coding,” arXivpreprint arXiv:1510.00149 , 2015.[7] Hojjat Salehinejad, Zijian Wang, and Shahrokh Valaee,“Ising dropout with node grouping for training andcompression of deep neural networks,” in . IEEE, 2019, pp. 1–5.[8] Alex Labach, Hojjat Salehinejad, and Shahrokh Valaee,“Survey of dropout methods for deep neural networks,” arXiv preprint arXiv:1904.13310 , 2019.[9] Hojjat Salehinejad, Shahryar Rahnamayan, andHamid R Tizhoosh, “Micro-differential evolution:Diversity enhancement and a comparative study,”

Applied Soft Computing , vol. 52, pp. 812–833, 2017.[10] Alex Krizhevsky, Geoffrey Hinton, et al., “Learningmultiple layers of features from tiny images,” 2009.[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778. [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton, “Imagenet classiﬁcation with deep convolutionalneural networks,” in

Advances in neural informationprocessing systems , 2012, pp. 1097–1105.[13] Forrest N Iandola, Song Han, Matthew W Moskewicz,Khalid Ashraf, William J Dally, and Kurt Keutzer,“Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5 mb model size,” arXiv preprintarXiv:1602.07360 , 2016.[14] Terrance DeVries and Graham W Taylor, “Improvedregularization of convolutional neural networks withcutout,” arXiv preprint arXiv:1708.04552arXiv preprint arXiv:1708.04552