[PDF] A Framework For Pruning Deep Neural Networks Using Energy-Based Models

Abstract

A typical deep neural network (DNN) has a large number of trainable parameters. Choosing a network with proper capacity is challenging and generally a larger network with excessive capacity is trained. Pruning is an established approach to reducing the number of parameters in a DNN. In this paper, we propose a framework for pruning DNNs based on a population-based global optimization method. This framework can use any pruning objective function. As a case study, we propose a simple but efficient objective function based on the concept of energy-based models. Our experiments on ResNets, AlexNet, and SqueezeNet for the CIFAR-10 and CIFAR-100 datasets show a pruning rate of more than 50\% of the trainable parameters with approximately <5\% and <1\% drop of Top-1 and Top-5 classification accuracy, respectively.

Full PDF

aa r X i v : . [ c s . N E ] F e b This paper is accepted for presentation at IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP), 2021.

A FRAMEWORK FOR PRUNING DEEP NEURAL NETWORKS USINGENERGY-BASED MODELS

Hojjat Salehinejad, Member, IEEE, and Shahrokh Valaee, Fellow, IEEE

Department of Electrical & Computer Engineering, University of Toronto, Toronto, Canada [email protected], [email protected]

ABSTRACT

A typical deep neural network (DNN) has a large number oftrainable parameters. Choosing a network with proper capac-ity is challenging and generally a larger network with exces-sive capacity is trained. Pruning is an established approachto reducing the number of parameters in a DNN. In this pa-per, we propose a framework for pruning DNNs based on apopulation-based global optimization method. This frame-work can use any pruning objective function. As a case study,we propose a simple but efﬁcient objective function basedon the concept of energy-based models. Our experimentson ResNets, AlexNet, and SqueezeNet for the CIFAR-10 andCIFAR-100 datasets show a pruning rate of more than ofthe trainable parameters with approximately < and < drop of Top-1 and Top-5 classiﬁcation accuracy, respectively. Index Terms — Compression of neural networks, dropout,energy-based models, pruning.

1. INTRODUCTION

Pruning a deep neural network (DNN) is one of the majormethods for removing redundant trainable parameters andcompressing the network. This approach permanently re-moves a subset of trainable parameters. In general, pruningalgorithms have three stages which are training, pruning, andﬁne-tuning [1]. One pruning approach is to utilize secondderivative information to minimize a cost function that re-duces network complexity by removing excess number oftrainable parameters and further ﬁne-tuning [2]. One of themajor approaches is

Deep Compression , which has threemain steps that are pruning, quantization, and Huffman cod-ing [3]. It prunes all connections with weights below a giventhreshold and then retrains the sparsiﬁed network.Generally, probabilistic models can be considered as aspecial type of energy-based models (EBMs). An EBM as-signs a scalar energy loss as a measure of compatibility toa conﬁguration of parameters in neural networks as demon-strated in Figure 1. This approach avoids computing the nor-malization term, which can be interpreted as an alternative toprobabilistic estimation [4]. Calculating the exact probabil-ity needs computing the partition function over all the data

Fig. 1 : Energy-based model (EBM), [4]. classes. However, for large number of data classes, such as inlanguage models with more than , classes, this causesa bottleneck [5]. Some methods such as annealed importancesampling [6] have been proposed to deal with this problemwhich is out of the scope of this paper.Previously, we have proposed an Ising energy model fordropout and pruning of multilayer perceptron (MLP) net-works [7, 8]. In this paper, a pruning framework based ona population-based stochastic global optimization method isproposed which is integrated into the typical training proce-dure of a DNN. This scheme is inspired from the concept ofdropout [9] and biological pruning of neurons in brain. Thisframework can handle different pruning objective functionwith multiple constrains. We also propose an energy-basedpruning objective function based on the concept of EBM inDNN, which allocates a scalar energy value to each statevector in the population of state vectors, called EPruning .Each state vector is in fact a representation of a sub-networkfrom the original DNN. Pruning is deﬁned as searching for abinary state vector that prunes the network while minimizesthe energy loss for a set of inputs and corresponding outputsin each iteration. Hence, the search for weights is conductedusing a probabilistic model while the pruning state vectoris ﬁxed and the search for pruning state is conducted usingan EBM while the weights are ﬁxed in each iteration. Thecandidate states help to ﬁnd a subset of the neural networkand capture its energy function that associates low energies tocorrect values of the remaining variables, and higher energiesto incorrect values . The codes and more details of experiments setup is available at: https://github.com/sparsifai/epruning . PROPOSED PRUNING FRAMEWORK2.1. Energy Model

A DNN can be modeled as a parametric function to mapthe input image X ∈ X to C real-valued numbers ǫ = { ε , ε , ..., ε C } (a.k.a. logits). The output is then passedto a classiﬁer, such as Softmax function to parameterizea categorical distribution in form of a probability distri-bution over the data classes Y = { y , ..., y C } [10], de-ﬁned as { p ( y ) , ..., p ( y C ) } where for simplicity we deﬁne p c = p ( y c ) ∀ c ∈ { , ..., C } , as illustrated in Figure 2. Theloss is then calculated based on cross-entropy with respectto the correct answer Y . Gibbs distribution is a very generalfamily of probability distributions deﬁned as p ( Y | X ) = e − β F ( Y,X ) Z ( β ) , (1)where Z ( β ) = P y c ∈Y e − β F ( y c ,X ) is the partition function, β > is the inverse temperature parameter [4], and F ( · ) isthe Hamiltonian or the energy function . Softmax function isa special case of the Gibbs distribution. We can achieve theenergy function corresponding to using a Softmax layer bysetting β = 1 [11] in (1) and getting the Hamiltonian F ( Y, X ) = − ǫ. (2)We deﬁne the following energy loss function to measure thequality of energy function for ( X, Y ) with target output y c as E = L ( Y, F )= F ( y c , X ) − min {F ( y c ′ , X ) : y c ′ ∈ Y , c ′ = c } , (3)where it can be extended for a batch of data. The energy loss function assigns a low loss value to a pruning state vectorwhich has the lowest energy with respect to the target dataclass c and higher energy with respect to the other data classesand vice versa [4]. We are interested in pruning weight kernels and hidden units,including their bias terms, referred to as a unit hereafter forconvenience. We deﬁne a set of S binary candidate pruningstate vectors as the population S S × D . Each vector s i ∈ S hasa length of D which represents a sub-network. The energyfunction value for each state vector is F i ∈ {F , ..., F S } . If s i,d = 0 unit d is dropped and if s i,d = 1 it is active.Algorithm 1 shows different steps of the proposed frame-work. At the beginning of training ( t = 0 ), we initializethe candidate pruning states S (0) ∈ Z S × D , where s (0) i,d ∼ Bernoulli ( P = 0 . for i ∈ { , ..., S } and d ∈ { , ..., D } .For each candidate state s ( t ) i ∈ S ( t ) in iteration t , theenergy loss value is calculated using (3) as E ( t ) i . Search-ing for the pruning state which can minimize the energy loss Fig. 2 : Switching between the energy-based model (EBM) and the proba-bilistic model. The EBM searches for the pruning state and the probabilisticmodels searches for the weights. Both models are aware of target class Y during training. In inference, the best pruning state is applied and the EBMis removed. value is an NP-hard combinatorial problem. Various methodssuch as MCMC [12] and simulated annealing (SA) can beused to search for low energy states. We propose using a bi-nary version of differential evolution (BDE) [13] to minimizethe energy loss function. This method has the advantage ofsearching the optimization landscape in parallel and sharingthe search experience among candidate states. The other ad-vantage of this approach is ﬂexibility of designing the energyfunction with constraints.The optimization step has three phases which are muta-tion, crossover, and selection. Given the population of states S ( t − , a mutation vector is deﬁned for each candidate state s ( t − i ∈ S ( t − as v i,d = ( − s ( t − i ,d , if s ( t − i ,d = s ( t − i ,d & r d < Fs ( t − i ,d , otherwise , (4)for all d ∈ { , .., D } , where i , i , i ∈ { , ..., S } are mutu-ally different, F is the mutation factor [14], and r d ∈ [0 , isa random number. The next step is to crossover the mutationvectors to generate new candidate state vectors as ˜ s ( t ) i,d = ( v i,d if r ′ d ∈ [0 , ≤ C r s ( t − i,d otherwise , (5)where C r is the crossover coefﬁcient [14]. The parameters C r and F control exploration and exploitation of the populationon the optimization landscape. Each generated state ˜ s ( t ) i isthen compared with its corresponding parent with respect toits energy loss value ˜ E ( t ) i as s ( t ) i = ( ˜ s ( t ) i if ˜ E ( t ) i ≤ E ( t − i s ( t − i otherwise ∀ i ∈ { , ..., S } . (6)The state with minimum energy loss E ( t ) b = min {E ( t )1 , ..., E ( t ) S } is selected as the best state s b , which represents the sub-network for next training batch. This optimization strategy issimple and feasible to implement in parallel for a large S .he population-based optimization methods suffer frompremature convergence and stagnation problems. The for-mer generally occurs when the population (candidate statevectors) has converged to local optima, has lost its diversity,or has no improvement in ﬁnding better solutions. The lat-ter happens mainly when the population stays diverse dur-ing training [15]. After a number of iterations, depending onthe capacity of the neural network and the complexity of thedataset, all the states in S ( t ) may converge to a state s b ∈ S ( t ) .We call this the early state convergence phase and deﬁne it as ∆ s = E ( t ) b − S S X j =1 E ( t ) j , (7)where E ( t ) b is the energy loss of s b . Therefore, if ∆ s = 0 wecan call for an early state convergence and continue trainingby ﬁne-tuning the sub-network identiﬁed by the state vector s b . In addition, a stagnation threshold ∆ s T is implementedwhere if ∆ s = 0 after ∆ s T number of training epochs, itstops the energy loss optimizer and starts ﬁne-tuning the se-lected sub-network. The convergence to the best state s b splitsthe training procedure into two phases, where the ﬁrst phaseacts similar to dropout and the second phase ﬁne-tunes thepruned network, deﬁned by s b .

3. EXPERIMENTS

The experiments were conducted on the CIFAR-10 andCIFAR-100 [16] datasets. The horizontal ﬂip and Cutout [17]augmentation methods were used. Input images were resizedto × for ResNets and × for AlexNet [18] andSqueezeNet v1.1 [19].We used ResNets (18, 34, 50, and 101 layers) [20],AlexNet [18], SqueezeNet v1.1 [19], and Deep Compres-sion [3] to evaluate EPruning. The results were averagedover ﬁve independent runs. A grid hyper-parameter searchwas conducted based on the Top-1 accuracy for all models,including initial learning rates in { . , . , } , StochasticGradient Descent (SGD) [21] and Adadelta [22] optimizer,exponential and step learning rate decays with gamma valuesin { , } , and batch sizes of 64 and 128. The Adadelta op-timizer with Step adaptive learning rate (step: every 50 epochat gamma rate of 0.1) and weight decay of e − was used.The number of epochs was 200 and the batch size was 128.Random dropout was not used in the EPruning experiments.For the other models, where applicable, the random dropoutrate was set to 0.5. The early state convergence in (7) is usedwith a threshold of 100 epochs. The models are implementedin PyTorch [23] and trained on three NVIDIA Titan RTXGPUs.Table 1 shows the classiﬁcation performance results. Theoriginal models contain the entire trainable parameters andhave larger learning capacity.

EPruning in pruned and full

Algorithm 1

EPruning

Set t = 0 // Optimization counterInitiate the neural network with trainable weights Θ Set S (0) ∼ Bernoulli ( P = 0 . // States initializationSet ∆ s = 0 & ∆ s T for i epoch = 1 → N epoch do // epoch counter for i batch = 1 → N batch do // batch counter t = t + 1 if ∆ s = 0 or i epoch ≤ ∆ s T thenif i epoch = 1 & i batch = 1 then Compute energy loss of S (0) as E (0) using (3) end iffor i = 1 → S do // States counterGenerate mutually different i , i , i ∈ { , ..., S } for d = 1 → D do // State dimension counterGenerate a random number r d ∈ [0 , Compute mutation vector v i,d using (4)Compute candidate state ˜ s ( t ) using (5) end forend for Compute energy loss of ˜ S ( t ) as ˜ E ( t ) using (3)Select S ( t ) and corresponding energy E ( t ) using (6)Select the state with the lowest energy from S ( t ) as s ( t ) b else s ( t ) b = s ( t − b end if Temporarily drop weights of the network based on the beststate s ( t ) b Compute loss of the sparsiﬁed networkPerform backpropagation to update Θ end for Update ∆ s for early state convergence using (7) end for versions have slightly lower Top-1 performance than the orig-inal model and competitive performance in terms of Top-5performance. The Deep Compression [3] method receivespruning rate as input. For the sake of comparison, we havemodiﬁed it to perform pruning on every convolution layer,given the rate achieved by EPruning , where generally it haslower performance than

EPruning . SqueezeNet [24] is asmall network with AlexNet level accuracy.

EPruning isalso applied to AlexNet and SqueezeNet v1.1, where it hasa smaller pruning rate for AlexNet compared to ResNets butcan prune approximately half of the trainable parameters inSqueezeNet v1.1 and achieve slightly lower performance.Figure 3 shows convergence plots of ResNet-50 with re-spect to the cross-entropy loss and energy of the sub-networkover 200 training epochs. The plots show that the best energyis lower than the average energy and after epoch 100, dueto early stopping, the sub-network is selected. Training lossand energy loss follow similar declining trend and after epoch100, the training loss has a slower declining rate. We observethat before epoch 100 the model was in an exploration phaseand after this epoch enters an exploitation (ﬁne-tuning) phase. poch0 20 40 60 80 100 120 140 160 180 200 C r o ss - E n t r op y Lo ss E ne r g y -1000100 Trainign LossValidation LossAvg EnergyBest Energy (a) CIFAR-10

Epoch0 20 40 60 80 100 120 140 160 180 200 C r o ss - E n t r op y Lo ss E ne r g y -50-40-30-20-10010 Trainign LossValidation LossAvg EnergyBest Energy (b) CIFAR-100

Fig. 3 : Cross-Entropy loss and energy of

EPruning over 200 training epochs of

ResNet-50 for CIFAR-10 and CIFAR-100 datasets with S = 8 , initializationprobability of P = 0 . , and ∆ s T = 100 . Table 1 : Classiﬁcation performance on test datasets. R is kept trainable parameters and p is approximate number of trainable parameters. All the valuesexcept loss and p are in percentage. (F) refers to full network used for inference and (P) refers to pruned network using EPruning .(a)

CIFAR-10

Model Loss Top-1 Top-3 Top-5 R p ResNet-18 0.3181 92.81 98.78 99.49 100 11.2MResNet-18+DeepCompression 0.6951 76.15 94.16 98.59 49.66 5.5MResNet-18+EPruning(F) 0.4906 90.96 98.33 99.60 100 11.2MResNet-18+EPruning(P) 0.4745 90.96 98.40 99.58 49.66 5.5MResNet-34 0.3684 92.80 98.85 99.71 100 21.3MResNet-34+DeepCompression 1.057 66.51 91.40 97.68 38.83 8.3MResNet-34+EDropou(F) 0.4576 88.28 97.47 99.31 100 21.3MResNet-34+EPruning(P) 0.4598 88.21 97.48 99.28 38.83 8.3MResNet-50 0.3761 92.21 98.70 99.51 100 23.5MResNet-50+DeepCompression 1.0271 67.53 89.92 96.30 46.39 10.9MResNet-50+EPruning(F) 0.6041 85.22 96.35 98.77 100 23.5MResNet-50+EPruning(P) 0.5953 85.30 96.62 98.76 46.39 10.9MResNet-101 0.3680 92.66 98.69 99.65 100 42.5MResNet-101+DeepCompression 1.037 66.32 92.65 98.11 45.10 19.2MResNet-101+EPruning(F) 0.6231 86.97 97.42 99.24 100 42.5MResNet-101+EPruning(P) 0.6339 86.57 97.37 99.20 45.10 19.2MAlexNet 0.9727 84.32 96.58 99.08 100 57.4MAlexNet+EPruning(F) 0.7632 75.05 93.74 98.18 100 57.4MAlexNet+EPruning(P) 0.7897 74.66 93.63 97.96 77.36 44.4MSqueezeNet 0.5585 81.49 96.31 99.01 100 0.73MSqueezeNet+EPruning(F) 0.6686 76.76 94.55 98.62 100 0.73MSqueezeNet+EPruning(P) 0.6725 76.85 95.00 98.56 52.35 0.38M (b)

CIFAR-100

Model Loss Top-1 Top-3 Top-5 R p ResNet-18 1.3830 69.03 84.44 88.90 100 11.2MResNet-18+DeepCompression 2.3072 40.01 62.20 72.28 48.04 5.4MResNet-18+EPruning(F) 1.9479 67.04 84.11 89.43 100 11.2MResNet-18+EPruning(P) 1.9541 67.06 84.14 89.27 48.04 5.4MResNet-34 1.3931 69.96 85.65 90.10 100 21.3MResNet-34+DeepCompression 2.1778 42.09 65.01 74.31 49.41 10.5MResNet-34+EPruning(F) 1.9051 64.50 81.38 86.87 100 21.3MResNet-34+EPruning(P) 1.9219 64.79 81.28 86.74 49.41 10.5MResNet-50 1.3068 71.22 86.47 90.74 100 23.7MResNet-50+DeepCompression 2.3115 43.87 67.02 76.26 46.01 10.9MResNet-50+EPruning(F) 1.8750 61.60 79.52 85.45 100 23.7MResNet-50+EPruning(P) 1.8768 61.91 79.99 85.87 46.01 10.9MResNet-101 1.3574 71.19 85.54 90.00 100 42.6MResNet-101+DeepCompression 2.6003 37.08 58.78 68.76 43.76 18.6MResNet-101+EPruning(F) 1.9558 61.52 79.71 85.20 100 42.6MResNet-101+EPruning(P) 1.9412 61.92 79.49 85.23 43.76 18.6MAlexNet 2.8113 60.12 79.18 83.31 100 57.4MAlexNet+EPruning(F) 2.4731 56.62 78.72 81.92 100 57.4MAlexNet+EPruning(P) 2.4819 56.59 78.52 81.62 71.84 41.2MSqueezeNet 1.4150 67.85 85.81 89.69 100 0.77MSqueezeNet+EPruning(F) 1.5265 64.23 82.71 88.63 100 0.77MSqueezeNet+EPruning(P) 1.5341 64.02 81.63 88.51 56.40 0.43M

This method increases the computational complexity ofthe model due to evaluation of energy loss for each candidatestate vector in the population. However, with proper parallelimplementation at each state vector level and shared memorymanagement, this overhead can signiﬁcantly be reduced. Asan example, for ResNet-50, the inference time per image is5.50E-5 Seconds where with EPruning with a population sizeof 8 this time is 3.40E-4 Seconds and in parallel mode thistime is 8.95E-5 Seconds.

4. CONCLUSIONS

In this paper, we have proposed a stochastic framework forpruning deep neural networks. Most pruning and compres-sion models ﬁrst prune a network and then ﬁne-tune it. Theproposed

EPruning method has two phases. The ﬁrst phaseacts as dropout where various subsets of the neural network are trained. Each sub-network is selected based on a corre-sponding energy loss value which reﬂects performance of thesub-network. The second phase is a ﬁne-tuning phase withfocus on training the pruned network. The proposed frame-work has the advantage of immediate usability for any neu-ral network without manual modiﬁcation of layers. In addi-tion, predeﬁned number of active states can also be utilized inthe optimizer to enforce a speciﬁc dropout/pruning rate. Ourexperiments show that as the proposed framework searchesfor sub-networks with lower energy, the training loss also de-creases.

5. ACKNOWLEDGMENT

The authors acknowledge ﬁnancial support of Fujitsu Labo-ratories Ltd. and Fujitsu Consulting (Canada) Inc. . REFERENCES [1] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang,and Trevor Darrell, “Rethinking the value of networkpruning,” arXiv preprint arXiv:1810.05270 , 2018.[2] Yann LeCun, John S Denker, and Sara A Solla, “Opti-mal brain damage,” in

Advances in neural informationprocessing systems , 1990, pp. 598–605.[3] Song Han, Huizi Mao, and William J Dally, “Deep com-pression: Compressing deep neural networks with prun-ing, trained quantization and huffman coding,” arXivpreprint arXiv:1510.00149 , 2015.[4] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato,and F Huang, “A tutorial on energy-based learning,”

Predicting structured data , vol. 1, no. 0, 2006.[5] David Barber and Aleksandar Botev, “Dealing with alarge number of classes–likelihood, discrimination orranking?,” arXiv preprint arXiv:1606.06959 , 2016.[6] Radford M Neal, “Annealed importance sampling,”

Statistics and computing , vol. 11, no. 2, pp. 125–139,2001.[7] Hojjat Salehinejad and Shahrokh Valaee, “Ising-dropout: A regularization method for training andcompression of deep neural networks,” in

ICASSP2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 3602–3606.[8] Hojjat Salehinejad, Zijian Wang, and Shahrokh Valaee,“Ising dropout with node grouping for training andcompression of deep neural networks,” in . IEEE, 2019, pp. 1–5.[9] Alex Labach, Hojjat Salehinejad, and Shahrokh Valaee,“Survey of dropout methods for deep neural networks,” arXiv preprint arXiv:1904.13310 , 2019.[10] Will Grathwohl, Kuan-Chieh Wang, J¨orn-Henrik Jacob-sen, David Duvenaud, Mohammad Norouzi, and KevinSwersky, “Your classiﬁer is secretly an energy basedmodel and you should treat it like one,” arXiv preprintarXiv:1912.03263 , 2019.[11] Kevin P Murphy,

Machine learning: a probabilistic per-spective , MIT press, 2012.[12] Keivan Dabiri, Mehrdad Malekmohammadi, Ali Sheik-holeslami, and Hirotaka Tamura, “Replica exchangemcmc hardware with automatic temperature selectionand parallel trial,”

IEEE Transactions on Parallel andDistributed Systems , vol. 31, no. 7, pp. 1681–1692,2020. [13] Kenneth V Price, “Differential evolution,” pp. 187–214,2013.[14] Hojjat Salehinejad, Shahryar Rahnamayan, andHamid R Tizhoosh, “Micro-differential evolution:Diversity enhancement and a comparative study,”

Applied Soft Computing , vol. 52, pp. 812–833, 2017.[15] Jouni Lampinen, Ivan Zelinka, et al., “On stagnation ofthe differential evolution algorithm,” in

Proceedings ofMENDEL , 2000, pp. 76–83.[16] Alex Krizhevsky, Geoffrey Hinton, et al., “Learningmultiple layers of features from tiny images,” 2009.[17] Terrance DeVries and Graham W Taylor, “Improvedregularization of convolutional neural networks withcutout,” arXiv preprint arXiv:1708.04552 , 2017.[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton, “Imagenet classiﬁcation with deep convolutionalneural networks,” in

Advances in neural informationprocessing systems , 2012, pp. 1097–1105.[19] Forrest N Iandola, Song Han, Matthew W Moskewicz,Khalid Ashraf, William J Dally, and Kurt Keutzer,“Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5 mb model size,” arXiv preprintarXiv:1602.07360 , 2016.[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[21] Geoffrey Hinton, Nitish Srivastava, and Kevin Swer-sky, “Neural networks for machine learning lecture 6aoverview of mini-batch gradient descent,”

Cited on , vol.14, no. 8, 2012.[22] Matthew D Zeiler, “Adadelta: an adaptive learning ratemethod,” arXiv preprint arXiv:1212.5701 , 2012.[23] Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, Luca Antiga,Alban Desmaison, Andreas Kopf, Edward Yang,Zachary DeVito, Martin Raison, Alykhan Tejani,Sasank Chilamkurthy, Benoit Steiner, Lu Fang, JunjieBai, and Soumith Chintala, “Pytorch: An imperativestyle, high-performance deep learning library,” in

Ad-vances in Neural Information Processing Systems 32 ,pp. 8024–8035. Curran Associates, Inc., 2019.[24] Forrest N Iandola, Song Han, Matthew W Moskewicz,Khalid Ashraf, William J Dally, and Kurt Keutzer,“Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5mb model size,” arXiv preprintarXiv:1602.07360arXiv preprintarXiv:1602.07360