Now that I can see, I can improve: Enabling data-driven finetuning of CNNs on the edge
PPREPRINT: Accepted at the Joint Workshop on Efficient Deep Learning in Computer Vision (EDLCV) at CVPR, 2020
Now that I can see, I can improve:
Enabling data-driven finetuning of CNNs on the edge
Aditya Rajagopal, Christos-Savvas BouganisImperial College London { aditya.rajagopal14, christos-savvas.bouganis } @imperial.ac.uk Abstract
In today’s world, a vast amount of data is being gener-ated by edge devices that can be used as valuable trainingdata to improve the performance of machine learning algo-rithms in terms of the achieved accuracy or to reduce thecompute requirements of the model. However, due to userdata privacy concerns as well as storage and communica-tion bandwidth limitations, this data cannot be moved fromthe device to the data centre for further improvement of themodel and subsequent deployment. As such there is a needfor increased edge intelligence, where the deployed modelscan be fine-tuned on the edge, leading to improved accu-racy and/or reducing the model’s workload as well as itsmemory and power footprint. In the case of ConvolutionalNeural Networks (CNNs), both the weights of the networkas well as its topology can be tuned to adapt to the datathat it processes. This paper provides a first step towardsenabling CNN finetuning on an edge device based on struc-tured pruning. It explores the performance gains and costsof doing so and presents an extensible open-source frame-work that allows the deployment of such approaches on awide range of network architectures and devices. The re-sults show that on average, data-aware pruning with re-training can provide 10.2pp increased accuracy over a widerange of subsets, networks and pruning levels with a maxi-mum improvement of 42.0pp over pruning and retraining ina manner agnostic to the data being processed by the net-work.
1. Introduction
Modern CNN-based systems achieve unprecedented lev-els of accuracy in various tasks such as image recognition[35], segmentation [25], drone navigation [15], and objectdetection [16, 6] due to the vast amounts of curated [2, 1]data that is used to train them. This training is performedon large data-centres and once deployed, the models remainstatic. However the large quantities of domain specific data that can help further improve the performance of these net-works in terms of accuracy or inference time, reside on theedge. This improvement can stem either from the availabil-ity of more data samples or the realisation of a differentdistribution of data at the deployment side. Nonetheless,user data privacy concerns as well as limited storage andcommunication bandwidths mean that this data cannot beeasily moved from the edge to these data-centres for up-dating the deployed model through changes to the model’sarchitecture (i.e. topology in the case of a CNN), and its pa-rameters. Consequently, there has been a push to move therequired processing from data-centres to edge devices [40].A widely adopted approach to tune the architecture ofthe model to the input data distribution is through pruningfollowed usually by a retraining stage. This paper will re-fer to such approaches as data-aware pruning and retraining(DaPR) approaches. The suitability of the above approachis further supported by works such as [31, 26, 5, 21] whichhave shown that the pruning levels are linked to the com-plexity of the data the network is processing. Currently theincreased compute and memory capabilities of edge devicessuch as NVIDIA’s Jetson TX2, NVIDIA’s Xavier GPUs andGoogle’s Edge TPU [14], provide an opportunity to performsuch tuning on edge devices in a manner that does not in-fringe on user data privacy and is within an acceptable timeframe.With the goal to enable CNNs to improve and adapt theirperformance to the data they are processing on the edge, thecontributions of this paper are as follows :1. A methodology based on the L1-norm of the weightsthat allows for on-device DaPR to be performed with-out user intervention. In doing so, the paper exploresthe accuracy gains of adapting a network to the datait is processing as well as the cost of achieving thesegains on an edge device. The paper provides quantita-tive results on the possible performance gains (i.e. in-ference latency) and the associated costs (i.e. pruningand retraining) of after-deployment tuning on a num-ber of state-of-the-art models targeting an actual em-1 a r X i v : . [ c s . C V ] J un edded device.2. An open-source framework ADaPT ( A utomated D ata-aware P running and Re T raining) that allows rapid pro-totyping and deployment of various structured pruningtechniques on a wide range of network architectureson edge devices. To the best of our knowledge, it is theonly open-source tool that fully automates the processof identifying filters to prune, shrinking the network toobtain memory and performance gains, and perform-ing retraining of the pruned network. In doing so, itenables direct deployment of DaPR solutions on anyedge device.The rest of the paper is organised as follows. Section 2formally describes the field of research that this work ad-dresses and states any assumptions made. Section 3 de-scribes the metrics that are of interest when evaluating so-lutions within this field. Section 4 discusses various state ofthe art structured pruning techniques and frameworks thatallow for experimentation with pruning. Section 5 describesthe proposed L1-norm based DaPR solution. Section 6 de-scribes the key features of ADaPT. Finally, Section 7 evalu-ates the gains and costs of performing DaPR on an NVIDIAJetson TX2.
2. Motivation
Let us consider a training dataset D = {I , C} , where I is the set of images in the dataset, and C is the set of classesrepresented by the images. Let us define a model M as atuple ( W , A ) , where W represents the weights of the modeland A represents the topology of the model. After perform-ing training on D , the resulting model M D = ( W D , A D ) issuch that for class i ∈ C , a i is the accuracy that the modelpredicts that class with, and c is the cost of the model.Consider the case where this model M D is going to bedeployed in an environment D (cid:48) = {I (cid:48) , C (cid:48) } . In most prac-tical scenarios the classes that are expected to be seen arenot completely known before deployment, but nonethelessa reasonable assumption ( Assumption 1 ) that is made isthat D (cid:48) ⊆ D , i.e. the classes encountered upon deploymentare a subset of the classes included in the training data. Withthis in mind, the problem this work addresses is:Given the dataset D (cid:48) and a provided computeand memory capacity budget, find a model M D(cid:48) ,such that a) its cost c (cid:48) is less than the cost c of theinitially deployed model M D , and b) (cid:80) j ∈C(cid:48) a (cid:48) j |D (cid:48) | ≥ (cid:80) j ∈C(cid:48) a j |D (cid:48) | , i.e. M D(cid:48) performs at least as well as M D on D (cid:48) .The transformation of a model can happen both throughchanges to the weights of the model and/or the topologyof the model.
3. Metrics of interest
The following metrics are adopted in order to assess thequality of various pruning strategies in this work:1. The achieved accuracy of the produced model.2. The cost c of the produced model, which entails: • The number of operations per second for a singleinference (GOps). It is a widely adopted metricin literature, as it provides a platform agnosticway of comparing the inference time of variouspruned networks. It should be noted that depend-ing on the architecture of the target hardware, thisis not always a good representation of the true la-tency of the system. • The latency of a single inference step using a spe-cific device.3. The memory footprint of M (cid:48)D as a percentage of thememory footprint of M D . This will be referred to asthe pruning level throughout the rest of this paper.
4. Background and Related Works
Consider a CNN with L layers, where layer l ∈ L has n l convolutional filters of size ( m l × k l × k l ) each, where n l are the number of output features, m l the number of inputfeatures, and the convolutional filter is of size k l × k l . Forlayer l ∈ L , the weight matrix W l has size n l × m l × k l × k l ,and the output feature maps (OFM) are X l .Unstructured pruning [4, 23] removes individual neu-rons within the network and induces sparsity, requiring cus-tom hardware optimised for sparse operations [38, 28] inorder to transform these changes into actual performancegains. Structured pruning focuses on removing entire con-volutional filters and not just individual neurons. This al-lows for realising runtime gains on commercial hardwareusing off-the-shelf frameworks and is the approach to prun-ing this paper will utilise. Structured Pruning -
The process of structured pruninggenerally involves pruning a pre-trained network based on afilter ranking criterion and then if possible performing fine-tuning of the pruned network to regain the accuracy lost dueto pruning. A number of works have explored various filterranking criteria. [20] uses the L1-norm of the weights torank filters and obtains up to 64% memory footprint and upto 38.6% Ops reduction for negligible accuracy loss acrossnetworks on CIFAR-10. They also achieve up to 10.8%memory and 24.2% Ops reduction for around 1% accuracyloss on ResNet34 on ImageNet [1]. [24] use the weightslearnt by batch normalisation layers to rank channels andobtain up to 29.7% reduction in memory and 50.6% reduc-tion in Ops for ResNet164 on CIFAR100 and 82.5% mem-ory and 30.4% Ops reduction for VGG on ImageNet foregligible accuracy loss. [27, 19] use the concept of sensi-tivity which ranks filters based on an approximation of theimpact on the loss that their omission has. [27] achievesup to 66% memory and 2.5x latency reduction for VGG16on ImageNet with an accuracy loss of 2.3%. Sensitivity-based works require the gradient of each filter as well as theweights and are more memory intensive than the weight-based methods.[12] perform entropy based pruning by estimating the av-erage amount of information from weights to output. Theyachieve up to 94% memory reduction with negligible loss inaccuracy on LeNet-5 on the MNIST[18] dataset. However,such information-theoretic methods are compute intensiveand make them unfavourable for deployment in embeddedsettings.[31] uses a correlation based metric on the OFM ( X l ) ofeach layer in order to decide which filters to prune. Usingthe feature maps makes this method input dependent, andthe paper also explores the difference in filters pruned whenshown only various subsets of the dataset that the originalnetwork was trained on. They achieve memory reductionof 8x, 3x and 1.4x on VGG-16 for CIFAR-10, CIFAR-100,and ImageNet respectively and up to 85% memory reduc-tion with no accuracy loss when tuned to random subsets ofCIFAR-100 with 2 classes in each subset.The works mentioned above do not dynamically choosethe filters pruned based on the data that the network is pro-cessing. There is a line of works that reduce the requiredoperations during inference by skipping convolution oper-ations depending on the input image by introducing condi-tional execution. However, these works do not reduce thememory requirements of the model. [5, 10, 21] all pre-trainclassifiers that, at run-time, can identify which filters areimportant for the current input image. For VGG-16, [5]achieves a 1.98x Ops reduction for a 2% decrease in accu-racy on ImageNet and show that they outperform both [10]and [21] on the same metric. [37] splits the network intomultiple sections and learns classifiers that allow for earlyexit through the network depending on the input image pro-cessed. They achieve on average 2.17x reduction in Opsacross networks on CIFAR-100 for no accuracy loss, and1.99x reduction in Ops on ImageNet also for no accuracyloss. Frameworks -
The various pruning techniques dis-cussed above each have a unique set of hyperparametersthat relate to filter ranking metrics as well as the mannerin which the models are re-trained. For instance, [31] se-quentially prunes and retrains on a per layer basis, whileworks such as [37] have to add many auxiliary layers on topof the chosen architecture in order to create and train theirearly exit classifiers. Distiller [41] and Mayo [39] are twostate-of-the-art open-source frameworks that allow for ex-perimentation with such pruning techniques. Mayo focuses on automating search for hyperpameters related to pruning,while Distiller focuses on implementing a wide variety ofpruning techniques discussed above. Distiller provides afunctionality called ”Thinning” for ResNet models only, butdoes not allow for easy application to other networks as thefunctionality is tailored to the ResNet architecture. More-over, both frameworks do not automatically shrink the sizeof the model after pruning, but instead mask the weightsof the model in order to allow for experimentation with thepruned model. Consequently, they can only be used to as-sess the impact of pruning on the accuracy and not on run-time. In contrast, ADaPT addresses both these issues byshrinking the model for a wide variety of architectures tohelp realise run-time gains, as well as providing an extensi-ble codebase to apply model shrinking to any new architec-ture.
5. On-device DaPR Methodology
This section presents an approach to obtain a model M D(cid:48) from M D as described in Section 2. It is assumed thatthe inputs collected upon deployment of the system havebeen correctly classified ( Assumption 2 ). The assumptionenables training to be performed on the edge without uncer-tainty of the class-labels, and thus focusing solely on gainsthat can be made by adapting the network to the data it pro-cesses upon deployment.
Adapting the network architecture to the data it is pro-cessing involves searching for a model M (cid:48)D that performsat least as well as M D on D (cid:48) , but with a reduced cost.The proposed approach is shown in Algorithm ?? and per-forms a binary search over a range of predefined pruninglevels. If progressively larger pruning levels are searched,the time to perform this search and the memory footprint ofsearched models decreases as progressively smaller modelsare used. The algorithm converges when the pruning levelto be searched does not change over iterations of the binarysearch.
6. ADaPT
In order to support the quick development and easy de-ployment of DaPR methodologies on edge devices, theADaPT framework has been developed. In its current stateit implements, deploys and evaluates the methodology de-scribed in Section 5, but its extensible nature allow thedeployment of other DaPR methodologies overcoming thelimitations of the existing tools described in Section 4. Ahigh-level description of its important features are presentedbelow, where further details on how to use it can be foundon the github . ADaPT’s functionality can be split into 4 https://github.com/adityarajagopal/pytorch training.git lgorithm 1: Proposed DaPR Methodology
Inputs:
Initial model M D , Subset D (cid:48) , First pruninglevel to search p , Set of pruning levels tosearch P = { ip i | i ∈ [ p l p i , p u p i ] } , TargetAccuracy a target Output: M (cid:48)D with test accuracy a max and pruninglevel p ∈ P that is the highest pruning levels.t. a max ≥ a target Finetune M D for n f epochs on D (cid:48) to obtain M f D while p changes across iterations do Prune M f D to pruning level p according tochosen pruning strategy Retrain the pruned model for n r epochs on D (cid:48) andrecord model M (cid:48)D and corresponding testaccuracy a max for the model with highestvalidation accuracy if a max < a target then p u = p // Reduce pruning level to the nearest multipleof p i to the midpoint between p l and p // Reduce pruning level to the nearest multipleof p i to the midpoint between p l and p p = p i × (cid:100) (cid:98) pl + p (cid:99) p i (cid:101) else p l = p // Increase pruning level to the nearest multipleof p i to the midpoint between p u and p p = p i × (cid:100) (cid:98) pu + p (cid:99) p i (cid:101) end end stages; 1) pruning dependency calculation, 2) pruning, 3)model writing, and 4) weight transfer. Modern CNN networks are usually constructed througha set of structural modules connected in a specific way.Most of the networks are based on the modules foundin AlexNet [17], ResNet20 [9], MobileNetV2 [29], andSqueezeNet [13]. These four networks incorporate amongstthem the most commonly used structural modules -
Se-quential connectivity in AlexNet,
Residuals in ResNet20and MobileNetV2,
MBConv modules in MobileNetV2 and
Fire modules in SqueezeNet. For instance, EfficientNet [34]uses the
MBConv as its primary convolutional module.Moreover, certain connectivity patterns result in pruningdependencies that necessitate the pruning of identical filtersacross dependent layers. The PDC automates the processof recognising such dependencies within a network.Networks with only
Sequential connectivity patterns (a) Residual Blocks [8](b) Depthwise Separable Unit [29]
Figure 1: Structural blocks that require consideration of de-pendencies when pruningsuch as AlexNet and VGG [30] do not have any pruningdependencies. The same applies to
Fire modules that makeup SqueezeNet.ResNet variants are made of residual blocks as shown inFig.1a. Due to the residual connection and summation fol-lowing it, the same filters need to be pruned in the last con-volution ( conv final ) of each residual block. Works such as[22] choose to not prune the last convolutional layer whileothers such as [20] choose to enforce this dependency. An-other consideration with ResNet architectures is that thenetwork is split into groups of residuals where with everynew group, a downsampling 1x1 convolution ( conv down ) isadded to the residual connection which ensures the numberof channels output from the connection match those outputfrom conv final . In these cases, the pruning dependency ex-ists between conv final and conv down . The PDC enforceseither a dependency across all conv final layers within agroup of residual blocks or prunes conv down in line withits corresponding conv final .MobileNetV2 is made of MbConv blocks which con-tain depth-wise separable convolutions as shown in Fig.1bwhere the blue blocks are input feature maps (IFM) and thered blocks are convolutional filters. The 3x3 convolution inthe figure is the depth-wise convolution ( conv dw ) and is dif-ferent from regular convolutions in that each filter only actson one of the IFM, i.e. n l = 1. This means that the numberof filters in conv dw must always match the number of IFMto that layer, and consequently the same filters need to bepruned in conv dw and the layer(s) feeding it. This depen-dency is also enforced by the PDC when MBConv modulesare present in the model.he PDC takes a model description in PyTorch that theuser annotates with Python decorators to identify classesthat correspond to various structural modules and the namesof the convolutions within them. Based on this information,the PDC automatically calculates all the dependent layersand communicates this information to the pruning stage solayers can be pruned in dependent groups if necessary. Thisautomation makes ADaPT very easy to use as tools such asDistiller [41] require the user to manually identify each de-pendent convolution in the entire network, which for largernetworks can be very tedious to list. Pruning the network once the dependencies have beenidentified is performed by ranking all the filters based ona customised metric, and then removing filters one-by-oneuntil the desired percentage of memory has been achieved.The user could chose to rank filter globally or on a per layerbasis. The block utilises the PDC information about de-pendencies between layers that need to be considered whenpruning. Removing a filter from one of the dependent lay-ers removes one from all of the layers in that dependencychain. Furthermore, the effect of removing a filter is prop-agated through the network as each filter corresponds to anIFM for the next layer. This stage computes the filters perlayer that need to be pruned and passes this information tothe Model Writing stage.
In order to effectively evaluate the solution described inSection 5, a necessary ability of the tool is to produce a newmodel that has a reduced runtime and memory footprint, sowhen possible the n r epochs of retraining of a pruned modelcan be accelerated. The Model Writing stage enables thisby creating a new shrunk PyTorch model description havingremoved the filters that were pruned.The Model Writer takes the channels to be pruned as in-put, and provides the user with a description of a prunednetwork which can be used in any PyTorch code-base. Thisdecouples model pruning from the model writing and al-lows for easy access to pruned models. The final stage transfers the relevant weights from M D to M (cid:48)D . This is necessary to minimise the drop in accuracythat is seen once the network is pruned and thus minimisethe number of epochs n r in the retraining stage that arenecessary to reach the target performance. It takes thepruned model description as input and returns a PyTorchmodel that is ready for deployment with the transferredweights.It is important to note that the parts discussed in this Structural Module Networks with Module
Sequential AlexNet[17], VGG[30]Residuals ResNet[8], MobileNetV2[29],ResNeXt[36], DenseNet[11]Depth-wise Separable MobileNetV2[29],Xception[3], EfficientNet[34]Fire Module SqueezeNet[13], GoogLeNet[33]Table 1: A summary of the structural modules implementedalong with networks that contain each module.section are linked to structural modules and not networks.This makes the tool more extendable as any network thatcontains the above structural modules can be readily prunedwith the current version of the tool. Furthermore, it is builtfor easy customisation of all the functions discussed above,thus allowing for new architectures and different pruningtechniques to be experimented on with ease. A summary ofthe supported structural modules, as well as the networksthat contain these modules and hence supported by ADaPTare presented in Table 1.
7. Evaluation
A key contribution of this work is the analysis of the costreduction on model deployment obtained by performing on-device DaPR as well as the associated costs of performingsuch domain adaptation on an edge device. The method-ology being evaluated is that presented in Section 5. Theresults reported in this section are averages and standarddeviations across 5 independent runs of each experiment.
The chosen edge device was the NVIDIA Jetson TX2.Through cross-validation, the values chosen for n f and n r were 5 and 25 respectively to ensure that the accuracyplateaus before retraining ends. The range of pruning levelssearched were from p l = 5% to p u = 95% in increments of p i = 5% . The first pruning level searched p = 50% .In order to explore the problem setup discussed in Sec-tion 2, it is necessary to create subsets D (cid:48) ⊆ D . In thiscase, D was the entire CIFAR-100 dataset, and five differentsubsets D (cid:48) were tested. CIFAR-100 is categorised into 20”coarse-classes” which contain 5 ”fine-classes” each. Thesubsets created and the classes within them are described inTable 2. The first four subsets were hand selected to havecoherent semantic meaning and across the four of them spanthe entire CIFAR-100 dataset. The fifth subset was ran-domly generated. The networks tested on were AlexNet,ResNet20, MobiletNetV2, and SqueezeNet for all the sub-sets listed in Table 2.The chosen metric to rank filters was the L1-norm asit has been established to show competitive performance ubsetName Coarse Classes Aquatic aquatic mammals, fish 10Indoor food containers, household electrical devices, household furniture 15Outdoor large man-made outdoor things, large natural outdoor scenes, vehicles 1, vehicles 2,trees, small mammals, people 35Natural flowers, fruit and vegetables, insects, large omnivores and herbivores, medium mammals,non-insect invertebrates, small mammals, reptiles 40Random aquatic mammals, fish, flowers, fruit and vegetables, household furniture,large man-made outdoor things, large omnivores and herbivores, medium mammals,non-insect invertebrates, people, reptiles, trees, vehicles 2 65Table 2: Subsets tested along with the coarse classes that were included in the subset and the number of fine classes per subseteven against data-aware metrics [26] despite being rela-tively computationally inexpensive.The learning rate schedule was set to start at the finallearning rate employed when M D was trained on D . Af-ter pruning the learning rate was increased to the secondhighest learning rate that was employed when M D wastrained on D . Following this, the learning rate was decayedat epochs 15 and 25 by the same gamma that was used when M D was trained on D .The batch size used was 128, and the training datasetwas split into a training and validation set in the 80:20 ratio.The accuracy values reported in this section are the test setaccuracy corresponding to the model with the best valida-tion accuracy. Standard data augmentation techniques forCIFAR-100 were used for the training set such as randomcrop, rotation and flip. Fig.2 shows the trade-off between the achieved test ac-curacy and inference time for all pruning levels between 5%and 95% on various subsets and networks. The red dot ineach of the figures (
Unpruned ) displays the performance of M D on D (cid:48) . The orange dots in the figures ( Subset Agnos-tic Pruning ) display the performance of a model that waspruned without any finetuning on D (cid:48) and retrained on theentirety of D for n f + n r epochs. These sets of point serveas baselines to compare the proposed DaPR methodology asthey do not finetune the network being deployed to the do-main ( D (cid:48) ) in any way, i.e. they are subset agnostic methods.The green points ( Subset Aware Pruning ) display the per-formance of a model that was finetuned for n f epochs onthe subset D (cid:48) , then pruned and subsequently retrained on D (cid:48) for n r epochs. Instead of performing a binary search asproposed in Section 5, Fig.2 acts as an ”oracle” that for eachpruning level compares the performance of the data-awaremethod (green points) with the data agnostic methods (or-ange and red points).Figs.2a-2d use errorbars to display the mean and stan- dard deviation of the test accuracy across all 5 subsets for agiven pruning level and network. For many cases, the worstperforming subset aware strategy performs better than thebest performing subset agnostic strategy with an averageimprovement in test accuracy of 10.2pp and maximum im-provement of 42.0pp over all pruning levels, networks anddatasets. Furthermore, this improvement in test accuracytends to increase as more of the network is pruned (lowerinference time) thus further motivating the need to performdata-aware pruning and retraining as better accuracy can beachieved for smaller models.Figs.2e-2f show the same trade-off but for specific com-binations of network and subset. From left to right, thesize of the subset increases and intuitively the gap betweensubset-aware and subset-agnostic pruning and retraining de-creases as the subset D (cid:48) converges towards D . Nonethelessfor all pruning levels, the performance of the subset-awarestrategy outperforms subset-agnostic pruning and finetun-ing.The ”oracle” results discussed above show the gains thatcan be obtained for a wide range of pruning levels, howeveronly some of the models perform better than the baselineperformance of M D on D (cid:48) (red points). The methodologyproposed in Section 5 efficiently searches for the low infer-ence time models that can perform better than M D on D (cid:48) onan edge device. Figs.3a-3d each form a pareto-frontier de-scribing the trade-off between searching for a smaller modeland the inference time of this model upon deployment. Thelabels next to the points show the relative improvement inGOps compared to the unpruned model (-x times) and thepruning level of the model (memory footprint reduction).Across all subsets of the CIFAR-100 dataset, perform-ing the search described in Section 5 results in sub-minutesearch times per minibatch for up to 2.22x improvementin inference times, 4.18x improvement in GOps and 90%memory footprint reduction. Furthermore, the memory util-isation of the GPU does not exceed 2GB which lies far be-low the maximum memory availability of the TX2 of 8GB. a) AlexNet (b) ResNet20 (c) MobileNetV2 (d) SqueezeNet(e) AlexNet - Aquatic (f) SqueezeNet - Indoor (g) MobileNetV2 - Outdoor (h) ResNet20 - Random Figure 2: Trade-off of Test Accuracy vs Inference Time for various levels of pruning. The error bars show the mean andstandard deviation of test accuracy obtained per level of pruning across all the 5 subsets. (a) AlexNet - Random (b) ResNet20 - Aquatic (c) MobileNetV2 - Outdoor (d) SqueezeNet - Natural
Figure 3: Trade-off between search time per minibatch and inference time performance of searched model. All models shownhere have no accuracy loss compared to M D infered on D (cid:48) . The labels next to the points show (improvement in GOps (-xtimes), pruning level (%)) of that modelThe number of minibatches searched for will depend onboth the amount of data available and the time budget allo-cated during deployment to perform finetuning. However,the results presented here show that such DaPR methodolo-gies can be performed on edge devices within a reasonabletime budget.It should be noted that as the budget allocated for per-forming the pruning in this work is much shorter thanthe budget required by the current state-of-the-art pruningmethods described in Section 4, and as such a direct com-parison to those works is not meaningful. The existingmethods perform pruning and retraining before deployment,and do not actively adapt the network once deployed. Fur-thermore, none of these works except [32] report results onsubsets of CIFAR-100. [32] however does not provide de- tails of the classes present in each subset, making a directcomparison infeasible. An investigation was carried out on the relationship be-tween the type of the structural block and the impact offinetuning to its parameters. Fig.4a shows the percentagedifference between the filters that were selected to prune ifpruning were to take place at epoch 0 and at epoch n f after n f epochs of finetuning on D (cid:48) . In this case D (cid:48) is the Aquatic subset. The results show that weights in ResNet20 and Mo-bileNetV2 are highly sensitive to finetuning on a subset, butAlexNet and SqueezeNet show negligible change in the fil-ters selected to prune before and after the finetune stage.To analyse if these changes in selected filters are due to a) Aquatic - Across networks (b) ResNet20 - Aquatic (c) MobileNetV2 - Indoor (d) MobileNetv2 - Random
Figure 4: Percentage difference in channels pruned at various points of DaPR for different network and subset combinationsthe finetuning process or random variation of the weightsduring training, ResNet20 and MobileNetV2 were furtherexplored. The blue bars in Figs.4b-4d show the difference inchannels pruned at epoch 0 versus epoch n f . The n f epochsof finetuning was performed 5 times per network and subsetfrom the same starting point, thus resulting in 5 models atepoch n f . The orange bars show the results for the averagedifference in channels pruned across all (cid:0) (cid:1) combinationsof models at epoch n f . Fig.4b shows that with ResNet ar-chitectures, there is a large random variation due to training(high percentage orange bars), and comparable percentagesbetween the orange and blue bars suggest that the effect ofthe finetuning process in tuning the topology of the networkto the data processed is limited. However for the Mobilet-NetV2 architecture, Figs.4c - 4d show that the finetuningprocess effectively tunes the topology to the data processed.These results suggest that Residual structural blocks(common feature between ResNet and MobileNetV2) makethe weights sensitive to finetuning, but the depth-wise con-volution (unique to MobileNetV2) allows for data-awarediscrimination between filters based on just their L1-norm.However, further experiments need to be conducted to gen-eralise such behaviour but the results also suggest that themetric that needs to be used for data dependent tuning ofarchitectures may vary depending on the structural blockspresent.
8. Conclusion
Acknowledging the shift in paradigm from server basedcompute to increased edge processing, this work provides asolution that performs on-device DaPR, an analysis of theaccuracy gains achievable by performing such data-awareretraining, the costs of performing this process on an em-bedded device, and a tool that alows for rapid deploymentand research of on-device DaPR methodologies. The re-sults show that the gains in accuracy obtained by retrainingto the subset are significant and on average 10.2pp acrossa wide range of subsets, networks and pruning levels. Interms of costs, the search for a pruned model that achieves a given target accuracy can be performed on an NVIDIAJetson TX2 with sub-minute search times per minibatch forup to a 2.22x improvement in inference latency and 90% re-duction in memory footprint. Furthermore, analysis of theselected filters show that for the MobileNetV2 architecture,the L1-norm is an effective yet computationally inexpensivemetric to tune a network’s topology to the data being pro-cessed and suggests that different architectures may requiredifferent metrics to make pruning of the network dependenton the data it is processing.Additionally, the extensible and customisable tool(ADaPT) developed to perform this on-device pruning andretraining allows users to prune a wide range of CNN archi-tectures and realise the memory and runtime gains imme-diately on any platform of their choice in a fully automatedmanner. To the best of our knowledge, this is the only open-source tool that that allows the user to automatically shrink(through pruning) and deploy a network for such a widevariety of networks as well as target hardware. There arecommercial tools [7] that can perform this function, how-ever tend to focus deployment on specific target hardwarearchitectures such as FPGAs. These combination of fea-tures allow for rapid deployment of pruned models on anydevice without user intervention and thus makes it a uniqueopen-source tool for pruning research.
References [1] ImageNet: A large-scale hierarchical image database - IEEEConference Publication. 1, 2[2] Andrei Barbu, David Mayo, Julian Alverio, William Luo,Christopher Wang, Dan Gutfreund, Josh Tenenbaum, andBoris Katz. ObjectNet: A large-scale bias-controlled datasetfor pushing the limits of object recognition models. page 11.1[3] Francois Chollet. Xception: Deep Learning with DepthwiseSeparable Convolutions. In , pages 1800–1807, Honolulu, HI, July 2017. IEEE. 5[4] Trevor Gale, Erich Elsen, and Sara Hooker. The State ofSparsity in Deep Neural Networks. arXiv:1902.09574 [cs,stat] , Feb. 2019. arXiv: 1902.09574. 25] Xitong Gao, Yiren Zhao, ukasz Dudziak, Robert Mullins,and Cheng-zhong Xu. Dynamic Channel Pruning: Fea-ture Boosting and Suppression. arXiv:1810.05331 [cs] , Oct.2018. arXiv: 1810.05331. 1, 3[6] Ross Girshick. Fast R-CNN. pages 1440–1448, 2015. 1[7] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, SongHan, Yu Wang, and Huazhong Yang. From model to FPGA:Software-hardware co-design for efficient neural network ac-celeration. In ,pages 1–27, Aug. 2016. ISSN: null. 8[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] , Dec. 2015. arXiv: 1512.03385. 4, 5[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] , Dec. 2015. arXiv: 1512.03385. 4[10] Weizhe Hua, Yuan Zhou, Christopher De Sa, Zhiru Zhang,and G. Edward Suh. Channel Gating Neural Networks. arXiv:1805.12549 [cs, stat] , Oct. 2019. arXiv: 1805.12549.3[11] Gao Huang, Zhuang Liu, Laurens van der Maaten, andKilian Q. Weinberger. Densely Connected ConvolutionalNetworks. arXiv:1608.06993 [cs] , Jan. 2018. arXiv:1608.06993. 5[12] Cheonghwan Hur and Sanggil Kang. Entropy-based pruningmethod for convolutional neural networks.
The Journal ofSupercomputing , 75(6):2950–2963, June 2019. 3[13] Forrest N. Iandola, Song Han, Matthew W. Moskewicz,Khalid Ashraf, William J. Dally, and Kurt Keutzer.SqueezeNet: AlexNet-level accuracy with 50x fewer param-eters and < arXiv:1602.07360 [cs] , Nov.2016. arXiv: 1602.07360. 4, 5[14] Norman P Jouppi, Cliff Young, Nishant Patil, David Patter-son, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, SureshBhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-lucCantin, Clifford Chao, Chris Clark, Jeremy Coriell, MikeDaley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaem-maghami, Rajendra Gottipati, William Gulland, Robert Hag-mann, C Richard Ho, Doug Hogberg, John Hu, RobertHundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, AlexanderKaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch,Naveen Kumar, Steve Lacy, James Laudon, James Law,Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, AlanLundin, Gordon MacKean, Adriana Maggiore, Maire Ma-hony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami,Narayana Penukonda, Andy Phelps, Jonathan Ross, MattRoss, Amir Salek, Emad Samadiani, Chris Severn, GregorySizikov, Matthew Snelham, Jed Souter, Dan Steinberg, AndySwing, Mercedes Tan, Gregory Thorson, Bo Tian, HoriaToma, Erick Tuttle, Vijay Vasudevan, Richard Walter, WalterWang, Eric Wilcox, and Doe Hyun Yoon. In-Datacenter Per-formance Analysis of a Tensor Processing UnitTM. page 17.1[15] Alexandros Kouris and Christos-Savvas Bouganis. Learningto Fly by MySelf: A Self-Supervised CNN-Based Approachfor Autonomous Navigation. In ,pages 1–9, Madrid, Oct. 2018. IEEE. 1 [16] Alexandros Kouris, Christos Kyrkou, and Christos-SavvasBouganis. Informed Region Selection for Efficient UAV-based Object Detectors: Altitude-aware Vehicle Detectionwith CyCAR Dataset. In , pages 51–58, Nov. 2019. ISSN: 2153-0858. 1[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.ImageNet classification with deep convolutional neural net-works. Communications of the ACM , 60(6):84–90, May2017. 4, 5[18] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceed-ings of the IEEE , 86(11):2278–2324, Nov. 1998. ConferenceName: Proceedings of the IEEE. 3[19] Yann LeCun, John S. Denker, and Sara A. Solla. Opti-mal Brain Damage. In D. S. Touretzky, editor,
Advances inNeural Information Processing Systems 2 , pages 598–605.Morgan-Kaufmann, 1990. 3[20] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, andHans Peter Graf. Pruning Filters for Efficient ConvNets. arXiv:1608.08710 [cs] , Mar. 2017. arXiv: 1608.08710. 2, 4[21] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. RuntimeNeural Pruning. page 11. 1, 3[22] Shaohui Lin, Rongrong Ji, Yuchao Li, Cheng Deng, andXuelong Li. Towards Compact ConvNets via Structure-Sparsity Regularized Filter Pruning. arXiv:1901.07827 [cs] ,Mar. 2019. arXiv: 1901.07827. 4[23] Tao Lin, Luis Barba, Martin Jaggi, Sebastian U Stich, andDaniil Dmitriev. DYNAMIC MODEL PRUNING WITHFEEDBACK. page 22, 2020. 2[24] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang,Shoumeng Yan, and Changshui Zhang. Learning EfficientConvolutional Networks through Network Slimming. Aug.2017. 2[25] Jonathan Long, Evan Shelhamer, and Trevor Darrell.Fully Convolutional Networks for Semantic Segmentation.page 10. 1[26] Deepak Mittal, Shweta Bhardwaj, Mitesh M. Khapra, andBalaraman Ravindran. Recovering from Random Pruning:On the Plasticity of Deep Convolutional Neural Networks. arXiv:1801.10447 [cs] , Jan. 2018. arXiv: 1801.10447. 1, 6[27] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio,and Jan Kautz. Importance Estimation for Neural NetworkPruning. page 9. 3[28] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, An-tonio Puglielli, Rangharajan Venkatesan, Brucek Khailany,Joel Emer, Stephen W. Keckler, and William J. Dally. SCNN:An Accelerator for Compressed-sparse Convolutional Neu-ral Networks.
ACM SIGARCH Computer Architecture News ,45(2):27–40, June 2017. 2[29] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-moginov, and Liang-Chieh Chen. MobileNetV2: InvertedResiduals and Linear Bottlenecks. arXiv:1801.04381 [cs] ,Mar. 2019. arXiv: 1801.04381. 4, 5[30] Karen Simonyan and Andrew Zisserman. Very Deep Con-volutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs] , Apr. 2015. arXiv: 1409.1556. 4, 531] Xavier Suau, Luca Zappella, and Nicholas Apostoloff. Fil-ter Distillation for Network Compression. arXiv:1807.10585[cs] , Dec. 2019. arXiv: 1807.10585. 1, 3[32] Xavier Suau, Luca Zappella, and Nicholas Apostoloff. Fil-ter Distillation for Network Compression. arXiv:1807.10585[cs] , Dec. 2019. arXiv: 1807.10585. 7[33] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going Deeper withConvolutions. arXiv:1409.4842 [cs] , Sept. 2014. arXiv:1409.4842. 5[34] Mingxing Tan and Quoc V. Le. EfficientNet: Rethink-ing Model Scaling for Convolutional Neural Networks. arXiv:1905.11946 [cs, stat] , Nov. 2019. arXiv: 1905.11946.4, 5[35] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V.Le. Self-training with Noisy Student improves ImageNetclassification. arXiv:1911.04252 [cs, stat] , Jan. 2020. arXiv:1911.04252. 1[36] Saining Xie, Ross Girshick, Piotr Dollr, Zhuowen Tu, andKaiming He. Aggregated Residual Transformations for DeepNeural Networks. arXiv:1611.05431 [cs] , Apr. 2017. arXiv:1611.05431. 5[37] Linfeng Zhang, Zhanhong Tan, Jiebo Song, Jingwei Chen,Chenglong Bao, and Kaisheng Ma. SCAN: A Scalable Neu-ral Networks Framework Towards Compact and EfficientModels. arXiv:1906.03951 [cs, stat] , May 2019. arXiv:1906.03951. 3[38] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, ShaoliLiu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen.Cambricon-X: An accelerator for sparse neural networks. In , pages 1–12, Oct. 2016. ISSN:null. 2[39] Yiren Zhao, Xitong Gao, Robert Mullins, and ChengzhongXu. Mayo: A Framework for Auto-generating HardwareFriendly Deep Neural Networks. In
Proceedings of the2nd International Workshop on Embedded and Mobile DeepLearning - EMDL’18 , pages 25–30, Munich, Germany,2018. ACM Press. 3[40] Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Jun-shan Zhang. Edge Intelligence: Paving the Last Mile of Arti-ficial Intelligence with Edge Computing. arXiv:1905.10083[cs] , May 2019. arXiv: 1905.10083. 1[41] Neta Zmora, Guy Jacob, Lev Zlotnik, Bar Elharar, and GalNovik. Neural Network Distiller: A Python Package ForDNN Compression Research. arXiv:1910.12232 [cs, stat]arXiv:1910.12232 [cs, stat]