Enabling On-Device CNN Training by Self-Supervised Instance Filtering and Error Map Pruning
11 Enabling On-Device CNN Training bySelf-Supervised Instance Filtering and Error MapPruning
Yawen Wu † , Zhepeng Wang † , Yiyu Shi ‡ , and Jingtong Hu †† Department of Electrical and Computer Engineering, University of Pittsburgh, USA ‡ Department of Computer Science and Engineering, University of Notre Dame, USAEmail: [email protected], [email protected], [email protected], [email protected]
Abstract —This work aims to enable on-device training ofconvolutional neural networks (CNNs) by reducing the compu-tation cost at training time. CNN models are usually trained onhigh-performance computers and only the trained models aredeployed to edge devices. But the statically trained model cannotadapt dynamically in a real environment and may result in lowaccuracy for new inputs. On-device training by learning from thereal-world data after deployment can greatly improve accuracy.However, the high computation cost makes training prohibitivefor resource-constrained devices. To tackle this problem, weexplore the computational redundancies in training and reducethe computation cost by two complementary approaches: self-supervised early instance filtering on data level and error mappruning on the algorithm level. The early instance filter selectsimportant instances from the input stream to train the networkand drops trivial ones. The error map pruning further prunesout insignificant computations when training with the selectedinstances. Extensive experiments show that the computation costis substantially reduced without any or with marginal accuracyloss. For example, when training ResNet-110 on CIFAR-10, weachieve 68% computation saving while preserving full accuracyand 75% computation saving with a marginal accuracy loss of1.3%. Aggressive computation saving of 96% is achieved with lessthan 0.1% accuracy loss when quantization is integrated into theproposed approaches. Besides, when training LeNet on MNIST,we save 79% computation while boosting accuracy by 0.2%.
Index Terms —On-device training, data filter, gradient pruning
I. I
NTRODUCTION
The maturation of deep learning has enabled on-deviceintelligence for Internet of Things (IoT) devices. Convolutionalneural network (CNN), as an effective deep learning model,has been intensively deployed on IoT devices to extractinformation from the sensed data, such as smart cities [1],smart agriculture [2] and wearable devices [3]. The modelsare initially trained on high-performance computers (HPCs)and then deployed to IoT devices for inference. However, inthe physical world, the statically trained model cannot adaptto the real world dynamically and may result in low accuracyfor new input instances. On-device training has the potential tolearn from the environment and update the model in-situ. Thisenables incremental/lifelong learning [4] to train an existingmodel to update its knowledge, and device personalization[5] by learning features from the specific user and improvingmodel accuracy. Federated learning [6] is another application scenario of on-device training, where a large number of de-vices (typically mobile phones) collaboratively learn a sharedmodel while keeping the training data on personal devices toprotect privacy. Since each device still computes the full modelupdate by expensive training process, the computation cost oftraining needs to be greatly reduced to make federated learningrealistic.While the efficiency of training in HPCs can always beimproved by allocating more computing resources, such as1024 GPUs [7], training on resource-constrained IoT devicesremains prohibitive. The main problem is the large gap be-tween the high computation and energy demand of trainingand the limited computing resource and battery on IoT devices.For example, training ResNet-110 [8] on a 32x32 input imagetakes 780M FLOPs, which is prohibitive for IoT devices.Besides, since computation directly translates into energyconsumption and IoT devices are usually battery-constrained[9], the high computation demand of training will quicklydrain the battery. While existing works [10]–[12] effectivelyreduce the computation cost of inference by assigning inputinstances to different classifiers according to the difficulty, thecomputation cost of training is not reduced.To address this challenge, this work aims to enable on-device training by significantly reducing the computation costof training while preserving the desired accuracy. Meanwhile,the proposed techniques can also be adopted to improvetraining efficiency on HPCs. To achieve this goal, we in-vestigate the computation cost of the entire training cycle,aiming to eliminate unnecessary computations while keepingfull accuracy. We made the following two observations:
First ,not all the input instances are important for improving themodel accuracy. Some instances are similar to the ones thatthe model has already been trained with and can be completelydropped to save computation.Therefore, developing an approach to filter out unimportantinstances can greatly reduce the computation cost.
Second , forthe important instances, not all the computation in the trainingcycle is necessary. Eliminating insignificant computations willhave a marginal influence on accuracy. In the backward passof training, some channels in the error maps have small values.Pruning out these insignificant channels and correspondingcomputation will have a marginal influence on the final ac- a r X i v : . [ c s . L G ] J u l curacy while saving a large portion of computation.Based on the two observations, we propose a novel frame-work consisting of two complementary approaches to reducethe computation cost of training while preserving the fullaccuracy. The first approach is an early instance filter toselect important instances from the input stream to train thenetwork and drop trivial ones. The second approach is errormap pruning to prune out insignificant computations in thebackward pass when training with the selected instances.In summary, the main contributions of this paper include: • A framework to enable on-device training.
We proposea framework consisting of two approaches to eliminateunnecessary computation in training CNNs while preserv-ing full network accuracy. The first approach improvesthe training efficiency of both the forward and backwardpasses, and the second approach further reduces thecomputation cost in the backward pass. • Self-supervised early instance filtering (EIF) on thedata level.
We propose an instance filter to predictthe loss of each instance and develop a self-supervisedalgorithm to train the filter. Instances with predicted lowloss are dropped before starting the training cycle tosave computation. To train the filter simultaneously withthe main network, we propose a self-supervised trainingalgorithm including the adaptive threshold based labelingstrategy, uncertainty sampling based instance selectionalgorithm, and weighted loss for biased high-loss ratio. • Error map pruning (EMP) on the algorithm level.
Wepropose an algorithm to prune insignificant channels inerror maps to reduce the computation cost in the back-ward pass. The channel selection strategy considers theimportance of each channel on both the error propagationand the computation of the weight gradients to minimizethe influence of pruning on the final accuracy.We evaluate the proposed approaches on networks of dif-ferent scales. ResNet and VGG are for on-device trainingof mobile devices, and LeNet is for tiny sensor node-leveldevices. Experimental results show the proposed approacheseffectively reduce the computation cost of training without anyor with marginal accuracy loss.II. B
ACKGROUND AND R ELATED W ORK
A. Background of CNN Training
The training of CNNs is most commonly conducted withthe mini-batch stochastic gradient descent (SGD) algorithm[13]. It updates the model weights iteration-by-iteration usinga mini-batch (e.g. 128) of input instances. For each instancein the mini-batch, a forward pass and a backward pass areconducted. The forward pass attempts to predict the correctoutputs using current model weights. Then the backward passback-propagates the loss through layers, which generates theerror maps for each layer. Using the error maps, the gradientof the loss w.r.t. the model weights are computed. Finally, themodel weights are updated by using the weight gradients andan optimization algorithm such as SGD.To provide labeled data for on-device training, labelingstrategies from existing works can be used. For example, the labels can come from aggregating inference results fromneighbor devices [14] (e.g. voting), employing spatial contextinformation as the supervisory signals [15], [16], or naturallyinferred from user interaction [17], [18] such as next-word-prediction in keyboard typing.
B. Related Work
Accelerated Training.
There are a number of works on ac-celerating network training. Stochastic depth [19] acceleratesthe training by randomly bypassing layers with the residualconnection. E2Train [20] randomly drops mini-batches andselectively skips layers by using residual connections to savethe computation cost. Different from [20], which randomlydrops mini-batches, we investigate the importance of eachinstance before keeping or dropping it. The input data fromthe real-word is not ideally shuffled and valuable instances fortraining can concentrate within one mini-batch. Simply drop-ping mini-batches can miss important instances for training thenetwork. Besides, the layer skipping in these two works rely onthe ResNet architecture [8], and cannot be naturally extendedto general CNNs. In contrast, our approaches are applicableto general CNNs. OHEM [21] selects high-loss instancesand drops low-loss ones to improve training efficiency. Itcomputes the loss values of all instances in the forward passand only keeps high-loss instances for the backward pass.The main drawback is that the computation in the forwardpass of low-loss instances is wasted. Different from this, ourapproach predicts the loss of each instance and drops low-lossinstances before starting the forward pass, which eliminatesthe computation cost of low-loss instances.
Distributed Training.
Another way to accelerate trainingis leveraging distributed training with abundant computingresources and large batch-sizes. [7] employs an extremelylarge batch size of 32K with 1024 GPUs to train ResNet-50 in 20 minutes. [22] integrates a mixed-precision methodinto distributed training and pushes the time to 6.6 minutes.However, these works target on leveraging highly-parallelcomputing resources to reduce the training time and actuallyincrease the total computation cost, which is infeasible fortraining on resource-constrained IoT devices.
Network Pruning during Training.
Some works aimto train and prune the network architecture simultaneously.[23] proposes a pruning approach to sparsify weights duringtraining. The goal is to generate a compact network forinference instead of improving training efficiency. In fact, itrequires more time on training by first training the backbonethen pruning it. Similarly, [24] prunes the sparse networkduring training to have a compact network for inference.However, these works only improve the inference efficiencyand training computation cost is not reduced. [25], [26] aim toaccelerate training by reconfiguring the network to a smallerone during training. The main drawback is that the networkis pruned on the offline training dataset, and the ability of thepruned network for further on-device learning is compromised.Instead, we focus on reducing the computation cost of onlinetraining and the entire network architecture is preserved tokeep the full ability for learning in an uncertain future.
Network Compression and Neural Architecture Search.
There are extensive explorations on network compressionand neural architecture search (NAS). [27], [28] prune thenetwork to generate a compact network for efficient inference.[29]–[32] search neural architectures for hardware-friendlyinference. [33] further considers quantization during NAS forefficient inference. However, these works only aim to designnetwork architectures for efficient inference. The computationcost of training is not considered.III. F
RAMEWORK O VERVIEW
Loss
Input
Ground truth
Early Instance Filter
Error map pruning
Main network
Forward pass
Backward pass
Self-Supervised
Training
Filter network EIF: Section IVEMP: Section V
Keep
Drop
Fig. 1: Overview of early instance filtering (EIF) and errormap pruning (EMP).The overview of the proposed framework is shown in Fig.1. On top of the main neural network, a small instance filternetwork is proposed to select important instances from theinput stream to train the network and drop trivial ones. Whenthe input instances arrive, the early instance filter predicts theloss value for each instance as if the instance was fed intothe main network and makes a binary decision to drop orpreserve this instance. If the predicted loss is high and theinstance is preserved, the main network will be invoked tostart the forward and backward pass for training. Since theloss prediction is for the main network, once the main networkis updated, the instance filter also needs to be trained foraccurate loss prediction. The training of the instance filter isself-supervised based on the labeling strategy by the adaptiveloss threshold, instance selection by uncertainty sampling, andthe weighted loss for biased high-loss ratio, which will beintroduced in Section IV. Once important instances are se-lected, the error map pruning further reduces the computationcost of the backward pass. It prunes out channels in the errormaps that have small contributions to the error propagation andgradient computation, which will be introduced in Section V.IV. S
ELF -S UPERVISED E ARLY I NSTANCE F ILTERING
The early instance filter (EIF) is used to select importantinstances for training the main network and drop trivialinstances to reduce the computation cost of training. Sincethe main network is constantly being updated during training,it is essential to tune the EIF every time the main network isupdated. In this way, the EIF can accurately select importantinstances based on the latest state of the main network. In thissection, we will first introduce the working flow of EIF to
High loss instances
Low loss inst w/ low confPreserved instances Main network
Main loss
Adaptive loss threshold
Binary loss label: H / L
Dropped instances Loss threshold: T l Weighted loss for filter
Section IV.ASection IV.C
Section IV.B
LowHigh
ConfidenceConfidence
Loss Prediction
Early Instance Filter (EIF)
Uncertainty sampling
Train EIF
Fig. 2: Self-supervised training of early instance filter (EIF)by adaptive loss threshold, uncertainty sampling, and weightedloss.select instances for training the main model. Then we discussthe challenges for updating the EIF. After that, we presentthree approaches to address these challenges such that the EIFcan be effectively updated.To select important instances and drop trivial ones on-the-fly during training, the EIF predicts the loss value ofeach instance from the input stream without actually feedingthe instance into the main network. Trivial instances withpredicted low loss are dropped before the forward pass, whicheliminates the computation on the forward pass and morecomputationally intensive backward pass of the main network.Important instances with predicted high loss are preserved tocomplete the forward pass, calculate the loss, and finish thebackward pass to compute the weight gradients to update themain network. Kindly note that the instances are not pre-selected before the training starts. Instead, they are selectedon-the-fly during training based on what the main networkhas and has not learned at the current state.Fig. 2 shows the working flow of EIF. The user first needsto pre-define a high-loss ratio R set (e.g. 10%) such thatthese amounts of instances in the whole input stream will bepredicted as high-loss and the others will be predicted as low-loss. Only instances predicted as high-loss will be used fortraining the main network. When instances arrive sequentially,the early instance filter predicts the loss value of each instance i as binary high or low y pred,i = { H, L } for the mainnetwork such that the pre-defined high-loss ratio is satisfied.The filter also produces the confidence of each loss prediction,represented by the entropy of the loss prediction. Since the lossprediction by the EIF network is for the main network and themain network is constantly being updated, it is essential to re-train the EIF network every time the main network is updatedto realize accurate loss prediction. However, there are severalchallenges in realizing automatic self-supervised training forthe EIF network. In this section, we will first present threemajor challenges. Then, we will present three techniques toaddress these challenges: adaptive loss threshold, uncertaintysampling, and weighted loss, as shown in Fig. 2. Challenges:
During on-device training, instances with pre-dicted low-loss are dropped before feeding to the main networkto compute the actual loss, and their true loss values areunknown. Thus, we can only know the true loss values of in-stances with predicted high-loss, which brings two challenges.
The first challenge is how to label instances as high-loss orlow-loss for training the EIF according to the pre-defined high-loss ratio. For example, if we could know the loss values of allinstances, defining a loss threshold to separate 10% instanceswith the highest loss values is simply sorting all the loss valuesand finding the value for separation. Since the loss valuesof dropped instances are unknown, defining a loss thresholdremains a challenge.The second challenge is that the EIF network can choosewhat instances will be used to train itself, which is not possiblefor normal CNN training. As long as the EIF network isnot 100% accurate, it will make wrong predictions. To avoidpunishment, instead of adjusting its own weights to makeaccurate loss predictions, the filter will learn a shortcut bypredicting all the new input instances as low loss and dropthem. Since the dropped instances will not be fed to the mainnetwork, the EIF network will never know the ground truthof the losses and thus it will not be punished for doing so. Inthis way, EIF will think it makes perfect predictions. Droppingall the new instances prevents further training of the filter andmain network.The third challenge is how to correctly train the filter whenthe number of high-loss and low-loss instances is extremelyunbalanced in the input stream. This is different from normaltraining datasets such as CIFAR-10 and ImageNet, in whichthe number of instances in each class is balanced. The unbal-anced number of high-loss and low-loss instances makes theEIF network training ineffective. For example, when the pre-defined high-loss ratio is relatively low (e.g. 10%), simplypredicting all the instances as low-loss will produce highaccuracy of 90% on the filter, for which it believes as a goodresult. However, this prediction is useless since it does not findany important instance to train the main network.We will present three techniques to address these challenges.
A. Adaptive Loss Threshold Based Labeling Strategy
The adaptive loss threshold is used to provide the groundtruth (labels) for training the EIF. With the adaptive lossthreshold, we can label the loss of instances as high-lossor low-loss to train the EIF. During the training of EIFand the main network, R set percent of instances will bepredicted as high-loss by the EIF. The true loss values ofinstances predicted as high-loss can be obtained on the mainnetwork. However, we do not know the true loss values ofinstances predicted as low-loss since they are dropped beforefeeding into the main network. With only partial loss values,defining an exact loss threshold (e.g. sorting all loss valuesand finding the threshold) is challenging. Therefore, we aimto approximate the threshold. To achieve this, we first definetrue high (TH) instances as the instances with predicted highloss by the filter and labeled as high-loss by the loss threshold.We will monitor the number of TH instances in the last n mini-batches. Then we calculate the percentage R T H as the numberof TH instances in the preserved ones over all the instances inlast n mini-batches. By comparing the percentage R T H withthe pre-defined percentage R set , the loss threshold is adjustedto draw R T H to the pre-defined percentage R set . Formally, with adaptive loss threshold T l , instances arelabeled as high-loss or low-loss as follows. y i = (cid:40) H if loss i ≥ T l L otherwise (1)where loss i is the loss value of instance i computed by themain network. T l is the adaptive loss threshold.The true high (TH) loss instance ratio R T H by the filter isdefined as follows. R T H = 1 mn mn (cid:88) i =1 I ( y pred,i = H ) I ( y i = H ) (2) I ( x ) is an indicator function which equals 1 if x is true and0 otherwise. y pred,i is the binary prediction by the filter forinstance i , and y i is the loss label by Eq.(1). m is the batchsize, and n is the number of mini-batches to monitor for oneupdate of the loss threshold.Based on the computed R T H and pre-defined R set , the lossthreshold T l is adjusted to draw R T H to R set . When R T H islarger than R set , too many instances are labeled and predictedas high loss, which indicates T l is too small. Therefore, T l will be incremented by multiplying with a factor larger than1. Similarly, when R T H is smaller than R set , T l is too largeand will be attenuated. The loss threshold T l is adjusted as: T l = (cid:40) α T l if R T H ≥ R set α T l otherwise (3) α and α are two hyper-parameters where α is larger than1 and α is smaller than 1 to define the step size.The computed R T H is essential to the self-supervisedtraining of the EIF. More specifically, R T H controls the lossthreshold T l by Eq.(3), which further controls the instancelabels y i by Eq.(1) for training the instance filter. With thelabels y i , the filter will be trained accordingly to predict high-loss instances. The number of instances with predicted high-loss by the filter and labeled as high-loss will be used tocompute the new R T H by Eq.(2), which further adjusts T l .This process continues for each mini-batch, which forms theself-supervised training of the instance filter. Leveraging theself-supervision, the loss threshold T l will be properly adjustedand the instance filter will be well-trained to track the lateststate of the main network. In this way, the true high-loss ratio R T H affected by both the filter and the loss threshold will bekept at the set ratio R set . The filter will effectively select R set percent important instances for training the main network. B. Instance Selection by Uncertainty Sampling
The main reason for the second challenge is that if aninstance is dropped, it will never be fed to the main networkand the EIF network will never know the ground truth of theloss. In this way, the labels (i.e. high-loss or low-loss) of thedropped instances for training the EIF will be unknown, andthe EIF cannot be correctly trained. To address this problem,we keep some instances with predicted low loss, which wouldbe dropped, to augment the preserved instances for trainingthe filter. In this way, wrong loss predictions of the droppedinstances will also punish the filter, which forces it to actuallylearn to find important instances. To decide which instancesto keep and minimize the number of selected instances, weemploy uncertainty sampling [34]. The dropped instances that the filter are least confident about will be fed into the mainnetwork to compute the loss value. To measure the confidenceof loss prediction by the filter, we use the entropy defined as: entropy ( i ) = − (cid:88) c ∈{ H,L } p i,c log p i,c , p i,c = prob ( y pred,i = c ) (4) p i,c is the computed probability by the filter of being high-loss( c = H ) or low-loss ( c = L ) for instance i . The smaller theentropy, the more confident the filter is about the prediction.Based on the entropy, we select from the dropped instanceswhere the entropy is above the entropy threshold to augmentthe preserved instances for training the filter. The set ofselected instances is defined as: I = { i | i ∈ { T L, F L } , entropy ( i ) > entropy T } (5)where entropy T is the entropy threshold. C. Weighed Loss for Biased High-Loss Ratio
To address the third challenge, we propose to use theweighted loss function to make the EIF training process fair intreating the high-loss instances when their ratio is low. In thisway, the EIF can be trained to make accurate loss predictionsand select important instances for training the main network.Traditionally, for datasets with balanced classes, we use theaverage loss of each instance as the loss function of a mini-batch for training. In our case, based on the binary loss label y i in Eq.(1) and the binary loss prediction y pred,i by the filter,the loss function for instance i is defined by cross-entropy as: L i = − (cid:88) c ∈{ H,L } I ( y i = c ) log p i,c (6)where p i,c defined in Eq.(4) is the computed probability ofbeing a high or low loss for instance i by the filter. L i measureshow well the loss prediction approximates the true loss labeland will be minimized during training. The average loss willbe the average loss value of each preserved instance in amini-batch. However, when the pre-defined high-loss ratio isnot 50% and makes the number of high-loss and low-lossinstances unbalanced, directly using the average loss will resultin effective training of the EIF.To understand the inefficiency of training with the averageloss, we define the weighted loss for preserved instances in amini-batch to train the filter as: L = (cid:88) i ∈ T H w H L i + (cid:88) j ∈ F H w L L j + (cid:88) p ∈ T L w L L p + (cid:88) q ∈ F L w H L q (7)where T H , F H , T L and
F L represent true high, false high,true low and false low loss instances, respectively.
T H and
F H are instances with predicted high loss and labeled as H and L by Eq.(1), respectively. T L and
F L are instances withpredicted low loss and selected by uncertainty sampling inEq.(5), which have loss labels L and H , respectively. Theweights w H and w L represent how important the true highloss (instances with loss label H , including T H and
F L ) andtrue low loss (instances with loss label L , including T L and
F H ) instances are, respectively. w H and w L are normalizedsuch that the weights of all instances in Eq.(7) sum up to 1.When the pre-defined high-loss ratio R set is not 50%, thenumber of high-loss and low-loss instances will be not equal inthe input stream. This makes training the EIF with the average loss ineffective. For example, when R set is set to , only10% of the instances streamed in will be labeled as high-lossby the adaptive loss threshold. In this way, 90% elements inEq.(7) will be low-loss instances and dominate the loss. If wewere using average loss, all the weights will be the same. Tominimize the loss when training the filter, simply predictingall instances as low-loss will produce small loss values on thedominating second and third elements in Eq.(7), and hence thetotal loss, which prevents effective training of the filter.To address this problem, we make the weights biased bysetting w H = R set and w L = − R set . In this way, we have w H × percent ( H = T H + F L ) = w L × percent ( L = T L + F H ) . The first and fourth sums in Eq.(7) correspondto the high-loss ( H ) instances. The second and third sums inEq.(7) correspond to the low-loss ( L ) instances. By settingthe weights in this way, the high-loss and low-loss instanceswill contribute equally to the total loss and will be treatedfairly in training. In the above example, while the first andfourth sums only contribute to 10% of the number of elements,the higher weight w H = . = 10 makes them equallyimportant as the second and third sums, which have lowerweight w H = . = 1 . . Therefore, the instance filter can becorrectly trained with the unbalanced number of high-loss andlow-loss instances and accurately predict high-loss ones.With the predicted high-loss instances by the filter, theselected instances by uncertainty sampling, and the weightedloss function for training, the filter is effectively trained topredict high-loss instances for training the main network.V. E RROR M AP P RUNING IN B ACKWARD P ASS
Loss
Forward passBackward pass
Input
Conv layer
Input error map
Output error map
Conv layer
OutputErrors
Fig. 3: Error maps of convolutional layers in back-propagation.When training with the selected instances, the computationin the backward pass can be further reduced by error mappruning (EMP). Since the backward pass takes about 2/3computation cost of training, reducing its computation caneffectively reduce the total cost. As shown in Fig. 3, in thebackward pass of training, the back-propagation propagates theerrors layer-by-layer from the last layer to the first layer. Wefocus on pruning convolutional layers because they dominatethe computation cost in the backward pass. Within one convo-lutional layer, the input error map is generated from the outputerror map of the same layer. The output error map consists ofmany channels. We aim to prune the insignificant channels toreduce the computation cost of training.Given a pruning ratio, we need to keep the most representa-tive channels in the error map to maintain as much informationsuch that the training accuracy is retained. The main idea of theproposed channel selection strategy is to prune the channels that have the least influence on both error propagation and thecomputation of the weight gradients.
A. Channel Selection to Minimize Reconstruction Error onError Propagation ** Rotated weights :output error map of layer L: input error map of layer L Original error propagation
Pruned error propagation
Prune c n c n' cc rot(W) lin δ l δ Fig. 4: Back-propagation of errors with pruned error map.The first criterion to select the channels to be pruned is tominimize reconstruction error on error propagation. The errorpropagation for one convolutional layer is shown on the top ofFig. 4. Within one layer, the error propagation starts from theoutput error map δ l shown on the right, convolves δ l with therotated kernel weights rot ( W l ) and generates the input errormap δ lin on the left. The error propagation with pruned δ l isshown on the bottom of Fig. 4. The number of channels in δ l is pruned from n to n (cid:48) . When computing δ lin , the computationscorresponding to the pruned channels, which are convolutionaloperations between δ l and the rotated weights, are removed. Inorder to maintain training accuracy, we want to keep the inputerror map δ lin before and after the pruning as same as possible.In another word, we want to minimize the reconstruction erroron the input error map.Formally, without channel pruning of δ l , δ lin is computedas follows. δ lin = n (cid:88) j =1 rot ( W lj ) ∗ δ lj (8)where δ lin is the input error map consisting of c channels,each with shape [ W in , H in ] . rot ( W lj ) is the rotated weightsof j th convolutional kernel with shape [ c, k w , k h ] . δ lj is the j thchannel of the output error map with shape [ W, H ] .Given a pruning ratio α and an output error map δ l , we aimto reduce the number of channels in δ l from n to n (cid:48) such that α = n (cid:48) /n . To minimize the reconstruction error on δ lin , thechannel selection problem is formulated as follows. arg min β (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) δ lin − n (cid:88) j =1 rot ( W lj ) ∗ ( δ lj β j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (9)s.t. (cid:107) β (cid:107) = n (cid:48) (10)where β is the error map selection strategy, represented asa binary vector of length n . β j is the j th entry of β , and β j = 0 means the j th channel of δ lj is pruned. The (cid:96) -norm (cid:107) x (cid:107) = √ Σ x measures the reconstruction error on δ lin .However, directly solving the minimization problem isprohibitive. δ lin in the problem is computed by Eq.(8), whichcompletes all the computation in error propagation and defeatsthe purpose of saving computation. To select channels to prunebefore starting the actual error propagation, we define theimportance score as an indication of how much each channel will influence the value of δ lin and prune the least importantchannels to minimize the reconstruction error on δ lin . Importance Score.
In Eq.(9), when a channel δ lj is pruned,the computation error on δ lin is caused by the pruned rot ( W lj ) ∗ δ lj . As a fast and accurate estimation of the magnitude of rot ( W lj ) ∗ δ lj , we define the importance score of channel j as follows. s j = γ (cid:13)(cid:13) W lj (cid:13)(cid:13) + γ (cid:13)(cid:13) δ lj (cid:13)(cid:13) (11)where (cid:107) W lj (cid:107) is (cid:96) -norm of convolutional kernel j , computedby (cid:80) ci =1 | W lj,i | . Here we remove the rotation on W lj sinceit does not change the (cid:96) -norm. (cid:107) δ lj (cid:107) is (cid:96) -norm of thechannel j in the output error map, computed by the sum ofits absolute values (cid:80) Wx =1 (cid:80) Hy =1 | δ lj,x,y | . γ and γ are twohyper-parameters to adjust the weight of each (cid:96) -norm.The importance score s j gives an expectation of the mag-nitude that a channel j in δ l contributes to δ lin . Channels withsmall magnitudes in δ l and corresponding kernel weights | W lj | tend to produce trivial values in the input error map δ lin , whichcan be pruned while minimizing the influence on δ lin . B. Channel Selection to Minimize Reconstruction Error onGradient Computation * Weight gradient of layer L :output error map of layer L :output feature map of layer L-1Original gradient computation
Pruned gradient computation * Prune n l δ − l a n' ccc c Fig. 5: Computation of weight gradient with pruned error map.The second criterion to select the channels to be pruned isto minimize the reconstruction error on the weight gradients.The computation of the weight gradients without pruning isshown on the top of Fig. 5. The output feature map a l − ofthe previous layer convolves with one channel of the outputerror map δ l to produce the gradient of one kernel. Whensome channels in δ l are pruned, the computation of the weightgradients corresponding to the pruned channels is removed. Toretain training accuracy, we want to keep the weight gradientsbefore and after pruning as same as possible. Without channelpruning of δ l , the weight gradients of kernel j are computedas follows. g lw,j = a l − ∗ δ lj , ∀ j ∈ { , ..., n } (12)where g lw,j is the weight gradients of kernel j with shape [ c, k w , k h ] . a l − is the output feature map of the previous layer l − with shape [ c, W in , H in ] . δ lj is the channel j of the outputerror map in layer l , which has shape [ W, H ] .To determine the channel selection strategy β while mini-mizing the reconstruction error on the gradient computation,the channel selection problem is formulated as follows. arg min β n (cid:88) j =1 (cid:13)(cid:13) g lw,j − a l − ∗ ( δ lj β j ) (cid:13)(cid:13) s.t. (cid:107) β (cid:107) = n (cid:48) (13) Similar to Eq.(9), we use the (cid:96) -norm (cid:107) · (cid:107) to measure thereconstruction error on the computation of the weight gradientsfor all the n kernels incurred by the pruning. Similarly, solvingthis problem needs to complete all the gradient computation inEq.(12) to get g lw,j , j ∈ { , ..., n } , which contradicts the goalto save computation. Thus, we define the importance score ofeach channel in δ l for g w and prune the least important onesto minimize the reconstruction error on g w . Importance Score.
In Eq.(13), when a channel δ lj is pruned,the computation error is caused by the pruned a l − ∗ δ lj . Since a l − is independent of j and can be considered as a constantwhen measuring the importance of each channel δ lj , we ignore a l − and only include δ lj in the importance score of channel j , which is defined as follows. s j = (cid:13)(cid:13) δ lj (cid:13)(cid:13) (14) C. Mini-Batch Pruning with Combined Importance Score
To make the pruned channels for error propagation andgradient computation consistent with each other, we combinethe importance score for these two processes. Then we scaleit from instance-wise to batch-wise for mini-batch training.The importance score s j for gradient computation in Eq.(14)is a reduced form of Eq.(11) by setting γ = 0 and γ = 1 .Therefore, we combine them into Eq.(11). Based on the per-instance importance score of each channel, we can prunechannels for a mini-batch of instances to reduce the compu-tation while maintaining the accuracy. For a mini-batch ofinstances, we prune the same channels for all the instances.The batch-wise importance score of one channel is calculatedas S j = (cid:80) mi =1 s ij . m is the batch size and s ij is the importancescore of channel j for instance i .With the batch-wise importance score, the error map prun-ing process for one convolutional layer is as follows. Given apruning ratio α , n (1 − α ) channels in the output error map δ l need to be pruned. First, for each channel j in δ l , we calculatethe batch-wise importance score S j . Then the importancescores of all channels are sorted and n (1 − α ) channels with thesmallest S j are marked as pruned. Then the error propagationand the computation of the weight gradients corresponding tothe pruned channels are skipped to save computation. Computation Reduction.
With error map pruning, thecomputation cost of both the error propagation and the weightgradients is effectively reduced. With pruning ratio α , − α computation in the error propagation and gradient computationis skipped, which saves about − α computation in thebackward pass of training. More specifically, without pruning,for one instance the computation cost of error propagation fora convolutional layer l in floating-point operations (FLOPs)is F LOP s ( δ lin ) = W in H in cnk w k h . When pruning the num-ber of channels in δ l from n to αn , the computation costis reduced to αF LOP s ( δ lin ) . For the computation of theweight gradients, before pruning the computation cost of g lw is F LOP s ( g lw ) = W Hcnk w k h . With pruning ratio α , the costis reduced to αF LOP s ( g lw ) . In this way, − α computationcost is reduced in the backward pass of convolutional layers. Overhead Analysis.
The computation overhead of errormap pruning is negligible. It is caused by the channel selectionand the skipping of pruned channels. When using the (cid:96) -norm strategy in Eq.(11) for the channel selection, the overheadis negligible because the sum over each kernel weight andeach channel are relatively cheap compared with the expensiveconvolutional operation in the backward pass. For example, thechannel selection of ResNet-110 consumes marginal 0.53%FLOPs of the backward pass. For the overhead of skipping,since we employ structured pruning, skipping the prunedchannels is simply skipping the computation involving thepruned channels, which has negligible overhead.VI. E XPERIMENTS
We conduct extensive experiments to demonstrate the effec-tiveness of our approaches in terms of computation reduction , energy saving , accuracy , convergence speed and provide de-tailed analysis . The evaluation is on four network architecturesand four datasets. We first evaluate EIF and then evaluatethe combined EIF+EMP approach. After that, we evaluate thepractical energy savings on two edge devices. A. Experimental setup
Datasets and Networks.
We evaluate the proposed ap-proaches on four datasets: CIFAR-10, CIFAR-100 [35],MNIST [36], and ImageNet [37]. We use networks withdifferent capacity to show the scalability of the proposedapproaches. The networks include large-scale networks formobile devices and small networks for tiny sensor nodes.For large-scale networks, we employ two kinds of CNNs,including residual networks ResNet [8] and plain networksVGG [38]. ResNet-110, ResNet-74, and VGG-16 are evaluatedon CIFAR-10/100. ResNet-18 and VGG-11 are evaluated onImageNet. For small networks, we use LeNet on MNIST.
Architectures of Instance Filter.
We use different networksas the instance filter for different datasets. For CIFAR-10/100,we use ResNet-8. It has 7 convolutional layers and 1 fully-connected layer. The first layer is 3x3 convolutions with 16filters. Then there is a stack of 3 residual blocks. Each blockhas 2 convolutional layers with kernel size 3x3. The numbersof filters in each block are { } , respectively. Thenetwork ends with a 10/100-way fully-connected layer. ForImageNet, we use ResNet-10. It has 9 convolutional layersand 1 fully-connected layer. The first layer is 7x7 convolutionswith 64 filters. Additional downsampling is conducted with astride of 4 to reduce the computation cost. Then there is astack of 4 residual blocks. Each block has 2 convolutionallayers with kernel size 3x3. The numbers of filters in eachblock are { } , respectively. The network endswith a 1000-way fully-connected layer. For MNIST, we usea slimmed LeNet with kernel size 3x3 and { } filters fortwo convolutional layers, respectively.The computation overhead of the EIF is negligible comparedwith the main networks. For CIFAR-10/100, the computationrequired for the inference of the EIF is 5.0% of ResNet-110and 4.1% of VGG-16, respectively. The computation requiredfor training the EIF is 5.9% of ResNet-110 and 4.8% of VGG-16, respectively. For ImageNet, the computation required forthe inference and training of the EIF network is 3.4% and 3.9%of ResNet-18, and 0.81% and 1.05% of VGG-11, respectively.For MNIST, the computation required for the inference andtraining of the EIF is 9.5% and 7.8% of LeNet, respectively. Training Details.
We train both the main network and theinstance filter simultaneously from scratch . For ResNet-110,ResNet-74, and VGG-16, we employ the training settings in[8]. We use SGD optimizer with momentum 0.9 and weightdecay 0.0001 with batch size 128. The models are trained for64k iterations. The initial learning rate is 0.1 and decayed bya factor of 10 at 32k and 48k iterations. For the instance filter,the learning rate is set to 0.1. For ResNet-18 and VGG-11,similar training settings are employed except that the batchsize is 256, the models are trained for 450k iterations, andthe learning rate is decayed by a factor of 10 at 150k and300k iterations. For LeNet, the learning rate is 0.01 and themomentum is 0.5. The model is trained for 18.7k iterationswith batch size 64. For the instance filter, the initial learningrate is 0.1 and decayed to 0.05 after 0.94k iterations.
Metrics.
We evaluate the proposed approaches in two highlyrelated but different metrics: the reduction of computationcost and practical energy saving . The computation cost ismeasured in FLOPs, which is independent of the specificdevices and a commonly used metric of computation cost[39]. The evaluation of the computation cost is conducted onNVIDIA P100 GPU with PyTorch 1.1 and measured by theTHOP library [40], which will be presented in Sections VI-B- VI-E. The second metric, practical energy saving, dependson the devices and is measured on two edge devices (NVIDIAJetson TX2 mobile GPU and MSP432 MCU), which will bepresented in Section VI-F.
B. Evaluating Early Instance Filtering (EIF)
Remaining computa�on T o p - A cc u r a c y ( % ) Fig. 6: Top-1 accuracy by early instance filter (EIF) andbaselines with ResNet-110 on CIFAR-10.To show that the proposed early instance filtering (EIF) caneffectively reduce the computation cost while maintaining oreven boosting the accuracy, we compare with two start-of-the-art (SOTA) baselines and a standard training approach. Onlinehard example mining (OHEM) [21] selects hard examples fortraining by computing the loss values. Stochastic mini-batchdropping (SMD) [20] is a SOTA on-device training approach,which randomly skips every mini-batch with a probability.SMB is the standard mini-batch training method by stochasticgradient descent (SGD), and the computation cost is adjustedby reducing the number of training iterations.
Computation Reduction while Boosting Accuracy.
Theproposed EIF substantially outperforms the baselines in termsof both accuracy and computation reduction. As shown in Fig.6, when training ResNet-110 on CIFAR-10, with differentremaining computation ratio, EIF consistently outperformsthe baselines by a large margin. Compared with the full accuracy by SGD (e.g. SMB with remaining computationratio 1.0), when using only 36.50% remaining computation,EIF boosts the accuracy by 0.16% (93.73% vs. 93.57%).With only 55.45% computation, EIF boosts the accuracy by0.52% (94.09% vs. 93.57%). Compared with SMB and SMD,under different computation ratio, EIF achieves consistentlyhigher accuracy with range [0.84%, 2.32%] and [0.83%,2.28%], respectively. The significant improvement is achievedbecause EIF selects instances by predicting the true loss value,instead of randomly dropping the instances. Compared withOHEM, EIF consistently achieves higher accuracy with range[0.31%, 0.98%] under different computation ratio. The im-proved accuracy and reduced computation cost show that theproposed instance filter effectively selects important instancesfor training to save computation cost.To further evaluate EIF, we conduct experiments on trainingResNet-74, VGG-16 and LeNet. Consistent accuracy improve-ment over the SOTA baselines is observed as shown in Fig.7(a)(b)(c). For ResNet-74, with 63.74% computation reduc-tion, EIF achieves 0.02% accuracy loss, while OHEM has alarger accuracy loss of 5.56% with less computation reductionof 60.59%. SMD and SMB achieve an accuracy loss of 2.25%and 1.98% with less computation reduction of 60%. Similarresults are observed on VGG-16. For LeNet, EIF boosts theaccuracy by 0.21% (99.45% over 99.24%) with computationreduction 77.19%.
C. Evaluating EIF + EMP
We evaluate the proposed framework EIF+EMP consistingof EIF and EMP and compare it with the SOTA baselines.Our approach effectively reduces the computation cost andachieves significantly better accuracy that SOTA baselines.Fig. 8 shows the accuracy of ResNet-110 on CIFAR-10 whentrained by EIF+EMP and the baselines under different re-maining computation ratio. Compared with EIF or EMP only,EIF+EMP achieves more aggressive computation reductionwhile preserving and even boosting the accuracy. With EIFonly, we achieve 63.50% computation reduction without ac-curacy loss. With EMP only, we achieve 35.56% computationreduction in the backward pass without accuracy loss and62.22% computation reduction with a slight accuracy lossof 0.72%. By the combined EIF+EMP, with up to 67.84%computation reduction, we achieve no accuracy loss and booststhe accuracy by up to 0.84% (94.41% vs. 93.57%).We further evaluate EIF+EMP with more network architec-tures and datasets. Our approach substantially outperforms thebaselines in terms of computation reduction and accuracy. Weevaluate our approach with ResNet-110, ResNet-74 and VGG-16 on CIFAR-10 and LeNet on MNIST. For a fair comparisonwith E2Train [20], which employs quantization [41], we usethe same quantization scheme. When comparing with otherbaselines, we do not use quantization. The experimental resultsare shown in Table I. When training ResNet-74, our approachachieves 63.91% computation saving without accuracy loss.With quantization, our approach achieves 95.41% computationsaving with a marginal accuracy loss of 0.46%. E2Trainachieves less computation saving of 90.13% and a muchhigher accuracy loss of 2.10%. Similar results are observed on
Remaining computa�on T o p - A cc u r a c y ( % ) Remaining computa�on T o p - A cc u r a c y ( % ) Remaining computa�on T o p - A cc u r a c y ( % ) (a) ResNet-74 (b) VGG-16 (c) LeNet EIFSMDSMB
OHEM
SMBOHEM
SMDSMBOHEM
Fig. 7: Top-1 accuracy by EIF and baselines with ResNet-74 and VGG-16 on CIFAR-10 and LeNet on MNIST.
Remaining computa�on T o p - A cc u r a c y ( % )
68% comp. saving, no accuracy loss
Fig. 8: Accuracy of ResNet-110 on CIFAR-10 by EIF+EMPand baselines under different remaining computation ratios.TABLE I: Top-1 accuracy by EIF+EMP with ResNet-110,ResNet-74, VGG-16 on CIFAR-10 and LeNet on MNIST.
Network Method Comp. Reduce AccuracyResNet-110 SGD(original) - 93.57%
EIF+EMP 67.84 % %OHEM [21] 60.45% 85.11%SD [19] 60.00% 91.96% EIF+EMP+Q 95.71% 93.54%
E2Train(+Q) [20] 90.13% 91.68%ResNet-74 SGD(original) - 93.46%
EIF+EMP 63.91 % %OHEM 60.59% 87.90%SD 60.00% 90.99% EIF+EMP+Q 95.41% 93.00%
E2Train(+Q) 90.13% 91.36%VGG-16 SGD(original) - 93.25%
EIF+EMP 67.33% 93.15%
OHEM 60.41% 71.81%
EIF+EMP+Q 95.54% 92.69%
E2Train(+Q) - -LeNet SGD(original) - 99.23%
EIF+EMP 78.60 % %OHEM 65.24% 99.33% ResNet-110, VGG-16 and LeNet. SD and E2Train rely on theresidual connections in ResNet and cannot be applied to VGG-16 and LeNet. This result shows the proposed frameworkEIF+EMP achieves superior computation saving and signif-icantly higher accuracy than baselines on different networks.
Experiments on CIFAR-100.
We further evaluate the pro-posed approaches on CIFAR-100 with ResNet-110 and VGG-16. EIF+EMP substantially outperforms the baselines in bothcomputation reduction and accuracy. As shown in Table II,with ResNet-110, EIF+EMP achieves 56.24% computationreduction while preserving the full network accuracy, and TABLE II: Top-1 accuracy by EIF+EMP and baselines withResNet-110 and VGG-16 on CIFAR-100.
Network Method Comp. Reduce AccuracyResNet-110 SGD (Original) - 71.60%
EIF+EMP 50.02% 72.02%56.24% 71.63%
OHEM 47.01% 69.98%SD 50.00% 70.44%SMB 50.00% 67.28%
EIF+EMP+Q 92.92% 71.29%
E2Train(+Q) 90.13% 67.94%VGG-16 SGD (Original) - 71.56%
EIF+EMP 50.49% 71.59%53.86% 70.92%
OHEM 46.99% 65.17%SMB 50.00% 68.76%
TABLE III: Top-1 and Top-5 accuracy by EIF+EMP andbaselines with ResNet-18 and VGG-11 on ImageNet.
Network Method Comp. Reduce Accuracy(top-1) Accuracy(top-5)ResNet-18 SGD (Original) - 69.76% 89.08%
EIF+EMP 58.91% 70.27% 89.63%64.71% 68.98% 89.35%
OHEM 46.67% 62.09% 87.08%SD 50.00% 65.36% 86.41%SMB 50.00% 65.94% 87.50%VGG-11 SGD (Original) - 70.38% 89.81%
EIF+EMP 51.63% 70.36% 89.98%60.59% 70.01% 89.83%
OHEM 46.59% 56.39% 85.62%SMB 50.00% 63.76% 86.49%
Experiments on ImageNet.
We evaluate the proposedapproaches on large-scale dataset ImageNet [17]. ImageNetconsists of 1.2M training images in 1000 classes. The mainnetworks are ResNet-18 and VGG-11.The proposed EIF+EMP effectively reduces the computa- tion cost in training while preserving the accuracy on the large-scale dataset and significantly outperforms the baselines. Asshown in Table III, when training ResNet-18, with 58.91%computation reduction in training, EIF+EMP boosts the top-1 accuracy by 0.51% (70.27% vs. 69.76%) and boosts thetop-5 accuracy by 0.55%. With more aggressive computationreduction of 64.71%, EIF+EMP still boosts the top-5 accu-racy by 0.27% (89.35% vs. 89.08%). EIF+EMP consistentlyoutperforms the SOTA baselines by a large margin. Withlarger computation reduction, EIF+EMP achieves higher top-1 accuracy with range [4.33%, 8.18%] and higher top-5accuracy with range [2.13%, 3.22%], respectively. SD relieson the residual connections and cannot be applied to VGG-11.Similar results are observed on VGG-11 as shown in Table III. D. Convergence Speed SD OHEMEIF
EMP
EIF+EMP
Training computa�on cost (10 GFLOPs) T e s t E rr o r Fig. 9: Convergence speed of ResNet-110 on CIFAR-10 duringtraining with different approaches.The proposed approaches improve the convergence speed inthe training process. The test error (i.e. 100% - accuracy ontest dataset) over the computation cost during training is shownin Fig. 9. The proposed EIF, EMP and combined EIF+EMPapproaches converge faster than the baselines, represented aslower test error (higher accuracy) with the same computationcost. More specifically, EIF+EMP achieves 3.1x faster con-vergence and 0.09% accuracy improvement compared withthe standard mini-batch approach (SMB). The SOTA baselinesOHEM and SD achieve lower convergence speed and largeraccuracy loss of 8.46% and 1.61%, respectively.
E. Quantitative and Qualitative Analysis
Effectiveness of Adaptive Loss Threshold.
The proposedearly instance filter effectively predicts a pre-defined percent-age of input instances as high-loss and the adaptive lossthreshold effectively adjusts the loss threshold as the labelingstrategy to train the filter. In Fig. 10(a), the pre-defined highloss ratio is 40% for training ResNet-110 on CIFAR-10. Thenumber of predicted high-loss instances, averaged every 390iterations, is stabilized at about 51, which effectively selects40% high-loss instances on average from 128 instances in eachmini-batch. As the average loss decreases, the adaptive lossthreshold also decreases following a similar pattern to closelytrack the latest state of the main network.We further compare the proposed adaptive loss thresholdwith the static loss threshold. With a static loss threshold 1.0,the number of predicted high-loss instances per mini-batch andthe average loss of the main model in the training process are d l o h s e r h T ss o L dn a ss o L � ve loss thresholdAverage loss of main model Training Itera �o ns (x10 ) P r e d i c t e d H i g h - L o ss I n s t a n c e s p e r M i n i - B a t c h d l o h s e r h T ss o L dn a ss o L Training Itera �o ns (x10 ) Sta � c loss threshold=1.00 Average loss of main model P r e d i c t e d H i g h - L o ss I n s t a n c e s p e r M i n i - B a t c h (a) Adap�ve loss threshold (b) Sta�c loss threshold Fig. 10: The adaptive loss threshold (left) tracks the state of themain model in the training process and stabilizes the number ofpreserved instances with predicted high loss by EIF. Static lossthreshold (right) cannot generate correct loss labels to trainthe EIF, which results in incorrect number predicted high-lossinstances and a high average loss of the main model.shown in Fig. 10(b). The goal of training is to minimize theloss of the main model. However, the static loss thresholdcannot effectively decrease the loss of the main model asshown in the blue line, and results in low accuracy. This isbecause the static loss threshold cannot track the latest stateof the main model. Therefore, it cannot effectively stabilizethe number of predicted high-loss instances to train the mainmodel. The static loss threshold only achieves 80.83% finalaccuracy of the main model. Different from this, the proposedadaptive loss threshold can effectively minimize the loss ofthe main model and achieves high accuracy of 94.24%. o i t a R n o i t c i d e r P ss o L g n o r W Training Itera �o ns (x10 ) With weighted lossWithout weighted loss
Fig. 11: Incorrect loss prediction ratio of EIF with and withoutweighted loss.
Effectiveness of Weighted Loss for Training EIF.
Theweighted loss in Eq.(7) effectively trains the EIF networkto make accurate loss prediction, which eventually results inhigher accuracy of the main model. As shown in Fig. 11, whenthe weighted loss is employed, the wrong loss prediction ratioby the EIF is much lower than that without weighted loss.The pre-defined high-loss ratio is 30%, and the correspondinglow-loss ratio is 70%. This high-loss ratio makes the numberof high-loss and low-loss instances unbalanced in the inputstream.When the weighted loss is used for training EIF, the averagewrong loss prediction ratio by EIF is reduced from 20.31%to 8.59%. This accurate loss prediction effectively selectshigh-loss instances to train the main model and results insignificantly higher accuracy of the main model, which is94.05% with weighted loss vs. 90.58% without weighted loss. (a) EIF energy overhead (b) EIF computa�on overhead (c) EIF comp. overhead every itera�on n o i t a r e t I r e p s P O L F G Training Itera �o nsEIF overhead Main model with EIF
Main model without EIF A v e r a g e G F L O P s p e r I t e r a � o n Main model EIF overhead A v e r a g e E n e r g y p e r I t e r a � o n ( J o u l e s ) Main model EIF overhead
Without EIF With EIF Without EIF
Fig. 12: Energy and computation overhead of EIF. Energyoverhead is measured on NVIDIA Jetson TX2 mobile GPU.
Overhead of EIF.
The proposed early instance filter hasmarginal energy and computation overhead. The average en-ergy and computation overhead of the EIF network per trainingiteration (e.g. one mini-batch of 128 instances) when trainingResNet-110 on CIFAR-10 dataset is shown in Fig. 12. Asshown in the yellow bar in Fig. 12(a), the energy overheadof the EIF network (measured on NVIDIA Jetson TX2) is0.43J per iteration, which is 10.22% of the total energy cost4.18J when training with EIF. Without EIF, the energy cost is12.90J per iteration. As shown in Fig. 12(b), the computationoverhead of EIF is 3.88 GFLOPs, which is 11.65% of thetotal computation cost 33.21 GFLOPs when training withEIF. Without EIF, the computation cost is 99.91 GFLOPs periteration. The detailed EIF computation overhead across alltraining iterations are shown in Fig. 12(c). While the overheadof EIF is not zero, the proposed approach achieves 67.60%energy saving and 66.76% computation saving while fullypreserving the accuracy. D r o pp e d P r e s e r v e d CIFAR-10 MNIST
Fig. 13: Preserved and dropped instances by EIF when trainingResNet-110 on CIFAR-10 and LeNet on MINST.
Preserved and Dropped Instances by EIF.
To better un-derstand the instances selected by the early instance filter, wecluster the instances that the filter preserves and drops whentraining ResNet-110 on CIFAR-10 and LeNet on MNIST, asshown in Fig. 13. We find that the dropped instances showthe full objects with typical characteristics. The preservedinstances either only show part of the object or show non-typical characteristics, even hard for humans to understand.This result shows the early instance filter can effectively findimportant instances to train the network. P r un e d P r e s e r v e d Error map Kernel weights HighLow
Fig. 14: Visualization of the pruned and preserved channels inerror map and corresponding convolutional kernel weights.
Analysis of Error Map Pruning.
To better understand thepruned and preserved channels in the backward pass by errormap pruning, we visualize them to analyze the effectiveness ofthe proposed channel selection approach. The preserved andpruned channels in the error map and corresponding kernelweights in conv2 layer of VGG-16 are shown in Fig. 14.The 16 channels with the highest/lowest proposed importancescores are shown on the top left and bottom left, respec-tively, and their corresponding convolutional kernel weightsare shown on the right. The pruned channels are darker withsmaller values than the preserved channels, which are brighterwith larger values. Similarly, the kernel weights correspondingto the pruned channels have smaller values than the preservedones. Therefore, the pruned channels will have the leastinfluence on both the error propagation and computation andweight gradients. This result shows the proposed error mappruning approach effectively selects the channels to prune tominimize the influence on training.
F. Practical Energy Saving on Hardware Platforms
The energy cost of training consists of both the computationcost and the memory access cost. While the former onedominates the energy cost and is represented by the commonlyused metric FLOPs [39], the energy saving ratio can be slightlydifferent from the computation reduction ratio. To evaluate thepractical energy saving, we conduct extensive experiments ontwo edge platforms and evaluate the proposed approaches interms of practical energy saving and accuracy . Power supply and energy measurement
Platform 1: Mobile GPU
Platform 2: MCU
Power supply and energy meas.
DatasetUARTComputerPower analyzerEnergy meter
Fig. 15: Energy measurement setup for training on two edgeplatforms, including mobile-level devices (top) and sensornode-level devices (bottom).
Hardware Setup.
We apply the proposed training approachon two edge platforms to evaluate realistic energy saving . Formobile-level devices, we train ResNet-110, ResNet-74, andVGG-16 on an NVIDIA Jetson TX2 mobile GPU [42] withCIFAR-10 and CIFAR-100 datasets by PyTorch 1.1. We usean energy meter to measure the energy cost as shown on thetop of Fig. 15.
For sensor node-level devices, we train LeNeton the MSP432 MCU [43]. We use C language to implementthe training process on MCU. Since the MCU cannot storethe entire dataset, we use a computer to feed the training datainto the MCU via UART in the training process. We use theKeysight N6705C power analyzer to measure the energy coston the MCU as shown on the bottom of Fig. 15. SGD EIF+EMP OHEM SD E n e r g y ( x J o u l e s ) Energy(J)Computa �on Cos t (GFLOPs) no accuracy loss
Acc93.57% Acc93.66% Acc85.11% Acc 91.96% (a) ResNet-110 C o m pu t a � o n C o s t ( x G F L O P s ) (b) VGG-16 SGD EIF+EMP OHEM
Energy(J)Computa �on Cos t (GFLOPs) C o m pu t a � o n C o s t ( x G F L O P s ) E n e r g y ( x J o u l e s ) Acc 93.25% Acc93.15% Acc71.81%
Fig. 16: Energy saving when training ResNet-110 and VGG-16 on Nvidia Jetson TX2 [42] mobile GPU with CIFAR-10dataset. EIF+EMP prolongs 2.5x to 3.1x battery life withoutany or with marginal accuracy loss.
Energy Saving of Training on Mobile GPU.
We evaluatethe energy saving by EIF+EMP on mobile-level devices. Werepeat all the experiments in Table I and II on the mobile GPUto measure the practical energy saving, except for the LeNet,which will be evaluated on MCU. Our approach effectively re-duces the energy cost of on-device training. Compared with theoriginal SGD, the proposed EIF+EMP achieves energy savingof 67.60%, 63.57%, and 60.02% in the training of ResNet-110, ResNet-74, and VGG-16 on CIFAR-10, respectively, asshown in Fig. 16 (the result of ResNet-74 is not shown forconciseness). The energy savings prolong battery life by 3.1x,2.7x, and 2.5x while improving the accuracy or incurring aslight 0.1% accuracy loss. Compared with the SOTA baselinesOHEM and SD, our approach achieves significantly higheraccuracy when similar energy saving is achieved. SD relieson the residual connections and cannot be applied to VGG-16. Besides, the practical energy saving ratios are very close tothe computation reduction ratios represented by FLOPs, whichshows the computation reduction in FLOPs can generalize wellto energy saving on hardware platforms. Similar results areobserved on the CIFAR-100, on which we achieve 54.22% and46.64% energy saving (2.2x and 1.9x battery life) for ResNet-110 and VGG-16 without any accuracy loss, respectively.
SGD EIF+EMP OHEM C o m pu t a � o n C o s t ( G F L O P s ) E n e r g y ( x J o u l e s ) Energy(J)Computa �on Cos t (GFLOPs) improved accuracy
Acc96.98% Acc97.31% Acc96.66%
Fig. 17: Energy saving when training LeNet on MSP432 [43]MCU. EIF+EMP prolongs 3.9x battery life.
Energy Saving of Training on MCU.
We evaluate theenergy saving by EIF+EMP on sensor node-level devices (i.e.MCUs). We train LeNet on MCU MSP432 for one epochincluding 60000 instances and measure the energy cost andaccuracy. Due to the limited runtime memory, we set the batchsize to 1. Since the original SGD approach takes too long(i.e. about 50 days) to complete on MCU, we conduct 10%training iterations of one epoch on MCU and estimate the totalenergy cost by multiplying the measured energy cost by 10. The accuracy of the original SGD is measured on the P100GPU after finishing one epoch. OHEM cannot be applied toMCUs because it needs batch-wise loss values for instanceselection. To compare with OHEM, we measure its energycost on MCU by completing its computation and ignoring theaccuracy. The accuracy of OHEM is evaluated on P100 GPU.EIF+EMP significantly reduces the energy cost of trainingon MCUs and effectively prolongs battery life. As shown inFig. 17, when training LeNet on MSP432 MCU, EIF+EMPeffectively reduces the energy cost by 74.09% while improvingthe accuracy by 0.33%. This prolongs battery life by 3.9x.OHEM, while not fully feasible on MCU, achieves much lowerenergy saving of 59.78% with an accuracy loss of 0.32%. Thisresult shows EIF+EMP greatly improves the battery life of tinysensor nodes and outperforms the baselines.VII. C
ONCLUSION
This work aims to enable on-device training of convolu-tional neural networks by reducing the computation cost attraining time. We propose two complementary approaches toreduce the computation cost: early instance filtering (EIF),which selects important instances for training the network anddrops trivial ones, and error map pruning (EMP), which prunesinsignificant channels in error map in back-propagation. Ex-perimental results show superior computation reduction withhigher accuracy compared with state-of-the-art techniques.R
EFERENCES[1] M. Song, K. Zhong, J. Zhang, Y. Hu, D. Liu, W. Zhang, J. Wang, andT. Li, “In-situ ai: Towards autonomous and incremental deep learningfor iot systems,” in . IEEE, 2018, pp. 92–103.[2] N. Zhu, X. Liu, Z. Liu, K. Hu, Y. Wang, J. Tan, M. Huang, Q. Zhu, X. Ji,Y. Jiang et al. , “Deep learning for smart agriculture: Concepts, tools,applications, and opportunities,”
International Journal of Agriculturaland Biological Engineering , vol. 11, no. 4, pp. 32–44, 2018.[3] S. Bhattacharya and N. D. Lane, “Sparsification and separation of deeplearning layers for constrained resource inference on wearables,” in
Proceedings of the 14th ACM Conference on Embedded Network SensorSystems CD-ROM , 2016, pp. 176–189.[4] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learningfor robust visual tracking,”
International journal of computer vision ,vol. 77, no. 1-3, pp. 125–141, 2008.[5] O. Rudovic, J. Lee, M. Dai, B. Schuller, and R. W. Picard, “Personalizedmachine learning for robot perception of affect and engagement inautism therapy,”
Science Robotics , vol. 3, no. 19, p. eaao6760, 2018.[6] B. McMahan and D. Ramage, “Federated learning: Col-laborative machine learning without centralized trainingdata,” 2017. [Online]. Available: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html[7] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch sgd:Training resnet-50 on imagenet in 15 minutes,” 2017.[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[9] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” arXiv preprint arXiv:1510.00149 , 2015.[10] N. K. Jayakodi, A. Chatterjee, W. Choi, J. R. Doppa, and P. P. Pande,“Trading-off accuracy and energy of deep inference on embeddedsystems: A co-design approach,”
IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems , vol. 37, no. 11, 2018.[11] P. Panda, A. Sengupta, and K. Roy, “Conditional deep learning forenergy-efficient and enhanced pattern recognition,” in , 2016. [12] W. Yawen, W. Zhepeng, J. Zhenge, S. Yiyu, and H. Jingtong, “In-termittent inference with nonuniformly compressed multi-exit neu-ral network for energy harvesting powered devices,” arXiv preprintarXiv:2004.11293 , 2020.[13] L. Bottou, “Stochastic gradient descent tricks,” in Neural networks:Tricks of the trade . Springer, 2012, pp. 421–436.[14] S. Lee and S. Nirjon, “Neuro. zero: a zero-energy neural networkaccelerator for embedded sensing and inference systems,” in
Proceedingsof the 17th Conference on Embedded Networked Sensor Systems , 2019.[15] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa-tions by solving jigsaw puzzles,” in
European Conference on ComputerVision . Springer, 2016, pp. 69–84.[16] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual repre-sentation learning by context prediction,” in
Proceedings of the IEEEInternational Conference on Computer Vision , 2015, pp. 1422–1430.[17] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al. ,“Communication-efficient learning of deep networks from decentralizeddata,” arXiv preprint arXiv:1602.05629 , 2016.[18] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augen-stein, H. Eichner, C. Kiddon, and D. Ramage, “Federated learning formobile keyboard prediction,” arXiv preprint arXiv:1811.03604 , 2018.[19] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deepnetworks with stochastic depth,” in
European conference on computervision . Springer, 2016, pp. 646–661.[20] Y. Wang, Z. Jiang, X. Chen, P. Xu, Y. Zhao, Y. Lin, and Z. Wang, “E2-train: Training state-of-the-art cnns with over 80% energy savings,” in
Advances in Neural Information Processing Systems , 2019.[21] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based objectdetectors with online hard example mining,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2016.[22] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo,Y. Yang, L. Yu et al. , “Highly scalable deep learning training systemwith mixed-precision: Training imagenet in four minutes,” 2018.[23] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structuredsparsity in deep neural networks,” in
Advances in neural informationprocessing systems , 2016, pp. 2074–2082.[24] H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compactcnns,” in
European Conference on Computer Vision . Springer, 2016.[25] J. M. Alvarez and M. Salzmann, “Compression-aware training of deepnetworks,” in
Advances in Neural Information Processing Systems , 2017.[26] S. Lym, E. Choukse, S. Zangeneh, W. Wen, M. Erez, and S. Shanghavi,“Prunetrain: Gradual structured pruning from scratch for faster neuralnetwork training,” arXiv preprint arXiv:1901.09290 , 2019.[27] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruningfilters for efficient convnets,” arXiv preprint arXiv:1608.08710 , 2016.[28] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning methodfor deep neural network compression,” in
Proceedings of the IEEEinternational conference on computer vision , 2017, pp. 5058–5066.[29] W. Jiang, X. Zhang, E. H.-M. Sha, L. Yang, Q. Zhuge, Y. Shi, and J. Hu,“Accuracy vs. efficiency: Achieving both through fpga-implementationaware neural architecture search,” in
Proceedings of the 56th AnnualDesign Automation Conference 2019 , 2019, pp. 1–6.[30] L. Yang, W. Jiang, W. Liu, H. Edwin, Y. Shi, and J. Hu, “Co-exploringneural architecture and network-on-chip design for real-time artificialintelligence,” in . IEEE, 2020, pp. 85–90.[31] L. Yang, Z. Yan, M. Li, H. Kwon, L. Lai, T. Krishna, V. Chandra,W. Jiang, and Y. Shi, “Co-exploration of neural architectures andheterogeneous asic accelerator designs targeting multiple tasks,” arXivpreprint arXiv:2002.04116 , 2020.[32] W. Jiang, L. Yang, E. H.-M. Sha, Q. Zhuge, S. Gu, S. Dasgupta, Y. Shi,and J. Hu, “Hardware/software co-exploration of neural architectures,”
IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems , 2020.[33] Q. Lu, W. Jiang, X. Xu, Y. Shi, and J. Hu, “On neural architecturesearch for resource-constrained hardware platforms,” arXiv preprintarXiv:1911.00105 , 2019.[34] K. Konyushkova, R. Sznitman, and P. Fua, “Learning active learningfrom data,” in
Advances in Neural Information Processing Systems ∼ kriz/cifar.html[36] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.[Online]. Available: http://yann.lecun.com/exdb/mnist/[37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A Large-Scale Hierarchical Image Database,” in CVPR09 , 2009.[38] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014. [39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition .[40] Python. (2020) Thop: Pytorch-opcounter. a tool to count the flops ofpytorch model. [Online]. Available: https://pypi.org/project/thop/[41] R. Banner, I. Hubara, E. Hoffer, and D. Soudry, “Scalable methods for8-bit training of neural networks,” in