[PDF] Enabling On-Device CNN Training by Self-Supervised Instance Filtering and Error Map Pruning

Abstract

This work aims to enable on-device training of convolutional neural networks (CNNs) by reducing the computation cost at training time. CNN models are usually trained on high-performance computers and only the trained models are deployed to edge devices. But the statically trained model cannot adapt dynamically in a real environment and may result in low accuracy for new inputs. On-device training by learning from the real-world data after deployment can greatly improve accuracy. However, the high computation cost makes training prohibitive for resource-constrained devices. To tackle this problem, we explore the computational redundancies in training and reduce the computation cost by two complementary approaches: self-supervised early instance filtering on data level and error map pruning on the algorithm level. The early instance filter selects important instances from the input stream to train the network and drops trivial ones. The error map pruning further prunes out insignificant computations when training with the selected instances. Extensive experiments show that the computation cost is substantially reduced without any or with marginal accuracy loss. For example, when training ResNet-110 on CIFAR-10, we achieve 68% computation saving while preserving full accuracy and 75% computation saving with a marginal accuracy loss of 1.3%. Aggressive computation saving of 96% is achieved with less than 0.1% accuracy loss when quantization is integrated into the proposed approaches. Besides, when training LeNet on MNIST, we save 79% computation while boosting accuracy by 0.2%.

Full PDF

11 Enabling On-Device CNN Training bySelf-Supervised Instance Filtering and Error MapPruning

Yawen Wu † , Zhepeng Wang † , Yiyu Shi ‡ , and Jingtong Hu †† Department of Electrical and Computer Engineering, University of Pittsburgh, USA ‡ Department of Computer Science and Engineering, University of Notre Dame, USAEmail: [email protected], [email protected], [email protected], [email protected]

Abstract —This work aims to enable on-device training ofconvolutional neural networks (CNNs) by reducing the compu-tation cost at training time. CNN models are usually trained onhigh-performance computers and only the trained models aredeployed to edge devices. But the statically trained model cannotadapt dynamically in a real environment and may result in lowaccuracy for new inputs. On-device training by learning from thereal-world data after deployment can greatly improve accuracy.However, the high computation cost makes training prohibitivefor resource-constrained devices. To tackle this problem, weexplore the computational redundancies in training and reducethe computation cost by two complementary approaches: self-supervised early instance ﬁltering on data level and error mappruning on the algorithm level. The early instance ﬁlter selectsimportant instances from the input stream to train the networkand drops trivial ones. The error map pruning further prunesout insigniﬁcant computations when training with the selectedinstances. Extensive experiments show that the computation costis substantially reduced without any or with marginal accuracyloss. For example, when training ResNet-110 on CIFAR-10, weachieve 68% computation saving while preserving full accuracyand 75% computation saving with a marginal accuracy loss of1.3%. Aggressive computation saving of 96% is achieved with lessthan 0.1% accuracy loss when quantization is integrated into theproposed approaches. Besides, when training LeNet on MNIST,we save 79% computation while boosting accuracy by 0.2%.

Index Terms —On-device training, data ﬁlter, gradient pruning

I. I

NTRODUCTION

The maturation of deep learning has enabled on-deviceintelligence for Internet of Things (IoT) devices. Convolutionalneural network (CNN), as an effective deep learning model,has been intensively deployed on IoT devices to extractinformation from the sensed data, such as smart cities [1],smart agriculture [2] and wearable devices [3]. The modelsare initially trained on high-performance computers (HPCs)and then deployed to IoT devices for inference. However, inthe physical world, the statically trained model cannot adaptto the real world dynamically and may result in low accuracyfor new input instances. On-device training has the potential tolearn from the environment and update the model in-situ. Thisenables incremental/lifelong learning [4] to train an existingmodel to update its knowledge, and device personalization[5] by learning features from the speciﬁc user and improvingmodel accuracy. Federated learning [6] is another application scenario of on-device training, where a large number of de-vices (typically mobile phones) collaboratively learn a sharedmodel while keeping the training data on personal devices toprotect privacy. Since each device still computes the full modelupdate by expensive training process, the computation cost oftraining needs to be greatly reduced to make federated learningrealistic.While the efﬁciency of training in HPCs can always beimproved by allocating more computing resources, such as1024 GPUs [7], training on resource-constrained IoT devicesremains prohibitive. The main problem is the large gap be-tween the high computation and energy demand of trainingand the limited computing resource and battery on IoT devices.For example, training ResNet-110 [8] on a 32x32 input imagetakes 780M FLOPs, which is prohibitive for IoT devices.Besides, since computation directly translates into energyconsumption and IoT devices are usually battery-constrained[9], the high computation demand of training will quicklydrain the battery. While existing works [10]–[12] effectivelyreduce the computation cost of inference by assigning inputinstances to different classiﬁers according to the difﬁculty, thecomputation cost of training is not reduced.To address this challenge, this work aims to enable on-device training by signiﬁcantly reducing the computation costof training while preserving the desired accuracy. Meanwhile,the proposed techniques can also be adopted to improvetraining efﬁciency on HPCs. To achieve this goal, we in-vestigate the computation cost of the entire training cycle,aiming to eliminate unnecessary computations while keepingfull accuracy. We made the following two observations:

First ,not all the input instances are important for improving themodel accuracy. Some instances are similar to the ones thatthe model has already been trained with and can be completelydropped to save computation.Therefore, developing an approach to ﬁlter out unimportantinstances can greatly reduce the computation cost.

Second , forthe important instances, not all the computation in the trainingcycle is necessary. Eliminating insigniﬁcant computations willhave a marginal inﬂuence on accuracy. In the backward passof training, some channels in the error maps have small values.Pruning out these insigniﬁcant channels and correspondingcomputation will have a marginal inﬂuence on the ﬁnal ac- a r X i v : . [ c s . L G ] J u l curacy while saving a large portion of computation.Based on the two observations, we propose a novel frame-work consisting of two complementary approaches to reducethe computation cost of training while preserving the fullaccuracy. The ﬁrst approach is an early instance ﬁlter toselect important instances from the input stream to train thenetwork and drop trivial ones. The second approach is errormap pruning to prune out insigniﬁcant computations in thebackward pass when training with the selected instances.In summary, the main contributions of this paper include: • A framework to enable on-device training.

We proposea framework consisting of two approaches to eliminateunnecessary computation in training CNNs while preserv-ing full network accuracy. The ﬁrst approach improvesthe training efﬁciency of both the forward and backwardpasses, and the second approach further reduces thecomputation cost in the backward pass. • Self-supervised early instance ﬁltering (EIF) on thedata level.

We propose an instance ﬁlter to predictthe loss of each instance and develop a self-supervisedalgorithm to train the ﬁlter. Instances with predicted lowloss are dropped before starting the training cycle tosave computation. To train the ﬁlter simultaneously withthe main network, we propose a self-supervised trainingalgorithm including the adaptive threshold based labelingstrategy, uncertainty sampling based instance selectionalgorithm, and weighted loss for biased high-loss ratio. • Error map pruning (EMP) on the algorithm level.

Wepropose an algorithm to prune insigniﬁcant channels inerror maps to reduce the computation cost in the back-ward pass. The channel selection strategy considers theimportance of each channel on both the error propagationand the computation of the weight gradients to minimizethe inﬂuence of pruning on the ﬁnal accuracy.We evaluate the proposed approaches on networks of dif-ferent scales. ResNet and VGG are for on-device trainingof mobile devices, and LeNet is for tiny sensor node-leveldevices. Experimental results show the proposed approacheseffectively reduce the computation cost of training without anyor with marginal accuracy loss.II. B

ACKGROUND AND R ELATED W ORK

A. Background of CNN Training

The training of CNNs is most commonly conducted withthe mini-batch stochastic gradient descent (SGD) algorithm[13]. It updates the model weights iteration-by-iteration usinga mini-batch (e.g. 128) of input instances. For each instancein the mini-batch, a forward pass and a backward pass areconducted. The forward pass attempts to predict the correctoutputs using current model weights. Then the backward passback-propagates the loss through layers, which generates theerror maps for each layer. Using the error maps, the gradientof the loss w.r.t. the model weights are computed. Finally, themodel weights are updated by using the weight gradients andan optimization algorithm such as SGD.To provide labeled data for on-device training, labelingstrategies from existing works can be used. For example, the labels can come from aggregating inference results fromneighbor devices [14] (e.g. voting), employing spatial contextinformation as the supervisory signals [15], [16], or naturallyinferred from user interaction [17], [18] such as next-word-prediction in keyboard typing.

B. Related Work

Accelerated Training.

There are a number of works on ac-celerating network training. Stochastic depth [19] acceleratesthe training by randomly bypassing layers with the residualconnection. E2Train [20] randomly drops mini-batches andselectively skips layers by using residual connections to savethe computation cost. Different from [20], which randomlydrops mini-batches, we investigate the importance of eachinstance before keeping or dropping it. The input data fromthe real-word is not ideally shufﬂed and valuable instances fortraining can concentrate within one mini-batch. Simply drop-ping mini-batches can miss important instances for training thenetwork. Besides, the layer skipping in these two works rely onthe ResNet architecture [8], and cannot be naturally extendedto general CNNs. In contrast, our approaches are applicableto general CNNs. OHEM [21] selects high-loss instancesand drops low-loss ones to improve training efﬁciency. Itcomputes the loss values of all instances in the forward passand only keeps high-loss instances for the backward pass.The main drawback is that the computation in the forwardpass of low-loss instances is wasted. Different from this, ourapproach predicts the loss of each instance and drops low-lossinstances before starting the forward pass, which eliminatesthe computation cost of low-loss instances.

Distributed Training.

Another way to accelerate trainingis leveraging distributed training with abundant computingresources and large batch-sizes. [7] employs an extremelylarge batch size of 32K with 1024 GPUs to train ResNet-50 in 20 minutes. [22] integrates a mixed-precision methodinto distributed training and pushes the time to 6.6 minutes.However, these works target on leveraging highly-parallelcomputing resources to reduce the training time and actuallyincrease the total computation cost, which is infeasible fortraining on resource-constrained IoT devices.

Network Pruning during Training.

Some works aimto train and prune the network architecture simultaneously.[23] proposes a pruning approach to sparsify weights duringtraining. The goal is to generate a compact network forinference instead of improving training efﬁciency. In fact, itrequires more time on training by ﬁrst training the backbonethen pruning it. Similarly, [24] prunes the sparse networkduring training to have a compact network for inference.However, these works only improve the inference efﬁciencyand training computation cost is not reduced. [25], [26] aim toaccelerate training by reconﬁguring the network to a smallerone during training. The main drawback is that the networkis pruned on the ofﬂine training dataset, and the ability of thepruned network for further on-device learning is compromised.Instead, we focus on reducing the computation cost of onlinetraining and the entire network architecture is preserved tokeep the full ability for learning in an uncertain future.

Network Compression and Neural Architecture Search.

There are extensive explorations on network compressionand neural architecture search (NAS). [27], [28] prune thenetwork to generate a compact network for efﬁcient inference.[29]–[32] search neural architectures for hardware-friendlyinference. [33] further considers quantization during NAS forefﬁcient inference. However, these works only aim to designnetwork architectures for efﬁcient inference. The computationcost of training is not considered.III. F

RAMEWORK O VERVIEW

Loss

Input

Ground truth

Early Instance Filter

Error map pruning

Main network

Forward pass

Backward pass

Self-Supervised

Training

Filter network EIF: Section IVEMP: Section V

Keep

Drop

Fig. 1: Overview of early instance ﬁltering (EIF) and errormap pruning (EMP).The overview of the proposed framework is shown in Fig.1. On top of the main neural network, a small instance ﬁlternetwork is proposed to select important instances from theinput stream to train the network and drop trivial ones. Whenthe input instances arrive, the early instance ﬁlter predicts theloss value for each instance as if the instance was fed intothe main network and makes a binary decision to drop orpreserve this instance. If the predicted loss is high and theinstance is preserved, the main network will be invoked tostart the forward and backward pass for training. Since theloss prediction is for the main network, once the main networkis updated, the instance ﬁlter also needs to be trained foraccurate loss prediction. The training of the instance ﬁlter isself-supervised based on the labeling strategy by the adaptiveloss threshold, instance selection by uncertainty sampling, andthe weighted loss for biased high-loss ratio, which will beintroduced in Section IV. Once important instances are se-lected, the error map pruning further reduces the computationcost of the backward pass. It prunes out channels in the errormaps that have small contributions to the error propagation andgradient computation, which will be introduced in Section V.IV. S

ELF -S UPERVISED E ARLY I NSTANCE F ILTERING

The early instance ﬁlter (EIF) is used to select importantinstances for training the main network and drop trivialinstances to reduce the computation cost of training. Sincethe main network is constantly being updated during training,it is essential to tune the EIF every time the main network isupdated. In this way, the EIF can accurately select importantinstances based on the latest state of the main network. In thissection, we will ﬁrst introduce the working ﬂow of EIF to

High loss instances

Low loss inst w/ low confPreserved instances Main network

Main loss

Adaptive loss threshold

Binary loss label: H / L

Dropped instances Loss threshold: T l Weighted loss for filter

Section IV.ASection IV.C

Section IV.B

LowHigh

ConfidenceConfidence

Loss Prediction

Early Instance Filter (EIF)

Uncertainty sampling

Train EIF

Fig. 2: Self-supervised training of early instance ﬁlter (EIF)by adaptive loss threshold, uncertainty sampling, and weightedloss.select instances for training the main model. Then we discussthe challenges for updating the EIF. After that, we presentthree approaches to address these challenges such that the EIFcan be effectively updated.To select important instances and drop trivial ones on-the-ﬂy during training, the EIF predicts the loss value ofeach instance from the input stream without actually feedingthe instance into the main network. Trivial instances withpredicted low loss are dropped before the forward pass, whicheliminates the computation on the forward pass and morecomputationally intensive backward pass of the main network.Important instances with predicted high loss are preserved tocomplete the forward pass, calculate the loss, and ﬁnish thebackward pass to compute the weight gradients to update themain network. Kindly note that the instances are not pre-selected before the training starts. Instead, they are selectedon-the-ﬂy during training based on what the main networkhas and has not learned at the current state.Fig. 2 shows the working ﬂow of EIF. The user ﬁrst needsto pre-deﬁne a high-loss ratio R set (e.g. 10%) such thatthese amounts of instances in the whole input stream will bepredicted as high-loss and the others will be predicted as low-loss. Only instances predicted as high-loss will be used fortraining the main network. When instances arrive sequentially,the early instance ﬁlter predicts the loss value of each instance i as binary high or low y pred,i = { H, L } for the mainnetwork such that the pre-deﬁned high-loss ratio is satisﬁed.The ﬁlter also produces the conﬁdence of each loss prediction,represented by the entropy of the loss prediction. Since the lossprediction by the EIF network is for the main network and themain network is constantly being updated, it is essential to re-train the EIF network every time the main network is updatedto realize accurate loss prediction. However, there are severalchallenges in realizing automatic self-supervised training forthe EIF network. In this section, we will ﬁrst present threemajor challenges. Then, we will present three techniques toaddress these challenges: adaptive loss threshold, uncertaintysampling, and weighted loss, as shown in Fig. 2. Challenges:

During on-device training, instances with pre-dicted low-loss are dropped before feeding to the main networkto compute the actual loss, and their true loss values areunknown. Thus, we can only know the true loss values of in-stances with predicted high-loss, which brings two challenges.

The ﬁrst challenge is how to label instances as high-loss orlow-loss for training the EIF according to the pre-deﬁned high-loss ratio. For example, if we could know the loss values of allinstances, deﬁning a loss threshold to separate 10% instanceswith the highest loss values is simply sorting all the loss valuesand ﬁnding the value for separation. Since the loss valuesof dropped instances are unknown, deﬁning a loss thresholdremains a challenge.The second challenge is that the EIF network can choosewhat instances will be used to train itself, which is not possiblefor normal CNN training. As long as the EIF network isnot 100% accurate, it will make wrong predictions. To avoidpunishment, instead of adjusting its own weights to makeaccurate loss predictions, the ﬁlter will learn a shortcut bypredicting all the new input instances as low loss and dropthem. Since the dropped instances will not be fed to the mainnetwork, the EIF network will never know the ground truthof the losses and thus it will not be punished for doing so. Inthis way, EIF will think it makes perfect predictions. Droppingall the new instances prevents further training of the ﬁlter andmain network.The third challenge is how to correctly train the ﬁlter whenthe number of high-loss and low-loss instances is extremelyunbalanced in the input stream. This is different from normaltraining datasets such as CIFAR-10 and ImageNet, in whichthe number of instances in each class is balanced. The unbal-anced number of high-loss and low-loss instances makes theEIF network training ineffective. For example, when the pre-deﬁned high-loss ratio is relatively low (e.g. 10%), simplypredicting all the instances as low-loss will produce highaccuracy of 90% on the ﬁlter, for which it believes as a goodresult. However, this prediction is useless since it does not ﬁndany important instance to train the main network.We will present three techniques to address these challenges.

A. Adaptive Loss Threshold Based Labeling Strategy

The adaptive loss threshold is used to provide the groundtruth (labels) for training the EIF. With the adaptive lossthreshold, we can label the loss of instances as high-lossor low-loss to train the EIF. During the training of EIFand the main network, R set percent of instances will bepredicted as high-loss by the EIF. The true loss values ofinstances predicted as high-loss can be obtained on the mainnetwork. However, we do not know the true loss values ofinstances predicted as low-loss since they are dropped beforefeeding into the main network. With only partial loss values,deﬁning an exact loss threshold (e.g. sorting all loss valuesand ﬁnding the threshold) is challenging. Therefore, we aimto approximate the threshold. To achieve this, we ﬁrst deﬁnetrue high (TH) instances as the instances with predicted highloss by the ﬁlter and labeled as high-loss by the loss threshold.We will monitor the number of TH instances in the last n mini-batches. Then we calculate the percentage R T H as the numberof TH instances in the preserved ones over all the instances inlast n mini-batches. By comparing the percentage R T H withthe pre-deﬁned percentage R set , the loss threshold is adjustedto draw R T H to the pre-deﬁned percentage R set . Formally, with adaptive loss threshold T l , instances arelabeled as high-loss or low-loss as follows. y i = (cid:40) H if loss i ≥ T l L otherwise (1)where loss i is the loss value of instance i computed by themain network. T l is the adaptive loss threshold.The true high (TH) loss instance ratio R T H by the ﬁlter isdeﬁned as follows. R T H = 1 mn mn (cid:88) i =1 I ( y pred,i = H ) I ( y i = H ) (2) I ( x ) is an indicator function which equals 1 if x is true and0 otherwise. y pred,i is the binary prediction by the ﬁlter forinstance i , and y i is the loss label by Eq.(1). m is the batchsize, and n is the number of mini-batches to monitor for oneupdate of the loss threshold.Based on the computed R T H and pre-deﬁned R set , the lossthreshold T l is adjusted to draw R T H to R set . When R T H islarger than R set , too many instances are labeled and predictedas high loss, which indicates T l is too small. Therefore, T l will be incremented by multiplying with a factor larger than1. Similarly, when R T H is smaller than R set , T l is too largeand will be attenuated. The loss threshold T l is adjusted as: T l = (cid:40) α T l if R T H ≥ R set α T l otherwise (3) α and α are two hyper-parameters where α is larger than1 and α is smaller than 1 to deﬁne the step size.The computed R T H is essential to the self-supervisedtraining of the EIF. More speciﬁcally, R T H controls the lossthreshold T l by Eq.(3), which further controls the instancelabels y i by Eq.(1) for training the instance ﬁlter. With thelabels y i , the ﬁlter will be trained accordingly to predict high-loss instances. The number of instances with predicted high-loss by the ﬁlter and labeled as high-loss will be used tocompute the new R T H by Eq.(2), which further adjusts T l .This process continues for each mini-batch, which forms theself-supervised training of the instance ﬁlter. Leveraging theself-supervision, the loss threshold T l will be properly adjustedand the instance ﬁlter will be well-trained to track the lateststate of the main network. In this way, the true high-loss ratio R T H affected by both the ﬁlter and the loss threshold will bekept at the set ratio R set . The ﬁlter will effectively select R set percent important instances for training the main network. B. Instance Selection by Uncertainty Sampling

The main reason for the second challenge is that if aninstance is dropped, it will never be fed to the main networkand the EIF network will never know the ground truth of theloss. In this way, the labels (i.e. high-loss or low-loss) of thedropped instances for training the EIF will be unknown, andthe EIF cannot be correctly trained. To address this problem,we keep some instances with predicted low loss, which wouldbe dropped, to augment the preserved instances for trainingthe ﬁlter. In this way, wrong loss predictions of the droppedinstances will also punish the ﬁlter, which forces it to actuallylearn to ﬁnd important instances. To decide which instancesto keep and minimize the number of selected instances, weemploy uncertainty sampling [34]. The dropped instances that the ﬁlter are least conﬁdent about will be fed into the mainnetwork to compute the loss value. To measure the conﬁdenceof loss prediction by the ﬁlter, we use the entropy deﬁned as: entropy ( i ) = − (cid:88) c ∈{ H,L } p i,c log p i,c , p i,c = prob ( y pred,i = c ) (4) p i,c is the computed probability by the ﬁlter of being high-loss( c = H ) or low-loss ( c = L ) for instance i . The smaller theentropy, the more conﬁdent the ﬁlter is about the prediction.Based on the entropy, we select from the dropped instanceswhere the entropy is above the entropy threshold to augmentthe preserved instances for training the ﬁlter. The set ofselected instances is deﬁned as: I = { i | i ∈ { T L, F L } , entropy ( i ) > entropy T } (5)where entropy T is the entropy threshold. C. Weighed Loss for Biased High-Loss Ratio

To address the third challenge, we propose to use theweighted loss function to make the EIF training process fair intreating the high-loss instances when their ratio is low. In thisway, the EIF can be trained to make accurate loss predictionsand select important instances for training the main network.Traditionally, for datasets with balanced classes, we use theaverage loss of each instance as the loss function of a mini-batch for training. In our case, based on the binary loss label y i in Eq.(1) and the binary loss prediction y pred,i by the ﬁlter,the loss function for instance i is deﬁned by cross-entropy as: L i = − (cid:88) c ∈{ H,L } I ( y i = c ) log p i,c (6)where p i,c deﬁned in Eq.(4) is the computed probability ofbeing a high or low loss for instance i by the ﬁlter. L i measureshow well the loss prediction approximates the true loss labeland will be minimized during training. The average loss willbe the average loss value of each preserved instance in amini-batch. However, when the pre-deﬁned high-loss ratio isnot 50% and makes the number of high-loss and low-lossinstances unbalanced, directly using the average loss will resultin effective training of the EIF.To understand the inefﬁciency of training with the averageloss, we deﬁne the weighted loss for preserved instances in amini-batch to train the ﬁlter as: L = (cid:88) i ∈ T H w H L i + (cid:88) j ∈ F H w L L j + (cid:88) p ∈ T L w L L p + (cid:88) q ∈ F L w H L q (7)where T H , F H , T L and

F L represent true high, false high,true low and false low loss instances, respectively.

T H and

F H are instances with predicted high loss and labeled as H and L by Eq.(1), respectively. T L and

F L are instances withpredicted low loss and selected by uncertainty sampling inEq.(5), which have loss labels L and H , respectively. Theweights w H and w L represent how important the true highloss (instances with loss label H , including T H and

F L ) andtrue low loss (instances with loss label L , including T L and

F H ) instances are, respectively. w H and w L are normalizedsuch that the weights of all instances in Eq.(7) sum up to 1.When the pre-deﬁned high-loss ratio R set is not 50%, thenumber of high-loss and low-loss instances will be not equal inthe input stream. This makes training the EIF with the average loss ineffective. For example, when R set is set to , only10% of the instances streamed in will be labeled as high-lossby the adaptive loss threshold. In this way, 90% elements inEq.(7) will be low-loss instances and dominate the loss. If wewere using average loss, all the weights will be the same. Tominimize the loss when training the ﬁlter, simply predictingall instances as low-loss will produce small loss values on thedominating second and third elements in Eq.(7), and hence thetotal loss, which prevents effective training of the ﬁlter.To address this problem, we make the weights biased bysetting w H = R set and w L = − R set . In this way, we have w H × percent ( H = T H + F L ) = w L × percent ( L = T L + F H ) . The ﬁrst and fourth sums in Eq.(7) correspondto the high-loss ( H ) instances. The second and third sums inEq.(7) correspond to the low-loss ( L ) instances. By settingthe weights in this way, the high-loss and low-loss instanceswill contribute equally to the total loss and will be treatedfairly in training. In the above example, while the ﬁrst andfourth sums only contribute to 10% of the number of elements,the higher weight w H = . = 10 makes them equallyimportant as the second and third sums, which have lowerweight w H = . = 1 . . Therefore, the instance ﬁlter can becorrectly trained with the unbalanced number of high-loss andlow-loss instances and accurately predict high-loss ones.With the predicted high-loss instances by the ﬁlter, theselected instances by uncertainty sampling, and the weightedloss function for training, the ﬁlter is effectively trained topredict high-loss instances for training the main network.V. E RROR M AP P RUNING IN B ACKWARD P ASS

Loss

Forward passBackward pass

Input

Conv layer

Input error map

Output error map

Conv layer

OutputErrors

Fig. 3: Error maps of convolutional layers in back-propagation.When training with the selected instances, the computationin the backward pass can be further reduced by error mappruning (EMP). Since the backward pass takes about 2/3computation cost of training, reducing its computation caneffectively reduce the total cost. As shown in Fig. 3, in thebackward pass of training, the back-propagation propagates theerrors layer-by-layer from the last layer to the ﬁrst layer. Wefocus on pruning convolutional layers because they dominatethe computation cost in the backward pass. Within one convo-lutional layer, the input error map is generated from the outputerror map of the same layer. The output error map consists ofmany channels. We aim to prune the insigniﬁcant channels toreduce the computation cost of training.Given a pruning ratio, we need to keep the most representa-tive channels in the error map to maintain as much informationsuch that the training accuracy is retained. The main idea of theproposed channel selection strategy is to prune the channels that have the least inﬂuence on both error propagation and thecomputation of the weight gradients.

A. Channel Selection to Minimize Reconstruction Error onError Propagation ** Rotated weights :output error map of layer L: input error map of layer L Original error propagation

Pruned error propagation

Prune c n c n' cc rot(W) lin δ l δ Fig. 4: Back-propagation of errors with pruned error map.The ﬁrst criterion to select the channels to be pruned is tominimize reconstruction error on error propagation. The errorpropagation for one convolutional layer is shown on the top ofFig. 4. Within one layer, the error propagation starts from theoutput error map δ l shown on the right, convolves δ l with therotated kernel weights rot ( W l ) and generates the input errormap δ lin on the left. The error propagation with pruned δ l isshown on the bottom of Fig. 4. The number of channels in δ l is pruned from n to n (cid:48) . When computing δ lin , the computationscorresponding to the pruned channels, which are convolutionaloperations between δ l and the rotated weights, are removed. Inorder to maintain training accuracy, we want to keep the inputerror map δ lin before and after the pruning as same as possible.In another word, we want to minimize the reconstruction erroron the input error map.Formally, without channel pruning of δ l , δ lin is computedas follows. δ lin = n (cid:88) j =1 rot ( W lj ) ∗ δ lj (8)where δ lin is the input error map consisting of c channels,each with shape [ W in , H in ] . rot ( W lj ) is the rotated weightsof j th convolutional kernel with shape [ c, k w , k h ] . δ lj is the j thchannel of the output error map with shape [ W, H ] .Given a pruning ratio α and an output error map δ l , we aimto reduce the number of channels in δ l from n to n (cid:48) such that α = n (cid:48) /n . To minimize the reconstruction error on δ lin , thechannel selection problem is formulated as follows. arg min β (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) δ lin − n (cid:88) j =1 rot ( W lj ) ∗ ( δ lj β j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (9)s.t. (cid:107) β (cid:107) = n (cid:48) (10)where β is the error map selection strategy, represented asa binary vector of length n . β j is the j th entry of β , and β j = 0 means the j th channel of δ lj is pruned. The (cid:96) -norm (cid:107) x (cid:107) = √ Σ x measures the reconstruction error on δ lin .However, directly solving the minimization problem isprohibitive. δ lin in the problem is computed by Eq.(8), whichcompletes all the computation in error propagation and defeatsthe purpose of saving computation. To select channels to prunebefore starting the actual error propagation, we deﬁne theimportance score as an indication of how much each channel will inﬂuence the value of δ lin and prune the least importantchannels to minimize the reconstruction error on δ lin . Importance Score.

In Eq.(9), when a channel δ lj is pruned,the computation error on δ lin is caused by the pruned rot ( W lj ) ∗ δ lj . As a fast and accurate estimation of the magnitude of rot ( W lj ) ∗ δ lj , we deﬁne the importance score of channel j as follows. s j = γ (cid:13)(cid:13) W lj (cid:13)(cid:13) + γ (cid:13)(cid:13) δ lj (cid:13)(cid:13) (11)where (cid:107) W lj (cid:107) is (cid:96) -norm of convolutional kernel j , computedby (cid:80) ci =1 | W lj,i | . Here we remove the rotation on W lj sinceit does not change the (cid:96) -norm. (cid:107) δ lj (cid:107) is (cid:96) -norm of thechannel j in the output error map, computed by the sum ofits absolute values (cid:80) Wx =1 (cid:80) Hy =1 | δ lj,x,y | . γ and γ are twohyper-parameters to adjust the weight of each (cid:96) -norm.The importance score s j gives an expectation of the mag-nitude that a channel j in δ l contributes to δ lin . Channels withsmall magnitudes in δ l and corresponding kernel weights | W lj | tend to produce trivial values in the input error map δ lin , whichcan be pruned while minimizing the inﬂuence on δ lin . B. Channel Selection to Minimize Reconstruction Error onGradient Computation * Weight gradient of layer L :output error map of layer L :output feature map of layer L-1Original gradient computation

Pruned gradient computation * Prune n l δ − l a n' ccc c Fig. 5: Computation of weight gradient with pruned error map.The second criterion to select the channels to be pruned isto minimize the reconstruction error on the weight gradients.The computation of the weight gradients without pruning isshown on the top of Fig. 5. The output feature map a l − ofthe previous layer convolves with one channel of the outputerror map δ l to produce the gradient of one kernel. Whensome channels in δ l are pruned, the computation of the weightgradients corresponding to the pruned channels is removed. Toretain training accuracy, we want to keep the weight gradientsbefore and after pruning as same as possible. Without channelpruning of δ l , the weight gradients of kernel j are computedas follows. g lw,j = a l − ∗ δ lj , ∀ j ∈ { , ..., n } (12)where g lw,j is the weight gradients of kernel j with shape [ c, k w , k h ] . a l − is the output feature map of the previous layer l − with shape [ c, W in , H in ] . δ lj is the channel j of the outputerror map in layer l , which has shape [ W, H ] .To determine the channel selection strategy β while mini-mizing the reconstruction error on the gradient computation,the channel selection problem is formulated as follows. arg min β n (cid:88) j =1 (cid:13)(cid:13) g lw,j − a l − ∗ ( δ lj β j ) (cid:13)(cid:13) s.t. (cid:107) β (cid:107) = n (cid:48) (13) Similar to Eq.(9), we use the (cid:96) -norm (cid:107) · (cid:107) to measure thereconstruction error on the computation of the weight gradientsfor all the n kernels incurred by the pruning. Similarly, solvingthis problem needs to complete all the gradient computation inEq.(12) to get g lw,j , j ∈ { , ..., n } , which contradicts the goalto save computation. Thus, we deﬁne the importance score ofeach channel in δ l for g w and prune the least important onesto minimize the reconstruction error on g w . Importance Score.

In Eq.(13), when a channel δ lj is pruned,the computation error is caused by the pruned a l − ∗ δ lj . Since a l − is independent of j and can be considered as a constantwhen measuring the importance of each channel δ lj , we ignore a l − and only include δ lj in the importance score of channel j , which is deﬁned as follows. s j = (cid:13)(cid:13) δ lj (cid:13)(cid:13) (14) C. Mini-Batch Pruning with Combined Importance Score

To make the pruned channels for error propagation andgradient computation consistent with each other, we combinethe importance score for these two processes. Then we scaleit from instance-wise to batch-wise for mini-batch training.The importance score s j for gradient computation in Eq.(14)is a reduced form of Eq.(11) by setting γ = 0 and γ = 1 .Therefore, we combine them into Eq.(11). Based on the per-instance importance score of each channel, we can prunechannels for a mini-batch of instances to reduce the compu-tation while maintaining the accuracy. For a mini-batch ofinstances, we prune the same channels for all the instances.The batch-wise importance score of one channel is calculatedas S j = (cid:80) mi =1 s ij . m is the batch size and s ij is the importancescore of channel j for instance i .With the batch-wise importance score, the error map prun-ing process for one convolutional layer is as follows. Given apruning ratio α , n (1 − α ) channels in the output error map δ l need to be pruned. First, for each channel j in δ l , we calculatethe batch-wise importance score S j . Then the importancescores of all channels are sorted and n (1 − α ) channels with thesmallest S j are marked as pruned. Then the error propagationand the computation of the weight gradients corresponding tothe pruned channels are skipped to save computation. Computation Reduction.

With error map pruning, thecomputation cost of both the error propagation and the weightgradients is effectively reduced. With pruning ratio α , − α computation in the error propagation and gradient computationis skipped, which saves about − α computation in thebackward pass of training. More speciﬁcally, without pruning,for one instance the computation cost of error propagation fora convolutional layer l in ﬂoating-point operations (FLOPs)is F LOP s ( δ lin ) = W in H in cnk w k h . When pruning the num-ber of channels in δ l from n to αn , the computation costis reduced to αF LOP s ( δ lin ) . For the computation of theweight gradients, before pruning the computation cost of g lw is F LOP s ( g lw ) = W Hcnk w k h . With pruning ratio α , the costis reduced to αF LOP s ( g lw ) . In this way, − α computationcost is reduced in the backward pass of convolutional layers. Overhead Analysis.

The computation overhead of errormap pruning is negligible. It is caused by the channel selectionand the skipping of pruned channels. When using the (cid:96) -norm strategy in Eq.(11) for the channel selection, the overheadis negligible because the sum over each kernel weight andeach channel are relatively cheap compared with the expensiveconvolutional operation in the backward pass. For example, thechannel selection of ResNet-110 consumes marginal 0.53%FLOPs of the backward pass. For the overhead of skipping,since we employ structured pruning, skipping the prunedchannels is simply skipping the computation involving thepruned channels, which has negligible overhead.VI. E XPERIMENTS

We conduct extensive experiments to demonstrate the effec-tiveness of our approaches in terms of computation reduction , energy saving , accuracy , convergence speed and provide de-tailed analysis . The evaluation is on four network architecturesand four datasets. We ﬁrst evaluate EIF and then evaluatethe combined EIF+EMP approach. After that, we evaluate thepractical energy savings on two edge devices. A. Experimental setup

Datasets and Networks.

We evaluate the proposed ap-proaches on four datasets: CIFAR-10, CIFAR-100 [35],MNIST [36], and ImageNet [37]. We use networks withdifferent capacity to show the scalability of the proposedapproaches. The networks include large-scale networks formobile devices and small networks for tiny sensor nodes.For large-scale networks, we employ two kinds of CNNs,including residual networks ResNet [8] and plain networksVGG [38]. ResNet-110, ResNet-74, and VGG-16 are evaluatedon CIFAR-10/100. ResNet-18 and VGG-11 are evaluated onImageNet. For small networks, we use LeNet on MNIST.

Architectures of Instance Filter.

We use different networksas the instance ﬁlter for different datasets. For CIFAR-10/100,we use ResNet-8. It has 7 convolutional layers and 1 fully-connected layer. The ﬁrst layer is 3x3 convolutions with 16ﬁlters. Then there is a stack of 3 residual blocks. Each blockhas 2 convolutional layers with kernel size 3x3. The numbersof ﬁlters in each block are { } , respectively. Thenetwork ends with a 10/100-way fully-connected layer. ForImageNet, we use ResNet-10. It has 9 convolutional layersand 1 fully-connected layer. The ﬁrst layer is 7x7 convolutionswith 64 ﬁlters. Additional downsampling is conducted with astride of 4 to reduce the computation cost. Then there is astack of 4 residual blocks. Each block has 2 convolutionallayers with kernel size 3x3. The numbers of ﬁlters in eachblock are { } , respectively. The network endswith a 1000-way fully-connected layer. For MNIST, we usea slimmed LeNet with kernel size 3x3 and { } ﬁlters fortwo convolutional layers, respectively.The computation overhead of the EIF is negligible comparedwith the main networks. For CIFAR-10/100, the computationrequired for the inference of the EIF is 5.0% of ResNet-110and 4.1% of VGG-16, respectively. The computation requiredfor training the EIF is 5.9% of ResNet-110 and 4.8% of VGG-16, respectively. For ImageNet, the computation required forthe inference and training of the EIF network is 3.4% and 3.9%of ResNet-18, and 0.81% and 1.05% of VGG-11, respectively.For MNIST, the computation required for the inference andtraining of the EIF is 9.5% and 7.8% of LeNet, respectively. Training Details.

We train both the main network and theinstance ﬁlter simultaneously from scratch . For ResNet-110,ResNet-74, and VGG-16, we employ the training settings in[8]. We use SGD optimizer with momentum 0.9 and weightdecay 0.0001 with batch size 128. The models are trained for64k iterations. The initial learning rate is 0.1 and decayed bya factor of 10 at 32k and 48k iterations. For the instance ﬁlter,the learning rate is set to 0.1. For ResNet-18 and VGG-11,similar training settings are employed except that the batchsize is 256, the models are trained for 450k iterations, andthe learning rate is decayed by a factor of 10 at 150k and300k iterations. For LeNet, the learning rate is 0.01 and themomentum is 0.5. The model is trained for 18.7k iterationswith batch size 64. For the instance ﬁlter, the initial learningrate is 0.1 and decayed to 0.05 after 0.94k iterations.

Metrics.

We evaluate the proposed approaches in two highlyrelated but different metrics: the reduction of computationcost and practical energy saving . The computation cost ismeasured in FLOPs, which is independent of the speciﬁcdevices and a commonly used metric of computation cost[39]. The evaluation of the computation cost is conducted onNVIDIA P100 GPU with PyTorch 1.1 and measured by theTHOP library [40], which will be presented in Sections VI-B- VI-E. The second metric, practical energy saving, dependson the devices and is measured on two edge devices (NVIDIAJetson TX2 mobile GPU and MSP432 MCU), which will bepresented in Section VI-F.

B. Evaluating Early Instance Filtering (EIF)

Remaining computa�on T o p - A cc u r a c y ( % ) Fig. 6: Top-1 accuracy by early instance ﬁlter (EIF) andbaselines with ResNet-110 on CIFAR-10.To show that the proposed early instance ﬁltering (EIF) caneffectively reduce the computation cost while maintaining oreven boosting the accuracy, we compare with two start-of-the-art (SOTA) baselines and a standard training approach. Onlinehard example mining (OHEM) [21] selects hard examples fortraining by computing the loss values. Stochastic mini-batchdropping (SMD) [20] is a SOTA on-device training approach,which randomly skips every mini-batch with a probability.SMB is the standard mini-batch training method by stochasticgradient descent (SGD), and the computation cost is adjustedby reducing the number of training iterations.

Computation Reduction while Boosting Accuracy.

Theproposed EIF substantially outperforms the baselines in termsof both accuracy and computation reduction. As shown in Fig.6, when training ResNet-110 on CIFAR-10, with differentremaining computation ratio, EIF consistently outperformsthe baselines by a large margin. Compared with the full accuracy by SGD (e.g. SMB with remaining computationratio 1.0), when using only 36.50% remaining computation,EIF boosts the accuracy by 0.16% (93.73% vs. 93.57%).With only 55.45% computation, EIF boosts the accuracy by0.52% (94.09% vs. 93.57%). Compared with SMB and SMD,under different computation ratio, EIF achieves consistentlyhigher accuracy with range [0.84%, 2.32%] and [0.83%,2.28%], respectively. The signiﬁcant improvement is achievedbecause EIF selects instances by predicting the true loss value,instead of randomly dropping the instances. Compared withOHEM, EIF consistently achieves higher accuracy with range[0.31%, 0.98%] under different computation ratio. The im-proved accuracy and reduced computation cost show that theproposed instance ﬁlter effectively selects important instancesfor training to save computation cost.To further evaluate EIF, we conduct experiments on trainingResNet-74, VGG-16 and LeNet. Consistent accuracy improve-ment over the SOTA baselines is observed as shown in Fig.7(a)(b)(c). For ResNet-74, with 63.74% computation reduc-tion, EIF achieves 0.02% accuracy loss, while OHEM has alarger accuracy loss of 5.56% with less computation reductionof 60.59%. SMD and SMB achieve an accuracy loss of 2.25%and 1.98% with less computation reduction of 60%. Similarresults are observed on VGG-16. For LeNet, EIF boosts theaccuracy by 0.21% (99.45% over 99.24%) with computationreduction 77.19%.

C. Evaluating EIF + EMP

We evaluate the proposed framework EIF+EMP consistingof EIF and EMP and compare it with the SOTA baselines.Our approach effectively reduces the computation cost andachieves signiﬁcantly better accuracy that SOTA baselines.Fig. 8 shows the accuracy of ResNet-110 on CIFAR-10 whentrained by EIF+EMP and the baselines under different re-maining computation ratio. Compared with EIF or EMP only,EIF+EMP achieves more aggressive computation reductionwhile preserving and even boosting the accuracy. With EIFonly, we achieve 63.50% computation reduction without ac-curacy loss. With EMP only, we achieve 35.56% computationreduction in the backward pass without accuracy loss and62.22% computation reduction with a slight accuracy lossof 0.72%. By the combined EIF+EMP, with up to 67.84%computation reduction, we achieve no accuracy loss and booststhe accuracy by up to 0.84% (94.41% vs. 93.57%).We further evaluate EIF+EMP with more network architec-tures and datasets. Our approach substantially outperforms thebaselines in terms of computation reduction and accuracy. Weevaluate our approach with ResNet-110, ResNet-74 and VGG-16 on CIFAR-10 and LeNet on MNIST. For a fair comparisonwith E2Train [20], which employs quantization [41], we usethe same quantization scheme. When comparing with otherbaselines, we do not use quantization. The experimental resultsare shown in Table I. When training ResNet-74, our approachachieves 63.91% computation saving without accuracy loss.With quantization, our approach achieves 95.41% computationsaving with a marginal accuracy loss of 0.46%. E2Trainachieves less computation saving of 90.13% and a muchhigher accuracy loss of 2.10%. Similar results are observed on

Remaining computa�on T o p - A cc u r a c y ( % ) Remaining computa�on T o p - A cc u r a c y ( % ) Remaining computa�on T o p - A cc u r a c y ( % ) (a) ResNet-74 (b) VGG-16 (c) LeNet EIFSMDSMB

OHEM

SMBOHEM

SMDSMBOHEM

Fig. 7: Top-1 accuracy by EIF and baselines with ResNet-74 and VGG-16 on CIFAR-10 and LeNet on MNIST.

Remaining computa�on T o p - A cc u r a c y ( % )

68% comp. saving, no accuracy loss

Fig. 8: Accuracy of ResNet-110 on CIFAR-10 by EIF+EMPand baselines under different remaining computation ratios.TABLE I: Top-1 accuracy by EIF+EMP with ResNet-110,ResNet-74, VGG-16 on CIFAR-10 and LeNet on MNIST.

Network Method Comp. Reduce AccuracyResNet-110 SGD(original) - 93.57%

EIF+EMP 67.84 % %OHEM [21] 60.45% 85.11%SD [19] 60.00% 91.96% EIF+EMP+Q 95.71% 93.54%

E2Train(+Q) [20] 90.13% 91.68%ResNet-74 SGD(original) - 93.46%

EIF+EMP 63.91 % %OHEM 60.59% 87.90%SD 60.00% 90.99% EIF+EMP+Q 95.41% 93.00%

E2Train(+Q) 90.13% 91.36%VGG-16 SGD(original) - 93.25%

EIF+EMP 67.33% 93.15%

OHEM 60.41% 71.81%

EIF+EMP+Q 95.54% 92.69%

E2Train(+Q) - -LeNet SGD(original) - 99.23%

EIF+EMP 78.60 % %OHEM 65.24% 99.33% ResNet-110, VGG-16 and LeNet. SD and E2Train rely on theresidual connections in ResNet and cannot be applied to VGG-16 and LeNet. This result shows the proposed frameworkEIF+EMP achieves superior computation saving and signif-icantly higher accuracy than baselines on different networks.

Experiments on CIFAR-100.

We further evaluate the pro-posed approaches on CIFAR-100 with ResNet-110 and VGG-16. EIF+EMP substantially outperforms the baselines in bothcomputation reduction and accuracy. As shown in Table II,with ResNet-110, EIF+EMP achieves 56.24% computationreduction while preserving the full network accuracy, and TABLE II: Top-1 accuracy by EIF+EMP and baselines withResNet-110 and VGG-16 on CIFAR-100.

Network Method Comp. Reduce AccuracyResNet-110 SGD (Original) - 71.60%

EIF+EMP 50.02% 72.02%56.24% 71.63%

OHEM 47.01% 69.98%SD 50.00% 70.44%SMB 50.00% 67.28%

EIF+EMP+Q 92.92% 71.29%

E2Train(+Q) 90.13% 67.94%VGG-16 SGD (Original) - 71.56%

EIF+EMP 50.49% 71.59%53.86% 70.92%

OHEM 46.99% 65.17%SMB 50.00% 68.76%

TABLE III: Top-1 and Top-5 accuracy by EIF+EMP andbaselines with ResNet-18 and VGG-11 on ImageNet.

Network Method Comp. Reduce Accuracy(top-1) Accuracy(top-5)ResNet-18 SGD (Original) - 69.76% 89.08%

EIF+EMP 58.91% 70.27% 89.63%64.71% 68.98% 89.35%

OHEM 46.67% 62.09% 87.08%SD 50.00% 65.36% 86.41%SMB 50.00% 65.94% 87.50%VGG-11 SGD (Original) - 70.38% 89.81%

EIF+EMP 51.63% 70.36% 89.98%60.59% 70.01% 89.83%

OHEM 46.59% 56.39% 85.62%SMB 50.00% 63.76% 86.49%

Experiments on ImageNet.

We evaluate the proposedapproaches on large-scale dataset ImageNet [17]. ImageNetconsists of 1.2M training images in 1000 classes. The mainnetworks are ResNet-18 and VGG-11.The proposed EIF+EMP effectively reduces the computa- tion cost in training while preserving the accuracy on the large-scale dataset and signiﬁcantly outperforms the baselines. Asshown in Table III, when training ResNet-18, with 58.91%computation reduction in training, EIF+EMP boosts the top-1 accuracy by 0.51% (70.27% vs. 69.76%) and boosts thetop-5 accuracy by 0.55%. With more aggressive computationreduction of 64.71%, EIF+EMP still boosts the top-5 accu-racy by 0.27% (89.35% vs. 89.08%). EIF+EMP consistentlyoutperforms the SOTA baselines by a large margin. Withlarger computation reduction, EIF+EMP achieves higher top-1 accuracy with range [4.33%, 8.18%] and higher top-5accuracy with range [2.13%, 3.22%], respectively. SD relieson the residual connections and cannot be applied to VGG-11.Similar results are observed on VGG-11 as shown in Table III. D. Convergence Speed SD OHEMEIF

EMP

EIF+EMP

Training computa�on cost (10 GFLOPs) T e s t E rr o r Fig. 9: Convergence speed of ResNet-110 on CIFAR-10 duringtraining with different approaches.The proposed approaches improve the convergence speed inthe training process. The test error (i.e. 100% - accuracy ontest dataset) over the computation cost during training is shownin Fig. 9. The proposed EIF, EMP and combined EIF+EMPapproaches converge faster than the baselines, represented aslower test error (higher accuracy) with the same computationcost. More speciﬁcally, EIF+EMP achieves 3.1x faster con-vergence and 0.09% accuracy improvement compared withthe standard mini-batch approach (SMB). The SOTA baselinesOHEM and SD achieve lower convergence speed and largeraccuracy loss of 8.46% and 1.61%, respectively.

E. Quantitative and Qualitative Analysis

Effectiveness of Adaptive Loss Threshold.

The proposedearly instance ﬁlter effectively predicts a pre-deﬁned percent-age of input instances as high-loss and the adaptive lossthreshold effectively adjusts the loss threshold as the labelingstrategy to train the ﬁlter. In Fig. 10(a), the pre-deﬁned highloss ratio is 40% for training ResNet-110 on CIFAR-10. Thenumber of predicted high-loss instances, averaged every 390iterations, is stabilized at about 51, which effectively selects40% high-loss instances on average from 128 instances in eachmini-batch. As the average loss decreases, the adaptive lossthreshold also decreases following a similar pattern to closelytrack the latest state of the main network.We further compare the proposed adaptive loss thresholdwith the static loss threshold. With a static loss threshold 1.0,the number of predicted high-loss instances per mini-batch andthe average loss of the main model in the training process are d l o h s e r h T ss o L dn a ss o L � ve loss thresholdAverage loss of main model Training Itera �o ns (x10 ) P r e d i c t e d H i g h - L o ss I n s t a n c e s p e r M i n i - B a t c h d l o h s e r h T ss o L dn a ss o L Training Itera �o ns (x10 ) Sta � c loss threshold=1.00 Average loss of main model P r e d i c t e d H i g h - L o ss I n s t a n c e s p e r M i n i - B a t c h (a) Adap�ve loss threshold (b) Sta�c loss threshold Fig. 10: The adaptive loss threshold (left) tracks the state of themain model in the training process and stabilizes the number ofpreserved instances with predicted high loss by EIF. Static lossthreshold (right) cannot generate correct loss labels to trainthe EIF, which results in incorrect number predicted high-lossinstances and a high average loss of the main model.shown in Fig. 10(b). The goal of training is to minimize theloss of the main model. However, the static loss thresholdcannot effectively decrease the loss of the main model asshown in the blue line, and results in low accuracy. This isbecause the static loss threshold cannot track the latest stateof the main model. Therefore, it cannot effectively stabilizethe number of predicted high-loss instances to train the mainmodel. The static loss threshold only achieves 80.83% ﬁnalaccuracy of the main model. Different from this, the proposedadaptive loss threshold can effectively minimize the loss ofthe main model and achieves high accuracy of 94.24%. o i t a R n o i t c i d e r P ss o L g n o r W Training Itera �o ns (x10 ) With weighted lossWithout weighted loss

Fig. 11: Incorrect loss prediction ratio of EIF with and withoutweighted loss.

Effectiveness of Weighted Loss for Training EIF.

Theweighted loss in Eq.(7) effectively trains the EIF networkto make accurate loss prediction, which eventually results inhigher accuracy of the main model. As shown in Fig. 11, whenthe weighted loss is employed, the wrong loss prediction ratioby the EIF is much lower than that without weighted loss.The pre-deﬁned high-loss ratio is 30%, and the correspondinglow-loss ratio is 70%. This high-loss ratio makes the numberof high-loss and low-loss instances unbalanced in the inputstream.When the weighted loss is used for training EIF, the averagewrong loss prediction ratio by EIF is reduced from 20.31%to 8.59%. This accurate loss prediction effectively selectshigh-loss instances to train the main model and results insigniﬁcantly higher accuracy of the main model, which is94.05% with weighted loss vs. 90.58% without weighted loss. (a) EIF energy overhead (b) EIF computa�on overhead (c) EIF comp. overhead every itera�on n o i t a r e t I r e p s P O L F G Training Itera �o nsEIF overhead Main model with EIF

Main model without EIF A v e r a g e G F L O P s p e r I t e r a � o n Main model EIF overhead A v e r a g e E n e r g y p e r I t e r a � o n ( J o u l e s ) Main model EIF overhead

Without EIF With EIF Without EIF

Fig. 12: Energy and computation overhead of EIF. Energyoverhead is measured on NVIDIA Jetson TX2 mobile GPU.

Overhead of EIF.

The proposed early instance ﬁlter hasmarginal energy and computation overhead. The average en-ergy and computation overhead of the EIF network per trainingiteration (e.g. one mini-batch of 128 instances) when trainingResNet-110 on CIFAR-10 dataset is shown in Fig. 12. Asshown in the yellow bar in Fig. 12(a), the energy overheadof the EIF network (measured on NVIDIA Jetson TX2) is0.43J per iteration, which is 10.22% of the total energy cost4.18J when training with EIF. Without EIF, the energy cost is12.90J per iteration. As shown in Fig. 12(b), the computationoverhead of EIF is 3.88 GFLOPs, which is 11.65% of thetotal computation cost 33.21 GFLOPs when training withEIF. Without EIF, the computation cost is 99.91 GFLOPs periteration. The detailed EIF computation overhead across alltraining iterations are shown in Fig. 12(c). While the overheadof EIF is not zero, the proposed approach achieves 67.60%energy saving and 66.76% computation saving while fullypreserving the accuracy. D r o pp e d P r e s e r v e d CIFAR-10 MNIST

Fig. 13: Preserved and dropped instances by EIF when trainingResNet-110 on CIFAR-10 and LeNet on MINST.

Preserved and Dropped Instances by EIF.

To better un-derstand the instances selected by the early instance ﬁlter, wecluster the instances that the ﬁlter preserves and drops whentraining ResNet-110 on CIFAR-10 and LeNet on MNIST, asshown in Fig. 13. We ﬁnd that the dropped instances showthe full objects with typical characteristics. The preservedinstances either only show part of the object or show non-typical characteristics, even hard for humans to understand.This result shows the early instance ﬁlter can effectively ﬁndimportant instances to train the network. P r un e d P r e s e r v e d Error map Kernel weights HighLow

Fig. 14: Visualization of the pruned and preserved channels inerror map and corresponding convolutional kernel weights.

Analysis of Error Map Pruning.

To better understand thepruned and preserved channels in the backward pass by errormap pruning, we visualize them to analyze the effectiveness ofthe proposed channel selection approach. The preserved andpruned channels in the error map and corresponding kernelweights in conv2 layer of VGG-16 are shown in Fig. 14.The 16 channels with the highest/lowest proposed importancescores are shown on the top left and bottom left, respec-tively, and their corresponding convolutional kernel weightsare shown on the right. The pruned channels are darker withsmaller values than the preserved channels, which are brighterwith larger values. Similarly, the kernel weights correspondingto the pruned channels have smaller values than the preservedones. Therefore, the pruned channels will have the leastinﬂuence on both the error propagation and computation andweight gradients. This result shows the proposed error mappruning approach effectively selects the channels to prune tominimize the inﬂuence on training.

F. Practical Energy Saving on Hardware Platforms

The energy cost of training consists of both the computationcost and the memory access cost. While the former onedominates the energy cost and is represented by the commonlyused metric FLOPs [39], the energy saving ratio can be slightlydifferent from the computation reduction ratio. To evaluate thepractical energy saving, we conduct extensive experiments ontwo edge platforms and evaluate the proposed approaches interms of practical energy saving and accuracy . Power supply and energy measurement

Platform 1: Mobile GPU

Platform 2: MCU

Power supply and energy meas.

DatasetUARTComputerPower analyzerEnergy meter

Fig. 15: Energy measurement setup for training on two edgeplatforms, including mobile-level devices (top) and sensornode-level devices (bottom).

Hardware Setup.

We apply the proposed training approachon two edge platforms to evaluate realistic energy saving . Formobile-level devices, we train ResNet-110, ResNet-74, andVGG-16 on an NVIDIA Jetson TX2 mobile GPU [42] withCIFAR-10 and CIFAR-100 datasets by PyTorch 1.1. We usean energy meter to measure the energy cost as shown on thetop of Fig. 15.

For sensor node-level devices, we train LeNeton the MSP432 MCU [43]. We use C language to implementthe training process on MCU. Since the MCU cannot storethe entire dataset, we use a computer to feed the training datainto the MCU via UART in the training process. We use theKeysight N6705C power analyzer to measure the energy coston the MCU as shown on the bottom of Fig. 15. SGD EIF+EMP OHEM SD E n e r g y ( x J o u l e s ) Energy(J)Computa �on Cos t (GFLOPs) no accuracy loss

Acc93.57% Acc93.66% Acc85.11% Acc 91.96% (a) ResNet-110 C o m pu t a � o n C o s t ( x G F L O P s ) (b) VGG-16 SGD EIF+EMP OHEM

Energy(J)Computa �on Cos t (GFLOPs) C o m pu t a � o n C o s t ( x G F L O P s ) E n e r g y ( x J o u l e s ) Acc 93.25% Acc93.15% Acc71.81%

Fig. 16: Energy saving when training ResNet-110 and VGG-16 on Nvidia Jetson TX2 [42] mobile GPU with CIFAR-10dataset. EIF+EMP prolongs 2.5x to 3.1x battery life withoutany or with marginal accuracy loss.

Energy Saving of Training on Mobile GPU.

We evaluatethe energy saving by EIF+EMP on mobile-level devices. Werepeat all the experiments in Table I and II on the mobile GPUto measure the practical energy saving, except for the LeNet,which will be evaluated on MCU. Our approach effectively re-duces the energy cost of on-device training. Compared with theoriginal SGD, the proposed EIF+EMP achieves energy savingof 67.60%, 63.57%, and 60.02% in the training of ResNet-110, ResNet-74, and VGG-16 on CIFAR-10, respectively, asshown in Fig. 16 (the result of ResNet-74 is not shown forconciseness). The energy savings prolong battery life by 3.1x,2.7x, and 2.5x while improving the accuracy or incurring aslight 0.1% accuracy loss. Compared with the SOTA baselinesOHEM and SD, our approach achieves signiﬁcantly higheraccuracy when similar energy saving is achieved. SD relieson the residual connections and cannot be applied to VGG-16. Besides, the practical energy saving ratios are very close tothe computation reduction ratios represented by FLOPs, whichshows the computation reduction in FLOPs can generalize wellto energy saving on hardware platforms. Similar results areobserved on the CIFAR-100, on which we achieve 54.22% and46.64% energy saving (2.2x and 1.9x battery life) for ResNet-110 and VGG-16 without any accuracy loss, respectively.

SGD EIF+EMP OHEM C o m pu t a � o n C o s t ( G F L O P s ) E n e r g y ( x J o u l e s ) Energy(J)Computa �on Cos t (GFLOPs) improved accuracy

Acc96.98% Acc97.31% Acc96.66%

Fig. 17: Energy saving when training LeNet on MSP432 [43]MCU. EIF+EMP prolongs 3.9x battery life.

Energy Saving of Training on MCU.

We evaluate theenergy saving by EIF+EMP on sensor node-level devices (i.e.MCUs). We train LeNet on MCU MSP432 for one epochincluding 60000 instances and measure the energy cost andaccuracy. Due to the limited runtime memory, we set the batchsize to 1. Since the original SGD approach takes too long(i.e. about 50 days) to complete on MCU, we conduct 10%training iterations of one epoch on MCU and estimate the totalenergy cost by multiplying the measured energy cost by 10. The accuracy of the original SGD is measured on the P100GPU after ﬁnishing one epoch. OHEM cannot be applied toMCUs because it needs batch-wise loss values for instanceselection. To compare with OHEM, we measure its energycost on MCU by completing its computation and ignoring theaccuracy. The accuracy of OHEM is evaluated on P100 GPU.EIF+EMP signiﬁcantly reduces the energy cost of trainingon MCUs and effectively prolongs battery life. As shown inFig. 17, when training LeNet on MSP432 MCU, EIF+EMPeffectively reduces the energy cost by 74.09% while improvingthe accuracy by 0.33%. This prolongs battery life by 3.9x.OHEM, while not fully feasible on MCU, achieves much lowerenergy saving of 59.78% with an accuracy loss of 0.32%. Thisresult shows EIF+EMP greatly improves the battery life of tinysensor nodes and outperforms the baselines.VII. C

ONCLUSION

This work aims to enable on-device training of convolu-tional neural networks by reducing the computation cost attraining time. We propose two complementary approaches toreduce the computation cost: early instance ﬁltering (EIF),which selects important instances for training the network anddrops trivial ones, and error map pruning (EMP), which prunesinsigniﬁcant channels in error map in back-propagation. Ex-perimental results show superior computation reduction withhigher accuracy compared with state-of-the-art techniques.R

EFERENCES[1] M. Song, K. Zhong, J. Zhang, Y. Hu, D. Liu, W. Zhang, J. Wang, andT. Li, “In-situ ai: Towards autonomous and incremental deep learningfor iot systems,” in . IEEE, 2018, pp. 92–103.[2] N. Zhu, X. Liu, Z. Liu, K. Hu, Y. Wang, J. Tan, M. Huang, Q. Zhu, X. Ji,Y. Jiang et al. , “Deep learning for smart agriculture: Concepts, tools,applications, and opportunities,”

International Journal of Agriculturaland Biological Engineering , vol. 11, no. 4, pp. 32–44, 2018.[3] S. Bhattacharya and N. D. Lane, “Sparsiﬁcation and separation of deeplearning layers for constrained resource inference on wearables,” in

Proceedings of the 14th ACM Conference on Embedded Network SensorSystems CD-ROM , 2016, pp. 176–189.[4] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learningfor robust visual tracking,”

International journal of computer vision ,vol. 77, no. 1-3, pp. 125–141, 2008.[5] O. Rudovic, J. Lee, M. Dai, B. Schuller, and R. W. Picard, “Personalizedmachine learning for robot perception of affect and engagement inautism therapy,”

Science Robotics , vol. 3, no. 19, p. eaao6760, 2018.[6] B. McMahan and D. Ramage, “Federated learning: Col-laborative machine learning without centralized trainingdata,” 2017. [Online]. Available: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html[7] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch sgd:Training resnet-50 on imagenet in 15 minutes,” 2017.[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[9] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” arXiv preprint arXiv:1510.00149 , 2015.[10] N. K. Jayakodi, A. Chatterjee, W. Choi, J. R. Doppa, and P. P. Pande,“Trading-off accuracy and energy of deep inference on embeddedsystems: A co-design approach,”

IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems , vol. 37, no. 11, 2018.[11] P. Panda, A. Sengupta, and K. Roy, “Conditional deep learning forenergy-efﬁcient and enhanced pattern recognition,” in , 2016. [12] W. Yawen, W. Zhepeng, J. Zhenge, S. Yiyu, and H. Jingtong, “In-termittent inference with nonuniformly compressed multi-exit neu-ral network for energy harvesting powered devices,” arXiv preprintarXiv:2004.11293 , 2020.[13] L. Bottou, “Stochastic gradient descent tricks,” in Neural networks:Tricks of the trade . Springer, 2012, pp. 421–436.[14] S. Lee and S. Nirjon, “Neuro. zero: a zero-energy neural networkaccelerator for embedded sensing and inference systems,” in

Proceedingsof the 17th Conference on Embedded Networked Sensor Systems , 2019.[15] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa-tions by solving jigsaw puzzles,” in

European Conference on ComputerVision . Springer, 2016, pp. 69–84.[16] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual repre-sentation learning by context prediction,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2015, pp. 1422–1430.[17] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al. ,“Communication-efﬁcient learning of deep networks from decentralizeddata,” arXiv preprint arXiv:1602.05629 , 2016.[18] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augen-stein, H. Eichner, C. Kiddon, and D. Ramage, “Federated learning formobile keyboard prediction,” arXiv preprint arXiv:1811.03604 , 2018.[19] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deepnetworks with stochastic depth,” in

European conference on computervision . Springer, 2016, pp. 646–661.[20] Y. Wang, Z. Jiang, X. Chen, P. Xu, Y. Zhao, Y. Lin, and Z. Wang, “E2-train: Training state-of-the-art cnns with over 80% energy savings,” in

Advances in Neural Information Processing Systems , 2019.[21] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based objectdetectors with online hard example mining,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2016.[22] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo,Y. Yang, L. Yu et al. , “Highly scalable deep learning training systemwith mixed-precision: Training imagenet in four minutes,” 2018.[23] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structuredsparsity in deep neural networks,” in

Advances in neural informationprocessing systems , 2016, pp. 2074–2082.[24] H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compactcnns,” in

European Conference on Computer Vision . Springer, 2016.[25] J. M. Alvarez and M. Salzmann, “Compression-aware training of deepnetworks,” in

Advances in Neural Information Processing Systems , 2017.[26] S. Lym, E. Choukse, S. Zangeneh, W. Wen, M. Erez, and S. Shanghavi,“Prunetrain: Gradual structured pruning from scratch for faster neuralnetwork training,” arXiv preprint arXiv:1901.09290 , 2019.[27] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruningﬁlters for efﬁcient convnets,” arXiv preprint arXiv:1608.08710 , 2016.[28] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A ﬁlter level pruning methodfor deep neural network compression,” in

Proceedings of the IEEEinternational conference on computer vision , 2017, pp. 5058–5066.[29] W. Jiang, X. Zhang, E. H.-M. Sha, L. Yang, Q. Zhuge, Y. Shi, and J. Hu,“Accuracy vs. efﬁciency: Achieving both through fpga-implementationaware neural architecture search,” in

Proceedings of the 56th AnnualDesign Automation Conference 2019 , 2019, pp. 1–6.[30] L. Yang, W. Jiang, W. Liu, H. Edwin, Y. Shi, and J. Hu, “Co-exploringneural architecture and network-on-chip design for real-time artiﬁcialintelligence,” in . IEEE, 2020, pp. 85–90.[31] L. Yang, Z. Yan, M. Li, H. Kwon, L. Lai, T. Krishna, V. Chandra,W. Jiang, and Y. Shi, “Co-exploration of neural architectures andheterogeneous asic accelerator designs targeting multiple tasks,” arXivpreprint arXiv:2002.04116 , 2020.[32] W. Jiang, L. Yang, E. H.-M. Sha, Q. Zhuge, S. Gu, S. Dasgupta, Y. Shi,and J. Hu, “Hardware/software co-exploration of neural architectures,”

IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems , 2020.[33] Q. Lu, W. Jiang, X. Xu, Y. Shi, and J. Hu, “On neural architecturesearch for resource-constrained hardware platforms,” arXiv preprintarXiv:1911.00105 , 2019.[34] K. Konyushkova, R. Sznitman, and P. Fua, “Learning active learningfrom data,” in

Advances in Neural Information Processing Systems ∼ kriz/cifar.html[36] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.[Online]. Available: http://yann.lecun.com/exdb/mnist/[37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A Large-Scale Hierarchical Image Database,” in CVPR09 , 2009.[38] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014. [39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition .[40] Python. (2020) Thop: Pytorch-opcounter. a tool to count the ﬂops ofpytorch model. [Online]. Available: https://pypi.org/project/thop/[41] R. Banner, I. Hubara, E. Hoffer, and D. Soudry, “Scalable methods for8-bit training of neural networks,” in