[PDF] Dynamically Sacrificing Accuracy for Reduced Computation: Cascaded Inference Based on Softmax Confidence

Abstract

We study the tradeoff between computational effort and classification accuracy in a cascade of deep neural networks. During inference, the user sets the acceptable accuracy degradation which then automatically determines confidence thresholds for the intermediate classifiers. As soon as the confidence threshold is met, inference terminates immediately without having to compute the output of the complete network. Confidence levels are derived directly from the softmax outputs of intermediate classifiers, as we do not train special decision functions. We show that using a softmax output as a confidence measure in a cascade of deep neural networks leads to a reduction of 15%-50% in the number of MAC operations while degrading the classification accuracy by roughly 1%. Our method can be easily incorporated into pre-trained non-cascaded architectures, as we exemplify on ResNet. Our main contribution is a method that dynamically adjusts the tradeoff between accuracy and computation without retraining the model.

Full PDF

aa r X i v : . [ c s . L G ] M a y Sacriﬁcing Accuracy for Reduced Computation:Cascaded Inference Based on Softmax Conﬁdence

Konstantin Berestizshevsky

School of Electrical EngineeringTel Aviv University [email protected]

Guy Even

School of Electrical EngineeringTel Aviv University [email protected]

Abstract

We study the tradeoff between computational effort and accuracy in a cascadeof deep neural networks. During inference, early termination in the cascade iscontrolled by conﬁdence levels derived directly from the softmax outputs of in-termediate classiﬁers. The advantage of early termination is that classiﬁcation isperformed using less computation, thus adjusting the computational effort to thecomplexity of the input. Moreover, dynamic modiﬁcation of conﬁdence thresh-olds allow one to trade accuracy for computational effort without requiring re-training. Basing of early termination on softmax classiﬁer outputs is justiﬁed byexperimentation that demonstrates an almost linear relation between conﬁdencelevels in intermediate classiﬁers and accuracy. Our experimentation with architec-tures based on ResNet obtained the following results. (i) A speedup of . thatsacriﬁces . accuracy with respect to the CIFAR-10 test set. (ii) A speedup of . that sacriﬁces . accuracy with respect to the CIFAR-100 test set. (iii) Aspeedup of . that sacriﬁces . accuracy with respect to the SVHN test set. The quest for high accuracy in Deep Neural Networks (DNNs) has lead to the design of large net-work consisting of hundreds of layers with millions of trainable weights. Computation in suchDNNs requires billions of multiply-accumulate operations (MACs) for a single image classiﬁca-tion [SCYE17]. The natural question that arises is whether this amount of computation is indeedrequired [PSR16].In this paper, we focus on the computational effort spent on inference in DNNs. (For simplicity,we measure the computational effort in the number of multiply-accumulate operations (MACs)).Many conjecture that the computational effort required for classifying images is not constant anddepends on the image [BBPP15, Gra16, FCZ +

17, PSR16, TMK16]. We claim that the requiredcomputational effort for classiﬁcation is an intrinsic yet hidden property of the images. Namely,some images are much easier to classify than others, but the required computational effort neededfor classiﬁcation is hard to predict before classiﬁcation is completed.The desire to spend the “right” computational effort in classiﬁcation leads to the ﬁrst goal in thiswork.

Goal 1.1.

Provide an architecture in which the computational effort is proportional to the complexityof the input.

We also consider a setting in which the system’s power consumption or throughput is not ﬁxed.Examples of such settings are: (1) As the battery drains in a mobile device, one would like to enter a“power saving mode” in which less power is spent per classiﬁcation. (2) If the input rate increases ina real-time system (e.g., due to a burst of inputs), then one must spend less time per input [CZC + Preprint. Work in progress.

3) Timely processing in a data center during spikes in query arrival rates may require reducing thecomputational effort per query [Bod10].Dynamic changes in the computational effort or the throughput leads to the second goal in this work.

Goal 1.2.

Introduce the ability to dynamically control the computational effort while sacriﬁcingaccuracy as little as possible. Such changes in the computational effort should not involve retrainingof the DNN.

We propose an architecture that is based on a cascade of DNNs [BWDS17] depicted in Figure 1.The cascade comprises multiple DNNs (e.g., three DNNs), called component DNNs . Classiﬁcationtakes place by invoking the component DNNs in increasing complexity order, and stopping thecomputation as soon as the conﬁdence level reached a desired level.Our component DNNs are not disjoint, namely, levels of previous components are part of the pro-cessing of the consecutive component. The advantage of this approach is that the next componentreuses the computational outcome of the previous component and further reﬁnes it.The decision of choosing the ﬁnal DNN component invoked in the computation is based on thesoftmax output of the invoked DNNs. We deﬁne a simple conﬁdence threshold, based on the softmaxoutput, that allows for trading off (a little) accuracy for (a substantial) computational effort. Basingthe stopping condition on the softmax output has two advantages over previous methods [PSR16,BWDS17]: (1) Simplicity, as we do not require an additional training step for conﬁguring the controlthat selects the output. (2) Improved tradeoffs of computational effort vs. accuracy.

We focus on the task of classiﬁcation of images. Our setting is applicable to other data as we do nothave any limitations on the number of classes or the distribution of the input data.The cascading of DNNs allows for the usage of various DNN layers, including convolutional layers,fully connected layers, alongside with batch-normalization and pooling layers. We do not limit thenon-linear functions employed by the neurons, and only considered classiﬁcation networks terminat-ing with a softmax function.

The two principle techniques that we employ are cascaded classiﬁcation and conﬁdence estimation .Cascaded classiﬁcation is suggested in the seminal work of [VJ01]. As opposed to voting or stackingensembles in which a classiﬁcation is derived from the outputs of multiple experts (e.g., majority),the decision in a cascaded architecture is based on the last expert. Uncertainty measures of classiﬁersare discussed in [CSTV95, SSV00]. These works address the issue of the degree of conﬁdence thata classiﬁer has about its output. We elaborate on recent usage of these techniques hereinafter.

A cascaded neural network architecture for computer vision is presented in [HCL + +

17] presented the idea of early stopping in a setting in which the cascadedDNNs are distributed among multiple devices.Reinforcement learning is employed in [OLO17] in a cascade of meta-layers to train controllers thatselect computational modules per meta-layer. 2 .2 Conﬁdence estimation

Conﬁdence of an assembly of algorithms is investigated by Fagin et al. [LBG +

16] in a general setup.Fagin et al. deﬁne instance optimality and suggest to terminate the execution according to a criterionbased on a threshold.Rejection refers to the event that a classiﬁer is not conﬁdent about its outcome, and hence, the outputis rendered unreliable. In [GEY17] a selective classiﬁcation technique in which a classiﬁer and arejection-function are trained together. The goal is obtain coverage (i.e., at least one classiﬁer doesnot reject) while controlling the risk via rejection functions. They proposed a softmax-responsemechanism for deriving the rejection function and discussed how the true-risk of a classiﬁer (i.e.,the average loss of all the non-rejected samples) can be traded-off with its coverage (i.e., the mass ofthe non-rejected region in the input space). Our work adopts the usage of the softmax response as aconﬁdence rate function, however it differs in a way we apply the conﬁdence threshold. Namely, wepropose a cascade of classiﬁers that terminates as soon as a desired conﬁdence threshold is reached.

The work by [CZC +

10] presented an additive ensemble machine learning approach with early exitsin a context of web document ranking. In the additive approach, the sum of the outputs of a preﬁxof the classiﬁers provides the current output conﬁdence.The work by [TMK16] presented the BranchyNet approach, in which a network architecture hasmultiple branches, each branch consists of a few convolutional layers terminated by a classiﬁerand a softmax function. Conﬁdence of an output vector y in BranchyNet is derived from the entropyfunction entropy ( y ) = − P c y c log y c . Our approach attempts to reduce the amount of computationthat takes place outside the “main path” so that computations that take place in rejected branches area negligible fraction of the total of the computation. In addition, we derive the conﬁdence by takingthe maximum over the softmax. Finally, in [TMK16], automatic setting of threshold levels is notdeveloped.Cascaded classiﬁcation with dedicated linear conﬁdence estimations (rather than softmax) appears inthe Conditional Deep Learning (CDL) of [PSR16]. Cascaded classiﬁcation with conﬁdence estima-tion appears also in the SACT mechanism of [FCZ + k , aspecial decision function γ k is trained to whether an exit should be chosen. We conjecture that itis hard to decide whether an exit should be chosen based on a convolutional layer without using aclassiﬁer. Indeed, in [BWDS17], a speedup of roughly is achieved for every doubling of theaccuracy loss. Table 1 depicts the parameters and notations used in this paper.

A cascade of DNNs is a chain of convolutional layers with a branching between layers to a classi-ﬁers (see Figure 1). Early termination in cascaded DNN components means that intermediate featuremaps are evaluated by classiﬁers. These classiﬁers attempt to classify the feature map, and output aconﬁdence measurement of their classiﬁcation. If the conﬁdence level is above a threshold, then ex-ecution terminates, and the classiﬁcation of the intermediate feature map is output. See Figure 1 foran example of a cascaded architecture based on three convolutional layers. In our experimentation,we employ ResNet-modules [HZRS15a] as component DNNs in our cascade.3able 1: Notations and deﬁnitions used in this paperNotation Domain Semantics n e N Number of training epochs n m N Number of component DNNs in the cascade n c N Number of classes in the classiﬁcation task n N Number of ResNet-blocks in a ResNet-module T Labeled training set, containing pairs of inputs and corresponding labels M Set of component DNNs that form a cascade ( | M | = n m ) M m The m th component in the cascade, m ∈ { , ..., n m − } θ conv m Weights and biases of the convolutional layers in component M m Θ conv Weights and biases of the convolutional layers in the cascade Θ fc Weights and biases of the fully connected layers of the cascade θ fc m Weights and biases of the fully connected layers of component M m out m ( x ) { , ..., n c − } Class predicted by component M m for input xδ m ( x ) [0 , Conﬁdence output by component M m for input x ˆ δ m [0 , Conﬁdence threshold of component m Each component in a cascaded architecture consists of convolutional layers followed by a branchingthat leads to (1) a classiﬁer, and (2) the next component.

CON V CON V CON V F C out δ Input

F C out δ F C out δ Figure 1: An example of a cascaded architecture of three component DNNs with early termina-tion. A cascade of convolutional layers ( CON V , . . . , CON V ) ends with a fully connected layer F C . Early termination is enabled by introducing fully connected networks F C i after convolutionallayers. The output of each fully connected layer consists of a classiﬁcation out i and a conﬁdencemeasurement δ i . The usage of the threshold for determining early termination in the cascade is listed as Algorithm 1.The algorithm applies the component DNNs one by one, and stops as soon as the conﬁdence measurereaches the conﬁdence threshold of this component. This approach differs from previous cascadedarchitectures in which a combination (e.g., sum) of the conﬁdence measures of the components isused to control the execution [FCZ +

17, CZC + Algorithm 1 CI ( M, ˆ δ, x ) - An algorithm for sequential execution of DNN components in a cas-caded architecture. Early termination takes place as soon as the conﬁdence level reaches theconﬁdence threshold. The parameters are: the cascade architecture M , the conﬁdence threshold ˆ δ = (ˆ δ , . . . , ˆ δ n m − ) , and the input x .1. For m = 0 to n m − do(a) ( out m ( x ) , δ m ( x )) ← M m ( x ) (b) If δ m ( x ) ≥ ˆ δ m then return out m ( x )

2. Return out n m − ( x ) .3 Softmax conﬁdence Every classiﬁer consists of one or more fully connected layers followed by a softmax function.Let z m ∈ R c denote the input to the softmax function in the m ’th classiﬁer of the cascade. Let s m ∈ [0 , n c denote the softmax value in the m ’th classiﬁer. The softmax value is deﬁned asfollows 3.1. Deﬁnition 3.1 (softmax) . s m [ i ] = e zm [ i ] P nc − c =0 e zm [ c ] . The output out m ∈ { , ..., n c − } and the conﬁdence measure δ m ∈ [0 , with respect to thisoutput are deﬁned as follows Deﬁnition 3.2. out m , arg max c { s m [ c ] | ≤ c ≤ n c − } Deﬁnition 3.3. δ m , max c { s m [ c ] | ≤ c ≤ n c − } In this section we present the training procedure of the DNN components and classiﬁers.Consider a cascaded architecture n m components. We denote this cascade by M =( M , . . . , M n m − ) , where M m denotes the m ’th component in the cascade. Let θ conv m ( θ fc m , resp.)denote the weights and biases of the convolutional (fully connected, resp.) layers in component M m .Let Θ conv = { θ conv , ..., θ conv nm − } and Θ fc = { θ fc , ..., θ fc nm − } denote the weights and biasesof the convolutional layers and fully connected layers, respectively.Let L M ( out m , T ) denote a loss function of the cascade M with respect to the output of the m ’thcomponent, averaged over the labeled dataset T . In order to train the cascade M , we proposea backtrack-training algorithm 2 BT ( M, T, n e ) . We emphasize that the training procedure ﬁrstoptimizes all the convolutional weights together with the weights of the last fully connected layers.Only then, do we optimize the weights of the fully connected levels of the remaining components,one by one. Our approach differs from previous training procedures [TMK16, WYDG17] in whichthe loss functions associated with all the classiﬁers were jointly optimized . Algorithm 2 BT ( M, T, n e ) - An algorithm for performing a backtrack training of the cascade M = { M , ..., M n m − } , each component DNN trains for n e epochs over the training set T . The algorithmoutputs the trained weights of the cascade M

1. Optimize Θ conv ∪ θ fc nm − with L M ( out n m − , T ) for . n e epochs2. For m = 0 to n m − θ fc m with L M ( out m , T ) for n e epochs3. Return Θ conv ∪ Θ fc In out experiments, we noticed that the “long path” of the cascade required a larger number oftraining epochs. This why the number of training epochs in Line 1 of BT ( M, T, n e ) is . n e . In this section we present an automatic methodology for setting the conﬁdence threshold ˆ δ m percomponent M m . Early termination is chosen if the conﬁdence level reaches the threshold. Theimportant feature of the automatic setting of the conﬁdence thresholds is that one can change themon the ﬂy during the inference stage.Let T m ( δ ) denote the subset of inputs for which the conﬁdence measure of the m th component is atleast δ . T m ( δ ) , { ( x, y ) | δ m ( x ) ≥ δ } . γ m ( δ ) denote the number of times the classiﬁcation output by component M m is correct forinputs in T m ( δ ) . γ m ( δ ) , X ( x,y ) ∈ T m ( δ ) { out m ( x ) = y } . Let α m ( δ ) denote the accuracy of component M m with respect to T m ( δ ) . α m ( δ ) , ( γ m ( δ ) | T m ( δ ) | if | T m ( δ ) | > otherwiseLet α ∗ m denote the maximum accuracy for component M m . α ∗ m , max δ ∈ [0 , α m ( δ ) . For an accuracy degradation ǫ > , we deﬁne the conﬁdence threshold δ m ( ǫ ) by δ m ( ǫ ) , min { δ | α m ( δ ) ≥ α ∗ m − ǫ } . In Algorithm 1, the conﬁdence threshold vector ˆ δ is set as follows. Choose an ǫ ≥ , and set ˆ δ m ← δ m ( ǫ ) , for every m . We remark that (i) threshold for the last component should be zero, and(ii) one could use separate datasets for training the weights and setting the conﬁdence threshold. This section elaborates on the experiments we performed on CIFAR-10, CIFAR-100 and SVHNdatasets.

Our experiments deal with image classiﬁcation in which the convolutional layers follow the ResNetarchitecture [HZRS15a]. The input to the network is the per-pixel-standardized RGB image. TheR ES N ET ( n ) architecture consists of the following n layers: (1) a 2D convolutional layer with32 × × -dimensional ﬁlters (2) ResNet modules , each of which uses only × convolutions.Each module contains n ResNet- blocks , where each block contains two convolutional layers withReLU non-linearities, a skip-connection and batch-normalization. See Figures 2a and 2b for theblock structure. The classiﬁcation is performed using global average pooling layer followed by afully connected layer with inputs and n c outputs and a softmax function.We transformed the regular R ES N ET ( n ) architecture into a cascaded architecture with early ter-mination by introducing two more classiﬁers branching from the ﬁrst two ResNet modules. Toimprove accuracy, we enhanced the ﬁrst two classiﬁers by increasing their feature map. In contrastto the BranchyNet, we did not allocate additional convolutional layers. The resulting architectureCI-R ES N ET ( n ) is depicted in Figure 2c. The overhead of the classiﬁer enhancement is constantand further improves accuracy as the number of ResNet-blocks increases per module. For example,for n = 18 , the classiﬁer enhancement incurs . more parameters and requires only . morecomputational effort per inference compared to the original R ES N ET ( ) architecture.We performed the training with respect to algorithm 2. A simple data augmentation was employedonly for CIFAR models as in [HZRS15a]. All models were trained from scratch. The model sizesand architectures used for CIFAR-10 and CIFAR-100 are identical except for the last FC layerin each classiﬁer, containing 10 and 100 outputs respectively, which caused a relatively negligibleaddition in parameter count. The weights were initialized at random from N (0 , p /k ) where k isthe number of inputs to a neuron, as proposed by [HZRS15b]. All the models used cross-entropyloss regularized by an L2 loss with a coefﬁcient 1e-4. Following the practice of [IS15], no dropoutwas applied. The CIFAR and the SVHN models were trained with Stochastic Gradient Descent(SGD) for 160 and 50 epochs per classiﬁer, respectively. Learning rate was scheduled as describedin [HZRS15a]. 6 nputW x H x CWxHxC3x3 conv, C, stride 1Batch-NormalizationReLU activation3x3 conv, C, stride 1Batch-NormalizationReLU activationOutputWxHxCWxHxCpadd 1 pxpadd 1 px (a) ResNet-Block Input2W x 2H x C/2WxHxC3x3 conv, C, stride 2Batch-NormalizationReLU activation3x3 conv, C, stride 1Batch-NormalizationReLU activationOutputWxHxC DownsampleWidth,HeightBy 2Pad DepthWith C/2Zero planesWxHxC/2WxHxCWxHxCpadd 1 pxpadd 1 px (b) ResNet-Block-i

Input32 x 32 x 33x3 conv, 16, stride 1Batch-NormalizationReLU activationpadding=1px32x32x16ResNet Module 0n General ResNet-BlocksResNet Module 11 ResNet-Block-in-1 General ResNet blocksResNet Module 21 ResNet-Block-in-1 General ResNet blocks32x32x1616x16x328x8x64Global Average PoolingFully Connected64x110x1 Softmax10x1 Out_2 Average Pooling stride 8Fully Connected4x4x16 flattened to 256x110x1 Softmax10x1 Out_0Batch NormalizationReLU activationFully Connected64x1Average Pooling stride 8Fully Connected2x2x32 flattened to 128x110x1 Softmax10x1 Out_1Batch NormalizationReLU activationFully Connected64x1 (c) CI-R ES N ET ( n ) Figure 2: (a) - general ResNet-block. (b) - ﬁrst ResNet-block in each module performs a sub-sampling using stride 2. (c) - the CI-R ES N ET ( n ) architecture used in our experiments. We trained the CI-R ES N ET ( ) model and evaluated its performance using various ǫ values. Thetradeoff between the test-accuracy and the number of MACs required for a single inference is shownin Figure 3. The MAC counts were obtained analytically by summing up the linear operations inthe convolutional layers and the fully connected layers, excluding activations and batch normal-ization. Quantitative results appear in Table 2. Similar gains are reported using SkipNet withResNet110 [WYDG17]. Note however that our approach does not incur extra computation forgating.Table 2: Cascaded inference with early termination - Accuracies of a cascade of i components islisted in columns M ,...,i − . Accuracies and speedups using early termination based on conﬁdencethresholds δ m ( ǫ ) are depicted for ﬁve values of ǫ . Speedup is relative to the computational effort of M , , component.Accuracy per Component Cascade: accuracy(top),speedup(bottom)Dataset M M , M , , ǫ = 0% ǫ = 1% ǫ = 2% ǫ = 4% ǫ = 8% CIFAR-10 .

5% 81 .

4% 93 .

1% 93 .

1% 92 .

7% 91 .

9% 91 .

1% 86 . .

064 1 .

377 1 .

513 1 .

735 2 . CIFAR-100 .

1% 50 .

0% 70 .

5% 70 .

65% 70 .

5% 70 .

3% 69 . .

009 1 .

044 1 .

072 1 .

116 1 . SVHN .

8% 85 .

2% 97 .

0% 97 .

0% 95 .

6% 94 .

0% 91 .

3% 89 . .

001 2 .

168 2 .

438 2 .

773 2 . For the CI-R ES N ET ( ) architecture, we measured the accuracy α m ( δ ) (see deﬁnition in Section 5)of each classiﬁer independently. This time, the α m ( δ ) was determined with respect to the test-setrather than to the training set. The plots in Figure 4 show how the choice of the threshold providesa control over the test accuracy. In addition, we examined the frequency of the different δ values,which is shown as a bar-plot in Figure 4. The distribution of the ﬁrst two components of the cascadeis relatively uniform. The distribution of the conﬁdences of the last classiﬁer has no importance since7 .10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 Giga-MACs per inference A cc u r a c y (a) CIFAR-10 Giga-MACs per inference A cc u r a c y (b) CIFAR-100 Giga-MACs per inference A cc u r a c y (c) SVHN Figure 3:

Cascaded inference with early termination test accuracy vs. average number of MACoperations per inference. The curves corresponds to conﬁdence threshold vectors chosen w.r.t. ǫ ∈{ , ..., , } .in our inference approach the conﬁdence threshold of the last classiﬁer is set to ˆ δ n m − = 0 . Notethat the range of α m ( δ ) starts with the accuracy of M m and ends with the accuracy that correspondsto the highest conﬁdence measure. The almost linear behavior of α m ( δ ) as a function of δ justiﬁesbasing the conﬁdence threshold on the softmax output. B a r s : F r e q u e n c y ( δ ) ; L i n e s : α m ( δ ) α (δ) of M α (δ) of M →M α (δ) of M →M →M Frequency(δ) of M Frequency(δ) of M →M Frequency(δ) of M →M →M (a) CIFAR-10 B a r s : F r e q u e n c y ( δ ) ; L i n e s : α m ( δ ) α (δ) of M α (δ) of M →M α (δ) of M →M →M Frequency(δ) of M Frequency(δ) of M →M Frequency(δ) of M →M →M (b) CIFAR-100 B a r s : F r e q u e n c y ( δ ) ; L i n e s : α m ( δ ) α (δ) of M α (δ) of M →M α (δ) of M →M →M Frequency(δ) of M Frequency(δ) of M →M Frequency(δ) of M →M →M (c) SVHN Figure 4:

Softmax as a conﬁdence measure . The line plots show the accuracy α m ( δ ) of each clas-siﬁer in the cascade independently. The bar plot presents the frequency of the different conﬁdencelevels sampled over the test set. All plots were obtained by separately testing the three componentDNNs of CI-R ES N ET ( ) architecture w.r.t. the test sets of the CIFAR and the SVHN datasets. As a further research, cascading can be applied to RNNs [BBPP15] or alternatively, the impact ofdepth of feedforward DNNs on the conﬁdence estimation can be investigated. A gap between theallowed accuracy degradation ( ǫ ) and the actual test accuracy degradation was especially evident inthe CIFAR-100 dataset. We believe that if one determines the thresholds with respect to a validationset, rather than the training set, this gap will be reduced.From a digital hardware point of view, we see two interesting directions to investigate in the contextof cascaded inference. First, the impact on cache memory performance can be observed as a func-tion of the conﬁdence threshold adjustment. Second, an innovative hardware architecture can beproposed to support cascaded inference and to provide high throughput via allocation of resourcesto each component DNN and taking advantage of the locality at each component. We showed that using a softmax output as a conﬁdence measure in a cascade of DNNs can providea × . to × speedup at a cost of of classiﬁcation accuracy loss. This approach is both simpleand requires no retraining when the conﬁdence thresholds must be adjusted after the network wastrained. 8n addition, fascinating properties of the softmax function as a conﬁdence measure were revealed.We showed that if trained properly, a cascade of DNN components can reliably indicate its conﬁ-dence level directly through the softmax output. Acknowledgments

We thank Nissim Halabi and Moni Shahar for useful conversations.

References [BBPP15] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditionalcomputation in neural networks for faster models.

CoRR , abs/1511.06297, 2015.[Bod10] Peter Bodik.

Automating datacenter operations using machine learning . PhD thesis,UC Berkeley, 2010.[BWDS17] Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neu-ral networks for fast test-time prediction.

CoRR , abs/1702.07811, 2017.[CSTV95] L. P. Cordella, C. De Stefano, F. Tortorella, and M. Vento. A method for improv-ing classiﬁcation reliability of multilayer perceptrons.

IEEE Transactions on NeuralNetworks , 6(5):1140–1147, Sep 1995.[CZC +

10] B. Barla Cambazoglu, Hugo Zaragoza, Olivier Chapelle, Jiang Chen, Ciya Liao, Zhao-hui Zheng, and Jon Degenhardt. Early exit optimizations for additive machine learnedranking systems. In

Proceedings of the Third ACM International Conference on WebSearch and Data Mining , WSDM ’10, pages 411–420, New York, NY, USA, 2010.ACM.[FCZ +

17] Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang,Dmitry Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time forresidual networks. In

The IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , July 2017.[GEY17] Yonatan Geifman and Ran El-Yaniv. Selective classiﬁcation for deep neural networks.In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, editors,

Advances in Neural Information Processing Systems 30 , pages4878–4887. Curran Associates, Inc., 2017.[Gra16] Alex Graves. Adaptive computation time for recurrent neural networks.

CoRR ,abs/1603.08983, 2016.[HCL +

17] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kil-ian Q Weinberger. Multi-scale dense convolutional networks for efﬁcient prediction. arXiv preprint arXiv:1703.09844 , 2017.[HZRS15a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition.

CoRR , abs/1512.03385, 2015.[HZRS15b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep intorectiﬁers: Surpassing human-level performance on imagenet classiﬁcation.

CoRR ,abs/1502.01852, 2015.[IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep networktraining by reducing internal covariate shift.

CoRR , abs/1502.03167, 2015.[LBDC +

17] Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, Bert Vankeirsbilck,Pieter Simoens, and Bart Dhoedt. The cascading neural network : building the in-ternet of smart things.

Lnowledge and Information Systems , 52(3):791–814, 2017.[LBG +

16] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, L. Jiao, L. Qendro, andF. Kawsar. Deepx: A software accelerator for low-power deep learning inferenceon mobile devices. In , pages 1–12, April 2016.[OLO17] Augustus Odena, Dieterich Lawson, and Christopher Olah. Changing model behaviorat test-time using reinforcement learning. arXiv preprint arXiv:1702.07780 , 2017.9PSR16] P. Panda, A. Sengupta, and K. Roy. Conditional deep learning for energy-efﬁcient andenhanced pattern recognition. In , pages 475–480, March 2016.[SCYE17] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efﬁcient processing ofdeep neural networks: A tutorial and survey.

CoRR , abs/1703.09039, 2017.[SSV00] C. De Stefano, C. Sansone, and M. Vento. To reject or not to reject: that is the question-an answer in case of neural classiﬁers.

IEEE Transactions on Systems, Man, andCybernetics, Part C (Applications and Reviews) , 30(1):84–94, Feb 2000.[TMK16] S. Teerapittayanon, B. McDanel, and H. T. Kung. Branchynet: Fast inference via earlyexiting from deep neural networks. In , pages 2464–2469, Dec 2016.[VJ01] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simplefeatures. In

Proceedings of the 2001 IEEE Computer Society Conference on ComputerVision and Pattern Recognition. CVPR 2001 , 2001.[WYDG17] Xin Wang, Fisher Yu, Zi-Yi Dou, and Joseph E. Gonzalez. Skipnet: Learning dynamicrouting in convolutional networks.