Dynamically Sacrificing Accuracy for Reduced Computation: Cascaded Inference Based on Softmax Confidence
aa r X i v : . [ c s . L G ] M a y Sacrificing Accuracy for Reduced Computation:Cascaded Inference Based on Softmax Confidence
Konstantin Berestizshevsky
School of Electrical EngineeringTel Aviv University [email protected]
Guy Even
School of Electrical EngineeringTel Aviv University [email protected]
Abstract
We study the tradeoff between computational effort and accuracy in a cascadeof deep neural networks. During inference, early termination in the cascade iscontrolled by confidence levels derived directly from the softmax outputs of in-termediate classifiers. The advantage of early termination is that classification isperformed using less computation, thus adjusting the computational effort to thecomplexity of the input. Moreover, dynamic modification of confidence thresh-olds allow one to trade accuracy for computational effort without requiring re-training. Basing of early termination on softmax classifier outputs is justified byexperimentation that demonstrates an almost linear relation between confidencelevels in intermediate classifiers and accuracy. Our experimentation with architec-tures based on ResNet obtained the following results. (i) A speedup of . thatsacrifices . accuracy with respect to the CIFAR-10 test set. (ii) A speedup of . that sacrifices . accuracy with respect to the CIFAR-100 test set. (iii) Aspeedup of . that sacrifices . accuracy with respect to the SVHN test set. The quest for high accuracy in Deep Neural Networks (DNNs) has lead to the design of large net-work consisting of hundreds of layers with millions of trainable weights. Computation in suchDNNs requires billions of multiply-accumulate operations (MACs) for a single image classifica-tion [SCYE17]. The natural question that arises is whether this amount of computation is indeedrequired [PSR16].In this paper, we focus on the computational effort spent on inference in DNNs. (For simplicity,we measure the computational effort in the number of multiply-accumulate operations (MACs)).Many conjecture that the computational effort required for classifying images is not constant anddepends on the image [BBPP15, Gra16, FCZ +
17, PSR16, TMK16]. We claim that the requiredcomputational effort for classification is an intrinsic yet hidden property of the images. Namely,some images are much easier to classify than others, but the required computational effort neededfor classification is hard to predict before classification is completed.The desire to spend the “right” computational effort in classification leads to the first goal in thiswork.
Goal 1.1.
Provide an architecture in which the computational effort is proportional to the complexityof the input.
We also consider a setting in which the system’s power consumption or throughput is not fixed.Examples of such settings are: (1) As the battery drains in a mobile device, one would like to enter a“power saving mode” in which less power is spent per classification. (2) If the input rate increases ina real-time system (e.g., due to a burst of inputs), then one must spend less time per input [CZC + Preprint. Work in progress.
3) Timely processing in a data center during spikes in query arrival rates may require reducing thecomputational effort per query [Bod10].Dynamic changes in the computational effort or the throughput leads to the second goal in this work.
Goal 1.2.
Introduce the ability to dynamically control the computational effort while sacrificingaccuracy as little as possible. Such changes in the computational effort should not involve retrainingof the DNN.
We propose an architecture that is based on a cascade of DNNs [BWDS17] depicted in Figure 1.The cascade comprises multiple DNNs (e.g., three DNNs), called component DNNs . Classificationtakes place by invoking the component DNNs in increasing complexity order, and stopping thecomputation as soon as the confidence level reached a desired level.Our component DNNs are not disjoint, namely, levels of previous components are part of the pro-cessing of the consecutive component. The advantage of this approach is that the next componentreuses the computational outcome of the previous component and further refines it.The decision of choosing the final DNN component invoked in the computation is based on thesoftmax output of the invoked DNNs. We define a simple confidence threshold, based on the softmaxoutput, that allows for trading off (a little) accuracy for (a substantial) computational effort. Basingthe stopping condition on the softmax output has two advantages over previous methods [PSR16,BWDS17]: (1) Simplicity, as we do not require an additional training step for configuring the controlthat selects the output. (2) Improved tradeoffs of computational effort vs. accuracy.
We focus on the task of classification of images. Our setting is applicable to other data as we do nothave any limitations on the number of classes or the distribution of the input data.The cascading of DNNs allows for the usage of various DNN layers, including convolutional layers,fully connected layers, alongside with batch-normalization and pooling layers. We do not limit thenon-linear functions employed by the neurons, and only considered classification networks terminat-ing with a softmax function.
The two principle techniques that we employ are cascaded classification and confidence estimation .Cascaded classification is suggested in the seminal work of [VJ01]. As opposed to voting or stackingensembles in which a classification is derived from the outputs of multiple experts (e.g., majority),the decision in a cascaded architecture is based on the last expert. Uncertainty measures of classifiersare discussed in [CSTV95, SSV00]. These works address the issue of the degree of confidence thata classifier has about its output. We elaborate on recent usage of these techniques hereinafter.
A cascaded neural network architecture for computer vision is presented in [HCL + +
17] presented the idea of early stopping in a setting in which the cascadedDNNs are distributed among multiple devices.Reinforcement learning is employed in [OLO17] in a cascade of meta-layers to train controllers thatselect computational modules per meta-layer. 2 .2 Confidence estimation
Confidence of an assembly of algorithms is investigated by Fagin et al. [LBG +
16] in a general setup.Fagin et al. define instance optimality and suggest to terminate the execution according to a criterionbased on a threshold.Rejection refers to the event that a classifier is not confident about its outcome, and hence, the outputis rendered unreliable. In [GEY17] a selective classification technique in which a classifier and arejection-function are trained together. The goal is obtain coverage (i.e., at least one classifier doesnot reject) while controlling the risk via rejection functions. They proposed a softmax-responsemechanism for deriving the rejection function and discussed how the true-risk of a classifier (i.e.,the average loss of all the non-rejected samples) can be traded-off with its coverage (i.e., the mass ofthe non-rejected region in the input space). Our work adopts the usage of the softmax response as aconfidence rate function, however it differs in a way we apply the confidence threshold. Namely, wepropose a cascade of classifiers that terminates as soon as a desired confidence threshold is reached.
The work by [CZC +
10] presented an additive ensemble machine learning approach with early exitsin a context of web document ranking. In the additive approach, the sum of the outputs of a prefixof the classifiers provides the current output confidence.The work by [TMK16] presented the BranchyNet approach, in which a network architecture hasmultiple branches, each branch consists of a few convolutional layers terminated by a classifierand a softmax function. Confidence of an output vector y in BranchyNet is derived from the entropyfunction entropy ( y ) = − P c y c log y c . Our approach attempts to reduce the amount of computationthat takes place outside the “main path” so that computations that take place in rejected branches area negligible fraction of the total of the computation. In addition, we derive the confidence by takingthe maximum over the softmax. Finally, in [TMK16], automatic setting of threshold levels is notdeveloped.Cascaded classification with dedicated linear confidence estimations (rather than softmax) appears inthe Conditional Deep Learning (CDL) of [PSR16]. Cascaded classification with confidence estima-tion appears also in the SACT mechanism of [FCZ + k , aspecial decision function γ k is trained to whether an exit should be chosen. We conjecture that itis hard to decide whether an exit should be chosen based on a convolutional layer without using aclassifier. Indeed, in [BWDS17], a speedup of roughly is achieved for every doubling of theaccuracy loss. Table 1 depicts the parameters and notations used in this paper.
A cascade of DNNs is a chain of convolutional layers with a branching between layers to a classi-fiers (see Figure 1). Early termination in cascaded DNN components means that intermediate featuremaps are evaluated by classifiers. These classifiers attempt to classify the feature map, and output aconfidence measurement of their classification. If the confidence level is above a threshold, then ex-ecution terminates, and the classification of the intermediate feature map is output. See Figure 1 foran example of a cascaded architecture based on three convolutional layers. In our experimentation,we employ ResNet-modules [HZRS15a] as component DNNs in our cascade.3able 1: Notations and definitions used in this paperNotation Domain Semantics n e N Number of training epochs n m N Number of component DNNs in the cascade n c N Number of classes in the classification task n N Number of ResNet-blocks in a ResNet-module T Labeled training set, containing pairs of inputs and corresponding labels M Set of component DNNs that form a cascade ( | M | = n m ) M m The m th component in the cascade, m ∈ { , ..., n m − } θ conv m Weights and biases of the convolutional layers in component M m Θ conv Weights and biases of the convolutional layers in the cascade Θ fc Weights and biases of the fully connected layers of the cascade θ fc m Weights and biases of the fully connected layers of component M m out m ( x ) { , ..., n c − } Class predicted by component M m for input xδ m ( x ) [0 , Confidence output by component M m for input x ˆ δ m [0 , Confidence threshold of component m Each component in a cascaded architecture consists of convolutional layers followed by a branchingthat leads to (1) a classifier, and (2) the next component.
CON V CON V CON V F C out δ Input
F C out δ F C out δ Figure 1: An example of a cascaded architecture of three component DNNs with early termina-tion. A cascade of convolutional layers ( CON V , . . . , CON V ) ends with a fully connected layer F C . Early termination is enabled by introducing fully connected networks F C i after convolutionallayers. The output of each fully connected layer consists of a classification out i and a confidencemeasurement δ i . The usage of the threshold for determining early termination in the cascade is listed as Algorithm 1.The algorithm applies the component DNNs one by one, and stops as soon as the confidence measurereaches the confidence threshold of this component. This approach differs from previous cascadedarchitectures in which a combination (e.g., sum) of the confidence measures of the components isused to control the execution [FCZ +
17, CZC + Algorithm 1 CI ( M, ˆ δ, x ) - An algorithm for sequential execution of DNN components in a cas-caded architecture. Early termination takes place as soon as the confidence level reaches theconfidence threshold. The parameters are: the cascade architecture M , the confidence threshold ˆ δ = (ˆ δ , . . . , ˆ δ n m − ) , and the input x .1. For m = 0 to n m − do(a) ( out m ( x ) , δ m ( x )) ← M m ( x ) (b) If δ m ( x ) ≥ ˆ δ m then return out m ( x )
2. Return out n m − ( x ) .3 Softmax confidence Every classifier consists of one or more fully connected layers followed by a softmax function.Let z m ∈ R c denote the input to the softmax function in the m ’th classifier of the cascade. Let s m ∈ [0 , n c denote the softmax value in the m ’th classifier. The softmax value is defined asfollows 3.1. Definition 3.1 (softmax) . s m [ i ] = e zm [ i ] P nc − c =0 e zm [ c ] . The output out m ∈ { , ..., n c − } and the confidence measure δ m ∈ [0 , with respect to thisoutput are defined as follows Definition 3.2. out m , arg max c { s m [ c ] | ≤ c ≤ n c − } Definition 3.3. δ m , max c { s m [ c ] | ≤ c ≤ n c − } In this section we present the training procedure of the DNN components and classifiers.Consider a cascaded architecture n m components. We denote this cascade by M =( M , . . . , M n m − ) , where M m denotes the m ’th component in the cascade. Let θ conv m ( θ fc m , resp.)denote the weights and biases of the convolutional (fully connected, resp.) layers in component M m .Let Θ conv = { θ conv , ..., θ conv nm − } and Θ fc = { θ fc , ..., θ fc nm − } denote the weights and biasesof the convolutional layers and fully connected layers, respectively.Let L M ( out m , T ) denote a loss function of the cascade M with respect to the output of the m ’thcomponent, averaged over the labeled dataset T . In order to train the cascade M , we proposea backtrack-training algorithm 2 BT ( M, T, n e ) . We emphasize that the training procedure firstoptimizes all the convolutional weights together with the weights of the last fully connected layers.Only then, do we optimize the weights of the fully connected levels of the remaining components,one by one. Our approach differs from previous training procedures [TMK16, WYDG17] in whichthe loss functions associated with all the classifiers were jointly optimized . Algorithm 2 BT ( M, T, n e ) - An algorithm for performing a backtrack training of the cascade M = { M , ..., M n m − } , each component DNN trains for n e epochs over the training set T . The algorithmoutputs the trained weights of the cascade M
1. Optimize Θ conv ∪ θ fc nm − with L M ( out n m − , T ) for . n e epochs2. For m = 0 to n m − θ fc m with L M ( out m , T ) for n e epochs3. Return Θ conv ∪ Θ fc In out experiments, we noticed that the “long path” of the cascade required a larger number oftraining epochs. This why the number of training epochs in Line 1 of BT ( M, T, n e ) is . n e . In this section we present an automatic methodology for setting the confidence threshold ˆ δ m percomponent M m . Early termination is chosen if the confidence level reaches the threshold. Theimportant feature of the automatic setting of the confidence thresholds is that one can change themon the fly during the inference stage.Let T m ( δ ) denote the subset of inputs for which the confidence measure of the m th component is atleast δ . T m ( δ ) , { ( x, y ) | δ m ( x ) ≥ δ } . γ m ( δ ) denote the number of times the classification output by component M m is correct forinputs in T m ( δ ) . γ m ( δ ) , X ( x,y ) ∈ T m ( δ ) { out m ( x ) = y } . Let α m ( δ ) denote the accuracy of component M m with respect to T m ( δ ) . α m ( δ ) , ( γ m ( δ ) | T m ( δ ) | if | T m ( δ ) | > otherwiseLet α ∗ m denote the maximum accuracy for component M m . α ∗ m , max δ ∈ [0 , α m ( δ ) . For an accuracy degradation ǫ > , we define the confidence threshold δ m ( ǫ ) by δ m ( ǫ ) , min { δ | α m ( δ ) ≥ α ∗ m − ǫ } . In Algorithm 1, the confidence threshold vector ˆ δ is set as follows. Choose an ǫ ≥ , and set ˆ δ m ← δ m ( ǫ ) , for every m . We remark that (i) threshold for the last component should be zero, and(ii) one could use separate datasets for training the weights and setting the confidence threshold. This section elaborates on the experiments we performed on CIFAR-10, CIFAR-100 and SVHNdatasets.
Our experiments deal with image classification in which the convolutional layers follow the ResNetarchitecture [HZRS15a]. The input to the network is the per-pixel-standardized RGB image. TheR ES N ET ( n ) architecture consists of the following n layers: (1) a 2D convolutional layer with32 × × -dimensional filters (2) ResNet modules , each of which uses only × convolutions.Each module contains n ResNet- blocks , where each block contains two convolutional layers withReLU non-linearities, a skip-connection and batch-normalization. See Figures 2a and 2b for theblock structure. The classification is performed using global average pooling layer followed by afully connected layer with inputs and n c outputs and a softmax function.We transformed the regular R ES N ET ( n ) architecture into a cascaded architecture with early ter-mination by introducing two more classifiers branching from the first two ResNet modules. Toimprove accuracy, we enhanced the first two classifiers by increasing their feature map. In contrastto the BranchyNet, we did not allocate additional convolutional layers. The resulting architectureCI-R ES N ET ( n ) is depicted in Figure 2c. The overhead of the classifier enhancement is constantand further improves accuracy as the number of ResNet-blocks increases per module. For example,for n = 18 , the classifier enhancement incurs . more parameters and requires only . morecomputational effort per inference compared to the original R ES N ET ( ) architecture.We performed the training with respect to algorithm 2. A simple data augmentation was employedonly for CIFAR models as in [HZRS15a]. All models were trained from scratch. The model sizesand architectures used for CIFAR-10 and CIFAR-100 are identical except for the last FC layerin each classifier, containing 10 and 100 outputs respectively, which caused a relatively negligibleaddition in parameter count. The weights were initialized at random from N (0 , p /k ) where k isthe number of inputs to a neuron, as proposed by [HZRS15b]. All the models used cross-entropyloss regularized by an L2 loss with a coefficient 1e-4. Following the practice of [IS15], no dropoutwas applied. The CIFAR and the SVHN models were trained with Stochastic Gradient Descent(SGD) for 160 and 50 epochs per classifier, respectively. Learning rate was scheduled as describedin [HZRS15a]. 6 nputW x H x CWxHxC3x3 conv, C, stride 1Batch-NormalizationReLU activation3x3 conv, C, stride 1Batch-NormalizationReLU activationOutputWxHxCWxHxCpadd 1 pxpadd 1 px (a) ResNet-Block Input2W x 2H x C/2WxHxC3x3 conv, C, stride 2Batch-NormalizationReLU activation3x3 conv, C, stride 1Batch-NormalizationReLU activationOutputWxHxC DownsampleWidth,HeightBy 2Pad DepthWith C/2Zero planesWxHxC/2WxHxCWxHxCpadd 1 pxpadd 1 px (b) ResNet-Block-i
Input32 x 32 x 33x3 conv, 16, stride 1Batch-NormalizationReLU activationpadding=1px32x32x16ResNet Module 0n General ResNet-BlocksResNet Module 11 ResNet-Block-in-1 General ResNet blocksResNet Module 21 ResNet-Block-in-1 General ResNet blocks32x32x1616x16x328x8x64Global Average PoolingFully Connected64x110x1 Softmax10x1 Out_2 Average Pooling stride 8Fully Connected4x4x16 flattened to 256x110x1 Softmax10x1 Out_0Batch NormalizationReLU activationFully Connected64x1Average Pooling stride 8Fully Connected2x2x32 flattened to 128x110x1 Softmax10x1 Out_1Batch NormalizationReLU activationFully Connected64x1 (c) CI-R ES N ET ( n ) Figure 2: (a) - general ResNet-block. (b) - first ResNet-block in each module performs a sub-sampling using stride 2. (c) - the CI-R ES N ET ( n ) architecture used in our experiments. We trained the CI-R ES N ET ( ) model and evaluated its performance using various ǫ values. Thetradeoff between the test-accuracy and the number of MACs required for a single inference is shownin Figure 3. The MAC counts were obtained analytically by summing up the linear operations inthe convolutional layers and the fully connected layers, excluding activations and batch normal-ization. Quantitative results appear in Table 2. Similar gains are reported using SkipNet withResNet110 [WYDG17]. Note however that our approach does not incur extra computation forgating.Table 2: Cascaded inference with early termination - Accuracies of a cascade of i components islisted in columns M ,...,i − . Accuracies and speedups using early termination based on confidencethresholds δ m ( ǫ ) are depicted for five values of ǫ . Speedup is relative to the computational effort of M , , component.Accuracy per Component Cascade: accuracy(top),speedup(bottom)Dataset M M , M , , ǫ = 0% ǫ = 1% ǫ = 2% ǫ = 4% ǫ = 8% CIFAR-10 .
5% 81 .
4% 93 .
1% 93 .
1% 92 .
7% 91 .
9% 91 .
1% 86 . .
064 1 .
377 1 .
513 1 .
735 2 . CIFAR-100 .
1% 50 .
0% 70 .
5% 70 .
5% 70 .
65% 70 .
5% 70 .
3% 69 . .
009 1 .
044 1 .
072 1 .
116 1 . SVHN .
8% 85 .
2% 97 .
0% 97 .
0% 95 .
6% 94 .
0% 91 .
3% 89 . .
001 2 .
168 2 .
438 2 .
773 2 . For the CI-R ES N ET ( ) architecture, we measured the accuracy α m ( δ ) (see definition in Section 5)of each classifier independently. This time, the α m ( δ ) was determined with respect to the test-setrather than to the training set. The plots in Figure 4 show how the choice of the threshold providesa control over the test accuracy. In addition, we examined the frequency of the different δ values,which is shown as a bar-plot in Figure 4. The distribution of the first two components of the cascadeis relatively uniform. The distribution of the confidences of the last classifier has no importance since7 .10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 Giga-MACs per inference A cc u r a c y (a) CIFAR-10 Giga-MACs per inference A cc u r a c y (b) CIFAR-100 Giga-MACs per inference A cc u r a c y (c) SVHN Figure 3:
Cascaded inference with early termination test accuracy vs. average number of MACoperations per inference. The curves corresponds to confidence threshold vectors chosen w.r.t. ǫ ∈{ , ..., , } .in our inference approach the confidence threshold of the last classifier is set to ˆ δ n m − = 0 . Notethat the range of α m ( δ ) starts with the accuracy of M m and ends with the accuracy that correspondsto the highest confidence measure. The almost linear behavior of α m ( δ ) as a function of δ justifiesbasing the confidence threshold on the softmax output. B a r s : F r e q u e n c y ( δ ) ; L i n e s : α m ( δ ) α (δ) of M α (δ) of M →M α (δ) of M →M →M Frequency(δ) of M Frequency(δ) of M →M Frequency(δ) of M →M →M (a) CIFAR-10 B a r s : F r e q u e n c y ( δ ) ; L i n e s : α m ( δ ) α (δ) of M α (δ) of M →M α (δ) of M →M →M Frequency(δ) of M Frequency(δ) of M →M Frequency(δ) of M →M →M (b) CIFAR-100 B a r s : F r e q u e n c y ( δ ) ; L i n e s : α m ( δ ) α (δ) of M α (δ) of M →M α (δ) of M →M →M Frequency(δ) of M Frequency(δ) of M →M Frequency(δ) of M →M →M (c) SVHN Figure 4:
Softmax as a confidence measure . The line plots show the accuracy α m ( δ ) of each clas-sifier in the cascade independently. The bar plot presents the frequency of the different confidencelevels sampled over the test set. All plots were obtained by separately testing the three componentDNNs of CI-R ES N ET ( ) architecture w.r.t. the test sets of the CIFAR and the SVHN datasets. As a further research, cascading can be applied to RNNs [BBPP15] or alternatively, the impact ofdepth of feedforward DNNs on the confidence estimation can be investigated. A gap between theallowed accuracy degradation ( ǫ ) and the actual test accuracy degradation was especially evident inthe CIFAR-100 dataset. We believe that if one determines the thresholds with respect to a validationset, rather than the training set, this gap will be reduced.From a digital hardware point of view, we see two interesting directions to investigate in the contextof cascaded inference. First, the impact on cache memory performance can be observed as a func-tion of the confidence threshold adjustment. Second, an innovative hardware architecture can beproposed to support cascaded inference and to provide high throughput via allocation of resourcesto each component DNN and taking advantage of the locality at each component. We showed that using a softmax output as a confidence measure in a cascade of DNNs can providea × . to × speedup at a cost of of classification accuracy loss. This approach is both simpleand requires no retraining when the confidence thresholds must be adjusted after the network wastrained. 8n addition, fascinating properties of the softmax function as a confidence measure were revealed.We showed that if trained properly, a cascade of DNN components can reliably indicate its confi-dence level directly through the softmax output. Acknowledgments
We thank Nissim Halabi and Moni Shahar for useful conversations.
References [BBPP15] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditionalcomputation in neural networks for faster models.
CoRR , abs/1511.06297, 2015.[Bod10] Peter Bodik.
Automating datacenter operations using machine learning . PhD thesis,UC Berkeley, 2010.[BWDS17] Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neu-ral networks for fast test-time prediction.
CoRR , abs/1702.07811, 2017.[CSTV95] L. P. Cordella, C. De Stefano, F. Tortorella, and M. Vento. A method for improv-ing classification reliability of multilayer perceptrons.
IEEE Transactions on NeuralNetworks , 6(5):1140–1147, Sep 1995.[CZC +
10] B. Barla Cambazoglu, Hugo Zaragoza, Olivier Chapelle, Jiang Chen, Ciya Liao, Zhao-hui Zheng, and Jon Degenhardt. Early exit optimizations for additive machine learnedranking systems. In
Proceedings of the Third ACM International Conference on WebSearch and Data Mining , WSDM ’10, pages 411–420, New York, NY, USA, 2010.ACM.[FCZ +
17] Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang,Dmitry Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time forresidual networks. In
The IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , July 2017.[GEY17] Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks.In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, editors,
Advances in Neural Information Processing Systems 30 , pages4878–4887. Curran Associates, Inc., 2017.[Gra16] Alex Graves. Adaptive computation time for recurrent neural networks.
CoRR ,abs/1603.08983, 2016.[HCL +
17] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kil-ian Q Weinberger. Multi-scale dense convolutional networks for efficient prediction. arXiv preprint arXiv:1703.09844 , 2017.[HZRS15a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition.
CoRR , abs/1512.03385, 2015.[HZRS15b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep intorectifiers: Surpassing human-level performance on imagenet classification.
CoRR ,abs/1502.01852, 2015.[IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep networktraining by reducing internal covariate shift.
CoRR , abs/1502.03167, 2015.[LBDC +
17] Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, Bert Vankeirsbilck,Pieter Simoens, and Bart Dhoedt. The cascading neural network : building the in-ternet of smart things.
Lnowledge and Information Systems , 52(3):791–814, 2017.[LBG +
16] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, L. Jiao, L. Qendro, andF. Kawsar. Deepx: A software accelerator for low-power deep learning inferenceon mobile devices. In , pages 1–12, April 2016.[OLO17] Augustus Odena, Dieterich Lawson, and Christopher Olah. Changing model behaviorat test-time using reinforcement learning. arXiv preprint arXiv:1702.07780 , 2017.9PSR16] P. Panda, A. Sengupta, and K. Roy. Conditional deep learning for energy-efficient andenhanced pattern recognition. In , pages 475–480, March 2016.[SCYE17] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efficient processing ofdeep neural networks: A tutorial and survey.
CoRR , abs/1703.09039, 2017.[SSV00] C. De Stefano, C. Sansone, and M. Vento. To reject or not to reject: that is the question-an answer in case of neural classifiers.
IEEE Transactions on Systems, Man, andCybernetics, Part C (Applications and Reviews) , 30(1):84–94, Feb 2000.[TMK16] S. Teerapittayanon, B. McDanel, and H. T. Kung. Branchynet: Fast inference via earlyexiting from deep neural networks. In , pages 2464–2469, Dec 2016.[VJ01] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simplefeatures. In
Proceedings of the 2001 IEEE Computer Society Conference on ComputerVision and Pattern Recognition. CVPR 2001 , 2001.[WYDG17] Xin Wang, Fisher Yu, Zi-Yi Dou, and Joseph E. Gonzalez. Skipnet: Learning dynamicrouting in convolutional networks.