Dynamic Neural Networks: A Survey
Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, Yulin Wang
11 Dynamic Neural Networks: A Survey
Yizeng Han ∗ , Gao Huang ∗ , Member, IEEE,
Shiji Song,
Senior Member, IEEE,
Le Yang, Honghui Wang,and Yulin Wang
Abstract —Dynamic neural network is an emerging research topic in deep learning. Compared to static models which have fixedcomputational graphs and parameters at the inference stage, dynamic networks can adapt their structures or parameters to differentinputs, leading to notable advantages in terms of accuracy, computational efficiency, adaptiveness, etc. In this survey, wecomprehensively review this rapidly developing area by dividing dynamic networks into three main categories: 1) instance-wise dynamic models that process each instance with data-dependent architectures or parameters; 2) spatial-wise dynamic networks thatconduct adaptive computation with respect to different spatial locations of image data and 3) temporal-wise dynamic models thatperform adaptive inference along the temporal dimension for sequential data such as videos and texts. The important researchproblems of dynamic networks, e.g., architecture design, decision making scheme, optimization technique and applications, arereviewed systematically. Finally, we discuss the open problems in this field together with interesting future research directions.
Index Terms —Dynamic networks, Adaptive inference, Efficient inference, Convolutional neural networks. (cid:70)
NTRODUCTION D EEP neural networks (DNNs) are playing an impor-tant role in various areas including computer vision(CV) [1], [2], [3], [4], [5] and natural language processing(NLP) [6], [7], [8]. In recent years, we have witnessedmany successful deep models such as AlexNet [1], VGG [2],GoogleNet [3], ResNet [4], DenseNet [5] and Transformers[6]. These architecture innovations have enabled the trainingof deeper, more accurate and more efficient models. Therecent research on neural architecture search (NAS) [9], [10]further speeds up the process of designing more powerfulstructures. However, most of the prevalent deep learningmodels perform inference in a static manner, i.e., both thecomputational graph and the network parameters are fixedonce trained, which may limit their representation power,efficiency and interpretability [11], [12], [13], [14].Dynamic networks, as opposed to static ones, can adapttheir structures or parameters to the input during inference,and therefore enjoy favorable properties that are absentin static models. In general, dynamic computation in thecontext of deep learning has the following advantages:
1) Efficiency.
One of the most notable advantages ofdynamic networks is that they are able to strategicallyallocate computations on demand at test time, by selectivelyactivating model components (e.g. layers [12], channels[15] or sub-networks [16]) conditioned on the input. Con-sequently, less computation is spent on canonical samplesthat are relatively easy to recognize, or on less informativespatial/temporal locations of an input.
2) Representation power.
Due to the data-dependentnetwork architecture/parameters, dynamic networks havesignificantly enlarged parameter space and improved rep- • The authors are with the Department of Automation, Tsinghua University,Beijing 100084, China.E-mail: { hanyz18, yangle15, wanghh20, wang-yl19 } @mails.tsinghua.edu.cn; { gaohuang, shijis } @tsinghua.edu.cn.Corresponding author: Gao Huang. ∗ . Equal contribution. resentation power. For example, with a minor increase ofcomputation, model capacity can be boosted by applyingfeature-conditioned attention weights on an ensemble ofconvolutional kernels [13], [17]. It is worth noting that thepopular soft attention mechanism could also be unified inthe framework of dynamic networks, as different channels[18], spatial areas [19] or temporal locations [20] of featuresare dynamically re-weighted at test time.
3) Adaptiveness.
Dynamic models are able to achievea desired trade-off between accuracy and efficiency fordealing with varying computational budgets on the fly.Therefore, they are more adaptable to different hardwareplatforms and changing environments, compared to staticmodels with a fixed computational cost.
4) Compatibility.
Dynamic networks are compatiblewith most advanced techniques in deep learning, includingarchitecture design [4], [5], optimization algorithms [21],[22] and data preprocessing [23], [24], which ensures thatthey can benefit from the most recent advances in the field toachieve state-of-the-art performance. For example, dynamicnetworks can inherit architecture innovations in lightweightmodels [25], or be designed via NAS approaches [9], [10].Their efficiency could also be further improved by usingacceleration methods developed for static models, such asnetwork pruning [26], weight quantization [27], knowledgedistillation [28] and low-rank approximation [29].
5) Generality.
As a substitute for static deep learningtechniques, many dynamic models are general approachesthat can be applied seamlessly to a wide range of applica-tions, such as image classification [12], [30], object detection[31] and semantic segmentation [32]. Moreover, the tech-niques developed in CV tasks are proven to transfer well tolanguage models in NLP tasks [33], [34], and vice versa.
6) Interpretability.
We finally note that the research ondynamic networks may potentially bridge the gap betweenthe underlying mechanism of deep models and brains,as it is believed that the brains process information in adynamic way [35], [36]. With dynamic neural networks, it a r X i v : . [ c s . C V ] F e b OptimizationIntroduction InferenceTraining Applications DiscussionVideo Key Frame/ClipSelectionRNN-based JumpingEarly ExitingFrame GlimpseInstance-wise Pixel-levelRegion-levelResolution-level DynamicArchitectureDynamic ParameterDynamicTransformationHard AttentionAdaptive ScalingMuli-scaleSpatial-wiseTemporal-wise ConfidencePolicy NetworksGating Functions TrainingObjectives Gradient EstimationReparameterizationRLDynamic WeightsSoft AttentionRecurrent AttentionOthersDynamicParameter ParameterAdjustment Soft Attentionon WeightsKernel ShapeAdaptationWeightPrediction Feature-basedTask-specificDynamicFeatures Channel-wiseSpatial-wiseDynamicArchitecture DynamicWidth
Skipping NeuronsSkipping BranchesSkipping Channels
Dynamic Routing Tree structuresOthersDynamicDepth Early ExitingLayer SkippingMulti-branch Text Token skimming
Early ExitingJumpingDynamicActivation Sparse ConvolutionAdditional Refinement
Fig. 1. Overview of the survey. is possible to analyze which components of a deep modelare activated [30] when processing an input instance, andto observe which parts of the input are accountable forcertain predictions [37]. These properties may shed light oninterpreting the decision process of DNNs.In fact, adaptive inference, the key idea underlying dy-namic neural networks, has been studied before the pop-ularity of modern DNNs. The most classical approach isbuilding an adaptive ensemble of multiple models througha cascaded [38] or parallel [39] structure, and selectively ac-tivating them conditioned on the input. The spiking neuralnetwork (SNN) [40], [41] also performs data-dependent in-ference by propagating pulse signals in the model. However,the training strategy for SNN is quite different from that ofpopular convolutional neural networks (CNNs), and it isnot commonly used in vision tasks. Therefore, we leave outthe work related to SNN in this survey.In the context of deep learning, dynamic inference withmodern deep architectures, has raised many new researchquestions and has attracted great research interests in thepast three years. Despite the extensive work on design-ing various types of dynamic networks, a systematic andcomprehensive review on this topic is still lacking. Thismotivates us to write this survey, to review the recentadvances in this rapidly developing area, with the purposesof 1) providing an overview as well as new perspectives forresearchers who are interested in this topic; 2) pointing outthe close relations of different subareas and reducing therisk of reinventing the wheel and 3) summarizing the keychallenges and possible future research directions.
TABLE 1Notations used in this paper.
Notations Descriptions R m m -dimensional real number domain a, a Scalar, vector/matrix/tensor x , y Input, output feature x (cid:96) Feature at layer (cid:96) h t Hidden state at time step t x ( p ) Feature at spatial location p on xΘ Learnable parameter ˆΘ | x Dynamic parameter conditioned on xx (cid:63) W Convolution of feature x and weight W ⊗ Channel-wise or element-wise multiplication F ( · , Θ ) Functional Operation parameterized by Θ F ◦ G
Composition of function F and G This survey is organized as follows. In Sec. 2, we in-troduce the most common instance-wise dynamic networkswhich adapt their architectures or parameters conditionedon each input instance. Then, dynamic models working ona finer granularity, i.e., spatially adaptive or temporally adap-tive models, are reviewed in Sec. 3 and Sec.4, respectively.Then we review the decision making strategies and thetraining techniques of dynamic networks in Sec. 5. We fur-ther summarize the applications of dynamic models in Sec.6. Finally, a number of open problems and future researchdirections are discussed in Sec. 7. For better readability, welist the notations that will be used in this survey in Table 1.
NSTANCE - WISE D YNAMIC N ETWORKS
Aiming at processing different samples in data-dependentmanners, instance-wise dynamic networks are typically de- S ca l e Classifier Classifier
Depth ......... (a) Multi-scale DenseNet. ccc ccc S ca l e Classifier Classifier
Depth ......... (b) Early exiting with policy networks.
PolicyNetwork PolicyNetwork
Router Router
Sub-Network Sub-Network Sub-Network Classifier Classifier
Router
Classifier
Router
Output
Classifier S ca l e Depth(c) Resolution Adaptive Network.
Regular ConvStrided Conv c ConcatenationIdentityFeature Fusion Controller ............... (d) Dynamic Routing inside a SuperNet.Depth S ca l e Fig. 2. Multi-scale architectures with dynamic inference graphs.
Model Input
Model Router
Classifier ...
Output
Classifier
Output(a) Cascading of models. (b) Network with intermediate classifiers.Router
Block Block
Fig. 3. The early-exiting scheme. The dashed lines and shaded modulesare not executed, conditioned on the decisions made by the routers. signed from two perspectives: 1) adjusting the model archi-tectures to allocate appropriate computation based on eachinstance, and therefore reducing the redundant computationon those ”easy” samples to improve the inference efficiency(Sec. 2.1); 2) adapting the network parameters to every in-stance while keeping the computational graphs fixed, withthe goal of boosting the representation power with minimalincrease of computational cost (Sec. 2.2).
Given that different inputs may have diverse computationaldemands, it is natural to perform inference with dynamicarchitectures conditioned on each sample. Specifically, onecan adjust the network depth (Sec. 2.1.1), width (Sec. 2.1.2),or perform dynamic routing within a super network (Su-perNet) that includes multiple possible paths (Sec. 2.1.3).Networks with dynamic architectures not only save re-dundant computation for canonical (”easy”) instances, butalso preserve their representation power when recognizingnon-canonical (”hard”) samples. Such a property leads toremarkable advantages in efficiency compared to the accel-eration techniques for static models [26], [27], [42], whichhandle ”easy” and ”hard” inputs with identical computa-tion, and fail to reduce intrinsic computational redundancy.
As modern DNNs are getting increasingly deep for recog-nizing more ”hard” samples, a straightforward solution toreducing redundant computation is performing inferencewith dynamic network depths, which can be realized by1) early exiting , i.e. allowing ”easy” samples to be outputat shallow exits without executing deeper layers [12], [43],[44]; or 2) layer skipping , i.e. selectively skipping intermediatenetwork layers conditioned on each instance [11], [45], [46].Because of the layer-wise sequential execution procedure ofdeep networks, models with dynamic depths usually enjoyfavorable runtime efficiency in practice.
1) Early exiting.
The complexity (or ”difficulty”) of inputsvaries in most real-world scenarios, and shallow networksare capable of correctly identifying many canonical in-stances. Ideally, these instances should be output at certainearly exits without executing deeper layers. For an input sample x , the forward propagation of an L -layer deep network F could be represented by y = F L ◦ F L − ◦ · · · ◦ F ( x ) , (1)where F (cid:96) denotes the operational function at layer (cid:96), ≤ (cid:96) ≤ L . In contrast, early exiting allows to terminate the inferenceprocedure at an intermediate layer. For the i -th input sample x i , the forward propagation can be written as y i = F (cid:96) i ◦ F (cid:96) i − ◦ · · · ◦ F ( x i ) , ≤ (cid:96) i ≤ L. (2)Note that (cid:96) i is adaptively determined based on x i . Extensivearchitectures have been studied to endow DNNs with suchearly exiting behaviors, as discussed in the following.a) Cascading of DNNs.
The most intuitive approach to en-abling early exiting is cascading multiple models (see Fig. 3(a)), and adaptively retrieving the prediction of an early net-work without activating latter ones. For example, Big/little-Net [47] cascades two CNNs with different depths. Afterobtaining the
SoftMax output of the first model, early exitingis conducted when the score margin between the two largestelements exceeds a threshold. Moreover, a number of classicCNNs [1], [3], [4] are cascaded in [44]. After each model,a decision function is trained to determine whether theobtained feature should be fed to a linear classifier forimmediate prediction, or be sent to subsequent classifiers.b)
Intermediate classifiers.
The models in the aforemen-tioned cascading structures are mutually independent. Con-sequently, once a ”difficult” instance is decided to be fed to alatter network, a whole inference procedure needs to be exe-cuted from scratch, and thus the already learned features arenot efficiently reused. A more compact design is to involveintermediate classifiers within one backbone network (seeFig. 3 (b)), where early features can be propagated to deeplayers if needed. Based on this such architecture design,early exiting can be achieved according to confidence-basedcriteria [43], [48] or learned decision functions [44], [49], [50],[51]. Note that the confidence-based exiting policy consumesno extra computation during inference, while usually re-quiring tuning the threshold(s) on the validation set. In thesecond learned scheme, gating functions that directly makediscrete decisions might face some training issues, whichwill be further discussed in Sec. 5.Adaptive early-exiting can also be extended to languagemodels (e.g. BERT [7]) for improving their efficiency on NLPtasks [52], [53], [54], [55]. It has also been implemented inrecurrent neural networks (RNNs) for temporally dynamicinference when processing sequential data such as videos[56], [57] and texts [58], [59], [60] (Sec. 4).
Input Output BlockInput Output(a) Layer skipping based on halting score. (b) Layer skipping based on a gating function.
GatingModule ... ...
Policy Network(c) Layer skipping based on a policy network.Input Block Block
Fig. 4. Dynamic layer skipping. The dashed features in (a) are not calculated conditioned on the halting score, and the gating module in (b) decideswhether to execute the layer/block. The extra policy network in (c) directly generates the skipping decisions for all layers in the main network. c) Multi-scale architecture with early exits.
Researchers [12]have observed that in chain-structured networks, the multi-ple classifiers may interfere with each other, which degradesthe overall performance. A reasonable interpretation couldbe that in regular CNNs, the high-resolution features lackthe global information that is essential for classification,leading to unsatisfying results for early exits. Moreover,early classifiers would force the shallow layers to generate task-specialized features, while a part of general informationis lost, leading to degraded performance for deep exits.To address this issue, multi-scale dense network (MSDNet)[12] adopts 1) a multi-scale architecture, to quickly gener-ate coarse-level features that are suitable for classification;2) dense connections , to reuse early features and improvethe performance of deep classifiers (see Fig. 2 (a)). Sucha specially-designed architecture effectively enhances theoverall accuracy of all the classifiers in the network.Besides the architecture design, the exiting policies andtraining techniques are also important for the model per-formance. Apart from the confidence-based criteria in [12],policy networks are built for the multi-scale dynamic mod-els with early classifiers (see Fig. 2 (b)) [61], [62] to makedecisions on whether each instance should exit. As fortraining, specific techniques are studied in [63] for multi-exitnetworks. More discussion about the inference and trainingschemes for dynamic models will be reviewed in Sec. 5.The methods discussed above mostly implement theearly-exiting scheme via depth adaptation . From the per-spective of exploiting spatial redundancy in features, res-olution adaptive network (RANet, see Fig. 2 (c)) [30] fur-ther achieves resolution adaptation with depth adaptationsimultaneously. Specifically, the network first processes eachinstance with low-resolution features, while high-resolutionrepresentations are utilized conditioned on the predictionconfidence of early classifiers.
2) Layer skipping.
In the aforementioned early-exitingparadigm, the general idea is skipping the execution of allthe deep layers after a certain classifier. More flexibly, thenetwork depth can also be adapted on the fly by strategi-cally skipping the calculation of intermediate layers withoutplacing extra classifiers. Given the i -th input instance x i ,dynamic layer skipping could be generally written as y i = ( L ◦ F L ) ◦ ( L − ◦ F L − ) ◦ · · · ◦ ( ◦ F )( x i ) , (3)where (cid:96) denotes the indicator function determining theexecution of layer F (cid:96) , ≤ (cid:96) ≤ L . This scheme is typically im-plemented on structures with skip connections (e.g. ResNet[4]) to guarantee the continuity of forward propagation, andhere we summarize three representative approaches.a) The halting score.
Adaptive computation time (ACT)[11] is achieved based on an RNN, where a scalar namedhalting score is accumulated as multiple layers are sequen- tially executed within a time step, and the hidden state ofthe RNN will be directly fed to the next step if the scoreexceeds a threshold. The ACT method is further extendedto ResNet for vision tasks [31] by viewing residual blockswithin a stage as linear layers within a step of RNN (seeFig. 4 (a)). Moreover, the halting score in [31] is allowedto vary across spatial locations. Rather than skipping theexecution of layers with independent parameters, iterativeand adaptive mobile neural network (IamNN) [64] replacesmultiple residual blocks in each ResNet stage by one blockwith shared weights, leading to a significant reduction ofparameters. In every stage, the block is executed for anadaptive number of steps according to the halting score.In addition to RNNs and CNNs, the halting scheme isfurther implemented on Transformers [6] by [33] and [34] toachieve dynamic network depth on NLP tasks.b) Gating function.
Apart from comparing the calculatedhalting scores with certain thresholds as in aforementionedapproaches, gating function is also a prevalent option formaking discrete decisions due to its plug-and-play property.By generating binary values based on intermediate features,a gating function can determine the skipping/execution ofa layer (block) on the fly (see Fig. 4 (b)).Take the layer skipping in ResNet as an example, let x (cid:96) denote the input feature of the (cid:96) -th residual block, gatingfunction G (cid:96) generates a binary value to determine the exe-cution of F (cid:96) . This procedure could be represented by x (cid:96) +1 = G (cid:96) ( x (cid:96) ) F (cid:96) ( x (cid:96) ) + x (cid:96) . (4)SkipNet [45] and convolutional network with adaptiveinference graph (conv-AIG) [46] are two representative ap-proaches to enabling dynamic layer skipping. Both methodsinduce lightweight computational overheads to efficientlyproduce the binary decisions on whether skipping the calcu-lation of a residual block. Specifically, Conv-AIG utilizes twoFC layers in each residual block, while the gating function inSkipNet is implemented as an RNN for parameter sharing.Rather than skipping layers in classic ResNets, dynamicrecursive network [65] iteratively executes one block withshared parameters in each residual stage. Although beingseemingly similar to the aforementioned IamNN [64], itsdecision policies differs significantly. Without tuning thethreshold for halting scores as IamNN, gating modules areexploited by [65] to decide the recursion depth.Instead of either skipping a layer, or executing it thor-oughly with a full numerical precision, a line of work [66],[67] studies adaptive bit-width for different layers condi-tioned on the resource budget . Furthermore, fractional skip-
1. Here we refer to a stage as a stack of multiple residual blocks withthe same feature resolution.2. For simplicity and without generality, the subscript for sampleindex will be omitted in the following.
Input ...
Output Root (Input)Routing nodeLeaf nodeTransformation(a) Soft weights for adaptive fusion. (c) Dynamic routing in a tree structure.Soft weights Input ...
Gating Module
Output(b) Selective execution of MoE branches.
WeightingModule
Hard gatesExpertExpertExpert ExpertExpertExpert
Fig. 5. MoE structure with soft weighting (a) and hard gating (b) schemes both adopt an auxiliary module to generate the weights or gates. In thetree structure (c), nodes and transformations (paths) are represented as circles and lines with arrows respectively. Only the full lines are activated. ping [68] adaptively selects a bit-width for each residualblock by a gating function based on input features .c)
Policy networks.
Besides making sequential decisionsbased on intermediate features , another implementation isusing an extra model to directly decide which layers in anetwork need to be executed based on each instance . Forexample, BlockDrop [69] builds a policy network to takeeach instance as input, producing the binary gates for alllayers in a pre-trained ResNet, as illustrated in Fig. 4 (c).
An alternative to adapting the network depths (Sec. 2.1.1) isperforming inference with dynamic widths : although everylayer is still executed, its multiple components (e.g. neurons,branches or channels) are selectively activated conditionedon the input. Therefore, this approach can be viewed as afiner-grained form of conditional computation.
1) Dynamic width of fully-connected (FC) layers.
Thecomputational cost of a FC layer is determined by its inputand output dimensions. It is commonly believed that differ-ent neuron units are responsible for representing differentfeatures, and therefore not all of them need to be activatedfor every instance. Early studies learn to adaptively controlthe neuron activations by auxiliary branches [70], [71], [72]or other techniques such as low-rank approximation [73].
2) Mixture of Experts (MoE).
In Sec. 2.1.1, adaptive modelensemble is achieved via a cascading way, and later networksare conditionally executed based on early predictions. Analternative approach to improving the capacity of networkswithout making them deeper is the MoE [39], [74] structure,which means that multiple network branches are built as ex-perts in parallel . These experts could be selectively executed,and their outputs are fused with data-dependent weights.Conventional soft
MoE approaches [39], [74] adopt real-valued weights to dynamically rescale the representationsobtained from different experts (shown in Fig. 5 (a)). Inthis way, all the branches still need to be executed, andthe computation cannot be reduced at test time. To increasethe inference efficiency, hard gates with only a fraction ofnon-zero elements allow the models to adaptively allo-cate the computation dependent on the input (see Fig. 5(b)). Let G denote a gating module whose output is a N -dimensional vector α controlling the execution of these N experts F , F , · · · , F N , the final output can be written as y = (cid:88) Nn =1 [ G ( x )] n F n ( x ) = (cid:88) Nn =1 α n F n ( x ) , (5)and the n -th expert will not be executed if α n = 0 .Hard MoE has been implemented in diverse networkstructures. For example, HydraNet [75] replaces the con-volutional blocks in the last stage of a CNN by multiple branches, and selectively execute these branches at test time.For another example, dynamic routing network (DRNet)[76] implements a hard branch selection in each cell struc-ture commonly used in the NAS framework [10]. On NLPtasks, sparely gated MoE [16] and the recent switch Trans-former [77] embeds hard MoE in a long short-term memory(LSTM) [78] network and a Transformer [6], respectively. Inplace of making choice with binary gates as in [76], only thebranches corresponding to the top-K elements of the real-valued gates are activated in [16], [75], [77].
3) Dynamic channel pruning in CNNs.
Modern CNNsusually have considerable redundancy in the large numberof feature channels. Based on the common belief that thesame channel can be of disparate importance for differentinstances, dynamic width of CNNs could be realized byadapting the channel numbers at runtime. Compared tothe static pruning methods [26], [42] that remove certainfilters permanently, such a dynamic pruning approach selec-tively skips the calculation of channels in a data-dependentmanner. The capacity of a CNN is not degraded, while theoverall efficiency could be improved.a)
Multi-stage architectures along the channel dimension.
Recall that the early-exiting networks [12], [30] discussedin Sec. 2.1.1 can be viewed as multi-stage models along the depth dimension, where late stages are conditionally exe-cuted based on early predictions. One can also build multi-stage architectures along the width (channel) dimension, andprogressively execute these stages on demand.Along this direction, channel gating network (CGNet)[79] is an example that uses a subset of convolutionalfilters in every layer, and activate the remaining filters onlyon certain strategically selected areas. The recent static-to-dynamic neural architecture search (S2DNAS) [80] searchesfor an optimal architecture among multiple structures withdifferent widths, and any instance can be output at an earlystage when a confident prediction is obtained.b)
Dynamic pruning based on gating functions.
The afore-mentioned progressive activation paradigm decides the exe-cution of a later stage based on previous output. As a result,a complete forward propagation is required to be performedfor every stage, which might be suboptimal for reducingthe practical inference latency. Another prevalent solutionis to decide the execution of channels at every layer basedon gating functions. For example, runtime neural pruning(RNP) [15] models the layer-wise pruning as a Markovdecision process, and uses an RNN to select specific channelgroups. Moreover, pooling operations followed by FC layersare utilized to generate channel-wise hard attention for eachinstance [81], [82], [83], [84]. Different reparameterizationand optimizing techniques are adopted at the training stage, which will be discussed in Sec. 5.2.The approaches mentioned above have managed to skipthe execution of either network layers [45], [46] (see Sec.2.1.1) or convolutional filters [15], [81], [82], [83]. On basisof these existing literature, recent work [85], [86], [87] hasrealized dynamic inference with respect to network depth and width simultaneously: only if a layer is determined to beexecuted, its channels will be selectively activated, leadingto a more flexible adaptation of computational graphs.It is worth noting that rather than placing plug-in gatingmodules inside a CNN, GaterNet [88] builds an individualnetwork with the same architecture as the backbone. Thisadditional network takes in the input instance and directlygenerating all the channel selection decisions for the back-bone CNN. This implementation is similar to BlockDrop [69]discussed in Sec. 2.1.1 that exploits an extra policy networkfor dynamic layer skipping.c)
Dynamic pruning based on feature activations.
Withoutauxiliary branches and computational overheads, dynamicpruning can be conducted directly based on feature activation values [89], and a regularization item is induced in trainingto encourage the sparsity of intermediate features.
The aforementioned methods mostly adjust the depth (Sec.2.1.1) or width (Sec. 2.1.2) of classic architectures by activat-ing their computational units (e.g. layers [45], [46] or chan-nels [15], [83]) conditioned on the input. The computationalgraph can be adapted by performing dynamic routing insidea SuperNet with various possible inference paths.To achieve dynamic routing, there are typically routingnodes in a SuperNet that are responsible for allocating thefeatures/samples to different paths. For node s at the (cid:96) -th layer, let α (cid:96)s → j denote the probability of assigning thereached feature x (cid:96)s to node j at layer (cid:96) + 1 , the path fromnode s to node j will be activated only when α (cid:96)s → j > . Theresulting feature that reaches node j could be obtained by x (cid:96) +1 j = (cid:88) s ∈ { s : α (cid:96)s → j > } α (cid:96)s → j F (cid:96)s → j ( x (cid:96)s ) . (6)One can generate the probability α (cid:96)s → j in different man-ners, and extra constraints could be imposed at the trainingstage to improve efficiency. Note that the dynamic early-exiting networks [12], [30] are a special form of SuperNets,where the routing decisions are only made at intermediateclassifiers. The CapsuleNet series [14], [90] also performsdynamic routing between capsules, i.e. groups of neurons,to character the relations between (parts of) objects. Herewe mainly focus on different architecture designs for theSuperNets and their routing policies.
1) Path selection in multi-branch structures.
The simplestSuperNet can be established by setting a number of candi-date modules at each layer, and dynamically selecting oneof them to execute [91], [92]. This is equivalent to having theprobability distribution α (cid:96)s →· in Eq. 6 being one-hot, and canbe viewed as a special form of hard MoE (see Fig. 5 (b)). Themain difference is that only one branch is selected withoutany fusion operations. Various implementations have beenexplored. For example, the branch selection is realized withRNN-based gating functions in [91]. Different topologies ofbranches have also been enabled by [92].
2) Neural trees and tree-structured networks.
As decisiontrees perform inference along one forward path that is de-pendent on input properties, combining tree structure withneural networks can enjoy the adaptive inference paradigmand the representation power of DNNs simultaneously.Note that in a tree structure, the outputs of different nodesare routed to independent paths rather than being fused together as in MoE structures (compare Fig. 5 (b), (c)).Early work develops soft decision tree (SDT) [93], [94],[95] that performs differentiable operations in both trainingand inference stages yet are unable to achieve conditionalcomputation. The end-to-end training for neural trees thatmake hard decisions has been enabled with specific tech-niques [96], [97]. Moreover, tree-structured CNNs are devel-oped [98], [99], [100] to endow modern CNN architectureswith dynamic routing behaviors.a)
SDT [93], [94], [95] adopts neural units as its routingnodes (blue nodes in Fig. 5 (c)), and the output of a routingnode is a real-valued portion that the inputs are assigned toits left/right sub-tree. Each leaf node of an SDT generatesa probability distribution over the output space, and thefinal prediction is the expectation of the results from all leafnodes. In an SDT, the probability for an instance reachingeach leaf node is data-dependent, while all the paths arestill executed, which limits the inference efficiency.b)
Neural trees with deterministic routing policies are de-signed to make hard routing decisions during inference,avoiding computation on those unselected paths, and there-fore practically improve the efficiency. The end-to-end train-ing of hard neural trees has been enabled in [96] and [97].c)
Tree-structured DNNs.
Apart from developing decisiontrees containing neural units, a line of work builds specialnetwork architectures to endow them with the routingbehavior of decision trees. For instance, hierarchical deepconvolutional neural network (HD-CNN) [98] first activatesa small CNN to classify each sample into coarse categories,and then conditionally executes specific sub-networks basedon the coarse predictions. A subsequent work [99] notonly partitions samples to different sub-networks, but alsodivides and routes the feature channels.Different to those networks using neural units onlyin routing nodes [96], [97], or routing each sample topre-designed sub-networks [98], [99], adaptive neural tree(ANT) [100] adopts CNN modules as feature transformersin a hard neural tree (see lines with arrows in Fig. 5 (c)),and learns the tree structure together with the networkparameters simultaneously in the training stage.
3) Others.
Performing dynamic routing within other formsof SuperNet is also a recent research trend. Representatively,one can design the SuperNet by hand [101] or by NAS [102],and the routing policies for each instance are determinedeither by an extra network [102] or by plug-in modules [101].For example, instance-aware neural architecture search(InstaNAS) [102] searches for an architecture distributionwith partly shared parameters from a SuperNet containing ∼ sub-networks. During inference, every instance isallocated by a controller network to one sub-network withappropriate computational cost.For another example, a hand-designed multi-scale Su-perNet (see Fig. 2 (d)) is developed in [101]. Instead oftraining a standalone controller network with reinforcement Layer(s) with dynamic parametersParameter GenerationSide InformationLayer(s) with dynamic parametersParameter Adjustment (b) Dynamic weight prediction.(a) Dynamic weight adjustment.
BlockAttention Module
Channel-wiseAttention Spatial-wiseAttentionChannel-wise AttentionSpatial-wise Attention ...
Conv (c) Soft attention for dynamic features.
Fig. 6. Adaptive inference with dynamic parameters. learning (RL) as InstaNAS, gating modules are pluggedinside the SuperNet to decide the routing path for eachsample. Moreover, unlike the soft routing functions thatonly produce non-zero values [93], [94], [95], or many gat-ing functions that require reparameterization techniques toproduce binary values [45], [46], [83], the routing modulesin [101] utilizes max(0 , Tanh( · )) as their activation functionto directly generate zero values, leading to a conditionalactivation of different paths. Although the dynamic architectures in Sec. 2.1 can adapttheir inference graphs to each instance and achieve anefficient allocation of computation, they usually have specialarchitecture designs, requiring specific training strategies orcareful hyper-parameters tuning (Sec. 7).Another line of work performs adaptive inference withdynamic parameters , while keeping the architectures fixed.Existing arts have been shown effective in improving therepresentation power of networks with a minor increase ofcomputational cost. Given an input sample x , the output ofa conventional network (module) with static parameters canbe written as y = F ( x , Θ ) . In contrast, the output of a modelwith dynamic parameters is y = F ( x , ˆΘ | x ) = F ( x , W ( x , Θ )) , (7)where W ( · , Θ ) is the operation for producing the dynamicparameters, and different choices of W have been exten-sively explored.In general, the parameter adaptation can be achievedfrom three aspects: 1) adjusting the trained parametersbased on the input (Sec. 2.2.1); 2) directly generating the net-work parameters from the input (Sec. 2.2.2) and 3) rescalingthe features with soft attention (Sec. 2.2.3). A typical approach to parameter adaptation is adjusting theweights based on their input during inference. This imple-mentation usually consumes little computation to obtain theadjustments, e.g., attention weights [13], [17], [103], [104] orsampling offsets [105], [106], [107] (see Fig. 6 (a)).
1) Attention on weights.
The amount of trainable param-eters is a key factor to the representation power. A type ofdynamic networks, e.g. conditionally parameterized convo-lution (CondConv) [13] and dynamic convolutional neuralnetwork (DY-CNN) [17], perform soft attention on multipleconvolutional kernels to produce an adaptive ensemble ofparameters without noticeably increasing the computationalcost. Assuming that there are N kernels W n , n = 1 , , · · · , N ,such a dynamic convolution can be formulated as y = x (cid:63) ˜W = x (cid:63) ( (cid:88) Nn =1 α n W n ) . (8) This procedure increases the model capacity yet remainshigh efficiency, as the result obtained through fusing theoutputs of N convolutional branches (as in MoE structures,see Fig. 5 (a)) is equivalent to that produced by performingonce convolution with ˜W . However, the latter approachonly consumes approximate /N times of computation.Weight adjustment could also be achieved by perform-ing soft attention over the spatial locations of convolutionalweights [103], [104]. For example, segmentation-aware con-volutional network [103] applies locally masked convolu-tion to aggregate information with larger weights fromsimilar pixels, which are more likely to belong to the sameobject. Unlike [103] that requires a sub-network for featureembedding, pixel-adaptive convolution (PAC) [104] adaptsthe convolutional weights based on the attention maskgenerated from the input feature at each layer.
2) Kernel shape adaptation.
Apart from adaptively scalingthe weight values , parameter adjustment can also be realizedto reshape the convolutional kernels and achieve dynamicreception of fields . Towards this direction, when performingconvolution on each pixel, deformable convolutions [105],[106] sample pixels from adaptive locations in the featuremaps. Deformable kernels [107] samples the weights in thekernel space to adapt the effective reception field (ERF) whileleaving the reception field unchanged. Table 2 summarizesthe formulations of these three methods. Note that the maindifference between [105] and [106] is that the latter versionintroduces a dynamic modulation mechanism by learninga spatial mask. Though customized CUDA kernels are re-quired for implementation, these kernel shape adaptationapproaches all lead to significant improvements in accuracyon image classification and object detection tasks.
Compared to making modifications on model parameterson the fly (Sec. 2.2.1), weight prediction [108] is morestraightforward: it directly generates (a subset of) instance-wise parameters with an independent model at test time (seeFig. 6 (b)). This idea was first suggested in [109], where boththe weight prediction model and the backbone model werefeedforward networks. Recent work has further extendedthe paradigm to modern network architectures and tasks.
1) General architectures.
Dynamic filter networks (DFN)[110] and HyperNetworks [111] are two classic approachesrealizing runtime weight prediction for CNNs and RNNs,respectively. Specifically, a filter generation network is builtin DFN [110] to produce the filters for a convolutionallayer. As for processing sequential data (e.g. a sentence),the weight matrices of the main RNN are predicted by asmaller one at each time step conditioned on the input (e.g. aword) [111]. The recent WeightNet [112] unifies the dynamicschemes of CondConv [13] and squeeze-and-excitation net-work (SENet) [18], and directly predicts the convolutional
TABLE 2Deformation for convolutional kernels.
Method Formulation Sampled Target Dynamic Mask
Regular Convolution y ( p ) = (cid:80) Kk =1 W ( p k ) x ( p + p k ) - -Deformable ConvNet-v1 [105] y ( p ) = (cid:80) Kk =1 W ( p k ) x ( p + p k + ∆ p k ) Feature map NoDeformable ConvNet-v2 [106] y ( p ) = (cid:80) Kk =1 W ( p k ) x ( p + p k + ∆ p k )∆ m k Feature map YesDeformable Kernels [107] y ( p ) = (cid:80) Kk =1 W ( p k + ∆ p k ) x ( p + p k ) Conv kernel No weights via adding a grouped FC layer after the attentionactivation layer, achieving competitive results in terms ofthe accuracy-FLOPs and accuracy-parameters trade-offs.
2) Task-specific information has also been exploited topredict model parameters on the fly. For example, edgeattributes are utilized in [113] to generate filters for graphconvolution, and camera perspective is incorporated in [114]to generate weights for image convolution.
The main effect of performing inference with adjusted (Sec.2.2.1) or predicted (Sec. 2.2.2) parameters is producing moredynamic and informative features, and therefore enhanc-ing the representation power of deep models. A morestraightforward solution is rescaling the features with input-dependent soft attention (see Fig. 6 (c)). Such dynamicfeatures are easier to obtain, as minor modifications oncomputational graphs are required. Note that for a lin-ear transformation F , applying attention α on the outputfeatures is equivalent to performing computation with re-weighted parameters, i.e. F ( x , Θ ) ⊗ α = F ( x , Θ ⊗ α ) . (9)
1) Channel-wise attention.
One common soft attentionmechanism is dynamically rescaling different feature chan-nels, following the form in SENet [18]: ˜y = y ⊗ α = y ⊗ A ( y ) . (10)In Eq. 10, y = x (cid:63) W is the output feature of a convolu-tional layer with C channels, and A ( · ) is a parameterizedfunction that contains pooling and linear layers, producingthe attention α ∈ [0 , C with relatively cheap computation.Taking the convolution into account, the procedure can alsobe written as ˜y = ( x (cid:63) W ) ⊗ α = x (cid:63) ( W ⊗ α ) , from whichwe can see that applying attention on features is equivalentto performing convolution with dynamic weights.Other implementations for attention modules have alsobeen developed, including using standard deviation to pro-vide more statistics [115], or replacing FC layers with moreefficient 1D convolutions [116]. The empirical performanceof three computational graphs for soft attention is studiedin [117]: 1) ˜y = y ⊗ A ( y ) , 2) ˜y = y ⊗ A ( x ) and 3) ˜y = y ⊗ A (Conv( x )) . It is found that the three forms yielddifferent performance in different backbone networks.
2) Spatial-wise attention . Spatial locations in features couldalso be dynamically rescaled with attention to improve therepresentation power of deep models [118]. Instead of usingpooling operations to efficiently gather global informationas in channel-wise attention, convolutions are often adoptedin spatial-wise attention to encode local information. More-over, these two types of attention modules can be integratedin one framework [19], [119], [120], [121] (see Fig. 6 (c)).
3. Floating point operations.
3) Dynamic activation functions.
The aforementioned ap-proaches to generating dynamic features usually apply softattention before static activation functions. A recent line ofwork has sought to increase the representation power ofmodels with dynamic activation functions [122], [123]. Forinstance, DY-ReLU [122] replaces ReLU ( y c = max( x c , with the max value among N linear transformations y c =max n { a nc x c + b nc } , where c is the channel index, and a nc , b nc are linear coefficients calculated from x . The dynamic ac-tivation functions are compatible with different networkarchitectures, and have been shown effective in vision tasks.To summarize, soft attention has been exploited in manyfields due to its simplicity and effectiveness. Moreover, itcan be incorporated with other methods conveniently. Forexample, by replacing the weighting scalar α n in Eq. 5 withchannel-wise [124] or spatial-wise [125] attention, the outputof multiple branches with independent kernel sizes [124] orfeature resolutions [125] in a soft MoE structure (see Fig. 5(a)) are fused with more flexibility.Note that we leave out the detailed discussion on the selfattention mechanism, which is widely studied in both NLP[6], [7] and CV fields [126], [127], [128] to re-weight featuresbased on the similarity between queries and keys at differentlocations (temporal or spatial). Readers who are interestedin this topic may refer to review studies [129], [130], [131].In this survey, we mainly focus on the feature re-weightingscheme in the framework of dynamic inference. PATIAL - WISE D YNAMIC N ETWORKS
In visual learning, it has been found that not all locationscontribute equally to the final prediction of CNNs [132],which suggests that spatially dynamic computation has greatpotential for reducing computational redundancy. In otherwords, making a correct prediction may only require pro-cessing a fraction of pixels or regions with an adaptiveamount of computation. Moreover, based on the observa-tions that low-resolution representations are sufficient toyield decent performance for most inputs [25], the staticCNNs that take in all the input with the same resolutionmay also induce considerable redundancy.To this end, spatial-wise dynamic networks are built toperform adaptive inference with respect to different spatiallocations of images. According to the granularity of dynamiccomputation, we further categorize the relevant approachesinto three levels: 1) pixel level , where each pixel in features istreated adaptively (Sec. 3.1); 2) region level , where the modelonly attends to strategically selected regions (Sec. 3.2) and3) resolution level , while each input image is processed withadaptive resolutions (Sec. 3.3).
Commonly seen spatial-wise dynamic networks performadaptive computation at the pixel level. Similar to thecategorization in Sec. 2, there are two types of pixel-level
Exploiting/Learningsparsity Dynamic ConvMask
Fig. 7. Dynamic convolution on selected spatial locations. dynamic networks: 1) models with dynamic architectures thatadapt their depth or width when processing each pixel offeatures (Sec. 3.1.1); 2) networks with dynamic parameters that perform convolutions with pixel-specific weights forimproved flexibility of feature representation (Sec. 3.1.2).
Based on the common belief that foreground pixels are moreinformative and computational demanding than those inthe background, some dynamic networks learn to adjusttheir architectures for each pixel. Existing work generallyachieves this by 1) sparse convolution , which only performsconvolutions on a subset of sampled pixels; 2) additionalrefinement , which strategically allocates extra computation(e.g. layers or channels) on certain spatial positions.
1) Dynamic sparse convolution.
To reduce the unnecessarycomputation on less informative locations, convolution canbe performed only on strategically sampled pixels. Thequality of the sampled feature locations largely determinesthe accuracy and efficiency of the network.Existing sampling strategies include 1) making use ofthe intrinsic sparsity of the input [133]; 2) predicting thepositions of zero elements on the output [134], [135] and 3)estimating the saliency of pixels [136], [137], [138]. A typicalimplementation is adopting an extra branch to generate aspatial mask, determining the execution of convolution oneach pixel (see Fig. 7). Moreover, as mentioned in Sec. 2.1.1,spatially adaptive computation time (SACT) [31] achievesdynamic network depth at each pixel based on a calculatedhalting score. In these dynamic convolutions, the unselectedpositions are usually neglected, which might degrade thenetwork performance. The recent stochastic feature sam-pling and interpolation (SFSI) [138] utilizes interpolationto efficiently fill those locations, therefore alleviating theaforementioned disadvantage.
2) Dynamic additional refinement.
Instead of sampling asubset of pixels to perform convolutions, another line ofwork first conducts relatively cheap computation on thewhole feature map, and adaptively activate extra moduleson certain pixels for further refinement. Representatively,dynamic capacity network [139] generates coarse featureswith a shallow model, and utilizes the gradient informationto predict sensitive spatial locations for the network output.For these locations, extra layers are applied to extract finerfeatures. Similarly, specific positions are additionally pro-cessed by a fraction of convolutional filters in channel gatingnetwork (CGNet) [79]. These methods adapt their networkarchitectures in terms of depth or width at the pixel level,achieving a spatially adaptive allocation of computation.In semantic segmentation, pixel-wise early exiting (seeSec. 2.1.1) is proposed in [32], where the pixels with highprediction confidence are output without being processed by deeper layers. PointRend [140] shares a similar idea,and applies additional FC layers on selected pixels withlow prediction confidence, which are more likely to be onborders of objects. All these researches demonstrate that byexploiting the spatial redundancy in image data, dynamiccomputation at the pixel level beyond instance level signifi-cantly increases the model efficiency. In contrast to entirely skipping the convolution operationon a subset of pixels, dynamic networks can also applydata-dependent weights on different pixels for improvedrepresentation power or adaptive reception fields.
1) Dynamic weights.
The trained convolutional weightscould be rescaled by pixel-wise attention on the fly. For ex-ample, pixel-adaptive convolution [104] rescales the weightsbased on the distance between pairs of pixels. Apart frommaking dynamic modifications, weight prediction (Sec. 2.2.2)is also adopted to directly generate location-specific convo-lution kernels. Most existing arts [141], [142], [143], [144]generate an H × W × k kernel map to produce spatiallydynamic weights ( H, W are the spatial size of the outputfeature and k is the kernel size). Considering the pixels be-longing to the same object may share identical weights, dy-namic region-aware convolution (DRConv) [145] generatesa segmentation mask for an input image, dividing it into m regions, for each of which a weight generation network isresponsible for producing a data-dependant kernel.
2) Dynamic reception fields.
Traditional convolution oper-ations usually have a fixed shape and size of kernels (e.g.the commonly used × square for 2D convolution). Theresulting uniform reception field across all the layers mayhave limitations for recognizing objects with varying shapesand sizes. To tackle this, a line of work learns adaptivereception field for different feature positions. As we haveintroduced in Sec. 2.2.1, the deformable convolution series[105], [106] dynamically samples pixels from the wholefeature map when performing convolutions. Moreover, anadaptive sampling can also be conducted in the kernel spacerather than the feature space to achieve adaptive ERF [107].Instead of adapting the sampling location of featuresor kernels, adaptive connected network [146] realizes adynamic trade-off among self transformation (e.g. × con-volution), local inference (e.g. × convolution) and globalinference (e.g. FC layer). The three branches of outputs arefused with data-dependent weighted summation. Besidesimages, the local and global information in non-Euclideandata, such as graphs, could also be adaptively aggregated.
3) Pixel-wise dynamic feature.
Applying spatial-wise softattention on features can effectively increase the represen-tation power of models [19], [119], [120], [125]. Thoughbeing equivalent to performing convolution with dynamicweights, directly rescaling the features is usually easier toimplement in practice.
Pixel-level dynamic networks mentioned in Sec. 3.1 oftenrequire specific implementations for sparse computation,and consequently may face challenges in terms of achievingreal acceleration on hardware. An alternative approach isperforming adaptive inference on regions or patches of the Region Selection Network
Fig. 8. Region-level dynamic inference. input images. There mainly exists two lines of work alongthis direction. One performs parameterized transformations on a region from input feature maps for more accurateprediction [147], [148] (Sec. 3.2.1). The other one learns hard attention on selected patches [37], [149], [150], withthe goal of improving the effectiveness and/or efficiencyof models (Sec. 3.2.2). The general procedure is illustratedin Fig. 8, where the region selection module generates thetransformation parameters or the location of the attendedregion, and the subsequent network performs inference onthe transformed/cropped input.
Dynamic transformations (e.g. affine/projective/thin platespline transformation) can be performed on images to undocertain variations [147] for better generalization ability, orto exaggerate the salient regions [148] for effective visualattention. For example, spatial transformer [147] adopts alocalization network to generate the transformation param-eters, and then applies the parameterized transformationto recover the input from the corresponding variations.Moreover, transformations are learned to adaptively zoom-in the salient regions of images on tasks that the modelperformance is sensitive to a small portion of regions, e.g.gaze tracking and fine-grained image classification [148].
Inspired by the fact that informative features may onlycorrespond to certain regions of an image, dynamic net-works with hard attention are proposed to strategicallyselect patches from the input for higher efficiency. Extensiveimplementations have been explored as follows.
1) Hard attention with RNNs.
The most typical approachto region-level hard attention is formulating a classificationtask as a sequential decision process. Along this direction,RNNs are adopted to focus on one patch at a time, andpredictions are made iteratively [149], [151]. For example,recurrent attention model (RAM) [149] classifies imageswithin a fixed number of steps. At each step, the classifierRNN only sees a cropped patch, deciding the next atten-tional location until the last step is reached. An adaptive stepnumber is further achieved by including early stopping inthe action space [151]. The recent glance-and-focus network(GFNet) [37] builds a general framework of adaptive infer-ence by sequentially focusing on a series of selected patches,and is compatible with all existing backbone architectures.By allowing early exiting, both spatially and temporallyadaptive inference can be realized [37], [151].
2) Hard attention with other implementations.
Rather thanusing an RNN to predict the region position that the modelshould pay attention to, class activation mapping (CAM)[132] is adopted to select salient patches iteratively [152]. Ateach iteration, the selection is performed on the previously cropped input, leading to a progressive refinement pro-cedure. Recurrent attention CNN (RA-CNN) [150] adoptsa multi-scale architecture to implement hard attention, inwhich each scale takes the cropped patch from the previousscale as input, and is responsible for simultaneously learn-ing 1) the feature representations for classification and 2) theattention map for the next scale.
The researches discussed above typically divide featuremaps into different areas, and treat them in an adap-tive manner. A downside of these approaches is that thesparse sampling (Sec. 3.1) or cropping (Sec. 3.2) operationsmight degrade the practical efficiency. Alternatively, dy-namic networks could treat each image as a whole withrepresentations of adaptive resolutions. It has been observedthat a low resolution might be sufficient for recognizing”easy” samples [25]. Traditional CNNs mostly process allthe inputs with the same resolution, inducing considerableredundancy. Therefore, resolution-level dynamic networksexploits spatial redundancy from the perspective of featureresolution rather than the saliency of different locations.Existing arts mainly include 1) downsampling/upsamplingwith adaptive scaling ratios [153], [154] (Sec. 3.3.1); 2) selec-tively activating the sub-networks with different resolutionsin a multi-scale architecture [30], [155] (Sec. 3.3.2).
Features with dynamic resolution can be produced byperforming downsample/upsample with adaptive scalingratios. For example, a small sub-network is first executedto predict a scale distribution of faces on the face detectiontask [153]. Then the input images are adaptively zoomed-inor zoomed-out, so that all the faces fall in a suitable rangefor recognition. A subsequent work further exploits a plug-in module to predict the stride for the first convolution layerin each ResNet stage, producing features with dynamicresolution [154].
An alternative approach to achieving dynamic resolution isbuilding multiple sub-networks in a parallel [155] or cas-cading [30] way. These sub-networks with different featureresolutions are selectively activated conditioned on the in-put during inference. For instance, Elastic [155] realizes a soft selection from multiple branches at every layer, where eachbranch performs a downsample-convolution-upsample pro-cedure with an independent scaling ratio. To practicallyavoid redundant computation, a hard selection is realizedby resolution adaptive network (RANet) [30], which allowseach instance to conditionally activate sub-networks thatprocess feature representations with resolution from low tohigh (see also Sec. 2.1.1).
EMPORAL - WISE D YNAMIC N ETWORKS
Apart from the spatial dimension (Sec. 3), adaptive compu-tation could also be performed along the temporal dimen-sion of sequential data, such as texts (Sec. 4.1) and videos(Sec. 4.2). In particular, redundant computation can be gen-erally saved from two aspects: 1) allocating cheap operationsfor the input at certain locations; 2) selectively conductingcomputation only on a subset of temporal locations. Agent
RNN
Agent
RNN CopyCopy RNN RNN RNNRNN RNN(a) Skip update of hidden state. (b) Partial update of hidden state. (c) Multi-scale RNN architecture.
Sequential input
RNN
Agent
RNN
Sequential Input (d) Temporal dynamic jumping.
Fig. 9. Temporally adaptive inference.
Traditional RNNs mostly follow a static inference paradigm,i.e. input tokens are read sequentially to update a hiddenstate at each time step, which could be written as h t = F ( x t , h t − ) , t = 1 , , · · · , T. (11)The final state h T is utilized for solving the task. Sucha static inference paradigm induces significant redundantcomputation, as different tokens usually have different con-tributions to the downstream tasks. A type of dynamicRNNs is developed for allocating appropriate computa-tional cost at each step. Some of them read in all the tokenswhile learning to ”skim” unimportant tokens by dynamicupdate of hidden states (Sec. 4.1.1), and others conduct an adaptive reading procedure to avoid reading in task-irrelevanttokens at test time. Specifically, such adaptive reading canbe achieved by early exiting (Sec. 4.1.2) or by jumping in textswith adaptive strides (Sec. 4.1.3). Note that the input of theseRNNs at each step is free to the level of text, which could becharacters [11], words [156] or even sentences [60]. Since not all the words or sentences are essential for cap-turing the task-relevant information in a sequence, dynamicRNNs can be built to adaptively update their hidden statesat each time step. Less informative tokens will be coarsely skimmed , i.e. the states are updated with cheap operations toreduce unnecessary computation. After reading in a token,the dynamic update could be achieved by: 1) directly skip-ping the update [157], [158], [159]; 2) conducting a coarseupdate [11], [160], [161]and 3) performing selective updatein multi-scale structures [162], [163].
1) Skipping the update.
For unimportant inputs at certaintemporal locations, dynamic models can learn to entirelyskip the update of hidden states (see Fig. 9 (a)), i.e. h t = α t F ( x t , h t − ) + (1 − α t ) h t − , α t ∈ { , } . (12)For instance, Skip-RNN [157] uses a controlling signal todetermine whether to update or copy the state, where thesignal is regarded as a special hidden state and is updatedin every step with negligible computation. An extra agentis adopted by Structural-Jump-LSTM [158] to make theskipping decision conditioned on the previous state andthe current input. Compared to [157] and [158] that requiretraining the RNNs and the controllers jointly, a predictor istrained in [159] to estimate whether each input will makea ”significant change” on the hidden state. The update isidentified worthy to be executed only when the changebrought by the update is greater than a threshold.
2) Coarse update.
As directly skipping the update may betoo aggressive, dynamic models could also update the hid- den states with adaptively allocated operations. In specific,a network can adapt its architecture in every step, i.e. h t = F t ( x t , h t − ) , (13)where F t is determined based on the input x t . One im-plementation is selecting a subset of state dimensions tocalculate, and copying the remaining from the previousstep [160], [161], as shown in Fig. 9 (b). To achieve partialupdate, a subset of rows in weight matrices of the RNNis dynamically activated in [160], while Skim-RNN [161]makes a choice between two independent RNNs.When the hidden states are generated by a multi-layernetwork, the update could be interrupted at an intermediatelayer based on a calculated halting score (see Fig. 4 (a)).To summarize, a coarse update can be realized by withdata-dependent network depth [11] or width [160], [161].
3) Selective updates in multi-scale RNNs.
Dynamic multi-scale RNNs [162], [163] are built to capture the hierarchicalstructure in long sequences. During inference, the RNNs athigher levels will selectively update their states conditionedon the output of low-level ones (see Fig. 9 (c)). For example,in hierarchical multi-scale recurrent neural network (HM-RNN) [162], when the low-level (character-level) modeldetects that the input satisfies certain conditions, it will ”flush” (reset) its states and feed them to a higher-level net-work (word-level). Focused hierarchical RNN [163] builds amulti-scale architecture in question answering (QA) tasks. Agating module is applied to decide whether the state shouldbe fed to the higher-level RNN (sentence-level) based oneach word in the asked question.
Despite that the dynamic RNNs in Sec. 4.1.1 are able to up-date their states with data-dependent computational costsat each step, all the tokens still must be read, leading toinefficiency in scenarios where the task-relevant results canbe obtained before reading the entire sequence. For instance,one could capture the main idea of a paper by reading onlyits title and abstract.Ideally, an efficient model should adaptively stop read-ing once the captured information is sufficient to yield aconfident result, i.e. performing early exiting before the laststep T in Eq. 11 is reached. For instance, reasoning network(ReasoNet) [58] terminates its reading procedure when suf-ficient evidence has been found for answering a question.Similarly, length adaptive recurrent model (LARM) [164]and Jumper [60] implement early stopping for sentence-level and paragraph-level classification respectively. Notethat the dynamic models discussed here focus on makingearly predictions with respect to the temporal dimension ofsequential input, rather than along the depth dimension ofnetworks as discussed in Sec. 2.1.1. Although the early exiting mechanism in Sec. 4.1.2 largelyreduces redundant computation, all the tokens must still befed to the model one by one. More aggressively, dynamicRNNs could further learn to decide ”where to read” by strate-gically skipping certain tokens without reading them, anddirectly jumping to a temporal location with an adaptivestride (see Fig. 9 (d)).Such dynamic jumping, together with early exiting, ispresented in [156] and [59]. Specifically, the network in [156]implements an auxiliary unit to predict the jumping stridewithin a defined range, and the reading process ends whenthe unit outputs zero. Differently, the model in [59] firstdecides whether to stop at each step. If not, it will furtherchoose to re-read the current input, or skip a flexible num-ber of words. Moreover, structural information is exploitedby Structural-Jump-LSTM [158], which utilizes an agent todecide whether to jump to the next punctuation. Apart fromlooking ahead, LSTM-Shuttle [165] also allows backwardjumping to supplement the missed history information.
For video recognition, where a video could be seen as a se-quential input of frames, temporal-wise dynamic networksare designed to allocate adaptive computational resourcesfor different frames conditioned on the input. This cangenerally be achieved by two approaches as follows.First (Sec. 4.2.1), a line of work performs recognitionby processing frames sequentially with RNNs. Similar tothe approaches introduced in Sec. 4.1, RNN-based adaptivevideo recognition can be realized by 1) treating unimportantframes with relatively cheap operations (a ”glimpse” ) [166],[167]; 2) early exiting [56], [57] and 3) strategically decide ”what to see when” in each time step [56], [168], [169], [170].The second line of work (Sec. 4.2.2) adopts a dynamicpre-sampling procedure for key frames (or clips, i.e. a smallpart of a long video) [171], [172], [173], and the selectedframes or clips are processed by a task-specific model.
Video recognition is often conducted via a recurrent pro-cedure, where the frames of a video are first encoded bya 2D CNN, and the obtained frame features are fed to anRNN sequentially for updating a hidden state, which isfinally used for the prediction. Such a procedure is similarto text processing mentioned in Sec. 4.1. Due to the tempo-ral redundancy in videos, task-irrelevant frames could beprocessed coarsely, or even be neglected.
1) Dynamic update of hidden states.
To avoid redundantcomputation in each step, LiteEval [166] makes a choicebetween two LSTM models with different computationalcosts. ActionSpotter [167] adaptively decides whether thecurrent input should be used to update the hidden state.Such a glimpse procedure (i.e. allocating cheap operationson relatively unimportant frames) is similar to the skimming operation for texts [157], [158] (Sec. 4.1.1).
2) Temporally early exiting.
Humans are able to compre-hend the contents easily before watching an entire video.Such early stopping is also implemented to make predic-tions only based on a portion of video frames [56], [57].Together with the temporal dimension, the model in [57] further achieves early exiting from the aspect of network depth as discussed in Sec. 2.1.1.
3) Jumping in videos.
Considering encoding those unim-portant frames with a CNN still requires considerablecomputation, a more efficient inference paradigm could beachieved by dynamically skipping some frames withoutwatching them. Existing arts [168], [169], [174] typicallylearn to predict the location that the network should jump toat each time step. Furthermore, both early stopping and dy-namic jumping are allowed in [56], where the jumping strideis limited in a discrete range. Adaptive frame (AdaFrame)[170] generates a continuous scalar within the range of [0 , as the relative location. Recent work [175] realizes theadaptation of frame resolution together with the dynamicjumping scheme, which further improves the recognitionefficiency by considering the redundancy in both spatial and temporal dimensions. Rather than processing video frames recurrently as in Sec.4.2.1, a line of work first performs an adaptive pre-sampling procedure, and then makes prediction by processing theselected subset of key frames/clips.
1) Temporal attention . Both soft and hard attention havebeen exploited for networks to focus on salient frames invideos. For face recognition, neural aggregation network[20] uses soft attention to adaptively aggregate frame fea-tures. With the goal of improving inference efficiency, hard attention is realized in [171] to remove unimportant framesiteratively with RL for efficient video face verification.
2) Dynamic sampling strategies.
In addition to the attentionmechanism, sampling module is also an alternative option.For example, one can train the frame sampling agent(s)with RL [172], [176]. Both approaches first sample framesuniformly, and then make decisions for each selected frameto go forward or backward step by step. As for clip-levelsampling, salient clips sampler (SCSample) [173] is designedbased on a trained classifier to find the most informativeclips for prediction. Moreover, dynamic sampling network(DSN) [177] segments each video into multiple sections, anda sampling module with shared weights across the sectionsis exploited to sample one clip from each section.Beyond frame/clip selection, adaptive 3D convolution(Ada3D) [178] performs a choice between 2D and 3D con-volution after obtaining the selected frames. By simplytaking the center channel of a 3D filter along its temporaldimension, a 3D convolution could be transformed into a2D one. By doing so, the inference efficiency is improved byexploiting the redundancy in both data and network structure ,which is a recent research trend [30], [37].
NFERENCE AND T RAINING
In previous sections, we have reviewed three different typesof dynamic networks (instance-wise (Sec. 2), spatial-wise(Sec. 3) and temporal-wise (Sec. 4)). It can be observed thatmaking data-depdent decisions during inference is essen-tial to achieve high efficiency and effectiveness. Moreover,training dynamic networks is usually more challenging thanoptimizing static models.Note that since parameter adaptation (Sec. 2.2) could beconveniently achieved by differentiable operations, models with dynamic parameters [13], [18], [112] can be directlytrained by stochastic gradient descent (SGD) without spe-cific techniques. Therefore, in this section we mainly focuson discrete decision making (Sec. 5.1) and its training strate-gies (Sec. 5.2), which are absent in most static models. As described above, dynamic networks are capable of mak-ing data-dependent decisions during inference to trans-form their architectures, parameters, or to select salientspatial/temporal locations in the input. Here we summarizethree commonly seen decision making schemes as follows.
Many dynamic networks [12], [30], [43] are able to output”easy” samples at early exits if a certain confidence-basedcriterion is satisfied. These methods generally require esti-mating the confidence of intermediate predictions, which iscompared to a predefined threshold for decision making. Inclassification tasks, the confidence is usually represented bythe maximum element of the
SoftMax output [12], [30]. Al-ternative criteria include the entropy [43], [53] and the scoremargin [47]. On NLP tasks, a model patience is proposed in[55]: when the predictions for one instance stay unchangedafter a number of classifiers, the inference procedure stops.In addition, the halting score in [11], [31], [33], [34] couldalso be viewed as confidence for whether the current featurecould be output to the next time step or calculation stage.Empirically, the confidence-based criteria are easy toimplement, and generally require no specific training tech-niques. A trade-off between accuracy and efficiency is con-trolled by manipulating the thresholds, which is usuallytuned on a validation dataset. It is worth noting that the overconfidence issue in deep models [179], [180] might affectthe effectiveness of such decision paradigm, which meansthat the samples that are classified incorrectly with highconfidence could be output at early exits.
To adapt the network topology based on different instances,it is a common option to build an additional policy networklearning a decision function for execution of multiple unitsin a model. Each input sample is first processed by thepolicy network, whose output directly determines whichparts of the main network should be activated. For example,BlockDrop [69] and GaterNet [88] use a policy network toadaptively control the depth (Sec. 2.1.1) and width (Sec. 2.1.2)of a backbone network. More generally, the dynamic routingdecisions in a
SuperNet can also be controlled by a policynetwork [102] (Sec. 2.1.3).It is worth noting that the architecture design and thetraining process of such policy networks are typically de-veloped for a specific backbone. This is considered as alimitation of this decision scheme, because it cannot beeasily adapted to different backbone architectures.
Gating function is a general and flexible approach to deci-sion making in dynamic networks. It can be convenientlyadopted as a plug-in module in any backbone networkat arbitrary locations. During inference, each module isresponsible for controlling the local inference graph of a layer or block. The gating functions take in intermediatefeatures and efficiently produce binary-valued gate vectorsto decide: 1) which channels need to be activated [15], [81],[82], [83], [84], 2) which layers need to be skipped [45], [46],[85], [86], 3) which paths should be selected in a SuperNet[101], or 4) what locations of the input should be allocatedcomputations [136], [137], [138].Compared to the aforementioned decision policies, thegating functions demonstrate notable generality and appli-cability. However, due to their lack of differentiability, thesegating functions usually need specific training techniques,which will be introduced in the following subsection.
Besides architecture design, training is also essential fordynamic networks. Here we summarize the existing train-ing strategies for dynamic models from the perspectives ofobjectives and optimization.
1) Training multi-exit networks.
First, we notice that dy-namic networks with early exists [12], [30] are generallytrained by minimizing a weighted cumulative loss of inter-mediate classifiers. One challenge for training such modelsis the joint optimization of multiple classifiers, which mayinterfere with each other. MSDNet [12] alleviates the prob-lem through its multi-scale architecture and dense connec-tions. Several training techniques are proposed in [63] tofurther improve the training efficacy of multi-exit networks,including a gradient equilibrium algorithm to stable thetraining process, and a bi-directional knowledge transferapproach to boost the collaboration of classifiers.
2) Encouraging sparsity.
Some dynamic networks adapttheir inference procedure by conditionally activating theircomputational units [45], [83] or strategically sampling lo-cations from the input [138]. Training these models withoutadditional constraints would result in superfluous compu-tational redundancy, as a network could tend to activate allthe candidate units for minimizing the task-specific loss.The overall objective function for restraining such re-dundancy are typically written as L = L task + γ L sparse .where γ is the hyper-parameter balancing the two items forthe trade-off between accuracy and efficiency. In real-worldapplications, the second item can be designed based on thegate/mask values of candidate units (e.g. channels [82], [83],layers [45], [46] or spatial locations [138]). Specifically, onemay set a target activation rate for these units [46], [82] ordirectly minimizing the L norm of the gates/masks as aregularization item [138]. Moreover, it is also practical tooptimize a resource-aware loss (e.g. FLOPs) [85], [101], [137],which can be estimated according to the input and outputfeature dimension for every candidate unit.
3) Other techniques.
Note that extra loss items are mostlydesigned for but not limited to improving efficiency. Take[150] as an example, the multi-scale model progressively fo-cuses on a selected region, and is trained with an additional inter-scale pairwise ranking loss , which is designed for betterregion proposals with representative features. Moreover,knowledge distilling is also utilized to boost the co-trainingof multiple sub-networks in [79] and [63]. TABLE 3Applications of Dynamic Networks. For the type column, I, S, T stand for instance-wise, spatial-wise and temporal-wise respectively.
Fields Data Type Subfields & referencesImage
I Object detection (face [38], [181], [182], facial point [183], pedestrian [184], general [31], [185], [186], [187], [188])Image segmentation [101], [189], Super resolution [190], Style transfer [191], Coarse-to-fine classification [192]I & S Image segmentation [32], [119], [136], [138], [140], [144], [146], [193], [194], [195], [196], [197], Image-to-imagetranslation [198], Object detection [105], [106], [137], [138], [153], Semantic image synthesis [199], [200], [201],
Computer
Image denoising [202], Fine-grained classification [148], [150], [203], [204] Eye tracking [148], Super resolution [141], [143], [205]
Vision
I & S & T General classification [37], [149], [152], Multi-object classification [206], [207], Fine-grained classification [151]
Video
I Multi-task learning (human action recognition and frame prediction) [208]I & T Classification (action recognition) [56], [166], [170], [172], [173], [175], [176], [177], [209], Semantic segmentation [210]Video face recognition [20], [171], Action detection [168], [169], Action spotting [167], [174]I & S & T Frame interpolation [211], [212], Video super resolution [213], Video deblurring [214], [215], Action prediction [216]
Point Cloud
I & S 3D Shape classification and segmentation, 3D scene segmentation [217], 3D semantic scene completion [218]
Natural Text
I Neural language inference, Text classification, Paraphrase similarity matching, and Sentiment analysis [54], [55]
Language
I & T Language modeling [11], [16], [111], [160], [162], Machine translation [16], [33], [34], Classification [59], [60], [164],
Processing
Sentiment analysis [156], [158], [159], [161], [165], Question answering [33], [58], [158], [161], [163]
Cross-Field Image & Text
I & S & T Image captioning [120], [219], visual question answering [220]
Others
Document classification [146], Link prediction [221], Graph classification [113], Stereo confidence estimation [222], Recommendation systems [223]
A variety of dynamic networks contain non-differentiablefunctions that make discrete decisions to modify their ar-chitectures or sampling spatial/temporal locations from theinput. These functions can not be trained directly withback-propagation. Therefore, specific techniques have beenproposed to enable the end-to-end training, including es-timating the gradients of non-differentiable variables [70],[162], or adopting reparameterization techniques [46], [88],[137], [138]. Other work also exploits reinforcement learning(RL) to train such discrete actions [15], [45], [61], [169].
1) Gradient estimation is proposed to approximate thegradients for those non-differentiable functions and enableback-propagation. In [70], [162], straight-through estimator(STE) is exploited to heuristically copy the gradient withrespect to the stochastic output directly as an estimator ofthe gradient with respect to the
Sigmoid argument.
2) Reparameterization techniques.
Apart from STE, repa-rameterization techniques are also proposed to deal withthe training problem of non-differentiable functions. Forinstance, the gating functions in [46], [82] are both trainedwith the
Gumbel SoftMax technique [224], [225] to controlthe network width or depth. For reducing the spatial redun-dancy in CNNs, [138] and [137] also use Gumbel SoftMax tosample feature pixels for dynamic convolution. In addition,
Improved SemHash [226] is utilized by [84] and [88] to enableend-to-end training of hard gating modules.
3) Reinforcement learning.
The non-differentiable decisionfunctions can also be trained with RL. In specific, the back-bones are trained by standard SGD, while the agents (eitherpolicy networks as introduced in Sec. 5.1.2 or gating func-tions as discussed in Sec. 5.1.3) are trained with RL to takediscrete actions for dynamic inference graphs [15], [45], [69]or spatial/temporal sampling strategies [149], [151], [168],[176]. The reward signal is usually constructed to minimizea penalty item of the computational cost for efficiency.
PPLICATION OF D YNAMIC N ETWORKS
In this section, we summarize the typical applications ofdynamic DNNs. Based on the input data modality, we listrepresentative methods and their corresponding adaptivemodes in Table 3. For image recognition, most dynamic CNNs are de-signed to conduct instance-wise or spatial-wise adaptive infer-ence on classification task, and many inference paradigmscan be generalized to other tasks. Note that as mentionedin Sec. 3.2, the object recognition could be formulated as asequential decision problem [37], [151]. By allowing earlyexiting in these approaches, temporally adaptive inferenceprocedure could also be realized.For text data, reducing the intrinsic temporal redun-dancy of the sequential input has attracted great researchinterests. The inference paradigm of temporal-wise dynamicRNNs (see Sec. 4.1) is also general enough to process audios(e.g. multi-scale RNN for speech recognition [227]). Basedon large language models such as Transformer [6] and BERT[7], data-dependent model depths [52], [53], [54], [55] areextensively studied to reduce the structure redundancy forefficient inference.For video-related tasks, the three types of dynamic in-ference ( instance-wise, spatial-wise and temporal-wise ) can beimplemented simultaneously [151], [211], [212]. However,for networks that do not process videos recurrently, e.g. 3DCNNs [228], [229], [230], most of them still follow a staticinference scheme, and few researches have been committedto building dynamic 3D CNNs [178], which might be aninteresting future research direction.Dynamic networks can also be exploited to tackle somefundamental problems in deep learning. For example, multi-exit models have been used to: 1) alleviate the over-thinking issue while reducing the overall computation [49], [231];2) perform long-tailed classification [232] by inducing earlyexiting in the training stage and 3) improve the model robustness [233]. For another example, the idea of dynamicrouting is implemented for: 1) reducing the training costunder a multi-task setting [234] and 2) finding the optimalfine-tuning strategy for per example in transfer learning [235]. ISCUSSIONS
Though significant advances have been made in the researchof dynamic deep neural networks, there still exist manyopen problems that are worth exploring. In this section, wesummarize a few challenges together with possible futuredirections in this field. Despite the success of dynamic neural networks, relativelyfew researches has been committed to analyze them fromthe theoretical perspective. In fact, theories for a deep un-derstanding of current dynamic learning models and furtherimproving them in principled ways are highly valuable.Here we list several theoretical problems that are funda-mental for dynamic networks.
1) Optimal decision in dynamic networks.
An essentialoperation in most dynamic networks (especially those de-signed for improving computational efficiency) is makingdata-dependent decisions, e.g., determining whether a mod-ule should be evaluated or skipped. Existing solutions eitheruse confidence-based criteria, or introduce policy networksand gating functions. Although being effective in practice(as mentioned in Sec. 5), they may not be optimal andlack theoretical justifications. Take early exiting as an ex-ample, the current heuristic methods [12], [30] might facethe issues of overconfidence, high sensitivity for thresholdsetting and poor transferability. As for policy networks orgating modules, runtime decisions can be made based ona learned function. However, they often introduce extracomputations, and usually require a long and unstabletraining procedure. Therefore, principled approaches withtheoretical guarantees for decision function design in dy-namic networks is a valuable research topic.
2) Generalization issues.
In a dynamic model, a sub-network might be activated for a set of test samples thatare not uniformly sampled from the data distribution, e.g.,smaller sub-networks tend to handle “easy” samples, whilelarger sub-networks are used for “hard” inputs [12]. Thisbrings a divergence between the training data distribu-tion and that of the inference stage, and thus violates thecommon i.i.d. assumption in classical machine learning.Therefore, it would be interesting to develop new theories toanalyze the generalization properties of dynamic networksunder such distribution mismatch. Note that transfer learn-ing also aims to address the issue of distributional shift attest time, but the samples of the target domain are assumedto be accessible in advance. In contrast, for dynamic models,the test distribution is not available until the training processis finished, when the network architecture and parametersare finalized. This poses greater challenges than analyzingthe generalization issues in transfer learning.
Architecture design has been proven to be essential for deepnetworks. Existing researches on architecture innovationsare mainly proposed for static models [4], [5], [25], whilerelatively few are dedicated to developing architectures spe-cially for dynamic networks. Most current approaches sim-ply adopt structures designed for static models, which maylead to suboptimal solutions and degraded performance.For example, it is observed that intermediate classifiers tendto interfere with each other in an early-exiting network,while the problem can be solved by a carefully designedmulti-scale architecture with dense connections [12].It is expected that architectures developed specificallyfor dynamic networks may further improve their effec-tiveness and efficiency. Possible research direction include designing dynamic network structures either by hand (as in[12], [30], [33], [65]), or by leveraging the NAS techniques(as in [80], [102]). Moreover, considering the popularity ofTransformers [128], developing a dynamic version of thisfamily of models could be an interesting direction.
Many existing dynamic networks (e.g., most of the instance-wise adaptive networks) are designed specially for classifi-cation tasks, and cannot be applied to other vision taskssuch as object detection and semantic segmentation. Thedifficulty arises from the fact that for these tasks there is nosimple criterion to assert whether an input image is easy orhard, as it usually contains multiple objects and pixels thathave different level of difficulty. Although many efforts, e.g.,spatially adaptive models [31], [37], [138] and soft attentionbased models [13], [18], [19], have been made to address thisissue, it remains to be a challenging problem to develop anunified and elegant dynamic network that can serve as anoff-the-shelf backbone for a variety of tasks.
The current deep learning hardware and libraries are mostlyoptimized for static models, and they may not be friendlyto dynamic networks. Therefore, we usually observe thatthe practical runtime of dynamic models lags behind thetheoretical efficiency. For example, some spatially adaptivenetworks involve sparse computation, which is known to beinefficient on modern GPUs. In addition, dynamic inferenceusually requires the model to handle input samples insequential, which also poses challenge for parallel computa-tion. This issue is mitigated in the scenario of mobile/edgecomputing, where the input signal by itself is sequentialand the computing hardware is less powerful than high-end GPUs. However, designing dynamic networks that aremore compatible with existing hardware and software isstill a valuable and challenging topic. Moreover, we notethat it is also an interesting research direction to optimizethe hardware and deep learning libraries to harvest thetheoretical efficiency gains of dynamic networks.
Dynamic models may provide new perspectives for theresearch of adversarial robustness on deep neural networks,as shown in the recent work [233]. In addition, tradi-tional attacks are usually aimed at reducing the accuracyof models. For dynamic networks, it is possible to launchadversarial attacks to decrease the efficiency and accuracysimultaneously [236]. The robustness of dynamic network isan interesting yet understudied topic.
Dynamic networks inherit the black-box nature of deepneural networks, and thus also invite research on inter-preting their working mechanism. What is special here isthat the adaptive inference paradigm, e.g., spatial/temporaladaptiveness, conforms well with that of the human visualsystem, and may provide new possibilities for making themodel more transparent to humans. In a dynamic network,it is usually convenient to analyze which part of the modelis activated for a given input or to locate which part of the input feature the model mostly relies on in makingits prediction. We expect that the research on dynamicnetwork will inspire new work on the interpretability ofdeep learning. R EFERENCES [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Ima-genet classification with deep convolutional neural networks. In
NeurIPS , 2012.[2] Karen Simonyan and Andrew Zisserman. Very deep convolu-tional networks for large-scale image recognition. In
ICLR , 2015.[3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, ScottReed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke,and Andrew Rabinovich. Going deeper with convolutions. In
CVPR , 2015.[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. In
CVPR , 2016.[5] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian QWeinberger. Densely connected convolutional networks. In
CVPR , 2017.[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need. In
NeurIPS , 2017.[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding. In
ACL , 2019.[8] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah,Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, PranavShyam, Girish Sastry, Amanda Askell, et al. Language modelsare few-shot learners. In
NeurIPS , 2020.[9] Barret Zoph and Quoc V Le. Neural architecture search withreinforcement learning. In
ICLR , 2017.[10] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Dif-ferentiable Architecture Search. In
ICLR , 2018.[11] Alex Graves. Adaptive computation time for recurrent neuralnetworks. arXiv preprint arXiv:1603.08983 , 2016.[12] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van derMaaten, and Kilian Weinberger. Multi-scale dense networks forresource efficient image classification. In
ICLR , 2018.[13] Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam.Condconv: Conditionally parameterized convolutions for effi-cient inference. In
NeurIPS , 2019.[14] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamicrouting between capsules. In
NeurIPs , 2017.[15] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neuralpruning. In
NeurIPS , 2017.[16] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, AndyDavis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageouslylarge neural networks: The sparsely-gated mixture-of-expertslayer. In
ICLR , 2017.[17] Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen,Lu Yuan, and Zicheng Liu. Dynamic convolution: Attention overconvolution kernels. In
CVPR , 2020.[18] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks.In
CVPR , 2018.[19] Sanghyun Woo, Jongchan Park, Joon-Young Lee, andIn So Kweon. Cbam: Convolutional block attention module. In
ECCV , 2018.[20] Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, FangWen, Hongdong Li, and Gang Hua. Neural aggregation networkfor video face recognition. In
CVPR , 2017.[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochas-tic optimization. In
ICLR , 2015.[22] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceler-ating deep network training by reducing internal covariate shift.
CoRR , abs/1502.03167, 2015.[23] Yulin Wang, Xuran Pan, Shiji Song, Hong Zhang, Gao Huang,and Cheng Wu. Implicit semantic data augmentation for deepnetworks. In
NeurIPS , 2019.[24] Ekin Dogus Cubuk, Barret Zoph, Dandelion Man´e, Vijay Vasude-van, and Quoc V. Le. Autoaugment: Learning augmentationpolicies from data.
CoRR , 2018. [25] Andrew G Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto,and Hartwig Adam. Mobilenets: Efficient convolutional neu-ral networks for mobile vision applications. arXiv preprintarXiv:1704.04861 , 2017.[26] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian QWeinberger. Condensenet: An efficient densenet using learnedgroup convolutions. In
CVPR , 2018.[27] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv,and Yoshua Bengio. Binarized neural networks. In
NeurIPS , 2016.[28] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling theknowledge in a neural network. In
NeurIPS Workshop , 2014.[29] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speed-ing up convolutional neural networks with low rank expansions.In
BMVC , 2014.[30] Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, and GaoHuang. Resolution Adaptive Networks for Efficient Inference. In
CVPR , 2020.[31] Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang,Jonathan Huang, Dmitry Vetrov, and Ruslan Salakhutdinov. Spa-tially adaptive computation time for residual networks. In
CVPR ,2017.[32] Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, and XiaoouTang. Not all pixels are equal: Difficulty-aware semantic segmen-tation via deep layer cascade. In
CVPR , 2017.[33] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszko-reit, and Lukasz Kaiser. Universal Transformers. In
ICLR , 2019.[34] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli.Depth-Adaptive Transformer. In
ICLR , 2020.[35] David H Hubel and Torsten N Wiesel. Receptive fields, binocularinteraction and functional architecture in the cat’s visual cortex.
The Journal of physiology , 1962.[36] Akira Murata, Vittorio Gallese, Giuseppe Luppino, MasakazuKaseda, and Hideo Sakata. Selectivity for the shape, size, andorientation of objects for grasping in neurons of monkey parietalarea aip.
Journal of neurophysiology , 2000.[37] Yulin Wang, Kangchen Lv, Rui Huang, Shiji Song, Le Yang, andGao Huang. Glance and focus: a dynamic approach to reducingspatial redundancy in image classification. In
NeurIPS , 2020.[38] Paul Viola and Michael J. Jones. Robust real-time face detection.
IJCV , 2004.[39] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Ge-offrey E Hinton. Adaptive mixtures of local experts.
Neuralcomputation , 1991.[40] Wolfgang Maass. Networks of spiking neurons: the third gener-ation of neural network models.
Neural networks , 1997.[41] Eugene M Izhikevich. Simple model of spiking neurons.
TNN ,2003.[42] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, ShoumengYan, and Changshui Zhang. Learning efficient convolutionalnetworks through network slimming. In
ICCV , 2017.[43] Surat Teerapittayanon, Bradley McDanel, and Hsiang-TsungKung. Branchynet: Fast inference via early exiting from deepneural networks. In
ICPR , 2016.[44] Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and VenkateshSaligrama. Adaptive neural networks for efficient inference. In
ICML , 2017.[45] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph EGonzalez. Skipnet: Learning dynamic routing in convolutionalnetworks. In
ECCV , 2018.[46] Andreas Veit and Serge Belongie. Convolutional networks withadaptive inference graphs. In
ECCV , 2018.[47] Eunhyeok Park, Dongyoung Kim, Soobeom Kim, Yong-DeokKim, Gunhee Kim, Sungroh Yoon, and Sungjoo Yoo. Big/littledeep neural network for ultra low power inference. In
CODES+ISSS , 2015.[48] Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, BertVankeirsbilck, Pieter Simoens, and Bart Dhoedt. The cascadingneural network: building the internet of smart things.
KAIS , 2017.[49] Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, FisherYu, and Joseph E Gonzalez. Idk cascades: Fast deep learning bylearning not to overthink. In
AUAI , 2017.[50] Jiaqi Guan, Yang Liu, Qiang Liu, and Jian Peng. Energy-efficientamortized inference with cascaded deep classifiers. In
IJCAI ,2018.[51] Xin Dai, Xiangnan Kong, and Tian Guo. Epnet: Learning to exitwith flexible multi-branch network. In
CIKM , 2020. [52] Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng,and QI JU. FastBERT: a Self-distilling BERT with AdaptiveInference Time. In ACL , 2020.[53] Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin.DeeBERT: Dynamic Early Exiting for Accelerating BERT Infer-ence. In
ACL , 2020.[54] Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, JesseDodge, and Noah A. Smith. The Right Tool for the Job: MatchingModel and Instance Complexities. In
ACL , 2020.[55] Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu,and Furu Wei. BERT Loses Patience: Fast and Robust Inferencewith Early Exit. arXiv:2006.04152 [cs] , 2020.[56] Hehe Fan, Zhongwen Xu, Linchao Zhu, Chenggang Yan, JianjunGe, and Yi Yang. Watching a small portion could be as goodas watching all: Towards efficient video classification. In
JICAI ,2018.[57] Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, Yi Yang,and Shilei Wen. Dynamic Inference: A New Approach TowardEfficient Video Action Recognition. In
CVPR Workshop , 2020.[58] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen.Reasonet: Learning to stop reading in machine comprehension.In
KDD , 2017.[59] Keyi Yu, Yang Liu, Alexander G. Schwing, and Jian Peng. Fastand accurate text classification: Skimming, rereading and earlystopping. In
ICLR Workshop , 2018.[60] Xianggen Liu, Lili Mou, Haotian Cui, Zhengdong Lu, and SenSong. Finding decision jumps in text classification.
Neurocomput-ing , 2020.[61] Mason McGill and Pietro Perona. Deciding how to decide:Dynamic routing in artificial neural networks. In
ICML , 2017.[62] Zequn Jie, Peng Sun, Xin Li, Jiashi Feng, and Wei Liu. Anytimerecognition with routing convolutional networks.
TPAMI , 2019.[63] Hao Li, Hong Zhang, Xiaojuan Qi, Ruigang Yang, and GaoHuang. Improved techniques for training adaptive deep net-works. In
ICCV , 2019.[64] Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt,Thomas Breuel, and Jan Kautz. IamNN: Iterative and AdaptiveMobile Neural Network for Efficient Image Classification. In
ICML Workshop , 2018.[65] Qiushan Guo, Zhipeng Yu, Yichao Wu, Ding Liang, Haoyu Qin,and Junjie Yan. Dynamic recursive neural network. In
CVPR ,2019.[66] Haichao Yu, Haoxiang Li, Honghui Shi, Thomas S Huang, andGang Hua. Any-precision deep neural networks. arXiv preprintarXiv:1911.07346 , 2019.[67] Qing Jin, Linjie Yang, and Zhenyu Liao. Adabits: Neural networkquantization with adaptive bit-widths. In
CVPR , 2020.[68] Jianghao Shen, Yonggan Fu, Yue Wang, Pengfei Xu, ZhangyangWang, and Yingyan Lin. Fractional skipping: Towards finer-grained dynamic cnn inference. In
AAAI , 2020.[69] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie,Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop:Dynamic inference paths in residual networks. In
CVPR , 2018.[70] Yoshua Bengio, Nicholas L´eonard, and Aaron Courville. Esti-mating or propagating gradients through stochastic neurons forconditional computation. arXiv preprint arXiv:1308.3432 , 2013.[71] Kyunghyun Cho and Yoshua Bengio. Exponentially increasingthe capacity-to-computation ratio for conditional computation indeep learning. arXiv preprint arXiv:1406.7362 , 2014.[72] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and DoinaPrecup. Conditional computation in neural networks for fastermodels.
ICLR Workshop , 2016.[73] Andrew Davis and Itamar Arel. Low-rank approximations forconditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461 , 2013.[74] David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learningfactored representations in a deep mixture of experts. In
ICLRWorkshop , 2013.[75] Ravi Teja Mullapudi, William R Mark, Noam Shazeer, andKayvon Fatahalian. Hydranets: Specialized dynamic architec-tures for efficient inference. In
CVPR , 2018.[76] Shaofeng Cai, Yao Shu, and Wei Wang. Dynamic routing net-works. In
WACV , 2021.[77] William Fedus, Barret Zoph, and Noam Shazeer. Switch Trans-formers: Scaling to Trillion Parameter Models with Simple andEfficient Sparsity. arXiv e-prints , 2021. [78] Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-termmemory.
Neural computation , 1997.[79] Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang, andG Edward Suh. Channel gating neural networks. In
NeurIPS ,2019.[80] Zhihang Yuan, Bingzhe Wu, Zheng Liang, Shiwan Zhao, WeichenBi, and Guangyu Sun. S2dnas: Transforming static cnn model fordynamic inference via neural architecture search. In
ECCV , 2020.[81] Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, andCheng zhong Xu. Dynamic channel pruning: Feature boostingand suppression. In
ICLR , 2019.[82] Charles Herrmann, Richard Strong Bowen, and Ramin Zabih. Anend-to-end approach for speeding up neural network inference. arXiv preprint arXiv:1812.04180 , 2018.[83] Babak Ehteshami Bejnordi, Tijmen Blankevoort, and MaxWelling. Batch-shaping for learning conditional channel gatednetworks. In
ICLR , 2020.[84] Jinting Chen, Zhaocheng Zhu, Cheng Li, and Yuming Zhao. Self-adaptive network pruning. In
ICONIP , 2019.[85] Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, TanNguyen, Richard G. Baraniuk, Zhangyang Wang, and YingyanLin. Dual dynamic inference: Enabling more efficient, adaptiveand controllable deep inference.
JSTSP , 2020.[86] Wenhan Xia, Hongxu Yin, Xiaoliang Dai, and Niraj K Jha. Fullydynamic inference with deep neural networks. arXiv preprintarXiv:2007.15151 , 2020.[87] Ali Ehteshami Bejnordi and Ralf Krestel. Dynamic channel andlayer gating in convolutional neural networks. In KI , 2020.[88] Zhourong Chen, Yang Li, Samy Bengio, and Si Si. You look twice:Gaternet for dynamic filter selection in cnns. In CVPR , 2019.[89] Chuanjian Liu, Yunhe Wang, Kai Han, Chunjing Xu, and ChangXu. Learning instance-wise sparsity for accelerating deep mod-els. In
IJCAI , 2019.[90] Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrixcapsules with EM routing. In
ICLR , 2018.[91] Augustus Odena, Dieterich Lawson, and Christopher Olah.Changing model behavior at test-time using reinforcement learn-ing. In
ICLR Workshop , 2017.[92] Lanlan Liu and Jia Deng. Dynamic deep neural networks:Optimizing accuracy-efficiency trade-offs by selective execution.In
AAAI , 2018.[93] Samuel Rota Bulo and Peter Kontschieder. Neural decisionforests for semantic image labelling. In
CVPR , 2014.[94] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, andSamuel Rota Bulo. Deep neural decision forests. In
ICCV , 2015.[95] Nicholas Frosst and Geoffrey Hinton. Distilling a neural networkinto a soft decision tree. arXiv preprint arXiv:1711.09784 , 2017.[96] Thomas M Hehn, Julian FP Kooij, and Fred A Hamprecht. End-to-end learning of decision trees and forests.
IJCV , 2019.[97] Hussein Hazimeh, Natalia Ponomareva, Petros Mol, Zhenyu Tan,and Rahul Mazumder. The Tree Ensemble Layer: Differentiabilitymeets Conditional Computation. arXiv preprint arXiv:2002.07772 ,2020.[98] Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Ja-gadeesh, Dennis DeCoste, Wei Di, and Yizhou Yu. Hd-cnn:hierarchical deep convolutional neural networks for large scalevisual recognition. In
ICCV , 2015.[99] Yani Ioannou, Duncan Robertson, Darko Zikic, PeterKontschieder, Jamie Shotton, Matthew Brown, and AntonioCriminisi. Decision forests, convolutional networks and themodels in-between. arXiv preprint arXiv:1603.01250 , 2016.[100] Ryutaro Tanno, Kai Arulkumaran, Daniel Alexander, AntonioCriminisi, and Aditya Nori. Adaptive neural trees. In
ICML ,2019.[101] Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu Zhang,Xingang Wang, and Jian Sun. Learning Dynamic Routing forSemantic Segmentation. In
CVPR , 2020.[102] An-Chieh Cheng, Chieh Hubert Lin, Da-Cheng Juan, Wei Wei,and Min Sun. Instanas: Instance-aware neural architecture search.In
AAAI , 2020.[103] Adam W. Harley, Konstantinos G. Derpanis, and Iasonas Kokki-nos. Segmentation-aware convolutional networks using localattention masks. In
ICCV , 2017.[104] Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, ErikLearned-Miller, and Jan Kautz. Pixel-adaptive convolutionalneural networks. In
CVPR , 2019. [105] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, HanHu, and Yichen Wei. Deformable convolutional networks. In ICCV , 2017.[106] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformableconvnets v2: More deformable, better results. In
CVPR , 2019.[107] Hang Gao, Xizhou Zhu, Stephen Lin, and Jifeng Dai. DeformableKernels: Adapting Effective Receptive Fields for Object Deforma-tion. In
ICLR , 2019.[108] Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ran-zato, and Nando De Freitas. Predicting parameters in deeplearning. In
NeurIPS , 2013.[109] J ¨urgen Schmidhuber. Learning to control fast-weight memories:An alternative to dynamic recurrent networks.
Neural Computa-tion , 1992.[110] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool.Dynamic filter networks. In
NeurIPS , 2016.[111] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In
ICLR , 2016.[112] Ningning Ma, Xiangyu Zhang, Jiawei Huang, and Jian Sun.WeightNet: Revisiting the Design Space of Weight Networks. In
ECCV , 2020.[113] Martin Simonovsky and Nikos Komodakis. Dynamic Edge-Conditioned Filters in Convolutional Neural Networks onGraphs. In
CVPR , 2017.[114] Di Kang, Debarun Dhar, and Antoni Chan. Incorporating sideinformation by adaptive convolution. In
NeurIPS , 2017.[115] HyunJae Lee, Hyo-Eun Kim, and Hyeonseob Nam. Srm: A style-based recalibration module for convolutional neural networks.In
ICCV , 2019.[116] Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, WangmengZuo, and Qinghua Hu. ECA-net: Efficient channel attention fordeep convolutional neural networks. In
CVPR , 2020.[117] Jingda Guo, Xu Ma, Andrew Sansom, Mara McGuire, AndrewKalaani, Qi Chen, Sihai Tang, Qing Yang, and Song Fu. Spanet:Spatial Pyramid Attention Network for Enhanced Image Recog-nition. In
ICME , 2020.[118] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li,Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residualattention network for image classification. In
CVPR , 2017.[119] Abhijit Guha Roy, Nassir Navab, and Christian Wachinger. Con-current spatial and channel ‘squeeze & excitation’in fully convo-lutional networks. In
MICCAI , 2018.[120] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao,Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and channel-wiseattention in convolutional networks for image captioning. In
CVPR , 2017.[121] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi.Gather-excite: Exploiting feature context in convolutional neuralnetworks. In
NeurIPS , 2018.[122] Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen,Lu Yuan, and Zicheng Liu. Dynamic relu. In
ECCV , 2020.[123] Ningning Ma, Xiangyu Zhang, and Jian Sun. Funnel activationfor visual recognition. In
ECCV , 2020.[124] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selectivekernel networks. In
CVPR , 2019.[125] Shenlong Wang, Linjie Luo, Ning Zhang, and Li-Jia Li. Au-toscaler: Scale-attention networks for visual correspondence. In
BMVC , 2017.[126] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He.Non-local neural networks. In
CVPR , 2018.[127] Kaiyu Yue, Ming Sun, Yuchen Yuan, Feng Zhou, Errui Ding, andFuxin Xu. Compact generalized non-local network. In
NeurIPS ,2018.[128] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, DirkWeissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De-hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.An image is worth 16x16 words: Transformers for image recog-nition at scale. arXiv preprint arXiv:2010.11929 , 2020.[129] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, andVarun Mithal. An attentive survey of attention models. arXivpreprint arXiv:1904.02874 , 2019.[130] Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, and JifengDai. An empirical study of spatial attention mechanisms in deepnetworks. In
ICCV , 2019.[131] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed WaqasZamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformersin vision: A survey. arXiv preprint arXiv:2101.01169 , 2021. [132] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, andAntonio Torralba. Learning deep features for discriminativelocalization. In
CVPR , 2016.[133] Mengye Ren, Andrei Pokrovsky, Bin Yang, and Raquel Urtasun.SBNet: Sparse Blocks Network for Fast Inference.
CVPR , 2018.[134] Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. Moreis less: A more complicated network with less inference complex-ity. In
CVPR , 2017.[135] Shijie Cao, Lingxiao Ma, Wencong Xiao, Chen Zhang, Yunxin Liu,Lintao Zhang, Lanshun Nie, and Zhi Yang. Seernet: Predictingconvolutional neural network feature-map sparsity through low-bit quantization. In
CVPR , 2019.[136] Shu Kong and Charless Fowlkes. Pixel-wise attentional gatingfor scene parsing. In
WACV , 2019.[137] Thomas Verelst and Tinne Tuytelaars. Dynamic Convolutions:Exploiting Spatial Sparsity for Faster Inference. In
CVPR , 2020.[138] Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, and StephenLin. Spatially Adaptive Inference with Stochastic Feature Sam-pling and Interpolation. In
ECCV , 2020.[139] Amjad Almahairi, Nicolas Ballas, Tim Cooijmans, Yin Zheng,Hugo Larochelle, and Aaron Courville. Dynamic capacity net-works. In
ICML , 2016.[140] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick.Pointrend: Image segmentation as rendering. In
CVPR , 2020.[141] Aritra Bhowmik, Suprosanna Shit, and Chandra Sekhar Seela-mantula. Training-free, single-image super-resolution using adynamic convolutional network.
IEEE Signal Processing Letters ,2017.[142] Jialin Wu, Dai Li, Yu Yang, Chandrajit Bajaj, and Xiangyang Ji.Dynamic filtering with large sampling field for convnets. In
ECCV , 2018.[143] Xuecai Hu, Haoyuan Mu, Xiangyu Zhang, Zilei Wang, TieniuTan, and Jian Sun. Meta-SR: A magnification-arbitrary networkfor super-resolution. In
CVPR , 2019.[144] Jiaqi Wang, Kai Chen, Rui Xu, Ziwei Liu, Chen Change Loy, andDahua Lin. CARAFE: Content-Aware ReAssembly of FEatures.In
ICCV , 2019.[145] Jin Chen, Xijun Wang, Zichao Guo, Xiangyu Zhang, andJian Sun. Dynamic region-aware convolution. arXiv preprintarXiv:2003.12243 , 2020.[146] Guangrun Wang, Keze Wang, and Liang Lin. Adaptively con-nected neural networks. In
CVPR , 2019.[147] Max Jaderberg, Karen Simonyan, and Andrew Zisserman. Spatialtransformer networks. In
NeurIPS , 2015.[148] Adria Recasens, Petr Kellnhofer, Simon Stent, Wojciech Matusik,and Antonio Torralba. Learning to zoom: a saliency-based sam-pling layer for neural networks. In
ECCV , 2018.[149] Volodymyr Mnih, Nicolas Heess, and Alex Graves. Recurrentmodels of visual attention. In
NeurIPS , 2014.[150] Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to seebetter: Recurrent attention convolutional neural network for fine-grained image recognition. In
CVPR , 2017.[151] Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and WeiXu. Dynamic computational time for visual attention. In
ICCVWorkshop , 2017.[152] Amir Rosenfeld and Shimon Ullman. Visual concept recognitionand localization via iterative introspection. In
ACCV , 2016.[153] Zekun Hao, Yu Liu, Hongwei Qin, Junjie Yan, Xiu Li, and XiaolinHu. Scale-aware face detection. In
CVPR , 2017.[154] Zerui Yang, Yuhui Xu, Wenrui Dai, and Hongkai Xiong.Dynamic-stride-net: deep convolutional neural network withdynamic stride. In
SPIE Optoelectronic Imaging and MultimediaTechnology , 2019.[155] Huiyu Wang, Aniruddha Kembhavi, Ali Farhadi, Alan L. Yuille,and Mohammad Rastegari. Elastic: Improving cnns with dy-namic scaling policies. In
CVPR , 2019.[156] Adams Wei Yu, Hongrae Lee, and Quoc Le. Learning to SkimText. In
ACL , 2017.[157] V´ıctor Campos, Brendan Jou, Xavier Gir´o-I-Nieto, Jordi Torres,and Shih Fu Chang. Skip RNN: Learning to skip state updates inrecurrent neural networks. In
ICLR , 2018.[158] Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob GrueSimonsen, and Christina Lioma. Neural Speed Reading withStructural-Jump-LSTM. In
ICLR , 2019.[159] Jin Tao, Urmish Thakker, Ganesh Dasika, and Jesse Beu. SkippingRNN State Updates without Retraining the Original Model. In
SenSys-ML , 2019. [160] Yacine Jernite, Edouard Grave, Armand Joulin, and TomasMikolov. Variable computation in recurrent neural networks. In ICLR , 2017.[161] Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi.Neural Speed Reading via Skim-RNN. In
ICLR , 2018.[162] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchicalmultiscale recurrent neural networks. In
ICLR , 2017.[163] Nan Rosemary Ke, Konrad ˙Zołna, Alessandro Sordoni, ZhouhanLin, Adam Trischler, Yoshua Bengio, Joelle Pineau, LaurentCharlin, and Christopher Pal. Focused Hierarchical RNNs forConditional Sequence Processing. In
ICML , 2018.[164] Zhengjie Huang, Zi Ye, Shuangyin Li, and Rong Pan. Lengthadaptive recurrent model for text classification. In
CIKM , 2017.[165] Tsu-Jui Fu and Wei-Yun Ma. Speed Reading: Learning to ReadForBackward via Shuttle. In
EMNLP , 2018.[166] Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, and Larry S. Davis.Liteeval: A coarse-to-fine framework for resource efficient videorecognition. In
NeurIPS , 2019.[167] Guillaume Vaudaux-Ruth, Adrien Chan-Hon-Tong, and Cather-ine Achard. ActionSpotter: Deep Reinforcement Learning Frame-work for Temporal Action Spotting in Videos. arXiv:2004.06971[cs] , 2020.[168] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei.End-to-end learning of action detection from frame glimpses invideos. In
CVPR , 2016.[169] Yu-Chuan Su and Kristen Grauman. Leaving some stones un-turned: dynamic feature prioritization for activity detection instreaming video. In
ECCV , 2016.[170] Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, andLarry S. Davis. AdaFrame: Adaptive Frame Selection for FastVideo Recognition. In
CVPR , 2019.[171] Yongming Rao, Jiwen Lu, and Jie Zhou. Attention-aware deepreinforcement learning for video face recognition. In
ICCV , 2017.[172] Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, and ShileiWen. Multi-agent reinforcement learning based frame samplingfor effective untrimmed video recognition. In
ICCV , 2019.[173] Bruno Korbar, Du Tran, and Lorenzo Torresani. Scsampler:Sampling salient clips from video for efficient action recognition.In
ICCV , 2019.[174] Humam Alwassel, Fabian Caba Heilbron, and Bernard Ghanem.Action search: Spotting actions in videos and its application totemporal action localization. In
ECCV , 2018.[175] Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sat-tigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, and RogerioFeris. Ar-net: Adaptive frame resolution for efficient actionrecognition. In
ECCV , 2020.[176] Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, and Jie Zhou. DeepProgressive Reinforcement Learning for Skeleton-Based ActionRecognition. In
CVPR , 2018.[177] Yin-Dong Zheng, Zhaoyang Liu, Tong Lu, and Limin Wang.Dynamic Sampling Networks for Efficient Action Recognition inVideos.
TIP , 2020.[178] Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, and Larry SDavis. 2d or not 2d? adaptive 3d convolution selection forefficient video recognition. arXiv preprint arXiv:2012.14950 , 2020.[179] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. Oncalibration of modern neural networks. In
ICML , 2017.[180] Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf.Why relu networks yield high-confidence predictions far awayfrom the training data and how to mitigate the problem. In
CVPR ,2019.[181] Henry A Rowley, Shumeet Baluja, and Takeo Kanade. Neuralnetwork-based face detection.
TPAMI , 1998.[182] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and GangHua. A convolutional neural network cascade for face detection.In
CVPR , 2015.[183] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutionalnetwork cascade for facial point detection. In
CVPR , 2013.[184] Anelia Angelova, Alex Krizhevsky, Vincent Vanhoucke, AbhijitOgale, and Dave Ferguson. Real-Time Pedestrian Detection withDeep Network Cascades. In
BMVC , 2015.[185] Fan Yang, Wongun Choi, and Yuanqing Lin. Exploit all thelayers: Fast and accurate cnn object detector with scale dependentpooling and cascaded rejection classifiers. In
CVPR , 2016.[186] Hong-Yu Zhou, Bin-Bin Gao, and Jianxin Wu. Adaptive feeding:Achieving fast and accurate detections by adaptively combiningobject detectors. In
ICCV , 2017. [187] Tong Yang, Xiangyu Zhang, Zeming Li, Wenqiang Zhang, andJian Sun. Metaanchor: Learning to detect objects with customizedanchors. In
NeurIPS , 2018.[188] Chunlin Chen and Qiang Ling. Adaptive Convolution for ObjectDetection.
IEEE Transactions on Multimedia , 2019.[189] Hiroki Tokunaga, Yuki Teramoto, Akihiko Yoshizawa, and Ry-oma Bise. Adaptive weighting multi-field-of-view cnn for se-mantic segmentation in pathology. In
CVPR , 2019.[190] Gernot Riegler, Samuel Schulter, Matthias Ruther, and HorstBischof. Conditioned regression models for non-blind singleimage super-resolution. In
ICCV , 2015.[191] Falong Shen, Shuicheng Yan, and Gang Zeng. Neural styletransfer via meta networks. In
CVPR , 2018.[192] Yu-Gang Jiang, Changmao Cheng, Hangyu Lin, and Yanwei Fu.Learning layer-skippable inference network.
TIP , 2020.[193] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi-scalefilters for semantic segmentation. In
ICCV , 2019.[194] Dmitrii Marin, Zijian He, Peter Vajda, Priyam Chatterjee, SamTsai, Fei Yang, and Yuri Boykov. Efficient segmentation: Learningdownsampling near semantic boundaries. In
ICCV , 2019.[195] Jun Li, Yongjun Chen, Lei Cai, Ian Davidson, and Shuiwang Ji.Dense transformer networks for brain electron microscopy imagesegmentation. In
IJCAI , 2019.[196] Fei Wu, Feng Chen, Xiao-Yuan Jing, Chang-Hui Hu, Qi Ge, andYimu Ji. Dynamic attention network for semantic segmentation.
Neurocomputing , 2020.[197] Zilong Zhong, Zhong Qiu Lin, Rene Bidart, Xiaodan Hu,Ibrahim Ben Daya, Zhifeng Li, Wei-Shi Zheng, Jonathan Li, andAlexander Wong. Squeeze-and-Attention Networks for SemanticSegmentation. In
CVPR , 2020.[198] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multi-modal unsupervised image-to-image translation. In
ECCV , 2018.[199] Xihui Liu, Guojun Yin, Jing Shao, Xiaogang Wang, and hong-sheng Li. Learning to Predict Layout-to-image Conditional Con-volutions for Semantic Image Synthesis. In
NeurIPS , 2019.[200] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.Semantic image synthesis with spatially-adaptive normalization.In
CVPR , 2019.[201] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.SEAN: Image Synthesis with Semantic Region-Adaptive Normal-ization. In
CVPR , 2020.[202] Meng Chang, Qi Li, Huajun Feng, and Zhihai Xu. Spatial-adaptive network for single image denoising. In
ECCV , 2020.[203] Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, YuxinPeng, and Zheng Zhang. The application of two-level attentionmodels in deep convolutional neural network for fine-grainedimage classification. In
CVPR , 2015.[204] Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. Learningmulti-attention convolutional neural network for fine-grainedimage recognition. In
ICCV , 2017.[205] Wanjie Sun and Zhenzhong Chen. Learned image downscalingfor upscaling using content adaptive resampler.
TIP , 2020.[206] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multipleobject recognition with visual attention. In
ICLR , 2015.[207] SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa,David Szepesvari, and Geoffrey E. Hinton. Attend, infer, repeat:Fast scene understanding with generative models. In
NeurIPS ,2016.[208] Ali Diba, Vivek Sharma, Luc Van Gool, and Rainer Stiefelhagen.Dynamonet: Dynamic action and motion network. In
ICCV , 2019.[209] Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, and Lorenzo Tor-resani. Listen to look: Action recognition by previewing audio.In
CVPR , 2020.[210] Yu-Syuan Xu, Tsu-Jui Fu, Hsuan-Kung Yang, and Chun-Yi Lee.Dynamic video segmentation network. In
CVPR , 2018.[211] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpola-tion via adaptive separable convolution. In
ICCV , 2017.[212] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpola-tion via adaptive convolution. In
CVPR , 2017.[213] Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and SeonJoo Kim. Deep video super-resolution network using dynamicupsampling filters without explicit motion compensation. In
CVPR , 2018.[214] Tae Hyun Kim, Kyoung Mu Lee, Bernhard Scholkopf, andMichael Hirsch. Online video deblurring via dynamic temporalblending network. In
CVPR , 2017. [215] Shangchen Zhou, Jiawei Zhang, Jinshan Pan, Haozhe Xie, Wang-meng Zuo, and Jimmy Ren. Spatio-temporal filter adaptivenetwork for video deblurring. In ICCV , 2019.[216] Lei Chen, Jiwen Lu, Zhanjie Song, and Jie Zhou. Part-activateddeep reinforcement learning for action prediction. In
ECCV , 2018.[217] Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beat-riz Marcotegui, Franc¸ois Goulette, and Leonidas J. Guibas. Kp-conv: Flexible and deformable convolution for point clouds. In
ICCV , 2019.[218] Jie Li, Kai Han, Peng Wang, Yu Liu, and Xia Yuan. Anisotropicconvolutional networks for 3d semantic scene completion. In
CVPR , 2020.[219] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, AaronCourville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio.Show, attend and tell: Neural image caption generation withvisual attention. In
ICML , 2015.[220] Peng Gao, Hongsheng Li, Shuang Li, Pan Lu, Yikang Li,Steven CH Hoi, and Xiaogang Wang. Question-guided hybridconvolution for visual question answering. In
ECCV , 2018.[221] Xiaotian Jiang, Quan Wang, and Bin Wang. Adaptive convolutionfor multi-relational learning. In
NAACL , 2019.[222] Sunok Kim, Seungryong Kim, Dongbo Min, and KwanghoonSohn. Laf-net: Locally adaptive fusion networks for stereoconfidence estimation. In
CVPR , 2019.[223] Weiping Song, Zhiping Xiao, Yifan Wang, Laurent Charlin, MingZhang, and Jian Tang. Session-based social recommendation viadynamic graph attention networks. In
WSDM , 2019.[224] Emil Julius Gumbel. Statistical theory of extreme values andsome practical applications.
NBS Applied Mathematics Series , 1954.[225] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameter-ization with gumbel-softmax. In
ICLR , 2017.[226] Łukasz Kaiser and Samy Bengio. Discrete autoencoders forsequence models. arXiv preprint arXiv:1801.09797 , 2018.[227] Raffaele Tavarone and Leonardo Badino. Conditional-Computation-Based Recurrent Neural Networks for Computa-tionally Efficient Acoustic Modelling. In
Interspeech , 2018.[228] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, andManohar Paluri. Learning spatiotemporal features with 3d con-volutional networks. In
ICCV , 2015.[229] Joao Carreira and Andrew Zisserman. Quo vadis, action recog-nition? a new model and the kinetics dataset. In
CVPR , 2017.[230] Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu,Yandong Li, Limin Wang, and Shilei Wen. Stnet: Local and globalspatial-temporal modeling for action recognition. In
AAAI , 2019.[231] Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. Shallow-deep networks: Understanding and mitigating network over-thinking. In
ICML , 2019.[232] Rahul Duggal, Scott Freitas, Sunny Dhamnani, Duen Horng,Chau, and Jimeng Sun. ELF: An Early-Exiting Framework forLong-Tailed Classification. arXiv:2006.11979 [cs, stat] , 2020.[233] Ting-Kuei Hu, Tianlong Chen, Haotao Wang, and ZhangyangWang. Triple Wins: Boosting Accuracy, Robustness and EfficiencyTogether by Enabling Input-Adaptive Inference. In
ICLR , 2020.[234] Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routingnetworks: Adaptive selection of non-linear functions for multi-task learning. In
ICLR , 2018.[235] Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grauman,Tajana Rosing, and Rogerio Feris. Spottune: Transfer learningthrough adaptive fine-tuning. In
CVPR , 2019.[236] Sanghyun Hong, Yi˘gitcan Kaya, Ionut¸-Vlad Modoranu, and Tu-dor Dumitras¸. A panda? no, it’s a sloth: Slowdown attackson adaptive multi-exit neural network inference. arXiv preprintarXiv:2010.02432arXiv preprintarXiv:2010.02432