ALERT: Accurate Learning for Energy and Timeliness
Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, Shan Lu
AALERT: Accurate Learning for Energy and Timeliness
Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, Shan Lu
The University of Chicago
Abstract
An increasing number of software applications incorporateruntime Deep Neural Networks (DNNs) to process sensordata and return inference results to humans. Effective deploy-ment of DNNs in these interactive scenarios requires meetinglatency and accuracy constraints while minimizing energy, aproblem exacerbated by common system dynamics.Prior approaches handle dynamics through either (1)system-oblivious DNN adaptation, which adjusts DNN la-tency/accuracy tradeoffs, or (2) application-oblivious systemadaptation, which adjusts resources to change latency/energytradeoffs. In contrast, this paper improves on the state-of-the-art by coordinating application- and system-level adaptation.ALERT, our runtime scheduler, uses a probabilistic modelto detect environmental volatility and then simultaneouslyselect both a DNN and a system resource configuration tomeet latency, accuracy, and energy constraints. We evaluateALERT on CPU and GPU platforms for image and speechtasks in dynamic environments. ALERT’s holistic approachachieves more than 13% energy reduction, and 27% errorreduction over prior approaches that adapt solely at the appli-cation or system level. Furthermore, ALERT incurs only 3%more energy consumption and 2% higher DNN-inference er-ror than an oracle scheme with perfect application and systemknowledge.
Deep neural networks (DNNs) have become a key workloadfor many computing systems due to their high inference accu-racy. This accuracy, however, comes at a cost of long latency,high energy usage, or both. Successful DNN deployment re-quires meeting a variety of user-defined, application-specificgoals for latency, accuracy, and often energy in unpredictable,dynamic environments.Latency constraints naturally arise with DNN deploymentswhen inference interacts with the real world as a consumer— processing data streamed from a sensor—or a producer—returning a series of answers to a human. For example, in mo-tion tracking, a frame must be processed at camera speed [41];in simultaneous interpretation, translation must be providedevery 2–4 seconds [57]. Violating these deadlines may leadto severe consequences: if a self-driving vehicle cannot actwithin a small time budget, life threatening accidents couldfollow [54].Accuracy and energy requirements are also common andmay vary for different applications in different operating en-vironments. On one hand, low inference accuracy can leadto software failures [68, 81]. On the other hand, it is bene-ficial to minimize DNN energy or resource usage to extendmobile-battery time or reduce server-operation cost [42].These requirements are also highly dynamic. For example,the latency requirement for a job could vary dynamicallydepending on how much time has already been consumed byrelated jobs before it [54]; the power budget and the accuracyrequirement for a job may switch among different settingsdepending on what type of events are currently sensed [1].Additionally, the latency requirement may change based onthe computing system’s current context; e.g., in robotic visionsystems the latency requirement can change based on therobot’s latency and distance from perceived pedestrians [19].Satisfying all these requirements in a dynamic computingenvironment where the inference job may compete for re-sources against unpredictable, co-located jobs is challenging.Although prior work addresses these problems at either theapplication level or system level separately, each approach byitself lacks critical information that could be used to producebetter results.At the application level, different DNN designs—withdifferent depths, widths, and numeric precisions—providevarious latency-accuracy trade-offs for the same inferencetask [27, 40, 43, 78, 86]. Even more dynamic schemes havebeen proposed that adapt the DNN by dynamically chang-ing its structure at the beginning of [23, 62, 85, 88] or dur-ing [5, 35, 36, 50, 53, 83, 87] every inference tasks.Although helpful, these techniques are sub-optimal with-1 a r X i v : . [ c s . PF ] M a y ut considering system-level adaptation options. For example,under energy pressure, these application-level adaptation tech-niques have to switch to lower-accuracy DNNs, sacrificingaccuracy for energy saving, even if the energy goal could havebeen achieved by lowering the system power setting (if thereis sufficient latency budget).At the system level, machine learning [4, 15, 16, 52, 64, 69,70, 80] and control theory [33, 38, 45, 46, 63, 71, 75, 92] basedtechniques have been proposed to dynamically assign systemresources to better satisfy system and application constraints.Unfortunately, without considering the option of applica-tion adaptions, these techniques also reach sub-optimal solu-tions. For example, when the current DNN offers much higheraccuracy than necessary, switching to a lower-precision DNNmay offer much more energy saving than any system-leveladaptation techniques. This problem is exacerbated because,in the DNN design space, very small drops in accuracy enabledramatic reductions in latency, and therefore system resourcerequirements.A cross-stack solution would enable DNN applications tomeet multiple, dynamic constraints. However, offering sucha holistic solution is non-trivial. The combination of DNNand system-resource adaptation creates a huge configurationspace, making it difficult to dynamically and efficiently pre-dict which combination of DNN and system settings will meetall the requirements optimally. Furthermore, without carefulcoordination, adaptations at the application and system levelmay conflict and cause constraint violations, like missing alatency deadline due to switching to higher-accuracy DNNand lower power setting at the same time. This paper presents ALERT, a cross-stack runtime system forDNN inference to meet user goals by simultaneously adaptingboth DNN models and system-resource settings.
Understanding the challenges
We profile DNN inferenceacross applications, inputs, hardware, and resource contentionconfirming there is a high variation in inference time. Thisleads to challenges in meeting not only latency but also energyand accuracy requirements. Furthermore, our profiling of 42existing DNNs for image classification confirms that differentdesigns offer a wide spectrum of latency, energy, and accu-racy tradeoffs. In general, higher accuracy comes at the costof longer latency and/or higher energy consumption. Thesetrade-offs offered provide both opportunities and challengesto holistic inference management (Section 2).
Run-time inference management
We design ALERT, aDNN inference management system that dynamically selectsand adapts a DNN and a system-resource setting togetherto handle changing system environments and meet dynamic
DeadlineAccuracy ConstraintEnergy Budget DNN-Model SelectionResourceSelection Inference ComputationInput Stream Inference OutputsInference Time,Accuracy, and EnergyMeasurement DNN family
With Accuracy &Latency Information PredictedInference Time
Figure 1: ALERT inference systemenergy, latency, and accuracy requirements with probabilisticguarantees (Section 3).ALERT is a feedback-based run-time. It measures infer-ence accuracy, latency, and energy consumption; it checkswhether the requirements on these goals are met; and, it thenoutputs both system and application-level configurations ad-justed to the current requirements and operating conditions.ALERT focuses on meeting constraints in any two dimen-sions while optimizing the third; e.g., minimizing energygiven accuracy and latency requirements or maximizing accu-racy given latency and energy budgets.The key is estimating how DNN and system configurationsinteract to affect the goals. To do so, ALERT addresses threeprimary challenges: (1) the combined DNN and system con-figuration space is huge, (2) the environment may changedynamically (including input, available resources, and eventhe required constraints), and (3) the predictions must be lowoverhead to have negligible impact on the inference itself.ALERT addresses these challenges with a global slow-down factor , a random variable relating the current runtimeenvironment to a nominal profiling environment. After eachinference task, ALERT estimates the global slow-down fac-tor using a Kalman filter. The global slow-down factor’smean represents the expected change compared to the pro-file, while the variance represents the current volatility. Themean provides a single scalar that modifies the predicted la-tency/accuracy/energy for every DNN/system configuration—a simple mechanism that leverages commonality among DNNarchitectures to allow prediction for even rarely used config-urations (tackle challenge-1), while incorporating varianceinto predictions naturally makes ALERT conservative involatile environments and aggressive in quiescent ones (tacklechallenge-2). The global slow-down factor and Kalman filterare efficient to implement and low-overhead (tackle challenge-3). Thus, ALERT combines the global slow-down factor withlatency, power, and accuracy measurements to select the DNNand system configuration with the highest likelihood of meet-ing the constraints optimally. ALERT provides probabilistic, not hard guarantees, as the latter requiresmuch more conservative configurations, often hurting both energy and accu-racy. Section 3.6 discusses this issue further.
2e evaluate ALERT using various DNNs and applicationdomains on different (CPU and GPU) machines under variousconstraints. Our evaluation shows that ALERT overcomesdynamic variability efficiently. Across various experimen-tal settings, ALERT meets constraints in most cases whileachieving within 93–99% of optimal energy saving or ac-curacy optimization. Compared to approaches that adapt atapplication-level or system-level only ALERT achieves morethan 13% energy reduction, and 27% error reduction (Section5).
We conduct an empirical study to examine the large trade-offspace offered by different DNN designs and system settings(Sec. 2.1), and the timing variability of inference (Sec. 2.2).
Embedded CPU1 CPU2 GPUCPU ARMCortex [email protected] GHz [email protected] GHz Xeon(R)Gold [email protected] [email protected] GHzGPU none none none RTX 2080Memory DDR3 2G DDR4 16G DDR4 16G*12 DDR4 16GLLC 2MB 9MB 19.25MB 9MB
Table 1: Hardware platforms used in our experiments
ID Task DNN Models DatasetsIMG1 Image VGG16 [79] ILSVRC2012IMG2 Classification ResNet50 [30] (ImageNet)NLP1 Sentence Prediction RNN Penn Treebank [60]NLP2 Question Bert [18] Stanford Q&AAnswering Dataset (SQuAD) [72]
Table 2: ML tasks and benchmark datasets in our experimentsWe use two canonical machine learning tasks, with state-of-the-art networks and common data-sets (see Table 2) ona diverse set of hardware platforms, representing embeddedsystems, laptops (CPU1), CPU servers (CPU2), and GPU plat-forms (see Table 1). The two tasks, image classification andnatural language processing (NLP), are often deployed withdeadlines—e.g., for motion tracking [41] and simultaneousinterpretation [57]—and both have received wide attentionleading to a diverse set of DNN models.
Tradeoffs from DNNs
We run all
42 image classificationmodels provided by the Tensorflow website [77] on the 50000images from ImageNet [17], and measure their average la-tency, accuracy (error rate), and energy consumption. Theresults from CPU2 are shown in Figure 2. We can clearly seetwo trends from the figure, which hold on other machines.First, different DNN models offer a wide spectrum of ac-curacy (error rate in figure), latency, and energy. As shown
Inference Time of One Image (s) E rr o r R a t e ( % ) ImageNet Classification Networks
Top5 Error-latencyLower bound of top5 error-latency
Figure 2: Tradeoffs for 42 DNNs (CPU2).
Inference Time of One Image (s) A v er ag e E n er gy ( J ) ResNet50 @ Different Power Limit
Power limit setting (W)
Figure 3: Tradeoffs for ResNet50 at different power settings(CPU2). (Numbers inside circles are power limit settings.)in the figure, the fastest model runs almost 18 × faster thanthe slowest one and the most accurate model has about 7.8 × lower error rate than the least accurate. These models alsoconsume a wide range—more than 20 × —of energy usage.Second, there is no magic DNN that offers both the bestaccuracy and the lowest latency, confirming the intuition thatthere exists a tradeoff between DNN accuracy and resource us-age. Of course, some DNNs offer better tradeoffs than others.In Figure 2, all the networks sitting above the lower-convex-hull curve represent sub-optimal tradeoffs. Tradeoffs from system settings
We run ResNet50 under31 power settings from 40–100W on CPU2. We consider asensor processing scenario with periodic inputs, setting theperiod to the latency under 40W cap. We then plot the averageenergy consumed for the whole period (run-time plus idleenergy) and the average inference latency in Figure 3.The results reflect two trends, which hold on other ma-chines. First, a large latency/energy space is available bychanging system settings. The fastest setting (100W) is morethan 2 × faster than the slowest setting (40W). The mostenergy-hungry setting (64W) uses 1.3 × more energy than theleast (40W). Second, there is no easy way to choose the bestsetting. For example, 40W offers the lowest energy, but high-est latency. Furthermore, most of these points are sub-optimalin terms of energy and latency tradeoffs. For example, 84Wshould be chosen for extremely low latency deadlines, but allother nearby points (from 52–100) will harm latency, energyor both. Additionally, when deadlines change or when there3 ettings (explained in Table 2) IMG1 IMG2 NLP1 NLP2 A vg . I n f ere n ce T i m e o f O n e I npu t ( s ) -2 -1 Time Variance on Different Inputs and Hardwares
EmbeddedCPU1CPU2GPU
Figure 4: Latency variance across inputs for different tasksand hardware (Most tasks have 3 boxplots for 3 hardwareplatforms, CPU1-2, GPU from left to right; NLP1 has anextra boxplot for Embedded; other tasks run out of memoryon Embedded; every box shows the 25th–75th percentile;points beyond the whiskers are >90th or <10th).is resource contention, the energy-latency curve also changesand different points become optimal.
Summary:
DNN models and system-resource settings of-fer a huge trade-off space. The energy/latency tradeoff spaceis not smooth (when accounting for deadlines and idle power)and optimal operating points cannot be found with simplegradient-based heuristics. Thus, there is a great opportunityand also a great challenge in picking different DNN mod-els and system-resource settings to satisfy inference latency,accuracy, and energy requirements.
To understand how DNN-inference varies across inputs, plat-forms, and run-time environment and hence how (not) helpfulis off-line profiling, we run a set of experiments below, wherewe feed the network one input at a time and use 1/10 of thetotal data for warm up, to emulate real-world scenarios. Weplot the inference latency without and with co-located jobs inFigure 4 and 5, and we see several trends.First, deadline violation is a realistic concern. Image clas-sification on video has deadlines ranging from 1 second tothe camera latency (e.g., 1/60 seconds) [41]; the two NLPtasks, have deadlines around 1 second [65]. There is clearly nosingle inference task that meets all deadlines on all hardware.Second, the inference variation among inputs is relativelysmall particularly when there are no co-located jobs (Fig. 4),except for that in NLP1, where this large variance is mainlycaused by different input lengths. For other tasks, outlier in-puts exist but are rare.Third, the latency and its variation across inputs are bothgreatly affected by resource contention. Comparing Figure 5with Figure 4, we can see that the co-located job has increased
Settings (explained in Table 2)
IMG1 IMG2 NLP1 NLP2 A vg . I n f ere n ce T i m e o f O n e I npu t ( s ) -2 -1 Time Variance with Co-located Jobs
EmbeddedCPU1CPU2GPU
Figure 5: Latency variance with co-located jobs (the memory-intensive STREAM benchmark [61] co-located on Embedded,CPU1-2; GPU-intensive Backprop [9] co-located on GPU)both the median latency, the tail inference, and the differencebetween these two for all tasks on all platforms. This trendalso applies to other contention cases.While the discussion above is about latency, similar con-clusions apply to inference accuracy and energy: the accuracytypically drops to close to 0 when the inference time exceedsthe latency requirement, and the energy consumption naturallychanges with inference time.
Summary:
Deadline violations are realistic concerns andinference latency varies greatly across platforms, under con-tention, and sometimes across inputs. Clearly, sticking to onestatic DNN design across platforms and workloads leads toan unpleasant trade-off: always meeting the deadline by sacri-ficing accuracy or energy in most settings, or achieving a highaccuracy some times but exceeding the deadline in others. Fur-thermore, it is also sub-optimal to make run-time decisionsbased solely on off-line profiling, considering the variationcaused by run-time contention.
We now show how confining adaptation to a single layer (justapplication or system) is insufficient. We run the ImageNetclassification on
CPU1 . We examine a range of latency (0.1s-0.7s) and accuracy constraints (85%-95%), and try meetingthose constraints while minimizing energy by either (1) con-figuring just the DNN (selecting a DNN from a family, likethat in Figure 2) or (2) configuring just the system (by se-lecting resources to control energy–latency tradeoffs as inFigure 3). We compare these single-layer approaches to onethat simultaneously picks the DNN and system configuration.As we are concerned with the ideal case, we create oracles byrunning 90 inputs in all possible DNN and system configu-rations, from which we find the best configuration for eachinput. The App-level oracle uses the default system setting.The Sys-level oracle uses the default (highest accuracy) DNN.4 onstraint Settings (deadline × accuracy_goal) deadline 0.1s 0.2s 0.3s 0.4s 0.5s 0.6s 0.7s A v er ag e E n er gy ( J ) ∞ Sys-levelApp-levelCombined
Figure 6: Minimize energy task with latency and accuracyconstraint @ CPU1. ( ∞ means unable to meet the constraints)Figure 6 shows the results. As we have a three dimensionalproblem—meeting accuracy and latency constraints with min-imal energy—we linearize the constraints and show them onthe x-axis (accuracy is faster changing, with latency slower,so each latency bin contains all accuracy goals). There areseveral important conclusions here. First, the App-only ap-proach meets all possible accuracy and latency constraints,while the Sys-only approach cannot meet any constraints be-low 0.3s. Second, across the entire constraint range, App-onlyconsumes significantly more energy than Combined (60%more on average). The intuition behind Combined’s superior-ity is that there are discrete choices for DNNs; so when one isselected, there are almost always energy saving opportunitiesby tailoring resource usage to that DNN’s needs. Summary:
Combining DNN and system level approachesachieves better outcomes. If left solely to the application, en-ergy will be wasted. If left solely to the system, many achiev-able constraints will not be met.
ALERT’s runtime system navigates the large tradeoff spacecreated by combining
DNN-level and system-level adaptation.ALERT meets user-specified latency, accuracy, and energyconstraints and optimization goals while accounting for run-time variations in environment or the goals themselves.
ALERT’s inputs are specifications about (1) the adaptionoptions, including a set of DNN models D = { d i | i = · · · K } and a set of system-resource settings, expressed as differentpower-caps P = { P j | j = · · · L } ; and (2) the user-specifiedrequirements on latency, accuracy, and energy usage, whichcan take the form of meeting constraints in any two of thesethree dimensions while optimizing the third. ALERT’s outputis the DNN model d i ∈ D and the system-resource setting p j ∈ P for the next inference-task input.Formally, ALERT selects a DNN d i and a system-resourcesetting p j to fulfill either of these user-specified goals. Maximizing inference accuracy q (minimizing error) foran energy budget E goal and inference deadline T goal :arg max i , j q i , j s.t. e i , j ≤ E goal ∧ t i , j ≤ T goal (1)Minimizing the energy use e for an accuracy goal Q goal and inference deadline T goal :arg min i , j e i , j s.t. q i , j ≥ Q goal ∧ t i , j ≤ T goal (2)We omit the discussion of meeting energy and accuracyconstraints while minimizing latency as it is a trivial exten-sion of the discussed techniques and we believe it to be theleast practically useful. We also omit the problem of optimiz-ing all three dimensions, as it creates a feasibility problem,leaving nothing for optimization—lowest latency and highestaccuracy are impractical to achieve simultaneously. Generality
Along the DNN-adaptation side, the input DNNset can consist of any DNNs that offer different accuracy, la-tency, and energy tradeoffs; e.g., those in Figure 3. In par-ticular, ALERT can work with either or both of the broadclasses of DNN adaptation approaches that have arisen re-cently, including: (1) traditional DNNs where the adapta-tion option should be selected prior to starting an inferencetask [21, 23, 62, 85, 88] and (2) anytime DNNs that producea series of outputs as they execute [5, 35, 36, 50, 53, 83, 87].These two classes are similar in that they both vary thingslike the network depth or width to create latency/accuracytradeoffs.On the system-resource side, ALERT uses a power cap as the proxy to system resource usage. Since both hardware[14] and software resource managers [34, 73, 89] can convertpower budgets into optimal performance resource allocations,ALERT is compatible with many different schemes from bothcommercial products and the research literature.
ALERT works as a feedback controller. It follows four stepsto pick the DNN and resource settings for each input n :1) Measurement. ALERT records the processing time, en-ergy usage, and computes inference accuracy for n − T goal if necessary, considering the potential latency-requirementvariation across inputs. In some inference tasks, a set of inputsshare one combined requirement (e.g., in the NLP1 task inTable 2, all the words in a sentence are processed by a DNNone by one and share one sentence-wise deadline) and hencedelays in previous input processing could greatly shorten theavailable time for the next input [1, 48]. Additionally, ALERTsets the goal latency to compensate for its own, worst-caseoverhead so that ALERT itself will not cause violations.5) Feedback-based estimation. ALERT computes the ex-pected latency, accuracy, and energy consumption for everycombination of DNN model and power setting.4) Picking a configuration. ALERT feeds all the updatedestimations of latency, accuracy, and energy into Eqs. 1 and 2,and gets the desired DNN model and power-cap setting for n .The key task is step 3: the estimation needs to be accurateand fast. In the remainder of this section, we discuss key ideasand the exact algorithm of our feedback-based estimation. Strawman
Solving Eqs. 1 and 2 would be trivially easy if thedeployment environment is guaranteed to match the trainingand profiling environment: we could estimate t i , j to be theaverage (or worst case, etc) inference time t prof i , j over a set ofprofiling inputs under model d i and power setting p j . How-ever, this approach does not work given the dynamic input,contention, and requirement variation.Next, we present the key ideas behind how ALERT esti-mates the inference latency, accuracy, and energy consump-tion under model d i and power setting p j . How to estimate the inference latency t i , j ? To handle therun-time variation, a potential solution is to apply an estima-tor, like a Kalman filter [56], to make dynamic predictionsbased on recent history about inferences under model d i andpower p j . The problem is that most models and power settingswill not have been picked recently and hence would have norecent history to feed into the estimator. This problem is adirect example of the challenge imposed by the large spaceof combined application and system options. Idea 1: Handle the large selection space with a singlescalar value.
To make effective online estimation for all com-binations of models and power settings, ALERT introducesa global slow-down factor ξ to capture how the current en-vironment differs from the profiled environment (e.g., dueto co-running processes, input variation, or other changes).Such an environmental slow-down factor is independent fromindividual model or power selection. It can fully leverage ex-ecution history, no matter which models and power settingswere recently used; it can then be used to estimate t i , j basedon t prof i , j for all d i and p j combinations.Applying a global slowdown factor for all combinationsof application and system-level settings is crucial for ALERTto make quick decisions for every inference task. Althoughit is possible that some perturbations may lead to differentslowdowns for different configurations, the slight loss of ac-curacy here is out-weighed by the benefit of having a simplemechanism that allows prediction even for configurations thathave not been used recently.This idea is also novel for ALERT, as previous cross-stackmanagement systems all use much more complicated mod-els to estimate and select different setting combinations (e.g.,using model predictive control to estimate combinations of settings [58]). ALERT’s global slowdown factor is based onseveral unique features of DNN families that accomplish thesame task with different accurarcy/latency tradeoffs. We cat-egorize these features as: (1) similarity of code paths and(2) proportionality of structure. The first is based on the ob-servation that DNNs do not have complex conditional codedependences, so we do not need to worry about the casewhere different inputs would exercise very different codepaths. Thus, what ALERT learns about latency, accuracy, andenergy for one input will always inform it about future inputs.The second feature refers to the fact that as DNNs in a familyscale in latency, the proportion of different operations tendto be similar, so what ALERT learns about one DNN in thefamily generally applies to other DNNs in the same family.These properties of DNNs do not hold for many other typesof software, where different inputs or additional functional-ity can invoke entirely different code paths, with differentresource requirements or responses. How to estimate the accuracy under a deadline?
Givena deadline T goal , the inference accuracy delivered by model d i and power setting p j is determined by three factors, asshown in Eq. 3: (1) whether the inference result, which takestime t i , j , can be generated before the deadline T goal ; (2) if yes,the accuracy is determined by the model d i ; (3) if not, theaccuracy drops to that offered by a backup result q fail . Fortraditional DNN models, without any output at the deadline, arandom guess will be used and q fail will be much worse than q i . For anytime DNN models that output multiple results asthey are ready, the backup result is the latest output [5, 35, 36,50, 53, 83, 87], which we discuss more in Section 3.5. q i , j [ T goal ] = (cid:40) q i , if t i , j ≤ T goal q fail , otherwise (3)A potential solution to estimate accuracy q i , j at the deadline T goal is to simply feed the estimated t i , j into Eq. 3. However,this simple approach fails to account for two issues. First,while DNNs are generally well-behaved, significant tail ef-fects are possible (see Figure 4). Second, Eq. 3 is not linear,and is best understood as a step function, where a failure tocomplete inference by the deadline results in a worthless in-ference output ( q f ail ). Combined, these two issues mean thatfor tail inputs, inference will produce a worthless result; i.e.,accuracy is not proportional to latency, but can easily fall tozero for tail inputs. The tail will, of course, be increased ifthere is any unexpected resource contention. Therefore, thesimple approach of using the mean latency prediction fails toaccount for the non-linear affects of latency on accuracy. Idea 2: handle the runtime variation and account fortail behavior
To handle the run-time variability mentioned in Since it could be infeasible to calculate the exact inference accuracy atrun time, ALERT uses the average training accuracy of the selected DNNmodel d i , denoted as q i , as the inference accuracy, as long as the inferencecomputation finishes before the specified deadline. t i , j and the globalslow-down factor ξ as random variables drawn from a normaldistribution. ALERT uses a recently proposed extension to theKalman filter to adaptively update the noise covariance [2].While this extension was originally proposed to produce betterestimates of the mean, a novel approach in ALERT is usingthis covariance estimate as a measure of system volatility.ALERT uses this Kalman filter extension to predict not justthe mean accuracy, but also the likelihood of meeting theaccuracy requirements in the current operating environment.Section 5.3 shows the advantages of our extensions. How to minimize energy or satisfy energy constraints?
Minimizing energy or satisfying energy constraints is com-plicated, as the energy is related to, but cannot be easily cal-culated by, the complexity of the selected model d i and thepower cap p j . As discussed in Section 2.2, the energy con-sumption includes both that used during the inference under agiven model d i and that used during the inference-idle period,waiting for the next input. Consequently, it is not straightfor-ward to decide which power setting to use. Idea 3.
ALERT leverages insights from previous research,which shows that energy for latency-constrained systems canbe efficiently expressed as a mathematical optimization prob-lem [8, 49, 51, 63]. These frameworks optimize energy byscheduling available configurations in time. Time is assignedto configurations so that the average performance hits thedesired latency target and the overall energy (including idleenergy) is minimal. The key is that while the configurationspace is large, the number of constraints is small (typically justtwo). Thus, the number of configurations assigned a non-zerotime is also small (equal to the number of constraints) [49].Given this structure, the optimization problem can be solvedusing a binary search over available configurations, or evenmore efficiently with a hash table [63].The only difficulty applying prior work to ALERT is thatprior work assumed there was only a single job running ata time, while ALERT assumes that other applications mightcontend for resources. Thus, ALERT cannot assume that thereis a single system-idle state that will be used whenever theDNN is not executing. To address this challenge, ALERT con-tinually estimates the system power when DNN inference isidle (but other non-inference tasks might be active), p DNNidle ,transforming Eq. 1 is transformed into:arg max i , j q i , j [ T goal ] s.t. p i , j · t i , j + p DNNidle · t DNNidle ≤ E goal (4) Global Slow-down Factor ξ . As discussed in Idea-1, ALERTuses ξ to reflect how the run-time environment differs fromthe profiling environment. Conceptually, if the inference taskunder model d i and power-cap p j took time t i , j at run timeand took t prof i , j on average to finish during profiling, the cor- responding ξ would be t i , j / t pro fi , j . ALERT estimates ξ usingrecent execution history under any model or power setting.Specifically, after an input n −
1, ALERT computes ξ ( n − ) as the ratio of the observed time t ( n − ) i , j to the profiled time t prof i , j , and then uses a Kalman Filter to estimate the mean µ ( n ) and variance ( σ ( n ) ) of ξ ( n ) at input n . ALERT’s formulationis defined in Eq. 5, where K ( n ) is the Kalman gain variable; R is a constant reflecting the measurement noise; Q ( n ) is theprocess noise capped with Q ( ) . We set a forgetting factor ofprocess variance α = . K ( ) = . R = . Q ( ) = . µ ( ) = ( σ ( ) ) = .
1, following thestandard convention [56]. Q ( n ) = max { Q ( ) , α Q ( n − ) + ( α )( K ( n − ) y ( n − ) ) } K ( n ) = ( − K ( n − ) )( σ ( n − ) ) + Q ( n ) ( − K ( n − ) )( σ ( n − ) ) + Q ( n ) + Ry ( n ) = t ( n − ) i , j / t prof i , j − µ ( n − ) µ ( n ) = µ ( n − ) + K ( n ) y ( n ) ( σ ( n ) ) = ( − K ( n − ) )( σ ( n − ) ) + Q ( n ) (5)Then, using ξ ( n ) , ALERT estimates the inference time ofinput n under any model d i and power cap p j : t ( n ) i , j = ξ ( n ) ∗ t prof i , j . Probability of meeting the deadline.
Given the KalmanFilter estimation for the global slowdown factor, we can calcu-late Pr i , j , the probability that the inference completes beforethe deadline T goal . ALERT computes this value using a cu-mulative distribution function (CDF) based on the normaldistribution of ξ ( n ) estimated by the Kalman Filter: Pr i , j = Pr [ ξ ( n ) · t prof i , j ≤ T goal ] = CDF ( ξ ( n ) · t prof i , j , T goal )= CDF ( µ ( n ) · t prof i , j , σ ( n ) , T goal ) (6) Accuracy.
As discussed in Idea-2, ALERT computes theestimated inference accuracy ˆ q i , j [ T goal ] by considering t i , j asa random variable that follows normal distribution with itsmean and variance computed based on that of ξ . Here q i , j represents the inference accuracy when the DNN inferencefinishes before the deadline, and q f ail is the accuracy of arandom guess:ˆ q i , j [ T goal ] = E ( q i , j [ T goal ] | t ( n ) i , j )= E ( q i , j [ T goal ] | ξ ( n ) · t prof i , j )= Pr i , j · q i , j + ( − Pr i , j ) · q f ail ξ ( n ) ∼N ( µ ( n ) , ( σ ( n ) ) ) (7) Energy.
As discussed in Idea-3, ALERT predicts energyconsumption by separately estimating energy during (1) DNNexecution: estimated by multiplying the power limit by the A Kalman Filter is an optimal estimator that assumes a normal distri-bution and estimates a varying quantity based on multiple potentially noisyobservations [56]. φ ( n ) is the predicted DNN-idlepower ratio, M ( n ) is process variance, S is process noise, V is measurement noise, and W ( n ) is the Kalman Filter gain.ALERT initially sets M ( ) = . S = . V = . W ( n ) = M ( n − ) + SM ( n − ) + S + VM ( n ) = ( − W ( n ) )( M ( n − ) + S ) φ ( n ) = φ ( n − ) + W ( n ) ( p idle / p ( n − ) i , j − φ ( n − ) ) (8)ALERT then predicts the energy by Eq. 9. Unlike equa-tion 7 that uses probabilistic estimates, energy estimation iscalculated without the notion of probability. The inferencepower is the same whenever the inference misses and meetsthe deadline because ALERT sets power limits. Therefore itis safe to estimate the energy by its mean without consideringthe distribution of its possible latency. See Eq. 12 to estimateenergy by its worst case latency percentile. e ( n ) i , j = p i , j · ξ ( n ) · t prof i , j + φ ( n ) · p i , j · ( T goal − ( ξ ( n ) · t prof i , j )) (9) Selecting Configurations.
Given the estimates of latency,expected accuracy, and energy consumption, ALERT gen-erates a set of valid configurations which meets all of theconstraints. ALERT then chooses the best valid configura-tion according to the optimization goal; i.e., ALERT selectsa configuration that solves either Eq. 1 or Eq. 2 using theestimated latency, accuracy, and energy from Equations 5, 7,and 9, respectively.In unpredictable environments—i.e., those with high esti-mated variance from Eq. 5—ALERT is to be more conserva-tive, selecting from fewer valid configurations. Consider an ex-ample scenario with two DNN candidates. The larger one hasan estimated accuracy of 0.98, and 97% probability to meetthe deadline. Meanwhile, the smaller one has 0.95 estimatedaccuracy and 99.9% probability, respectively. The larger DNNhas lower probability because it takes longer time. When ob-served variance is low, ALERT picks the larger DNN for itshigher expected accuracy (i.e., 97% × . = .
951 comparedwith the smaller one’s 99 . × . = . Q value(Eq. 5, and thus increases its estimated variance ( σ ). Thehigher estimated variance means that the probability of com-pletion by the deadline for all configurations in equation 6will be decreased. Consequently, the probability of selectinga larger DNN will be decreased more than that of the smallDNN because it has larger latency. In our example, the largerDNN’s probability of completion drops to 95% from 97%,thus decreasing the expected accuracy to 0.941. In contrastthe smaller DNN only drops its probability to 99.5% from 99.9%, decreasing its expected accuracy to 0.945. ALERTthen chooses smaller DNN which now has a higher expectedaccuracy (because it is more likely to complete) under highvariance environment. Manipulating ALERT’s Probabilistic Guarantees.
ALERT default setting is using a full mathematical expec-tation without explicitly defining a probabilistic threshold( Pr th ), which represents the probability of meeting theconstraints. Users can set this probabilistic threshold ( Pr th )according their needs and then ALERT will not selectconfigurations for which the probability is below thisthreshold. Adding this capability is as simple as addinganother constraint to Eq. 1 in the maximizing accuracyscenario:arg max i , j q i , j s.t. e i , j ≤ E goal ∧ t i , j ≤ T goal ∧ Pr i , j ≥ Pr th (10)In minimizing energy scenario, Eq. 2 is modified to be:arg min i , j e i , j s.t. q i , j ≥ Q goal ∧ t i , j ≤ T goal ∧ Pr i , j ≥ Pr th (11)Energy estimation can also be updated accordingly forusers who want more control over ALERT’s energy guaran-tees: e ( n ) i , j = p i , j · CDF − ( ξ ( n ) · t prof i , j , Pr th )+ φ ( n ) · p i , j · ( T goal − CDF − ( ξ ( n ) · t prof i , j , Pr th )) , (12)where CDF − ( ξ ( n ) · t prof i , j , Pr th ) is the inverse of cumulativedistribution function. It takes two inputs: (1) the distributionfunction of random variable ξ ( n ) · t prof i , j and (2) the user thresh-old Pr th which indicates the probability of meeting the goal.It outputs the predicted latency such that it is the worst case la-tency of Pr th percentile from distribution t i , j . Compared withEq. 9, the energy estimation by this equation will be higher asit uses a higher percentile latency. Thus, ALERT will rejectmore configurations and may lead to lower expected accuracyas the cost of tighter energy bounds. An anytime DNN is an inference model that outputs a seriesof increasingly accurate inference results— o , o , ... o k , with o t more reliable than o t − . A variety of recent works [5,36,50,53,83,87] have proposed DNNs supporting anytime inference,covering a variety of problem domains. ALERT easily workswith not only traditional DNNs but also Anytime DNNs. Theonly change is that q fail in Eq. 3 no longer corresponds to arandom guess. That is, when the inference could not generateits final result o k by the deadline T goal , an earlier result o x canbe used with a much better accuracy than that of a random8uess. The updated accuracy equation is below: q ., j = q k , if t k , j ≤ t goal q k − , if t k − , j ≤ t goal < t k , j · · · q fail , otherwise (13)Existing anytime DNNs consider latency but not energyconstraints—an anytime DNN will keep running until thelatency deadline arrives and the last output will be deliveredto the user. ALERT naturally improves Anytime DNN en-ergy efficiency, stopping the inference sometimes before thedeadline based on its estimation to meet not only latency andaccuracy, but also energy requirements.Furthermore, ALERT can work with a set of traditionalDNNs and an Anytime DNN together to achieve the bestcombined result. The reason is that Anytime DNNs generallysacrifice accuracy for flexibility. When we feed a group oftraditional DNNs and one Anytime DNN to construct thecandidacy set D , with Eq. 7, ALERT naturally selects theAnytime DNN when the environment is changing rapidly(because the expected accuracy of an anytime DNN will behigher given that variance), and the regular DNN, which hasslightly higher accuracy with similar computation, when it isstable, getting the best of both worlds.In our evaluation, we will use the nested design from [5],which provides a generic coverage of anytime DNNs. Assumptions of the Kalman Filter.
ALERT’s prediction,particularly the Kalman Filter, relies on the feedback fromrecent input processing. Consequently, it requires at least oneinput to react to sudden changes. Additionally, the Kalmanfilter formulations assume that the underlying distributionsare normal, which may not hold in practice. If the behavior isnot Gaussian, the Kalman filter will produce bad estimationsfor the mean of ξ for some amount of time.Having said that, as will be shown by our experiments, nosingle distribution fits all real-world scenarios and normaldistribution is the best fit we can find in practice (Figure 11).Furthermore, ALERT is specifically designed to handle devi-ation from the normal-distribution assumption, novelly usingthe Kalman Filter’s covariance estimation to measure systemvolatility and accounting for volatility in the accuracy/energyestimations. Consequently, after just 2–3 such bad predictionsof means, the estimated variance will increase, which will thentrigger ALERT to pick anytime DNN over traditional DNNsor pick a low-latency traditional DNN over high-latency ones,because the former has a better chance to produce resultsat latency deadlines and hence a higher expected accuracyunder high variance. So—worst case—ALERT will choose aDNN with slightly less accuracy than what could have beenused with the right model of randomness. Users can also compensate for extremely aberrant latency distributions byincreasing the value of Q ( ) in Eq. 5. As we will see in Section5.3, ALERT performs well even when the distribution is notnormal. Probabilistic guarantees.
ALERT provides probabilistic,not hard, guarantees. As ALERT estimates not just averagetiming, but the distributions of possible timings, it can providearbitrarily many nines of assurance that it will meet latency oraccuracy goals but cannot provide 100% guarantee. Provid-ing 100% guarantees requires the information of worst caseexecution time (WCET), a latency value that guarantees thereis no slower latency with probability of 1. ALERT does notassume the availability of such information and hence cannotprovide hard guarantees [7].
Safety guarantees.
While ALERT does not explicitlymodel safety requirements, it can be configured to prioritizeaccuracy over other dimensions. In scenarios where users par-ticularly value safety (e.g., auto-driving), they could set a highaccuracy requirement or even remove the energy constraints.
Concurrent inference jobs.
ALERT is currently designedto support one inference job at a time. To support multiple con-current inference jobs, future work needs to extend ALERT tocoordinate across these concurrent jobs. We expect the mainidea of ALERT, such as using a global slowdown factor toestimate system variation, to still apply.
Scope of ALERT.
Finally, how the inference behaves ul-timately depends not only on ALERT, but also on the DNNmodels and system-resource setting options. As we will eval-uate in Section 5, ALERT helps make the best use of suppliedDNN models, but does not eliminate the difference betweendifferent DNN models.
We implement ALERT for both CPUs and GPUs. On CPUs,ALERT adjusts power through Intel’s RAPL interface [14],which allows software to set a hardware power limit. OnGPUs, ALERT uses PyNVML to control frequency and buildsa power-frequency lookup table. ALERT can also be appliedto other approaches that translate power limits into settingsfor combinations of resources [34, 37, 73, 89].In our experiments, ALERT considers a series of powersettings within the feasible range with 2.5W interval on ourtest laptop and a 5W interval on our test CPU server and GPUplatform, as the latter has a wider power range than the former.The number of power buckets is configurable.ALERT incurs small overhead in both scheduler computa-tion and switching from one DNN/power-setting to another,just 0.6–1.7% of an input inference time. We explicitly ac-count for overhead by subtracting it from the user-specifiedgoal (see step 2 in Section 3.2).Users may set goals that are not achievable. If ALERTcannot meet all constraints, it prioritizes latency highest, thenaccuracy, then power. This hierarchy is configurable.9 un-time environment settingDefault Inference task has no co-running processMemory Co-locate with memory-hungry STREAM [61] (@CPU)Co-locate with Backprop from Rodinia-3.1 [9] (@GPU)Compute Co-locate with Bodytrack from PARSEC-3.0 [6] (@CPU)Co-locate with the forward pass of Backprop [9] (@GPU)Ranges of constraint settingLatency 0.4x–2x mean latency* of the largest Anytime DNNAccuracy Whole range achievable by trad. and Anytime DNNEnergy Whole feasible power-cap ranges on the machineTask Trad. DNN Anytime [5] Fixed deadline?Image Classifi. Sparse ResNet Depth-Nest YesSentence Pred. RNN Width-Nest NoScheme ID DNN selection Power selectionOracle Dynamic optimal Dynamic optimalOracle
Static
Static optimal Static optimalApp-only One Anytime DNN System DefaultSys-only Fastest traditional DNN State-of-Art [38]No-coord Anytime DNN w/o coord. with Power State-of-Art [38]ALERT ALERT default ALERT defaultALERT
Any
ALERT w/o traditional DNNs ALERT defaultALERT
Trad
ALERT w/o Anytime DNNs ALERT default
Table 3: Settings and schemes under evaluation (* measuredunder default setting without resource contention)
We apply ALERT to different inference tasks on both CPUand GPU with and without resource contention from co-located jobs. We set ALERT to (1) reduce energy while sat-isfying latency and accuracy requirements and (2) reduceerror rates while satisfying latency and energy requirements.We compare ALERT with both oracle and state-of-the-artschemes and evaluate detailed design decisions.
Experimental setup.
We use the three platforms listed inTable 1:
CPU1 , CPU2 , and
GPU . On each, we run inferencetasks , image classification and sentence prediction, underthree different resource-contention scenarios: • No contention: the inference task is the only job running,referred to as “Default”; • Memory dynamic: the inference task is running togetherwith a memory-intensive job that repeatedly gets stoppedand then started, representing dynamic memory resourcecontention, referred to as “Memory”; • Computation dynamic: the inference task is running to-gether with a computation-intensive job that repeatedlygets stopped and then started, representing dynamic com-putation resource contention, referred to as “Compute”. For GPU, we only run image classification task there, as the RNN-basedsentence prediction task is better suited for CPU [90].
Minimize Energy Minimize Error N o r m P er f o r m a n ce App+Sys(ALERT-Any)
Oracle V i o l a t i o n s ( % ) Figure 7: Result Summary: average performance normalizedto Oracle
Static . Violations% refers to %-of-constraint-settingsunder which a scheme incurs more than 10% violation of allinputs. (Smaller is better; Details in Table 4)We then evaluate a number of management schemes’ abilityto meet latency, accuracy, and energy constraints. Table 3 liststhe details.
Schemes under evaluation.
We give ALERT three dif-ferent DNN sets: traditional DNN models (ALERT
Trad ), anAnytime DNN (ALERT
Any ), and both (ALERT).We compare with two
Oracle ∗ schemes that have perfectpredictions for every input under every DNN/power setting(i.e., impractical). The “Oracle" allows DNN/power settingsto change across inputs, representing the best possible re-sults; the “Oracle Static ” has one fixed setting across inputs,representing the best results without dynamic adaptation.Finally, we compare with three state-of-the-art approaches: • “App-only” conducts adaptation only at the applicationlevel through an Anytime DNN [5]; • “Sys-only” conducts adaptation only at the system levelfollowing an existing resource-management system thatminimizes energy under soft real-time constraints [63] and uses the fastest candidate DNN to avoid latencyviolations; • “No-coord” uses both the Anytime DNN for application-level adaptation and the power-management scheme [63]to adapt power, but with these two working indepen-dently. Table 4 shows the results for all schemes for different taskson different platforms and environments. Each cell showsthe average energy or accuracy under 35–40 combinations oflatency, accuracy, and energy constraints (the settings are de-tailed in Table 3), normalized to the Oracle
Static result. Figure7 compares these results, where lower bars represent betterresults and lower *s represent fewer constraint violations.ALERT and ALERT
Any both work very well for all settings.They outperform state-of-the-art approaches, which have a Specifically, this adaptation uses a feedback scheduler that predicts in-ference latency based on Kalman Filter. lat. DNN Work. ALERT ALERT-Any Sys-only App-only No-coord Oracle ALERT ALERT-Any Sys-only App-only No-coord OracleEnergy in Minimizing Energy Task Error Rate in Minimizing Error TaskCPU1 SparseResnet Idle 0.64 0.68 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4: Average energy consumption and error rate normalized to
Oracle
Static , smaller is better. (Each cell is averaged over35–40 constraint settings; superscript means
Static because it adaptsto dynamic variations. ALERT also comes very close to thetheoretically optimal Oracle.
Comparing with Oracles.
As shown in Table 4, ALERTachieves 93-99% of Oracle’s energy and accuracy optimiza-tion while satisfying constraints. Oracle static , the baseline inTable 4, represents the best one can achieve by selecting 1DNN model and 1 power setting for all inputs. ALERT greatlyout-performs Oracle static , reducing its energy consumption by3–48% while satisfying accuracy constraints (36% in har-monic mean) and reducing its error rate by 9-66% whilesatisfying energy constraints (54% in harmonic mean).Figure 8 shows a detailed comparison for the energy min-imization task. The figure shows the range of performanceunder all requirement settings (i.e., the whiskers). ALERTnot only achieves similar mean energy reduction, its wholerange of optimization behavior is also similar to Oracle. Incomparison, Oracle
Static not only has the worst mean but alsothe worst tail performance. Due to space constraints, we omitthe figures for other settings, where similar trends hold.ALERT has more advantage over Oracle static on CPUs thanon GPUs. The CPUs have more empirical variance than theGPU, so they benefit more from dynamic adaptation. TheGPU experiences significantly lower dynamic fluctuation sothe static oracle makes good predictions.ALERT satisfies the constraint in 99.9% of tests for imageclassification and 98.5% of those for sentence prediction. Forthe latter, due to the large input variability (NLP1 in Figure 4),some input sentences simply cannot complete by the deadlineeven with the fastest DNN. There the Oracle fails, too.Note that, these Oracle schemes not only have perfect— and hence, impractical—prediction capability, but they alsohave no overhead. In contrast, ALERT is running on the samemachines as the DNN workloads.
All results include ALERT’srun-time latency and power overhead.
Comparing with State-of-the-Art.
For a fair comparison,we focus on ALERT
Any , as it uses exactly the same DNN can-didate set as "Sys-only", "App-only", and "No-coord". Acrossall settings, ALERT
Any outperforms the others.The System-only solution suffers from not being able tochoose different DNNs under different runtime scenarios. Asa result, it performs much worse than ALERT
Any in satisfy-ing accuracy requirements or optimizing accuracy. For theformer (left side of Table 4 and Figure 7), it creates accuracyviolations in 68% of the settings as shown in Figure 7; for thelatter (right side of Table 4 and Figure 7), although capableof satisfying energy constraints, it introduces 34% more errorthan ALERT
Any .The Application-only solution that uses an Anytime DNNsuffers from not being able to adjust to the energy require-ments. As a result, it consumes 73% more energy in energy-minimizing tasks (left side of Table 4 and Figure 7) and in-troduces many energy-budget violations particularly underresource contention settings (right side of Table 4 and Figure7).The no-coordination scheme is worse than both System-and Application-only. It violates constraints in both tasks with69% more energy and 34% more error than ALERT
Any . With-out coordination, the two levels can work at cross purposes;e.g., the application switches to a faster DNN to save energywhile the system makes more power available.11 efault Compute Memory A v er ag e E n er gy ( J ) (a) CPU1, Image Classification Default Compute Memory A v er ag e E n er gy ( J ) (b) CPU1, Sentence Prediction Default Compute Memory A v er ag e E n er gy ( J ) (c) CPU2, Image Classification Default Compute Memory A v er ag e E n er gy ( J ) (d) CPU2, Sentence Prediction Figure 8: ALERT versus Oracle and Oracle
Static on minimize energy task (Lower is better). (whisker: whole range; circle: mean)
Plat. Work. ALERT Any Trad ALERT Any TradMinimize Energy Task Minimize Error TaskCPU1 Idle 0.64 0.68 0 . . . . . . .
01 0.96 0.89 0.92 0.89Mem. 0.96 0.97 0.95 0.32 0.34 0.32Harmonic mean 0.66 0.66 0 . Table 5: ALERT normalized average energy consumption anderror rate to
Oracle
Static @ Sparse ResNet (Smaller is better)
Different DNN candidate sets.
Table 5 compares the perfor-mance of ALERT working with an Anytime DNN (Any), aset of traditional DNN models (Trad), and both. At a highlevel, ALERT works well with all three DNN sets. Underclose comparison, ALERT
Trad violates more accuracy con-straints than the others, particularly under resource contentionon CPUs, because a traditional DNN has a much larger ac-curacy drop than an anytime DNN when missing a latencydeadline. Consequently, when the system variation is large,ALERT
Trad selects a faster DNN to meet latency and thusmay not meet accuracy goals. Of course, ALERT
Any is notalways the best. As discussed in Section 3.5, Anytime DNNssometimes have lower accuracy then a traditional DNN withsimilar execution time. This difference leads to the slightlybetter results for ALERT over ALERT
Any .Figure 9 visualizes the different dynamic behavior ofALERT (blue curve) and ALERT
Trad (orange curve) whenthe environment changes from Default to Memory-intensiveand back. At the beginning, due to a loose latency constraint,ALERT and ALERT
Trad both select the biggest traditionalDNN, which provides the highest accuracy within the energybudget. When the memory contention suddenly starts, thisDNN choice leads to a deadline miss and an energy-budgetviolation (as the idle period disappeared), which causes anaccuracy dip. Fortunately, both quickly detect this problemand sense the high variability in the expected latency. ALERTswitches to use an anytime DNN and a lower power cap.This switch is effective: although the environment is stillunstable, the inference accuracy remains high, with slight L a t e n c y ( s ) ALERT-TradALERTConstraint P o w e r ( W ) A cc u r a c y ( % ) Image Classi fi cation Time (Input Number) D NN TradAny
Figure 9: Minimize error rates w/ latency, energy constraints@ CPU1. (ALERT in blue; ALERT
Trad in orange; constraintsin red. Memory contention occurs from about input 46 to 119;Deadline: 1.25 × mean latency of largest Anytime DNN inDefault; power limit: 35W.)ups and downs depending on which anytime output finishedbefore the deadline. Only able to choose from traditionalDNNs, ALERT Trad conservatively switches to much simplerand hence lower-accuracy DNNs to avoid deadline misses.This switch does eliminate deadline misses under the highlydynamic environment, but many of the conservatively cho-sen DNNs finish before the deadline (see the Latency panel),wasting the opportunity to produce more accurate results andcausing ALERT
Trad to have a lower accuracy than ALERT.When the system quiesces, both schemes quickly shift backto the highest-accuracy, traditional DNN.Overall, these results demonstrate how ALERT alwaysmakes use of the full potential of the DNN candidate set tooptimize performance and satisfy constraints.
ALERT probabilistic design.
A key feature of ALERT isits use of not just mean estimations, but also their variance.To evaluate the impact of this design, we compare ALERT toan alternative design ALERT*, which only uses the estimatedmean to select configurations.Figure 10 shows the performance of ALERT and ALERT*in the minimize error task for sentence prediction. As we12 tandard Trad. Only Any. Only A v er ag e P er p l e x i t y (a) Default Contention Standard Tradition OnlyAnytime Only A v er ag e P er p l e x i t y (b) Memory Contention Figure 10: Minimize error for sentence prediction@ CPU1(Lower is better). (whisker: whole range; circle: mean)can see, ALERT (blue circles) always performs better thanALERT*. Its advantage is the biggest when the DNN can-didate sets include both traditional and Anytime DNNs,which is the “Standard” in Figure 10. The reason is thattraditional DNNs and Anytime DNN have different accu-racy/latency curves, Eq. 3 for the former and Eq. 13 for thelatter. ALERT* is much worse than ALERT in distinguishingthese two by simply using the mean of estimated latency topredict accuracy. ALERT’s advantage is also reflected undermemory contention with traditional DNN candidates. SinceALERT’s estimation better captures dynamic system vari-ation, it clearly outperforms ALERT* there. Overall, theseresults show ALERT’s probabilistic design is effective.
Sensitivity to latency distribution.
ALERT assumes aGaussian distribution. However, ALERT is still robust forother distributions, as explained in Section 3.6. As shown inFigure 11, the observed ξ s (red bars) are indeed not a perfectfit for Gaussian distribution (blue lines) in all scenarios, whichconfirms ALERT’s robustness. Past resource management systems have used machine learn-ing [4,52,69,70,80] or control theory [33,38,45,46,63,75,92]to make dynamic decisions and adapt to changing environ-ments or application needs. Some also use Kalman filter be-cause it has optimal error properties [38, 45, 46, 63]. Thereare two major differences between them and ALERT: 1) priorapproaches use the Kalman filter to estimate physical quan-tities such as CPU utilization [46] or job latency [38], whileALERT estimates a virtual quantity that is then used to up-date a large number of latency estimates. 2) while varianceis naturally computed as part of the filter, ALERT actuallyuses it, in addition to the mean, to help produce estimates thatbetter account for environment variability.Past work designed resource managers explicitly to co-ordinate approximate applications with system resource us-age [22, 32, 33, 47]. Although related, they manage appli-cations separately from system resources, which is funda-mentally different from ALERT’s holistic design. When anenvironmental change occurs, prior approaches first adjustthe application and then the system serially (or vice versa)so that the change’s effects on each can be established in- D e f a u l t ObservationEstimation C o m pu t e M e m o r y Figure 11: Distribution of ξ for image class. on CPU1.dependently [32, 33]. That is, coordination is established byforcing one level to lag behind the other. In practice this de-sign forces each level to keep its own independent modeland delays response to environmental changes. In contrast,ALERT’s global slowdown factor allows it to easily modeland update prediction about all application and system config-urations simultaneously, leading to very fast response times,like the single input delay demonstrated in Figure 9.Much work accelerates DNNs through hardware [3, 11–13,20,24,25,28,31,39,44,55,59,67,74,76,84], compiler [10,66],system [29,54], or design support [26,26,27,40,43,78,82,86].They essentially shift and extend the tradeoff space, but donot provide policies for meeting user needs or for navigatingtradeoffs dynamically, and hence are orthogonal to ALERT.Some research supports hard real-time guarantees forDNNs [91], providing 100% timing guarantees while assum-ing that the DNN model gives the desired accuracy, the envi-ronment is completely predictable, and energy consumptionis not a concern. ALERT provides slightly weaker timingguarantees, but manages accuracy and power goals. ALERTalso provides more flexibility to adapt to unpredictable envi-ronments. Hard real-time systems would fail in the co-locatedscenario unless they explicitly account for all possible co-located applications at design time. This paper demonstrates the challenges behind the importantproblem of ensuring timely, accurate, and energy efficientneural network inference with dynamic input, contention, andrequirement variation. ALERT achieves these goals throughdynamic and coordinated DNN model selection and powermanagement based on feedback control. We evaluate ALERTwith a variety of workloads and DNN models and achievehigh performance and energy efficiency.
Acknowledgement
We thank the anonymous reviewers for their helpful feed-back and Ken Birman for shepherding this paper. This re-search is supported by NSF (grants CNS-1956180, CNS-13764039, CCF-1837120, CNS-1764039, CNS-1563956, CNS-1514256, IIS-1546543, CNS-1823032, CCF-1439156), ARO(grant W911NF1920321), DOE (grant DESC0014195 0003),DARPA (grant FA8750-16-2-0004) and the CERES Centerfor Unstoppable Computing. Additional support comes fromthe DARPA BRASS program and a DOE Early Career award.
References [1] Baidu AI. Apollo open vehicle certificate platform.Online document, http://apollo.auto , 2018.[2] S. Akhlaghi, N. Zhou, and Z. Huang. Adaptive adjust-ment of noise covariance in kalman filter for dynamicstate estimation. In , 2017.[3] Jorge Albericio, Patrick Judd, Tayler Hetherington, TorAamodt, Natalie Enright Jerger, and Andreas Moshovos.Cnvlutin: Ineffectual-neuron-free deep neural networkcomputing. In
ISCA , pages 1–13, 2016.[4] Jason Ansel, Maciej Pacula, Yee Lok Wong, Cy Chan,Marek Olszewski, Una-May O’Reilly, and Saman Ama-rasinghe. Siblingrivalry: online autotuning through localcompetitions. In
CASES , 2012.[5] Anonymous Authors. Prioritized SGD and nested archi-tectures for anytime neural network. In
In submissionincluded as supplementary material .[6] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh,and Kai Li. The parsec benchmark suite: Characteriza-tion and architectural implications. In
PACT , October2008.[7] Giorgio C Buttazzo, Giuseppe Lipari, Luca Abeni, andMarco Caccamo.
Soft Real-Time Systems: Predictabilityvs. Efficiency: Predictability vs. Efficiency . Springer,2006.[8] Aaron Carroll and Gernot Heiser. Mobile multicores:Use them or waste them. In
HotPower , 2013.[9] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan,Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron.Rodinia: A benchmark suite for heterogeneous comput-ing. In
IISWC , 2009.[10] Tianqi Chen, Thierry Moreau, Ziheng Jiang, LianminZheng, Eddie Yan, Haichen Shen, Meghan Cowan,Leyuan Wang, Yuwei Hu, Luis Ceze, et al. Tvm: An au-tomated end-to-end optimizing compiler for deep learn-ing. In
OSDI , pages 578–594, 2018.[11] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang,Chengyong Wu, Yunji Chen, and Olivier Temam. Di-annao: A small-footprint high-throughput accelerator for ubiquitous machine-learning.
SIGPLAN Not. , pages269–284, 2014.[12] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivi-enne Sze. Eyeriss: An energy-efficient reconfigurable ac-celerator for deep convolutional neural networks.
JSSC ,2016.[13] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, LiqiangHe, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu,Ninghui Sun, et al. Dadiannao: A machine-learningsupercomputer. In
MICRO 47 , pages 609–622, 2014.[14] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, andC. Le. Rapl: Memory power estimation and capping. In
ISLPED , 2010.[15] Christina Delimitrou and Christos Kozyrakis. Paragon:Qos-aware scheduling for heterogeneous datacenters. In
ASPLOS , 2013.[16] Christina Delimitrou and Christos Kozyrakis. Quasar:resource-efficient and qos-aware cluster management.In
ASPLOS , 2014.[17] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In
CVPR , 2009.[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXivpreprint arXiv:1810.04805 , 2018.[19] Piotr Dollar, Christian Wojek, Bernt Schiele, and PietroPerona. Pedestrian detection: An evaluation of the stateof the art.
TPAMI , 2011.[20] Zidong Du, Robert Fasthuber, Tianshi Chen, PaoloIenne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen,and Olivier Temam. Shidiannao: Shifting vision pro-cessing closer to the sensor. In
ISCA , pages 92–104,2015.[21] Biyi Fang, Xiao Zeng, and Mi Zhang. Nestdnn:Resource-aware multi-tenant on-device deep learningfor continuous mobile vision. In
Proceedings of the24th Annual International Conference on Mobile Com-puting and Networking , MobiCom ’18, page 115–127,New York, NY, USA, 2018. Association for ComputingMachinery.[22] Anne Farrell and Henry Hoffmann. MEANTIME:achieving both minimal energy and timeliness with ap-proximate computing. In
USENIX ATC , 2016.[23] Michael Figurnov, Maxwell D Collins, Yukun Zhu,Li Zhang, Jonathan Huang, Dmitry P Vetrov, and Ruslan14alakhutdinov. Spatially adaptive computation time forresidual networks. In
CVPR , page 7, 2017.[24] Mingyu Gao, Christina Delimitrou, Dimin Niu, Kr-ishna T Malladi, Hongzhong Zheng, Bob Brennan, andChristos Kozyrakis. Draf: a low-power dram-based re-configurable acceleration fabric.
ISCA , pages 506–518,2016.[25] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, ArdavanPedram, Mark A Horowitz, and William J Dally. Eie:efficient inference engine on compressed deep neuralnetwork. In
ISCA , pages 243–254, 2016.[26] Song Han, Huizi Mao, and William J Dally. Deep com-pression: Compressing deep neural networks with prun-ing, trained quantization and huffman coding. arXivpreprint arXiv:1510.00149 , 2015.[27] Soheil Hashemi, Nicholas Anthony, Hokchhay Tann,R Iris Bahar, and Sherief Reda. Understanding the im-pact of precision quantization on the accuracy and en-ergy of neural networks. In
DATE , pages 1474–1479,2017.[28] Johann Hauswald, Yiping Kang, Michael A Laurenzano,Quan Chen, Cheng Li, Trevor Mudge, Ronald G Dreslin-ski, Jason Mars, and Lingjia Tang. Djinn and tonic: Dnnas a service and its implications for future warehousescale computers. In
ISCA , pages 27–40, 2015.[29] Johann Hauswald, Michael A Laurenzano, Yunqi Zhang,Cheng Li, Austin Rovinski, Arjun Khurana, Ronald GDreslinski, Trevor Mudge, Vinicius Petrucci, LingjiaTang, et al. Sirius: An open end-to-end voice and visionpersonal assistant and its implications for future ware-house scale computers. In
ASPLOS , pages 223–238,2015.[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. In
CVPR , pages 770–778, 2016.[31] Parker Hill, Animesh Jain, Mason Hill, Babak Zamirai,Chang-Hong Hsu, Michael A Laurenzano, Scott Mahlke,Lingjia Tang, and Jason Mars. Deftnn: Addressing bot-tlenecks for dnn execution on gpus via synapse vectorelimination and near-compute data fission. In
MICRO ,pages 786–799, 2017.[32] Henry Hoffmann. Coadapt: Predictable behavior foraccuracy-aware applications running on power-awaresystems. In
ECRTS , pages 223–232, 2014.[33] Henry Hoffmann. Jouleguard: energy guarantees forapproximate applications. In
SOSP , 2015. [34] Henry Hoffmann and Martina Maggio. PCP: A general-ized approach to optimizing performance under powerconstraints through resource management. In
ICAC ,pages 241–247, 2014.[35] Hanzhang Hu, Debadeepta Dey, Martial Hebert, andJ Andrew Bagnell. Learning anytime predictions inneural networks via adaptive loss balancing. In
AAAI ,2019.[36] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Lau-rens van der Maaten, and Kilian Q. Weinberger. Multi-scale dense convolutional networks for efficient predic-tion. In
CoRR , 2017.[37] C. Imes and H. Hoffmann. Bard: A unified frameworkfor managing soft timing and power constraints. In
SAMOS , pages 31–38, 2016.[38] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann.Poet: a portable approach to minimizing energy undersoft real-time constraints. In
RTAS , pages 75–86, April2015.[39] Animesh Jain, Michael A Laurenzano, Gilles A Pokam,Jason Mars, and Lingjia Tang. Architectural supportfor convolutional neural networks on modern cpus. In
PACT , 2018.[40] Shubham Jain, Swagath Venkataramani, VijayalakshmiSrinivasan, Jungwook Choi, Pierce Chuang, and Le-land Chang. Compensated-dnn: energy efficient low-precision deep neural networks by compensating quan-tization errors. In
DAC , pages 1–6, 2018.[41] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik,Siddhartha Sen, and Ion Stoica. Chameleon: scalableadaptation of video analytics. In
ACM SIGCOMM ,pages 253–266, 2018.[42] Norman P. Jouppi, Cliff Young, Nishant Patil, DavidPatterson, Gaurav Agrawal, Raminder Bajwa, SarahBates, Suresh Bhatia, Nan Boden, Al Borchers, RickBoyle, Pierre-luc Cantin, Clifford Chao, Chris Clark,Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean,Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Got-tipati, William Gulland, Robert Hagmann, C. RichardHo, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt,Julian Ibarz, Aaron Jaffey, Alek Jaworski, AlexanderKaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch,Naveen Kumar, Steve Lacy, James Laudon, James Law,Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke,Alan Lundin, Gordon MacKean, Adriana Maggiore,Maire Mahony, Kieran Miller, Rahul Nagarajan, RaviNarayanaswami, Ray Ni, Kathy Nix, Thomas Norrie,Mark Omernick, Narayana Penukonda, Andy Phelps,15onathan Ross, Matt Ross, Amir Salek, Emad Samadi-ani, Chris Severn, Gregory Sizikov, Matthew Snelham,Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan,Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle,Vijay Vasudevan, Richard Walter, Walter Wang, EricWilcox, and Doe Hyun Yoon. In-datacenter performanceanalysis of a tensor processing unit. In
ISCA , 2017.[43] Patrick Judd, Jorge Albericio, Tayler Hetherington,Tor M Aamodt, Natalie Enright Jerger, and AndreasMoshovos. Proteus: Exploiting numerical precisionvariability in deep neural networks. In
ICS , page 23,2016.[44] Patrick Judd, Jorge Albericio, Tayler Hetherington,Tor M Aamodt, and Andreas Moshovos. Stripes: Bit-serial deep neural network computing. In
MICRO , pages1–12, 2016.[45] Evangelia Kalyvianaki, Themistoklis Charalambous,and Steven Hand. Self-adaptive and self-configuredcpu resource provisioning for virtualized servers usingkalman filters. In
ICAC , 2009.[46] Evangelia Kalyvianaki, Themistoklis Charalambous,and Steven Hand. Adaptive resource provisioning forvirtualized servers using kalman filters.
TAAS , 2014.[47] Aman Kansal, Scott Saponas, AJ Brush, Kathryn SMcKinley, Todd Mytkowicz, and Ryder Ziola. The la-tency, accuracy, and battery (lab) abstraction: program-mer productivity and energy efficiency for continuousmobile context sensing. In
OOPSLA , 2013.[48] Shinpei Kato, Shota Tokunaga, Yuya Maruyama, SeiyaMaeda, Manato Hirabayashi, Yuki Kitsukawa, AbrahamMonrroy, Tomohito Ando, Yusuke Fujii, and TakuyaAzumi. Autoware on board: Enabling autonomous vehi-cles with embedded systems. In
ICCPS , pages 287–296,2018.[49] D. H. K. Kim, C. Imes, and H. Hoffmann. Racing andpacing to idle: Theoretical and empirical analysis ofenergy optimization heuristics. In
ICCPS , 2015.[50] Gustav Larsson, Michael Maire, and GregoryShakhnarovich. Fractalnet: Ultra-deep neural networkswithout residuals. arXiv preprint arXiv:1605.07648 ,2016.[51] Etienne Le Sueur and Gernot Heiser. Slow down orsleep, that is the question. In
USENIX ATC , June 2011.[52] Benjamin C Lee and David Brooks. Efficiency trendsand limits from comprehensive microarchitectural adap-tivity.
ASPLOS , 2008. [53] Hankook Lee and Jinwoo Shin. Anytime neural pre-diction via slicing networks vertically. arXiv preprintarXiv:1807.02609 , 2018.[54] Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, MattSkach, Md E Haque, Lingjia Tang, and Jason Mars. Thearchitectural implications of autonomous driving: Con-straints and acceleration. In
ASPLOS , pages 751–766,2018.[55] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou,Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xue-hai Zhou, and Yunji Chen. Pudiannao: A polyvalentmachine learning accelerator. In
ISCA , pages 369–381,2015.[56] Jun S Liu and Rong Chen. Sequential monte carlomethods for dynamic systems.
Journal of the Americanstatistical association , 1998.[57] ATLAS LS. What is simultaneous/conference inter-pretation? Online document, https://atlasls.com/what-is-simultaneousconference-interpretation/ ,2010.[58] Martina Maggio, Alessandro Vittorio Papadopoulos, An-tonio Filieri, and Henry Hoffmann. Automated controlof multiple software goals using multiple actuators. In
FSE , 2017.[59] Divya Mahajan, Jongse Park, Emmanuel Amaro, HardikSharma, Amir Yazdanbakhsh, Joon Kyung Kim, andHadi Esmaeilzadeh. Tabla: A unified template-basedframework for accelerating statistical machine learning.In
HPCA , pages 14–26. IEEE, 2016.[60] Mitchell P. Marcus, Beatrice Santorini, Mary AnnMarcinkiewicz, and Ann Taylor. Treebank-3 - lin-guistic data consortium. Online document, https://catalog.ldc.upenn.edu/LDC99T42 , 1999.[61] John D McCalpin. Memory bandwidth and machinebalance in current high performance computers.
TCCA ,1995.[62] Mason McGill and Pietro Perona. Deciding how todecide: Dynamic routing in artificial neural networks. arXiv preprint arXiv:1703.06217 , 2017.[63] Nikita Mishra, Connor Imes, John D. Lafferty, andHenry Hoffmann. CALOREE: learning control for pre-dictable latency and low energy. In
ASPLOS , 2018.[64] Nikita Mishra, Huazhe Zhang, John D. Lafferty, andHenry Hoffmann. A probabilistic graphical model-based approach for minimizing energy under perfor-mance constraints.
ASPLOS , 2015.1665] Jakob Nielsen.
Usability engineering . Elsevier, 1994.[66] NVIDIA. Nvidia tensorrt: Programmable inferenceaccelerator. Online document, https://developer.nvidia.com/tensorrt , 2018.[67] Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim,Jeremy Fowers, Karin Strauss, and Eric S Chung. Ac-celerating deep convolutional neural networks usingspecialized hardware.
Microsoft Research Whitepaper ,2015.[68] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana.Deepxplore: Automated whitebox testing of deep learn-ing systems. In
SOSP , 2017.[69] Paula Petrica, Adam M Izraelevitz, David H Albonesi,and Christine A Shoemaker. Flicker: A dynamicallyadaptive architecture for power limited multicore sys-tems. In
ISCA , 2013.[70] Dmitry Ponomarev, Gurhan Kucuk, and Kanad Ghose.Reducing power requirements of instruction schedul-ing through dynamic allocation of multiple datapathresources. In
MICRO , 2001.[71] Amir M. Rahmani, Bryan Donyanavard, Tiago Mück,Kasra Moazzemi, Axel Jantsch, Onur Mutlu, andNikil D. Dutt. SPECTR: formal supervisory controland coordination for many-core systems resource man-agement. In
ASPLOS , pages 169–183, 2018.[72] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,and Percy Liang. Squad: 100,000+ questions formachine comprehension of text. arXiv preprintarXiv:1606.05250 , 2016.[73] S. Reda, R. Cochran, and A. K. Coskun. Adaptive powercapping for servers with multithreaded workloads.
IEEEMicro , 2012.[74] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Ar-slan Zulfiqar, and Stephen W Keckler. vdnn: Virtualizeddeep neural networks for scalable, memory-efficient neu-ral network design. In
MICRO , page 18, 2016.[75] Muhammad Husni Santriaji and Henry Hoffmann.Grape: Minimizing energy for gpu applications withperformance requirements. In
MICRO , 2016.[76] Hardik Sharma, Jongse Park, Divya Mahajan, Em-manuel Amaro, Joon Kyung Kim, Chenkai Shao, AsitMishra, and Hadi Esmaeilzadeh. From high-level deepneural models to fpgas. In
MICRO , page 17, 2016.[77] N Silberman and Guadarrama. S. Tensorflow-slimimage classification model library. Online doc-ument, https://github.com/tensorflow/models/tree/master/research/slim , 2016. [78] Hyeonuk Sim, Saken Kenzhegulov, and JongeunLee. Dps: dynamic precision scaling for stochas-tic computing-based deep neural networks. In
DAC ,page 13, 2018.[79] Karen Simonyan and Andrew Zisserman. Very deep con-volutional networks for large-scale image recognition.In
ICLR , 2015.[80] Srinath Sridharan, Gagan Gupta, and Gurindar S Sohi.Holistic run-time parallelism management for time andenergy efficiency. In
ICS , 2013.[81] Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang,Marta Kwiatkowska, and Daniel Kroening. Concolictesting for deep neural networks. In
ASE , 2018.[82] Hokchhay Tann, Soheil Hashemi, R Iris Bahar, andSherief Reda. Hardware-software codesign of accurate,multiplier-free deep neural networks. In
DAC , 2017.[83] Surat Teerapittayanon, Bradley McDanel, and H.T.Kung. Branchynet: Fast inference via early exiting fromdeep neural networks. In
CVPR , 2016.[84] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao.Improving the speed of neural networks on cpus. In
Proc. Deep Learning and Unsupervised Feature Learn-ing NIPS Workshop , page 4, 2011.[85] Andreas Veit and Serge Belongie. Convolutional net-works with adaptive inference graphs. In
ECCV , 2018.[86] Swagath Venkataramani, Ashish Ranjan, Kaushik Roy,and Anand Raghunathan. Axnn: energy-efficient neu-romorphic systems using approximate computing. In
ISLPED , 2014.[87] Yan Wang, Zihang Lai, Gao Huang, Brian H Wang, Lau-rens van der Maaten, Mark Campbell, and Kilian QWeinberger. Anytime stereo image depth estimationon mobile devices. arXiv preprint arXiv:1810.11408 ,2018.[88] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, StevenRennie, Larry S Davis, Kristen Grauman, and RogerioFeris. Blockdrop: Dynamic inference paths in residualnetworks. In
CVPR , pages 8817–8826, 2018.[89] Huazhe Zhang and Henry Hoffmann. Maximizing per-formance under a power cap: A comparison of hardware,software, and hybrid techniques. In
ASPLOS , 2016.[90] Minjia Zhang, Samyam Rajbhandari, Wenhan Wang,and Yuxiong He. Deepcpu: Serving rnn-based deeplearning models 10x faster. In
ATC , pages 951–965,2018.1791] H. Zhou, S. Bateni, and C. Liu. S3dnn: Supervisedstreaming and scheduling for gpu-accelerated real-timednn workloads. In
RTAS , 2018. [92] Yanqi Zhou, Henry Hoffmann, and David Wentzlaff.Cash: Supporting iaas customers with a sub-core config-urable architecture. In