[PDF] ALERT: Accurate Learning for Energy and Timeliness

Abstract

An increasing number of software applications incorporate runtime Deep Neural Networks (DNNs) to process sensor data and return inference results to humans. Effective deployment of DNNs in these interactive scenarios requires meeting latency and accuracy constraints while minimizing energy, a problem exacerbated by common system dynamics. Prior approaches handle dynamics through either (1) system-oblivious DNN adaptation, which adjusts DNN latency/accuracy tradeoffs, or (2) application-oblivious system adaptation, which adjusts resources to change latency/energy tradeoffs. In contrast, this paper improves on the state-of-the-art by coordinating application- and system-level adaptation. ALERT, our runtime scheduler, uses a probabilistic model to detect environmental volatility and then simultaneously select both a DNN and a system resource configuration to meet latency, accuracy, and energy constraints. We evaluate ALERT on CPU and GPU platforms for image and speech tasks in dynamic environments. ALERT's holistic approach achieves more than 13% energy reduction, and 27% error reduction over prior approaches that adapt solely at the application or system level. Furthermore, ALERT incurs only 3% more energy consumption and 2% higher DNN-inference error than an oracle scheme with perfect application and system knowledge.

Full PDF

AALERT: Accurate Learning for Energy and Timeliness

Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, Shan Lu

The University of Chicago

Abstract

An increasing number of software applications incorporateruntime Deep Neural Networks (DNNs) to process sensordata and return inference results to humans. Effective deploy-ment of DNNs in these interactive scenarios requires meetinglatency and accuracy constraints while minimizing energy, aproblem exacerbated by common system dynamics.Prior approaches handle dynamics through either (1)system-oblivious DNN adaptation, which adjusts DNN la-tency/accuracy tradeoffs, or (2) application-oblivious systemadaptation, which adjusts resources to change latency/energytradeoffs. In contrast, this paper improves on the state-of-the-art by coordinating application- and system-level adaptation.ALERT, our runtime scheduler, uses a probabilistic modelto detect environmental volatility and then simultaneouslyselect both a DNN and a system resource conﬁguration tomeet latency, accuracy, and energy constraints. We evaluateALERT on CPU and GPU platforms for image and speechtasks in dynamic environments. ALERT’s holistic approachachieves more than 13% energy reduction, and 27% errorreduction over prior approaches that adapt solely at the appli-cation or system level. Furthermore, ALERT incurs only 3%more energy consumption and 2% higher DNN-inference er-ror than an oracle scheme with perfect application and systemknowledge.

Deep neural networks (DNNs) have become a key workloadfor many computing systems due to their high inference accu-racy. This accuracy, however, comes at a cost of long latency,high energy usage, or both. Successful DNN deployment re-quires meeting a variety of user-deﬁned, application-speciﬁcgoals for latency, accuracy, and often energy in unpredictable,dynamic environments.Latency constraints naturally arise with DNN deploymentswhen inference interacts with the real world as a consumer— processing data streamed from a sensor—or a producer—returning a series of answers to a human. For example, in mo-tion tracking, a frame must be processed at camera speed [41];in simultaneous interpretation, translation must be providedevery 2–4 seconds [57]. Violating these deadlines may leadto severe consequences: if a self-driving vehicle cannot actwithin a small time budget, life threatening accidents couldfollow [54].Accuracy and energy requirements are also common andmay vary for different applications in different operating en-vironments. On one hand, low inference accuracy can leadto software failures [68, 81]. On the other hand, it is bene-ﬁcial to minimize DNN energy or resource usage to extendmobile-battery time or reduce server-operation cost [42].These requirements are also highly dynamic. For example,the latency requirement for a job could vary dynamicallydepending on how much time has already been consumed byrelated jobs before it [54]; the power budget and the accuracyrequirement for a job may switch among different settingsdepending on what type of events are currently sensed [1].Additionally, the latency requirement may change based onthe computing system’s current context; e.g., in robotic visionsystems the latency requirement can change based on therobot’s latency and distance from perceived pedestrians [19].Satisfying all these requirements in a dynamic computingenvironment where the inference job may compete for re-sources against unpredictable, co-located jobs is challenging.Although prior work addresses these problems at either theapplication level or system level separately, each approach byitself lacks critical information that could be used to producebetter results.At the application level, different DNN designs—withdifferent depths, widths, and numeric precisions—providevarious latency-accuracy trade-offs for the same inferencetask [27, 40, 43, 78, 86]. Even more dynamic schemes havebeen proposed that adapt the DNN by dynamically chang-ing its structure at the beginning of [23, 62, 85, 88] or dur-ing [5, 35, 36, 50, 53, 83, 87] every inference tasks.Although helpful, these techniques are sub-optimal with-1 a r X i v : . [ c s . PF ] M a y ut considering system-level adaptation options. For example,under energy pressure, these application-level adaptation tech-niques have to switch to lower-accuracy DNNs, sacriﬁcingaccuracy for energy saving, even if the energy goal could havebeen achieved by lowering the system power setting (if thereis sufﬁcient latency budget).At the system level, machine learning [4, 15, 16, 52, 64, 69,70, 80] and control theory [33, 38, 45, 46, 63, 71, 75, 92] basedtechniques have been proposed to dynamically assign systemresources to better satisfy system and application constraints.Unfortunately, without considering the option of applica-tion adaptions, these techniques also reach sub-optimal solu-tions. For example, when the current DNN offers much higheraccuracy than necessary, switching to a lower-precision DNNmay offer much more energy saving than any system-leveladaptation techniques. This problem is exacerbated because,in the DNN design space, very small drops in accuracy enabledramatic reductions in latency, and therefore system resourcerequirements.A cross-stack solution would enable DNN applications tomeet multiple, dynamic constraints. However, offering sucha holistic solution is non-trivial. The combination of DNNand system-resource adaptation creates a huge conﬁgurationspace, making it difﬁcult to dynamically and efﬁciently pre-dict which combination of DNN and system settings will meetall the requirements optimally. Furthermore, without carefulcoordination, adaptations at the application and system levelmay conﬂict and cause constraint violations, like missing alatency deadline due to switching to higher-accuracy DNNand lower power setting at the same time. This paper presents ALERT, a cross-stack runtime system forDNN inference to meet user goals by simultaneously adaptingboth DNN models and system-resource settings.

Understanding the challenges

We proﬁle DNN inferenceacross applications, inputs, hardware, and resource contentionconﬁrming there is a high variation in inference time. Thisleads to challenges in meeting not only latency but also energyand accuracy requirements. Furthermore, our proﬁling of 42existing DNNs for image classiﬁcation conﬁrms that differentdesigns offer a wide spectrum of latency, energy, and accu-racy tradeoffs. In general, higher accuracy comes at the costof longer latency and/or higher energy consumption. Thesetrade-offs offered provide both opportunities and challengesto holistic inference management (Section 2).

Run-time inference management

We design ALERT, aDNN inference management system that dynamically selectsand adapts a DNN and a system-resource setting togetherto handle changing system environments and meet dynamic

DeadlineAccuracy ConstraintEnergy Budget DNN-Model SelectionResourceSelection Inference ComputationInput Stream Inference OutputsInference Time,Accuracy, and EnergyMeasurement DNN family

With Accuracy &Latency Information PredictedInference Time

Figure 1: ALERT inference systemenergy, latency, and accuracy requirements with probabilisticguarantees (Section 3).ALERT is a feedback-based run-time. It measures infer-ence accuracy, latency, and energy consumption; it checkswhether the requirements on these goals are met; and, it thenoutputs both system and application-level conﬁgurations ad-justed to the current requirements and operating conditions.ALERT focuses on meeting constraints in any two dimen-sions while optimizing the third; e.g., minimizing energygiven accuracy and latency requirements or maximizing accu-racy given latency and energy budgets.The key is estimating how DNN and system conﬁgurationsinteract to affect the goals. To do so, ALERT addresses threeprimary challenges: (1) the combined DNN and system con-ﬁguration space is huge, (2) the environment may changedynamically (including input, available resources, and eventhe required constraints), and (3) the predictions must be lowoverhead to have negligible impact on the inference itself.ALERT addresses these challenges with a global slow-down factor , a random variable relating the current runtimeenvironment to a nominal proﬁling environment. After eachinference task, ALERT estimates the global slow-down fac-tor using a Kalman ﬁlter. The global slow-down factor’smean represents the expected change compared to the pro-ﬁle, while the variance represents the current volatility. Themean provides a single scalar that modiﬁes the predicted la-tency/accuracy/energy for every DNN/system conﬁguration—a simple mechanism that leverages commonality among DNNarchitectures to allow prediction for even rarely used conﬁg-urations (tackle challenge-1), while incorporating varianceinto predictions naturally makes ALERT conservative involatile environments and aggressive in quiescent ones (tacklechallenge-2). The global slow-down factor and Kalman ﬁlterare efﬁcient to implement and low-overhead (tackle challenge-3). Thus, ALERT combines the global slow-down factor withlatency, power, and accuracy measurements to select the DNNand system conﬁguration with the highest likelihood of meet-ing the constraints optimally. ALERT provides probabilistic, not hard guarantees, as the latter requiresmuch more conservative conﬁgurations, often hurting both energy and accu-racy. Section 3.6 discusses this issue further.

2e evaluate ALERT using various DNNs and applicationdomains on different (CPU and GPU) machines under variousconstraints. Our evaluation shows that ALERT overcomesdynamic variability efﬁciently. Across various experimen-tal settings, ALERT meets constraints in most cases whileachieving within 93–99% of optimal energy saving or ac-curacy optimization. Compared to approaches that adapt atapplication-level or system-level only ALERT achieves morethan 13% energy reduction, and 27% error reduction (Section5).

We conduct an empirical study to examine the large trade-offspace offered by different DNN designs and system settings(Sec. 2.1), and the timing variability of inference (Sec. 2.2).

Embedded CPU1 CPU2 GPUCPU ARMCortex [email protected] GHz [email protected] GHz Xeon(R)Gold [email protected] [email protected] GHzGPU none none none RTX 2080Memory DDR3 2G DDR4 16G DDR4 16G*12 DDR4 16GLLC 2MB 9MB 19.25MB 9MB

Table 1: Hardware platforms used in our experiments

ID Task DNN Models DatasetsIMG1 Image VGG16 [79] ILSVRC2012IMG2 Classiﬁcation ResNet50 [30] (ImageNet)NLP1 Sentence Prediction RNN Penn Treebank [60]NLP2 Question Bert [18] Stanford Q&AAnswering Dataset (SQuAD) [72]

Table 2: ML tasks and benchmark datasets in our experimentsWe use two canonical machine learning tasks, with state-of-the-art networks and common data-sets (see Table 2) ona diverse set of hardware platforms, representing embeddedsystems, laptops (CPU1), CPU servers (CPU2), and GPU plat-forms (see Table 1). The two tasks, image classiﬁcation andnatural language processing (NLP), are often deployed withdeadlines—e.g., for motion tracking [41] and simultaneousinterpretation [57]—and both have received wide attentionleading to a diverse set of DNN models.

Tradeoffs from DNNs

We run all

42 image classiﬁcationmodels provided by the Tensorﬂow website [77] on the 50000images from ImageNet [17], and measure their average la-tency, accuracy (error rate), and energy consumption. Theresults from CPU2 are shown in Figure 2. We can clearly seetwo trends from the ﬁgure, which hold on other machines.First, different DNN models offer a wide spectrum of ac-curacy (error rate in ﬁgure), latency, and energy. As shown

Inference Time of One Image (s) E rr o r R a t e ( % ) ImageNet Classification Networks

Top5 Error-latencyLower bound of top5 error-latency

Figure 2: Tradeoffs for 42 DNNs (CPU2).

Inference Time of One Image (s) A v er ag e E n er gy ( J ) ResNet50 @ Different Power Limit

Power limit setting (W)

Figure 3: Tradeoffs for ResNet50 at different power settings(CPU2). (Numbers inside circles are power limit settings.)in the ﬁgure, the fastest model runs almost 18 × faster thanthe slowest one and the most accurate model has about 7.8 × lower error rate than the least accurate. These models alsoconsume a wide range—more than 20 × —of energy usage.Second, there is no magic DNN that offers both the bestaccuracy and the lowest latency, conﬁrming the intuition thatthere exists a tradeoff between DNN accuracy and resource us-age. Of course, some DNNs offer better tradeoffs than others.In Figure 2, all the networks sitting above the lower-convex-hull curve represent sub-optimal tradeoffs. Tradeoffs from system settings

We run ResNet50 under31 power settings from 40–100W on CPU2. We consider asensor processing scenario with periodic inputs, setting theperiod to the latency under 40W cap. We then plot the averageenergy consumed for the whole period (run-time plus idleenergy) and the average inference latency in Figure 3.The results reﬂect two trends, which hold on other ma-chines. First, a large latency/energy space is available bychanging system settings. The fastest setting (100W) is morethan 2 × faster than the slowest setting (40W). The mostenergy-hungry setting (64W) uses 1.3 × more energy than theleast (40W). Second, there is no easy way to choose the bestsetting. For example, 40W offers the lowest energy, but high-est latency. Furthermore, most of these points are sub-optimalin terms of energy and latency tradeoffs. For example, 84Wshould be chosen for extremely low latency deadlines, but allother nearby points (from 52–100) will harm latency, energyor both. Additionally, when deadlines change or when there3 ettings (explained in Table 2) IMG1 IMG2 NLP1 NLP2 A vg . I n f ere n ce T i m e o f O n e I npu t ( s ) -2 -1 Time Variance on Different Inputs and Hardwares

EmbeddedCPU1CPU2GPU

Figure 4: Latency variance across inputs for different tasksand hardware (Most tasks have 3 boxplots for 3 hardwareplatforms, CPU1-2, GPU from left to right; NLP1 has anextra boxplot for Embedded; other tasks run out of memoryon Embedded; every box shows the 25th–75th percentile;points beyond the whiskers are >90th or <10th).is resource contention, the energy-latency curve also changesand different points become optimal.

Summary:

DNN models and system-resource settings of-fer a huge trade-off space. The energy/latency tradeoff spaceis not smooth (when accounting for deadlines and idle power)and optimal operating points cannot be found with simplegradient-based heuristics. Thus, there is a great opportunityand also a great challenge in picking different DNN mod-els and system-resource settings to satisfy inference latency,accuracy, and energy requirements.

To understand how DNN-inference varies across inputs, plat-forms, and run-time environment and hence how (not) helpfulis off-line proﬁling, we run a set of experiments below, wherewe feed the network one input at a time and use 1/10 of thetotal data for warm up, to emulate real-world scenarios. Weplot the inference latency without and with co-located jobs inFigure 4 and 5, and we see several trends.First, deadline violation is a realistic concern. Image clas-siﬁcation on video has deadlines ranging from 1 second tothe camera latency (e.g., 1/60 seconds) [41]; the two NLPtasks, have deadlines around 1 second [65]. There is clearly nosingle inference task that meets all deadlines on all hardware.Second, the inference variation among inputs is relativelysmall particularly when there are no co-located jobs (Fig. 4),except for that in NLP1, where this large variance is mainlycaused by different input lengths. For other tasks, outlier in-puts exist but are rare.Third, the latency and its variation across inputs are bothgreatly affected by resource contention. Comparing Figure 5with Figure 4, we can see that the co-located job has increased

Settings (explained in Table 2)

IMG1 IMG2 NLP1 NLP2 A vg . I n f ere n ce T i m e o f O n e I npu t ( s ) -2 -1 Time Variance with Co-located Jobs

EmbeddedCPU1CPU2GPU

Figure 5: Latency variance with co-located jobs (the memory-intensive STREAM benchmark [61] co-located on Embedded,CPU1-2; GPU-intensive Backprop [9] co-located on GPU)both the median latency, the tail inference, and the differencebetween these two for all tasks on all platforms. This trendalso applies to other contention cases.While the discussion above is about latency, similar con-clusions apply to inference accuracy and energy: the accuracytypically drops to close to 0 when the inference time exceedsthe latency requirement, and the energy consumption naturallychanges with inference time.

Summary:

Deadline violations are realistic concerns andinference latency varies greatly across platforms, under con-tention, and sometimes across inputs. Clearly, sticking to onestatic DNN design across platforms and workloads leads toan unpleasant trade-off: always meeting the deadline by sacri-ﬁcing accuracy or energy in most settings, or achieving a highaccuracy some times but exceeding the deadline in others. Fur-thermore, it is also sub-optimal to make run-time decisionsbased solely on off-line proﬁling, considering the variationcaused by run-time contention.

We now show how conﬁning adaptation to a single layer (justapplication or system) is insufﬁcient. We run the ImageNetclassiﬁcation on

CPU1 . We examine a range of latency (0.1s-0.7s) and accuracy constraints (85%-95%), and try meetingthose constraints while minimizing energy by either (1) con-ﬁguring just the DNN (selecting a DNN from a family, likethat in Figure 2) or (2) conﬁguring just the system (by se-lecting resources to control energy–latency tradeoffs as inFigure 3). We compare these single-layer approaches to onethat simultaneously picks the DNN and system conﬁguration.As we are concerned with the ideal case, we create oracles byrunning 90 inputs in all possible DNN and system conﬁgu-rations, from which we ﬁnd the best conﬁguration for eachinput. The App-level oracle uses the default system setting.The Sys-level oracle uses the default (highest accuracy) DNN.4 onstraint Settings (deadline × accuracy_goal) deadline 0.1s 0.2s 0.3s 0.4s 0.5s 0.6s 0.7s A v er ag e E n er gy ( J ) ∞ Sys-levelApp-levelCombined

Figure 6: Minimize energy task with latency and accuracyconstraint @ CPU1. ( ∞ means unable to meet the constraints)Figure 6 shows the results. As we have a three dimensionalproblem—meeting accuracy and latency constraints with min-imal energy—we linearize the constraints and show them onthe x-axis (accuracy is faster changing, with latency slower,so each latency bin contains all accuracy goals). There areseveral important conclusions here. First, the App-only ap-proach meets all possible accuracy and latency constraints,while the Sys-only approach cannot meet any constraints be-low 0.3s. Second, across the entire constraint range, App-onlyconsumes signiﬁcantly more energy than Combined (60%more on average). The intuition behind Combined’s superior-ity is that there are discrete choices for DNNs; so when one isselected, there are almost always energy saving opportunitiesby tailoring resource usage to that DNN’s needs. Summary:

Combining DNN and system level approachesachieves better outcomes. If left solely to the application, en-ergy will be wasted. If left solely to the system, many achiev-able constraints will not be met.

ALERT’s runtime system navigates the large tradeoff spacecreated by combining

DNN-level and system-level adaptation.ALERT meets user-speciﬁed latency, accuracy, and energyconstraints and optimization goals while accounting for run-time variations in environment or the goals themselves.

ALERT’s inputs are speciﬁcations about (1) the adaptionoptions, including a set of DNN models D = { d i | i = · · · K } and a set of system-resource settings, expressed as differentpower-caps P = { P j | j = · · · L } ; and (2) the user-speciﬁedrequirements on latency, accuracy, and energy usage, whichcan take the form of meeting constraints in any two of thesethree dimensions while optimizing the third. ALERT’s outputis the DNN model d i ∈ D and the system-resource setting p j ∈ P for the next inference-task input.Formally, ALERT selects a DNN d i and a system-resourcesetting p j to fulﬁll either of these user-speciﬁed goals. Maximizing inference accuracy q (minimizing error) foran energy budget E goal and inference deadline T goal :arg max i , j q i , j s.t. e i , j ≤ E goal ∧ t i , j ≤ T goal (1)Minimizing the energy use e for an accuracy goal Q goal and inference deadline T goal :arg min i , j e i , j s.t. q i , j ≥ Q goal ∧ t i , j ≤ T goal (2)We omit the discussion of meeting energy and accuracyconstraints while minimizing latency as it is a trivial exten-sion of the discussed techniques and we believe it to be theleast practically useful. We also omit the problem of optimiz-ing all three dimensions, as it creates a feasibility problem,leaving nothing for optimization—lowest latency and highestaccuracy are impractical to achieve simultaneously. Generality

Along the DNN-adaptation side, the input DNNset can consist of any DNNs that offer different accuracy, la-tency, and energy tradeoffs; e.g., those in Figure 3. In par-ticular, ALERT can work with either or both of the broadclasses of DNN adaptation approaches that have arisen re-cently, including: (1) traditional DNNs where the adapta-tion option should be selected prior to starting an inferencetask [21, 23, 62, 85, 88] and (2) anytime DNNs that producea series of outputs as they execute [5, 35, 36, 50, 53, 83, 87].These two classes are similar in that they both vary thingslike the network depth or width to create latency/accuracytradeoffs.On the system-resource side, ALERT uses a power cap as the proxy to system resource usage. Since both hardware[14] and software resource managers [34, 73, 89] can convertpower budgets into optimal performance resource allocations,ALERT is compatible with many different schemes from bothcommercial products and the research literature.

ALERT works as a feedback controller. It follows four stepsto pick the DNN and resource settings for each input n :1) Measurement. ALERT records the processing time, en-ergy usage, and computes inference accuracy for n − T goal if necessary, considering the potential latency-requirementvariation across inputs. In some inference tasks, a set of inputsshare one combined requirement (e.g., in the NLP1 task inTable 2, all the words in a sentence are processed by a DNNone by one and share one sentence-wise deadline) and hencedelays in previous input processing could greatly shorten theavailable time for the next input [1, 48]. Additionally, ALERTsets the goal latency to compensate for its own, worst-caseoverhead so that ALERT itself will not cause violations.5) Feedback-based estimation. ALERT computes the ex-pected latency, accuracy, and energy consumption for everycombination of DNN model and power setting.4) Picking a conﬁguration. ALERT feeds all the updatedestimations of latency, accuracy, and energy into Eqs. 1 and 2,and gets the desired DNN model and power-cap setting for n .The key task is step 3: the estimation needs to be accurateand fast. In the remainder of this section, we discuss key ideasand the exact algorithm of our feedback-based estimation. Strawman

Solving Eqs. 1 and 2 would be trivially easy if thedeployment environment is guaranteed to match the trainingand proﬁling environment: we could estimate t i , j to be theaverage (or worst case, etc) inference time t prof i , j over a set ofproﬁling inputs under model d i and power setting p j . How-ever, this approach does not work given the dynamic input,contention, and requirement variation.Next, we present the key ideas behind how ALERT esti-mates the inference latency, accuracy, and energy consump-tion under model d i and power setting p j . How to estimate the inference latency t i , j ? To handle therun-time variation, a potential solution is to apply an estima-tor, like a Kalman ﬁlter [56], to make dynamic predictionsbased on recent history about inferences under model d i andpower p j . The problem is that most models and power settingswill not have been picked recently and hence would have norecent history to feed into the estimator. This problem is adirect example of the challenge imposed by the large spaceof combined application and system options. Idea 1: Handle the large selection space with a singlescalar value.

To make effective online estimation for all com-binations of models and power settings, ALERT introducesa global slow-down factor ξ to capture how the current en-vironment differs from the proﬁled environment (e.g., dueto co-running processes, input variation, or other changes).Such an environmental slow-down factor is independent fromindividual model or power selection. It can fully leverage ex-ecution history, no matter which models and power settingswere recently used; it can then be used to estimate t i , j basedon t prof i , j for all d i and p j combinations.Applying a global slowdown factor for all combinationsof application and system-level settings is crucial for ALERTto make quick decisions for every inference task. Althoughit is possible that some perturbations may lead to differentslowdowns for different conﬁgurations, the slight loss of ac-curacy here is out-weighed by the beneﬁt of having a simplemechanism that allows prediction even for conﬁgurations thathave not been used recently.This idea is also novel for ALERT, as previous cross-stackmanagement systems all use much more complicated mod-els to estimate and select different setting combinations (e.g.,using model predictive control to estimate combinations of settings [58]). ALERT’s global slowdown factor is based onseveral unique features of DNN families that accomplish thesame task with different accurarcy/latency tradeoffs. We cat-egorize these features as: (1) similarity of code paths and(2) proportionality of structure. The ﬁrst is based on the ob-servation that DNNs do not have complex conditional codedependences, so we do not need to worry about the casewhere different inputs would exercise very different codepaths. Thus, what ALERT learns about latency, accuracy, andenergy for one input will always inform it about future inputs.The second feature refers to the fact that as DNNs in a familyscale in latency, the proportion of different operations tendto be similar, so what ALERT learns about one DNN in thefamily generally applies to other DNNs in the same family.These properties of DNNs do not hold for many other typesof software, where different inputs or additional functional-ity can invoke entirely different code paths, with differentresource requirements or responses. How to estimate the accuracy under a deadline?

Givena deadline T goal , the inference accuracy delivered by model d i and power setting p j is determined by three factors, asshown in Eq. 3: (1) whether the inference result, which takestime t i , j , can be generated before the deadline T goal ; (2) if yes,the accuracy is determined by the model d i ; (3) if not, theaccuracy drops to that offered by a backup result q fail . Fortraditional DNN models, without any output at the deadline, arandom guess will be used and q fail will be much worse than q i . For anytime DNN models that output multiple results asthey are ready, the backup result is the latest output [5, 35, 36,50, 53, 83, 87], which we discuss more in Section 3.5. q i , j [ T goal ] = (cid:40) q i , if t i , j ≤ T goal q fail , otherwise (3)A potential solution to estimate accuracy q i , j at the deadline T goal is to simply feed the estimated t i , j into Eq. 3. However,this simple approach fails to account for two issues. First,while DNNs are generally well-behaved, signiﬁcant tail ef-fects are possible (see Figure 4). Second, Eq. 3 is not linear,and is best understood as a step function, where a failure tocomplete inference by the deadline results in a worthless in-ference output ( q f ail ). Combined, these two issues mean thatfor tail inputs, inference will produce a worthless result; i.e.,accuracy is not proportional to latency, but can easily fall tozero for tail inputs. The tail will, of course, be increased ifthere is any unexpected resource contention. Therefore, thesimple approach of using the mean latency prediction fails toaccount for the non-linear affects of latency on accuracy. Idea 2: handle the runtime variation and account fortail behavior

To handle the run-time variability mentioned in Since it could be infeasible to calculate the exact inference accuracy atrun time, ALERT uses the average training accuracy of the selected DNNmodel d i , denoted as q i , as the inference accuracy, as long as the inferencecomputation ﬁnishes before the speciﬁed deadline. t i , j and the globalslow-down factor ξ as random variables drawn from a normaldistribution. ALERT uses a recently proposed extension to theKalman ﬁlter to adaptively update the noise covariance [2].While this extension was originally proposed to produce betterestimates of the mean, a novel approach in ALERT is usingthis covariance estimate as a measure of system volatility.ALERT uses this Kalman ﬁlter extension to predict not justthe mean accuracy, but also the likelihood of meeting theaccuracy requirements in the current operating environment.Section 5.3 shows the advantages of our extensions. How to minimize energy or satisfy energy constraints?

Minimizing energy or satisfying energy constraints is com-plicated, as the energy is related to, but cannot be easily cal-culated by, the complexity of the selected model d i and thepower cap p j . As discussed in Section 2.2, the energy con-sumption includes both that used during the inference under agiven model d i and that used during the inference-idle period,waiting for the next input. Consequently, it is not straightfor-ward to decide which power setting to use. Idea 3.

ALERT leverages insights from previous research,which shows that energy for latency-constrained systems canbe efﬁciently expressed as a mathematical optimization prob-lem [8, 49, 51, 63]. These frameworks optimize energy byscheduling available conﬁgurations in time. Time is assignedto conﬁgurations so that the average performance hits thedesired latency target and the overall energy (including idleenergy) is minimal. The key is that while the conﬁgurationspace is large, the number of constraints is small (typically justtwo). Thus, the number of conﬁgurations assigned a non-zerotime is also small (equal to the number of constraints) [49].Given this structure, the optimization problem can be solvedusing a binary search over available conﬁgurations, or evenmore efﬁciently with a hash table [63].The only difﬁculty applying prior work to ALERT is thatprior work assumed there was only a single job running ata time, while ALERT assumes that other applications mightcontend for resources. Thus, ALERT cannot assume that thereis a single system-idle state that will be used whenever theDNN is not executing. To address this challenge, ALERT con-tinually estimates the system power when DNN inference isidle (but other non-inference tasks might be active), p DNNidle ,transforming Eq. 1 is transformed into:arg max i , j q i , j [ T goal ] s.t. p i , j · t i , j + p DNNidle · t DNNidle ≤ E goal (4) Global Slow-down Factor ξ . As discussed in Idea-1, ALERTuses ξ to reﬂect how the run-time environment differs fromthe proﬁling environment. Conceptually, if the inference taskunder model d i and power-cap p j took time t i , j at run timeand took t prof i , j on average to ﬁnish during proﬁling, the cor- responding ξ would be t i , j / t pro fi , j . ALERT estimates ξ usingrecent execution history under any model or power setting.Speciﬁcally, after an input n −

1, ALERT computes ξ ( n − ) as the ratio of the observed time t ( n − ) i , j to the proﬁled time t prof i , j , and then uses a Kalman Filter to estimate the mean µ ( n ) and variance ( σ ( n ) ) of ξ ( n ) at input n . ALERT’s formulationis deﬁned in Eq. 5, where K ( n ) is the Kalman gain variable; R is a constant reﬂecting the measurement noise; Q ( n ) is theprocess noise capped with Q ( ) . We set a forgetting factor ofprocess variance α = . K ( ) = . R = . Q ( ) = . µ ( ) = ( σ ( ) ) = .

1, following thestandard convention [56].  Q ( n ) = max { Q ( ) , α Q ( n − ) + ( α )( K ( n − ) y ( n − ) ) } K ( n ) = ( − K ( n − ) )( σ ( n − ) ) + Q ( n ) ( − K ( n − ) )( σ ( n − ) ) + Q ( n ) + Ry ( n ) = t ( n − ) i , j / t prof i , j − µ ( n − ) µ ( n ) = µ ( n − ) + K ( n ) y ( n ) ( σ ( n ) ) = ( − K ( n − ) )( σ ( n − ) ) + Q ( n ) (5)Then, using ξ ( n ) , ALERT estimates the inference time ofinput n under any model d i and power cap p j : t ( n ) i , j = ξ ( n ) ∗ t prof i , j . Probability of meeting the deadline.

Given the KalmanFilter estimation for the global slowdown factor, we can calcu-late Pr i , j , the probability that the inference completes beforethe deadline T goal . ALERT computes this value using a cu-mulative distribution function (CDF) based on the normaldistribution of ξ ( n ) estimated by the Kalman Filter: Pr i , j = Pr [ ξ ( n ) · t prof i , j ≤ T goal ] = CDF ( ξ ( n ) · t prof i , j , T goal )= CDF ( µ ( n ) · t prof i , j , σ ( n ) , T goal ) (6) Accuracy.

As discussed in Idea-2, ALERT computes theestimated inference accuracy ˆ q i , j [ T goal ] by considering t i , j asa random variable that follows normal distribution with itsmean and variance computed based on that of ξ . Here q i , j represents the inference accuracy when the DNN inferenceﬁnishes before the deadline, and q f ail is the accuracy of arandom guess:ˆ q i , j [ T goal ] = E ( q i , j [ T goal ] | t ( n ) i , j )= E ( q i , j [ T goal ] | ξ ( n ) · t prof i , j )= Pr i , j · q i , j + ( − Pr i , j ) · q f ail ξ ( n ) ∼N ( µ ( n ) , ( σ ( n ) ) ) (7) Energy.

As discussed in Idea-3, ALERT predicts energyconsumption by separately estimating energy during (1) DNNexecution: estimated by multiplying the power limit by the A Kalman Filter is an optimal estimator that assumes a normal distri-bution and estimates a varying quantity based on multiple potentially noisyobservations [56]. φ ( n ) is the predicted DNN-idlepower ratio, M ( n ) is process variance, S is process noise, V is measurement noise, and W ( n ) is the Kalman Filter gain.ALERT initially sets M ( ) = . S = . V = .  W ( n ) = M ( n − ) + SM ( n − ) + S + VM ( n ) = ( − W ( n ) )( M ( n − ) + S ) φ ( n ) = φ ( n − ) + W ( n ) ( p idle / p ( n − ) i , j − φ ( n − ) ) (8)ALERT then predicts the energy by Eq. 9. Unlike equa-tion 7 that uses probabilistic estimates, energy estimation iscalculated without the notion of probability. The inferencepower is the same whenever the inference misses and meetsthe deadline because ALERT sets power limits. Therefore itis safe to estimate the energy by its mean without consideringthe distribution of its possible latency. See Eq. 12 to estimateenergy by its worst case latency percentile. e ( n ) i , j = p i , j · ξ ( n ) · t prof i , j + φ ( n ) · p i , j · ( T goal − ( ξ ( n ) · t prof i , j )) (9) Selecting Conﬁgurations.

Given the estimates of latency,expected accuracy, and energy consumption, ALERT gen-erates a set of valid conﬁgurations which meets all of theconstraints. ALERT then chooses the best valid conﬁgura-tion according to the optimization goal; i.e., ALERT selectsa conﬁguration that solves either Eq. 1 or Eq. 2 using theestimated latency, accuracy, and energy from Equations 5, 7,and 9, respectively.In unpredictable environments—i.e., those with high esti-mated variance from Eq. 5—ALERT is to be more conserva-tive, selecting from fewer valid conﬁgurations. Consider an ex-ample scenario with two DNN candidates. The larger one hasan estimated accuracy of 0.98, and 97% probability to meetthe deadline. Meanwhile, the smaller one has 0.95 estimatedaccuracy and 99.9% probability, respectively. The larger DNNhas lower probability because it takes longer time. When ob-served variance is low, ALERT picks the larger DNN for itshigher expected accuracy (i.e., 97% × . = .

951 comparedwith the smaller one’s 99 . × . = . Q value(Eq. 5, and thus increases its estimated variance ( σ ). Thehigher estimated variance means that the probability of com-pletion by the deadline for all conﬁgurations in equation 6will be decreased. Consequently, the probability of selectinga larger DNN will be decreased more than that of the smallDNN because it has larger latency. In our example, the largerDNN’s probability of completion drops to 95% from 97%,thus decreasing the expected accuracy to 0.941. In contrastthe smaller DNN only drops its probability to 99.5% from 99.9%, decreasing its expected accuracy to 0.945. ALERTthen chooses smaller DNN which now has a higher expectedaccuracy (because it is more likely to complete) under highvariance environment. Manipulating ALERT’s Probabilistic Guarantees.

ALERT default setting is using a full mathematical expec-tation without explicitly deﬁning a probabilistic threshold( Pr th ), which represents the probability of meeting theconstraints. Users can set this probabilistic threshold ( Pr th )according their needs and then ALERT will not selectconﬁgurations for which the probability is below thisthreshold. Adding this capability is as simple as addinganother constraint to Eq. 1 in the maximizing accuracyscenario:arg max i , j q i , j s.t. e i , j ≤ E goal ∧ t i , j ≤ T goal ∧ Pr i , j ≥ Pr th (10)In minimizing energy scenario, Eq. 2 is modiﬁed to be:arg min i , j e i , j s.t. q i , j ≥ Q goal ∧ t i , j ≤ T goal ∧ Pr i , j ≥ Pr th (11)Energy estimation can also be updated accordingly forusers who want more control over ALERT’s energy guaran-tees: e ( n ) i , j = p i , j · CDF − ( ξ ( n ) · t prof i , j , Pr th )+ φ ( n ) · p i , j · ( T goal − CDF − ( ξ ( n ) · t prof i , j , Pr th )) , (12)where CDF − ( ξ ( n ) · t prof i , j , Pr th ) is the inverse of cumulativedistribution function. It takes two inputs: (1) the distributionfunction of random variable ξ ( n ) · t prof i , j and (2) the user thresh-old Pr th which indicates the probability of meeting the goal.It outputs the predicted latency such that it is the worst case la-tency of Pr th percentile from distribution t i , j . Compared withEq. 9, the energy estimation by this equation will be higher asit uses a higher percentile latency. Thus, ALERT will rejectmore conﬁgurations and may lead to lower expected accuracyas the cost of tighter energy bounds. An anytime DNN is an inference model that outputs a seriesof increasingly accurate inference results— o , o , ... o k , with o t more reliable than o t − . A variety of recent works [5,36,50,53,83,87] have proposed DNNs supporting anytime inference,covering a variety of problem domains. ALERT easily workswith not only traditional DNNs but also Anytime DNNs. Theonly change is that q fail in Eq. 3 no longer corresponds to arandom guess. That is, when the inference could not generateits ﬁnal result o k by the deadline T goal , an earlier result o x canbe used with a much better accuracy than that of a random8uess. The updated accuracy equation is below: q ., j =  q k , if t k , j ≤ t goal q k − , if t k − , j ≤ t goal < t k , j · · · q fail , otherwise (13)Existing anytime DNNs consider latency but not energyconstraints—an anytime DNN will keep running until thelatency deadline arrives and the last output will be deliveredto the user. ALERT naturally improves Anytime DNN en-ergy efﬁciency, stopping the inference sometimes before thedeadline based on its estimation to meet not only latency andaccuracy, but also energy requirements.Furthermore, ALERT can work with a set of traditionalDNNs and an Anytime DNN together to achieve the bestcombined result. The reason is that Anytime DNNs generallysacriﬁce accuracy for ﬂexibility. When we feed a group oftraditional DNNs and one Anytime DNN to construct thecandidacy set D , with Eq. 7, ALERT naturally selects theAnytime DNN when the environment is changing rapidly(because the expected accuracy of an anytime DNN will behigher given that variance), and the regular DNN, which hasslightly higher accuracy with similar computation, when it isstable, getting the best of both worlds.In our evaluation, we will use the nested design from [5],which provides a generic coverage of anytime DNNs. Assumptions of the Kalman Filter.

ALERT’s prediction,particularly the Kalman Filter, relies on the feedback fromrecent input processing. Consequently, it requires at least oneinput to react to sudden changes. Additionally, the Kalmanﬁlter formulations assume that the underlying distributionsare normal, which may not hold in practice. If the behavior isnot Gaussian, the Kalman ﬁlter will produce bad estimationsfor the mean of ξ for some amount of time.Having said that, as will be shown by our experiments, nosingle distribution ﬁts all real-world scenarios and normaldistribution is the best ﬁt we can ﬁnd in practice (Figure 11).Furthermore, ALERT is speciﬁcally designed to handle devi-ation from the normal-distribution assumption, novelly usingthe Kalman Filter’s covariance estimation to measure systemvolatility and accounting for volatility in the accuracy/energyestimations. Consequently, after just 2–3 such bad predictionsof means, the estimated variance will increase, which will thentrigger ALERT to pick anytime DNN over traditional DNNsor pick a low-latency traditional DNN over high-latency ones,because the former has a better chance to produce resultsat latency deadlines and hence a higher expected accuracyunder high variance. So—worst case—ALERT will choose aDNN with slightly less accuracy than what could have beenused with the right model of randomness. Users can also compensate for extremely aberrant latency distributions byincreasing the value of Q ( ) in Eq. 5. As we will see in Section5.3, ALERT performs well even when the distribution is notnormal. Probabilistic guarantees.

ALERT provides probabilistic,not hard, guarantees. As ALERT estimates not just averagetiming, but the distributions of possible timings, it can providearbitrarily many nines of assurance that it will meet latency oraccuracy goals but cannot provide 100% guarantee. Provid-ing 100% guarantees requires the information of worst caseexecution time (WCET), a latency value that guarantees thereis no slower latency with probability of 1. ALERT does notassume the availability of such information and hence cannotprovide hard guarantees [7].

Safety guarantees.

While ALERT does not explicitlymodel safety requirements, it can be conﬁgured to prioritizeaccuracy over other dimensions. In scenarios where users par-ticularly value safety (e.g., auto-driving), they could set a highaccuracy requirement or even remove the energy constraints.

Concurrent inference jobs.

ALERT is currently designedto support one inference job at a time. To support multiple con-current inference jobs, future work needs to extend ALERT tocoordinate across these concurrent jobs. We expect the mainidea of ALERT, such as using a global slowdown factor toestimate system variation, to still apply.

Scope of ALERT.

Finally, how the inference behaves ul-timately depends not only on ALERT, but also on the DNNmodels and system-resource setting options. As we will eval-uate in Section 5, ALERT helps make the best use of suppliedDNN models, but does not eliminate the difference betweendifferent DNN models.

We implement ALERT for both CPUs and GPUs. On CPUs,ALERT adjusts power through Intel’s RAPL interface [14],which allows software to set a hardware power limit. OnGPUs, ALERT uses PyNVML to control frequency and buildsa power-frequency lookup table. ALERT can also be appliedto other approaches that translate power limits into settingsfor combinations of resources [34, 37, 73, 89].In our experiments, ALERT considers a series of powersettings within the feasible range with 2.5W interval on ourtest laptop and a 5W interval on our test CPU server and GPUplatform, as the latter has a wider power range than the former.The number of power buckets is conﬁgurable.ALERT incurs small overhead in both scheduler computa-tion and switching from one DNN/power-setting to another,just 0.6–1.7% of an input inference time. We explicitly ac-count for overhead by subtracting it from the user-speciﬁedgoal (see step 2 in Section 3.2).Users may set goals that are not achievable. If ALERTcannot meet all constraints, it prioritizes latency highest, thenaccuracy, then power. This hierarchy is conﬁgurable.9 un-time environment settingDefault Inference task has no co-running processMemory Co-locate with memory-hungry STREAM [61] (@CPU)Co-locate with Backprop from Rodinia-3.1 [9] (@GPU)Compute Co-locate with Bodytrack from PARSEC-3.0 [6] (@CPU)Co-locate with the forward pass of Backprop [9] (@GPU)Ranges of constraint settingLatency 0.4x–2x mean latency* of the largest Anytime DNNAccuracy Whole range achievable by trad. and Anytime DNNEnergy Whole feasible power-cap ranges on the machineTask Trad. DNN Anytime [5] Fixed deadline?Image Classiﬁ. Sparse ResNet Depth-Nest YesSentence Pred. RNN Width-Nest NoScheme ID DNN selection Power selectionOracle Dynamic optimal Dynamic optimalOracle

Static

Static optimal Static optimalApp-only One Anytime DNN System DefaultSys-only Fastest traditional DNN State-of-Art [38]No-coord Anytime DNN w/o coord. with Power State-of-Art [38]ALERT ALERT default ALERT defaultALERT

Any

ALERT w/o traditional DNNs ALERT defaultALERT

Trad

ALERT w/o Anytime DNNs ALERT default

Table 3: Settings and schemes under evaluation (* measuredunder default setting without resource contention)

We apply ALERT to different inference tasks on both CPUand GPU with and without resource contention from co-located jobs. We set ALERT to (1) reduce energy while sat-isfying latency and accuracy requirements and (2) reduceerror rates while satisfying latency and energy requirements.We compare ALERT with both oracle and state-of-the-artschemes and evaluate detailed design decisions.

Experimental setup.

We use the three platforms listed inTable 1:

CPU1 , CPU2 , and

GPU . On each, we run inferencetasks , image classiﬁcation and sentence prediction, underthree different resource-contention scenarios: • No contention: the inference task is the only job running,referred to as “Default”; • Memory dynamic: the inference task is running togetherwith a memory-intensive job that repeatedly gets stoppedand then started, representing dynamic memory resourcecontention, referred to as “Memory”; • Computation dynamic: the inference task is running to-gether with a computation-intensive job that repeatedlygets stopped and then started, representing dynamic com-putation resource contention, referred to as “Compute”. For GPU, we only run image classiﬁcation task there, as the RNN-basedsentence prediction task is better suited for CPU [90].

Minimize Energy Minimize Error N o r m P er f o r m a n ce App+Sys(ALERT-Any)

Oracle V i o l a t i o n s ( % ) Figure 7: Result Summary: average performance normalizedto Oracle

Static . Violations% refers to %-of-constraint-settingsunder which a scheme incurs more than 10% violation of allinputs. (Smaller is better; Details in Table 4)We then evaluate a number of management schemes’ abilityto meet latency, accuracy, and energy constraints. Table 3 liststhe details.

Schemes under evaluation.

We give ALERT three dif-ferent DNN sets: traditional DNN models (ALERT

Trad ), anAnytime DNN (ALERT

Any ), and both (ALERT).We compare with two

Oracle ∗ schemes that have perfectpredictions for every input under every DNN/power setting(i.e., impractical). The “Oracle" allows DNN/power settingsto change across inputs, representing the best possible re-sults; the “Oracle Static ” has one ﬁxed setting across inputs,representing the best results without dynamic adaptation.Finally, we compare with three state-of-the-art approaches: • “App-only” conducts adaptation only at the applicationlevel through an Anytime DNN [5]; • “Sys-only” conducts adaptation only at the system levelfollowing an existing resource-management system thatminimizes energy under soft real-time constraints [63] and uses the fastest candidate DNN to avoid latencyviolations; • “No-coord” uses both the Anytime DNN for application-level adaptation and the power-management scheme [63]to adapt power, but with these two working indepen-dently. Table 4 shows the results for all schemes for different taskson different platforms and environments. Each cell showsthe average energy or accuracy under 35–40 combinations oflatency, accuracy, and energy constraints (the settings are de-tailed in Table 3), normalized to the Oracle

Static result. Figure7 compares these results, where lower bars represent betterresults and lower *s represent fewer constraint violations.ALERT and ALERT

Any both work very well for all settings.They outperform state-of-the-art approaches, which have a Speciﬁcally, this adaptation uses a feedback scheduler that predicts in-ference latency based on Kalman Filter. lat. DNN Work. ALERT ALERT-Any Sys-only App-only No-coord Oracle ALERT ALERT-Any Sys-only App-only No-coord OracleEnergy in Minimizing Energy Task Error Rate in Minimizing Error TaskCPU1 SparseResnet Idle 0.64 0.68 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4: Average energy consumption and error rate normalized to

Oracle

Static , smaller is better. (Each cell is averaged over35–40 constraint settings; superscript means

Static because it adaptsto dynamic variations. ALERT also comes very close to thetheoretically optimal Oracle.

Comparing with Oracles.

As shown in Table 4, ALERTachieves 93-99% of Oracle’s energy and accuracy optimiza-tion while satisfying constraints. Oracle static , the baseline inTable 4, represents the best one can achieve by selecting 1DNN model and 1 power setting for all inputs. ALERT greatlyout-performs Oracle static , reducing its energy consumption by3–48% while satisfying accuracy constraints (36% in har-monic mean) and reducing its error rate by 9-66% whilesatisfying energy constraints (54% in harmonic mean).Figure 8 shows a detailed comparison for the energy min-imization task. The ﬁgure shows the range of performanceunder all requirement settings (i.e., the whiskers). ALERTnot only achieves similar mean energy reduction, its wholerange of optimization behavior is also similar to Oracle. Incomparison, Oracle

Static not only has the worst mean but alsothe worst tail performance. Due to space constraints, we omitthe ﬁgures for other settings, where similar trends hold.ALERT has more advantage over Oracle static on CPUs thanon GPUs. The CPUs have more empirical variance than theGPU, so they beneﬁt more from dynamic adaptation. TheGPU experiences signiﬁcantly lower dynamic ﬂuctuation sothe static oracle makes good predictions.ALERT satisﬁes the constraint in 99.9% of tests for imageclassiﬁcation and 98.5% of those for sentence prediction. Forthe latter, due to the large input variability (NLP1 in Figure 4),some input sentences simply cannot complete by the deadlineeven with the fastest DNN. There the Oracle fails, too.Note that, these Oracle schemes not only have perfect— and hence, impractical—prediction capability, but they alsohave no overhead. In contrast, ALERT is running on the samemachines as the DNN workloads.

All results include ALERT’srun-time latency and power overhead.

Comparing with State-of-the-Art.

For a fair comparison,we focus on ALERT

Any , as it uses exactly the same DNN can-didate set as "Sys-only", "App-only", and "No-coord". Acrossall settings, ALERT

Any outperforms the others.The System-only solution suffers from not being able tochoose different DNNs under different runtime scenarios. Asa result, it performs much worse than ALERT

Any in satisfy-ing accuracy requirements or optimizing accuracy. For theformer (left side of Table 4 and Figure 7), it creates accuracyviolations in 68% of the settings as shown in Figure 7; for thelatter (right side of Table 4 and Figure 7), although capableof satisfying energy constraints, it introduces 34% more errorthan ALERT

Any .The Application-only solution that uses an Anytime DNNsuffers from not being able to adjust to the energy require-ments. As a result, it consumes 73% more energy in energy-minimizing tasks (left side of Table 4 and Figure 7) and in-troduces many energy-budget violations particularly underresource contention settings (right side of Table 4 and Figure7).The no-coordination scheme is worse than both System-and Application-only. It violates constraints in both tasks with69% more energy and 34% more error than ALERT

Any . With-out coordination, the two levels can work at cross purposes;e.g., the application switches to a faster DNN to save energywhile the system makes more power available.11 efault Compute Memory A v er ag e E n er gy ( J ) (a) CPU1, Image Classiﬁcation Default Compute Memory A v er ag e E n er gy ( J ) (b) CPU1, Sentence Prediction Default Compute Memory A v er ag e E n er gy ( J ) (c) CPU2, Image Classiﬁcation Default Compute Memory A v er ag e E n er gy ( J ) (d) CPU2, Sentence Prediction Figure 8: ALERT versus Oracle and Oracle

Static on minimize energy task (Lower is better). (whisker: whole range; circle: mean)

Plat. Work. ALERT Any Trad ALERT Any TradMinimize Energy Task Minimize Error TaskCPU1 Idle 0.64 0.68 0 . . . . . . .

01 0.96 0.89 0.92 0.89Mem. 0.96 0.97 0.95 0.32 0.34 0.32Harmonic mean 0.66 0.66 0 . Table 5: ALERT normalized average energy consumption anderror rate to

Oracle

Static @ Sparse ResNet (Smaller is better)

Different DNN candidate sets.

Table 5 compares the perfor-mance of ALERT working with an Anytime DNN (Any), aset of traditional DNN models (Trad), and both. At a highlevel, ALERT works well with all three DNN sets. Underclose comparison, ALERT

Trad violates more accuracy con-straints than the others, particularly under resource contentionon CPUs, because a traditional DNN has a much larger ac-curacy drop than an anytime DNN when missing a latencydeadline. Consequently, when the system variation is large,ALERT

Trad selects a faster DNN to meet latency and thusmay not meet accuracy goals. Of course, ALERT

Any is notalways the best. As discussed in Section 3.5, Anytime DNNssometimes have lower accuracy then a traditional DNN withsimilar execution time. This difference leads to the slightlybetter results for ALERT over ALERT

Any .Figure 9 visualizes the different dynamic behavior ofALERT (blue curve) and ALERT

Trad (orange curve) whenthe environment changes from Default to Memory-intensiveand back. At the beginning, due to a loose latency constraint,ALERT and ALERT

Trad both select the biggest traditionalDNN, which provides the highest accuracy within the energybudget. When the memory contention suddenly starts, thisDNN choice leads to a deadline miss and an energy-budgetviolation (as the idle period disappeared), which causes anaccuracy dip. Fortunately, both quickly detect this problemand sense the high variability in the expected latency. ALERTswitches to use an anytime DNN and a lower power cap.This switch is effective: although the environment is stillunstable, the inference accuracy remains high, with slight L a t e n c y ( s ) ALERT-TradALERTConstraint P o w e r ( W ) A cc u r a c y ( % ) Image Classi ﬁ cation Time (Input Number) D NN TradAny

Figure 9: Minimize error rates w/ latency, energy constraints@ CPU1. (ALERT in blue; ALERT

Trad in orange; constraintsin red. Memory contention occurs from about input 46 to 119;Deadline: 1.25 × mean latency of largest Anytime DNN inDefault; power limit: 35W.)ups and downs depending on which anytime output ﬁnishedbefore the deadline. Only able to choose from traditionalDNNs, ALERT Trad conservatively switches to much simplerand hence lower-accuracy DNNs to avoid deadline misses.This switch does eliminate deadline misses under the highlydynamic environment, but many of the conservatively cho-sen DNNs ﬁnish before the deadline (see the Latency panel),wasting the opportunity to produce more accurate results andcausing ALERT

Trad to have a lower accuracy than ALERT.When the system quiesces, both schemes quickly shift backto the highest-accuracy, traditional DNN.Overall, these results demonstrate how ALERT alwaysmakes use of the full potential of the DNN candidate set tooptimize performance and satisfy constraints.

ALERT probabilistic design.

A key feature of ALERT isits use of not just mean estimations, but also their variance.To evaluate the impact of this design, we compare ALERT toan alternative design ALERT*, which only uses the estimatedmean to select conﬁgurations.Figure 10 shows the performance of ALERT and ALERT*in the minimize error task for sentence prediction. As we12 tandard Trad. Only Any. Only A v er ag e P er p l e x i t y (a) Default Contention Standard Tradition OnlyAnytime Only A v er ag e P er p l e x i t y (b) Memory Contention Figure 10: Minimize error for sentence prediction@ CPU1(Lower is better). (whisker: whole range; circle: mean)can see, ALERT (blue circles) always performs better thanALERT*. Its advantage is the biggest when the DNN can-didate sets include both traditional and Anytime DNNs,which is the “Standard” in Figure 10. The reason is thattraditional DNNs and Anytime DNN have different accu-racy/latency curves, Eq. 3 for the former and Eq. 13 for thelatter. ALERT* is much worse than ALERT in distinguishingthese two by simply using the mean of estimated latency topredict accuracy. ALERT’s advantage is also reﬂected undermemory contention with traditional DNN candidates. SinceALERT’s estimation better captures dynamic system vari-ation, it clearly outperforms ALERT* there. Overall, theseresults show ALERT’s probabilistic design is effective.

Sensitivity to latency distribution.

ALERT assumes aGaussian distribution. However, ALERT is still robust forother distributions, as explained in Section 3.6. As shown inFigure 11, the observed ξ s (red bars) are indeed not a perfectﬁt for Gaussian distribution (blue lines) in all scenarios, whichconﬁrms ALERT’s robustness. Past resource management systems have used machine learn-ing [4,52,69,70,80] or control theory [33,38,45,46,63,75,92]to make dynamic decisions and adapt to changing environ-ments or application needs. Some also use Kalman ﬁlter be-cause it has optimal error properties [38, 45, 46, 63]. Thereare two major differences between them and ALERT: 1) priorapproaches use the Kalman ﬁlter to estimate physical quan-tities such as CPU utilization [46] or job latency [38], whileALERT estimates a virtual quantity that is then used to up-date a large number of latency estimates. 2) while varianceis naturally computed as part of the ﬁlter, ALERT actuallyuses it, in addition to the mean, to help produce estimates thatbetter account for environment variability.Past work designed resource managers explicitly to co-ordinate approximate applications with system resource us-age [22, 32, 33, 47]. Although related, they manage appli-cations separately from system resources, which is funda-mentally different from ALERT’s holistic design. When anenvironmental change occurs, prior approaches ﬁrst adjustthe application and then the system serially (or vice versa)so that the change’s effects on each can be established in- D e f a u l t ObservationEstimation C o m pu t e M e m o r y Figure 11: Distribution of ξ for image class. on CPU1.dependently [32, 33]. That is, coordination is established byforcing one level to lag behind the other. In practice this de-sign forces each level to keep its own independent modeland delays response to environmental changes. In contrast,ALERT’s global slowdown factor allows it to easily modeland update prediction about all application and system conﬁg-urations simultaneously, leading to very fast response times,like the single input delay demonstrated in Figure 9.Much work accelerates DNNs through hardware [3, 11–13,20,24,25,28,31,39,44,55,59,67,74,76,84], compiler [10,66],system [29,54], or design support [26,26,27,40,43,78,82,86].They essentially shift and extend the tradeoff space, but donot provide policies for meeting user needs or for navigatingtradeoffs dynamically, and hence are orthogonal to ALERT.Some research supports hard real-time guarantees forDNNs [91], providing 100% timing guarantees while assum-ing that the DNN model gives the desired accuracy, the envi-ronment is completely predictable, and energy consumptionis not a concern. ALERT provides slightly weaker timingguarantees, but manages accuracy and power goals. ALERTalso provides more ﬂexibility to adapt to unpredictable envi-ronments. Hard real-time systems would fail in the co-locatedscenario unless they explicitly account for all possible co-located applications at design time. This paper demonstrates the challenges behind the importantproblem of ensuring timely, accurate, and energy efﬁcientneural network inference with dynamic input, contention, andrequirement variation. ALERT achieves these goals throughdynamic and coordinated DNN model selection and powermanagement based on feedback control. We evaluate ALERTwith a variety of workloads and DNN models and achievehigh performance and energy efﬁciency.

Acknowledgement

We thank the anonymous reviewers for their helpful feed-back and Ken Birman for shepherding this paper. This re-search is supported by NSF (grants CNS-1956180, CNS-13764039, CCF-1837120, CNS-1764039, CNS-1563956, CNS-1514256, IIS-1546543, CNS-1823032, CCF-1439156), ARO(grant W911NF1920321), DOE (grant DESC0014195 0003),DARPA (grant FA8750-16-2-0004) and the CERES Centerfor Unstoppable Computing. Additional support comes fromthe DARPA BRASS program and a DOE Early Career award.

References [1] Baidu AI. Apollo open vehicle certiﬁcate platform.Online document, http://apollo.auto , 2018.[2] S. Akhlaghi, N. Zhou, and Z. Huang. Adaptive adjust-ment of noise covariance in kalman ﬁlter for dynamicstate estimation. In , 2017.[3] Jorge Albericio, Patrick Judd, Tayler Hetherington, TorAamodt, Natalie Enright Jerger, and Andreas Moshovos.Cnvlutin: Ineffectual-neuron-free deep neural networkcomputing. In

ISCA , pages 1–13, 2016.[4] Jason Ansel, Maciej Pacula, Yee Lok Wong, Cy Chan,Marek Olszewski, Una-May O’Reilly, and Saman Ama-rasinghe. Siblingrivalry: online autotuning through localcompetitions. In

CASES , 2012.[5] Anonymous Authors. Prioritized SGD and nested archi-tectures for anytime neural network. In

In submissionincluded as supplementary material .[6] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh,and Kai Li. The parsec benchmark suite: Characteriza-tion and architectural implications. In

PACT , October2008.[7] Giorgio C Buttazzo, Giuseppe Lipari, Luca Abeni, andMarco Caccamo.

Soft Real-Time Systems: Predictabilityvs. Efﬁciency: Predictability vs. Efﬁciency . Springer,2006.[8] Aaron Carroll and Gernot Heiser. Mobile multicores:Use them or waste them. In

HotPower , 2013.[9] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan,Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron.Rodinia: A benchmark suite for heterogeneous comput-ing. In

IISWC , 2009.[10] Tianqi Chen, Thierry Moreau, Ziheng Jiang, LianminZheng, Eddie Yan, Haichen Shen, Meghan Cowan,Leyuan Wang, Yuwei Hu, Luis Ceze, et al. Tvm: An au-tomated end-to-end optimizing compiler for deep learn-ing. In

OSDI , pages 578–594, 2018.[11] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang,Chengyong Wu, Yunji Chen, and Olivier Temam. Di-annao: A small-footprint high-throughput accelerator for ubiquitous machine-learning.

SIGPLAN Not. , pages269–284, 2014.[12] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivi-enne Sze. Eyeriss: An energy-efﬁcient reconﬁgurable ac-celerator for deep convolutional neural networks.

JSSC ,2016.[13] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, LiqiangHe, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu,Ninghui Sun, et al. Dadiannao: A machine-learningsupercomputer. In

MICRO 47 , pages 609–622, 2014.[14] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, andC. Le. Rapl: Memory power estimation and capping. In

ISLPED , 2010.[15] Christina Delimitrou and Christos Kozyrakis. Paragon:Qos-aware scheduling for heterogeneous datacenters. In

ASPLOS , 2013.[16] Christina Delimitrou and Christos Kozyrakis. Quasar:resource-efﬁcient and qos-aware cluster management.In

ASPLOS , 2014.[17] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In

CVPR , 2009.[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXivpreprint arXiv:1810.04805 , 2018.[19] Piotr Dollar, Christian Wojek, Bernt Schiele, and PietroPerona. Pedestrian detection: An evaluation of the stateof the art.

TPAMI , 2011.[20] Zidong Du, Robert Fasthuber, Tianshi Chen, PaoloIenne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen,and Olivier Temam. Shidiannao: Shifting vision pro-cessing closer to the sensor. In

ISCA , pages 92–104,2015.[21] Biyi Fang, Xiao Zeng, and Mi Zhang. Nestdnn:Resource-aware multi-tenant on-device deep learningfor continuous mobile vision. In

Proceedings of the24th Annual International Conference on Mobile Com-puting and Networking , MobiCom ’18, page 115–127,New York, NY, USA, 2018. Association for ComputingMachinery.[22] Anne Farrell and Henry Hoffmann. MEANTIME:achieving both minimal energy and timeliness with ap-proximate computing. In

USENIX ATC , 2016.[23] Michael Figurnov, Maxwell D Collins, Yukun Zhu,Li Zhang, Jonathan Huang, Dmitry P Vetrov, and Ruslan14alakhutdinov. Spatially adaptive computation time forresidual networks. In

CVPR , page 7, 2017.[24] Mingyu Gao, Christina Delimitrou, Dimin Niu, Kr-ishna T Malladi, Hongzhong Zheng, Bob Brennan, andChristos Kozyrakis. Draf: a low-power dram-based re-conﬁgurable acceleration fabric.

ISCA , pages 506–518,2016.[25] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, ArdavanPedram, Mark A Horowitz, and William J Dally. Eie:efﬁcient inference engine on compressed deep neuralnetwork. In

ISCA , pages 243–254, 2016.[26] Song Han, Huizi Mao, and William J Dally. Deep com-pression: Compressing deep neural networks with prun-ing, trained quantization and huffman coding. arXivpreprint arXiv:1510.00149 , 2015.[27] Soheil Hashemi, Nicholas Anthony, Hokchhay Tann,R Iris Bahar, and Sherief Reda. Understanding the im-pact of precision quantization on the accuracy and en-ergy of neural networks. In

DATE , pages 1474–1479,2017.[28] Johann Hauswald, Yiping Kang, Michael A Laurenzano,Quan Chen, Cheng Li, Trevor Mudge, Ronald G Dreslin-ski, Jason Mars, and Lingjia Tang. Djinn and tonic: Dnnas a service and its implications for future warehousescale computers. In

ISCA , pages 27–40, 2015.[29] Johann Hauswald, Michael A Laurenzano, Yunqi Zhang,Cheng Li, Austin Rovinski, Arjun Khurana, Ronald GDreslinski, Trevor Mudge, Vinicius Petrucci, LingjiaTang, et al. Sirius: An open end-to-end voice and visionpersonal assistant and its implications for future ware-house scale computers. In

ASPLOS , pages 223–238,2015.[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. In

CVPR , pages 770–778, 2016.[31] Parker Hill, Animesh Jain, Mason Hill, Babak Zamirai,Chang-Hong Hsu, Michael A Laurenzano, Scott Mahlke,Lingjia Tang, and Jason Mars. Deftnn: Addressing bot-tlenecks for dnn execution on gpus via synapse vectorelimination and near-compute data ﬁssion. In

MICRO ,pages 786–799, 2017.[32] Henry Hoffmann. Coadapt: Predictable behavior foraccuracy-aware applications running on power-awaresystems. In

ECRTS , pages 223–232, 2014.[33] Henry Hoffmann. Jouleguard: energy guarantees forapproximate applications. In

SOSP , 2015. [34] Henry Hoffmann and Martina Maggio. PCP: A general-ized approach to optimizing performance under powerconstraints through resource management. In

ICAC ,pages 241–247, 2014.[35] Hanzhang Hu, Debadeepta Dey, Martial Hebert, andJ Andrew Bagnell. Learning anytime predictions inneural networks via adaptive loss balancing. In

AAAI ,2019.[36] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Lau-rens van der Maaten, and Kilian Q. Weinberger. Multi-scale dense convolutional networks for efﬁcient predic-tion. In

CoRR , 2017.[37] C. Imes and H. Hoffmann. Bard: A uniﬁed frameworkfor managing soft timing and power constraints. In

SAMOS , pages 31–38, 2016.[38] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann.Poet: a portable approach to minimizing energy undersoft real-time constraints. In

RTAS , pages 75–86, April2015.[39] Animesh Jain, Michael A Laurenzano, Gilles A Pokam,Jason Mars, and Lingjia Tang. Architectural supportfor convolutional neural networks on modern cpus. In

PACT , 2018.[40] Shubham Jain, Swagath Venkataramani, VijayalakshmiSrinivasan, Jungwook Choi, Pierce Chuang, and Le-land Chang. Compensated-dnn: energy efﬁcient low-precision deep neural networks by compensating quan-tization errors. In

DAC , pages 1–6, 2018.[41] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik,Siddhartha Sen, and Ion Stoica. Chameleon: scalableadaptation of video analytics. In

ACM SIGCOMM ,pages 253–266, 2018.[42] Norman P. Jouppi, Cliff Young, Nishant Patil, DavidPatterson, Gaurav Agrawal, Raminder Bajwa, SarahBates, Suresh Bhatia, Nan Boden, Al Borchers, RickBoyle, Pierre-luc Cantin, Clifford Chao, Chris Clark,Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean,Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Got-tipati, William Gulland, Robert Hagmann, C. RichardHo, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt,Julian Ibarz, Aaron Jaffey, Alek Jaworski, AlexanderKaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch,Naveen Kumar, Steve Lacy, James Laudon, James Law,Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke,Alan Lundin, Gordon MacKean, Adriana Maggiore,Maire Mahony, Kieran Miller, Rahul Nagarajan, RaviNarayanaswami, Ray Ni, Kathy Nix, Thomas Norrie,Mark Omernick, Narayana Penukonda, Andy Phelps,15onathan Ross, Matt Ross, Amir Salek, Emad Samadi-ani, Chris Severn, Gregory Sizikov, Matthew Snelham,Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan,Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle,Vijay Vasudevan, Richard Walter, Walter Wang, EricWilcox, and Doe Hyun Yoon. In-datacenter performanceanalysis of a tensor processing unit. In

ISCA , 2017.[43] Patrick Judd, Jorge Albericio, Tayler Hetherington,Tor M Aamodt, Natalie Enright Jerger, and AndreasMoshovos. Proteus: Exploiting numerical precisionvariability in deep neural networks. In

ICS , page 23,2016.[44] Patrick Judd, Jorge Albericio, Tayler Hetherington,Tor M Aamodt, and Andreas Moshovos. Stripes: Bit-serial deep neural network computing. In

MICRO , pages1–12, 2016.[45] Evangelia Kalyvianaki, Themistoklis Charalambous,and Steven Hand. Self-adaptive and self-conﬁguredcpu resource provisioning for virtualized servers usingkalman ﬁlters. In

ICAC , 2009.[46] Evangelia Kalyvianaki, Themistoklis Charalambous,and Steven Hand. Adaptive resource provisioning forvirtualized servers using kalman ﬁlters.

TAAS , 2014.[47] Aman Kansal, Scott Saponas, AJ Brush, Kathryn SMcKinley, Todd Mytkowicz, and Ryder Ziola. The la-tency, accuracy, and battery (lab) abstraction: program-mer productivity and energy efﬁciency for continuousmobile context sensing. In

OOPSLA , 2013.[48] Shinpei Kato, Shota Tokunaga, Yuya Maruyama, SeiyaMaeda, Manato Hirabayashi, Yuki Kitsukawa, AbrahamMonrroy, Tomohito Ando, Yusuke Fujii, and TakuyaAzumi. Autoware on board: Enabling autonomous vehi-cles with embedded systems. In

ICCPS , pages 287–296,2018.[49] D. H. K. Kim, C. Imes, and H. Hoffmann. Racing andpacing to idle: Theoretical and empirical analysis ofenergy optimization heuristics. In

ICCPS , 2015.[50] Gustav Larsson, Michael Maire, and GregoryShakhnarovich. Fractalnet: Ultra-deep neural networkswithout residuals. arXiv preprint arXiv:1605.07648 ,2016.[51] Etienne Le Sueur and Gernot Heiser. Slow down orsleep, that is the question. In

USENIX ATC , June 2011.[52] Benjamin C Lee and David Brooks. Efﬁciency trendsand limits from comprehensive microarchitectural adap-tivity.

ASPLOS , 2008. [53] Hankook Lee and Jinwoo Shin. Anytime neural pre-diction via slicing networks vertically. arXiv preprintarXiv:1807.02609 , 2018.[54] Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, MattSkach, Md E Haque, Lingjia Tang, and Jason Mars. Thearchitectural implications of autonomous driving: Con-straints and acceleration. In

ASPLOS , pages 751–766,2018.[55] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou,Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xue-hai Zhou, and Yunji Chen. Pudiannao: A polyvalentmachine learning accelerator. In

ISCA , pages 369–381,2015.[56] Jun S Liu and Rong Chen. Sequential monte carlomethods for dynamic systems.

Journal of the Americanstatistical association , 1998.[57] ATLAS LS. What is simultaneous/conference inter-pretation? Online document, https://atlasls.com/what-is-simultaneousconference-interpretation/ ,2010.[58] Martina Maggio, Alessandro Vittorio Papadopoulos, An-tonio Filieri, and Henry Hoffmann. Automated controlof multiple software goals using multiple actuators. In

FSE , 2017.[59] Divya Mahajan, Jongse Park, Emmanuel Amaro, HardikSharma, Amir Yazdanbakhsh, Joon Kyung Kim, andHadi Esmaeilzadeh. Tabla: A uniﬁed template-basedframework for accelerating statistical machine learning.In

HPCA , pages 14–26. IEEE, 2016.[60] Mitchell P. Marcus, Beatrice Santorini, Mary AnnMarcinkiewicz, and Ann Taylor. Treebank-3 - lin-guistic data consortium. Online document, https://catalog.ldc.upenn.edu/LDC99T42 , 1999.[61] John D McCalpin. Memory bandwidth and machinebalance in current high performance computers.

TCCA ,1995.[62] Mason McGill and Pietro Perona. Deciding how todecide: Dynamic routing in artiﬁcial neural networks. arXiv preprint arXiv:1703.06217 , 2017.[63] Nikita Mishra, Connor Imes, John D. Lafferty, andHenry Hoffmann. CALOREE: learning control for pre-dictable latency and low energy. In

ASPLOS , 2018.[64] Nikita Mishra, Huazhe Zhang, John D. Lafferty, andHenry Hoffmann. A probabilistic graphical model-based approach for minimizing energy under perfor-mance constraints.

ASPLOS , 2015.1665] Jakob Nielsen.

Usability engineering . Elsevier, 1994.[66] NVIDIA. Nvidia tensorrt: Programmable inferenceaccelerator. Online document, https://developer.nvidia.com/tensorrt , 2018.[67] Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim,Jeremy Fowers, Karin Strauss, and Eric S Chung. Ac-celerating deep convolutional neural networks usingspecialized hardware.

Microsoft Research Whitepaper ,2015.[68] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana.Deepxplore: Automated whitebox testing of deep learn-ing systems. In

SOSP , 2017.[69] Paula Petrica, Adam M Izraelevitz, David H Albonesi,and Christine A Shoemaker. Flicker: A dynamicallyadaptive architecture for power limited multicore sys-tems. In

ISCA , 2013.[70] Dmitry Ponomarev, Gurhan Kucuk, and Kanad Ghose.Reducing power requirements of instruction schedul-ing through dynamic allocation of multiple datapathresources. In

MICRO , 2001.[71] Amir M. Rahmani, Bryan Donyanavard, Tiago Mück,Kasra Moazzemi, Axel Jantsch, Onur Mutlu, andNikil D. Dutt. SPECTR: formal supervisory controland coordination for many-core systems resource man-agement. In

ASPLOS , pages 169–183, 2018.[72] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,and Percy Liang. Squad: 100,000+ questions formachine comprehension of text. arXiv preprintarXiv:1606.05250 , 2016.[73] S. Reda, R. Cochran, and A. K. Coskun. Adaptive powercapping for servers with multithreaded workloads.

IEEEMicro , 2012.[74] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Ar-slan Zulﬁqar, and Stephen W Keckler. vdnn: Virtualizeddeep neural networks for scalable, memory-efﬁcient neu-ral network design. In

MICRO , page 18, 2016.[75] Muhammad Husni Santriaji and Henry Hoffmann.Grape: Minimizing energy for gpu applications withperformance requirements. In

MICRO , 2016.[76] Hardik Sharma, Jongse Park, Divya Mahajan, Em-manuel Amaro, Joon Kyung Kim, Chenkai Shao, AsitMishra, and Hadi Esmaeilzadeh. From high-level deepneural models to fpgas. In

MICRO , page 17, 2016.[77] N Silberman and Guadarrama. S. Tensorﬂow-slimimage classiﬁcation model library. Online doc-ument, https://github.com/tensorflow/models/tree/master/research/slim , 2016. [78] Hyeonuk Sim, Saken Kenzhegulov, and JongeunLee. Dps: dynamic precision scaling for stochas-tic computing-based deep neural networks. In

DAC ,page 13, 2018.[79] Karen Simonyan and Andrew Zisserman. Very deep con-volutional networks for large-scale image recognition.In

ICLR , 2015.[80] Srinath Sridharan, Gagan Gupta, and Gurindar S Sohi.Holistic run-time parallelism management for time andenergy efﬁciency. In

ICS , 2013.[81] Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang,Marta Kwiatkowska, and Daniel Kroening. Concolictesting for deep neural networks. In

ASE , 2018.[82] Hokchhay Tann, Soheil Hashemi, R Iris Bahar, andSherief Reda. Hardware-software codesign of accurate,multiplier-free deep neural networks. In

DAC , 2017.[83] Surat Teerapittayanon, Bradley McDanel, and H.T.Kung. Branchynet: Fast inference via early exiting fromdeep neural networks. In

CVPR , 2016.[84] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao.Improving the speed of neural networks on cpus. In

Proc. Deep Learning and Unsupervised Feature Learn-ing NIPS Workshop , page 4, 2011.[85] Andreas Veit and Serge Belongie. Convolutional net-works with adaptive inference graphs. In

ECCV , 2018.[86] Swagath Venkataramani, Ashish Ranjan, Kaushik Roy,and Anand Raghunathan. Axnn: energy-efﬁcient neu-romorphic systems using approximate computing. In

ISLPED , 2014.[87] Yan Wang, Zihang Lai, Gao Huang, Brian H Wang, Lau-rens van der Maaten, Mark Campbell, and Kilian QWeinberger. Anytime stereo image depth estimationon mobile devices. arXiv preprint arXiv:1810.11408 ,2018.[88] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, StevenRennie, Larry S Davis, Kristen Grauman, and RogerioFeris. Blockdrop: Dynamic inference paths in residualnetworks. In

CVPR , pages 8817–8826, 2018.[89] Huazhe Zhang and Henry Hoffmann. Maximizing per-formance under a power cap: A comparison of hardware,software, and hybrid techniques. In

ASPLOS , 2016.[90] Minjia Zhang, Samyam Rajbhandari, Wenhan Wang,and Yuxiong He. Deepcpu: Serving rnn-based deeplearning models 10x faster. In

ATC , pages 951–965,2018.1791] H. Zhou, S. Bateni, and C. Liu. S3dnn: Supervisedstreaming and scheduling for gpu-accelerated real-timednn workloads. In

RTAS , 2018. [92] Yanqi Zhou, Henry Hoffmann, and David Wentzlaff.Cash: Supporting iaas customers with a sub-core conﬁg-urable architecture. In