MDInference: Balancing Inference Accuracy and Latency for Mobile Applications
MMDInference: Balancing Inference Accuracy andLatency for Mobile Applications
Samuel S. Ogden
Worcester Polytechnic Institute [email protected]
Tian Guo
Worcester Polytechnic Institute [email protected]
Abstract —Deep Neural Networks are allowing mobile devicesto incorporate a wide range of features into user applications.However, the computational complexity of these models makes itdifficult to run them effectively on resource-constrained mobiledevices. Prior work approached the problem of supportingdeep learning in mobile applications by either decreasing modelcomplexity or utilizing powerful cloud servers. These approacheseach only focus on a single aspect of mobile inference and thusthey often sacrifice overall performance.In this work we introduce a holistic approach to designingmobile deep inference frameworks. We first identify the keygoals of accuracy and latency for mobile deep inference and theconditions that must be met to achieve them. We demonstrate ourholistic approach through the design of a hypothetical frameworkcalled MDInference. This framework leverages two complemen-tary techniques; a model selection algorithm that chooses from aset of cloud-based deep learning models to improve inference ac-curacy and an on-device request duplication mechanism to boundlatency. Through empirically-driven simulations we show thatMDInference improves aggregate accuracy over static approachesby over 40% without incurring SLA violations. Additionally, weshow that with a target latency of 250ms, MDInference increasedthe aggregate accuracy in 99.74% cases on faster universitynetworks and 96.84% cases on residential networks.
Index Terms —Mobile deep learning, performance
I. I
NTRODUCTION
Deep learning on mobile devices is allowing for a widerange of new features such as virtual personal assistants [1],[2], visual text translation [3] and facial filters [4] to becomecommonplace in mobile applications. These diverse function-alities are made possible by recent advanced in machinelearning models called deep neural networks (DNNs), whichon some tasks can approach human-level accuracy [5].However, DNNs achieve this high accuracy with high com-putational complexity [6] leading to high latency, especiallywhen running on mobile devices [7]. This causes a necessarytrade-off to be made between model accuracy and modelexecution latency. Modern frameworks such as TensorFlowallow for on-device execution, in-cloud execution, or somehybrid of these two, introducing a wide range of choices forthis accuracy-latency trade-off.Each of these three approaches each have strengths butintroduce additional drawbacks.
On-device inference allows forexecuting inferences entirely on the mobile device with easy topredict latency but the mobile developer has to choose betweenhigh execution latency or using lower accuracy models.
In-cloud inference can execute high-accuracy models with low latency but the reliance on network communication meansunpredictable, and potentially unacceptably long, overall re-sponse time [8].
Hybrid inference involves spreading executionbetween the mobile device and the cloud allowing for potentialreductions in latency, but can result in worse latency and loweraccuracy than purely on-device or in-cloud approaches.In this paper we argue the need for mobile-oriented infer-ence frameworks. We discuss the pros and cons of existingapproaches and pinpoint the potential areas for improvement.We propose a holistic approach that considers mobile-specificfactors when designing mobile inference frameworks. Finally,we demonstrate our approach through the design of a hy-pothetical framework called MDInference aiming to increase aggregate accuracy , defined as the average accuracy for allmodels used to service requests, while bounding latency formobile inference requests. This is enabled by both utilizinga network-aware model selection algorithm to dynamicallychoose high-accuracy models that can execute within a targetresponse time and duplicating requests to ensure a boundedlatency response.Instead of approaching the design of mobile inferenceframeworks as a static problem, where a single model isused and network time is disregarded, we consider a run-time approach to mobile inferences with a two-pronged design.
First, by selecting the most accurate model for in-cloud infer-ence based on the network delay we increase accuracy withinan overall latency target.
Second, by duplicating inferenceexecution on-device using a low-latency model we can ensurethat we can meet the latency target regardless of networkconnectivity and delay. In short, by dynamically selecting amodel while running inference both in-cloud and on-devicewe improve accuracy while providing latency guarantees formobile applications.Our three main contributions are: • We introduce a new mobile-oriented approach to de-signing deep inference frameworks that focuses on thespecific goals and constraints of mobile devices, such asnetwork condition variation. Making frameworks awareof these constraints will allow them to improve aggregateaccuracy without sacrificing latency. • We designed a hypothetical framework MDInferencethat demonstrates the ability of this mobile-oriented ap-proach to improve the aggregate accuracy of inferenceswhile meeting latency targets. Our evaluation shows MD-1 a r X i v : . [ c s . PF ] A p r d) MDInference(c) Hybrid(b) In-cloud(a) On-device Mobile DeviceCloud ServerDNN ModelInput Image
Fig. 1: Comparison of different mobile inference techniques. (a) On-device inference allows models to run on-device in a resourceconstrained environment. Mobile-optimized models have lower latency but also lower accuracy. (b) Cloud-based inference allows for complexmodels but requires network transfer prior to inference execution, adding unpredictable network delay. (c) Hybrid inference shares executionbetween the mobile device and a cloud server, relying on both being available, to decrease latency. (d) MDInference uses runtime modelselection and request duplication to select accurate cloud-based models and an on-device model to guarantee latency.
Inference achieved its target latency in 23% more casesthan in-cloud approaches and increased aggregate accu-racy over 39% compared to purely on-device approaches. • We developed and integrated two algorithms to enableour MDInference to be mobile-aware. These algorithmsopportunistically increase the aggregate accuracy of in-ferences and ensure that there are no SLA violations,improving user experience.The remainder of this paper is structured as follows. InSection II we introduce a number of existing approachesmobile inference frameworks. The problem of mobile deepinference is formalized in Section III. Section IV discussesthe key advantages of existing approaches and describeshow we design a hypothetical framework for which we callMDInference. An evaluation of the techniques implementedin MDInference is presented in Section VI and a discussionof future directions is conducted in Section VII.II. B
ACKGROUND AND M OTIVATION
Deep neural networks have become increasingly popular forembedding novel features into mobile applications. Two com-mon forms of deep learning, convolutional neural networks(CNNs) and recurrent neural networks (RNNs) excel at imageprocessing and speech-to-text, respectively. This allows for theaddition of features such as Optical Character Recognition(OCR) [3] and virtual assistants [1], [2] to mobile applications.State-of-the-art DNNs, with their accuracy-driven design,can contain tens of millions of parameters and hundreds oflayers, and are therefore both computationally- and memory-intensive [6], [9]–[12]. To leverage these deep learning models,devices first need to preprocess the input data and load thesemodels into memory. Once these models are loaded intomemory, executing them requires large matrix multiplicationoperations with many millions, and often billions, of floatingpoint operations [6], [10], [13]. Therefore, while these modelscan add rich functionality to mobile applications, actuallyleveraging them on mobile devices is difficult due to resourceconstraints [12].Further, the number and extremity of otherwise commonissues that mobile devices need to balance is extremely high.First, mobile devices experience a wide range of networkconditions both in terms of connection quality and speed.They can be without a network connection for days or switch M o d e l A cc u r a c y ( t o p - % )
250 2000 4000 6000 8000 10000 12000Total Inference Latency (ms)Nexus 5Moto X4Pixel 2
Fig. 2: DNN execution latency on a range of mobile devices.
We measure execution time for 21 pretrained models [7] via theTensorFlow Lite framework [14] running on three devices. The circlesize corresponds to the standard deviation of the inference latency. Weobserve that high-accuracy models take multiple seconds to run onall devices and that newer devices, such as the Pixel 2, can supportmore models than older devices. between high-speed WiFi and a cellular connection within thesame minute. Second, they are inherently resource constrainedas devices must be small and efficient enough to be carried byend-users throughout daily life. Finally, mobile applicationsare inevitably user-facing, compelling them to adhere to strictlatency goals to improve user experience.Three main approaches to enabling mobile deep learningare depicted in Figure 1 and discussed below.
A. On-device Inference
On-device inference refers to running DNNs on mobiledevices, which is illustrated in Figure 1(a), and is supposedby frameworks such as Caffe2 [15] and TensorFlow Lite [14].These frameworks often use models that have been trained onpowerful servers and exported to a format that is optimizedfor mobile devices. Mobile oriented optimizations to decreasethe latency of on-device execution often aim to reduce thecomplexity of the models themselves [6], [13].In Figure 2 we show the execution latency of 21 pretrainedCNN models [7]. While many of the models that have beenoptimized for mobile devices completed execution in under250ms, these models have lower accuracy results. Higher ac-curacy models often take much longer to run, even on deviceswith specialized hardware such as the Pixel 2 [16]. Further,we observe that the lower accuracy models show a distinctrange of latencies, with latency increasing exponentially with2ccuracy, leading to the highest accuracy models having multi-second latency on all tested devices.Further, even mobile-oriented models can still be orders ofmagnitude slower than running on dedicated servers with ac-celerators. The inference latency can be exacerbated when anapplication needs to load multiple models, such as chaining theexecution of an OCR model and a text translation model [3],or requires higher accuracy.In summary, even though on-device inference is a plausiblechoice for simple tasks and newer mobile devices, it is lesssuitable for complex tasks or older mobile devices.
B. In-cloud Inference
In-cloud inference, as illustrated in Figure 1(b), executesmodels on remote servers instead of on-device. Cloud-basedservers, especially those with access to powerful acceleratorssuch as GPUs, can execute models orders of magnitude fasterthan mobile devices. For example, execution of the
NasNetLarge model takes over 5 seconds on all of our mobiledevices but takes only 113ms on a server with GPU (detailsin Table III). By leveraging this decrease in latency, in-cloud inference could decrease overall response time, evenwhile using more accurate models. However, transferring theinput data to the cloud-based servers can incur long andunpredictable network time [8], [17].Model serving systems [18]–[21] allow mobile applica-tions to leverage these cloud-based frameworks, often throughREST APIs. However, many such systems require mobiledevelopers to manually specify the exact model to use throughthe exposed API endpoints. These frameworks fail to considerthe impact of dynamic mobile network conditions, which cantake up a significant portion of end-to-end inference time [17],[22]. Moreover, such static development-time decisions canlead to using high-accuracy models whose high executionlatency may be compounded by unexpectedly long networktransfer time.In summary, cloud-based inference has the potential tosupport many application scenarios, simple and complex,for heterogeneous mobile devices. However, current mobile-agnostic serving platforms fall short by not automaticallyadapting inference choices based on mobile constraints.
C. Hybrid Inference
Hybrid inference spreads the execution of models acrossboth the mobile device and a cloud-based server, as shown inFigure 1(c). By splitting the execution between two locationshybrid inference allows for decisions to be made at runtimeto reduce overall response latency.The division of model execution between the mobile deviceand the remote server is done by identifying partition points inmodels where intermediate data can be efficiently transferredfrom the mobile device to the remote server [23]. Executingthe first layers of a model on-device and then the rest of themodel on a remote server allows for transferring less dataacross the network. However, if the network is unavailable
Symbol Meaning T sla Response time SLA T budget Time allowed for model execution T nw Estimated round-trip network time M A set of available models A ( m ) Accuracy of a model mµ ( m ) , σ ( m ) Average and standard deviation ofmodel execution time for model m TABLE I: Symbols used throughout this paper. the entire model can be executed locally, but an unpredictablenetwork can lead to an increase in latency.To remove this reliance on the network, each segmentof the model execution can calculate a confidence in itsresponse [24], [25] where a high confidence will result in usingthe current response. If the confidence is too low on-device,the intermediate data can be sent for remote inference. Thisdecreases the reliance on the network but potentially decreasesaccuracy.In addition, since hybrid inference relies on continuingexecution on the remote server this server has to host tosame model as was used on the mobile device. In order toaccommodate the possibility of no network connection thislimits the models that can be used for hybrid execution.In summary, hybrid inference allows for decreasing latencyby partitioning the inference model and selecting where andwhether each of the pieces should be executed. Networkconstraints may lead to longer latency and with a limitedability to improve accuracy by the models used.III. P
ROBLEM S TATEMENT
We target the problem of designing mobile deep inferenceframeworks for mobile applications. The core aspect of thisproblem is that the mobile device can have a variable, ornonexistent, network connection while request inferences.Additionally, an application developer has access to a set ofmodels M that exhibit a range of different accuracies andlatencies [7], [26] for the same task. Therefore, the problem isabout how to enable high-accuracy inference results for mobiledevices within a given target latency. Concretely, for a mobiledevice requesting an inference within a target latency, T sla , wewant to select an inference model, m ∈ M that maximizesaccuracy and returns results within T sla . Note, all symbolsused throughout this paper can be referenced in Table I.We consider two main metrics that measure the qualityof solutions to this problem. First is Service Level Agree-ment (SLA) attainment, which is measured as the numberof requests that return results within the specified responsetime target. The goal for a mobile-oriented framework is toreturn all results within a given SLA.
Second is aggregateaccuracy , which is the average accuracy of all models usedby the framework. For example, if three inference requests areserviced by models with 40%, 60% and 60% accuracy, thenthe framework’s aggregate accuracy is 53.3%. The goal of anyframework is to maximize its aggregate accuracy.3 ystem model and assumptions: We assume our mobiledevice is resource constrained and can only run a single on-device model. Further, we assume the mobile device may haveaccess to an in-cloud server hosting a set of functionally-equivalent models, but transferring input data can take avariable network time T nw . We call a set of distinct models functionally-equivalent if they all perform the same task, suchas image classification. We further assume this network timecan be calculated or estimated through a number of methodssuch as time synchronization, direct measurement, or networkmodeling [8].Our hypothetical system is designed specifically for CNNsperforming image classification tasks. We assume that anyrequired preprocessing is completed on the mobile device andis not directly considered as part of the response time. We alsoassume that each request has an appropriate T sla , representingthe target request-response latency.IV. M OBILE I NFERENCE F RAMEWORK D ESIGNS
Mobile-oriented inference frameworks have a number ofunique goals and constraints that we discuss next. These rep-resent a number of opportunities we discuss in Section IV-B.
A. Mobile-Aware Framework Design Goals and Challenges
As more mobile applications are leveraging DNNs it isbecoming critical that inference frameworks be aware of thespecial demands of these applications. Existing approachesfocus on optimizing for single goals, such as latency on mobiledevices or inference server throughput, while ignoring mobile-specific needs. As an example, the
NasNet Mobile modelwas designed to provide high-accuracy inference on mobiledevices. On a Pixel 2 phone this model ran in 236ms but onother tested devices this model took up to 2.5X longer.
Goals for a mobile-specific inference framework:
Amobile inference framework needs to dynamically balance twodesign goals: latency and accuracy. This need is driven by adynamic mobile environments and network connections, andthe inherent heterogeneity of devices.Latency is the time required to return an inference responseto the mobile end-user. Keeping this metric low and consistentis important to mobile applications which are inherently userfacing. Response times that are particularly long relative tothe average will be obvious to users.Accuracy is the ability a model to return the correct responseon input data, which is often reported for image classificationsmodels as the top-1 accuracy. This describes the model’s aver-age likelihood to correctly classify input images. In complexuse cases accuracy is especially important.
Challenges for mobile-oriented frameworks:
An idealmobile inference framework should allow for both goals tobe optimized by balancing them. To do this it would have tobe aware of three major constraints, which we introduce belowand have summarized in Table II.
First , mobile devices experience a wide range of networkconditions that can lead to large variations in the latencyof transferring input data for remote inference. Frameworks
Goals Factors (Awareness)Accuracy Latency Network Resource SLAOn-Device (cid:55) (cid:51) – (cid:51) (cid:51) In-Cloud (cid:51) (cid:55) (cid:55) – (cid:51) Hybrid (cid:51) / (cid:55) (cid:51) / (cid:55) (cid:51) (cid:51) – MDInference (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)
TABLE II: Different mobile inference approaches and their goalsand awareness.
The three approaches discussed each have differentoptimization goals. On-device inference relies on an awareness ofavailable resources to optimize for inference latency. In-cloud infer-ence has the goal to increase the throughput of inference servers forthe most accurate models, showing an attention to SLA but ignoringthe network. With hybrid approaches, the goals and awareness lie ona spectrum. Typically frameworks are aware of a subset of the variousfactors but no single approach is aware of all three. MDInferenceis aware of all three factors to achieve a reliable latency whileincreasing accuracy when possible. that performs remote inference should be aware of this vari-ation and able to adapt its inference decisions to minimizethe impact.
Second , mobile devices are inherently resourceconstrained, making on-device inference difficult, which is ex-acerbated by device heterogeneity. A mobile-aware inferenceframework should reduce its reliance on on-device inference asthese constraints are device-specific and may force each deviceto use a different low-accuracy model.
Finally, mobile appli-cations are user facing and thus are generally very sensitiveto response time. Therefore any framework providing mobiledevices with inference services should be able to provideresults within a reasonable time, often defined by its latencySLA.
B. Inference Serving Opportunities
The existing approaches that mobile deep inference frame-works take introduce a number of potentially opportunities.
On-device inference aims to ensure that mobile userscan always run inference but at a decreased accuracy.
By decreasing the complexity of deep learning models it ispossible to run inference directly on the mobile device withina reasonable latency. This ensures that regardless of networkconnectivity mobile users can obtain inference results. Oneexample of this is
MobileNets [13] which by tuning the numberof parameters within the model prior to training allows for asmooth trade-off curve between latency and accuracy basedon the same model architecture.The main drawback of on-device inference is that decreasedlatency is achieved by sacrificing inference accuracy. In thecase of MobileNets, this can mean decreasing the top-1accuracy by 29.6% (comparing the accuracy of the fastestand most accurate variations [7]). The problem of tradingaccuracy for latency is further compounded by the need tomake such decisions prior to training. In particular, doing soat development time means an application either relies on asingle model across all devices or needs to select the optimalmodel per device, which is challenging given the wide rangeof devices and models.
In-cloud inference allows for high-accuracy models to berun with low latency but neglects the needs of mobile ap- lications. By leveraging hardware accelerators such as GPUs,cloud-based inference servers can greatly reduce the latencyand improve the serving throughput even with complex high-accuracy models [18], [19], [27]. As an example, we observedthat the time to execute the
NasNet Large model (82.6% accu-racy) in the cloud was faster than running inference requestswith the
MobileNetV1 160 1.0 model (68.0% accuracy) onthe fastest mobile device in our experiments. (For details seeFigure 2 and Table III.) Cloud-based serving allows not onlyfor high-accuracy inferences with low execution latency, butalso opens up opportunities to serve inference requests withfunctionally-equivalent models that exhibit different latency-accuracy trade-offs.The drawback of in-cloud inference frameworks is that theymobile-agnostic and are typically oriented towards serviceproviders. This has two impacts. First, cloud-based servers aimto achieve a service level objective considering only on-servertime and exclude the network latency of the input data [18],[28]. Due to this, poor mobile network connections can resultin poor mobile performance [17]. Second, optimizations forthroughput, such as batching, lead to an increase in the latencyof individual requests [11], [18].
Hybrid inference spreads execution across multiple lo-cations allowing for decreased latency but at the cost ofrelying on the availability of both locations.
Spreadinginference across multiple devices allows for a decrease inthe amount of data transmitted across the network [23] or toexit early from execution when confidence in the intermediateresult is above a threshold [25]. As a result, frameworks thatsupport hybrid inference have the flexibility to selectivelyimprove the inference performance by carefully spreading themodel across different locations.However, this requires both that intermediate data be trans-ferred between locations and that the intermediate data canbe used in both locations, leading to the same model beexecuted in both locations. In the case that network trans-fer of intermediate data is prohibitive the model must beexecuted entirely on-device. For complex models this leadsto unacceptable latency, and simple models fail to benefitfrom the remote execution. Therefore, hybrid frameworkshave similar limitations to on-device frameworks, in that themodel used must be selected during development, and in-cloudframeworks with their sensitivity to the network.V. MDI
NFERENCE F RAMEWORK D ESIGN
The key insight of MDInference is that we can leverage a setof cloud-based functionally-equivalent models to improve ac-curacy. In addition, duplication of inference requests [29], [30]allows us to bound latency. For each inference MDInferencesubmits an inference to a remote server that dynamicallyselects an accurate model, and at the same time executes a lowmodel to ensure results will be available for uses within theSLA. This allows for increased accuracy and reliable latency.MDInference combines the advantages of existing ap-proaches in order to improve end-user performance. By dy-namically selecting cloud-based models based on network information we can opportunistically use higher accuracymodels and improve the aggregate accuracy. Additionally,MDInference and further improve the aggregate accuracy byusing a more accurate on-device model, although this canimpact the minimum achievable SLA. This combination oflocal and remote inferences allows MDInference to providefor reliable latency and improved aggregate accuracy.MDInference consists of two components. First, a cloud-based server selects between a number of functionally-equivalent models for one that can complete within a specificSLA by estimating the time consumed for transferring inputdata. This algorithm is detailed in Section V-A. Second, alocal inference is run on-device to ensure that results areavailable within the target SLA. The combination of these twocomponents ensures that inference output is available withinthe SLA, possibly with improved aggregate accuracy fromthe cloud-based component. We discuss the implication ofduplicating inference requests in Section V-B.
A. Model Selection Algorithm
MDInference’s model selection algorithm is designed tomanage a set of functionally-equivalent CNN models and pickthe most accurate model that can return results within thespecified SLA. It is designed to take advantage of the lowvariability of model execution latency to not only mitigatethe impact of variations in the mobile network latency butopportunistically use them to improve accuracy. The keyinsight of our model selection algorithm is that the variationsof transfer latency for an inference request can be compen-sated for with the appropriate choice of inference model. Asfunctionally-equivalent models each have different executiontimes and accuracies, by explicitly making inference latencyand model accuracy trade-offs MDInference determines whichCNN model to execute.MDInference works by selecting the most accurate modelthat has a low enough execution time to return results tothe end-user within the SLA. It accomplishes this by firstcalculating the request’s time budget as the difference betweenSLA and the estimated network time. That is, T budget = T sla − T nw where T nw , referred to as network time , denotesthe time to transfer the inference request and to return theresult. Consequently, T nw can be estimated conservativelyas T nw = 2 × T input where T input is the time to transferthe data to the remote server. Estimating T nw using T input is straightforward as T input can be measured by the serverprior to inference execution. Further, such estimation is rea-sonable for application scenarios such as image recognitionor image-to-text translations. These applications often needto send more data to the cloud (i.e., input data) which leadsto T input ≥ T output , the time to return results. For otherapplication scenarios such as speech recognition where outputdata size is often larger, one could leverage past observationsof T output and estimate T nw = 2 × T output instead. Usingthis time budget we can then identify the set of models, M E ,that can complete execution within the request time budget T budget .5he basic approach described above assumes that the execu-tion times and accuracies of models previously measured staysthe same. However, these two assumptions do not always hold,leading to a need to expand on the basic concept of modelselection to probabilistically select models. Real-world servingsystems [18], [19] often experience queuing delay or workloadspikes [31] leading to transient increases in latency. Addi-tionally, accuracy is affected over time by concept drift [32].To handle these changes in latency and accuracy the modelselection algorithm probabilistically selects models, thus ex-ploring available models that might have been previouslydisregarded due to transient issues. We do this by selectinga model using a weighted probability based on the model’slatency relative to T budget and accuracy. We implement thisprobabilistic approach via a three stage algorithm describedbelow. Stage one: greedily picking the baseline model.
In thisstage, MDInference takes all the existing models and selectsa base model m j as follows.maximize j A ( m ) (1)subject to µ ( m ) + σ ( m ) < T budget , m ∈ M (2)To find the base model we first consider all models that havean expected inference time less than the time budget and usethe most accurate of these models. This is to make it likelythat the model will finish within the calculated budget. We usethis model m b as our base model. In the case that no modelssatisfy the time budget constraint the fastest model availableis chosen as the base model and execution begins. Stage two: optimistically constructing the eligible modelset.
In order to account for unexpected performance variations,such as queueing delays or accuracy variations, the proba-bilistic model selection algorithm will expand around the basemodel to form an exploration set, M E . This exploration setrepresents models that are similar to the base model in termsof execution time. Specifically, we construct the explorationset as M E = { m | µ ( m ) ∈ [ µ ( m b ) − σ ( m b ) , µ ( m b ) + σ ( m b )] } (3)which is the set of all models that have an average executiontime within the typical execution time of the base model. Itis important to note that M E may include models that violatethe latency variation constraints imposed on the base model.This is accounted for in stage three. Stage three: opportunistically selecting the inferencemodel.
From the exploration set M E we select a model m (cid:48) tobalance the risk of SLA violations and the exploration reward.Concretely, we calculate the utility for each model, U ( m ) ,based on its inference accuracy and its likelihood to violateresponse time SLA as: U ( m ) = A ( m ) T budget − (cid:0) µ ( m ) + σ ( m ) (cid:1) | T budget − µ ( m ) | . (4)MDInference than normalizes these utilities to calculate theselection probability as P r ( m ) = U ( m ) (cid:80) n ∈ ME U ( n ) and picks m (cid:48) Model Name Top-1 Accuracy (%) Inference Avg. (ms) Inference Std. (ms)SqueezeNet
MobileNetV1 0.25
MobileNetV1 0.5
DenseNet
MobileNetV1 0.75
MobileNetV1 1.0
NasNet Mobile
InceptionResNetV2
InceptionV3
InceptionV4
NasNet Large
NasNet Fictional*
50 112.61 0.36
TABLE III: Summaries of model statistics through empiricalmeasurement.
Models are sorted based on their reported top-1accuracy which is defined as the percentage of correctly labeled testimages using only the most probable label. We measure the averageinference time and standard deviation for each model running viaTensorFlow on an AWS p2.xlarge GPU server. We used these modelsin simulations to study MDInference’s effectiveness in trading-offaggregate accuracy and latency. Note,
NasNet Fictional is a copyof
NasNet with lower accuracy used in Section VI-C. based on its probability. This helps avoid choosing modelswith lower inference accuracy, wider inference time distribu-tion, and outdated performance profile while still exploring theset of potentially eligible models.
B. Request Duplication
To ensure that all requests can be serviced within the SLA,MDInference duplicates requests to bound their tail responselatency. As discussed in Section II-A, many mobile-orientedmodels can produce results on-device within a reasonable timelimit, but with lower accuracy.When an inference is initiated two requests are generatedby the MDInference framework. The first is sent to a remoteinference server that executes the model selection algorithmoutlined previously. While this cloud request aims to returnresults within the SLA it is not guaranteed. Therefore, aninference request is duplicated and executed locally usingthe on-device model. In MDInference we chose the fastestavailable model to use on-device, supporting for SLAs as lowas 50ms, although any model that satisfies the SLA goal canbe used.There are two potential outcomes to duplication. First, theSLA expires without the remote inference request havingreturned results, in which case MDInference uses the results ofthe on-device model. In our experiments this occurred in only3.16% of cases, as we will see in Section VI-D. The secondoutcome has the remote inference response arrives before theSLA expires and the remote results are used.VI. E
VALUATION
Our evaluation goal is to quantify the effectiveness of MD-Inference in opportunistically improving aggregate accuracyfor mobile devices. We do this by first demonstrating theeffectiveness of our model selection algorithm at increasingthe aggregate accuracy at a range of SLAs when compared toa number of alternative algorithms. Finally, we demonstratethe benefit of request duplication by analyzing its impact onSLA attainment and aggregate accuracy.6
100 200 300SLA Target (ms)0100200300 L a t e n c y ( m s ) A cc u r a c y ( % ) GreedyMDInference w/o dup.SLA Target (a)
Comparison of average end-to-end latency and accuracy. MDInference’sselection algorithm is able to track the SLA target when SLA ≥ safely as the SLA target increases. Further, our model selection algorithmimproves standard deviation of inference times as well keeping them belowthe SLA target. M o d e l U s a g e ( % ) MobileNetV1 FamilyNasNet MobileInceptionV3InceptionV4NasNet Large (b)
Models selected by MDInference. The use of a diverse set of modelsallows MDInference to minimize SLA violations while providing highly accurateinference results.
Fig. 3: Comparison of MDInference to the static greedy algorithm.
For each SLA target, we simulated 10,000 inference requests andrecorded the inference time incurred by both the greedy algorithmand MDInference.
Key metrics.
The first metric that we are measuring in manyof our evaluations is aggregate accuracy, which we introducedin Section III. We additionally measure the SLA attainment,which is the percent of requests that return results to the userwithin a target SLA. With duplication this is no longer anissue thanks to leveraging low-latency on-device models. Inthese cases we measure the percentage of requests that rely onthe on-device model and the aggregate accuracy improvementswhen leveraging in-cloud models.
Simulation methodology.
In our simulations, we leveragea range of models, summarized in Table III [7], [9], [10], [13],[33]–[35], that expose different accuracy and inference timetrade-offs. We empirically measured the inference time distri-butions of models using an EC2 p2.xlarge
GPU-acceleratedserver over 1,000 inference executions. The accuracy of ourfunctionally-equivalent pretrained model was reported againstthe ILSVRC 2012 dataset [7], [36], a widely used imageclassification test set. The mobile network we use as thebasis for many of our simulations assumes that transferringan input image takes ms ± ms, based on real worldmeasurements of our university network. For each simulation,we generate 10,000 inference requests with a predefined SLAtarget and recorded the model selected by MDInference (andbaseline algorithms) and relevant performance metrics. Werepeated each simulation for a variety of SLA targets andnetwork profiles combinations. For all tests except those inSection VI-D we evaluated the model selection capabilities ofMDInference without duplication of requests. A. Benefits over static greedy model selection
Figure 3a shows the average end-to-end inference time(left) and aggregate accuracy (right) achieved by our modelselection algorithm and a static greedy algorithm, whichpicks the most accurate model that can complete within thegiven SLA. This figure shows that MDInference consistentlyachieved up to 42% lower inference latency, compared to static greedy . Moreover, MDInference can operate under amuch more stringent SLA target ( ∼ ms ) while static greedy continues to frequently incur SLA violations until SLA targetis more than 200ms. The key reason is that MDInference wasable to effectively trade off aggregate accuracy and responsetime by choosing from a diverse set of functionally-equivalentmodels. Consequently, MDInference had an aggregate accu-racy of 68% (on par to using MobileNetV1 0.75 which cantake 2.9X more time running on mobile devices) under lowSLA target ( ∼ ms ), but was able to match the aggregateaccuracy achieved by static greedy for higher SLA target. Notethat static greedy achieved up to 12% higher accuracy bysacrificing inference latency.Figure 3b illustrates the CNN model usage patterns (i.e.,percentage of model being used for executing the inferencerequests) under different SLA targets. At very low SLA target(less than 30ms), MDInference chooses the fastest model, Mo-bileNetV1 0.25 , as described in Section V-A. As the SLA targetincreases, MDInference aggressively explores more accurate,but slower models, commonly using our most accurate model,
NasNet Large .We make two observations. First, MDInference was ef-fective in picking the more appropriate model to increaseaccuracy while staying safely within SLA target. For example,
InceptionResNetV2 was never selected by MDInference be-cause better alternatives such as
InceptionV3 and
InceptionV4 exist. Second, MDInference faithfully explored eligible modelsand was able to converge to the most accurate model whenSLA target allows, as shown in Figure 3a at T sla = 250 ms . In summary , MDInference outperformed static greedy withan end-to-end latency reduction of up to 43%, while matchingits accuracy when the SLA budget is larger than ms. Thisis possible because MDInference adapted its model selectionby considering both the SLA target and network transfer time,while static greedy naively selected the most accurate model.
B. Adaptiveness to dynamic mobile network conditions
One of the key goals of MDInference is to adapt to networkvariations in order to improve mobile user experience. Tofurther examine how MDInference copes with these variationswe simulated network profiles with increasing variability.Specifically, we fixed the average network latency to be ms, and varied the Coefficient of Variation (CV) from 0%to 100%, where CV is defined as the ratio between the standarddeviation and average of the network time. CV ranged from0% to 100%, to represent a perfectly stable network and anetwork which has a standard deviation equals its average,respectively. As a point of reference, our measured universityWiFi network has a CV of 74%.7
25 50 75 100Network Latency CV %050100 P e r c e n t a g e SLA = 100ms
SLA = 250ms
AccuracySLA Attainment
Fig. 4: Aggregate accuracy of MDInference at different levels ofCV with T nw = 100ms. The initial low level of SLA attainment isdue to the initial network time of 100ms, leaving no time for inferenceexecution. As the variability of the network increases MDInferencecan take advantage of the range of models available to it to quicklyimprove accuracy and SLA attainment. Similarly, at a higher SLA,MDInference can achieve high accuracy until the network variabilityexceeds the SLA. M o d e l U s a g e ( \ % ) (SLA = 100ms) MobileNetV1NasNet Mobile InceptionV3InceptionV4 NasNet Large0 20 40 60 80 100Network Latency CV %0255075100 M o d e l U s a g e ( \ % ) (SLA = 250ms) Fig. 5: Model usage vs. network latency (CV) shown at twodifferent SLA targets.
When there is a reliable network (i.e. lowCV) we see that single models are used for all requests since the highreliable network has leaves no room for other model choices. As thenetwork becomes increasingly volatile a wider range of models isused to meet the SLA.
Figure 4 shows the aggregate accuracy and SLA attainmentachieved by MDInference. For low SLA target (100ms), whenthe network is relatively stable MDInference had an SLAattainment of less than 50%. As the network condition becamemore variable, MDInference was able to increase the aggregateaccuracy gradually while maintaining the SLA attainment. Thelow SLA attainment is due to on average half the requestsneeding the entire SLA just for network transfer. Conversely,the slight increase in accuracy is due to MDInference takingadvantage of the network variation to use more accuratemodels.It is important to note that when the network transfer tookthe entirely SLA, MDInference performed as expected bychoosing
MobileNetV1 0.25 , the fastest available model. Wenote that, to satisfy such stringent SLA targets with highnetwork latency variation, approaches such as provisioninginference servers at network edge.When we consider an SLA target of 250ms we see a E n d - t o - e n d l a t e n c y ( m s ) MDInference w/o dup.Related Accurate 0 100 200 300SLA Target (ms)50607080 A cc u r a c y ( % ) Pure RandomRelated Random
Fig. 6: Decomposition of benefits of MDInference’s three-stagealgorithm.
MDInference achieves similar accuracy and SLA attain-ment compared to related accurate , indicating the effectiveness of ourprobabilistic approach. Both pure random and related random havepoor aggregate accuracy due to their inability to distinguish modelswith different latency. different outcome. This SLA is slightly more than the sumof the average network latency and the time to execute themost accurate model,
NasNet Large . In this case we seethat MDInference used high accuracy models, maintainingan accuracy of around 80% throughout the entire range ofnetwork variations.Figure 5 shows the models chosen when varying CV ofnetwork time for SLAs of 100ms and 250ms. The SLA of100ms is the RTT of the simulated network leaving no timeleft for inference. Similarly, when the SLA is 250ms then thetarget time is larger than the sum of the RTT of the networkand the execution latency of
NasNet Large , our most complexmodel, which MDInference leverages.We make the following two observations. First as thenetwork became more variable (i.e. high CV), MDInferencematched the network variability by using a subset of fastermodels. As the network becomes more variable MDInferencecan exploit this to in some cases opportunistically use mod-els with high inference accuracy. Second, the probability ofexploring different eligible models is proportional to the SLAtarget and network variability. Faster models, such as thosein the
MobileNetV1 family, are used as a basis for low SLAtarget while the most accurate model,
NasNet Large , is usedfor higher SLA targets.
In summary , MDInference was effective in handling highlyvariable mobile network by exploring a diverse set of deeplearning models that expose different inference latency andaccuracy trade-offs. By taking advantage of network variationit could improve accuracy in many cases and even in caseswith low target response times could often return responseswithin the SLA.
C. Decomposing the efficiency of MDInference
Next, we breakdown the performance benefits provided byMDInference by examining the stages of our probabilisticmodel selection algorithm (Section V-A). For each of the threestages we compare to an alternative algorithm. For stage onewe compare to random model selection. For stage two wecompare to related random that randomly selects a model fromthe exploration set M E . For stage three we compare to relatedaccurate , which selects the most accurate model from M E to8 niversity ResidentialOn-deviceReliance AggregateAccuracy On-deviceReliance AggregateAccuracyStatic Latency Static Accuracy
Random
MDInference
TABLE IV: Aggregate accuracy and on-device model reliance.
MDInference achieved the highest aggregate accuracy with lower on-device reliance, compared to other algorithms. demonstrate that our probabilistic approach does not sacrificeaccuracy. All algorithms were tested with a network latencywith average ms and standard deviation of ms .Figure 6 shows the average inference latency and aggre-gate accuracy for all four model selection algorithms. Allthree algorithms, including MDInference, that choose fromthe exploration set M E were able to meet reasonable SLAtarget while pure random had approximately the same latencyacross all SLAs, incurring a large number of SLA violations.This indicates that the construction of M E , by stage one andstage two both enabled exploration and minimized risk ofunnecessary SLA violations.As the SLA target increases, pure random again achievedapproximately the same aggregate accuracy across all SLAs.All three other algorithms were able to gradually increase theaggregate accuracy by using slower but more accurate modelsfrom Table III. However, once we have a large enough SLAtarget ( ∼ ms), the exploration set M E converges to twomodels: NasNet Large and
NasNet Fictional . At this point, related random algorithm started to degrade aggregate accu-racy since it cannot differentiate between these two models.Meanwhile, both related accurate and MDInference were ableto steadily improve aggregate accuracy by avoiding
NasNetFictional .It is important to note there is only a negligible differencein aggregate accuracy using MDInference when compared to related accurate algorithm. This small difference is due to related accurate always selecting the most accurate modelfrom M E while MDInference explores the eligible set. Theprobability of picking a less accurate model is low enough asto not overly impact the aggregate accuracy. The probabilisticbehavior of MDInference is meant to allow for this explo-ration even while generally maintaining aggregate accuracy,as opposed to related accurate , which misses the opportunityto use models which may have improved accuracy or latencyprofiles. In summary , MDInference’s three-stage algorithm is ef-fective in distinguishing and identifying the most appropriatemodel to use under dynamic conditions. All three stagescontribute to and help MDInference cope with the variablenetwork conditions and improve aggregate accuracy.
D. Effectiveness of Request Duplication
To test the on-device reliance of MDInference we usedthe network time from sample of 5000 requests on each ofour university network and on a residential network. Theserequests consisted of a preprocessed image input that averaged A cc u r a c y O n - d e v i c e R e li a n c e Static LatencyStatic AccuracyRandomMDInference
Fig. 7: Aggregate accuracy and on-device model reliance on resi-dential network.
MDInference demonstrated improvements through-out all tested SLAs. At lower SLAs MDInference can quickly improvethe aggregate accuracy. Meanwhile, MDInference also reduces thereliance on on-device models at low SLAs, much more quickly thanother algorithms.
Static Latency0100200300400500 Network Time Static AccuracyModel TimeRandom0100200300400500 Local Model Used MDInference0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0 T i m e ( m s ) Fig. 8: Inference time for 20 randomly sampled requests overresidential network.
In many cases MDInference chose a modelthat yields results within the time budget. Other approaches, such asStatic Accuracy, returned high-accuracy results but rely more heavilyon the on-device model due to network variation.
MobileNetV1 128 0.25 modelas it represents the single model most likely to completewithin any SLA for all tested mobile devices. It was alsoexcluded from the set of models available in the cloud to betterdemonstrate the ability of MDInference to improve over on-device inference.For each of these measured requests we compared MD-Inference to three other simulated model selection algorithmsusing the models detailed in Table III and an SLA target of250ms. The three other algorithms we used were static latency ,which picks the fastest model, static accuracy , which picksthe most accurate model, and random which picks a randommodel.Table IV compares the aggregate accuracy and on-devicereliance for all four model selection algorithms. MDInferenceachieved the highest aggregate accuracy for inference requestssent over both university and residential WiFi, improving overstatic accuracy by 7.32% on the variable residential network.Meanwhile, it improved aggregate accuracy compared to staticaccuracy by up to 19%.The aggregate accuracy and on-device reliance is shownin Figure 7. We can observe that MDInference increasesaggregate accuracy more quickly than the other algorithmstested and has a lower reliance on the on-device models,allowing it to maintain this higher aggregate accuracy.Figure 8 shows the inference latency breakdown for 20randomly sampled requests that were sent over the residential9etwork. We observe that mosts requests were able to completeon the remote server but in some cases the on-device modelmust be used, in which we highlight the network latency in red.This shows the ability of MDInference to adapt its executionchoice to match compensate for network variability, allowingit to decrease its on-device reliance and boosting is accuracy.
In summary, the duplication mechanism allows MD-Inference to ensure that results are returned to the mobileuser within the target response time. Combining this with themodel selection algorithm, MDInference is able to increase theaggregate accuracy for inference requests in the vast majorityof cases. VII. D
ISCUSSION & F
UTURE W ORK
There are a number of potential avenues for future workin mobile-oriented deep inference frameworks. We discuss anumber of important factors that should be considered, suchas energy consumption and aggregate accuracy.
Energy Consumption.
The duplication of inference re-quests solves the issues of SLA violations, allowing users tohave reasonable performance. However, this requires energyconsumption on the device for both network communicationand inference. Therefore, identifying times when duplicationis critical and avoiding unnecessary duplications could allowfor reduced energy consumption.
On-device Model Selection.
Currently, our proposed MD-Inference framework uses the same on-device model regard-less of the mobile devices. There are a number of differentapproaches, discussed in Section II-A for improving on-deviceinference but generally rely on statically selected models.While some model optimizations can provide this ability tosimplify models post-training [37], these provide only limitedoptions.
Spanning Subsets of Models.
Figure 3b demonstrated thatmany requests can be serviced by only a small subset ofmodels. This potentially indicates that there exists a subsetof models that could service nearly all requests, and thusform a spanning subset for all the models. This would behighly beneficial for decreasing the cost of inference serving,as only the models that fall into this subset would need to beavailable. Further, finding this subset for an arbitrary set ofmodels without resorting to empirical measurement is anotherchallenging problem to investigate.VIII. C
ONCLUSION
In this work we introduced a holistic approach to designingmobile-oriented deep inference frameworks that focuses onidentifying user needs and the constraints of mobile devices.We introduced the design of MDInference, a hypotheticalframework utilizing this approach. MDInference can improveaggregate accuracy in over 96% of cases without introducingadditional SLA violations. This improvement in accuracy wasover 40% in some cases and was a 7.32% increase overstatically serving the highest-latency model while duplicatinginference execution locally. Our work shows the potentialto better mobile inference serving by explicitly addressingmobile-oriented constraints. A
CKNOWLEDGMENT
We would like to thank our anonymous reviewers for theirvaluable feedback. This work was supported in part by NSFGrants
EFERENCES[1] T. Capes, P. Coles, A. Conkie, L. Golipour, A. Hadjitarkhani, Q. Hu,N. Huddleston, M. Hunt, J. Li, M. Neeracher et al. , “Siri on-devicedeep learning-guided unit selection text-to-speech system.” in
INTER-SPEECH , 2017, pp. 4011–4015.[2] V. Kepuska and G. Bohouta, “Next-generation of virtual personal assis-tants (microsoft cortana, apple siri, amazon alexa and google home),” in . IEEE, 2018, pp. 99–103.[3] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen,N. Thorat, F. Vigas, M. Wattenberg, G. Corrado, M. Hughes, andJ. Dean, “Google’s multilingual neural machine translation system:Enabling zero-shot translation,” 2016.[4] Z. Li, X. Wang, X. Lv, and T. Yang, “Sep-nets: Small andeffective pattern networks,”
CoRR , vol. abs/1706.03912, 2017. [Online].Available: http://arxiv.org/abs/1706.03912[5] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE international conference on computer vision ,2015, pp. 1026–1034.[6] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferablearchitectures for scalable image recognition,” in
Proceedings of the IEEEconference on computer vision and pattern recognition . Santa Clara, CA: USENIX Association, Jul. 2015, pp. 417–429.[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition (CVPR 2016) , 2016.[10] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp.2818–2826.[11] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmarkanalysis of representative deep neural network architectures.”[12] T. Guo, “Cloud-based or on-device: An empirical study of mobiledeep inference,” in
USENIX Workshop on Hot Topics in EdgeComputing (HotEdge 18) . Boston, MA: USENIX Association, Jul.2018.[18] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, andI. Stoica, “Clipper: A low-latency online prediction serving system,” in , 2017, pp. 613–627.[19] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Ra-jashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high-performance ml serving.”[20] P. Gao, L. Yu, Y. Wu, and J. Li, “Low latency rnn inference with cellularbatching,” in
Proceedings of the Thirteenth EuroSys Conference , ser.EuroSys ’18. New York, NY, USA: ACM, 2018.
21] M. LeMay, S. Li, and T. Guo, “Perseus: Characterizing Performanceand Cost of Multi-Tenant Serving for CNN Models,” in . IEEE, 2020.[22] M. Satyanarayanan, P. Bahl, R. Caceres, and N. Davies, “The case forvm-based cloudlets in mobile computing,”
IEEE Pervasive Computing ,vol. 8, no. 4, Oct. 2009.[23] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, andL. Tang, “Neurosurgeon: Collaborative intelligence between the cloudand mobile edge,” in
ACM SIGARCH Computer Architecture News ,vol. 45, no. 1. ACM, 2017, pp. 615–629.[24] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fastinference via early exiting from deep neural networks,” in . IEEE, 2016,pp. 2464–2469.[25] S. Teerapittayanon, B. McDanel, and H. T. Kung, “Distributed deepneural networks over the cloud, the edge and end devices,” in . IEEE, 2017, pp. 328–339.[26] “Caffe: Model Zoo,” http://caffe.berkeleyvision.org/model zoo.html.[27] C. Zhang, M. Yu, W. Wang, and F. Yan, “MArk: Exploiting CloudServices for Cost-Effective, SLO-Aware Machine Learning InferenceServing,” in , 2019.[28] R. S. Kannan, L. Subramanian, A. Raju, J. Ahn, J. Mars, and L. Tang,“GrandSLAm: Guaranteeing SLAs for jobs in microservices executionframeworks,” in
Proceedings of the Fourteenth EuroSys Conference2019 . ACM, 2019, p. 34.[29] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica, “Sparrow: dis- tributed, low latency scheduling,” in
Proceedings of the Twenty-FourthACM Symposium on Operating Systems Principles . ACM, 2013, pp.69–84.[30] M. Mitzenmacher, “The power of two choices in randomized loadbalancing,”
IEEE Transactions on Parallel and Distributed Systems ,vol. 12, no. 10, pp. 1094–1104, 2001.[31] P. Bodik, A. Fox, M. J. Franklin, M. I. Jordan, and D. A. Patterson,“Characterizing, modeling, and generating workload spikes for statefulservices,” in
Proceedings of the 1st ACM symposium on Cloud comput-ing . ACM, 2010, pp. 241–252.[32] G. Widmer and M. Kubat, “Learning in the presence of concept drift andhidden contexts,”
Machine learning , vol. 23, no. 1, pp. 69–101, 1996.[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A Large-Scale Hierarchical Image Database,” in
CVPR09 , 2009.[34] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewerparameters and < International Journal of Computer Vision (IJCV) , vol. 115, no. 3, pp.211–252, 2015.[37] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” arXiv preprint arXiv:1510.00149 , 2015., 2015.