[PDF] CoEdge: Cooperative DNN Inference with Adaptive Workload Partitioning over Heterogeneous Edge Devices

Abstract

Recent advances in artificial intelligence have driven increasing intelligent applications at the network edge, such as smart home, smart factory, and smart city. To deploy computationally intensive Deep Neural Networks (DNNs) on resource-constrained edge devices, traditional approaches have relied on either offloading workload to the remote cloud or optimizing computation at the end device locally. However, the cloud-assisted approaches suffer from the unreliable and delay-significant wide-area network, and the local computing approaches are limited by the constrained computing capability. Towards high-performance edge intelligence, the cooperative execution mechanism offers a new paradigm, which has attracted growing research interest recently. In this paper, we propose CoEdge, a distributed DNN computing system that orchestrates cooperative DNN inference over heterogeneous edge devices. CoEdge utilizes available computation and communication resources at the edge and dynamically partitions the DNN inference workload adaptive to devices' computing capabilities and network conditions. Experimental evaluations based on a realistic prototype show that CoEdge outperforms status-quo approaches in saving energy with close inference latency, achieving up to 25.5%~66.9% energy reduction for four widely-adopted CNN models.

Full PDF

11 CoEdge: Cooperative DNN Inference withAdaptive Workload Partitioningover Heterogeneous Edge Devices

Liekang Zeng,

Student Member, IEEE,

Xu Chen,

Senior Member, IEEE,

Zhi Zhou,

Member, IEEE,

Lei Yang,

Senior Member, IEEE and Junshan Zhang,

Fellow, IEEE

Abstract —Recent advances in artiﬁcial intelligence have drivenincreasing intelligent applications at the network edge, such assmart home, smart factory, and smart city. To deploy compu-tationally intensive Deep Neural Networks (DNNs) on resource-constrained edge devices, traditional approaches have relied oneither ofﬂoading workload to the remote cloud or optimizingcomputation at the end device locally. However, the cloud-assistedapproaches suffer from the unreliable and delay-signiﬁcant wide-area network, and the local computing approaches are limited bythe constrained computing capability. Towards high-performanceedge intelligence, the cooperative execution mechanism offers anew paradigm, which has attracted growing research interestrecently. In this paper, we propose CoEdge, a distributed DNNcomputing system that orchestrates cooperative DNN inferenceover heterogeneous edge devices. CoEdge utilizes available com-putation and communication resources at the edge and dynami-cally partitions the DNN inference workload adaptive to devices’computing capabilities and network conditions. Experimentalevaluations based on a realistic prototype show that CoEdgeoutperforms status-quo approaches in saving energy with closeinference latency, achieving up to 25.5% ∼ Index Terms —Edge Intelligence, Cooperative DNN Inference,Distributed Computing, Energy Efﬁciency.

I. I

NTRODUCTION R ECENT years have witnessed an ever-increasing numberof Internet of Things (IoT) devices diving into miscella-neous application domains, e.g., smart home [1], smart factory[2], autonomous driving [3], etc. This trend also drives thecommunity to build smarter, faster, and greener intelligentapplications at the network edge, pushing remarkable progressin smart healthcare, security inspection and disease detec-tion [4]–[6]. Meanwhile, advances in Deep Neural Networks(DNNs) have shown unprecedented ability in learning abstractrepresentation and extracting high-level features, promotingsigniﬁcant improvement in processing human-centric contents[7]. Motivated by this success, it is envisioned that employingDNNs to edge devices would enable and boost intelligent

L. Zeng, X. Chen and Z. Zhou are with the School of Computer Scienceand Engineering, Sun Yat-sen University, Guangzhou, Guangdong, 510006China (e-mail: [email protected], [email protected],[email protected]).L. Yang is with the Department of Computer Science and Engineering,University of Nevada, Reno, NV, 89557 USA (e-mail: [email protected]).J. Zhang is with the School of Electrical, Computer and Energy En-gineering, Arizona State University, Tempe, 85287 USA (e-mail: [email protected]).

BA DC

WirelessConnection

Smart Home

Fig. 1. An example of cooperative DNN inference in a smart home scenario.As the raw image is captured, device A decides a cooperative execution planand distributes the workload to devices B and C . According to the plan, thedevices perform cooperative inference in response to the DNN task. services, supporting brand new smart interactions betweenhumans and their physical surroundings.The essential demand of these services is to respond user’squeries timely, e.g., recognizing voice commands [8], inspect-ing visitor’s faces [9], and detecting heartbeat frequency [6],all within a matter of milliseconds. This also implies a soft-realtime requirement - if the result comes late, the user mayturn to other applications, and the result can even be out ofdate and meaningless. Therefore, minimizing response latencyand promising users’ experience is of paramount importance.However, DNN-based applications are typically computation-intensive and resource-hungry [7], and service providers tra-ditionally appeal to the resource-abundant cloud to satisfy thestringent responsiveness requirement [10]. Yet the Quality-of-Service (QoS) can still be poor and unsatisfactory due to theunreliable and delay-signiﬁcant wide-area network connectionbetween edge devices and the remote cloud [11], [12]. What’sworse, for many smart applications with human in the loop, thesensory data can contain highly sensitive or private informa-tion. Ofﬂoading these data to the remote datacenter owned bycurious commercial companies inevitably raises users’ privacyconcerns.Intuitively, keeping data locally and processing tasks with-out external remote assists will preserve user privacy andavoid the remote network transmission. Unfortunately, localedge devices are generally with limited computing capability,making it hard to fulﬁll DNN execution under the latencyService-Level-Objective (SLO). For instance, if a smart homecamera runs CNN-based face recognition to provide real-timeinspection and warning, the response delay when runningDNN locally may last for a few seconds, resulting in pooruser experience and completely unusable service. a r X i v : . [ c s . N I] D ec Convolution Pooling Convolution Pooling FullyConnected FullyConnected OuputFeature MapsInput Image

Cat (0.9)Dog (0.05)Rabbit (0.03)Wolf (0.02)

Feature Extraction Stage Classification Stage

Fig. 2. Conventional CNN inference workﬂow, which is typically in two stages. In the ﬁrst stage, CNN processes the input image to extract hidden featuresthrough operations like convolution and pooling, and generates multidimensional feature maps. In the second stage, CNN classiﬁes the feature maps byfully-connected layers and obtains the inference result.

To tackle these challenges, a promising approach is toexploit available computation resources in the proximity to thedata source with the emerging edge intelligence paradigm [13].Instead of uploading data to the remote cloud or keeping allcomputation at the single local device, edge intelligence enjoysreal-time response as well as privacy preservation by ofﬂoad-ing computing workload within a manageable range. As Fig. 1illustrates, we can utilize the diverse computing resources in asmart home (with inspection camera, smartphone, tablet, anddesktop PC) to accelerate the CNN-based face recognition.Speciﬁcally, the source device A can distribute the inferenceworkload to devices B and C , and perform cooperative in-ference via high-bandwidth local wireless connection (e.g.,WiFi). Nevertheless, this paradigm brings some key challengesto be addressed: (1) how to decide the workload assignmenttailored to the resource heterogeneity of edge devices, (2)how to optimize the system performance with the presenceof network dynamics, and (3) how to orchestrate computationand communication during cooperative inference runtime.To answer these questions, we propose CoEdge( Co operative Edge ), a runtime system that orchestratescooperative DNN inference over multiple heterogeneous edgedevices. CoEdge does not apply any structural modiﬁcationsor tuning requirements to the given DNN model, and does notsacriﬁce model accuracy as it reserves input data and modelparameters of the given DNN model. CoEdge employs asimilar parallel workﬂow as DeepThings [14], where the inputis split initially and the execution is parallelized on multipledevices at runtime. While DeepThings leverages a layerfusion technique to reduce communication overhead, CoEdgeproposes to optimize workload allocation to maximallyutilize heterogeneous edge resources. By optimizing thecomputation-communication tradeoff, CoEdge optimallypartitions the input inference workload, where the partitions’sizes are chosen to match devices’ computing capabilitiesand network conditions to improve system performance inboth latency and energy metrics. We implement CoEdgeusing a realistic prototype with Raspberry Pi 3, JetsonTX2, and desktop PC. Experimental evaluations show7.21 ×∼ × latency speedup over the local approach andup to 25.5% ∼ • We propose CoEdge, a distributed DNN computing sys-tem that orchestrates cooperative inference over hetero-geneous devices to minimize system energy consumption while promising response latency requirement. • We identify the impacts of workload partitioning on coop-erative inference workﬂow, and build a constrained pro-gramming model on workload distribution optimization.We prove the NP-hardness of the problem, and devisea fast approximated algorithm to decide the efﬁcientpartitioning policy in real-time tailored to devices’ diversecomputing capabilities and network conditions. • We implement a multi-device prototype using hetero-geneous edge devices, and evaluate CoEdge on fourwidely-adopted DNN models to corroborate its superiorperformance.The rest of this paper is organized as follows. Section IIbrieﬂy reviews background on DNN inference, and discussesopportunities and challenges based on a case of cooperativeinference. Section III presents CoEdge design and its work-ﬂow. Section IV builds the system model and describes ourworkload partitioning algorithm. We explain implementationdetails in Section V and evaluate the prototype in SectionVI. Section VII provides related works. Section VIII discusseslimitation and extension of CoEdge, and Section IX concludes.The appendix (in the supplementary material) details theproofs of Theorem 1 and 2.II. B

ACKGROUND AND M OTIVATION

In this section, we brieﬂy review conventional CNN infer-ence and cooperative inference. We study a case of cooperativeinference and discuss potential challenges behind that.

A. Deep Neural Network Inference

In this paper, we focus on the classical Convolutional NeuralNetworks (CNNs) as they are widely adopted across a boardspectrum of intelligent services, including image classiﬁcation,object detection, and semantic segmentation, etc.Fig. 2 depicts a conventional CNN inference for imageclassiﬁcation task from a perspective of feature maps. As wecan see, a conventional CNN inference can be viewed as aseries of successive algorithmic operations on feature maps.These operations comprise of convolution, pooling, batchnormalization, activation, and fully-connected computation,etc. In light of the functionality of the operations, the inferenceprocess can be separated into two stages. The ﬁrst stageis the feature extraction stage, where the model processes For ease of illustration, only some of the operations are drawn in Fig. 2

TABLE IR

ASPBERRY P I PECIFICATIONS [15]Hardware SpeciﬁcationsCPU 1.2GHz Quad Core ARM Cortex-A53Memory 1GB LPDDR2 900MHzGPU No GPUPower IdleFully LoadedAverage Observed 1.3W6.5W3WTABLE IIJ

ETSON

TX2 S

PECIFICATIONS [16]Hardware SpeciﬁcationsCPU 2.0GHz Dual Denver 2 +2.0GHz Quad Core ARM Cortex-A57Memory 8GB LPDDR4 1.6GHzGPU Pascal Architecture 256 CUDA CorePower IdleFully LoadedAverage Observed 5W15W9.5W every pixel in the input image to generate hidden featurerepresentations. Following that, at the second stage, thesefeatures are classiﬁed by the fully-connected layers, exportingresults in a probabilistic form.

B. Case Study: Cooperative Inference with Two Devices

The key impediment of deploying CNN at the network edgelies in the gap between intensive CNN inference computationand the limited computing capability of edge devices. Tobridge this gap, we can utilize the cooperative inferencemechanism to exploit available computing resources at theedge. A straightforward solution of that, for example, is themaster-worker paradigm that ofﬂoads inference workload toexternal infrastructure. To obtain a better understanding ofcooperative inference, we use a real hardware testbed toemulate this solution.As a case study, we employ a Raspberry Pi 3 and aJetson TX2, on behalf of weak IoT devices and mobile AIplatforms at the edge, respectively. Table I and II presentstheir speciﬁcations and reported power parameters, which aremeasured with Monsoon High Voltage Monitor [17] using themethodology in [18]. For each inference task, we input onesingle image to the Pi and then ofﬂoads a part of the imageto the Jetson. The two devices parallelize the DNN executionand their results are ﬁnally aggregated to the Pi as output.We measure the end-to-end latency of this process, i.e., fromthe image input to the inference result output; and we recordthe average latency of fulﬁlling the inference task over 100runs. We implement AlexNet [19] with TensorFlow Lite [20]on both devices, and run the model with the same image fromImageNet [21]. For the bandwidth between two devices, weﬁx it at 1MB/s using the trafﬁc control tool tc [22].We deﬁne ofﬂoading ratio to indicate how much data isofﬂoaded from the Raspberry Pi to the Jetson TX2. Forinstance, when the ratio is 0.5, we split the input imagealong the height into two equal parts, and transfer one ofthem to the Jetson TX2. In particular, a zero ratio indicates L a t e n c y ( s ) E n e r g y ( J ) LatencyEnergy

Fig. 3. The total latency and energy consumption under varying ofﬂoadingratio, i.e., the proportion of data that is ofﬂoaded from the Raspberry Pi 3 tothe Jetson TX2. performing inference at the Raspberry Pi locally, while theratio equals to 1.0 if ofﬂoading all workload to the JetsonTX2. Fig. 3 shows the latency and energy overheads undervarying ofﬂoading ratio. Through this experiment, we derivethe following observations. • Jetson TX2 enjoys better performance than Raspberry Pi3. When the ratio is zero, the system consumption is onlythe computation cost of Pi, while at the 1.0 case, the totalcost comprise of the input ofﬂoading overhead, the DNNcomputation overhead in Jetson and the overhead fortransferring result back. However, the former still takeshigher cost than the latter in terms of both latency andenergy. Note that fully ofﬂoading workload to the Jetson(i.e., ofﬂoading ratio = 1) may not necessarily yield thelowest costs if the network ﬂuctuates. • Cooperative inference is more economical than localinference given the favourable network condition. As theofﬂoading ratio increases, both latency and energy drop.In other words, the system cost decreases via harvestingthe cooperator’s computing resource. • The curve of latency drops faster as the ofﬂoading ratioincreases. This is because the DNN execution is paral-lelized in cooperative inference and the end-to-end infer-ence latency is straggled by the slower one. Therefore, inhigh bandwidth environments, assigning more workloadto Jetson TX2 beneﬁts performance improvement better.

C. Merits and Challenges

We see that, from the above observations, cooperativemechanism has potential to improve inference performancewith multiple devices, which are exactly what edge scenariospossess. More precisely, we envision the deployment of thecooperative inference system in an environment such as smarthome or smart factory, wherein the devices are managed bythe same owner and thus they are willing to cooperate andshare their resources. This brings several major merits as wellas challenges.

Merits.

On the one hand, comparing with local inference,cooperative inference has signiﬁcant potential in reducing la-tency and energy costs via harvesting idle computing resourcesat the edge. On the other hand, other than the cloud approachthat uploads data to the remote datacenter, the cooperativeapproach keeps data within user’s control scope, therefore avoiding the delay-signiﬁcant wide-area network as well asprivacy issues.

Challenges.

To effectively exploit computing resourcesat the edge, we need to felicitously factor the computingcapabilities of edge devices considering magnitude and het-erogeneity. Also, given the dynamic network inherently inedge computing, an efﬁcient workload allocation solution thatjointly considers systematic costs is desired. More speciﬁcally,it is crucial to decide which device to participate in thecooperative inference and how much workload each deviceaffords. Besides, since the cooperative mechanism parallelizesCNN inference in a distributed manner, the system needs toorchestrate the computation and communication over multipledevices.We address these challenges by designing a cooperativesystem, CoEdge, through orchestrating the available resourcesfrom heterogeneous edge devices.III. C O E DGE D ESIGN AND W ORKFLOW

In this section, we present CoEdge design and the work-ﬂow of cooperative CNN inference. We further explore howworkload partitioning impacts parallel processing in terms ofcomputation and communication.

A. CoEdge Design

For ease of illustration, we differentiate between devices ontheir roles in the cooperation. For the device that launchesa CNN inference task, we label it as the master device ,and for the device that joins the cooperation, it is markedas the worker device . The master device is responsible forregistering participated devices, generating a feasible workloadpartitioning plan, and managing the cooperative inference overworker devices. Note that a device can be the master and theworker at the same time since it can retain CNN workload insitu.Fig. 4 illustrates the architecture overview of CoEdge,which works in two phases, namely the setup phase andthe runtime phase. In the setup phase, CoEdge records theexecution proﬁles of each device. In the runtime phase,CoEdge creates a cooperative inference plan that determinesthe workload partitions and their corresponding assignment,using the proﬁling results collected in the setup phase and thenetwork status. According to the cooperation plan, the masterdistributes the workload partitions to the workers and thenperforms cooperative execution collaboratively.

Setup phase.

Whenever a CNN-based application is in-stalled,

Device Proﬁler runs the CNN models locally andrecords

Proﬁling Results . These results sketch the device’scomputing capability, including the computation intensity,computation frequency and power parameters, which will bedetailed in Section IV-A.

Runtime phase.

The runtime phase starts when the masterraises a CNN inference query. As the image inputs, the masterestablishes connections with worker devices and pulls theirproﬁling results. Since the size of the proﬁling results isvery small (tens of bytes in our prototype), the transmissionoverhead for transferring them is negligible. As the master

Partitioning EngineDeviceProfiler ProfilingResultDNN Execution Runtime DeviceProfiler ProfilingResultDNN Execution Runtime

Master Device Worker Device

Transfer when the workerconnects to the masterWorkload DistributionCooperative Execution

DNN ModelInput Image Inter-device data flowIntra-device data flow

Partitioning Plan

Setup PhaseRuntime Phase

Fig. 4. CoEdge architecture overview, which works in two phases. In the setupphase, the devices proﬁle parameters to sketch their computing capabilitiesinformation. In the runtime phase, the master device creates a partitioning planusing the collected proﬁling results. According to the plan, the master devicedistributes the workload and performs cooperative inference with workerdevices. receives the proﬁling data, the partitioning engine in themaster device generates a workload allocation plan usingthe adaptive workload partitioning algorithm (explained inSection IV-C). According to the plan,

DNN execution runtime distributes the workload partitions to workers and performscooperative inference in response to the query.

B. Cooperative Inference Workﬂow

In this work, we exploit model parallelism to partition CNNinference over multiple devices. Under model parallelism,CNN model parameters are divided into subsets and assignedto multiple edge nodes. With respective parameters, eachdevice accepts a necessary part of the input feature maps andgenerates a portion of the output feature maps. Concatenatingall these portions yields the complete output of each layer.Fig. 5 provides an instance of cooperative inference work-ﬂow with three devices from a perspective of feature maps.The cooperative inference begins when the image is piece-wise split into partitions. Note that to accommodate devices’heterogeneity, the partition sizes are differentiated to matchdevice capabilities. The partitions are then distributed fromthe master to three devices (i.e., devices A , B , and C in Fig.5). At the feature extraction stage, the three devices executetheir workload in parallel, while at the classiﬁcation stage,their execution results are aggregated to one of them (device B in Fig. 5). This aggregation is to avoid excess communicationoverhead caused by the nature of fully-connected computation,which requires repeating data access on the feature vectors. Generalization.

Based on the workﬂow in Fig. 5, it is fea-sible to accommodate various CNNs with complex structuresby redesigning some details. For example, for CNNs withoutfully-connected layers (e.g., Network in Network [23]), we canreduce the classiﬁcation stage in Fig. 5. To adapt CNNs withskip connections (e.g., ResNet [24]), we can keep intermediateoutput results on each device at the shortcut starting point andrelease these data at the shortcut destination to collect the datawhen needed.

Partition

Cat (0.9)Dog (0.05)Rabbit (0.03)Wolf (0.02)

OuputFullyConnectedFullyConnectedPoolingConvolutionPoolingConvolutionInput Image

Feature Extraction Stage Classification Stage

Device A Device B Device C Workload Placement:

Fig. 5. Cooperative CNN inference workﬂow of CoEdge. The input image is piece-wise partitioned to patches before execution. In feature extraction stage,these patches are distributed to devices A , B and C , respectively, and then in classiﬁcation stage, the feature map fragments are aggregated to ﬁnish theremaining execution.Fig. 6. Example of a convolution operation for cooperative inference. Theinput feature map partitions with each of × size locate at devices A and B ,respectively. To generate the output feature map through the × convolutionkernel, device A needs to pull the padding data of × size from device B . C. Impact of Workload Partitioning

The way of piece-wise partitioning signiﬁcantly affects thecommunication between devices, especially for convolutionoperations that process data across partition boundaries. Forinstance, Fig. 6 shows a typical convolution operation with twopartitions. To compute convolution over the × partition withthe × kernel, device A needs to fetch the × margin rowin device B ’s partition. In general, for the kernel whose size k is greater than 1, each device needs to pull the padding dataof (cid:98) k/ (cid:99) size along the split dimension from the neighboringdevice. In some extreme cases, when the kernel size is verylarge but the neighboring partition size is very small, thepadding range may even across three or more devices, whichcould incur extravagant communication overhead.To reduce the communication between devices, some priorworks [14], [25] exploit sending redundant data in advance toavoid the padding issue. However, while transferring redundantdata takes additional communication cost, preparing necessarydata beforehand for a number of CNN layers incurs extrastorage overhead. In this work, we address the padding issueby imposing a principle that requires the allocated partitionsize in the neighboring device to be not smaller than thepadding size, unless it owns no partition. This principleensures that the padding data can be always acquired fromonly the neighboring device as long as it has data. That is,the transmission of the padding data merely happens once,and thus we reduce the overhead in establishing additionalconnections. To illustrate that, Fig. 7 show the communicationpattern of the example in Fig. 5. Initially, the input imageis partitioned and distributed to corresponding devices, alongwith the padding data for the ﬁrst convolution layer. For thefollowing layers, each device only connects to its neighbor Device A Device B Device C I npu t p a r t i t i on C WorkloadpartitioningConv 1computation P a dd i ng d a t a P a dd i ng d a t a P a dd i ng d a t a P a dd i ng d a t a Conv 2computation // // //

Conv 3computationConv ncomputation C o m p l e t e p a r t i t i on C o m p l e t e p a r t i t i on I n f e r e n ce r es u l t Fully-connectedcomputation ... input I npu t p a r t i t i on B "Cat" output Fig. 7. The communication pattern of a CoEdge runtime instance with threedevices. For convolution computation, each device pulls the necessary paddingdata from its neighboring device. For fully-connected computation, the featuremap partitions are aggregated. The ﬁnal inference result is transferred to auser-speciﬁed location, i.e., device C in the ﬁgure. device and fetches padding data for convolution computation.This pattern holds until all convolutions are completed, andthen the separated feature map partitions are aggregated to oneof the devices for fully-connected computation. The inferenceresult is ﬁnally returned to a user-speciﬁed device (device C for example in Fig. 7).Under this principle, ﬁnding an appropriate workload as-signment matters signiﬁcantly for system performance. Forinstance, ofﬂoading a large portion of workload to a devicethat owns high bandwidth but poor computing capability maynot lead to a lower execution latency. To deploy cooperativeinference optimally, we need to match the assigned partitionsize to the computation and communication resource of eachdevice. We achieve this goal by designing a workload parti-tioning algorithm that is adaptive to the computing capabilitiesof available devices and the dynamic network status.IV. A DAPTIVE W ORKLOAD P ARTITIONING

The objective of optimizing workload allocation is to im-prove cooperative inference performance in both latency andenergy metrics. For the simplicity of problem deﬁnition, wetarget at meeting the latency requirement while minimizing the energy costs. Assuming an execution deadline D , the opti-mization problem is to optimally allocate the workload so thatthe system energy consumption is minimized while promisingthe execution deadline D , given the available computation andcommunication resources.In the following, we present a detailed formulation of thisproblem and our workload partitioning algorithm. A. Problem Formulation

We assume that the devices are available and relativelystable during the inference runtime. This can be relevant asexecuting an inference task is typically in a period of seconds,and many edge environments are maintained statically inindependent spaces, such as smart home and smart factory.Besides, the underlying support of intelligent services in suchscenarios usually employs a few commonly-adopted DNNmodels and frequently run similar types of DNN inferencetasks. Therefore, we suppose that the DNN models have beenloaded ahead of inference queries, and can be used to computeinput tensors as soon as necessary data are prepared.Since a CNN model typically encompasses many layers,we model the cooperative inference process from single-layerto multi-layer, progressively. The single-layer formulation fo-cuses on sketching the workload partitioning constraints andshaping the performance of single-layer, while the multi-layerpart aims at summarizing the system behavior for the wholeworkﬂow. Prior to that, we deﬁne the necessary concepts andnotations as follows: • A layer l is an algorithmic operation in a CNN model.In our formulation, a layer refers to either a convolution(Conv) or a fully-connected (FC) layer. Given a CNNmodel, L = [1 , , · · · , L ] denotes the layers in order. • A partitioning solution π is a group of coterminouspartitions of the input image, which is generated by piece-wise partitioning along one dimension. For the inputpartition assigned to device i , a i represents the numberof rows that it covers. Hence, given the devices’ indices N = [1 , , · · · , N ] , π = [ a , a , · · · , a N ] . We denotethe workload as the input feature map partition to beprocessed on each DNN layer. For layer l , the workloadsize of the i -th partition is r li , which can be obtained bycalculating the partition’s data size. • A conﬁguration tuple ( k, c in , c out , s, p ) li denotes the l -thlayer’s computation task on the i -th partition, which ischaracterized by the layer’s conﬁguration, i.e., convolu-tion kernel size k , input channels c in , output channels c out ,stride s , and padding p . This tuple is applicable for bothconvolution and FC layers since FC computation can beviewed as a special case . In particular, as discussed inSection III-C, the padding size of convolutional layers issupposed to be smaller than the size of the partition onthe last neighboring device, unless it owns no data. Weformulate this principle as Eq. (1), where { a i > } is anindicative function that values 1 if a i > or else 0. This This tuple depicts a fully-connected computation when the input featuremap’s size is × × c in , the output feature map’s size is × × c out , kernelsize k = 1 , stride s = 1 , and padding p = 0 . constraint is essentially equivalent to the disjunction of a i ≥ p i +1 and a i = 0 . • A resource tuple ( ρ, f, m, P c , P x ) i speciﬁcs the resourceproﬁle of device i . Here, ρ is deﬁned as the computingintensity (in processing cycles per 1KB input) of the givenDNN model, which is measured by application-drivenofﬂine proﬁling [26] in the setup phase. f is the device’sCPU frequency, reﬂecting its computing capability in acoarse granularity. m is the available maximum memorycapacity for inference tasks. For a single device thatonly processes CNN workloads, m is the volume ofmemory excluding the space taken by the underlyingsystem services, e.g., I/O services, compiler, etc. P c and P x denote the computation power and the wirelesstransmission power, respectively.

1) Single-Layer Formulation:

There are some numericalconstraints on the partition sizes. Eq. (1) imposes the sizerestriction with the padding size as discussed in Section III-C,and Eq. (2) claims a i is a nonnegative integer. Eq. (3) presentsthat the concatenation size of all partitions along the heightdimension equals to this dimension’s size H . The piece-wisepartitioning can be conducted along either the height H orwidth W of the input. In our experiments, we split along theheight H without loss of generality. a i ≥ p i +1 { a i > } , i ∈ N , (1) a i ≥ , a i ∈ Z , i ∈ N , (2) (cid:88) i ∈N a i = H. (3)The workload size r li of a partition is constrained by thedevice’s available memory capacity m i , as in Eq. (4). Here,we only limits the memory footprint on the size of per-layer inputs for the sake of simplicity, while the runtimememory may not be exactly r li . For practical deploymentcases, emerging techniques on characterizing the detailed CNNexecution memory (e.g., [27]) can be adopted, and the deeplearning platform-related memory footprint can be added intothe left-hand side of Eq. (4) as an enhancement. r li ≤ m i , i ∈ N , l ∈ L . (4)During a single layer’s execution, the system takes timeand energy on two aspects, computation and communication.For computation, we calculate the latency and energy by ﬁrstapproximating the computing cycles of given partitions. Asdemonstrated in previous empirical studies [26], [28], [29], formany data processing tasks as exempliﬁed by data encodingand decoding, the required computing cycles are proportionalto their input data sizes. This means that, a constant computingintensity (in computing cycles per unit data) exists for suchtasks, and we can use it to capture the effective computingcapability of a speciﬁc device. Existing literature, such as[30]–[32], has leveraged this observation to characterize deeplearning workloads, and in this work, we adopt it to estimatethe computing cycles amount given the partitions and DNNlayers. Concretely, in Eq. (5), we assess the total processingcycles of the i -th partition by multiplying the device’s com-puting intensity ρ i with the workload size r li . Moreover, for each respective DNN layer, CNN inference typically conductsa feed-forward execution without any branch operation orrecurrent computation [7], indicating that its execution latencyis approximately linear to the computing cycles. Therefore, thelatency T cli for computing layer l is then appraised via dividingthe total processing cycles by the computation frequency f i ,and the energy is the product of T cli and the computationpower P ci in Eq. (6). Note that Eq. (6) only reckons ondynamic energy. Static energy consumption, e.g., those formaintaining basic system-level services, are not considered inour formulation. T cli = ρ i r li f i , i ∈ N , l ∈ L , (5) E cli = P ci T cli , i ∈ N , l ∈ L . (6)For communication, let b i,j be the available bandwidthbetween devices i and j . Particularly, j = i implies deliveringdata from a device to itself, and b i,i is the memory bandwidth.In our experiment, b i,i is set as 12.8GB/s by default, whichis the typical memory bandwidth of DDR3 [33]. Initially,the communication occurs when the master device (noted asdevice M ) distributes input partitions to worker devices, thetransmission time is therefore calculated in the l = 1 case ofEq. (7). For communication of pulling the padding data fromthe neighboring device, its transmission time is described asthe l > case. For the sake of simplicity, Eq. (7) does nottake the queuing delays into account since we are optimizinginference for respective single image input. Streaming input,in which case the queuing delays signiﬁcantly matter, are leftfor future work. With the transmission power P xi , we acquirethe dynamic energy of communicating with device i on layer l in Eq. (8). T xli = (cid:40) a i b M ,i , l = 1 , i ∈ N , p li b i,i +1 , l > , l ∈ L , i ∈ N , (7) E xli = P xi T xli , i ∈ N , l ∈ L . (8)

2) Multi-Layer Formulation:

The key challenge of extend-ing the formulation from a single layer to multiple layers liesin the synchronization mechanism during parallel processing.Fig. 8 presents a job breakdown of a CoEdge instance -processing one image over three devices. As we can see, eachdevice processes computation and communication jobs alter-nately, and they trigger synchronization periodically whenevera communication job (except for the initial communicationjob) is accomplished. The contents of the synchronizationsare the requisite padding data for convolutional computation.During the interval between two synchronizations, there is nodata dependency between devices, and thus they process jobsin parallel. These scattered feature map partitions are ﬁnallyaggregated at the classiﬁcation stage for FC computation.Hence, the whole process works in a Bulk SynchronousParallel (BSP) mechanism [34].To summarize the cost of the whole process, we denote E c and E x as the total energy consumption of computation andcommunication, respectively, which are obtained by summingup the energy of all devices for all layers in Eq. (9) and(10). We count the total physical latency T according to the Device B TimelineDevice C Device A sync Conv 1 .........

Input Output

Conv 2 ...

Conv n

Fully-Connected sync sync syncComputation Communication

Feature Extraction Stage ClassificationStage

Fig. 8. The job breakdown of a CoEdge runtime instance with three devices.Each device processes computation and communication jobs alternately, andthe system performs in a Bulk Synchronous Parallel (BSP) mechanism.

BSP model and obtain Eq. (11). Concretely, we acquire T by calculating the maximum latency of all devices and thensumming up the physical latency of all intervals. It is worthnoting that Eq. (11) has counted the latency of FC layers, asthe maximum latency of all device is essentially that of theselected device in the classiﬁcation stage. E c = (cid:88) l ∈L (cid:88) i ∈N E cli , (9) E x = (cid:88) l ∈L (cid:88) i ∈N E xli , (10) T = (cid:88) l ∈L max i ∈N ( T cli + T xli ) . (11)Given the execution deadline D , the targeted problem is todecide an optimal partitioning solution π = [ a , a , · · · , a N ] with the objective of minimizing total energy without violat-ing the execution deadline D . Hence, we can formulate thecooperative inference optimization as the following problem. P E c + E x s.t. T ≤ D, (1) , (2) , (3) , (4) . Theorem 1.

Problem P We prove Theorem 1 by identifying P P P B. Problem Transformation

A Linear Programming (LP) problem is a kind of optimiza-tion towards a linear objective function subject to linear equal-ity or inequality constraints, and the Integer Linear Program-ming (ILP) problem is a special case where all optimizationvariables are integers [35]. As proved in Appendix A, P P a i . To produce a feasible solution to P P λ i to approximate a i . Eq. (12) deﬁnes the relation between λ i and a i , where H is the input’s height and λ i describes the proportion that the i -th partition covers. Since the input ofCNN inference are usually of a large size (e.g., typically of 224 ×

224 size from ImageNet [21] dataset), the approximationerror is tiny and tolerated. Eq. (13), (14), and (15) show thenumerical constraints for λ i , which are derived from Eq. (1),(2), and (3), respectively. a i = λ i H, i ∈ N , (12) λ i H ≥ p i +1 { λ i > } , i ∈ N , (13) λ i ≥ , i ∈ N , (14) (cid:88) i ∈N λ i = 1 . (15)Eq. (13) is essentially equivalent to the expression of λ i H ≥ p i +1 or λ i = 0 . Since λ i = 0 is a potential solution, it isfeasible to separate solving λ i ’s value and checking whether λ i ≥ p i +1 to two steps. Therefore, we relax the constraint Eq.(13) as λ i H ≥ , i.e., λ i ≥ , and P P P E c + E x s.t. T ≤ D, (4) , (14) , (15) . P P

1. Particularly, onthe solution to P

2, there may be some devices that are assignedwith tiny workload ( ∃ i ∈ N , ≤ λ i < p i +1 ), while on thesolution to P

1, the workload size on all devices must be largerthan or equal to the padding data size unless it is zero ( ∀ i ∈N , λ i ≥ p i +1 or λ i = 0 ). Regardless of the potential solution λ i = 0 to problem P

1, the main difference between P P P p i +1 while P . Hence, we can exploit P P Theorem 2.

Problem P Theorem 2 (proved in Appendix B) reveals that P P C. Workload Partitioning Algorithm Design

We propose a threshold-based workload partition algorithmfor P P P

2. The output is the workload partitioningsolution to P N is empty - if N is an empty set, there is no available devices to performcooperative inference, and thus no feasible solution to P π from P Algorithm 1

Workload Partitioning Algorithm

Input: N : Available devices [1 , , · · · , N ] L : CNN layers [1 , , · · · , L ]( k, c in , c out , s, p ) li , ∀ i ∈ N , ∀ l ∈ L : Conﬁguration tuples ( ρ, f, m, P c , P x ) i , ∀ i ∈ N : Resources tuples b i,j , ∀ i, j ∈ N : Bandwidths D : Execution deadline Output: π : Assigned workload proportions [ λ , λ , · · · , λ N ] Procedure P ARTITION ( N ) if N is empty then return NULL (cid:66) no feasible solution Solve π from P if π satisfy Eq. (13) then return π else Find the index set N of zero elements in π Find the minimum element λ m in π N ← N − N − { m } return P ARTITION ( N ) end Procedure check whether the obtained π satisfy Eq. ( ), the thresholdconstraint of P

1. If so, the current version of π is a feasiblesolution and is immediately returned. Or else, there must besome elements in π that are smaller than the required paddingsize. In this case, we remove part of these unsatisﬁed elementsfrom the available devices list: ﬁrstly we remove zero elementssince the zero workload assignment indicates that the devicewould not participate in cooperative inference; next we ﬁnd theminimum from the rest elements in π and remove it from N .After that, the algorithm goes to the next recursion to acquirethe new partition solution with the updated N and checks theresult for P N (the total number of availabledevices), the solving process of Algorithm 1 is very fast ( < ROTOTYPE I MPLEMENTATION

We employ TensorFlow Lite [20] as the backend engineto execute CNN layers, and implement the communicationmodule based on gRPC [37]. In the following, we provide theimplementation details of CoEdge.

Deployment and proﬁling.

Since any one of the devicesin the environment may launch a CNN inference task, theemployed CNN models are trained and installed on all devicesin advance. As the model is installed, we use TensorFlowbenchmark tool to proﬁle the latency of one inference andmeasure the energy with the Monsoon High Voltage Power

Fig. 9. Our experimental prototype uses four Raspberry Pis (RPi), one JetsonTX2 and one desktop PC. Their speciﬁcations are listed in Table I, II and III.We employ the Monsoon High Voltage Power Monitor (HVPM) to measurethe energy.

Monitor [17]. For each CNN model, we run it for once aswarm-up and then record the execution time with 50 runswithout break. The aim of warm-up running is to alleviatethe impact of weight loading and TensorFlow initiation sincewe have omitted these overheads in the formulation. Theexecution tasks on all devices are the same - perform CNNinference on the same image from ImageNet [21]. We take themean values as the measuring results and derive the resourcetuple parameters based on them.The computation frequency f is directly from known speci-ﬁcations. With f and the measured latency, we can estimate thetotal computing cycles of one inference. Dividing the cycles’amount to the processed image size yields the computingintensity ρ . We obtain the memory capacity m by observingthe available memory space of an idle system. For powerparameters P c and P x , we measure them by calculating themeasured computation/communication energy and delay. Workload partitioning and distribution.

To create theworkload allocation plan efﬁciently, We run the workloadpartitioning algorithm based on IBM ILOG CPLEX [36], alinear programming solver package. If the algorithm returns afeasible solution, we segment the input image accordingly andsend the partitions to the corresponding devices. Otherwise,the algorithm returns an infeasible signal, which means thedeadline is set too strict. In this case, we choose to ofﬂoadall workload to the device that can minimize the end-to-endexecution latency.

Runtime communication.

During the runtime, each deviceneeds to fetch the padding data from its neighboring device.Due to the limited computing capability, a device may bestill working on generating the output feature map partitionwhen a padding pulling request arrives. To accommodatethis case, we block the pulling request until the needed datais prepared. Note that such circumstance is rare since ourworkload partitioning algorithm has optimized the workloadallocation to match devices’ computing capabilities. Under thisplan, the execution time on each device is reasonably close andthe utilization of computing resources are maximized as muchas possible. Moreover, our workload partitioning algorithmsupposes that participated devices can well communicate witheach other during the runtime. However, the devices canaccidentally break down or temporarily unavailable in real-

TABLE IIID

ESKTOP

PC S

PECIFICATIONS

Hardware SpeciﬁcationsCPU 3.60GHz 8-Core Intel i7-7700Memory 2666MHz 16GB DDR4GPU GeForce GTX 1050 (Pascal) 640 CUDA corePower IdleCPU Fully LoadedGPU Fully Loaded 80W180W200WTABLE IVI

NFERENCE L ATENCY ( MS ) A ND C OMPUTATION I NTENSITY ( CYCLES /KB) OF B ASIC I MPLEMENTATION

Model Raspberry Pi Jetson TX2 Desktop PCLat. Inten. Lat. Inten. Lat. Inten.AlexNet 302 615 89 301 46 282VGG-f 276 563 83 283 44 269GoogLeNet 769 1568 227 772 114 698MobileNet 226 461 71 239 37 226 world deployment. This raises robustness issues, which bediscussed in Section VIII.VI. P

ERFORMANCE E VALUATION

In this section, we evaluate the performance of CoEdgeprototype in terms of inference latency and dynamic energy.We also explore the impact of deadline setting, the systemscalability and the adaptability to network ﬂuctuation.

A. Experimental Setup

Prototype. we implement CoEdge prototype with six de-vices: four Raspberry Pi 3, one Jetson TX2, and one desktopPC, as shown in Fig. 9. The Raspberry Pi 3 and the JetsonTX2 represent weak IoT devices and mobile AI platforms.Besides, we take a desktop PC to emulate small edge servers.The speciﬁcations of the three types of devices are providedin Table I, II and III. We employ the Monsoon High VoltagePower Monitor (HVPM) [17] to measure the energy. Forbandwidth control, We use the trafﬁc control tool tc [22],which is able to limit the bandwidth under the setting value.

Workload.

In our prototype, we use TensorFlow Lite [20]to implement four typical CNN models: AlexNet [19], VGG-f [9], GoogLeNet [38], and MobileNet [39], all of whichare trained before deployment. Table IV presents the reportedlatency of basic implementation and the computing intensityon different platforms. We set the workload as the imageclassiﬁcation task on one ImageNet [21] image. The averageinference latency and computation intensity of one hundredruns are taken as the results. During the runtime we turn offall applications except for necessary OS background services.

Approaches.

We compare CoEdge with the following rela-tive approaches. (1)

MoDNN [40] adopts the same piece-wisepartitioning mechanism as CoEdge, but decides partition sizesin proportion to the devices’ computing capabilities withoutconsidering network conditions. (2)

Musical Chair [18] is acooperative inference system that exploits both data and modelparallelism. For each layer, it chooses one of the parallelismsand accordingly partitions the workloads in equal proportion. Loc MD MC CE D $OH[1HW L a t e n c y ( s ) Deadline

Loc MD MC CE E 9**I L a t e n c y ( s ) Loc MD MC CE F *RRJ/H1HW L a t e n c y ( s ) Loc MD MC CE G 0RELOH1HW L a t e n c y ( s ) Fig. 10. The end-to-end latency of different approaches running four DNNmodels. The deadline of AlexNet, VGG-f, GoogLeNet, and MobileNet are setas 100ms, 100ms, 200ms, and 100ms, respectively.

Loc MD MC CE D $OH[1HW E n e r g y ( J ) Loc MD MC CE E 9**I E n e r g y ( J ) Loc MD MC CE F *RRJ/H1HW E n e r g y ( J ) Loc MD MC CE G 0RELOH1HW E n e r g y ( J ) Fig. 11. The dynamic energy consumption of different approaches runningfour DNN models. We use the testbed in Fig. 9 that consists of four RaspberryPi 3, one Jetson TX2, and one desktop PC. All experimental settings are thesame as that in the experiment of Fig. 10. (3)

Local approach executes CNN inference at the masterdevice solely. In our experiment, the local approach is thebaseline, and we ﬁx the master device as a certain RaspberryPi 3.

B. Performance Comparison

Fig. 10 and Fig. 11 show the latency and dynamic energyresults of different models with the local approach (Loc),MoDNN (MD), Musical Chair (MC), and CoEdge (CE). Theresults in these two ﬁgures are measured at the same experi-mental settings, and the maximum bandwidth between devicesare ﬁxed at 1MB/s. We set the deadline for executing the fourmodels as 100ms, 100ms, 200ms, and 100ms, respectively,marked as dashed lines in Fig. 10.As shown in Fig. 10, CoEdge, Musical Chair and MoDNNalways accomplish inference within the deadline. As the mosttime-consuming option, the local approach is the only onethat violates the latency requirement, and CoEdge achieves7.21 ×∼ × latency speedup over it. Comparing the localapproach with the other ones reﬂects the power of cooperativeinference gained by harvesting vicinal edge resources. Amongthe three cooperative approaches, Musical Chair takes higherlatency that the other two. This is because that Musical Chairdirectly split the workload in equal proportion ignoring theresources heterogeneity. CoEdge and MoDNN perform closelyin the latency metric, but differs their energy costs in theenergy metrics.As an evidence, Fig. 11 shows the least dynamic energyconsumption that CoEdge takes comparing with other ap-proaches. CoEdge saves up to 66.9%, 64.9%, 46.0%, and25.5% energy for four models, respectively (comparing withMuscial Chair). To the baseline (local approach), CoEdgesaves 39.2%, 37.8%, 11.5%, and 10.9% energy. CoEdge saves Deadline (ms) E n e r g y ( J ) LocalMoDNNMusical ChairCoEdge

Fig. 12. The dynamic energy con-sumption of four approaches un-der varying deadlines built on thetestbed in Fig. 9. The result isrecorded as zero if the approach failsto ﬁnish inference within the dead-line. L a t e n c y ( s ) +Pi +Pi +PC +Pi +Pi +TX2 E n e r g y ( J ) LatencyEnergy

Fig. 13. The latency and dynamicenergy results of CoEdge with vary-ing number of devices. The top textindicates which type of device arenewly added to the cluster. energy prominently for AlexNet and VGG-f, but promote notso much for GoogLeNet and MobileNet. This attributes tothe structure of CNN models. GoogLeNet’s completed blockstructure comprises a crowd of layers, which incurs frequentdata exchanges in cooperative inference. At the opposite endof the spectrum, MobileNet uses a simpliﬁed structure and hasbeen well optimized for local inference in embedded devices,which limits the improvement space for cooperative inference.It is worth noting that the local approach consumes less energythan MoDNN and Musical Chair. The reasons come fromtwo aspects. On the one hand, the local approach does notincur communication costs, while MoDNN and Musical Chairneed frequent cross-device communication during the runtime,which takes energy. On the other hand, the optimization ofMoDNN and Musical Chair does not consider the power char-acteristics of different types of devices so that the workload areprocessed in an energy-lavish manner. In contrast, by jointlyoptimizing the computation-communication tradeoff provideddevices’ computing capabilities and network conditions, Co-Edge achieves the lowest energy costs.

C. Performance under Varying Deadlines

In this experiment, we explore how the deadline settingimpacts the system performance. We run AlexNet to processone image input. The bandwidths between devices are ﬁxedas 1MB/s. Fig. 12 shows the dynamic energy results as afunction of deadlines. To emphasize the deadline constraint,we plot the energy result as zero if it fails to accomplish theinference within the deadline. When the latency requirement isvery stringent ( ≤ latency requirement, CoEdge prefers to sacriﬁce some energy-saving to latency reduction. With the requirement loosens, thepressure of satisfying deadline constraint gradually relaxes andCoEdge will transfer emphasis on energy optimization. Whenthe deadline is adequately slack, it is not a constraint to ouroptimization anymore, in which case the change of that willnot impact the workload allocation plan of CoEdge and thusthe dynamic energy result keeps stable. D. Scalability

To evaluate CoEdge’s scalability, we measure the latencyand energy by incrementally adding devices to the experimen-tal cluster. We ﬁx the bandwidth as 1MB/s and set a loosedeadline of 500ms. The inference task is run AlexNet withone image for classiﬁcation. We add devices in the followingorder: Raspberry Pi, Raspberry Pi, desktop PC, Raspberry Pi,Raspberry Pi, and Jetson TX2. Fig. 13 presents the measuringresults of CoEdge, where the top text shows the adding devicesorderly. With the increase of the cluster scale, both the latencyand dynamic energy drop. In particular, there is a distinctivedecrease when adding PC (2 →

3) and Jetson TX2 (5 → E. Adaptability to Network Fluctuation

In this experiment, we record the system performance ofdifferent cooperative approaches with varying bandwidths. Werun AlexNet with one image on the six-device cluster, andthe deadline is 100ms, plotted as the dashed line in Fig. 14.For each epoch, CoEdge captures the available bandwidthsand triggers a reprogramming on workload partitioning if thebandwidths change. This reprogramming process takes tinyoverhead, reporting less than 10ms in the experiment. Weadjust the bandwidth settings between devices in differentperiods. The top subﬁgure in Fig. 14 presents the network ﬂuc-tuation with the bandwidths of 1000KB/s, 750KB/s, 500KB/s,1250KB/s, 1500KB/s, and 1000KB/s, respectively. As thebandwidth changes, all three approaches vary their perfor-mance. The performance variation comes from two reasons.On the one hand, the communication overhead for necessarydata exchange during cooperative inference depends on thenetwork conditions. On the other hand, for CoEdge, diversebandwidths yield diverse partitioning plans and therefore im-pact the performance of cooperation. In most cases, the latencyresults of the approaches are approximate. On the energy side,however, CoEdge outperforms other approaches all the time.In particular, when the bandwidth drops at 500KB/s (Epoch11-15), only CoEdge’s execution satisﬁes the deadline. In other M a x i m u m B a n d w i d t h ( K B / s ) L a t e n c y ( s ) Deadline

Musical ChairMoDNNCoEdge0 5 10 15 20 25 30

Epoch E n e r g y ( J ) Musical ChairMoDNNCoEdge

Fig. 14. The latency and dynamic energy results under varying network status.The deadline for CoEdge is set as 100ms. epochs, as the bandwidth becomes higher, CoEdge adjusts theworkload partitioning and the energy costs decrease.VII. R

ELATED W ORK

Previous research efforts in enabling artiﬁcial intelligence atthe network edge can be divided into three directions: cloud-assisted execution, local resource exploitation, and multi-device collaboration.

Cloud-assisted execution.

Cloud-assisted approaches of-ﬂoad DNN inference workload from local to the cloud fullyor partially [10], [41]–[45]. MCDNN [41] fully ofﬂoads DNNcomputation. It creates DNN model variants and selects onefrom them to maximize the accuracy under resource con-straints, while CoEdge does not involve accuracy issues asthe model and data are never modiﬁed. Neurosurgeon [42]proposes a partially ofﬂoading solution, which decides anintermediate partition point in the DNN structure to keep frontlayers locally and ofﬂoad rear layers to the cloud. DDNN [44]leverages a similar principle and partitions DNN layers in acloud-edge-device hierarchy. However, it requests to retrainthe DNN model in scheduling, while CoEdge does not re-quire any retraining work. The cloud-assisted approaches havebeen widely-adopted in mobile scenarios, e.g., drone-enabledvehicles tracking [46], robotics-based vision applications [47]–[49]. On these speciﬁc cases, the cloud’s functionality isfurther optimized to adjust demands. For example, RILaaS[49] introduces a Robot-Inference-and-Learning-as-a-Serviceplatform with robotics-oriented features such as reliable net-work protocol, secure authentication, and REST front-end API.

Local resource exploitation.

Local approaches keep allcomputation locally and optimize the performance throughhardware specialization or model modiﬁcation [27], [50]–[53].Hardware specialization generally centers around basic DNNoperations (e.g., matrix multiplication, convolution) to developefﬁcient hardware accelerators, e.g., ARM ML Processor [54],Google Edge TPU [55]. Some other works target to optimizethe utilization of existing hardware. For example, µ Layer[50] accelerates inference in layer granularity by simulta-neously utilizing heterogeneous processors inside an edge device. Model modiﬁcation typically uses model compressiontechnique, e.g., model sparsiﬁcation and quantization [51].ReForm [52] reconﬁgurates CNN models by model pruningand selective computing to reduce inference latency on mobiledevices. On the same goal, libnumber [53] employs quantiza-tion technique to optimize number representation in low-level,reducing both model size and inference latency. CoEdge isorthogonal to such optimizations since it does not apply anystructural modiﬁcations to the employed DNN. Multi-device collaboration.

Multi-device approaches exe-cutes DNN inference using a cluster of devices in the edgeenvironment [14], [40], [56]. Within this category, previousworks optimize workload distribution in two ways, layer fusionand workload size adjustment. Layer fusion partitions thefeature maps in a ﬁxed pattern, and distributes workloadwith redundant data - the padding data - to avoid datarequests between devices during the runtime. Under thismechanism, DeepThings [14] fuses front convolutional layersand parallelizes these layers on multiple devices. The follow-up work [25] generalizes the fusion operation to all layersand takes resources heterogeneity into account. It designsa dynamic-programming-based fusion searching strategy toadaptively decide which layers are fused and which layers aredirectly parallelized. Workload size adjustment accommodatesthe workload allocation to minimize the end-to-end inferencelatency. MoDNN [40] segments workload greedily and assignsmore workload to the devices with higher computing capabilitywithout considering the network conditions. Musical Chair[18] introduces a partitioning algorithm integrating data andmodel parallelism, and partitions the feature maps in equalproportion. Based on Musical Chair, the subsequent work[57]–[59] improves distributed CNN execution in terms oflatency performance, scalability, and robustness, respectively.Our work falls into the mutli-device collaboration category,and combines the parallel workﬂow of layer-fusion techniques[14], [25] and the partitioning mechanism of workload ad-justment approaches [18], [40]. Beyond combining the noveldesigns of these two lines, CoEdge jointly considers availablecomputation and communication resources and improves theworkload allocation on heterogeneous devices via an adaptivealgorithm, which has not been addressed in prior works.VIII. D

ISCUSSION AND F UTURE DIRECTIONS

In this section, we discuss the limitation and extension ofCoEdge, and provide some future research directions.

Robustness and Generalization.

As a distributed system,crash of any participant or network timeout can result in asystem breakdown of cooperative inference. To increase therobustness against such faults, it may be helpful to designmodularity [48] for the system or reserve intermediate resultbackups periodically. Another direction is to further generalizeand optimize the system workﬂow for more sophisticatedmodel structures. Only applying workload partitioning over thewhole network may not well ﬁt more complicated architecturesas the feature maps of the deeper layers usually exhibit in asmaller height and width.

Other optimizing objectives.

CoEdge focuses on optimiz-ing the dynamic energy consumption with preset deadlines for CNN inference. Modifying the objective function of theconstrained programming can steer CoEdge to meet other pri-orities. For example, one can adopt a performance preferenceby setting a tunable-weighted synthesis of latency and energy.Alternatively, taking static energy consumption into accountmay produce a more energy-friendly workload allocation plan.Another potential objective is accuracy. Although CoEdgedoes not sacriﬁce any accuracy theoretically, running DNNson some minitype edge devices may still loss precision owingto the limitations of their modest computing capability and theexecution mechanism of DNN frameworks. Characterizing andoptimizing such accuracy issue is practically signiﬁcant foredge deployment.

Utilizing edge-oriented resources.

Recent technical pro-gresses on edge computing enhancement in computation (e.g.,pluggable Google Edge TPU [55], Intel Movidius NeuralCompute Stick [60]) could potentially beneﬁt CoEdge per-formance. For example, by equipping a Raspberry Pi with anEdge TPU, CoEdge may choose to remain the input workloadmainly or even completely in situ. This requires more effortson shaping and utilizing the emerging elastic computingresources. Moreover, improvements on the communicationside, e.g., 5G and mmWave, can also boost cooperative edgeintelligence. IX. C

ONCLUSION

In this paper, we present CoEdge, a distributed DNN com-puting system that orchestrates cooperative DNN inferenceover heterogeneous edge devices. We explore the workﬂowof cooperative inference and formulate it as a constrainedoptimization problem, which is NP-hard. To solve it efﬁciently,we design a workload partitioning algorithm to decide efﬁ-cient partitioning policy in real-time. By jointly optimizingcomputation and communication, CoEdge can ﬁnd the op-timal workload partitioning plan that minimizes the systemenergy cost while promising execution latency requirements.Experimental evaluations using a realistic prototype show7.21 ×∼ × latency speedup over the local approach andup to 25.5% ∼ EFERENCES[1] B. L. R. Stojkoska and K. V. Trivodaliev, “A review of internet ofthings for smart home: Challenges and solutions,”

Journal of CleanerProduction , vol. 140, pp. 1454–1464, 2017.[2] F. Shrouf, J. Ordieres, and G. Miragliotta, “Smart factories in industry4.0: A review of the concept and of energy management approached inproduction based on the internet of things paradigm,” in . IEEE, 2014, pp. 697–701.[3] M. Gerla, E.-K. Lee, G. Pau, and U. Lee, “Internet of vehicles: Fromintelligent grid to autonomous cars and vehicular clouds,” in . IEEE, 2014, pp. 241–246.[4] A.-M. Rahmani, N. K. Thanigaivelan, T. N. Gia, J. Granados, B. Negash,P. Liljeberg, and H. Tenhunen, “Smart e-health gateway: Bringingintelligence to internet-of-things based ubiquitous healthcare systems,”in . IEEE, 2015, pp. 826–834.[5] Q. Shi and X. Chen, “Carpool for big data: Enabling efﬁcient crowdcooperation in data market for pervasive ai,”

IEEE Transactions onVehicular Technology , 2020. [6] U. R. Acharya, S. L. Oh, Y. Hagiwara, J. H. Tan, M. Adam, A. Gertych,and R. San Tan, “A deep convolutional neural network model to classifyheartbeats,” Computers in biology and medicine , vol. 89, pp. 389–396,2017.[7] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efﬁcient processing ofdeep neural networks: A tutorial and survey,”

Proceedings of the IEEE ,vol. 105, no. 12, pp. 2295–2329, 2017.[8] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neuralnetwork learning for speech recognition and related applications: Anoverview,” in . IEEE, 2013, pp. 8599–8603.[9] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[10] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge ai: On-demand acceleratingdeep neural network inference via edge computing,”

IEEE Transactionson Wireless Communications , vol. 19, no. 1, pp. 447–457, 2019.[11] T. Ouyang, Z. Zhou, and X. Chen, “Follow me at the edge: Mobility-aware dynamic service placement for mobile edge computing,”

IEEEJournal on Selected Areas in Communications , vol. 36, no. 10, pp. 2333–2345, 2018.[12] X. Chen, Q. Shi, L. Yang, and J. Xu, “Thriftyedge: Resource-efﬁcientedge computing for intelligent iot applications,”

IEEE network , vol. 32,no. 1, pp. 61–65, 2018.[13] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edgeintelligence: Paving the last mile of artiﬁcialintelligence with edgecomputing,”

Proceedings of the IEEE , 2019.[14] Z. Zhao, K. M. Barijough, and A. Gerstlauer, “Deepthings: Distributedadaptive deep learning inference on resource-constrained iot edgeclusters,”

IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems

IEEE Robotics and AutomationLetters , vol. 3, no. 4, pp. 3709–3716, 2018.[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[20] “Tensorﬂow benchmark tool,” https://github.com/tensorﬂow/tensorﬂow/tree/r1.4/tensorﬂow/tools/benchmark, accessed May 15, 2019.[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in

InternationalConference on Learning Representations , 2014.[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[25] L. Zhou, M. H. Samavatian, A. Bacha, S. Majumdar, and R. Teodorescu,“Adaptive parallel execution of deep neural networks on heterogeneousedge devices,” in

Proceedings of the 4th ACM/IEEE Symposium on EdgeComputing , 2019, pp. 195–208.[26] A. P. Miettinen and J. K. Nurminen, “Energy efﬁciency of mobile clientsin cloud computing,”

HotCloud , vol. 10, no. 4-4, p. 19, 2010.[27] S. Yao, Y. Zhao, H. Shao, S. Liu, D. Liu, L. Su, and T. Abdelzaher,“Fastdeepiot: Towards understanding and optimizing neural networkexecution time on mobile and embedded devices,” in

Proceedings ofthe 16th ACM Conference on Embedded Networked Sensor Systems .ACM, 2018, pp. 278–291.[28] Y. Wen, W. Zhang, and H. Luo, “Energy-optimal mobile applicationexecution: Taming resource-poor mobile devices with cloud clones,” in . IEEE, 2012, pp. 2716–2720.[29] Y. Cui, J. Song, K. Ren, M. Li, Z. Li, Q. Ren, and Y. Zhang, “Soft-ware deﬁned cooperative ofﬂoading for mobile cloudlets,”

IEEE/ACMTransactions on Networking , vol. 25, no. 3, pp. 1746–1760, 2017.[30] T. Mohammed, C. Joe-Wong, R. Babbar, and M. Di Francesco, “Dis-tributed inference acceleration with adaptive dnn partitioning and of-ﬂoading,” in

IEEE INFOCOM 2020-IEEE Conference on ComputerCommunications . IEEE, 2020, pp. 854–863. [31] M. Mukherjee, V. Kumar, A. Lat, M. Guo, R. Matam, and Y. Lv, “Dis-tributed deep learning-based task ofﬂoading for uav-enabled mobile edgecomputing,” in

IEEE INFOCOM 2020-IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) . IEEE, 2020, pp.1208–1212.[32] S. Xu, Q. Liu, B. Gong, F. Qi, S. Guo, X. Qiu, and C. Yang, “Rjcc:Reinforcement learning based joint communicational-and-computationalresource allocation mechanism for smart city iot,”

IEEE Internet ofThings Journal , 2020.[33] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis,and M. Horowitz, “Towards energy-proportional datacenter memorywith mobile dram,” in . IEEE, 2012, pp. 37–48.[34] L. G. Valiant, “A bridging model for parallel computation,”

Communi-cations of the ACM , vol. 33, no. 8, pp. 103–111, 1990.[35] G. B. Dantzig,

Linear programming and extensions . Princeton univer-sity press, 1998, vol. 48.[36] I. I. CPLEX, “V12. 1: User’s manual for cplex,”

International BusinessMachines Corporation , vol. 46, no. 53, p. 157, 2009.[37] Google, “gprc - a rpc library and framework,” https://grpc.io, accessedDecember 15, 2019.[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 1–9.[39] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861 , 2017.[40] J. Mao, X. Chen, K. W. Nixon, C. Krieger, and Y. Chen, “Modnn:Local distributed mobile computing system for deep neural network,” in

Design, Automation & Test in Europe Conference & Exhibition (DATE),2017 . IEEE, 2017, pp. 1396–1401.[41] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krishna-murthy, “Mcdnn: An approximation-based execution framework for deepstream processing under resource constraints,” in

Proceedings of the 14thAnnual International Conference on Mobile Systems, Applications, andServices . ACM, 2016, pp. 123–136.[42] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, andL. Tang, “Neurosurgeon: Collaborative intelligence between the cloudand mobile edge,” in

ACM SIGARCH Computer Architecture News ,vol. 45, no. 1. ACM, 2017, pp. 615–629.[43] L. Zeng, E. Li, Z. Zhou, and X. Chen, “Boomerang: On-demandcooperative deep neural network inference for edge intelligence onindustrial internet of things,”

IEEE Network , 2019.[44] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Distributed deepneural networks over the cloud, the edge and end devices,” in . IEEE, 2017, pp. 328–339.[45] H.-J. Jeong, H.-J. Lee, C. H. Shin, and S.-M. Moon, “Ionn: Incrementalofﬂoading of neural network computations from mobile devices to edgeservers,” in

Proceedings of the ACM Symposium on Cloud Computing ,2018, pp. 401–411.[46] L. Ballotta, L. Schenato, and L. Carlone, “Computation-communicationtrade-offs and sensor selection in real-time estimation for processingnetworks,”

IEEE Transactions on Network Science and Engineering ,2020.[47] S. P. Chinchali, E. Cidon, E. Pergament, T. Chu, and S. Katti, “Neuralnetworks meet physical networks: Distributed inference between edgedevices and the cloud,” in

Proceedings of the 17th ACM Workshop onHot Topics in Networks , 2018, pp. 50–56.[48] S. Chinchali, A. Sharma, J. Harrison, A. Elhafsi, D. Kang, E. Pergament,E. Cidon, S. Katti, and M. Pavone, “Network ofﬂoading policies forcloud robotics: a learning-based approach,” in

Robotics: Science andSystems , 2019, pp. 1–10.[49] A. K. Tanwani, R. Anand, J. E. Gonzalez, and K. Goldberg, “Rilaas:Robot inference and learning as a service,”

IEEE Robotics and Automa-tion Letters , 2020.[50] Y. Kim, J. Kim, D. Chae, D. Kim, and J. Kim, “ µ layer: Low la-tency on-device inference using cooperative single-layer accelerationand processor-friendly quantization,” in Proceedings of the FourteenthEuroSys Conference 2019 , 2019, pp. 1–15.[51] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” arXiv preprint arXiv:1510.00149 , 2015.[52] Z. Xu, F. Yu, C. Liu, and X. Chen, “Reform: Static and dynamicresource-aware dnn reconﬁguration framework for mobile device,” in Proceedings of the 56th Annual Design Automation Conference 2019 ,2019, pp. 1–6.[53] Y. H. Oh, Q. Quan, D. Kim, S. Kim, J. Heo, S. Jung, J. Jang,and J. W. Lee, “A portable, automatic data qantizer for deep neuralnetworks,” in

Proceedings of the 27th International Conference onParallel Architectures and Compilation Techniques . IEEE, 2020, pp. 157–163.[57] R. Hadidi, J. Cao, M. S. Ryoo, and H. Kim, “Towards collaborativeinferencing of deep neural networks on internet of things devices,”

IEEEInternet of Things Journal , 2020.[58] J. Cao, F. Wu, R. Hadidi, L. Liu, T. Krishna, M. S. Ryoo, and H. Kim,“An edge-centric scalable intelligent framework to collaboratively exe-cute dnn,” in

Demo for SysML Conference, Palo Alto, CA , 2019.[59] R. Hadidi, J. Cao, M. S. Ryoo, and H. Kim, “Robustly executing dnnsin iot systems using coded distributed computing,” in

Proceedings of the56th Annual Design Automation Conference 2019 A PPENDIX AP ROOF OF T HEOREM Proof.

We reduce P || C max problem to a special case of P P || C max problem is NP-hard, P P || C max problem.Firstly, we identify P a i , the constaints (1), (2),and (3) limit it into a range of nonnegative integers. Using a i , we can obtain the initial workload on each device bymultiplying a i and the data size of each row. Since the inputfeature maps of each layer are the output of the prior layer, wecan derive the workload of each layer based on its speciﬁc con-ﬁguration. For example, for convolution operation, given theinput feature map partition of size ( H, W, C in ) (Height, Width,Channels) and the convolution kernel ( k, C in , C out , s, p ) , theoutput size is ( H − k +2 ps + 1 , W − k +2 ps , C out ) . Therefore, wecan express r li linearly using a i . So do for T cli , T xli , E cli , E xli according to Eq. (5), (7), (9), and (10).For the deadline constraint T = (cid:80) l ∈L max i ∈N ( T cli + T xli ) ≤ D , we transform it into a series of inequalities. Assuming asub-deadline D l for processing layer l , we have max i ∈N ( T cli + T xli ) ≤ D l , which is equivalent to T cl + T xl ≤ D l , T cl + T xl ≤ D l , · · · , T clN + T xlN ≤ D l . Without loss of generality,we conduct this transformation to all interval and obtain N · L inequalities in total, i.e., T cli + T xli ≤ D l , ∀ i ∈N , ∀ l ∈L . Giventhat T cli and T xli is linear with a i , these inequalities are linear.In conclusion, all the expressions in P P a i be the jobsto schedule and all power parameters be 1, we can reduce P || C max problem to P P P || C max . Since P || C max problemis NP-hard, P PPENDIX BP ROOF OF T HEOREM Proof.

As we have discussed in the proof of Theorem 1, theobjective function, the memory constraint, and the deadlineconstraint are linear with a i . In P