CoEdge: Cooperative DNN Inference with Adaptive Workload Partitioning over Heterogeneous Edge Devices
11 CoEdge: Cooperative DNN Inference withAdaptive Workload Partitioningover Heterogeneous Edge Devices
Liekang Zeng,
Student Member, IEEE,
Xu Chen,
Senior Member, IEEE,
Zhi Zhou,
Member, IEEE,
Lei Yang,
Senior Member, IEEE and Junshan Zhang,
Fellow, IEEE
Abstract —Recent advances in artificial intelligence have drivenincreasing intelligent applications at the network edge, such assmart home, smart factory, and smart city. To deploy compu-tationally intensive Deep Neural Networks (DNNs) on resource-constrained edge devices, traditional approaches have relied oneither offloading workload to the remote cloud or optimizingcomputation at the end device locally. However, the cloud-assistedapproaches suffer from the unreliable and delay-significant wide-area network, and the local computing approaches are limited bythe constrained computing capability. Towards high-performanceedge intelligence, the cooperative execution mechanism offers anew paradigm, which has attracted growing research interestrecently. In this paper, we propose CoEdge, a distributed DNNcomputing system that orchestrates cooperative DNN inferenceover heterogeneous edge devices. CoEdge utilizes available com-putation and communication resources at the edge and dynami-cally partitions the DNN inference workload adaptive to devices’computing capabilities and network conditions. Experimentalevaluations based on a realistic prototype show that CoEdgeoutperforms status-quo approaches in saving energy with closeinference latency, achieving up to 25.5% ∼ Index Terms —Edge Intelligence, Cooperative DNN Inference,Distributed Computing, Energy Efficiency.
I. I
NTRODUCTION R ECENT years have witnessed an ever-increasing numberof Internet of Things (IoT) devices diving into miscella-neous application domains, e.g., smart home [1], smart factory[2], autonomous driving [3], etc. This trend also drives thecommunity to build smarter, faster, and greener intelligentapplications at the network edge, pushing remarkable progressin smart healthcare, security inspection and disease detec-tion [4]–[6]. Meanwhile, advances in Deep Neural Networks(DNNs) have shown unprecedented ability in learning abstractrepresentation and extracting high-level features, promotingsignificant improvement in processing human-centric contents[7]. Motivated by this success, it is envisioned that employingDNNs to edge devices would enable and boost intelligent
L. Zeng, X. Chen and Z. Zhou are with the School of Computer Scienceand Engineering, Sun Yat-sen University, Guangzhou, Guangdong, 510006China (e-mail: [email protected], [email protected],[email protected]).L. Yang is with the Department of Computer Science and Engineering,University of Nevada, Reno, NV, 89557 USA (e-mail: [email protected]).J. Zhang is with the School of Electrical, Computer and Energy En-gineering, Arizona State University, Tempe, 85287 USA (e-mail: [email protected]).
BA DC
WirelessConnection
Smart Home
Fig. 1. An example of cooperative DNN inference in a smart home scenario.As the raw image is captured, device A decides a cooperative execution planand distributes the workload to devices B and C . According to the plan, thedevices perform cooperative inference in response to the DNN task. services, supporting brand new smart interactions betweenhumans and their physical surroundings.The essential demand of these services is to respond user’squeries timely, e.g., recognizing voice commands [8], inspect-ing visitor’s faces [9], and detecting heartbeat frequency [6],all within a matter of milliseconds. This also implies a soft-realtime requirement - if the result comes late, the user mayturn to other applications, and the result can even be out ofdate and meaningless. Therefore, minimizing response latencyand promising users’ experience is of paramount importance.However, DNN-based applications are typically computation-intensive and resource-hungry [7], and service providers tra-ditionally appeal to the resource-abundant cloud to satisfy thestringent responsiveness requirement [10]. Yet the Quality-of-Service (QoS) can still be poor and unsatisfactory due to theunreliable and delay-significant wide-area network connectionbetween edge devices and the remote cloud [11], [12]. What’sworse, for many smart applications with human in the loop, thesensory data can contain highly sensitive or private informa-tion. Offloading these data to the remote datacenter owned bycurious commercial companies inevitably raises users’ privacyconcerns.Intuitively, keeping data locally and processing tasks with-out external remote assists will preserve user privacy andavoid the remote network transmission. Unfortunately, localedge devices are generally with limited computing capability,making it hard to fulfill DNN execution under the latencyService-Level-Objective (SLO). For instance, if a smart homecamera runs CNN-based face recognition to provide real-timeinspection and warning, the response delay when runningDNN locally may last for a few seconds, resulting in pooruser experience and completely unusable service. a r X i v : . [ c s . N I] D ec Convolution Pooling Convolution Pooling FullyConnected FullyConnected OuputFeature MapsInput Image
Cat (0.9)Dog (0.05)Rabbit (0.03)Wolf (0.02)
Feature Extraction Stage Classification Stage
Fig. 2. Conventional CNN inference workflow, which is typically in two stages. In the first stage, CNN processes the input image to extract hidden featuresthrough operations like convolution and pooling, and generates multidimensional feature maps. In the second stage, CNN classifies the feature maps byfully-connected layers and obtains the inference result.
To tackle these challenges, a promising approach is toexploit available computation resources in the proximity to thedata source with the emerging edge intelligence paradigm [13].Instead of uploading data to the remote cloud or keeping allcomputation at the single local device, edge intelligence enjoysreal-time response as well as privacy preservation by offload-ing computing workload within a manageable range. As Fig. 1illustrates, we can utilize the diverse computing resources in asmart home (with inspection camera, smartphone, tablet, anddesktop PC) to accelerate the CNN-based face recognition.Specifically, the source device A can distribute the inferenceworkload to devices B and C , and perform cooperative in-ference via high-bandwidth local wireless connection (e.g.,WiFi). Nevertheless, this paradigm brings some key challengesto be addressed: (1) how to decide the workload assignmenttailored to the resource heterogeneity of edge devices, (2)how to optimize the system performance with the presenceof network dynamics, and (3) how to orchestrate computationand communication during cooperative inference runtime.To answer these questions, we propose CoEdge( Co operative Edge ), a runtime system that orchestratescooperative DNN inference over multiple heterogeneous edgedevices. CoEdge does not apply any structural modificationsor tuning requirements to the given DNN model, and does notsacrifice model accuracy as it reserves input data and modelparameters of the given DNN model. CoEdge employs asimilar parallel workflow as DeepThings [14], where the inputis split initially and the execution is parallelized on multipledevices at runtime. While DeepThings leverages a layerfusion technique to reduce communication overhead, CoEdgeproposes to optimize workload allocation to maximallyutilize heterogeneous edge resources. By optimizing thecomputation-communication tradeoff, CoEdge optimallypartitions the input inference workload, where the partitions’sizes are chosen to match devices’ computing capabilitiesand network conditions to improve system performance inboth latency and energy metrics. We implement CoEdgeusing a realistic prototype with Raspberry Pi 3, JetsonTX2, and desktop PC. Experimental evaluations show7.21 ×∼ × latency speedup over the local approach andup to 25.5% ∼ • We propose CoEdge, a distributed DNN computing sys-tem that orchestrates cooperative inference over hetero-geneous devices to minimize system energy consumption while promising response latency requirement. • We identify the impacts of workload partitioning on coop-erative inference workflow, and build a constrained pro-gramming model on workload distribution optimization.We prove the NP-hardness of the problem, and devisea fast approximated algorithm to decide the efficientpartitioning policy in real-time tailored to devices’ diversecomputing capabilities and network conditions. • We implement a multi-device prototype using hetero-geneous edge devices, and evaluate CoEdge on fourwidely-adopted DNN models to corroborate its superiorperformance.The rest of this paper is organized as follows. Section IIbriefly reviews background on DNN inference, and discussesopportunities and challenges based on a case of cooperativeinference. Section III presents CoEdge design and its work-flow. Section IV builds the system model and describes ourworkload partitioning algorithm. We explain implementationdetails in Section V and evaluate the prototype in SectionVI. Section VII provides related works. Section VIII discusseslimitation and extension of CoEdge, and Section IX concludes.The appendix (in the supplementary material) details theproofs of Theorem 1 and 2.II. B
ACKGROUND AND M OTIVATION
In this section, we briefly review conventional CNN infer-ence and cooperative inference. We study a case of cooperativeinference and discuss potential challenges behind that.
A. Deep Neural Network Inference
In this paper, we focus on the classical Convolutional NeuralNetworks (CNNs) as they are widely adopted across a boardspectrum of intelligent services, including image classification,object detection, and semantic segmentation, etc.Fig. 2 depicts a conventional CNN inference for imageclassification task from a perspective of feature maps. As wecan see, a conventional CNN inference can be viewed as aseries of successive algorithmic operations on feature maps.These operations comprise of convolution, pooling, batchnormalization, activation, and fully-connected computation,etc. In light of the functionality of the operations, the inferenceprocess can be separated into two stages. The first stageis the feature extraction stage, where the model processes For ease of illustration, only some of the operations are drawn in Fig. 2
TABLE IR
ASPBERRY P I PECIFICATIONS [15]Hardware SpecificationsCPU 1.2GHz Quad Core ARM Cortex-A53Memory 1GB LPDDR2 900MHzGPU No GPUPower IdleFully LoadedAverage Observed 1.3W6.5W3WTABLE IIJ
ETSON
TX2 S
PECIFICATIONS [16]Hardware SpecificationsCPU 2.0GHz Dual Denver 2 +2.0GHz Quad Core ARM Cortex-A57Memory 8GB LPDDR4 1.6GHzGPU Pascal Architecture 256 CUDA CorePower IdleFully LoadedAverage Observed 5W15W9.5W every pixel in the input image to generate hidden featurerepresentations. Following that, at the second stage, thesefeatures are classified by the fully-connected layers, exportingresults in a probabilistic form.
B. Case Study: Cooperative Inference with Two Devices
The key impediment of deploying CNN at the network edgelies in the gap between intensive CNN inference computationand the limited computing capability of edge devices. Tobridge this gap, we can utilize the cooperative inferencemechanism to exploit available computing resources at theedge. A straightforward solution of that, for example, is themaster-worker paradigm that offloads inference workload toexternal infrastructure. To obtain a better understanding ofcooperative inference, we use a real hardware testbed toemulate this solution.As a case study, we employ a Raspberry Pi 3 and aJetson TX2, on behalf of weak IoT devices and mobile AIplatforms at the edge, respectively. Table I and II presentstheir specifications and reported power parameters, which aremeasured with Monsoon High Voltage Monitor [17] using themethodology in [18]. For each inference task, we input onesingle image to the Pi and then offloads a part of the imageto the Jetson. The two devices parallelize the DNN executionand their results are finally aggregated to the Pi as output.We measure the end-to-end latency of this process, i.e., fromthe image input to the inference result output; and we recordthe average latency of fulfilling the inference task over 100runs. We implement AlexNet [19] with TensorFlow Lite [20]on both devices, and run the model with the same image fromImageNet [21]. For the bandwidth between two devices, wefix it at 1MB/s using the traffic control tool tc [22].We define offloading ratio to indicate how much data isoffloaded from the Raspberry Pi to the Jetson TX2. Forinstance, when the ratio is 0.5, we split the input imagealong the height into two equal parts, and transfer one ofthem to the Jetson TX2. In particular, a zero ratio indicates L a t e n c y ( s ) E n e r g y ( J ) LatencyEnergy
Fig. 3. The total latency and energy consumption under varying offloadingratio, i.e., the proportion of data that is offloaded from the Raspberry Pi 3 tothe Jetson TX2. performing inference at the Raspberry Pi locally, while theratio equals to 1.0 if offloading all workload to the JetsonTX2. Fig. 3 shows the latency and energy overheads undervarying offloading ratio. Through this experiment, we derivethe following observations. • Jetson TX2 enjoys better performance than Raspberry Pi3. When the ratio is zero, the system consumption is onlythe computation cost of Pi, while at the 1.0 case, the totalcost comprise of the input offloading overhead, the DNNcomputation overhead in Jetson and the overhead fortransferring result back. However, the former still takeshigher cost than the latter in terms of both latency andenergy. Note that fully offloading workload to the Jetson(i.e., offloading ratio = 1) may not necessarily yield thelowest costs if the network fluctuates. • Cooperative inference is more economical than localinference given the favourable network condition. As theoffloading ratio increases, both latency and energy drop.In other words, the system cost decreases via harvestingthe cooperator’s computing resource. • The curve of latency drops faster as the offloading ratioincreases. This is because the DNN execution is paral-lelized in cooperative inference and the end-to-end infer-ence latency is straggled by the slower one. Therefore, inhigh bandwidth environments, assigning more workloadto Jetson TX2 benefits performance improvement better.
C. Merits and Challenges
We see that, from the above observations, cooperativemechanism has potential to improve inference performancewith multiple devices, which are exactly what edge scenariospossess. More precisely, we envision the deployment of thecooperative inference system in an environment such as smarthome or smart factory, wherein the devices are managed bythe same owner and thus they are willing to cooperate andshare their resources. This brings several major merits as wellas challenges.
Merits.
On the one hand, comparing with local inference,cooperative inference has significant potential in reducing la-tency and energy costs via harvesting idle computing resourcesat the edge. On the other hand, other than the cloud approachthat uploads data to the remote datacenter, the cooperativeapproach keeps data within user’s control scope, therefore avoiding the delay-significant wide-area network as well asprivacy issues.
Challenges.
To effectively exploit computing resourcesat the edge, we need to felicitously factor the computingcapabilities of edge devices considering magnitude and het-erogeneity. Also, given the dynamic network inherently inedge computing, an efficient workload allocation solution thatjointly considers systematic costs is desired. More specifically,it is crucial to decide which device to participate in thecooperative inference and how much workload each deviceaffords. Besides, since the cooperative mechanism parallelizesCNN inference in a distributed manner, the system needs toorchestrate the computation and communication over multipledevices.We address these challenges by designing a cooperativesystem, CoEdge, through orchestrating the available resourcesfrom heterogeneous edge devices.III. C O E DGE D ESIGN AND W ORKFLOW
In this section, we present CoEdge design and the work-flow of cooperative CNN inference. We further explore howworkload partitioning impacts parallel processing in terms ofcomputation and communication.
A. CoEdge Design
For ease of illustration, we differentiate between devices ontheir roles in the cooperation. For the device that launchesa CNN inference task, we label it as the master device ,and for the device that joins the cooperation, it is markedas the worker device . The master device is responsible forregistering participated devices, generating a feasible workloadpartitioning plan, and managing the cooperative inference overworker devices. Note that a device can be the master and theworker at the same time since it can retain CNN workload insitu.Fig. 4 illustrates the architecture overview of CoEdge,which works in two phases, namely the setup phase andthe runtime phase. In the setup phase, CoEdge records theexecution profiles of each device. In the runtime phase,CoEdge creates a cooperative inference plan that determinesthe workload partitions and their corresponding assignment,using the profiling results collected in the setup phase and thenetwork status. According to the cooperation plan, the masterdistributes the workload partitions to the workers and thenperforms cooperative execution collaboratively.
Setup phase.
Whenever a CNN-based application is in-stalled,
Device Profiler runs the CNN models locally andrecords
Profiling Results . These results sketch the device’scomputing capability, including the computation intensity,computation frequency and power parameters, which will bedetailed in Section IV-A.
Runtime phase.
The runtime phase starts when the masterraises a CNN inference query. As the image inputs, the masterestablishes connections with worker devices and pulls theirprofiling results. Since the size of the profiling results isvery small (tens of bytes in our prototype), the transmissionoverhead for transferring them is negligible. As the master
Partitioning EngineDeviceProfiler ProfilingResultDNN Execution Runtime DeviceProfiler ProfilingResultDNN Execution Runtime
Master Device Worker Device
Transfer when the workerconnects to the masterWorkload DistributionCooperative Execution
DNN ModelInput Image Inter-device data flowIntra-device data flow
Partitioning Plan
Setup PhaseRuntime Phase
Fig. 4. CoEdge architecture overview, which works in two phases. In the setupphase, the devices profile parameters to sketch their computing capabilitiesinformation. In the runtime phase, the master device creates a partitioning planusing the collected profiling results. According to the plan, the master devicedistributes the workload and performs cooperative inference with workerdevices. receives the profiling data, the partitioning engine in themaster device generates a workload allocation plan usingthe adaptive workload partitioning algorithm (explained inSection IV-C). According to the plan,
DNN execution runtime distributes the workload partitions to workers and performscooperative inference in response to the query.
B. Cooperative Inference Workflow
In this work, we exploit model parallelism to partition CNNinference over multiple devices. Under model parallelism,CNN model parameters are divided into subsets and assignedto multiple edge nodes. With respective parameters, eachdevice accepts a necessary part of the input feature maps andgenerates a portion of the output feature maps. Concatenatingall these portions yields the complete output of each layer.Fig. 5 provides an instance of cooperative inference work-flow with three devices from a perspective of feature maps.The cooperative inference begins when the image is piece-wise split into partitions. Note that to accommodate devices’heterogeneity, the partition sizes are differentiated to matchdevice capabilities. The partitions are then distributed fromthe master to three devices (i.e., devices A , B , and C in Fig.5). At the feature extraction stage, the three devices executetheir workload in parallel, while at the classification stage,their execution results are aggregated to one of them (device B in Fig. 5). This aggregation is to avoid excess communicationoverhead caused by the nature of fully-connected computation,which requires repeating data access on the feature vectors. Generalization.
Based on the workflow in Fig. 5, it is fea-sible to accommodate various CNNs with complex structuresby redesigning some details. For example, for CNNs withoutfully-connected layers (e.g., Network in Network [23]), we canreduce the classification stage in Fig. 5. To adapt CNNs withskip connections (e.g., ResNet [24]), we can keep intermediateoutput results on each device at the shortcut starting point andrelease these data at the shortcut destination to collect the datawhen needed.
Partition
Cat (0.9)Dog (0.05)Rabbit (0.03)Wolf (0.02)
OuputFullyConnectedFullyConnectedPoolingConvolutionPoolingConvolutionInput Image
Feature Extraction Stage Classification Stage
Device A Device B Device C Workload Placement:
Fig. 5. Cooperative CNN inference workflow of CoEdge. The input image is piece-wise partitioned to patches before execution. In feature extraction stage,these patches are distributed to devices A , B and C , respectively, and then in classification stage, the feature map fragments are aggregated to finish theremaining execution.Fig. 6. Example of a convolution operation for cooperative inference. Theinput feature map partitions with each of × size locate at devices A and B ,respectively. To generate the output feature map through the × convolutionkernel, device A needs to pull the padding data of × size from device B . C. Impact of Workload Partitioning
The way of piece-wise partitioning significantly affects thecommunication between devices, especially for convolutionoperations that process data across partition boundaries. Forinstance, Fig. 6 shows a typical convolution operation with twopartitions. To compute convolution over the × partition withthe × kernel, device A needs to fetch the × margin rowin device B ’s partition. In general, for the kernel whose size k is greater than 1, each device needs to pull the padding dataof (cid:98) k/ (cid:99) size along the split dimension from the neighboringdevice. In some extreme cases, when the kernel size is verylarge but the neighboring partition size is very small, thepadding range may even across three or more devices, whichcould incur extravagant communication overhead.To reduce the communication between devices, some priorworks [14], [25] exploit sending redundant data in advance toavoid the padding issue. However, while transferring redundantdata takes additional communication cost, preparing necessarydata beforehand for a number of CNN layers incurs extrastorage overhead. In this work, we address the padding issueby imposing a principle that requires the allocated partitionsize in the neighboring device to be not smaller than thepadding size, unless it owns no partition. This principleensures that the padding data can be always acquired fromonly the neighboring device as long as it has data. That is,the transmission of the padding data merely happens once,and thus we reduce the overhead in establishing additionalconnections. To illustrate that, Fig. 7 show the communicationpattern of the example in Fig. 5. Initially, the input imageis partitioned and distributed to corresponding devices, alongwith the padding data for the first convolution layer. For thefollowing layers, each device only connects to its neighbor Device A Device B Device C I npu t p a r t i t i on C WorkloadpartitioningConv 1computation P a dd i ng d a t a P a dd i ng d a t a P a dd i ng d a t a P a dd i ng d a t a Conv 2computation // // //
Conv 3computationConv ncomputation C o m p l e t e p a r t i t i on C o m p l e t e p a r t i t i on I n f e r e n ce r es u l t Fully-connectedcomputation ... input I npu t p a r t i t i on B "Cat" output Fig. 7. The communication pattern of a CoEdge runtime instance with threedevices. For convolution computation, each device pulls the necessary paddingdata from its neighboring device. For fully-connected computation, the featuremap partitions are aggregated. The final inference result is transferred to auser-specified location, i.e., device C in the figure. device and fetches padding data for convolution computation.This pattern holds until all convolutions are completed, andthen the separated feature map partitions are aggregated to oneof the devices for fully-connected computation. The inferenceresult is finally returned to a user-specified device (device C for example in Fig. 7).Under this principle, finding an appropriate workload as-signment matters significantly for system performance. Forinstance, offloading a large portion of workload to a devicethat owns high bandwidth but poor computing capability maynot lead to a lower execution latency. To deploy cooperativeinference optimally, we need to match the assigned partitionsize to the computation and communication resource of eachdevice. We achieve this goal by designing a workload parti-tioning algorithm that is adaptive to the computing capabilitiesof available devices and the dynamic network status.IV. A DAPTIVE W ORKLOAD P ARTITIONING
The objective of optimizing workload allocation is to im-prove cooperative inference performance in both latency andenergy metrics. For the simplicity of problem definition, wetarget at meeting the latency requirement while minimizing the energy costs. Assuming an execution deadline D , the opti-mization problem is to optimally allocate the workload so thatthe system energy consumption is minimized while promisingthe execution deadline D , given the available computation andcommunication resources.In the following, we present a detailed formulation of thisproblem and our workload partitioning algorithm. A. Problem Formulation
We assume that the devices are available and relativelystable during the inference runtime. This can be relevant asexecuting an inference task is typically in a period of seconds,and many edge environments are maintained statically inindependent spaces, such as smart home and smart factory.Besides, the underlying support of intelligent services in suchscenarios usually employs a few commonly-adopted DNNmodels and frequently run similar types of DNN inferencetasks. Therefore, we suppose that the DNN models have beenloaded ahead of inference queries, and can be used to computeinput tensors as soon as necessary data are prepared.Since a CNN model typically encompasses many layers,we model the cooperative inference process from single-layerto multi-layer, progressively. The single-layer formulation fo-cuses on sketching the workload partitioning constraints andshaping the performance of single-layer, while the multi-layerpart aims at summarizing the system behavior for the wholeworkflow. Prior to that, we define the necessary concepts andnotations as follows: • A layer l is an algorithmic operation in a CNN model.In our formulation, a layer refers to either a convolution(Conv) or a fully-connected (FC) layer. Given a CNNmodel, L = [1 , , · · · , L ] denotes the layers in order. • A partitioning solution π is a group of coterminouspartitions of the input image, which is generated by piece-wise partitioning along one dimension. For the inputpartition assigned to device i , a i represents the numberof rows that it covers. Hence, given the devices’ indices N = [1 , , · · · , N ] , π = [ a , a , · · · , a N ] . We denotethe workload as the input feature map partition to beprocessed on each DNN layer. For layer l , the workloadsize of the i -th partition is r li , which can be obtained bycalculating the partition’s data size. • A configuration tuple ( k, c in , c out , s, p ) li denotes the l -thlayer’s computation task on the i -th partition, which ischaracterized by the layer’s configuration, i.e., convolu-tion kernel size k , input channels c in , output channels c out ,stride s , and padding p . This tuple is applicable for bothconvolution and FC layers since FC computation can beviewed as a special case . In particular, as discussed inSection III-C, the padding size of convolutional layers issupposed to be smaller than the size of the partition onthe last neighboring device, unless it owns no data. Weformulate this principle as Eq. (1), where { a i > } is anindicative function that values 1 if a i > or else 0. This This tuple depicts a fully-connected computation when the input featuremap’s size is × × c in , the output feature map’s size is × × c out , kernelsize k = 1 , stride s = 1 , and padding p = 0 . constraint is essentially equivalent to the disjunction of a i ≥ p i +1 and a i = 0 . • A resource tuple ( ρ, f, m, P c , P x ) i specifics the resourceprofile of device i . Here, ρ is defined as the computingintensity (in processing cycles per 1KB input) of the givenDNN model, which is measured by application-drivenoffline profiling [26] in the setup phase. f is the device’sCPU frequency, reflecting its computing capability in acoarse granularity. m is the available maximum memorycapacity for inference tasks. For a single device thatonly processes CNN workloads, m is the volume ofmemory excluding the space taken by the underlyingsystem services, e.g., I/O services, compiler, etc. P c and P x denote the computation power and the wirelesstransmission power, respectively.
1) Single-Layer Formulation:
There are some numericalconstraints on the partition sizes. Eq. (1) imposes the sizerestriction with the padding size as discussed in Section III-C,and Eq. (2) claims a i is a nonnegative integer. Eq. (3) presentsthat the concatenation size of all partitions along the heightdimension equals to this dimension’s size H . The piece-wisepartitioning can be conducted along either the height H orwidth W of the input. In our experiments, we split along theheight H without loss of generality. a i ≥ p i +1 { a i > } , i ∈ N , (1) a i ≥ , a i ∈ Z , i ∈ N , (2) (cid:88) i ∈N a i = H. (3)The workload size r li of a partition is constrained by thedevice’s available memory capacity m i , as in Eq. (4). Here,we only limits the memory footprint on the size of per-layer inputs for the sake of simplicity, while the runtimememory may not be exactly r li . For practical deploymentcases, emerging techniques on characterizing the detailed CNNexecution memory (e.g., [27]) can be adopted, and the deeplearning platform-related memory footprint can be added intothe left-hand side of Eq. (4) as an enhancement. r li ≤ m i , i ∈ N , l ∈ L . (4)During a single layer’s execution, the system takes timeand energy on two aspects, computation and communication.For computation, we calculate the latency and energy by firstapproximating the computing cycles of given partitions. Asdemonstrated in previous empirical studies [26], [28], [29], formany data processing tasks as exemplified by data encodingand decoding, the required computing cycles are proportionalto their input data sizes. This means that, a constant computingintensity (in computing cycles per unit data) exists for suchtasks, and we can use it to capture the effective computingcapability of a specific device. Existing literature, such as[30]–[32], has leveraged this observation to characterize deeplearning workloads, and in this work, we adopt it to estimatethe computing cycles amount given the partitions and DNNlayers. Concretely, in Eq. (5), we assess the total processingcycles of the i -th partition by multiplying the device’s com-puting intensity ρ i with the workload size r li . Moreover, for each respective DNN layer, CNN inference typically conductsa feed-forward execution without any branch operation orrecurrent computation [7], indicating that its execution latencyis approximately linear to the computing cycles. Therefore, thelatency T cli for computing layer l is then appraised via dividingthe total processing cycles by the computation frequency f i ,and the energy is the product of T cli and the computationpower P ci in Eq. (6). Note that Eq. (6) only reckons ondynamic energy. Static energy consumption, e.g., those formaintaining basic system-level services, are not considered inour formulation. T cli = ρ i r li f i , i ∈ N , l ∈ L , (5) E cli = P ci T cli , i ∈ N , l ∈ L . (6)For communication, let b i,j be the available bandwidthbetween devices i and j . Particularly, j = i implies deliveringdata from a device to itself, and b i,i is the memory bandwidth.In our experiment, b i,i is set as 12.8GB/s by default, whichis the typical memory bandwidth of DDR3 [33]. Initially,the communication occurs when the master device (noted asdevice M ) distributes input partitions to worker devices, thetransmission time is therefore calculated in the l = 1 case ofEq. (7). For communication of pulling the padding data fromthe neighboring device, its transmission time is described asthe l > case. For the sake of simplicity, Eq. (7) does nottake the queuing delays into account since we are optimizinginference for respective single image input. Streaming input,in which case the queuing delays significantly matter, are leftfor future work. With the transmission power P xi , we acquirethe dynamic energy of communicating with device i on layer l in Eq. (8). T xli = (cid:40) a i b M ,i , l = 1 , i ∈ N , p li b i,i +1 , l > , l ∈ L , i ∈ N , (7) E xli = P xi T xli , i ∈ N , l ∈ L . (8)
2) Multi-Layer Formulation:
The key challenge of extend-ing the formulation from a single layer to multiple layers liesin the synchronization mechanism during parallel processing.Fig. 8 presents a job breakdown of a CoEdge instance -processing one image over three devices. As we can see, eachdevice processes computation and communication jobs alter-nately, and they trigger synchronization periodically whenevera communication job (except for the initial communicationjob) is accomplished. The contents of the synchronizationsare the requisite padding data for convolutional computation.During the interval between two synchronizations, there is nodata dependency between devices, and thus they process jobsin parallel. These scattered feature map partitions are finallyaggregated at the classification stage for FC computation.Hence, the whole process works in a Bulk SynchronousParallel (BSP) mechanism [34].To summarize the cost of the whole process, we denote E c and E x as the total energy consumption of computation andcommunication, respectively, which are obtained by summingup the energy of all devices for all layers in Eq. (9) and(10). We count the total physical latency T according to the Device B TimelineDevice C Device A sync Conv 1 .........
Input Output
Conv 2 ...
Conv n
Fully-Connected sync sync syncComputation Communication
Feature Extraction Stage ClassificationStage
Fig. 8. The job breakdown of a CoEdge runtime instance with three devices.Each device processes computation and communication jobs alternately, andthe system performs in a Bulk Synchronous Parallel (BSP) mechanism.
BSP model and obtain Eq. (11). Concretely, we acquire T by calculating the maximum latency of all devices and thensumming up the physical latency of all intervals. It is worthnoting that Eq. (11) has counted the latency of FC layers, asthe maximum latency of all device is essentially that of theselected device in the classification stage. E c = (cid:88) l ∈L (cid:88) i ∈N E cli , (9) E x = (cid:88) l ∈L (cid:88) i ∈N E xli , (10) T = (cid:88) l ∈L max i ∈N ( T cli + T xli ) . (11)Given the execution deadline D , the targeted problem is todecide an optimal partitioning solution π = [ a , a , · · · , a N ] with the objective of minimizing total energy without violat-ing the execution deadline D . Hence, we can formulate thecooperative inference optimization as the following problem. P E c + E x s.t. T ≤ D, (1) , (2) , (3) , (4) . Theorem 1.
Problem P We prove Theorem 1 by identifying P P P B. Problem Transformation
A Linear Programming (LP) problem is a kind of optimiza-tion towards a linear objective function subject to linear equal-ity or inequality constraints, and the Integer Linear Program-ming (ILP) problem is a special case where all optimizationvariables are integers [35]. As proved in Appendix A, P P a i . To produce a feasible solution to P P λ i to approximate a i . Eq. (12) defines the relation between λ i and a i , where H is the input’s height and λ i describes the proportion that the i -th partition covers. Since the input ofCNN inference are usually of a large size (e.g., typically of 224 ×
224 size from ImageNet [21] dataset), the approximationerror is tiny and tolerated. Eq. (13), (14), and (15) show thenumerical constraints for λ i , which are derived from Eq. (1),(2), and (3), respectively. a i = λ i H, i ∈ N , (12) λ i H ≥ p i +1 { λ i > } , i ∈ N , (13) λ i ≥ , i ∈ N , (14) (cid:88) i ∈N λ i = 1 . (15)Eq. (13) is essentially equivalent to the expression of λ i H ≥ p i +1 or λ i = 0 . Since λ i = 0 is a potential solution, it isfeasible to separate solving λ i ’s value and checking whether λ i ≥ p i +1 to two steps. Therefore, we relax the constraint Eq.(13) as λ i H ≥ , i.e., λ i ≥ , and P P P E c + E x s.t. T ≤ D, (4) , (14) , (15) . P P
1. Particularly, onthe solution to P
2, there may be some devices that are assignedwith tiny workload ( ∃ i ∈ N , ≤ λ i < p i +1 ), while on thesolution to P
1, the workload size on all devices must be largerthan or equal to the padding data size unless it is zero ( ∀ i ∈N , λ i ≥ p i +1 or λ i = 0 ). Regardless of the potential solution λ i = 0 to problem P
1, the main difference between P P P p i +1 while P . Hence, we can exploit P P Theorem 2.
Problem P Theorem 2 (proved in Appendix B) reveals that P P C. Workload Partitioning Algorithm Design
We propose a threshold-based workload partition algorithmfor P P P
2. The output is the workload partitioningsolution to P N is empty - if N is an empty set, there is no available devices to performcooperative inference, and thus no feasible solution to P π from P Algorithm 1
Workload Partitioning Algorithm
Input: N : Available devices [1 , , · · · , N ] L : CNN layers [1 , , · · · , L ]( k, c in , c out , s, p ) li , ∀ i ∈ N , ∀ l ∈ L : Configuration tuples ( ρ, f, m, P c , P x ) i , ∀ i ∈ N : Resources tuples b i,j , ∀ i, j ∈ N : Bandwidths D : Execution deadline Output: π : Assigned workload proportions [ λ , λ , · · · , λ N ] Procedure P ARTITION ( N ) if N is empty then return NULL (cid:66) no feasible solution Solve π from P if π satisfy Eq. (13) then return π else Find the index set N of zero elements in π Find the minimum element λ m in π N ← N − N − { m } return P ARTITION ( N ) end Procedure check whether the obtained π satisfy Eq. ( ), the thresholdconstraint of P
1. If so, the current version of π is a feasiblesolution and is immediately returned. Or else, there must besome elements in π that are smaller than the required paddingsize. In this case, we remove part of these unsatisfied elementsfrom the available devices list: firstly we remove zero elementssince the zero workload assignment indicates that the devicewould not participate in cooperative inference; next we find theminimum from the rest elements in π and remove it from N .After that, the algorithm goes to the next recursion to acquirethe new partition solution with the updated N and checks theresult for P N (the total number of availabledevices), the solving process of Algorithm 1 is very fast ( < ROTOTYPE I MPLEMENTATION
We employ TensorFlow Lite [20] as the backend engineto execute CNN layers, and implement the communicationmodule based on gRPC [37]. In the following, we provide theimplementation details of CoEdge.
Deployment and profiling.
Since any one of the devicesin the environment may launch a CNN inference task, theemployed CNN models are trained and installed on all devicesin advance. As the model is installed, we use TensorFlowbenchmark tool to profile the latency of one inference andmeasure the energy with the Monsoon High Voltage Power
Fig. 9. Our experimental prototype uses four Raspberry Pis (RPi), one JetsonTX2 and one desktop PC. Their specifications are listed in Table I, II and III.We employ the Monsoon High Voltage Power Monitor (HVPM) to measurethe energy.
Monitor [17]. For each CNN model, we run it for once aswarm-up and then record the execution time with 50 runswithout break. The aim of warm-up running is to alleviatethe impact of weight loading and TensorFlow initiation sincewe have omitted these overheads in the formulation. Theexecution tasks on all devices are the same - perform CNNinference on the same image from ImageNet [21]. We take themean values as the measuring results and derive the resourcetuple parameters based on them.The computation frequency f is directly from known speci-fications. With f and the measured latency, we can estimate thetotal computing cycles of one inference. Dividing the cycles’amount to the processed image size yields the computingintensity ρ . We obtain the memory capacity m by observingthe available memory space of an idle system. For powerparameters P c and P x , we measure them by calculating themeasured computation/communication energy and delay. Workload partitioning and distribution.
To create theworkload allocation plan efficiently, We run the workloadpartitioning algorithm based on IBM ILOG CPLEX [36], alinear programming solver package. If the algorithm returns afeasible solution, we segment the input image accordingly andsend the partitions to the corresponding devices. Otherwise,the algorithm returns an infeasible signal, which means thedeadline is set too strict. In this case, we choose to offloadall workload to the device that can minimize the end-to-endexecution latency.
Runtime communication.
During the runtime, each deviceneeds to fetch the padding data from its neighboring device.Due to the limited computing capability, a device may bestill working on generating the output feature map partitionwhen a padding pulling request arrives. To accommodatethis case, we block the pulling request until the needed datais prepared. Note that such circumstance is rare since ourworkload partitioning algorithm has optimized the workloadallocation to match devices’ computing capabilities. Under thisplan, the execution time on each device is reasonably close andthe utilization of computing resources are maximized as muchas possible. Moreover, our workload partitioning algorithmsupposes that participated devices can well communicate witheach other during the runtime. However, the devices canaccidentally break down or temporarily unavailable in real-
TABLE IIID
ESKTOP
PC S
PECIFICATIONS
Hardware SpecificationsCPU 3.60GHz 8-Core Intel i7-7700Memory 2666MHz 16GB DDR4GPU GeForce GTX 1050 (Pascal) 640 CUDA corePower IdleCPU Fully LoadedGPU Fully Loaded 80W180W200WTABLE IVI
NFERENCE L ATENCY ( MS ) A ND C OMPUTATION I NTENSITY ( CYCLES /KB) OF B ASIC I MPLEMENTATION
Model Raspberry Pi Jetson TX2 Desktop PCLat. Inten. Lat. Inten. Lat. Inten.AlexNet 302 615 89 301 46 282VGG-f 276 563 83 283 44 269GoogLeNet 769 1568 227 772 114 698MobileNet 226 461 71 239 37 226 world deployment. This raises robustness issues, which bediscussed in Section VIII.VI. P
ERFORMANCE E VALUATION
In this section, we evaluate the performance of CoEdgeprototype in terms of inference latency and dynamic energy.We also explore the impact of deadline setting, the systemscalability and the adaptability to network fluctuation.
A. Experimental Setup
Prototype. we implement CoEdge prototype with six de-vices: four Raspberry Pi 3, one Jetson TX2, and one desktopPC, as shown in Fig. 9. The Raspberry Pi 3 and the JetsonTX2 represent weak IoT devices and mobile AI platforms.Besides, we take a desktop PC to emulate small edge servers.The specifications of the three types of devices are providedin Table I, II and III. We employ the Monsoon High VoltagePower Monitor (HVPM) [17] to measure the energy. Forbandwidth control, We use the traffic control tool tc [22],which is able to limit the bandwidth under the setting value.
Workload.
In our prototype, we use TensorFlow Lite [20]to implement four typical CNN models: AlexNet [19], VGG-f [9], GoogLeNet [38], and MobileNet [39], all of whichare trained before deployment. Table IV presents the reportedlatency of basic implementation and the computing intensityon different platforms. We set the workload as the imageclassification task on one ImageNet [21] image. The averageinference latency and computation intensity of one hundredruns are taken as the results. During the runtime we turn offall applications except for necessary OS background services.
Approaches.
We compare CoEdge with the following rela-tive approaches. (1)
MoDNN [40] adopts the same piece-wisepartitioning mechanism as CoEdge, but decides partition sizesin proportion to the devices’ computing capabilities withoutconsidering network conditions. (2)
Musical Chair [18] is acooperative inference system that exploits both data and modelparallelism. For each layer, it chooses one of the parallelismsand accordingly partitions the workloads in equal proportion. Loc MD MC CE D $ O H [ 1 H W L a t e n c y ( s ) Deadline
Loc MD MC CE E 9 * * I L a t e n c y ( s ) Loc MD MC CE F * R R J / H 1 H W L a t e n c y ( s ) Loc MD MC CE G 0 R E L O H 1 H W L a t e n c y ( s ) Fig. 10. The end-to-end latency of different approaches running four DNNmodels. The deadline of AlexNet, VGG-f, GoogLeNet, and MobileNet are setas 100ms, 100ms, 200ms, and 100ms, respectively.
Loc MD MC CE D $ O H [ 1 H W E n e r g y ( J ) Loc MD MC CE E 9 * * I E n e r g y ( J ) Loc MD MC CE F * R R J / H 1 H W E n e r g y ( J ) Loc MD MC CE G 0 R E L O H 1 H W E n e r g y ( J ) Fig. 11. The dynamic energy consumption of different approaches runningfour DNN models. We use the testbed in Fig. 9 that consists of four RaspberryPi 3, one Jetson TX2, and one desktop PC. All experimental settings are thesame as that in the experiment of Fig. 10. (3)
Local approach executes CNN inference at the masterdevice solely. In our experiment, the local approach is thebaseline, and we fix the master device as a certain RaspberryPi 3.
B. Performance Comparison
Fig. 10 and Fig. 11 show the latency and dynamic energyresults of different models with the local approach (Loc),MoDNN (MD), Musical Chair (MC), and CoEdge (CE). Theresults in these two figures are measured at the same experi-mental settings, and the maximum bandwidth between devicesare fixed at 1MB/s. We set the deadline for executing the fourmodels as 100ms, 100ms, 200ms, and 100ms, respectively,marked as dashed lines in Fig. 10.As shown in Fig. 10, CoEdge, Musical Chair and MoDNNalways accomplish inference within the deadline. As the mosttime-consuming option, the local approach is the only onethat violates the latency requirement, and CoEdge achieves7.21 ×∼ × latency speedup over it. Comparing the localapproach with the other ones reflects the power of cooperativeinference gained by harvesting vicinal edge resources. Amongthe three cooperative approaches, Musical Chair takes higherlatency that the other two. This is because that Musical Chairdirectly split the workload in equal proportion ignoring theresources heterogeneity. CoEdge and MoDNN perform closelyin the latency metric, but differs their energy costs in theenergy metrics.As an evidence, Fig. 11 shows the least dynamic energyconsumption that CoEdge takes comparing with other ap-proaches. CoEdge saves up to 66.9%, 64.9%, 46.0%, and25.5% energy for four models, respectively (comparing withMuscial Chair). To the baseline (local approach), CoEdgesaves 39.2%, 37.8%, 11.5%, and 10.9% energy. CoEdge saves Deadline (ms) E n e r g y ( J ) LocalMoDNNMusical ChairCoEdge
Fig. 12. The dynamic energy con-sumption of four approaches un-der varying deadlines built on thetestbed in Fig. 9. The result isrecorded as zero if the approach failsto finish inference within the dead-line. L a t e n c y ( s ) +Pi +Pi +PC +Pi +Pi +TX2 E n e r g y ( J ) LatencyEnergy
Fig. 13. The latency and dynamicenergy results of CoEdge with vary-ing number of devices. The top textindicates which type of device arenewly added to the cluster. energy prominently for AlexNet and VGG-f, but promote notso much for GoogLeNet and MobileNet. This attributes tothe structure of CNN models. GoogLeNet’s completed blockstructure comprises a crowd of layers, which incurs frequentdata exchanges in cooperative inference. At the opposite endof the spectrum, MobileNet uses a simplified structure and hasbeen well optimized for local inference in embedded devices,which limits the improvement space for cooperative inference.It is worth noting that the local approach consumes less energythan MoDNN and Musical Chair. The reasons come fromtwo aspects. On the one hand, the local approach does notincur communication costs, while MoDNN and Musical Chairneed frequent cross-device communication during the runtime,which takes energy. On the other hand, the optimization ofMoDNN and Musical Chair does not consider the power char-acteristics of different types of devices so that the workload areprocessed in an energy-lavish manner. In contrast, by jointlyoptimizing the computation-communication tradeoff provideddevices’ computing capabilities and network conditions, Co-Edge achieves the lowest energy costs.
C. Performance under Varying Deadlines
In this experiment, we explore how the deadline settingimpacts the system performance. We run AlexNet to processone image input. The bandwidths between devices are fixedas 1MB/s. Fig. 12 shows the dynamic energy results as afunction of deadlines. To emphasize the deadline constraint,we plot the energy result as zero if it fails to accomplish theinference within the deadline. When the latency requirement isvery stringent ( ≤ latency requirement, CoEdge prefers to sacrifice some energy-saving to latency reduction. With the requirement loosens, thepressure of satisfying deadline constraint gradually relaxes andCoEdge will transfer emphasis on energy optimization. Whenthe deadline is adequately slack, it is not a constraint to ouroptimization anymore, in which case the change of that willnot impact the workload allocation plan of CoEdge and thusthe dynamic energy result keeps stable. D. Scalability
To evaluate CoEdge’s scalability, we measure the latencyand energy by incrementally adding devices to the experimen-tal cluster. We fix the bandwidth as 1MB/s and set a loosedeadline of 500ms. The inference task is run AlexNet withone image for classification. We add devices in the followingorder: Raspberry Pi, Raspberry Pi, desktop PC, Raspberry Pi,Raspberry Pi, and Jetson TX2. Fig. 13 presents the measuringresults of CoEdge, where the top text shows the adding devicesorderly. With the increase of the cluster scale, both the latencyand dynamic energy drop. In particular, there is a distinctivedecrease when adding PC (2 →
3) and Jetson TX2 (5 → E. Adaptability to Network Fluctuation
In this experiment, we record the system performance ofdifferent cooperative approaches with varying bandwidths. Werun AlexNet with one image on the six-device cluster, andthe deadline is 100ms, plotted as the dashed line in Fig. 14.For each epoch, CoEdge captures the available bandwidthsand triggers a reprogramming on workload partitioning if thebandwidths change. This reprogramming process takes tinyoverhead, reporting less than 10ms in the experiment. Weadjust the bandwidth settings between devices in differentperiods. The top subfigure in Fig. 14 presents the network fluc-tuation with the bandwidths of 1000KB/s, 750KB/s, 500KB/s,1250KB/s, 1500KB/s, and 1000KB/s, respectively. As thebandwidth changes, all three approaches vary their perfor-mance. The performance variation comes from two reasons.On the one hand, the communication overhead for necessarydata exchange during cooperative inference depends on thenetwork conditions. On the other hand, for CoEdge, diversebandwidths yield diverse partitioning plans and therefore im-pact the performance of cooperation. In most cases, the latencyresults of the approaches are approximate. On the energy side,however, CoEdge outperforms other approaches all the time.In particular, when the bandwidth drops at 500KB/s (Epoch11-15), only CoEdge’s execution satisfies the deadline. In other M a x i m u m B a n d w i d t h ( K B / s ) L a t e n c y ( s ) Deadline
Musical ChairMoDNNCoEdge0 5 10 15 20 25 30
Epoch E n e r g y ( J ) Musical ChairMoDNNCoEdge
Fig. 14. The latency and dynamic energy results under varying network status.The deadline for CoEdge is set as 100ms. epochs, as the bandwidth becomes higher, CoEdge adjusts theworkload partitioning and the energy costs decrease.VII. R
ELATED W ORK
Previous research efforts in enabling artificial intelligence atthe network edge can be divided into three directions: cloud-assisted execution, local resource exploitation, and multi-device collaboration.
Cloud-assisted execution.
Cloud-assisted approaches of-fload DNN inference workload from local to the cloud fullyor partially [10], [41]–[45]. MCDNN [41] fully offloads DNNcomputation. It creates DNN model variants and selects onefrom them to maximize the accuracy under resource con-straints, while CoEdge does not involve accuracy issues asthe model and data are never modified. Neurosurgeon [42]proposes a partially offloading solution, which decides anintermediate partition point in the DNN structure to keep frontlayers locally and offload rear layers to the cloud. DDNN [44]leverages a similar principle and partitions DNN layers in acloud-edge-device hierarchy. However, it requests to retrainthe DNN model in scheduling, while CoEdge does not re-quire any retraining work. The cloud-assisted approaches havebeen widely-adopted in mobile scenarios, e.g., drone-enabledvehicles tracking [46], robotics-based vision applications [47]–[49]. On these specific cases, the cloud’s functionality isfurther optimized to adjust demands. For example, RILaaS[49] introduces a Robot-Inference-and-Learning-as-a-Serviceplatform with robotics-oriented features such as reliable net-work protocol, secure authentication, and REST front-end API.
Local resource exploitation.
Local approaches keep allcomputation locally and optimize the performance throughhardware specialization or model modification [27], [50]–[53].Hardware specialization generally centers around basic DNNoperations (e.g., matrix multiplication, convolution) to developefficient hardware accelerators, e.g., ARM ML Processor [54],Google Edge TPU [55]. Some other works target to optimizethe utilization of existing hardware. For example, µ Layer[50] accelerates inference in layer granularity by simulta-neously utilizing heterogeneous processors inside an edge device. Model modification typically uses model compressiontechnique, e.g., model sparsification and quantization [51].ReForm [52] reconfigurates CNN models by model pruningand selective computing to reduce inference latency on mobiledevices. On the same goal, libnumber [53] employs quantiza-tion technique to optimize number representation in low-level,reducing both model size and inference latency. CoEdge isorthogonal to such optimizations since it does not apply anystructural modifications to the employed DNN. Multi-device collaboration.
Multi-device approaches exe-cutes DNN inference using a cluster of devices in the edgeenvironment [14], [40], [56]. Within this category, previousworks optimize workload distribution in two ways, layer fusionand workload size adjustment. Layer fusion partitions thefeature maps in a fixed pattern, and distributes workloadwith redundant data - the padding data - to avoid datarequests between devices during the runtime. Under thismechanism, DeepThings [14] fuses front convolutional layersand parallelizes these layers on multiple devices. The follow-up work [25] generalizes the fusion operation to all layersand takes resources heterogeneity into account. It designsa dynamic-programming-based fusion searching strategy toadaptively decide which layers are fused and which layers aredirectly parallelized. Workload size adjustment accommodatesthe workload allocation to minimize the end-to-end inferencelatency. MoDNN [40] segments workload greedily and assignsmore workload to the devices with higher computing capabilitywithout considering the network conditions. Musical Chair[18] introduces a partitioning algorithm integrating data andmodel parallelism, and partitions the feature maps in equalproportion. Based on Musical Chair, the subsequent work[57]–[59] improves distributed CNN execution in terms oflatency performance, scalability, and robustness, respectively.Our work falls into the mutli-device collaboration category,and combines the parallel workflow of layer-fusion techniques[14], [25] and the partitioning mechanism of workload ad-justment approaches [18], [40]. Beyond combining the noveldesigns of these two lines, CoEdge jointly considers availablecomputation and communication resources and improves theworkload allocation on heterogeneous devices via an adaptivealgorithm, which has not been addressed in prior works.VIII. D
ISCUSSION AND F UTURE DIRECTIONS
In this section, we discuss the limitation and extension ofCoEdge, and provide some future research directions.
Robustness and Generalization.
As a distributed system,crash of any participant or network timeout can result in asystem breakdown of cooperative inference. To increase therobustness against such faults, it may be helpful to designmodularity [48] for the system or reserve intermediate resultbackups periodically. Another direction is to further generalizeand optimize the system workflow for more sophisticatedmodel structures. Only applying workload partitioning over thewhole network may not well fit more complicated architecturesas the feature maps of the deeper layers usually exhibit in asmaller height and width.
Other optimizing objectives.
CoEdge focuses on optimiz-ing the dynamic energy consumption with preset deadlines for CNN inference. Modifying the objective function of theconstrained programming can steer CoEdge to meet other pri-orities. For example, one can adopt a performance preferenceby setting a tunable-weighted synthesis of latency and energy.Alternatively, taking static energy consumption into accountmay produce a more energy-friendly workload allocation plan.Another potential objective is accuracy. Although CoEdgedoes not sacrifice any accuracy theoretically, running DNNson some minitype edge devices may still loss precision owingto the limitations of their modest computing capability and theexecution mechanism of DNN frameworks. Characterizing andoptimizing such accuracy issue is practically significant foredge deployment.
Utilizing edge-oriented resources.
Recent technical pro-gresses on edge computing enhancement in computation (e.g.,pluggable Google Edge TPU [55], Intel Movidius NeuralCompute Stick [60]) could potentially benefit CoEdge per-formance. For example, by equipping a Raspberry Pi with anEdge TPU, CoEdge may choose to remain the input workloadmainly or even completely in situ. This requires more effortson shaping and utilizing the emerging elastic computingresources. Moreover, improvements on the communicationside, e.g., 5G and mmWave, can also boost cooperative edgeintelligence. IX. C
ONCLUSION
In this paper, we present CoEdge, a distributed DNN com-puting system that orchestrates cooperative DNN inferenceover heterogeneous edge devices. We explore the workflowof cooperative inference and formulate it as a constrainedoptimization problem, which is NP-hard. To solve it efficiently,we design a workload partitioning algorithm to decide effi-cient partitioning policy in real-time. By jointly optimizingcomputation and communication, CoEdge can find the op-timal workload partitioning plan that minimizes the systemenergy cost while promising execution latency requirements.Experimental evaluations using a realistic prototype show7.21 ×∼ × latency speedup over the local approach andup to 25.5% ∼ EFERENCES[1] B. L. R. Stojkoska and K. V. Trivodaliev, “A review of internet ofthings for smart home: Challenges and solutions,”
Journal of CleanerProduction , vol. 140, pp. 1454–1464, 2017.[2] F. Shrouf, J. Ordieres, and G. Miragliotta, “Smart factories in industry4.0: A review of the concept and of energy management approached inproduction based on the internet of things paradigm,” in . IEEE, 2014, pp. 697–701.[3] M. Gerla, E.-K. Lee, G. Pau, and U. Lee, “Internet of vehicles: Fromintelligent grid to autonomous cars and vehicular clouds,” in . IEEE, 2014, pp. 241–246.[4] A.-M. Rahmani, N. K. Thanigaivelan, T. N. Gia, J. Granados, B. Negash,P. Liljeberg, and H. Tenhunen, “Smart e-health gateway: Bringingintelligence to internet-of-things based ubiquitous healthcare systems,”in . IEEE, 2015, pp. 826–834.[5] Q. Shi and X. Chen, “Carpool for big data: Enabling efficient crowdcooperation in data market for pervasive ai,”
IEEE Transactions onVehicular Technology , 2020. [6] U. R. Acharya, S. L. Oh, Y. Hagiwara, J. H. Tan, M. Adam, A. Gertych,and R. San Tan, “A deep convolutional neural network model to classifyheartbeats,” Computers in biology and medicine , vol. 89, pp. 389–396,2017.[7] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing ofdeep neural networks: A tutorial and survey,”
Proceedings of the IEEE ,vol. 105, no. 12, pp. 2295–2329, 2017.[8] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neuralnetwork learning for speech recognition and related applications: Anoverview,” in . IEEE, 2013, pp. 8599–8603.[9] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[10] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge ai: On-demand acceleratingdeep neural network inference via edge computing,”
IEEE Transactionson Wireless Communications , vol. 19, no. 1, pp. 447–457, 2019.[11] T. Ouyang, Z. Zhou, and X. Chen, “Follow me at the edge: Mobility-aware dynamic service placement for mobile edge computing,”
IEEEJournal on Selected Areas in Communications , vol. 36, no. 10, pp. 2333–2345, 2018.[12] X. Chen, Q. Shi, L. Yang, and J. Xu, “Thriftyedge: Resource-efficientedge computing for intelligent iot applications,”
IEEE network , vol. 32,no. 1, pp. 61–65, 2018.[13] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edgeintelligence: Paving the last mile of artificialintelligence with edgecomputing,”
Proceedings of the IEEE , 2019.[14] Z. Zhao, K. M. Barijough, and A. Gerstlauer, “Deepthings: Distributedadaptive deep learning inference on resource-constrained iot edgeclusters,”
IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems
IEEE Robotics and AutomationLetters , vol. 3, no. 4, pp. 3709–3716, 2018.[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[20] “Tensorflow benchmark tool,” https://github.com/tensorflow/tensorflow/tree/r1.4/tensorflow/tools/benchmark, accessed May 15, 2019.[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in
InternationalConference on Learning Representations , 2014.[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[25] L. Zhou, M. H. Samavatian, A. Bacha, S. Majumdar, and R. Teodorescu,“Adaptive parallel execution of deep neural networks on heterogeneousedge devices,” in
Proceedings of the 4th ACM/IEEE Symposium on EdgeComputing , 2019, pp. 195–208.[26] A. P. Miettinen and J. K. Nurminen, “Energy efficiency of mobile clientsin cloud computing,”
HotCloud , vol. 10, no. 4-4, p. 19, 2010.[27] S. Yao, Y. Zhao, H. Shao, S. Liu, D. Liu, L. Su, and T. Abdelzaher,“Fastdeepiot: Towards understanding and optimizing neural networkexecution time on mobile and embedded devices,” in
Proceedings ofthe 16th ACM Conference on Embedded Networked Sensor Systems .ACM, 2018, pp. 278–291.[28] Y. Wen, W. Zhang, and H. Luo, “Energy-optimal mobile applicationexecution: Taming resource-poor mobile devices with cloud clones,” in . IEEE, 2012, pp. 2716–2720.[29] Y. Cui, J. Song, K. Ren, M. Li, Z. Li, Q. Ren, and Y. Zhang, “Soft-ware defined cooperative offloading for mobile cloudlets,”
IEEE/ACMTransactions on Networking , vol. 25, no. 3, pp. 1746–1760, 2017.[30] T. Mohammed, C. Joe-Wong, R. Babbar, and M. Di Francesco, “Dis-tributed inference acceleration with adaptive dnn partitioning and of-floading,” in
IEEE INFOCOM 2020-IEEE Conference on ComputerCommunications . IEEE, 2020, pp. 854–863. [31] M. Mukherjee, V. Kumar, A. Lat, M. Guo, R. Matam, and Y. Lv, “Dis-tributed deep learning-based task offloading for uav-enabled mobile edgecomputing,” in
IEEE INFOCOM 2020-IEEE Conference on ComputerCommunications Workshops (INFOCOM WKSHPS) . IEEE, 2020, pp.1208–1212.[32] S. Xu, Q. Liu, B. Gong, F. Qi, S. Guo, X. Qiu, and C. Yang, “Rjcc:Reinforcement learning based joint communicational-and-computationalresource allocation mechanism for smart city iot,”
IEEE Internet ofThings Journal , 2020.[33] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis,and M. Horowitz, “Towards energy-proportional datacenter memorywith mobile dram,” in . IEEE, 2012, pp. 37–48.[34] L. G. Valiant, “A bridging model for parallel computation,”
Communi-cations of the ACM , vol. 33, no. 8, pp. 103–111, 1990.[35] G. B. Dantzig,
Linear programming and extensions . Princeton univer-sity press, 1998, vol. 48.[36] I. I. CPLEX, “V12. 1: User’s manual for cplex,”
International BusinessMachines Corporation , vol. 46, no. 53, p. 157, 2009.[37] Google, “gprc - a rpc library and framework,” https://grpc.io, accessedDecember 15, 2019.[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 1–9.[39] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-lutional neural networks for mobile vision applications,” arXiv preprintarXiv:1704.04861 , 2017.[40] J. Mao, X. Chen, K. W. Nixon, C. Krieger, and Y. Chen, “Modnn:Local distributed mobile computing system for deep neural network,” in
Design, Automation & Test in Europe Conference & Exhibition (DATE),2017 . IEEE, 2017, pp. 1396–1401.[41] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krishna-murthy, “Mcdnn: An approximation-based execution framework for deepstream processing under resource constraints,” in
Proceedings of the 14thAnnual International Conference on Mobile Systems, Applications, andServices . ACM, 2016, pp. 123–136.[42] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, andL. Tang, “Neurosurgeon: Collaborative intelligence between the cloudand mobile edge,” in
ACM SIGARCH Computer Architecture News ,vol. 45, no. 1. ACM, 2017, pp. 615–629.[43] L. Zeng, E. Li, Z. Zhou, and X. Chen, “Boomerang: On-demandcooperative deep neural network inference for edge intelligence onindustrial internet of things,”
IEEE Network , 2019.[44] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Distributed deepneural networks over the cloud, the edge and end devices,” in . IEEE, 2017, pp. 328–339.[45] H.-J. Jeong, H.-J. Lee, C. H. Shin, and S.-M. Moon, “Ionn: Incrementaloffloading of neural network computations from mobile devices to edgeservers,” in
Proceedings of the ACM Symposium on Cloud Computing ,2018, pp. 401–411.[46] L. Ballotta, L. Schenato, and L. Carlone, “Computation-communicationtrade-offs and sensor selection in real-time estimation for processingnetworks,”
IEEE Transactions on Network Science and Engineering ,2020.[47] S. P. Chinchali, E. Cidon, E. Pergament, T. Chu, and S. Katti, “Neuralnetworks meet physical networks: Distributed inference between edgedevices and the cloud,” in
Proceedings of the 17th ACM Workshop onHot Topics in Networks , 2018, pp. 50–56.[48] S. Chinchali, A. Sharma, J. Harrison, A. Elhafsi, D. Kang, E. Pergament,E. Cidon, S. Katti, and M. Pavone, “Network offloading policies forcloud robotics: a learning-based approach,” in
Robotics: Science andSystems , 2019, pp. 1–10.[49] A. K. Tanwani, R. Anand, J. E. Gonzalez, and K. Goldberg, “Rilaas:Robot inference and learning as a service,”
IEEE Robotics and Automa-tion Letters , 2020.[50] Y. Kim, J. Kim, D. Chae, D. Kim, and J. Kim, “ µ layer: Low la-tency on-device inference using cooperative single-layer accelerationand processor-friendly quantization,” in Proceedings of the FourteenthEuroSys Conference 2019 , 2019, pp. 1–15.[51] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” arXiv preprint arXiv:1510.00149 , 2015.[52] Z. Xu, F. Yu, C. Liu, and X. Chen, “Reform: Static and dynamicresource-aware dnn reconfiguration framework for mobile device,” in Proceedings of the 56th Annual Design Automation Conference 2019 ,2019, pp. 1–6.[53] Y. H. Oh, Q. Quan, D. Kim, S. Kim, J. Heo, S. Jung, J. Jang,and J. W. Lee, “A portable, automatic data qantizer for deep neuralnetworks,” in
Proceedings of the 27th International Conference onParallel Architectures and Compilation Techniques . IEEE, 2020, pp. 157–163.[57] R. Hadidi, J. Cao, M. S. Ryoo, and H. Kim, “Towards collaborativeinferencing of deep neural networks on internet of things devices,”
IEEEInternet of Things Journal , 2020.[58] J. Cao, F. Wu, R. Hadidi, L. Liu, T. Krishna, M. S. Ryoo, and H. Kim,“An edge-centric scalable intelligent framework to collaboratively exe-cute dnn,” in
Demo for SysML Conference, Palo Alto, CA , 2019.[59] R. Hadidi, J. Cao, M. S. Ryoo, and H. Kim, “Robustly executing dnnsin iot systems using coded distributed computing,” in
Proceedings of the56th Annual Design Automation Conference 2019 A PPENDIX AP ROOF OF T HEOREM Proof.
We reduce P || C max problem to a special case of P P || C max problem is NP-hard, P P || C max problem.Firstly, we identify P a i , the constaints (1), (2),and (3) limit it into a range of nonnegative integers. Using a i , we can obtain the initial workload on each device bymultiplying a i and the data size of each row. Since the inputfeature maps of each layer are the output of the prior layer, wecan derive the workload of each layer based on its specific con-figuration. For example, for convolution operation, given theinput feature map partition of size ( H, W, C in ) (Height, Width,Channels) and the convolution kernel ( k, C in , C out , s, p ) , theoutput size is ( H − k +2 ps + 1 , W − k +2 ps , C out ) . Therefore, wecan express r li linearly using a i . So do for T cli , T xli , E cli , E xli according to Eq. (5), (7), (9), and (10).For the deadline constraint T = (cid:80) l ∈L max i ∈N ( T cli + T xli ) ≤ D , we transform it into a series of inequalities. Assuming asub-deadline D l for processing layer l , we have max i ∈N ( T cli + T xli ) ≤ D l , which is equivalent to T cl + T xl ≤ D l , T cl + T xl ≤ D l , · · · , T clN + T xlN ≤ D l . Without loss of generality,we conduct this transformation to all interval and obtain N · L inequalities in total, i.e., T cli + T xli ≤ D l , ∀ i ∈N , ∀ l ∈L . Giventhat T cli and T xli is linear with a i , these inequalities are linear.In conclusion, all the expressions in P P a i be the jobsto schedule and all power parameters be 1, we can reduce P || C max problem to P P P || C max . Since P || C max problemis NP-hard, P PPENDIX BP ROOF OF T HEOREM Proof.
As we have discussed in the proof of Theorem 1, theobjective function, the memory constraint, and the deadlineconstraint are linear with a i . In P