[PDF] Dynamic Heterogeneity-Aware Coded Cooperative Computation at the Edge

Abstract

Cooperative computation is a promising approach for localized data processing at the edge, e.g. for Internet of Things (IoT). Cooperative computation advocates that computationally intensive tasks in a device could be divided into sub-tasks, and offloaded to other devices or servers in close proximity. However, exploiting the potential of cooperative computation is challenging mainly due to the heterogeneous and time-varying nature of edge devices. Coded computation, which advocates mixing data in sub-tasks by employing erasure codes and offloading these sub-tasks to other devices for computation, is recently gaining interest, thanks to its higher reliability, smaller delay, and lower communication costs. In this paper, we develop a coded cooperative computation framework, which we name Coded Cooperative Computation Protocol (C3P), by taking into account the heterogeneous resources of edge devices. C3P dynamically offloads coded sub-tasks to helpers and is adaptive to time-varying resources. We show that (i) task completion delay of C3P is very close to optimal coded cooperative computation solutions, (ii) the efficiency of C3P in terms of resource utilization is higher than 99% , and (iii) C3P improves task completion delay significantly as compared to baselines via both simulations and in a testbed consisting of real Android-based smartphones.

Full PDF

aa r X i v : . [ c s . D C ] O c t Dynamic Heterogeneity-Aware Coded CooperativeComputation at the Edge

Yasaman Keshtkarjahromi,

Member, IEEE,

Yuxuan Xing,

Student Member, IEEE, and Hulya Seferoglu,

Member, IEEE

Abstract —Cooperative computation is a promising approachfor localized data processing at the edge, e.g., for Internet ofThings (IoT). Cooperative computation advocates that compu-tationally intensive tasks in a device could be divided into sub-tasks, and ofﬂoaded to other devices or servers in close proximity.However, exploiting the potential of cooperative computation ischallenging mainly due to the heterogeneous and time-varyingnature of edge devices. Coded computation, which advocatesmixing data in sub-tasks by employing erasure codes and ofﬂoad-ing these sub-tasks to other devices for computation, is recentlygaining interest, thanks to its higher reliability, smaller delay, andlower communication costs. In this paper, we develop a codedcooperative computation framework, which we name CodedCooperative Computation Protocol (

C3P ), by taking into accountthe heterogeneous resources of edge devices.

C3P dynamicallyofﬂoads coded sub-tasks to helpers and is adaptive to time-varying resources. We show that (i) task completion delay of

C3P is very close to optimal coded cooperative computationsolutions, (ii) the efﬁciency of

C3P in terms of resource utilizationis higher than , and (iii)

C3P improves task completion delaysigniﬁcantly as compared to baselines via both simulations andin a testbed consisting of real Android-based smartphones.

I. I

NTRODUCTION

Data processing is crucial for many applications at theedge including Internet of Things (IoT), but it could becomputationally intensive and not doable if devices operateindividually. One of the promising solutions to handle com-putationally intensive tasks is computation ofﬂoading, whichadvocates ofﬂoading tasks to remote servers or cloud. Yet,ofﬂoading tasks to remote servers or cloud could be luxury thatcannot be afforded by most of the edge applications, whereconnectivity to remote servers can be lost or compromised,which makes localized processing crucial.Cooperative computation is a promising approach for edgecomputing, where computationally intensive tasks in a device(collector device) could be ofﬂoaded to other devices (helpers)in close proximity as illustrated in Fig. 1.These devices could be other IoT or mobile devices, localservers, or fog at the edge of the network [1], [2]. However,

This work was supported by the NSF under Grant CNS-1801708, the ARLunder Grants W911NF-1820181 and W911NF-1710032, and the NIST underGrant 70NANB17H188. The preliminary results of this paper were presentedin part at the IEEE International Conference on Network Protocols (ICNP),Cambridge, UK, Sep. 2018.Y. Keshtkarjahromi was with the Department of Electrical and ComputerEngineering, University of Illinois at Chicago. She is now an ORAU PostdocFellow. E-mail: [email protected]. Xing, and H. Seferoglu are with the Department of Electrical andComputer Engineering, University of Illinois at Chicago, Chicago, IL, 60607,[email protected], [email protected]. ...

A, xCollector

Helper 1

Helper 2Helper N (a) Ofﬂoading sub-tasks from acollector to helpers ... y=Ax A , xA , xA N , xCollector Helper 1

Helper 2Helper N (b) Helpers send computed sub-tasksback to the collectorFig. 1. Cooperative computation to compute y = A x . (a) Matrix A isdivided into sub-matrices A , A , ..., A N . Each sub-matrix along with thevector x is transmitted from the collector to one of the helpers. (b) Eachhelper computes the multiplication of its received sub-matrix with vector x and sends the computed value back to the collector. exploiting the potential of cooperative computation is chal-lenging mainly due to the heterogeneous and time-varyingnature of the devices at the edge. Indeed, these devices mayhave different and time-varying computing power and energyresources, and could be mobile. Thus, our goal is to develop adynamic, adaptive, and heterogeneity-aware cooperative com-putation framework by taking into account the heterogeneityand time-varying nature of devices at the edge.We focus on the computation of linear functions. In partic-ular, we assume that the collector’s data is represented by alarge matrix A and it wishes to compute the product y = A x ,for a given vector x , Fig. 1. In fact, matrix multiplicationforms the atomic function computed over many iterations ofseveral signal processing, machine learning, and optimizationalgorithms, such as gradient descent based algorithms, classi-ﬁcation algorithms, etc. [3], [4], [5], [6].In cooperative computation setup, matrix A is divided intosub-matrices A , A , ..., A N and each sub-matrix along withthe vector x is transmitted from the collector to one of thehelpers, Fig. 1(a). Helper n computes A n x , and transmitsthe computed result back to the collector, Fig. 1(b), who canprocess all returned computations to obtain the result of itsoriginal task; i.e., the calculation of y = A x .Coding in computation systems is recently gaining inter-est in large scale computing environments, and it advocateshigher reliability and smaller delay [3]. In particular, codedcomputation ( e.g., by employing erasure codes) mixes data insub-tasks and ofﬂoads these coded sub-tasks for computation,which improves delay and reliability. The following canonicalexample inspired from [3] demonstrates the effectiveness ofcoded computation. Example 1:

Let us consider that a collector device wouldlike to calculate y = A x with the help of three helper devices(helper , helper , and helper ), where the number of rowsin A is . Let us assume that each helper has a differentruntime; helper computes each row in unit time, whilethe second and the third helpers require and units oftime for computing one row, respectively. Assuming that theseruntimes are random and not known a priori, one may divide A to three sub-matrices; A , A , and A ; each with rows.Thus, the completion time of these sub-matrices becomes , ,and at helpers , , and , respectively. Since the collectorshould receive all the calculated sub-matrices to compute itsoriginal task; i.e., y = A x , the total task completion delaybecomes max(2 , ,

20) = 20 .As seen, helper becomes a bottleneck in this scenario,which can be addressed using coding. In particular, A couldbe divided into two sub-matrices A and A ; each with rows.Then, A and A could be ofﬂoaded to helpers and , and A + A could be ofﬂoaded to helper . In this setup, runtimesbecome , , and at helpers , , and , respectively.However, since the collector requires reply from only twohelpers to compute y = A x thanks to coding, the total taskcompletion delay becomes max(3 ,

6) = 6 . As seen, the taskcompletion delay reduces to from with the help of coding. (cid:3) The above example demonstrates the beneﬁt of codingfor cooperative computation. However, ofﬂoading sub-taskswith equal sizes to all helpers, without considering theirheterogeneous resources is inefﬁcient. Let us consider thesame setup in Example 1. If A with rows and A with rows are ofﬂoaded to helper and helper , respectively,and helper is not used, the task completion delay becomes max(4 ,

4) = 4 , which is the smallest possible delay inthis example. Furthermore, the resources of helper are notwasted, which is another advantage of taking into account theheterogeneity as compared to the above example. As seen, itis crucial to divide and ofﬂoad matrix A to helpers by takinginto account the heterogeneity of resources.Indeed, a code design mechanism under such a heteroge-neous setup is developed in [7], where matrix A is divided,coded, and ofﬂoaded to helpers by taking into account hetero-geneity of resources. However, available resources at helpersare generally not known by the collector a priori and mayvary over time, which is not taken into account in [7]. Forexample, the runtime of helper in Example 1 may increasefrom to while computing ( e.g., it may start runninganother computationally intensive task), which would increasethe total task completion delay. Thus, it is crucial to design acoded cooperation framework, which is dynamic and adaptiveto heterogeneous and time-varying resources, which is the goalof this paper.In this paper, we design a coded cooperative computationframework for edge computing. In particular, we design aCoded Cooperative Computation Protocol ( C3P ), which pack-etizes rows of matrix A into packets, codes these packets usingFountain codes, and determines how many coded packets eachhelper should compute dynamically over time. We providetheoretical analysis of C3P ’s task completion delay and efﬁ- ciency, and evaluate its performance via simulations as well asin a testbed consisting of real Android-based smartphones ascompared to baselines. The following are the key contributionsof this work: • We formulate the coded cooperative computation problemas an optimization problem. We investigate the non-ergodic and static solutions of this problem. As a dynamicsolution to the optimization problem, we develop a codedcooperative computation protocol (

C3P ), which is basedon Automatic Repeat reQuest (ARQ) mechanism. Inparticular, a collector device ofﬂoads coded sub-tasks tohelpers gradually, and receives Acknowledgment (ACK)after each sub-task is computed. Depending on the timedifference between ofﬂoading a sub-task to a helperand its ACK, the collector estimates the runtime of thehelpers, and ofﬂoads more/less tasks accordingly. Thismakes

C3P dynamic and adaptive to heterogeneous andtime-varying resources at helpers. • We characterize the performance of

C3P as compared tothe non-ergodic and static solutions, and show that (i)the gap between the task completion delays of

C3P andthe non-ergodic solution is ﬁnite even for large number ofsub-tasks, i.e., R → ∞ , and (ii) the task completion delayof C3P is approximately equal to the static solution forlarge numbers of sub-tasks. We also analyze the efﬁciencyof

C3P in each helper in closed form, where the efﬁciencymetric represents the effective utilization of resources ateach helper. • We evaluate

C3P via simulations as well as in a testbedconsisting of real Android-based smartphones and showthat (i)

C3P improves task completion delay signiﬁcantlyas compared to baselines, and (ii) the efﬁciency of

C3P in terms of resource utilization is higher than .The structure of the rest of this paper is as follows. SectionII presents the coded cooperative computation problem for-mulation. Section III presents the ergodic and static solutionsto coded cooperative computation problem and the design of

C3P . Section IV provides the performance analysis of

C3P .Section V presents the performance evaluation of

C3P . SectionVI presents related work. Section VII concludes the paper.II. P

ROBLEM F ORMULATION

Setup.

We consider a setup shown in Fig. 1, where thecollector device ofﬂoads its task to helpers in the set N (where N = |N | ) via device-to-device (D2D) links such as Wi-Fi Direct and/or Bluetooth. In this setup, all devices couldpotentially be mobile, so the encounter time of the collectorwith helpers varies over time. I.e., the collector can connectto less than N helpers at a time. Application.

As we described in Section I, we focus oncomputation of linear functions; i.e., the collector wishesto compute y = A x where A = ( a i,j ) ∈ R R × R , and x = ( x i,j ) ∈ R R × . Our goal is to determine sub-matrix A n = ( a i,j ) ∈ R r n × R that will be ofﬂoaded to helper n ,where r n is an integer. Coding Approach.

We use Fountain codes [8], [9], whichare ideal in our dynamic coded cooperation framework thanks to their rateless property, low encoding and decoding com-plexity, and low overhead. In particular, the encoding anddecoding complexity of Fountain codes could be as low as O ( R log( R )) for LT codes and O ( R ) for Raptor codes andthe coding overhead could be as low as [10]. We notethat Fountain codes perform better than (i) repetition codesthanks to randomization of sub-tasks by mixing them, (ii)maximum distance separable (MDS) codes as MDS codesrequire a priori task allocation (due to their block codingnature) and are not suitable for the dynamic and adaptiveframework that we would like to develop, and (iii) networkcoding as the decoding complexity of network coding is toohigh [11], which introduces too much computation overheadat the collector which obsoletes the computation ofﬂoadingbeneﬁt. Packetization.

In particular, we packetize each row of A intoa packet and create R packets; Γ = { ρ , ρ , . . . , ρ R } . Thesepackets are used to create Fountain coded packets, where ν i is the i th coded packet. The coded packet ν i is transmittedto a helper, where the helper computes the multiplication of ν i x and sends the result back to the collector. R + K codedcomputed packets are required at the collector to decode thecoded packets, where K is the coding overhead. Let p n,i bethe j th coded packet generated by the collector and the i thcoded packet transmitted to helper n ; p n,i = ν j , j ≥ i . Delay Model.

Each transmitted packet p n,i experiencestransmission delay between the collector and helper n as wellas computing delay at helper n . Also, the computed packet p n,i x experiences transmission delay while transmitted fromhelper n to the collector. The average round trip time (RTT)of sending a packet to helper n and receiving the computedpacket, is characterized as RT T data n . The runtime of packet p n,i at helper n is a random variable denoted by β n,i . Assumingthat r n packets are ofﬂoaded to helper n , the total task comple-tion delay for helper n to receive r n coded packets, computethem, and send the results back to the collector becomes D n ,which is expressed as D n = RT T data n + P r n i =1 β n,i . Note that RT T data n in this formulation is due to transmitting the ﬁrstpacket p n, and receiving the last computed packet p n,r n x .The other packets can be transmitted while helpers are busywith processing packets; it is why we do not sum RT T data n across packets. Problem Formulation.

Our goal is to determine the taskofﬂoading set R = { r , . . . , r N } that minimizes the total taskcompletion delay, i.e., we would like to dynamically determine R that solves the following optimization problem: min R max n ∈N D n subject to N X n =1 r n = R, r n ∈ N , ∀ n ∈ N . (1)The objective of the optimization problem in (1) is to minimizethe maximum of per helper task completion delays, whichis equal to max n ∈N D n , as helpers compute their tasks in Our framework is compatible with any delay distribution, but for thesake of characterizing the efﬁciency of our algorithm, and simulating its taskcompletion delay, we use shifted exponential distribution in Sections IV-Dand V. parallel. The constraint in (1) is a task conservation constraintthat guarantees that resources of helpers are not wasted, i.e., the sum of the received computed tasks from all helpers isequal to the number of rows of matrix A . Note that this con-straint is possible thanks to coding. As we mentioned earlier, R + K coded computed packets are required at the collector todecode the coded packets when we use Fountain codes. Theconstraint in (1) guarantees this requirement in an idealizedscenario assuming that K = 0 . The constraint r n ∈ N makessure that the number of tasks r n is an integer. The solutionof (1) is challenging as (i) D n = RT T data n + P r n i =1 β n,i isa random variable and not known a priori, and (ii) it is aninteger programming problem.III. P ROBLEM S OLUTION & C3P D ESIGN

In this section, we investigate the solution of (1) for non-ergodic, static, and dynamic setups.

A. Non-Ergodic Solution

Let us assume that the solution of (1) is T best = max n ∈N (cid:16) RT T data n + r best n X i =1 β n,i (cid:17) , (2)where r best n = argmin r n ∈ N max n ∈N (cid:16) RT T data n + P r n i =1 β n,i (cid:17) .We note that (2) is a non-ergodic solution as it requires theperfect knowledge of β n,i a priori. Although we do not have acompact solution of T best , the solution in (2) will behave as aperformance benchmark for our dynamic and adaptive codedcooperative computation framework in Section IV-A. B. Static Solution

We assume that

RT T data n becomes negligible as comparedto P r n i =1 β n,i . This assumption holds in practical scenarioswith large R , and/or when transmission delay is smallerthan processing delay. Then, D n can be approximated as P r n i =1 β n,i , and the optimization problem in (1) becomes min R max n ∈N r n X i =1 β n,i subject to N X n =1 r n = R, r n ∈ N , ∀ n ∈ N . (3)As a static solution, we solve the expected value of theobjective function in (3) by relaxing the integer constraint, i.e., r n ∈ N . The expected value of the objective function of (3) isexpressed as E [max n ∈N P r n i =1 β n,i ] , which is greater than orequal to max n ∈N P r n i =1 E [ β n,i ] = max n ∈N r n E [ β n,i ] (notingthat max( . ) is a convex function, so E [max( . )] ≥ max( E [ . ]) ), We note that the optimal computation ofﬂoading problem, when coding isnot employed, is formulated as min Γ n max n ∈N ( RT T data n + P | Γ n | i =1 β n,i ) subject to ∪ Nn =1 Γ n = Γ where Γ n ⊂ Γ is the set of packets ofﬂoadedto helper n . As seen, the optimization problem in (1) is more tractable ascompared to this problem thanks to employing Fountain codes. Collector Helper n (cid:2236) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2236) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2206) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2206) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2200) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2200) (cid:2196)(cid:481)(cid:2191) (cid:2180)(cid:2176)(cid:2176) (cid:2196)(cid:481)(cid:2191) = (cid:2174)(cid:2176)(cid:2176) (cid:2196)(cid:2186)(cid:2183)(cid:2202)(cid:2183) (cid:2176)(cid:2176)(cid:2165) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2185) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2185) (cid:2196)(cid:481)(cid:2191) (a) Ideal case Collector Helper n (cid:2176)(cid:2206) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2206) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2200) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2200) (cid:2196)(cid:481)(cid:2191) (cid:2236) (cid:2196)(cid:481)(cid:2191) (cid:2236) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2180)(cid:2176)(cid:2176) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2176)(cid:2165) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2203) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2185) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2185) (cid:2196)(cid:481)(cid:2191) (b) Underutilized case

Collector Helper n (cid:2236) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2206) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2206) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2200) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2200) (cid:2196)(cid:481)(cid:2191) (cid:2236) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2180)(cid:2176)(cid:2176) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2176)(cid:2165) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2185) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2197) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2185) (cid:2196)(cid:481)(cid:2191) (c) Congested caseFig. 2. Different states of the system: (a) ideal case, (b) underutilized case, and (c) congested case. where expectation is across the packets. Assuming that the av-erage task completion delay is T = E [max n ∈N P r n i =1 β n,i ] ≥ max n ∈N r n E [ β n,i ] , (3) is converted to min R T subject to r n E [ β n,i ] ≤ T, ∀ n ∈ N N X n =1 r n = R. (4)We solve (4) using Lagrange relaxation (we omit the stepsof the solution as it is straightforward); the optimal taskofﬂoading policy becomes r static n = RE [ β n,i ] P Nn =1 1 E [ β n,i ] , (5)and the optimal task completion delay becomes T static = R P Nn =1 1 E [ βn,i ] . Although the solution in (5) is an optimalsolution of (4), the algorithm that ofﬂoads r static n sub-tasksto helper n a priori (static allocation) loses optimality as itis not adaptive to the time-varying nature of resources ( i.e., β n,i ). Next, we introduce our Coded Cooperative ComputationProtocol ( C3P ) that is dynamic and adaptive to time-varyingresources and approaches to the optimal solution in (5) withincreasing R . C. Dynamic Solution:

C3P

We consider the system setup in Fig. 1, where the collectorconnects to N helpers. In this setup, the collector deviceofﬂoads coded packets gradually to helpers, and receives twoACKs for each packet; one conﬁrming the receipt of thepacket by the helper, and the second one (piggybacked to thecomputed packet p n,i x ) showing that the packet is computedby the helper. Inspired by ARQ mechanisms [12], the collectortransmits more/less coded packets based on the frequency ofthe received ACKs.In particular, we deﬁne the transmission time interval T T I n,i as the time interval between sending two consecutivepackets, p n,i and p n,i +1 , to helper n by the collector. The goalof our mechanism is to determine the best T T I n,i that reducesthe task completion delay and increases helper efﬁciency ( i.e., exploiting the full potential of the helpers while notoverloading them).

T T I n,i in an ideal scenario.

Let

T x n,i be the time that p n,i is transmitted from the collector to helper n , T c n,i be the timethat helper n ﬁnishes computing p n,i , and T r n,i be the timethat the computed packet ( i.e., by abusing the notation p n,i x )is received by the collector from helper n . We assume thatthe time of transmitting the ﬁrst packet to each helper, i.e., p n, , ∀ n ∈ N , is zero; i.e., T x n, = 0 , ∀ n ∈ N .Let us ﬁrst consider the ideal scenario, Fig. 2(a), where T T I n,i is equal to β n,i for all packets that are transmittedto helper n . Indeed, if T T I n,i > β n,i , Fig. 2(b), helper n stays idle, which reduces the efﬁcient utilization of resourcesand increases the task completion delay. On the other hand,if T T I n,i < β n,i , Fig. 2(c), packets are queued at helper n . This congested (overloaded) scenario is not ideal, becausethe collector can receive enough number of packets beforeall queued packets in helpers are processed, which wastesresources. Determining

T T I n,i in practice.

Now that we know that

T T I n,i = β n,i should be satisﬁed for the best system efﬁ-ciency and smallest task completion delay, the collector canset T T I n,i to β n,i . However, the collector does not know β n,i a priori as it is the computation runtime of packet p n,i athelper n . Thus, we should determine T T I n,i without explicitknowledge of β n,i .Our approach in C3P is to estimate β n,i as E [ β n,i ] , whereexpectation is taken over packets. We will explain how tocalculate E [ β n,i ] later in this section, but before that let usexplain how to use estimated E [ β n,i ] for setting T T I n,i . It isobvious that if the computed packet p n,i x is received at thecollector before packet p n,i +1 is transmitted from the collectorto helper n , the helper will be idle until it receives packet p n,i +1 . Therefore, to better utilize resources at helper n , thecollector should ofﬂoad a new packet before or immediatelyafter receiving the computed value of the previous packet, i.e., T T I n,i ≤ T r n,i − T x n,i should be satisﬁed as inFig. 2. Therefore, if the calculated E [ β n,i ] is larger than T r n,i − T x n,i , then we set

T T I n,i as T r n,i − T x n,i to satisfythis condition. In other words,

T T I n,i is set to

T T I n,i = min(

T r n,i − T x n,i , E [ β n,i ]) . (6) Calculation of E [ β n,i ] . In C3P , E [ β n,i ] is estimated using Algorithm 1

C3P algorithm at the collector Initialize:

T O n = ∞ , ∀ n ∈ N . while R + K calculated packets have not been received do if Calculated packet p n,i x is received before timeoutexpires then Calculate

T T I n,i according to (7) and (6). else T T I n,i = 2 × T T I n,i . Update timeout as

T O n = 2 T T I n,i .runtimes of previous packets: E [ β n,i ] ≈ P m n j =1 β n,i m n , (7)where m n is the number of computed packets received atthe collector from helper n before sending packet p n,i +1 .In order to calculate (7), the collector device should have β n,i values from the previous ofﬂoaded packets. A straight-forward approach would be putting timestamps on sub-tasksto directly access the runtimes β n,i at the collector. However,this approach introduces overhead on sub-tasks. Thus, we alsodeveloped a mechanism, where the collector device infers β n,i by taking into account transmission and ACK times of sub-tasks. The details of this approach is provided in AppendixA. C3P in a nutshell.

The main goal of

C3P is to determinepacket transmission intervals,

T T I n,i , according to (6), whichis summarized in Algorithm 1. Note that Algorithm 1 hasalso a timeout value deﬁned in line 7, which is needed forunresponsive helpers. If helper n is not responsive, T T I n,i isquickly increased as shown in line 6 so that fewer and fewerpackets could be ofﬂoaded to that helper. In particular,

C3P doubles

T T I n,i when the timeout for receiving ACK occurs.This is inspired by additive increase multiplicative decreasestrategy of TCP, where the number of transmitted packets arehalved to backoff quickly when the system is not responding.After

T T I n,i is updated when a transmitted packet isACKed or timeout occurs, this interval is used to determinethe transmission times of the next coded packets. In particular,coded packets are generated and transmitted one by one to allhelpers with intervals

T T I n,i until (i)

T T I n,i is updated witha new ACK packet or when timeout occurs, or (ii) the collectorcollects R + K computed packets. Next, we characterize theperformance of C3P .IV. P

ERFORMANCE A NALYSIS OF

C3P

A. Performance of

C3P w.r.t. the Non-Ergodic Solution

In this section, we analyze the gap between

C3P and thenon-ergodic solution characterized in Section III-A. Let us ﬁrstcharacterize the task completion delay of

C3P as T C3P = max n ∈N (cid:16) RT T data n + r C3P n X i =1 ( β n,i + T u n,i ) (cid:17) , (8) where r C3P n = argmin r n max n ∈N (cid:16) RT T data n + P r n i =1 ( β n,i + T u n,i ) (cid:17) , and T u n,i is per packet under-utilization time athelper n , which occurs as C3P does not have a priori knowl-edge of β n,i , but it estimates β n,i and accordingly determinespacket transmission times T T I n,i according to (6). The gapbetween T C3P and T best in (2) is upper bounded by: T C3P − T best = max n ∈N (cid:16) RT T data n + r C3P n X i =1 ( β n,i + T u n,i ) (cid:17) − max n ∈N (cid:16) RT T data n + r best n X i =1 β n,i (cid:17) ≤ max n ∈N (cid:16) RT T data n + r best n X i =1 ( β n,i + T u n,i ) (cid:17) − max n ∈N (cid:16) RT T data n + r best n X i =1 β n,i (cid:17) ≤ max n ∈N (cid:16) RT T data n + r best n X i =1 β n,i (cid:17) + max n ∈N r best n X i =1 T u n,i − max n ∈N (cid:16) RT T data n + r best n X i =1 β n,i (cid:17) = max n ∈N r best n X i =1 T u n,i , (9)where the ﬁrst inequality comes from r C3P n = argmin r n max n ∈N (cid:16) RT T data n + P r n i =1 ( β n,i + T u n,i ) (cid:17) and the secondinequality comes from the fact that max( f ( x ) + g ( x )) ≤ (max( f ( x )) + max( g ( x ))) . As seen, the gap between

C3P and the non-ergodic solution is bounded with the sum of

T u n,i . The next theorem characterizes

T u n,i . Theorem 1:

T u n,i is monotonically decreasing with increas-ing number of sub-tasks, and lim i →∞ P r ( T u n,i > → . Proof:

Let us ﬁrst consider the following lemma that deter-mines the conditions for having a positive

T u n,i +1 . Lemma 2:

The necessary and sufﬁcient conditions to satisfy

T u n,i +1 > are i X j = i +1 − k β n,j < kE [ β n,i ] , ∀ k = 1 , , . . . , i (10) Proof:

The proof is provided in Appendix B. (cid:3)

According to the conditions given in Lemma 2, the proba-bility of

T u n,i > is calculated as: P r ( T u n,i >

0) = Z E [ β n,i ]0 Z E [ β n,i ] − x i . . . Z iE [ β n,i ] − P ij =2 β n,j (11) f β n, ,...,β n,i ( x , . . . , x i ) dx . . . dx i , where f β n, ,...,β n,i ( x , . . . , x i ) is the joint probability den- Note that in (9), we assume that the runtime of packet i at helper n is thesame in both the non-ergodic solution and C3P , which is necessary for faircomparison. sity function of ( β n, , . . . , β n,i ) . With the assumption that β n,j , j = 1 , , ..., i is from an i.i.d distribution, the joint prob-ability distribution function of β n, , . . . , β n,i is the product of i probability distribution functions: P r ( T u n,i >

0) = Z E [ β n,i ]0 Z E [ β n,i ] − x i ... Z iE [ β n,i ] − P ij =2 x j f β n,i ( x ) f β n,i ( x ) ...f β n,i ( x i ) dx dx ...dx i (12) = Z E [ β n,i ]0 f β n,i ( x i ) Z E [ β n,i ] − x i f β n,i ( x i − ) ... Z ( i − E [ β n,i ] − P ij =3 x j f β n,i ( x ) Z iE [ β n,i ] − P ij =2 x j f β n,i ( x ) dx dx ...dx i (13) < Z E [ β n,i ]0 f β n,i ( x i ) Z E [ β n,i ] − x i f β n,i ( x i − ) ... Z ( i − E [ β n,i ] − P ij =3 x j f β n,i ( x ) dx ...dx i (14) = Z E [ β n,i ]0 f β n,i ( x i − ) Z E [ β n,i ] − x i − f β n,i ( x i − ) ... Z ( i − E [ β n,i ] − P i − j =2 x j f β n,i ( x ) dx ...dx i − , (15)where the last inequality comes from the fact that R iE [ β n,i ] − P ij =2 x j f β n,i ( x ) dx is less than 1, because theprobability density function is integrated over a ﬁnite rangeof variable x , and the last equality comes from a change ofvariables in the integrals. (15) is equal to P r ( T u n,i − > and thus P r ( T u n,i > < P r ( T u n,i − > . Similarly, wecan show that: P r ( T u n,j > < P r ( T u n,j − > , ∀ j = 2 , , . . . , i (16)From the above equation, we can conclude that as i gets larger, P r ( T u n,i > gets smaller, and lim i →∞ P r ( T u n,i > → is satisﬁed. This concludes the proof. (cid:3) We can conclude from Theorem 1 that the rate of theincrease in the gap between

C3P and the non-ergodic solu-tion decreases with increasing the number of sub-tasks andeventually the rate becomes zero for R → ∞ . Therefore, thegap becomes ﬁnite even for R → ∞ . B. Performance of

C3P w.r.t the Static Solution

In this section, we analyze the performance of

C3P ascompared to the static solution characterized in Section III-B.The next theorem characterizes the task completion delay of

C3P as well as the optimal task ofﬂoading policy.

Theorem 3:

The task completion delay of

C3P approachesto T C3P ≈ R + K P Nn =1 1 E [ β n,i ] , (17)with increasing R and the number of ofﬂoaded tasks to helper n is approximated as r C3P n ≈ R + KE [ β n,i ] P Nn =1 1 E [ β n,i ] . (18) Proof:

Proof is provided in Appendix C. (cid:3)

Theorem 3 shows that the task completion delay of

C3P is getting close to the static solution T static characterized inSection III-B with increasing R . The gap between T static and T C3P is K P Nn =1 1 E [ βn,i ] which is due to the coding overhead ofFountain codes, which becomes negligible for large R . C. Performance of

C3P w.r.t. Repetition Codes

In this section, we demonstrate the performance of

C3P as compared to repetition coding with Round-robin (RR)scheduling through an illustrative example. Repetition codeswith RR scheduling works as follows. Uncoded packets fromthe set

Γ = { ρ , ρ , . . . , ρ R } is ofﬂoaded to helpers one byone (in round robin manner) depending on their sequence in Γ . For example, ρ is ofﬂoaded to helper 1, ρ is ofﬂoaded tohelper 2, and so on. When all the packets are ofﬂoaded from Γ , we start again from the ﬁrst packet in the set (so it is arepetition coding). Note that whenever a packet is computedand a corresponding ACK is received, the packet is removedfrom Γ . Thus, this RR scheduling continues until Γ becomesan empty set. We use T T I n,i in (6) to determine the nextscheduling time for helper n . The next example demonstratesthe beneﬁt of C3P as compared to this repetition codingmechanism with RR scheduling.

Example 2:

We consider the same setup in Example 1.We assume that per-packet runtimes are β , = 1 , β , =1 , β , = 0 . , β , = 1 , β , = 1 . , β , = 1 . , β , = 3 . ,and β , = 3 , β , = 2 . , and the transmission times ofpackets are negligible.As seen in Fig. 3(a), RR scheduler sends ρ , ρ , and ρ to helpers 1, 2, and 3, respectively at time t = 0 . At time t = 1 , the computed packet ρ x is received at the collector,and ρ , which is the next packet selected by RR scheduler,is transmitted to helper . Similarly, at time t = 1 . , ρ x isreceived at the collector, and ρ is transmitted to helper .Similarly, the next packets are transmitted to helpers until theresults for all packets are received at the collector, which isachieved at time t = 5 . As seen, the resources of helper is wasted while computing ρ , because those resources couldhave been used for computing a new packet. C3P addressesthis problem thanks to employing Fountain codes.In particular, at time t = 0 , three Fountain coded packets of ν , ν , ν are created and transmitted to the three helpers, i.e., p , = ν , p , = ν , p , = ν . At time t = 1 , a new codedpacket of ν is created and transmitted as a second packetto helper , i.e., p , = ν . This continues until computedcoded packets (assuming that the overhead of Fountain codes, (cid:2251) (cid:2778) (cid:2206) (cid:2251) (cid:2781) (cid:2206) (cid:2251) (cid:2783) (cid:2206)(cid:2251) (cid:2780) (cid:2206) (cid:2251) (cid:2782) (cid:2206) (cid:2251) (cid:2779) (cid:2206) (cid:2251) (cid:2782) (cid:2206) (cid:2251) (cid:2780) (cid:2206) (cid:2251) (cid:2782) (cid:2206) A (a) Repetition codes with RR scheduling (cid:2247) (cid:2778) (cid:2206) (cid:2247) (cid:2781) (cid:2206)(cid:2247) (cid:2783) (cid:2206) (cid:2247) (cid:2784) (cid:2206) (cid:2247) (cid:2779) (cid:2206) (cid:2247) (cid:2782) (cid:2206) (cid:2247) (cid:2780) (cid:2206) (cid:2247) (cid:2785) (cid:2206) A (b) C3P

Fig. 3. Performance of

C3P with respective to repetition codes with RR scheduling. i.e., K is zero) are received at the collector, which is achievedat time t = 3 . . (cid:3) Example 2 shows that the task completion delay is reducedfrom to . when we use Fountain codes, which is signiﬁ-cant. Section V shows extensive simulation results to supportthis illustrative example. D. Efﬁciency of

C3P

In this section, we characterize the efﬁciency of

C3P inthe worst case scenario when per task runtimes follow theshifted exponential distribution. We call it the worst caseefﬁciency, because we take into account per packet under-utilization

T u n,i in efﬁciency calculation, but the fact that

T u n,i is monotonically decreasing, which is stated in Theorem1, is not used.

Theorem 4:

Assume that the runtime of each packet, i.e., β n,i , is a random variable according to an i.i.d shifted expo-nential distribution of F β n,i ( t ) = P r ( β n,i < t ) = 1 − e − µ n ( t − a n ) , (19)with mean a n + 1 /µ n and shifted value of a n . The expectedvalue of the duration that helper n is underutilized per packetis characterized as: E [ T u n,i ] =  eµ n ) (cid:16) − e ( µ n RT T data n ) (cid:17) + RT T data n , if RT T data n < µ n eµ n ) , otherwise. (20) Proof:

The proof is provided in Appendix D. (cid:3)

We deﬁne the efﬁciency of helper n in the worst caseas γ n = 1 − E [ T u n,i ] /E [ β n,i ] . Note that E [ T u n,i ] is theexpected time that helper n is underutilized per packet in theworst case, while E [ β n,i ] is the expected runtime duration, i.e., the expected time that helper n works per packet. Thus, E [ T u n,i ] /E [ β n,i ] becomes the under-utilization ratio of helper n in the worst case, so γ n = 1 − E [ T u n,i ] / E [ β n,i ] becomesthe worst case efﬁciency. From (20) and replacing E [ β n,i ] with a n + 1 /µ n , γ n is expressed as the following: γ n =  a n µ n − µ n RT T data n − /e +exp( µ n RT T data n − a n µ n , if RT T data n < /µ ne (1+ a n µ n ) − e (1+ a n µ n ) , otherwise. (21) We show through simulations (in Section V) that, (i) γ n in(21) is larger than , which is signiﬁcant as (21) is theworst case efﬁciency, and (ii) C3P ’s efﬁciency is even largerthan γ n as γ n in (21) is the efﬁciency in the worst case, wherethe under-utilization time period has the maximum value.V. P ERFORMANCE E VALUATION OF

C3P

In this section, we evaluate the performance of our algo-rithm; Coded Cooperative Computation Protocol (

C3P ) viasimulations and using real Android-based smartphones.

A. Simulation Results

We consider two scenarios: (i) Scenario 1, where the systemresources for each helper vary over time. In this scenario, theruntime for computing each packet p n,i , ∀ i at each helper n is an i.i.d. shifted exponential random variable with shiftedvalue a n and mean a n + 1 /µ n , and (ii) Scenario 2, where theruntime for computing packets in helper n does not changeover time, i.e., β n,i = β n , ∀ i , and β n , ∀ n ∈ N is a shiftedexponential random variable with shifted value a n and mean a n + 1 /µ n .In our simulations, each simulated point is obtained byaveraging over iterations for N = 100 helpers. Thetransmission rate for sending each packet from the collector toeach helper n and from helper n to the collector is a Poissonrandom variable with the average selected uniformly between Mbps and Mbps for each helper n . The size of atransmitted packet p n,i is set to B x = 8 R bits, where R isthe number of rows of matrix A , and it varies from to , in our simulations. The sizes of a computed packet p n,i x and an acknowledgement packet are set to B r = 8 bitsand B ack = 1 bit, respectively. These are the parameters thatare used for creating all plots unless otherwise is stated. Task Completion Delay vs. Number of Rows:

We evaluate

C3P for Scenarios 1 and 2 and compare its task completiondelay with: (i) Static solution, which is the task completiondelay characterized in Section III-B for both Scenarios 1 and2. (ii) Non-ergodic solution, which is a realization of the non-ergodic problem characterized in Section III-A by knowing β n,i a priori at the collector and setting T T I n,i as β n,i . (iii)Uncoded: r n packets without coding are assigned to eachhelper n , and the collector waits to receive computed valuesfrom all helpers. The number of assigned packets to each Numbr of Rows T ask C o m p l e t i on D e l ay C3PStatic solutionNon-ergodic solutionUncodedHCMM (a) Scenario 1

Number of Rows T ask C o m p l e t i on D e l ay C3PStatic solutionNon-ergodic solutionUncodedHCMM (b) Scenario 2Fig. 4. Task completion delay vs. number of rows/packets for (i) Scenario 1,and (ii) Scenario 2, where the runtime for computing one row by helper n isselected from a shifted exponential distribution with a n = 0 . , ∀ n ∈ N and µ n , which is selected uniformly from { , , } . helper n is inversely proportional to the mean of β n,i , i.e., r n ∝ a n +1 /µ n . (iv) HCMM: Coded cooperative frameworkdeveloped in [7] using block codes. We introduce codingoverhead for C3P , static, and non-ergodic solutions.Fig. 4(a) shows completion delay versus number of rows forScenario 1, where the runtime for computing each packet byhelper n , β n,i , ∀ i , is a shifted exponential random variable withshifted value of a n = 0 . and mean of a n + 1 /µ n , where µ n is selected uniformly from { , , } . As seen, C3P performsclose to the static and non-ergodic solutions. This shows theeffectiveness of our proposed algorithm. In addition,

C3P performs better than the baselines. In particular, in average, and improvement is obtained by

C3P over HCMMand no coding, respectively. Fig. 4(b) considers the same setupbut for Scenario 2, where the runtime for computing r n packetsby helper n is r n β n , where β n is selected from a shiftedexponential distribution with a n = 0 . , ∀ n ∈ N and µ n ,which is selected uniformly from { , , } . As seen, for thisscenario, C3P performs close to the static and non-ergodicsolutions.

C3P performs better than HCMM, and HCMMperforms better than no coding. In particular, in average, and improvement is obtained by

C3P over HCMM and nocoding, respectively. Note that uncoded performs better thanHCMM for Scenario 1, as HCMM is designed for Scenario2, so it does not work well in Scenario 1.

C3P performs wellin both scenarios.Fig. 5 shows completion delay versus number of rows forboth Scenarios 1 and 2, where the runtime for computing therows by each helper n , is from a shifted exponential distri-bution with µ n , n ∈ N selected uniformly from { , , } and a n = 1 /µ n (different shifted values for different helpers). Asseen, C3P performs close to static and non-ergodic solutions

Number of Rows T ask C o m p l e t i on D e l ay C3PStatic solutionNon-ergodic solutionUncodedHCMM (a) Scenario 1

Number of Rows T ask C o m p l e t i on D e l ay C3PStatic solutionNon-ergodic solutionUncodedHCMM (b) Scenario 2Fig. 5. Task completion delay vs. number of rows/packets for (i) Scenario 1,and (ii) Scenario 2, where the runtime for computing one row by each helper n is selected from a shifted exponential distribution with µ n , which is selecteduniformly from { , , } for different helpers and a n = 1 /µ n , ∀ n ∈ N and much better than the baselines. In particular, for Scenario1, more than and improvement is obtained by C3P over HCMM and no coding, respectively. Also, for Scenario2, in average, and improvement is obtained by

C3P over HCMM and no coding, respectively.

Efﬁciency:

We calculated the efﬁciency of helpers for dif-ferent simulation setups and compared it with the theoreticalefﬁciency obtained in (21) for Scenario 1. For all simulationsetups, the average efﬁciency over all helpers was around and the theoretical efﬁciency was a little lower than the simu-lated efﬁciency.

E.g., for R = 8000 rows, where µ n , n ∈ N isselected uniformly from { , , } and a n = 1 /µ n , the averageof efﬁciency over all helpers is . and the averageof theoretical efﬁciency is . . This is expected as thetheoretical efﬁciency is calculated for the worst case scenario.We also calculate the efﬁciency of helpers for Scenario2. For all simulation setups, the average efﬁciency over allhelpers was around , e.g., for R = 8000 rows, where µ n , n ∈ N is selected uniformly from { , , } and a n =1 /µ n , the average of efﬁciency over all helpers was . .Note that the theoretical efﬁciency for Scenario 1 is . Thesimulated efﬁciency is lower than the theoretical one, becausethe simulation underutilizes the helpers when transmittingthe very ﬁrst packet to each helper, i.e., before the collectorestimates the resources of helpers. C3P as Compared to Repetition Coding and Round RobinScheduling:

Fig. 6 shows the percentage of improvement of

C3P over repetition coding with RR scheduling in terms oftask completion delay. The number of rows is selected as R =2000 with overhead for C3P and the number of helpersvaries from N = 100 to N = 600 . The transmission rate forsending each packet from the collector to each helper n and

100 200 300 400 500 600

Number of Helpers P e r ce n t a g e o f I m p r o ve m e n t Fig. 6. Percentage of improvement of

C3P over repetition codes with RRscheduling in terms of the task completion delay. from helper n to the collector is a Poisson random variablewith the average selected uniformly between . Mbps and . Mbps for each helper n . The other parameters are the sameas the parameters used in Fig. 4(a). As seen, by increasingthe number of helpers, more improvement is gained by C3P compared to the repetition coding with RR scheduling.

B. Evaluation in a Testbed

We implemented a testbed of a collector and multiplehelpers using real mobile devices, speciﬁcally Android 6.0.1based Nexus 6P and Nexus 5 smartphones. All the helpersare connected to the collector device using Wi-Fi Directconnections. We conducted our experiments using our testbedin a lab environment where several other Wi-Fi networks wereoperating in the background. We located all the devices inclose proximity of each other (within a few meters distance).We implemented both

C3P and repetition coding with RRscheduling in our testbed. The collector device would like tocalculate matrix multiplication y = A x , where A is a K × K matrix and x is a K × vector. Matrix A is dividedinto sub-matrices, each of which is a × K matrix.A sub-task to be processed by a helper is the multiplicationof a sub-matrix with vector x . There is one collector device(Nexus 5) and varying number of helpers (Nexus 6P).Fig. 7 shows task completion delay versus number ofhelpers for both C3P and repetition codes with RR scheduling.In this setup, each helper receives a sub-task, processes it,and waits for a random amount of time (exponential randomvariable with mean seconds), which may arise due toother applications running at smartphones, and then sendsthe result back to the collector. As can be seen, the taskcompletion delay reduces with increasing number of helpersin both algorithms. When there is one helper C3P performsworse, which is expected. In particular,

C3P introduces codingoverhead, and the number of helpers is very small to see thebeneﬁt of coding. On the other hand, when the number ofhelpers increases, we start seeing the beneﬁt of coding. Forexample, when the number of helpers is , C3P improves over repetition codes with RR scheduling. This resultconﬁrms our simulation results in Fig. 6 in a testbed with realAndroid-based smartphones.Fig. 8 shows the task completion delay versus per sub-task random delays at helpers. There are helpers in thisscenario. As can be seen, C3P improves more over repetitioncodes with RR scheduling when delay increases, as it increases

Number of Helpers T ask C o m p l e t i on D e l ay ( sec ) C3PRepetition Codes with RR Scheduling

Fig. 7. Task completion delay versus number of helpers.

Per Sub-task Delay (sec) T ask C o m p l e t i on D e l ay ( sec ) C3PRepetition Codes with RR Scheduling

Fig. 8. Task completion delay versus per sub-task delay. heterogeneity, and

C3P is designed to take into accountheterogeneity. VI. R

ELATED W ORK

Mobile cloud computing is a rapidly growing ﬁeld withthe goal of providing extensive computational resources tomobile devices as well as higher quality of experience [13],[14], [15]. The initial approach to mobile cloud computinghas been to ofﬂoad resource intensive tasks to remote cloudsby exploiting Internet connectivity of mobile devices. Thisapproach has received a lot of attention which led to extensiveliterature in the area [16], [17], [18], [19], [20]. The feasibilityof computation ofﬂoading to remote cloud by mobile devices[21] as well as energy efﬁcient computation ofﬂoading [22],[23] has been considered in the previous work. As comparedto this line of work, our focus is on edge computing ratherthan remote clouds.There is an increasing interest in edge computing by exploit-ing connectivity among mobile devices [24]. This approachsuggests that if devices in close proximity are capable ofprocessing tasks cooperatively, then local area computationgroups could be formed and exploited for computation. Indeed,cooperative computation mechanisms by exploiting device-to-device connections of mobile devices in close proximity aredeveloped in [24] and [25]. A similar approach is considered in[26] with particular focus on load balancing across workers. Ascompared to this line of work, we consider coded cooperativecomputation.Coded cooperative computation is shown to provide higherreliability, smaller delay, and reduced communication cost inMapReduce framework [27], where computationally intensivetasks are ofﬂoaded to distributed server clusters [28]. In[3] and [29], coded computation for matrix multiplication is considered, where matrix A is divided into sub-matrices andeach sub-matrix is sent from the master node (called collectorin our work) to one of the worker nodes (called helpersin our work) for matrix multiplication with the assumptionthat the helpers are homogeneous. In [3], workload of theworker nodes is optimized such that the overall runtime isminimized. Fountain codes are employed in [30] for codedcomputation, but for homogeneous resources. In [7], the sameproblem is considered, but with the assumption that workersare heterogeneous in terms of their resources. Compared tothis line of work, we develop C3P , a practical algorithm thatis (i) adaptive to the time-varying resources of helpers, and (ii)does not require any prior information about the computationcapabilities of the helpers. As shown, our proposed methodreduces the task completion delay signiﬁcantly as comparedto prior work. VII. C

ONCLUSION

In this paper, we designed a Computation Control Protocol(

C3P ), where heterogeneous edge devices with computationcapabilities and energy resources are connected to each other.In

C3P , a collector device divides tasks into sub-tasks, ofﬂoadsthem to helpers by taking into account heterogeneous re-sources.

C3P is (i) a dynamic algorithm that efﬁciently utilizesthe potential of each helper, and (ii) adaptive to the time-varying resources at helpers. We analyzed the performanceof

C3P in terms of task completion delay and efﬁciency.Simulation and experiment results in an Android testbedconﬁrm that

C3P is efﬁcient and reduces the completion delaysigniﬁcantly as compared to baselines.R

EFERENCES[1] Y. Li and W. Wang, “Can mobile cloudlets support mobile applications?,” in Proc. of IEEE INFOCOM , Toronto, ON, 2014.[2] M. Chen, Y. Hao, Y. Li, C. F. Lai and D. Wu, “On the computationofﬂoading at ad hoc cloudlet: architecture and service modes,” in IEEECommunications Magazine , vol. 53, no. 6, pp. 18-24, June 2015.[3] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos and K. Ramchandran,“Speeding up distributed machine learning using codes,” in Proc. of IEEEInternational Symposium on Information Theory (ISIT) , Barcelona, Spain,July 2016.[4] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton,and G. Hullender, “Learning to rank using gradient descent,” in Proc. ofthe ACM 22nd International Conference on Machine Learning (ICML) ,Bonn, Germany, Aug. 2005.[5] T. Zhang, “Solving large scale linear prediction problems using stochasticgradient descent algorithms,” in Proc. of the ACM 21st InternationalConference on Machine Learning (ICML) , Banff, Canada, July 2004.[6] L. Bottou, “Large-scale machine learning with stochastic gradient de-scent,” in Proc. of the International Conference on Computational Statis-tics (COMPSTAT) , Paris, France, Aug. 2010.[7] A. Reisizadeh, S. Prakash, R. Pedarsani and A. S. Avestimehr, “CodedComputation over Heterogeneous Clusters,” in Proc. of IEEE Interna-tional Symposium on Information Theory (ISIT) , Aachen, Germany, June2017.[8] M. Luby, “LT codes,” in Proc. of The 43rd Annual IEEE Symposium onFoundations of Computer Science , 2002.[9] A. Shokrollahi, “Raptor codes,” in IEEE Transactions on InformationTheory , vol. 52, no. 6, pp. 2551-2567, June 2006.[10] D. J. C. MacKay, “Fountain codes,” in IEE Proceedings - Communica-tions , vol. 152, no. 6, pp. 1062-1068, Dec. 2005.[11] T. Ho, R. Koetter, M. Medard, D. R. Karger and M. Effros, “Thebeneﬁts of coding over routing in a randomized setting,” in Proc. ofIEEE International Symposium on Information Theory (ISIT) , Yokohama,Japan, 2003. [12] S. Lin, D. J. Costello, and M. J. Miller. “Automatic-repeat-request error-control schemes,” in IEEE Communications Magazine , vol. 22(12), pp.5-17, 1984.[13] H. T. Dinh, C. Lee, D. Niyato, and P. Wang, “A survey of mobilecloud computing: architecture, applications, and approaches,” in WirelessCommunications and Mobile Computing , vol. 13, no. 8, October 2011.[14] N. Fernando, S. W. Loke, and W. Rahayu, “Mobile cloud computing: Asurvey,” in Future Generation Computer Systems , vol. 29, no. 1, pp. 84106, 2013.[15] Z. Sanaei, S. Abolfazli, A. Gani, and R.Buyya, “Heterogeneity in mobilecloud computing: Taxonomy and open challenges,” in CommunicationsSurveys Tutorials, IEEE , vol. 16, no. 1, pp. 369392, First 2014.[16] M. Gordon, D. Jamshidi, S. Mahlke, Z. Mao, and X. Chen, “Comet:Code ofﬂoad by migrating execution transparently,” in Proc. OSDI ,Hollywood, CA, October 2012.[17] E. Cuervo, A. Balasubramanian, D. Cho, A. Wolman, S. Saroiu, R.Chandra, and P. Bahl, “Maui: Making smartphones last longer with codeofﬂoad,” in Proc. ACM MobiSys , San Francisco, CA, June 2010.[18] Y. Zhang, G. Huang, X. Liu, W. Zhang, H. Mei, and S. Yang,“Refactoring android java code for on-demand computation ofﬂoading,” in OOPSLA , Tuscon, AZ, October 2012.[19] R. Kemp, N. Palmer, T. Kielmann, and H. Bal, “Cuckoo: A computationofﬂoading framework for smartphones,” in Mobile Computing, Applica-tions, and Services , vol. 76, pp. 5979, 2012.[20] D. T. Hoang, D. Niyato, and P. Wang, “Optimal admission controlpolicy for mobile cloud computing hotspot with cloudlet,” in WirelessCommunications and Networking Conference (WCNC) , 2012 IEEE, April2012, pp. 31453149.[21] S. Kosta, A. Aucinas, P. Hui, R. Mortier, and X. Zhang, “Thinkair:Dynamic resource allocation and parallel execution in the cloud formobile code ofﬂoading,” in INFOCOM , 2012 Proceedings IEEE, March2012, pp. 945953.[22] Y. Geng, W. Hu, Y. Yang, W. Gao, and G. Cao, “Energy-efﬁcient compu-tation ofﬂoading in cellular networks,” in 2015 IEEE 23rd InternationalConference on Network Protocols (ICNP) , Nov 2015, pp. 145155.[23] W. Zhang, Y. Wen, and D. O. Wu, “Energy-efﬁcient scheduling policyfor collaborative execution in mobile cloud computing,” in INFOCOM , , April 2013, pp. 190194.[24] R. K. Lomotey and R. Deters, “Architectural designs from mobilecloud computing to ubiquitous cloud computing - survey,” in Proc. IEEEServices , Anchorage, Alaska, June 2014.[25] E. Miluzzo, R. Caceres and Y. Chen, “Vision: mclouds - computing onclouds of mobile devices,” in ACM workshop on Mobile cloud computingand services , Low Wodd Bay, Lake District, UK, June 2012.[26] T. Penner, A. Johnson, B. V. Slyke, M. Guirguis and Q. Gu, “Transientclouds: Assignment and collaborative execution of tasks on mobiledevices,” in Proc. IEEE GLOBECOM , Austin, TX, Dec. 2014.[27] J. Dean and S. Ghemawat, “Mapreduce: simpliﬁed data processing onlarge clusters,” in Communications of the ACM , vol. 51, no. 1, pp.107113, 2008.[28] S. Li, M. A. Maddah-Ali and A. S. Avestimehr, “Coded MapReduce,” in 53rd Annual Allerton Conference on Communication, Control, andComputing (Allerton) , Monticello, IL, 2015.[29] S. Dutta, V. Cadambe and P. Grover, “Short-dot: Computing large lineartransforms distributedly using coded short dot products,” in Advances InNeural Information Processing Systems (NIPS) , Barcelona, Spain, Dec.2016.[30] A. Mallick, M. Chaudhari and G. Joshi, “Rateless Codes for Near-PerfectLoad Balancing in Distributed Matrix-Vector Multiplication,” available inArXiv, arXiv:1804.10331v2. Collector Helper n (cid:2236) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2236) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2206) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2206) (cid:2196)(cid:481)(cid:2191) (cid:2176)(cid:2200) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2176)(cid:2200) (cid:2196)(cid:481)(cid:2191) (cid:2174)(cid:2176)(cid:2176) (cid:2196)(cid:2186)(cid:2183)(cid:2202)(cid:2183) (cid:2157)(cid:2185)(cid:2193) (cid:2196)(cid:481)(cid:2191)(cid:2879)(cid:2778) (cid:2157)(cid:2185)(cid:2193) (cid:2196)(cid:481)(cid:2191) (cid:2174)(cid:2176)(cid:2176) (cid:2196)(cid:2183)(cid:2185)(cid:2193)

Fig. 9. Demonstrating

RT T data n through an example. A PPENDIX

A: C

ALCULATING E [ β n,i ] AT T HE C OLLECTOR

We deﬁne and use the parameters residual time

XT T n,i and round trip times of computed packets ; RT T data n,i to estimate E [ β n,i ] . Next, we will ﬁrst show how we characterize XT T n,i and

RT T data n,i , and then present how we estimate E [ β n,i ] using XT T n,i and

RT T data n,i . Characterization of

XT T n,i . The collector side has theknowledge of the transmission time

T x n,i of packet p n,i ,and time T r n,i that the computed packet p n,i x is received.Thus, the time between transmitting a packet and receivingits computed value, deﬁned as T t n,i = T r n,i − T x n,i , canbe calculated at the collector. On the other hand, to betterutilize resources at helper n , the collector should ofﬂoad a newpacket before (or immediately after) receiving the computedvalue of the previous packet, i.e., the following conditionshould be satisﬁed: T T I n,i ≤ T t n,i . Thus, we can write

T T I n,i = T t n,i − XT T n,i +1 , where XT T n,i +1 is the residualtime that the collector measures in our C3P setup and is equalto:

XT T n,i +1 = T r n,i − T x n,i +1 . (22) Characterization of

RT T data n,i . We deﬁne

RT T data n,i as theround trip time (RTT) of packet p n,i sent to helper n . Moreprecisely, RT T data n,i is equal to transmission delay of packet p n,i from the collector to helper n plus the transmission delayof the calculated packet p n,i x from helper n to the collector.Although RT T data n,i is round trip time, it can not be directlymeasured in the collector, as the collector only knows thetime period between sending a packet and receiving the com-puted packet, which is equal to the sum of transmission andcomputing delay. Thus, in

C3P , we calculate

RT T data n,i using

RT T ack n,i , which is the time period between sending packet p n,i and receiving its ACK at the collector. Fig. 9 demonstratesthe difference between RT T data n,i and

RT T ack n,i . Note that

RT T ack n,i can be directly measured by employing ACKs. Wecan represent

RT T ack n,i as RT T ack n,i = B x /C up n,i + B ack /C down n,i ,where B x is the size of the transmitted packet, B ack is thesize of the ACK packet, and C up n,i and C down n,i are the uplink(from the collector to helper n ) and downlink (from helper n to the collector) transmission rates experienced by packet p n,i and its ACK.Note that RT T data n,i is characterized as

RT T data n,i = B x /C up n,i + B r /C down n,i , where B r is the size of the computed packet; p n,i x . Assuming that uplink and downlink transmis-sion rates are the same, which is likely in IoT setup, we canobtain RT T data n,i as:

RT T data n,i = B x + B r B x + B ack RT T ack n,i . (23)As we discussed earlier RT T ack n,i can be directly measured bythe collector and used in (23) to determine

RT T data n,i . Next, wecharacterize the average value of data round trip time of helper n , i.e., RT T data n as exponential weighted moving averages ofper packet round trip time, RT T data n,i : RT T data n = αRT T data n,i + (1 − α ) RT T data n , (24)where α is a weight satisfying < α < . Now that wecharacterized XT T n,i and

RT T data n and discussed how wecan measure these parameters at the collector, we explain howto use these parameters to calculate E [ β n,i ] at the collector. Calculation of E [ β n,i ] . We formulate E [ β n,i ] as follows: E [ β n,i ] ≈ P m n j =1 β n,i m n = T c n,i − T u n m n , (25)where T c n,i is the time that helper n ﬁnishes computing p n,i x , T u n is the estimate made at the collector about thetotal (cumulative) time that helper n is underutilized, and m n is the number of packets that helper n processed until (andincluding) packet p n,i . Since T c n,i is the time instance thathelper n ﬁnishes computing packet p n,i x , and T u n is thecumulative time that helper n is underutilized, their differencegives us the total time that helper n has been busy since thestarting time. Total busy time of helper n , i.e., T c n,i − T u n is normalized by the total number of processed packets m n to determine E [ β n,i ] . Next, we characterize T c n,i and

T u n interms of XT T n,i and

RT T data n .The collector estimates T c n,i as follows

T c n,i ≈ T r n,i − B r B x + B r RT T data n , (26)where T r n,i is the time that the computed packet p n,i x isreceived by the collector from helper n , so it is known by thecollector. B r B x + B r RT T data n is the backward trip time estimatedby the collector using (24) and packet sizes.The next step is to characterize T u n , which is the estimatemade at the collector about the total (cumulative) time thathelper n is underutilized. T u n is the sum of all per packetunder-utilization times, shown by T u n,i . In particular,

T u n,i is deﬁned as the time period that the helper is idle betweencomputing packet p n,i − x and p n,i x . In order to calculate T u n,i , we should determine the state of the system, i.e., if thesystem is in the ideal, underutilized, or congested case, Fig.2. As shown in Fig. 2,

XT T n,i ≥ RT T data n in the ideal andcongested cases. Note that these cases occur when packet p n,i is received at the helper meanwhile the helper is computing p n,i − x or right after the helper has computed p n,i − x . In thissetup, since there is no under-utilization, T u n,i is equal to .On the other hand, XT T n,i < RT T data n in the underutilizedcase scenario. As seen in Fig. 2(b), underutilized case occurswhen packet p n,i is received at the helper after a while that the Algorithm 2

Calculating E [ β n,i ] by the collector Initialize:

T x n, = 0 , RT T data n = 0 , ∀ n ∈ N . if An ACK for successful transmission of packet p n,i isreceived from helper n then Update

RT T data n according to (24). if Calculated packet p n,i x and the corresponding compu-tation ACK is received then if i == 1 then T u n = 0 . else XT T n,i = T r n,i − − T x n,i . Update

T u n according to (27). Calculate E [ β n,i ] from (25).helper has ﬁnished computing packet p n,i − x . In this setup, RT T data n − XT T n,i is the approximate duration that helper n is idle before calculating p n,i x , i.e., T u n,i ≈ RT T data n − XT T n,i . Therefore,

T u n is updated after p n,i x is received bythe collector as the following: T u n ≈ T u n + max { , RT T data n − XT T n,i } . (27)As seen, E [ β n,i ] can be calculated from the parameters thatare known by the collector. The process of calculating E [ β n,i ] by the collector is summarized in Algorithm 2.A PPENDIX

B: P

ROOF OF L EMMA

T u n,i and thenﬁnd the closed form conditions for

T u n,i > . A. Characterising

T u n,i

According to (6), in

C3P the packets are transmitted fromthe collector to helper n with the time interval equal to min( T r n,i − T x n,i , E [ β n,i ]) . We ﬁrst provide the queuingmodel for the case of T T I n,i equal to E [ β n,i ] and then showthat the queueing model for C3P is the same as this queuewith the only difference that the idle time in

C3P is smallerthan the idle time for the case with

T T I n,i equal to E [ β n,i ] . Queueing model for

T T I n,i equal to E [ β n,i ] . The systemof the collector and helper n for the case with T T I n,i equal to E [ β n,i ] can be modeled as a queue, where each packet p n,i isarrived at helper n with the arrival rate of packet per E [ β n,i ] and processed with the service time of β n,i . In the steady statecase, ( i.e., the case that the queue is empty at the time packet p n,i is received at the helper), if the service time is larger thanthe arrival time, i.e., β n,i > E [ β n,i ] , the next received packetof p n,i +1 will be queued at the helper for the time period equalto the difference between the service time and the arrival time, i.e., β n,i − E [ β n,i ] . On the other hand, if the service time issmaller than the arrival time, i.e., β n,i < E [ β n,i ] , processingof the received packet p n,i +1 will be delayed for the timeperiod equal to the difference between the arrival time and theservice time, i.e., E [ β n,i ] − β n,i , after computing the previouspacket of p n,i . This is the idle (underutilized) time period ofthe queue. Now let us consider the general case where thequeue is not empty when packet p n,i +1 is received at helper n with the queueing delay of T q n,i , i.e., it takes T q n,i forhelper n to start computing the last packet in its queue. Inthis case, if β n,i < E [ β n,i ] , then the underutilized time periodbetween computing packet p n,i and packet p n,i +1 at the helperis equal to max (0 , E [ β n,i ] − β n,i − T q n,i ) and thus T u n,i ischaracterized as

T u n,i = max (cid:0) max(0 , E [ β n,i ] − β n,i ) − T q n,i , (cid:1) . (28)We will formulate T q n,i later in this section, but before thatlet us formulate

T u n,i for

C3P . Formulating

T u n,i for

C3P . The difference between

C3P and the case when

T T I n,i is equal to E [ β n,i ] , is that in C3P the idle time is reduced. In particular, if the collector noticesthat the helper is idle (by receiving the computed packet p n,i x before sending packet p n,i +1 ), it reduces T T I n,i , the timeinterval between sending packet p n,i and p n,i +1 , to T r n,i − T x n,i . In this case, from Fig. 2(b), the parameter

XT T n,i becomes zero, which results in reduced underutilized time of

T u n,i to RT T data n . Therefore, T u n,i for

C3P is equal to:

T u n,i = min (cid:16) max (cid:0) max(0 , E [ β n,i ] − β n,i ) − T q n,i , (cid:1) ,RT T data n (cid:17) . (29) Formulating

T q n,i . The queueing delay

T q n,i is deﬁned asthe period that packet p n,i should wait in the queue to becomputed by helper n . We consider two cases to calculate T q n,i : (i) β n,i − > E [ β n,i ] : this is the congested case scenario,where p n,i is received at the helper while the helper is busycomputing the previously received packets. Therefore, packet p n,i should be queued at the helper queue and its queueingdelay is equal to the summation of β n,i − − E [ β n,i ] andthe queueing delay for computing its previous packet p n,i − ,which is equal to T q n,i − and thus T q n,i = β n,i − − E [ β n,i ]+ T q n,i − . (ii) β n,i − < E [ β n,i ] : this is the underutilized casescenario if there is no packet in the queue when packet p n,i is received at the collector. In this case, the helper will beidle for the time period of E [ β n,i ] − β n,i − after it computespacket p n,i − until it receives packet p n,i and starts computingit. However, if there are packets in the queue at the timepacket p n,i is received at helper n , then two cases may occur:(a) T q n,i − − ( E [ β n,i ] − β n,i − ) > : in this case, packet p n,i still should wait in the queue but its queueing delay isreduced compared to the queueing delay for packet p n,i − by E [ β n,i ] − β n,i − . (b) T q n,i − − ( E [ β n,i ] − β n,i − ) < : in thiscase, the queueing delay for packet p n,i is zero and p n,i willbe computed by helper n as soon as it is received at the helper.The reason is that helper n ﬁnishes computing the previouspacket p n,i − earlier than packet p n,i is received at the helperand remains idle for the period of ( E [ β n,i ] − β n,i − ) − T q n,i − before it starts computing packet p n,i . By considering all thesecases, T q n,i can be formulated as:

T q n,i = max( β n,i − − E [ β n,i ] + T q n,i − , . (30) B. Finding The Closed Form Conditions For

T u n,i > From (29), for

T u n,i +1 to be positive, i.e., max(max(0 , E [ β n,i ] − β n,i ) − T q n,i , > , the following condition should be satisﬁed: max(0 , E [ β n,i ] − β n,i ) − T q n,i > (31) ⇔ max(0 , E [ β n,i ] − β n,i ) > T q n,i (32) ⇔ E [ β n,i ] − β n,i > T q n,i (33)By replacing T q n,i from (30), we have: E [ β n,i ] − β n,i > max( β n,i − − E [ β n,i ] + T q n,i − , (34) ⇔ (cid:26) E [ β n,i ] − β n,i > (35) E [ β n,i ] − β n,i > β n,i − − E [ β n,i ] + T q n,i − (36) ⇔ (cid:26) E [ β n,i ] > β n,i (37) E [ β n,i ] − β n,i − β n,i − > T q n,i − (38)(37) generates the ﬁrst condition of Lemma 2, i.e., k = 1 . Byreplacing T q n,i − in (38), we have: E [ β n,i ] − β n,i − β n,i − > max( β n,i − − E [ β n,i ] + T q n,i − , (39) ⇔ (cid:26) E [ β n,i ] > β n,i + β n,i − (40) E [ β n,i ] − β n,i − β n,i − − β n,i − > T q n,i − (41)(40) generates the second condition of Lemma 2, i.e., k = 2 .Intuitively, we can prove all other conditions of Lemma 2 byreplacing T q n,i − in (41). In the following, we give a formalproof using the proof by induction.First we prove by induction that ( kE [ β n,i ] − P ij = i − k +1 β n,j ) > T q n,i − k +1 is satisﬁed for ∀ k = 1 , , . . . , i − when T u n,i > ; We already showedin (33) that this is true for k = 1 . If this inequality is truefor k = m , i.e., ( mE [ β n,i ] − P ij = i +1 − m β n,j ) > T q n,i − m +1 ,then by replacing T q n,i − m +1 with its equivalent from (30),we have: ( mE [ β n,i ] − i X j = i +1 − m β n,j ) > (42) max( β n,i − m − E [ β n,i ] + T q n,i − m , ⇒ ( mE [ β n,i ] − i X j = i +1 − m β n,j ) > (43) ( β n,i − m − E [ β n,i ] + T q n,i − m ) ⇒ (( m + 1) E [ β n,i ] − β n,i − m ) − i X j = i +1 − m β n,j > T q n,i − m (44) ⇒ (( m + 1) E [ β n,i ] − i X j = i − m β n,j > T q n,i − m , (45)and thus the inequality is true for k = m + 1 . This proves ( kE [ β n,i ] − P ij = i +1 − k β n,j ) > T q n,i − k +1 , ∀ k = 1 , , . . . , i from which we conclude ( kE [ β n,i ] > P ij = i +1 − k β n,j ) , ∀ k =1 , , . . . , i as T q n,i − k is positive. Therefore, we proved that the i conditions of ( kE [ β n,i ] > P ij = i +1 − k β n,j ) , ∀ k = 1 , , . . . , i is necessary for T u n,i > .Now, we prove the sufﬁciency of conditions in Lemma 2; i.e., if ( kE [ β n,i ] > P ij = i +1 − k β n,j ) , ∀ k = 1 , , . . . , i , then T u n,i +1 > or equivalently ( E [ β n,i ] − β n,i ) > T q n,i . Firstwe prove by induction that ( i − l ) E [ β n,i ] − P ij = l +1 β n,j >T q n,l +1 is satisﬁed for ∀ l = 0 , , . . . , i − , when ( kE [ β n,i ] > P ij = i +1 − k β n,j ) , ∀ k = 1 , , . . . , i ; For l = 0 , we just needto make k equal to i in ( kE [ β n,i ] > P ij = i +1 − k β n,j ) , as thequeueing delay for the ﬁrst received packet at helper n , T q n, ,is zero. Now we assume that ( i − l ) E [ β n,i ] − P ij = l +1 β n,j >T q n,l +1 is satisﬁed for l = m : ( i − m ) E [ β n,i ] − i X j = m +1 β n,j > T q n,m +1 (46) ⇒ ( i − m − E [ β n,i ] − i X j = m +2 β n,j >β n,m +1 − E [ β n,i ] + T q n,m +1 (47)On the other hand, by replacing k = i − m − in ( kE [ β n,i ] > P ij = i +1 − k β n,j ) , we have ( i − m − E [ β n,i ] − P ij = m +2 β n,j > , and thus we have: ( i − m − E [ β n,i ] − i X j = m +2 β n,j > max( β n,m +1 − E [ β n,i ] + T q n,m +1 , (48) = T q n,m +2 . (49)The above inequality shows that ( i − l ) E [ β n,i ] − P ij = l +1 β n,j > T q n,l +1 is satisﬁed for l = m + 1 . Therefore,from proof by induction, ( i − l ) E [ β n,i ] − P ij = l +1 β n,j >T q n,l +1 is satisﬁed for ∀ l = 0 , , . . . , i − . By replacing l = i − in ( i − l ) E [ β n,i ] − P ij = l +1 β n,j > T q n,l +1 , wehave ( E [ β n,i ] − β n,i ) > T q n,i or equivalently T u n,i +1 > .This proves the sufﬁciency of conditions in Lemma 2.This concludes the proof.A PPENDIX

C: P

ROOF OF T HEOREM

T q n,i is small andthen use this property to prove Theorem 3.According to (30), the queueing delay

T q n,i , which is thedelay for packet p n,i to be computed by helper n , is equalto the sum of ( E [ β n,i ] − β n,i − ) and T q n,i − , if this sum ispositive, otherwise it is qual to zero. If we look at this equationmore closely, we observe that we can reformulate T q n,i as P i − j = i ′ ( E [ β n,i ] − β n,j ) , where i ′ < i − corresponds to thelast time that helper n has been seen as underutilized, i.e., i ′ is the largest j < i − , for which T q n,j − = 0 . Therefore, theaverage of T q n,i is equal to E [ T q n,i ] = E [ P i − j = i ′ ( E [ β n,i ] − β n,j )] = 0 . Therefore, in average, the queueing delay of C3P is zero. Next, we use this property to prove Theorem 3.

C3P sends packets to each helper n with the time intervalless than or equal to E [ β n,i ] according to (6). Since thequeueing delay of C3P is small, this time interval will resultin the delay of D C3P n less than or equal to r C3P n E [ β n,i ] forcalculating r C3P n packets by helper n . On the other hand, thecollector stops sending packets to helpers once it receives R + K packets collectively from all helpers. Again, since the queueing delay of C3P is small, i.e., the period of time packetswait in the queue of helper n to be computed is small, at thetime that R + K packets are collected at the collector, theremight be only small number of packets waiting at the queue ofeach helper and thus P Nn =1 r C3P n ≃ R + K . In addition, withsmall queueing delay, all helpers will ﬁnish their assigned tasksapproximately at the same time; this results in (17) and (18).This concludes the proof. (cid:3) A PPENDIX

D: P

ROOF OF T HEOREM β n,i − is from a shifted expo-nential distribution with shifted value of a n and mean of a n + 1 /µ n , E [ β n,i ] in (29) can be replaced with a n + 1 /µ n .In addition, T q n,i is equal to zero as we consider the worstcase scenario. Therefore, we have:

T u n,i =  RT T data n , if a n < β n,i − < a n + µ n − RT T data n a n + µ n − β n,i − , if a n + µ n − RT T data n < β n,i − < a n + µ n , otherwise, (50)where, the random variable β n,i − is always greater than itsshifted value, a n , i.e., the condition β n,i − > a n is alwayssatisﬁed. From (50), the value of variable T u n,i changesdepending on the value of

RT T data n and the value of thedistribution parameter /µ n . We decompose (50) as follows: T u n,i = ( ˜ T n,i , if RT T data n < /µ n T n,i , otherwise, (51)where, ˜ T n,i =  RT T data n , if a n < β n,i − < a n + µ n − RT T data n a n + µ n − β n,i − , if a n + µ n − RT T data n < β n,i − < a n + µ n , otherwise, (52)and, T n,i = ( a n + 1 /µ n − β n,i − , if a n ≤ β n,i − ≤ a n + 1 /µ n , otherwise (53)To prove Theorem 4, we ﬁnd the expected values of ˜ T n,i and T n,i . From (52), the average of ˜ T n,i is calculated as: E [ ˜ T n,i ] = Z f β n,i − ( t ) ˜ T n,i dt = Z a n +1 /µ n − RT T data n a n µ n exp( − µ n ( t − a n )) RT T data n dt + Z a n + µn a n + µn − RT T data n µ n exp( − µ n ( t − a n ))( a n + 1 µ n − t ) dt = RT T data n + 1 /µ n (exp( − − exp( µ n RT T data n − . (54) Similarly, from (53), we can calculate the average of T n,i : E [ T n,i ] = Z f β i − ( t ) T n,i dt = Z a n +1 /µ n a n µ n exp( − µ n ( t − a n ))( a n + 1 /µ n − t ) dt = 1 /µ n exp( − . (55)By replacing the obtained expected values of ˜ T n,i and T n,i in(51), we have: E [ T u n,i ] = 

RT T data n + µ n ( e − − exp( µ n RT T data n − , if RT T data n < µ n µ n e − ,,