[PDF] How to Optimally Allocate Resources for Coded Distributed Computing?

Abstract

Today's data centers have an abundance of computing resources, hosting server clusters consisting of as many as tens or hundreds of thousands of machines. To execute a complex computing task over a data center, it is natural to distribute computations across many nodes to take advantage of parallel processing. However, as we allocate more and more computing resources to a computation task and further distribute the computations, large amounts of (partially) computed data must be moved between consecutive stages of computation tasks among the nodes, hence the communication load can become the bottleneck. In this paper, we study the optimal allocation of computing resources in distributed computing, in order to minimize the total execution time in distributed computing accounting for both the duration of computation and communication phases. In particular, we consider a general MapReduce-type distributed computing framework, in which the computation is decomposed into three stages: \emph{Map}, \emph{Shuffle}, and \emph{Reduce}. We focus on a recently proposed \emph{Coded Distributed Computing} approach for MapReduce and study the optimal allocation of computing resources in this framework. For all values of problem parameters, we characterize the optimal number of servers that should be used for distributed processing, provide the optimal placements of the Map and Reduce tasks, and propose an optimal coded data shuffling scheme, in order to minimize the total execution time. To prove the optimality of the proposed scheme, we first derive a matching information-theoretic converse on the execution time, then we prove that among all possible resource allocation schemes that achieve the minimum execution time, our proposed scheme uses the exactly minimum possible number of servers.

Full PDF

11 How to Optimally Allocate Resources for CodedDistributed Computing?

Qian Yu ∗ , Songze Li ∗ , Mohammad Ali Maddah-Ali † , and A. Salman Avestimehr ∗∗ Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA † Nokia Bell Labs, Holmdel, NJ, USA

Abstract

Today’s data centers have an abundance of computing resources, hosting server clusters consisting of as many as tens orhundreds of thousands of machines. To execute a complex computing task over a data center, it is natural to distribute computationsacross many nodes to take advantage of parallel processing. However, as we allocate more and more computing resources to acomputation task and further distribute the computations, large amounts of (partially) computed data must be moved betweenconsecutive stages of computation tasks among the nodes, hence the communication load can become the bottleneck. In this paper,we study the optimal allocation of computing resources in distributed computing, in order to minimize the total execution timein distributed computing accounting for both the duration of computation and communication phases. In particular, we considera general MapReduce-type distributed computing framework, in which the computation is decomposed into three stages:

Map , Shufﬂe , and

Reduce . We focus on a recently proposed

Coded Distributed Computing approach for MapReduce and study theoptimal allocation of computing resources in this framework. For all values of problem parameters, we characterize the optimalnumber of servers that should be used for distributed processing, provide the optimal placements of the Map and Reduce tasks,and propose an optimal coded data shufﬂing scheme, in order to minimize the total execution time. To prove the optimality of theproposed scheme, we ﬁrst derive a matching information-theoretic converse on the execution time, then we prove that among allpossible resource allocation schemes that achieve the minimum execution time, our proposed scheme uses the exactly minimumpossible number of servers.

I. I

NTRODUCTION

In recent years, distributed systems like Apache Spark [1] and computational primitives like MapReduce [2], Dryad [3], andCIEL [4] have gained signiﬁcant traction, as they enable the execution of production-scale computation tasks on data sizesof the order of tens of terabytes and more. The design of these modern distributed computing platforms is driven by scalingout computations across clusters consisting of as many as tens or hundreds of thousands of machines. As a result, there isan abundance of computing resources that can be utilized for distributed processing of computation tasks. However, as weallocate more and more computing resources to a computation task and further distribute the computations, a large amountof (partially) computed data must be moved between consecutive stages of computation tasks among the nodes, hence thecommunication load can become the bottleneck. This gives rise to an important problem: • How should we optimally allocate computing resources for distributed processing of a computation task in order tominimize its total execution time (accounting for both the duration of computation and communication phases)?

This problem has indeed attracted a lot of attention in recent years, and it has been broadly studied in various settings (see,e.g., [5]–[9]). In this paper, we study resource allocation problem in the context of a recently proposed coding frameworkfor distributed computing, namely

Coded Distributed Computing [10], which allows to optimally trade computation load with communication load in distributed computing. The key advantage of this framework is that it quantitatively captures the relationbetween computation time and communication time in distributed computing, which is crucial for resource allocation problems.More formally, we consider a general MapReduce-type framework for distributed computing (see, e.g., [1], [2]), in whichthe overall computation is decomposed to three stages,

Map , Shufﬂe , and

Reduce that are executed distributedly across severalcomputing nodes. In the Map phase, each input ﬁle is processed locally, in one (or more) of the nodes, to generate intermediatevalues. In the Shufﬂe phase, for every output function to be calculated, all intermediate values corresponding to that functionare transferred to one of the nodes for reduction. Finally, in the Reduce phase all intermediate values of a function are reducedto the ﬁnal result.In Coded Distributed Computing, we allow redundant execution of Map tasks at the nodes, since it can result in signiﬁcantreductions in data shufﬂing load by enabling in-network coding. In fact, in [10], [11] it has been shown that by assigningthe computation of each Map task at r carefully chosen nodes, we can enable novel coding opportunities that reduce thecommunication load by exactly a multiplicative factor of the computation load r . For example, the communication load canbe reduced by more than 50% when each Map task is computed at only one other node (i.e., r = 2 ).Based on this framework, we consider two types of implementations: 1) Sequential Implementation.

The above three phasestake place one after another sequentially. In this case, the overall execution time T sequential = T map + T shufﬂe + T reduce . 2) ParallelImplementation.

The Shufﬂe phase happens in parallel with the Map phase. In this case, the overall execution time becomes T parallel = max { T map , T shufﬂe } + T reduce . Then the considered resource allocation problem for e.g., the sequential implementationcan (informally) be formulated as the following optimization problem. a r X i v : . [ c s . I T ] F e b min  Number of utilized serversPlacements of Map/Reduce tasksData shufﬂing scheme  T sequential = T map + T shufﬂe + T reduce (1)In this paper, we exactly solve the above optimization problem and its counterpart for the parallel implementation. Inparticular, for each implementation, we propose an optimal resource allocation scheme that exactly achieves the minimumexecution time. In the proposed scheme to compute Q output functions, for some design parameter r ∗ , we use a numberof Q + (cid:100) Qr ∗ (cid:101) server nodes for computation. These servers are split into two groups that are termed as the “solvers” and the“helpers”. There are Q solver nodes, each computing a distinct Reduce function. The remaining (cid:100) Qr ∗ (cid:101) nodes are helpers, onwhich Map functions are computed to facilitate a more efﬁcient data shufﬂing process. No Reduce function is computed onhelpers themselves. In the Map phase, each input ﬁle is repetitively mapped on r ∗ solver nodes according to a speciﬁed pattern.On the other hand, on the helper nodes, all input ﬁles are evenly partitioned and assigned for mapping, without any repetition.Then in the Shufﬂe phase, the communication is solely from the helpers to the solvers. In particular, based on the locallycomputed intermediate values in the Map phase, each helper node constructs coded multicast messages that are simultaneouslydelivering required intermediate values to r ∗ + 1 solvers. From these multicast messages, each solver node can decode therequired intermediate values for reduction, using locally computed Map results. Finally, each solver node computes the assignedReduce functions (hence the ﬁnal output functions) locally, using the locally computed Map results and the intermediate valuesdecoded from the messages received from the helpers.We also prove the exact optimality of our proposed resource allocation strategies for both sequential and parallelimplementations. To do that, we ﬁrst derive a lower bound on the data shufﬂing time using any placements of the Mapand the Reduce tasks. Then from this lower bound, we derive a lower bound on the minimum total execution time, and showthat it is no shorter than the time achieved by the proposed strategy. At the same time, we also prove that the proposed strategyalways uses exactly the minimum required number of servers to achieve the exact minimum execution time, by showing thatthe derived lower bound on the minimum execution time cannot be achieved with less number of servers. Related Work.

The idea of injecting structured redundancy in computation to provide the coding opportunity that signiﬁcantlyreduces the communication load has been studied in [10]–[14]. In all these works, it was assumed that the computation iscarried out with a ﬁxed number of computing nodes. Furthermore, it assumed a balanced design of the computation scheme,where the reduce jobs in the considered MapReduce-type framework have to be evenly distributed on all the nodes. Underthese assumptions, they focused on characterizing the optimal tradeoff between the computation load in the Map phase, and thecommunication load in the Shufﬂe phase, by designing only the Map phase and the Shufﬂe phase. In this paper, we generalizethe prior works by allowing the ﬂexibility of using an arbitrary number of servers, and unbalanced reduce task assignments onthe computing nodes. We design all three phases (Map, Shufﬂe, and Reduce) and aim to minimize the total execution time.We also aim to minimize the usage of computing resources (nodes) while achieving the optimal performance. In another lineof research, [15] showed that injecting redundancy in computation also provides robustness to handle straggling effects, and[14] proposed a framework that takes both the straggling effect and the bandwidth usage into account. In this work, we donot focus on the straggling effect and we consider the simple model where all the nodes are computing with the same speed.The rest of the paper is organized as follows. Section II formally establishes the system model and deﬁnes the problems.Section III summarizes and discusses the main results of this paper. Section IV describes the proposed resource allocationschemes for both sequential and parallel implementations. Section V proves the exact optimality of the proposed schemesthrough matching information-theoretic converses. Section VI concludes the paper.II. P

ROBLEM F ORMULATION

We consider a problem of computing Q output functions from N input ﬁles, for some system parameters Q, N ∈ N . Morespeciﬁcally, given N input ﬁles w , . . . , w N ∈ F F , for some F ∈ N , the goal is to compute Q output functions φ , . . . , φ Q ,where φ q : ( F F ) N → F B , q ∈ { , . . . , Q } , maps all input ﬁles to a B -bit output value u q = φ q ( w , . . . , w N ) ∈ F B , forsome B ∈ N .We employ a MapReduce-type distributed computing structure and decompose the computation of the output function φ q , q ∈ { , . . . , Q } , as follows: φ q ( w , . . . , w N ) = h q ( g q, ( w ) , . . . , g q,N ( w N )) , (2)where as illustrated in Fig. 1, • The “Map” functions (cid:126)g n = ( g ,n , . . . , g Q,n ) : F F → ( F T ) Q , n ∈ { , . . . , N } , maps the input ﬁle w n into Q length- T intermediate values v q,n = g q,n ( w n ) ∈ F T , q ∈ { , . . . , Q } , for some T ∈ N . • The “Reduce” functions h q : ( F T ) N → F B , q ∈ { , . . . , Q } , maps the intermediate values of the output function φ q in allinput ﬁles into the output value u q = h q ( v q, , . . . , v q,N ) = φ q ( w , . . . , w N ) . Map Functions

Reduce Functions

Fig. 1: Illustration of a two-stage distributed computing framework. The overall computation is decomposed into computing a set of Mapand Reduce functions.

We perform the above computation using K distributed computing servers, labelled by Server , . . . , Server K . Here thenumber of servers K is a design parameter and can be an arbitrary positive integer. The chosen K servers carry out thecomputation in three phases: Map , Shufﬂe and

Reduce . Map Phase.

In the Map phase, each server maps a subset of input ﬁles. For each k ∈ { , . . . , K } , we denote the indicesof the ﬁles mapped by Server k as M k , which is a design parameter. Each ﬁle is mapped by at least one server, i.e., ∪ k =1 ,...,K M k = { , . . . , N } . For each n in M k , Server k computes the Map function (cid:126)g n ( w n ) = ( v ,n , . . . , v Q,n ) . Deﬁnition 1 (Peak Computation Load) . We deﬁne the peak computation load , denoted by p , ≤ p ≤ , as the maximumnumber of ﬁles mapped at one server, normalized by the number of ﬁles N , i.e., p (cid:44) max k =1 ,...,K |M k | N . ♦ We assume that all servers are homogeneous and have the same processing capacity. The average time a server spends inthe Map phase is linearly proportional to the number of Map functions it computes, i.e., the average time for a server tocompute n Map functions is c m nN , for some constant c m > . Also, since the servers compute their assigned Map functionssimultaneously in parallel, we deﬁne the Map time , denoted by T map , as the average time for the server mapping the most ﬁlesto ﬁnish its computations, i.e., T map = max k =1 ,...,K c m |M k | N = c m p. (3)The minimum possible Map time can be arbitrarily close to , assuming N is large. This minimum Map time can be achievedby using a large number of servers, and letting the N Map tasks be uniformly assigned to these servers without repetition.

Shufﬂe Phase.

We assign the tasks of computing the Q output functions across the K servers, and denote the indices ofthe output functions computed by Server k , k = 1 , . . . , K , as W k , which is also a design parameter. Each output function iscomputed exactly once at some server, i.e., 1) ∪ k =1 ,...,K W k = { , . . . , Q } , and 2) W j ∩ W k = ∅ for j (cid:54) = k .To compute the output value u q for some q ∈ W k , Server k needs the intermediate values that are not computed locally inthe Map phase, i.e., { v q,n : q ∈ W k , n / ∈ M k } . After the Map phase, the K server proceed to exchange the needed intermediatevalues for reduction. We formally deﬁne a shufﬂing scheme as follows: • Each server k , k ∈ { , . . . , K } , creates a message X k as a function of the intermediate values computed locally in the Mapphase, i.e., X k = ψ k ( { (cid:126)g n : n ∈ M k } ) , and multicasts it to a subset of ≤ j ≤ K − nodes. Deﬁnition 2 (Communication Load) . We deﬁne the communication load , denoted by L , ≤ L ≤ , as the total numberof bits communicated by all server in the Shufﬂe phase, normalized by QN T (which equals the total number of bits in allintermediate values { v q,n : q ∈ { , . . . , Q } , n ∈ { , . . . , N }} ). ♦ For some constant c s > , we denote the bandwidth of the shared link connecting the servers as /c s . Thus given acommunication load of L , the Shufﬂe time , denoted by T shufﬂe , is deﬁned as T shufﬂe = c s L. (4)The minimum possible Shufﬂe time is . It can be achieved by having each of the servers assigned to compute the Reducefunctions map all N ﬁles locally. Reduce Phase.

Server k , k ∈ { , . . . , K } , uses the local Map results { (cid:126)g n : w n ∈ M k } and the received messages X , . . . , X K in the Shufﬂe phase to construct the inputs to the assigned Reduce functions in W k , and computes the output value u q = h q ( v q, . . . v q,N ) for all q ∈ W k .Similar to the computations of the Map functions, the average time for a server to compute q Reduce functions is c r q , forsome constant c r > . The servers compute their assigned Reduce functions simultaneously in parallel. We deﬁne the Reducetime , denoted by T reduce , as the average time for the server reducing the most output functions to ﬁnish its computations, i.e., T reduce = c r max k =1 ,...,K |W k | . (5)The minimum Reduce time equals c r . To minimize the Reduce time, we need at least Q servers, and each computing aunique Reduce function. In this paper, we assume that the cost of multicasting to multiple servers is the same as unicasting to one server.

In this setting, we are interested in designing distributed computing schemes, which includes the selection of K , theassignment of the Map tasks M (cid:44) ( M . . . , M K ) , the assignment of the Reduce tasks W (cid:44) ( W . . . , W K ) , and thedesign of the data shufﬂing scheme, in order to minimize the overall execution time to accomplish the distributed computingtasks.Speciﬁcally, the overall execution time is the total amount of time spent executing the above three phases of the computation.In this paper, we consider the following two types of implementations.1) Sequential Implementation.

For the sequential implementation, the three phases take place one after another sequentially,e.g., the Shufﬂe phase does not start until all servers have completed their Map computations. In this case, the overallexecution time T sequential = T map + T shufﬂe + T reduce .2) Parallel Implementation.

For the parallel implementation, the Shufﬂe phase happens in parallel with the Map phase, i.e., aserver communicates a message as soon as the intermediate values needed to construct the message is calculated locallyfrom the Map functions. In this case, the overall execution time becomes T parallel = max { T map , T shufﬂe } + T reduce .To design the optimal distributed computing scheme that minimizes the execution time while using as few servers as possible,we need to answer the following questions: • What is the minimum possible execution time? • What is the minimum number of servers needed to achieve the minimum possible execution time? • How to place the Map, Reduce tasks and design the data shufﬂing scheme to achieve the minimum execution time?To answer these questions, we formulate them into the following problem:

Problem 1 (Optimal Resource Allocation) . Consider a computing task with parameters Q and N . Given a certain number ofservers K , a Map task assignment M and a Reduce task assignment W on these servers, we say a shufﬂing scheme is valid if, for any possible outcomes of the intermediate values v q,n , each server can decode all its needed intermediate values basedon the values that are locally computed in the map phase and the messages received during the shufﬂe phase.Suppose we always use valid shufﬂing schemes with minimum shufﬂing time. We denote the resulting execution times given K , M and W by T ∗ sequential ( K, M , W ) and T ∗ parallel ( K, M , W ) . Assuming N is large, we aim to ﬁnd the minimum executiontimes over all possible designs, which can be rigorously deﬁned as follows: T ∗ sequential = inf K, M , W T ∗ sequential ( K, M , W ) , (6) T ∗ parallel = inf K, M , W T ∗ parallel ( K, M , W ) . (7)We are also interested in ﬁnding the minimum number of servers required to exactly achieve the minimum execution timefor large N , denoted by K ∗ sequential and K ∗ parallel , deﬁned as follows K ∗ sequential = min { K ∈ N | min M , W T ∗ sequential ( K, M , W ) = T ∗ sequential } , (8) K ∗ parallel = min { K ∈ N | min M , W T ∗ parallel ( K, M , W ) = T ∗ parallel } . (9)If the minimum in any of the above equations does not exist, we say the corresponding T ∗ sequential or T ∗ parallel can not be achievedusing ﬁnite number of servers .Besides, we want to ﬁnd the optimal computing schemes that minimizes the execution time while using the minimum numberof servers. Speciﬁcally, for each implementation, we want to construct a Map task assignment M , a reduce task assignment W , and a valid shufﬂing scheme design, that achieve the minimum execution time using the minimum number of servers. ♦ In this paper, we answer all the questions mentioned in the above problem. Interestingly, some of the answers match theintuition and some do not. For example, the coding gain in our proposed optimal scheme is obtained through coded multicasting,which agrees with the intuition. However, counter intuitively, the optimal scheme requires a non-symmetric design, where theservers are classiﬁed into two groups. One group is only assigned Map and Reduce tasks, focusing on computing the outputfunctions; while the other group only does Map and Shufﬂe, focusing on delivering the intermediate results and exploiting themulticast opportunity. Also, the intuition may suggest that by using more servers, we may always be able to further reducethe execution time. However, we show that in most cases, the minimum execution time can be exactly achieved using ﬁnitelymany servers, and the minimum execution time can not be further reduced after the number of servers passes a threshold.III. M

AIN R ESULTS

For the sequential implementation, we characterize the minimum execution time T ∗ sequential and the minimum number ofservers to achieve T ∗ sequential in the following theorem. Theorem 1 (Sequential Implementation) . For a distributed computing application that computes Q output functions, T ∗ sequential deﬁned in Problem 1 is given by T ∗ sequential = c m r ∗ Q + c s Q − r ∗ Q ( r ∗ +1) + c r , (10) where r ∗ is deﬁned as follows: r ∗ = max argmin r ∈{ , ...,Q } c m rQ + c s Q − rQ ( r + 1) . (11) We can show that the above execution time can be exactly achieved using a ﬁnite number of servers if and only if r ∗ (cid:54) = 0 . For r ∗ (cid:54) = 0 , K ∗ sequential deﬁned in Problem 1 is given by K ∗ sequential = (cid:40) Q + (cid:100) Qr ∗ (cid:101) , < r ∗ < Q,Q, r ∗ = Q. (12) Remark . The above theorem generalizes the prior works on coded distributed computing, [10]–[13], by allowing the ﬂexibilityof using arbitrary number of servers and arbitrary reduce task assignments on the servers. In prior works, it is assumed thatall the Q Reduced tasks are uniformly assigned to all the servers. In this paper, we will see that by focusing on the executiontime and allowing using arbitrary number of servers, the optimal scheme naturally requires a certain Reduce task assignment,where each server either reduce function, or does not reduce at all. To simplify the discussion, we refer to the servers thatare assigned Reducing tasks as solvers , and we refer to the rest of the servers as helpers . Remark . To achieve the above minimum execution time, we propose a distributed computing scheme, where each servermaps no more than r ∗ Q fraction of the ﬁles in the database, with communication load of Q − r ∗ Q ( r ∗ +1) . In the proposed achievabilityscheme, we will see each ﬁle repetitively mapped on r ∗ solvers. Having this redundancy in the Map phase has two advantages:ﬁrst of all, more computation enhances the local availability of the intermediate values, thus each solver only needs values from − r ∗ Q fraction of the ﬁles from the the shufﬂing phase; secondly, mapping the same ﬁle at multiple servers allows deliveringintermediate values through coded multicasting, and a coding gain of r ∗ + 1 is achieved in the proposed delivery scheme. Remark . Similar to the prior works [10]–[13], the trade off between computation load and communication load can beestablished, and the above theorems demonstrate how the optimal peak computation load can be chosen based on the trade off.

Remark . We prove the exact optimality of the proposed scheme through a matching information theoretic converse, which isprovided in section V. We observe that in most cases, using a ﬁnite number of servers is sufﬁcient to exactly achieve the lowerbound of the minimum execution time, which means that the execution time cannot be further reduced by using more serversthan the provided K ∗ sequential . This is due to the fact that the coded multicasting opportunity, which is essential to achievingthe minimum communication load, relies on mapping the ﬁles repetitively on the solvers. Because the total number of Reducefunctions is ﬁxed, the number of solvers is upper bounded by Q even if we use inﬁnitely many servers. Consequently, by usinga large number of servers, reducing the peak computation load on the solvers will inevitably reduce the number of times thateach ﬁle is repetitively mapped on the solvers, which consequently hurts the coded multicasting opportunity and increases thecommunication load. Hence, the entire beneﬁt of using more than Q servers is to reduce the computation load of the helpers,until the computation load of the solvers becomes the bottleneck. Further increasing the number of servers will not affect thecomputation-communication trade-off.Conversely, Theorem 1 also indicates that, when using fewer servers than the suggested minimum number ( K ∗ sequential ), theresulting computing scheme must be strictly suboptimal. This is due to the fact that only the helpers can fully utilize the codedmulticasting opportunity during the shufﬂing phase. Hence, to achieve the minimum communication load, no shufﬂing jobshould be handled by the solvers, and we need sufﬁcient helpers to map enough ﬁles in order to obtain enough informationto support the shufﬂing phase, without becoming the bottleneck of the peak computation load. Remark . From theorem 1, we observe that the optimal solution always requires using at least Q servers, which is becauseany computing scheme having a server reducing more that one function is strictly suboptimal (will be proved later), so at least Q solvers are needed to compute all the Reduce functions.In addition, we note that K ∗ sequential , is a decreasing function of r ∗ , and consequently an increasing function of c m c s , which canbe explained as follows: When c m c s increases, the computation time for mapping one ﬁle becomes relatively larger, therefore itis better to pick a computing scheme with larger communication load and smaller computation load. To reduce the computationload, r ∗ , the number of times each ﬁle is repetitively mapped on all the solvers, should be decreased. As a result, the peakcomputation load on the helpers also decreases, and thus more helpers are needed to make sure that each ﬁle needed for theshufﬂing phase is mapped on at least one helper. Remark . If we ignore the integrality constraint, r ∗ and K ∗ sequential can be approximated as follows: r ∗ ≈ (cid:114) ( Q + 1) c s c m − ≈ (cid:114) Q c s c m (13) K ∗ sequential ≈ Q + Q/ ( (cid:114) ( Q + 1) c s c m − ≈ Q + (cid:114) Q c m c s . (14)Interestingly, r ∗ is approximately proportional to the square root of c s c m , while the number of helpers (i.e., K ∗ sequential − Q ) isinversely proportional to the square root of c s c m . Hence if the computation time of mapping one ﬁle is increased by times, r ∗ should be halved, and the number of helpers should be doubled.We have the following explanation: In the optimal computing scheme proposed in this paper, the computation time isproportional to c m r , and the communication time is approximately c s /r , where r is the number of times each ﬁle is repetitivelymapped on all solvers. To minimize the total execution time, the design parameter should balance the time used in these twophases, which results that r ∗ should be approximately proportional to the square root of c s c m . Besides, in most cases the helpers should map all ﬁles in the database in order to execute the shufﬂing functions. Hence the minimum number of helpers (i.e., K ∗ sequential − Q ) should be inversely proportional to the computation load, which should consequently be inversely proportionalto the square root of c s c m . Remark . As we have discussed, achieving the minimum possible communication load relies on exploiting local availabilitiesand allowing coded multicasting. As a comparison, we consider computing designs where the opportunity of multicastingduring the shufﬂing phase is not utilized, i.e., the shufﬂing phase is uncoded . The minimum execution time is given as follows: T ∗ sequential, uncoded = min r ∈{ ,Q } c m rQ + c s (1 − rQ ) + c r (15) = min { c m , c s } + c r . (16)The above execution time can be achieved using uncoded computing scheme with ﬁnite number of servers if and only if c m ≤ c s , and the minimum needed number of server in this case equals Q .Compared to the uncoded scheme, a large coding gain that scales with the size of the problem can be achieved by exploitingcoded multicasting opportunities during the shufﬂing phase. For example, when c m = c s , the execution time for the Mapand Shufﬂe phase of the optimal coded scheme grows as Θ( Q − ) , while the execution time of the uncoded scheme remainsconstant.The two schemes also requires different number of servers to achieve the minimum execution time. For the uncodedcomputing scheme, at most Q servers are needed to achieve the minimum cost, unless the computing power of Q servers arenot sufﬁcient to map the entire database; while for the coded computing scheme, in most cases more that Q servers are neededto achieve the minimum execution time. This is due to the fact that in the coded computing scheme, the Reduce tasks and theshufﬂing jobs are handled by disjoint groups of servers in order to fully maximize the coding gain, and hence extra serversare needed to optimize the performance. However in the uncoded scheme, the only use of non-solver nodes is to provide extracomputing power. Hence when Q servers are sufﬁcient to map the entire database, using more servers does not reduce theexecution time.For the parallel implementation, we characterize the minimum execution time, and the minimum number of servers to achieve T ∗ sequential in the following theorem Theorem 2 (Parallel Implementation) . For a distributed computing application that computes Q output functions, T ∗ parallel deﬁned in Problem 1 is given by T ∗ parallel = max { c m r ∗ Q , c s · Conv ( Q − r ∗ Q ( r ∗ +1) ) } + c r , (17) where Conv ( f ( · )) denotes the lower convex envelope of points { ( r, f ( r )) | r ∈ { , , ..., Q }} , and r ∗ is deﬁned as follows: r ∗ = argmin ≤ r ≤ Q max { c m rQ , c s · Conv ( Q − rQ ( r +1) ) } . (18) We can show that the above execution time can be exactly achieved using a ﬁnite number of servers, and K ∗ parallel deﬁned inProblem 1 is given by K ∗ parallel = (cid:40) Q + (cid:100) Qr ∗ (cid:101) , r ∗ ≤ Q − ,Q + (cid:100) Q ( Q − r ∗ ) r ∗ (cid:101) , r ∗ > Q − . (19) Remark . The above theorem generalized the prior works [10]–[13], by allowing the ﬂexibility of using an arbitrary numberof servers and arbitrary Reduce task assignments on the servers. Similar to the sequential implementation, the optimal schemefor parallel implementation also requires a certain Reduce task assignment, where each server either reduces function ordoes not reduce at all. Thus, we continue to use the names solvers and helpers for the parallel implementation. Remark . To achieve the above minimum execution time, we propose a distributed computing scheme, where each server mapsno more than r ∗ Q fraction of the ﬁles in the database, with communication load of Conv ( Q − r ∗ Q ( r ∗ +1) ) . Similar to the sequentialcase, each ﬁle is repetitively mapped r ∗ times. This redundancy enhances the local availability of the intermediate values, andallows delivering intermediate values through coded multicasting. Hence, by following the same argument, we can achieve thesame computation-communication trade off achieved by the scheme used in sequential implementations. However, given thesame computation-communication trade off, the above theorem indicates that the optimal peak computation load should bechosen differently compared to the sequential case, in order to minimum the execution time for parallel implementation. Remark . We prove the exact optimality of the proposed scheme through a matching information theoretic converse, whichis provided in section V. We note that for parallel implementation, using ﬁnite number of servers is sufﬁcient to exactlyachieve the minimum execution time. Conversely, the statement in theorem 1 also indicates that when using less servers thanthe suggested minimum number ( K ∗ parallel ), the resulting computing scheme must be strictly suboptimal. Both statements canbe understood exactly the same way as discussed for the sequential implementation. Remark . From theorem 2, we observe that the optimal solution always requires using at least Q servers. In addition, wenote that K ∗ parallel , is a decreasing function of r ∗ , and consequently an increasing function of c m c s . Both observations can beunderstood exactly the same way as discussed for the sequential implementation. Remark . If we ignore the integrality constraint, r ∗ and K ∗ parallel can be approximated as follows: r ∗ ≈ (cid:115) Q c s c m + ( c s /c m + 12 ) − c s /c m + 12 ≈ (cid:114) Q c s c m (20) K ∗ parallel ≈ Q + Q/r ∗ ≈ Q + (cid:114) Q c m c s . (21)Similar to the sequential case, r ∗ is approximately proportional to the square root of c s c m , while the number of helpers (i.e., K ∗ parallel − Q ) is inversely proportional to the square root of c s c m . Both approximations can be explained through the samearguments used for the sequential implementation. Remark . We consider the minimum execution time of the uncoded scheme, which is given as follows: T ∗ parallel, uncoded = min r ∈ [0 ,Q ] max { c m rQ , c s (1 − rQ ) } + c r (22) = c m c s c m + c s + c r . (23)The above execution time can be achieved using K ∗ parallel, uncoded = max { Q, (cid:100) Qr ∗ (cid:101)} servers.Compared to the uncoded scheme, a large coding gain that scales with the size of the problem is achieved using the proposedcoded scheme. For example, when c m = c s , the execution time for the Map and Shufﬂe phase of the optimal coded schemegrows as Θ( Q − ) , while the execution time of the uncoded scheme remains constant.The two schemes also requires different number of servers to achieve the minimum execution time. For the uncoded scheme,at most Q servers are needed to achieve the minimum cost, unless the computing power of Q servers are not sufﬁcient tomap the entire database; while for the coded computing scheme, in most cases more that Q servers are needed to achieve theminimum execution time. This is due to the fact that uncoded scheme failed to exploit the coded multicast opportunity, asexplained in Remark 7. IV. A CHIEVABILITY S CHEMES

In this section, we construct achievability schemes that achieve the minimum execution time mentioned in Section III, usingthe minimum number of servers. We start by giving an illustrative example on how to build an optimal scheme for sequentialimplementation given a speciﬁc set of values of problem parameters. Then we proceed to present the optimal achievabilityscheme for general parameters. The optimal achievability schemes for the parallel implementation is described in Appendix A.

A. Illustrative Example

We present an illustrative example of the optimal achievability scheme for a given set of parameters: N = 6 , Q = 3 , c m = 1 , c s = 2 and c r = 1 . According to Theorem 1, we choose design parameter r ∗ = 2 and use K ∗ sequential = 5 servers. We let servers , , and reduce functions , , and , respectively. Map Phase Design.

We let the Map task assignment to the users be M = { , , , } , M = { , , , } , M = { , , , } , M = { , , } , and M = { , , } . Here each solver, i.e. users in { , , } , maps = r ∗ Q fraction of the ﬁle, and eachhelper maps < r ∗ Q fraction of the ﬁles. Hence the peak computation load equals = r ∗ Q . Shufﬂe Phase Design.

After the map phase, user multicast the message v , ⊕ v , ⊕ v , , and user multicast the message v , ⊕ v , ⊕ v , . The normalized communication load equals = Q − r ∗ Q ( r ∗ +1) . Since node knows v , and v , , he can decode v , from the message multicasted by user . Similarly, he can also decode v , from the other message. Because v , , ..., v , are already locally computed by user , the Reduce function can be executed after the shufﬂe phase. Same argument holdsfor the other Reduce functions, hence the computation can be completed after the shufﬂing.Note that in the above example, each server computes at most Reduce function. Hence the reduce time equals .Consequently, the total execution time for sequential implementation equals · + 2 · + 1 = c m r ∗ Q + c s Q − r ∗ Q ( r ∗ +1) + c r ,which can be veriﬁed to be equal to the minimum execution time T ∗ sequential given in Theorem 1. B. General Description for Sequential Implementation

We consider a general computing task with Q Reduce functions, parameters c m , c s , c r , and sufﬁciently large N . We ﬁrstcompute the design parameter r ∗ as speciﬁed in Theorem 1. Depending on the value of r ∗ , we design the achievabilityscheme as follows. Note that if network-layer multicast is not possible for delivering the coded packets, we can instead use the existing application-layer multicast algorithms(e.g., the Message Passing Interface (MPI)) to mutlicast them (see [10] Section VII-A for more details).

Map

1 3 5 1 3 5 1 3 5 1 3 5

Map

2 4 6 2 4 6 2 4 6 2 4 6 Node 1 5 1 3 ⊕ ⊕

6 2 4 ⊕ ⊕ Node 4 Node 5

Map

1 2 3 1 2 3

1 2 3

1 2 3 4 4 Map

3 4 5 3 4 5

3 4 5

3 4 5 6 6

2 Map

1 2 5 1 2 5

1 2 5

1 2 5 6 6 Needs:

3 4

Fig. 2: Illustration of the optimal achievability scheme for N = 6 , Q = 3 , c m = 1 , c s = 2 and c r = 1 . r ∗ ∈ { , ..., Q − } : We use K = K ∗ sequential servers as suggested in Theorem . Note that K ∗ sequential ≥ Q always holds,we let nodes , , ..., Q reduce functions , , ..., Q respectively. Map Phase Design.

Assuming N is large, we evenly partition the dataset into ( K − Q ) (cid:0) Qr ∗ (cid:1) disjoint subsets. We bijectivelymap these subsets, to tuples of a subset of r ∗ solvers and a helper. Rigorously, we map the subset of ﬁles to the following set: { ( i, A ) | i ∈ { Q + 1 , ..., K } , A ⊆ { , ..., Q } , |A| = r ∗ } . We denote the subset of ﬁles that is mapped to ( i, A ) by B i, A .We let each solver k ∈ { , ..., Q } map all subsets of ﬁles B i, A satisfying k ∈ A , and we let each helper k ∈ { Q + 1 , ..., K } map all subsets B k, A . Each solver maps ( Q − r ∗− ) ( K − Q ) ( Qr ∗ ) ( K − Q ) = r ∗ Q fraction of the ﬁles, and each helper maps ( Qr ∗ )( Qr ∗ ) ( K − Q ) = K − Q ≤ r ∗ Q fraction of the ﬁles. Hence, the computation time of this given Map phase design equals c m r ∗ Q . Shufﬂe Phase Design.

We group all the intermediate values for a Reduce function q from all ﬁles in B i, A into a single variable,and denote it by V i, A ,q . At the shufﬂing phase, each helper from i ∈ { Q + 1 , ..., K } will multicast the following messages: Foreach subset of r ∗ + 1 solvers, denoted by S , helper i multicasts Y i, S (cid:44) ⊕ k ∈S V i, S\{ k } ,k to all the solvers in S . The normalizedcommunication load equals ( Qr ∗ +1 ) ( K − Q ) ( Qr ∗ ) ( K − Q ) Q = Q − r ∗ Q ( r ∗ +1) . Hence, the computation time of this given Shufﬂe phase design equals c s Q − r ∗ Q ( r ∗ +1) .Now we prove the validity of the above scheme: For each subset A ⊆ { , ..., Q } of size r ∗ and for each i ∈ { Q + 1 , ..., K } ,server can decode V i, A , from Y i, S∪{ } . Combining with the intermediate values that are computed locally on server , thereduce function can be executed after the shufﬂe phase. Same argument holds for the other Q − Reduce functions, hencethe proposed shufﬂing scheme is valid.

Remark . Note that if we view all the helpers as super node, the node maps all the ﬁles and broadcasts all messagesduring the shufﬂe phase. By viewing the super node as the server and the solvers as the users, we recover the caching schemeproposed in [16]. In our proposed distributed computing scheme, we split the work in the map phase for the super node ontomultiple nodes, in order to ensure the peak computation load is not bottlenecked by the Map tasks executed at these helpers. r ∗ = 0 : In this case, Theorem 1 states that T ∗ sequential cannot be exactly achieved using ﬁnite number of servers. Hencewe consider picking a parameter K as large as possible, and use K servers for the achievability scheme. We let nodes , , ..., Q reduce functions , , ..., Q respectively, and not being assigned any Map tasks. Assuming N is large, we evenlypartition the dataset into K − Q subsets of ﬁles, and we let each helper disjointly maps one subset. The peak computation loadconsequently equals K − Q , which is negligible if K is sufﬁciently large. Hence the Map Phase design requires a computationtime of c m · c m · r ∗ Q .At the shufﬂing phase, note that each the intermediate value is computed by exactly one helper, we simply let all thehelpers unicast each intermediate value to the solver that requires the value to execute the reduce function. Because eachintermediate value is unicast exactly once, the normalized communication load equals and the communication time equals c s · c s Q − r ∗ Q ( r ∗ +1) . r ∗ = Q : In this case, K ∗ sequential = Q . We simply use Q servers, each reducing one function, and maps the entire database.The peak computation load equals , hence the computation time equals c m · c m · r ∗ Q . Note that each server obtains all theneeded intermediate values after the Map phase, no communication is required in the shufﬂing phase. Hence the communicationtime equals c s · c s Q − r ∗ Q ( r ∗ +1) .In all the above cases, each server reduces at most one function. Hence our proposed achievability scheme always achievea reduce time of c r . Besides, in all the cases, the achievability scheme uses K ∗ sequential servers (or sufﬁciently many servers if K ∗ sequential does not exist), achieves a computation time of c m · r ∗ Q and a communication time of c s Q − r ∗ Q ( r ∗ +1) . The total executiontime always equals T ∗ sequential = c m · r ∗ Q + c s Q − r ∗ Q ( r ∗ +1) + c r . Hence, our proposed scheme always achieves the T ∗ sequential and K ∗ sequential stated in Theorem 1. Remark . Interestingly, in the proposed optimal computing scheme, the minimum cost is achieved by completely separatingthe Reduce tasks and the shufﬂe jobs onto different servers. Because no solver in the proposed scheme are responsible formulticasting messages in the delivery phase, the Map tasks on the solvers can be perfectly designed in order to fully exploitingthe multicast opportunity, without having to considerate the encodability constraint.V. C

ONVERSE

In this section, we derive matching converses that shows the optimality of the proposed computation scheme. We also showthat our proposed optimal scheme uses the minimum possible number of nodes to achieve the minimum execution time.

A. Key Lemma

Before deriving the exact converse for each implementation, we ﬁrst prove the following key lemma, that applies for bothsequential and parallel implementations. The lemma lower bounds the shufﬂing time given an arbitrary Map and Reduce taskallocation:

Lemma 1 (Converse Bound for Communication Load) . Consider a distributed computing task with N ﬁles and Q Reducefunctions, and a given map and reduce design that uses K nodes. For any integers s, d , let a s,d denotes the number ofintermediate values that are available at s nodes, and required by (but not available at) d nodes. The following lower boundon the communication load holds: L ≥ QN K (cid:88) s =1 K − s (cid:88) d =1 a s,d ds + d − (24) Remark . Prior to this work, several bounding techniques have been proposed for coded distributed computing and codedcaching with uncoded prefetching [10], [12], [13], [17]–[19] . All of them can be derived as special cases of the above simplelemma.

Remark . Although we assume that each server sends messages independently during the shufﬂing phase, the above lemmacan be easily generalized to computing models where the data shufﬂing process can be carried out in multiple rounds anddependency between messages are allowed. We can prove that even multiple round communication is allowed, the exactlysame lower bound stated in Lemma 1 still holds. Consequently, requiring the servers communicating independently does notinduce any cost in the total execution time.We postpone the proof of Lemma 1 to Appendix B, and in this section, we assume the correctness of this lemma and provethe optimality of the proposed schemes based on that.

B. Converse Bounds for Sequential Implementation

Now we use Lemma 1 to prove a matching converse for Theorem 1, which is equivalent to prove the following twostatements:1) The execution time of any coded computing scheme for a distributed computing task with N ﬁles and Q Reduce functionswith sequential implementation is at least T ∗ sequential .2) Any computing scheme that arbitrarily closely achieve a execution time of T ∗ sequential uses at least K ∗ sequential servers.First of all, note that for any coded computing scheme, we can construct an alternative valid scheme with the samecomputation load and communication load, but each server only reduces at most function. The construction is given asfollows:Given the computing scheme, for each server k that reduce at least functions, let q k denotes the number of functionsreduced by this server. Make q k − extra copies of this server mapping the same set of ﬁles, but not responsible for anyshufﬂing job, and let each of these q k users reduce only one of the q k functions originally assigned to server k . If all map,shufﬂe, and reduce phases for the other servers remain the same, each additional server can still obtain enough information to execute the reduce function. Besides, the Map time and the Shufﬂe time remain the same, but each server in the new computingscheme only reduces at most function.Consequently, for any computing scheme that assigns more than function to any single server, we can ﬁnd a furtheroptimized scheme with a strict improvement in the execution time of at least c r . Hence any such scheme can not achieve theminimum possible execution time. So to prove a matching converse for Theorem 1, it is sufﬁcient to focus on computingschemes where each server reduces at most one function.We consider an arbitrary computing scheme that maps N ﬁles, uses K servers and reduces Q functions. Without loss ofgenerality, we assume servers in { , ..., Q } are assigned Reduce tasks.We ﬁrst derive a lowerbound on the communication load by enhancing the computing system: We view the servers in Q + 1 , ..., K as a super node, that maps all ﬁles that are mapped by these servers, and broadcast all messages that are broadcastby these servers during the shufﬂing phase. It is easy to verify that by enhancing the computing system in this way, all solversare still able to execute the reduce function, and the total communication load does not increase.We then apply Lemma 1 on the enhanced computing system. Let a j, denotes the number of ﬁles that are mapped by j solvers, but not mapped by the super node, and let a j, be the number of ﬁles that are mapped by j solvers, and mapped bythe super node. From Lemma 1, the communication load is lower bounded by the following inequality: L ≥ QN Q (cid:88) j =0 ( Q − j ) a j, j + ( Q − j ) a j, j + 1 . (25)Note that the peak computation load is lower bounded by the average computation load on the solvers, thus p ≥ Q (cid:88) k =1 |M k | QN = 1 QN Q (cid:88) j =0 j ( a j, + a j, ) . (26)Hence, the total execution time is lower bounded by T sequential ≥ QN ( Q (cid:88) j =0 a j, ( c m j + c s Q − jj ) + a j, ( c m j + c s Q − jj + 1 )) + c r . (27)Note that a j, , a j, are non-negative and satisfy the following equation N = Q (cid:88) j =0 ( a j, + a j, ) . (28)Consequently, the minimum value that T sequential can take is given by T sequential ≥ Q ( min j ∈{ ,...,Q } min { c m j + c s Q − jj , c m j + c s Q − jj + 1 } ) + c r (29) = min r ∈{ ,...,Q } ( c m rQ + c s Q − rQ ( r + 1) ) + c r (30) = T ∗ sequential , (31)which proves the ﬁrst statement.Let R ∗ = argmin r ∈{ , ,...,Q } ( c m rQ + c s Q − rQ ( r +1) ) , we have r ∗ = max R ∗ . If T ∗ sequential is arbitrarily closely achieved, the Map taskassignment of the computation scheme must satisfy that a j,i ≈ except for j ∈ R ∗ , and i = 1 if j (cid:54) = Q .We consider the following two possible cases, distinguished by the value of r ∗ :1. If r ∗ (cid:54) = Q , i.e., Q / ∈ R ∗ . a j,i can only be non-zero when i = 1 , which means almost all ﬁles must be mapped at thesuper node. Since the equality for (26) must hold in order for a computing scheme to arbitrarily achieve the lower bound of T sequential , the peak computation load must be no larger than r ∗ Q . Consequently, the minimum number of helpers must be atleast (cid:100) p (cid:101) = (cid:100) Qr ∗ (cid:101) in order for them to map all the ﬁles.Hence, we have K ≥ Q + (cid:100) Qr ∗ (cid:101) = K ∗ sequential . (32)Note that if r ∗ = 0 , the minimum execution time can not be achieved using ﬁnite number of servers.2. If r ∗ = Q , the required number of servers to achieve T ∗ sequential is simply bounded by Q , because Q Reduce functions hasto be assigned to distinct servers. Hence K ≥ Q = K ∗ sequential .Hence, the second statement is proved for all possible values of r ∗ . C. Converse Bounds for Parallel Implementation

Now we use Lemma 1 to prove a matching for Theorem 2, which is equivalent to prove the following two statements: If K = Q , we simply let the super node not being assigned any tasks.

1) The execution time of any coded computing scheme for a distributed computing task with N ﬁles and Q Reduce functionswith parallel implementation is at least T ∗ parallel .2) Any computing scheme that arbitrarily closely achieve a execution time of T ∗ parallel uses at least K ∗ parallel servers.Similar to the sequential case, we can easily show that any computing scheme that assigns more than Reduce function toany single server can not achieve the minimum possible execution time. So to prove a matching converse, it is sufﬁcient tofocus on computing schemes where each server reduces at most one function.We consider an arbitrary a computing scheme that maps N ﬁles, uses K servers and reduces Q functions. Without loss ofgenerality, we assume servers in { , ..., Q } are assigned Reduce tasks. Following the same arguments and the same notationused for the sequential case, the following bounds for the communication load and the computation load also hold for sequentialimplementation: L ≥ QN Q (cid:88) j =0 ( Q − j ) a j, j + ( Q − j ) a j, j + 1 , (33) p ≥ QN Q (cid:88) j =0 j ( a j, + a j, ) . (34)Let Conv ( f ( · )) denotes the lower convex envelop of points { ( r, f ( r )) | r ∈ { , , ..., Q }} , we have L ≥ N Q (cid:88) j =0 ( a j, + a j, ) Q − jQ ( j + 1) (35) = 1 N Q (cid:88) j =0 ( a j, + a j, ) Conv (cid:18) Q − jQ ( j + 1) (cid:19) . (36)Note that N = Q (cid:88) j =0 ( a j, + a j, ) , (37)and Q − jQ ( j +1) is a decreasing sequence, using Jensen’s inequality, we have L ≥ Conv (cid:18) Q − rQ ( r + 1) (cid:19) , (38)where r = Qp .Consequently, T parallel ≥ min r ∈ [0 ,Q ] max { c m rQ , c s Conv (cid:18) Q − rQ ( r + 1) (cid:19) } + c r (39) = T ∗ parallel , (40)which proves the ﬁrst statement.It is easy to show that the above bound is minimized by a unique value r ∗ ∈ (0 , Q ) . If T ∗ parallel is arbitrarily closely achieved,the equality of the Jensen’s inequality used in (38) must hold. Consequently, the Map task assignment of the computationscheme must satisfy that a j,i ≈ except for j = (cid:98) r ∗ (cid:99) or (cid:100) r ∗ (cid:101) , and i = 1 if j (cid:54) = Q .We consider the following two possible cases, distinguished by the value of r ∗ :1. If r ∗ ≤ Q − , a j,i can only be non-zero when i = 1 , which means almost all ﬁles must be mapped at the super node.Similar to the sequential case, the minimum number of helpers must be at least (cid:100) p (cid:101) = (cid:100) Qr ∗ (cid:101) in order for them to map all theﬁles. Hence, we have K ≥ Q + (cid:100) Qr ∗ (cid:101) = K ∗ parallel . (41)2. If r ∗ > Q − , only a Q − , , a Q, and a Q, can be non-zero. Hence we have a Q − , + a Q, + a Q, = N (42) ( Q − a Q − , + Q a Q, + Q a Q, = r ∗ N (43)Note that a Q − , + a Q, ﬁles are mapped at the super node, the required number of servers to achieve T ∗ sequential can bebounded as follows: K ≥ Q + (cid:100) a Q − , + a Q, r ∗ N/Q (cid:101) (44) ≥ Q + (cid:100) a Q − , r ∗ N/Q (cid:101) (45) = Q + (cid:100) QN − r ∗ Nr ∗ N/Q (cid:101) (46) = Q + (cid:100) Q ( Q − r ∗ ) r ∗ (cid:101) (47) = K ∗ parallel . (48)Hence, the second statement is proved for all possible values of r ∗ .VI. C ONCLUSION AND F UTURE D IRECTIONS

In this paper, we considered the problem of optimally allocating computing resources for distributed computation tasks.We proposed the optimal resource allocation scheme that minimizes the total execution time of the computation tasks, andproved its optimality through information-theoretic converses. Similarly, we proved that our proposed design uses the minimumpossible number of servers among all possible computation schemes that achieves the minimum execution time.This work leads to several interesting future directions. From a practical perspective, we can apply and implement ourproposed scheme to many distributed computing algorithms to improve their performances. One example being the TeraSortalgorithm, of which the coded version has been successfully implemented [20], [21]. On the other hand, we can extend thisproblem to a heterogeneous setting, where the processing speeds of the computing nodes varies signiﬁcantly. For example, aninteresting problem could be how to optimally allocate the computing resources for a cluster with a few “super computers”,and abundant number of “slower processors”. Prior to this work, [22] considered a distributed matrix multiplication problem,and shown that designing a computing scheme without fully exploiting the heterogeneity could signiﬁcantly increase thecomputation latency. VII.

ACKNOWLEDGEMENT

This work is in part supported by NSF grants CAREER 1408639 and NETS-1419632, ONR award N000141612189, NSAaward, and funds from Intel. A

PPENDIX AA CHIEVABILITY SCHEMES FOR THE PARALLEL IMPLEMENTATION

In this appendix, we provide achievability schemes that achieves the minimum execution time T ∗ parallel for parallelimplementation using K ∗ parallel servers. We consider a general computing task with Q Reduce functions, parameters c m , c s , c r , and sufﬁciently large N . We compute the design parameters r ∗ and K ∗ parallel speciﬁed in Theorem 2. It is easy to show that r ∗ > from (18) given that c s > , hence K = K ∗ parallel is always well deﬁned.We use K = K ∗ parallel servers, as suggested in Theorem 2. Note that K ∗ parallel ≥ Q always holds, we let nodes , , ..., Q reduce functions , , ..., Q respectively. Depending on the value of r ∗ , we design the map phase and reduce phase as follows. r ∗ ∈ (0 , Q − : For a given parameter r ∗ , we let r + (cid:44) (cid:100) r ∗ (cid:101) , r − = r + − and α = r − r − . It is to verify that r + , r − ∈ { , , ..., Q − } and α ∈ [0 , . Assuming N is large, we break the dataset into two subsets, one with αN ﬁles, theother with (1 − α ) N ﬁles. We construct the map and shufﬂe phase as follows: Map Phase Design.

We ﬁrst consider the map task assignment for the subset of αN ﬁles: We evenly partition the set of αN ﬁles into ( K − Q ) (cid:0) Qr + (cid:1) disjoint subsets. We bijectively map these subsets, to tuples of a subset of r + solvers and a helper.Rigorously, we map the subset of ﬁles to the following set: { ( i, A ) | i ∈ { Q + 1 , ..., K } , A ⊆ { , ..., Q } , |A| = r + } . We denotethe subset of ﬁles that is mapped to ( i, A ) by B i, A .We let each solver k ∈ { , ..., Q } map all subsets of ﬁles B i, A satisfying k ∈ A , and we let each helper k ∈ { Q +1 , ..., K } mapall subsets B k, A . Each solver maps α ( Q − r + − ) ( K − Q ) ( Qr + ) ( K − Q ) = α r + Q fraction of the ﬁles, and each helper maps α ( Qr + )( Qr + ) ( K − Q ) = α K − Q fraction of the ﬁles.We map the rest of the (1 − α ) N ﬁles in a similar way, except we let each ﬁle be repetitively mapped by r − solvers. Thisrequires extra computation loads of (1 − α ) r − Q on each solver and (1 − α ) K − Q on each helper. Hence, the each solver maps α r + Q + (1 − α ) r − Q = r ∗ Q fraction of the ﬁles, and each helper maps α K − Q + (1 − α ) K − Q = K − Q ≤ r ∗ Q fraction of the ﬁles.The peak computation load thus equals r ∗ Q and the computation time equals c m r ∗ Q . Shufﬂe Phase Design.

We ﬁrst consider a shufﬂing scheme that delivers all intermediate values computed from the subsetof αN ﬁles: We group all the intermediate values for a Reduce function q from all ﬁles in B i, A into a single variable, anddenote it by V i, A ,q . At the shufﬂing phase, each helper from i ∈ { Q + 1 , ..., K } will multicast the following messages: Foreach subset of r + + 1 solvers, denoted by S , helper i multicasts Y i, S (cid:44) ⊕ k ∈S V i, S\{ k } ,k to all the solvers in S . The normalizedcommunication load equals α ( Qr ++1 ) ( K − Q ) ( Qr + ) ( K − Q ) Q = α Q − r + Q ( r + +1) .The validity of the above scheme is proved as follows: For each subset A ⊆ { , ..., Q } of size r + and for each i ∈{ Q + 1 , ..., K } , server can decode V i, A , from Y i, S∪{ } . Combining with the intermediate values that are computed locally,server obtained all intermediate values mapped from the ﬁles in the subset of size αN for reduce function . Same argumentholds for the other Q − Reduce functions, hence the proposed shufﬂing scheme is valid for delivering the intermediate valuesthat are mapped from the subset of αN ﬁles.Similarly, we can deliver the rest of the (1 − α ) N ﬁles using a communication load of (1 − α ) Q − r − Q ( r − +1) . Hence the totalcommunication time of the proposed scheme equals c s ( α Q − r + Q ( r + +1) + (1 − α ) Q − r − Q ( r − +1) ) = c s · Conv ( Q − r ∗ Q ( r ∗ +1) ) . r ∗ ∈ ( Q − , Q ] : Similar to the other case, we deﬁne parameters r + = Q , r − = Q − and α = r − r − , and we break thedataset into two subsets and handle the map and reduce tasks for these two subsets separately. For the subset of size (1 − α ) N ,we use exactly the same Map and Shufﬂe phase design as discussed above, which requires computation loads of (1 − α ) r − Q on each solver, (1 − α ) K − Q on each helper, and a communication load of (1 − α ) Q − r − Q ( r − +1) . However for the rest of the ﬁles,we simply let all of them to be mapped on all the solvers, which requires no extra computation on the helpers and no extracommunication.The computation load on each solver thus equals (1 − α ) r − Q + α = r ∗ Q , and the computation load on each helper equals (1 − α ) K − Q ≤ r ∗ Q . Consequently, the computation time equals c m r ∗ Q . On the other hand, the communication load equals, (1 − α ) Q − r − Q ( r − +1) = Conv ( Q − r ∗ Q ( r ∗ +1) ) , hence the communication time equals c s · Conv ( Q − r ∗ Q ( r ∗ +1) ) .In all the above cases, each server reduces at most one function. Hence our proposed achievability scheme always achievea reduce time of c r . Besides, in all the cases, the achievability scheme uses K ∗ sequential servers, achieves a computation time of c m · r ∗ Q and a communication time of c s · Conv ( Q − r ∗ Q ( r ∗ +1) ) . The total execution time always equals T ∗ sequential = c m · r ∗ Q + c s · Conv ( Q − r ∗ Q ( r ∗ +1) ) + c r . Hence, our proposed scheme always achieves the T ∗ sequential and K ∗ sequential stated in Theorem 2.A PPENDIX BP ROOF OF L EMMA Proof.

For q ∈ { , ..., Q } , n ∈ { , ..., N } , we let V q,n be i.i.d. random variables uniformly distributed on F T . We let theintermediate values v q,n be the realizations of V q,n . For any Q ⊆ { , ..., Q } , and N ⊆ { , ..., N } , we deﬁne V Q , N (cid:44) { V q,n : q ∈ Q , n ∈ N } . (49)Since each message X k is generated as a function of the intermediate values that are computed at node k , the followingequation holds for all k ∈ { , ..., K } : H ( X k | V [ Q ] , M k ) = 0 . (50)The validity of the shufﬂing scheme requires that for all k ∈ { , ..., K } , the following equation holds : H ( V W k , [ N ] | X [ K ] , V [ Q ] , M k ) = 0 . (51)Given M and W , for any disjoint subsets of users S and D , we denote the number of intermediate values that are exclusivelyavailable at servers in S , and exclusively needed by (but not available at) servers in D , by a S , D , i.e.: a S , D = | (( ∩ k ∈S M k ) \ ( ∪ i/ ∈S M i )) ∩ (( ∩ k ∈D W k ) \ ( ∪ i/ ∈D∪S W i )) | . (52)For any subset C ⊆ { , ..., K } , let C (cid:123) = { , ..., K }\C . We deﬁne Y C (cid:123) (cid:44) ( V W C (cid:123) , [ N ] , V [ Q ] , M C (cid:123) ) . (53)We denote the number of intermediate values that are exclusively available at s servers in C , and exclusively needed by (butnot available at) d users in C , by a s,d, C , i.e.: a s,d, C = (cid:88) S⊆C|S| = s (cid:88) D⊆C\S|D| = d a S , D . (54)Then we prove the following statement by induction: Claim . For any subset

C ⊆ { , ..., K } , we have H ( X C | Y C (cid:123) ) ≥ T |C| (cid:80) s =1 |C|− s (cid:80) d =1 a s,d, C · ds + d − .a. If C = ∅ , obviously H ( X ∅ | Y ∅ c ) ≥ T (cid:88) s =1 0 − s (cid:88) d =1 a s,d, ∅ · ds + d − . (55)b. Suppose the statement is true for all subsets of size C .For any C ⊆ { , ..., K } of size |C| = C + 1 , and all k ∈ C , the subset version of (50) and (51) can be derived: H ( X k | V [ Q ] , M k , Y C (cid:123) ) = 0 , (56) H ( V W k , [ N ] | X C , V [ Q ] , M k , Y C (cid:123) ) = 0 . (57)Consequently, the following equation holds: H ( X C | V [ Q ] , M k , Y C (cid:123) ) = H ( X C | V W k , [ N ] , V [ Q ] , M k , Y C (cid:123) ) + H ( V W k , [ N ] | V [ Q ] , M k , Y C (cid:123) ) . (58) [ Q ] (cid:44) { , ..., Q } . Next we lower bound H ( X C | Y C (cid:123) ) as follows: H ( X C | Y C (cid:123) ) = 1 |C| (cid:88) k ∈C H ( X C , X k | Y C (cid:123) ) (59) = 1 |C| (cid:88) k ∈C ( H ( X C | X k , Y C (cid:123) ) + H ( X k | Y C (cid:123) )) (60) ≥ |C| (cid:88) k ∈C H ( X C | X k , Y C (cid:123) ) + 1 |C| H ( W C | Y C (cid:123) ) . (61)From (61), we can derive a lower bound on H ( W C | Y C (cid:123) ) that equals the LHS of (58) scaled by C : H ( X C | Y C (cid:123) ) ≥ |C| − (cid:88) k ∈C H ( X C | X k , Y C (cid:123) ) (62) ≥ C (cid:88) k ∈C H ( X C | X k , V [ Q ] , M k , Y C (cid:123) ) (63) = 1 C (cid:88) k ∈C H ( X C | V [ Q ] , M k , Y C (cid:123) ) . (64)The ﬁrst term on the RHS of (58) is lower bounded by the induction assumption: H ( X C | V W k , [ N ] , V [ Q ] , M k , Y S c ) = H ( X C\{ k } | Y ( C\{ k } ) (cid:123) ) (65) ≥ T C (cid:88) s =1 C − s (cid:88) d =1 a s,d, C\{ k } · ds + d − (66) = T (cid:88) S⊆C\{ k }|S|≥ (cid:88) D⊆C\{ k }\S|D|≥ a S , D · |D||S| + |D| − (67) = T (cid:88) S⊆C|S|≥ (cid:88) D⊆C\S|D|≥ a S , D · |D| · ( k / ∈ S ∪ D ) |S| + |D| − . (68)The second term on the RHS of (58) can be calculated based on the independence of intermediate values: H ( V W k , [ N ] | V [ Q ] , M k , Y C (cid:123) ) (69) = H ( V W k , [ N ] | V [ Q ] , M k , V W C (cid:123) , [ N ] , V [ Q ] , M C (cid:123) ) (70) = T (cid:88) S⊆C\{ k } (cid:88) D⊆C\S k ∈D a S , D (71) ≥ T (cid:88) S⊆C\{ k }|S|≥ (cid:88) D⊆C\S k ∈D a S , D (72) = T (cid:88) S⊆C\{ k }|S|≥ (cid:88) D⊆C\S|D|≥ a S , D · ( k ∈ D ) . (73)Thus by (58), (64), (68) and (73), we have H ( W C | Y C (cid:123) ) ≥ C (cid:88) k ∈C H ( X C | V [ Q ] , M k , Y C (cid:123) ) (74) = 1 C (cid:88) k ∈C ( H ( X C | V W k , [ N ] , V [ Q ] , M k , Y C (cid:123) ) + H ( V W k , [ N ] | V [ Q ] , M k , Y C (cid:123) )) (75) ≥ TC (cid:88) k ∈C (cid:88) S⊆C|S|≥ (cid:88) D⊆C\S|D|≥ a S , D ( |D| · ( k / ∈ S ∪ D ) |S| + |D| − ( k ∈ D )) (76) = TC (cid:88) S⊆C|S|≥ (cid:88) D⊆C\S|D|≥ a S , D (cid:88) k ∈C ( |D| · ( k / ∈ S ∪ D ) |S| + |D| − ( k ∈ D )) (77) = TC (cid:88) S⊆C|S|≥ (cid:88) D⊆C\S|D|≥ a S , D ( |D| · ( |C| − |S| − |D| ) |S| + |D| − |D| ) (78) = TC (cid:88) S⊆C|S|≥ (cid:88) D⊆C\S|D|≥ a S , D |D| · ( |C| − |S| + |D| − (79) = T (cid:88) S⊆C|S|≥ (cid:88) D⊆C\S|D|≥ a S , D |D||S| + |D| − . (80)From the deﬁnition of a s,d, C and (80) , we have: H ( W C | Y C (cid:123) ) ≥ T |C| (cid:88) s =1 |C|− s (cid:88) d =1 a s,d, C ds + d − . (81)c. Thus for all subsets C ⊆ { , ..., K } , the following equation holds: H ( X C | Y C (cid:123) ) ≥ T |C| (cid:88) s =1 |C|− s (cid:88) d =1 a s,d, C ds + d − , (82)which proves Claim 1.Then by Claim 1, let C = { , ..., K } be the set of all K users, L ≥ H ( X C | Y C (cid:123) ) QN T ≥ QN K (cid:88) s =1 K − s (cid:88) d =1 a s,d ds + d − . (83)This completes the proof of Lemma 1. R EFERENCES[1] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” , vol. 10,p. 10, June 2010.[2] J. Dean and S. Ghemawat, “MapReduce: Simpliﬁed data processing on large clusters,”

Sixth USENIX OSDI , Dec. 2004.[3] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: Distributed data-parallel programs from sequential building blocks,” in

Proceedings ofthe 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 , ser. EuroSys ’07. New York, NY, USA: ACM, 2007, pp. 59–72.[4] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand, “CIEL: a universal execution engine for distributed data-ﬂowcomputing,” in

Proc. 8th ACM/USENIX Symposium on Networked Systems Design and Implementation , 2011, pp. 113–126.[5] S. Corsava and V. Getov, “Intelligent architecture for automatic resource allocation in computer clusters,” in

International Parallel and DistributedProcessing Symposium . IEEE, 2003.[6] D. P. Pazel, T. Eilam, L. L. Fong, M. Kalantar, K. Appleby, and G. Goldszmidt, “Neptune: A dynamic resource allocation and planning system for acluster computing utility,” in

Cluster Computing and the Grid, 2002. 2nd IEEE/ACM International Symposium on , May 2002, pp. 57–57.[7] A. Verma, L. Cherkasova, and R. H. Campbell, “Aria: automatic resource inference and allocation for mapreduce environments,” in

Proceedings of the8th ACM international conference on Autonomic computing , 2011, pp. 235–244.[8] G. Lee and R. H. Katz, “Heterogeneity-aware resource allocation and scheduling in the cloud.” in

HotCloud , 2011.[9] Y.-H. Kao, B. Krishnamachari, M.-R. Ra, and F. Bai, “Hermes: Latency optimal task assignment for resource-constrained mobile computing,” in

IEEEINFOCOM , 2015, pp. 1894–1902.[10] S. Li, M. A. Maddah-Ali, Q. Yu, and A. Salman Avestimehr, “A Fundamental Tradeoff between Computation and Communication in DistributedComputing,”

ArXiv e-prints , Apr. 2016, submitted to IEEE Trans. Inf. Theory.[11] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,” , Sept. 2015.[12] S. Li, Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Edge-facilitated wireless distributed computing,” in

Proc. IEEE GLOBECOM , Dec. 2016.[13] ——, “A scalable framework for wireless distributed computing,” arXiv preprint arXiv:1608.05743 , 2016.[14] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “A uniﬁed coding framework for distributed computing with straggling servers,” arXiv preprintarXiv:1609.01690 , 2016.[15] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” in

Information Theory(ISIT), 2016 IEEE International Symposium on . IEEE, 2016, pp. 1143–1147.[16] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of caching,”

IEEE Trans. Inf. Theory , vol. 60, no. 5, pp. 2856–2867, May 2014.[17] K. Wan, D. Tuninetti, and P. Piantanida, “On the optimality of uncoded cache placement,” arXiv preprint arXiv:1511.02256 , 2015.[18] ——, “On caching with more users than ﬁles,” arXiv preprint arXiv:1601.06383 , 2016.[19] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “The exact rate-memory tradeoff for caching with uncoded prefetching,” arXiv preprintarXiv:1609.07817 , 2016, submitted to IEEE Trans. Inf. Theory.[20] “Hadoop terasort,” https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html.[21] S. Li, S. Supittayapornpong, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded terasort,” arXiv preprint arXiv:1702.04850 , 2017.[22] A. Reisizadehmobarakeh, S. Prakash, R. Pedarsani, and S. Avestimehr, “Coded computation over heterogeneous clusters,” arXiv preprintarXiv:1701.05973arXiv preprintarXiv:1701.05973