[PDF] Optimizing the Transition Waste in Coded Elastic Computing

Abstract

Distributed computing, in which a resource-intensive task is divided into subtasks and distributed among different machines, plays a key role in solving large-scale problems, e.g., machine learning for large datasets or massive computational problems arising in genomic research. Coded computing is a recently emerging paradigm where redundancy for distributed computing is introduced to alleviate the impact of slow machines, or stragglers, on the completion time. Motivated by recently available services in the cloud computing industry, e.g., EC2 Spot or Azure Batch, where spare/low-priority virtual machines are offered at a fraction of the price of the on-demand instances but can be preempted in a short notice, we investigate coded computing solutions over elastic resources, where the set of available machines may change in the middle of the computation. Our contributions are two-fold: We first propose an efficient method to minimize the transition waste, a newly introduced concept quantifying the total number of tasks that existing machines have to abandon or take on anew when a machine joins or leaves, for the cyclic elastic task allocation scheme recently proposed in the literature (Yang et al. ISIT'19). We then proceed to generalize such a scheme and introduce new task allocation schemes based on finite geometry that achieve zero transition wastes as long as the number of active machines varies within a fixed range. The proposed solutions can be applied on top of every existing coded computing scheme tolerating stragglers.

Full PDF

11 Optimizing the Transition Wastein Coded Elastic Computing

Hoang Dau, Ryan Gabrys, Yu-Chih Huang, Chen Feng, Quang-Hung Luu, Eidah Alzahrani, Zahir Tari

Abstract —Distributed computing, in which a resource-intensivetask is divided into subtasks and distributed among differentmachines, plays a key role in solving large-scale problems, e.g.machine learning for large datasets or massive computationalproblems arising in genomic research.

Coded computing is arecently emerging paradigm where redundancy for distributedcomputing is introduced to alleviate the impact of slow machines,or stragglers, on the completion time. Motivated by recentlyavailable services in the cloud computing industry, e.g. EC2 Spotor Azure Batch, where spare/low-priority virtual machines areoffered at a fraction of the price of the on-demand instancesbut can be preempted in a short notice, we investigate codedcomputing solutions over elastic resources, where the set ofavailable machines may change in the middle of the computation.Our contributions are two-fold: We ﬁrst propose an efﬁcientmethod to minimize the transition waste , a newly introducedconcept quantifying the total number of tasks that existingmachines have to abandon or take on anew when a machine joinsor leaves, for the cyclic elastic task allocation scheme recentlyproposed in the literature (Yang et al.

ISIT’19). We then proceedto generalize such a scheme and introduce new task allocationschemes based on ﬁnite geometry that achieve zero transitionwastes as long as the number of active machines varies withina ﬁxed range. The proposed solutions can be applied on top ofevery existing coded computing scheme tolerating stragglers.

I. I

NTRODUCTION

In the era of Big Data, massive computational tasks, e.g.in large-scale machine learning and data analytics, are oftencarried out in distributed systems like Apache Spark [1] andHadoop [2], which can efﬁciently process terabytes or evenpetabytes of data. However, it has been observed in suchsystems that slow machines, or stragglers , which may run6x-8x slower than a median one, may signiﬁcantly affect theperformance of the whole distributed system [3], [4], [5].

Coded distributed computing [6], [7], [8], built upon algo-rithmic fault tolerance [9], is a recently emerging paradigmwhere computation redundancy is employed to tackle thestraggler effect. As a toy example [6], to perform a matrix-vector multiplication Ax , a master machine ﬁrst partitions thematrix A into two equal-size submatrices A and A and thendistributes A , A , and A + A to three worker machines,respectively. These machines also receive the vector x andperform three multiplications A x , A x , and ( A + A ) x in Hoang Dau, Eidah Alzahrani, and Zahir Tari are with the Schoolof Science, RMIT University. Emails: { sonhoang.dau, eidah.alzahrani, za-hir.tari } @rmit.edu.au. Ryan Gabrys is with SPAWAR Systems Center, SanDiego. Email: [email protected]. Yu-Chih Huang is with the Depart-ment of Communication Engineering, National Taipei University. Email:[email protected]. Chen Feng is with the School of Engineering,British Columbia University (Okanagan Campus). Email: [email protected] Luu is with the Department of Computer Science and SoftwareEngineering, Swinburne University of Technology. Email: [email protected]. parallel. Clearly, Ax can be recovered by the master from theoutcomes of any two workers. Thus, this coded scheme cantolerate one straggler. The potential of coded distributed com-puting has been extensively investigated through a substantialbody of work in the literature, e.g., [10], [11], [12], [13],[14]. Recent breakthroughs have shown that this paradigmnot only applies to linear or bilinear operations but alsoworks for general nonlinear operations such as polynomialevaluation [15] or even for any function that can be representedby a deep network [16].Most of the research in the literature of coded distributedcomputing, however, assume that the set of available workermachines remains ﬁxed . This critical limitation renders currentcoded computing schemes inapplicable in an environmentwhere low-cost elastic resources are readily available. In fact,major cloud computing providers, very recently, started offer-ing spare virtual machines at a price up to 90% cheaper thanthat of the on-demand machines, e.g. Amazon EC2 Spot [17]and

Microsoft Azure Batch [18], albeit at the cost of lowpriority in the sense that these machines can be preempted(removed) for a higher-priority customer under a short notice(e.g., two minutes in the case of Amazon Spot). This new de-velopment in the cloud computing industry provides customerswith an opportunity to have large computing resources at afraction of the cost of the normal on-demand service. Realizingthis opportunity, however, requires the user to develop muchmore ﬂexible distributed computing paradigms in order toefﬁciently exploit elastic resources where low-cost machinescan leave and join at any time during the computation cycle.Recently, Yang et al. [19] proposed an elegant technique ex-tending coded computing to deal with elastic resources. Theirkey idea is to couple a cyclic task allocation scheme, whichworks for any number of machines, with a coded computingscheme to guarantee that a) as long as there are a sufﬁcientnumber of machines working, the original computation can berecovered, and b) the workload at each machine is inverselyproportional to the number of available machines. In otherwords, their solution allows an elastic task allocation: whena new machine joins, existing machines can share some oftheir workload with the new comer and hence reduce thenumber of tasks they are currently working on; likewise, whena machine leaves, existing machines must cover extra tasks leftover by that machine. The elastic coded computing schemeproposed in [19] was evaluated in the multi-tenancy cluster atMicrosoft using the Apache REEF Elastic Group Communi-cation framework, and shown to reduce the completion timeof matrix-vector multiplication and linear regression by up to46% compared to ordinary coded computing schemes. a r X i v : . [ c s . I T ] O c t Relaxing the cyclic task allocation proposed in [19], we in-vestigate a more general elastic task allocation problem, whichwe believe may ﬁnd applications not just in coded distributedcomputing but also in a much broader context where a set oftasks is distributed to an elastic set of participants (e.g., virtualmachines), which frequently leave and join. More speciﬁcally,we need to address the following key questions. • Task allocation : given a set of tasks and a set of machines,how to assign tasks to machines so that all machines areassigned an equal number of tasks (workload balance)and every task is covered by the same number of ma-chines? This can be easily solved, e.g., by using the cyclicscheme employed in [19]. • Transition reallocation : when an elastic event occurs(machines leaving/joining), how to reallocate the tasks tothe new set of machines so as to minimize the transitionwaste , i.e., the total number of tasks that existing ma-chines have to abandon or take over when one machinejoins or leaves, less the necessary amount? This is a muchmore challenging question and is our focus in this work.We illustrate in a toy example (Fig. 1) the concept of tran-sition waste and explain why the cyclic elastic task allocationscheme in [19] is suboptimal with respect to this new metric.We consider the computation of Ax where A consists of 40equal-sized sub-matrices A , . . . , A . We ﬁrst partition thesesub-matrices into 20 groups, e.g., { A , A } , { A , A } , andso forth. Then each group is assigned a task index (or task ,for simplicity) from to . Task 0, for instance, correspondsto the computation of { A x , A x } . Task 0 is encoded intoﬁve subtasks: A x , A x , ( A + A ) x , ( A + 2 A ) x , and ( A + 3 A ) x . A machine taking Task 0 means it computesone of these ﬁve subtasks . Similar to the earlier discussion, any three out of ﬁve subtasks/machines form a coded computinggroup that can recover Task 0 given one straggler.Hence, abstracting away the underlying coded computingscheme, which can be designed independently of the taskallocation scheme in consideration, given F = 20 tasks, werequire that each task must be covered by precisely L = 3 machines . Furthermore, this requirement can be easily met byusing the cyclic scheme in [19]: each of the N machines ispreloaded with a set of F tasks, which is then divided into N equal consecutive subsets of size F/N each and workson tasks in the union of L consecutive such subsets. Forinstance, when N = 5 , Machine 1 works on the set of tasks S = { , , . . . , } = { , . . . , } ∪ { , . . . , } ∪ { , . . . , } ,Machine 2 works on S = { , , . . . , } , and so forth (seeFig. 1 (a)). Note that each machine takes 12 tasks and dueto the cyclic task allocation scheme, each task is covered bythree machines.In Fig. 1 (b), only four machines are available, each ofwhich takes 15 tasks. As the ﬁfth machine has left, it isnecessary now that each of the four available machines musttake − more tasks. Ideally, when the transitionfrom ﬁve machines to four machines occurs, each machinecontinues their existing tasks and works on three new tasks.This is true for Machine 1 because S ⊂ S . However, it isnot the case for other machines. For instance, Machine 3 has S S S S S − − − − − − −

11 8 −

11 8 − −

15 12 −

15 12 − −

19 16 −

19 16 − (a) Cyclic task allocation for ﬁve machines [19]. S S S S − − − − − − −

14 10 −

14 10 − −

19 15 −

19 15 − (b) Cyclic task allocation for four machines [19]. The transition waste fromﬁve to four machines is

12 tasks . S S S S − − − − − − −

11 7 −

11 7 − −

16 12 −

16 12 − (c) Our proposed shifted cyclic task allocation for four machines that resultsin an optimal transition waste among all cyclic schemes ( zero in this case). Fig. 1: Illustration of the sub-optimality of the cyclic task allo-cation scheme proposed in [19] with respect to the transitionwaste when one machine leaves. Here, we use a − b to denotethe set { a, a + 1 , . . . , b } (mod F ) .to abandon two tasks (8 and 9) and takes over ﬁve new tasks(0-4). The transition waste at Machine 3 is (2 + 5) − tasks. Note that three is the necessary increase in the numberof tasks each machine must take and so we less that amount.The transition wastes at other machines can be computed in asimilar manner. The total transition waste is (0+3 − − − −

3) = 12 (tasks) , Therefore, sticking to the cyclic allocation scheme of [19],we waste 12 tasks. However, it turns out that we can reducethe transition waste to zero if we use the allocation schemein Fig. 1 (c) instead. In this case, as S n ⊂ S n , the transitionwastes at all four machines are zero. The trick is to shift thecyclic task allocation by a right amount ( − in this case) tomaximize the overlaps between S n and S n , n = 1 , . . . , .Our main contributions are summarized below. • We ﬁrst introduce a new concept of transition waste of anelastic task allocation scheme, which quantiﬁes the totalnumber of tasks that existing machines have to abandonor take over when one machine joins or leaves, less thenecessary amount.

A reduction in transition waste implieslower computation and communication costs (Remark 1). • We then compute explicitly the transition waste incurredin the cyclic elastic task allocation scheme introducedby Yang et al. [19] when machines leave and join(Theorems 1, 2) and propose a shifted cyclic scheme thatminimizes the transition waste among all cyclic scheme(Theorems 3, 4). The optimal transition waste of a shiftedcyclic scheme is, in general, greater than zero. • Lastly, we show that there exists a zero-waste transitionwhen a machine leaves if and only if there exists a perfectmatching in a certain bipartite graph, using the famousHall’s marriage theorem. Based on this new insight, weconstruct several novel task allocation schemes based on ﬁnite geometry that achieve zero transition wastes whenthe number of active machines varies within a ﬁxed range.While the cyclic schemes are simple to implement and efﬁcientwhen there are many tasks and many machines, the schemeswith zero-waste transitions are more suitable when there area moderate number of machines and tasks but each task isresource-intensive. We will discuss this further in Sections II.We emphasize that our task allocation schemes are designed separately from the underlying coded computing scheme andhence can be applied on top of almost every coded computingscheme. The readers who are familiar with the parity declus-tering technique in redundant disk arrays (RAID) [20], [21],[22] may recognize the analogy between a coded computingscheme and a stripe unit and between a task allocation schemeand a data layout (in the terminology of [21]).The paper is organized as follows. The concepts of elastictask allocation and transition waste are deﬁned and discussedin Section II. Section III is devoted for the cyclic task allo-cation scheme and our proposed shifted version with optimaltransition wastes. We develop elastic task allocation schemesthat admit zero transition wastes in Section IV and concludethe paper in Section V.II. P

RELIMINARIES

In this section we ﬁrst deﬁne the elastic task allocationscheme and the new concept of transition waste . We thenexplain how to couple such a scheme with a coded computingscheme to create a coded elastic computing scheme, whichgeneralizes the cyclic scheme originally proposed by Yang etal. [19].We henceforth use N for the number of available machines, F for the common number of pre-loaded tasks at each ma-chine, and L as minimum number of available machines sothat the scheme still works ( L ≤ N ). Each task is representedby a label from [[ F ]] (cid:52) = { , , . . . , F − } . We assume thatall tasks consume an equal amount of resources (storage,memory, CPU). We use [ F ] to denote the set { , , . . . , F } and [ A, B ] to denote the set { A, A + 1 , . . . , B } . We also use [[ F ]] to denote the power set of the set { , , . . . , F − } and (2 [[ F ]] ) M = 2 [[ F ]] × [[ F ]] × · · · × [[ F ]] to denote the M -aryCartesian power of [[ F ]] . Deﬁnition 1 (Task allocation scheme) . An ordered list of N sets S N = ( S N , . . . , S NN ) ∈ (2 [[ F ]] ) N , where S Nn ⊂ [[ F ]] , n ∈ [ N ] , is referred to as an ( N, L, F ) task allocation scheme ( ( N, L, F ) -TAS) if it satisﬁes the following two properties. • ( L -Redundancy) each element in [[ F ]] is included inprecisely L sets in S N , and • (Load Balancing) | S Nn | = LF/N for all n ∈ [ N ] . Herewe assume that LF/N ∈ Z . Note that we can relax the Load Balancing property andrequire that S Nn ∈ {(cid:98) LF/N (cid:99) , (cid:100) LF/N (cid:101)} and hence can lift the requirement that N divides LF . To simplify the exposition,however, we assume LF/N ∈ Z . In practice, padding ofdummy tasks can be employed to achieve this property.The L -Redundancy property is tied to the underlying codedcomputing scheme (see Appendix VI-A).An ( N, L, F ) -TAS S N = ( S N , . . . , S NN ) can also berepresented by its incidence matrix B = ( b f,n ) F × N , where b f,n = 1 if and only if f ∈ S Nn . The rows and columnsof B represent tasks and machines, respectively. Clearly, B has row weight L and column weight LF/N . In other words,each row of B has precisely L ones while each column hasprecisely LF/N ones. Thus, a TAS simply corresponds to abinary matrix with constant row and column weights.

Example 1.

For N = 3 , L = 2 , F = 6 , the list of sets S = ( { , , , } , { , , , } , { , , , } ) is a (3 , , -TAS as each member set has size × / andeach element f ∈ [[6]] = { , , . . . , } belongs to precisely L = 2 such sets. The incident matrix of S , given by (1), hascolumn weight four and row weight two.Machine 1 Machine 2 Machine 3 S S S   Task 0

Task 1

Task 2

Task 3

Task 4

Task 5 . (1)When a machine leaves or joins, we need to reallocate tasksto a new set of machines. Thus, we must extend the notionof a task allocation scheme (TAS) to that of an elastic taskallocation scheme (ETAS). We explain in Appendix VI-A howto couple an ETAS and a coded computing scheme to achievea coded elastic computing scheme that tolerates stragglers. Deﬁnition 2 (Elastic task allocation) . A pair ( S N , T ) isreferred to as an ( N , L, F ) elastic task allocation scheme( ( N , L, F ) -ETAS) if S N is an ( N , L, F ) -TAS and T is analgorithm that reallocates tasks when machines leave and joinso that the new scheme remains a TAS. More speciﬁcally, T : (2 [[ F ]] ) N × {− , } × [ N ] → (2 [[ F ]] ) N − ∪ (2 [[ F ]] ) N +1 takes as input an ( N, L, F ) -TAS S N , where L ≤ N ≤ LF ,a variable b ∈ {− , } , which represents the elastic event ofone machine leaving ( b = − ) or joining ( b = 1 ), and anindex n ∗ ∈ [ N ] , which indicates the index of the machine thatleaves (effective only when b = − ), and returns an output S N (cid:48) , which is another ( N (cid:48) , L, F ) -TAS, where N (cid:48) = N + b .In other words, moving from a set of N machines to a newset of N (cid:48) = N + b machines, T updates the list of task sets S N to obtain S N (cid:48) , which remains a TAS (Deﬁnition 1). Thestarting TAS is set to be S N . A few remarks are in order.

First , we make a simpliﬁcationassumption in Deﬁnition 2 that each elastic event correspondsto one machine leaving and joining only. In other words, weassume that machines leave and join one after another and not at the same time.

Second , while in general we allow N to takeany value in the range [ L, LF ] , it is more practical to limit N within a ﬁxed range [ L, N max ] . Moreover, we often assumethat F is divisible by any number within this range. Theseassumptions allow us to achieve concrete results and are alsopractically reasonable. For instance, we can use padding, i.e.,adding dummy tasks, to make F satisfy the aforementionedproperty. Third , when Machine n ∗ ∈ [ N ] leaves, we index theremaining machines by the set [ N −

1] = { , . . . , N − } .However, when comparing with the previous TAS, we oftenuse { , . . . , n ∗ − , n ∗ + 1 , . . . , N } , instead of [ N − , so thatthe same machine is given the same index in the previous andin the current task allocation schemes. Cyclic elastic task allocation scheme [19].

A simple wayto construct an ETAS is to let T depend only on the numberof machines and not on the current TAS. More speciﬁcally,whenever there are N machines available as the result of anelastic event, we always use a ﬁxed ( N, L, F ) -TAS S N cyc = ( S N , . . . , S NN ) ,S Nn = (cid:20) ( n − FN , ( n − FN + LFN − (cid:21) (mod F ) , (2)for every n ∈ [ N ] , where [ A, B ] (mod F ) is obtained from [ A, B ] by applying the modulo operation on every element ofthis set. We also assume here that F/N ∈ Z .It is straightforward to verify that each S N cyc satisﬁes theLoad Balancing and the L -Redundancy properties, and there-fore, is indeed an ( N, L, F ) -TAS. The reallocation algorithmis trivial: T ( S N cyc ,

1) = S N +1 cyc and T ( S N cyc , − , n ∗ ) = S N − cyc for every n ∗ ∈ [ N ] . Fig. 1(a) and (b) illustrate the cyclic ETASwhen N = 5 and when N = 4 , L = 3 , and F = 20 . Transition waste.

We now deﬁne the transition waste occurring during an elastic event when one machine leaves orjoins and demonstrate this new concept via a few examples.

Deﬁnition 3 (Necessary load change) . For a transition from an ( N, L, F ) -TAS S N to another ( N (cid:48) , L, F ) -TAS S N (cid:48) , ∆ N,N (cid:48) (cid:52) = | LF/N − LF/N (cid:48) | is referred to as the necessary load change.We assume here that N (cid:48) ∈ { N − , N + 1 } . The necessary load change , ∆ N,N (cid:48) = (cid:12)(cid:12)(cid:12) | S Nn | − |S N (cid:48) n | (cid:12)(cid:12)(cid:12) ,reﬂect the necessary increase or decrease in the number oftasks each machine must take when one machine leaves orjoins, respectively. For instance, when L = 3 , F = 20 , if thereare N = 5 machines, the Load-Balancing property requiresthat each machine runs LF/N = 12 tasks, while if thereare N (cid:48) = 4 machines due to the removal of one, then eachmachine runs LF/N (cid:48) = 15 tasks. Therefore, each of the fourmachines has to take − more tasks to react to thisevent. The necessary load change is three in this case. Deﬁnition 4 (Transition waste for one machine) . The transi-tion waste incurred at Machine n when transitioning from aset of tasks S Nn to another set of tasks S N (cid:48) n is deﬁned as W ( S Nn → S N (cid:48) n ) = | S Nn ∆ S N (cid:48) n | − ∆ N,N (cid:48) , where ∆ N,N (cid:48) is the necessary load change (Deﬁnition 3) and A ∆ B denotes the symmetric difference between A and B . Wealso use W n ∗ ( S Nn → S N (cid:48) n ) for the case Machine n ∗ leaves. Remark 1.

Note that | S Nn ∆ S N (cid:48) n | = | S Nn \ S N (cid:48) n | + | S N (cid:48) n \ S Nn | corresponds to the number of scheduled tasks Machine n has to abandon (tasks that belong to S Nn but not S N (cid:48) n ) andtake anew (tasks that belong to S N (cid:48) n but not S Nn ). Thus, thetransition waste W ( S Nn → S N (cid:48) n ) in Deﬁnition 4 measures themaximum number of tasks wasted at Machine n . As sometasks may have been already completed before the transition,one should abandon as few existing tasks as possible. Atthe same time, taking on fewer new tasks will decrease thedownloading trafﬁc (if the protocol requires new tasks to bedownloaded). In other words, having a low-waste transition will save computation and network resources and reduce thecompletion time of the scheme. Deﬁnition 5 (Transition waste) . When Machine N + 1 joins,the transition waste of the transition from an ( N, L, F ) -TAS S N to an ( N + 1 , L, F ) -TAS S N +1 is deﬁned as W ( S N → S N +1 ) (cid:52) = (cid:88) n ∈ [ N ] W ( S Nn → S N +1 n ) . When Machine n ∗ leaves, the transition waste of the transitionfrom an ( N, L, F ) -TAS S N to an ( N − , L, F ) -TAS S N − isdeﬁned as W n ∗ ( S N → S N − ) (cid:52) = (cid:88) n ∈ [ N ] \{ n ∗ } W n ∗ ( S Nn → S N − n ) . Here, W ( S Nn → S N +1 n ) and W n ∗ ( S Nn → S N − n ) denote thetransition waste incurred at Machine n (Deﬁnition 4). We demonstrate in the Introduction (Fig. 1(a), (b), (c)) twodifferent transitions from a (5 , , -TAS to a (4 , , -TAS,i.e. one machine removed. The ﬁrst transition has a transitionwaste of 12 tasks, while the second one has a zero waste.Another example, built upon Example 1, is given below. Example 2.

Let L = 2 , F = 6 , N = 3 , and N (cid:48) = 4 . It iseasy to verify that S = ( { , , , } , { , , , } , { , , , } ) is a (3 , , -TAS and S = ( { , , } , { , , } , { , , } , { , , } ) is a (4 , , -TAS. The necessary load change when going fromthree to four machines, and vice versa, is ∆ , = | − | = 1 .The waste when transitioning from S to S is computed asfollows. W ( S → S ) = (cid:88) n =1 ( | S n ∆ S n | − ∆ , )= (1 −

1) + (5 −

1) + (3 −

1) = 6 . Storage, communication, and computation overhead ofan ETAS.

As proposed in [19], each machine stores all F tasks but only runs a subset of those tasks based on the speciﬁcallocation. In this way, when switching to a new TAS, eachexisting machine doesn’t have to download new data. Whencoupling with a coded computing scheme (Appendix VI-A),each machine actually stores only a / ( L − E ) -fraction of theinput data, e.g., the matrix A if we are computing Ax , where E < L is the number of stragglers (slow machines) that thescheme can tolerate.

Every machine joining the system has to download itsportion of data once, which constitutes the most costly, butnecessary, communication overhead of the system. The com-munication between a master machine, which coordinates thetask allocation, and the worker machines, is negligible.The master has to run an algorithm to ﬁnd a new TAS when-ever a machine leaves or joins. If a cyclic or a shifted cyclicETAS (see Section III-B) is used, the computation overheadis negligible. If a zero-waste transition (see Section IV) isinsisted, the complexity of the search is polynomial in N , L , and F (basically, it runs a network ﬂow algorithm). Azero-waste transition will be particularly beneﬁcial when thereare a moderate number of tasks while each task is resource-intensive, e.g., when we multiply a fat matrix with a longvector. In that case, the beneﬁt of a zero-waste transition willoffset the time spent for ﬁnding one.III. S HIFTED C YCLIC E LASTIC T ASK A LLOCATION S CHEMES WITH O PTIMAL T RANSITION W ASTES

We ﬁrst compute explicitly the transition waste of the cyclicelastic task allocation scheme introduced by Yang et al. [19]and then propose a shifted cyclic scheme that achieves theoptimal transition waste among all such cyclic schemes. Weassume that the number of machines N lies in a predeterminedinteval [ L, N max ] and N ( N + 1) | F for every L ≤ N < N max . A. Transition Waste of the Cyclic Elastic Task Allocation

The following lemma is useful in determining the symmetricdifference between two sets in [[ F ]] . Lemma 1.

Let S = [ a, b ] (mod F ) and T = [ c, d ] (mod F ) .Assume that ≤ a ≤ c < F , and moreover, < | S | < F and < | T | < F . The following statements hold. (a) If c − a < | S | < ( c − a ) + | T | < F then | S ∆ T | = 2( c − a ) + | T | − | S | . (b) If | S | ≥ ( c − a ) + | T | then T ⊂ S and | S ∆ T | = | S | − | T | . Proof. (a)

Suppose that c − a < | S | < ( c − a ) + | T | < F . Ifwe travel along the circle of integers mod F (see Fig. 2 (a))clockwise from a , we ﬁrst see c , then b (mod F ) (because c − a < | S | ), then d (mod F ) (because | S | < ( c − a ) + | T | ),before we reach a again (because ( c − a )+ | T | < F ). Therefore, | S ∆ T | = | S \ T | + | T \ S | = ( c − a ) + ( | T | − ( | S | − ( c − a )))= 2( c − a ) + | T | − | S | . (b) Suppose that | S | ≥ ( c − a ) + | T | . This clearly impliesthat T ⊂ S and hence | S ∆ T | = | S | − | T | (see Fig. 2 (b)). (cid:4) Lemma 2 is obvious by the deﬁnition of the transition waste.

Lemma 2.

The transition waste incurred at Machine n whentransitioning from a set of tasks S Nn to another set of tasks S N (cid:48) n is zero if and only if S Nn ⊂ S N (cid:48) n or S Nn ⊃ S N (cid:48) n . In the next corollary, we show that when there are N = L + 1 machines and one machine leaves or when there are TSa b (mod F ) c d (mod F ) T S a b (mod F ) c d (mod F ) (b)(a) Fig. 2: Illustrations of the two sets S = [ a, b ] (mod F ) and T = [ c, d ] (mod F ) on the circle of integers mod F . N = L machines and one machine joins, the transition waste istrivially zero, no matter which TASs the system are employing. Corollary 1.

The transition waste when transitioning from an ( L, L, F ) -TAS to an ( L +1 , L, F ) -TAS and vice versa are zero.Proof. Note that for an ( L, L, F ) -TAS S L = ( S L , . . . , S LL ) ,we have S Ln = [[ F ]] for all n ∈ [ L ] . Therefore, S Nn ⊃ S N (cid:48) n .By Lemma 2, the corollary follows. (cid:4) We henceforth assume that

N > L when one machine joinsand

N > L + 1 when one machine leaves. First, we considerthe case of one machine joining.

Theorem 1.

The transition waste when transitioning from acyclic ( N, L, F ) -TAS S N cyc to a cyclic ( N +1 , L, F ) -TAS S N +1 cyc (deﬁned in (2) ) is given below (assuming N > L ). W ( S N cyc → S N +1 cyc ) = N − N + 1 F. Proof.

Suppose Machine N + 1 joins the computation. Ac-cording to (2), we have S N cyc = ( S N , . . . , S NN ) and S N +1 cyc = ( S N +11 , . . . , S N +1 N +1 ) , where for n ∈ [ N ] , S Nn = (cid:20) ( n − FN , ( n − FN + LFN − (cid:21) (mod F ) , and for n ∈ [ N + 1] , S N +1 n = (cid:20) ( n − FN + 1 , ( n − FN + 1 + LFN + 1 − (cid:21) (mod F ) . We now apply Lemma 1 to ﬁnd the symmetric difference of S Nn and S N +1 n for every n ∈ [ n ] . We write S = S N +1 n = [ a, b ] (mod F ) , T = S Nn = [ c, d ] (mod F ) and can verify that all assumptions of Lemma 1 (a) aresatisﬁed. Indeed, since N > L and N ≥ n ≥ , we have ≤ a = ( n − FN + 1 ≤ c = ( n − FN < F, < | S | = LFN + 1 < F, < | T | = LFN < F,c − a = ( n − FN ( N + 1) < LFN + 1 = | S | , | S | = LFN + 1 < ( n − FN ( N + 1) + LFN = ( c − a )+ | T | < F. Therefore, by Lemma 1 (a), | S Nn ∆ S N +1 n | = 2( c − a ) + ( | S Nn | − | S N +1 n | )= 2( n − FN ( N + 1) + LFN ( N + 1)= 2( n − FN ( N + 1) + ∆ N,N +1 . Thus, the transition waste incurred at Machine n is W ( S Nn → S N +1 n ) = | S Nn ∆ S N +1 n | − ∆ N,N +1 = 2( n − FN ( N + 1) . Finally, the transition waste when transitioning from S N cyc to S N +1 cyc is W ( S N cyc → S N +1 cyc ) = (cid:88) n ∈ [ N ] W ( S Nn → S N +1 n )= (cid:88) n ∈ [ N ] n − FN ( N + 1) = N − N + 1 F, as desired. (cid:4) We now turn to the slightly more involved case when onemachine leaves the computation. When Machine n ∗ ∈ [ N ] leaves, for the ease of notation, we assume the system transi-tions to the cyclic TAS S N − cyc = { S N − , . . . , S N − n ∗ − , S N − n ∗ +1 , . . . , S N − N } , where for n < n ∗ , S N − n = (cid:20) ( n − FN − , ( n − FN − LFN − − (cid:21) (mod F ) , and for n > n ∗ , S N − n = (cid:20) ( n − FN − , ( n − FN − LFN − − (cid:21) (mod F ) . Lemma 3.

The proof is the same as that of Theorem 1, wherebyLemma 1 (a) is applied to S = S Nn and T = S N − n . (cid:4) Lemma 4.

Suppose that Machine n ∗ ∈ [ N ] leaves and thesystem transitions from a cyclic ( N, L, F ) -TAS S N cyc to a cyclic ( N − , L, F ) -TAS S N − cyc (deﬁned in (2) ). The transition wasteincurred at Machine n for N ≥ n ≥ n ∗ + 1 is given below(assuming N > L + 1 ).If n ∗ ≥ N − L then W n ∗ ( S Nn → S N − n ) = 0 .If n ∗ < N − L < n then W n ∗ ( S Nn → S N − n ) = 0 .If n ∗ < n ≤ N − L then W n ∗ ( S Nn → S N − n ) = 2( N − L − n + 1) FN ( N − . Proof. As n > n ∗ , we have S N − n = (cid:20) ( n − FN − , ( n − FN − LFN − − (cid:21) (mod F ) . We now apply Lemma 1 to the sets S = S N − n = [ a, b ] (mod F ) , T = S Nn = [ c, d ] (mod F ) . The common assumptions of Lemma 1 are veriﬁed as follows.We have ≤ a = ( n − FN − < c = ( n − FN < F, < | S | = LFN − < F, < | T | = LFN < F.

Case 1.

When n ∗ ≥ N − L or n ∗ < N − L but n > N − L ,we aim to show W n ∗ ( S Nn → S N − n ) = 0 by proving that S Nn ⊂ S N − n (Lemma 2). Note that in this case, we alwayshave n ≥ N − L + 1 . Therefore, LFN ( N − ≥ ( N − n + 1) FN ( N − , (3)which is equivalent to | S N − n | − | S Nn | ≥ ( c − a ) , or | S | ≤ ( c − a ) + T . By Lemma 1 (b), we conclude that T = S Nn ⊂ S = S N − n , as desired. Hence the transition waste incurred atMachine n is zero. Case 2.

Suppose that n ∗ < n ≤ N − L . The inequality (3)is reversed, which gives us | S | < ( c − a ) + | T | . We now verifythat other conditions of Lemma 1 (a) are also satisﬁed. First,it is clear that c − a = ( N − n + 1) FN ( N − < LFN − | S | . Moreover, as

N > L + 1 (our assumption), ( c − a ) + | T | = ( N − n + 1) FN ( N −

1) +

LFN < F.

Therefore, by Lemma 1 (a), we obtain | S Nn ∆ S N − n | = 2( c − a ) + ( | S Nn | − ( | S N − n | )= 2( N − n + 1) FN ( N − − LFN ( N − . Noting that ∆ N,N − = LFN ( N − , we obtain W n ∗ ( S Nn → S N − n ) = | S Nn ∆ S N − n | − ∆ N,N − = 2( N − L − n + 1) FN ( N − . This completes the proof. (cid:4)

Theorem 2.

The transition waste when Machine n ∗ ∈ [ N ] leaves and the system transitions from a cyclic ( N, L, F ) -TAS S N cyc to a cyclic ( N − , L, F ) -TAS S N − cyc (deﬁned in (2) ) is given as follows (assuming N > L + 1 ).If n ∗ < N − L , W n ∗ ( S N cyc → S N − cyc ) is (cid:0) ( n ∗ − n ∗ − N − L − n ∗ )( N − L − n ∗ +1) (cid:1) FN ( N − . If n ∗ ≥ N − L , W n ∗ ( S N cyc → S N − cyc ) is ( n ∗ − n ∗ − FN ( N − . Averaging n ∗ over [ N ] , the averaged transition waste whenone machine leaves in the cyclic ETAS is W avg ( S N cyc → S N − cyc )= (cid:18) N − N + ( N − L − N − L )( N − L + 1)3( N − N (cid:19) F. Proof. If n ∗ < N − L , by Lemma 3 and Lemma 4, we have W n ∗ ( S N cyc → S N − cyc )= n ∗ − (cid:88) n =1 n − FN ( N −

1) + N − L (cid:88) n = n ∗ +1 N − L − n + 1) FN ( N −

1) + N (cid:88) n = N − L +1 (cid:0) ( n ∗ − n ∗ − N − L − n ∗ )( N − L − n ∗ +1) (cid:1) FN ( N − . Similarly, when n ∗ ≥ N − L , we obtain W n ∗ ( S N cyc → S N − cyc ) = n ∗ − (cid:88) n =1 n − FN ( N −

1) + N (cid:88) n = n ∗ +1

0= ( n ∗ − n ∗ − FN ( N − . Averaging W n ∗ ( S N cyc → S N − cyc ) over all n ∗ ∈ [ N ] we obtainthe stated formula for W avg ( S N cyc → S N − cyc ) . (cid:4) B. Shifted Cyclic Scheme Achieving Optimal Transition Waste

From Theorem 1 and Theorem 2, the transition wasteincurred across all existing machines in the cyclic ETASproposed in [19] is N − N +1 F ≈ F or ( N − N + · · · ) F ≈ F tasks when a machine joins or leaves, respectively. In thissection, we show that by applying a calculated shift, we cansigniﬁcantly reduce the transition waste of the cyclic ETAS.As mentioned earlier, the updated TAS used by the cyclicETAS [19] (see Section II) only depends on the number ofmachines available and not on the current TAS, which is onereason that leads to the scheme’s poor transition waste. Wenow generalize the cyclic TAS to shifted cyclic TAS in orderto allow a more adaptive transition that takes into account thecurrent TAS. Deﬁnition 6 (Shifted cyclic task allocation) . For δ ∈ [[ F ]] , a δ -shifted cyclic ( N, L, F ) -TAS is given as follows. S Nδ - cyc = ( S N , . . . , S NN ) , where for n ∈ [ N ] , S Nn = (cid:20) ( n − FN + δ, ( n − FN + LFN − δ (cid:21) (mod F ) . Note that there are F different shifted TASs possible corre-sponding to F different values of δ . When δ = 0 , the shiftedcyclic TAS reduces to an ordinary cyclic TAS (Section II).Given that the system transitions from an δ -shifted cyclic ( N, L, F ) -TAS to a δ (cid:48) -shifted cyclic ( N (cid:48) , L, F ) -TAS, thequestion of interest is to determine δ (cid:48) that leads to a minimumtransition waste. We note here that the master machine canalways exhaustively examine all possible F shifted schemesand ﬁnd the one with the smallest waste. However, this willtake the master roughly LF = F N

LFN operations, whichis time-consuming for large F . Our contribution is to derivethe explicit formula of an optimal shift , which results in the minimum waste among all F shifted schemes. We ﬁrst tacklethe case of one machine joining and then argue that the caseof one machine leaving follows by symmetry. Theorem 3.

The transition waste when transitioning from a δ (cid:48) -shifted cyclic ( N, L, F ) -TAS S Nδ (cid:48) - cyc to a δ -shifted cyclic ( N + 1 , L, F ) -TAS S N +1 δ - cyc with δ = δ (cid:48) + (cid:98) N + L − (cid:99) FN ( N +1) is W ( S Nδ (cid:48) - cyc → S N +1 δ - cyc )= (cid:40) ( N − L − N − L +1) F N ( N +1) , for odd N − L, ( N − L ) F N ( N +1) , for even N − L. Before proving Theorem 3, we observe that the transitionwaste of the proposed shifted cyclic TAS is improved over thatof the ordinary cyclic TAS ([19]) by a considerable factor ofapproximately N ( N − L ) , which is 8X when L ≈ N/ . Theimprovement becomes even more signiﬁcant when L getscloser to N , e.g., in the order of N when N − L is small. Proof of Theorem 3.

Without loss of generality, we can al-ways assume that δ (cid:48) = 0 and δ = (cid:98) N + L − (cid:99) FN ( N +1) . Weprovide a proof when N + L is odd, i.e., δ = ( N + L − F N ( N +1) noting that we assume N ( N + 1) | F (padding with dummytasks if necessary). A proof for the case when N + L is evencan be done similarly.With δ (cid:48) = 0 and δ = ( N + L − F N ( N +1) , we have S Nδ (cid:48) - cyc = ( S N , . . . , S NN ) , S N +1 δ - cyc = ( S N +11 , . . . , S N +1 N +1 ) , where for n ∈ [ N ] , S Nn = (cid:20) ( n − FN , ( n − FN + LFN − (cid:21) (mod F ) .S N +1 n = (cid:20) ( n − FN + 1 + ( N + L − F N ( N + 1) , ( n − FN + 1 + LFN + 1 − N + L − F N ( N + 1) (cid:21) (mod F ) . To compute the transition waste W ( S Nn → S N +1 n ) incurred atMachine n ∈ [ N ] , we consider the following three cases. Case 1. ≤ n < N − L +12 . It can be easily veriﬁed that allconditions of Lemma 1 (a) are satisﬁed for S (cid:52) = S Nn = [ a, b ](mod F ) and T (cid:52) = S N +1 n = [ c, d ] (mod F ) . Therefore, W ( S Nn → S N +1 n ) = 2( c − a ) − N,N +1 = ( N + L + 1 − n ) FN ( N + 1) − LFN ( N + 1)= ( N − L + 1 − n ) FN ( N + 1) . Case 2. N − L +12 ≤ n < N + L +12 . We can verify that allconditions of Lemma 1 (b) are satisﬁed for S (cid:52) = S Nn = [ a, b ](mod F ) and T (cid:52) = S N +1 n = [ c, d ] (mod F ) . Hence, T ⊂ S and W ( S Nn → S N +1 n ) = 0 . Case 3. N + L +12 ≤ n ≤ N . We can verify that all conditionsof Lemma 1 (a) are satisﬁed for S (cid:52) = S N +1 n = [ a, b ] (mod F ) and T (cid:52) = S Nn = [ c, d ] (mod F ) . Therefore, W ( S Nn → S N +1 n ) = 2( c − a ) + ∆ N,N +1 − ∆ N,N +1 = (2 n − ( N + L + 1)) FN ( N + 1) . Thus, the waste when transitioning from S Nδ - cyc to S N +1 δ (cid:48) - cyc is W ( S Nδ - cyc → S N +1 δ (cid:48) - cyc ) = FN ( N + 1) (cid:18) N − L − (cid:88) n =1 ( N − L +1 − n )+ N + L − (cid:88) n = N − L +12 N (cid:88) n = N + L +12 (2 n − ( N + L + 1)) (cid:19) = ( N − L − N − L + 1) F N ( N + 1) . This completes the proof. (cid:4)

Theorem 4.

The transition waste when transitioning from a δ (cid:48) -shifted cyclic ( N, L, F ) -TAS S Nδ (cid:48) - cyc to a δ -shifted cyclic ( N − , L, F ) -TAS S N − δ - cyc with δ = δ (cid:48) + ( N − n ∗ ) −(cid:98) ( N + L − (cid:99) FN ( N − , where Machine n ∗ leaves, is W ( S Nδ (cid:48) - cyc → S N − δ - cyc )= (cid:40) ( N − L − F N ( N − , for odd N − L, ( N − L )( N − L − F N ( N − , for even N − L. Proof.

The proof works by symmetry. By treating Machine n ∗ that leaves as the machine that joins the system in Theorem 3and replacing N by N − , we obtain the claimed formula forthe transition wastes. Note that because the task sets can becyclically shifted along the circle of integers mod F , the indexof the machine that leaves does not matter. This phenomenon,however, does not apply to the ordinary cyclic ETAS. (cid:4) Although we are able to show the optimality of our shiftedcyclic ETASs only within a certain range of δ , we believethe optimality holds for every δ , which was supported by anexhaustive search over small values of L and N . Theorem 5.

The transition wastes stated in Theorem 3 andTheorem 4 are optimal among all choices of δ -shifted cyclicTASs where FN ( N +1) divides δ − δ (cid:48) and FN ( N − divides δ (cid:48) − δ + N − n ∗ , respectively.Proof. By symmetry, we just need to prove this for the caseof machines joining. We ﬁrst derive a formula of the transitionwaste for every δ and then show that it is minimized within thespeciﬁed range of δ . See Appendix VI-B for more details. (cid:4) IV. Z

ERO -W ASTE E LASTIC T ASK A LLOCATION S CHEMES

The shifted cyclic ETAS developed in Section III-B is easyto implement and has a negligible computation overhead atthe master machine: to coordinate a transition, the master justneeds to inform each machine its updated index, the numberof active machines, and the amount of shift required. However,in order to maintain the cyclic structure, the transitions incura nontrivial transition waste, which can be linear in F , themaximum number of tasks each machine can take. Thismay signiﬁcantly increase the computation overhead at eachmachine because a large number of completed tasks canpotentially be abandoned. Moreover, high transition wastesalso mean more new tasks than necessary must be downloadedif each machine does not already store all the tasks from thebeginning, which leads to higher communication overhead. This drawback of the (shifted) cyclic ETAS motivated us toinvestigate elastic task allocation schemes with zero transitionwastes. Our key ﬁndings include a necessary and sufﬁcientcondition for the existence of a zero-waste transition from an ( N, L, F ) -TAS to an ( N (cid:48) , L, F ) -TAS based on the famousHall’s marriage theorem and a construction of zero-wasteETAS based on ﬁnite geometry. A. Zero-Waste Transition When One Machine Joins

By Lemma 2, the transition waste incurred at Machine n when transitioning from the set of tasks S Nn to another set S N (cid:48) n is zero if and only if S Nn ⊂ S N (cid:48) n or vice versa. It turnsout that if the elastic events only consist of machines joiningthan it is easy to achieve zero-waste transitions. Proposition 1.

There always exists a zero-waste transitionfrom an ( N, L, F ) -TAS to an ( N + 1 , L, F ) -TAS.Proof. To achieve a zero-waste transition when Machine N +1 joins, each existing machine (from to N ) can simply choosea subset of LFN ( N +1) tasks to pass to Machine N + 1 , whichwill then have in total N LFN ( N +1) tasks. The requirement isto have these N sets disjoint. We can achieve this by lettingeach machine n from to N choose an arbitrary subset of S Nn of size LFN ( N +1) that does not intersect any sets chosenby previous machines so far. This is always possible becauseMachine n has enough tasks in its set to do the selection: | S Nn | = LFN ≥ ( n − LFN ( N + 1) + LFN ( N + 1) . This complete the proof. (cid:4)

Note that this proposition is a stand-alone result and willnot be used in the rest of the paper.

B. Zero-Waste Transition When One Machine Leaves

The case of one machine leaving, say Machine n ∗ , is moreinteresting. Note that to achieve a zero-waste transition, due toLemma 2, it is necessary and sufﬁcient to let other machineskeep their current sets of tasks while reallocating the tasksfrom the leaving machine to them (so that S Nn ⊂ S N − n ).Reallocating one task from Machine n ∗ to a machine n corresponds to selecting one edge in the transition graph (Deﬁnition 7 below). We will see later that reallocating alltasks turns out to correspond to a “matching” in that graph(Lemma 5). Deﬁnition 7.

Given an ( N, L, F ) -TAS S N = ( S N , . . . , S NN ) ,the transition graph G n ∗ is the bipartite graph with vertex set U n ∗ ∪ V n ∗ , where U n ∗ = [ N ] \ { n ∗ } and V n ∗ = S Nn ∗ andthere is an edge ( u, v ) , u ∈ U n ∗ , v ∈ V n ∗ , if and only if v ∈ S Nu (cid:52) = [[ F ]] \ S Nu . Note that the set V n ∗ of the transition graph represents thetasks from the leaving machine n ∗ that need to be reallocatedto other machines, while an edge ( u, v ) implies that the task v ∈ V n ∗ can be taken over by Machine u , i.e., this machinewas not allocated this task before the transition. An exampleof such a graph is given below. U represents the available machines V represents the tasks of the leaving machinethat need to be reallocated Fig. 3: Illustration of the transition graph G in Example 3.An edge ( u, v ) means the task v from the leaving machinecan be taken over by Machine u because Machine u was notallocated this task before the transition. Example 3.

When N = 5 , L = 3 , and F = 20 , weconsider the ( N, L, F ) -TAS given in Fig. 1 (a), S =( S , . . . , S ) , where S = { , . . . , } , S = { , . . . , } , S = { , . . . , } , S = { , . . . , , , . . . , } , and S = { , . . . , , , . . . , } . Suppose that Machine leaves, i.e., n ∗ = 5 . Then the transition graph G is illustrated in Fig. 3.A subset M of edges of a bipartite graph G with vertexset ( U, V ) is referred to as a perfect ∆ -matching of G if eachvertex in V is incident to precisely one edge in M while eachvertex in U is incident to precisely ∆ edges in M . Lemma 5.

There exists a zero-waste transition from an ( N, L, F ) -TAS S N to an ( N − , L, F ) -TAS S N − whenMachine n ∗ leaves if and only if the transition graph G n ∗ admits a perfect ∆ N,N − -matching.Proof. Recall that due to Lemma 2, the transition has a zero-waste if and only if S N − n ⊂ S Nn for every n ∈ [ N ] \ { n ∗ } .This means that we need to reallocate tasks left over byMachine n ∗ to other N − machines by adding these newtasks to the existing task sets of these machines.It is evident that a way to reallocate LFN tasks from Machine n ∗ to N − other machines corresponds precisely to a perfect ∆ N,N − -matching of the transition graph G n ∗ : each task,which corresponds to a vertex v ∈ V n ∗ , is reallocated toexactly one machine, which corresponds to a vertex u ∈ U n ∗ ;moreover, each machine is allocated precisely ∆ N,N − = LFN ( N − new tasks, which shows that each vertex u is incidentto precisely ∆ N,N − edges while each vertex v is incident toexactly one edge in the matching. (cid:4) For instance, the zero-waste transition presented inFig. 1 (a)(c) corresponds to the following perfect -matchingof G (thicker edges in Fig. 3): M = { (1 , , (1 , , (1 , , (2 , , (2 , , (2 , , (3 , , (3 , , (3 , , (4 , , (4 , , (4 , } . Based on this matching, each machine , , , and is allocated three new tasks from the leaving Ma-chine 5. Furthermore, every task from Machine 5, i.e., { , , , , , , , , , , , } , is reallocated to exactlyone machine.The following lemma is a straightforward corollary of Hall’smarriage theorem. Lemma 6.

A bipartite graph G with the vertex set U ∪ V hasa perfect ∆ N,N − -matching if and only if the inequality | ∪ n ∈ J Γ G ( n ) | ≥ | J | ∆ N,N − , (4) holds for every nonempty set J ⊆ U , where Γ G ( n ) denotesthe set of neighbors of n in G .Proof. The celebrated Hall’s marriage theorem [23] states thata bipartite graph G with the vertex set ( U, V ) has a perfectmatching (or, perfect -matching, in our notation), if and onlyif for every nonempty set J ⊆ U , it holds that |∪ n ∈ J Γ G ( n ) | ≥| J | , where Γ G ( n ) denotes the set of neighbors of n in G . Byduplicating each vertex of U and its incident edges ∆ N,N − times and applying Hall’s theorem to the resulting bipartitegraph, we deduce that G has a perfect ∆ N,N − -matching ifand only if (4) holds for every nonempty set J ⊆ U . (cid:4) As a corollary of Lemma 5 and Lemma 6, we obtain anecessary and sufﬁcient condition for the existence of a zero-waste transition when one particular machine leaves.

Corollary 2.

There exists a zero-waste transition from an ( N, L, F ) -TAS S N to an ( N − , L, F ) -TAS S N − whenMachine n ∗ leaves if and only if the inequality | ( ∪ n ∈ J S Nn ) ∩ S Nn ∗ | ≥ | J | ∆ N,N − (5) holds for every nonempty set J ⊆ [ N ] \ { n ∗ } .Proof. The conclusion is straightforward from Lemma 5 andLemma 6 and the following observation: by the deﬁnition ofthe transition matrix G n ∗ , the set of neighbours of a vertex n ∈ U n ∗ = [ N ] \ { n ∗ } in G n ∗ is Γ G n ∗ ( n ) = S Nn ∩ S Nn ∗ . (cid:4) Theorem 6 provides a necessary and sufﬁcient conditionfor the existence of a zero-waste transition from an ( N, L, F ) -TAS to an ( N − , L, F ) -TAS no matter which machine leaves.Essentially, it states that as long as the sets of tasks of differentmachines do not overlap too much then there exists a zero-waste transition. Recall that ∆ N,N − = LFN ( N − . Theorem 6.

There exists a zero-waste transition from an ( N, L, F ) -TAS S N = ( S N , . . . , S NN ) to an ( N − , L, F ) -TASwhen Machine n ∗ leaves for every n ∗ ∈ [ N ] if and only if | ∩ n ∈ I S Nn | ≤ ( N − | I | )∆ N,N − , (6) for every nonempty set I ⊆ [ N ] . Moreover, such a transitioncan be found in time O (cid:0) ( N − LFN )( N − F (1 − LN ) (cid:1) .Proof. Let G n ∗ be the transition graph of an ( N, L, F ) -TAS S N with the vertex set ( U n ∗ , V n ∗ ) (Deﬁnition 7). ByCorollary 2, it sufﬁces to show that the inequality (5) holdsfor every nonempty set J ⊆ [ N ] \{ n ∗ } and for every n ∗ ∈ [ N ] if and only if (6) holds for every nonempty set I ⊆ [ N ] .Suppose that (5) holds as stated. Note that | ( ∪ n ∈ J S Nn ) ∩ S Nn ∗ | = |∩ n ∈ J S Nn ∩ S Nn ∗ | = | S Nn ∗ \ ∩ n ∈ J S Nn | = | S Nn ∗ | − | ∩ n ∈ J ∪{ n ∗ } S Nn | . Therefore, (5) is equivalent to | ∩ n ∈ J ∪{ n ∗ } S Nn | ≤ | S Nn ∗ | − | J | ∆ N,N − . Setting I = J ∪ { n ∗ } , this is also equivalent to | ∩ n ∈ I S Nn | ≤ | S Nn ∗ | − ( | I | − N,N − = ( N − N,N − − ( | I | − N,N − = ( N − | I | )∆ N,N − . Note that as n ∗ varies over [ N ] and J varies over all nonemptysubsets of [ N ] \ { n ∗ } , I = J ∪ { n ∗ } varies over all subsets of [ N ] of size at least two. Furthermore, (6) holds trivially (withequality) when | I | = 1 . Therefore, (6) holds for all nonemptysets I ⊆ [ N ] . Hence, we settle the only if direction. As allsteps are equivalent transformations, the if direction is alsotrue. The complexity of ﬁnding a zero-waste transition comesfrom that of a network ﬂow algorithm [24] employed to ﬁnda perfect matching for G n ∗ . This completes the proof. (cid:4) Theorem 6 provides us with an important insight: to maketransitions with zero waste possible, we should assign tomachines sets of tasks with small overlaps. This will be crucialin our construction of an ETAS with zero transition waste inthe next section.

C. A Zero-Waste Elastic Task Allocation Scheme

So far we have discussed the case of a single machineleaving or joining. The more challenging question is how toallow a (possibly inﬁnite) chain of such elastic events whileguaranteeing zero-waste transitions. More speciﬁcally, we areinterested in establishing a zero-waste range [ N min , N max ] ⊂ [ L, F ] where the system can start with any number N ofmachines, N ∈ [ N min , N max ] , and then can transition withzero wastes an arbitrary number of times within this range , onemachine leaving or joining at a time. We show the existenceof a handful of such ranges in Theorem 7 and Corollary 3.We ﬁrst need a formal deﬁnition of a zero-waste range. Deﬁnition 8 (Zero-waste range) . Given L and F , a range [ N min , N max ] , where L ≤ N min ≤ N max ≤ F is calledan ( L, F ) -zero-waste range ( ( L, F ) -ZWR) if for every N ∈ [ N min , N max ] there exists an ( N , L, F ) -ETAS ( S N , T ) (seeDeﬁnition 2) where the transition algorithm T incurs a zerowaste whenever the transition is within the range [ N min , N max ] . Note that N min and N max are usually functions of L and F .Also, the transition algorithm T mentioned in Deﬁnitions 2and 8 can be applied repeatedly to enable a chain of transitionswithin N min and N max machines, although by adding orremoving just one machine at a time. It turns out that if wecan construct an ( N , L, F ) -ETAS ( S N , T ) so that T incursa zero transition waste within [ N min , N max ] for some N ∈ [ N min , N max ] then we can also construct an ( N (cid:48) , L, F ) -ETASsatisfying the same property for every N (cid:48) ∈ [ N min , N max ] , i.e., [ N min , N max ] is an ( L, F ) -ZWR. In particular, we show thatthis claim is true when N = N max . Lemma 7.

If there exists an ( N max , L, F ) -ETAS ( S N max , T ) sothat T always incurs a zero transition waste for every possiblechain of N max − N min transitions from N max to N min machines(machines leaving only) then [ N min , N max ] is an ( L, F ) -ZWR. Before proving this lemma, we need the concept of a transition tree , which keeps track of all the possible states the system can be at and the transitions leading to them fromthe original state, where a state consists of the list of machinesavailable and the corresponding TAS. The transition tree is, infact, an explicit way to represent an ETAS. { , , } { , }{ , }{ , }{ } { } { } { } { } { } M a c h i n e s l e a v i n g M a c h i n e s j o i n i n g Fig. 4: Illustration of a transition tree when N min = 1 and N max = 3 . The set of available machines is given at eachnode (we omit the TAS associated with each node). Deﬁnition 9 (Transition tree) . Given an ( N max , L, F ) -ETAS ( S N max , T ) satisfying the assumption of Lemma 7, the corre-sponding transition tree T is a rooted tree created as follows.The root node of the tree consists of the set [ N max ] and thecorresponding ( N max , L, F ) -TAS. Other nodes can be createdin a recursive manner. Suppose that a node u is alreadycreated that consists of a set of indices I and an ( |I| , L, F ) -TAS. If |I| > N min , the |I| child nodes of u can be createdas follows. Each child node v corresponds to the removalof one machine indexed by n ∗ ∈ I and consists of the list J (cid:52) = I \ { n ∗ } and a ( |J | , L, F ) -TAS obtained by applyingthe transition algorithm T to the ( |I| , L, F ) -TAS of u . For instance, when N max = 3 , N min = L = 1 , we have atransition tree illustrated in Fig. 4. Proof of Lemma 7.

Based on the transition tree, it is easy tosee that once the system can start from an ( N max , L, F ) -TASand transition with zero wastes down to an ( N min , L, F ) -TASin all possible ways then we can also start from any interme-diate ( N , L, F ) -TAS, N ∈ [ N min , N max ] , and transition withzero wastes within this range. Indeed, if one machine leavesand the system is currently at a state corresponding to a nodein the tree, then it can transition to a child node depending onwhich node is leaving. Vice versa, if one machine joins, thesystem can transition to the state stored at the parent node. (cid:4) Remark 2 (Overhead incurred by the transition tree) . Asshown in the proof of Lemma 7, the transition tree is usedto keep track of all zero-waste transitions possible within therange [ N min , N max ] . The entire tree can be created once by themaster machine before the computation session starts or canbe created on the ﬂy. The tree has height N max − N min anda total of (cid:80) N max − N min h =1 (cid:81) h − i =0 ( N − i ) nodes, which is inthe order of N ! . To create a child node, an algorithm such asthe Network Flow Algorithm is invoked to ﬁnd the zero-wastetransition (however, the computation required becomes lighterwhen it gets closer to the leaves). The creation and storage ofthe transition tree incurs signiﬁcant storage and computationoverheads at the master node, and therefore, using the treeis beneﬁcial when we have relatively small N and F andintensive tasks so that having zero transition waste pays off.Maintaining a zero-waste ETAS with lower overheads remainsan open question for future research. Based on Lemma 7, we now describe our construction of ( L, F ) -ZWRs based on the so-called symmetric conﬁgurations from combinatorial designs. Deﬁnition 10 (Conﬁguration [25]) . A ( v, b, k, r ) -conﬁgurationis an incident structure of v points and b lines such that • each line contains k points, • each point lies on r lines, and • two different points are connected by at most one line.If v = b and, hence, r = k , the conﬁguration is symmetric,denoted by ( v, k ) -conﬁguration. The famous Fano plane is a (7 , -conﬁguration withseven points { , , . . . , } and seven lines: { , , } , { , , } , { , , } , { , , } , { , , } , { , , } , and { , , } (Fig. 5).Fig. 5: A Fano plane with seven points and seven lines.We ﬁrst show that an ( N max , L ) -conﬁguration can be usedto construct an ( N max , L, F ) -TAS with small pairwise overlapsand then present a method to establish an [ N min , N max ] -zero-waste range from such a TAS. Essentially, points correspondto tasks while lines correspond to sets of tasks. As there are N max points and F tasks, it is natural to associate each pointwith F/N max tasks.

Construction 1.

Suppose that N max divides F and B = { B , . . . , B N max } is the set of N max lines of an ( N max , L ) -conﬁguration. An ( N max , L, F ) -TAS S N max can beconstructed as follows. First, partition [[ F ]] = { , . . . , F − } into N max equal sized parts F , . . . , F N max . Then for each n ∈ [ N max ] we assign to Machine n the tasks indexed by theparts F p ’s corresponding to all points p in the line B n . Inother words, we set S N max n := ∪ p ∈ B n F p , for every n ∈ [ N max ] .For instance, when there are N max = 7 machines, L = 3 ,and F = 14 tasks, we ﬁrst partition [[ F ]] in to seven parts: F = { , } , F = { , } , F = { , } ,F = { , } , F = { , } , F = { , } , F = { , } . Then, using the (7 , -conﬁguration (the Fano plane) inConstruction 1, we obtain a (7 , , -TAS, represented byFig. 6. For instance, Machine 1 is allocated the task set S = [5] = F ∪ F ∪ F , while Machine 2 has the task set S = { , , , , , } = F ∪ F ∪ F . It is easy to verify thateach task is performed by L = 3 machines and each machineperforms LF/N max = 6 tasks.Since every two lines in a conﬁguration intersect at atmost one point, the resulting TAS also has small pairwiseintersections, which is crucial for our construction of a zero-waste range. S S S S S S S − − − − − − − − − − − − − − − −

11 10 −

11 10 − −

13 12 −

13 12 − Fig. 6: A (7 , , -TAS constructed from the Fano plane. Therows and columns of the table corresponding to seven pointsand seven lines of the plane. Here, we use a − b to denote theset { a, a + 1 , . . . , b } (mod 14) . Lemma 8.

Construction 1 produces an ( N max , L, F ) -TASwhere every two task sets intersect at at most F/N max tasks.Proof.

According to Construction 1, each set of task has size | S N max n | = | B n | FN max = LFN max . Moreover, as each point p in the conﬁguration belongs toexactly L lines, each task also belongs to precisely L tasksets. Hence, the resulting S N max is indeed an ( N max , L, F ) -TAS. Moreover, since every two lines in the conﬁgurationintersect at at most one point, every two task sets S N max n and S Nn (cid:48) , n (cid:54) = n (cid:48) , intersect at at most F/N max tasks as claimed. (cid:4)

Note that the expected cardinality of the intersection of two random subsets of cardinality

LF/N of [ N ] is L N FN , which isapproximately

F/N for L ≈ N . Therefore, F/N is indeedthe lowest pairwise intersection size that we could expect forthis parameter range.By Lemma 8, Construction 1 produces an initial ( N max , L, F ) -TAS with small pairwise set overlaps. To showthat R machines can be removed one by one from this TASwith zero transition wastes, we ﬁrst show that the pairwiseintersections of the sets of intermediate TASs do not increasetoo much. Then, by using the pairwise intersection as an upperbound on the intersection of any set I of task sets, | I | ≤ L ,we can guarantee that the intersections still satisfy the Hall-like condition in Theorem 6. As a consequence, zero-wastetransitions will be possible within the range [ N max − R, N max ] . Theorem 7.

If there exists an ( N max , L ) -conﬁguration thenthere exists an ( N max , L, F ) -TAS S N max = ( S N max , . . . , S N max N max ) where | S N max n ∩ S N max n (cid:48) | ≤ FN max , for every n, n (cid:48) ∈ [ N max ] , n (cid:54) = n (cid:48) . This leads to the existenceof an ( L, F ) -zero-waste range [ N max − R, N max ] where R = 1 + (cid:36) (3 LN max − N max − L + 1) − √ ∆4 L − (cid:37) (7) and ∆ = LN max ( LN max + 8 L − L + 6) + (2 L − . (8) We assume here that N | F for every N ∈ [ N min , N max ] . Proof.

The ﬁrst statement is due to Lemma 8. We nowprove the second statement, assuming that there exists an ( N max , L, F ) -TAS as speciﬁed. Thanks to Lemma 7, it sufﬁcesto show that for every ≤ r < R , after removing any r machines one after another, the resulting ( N max − r, L, F ) -TASstill admits a zero-waste transition when one more machineleaves. Equivalently, we aim to show that this TAS satisﬁesthe Hall-like condition (6).Suppose that r < R machines have been removed with r zero-waste transitions and S N max − r = ( S N max − r , . . . , S N max − rN max − r ) is the resulting ( N max − r, L, F ) -TAS. Let I be a nonemptysubset of indices of | I | machines among the remaining ones.Note that when | I | = 1 or | I | > L , the inequality (6) is triviallysatisﬁed. Indeed, when | I | = 1 , the equality is achieved. When | I | > L , as each task cannot belong to more than L tasksets, the intersection of | I | task sets is empty and hence, (6)holds trivially. We henceforth assume ≤ | I | ≤ L . Suppose n, n (cid:48) ∈ I , n (cid:54) = n (cid:48) . Note that whenever there is a zero-wastetransition from an ( N, L, F ) -TAS to an ( N − , L, F ) -TAS,each machine keeps its current task set and also takes ∆ N,N − extra tasks. Hence, the intersection of a pair of task sets isincreased by at most N,N − tasks. Therefore, | ∩ i ∈ I S N max − ri | ≤ | S N max − rn ∩ S N max − rn (cid:48) |≤ | S N max n ∩ S N max n (cid:48) | + r − (cid:88) j =0 N max − j,N max − j − = | S N max n ∩ S N max n (cid:48) | + r − (cid:88) j =0 (cid:18) LFN max − j − − LFN max − j (cid:19) ≤ FN max + (cid:18) LFN max − r − LFN max (cid:19) = FN max + 2 LF r ( N max − r ) N max . Therefore, in order to show that (6) holds for the ( N max − r, L, F ) -TAS S N max − r , that is, | ∩ i ∈ I S N max − ri | ≤ ( N max − r − | I | )∆ N max − r,N max − r − , as we assume | I | ≤ L , it sufﬁces to show that FN max + 2 LF r ( N max − r ) N max ≤ ( N max − r − L )∆ N max − r,N max − r − , or equivalently, FN max + 2 LF r ( N max − r ) N max ≤ ( N max − r − L ) LF ( N max − r )( N max − r − . (9)Simplifying (9), we obtain (2 L − r − (3 LN max − N max − L + 1) r + N max ( LN max − N max − L + 1) ≥ . (10)The left-hand side of (10) can be regarded as a quadraticpolynomial in r , which has two positive roots (3 LN max − N max − L + 1) ± √ ∆4 L − , where ∆ is given as in (8). Note that ∆ ≥ when L ≥ and N max ≥ . Therefore, when r ≤ R − (cid:36) (3 LN max − N max − L + 1) − √ ∆4 L − (cid:37) , the left-hand side of (10) is non-negative, which implies thatthis inequality holds. Therefore, we have shown that for every r < R deﬁned as in (7), the inequality (6) holds for the ( N max − r, L, F ) -TAS in consideration. Hence, there is a zero-waste transition from this TAS to an ( N max − r − , L, F ) -TAS.Thus, [ N max − R, N max ] is an ( L, F ) -zero-waste range. (cid:4) Equipped with Theorem 7, we now present a few explicitzero-waste ranges based on known results on conﬁgurationsfrom the literature of combinatorial designs.

Corollary 3.

The following zero-waste ranges exist for allrelevant F , that is, F is divisible by N ( N − for every N ∈ [ N min + 1 , N max ] . L = 3 , N max ≥ , N min = N max − (cid:106) N max − −√ ∆10 (cid:107) − ,where ∆ = 9 N max + 90 N max + 25 . L = 4 , N max ≥ , N min = N max − (cid:106) N max − −√ ∆14 (cid:107) − ,where ∆ = 16 N max + 280 N max + 49 . L = q + 1 , N max = q + q + 1 , N min = N max − (cid:106) q +4 q +2 q −√ ∆4 q +2 (cid:107) − , where ∆ = q + 12 q + 16 q +4 q − q − q − , for every prime power q . L = q , N max = q , N min = N max − (cid:106) q − q − q +1 −√ ∆4 q − (cid:107) − , where ∆ = q + 8 q − q +10 q + 4 q − q + 1 , for every prime power q . L = q , N max = q − , N min = N max − (cid:106) q − q − q +3 −√ ∆4 q − (cid:107) − , where ∆ = q + 8 q − q +2 q + 29 q − q + 1 , for every prime power q .Proof. Note that ( v, k ) -conﬁgurations exist for the following v and k .1) k ∈ { , } and v ≥ k ( k −

1) + 1 (See [25]).2) k = q + 1 and v = q + q + 1 for any prime power q .Such a ( q + q +1 , q +1) -conﬁguration is also referred toas a ﬁnite projective plane. This gives us the Fano planewhen q = 2 . For this existence result and the followingones, see, e.g., [26, p. 2].3) k = q and v = q for any prime power q . A ( q , q ) -conﬁguration can be obtained from a ( q + q + 1 , q + 1) -conﬁguration by removing a point P and all q + 1 linescontaining P without removing their points, and alsoremoving one line containing P together with all of itspoints.4) k = q and v = q − for any prime power q . A ( q − , q ) -conﬁguration can be obtained from a ( q + q +1 , q + 1) -conﬁguration by removing a point P and all q + 1 lines containing P without removing their points,and also removing one line not containing P togetherwith all of its points.Applying Theorem 7 to these conﬁgurations, setting N max = v and L = k , we deduce the conclusions of the corollary. (cid:4) Applying Corollary 3 to the case L = 3 and N max =7 , we obtain a (3 , F ) -ZWR [5 , where the (7 , , F ) -TAScorresponds to the Fano plane. In other words, zero-wastetransitions are possible between ﬁve and seven machines when L = 3 . Similarly, when applying the corollary to the case L = 4 and N max = 13 , we obtain a (3 , F ) -ZWR [9 , ,which implies that zero-waste transitions are possible between nine and thirteen machines. When L = q and N max = q ,for instance, we obtain a ( q, F ) -ZWR [ N min , N max ] where N max − N min = Θ( N max / . Ideally, we would like to expandthese ranges to [ L, N max ] , which remains an open question.V. C ONCLUSIONS

Building up on the work of Yang et al. [19] on coded elasticcomputing, we ﬁrst propose a complete separation betweenthe elastic task allocation scheme and the coded computingscheme. As a result, we have the freedom to design efﬁcient elastic task allocation schemes as a combinatorial object independent of the underlying coded computing schemes.Moreover, our result can be applied to almost every codedcomputing scheme developed in the literature. We illustratethe application of our result in matrix-vector and matrix-matrixmultiplication, linear regression, and multivariate polynomialevaluation. The proposed separation simpliﬁes the couplingsigniﬁcantly compared to the original approach in [19].Our main contributions in this work include the introductionof a new performance criterion for elastic task allocationschemes called the transition waste and constructions ofdifferent schemes that achieve optimal transition wastes. Thisquantity measures the number of tasks that available machinesmust abandon or take anew when one machine leaves or joinsin the middle of the computation of a large scaled job. Smallertransition wastes reduce the waste of computing resources andspeed up the job completion time.The works of Yang et al. [19] and ours address the needto bridge the gap between the common setup of most codedcomputing schemes in the literature, where the number ofavailable machines remain ﬁxed, and an emerging trend inthe cloud computing industry where the number of availablemachines can vary, due to the fact that low-priority virtualmachines are often offered at much cheaper prices but can betaken back under a short notice (e.g. Amazon EC2 Spot andMicrosoft Azure Batch).We can imagine one application of the coded elastic com-puting scheme as follows. We purchase a number of EC2on-demand instances at a higher price while also get a fewSpot instances at a much cheaper cost to run our computation.During the computation cycle, the low-priority Spot instancesmay leave, reducing the number of available machines. Oursystem can still handle this if we employ a coded elasticcomputing scheme in which the number of on-demand in-stances is greater than or equal to the minimum number ofavailable machines required by the scheme. Thus, insteadof maintaining all the costly on-demand instances from thebeginning to the end, this approach allows us to take advantageof low-cost Spot instances available to us while keepingthe computation run smoothly even when machines leave. An interesting related approach from Amazon in 2018 wasimplemented in a new feature called Amazon EC2 Fleet [27],which allows users to specify the target capacity and thepreferred EC2 instances while automatically performs mix-and-match to meet customers speciﬁcations at a lowest price.We strongly believe that distributed computing with elasticresources is a fruitful research direction and can potentiallycreate a signiﬁcant impact on the cloud computing industry.VI. A

PPENDIX

A. Coupling an Elastic Task Allocation Scheme and a CodedComputing Scheme

We now explain how to couple an elastic task allocationscheme (ETAS) and a coded computing scheme (CCS) toachieve a coded elastic computing scheme, which allows • straggler tolerance : at most E slow machines do notaffect the completion time of the system, and • load balancing : every available machine is assigned thesame workload, and • elasticity : the workload of available machines can beﬂexibly adjusted when machines leave and join.The general method is to ﬁrst partition the problem instanceinto F independent sub-instances and then apply a CCS to each sub-instance. Task f , f ∈ [[ F ]] , refers to the computa-tion task performed over the f th sub-instance. Suppose thatthroughout the computation the number of available machinesvaries from L to N max . For each task, a CCS generates N max sub-tasks, which are distributed to maximum N max machinesso that the completion of any L − E sub-tasks leads to thecompletion of the task ( L − E is referred to as the recoverythreshold ). Each of the N available machines must be loadedwith the corresponding sub-tasks of all F sub-instances so asto be ready to work on any new tasks when machines leaveor join. However, each machine only works on the sub-tasksof the tasks assigned to it by the TAS. More speciﬁcally, if an ( N, L, F ) -TAS S N = { S N , . . . , S Nn } is used then Machine n only works on tasks indexed by S Nn .The L -Redundancy of the TAS guarantees that any task f is worked on by precisely L different machines among N .As the CCS allows the recovery of Task f from any L − E outputs, the coded elastic computing scheme, which couples aTAS and a CCS, can tolerate E stragglers. The Load Balancingproperty of the TAS guarantees that every available machine isassigned the same workload. When a machine joins or leaves,a new TAS constructed by the transition algorithm T of theETAS is applied, which preserves the straggler tolerance andthe load balancing property. We discuss below how to deﬁnethe tasks for a few speciﬁc problems. Matrix-Vector Multiplication.

We aim to compute Ax ,where A is a matrix and x is a vector of matching dimension,in a way that tolerates any E stragglers (0 ≤ E < L ) , and witha varied number of available machines N ( L ≤ N ≤ N max ) .Assuming that the number of rows of A is divisible by F (padding if necessary), we partition A row-wise into F equal-sized sub-matrices A , A , . . . , A F − . The pair ( A f , x ) forms the f th sub-instance of the original instance ( A , x ) andthe computation of A f x is referred to as Task f . A known CCS for matrix-vector multiplication (e.g., [6]) can then beused to generate N max sub-tasks for each Task f , each ofwhich is then distributed to the corresponding machine (ma-chines joining later download later). Clearly, the completionof all tasks f ∈ [[ F ]] gives us the desired product Ax . Matrix-Matrix Multiplication.

The goal is to computethe product AB where A and B are matrices of machingdimensions, in a way that tolerates any E stragglers (0 ≤ E < L ) , and with a varied number of available machines N ( L ≤ N ≤ N max ) .We partition A and B column-wise and row-wise, respec-tively, into F equal-sized sub-matrices (padding with zeros ifnecessary) as follows, A = (cid:2) A , A , . . . A F − (cid:3) , B =  B B ... B F −  . The pair ( A f , B f ) , f ∈ [[ F ]] , forms the f th sub-instanceand the computation of A f B f is referred to as Task f . As AB = (cid:80) F − f =0 A f B f , the completion of all F tasks gives usthe product AB . For each Task f , a known CCS for matrix-matrix multiplication can be applied (e.g., MatDot [28]). Linear Regression.

Given a data matrix X and a vector y , we aim to ﬁnd a weight vector w that minimizes the lossfunction (cid:107) Xw − y (cid:107) . Using gradient descent, in each iteration,we update the weight using the gradient of the loss function,which requires the computation of X T ( Xw ( t ) − y ) .The algorithm in [19] ﬁrst computes Xw ( t ) via codedelastic computing, computes z ( t ) = Xw ( t ) − y at the masternode, and adaptively encodes z ( t ) according to the knowledgeof machines that are active. Hence, it is not suitable forthe scenario where machines join or leave in the middle ofeach iteration. Our approach presented below simpliﬁes theapproach in [19] and also overcomes its drawback.Note that both X and y are ﬁxed while w ( t ) varies fromone iteration to the next. Therefore, the matrix-matrix product A = X T X and the matrix-vector product X T y can becomputed once in advance with amortized cost using an ETASas described earlier. The only job left is to repeatedly compute Aw ( t ) , t = 0 , , . . . Again, we use an ETAS to perform thismatrix-vector multiplication. Despite of its conceptual sim-plicity, this procedure not only allows machines join or leavein the middle of each iteration but also saves communicationbandwidth as at each iteration, we only send w ( t ) to machinesrather than both w ( t ) and a coded version of z ( t ) . Multivariate polynomial evaluation.

We aim to compute g ( X ) , . . . , g ( X K ) , where g is a multivariate polynomial and X k is a large matrix or vector ( k ∈ [ K ]) , in a way thattolerates E stragglers and allows the number of availablemachines vary between L and N max .Suppose that K is divisible by F (padding if necessary).We partition the set of evaluation points into F equal parts P f = (cid:8) X fK/F +1 , . . . , X fK/F + K/F (cid:9) , f ∈ [[ F ]] . Task f refers to the computations of g ( X p ) , p ∈ P f . Clearly,the completion of all F tasks gives us g ( X ) , . . . , g ( X K ) as desired. Yu et al. [15] propose a CCS called the Lagrangecoded computing to perform distributed polynomial evaluationthat tolerates stragglers. We can apply this CCS to each taskusing N max machines and recovery threshold L − E . B. Proof of Theorem 5

Note that we only need to prove Theorem 5 for the casewhen Machine N + 1 joins. The following lemma holds forall δ ∈ [[ F ]] . Lemma 9.

The transition waste when transitioning from acyclic ( N, L, F ) -TAS S N cyc to a δ -shifted cyclic ( N + 1 , L, F ) -TAS S N +1 δ - cyc is W ( S N cyc → S N +1 δ - cyc ) = Sum 1 + Sum 2 + Sum 3 , where these three sums are given as follows. Setting d = FN ( N +1) ∈ Z , the ﬁrst sum isSum 1 = (cid:88) n ∈ [ N ]:( n − d>δ n − d − δ ) . When

L < (cid:100) N +12 (cid:101) , the second and third sums areSum 2 = (cid:88) n ∈ [ N ]:( n − L ) d ≤ δ< ( n − L + LN ) d (cid:0) δ − ( n − L ) d (cid:1) , Sum 3 = (cid:88) n ∈ [ N ]:( n − L + LN ) d ≤ δ ≤ F +( n − d − LNd LN d + (cid:88) n ∈ [ N ]: F +( n − d − LNd<δ (cid:0) F + ( n − d − δ (cid:1) . When L ≥ (cid:100) N +12 (cid:101) , the second and third sums areSum 2 = (cid:88) n ∈ [ N ]:( n − L ) d ≤ δ ≤ F +( n − d − LNd (cid:0) δ − ( n − L ) d (cid:1) + (cid:88) F +( n − d − LNd<δ< ( n − L + LN ) d N − L ) F/N,

Sum 3 = (cid:88) n ∈ [ N ]:( n − L + LN ) d ≤ δ (cid:0) F + ( n − d − δ (cid:1) . Proof.

These sums are obtained by considering all possiblecases of the intersection between S Nn and S N +1 n taking intoaccount the fact that we have shifted S N +1 n cyclicly by δ positions compared to the ordinary cyclic TAS. We omit somedetails due to lack of space but provide cases that lead to thesesums so that interested reader can follow and verify our result.Let S N cyc = ( S N , . . . , S NN ) and S N +1 δ - cyc =( S N +11 , . . . , S N +1 N +1 ) . For n ∈ [ N ] , S Nn = (cid:20) ( n − FN , ( n − FN + LFN − (cid:21) (mod F ) .S N +1 n = (cid:20) ( n − FN + 1 + δ, ( n − FN + 1 + LFN + 1 − δ (cid:21) (mod F ) . To compute the transition waste W ( S Nn → S N +1 n ) incurredat Machine n ∈ [ N ] , we consider the following three casesdepending on the relative position of the endpoints of S Nn and S N +1 n on the circle of integers mod F . Case 1. δ < ( n − FN ( N +1) = ( n − d . The left endpoint of S N +1 n lies between and the left endpoint of S Nn (see Fig. 7).Applying Lemma 1 (a) to S = S N +1 n and T = S Nn , we have W ( S Nn → S N +1 n ) = 2(( n − d − δ ) . Case 1 gives rise to Sum 1. S Nn S N +1 n Case 1 Fig. 7: Illustration of Case 1.

Case 2. ( n − d ≤ δ < ( n − L + LN ) d . The left endpointof S N +1 n lies between the two endpoints of S Nn (inclusive). Wefurther divide Case 2 into two sub-cases. Sub-case 2.1. ( n − d ≤ δ < ( n − L ) d . Since S N +1 n ⊂ S Nn , by Lemma 2, the transition waste is zero and we canignore this sub-case. Sub-case 2.2. ( n − L ) d ≤ δ < ( n − L + LN ) d .When L < (cid:100) N +12 (cid:101) , the intersection of S Nn and S N +1 n iscontiguous (see Fig. 8 (a) and we can use similar argument asin Lemma 1 (a) to deduce that W ( S Nn → S N +1 n ) = 2( δ − ( n − L ) d ) . When L ≥ (cid:100) N +12 (cid:101) , we have F + ( n − d − LN d < ( n − L + LN ) d. This inequality is important because for ( n − L ) d ≤ δ ≤ F + ( n − d − LN d , the intersection of S Nn and S N +1 n iscontiguous and the transition waste is W ( S Nn → S N +1 n ) = 2( δ − ( n − L ) d ) , while for F + ( n − d − LN d < δ < ( n − L + LN ) d ,the intersection between the two sets is non-contiguous (seeFig. 8 (b)) and the transition waste is W ( S Nn → S N +1 n ) = 2( N − L ) F/N.

These explain the formula of Sum 2. S N +1 n S Nn S N +1 n S N n (b) Non-contiguous intersection(a) Continguous Intersection Fig. 8: Illustration of scenarios in Sub-case 2.2. S N +1 n S N n (b) Intersecting case(a) Non-intersecting case S N +1 n S N n Fig. 9: Illustration of scenarios in Case 3.

Case 3. ( n − L + LN ) d ≤ δ < F . The right endpointof S Nn lies between its left endpoint and the left endpointof S N +1 n . We divide this case further into two sub-cases,depending on whether the two sets intersect or not (see Fig. 9).When L < (cid:100) N +12 (cid:101) , for ( n − L + LN ) d ≤ δ ≤ F + ( n − d − LN d , the two sets do not intersect, and so, the transitionwaste is LN d , while for F + ( n − d − LN d < δ < F , thetwo sets intersect and the transition waste is F +( n − d − δ ) .When L ≥ (cid:100) N +12 (cid:101) , the two sets S Nn and S N +1 n alwaysintersect and the transition waste is F + ( n − d − δ ) .These explain the formula of Sum 3. (cid:4) Proof of Theorem 5.

Lemma 9 establishes an implicit formulafor the transition waste when transitioning from a cyclic ( N, L, F ) -TAS S N cyc to a δ -shifted cyclic ( N + 1 , L, F ) -TAS S N +1 δ - cyc . It remains to determine an explicit form of thetransition waste and show that it is minimized at δ opt = (cid:4) N + L − (cid:5) d . To simplify the computation, we assume that δ is divisible by d (cid:52) = FN ( N +1) . Even with this simpliﬁcation, thecomputation is still very tedious with many cases dependingon the relation between N and L and the exact interval δ lies in(four cases, each has seven intervals to consider - Figs. 10, 11). Ld ( N − dF − LNdF − LNd + ( N − d ( L + LNd )( N − L + LNd ) F Case 1.

L < (cid:6) N +12 (cid:7) and L = N/ Ld ( N − dF − LNdF − LNd + ( N − d ( L + LNd )( N − L + LNd ) F Case 2. N even and L = N

11, 222, 3a3a3a, 3b3b 11, 222, 3a2, 3a3a, 3b3b3b

Fig. 10: Illustration of the intervals for δ and the non-emptysums contributing to the transition waste when L < (cid:100) N +12 (cid:101) .The labels 2a/3a and 2b/3b refer to the two components sumsof Sum 2 and Sum 3, respectively. The appearance of thelabels 1, 2, 2a, 2b, 3a, 3b in each interval indicate that thesesums are non-empty in that interval of δ . Ld ( N − dF − LNdF − LNd + ( N − d ( L + LNd )( N − L + LNd ) FL ≥ (cid:6) N +12 (cid:7) and N even Case 3.

L > (cid:6) N +12 (cid:7) or 0 Ld ( N − dF − LNdF − LNd + ( N − d ( L + LNd )( N − L + LNd ) F Case 4. N odd and L = N +12

11, 2a2a2a, 2b2a, 2b32b, 3311, 2a2a2a, 2b2b2b, 33

Fig. 11: Illustration of the intervals for δ and the non-emptysums contributing to the transition waste when L ≥ (cid:100) N +12 (cid:101) .Note that while the transition waste can be written as thesum of four component sums, depending on the interval that δ belongs to, only a few sums are non-empty. We must knowwhich sums are non-empty in which intervals of δ to obtain aprecise formula for the transition waste. We present below thecomputation of the transition waste in one interval of δ thatcontains δ opt and omit the rest due to lack of space.Consider Case 1 when Ld ≤ δ < ( N − d (see Fig. 10).By Lemma 9, W ( S Nn → S N +1 n ) = Sum

Sum , whereSum N (cid:88) n = δd +2 n − d − δ ) = δ d − (2 N − δ + N ( N − d, Sum δ/d − L +1 (cid:88) n =1 δ − ( n − L ) d ) = δ d − (2 L − δ + L ( L − d. Note that it is important to determine the precise lower andupper limits for each sum. Hence, W ( S Nn → S N +1 n ) =2 δ d − N + L − δ + ( N ( N −

1) + L ( L − d. This is a quadratic function of δ , which achieves the minimumat δ opt = (cid:4) N + L − (cid:5) d . This is indeed the shift recommendedin Theorem 3. It is straightforward but tedious to show that thetransition waste, which can be constant, linear, or quadratic in δ , is always greater than or equal to that at δ opt if δ lies inother intervals. We omit the details. (cid:4) A CKNOWLEDGEMENT

We thank Yaoqing Yang for helpful discussions.R

EFERENCES[1] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,“Spark: Cluster computing with working sets,” in

Proceedings of the2nd USENIX Conference on Hot Topics in Cloud Computing , 2010.[2] Apache Hadoop. http://hadoop.apache.org. [3] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effectivestraggler mitigation: Attack of the clones,” in

Proceedings of the 10thUSENIX Conference on Networked Systems Design and Implementation ,2013, pp. 185–198.[4] F. Dean and L. A. Barroso, “The tail at scale,”

Communications of theACM , vol. 56, no. 2, pp. 74–80, 2013.[5] N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, and R. Katz, “Multi-task learning for straggler avoiding predictive job scheduling,”

Journalof Machine Learning Research , vol. 17, no. 1, pp. 3692–3728, 2016.[6] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran,“Speeding up distributed machine learning using codes,”

IEEE Trans-actions on Information Theory , vol. 64, no. 3, pp. 1514–1529, 2018.[7] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,” in

Proceedings of the 53rd Annual Allerton Conference on Communication,Control, and Computing , 2015, pp. 964–971.[8] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradientcoding: Avoiding stragglers in distributed learning,” in

Proceedings ofthe 34th International Conference on Machine Learning , 2017, pp.3368–3376.[9] K. H. Huang and J. A. Abraham, “Algorithm-based fault tolerance formatrix operations,”

IEEE Transactions on Computers , vol. C-33, no. 6,pp. 518–528, 1984.[10] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large lineartransforms distributedly using coded short dot products,” in

Advancesin Neural Information Processing Systems , 2016, pp. 2100–2108.[11] Q. Yu, M. A. Maddah-Ali, and S. Avestimehr, “Polynomial codes: anoptimal design for high-dimensional coded matrix multiplication,” in

Advances in Neural Information Processing Systems , 2017, pp. 4403–4413.[12] C. Karakus, Y. Sun, S. Diggavi, and W. Yin, “Straggler mitigation indistributed optimization through data encoding,” in

Advances in NeuralInformation Processing Systems 30 , 2017, pp. 5434–5442.[13] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, “A fundamentaltradeoff between computation and communication in distributed com-puting,”

IEEE Transactions on Information Theory , vol. 64, no. 1, pp.109–128, 2018.[14] A. Mallick, M. Chaudhari, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication,”2018. [Online]. Available: http://arxiv.org/abs/1804.10331[15] Q. Yu, S. Li, N. Raviv, S. M. M. Kalan, M. Soltanolkotabi, and S. Aves-timehr, “Lagrange coded computing: Optimal design for resiliency,security, and privacy,” in

Proceedings of Machine Learning Research ,vol. 89, 2019, pp. 1215–1225.[16] J. Kosaian, K. V. Rashmi, and S. Venkataraman, “Learning a code:Machine learning for approximate non-linear coded computation,”2018. [Online]. Available: http://arxiv.org/abs/1806.01259[17] Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/spot/.[18] Microsoft Azure Batch. https://azure.microsoft.com/en-au/services/batch/.[19] Y. Yang, P. Grover, and S. Kar, “Coded elastic computing,” in

IEEEInternational Symposium on Information Theory , 2019, pp. 2654–2658.[20] R. R. Muntz and J. C. S. Lui, “Performance analysis of disk arraysunder failure,” in

Proceedings of the 16th International Conference onVery Large Data Bases , ser. VLDB ’90, 1990, pp. 162–173.[21] M. Holland and A. G. Gibson, “Parity declustering for continuous oper-ation in redundant disk arrays,” in

Proceedings of the Fifth InternationalConference on Architectural Support for Programming Languages andOperating Systems , ser. ASPLOS V, 1992, pp. 23–35.[22] S. H. Dau, Y. Jia, C. Jin, W. Xi, and K. S. Chan, “Parity declustering forfault-tolerant storage systems via t -designs,” in , 2014, pp. 7–14.[23] P. Hall, “On representatives of subsets,” Journal of the London Mathe-matical Society , vol. s1-10, no. 1, pp. 26–30, 1935.[24] R. K. Ahuja, R. L. Magnanti, and J. B. Orlin,

Network Flows . Engle-wood Cliffs, NJ: Prentice-Hall, 1993.[25]

Handbook of Combinatorial Designs, Second Edition (Discrete Mathe-matics and Its Applications) .[26] M. Funk, D. Labbate, and V. Napolitano, “Tactical (de-)compositions ofsymmetric conﬁgurations,”

Discrete Mathematics , vol. 309, no. 4, pp.741–747, 2009.[27] Introducing Amazon EC2 Fleet. https://aws.amazon.com/about-aws/whats-new/2018/04/introducing-amazon-ec2-ﬂeet/.[28] M. Fahim, H. Jeong, F. Haddadpour, S. Dutta, V. Cadambe, andP. Grover, “On the optimal recovery threshold of coded matrix mul-tiplication,” in