Effective Straggler Mitigation: Which Clones Should Attack and When?
EEffective Straggler Mitigation: Which Clones ShouldAttack and When?
Mehmet Fatih Aktas¸[email protected] Pei [email protected] Emina [email protected]
Department of Electrical and Computer Engineering, Rutgers University
1. INTRODUCTION AND MODEL
Motivation:
Distributed (computing) systems aim to at-tain scalability through parallel execution of multiple tasksconstituting a job. Each of these tasks is run on a separatenode, and the job is completed only when the slowest taskis finished. It has been observed that task execution timeshave significant variability, e.g., because of multiple job re-source sharing [1]. The slowest tasks that determine the jobexecution time are known as ”stragglers”.Two common performance metrics for distributed job ex-ecution are 1)
Latency, measuring the execution time, and2)
Cost, measuring the resource usage. Job execution is de-sired to be fast and with low cost, but these are conflictingobjectives. Replicating tasks and running the replicas overseparate nodes has been shown to be effective in mitigatingthe effect of stragglers on latency [2], and is used in practice[3]. Recent research proposes to delay replication, and cloneonly the tasks that at some point appear to be straggling,in order to reduce the cost [4].Erasure coding is a more general form of redundancy thansimple replication, and it has been considered for stragglersmitigation in both data download [5] and, more recently, indistributed computing context [6]. We here take this line ofwork further by analyzing the effect of coding on the tradeoffbetween latency and cost. As in [4], that deals with this is-sue in the context of replication, we consider systems wherecoded redundancy is introduced with a delay in order to re-duce the cost, and examine the impact of that delay on la-tency. In [2], introduction of redundancy has been playfullydescribed as attack of the clones. We here examine whetherthe redundancy should be simple replication or coding andwhen it should be introduced. That is, following the anal-ogy of [2], we ask which clones should attack and when.
System Model:
In our system, a job is split into k tasks.The job execution starts with launching all its k tasks, andthe redundancy is introduced only if the job is not completedby some time ∆.In replicated-redundancy ( k, c, ∆)-system, if the job stillruns at time ∆, c replicas for each remaining task are launched.In coded-redundancy ( k, n, ∆)-system, if the job still runs attime ∆, n − k redundant parity tasks are launched wherecompletion of any k of all launched tasks results in total jobcompletion (see Fig. 1). Note that this assumption does notimpose severe restrictions. Any linear computing algorithmcan be structured in this way simply by using linear erasure Copyright is held by author/owner(s).
Replicated redundancyJobstarttask 1 X task 2task 3 X task 4 X ∆ replica of 1 X replica of 3 Jobcompletion Coded redundancyJobstarttask 1 X task 2task 3 X task 4 X ∆ parity 1 X parity 2 Jobcompletion Figure 1: A job with four tasks is executed with delayed redun-dancy. Check mark represents completion of a task while crossrepresents cancellation of remaining outstanding redundant tasks. codes. Particular examples can be found in e.g., [6] andreferences therein.We assume that task execution times are iid and follow oneof the three canonical distributions: 1)
Exp ( µ ); commonlyused to model execution of small-size tasks, 2) SExp ( D, µ );constant D plus Exp ( µ ) noise, used when the job size affectsthe execution time [4], (3) Pareto ( λ, α ); canonical heavy-taildistribution that is observed to fit task execution times inreal computing systems [1, 7].We use T to denote the job execution time. Cost is de-fined as the sum of the lifetimes of each task involved injob execution. There are two main setups that define cost:1) Cost with task cancellation C c ; remaining outstandingtasks are canceled upon the job completion, which is a vi-able option for distributed computing with redundancy, 2)Cost without task cancellation C ; tasks remaining after jobcompletion run until they complete, which, for instance, isthe only option for data transmission over multi-path net-work with redundancy.In this paper, we analyze the effect of replicated and codedredundancy on cost and latency tradeoff. Specifically, wepresent exact expressions for expected latency and cost un-der delayed and zero-delay redundancy schemes. From theseexpressions, we observe that pain and gain of redundancyare strongly correlated with the tail of task execution time. Summary of Observations:
Coding allows us to increasedegree of redundancy with finer steps than replication, whichtranslates into greater achievable cost vs. latency region.Delaying coded-redundancy is not effective to trade off la-tency for cost, therefore, primarily the degree of redundancyshould be tuned for the desired cost and latency. Coding isshown to outperform replication in terms of cost and latencytogether. When the task execution time has heavy tail, re-dundancy can reduce cost and latency simultaneously, wherethe reduction depends on how heavy the tail is. a r X i v : . [ c s . PF ] O c t . RESULTS AND OBSERVATIONS We next state expressions for the expected latency andcost under replicated and coded redundancy. Note thatthese quantities depend on k , the number of tasks the jobis split into, the redundancy level ( c in the replicated and n in the coded systems), as well as ∆, the time when theredundancy is introduced. Notation: H n is the n th harmonic number defined for n ∈ Z + as (cid:80) ni =1 1 i and for n ∈ R as (cid:82)
10 1 − x n − x dx . In-complete Beta function B ( q ; m, n ) is defined for q ∈ [0 , m, n ∈ R + as (cid:82) q u m − (1 − u ) n − du and Beta function as B ( m, n ) = B (1; m, n ). Gamma function Γ( x ) is defined as (cid:82) ∞ u x − e − u du for x ∈ R and as ( x − x ∈ Z + . Expected Latency and Cost with Replication:
Theorem Under the exponential task execution time X ∼ Exp ( µ ) , expected latency in the replication ( k, c, ∆) -system is well approximated as E [ T ] ≈ µ ( H k − cc + 1 H k − kq ) . Expected cost with ( C c ) and without ( C ) task cancellation E [ C c ] = kµ , E [ C ] = ( c (1 − q ) + 1) kµ . where q = 1 − e − µ ∆ . Theorem Under the shifted exponential task executiontime X ∼ SExp ( Dk , µ ) , expected latency in the replication ( k, c, ∆) -system is well approximated as E [ T ] ≈ Dk + 1 µ ( H k − cc + 1 H k − kq ) , where q = 1 − e − µ ∆ Expected cost with ( C c ) and without ( C ) task cancellation E [ C c ] = D + kµ (1 + c (1 − q − e − µ ∆ )) , ∆ > Dk ,E [ C ] = ( c (1 − q ) + 1)( D + kµ ) . where q = 1 − e − µ (∆ − Dk ) . Expected Latency and Cost with Coding:
Theorem Under the exponential task execution time X ∼ Exp ( µ ) , expected latency in coded redundancy ( k, n, ∆) -system is well approximated as E [ T ] ≈ ∆ − µ ( B ( q ; k + 1 ,
0) + H n − kq − H n − k ) . Expected cost with ( C c ) and without ( C ) task cancellation E [ C c ] = kµ , E [ C ] = kµ q k + nµ (1 − q k ) . where q = 1 − e − µ ∆ . Theorem Under the shifted exponential task executiontime X ∼ SExp ( Dk , µ ) , expected latency in coded redundancy ( k, n, ∆) -system is well approximated as E [ T ] ≈ Dk + ∆ − µ ( B ( q ; k + 1 ,
0) + H n − kq − H n − k ) . where q = 1 − e − µ ∆ .Expected cost with ( C c ) and without ( C ) task cancellation E [ C ] = q k k (cid:18) µ + Dk (cid:19) + (1 − q k ) n (cid:18) µ + Dk (cid:19) ,E [ C c ] ≈ E [ C ] − ( n − k ) µ (1 − q k ) − ( n − k ) µ η − k (1 − q ) B ( η ; k − kq + 1 , q k − q k ) . where q = (∆ > Dk )(1 − e − µ (∆ − Dk ) ) , ˜ q = 1 − e − µ ∆ and η = 1 − e − µ ∆ . Scheme Comparison:
In order to answer the title ques-tion which clones to send and when, we next compare repli-cated and coded redundancy in distributed computing con-text, where it is feasible to cancel the running redundanttasks upon the job completion.With exponential task execution time, under both repli-cated and coded redundancy, the expected cost depends nei-ther on the time ∆ redundancy is introduced nor on thedegree of redundancy c and n (see Thm 1 and 3). Conse-quently, in order to achieve the minimum latency, one canintroduce all available redundancy at once (∆ = 0) withzero expected penalty in cost.We want to understand the reduction in cost (gain) andincrease in latency (pain) per increase in ∆. Fig. 2 showscost vs. latency under delayed redundancy for SExp tasks.For coded redundancy, we observe two phases: 1) Initially,increasing ∆ away from 0 returns almost no reduction incost but significantly increases latency. 2) Beyond a certainpoint, increasing ∆ further reduces cost significantly whilenot increasing delay much. In other words, significant re-duction in cost by delaying redundancy is possible only withsignificant increase in latency. Therefore, delaying coded re-dundancy is not effective because one can simply achieveless cost for the same latency by decreasing degree of redun-dancy n . Simulations show that this two-phase behaviorexists for Pareto task execution time as well. Note that de-laying is effective for replicated redundancy to reduce costup to some point, beyond which, once again, it is better toreduce the degree of replication c . Expected Latency E [ T ] (s) E x p e c t e d C o s t w / c a n c e l E [ C c ] ( s ) Δ = 0Δ = 0 Δ→∞ X ∼ SExp ( D / k , μ ), D = 30, μ = 0.5, k = 10 Rep, c = 1Rep, c = 2Code, n = 11Code, n = 12Code, n = 14Code, n = 16Code, n = 18Code, n = 20Code, n = 25Code, n = 30 Figure 2: Under
SExp task execution time, achievable expectedcost with task cancellation vs. latency region is plotted for repli-cated ( c = 1 ,
2) and coded ( n ∈ [ k +1 , k ]) redundancy by varyingthe time (∆) of introducing redundancy along each curve. Thm. 5 gives exact expressions for the expected cost andlatency under zero-delay redundancy. Under both
SExp and
P areto task execution time, coding always achieves betterexpected cost and latency than replication as illustrated inFig. 3.
Theorem Let expected latency and cost with task can-cellation be E [ T ( k,c ) ] , E [ C ( k,c ) ] for zero-delay replicated re-dundancy, and E [ T ( k,n ) ] , E [ C ( k,n ) ] for zero-delay coded re- Expected Latency E [ T ] (s) E x p e c t e d C o s t w / c a n c e l E [ C c ] ( s ) c = 1 c = 2 c = 3 c = 4 c = 5 n = 11 n = 12 n = 20 n = 30 n = 40 n = 50 n = 60 No redundancy c = 0, n = 10 X ∼ SExp ( D = 30/ k , μ = 0.5), k = 10 ReplicationCoding
Expected Latency E [ T ] (s) E x p e c t e d C o s t w / c a n c e l E [ C c ] ( s ) c = 1 c = 2 c = 3 c = 4 c = 5 n = 11 n = 12 n = 20 n = 30 n = 40 n = 50 n = 60 No redundancy c = 0, n = 10 X ∼ Pareto ( λ = 3, α = 1.5), k = 10 ReplicationCoding
Expected Latency E [ T ] (s) E x p e c t e d C o s t w / c a n c e l E [ C c ] ( s ) c = 1 c = 2 c = 3 c = 4 c = 5 n = 11 n = 12 n = 20 n = 30 n = 40 n = 50 n = 60 No redundancy c = 0, n = 10 X ∼ Pareto ( λ = 3, α = 1.2), k = 10 ReplicationCoding
Figure 3: Expected cost vs. latency for zero-delay redundancy where redundancy levels c and n vary along the curves. Tail heavinessincreases from left to right. The heavier the tail is, the higher the maximum reduction in expected cost and latency is. dundancy. Under task execution time X ∼ SExp ( Dk , µ ) , E [ T ( k,c ) ] = Dk + H k ( c + 1) µ , E [ C ( k,c ) ] = ( c + 1) D + kµ ,E [ T ( k,n ) ] = Dk + 1 µ ( H n − H n − k ) , E [ C ( k,n ) ] = nDk + kµ . Under task execution time X ∼ P areto ( λ, α ) , E [ T ( k,c ) ] = λk ! Γ(1 − (( c + 1) α ) − )Γ( k + 1 − (( c + 1) α ) − ) ,E [ C ( k,c ) ] = λk ( c + 1) ( c + 1) α ( c + 1) α − ,E [ T ( k,n ) ] = λ n !( n − k )! Γ( n − k + 1 − α − )Γ( n + 1 − α − ) ,E [ C ( k,n ) ] = λ nα − α − Γ( n )Γ( n − k ) Γ( n − k + 1 − α − )Γ( n + 1 − α − ) ) . One would expect that adding more redundancy reduceslatency but always increases cost. In [4] replicated redun-dancy is demonstrated to reduce both cost and latency underheavy-tail task execution time. Fig. 3 shows and comparesthis for replicated and also the coded redundancy using theanalytical expressions presented here. Under heavy-tail, itis possible to reduce latency by adding redundancy and stillpay for the baseline cost of running without redundancy.Corollary 1 gives expressions for the minimum achievableexpected latency without exceeding the baseline cost.
Corollary Under task execution time X ∼ Pareto ( λ, α ) in zero-delay replicated redundancy system, minimum la-tency E [ T min ] that can be achieved without exceeding thebaseline cost is, E [ T min ] = λk ! Γ(1 − ( α ( c max + 1)) − )Γ( k + 1 − ( α ( c max + 1)) − ) . where c max = max { (cid:106) α − (cid:107) − , } and any reduction in la-tency without exceeding the baseline cost is possible only if α < . . For coded redundancy system, a tight upper boundon E [ T min ] is E [ T min ] < λα + λk ! Γ(1 − α − )Γ( k + 1 − α − ) . Fig. 4 illustrates that the maximum percentage reduc-tion in latency while paying for less than the baseline costdepends on the tail of task execution time. As stated inCorollary 1, this reduction is possible under replicated re-dundancy only when the tail index is less than 1 .
5, in otherwords when the tail is very heavy, while coding relaxes thisconstraint significantly. In addition, the constraint on α is independent of the number of tasks k under replication,while it increases with k under coding, meaning that jobswith larger number of tasks can get reduction in latency atno cost even for lighter tailed task execution times. Tail index α M a x i m u m p e r c e t a n g e r e d u c t i o n i n E [ T ] a t n o a dd e d c o s t X ∼ Pareto ( λ = 3, α ) Replicated, k = 10Replicated, k = 50Coded, k = 10Coded, k = 50 Figure 4: E [ T ] − E [ T min ] E [ T ] vs. α . E [ T min ] is the minimum expectedlatency with redundancy without exceeding the baseline cost and E [ T ] is the expected latency with no redundancy.
3. REFERENCES [1] Jeffrey Dean and Luiz Andr´e Barroso. The tail at scale.
Communications of the ACM , 56(2):74–80, 2013.[2] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker,and Ion Stoica. Effective straggler mitigation: Attack ofthe clones. In
NSDI , volume 13, pages 185–198, 2013.[3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce:simplified data processing on large clusters.
Communications of the ACM , 51(1):107–113, 2008.[4] Da Wang, Gauri Joshi, and Gregory Wornell. Usingstraggler replication to reduce latency in large-scaleparallel computing.
ACM SIGMETRICS PerformanceEvaluation Review , 43(3):7–11, 2015.[5] Gauri Joshi, Emina Soljanin, and Gregory Wornell.Queues with redundancy: Latency-cost analysis.
ACMSIGMETRICS Performance Evaluation Review ,43(2):54–56, 2015.[6] Sanghamitra Dutta, Viveck Cadambe, and PulkitGrover. Short-dot: Computing large linear transformsdistributedly using coded short dot products. In
Advances In Neural Information Processing Systems ,pages 2092–2100, 2016.[7] Charles Reiss, Alexey Tumanov, Gregory R Ganger,Randy H Katz, and Michael A Kozuch. Towardsunderstanding heterogeneous clouds at scale: Googletrace analysis.