A Lower Bound on the stability region of Redundancy-d with FIFO service discipline
11 A Lower Bound on the stability region ofRedundancy- d with FIFO service discipline Gal Mendelson
Abstract —Redundancy- d (R( d )) is a load balancing methodused to route incoming jobs to K servers, each with its ownqueue. Every arriving job is replicated into ≤ d ≤ K tasks,which are then routed to d servers chosen uniformly at random.When the first task finishes service, the remaining d − tasks arecancelled and the job departs the system.Despite the fact that R( d ) is known, under certain conditions,to substantially improve job completion times compared to notusing redundancy at all, little is known on a more fundamentalperformance criterion: what is the set of arrival rates underwhich the R( d ) queueing system with FIFO service discipline isstable? In this context, due to the complex dynamics of systemswith redundancy and cancellations, existing results are scarceand are limited to very special cases with respect to the jointservice time distribution of tasks.In this paper we provide a non-trivial, closed form lowerbound on the stability region of R( d ) for a general joint servicetime distribution of tasks with finite first and second moments.We consider a discrete time system with Bernoulli arrivals andassume that jobs are processed by their order of arrival. Weuse the workload processes and a quadratic Lyapunov functionto characterize the set of arrival rates for which the system isstable. While simulation results indicate our bound is not tight,it provides an easy-to-check performance guarantee. Index Terms —Redundancy Routing, Job Replication, Job Can-cellation, Stability, Lyapunov Stability.
I. I
NTRODUCTION R EDUNDANCY and cancellation based routing has at-tracted much attention in the last decade [1]–[7]. Thebasic motivation behind using redundancy and cancellation isreducing the tail of the job completion time distribution. Theidea is to replicate a job and send its copies, referred to as tasks , to different servers for processing. When the first taskis finished being processed the job is deemed complete andleaves the network. The premise is that allowing copies of ajob to traverse different paths in the network makes it highlyimprobable that all of the copies experience large queuingdelay and/or processing time.Implementing such redundancy and cancellations mech-anisms incurs an overhead which can include software,hardware, control, memory and computational power. Theperformance-cost trade-off of these schemes may well beworthwhile since the potential benefits in terms of performanceis known in some cases to be substantial [1]–[5].In this paper we are concerned with a specific schemecalled Redundancy- d (R( d )), used to route incoming jobs to K servers, working at rate µ , each with its own queue. Withineach server, service is given by order of arrival (FIFO). Everyarriving job is replicated into ≤ d ≤ K tasks, which are Gal Mendelson is with the Electrical Engineering Faculty, Technion, Israel. then routed to d distinct servers chosen uniformly at random.When the first task finishes service, the remaining d − tasksare cancelled and the job departs the system.Our main research question is concerned with a first orderperformance criterion of R( d ): what is the set of arrival ratesunder which the R( d ) system is stable? We refer to this setas the stability region . Here and throughout we refer to asystem as stable if the underlying Markov chain describing theload (e.g. queue lengths, workloads) in the system is positiverecurrent.For policies with no redundancy such as random routing or‘join the shortest queue’, it is well known that the queueingsystem is stable as long as the arrival rate λ satisfies λ ∈ [0 , Kµ ) . For R( d ), denoting by B , . . . , B d the service timerequirements of the d tasks belonging to a single job, suchthat E [ B i ] = µ − , the stability region is known exactly onlyin two special cases:(i) B , . . . , B d are independent and exponentially distributed.Then, the stability region remains λ ∈ [0 , Kµ ) , for all valuesof d [6].(ii) d = K (full redundancy). Then the stability region of R( K )is λ ∈ [0 , / E [ ∧ Ki =1 B i ]) [7].The only other closed form result the author is aware ofis a lower bound on the stability region of R( d ) such thatif λ ∈ [0 , / E [ ∧ di =1 B i ]) then the system is stable [7]. Thislower bound is tight for d = K but is limited in generalbecause it does not depend on K . Finally, the authors of [8]implicitly characterize the exact stability region in the casewhere B , . . . , B d are identical (i.e. B = . . . = B d ) andexponentially distributed. The stability condition is given interms of the mean number of jobs in service in an associated‘saturated’ system. Using this result, our lower bound can beeasily derived for this special case.Characterizing the stability region of the R( d ) queueingsystem for ≤ d
Consider a time slotted system with a single dispatcherand K homogeneous servers. Each server has an infinite sizebuffer in which a queue can form and the servers do not idlewhen there is work in the buffer. Each server completes asingle unit of service when it has work to do and service isgiven by order of arrival, i.e. FIFO. We assume a server canwork on a job that has just arrived.
Arrival
At each time slot t ∈ N , a job arrives to thedispatcher with probability > λ < , according to thevalue of a Bernoulli random variable (RV) A ( t ) , such that E [ A ( t )] = λ . Routing
When a job arrives, the dispatcher immediatelysends d replicas of the job, where ≤ d ≤ K , to d distinctservers. We refer to these replicas as tasks . Denote by G d theset of all d -sized subsets of [ K ] . For each t , denote by G d ( t ) a set-valued RV taking values in G d with equal probability.If a job arrives at time slot t then G d ( t ) determines which d servers will receive its tasks. We assume that G d ( t ) areindependent and identically distributed (i.i.d) across timeslots. When the first of the job’s tasks finishes service,the remaining d − tasks are immediately cancelled andthis marks the job’s departure time from the system. Forcompleteness, in the case where tasks are completed at theexact same time, we refer to the task in the smallest indexedserver as completed and to the rest as cancelled. Service time distribution.
If a job arrives at time slot t , theservice duration requirements for its tasks are determined bythe random vector ¯ B ( t )=( B ( t ) , . . . , B K ( t )) whose memberstake values in N . The quantity B i ( t ) represents the servicetime requirement of the task that is to be sent to server i ,provided it is a member of G d ( t ) . Denote ¯ B = ¯ B (1) .The homogeneity of the servers is captured by the assump-tion that the distribution of ¯ B is symmetric with respect to itsmembers, such that the joint distribution of any subset of ¯ B coincides with any other subset of the same size. Formally, Assumption II.1 (Homogeneity) . For every k ∈ [ K ] , { i , . . . , i k }⊂ [ K ] and { j , . . . , j k }⊂ [ K ] such that i < . . .
To avoid the possibility of λ = 1 being insidethe stability region, we scale time appropriately. This translatesto an assumption on ¯ B as follows. The average amount ofwork that enters the system upon a job’s arrival is boundedfrom below by the average of the minimum of the servicerequirements of its tasks, i.e. E [ ∧ di =1 B i ] . Note that it does notmatter which d members we take due to Assumption 1. Theservers clear at most K units of work, so we require E (cid:2) ∧ di =1 B i (cid:3) > K, (2)such that the largest possible λ for which the system is stableis strictly smaller than 1. Workload.
Denote by W i ( t ) the workload in buffer i at timeslot t , after a possible arrival and service. This is defined asthe amount of time it will take (in time slots) for the existingtasks in the buffer (including the one in service if there isany) to leave the system. Note that tasks can leave the systemeither due to service completion or due to cancellation. Also,due to the FIFO service discipline, the workloads depend onlyon present tasks in the buffers and not on future arrivals. Thisis not true for other service disciplines such as last-in-first-outor processor-sharing.Denote ¯ W ( t )=( W ( t ) , . . . , W K ( t )) and ¯ W = { ¯ W ( t ) } t ∈ N .We refer to ¯ W as the workload process . We assume that thesystem starts empty, i.e. W i (0) = 0 , ∀ i ∈ [ K ] . (3) Markov chain formulation.
We turn to analyze the dynamicsof the workload process. To this end, denote ∆ i,j ( t ) = W j ( t ) − W i ( t ) . (4)and let i ( t ) denote the indicator RV which equals if server i is a member of the d chosen servers at time slot t . Then E [ i ( t )] = P ( i ∈ G d ( t )) = (cid:0) K − d − (cid:1)(cid:0) Kd (cid:1) = dK . (5) Proposition II.2.
For each i ∈ [ K ] we have W i ( t ) = [ W i ( t −
1) + A ( t ) A i ( t ) − + , where A i ( t ) = i ( t ) (cid:16) ∧ j ∈ G d ( t ) [ B j ( t ) + ∆ i,j ( t − + (cid:17) . (6) Proof.
The first part of (6) is a standard balance equation.The workload at time t equals the workload at time t − plusarrival minus service, and is kept non-negative. The secondpart of (6) is less trivial and captures the complexity of theR( d ) model with FIFO service discipline.The quantity A i ( t ) represents the total amount of workserver i receives at time slot t provided a job arrives. If i / ∈ G d ( t ) then i ( t ) = 0 and we have A i ( t ) = 0 . Otherwise,the amount of work that server i receives depends on theworkload in the other d − members of G d ( t ) , as well asthe service time requirements of the arriving job’s tasks.Considering the FIFO service discipline within each serverand the definition of the workloads, the task that reachesa server first out of the d tasks is the one that is sent to the server with the least workload. Considering the can-cellation mechanism, the task that finishes processing firstout of the d tasks is the one that is sent to server j forwhich W j ( t −
1) + B j ( t ) is minimal. Denote this server by j ∗ = argmin j ∈ G d ( t ) { W j ( t −
1) + B j ( t ) } , where in the case ofseveral minimizers, the smallest index is returned.If B j ∗ ( t )+ W j ∗ ( t − ≤ W i ( t − , or, written differentlyusing (4), B j ∗ ( t ) + ∆ i,j ∗ ( t − ≤ , the task that arrives toserver i will be cancelled before being processed and A i ( t ) =0 . If B j ∗ ( t ) + ∆ i,j ∗ ( t − > , then by the definition of j ∗ the task in server i will be cancelled at the exact same timeserver j ∗ completes its task. Thus, the workload at server i will be truncated and will equal W j ∗ ( t − B j ∗ ( t ) and theamount of work server i receives equals W j ∗ ( t − B j ∗ ( t ) − W i ( t −
1) = B j ∗ ( t ) + ∆ i,j ∗ ( t − . Note that ∆ i,j ∗ ( t − may be negative. Overall, we obtain A i ( t ) = i ( t )[ B j ∗ ( t ) + ∆ i,j ∗ ( t − + . (7)To connect (7) with (6), we argue that ∧ j ∈ G d ( t ) [ B j ( t ) + ∆ i,j ( t − + = [ B j ∗ ( t ) + ∆ i,j ∗ ( t − + . (8)Indeed, if B j ∗ ( t ) + ∆ i,j ∗ ( t − ≤ , the right hand side of (8)is zero, and since j ∗ ∈ G d ( t ) , the left hand side of (8) equalszero as well. If B j ∗ ( t ) + ∆ i,j ∗ ( t − > , by the definition of j ∗ , for j ∈ G d ( t ) , we have B j ( t ) + ∆ i,j ( t −
1) = B j ( t ) + W j ( t − − W i ( t − ≥ B j ∗ ( t ) + W j ∗ ( t − − W i ( t − B j ∗ ( t ) + ∆ i,j ∗ ( t − > , which completes the proof.Define the state space S of ¯ W as all members of Z K that can be reached from an empty state. Equations (3)-(6)uniquely define the process ¯ W as a Markov chain on S . Since P ( A ( t )=0) > , all states in S communicate and the emptystate has a self transition. Therefore, ¯ W is irreducible and a-periodic. Remark II.3.
Since we have assumed the system starts empty (3) , by Property 1 in [7], the d largest workloads are alwaysequal. Thus the largest d components of any member of S must be equal. Remark II.4.
The R( d ) policy routes tasks to d servers chosenuniformly at random. It does not use workload or service timeinformation, which we only use for modelling and stabilityanalysis. III. M
AIN RESULTS
We first state our results, then discuss them in detail. Theproofs immediately follow.
A. Statements
Our first result identifies a non-trivial lower bound on thestability region, given as the solution of a certain minimizationproblem. This lower bound, which we denote by λ m , satisfiesthat if λ ∈ [0 , λ m ) then ¯ W is positive recurrent. Our secondresult, which is the main result of this paper, is a closed formformula for a lower bound λ lb on the stability region, satisfying λ lb ≤ λ m . To this end, define the space of ordered states in S by S = { ¯ s ∈ S : s ≤ . . . ≤ s K − d +1 = . . . = s K } , (9)and define the space of vectors capturing the differencesbetween the coordinates of members of S by D S = { ¯ δ ∈ Z K − : ∃ ¯ s ∈ S such that δ i = s i +1 − s i , for ≤ i ≤ K − } . (10)Note that δ K − d +1 = . . . = δ K − = 0 . For ¯ δ ∈ D S , define δ i,j = j − (cid:88) k = i δ k . (11)For ease of notation, denote ( G d , B j , A , i ) = ( G d (1) , B j (1) , A (1) , i (1)) . (12)Define λ m = inf ¯ δ ∈D S (cid:40) K (cid:80) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) (cid:41) . (13) Proposition III.1. If λ ∈ [0 , λ m ) then ¯ W is positive recurrent. We now state the main result of this paper. To this end, for ≤ m ≤ d , define Ω m = { choosing m out of d largest workloads } . (14)A simple calculation yields P m := P (Ω m ) = (cid:0) K − dd − m (cid:1)(cid:0) dm (cid:1)(cid:0) Kd (cid:1) . (15)Define λ lb = K (cid:80) dm =0 (cid:16) (cid:80) d − mj =1 E [ ∧ jk =1 B k ] + m E [ ∧ dk =1 B k ] (cid:17) P m (16) Theorem III.2.
Let λ m and λ lb defined as in (13) and (16) respectively. Then < λ lb ≤ λ m < . Specifically, if λ ∈ [0 , λ lb ) then ¯ W is positive recurrent.B. Intuition and discussion Before proving Proposition III.1 and Theorem III.2, wediscuss (13) and (16) in detail.
Intuition for λ m . The basic idea is as follows. By (6), theaverage amount of incoming work at time slot t is given by E [ A ( t ) (cid:88) i ∈ [ K ] A i ( t )]= λ E (cid:104) (cid:88) i ∈ [ K ] i ( t ) (cid:16) ∧ j ∈ G d ( t ) [ B j ( t ) + ∆ i,j ( t − + (cid:17)(cid:105) . (17) This quantity depends on the state at the end of time slot t − only through { ∆ i,j ( t − } , i.e. the difference betweenthe workloads. We require that for all relevant values of { ∆ i,j ( t − } the right hand side of (17) is less than or equalto K and find the largest λ for which this still holds. This isa notion of sub-criticality. The challenge is to prove that thisis sufficient for stability for the R( d ) system.The infimum in (13) is taken over D S and not Z K − for two reasons. First, by the symmetry of the servers, itis sufficient to consider only ordered states in S of theform s ≤ s ≤ . . . ≤ s K , i.e. ∆ i,j ( t − ∈ Z + , whenever i ≤ j . Second, all states reached by ¯ W must have that the d largest workloads are equal, namely ∆ K − d +1 ,j ( t −
1) = 0 for j = K − d + 1 , . . . , K . Intuition for λ lb . Suppose a job arrives and d correspondingtasks are sent to d distinct servers. Further suppose thatexactly m of the servers with the largest workloads arechosen, where ≤ m ≤ d . For simplicity, assume that thechosen servers are { , . . . , d } , such that W ≤ . . . ≤ W d andthat the m servers d − m + 1 , . . . , d have the largest workloadin the system, implying W d − m +1 = W d − m +2 = . . . = W d .Then, server receives at most B units of work, server2 at most B ∧ B and up to server d − m receivingat most B ∧ . . . ∧ B d − m . The last m servers receive atmost B ∧ . . . ∧ B d . Taking expectation and summingover possible values of m gives an upper bound on theexpected amount of work that enters the system upon a job’sarrival, which yields the closed form expression of λ lb in (16). Why we included λ m in this paper. One can prove that λ lb is a lower bound on the stability region without resorting to λ m at all. This requires a minor modification of the proofof Proposition III.1 and some elements from the proof ofTheorem III.2.However, we feel that λ m is interesting in its own right.Indeed, given K , d and the distribution of ¯ B , if one can solvefor λ m numerically then one may obtain a tighter boundthan λ lb . But, this is not trivial. First, the set D S is infinite.Second, different distributions of ¯ B may change the set ofpossible states the system can reach resulting in differentsets D S which, in turn, may be difficult to characterize.Third, for each member of D S , one must explicitly calculatethe expected value of the minimum of several functionsof ¯ B . This may be computationally expensive for certaindistributions. We leave this as an open problem. The special case of d = 1 (no replication). In this case | G d ( t ) | = 1 and every incoming job is randomly routed toone of the servers with equal probability. Together with thefact that by definition ∆ i,i ( t −
1) = 0 , ∀ i ∈ [ K ] , Equation (17)reduces to E [ A ( t ) (cid:88) i ∈ [ K ] A i ( t )] = λ E (cid:104) (cid:88) i ∈ [ K ] i ( t ) B i ( t ) (cid:105) = λ E (cid:104) B ( t ) (cid:105) and the solution to (13) is given by λ m = K/ E [ B ] = Kµ, as expected. As for λ lb , substituting d = 1 in (16) and using the convention that (cid:80) j =1 = 0 yields λ lb = λ m . The special case of d = K (full replication). In thiscase the workloads of all of the servers are equal and allservers are chosen for each job. Thus, P K = 1 , and by (16), λ lb = 1 / E [ ∧ Ki =1 B i ] as expected. Example calculation.
Consider the case where K = 3 , d =2 and the joint service time distribution ¯ B = ( B , B , B ) satisfies that B , B and B are i.i.d and B = (cid:40) α with probability pβ with probability − p, where ≤ p ≤ and α, β ∈ N . By (15) we have P =0 , P = 2 / and P = 1 / , and a straightforward calculationyields E [ B ] = αp + β (1 − p ) E [ B ∧ B ] = α (1 − (1 − p ) ) + β (1 − p ) . Thus, by (16) we have λ lb = 3( E [ B ] + E [ B ∧ B ]) P + 2 E [ B ∧ B ] P = 3 ( E [ B ] + 2 E [ B ∧ B ])= 92 αp (5 − p ) + 2 β (1 − p )(3 − p ) . We use the following two results in the proofs that follow.
Proposition III.3 (Balance) . In a R( d ) system, with ¯0 initialcondition, the d largest workloads are equal at all times.Proof. The proof is given in Property 1 in [7].
Proposition III.4 (Average workload) . Consider a R( d ) sys-tem with K servers. Let ¯ s = ( s , . . . , s K ) ∈ S such that s ≤ . . . ≤ s K . Let s i,j = s j − s i . Denote f i (¯ s ) = E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + s i,j ] + (cid:1)(cid:3) Then E [ A i (1) | ¯ W (0) = ¯ s ] = f i (¯ s ) , (18) and f (¯ s ) ≥ . . . ≥ f K (¯ s ) . (19) Proof.
Relation (18) is an immediate consequence of (6)and (12). We now prove that for any i ∈ { , . . . , K − } we have f i (¯ s ) ≥ f i +1 (¯ s ) and the relation (19) follows.This simply means that the average incoming workload ismonotonic non-increasing when considering servers orderedby their workload.Consider the four possibilities describing whether or not i and i + 1 are members of G d . Under the event that i and i + 1 are not in G d , both receive zero work. If both are in G d , then i + 1 cannot receive more work than i due the the minimumtaken in (6). Finally, for any event under which i is chosenand i + 1 is not, there is an event with equal probability where i is not chosen and i + 1 is, and the rest d − servers stay thesame, and vice versa. Again, by (6), the amount of average work that enters server i under the first event is no less thanthe amount of average work that enters server i + 1 under thesecond event. This concludes the proof. C. Proofs of main results
Proof of Proposition III.1 . Since ¯ W is irreducible andaperiodic, by Theorem 3.3.7 of [9], it suffices to prove thatif λ<λ m , then a Lyapunov drift condition holds. Namely, thatthere exist a function L : S → R + , a finite set F ⊂ S andconstants (cid:15) , C > such that E [∆ L ( t +1) | ¯ W ( t ) = ¯ s ] ≤ (cid:26) − (cid:15) if ¯ s ∈ S \ FC if ¯ s ∈ F (20)where ∆ L ( t +1) denotes the drift at time slot t + 1 , namely ∆ L ( t +1) = L ( ¯ W ( t +1)) − L ( ¯ W ( t )) . (21)While we assumed in (3) that the system starts empty, witha slight abuse of notation and for simplicity, in what followswe suppress the dependence on t by writing ¯ W (0) and ¯ W (1) instead of ¯ W ( t ) and ¯ W ( t + 1) , respectively. We choose thequadratic function L (¯ s )= K (cid:88) i =1 s i . (22)Since the members of ¯ B have finite first and second moments,it is trivial that the left hand side of (20) is bounded from aboveby some positive constant C , uniformly over all ¯ s ∈ F , forany finite set F ⊂ S . We omit the details.Next, we choose the finite set F to be of the form F = { ¯ s = ( s , . . . , s K ) ∈ S : max ≤ i ≤ K s i < C } , (23)where C > is a large enough constant whose value isdetermined later in the proof.Consider a state ¯ s ∈ S \ F . By the homogeneity of theservers, the symmetric distribution of ¯ B and the uniformlyat random routing choice, we can, without loss of generality,consider ¯ s = ( s , . . . , s K ) such that ≤ s ≤ . . . ≤ s K . By the definition of F in (23), we have s K > C , (24)and by Proposition III.3, s K − d +1 = . . . = s K . By (6), using the notation in (12) and denoting A i = A i (1) ,we have W i (1) = [ s i + A A i − + and therefore W i (1) ≤ s i + 2 s i ( A A i −
1) + ( A A i − . (25) Using (18), (21), (22) and (25) we obtain E [∆ L (1) | ¯ W (0)=¯ s ] = K (cid:88) i =1 E (cid:104) W i (1) − W i (0) | ¯ W (0) = ¯ s (cid:105) ≤ K (cid:88) i =1 s i (cid:16) λ E [ A i | ¯ W (0) = ¯ s ] − (cid:17) + C = 2 K (cid:88) i =1 s i (cid:16) λ E [ i ] f i (¯ s ) − (cid:17) + C , (26) where the constant C > satisfies (cid:88) i ∈ [ K ] E [( A A i − ] ≤ (cid:88) i ∈ [ K ] E [( d (cid:88) i =1 B i ) ] ≤ C . (27)The existence of C is due to the finite first and secondmoments of the members of ¯ B .Denote d i = λ E [ i ] f i (¯ s ) − . (28)The argument proceeds by analyzing (cid:80) Ki =1 s i d i . First, by (5)and (19), d ≥ . . . ≥ d K . Second, since λ < λ m , there exists (cid:15) > such that λ = λ m − (cid:15) . Thus K (cid:88) i =1 d i = λ K (cid:88) i =1 E [ i ] f i (¯ s ) − K = ( λ m − (cid:15) ) K (cid:88) i =1 E [ i ] f i (¯ s ) − K ≤ − (cid:15) K (cid:88) i =1 E [ i ] f i (¯ s ) < , where in the last inequality we have used the definition of λ m in (13). Denote by k the lowest index in [ K ] such that (cid:80) k i =1 d i < , namely k = min { j ∈ [ K ] : j (cid:88) i =1 d i < } . (29)Hence j (cid:88) i =1 d i ≥ , ∀ j < k (30)and > d k ≥ . . . ≥ d K (31)Next, we argue by induction that j − (cid:88) i =1 s i d i ≤ s j j − (cid:88) i =1 d i , ∀ j ∈ { , . . . , k } , (32)with the convention that (cid:80) − i =1 = 0 . Inequality (32) holdstrivially for j = 1 . Suppose it holds for j = j < k . Then ( j +1) − (cid:88) i =1 s i d i = j − (cid:88) i =1 s i d i + s j d j ≤ s j j − (cid:88) i =1 d i + s j d j = s j j (cid:88) i =1 d i ≤ s j +1 ( j +1) − (cid:88) i =1 d i , where the first inequality is due to the induction hypothesisand the second is due to (30). Taking j = k in (32) yields k − (cid:88) i =1 s i d i ≤ s k k − (cid:88) i =1 d i and therefore K (cid:88) i =1 s i d i ≤ s k k − (cid:88) i =1 d i + K (cid:88) i = k s i d i = s k k (cid:88) i =1 d i + K (cid:88) i = k +1 s i d i . (33)By the definition of k in (29) and by (31), the coefficients (cid:80) k i =1 d i , d k +1 , . . . , d K , multiplying s k , s k +1 , . . . , s K re-spectively, are strictly negative. Define γ = min (cid:110) k (cid:88) i =1 d i , d k +1 , . . . , d K (cid:111) < . (34)By (33) and (34) we obtain K (cid:88) i =1 s i d i ≤ γ ( s k + s k +1 + . . . + s k ) ≤ γs k ≤ γC , (35)where in the last inequality we have used (24). Combining(26), (28) and (35) we obtain E [∆ L (1) | ¯ W (0)=¯ s ] ≤ γC + C , where we recall from (34) that γ < .Finally, we determine (cid:15) and F in (20). Given the primitivearrival and service processes, the constants C in (27) and γ in (34) are given. Fix some (cid:15) > and choose C (whichdefines F ) to be large enough such that γC + C < − (cid:15) .This concludes the proof. Proof of Theorem III.2 . Recall that by (13) λ m = inf ¯ δ ∈D S (cid:40) K (cid:80) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) (cid:41) , where D S is given in (10). Denote by G d ( k ) the k th memberof G d such that G d (1) < . . . < G d ( d ) . (36)With this notation at hand we can write (cid:88) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) = d (cid:88) k =1 E (cid:2) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:3) . (37) Proof that λ m < . Taking only the first term of the righthand side of (37) yields d (cid:88) k =1 E (cid:2) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:3) ≥ E (cid:2) ∧ dj =1 [ B G d ( j ) + δ G d (1) ,G d ( j ) ] + (cid:3) . (38) By the definition of D S , δ i,j and { G d (1) , . . . , G d ( d ) } in(10), (11) and (36), respectively, we have that δ G d (1) ,G d ( j ) ≥ .Therefore E (cid:2) ∧ dj =1 [ B G d ( j ) + δ G d (1) ,G d ( j ) ] + (cid:3) ≥ E (cid:2) ∧ dj =1 B G d ( j ) (cid:3) = E (cid:2) ∧ dj =1 B j (cid:3) , (39)where the last transition is due to the symmetry assumed inAssumption (1) and the fact that G d and ¯ B are independent.Combining (37), (38) and (39) yields (cid:88) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) ] ≥ E (cid:2) ∧ dj =1 B j (cid:3) > . Since this bound holds for every ¯ δ ∈ D S , using (13) weobtain λ m ≤ K/ E (cid:2) ∧ dj =1 B j (cid:3) < , where the last transitionis due to the time scaling assumption in (2). Proof that λ lb ≤ λ m . Fix ¯ δ ∈ D S . On Ω m in (14), exactly m out of the d largest workloads are members of G d and aregiven by { G d ( d − m + 1) , . . . , G d ( d ) } . By (10), (11) and (36)we have δ G d ( k ) ,G d ( j ) ≤ , for k > min { j, d − m } . So, for ≤ k ≤ d − m , we have (cid:0) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1) Ω m ≤ (cid:0) ∧ kj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1) Ω m ≤ (cid:0) ∧ kj =1 B G d ( j ) (cid:1) Ω m (40)and for d − m + 1 ≤ k ≤ d we have (cid:0) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1) Ω m ≤ (cid:0) ∧ dj =1 B G d ( j ) (cid:1) Ω m . (41)Using (37), (40) and (41) we obtain (cid:88) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) = E (cid:104) d (cid:88) k =1 (cid:0) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1)(cid:105) = d (cid:88) m =0 E (cid:104) d (cid:88) k =1 (cid:0) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1) Ω m (cid:105) ≤ d (cid:88) m =0 (cid:16) d − m (cid:88) k =1 E [ ∧ kj =1 B j ] + m E [ ∧ dj =1 B j ] (cid:17) P (Ω m ) , (42)where in the last inequality we used the fact that Ω m isindependent of ¯ B and G d . Finally, since the bound in (42)is finite and uniform over ¯ δ ∈ D S , we obtain λ m = K sup ¯ δ ∈D S (cid:8) (cid:80) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3)(cid:9) ≥ K (cid:80) dm =0 (cid:16) (cid:80) d − mk =1 E [ ∧ kj =1 B j ] + m E [ ∧ dj =1 B j ] (cid:17) P (Ω m )= λ lb , which concludes the proof. IV. S IMULATION
In this section we present simulation results which shedsome light on the behaviour of the stability region of R( d ), ourlower bound λ lb and the known lower bound / E [ ∧ dk =1 B k ] .We consider the R( d ) system with K = 10 servers, workingaccording to the FIFO service discipline. The service timedistribution of tasks, ¯ B , is comprised of i.i.d random variables B , . . . , B such that B = (cid:40) with probability . with probability . . Since B ≥ , the time scaling condition (2) holds, and thusthe stability region is a subset of [0 , for all values of d . Foreach value of d ∈ { , . . . , } we run simulations on a largenumber of time slots for different loads (namely, values of thearrival rate λ ) in [0 , . The number of time slots was chosensuch that the difference in the outputs of different runs at themaximal load were negligible.For each simulation run corresponding to a specific ( d, λ ) pair, we calculate the running average workload in the system(over all time-slots, after an initial duration required forconvergence). Whenever the Markov chain is positive recurrent(i.e. the system is stable), it is also ergodic. Thus the runningaverage workload converges, and one simulation run is enoughto calculate the steady-state average workload.The idea is that for values of d where the stability regionis not known, the steady state average workload dependenceon λ , and, specifically, for what loads it becomes very large,suffices as an approximation for the actual stability region. Wealso calculate our lower bound λ lb and the known lower bound / E [ ∧ dk =1 B k ] . Figure 1 depicts the results.The stability region for d = 1 is marked by the verticalline ‘ob- ’ (which stands for ‘our bound’) and equals approx-imately 0.52. The simulation indicates that the stability regionfor d = 2 is the largest and equals approximately 0.6. Thestability region for d = 3 is still larger than that for d = 1 and equals approximately 0.56. For larger values of d thestability region decreases substantially until reaching around0.1 for d = 10 . The non-monotone behaviour of the stabilityregion with respect to the values of d is evidence for why itis challenging to study it.On the one hand, our bound is not tight. For example, ob- marks our bound for d = 2 and equals almost half of theactual stability region. On the other hand, it is much betterthan the known lower bound for all ≤ d < K . Anotherinteresting result of the simulation is that the steady stateaverage workload decreases substantially for d > comparedto the case where d = 1 (no replication). In fact, it can beseen that most of the improvement is achieved by using d = 2 instead of d = 1 . If the system under consideration is currentlyworking at around 0.3 load, then our lower bound guaranteesthat the system remains stable for d = 2 while obtaining thebenefits of replication.Next, to further compare the lower bounds, we considerthe R( d ) system with K = 30 servers. Instead of choosing aspecific distribution for ¯ B , we specify the connection between S t e a d y s t a t e a v e r a g e w o r k l o a d k b - o b - o b - o b - o b - o b - o b - o b - k b - o b - d=1d=2d=3d=4d=5d=6d=7d=8d=9d=10 Fig. 1:
Steady state average workload vs. load (values of the arrival rate λ ) for K = 10 servers and different values of d . The vertical lines marked ‘ob- i ’and ‘kb- i ’ stand for ’our lower bound’ λ lb and the ’known lower bound’ of / E [ ∧ dk =1 B k ] , respectively, for d = i . Only kb-1 and kb-10 are shown becauseall other values of kb are between them. The values of ob-8 and ob-9 are not shown and satisfy ob- < ob- < ob- < ob-7. the different expected values needed to calculate the lowerbounds. Figure 2(a) depicts the results for the case where E [ ∧ dk =1 B k ] = K/d . , d ∈ { , . . . , K } , and Figure 2(a) depicts the results for the case where E [ ∧ dk =1 B k ] = 2 K/d . , d ∈ { , . . . , K } . As mentioned in the introduction, no one bound implies theother and their values highly depend on the distribution of ¯ B and the value of d . Taking the maximum of the lower boundsyields a new and improved lower bound.A CKNOWLEDGMENT
The author would like to thank Rami Atar, Isaac Keslassyand Shay Vargaftik for their useful feedback. This researchwas supported in part by the Hasso Plattner Institute.R
EFERENCES[1] G. Joshi, Y. Liu, and E. Soljanin, “Coding for fast content download,” in , pp. 326–333, IEEE, 2012.[2] J. Dean and L. A. Barroso, “The tail at scale,”
Communications of theACM , vol. 56, no. 2, pp. 74–80, 2013.[3] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effectivestraggler mitigation: Attack of the clones,” in
NSDI , pp. 185–198, 2013.[4] A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, andS. Shenker, “Low latency via redundancy,” in
Proceedings of the ninthACM conference on Emerging networking experiments and technologies ,pp. 283–294, ACM, 2013.[5] K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, M. Velednitsky, andS. Zbarsky, “Redundancy-d: The power of d choices for redundancy,”
Operations Research , vol. 65, no. 4, pp. 1078–1094, 2017.[6] K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, E. Hyyti¨a, andA. Scheller-Wolf, “Queueing with redundant requests: exact analysis,”
Queueing Systems , vol. 83, no. 3-4, pp. 227–259, 2016.[7] Y. Raaijmakers, S. Borst, and O. Boxma, “Redundancy scheduling withscaled bernoulli service requirements,”
Queueing Systems , vol. 93, no. 1-2, pp. 67–82, 2019.[8] E. Anton, U. Ayesta, M. Jonckheere, and I. M. Verloop, “On the stabilityof redundancy models,” arXiv preprint arXiv:1903.04414 , 2019.[9] R. Srikant and L. Ying,
Communication networks: an optimization,control, and stochastic networks perspective . Cambridge University Press,2013. (a)
Our lower bound and the known lower bound for K = 30 and differentvalues of d for the case where E [ ∧ dk =1 B k ] = K/d . (b) Our lower bound and the known lower bound for K = 30 and differentvalues of d for the case where E [ ∧ dk =1 B k ] = 2 K/d1
Our lower bound and the known lower bound for K = 30 and differentvalues of d for the case where E [ ∧ dk =1 B k ] = K/d . (b) Our lower bound and the known lower bound for K = 30 and differentvalues of d for the case where E [ ∧ dk =1 B k ] = 2 K/d1 .1