[PDF] A Lower Bound on the stability region of Redundancy-d with FIFO service discipline

Abstract

Redundancy-d (R(d)) is a load balancing method used to route incoming jobs to K servers, each with its own queue. Every arriving job is replicated into 2<=d<=K tasks, which are then routed to d servers chosen uniformly at random. When the first task finishes service, the remaining d-1 tasks are cancelled and the job departs the system. Despite the fact that R(d) is known, under certain conditions, to substantially improve job completion times compared to not using redundancy at all, little is known on a more fundamental performance criterion: what is the set of arrival rates under which the R(d) queueing system with FIFO service discipline is stable? In this context, due to the complex dynamics of systems with redundancy and cancellations, existing results are scarce and are limited to very special cases with respect to the joint service time distribution of tasks. In this paper we provide a non-trivial, closed form lower bound on the stability region of R(d) for a general joint service time distribution of tasks with finite first and second moments. We consider a discrete time system with Bernoulli arrivals and assume that jobs are processed by their order of arrival. We use the workload processes and a quadratic Lyapunov function to characterize the set of arrival rates for which the system is stable. While simulation results indicate our bound is not tight, it provides an easy-to-check performance guarantee.

Full PDF

11 A Lower Bound on the stability region ofRedundancy- d with FIFO service discipline Gal Mendelson

Abstract —Redundancy- d (R( d )) is a load balancing methodused to route incoming jobs to K servers, each with its ownqueue. Every arriving job is replicated into ≤ d ≤ K tasks,which are then routed to d servers chosen uniformly at random.When the ﬁrst task ﬁnishes service, the remaining d − tasks arecancelled and the job departs the system.Despite the fact that R( d ) is known, under certain conditions,to substantially improve job completion times compared to notusing redundancy at all, little is known on a more fundamentalperformance criterion: what is the set of arrival rates underwhich the R( d ) queueing system with FIFO service discipline isstable? In this context, due to the complex dynamics of systemswith redundancy and cancellations, existing results are scarceand are limited to very special cases with respect to the jointservice time distribution of tasks.In this paper we provide a non-trivial, closed form lowerbound on the stability region of R( d ) for a general joint servicetime distribution of tasks with ﬁnite ﬁrst and second moments.We consider a discrete time system with Bernoulli arrivals andassume that jobs are processed by their order of arrival. Weuse the workload processes and a quadratic Lyapunov functionto characterize the set of arrival rates for which the system isstable. While simulation results indicate our bound is not tight,it provides an easy-to-check performance guarantee. Index Terms —Redundancy Routing, Job Replication, Job Can-cellation, Stability, Lyapunov Stability.

I. I

NTRODUCTION R EDUNDANCY and cancellation based routing has at-tracted much attention in the last decade [1]–[7]. Thebasic motivation behind using redundancy and cancellation isreducing the tail of the job completion time distribution. Theidea is to replicate a job and send its copies, referred to as tasks , to different servers for processing. When the ﬁrst taskis ﬁnished being processed the job is deemed complete andleaves the network. The premise is that allowing copies of ajob to traverse different paths in the network makes it highlyimprobable that all of the copies experience large queuingdelay and/or processing time.Implementing such redundancy and cancellations mech-anisms incurs an overhead which can include software,hardware, control, memory and computational power. Theperformance-cost trade-off of these schemes may well beworthwhile since the potential beneﬁts in terms of performanceis known in some cases to be substantial [1]–[5].In this paper we are concerned with a speciﬁc schemecalled Redundancy- d (R( d )), used to route incoming jobs to K servers, working at rate µ , each with its own queue. Withineach server, service is given by order of arrival (FIFO). Everyarriving job is replicated into ≤ d ≤ K tasks, which are Gal Mendelson is with the Electrical Engineering Faculty, Technion, Israel. then routed to d distinct servers chosen uniformly at random.When the ﬁrst task ﬁnishes service, the remaining d − tasksare cancelled and the job departs the system.Our main research question is concerned with a ﬁrst orderperformance criterion of R( d ): what is the set of arrival ratesunder which the R( d ) system is stable? We refer to this setas the stability region . Here and throughout we refer to asystem as stable if the underlying Markov chain describing theload (e.g. queue lengths, workloads) in the system is positiverecurrent.For policies with no redundancy such as random routing or‘join the shortest queue’, it is well known that the queueingsystem is stable as long as the arrival rate λ satisﬁes λ ∈ [0 , Kµ ) . For R( d ), denoting by B , . . . , B d the service timerequirements of the d tasks belonging to a single job, suchthat E [ B i ] = µ − , the stability region is known exactly onlyin two special cases:(i) B , . . . , B d are independent and exponentially distributed.Then, the stability region remains λ ∈ [0 , Kµ ) , for all valuesof d [6].(ii) d = K (full redundancy). Then the stability region of R( K )is λ ∈ [0 , / E [ ∧ Ki =1 B i ]) [7].The only other closed form result the author is aware ofis a lower bound on the stability region of R( d ) such thatif λ ∈ [0 , / E [ ∧ di =1 B i ]) then the system is stable [7]. Thislower bound is tight for d = K but is limited in generalbecause it does not depend on K . Finally, the authors of [8]implicitly characterize the exact stability region in the casewhere B , . . . , B d are identical (i.e. B = . . . = B d ) andexponentially distributed. The stability condition is given interms of the mean number of jobs in service in an associated‘saturated’ system. Using this result, our lower bound can beeasily derived for this special case.Characterizing the stability region of the R( d ) queueingsystem for ≤ d E [ ∧ k =1 B k ] . However, if we choose K = 3 and E [ B ]=4 E [ B ∧ B ] , then P = 1 / and we obtain λ lb = 3 E [ B ] + E [ B ∧ B ] < E [ ∧ k =1 B k ] . Thus, taking the maximum of λ lb and / E [ ∧ di =1 B i ] yields anew and improved lower bound for the stability region of R( d )with FIFO service discipline.Apart from the closed form lower bound, we make thefollowing additional contributions. The ﬁrst is a rigorousderivation of a workload model of the R( d ) system. The secondis a method of using the standard Lyapunov based techniquefor proving stability in a setting where the stability region isnot a-priori known. The third is identifying a tighter lowerbound on the stability region than λ lb , which we denote by λ m , given as a solution to a certain minimization problem. Infact we prove that if λ ∈ [0 , λ m ) then the system is stable, andthen prove λ lb ≤ λ m . Forth and ﬁnal, we provide simulationresults evaluating λ lb and comparing it to / E [ ∧ dk =1 B k ] .The rest of the paper is organized as follows. In SectionII we rigorously derive a workload model for the system.In Section III we present our main results and their proofs.Section IV is devoted to simultion results. We use the following notation. For K ∈ N write [ K ] = { , . . . , K } . For two random vectors X and Y write X d = Y for equality in distribution. Write [ x ] + for x ∨ .II. R EDUNDANCY ROUTING MODEL

Consider a time slotted system with a single dispatcherand K homogeneous servers. Each server has an inﬁnite sizebuffer in which a queue can form and the servers do not idlewhen there is work in the buffer. Each server completes asingle unit of service when it has work to do and service isgiven by order of arrival, i.e. FIFO. We assume a server canwork on a job that has just arrived.

Arrival

At each time slot t ∈ N , a job arrives to thedispatcher with probability > λ < , according to thevalue of a Bernoulli random variable (RV) A ( t ) , such that E [ A ( t )] = λ . Routing

When a job arrives, the dispatcher immediatelysends d replicas of the job, where ≤ d ≤ K , to d distinctservers. We refer to these replicas as tasks . Denote by G d theset of all d -sized subsets of [ K ] . For each t , denote by G d ( t ) a set-valued RV taking values in G d with equal probability.If a job arrives at time slot t then G d ( t ) determines which d servers will receive its tasks. We assume that G d ( t ) areindependent and identically distributed (i.i.d) across timeslots. When the ﬁrst of the job’s tasks ﬁnishes service,the remaining d − tasks are immediately cancelled andthis marks the job’s departure time from the system. Forcompleteness, in the case where tasks are completed at theexact same time, we refer to the task in the smallest indexedserver as completed and to the rest as cancelled. Service time distribution.

If a job arrives at time slot t , theservice duration requirements for its tasks are determined bythe random vector ¯ B ( t )=( B ( t ) , . . . , B K ( t )) whose memberstake values in N . The quantity B i ( t ) represents the servicetime requirement of the task that is to be sent to server i ,provided it is a member of G d ( t ) . Denote ¯ B = ¯ B (1) .The homogeneity of the servers is captured by the assump-tion that the distribution of ¯ B is symmetric with respect to itsmembers, such that the joint distribution of any subset of ¯ B coincides with any other subset of the same size. Formally, Assumption II.1 (Homogeneity) . For every k ∈ [ K ] , { i , . . . , i k }⊂ [ K ] and { j , . . . , j k }⊂ [ K ] such that i < . . .

To avoid the possibility of λ = 1 being insidethe stability region, we scale time appropriately. This translatesto an assumption on ¯ B as follows. The average amount ofwork that enters the system upon a job’s arrival is boundedfrom below by the average of the minimum of the servicerequirements of its tasks, i.e. E [ ∧ di =1 B i ] . Note that it does notmatter which d members we take due to Assumption 1. Theservers clear at most K units of work, so we require E (cid:2) ∧ di =1 B i (cid:3) > K, (2)such that the largest possible λ for which the system is stableis strictly smaller than 1. Workload.

Denote by W i ( t ) the workload in buffer i at timeslot t , after a possible arrival and service. This is deﬁned asthe amount of time it will take (in time slots) for the existingtasks in the buffer (including the one in service if there isany) to leave the system. Note that tasks can leave the systemeither due to service completion or due to cancellation. Also,due to the FIFO service discipline, the workloads depend onlyon present tasks in the buffers and not on future arrivals. Thisis not true for other service disciplines such as last-in-ﬁrst-outor processor-sharing.Denote ¯ W ( t )=( W ( t ) , . . . , W K ( t )) and ¯ W = { ¯ W ( t ) } t ∈ N .We refer to ¯ W as the workload process . We assume that thesystem starts empty, i.e. W i (0) = 0 , ∀ i ∈ [ K ] . (3) Markov chain formulation.

We turn to analyze the dynamicsof the workload process. To this end, denote ∆ i,j ( t ) = W j ( t ) − W i ( t ) . (4)and let i ( t ) denote the indicator RV which equals if server i is a member of the d chosen servers at time slot t . Then E [ i ( t )] = P ( i ∈ G d ( t )) = (cid:0) K − d − (cid:1)(cid:0) Kd (cid:1) = dK . (5) Proposition II.2.

For each i ∈ [ K ] we have W i ( t ) = [ W i ( t −

1) + A ( t ) A i ( t ) − + , where A i ( t ) = i ( t ) (cid:16) ∧ j ∈ G d ( t ) [ B j ( t ) + ∆ i,j ( t − + (cid:17) . (6) Proof.

The ﬁrst part of (6) is a standard balance equation.The workload at time t equals the workload at time t − plusarrival minus service, and is kept non-negative. The secondpart of (6) is less trivial and captures the complexity of theR( d ) model with FIFO service discipline.The quantity A i ( t ) represents the total amount of workserver i receives at time slot t provided a job arrives. If i / ∈ G d ( t ) then i ( t ) = 0 and we have A i ( t ) = 0 . Otherwise,the amount of work that server i receives depends on theworkload in the other d − members of G d ( t ) , as well asthe service time requirements of the arriving job’s tasks.Considering the FIFO service discipline within each serverand the deﬁnition of the workloads, the task that reachesa server ﬁrst out of the d tasks is the one that is sent to the server with the least workload. Considering the can-cellation mechanism, the task that ﬁnishes processing ﬁrstout of the d tasks is the one that is sent to server j forwhich W j ( t −

1) + B j ( t ) is minimal. Denote this server by j ∗ = argmin j ∈ G d ( t ) { W j ( t −

1) + B j ( t ) } , where in the case ofseveral minimizers, the smallest index is returned.If B j ∗ ( t )+ W j ∗ ( t − ≤ W i ( t − , or, written differentlyusing (4), B j ∗ ( t ) + ∆ i,j ∗ ( t − ≤ , the task that arrives toserver i will be cancelled before being processed and A i ( t ) =0 . If B j ∗ ( t ) + ∆ i,j ∗ ( t − > , then by the deﬁnition of j ∗ the task in server i will be cancelled at the exact same timeserver j ∗ completes its task. Thus, the workload at server i will be truncated and will equal W j ∗ ( t − B j ∗ ( t ) and theamount of work server i receives equals W j ∗ ( t − B j ∗ ( t ) − W i ( t −

1) = B j ∗ ( t ) + ∆ i,j ∗ ( t − . Note that ∆ i,j ∗ ( t − may be negative. Overall, we obtain A i ( t ) = i ( t )[ B j ∗ ( t ) + ∆ i,j ∗ ( t − + . (7)To connect (7) with (6), we argue that ∧ j ∈ G d ( t ) [ B j ( t ) + ∆ i,j ( t − + = [ B j ∗ ( t ) + ∆ i,j ∗ ( t − + . (8)Indeed, if B j ∗ ( t ) + ∆ i,j ∗ ( t − ≤ , the right hand side of (8)is zero, and since j ∗ ∈ G d ( t ) , the left hand side of (8) equalszero as well. If B j ∗ ( t ) + ∆ i,j ∗ ( t − > , by the deﬁnition of j ∗ , for j ∈ G d ( t ) , we have B j ( t ) + ∆ i,j ( t −

1) = B j ( t ) + W j ( t − − W i ( t − ≥ B j ∗ ( t ) + W j ∗ ( t − − W i ( t − B j ∗ ( t ) + ∆ i,j ∗ ( t − > , which completes the proof.Deﬁne the state space S of ¯ W as all members of Z K that can be reached from an empty state. Equations (3)-(6)uniquely deﬁne the process ¯ W as a Markov chain on S . Since P ( A ( t )=0) > , all states in S communicate and the emptystate has a self transition. Therefore, ¯ W is irreducible and a-periodic. Remark II.3.

Since we have assumed the system starts empty (3) , by Property 1 in [7], the d largest workloads are alwaysequal. Thus the largest d components of any member of S must be equal. Remark II.4.

The R( d ) policy routes tasks to d servers chosenuniformly at random. It does not use workload or service timeinformation, which we only use for modelling and stabilityanalysis. III. M

AIN RESULTS

We ﬁrst state our results, then discuss them in detail. Theproofs immediately follow.

A. Statements

Our ﬁrst result identiﬁes a non-trivial lower bound on thestability region, given as the solution of a certain minimizationproblem. This lower bound, which we denote by λ m , satisﬁesthat if λ ∈ [0 , λ m ) then ¯ W is positive recurrent. Our secondresult, which is the main result of this paper, is a closed formformula for a lower bound λ lb on the stability region, satisfying λ lb ≤ λ m . To this end, deﬁne the space of ordered states in S by S = { ¯ s ∈ S : s ≤ . . . ≤ s K − d +1 = . . . = s K } , (9)and deﬁne the space of vectors capturing the differencesbetween the coordinates of members of S by D S = { ¯ δ ∈ Z K − : ∃ ¯ s ∈ S such that δ i = s i +1 − s i , for ≤ i ≤ K − } . (10)Note that δ K − d +1 = . . . = δ K − = 0 . For ¯ δ ∈ D S , deﬁne δ i,j = j − (cid:88) k = i δ k . (11)For ease of notation, denote ( G d , B j , A , i ) = ( G d (1) , B j (1) , A (1) , i (1)) . (12)Deﬁne λ m = inf ¯ δ ∈D S (cid:40) K (cid:80) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) (cid:41) . (13) Proposition III.1. If λ ∈ [0 , λ m ) then ¯ W is positive recurrent. We now state the main result of this paper. To this end, for ≤ m ≤ d , deﬁne Ω m = { choosing m out of d largest workloads } . (14)A simple calculation yields P m := P (Ω m ) = (cid:0) K − dd − m (cid:1)(cid:0) dm (cid:1)(cid:0) Kd (cid:1) . (15)Deﬁne λ lb = K (cid:80) dm =0 (cid:16) (cid:80) d − mj =1 E [ ∧ jk =1 B k ] + m E [ ∧ dk =1 B k ] (cid:17) P m (16) Theorem III.2.

Let λ m and λ lb deﬁned as in (13) and (16) respectively. Then < λ lb ≤ λ m < . Speciﬁcally, if λ ∈ [0 , λ lb ) then ¯ W is positive recurrent.B. Intuition and discussion Before proving Proposition III.1 and Theorem III.2, wediscuss (13) and (16) in detail.

Intuition for λ m . The basic idea is as follows. By (6), theaverage amount of incoming work at time slot t is given by E [ A ( t ) (cid:88) i ∈ [ K ] A i ( t )]= λ E (cid:104) (cid:88) i ∈ [ K ] i ( t ) (cid:16) ∧ j ∈ G d ( t ) [ B j ( t ) + ∆ i,j ( t − + (cid:17)(cid:105) . (17) This quantity depends on the state at the end of time slot t − only through { ∆ i,j ( t − } , i.e. the difference betweenthe workloads. We require that for all relevant values of { ∆ i,j ( t − } the right hand side of (17) is less than or equalto K and ﬁnd the largest λ for which this still holds. This isa notion of sub-criticality. The challenge is to prove that thisis sufﬁcient for stability for the R( d ) system.The inﬁmum in (13) is taken over D S and not Z K − for two reasons. First, by the symmetry of the servers, itis sufﬁcient to consider only ordered states in S of theform s ≤ s ≤ . . . ≤ s K , i.e. ∆ i,j ( t − ∈ Z + , whenever i ≤ j . Second, all states reached by ¯ W must have that the d largest workloads are equal, namely ∆ K − d +1 ,j ( t −

1) = 0 for j = K − d + 1 , . . . , K . Intuition for λ lb . Suppose a job arrives and d correspondingtasks are sent to d distinct servers. Further suppose thatexactly m of the servers with the largest workloads arechosen, where ≤ m ≤ d . For simplicity, assume that thechosen servers are { , . . . , d } , such that W ≤ . . . ≤ W d andthat the m servers d − m + 1 , . . . , d have the largest workloadin the system, implying W d − m +1 = W d − m +2 = . . . = W d .Then, server receives at most B units of work, server2 at most B ∧ B and up to server d − m receivingat most B ∧ . . . ∧ B d − m . The last m servers receive atmost B ∧ . . . ∧ B d . Taking expectation and summingover possible values of m gives an upper bound on theexpected amount of work that enters the system upon a job’sarrival, which yields the closed form expression of λ lb in (16). Why we included λ m in this paper. One can prove that λ lb is a lower bound on the stability region without resorting to λ m at all. This requires a minor modiﬁcation of the proofof Proposition III.1 and some elements from the proof ofTheorem III.2.However, we feel that λ m is interesting in its own right.Indeed, given K , d and the distribution of ¯ B , if one can solvefor λ m numerically then one may obtain a tighter boundthan λ lb . But, this is not trivial. First, the set D S is inﬁnite.Second, different distributions of ¯ B may change the set ofpossible states the system can reach resulting in differentsets D S which, in turn, may be difﬁcult to characterize.Third, for each member of D S , one must explicitly calculatethe expected value of the minimum of several functionsof ¯ B . This may be computationally expensive for certaindistributions. We leave this as an open problem. The special case of d = 1 (no replication). In this case | G d ( t ) | = 1 and every incoming job is randomly routed toone of the servers with equal probability. Together with thefact that by deﬁnition ∆ i,i ( t −

1) = 0 , ∀ i ∈ [ K ] , Equation (17)reduces to E [ A ( t ) (cid:88) i ∈ [ K ] A i ( t )] = λ E (cid:104) (cid:88) i ∈ [ K ] i ( t ) B i ( t ) (cid:105) = λ E (cid:104) B ( t ) (cid:105) and the solution to (13) is given by λ m = K/ E [ B ] = Kµ, as expected. As for λ lb , substituting d = 1 in (16) and using the convention that (cid:80) j =1 = 0 yields λ lb = λ m . The special case of d = K (full replication). In thiscase the workloads of all of the servers are equal and allservers are chosen for each job. Thus, P K = 1 , and by (16), λ lb = 1 / E [ ∧ Ki =1 B i ] as expected. Example calculation.

Consider the case where K = 3 , d =2 and the joint service time distribution ¯ B = ( B , B , B ) satisﬁes that B , B and B are i.i.d and B = (cid:40) α with probability pβ with probability − p, where ≤ p ≤ and α, β ∈ N . By (15) we have P =0 , P = 2 / and P = 1 / , and a straightforward calculationyields E [ B ] = αp + β (1 − p ) E [ B ∧ B ] = α (1 − (1 − p ) ) + β (1 − p ) . Thus, by (16) we have λ lb = 3( E [ B ] + E [ B ∧ B ]) P + 2 E [ B ∧ B ] P = 3 ( E [ B ] + 2 E [ B ∧ B ])= 92 αp (5 − p ) + 2 β (1 − p )(3 − p ) . We use the following two results in the proofs that follow.

Proposition III.3 (Balance) . In a R( d ) system, with ¯0 initialcondition, the d largest workloads are equal at all times.Proof. The proof is given in Property 1 in [7].

Proposition III.4 (Average workload) . Consider a R( d ) sys-tem with K servers. Let ¯ s = ( s , . . . , s K ) ∈ S such that s ≤ . . . ≤ s K . Let s i,j = s j − s i . Denote f i (¯ s ) = E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + s i,j ] + (cid:1)(cid:3) Then E [ A i (1) | ¯ W (0) = ¯ s ] = f i (¯ s ) , (18) and f (¯ s ) ≥ . . . ≥ f K (¯ s ) . (19) Proof.

Relation (18) is an immediate consequence of (6)and (12). We now prove that for any i ∈ { , . . . , K − } we have f i (¯ s ) ≥ f i +1 (¯ s ) and the relation (19) follows.This simply means that the average incoming workload ismonotonic non-increasing when considering servers orderedby their workload.Consider the four possibilities describing whether or not i and i + 1 are members of G d . Under the event that i and i + 1 are not in G d , both receive zero work. If both are in G d , then i + 1 cannot receive more work than i due the the minimumtaken in (6). Finally, for any event under which i is chosenand i + 1 is not, there is an event with equal probability where i is not chosen and i + 1 is, and the rest d − servers stay thesame, and vice versa. Again, by (6), the amount of average work that enters server i under the ﬁrst event is no less thanthe amount of average work that enters server i + 1 under thesecond event. This concludes the proof. C. Proofs of main results

Proof of Proposition III.1 . Since ¯ W is irreducible andaperiodic, by Theorem 3.3.7 of [9], it sufﬁces to prove thatif λ<λ m , then a Lyapunov drift condition holds. Namely, thatthere exist a function L : S → R + , a ﬁnite set F ⊂ S andconstants (cid:15) , C > such that E [∆ L ( t +1) | ¯ W ( t ) = ¯ s ] ≤ (cid:26) − (cid:15) if ¯ s ∈ S \ FC if ¯ s ∈ F (20)where ∆ L ( t +1) denotes the drift at time slot t + 1 , namely ∆ L ( t +1) = L ( ¯ W ( t +1)) − L ( ¯ W ( t )) . (21)While we assumed in (3) that the system starts empty, witha slight abuse of notation and for simplicity, in what followswe suppress the dependence on t by writing ¯ W (0) and ¯ W (1) instead of ¯ W ( t ) and ¯ W ( t + 1) , respectively. We choose thequadratic function L (¯ s )= K (cid:88) i =1 s i . (22)Since the members of ¯ B have ﬁnite ﬁrst and second moments,it is trivial that the left hand side of (20) is bounded from aboveby some positive constant C , uniformly over all ¯ s ∈ F , forany ﬁnite set F ⊂ S . We omit the details.Next, we choose the ﬁnite set F to be of the form F = { ¯ s = ( s , . . . , s K ) ∈ S : max ≤ i ≤ K s i < C } , (23)where C > is a large enough constant whose value isdetermined later in the proof.Consider a state ¯ s ∈ S \ F . By the homogeneity of theservers, the symmetric distribution of ¯ B and the uniformlyat random routing choice, we can, without loss of generality,consider ¯ s = ( s , . . . , s K ) such that ≤ s ≤ . . . ≤ s K . By the deﬁnition of F in (23), we have s K > C , (24)and by Proposition III.3, s K − d +1 = . . . = s K . By (6), using the notation in (12) and denoting A i = A i (1) ,we have W i (1) = [ s i + A A i − + and therefore W i (1) ≤ s i + 2 s i ( A A i −

1) + ( A A i − . (25) Using (18), (21), (22) and (25) we obtain E [∆ L (1) | ¯ W (0)=¯ s ] = K (cid:88) i =1 E (cid:104) W i (1) − W i (0) | ¯ W (0) = ¯ s (cid:105) ≤ K (cid:88) i =1 s i (cid:16) λ E [ A i | ¯ W (0) = ¯ s ] − (cid:17) + C = 2 K (cid:88) i =1 s i (cid:16) λ E [ i ] f i (¯ s ) − (cid:17) + C , (26) where the constant C > satisﬁes (cid:88) i ∈ [ K ] E [( A A i − ] ≤ (cid:88) i ∈ [ K ] E [( d (cid:88) i =1 B i ) ] ≤ C . (27)The existence of C is due to the ﬁnite ﬁrst and secondmoments of the members of ¯ B .Denote d i = λ E [ i ] f i (¯ s ) − . (28)The argument proceeds by analyzing (cid:80) Ki =1 s i d i . First, by (5)and (19), d ≥ . . . ≥ d K . Second, since λ < λ m , there exists (cid:15) > such that λ = λ m − (cid:15) . Thus K (cid:88) i =1 d i = λ K (cid:88) i =1 E [ i ] f i (¯ s ) − K = ( λ m − (cid:15) ) K (cid:88) i =1 E [ i ] f i (¯ s ) − K ≤ − (cid:15) K (cid:88) i =1 E [ i ] f i (¯ s ) < , where in the last inequality we have used the deﬁnition of λ m in (13). Denote by k the lowest index in [ K ] such that (cid:80) k i =1 d i < , namely k = min { j ∈ [ K ] : j (cid:88) i =1 d i < } . (29)Hence j (cid:88) i =1 d i ≥ , ∀ j < k (30)and > d k ≥ . . . ≥ d K (31)Next, we argue by induction that j − (cid:88) i =1 s i d i ≤ s j j − (cid:88) i =1 d i , ∀ j ∈ { , . . . , k } , (32)with the convention that (cid:80) − i =1 = 0 . Inequality (32) holdstrivially for j = 1 . Suppose it holds for j = j < k . Then ( j +1) − (cid:88) i =1 s i d i = j − (cid:88) i =1 s i d i + s j d j ≤ s j j − (cid:88) i =1 d i + s j d j = s j j (cid:88) i =1 d i ≤ s j +1 ( j +1) − (cid:88) i =1 d i , where the ﬁrst inequality is due to the induction hypothesisand the second is due to (30). Taking j = k in (32) yields k − (cid:88) i =1 s i d i ≤ s k k − (cid:88) i =1 d i and therefore K (cid:88) i =1 s i d i ≤ s k k − (cid:88) i =1 d i + K (cid:88) i = k s i d i = s k k (cid:88) i =1 d i + K (cid:88) i = k +1 s i d i . (33)By the deﬁnition of k in (29) and by (31), the coefﬁcients (cid:80) k i =1 d i , d k +1 , . . . , d K , multiplying s k , s k +1 , . . . , s K re-spectively, are strictly negative. Deﬁne γ = min (cid:110) k (cid:88) i =1 d i , d k +1 , . . . , d K (cid:111) < . (34)By (33) and (34) we obtain K (cid:88) i =1 s i d i ≤ γ ( s k + s k +1 + . . . + s k ) ≤ γs k ≤ γC , (35)where in the last inequality we have used (24). Combining(26), (28) and (35) we obtain E [∆ L (1) | ¯ W (0)=¯ s ] ≤ γC + C , where we recall from (34) that γ < .Finally, we determine (cid:15) and F in (20). Given the primitivearrival and service processes, the constants C in (27) and γ in (34) are given. Fix some (cid:15) > and choose C (whichdeﬁnes F ) to be large enough such that γC + C < − (cid:15) .This concludes the proof. Proof of Theorem III.2 . Recall that by (13) λ m = inf ¯ δ ∈D S (cid:40) K (cid:80) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) (cid:41) , where D S is given in (10). Denote by G d ( k ) the k th memberof G d such that G d (1) < . . . < G d ( d ) . (36)With this notation at hand we can write (cid:88) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) = d (cid:88) k =1 E (cid:2) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:3) . (37) Proof that λ m < . Taking only the ﬁrst term of the righthand side of (37) yields d (cid:88) k =1 E (cid:2) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:3) ≥ E (cid:2) ∧ dj =1 [ B G d ( j ) + δ G d (1) ,G d ( j ) ] + (cid:3) . (38) By the deﬁnition of D S , δ i,j and { G d (1) , . . . , G d ( d ) } in(10), (11) and (36), respectively, we have that δ G d (1) ,G d ( j ) ≥ .Therefore E (cid:2) ∧ dj =1 [ B G d ( j ) + δ G d (1) ,G d ( j ) ] + (cid:3) ≥ E (cid:2) ∧ dj =1 B G d ( j ) (cid:3) = E (cid:2) ∧ dj =1 B j (cid:3) , (39)where the last transition is due to the symmetry assumed inAssumption (1) and the fact that G d and ¯ B are independent.Combining (37), (38) and (39) yields (cid:88) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) ] ≥ E (cid:2) ∧ dj =1 B j (cid:3) > . Since this bound holds for every ¯ δ ∈ D S , using (13) weobtain λ m ≤ K/ E (cid:2) ∧ dj =1 B j (cid:3) < , where the last transitionis due to the time scaling assumption in (2). Proof that λ lb ≤ λ m . Fix ¯ δ ∈ D S . On Ω m in (14), exactly m out of the d largest workloads are members of G d and aregiven by { G d ( d − m + 1) , . . . , G d ( d ) } . By (10), (11) and (36)we have δ G d ( k ) ,G d ( j ) ≤ , for k > min { j, d − m } . So, for ≤ k ≤ d − m , we have (cid:0) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1) Ω m ≤ (cid:0) ∧ kj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1) Ω m ≤ (cid:0) ∧ kj =1 B G d ( j ) (cid:1) Ω m (40)and for d − m + 1 ≤ k ≤ d we have (cid:0) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1) Ω m ≤ (cid:0) ∧ dj =1 B G d ( j ) (cid:1) Ω m . (41)Using (37), (40) and (41) we obtain (cid:88) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3) = E (cid:104) d (cid:88) k =1 (cid:0) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1)(cid:105) = d (cid:88) m =0 E (cid:104) d (cid:88) k =1 (cid:0) ∧ dj =1 [ B G d ( j ) + δ G d ( k ) ,G d ( j ) ] + (cid:1) Ω m (cid:105) ≤ d (cid:88) m =0 (cid:16) d − m (cid:88) k =1 E [ ∧ kj =1 B j ] + m E [ ∧ dj =1 B j ] (cid:17) P (Ω m ) , (42)where in the last inequality we used the fact that Ω m isindependent of ¯ B and G d . Finally, since the bound in (42)is ﬁnite and uniform over ¯ δ ∈ D S , we obtain λ m = K sup ¯ δ ∈D S (cid:8) (cid:80) i ∈ [ K ] E (cid:2) i (cid:0) ∧ j ∈ G d [ B j + δ i,j ] + (cid:1)(cid:3)(cid:9) ≥ K (cid:80) dm =0 (cid:16) (cid:80) d − mk =1 E [ ∧ kj =1 B j ] + m E [ ∧ dj =1 B j ] (cid:17) P (Ω m )= λ lb , which concludes the proof. IV. S IMULATION

In this section we present simulation results which shedsome light on the behaviour of the stability region of R( d ), ourlower bound λ lb and the known lower bound / E [ ∧ dk =1 B k ] .We consider the R( d ) system with K = 10 servers, workingaccording to the FIFO service discipline. The service timedistribution of tasks, ¯ B , is comprised of i.i.d random variables B , . . . , B such that B = (cid:40) with probability . with probability . . Since B ≥ , the time scaling condition (2) holds, and thusthe stability region is a subset of [0 , for all values of d . Foreach value of d ∈ { , . . . , } we run simulations on a largenumber of time slots for different loads (namely, values of thearrival rate λ ) in [0 , . The number of time slots was chosensuch that the difference in the outputs of different runs at themaximal load were negligible.For each simulation run corresponding to a speciﬁc ( d, λ ) pair, we calculate the running average workload in the system(over all time-slots, after an initial duration required forconvergence). Whenever the Markov chain is positive recurrent(i.e. the system is stable), it is also ergodic. Thus the runningaverage workload converges, and one simulation run is enoughto calculate the steady-state average workload.The idea is that for values of d where the stability regionis not known, the steady state average workload dependenceon λ , and, speciﬁcally, for what loads it becomes very large,sufﬁces as an approximation for the actual stability region. Wealso calculate our lower bound λ lb and the known lower bound / E [ ∧ dk =1 B k ] . Figure 1 depicts the results.The stability region for d = 1 is marked by the verticalline ‘ob- ’ (which stands for ‘our bound’) and equals approx-imately 0.52. The simulation indicates that the stability regionfor d = 2 is the largest and equals approximately 0.6. Thestability region for d = 3 is still larger than that for d = 1 and equals approximately 0.56. For larger values of d thestability region decreases substantially until reaching around0.1 for d = 10 . The non-monotone behaviour of the stabilityregion with respect to the values of d is evidence for why itis challenging to study it.On the one hand, our bound is not tight. For example, ob- marks our bound for d = 2 and equals almost half of theactual stability region. On the other hand, it is much betterthan the known lower bound for all ≤ d < K . Anotherinteresting result of the simulation is that the steady stateaverage workload decreases substantially for d > comparedto the case where d = 1 (no replication). In fact, it can beseen that most of the improvement is achieved by using d = 2 instead of d = 1 . If the system under consideration is currentlyworking at around 0.3 load, then our lower bound guaranteesthat the system remains stable for d = 2 while obtaining thebeneﬁts of replication.Next, to further compare the lower bounds, we considerthe R( d ) system with K = 30 servers. Instead of choosing aspeciﬁc distribution for ¯ B , we specify the connection between S t e a d y s t a t e a v e r a g e w o r k l o a d k b - o b - o b - o b - o b - o b - o b - o b - k b - o b - d=1d=2d=3d=4d=5d=6d=7d=8d=9d=10 Fig. 1:

Steady state average workload vs. load (values of the arrival rate λ ) for K = 10 servers and different values of d . The vertical lines marked ‘ob- i ’and ‘kb- i ’ stand for ’our lower bound’ λ lb and the ’known lower bound’ of / E [ ∧ dk =1 B k ] , respectively, for d = i . Only kb-1 and kb-10 are shown becauseall other values of kb are between them. The values of ob-8 and ob-9 are not shown and satisfy ob- < ob- < ob- < ob-7. the different expected values needed to calculate the lowerbounds. Figure 2(a) depicts the results for the case where E [ ∧ dk =1 B k ] = K/d . , d ∈ { , . . . , K } , and Figure 2(a) depicts the results for the case where E [ ∧ dk =1 B k ] = 2 K/d . , d ∈ { , . . . , K } . As mentioned in the introduction, no one bound implies theother and their values highly depend on the distribution of ¯ B and the value of d . Taking the maximum of the lower boundsyields a new and improved lower bound.A CKNOWLEDGMENT

The author would like to thank Rami Atar, Isaac Keslassyand Shay Vargaftik for their useful feedback. This researchwas supported in part by the Hasso Plattner Institute.R

EFERENCES[1] G. Joshi, Y. Liu, and E. Soljanin, “Coding for fast content download,” in , pp. 326–333, IEEE, 2012.[2] J. Dean and L. A. Barroso, “The tail at scale,”

Communications of theACM , vol. 56, no. 2, pp. 74–80, 2013.[3] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effectivestraggler mitigation: Attack of the clones,” in

NSDI , pp. 185–198, 2013.[4] A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, andS. Shenker, “Low latency via redundancy,” in

Proceedings of the ninthACM conference on Emerging networking experiments and technologies ,pp. 283–294, ACM, 2013.[5] K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, M. Velednitsky, andS. Zbarsky, “Redundancy-d: The power of d choices for redundancy,”

Operations Research , vol. 65, no. 4, pp. 1078–1094, 2017.[6] K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, E. Hyyti¨a, andA. Scheller-Wolf, “Queueing with redundant requests: exact analysis,”

Queueing Systems , vol. 83, no. 3-4, pp. 227–259, 2016.[7] Y. Raaijmakers, S. Borst, and O. Boxma, “Redundancy scheduling withscaled bernoulli service requirements,”

Queueing Systems , vol. 93, no. 1-2, pp. 67–82, 2019.[8] E. Anton, U. Ayesta, M. Jonckheere, and I. M. Verloop, “On the stabilityof redundancy models,” arXiv preprint arXiv:1903.04414 , 2019.[9] R. Srikant and L. Ying,

Communication networks: an optimization,control, and stochastic networks perspective . Cambridge University Press,2013. (a)

Our lower bound and the known lower bound for K = 30 and differentvalues of d for the case where E [ ∧ dk =1 B k ] = K/d . (b) Our lower bound and the known lower bound for K = 30 and differentvalues of d for the case where E [ ∧ dk =1 B k ] = 2 K/d1

Our lower bound and the known lower bound for K = 30 and differentvalues of d for the case where E [ ∧ dk =1 B k ] = K/d . (b) Our lower bound and the known lower bound for K = 30 and differentvalues of d for the case where E [ ∧ dk =1 B k ] = 2 K/d1 .1

Related Researches

DV-DVFS: Merging Data Variety and DVFS Technique to Manage the Energy Consumption of Big Data Processing

by Hossein Ahmadvand

Performance Comparison for Scientific Computations on the Edge via Relative Performance

by Aravind Sankaran

HPC AI500: Representative, Repeatable and Simple HPC AI Benchmarking

by Zihan Jiang

Performance Optimizations of Recursive Electronic Structure Solvers targeting Multi-Core Architectures (LA-UR-20-26665)

by Adetokunbo A. Adedoyin

An In-Depth Investigation of Performance Characteristics of Hyperledger Fabric

by Tobias Guggenberger

NumaPerf: Predictive and Full NUMA Profiling

by Xin Zhao

Comparing Broadband ISP Performance using Big Data from M-Lab

by Xiaohong Deng

Verifiable Failure Localization in Smart Grid under Cyber-Physical Attacks

by Yudi Huang

Efficient Learning-based Scheduling for Information Freshness in Wireless Networks

by Bin Li

GPA: A GPU Performance Advisor Based on Instruction Sampling

by Keren Zhou

ScalAna: Automating Scaling Loss Detection with Graph Analysis

by Yuyang Jin

Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

by Christie L. Alappat

Pass-and-Swap Queues

by Céline Comte

Investigating Applications on the A64FX

by Adrian Jackson

ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources

by Mohamed Handaoui

Correlation Coefficient Analysis of the Age of Information in Multi-Source Systems

by Yukang Jiang

On the Throughput Optimization in Large-Scale Batch-Processing Systems

by Sounak Kar

tinyMD: A Portable and Scalable Implementation for Pairwise Interactions Simulations

by Rafael Ravedutti L. Machado

Towards an Objective Metric for the Performance of Exact Triangle Count

by Mark P. Blanco

Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-time Edge Computing

by Karel Adámek

Performance portability through machine learning guided kernel selection in SYCL libraries

by John Lawson

A Learned Performance Model for Tensor Processing Units

by Samuel J. Kaufman

Analysis of Interference between RDMA and Local Access on Hybrid Memory System

by Kazuichi Oe

Analysis of an M/G/1 system for the optimization of the RTG performances in the delivery of containers in Abidjan Terminal

by Bakary Kone

Reinforcement Learning-based Admission Control in Delay-sensitive Service Systems

by Majid Raeis

«

1

2

3

4

»

Submitted on 30 Apr 2020 (v1), last revised 21 May 2020 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar