Asymptotically Optimal Load Balancing in Large-scale Heterogeneous Systems with Multiple Dispatchers
aa r X i v : . [ c s . PF ] F e b Asymptotically Optimal Load Balancing in Large-scale HeterogeneousSystems with Multiple Dispatchers
Xingyu ZhouDepartment of ECEThe Ohio State [email protected] Ness ShroffDepartment of ECE and CSEThe Ohio State Universityshroff[email protected] WiermanDepartment of Computing and Mathematical SciencesCalifornia Institute of [email protected]
Abstract
We consider the load balancing problem in large-scale heterogeneous systems with multiple dispatchers.We introduce a general framework called Local-Estimation-Driven (LED). Under this framework, each dis-patcher keeps local (possibly outdated) estimates of queue lengths for all the servers, and the dispatchingdecision is made purely based on these local estimates. The local estimates are updated via infrequentcommunications between dispatchers and servers. We derive sufficient conditions for LED policies to achievethroughput optimality and delay optimality in heavy-traffic, respectively. These conditions directly implydelay optimality for many previous local-memory based policies in heavy traffic. Moreover, the results en-able us to design new delay optimal policies for heterogeneous systems with multiple dispatchers. Finally,the heavy-traffic delay optimality of the LED framework directly resolves a recent open problem on how todesign optimal load balancing schemes using delayed information.
Load balancing, which is responsible for dispatching jobs on parallel servers, has attracted significant interestin recent years. This is motivated by the challenges associated with efficiently dispatching jobs in large-scaledata centers and cloud applications, which are rapidly increasing in size. A good load balancing policy notonly ensures high throughput by maximizing server utilization, but improves the user experience by minimizingdelay.There have been numerous load balancing policies proposed in the literature. The most straightforward oneis Join-Shortest-Queue (JSQ), which has been shown to enjoy optimal delay in both non-asymptotic (for homo-geneous servers) and asymptotic regimes [22, 5, 4]. However, it is difficult to implement in today’s large-scaledata centers due to the large message overhead between the dispatcher and servers. As a result, alternativeload balancing policies with low message overhead have been proposed. For example, the Power-of- d policy [12]has been shown to achieve optimal average delay in heavy traffic with only 2 d messages per arrival [10]. An-other common load balancing policy is the pull-based Join-Idle-Queue (JIQ) [9, 16], which has been shown tooutperform the Power-of- d policy using less overhead. However, both Power-of- d and JIQ mainly achieve goodperformance for systems with homogeneous servers. Recently, some works consider heterogeneous servers andpropose flexible and low message overhead policies that achieve optimal delay in heavy traffic [29, 27]. How-ever, only a single dispatcher is considered in these works. Theoretical analysis of load balancing with multipledispatchers has mainly focused on the JIQ policy so far [13, 17], which has a poor performance in heavy trafficand is even generally unstable for heterogeneous systems [29].Note that heterogeneous systems with multiple dispatchers are now almost the default scenarios in today’scloud infrastructures. On one hand, the heterogeneity comes from the usage of multiple generations of CPUs andvarious types of devices [6]. On the other hand, with the massive amount of data, a scalable cloud infrastructureneeds multiple dispatchers to increase both throughput and robustness [15].1otivated by this, a recent work [1] proposes a new framework named Loosely-Shortest-Queue (LSQ) fordesigning load balancing policies for heterogeneous systems with multiple dispatchers. In particular, under thisframework, each dispatcher keeps its own, local, and possibly outdated view of each server’s queue length. Uponarrival, each dispatcher routes to the server with shortest local view. A small amount of message overhead isused to update the local view. The authors successfully establish sufficient conditions on the update scheme forthe system to be stable. Moreover, extensive simulations were conducted to show that LSQ policies significantlyoutperform well-known low-communication policies while using similar communication overhead in both het-erogeneous and homogeneous cases. However, no theoretical guarantees on the delay performance are provided.It is worth noting that the key challenge for establishing a delay performance guarantee for this framework isthat it only uses possibly outdated local information to dispatch jobs. In fact, the problem of designing delayoptimal load balancing schemes that only have access to delayed information has recently been listed as an openproblem in [8].Inspired by this, in this paper, we are particularly interested in the following questions: Is is possible toestablish delay performance guarantees for load balancing in heterogeneous systems with multiple dispatchers? Ifso, can these guarantees be achieved using only delayed information?
Contributions.
To answer the questions above, we propose a general framework of load balancing forheterogeneous systems with multiple dispatchers that uses only delayed (out-of-date) information about thesystem state. We call this framework Local-Estimation-Driven (LED) and it generalizes the LSQ framework.Our main results provide sufficient conditions for LED policies to be both throughput optimal and delay optimalin heavy-traffic. Our key contributions can be summarized as follows.First, we introduce the LED framework for designing load balancing policies for heterogeneous systems withmultiple dispatchers. In this framework, each dispatcher keeps its own local estimates of queue lengths for allthe servers, and makes its dispatching decision based purely on its own local estimates according to a certaindispatching strategy. The local estimates are updated infrequently via an update strategy that is based oncommunications between dispatchers and servers.Second, we derive sufficient conditions for LED policies to be throughput optimal and delay optimal in heavy-traffic. The importance of the sufficient conditions is three-fold: (i) It can be shown that previous local-memorybased policies (e.g., LSQ) satisfy our sufficient conditions. As a result, we are able to show that they are notonly throughput optimal (in a stronger sense) but also delay optimal in heavy-traffic. (ii) The conditions allowus to design new delay optimal load balancing policies with zero dispatching delay and low message overheadthat work for heterogeneous servers and multiple dispatchers. (iii) These conditions also provide us with asystematic approach for generalizing previous optimal policies to the case of multiple dispatchers and exploringthe trade-off between memory (i.e., local estimations) and message overhead.Third, the LED framework also resolves the open problem posed in [8], which asks how to design heavy-traffic delay optimal policies that only use delayed information. Our main results for LED policies not onlydemonstrate that it is possible to achieve optimal delay in heavy-traffic via only delayed information, buthighlight conditions on the extent to which old information is useful. Moreover, they provide methods for usingthe delayed information to achieve optimality in heavy traffic. Interestingly, the LED framework also showsthat, in the case of multiple dispatchers, inaccurate information can actually lead to improved performance.To establish the main results, we need to address the following two key challenges. First, each dispatcher inour model has access to delayed and outdated system information. Moreover, each dispatcher does not knowthe arrivals to servers from other dispatchers, since there is no communication between them. As a result, forthroughput optimality, we have to carefully design our Lyapunov function, since the local estimates can also beunbounded. For delay optimality, we consider two queueing systems: a local-estimation system and the actualsystem. Then, we have to transfer the negative drift on the local-estimation system to the actual system, whichrequires establishing new bounds and the analysis of sample paths.
Related work.
The study of efficient load balancing algorithms has been a hot topic for a long time andspans across different asymptotic regimes. The most extensively investigated policy might be Join-Shortest-Queue (JSQ), under which the incoming jobs are always sent to the server with the shortest queue length. JSQhas been shown to be optimal in a stochastic order sense in the non-asymptotic regime for arbitrary arrival andnon-decreasing failure rate identical service processes [22, 23]. In the heavy-traffic asymptotic regime, in whichthe normalized load approaches one and the number of servers is fixed, JSQ has been proved to achieve optimaldelay even for heterogeneous servers using both diffusion approximations in [5] and the recently proposed drift-based method [4]. However, the optimality of JSQ comes at the cost of a large amount of communication betweendispatchers and servers, which is particularly undesirable for large-scale data centers. Thus, some popular low-message overhead alternative policies have been proposed, e.g., Power-of- d and Join-Idle-Queue (JIQ). Under2ower-of- d , the dispatcher only needs to sample d ≥ d samples. This simple policy has been shown to enjoy a doubly exponential decayrate in response time in the large-system asymptotic regime [12] and achieve optimal delay in heavy-traffic forhomogeneous servers [3, 10]. Another low-message overhead policy is JIQ (or Pull-based policy) [9, 16], underwhich arrivals are sent to one of the idle servers, if there are any, and to a randomly selected server otherwise.Compared to JSQ and Power-of- d , JIQ has the nice property of zero dispatching delay since each arrival canbe instantaneously routed rather than waiting for feedback from servers. Moreover, JIQ has been shown tooutperform Power-of- d with even smaller message overhead (at most one per job). In particular, under JIQ,arriving jobs achieve asymptotic zero waiting time in the large-system regime while Power-of- d does not. An evenstronger result suggests that, in the Halfin-Whitt asymptotic regime, JIQ achieves the same delay performanceas JSQ [14]. Nevertheless, the performance of JIQ drops substantially in heavy traffic with a finite number ofservers, even for homogeneous servers. In fact, it is not heavy-traffic delay optimal in this case [29]. Motivatedby this, recent works have proposed alternative pull-based policies that not only enjoy all the nice features ofJIQ but also achieve optimal delay in heavy-traffic [29, 27]. However, these studies only consider the case ofsingle dispatcher.Compared to the large literature on the single dispatcher case, there are only a few works for the scenario ofmultiple dispatchers, and they mainly focus on the JIQ policy. In particular, [13] presents a new large-systemasymptotic analysis of JIQ without the simplifying assumptions in [9]. The property of asymptotically zerowaiting time of JIQ was generalized to the case of multiple dispatchers in [17]. However, the results for JIQin [9, 13, 17] all assume that the loads at various dispatchers are strictly equal. Without this assumption, [19]shows that the waiting time under JIQ no longer vanishes in the large-system regime and two enhanced JIQschemes are proposed. As mentioned earlier, although JIQ is a scalable choice for the multiple-dispatcher case,it is not delay optimal in heavy traffic for homogeneous servers and not even generally stable for heterogeneoussystems.The case of heterogeneous systems with multiple dispatchers has received very little attention from thetheoretical community so far. To the best of our knowledge, the framework proposed in [1] is the first attemptto study efficient load balancing schemes with a theoretical guarantee for the scenario of heterogeneous systemswith multiple dispatchers. In particular, under the proposed Loosely-Shortest-Queue (LSQ) framework, eachdispatcher independently keeps its own local view of sever queue lengths and routes jobs to the shortest amongthem. Communication is used only to update the local views and make sure that they are not too far from thereal queue lengths. The main contributions of [1] are the sufficient conditions for any LSQ policy to achievestrong stability with low message overhead. Additionally, extensive simulations have been used to demonstrateits appeal. Nevertheless, a theoretical guarantee on the delay performance of LSQ policies remains an importantunsolved question.It is worth pointing out that the idea of using local memory to hold possibly old information for loadbalancing was also explored in two recent works [2, 20]. As we discuss later, these two proposed policies arein our LED framework. Both works only consider a single dispatcher and homogeneous servers, which is alsoa special case of our model. Further, their analysis focuses on the large-system asymptotic regime where thenumber of servers goes to infinity, while our analysis deals with a finite number of servers. This section describes the system model and assumptions considered in this paper. Then, several necessarypreliminaries are presented.
We consider a discrete-time (i.e., time-slotted) load balancing system consisting of M dispatchers and N possibly-heterogeneous servers. Each server maintains an infinite capacity FIFO queue. At each dispatcher, there is alocal memory, through which the dispatcher can have some (possibly delayed) information about the systemstates. In each time-slot, the central dispatcher routes the new incoming tasks to one of the servers, immediatelyupon arrival. Once a task joins a queue, it remains in that queue until its service is completed. Each server isassumed to be work conserving, i.e., a server is idle if and only if its corresponding queue is empty.3 .1.1 Arrivals Let A m ( t ) denote the number of exogenous tasks that arrive at dispatcher m at the beginning of time-slot t . Weassume that A Σ ( t ) = P Mm =1 A m ( t ) is an integer-valued random variable, which is i.i.d. across time-slots. Themean and variance of A Σ ( t ) are denoted by λ Σ and σ , respectively. We further assume that there is a positiveprobability that A Σ ( t ) is zero. The allocation of total arriving tasks among the M dispatchers is allowed to useany arbitrary policy that is independent of system states. Note that, in contrast to previous works on multipledispatchers [9, 13, 17], we do not require that the loads at all dispatchers are equal. We assume that there isa strictly positive probability for tasks to arrive at each dispatcher at any time-slot t . That is, there exists astrictly positive constant p such that P ( A m ( t ) > ≥ p , ∀ ( m, t ) ∈ M × N , (1)where M = { , , . . . , M } . Moreover, we assume that A m ( t ) is i.i.d across time-slots with mean arrival ratedenoted by λ m . We further let A mn ( t ) denote the number of new arrivals at server n from dispatcher m at thebeginning of time-slot t . Let A n ( t ) = P Mm =1 A mn ( t ) be the total number of arriving tasks at server n at thebeginning of time-slot t . Let S n ( t ) denote the amount of service that server n offers for queue n in time-slot t . That is, S n ( t ) is themaximum number of tasks that can be completed by server n at time-slot t . We assume that S n ( t ) is aninteger-valued random variable, which is i.i.d. across time-slots. We also assume that S n ( t ) is independentacross different servers as well as the arrival process. The mean and variance of S n ( t ) are denoted as µ n and ν n , respectively. Let µ Σ , Σ Nn =1 µ n and ν , Σ Nn =1 ν n denote the mean and variance of the hypothetical totalservice process S Σ ( t ) , P Nn =1 S n ( t ). Let ǫ = µ Σ − λ Σ characterize the distance between the arrival rate and theboundary of capacity region. Let Q n ( t ) be the queue length of server n at the beginning of time slot t . Let A n ( t ) denote the number of tasksrouted to queue n at the beginning of time-slot t according to the dispatching decision. Then the evolution ofthe length of queue n is given by Q n ( t + 1) = Q n ( t ) + A n ( t ) − S n ( t ) + U n ( t ) , n = 1 , , . . . , N, (2)where U n ( t ) = max { S n ( t ) − Q n ( t ) − A n ( t ) , } is the unused service due to an empty queue.We do not assume any specific distribution for arrival and service processes. Moreover, in contrast to previousworks [29, 4], we do not require that both arrival and service processes have a finite support. Instead, we onlyneed the condition that their distributions are light-tailed. More specifically, we assume that E h e θ A Σ ( t ) i ≤ D and E h e θ S n ( t ) i ≤ D , (3)for each n where the constants θ > θ > D < ∞ and D < ∞ are all independent of ǫ . We are interested in the case that the local memory at each dispatcher m stores an estimate of the queuelength for each server n . In particular, we let e Q mn ( t ) be the local estimate of the queue length for server n fromdispatcher m at the beginning of time-slot t (before any arrivals and departures). More specifically, we introducethe following framework for load balancing. Definition 1.
A Local-Estimation-Driven (LED) policy is composed of the following components:(a)
Dispatching strategy:
At the beginning of each time-slot, each dispatcher m chooses one of the serversfor new arrivals purely based on its local estimates (i.e., local queue length estimates e Q m )(b) Update strategy:
At the end of each time-slot, each dispatcher would possibly update its local estimates,e.g., synchronize local queue length estimate with the true queue length. { Z ( t ) = ( Q ( t ) , m ( t )) , t ≥ } with statespace Z , using the queue length vector Q ( t ) together with the memory state m ( t ) , ( e Q ( t ) , e Q ( t ) , . . . , e Q m ( t )).We consider a set of load balancing systems { Z ( ǫ ) ( t ) , t ≥ } parameterized by ǫ such that the mean arrival rate ofthe total exogenous arrival process { A ( ǫ )Σ ( t ) , t ≥ } is λ ( ǫ )Σ = µ Σ − ǫ . Note that the parameter ǫ characterizes thedistance between the arrival rate and the boundary of the capacity region. We are interested in the throughputperformance and the steady-state delay performance in the heavy-traffic regime under any LED policy.A load balancing system is stable if the Markov chain { Z ( t ) , t ≥ } is positive recurrent, and Z = { Q , m } denotes the random vector whose distribution is the same as the steady-state distribution of { Z ( t ) , t ≥ } . Wehave the following definition. Definition 2 (Throughput Optimality) . A load balancing policy is said to be throughput optimal if for anyarrival rate within the capacity region, i.e., for any ǫ > , the system is positive recurrent and all the momentsof (cid:13)(cid:13) Q ( ǫ ) (cid:13)(cid:13) are finite. Note that this is a stronger definition of throughput optimality than that in [1, 21, 25] because, besides thepositive recurrence, it also requires all the moments to be finite in steady state for any arrival rate within thecapacity region.To characterize the steady-state average delay performance in the heavy-traffic regime when ǫ approacheszero, by Little’s law, it is sufficient to focus on the summation of all the queue lengths. First, recall the followingfundamental lower bound on the expected sum queue lengths in a load balancing system under any throughputoptimal policy [4]. Note that this result was originally proved with the assumption of finite support on theservice process (Lemma 5 in [4]), which can be generalized to service processes with light-tailed distributionswith a careful analysis of the unused service, see our proof of Lemma 6. Lemma 1.
Given any throughput optimal policy and assuming that ( σ ( ǫ )Σ ) converges to a constant σ as ǫ decreases to zero, then lim inf ǫ ↓ ǫ E " N X n =1 Q ( ǫ ) n ≥ ζ , (4) where ζ , σ + ν . The right-hand-side of Eq. (4) is the heavy-traffic limit of a hypothesized single-server system with arrivalprocess A ( ǫ )Σ ( t ) and service process P Nn S n ( t ) for all t ≥
0. This hypothetical single-server queueing system isoften called the resource-pooled system . Since a task cannot be moved from one queue to another in the loadbalancing system, it is easy to see that the expected sum queue lengths of the load balancing system is largerthan the expected queue length in the resource-pooled system. However, if a policy achieves the lower bound inEq. (4) in the heavy-traffic limit, based on Little’s law this policy achieves the minimum average delay of thesystem in steady-state, and thus said to be heavy-traffic delay optimal, see [4, 10, 21, 24, 25, 29].
Definition 3 (Heavy-traffic Delay Optimality in Steady-state) . A load balancing scheme is said to be heavy-traffic delay optimal in steady-state if the steady-state queue length vector Q ( ǫ ) satisfies lim sup ǫ ↓ ǫ E " N X n =1 Q ( ǫ ) n ≤ ζ , where ζ is defined in Lemma 1. In order to provide a unified way to specify the dispatching strategy in LED, we first introduce a conceptcalled dispatching preference . In particular, let P mn ( t ) be the probability that new arrivals at dispatcher m aredispatched to server n at time-slot t . We define β mn ( t ) , P mn ( t ) − µ n µ Σ , which is the difference in probability thatserver n will be chosen under a particular dispatching strategy and random routing (weighted by service rate).Then, we have the following definition. 5 efinition 4 (Dispatching preference) . Fix a dispatcher m , let σ t ( · ) be a permutation of (1 , , . . . , N ) thatsatisfies e Q mσ t (1) ( t ) ≤ e Q mσ t (2) ( t ) ≤ . . . ≤ e Q mσ t ( N ) ( t ) . The dispatching preference at dispatcher m is a N -dimensional vector denoted by ∆ m ( t ) , the n th component ofwhich is given by ∆ mn ( t ) , β mσ t ( n ) . In words, the dispatching preference at a dispatcher m specifies how servers with different local estimatesare preferred in a unified way such that it is independent of the actual values of local estimates. It only dependson the relative order of local estimates. More specifically, fix a dispatcher m , by definition we can see thatweighted random routing strategy has no preference for any servers and ∆ mn ( t ) = 0 for any n . On the otherhand, if new arrivals are always dispatched to the server with the shortest local estimate (e.g, LSQ policy), wehave ∆ m ( t ) > mn ( t ) < ≤ n ≤ N . Thus, we can see that a positive value for ∆ mn ( t ) means thatthe dispatching strategy has a preference for the server with the n th shortest local estimation. This observationdirectly motivates the following two definitions. Definition 5 (Tilted dispatching strategy) . A dispatching strategy adopted at dispatcher m is said to be tiltedif there exists a k ∈ { , , . . . N } such that for all t , ∆ mn ( t ) ≥ for all n ≤ k and ∆ mn ( t ) ≤ for all n ≥ k . Definition 6 ( δ -tilted dispatching strategy) . A dispatching strategy adopted at dispatcher m is said to be δ -tiltedif for all t (i) it is a tilted dispatching strategy and (ii) there exists a positive constant δ such that ∆ m ( t ) ≥ δ and ∆ mN ( t ) ≤ − δ . Remark 1.
Note that similar definitions were first provided in [29] for the case of a single dispatcher withup-to-date information. Based on these definitions, sufficient conditions were presented for throughput andheavy-traffic optimality. However, these conditions cannot be directly applied to our model due to the followingtwo major challenges. One is that, in our model, each dispatcher only has access to outdated information. Theother is that each dispatcher has no idea of the arrivals at the servers coming from other dispatchers, since thereis no communication between them. To handle these challenges, we have to develop new techniques.
We end this section by providing intuitions behind the two definitions. To start, it can be seen easilythat P Nn =1 ∆ mn ( t ) = 0 for all m and t via the definition of dispatching preference. Roughly speaking, a tilteddispatching strategy means that compared to (weighted) random routing (which does not have any preference),the probabilities of choosing servers with shorter local estimates (the first k shortest ones) are increased, and,as a result, the probabilities of choosing servers with longer local estimates are reduced. This is the reasonwhy we call it tilted, since more preference is given to queues with shorter local estimates. Therefore, a tilteddispatching strategy can be viewed as a strategy that is as least as ‘good’ as (weighted) random routing. On theother hand, a δ -tilted dispatching strategy can be viewed as a strategy that is strictly better than (weighted)random routing. The reason is that, besides the fact that it is tilted, it also requires that there is a strictlypositive preference of the server with the shortest local estimation. In this section, we first present the sufficient conditions for LED policies to be throughput optimal and heavy-traffic delay optimal. Then, we explore several example policies within LED framework to demonstrate itsflexibility in designing new load balancing schemes.
Let us begin with the sufficient conditions for LED policies to be throughput optimal. In particular, we specifyconditions for the dispatching strategy and update strategy that guarantee throughput optimality.To state the theorem, we need the following notation. Let I mn ( t ) be an indicator function which equals 1 ifand only if the local estimate of server n ’s queue length at dispatcher m gets updated, i.e., the estimated queuelength e Q mn ( t ) is set to the actual queue length Q n ( t ) at the end of time-slot t . Theorem 1.
Consider an LED policy. Suppose the dispatching strategy at each dispatcher is tilted and theupdate strategy can guarantee the condition that there exists a positive constant p such that E [ I mn ( t ) | Z ( t ) = Z ] ≥ p (5)6 olds for all Z and ( m, n, t ) ∈ M × N × N . Then, this policy is throughput optimal, i.e., the system under thispolicy is positive recurrent with all the moments being bounded for any ǫ > .Proof. See Section 5.1Note that this theorem directly implies that LSQ is not only strongly stable but also enables the systemto have all the moments bounded in steady-state. Moreover, it suggests that any dispatching strategy that isas good as (weighted) random routing is sufficient to guarantee throughput optimality. Further, the updateprobability can be a function of the traffic load.Now, we turn to presenting the sufficient conditions for LED policies to be delay optimal in heavy traffic. Inorder to achieve delay optimality, we need stronger conditions on both the dispatching strategy and the updatestrategy.
Theorem 2.
Consider an LED policy. Suppose the dispatching strategy at each dispatcher is δ -tilted with auniform lower bound δ > being independent of ǫ . Suppose the update strategy can guarantee that there existsa positive constant p (independent of ǫ ) such that E [ I mn ( t ) | Z ( t ) = Z ] ≥ p (6) holds for all Z and ( m, n, t ) ∈ M × N × N . Then, this policy is heavy-traffic delay optimal.Proof. See Section 5.2This theorem not only establishes a delay performance guarantee for many previous local-memory basedpolicies (e.g., LSQ in [1], low-message policies in [2, 20]), but provides us with the flexibility to design newdelay optimal load balancing for different scenarios with heterogeneous servers and multiple dispatchers, asdiscussed in the next section. More importantly, our results directly suggest that it is possible to use onlydelayed information to achieve delay optimality, which resolves one of the open problems listed in [8].
High-level proof idea.
We end this section by providing drift-based intuitions behind the technical proofs.In particular, let us consider two queueing systems: a local-estimation system and the actual system (i.e., queuelengths at servers). For throughput optimality, it requires the actual system to have a drift towards the origin.First, by the definition of a tilted dispatching strategy, it provides an equivalent drift on the local-estimationsystem that is towards the origin. Then, the condition on the update strategy guarantees that the local-estimation system is not too far away from the actual system. Hence, the actual system also has a drift towardsthe origin. The heavy-traffic delay optimality not only requires a drift towards the origin, but also needs a drifttowards the line that all the queue lengths are equal. First, by the definition of a δ -tilted dispatching strategy,there is a drift towards the line that all the local estimates are equal within a given dispatcher. Then, by thecondition for the update strategy, the drift on the local-estimation system can be transfered to a drift on theactual system, and hence delay optimality. Note that, in the current proof, in order to make this ‘drift-transfer’process valid, we impose the condition that both δ and p are independent of ǫ , which is not necessarily requiredand both of them could possibly be a particular function of ǫ as in [26]. This relaxation could be an interestingfuture research direction. To illustrate the applications of Theorems 1 and 2, in this section, we introduce examples of LED policiesthat are both throughput optimal and heavy-traffic delay optimal. The flexibility provided by our sufficientconditions not only allows us to include previous policies as special cases, but enables us to design new flexiblepolicies.Let us first introduce some typical δ -tilted dispatching strategies. Example 1 (Local–Join-Shortest-Queue (L-JSQ)) . At the beginning of each time-slot t , the dispatcher for-wards its arrivals to the server with the shortest local estimation with ties broken arbitrarily. That is, considerdispatcher m , the chosen server is i ∗ ∈ arg min n { e Q mn } . This dispatching strategy is the same as that in the LSQ policy in [1]. By the definition of dispatchingpreference, we can see that under L-JSQ, ∆ m ( t ) = 1 − µ σ t (1) /µ Σ > mn ( t ) = − µ σ t ( n ) /µ Σ <
0. Hence, itis δ -tilted even for heterogeneous servers with δ = µ min /µ Σ where µ min = min n µ n .Instead of always joining the server with the shortest local estimate, it is also possible to join a sever whosequeue length is below a threshold while satisfying the condition of δ -tilted dispatching preference.7 xample 2 (Local–Join-Below-Average (L-JBA)) . At the beginning of each time-slot t , the dispatcher forwardsits arrivals to a randomly chosen server whose local estimate is below or equal to the average local queue lengthestimation. That is, consider dispatcher m with the average local estimate being ¯ Q m ( t ) = N P n e Q mn ( t ) . Let A , { n : e Q mn ( t ) ≤ ¯ Q m ( t ) } . Then, for each i ∈ A , P mi ( t ) = µ i / P n ∈A µ n , and for i / ∈ A , P mi ( t ) = 0 . It can be easily shown from the definition that L-JBA is also δ -tilted. Note that, compared to L-JSQ, in theheterogeneous case, it needs the dispatcher to know the service rate of each server, which can be easily obtainedby the update strategies introduced next. This strategy is more flexible than L-JSQ since it does not requirenew arrivals to be only sent to the server with the shortest local estimate, which could be used in the scenarioswith data locality. Moreover, some randomness in the dispatching strategy is also useful, as discussed in thenext section.Further, it is possible to generalize many previous heavy-traffic delay optimal policies into the LED frame-work. For example, we can directly apply the Power-of- d policy as our dispatching strategy. Example 3 (Local–Power-of- d (L-Pod)) . At the beginning of each time-slot t , the dispatcher randomly chooses d ≥ servers and sends arrivals to the server that has the shortest local estimation among the d servers. It can be easily shown that L-Pod is tilted for homogeneous servers. Moreover, for a given m , we have∆ m ( t ) = d − N and ∆ mN ( t ) = − N , and hence it is δ -tilted with δ = N .Now, let us turn to discussing update strategies that satisfy the condition in Theorem 2. In particular,the update strategy can either be push-based (dispatcher samples servers) or pull-based (servers report todispatchers). Definition 7 (Push-Update) . If there are new arrivals, then at the end of the time-slot the dispatcher m samples d distinct servers with a positive probability ˆ p . Then, it updates the corresponding d local estimations with thetrue values. It has been shown in [1] that even for d = 1, the push-update strategy is guaranteed to satisfy the conditionin Theorem 2. Definition 8 (Pull-Update) . At the end of each time-slot, for each server n if there are completed tasks, thenthe server will uniformly at random pick a dispatcher m and then abide by one of the following two rules: • If the server becomes idle (i.e., no tasks), it sends ( n, to dispatcher m . • If not, it sends ( n, Q n ) to dispatcher m with probability ˆ p . It has been shown in [1] that for any ˆ p >
0, the pull-update strategy is guaranteed to satisfy the conditionin Theorem 2.Now, having introduced both the dispatching strategy and the update strategy, we can combine them toobtain different LED policies that are delay optimal in heavy-traffic. For example, we have L-JSQ-Push, L-JSQ-Pull, L-JBA-Push, L-JBA-Pull for heterogeneous servers, as well as L-Pod-Push and L-Pod-Pull for homogeneousservers.We end this section by summarizing the contributions of the LED framework. (i)
It covers previouspolices.
L-JSQ-Push (with ˆ p = 1) and L-JSQ-Pull are the same as LSQ policies considered in [1], whichinclude the policies developed in both [2] and [20] as special cases. Thus, by Theorems 1 and 2, all these policiesare throughput and heavy-traffic delay optimal. (ii) It allows randomness in dispatching.
The randomnessintroduced in L-JBA and L-Pod is helpful when dealing with the scenario with an extreme low budget on themessage overhead, as discussed next. (iii)
It enables trade-offs between memory and message overhead.
For example, L-Pod-Push and L-Pod-Pull represent good examples that trade memory for low message overhead.That is, if each dispatcher directly uses the traditional Power-of- d without any memory, then at least 4 messagesneeded to guarantee delay optimality in heavy-traffic. In contrast, in both L-Pod-Push and L-Pod-Pull, the worst-case message overhead is just 1 per arrival. In addition, the message can be further reduced by choosinga smaller value of ˆ p in the update strategy. Before moving to the proofs, we would like to discuss key features and insights about LED, and point outpossible refinements on LED. 8 .1 Key features of LED
In this section, we highlight the key features of the LED framework, including low message overhead, zerodispatching delay, low computational complexity and appealing performance across various loads.
Low message overhead.
It should be noted that the communication overhead occurs only during theupdate phase in LED policies. For the push-update strategy, the number of messages per arrival is at most 2 d ( d can even be one). For the pull-update strategy, the number of messages per arrival is at most 1. In contrast,JSQ needs 2 N messages per arrival and Power-of- d needs at least 4 messages per arrival. Although JIQ has acomparative worst-case message overhead as LED policies, it is not stable for heterogeneous servers. Zero dispatching delay.
Another key feature of all LED policies is that there is zero dispatching delay.That is, the dispatcher can immediately route its new arrivals to the chosen server since the decision is madepurely based on its local estimations. Moreover, the communication between dispatchers and servers happensonly after the decision is made. This is in contrast to typical push-based policies like JSQ and Power-of- d , underwhich the dispatcher has to wait for the response of sampled servers to make its dispatching decision, resultingin a non-zero dispatching delay. Low computational complexity.
In order to implement LED policies, each dispatcher has to keep anarray of size N its local estimations. Such a space requirement is negligible in a modern cluster. Further, theoperations required by dispatching strategies of LED policies are very efficient. For example, in order to findthe server with the minimal local estimate in L-JSQ, we can keep the array in a min-heap data structure. ForL-JBA, we can calculate the average by using an efficient running average algorithm. For the simple L-Pod, itonly needs random number generators. Appealing performance across loads.
Although the theoretical delay optimality for the LED frame-work holds in the heavy-traffic asymptotic regime, the family of LED policies includes efficient policies thatsignificantly outperform alternative low-message overhead policies with the same (or even smaller) amount ofcommunications. For example, if the dispatching strategy adopts L-JSQ in LED, then it reduces to the LSQpolicy proposed in [1], which appeals to enjoy good performance over a wide range of traffic loads in differentscenarios via extensive simulations.As mentioned earlier, the class of heavy-traffic delay optimal LED policies is broad and includes flexiblechoices of different dispatching and update strategies based on different application scenarios. The actual delayperformance (except the heavy-load scenario) varies with the particular choice of dispatching strategy or updatestrategy under different scenarios. Thus, it is not possible to pick one particular LED policy that fits everycircumstance, which is also not the focus of this paper. Instead, it would be useful to present some usefulinsights about the LED framework, as presented in the following. These insights could serve as the guidance onthe choice or design of new LED policies.
The main trait of the LED framework is that only local, possibly delayed and inaccurate information, is usedfor making the dispatching decision. In the following, we present two useful insights about the use of inaccuratedelayed information for load balancing.
Inaccurate information can improve performance.
A big problem for load balancing with multipledispatchers is herd behavior , which means that arrivals at different dispatchers join the same server. This oftenleads to a poor delay performance in practice [18]. For example, JSQ used in the case of multiple dispatchersleads to a serious herd behavior since all the dispatchers will route arrivals to the single shortest queue. Incontrast, under the LED framework, each dispatcher may believe that a different queue is the shortest accordingto its own local estimates because these estimates are inaccurate and delayed. Thus, jobs at different dispatchersare sent to different queues that may not have the actual shortest length but still have relatively small queuelengths. This intuition is illustrated by Fig. 1. In particular, we consider a set up with 10 dispatchers and100 heterogeneous servers. All the LED policies are configured to have the same average message overheads asPower-of-2. It can be seen that the LED policies are not only stable but achieve a much better performancecompared to JSQ, which suffers from the herd behavior in the multiple-dispatcher case.
Randomness is useful for heavily-delayed information.
As mentioned earlier, the LED frameworkprovides us with the possibility of exploring load balancing with extremely low message overhead by choosinga small value ˆ p in the update strategy. As a result, the local information at each dispatcher will only beupdated after a long time interval. In this case, if a deterministic dispatching strategy (e.g., L-JSQ) is adopted,it would again incur herd behavior (even for a single dispatcher case) since all the arrivals during the longupdate interval will join the same queue. This is another motivation for considering L-JBA and L-Pod, which9 .3 0.4 0.5 0.6 0.7 0.8 0.9 1 Load ( ρ ) M ean r e s pon s e t i m e ( t i m e s l o t s ) JSQL-JSQ-PullL-JSQ-PushL-JBA-PullL-JBA-PushJIQPower-of-2
Figure 1: Inaccurate information could improve performance in multiple-dispatcher case.
Load ( ρ ) M ean r e s pon s e t i m e ( t i m e s l o t s ) L-JSQ-PushL-JBA-PushL-Pod-Push
Figure 2: Randomness is useful for heavily-delayed information.naturally introduce a certain level of randomness and hence help avoid the herd behavior as suggested by [11].To illustrate this insight, we consider a set up with 10 dispatchers and 100 homogeneous servers. We comparethe delay performance of L-JSQ-Push, L-Pod-Push and L-JBA-Push with the update probability set to ˆ p = 0 . d = 2. As shown in Fig. 2, both L-JBA-Push and L-Pod-Push outperforms L-JSQ-Push, which suffers fromherd behavior because of heavily-delayed information. Our main results suggest that there is a large class of heavy-traffic delay optimal LED policies. On the onehand, it provides us with flexibility to tailor our policy design for different application scenarios with differentchoices of dispatching and update strategies. On the other hand, it also suggests the need for refinements onLED beyond delay optimality in heavy-traffic. To this end, we introduce two possible directions for refinements.
Degree of queue imbalance.
As introduced in [28], degree of queue imbalance is a refined metric to furtherdistinguish heavy-traffic delay optimal policies. The idea is that, instead of looking at the average queue length(and hence average delay), the degree of queue imbalance measures the expected difference in queue lengthsamong the servers. By following the proof of Proposition 5.6 in [28], we can establish that the degree of queueimbalance of all heavy-traffic delay optimal LED policies is O ( δ p ). Thus, even though by Theorem 2, anypositive δ and p are sufficient for delay optimality in heavy-traffic, a dispatching strategy with smaller δ or a10pdate strategy with a smaller p could affect the performance in practice. Other asymptotic regimes.
In this paper, we focus on the heavy-traffic asymptotic regime where thenumber of servers is fixed and the load approaches one. As mentioned before, there are also other asymptoticregimes in the analysis of load balancing schemes. One possible direction is to extend the fluid-limit techniquesfor the large-system regime in [20] to the case of multiple dispatchers and heterogeneous servers. Anotheralternative regime is the many-server heavy-traffic regime (e.g., Halfin-Whitt regime), which tends to keep abalance between heavy-traffic regime and large-system regime. Studying LED in such a regime is anotherinteresting direction for future work.
In this paper, we extend the Lyapunov drift-based approach developed in [4] to allow for unbounded supportsof arrival and service processes. In particular, we replace the finiteness condition on the drift in [4] by astochastically dominated condition, as shown in (C2) in Lemma 2. As proved in [7], this weaker condition,combined with a negative drift condition, can still guarantee finite moment bounds. Besides a weaker condition,we also replace the one-step drift with a T -step drift. Formally, we use the following lemma to derive boundedmoments in steady state. Lemma 2.
For an irreducible aperiodic and positive recurrent Markov chain { X ( t ) , t ≥ } over a countablestate space X , which converges in distribution to X , and suppose V : X → R + is a Lyapunov function. Wedefine the T time slot drift of V at X as ∆ V ( X ) , [ V ( X ( t + T )) − V ( X ( t ))] I ( X ( t ) = X ) , where I ( . ) is the indicator function. Suppose for some positive finite integer T , the T time slot drift of V satisfiesthe following conditions: • (C1) There exists an η > and a κ < ∞ such that for any t = 1 , , . . . and for all X ∈ X with V ( X ) ≥ κ , E [∆ V ( X ) | X ( t ) = X ] ≤ − η. • (C2) | ∆ V ( X ) | ≺ W for all t and all X ∈ X , and E (cid:2) e θW (cid:3) = D is finite for some θ > ,Then { V ( X ( t )) , t ≥ } converges in distribution to a random variable V for which there exists a θ ∗ > anda C ∗ < ∞ such that E h e θ ∗ V i ≤ C ∗ , which directly implies that all the moments of V exist and are finite. To start with, let us first show that the Markov chain { Z ( t ) = ( Q ( t ) , m ( t )) , t ≥ } with m ( t ) , ( e Q ( t ) , e Q ( t ) , . . . , e Q m ( t ))is irreducible and aperiodic. Let the initial state be Z (0) = ( Q (0) , m (0)) = (0 × N , × MN ) and the state space Z consists of all the states that can be reached from the initial state. Consider any state Z , the queue lengthvector Q can reach the initial state with a positive probability since the event that there are no exogenous ar-rivals and all the offered service is at least one during each time-slot happens with positive probability under ourassumptions. Moreover, under the condition for the update strategy given by Eq. (5), the event that Q remainsas the initial state while all e Q m reach to the initial state happens with a positive probability. Therefore, anystate in the state space can reach the initial state, and hence the Markov chain is irreducible. The aperiodicityof the Markov chain comes from the fact that the transition probability from the initial state to itself is positive.In order to show positive recurrence, we adopt the Foster-Lyapunov theorem. In particular, we consider thefollowing Lyapunov function W ( Z ( t )) = k Q ( t ) k + P Mm =1 (cid:13)(cid:13)(cid:13) Q ( t ) − e Q m ( t ) (cid:13)(cid:13)(cid:13) , and in the rest of the proof we use W ( t ) as an abbreviation of W ( Z ( t )) Let X mn ( t ) , | Q n ( t ) − e Q mn ( t ) | . The conditional mean drift of W ( t ) definedas D ( Z ( t )) , E [ W ( t + T ) − W ( t ) | Z ( t )] can be decomposed as follows D ( Z ( t )) = D Q ( t ) + M X m =1 N X n =1 D X mn ( t ) (7)11here D Q ( t ) , E h k Q ( t + T ) k − k Q ( t ) k | Z ( t ) i D X mn ( t ) , E [ X mn ( t + T ) − X mn ( t ) | Z ( t )]Let us first consider the tern D X mn ( t ). Note that for all t , m and n E [ X mn ( t + 1) | Z ( t ) = Z ] ≤ E [(1 − I mn ( t )) ( X mn ( t ) + A n ( t ) + S n ( t )) | Z ( t ) = Z ] ( a ) ≤ (1 − p ) X mn ( t ) + λ Σ + µ max (8)where (a) follows from the condition in Eq. (5) and µ max = max n µ n . Then, we have (the time reference t isdropped for simplicity) D X mn ( t )= E " t + T − X t = t X mn ( t + 1) − X mn ( t ) | Z ( t ) = Z = t + T − X t = t E [ E [ X mn ( t + 1) − X mn ( t ) | Z ( t )] | Z ] ( a ) ≤ t + T − X t = t E [ − pX mn ( t ) + λ Σ + µ max | Z ] ≤ − pX mn ( t ) + λ Σ + µ max , (9)where (a) follows from Eq. (8). Let us turn to consider the term D Q ( t ). By the queue dynamics in Eq. (2), D Q ( t )= E " t + T − X t = t k Q ( t + 1) k − k Q ( t ) k | Z ( t ) = Z = E " t + T − X t = t k Q ( t ) + A ( t ) − S ( t ) + U ( t ) k − k Q ( t ) k | Z ( a ) ≤ E " t + T − X t = t k Q ( t ) + A ( t ) − S ( t ) k − k Q ( t ) k | Z = E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i + k A ( t ) − S ( t ) k | Z ( b ) ≤ E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i + K | Z , (10)where (a) follows from the facts that Q n ( t ) + A n ( t ) − S n ( t ) + U n ( t ) = max( Q n ( t ) + A n ( t ) − S n ( t ) ,
0) for any t ≥
0, and (max( a, ≤ a for any a ∈ R ; (b) holds by our assumption of light-tailed distributions for thetotal arrival process and each service process in Eq. (3). In particular, we have that the second moments fortotal arrival process and service process of each server are finite (independent of ǫ ), and hence there exists afinite upper bound K which is independent of the load parameter ǫ .12ow, let us continue to work on Eq. (10). In particular, we have E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z = t + T − X t = t E [ E [ h Q ( t ) , A ( t ) − S ( t ) i | Z ( t )] | Z ( t ) = Z ]= t + T − X t = t E [ E [ h Q ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ] (11) − t + T − X t = t E " N X n =1 Q n ( t ) µ n | Z ( t ) = Z . (12)For Eq. (11), we have t + T − X t = t E [ E [ h Q ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ]= t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 E [ A mn ( t ) | Z ( t )] | Z = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 P mn ( t ) λ m | Z ( t ) = Z ( a ) = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 (cid:18) β mn ( t ) + µ n µ Σ (cid:19) λ m | Z ( t ) = Z , where (a) follows from the definition of β mn ( t ). Then, it can be further simplified as follows. t + T − X t = t E [ E [ h Q ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ] ( a ) = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 β mn ( t ) λ m | Z + t + T − X t = t E " N X i =1 Q n ( t ) M X m =1 µ n µ Σ ( µ Σ − ǫ ) p m | Z = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 β mn ( t ) λ m | Z + t + T − X t = t E " N X n =1 Q n ( t ) µ n | Z − t + T − X t = t E " N X n =1 Q n ( t ) ǫµ n µ Σ | Z , (13)where in (a), p m is the probability that arrivals are allocated to dispatcher m (or it can be viewed as the fractionof the total arrivals that are allocated to dispatcher m ).13ombining Eqs. (11), (12) and (13), yields E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 β mn ( t ) λ m | Z ( t ) = Z − t + T − X t = t E " N X n =1 Q n ( t ) ǫµ n µ Σ | Z ( t ) = Z = t + T − X t = t E " N X n =1 M X m =1 (cid:16) Q n ( t ) − e Q mn ( t ) + e Q mn ( t ) (cid:17) β mn ( t ) λ m | Z − t + T − X t = t E " N X n =1 Q n ( t ) ǫµ n µ Σ | Z = t + T − X t = t E " N X n =1 M X m =1 (cid:16) Q n ( t ) − e Q mn ( t ) (cid:17) β mn ( t ) λ m | Z T + t + T − X t = t E " N X n =1 M X m =1 e Q mn ( t ) β mn ( t ) λ m | Z T − t + T − X t = t E " N X n =1 Q n ( t ) ǫµ n µ Σ | Z T . We are going to handle each term one by one. To upper bound T , we use the following result on X mn ( t ) = | Q n ( t ) − e Q mn ( t ) | . Lemma 3.
Under the condition given by Eq. (5) , for any t and Z ( t ) , there exists a finite T independent of ǫ and a finite constant L that is only a function of p and µ Σ , such that for all T ≥ T E " t + T − X t = t X nm ( t ) | Z ( t ) = Z ≤ LT holds for all m and n .Proof. See Appendix A.By using Lemma 3 with T ≥ T , we have T ≤ λ Σ t + T − X t = t E " N X n =1 M X m =1 (cid:12)(cid:12)(cid:12) Q n ( t ) − e Q mn ( t ) (cid:12)(cid:12)(cid:12) | Z ≤ λ Σ M N LT. (14)For T , we have T a ) = t + T − X t = t E " N X n =1 M X m =1 e Q mσ t ( n ) ( t )∆ mn ( t ) λ m | Z ( b ) ≤ , (15)where (a) comes from the definition of dispatching preference vector ∆ m ( t ); (b) holds since dispatching preferenceis tilted and e Q mσ t (1) ( t ) ≤ e Q mσ t (2) ( t ) ≤ . . . ≤ e Q mσ t ( N ) ( t ).For T , we have T ≥ ǫµ min µ Σ k Q ( t ) k , (16)14here µ min = min n µ n .Now, combining Eqs. (14), (15) and (16), yields E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z ≤ − ǫµ min µ Σ k Q ( t ) k + λ Σ M N LT.
Substituting the result above back into Eq. (10), yields D Q ( t ) ≤ − ǫµ min µ Σ k Q ( t ) k + 2 λ Σ M N LT + KT. (17)Now, we are ready to substitute Eq. (9) and Eq. (17) back into Eq. (7). As a result, we have D ( Z ( t )) ≤ − ǫµ min µ Σ k Q ( t ) k − p M X m =1 N X n =1 X mn ( t )+ 2 λ Σ M N LT + KT + λ Σ + µ max ( a ) ≤ − ξ k Q ( t ) k + M X m =1 N X n =1 | Q n ( t ) − e Q mn ( t ) | ! + K , where in (a) ξ = min(2 ǫµ min µ Σ , p ) and K , λ Σ M N LT + KT + λ Σ + µ max . Pick any α > B , { Z ∈ Z : k Q ( t ) k + M X m =1 N X n =1 | Q n ( t ) − e Q mn ( t ) | ≤ K + αξ } . Then, B is a finite subset. For any Z ∈ B c , D ( Z ) ≤ − α , and for any Z ∈ B , D ( Z ) ≤ K . By Foster-Lyapunovtheorem, we have established positive recurrence.Having shown that the Markov chain { Z ( t ) , t ≥ } is ergodic, we are left with the task of showing that all themoments are finite in steady-state. In order to do so, we use Lemma 2. In particular, we choose the Lyapunovfunction as V ( Z ( ǫ ) ) = (cid:13)(cid:13) Q ( ǫ ) (cid:13)(cid:13) and then verify the two conditions. In the following, the superscript ( ǫ ) will beomitted for ease of notations. To verify condition (C2), we have | ∆ V ( Z ) | = | k Q ( t + T ) k − k Q ( t ) k |I ( Z ( t ) = Z ) ( a ) ≤ k Q ( t + T ) − Q ( t ) k I ( Z ( t ) = Z ) ≤ t + T − X t = t k Q ( t + 1) − Q ( t ) k I ( Z ( t ) = Z ) ≤ t + T − X t = t k A ( t ) − S ( t ) + U ( t ) k I ( Z ( t ) = Z ) ( b ) ≤ t + T − X t = t ( k A ( t ) k + 2 k S ( t ) k ) I ( Z ( t ) = Z ) , (18)where (a) holds since | k x k − k y k | ≤ k x − y k for each x , y in R N . (b) follows from triangle inequality and thefact that U n ( t ) ≤ S n ( t ) for all t and t . Then, by our assumptions of light-tailed distributions for both totalarrival and service processes, there exists a random variable W such that | ∆ V ( X ) | ≺ W for all t and all X ∈ X ,and E (cid:2) e θW (cid:3) = D is finite for some θ >
0, which verifies (C2).15or (C1), we have E [∆ V ( Z ) | Z ( t ) = Z ]= E [ k Q ( t + T ) k − k Q ( t ) k | Z ( t ) = Z ]= E (cid:20)q k Q ( t + T ) k − q k Q ( t ) k | Z ( t ) = Z (cid:21) ( a ) ≤ k Q ( t ) k E h k Q ( t + T ) k − k Q ( t ) k | Z ( t ) = Z i ( b ) ≤ − ǫ µ min µ Σ + 2 λ Σ M N LT + KT k Q ( t ) k , where (a) follows from the fact that f ( x ) = √ x is concave; (b) comes from Eq. (17). Thus, condition (C1) isvalid and hence the proof of Theorem 1 is complete. In order to prove the result, we need two intermediate results. One is called state-space collapse as stated inProposition 1, which is the key ingredient for establishing heavy traffic delay optimality. Roughly speaking, itmeans that the multi-dimension space for the queue length vector reduces to one dimension in the sense thatthe deviation from the line (on which all the queue lengths are equal) is bounded by a constant, independent of ǫ . Another intermediate result is concerned with unused service. Based on these two intermediate results, wecan prove heavy-traffic delay optimality. We omit the time reference t for simplicity when necessary. Proposition 1.
Under the conditions in Theorem 2, then we have that Q ⊥ is bounded in the sense that insteady state there exists finite constants { L r , r ∈ N } independent of ǫ such that E h(cid:13)(cid:13)(cid:13) Q ( ǫ ) ⊥ (cid:13)(cid:13)(cid:13) r i ≤ L r for all ǫ ∈ (0 , ǫ ) and r ∈ N , where Q ⊥ = Q − h Q , c i c is the perpendicular component of Q with respect to theline c = √ N (1 , , . . . , .Proof. It suffices to show that V ⊥ ( Z ( ǫ ) ) , (cid:13)(cid:13)(cid:13) Q ( ǫ ) ⊥ (cid:13)(cid:13)(cid:13) satisfies the conditions (C1) and (C2) in Lemma 2. Let usfirst consider conditions (C2). In particular, we have | ∆ V ⊥ ( Z ) | = | k Q ⊥ ( t + T ) k − k Q ⊥ ( t ) k |I ( Z ( t ) = Z ) ( a ) ≤ k Q ⊥ ( t + T ) − Q ⊥ ( t ) k I ( Z ( t ) = Z )= (cid:13)(cid:13) Q ( t + T ) − Q k ( t + T ) − Q ( t ) + Q k ( t ) (cid:13)(cid:13) I ( Z ( t ) = Z ) ( b ) ≤ k Q ( t + T ) − Q ( t ) k + (cid:13)(cid:13) Q k ( t + T ) − Q k ( t ) (cid:13)(cid:13) I ( Z ( t ) = Z ) ( c ) ≤ k Q ( t + T ) − Q ( t ) k I ( Z ( t ) = Z ) ( d ) ≤ t + T − X t = t ( k A ( t ) k + 2 k S ( t ) k ) I ( Z ( t ) = Z ) (19)where the inequality (a) follows from the fact that | k x k − k y k | ≤ k x − y k holds for any x , y ∈ R N ; inequality(b) follows from triangle inequality; (c) holds due to the non-expansive property of projection to a convex set;(d) follows from Eq. (18). Then by our assumptions of light-tailed distributions for both total arrival and serviceprocesses, there exists a random variable W such that | ∆ V ⊥ ( X ) | ≺ W for all t and all X ∈ X , and E (cid:2) e θW (cid:3) = D is finite for some θ >
0, which verifies (C2).Let us turn to condition (C1). By the proof of Lemma 3.6 in [29], it suffices to establish the following resultin order to verify (C1). That is, there exists
T > K ≥ η > ǫ , such thatfor all t and Z ∈ Z E " t + T − X t = t h Q ⊥ ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z ≤ − η k Q ⊥ k + K (20)16olds for all ǫ ∈ (0 , ǫ ). Note that E " t + T − X t = t h Q ⊥ ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z ( a ) = t + T − X t = t E [ E [ h Q ⊥ ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ] (21) − t + T − X t = t E "X n µ n Q ⊥ ,n ( t ) | Z ( t ) = Z , (22)where (a) follows from the tower property of conditional expectation and the fact that A ( t ) is independent of Z ( t ) given Z ( t ). Moreover, Q ⊥ ,n ( t ) denotes the n th component of the vector Q ⊥ ( t ). Now let us first focus onEq. (21). t + T − X t = t E [ E [ h Q ⊥ ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ]= t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 P mn ( t ) λ m | Z ( t ) = Z = t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 (cid:18) β mn ( t ) + µ n µ Σ (cid:19) λ m | Z ( t ) = Z = t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 β mn ( t ) λ m | Z + t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 µ n µ Σ ( µ Σ − ǫ ) p m | Z . Combining the result above with Eq. (22), yields E " t + T − X t = t h Q ⊥ ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z = t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 β mn ( t ) λ m | Z (23)+ t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) − ǫµ n µ Σ | Z . (24)Note that by definition Q ⊥ ,n ( t ) = Q n ( t ) − Q avg ( t ), in which Q avg ( t ) is the average queue length among N queuesat the beginning of time-slot t . Moreover, Q ⊥ ,n ( t ) can be written as Q ⊥ ,n ( t ) = Q n ( t ) − e Q mn ( t ) + e Q mn ( t ) − ¯ Q m ( t ) + ¯ Q m ( t ) − Q avg ( t ) (25)for all m and t , in which ¯ Q m ( t ) , N P Nn =1 e Q mn ( t ), i.e., the average queue length estimated by dispatcher m at17he beginning of time-slot t . By utilizing Eq. (25), Eq. (23) can be written as t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 β mn ( t ) λ m | Z = t + T − X t = t E " N X n =1 M X m =1 (cid:16) e Q mn ( t ) − ¯ Q m ( t ) (cid:17) β mn ( t ) λ m | Z (26)+ t + T − X t = t E " N X n =1 M X m =1 (cid:16) Q n ( t ) − e Q mn ( t ) (cid:17) β mn ( t ) λ m | Z (27)+ t + T − X t = t E " N X n =1 M X m =1 (cid:0) ¯ Q m ( t ) − Q avg ( t ) (cid:1) β mn ( t ) λ m | Z . (28)Our main task now is to upper bound each term above. Let us start with Eq. (26). In particular, we can boundit by using the following result. Lemma 4.
There exist finite positive constants η and C such that t + T − X t = t E " N X n =1 M X m =1 (cid:16) e Q mn ( t ) − ¯ Q m ( t ) (cid:17) β mn ( t ) λ m | Z ≤ − η k Q ⊥ ( t ) k + C holds for all T ≥ , in which η = λ Σ δp √ N and C = 3( µ Σ ) p .Proof. See Appendix BFor Eqs. (27) and (28), we can bound both of them by using the result in Lemma 3, respectively. Inparticular, for Eq. (27), we have t + T − X t = t E " N X n =1 M X m =1 (cid:16) Q n ( t ) − e Q mn ( t ) (cid:17) β mn ( t ) λ m | Z ≤ λ Σ t + T − X t = t E " N X n =1 M X m =1 (cid:12)(cid:12)(cid:12) Q n ( t ) − e Q mn ( t ) (cid:12)(cid:12)(cid:12) | Z ≤ µ Σ M N LT. (29)For Eq. (28), we have t + T − X t = t E " N X n =1 M X m =1 (cid:0) ¯ Q m ( t ) − Q avg ( t ) (cid:1) β mn ( t ) λ m | Z = t + T − X t = t E " N X n =1 M X m =1 N N X n =1 (cid:16) e Q mn ( t ) − Q n ( t ) (cid:17)! β mn ( t ) λ m | Z ≤ λ Σ t + T − X t = t E " N X n =1 M X m =1 N N X n =1 (cid:12)(cid:12)(cid:12) e Q mn ( t ) − Q n ( t ) (cid:12)(cid:12)(cid:12) | Z ≤ µ Σ M N LT. (30)We have obtained bounds for Eqs. (26), (27) and (28). Let us turn to focus on Eq. (24), which can be upperbounded by the following result.
Lemma 5.
For any t and Z , t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) − ǫµ n µ Σ | Z ( t ) = Z ≤ ǫ √ N T k Q ( t ) k + K , where k is a finite constant independent of ǫ . 18 roof. See Appendix CNow, we are ready to bound the left-hand-side of Eq. (20) by using the bounds for both Eq. (23) and Eq. (24).In particular, we have E " t + T − X t = t h Q ⊥ ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z ≤ − λ Σ δp √ N k Q ⊥ ( t ) k + C + 2 µ Σ M N LT + ǫ √ N T k Q ( t ) k + K a ) = (cid:18) T ǫ − λ Σ δp N (cid:19) √ N k Q ⊥ ( t ) k + K ≤ − µ Σ δp √ N k Q ⊥ ( t ) k + K , ∀ ǫ < µ Σ δp N T + 2 δp (31)where (a) follows from K = C + 2 µ Σ M N LT + K , which is independent of ǫ . Hence, this verifies condition(C1) with η = µ Σ δp √ N , which is also independent of ǫ . Combined with condition (C2), we have finished the proofof Proposition 1.Having proved the state-space collapse result, we turn to prove another intermediate result regarding unusedservice, as stated in the following lemma. In words, this lemma says that in heavy traffic unused service tendsto be zero. Lemma 6.
Under any LED policy, we have lim ǫ ↓ E (cid:20)(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) (cid:21) = 0 . Proof.
First, we would like to show that under any LED policy, E h(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) i = ǫ. (32)To see this, we consider the Lyapunov function W ( Z ( t )) = k Q ( t ) k . Since LED is throughput optimal with allthe moments being finite, we have that the mean drift of W ( Z ( t )) in steady-state is zero. Then, we have0 = E h(cid:13)(cid:13)(cid:13) A ( ǫ ) (cid:13)(cid:13)(cid:13) − k S k + (cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) i , which directly implies the result in Eq. (32).Now let us fix n ∈ N , we have for any t ≥ S ′ U n ( t ) ≤ U n ( t ) S n ( t )= U n ( t ) S n ( t ) I ( S n ( t ) ≤ S ′ ) + U n ( t ) S n ( t ) I ( S n ( t ) > S ′ ) ≤ U n ( t ) S ′ + S n ( t ) I ( S n ( t ) > S ′ ) . In steady state, we have E h U n i ≤ E (cid:2) U n (cid:3) S ′ + E (cid:2) S n ( ∞ ) I ( S n ( ∞ ) > S ′ ) (cid:3) ( a ) ≤ ǫS ′ + E (cid:2) S n (0) I ( S n (0) > S ′ ) (cid:3) ( b ) ≤ ǫS ′ + β, where (a) follows from the fact that E h(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13) i = ǫ and service process is i.i.d. ; in (b), we choose S ′ such that E (cid:2) S n (0) I ( S n (0) > S ′ ) (cid:3) ≤ β , which is possible by the exponential decay rate of S n (0) under the light-tailedassumption. Thus, we have lim ǫ ↓ E h U n i ≤ β, for any β >
0. Hence, we have lim ǫ ↓ E h U n i = 0 for each n , which directly implies our result.19ow, we are prepared to show that under the conditions in Theorem 2, the system achieves optimal delayin heavy traffic. More specifically, by Lemma 3 in [26], we need only to verify the following condition.lim ǫ ↓ E h(cid:13)(cid:13) Q ( ǫ ) ( t + 1) (cid:13)(cid:13) (cid:13)(cid:13) U ( ǫ ) ( t ) (cid:13)(cid:13) i = 0 . (33)Let us define B ( ǫ ) , E h(cid:13)(cid:13)(cid:13) Q ( ǫ ) ( t + 1) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) U ( ǫ ) ( t ) (cid:13)(cid:13)(cid:13) i . We can bound it as follows. B ( ǫ ) ( a ) = N E h h U ( ǫ ) ( t ) , − Q ( ǫ ) ⊥ ( t + 1) i i ( b ) ≤ N s E (cid:20)(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) (cid:21) E (cid:20)(cid:13)(cid:13)(cid:13) Q ( ǫ ) ⊥ ( t + 1) (cid:13)(cid:13)(cid:13) (cid:21) ( c ) = N s E (cid:20)(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) (cid:21) E (cid:20)(cid:13)(cid:13)(cid:13) Q ( ǫ ) ⊥ ( t ) (cid:13)(cid:13)(cid:13) (cid:21) ( d ) ≤ N s E (cid:20)(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) (cid:21) L , where the equality (a) comes from the property Q ( ǫ ) n ( t + 1) U ( ǫ ) n ( t ) = 0 for all n ∈ N and all t ≥ Q ⊥ ; the inequality (b) holds due to Cauchy-Schwartz inequality; the equality (c) is true since thedistributions of Q ( ǫ ) ⊥ ( t + 1) and Q ( ǫ ) ⊥ ( t ) are the same in steady state; (d) follow from the state-space collapseresult in Proposition 1. Finally, by Lemma 6 and the fact that L is independent of ǫ , we have lim ǫ → B ( ǫ ) = 0,which finishes our proof. We have introduced the Local-Estimation-Driven (LED) framework for load balancing policies in possibly het-erogeneous systems with multiple dispatchers. Under this framework, each dispatcher keeps local and possiblyoutdated estimates of the queue lengths for all the servers, and makes its dispatching decision only based on theselocal estimates. Communication between dispatchers and servers is only used to update the local estimates.We have established sufficient conditions for LED policies to achieve both throughput optimality and delayoptimality in heavy traffic. These sufficient conditions not only establish delay optimality for many previouslocal-memory based policies, but enable us to tailor the design of new delay optimal policies based on differentapplication requirements. The heavy-traffic delay optimality of LED policies also resolves a recent open problemon the development of load balancing schemes that have only access to delayed information.In future work, it will be interesting to investigate LED framework in other asymptotic regimes, e.g., thelarge-system regime and the many-server heavy-traffic regime.
References [1] Anonymous. Load balancing in large-scale heterogeneous systems with multiple dispatchers. anonymouspreprint under review. https://openreview.net/pdf?id=BJxN9Hvf-V , 2018.[2] Jonatha Anselmi and Francois Dufour. Power-of- d -choices with memory: Fluid limit and optimality. arXivpreprint arXiv:1802.06566 , 2018.[3] Hong Chen and Heng-Qing Ye. Asymptotic optimality of balanced routing. Operations research , 60(1):163–179, 2012.[4] Atilla Eryilmaz and R Srikant. Asymptotically tight steady-state queue length bounds implied by driftconditions.
Queueing Systems , 72(3-4):311–359, 2012.[5] G. J. Foschini and J. Salz. A basic dynamic routing problem and diffusion.
IEEE Transactions on Com-munications , 26(3):320–327, 1978. 206] Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve or die: High-availability design principles drawn from googles network infrastructure. In
Proceedings of the 2016 ACMSIGCOMM Conference , pages 58–72, 2016.[7] Bruce Hajek. Hitting-time and occupation-time bounds implied by drift analysis with applications.
Advancesin Applied probability , pages 502–525, 1982.[8] David Lipshutz. Open problemload balancing using delayed information.
Stochastic Systems , 9(3):305–306,2019.[9] Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R Larus, and Albert Greenberg. Join-idle-queue: Anovel load balancing algorithm for dynamically scalable web services.
Performance Evaluation , 68(11):1056–1071, 2011.[10] Siva Theja Maguluri, R Srikant, and Lei Ying. Heavy traffic optimal resource allocation algorithms forcloud computing clusters.
Performance Evaluation , 81:20–39, 2014.[11] Michael Mitzenmacher. How useful is old information?
IEEE Transactions on Parallel and DistributedSystems , 11(1):6–20, 2000.[12] Michael Mitzenmacher. The power of two choices in randomized load balancing.
IEEE Transactions onParallel and Distributed Systems , 12(10):1094–1104, 2001.[13] Michael Mitzenmacher. Analyzing distributed join-idle-queue: A fluid limit approach. In
Communication,Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on , pages 312–318. IEEE, 2016.[14] Debankur Mukherjee, Sem C Borst, Johan SH Van Leeuwaarden, Philip A Whiting, et al. Universality ofload balancing schemes on the diffusion scale.
Journal of Applied Probability , 53(4):1111–1124, 2016.[15] Patrick Shuff. Building a billion user load balancer. 2016.[16] Alexander L Stolyar. Pull-based load distribution in large-scale heterogeneous service systems.
QueueingSystems , 80(4):341–361, 2015.[17] Alexander L Stolyar. Pull-based load distribution among heterogeneous parallel servers: the case of multiplerouters.
Queueing Systems , 85(1-2):31–65, 2017.[18] Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. C3: Cutting tail latency in cloud datastores via adaptive replica selection. In
Proceedings of the 2015 USENIX NSDI Conference , pages 513–527,2015.[19] Mark van der Boor, Sem Borst, and Johan van Leeuwaarden. Load balancing in large-scale systems withmultiple dispatchers. In
IEEE INFOCOM 2017-IEEE Conference on Computer Communications , pages1–9. IEEE, 2017.[20] Mark van der Boor, Sem Borst, and Johan van Leeuwaarden. Hyper-scalable jsq with sparse feedback.
Proceedings of the ACM on Measurement and Analysis of Computing Systems , 3(1):1–37, 2019.[21] Weina Wang, Kai Zhu, Lei Ying, Jian Tan, and Li Zhang. Maptask scheduling in mapreduce with datalocality: Throughput and heavy-traffic optimality.
IEEE/ACM Transactions on Networking , 24(1):190–203,2016.[22] Richard R Weber. On the optimal assignment of customers to parallel servers.
Journal of Applied Probability ,pages 406–413, 1978.[23] Wayne Winston. Optimality of the shortest line discipline.
Journal of Applied Probability , 14(1):181–189,1977.[24] Qiaomin Xie and Yi Lu. Priority algorithm for near-data scheduling: Throughput and heavy-traffic op-timality. In
Proceedings of IEEE International Conference on Computer Communications (INFOCOM) ,pages 963–972, 2015. 2125] Qiaomin Xie, Ali Yekkehkhany, and Yi Lu. Scheduling with multi-level data locality: Throughput andheavy-traffic optimality. In
Proceedings of IEEE International Conference on Computer Communications(INFOCOM) , pages 1–9, 2016.[26] Xingyu Zhou, Jian Tan, and Ness Shroff. Flexible load balancing with multi-dimensional state-spacecollapse: Throughput and heavy-traffic delay optimality.
Performance Evaluation , 127:176–193, 2018.[27] Xingyu Zhou, Jian Tan, and Ness Shroff. Heavy-traffic delay optimality in pull-based load balancingsystems: Necessary and sufficient conditions.
Proceedings of the ACM on Measurement and Analysis ofComputing Systems , 2(3):1–33, 2018.[28] Xingyu Zhou, Fei Wu, Jian Tan, Kannan Srinivasan, and Ness Shroff. Degree of queue imbalance: Over-coming the limitation of heavy-traffic delay optimality in load balancing systems.
Proceedings of the ACMon Measurement and Analysis of Computing Systems , 2(1):1–41, 2018.[29] Xingyu Zhou, Fei Wu, Jian Tan, Yin Sun, and Ness Shroff. Designing low-complexity heavy-traffic delay-optimal load balancing schemes: Theory to algorithms.
Proceedings of the ACM on Measurement andAnalysis of Computing Systems , 1(2):39, 2017.
AppendixA Proof of Lemma 3
First, we show that the Markov chain { X mn ( t ) , t ≥ } is ergodic. It is irreducible and aperiodic since any statecan reach the initial state X mn (0) = 0 via update and the initial state can also reach to itself. It is positiverecurrent since the expected return time to state 0 is finite under the condition in Eq. (5). By the ergodicity ofMarkov chain, we have that the limiting distribution π = ( π , π , . . . ) exists and it is also the unique stationarydistribution. Here, π k = lim n →∞ P nvk ∀ v where P nvk is the probability of being in state k in n steps, given we are in state v now. Since the limitingdistribution is independent of the initial state, we can pick any state to start with. In particular, we pick theinitial state as X mn (0) = 0. By Eq. (8), we have for any t E [ X mn ( t + 1)] ≤ (1 − p ) E [ X mn ( t )] + λ Σ + µ max . (34)We conduct an inductive process to establish that E [ X mn ( t )] ≤ λ Σ + µ max p for all t . First, the basis is true since X mn (0) = 0. We assume E [ X mn ( t )] ≤ λ Σ + µ max p holds for t = t , then by Eq. (34), we have E [ X mn ( t + 1)] ≤ λ Σ + µ max p also holds and hence completing the proof. In particular, E [ X mn ( t )] ≤ λ Σ + µ max p ≤ µ Σ p for all t . Thisdirectly means that the stationary distribution π of { X mn ( t ) , t ≥ } has a finite bound on the mean (independentof ǫ ). Similarly, by applying the same step as in Eq. (8) combined with the inductive argument, we can obtainthat all the moments of the stationary distribution are bounded, because all the moments of the total arrivaland each service are bounded (independent of ǫ ) by our light-tailed assumption.Let f T , T (cid:16)P t + T − t = t X mn ( t ) | Z ( t ) = Z (cid:17) . By ergodicity (i.e., time-average = ensemble-average), we havethat for any starting point Z ( t ) = Z , with probability 1 such thatlim T →∞ f T = ∞ X k =0 kπ k ≤ µ Σ p . As a result, we can find a finite T (independent of ǫ since all the moments of π are bounded with independenceof ǫ ) such that for all T ≥ T , f T ≤ µ Σ p , L with probability 1. Therefore, if T ≥ T , E " t + T − X t = t X mn ( t ) | Z ( t ) = Z ≤ LT, which completes the proof. 22
Proof of Lemma 4
Consider the left-hand-side (LHS) of the inequality in Lemma 4.
LHS ( a ) ≤ t + T − X t = t M X m =1 λ m δ E h e Q mmin ( t ) − e Q mmax ( t ) | Z ( t ) = Z i ( b ) ≤ M X m =1 λ m δ E h e Q mmin ( t + 1) − e Q mmax ( t + 1) | Z ( t ) = Z i + M X m =1 λ m δ E h e Q mmin ( t + 2) − e Q mmax ( t + 2) | Z ( t ) = Z i , where (a) follows from the definition of δ -tilted dispatching strategy and the fact e Q mmin ( t ) , e Q mσ t (1) ( t ) ≤ e Q mσ t (2) ( t ) ≤ . . . ≤ e Q mσ t ( N ) ( t ) , e Q mmax ( t ); (b) holds since all the terms in the summation are non-positive and T ≥ I mmin ( t ) as the indicator function which obtains 1 if the server with theminimal true queue length at the end of time-slot t is updated by dispatcher m . Similarly, I mmax ( t ) is defined asan indicator function which obtains 1 if the server with the maximal true queue length at the end of time-slot t is updated by dispatcher m . In the following, we consider the event that I mmax ( t ) = 1 and I mmin ( t + 1) = 1,i.e., at the end of time-slot t the server with the maximal actual queue is updated and at the end of time-slot t + 1 the server with the minimal actual queue is updated. Then, by law of expectation, we have LHS ≤ M X m =1 λ m δp E h e Q mmin ( t + 1) − e Q mmax ( t + 1) e Q mmin ( t + 2) − e Q mmax ( t + 2) | Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1 i ( a ) ≤ M X m =1 λ m δp E h e Q mmin ( t + 1) − Q max ( t + 1) Q min ( t + 2) − e Q mmax ( t + 2) | Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1 i ( b ) ≤ M X m =1 λ m δp E [ − Q max ( t + 1) + Q min ( t + 2) | Z ( t ) = Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1] , (35)where (a) follows from the fact that with I mmax ( t ) = 1, e Q mmax ( t + 1) ≥ Q max ( t + 1) and with I mmin ( t + 1) = 1, e Q mmin ( t + 2) ≤ Q min ( t + 2); (b) holds since e Q mmin ( t + 1) ≤ e Q mmax ( t + 1) ≤ e Q mmax ( t + 2). This is becauseunder any LED policy the decrease of local estimation can only happen when the queue is updated.In order to relate the queue lengths at time-slots t + 1 and t + 2 to the queue length at t , we use thefollowing result. Claim 1.
For any t , we have1. E [ Q min ( t + 1) | Z ( t ) = Z ] ≤ Q min ( t ) + M E [ Q max ( t + 1) | Z ( t ) = Z ] ≥ Q max ( t ) − M where M = µ Σ .Proof. Let us start with the first result. Suppose that the server i is the server with the shortest queue lengthat time-slot t . We have E [ Q i ( t + 1) | Z ( t ) = Z ] = E [ Q i ( t ) + A i ( t ) − S i ( t ) + U i ( t ) | Z ( t )] ≤ Q i ( t ) + max( λ Σ , µ i ) ≤ Q i ( t ) + µ Σ = Q i ( t ) + M . (36)23f at time-slot t + 1, the same server i is still the one with the shortest queue length, then we are done. If not,suppose that some other server j is the one with the shortest queue length at time-slot t + 1. Now, we assumethat E [ Q j ( t + 1) | Z ( t ) = Z ] > Q i ( t ) + M , and we hope to arrive at a contradiction.First, since Q j ( t + 1) ≤ Q i ( t + 1), E [ Q j ( t + 1) | Z ( t ) = Z ] ≤ E [ Q i ( t + 1) | Z ( t ) = Z ]. Combined withour assumption, we get Q i ( t ) + M < E [ Q j ( t + 1) | Z ( t ) = Z ] ≤ E [ Q i ( t + 1) | Z ( t ) = Z ] (37)Thus, we can see Eq. (37) contradicts with Eq. (36). Hence, E [ Q j ( t + 1) | Z ( t ) = Z ] ≤ Q i ( t ) + M , which finishes the proof of the first result. Same argument can be applied to prove the second result.Now, we are ready to bound Eq. (35). First, LHS ( a ) ≤ M X m =1 λ m δp E [ − Q max ( t + 1) | Z ( t ) = Z ] + (38) M X m =1 λ m δp E [ Q min ( t + 2) | Z ( t ) = Z, φ ( t )] , (39)where φ ( t ) , I mmax ( t ) = 1 , I mmin ( t +1) = 1. The inequality follows from the fact that given Z ( t ), Q max ( t +1)is independent of I mmax ( t ) and I mmin ( t + 1). By using the bound in Claim 1, we have an upper bound for theterm in Eq. (38) M X m =1 λ m δp E [ − Q max ( t + 1) | Z ( t ) = Z ] ≤ M X m =1 λ m δp ( − Q max ( t ) + M ) . (40)For the term in Eq. (39), we have M X m =1 λ m δp E [ Q min ( t + 2) | Z ( t ) = Z, φ ( t )] ( a ) = M X m =1 λ m δp E [ E [ Q min ( t + 2) | Z ( t + 1)] | Z ( t ) = Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1]where (a) follows from the tower property of conditional expectation and the fact that given Z ( t +1), Q min ( t +2)is independent of φ ( t ). Then, it can be upper bounded as follows. M X m =1 λ m δp E [ Q min ( t + 2) | Z ( t ) = Z, φ ( t )] ( a ) ≤ M X m =1 λ m δp E [ Q min ( t + 1) + M | Z ( t ) = Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1] ( b ) = M X m =1 λ m δp E [ Q min ( t + 1) + M | Z ( t ) = Z ] ( c ) ≤ M X m =1 λ m δp ( Q min ( t ) + 2 M ) , Z ( t ), Q min ( t + 1) is independent of theevent I mmax ( t ) = 1 , I mmin ( t + 1) = 1; (c) holds by the bound in Claim 1 again.Thus, combining the bounds for Eqs. (38) and (39), yields LHS ≤ M X m =1 λ m δp ( Q min ( t ) − Q max ( t ) + 3 M )= λ Σ δp ( Q min ( t ) − Q max ( t ) + 3 M λ Σ δp ≤ − λ Σ δp √ N k Q ⊥ ( t ) k + 3( µ Σ ) p , (41)in which the last inequality follows from the fact that k Q ⊥ ( t ) k ≤ √ N ( Q max ( t ) − Q min ( t )) and M = µ Σ with δ ≤
1. Hence, the proof of Lemma 4 is complete.
C Proof of Lemma 5