[PDF] Asymptotically Optimal Load Balancing in Large-scale Heterogeneous Systems with Multiple Dispatchers

Abstract

We consider the load balancing problem in large-scale heterogeneous systems with multiple dispatchers. We introduce a general framework called Local-Estimation-Driven (LED). Under this framework, each dispatcher keeps local (possibly outdated) estimates of queue lengths for all the servers, and the dispatching decision is made purely based on these local estimates. The local estimates are updated via infrequent communications between dispatchers and servers. We derive sufficient conditions for LED policies to achieve throughput optimality and delay optimality in heavy-traffic, respectively. These conditions directly imply delay optimality for many previous local-memory based policies in heavy traffic. Moreover, the results enable us to design new delay optimal policies for heterogeneous systems with multiple dispatchers. Finally, the heavy-traffic delay optimality of the LED framework directly resolves a recent open problem on how to design optimal load balancing schemes using delayed information.

Full PDF

aa r X i v : . [ c s . PF ] F e b Asymptotically Optimal Load Balancing in Large-scale HeterogeneousSystems with Multiple Dispatchers

Xingyu ZhouDepartment of ECEThe Ohio State [email protected] Ness ShroﬀDepartment of ECE and CSEThe Ohio State Universityshroﬀ[email protected] WiermanDepartment of Computing and Mathematical SciencesCalifornia Institute of [email protected]

Abstract

We consider the load balancing problem in large-scale heterogeneous systems with multiple dispatchers.We introduce a general framework called Local-Estimation-Driven (LED). Under this framework, each dis-patcher keeps local (possibly outdated) estimates of queue lengths for all the servers, and the dispatchingdecision is made purely based on these local estimates. The local estimates are updated via infrequentcommunications between dispatchers and servers. We derive suﬃcient conditions for LED policies to achievethroughput optimality and delay optimality in heavy-traﬃc, respectively. These conditions directly implydelay optimality for many previous local-memory based policies in heavy traﬃc. Moreover, the results en-able us to design new delay optimal policies for heterogeneous systems with multiple dispatchers. Finally,the heavy-traﬃc delay optimality of the LED framework directly resolves a recent open problem on how todesign optimal load balancing schemes using delayed information.

Load balancing, which is responsible for dispatching jobs on parallel servers, has attracted signiﬁcant interestin recent years. This is motivated by the challenges associated with eﬃciently dispatching jobs in large-scaledata centers and cloud applications, which are rapidly increasing in size. A good load balancing policy notonly ensures high throughput by maximizing server utilization, but improves the user experience by minimizingdelay.There have been numerous load balancing policies proposed in the literature. The most straightforward oneis Join-Shortest-Queue (JSQ), which has been shown to enjoy optimal delay in both non-asymptotic (for homo-geneous servers) and asymptotic regimes [22, 5, 4]. However, it is diﬃcult to implement in today’s large-scaledata centers due to the large message overhead between the dispatcher and servers. As a result, alternativeload balancing policies with low message overhead have been proposed. For example, the Power-of- d policy [12]has been shown to achieve optimal average delay in heavy traﬃc with only 2 d messages per arrival [10]. An-other common load balancing policy is the pull-based Join-Idle-Queue (JIQ) [9, 16], which has been shown tooutperform the Power-of- d policy using less overhead. However, both Power-of- d and JIQ mainly achieve goodperformance for systems with homogeneous servers. Recently, some works consider heterogeneous servers andpropose ﬂexible and low message overhead policies that achieve optimal delay in heavy traﬃc [29, 27]. How-ever, only a single dispatcher is considered in these works. Theoretical analysis of load balancing with multipledispatchers has mainly focused on the JIQ policy so far [13, 17], which has a poor performance in heavy traﬃcand is even generally unstable for heterogeneous systems [29].Note that heterogeneous systems with multiple dispatchers are now almost the default scenarios in today’scloud infrastructures. On one hand, the heterogeneity comes from the usage of multiple generations of CPUs andvarious types of devices [6]. On the other hand, with the massive amount of data, a scalable cloud infrastructureneeds multiple dispatchers to increase both throughput and robustness [15].1otivated by this, a recent work [1] proposes a new framework named Loosely-Shortest-Queue (LSQ) fordesigning load balancing policies for heterogeneous systems with multiple dispatchers. In particular, under thisframework, each dispatcher keeps its own, local, and possibly outdated view of each server’s queue length. Uponarrival, each dispatcher routes to the server with shortest local view. A small amount of message overhead isused to update the local view. The authors successfully establish suﬃcient conditions on the update scheme forthe system to be stable. Moreover, extensive simulations were conducted to show that LSQ policies signiﬁcantlyoutperform well-known low-communication policies while using similar communication overhead in both het-erogeneous and homogeneous cases. However, no theoretical guarantees on the delay performance are provided.It is worth noting that the key challenge for establishing a delay performance guarantee for this framework isthat it only uses possibly outdated local information to dispatch jobs. In fact, the problem of designing delayoptimal load balancing schemes that only have access to delayed information has recently been listed as an openproblem in [8].Inspired by this, in this paper, we are particularly interested in the following questions: Is is possible toestablish delay performance guarantees for load balancing in heterogeneous systems with multiple dispatchers? Ifso, can these guarantees be achieved using only delayed information?

Contributions.

To answer the questions above, we propose a general framework of load balancing forheterogeneous systems with multiple dispatchers that uses only delayed (out-of-date) information about thesystem state. We call this framework Local-Estimation-Driven (LED) and it generalizes the LSQ framework.Our main results provide suﬃcient conditions for LED policies to be both throughput optimal and delay optimalin heavy-traﬃc. Our key contributions can be summarized as follows.First, we introduce the LED framework for designing load balancing policies for heterogeneous systems withmultiple dispatchers. In this framework, each dispatcher keeps its own local estimates of queue lengths for allthe servers, and makes its dispatching decision based purely on its own local estimates according to a certaindispatching strategy. The local estimates are updated infrequently via an update strategy that is based oncommunications between dispatchers and servers.Second, we derive suﬃcient conditions for LED policies to be throughput optimal and delay optimal in heavy-traﬃc. The importance of the suﬃcient conditions is three-fold: (i) It can be shown that previous local-memorybased policies (e.g., LSQ) satisfy our suﬃcient conditions. As a result, we are able to show that they are notonly throughput optimal (in a stronger sense) but also delay optimal in heavy-traﬃc. (ii) The conditions allowus to design new delay optimal load balancing policies with zero dispatching delay and low message overheadthat work for heterogeneous servers and multiple dispatchers. (iii) These conditions also provide us with asystematic approach for generalizing previous optimal policies to the case of multiple dispatchers and exploringthe trade-oﬀ between memory (i.e., local estimations) and message overhead.Third, the LED framework also resolves the open problem posed in [8], which asks how to design heavy-traﬃc delay optimal policies that only use delayed information. Our main results for LED policies not onlydemonstrate that it is possible to achieve optimal delay in heavy-traﬃc via only delayed information, buthighlight conditions on the extent to which old information is useful. Moreover, they provide methods for usingthe delayed information to achieve optimality in heavy traﬃc. Interestingly, the LED framework also showsthat, in the case of multiple dispatchers, inaccurate information can actually lead to improved performance.To establish the main results, we need to address the following two key challenges. First, each dispatcher inour model has access to delayed and outdated system information. Moreover, each dispatcher does not knowthe arrivals to servers from other dispatchers, since there is no communication between them. As a result, forthroughput optimality, we have to carefully design our Lyapunov function, since the local estimates can also beunbounded. For delay optimality, we consider two queueing systems: a local-estimation system and the actualsystem. Then, we have to transfer the negative drift on the local-estimation system to the actual system, whichrequires establishing new bounds and the analysis of sample paths.

Related work.

The study of eﬃcient load balancing algorithms has been a hot topic for a long time andspans across diﬀerent asymptotic regimes. The most extensively investigated policy might be Join-Shortest-Queue (JSQ), under which the incoming jobs are always sent to the server with the shortest queue length. JSQhas been shown to be optimal in a stochastic order sense in the non-asymptotic regime for arbitrary arrival andnon-decreasing failure rate identical service processes [22, 23]. In the heavy-traﬃc asymptotic regime, in whichthe normalized load approaches one and the number of servers is ﬁxed, JSQ has been proved to achieve optimaldelay even for heterogeneous servers using both diﬀusion approximations in [5] and the recently proposed drift-based method [4]. However, the optimality of JSQ comes at the cost of a large amount of communication betweendispatchers and servers, which is particularly undesirable for large-scale data centers. Thus, some popular low-message overhead alternative policies have been proposed, e.g., Power-of- d and Join-Idle-Queue (JIQ). Under2ower-of- d , the dispatcher only needs to sample d ≥ d samples. This simple policy has been shown to enjoy a doubly exponential decayrate in response time in the large-system asymptotic regime [12] and achieve optimal delay in heavy-traﬃc forhomogeneous servers [3, 10]. Another low-message overhead policy is JIQ (or Pull-based policy) [9, 16], underwhich arrivals are sent to one of the idle servers, if there are any, and to a randomly selected server otherwise.Compared to JSQ and Power-of- d , JIQ has the nice property of zero dispatching delay since each arrival canbe instantaneously routed rather than waiting for feedback from servers. Moreover, JIQ has been shown tooutperform Power-of- d with even smaller message overhead (at most one per job). In particular, under JIQ,arriving jobs achieve asymptotic zero waiting time in the large-system regime while Power-of- d does not. An evenstronger result suggests that, in the Halﬁn-Whitt asymptotic regime, JIQ achieves the same delay performanceas JSQ [14]. Nevertheless, the performance of JIQ drops substantially in heavy traﬃc with a ﬁnite number ofservers, even for homogeneous servers. In fact, it is not heavy-traﬃc delay optimal in this case [29]. Motivatedby this, recent works have proposed alternative pull-based policies that not only enjoy all the nice features ofJIQ but also achieve optimal delay in heavy-traﬃc [29, 27]. However, these studies only consider the case ofsingle dispatcher.Compared to the large literature on the single dispatcher case, there are only a few works for the scenario ofmultiple dispatchers, and they mainly focus on the JIQ policy. In particular, [13] presents a new large-systemasymptotic analysis of JIQ without the simplifying assumptions in [9]. The property of asymptotically zerowaiting time of JIQ was generalized to the case of multiple dispatchers in [17]. However, the results for JIQin [9, 13, 17] all assume that the loads at various dispatchers are strictly equal. Without this assumption, [19]shows that the waiting time under JIQ no longer vanishes in the large-system regime and two enhanced JIQschemes are proposed. As mentioned earlier, although JIQ is a scalable choice for the multiple-dispatcher case,it is not delay optimal in heavy traﬃc for homogeneous servers and not even generally stable for heterogeneoussystems.The case of heterogeneous systems with multiple dispatchers has received very little attention from thetheoretical community so far. To the best of our knowledge, the framework proposed in [1] is the ﬁrst attemptto study eﬃcient load balancing schemes with a theoretical guarantee for the scenario of heterogeneous systemswith multiple dispatchers. In particular, under the proposed Loosely-Shortest-Queue (LSQ) framework, eachdispatcher independently keeps its own local view of sever queue lengths and routes jobs to the shortest amongthem. Communication is used only to update the local views and make sure that they are not too far from thereal queue lengths. The main contributions of [1] are the suﬃcient conditions for any LSQ policy to achievestrong stability with low message overhead. Additionally, extensive simulations have been used to demonstrateits appeal. Nevertheless, a theoretical guarantee on the delay performance of LSQ policies remains an importantunsolved question.It is worth pointing out that the idea of using local memory to hold possibly old information for loadbalancing was also explored in two recent works [2, 20]. As we discuss later, these two proposed policies arein our LED framework. Both works only consider a single dispatcher and homogeneous servers, which is alsoa special case of our model. Further, their analysis focuses on the large-system asymptotic regime where thenumber of servers goes to inﬁnity, while our analysis deals with a ﬁnite number of servers. This section describes the system model and assumptions considered in this paper. Then, several necessarypreliminaries are presented.

We consider a discrete-time (i.e., time-slotted) load balancing system consisting of M dispatchers and N possibly-heterogeneous servers. Each server maintains an inﬁnite capacity FIFO queue. At each dispatcher, there is alocal memory, through which the dispatcher can have some (possibly delayed) information about the systemstates. In each time-slot, the central dispatcher routes the new incoming tasks to one of the servers, immediatelyupon arrival. Once a task joins a queue, it remains in that queue until its service is completed. Each server isassumed to be work conserving, i.e., a server is idle if and only if its corresponding queue is empty.3 .1.1 Arrivals Let A m ( t ) denote the number of exogenous tasks that arrive at dispatcher m at the beginning of time-slot t . Weassume that A Σ ( t ) = P Mm =1 A m ( t ) is an integer-valued random variable, which is i.i.d. across time-slots. Themean and variance of A Σ ( t ) are denoted by λ Σ and σ , respectively. We further assume that there is a positiveprobability that A Σ ( t ) is zero. The allocation of total arriving tasks among the M dispatchers is allowed to useany arbitrary policy that is independent of system states. Note that, in contrast to previous works on multipledispatchers [9, 13, 17], we do not require that the loads at all dispatchers are equal. We assume that there isa strictly positive probability for tasks to arrive at each dispatcher at any time-slot t . That is, there exists astrictly positive constant p such that P ( A m ( t ) > ≥ p , ∀ ( m, t ) ∈ M × N , (1)where M = { , , . . . , M } . Moreover, we assume that A m ( t ) is i.i.d across time-slots with mean arrival ratedenoted by λ m . We further let A mn ( t ) denote the number of new arrivals at server n from dispatcher m at thebeginning of time-slot t . Let A n ( t ) = P Mm =1 A mn ( t ) be the total number of arriving tasks at server n at thebeginning of time-slot t . Let S n ( t ) denote the amount of service that server n oﬀers for queue n in time-slot t . That is, S n ( t ) is themaximum number of tasks that can be completed by server n at time-slot t . We assume that S n ( t ) is aninteger-valued random variable, which is i.i.d. across time-slots. We also assume that S n ( t ) is independentacross diﬀerent servers as well as the arrival process. The mean and variance of S n ( t ) are denoted as µ n and ν n , respectively. Let µ Σ , Σ Nn =1 µ n and ν , Σ Nn =1 ν n denote the mean and variance of the hypothetical totalservice process S Σ ( t ) , P Nn =1 S n ( t ). Let ǫ = µ Σ − λ Σ characterize the distance between the arrival rate and theboundary of capacity region. Let Q n ( t ) be the queue length of server n at the beginning of time slot t . Let A n ( t ) denote the number of tasksrouted to queue n at the beginning of time-slot t according to the dispatching decision. Then the evolution ofthe length of queue n is given by Q n ( t + 1) = Q n ( t ) + A n ( t ) − S n ( t ) + U n ( t ) , n = 1 , , . . . , N, (2)where U n ( t ) = max { S n ( t ) − Q n ( t ) − A n ( t ) , } is the unused service due to an empty queue.We do not assume any speciﬁc distribution for arrival and service processes. Moreover, in contrast to previousworks [29, 4], we do not require that both arrival and service processes have a ﬁnite support. Instead, we onlyneed the condition that their distributions are light-tailed. More speciﬁcally, we assume that E h e θ A Σ ( t ) i ≤ D and E h e θ S n ( t ) i ≤ D , (3)for each n where the constants θ > θ > D < ∞ and D < ∞ are all independent of ǫ . We are interested in the case that the local memory at each dispatcher m stores an estimate of the queuelength for each server n . In particular, we let e Q mn ( t ) be the local estimate of the queue length for server n fromdispatcher m at the beginning of time-slot t (before any arrivals and departures). More speciﬁcally, we introducethe following framework for load balancing. Deﬁnition 1.

A Local-Estimation-Driven (LED) policy is composed of the following components:(a)

Dispatching strategy:

At the beginning of each time-slot, each dispatcher m chooses one of the serversfor new arrivals purely based on its local estimates (i.e., local queue length estimates e Q m )(b) Update strategy:

At the end of each time-slot, each dispatcher would possibly update its local estimates,e.g., synchronize local queue length estimate with the true queue length. { Z ( t ) = ( Q ( t ) , m ( t )) , t ≥ } with statespace Z , using the queue length vector Q ( t ) together with the memory state m ( t ) , ( e Q ( t ) , e Q ( t ) , . . . , e Q m ( t )).We consider a set of load balancing systems { Z ( ǫ ) ( t ) , t ≥ } parameterized by ǫ such that the mean arrival rate ofthe total exogenous arrival process { A ( ǫ )Σ ( t ) , t ≥ } is λ ( ǫ )Σ = µ Σ − ǫ . Note that the parameter ǫ characterizes thedistance between the arrival rate and the boundary of the capacity region. We are interested in the throughputperformance and the steady-state delay performance in the heavy-traﬃc regime under any LED policy.A load balancing system is stable if the Markov chain { Z ( t ) , t ≥ } is positive recurrent, and Z = { Q , m } denotes the random vector whose distribution is the same as the steady-state distribution of { Z ( t ) , t ≥ } . Wehave the following deﬁnition. Deﬁnition 2 (Throughput Optimality) . A load balancing policy is said to be throughput optimal if for anyarrival rate within the capacity region, i.e., for any ǫ > , the system is positive recurrent and all the momentsof (cid:13)(cid:13) Q ( ǫ ) (cid:13)(cid:13) are ﬁnite. Note that this is a stronger deﬁnition of throughput optimality than that in [1, 21, 25] because, besides thepositive recurrence, it also requires all the moments to be ﬁnite in steady state for any arrival rate within thecapacity region.To characterize the steady-state average delay performance in the heavy-traﬃc regime when ǫ approacheszero, by Little’s law, it is suﬃcient to focus on the summation of all the queue lengths. First, recall the followingfundamental lower bound on the expected sum queue lengths in a load balancing system under any throughputoptimal policy [4]. Note that this result was originally proved with the assumption of ﬁnite support on theservice process (Lemma 5 in [4]), which can be generalized to service processes with light-tailed distributionswith a careful analysis of the unused service, see our proof of Lemma 6. Lemma 1.

Given any throughput optimal policy and assuming that ( σ ( ǫ )Σ ) converges to a constant σ as ǫ decreases to zero, then lim inf ǫ ↓ ǫ E " N X n =1 Q ( ǫ ) n ≥ ζ , (4) where ζ , σ + ν . The right-hand-side of Eq. (4) is the heavy-traﬃc limit of a hypothesized single-server system with arrivalprocess A ( ǫ )Σ ( t ) and service process P Nn S n ( t ) for all t ≥

0. This hypothetical single-server queueing system isoften called the resource-pooled system . Since a task cannot be moved from one queue to another in the loadbalancing system, it is easy to see that the expected sum queue lengths of the load balancing system is largerthan the expected queue length in the resource-pooled system. However, if a policy achieves the lower bound inEq. (4) in the heavy-traﬃc limit, based on Little’s law this policy achieves the minimum average delay of thesystem in steady-state, and thus said to be heavy-traﬃc delay optimal, see [4, 10, 21, 24, 25, 29].

Deﬁnition 3 (Heavy-traﬃc Delay Optimality in Steady-state) . A load balancing scheme is said to be heavy-traﬃc delay optimal in steady-state if the steady-state queue length vector Q ( ǫ ) satisﬁes lim sup ǫ ↓ ǫ E " N X n =1 Q ( ǫ ) n ≤ ζ , where ζ is deﬁned in Lemma 1. In order to provide a uniﬁed way to specify the dispatching strategy in LED, we ﬁrst introduce a conceptcalled dispatching preference . In particular, let P mn ( t ) be the probability that new arrivals at dispatcher m aredispatched to server n at time-slot t . We deﬁne β mn ( t ) , P mn ( t ) − µ n µ Σ , which is the diﬀerence in probability thatserver n will be chosen under a particular dispatching strategy and random routing (weighted by service rate).Then, we have the following deﬁnition. 5 eﬁnition 4 (Dispatching preference) . Fix a dispatcher m , let σ t ( · ) be a permutation of (1 , , . . . , N ) thatsatisﬁes e Q mσ t (1) ( t ) ≤ e Q mσ t (2) ( t ) ≤ . . . ≤ e Q mσ t ( N ) ( t ) . The dispatching preference at dispatcher m is a N -dimensional vector denoted by ∆ m ( t ) , the n th component ofwhich is given by ∆ mn ( t ) , β mσ t ( n ) . In words, the dispatching preference at a dispatcher m speciﬁes how servers with diﬀerent local estimatesare preferred in a uniﬁed way such that it is independent of the actual values of local estimates. It only dependson the relative order of local estimates. More speciﬁcally, ﬁx a dispatcher m , by deﬁnition we can see thatweighted random routing strategy has no preference for any servers and ∆ mn ( t ) = 0 for any n . On the otherhand, if new arrivals are always dispatched to the server with the shortest local estimate (e.g, LSQ policy), wehave ∆ m ( t ) > mn ( t ) < ≤ n ≤ N . Thus, we can see that a positive value for ∆ mn ( t ) means thatthe dispatching strategy has a preference for the server with the n th shortest local estimation. This observationdirectly motivates the following two deﬁnitions. Deﬁnition 5 (Tilted dispatching strategy) . A dispatching strategy adopted at dispatcher m is said to be tiltedif there exists a k ∈ { , , . . . N } such that for all t , ∆ mn ( t ) ≥ for all n ≤ k and ∆ mn ( t ) ≤ for all n ≥ k . Deﬁnition 6 ( δ -tilted dispatching strategy) . A dispatching strategy adopted at dispatcher m is said to be δ -tiltedif for all t (i) it is a tilted dispatching strategy and (ii) there exists a positive constant δ such that ∆ m ( t ) ≥ δ and ∆ mN ( t ) ≤ − δ . Remark 1.

Note that similar deﬁnitions were ﬁrst provided in [29] for the case of a single dispatcher withup-to-date information. Based on these deﬁnitions, suﬃcient conditions were presented for throughput andheavy-traﬃc optimality. However, these conditions cannot be directly applied to our model due to the followingtwo major challenges. One is that, in our model, each dispatcher only has access to outdated information. Theother is that each dispatcher has no idea of the arrivals at the servers coming from other dispatchers, since thereis no communication between them. To handle these challenges, we have to develop new techniques.

We end this section by providing intuitions behind the two deﬁnitions. To start, it can be seen easilythat P Nn =1 ∆ mn ( t ) = 0 for all m and t via the deﬁnition of dispatching preference. Roughly speaking, a tilteddispatching strategy means that compared to (weighted) random routing (which does not have any preference),the probabilities of choosing servers with shorter local estimates (the ﬁrst k shortest ones) are increased, and,as a result, the probabilities of choosing servers with longer local estimates are reduced. This is the reasonwhy we call it tilted, since more preference is given to queues with shorter local estimates. Therefore, a tilteddispatching strategy can be viewed as a strategy that is as least as ‘good’ as (weighted) random routing. On theother hand, a δ -tilted dispatching strategy can be viewed as a strategy that is strictly better than (weighted)random routing. The reason is that, besides the fact that it is tilted, it also requires that there is a strictlypositive preference of the server with the shortest local estimation. In this section, we ﬁrst present the suﬃcient conditions for LED policies to be throughput optimal and heavy-traﬃc delay optimal. Then, we explore several example policies within LED framework to demonstrate itsﬂexibility in designing new load balancing schemes.

Let us begin with the suﬃcient conditions for LED policies to be throughput optimal. In particular, we specifyconditions for the dispatching strategy and update strategy that guarantee throughput optimality.To state the theorem, we need the following notation. Let I mn ( t ) be an indicator function which equals 1 ifand only if the local estimate of server n ’s queue length at dispatcher m gets updated, i.e., the estimated queuelength e Q mn ( t ) is set to the actual queue length Q n ( t ) at the end of time-slot t . Theorem 1.

Consider an LED policy. Suppose the dispatching strategy at each dispatcher is tilted and theupdate strategy can guarantee the condition that there exists a positive constant p such that E [ I mn ( t ) | Z ( t ) = Z ] ≥ p (5)6 olds for all Z and ( m, n, t ) ∈ M × N × N . Then, this policy is throughput optimal, i.e., the system under thispolicy is positive recurrent with all the moments being bounded for any ǫ > .Proof. See Section 5.1Note that this theorem directly implies that LSQ is not only strongly stable but also enables the systemto have all the moments bounded in steady-state. Moreover, it suggests that any dispatching strategy that isas good as (weighted) random routing is suﬃcient to guarantee throughput optimality. Further, the updateprobability can be a function of the traﬃc load.Now, we turn to presenting the suﬃcient conditions for LED policies to be delay optimal in heavy traﬃc. Inorder to achieve delay optimality, we need stronger conditions on both the dispatching strategy and the updatestrategy.

Theorem 2.

Consider an LED policy. Suppose the dispatching strategy at each dispatcher is δ -tilted with auniform lower bound δ > being independent of ǫ . Suppose the update strategy can guarantee that there existsa positive constant p (independent of ǫ ) such that E [ I mn ( t ) | Z ( t ) = Z ] ≥ p (6) holds for all Z and ( m, n, t ) ∈ M × N × N . Then, this policy is heavy-traﬃc delay optimal.Proof. See Section 5.2This theorem not only establishes a delay performance guarantee for many previous local-memory basedpolicies (e.g., LSQ in [1], low-message policies in [2, 20]), but provides us with the ﬂexibility to design newdelay optimal load balancing for diﬀerent scenarios with heterogeneous servers and multiple dispatchers, asdiscussed in the next section. More importantly, our results directly suggest that it is possible to use onlydelayed information to achieve delay optimality, which resolves one of the open problems listed in [8].

High-level proof idea.

We end this section by providing drift-based intuitions behind the technical proofs.In particular, let us consider two queueing systems: a local-estimation system and the actual system (i.e., queuelengths at servers). For throughput optimality, it requires the actual system to have a drift towards the origin.First, by the deﬁnition of a tilted dispatching strategy, it provides an equivalent drift on the local-estimationsystem that is towards the origin. Then, the condition on the update strategy guarantees that the local-estimation system is not too far away from the actual system. Hence, the actual system also has a drift towardsthe origin. The heavy-traﬃc delay optimality not only requires a drift towards the origin, but also needs a drifttowards the line that all the queue lengths are equal. First, by the deﬁnition of a δ -tilted dispatching strategy,there is a drift towards the line that all the local estimates are equal within a given dispatcher. Then, by thecondition for the update strategy, the drift on the local-estimation system can be transfered to a drift on theactual system, and hence delay optimality. Note that, in the current proof, in order to make this ‘drift-transfer’process valid, we impose the condition that both δ and p are independent of ǫ , which is not necessarily requiredand both of them could possibly be a particular function of ǫ as in [26]. This relaxation could be an interestingfuture research direction. To illustrate the applications of Theorems 1 and 2, in this section, we introduce examples of LED policiesthat are both throughput optimal and heavy-traﬃc delay optimal. The ﬂexibility provided by our suﬃcientconditions not only allows us to include previous policies as special cases, but enables us to design new ﬂexiblepolicies.Let us ﬁrst introduce some typical δ -tilted dispatching strategies. Example 1 (Local–Join-Shortest-Queue (L-JSQ)) . At the beginning of each time-slot t , the dispatcher for-wards its arrivals to the server with the shortest local estimation with ties broken arbitrarily. That is, considerdispatcher m , the chosen server is i ∗ ∈ arg min n { e Q mn } . This dispatching strategy is the same as that in the LSQ policy in [1]. By the deﬁnition of dispatchingpreference, we can see that under L-JSQ, ∆ m ( t ) = 1 − µ σ t (1) /µ Σ > mn ( t ) = − µ σ t ( n ) /µ Σ <

0. Hence, itis δ -tilted even for heterogeneous servers with δ = µ min /µ Σ where µ min = min n µ n .Instead of always joining the server with the shortest local estimate, it is also possible to join a sever whosequeue length is below a threshold while satisfying the condition of δ -tilted dispatching preference.7 xample 2 (Local–Join-Below-Average (L-JBA)) . At the beginning of each time-slot t , the dispatcher forwardsits arrivals to a randomly chosen server whose local estimate is below or equal to the average local queue lengthestimation. That is, consider dispatcher m with the average local estimate being ¯ Q m ( t ) = N P n e Q mn ( t ) . Let A , { n : e Q mn ( t ) ≤ ¯ Q m ( t ) } . Then, for each i ∈ A , P mi ( t ) = µ i / P n ∈A µ n , and for i / ∈ A , P mi ( t ) = 0 . It can be easily shown from the deﬁnition that L-JBA is also δ -tilted. Note that, compared to L-JSQ, in theheterogeneous case, it needs the dispatcher to know the service rate of each server, which can be easily obtainedby the update strategies introduced next. This strategy is more ﬂexible than L-JSQ since it does not requirenew arrivals to be only sent to the server with the shortest local estimate, which could be used in the scenarioswith data locality. Moreover, some randomness in the dispatching strategy is also useful, as discussed in thenext section.Further, it is possible to generalize many previous heavy-traﬃc delay optimal policies into the LED frame-work. For example, we can directly apply the Power-of- d policy as our dispatching strategy. Example 3 (Local–Power-of- d (L-Pod)) . At the beginning of each time-slot t , the dispatcher randomly chooses d ≥ servers and sends arrivals to the server that has the shortest local estimation among the d servers. It can be easily shown that L-Pod is tilted for homogeneous servers. Moreover, for a given m , we have∆ m ( t ) = d − N and ∆ mN ( t ) = − N , and hence it is δ -tilted with δ = N .Now, let us turn to discussing update strategies that satisfy the condition in Theorem 2. In particular,the update strategy can either be push-based (dispatcher samples servers) or pull-based (servers report todispatchers). Deﬁnition 7 (Push-Update) . If there are new arrivals, then at the end of the time-slot the dispatcher m samples d distinct servers with a positive probability ˆ p . Then, it updates the corresponding d local estimations with thetrue values. It has been shown in [1] that even for d = 1, the push-update strategy is guaranteed to satisfy the conditionin Theorem 2. Deﬁnition 8 (Pull-Update) . At the end of each time-slot, for each server n if there are completed tasks, thenthe server will uniformly at random pick a dispatcher m and then abide by one of the following two rules: • If the server becomes idle (i.e., no tasks), it sends ( n, to dispatcher m . • If not, it sends ( n, Q n ) to dispatcher m with probability ˆ p . It has been shown in [1] that for any ˆ p >

0, the pull-update strategy is guaranteed to satisfy the conditionin Theorem 2.Now, having introduced both the dispatching strategy and the update strategy, we can combine them toobtain diﬀerent LED policies that are delay optimal in heavy-traﬃc. For example, we have L-JSQ-Push, L-JSQ-Pull, L-JBA-Push, L-JBA-Pull for heterogeneous servers, as well as L-Pod-Push and L-Pod-Pull for homogeneousservers.We end this section by summarizing the contributions of the LED framework. (i)

It covers previouspolices.

L-JSQ-Push (with ˆ p = 1) and L-JSQ-Pull are the same as LSQ policies considered in [1], whichinclude the policies developed in both [2] and [20] as special cases. Thus, by Theorems 1 and 2, all these policiesare throughput and heavy-traﬃc delay optimal. (ii) It allows randomness in dispatching.

The randomnessintroduced in L-JBA and L-Pod is helpful when dealing with the scenario with an extreme low budget on themessage overhead, as discussed next. (iii)

It enables trade-oﬀs between memory and message overhead.

For example, L-Pod-Push and L-Pod-Pull represent good examples that trade memory for low message overhead.That is, if each dispatcher directly uses the traditional Power-of- d without any memory, then at least 4 messagesneeded to guarantee delay optimality in heavy-traﬃc. In contrast, in both L-Pod-Push and L-Pod-Pull, the worst-case message overhead is just 1 per arrival. In addition, the message can be further reduced by choosinga smaller value of ˆ p in the update strategy. Before moving to the proofs, we would like to discuss key features and insights about LED, and point outpossible reﬁnements on LED. 8 .1 Key features of LED

In this section, we highlight the key features of the LED framework, including low message overhead, zerodispatching delay, low computational complexity and appealing performance across various loads.

Low message overhead.

It should be noted that the communication overhead occurs only during theupdate phase in LED policies. For the push-update strategy, the number of messages per arrival is at most 2 d ( d can even be one). For the pull-update strategy, the number of messages per arrival is at most 1. In contrast,JSQ needs 2 N messages per arrival and Power-of- d needs at least 4 messages per arrival. Although JIQ has acomparative worst-case message overhead as LED policies, it is not stable for heterogeneous servers. Zero dispatching delay.

Another key feature of all LED policies is that there is zero dispatching delay.That is, the dispatcher can immediately route its new arrivals to the chosen server since the decision is madepurely based on its local estimations. Moreover, the communication between dispatchers and servers happensonly after the decision is made. This is in contrast to typical push-based policies like JSQ and Power-of- d , underwhich the dispatcher has to wait for the response of sampled servers to make its dispatching decision, resultingin a non-zero dispatching delay. Low computational complexity.

In order to implement LED policies, each dispatcher has to keep anarray of size N its local estimations. Such a space requirement is negligible in a modern cluster. Further, theoperations required by dispatching strategies of LED policies are very eﬃcient. For example, in order to ﬁndthe server with the minimal local estimate in L-JSQ, we can keep the array in a min-heap data structure. ForL-JBA, we can calculate the average by using an eﬃcient running average algorithm. For the simple L-Pod, itonly needs random number generators. Appealing performance across loads.

Although the theoretical delay optimality for the LED frame-work holds in the heavy-traﬃc asymptotic regime, the family of LED policies includes eﬃcient policies thatsigniﬁcantly outperform alternative low-message overhead policies with the same (or even smaller) amount ofcommunications. For example, if the dispatching strategy adopts L-JSQ in LED, then it reduces to the LSQpolicy proposed in [1], which appeals to enjoy good performance over a wide range of traﬃc loads in diﬀerentscenarios via extensive simulations.As mentioned earlier, the class of heavy-traﬃc delay optimal LED policies is broad and includes ﬂexiblechoices of diﬀerent dispatching and update strategies based on diﬀerent application scenarios. The actual delayperformance (except the heavy-load scenario) varies with the particular choice of dispatching strategy or updatestrategy under diﬀerent scenarios. Thus, it is not possible to pick one particular LED policy that ﬁts everycircumstance, which is also not the focus of this paper. Instead, it would be useful to present some usefulinsights about the LED framework, as presented in the following. These insights could serve as the guidance onthe choice or design of new LED policies.

The main trait of the LED framework is that only local, possibly delayed and inaccurate information, is usedfor making the dispatching decision. In the following, we present two useful insights about the use of inaccuratedelayed information for load balancing.

Inaccurate information can improve performance.

A big problem for load balancing with multipledispatchers is herd behavior , which means that arrivals at diﬀerent dispatchers join the same server. This oftenleads to a poor delay performance in practice [18]. For example, JSQ used in the case of multiple dispatchersleads to a serious herd behavior since all the dispatchers will route arrivals to the single shortest queue. Incontrast, under the LED framework, each dispatcher may believe that a diﬀerent queue is the shortest accordingto its own local estimates because these estimates are inaccurate and delayed. Thus, jobs at diﬀerent dispatchersare sent to diﬀerent queues that may not have the actual shortest length but still have relatively small queuelengths. This intuition is illustrated by Fig. 1. In particular, we consider a set up with 10 dispatchers and100 heterogeneous servers. All the LED policies are conﬁgured to have the same average message overheads asPower-of-2. It can be seen that the LED policies are not only stable but achieve a much better performancecompared to JSQ, which suﬀers from the herd behavior in the multiple-dispatcher case.

Randomness is useful for heavily-delayed information.

As mentioned earlier, the LED frameworkprovides us with the possibility of exploring load balancing with extremely low message overhead by choosinga small value ˆ p in the update strategy. As a result, the local information at each dispatcher will only beupdated after a long time interval. In this case, if a deterministic dispatching strategy (e.g., L-JSQ) is adopted,it would again incur herd behavior (even for a single dispatcher case) since all the arrivals during the longupdate interval will join the same queue. This is another motivation for considering L-JBA and L-Pod, which9 .3 0.4 0.5 0.6 0.7 0.8 0.9 1 Load ( ρ ) M ean r e s pon s e t i m e ( t i m e s l o t s ) JSQL-JSQ-PullL-JSQ-PushL-JBA-PullL-JBA-PushJIQPower-of-2

Figure 1: Inaccurate information could improve performance in multiple-dispatcher case.

Load ( ρ ) M ean r e s pon s e t i m e ( t i m e s l o t s ) L-JSQ-PushL-JBA-PushL-Pod-Push

Figure 2: Randomness is useful for heavily-delayed information.naturally introduce a certain level of randomness and hence help avoid the herd behavior as suggested by [11].To illustrate this insight, we consider a set up with 10 dispatchers and 100 homogeneous servers. We comparethe delay performance of L-JSQ-Push, L-Pod-Push and L-JBA-Push with the update probability set to ˆ p = 0 . d = 2. As shown in Fig. 2, both L-JBA-Push and L-Pod-Push outperforms L-JSQ-Push, which suﬀers fromherd behavior because of heavily-delayed information. Our main results suggest that there is a large class of heavy-traﬃc delay optimal LED policies. On the onehand, it provides us with ﬂexibility to tailor our policy design for diﬀerent application scenarios with diﬀerentchoices of dispatching and update strategies. On the other hand, it also suggests the need for reﬁnements onLED beyond delay optimality in heavy-traﬃc. To this end, we introduce two possible directions for reﬁnements.

Degree of queue imbalance.

As introduced in [28], degree of queue imbalance is a reﬁned metric to furtherdistinguish heavy-traﬃc delay optimal policies. The idea is that, instead of looking at the average queue length(and hence average delay), the degree of queue imbalance measures the expected diﬀerence in queue lengthsamong the servers. By following the proof of Proposition 5.6 in [28], we can establish that the degree of queueimbalance of all heavy-traﬃc delay optimal LED policies is O ( δ p ). Thus, even though by Theorem 2, anypositive δ and p are suﬃcient for delay optimality in heavy-traﬃc, a dispatching strategy with smaller δ or a10pdate strategy with a smaller p could aﬀect the performance in practice. Other asymptotic regimes.

In this paper, we focus on the heavy-traﬃc asymptotic regime where thenumber of servers is ﬁxed and the load approaches one. As mentioned before, there are also other asymptoticregimes in the analysis of load balancing schemes. One possible direction is to extend the ﬂuid-limit techniquesfor the large-system regime in [20] to the case of multiple dispatchers and heterogeneous servers. Anotheralternative regime is the many-server heavy-traﬃc regime (e.g., Halﬁn-Whitt regime), which tends to keep abalance between heavy-traﬃc regime and large-system regime. Studying LED in such a regime is anotherinteresting direction for future work.

In this paper, we extend the Lyapunov drift-based approach developed in [4] to allow for unbounded supportsof arrival and service processes. In particular, we replace the ﬁniteness condition on the drift in [4] by astochastically dominated condition, as shown in (C2) in Lemma 2. As proved in [7], this weaker condition,combined with a negative drift condition, can still guarantee ﬁnite moment bounds. Besides a weaker condition,we also replace the one-step drift with a T -step drift. Formally, we use the following lemma to derive boundedmoments in steady state. Lemma 2.

For an irreducible aperiodic and positive recurrent Markov chain { X ( t ) , t ≥ } over a countablestate space X , which converges in distribution to X , and suppose V : X → R + is a Lyapunov function. Wedeﬁne the T time slot drift of V at X as ∆ V ( X ) , [ V ( X ( t + T )) − V ( X ( t ))] I ( X ( t ) = X ) , where I ( . ) is the indicator function. Suppose for some positive ﬁnite integer T , the T time slot drift of V satisﬁesthe following conditions: • (C1) There exists an η > and a κ < ∞ such that for any t = 1 , , . . . and for all X ∈ X with V ( X ) ≥ κ , E [∆ V ( X ) | X ( t ) = X ] ≤ − η. • (C2) | ∆ V ( X ) | ≺ W for all t and all X ∈ X , and E (cid:2) e θW (cid:3) = D is ﬁnite for some θ > ,Then { V ( X ( t )) , t ≥ } converges in distribution to a random variable V for which there exists a θ ∗ > anda C ∗ < ∞ such that E h e θ ∗ V i ≤ C ∗ , which directly implies that all the moments of V exist and are ﬁnite. To start with, let us ﬁrst show that the Markov chain { Z ( t ) = ( Q ( t ) , m ( t )) , t ≥ } with m ( t ) , ( e Q ( t ) , e Q ( t ) , . . . , e Q m ( t ))is irreducible and aperiodic. Let the initial state be Z (0) = ( Q (0) , m (0)) = (0 × N , × MN ) and the state space Z consists of all the states that can be reached from the initial state. Consider any state Z , the queue lengthvector Q can reach the initial state with a positive probability since the event that there are no exogenous ar-rivals and all the oﬀered service is at least one during each time-slot happens with positive probability under ourassumptions. Moreover, under the condition for the update strategy given by Eq. (5), the event that Q remainsas the initial state while all e Q m reach to the initial state happens with a positive probability. Therefore, anystate in the state space can reach the initial state, and hence the Markov chain is irreducible. The aperiodicityof the Markov chain comes from the fact that the transition probability from the initial state to itself is positive.In order to show positive recurrence, we adopt the Foster-Lyapunov theorem. In particular, we consider thefollowing Lyapunov function W ( Z ( t )) = k Q ( t ) k + P Mm =1 (cid:13)(cid:13)(cid:13) Q ( t ) − e Q m ( t ) (cid:13)(cid:13)(cid:13) , and in the rest of the proof we use W ( t ) as an abbreviation of W ( Z ( t )) Let X mn ( t ) , | Q n ( t ) − e Q mn ( t ) | . The conditional mean drift of W ( t ) deﬁnedas D ( Z ( t )) , E [ W ( t + T ) − W ( t ) | Z ( t )] can be decomposed as follows D ( Z ( t )) = D Q ( t ) + M X m =1 N X n =1 D X mn ( t ) (7)11here D Q ( t ) , E h k Q ( t + T ) k − k Q ( t ) k | Z ( t ) i D X mn ( t ) , E [ X mn ( t + T ) − X mn ( t ) | Z ( t )]Let us ﬁrst consider the tern D X mn ( t ). Note that for all t , m and n E [ X mn ( t + 1) | Z ( t ) = Z ] ≤ E [(1 − I mn ( t )) ( X mn ( t ) + A n ( t ) + S n ( t )) | Z ( t ) = Z ] ( a ) ≤ (1 − p ) X mn ( t ) + λ Σ + µ max (8)where (a) follows from the condition in Eq. (5) and µ max = max n µ n . Then, we have (the time reference t isdropped for simplicity) D X mn ( t )= E " t + T − X t = t X mn ( t + 1) − X mn ( t ) | Z ( t ) = Z = t + T − X t = t E [ E [ X mn ( t + 1) − X mn ( t ) | Z ( t )] | Z ] ( a ) ≤ t + T − X t = t E [ − pX mn ( t ) + λ Σ + µ max | Z ] ≤ − pX mn ( t ) + λ Σ + µ max , (9)where (a) follows from Eq. (8). Let us turn to consider the term D Q ( t ). By the queue dynamics in Eq. (2), D Q ( t )= E " t + T − X t = t k Q ( t + 1) k − k Q ( t ) k | Z ( t ) = Z = E " t + T − X t = t k Q ( t ) + A ( t ) − S ( t ) + U ( t ) k − k Q ( t ) k | Z ( a ) ≤ E " t + T − X t = t k Q ( t ) + A ( t ) − S ( t ) k − k Q ( t ) k | Z = E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i + k A ( t ) − S ( t ) k | Z ( b ) ≤ E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i + K | Z , (10)where (a) follows from the facts that Q n ( t ) + A n ( t ) − S n ( t ) + U n ( t ) = max( Q n ( t ) + A n ( t ) − S n ( t ) ,

0) for any t ≥

0, and (max( a, ≤ a for any a ∈ R ; (b) holds by our assumption of light-tailed distributions for thetotal arrival process and each service process in Eq. (3). In particular, we have that the second moments fortotal arrival process and service process of each server are ﬁnite (independent of ǫ ), and hence there exists aﬁnite upper bound K which is independent of the load parameter ǫ .12ow, let us continue to work on Eq. (10). In particular, we have E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z = t + T − X t = t E [ E [ h Q ( t ) , A ( t ) − S ( t ) i | Z ( t )] | Z ( t ) = Z ]= t + T − X t = t E [ E [ h Q ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ] (11) − t + T − X t = t E " N X n =1 Q n ( t ) µ n | Z ( t ) = Z . (12)For Eq. (11), we have t + T − X t = t E [ E [ h Q ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ]= t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 E [ A mn ( t ) | Z ( t )] | Z = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 P mn ( t ) λ m | Z ( t ) = Z ( a ) = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 (cid:18) β mn ( t ) + µ n µ Σ (cid:19) λ m | Z ( t ) = Z , where (a) follows from the deﬁnition of β mn ( t ). Then, it can be further simpliﬁed as follows. t + T − X t = t E [ E [ h Q ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ] ( a ) = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 β mn ( t ) λ m | Z + t + T − X t = t E " N X i =1 Q n ( t ) M X m =1 µ n µ Σ ( µ Σ − ǫ ) p m | Z = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 β mn ( t ) λ m | Z + t + T − X t = t E " N X n =1 Q n ( t ) µ n | Z − t + T − X t = t E " N X n =1 Q n ( t ) ǫµ n µ Σ | Z , (13)where in (a), p m is the probability that arrivals are allocated to dispatcher m (or it can be viewed as the fractionof the total arrivals that are allocated to dispatcher m ).13ombining Eqs. (11), (12) and (13), yields E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z = t + T − X t = t E " N X n =1 Q n ( t ) M X m =1 β mn ( t ) λ m | Z ( t ) = Z − t + T − X t = t E " N X n =1 Q n ( t ) ǫµ n µ Σ | Z ( t ) = Z = t + T − X t = t E " N X n =1 M X m =1 (cid:16) Q n ( t ) − e Q mn ( t ) + e Q mn ( t ) (cid:17) β mn ( t ) λ m | Z − t + T − X t = t E " N X n =1 Q n ( t ) ǫµ n µ Σ | Z = t + T − X t = t E " N X n =1 M X m =1 (cid:16) Q n ( t ) − e Q mn ( t ) (cid:17) β mn ( t ) λ m | Z T + t + T − X t = t E " N X n =1 M X m =1 e Q mn ( t ) β mn ( t ) λ m | Z T − t + T − X t = t E " N X n =1 Q n ( t ) ǫµ n µ Σ | Z T . We are going to handle each term one by one. To upper bound T , we use the following result on X mn ( t ) = | Q n ( t ) − e Q mn ( t ) | . Lemma 3.

Under the condition given by Eq. (5) , for any t and Z ( t ) , there exists a ﬁnite T independent of ǫ and a ﬁnite constant L that is only a function of p and µ Σ , such that for all T ≥ T E " t + T − X t = t X nm ( t ) | Z ( t ) = Z ≤ LT holds for all m and n .Proof. See Appendix A.By using Lemma 3 with T ≥ T , we have T ≤ λ Σ t + T − X t = t E " N X n =1 M X m =1 (cid:12)(cid:12)(cid:12) Q n ( t ) − e Q mn ( t ) (cid:12)(cid:12)(cid:12) | Z ≤ λ Σ M N LT. (14)For T , we have T a ) = t + T − X t = t E " N X n =1 M X m =1 e Q mσ t ( n ) ( t )∆ mn ( t ) λ m | Z ( b ) ≤ , (15)where (a) comes from the deﬁnition of dispatching preference vector ∆ m ( t ); (b) holds since dispatching preferenceis tilted and e Q mσ t (1) ( t ) ≤ e Q mσ t (2) ( t ) ≤ . . . ≤ e Q mσ t ( N ) ( t ).For T , we have T ≥ ǫµ min µ Σ k Q ( t ) k , (16)14here µ min = min n µ n .Now, combining Eqs. (14), (15) and (16), yields E " t + T − X t = t h Q ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z ≤ − ǫµ min µ Σ k Q ( t ) k + λ Σ M N LT.

Substituting the result above back into Eq. (10), yields D Q ( t ) ≤ − ǫµ min µ Σ k Q ( t ) k + 2 λ Σ M N LT + KT. (17)Now, we are ready to substitute Eq. (9) and Eq. (17) back into Eq. (7). As a result, we have D ( Z ( t )) ≤ − ǫµ min µ Σ k Q ( t ) k − p M X m =1 N X n =1 X mn ( t )+ 2 λ Σ M N LT + KT + λ Σ + µ max ( a ) ≤ − ξ k Q ( t ) k + M X m =1 N X n =1 | Q n ( t ) − e Q mn ( t ) | ! + K , where in (a) ξ = min(2 ǫµ min µ Σ , p ) and K , λ Σ M N LT + KT + λ Σ + µ max . Pick any α > B , { Z ∈ Z : k Q ( t ) k + M X m =1 N X n =1 | Q n ( t ) − e Q mn ( t ) | ≤ K + αξ } . Then, B is a ﬁnite subset. For any Z ∈ B c , D ( Z ) ≤ − α , and for any Z ∈ B , D ( Z ) ≤ K . By Foster-Lyapunovtheorem, we have established positive recurrence.Having shown that the Markov chain { Z ( t ) , t ≥ } is ergodic, we are left with the task of showing that all themoments are ﬁnite in steady-state. In order to do so, we use Lemma 2. In particular, we choose the Lyapunovfunction as V ( Z ( ǫ ) ) = (cid:13)(cid:13) Q ( ǫ ) (cid:13)(cid:13) and then verify the two conditions. In the following, the superscript ( ǫ ) will beomitted for ease of notations. To verify condition (C2), we have | ∆ V ( Z ) | = | k Q ( t + T ) k − k Q ( t ) k |I ( Z ( t ) = Z ) ( a ) ≤ k Q ( t + T ) − Q ( t ) k I ( Z ( t ) = Z ) ≤ t + T − X t = t k Q ( t + 1) − Q ( t ) k I ( Z ( t ) = Z ) ≤ t + T − X t = t k A ( t ) − S ( t ) + U ( t ) k I ( Z ( t ) = Z ) ( b ) ≤ t + T − X t = t ( k A ( t ) k + 2 k S ( t ) k ) I ( Z ( t ) = Z ) , (18)where (a) holds since | k x k − k y k | ≤ k x − y k for each x , y in R N . (b) follows from triangle inequality and thefact that U n ( t ) ≤ S n ( t ) for all t and t . Then, by our assumptions of light-tailed distributions for both totalarrival and service processes, there exists a random variable W such that | ∆ V ( X ) | ≺ W for all t and all X ∈ X ,and E (cid:2) e θW (cid:3) = D is ﬁnite for some θ >

0, which veriﬁes (C2).15or (C1), we have E [∆ V ( Z ) | Z ( t ) = Z ]= E [ k Q ( t + T ) k − k Q ( t ) k | Z ( t ) = Z ]= E (cid:20)q k Q ( t + T ) k − q k Q ( t ) k | Z ( t ) = Z (cid:21) ( a ) ≤ k Q ( t ) k E h k Q ( t + T ) k − k Q ( t ) k | Z ( t ) = Z i ( b ) ≤ − ǫ µ min µ Σ + 2 λ Σ M N LT + KT k Q ( t ) k , where (a) follows from the fact that f ( x ) = √ x is concave; (b) comes from Eq. (17). Thus, condition (C1) isvalid and hence the proof of Theorem 1 is complete. In order to prove the result, we need two intermediate results. One is called state-space collapse as stated inProposition 1, which is the key ingredient for establishing heavy traﬃc delay optimality. Roughly speaking, itmeans that the multi-dimension space for the queue length vector reduces to one dimension in the sense thatthe deviation from the line (on which all the queue lengths are equal) is bounded by a constant, independent of ǫ . Another intermediate result is concerned with unused service. Based on these two intermediate results, wecan prove heavy-traﬃc delay optimality. We omit the time reference t for simplicity when necessary. Proposition 1.

Under the conditions in Theorem 2, then we have that Q ⊥ is bounded in the sense that insteady state there exists ﬁnite constants { L r , r ∈ N } independent of ǫ such that E h(cid:13)(cid:13)(cid:13) Q ( ǫ ) ⊥ (cid:13)(cid:13)(cid:13) r i ≤ L r for all ǫ ∈ (0 , ǫ ) and r ∈ N , where Q ⊥ = Q − h Q , c i c is the perpendicular component of Q with respect to theline c = √ N (1 , , . . . , .Proof. It suﬃces to show that V ⊥ ( Z ( ǫ ) ) , (cid:13)(cid:13)(cid:13) Q ( ǫ ) ⊥ (cid:13)(cid:13)(cid:13) satisﬁes the conditions (C1) and (C2) in Lemma 2. Let usﬁrst consider conditions (C2). In particular, we have | ∆ V ⊥ ( Z ) | = | k Q ⊥ ( t + T ) k − k Q ⊥ ( t ) k |I ( Z ( t ) = Z ) ( a ) ≤ k Q ⊥ ( t + T ) − Q ⊥ ( t ) k I ( Z ( t ) = Z )= (cid:13)(cid:13) Q ( t + T ) − Q k ( t + T ) − Q ( t ) + Q k ( t ) (cid:13)(cid:13) I ( Z ( t ) = Z ) ( b ) ≤ k Q ( t + T ) − Q ( t ) k + (cid:13)(cid:13) Q k ( t + T ) − Q k ( t ) (cid:13)(cid:13) I ( Z ( t ) = Z ) ( c ) ≤ k Q ( t + T ) − Q ( t ) k I ( Z ( t ) = Z ) ( d ) ≤ t + T − X t = t ( k A ( t ) k + 2 k S ( t ) k ) I ( Z ( t ) = Z ) (19)where the inequality (a) follows from the fact that | k x k − k y k | ≤ k x − y k holds for any x , y ∈ R N ; inequality(b) follows from triangle inequality; (c) holds due to the non-expansive property of projection to a convex set;(d) follows from Eq. (18). Then by our assumptions of light-tailed distributions for both total arrival and serviceprocesses, there exists a random variable W such that | ∆ V ⊥ ( X ) | ≺ W for all t and all X ∈ X , and E (cid:2) e θW (cid:3) = D is ﬁnite for some θ >

0, which veriﬁes (C2).Let us turn to condition (C1). By the proof of Lemma 3.6 in [29], it suﬃces to establish the following resultin order to verify (C1). That is, there exists

T > K ≥ η > ǫ , such thatfor all t and Z ∈ Z E " t + T − X t = t h Q ⊥ ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z ≤ − η k Q ⊥ k + K (20)16olds for all ǫ ∈ (0 , ǫ ). Note that E " t + T − X t = t h Q ⊥ ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z ( a ) = t + T − X t = t E [ E [ h Q ⊥ ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ] (21) − t + T − X t = t E "X n µ n Q ⊥ ,n ( t ) | Z ( t ) = Z , (22)where (a) follows from the tower property of conditional expectation and the fact that A ( t ) is independent of Z ( t ) given Z ( t ). Moreover, Q ⊥ ,n ( t ) denotes the n th component of the vector Q ⊥ ( t ). Now let us ﬁrst focus onEq. (21). t + T − X t = t E [ E [ h Q ⊥ ( t ) , A ( t ) i | Z ( t )] | Z ( t ) = Z ]= t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 P mn ( t ) λ m | Z ( t ) = Z = t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 (cid:18) β mn ( t ) + µ n µ Σ (cid:19) λ m | Z ( t ) = Z = t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 β mn ( t ) λ m | Z + t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 µ n µ Σ ( µ Σ − ǫ ) p m | Z . Combining the result above with Eq. (22), yields E " t + T − X t = t h Q ⊥ ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z = t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 β mn ( t ) λ m | Z (23)+ t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) − ǫµ n µ Σ | Z . (24)Note that by deﬁnition Q ⊥ ,n ( t ) = Q n ( t ) − Q avg ( t ), in which Q avg ( t ) is the average queue length among N queuesat the beginning of time-slot t . Moreover, Q ⊥ ,n ( t ) can be written as Q ⊥ ,n ( t ) = Q n ( t ) − e Q mn ( t ) + e Q mn ( t ) − ¯ Q m ( t ) + ¯ Q m ( t ) − Q avg ( t ) (25)for all m and t , in which ¯ Q m ( t ) , N P Nn =1 e Q mn ( t ), i.e., the average queue length estimated by dispatcher m at17he beginning of time-slot t . By utilizing Eq. (25), Eq. (23) can be written as t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) M X m =1 β mn ( t ) λ m | Z = t + T − X t = t E " N X n =1 M X m =1 (cid:16) e Q mn ( t ) − ¯ Q m ( t ) (cid:17) β mn ( t ) λ m | Z (26)+ t + T − X t = t E " N X n =1 M X m =1 (cid:16) Q n ( t ) − e Q mn ( t ) (cid:17) β mn ( t ) λ m | Z (27)+ t + T − X t = t E " N X n =1 M X m =1 (cid:0) ¯ Q m ( t ) − Q avg ( t ) (cid:1) β mn ( t ) λ m | Z . (28)Our main task now is to upper bound each term above. Let us start with Eq. (26). In particular, we can boundit by using the following result. Lemma 4.

There exist ﬁnite positive constants η and C such that t + T − X t = t E " N X n =1 M X m =1 (cid:16) e Q mn ( t ) − ¯ Q m ( t ) (cid:17) β mn ( t ) λ m | Z ≤ − η k Q ⊥ ( t ) k + C holds for all T ≥ , in which η = λ Σ δp √ N and C = 3( µ Σ ) p .Proof. See Appendix BFor Eqs. (27) and (28), we can bound both of them by using the result in Lemma 3, respectively. Inparticular, for Eq. (27), we have t + T − X t = t E " N X n =1 M X m =1 (cid:16) Q n ( t ) − e Q mn ( t ) (cid:17) β mn ( t ) λ m | Z ≤ λ Σ t + T − X t = t E " N X n =1 M X m =1 (cid:12)(cid:12)(cid:12) Q n ( t ) − e Q mn ( t ) (cid:12)(cid:12)(cid:12) | Z ≤ µ Σ M N LT. (29)For Eq. (28), we have t + T − X t = t E " N X n =1 M X m =1 (cid:0) ¯ Q m ( t ) − Q avg ( t ) (cid:1) β mn ( t ) λ m | Z = t + T − X t = t E " N X n =1 M X m =1 N N X n =1 (cid:16) e Q mn ( t ) − Q n ( t ) (cid:17)! β mn ( t ) λ m | Z ≤ λ Σ t + T − X t = t E " N X n =1 M X m =1 N N X n =1 (cid:12)(cid:12)(cid:12) e Q mn ( t ) − Q n ( t ) (cid:12)(cid:12)(cid:12) | Z ≤ µ Σ M N LT. (30)We have obtained bounds for Eqs. (26), (27) and (28). Let us turn to focus on Eq. (24), which can be upperbounded by the following result.

Lemma 5.

For any t and Z , t + T − X t = t E " N X n =1 Q ⊥ ,n ( t ) − ǫµ n µ Σ | Z ( t ) = Z ≤ ǫ √ N T k Q ( t ) k + K , where k is a ﬁnite constant independent of ǫ . 18 roof. See Appendix CNow, we are ready to bound the left-hand-side of Eq. (20) by using the bounds for both Eq. (23) and Eq. (24).In particular, we have E " t + T − X t = t h Q ⊥ ( t ) , A ( t ) − S ( t ) i | Z ( t ) = Z ≤ − λ Σ δp √ N k Q ⊥ ( t ) k + C + 2 µ Σ M N LT + ǫ √ N T k Q ( t ) k + K a ) = (cid:18) T ǫ − λ Σ δp N (cid:19) √ N k Q ⊥ ( t ) k + K ≤ − µ Σ δp √ N k Q ⊥ ( t ) k + K , ∀ ǫ < µ Σ δp N T + 2 δp (31)where (a) follows from K = C + 2 µ Σ M N LT + K , which is independent of ǫ . Hence, this veriﬁes condition(C1) with η = µ Σ δp √ N , which is also independent of ǫ . Combined with condition (C2), we have ﬁnished the proofof Proposition 1.Having proved the state-space collapse result, we turn to prove another intermediate result regarding unusedservice, as stated in the following lemma. In words, this lemma says that in heavy traﬃc unused service tendsto be zero. Lemma 6.

Under any LED policy, we have lim ǫ ↓ E (cid:20)(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) (cid:21) = 0 . Proof.

First, we would like to show that under any LED policy, E h(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) i = ǫ. (32)To see this, we consider the Lyapunov function W ( Z ( t )) = k Q ( t ) k . Since LED is throughput optimal with allthe moments being ﬁnite, we have that the mean drift of W ( Z ( t )) in steady-state is zero. Then, we have0 = E h(cid:13)(cid:13)(cid:13) A ( ǫ ) (cid:13)(cid:13)(cid:13) − k S k + (cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) i , which directly implies the result in Eq. (32).Now let us ﬁx n ∈ N , we have for any t ≥ S ′ U n ( t ) ≤ U n ( t ) S n ( t )= U n ( t ) S n ( t ) I ( S n ( t ) ≤ S ′ ) + U n ( t ) S n ( t ) I ( S n ( t ) > S ′ ) ≤ U n ( t ) S ′ + S n ( t ) I ( S n ( t ) > S ′ ) . In steady state, we have E h U n i ≤ E (cid:2) U n (cid:3) S ′ + E (cid:2) S n ( ∞ ) I ( S n ( ∞ ) > S ′ ) (cid:3) ( a ) ≤ ǫS ′ + E (cid:2) S n (0) I ( S n (0) > S ′ ) (cid:3) ( b ) ≤ ǫS ′ + β, where (a) follows from the fact that E h(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13) i = ǫ and service process is i.i.d. ; in (b), we choose S ′ such that E (cid:2) S n (0) I ( S n (0) > S ′ ) (cid:3) ≤ β , which is possible by the exponential decay rate of S n (0) under the light-tailedassumption. Thus, we have lim ǫ ↓ E h U n i ≤ β, for any β >

0. Hence, we have lim ǫ ↓ E h U n i = 0 for each n , which directly implies our result.19ow, we are prepared to show that under the conditions in Theorem 2, the system achieves optimal delayin heavy traﬃc. More speciﬁcally, by Lemma 3 in [26], we need only to verify the following condition.lim ǫ ↓ E h(cid:13)(cid:13) Q ( ǫ ) ( t + 1) (cid:13)(cid:13) (cid:13)(cid:13) U ( ǫ ) ( t ) (cid:13)(cid:13) i = 0 . (33)Let us deﬁne B ( ǫ ) , E h(cid:13)(cid:13)(cid:13) Q ( ǫ ) ( t + 1) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) U ( ǫ ) ( t ) (cid:13)(cid:13)(cid:13) i . We can bound it as follows. B ( ǫ ) ( a ) = N E h h U ( ǫ ) ( t ) , − Q ( ǫ ) ⊥ ( t + 1) i i ( b ) ≤ N s E (cid:20)(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) (cid:21) E (cid:20)(cid:13)(cid:13)(cid:13) Q ( ǫ ) ⊥ ( t + 1) (cid:13)(cid:13)(cid:13) (cid:21) ( c ) = N s E (cid:20)(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) (cid:21) E (cid:20)(cid:13)(cid:13)(cid:13) Q ( ǫ ) ⊥ ( t ) (cid:13)(cid:13)(cid:13) (cid:21) ( d ) ≤ N s E (cid:20)(cid:13)(cid:13)(cid:13) U ( ǫ ) (cid:13)(cid:13)(cid:13) (cid:21) L , where the equality (a) comes from the property Q ( ǫ ) n ( t + 1) U ( ǫ ) n ( t ) = 0 for all n ∈ N and all t ≥ Q ⊥ ; the inequality (b) holds due to Cauchy-Schwartz inequality; the equality (c) is true since thedistributions of Q ( ǫ ) ⊥ ( t + 1) and Q ( ǫ ) ⊥ ( t ) are the same in steady state; (d) follow from the state-space collapseresult in Proposition 1. Finally, by Lemma 6 and the fact that L is independent of ǫ , we have lim ǫ → B ( ǫ ) = 0,which ﬁnishes our proof. We have introduced the Local-Estimation-Driven (LED) framework for load balancing policies in possibly het-erogeneous systems with multiple dispatchers. Under this framework, each dispatcher keeps local and possiblyoutdated estimates of the queue lengths for all the servers, and makes its dispatching decision only based on theselocal estimates. Communication between dispatchers and servers is only used to update the local estimates.We have established suﬃcient conditions for LED policies to achieve both throughput optimality and delayoptimality in heavy traﬃc. These suﬃcient conditions not only establish delay optimality for many previouslocal-memory based policies, but enable us to tailor the design of new delay optimal policies based on diﬀerentapplication requirements. The heavy-traﬃc delay optimality of LED policies also resolves a recent open problemon the development of load balancing schemes that have only access to delayed information.In future work, it will be interesting to investigate LED framework in other asymptotic regimes, e.g., thelarge-system regime and the many-server heavy-traﬃc regime.

References [1] Anonymous. Load balancing in large-scale heterogeneous systems with multiple dispatchers. anonymouspreprint under review. https://openreview.net/pdf?id=BJxN9Hvf-V , 2018.[2] Jonatha Anselmi and Francois Dufour. Power-of- d -choices with memory: Fluid limit and optimality. arXivpreprint arXiv:1802.06566 , 2018.[3] Hong Chen and Heng-Qing Ye. Asymptotic optimality of balanced routing. Operations research , 60(1):163–179, 2012.[4] Atilla Eryilmaz and R Srikant. Asymptotically tight steady-state queue length bounds implied by driftconditions.

Queueing Systems , 72(3-4):311–359, 2012.[5] G. J. Foschini and J. Salz. A basic dynamic routing problem and diﬀusion.

IEEE Transactions on Com-munications , 26(3):320–327, 1978. 206] Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve or die: High-availability design principles drawn from googles network infrastructure. In

Proceedings of the 2016 ACMSIGCOMM Conference , pages 58–72, 2016.[7] Bruce Hajek. Hitting-time and occupation-time bounds implied by drift analysis with applications.

Advancesin Applied probability , pages 502–525, 1982.[8] David Lipshutz. Open problemload balancing using delayed information.

Stochastic Systems , 9(3):305–306,2019.[9] Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R Larus, and Albert Greenberg. Join-idle-queue: Anovel load balancing algorithm for dynamically scalable web services.

Performance Evaluation , 68(11):1056–1071, 2011.[10] Siva Theja Maguluri, R Srikant, and Lei Ying. Heavy traﬃc optimal resource allocation algorithms forcloud computing clusters.

Performance Evaluation , 81:20–39, 2014.[11] Michael Mitzenmacher. How useful is old information?

IEEE Transactions on Parallel and DistributedSystems , 11(1):6–20, 2000.[12] Michael Mitzenmacher. The power of two choices in randomized load balancing.

IEEE Transactions onParallel and Distributed Systems , 12(10):1094–1104, 2001.[13] Michael Mitzenmacher. Analyzing distributed join-idle-queue: A ﬂuid limit approach. In

Communication,Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on , pages 312–318. IEEE, 2016.[14] Debankur Mukherjee, Sem C Borst, Johan SH Van Leeuwaarden, Philip A Whiting, et al. Universality ofload balancing schemes on the diﬀusion scale.

Journal of Applied Probability , 53(4):1111–1124, 2016.[15] Patrick Shuﬀ. Building a billion user load balancer. 2016.[16] Alexander L Stolyar. Pull-based load distribution in large-scale heterogeneous service systems.

QueueingSystems , 80(4):341–361, 2015.[17] Alexander L Stolyar. Pull-based load distribution among heterogeneous parallel servers: the case of multiplerouters.

Queueing Systems , 85(1-2):31–65, 2017.[18] Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. C3: Cutting tail latency in cloud datastores via adaptive replica selection. In

Proceedings of the 2015 USENIX NSDI Conference , pages 513–527,2015.[19] Mark van der Boor, Sem Borst, and Johan van Leeuwaarden. Load balancing in large-scale systems withmultiple dispatchers. In

IEEE INFOCOM 2017-IEEE Conference on Computer Communications , pages1–9. IEEE, 2017.[20] Mark van der Boor, Sem Borst, and Johan van Leeuwaarden. Hyper-scalable jsq with sparse feedback.

Proceedings of the ACM on Measurement and Analysis of Computing Systems , 3(1):1–37, 2019.[21] Weina Wang, Kai Zhu, Lei Ying, Jian Tan, and Li Zhang. Maptask scheduling in mapreduce with datalocality: Throughput and heavy-traﬃc optimality.

IEEE/ACM Transactions on Networking , 24(1):190–203,2016.[22] Richard R Weber. On the optimal assignment of customers to parallel servers.

Journal of Applied Probability ,pages 406–413, 1978.[23] Wayne Winston. Optimality of the shortest line discipline.

Journal of Applied Probability , 14(1):181–189,1977.[24] Qiaomin Xie and Yi Lu. Priority algorithm for near-data scheduling: Throughput and heavy-traﬃc op-timality. In

Proceedings of IEEE International Conference on Computer Communications (INFOCOM) ,pages 963–972, 2015. 2125] Qiaomin Xie, Ali Yekkehkhany, and Yi Lu. Scheduling with multi-level data locality: Throughput andheavy-traﬃc optimality. In

Proceedings of IEEE International Conference on Computer Communications(INFOCOM) , pages 1–9, 2016.[26] Xingyu Zhou, Jian Tan, and Ness Shroﬀ. Flexible load balancing with multi-dimensional state-spacecollapse: Throughput and heavy-traﬃc delay optimality.

Performance Evaluation , 127:176–193, 2018.[27] Xingyu Zhou, Jian Tan, and Ness Shroﬀ. Heavy-traﬃc delay optimality in pull-based load balancingsystems: Necessary and suﬃcient conditions.

Proceedings of the ACM on Measurement and Analysis ofComputing Systems , 2(3):1–33, 2018.[28] Xingyu Zhou, Fei Wu, Jian Tan, Kannan Srinivasan, and Ness Shroﬀ. Degree of queue imbalance: Over-coming the limitation of heavy-traﬃc delay optimality in load balancing systems.

Proceedings of the ACMon Measurement and Analysis of Computing Systems , 2(1):1–41, 2018.[29] Xingyu Zhou, Fei Wu, Jian Tan, Yin Sun, and Ness Shroﬀ. Designing low-complexity heavy-traﬃc delay-optimal load balancing schemes: Theory to algorithms.

Proceedings of the ACM on Measurement andAnalysis of Computing Systems , 1(2):39, 2017.

AppendixA Proof of Lemma 3

First, we show that the Markov chain { X mn ( t ) , t ≥ } is ergodic. It is irreducible and aperiodic since any statecan reach the initial state X mn (0) = 0 via update and the initial state can also reach to itself. It is positiverecurrent since the expected return time to state 0 is ﬁnite under the condition in Eq. (5). By the ergodicity ofMarkov chain, we have that the limiting distribution π = ( π , π , . . . ) exists and it is also the unique stationarydistribution. Here, π k = lim n →∞ P nvk ∀ v where P nvk is the probability of being in state k in n steps, given we are in state v now. Since the limitingdistribution is independent of the initial state, we can pick any state to start with. In particular, we pick theinitial state as X mn (0) = 0. By Eq. (8), we have for any t E [ X mn ( t + 1)] ≤ (1 − p ) E [ X mn ( t )] + λ Σ + µ max . (34)We conduct an inductive process to establish that E [ X mn ( t )] ≤ λ Σ + µ max p for all t . First, the basis is true since X mn (0) = 0. We assume E [ X mn ( t )] ≤ λ Σ + µ max p holds for t = t , then by Eq. (34), we have E [ X mn ( t + 1)] ≤ λ Σ + µ max p also holds and hence completing the proof. In particular, E [ X mn ( t )] ≤ λ Σ + µ max p ≤ µ Σ p for all t . Thisdirectly means that the stationary distribution π of { X mn ( t ) , t ≥ } has a ﬁnite bound on the mean (independentof ǫ ). Similarly, by applying the same step as in Eq. (8) combined with the inductive argument, we can obtainthat all the moments of the stationary distribution are bounded, because all the moments of the total arrivaland each service are bounded (independent of ǫ ) by our light-tailed assumption.Let f T , T (cid:16)P t + T − t = t X mn ( t ) | Z ( t ) = Z (cid:17) . By ergodicity (i.e., time-average = ensemble-average), we havethat for any starting point Z ( t ) = Z , with probability 1 such thatlim T →∞ f T = ∞ X k =0 kπ k ≤ µ Σ p . As a result, we can ﬁnd a ﬁnite T (independent of ǫ since all the moments of π are bounded with independenceof ǫ ) such that for all T ≥ T , f T ≤ µ Σ p , L with probability 1. Therefore, if T ≥ T , E " t + T − X t = t X mn ( t ) | Z ( t ) = Z ≤ LT, which completes the proof. 22

Proof of Lemma 4

Consider the left-hand-side (LHS) of the inequality in Lemma 4.

LHS ( a ) ≤ t + T − X t = t M X m =1 λ m δ E h e Q mmin ( t ) − e Q mmax ( t ) | Z ( t ) = Z i ( b ) ≤ M X m =1 λ m δ E h e Q mmin ( t + 1) − e Q mmax ( t + 1) | Z ( t ) = Z i + M X m =1 λ m δ E h e Q mmin ( t + 2) − e Q mmax ( t + 2) | Z ( t ) = Z i , where (a) follows from the deﬁnition of δ -tilted dispatching strategy and the fact e Q mmin ( t ) , e Q mσ t (1) ( t ) ≤ e Q mσ t (2) ( t ) ≤ . . . ≤ e Q mσ t ( N ) ( t ) , e Q mmax ( t ); (b) holds since all the terms in the summation are non-positive and T ≥ I mmin ( t ) as the indicator function which obtains 1 if the server with theminimal true queue length at the end of time-slot t is updated by dispatcher m . Similarly, I mmax ( t ) is deﬁned asan indicator function which obtains 1 if the server with the maximal true queue length at the end of time-slot t is updated by dispatcher m . In the following, we consider the event that I mmax ( t ) = 1 and I mmin ( t + 1) = 1,i.e., at the end of time-slot t the server with the maximal actual queue is updated and at the end of time-slot t + 1 the server with the minimal actual queue is updated. Then, by law of expectation, we have LHS ≤ M X m =1 λ m δp E h e Q mmin ( t + 1) − e Q mmax ( t + 1) e Q mmin ( t + 2) − e Q mmax ( t + 2) | Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1 i ( a ) ≤ M X m =1 λ m δp E h e Q mmin ( t + 1) − Q max ( t + 1) Q min ( t + 2) − e Q mmax ( t + 2) | Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1 i ( b ) ≤ M X m =1 λ m δp E [ − Q max ( t + 1) + Q min ( t + 2) | Z ( t ) = Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1] , (35)where (a) follows from the fact that with I mmax ( t ) = 1, e Q mmax ( t + 1) ≥ Q max ( t + 1) and with I mmin ( t + 1) = 1, e Q mmin ( t + 2) ≤ Q min ( t + 2); (b) holds since e Q mmin ( t + 1) ≤ e Q mmax ( t + 1) ≤ e Q mmax ( t + 2). This is becauseunder any LED policy the decrease of local estimation can only happen when the queue is updated.In order to relate the queue lengths at time-slots t + 1 and t + 2 to the queue length at t , we use thefollowing result. Claim 1.

For any t , we have1. E [ Q min ( t + 1) | Z ( t ) = Z ] ≤ Q min ( t ) + M E [ Q max ( t + 1) | Z ( t ) = Z ] ≥ Q max ( t ) − M where M = µ Σ .Proof. Let us start with the ﬁrst result. Suppose that the server i is the server with the shortest queue lengthat time-slot t . We have E [ Q i ( t + 1) | Z ( t ) = Z ] = E [ Q i ( t ) + A i ( t ) − S i ( t ) + U i ( t ) | Z ( t )] ≤ Q i ( t ) + max( λ Σ , µ i ) ≤ Q i ( t ) + µ Σ = Q i ( t ) + M . (36)23f at time-slot t + 1, the same server i is still the one with the shortest queue length, then we are done. If not,suppose that some other server j is the one with the shortest queue length at time-slot t + 1. Now, we assumethat E [ Q j ( t + 1) | Z ( t ) = Z ] > Q i ( t ) + M , and we hope to arrive at a contradiction.First, since Q j ( t + 1) ≤ Q i ( t + 1), E [ Q j ( t + 1) | Z ( t ) = Z ] ≤ E [ Q i ( t + 1) | Z ( t ) = Z ]. Combined withour assumption, we get Q i ( t ) + M < E [ Q j ( t + 1) | Z ( t ) = Z ] ≤ E [ Q i ( t + 1) | Z ( t ) = Z ] (37)Thus, we can see Eq. (37) contradicts with Eq. (36). Hence, E [ Q j ( t + 1) | Z ( t ) = Z ] ≤ Q i ( t ) + M , which ﬁnishes the proof of the ﬁrst result. Same argument can be applied to prove the second result.Now, we are ready to bound Eq. (35). First, LHS ( a ) ≤ M X m =1 λ m δp E [ − Q max ( t + 1) | Z ( t ) = Z ] + (38) M X m =1 λ m δp E [ Q min ( t + 2) | Z ( t ) = Z, φ ( t )] , (39)where φ ( t ) , I mmax ( t ) = 1 , I mmin ( t +1) = 1. The inequality follows from the fact that given Z ( t ), Q max ( t +1)is independent of I mmax ( t ) and I mmin ( t + 1). By using the bound in Claim 1, we have an upper bound for theterm in Eq. (38) M X m =1 λ m δp E [ − Q max ( t + 1) | Z ( t ) = Z ] ≤ M X m =1 λ m δp ( − Q max ( t ) + M ) . (40)For the term in Eq. (39), we have M X m =1 λ m δp E [ Q min ( t + 2) | Z ( t ) = Z, φ ( t )] ( a ) = M X m =1 λ m δp E [ E [ Q min ( t + 2) | Z ( t + 1)] | Z ( t ) = Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1]where (a) follows from the tower property of conditional expectation and the fact that given Z ( t +1), Q min ( t +2)is independent of φ ( t ). Then, it can be upper bounded as follows. M X m =1 λ m δp E [ Q min ( t + 2) | Z ( t ) = Z, φ ( t )] ( a ) ≤ M X m =1 λ m δp E [ Q min ( t + 1) + M | Z ( t ) = Z, I mmax ( t ) = 1 , I mmin ( t + 1) = 1] ( b ) = M X m =1 λ m δp E [ Q min ( t + 1) + M | Z ( t ) = Z ] ( c ) ≤ M X m =1 λ m δp ( Q min ( t ) + 2 M ) , Z ( t ), Q min ( t + 1) is independent of theevent I mmax ( t ) = 1 , I mmin ( t + 1) = 1; (c) holds by the bound in Claim 1 again.Thus, combining the bounds for Eqs. (38) and (39), yields LHS ≤ M X m =1 λ m δp ( Q min ( t ) − Q max ( t ) + 3 M )= λ Σ δp ( Q min ( t ) − Q max ( t ) + 3 M λ Σ δp ≤ − λ Σ δp √ N k Q ⊥ ( t ) k + 3( µ Σ ) p , (41)in which the last inequality follows from the fact that k Q ⊥ ( t ) k ≤ √ N ( Q max ( t ) − Q min ( t )) and M = µ Σ with δ ≤

1. Hence, the proof of Lemma 4 is complete.

C Proof of Lemma 5