[PDF] Optimal Load Balancing in Bipartite Graphs

Abstract

Applications in cloud platforms motivate the study of efficient load balancing under job-server constraints and server heterogeneity. In this paper, we study load balancing on a bipartite graph where left nodes correspond to job types and right nodes correspond to servers, with each edge indicating that a job type can be served by a server. Thus edges represent locality constraints, i.e., each job can only be served at servers which contained certain data and/or machine learning (ML) models. Servers in this system can have heterogeneous service rates. In this setting, we investigate the performance of two policies named Join-the-Fastest-of-the-Shortest-Queue (JFSQ) and Join-the-Fastest-of-the-Idle-Queue (JFIQ), which are simple variants of Join-the-Shortest-Queue and Join-the-Idle-Queue, where ties are broken in favor of the fastest servers. Under a "well-connected" graph condition, we show that JFSQ and JFIQ are asymptotically optimal in the mean response time when the number of servers goes to infinity. In addition to asymptotic optimality, we also obtain upper bounds on the mean response time for finite-size systems. We further show that the well-connectedness condition can be satisfied by a random bipartite graph construction with relatively sparse connectivity.

Full PDF

OO PTIMAL L OAD B ALANCING IN B IPARTITE G RAPHS

Wentao Weng

Institute for Interdisciplinary Information SciencesTsinghua University [email protected]

Xingyu Zhou

ECEOhio State University [email protected]

R. Srikant

C3.ai DTI, CSL and ECEUniversity of Illinois at Urbana-Champaign [email protected]

August 21, 2020 A BSTRACT

Applications in cloud platforms motivate the study of efﬁcient load balancing under job-server con-straints and server heterogeneity. In this paper, we study load balancing on a bipartite graph whereleft nodes correspond to job types and right nodes correspond to servers, with each edge indicat-ing that a job type can be served by a server. Thus edges represent locality constraints, i.e., eachjob can only be served at servers which contained certain data and/or machine learning (ML) mod-els. Servers in this system can have heterogeneous service rates. In this setting, we investigate theperformance of two policies named Join-the-Fastest-of-the-Shortest-Queue (JFSQ) and Join-the-Fastest-of-the-Idle-Queue (JFIQ), which are simple variants of Join-the-Shortest-Queue and Join-the-Idle-Queue, where ties are broken in favor of the fastest servers. Under a “well-connected”graph condition, we show that JFSQ and JFIQ are asymptotically optimal in the mean response timewhen the number of servers goes to inﬁnity. In addition to asymptotic optimality, we also obtainupper bounds on the mean response time for ﬁnite-size systems. We further show that the well-connectedness condition can be satisﬁed by a random bipartite graph construction with relativelysparse connectivity.

Many applications that use data centers, cloud computing systems and other data analytic platforms, including Websearch engines [22], cloud computing service [1], large-scale data processing [13], and cloud storage have extremelystringent latency requirements. Ultra low latency guarantees in these applications not only provide smooth user expe-rience, but help improve company proﬁts [12].A key component for achieving a fast response in the aforementioned systems are load balancing algorithms, which areresponsible for dispatching jobs to parallel servers. Motivated by the demanding requirement of a low latency, there hasbeen a line of recent research that aims to design smart load balancing algorithms with delay performance guarantees.They often focus on the classical load balancing model, where there are N identical servers with exponential servicetimes and a dispatcher that assigns Poisson arrivals to one of the servers. It has been shown that in this setting that aclass of load balancing policies including Join-the-Shortest-Queue (JSQ), Join-the-Idle-Queue (JIQ) [33] and variantsof the Power-of-d-Choices (Pod) [36, 46] which sample a sufﬁciently large number of queues or exploit the parallelismof tasks within a job are able to achieve asymptotically zero waiting time for a sufﬁciently large N .However, the above classical load balancing model may not be appropriate for certain modern cloud computing anddata analytic applications due to the presence of job-server constraints. Under such constraints, a job can only bedispatched to a subset of the N servers. These constraints, often called locality constraints, are quite common inlarge-scale Machine Learning as a Service (MLaaS) and serverless computing services supported by cloud computing a r X i v : . [ c s . PF ] A ug PREPRINT - A

UGUST

21, 2020platforms (e.g., Microsoft Azure [35], Amazon Web Services [1], Google Cloud [21]). To give a concrete example,let us consider MLaaS. In this setting, various well-trained machine learning models are deployed on cloud platforms,say deep convolutional neural network (CNN) models for image classiﬁcation and natural language processing (NLP)models. A user’s image classiﬁcation request can only be sent to the servers on which the CNN models have beenloaded. As a result, it is not appropriate to assume that every request can be served by any server in the system. Otherexamples in which there are inherent job-server constraints include online video services, such as TikTok, Netﬂix andYoutube. In these applications, user requests can only be sent to servers with the required data (e.g., movies, music).The ultimate goal in all these modern applications is to achieve a fast response time and efﬁcient resource (e.g., numberof servers) usage while satisfying job-server constraints.Inspired by these applications, in this paper, we take into account job-server constraints by considering a bipartite loadbalancing model. In this model, job-server constraints are abstracted by the edges in a bipartite graph, where the leftnodes are called ports and the right nodes are called servers. In the model, each port represents a job of a particulartype which requires a speciﬁc chunk of data or a speciﬁc machine learning model to execute, and thus can only berouted to speciﬁc servers. Each port (cid:96) corresponds to Poisson job arrivals with rate λ (cid:96) . A job from a port (cid:96) can only besent to server r such that ( (cid:96), r ) is an edge of the graph. Jobs routed to a server r are queued in a buffer, and get servicein a ﬁrst-come ﬁrst-server manner. The service time of each job at server r is exponentially distributed with rate µ r (possibly different).To the best of our knowledge, this bipartite graph model was only introduced recently in [11], where JSQ is shown tobe throughput optimal while no delay performance guarantee is provided. The bipartite graph model generalizes theload balancing model on graphs introduced in [38, 8]. In their model, jobs arrive at each node with a homogeneousrate, and each job can be served by the node it arrives and its neighbors. It has been shown that in this setting JSQachieves zero delays under certain assumptions on graph connectivity [38].Inspired by the discussions above, we are particularly interested in the following question: Are there simple policies that can achieve optimal response time in modern load balancing systems with both job-server constraints and service-rate heterogeneity?

This paper afﬁrmatively answers the above question by presenting optimal policies as well as performance bounds onthe mean response time. The detailed contributions can be summarized as follows.First, we consider two policies: Join-the-Fastest-of-the-Shortest-Queues (JFSQ), and Join-the-Fastest-of-the-Idle-Queues (JFIQ). We show that, under a ‘well-connected’ graph condition, they can asymptotically achieve the min-imum response time in both the many-server regime (the system load λ < is a constant while the number of servers N → ∞ ) and sub Halﬁn-Whitt (HW) regime ( λ = 1 − N − α with α < . ). The minimum response time metricis more stringent than the common "zero queueing delays" discussed before, and is especially important in systemswith heterogeneous servers. JFSQ and JFIQ are simple variants of JSQ and JIQ adapted to job-server constraints,but they break ties in JSQ and JIQ by choosing the fastest servers. Consequently, our results imply that JSQ and JIQhave asymptotic zero waiting time for homogeneous servers. They are practical since they only need comparisonsbetween service speed rather than the exact service rates of servers. In addition to the asymptotic result, we alsoobtained ﬁnite-system bounds on the mean response time. Roughly speaking, we show that the difference between themean response time in an N -server system and that in the limit is bounded by O (cid:0) (cid:15) + ((1 − λ ) (cid:15)N ) − / (cid:1) , where (cid:15) is aparameter related to the well-connectedness of the underlying bipartite graph, and λ reﬂects the load of the system.Second, our theoretical results provide practical guidance in designing modern load balancing systems. Besides thetwo simple but efﬁcient algorithms, the underlying ‘well-connected’ condition sheds light on the efﬁcient deploymentof various ML models or the required data among the servers. In particular, the key message is that each movie onNetﬂix or each ML model deployed on Microsoft Azure only needs to be loaded in ω (1) servers. To give a concreteexample, we show that if edges in the bipartite graph are randomly generated according to some given probabilities,then the graph is "well-connected" with high probability. Let L be the number of kinds of jobs, and N be the numberof servers. Our result indicates that on average, the graph only needs ω (cid:16) L + N (1 − λ ) (cid:17) connections to be "well-connected".And if the arrival rates of jobs are uniform, then this number can be reduced to ω (cid:16) L + N − λ ln − λ (cid:17) .A key theoretical contribution of the paper is showing that a recently-developed Lyapunov drift method for studyingparallel-server queueing systems can be generalized to bipartite graphs using two key ideas: (i) we demonstrate some-thing akin to state-space collapse and resource pooling by exploiting the connectivity structure of the graph, and (ii)apply this idea iteratively twice, once to bound the number of jobs in fast servers that are busy in the large-system limit2 PREPRINT - A

UGUST

21, 2020and a second time to bound the number of jobs in slow servers that are idle in the limit using a conditional geometrictail bound.

There is a vast literature on efﬁcient load balancing policies, mostly in the classical load balancing setting where thereare N identical servers and the service rate is exponentially distributed. Upon arrival, each job can be sent to any ofthe N servers. It is now well-known that in this setting JSQ is optimal [49] in a stochastic ordering sense. However,obtaining the exact steady state performance of JSQ is difﬁcult. The problem is partly solved in [15] which establishesthat the scaled queue length process of JSQ converges to a two-dimensional Ornstein-Uhlenbeck process, and thefraction of waiting jobs vanishes in the Halﬁn-Whitt heavy trafﬁc regime. Although this result is on the process level,it is later conﬁrmed for the steady state distribution by [6]. The tail of the distribution is further studied in [4].Since JSQ has signiﬁcant communication overhead in large-scale systems, alternative policies have been proposed andanalyzed. One prominent policy is Power-of- d -Choices (Pod). In Pod, each arrival of jobs probes d random servers,and joins the one with the shortest queue. [39] ﬁrst shows that if d → ∞ , then both the ﬂuid limit and the stateoccupancy distribution of Pod coincides with that of JSQ in many-server limit. It implies that Pod has zero waitingtime in many-server limit. [39] also prove that the diffusion limit of Pod is the same as JSQ if d = ω ( √ N log N ) inthe Halﬁn-Whitt heavy trafﬁc regime, but it does not induce steady-state performance. For the many-server regime,a line of works [16, 17] study the minimum required resources (such as memory, and communication overhead) toachieve zero waiting time.When the system load λ can also approach as N increases (i.e. many-server heavy-trafﬁc regime), [29] shows thatPod can achieve asymptotic zero waiting time if d = ω (cid:16) − λ (cid:17) when − λ = ω ( N − / ) . For a heavier-trafﬁc regime,a recent breakthrough is the work [31]. In the sub Halﬁn-Whitt regime ( − λ = ω ( N − . ) ), this work establishesasymptotic zero waiting property for a large class of policies including JSQ, JIQ and Pod with d = O ( log N − λ ) . Theresult is later extended to the Beyond-Halﬁn-Whitt regime ( − λ = ω ( N − ) ) [30], and to Coxian-2 service timedistribution [32]. When − λ = O ( N − ) , it is known that the waiting time must be positive for all load balancingpolicies [3, 24]. When jobs are divisible, [50, 39] shows similar result for Batch Sampling [40] and Batch-Filling [54],which are batch variants of Pod.Proving optimality of load balancing algorithms is more complicated when servers are heterogeneous. Simple heuris-tics, nevertheless, are proposed in decades. We note that a policy called Never Queue policy which is very similar toJFIQ was proposed in [42]. The Never Queue policy is analyzed in the case of a centralized queue, but not for loadbalancing systems. Many studies have focused on the heavy trafﬁc regime where the system load converges to whilethe number of servers is ﬁxed. In this regime, JSQ was shown to be delay optimal by the drift method [14]. Later,[57] proves that a threshold policy is heavy-trafﬁc optimal. The stability and optimality in heavy trafﬁc of Pod for het-erogeneous servers studied recently by [28]. Moreover, [56] provides a simple criteria for load balancing algorithmsto be heavy-trafﬁc optimal. The assumption of heavy trafﬁc can be relaxed to many-server heavy trafﬁc regime when − λ = o ( N − ) [27, 55]. Nevertheless, the results mentioned above do not imply fast mean response time in themany-server regime, which is more practical for cloud platforms. For the many-server regime, work in [44] shows thatJIQ has asymptotic zero waiting time as N → ∞ . However, this does not imply optimal mean response time sincethe service time of jobs varies in different servers. A recent work [19] takes heterogeneity into accounts by studying asystem with fast and slow servers. Although [19] obtains mean-ﬁeld limit for a variant policy of Pod, the result doesnot imply optimal mean response time.Load balancing with job-server constraints are not considered in the literature until recent years. To the best of ourknowledge, [37] is the ﬁrst paper that considers load balancing with job-server constraints and proposes an onlineload balancing algorithm with the optimal competitive ratio. However, their model is not stochastic, and is thus quitedifferent from the model we are considering in this paper. Cruise et al. [11] considers the stability of JSQ on thesame model as ours while no delay guarantee is provided. In Cardinaels et al. [10], redundancy policies are exploredin bipartite load balancing. They obtain a product-form steady state distribution which however does not imply anoptimal mean response time. Besides these papers, there are also studies for load balancing on graphs. In [45, 20, 8],the impact of the graph structure on the performance of Pod is studied. Mukherjee et al. [38] utilizes a stochasticcoupling method to prove that JSQ on graph can have the same performance as JSQ in the classical load balancingmodel in both the many-server regime and the Halﬁn-Whitt regime under certain graph constraints. Therefore, itimplies that JSQ can also achieve zero waiting time in the many-server regime for a graph-based model. However, themodel in [38] only considers identical servers and homogeneous arrival rates of jobs, which is a special case of thispaper. 3 PREPRINT - A

UGUST

21, 2020 … … 𝜆 Server Server … 𝜆 𝜆 𝜆 𝐿 Server Server 𝑁 Figure 1: An example of the bipartite graph model. In this instance, jobs from port can only be routed to server and server .We note that if servers share a central queue, then the bipartite graph model turns into the skill-based model studiedin the call center literature [18, 10]. It is shown in [18] (and the references within) that the stationary distributionsunder several redundancy policies have product forms. One related result to us is that our model becomes the sameas a skill-based model, and thus enjoys a product-form stationary distribution, if we send a job to a connected serverwith least amount of work in its buffer [18, 10]. Such policy is, however, impractical since workloads of jobs in cloudplatforms suffer from volatility. Also, as [18] has pointed out, it is non-trivial to obtain bounds on mean response timejust from the product-form results.Our bipartite graph model also resembles other problems in the literature. One particular model is the job-serverafﬁnity model for data locality problems studied in [9, 51, 52, 47]. In the job-server afﬁnity model, if one job is servedby a server with its data, it has a fast constant service rate. Otherwise, it has a slow service rate, meaning that this severhas to fetch data from somewhere. However, the setting is not suitable in the context of MLaaS we discussed above.Here ML models are usually reconﬁgured on machines periodically, and a new request will only be routed to thoseservers with needed model [23]. Also, previous studies on job-server afﬁnity models can only guarantee heavy-trafﬁcdelay optimality [51, 52, 47], which does not induce extremely fast mean response time required in cloud platforms.From a methodological perspective, our paper builds on the drift method to obtain performance bounds. In this method,one exploits the fact that the steady-state expectation of suitable functions of the state of a Markov process does notchange with time. This idea was developed in [14, 34, 48] for the heavy-trafﬁc regime where the idea of using the tailbounds of [26, 5] to prove state-state collapse or resource pooling was introduced. The recent work in [31] developeda parallel approach for the many-server regime where they introduced the notion of generator coupling inspired byStein’s method in [53, 7, 25, 43] and designed a clever Lyapunov coupling to show that, for JSQ-type policies, thenumber of homogeneous servers utilized is large when the backlog is large. We will call this latter idea state-spacecollapse since it is similar to the notion of state-space collapse in the heavy-trafﬁc regime. In this paper, we introducenew ideas to expand the applicability of the techniques [31] to networks of heterogeneous servers.Contemporaneous to our work, in [41], the authors study the waiting time of JSQ(d) policies in bipartite graphs inthe limit as the size of the graph goes to inﬁnity. While the papers are motivated by related problems, the modelsand routing policies studied, and the results in the two papers are different. The authors in [41] consider the caseof homogeneous servers with inﬁnite buffers, and show that the performance of JSQ(d) in a bipartite graph withlimited connectivity converges to the performance of the fully ﬂexible system in terms of queue length (or waitingtime) under appropriate connectivity conditions. In addition, they prove that the occupancy in steady state of thelimited-connectivity system converges to the steady state of the fully ﬂexible system. Our paper considers the case ofheterogenous arrival and service rates with ﬁnite buffers, and shows that the waiting time in the queue and blockingprobability both go to zero in the large-system limit under the JFIQ and JFSQ routing policies. Additionally, thetechniques used in the two papers are different. We use the drift method to obtain performance bounds for ﬁnite-sizedsystems while [41] uses process-level convergence techniques. We consider load balancing in a bipartite graph G = ( L , R , E ) where L and R are the set of left nodes and right nodes,respectively, and E is the set of edges between these two sets of nodes. Nodes in L are indexed as { , , · · · , L } with L = |L| , and nodes in R are indexed as { , , · · · , N } with N = |R| . For a node (cid:96) ∈ L (or r ∈ R ), deﬁne N L ( (cid:96) ) PREPRINT - A

UGUST

21, 2020(or N R ( r ) ) to be the set of right (or left) nodes it connects with. W.L.O.G., every N L ( (cid:96) ) , N R ( r ) is assumed to benon-empty. To distinguish between left and right nodes, we may refer to a node (cid:96) ∈ L as port (cid:96) , and a node r ∈ R asserver r . See Fig. 1 for an illustration.Jobs arrive at port (cid:96) according to a Poisson process with rate λ (cid:96) , and the goal is to route them to one of the serversconnected to (cid:96) so as to minimize a certain performance metric of interest. It is assumed that every server has a ﬁnitebuffer of size b . When a job is routed to a server that is currently processing another job, this new arrival will beplaced in the buffer. But if there are already b jobs (including the one being served), the new arrival is blocked andlost forever. We assume that jobs in the buffer are served in a ﬁrst-come-ﬁrst-serve manner. The queue length Q r of aserver r is the number of jobs in the buffer plus one if there is a job running on the server.To reﬂect the nature of server heterogeneity in a practical load balancing system, we assume that there are M typesof servers. For a type m server, the service time of a job running on it is assumed to be exponentially distributed withmean µ m . The arrival processes to the ports and the service times of jobs are assumed to be independent. Denote thenumber of type m servers by N m , and the type of a server r by t r . Equivalently, we can write N m = N α m with α m ∈ (0 , , (cid:80) Mm =1 α m = 1 . We assume that there is sufﬁcient service capacity, i.e., λ Σ = (cid:80) L(cid:96) =1 λ (cid:96) < N (cid:80) Mm =1 µ m α m . W.L.O.G., we assume µ > µ > · · · > µ M > since we can always reorder the types of servers.We study two routing policies, Join-the-Fastest-of-the-Shortest-Queues (JFSQ) and Join-the-Fastest-of-the-Idle-Queues (JFIQ) in bipartite load balancing systems. For JFSQ, upon the arrival of a job at port (cid:96) , we select a server r connected to port (cid:96) with the shortest queue length, that is, r ∈ arg min r ∈N L ( (cid:96) ) Q r . If there are multiple such servers,we select the one with the fastest service rate, i.e. largest µ t r , and break ties (if any) by randomly choosing one server.Alternatively, if we use JFIQ, we ﬁnd an idle server r ∈ N L ( (cid:96) ) with the fastest service rate. If there is no idle servers,we select one server from N L ( (cid:96) ) randomly. The question of interest in this paper is whether these two policies canachieve optimal job delays (at least for a large system) under appropriate conditions on the underlying bipartite graph.We note that our routing policies JFIQ and JFSQ reduce to JIQ and JSQ, respectively, when all servers have the sameservice rates. Before we proceed to state our results, we ﬁrst state the notation that we will use in the paper. We use capital lettersto denote random variables, such as Q r ( t ) for the queue length of server r at time t , and small letters to denoterealizations.Clearly, for the system considered in this paper, the sequence { Q ( t ) = ( Q ( t ) , · · · , Q N ( t )) } forms a ContinuousTime Markov chain (CTMC). Since the buffers are ﬁnite, there is a unique stationary distribution of Q ( t ) . For eachstate q = ( q , · · · , q N ) , let s m,i ( q ) = 1 N |{ r ∈ R : q r ≥ i, t r = m }| be the fraction of type m servers with queue length at least i . Besides, let C m ( q ) = b (cid:88) i =1 s m,i ( q ) , W ( q ) = K (cid:88) m =1 µ m s m, ( q ) , which is the normalized (divided by N ) number of jobs in type m servers, and the rate to complete a job if we onlyconsider the ﬁrst K types of servers. Notation:

As mentioned earlier, capital letters are reserved for random variables (such as Q ( t ) for queue lengths attime t ), and small letters are for realizations (such as q for a queue-length state). We add a line on top of a variablemeaning that it is in steady state (such as ¯ Q ). This paper makes use of asymptotic notations. For two positive functions f ( x ) , g ( x ) , we write f ( x ) = o ( g ( x )) if sup lim x →∞ f ( x ) g ( x ) = 0 ; write f ( x ) = O ( g ( x )) if sup lim x →∞ f ( x ) g ( x ) < ∞ ;write f ( x ) = Ω( g ( x )) if inf lim x →∞ f ( x ) g ( x ) > ; write f ( x ) = ω ( g ( x )) if inf lim x →∞ f ( x ) g ( x ) = ∞ . We summarize our main results in this section. To be speciﬁc, our results provide an upper bound of the mean numberjobs in the system under certain assumptions. This upper bound can directly imply asymptotic optimality of JFSQ andJFIQ in the sense of minimum mean response time, which we will deﬁne explicitly later. We also give a random graphconstruction of the graph G such that G can satisfy Assumption 2 with high probability.5 PREPRINT - A

UGUST

21, 2020

Let K be the minimum value such that N (cid:80) Km =1 µ m α m > λ Σ . Such a K must exist by the assumption of sufﬁcientservice capacity. Assume that λ Σ = N (cid:80) Km =1 µ m α m (1 − β ) where < β ≤ , and denote λ = λ Σ N . Let C ∗ = α , · · · , C ∗ K − = α K − , C ∗ K = λ − (cid:80) K − m =1 µ m α m µ K , and let C ∗ = (cid:80) Km =1 C ∗ m . Such deﬁnition is motivated by the mean-ﬁeld limit of our system, which will be illustratedlater. The following result provides lower bounds for the expected service time of each job, and the mean number ofjobs in the system. Proposition 1.

Suppose that the buffer size is inﬁnite, i.e. b = ∞ . Let ¯ Z be the random variable denoting the servicetime of one job. Then for any stable policy, the mean number of jobs in the system is lower bounded by N C ∗ , and E (cid:2) ¯ Z (cid:3) ≥ C ∗ λ . (1)The proof is provided in the appendix.For every ≤ m ≤ K , let R m be the set of servers of types through m . Let ˆ β = β (cid:80) Km =1 α m , and (cid:15) be a numberin (0 , ˆ β ]; we call (cid:15) the approximation error since we will later use this parameter to characterize the near optimality ofour routing policies. For any subset I ⊆ R , deﬁne N R ( I ) = ∪ r ∈I N R ( r ) to be the set of ports connected to at leastone server in I , and D I = (cid:80) (cid:96) (cid:54)∈N R ( I ) λ (cid:96) be the sum of arrival rates at ports not connected to I . Before stating ourresults on JFSQ and JFIQ, we ﬁrst make a few assumptions on the system. Let τ K = µ µ K , τ M = µ µ M , τ KM = µ K µ M . Assumption 1 (Buffer Size) . For a ﬁxed approximation parameter (cid:15) in (0 , ˆ β ] , the buffer size b satisﬁes √ τ K ≤ b ≤ (cid:22)(cid:16) (cid:15) N τ K ln N (cid:17) / (cid:23) . Assumption 2 (Well Connectedness) . The graph G satisﬁes the following conditions: • D I ≤ N ˜ d for any I ⊆ R K − with |I| ≥ N p ; • D I ≤ N ˜ d for any I ⊆ R K with |I| ≥ N p .where p = (cid:15) b , p = ˆ β , ˜ d ≤ (cid:15)µ K b , ˜ d ≤ (cid:15)µ K b . Although there are two constraints, Assumption 2 basically requires that a large enough subset of the ﬁrst K typesof servers must connect with ports with enough arrival rates. Such requirement enables that JFSQ and JFIQ behavealmost the same as in a classical load balancing system even though there are additional job-server constraints. We arenow ready to state the main result. Theorem 1.

Suppose that Assumptions 1 and 2 hold, and that the routing policy is either JFSQ or JFIQ. Then for asufﬁciently large N , the following results hold:(i) the expected number of jobs in servers of the ﬁrst K types divided by N is bounded as E (cid:34) max (cid:32) K (cid:88) m =1 C m ( ¯ Q ) − ( C ∗ + (cid:15) ) , (cid:33)(cid:35) ≤ τ K b (cid:15)N ; (2) (ii) if K < M , the expected number of jobs in the system divided by N is bounded as E (cid:34) M (cid:88) m =1 C m ( ¯ Q ) (cid:35) ≤ C ∗ + (cid:16) τ KM (cid:17) (cid:15) + 2 (cid:114) τ M b ln NN + 60 b (cid:115) τ K τ M ˆ β(cid:15)N ; (3) (iii) the probability p B that an arriving job is blocked is bounded as p B ≤ ˜ d λ + 52 τ K b (cid:15)N . (4)6 PREPRINT - A

UGUST

21, 2020

Theorem 1 may be difﬁcult to interpret since there are several parameters involved in the results. So let us interpretthe result for an important special case which is perhaps the one that is practically most relevant. Suppose that thenormalized arrival rate λ , the proportions of different types of servers { α m } , and (cid:15) are ﬁxed. In most practicalsystems, the number of jobs that can wait at a server is small, so let us suppose that b is a ﬁxed constant satisfyingAssumption 2. Then, from (3), it is clear that the normalized expected number of jobs in the system is asymptoticallyequal to C ∗ + O ( (cid:15) ) in the many-server limit. The blocking probability goes to zero provided ˜ d = o (1) and the rateat which it goes to zero depends on rate at which ˜ d decreases with N. From Proposition 1, the lower bound on thenormalized number of jobs in an inﬁnite buffer system is C ∗ . This suggests that JFSQ and JFIQ are near-optimal fromthe perspective of mean response time if the graph is reasonably well connected; we make this argument more general(by allowing many parameters to scale) and precise next.To study the limit as N approaches inﬁnity, we let { G N = ( L N , R N , E N ) , N ≥ } be a sequence of bipartite graphssuch that |R N | = N and the buffer size of each server is given by b N . Here, the number of servers, N , is allowed toscale, but the server-type distribution ( α , · · · , α M ) , and the service rate of each type of servers, ( µ , · · · , µ M ) , µ > · · · > µ M , are ﬁxed. Further, the total arrival rates at ports in L N , λ Σ , is assumed to be equal to N (cid:80) Km =1 µ m α m (1 − β N ) for all G N . As before, we can deﬁne a sequence of parameters { (cid:15) N , N ≥ } that quantify the approximationerror where (cid:15) N ∈ (0 , ˆ β N ] , and ˆ β N = β N (cid:80) Km =1 α m . Now we can discuss the asymptotic performance of a routingpolicy as N → ∞ .Proposition 1 provides a lower bound on the expected service time of a job in the system with inﬁnite buffers. we thushave the following deﬁnition of an (asymptotically) optimal routing policy in the bipartite load balancing system. Deﬁnition 1 (Optimality in the Mean Response Time Sense) . A stable routing policy is asymptotically optimal inthe response time if the mean response time of jobs converges to C ∗ λ and the blocking probability goes to zero when N → ∞ . We can see that optimality in the mean response time is a stronger metric than the common zero-waiting propertydiscussed in the literature [44, 16, 31]. With this optimality, not only an arriving job has asymptotically zero waitingtime, but it also has the minimum possible service time.Then Theorem 1 immediately implies that both JFSQ and JFIQ are asymptotically optimal if the load of the system ismoderate and the graph G N is suitably well connected. Corollary 1.

Suppose that (cid:15) N is both o (1) and ω ( ln ( N ) N − . ) , and that both Assumptions 1 and 2 hold for G N when N is sufﬁciently large. Then as N → ∞ , both JFSQ and JFIQ are asymptotically optimal, and the expectedqueueing delay converges to zero for both policies. Due to the relationship between β N and (cid:15) N , it is not difﬁcult to see that asymptotic optimality holds for arrival ratesupto the sub-Halﬁn-Whitt regime. We refer the reader to the appendix for a proof of Corollary 1. We now discuss when a bipartite graph can satisfy Assumption 2 in random graph models. Suppose the set of ports L and the set of servers R are ﬁxed, but connections between them, i.e., the graph G , is not determined. This sectionconsiders a random graph G where port i connects with server j with probability z ij . We devise an explicit constructionof z ij and show that such a random graph can satisfy Assumption 2 with a high probability. Our result ﬁrst providesthe construction of z ij when ports can have different arrival rates. Later, by restricting the scope to homogeneousarrival rates among ports, we give a better construction where the graph G can have fewer edges. We are now ready tostate our results. Theorem 2.

Let H j = N + L ) /Np j for j ∈ { , } . Consider the following construction of the graph G . For eachport (cid:96) ∈ L , • if λ (cid:96) ≥ N ˜ d H , this port connects with all servers of types less than K ; • if λ (cid:96) ≥ N ˜ d H , this port connects with all servers of types equal to K ; • otherwise, for each server r ∈ R , if r ∈ R K − , then (cid:96) connects with r with probability λ (cid:96) H N ˜ d . And if r ∈ R K \ R K − , then (cid:96) connects with r with probability λ (cid:96) H N ˜ d . PREPRINT - A

UGUST

21, 2020

Then G satisﬁes Assumption 2 with probability at least − − ( N + L − . The expected total number of edges used in G N scales as O ( ( N + L ) b (cid:15) ) . Next, we discuss the special case of homogeneous arrival rates.

Theorem 3.

Suppose that all ports have the same arrival rates, that is, λ (cid:96) ≡ ¯ λ for all (cid:96) ∈ L . Then following the sameconstruction of graph G in Theorem 2 but with H j = 6 (cid:16) − ln p j + ˜ d j p j ¯ λ ln µ ˜ d j (cid:17) for j ∈ { , } , it holds that G satisﬁesAssumption 2 with probability at least − (cid:0) NNp (cid:1) − . The total number of edges in G N scales as O (cid:16) ( N + L ) b (cid:15) ln b(cid:15) (cid:17) . Remark 1.

Th previous two theorems indicate that to achieve asymptotically optimal mean response time and asymp-totic zero waiting probability, the average number of connections of each port is only O ( (cid:15) ) for heterogeneous arrivalrates, and O ( (cid:15) ln (cid:15) ) for homogeneous arrival rates, given that L = Ω( N ) , b = O (1) . When / (1 − λ ) = O (1) , we only require (cid:15) = o (1) . Then the average number of edges connected to each port becomes ω (1) . Therefore, forachieving very small loss probability and near-optimal response times, the number of edges in a random graph needto be only sparse compared to a fully connected graph. In this section, we provide the proofs of Theorem 1. These results respectively bound the mean number of jobs ina ﬁnite-size system and show the asymptotic optimality for JFSQ and JFIQ in the many-server limit and the subHalﬁn-Whitt regime.

Ahead of the complete proof, we ﬁrst provide a sketch of the proof reﬂecting intuitions behind it. Recall that the goalis to bound the mean number of jobs in the system divided by N , given by E (cid:104)(cid:80) Mm =1 C m ( ¯ Q ) (cid:105) . Here by deﬁnition, C m ( ¯ Q ) = (cid:80) bj =1 s m,j ( ¯ Q ) . Our proof starts with the following observation about the mean-ﬁeld limit for JFSQ andJFIQ in the heterogeneous system. Ideally, if the load λ is a constant, then as N → ∞ , it holds that s m, ( ¯ Q ) ≈  α m , m < KC ∗ K , m = K , m > K and s m,j ( ¯ Q ) ≈ , ∀ m = 1 . . . M, j = 2 . . . b. (5)Roughly speaking, this limit tells us that all the ﬁrst K − types of servers are busy, some servers of type K are busy,and all the servers with types greater than K are idle.The intuition behind (5) is as follows. Since there are inﬁnite servers, a certain fraction of them must be idle. Thenby the deﬁnition of JFIQ and JFSQ, all arrivals of jobs are routed to idle servers, at least in a ﬂuid model. There-fore, the scaled number of waiting jobs (i.e., not in service), (cid:80) Mm =1 (cid:80) bj =2 S m,j ( Q ) must converge to zero. For S , ( Q ) , · · · , S M, ( Q ) , JFIQ and JFSQ always route jobs to fastest idle servers. Therefore, it must be the case that s m, ( Q ) are ﬁlled from to M until (cid:80) Mm =1 µ m s m, ( ¯ Q ) = λ . That is to say, the total departure rate is equal to thetotal arrival rate. Therefore, we can ‘guess’ that the mean-ﬁeld limit has the form (5).Based on this limit, the scaled mean number of jobs can be decomposed as E (cid:34) M (cid:88) m =1 C m ( ¯ Q ) (cid:35) = E (cid:34) K (cid:88) m =1 C m ( ¯ Q ) (cid:35) + E (cid:34) M (cid:88) m = K +1 C m ( ¯ Q ) (cid:35) . (6) The drift argument starts by considering a Lyapunov function g and setting its drift in steady-state equal to zero. Sincewe are considering continuous-time Markov chains, this is equivalent to saying that E (cid:2) Gg ( ¯ Q ) (cid:3) = 0 where G is thegenerator of the Markov chain (deﬁned explicitly later). Initially, let us focus on the total queue length in the ﬁrst K types of servers (scaled by N ) and thus, choose the Lyapunov function to be a function of the scaled total number8 PREPRINT - A

UGUST

21, 2020of jobs in these servers and their queues, which we will call x . By an abuse of notation, we will rewrite the drift as E [ Gg ( x )] = 0 . However, this drift may be hard to analyze. Instead, suppose that the system was a simple deterministicﬂuid model of the form ˙ x = − ∆ for an appropriately ∆ > . The motivation for considering this ﬂuid model is that,in the large-system limit, our system behaves like a single-server queue with simple ﬂuid dynamics. If this ﬂuid limitwere the true system, then the drift of g becomes simply − g (cid:48) ( x )∆ . We add and subtract this drift from the drift of thestochastic system to obtain E [ Gg ( x ) − g (cid:48) ( x )∆ + g (cid:48) ( x )∆] = 0 , which can be rewritten as E [ g (cid:48) ( x )∆] = E [ Gg ( x ) − ( − g (cid:48) ( x )∆)] . We are interested in getting a bound on the steady-state expectation of h ( x ) = ( x − C ∗ + (cid:15) ) + where (cid:15) controls theapproximation error. Therefore, we choose g such that g (cid:48) ( x )∆ = h ( x ) (this equality is sometimes called Stein’sequation). Thus, the drift equation becomes E [ h ( x )] = E [ Gg ( x ) − ( − g (cid:48) ( x )∆)] . Now, it is easy to see that we can bound E [ h ( x )] if we can show that the drift of the Markov process E [ G ( g ( x ))] is approximately equal to − g (cid:48) ( x )∆ . The rest of the proof involves studying E [ Gg ( x ) − ( − g (cid:48) ( x )∆)] by choosing ∆ = µ δ where δ > . In Lemma 3, we show that this expression is approximately equal to µ δ E (cid:34) (cid:40) K (cid:88) m =1 C m ( ¯ Q ) ≥ C ∗ + (cid:15) + 1 N (cid:41) h (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( λ + µ δ − W ( ¯ Q )) (cid:35) . (7)We want to upper bound this expression by a quantity which is small when N is large. Note that (cid:80) Km =1 C m ( ¯ Q ) is the total scaled queue length in the ﬁrst K types of servers and W ( ¯ Q ) = (cid:80) Km =1 µ m s m, ( ¯ Q ) can be interpretedas the departure rate from these servers. Thus, the above expression can be upper bounded by a small quantity ifthe following holds: whenever the total queue length is large, the departure rate exceeds the arrival rate with highprobability.To establish this fact, the mean-ﬁeld limit (5) motivates us to show that s m, ( ¯ Q ) ≈ α m for m < K and s K, ( ¯ Q ) ≈ C ∗ K . To be concrete, we show a two-stage state space collapse result through the following two Lyapunov functions(omitting extra technical terms): ˜ V ( q ) = min  K − (cid:88) m =1 b (cid:88) j =2 s m,j ( q ) + C K ( q ) , K − (cid:88) m =1 α m − K − (cid:88) m =1 s m, ( q )  (8) ˜ V ( q ) = min  K (cid:88) m =1 b (cid:88) j =2 s m,j ( q ) , K − (cid:88) m =1 C ∗ m + τ K δ − K (cid:88) m =1 s m, ( q )  . (9)The well-connectedness condition in Assumption 2 and the routing policy (JFSQ and JFIQ) ensure that both of themhave negative drifts when they are sufﬁciently large (Lemma 4 and Lemma 5). We now provide some intuitionto explain how the well-connectedness condition plays a role in establishing the negative drift of these Lyapunovfunctions. We consider ˜ V , the explanation for the other Lyapunov function is similar. If ˜ V is large, it implies thatboth terms inside the min in (8) are large. In particular, by focusing on the second term, we note that a large ˜ V impliesthat the (scaled) number of used servers (cid:80) Km =1 s m, ( q ) is small. Equivalently, the number of idle servers is large. Thewell-connected condition simply states that the arrival rates to large subsets of servers is large. Thus, if ˜ V is large, thenumber of empty servers is large which implies they have a large arrival rate, which in turn implies that the numberof empty servers quickly decreases. The negative drift of ˜ V and ˜ V can be used to establish geometric tail bounds(Lemma 6) using standard drift arguments to show that they are small with high probability.Observe that when (cid:80) Km =1 C m ( q ) > C ∗ + (cid:15) , these two Lyapunov functions are all equal to the second term on theirright hand side. Then in this case, (cid:80) K − m =1 s m, ( q ) ≈ (cid:80) K − m =1 α m , and (cid:80) Km =1 s m, ( q ) ≈ (cid:80) Km =1 C ∗ m + τ K δ . It thenimplies s K, ( q ) ≈ C ∗ K + τ K δ . Now that (cid:80) Km =1 µ m C ∗ m = λ , it holds W ( q ) ≈ λ + µ δ with high probability. Wethus prove that (7) should be small, and it leads to a bound on the scaled mean number of jobs in the ﬁrst K types ofservers.Now for the remaining types of servers, the mean-ﬁeld limit (5) indicates that almost all of them are idle. We thus try tobound this third Lyapunov function, (cid:80) Mm = K +1 C m ( ¯ Q ) . From the mean-ﬁeld limit, we know that (cid:80) Km =1 s m, ( Q ) ≈ PREPRINT - A

UGUST

21, 2020 C ∗ . Therefore, approximately N (cid:16)(cid:80) Km =1 α m − C ∗ (cid:17) servers of the ﬁrst K types are idle. Therefore, Assumption 2ensures that very few jobs are routed to the remaining types of servers under JFSQ and JFIQ. By utilizing a conditionalgeometric tail bound (Lemma 6), we manage to show that (cid:80) Mm = K +1 C m ( ¯ Q ) is small with high probability, and ﬁnallyobtain a bound on its mean.For the complete proof of Theorem 1, since our theorem consists of three parts, we prove each of them in order, andcombine them together at the end of this section. K Types of Servers

The ﬁrst result, which bounds the number of jobs in the ﬁrst K types of servers, is the most important part in thetheorem, which is restated as follows. Lemma 1.

Under Assumption 1 and Assumption 2, the expected number of jobs in servers of the ﬁrst K types dividedby N is bounded as E (cid:34) max (cid:32) K (cid:88) m =1 C m ( ¯ Q ) − ( C ∗ + (cid:15) ) , (cid:33)(cid:35) ≤ τ K b (cid:15)N (2) if the routing policy is either JFSQ or JFIQ.Proof. Throughout this proof, we assume all assumptions in Lemma 1 are satisﬁed. Recall that the metric of interestis E (cid:104) max (cid:16)(cid:80) Km =1 C m ( ¯ Q ) − ( C ∗ + (cid:15) ) , (cid:17)(cid:105) , where C ∗ = (cid:80) Km =1 C ∗ m . To simplify the notation, let η = C ∗ + (cid:15) ,and denote h ( x ) = max( x − η, . Our goal is thus to bound E (cid:104) h ( (cid:80) Km =1 C m ( ¯ Q )) (cid:105) . The proof is motivated by theframework introduced in [31], and can be divided mainly into three parts, generator coupling, gradient bounds andstate-space collapse. Generator Coupling

We couple our system with a ﬂuid model that is simple, but can well approximate the evolutionof h ( (cid:80) Km =1 C m ( ¯ Q )) . In particular, consider a ﬂuid model ˙ x = − µ δ where δ = µ K µ b (cid:15) . Let g ( x ) be the solution tothe following Stein’s equation of the ﬂuid model, µ δg (cid:48) ( x ) = h ( x ) . (10)The solution is unique, and is given by g ( x ) = max( x − η, µ δ , g (cid:48) ( x ) = max( x − η, µδ , g (cid:48)(cid:48) ( x ) =  , x < η µ δ , x ≥ η. (11)The next step is to couple our system with the ﬂuid model through this stein’s equation.To do so, recall that the system is a CTMC deﬁned on queue lengths of servers, Q ( t ) . let G be the generator of oursystem such that for a queue state q , and any function V deﬁned on the state space, GV ( q ) = (cid:88) q (cid:48) r q , q (cid:48) ( V ( q (cid:48) ) − V ( q )) (12)where r q , q (cid:48) is the transition rate from state q to state q (cid:48) . It is clear that Gg ( q ) serves as an analog of the drift offunction g at state q in a discrete-time Markov chain as in [14]. To couple our system with the ﬂuid model, we ﬁrstneed the following property, a key insight from [14] and [31]. Lemma 2.

The expectation E (cid:104) Gg ( (cid:80) Km =1 C m ( ¯ Q )) (cid:105) is equal to . Then the two systems can be coupled by seeing that E (cid:34) h (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33)(cid:35) = E (cid:34) g (cid:48) (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( µ δ ) (cid:35) (13) = E (cid:34) Gg (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) − g (cid:48) (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( − µ δ ) (cid:35) . (14)As a result, to bound E (cid:104) h (cid:16)(cid:80) Km =1 C m ( ¯ Q ) (cid:17)(cid:105) , it is equivalent to bound (14).10 PREPRINT - A

UGUST

21, 2020

Gradient Bounds.

We now utilizing the explicit form of g ( x ) in (11) to bound (14). First by deﬁnition, it holds thatfor a state q , Gg (cid:32) K (cid:88) m =1 C m ( q ) (cid:33) = (cid:88) q (cid:48) r q , q (cid:48) (cid:32) g (cid:32) K (cid:88) m =1 C m ( q (cid:48) ) (cid:33) − g (cid:32) K (cid:88) m =1 C m ( q ) (cid:33)(cid:33) = λ Σ (1 − P k ( q )) (cid:32) g (cid:32) K (cid:88) m =1 C m ( q ) + 1 N (cid:33) − g (cid:32) K (cid:88) m =1 C m ( q ) (cid:33)(cid:33) ( Arrival transitions ) (15) + N W ( q ) (cid:32) g (cid:32) K (cid:88) m =1 C m ( q ) − N (cid:33) − g (cid:32) K (cid:88) m =1 C m ( q ) (cid:33)(cid:33) ( Departure transitions ) (16)where P k ( q ) is the probability that an arrival of jobs is not routed to a server of type no greater than K , and W ( q ) = (cid:80) Km =1 µ m s m, ( q ) . Then by (14), we can get E (cid:34) h (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33)(cid:35) ≤ E (cid:34) g (cid:48) (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( µ δ ) (17) + λ Σ (cid:32) g (cid:32) K (cid:88) m =1 C m ( ¯ Q ) + 1 N (cid:33) − g (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33)(cid:33) (18) + N W ( ¯ Q ) (cid:32) g (cid:32) K (cid:88) m =1 C m ( ¯ Q ) − N (cid:33) − g (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33)(cid:33)(cid:35) (19)where we omit the term P k ( ¯ Q ) from (16) since g ( x ) is an increasing function by (11). Now to simplify the equation,we can do Taylor’s expansion on (18) and (19), and apply gradient bounds of g ( x ) . The result is summarized asfollows whose proof is provided in the appendix. Lemma 3.

It holds that E (cid:34) h (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33)(cid:35) ≤ E (cid:34) (cid:40) K (cid:88) m =1 C m ( ¯ Q ) ≥ η + 1 N (cid:41) g (cid:48) (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( µ δ + λ − W ( ¯ Q )) (cid:35) + 38 b τ K (cid:15)N . (20)The remaining step is to bound the ﬁrst term on the right hand side in (20), which is the main part of this proof. Thekey insight is that as long as W ( q ) ≥ λ + µ δ , it holds that the contribution of q to the ﬁrst term would be at mostzero. Furthermore, this property only needs to hold when (cid:80) Km =1 C m ( q ) ≥ η + N due to the indicator function. Tojustify this result, we establish two state space collapse results as follows. State Space Collapse.

Recall that (cid:80) Km =1 C m ( q ) is the number of jobs in servers of the ﬁrst K types divided by N .The intuition is to show that when this number is large, it holds that with high probability, s , ( q ) = C ∗ , · · · , s K − , ( q ) = C ∗ K − , s K, > C ∗ K . (21)That is to say, almost all servers of the ﬁrst K − types are busy. And enough type- K servers are busy such that theirtotal departure rates (or works produced by these servers) are sufﬁcient for the total arrival rate λ Σ .The following lemma indirectly shows that unless (cid:80) Km =1 C m ( q ) is small, (cid:80) Km =1 s m, ( q ) ≈ (cid:80) K − m =1 α m . In particular,it designs a Lyapunov function closely related to the above property. Due to space limitations, the proof is deferred tothe appendix. Lemma 4.

UGUST

21, 2020In addition to Lemma 4 that focuses on the ﬁrst K − types of servers, the following lemma provides anotherLyapunov function. This function is later used together with Lemma 4 to show that if (cid:80) Km =1 C m ( q ) is large, then acertain number of type K servers are busy. It then complements the goal in (21). The proof of this lemma is similar tothat of Lemma 4, and is provided in the appendix. Lemma 5.

Consider the following Lyapunov function V ( q ) = min  K (cid:88) m =1 b (cid:88) j =2 s m,j ( q ) , K (cid:88) m =1 C ∗ m + B + 3 τ K ¯ δ − K (cid:88) m =1 s m, ( q )  (23) where ¯ δ := τ K δ , and B := (cid:15) + ¯ δ. It holds that if V ( q ) ≥ B , then GV ( q ) ≤ − µ δb . To apply the above two lemmas, we need the following geometric tail bound from [50], which originates in [5, 48].This lemma translates the fact that a Lyapunov function has a negative drift to the property that the function is withina certain region with high probability.

Lemma 6.

Consider a continuous time Markov chain { S ( t ) : t ≥ } on a ﬁnite state space S . Assume that it has aunique stationary distribution. For a Lyapunov function V : S → [0 , + ∞ ) , deﬁne GV ( s ) = (cid:80) s (cid:48) ∈S r s , s (cid:48) ( V ( s (cid:48) ) − V ( s )) where r s , s (cid:48) is the transition rate from state s to s (cid:48) .Suppose that ν max := sup s , s (cid:48) ∈S : r s , s (cid:48) > | V ( s ) − V ( s (cid:48) ) | < ∞ ; f max := max  , sup s ∈S (cid:88) s (cid:48) : V ( s (cid:48) ) >V ( s ) r s , s (cid:48) ( V ( s (cid:48) ) − V ( s ))  < ∞ . Given a set E . If for some B > , γ > , ξ ≥ , it holds: 1) GV ( s ) ≤ − γ when V ( s ) ≥ B and s ∈ E ; 2) GV ( s ) ≤ ξ when V ( s ) ≥ B and s (cid:54)∈ E ,then for all positive integer j , if ¯ S is the steady-state random variable, it holds P (cid:8) V (¯ S ) ≥ B + 2 ν max j (cid:9) ≤ (cid:18) f max f max + γ (cid:19) j + (cid:18) ξγ + 1 (cid:19) P { s (cid:54)∈ E} . (24)Based on Lemma 6, we can bound the probability that V ( q ) or V ( q ) is large in the following result. Lemma 7.

Let χ = 96 τ K b ln N . With the same notation in Lemma 4 and Lemma 5, it holds that P (cid:110) V ( ¯ Q ) ≥ B + χ(cid:15)N (cid:111) ≤ N − ; P (cid:110) V ( ¯ Q ) ≥ B + χ(cid:15)N (cid:111) ≤ N − . (25) Proof.

Note that under the notation in Lemma 6, we have for both V ( q ) and V ( q ) , ν max = N , and f max ≤ µ . Weﬁrst bound P (cid:8) V ( q ) ≥ B + χ(cid:15)N (cid:9) . Since by Lemma 4, when V ( q ) ≥ B , it holds GV ( q ) ≤ − µ δ b . Then by takingthe set E to be the empty set and taking j = bδ log N , Lemma 6 shows that P { V ( q ) ≥ B + 2 ν max j } ≤ (cid:18) δ b (cid:19) − j ≤ exp (cid:18) − j δ b (cid:19) = N − (26)where the last inequality comes from the fact that ln(1 + x ) ≥ x/ for x ∈ [0 , . We can easily verify that ν max j = N · µ b µ K (cid:15) = χ(cid:15)N . Similarly, take j = bδ log N for V ( q ) . Together with Lemma 5, Lemma 6 shows that P { V ( q ) ≥ B + 2 ν max j } ≤ (cid:18) δb (cid:19) − j ≤ exp (cid:18) − j δ b (cid:19) = N − . (27)We complete the proof by noticing that ν max j = N · µ b µ K (cid:15) ≤ χ(cid:15)N . PREPRINT - A

UGUST

21, 2020

Completing the Whole Proof

Finally, combining Lemma 7 with Lemma 3 help us complete the proof. To see why,recall that it remains to bound E (cid:34) (cid:40) K (cid:88) m =1 C m ( ¯ Q ) ≥ η + 1 N (cid:41) g (cid:48) (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( λ + µ δ − W ( ¯ Q )) (cid:35) . (28)Let event D = { V ( ¯ Q ) ≤ B + χ(cid:15)N } ∩ { V ( ¯ Q ) ≤ B + χ(cid:15)N } . It holds that (28) ≤ E (cid:34) (cid:40) K (cid:88) m =1 C m ( ¯ Q ) ≥ η + 1 N (cid:41) g (cid:48) (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( λ + µ δ − W ( ¯ Q )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D (cid:35) + g (cid:48) ( b ) µ (1 + δ ) P { ¯ D}≤ E (cid:34) (cid:40) K (cid:88) m =1 C m ( ¯ Q ) ≥ η + 1 N (cid:41) g (cid:48) (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( λ + µ δ − W ( ¯ Q )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D (cid:35) + 2 bδN (1 + δ ) (29)where the ﬁrst inequality is by the law of total probability and the fact that g (cid:48) ( x ) is a positive increasing function,that (cid:80) Km =1 C m ( q ) ≤ b for all possible q , and that λ ≤ µ , and the second inequality is by Lemma 7 that shows P { ¯ D} ≤ N . Therefore, it is sufﬁcient to bound the ﬁrst term in (29). The following lemma shows that this term is indeed non-positive.

Lemma 8.

For any q such that V ( q ) ≤ B + χ(cid:15)N and V ( q ) ≤ B + χ(cid:15)N , it holds that (cid:40) K (cid:88) m =1 C m ( q ) ≥ η + 1 N (cid:41) ( λ + µ δ − W ( q )) ≤ . (30) Proof.

W.L.O.G., we can directly assume (cid:80) Km =1 C m ( q ) ≥ η + N . Otherwise, (30) is already zero. Then the key stepis to show W ( q ) = (cid:80) Km =1 µ m s m, ( q ) ≥ λ + µ δ . By the deﬁnition of V ( q ) in (23), since (cid:80) Km =1 C m ( q ) ≥ η + N ,it holds that V ( q ) = (cid:80) K − m =1 C ∗ m − (cid:80) K − m =1 s m, ( q ) . Furthermore, as V ( q ) ≤ B + χ(cid:15)N and C ∗ m = α m for m < K ,it satisﬁes K − (cid:88) m =1 s i, ( q ) ≥ K − (cid:88) m =1 α m − ( B + χ(cid:15)N ) . (31)Since s m, ( q ) ≤ α m for all m , the total departure rate of servers of the ﬁrst K − types is at least K − (cid:88) m =1 µ m s m, ( q ) ≥ K − (cid:88) m =1 µ m α m − µ (cid:16) B + χ(cid:15)N (cid:17) . (32)Then for s K, ( q ) , recall the deﬁnition of V ( q ) in (22). To show that V ( q ) is equal to the second term in its deﬁnition,note that B + 3 τ K ¯ δ = 12 (cid:15) + τ K δ + 3 τ K δ ≤

12 + 2 τ K (cid:15) b ≤ (cid:15). Then since (cid:80) Km =1 C m ( q ) ≥ (cid:80) Km =1 C ∗ m + (cid:15) + N , it holds (cid:80) Km =1 C m ( q ) ≥ (cid:80) Km =1 C ∗ m + B +3 τ K ¯ δ . Therefore, V ( q ) is equal to (cid:80) Km =1 C ∗ m + B + 3 τ K ¯ δ − (cid:80) Km =1 s m, ( q ) , the second term in (22). By assumption, V ( q ) ≤ B + χ(cid:15)N .As a result, K (cid:88) m =1 s m, ( q ) ≥ K (cid:88) m =1 C ∗ m + 3 τ K ¯ δ − χ(cid:15)N , (33)and s K, ( q ) ≥ C ∗ K + 3 τ K ¯ δ − χ(cid:15)N (34)because s m, ( q ) ≤ α m = C ∗ m for m < K . From (32) and (34), it holds W ( q ) = K − (cid:88) m =1 µ m s m, ( q ) + µ K s K, ( q ) ≥ K − (cid:88) m =1 µ m α m + µ K C ∗ K + 3 µ K τ K ¯ δ − µ B − µ χ(cid:15)N (35) ≥ λ + 2 µ µ K δ − µ b µ K (cid:15)N ln( N ) ≥ λ + µ δ (36)where the last inequality is because µ > µ K , and µ µ K δ ≥ µ ln( N ) µ K (cid:15)N b by Assumption 1. The inequality (36)immediately implies the desired result. 13 PREPRINT - A

UGUST

21, 2020To conclude the proof of Lemma 1, by Lemma 3, the bound in (29) and Lemma 8, it holds E (cid:34) h (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33)(cid:35) ≤ bδN (1 + δ ) + 38 b τ K (cid:15)N ≤ b τ K (cid:15)N + 2 bN + 38 b τ K (cid:15)N ≤ b τ K (cid:15)N . (37) Since Lemma 1 only bounds the mean number of jobs in servers of the ﬁrst K types, we need the following result forthe remaining servers in the system. This result shows that very few jobs will be served by servers of the last M − K types of jobs. Note that if K = M , then Lemma 1 already bounds the mean number of jobs in the system. Lemma 9.

Suppose

K < M . Under Assumption 1 and Assumption 2, if N is sufﬁciently large, the expected numberof jobs in servers of the last M − K types divided by N is bounded as E (cid:34) M (cid:88) m = K +1 C m ( ¯ Q ) (cid:35) ≤ ˜ d bµ M + 2 (cid:114) τ M b ln NN + 8 b (cid:115) τ K τ M ˆ β(cid:15)N . (38) if the routing policy is either JFSQ or JFIQ.Proof. To prove this result, let us consider the Lyapunov function V ( q ) = (cid:80) Mm = K +1 C m ( q ) . Then by showing thatthis function has a negative drift when outside of a region, we can obtain a bound on its expectation. To do so, deﬁne B as B = 1 µ M (cid:32) ˜ d b + (cid:115) µ µ M (cid:18) b ln( N ) N + 416 τ K b ˆ β(cid:15)N (cid:19)(cid:33) . (39)Let E K = { q : (cid:80) Km =1 C m ( q ) ≤ C ∗ + ˆ β } . It holds that ¯ Q lies in E K with high probability by the following lemmawhose proof is in the appendix. Lemma 10.

For any ∆ ≥ ˆ β , it holds P { (cid:80) Km =1 C m ( ¯ Q ) > C ∗ + ∆ } ≤ τ K b ∆ (cid:15)N . By Lemma 10, it holds that P { ¯ Q (cid:54)∈ E K } ≤ τ K b ˆ β(cid:15)N . Then it is natural to discuss the drift of V ( q ) when it is greaterthan B by conditioning on whether q is in E K or not. The result is summarized in this lemma, and the proof is in theappendix. Lemma 11.

When V ( q ) ≥ B , it holds that • if q ∈ E K , the drift is bounded as GV ( q ) ≤ − B µ M b + ˜ d ; • if q (cid:54)∈ E K , the drift is bounded as GV ( q ) ≤ µ . We now apply Lemma 6. Under the notation of that lemma, it holds ν max = N , f max ≤ µ for V ( q ) . Let γ := B µ M b − ˜ d , and take j = µ ln( N ) γ . Applying Lemma 6 and using Lemma 11, it satisﬁes that P (cid:26) V ( ¯ Q ) > B + 2 j N (cid:27) ≤ (cid:18) γµ (cid:19) − j + (cid:18) µ γ + 1 (cid:19) P { q (cid:54)∈ E K } ≤ N − + 416 µ τ K b β(cid:15)N (40)where the last inequality is because γ < µ when N is sufﬁciently large. Furthermore, the expecation of V ( ¯ Q ) canbe bounded as E (cid:2) V ( ¯ Q ) (cid:3) ≤ E (cid:20) V ( ¯ Q ) (cid:12)(cid:12)(cid:12)(cid:12) V ( ¯ Q ) ≤ B + 2 j N (cid:21) + E (cid:20) V ( ¯ Q ) (cid:12)(cid:12)(cid:12)(cid:12) V ( ¯ Q ) > B + 2 j N (cid:21) P (cid:26) V ( ¯ Q ) > B + 2 j N (cid:27) (41) ≤ B + 4 µ ln( N ) γN + b (cid:18) N − + 416 µ τ K b β(cid:15)N (cid:19) (42) ≤ B + 5 µ ln( N ) γN + 416 µ τ K b ˆ β(cid:15)γN . (43)The deﬁnition of B in (39) and that of γ immediately give the desired result.14 PREPRINT - A

UGUST

21, 2020

The next lemma provides a bound on the blocking probability, and thus characterizes the effective throughput of thesystem. Due to space limitations, the reader is referred to the appendix for the proof.

Lemma 12.

Under Assumptions 1 and 2, the probability p B that an arrival of job is blocked is bounded as p B ≤ ˜ d λ + 52 τ K b (cid:15)N . (4)Wrapping up above lemmas, we can conclude the proof of Theorem 1. Proof of Theorem 1.

The ﬁrst result and third result in Theorem 1 corresponds to Lemma 1 and 12. For the secondresult, notice that Lemma 1 implies E (cid:34) K (cid:88) m =1 C m ( ¯ Q ) (cid:35) ≤ C ∗ + (cid:15) + 52 τ K b (cid:15)N . (44)Then combining (44) and (4) in Lemma 9 and the assumption that ˜ d ≤ (cid:15)µ K b in Assumption 2, it holds E (cid:34) M (cid:88) m =1 C m ( ¯ Q ) (cid:35) = E (cid:34) K (cid:88) m =1 C m ( ¯ Q ) (cid:35) + E (cid:34) M (cid:88) m = K +1 C m ( ¯ Q ) (cid:35) ≤ C ∗ + (cid:15) + 52 τ K b (cid:15)N + ˜ d bµ M + 2 (cid:114) τ M b ln NN + 8 b (cid:115) τ K τ M ˆ β(cid:15)N ≤ C ∗ + (cid:18) µ K µ M (cid:19) (cid:15) + 2 (cid:114) τ M b ln NN + 60 b (cid:115) τ K τ M ˆ β(cid:15)N , which is exactly (3). In this section, we prove Theorem 2. Since similar proof holds for Theorem 3, we provide that proof in the appendix.

Proof Sketch

The result is proved by showing that almost every pair of large enough subsets of L , R shares edgesbetween the two sets because of the random graph structure. To show this fact, we ﬁrst bound the probability that twogiven subsets are disconnected. Then the union bound concludes the proof since the total number of pairs of subsetsis given by L + N . Proof.

Recall the deﬁnition of p , p , ˜ d , ˜ d in Assumption 2. W.L.O.G., assume N p j is an integer for j = 1 , .Otherwise, we can raise p j to satisfy this condition since the size of a subset must be an integer. Suppose that wegenerate a bipartite graph G as in Theorem 2. Let C j be the event that G violates the j − th condition in Assumption 2.We bound P {C j } separately. To simplify the notation, let us denote R = R K − , R = R K . And let us write p (cid:96),r bethe probability that a port (cid:96) connects with a server r in the graph G .First, deﬁne D K , I as the event that a subset K of L has no edges with a subset I of R . Then for j = 1 , , C j = (cid:91) K⊆L : (cid:80) (cid:96) ∈K λ (cid:96) >N ˜ d j I⊆R j : |I|≥ Np j D K , I . (45)Fix j ∈ { , } . Let K be any subset of L satisfying (cid:80) (cid:96) ∈K λ (cid:96) > N ˜ d j , and I be any subset of R j satisfying |I| ≥ N p j .We want to bound P {D K , I } . Notice that by Assumption 2, it holds p < p , ˜ d < ˜ d , and ˜ d H ≥ ˜ d H . Then by theconstruction of G , if there is a port (cid:96) in K such that λ (cid:96) ≥ N ˜ d j H j , this port must be connected to all servers in R j ,15 PREPRINT - A

UGUST

21, 2020meaning that P {D K , I } = 0 . Therefore, we can assume that such port does not exist. Recall that z (cid:96),r is the probabilitythat port (cid:96) is connected with server r . It holds that P {D K , I } = (cid:89) (cid:96) ∈K (cid:89) r ∈I (1 − z (cid:96),r ) ≤ exp (cid:32) − (cid:88) (cid:96) ∈K (cid:88) r ∈I z (cid:96),r (cid:33) ≤ exp (cid:32) − (cid:88) (cid:96) ∈K (cid:88) r ∈I λ (cid:96) H j N ˜ d j (cid:33) , (46)and thus P {D K , I } ≤ exp (cid:32) −|I| (cid:80) (cid:96) ∈K λ (cid:96) H j N ˜ d j (cid:33) ≤ exp( − H j N p j ) ≤ − N + L ) . (47)The ﬁrst inequality is because ln(1 + x ) ≤ x for x > − , and z (cid:96),r < . The second inequality is from the constructionof G . The third inequality is from the deﬁnition of K and I . It thus holds that P {C j } ≤ N + L − N + L ) = 2 − ( N + L ) by the union bound. Use the union bound once again, it holds P {C ∪ C } ≤ − ( N + L − .For the total number of edges used in G N , recall the deﬁnition of p , p , ˜ d , ˜ d for a particular system in Assumption2, and H , H in Theorem 2. It holds that ˜ d H = O ( (cid:15) b ( N + L ) /N ) , and ˜ d H = O ( (cid:15) b ( N + L ) /N ) . Note that there are fourtypes of connections on graph G N as per Theorem 2, we bound their numbers of edges separately. First, the numberof ports with λ (cid:96) ≥ N ˜ d H is bounded by Nµ H N ˜ d = O ( b ( N + L ) (cid:15) N ) because λ Σ ≤ N µ . Therefore, the number ofconnections from them is bounded by O ( b ( N + L ) (cid:15) ) since there are N servers. The same result holds for ports with λ (cid:96) ≥ N ˜ d H . Now for the remaining ports, the expected number of edges is upper bounded by (cid:80) (cid:96) ∈L λ (cid:96) N (cid:16) H ˜ d + H ˜ d (cid:17) N = O (cid:16) b ( N + L ) (cid:15) (cid:17) . Then to sum up, the expected number of edges in G N scales as O (cid:16) b ( N + L ) (cid:15) (cid:17) . In this section, we present simulation results for JFSQ and JFIQ. In particular, the following two settings are explored: • we compare the mean response time of JFSQ, JFIQ with a recent paper [19] in a ﬁxed-size system; • we study the convergence of JFSQ and JFIQ on a random bipartite graph in the many-server regime.We will also compare our policies with JSQ and JIQ where we assume that ties in those policies are broken at random.Detailed results are as follows. We ﬁrst study one particular setting as in [19]. There are servers with fast service rate , and servers withslow service rate . Jobs arrive into the system in a Poisson process of rate λ Σ , and can be routed to any server. Wesimulate an inﬁnite buffer system by setting the buffer size at each server to . We compare JFSQ and JFIQ withJSQ, JIQ and JSQ-(2,2) introduced in [19]. JSQ-(2,2) is similar to Pod, and it is shown in [19] to perform better thanother algorithms in light trafﬁc. We refer the reader to the appendix for a detailed description of JSQ-(2,2). Beside, thelower bound result in Theorem 1 is plotted as a baseline. Deﬁne the system load to be λ Σ . By increasing the systemload, we can obtain Fig. 2. Clearly, Fig. 2 shows that JFSQ and JFIQ can achieve consistently fast mean response(very close to the lower bound) ranging from light trafﬁc to heavy trafﬁc (the system load is around . ). For otherpolicies, JSQ-(2,2) performs well in light trafﬁc. However, JIQ and JSQ could have relatively poor response time inlight trafﬁc, although JIQ is shown to have asymptotically zero waiting time [44]. Next we explore the convergence behavior of JFSQ and JFIQ when there are job-server constraints. In particular,suppose there are N servers in the system. We assume there are four types of servers with the same amount ofeach type. The service time distributions are all exponentially distributed, but with different service rate such that µ i = 2 − i +1 , i = 1 , , , . We also study the convergence of JSQ and JIQ. JSQ-(2,2) introduced above is not studiedbecause it is designed for systems with two classes of servers.The number of ports is set as L = N . . The arrival rate to each port is assumed to be homogeneous, and is equal to λ Σ L with λ Σ = 0 . (cid:80) i =1 Nµ i . Denote the system load as λ = 0 . . In the corresponding bipartite graph, each port16 PREPRINT - A

UGUST

21, 2020 M ean R e s pon s e T i m e JFIQJFSQ JIQJSQ JSQ-(2,2)Lower Bound

Figure 2: The Mean Response Time of Different Routing Policies in a Fixed-Size System with Increasing SystemLoad Number of Servers12345 M ean R e s pon s e T i m e JFIQJFSQ JIQJSQ Lower Bound

Figure 3: The Mean Response Time of Different Routing Policies on Increasing-Sized Random Bipartite Graphsconnects with each server with probability √ ln NN (1 − λ ) ln − λ according to Theorem 3. The buffer size in this case is setas b = 5 because in many-server systems, we expect there to be little queueing and one should not need a large buffersize. Fig. 3 presents the convergence behavior of the mean-response time for JFSQ, JFIQ, JIQ and JSQ. It is interestingto notice that both JIQ and JFIQ suffer from slow mean response time when the system is small. But when the numberof servers is = 2048 , the mean response time of JFSQ and JFIQ is very close to the lower bound. Such requirementon the number of servers is ﬁne since modern cloud platforms can easily possess tens of thousands of servers [2]. Onthe other hand, both JSQ and JIQ also converge as N increases. Nevertheless, their mean response time is not optimalbecause they neglect server heterogeneity. Note that when the system is large, the blocking probability is nearly zero,even with a small buffer size. The convergence of the blocking probability is provided in the appendix. The setting isalso extended to hyper-exponential service time distribution. For this new distribution, we show that although JFSQand JFIQ have slow mean response times initially, their convergence behavior is similar to Fig. 3 when N increases.We refer the reader to the appendix for details. In this paper, we studied the performance of two load balancing policies, JFSQ and JFIQ for load balancing on abipartite graph. For a "well-connected" bipartite graph, we presented a bound on the mean response time for ﬁnite-size systems, which implies asymptotic optimality in the mean response time in both the many-server regime and thesub Halﬁn-Whitt regime. A by-product of this paper is a novel technique for bouding the distance to the mean-ﬁeldlimit of heterogeneous load balancing systems. In the analysis, we established three state-space collapse results toshow that the system behaves similar to its mean-ﬁeld limit. We also presented how to construct a sparse "well-connected" bipartite graph, where each left node is only connected to ω ( − λ ) ) right nodes when arrival rates areheterogeneous, and only ω ( − λ ln − λ ) nodes for homogeneous servers, given that the buffer size is a constant, andthe number of left nodes is at least that of right nodes. However, it is unknown whether these two bounds are tight,which we leave for future research. Acknowledgment:

The work of Wentao Weng was conducted during a visit to the Coordinated Science Lab, UIUCduring 2020. 17

PREPRINT - A

UGUST

21, 2020

References [1] Amazon. Amazon web services (aws) cloud computing services, 2020. URL https://aws.amazon.com .[2] G. Amvrosiadis, J. W. Park, G. R. Ganger, G. A. Gibson, E. Baseman, and N. DeBardeleben. On the diversityof cluster workloads and its impact on research results. In

Proc. USENIX Ann. Technical Conf. (ATC) , pages533–546, 2018.[3] R. Atar. A diffusion regime with nondegenerate slowdown.

Operations Research , 60(2):490–500, 2012.[4] S. Banerjee, D. Mukherjee, et al. Join-the-shortest queue diffusion limit in halﬁn–whitt regime: Tail asymptoticsand scaling of extrema.

Ann. Appl. Probab. , 29(2):1262–1309, 2019.[5] D. Bertsimas, D. Gamarnik, and J. N. Tsitsiklis. Performance of multiclass markovian queueing networks viapiecewise linear lyapunov functions.

Ann. Appl. Probab. , 11(4):1384–1428, 11 2001.[6] A. Braverman. Steady-state analysis of the join-the-shortest-queue model in the halﬁn–whitt regime.

Math. Oper.Res. , 2020.[7] A. Braverman, J. Dai, and J. Feng. Stein’s method for steady-state diffusion approximations: an introductionthrough the erlang-a and erlang-c models.

Stochastic Systems , 6(2):301–366, 2017.[8] A. Budhiraja, D. Mukherjee, R. Wu, et al. Supermarket model on graphs.

The Annals of Applied Probability , 29(3):1740–1777, 2019.[9] E. Cardinaels, S. C. Borst, and J. S. van Leeuwaarden. Job assignment in large-scale service systems with afﬁnityrelations.

Queueing Systems , 93(3-4):227–268, 2019.[10] E. Cardinaels, S. Borst, and J. S. H. van Leeuwaarden. Redundancy scheduling with locally stable compatibilitygraphs, 2020.[11] J. Cruise, M. Jonckheere, S. Shneer, et al. Stability of jsq in queues with general server-job class compatibilities.

Queueing Syst. , pages 1–9, 2020.[12] J. Dean and L. A. Barroso. The tail at scale.

Communications of the ACM , 56(2):74–80, 2013.[13] J. Dean and S. Ghemawat. Mapreduce: simpliﬁed data processing on large clusters.

Communications of theACM , 51(1):107–113, 2008.[14] A. Eryilmaz and R. Srikant. Asymptotically tight steady-state queue length bounds implied by drift conditions.

Queueing Syst. , 72(3-4):311–359, 2012.[15] P. Eschenfeldt and D. Gamarnik. Join the shortest queue with many servers. the heavy-trafﬁc asymptotics.

Math.Oper. Res. , 43(3):867–886, 2018.[16] D. Gamarnik, J. N. Tsitsiklis, and M. Zubeldia. Delay, memory, and messaging tradeoffs in distributed servicesystems.

Stoch. Syst. , 8(1):45–74, 2018.[17] D. Gamarnik, J. N. Tsitsiklis, M. Zubeldia, et al. A lower bound on the queueing delay in resource constrainedload balancing.

Annals of Applied Probability , 30(2):870–901, 2020.[18] K. Gardner and R. Righter. Product forms for fcfs queueing models with arbitrary server-job compatibilities: Anoverview. arXiv preprint arXiv:2006.05979 , 2020.[19] K. Gardner, J. A. Jaleel, A. Wickeham, and S. Doroudi. Scalable load balancing in the presence of heterogeneousservers. arXiv preprint arXiv:2006.13987 , 2020.[20] N. Gast. The power of two choices on graphs: the pair-approximation is accurate?

ACM SIGMETRICS Perfor-mance Evaluation Review , 43(2):69–71, 2015.[21] Google. Google cloud cloud computing services, 2020. URL https://cloud.google.com .[22] Google. Google search, 2020. URL .[23] A. Gujarati, S. Elnikety, Y. He, K. S. McKinley, and B. B. Brandenburg. Swayam: distributed autoscal-ing to meet slas of machine learning inference services with resource efﬁciency. In

Proceedings of the 18thACM/IFIP/USENIX Middleware Conference , pages 109–120, 2017.[24] V. Gupta and N. Walton. Load balancing in the nondegenerate slowdown regime.

Operations Research , 67(1):281–294, 2019.[25] I. Gurvich et al. Diffusion models and steady-state approximations for exponentially ergodic markovian queues.

The Annals of Applied Probability , 24(6):2527–2559, 2014.[26] B. Hajek. Hitting-time and occupation-time bounds implied by drift analysis with applications.

Advances inApplied probability , pages 502–525, 1982. 18

PREPRINT - A

UGUST

21, 2020[27] D. Hurtado-Lange and S. T. Maguluri. Load balancing system under join the shortest queue: Many-server-heavy-trafﬁc asymptotics. arXiv preprint arXiv:2004.04826 , 2020.[28] D. Hurtado-Lange and S. T. Maguluri. Throughput and delay optimality of power-of-d choices in inhomogeneousload balancing systems. arXiv preprint arXiv:2004.00538 , 2020.[29] X. Liu and L. Ying. On achieving zero delay with power-of-d-choices load balancing. In

IEEE INFOCOM2018-IEEE Conference on Computer Communications , pages 297–305. IEEE, 2018.[30] X. Liu and L. Ying. On universal scaling of distributed queues under load balancing. arXiv preprintarXiv:1912.11904 , 2019.[31] X. Liu and L. Ying. Steady-state analysis of load-balancing algorithms in the sub-halﬁn–whitt regime.

J. Appl.Probab. , 57(2):578–596, 2020.[32] X. Liu, K. Gong, and L. Ying. Steady-state analysis of load balancing with coxian- distributed service times. arXiv preprint arXiv:2005.09815 , 2020.[33] Y. Lu, Q. Xie, G. Kliot, A. Geller, J. R. Larus, and A. Greenberg. Join-idle-queue: A novel load balancingalgorithm for dynamically scalable web services. Performance Evaluation , 68(11):1056–1071, 2011.[34] S. T. Maguluri and R. Srikant. Heavy trafﬁc queue length behavior in a switch under the maxweight algorithm.

Stochastic Systems , 6(1):211–250, 2016.[35] Microsoft. Microsoft azure cloud computing services, 2020. URL https://azure.microsoft.com/en-us/ .[36] M. Mitzenmacher. The power of two choices in randomized load balancing.

IEEE Transactions on Parallel andDistributed Systems , 12(10):1094–1104, 2001.[37] S. Moharir, S. Sanghavi, and S. Shakkottai. Online load balancing under graph constraints.

IEEE/ACM Trans-actions on Networking , 24(3):1690–1703, 2015.[38] D. Mukherjee, S. C. Borst, and J. S. Van Leeuwaarden. Asymptotically optimal load balancing topologies.

Proceedings of the ACM on Measurement and Analysis of Computing Systems , 2(1):1–29, 2018.[39] D. Mukherjee, S. C. Borst, J. S. Van Leeuwaarden, and P. A. Whiting. Universality of power-of-d load balancingin many-server systems.

Stoch. Syst. , 8(4):265–292, 2018.[40] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. In

Proceed-ings of the Twenty-Fourth ACM Symposium on Operating Systems Principles , pages 69–84, 2013.[41] D. Rutten and D. Mukherjee. Load balancing under strict compatibility constraints. 2020.[42] S. Shenker and A. Weinrib. The optimal control of heterogeneous queueing systems: A paradigm for load-sharingand routing.

IEEE Transactions on Computers , 38(12):1724–1735, 1989.[43] A. L. Stolyar. Tightness of stationary distributions of a ﬂexible-server system in the halﬁn-whitt asymptoticregime.

Stochastic Systems , 5(2):239–267, 2015.[44] A. L. Stolyar. Pull-based load distribution in large-scale heterogeneous service systems.

Queueing Syst. , 80(4):341–361, 2015.[45] S. R. Turner. The effect of increasing routing choice on resource pooling.

Probability in the Engineering andInformational Sciences , 12(1):109–124, 1998.[46] N. D. Vvedenskaya, R. L. Dobrushin, and F. I. Karpelevich. Queueing system with selection of the shortest oftwo queues: An asymptotic approach.

Problemy Peredachi Informatsii , 32(1):20–34, 1996.[47] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang. Maptask scheduling in mapreduce with data locality: Through-put and heavy-trafﬁc optimality.

IEEE/ACM Transactions On Networking , 24(1):190–203, 2014.[48] W. Wang, S. T. Maguluri, R. Srikant, and L. Ying. Heavy-trafﬁc delay insensitivity in connection-level modelsof data transfer with proportionally fair bandwidth sharing. In

Proc. ACM SIGMETRICS Int. Conf. Measurementand Modeling of Computer Systems , volume 45, pages 232–245. ACM, 2018.[49] R. R. Weber. On the optimal assignment of customers to parallel servers. 15(2):406–413, 1978.[50] W. Weng and W. Wang. Dispatching parallel jobs to achieve zero queuing delay. arXiv preprintarXiv:2004.02081 , 2020.[51] Q. Xie and Y. Lu. Priority algorithm for near-data scheduling: Throughput and heavy-trafﬁc optimality. In , pages 963–972. IEEE, 2015.[52] Q. Xie, A. Yekkehkhany, and Y. Lu. Scheduling with multi-level data locality: Throughput and heavy-trafﬁcoptimality. In

IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communi-cations , pages 1–9. IEEE, 2016. 19

PREPRINT - A

UGUST

21, 2020[53] L. Ying. Stein’s method for mean ﬁeld approximations in light and heavy trafﬁc regimes.

Proceedings of theACM on Measurement and Analysis of Computing Systems , 1(1):1–27, 2017.[54] L. Ying, R. Srikant, and X. Kang. The power of slightly more than one sample in randomized load balancing.

Math. Oper. Res. , 42(3):692–722, 2017.[55] X. Zhou and N. Shroff. A note on load balancing in many-server heavy-trafﬁc regime. arXiv preprintarXiv:2004.09574 , 2020.[56] X. Zhou, J. Tan, and N. Shroff. Flexible load balancing with multi-dimensional state-space collapse: Throughputand heavy-trafﬁc delay optimality.

Performance Evaluation , 127:176–193, 2018.[57] X. Zhou, J. Tan, and N. Shroff. Heavy-trafﬁc delay optimality in pull-based load balancing systems: Necessaryand sufﬁcient conditions.

Proceedings of the ACM on Measurement and Analysis of Computing Systems , 2(3):1–33, 2018.

A Proof of Proposition 1

Proposition 1[Restated].

For any m ∈ { , · · · , M } , let I m denote the probability that an arrival of jobs is scheduled to a type- m serverin steady state. Also, recall that ¯ s m, is deﬁned as a steady-state random variable denoting the number of busy type- m servers divided by N . Then because of stability and work conservation law, it holds that for all m ≤ M , λ Σ I m = N µ m E (cid:2) ¯ S m, (cid:3) . (49)In particular, λ = M (cid:88) m =1 λ Σ I m N = M (cid:88) m =1 µ m E (cid:2) ¯ S m, (cid:3) (50)since (cid:80) Mm =1 I m = 1 . Now notice that the mean service time of jobs is given by E (cid:2) ¯ Z (cid:3) = M (cid:88) m =1 I m µ m = M (cid:88) m =1 E (cid:2) ¯ S m, (cid:3) λ (51)since the service time at type- m servers is exponentially distributed with mean µ m , and I m satisﬁes (49). To obtain alower bound of E (cid:2) ¯ Z (cid:3) , consider the following linear programming. min 1 λ M (cid:88) m =1 x m s.t. λ = M (cid:88) m =1 µ m x m , m = 1 , . . . , M ≤ x m ≤ α m , m = 1 , . . . , M where x m is an analog of E (cid:2) ¯ S m, (cid:3) , and the objective value is a lower bound of E (cid:2) ¯ Z (cid:3) because of (50). Then sinceonly the sum of x m matters, and µ ≥ · · · ≥ µ M , the optimal solution is exactly given by x ∗ = α , · · · , x ∗ K − = α K − , x ∗ K = λ − (cid:80) K − m =1 µ m x m µ K , x ∗ m = 0 for m > K . Then it is clear that E (cid:2) ¯ Z (cid:3) ≥ λ (cid:80) Mm =1 x ∗ m = C ∗ λ . B Proof of Lemmas in Section 4

B.1 Proof of Lemma 2Lemma 2[Restated].

The expectation E (cid:104) Gg ( (cid:80) Km =1 C m ( ¯ Q )) (cid:105) is equal to . PREPRINT - A

UGUST

21, 2020

Proof.

To simplify the notation, denote V ( q ) = g ( (cid:80) Km =1 C m ( q )) for a state q . Now that since the system is stable(because of the assumption of ﬁnite buffers), there is a unique stationary distribution π q that solves the balancingequation such that for every q , π q (cid:88) q (cid:48) r q , q (cid:48) = (cid:88) q (cid:48) π q (cid:48) r q (cid:48) , q (52)where r q , q (cid:48) is the transition rate from q to q (cid:48) . Now that V ( q ) is bounded (as (cid:80) Km =1 C m ( q ≤ b ), it holds E (cid:2) GV ( ¯ Q ) (cid:3) = (cid:88) q π q (cid:88) q (cid:48) r q , q (cid:48) ( V ( q (cid:48) ) − V ( q ))= − (cid:88) q π q (cid:88) q (cid:48) V ( q ) r q , q (cid:48) + (cid:88) q π q (cid:88) q (cid:48) r q , q (cid:48) V ( q (cid:48) )= − (cid:88) q V ( q ) (cid:88) q (cid:48) π q r q , q (cid:48) + (cid:88) q V ( q ) (cid:88) q (cid:48) π q (cid:48) r q (cid:48) , q = 0 . B.2 Proof of Lemma 3Lemma 3[Restated].

The idea is to utilize the result that E (cid:104) h (cid:16)(cid:80) Km =1 C m ( ¯ Q ) (cid:17)(cid:105) ≤ (17)(18) + (19) , and to expand (18) and (19)by Taylor’s expansion. Consider three cases of state q . • First, if (cid:80) Km =1 C m ( q ) ≤ η − N , then g ( (cid:80) Km =1 C m ( q ) − N ) , g ( (cid:80) Km =1 C m ( q )) , g ( (cid:80) Km =1 C m ( q ) + N ) areall zero. This case has no contribution to the expectation; • second, if (cid:80) Km =1 C m ( q ) ∈ ( η − N , η + N ) , by ﬁrst-order Taylor’s expansion, there exists some ˜ ξ q , ˜ η q ∈ ( η − N , η + N ) , such that g (cid:32) K (cid:88) m =1 C m ( q ) + 1 N (cid:33) − g (cid:32) K (cid:88) m =1 C m ( q ) (cid:33) = 1 N g (cid:48) ( ˜ ξ q ) ,g (cid:32) K (cid:88) m =1 C m ( q ) − N (cid:33) − g (cid:32) K (cid:88) m =1 C m ( q ) (cid:33) = − N g (cid:48) (˜ η q ); • third, if (cid:80) Km =1 C m ( q ) ≥ η + N , by second-order Taylor’s expansion, there exists some ξ q , η q , such that g (cid:32) K (cid:88) m =1 C m ( q ) + 1 N (cid:33) − g (cid:32) K (cid:88) m =1 C m ( q ) (cid:33) = 1 N g (cid:48) (cid:32) K (cid:88) m =1 C m ( q ) (cid:33) + 2 N g (cid:48)(cid:48) ( ξ q ) ,g (cid:32) K (cid:88) m =1 C m ( q ) − N (cid:33) − g (cid:32) K (cid:88) m =1 C m ( q ) (cid:33) = − N g (cid:48) (cid:32) K (cid:88) m =1 C m ( q ) (cid:33) + 2 N g (cid:48)(cid:48) ( η q ) . Then it holds that E (cid:34) h (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33)(cid:35) (53) ≤ (17) + (18) + (19) (54)21 PREPRINT - A

UGUST

21, 2020 = E (cid:34) (cid:40) K (cid:88) m =1 C m ( ¯ Q ) ≥ η + 1 N (cid:41) (cid:32) g (cid:48) (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( λ + µ δ − W ( ¯ Q )) (cid:33)(cid:35) (55) + E (cid:34) (cid:40) K (cid:88) m =1 C m ( ¯ Q ) ≥ η + 1 N (cid:41) (cid:18) N (cid:0) λg (cid:48)(cid:48) ( ξ ¯ Q ) + W ( ¯ Q ) g (cid:48)(cid:48) ( η ¯ Q ) (cid:1)(cid:19)(cid:35) (56) + E (cid:34) (cid:40) K (cid:88) m =1 C m ( ¯ Q ) ∈ ( η − N , η + 1 N ) (cid:41) (cid:32) g (cid:48) (cid:32) K (cid:88) m =1 C m ( ¯ Q ) (cid:33) ( µ δ ) + λg (cid:48) ( ˜ ξ ¯ Q ) − W ( ¯ Q ) g (cid:48) (˜ η ¯ Q ) (cid:33)(cid:35) . (57)It sufﬁces to bound (56) and (57). First, note that | g (cid:48)(cid:48) ( x ) | ≤ µ δ for all x by the explicit form of g ( x ) in (11). It holds (56) ≤ N · µ δ · µ = 4 N δ = 24 τ K b (cid:15)N . (58)On the other hand, to bound (57), since (cid:80) Km =1 C m ( ¯ Q ) , ˜ ξ q , ˜ η q ∈ ( η − N , η + N ) , their derivatives are all bounded by Nµ δ . Then (57) ≤ N µ δ · ( µ δ + µ ) = 2 N + 12 τ K b (cid:15)N ≤ τ K b (cid:15)N . (59)Summing the above two equations completes the proof of Lemma 3. B.3 Proof of Lemma 4Lemma 4[Restated].

Consider the following Lyapunov function V ( q ) = min  b (cid:88) j =1 s K,j ( q ) + K − (cid:88) m =1 b (cid:88) j =2 s m,j ( q ) , K − (cid:88) m =1 C ∗ m − K − (cid:88) m =1 s m, ( q )  . (22) It holds that if V ( q ) ≥ B := τ K δ , then GV ( q ) ≤ − µ δ b .Proof. Since V ( q ) ≥ B by assumption, both of the following two properties holds: b (cid:88) j =1 s K,j ( q ) + K − (cid:88) m =1 b (cid:88) j =2 s m,j ( q ) ≥ B ; (60) K − (cid:88) m =1 s m, ( q ) ≤ K − (cid:88) m =1 C ∗ m − B . (61)Let T , be the ﬁrst term in V ( q ) , and T , be the second term. First, by deﬁnition, GV ( q ) = (cid:88) q (cid:48) r q , q (cid:48) ( V ( q (cid:48) ) − V ( q ))= (cid:88) q (cid:48) , arrival r q , q (cid:48) ( V ( q (cid:48) ) − V ( q )) (62) + (cid:88) q (cid:48) , departure r q , q (cid:48) ( V ( q (cid:48) ) − V ( q )) (63)where we separate transitions by identifying those caused by a job arrival from those caused by a job departure.Bounding (62) and (63) can then bound GV ( q ) . Next we consider two cases corresponding to whether V ( q ) is equalto T , or to T , .Suppose that T , ≤ T , . then in this case, (63) ≤ −  b (cid:88) j =1 µ K ( s K,j ( q ) − s K,j +1 ( q )) + K − (cid:88) m =1 b (cid:88) j =2 µ m ( s m,j ( q ) − s m,j +1 ( q ))  (64)22 PREPRINT - A

UGUST

21, 2020 = − (cid:32) µ K s K, ( q ) + K (cid:88) m =1 µ m s m, ( q ) (cid:33) (65) ≤ − B µ K b ≤ − µ δb . (66)The ﬁrst inequality (64) is because V ( q ) = τ , , and only jobs departing from servers of type K and servers of typesless than K with queue length at least can affect the value of V ( q ) . The ﬁrst equation (65) comes from the fact that s m,b +1 = 0 for all m . The last inequality is from (60) and the non-decreasing property s m, ( q ) ≥ s m, ( q ) ≥ · · · s m,b ( q ) for all m .On the other hand, to bound (62), notice that V ( q ) can increase only when a job arrival is routed to some servers oftypes at least K . Then clearly, (62) ≤ L (cid:88) (cid:96) =1 N λ (cid:96) · { an arrival to port (cid:96) is not routed to an idle server of types less than k | q } . (67)However, by (61), the number of idle servers of types less than K is at least N K − (cid:88) m =1 ( C ∗ m − s m, ( q )) ≥ N B = N (cid:15) b . Let I be the set of idle servers of types less than K . Since |I| ≥ N(cid:15) b , Assumption 2 guarantees that (cid:80) (cid:96) (cid:54)∈ N R ( I ) λ (cid:96) ≤ N ˜ d = N(cid:15)µ K b . That is to say, the total arrival rates of ports not connected with servers in I is bounded by N ˜ d . Nowsince our routing policy is either JFSQ or JFIQ, for those ports connected with I , a job arrival must be routed to oneserver in I because servers in I are idle, and are faster than other idle servers not in I . Therefore, (67) ≤ N · N (cid:15)µ K b ≤ µ δ b . (68)With (66) and (68), it holds GV ( q ) ≤ − µ δ b when T , ≤ T , .For the second case where T , ≥ T , , it holds (63) ≤ K − (cid:88) m =1 µ m ( s m, ( q ) − s m, ( q )) (69)since V ( q ) increases only when a job departs from a server of type less than K and only with this single job in theserver. Also, we can see (62) ≤ − N L (cid:88) (cid:96) =1 λ (cid:96) · { an arrival to port (cid:96) is routed to an idle server of type less than k | q } (70) ≤ N ( − λ Σ + N ˜ d ) = − λ + ˜ d . (71)The ﬁrst inequality is because for arrival transitions, only jobs arriving to idle servers of types less than k can change V ( q ) , and their arrivals will all decrease V ( q ) by N by the deﬁnition of T , . The second inequality is derived fromthe same argument of (68). Therefore, it holds that GV ( q ) = (62) + (63) ≤ − λ + ˜ d + K − (cid:88) m =1 µ m ( s m, ( q ) − s m, ( q )) ≤ − λ + ˜ d + K − (cid:88) m =1 µ m α m − µ K B (72) ≤ − µ K B + ˜ d (73) ≤ − µ δ b (74)because of (61) and the assumption that λ ≥ (cid:80) K − m =1 µ m α m . Therefore, the above discussion proves that whenever V ( q ) ≥ B , it holds GV ( q ) ≤ − µ δ b . PREPRINT - A

UGUST

21, 2020

B.4 Proof of Lemma 5Lemma 5[Restated].

Let T , be the ﬁrst term in V ( q ) , and T , be the second term. Since V ( q ) ≥ B , both the following hold: K (cid:88) m =1 b (cid:88) j =2 s ij ( q ) ≥ B ; (75) K (cid:88) m =1 s m, ( q ) ≤ K (cid:88) m =1 C im + 3 µ ¯ δ. (76)By deﬁnition, GV ( q ) = (cid:88) q (cid:48) , arrival r q , q (cid:48) ( V ( q (cid:48) ) − V ( q )) (77) + (cid:88) q (cid:48) , departure r q , q (cid:48) ( V ( q (cid:48) ) − V ( q )) . (78)We then consider two cases. First, suppose that T , ≤ T , . Then similar to the proof of Lemma 4, using (75), it holdsthat (78) ≤ − N K (cid:88) m =1 b (cid:88) j =2 N µ m ( s m,j ( q ) − s m,j +1 ( q )) (79) = − N K (cid:88) m =1 N µ m s m, ( q ) (80) ≤ − B µ K b = − (cid:15)µ K b − µ δb . (81)On the other hand, we have (77) ≤ L (cid:88) (cid:96) =1 N λ (cid:96) · { an arrival to port (cid:96) is not routed to an idle server of types ≤ k | q } . (82)Notice that by (76), the number of idle servers of types no greater than K satisﬁes that N (cid:32) K (cid:88) m =1 α m − K (cid:88) m =1 s m, ( q ) (cid:33) (83) ≥ N (cid:32) K (cid:88) m =1 α m − K (cid:88) m =1 C ∗ m − τ ,K ¯ δ (cid:33) (84) = N (cid:32) α K − λ − (cid:80) K − m =1 µ m α m µ K − τ ,K ¯ δ (cid:33) (85) = N · (cid:80) Km =1 µ m α m − λµ K − N τ K ¯ δ (86) = Nµ K (cid:32) β K (cid:88) m =1 µ m α m − µ τ K δ (cid:33) (87) ≥ N (cid:16) ˆ β − τ K (cid:15) b (cid:17) ≥ N ˆ β (88)24 PREPRINT - A

UGUST

21, 2020where (88) is because b ≥ τ K by Assumption 1, and ˆ β = β (cid:80) Km =1 α m , and µ > · · · > µ K .Let I be the set of idle servers of types no greater than K . It then holds |I| ≥ N ˆ β . Then By Assumption 2, the totalarrival rate of ports not connected with I is bounded by N ˜ d . Since the routing policy is either JFSQ or JIFQ, jobsarriving to ports connecting with I must be routed to servers in I . Therefore, it holds (82) ≤ ˜ d ≤ µ K (cid:15) b . Then in thiscase, we know GV ( q ) = (77) + (78) ≤ − (cid:15)µ K b − µ δb + µ K (cid:15) b ≤ − µ δb . Now we consider the second case, T , ≥ T , . Similarly, it holds (78) ≤ (cid:80) Km =1 µ m ( s m, ( q ) − s m, ( q )) , and (77) ≤ − N L (cid:88) (cid:96) =1 N λ (cid:96) · { an arrival to port (cid:96) is routed to an idle server of types ≤ k | q }≤ − λ + ˜ d (89)where the last inequality follows the same argument as in the ﬁrst case. Then it holds GV ( q ) ≤ K (cid:88) m =1 µ m s m, ( q ) − K (cid:88) m =1 µ m s m, ( q ) − λ + ˜ d (90) ≤ K − (cid:88) m =1 µ m α m + µ K ( C ∗ K + 3 µ ¯ δ ) − λ − µ K B b − µ K (cid:15) b (91) ≤ µ δ − µ K B b − (cid:15) b (92) ≤ µ δ − µ K (cid:15) b −

1) + µ K (cid:15) b − µ δb (93) ≤ − µ δb . (94)The last inequality is because µ K (cid:15) b − − µ K (cid:15) b = µ K (cid:15) b ≥ µ µ K (cid:15) µ b = 3 µ δ. Therefore, we complete the proof of Lemma 5.

B.5 Proof of Lemma 10Lemma 10[Restated].

For any ∆ ≥ ˆ β , it holds P { (cid:80) Km =1 C m ( ¯ Q ) > C ∗ + ∆ } ≤ τ K b ∆ (cid:15)N . Proof.

By Lemma 1, it holds that P (cid:40) K (cid:88) m =1 C m ( ¯ Q ) > C ∗ + ∆ (cid:41) = P (cid:40) K (cid:88) m =1 C m ( ¯ Q ) − C ∗ − ˆ β > ∆ − β (cid:41) (95) ≤ P (cid:40) K (cid:88) m =1 C m ( ¯ Q ) − C ∗ − ˆ β >

12 ∆ (cid:41) (96) ≤ E (cid:104) max (cid:16)(cid:80) Km =1 C m ( ¯ Q ) − C ∗ − ˆ β , (cid:17)(cid:105) ∆ (97) ≤ τ K b ∆ (cid:15)N (98)since (cid:15) ≤ ˆ β by assumption. 25 PREPRINT - A

UGUST

21, 2020

B.6 Proof of Lemma 11Lemma 11[Restated].

When V ( q ) ≥ B , it holds that • if q ∈ E K , the drift is bounded as GV ( q ) ≤ − B µ M b + ˜ d ; • if q (cid:54)∈ E K , the drift is bounded as GV ( q ) ≤ µ . Proof.

By deﬁnition, GV ( q ) = (cid:88) q (cid:48) r q , q (cid:48) ( V ( q (cid:48) ) − V ( q ))= (cid:88) q (cid:48) , arrival r q , q (cid:48) ( V ( q (cid:48) ) − V ( q )) (99) + (cid:88) q (cid:48) , departure r q , q (cid:48) ( V ( q (cid:48) ) − V ( q )) . (100)Note that since V ( q ) ≥ B , and V ( q ) = (cid:80) Mm = K +1 (cid:80) bj =1 s m,j ( q ) , it holds that (100) = − M (cid:88) m = k +1 µ m s m, ( q ) ≥ − B µ M b (101)since s m, ( q ) ≥ · · · ≥ s m,b ( q ) and s m,b +1 ( q ) = 0 for all m .For (99), we consider two cases. First, if q ∈ E K , the number of idle servers of types no greater than K is given by N (cid:32) K (cid:88) m =1 α m − K (cid:88) m =1 s m, ( q ) (cid:33) ≥ N (cid:32) K (cid:88) m =1 α m − K (cid:88) m =1 C m ( q ) (cid:33) ≥ N (cid:32) K (cid:88) m =1 α m − C ∗ − ˆ β (cid:33) = N (cid:32) β (cid:80) K − m =1 α m µ m µ K − ˆ β (cid:33) ≥ N ˆ β where the second inequality is because sum Km =1 C m ( q ) ≤ C ∗ + ˆ β when q ∈ E K . Then since the routing policy iseither JFSQ or JFIQ, jobs arriving to ports connecting with idle servers of types no greater than K must be routed tothose servers. And by Assumption 2, the total arrival rate of disconnected ports is bounded by ˜ d N . As a result, (99) ≤ ˜ d , (102)showing that GV ( q ) ≤ − B µ M b + ˜ d when q ∈ E K .When q (cid:54)∈ E K , it holds that (99) ≤ λ ≤ µ , and (100) ≥ . Therefore, GV ( q ) ≤ µ . B.7 Proof of Lemma 12Lemma 11[Restated].

Under Assumption 1 and Assumption 2, the probability p B that an arrival of job is blocked isbounded as p B ≤ ˜ d λ + 52 τ K b (cid:15)N . (4)26 PREPRINT - A

UGUST

21, 2020

Proof.

Denote B (cid:96) ( q ) = {∀ r ∈ N L ( (cid:96) ) , q r = b } . That is, whether all neighbors of port (cid:96) are full. Then by deﬁnition, p B = 1 λ Σ L (cid:88) (cid:96) =1 λ (cid:96) E (cid:2) B (cid:96) ( ¯ Q ) (cid:3) = 1 λ Σ L (cid:88) (cid:96) =1 λ (cid:96) E (cid:34) B (cid:96) ( ¯ Q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) m =1 C m ( ¯ Q ) ≤ (cid:35) P (cid:40) K (cid:88) m =1 C m ( ¯ Q ) ≤ (cid:41) + 1 λ Σ L (cid:88) (cid:96) =1 λ (cid:96) E (cid:34) B (cid:96) ( ¯ Q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) m =1 C m ( ¯ Q ) > (cid:35) P (cid:40) K (cid:88) m =1 C m ( ¯ Q ) > (cid:41) ≤ λ Σ L (cid:88) (cid:96) =1 λ (cid:96) E (cid:34) B (cid:96) ( ¯ Q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) m =1 C m ( ¯ Q ) ≤ (cid:35) + P (cid:40) K (cid:88) m =1 C m ( ¯ Q ) > (cid:41) . To bound P (cid:110)(cid:80) Km =1 C m ( ¯ Q ) > (cid:111) , notice that C ∗ ≤ , so P (cid:40) K (cid:88) m =1 C m ( ¯ Q ) > (cid:41) ≤ P (cid:40) K (cid:88) m =1 C m ( ¯ Q ) > C ∗ + 2 (cid:41) ≤ τ K b (cid:15)N by Lemma 10.Then for the case (cid:80) Km =1 C m ( q ) ≤ , it holds that (cid:80) Km =1 s m,b ( q ) ≤ b . Let I be the set of servers of types no greaterthan K with queue length less than b . Then we know |I| ≥ (1 − b ) N ≥ ˆ β N since b ≥ . By Assumption 2, the totalarrival rate of ports not connected with I is thus upper bounded by N ˜ d . As a result, p B ≤ λ Σ L (cid:88) (cid:96) =1 λ (cid:96) E (cid:34) B (cid:96) ( ¯ Q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) m =1 C m ( ¯ Q ) ≤ (cid:35) + P (cid:40) K (cid:88) m =1 C m ( ¯ Q ) > (cid:41) ≤ ˜ d λ + 52 τ K b (cid:15)N . B.8 Proof of Corollary 1Corollary 1[Restated].

Suppose that (cid:15) N is both o (1) and ω ( N − . ln( N )) , and that both Assumptions 1 and 2 holdfor G N when N is sufﬁciently large. Then as N → ∞ , both JFSQ and JFIQ are asymptotically optimal, and theexpected queueing delay converges to zero for both policies.Proof. First since (cid:15) N = ω (ln N N − . ) , there is always a b N satisfying Assumption 1 when N is sufﬁciently large.Let ¯ Q N be the queue-length random variable, and let p N B be the blocking probability for the N − th system. ApplyingTheorem 1 gives E (cid:34) M (cid:88) m =1 C m ( ¯ Q N ) (cid:35) ≤ C ∗ + (cid:16) τ KM (cid:17) (cid:15) N + 2 (cid:114) τ M b N ln NN + 60 b N (cid:115) τ K τ M ˆ β N (cid:15) N N , and p N B ≤ (cid:15) N µ K b N λ + τ K b N (cid:15) N N for N large enough.Since (cid:15) N = o (1) , (cid:15) N = ω ( N − . ln N ) , ˆ β N > (cid:15) N and b N satisﬁes Assumption 1, it holds that lim N →∞ E (cid:104)(cid:80) Mm =1 C m ( ¯ Q N ) (cid:105) = C ∗ . Then by Little’s Law, the expected mean response time E [ T N ] of the N − thsystem is given by the mean number of jobs in the system divided by the effective arrival rate. Therefore, lim N →∞ E [ T N ] = lim N →∞ E (cid:104) N (cid:80) Mm =1 C m ( ¯ Q N ) (cid:105) λ Σ (1 − p N B ) ≤ C ∗ λ (cid:16) − lim N →∞ (cid:15) N µ K b N λ + τ K b N (cid:15) N N (cid:17) = C ∗ λ , which matches the lower bound in Theorem 1. Therefore, JFSQ and JFIQ are asymptotically optimal in mean responsetime. On the other hand, let E (cid:2) T N W (cid:3) be the expected waiting time of jobs, and let E [ Z N ] be the expected servicetime in the N − th system. Then it holds E [ T N ] = E (cid:2) T N W (cid:3) + E [ Z N ] . Since E [ Z N ] ≥ C ∗ λ , E (cid:2) T N W (cid:3) ≥ , and lim N →∞ E [ T N ] = C ∗ λ , it holds lim N →∞ E (cid:2) T N W (cid:3) = 0 . As a result, JFSQ and JFIQ obtain asymptotic zero queueingdelays. 27 PREPRINT - A

UGUST

21, 2020

C Proof of Random Graph Results

Here we provide the missing proof of Theorem 3.

C.1 Proof of Theorem 3Theorem 3[Restated].

Suppose that all ports share the same arrival rates, that is, λ (cid:96) ≡ ¯ λ for all (cid:96) ∈ L . Thenfollowing the same construction of graph G in Theorem 2 but with H j = 6 (cid:16) − ln p j + ˜ d j p j ¯ λ ln µ ˜ d j (cid:17) for j ∈ { , } , itholds that G satisﬁes Assumption 2 with probability at least − (cid:0) NNp (cid:1) − . The total number of edges in G N scalesas O (cid:16) ( N + L ) b (cid:15) ln b(cid:15) (cid:17) .Proof. The proof is similar to that of Theorem 2. Let us follow the same notation in the proof of Theorem 2. Fix j ∈ { , } . Similarly, let K be any subset of L satisfying (cid:80) (cid:96) ∈K λ (cid:96) > N ˜ d j , and I be any subset of R j satisfying |I| ≥ N p j . To bound P {D K , I } , W.L.O.G., we can assume every port in K has arrival rate less than N ˜ d j H j , otherwise P {D K , I } = 0 . Then following the same argument in the proof of Theorem 2, it holds P {D K , I } ≤ exp( − H j N p j ) . The key step is to obtain a bound on the number of pairs of feasible K , I so that we can use the union bound. Let N j K , N j I be the amount of such sets, respectively. W.L.O.G., assume that N p j is an integer since |I| must be an integer.Also, as all ports share the same arrival rate ¯ λ , we can assume N ˜ d j / ¯ λ is an integer since the size of K must exceedthis value. Then it holds that N j K = (cid:18) LN ˜ d j / ¯ λ (cid:19) ≤ (cid:18) (cid:100) N µ / ¯ λ (cid:101) N ˜ d j / ¯ λ (cid:19) (103) N j I = (cid:18) NN p j (cid:19) . (104)We have the following lemma bounding a binomial number. Lemma 13.

Fix an integer n . For any < α < , if αn is an integer, then ln (cid:0)(cid:0) nαn (cid:1)(cid:1) ≤ − αn ln α .Proof. Let k = αn . It holds that (cid:18) nk (cid:19) = n ( n − · · · ( n − k + 1) k ! ≤ n k k ! . We know that e k = (cid:80) i ≥ k i i ! . Therefore, k k k ! ≤ e k . It then implies that (cid:18) nk (cid:19) ≤ n k k ! ≤ e k n k k k = (cid:16) enk (cid:17) k . As a result, ln (cid:18)(cid:18) nαn (cid:19)(cid:19) ≤ αn (1 − ln( α )) ≤ − nα ln α because α < .Now by the deﬁnition of p j , ˜ d j , it holds p j < , N ˜ d j / ¯ λ (cid:100) Nµ / ¯ λ (cid:101) < . Then by Lemma 13, when N is sufﬁciently large, ln (cid:16) N j K (cid:17) ≤ − N p j ln p j , ln (cid:16) N j I (cid:17) ≤ − N ˜ d j / ¯ λ ln (cid:18) µ ˜ d (cid:19) . (105)Therefore, it holds that P {C j } ≤ N j K N j I exp( − H j N p j ) ≤ exp (cid:32) − N p j H j − N p j ln p j − N p j ˜ d j p j ¯ λ ln (cid:32) µ ˜ d j (cid:33)(cid:33) . (106)28 PREPRINT - A

UGUST

21, 2020By deﬁnition, H j = 6 (cid:16) − ln p j − ˜ d j p j ¯ λ ln (cid:16) µ ˜ d j (cid:17)(cid:17) . Then we can see P {C j } ≤ exp(3 N p j ln p j ) ≤ (cid:18) NN p j (cid:19) − . By the union bound, it holds that P {C ∪ C } ≤ (cid:18) NN p (cid:19) − . since p < p < . Therefore, the probability that G N satisﬁes Assumption 2 is at least − (cid:0) NNp (cid:1) − .For the total number of edges used in G N , consider the four types of connections on graph G N as per Theorem 2 andTheorem 3 where we use different H j . we bound the number of edges for each type as follows. First, through somecalculations, H j = O (cid:0)(cid:0) b ¯ λ (cid:1) ln (cid:0) b(cid:15) (cid:1)(cid:1) , and H j ˜ d j = O (cid:16) b ¯ λ + b (cid:15) ¯ λ ln b(cid:15) (cid:17) .Then the number of ports with λ (cid:96) ≥ N ˜ d H is bounded by L ¯ λH N ˜ d = O (cid:16) ( N + L ) b N(cid:15) ln b(cid:15) (cid:17) because λ Σ = L ¯ λ . Therefore,the number of connections from them is bounded by O (cid:16) ( N + L ) b (cid:15) ln b(cid:15) (cid:17) since there are N servers. The same resultholds for ports with λ (cid:96) ≥ N ˜ d H . Now for the remaining ports, the expected number of edges is upper bounded by (cid:88) (cid:96) ∈L λ (cid:96) N (cid:18) H ˜ d + H ˜ d (cid:19) N = O (cid:18) ( N + L ) b (cid:15) ln b(cid:15) (cid:19) . Then to sum up, the expected number of edges in G N scales as O (cid:16) ( N + L ) b (cid:15) ln b(cid:15) (cid:17) . D Additional Simulation Results

In this section, we provide missing details in the main text and give additional simulation results.

D.1 Description of JSQ-(2,2)

In JSQ-(2,2)[19], there are two parameters p F , p S . Then for each arrival of jobs, we ﬁnd a server as follows:1. sample fast servers and slow servers;2. if there is an idle fast server, route the job to this server;3. if there is an idle slow server, route the job to this server with probability p S , and route the job to the fastserver with shorter queue with probability − p S ;4. otherwise, route the job to the fast server with shorter queue with probability p F ; and route the job to the slowserver with shorter queue with probability p S .We set p S , p F to be the optimal values from Table 1 in [19]. D.2 Convergence of Blocking Probability

Fig. 4 provides the convergence of the blocking probability following the same setting as in Section 6.2. UnlikeJSQ which is shown to be throughput optimal [11] (so is JFSQ), JIQ and JFIQ could lose the capacity of the system.As in Fig. 4, when we set the buffer size to be , the blocking probability of JIQ is around 1.5 percent, and that ofJFIQ is around 1 percent. Interestingly, JFIQ seems to be more stable. Nevertheless, the blocking probability of bothalgorithms decreases swiftly as N increases. D.3 Exploring More General Service Time Distribution

We present a preliminary study here that extends results proved in this paper. Roughly speaking, we consider the samesetting as in Section 6.2. However, we allow the service time distribution to be hyper-exponential.Still, suppose there are N servers in the system where N can scale up. Servers can be classiﬁed into four typeswith different service speed. Each type consists of the same amount of servers. Then let X be a hyper-exponential29 PREPRINT - A

UGUST

21, 2020 Number of Servers0.0000.0050.0100.015 B l o ck i ng P r obab ili t y JFIQJFSQ JIQ JSQ

Figure 4: The Blocking Probability of Different Routing Policies on Increasing-Sized Random Bipartite Graphs Number of Servers4681012 M ean R e s pon s e T i m e JFIQJFSQ JIQJSQ Lower Bound