[PDF] Scalable Load Balancing in the Presence of Heterogeneous Servers

Abstract

Heterogeneity is becoming increasingly ubiquitous in modern large-scale computer systems. Developing good load balancing policies for systems whose resources have varying speeds is crucial in achieving low response times. Indeed, how best to dispatch jobs to servers is a classical and well-studied problem in the queueing literature. Yet the bulk of existing work on large-scale systems assumes homogeneous servers; unfortunately, policies that perform well in the homogeneous setting can cause unacceptably poor performance---or even instability---in heterogeneous systems. We adapt the "power-of-d" versions of both the Join-the-Idle-Queue and Join-the-Shortest-Queue policies to design two corresponding families of heterogeneity-aware dispatching policies, each of which is parameterized by a pair of routing probabilities. Unlike their heterogeneity-unaware counterparts, our policies use server speed information both when choosing which servers to query and when probabilistically deciding where (among the queried servers) to dispatch jobs. Both of our policy families are analytically tractable: our mean response time and queue length distribution analyses are exact as the number of servers approaches infinity, under standard assumptions. Furthermore, our policy families achieve maximal stability and outperform well-known dispatching rules---including heterogeneity-aware policies such as Shortest-Expected-Delay---with respect to mean response time.

Full PDF

SScalable Load Balancing in the Presence of Heterogeneous Servers

Kristen Gardner , Jazeem Abdul Jaleel , Alexander Wickeham , Sherwin Doroudi Amherst College University of MinnesotaJune 24, 2020

Abstract

Heterogeneity is becoming increasingly ubiquitous in modern large-scale computer systems. Develop-ing good load balancing policies for systems whose resources have varying speeds is crucial in achievinglow response times. Indeed, how best to dispatch jobs to servers is a classical and well-studied problemin the queueing literature. Yet the bulk of existing work on large-scale systems assumes homogeneousservers; unfortunately, policies that perform well in the homogeneous setting can cause unacceptablypoor performance—or even instability—in heterogeneous systems.We adapt the “power-of- d ” versions of both the Join-the-Idle-Queue and Join-the-Shortest-Queuepolicies to design two corresponding families of heterogeneity-aware dispatching policies, each of whichis parameterized by a pair of routing probabilities. Unlike their heterogeneity-unaware counterparts,our policies use server speed information both when choosing which servers to query and when prob-abilistically deciding where (among the queried servers) to dispatch jobs. Both of our policy familiesare analytically tractable: our mean response time and queue length distribution analyses are exact asthe number of servers approaches inﬁnity, under standard assumptions. Furthermore, our policy familiesachieve maximal stability and outperform well-known dispatching rules—including heterogeneity-awarepolicies such as Shortest-Expected-Delay—with respect to mean response time. In large-scale computer systems, deciding how to dispatch arriving jobs to servers is a primary factor aﬀectingsystem performance. Consequently, there is a wealth of literature on designing, analyzing, and evaluatingthe performance of load balancing policies. For analytical tractability, most existing work on dispatchingin large-scale systems makes a key assumption: that the servers are homogeneous, meaning that they allhave the same speeds, capabilities, and available resources. But this assumption is not accurate in practice.Modern computer systems are instead heterogeneous: server farms may consist of multiple generations ofhardware, servers with varied resources, or even virtual machines running in a cloud environment. Giventhe ubiquity of heterogeneity in today’s systems, it is critically important to develop load balancing policiesthat perform well in heterogeneous environments. In this paper, we focus on systems in which server speeds are heterogeneous.The dominant dispatching paradigm in the contemporary literature on large scale systems is the “powerof d choices,” wherein the dispatcher cannot use global information to make dispatching decisions, as thatwould require prohibitively expensive computation upon each job’s arrival. Rather, a ﬁxed number ( d ) ofservers are queried at random, and a dispatching decision is made among these servers. Unfortunately,the “power of d ” policies that have been designed to perform well in homogeneous systems can lead tounacceptably poor performance—or even instability—in the presence of heterogeneity. For example, theclassical Join-the-Shortest-Queue- d (JSQ- d ) policy, under which, upon a job’s arrival, the dispatcher queries d servers uniformly at random and sends the job to the queried server with the fewest jobs in its queue,can cause the system to become unstable if the system’s capacity is concentrated among a relatively smallnumber of fast servers. JSQ- d is just one example of a heterogeneity-unaware policy, but recent work hasshown that other heterogeneity-unaware policies, including Join Idle Queue (JIQ), also can lead to poor1 a r X i v : . [ c s . PF ] J un erformance in heterogeneous systems. Clearly, it is necessary to use server speed information when makingdispatching decisions in heterogeneous systems.Yet simply using heterogeneity information is not enough: it matters exactly when and how the dispatcheruses this information. Consider the Shortest-Expected-Delay- d (SED- d ) policy, a natural heterogeneity-awaregeneralization of JSQ- d . Under SED- d , upon a job’s arrival the dispatcher queries d servers uniformly atrandom and sends the job to the queried server at which the job’s expected delay—the number of jobs inthe queue scaled by the server’s speed—is smallest. By allowing the dispatcher to select a fast server witha longer queue over a slow server with a shorter queue, SED- d overcomes one of the weaknesses of JSQ- d inthe presence of heterogeneity. Unfortunately, this is insuﬃcient to solve the fundamental problem faced byJSQ- d . SED- d , too, can cause poor performance and instability if fast servers are queried infrequently.While server heterogeneity poses a problem for many existing dispatching policies, it also presents anopportunity to design new policies that leverage heterogeneity to achieve good performance and maintainstability, rather than suﬀering in the presence of heterogeneity. Our key insight is that there are twodecision points at which “power of d ” policies can use server speed information. First, the dispatchercan make heterogeneity-aware decisions about which d servers to query. Second, the dispatcher can makeheterogeneity-aware decisions about where among the queried servers to send an arriving job. Alone, neitherdecision point appears to be enough to both ensure stability and achieve good performance. In combination,they allow for the design of a new class of powerful policies that beneﬁt from server speed heterogeneity,thereby resolving the problems of instability and poor performance.We propose two new families of policies, called JIQ-( d F , d S ) and JSQ-( d F , d S ), that are inspired byclassical “power of d ” policies but use server speed information at both decision points. This enables themto signiﬁcantly outperform JSQ- d , SED- d , and other heterogeneity-aware policies, as well as to maintainthe full stability region. At the ﬁrst decision point, instead of quering d servers uniformly at random fromamong all servers, our policies query d F fast servers and d S slow servers. Unlike under JSQ- d and SED- d ,this guarantees that each job has the option to run on a fast server. After querying d F + d S servers, ourpolicies decide probabilistically based on the servers’ states (idle or busy) whether to dispatch the job to afast server or a slow server. Our policy families are analytically tractable: given the probabilistic parametersettings, we derive the mean response time and queue length distribution under each. While the two familiesare functionally similar, they require diﬀerent analytical approaches. We analyze JIQ-( d F , d s ) using a meanﬁeld approach, and JSQ-( d F , d S ) using a system of diﬀerential equations capturing the system evolution. Ouranalyses of both policies are exact in the limiting regime where the number of servers approaches inﬁnity,under standard asymptotic independence assumptions.The remainder of this paper is organized as follows. In Section 2 we survey related work on dispatching inheterogeneous systems. Section 3 describes the system model and deﬁnes the JIQ-( d F , d S ) and JSQ-( d F , d S )policy families. In Section 4 we present our analyses of both policies. We give a numerical evaluation inSection 5 and propose a heuristic for selecting policy parameters in Section 6. Finally, in Section 7, weconclude. In large-scale homogeneous systems, Join-the-Shortest-Queue (JSQ) is known to minimize mean responsetime under ﬁrst-come-ﬁrst-served (FCFS) scheduling when service times are independent and identicallydistributed and have non-decreasing hazard rate [31, 29]. While analyzing response time is challenging dueto the dependencies among queue lengths, approximations exist in both the FCFS setting with exponentialservice times [17] and the Processor Sharing (PS) setting with general service times [6]. Because of thehigh communication cost required to query all servers for their queue lengths, the JSQ- d (also called SQ( d )or Power-of- d ) policy was proposed and analyzed, assuming homogeneous servers and exponential servicetimes [15, 27]. Other policies, such as Join-Idle-Queue (JIQ), have also been proposed as low-communicationalternatives to JSQ [13, 28].Once the server homogeneity assumption is relaxed, the optimality and analytical tractability of state-aware dispatching policies suﬀers. The SQ(2) policy has been studied in heterogeneous FCFS systems withgeneral service times, under both light traﬃc [9] and heavy traﬃc [32] assumptions. Performance analysis alsoexists for SQ(2) in heterogeneous PS systems [16]. The Shortest Expected Delay (SED) policy is a natural2igure 1: The system consists of k F fast servers, each with service rate µ F , and k S slow servers, each withservice rate µ S . Job arrive to the system as a Poisson process with rate λk and are dispatched immediately.alternative to JSQ when server speeds are known; SED has been shown empirically to perform favorablyto several other heterogeneity-aware policies [1]. However, SED is known to be suboptimal in general [30].When service times are generally distributed, SED requires knowledge of the full job size distribution inorder to estimate the remaining service time of the job currently in service. The Generalized JSQ (GJSQ)policy has been proposed as an alternative when only the mean job size at each server, not the full job sizedistribution, is known [21] (note that when service times are exponentially distributed, SED and GJSQ areequivalent). The equilibrium distribution of the number of jobs in the system has been analyzed under bothSED and GJSQ in a heterogeneous two-server system [21, 22]. The Balanced Routing policy (which we callWeighted JSQ in Section 5) uses server speed information by querying servers probabilistically in proportionto their speeds but ignores heterogeneity information when choosing among the queried servers; this policyminimizes the system workload in heavy traﬃc [4], but can be suboptimal at lower load.A common theme in much of the recent work on dispatching in heterogeneous systems is the observationthat policies like SQ( d ) and JIQ, which were designed for homogeneous systems, have a reduced stabilityregion when used in heterogeneous systems. Consequently, much of the recent work in heterogeneous systemshas focused on developing policies that maximize the stability region. Recently several families of throughputoptimal policies have been proposed, including PULL [25] and Π [32]. PULL, which is similar to JIQ, isshown to be optimal in the sense that it stochastically minimizes the queue length distribution [25]; as wewill see in Section 5, this does not mean that it is optimal with respect to other system metrics such asresponse time.Another related stream of work focuses on the so-called “slow server problem,” wherein the systemdesigner must choose when to use a slow server if at all. Typically, models consist of two servers of diﬀerentspeeds with all jobs arriving to a single queue [11, 12, 18, 19, 10], with more recent work examining similarproblems in settings with more than two servers [14, 20]. As they examine a central queue setting ratherthan an immediate dispatching setting, the policies and analysis proposed in these papers are inapplicableto our setting. Closer to our setting but still within the literature on central queues is [24], which considersdispatching to one of two subsystems: a central queue for a limited number of fast servers, and a subsystemwith an inﬁnite number of slow servers.More closely related to our work is a literature stream on dispatching in small-scale heterogeneous systems[26, 3, 2, 5, 23]. Such work explores policies that use information about all servers’ queue lengths (orsometimes more detailed information, as in [8]) when making dispatching decisions. These are not “powerof d ” policies and would not typically be considered scalable; hence, our policies of interest, analyticalapproaches, and qualitative ﬁndings diﬀer signiﬁcantly from those in the papers above. Our system consists of k heterogeneous servers (see Figure 1). There are two classes of servers: k F ofthe servers are “fast” servers and k S = k − k F of the servers are “slow” servers. We let q F = k F k and q S = k S k = 1 − q F denote the fraction of servers that are fast and slow respectively. Service times areindependent and for most of the paper we assume that they are exponentially distributed with rate µ F onfast servers and rate µ S on slow servers, where the speed ratio r ≡ µ F /µ S >

1. In Section 4.1.2 we consider3eneral service time distributions. For simplicity, we assume that µ F q F + µ S q S = 1, so that the system hastotal capacity k .Jobs arrive to the system as a Poisson process with rate λk . Upon arrival to the system, a job isdispatched immediately to a single server according to some policy. Each server works on the jobs in itsqueue in ﬁrst-come ﬁrst-served (FCFS) order without preemption.We consider two dispatching policies: JIQ-( d F , d S ) and JSQ-( d F , d S ). The common framework sharedby both policies favors idle fast servers whenever possible, and leverages the idea that slow servers are stilloccasionally worth utilizing (motivating probabilistic decision-making), and it is better to utilize them whenidle rather than busy (motivating the use of two—rather than just one—probabilistic parameters). Deﬁnition 1.

Under both

JIQ-( d F , d S ) and JSQ-( d F , d S ) , when a job arrives the dispatcher queries d F fast servers and d S slow servers, chosen uniformly at random without replacement. The job is then dispatchedto one of the queried servers as follows: • If any of the d F fast servers are idle, the job begins service on one of them. • If all d F fast servers are busy and any of the d S slow servers are idle: – With probability p S the job begins service on an idle slow server. – With probability − p S the job is dispatched to a chosen fast server among the d F queried. • If all d F + d S queried servers are busy: – With probability p F the job is dispatched to a chosen fast server among the d F queried. – With probability − p F the job is dispatched to a chosen slow server among the d S queried.The diﬀerence between the two policies lies in how a busy server (among those under consideration) is chosen . Under JIQ-( d F , d S ) the server is chosen uniformly at random. Under JSQ-( d F , d S ) the server withthe shortest queue is chosen. Under both policies all ties are broken uniformly at random. In this section we analyze the queue length distribution and mean response time under both JIQ-( d F , d S )and JSQ-( d F , d S ). Let ρ F and ρ S denote respectively the fraction of time that a fast server is busy and thata slow server is busy. We begin with the observation that ρ F and ρ S are independent of the choice of policybetween JIQ-( d F , d S ) and JSQ-( d F , d S ) and of the service time distribution. For both policies, and for anyservice time distribution such that the system is stable, we have ρ F = λk P { job runs on a fast server } · µ F k F = λµ F q F (cid:16) (1 − ρ d F F ) + ρ d F F (1 − ρ d S S )(1 − p S ) + ρ d F F ρ d S S p F (cid:17) (1) ρ S = λk P { job runs on a slow server } · µ S k S = λµ S q S (cid:16) ρ d F F (1 − ρ d S S ) p S + ρ d F F ρ d S S (1 − p F ) (cid:17) . (2)Solving this system of equations, numerically if an exact analytical solution is not possible, yields ρ F and ρ S . We will deﬁne π F = 1 − ρ F (respectively, π S = 1 − ρ S ) to be the probability that a fast (slow) serveris idle.We will assume that k → ∞ and that in this limiting regime the queue lengths at each of the serversbecome independent. This lets us treat a single queue as its own isolated system. While we do not formallyprove this asymptotic independence, our numerical results indicate that as k becomes large our approximationis highly accurate. 4 .1 JIQ-( d F , d S ) We will derive performance metrics under JIQ-( d F , d S ) ﬁrst for exponential service times, then for generalservice times. For both analyses, we use a mean ﬁeld approach and study a tagged fast server and a taggedslow server, each in isolation. We will need the arrival rates to fast and slow servers when they are busy andwhen they are idle; we note that these rates are independent of the service time distribution. Let λ BF , λ IF , λ BS , and λ IS denote respectively the arrival rates to a tagged busy fast, idle fast, busy slow, and idle slowserver.Let λ QF denote the arrival rate of jobs that query a tagged fast server. We have λ QF = λk (cid:0) k F − d F − (cid:1)(cid:0) k S d S (cid:1)(cid:0) k F d F (cid:1)(cid:0) k S d S (cid:1) = λd F q F . (3) λ QS is deﬁned similarly. The arrival rates λ IF and λ BF depend not only on the state of the tagged fast server,but also on whether the other servers queried by an arriving job are busy or idle. Under our asymptoticindependence assumption, all other fast (respectively, slow) servers have the same stationary distribution, π iF ( π iS ), as the tagged fast (slow) server, where π iF ( π iS ) denotes the stationary probability that there are i jobs at a tagged fast (slow) server, i ∈ { , , . . . } . When the tagged fast server is idle, an arriving job thatqueries the tagged server will be dispatched to it if it is chosen (uniformly at random) among all idle fastservers queried by the arrival. We have: λ IF = λ QF (cid:32) d F − (cid:88) i =0 (cid:18) d F − i (cid:19) π i F (1 − π F ) d F − − i i + 1 (cid:33) . (4)When the tagged fast server is busy, an arriving job that queries the tagged server will be dispatched to itif none of the other queried fast servers are idle (probability (1 − π F ) d F − ), and if either (1) the arrivalqueries an idle slow server (probability 1 − (1 − π S ) d S ), the dispatcher chooses to send the job to a fastserver (probability 1 − p S ), and the tagged fast server is chosen uniformly at random among all queried fastservers (probability 1 /d F ), or (2) all queried slow servers are busy (probability (1 − π S ) d S ), the dispatcherchooses to send the job to a fast server (probability p F ), and the tagged fast server is chosen uniformly atrandom among all queried fast servers (probability 1 /d F ). We thus have: λ BF = λ QF (1 − π F ) d F − d F (cid:16)(cid:16) − (1 − π S ) d S (cid:17) (1 − p S ) + (1 − π S ) d S p F (cid:17) . (5) Our approach for the tagged slow server is similar, yielding: λ IS = λ QS (1 − π F ) d F (cid:32) d S − (cid:88) i =0 (cid:18) d S − i (cid:19) π i S (1 − π S ) d S − − i i + 1 p S (cid:33) (6) λ BS = λ QS (1 − π F ) d F (1 − π S ) d S − d S (1 − p F ) . (7)We are now ready to derive mean response time under both exponential and general service times. Our approach involves setting up and solving a Markov chain for a tagged fast server and for a tagged slowserver. We begin with the fast server. Recall that state iF denotes that there are i jobs at the fast server,including the job in service if there is one, and π iF denotes that state’s stationary probability. The number ofjobs at the tagged fast server will evolve as a state-dependent M/M/1 queue with arrival rate λ IF when it isidle, arrival rate λ BF when it is busy, and service rate µ F . Figure 2 depicts the Markov chain correspondingto this server.The stationary probabilities for this Markov chain are: π iF = λ IF µ F (cid:18) λ BF µ F (cid:19) i − π F , i ≥ .

5F 1F 2F 3F . . .λ IF µ F λ BF µ F λ BF µ F λ BF µ F Figure 2: The Markov chain tracking the number of jobs at a tagged fast server. State iF indicates thatthere are i jobs at the fast server (including the job in service, if there is one).With the normalization equation, (cid:80) ∞ i =0 π iF = 1, this yields: π F = µ F − λ BF µ F − λ BF + λ IF . (8)Our approach for the slow server is similar, yielding: π iS = λ IS µ S (cid:18) λ BS µ S (cid:19) i − π S , i ≥ .π S = µ S − λ BS µ S − λ BS + λ IS (9)We now have six equations (4,5,6,7,8,9) to solve for six unknown variables ( π F , λ IF , λ BF , π S , λ IS , λ BS ), after which we will have obtained the full queue length distribution under JIQ-( d F , d S ).We are now ready to give an expression for mean response time as a function of the system parametersand the policy parameters p F and p S . Let E [ N F ] and E [ N S ] denote respectively the mean number of jobsat a fast server and at a slow server. We have: E [ N F ] = ∞ (cid:88) i =0 iπ iF = π F λ IF µ F ∞ (cid:88) i =1 i (cid:18) λ BF µ F (cid:19) i − = λ IF µ F ( µ F − λ BF )( µ F − λ BF + λ IF ) (10) E [ N S ] = λ IS µ S ( µ S − λ BS )( µ S − λ BS + λ IS ) . (11)Putting this together, the mean number of jobs in the system is: E [ N ] = k F E [ N F ] + k S E [ N S ] . (12)Finally, we apply Little’s Law to obtain the mean response time: E [ T ] = k F E [ N F ] + k S E [ N S ] λk = q F λ IF µ F λ ( µ F − λ BF )( µ F − λ BF + λ IF ) + q S λ IS µ S λ ( µ S − λ BS )( µ S − λ BS + λ IS ) . (13) For general service times, our Markov chain approach no longer applies. Now, a job’s service time on a fastserver (respectively, a slow server) is distributed like Y F ( Y S ), where Y S ∼ rY F . Note that the servers exhibitheterogeneity in speed, but (as in the case of exponential service times) the coeﬃcient of variation associatedwith service times is the same across both server speeds.To analyze this system, we make the observation that the dynamics of a busy fast server are identical tothose of an M/G/1 system with arrival rate λ BF and service time distributed like Y F . The only diﬀerencebetween these two systems is that they have diﬀerent arrival rates when idle; this does not aﬀect the response6ime distribution. Hence we can conclude that the response time distribution at a fast server under JIQ-( d F , d S ) is the same as that of this M/G/1. A similar result holds for slow servers. The Pollaczek-Khinchineformula gives us: E [ T F ] = λ BF E (cid:2) Y F (cid:3) − λ BF E [ Y F ]) + E [ Y F ] E [ T S ] = λ BS E (cid:2) Y S (cid:3) − λ BS E [ Y S ]) + E [ Y S ] . Conditioning on whether an arriving job is dispatched to a fast or a slow server, we then obtain thesystem mean response time: E [ T ] = q F λ F λ (cid:32) λ BF E (cid:2) Y F (cid:3) − λ BF E [ Y F ]) + E [ Y F ] (cid:33) + q S λ S λ (cid:32) λ BS E (cid:2) Y S (cid:3) − λ BS E [ Y S ]) + E [ Y S ] (cid:33) , (14)which coincides with (13) when Y F and Y S are exponentially distributed.The observation that a tagged fast server essentially behaves like an M/G/1 also allows us to adaptstandard techniques, such as M/G/1 transform analysis, to derive queue length distributions and othersystem metrics (see Chapter 26 of [7]). Having determined E [ T ] for a ﬁxed p F and p S , we can now optimize the JIQ-( d F , d S ) policy by ﬁnding theoptimal values for p F and p S . We will assume a ﬁxed d F and d S , but note that we could also optimize over d F and d S ; only a small set of values for d F and d S are likely to be practical.Equation (14) tells us that mean response time is linear in the second moments of Y F and Y S . This meansthat, because Y F and Y S have the same coeﬃcient of variation, the optimal values of p F and p S dependonly on the mean service times E [ Y F ] = 1 /µ F and E [ Y S ] = 1 /µ S . This insensitivity property allows us toassume exponential service times without loss of generality when carrying out our optimization.Our optimization problem is as follows:minimize p F ,p S E [ T ]subject to Equations(4 , , , , , ≤ π F , π S ≤ ≤ p F , p S ≤ E [ T ] is given in (13). We provide an explicit formulation of this problem in the Appendix. d F , d S ) While the diﬀerence between JIQ-( d F , d S ) and JSQ-( d F , d S ) may seem like only a minor policy modiﬁcation,it necessitates a fundamentally diﬀerent analytical approach. Imagine applying the tagged server approachused to analyze JIQ-( d F , d S ) to JSQ-( d F , d S ), and consider a tagged fast server under JSQ-( d F , d S ). Asunder JIQ-( d F , d S ), this server experiences a state-dependent arrival rate. Unlike under JIQ-( d F , d S ), thisarrival rate is diﬀerent for every state, and it depends on the queue lengths of all other polled servers.Hence adopting the Markov chain-based approach we used for JIQ-( d F , d S ) would require solving a highlycomplicated inﬁnite system of equations.Instead, our approach for analyzing JSQ-( d F , d S ) will involve considering a tagged arrival to the system,again assuming that k → ∞ and that in this limiting regime, all servers have independent queue lengths.We condition on whether the tagged arrival runs on a fast or slow server and on whether or not it waits7n the queue: E [ T ] = E [ T | run on idle fast] · P { run on idle fast } + E [ T | run on idle slow] · P { run on idle slow } + E [ T | queue at busy fast] · P { queue at busy fast } + E [ T | queue at busy slow] · P { queue at busy slow } = 1 µ F · (1 − ρ d F F ) + 1 µ S · ρ d F F (1 − ρ d s S ) p S + E [ T | queue at busy fast] · ρ d F F ( ρ d S S p F + (1 − ρ d s S )(1 − p S ))+ E [ T | queue at busy slow] · ρ d F F ρ d S S (1 − p F ) . (16)In line (16) we use the asymptotic independence assumption.We next derive E [ T | queue at busy fast]. Here, the job joins the shortest queue among the d F polled fastservers, all of which are busy. In order to derive response time, we ﬁrst need to determine the distributionof the number of jobs in a fast server’s queue.Let n i ( t ) denote the number of fast servers with at least i jobs at time t . Let f i ( t ) = n i ( t ) /k F be thefraction of servers that are fast and have at least i jobs at time t . We note that f ( t ) = 1 for all t .As in [15], we consider a limiting system, where k → ∞ and the system exhibits deterministic steady-statebehavior where df i ( t ) /dt = 0 for all i ≥

0. This setting lets us describe our system’s evolution through asystem of diﬀerential equations wherein all f i ( t ) functions are constant (henceforth we write f i rather than f i ( t )).We formulate the diﬀerential equations by considering the expected change in the number of fast servers’queues with at least i > dt . This number will increase if an arriving jobjoins the queue at a fast server with exactly i − λk ; with probability f d F i − − f d F i all d F of the polled fast servers have at least i − d F haveat least i jobs (that is, the shortest queue among the d F fast servers contains exactly i jobs). The arrivingjob will join the length-( i −

1) queue if either (1) there is an idle slow server among the d S polled slow servers(probability 1 − ρ d S S ) and the job is assigned to join the queue at a fast server (probability 1 − p S ), or (2)there are no idle slow servers among the d S polled slow servers (probability ρ d S S ) and the job is assigned tojoin the queue at a fast server (probability p F ). The number of queues with at least i > i jobs. This happens with rate µ F k F ( f i − f i +1 ). Putting thistogether, we have, for i > dn i dt = λk (cid:16) f d F i − − f d F i (cid:17) (cid:16) (1 − ρ d S S )(1 − p S ) + ρ d S S p F (cid:17) − µ F k F ( f i − f i +1 ) . The case where i = 1 is similar, except here an arriving job that ﬁnds a fast server with i − i = 1 we have: dn dt = λk (cid:16) f d F − f d F (cid:17) − µ F k F ( f − f ) . Dividing by k F gives us a system of equations for the f i terms: df i dt = λq F (cid:16) f d F i − − f d F i (cid:17) (cid:16) (1 − ρ d S S )(1 − p S ) + ρ d S S p F (cid:17) − µ F ( f i − f i +1 ) . (17) df dt = λq F (cid:16) − f d F (cid:17) − µ F ( f − f ) . (18)recalling that q F = k F k is the fraction of servers that are fast and that f = 1. We further note that f isthe fraction of servers that are busy; using our asymptotic independence assumption, we have f = ρ F . Wenow set df i dt = 0 for all i and solve for the f i terms.Once we have the f i terms, we can ﬁnd E [ T | queue at busy fast] by conditioning on the queue lengthseen by an arriving job: E (cid:2) T (cid:12)(cid:12) queue atbusy fast (cid:3) = ∞ (cid:88) i =1 P (cid:110) job joins queuewith i jobs (cid:12)(cid:12)(cid:12) queue atbusy fast (cid:111) ( i + 1) 1 µ F = 1 µ F ∞ (cid:88) i =1 ( i + 1) · f d F i − f d F i +1 f d F . (19)8ote that the probability that a job joins a queue with i jobs is not the same as the probability that a serverhas i jobs in its queue.Our approach to ﬁnd E [ T | queue at busy slow] is similar. Let s i ( t ) denote the fraction of slow serverswith at least i jobs at time t (we will write s i when the meaning is clear). We obtain the following systemof diﬀerential equations for the s i terms: ds i dt = λq S (cid:16) s d S i − − s d S i (cid:17) ρ d F F (1 − p F ) − µ S ( s i − s i +1 ) . (20) ds dt = λq S (cid:16) − s d S (cid:17) ρ d F F p S − µ S ( s − s ) , (21)where we note that s ( t ) = 1 for all t . Again, setting ds i dt = 0 for all i allows us to solve for a ﬁxed point forthe s i terms.As with the fast servers, we now ﬁnd E (cid:2) T (cid:12)(cid:12) queue atbusy slow (cid:3) = 1 µ S ∞ (cid:88) i =1 ( i + 1) · s d S i − s d S i +1 s d S . (22)The overall system mean response time results from combining (16), (19), and (22). As under JIQ-( d F , d S ), we now ﬁnd the values of p F and p S that minimize mean response time underJSQ-( d F , d S ) (assuming d F and d S are ﬁxed). Our optimization problem is as follows:minimize p F ,p S E [ T ]subject to Equations(1 , df i dt = ds i dt = 0 i ≥ f = s = 1 f = ρ F s = ρ S ≤ ρ F , ρ S ≤ ≤ p F , p S ≤ E [ T ] is given in (16, 19, 22) and df i dt , ds i dt are given in (17,18, 20, 21). We provide an explicit formulationof this problem in the Appendix. One of the signiﬁcant downsides to heterogeneity-unaware dispatching policies such as JSQ- d and SED- d is that they can become unstable under certain system parameters, including, for example, when q F is lowand the fast servers are signiﬁcantly faster than the slow servers. In Theorem 1, we show that JIQ-( d F , d S )and JSQ-( d F , d S ) do not suﬀer this downside: instead, our policies remain stable as long as λ <

1, therebyachieving the maximum possible stability region.

Theorem 1.

Under both JIQ-( d F , d S ) and JSQ-( d F , d S ) with optimal choices of p F and p S , the system isstable for λ < µ F q F + µ S q S = 1 , for any values of d F , d S ≥ .Proof. We will begin by showing that the system is stable under JIQ-( d F , d S ) when p S = 1 and p F = µ F q F ,for all d F , d S ≥

1. The system’s stability is aﬀected by the arrival rates to busy fast servers and to busy slowservers. The arrival rate to an individual busy fast server, denoted λ BF (while we use the same notation asearlier in the section, note that here we do not assume that k → ∞ ), is: λ BF = λk (cid:0) k F − d F − (cid:1)(cid:0) k F d F (cid:1) P (cid:110) all other queriedfast servers busy (cid:111) d F · (cid:16) P (cid:110) all queriedslow servers busy (cid:111) p F + P (cid:110) not all queriedslow servers busy (cid:111) (1 − p S ) (cid:17) , λp F /q F because p S = 1, P { all other fast servers busy } ≤

1, and P { all queried slow servers busy } ≤

1. Let p F = µ F q F . Then we have λ BF ≤ λq F p F = λµ F < µ F , ensuring the stability of the fast servers, if λ < λ BS . We have: λ BS = λk (cid:0) k S − d S − (cid:1)(cid:0) k S d S (cid:1) P (cid:110) all other queriedslow servers busy (cid:111) P (cid:110) all queriedfast servers busy (cid:111) (1 − p F ) 1 d S , which is at most λ (1 − p F ) /q S because P { all other slow servers busy } ≤ P { all queried fast servers busy } ≤

1. Again, let p F = µ F q F . Then we have λ BS ≤ λq S (1 − p F ) = λq S µ S q S = λµ S , which is less than µ S , ensuring the stability of the slow servers, if λ < d F , d S ) is stable for p S = 1, p F = µ F q F . We obtain the samestability result for JSQ-( d F , d S ) by observing that joining the shortest queue among d F fast servers (or among d S slow servers) instead of routing randomly to one of those d F fast servers ( d S slow servers) cannot changethe stability region. Finally, optimizing over all possible choices of p F and p S cannot decrease the stabilityregion.Theorem 1 tells us that there always exist settings for p S and p F for which the system is stable; inTheorem 2 we identify more speciﬁc necessary and suﬃcient conditions for stability as λ → Theorem 2. As λ → , the system is unstable if p F (cid:54) = µ F q F , and the system is stable if p F = µ F q F and p S ≥ µ S q S .Proof. We ﬁrst show that the system is stable if p F = µ F q F and p S ≥ µ S q S . We begin by considering anarbitrary tagged fast server. Note that the arrival rate to the tagged server when it is idle does not aﬀectthe stability region of that server. The arrival rate to a tagged busy fast server is λ BF = λk (cid:0) k F − d F − (cid:1)(cid:0) k F d F (cid:1) P (cid:110) all other queriedfast servers busy (cid:111) d F · (cid:16) P (cid:110) all queriedslow servers busy (cid:111) p F + P (cid:110) not all queriedslow servers busy (cid:111) (1 − p S ) (cid:17) (24)We have P { all other queried fast servers busy } ≤ p F = µ F q F , and p S ≥ µ S q S , so 1 − p S ≤ − µ S q S = µ F q F . Applying these bounds to (24) we obtain λ BF ≤ λq F (cid:16) P (cid:110) all queriedslow servers busy (cid:111) µ F q F + P (cid:110) not all queriedslow servers busy (cid:111) µ F q F (cid:17) = λµ F , which is less than µ F , ensuring the stability of the tagged server—and hence, of all fast servers—if λ < λ →

1, it mustalso be the case that P { tagged fast server busy } →

1. Thus an arriving job is likely to query d F busyservers: P { all queried fast servers busy } →

1. Let P { all queried fast servers busy } = 1 − (cid:15) for some small (cid:15) >

0, where (cid:15) → λ →

1. The total arrival rate to all slow servers is then λk (1 − (cid:15) ). Consider anarbitrary tagged slow server, and note that, as for the fast servers, the arrival rate to a slow server when itis idle does not aﬀect its stability region. For a tagged busy slow server, we have λ BS = λd S q S (1 − (cid:15) ) · P (cid:110) all other queriedslow servers busy (cid:111) · (1 − p F ) · d S . We have P { all other queried slow servers busy } ≤ p F = µ F q F , so 1 − p F = 1 − µ F q F = µ S q S , whichgives λ BS ≤ λµ S (1 − (cid:15) ) . This is less than µ S , ensuring stability of the tagged slow server—and hence, of all slow servers—if λ < k under both JIQ-( d F , d S ) andJSQ-( d F , d S ). Here q F = 0 . r = 10, d F = d S = 2, and p F and p S are optimized separately for each policyfamily.We now turn to the second part of the result: that the system is unstable when p F (cid:54) = µ F q F (for anychoice of p S ). The argument hinges on the observation that the maximum throughput of the system is k ( µ F q F + µ S q S ) = k (because µ F q F + µ S q S = 1). In order for the system to be stable as λ → k , it must therefore be the case that the probability that all serversare busy approaches 1; if some servers were idle with probability (cid:15) >

0, then the maximum possible systemthroughput would be less than the arrival rate and the system would be unstable.With this observation in mind, we ﬁrst consider the case where p F > µ F q F . Recall from Theorem 1 thearrival rate to an individual busy fast server: λ BF = λk (cid:0) k F − d F − (cid:1)(cid:0) k F d F (cid:1) P (cid:110) all other queriedfast servers busy (cid:111) d F · (cid:16) P (cid:110) all queriedslow servers busy (cid:111) p F + P (cid:110) not all queriedslow servers busy (cid:111) (1 − p S ) (cid:17) = λq F P (cid:110) all other queriedfast servers busy (cid:111) · (cid:16) P (cid:110) all queriedslow servers busy (cid:111) p F + P (cid:110) not all queriedslow servers busy (cid:111) (1 − p S ) (cid:17) . Assuming that P (cid:110) all other queriedfast servers busy (cid:111) → P (cid:110) all queriedslow servers busy (cid:111) → λ → λ BF → p F /q F , which is less than µ F if p F < µ F q F ; this contradicts our assumptionthat p F > µ F q F , hence the system is unstable in this case. The case where p F < µ F q F is similar.It is possible that the system also remains stable for a wider range of values for p S >

0, but identifyingthe full stability region remains an open problem.

In this section we present a numerical study to evaluate performance under the JIQ-( d F , d S ) and JSQ-( d F , d S )policy families. For each set of system parameters considered, we report results for the optimal policy withineach family, i.e., p F and p S are chosen to minimize mean response time, as discussed in Sections 4.1.3and 4.2.1. We consider diﬀerent levels of server heterogeneity by varying two parameters: q F (the fractionof servers that are fast) and r ≡ µ F /µ S (the speed ratio). Unless otherwise speciﬁed, we set d F = d S = 2. k Our analyses for both JIQ-( d F , d S ) (Section 4.1) and JSQ-( d F , d S ) (Section 4.2) are approximate because theyassume that the server states are independent as the number of servers k → ∞ . We evaluate the accuracyof our approximations by comparing our analytical results to simulation (see Figure 3). As k increases ouranalytical results for mean response time under both policies become increasingly accurate. By k = 500,the analytical and simulation results are indistinguishable. We obtained similar results for other systemparameter settings. 11 .2 Mean Response Time q F = 0.2 q F = 0.5 q F = 0.8 r = 1.1 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 r = 2 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 r = 5 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 r = 10 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4 λ E [ T ] JIQ-(2,2)JSQ-(2,2)JSQ-4SED-4JIQWJSQ-4

Figure 4: Mean response time as a function of λ under JIQ-(2,2), JSQ-(2,2), JSQ-4, SED-4, and JIQ. Leftto right: q F = 0 . q F = 0 . q F = 0 .

8. Top to bottom: r = 1 . r = 2, r = 5, r = 10.Figure 4 compares mean response time under JIQ-( d F , d S ) and JSQ-( d F , d S ) to that under four otherpolicies (results for our policies are analytical, while results for the following policies are simulated): • Under

JSQ- d , the dispatcher queries d servers uniformly at random and sends the job to the server amongthose d with the shortest queue. • Under

SED- d , the dispatcher queries d servers uniformly at random and sends the job to the server amongthose d at which it has the shortest expected delay. • Under

WJSQ- d (the W stands for “Weighted”), the dispatcher queries d servers, where the probabilitythat a server is queried is proportional to that server’s speed, and sends the job to the server among those d with the shortest queue. 12 Under

JIQ , the dispatcher sends the job to an idle server if there is one, and to a busy server chosenuniformly at random otherwise.We note that JSQ- d and JIQ are heterogeneity-unaware, SED- d only uses heterogeneity information whendispatching, and WJSQ- d only uses heterogeneity information when querying. Unlike the other ﬁve policesthat we consider, JIQ is not a “power of d ” policy; we include it here as a point of comparison because it isknown to minimize the probability that an arriving job waits in the queue [25].When there is little diﬀerence in speed between fast and slow servers ( r = 1 .

1, top row of Figure 4), JSQ- d and SED- d perform similarly to each other, and both outperform our policies at high load. This is becausewhen all servers are similar in speed, providing more ﬂexibility when selecting among queried servers oﬀersa greater advantage than ensuring that some fast servers are queried. But in systems with more pronouncedheterogeneity, JSQ- d and SED- d cannot maintain their good performance. As r increases, JSQ- d suﬀerssigniﬁcantly: here it is a serious shortcoming to make dispatching decisions based only on queue lengths.SED- d corrects for this problem by scaling queue lengths in proportion to server speeds. Yet when r is highand q F is low, both JSQ- d and SED- d can lead to instability. In this regime, much of the system’s capacitybelongs to the fast servers, but an arriving job may not query any fast servers because JSQ- d and SED- d use uniform querying (e.g., when q F = 0 .

2, only about 40% of jobs query a fast server). This causes the slowservers to become overloaded. WJSQ- d avoids instability in this regime by ensuring that faster servers aremore likely to be queried and thus sent a job. However, performance under WJSQ- d still suﬀers at low load;here all queue lengths are relatively short, so WJSQ- d eﬀectively ignores server speeds when dispatching.Our policies remain stable and achieve better performance by diﬀerentiating between fast and slow serversboth when querying and when choosing where to dispatch among the queried servers. At low load, JIQ-( d F , d S ) and JSQ-( d F , d S ) perform similarly to each other, and both outperform SED- d , JSQ- d , and WJSQ- d .As r increases, the gap between our policies and JSQ- d becomes particularly pronounced: JSQ- d frequentlysends jobs to slow servers even when there are idle fast servers, whereas our policies are more likely to ﬁnd andselect an idle fast server. Indeed, our policies eﬀectively throw out the slow servers when load is suﬃcientlylow or r is suﬃciently high. At high load, too, our policies perform competitively with or better than JSQ- d ,SED- d , and WJSQ- d . Most notably, while JSQ- d and SED- d have a reduced stability region when q F is lowand r is high, both JIQ-( d F , d S ) and JSQ-( d F , d S ) are guaranteed to be stable provided λ < µ F q F + µ S q S ,as shown in Theorem 1.Unsurprisingly, JSQ-( d F , d S ) always outperforms JIQ-( d F , d S ). This makes sense: when using the same p F and p S values, the only diﬀerence between the two policies is that the JSQ version makes a better dispatchingdecision when choosing among busy servers. Note that the results in Figure 4 do not necessarily have thesame values of p F and p S for JSQ-( d F , d S ) and JIQ-( d F , d S ) because both policy families are optimized overthe parameters. Even though JSQ-( d F , d S ) is guaranteed to achieve lower mean response time than JIQ-( d F , d S ), the two policies perform similarly until λ becomes high. At this point JSQ-( d F , d S )’s advantagebecomes more apparent, as this is when queues actually build up. Under both JIQ-( d F , d S ) and JSQ-( d F , d S ),mean response time appears to be non-convex in λ . This surprising result is due to our optimization over p F and p S . For any ﬁxed p F and p S , mean response time is convex in λ , and indeed the convex regions in theplots in Figure 4 occur when p F and p S do not change (for example, when λ is relatively low it is optimalto set p S = 0, i.e., to never use the slow servers). The non-convex regions appear when either p F or p S isvarying between 0 and 1.We also compare our policies to JIQ, which uses queue length information from all servers, not just asubset of d servers. At high load, JIQ outperforms all of the “power of d ” policies; this is unsurprising giventhat JIQ will always ﬁnd an idle server if there is one. But at low load and high r , JIQ yields a substantiallyhigher mean response time than our policies. This is because, like JSQ- d and WJSQ- d , JIQ does not useserver speed information to break ties between idle servers. That our policies outperform JIQ may seemsurprising in light of the fact that JIQ is delay optimal [25]; we explore this result further in Section 5.3. In this section we look at the queue length distributions under JIQ-( d F , d S ), JSQ-( d F , d S ), and JIQ in moredetail to gain insight as to why our policies can outperform JIQ in terms of response time, even though theylack JIQ’s queue length optimality property. 13a) r = 1 . λ = 0 . r = 5, λ = 0 . r = 10, λ = 0 . F a s t s e r v e r s n P r { N = n } JSQ-(2,2)JIQ-(2,2)JIQ n P r { N = n } JSQ-(2,2)JIQ-(2,2)JIQ n P r { N = n } JSQ-(2,2)JIQ-(2,2)JIQ S l o w s e r v e r s n P r { N = n } JSQ-(2,2)JIQ-(2,2)JIQ n P r { N = n } JSQ-(2,2)JIQ-(2,2)JIQ n P r { N = n } JSQ-(2,2)JIQ-(2,2)JIQ

Figure 5: Comparing the queue length distribution under JSQ-(2,2), JIQ-(2,2), and JIQ for fast servers (toprow) and slow servers (bottom row) when q F = 0 .

5. (a) r = 1 . λ = 0 .

5, (b) r = 5, λ = 0 .

8, (c) r = 10, λ = 0 . d F , d S ), JSQ-( d F , d S ), and JIQ for both fastservers (top row) and slow servers (bottom row) in three settings selected from those featured in Figure 4. Atleft, we show a case in which all three policies have similar mean response times; in this case the queue lengthdistributions are also similar. The center column shows a case in which JIQ yields lower mean response timethan our policies: in this case r = 5 and λ = 0 .

8. Because λ is high, few slow servers are idle, but bothour policies and JIQ prevent queues from building up at the slow servers. The key diﬀerence between thepolicies lies in what happens at the fast servers. Under our policies, the optimal value of p F in this settingis 1, meaning that a job will never choose to wait in the queue at a slow server. This means that many jobsare deferred back to the (busy) fast servers, causing the queue lengths to increase. JIQ prevents the queuelengths at the fast servers from growing. A slightly greater proportion of jobs run on slow servers under JIQ,but the jobs that run on fast servers do not have to wait in the queue. When λ is high, this tradeoﬀ favorsJIQ.In contrast, when λ is low the same tradeoﬀ favors JIQ-( d F , d S ) and JSQ-( d F , d S ), as shown in the rightcolumn of Figure 5, where r = 10 and λ = 0 .

2. Again, under JIQ a higher proportion of slow servers arebusy because JIQ does not diﬀerentiate between fast and slow idle servers. Indeed, there are no busy slowservers under JIQ-( d F , d S ) and JSQ-( d F , d S ) because the combination of high r and low λ means that theoptimal value of p S is 0: it is best not to use any of the slow servers at all. As a result, the fast servers havea slightly lower probability of being idle under our policies than under JIQ. However, because λ is low thequeue lengths under JIQ-( d F , d S ) and JSQ-( d F , d S ) remain short. In this case, JIQ’s decision to prioritizeserver idleness over server speed works against it, and our policies achieve lower mean response time. d One of the primary selling points of policies like JSQ- d , SED- d , and WJSQ- d is the “power of two choices”:often, there is a large beneﬁt in going from d = 1 (i.e., random routing) to d = 2, but a much smallermarginal beneﬁt in further increasing d . Consequently, JSQ-2 is the most commonly considered variantof JSQ- d . Our JIQ-( d F , d S ) and JSQ-( d F , d S ) policies query fast and slow servers separately; while setting d F = d S = 1 oﬀers two choices in total, it does not oﬀer a choice within each speed. Therefore, JIQ-(1,1)and JSQ-(1,1) are equivalent: once the dispatcher has chosen to send the job to a fast (or slow) server there14a) q F = 0 . q F = 0 . λ E [ T ] JIQ-(1,1)JSQ-2SED-2WJSQ-2 λ E [ T ] JIQ-(1,1)JSQ-2SED-2WJSQ-2

Figure 6: Mean response time as a function of λ under JIQ-(1,1), JSQ-2, SED-2, and WJSQ-2 when r = 5.(a) q F = 0 .

2, (b) q F = 0 . q F is low (see Figure 6). As we have seen, both JSQ-2 andSED-2 can cause instability when q F is low and r is high, whereas JIQ-(1,1) guarantees that the system willremain stable.In Figure 7 we consider the eﬀect of varying d = d F + d S on the performance of JIQ-( d F , d S ) and JSQ-( d F , d S ): does the marginal beneﬁt of increasing d decrease as d gets larger? When d = 1, we interpretour policies to collapse the querying and dispatching decision points into a single probabilistic choice: wedispatch to a random fast server with probability p F and to a slow server otherwise. For all other values of d , we choose the optimal combination of d F , d S , p F , and p S such that d F + d S = d . As under JSQ- d andSED- d , the steepest drop in mean response time comes from going from d = 1 to d = 2, and mean responsetime is convex in d . When the fast and slow servers are similar in speed (Figure 7 (a)), JSQ- d and SED- d perform slightly better at low d , and all policies have similar performance at high d . When the r is high and q F is low (Figure 7 (b)), JIQ-( d F , d S ) and JSQ-( d F , d S ) are stable at all values of d , and outperform JSQ- d and SED- d even when d is high enough for the latter two policies to be stable. p F and p S A key part of deﬁning the JIQ-( d F , d S ) and JSQ-( d F , d S ) policies involves choosing values for p F and p S ;in Sections 4.1.3 and 4.2.1 we do this by ﬁnding the values of p F and p S that minimize mean responsetime. Figure 8 shows mean response time under JSQ-( d F , d S ) as a function of p F and p S for two diﬀerentparameter settings (results for JIQ-( d F , d S ) are similar). When λ is low to moderate (Figure 8(a)), meanresponse time is relatively insensitive to the particular parameter choices, provided that p S is high enoughto ensure stability. When λ is high (Figure 8(b)), it becomes more important to choose the right p F and p S :even small variations in p F and p S can lead to substantial changes in response time, and there is a smallerset of p F and p S values for which the system is stable.The extreme sensitivity to p F and p S occurs only at very high λ ; at most parameter settings the optimalvalues of p F and p S fall into one of a few cases. If the fast servers comprise a suﬃciently high fraction of thetotal system capacity or if the system load is very low, it is best to set p S = 0. If the fast and slow serversare relatively similar in speed or if the system load is suﬃciently high, it is best to set p S = 1. As we showedin Theorem 2, as λ → p F = µ F q F is the only value of p F for which the system is stable.Motivated by these observations, we propose a heuristic for choosing appropriate values of p F and p S .Instead of optimizing over the entire parameter space for p F and p S , which can be computationally expensive,we consider the following parameter settings: • p S = 0. Note that in this case the slow servers are never used, so the choice of p F does not matter.15a) q F = 0 . r = 1 . d JIQ-( d F , d S ) JSQ-( d F , d S )2 (1,1) (1,1)3 (1,2) (2,1)4 (2,2) (2,2)5 (2,3) (2,3)6 (3,3) (3,3)7 (3,4) (3,4)8 (3,5) (3,5) (b) q F = 0 . r = 10 d E [ T ] JIQ-( d F , d S )JSQ-( d F , d S )JSQ- d SED- d WJSQ- d d JIQ-( d F , d S ) JSQ-( d F , d S )2 (1,1) (1,1)3 (2,1) (2,1)4 (3,1) (3,1)5 (4,1) (4,1)6 (5,1) (5,1)7 (6,1) (6,1)8 (7,1) (7,1) Figure 7: Eﬀect of varying d on mean response time under JIQ-( d F , d S ), JSQ-( d F , d S ), JSQ- d , SED- d , andWJSQ- d when λ = 0 .

8. (a) q F = 0 . r = 1 .

1. (b) q F = 0 . r = 10. The tables at right show the optimalchoices of ( d F , d S ) for each d .(a) q F = 0 . r = 5, λ = 0 .

56 (b) q F = 0 . r = 2, λ = 0 . p F and p S . (a) q F = 0 . r = 5, λ = 0 .

56, (b) q F = 0 . r = 2, λ = 0 .

95. The red circle indicates the optimal E [ T ].16IQ-(2,2) λ p ∗ F p ∗ S E [ T opt ] p heur F p heur S E [ T heur ] % error0.14 any 0 0.384 any 0 0.384 00.24 any 0 0.443 any 0 0.443 00.34 0.999 0.018 0.575 any 0 0.576 0.0230.44 1 0.426 0.742 1 0.444 0.743 0.0140.54 1 0.723 0.868 1 1 0.879 1.1960.64 1 1 0.967 1 1 0.967 00.74 1 1 1.101 1 1 1.101 00.84 0.877 1 1.547 1 1 1.605 3.7320.90 0.714 1 2.331 0.555 1 2.908 24.7540.98 0.579 1 10.677 0.555 1 12.837 20.231JSQ-(2,2) λ p ∗ F p ∗ S E [ T opt ] p heur F p heur S E [ T heur ] % error0.14 any 0 0.383 any 0 0.383 00.24 any 0 0.429 any 0 0.429 00.34 any 0 0.514 any 0 0.514 00.44 1 0.103 0.677 any 0 0.689 1.6930.54 1 0.405 0.832 1 0.444 0.833 0.0660.64 1 0.722 0.946 1 1 0.954 0.7620.74 1 1 1.039 1 1 1.039 00.84 1 1 1.217 1 1 1.217 00.90 0.839 1 1.595 1 1 1.957 22.6970.98 0.597 1 3.243 0.555 1 3.659 12.804Table 1: Comparison of optimal p F and p S to best heuristic under JIQ-(2,2) (top) and JSQ-(2,2) (bottom).Here q F = 0 . r = 5. The columns p ∗ F and p ∗ S give the optimal values of p F and p S , while p heur F and p heur S are the values chosen by the heuristic. • All combinations of p S ∈ { µ S q S , } and p F ∈ { , µ F q F , } .For each setting of λ , q F , and r , this gives us only seven policies to compare; we select the p F and p S thatyields the best performance among these seven alternatives.Table 1 shows our results for JIQ-( d F , d S ) and JSQ-( d F , d S ); each row shows a diﬀerent value of λ , for asystem with q F = 0 . r = 10. Under both policies, when λ is low it is optimal to set p S = 0, and ourheuristic correctly selects this policy. As λ starts to increase, it becomes optimal to increase p S continuouslyand set p F = 1. Our heuristic sets p F = 1 and changes p S in discrete steps from 0 to µ S q S to 1; because λ isstill relatively low, mean response time is relatively insensitive to selecting a slightly suboptimal value of p S and our heuristic has low error. When λ becomes high, the performance of our heuristic can suﬀer. In thisregion it becomes optimal to set p S = 1 and decrease p F continuously, while our heuristic must choose either p F = 1 or p F = µ F q F . Because λ is high, a small change in p F (which corresponds to a small change in thearrival rate to any individual busy server), can have a big aﬀect on mean response time, and the error of ourheuristic can reach as high as 25%. However, as λ →

1, the heuristic, which sets p S = 1 and p F = µ F q F ,again approaches perfect accuracy because p F = µ F q F is the only value of p F that maintains stability, andas λ → p S = 1) also shouldbe optimal. This paper addresses the problem of dispatching in large-scale, heterogeneous systems. We design two newheterogeneity-aware families of policies, JIQ-( d F , d S ) and JSQ-( d F , d S ). Our policies are simple, analyticallytractable, and provide outstanding performance. 17ur results yield several insights about how to design “power of d ” policies that perform well in hetero-geneous settings. In order to maintain the maximum stability region, the dispatcher must ensure that fastservers are queried suﬃciently often. Alone, neither uniform sampling nor weighting querying in favor offast servers is enough to ensure good performance. Our work establishes that, instead, dispatching policiesshould use heterogeneity information at two decision points: (1) when choosing which servers to query, and(2) when choosing where among the queried servers to dispatch a job. Ultimately, how best to distributejobs among fast and slow servers depends jointly on the system load, the fraction of servers that are fast,and the relative speeds of the servers. It may be best to use only fast servers, to use slow servers only whenthey are idle, or to balance jobs among fast and slow servers in some other way. Because there is no singleright answer, policies designed for heterogeneous systems must be able to adapt to the system parameters.JIQ-( d F , d S ) and JSQ-( d F , d S ) do this by optimizing over the probabilistic parameters to choose the bestallocation of jobs to fast and slow servers. Moreover, as we show in Theorem 1, the optimal policy in eachfamily is guaranteed to be stable.We focus speciﬁcally on policies that query ﬁxed numbers of fast and slow servers and then make proba-bilistic decisions about how to route among the queried servers based on idleness and queue length informa-tion. The space of policies that use heterogeneity information at both decision points is much larger than thepolicies we propose here. For example, one could imagine generalizing our policies at the ﬁrst decision pointby choosing d F and d S probabilistically for each query; this also would allow us to adapt our policies forsystems with more than two server speeds. At the second decision point, one could combine ( d F , d S )-stylequerying with a heterogeneity-aware dispatching policy, such as SED. While optimizing over such a largepolicy space is likely to be challenging, we are optimistic that substantial advances could be made in futurework toward understanding a wider scope of policies and settings.Diﬀering server speeds is just one way in which server farms may exhibit heterogeneity. Systems mayalso consist of servers that are heterogeneous in their memory, network bandwidth, or any other resourceavailability. Some jobs may be able to run on certain servers but not on others, for example due to datalocality. Jobs may be capable of running on any server, but may have a preference for or run faster oncertain servers. The policies we present in this paper are designed to perform well speciﬁcally for the case ofheterogeneous server speeds, but we believe the insights gained will aid the design of eﬀective load balancingpolicies for the broad range of heterogeneity that exists in today’s systems. References [1] S. Banawan and N. Zeidat. A comparative study of load sharing in heterogeneous multicomputersystems. In

Proceedings. 25th Annual Simulation Symposium , pages 22–31. IEEE, 1992.[2] S. A. Banawan and J. Zahorjan. Load sharing in heterogeneous queueing systems. In

In Proc. of IEEEINFOCOM’89 , pages 731–739, 1989.[3] F. Bonomi. On job assignment for a parallel system of processor sharing queues.

IEEE Trans. Comput. ,39(7):858–869, July 1990.[4] H. Chen and H.-Q. Ye. Asymptotic optimality of balanced routing.

Operations research , 60(1):163–179,2012.[5] H. Feng, V. Misra, and D. Rubenstein. Optimal state-free, size-aware dispatching for heterogeneousm/g/-type systems.

Performance Evaluation , 62(1):475 – 492, 2005. Performance 2005.[6] V. Gupta, M. Harchol-Balter, K. Sigman, and W. Whitt. Analysis of join-the-shortest-queue routingfor web server farms.

Performance Evaluation , 64(9-12):1062–1081, 2007.[7] M. Harchol-Balter.

Performance Modeling and Design of Computer Systems: Queueing Theory inAction . Cambridge University Press, 2013.[8] E. Hyyti¨a. Optimal routing of ﬁxed size jobs to two parallel servers.

INFOR: Information Systems andOperational Research , 51(4):215–224, 2013. 189] A. Izagirre and A. Makowski. Light traﬃc performance under the power of two load balancing strategy:the case of server heterogeneity.

SIGMETRICS Performance Evaluation Review , 42(2):18–20, 2014.[10] G. Koole. A simple proof of the optimality of a threshold policy in a two-server queueing system.

Systems and Control Letters , 26(5):301–303, Dec. 1995.[11] R. L. Larsen.

Control of Multiple Exponential Servers with Application to Computer Systems . PhDthesis, College Park, MD, USA, 1981.[12] W. Lin and P. R. Kumar. Optimal Control of a Queueing System with Two Heterogeneous Servers.

IEEE Transactions on Automatic Control , 29(8):696–703, 1984.[13] Y. Lu, Q. Xie, G. Kliot, A. Geller, J. Larus, and A. Greenberg. Join-idle-queue: A novel load balancingalgorithm for dynamically scalable web services.

Performance Evaluation , 68(11):1056–1071, 2011.[14] H. P. Luh and I. Viniotis. Threshold control policies for heterogeneous server systems.

MathematicalMethods of Operations Research , 55(1):121–142, 2002.[15] M. Mitzenmacher. The power of two choices in randomized load balancing.

IEEE Transactions onParallel and Distributed Systems , 12(10):1094–1104, 2001.[16] A. Mukhopadhyay and R. Mazumdar. Analysis of randomized join-the-shortest-queue (jsq) schemesin large heterogeneous processor-sharing systems.

IEEE Transactions on Control of Network Systems ,3(2):116–126, 2016.[17] R. D. Nelson and T. K. Philips.

An approximation to the response time for shortest queue routing ,volume 17. ACM, 1989.[18] M. Rubinovitch. The Slow Server Problem.

Journal of Applied Probability , 22(1):205–213, 1985.[19] M. Rubinovitch. The Slow Server Problem: A Queue with Stalling.

Journal of Applied Probability ,22(4):879–892, 1985.[20] V. V. Rykov and D. V. Efrosinin. On the slow server problem.

Automation and Remote Control ,70(12):2013–2023, 2009.[21] J. Selen, I. Adan, and S. Kapodistria. Approximate performance analysis of generalized join the shortestqueue routing. In

Proceedings of the 9th EAI International Conference on Performance EvaluationMethodologies and Tools , pages 103–110. ICST (Institute for Computer Sciences, Social-Informaticsand . . . , 2016.[22] J. Selen, I. Adan, S. Kapodistria, and J. van Leeuwaarden. Steady-state analysis of shortest expecteddelay routing.

Queueing Systems , 84(3-4):309–354, 2016.[23] J. Sethuraman and M. S. Squillante. Optimal stochastic scheduling in multiclass parallel queues.

SIG-METRICS Perform. Eval. Rev. , 27(1):93–102, May 1999.[24] S. Shenker and A. Weinrib. The optimal control of heterogeneous queueing systems: a paradigm forload-sharing and routing.

IEEE Transactions on Computers , 38(12):1724–1735, Dec 1989.[25] A. Stolyar. Pull-based load distribution in large-scale heterogeneous service systems.

Queueing Systems ,80(4):341–361, 2015.[26] A. N. Tantawi and D. Towsley. Optimal static load balancing in distributed computer systems.

Journalof the ACM (JACM) , 32(2):445–465, 1985.[27] N. Vvedenskaya, R. Dobrushin, and F. Karpelevich. Queueing system with selection of the shortest oftwo queues: An asymptotic approach.

Problemy Peredachi Informatsii , 32(1):20–34, 1996.[28] C. Wang, C. Feng, and J. Cheng. Distributed join-the-idle-queue for low latency cloud services.

IEEE/ACM Transactions on Networking , 26(5):2309–2319, 2018.1929] R. R. Weber. On the optimal assignment of customers to parallel servers.

Journal of Applied Probability ,15(2):406–413, 1978.[30] W. Whitt. Deciding which queue to join: Some counterexamples.

Operations research , 34(1):55–62,1986.[31] W. Winston. Optimality of the shortest line discipline.

Journal of Applied Probability , 14(1):181–189,1977.[32] X. Zhou, F. Wu, J. Tan, Y. Sun, and N. Shroﬀ. Designing low-complexity heavy-traﬃc delay-optimalload balancing schemes: Theory to algorithms.

Proceedings of the ACM on Measurement and Analysisof Computing Systems , 1(2):39, 2017. 20 ppendix

Here we give the complete expanded form of the optimization formulations given in (15, 23).For JIQ-( d F , d S ) our optimization formulation (15) is as follows:minimize p F ,p S q F λ IF µ F λ ( µ F − λ BF )( µ F − λ BF + λ IF )+ q S λ IS µ S λ ( µ S − λ BS )( µ S − λ BS + λ IS )subject to λ IF = λd F q F (cid:32) d F − (cid:88) i =0 (cid:18) d F − i (cid:19) π i F (1 − π F ) d F − − i i + 1 (cid:33) λ BF = λd F q F (1 − π F ) d F − d F · (cid:0)(cid:0) − (1 − π S ) d S (cid:1) (1 − p S ) + (1 − π S ) d S p F (cid:1) λ IS = λd S q S (1 − π F ) d F · (cid:32) d S − (cid:88) i =0 (cid:18) d S − i (cid:19) π i S (1 − π S ) d S − − i i + 1 p S (cid:33) λ BS = λd S q S (1 − π F ) d F (1 − π S ) d S − d S (1 − p F ) π F = µ F − λ BF µ F − λ BF + λ IF π S = µ S − λ BS µ S − λ BS + λ IS ≤ π F , π S ≤ ≤ p F , p S ≤ d F , d S ) our optimization formulation (23) is as follows:minimize p F ,p S µ F · (cid:16) − ρ d F F (cid:17) + 1 µ S · ρ d F F (cid:16) − ρ d s S (cid:17) p S + 1 µ F ∞ (cid:88) i =1 ( i + 1) · f d F i − f d F i +1 f d F · ρ d F F ( ρ d S S p F + (cid:16) − ρ d s S (cid:17) (1 − p S ))+ 1 µ S ∞ (cid:88) i =1 ( i + 1) · s d S i − s d S i +1 s d S · ρ d F F ρ d S S (1 − p F )subject to ρ F = λµ F q F (cid:16) ρ d F F (cid:16) − ρ d S S (cid:17) (1 − p S ) (cid:17) + λµ F q F (cid:16)(cid:16) − ρ d F F (cid:17) + ρ d F F ρ d S S p F (cid:17) ρ S = λµ S q S (cid:16) ρ d F F (cid:16) − ρ d S S (cid:17) p S + ρ d F F ρ d S S (1 − p F ) (cid:17) λq F (cid:16) f d F i − − f d F i (cid:17) (cid:16)(cid:16) − ρ d S S (cid:17) (1 − p S ) + ρ d S S p F (cid:17) = µ F ( f i − f i +1 ) i ≥ λq S (cid:16) s d S i − − s d S i (cid:17) ρ d F F (1 − p F ) = µ S ( s i − s i +1 ) i ≥ f = s = 1 f = ρ F s = ρ S ≤ ρ F , ρ S ≤ ≤ p F , p S ≤≤