[PDF] Achieving Zero Asymptotic Queueing Delay for Parallel Jobs

Abstract

Full PDF

DD ISPATCHING P ARALLEL J OBS TO A CHIEVE Z ERO Q UEUING D ELAY

Wentao Weng

Institute for Interdisciplinary Information SciencesTsinghua University [email protected]

Weina Wang

Computer Science DepartmentCarnegie Mellon University [email protected]

April 7, 2020 A BSTRACT

Zero queueing delay is highly desirable in large-scale computing systems. Existing work has shownthat it can be asymptotically achieved by using the celebrated Power-of- d -choices (Po d ) policy witha probe overhead d = ω (cid:16) log N − λ (cid:17) , and it is impossible when d = O (cid:16) − λ (cid:17) , where N is the numberof servers and λ is the load of the system. However, these results are based on the model where eachjob is an indivisible unit, which does not capture the parallel structure of jobs in today’s predominantparallel computing paradigm.This paper thus considers a model where each job consists of a batch of parallel tasks. In thismodel, we say a policy leads to zero (asymptotic) queueing delay if the job delay under the policyapproaches the delay given by the max of service times of its tasks, as if tasks entered service rightupon arrival. We show that zero queueing delay for such parallel jobs can be achieved using a variantof the Po d policy, the batch-ﬁlling policy , with a probe overhead d = ω (cid:16) − λ ) log k (cid:17) , where k isthe number of tasks in each job. This result demonstrates that for parallel jobs, zero queueing delaycan be achieved with a smaller probe overhead. We also establish a lower bound on the minimum d needed: we show that zero queueing delay cannot be achieved if d = e o ( log N log k ) . In view of the rise in the amount of latency-critical workloads in today’s datacenters [26, 22], load-balancing policieswith ultra-low latency have attracted great attention (see, e.g., [21, 16, 7, 17, 15]). In particular, it is highly desirableto have a policy under which the delay due to queueing is minimal.In a classical setting of load-balancing, the celebrated greedy policy, Join-the-Shortest-Queue (JSQ), achieves a min-imal queueing delay in the sense that the queueing delay is diminishing as the system becomes large, even in heavy-trafﬁc regimes [31, 30, 21]. Therefore, we say that JSQ achieves a zero (asymptotic) queueing delay . Speciﬁcally,consider a system with N servers where jobs arrive into the system following a Poisson process. Each server has itsown queue and serves jobs in the queue in a First-Come-First-Serve manner. Under JSQ, each incoming job will beassigned to a server with the shortest queue length. Then the expected time (in steady state) a job spends in the queue before entering service goes to zero as N goes to inﬁnity.However, a drawback of JSQ is that it has a high communication overhead, which can cancel out its advantage ofachieving zero queueing delay. For assigning each job, JSQ requires the knowledge of the queue-length informationof all the N servers, which will be referred to as having a probe overhead of N . In a typical cluster of servers, N is inthe tens of thousands range, resulting in intolerable delay due to communication [26, 22].A load-balancing algorithm that provides tradeoffs between queueing delay and communication overhead is the Power-of- d -choices (Po d ) policy [27, 20]. For each incoming job, Po d selects d queues out of N queues uniformly at random,and assigns the job to a shortest queue among the d selected queues. Therefore, Po d has a probe overhead of d . It is a r X i v : . [ c s . PF ] A p r PREPRINT - A

PRIL

7, 2020easy to see that when d = N , Po d coincides with JSQ, thus achieving a zero queueing delay. However, a fundamentalquestion is: Can zero queueing delay be achieved by Po d with a d smaller than N ? Or, what is the smallest d forachieving zero queueing delay? This question has been recently answered in a line of research [21, 16, 17, 15]. In particular, the following results arethe most relevant to our paper. Suppose the job arrival rate is

N λ and job service times are exponentially distributedwith rate . Then the load of the system is λ . Consider a heavy-trafﬁc regime with λ = 1 − βN − α , where α and β are constants with < β < and < α < . It has been shown that Po d achieves zero queueing delay when d = Ω (cid:16) log N − λ (cid:17) , and does not have zero queueing delay when d = O (cid:16) − λ (cid:17) .Although these prior results provide great insights into achieving zero queueing delay, they are all for the classicalsetting where each job is an indivisible unit. In today’s applications, parallel computing has emerged as a dominantparadigm to support the rapidly growing data volume and computation demands. A job with a parallel structure is nolonger a single unit, but consists of multiple components that can run in parallel, resulting in a system dynamics thatis very different from the non-parallel model. Therefore, it is of great importance to revisit the fundamental questionon the minimum probe overhead needed for achieving zero queueing delay, and answer it under the new parallelparadigm.In this paper, to capture the parallel structure, we consider a model where each job consists of k tasks. Tasks can runon different servers in parallel, and a job is completed when all its tasks are completed. We assume that task servicetimes are independent and exponentially distributed with rate . Recall that N denotes the number of servers in thesystem. We assume that k grows with N (with exact assumption speciﬁed later on), but we suppress this dependencyin notation for conciseness. Zero queueing delay for parallel jobs

We are interested in achieving zero queueing delay since this is the regime where the delay due to queueing is minimaland jobs are only subject to delay due to their inherent sizes. In the non-parallel model, it is clear that the delay due toqueueing for a job is just the time a job spends waiting in the queue. However, when a job consists of multiple tasks,quantifying the delay due to queueing is more complicated since different tasks experience different queueing times.In this paper, we propose the following notion of zero queueing delay for parallel jobs. Let X , X , . . . , X k denotethe service times of a job’s k tasks. Then if a job does not experience any queueing, its delay is given by T ∗ =max { X , X , . . . , X k } . This is the job delay when all the tasks of the job enter service immediately, so we call it the inherent delay . Let T denote the delay of a job in steady state. Then the delay due to queueing is characterized by thedifference E [ T − T ∗ ] . We say jobs have zero queueing delay if E [ T − T ∗ ] E [ T ∗ ] → as N → ∞ , (1)i.e., the queueing delay takes a diminishing fraction of the inherent delay. Interestingly, under this notion, zero queue-ing delay allows tasks in a job to wait in queues for non-negligible times. Probe overhead and batch-ﬁlling policy

When a job arrives into the system, a task-assigning policy samples some queues to obtain their queue length infor-mation, and then decides how to assign the k tasks to the sampled servers. If the policy samples kd queues, then wesay its probe overhead [34, 22] is d since d is the average number of samples per task.In this paper, we focus on a policy called batch-ﬁlling , which has been shown to outperform the naive implementationof Po d and also another policy called batch-sampling for parallel jobs [34, 22]. Batch-ﬁlling assigns the tasks one byone to the shortest queue, where the queue length is updated after every task assignment. Challenges and our results

For the non-parallel model, to show a zero queueing delay, it sufﬁces to characterize the fraction of non-idle serverssince a job can only land in one single queue. However, for parallel jobs, crucially, zero queueing delay of jobs can beachieved even when tasks have non-zero queueing delays. As a result, the analysis becomes much harder – we needto characterize the fractions of servers with queue lengths ranging from zero to a certain threshold. More speciﬁcally,the threshold here is o (log k ) . A key in our analysis is an interesting state-space collapse result that we discover. Thisresult enables us to use the powerful framework of Stein’s method [4, 5].We consider a system with a job arrival rate of N λ/k . Then λ is the load of the system. We focus on a heavy-trafﬁcregime where λ = 1 − βN − α with < β < and < α < . , i.e., the sub-Halﬁn-Whitt regime. Note that the2 PREPRINT - A

PRIL

7, 2020larger α is, the faster the load approaches as N → ∞ . All the order notation and asymptotic results in this paper arewith respect to the regime that N → ∞ .Our main result is that zero queueing delay is achieved when the probe overhead d satisﬁes d = ω (cid:18) − λ ) log k (cid:19) , (2)where the number of tasks k satisﬁes k = o (cid:16) N . − α log N (cid:17) and k log k = Ω(log N ) . For example, this includes k = log N , k = N . when α < . , and so on.Recall that for the non-parallel model, a lower bound result is that zero queueing cannot be achieved when the probeoverhead is O (cid:16) − λ (cid:17) . In contrast, we can see that for parallel jobs, the probe overhead in (2) can be orderly smallerthan − λ .We also prove a lower bound result on the minimum d needed: zero queueing delay is not achievable if d = e o ( log N log k ) , (3)where k satisﬁes that k = e o ( √ log N ) and k = ω (1) . To establish this lower bound, we utilize the tail bound given bya Lyapunov function in a novel way. This proof technique we develop may be of separate interest itself. Related works

Load-balancing systems for non-parallel jobs have been extensively studied in the literature. It is well-known thatJSQ is delay-optimal under a wide range of assumptions [31, 30]. Although getting exact-form stationary distributionsis typically not feasible for most load-balancing policies, many results and approximations are known for variousasymptotic regimes.For JSQ in heavy-trafﬁc regimes, Eschenfeldt and Gamarnik [6] obtain a diffusion approximation in the Halﬁn-Whittregime ( α = 0 . ), which has a zero queueing delay in the diffusion limit. The convergence result in [6] is on theprocess level. Braverman [3] later establish steady-state results and their results imply the convergence of the stationarydistributions to the diffusion limit. JSQ has also been studied in the nondegenerate slowdown (NDS) regime ( α = 1 )[10].The problem of achieving zero queueing delay with Po d has been studied in [21, 16, 17, 15]. Mukherjee et al. [21]show through stochastic coupling that the diffusion limit of Po d with d = ω ( N . log N ) converges to that of JSQin the Halﬁn-Whitt regime, thus resulting in a zero queueing delay. The convergence to the diffusion limit in [21] ison the process level. Zero queueing delay for Po d in steady state is ﬁrst studied by Liu and Ying [17] for the regimewhere α < , where they show that the waiting probability goes to as N → ∞ when d = ω (cid:16) − λ (cid:17) . The results arelater extended to the sub-Halﬁn-Whitt regime (0 < α < . for both exponential and Coxian-2 service times [16, 15]and beyond-Halﬁn-Whitt regime (0 . ≤ α < [15], where it is shown that zero queueing delay is achieved when d = Ω (cid:16) log N − λ (cid:17) . The paper [17] also provides a lower bound result: the waiting probability is bounded away from when d = O (cid:16) − λ (cid:17) for ≤ α < .Po d has also been analyzed in the regime with a constant load ( α = 0 ) as N → ∞ . Mean-ﬁeld analysis has beenderived for a constant d in [20, 27], and Mukherjee et al. [21] show d = ω (1) leads to zero queueing delay. We remarkthat mean-ﬁeld analysis results are also available for other policies such as Join-the-Idle-Queue (JIQ) [18, 24], andalso for delay-resource tradeoffs [7].To the best of our knowledge, very limited work has been done on achieving zero queueing delay for parallel jobs , oron analyzing delay for parallel jobs in general. Only the regime with a constant load as N → ∞ has been studied.Mukherjee et al. [21] brieﬂy touch upon this topic and show that ﬂuid-level optimality can be achieved with probeoverhead d ≥ − λ − (cid:15) under the so-called batch-sampling policy [22]. Ying et al. [34] provide limiting distributionsfor the stationary distributions under (batch-version) Po d , batching-sampling, and batch-ﬁlling, but have not analyzeddelay of jobs. Wang et al. [29] analyze job delay under a (batch-version) random-routing policy, which does notachieve zero queueing delay. There have been no results for heavy-trafﬁc regimes.Finally, the techniques we use in this paper are based on Stein’s method and drift-based state-space collapse. Proposedin [23], Stein’s method has been an effective tool for bounding the distance between two distributions. The semi-nal papers [4, 5, 11] build an analytical framework for Stein’s method in queueing theory that consists of generator3 PREPRINT - A

PRIL

7, 2020 …… … ! tasks per jobServer 1 Server 2 Server 3 Server " dispatcher Figure 1: A n -server system with batch arrivals. …… Server 1 Server 2 Server 3 Server " ℓ = 2

Figure 2: An example of the number of spaces below a threshold (cid:96) in a set of queues: (cid:96) = 2 , set of queues A = { , , } , and N (cid:96) ( A ) = 3 .approximation, gradient bounds, and possibly state-space collapse. The papers [4, 5] use Stein’s method to studysteady-state diffusion approximation, and [16, 17, 33, 1, 3, 8, 9, 32] use Stein’s method to obtain convergence rates tothe mean-ﬁeld limit. A similar approach has also been developed by Stolyar [25]. We consider a system with N identical servers, illustrated in Figure 1. Each server has its own queue and serves tasksin its queue in a First-Come-First-Serve manner. Since each queue is associated with a server, we will refer to queuesand servers interchangeably. Jobs arrive into the system following a Poisson process. To capture the parallel structureof jobs, we assume that each job consists of k tasks that can run on different servers in parallel. A job ﬁnishes whenall of its tasks ﬁnish. We study the large-system regime where the number of servers, N , becomes large, and we willlet k increase to inﬁnity with N to capture the trend of growing job sizes.We denote the job arrival rate by N λ/k and assume that the service times of tasks are independent and exponentiallydistributed with rate . Then λ is the load of the system. We consider a heavy-trafﬁc regime where λ = 1 − βN − α with < β < and < α < . , i.e., the so-called sub-Halﬁn-Whitt regime [16, 12].When a job arrives into the system, we sample kd queues and obtain their queue length information. Since the averageoverhead is d samples per task, the probe overhead is d . We then assign the k tasks of the job to the kd selected queuesusing the batch-ﬁlling policy proposed in [34]. Batch-ﬁlling assigns the tasks one by one to the shortest queue, wherethe queue length is updated after each task assignment. Speciﬁcally, the task assignment process runs in k rounds.For each round, we put a task into the shortest queue among sampled queues. We then update the queue length, andcontinue to the next round.Now we give an equivalent description of batch-ﬁlling, which is useful in our analysis. For each queue and a positiveinteger (cid:96) , we use the number of spaces below threshold (cid:96) to refer to the quantity max { (cid:96) − queue length , } , i.e., thenumber of tasks we can put in the queue such that the queue length after receiving the tasks is no larger than (cid:96) . Fora set of queues A , we use N (cid:96) ( A ) (or just N (cid:96) when it is clear from the context) to denote the total number of spacesbelow (cid:96) in A . Figure 2 gives an example of N (cid:96) ( A ) . We say a task is at a queueing position p if there are p − tasksahead of it in the queue. With the above terminology, the batch-ﬁlling policy can be described in the following way: itﬁnds a minimum threshold (cid:96) such that the total number of spaces below (cid:96) in the sampled queues is at least k . Then itﬁlls the k tasks into these spaces from low positions to high positions.To deﬁne zero queueing delay for parallel jobs, let X , X , · · · , X k be the service times of the tasks of a job. Whena job does not experience any queueing, its delay is given by T ∗ = max { X , · · · , X k } , which we call the inherentdelay of this job. If the actual delay of the job is very close to its inherent delay, it is as if the job immediately gets4 PREPRINT - A

PRIL

7, 2020service when it arrives to the system. Therefore, we say a job experiences zero queueing delay if the steady state delayof the job, T , satisﬁes that E [ T − T ∗ ] E [ T ∗ ] → as N → ∞ . We note that as the service time of each task is exponentially distributed with mean , it holds that E [ T ∗ ] = H k = ln k + o (ln k ) , where H k is the k -th harmonic number [19].We make the following interesting observation, which provides a basis for our delay analysis of parallel jobs: a jobcan have zero queueing delay even when its tasks are assigned to non-idle servers. In fact, we establish a necessaryand sufﬁcient condition: a job has zero queueing delay if and only if all of its tasks are at queueing positions below athreshold h with h = o (log k ) after assigned to servers, noting that the inherent delay is ln k + o (ln k ) . The formalproof is based on Lemma 4. This phenomenon allows us to have a zero queueing delay with low probe overhead. Butit also makes the analysis hard since it implies that there are many situations that can lead to zero queueing delay.We assume that every queue has a ﬁnite buffer size of b including the task in service. If the dispatcher routes a taskto a queue with length equal to b , we simply discard this task and all the other tasks of the same job. In this case, wesay the job is dropped ; otherwise, we say the job is admitted . We remark that this assumption is not restrictive forthe following two reasons: (1) our results hold for a very large range of b (see Theorem 1); and (2) the probability ofdiscarding a job is very small (see Theorem 2).To represent the state of the system, let S i ( t ) denote the fraction of servers that have at least i jobs at time t , where ≤ i ≤ b . Note that it always holds S ( t ) = 1 . Then S ( t ) = ( S ( t ) , S ( t ) , · · · , S b ( t )) forms a continuous-timeMarkov chain (CTMC) since batch-ﬁlling is oblivious to labels of servers. The state space is as follows: S = { s = ( s , s , s , · · · , s b ) : 1 = s ≥ s ≥ s ≥ · · · s b ,N s i ∈ N , ∀ ≤ i ≤ b } . It can be veriﬁed that { S ( t ) : t ≥ } is irreducible and positive recurrent, thus having a unique stationary distribution.Let π S denote this stationary distribution, and let S = ( S , · · · , S b ) be a random element with distribution π S . Our main results provide bounds on queue lengths and delay, which lead to corresponding bounds on the probeoverhead for achieving zero queueing delay. We divide our results into upper-bound and lower-bound results. Again,all the asymptotics are with respect to the regime that the number of servers, N , goes to inﬁnity. Upper-Bound Results

We ﬁrst give an upper bound on E (cid:104)(cid:80) bi =1 S i (cid:105) , the expected number of tasks in each server, in Theorem 1. This upperbound underpins our analysis of job delay. theorem 1. Consider a system with N servers where each job consists of k tasks. Let the load be λ = 1 − βN − α with < β < and < α < . . Under the batch-ﬁlling policy with a probe overhead of d such that d ≥ − λ ) h forsome h = o (log k ) and h = ω (1) , it holds that E (cid:34) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41)(cid:35) ≤ √ N log N , (4) where k satisﬁes that k = o (cid:16) N . − α log N (cid:17) and k log k = Ω(log N ) , the buffer size b = min (cid:110) N α , N . − α k (cid:111) , and N issufﬁciently large. We remark that the h = o (log k ) in this theorem represents the threshold position we pointed out for zero queueingdelay, i.e., a job has zero queueing delay if all of its tasks are at queueing positions below h after assigned to servers.The upper bound on E (cid:104)(cid:80) bi =1 S i (cid:105) in Theorem 1 indicates how full the queues are. This enables us to analyze theprobability that all the tasks of an incoming job end up in positions below h under batch-ﬁlling, which further leads tothe zero queueing delay result below in Theorem 2. Recall that the buffer size b of each queue is ﬁnite, so a job willget dropped if at least one of its tasks is assigned to a queue with a full buffer. We denote the probability of droppingan incoming job in steady state by p d . 5 PREPRINT - A

PRIL

7, 2020 theorem 2.

Under the assumptions of Theorem 1, the dropping probability under batch-ﬁlling, p d , can be upperbounded as follows when N is sufﬁciently large: p d ≤ b √ N log N .

The steady-state delay of jobs that are admitted satisﬁes that E [ T | admitted ] = ln k + o (ln k ) . (5) Therefore, the batch-ﬁlling policy achieves zero queueing delay for parallel jobs.

Theorems 1 and 2 imply that zero queueing delay for parallel jobs can be achieved with a probe overhead d = ω (cid:16) − λ ) log k (cid:17) . This breaks the lower bound of ω (cid:16) − λ (cid:17) for achieving zero queueing delay for non-parallel jobs, i.e.,single-task jobs [17]. Therefore, the parallel structure helps reduce communication overhead. Lower-Bound Results

To complement the upper-bound results, below we investigate when zero queueing delay cannot be achieved. InTheorem 3, we ﬁnd conditions under which (cid:80) hi =1 S i is lower bounded with a constant probability. theorem 3. Consider a system with N servers where each job consists of k tasks. Let the load be λ = 1 − βN − α with < β < and < α < . . Assume that b = ∞ and k satisﬁes that k = e o ( √ log N ) and k = ω (1) . For any stabletask-assigning policy with a probe overhead of d such that d = e o ( log N log k ) and any h with h = O (log k ) , it holds thatwhen N is sufﬁciently large, P (cid:40) h (cid:88) i =1 S i ≥ h − d (cid:41) ≥ e . (6)The lower bound on (cid:80) hi =1 S i in Theorem 3 guarantees that an incoming job will have a signiﬁcant delay in additionto its inherent delay, and thus fails to have zero queueing delay. This result is formally stated in Theorem 4 below. theorem 4. Under the assumptions of Theorem 3, the steady-state job delay, T , satisﬁes that E [ T ] ≥ k (7) when N is sufﬁciently large. Therefore, to achieve zero queueing delay, the probe overhead d needs to be at least e Ω ( log N log k ) . In this section, we prove the upper-bound results in Theorems 1 and 2. We ﬁrst give a proof sketch that providesan overview of the structure of the proofs. We then present the formal proofs of Theorems 1 and 2 in Sections 4.1and 4.2, respectively. These two proofs rely on lemmas that are presented in Section 4.3, followed by their proofs inSection 4.4. Throughout this section, we assume that the assumptions in Theorem 1 hold.

Proof Sketch

We start by setting the goal to be proving the zero queueing delay result in Theorem 2. The need for the fundamentalcharacterizations of the system in Theorem 1 will emerge during the analysis. We ﬁrst note that the steady-state jobdelay T can be upper bounded in the following way: E [ T ] ≤ E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:35) + E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:35) · P (cid:40) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:41) , (8)6 PREPRINT - A

PRIL

7, 2020where we have used the fact that P (cid:110)(cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1)(cid:111) ≤ .In this upper bound, the conditions in the expectations are based the threshold value h (cid:0) − βN − α (cid:1) for (cid:80) hi =1 S i .We choose this particular threshold value for the following reason. Given the condition (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1) ,we can show that with high probability, all the tasks of an incoming job will be assigned to queueing positions below h with h = o (log k ) , thus resulting in a zero queueing delay for the job. Speciﬁcally, suppose a job arrives to the systemwith state s . If we choose one queue uniformly at random from all the queues, then the probability for the chosenqueue to have a length of i is s i − s i +1 . So the expected number of spaces below position h in the chosen queue is (cid:80) hi =0 ( h − i )( s i − s i +1 ) = h − (cid:80) hi =1 s i . The batch-ﬁlling policy samples kd queues. Thus the total expected numberof spaces below position h in the kd sampled queues is kd (cid:16) h − (cid:80) hi =1 s i (cid:17) . To ﬁt all the k tasks of the incoming jobto positions below h , we need kd (cid:32) h − h (cid:88) i =1 s i (cid:33) ≥ k, which becomes h (cid:88) i =1 s i ≤ h (cid:18) − βN − α (cid:19) when d ≥ − λ ) h = N α βh as required in Theorem 1. We strengthen this requirement to the condition (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1) to obtain a high-probability guarantee using proper concentration inequalities.The second summand in the upper bound (8) is based on the condition (cid:80) hi =1 S i > h (cid:0) − βN − α (cid:1) . Under thiscondition, we may not be able to put all the tasks of an incoming job to positions below h . But we show that theprobability P (cid:110)(cid:80) hi =1 S i > h (cid:0) − βN − α (cid:1)(cid:111) is very small in Theorem 1. To this end, we ﬁrst upper-bound it usingthe Markov inequality: P (cid:40) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:41) ≤ E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) βN − α . It then sufﬁces to bound E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) .We focus on the dynamics of (cid:80) bi =1 S i ( t ) in the proof of Theorem 1, which is equal to the total queue length attime t divided by N . Our proof follows the framework of Stein’s method. The main idea is to couple our Markovchain { S ( t ) : t ≥ } with an auxiliary process that is easier to analyze, and bound their difference through generatorapproximation. Here we consider the following simple ﬂuid model as our auxiliary process: ˙ x ( t ) = ( − δ ) { x> } ,x ( t ) is continuous, (9)where δ = ( k +1) log N √ N , and we then compare the dynamics of (cid:80) bi =1 S i ( t ) with that of x ( t ) . Based on this coupling,we derive an upper bound on E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) in Section 4.1 below. We reiterate that akey in our analysis is a novel state-space collapse result that we establish.Combining the arguments above for both the condition (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1) and the condition (cid:80) hi =1 S i >h (cid:0) − βN − α (cid:1) , we can conclude that the upper bound on E [ T ] in (8) implies zero queueing delay. Proof.

As we explained in the proof sketch, we consider the ﬂuid model in (9). The generator of this ﬂuid model,denoted as G , is simply given by Gg ( x ) = g (cid:48) ( x ) · ( − δ ) { x> } PREPRINT - A

PRIL

7, 2020for a differentiable function g . Recall that we will compare the dynamics of x ( t ) in this ﬂuid model with that of (cid:80) bi =1 S i ( t ) .The quantity of interest in Theorem 1 is E (cid:104) max (cid:110)(cid:80) bi =1 S i − η, (cid:111)(cid:105) , where we have used the notation η = h (cid:0) − βN − α (cid:1) for conciseness. Recall that S follows the stationary distribution of { S ( t ) : t ≥ } . To couple { S ( t ) : t ≥ } with the ﬂuid model, we solve for a function g such that Gg ( x ) = max { x − η, } ,g (0) = 0 . (10)It is not hard to see that the solution is g ( x ) = ( x − η ) − δ ) { x ≥ η } . (11)Now we utilize this function g to bound E (cid:104) max (cid:110)(cid:80) bi =1 S i − η, (cid:111)(cid:105) through generator approximation. Let G be thegenerator of { S ( t ) : t ≥ } . Then Gg (cid:32) b (cid:88) i =1 s i (cid:33) = (cid:88) s (cid:48) ∈S r s → s (cid:48) (cid:32) g (cid:32) b (cid:88) i =1 s (cid:48) i (cid:33) − g (cid:32) b (cid:88) i =1 s i (cid:33)(cid:33) , where r s → s (cid:48) is the transition rate from state s to state s (cid:48) . Since g (cid:16)(cid:80) bi =1 s i (cid:17) is bounded on S , it holds that E (cid:34) Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) = 0 . (12)Combining this with the equations in (10) gives, E (cid:34) max (cid:40) b (cid:88) i =1 S i − η, (cid:41)(cid:35) = E (cid:34) Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) = E (cid:34) Gg (cid:32) b (cid:88) i =1 S i (cid:33) − Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) = E (cid:34) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) (13)This is referred to as the generator approximation since we are approximating the generator G with G .Next we take a closer look at the term Gg (cid:16)(cid:80) bi =1 S i (cid:17) and derive an upper bound for (13). Let P A ( s ) be the probabilitythat a job arrival is admitted into the system given that the system is at state s , i.e., the probability that all the tasks ofthe job are routed to positions below b . Then Gg (cid:32) b (cid:88) i =1 s i (cid:33) = N λk P A ( s ) (cid:32) g (cid:32) b (cid:88) i =1 s i + kN (cid:33) − g (cid:32) b (cid:88) i =1 s i (cid:33)(cid:33) + N s (cid:32) g (cid:32) b (cid:88) i =1 s i − N (cid:33) − g (cid:32) b (cid:88) i =1 s i (cid:33)(cid:33) , where ﬁrst term is the drift due to a job arrival and the second term is due to a task departure. To derive an upperbound on (13), we divide the discussion into the three cases below. Recall that g ( x ) = ( x − η ) − δ ) { x ≥ η } and g (cid:48) ( x ) = x − η − δ { x ≥ η } . Case 1 : (cid:80) bi =1 S i < η − kN . In this case, clearly g (cid:48) (cid:16)(cid:80) bi =1 S i (cid:17) = 0 and Gg (cid:16)(cid:80) bi =1 S i (cid:17) = 0 .8 PREPRINT - A

PRIL

7, 2020

Case 2 : (cid:80) bi =1 S i ∈ [ η − kN , η + N ) . By the mean value theorem, g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − Gg (cid:32) b (cid:88) i =1 S i (cid:33) = g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − (cid:18) N λk P A ( S ) kN g (cid:48) ( ξ ) + N S − N g (cid:48) ( ˜ ξ ) (cid:19) ≤ g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − λg (cid:48) ( ξ ) + S g (cid:48) ( ˜ ξ ) , (14)where ξ ∈ (cid:16)(cid:80) bi =1 S i , (cid:80) bi =1 S i + kN (cid:17) , ˜ ξ ∈ (cid:16)(cid:80) bi =1 S i − N , (cid:80) bi =1 S i (cid:17) , and (14) is true since P A ( S ) ≤ and g (cid:48) ( x ) ≤ for all x . Case 3 : (cid:80) bi =1 S i ≥ η + N . Since g (cid:48) ( x ) is continuous for all x , by the second order Taylor expansion in the Lagrangeform, g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − Gg (cid:32) b (cid:88) i =1 S i (cid:33) = g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − N λk P A ( S ) (cid:32) kN g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) + k N g (cid:48)(cid:48) ( ζ ) (cid:33) − N S (cid:32) − N g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) + 12 N g (cid:48)(cid:48) (˜ ζ ) (cid:33) ≤ g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ − λ + S ) − N (cid:16) λkg (cid:48)(cid:48) ( ζ ) + S g (cid:48)(cid:48) (˜ ζ ) (cid:17) , (15)where ζ ∈ (cid:16)(cid:80) bi =1 S i , (cid:80) bi =1 S i + kN (cid:17) , ˜ ζ ∈ (cid:16)(cid:80) bi =1 S i − N , (cid:80) bi =1 S i (cid:17) .Combining these three cases yields E (cid:34) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) ≤ E (cid:34)(cid:32) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − λg (cid:48) ( ξ ) + S g (cid:48) ( ˜ ξ ) (cid:33) { (cid:80) bi =1 S i ∈ [ η − kN ,η + N ) } (cid:35) (16) − N E (cid:104) ( λkg (cid:48)(cid:48) ( ζ ) + S g (cid:48)(cid:48) (˜ ζ )) { (cid:80) bi =1 S i ≥ η + N } (cid:105) (17) + E (cid:34) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ − λ + S ) { (cid:80) bi =1 S i ≥ η + N } (cid:35) . (18)The ﬁrst two terms (16) and (17) are easy to bound once we notice that for any x ∈ (cid:2) η − k +1 N , η + k +1 N (cid:3) , | g (cid:48) ( x ) | ≤ | x − η | δ ≤ √ N log N , and for any x ∈ ( η, + ∞ ) , | g (cid:48)(cid:48) ( x ) | = δ = √ N ( k +1) log N . Then when N is sufﬁciently large, | (16) | ≤ √ N log N (cid:18) ( k + 1) log N √ N + 1 + 1 (cid:19) ≤ √ N log N , and | (17) | ≤ N √ N ( k + 1) log N ( λk + 1) ≤ √ N log N .

The key in this proof is to bound the term (18), for which we need the state-space collapse result in Lemma 3 inSection 4.3. Consider the following Lyapunov function: V ( s ) = min  h − b (cid:88) i = h s i , b (cid:32)(cid:18) − βN − α (cid:19) − h − h − (cid:88) i =1 s i (cid:33) +  , PREPRINT - A

PRIL

7, 2020where the superscript + denotes the function x + = max { x, } . Then by Lemma 3, when N is sufﬁciently large, P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) ≤ e − log N , where B = b − h +1 h − (cid:16) βN − α + log N √ N (cid:17) .We partition the probability space based on the value of V ( S ) . Note that g (cid:48) (cid:16)(cid:80) bi =1 S i (cid:17) ( − δ − λ + S ) { (cid:80) bi =1 S i ≥ η + N } is always no larger than bδ for large enough N . Then (18) can be upper bounded as(18) ≤ E (cid:34) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ − λ + S ) · { (cid:80) bi =1 S i ≥ η + N } (cid:12)(cid:12)(cid:12)(cid:12) V ( S ) ≤ B + 2 kb log N ( h − √ N (cid:21) + 2 bδ P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) . (19)Now we focus on the case where we are given the condition that V ( S ) ≤ B + kb log N ( h − √ N . Our goal is to show that S is large enough such that δ + λ − S < . Intuitively, this condition on V ( S ) implies that we either have a small (cid:80) bi = h S i , which leads to a large S when combined with the condition (cid:80) bi =1 S i ≥ η + N in the indicator, or a large (cid:80) h − i =1 S i , which directly gives a large S since S ≥ · · · ≥ S h − .If h − (cid:80) bi = h S i ≤ b (cid:16)(cid:0) − βN − α (cid:1) − h − (cid:80) h − i =1 S i (cid:17) + in V ( S ) , the condition V ( S ) ≤ B + kb log N ( h − √ N implies that h − b (cid:88) i = h S i ≤ b − h + 1 h − (cid:18) βN − α + log N √ N (cid:19) + 2 kb log N ( h − √ N . (20)Recall that b = min (cid:110) N α , N . − α k (cid:111) and h = o (log k ) . Note that the indicator function in (19) makes it sufﬁcient toconsider the case where (cid:80) bi =1 S i ≥ η + N , which implies ( h − S + (cid:80) bi = h S i ≥ η . Combining this with (20) gives S ≥ ηh − − b − h + 1 h − (cid:18) βN − α + log N √ N (cid:19) − kb log N ( h − √ N ≥ − β ) 1 h − − βN − α + o (cid:18) h (cid:19) when N is sufﬁciently large. Note that δ = o (cid:0) h (cid:1) and λ = 1 − βN − α . Therefore, λ + δ − S < when N issufﬁciently large.If h − (cid:80) bi = h S i > b (cid:16)(cid:0) − βN − α (cid:1) − h − (cid:80) h − i =1 S i (cid:17) + in V ( S ) , the condition V ( S ) ≤ B + kb log N ( h − √ N implies that b (cid:32) − βN − α − h − h − (cid:88) i =1 S i (cid:33) ≤ B + 2 kb log N ( h − √ N .

Then S ≥ h − h − (cid:88) i =1 S i ≥ − βN − α − b (cid:18) B + 2 kb log N ( h − √ N (cid:19) ≥ − βN − α + o ( N − α ) . PREPRINT - A

PRIL

7, 2020As a result, again we have λ + δ − S ≤ − βN − α + o ( N − α ) < when N is sufﬁciently large.Inserting these bounds back to (19) gives that when N is sufﬁciently large,(18) ≤ bδ P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) ≤ bδ e − log N ≤ √ N log N .

Combining the bounds for (16), (17) and (18), we have E (cid:34) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41)(cid:35) ≤ √ N log N , which completes the proof of Theorem 1.

Proof.

We ﬁrst bound the dropping probability p d using Lemma 1, which will be presented in Section 4.3. Note thatan incoming job does not get dropped if and only if all its k tasks are routed to queueing positions below threshold b ,which is the complement of the event FILL b in Lemma 1. Thus, p d = 1 − P { FILL b } = 1 − P (cid:40) FILL b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b (cid:88) i =1 S i ≤ b (cid:18) − βN − α (cid:19)(cid:41) · P (cid:40) b (cid:88) i =1 S i ≤ b (cid:18) − βN − α (cid:19)(cid:41) − P (cid:40) FILL b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b (cid:88) i =1 S i > b (cid:18) − βN − α (cid:19)(cid:41) · P (cid:40) b (cid:88) i =1 S i > b (cid:18) − βN − α (cid:19)(cid:41) . We can easily have that P (cid:110) FILL b (cid:12)(cid:12)(cid:12) (cid:80) bi =1 S i ≤ b (cid:0) − βN − α (cid:1)(cid:111) ≤ N using Lemma 1.Now we bound P (cid:110)(cid:80) bi =1 S i > b (cid:0) − βN − α (cid:1)(cid:111) using Theorem 1. Note that P (cid:40) b (cid:88) i =1 S i > b (cid:18) − βN − α (cid:19)(cid:41) ≤ P (cid:40) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41) > b − b βN − α − h (cid:41) ≤ P (cid:40) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41) > b (cid:41) , PREPRINT - A

PRIL

7, 2020where we have used the fact that b βN − α + h ≤ b when N is sufﬁciently large due to our assumptions on b and h .Then by Markov’s inequality, P (cid:40) b (cid:88) i =1 S i > b (cid:18) − βN − α (cid:19)(cid:41) ≤ E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) b ≤ b √ N log N .

Combining the arguments above yields p d ≥ − N − b √ N log N ≥ − b √ N log N when N is sufﬁciently large.Next we bound the expected job delay given that a job is admitted, i.e., E [ T | admitted ] . We deﬁne the delay of a jobthat is dropped to be zero since it leaves the system immediately after arrival. Then E [ T ] = E [ T | admitted ] · (1 − p d ) + E [ T | dropped ] · p d , and thus E [ T | admitted ] = E [ T ]1 − p d . So we can focus on bounding E [ T ] , following the outlinegiven in the proof sketch.We bound E [ T ] in the following way E [ T ] ≤ E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:35) (21) + E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:35) · P (cid:40) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:41) . (22)For the ﬁrst term (21) in this upper bound, as described in the proof sketch, we will rely on the fact that with highprobability, all the k tasks are assigned to queueing positions below h . Speciﬁcally, E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:35) = E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19) , FILL h (cid:35) · P (cid:40) FILL h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:41) + E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19) , FILL h (cid:35) · P (cid:40) FILL h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:41) , where FILL h is the complement of FILL h .Suppose FILL h is true. Suppose that the k tasks of the incoming job land in m distinct queues with m ≤ k . Wecall the tasks with the highest positions in these m queues tasks , , . . . , m , and let n , n , . . . , n m denote thesepositions. Then the delay of task i can be written as Y i = (cid:80) n i j =1 X i,j , where X i,j is the service time of the task at12 PREPRINT - A

PRIL

7, 2020position j in the same queue as task i . Clearly X i,j ’s are i.i.d. with an exponential distribution of rate . We know that n i ≤ h, i = 1 , , . . . , m given FILL h . Then by Lemma 4, E [max { Y , · · · , Y m } ] ≤ ln k + o (ln k ) . When FILL h is true, E (cid:104) T (cid:12)(cid:12)(cid:12) (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1) , FILL h (cid:105) ≤ bk since the highest position for a task is b andthe maximum is upper bounded by the sum. Further, P (cid:110) FILL h (cid:12)(cid:12)(cid:12) (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1)(cid:111) ≤ N by Lemma 1.Combining the arguments above, we have the following bound for term (21): E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:35) ≤ ln k + o (ln k ) + bkN . Now we go back to the term (22). Again, it is easy to see that E (cid:104) T (cid:12)(cid:12)(cid:12) (cid:80) hi =1 S i > h (cid:0) − βN − α (cid:1)(cid:105) ≤ bk . UtilizingTheorem 1, we have P (cid:40) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:41) ≤ P (cid:40) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41) > hβN − α (cid:41) ≤ E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) hβN − α ≤ hβN − α log N .

With the bounds above on (21) and (22), we have E [ T ] ≤ ln k + o (ln k ) + bkN + 20 bkhβN − α log N .

Consequently, E [ T | admitted ] = E [ T ]1 − p d ≤ ln k + o (ln k ) + bkN + bkhβN − α log N − p d ≤ ln k + o (ln k ) , which completes the proof. In Lemma 1 below, we consider the event that all the k tasks of an incoming job are routed to queueing positions belowsome threshold value (cid:96) . Let this event be denoted by FILL (cid:96) , and Lemma 1 lower-bounds its probability P { FILL (cid:96) } forseveral values of (cid:96) of interest. Lemma 1 is an essential building block and is needed for establishing the state-spacecollapse result in Lemma 3 and bounding job delay in Theorem 2. lemma 1 (Filling Probability) . Under the assumptions of Theorem 1, given that the system is in a state s such that (cid:80) (cid:96)i =1 s i ≤ (cid:96) (cid:0) − βN − α (cid:1) , the probability of the event FILL (cid:96) for any (cid:96) ∈ { h − , h, b } can be bounded as followswhen N is sufﬁciently large: P { FILL (cid:96) } ≥ − N .

Lemma 2 bounds the distribution tails of a Lyapunov function, which slightly generalizes the tail bounds in [28], [15]and [2]. 13

PREPRINT - A

PRIL

7, 2020 lemma 2.

Consider a continuous time Markov chain { S ( t ) } with a unique stationary distribution π . Assume it has aﬁnite state space S . For a Lyapunov function V : S → [0 , + ∞ ) , deﬁne the drift of V at a state s ∈ S as ∆ V ( s ) = GV ( s ) = (cid:88) s (cid:48) ∈S , s (cid:54) = s (cid:48) r s → s (cid:48) ( V ( s (cid:48) ) − V ( s )) , where r s → s (cid:48) is the transition rate from state s to s (cid:48) .Assume that ν max := sup s , s (cid:48) ∈S : r s → s (cid:48) > | V ( s ) − V ( s (cid:48) ) | < ∞ f max := max  , sup s ∈S (cid:88) s (cid:48) : V ( s (cid:48) ) >V ( s ) r s → s (cid:48) ( V ( s (cid:48) ) − V ( s ))  < ∞ . If there is a set E with B > , γ > , δ ≥ such that • ∆ V ( s ) ≤ − γ when V ( s ) ≥ B and s ∈ E . • ∆ V ( s ) ≤ δ when V ( s ) ≥ B and s (cid:54)∈ E .Then it holds that for all j ∈ N , P { V ( s ) ≥ B + 2 ν max j } ≤ (cid:18) f max f max + γ (cid:19) j + (cid:18) δγ + 1 (cid:19) P { s (cid:54)∈ E} . This tail bound in Lemma 2 is slightly more general than existing bounds in that it allows different drift bounds basedon whether a state s is in a set E or not, which will be needed in the proof of lower-bound results.We utilize Lemma 2 to establish the state-space collapse result below in Lemma 3. Here we simply let E be the wholestate space. lemma 3 (State-Space Collapse) . Under the assumption of Theorem 1, consider the following Lyapunov function V ( s ) = min  h − b (cid:88) i = h s i , b (cid:32)(cid:18) − βN − α (cid:19) − h − h − (cid:88) i =1 s i (cid:33) +  , where the superscript + denotes the function x + = max { x, } . Let B = b − h +1 h − (cid:16) βN − α + log N √ N (cid:17) . For any state s such that V ( s ) > B , its Lyapunov drift satisﬁes ∆ V ( s ) = GV ( s ) ≤ − b √ N .

Consequently, when N is sufﬁciently large, P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) ≤ e − log N , Lemma 4 below is used in the proof of Theorem 2 to bound job delay based on queueing positions of its tasks. It statesthat if m tasks are in queueing positions below o (log m ) , then the maximum delay of these tasks is ln m + o (ln m ) .Due to space limitations, the proof is given in Appendix A. lemma 4. Consider m independent random variables Y , · · · , Y m where each Y i is the sum of n i i.i.d. randomvariables that follow the exponential distribution with rate . In the asymptotic regime that m goes to inﬁnity, if max { n , · · · , n m } = o (log m ) , then E [max { Y , · · · , Y m } ] ≤ ln m + o (ln m ) . (Filling Probability) Proof.

Assume that a job arrival sees a state S = s that satisﬁes (cid:96) (cid:88) i =1 s i ≤ (cid:96) (cid:18) − βN − α (cid:19) . PREPRINT - A

PRIL

7, 2020We focus on the the number of spaces below the threshold (cid:96) in the sampled queues, denoted by N (cid:96) . Then N (cid:96) is themaximum number of tasks that can be put into these queues such that all of these tasks are at queueing positions below (cid:96) . Therefore, P { FILL (cid:96) } = P { N (cid:96) ≥ k } ≥ − P { N (cid:96) ≤ k } . Now we bound P { N (cid:96) ≤ k } . We can think of the sampling process of batch-ﬁlling as sampling kd queues one byone without replacement. Let X , X , · · · , X kd be the numbers of spaces below (cid:96) in the st, nd, . . . , kd th sampledqueues, respectively. Then N (cid:96) = X + · · · + X kd . It is not hard to see that for each of the sampled queue and eachinteger x with ≤ x ≤ (cid:96) , P { X i = x } = s (cid:96) − x − s (cid:96) − x +1 , and P { X i = 0 } = s (cid:96) .Note that since we sample without replacement, X , X , . . . , X kd are not independent. But we can still deriveconcentration bounds using a result of Hoeffding [13, Theorem 4]. By this result, we have E (cid:104) f (cid:16)(cid:80) kdi =1 X i (cid:17)(cid:105) ≤ E (cid:104) f (cid:16)(cid:80) kdi =1 Y i (cid:17)(cid:105) for any continuous and convex function f ( · ) , where Y , Y , . . . , Y kd are i.i.d. and follow the samedistribution as X . We take the function f ( · ) to be f ( x ) = e − tx with t > . Then P { N (cid:96) ≤ k } = P (cid:8) e − tN (cid:96) ≥ e − tk (cid:9) ≤ e tk kd (cid:89) i =1 E (cid:2) e − tY i (cid:3) = e tk kd (cid:89) i =1  − (cid:96) (cid:88) j =1 ( s (cid:96) − j − s (cid:96) − j +1 ) (cid:0) − e − tj (cid:1) . Since − x ≤ e − x for each x ≥ , this can be further bounded as P { N (cid:96) ≤ k }≤ exp  tk − kd (cid:96) (cid:88) j =1 ( s (cid:96) − j − s (cid:96) − j +1 ) (cid:0) − e − tj (cid:1) ≤ exp  tk + kd (cid:96) (cid:88) j =1 ( s j − − s j ) (cid:16) e − t ( (cid:96) − j +1) − (cid:17) (23)Rearranging the terms in the sum in (23), we get (cid:96) (cid:88) j =1 ( s j − − s j ) (cid:16) e − t ( (cid:96) − j +1) − (cid:17) = (cid:0) e − t(cid:96) − (cid:1) + (cid:0) e t − (cid:1) (cid:96) (cid:88) j =1 s j e − t ( (cid:96) − j +1) . (24)Since ≥ s ≥ · · · s (cid:96) and we have assumed that (cid:80) (cid:96)j =1 s j ≤ (cid:96) (cid:0) − βN − α ) (cid:1) , (24) is maximized when s = s = · · · = s (cid:96) = 1 − βN − α . Therefore, the upper bound becomes P { N (cid:96) ≤ k } ≤ exp (cid:18) tk + kd (cid:0) e − t(cid:96) − (cid:1) βN − α (cid:19) . PREPRINT - A

PRIL

7, 2020Now we apply the condition that d ≥ N α βh and let t = ln(2 (cid:96) ) − ln h(cid:96) . Then P { N (cid:96) ≤ k }≤ exp (cid:18) tk + 2 kh (cid:0) e − t(cid:96) − (cid:1)(cid:19) = exp (cid:18) kh (cid:18) h(cid:96) (ln(2 (cid:96) ) − ln h ) + h(cid:96) − (cid:19)(cid:19) . Recall the we have assumed that kh = ω (log N ) and h = ω (1) . Then it can be veriﬁed that with a sufﬁciently large N , h(cid:96) (ln(2 (cid:96) ) − ln h ) + h(cid:96) + 2 N − . − is smaller than a negative constant for all (cid:96) ∈ { h − , h, b } . Thus P { N (cid:96) ≤ k } ≤ exp( − ω (log N )) ≤ N .

As a result, P { FILL (cid:96) } ≥ − P { N (cid:96) ≤ k } ≥ − N , which completes the proof.

Proof of Lemma 3 (State-Space Collapse)

Proof.

Consider the Lyapunov function in the lemma, i.e., V ( s ) = min  h − b (cid:88) i = h s i , b (cid:32)(cid:18) − βN − α (cid:19) − h − h − (cid:88) i =1 s i (cid:33) +  . We will refer to the ﬁrst term and second term in the minimum as T and T , respectively. Let B = b − h +1 h − (cid:16) βN − α + log N √ N (cid:17) and suppose V ( s ) > B . Recall that the drift of V is given by ∆ V ( s ) = GV ( s ) = (cid:88) s (cid:48) ∈S , s (cid:54) = s (cid:48) r s → s (cid:48) ( V ( s (cid:48) ) − V ( s )) , where r s → s (cid:48) is the transition rate from state s to s (cid:48) . Let e i = (cid:0) , · · · , , N , , · · · , (cid:1) be a vector of length b whose i th entry is N and all the other entries are zero. We divide the discussion into two cases. Case 1: T ≤ T . In this case V ( s ) = T . When the state transition is due to a task departure from a queue of length i , which has a rate of N ( s i − s i +1 ) , then V ( s − e i ) = (cid:40) V ( s ) , if ≤ i < h,V ( s ) − N ( h − , if h ≤ i ≤ b. Now consider the state transition due to a job arrival. Let a i be the queueing position that task i is assigned to. Thenthe next state can be written as s + e a + · · · + e a k . Note that when the event

FILL h − happens, the dispatcher puts all k tasks to positions below threshold h − . Thenunder FILL h − , s i does not change for i ≥ h , which implies that V ( s + e a + · · · + e a k ) = V ( s ) . We can show that P { FILL h − } ≥ − N using Lemma 1 since T ≥ T > B > . Otherwise, i.e., when FILL h − is not true, it is easy to see that V ( s + e a + · · · + e a k ) ≤ V ( s ) + kN ( h − . PREPRINT - A

PRIL

7, 2020Therefore, ∆ V ( s ) ≤ b (cid:88) i =1 N ( s i − s i +1 ) ( V ( s − e i ) − V ( s )) + N λk N kN ( h − N ( h − − s h h − ≤ N ( h − − h − b − h + 1 b (cid:88) i = h s i . By the assumption that T > B , we have b − h + 1 b (cid:88) i = h s i ≥ h − b − h + 1 B = βN − α + log N √ N .

Inserting this back to the upper bound on ∆ V ( s ) gives ∆ V ( s ) ≤ − h − (cid:18) − N + βN − α + log N √ N (cid:19) . Since βN − α h − ≥ N − α k ≥ b √ N and log N √ N ≥ N when N is sufﬁciently large, this upper bound becomes ∆ V ( s ) ≤ − b √ N .

Case 2: T > T . In this case V ( s ) = T . Similarly, a task departs from a queue of length i at a rate of N ( s i − s i +1 ) .The change in V ( s ) can be bounded as V ( s − e i ) − V ( s ) ≤ (cid:40) bN ( h − , if ≤ i < h, , if h ≤ i ≤ b. When a job arrives, under the event

FILL h − , V ( s + e a + · · · + e a k ) = V ( s ) − kbN ( h − , where we have used the fact that T > B . Again, P { FILL h − } ≥ − N by Lemma 1. Otherwise, i.e., when FILL h − is not true, V ( s + e a + · · · + e a k ) ≤ V ( s ) .Therefore, ∆ V ( s ) ≤ b (cid:88) i =1 N ( s i − s i +1 ) ( V ( s − e i ) − V ( s ))+ N λk (cid:18) − N (cid:19) (cid:18) − kbN ( h − (cid:19) ≤ bh − s − s h ) − bh − (cid:18) − N (cid:19) (cid:0) − βN − α (cid:1) ≤ bh − (cid:18) − (cid:18) βN − α + log N √ N (cid:19) − (cid:18) − N (cid:19) (cid:0) − βN − α (cid:1)(cid:19) , (25) = bh − (cid:18) − log N √ N + 1 N (cid:0) − βN − α (cid:1)(cid:19) ≤ − bh − N − √ N √ N , where (25) is due to the fact that s ≤ and the fact that s h ≥ βN − α + log N √ N following similar arguments as those inCase 1 noting that T > T > B . When N is sufﬁciently large, this upper bound becomes ∆ V ( s ) ≤ − b √ N , PREPRINT - A

PRIL

7, 2020which completes the proof of the drift bound in Lemma 3.For this Lyapunov function V , under the notation in Lemma 2, we have that ν max ≤ kbN ( h − and f max ≤ bh − . Let E = S and j = √ N log N . Then by Lemma 2, the drift bound implies that P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) = P (cid:26) V ( S ) > B + 2 kb ( h − N j (cid:27) ≤ (cid:18) h − √ N (cid:19) − j ≤ (cid:32)(cid:18) √ N (cid:19) √ N +1 (cid:33) − √ N +1 √ N log N ≤ e − log N , where the last inequality holds when N is sufﬁciently large. This completes the proof. In this section, we prove the lower-bound results in Theorems 3 and 4. We ﬁrst present the proofs of Theorems 3 and4 in Sections 5.1 and 5.2, respectively. Then we give the lemmas needed in Section 5.3. Due to space limitations, theproofs of the lemmas are given in Appendix B. Throughout this section, we assume that the assumptions in Theorem 3hold.

Proof.

The proof proceeds in an iterative fashion. The base case is that E [ S ] = λ = 1 − βN − α , which can be provedusing the Little’s law. We will then bound S − S i based on properties of S − S i − .For simplicity, let u = 2 kd , this is the ratio appearing in Lemma 6. Consider a Lyapunov function V ( s ) = s . Let h = O (log k ) and B = 1 − hβN − α . For some state s such that V ( s ) > B , it holds ∆ V ( s ) = (cid:88) s (cid:48) : s → s (cid:48) due to an arrival r s → s (cid:48) ( V ( s (cid:48) ) − V ( s ))+ (cid:88) s (cid:48) : s → s (cid:48) due to a departure r s → s (cid:48) ( V ( s (cid:48) ) − V ( s )) (a) ≤ uhβN − α − N ( s − s ) 1 N = uhβN − α − ( s − s ) , where (a) is due to Lemma 6.Let p = P (cid:8) S − S ≤ uh βN − α (cid:9) and E = (cid:8) s ∈ S| s − s > uh βN − α (cid:9) . Then p = P { S (cid:54)∈ E } . We now use the tail bound in Lemma 2. Assume that we follow the notation in the lemma.Consider the following two cases: • s (cid:54)∈ E , ∆ V ( s ) ≤ uhβN − α =: δ . • s ∈ E . Let γ = − ∆ V ( s ) . It holds γ ≥ uhβN − α ( h − .18 PREPRINT - A

PRIL

7, 2020Following the deﬁnition in 2, it is easy to verify that ν max ≤ kN and f max ≤ for V ( s ) . Let j = (cid:16) N α βuh ( h − (cid:17) log N . By Lemma 2, it holds that P { V ( S ) > B + 2 ν max j } ≤ (cid:18) f max f max + γ (cid:19) j + (cid:18) δγ + 1 (cid:19) P { S (cid:54)∈ E }≤ (cid:18) f max f max + γ (cid:19) j + hh − p . Besides, when N is large enough, (cid:18) f max f max + γ (cid:19) j ≤ (cid:0) uhβN − α ( h − (cid:1) − ( N α βuh ( h − ) log N ≤ e − log N . As a result, P { V ( S ) > B + 2 ν max j } ≤ N − log N + hh − p . Since < α < . and k = e o ( √ log N ) , − ( h − βN − α > − hβN − α + 2 kN (cid:18) N α βuh ( h − (cid:19) log N when N is large enough. It follows that P (cid:8) V ( S ) > − ( h − βN − α (cid:9) ≤ P { V ( S ) > B + 2 ν max j }≤ N − log N + hh − p However, by Lemma 5, P (cid:8) V ( S ) > − ( h − βN − α (cid:9) ≥ − h − . Therefore, hh − p + N − log N ≥ h − h − , and thus P (cid:8) S − S ≤ uh βN − α (cid:9) = p ≥ h − h − N − log N . Let b q = u q − h q βN − α for an integer q > . Deﬁne a sequence a q , such that a = 0 , a = 1 and a q = ( q − a q − +2 for q > . We now have P { S − S ≤ a b } ≥ h − h − N − log N . We can use Lemma 7 successively to establish P { S − S q ≤ a q b q } ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N for all ≤ q ≤ h .Let us condition on S − S h ≤ a h b h . For ease of notation, let p c = (cid:0) h − h (cid:1) h − − ( h − N − log N , which is a lowerbound on the probability of the condition. Note that E [ S ] ≤ E [ S | S − S h ≤ a h b h ] P { S − S h ≤ a h b h } + 1 · P { S − S h > a h b h } . Thus E [ S | S − S h ≤ a h b h ] ≥ − βN − α − (1 − P { S − S h ≤ a h b h } ) P { S − S h ≤ a h b h }≥ − βp c N − α . PREPRINT - A

PRIL

7, 2020We can also see that P (cid:40) h (cid:88) i =1 S i ≥ h − d (cid:41) ≥ P (cid:40) h (cid:88) i =1 S i ≥ h − d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S − S h ≤ a h b h (cid:41) P { S − s h ≤ a h b h }≥ p c P (cid:26) hS − h ( S − S h ) ≥ h − d (cid:12)(cid:12)(cid:12)(cid:12) S − S h ≤ a h b h (cid:27) ≥ p c P (cid:26) S ≥ − dh + a h b h (cid:12)(cid:12)(cid:12)(cid:12) S − S h ≤ a h b h (cid:27) . (26)Utilizing the Markov inequality gives (26) ≥ p c (cid:18) − dh − dh E [ S | S − S h ≤ a h b h ]1 − dha h b h (cid:19) ≥ p c (cid:18) − βp c dh − dha h b h N − α (cid:19) . Recall that a q = ( q − a q − + 2 for q > and a = 1 . We have a h ≤ h h , and thus a h b h ≤ βu h h h N − α . As d = e o (log N/ log k ) , k = e o ( √ log N ) , h = O (log k ) , we have ln( a h b h ) = − Ω(log N ) . Furthermore, since ln(3 dh ) = o (log N/ log k ) + O (log k ) , α > , it holds − βp c dh − dha h b h N − α ≥ if N is sufﬁciently large. Note that p c is equal to (cid:0) h − h (cid:1) h − − ( h − N − log N which converges to e . We couldconclude that when N goes to inﬁnity, we have P (cid:40) h (cid:88) i =1 S i ≥ h − d (cid:41) ≥ e . Proof.

Let h = 12 e ln k . Then h = O (log k ) . Suppose that we have an incoming job. By Theorem 3 and the PASTAproperty of a Poisson arrival process, with probability at least e , this job will see a state s such that (cid:80) hi =1 s i ≥ h − d . By Lemma 8, the dispatcher will route at least one task of this job into a queue of length at least h + 1 with probability − o (1) . Let T be the delay of the job. Then it holds for a large enough N , E [ T ] ≥ k (1 − o (1)) ≥ k, which completes the proof. Assume that the system is stable. Then for any x > , P { S < − x } ≤ βN − α x . lemma 6. Let (cid:96) be a threshold such that ≤ (cid:96) ≤ h with h = O (log k ) . Suppose that an incoming job sees astate s such that (cid:80) (cid:96)i =1 s i ≥ (cid:96) − x , where x = Ω( hN − α ) and x = e − Ω(log N ) . Consider a Lyapunov function V (cid:96) ( s ) = s + s + · · · + s (cid:96) . It holds that when N is sufﬁciently large, (cid:88) s (cid:48) : s → s (cid:48) due to an arrival r s → s (cid:48) ( V (cid:96) ( s (cid:48) ) − V (cid:96) ( s )) ≤ kdx, where r s → s (cid:48) is the transition rate, and s → s (cid:48) due to an arrival means that s will move to state s (cid:48) on the Markovchain only if there is an incoming job. PREPRINT - A

PRIL

7, 2020Lemma 7 below is a key in establishing the iterative proof. This lemma relates S q to S q − for ≤ i ≤ h . lemma 7. Deﬁne u = 2 kd and b q = u q − h q βN − α for q ∈ N . Deﬁne a sequence a q , such that a = 0 , a = 1 and a q = ( q − a q − + 2 for q > . For any q with ≤ q ≤ h , if P { S − S q − ≤ a q − b q − } ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N , then P { S − S q ≤ a q b q } ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N . Lemma 8 below complements the probability bound in Lemma 1. Recall that

FILL h denotes the event that all the k tasks of an incoming job are assigned to queueing positions below a threshold h . Lemma 8 gives a condition on thetotal queue length for FILL h to happen with low probability. lemma 8. Suppose an incoming job sees a state s such that (cid:80) hi =1 s i > h − d . Then when N is sufﬁciently large, P { FILL h } = o (1) . We studied a load balancing algorithm, batch-ﬁlling, for a system where each job consists of k parallel tasks inthe sub-Halﬁn-Whitt regime of heavy trafﬁc. We showed that to achieve zero queueing delay for such jobs, weonly need a probe overhead of d = ω (cid:16) − λ ) log k (cid:17) under proper conditions. Existing work has shown that d = ω (cid:16) − λ (cid:17) is necessary for achieving zero queueing delay when each job consists of a single task. Therefore, with aparallel structure, we save a factor of log k communication overhead. We also established a lower-bound result on theprobe overhead d , where we showed that d = Ω (cid:16) exp (cid:16) log N log k (cid:17)(cid:17) is necessary for achieving zero queueing delay. Aninteresting future direction is to extend our results to general service time distributions, where it is possible to get moresavings when the distributions have a heavy tail. References [1] S. Banerjee and D. Mukherjee. Join-the-shortest queue diffusion limit in halﬁn–whitt regime: Tail asymptoticsand scaling of extrema.

Ann. Appl. Probab. , 29(2):1262–1309, 2019.[2] D. Bertsimas, D. Gamarnik, and J. N. Tsitsiklis. Performance of multiclass markovian queueing networks viapiecewise linear lyapunov functions.

Ann. Appl. Probab. , 11(4):1384–1428, 11 2001.[3] A. Braverman. Steady-state analysis of the join the shortest queue model in the halﬁn-whitt regime. arXiv:1801.05121 [math.PR] , 2018.[4] A. Braverman and J. Dai. Stein’s method for steady-state diffusion approximations of m/ Ph /n + m systems. Ann. Appl. Probab. , 27:550–581, Feb. 2017. doi: 10.1214/16-AAP1211.[5] A. Braverman, J. Dai, and J. Feng. Stein’s method for steady-state diffusion approximations: an introductionthrough the erlang-a and erlang-c models.

Stoch. Syst. , 6(2):301–366, 2017.[6] P. Eschenfeldt and D. Gamarnik. Join the shortest queue with many servers. the heavy-trafﬁc asymptotics.

Math.Oper. Res. , 43(3):867–886, 2018.[7] D. Gamarnik, J. N. Tsitsiklis, and M. Zubeldia. Delay, memory, and messaging tradeoffs in distributed servicesystems. In

Proc. ACM SIGMETRICS/PERFORMANCE Jt. Int. Conf. Measurement and Modeling of ComputerSystems , pages 1–12. ACM, 2016.[8] N. Gast. Expected values estimated via mean-ﬁeld approximation are 1/n-accurate. In

Proc. ACM Measurementand Analysis of Computing Systems (POMACS) , volume 45, pages 50–50. ACM, 2017.[9] N. Gast and B. Van Houdt. A reﬁned mean ﬁeld approximation. In

Proc. ACM Measurement and Analysis ofComputing Systems (POMACS) , volume 1, page 33. ACM, 2017.[10] V. Gupta and N. Walton. Load balancing in the nondegenerate slowdown regime.

Oper. Res. , 67(1):281–294,2019. 21

PREPRINT - A

PRIL

7, 2020[11] I. Gurvich. Diffusion models and steady-state approximations for exponentially ergodic markovian queues.

Ann.Appl. Probab. , 24(6):2527–2559, 2014.[12] S. Halﬁn and W. Whitt. Heavy-trafﬁc limits for queues with many exponential servers.

Oper. Res. , 29(3):567–588, 1981.[13] W. Hoeffding. Probability inequalities for sums of bounded random variables.

J. Amer. Stat. Assoc. , 58(301):13–30, 1963.[14] G. Kamath. Bounds on the expectation of the maximum of samples from a gaussian[online], 2015. URL .[15] X. Liu.

Steady State Analysis of Load Balancing Algorithms in Heavy Trafﬁc Regime . PhD thesis, Arizona StateUniv., Tempe, AZ, USA, 2019.[16] X. Liu and L. Ying. A simple steady-state analysis of load balancing algorithms in the sub-halﬁn-whitt regime. arXiv:1804.02622 [math.PR] , 2018.[17] X. Liu and L. Ying. On achieving zero delay with power-of-d-choices load balancing. In

Proc. IEEE Int. Conf.Computer Communications (INFOCOM) , pages 297–305, Honolulu, HI, USA, Apr. 2018.[18] Y. Lu, Q. Xie, G. Kliot, A. Geller, J. R. Larus, and A. Greenberg. Join-Idle-Queue: A novel load balancingalgorithm for dynamically scalable web services.

Perform. Eval. , 68(11):1056–1071, Nov. 2011.[19] M. Lugo. The expectation of the maximum of exponentials[online], 2011. URL .[20] M. Mitzenmacher. The power of two choices in randomized load balancing.

IEEE Trans. Parallel Distrib. Syst. ,12(10):1094–1104, 2001.[21] D. Mukherjee, S. C. Borst, J. S. Van Leeuwaarden, and P. A. Whiting. Universality of power-of-d load balancingin many-server systems.

Stoch. Syst. , 8(4):265–292, 2018.[22] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. In

Proc.ACM Symp. Operating Systems Principles (SOSP) , pages 69–84. ACM, 2013.[23] C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent randomvariables. In

Proc. 6th Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: ProbabilityTheory . The Regents of the University of California, 1972.[24] A. L. Stolyar. Pull-based load distribution in large-scale heterogeneous service systems.

Queueing Syst. , 80(4):341–361, 2015.[25] A. L. Stolyar. Tightness of stationary distributions of a ﬂexible-server system in the Halﬁn-Whitt asymptoticregime.

Stoch. Syst. , 5(2):239–267, 2015.[26] A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster managementat Google with Borg. In

Proc. European Conf. Computer Systems (EuroSys) , 2015.[27] N. D. Vvedenskaya, R. L. Dobrushin, and F. I. Karpelevich. Queueing system with selection of the shortest oftwo queues: An asymptotic approach.

Problems of Information Transmission , 32(1):15–27, 1996.[28] W. Wang, S. T. Maguluri, R. Srikant, and L. Ying. Heavy-trafﬁc delay insensitivity in connection-level modelsof data transfer with proportionally fair bandwidth sharing. In

Proc. ACM SIGMETRICS Int. Conf. Measurementand Modeling of Computer Systems , volume 45, pages 232–245. ACM, 2018.[29] W. Wang, M. Harchol-Balter, H. Jiang, A. Scheller-Wolf, and R. Srikant. Delay asymptotics and bounds formultitask parallel jobs.

Queueing Syst. , 91(3):207–239, Apr. 2019.[30] R. R. Weber. On the optimal assignment of customers to parallel servers.

J. Appl. Probab. , 15(2):406–413, 1978.[31] W. Winston. Optimality of the shortest line discipline.

J. Appl. Probab. , 14(1):181–189, 1977.[32] L. Ying. On the approximation error of mean-ﬁeld models.

ACM SIGMETRICS Perform. Evaluation Rev. , 44(1):285–297, 2016.[33] L. Ying. Stein’s method for mean ﬁeld approximations in light and heavy trafﬁc regimes.

ACM SIGMETRICSPerform. Evaluation Rev. , 45(1):49, 2017.[34] L. Ying, R. Srikant, and X. Kang. The power of slightly more than one sample in randomized load balancing. In

Proc. IEEE Int. Conf. Computer Communications (INFOCOM) , pages 1131–1139, Kowloon, Hong Kong, Apr.2015. 22

PREPRINT - A

PRIL

7, 2020

A Proof of Lemma 4

Proof.

The proof idea is similar to that in [14]. Let M X ( s ) be the moment generating function of a random variable X . By assumption, Y i = (cid:80) n i j =1 X i,j , and X i,j , ≤ i ≤ m, ≤ j ≤ n i are all independent and exponentiallydistributed with mean . Therefore, for any ≤ i ≤ m, ≤ j ≤ n i and any s < , M X i,j ( s ) = E [ e sX i,j ] = 11 − sM Y i ( s ) = E [ e sY i ] = (cid:18) − s (cid:19) n i . Let q = max { n , · · · , n m } . It holds that for any s ∈ (0 , , exp (cid:18) s E (cid:20) m max j =1 Y j (cid:21)(cid:19) ≤ E (cid:20) exp( s m max j =1 Y j ) (cid:21) (27) = E (cid:20) m max j =1 exp( sY j ) (cid:21) (28) ≤ m (cid:88) j =1 E [exp( sY j )] (29) ≤ m (cid:18) − s (cid:19) q , (30)where (27) is due to Jensen’s inequality and (29) is true since the maximum is upper bounded by the sum. As a result, E (cid:20) m max j =1 Y j (cid:21) ≤ ln ms + q · − ln(1 − s ) s . Since we assume that q = o (log m ) , we can write q as q = (ln m ) · (cid:96) ( m ) where (cid:96) ( m ) → + as m → ∞ . Let s = 1 − (cid:96) ( m ) , then E (cid:20) m max j =1 Y j (cid:21) ≤ ln m − (cid:96) ( m ) (1 − (cid:96) ( m ) ln ( (cid:96) ( m )))= (ln m ) (cid:18) (cid:96) ( m )1 − (cid:96) ( m ) (cid:19) (1 − (cid:96) ( m ) ln ( (cid:96) ( m ))) . Note that lim m →∞ (cid:96) ( m ) ln( (cid:96) ( m )) = 0 . Then as m → ∞ , E (cid:20) m max j =1 Y j (cid:21) ≤ (ln m )(1 + o (1)) , which completes the proof. B Proofs of Lemmas 5–8

B.1 Proof of Lemma 5

Proof.

By the Little’s law, it holds that E [ S ] = λ = 1 − βN − α . Then E [1 − S ] = βN − α . Therefore, by the Markovinequality, for any x > , P { S < − x } = P { − S > x } ≤ βN − α x . B.2 Proof of Lemma 6

Proof.

Suppose that an arrival sees a state s . Given (cid:80) (cid:96)i =1 s i ≥ (cid:96) − x , we have s (cid:96) ≥ − x since s i ≤ for all ≤ i ≤ (cid:96) . Without loss of generality, we can think of the batch-ﬁlling policy as sampling the kd queues one by one.23 PREPRINT - A

PRIL

7, 2020During the sampling, we always choose at most kd servers of length at least (cid:96) . The probability that all kd sampledservers have length at least (cid:96) is thus larger or equal to (cid:18) N (1 − x ) − kdN (cid:19) kd = (cid:18) − (cid:18) x + kdN (cid:19)(cid:19) kd . Recall that by the assumptions in Theorem 3, we have x = e − Ω(log N ) , kd = o ( N − α ) , and thus x + kdN > − when N is sufﬁciently large. Furthermore, applying Bernoulli’s Inequality and the assumption that x = Ω( hN − α ) , it holds (cid:18) − (cid:18) x + kdN (cid:19)(cid:19) kd ≥ − kd (cid:18) x + kdN (cid:19) ≥ − xkd for a large N . Note that if we put all tasks of this arrival into servers of length at least (cid:96) , we will not affect the value of V l ( s ) . As a result, (cid:88) s (cid:48) : s → s (cid:48) due to an arrival r s → s (cid:48) ( V (cid:96) ( s (cid:48) ) − V l ( s )) ≤ (1 − kdx ) · · λk + 2 kdx · k λk ≤ kdx, which completes the proof. B.3 Proof of Lemma 7

Proof.

The proof is close to that of Theorem 3. Recall that for each ≤ (cid:96) ≤ h and state s ∈ S , we deﬁne theLyapunov function V (cid:96) ( s ) = (cid:96) (cid:88) i =1 s i . For q such that ≤ q ≤ h , by assumption, P { S − S q − ≤ a q − b q − } ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N . It holds P { V q − ( S ) < q − − (( q − a q − + 1) b q − }≤ P { V q − ( S ) < q − − (( q − a q − + 1) b q − ,S − S q − ≤ a q − b q − } + P { S − S q − > a q − b q − }≤ P { ( q − S < q − − b q − } + 1 − (cid:18) h − h (cid:19) q − + ( q − N − log N ≤ q − u q − h q − + 1 − (cid:18) h − h (cid:19) q − + ( q − N − log N . (31)The last inequality uses Lemma 5 and b q − = u q − h q − βN − α . Now let B q − = q − − (( q − a q − + 2) b q − . We can see that B q − = q − − a q b q − . For a state s such that V q − ( s ) > B q − , it holds ∆ V q − ( s ) = (cid:88) s (cid:48) : s → s (cid:48) due to an arrival r s → s (cid:48) ( V q − ( s (cid:48) ) − V q − ( s ))+ (cid:88) s (cid:48) : s → s (cid:48) due to a departure r s → s (cid:48) ( V q − ( s (cid:48) ) − V q − ( s )) . Recall that we deﬁne u = 2 kd and b q = u q − h q βN − α . As V q − ( s ) > q − − a q b q − , by Lemma 6, it holds ∆ V q − ( s ) ≤ kda q b q − − ( s − s q )= a q u q − h q − βN − α − ( s − s q ) . Let P { S − S q ≤ a q b q } = p q , E q − = { s ∈ S | s − s q > a q b q } . Then P { S (cid:54)∈ E q − } = p q . For a state s , considerthe following two cases. 24 PREPRINT - A

PRIL

7, 2020 • s (cid:54)∈ E q − , ∆ V q − ( s ) ≤ a q u q − h q − βN − α =: δ . • s ∈ E q − . Let γ = − ∆ V q − ( s ) . It holds γ ≥ a q u q − h q − βN − α ( h − . We then utilize the tail bound, Lemma 2. Following the deﬁnition in Lemma 2, it is easy to verify that ν max ≤ kN , f max ≤ for the Lyapunov function V q − ( s ) . Let j q − = (cid:18) N α a q u q − h q − ( h − β (cid:19) log N. Using Lemma 2, P { V q − ( S ) > B q − + 2 ν max j q − }≤ (cid:18) f max f max + γ (cid:19) j q − + (cid:18) δγ + 1 (cid:19) P { S (cid:54)∈ E q − }≤ (cid:18) f max f max + γ (cid:19) j q − + hh − p q . Note that when N is sufﬁciently large, (cid:18) f max f max + γ (cid:19) j q − ≤ e − log N . Besides, we assume that < α < . , k = e o ( √ log N ) and h = O (log k ) . As a result, for a large N , P { V q − ( S ) ≥ q − − (( q − a q − + 1) b q − }≤ P { V q − ( S ) > B + 2 ν max j q − }≤ e − log N + hh − p q . Together with Eq.(31), we have (cid:18) h − h (cid:19) q − − q − u q − h q − − ( q − N − log N ≤ P { V q − ( S ) > q − − (( q − a q − + 1) b q − }≤ e − log N + hh − p q We can conclude that for a large N , P { S − S q ≤ a q b q } = p q ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N , which completes the proof. B.4 Proof of Lemma 8

Proof.

We use a similar argument as the proof of Lemma 1. Suppose that an arrival sees a state s . By assumption, itholds h (cid:88) i =1 s i ≥ h − d . Let X , · · · , X kd be the numbers of places below h in each sampled server. The goal is to show P { FILL h } = P (cid:40) kd (cid:88) i =1 X i ≥ k (cid:41) = o (1) when N is large enough. 25 PREPRINT - A

PRIL

7, 2020We could see that for each integer x such that ≤ x ≤ h , P { X i = x } = s h − x − s h − x +1 , and P { X i = 0 } = s h . Sincewe are sampling without replacement, X , · · · , X kd are not independent. But still, utilizing a result of Hoeffding [13,Theorem 4], we have E (cid:104) f (cid:16)(cid:80) kdi =1 X i (cid:17)(cid:105) ≤ E (cid:104) f (cid:16)(cid:80) kdi =1 Y i (cid:17)(cid:105) for any continuous and convex function f ( · ) , where Y , · · · , Y kd are i.i.d. and follow the same distribution as X . Take f ( · ) to be f ( x ) = e tx where t is some positivevalue.It then holds P { FILL h } = P (cid:40) kd (cid:88) i =1 X i ≥ k (cid:41) = P (cid:110) e t (cid:80) kdi =1 X i ≥ e tk (cid:111) ≤ e − tk kd (cid:89) i =1 E (cid:2) e tY i (cid:3) = e − tk kd (cid:89) i =1  h (cid:88) j =1 (cid:16) e t ( h − j +1) − − (cid:17) . Since for all x > , x ≤ e x , we can further have P { FILL h } ≤ e − tk exp  kd h (cid:88) j =1 (cid:16) e t ( h − j +1) − (cid:17) ( s j − − s j )  . (32)Rearraning the sum in (32), we get h (cid:88) j =1 (cid:16) e t ( h − j +1) − (cid:17) ( s j − − s j )= e th − h (cid:88) j =1 s j (cid:16) e t ( h − j +1) − e t ( h − j ) (cid:17) = e th − ( e t − h (cid:88) j =1 s j e t ( h − j ) . (33)Recall that (cid:80) hj =1 s j ≥ h − d , and ≥ s ≥ s ≥ · · · ≥ s h ≥ . Eq. (33) is maximized when s = s = · · · = s h = 1 − dh and thus, (33) ≤ ( e th −

1) 13 dh .