Achieving Zero Asymptotic Queueing Delay for Parallel Jobs
DD ISPATCHING P ARALLEL J OBS TO A CHIEVE Z ERO Q UEUING D ELAY
Wentao Weng
Institute for Interdisciplinary Information SciencesTsinghua University [email protected]
Weina Wang
Computer Science DepartmentCarnegie Mellon University [email protected]
April 7, 2020 A BSTRACT
Zero queueing delay is highly desirable in large-scale computing systems. Existing work has shownthat it can be asymptotically achieved by using the celebrated Power-of- d -choices (Po d ) policy witha probe overhead d = ω (cid:16) log N − λ (cid:17) , and it is impossible when d = O (cid:16) − λ (cid:17) , where N is the numberof servers and λ is the load of the system. However, these results are based on the model where eachjob is an indivisible unit, which does not capture the parallel structure of jobs in today’s predominantparallel computing paradigm.This paper thus considers a model where each job consists of a batch of parallel tasks. In thismodel, we say a policy leads to zero (asymptotic) queueing delay if the job delay under the policyapproaches the delay given by the max of service times of its tasks, as if tasks entered service rightupon arrival. We show that zero queueing delay for such parallel jobs can be achieved using a variantof the Po d policy, the batch-filling policy , with a probe overhead d = ω (cid:16) − λ ) log k (cid:17) , where k isthe number of tasks in each job. This result demonstrates that for parallel jobs, zero queueing delaycan be achieved with a smaller probe overhead. We also establish a lower bound on the minimum d needed: we show that zero queueing delay cannot be achieved if d = e o ( log N log k ) . In view of the rise in the amount of latency-critical workloads in today’s datacenters [26, 22], load-balancing policieswith ultra-low latency have attracted great attention (see, e.g., [21, 16, 7, 17, 15]). In particular, it is highly desirableto have a policy under which the delay due to queueing is minimal.In a classical setting of load-balancing, the celebrated greedy policy, Join-the-Shortest-Queue (JSQ), achieves a min-imal queueing delay in the sense that the queueing delay is diminishing as the system becomes large, even in heavy-traffic regimes [31, 30, 21]. Therefore, we say that JSQ achieves a zero (asymptotic) queueing delay . Specifically,consider a system with N servers where jobs arrive into the system following a Poisson process. Each server has itsown queue and serves jobs in the queue in a First-Come-First-Serve manner. Under JSQ, each incoming job will beassigned to a server with the shortest queue length. Then the expected time (in steady state) a job spends in the queue before entering service goes to zero as N goes to infinity.However, a drawback of JSQ is that it has a high communication overhead, which can cancel out its advantage ofachieving zero queueing delay. For assigning each job, JSQ requires the knowledge of the queue-length informationof all the N servers, which will be referred to as having a probe overhead of N . In a typical cluster of servers, N is inthe tens of thousands range, resulting in intolerable delay due to communication [26, 22].A load-balancing algorithm that provides tradeoffs between queueing delay and communication overhead is the Power-of- d -choices (Po d ) policy [27, 20]. For each incoming job, Po d selects d queues out of N queues uniformly at random,and assigns the job to a shortest queue among the d selected queues. Therefore, Po d has a probe overhead of d . It is a r X i v : . [ c s . PF ] A p r PREPRINT - A
PRIL
7, 2020easy to see that when d = N , Po d coincides with JSQ, thus achieving a zero queueing delay. However, a fundamentalquestion is: Can zero queueing delay be achieved by Po d with a d smaller than N ? Or, what is the smallest d forachieving zero queueing delay? This question has been recently answered in a line of research [21, 16, 17, 15]. In particular, the following results arethe most relevant to our paper. Suppose the job arrival rate is
N λ and job service times are exponentially distributedwith rate . Then the load of the system is λ . Consider a heavy-traffic regime with λ = 1 − βN − α , where α and β are constants with < β < and < α < . It has been shown that Po d achieves zero queueing delay when d = Ω (cid:16) log N − λ (cid:17) , and does not have zero queueing delay when d = O (cid:16) − λ (cid:17) .Although these prior results provide great insights into achieving zero queueing delay, they are all for the classicalsetting where each job is an indivisible unit. In today’s applications, parallel computing has emerged as a dominantparadigm to support the rapidly growing data volume and computation demands. A job with a parallel structure is nolonger a single unit, but consists of multiple components that can run in parallel, resulting in a system dynamics thatis very different from the non-parallel model. Therefore, it is of great importance to revisit the fundamental questionon the minimum probe overhead needed for achieving zero queueing delay, and answer it under the new parallelparadigm.In this paper, to capture the parallel structure, we consider a model where each job consists of k tasks. Tasks can runon different servers in parallel, and a job is completed when all its tasks are completed. We assume that task servicetimes are independent and exponentially distributed with rate . Recall that N denotes the number of servers in thesystem. We assume that k grows with N (with exact assumption specified later on), but we suppress this dependencyin notation for conciseness. Zero queueing delay for parallel jobs
We are interested in achieving zero queueing delay since this is the regime where the delay due to queueing is minimaland jobs are only subject to delay due to their inherent sizes. In the non-parallel model, it is clear that the delay due toqueueing for a job is just the time a job spends waiting in the queue. However, when a job consists of multiple tasks,quantifying the delay due to queueing is more complicated since different tasks experience different queueing times.In this paper, we propose the following notion of zero queueing delay for parallel jobs. Let X , X , . . . , X k denotethe service times of a job’s k tasks. Then if a job does not experience any queueing, its delay is given by T ∗ =max { X , X , . . . , X k } . This is the job delay when all the tasks of the job enter service immediately, so we call it the inherent delay . Let T denote the delay of a job in steady state. Then the delay due to queueing is characterized by thedifference E [ T − T ∗ ] . We say jobs have zero queueing delay if E [ T − T ∗ ] E [ T ∗ ] → as N → ∞ , (1)i.e., the queueing delay takes a diminishing fraction of the inherent delay. Interestingly, under this notion, zero queue-ing delay allows tasks in a job to wait in queues for non-negligible times. Probe overhead and batch-filling policy
When a job arrives into the system, a task-assigning policy samples some queues to obtain their queue length infor-mation, and then decides how to assign the k tasks to the sampled servers. If the policy samples kd queues, then wesay its probe overhead [34, 22] is d since d is the average number of samples per task.In this paper, we focus on a policy called batch-filling , which has been shown to outperform the naive implementationof Po d and also another policy called batch-sampling for parallel jobs [34, 22]. Batch-filling assigns the tasks one byone to the shortest queue, where the queue length is updated after every task assignment. Challenges and our results
For the non-parallel model, to show a zero queueing delay, it suffices to characterize the fraction of non-idle serverssince a job can only land in one single queue. However, for parallel jobs, crucially, zero queueing delay of jobs can beachieved even when tasks have non-zero queueing delays. As a result, the analysis becomes much harder – we needto characterize the fractions of servers with queue lengths ranging from zero to a certain threshold. More specifically,the threshold here is o (log k ) . A key in our analysis is an interesting state-space collapse result that we discover. Thisresult enables us to use the powerful framework of Stein’s method [4, 5].We consider a system with a job arrival rate of N λ/k . Then λ is the load of the system. We focus on a heavy-trafficregime where λ = 1 − βN − α with < β < and < α < . , i.e., the sub-Halfin-Whitt regime. Note that the2 PREPRINT - A
PRIL
7, 2020larger α is, the faster the load approaches as N → ∞ . All the order notation and asymptotic results in this paper arewith respect to the regime that N → ∞ .Our main result is that zero queueing delay is achieved when the probe overhead d satisfies d = ω (cid:18) − λ ) log k (cid:19) , (2)where the number of tasks k satisfies k = o (cid:16) N . − α log N (cid:17) and k log k = Ω(log N ) . For example, this includes k = log N , k = N . when α < . , and so on.Recall that for the non-parallel model, a lower bound result is that zero queueing cannot be achieved when the probeoverhead is O (cid:16) − λ (cid:17) . In contrast, we can see that for parallel jobs, the probe overhead in (2) can be orderly smallerthan − λ .We also prove a lower bound result on the minimum d needed: zero queueing delay is not achievable if d = e o ( log N log k ) , (3)where k satisfies that k = e o ( √ log N ) and k = ω (1) . To establish this lower bound, we utilize the tail bound given bya Lyapunov function in a novel way. This proof technique we develop may be of separate interest itself. Related works
Load-balancing systems for non-parallel jobs have been extensively studied in the literature. It is well-known thatJSQ is delay-optimal under a wide range of assumptions [31, 30]. Although getting exact-form stationary distributionsis typically not feasible for most load-balancing policies, many results and approximations are known for variousasymptotic regimes.For JSQ in heavy-traffic regimes, Eschenfeldt and Gamarnik [6] obtain a diffusion approximation in the Halfin-Whittregime ( α = 0 . ), which has a zero queueing delay in the diffusion limit. The convergence result in [6] is on theprocess level. Braverman [3] later establish steady-state results and their results imply the convergence of the stationarydistributions to the diffusion limit. JSQ has also been studied in the nondegenerate slowdown (NDS) regime ( α = 1 )[10].The problem of achieving zero queueing delay with Po d has been studied in [21, 16, 17, 15]. Mukherjee et al. [21]show through stochastic coupling that the diffusion limit of Po d with d = ω ( N . log N ) converges to that of JSQin the Halfin-Whitt regime, thus resulting in a zero queueing delay. The convergence to the diffusion limit in [21] ison the process level. Zero queueing delay for Po d in steady state is first studied by Liu and Ying [17] for the regimewhere α < , where they show that the waiting probability goes to as N → ∞ when d = ω (cid:16) − λ (cid:17) . The results arelater extended to the sub-Halfin-Whitt regime (0 < α < . for both exponential and Coxian-2 service times [16, 15]and beyond-Halfin-Whitt regime (0 . ≤ α < [15], where it is shown that zero queueing delay is achieved when d = Ω (cid:16) log N − λ (cid:17) . The paper [17] also provides a lower bound result: the waiting probability is bounded away from when d = O (cid:16) − λ (cid:17) for ≤ α < .Po d has also been analyzed in the regime with a constant load ( α = 0 ) as N → ∞ . Mean-field analysis has beenderived for a constant d in [20, 27], and Mukherjee et al. [21] show d = ω (1) leads to zero queueing delay. We remarkthat mean-field analysis results are also available for other policies such as Join-the-Idle-Queue (JIQ) [18, 24], andalso for delay-resource tradeoffs [7].To the best of our knowledge, very limited work has been done on achieving zero queueing delay for parallel jobs , oron analyzing delay for parallel jobs in general. Only the regime with a constant load as N → ∞ has been studied.Mukherjee et al. [21] briefly touch upon this topic and show that fluid-level optimality can be achieved with probeoverhead d ≥ − λ − (cid:15) under the so-called batch-sampling policy [22]. Ying et al. [34] provide limiting distributionsfor the stationary distributions under (batch-version) Po d , batching-sampling, and batch-filling, but have not analyzeddelay of jobs. Wang et al. [29] analyze job delay under a (batch-version) random-routing policy, which does notachieve zero queueing delay. There have been no results for heavy-traffic regimes.Finally, the techniques we use in this paper are based on Stein’s method and drift-based state-space collapse. Proposedin [23], Stein’s method has been an effective tool for bounding the distance between two distributions. The semi-nal papers [4, 5, 11] build an analytical framework for Stein’s method in queueing theory that consists of generator3 PREPRINT - A
PRIL
7, 2020 …… … ! tasks per jobServer 1 Server 2 Server 3 Server " dispatcher Figure 1: A n -server system with batch arrivals. …… Server 1 Server 2 Server 3 Server " ℓ = 2
Figure 2: An example of the number of spaces below a threshold (cid:96) in a set of queues: (cid:96) = 2 , set of queues A = { , , } , and N (cid:96) ( A ) = 3 .approximation, gradient bounds, and possibly state-space collapse. The papers [4, 5] use Stein’s method to studysteady-state diffusion approximation, and [16, 17, 33, 1, 3, 8, 9, 32] use Stein’s method to obtain convergence rates tothe mean-field limit. A similar approach has also been developed by Stolyar [25]. We consider a system with N identical servers, illustrated in Figure 1. Each server has its own queue and serves tasksin its queue in a First-Come-First-Serve manner. Since each queue is associated with a server, we will refer to queuesand servers interchangeably. Jobs arrive into the system following a Poisson process. To capture the parallel structureof jobs, we assume that each job consists of k tasks that can run on different servers in parallel. A job finishes whenall of its tasks finish. We study the large-system regime where the number of servers, N , becomes large, and we willlet k increase to infinity with N to capture the trend of growing job sizes.We denote the job arrival rate by N λ/k and assume that the service times of tasks are independent and exponentiallydistributed with rate . Then λ is the load of the system. We consider a heavy-traffic regime where λ = 1 − βN − α with < β < and < α < . , i.e., the so-called sub-Halfin-Whitt regime [16, 12].When a job arrives into the system, we sample kd queues and obtain their queue length information. Since the averageoverhead is d samples per task, the probe overhead is d . We then assign the k tasks of the job to the kd selected queuesusing the batch-filling policy proposed in [34]. Batch-filling assigns the tasks one by one to the shortest queue, wherethe queue length is updated after each task assignment. Specifically, the task assignment process runs in k rounds.For each round, we put a task into the shortest queue among sampled queues. We then update the queue length, andcontinue to the next round.Now we give an equivalent description of batch-filling, which is useful in our analysis. For each queue and a positiveinteger (cid:96) , we use the number of spaces below threshold (cid:96) to refer to the quantity max { (cid:96) − queue length , } , i.e., thenumber of tasks we can put in the queue such that the queue length after receiving the tasks is no larger than (cid:96) . Fora set of queues A , we use N (cid:96) ( A ) (or just N (cid:96) when it is clear from the context) to denote the total number of spacesbelow (cid:96) in A . Figure 2 gives an example of N (cid:96) ( A ) . We say a task is at a queueing position p if there are p − tasksahead of it in the queue. With the above terminology, the batch-filling policy can be described in the following way: itfinds a minimum threshold (cid:96) such that the total number of spaces below (cid:96) in the sampled queues is at least k . Then itfills the k tasks into these spaces from low positions to high positions.To define zero queueing delay for parallel jobs, let X , X , · · · , X k be the service times of the tasks of a job. Whena job does not experience any queueing, its delay is given by T ∗ = max { X , · · · , X k } , which we call the inherentdelay of this job. If the actual delay of the job is very close to its inherent delay, it is as if the job immediately gets4 PREPRINT - A
PRIL
7, 2020service when it arrives to the system. Therefore, we say a job experiences zero queueing delay if the steady state delayof the job, T , satisfies that E [ T − T ∗ ] E [ T ∗ ] → as N → ∞ . We note that as the service time of each task is exponentially distributed with mean , it holds that E [ T ∗ ] = H k = ln k + o (ln k ) , where H k is the k -th harmonic number [19].We make the following interesting observation, which provides a basis for our delay analysis of parallel jobs: a jobcan have zero queueing delay even when its tasks are assigned to non-idle servers. In fact, we establish a necessaryand sufficient condition: a job has zero queueing delay if and only if all of its tasks are at queueing positions below athreshold h with h = o (log k ) after assigned to servers, noting that the inherent delay is ln k + o (ln k ) . The formalproof is based on Lemma 4. This phenomenon allows us to have a zero queueing delay with low probe overhead. Butit also makes the analysis hard since it implies that there are many situations that can lead to zero queueing delay.We assume that every queue has a finite buffer size of b including the task in service. If the dispatcher routes a taskto a queue with length equal to b , we simply discard this task and all the other tasks of the same job. In this case, wesay the job is dropped ; otherwise, we say the job is admitted . We remark that this assumption is not restrictive forthe following two reasons: (1) our results hold for a very large range of b (see Theorem 1); and (2) the probability ofdiscarding a job is very small (see Theorem 2).To represent the state of the system, let S i ( t ) denote the fraction of servers that have at least i jobs at time t , where ≤ i ≤ b . Note that it always holds S ( t ) = 1 . Then S ( t ) = ( S ( t ) , S ( t ) , · · · , S b ( t )) forms a continuous-timeMarkov chain (CTMC) since batch-filling is oblivious to labels of servers. The state space is as follows: S = { s = ( s , s , s , · · · , s b ) : 1 = s ≥ s ≥ s ≥ · · · s b ,N s i ∈ N , ∀ ≤ i ≤ b } . It can be verified that { S ( t ) : t ≥ } is irreducible and positive recurrent, thus having a unique stationary distribution.Let π S denote this stationary distribution, and let S = ( S , · · · , S b ) be a random element with distribution π S . Our main results provide bounds on queue lengths and delay, which lead to corresponding bounds on the probeoverhead for achieving zero queueing delay. We divide our results into upper-bound and lower-bound results. Again,all the asymptotics are with respect to the regime that the number of servers, N , goes to infinity. Upper-Bound Results
We first give an upper bound on E (cid:104)(cid:80) bi =1 S i (cid:105) , the expected number of tasks in each server, in Theorem 1. This upperbound underpins our analysis of job delay. theorem 1. Consider a system with N servers where each job consists of k tasks. Let the load be λ = 1 − βN − α with < β < and < α < . . Under the batch-filling policy with a probe overhead of d such that d ≥ − λ ) h forsome h = o (log k ) and h = ω (1) , it holds that E (cid:34) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41)(cid:35) ≤ √ N log N , (4) where k satisfies that k = o (cid:16) N . − α log N (cid:17) and k log k = Ω(log N ) , the buffer size b = min (cid:110) N α , N . − α k (cid:111) , and N issufficiently large. We remark that the h = o (log k ) in this theorem represents the threshold position we pointed out for zero queueingdelay, i.e., a job has zero queueing delay if all of its tasks are at queueing positions below h after assigned to servers.The upper bound on E (cid:104)(cid:80) bi =1 S i (cid:105) in Theorem 1 indicates how full the queues are. This enables us to analyze theprobability that all the tasks of an incoming job end up in positions below h under batch-filling, which further leads tothe zero queueing delay result below in Theorem 2. Recall that the buffer size b of each queue is finite, so a job willget dropped if at least one of its tasks is assigned to a queue with a full buffer. We denote the probability of droppingan incoming job in steady state by p d . 5 PREPRINT - A
PRIL
7, 2020 theorem 2.
Under the assumptions of Theorem 1, the dropping probability under batch-filling, p d , can be upperbounded as follows when N is sufficiently large: p d ≤ b √ N log N .
The steady-state delay of jobs that are admitted satisfies that E [ T | admitted ] = ln k + o (ln k ) . (5) Therefore, the batch-filling policy achieves zero queueing delay for parallel jobs.
Theorems 1 and 2 imply that zero queueing delay for parallel jobs can be achieved with a probe overhead d = ω (cid:16) − λ ) log k (cid:17) . This breaks the lower bound of ω (cid:16) − λ (cid:17) for achieving zero queueing delay for non-parallel jobs, i.e.,single-task jobs [17]. Therefore, the parallel structure helps reduce communication overhead. Lower-Bound Results
To complement the upper-bound results, below we investigate when zero queueing delay cannot be achieved. InTheorem 3, we find conditions under which (cid:80) hi =1 S i is lower bounded with a constant probability. theorem 3. Consider a system with N servers where each job consists of k tasks. Let the load be λ = 1 − βN − α with < β < and < α < . . Assume that b = ∞ and k satisfies that k = e o ( √ log N ) and k = ω (1) . For any stabletask-assigning policy with a probe overhead of d such that d = e o ( log N log k ) and any h with h = O (log k ) , it holds thatwhen N is sufficiently large, P (cid:40) h (cid:88) i =1 S i ≥ h − d (cid:41) ≥ e . (6)The lower bound on (cid:80) hi =1 S i in Theorem 3 guarantees that an incoming job will have a significant delay in additionto its inherent delay, and thus fails to have zero queueing delay. This result is formally stated in Theorem 4 below. theorem 4. Under the assumptions of Theorem 3, the steady-state job delay, T , satisfies that E [ T ] ≥ k (7) when N is sufficiently large. Therefore, to achieve zero queueing delay, the probe overhead d needs to be at least e Ω ( log N log k ) . In this section, we prove the upper-bound results in Theorems 1 and 2. We first give a proof sketch that providesan overview of the structure of the proofs. We then present the formal proofs of Theorems 1 and 2 in Sections 4.1and 4.2, respectively. These two proofs rely on lemmas that are presented in Section 4.3, followed by their proofs inSection 4.4. Throughout this section, we assume that the assumptions in Theorem 1 hold.
Proof Sketch
We start by setting the goal to be proving the zero queueing delay result in Theorem 2. The need for the fundamentalcharacterizations of the system in Theorem 1 will emerge during the analysis. We first note that the steady-state jobdelay T can be upper bounded in the following way: E [ T ] ≤ E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:35) + E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:35) · P (cid:40) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:41) , (8)6 PREPRINT - A
PRIL
7, 2020where we have used the fact that P (cid:110)(cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1)(cid:111) ≤ .In this upper bound, the conditions in the expectations are based the threshold value h (cid:0) − βN − α (cid:1) for (cid:80) hi =1 S i .We choose this particular threshold value for the following reason. Given the condition (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1) ,we can show that with high probability, all the tasks of an incoming job will be assigned to queueing positions below h with h = o (log k ) , thus resulting in a zero queueing delay for the job. Specifically, suppose a job arrives to the systemwith state s . If we choose one queue uniformly at random from all the queues, then the probability for the chosenqueue to have a length of i is s i − s i +1 . So the expected number of spaces below position h in the chosen queue is (cid:80) hi =0 ( h − i )( s i − s i +1 ) = h − (cid:80) hi =1 s i . The batch-filling policy samples kd queues. Thus the total expected numberof spaces below position h in the kd sampled queues is kd (cid:16) h − (cid:80) hi =1 s i (cid:17) . To fit all the k tasks of the incoming jobto positions below h , we need kd (cid:32) h − h (cid:88) i =1 s i (cid:33) ≥ k, which becomes h (cid:88) i =1 s i ≤ h (cid:18) − βN − α (cid:19) when d ≥ − λ ) h = N α βh as required in Theorem 1. We strengthen this requirement to the condition (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1) to obtain a high-probability guarantee using proper concentration inequalities.The second summand in the upper bound (8) is based on the condition (cid:80) hi =1 S i > h (cid:0) − βN − α (cid:1) . Under thiscondition, we may not be able to put all the tasks of an incoming job to positions below h . But we show that theprobability P (cid:110)(cid:80) hi =1 S i > h (cid:0) − βN − α (cid:1)(cid:111) is very small in Theorem 1. To this end, we first upper-bound it usingthe Markov inequality: P (cid:40) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:41) ≤ E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) βN − α . It then suffices to bound E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) .We focus on the dynamics of (cid:80) bi =1 S i ( t ) in the proof of Theorem 1, which is equal to the total queue length attime t divided by N . Our proof follows the framework of Stein’s method. The main idea is to couple our Markovchain { S ( t ) : t ≥ } with an auxiliary process that is easier to analyze, and bound their difference through generatorapproximation. Here we consider the following simple fluid model as our auxiliary process: ˙ x ( t ) = ( − δ ) { x> } ,x ( t ) is continuous, (9)where δ = ( k +1) log N √ N , and we then compare the dynamics of (cid:80) bi =1 S i ( t ) with that of x ( t ) . Based on this coupling,we derive an upper bound on E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) in Section 4.1 below. We reiterate that akey in our analysis is a novel state-space collapse result that we establish.Combining the arguments above for both the condition (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1) and the condition (cid:80) hi =1 S i >h (cid:0) − βN − α (cid:1) , we can conclude that the upper bound on E [ T ] in (8) implies zero queueing delay. Proof.
As we explained in the proof sketch, we consider the fluid model in (9). The generator of this fluid model,denoted as G , is simply given by Gg ( x ) = g (cid:48) ( x ) · ( − δ ) { x> } PREPRINT - A
PRIL
7, 2020for a differentiable function g . Recall that we will compare the dynamics of x ( t ) in this fluid model with that of (cid:80) bi =1 S i ( t ) .The quantity of interest in Theorem 1 is E (cid:104) max (cid:110)(cid:80) bi =1 S i − η, (cid:111)(cid:105) , where we have used the notation η = h (cid:0) − βN − α (cid:1) for conciseness. Recall that S follows the stationary distribution of { S ( t ) : t ≥ } . To couple { S ( t ) : t ≥ } with the fluid model, we solve for a function g such that Gg ( x ) = max { x − η, } ,g (0) = 0 . (10)It is not hard to see that the solution is g ( x ) = ( x − η ) − δ ) { x ≥ η } . (11)Now we utilize this function g to bound E (cid:104) max (cid:110)(cid:80) bi =1 S i − η, (cid:111)(cid:105) through generator approximation. Let G be thegenerator of { S ( t ) : t ≥ } . Then Gg (cid:32) b (cid:88) i =1 s i (cid:33) = (cid:88) s (cid:48) ∈S r s → s (cid:48) (cid:32) g (cid:32) b (cid:88) i =1 s (cid:48) i (cid:33) − g (cid:32) b (cid:88) i =1 s i (cid:33)(cid:33) , where r s → s (cid:48) is the transition rate from state s to state s (cid:48) . Since g (cid:16)(cid:80) bi =1 s i (cid:17) is bounded on S , it holds that E (cid:34) Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) = 0 . (12)Combining this with the equations in (10) gives, E (cid:34) max (cid:40) b (cid:88) i =1 S i − η, (cid:41)(cid:35) = E (cid:34) Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) = E (cid:34) Gg (cid:32) b (cid:88) i =1 S i (cid:33) − Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) = E (cid:34) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) (13)This is referred to as the generator approximation since we are approximating the generator G with G .Next we take a closer look at the term Gg (cid:16)(cid:80) bi =1 S i (cid:17) and derive an upper bound for (13). Let P A ( s ) be the probabilitythat a job arrival is admitted into the system given that the system is at state s , i.e., the probability that all the tasks ofthe job are routed to positions below b . Then Gg (cid:32) b (cid:88) i =1 s i (cid:33) = N λk P A ( s ) (cid:32) g (cid:32) b (cid:88) i =1 s i + kN (cid:33) − g (cid:32) b (cid:88) i =1 s i (cid:33)(cid:33) + N s (cid:32) g (cid:32) b (cid:88) i =1 s i − N (cid:33) − g (cid:32) b (cid:88) i =1 s i (cid:33)(cid:33) , where first term is the drift due to a job arrival and the second term is due to a task departure. To derive an upperbound on (13), we divide the discussion into the three cases below. Recall that g ( x ) = ( x − η ) − δ ) { x ≥ η } and g (cid:48) ( x ) = x − η − δ { x ≥ η } . Case 1 : (cid:80) bi =1 S i < η − kN . In this case, clearly g (cid:48) (cid:16)(cid:80) bi =1 S i (cid:17) = 0 and Gg (cid:16)(cid:80) bi =1 S i (cid:17) = 0 .8 PREPRINT - A
PRIL
7, 2020
Case 2 : (cid:80) bi =1 S i ∈ [ η − kN , η + N ) . By the mean value theorem, g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − Gg (cid:32) b (cid:88) i =1 S i (cid:33) = g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − (cid:18) N λk P A ( S ) kN g (cid:48) ( ξ ) + N S − N g (cid:48) ( ˜ ξ ) (cid:19) ≤ g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − λg (cid:48) ( ξ ) + S g (cid:48) ( ˜ ξ ) , (14)where ξ ∈ (cid:16)(cid:80) bi =1 S i , (cid:80) bi =1 S i + kN (cid:17) , ˜ ξ ∈ (cid:16)(cid:80) bi =1 S i − N , (cid:80) bi =1 S i (cid:17) , and (14) is true since P A ( S ) ≤ and g (cid:48) ( x ) ≤ for all x . Case 3 : (cid:80) bi =1 S i ≥ η + N . Since g (cid:48) ( x ) is continuous for all x , by the second order Taylor expansion in the Lagrangeform, g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − Gg (cid:32) b (cid:88) i =1 S i (cid:33) = g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − N λk P A ( S ) (cid:32) kN g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) + k N g (cid:48)(cid:48) ( ζ ) (cid:33) − N S (cid:32) − N g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) + 12 N g (cid:48)(cid:48) (˜ ζ ) (cid:33) ≤ g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ − λ + S ) − N (cid:16) λkg (cid:48)(cid:48) ( ζ ) + S g (cid:48)(cid:48) (˜ ζ ) (cid:17) , (15)where ζ ∈ (cid:16)(cid:80) bi =1 S i , (cid:80) bi =1 S i + kN (cid:17) , ˜ ζ ∈ (cid:16)(cid:80) bi =1 S i − N , (cid:80) bi =1 S i (cid:17) .Combining these three cases yields E (cid:34) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − Gg (cid:32) b (cid:88) i =1 S i (cid:33)(cid:35) ≤ E (cid:34)(cid:32) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ ) − λg (cid:48) ( ξ ) + S g (cid:48) ( ˜ ξ ) (cid:33) { (cid:80) bi =1 S i ∈ [ η − kN ,η + N ) } (cid:35) (16) − N E (cid:104) ( λkg (cid:48)(cid:48) ( ζ ) + S g (cid:48)(cid:48) (˜ ζ )) { (cid:80) bi =1 S i ≥ η + N } (cid:105) (17) + E (cid:34) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ − λ + S ) { (cid:80) bi =1 S i ≥ η + N } (cid:35) . (18)The first two terms (16) and (17) are easy to bound once we notice that for any x ∈ (cid:2) η − k +1 N , η + k +1 N (cid:3) , | g (cid:48) ( x ) | ≤ | x − η | δ ≤ √ N log N , and for any x ∈ ( η, + ∞ ) , | g (cid:48)(cid:48) ( x ) | = δ = √ N ( k +1) log N . Then when N is sufficiently large, | (16) | ≤ √ N log N (cid:18) ( k + 1) log N √ N + 1 + 1 (cid:19) ≤ √ N log N , and | (17) | ≤ N √ N ( k + 1) log N ( λk + 1) ≤ √ N log N .
The key in this proof is to bound the term (18), for which we need the state-space collapse result in Lemma 3 inSection 4.3. Consider the following Lyapunov function: V ( s ) = min h − b (cid:88) i = h s i , b (cid:32)(cid:18) − βN − α (cid:19) − h − h − (cid:88) i =1 s i (cid:33) + , PREPRINT - A
PRIL
7, 2020where the superscript + denotes the function x + = max { x, } . Then by Lemma 3, when N is sufficiently large, P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) ≤ e − log N , where B = b − h +1 h − (cid:16) βN − α + log N √ N (cid:17) .We partition the probability space based on the value of V ( S ) . Note that g (cid:48) (cid:16)(cid:80) bi =1 S i (cid:17) ( − δ − λ + S ) { (cid:80) bi =1 S i ≥ η + N } is always no larger than bδ for large enough N . Then (18) can be upper bounded as(18) ≤ E (cid:34) g (cid:48) (cid:32) b (cid:88) i =1 S i (cid:33) ( − δ − λ + S ) · { (cid:80) bi =1 S i ≥ η + N } (cid:12)(cid:12)(cid:12)(cid:12) V ( S ) ≤ B + 2 kb log N ( h − √ N (cid:21) + 2 bδ P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) . (19)Now we focus on the case where we are given the condition that V ( S ) ≤ B + kb log N ( h − √ N . Our goal is to show that S is large enough such that δ + λ − S < . Intuitively, this condition on V ( S ) implies that we either have a small (cid:80) bi = h S i , which leads to a large S when combined with the condition (cid:80) bi =1 S i ≥ η + N in the indicator, or a large (cid:80) h − i =1 S i , which directly gives a large S since S ≥ · · · ≥ S h − .If h − (cid:80) bi = h S i ≤ b (cid:16)(cid:0) − βN − α (cid:1) − h − (cid:80) h − i =1 S i (cid:17) + in V ( S ) , the condition V ( S ) ≤ B + kb log N ( h − √ N implies that h − b (cid:88) i = h S i ≤ b − h + 1 h − (cid:18) βN − α + log N √ N (cid:19) + 2 kb log N ( h − √ N . (20)Recall that b = min (cid:110) N α , N . − α k (cid:111) and h = o (log k ) . Note that the indicator function in (19) makes it sufficient toconsider the case where (cid:80) bi =1 S i ≥ η + N , which implies ( h − S + (cid:80) bi = h S i ≥ η . Combining this with (20) gives S ≥ ηh − − b − h + 1 h − (cid:18) βN − α + log N √ N (cid:19) − kb log N ( h − √ N ≥ − β ) 1 h − − βN − α + o (cid:18) h (cid:19) when N is sufficiently large. Note that δ = o (cid:0) h (cid:1) and λ = 1 − βN − α . Therefore, λ + δ − S < when N issufficiently large.If h − (cid:80) bi = h S i > b (cid:16)(cid:0) − βN − α (cid:1) − h − (cid:80) h − i =1 S i (cid:17) + in V ( S ) , the condition V ( S ) ≤ B + kb log N ( h − √ N implies that b (cid:32) − βN − α − h − h − (cid:88) i =1 S i (cid:33) ≤ B + 2 kb log N ( h − √ N .
Then S ≥ h − h − (cid:88) i =1 S i ≥ − βN − α − b (cid:18) B + 2 kb log N ( h − √ N (cid:19) ≥ − βN − α + o ( N − α ) . PREPRINT - A
PRIL
7, 2020As a result, again we have λ + δ − S ≤ − βN − α + o ( N − α ) < when N is sufficiently large.Inserting these bounds back to (19) gives that when N is sufficiently large,(18) ≤ bδ P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) ≤ bδ e − log N ≤ √ N log N .
Combining the bounds for (16), (17) and (18), we have E (cid:34) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41)(cid:35) ≤ √ N log N , which completes the proof of Theorem 1.
Proof.
We first bound the dropping probability p d using Lemma 1, which will be presented in Section 4.3. Note thatan incoming job does not get dropped if and only if all its k tasks are routed to queueing positions below threshold b ,which is the complement of the event FILL b in Lemma 1. Thus, p d = 1 − P { FILL b } = 1 − P (cid:40) FILL b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b (cid:88) i =1 S i ≤ b (cid:18) − βN − α (cid:19)(cid:41) · P (cid:40) b (cid:88) i =1 S i ≤ b (cid:18) − βN − α (cid:19)(cid:41) − P (cid:40) FILL b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b (cid:88) i =1 S i > b (cid:18) − βN − α (cid:19)(cid:41) · P (cid:40) b (cid:88) i =1 S i > b (cid:18) − βN − α (cid:19)(cid:41) . We can easily have that P (cid:110) FILL b (cid:12)(cid:12)(cid:12) (cid:80) bi =1 S i ≤ b (cid:0) − βN − α (cid:1)(cid:111) ≤ N using Lemma 1.Now we bound P (cid:110)(cid:80) bi =1 S i > b (cid:0) − βN − α (cid:1)(cid:111) using Theorem 1. Note that P (cid:40) b (cid:88) i =1 S i > b (cid:18) − βN − α (cid:19)(cid:41) ≤ P (cid:40) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41) > b − b βN − α − h (cid:41) ≤ P (cid:40) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41) > b (cid:41) , PREPRINT - A
PRIL
7, 2020where we have used the fact that b βN − α + h ≤ b when N is sufficiently large due to our assumptions on b and h .Then by Markov’s inequality, P (cid:40) b (cid:88) i =1 S i > b (cid:18) − βN − α (cid:19)(cid:41) ≤ E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) b ≤ b √ N log N .
Combining the arguments above yields p d ≥ − N − b √ N log N ≥ − b √ N log N when N is sufficiently large.Next we bound the expected job delay given that a job is admitted, i.e., E [ T | admitted ] . We define the delay of a jobthat is dropped to be zero since it leaves the system immediately after arrival. Then E [ T ] = E [ T | admitted ] · (1 − p d ) + E [ T | dropped ] · p d , and thus E [ T | admitted ] = E [ T ]1 − p d . So we can focus on bounding E [ T ] , following the outlinegiven in the proof sketch.We bound E [ T ] in the following way E [ T ] ≤ E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:35) (21) + E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:35) · P (cid:40) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:41) . (22)For the first term (21) in this upper bound, as described in the proof sketch, we will rely on the fact that with highprobability, all the k tasks are assigned to queueing positions below h . Specifically, E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:35) = E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19) , FILL h (cid:35) · P (cid:40) FILL h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:41) + E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19) , FILL h (cid:35) · P (cid:40) FILL h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:41) , where FILL h is the complement of FILL h .Suppose FILL h is true. Suppose that the k tasks of the incoming job land in m distinct queues with m ≤ k . Wecall the tasks with the highest positions in these m queues tasks , , . . . , m , and let n , n , . . . , n m denote thesepositions. Then the delay of task i can be written as Y i = (cid:80) n i j =1 X i,j , where X i,j is the service time of the task at12 PREPRINT - A
PRIL
7, 2020position j in the same queue as task i . Clearly X i,j ’s are i.i.d. with an exponential distribution of rate . We know that n i ≤ h, i = 1 , , . . . , m given FILL h . Then by Lemma 4, E [max { Y , · · · , Y m } ] ≤ ln k + o (ln k ) . When FILL h is true, E (cid:104) T (cid:12)(cid:12)(cid:12) (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1) , FILL h (cid:105) ≤ bk since the highest position for a task is b andthe maximum is upper bounded by the sum. Further, P (cid:110) FILL h (cid:12)(cid:12)(cid:12) (cid:80) hi =1 S i ≤ h (cid:0) − βN − α (cid:1)(cid:111) ≤ N by Lemma 1.Combining the arguments above, we have the following bound for term (21): E (cid:34) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h (cid:88) i =1 S i ≤ h (cid:18) − βN − α (cid:19)(cid:35) ≤ ln k + o (ln k ) + bkN . Now we go back to the term (22). Again, it is easy to see that E (cid:104) T (cid:12)(cid:12)(cid:12) (cid:80) hi =1 S i > h (cid:0) − βN − α (cid:1)(cid:105) ≤ bk . UtilizingTheorem 1, we have P (cid:40) h (cid:88) i =1 S i > h (cid:18) − βN − α (cid:19)(cid:41) ≤ P (cid:40) max (cid:40) b (cid:88) i =1 S i − h (cid:18) − βN − α (cid:19) , (cid:41) > hβN − α (cid:41) ≤ E (cid:104) max (cid:110)(cid:80) bi =1 S i − h (cid:0) − βN − α (cid:1) , (cid:111)(cid:105) hβN − α ≤ hβN − α log N .
With the bounds above on (21) and (22), we have E [ T ] ≤ ln k + o (ln k ) + bkN + 20 bkhβN − α log N .
Consequently, E [ T | admitted ] = E [ T ]1 − p d ≤ ln k + o (ln k ) + bkN + bkhβN − α log N − p d ≤ ln k + o (ln k ) , which completes the proof. In Lemma 1 below, we consider the event that all the k tasks of an incoming job are routed to queueing positions belowsome threshold value (cid:96) . Let this event be denoted by FILL (cid:96) , and Lemma 1 lower-bounds its probability P { FILL (cid:96) } forseveral values of (cid:96) of interest. Lemma 1 is an essential building block and is needed for establishing the state-spacecollapse result in Lemma 3 and bounding job delay in Theorem 2. lemma 1 (Filling Probability) . Under the assumptions of Theorem 1, given that the system is in a state s such that (cid:80) (cid:96)i =1 s i ≤ (cid:96) (cid:0) − βN − α (cid:1) , the probability of the event FILL (cid:96) for any (cid:96) ∈ { h − , h, b } can be bounded as followswhen N is sufficiently large: P { FILL (cid:96) } ≥ − N .
Lemma 2 bounds the distribution tails of a Lyapunov function, which slightly generalizes the tail bounds in [28], [15]and [2]. 13
PREPRINT - A
PRIL
7, 2020 lemma 2.
Consider a continuous time Markov chain { S ( t ) } with a unique stationary distribution π . Assume it has afinite state space S . For a Lyapunov function V : S → [0 , + ∞ ) , define the drift of V at a state s ∈ S as ∆ V ( s ) = GV ( s ) = (cid:88) s (cid:48) ∈S , s (cid:54) = s (cid:48) r s → s (cid:48) ( V ( s (cid:48) ) − V ( s )) , where r s → s (cid:48) is the transition rate from state s to s (cid:48) .Assume that ν max := sup s , s (cid:48) ∈S : r s → s (cid:48) > | V ( s ) − V ( s (cid:48) ) | < ∞ f max := max , sup s ∈S (cid:88) s (cid:48) : V ( s (cid:48) ) >V ( s ) r s → s (cid:48) ( V ( s (cid:48) ) − V ( s )) < ∞ . If there is a set E with B > , γ > , δ ≥ such that • ∆ V ( s ) ≤ − γ when V ( s ) ≥ B and s ∈ E . • ∆ V ( s ) ≤ δ when V ( s ) ≥ B and s (cid:54)∈ E .Then it holds that for all j ∈ N , P { V ( s ) ≥ B + 2 ν max j } ≤ (cid:18) f max f max + γ (cid:19) j + (cid:18) δγ + 1 (cid:19) P { s (cid:54)∈ E} . This tail bound in Lemma 2 is slightly more general than existing bounds in that it allows different drift bounds basedon whether a state s is in a set E or not, which will be needed in the proof of lower-bound results.We utilize Lemma 2 to establish the state-space collapse result below in Lemma 3. Here we simply let E be the wholestate space. lemma 3 (State-Space Collapse) . Under the assumption of Theorem 1, consider the following Lyapunov function V ( s ) = min h − b (cid:88) i = h s i , b (cid:32)(cid:18) − βN − α (cid:19) − h − h − (cid:88) i =1 s i (cid:33) + , where the superscript + denotes the function x + = max { x, } . Let B = b − h +1 h − (cid:16) βN − α + log N √ N (cid:17) . For any state s such that V ( s ) > B , its Lyapunov drift satisfies ∆ V ( s ) = GV ( s ) ≤ − b √ N .
Consequently, when N is sufficiently large, P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) ≤ e − log N , Lemma 4 below is used in the proof of Theorem 2 to bound job delay based on queueing positions of its tasks. It statesthat if m tasks are in queueing positions below o (log m ) , then the maximum delay of these tasks is ln m + o (ln m ) .Due to space limitations, the proof is given in Appendix A. lemma 4. Consider m independent random variables Y , · · · , Y m where each Y i is the sum of n i i.i.d. randomvariables that follow the exponential distribution with rate . In the asymptotic regime that m goes to infinity, if max { n , · · · , n m } = o (log m ) , then E [max { Y , · · · , Y m } ] ≤ ln m + o (ln m ) . (Filling Probability) Proof.
Assume that a job arrival sees a state S = s that satisfies (cid:96) (cid:88) i =1 s i ≤ (cid:96) (cid:18) − βN − α (cid:19) . PREPRINT - A
PRIL
7, 2020We focus on the the number of spaces below the threshold (cid:96) in the sampled queues, denoted by N (cid:96) . Then N (cid:96) is themaximum number of tasks that can be put into these queues such that all of these tasks are at queueing positions below (cid:96) . Therefore, P { FILL (cid:96) } = P { N (cid:96) ≥ k } ≥ − P { N (cid:96) ≤ k } . Now we bound P { N (cid:96) ≤ k } . We can think of the sampling process of batch-filling as sampling kd queues one byone without replacement. Let X , X , · · · , X kd be the numbers of spaces below (cid:96) in the st, nd, . . . , kd th sampledqueues, respectively. Then N (cid:96) = X + · · · + X kd . It is not hard to see that for each of the sampled queue and eachinteger x with ≤ x ≤ (cid:96) , P { X i = x } = s (cid:96) − x − s (cid:96) − x +1 , and P { X i = 0 } = s (cid:96) .Note that since we sample without replacement, X , X , . . . , X kd are not independent. But we can still deriveconcentration bounds using a result of Hoeffding [13, Theorem 4]. By this result, we have E (cid:104) f (cid:16)(cid:80) kdi =1 X i (cid:17)(cid:105) ≤ E (cid:104) f (cid:16)(cid:80) kdi =1 Y i (cid:17)(cid:105) for any continuous and convex function f ( · ) , where Y , Y , . . . , Y kd are i.i.d. and follow the samedistribution as X . We take the function f ( · ) to be f ( x ) = e − tx with t > . Then P { N (cid:96) ≤ k } = P (cid:8) e − tN (cid:96) ≥ e − tk (cid:9) ≤ e tk kd (cid:89) i =1 E (cid:2) e − tY i (cid:3) = e tk kd (cid:89) i =1 − (cid:96) (cid:88) j =1 ( s (cid:96) − j − s (cid:96) − j +1 ) (cid:0) − e − tj (cid:1) . Since − x ≤ e − x for each x ≥ , this can be further bounded as P { N (cid:96) ≤ k }≤ exp tk − kd (cid:96) (cid:88) j =1 ( s (cid:96) − j − s (cid:96) − j +1 ) (cid:0) − e − tj (cid:1) ≤ exp tk + kd (cid:96) (cid:88) j =1 ( s j − − s j ) (cid:16) e − t ( (cid:96) − j +1) − (cid:17) (23)Rearranging the terms in the sum in (23), we get (cid:96) (cid:88) j =1 ( s j − − s j ) (cid:16) e − t ( (cid:96) − j +1) − (cid:17) = (cid:0) e − t(cid:96) − (cid:1) + (cid:0) e t − (cid:1) (cid:96) (cid:88) j =1 s j e − t ( (cid:96) − j +1) . (24)Since ≥ s ≥ · · · s (cid:96) and we have assumed that (cid:80) (cid:96)j =1 s j ≤ (cid:96) (cid:0) − βN − α ) (cid:1) , (24) is maximized when s = s = · · · = s (cid:96) = 1 − βN − α . Therefore, the upper bound becomes P { N (cid:96) ≤ k } ≤ exp (cid:18) tk + kd (cid:0) e − t(cid:96) − (cid:1) βN − α (cid:19) . PREPRINT - A
PRIL
7, 2020Now we apply the condition that d ≥ N α βh and let t = ln(2 (cid:96) ) − ln h(cid:96) . Then P { N (cid:96) ≤ k }≤ exp (cid:18) tk + 2 kh (cid:0) e − t(cid:96) − (cid:1)(cid:19) = exp (cid:18) kh (cid:18) h(cid:96) (ln(2 (cid:96) ) − ln h ) + h(cid:96) − (cid:19)(cid:19) . Recall the we have assumed that kh = ω (log N ) and h = ω (1) . Then it can be verified that with a sufficiently large N , h(cid:96) (ln(2 (cid:96) ) − ln h ) + h(cid:96) + 2 N − . − is smaller than a negative constant for all (cid:96) ∈ { h − , h, b } . Thus P { N (cid:96) ≤ k } ≤ exp( − ω (log N )) ≤ N .
As a result, P { FILL (cid:96) } ≥ − P { N (cid:96) ≤ k } ≥ − N , which completes the proof.
Proof of Lemma 3 (State-Space Collapse)
Proof.
Consider the Lyapunov function in the lemma, i.e., V ( s ) = min h − b (cid:88) i = h s i , b (cid:32)(cid:18) − βN − α (cid:19) − h − h − (cid:88) i =1 s i (cid:33) + . We will refer to the first term and second term in the minimum as T and T , respectively. Let B = b − h +1 h − (cid:16) βN − α + log N √ N (cid:17) and suppose V ( s ) > B . Recall that the drift of V is given by ∆ V ( s ) = GV ( s ) = (cid:88) s (cid:48) ∈S , s (cid:54) = s (cid:48) r s → s (cid:48) ( V ( s (cid:48) ) − V ( s )) , where r s → s (cid:48) is the transition rate from state s to s (cid:48) . Let e i = (cid:0) , · · · , , N , , · · · , (cid:1) be a vector of length b whose i th entry is N and all the other entries are zero. We divide the discussion into two cases. Case 1: T ≤ T . In this case V ( s ) = T . When the state transition is due to a task departure from a queue of length i , which has a rate of N ( s i − s i +1 ) , then V ( s − e i ) = (cid:40) V ( s ) , if ≤ i < h,V ( s ) − N ( h − , if h ≤ i ≤ b. Now consider the state transition due to a job arrival. Let a i be the queueing position that task i is assigned to. Thenthe next state can be written as s + e a + · · · + e a k . Note that when the event
FILL h − happens, the dispatcher puts all k tasks to positions below threshold h − . Thenunder FILL h − , s i does not change for i ≥ h , which implies that V ( s + e a + · · · + e a k ) = V ( s ) . We can show that P { FILL h − } ≥ − N using Lemma 1 since T ≥ T > B > . Otherwise, i.e., when FILL h − is not true, it is easy to see that V ( s + e a + · · · + e a k ) ≤ V ( s ) + kN ( h − . PREPRINT - A
PRIL
7, 2020Therefore, ∆ V ( s ) ≤ b (cid:88) i =1 N ( s i − s i +1 ) ( V ( s − e i ) − V ( s )) + N λk N kN ( h − N ( h − − s h h − ≤ N ( h − − h − b − h + 1 b (cid:88) i = h s i . By the assumption that T > B , we have b − h + 1 b (cid:88) i = h s i ≥ h − b − h + 1 B = βN − α + log N √ N .
Inserting this back to the upper bound on ∆ V ( s ) gives ∆ V ( s ) ≤ − h − (cid:18) − N + βN − α + log N √ N (cid:19) . Since βN − α h − ≥ N − α k ≥ b √ N and log N √ N ≥ N when N is sufficiently large, this upper bound becomes ∆ V ( s ) ≤ − b √ N .
Case 2: T > T . In this case V ( s ) = T . Similarly, a task departs from a queue of length i at a rate of N ( s i − s i +1 ) .The change in V ( s ) can be bounded as V ( s − e i ) − V ( s ) ≤ (cid:40) bN ( h − , if ≤ i < h, , if h ≤ i ≤ b. When a job arrives, under the event
FILL h − , V ( s + e a + · · · + e a k ) = V ( s ) − kbN ( h − , where we have used the fact that T > B . Again, P { FILL h − } ≥ − N by Lemma 1. Otherwise, i.e., when FILL h − is not true, V ( s + e a + · · · + e a k ) ≤ V ( s ) .Therefore, ∆ V ( s ) ≤ b (cid:88) i =1 N ( s i − s i +1 ) ( V ( s − e i ) − V ( s ))+ N λk (cid:18) − N (cid:19) (cid:18) − kbN ( h − (cid:19) ≤ bh − s − s h ) − bh − (cid:18) − N (cid:19) (cid:0) − βN − α (cid:1) ≤ bh − (cid:18) − (cid:18) βN − α + log N √ N (cid:19) − (cid:18) − N (cid:19) (cid:0) − βN − α (cid:1)(cid:19) , (25) = bh − (cid:18) − log N √ N + 1 N (cid:0) − βN − α (cid:1)(cid:19) ≤ − bh − N − √ N √ N , where (25) is due to the fact that s ≤ and the fact that s h ≥ βN − α + log N √ N following similar arguments as those inCase 1 noting that T > T > B . When N is sufficiently large, this upper bound becomes ∆ V ( s ) ≤ − b √ N , PREPRINT - A
PRIL
7, 2020which completes the proof of the drift bound in Lemma 3.For this Lyapunov function V , under the notation in Lemma 2, we have that ν max ≤ kbN ( h − and f max ≤ bh − . Let E = S and j = √ N log N . Then by Lemma 2, the drift bound implies that P (cid:26) V ( S ) > B + 2 kb log N ( h − √ N (cid:27) = P (cid:26) V ( S ) > B + 2 kb ( h − N j (cid:27) ≤ (cid:18) h − √ N (cid:19) − j ≤ (cid:32)(cid:18) √ N (cid:19) √ N +1 (cid:33) − √ N +1 √ N log N ≤ e − log N , where the last inequality holds when N is sufficiently large. This completes the proof. In this section, we prove the lower-bound results in Theorems 3 and 4. We first present the proofs of Theorems 3 and4 in Sections 5.1 and 5.2, respectively. Then we give the lemmas needed in Section 5.3. Due to space limitations, theproofs of the lemmas are given in Appendix B. Throughout this section, we assume that the assumptions in Theorem 3hold.
Proof.
The proof proceeds in an iterative fashion. The base case is that E [ S ] = λ = 1 − βN − α , which can be provedusing the Little’s law. We will then bound S − S i based on properties of S − S i − .For simplicity, let u = 2 kd , this is the ratio appearing in Lemma 6. Consider a Lyapunov function V ( s ) = s . Let h = O (log k ) and B = 1 − hβN − α . For some state s such that V ( s ) > B , it holds ∆ V ( s ) = (cid:88) s (cid:48) : s → s (cid:48) due to an arrival r s → s (cid:48) ( V ( s (cid:48) ) − V ( s ))+ (cid:88) s (cid:48) : s → s (cid:48) due to a departure r s → s (cid:48) ( V ( s (cid:48) ) − V ( s )) (a) ≤ uhβN − α − N ( s − s ) 1 N = uhβN − α − ( s − s ) , where (a) is due to Lemma 6.Let p = P (cid:8) S − S ≤ uh βN − α (cid:9) and E = (cid:8) s ∈ S| s − s > uh βN − α (cid:9) . Then p = P { S (cid:54)∈ E } . We now use the tail bound in Lemma 2. Assume that we follow the notation in the lemma.Consider the following two cases: • s (cid:54)∈ E , ∆ V ( s ) ≤ uhβN − α =: δ . • s ∈ E . Let γ = − ∆ V ( s ) . It holds γ ≥ uhβN − α ( h − .18 PREPRINT - A
PRIL
7, 2020Following the definition in 2, it is easy to verify that ν max ≤ kN and f max ≤ for V ( s ) . Let j = (cid:16) N α βuh ( h − (cid:17) log N . By Lemma 2, it holds that P { V ( S ) > B + 2 ν max j } ≤ (cid:18) f max f max + γ (cid:19) j + (cid:18) δγ + 1 (cid:19) P { S (cid:54)∈ E }≤ (cid:18) f max f max + γ (cid:19) j + hh − p . Besides, when N is large enough, (cid:18) f max f max + γ (cid:19) j ≤ (cid:0) uhβN − α ( h − (cid:1) − ( N α βuh ( h − ) log N ≤ e − log N . As a result, P { V ( S ) > B + 2 ν max j } ≤ N − log N + hh − p . Since < α < . and k = e o ( √ log N ) , − ( h − βN − α > − hβN − α + 2 kN (cid:18) N α βuh ( h − (cid:19) log N when N is large enough. It follows that P (cid:8) V ( S ) > − ( h − βN − α (cid:9) ≤ P { V ( S ) > B + 2 ν max j }≤ N − log N + hh − p However, by Lemma 5, P (cid:8) V ( S ) > − ( h − βN − α (cid:9) ≥ − h − . Therefore, hh − p + N − log N ≥ h − h − , and thus P (cid:8) S − S ≤ uh βN − α (cid:9) = p ≥ h − h − N − log N . Let b q = u q − h q βN − α for an integer q > . Define a sequence a q , such that a = 0 , a = 1 and a q = ( q − a q − +2 for q > . We now have P { S − S ≤ a b } ≥ h − h − N − log N . We can use Lemma 7 successively to establish P { S − S q ≤ a q b q } ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N for all ≤ q ≤ h .Let us condition on S − S h ≤ a h b h . For ease of notation, let p c = (cid:0) h − h (cid:1) h − − ( h − N − log N , which is a lowerbound on the probability of the condition. Note that E [ S ] ≤ E [ S | S − S h ≤ a h b h ] P { S − S h ≤ a h b h } + 1 · P { S − S h > a h b h } . Thus E [ S | S − S h ≤ a h b h ] ≥ − βN − α − (1 − P { S − S h ≤ a h b h } ) P { S − S h ≤ a h b h }≥ − βp c N − α . PREPRINT - A
PRIL
7, 2020We can also see that P (cid:40) h (cid:88) i =1 S i ≥ h − d (cid:41) ≥ P (cid:40) h (cid:88) i =1 S i ≥ h − d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S − S h ≤ a h b h (cid:41) P { S − s h ≤ a h b h }≥ p c P (cid:26) hS − h ( S − S h ) ≥ h − d (cid:12)(cid:12)(cid:12)(cid:12) S − S h ≤ a h b h (cid:27) ≥ p c P (cid:26) S ≥ − dh + a h b h (cid:12)(cid:12)(cid:12)(cid:12) S − S h ≤ a h b h (cid:27) . (26)Utilizing the Markov inequality gives (26) ≥ p c (cid:18) − dh − dh E [ S | S − S h ≤ a h b h ]1 − dha h b h (cid:19) ≥ p c (cid:18) − βp c dh − dha h b h N − α (cid:19) . Recall that a q = ( q − a q − + 2 for q > and a = 1 . We have a h ≤ h h , and thus a h b h ≤ βu h h h N − α . As d = e o (log N/ log k ) , k = e o ( √ log N ) , h = O (log k ) , we have ln( a h b h ) = − Ω(log N ) . Furthermore, since ln(3 dh ) = o (log N/ log k ) + O (log k ) , α > , it holds − βp c dh − dha h b h N − α ≥ if N is sufficiently large. Note that p c is equal to (cid:0) h − h (cid:1) h − − ( h − N − log N which converges to e . We couldconclude that when N goes to infinity, we have P (cid:40) h (cid:88) i =1 S i ≥ h − d (cid:41) ≥ e . Proof.
Let h = 12 e ln k . Then h = O (log k ) . Suppose that we have an incoming job. By Theorem 3 and the PASTAproperty of a Poisson arrival process, with probability at least e , this job will see a state s such that (cid:80) hi =1 s i ≥ h − d . By Lemma 8, the dispatcher will route at least one task of this job into a queue of length at least h + 1 with probability − o (1) . Let T be the delay of the job. Then it holds for a large enough N , E [ T ] ≥ k (1 − o (1)) ≥ k, which completes the proof. Assume that the system is stable. Then for any x > , P { S < − x } ≤ βN − α x . lemma 6. Let (cid:96) be a threshold such that ≤ (cid:96) ≤ h with h = O (log k ) . Suppose that an incoming job sees astate s such that (cid:80) (cid:96)i =1 s i ≥ (cid:96) − x , where x = Ω( hN − α ) and x = e − Ω(log N ) . Consider a Lyapunov function V (cid:96) ( s ) = s + s + · · · + s (cid:96) . It holds that when N is sufficiently large, (cid:88) s (cid:48) : s → s (cid:48) due to an arrival r s → s (cid:48) ( V (cid:96) ( s (cid:48) ) − V (cid:96) ( s )) ≤ kdx, where r s → s (cid:48) is the transition rate, and s → s (cid:48) due to an arrival means that s will move to state s (cid:48) on the Markovchain only if there is an incoming job. PREPRINT - A
PRIL
7, 2020Lemma 7 below is a key in establishing the iterative proof. This lemma relates S q to S q − for ≤ i ≤ h . lemma 7. Define u = 2 kd and b q = u q − h q βN − α for q ∈ N . Define a sequence a q , such that a = 0 , a = 1 and a q = ( q − a q − + 2 for q > . For any q with ≤ q ≤ h , if P { S − S q − ≤ a q − b q − } ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N , then P { S − S q ≤ a q b q } ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N . Lemma 8 below complements the probability bound in Lemma 1. Recall that
FILL h denotes the event that all the k tasks of an incoming job are assigned to queueing positions below a threshold h . Lemma 8 gives a condition on thetotal queue length for FILL h to happen with low probability. lemma 8. Suppose an incoming job sees a state s such that (cid:80) hi =1 s i > h − d . Then when N is sufficiently large, P { FILL h } = o (1) . We studied a load balancing algorithm, batch-filling, for a system where each job consists of k parallel tasks inthe sub-Halfin-Whitt regime of heavy traffic. We showed that to achieve zero queueing delay for such jobs, weonly need a probe overhead of d = ω (cid:16) − λ ) log k (cid:17) under proper conditions. Existing work has shown that d = ω (cid:16) − λ (cid:17) is necessary for achieving zero queueing delay when each job consists of a single task. Therefore, with aparallel structure, we save a factor of log k communication overhead. We also established a lower-bound result on theprobe overhead d , where we showed that d = Ω (cid:16) exp (cid:16) log N log k (cid:17)(cid:17) is necessary for achieving zero queueing delay. Aninteresting future direction is to extend our results to general service time distributions, where it is possible to get moresavings when the distributions have a heavy tail. References [1] S. Banerjee and D. Mukherjee. Join-the-shortest queue diffusion limit in halfin–whitt regime: Tail asymptoticsand scaling of extrema.
Ann. Appl. Probab. , 29(2):1262–1309, 2019.[2] D. Bertsimas, D. Gamarnik, and J. N. Tsitsiklis. Performance of multiclass markovian queueing networks viapiecewise linear lyapunov functions.
Ann. Appl. Probab. , 11(4):1384–1428, 11 2001.[3] A. Braverman. Steady-state analysis of the join the shortest queue model in the halfin-whitt regime. arXiv:1801.05121 [math.PR] , 2018.[4] A. Braverman and J. Dai. Stein’s method for steady-state diffusion approximations of m/ Ph /n + m systems. Ann. Appl. Probab. , 27:550–581, Feb. 2017. doi: 10.1214/16-AAP1211.[5] A. Braverman, J. Dai, and J. Feng. Stein’s method for steady-state diffusion approximations: an introductionthrough the erlang-a and erlang-c models.
Stoch. Syst. , 6(2):301–366, 2017.[6] P. Eschenfeldt and D. Gamarnik. Join the shortest queue with many servers. the heavy-traffic asymptotics.
Math.Oper. Res. , 43(3):867–886, 2018.[7] D. Gamarnik, J. N. Tsitsiklis, and M. Zubeldia. Delay, memory, and messaging tradeoffs in distributed servicesystems. In
Proc. ACM SIGMETRICS/PERFORMANCE Jt. Int. Conf. Measurement and Modeling of ComputerSystems , pages 1–12. ACM, 2016.[8] N. Gast. Expected values estimated via mean-field approximation are 1/n-accurate. In
Proc. ACM Measurementand Analysis of Computing Systems (POMACS) , volume 45, pages 50–50. ACM, 2017.[9] N. Gast and B. Van Houdt. A refined mean field approximation. In
Proc. ACM Measurement and Analysis ofComputing Systems (POMACS) , volume 1, page 33. ACM, 2017.[10] V. Gupta and N. Walton. Load balancing in the nondegenerate slowdown regime.
Oper. Res. , 67(1):281–294,2019. 21
PREPRINT - A
PRIL
7, 2020[11] I. Gurvich. Diffusion models and steady-state approximations for exponentially ergodic markovian queues.
Ann.Appl. Probab. , 24(6):2527–2559, 2014.[12] S. Halfin and W. Whitt. Heavy-traffic limits for queues with many exponential servers.
Oper. Res. , 29(3):567–588, 1981.[13] W. Hoeffding. Probability inequalities for sums of bounded random variables.
J. Amer. Stat. Assoc. , 58(301):13–30, 1963.[14] G. Kamath. Bounds on the expectation of the maximum of samples from a gaussian[online], 2015. URL .[15] X. Liu.
Steady State Analysis of Load Balancing Algorithms in Heavy Traffic Regime . PhD thesis, Arizona StateUniv., Tempe, AZ, USA, 2019.[16] X. Liu and L. Ying. A simple steady-state analysis of load balancing algorithms in the sub-halfin-whitt regime. arXiv:1804.02622 [math.PR] , 2018.[17] X. Liu and L. Ying. On achieving zero delay with power-of-d-choices load balancing. In
Proc. IEEE Int. Conf.Computer Communications (INFOCOM) , pages 297–305, Honolulu, HI, USA, Apr. 2018.[18] Y. Lu, Q. Xie, G. Kliot, A. Geller, J. R. Larus, and A. Greenberg. Join-Idle-Queue: A novel load balancingalgorithm for dynamically scalable web services.
Perform. Eval. , 68(11):1056–1071, Nov. 2011.[19] M. Lugo. The expectation of the maximum of exponentials[online], 2011. URL .[20] M. Mitzenmacher. The power of two choices in randomized load balancing.
IEEE Trans. Parallel Distrib. Syst. ,12(10):1094–1104, 2001.[21] D. Mukherjee, S. C. Borst, J. S. Van Leeuwaarden, and P. A. Whiting. Universality of power-of-d load balancingin many-server systems.
Stoch. Syst. , 8(4):265–292, 2018.[22] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. In
Proc.ACM Symp. Operating Systems Principles (SOSP) , pages 69–84. ACM, 2013.[23] C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent randomvariables. In
Proc. 6th Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: ProbabilityTheory . The Regents of the University of California, 1972.[24] A. L. Stolyar. Pull-based load distribution in large-scale heterogeneous service systems.
Queueing Syst. , 80(4):341–361, 2015.[25] A. L. Stolyar. Tightness of stationary distributions of a flexible-server system in the Halfin-Whitt asymptoticregime.
Stoch. Syst. , 5(2):239–267, 2015.[26] A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster managementat Google with Borg. In
Proc. European Conf. Computer Systems (EuroSys) , 2015.[27] N. D. Vvedenskaya, R. L. Dobrushin, and F. I. Karpelevich. Queueing system with selection of the shortest oftwo queues: An asymptotic approach.
Problems of Information Transmission , 32(1):15–27, 1996.[28] W. Wang, S. T. Maguluri, R. Srikant, and L. Ying. Heavy-traffic delay insensitivity in connection-level modelsof data transfer with proportionally fair bandwidth sharing. In
Proc. ACM SIGMETRICS Int. Conf. Measurementand Modeling of Computer Systems , volume 45, pages 232–245. ACM, 2018.[29] W. Wang, M. Harchol-Balter, H. Jiang, A. Scheller-Wolf, and R. Srikant. Delay asymptotics and bounds formultitask parallel jobs.
Queueing Syst. , 91(3):207–239, Apr. 2019.[30] R. R. Weber. On the optimal assignment of customers to parallel servers.
J. Appl. Probab. , 15(2):406–413, 1978.[31] W. Winston. Optimality of the shortest line discipline.
J. Appl. Probab. , 14(1):181–189, 1977.[32] L. Ying. On the approximation error of mean-field models.
ACM SIGMETRICS Perform. Evaluation Rev. , 44(1):285–297, 2016.[33] L. Ying. Stein’s method for mean field approximations in light and heavy traffic regimes.
ACM SIGMETRICSPerform. Evaluation Rev. , 45(1):49, 2017.[34] L. Ying, R. Srikant, and X. Kang. The power of slightly more than one sample in randomized load balancing. In
Proc. IEEE Int. Conf. Computer Communications (INFOCOM) , pages 1131–1139, Kowloon, Hong Kong, Apr.2015. 22
PREPRINT - A
PRIL
7, 2020
A Proof of Lemma 4
Proof.
The proof idea is similar to that in [14]. Let M X ( s ) be the moment generating function of a random variable X . By assumption, Y i = (cid:80) n i j =1 X i,j , and X i,j , ≤ i ≤ m, ≤ j ≤ n i are all independent and exponentiallydistributed with mean . Therefore, for any ≤ i ≤ m, ≤ j ≤ n i and any s < , M X i,j ( s ) = E [ e sX i,j ] = 11 − sM Y i ( s ) = E [ e sY i ] = (cid:18) − s (cid:19) n i . Let q = max { n , · · · , n m } . It holds that for any s ∈ (0 , , exp (cid:18) s E (cid:20) m max j =1 Y j (cid:21)(cid:19) ≤ E (cid:20) exp( s m max j =1 Y j ) (cid:21) (27) = E (cid:20) m max j =1 exp( sY j ) (cid:21) (28) ≤ m (cid:88) j =1 E [exp( sY j )] (29) ≤ m (cid:18) − s (cid:19) q , (30)where (27) is due to Jensen’s inequality and (29) is true since the maximum is upper bounded by the sum. As a result, E (cid:20) m max j =1 Y j (cid:21) ≤ ln ms + q · − ln(1 − s ) s . Since we assume that q = o (log m ) , we can write q as q = (ln m ) · (cid:96) ( m ) where (cid:96) ( m ) → + as m → ∞ . Let s = 1 − (cid:96) ( m ) , then E (cid:20) m max j =1 Y j (cid:21) ≤ ln m − (cid:96) ( m ) (1 − (cid:96) ( m ) ln ( (cid:96) ( m )))= (ln m ) (cid:18) (cid:96) ( m )1 − (cid:96) ( m ) (cid:19) (1 − (cid:96) ( m ) ln ( (cid:96) ( m ))) . Note that lim m →∞ (cid:96) ( m ) ln( (cid:96) ( m )) = 0 . Then as m → ∞ , E (cid:20) m max j =1 Y j (cid:21) ≤ (ln m )(1 + o (1)) , which completes the proof. B Proofs of Lemmas 5–8
B.1 Proof of Lemma 5
Proof.
By the Little’s law, it holds that E [ S ] = λ = 1 − βN − α . Then E [1 − S ] = βN − α . Therefore, by the Markovinequality, for any x > , P { S < − x } = P { − S > x } ≤ βN − α x . B.2 Proof of Lemma 6
Proof.
Suppose that an arrival sees a state s . Given (cid:80) (cid:96)i =1 s i ≥ (cid:96) − x , we have s (cid:96) ≥ − x since s i ≤ for all ≤ i ≤ (cid:96) . Without loss of generality, we can think of the batch-filling policy as sampling the kd queues one by one.23 PREPRINT - A
PRIL
7, 2020During the sampling, we always choose at most kd servers of length at least (cid:96) . The probability that all kd sampledservers have length at least (cid:96) is thus larger or equal to (cid:18) N (1 − x ) − kdN (cid:19) kd = (cid:18) − (cid:18) x + kdN (cid:19)(cid:19) kd . Recall that by the assumptions in Theorem 3, we have x = e − Ω(log N ) , kd = o ( N − α ) , and thus x + kdN > − when N is sufficiently large. Furthermore, applying Bernoulli’s Inequality and the assumption that x = Ω( hN − α ) , it holds (cid:18) − (cid:18) x + kdN (cid:19)(cid:19) kd ≥ − kd (cid:18) x + kdN (cid:19) ≥ − xkd for a large N . Note that if we put all tasks of this arrival into servers of length at least (cid:96) , we will not affect the value of V l ( s ) . As a result, (cid:88) s (cid:48) : s → s (cid:48) due to an arrival r s → s (cid:48) ( V (cid:96) ( s (cid:48) ) − V l ( s )) ≤ (1 − kdx ) · · λk + 2 kdx · k λk ≤ kdx, which completes the proof. B.3 Proof of Lemma 7
Proof.
The proof is close to that of Theorem 3. Recall that for each ≤ (cid:96) ≤ h and state s ∈ S , we define theLyapunov function V (cid:96) ( s ) = (cid:96) (cid:88) i =1 s i . For q such that ≤ q ≤ h , by assumption, P { S − S q − ≤ a q − b q − } ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N . It holds P { V q − ( S ) < q − − (( q − a q − + 1) b q − }≤ P { V q − ( S ) < q − − (( q − a q − + 1) b q − ,S − S q − ≤ a q − b q − } + P { S − S q − > a q − b q − }≤ P { ( q − S < q − − b q − } + 1 − (cid:18) h − h (cid:19) q − + ( q − N − log N ≤ q − u q − h q − + 1 − (cid:18) h − h (cid:19) q − + ( q − N − log N . (31)The last inequality uses Lemma 5 and b q − = u q − h q − βN − α . Now let B q − = q − − (( q − a q − + 2) b q − . We can see that B q − = q − − a q b q − . For a state s such that V q − ( s ) > B q − , it holds ∆ V q − ( s ) = (cid:88) s (cid:48) : s → s (cid:48) due to an arrival r s → s (cid:48) ( V q − ( s (cid:48) ) − V q − ( s ))+ (cid:88) s (cid:48) : s → s (cid:48) due to a departure r s → s (cid:48) ( V q − ( s (cid:48) ) − V q − ( s )) . Recall that we define u = 2 kd and b q = u q − h q βN − α . As V q − ( s ) > q − − a q b q − , by Lemma 6, it holds ∆ V q − ( s ) ≤ kda q b q − − ( s − s q )= a q u q − h q − βN − α − ( s − s q ) . Let P { S − S q ≤ a q b q } = p q , E q − = { s ∈ S | s − s q > a q b q } . Then P { S (cid:54)∈ E q − } = p q . For a state s , considerthe following two cases. 24 PREPRINT - A
PRIL
7, 2020 • s (cid:54)∈ E q − , ∆ V q − ( s ) ≤ a q u q − h q − βN − α =: δ . • s ∈ E q − . Let γ = − ∆ V q − ( s ) . It holds γ ≥ a q u q − h q − βN − α ( h − . We then utilize the tail bound, Lemma 2. Following the definition in Lemma 2, it is easy to verify that ν max ≤ kN , f max ≤ for the Lyapunov function V q − ( s ) . Let j q − = (cid:18) N α a q u q − h q − ( h − β (cid:19) log N. Using Lemma 2, P { V q − ( S ) > B q − + 2 ν max j q − }≤ (cid:18) f max f max + γ (cid:19) j q − + (cid:18) δγ + 1 (cid:19) P { S (cid:54)∈ E q − }≤ (cid:18) f max f max + γ (cid:19) j q − + hh − p q . Note that when N is sufficiently large, (cid:18) f max f max + γ (cid:19) j q − ≤ e − log N . Besides, we assume that < α < . , k = e o ( √ log N ) and h = O (log k ) . As a result, for a large N , P { V q − ( S ) ≥ q − − (( q − a q − + 1) b q − }≤ P { V q − ( S ) > B + 2 ν max j q − }≤ e − log N + hh − p q . Together with Eq.(31), we have (cid:18) h − h (cid:19) q − − q − u q − h q − − ( q − N − log N ≤ P { V q − ( S ) > q − − (( q − a q − + 1) b q − }≤ e − log N + hh − p q We can conclude that for a large N , P { S − S q ≤ a q b q } = p q ≥ (cid:18) h − h (cid:19) q − − ( q − N − log N , which completes the proof. B.4 Proof of Lemma 8
Proof.
We use a similar argument as the proof of Lemma 1. Suppose that an arrival sees a state s . By assumption, itholds h (cid:88) i =1 s i ≥ h − d . Let X , · · · , X kd be the numbers of places below h in each sampled server. The goal is to show P { FILL h } = P (cid:40) kd (cid:88) i =1 X i ≥ k (cid:41) = o (1) when N is large enough. 25 PREPRINT - A
PRIL
7, 2020We could see that for each integer x such that ≤ x ≤ h , P { X i = x } = s h − x − s h − x +1 , and P { X i = 0 } = s h . Sincewe are sampling without replacement, X , · · · , X kd are not independent. But still, utilizing a result of Hoeffding [13,Theorem 4], we have E (cid:104) f (cid:16)(cid:80) kdi =1 X i (cid:17)(cid:105) ≤ E (cid:104) f (cid:16)(cid:80) kdi =1 Y i (cid:17)(cid:105) for any continuous and convex function f ( · ) , where Y , · · · , Y kd are i.i.d. and follow the same distribution as X . Take f ( · ) to be f ( x ) = e tx where t is some positivevalue.It then holds P { FILL h } = P (cid:40) kd (cid:88) i =1 X i ≥ k (cid:41) = P (cid:110) e t (cid:80) kdi =1 X i ≥ e tk (cid:111) ≤ e − tk kd (cid:89) i =1 E (cid:2) e tY i (cid:3) = e − tk kd (cid:89) i =1 h (cid:88) j =1 (cid:16) e t ( h − j +1) − − (cid:17) . Since for all x > , x ≤ e x , we can further have P { FILL h } ≤ e − tk exp kd h (cid:88) j =1 (cid:16) e t ( h − j +1) − (cid:17) ( s j − − s j ) . (32)Rearraning the sum in (32), we get h (cid:88) j =1 (cid:16) e t ( h − j +1) − (cid:17) ( s j − − s j )= e th − h (cid:88) j =1 s j (cid:16) e t ( h − j +1) − e t ( h − j ) (cid:17) = e th − ( e t − h (cid:88) j =1 s j e t ( h − j ) . (33)Recall that (cid:80) hj =1 s j ≥ h − d , and ≥ s ≥ s ≥ · · · ≥ s h ≥ . Eq. (33) is maximized when s = s = · · · = s h = 1 − dh and thus, (33) ≤ ( e th −
1) 13 dh .