[PDF] On the Throughput Optimization in Large-Scale Batch-Processing Systems

Abstract

We analyze a data-processing system with n clients producing jobs which are processed in \textit{batches} by m parallel servers; the system throughput critically depends on the batch size and a corresponding sub-additive speedup function. In practice, throughput optimization relies on numerical searches for the optimal batch size, a process that can take up to multiple days in existing commercial systems. In this paper, we model the system in terms of a closed queueing network; a standard Markovian analysis yields the optimal throughput in ω( n 4 ) time. Our main contribution is a mean-field model of the system for the regime where the system size is large. We show that the mean-field model has a unique, globally attractive stationary point which can be found in closed form and which characterizes the asymptotic throughput of the system as a function of the batch size. Using this expression we find the \textit{asymptotically} optimal throughput in O(1) time. Numerical settings from a large commercial system reveal that this asymptotic optimum is accurate in practical finite regimes.

Full PDF

OOn the Throughput Optimization in Large-ScaleBatch-Processing Systems

Sounak Kar

TU DarmstadtDarmstadt, Germany

Robin Rehrmann

TU DresdenDresden, Germany

Arpan Mukhopadhyay

University of WarwickCoventry, United Kingdom

Bastian Alt

TU DarmstadtDarmstadt, Germany

Florin Ciucu

University of WarwickCoventry, United Kingdom

Heinz Koeppl

TU DarmstadtDarmstadt, Germany

Carsten Binnig

TU DarmstadtDarmstadt, Germany

Amr Rizk

Universität UlmUlm, Germany

ABSTRACT

We analyze a data-processing system with n clients producing jobswhich are processed in batches by m parallel servers; the systemthroughput critically depends on the batch size and a correspondingsub-additive speedup function. In practice, throughput optimizationrelies on numerical searches for the optimal batch size, a processthat can take up to multiple days in existing commercial systems.In this paper, we model the system in terms of a closed queueingnetwork; a standard Markovian analysis yields the optimal through-put in ω (cid:0) n (cid:1) time. Our main contribution is a mean-field modelof the system for the regime where the system size is large. Weshow that the mean-field model has a unique, globally attractivestationary point which can be found in closed form and which char-acterizes the asymptotic throughput of the system as a functionof the batch size. Using this expression we find the asymptotically optimal throughput in O ( ) time. Numerical settings from a largecommercial system reveal that this asymptotic optimum is accuratein practical finite regimes. A key technique to cutback overhead in data-processing systemsis service batching , i.e., collecting the inputs to form batches thatare then processed as one entity. The rationale lies in the overheadamortization with increasing the batch size. A prominent examplehighlighting the benefits of service batching is a Linux-based sys-tem in which the network-card throughput can be substantiallyincreased by batching data packets [10]. Similar improvements holdin software-defined networks by passing switching rule updates inbatches from controllers to network switches [34]. In this work, weanalyze the benefits of service batching in the context of large-scaledata-processing systems, and in particular of a large commercialdatabase system.We consider a closed system in which n clients generate jobs tobe processed by m parallel servers. Each client alternates betweenbeing in either an active or an inactive state; in the former it pro-duces a job and in the latter it awaits the job to be fully processed.We note that each client can have at most one job in the system, i.e.,a client produces a new job no sooner than its previous one finishedexecution. The servers process jobs in batches of size k , i.e., once k .. Clients Batcher Servers .. Figure 1: A closed queueing system with n clients and m servers. Clients are either active or inactive and produce jobsat rate λx when x of them are active . The batcher producesbatches of size k at rate M ⌊ y / k ⌋ when there are y availablejobs. The service station consists of a single queue and m par-allel servers, each having a service rate µ ; the overall batch service rate is µ min ( m , z ) when z batches are available. clients produce k jobs these are sent for batch processing – and mayhave to wait in a central queue if all servers are busy; see Fig. 1 .This model is representative for some real-world data-processingsystems such as databases employing Multi Query Optimization[28, 29, 31].Besides a model with a single job type, we also consider a gen-eralized model with two job types. A typical example would be read and write jobs in a database system; such jobs not only havedifferent average processing times but some are prioritized overthe others, e.g., the write jobs have non-preemptive priority overthe read jobs for consistency reasons.Classical approaches to queueing systems with batch arrivalsand batch service disciplines have been intensively studied, e.g., in[1, 4, 9, 11] and the references therein. Most of these studies wereeither mainly concerned with open queueing systems or focusedon different properties of interest such as the product form; for a All times are exponentially distributed with the rates λ , M , and µ , the last two de-pending on the batch size k ; we will show that this technically convenient assumptionis valid by fitting our model’s parameters from a real-world system. a r X i v : . [ c s . PF ] S e p onference version, 2020, Virtual Sounak Kar, Robin Rehrmann, Arpan Mukhopadhyay, Bastian Alt, Florin Ciucu, Heinz Koeppl, Carsten Binnig, and Amr Rizk more thorough discussion see Sect. 2. To the best of our knowledgethe closed queueing system from Fig. 1 is new, i.e., it does not fitexisting models.The main contribution of this paper consists in the throughputoptimization in a closed batching system characteristic to a largeproduction system; this involves finding the optimal batch sizes. Wefirst provide the exact analysis by solving for the balance equationsin a Markov model, an approach requiring at least ω (cid:0) n (cid:1) computa-tional time. We also provide the corresponding mean-field modelswhich yield exact results in an asymptotic regime whereby both n and m are proportionally scaled. This second approach yields theoptimal (asymptotic) throughput in O ( ) , which is particularly ap-pealing given that existing empirical approaches rely on extensivenumerical searches for the optimal batch sizes, a process whichtypically runs in the order of days .To find the asymptotically optimal batch size, we first prove thatthe dynamics of the system converges to a deterministic mean-field limit as n , m → ∞ . We then find a closed form solution ofthe stationary point of the mean-field and prove that it is globallyattractive. Using the stationary point of the mean-field we charac-terize the throughput of the system as a function of the batch size.This finally leads to a simple optimization problem which can besolved either in closed form or numerically in constant time to findthe asymptotically optimal batch size.Recently, mean-field techniques have been used successfully invarious models of large scale service systems, such as web serverfarms [26], cloud data centers [35], and caching systems [14], wherean exact solution of the stationary distribution is computationallyinfeasible due to the large size of the state space. In such systems, thekey idea is to approximate the Markovian dynamics of the systemby a deterministic dynamical system, called the mean-field limit ,typically described by a system of ordinary differential equations(ODEs). Such an approximation is exact in the limit as the systembecomes large. The stationary behaviour of the limiting systemcan be described by the stationary point of the mean-field whichcan either be found in closed form or computed in constant time.The key challenge is to prove the uniqueness and existence of thestationary point and the fact that all possible trajectories of themean-field limit converges to this unique stationary point (globalattraction) [6, 33].To demonstrate the practical relevance of our results we analyzea large commercial database system. In such a system a job refersto a query, e.g., an SQL string, which can execute read or write operations. A client can only send a new query once the previousquery has been processed, i.e., each client can have at most one out-standing query at any time. Job/query batching involves mergingmultiple similar queries into a new SQL string, whose executiontime depends on many factors such as the operations’ types. More-over, the shared overhead amongst the individual queries lendsitself to a certain speedup in the batch execution time which wasempirically shown to be around a factor of 2 in [28]; the speedup isgenerally a function of both the number of batched jobs k and thejobs’ types, e.g. read or write . According to personal communications with engineers from a large commercialdatabase system

The remainder of the paper is structured as follows. We firstdiscuss related work and then describe the queueing model andthe optimization formulation. In Sect. 4 we provide the mean-fieldmodel and the corresponding asymptotic result. In Sect. 5 we pro-vide the generalized model for the two types of jobs case, and thenpresent numerical and experimental evaluation results for the opti-mal batch sizing approach in Sect. 6. Lastly we conclude the paperin Sect. 7.

We overview some open and closed queueing systems with batching,and practical approaches to batching in database systems.In the open queueing systems literature, one of the earliest ex-amples of batching is [1] which derives the expected value of thesteady state queue length and waiting time assuming exponentialinter-arrival and Chi-squared service time. In [12], the authors con-sider a queueing system with Poisson arrivals and general batchservice time, independent of the batch size; both the execution timeand batch size can be dynamically controlled subject to real-worldconstraints on the maximum possible batch size. If a batch is for-warded to the server only at the points when the server is free, orthere is an arrival or departure, it is shown that it is optimal toserve all jobs in a batch only when the queue length exceeds a cer-tain threshold. Batching in the context of running a shuttle servicebetween two end points has been considered in [11], which pro-vides an optimal batching policy for minimizing the expected totaldiscounted cost over an infinite horizon. Here it is assumed that thecustomers arrive according to independent Poisson processes. Theauthors in [3] consider a discrete time system with incoming jobshaving a strict delay guarantee. Given a certain form of serving costwhich incentivizes batching and arrival distribution, the authorslay down a strategy that minimizes the expected long term cost perunit time. Further, in [16], a queueing system with bulk service atscheduled time points has been considered where the customerscan pick their arrival time to minimize the waiting time. Undersome given conditions, the authors show that it is optimal to arrivejust the moment before a service starts.In turn, a key objective in the closed queueing systems literaturewas proving the product form property of the steady state queues’distribution. Gordon and Newell [17] considered a closed networkwith multiple service stages and a set of probabilities governing therouting among these stages and showed the product form propertyunder the assumption of exponential service times. In the seminalwork on BCMP networks [2], the authors considered the more gen-eral case of open, closed, and mixed networks, and also multiplejob classes. Inspired by the functioning of central processors, datachannels, terminals among others, sufficient conditions have beenprovided for each of these cases for the network to have a productform equilibrium distribution. Further, in [7], the authors general-ized the idea of local balance to station balance that explains theconditions for a network with non-exponential service times tohave a product form. These findings were further extended undera more general set-up in [8], which investigated the existence ofproduct form equilibrium distribution under certain restrictionson the service discipline which can however be class dependent.The existence of product form in closed queueing networks with n the Throughput Optimization in Large-Scale Batch-Processing Systems Conference version, 2020, Virtual service batching was investigated in [19], which derives conditionsfor the existence of product form distribution in a discrete-timesetting with state-independent routing, allowing multiple eventsto occur in a single time slot. The results were further extendedto a continuous-time setting allowing for batch arrivals in [20].For the particular closed queueing network with service batchingfrom this paper, it is certainly of interest to determine whether theproduct form property applies. However, aforementioned works donot apply to our problem as the conditional routing probabilitiesof jobs/batches in our case is state-dependent due the FCFS natureof service. Further, even if we approximate FCFS order by randomservice order, we cannot directly compute the system throughputfrom these works as they lack a method to derive the normalizingconstant for the corresponding product form.In the context of batching in databases, one of the earliest andinfluential work is [13] whereby transactions are executed as se-quence of jobs and batches of jobs access the same log page. Oncethat page is full, the log is flushed and the batch is executed, thus de-creasing the I/O. Naturally, the batch size is fixed to the page size; inturn, in our work, we allow for flexible batch sizes in relation to thenumber of clients and specifically focus on optimizing throughputrather than I/O reduction. In comparison,

SharedDB [15] executesall incoming jobs as a big batch. Jobs that enter the system, while abatch is executed, are queued and batched, once the previous batchfinished execution. In contrast to our work, SharedDB executesbatches of different sizes sequentially and does not classify jobtypes or consider job sizes. A similar work to SharedDB is

BatchDB [23] in which incoming analytic jobs are batched where the execu-tion is interleaved with writing jobs, as they occur. Alike SharedDB,BatchDB does not classify their jobs or focus on the size of batchesin relation to clients. The closest system to our work is

OLTPShare [28], where the authors use a fixed time interval to collect incomingjobs into batches. In contrast, our approach of using a count-basedbatching (i.e., each batch has exactly k jobs) has the practical benefitof utilizing cached batch queries. These batch jobs are compiledSQL strings that have been requested before. Using the interval ap-proach results in batches of various sizes diminishing the efficiencyof caching previously seen batch requests. We consider a closed queueing system where jobs are routed alongthree stations: job producer, job batcher, and service station. Theproducer station has n clients, each being assigned a token enablingthem to submit a new job/query . Upon submission, the token isrevoked and the query is passed to the job batcher which creates amerged query at rate M ( k ) , once k queries become available to forma batch of size k . Each batch is forwarded to the service stationconsisting of m serving units, or servers, processing batches ina FCFS order at rate µ ( k ) , i.e., the number of batches served perunit time. Further, the merged query is compiled, executed, andthe result is split and sent back to the respective clients. Alongwith receiving a result, each client also receives its token back andbecomes ready to submit a new query. We note that the rate atwhich a new query is submitted to the batching station depends on We use the terms job and query interchangeably. the number of active clients, i.e., clients with a token, rather thanthe total number of clients. It is also important to observe that thetotal number jobs in the system is the same as the number of clients n . For a schematic representation of the system recall Fig. 1.A key observation is that the additional time spent on batchingis compensated by the reduction in the total execution time of thejobs, owing to the amortization of associated operational overheadcharacteristic to jobs of the same type. The gain from batchingusually grows when increasing the batch size, an effect which iscommonly referred to as speedup . However, increasing the batchsize beyond a certain threshold can lead to an excessive idling of theavailable servers. This is due to the fact that batch formation takeslonger and also the number of batches in the system can become lessthan the number of servers. In other words, higher speedups canidle more servers, which raises an interesting performance tradeoff.Our objective is to find the optimal batch size k ∗ maximizing thesystem’s throughput, i.e., the number of jobs served at the servicestation per unit time. To this end, we will first model the closedqueueing system as a continuous time Markov chain (CTMC) andfind its steady state distribution.We assume that the time for each client to produce a job isexponentially distributed with rate λ ; denoting by x the number ofactive clients (i.e., having a token), the producer station forwards ajob to the batcher at rate λx . Let us also denote by y and z as thenumber of jobs at the batcher and the number of batches at theserver, respectively. The state of the system can thus be uniquelydescribed by the triple ( x , y , zk ) belonging to the state space S = (cid:8) ( x , x , x ) ∈ Z + : x + x + x = n , k | x (cid:9) . Although ( x , y , zk ) is determined by any two of its components, weretain the triple representation due to a more convenient visualisa-tion. The state of the system clearly evolves as a continuous-timeMarkov chain and the rates at which the system jumps to anotherstate from the state ( x , y , zk ) are given by ( x , y , zk ) λx −−→ ( x − , y + , zk ) , x > M ( k )⌊ y / k ⌋ −−−−−−−−−−→ ( x , y − k , ( z + ) k ) , y ≥ k µ ( k ) min ( m , z ) −−−−−−−−−−−−→ ( x + k , y , ( z − ) k ) , z > . (1)Informally, when the system is in state ( x , y , zk ) , either one jobcan move from the producer to the batcher at rate λx when there are x active clients, or k jobs can move from the batcher to the server atrate M ( k ) ⌊ y / k ⌋ , or k more clients become active (i.e., receive theirtokens back) at rate min ( m , z ) µ ( k ) . The rates to all other states arezero.The system attains a steady state with the unique distribution πππ given by the solution of the equation πππ · Q =

0. This is due tothe fact that the chain is irreducible, whereas the finiteness of thestate space guarantees positive recurrence; for a rigorous argumentsee Sect. A.1 in the Appendix. Here, Q ( r , s ) denotes the jump ratefrom state r to s where r , s are of the form ( x , y , zk ) , as specified in(1). Given the non-linear state dependent rates, we can only obtainthe solution πππ numerically rather than in closed form. onference version, 2020, Virtual Sounak Kar, Robin Rehrmann, Arpan Mukhopadhyay, Bastian Alt, Florin Ciucu, Heinz Koeppl, Carsten Binnig, and Amr Rizk Further, the steady state distribution πππ immediately lends itselfto the steady state system throughput, i.e., Θ ( k ) : = (cid:213) ( x , y , zk )∈S πππ ( x , y , zk ) kµ ( k ) min ( m , z ) , (2)which implicitly yields the optimal batch size k ∗ : = arg max k ∈K Θ ( k ) . (3)Here K = { , , , . . . , K } and K is the maximum possible batch sizeimposed by the underlying queueing system. Note that finding thesolution of (3) runs in ω (cid:0) n (cid:1) time as it involves solving πππ · Q = ≤ k ≤ K in (2); for a particular batch size k , the dimensionof Q is of order n k . In practical data-processing systems, the number of clients servedis usually large. From a computational point of view, the standardMarkovian approach followed in Sect. 3 becomes increasingly com-putationally infeasible when growing the number of clients.Consequently, we adopt a mean-field approach where the num-ber of servers m scales with the number of clients n . We assumethat the batching step is instantaneous, i.e., the number of jobs inthe batching station jumps accordingly from ( k − ) to 0 upon thearrival of a new job. This assumption not only simplifies our anal-ysis but is also motivated by empirical observations; for instance,in the commercial database system where we run the evaluationexperiments, the batching step is approximately 50 times fasterthan the service step.Let X ( n ) ( t ) denote the number of active clients in the system attime t ≥

0. Hence, the number of queries in the system at time t is n − X ( n ) ( t ) . Then, ( X ( n ) ( t ) , t ≥ ) is a Markov process with thestate space { , , . . . , n } and the following rates: q ( n ) ( x → x − ) = λxq ( n ) ( x → x + k ) = µ ( k ) min (cid:16) m , (cid:106) n − xk (cid:107)(cid:17) , where x ∈ { , , . . . , n } and q ( i → j ) denotes the transition ratefrom state i to state j . The Markov process ( X ( n ) ( t ) , t ≥ ) is ergodicbecause it is irreducible and has a finite state space. However, it isextremely difficult to obtain a closed form solution of the station-ary distribution π ( n ) by solving the matrix equation π ( n ) Q ( n ) = λ E [ X ] = kµ ( k ) E (cid:20) min (cid:18) m , (cid:22) n − Xk (cid:23)(cid:19)(cid:21) . (4)Using Jensen’s inequality we obtain λ E [ X ] ≤ kµ ( k ) min (cid:18) m , n − E [ X ] k (cid:19) , which yields the following bound on E [ X ] E [ X ] ≤ min (cid:18) nµ ( k ) λ + µ ( k ) , kµ ( k ) mλ (cid:19) . (5) The throughput of the system is given by the RHS of (4). Hence, anupper bound on the throughput Θ ( n ) is given by E (cid:104) Θ ( n ) (cid:105) ≤ min (cid:18) kµ ( k ) m , nλµ ( k ) λ + µ ( k ) (cid:19) . (6)(note that we dropped the dependency on k in Θ ( n ) for brevity.)In addition to having this bound on the throughput for finite val-ues of n and m , we will next show that the bound is asymptoticallytight as n , m → ∞ with m = αn for some fixed α > ( w ( n ) ( t ) , t ≥ ) , where w ( n ) ( t ) : = X ( n ) ( t )/ n denotes the fraction of active clients in the system. The process ( w ( n ) ( t ) , t ≥ ) is a density dependent jump Markov process [21, 24,25] with rates q ( n ) ( w → w − / n ) = nλwq ( n ) ( w → w + k / n ) = nµ ( k ) min (cid:18) α , n (cid:106) n − nwk (cid:107)(cid:19) , where w : = x / n .Next we prove the following main result:Theorem 4.1. (i) If w ( n ) ( ) → w ∈ [ , ] as n → ∞ inprobability, then we have sup ≤ t ≤ T ∥ w ( n ) ( t ) − w ( t )∥ → in probability as n → ∞ , where ( w ( t ) , t ≥ ) is the uniquesolution of the following ODE: (cid:219) w ( t ) = f ( w ( t )) , w ( ) = w , (7) with f : [ , ] → R defined as f ( w ) = kµ ( k ) min (cid:18) α , − wk (cid:19) − λw . (8) (ii) For any w ∈ [ , ] , we have w ( t ) → w ∗ exponentially fast as t → ∞ , where w ∗ is the unique solution of f ( w ∗ ) = and isgiven by w ∗ = min (cid:18) µ ( k ) λ + µ ( k ) , αkµ ( k ) λ (cid:19) (9) (iii) The sequence of stationary measures π ( n ) w of the process ( w ( n ) ( t ) , t ≥ ) converges weakly to δ w ∗ as n → ∞ . Proof. To show part (i), we first note that the limiting expecteddrift of the process ( w ( n ) ( t ) , t ≥ ) conditioned on w ( n ) ( t ) = w converges point-wise (and hence uniformly) to the continuousfunction f , i.e., for each w ∈ [ , ] we havelim n →∞ lim h → h E (cid:104) w ( n ) ( t + h ) − w ( n ) ( t )| w ( n ) ( t ) = w (cid:105) = f ( w ) . (10)Furthermore, it is easy to see that f : [ , ] → R is Lipschitzcontinuous which follows from the facts (1) any linear functionis Lipschitz continuous, (2) if F , G are Lipschitz continuous, then cF + dG is Lipschitz continuous for any c , d ∈ R , (3) | F | is Lipschitzcontinuous when F is Lipschitz continuous, and (4) min ( F , G ) = F + G − | F − G | . Part (i) now follows from Theorem 3.1 of [21]. n the Throughput Optimization in Large-Scale Batch-Processing Systems Conference version, 2020, Virtual To prove part (ii), we first observe that the unique solution tothe equation f ( w ∗ ) = w ≥ w ∗ . Then w ( t ) ≥ w ∗ for all t ≥ w ( t ) and the fact that (cid:219) w ( t ) = w ( t ) = w ∗ . We define the distance function ϕ ( t ) = w ( t )− w ∗ . Clearly, ϕ ( t ) ≥ t ≥

0. Now we have (cid:219) ϕ ( t ) = (cid:219) w ( t ) = f ( w ) = f ( w ) − f ( w ∗ ) = − λ ( w − w ∗ ) + kµ ( k ) (cid:20) min (cid:18) α , − w ( t ) k (cid:19) − min (cid:18) α , − w ∗ k (cid:19)(cid:21) ≤ − λϕ , where the last inequality follows since w ( t ) ≥ w ∗ for all t ≥ ϕ ( t ) ≤ ϕ ( ) e − λt . This implies that w ( t ) → w ∗ as required.To show part (iii), we first note that the stationary measure π ( n ) w is tight as it is defined on the compact space [ , ] . Hence, part (iii)follows from Theorem 2 of [5]. □ The above theorem implies the weaker result thatlim n →∞ lim t →∞ E (cid:104) w ( n ) ( t ) (cid:105) = lim t →∞ lim n →∞ E (cid:104) w ( n ) ( t ) (cid:105) = w ∗ . Equivalently, we have the following convergence of the normalizedthroughput Θ ( n ) / n lim n →∞ E (cid:104) Θ ( n ) / n (cid:105) = λw ∗ , which proves the asymptotic tightness of the bound from (6).The optimal asymptotic throughput further follows by maxi-mizing the fraction of active clients w ∗ with respect to the batchsize k . The asymptotically optimal batch size is the solution to thefollowing optimization problemmax k min (cid:18) µ ( k ) λ + µ ( k ) , αkµ ( k ) λ (cid:19) . (11)In the particular case when µ ( k ) is a non-increasing function of k and kµ ( k ) is a non-decreasing function of k , the optimal solution k ∗ is simply the solution to the following equation µ ( k ) λ + µ ( k ) = αkµ ( k ) λ . (12)Therefore, we have just showed that k ∗ can simply be found bysolving a polynomial equation. We can approximate the optimalbatch size for finite systems by k ∗ as long as n and m are large. Theadvantage is that solving the polynomial equation can be done intime independent of the system size n ; moreover, as we will showin our numerical experiments, the approximation is numericallyaccurate in practical regimes. We now consider the case when jobs can be of two types, e.g., write and read in a database system. Each of these types benefits frombatching and can possibly have different speedups; we note that batching involves jobs of the same type, which is typically thecase in database systems. Additionally, we consider priority servicescheduling between the two types, which can be either preemptiveor non-preemptive. In a database system, where queries can be oftype write or read , the former is usually prioritized.In our model, we assume without loss of generality that the firsttype is given priority in the service station. Below we describe thesystem dynamics and the required state space representation beforeproviding the mean-field formulation.Recall that the producer station has n clients, each producingone job with rate λ once becoming active (i.e., once receiving theirtoken back); also, the number of active clients is denoted by x . In thetwo job-type model, each active client produces a job of type 1 withprobability p or a job of type 2 with probability ( − p ) . The numberof type 1 and type 2 jobs in the batching station is denoted by y and y , respectively. The batching station groups k i jobs of type i intoa batch with rate M i ( k i ) ⌊ y i / k i ⌋ whenever y i ≥ k i , i ∈ { , } , andforwards batches to the service station. Further, the service stationhas m parallel servers which give preemptive priority to the type 1jobs; the alternative case of non-preemptive priority is discussed inSect. A.2 of the Appendix.Let us denote the total number of type 1 batches by z . Due topreemptive priority, the actual number of type 1 batches in serviceis v = min ( m , z ) . The rest of the servers may be occupied bybatches of type 2. The state of the system can be uniquely describedby the quadruple ( x , y , y , z k ) , where ( x , y , y , z k ) belongs tothe state space S = (cid:8) ( x , x , x , x ) : ∈ Z + : x + x + x + x ≤ n , k | x (cid:9) . Note that the number of type 2 jobs in the system which are alreadybatched is z k = ( n − x − y − y − z k ) , out of which v k are at the server and the rest are queued forservice; here, v = min ( max ( , m − z ) , z ) . (13)Clearly, the system evolves as a continuous-time Markov chainwith the jump rates s λxp −−−→ s − e + e , x > λx ( − p ) −−−−−−−→ s − e + e , x > M ( k )⌊ y / k ⌋ −−−−−−−−−−−−→ s − k e + k e , y ≥ k v µ ( k ) −−−−−−−→ s + k e − k e , z ≥ v µ ( k ) −−−−−−−→ s + k e , v ≥ , (14)where s = ( x , y , y , z k ) and e j is the unit vector of appropriatesize whose j -th component is unity. The jump rates to all the otherstate are zero.The chain is irreducible whereas the finiteness of the state spaceguarantees positive recurrence. Thus, we can derive the rate matrix Q using (14) and derive the steady state distribution πππ by solving πππ · Q = onference version, 2020, Virtual Sounak Kar, Robin Rehrmann, Arpan Mukhopadhyay, Bastian Alt, Florin Ciucu, Heinz Koeppl, Carsten Binnig, and Amr Rizk While we could jointly optimize for k and k , database batchingargues for using a uniform batch size across all job types (see,e.g., [15, 28, 29]); in particular, standard multi-query optimizationmethods in databases batch requests through fixed compiling of theexecution of multiple queries into one SQL string which rendersequal batch sizes regardless of type. Denoting k : = k = k , thesteady state throughput is Θ p ( k ) = (cid:213) s ∈S πππ ( s ) k ( µ ( k ) v + µ ( k ) v ) , (15)where s = ( x , y , y , z k ) and v is derived in (13). The optimalbatch size is k ∗ = arg max k ∈K Θ p ( k ) . (16)Here, K = { , , , . . . , K } and K is the maximum possible batchsize for the considered system. We now discuss the preemptive priority case in the context ofthe mean-field formulation from Sect. 4. The system can now beuniquely described by the number of active clients and the numberof type 1 jobs in the system. This is due to the fact that there canbe at most ( k − ) jobs of type 1 that have not formed a batch;the number of un-batched jobs is mod ( x , k ) , where x is numberof type 1 jobs in the system. This phenomenon also applies to thetype 2 jobs and lets us derive the number of type 2 jobs which are notyet batched. Assuming work conservingness of the server and thepreemptive priority of type 1 over type 2, we can derive the numberof batches in service for each type. Note that we use the notation µ and µ instead of µ ( k ) and µ ( k ) when the dependence is clear.Let X ( n ) ( t ) and X ( n ) ( t ) denote the numbers of active clientsand the total number of type 1 jobs in the system at time t ≥ t ≥

0, the number of type 2 jobs in thesystem is n − X ( n ) ( t ) − X ( n ) ( t ) , the number of type 1 batches beingserved is min (cid:16) m , (cid:106) X ( n ) ( t )/ k (cid:107)(cid:17) , and the number of type 2 batchesbeing served ismin (cid:16) m − min (cid:16) m , (cid:106) X ( n ) ( t )/ k (cid:107)(cid:17) , (cid:106) ( n − X ( n ) ( t ) − X ( n ) ( t ))/ k (cid:107)(cid:17) , which simplifies tomin (cid:16) max (cid:16) , m − (cid:106) X ( n ) ( t )/ k (cid:107)(cid:17) , (cid:106) ( n − X ( n ) ( t ) − X ( n ) ( t ))/ k (cid:107)(cid:17) . Clearly, ( X ( n ) ( t ) , X ( n ) ( t ) , t ≥ ) is Markov process on state space S = (cid:8) ( x , x ) ∈ Z + : x + x ≤ n (cid:9) with the following rates: q (( x , x ) → ( x − , x + )) = λpx q (( x , x ) → ( x − , x )) = λ ( − p ) x q (( x , x ) → ( x + k , x − k )) = µ min (cid:18) m , (cid:22) x k (cid:23)(cid:19) q (( x , x ) → ( x + k , x − k )) = µ min (cid:18) m , (cid:22) x k (cid:23)(cid:19) q (( x , x ) → ( x + k , x )) = µ min (cid:18) max (cid:18) , m − (cid:22) x k (cid:23)(cid:19) , (cid:22) n − x − x k (cid:23)(cid:19) As in the previous section, we consider the scaled process w ( n ) ( t ) = ( w ( n ) ( t ) , w ( n ) ( t )) with w ( n ) i ( t ) = X ( n ) i ( t )/ n , i = (i) If w ( n ) ( ) → w ∈ [ , ] as n → ∞ inprobability, then we have sup ≤ t ≤ T ∥ w ( n ) ( t ) − w ( t )∥ → in probability as n → ∞ , where ( w ( t ) = ( w ( t ) , w ( t )) , t ≥ ) is the unique solution of the following system of ODEs: (cid:219) w ( t ) = f ( w ( t )) , (cid:219) w ( t ) = f ( w ( t )) , w ( ) = w , (17) with f = ( f , f ) : [ , ] → R defined as f ( w ) = − λw + k µ min (cid:18) α , w k (cid:19) + k µ min (cid:18) max (cid:18) , α − w k (cid:19) , − w − w k (cid:19) (18) f ( w ) = λpw − k µ min (cid:18) α , w k (cid:19) (19) (ii) For any w ∈ [ , ] , we have w ( t ) → w ∗ as t → ∞ , where w ∗ = ( w ∗ , w ∗ ) is the unique solution of f ( w ∗ ) = and is givenby w ∗ = min (cid:18) µ µ µ λ ( − p ) + µ λp + µ µ , k k µ µ αk µ λ ( − p ) + k µ λp (cid:19) (20) w ∗ = λpw ∗ µ (21) (iii) The sequence of stationary measures π ( n ) w of the process ( w ( n ) ( t ) , t ≥ ) converges weakly to δ w ∗ as n → ∞ . Proof. Part (i) can be shown using arguments similar to theproof of Part (i) of Theorem 4.1. To show part (ii), we first note that w ∗ is the unique solution of f ( w ∗ ) =

0. We now show that w ∗ isglobally attractive.We first define a linear transform ( w , w ) → ( z , z ) defined as z = w + w and z = w . Under this transformation the system isdescribed as follows: n the Throughput Optimization in Large-Scale Batch-Processing Systems Conference version, 2020, Virtual dz dt =  − λ ( − p )( z − z ) , if z ≥ k α − λ ( − p )( z − z ) + µ ( − z ) , if − z k + z k < α − λ ( − p )( z − z ) − k k µ z + k µ α , if − z k + z k ≥ α (22) dz dt = (cid:40) λp ( z − z ) − k µ α , if z ≥ k αλp ( z − z ) − µ z , otherwise . (23)Furthermore, the stationary point is mapped to ( z ∗ , z ∗ ) , where z ∗ = min ( z ∗ , z ∗ ) and z ∗ = ηz ∗ with z ∗ = ( µ + λp ) µ µ λ ( − p ) + µ λp + µ µ , z ∗ = k k ( µ + λp ) µ αk µ λ ( − p ) + k µ λp , η = λpµ + λp .Clearly, the system is a piece-wise linear system. We considerthe stability of each region individually: Case 1 : k α ≥ k α ≥ ≥ z ≥ z ≥ dz dt = − λ ( − p )( z − z ) + µ ( − z ) dz dt = λp ( z − z ) − µ z (24)The above system can be represented as a linear dynamicalsystem (cid:219) z = Az + b , where the eigenvalues θ of A ∈ R × satisfy θ + ( λ + µ + µ ) θ + c = , for some constant c . Clearly, the real parts of the eigenvalues arestrictly negative. Hence, the system is globally attractive to theunique stationary point ( z ∗ , ηz ∗ ) . Case 2 : k α ≥ k α < L : z = k k ( k α − + z ) , having respective linear equations. Let us also consider the line L : z = ηz . Note that (cid:219) z ( t ) ≥ z lies below L . Thus, if the system startsbelow L , it stays there and vice-versa and the fixed point(s) of thesystem, if exist(s), lie(s) on L . Let z denote z -coordinate of theintersection point of these lines, i.e., z = k ( µ + λp )( − k α ) k µ + k λp − k λp . Let us assume z ∗ ≤ z ∗ . Calculations show that this implies z ≤ z ∗ ≤ z ∗ . If the systemstarts from a point below L , the evolution of the system is givenby (24) and a similar argument as Case 1 shows that the systemconverges to the fixed point ( z ∗ , ηz ∗ ) . In case the initial point liesabove L , the evolution in the starting phase is given by dz dt = − λ ( − p )( z − z ) + k µ ( α − z k ) dz dt = λp ( z − z ) − µ z (25) Calculations show that the real parts of the corresponding eigen-values are negative and the fixed point for this system is givenby ( z ∗ , ηz ∗ ) . Thus the system crosses L where the evolution isgoverned by (24). From the perspective of convergence, this isequivalent to having the initial point below L in which is case theconvergence to ( z ∗ , ηz ∗ ) is already established. Thus, the systemalways converges to ( z ∗ , ηz ∗ ) when z ∗ ≤ z ∗ .For the case z ∗ > z ∗ , we notice that z ≥ z ∗ ≥ z ∗ and a similar argument showsconvergence of the system to ( z ∗ , ηz ∗ ) . Case 3 : k α < k α < ( z , z ) suchthat 1 ≥ z ≥ z ≥ k α . We will show that the system eventuallyreaches a state where z ≤ k α . In this case, until we have z ≤ k α ,the evolution is given by dz dt = − λ ( − p )( z − z ) dz dt = λp ( z − z ) − k µ α (26)The above is clearly a unstable system with z and z decreasingindefinitely for ever. Therefore, there exists t ≥ z ( t ) ≤ k α .Now without loss of generality we start our system with z ≤ kα .Thus, without loss of generality, we take an initial point satisfying z ≤ k α . Let us assume η ≤ k α , Like

Case 2 , we observe that either z ≤ z ∗ ≤ z ∗ or z ≥ z ∗ ≥ z ∗ and the proof follows the same line of argument as Case 2 .Now let’s consider the scenario when η > k α . We show that z ∗ ≥ z ∗ which holds if and only if µ µ k k α ≤ k µ λ ( − p )( − k α ) + k µ λp ( − k α ) . Since, η > k α , it suffices to show µ µ k η ≤ k µ λ ( − p )( − k α ) + k µ λp ( − η ) , which is equivalent to k ( µ + λp )( − p )( − k α ) ≥

0. Similar to

Case 2 , we see that if the initial point lies above L , the evolution isgiven by (25) and the system converges to ( z ∗ , ηz ∗ ) . When startedbelow L , the evolution is governed by (24) initially and the systemmoves towards ( z ∗ , ηz ∗ ) . This eventually changes the evolutiondynamics to (25) and the system converges to ( z ∗ , ηz ∗ ) in eithercase. Case 4 : k α < k α ≥ Case 2 , we take an initial point with z ≤ k α .Let us first assume η ≤ k α . Similar to the argument of

Case 3 when η > k α and using the factthat k α ≥

1, we observe this implies z ∗ ≤ z ∗ . The convergencefrom an initial point below or above L follows in a similar fashion. onference version, 2020, Virtual Sounak Kar, Robin Rehrmann, Arpan Mukhopadhyay, Bastian Alt, Florin Ciucu, Heinz Koeppl, Carsten Binnig, and Amr Rizk The remaining scenario is η > k α . Similar to

Case 2 , we observe that either z ≤ z ∗ ≤ z ∗ or z ≥ z ∗ ≥ z ∗ and convergence can be shown to ( z ∗ , ηz ∗ ) or ( z ∗ , ηz ∗ ) , respectively, using the same line of argument presentedthere. Thus global attraction is established under all scenarios. Notethat we have mentioned the actual limit as ( z ∗ , ηz ∗ ) or ( z ∗ , ηz ∗ ) ,as applicable.Finally Part (iii) of the theorem follows by the same line argu-ments as in the proof of Part (iii) of Theorem 4.1. A more generalresult under the assumption of equal batch sizes is given in Appen-dix A.3. □ From the above theorem it follows that the asymptotic through-put is a linear combination of w ∗ and w ∗ . Given the forms of thespeedup functions µ ( k ) and µ ( k ) , we can optimize the asymptoticthroughput jointly over k and k . The time taken to find the as-ymptotic optimal batch sizes is clearly independent of the systemsize n and these asymptotic solutions serve as accurate estimatesfor batch sizes for finite systems, as we will show in Sec. 6. In this section, we evaluate the performance of our model forthroughput optimization using both simulations and an applicationto a research prototype of a large commercial database system. Wefirst show accuracy of our model for simulation results and sub-sequently describe the details of experimental evaluations whichincludes the system layout, the experiment description, data collec-tion and the model performance.

We first numerically compare our exact and asymptotic results tocorresponding simulation results. For all comparisons, the exact model obtains the throughput by solving for the steady state distri-bution numerically whereas for simulations , we plot the observedthroughput when the system is simulated using (1). The unit of timefor simulations is seconds and a linear form of speedup is assumed.Further, the following system parameters are used for the singlejob type case (The parameters correspond to the range of valuesobserved in the prototype system described in the next section): • job generation rate λ = · , • batch service time 1 / µ ( k ) = . · − + . · − k , • batching time 1 / M ( k ) = . · − + · − k .For two job types, type 1 job has higher priority and is generatedwith 20% probability. We modify the service rates as below andkeep other parameters unchanged. • type 1 service time 1 / µ ( k ) = /( · µ ( k ))• type 2 service time 1 / µ ( k ) = . · − + . · − k In Figures 2-4, we compare the steady state throughput for thenon-asymptotic/exact model, the mean-field model and simulations,for both the one-job type and the two-job type cases with preemp-tive priority. In all figures we vary the number of servers m andobtain the corresponding steady state throughput as a function ofthe number of clients n or of the batch size k . In Fig. 2 we show the optimal steady state throughput as a func-tion of the number of clients n , for fixed values of the number ofservers m . The non-asymptotic/exact model and more interestinglythe mean-field model accurately capture the optimal steady statethroughput obtained from simulations. The optimal throughputis concave in the number of clients n , as it is given by the mean-field analysis as mk ∗ µ ( k ∗ ) in the limit with k ∗ from (12). Similarobservations hold in Fig. 3 depicting the optimal total steady statethroughput for the two job-type case with preemptive priority.The next set of results in Figs. 4-5 concern with the steady statethroughput as a function of the batch size k . In Fig. 4 we show howthe exact/non-asymptotic model and the mean-field model accu-rately capture the simulated steady state throughput and providethe optimal batch sizes k . Figure 5 shows the total throughput forthe case of two job-types with preemptive priority.Next, we consider the trade-off between the speedup and theidling of servers. Fig. 6a shows the extent to which the effect ofidling is compensated by the batching speedup for a set-up with n =

300 clients and different number of servers. For this same set-up, we also look at the convergence rate of the system to the steadystate in Fig. 6b, but only for the optimal batch size given by theexact analysis. We assume the system starts at the state where alljobs are at the producer/clients station and numerically computethe marginal distribution at regular time intervals. To visualize thedistance of the marginal with the steady state distribution of thesystem, we use the total variation distance as defined in [22].To conclude this subsection, we note that the mean-field resultsaccurately capture both the optimal steady state throughput andthe corresponding optimal batch size k ∗ of the system. In this section, we discuss the performance of our system for experi-mental evaluations. We start with the description of the system anddata collection before comparing our results to actual observation.

Here we provide an overview of our systemand the Telecom Application Transaction Processing (TATP) bench-mark [30] that is used to retrieve the data for our model. We runour experiments in a research prototype based on a commercial in-memory database. The database receives a client-request as an SQLstring and compiles it to optimized execution plans or extracts suchplan from a plan cache, if the string was already compiled for a pre-vious request. Each plan consists of several data-operators, e.g., foraccessing tables by index or scanning, or aggregating results, as wellas operators for sending the results back to the requesting client.Fig. 7 shows that incoming requests are not executed instantlybut rather wait in a queue, until the number of waiting requestsreaches a certain threshold (i.e., the batch size). Once this eventoccurs, the number of requests to grab from the waiting queueis determined, we extract that amount of requests, preferring thewrite jobs and create one SQL string from the requests. The servicethread then compiles and executes the merged SQL string, whichproduces a shared result. Finally, the service thread splits the sharedresult to return to each client its individual result.Service threads execute three tasks on a merged batch taken fromthe waiting queue: (1) compilation, (2) execution, and (3) splittingthe results. For merging, we need to execute some string operations n the Throughput Optimization in Large-Scale Batch-Processing Systems Conference version, 2020, Virtual

200 400 600 800 1000

Number of clients n O p t i m a l s t ead y s t a t e t h r oughpu t mean fieldexactsimulation (a) m = (servers)

200 400 600 800 1000

Number of clients n O p t i m a l s t ead y s t a t e t h r oughpu t mean fieldexactsimulation (b) m =

200 400 600 800 1000

Number of clients n O p t i m a l s t ead y s t a t e t h r oughpu t mean fieldexactsimulation (c) m = Figure 2: The optimal steady state throughput as a function of the number of clients/jobs n , for the single job type case;results from the mean-field model, the non-asymptotic/exact formulation, and simulations, for several values of the numberof servers m and a linear service speedup. For a fixed m , the optimal throughput is known from the mean-field analysis to be mk ∗ µ ( k ∗ ) , where k ∗ is given in (12) .

50 100 150 200 250 300

Number of clients n O p t i m a l s t ead y s t a t e t h r oughpu t mean fieldexactsimulation (a) m =

50 100 150 200 250 300

Number of clients n O p t i m a l s t ead y s t a t e t h r oughpu t mean fieldexactsimulation (b) m =

50 100 150 200 250 300

Number of clients n O p t i m a l s t ead y s t a t e t h r oughpu t mean fieldexactsimulation (c) m = Figure 3: The optimal total steady state throughput for the two-job types case, preemptive priority, and linear speedup.

100 200 300 400 500

Batch size S t ead y s t a t e t h r oughpu t mean fieldexactsimulation (a) m =

100 200 300 400 500

Batch size S t ead y s t a t e t h r oughpu t mean fieldexactsimulation (b) m =

100 200 300 400 500

Batch size S t ead y s t a t e t h r oughpu t mean fieldexactsimulation (c) m = Figure 4: Steady state throughput of the system for one job-type for several values of the number of servers m ; each set of linescorresponds to a value of n ∈ [ , , , , ] in an increasing order (from left to right). Observe that the exact and sim-ulated throughput decreases sharply at points where the number of maximum possible active server drops by one, becomingmore apparent for larger batch sizes due to higher relative change. Both the exact and the mean-field model accurately mimicthe steady state throughput and accurately capture the optimal batch sizes k ∗ (from the peak point). to create the merged SQL string. The processing time of this stepdepends on the number of requests extracted from the waiting queue. In comparison, step (1) first looks up the cache, whetherthat SQL string was already compiled and only if this is not the onference version, 2020, Virtual Sounak Kar, Robin Rehrmann, Arpan Mukhopadhyay, Bastian Alt, Florin Ciucu, Heinz Koeppl, Carsten Binnig, and Amr Rizk

20 40 60 80 100

Batch size S t ead y s t a t e t h r oughpu t mean fieldexactsimulation (a) m =

20 40 60 80 100

Batch size S t ead y s t a t e t h r oughpu t mean fieldexactsimulation (b) m =

20 40 60 80 100

Batch size S t ead y s t a t e t h r oughpu t mean fieldexactsimulation (c) m = Figure 5: Steady state throughput with two job-types and preemptive priority for several values of m ∈ [ , , ] ; each set oflines corresponds to a value of n ∈ [ , , ] in an increasing order (from left to right).

20 40 60 80 100

Batch size A v g no o f a c t i v e s e r v e r s (57)(38)(30)(24) (a) Average number of activeservers η in the steady state with clients attached. -3 -2.5 -2 -1.5 log (time in sec.) T o t a l v a r i a t i on w i t h (b) Total variation distance withsteady state distribution π π π overtime. Figure 6: Steady state and transient characteristics from ex-act analysis for the system with n = clients with one jobtype. Fig. 6(a) shows the average number of active servers η in the steady state. The annotated optimal batch sizes showthe point until which the speedup compensates for dimin-ishing server utilization. Fig. 6(b) shows the total variationdistance with the steady state distribution π π π for the respec-tive optimal batch sizes over time, i.e., how the marginal dis-tribution of the system states gets reasonably close to thesteady state distribution π π π within ms. case, it compiles the string itself. This is a crucial step, becausecompiling a string into an executable plan is a time consumingtask. The execution of a batch in step (2) heavily depends on thetable format (row-store or column-store [32]) and whether an indexexists on the filtered column or the column needs to be scanned.And finally, in the last step (3), the service thread scans the sharedresult for each client that belongs to the batch and sends back thematching rows. For our experiments, we fo-cus on two transactions of the TATP benchmark [30], a well knownOnline Transactional Processing (OLTP) benchmark for databases.The two transactions used are GET_SUBSCRIBER_DATA, consist-ing of one read operation and the DELETE_CALL_FORWARDING,consisting of one read and one write, namely a delete operation.

Service ThreadService Thread

Client

ClientClient Service ThreadStn. 1Merge Stn. 2

Compile

Stn. 3

Execute

Stn. 2SplitWait Queue

Figure 7: Query Batching in the Database System. Requestsof the same SQL string are merged and executed as a batch.

Each operation is expressed as an SQL string, which is sent to thedatabase and processed on the server side, as described earlier. Eachof the reading and writing operations access only one row of ex-actly one table to read or delete from and are usually processedin less than 1 ms. We adjust the DELETE_CALL_FORWARDINGtransaction in such a way that it submits a single read operation in80% of all cases and a delete operation in the remaining 20%.We run our experiments on a base table size of 10 rows witha varying number of clients. The database and the clients run ondifferent sockets of the same server with SUSE Linux EnterpriseServer 12 SP1 (kernel: 4.1.36-44-default), having 512 GB of mainmemory, four sockets with 10 cores each and no hyperthreading.The server runs on Intel(R) Xeon(R) CPU E7-4870, with a speed of2 . In the following, we employstandard optimal experiment design techniques to characterize theservice distributions for all batch sizes, while letting the batch-processing system run only for some selected batch sizes. To thisend, we estimate the batching speedup and characterize the corre-sponding service distributions. For the sake of brevity, we describethe estimation process for only one job type; the two job-type caseproceeds similarly. n the Throughput Optimization in Large-Scale Batch-Processing Systems Conference version, 2020, Virtual

Number of servers m O p t i m a l ba t c h s i z e systemmean fieldexact (a) n = Number of servers m O p t i m a l ba t c h s i z e systemmean fieldexact (b) n = Figure 8: Experimental evaluation: Comparison of the ob-served optimal batch sizes k ∗ and the model estimates withincreasing number of servers. The system receives only onejob type, i.e., read jobs, and the comparison is done for a vary-ing number of clients. As expected the optimal batch sizedecreases with increasing number of servers due to serveridling. First, we express the batching speedup through the function д : N (cid:55)→ R + where д ( k ) = / µ ( k ) . To avoid triviality, we assumesub-additivity, i.e., д ( k + k ) ≤ д ( k ) + д ( k ) . In the experimentalevaluation, we consider the best fit of the empirical data to haveone of the following speedup forms: • д ( k ) = ak + b with a < • д ( k ) = γk α with α < • д ( k ) = c log k + d with c < w and feature vectors ϕϕϕ ( k ) . Assuming a Gaussian distribu-tion on the error of the responses of this model, i.e., the mean servicetimes, the standard linear model can be used and hence the ordinaryleast square (OLS) regression estimate of the regression weightscan be found. For the experiment design on the batch-processingsystem, i.e., deciding on the set A containing which batch sizes torun for the subsequent fitting, we employ a D-optimal design [27] tominimize the log determinant of the covariance matrix of the OLSestimator. The size of the subset A is usually set in accordance withtime and cost considerations. We solve this integer optimizationproblem numerically after relaxation using the CVX package [18].Finally, we denote the set of sample service times correspondingto the batch size k ∈ A as S k , and the respective mean service timesas E [ Y ( k )] , and find the speedup function д minimizing the corre-sponding OLS estimation error, i.e., д = д m where m = arg min i e i and e i = (cid:205) k ∈ A (cid:0) д i ( k | ˆ θ i ) − E [ Y ( k )] (cid:1) . Here, we express the param-eter space corresponding to the parameter vector θ i of the speedupfunction д i as Θ i , and adopt an OLS approach to estimate θ i throughˆ θ i = arg min θ i ∈ Θ i (cid:205) k ∈ A (cid:0) д i ( k ) − E [ Y ( k )] (cid:1) .

50 100 150

Batch size S t ead y s t a t e t h r oughpu t non-preemptivepreemptive (a) m =

50 100 150

Batch size S t ead y s t a t e t h r oughpu t non-preemptivepreemptive (b) m = Figure 9: Equivalence of preemptive and non-preemptivepriority in terms of the steady state throughput for a sim-ulated system with two job-types; each set of lines corre-sponds to a value of n ∈ [ , , ] in an increasing order(from left to right). For the experimental evaluation we set a mea-surement budget for the fitting and parameter estimation, i.e., weestimate the service times and the speedup based on measurementruns for only ∼

5% of all possible batch sizes. Using the optimal ex-perimental design approach from the previous section we calculatethe set of batch sizes to be measured A for n ∈ { , } clients.For each n we estimate the mean batching and service times foreach batch size k ∈ A from independent runs. The mean servicetimes for batch sizes k ∈ A are then used to estimate the speedup.Equipped with the estimated service and batching rates, we pop-ulate the intensity matrix Q using (1) and subsequently solve forthe steady state distribution. We further calculate the steady statethroughput using (2) and obtain the corresponding optimal batchsize. We repeat the same process for a varying number of servers m and for a varying number of clients n up to 300. Note that thedatabase prototype at hand has at most m =

10 available servers.In addition, we run an exhaustive experiment for all possiblebatch sizes to find the empirical optimum for the set-up with avarying number of servers and clients for the sake of completeness.Fig. 8 shows a comparison of the modelled and observed optimalbatch sizes k ∗ for an increasing number of servers and differentnumber of clients n . We observe that our models are accurate. Boththe non-asymptotic/exact model as well as the mean-field modelcapture the decline in the optimal batch size with an increasingnumber of servers m .We also conduct experiments where the submitted jobs can beof two types: read or write . A new request can be a read query withprobability 0 . write query with probability 0 .

2. Further, the write jobs have priority over the read jobs. The prototype systemprovides a non-preemptive priority to the write jobs; however, thedifference in the system throughput diminishes in the stationaryasymptotic regime, as illustrated through simulations in Fig. 9. InFig. 10 we compare the modelled and the actual optimal batch sizesin the system for the two job case for a varying number of clientsand observe a reasonably close match. The contributed mean-fieldmodel is seen to capture the system behavior very well. onference version, 2020, Virtual Sounak Kar, Robin Rehrmann, Arpan Mukhopadhyay, Bastian Alt, Florin Ciucu, Heinz Koeppl, Carsten Binnig, and Amr Rizk

50 60 70 80 90 100

Number of clients n O p t i m a l ba t c h s i z e systemmean fieldexact Figure 10: Optimal batch sizes for read and write job typeswith m = servers and a varying number of clients n . Thesystem implements non-preemptive priority of write jobsover read jobs. For mean field analysis, optima are approxi-mated by the preemptive model in Sect. 5.2 whereas the ex-act model follows the workflow in Sect. A.2. As expected,modelled and observed optima rise in close proximity. In this work, we optimize the throughput of closed data-processingsystems that process incoming jobs in batches. Through modellingthe system as a closed queueing network, where batches observe asub-additive speedup in execution, we obtain the optimal through-put as a function of the batch size for n clients and m servers. Theconsidered system resembles standard database systems whereclients wait for the result of an input query to generate the nextone. We contribute a mean-field model that captures the systemthroughput in the asymptotic regime and show that the analyticalresults accurately provide the optimal throughput, as well as, thecorresponding optimal batch size in simulation, as well as, for aprototype of a large commercial system. REFERENCES [1] Norman TJ Bailey. 1954. On queueing processes with bulk service.

J. R. Stat. Soc.B. Met. (1954), 80–87.[2] Forest Baskett, K Mani Chandy, Richard R Muntz, Fernando G Palacios, et al. 1975.Open, closed, and mixed networks of queues with different classes of customers. j. ACM

22, 2 (1975), 248–260.[3] Menachem Berg, Frank van der Duyn Schouten, and Jorg Jansen. 1998. Optimalbatch provisioning to customers subject to a delay-limit.

Manag. Sci.

44, 5 (1998),684–697.[4] Gunter Bolch, Stefan Greiner, Hermann de Meer, and Kishor Shridharbhai Trivedi.2005.

Queueing Networks and Markov Chains . Wiley-Interscience, New York, NY,USA.[5] L. Bortolussi and N. Gast. 2016.

Mean-Field Limits Beyond Ordinary DifferentialEquations . Springer International Publishing, Cham, 61–82. https://doi.org/10.1007/978-3-319-34096-8_3[6] Amarjit Budhiraja, Paul Dupuis, Markus Fischer, and Kavita Ramanan. 2015.Local stability of Kolmogorov forward equations for finite state nonlinear Markovprocesses.

Electron. J. Probab.

20 (2015), 30 pp. https://doi.org/10.1214/EJP.v20-4004[7] K Mani Chandy, John H Howard Jr, and Donald F Towsley. 1977. Product formand local balance in queueing networks.

Journal of the ACM (JACM)

24, 2 (1977),250–263.[8] K Mani Chandy and Alain J Martin. 1983. A characterization of product-formqueuing networks.

J. ACM

30, 2 (1983), 286–299. [9] M.L. Chaudhry and J.G.C. Templeton. 1983.

A first course in bulk queues . Wiley.[10] Edward Cree. 2018. Linux Kernel path: “Handle-multiple-received-packets-at-each-stage”. Retrieved May 25, 2020 from https://github.com/torvalds/linux/commit/2d1b138505dc29bbd7ac5f82f5a10635ff48bddb. Accessed: 2018-10-27.[11] Rajat K Deb. 1978. Optimal dispatching of a finite capacity shuttle.

Manag. Science

24, 13 (1978), 1362–1372.[12] Rajat K Deb and Richard F Serfozo. 1973. Optimal control of batch service queues.

Advances in Applied Probability

5, 2 (1973), 340–361.[13] David J DeWitt, Randy H Katz, Frank Olken, Leonard D Shapiro, Michael RStonebraker, and David A. Wood. 1984. Implementation Techniques for MainMemory Database Systems. In

Proc. ACM SIGMOD Int. Conf. Manag. Dat. (Boston,Massachusetts). ACM, New York, NY, USA, 1–8. https://doi.org/10.1145/602259.602261[14] Nicolas Gast and Benny Van Houdt. 2015. Transient and Steady-State Regimeof a Family of List-Based Cache Replacement Algorithms. In

Proceedings of the2015 ACM SIGMETRICS International Conference on Measurement and Modeling ofComputer Systems . Association for Computing Machinery, New York, NY, USA,123–136.[15] Georgios Giannikis, Gustavo Alonso, and Donald Kossmann. 2012. SharedDB:Killing One Thousand Queries with One Stone.

PVLDB

5, 6 (2012), 526–537.[16] Amihai Glazer and Refael Hassin. 1987. Equilibrium arrivals in queues with bulkservice at scheduled times.

Transp. Sci.

21, 4 (1987), 273–278.[17] William J Gordon and Gordon F Newell. 1967. Closed queuing systems withexponential servers.

Operations research

15, 2 (1967), 254–265.[18] Michael Grant, Stephen Boyd, and Yinyu Ye. 2008. CVX: Matlab software fordisciplined convex programming.[19] William Henderson, CEM Pearce, Peter G. Taylor, and Nico M van Dijk. 1990.Closed queueing networks with batch services.

Queueing systems

6, 1 (1990),59–70.[20] William Henderson and Peter G. Taylor. 1990. Product form in networks ofqueues with batch arrivals and batch services.

Queueing Syst.

6, 1 (1990), 71–87.[21] T. G. Kurtz. 1970. Solutions of Ordinary Differential Equations as Limits ofPure Jump Markov Processes.

Journal of Applied Probability

Markov chains andmixing times . American Mathematical Society.[23] Darko Makreshanski, Jana Giceva, Claude Barthels, and Gustavo Alonso. 2017.BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads forInteractive Applications. In

Proc. ACM Int. Conf. Manag. Dat. (Chicago, Illinois,USA) (SIGMOD ’17) . ACM, New York, NY, USA, 37–50. https://doi.org/10.1145/3035918.3035959[24] M. Mitzenmacher. 1996.

The power of two choices in randomized load balancing .Ph.D. Dissertation. University of California at Berkeley.[25] A. Mukhopadhyay, A. Karthik, and R. R. Mazumdar. 2016. Randomized Assign-ment of Jobs to Servers in Heterogeneous Clusters of Shared Servers for LowDelay.

Stochastic Systems

6, 1 (2016), 90–131.[26] A. Mukhopadhyay and R. R. Mazumdar. 2016. Analysis of Randomized Join-the-Shortest-Queue (JSQ) Schemes in Large Heterogeneous Processor-SharingSystems.

IEEE Transactions on Control of Network Systems

3, 2 (June 2016), 116–126.https://doi.org/10.1109/TCNS.2015.2428331[27] Friedrich Pukelsheim. 1993.

Optimal design of experiments . Vol. 50. siam.[28] Robin Rehrmann, Carsten Binnig, Alexander Böhm, Kihong Kim, WolfgangLehner, and Amr Rizk. 2018. OLTPshare: The Case for Sharing in OLTP Workloads.

Proc. VLDB Endow.

11, 12 (Aug. 2018), 1769–1780.[29] Timos K. Sellis. 1988. Multiple-query Optimization.

ACM Trans. Database Syst.

13, 1 (March 1988), 23–52. https://doi.org/10.1145/42201.42203[30] Markku Manner Simo Neuvonen, Antoni Wolski and Vilho Raatikka. 2009.

Telecommunication Application Transaction Processing (TATP) Benchmark De-scription . Technical Report. IBM Software Group Information Manag. 19 pages.[31] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao,and Daniel J. Abadi. 2012. Calvin: Fast Distributed Transactions for PartitionedDatabase Systems. In

Proceedings of the 2012 ACM SIGMOD International Con-ference on Management of Data (Scottsdale, Arizona, USA) (SIGMOD ’12) . ACM,New York, NY, USA, 1–12. https://doi.org/10.1145/2213836.2213838[32] M. J. Turner, R. Hammond, and P. Cotton. 1979. A DBMS for Large StatisticalDatabases. In

Proc. Int. Conf. Very Large Dat. Bases - Volume 5 (Rio de Janeiro,Brazil) (VLDB ’79) . VLDB Endowment, 319–327. http://dl.acm.org/citation.cfm?id=1286711.1286746[33] Benny Van Houdt. 2019. Global Attraction of ODE-Based Mean Field Models withHyperexponential Job Sizes.

Proc. ACM Meas. Anal. Comput. Syst.

3, 2, ArticleArticle 23 (June 2019), 23 pages. https://doi.org/10.1145/3341617.3326137[34] X. Wen, B. Yang, Y. Chen, L. E. Li, K. Bu, P. Zheng, Y. Yang, and C. Hu. 2016.RuleTris: Minimizing Rule Update Latency for TCAM-Based SDN Switches. In

Proc. IEEE Int. Conf. Dist. Comput. Sys.

SIGMETRICS Perform. Eval.Rev.

43, 1 (June 2015), 321–334. https://doi.org/10.1145/2796314.2745849 n the Throughput Optimization in Large-Scale Batch-Processing Systems Conference version, 2020, Virtual

A APPENDIXA.1 Irreducibility of the Closed QueueingSystem

Proposition 1.

The Markov chain describing the queueing systemin Sect. 3 is irreducible.

Proof. It is sufficient to show that the states ( n , , ) and ( x , y , zk ) communicate. To show that ( x , y , zk ) can be reached from ( n , , ) in finite steps with positive probability, we show that each of theintermediate states in the following can be reached in finite stepswith positive probability: ( n , , ) −→ ( n − k , k , ) −→ ( n − y − zk , y + zk , )−→ ( n − y − zk , y + ( l − ) k , k ) −→ ( n − y − zk , y , zk ) . Starting from ( n , , ) , ( n − k , k , ) is reached in k steps with proba-bility 1. This is due to the fact that there can not be any batchingunless there are at least k jobs at the batching station. Further, ( n − y − zk , y + zk , ) is reached in another y + ( l − ) k steps wherethe r th step has probability p r = P [ X r < Y ] . Here, X r is an expo-nential variable with mean 1 /(( n − k − r + ) λ ) and Y is another inde-pendent exponential variable with mean 1 / M . Each of these stepscorresponds to the outcome that the producer sends a job to thebatcher before it could form a batch. Again, ( n − y − zk , y + ( l − ) k , k ) is reached from ( n − y − zk , y + zk , ) in a single step with probability P [ Y < X )] where X is another independent exponential variablewith mean 1 /(( n − y − zk ) λ ) . That is, a batch is formed by the batcherbefore the dispatcher could send a new job. Finally, ( n − y − zk , y , zk ) is reached in another ( l − ) step where the r th step has probability P [ Y < min ( X , Z r )] , Z r being an exponential variable with mean1 /( min ( n , r ) µ ( k )) . Each step describes the event that the batchingstation merges a batch before either the dispatcher could send a newjob or the sever could finish serving a batch. Similarly, we can showthat starting from ( x , y , zk ) , there exists a way to reach ( n , , ) infinite steps with positive probability, completing the proof. □ Since the Markov chain describing the states of the queueingsystem in Sect. 3 is finite and irreducible, it is positive recurrent aswell. Thus, there exists a unique steady state distribution for thischain that is obtainable by solving the equation πππ · Q =

0. Similarly,we can argue about the existence and uniqueness of the steady statedistribution for the system described in Sect. 5.1.

A.2 System with Two Job Types andNon-preemptive Priority

Unlike the case with preemptive priority in Sect. 5, the case withnon-preemption requires the number of type 1 jobs in service explic-itly. The system can be uniquely described by the tuple ( x , y , y , u k , v k ) where x is the number of active clients, y ι is the number of type ι jobs not yet batched, u is the number of type 1 batches waitingin the queue and v is the number of type 1 batches in service and s = ( x , y , y , u k , v k ) belongs to the state space S = (cid:8) ( x , x , x , x , x ) : ∈ Z + : x . ≤ n , k | x , k | x (cid:9) . Here denotes the column vector of ones whose size is impliedfrom the context. Note that the number of type 2 batches is givenby z = ( n − s . )/ k out of which v = min ( m − v , z ) many are in service. The system evolves as CTMC and the jump rates are: s λxp −−−→ s − e + e , x > λx ( − p ) −−−−−−−→ s − e + e , x > M ( k )⌊ y / k ⌋ −−−−−−−−−−−−→ s − k e + k e , y ≥ k , v = m M ( k )⌊ y / k ⌋ −−−−−−−−−−−−→ s − k e + k e , y ≥ k , v < m M ( k )⌊ y / k ⌋ −−−−−−−−−−−−→ s − k e , y ≥ k v µ ( k ) −−−−−−−→ s + k e − k e , v ≥ , u = v µ ( k ) −−−−−−−→ s + k e − k e , v ≥ , u ≥ v µ ( k ) −−−−−−−→ s + k e , v ≥ , u = v µ ( k ) −−−−−−−→ s + k e − k e + k e , v ≥ , u ≥ , (27)where s = ( x , y , y , z k ) and v = v + v denotes the total numberof busy servers. The jump rates to any other state is zero. Similar toSect. 5.1, we can solve πππ · Q = πππ , derive the optimal throughput and find the optimal batch size k ∗ for maximum throughput. A.3 Extension to Multiple Job Types withPreemptive Priority

Let us recall the framework described in Sect. 5.2 and consider thecase that there are r types of jobs in the system with job of type i having preemptive priority over type i whenever i < i . Wesuppose that each client produces a job of type i with probability p i where (cid:205) p i =

1. Further, we assume batches go through d levelsof service before being unbatched and finally individual responsesare sent back to the clients. The workflow of the system requiresthat after each level of service, batches wait in a common queueif all servers of the next stage are busy. Let k i denote the batchsize for job type i for all stages, m j denote the total number ofservers at stage j and µ ij ( k i ) denote the service rate of type i atlevel j for batch size k i . We will suppress the argument for µ ij when the dependence is clear. If X ( n ) ij ( t ) denotes the number oftype i jobs that are waiting for or are at j -th level of service, wesee that ( X ( n ) ij ( t ) , ≤ i ≤ r , ≤ j ≤ d , t ≥ ) is Markov onstate space S = (cid:110) ( x ij ) ∈ Z rd + : (cid:205) ri = (cid:205) dj = x i , j ≤ n (cid:111) . Note that X j includes unbatched jobs of type j as well whereas X ij , i >

1, onlycomprises of batches. We consider the corresponding scaled process w ( n ) ( t ) = ( w ( n ) ij ( t )) , w ( n ) ij ( t ) = X ( n ) ij ( t )/ n , 1 ≤ i ≤ r , ≤ j ≤ d . We we use level/stage interchangeably onference version, 2020, Virtual Sounak Kar, Robin Rehrmann, Arpan Mukhopadhyay, Bastian Alt, Florin Ciucu, Heinz Koeppl, Carsten Binnig, and Amr Rizk have (cid:219) w = λp (cid:169)(cid:173)(cid:171) − (cid:213) a , b w ab (cid:170)(cid:174)(cid:172) − µ w , (28) (cid:219) w i = λp i (cid:169)(cid:173)(cid:171) − (cid:213) a , b w ab (cid:170)(cid:174)(cid:172) − µ i min (cid:32) w i , max (cid:32) , α − (cid:213) l < i w l k l (cid:33) k i (cid:33) , (cid:219) w j = µ ( j − ) w ( j − ) − µ j w j , (cid:219) w ij = µ i ( j − ) min (cid:32) w i ( j − ) , max (cid:32) , α j − − (cid:213) l < i w l ( j − ) k l (cid:33) k i (cid:33) − µ ij min (cid:32) w ij , max (cid:32) , α j − (cid:213) l < i w l j k l (cid:33) k i (cid:33) , ≤ i ≤ r , ≤ j ≤ d , where m j / n → α j and (cid:205) i , j w ij ≤

1. We use the shorthand notation d w dt = F ( w ) for (28). We notice that F is Lipschitz continuous whichfollows from arguments identical to part (i) of Thm. 4.1. Also, thestationary measure π ( n ) w is tight as it is defined on the compactspace [ , ] rd . We observe that the dynamical system given by (28)is piecewise linear and we investigate global attraction to the fixedpoint when k i = k and there is only one level of service.Theorem A.1. Consider the system in (28) when k i = k , ≤ i ≤ r and d = . The system is globally attractive to w ∗ = (cid:40) A − c a , if ⟨ A − c a , ⟩ < kα B − c b , otherwise , where A =  − µ − λp − λp . . . − λp − µ − µ − λp . . . − λp . . . . . . . . . . . . − µ r − µ r . . . − µ r − λp r  , B =  − µ − λp − λp . . . − λp − µ − µ − λp . . . − λp . . . . . . . . . . . . µ r − λp r µ r − λp r . . . − λp r  , c a =  − λp − λp . . . − λp r  , c b =  − λp − λp . . . − λp r + kα µ r  . Proof. When k i = k , α = α and d =

1, we can suppress theservice stage index j and the ODE’s from (28) reduces to: (cid:219) w = λp (cid:32) − (cid:213) a w a (cid:33) − µ w , (cid:219) w i = λp i (cid:32) − (cid:213) a w a (cid:33) − µ i min (cid:32) w i , max (cid:32) , kα − (cid:213) l < i w l (cid:33)(cid:33) , ≤ i ≤ r . (29)Next we show that B is non-singular and real parts of its eigenvaluesare negative. Same holds for A which can be proved in a similar,although simpler, way. Let Bx = with x (cid:44) . Then  µ x µ x . . . µ r x r  = − (cid:213) l x l  λp λp . . . λp r − µ r  . Since x (cid:44) , we have (cid:205) l x l (cid:44) (cid:213) l x l = − λ (cid:32)(cid:213) l x l (cid:33) (cid:32)(cid:213) l p l µ l (cid:33) + (cid:213) l x l i.e., λ (cid:213) l p l µ l = , which contradicts positivity of λ , µ i ’s and p i ’s.Next we prove that the eigenvalues of B have negative real part.Let θ be an eigenvalue and u be a corresponding eigenvector. For θ (cid:44) − µ j ∀ j , we have u j ( θ + µ j ) = − λp j (cid:213) l u l , ≤ j ≤ r − u r ( θ + µ r ) = (− λp r + µ r ) (cid:213) l u l . Since θ (cid:44) − µ j ∀ j and u (cid:44) , we have (cid:205) l u l (cid:44) (cid:213) l u l = (cid:32) − µ r µ r + θ + (cid:213) l − λp l µ l + θ (cid:33) (cid:32)(cid:213) l u l (cid:33) , i.e., (cid:213) l λp l µ l + θ = − + µ r µ r + θ , i.e., (cid:213) l λp l ( µ l + ℜ( θ ))| µ l + θ | = − + µ r ( µ r + ℜ( θ ))| µ r + θ | . The left and right hand sides have different signs unless ℜ( θ ) < θ = − µ j for some j , we are done anyway. Now we return to (29)and prove global attraction to the unique fixed point under differentscenarios. Case 1 : kα ≥ d w dt = Aw − c a . and the system is globally attractive to the unique fixed point A − c a as A is non-singular and all its eigenvalues have negative real part. Case 2 : kα < ⟨ A − c a , ⟩ < kα iff ⟨ B − c b , ⟩ < kα . For if Ax = c a and By = c b , we have (cid:213) l x l = (cid:205) l λp l µ l + (cid:205) l λp l µ l and (cid:213) l y l = (cid:205) l λp l µ l − kα (cid:205) l λp l µ l . And (cid:205) l λp l µ l + (cid:205) l λp l µ l < kα ⇐⇒ (cid:205) l λp l µ l − kα (cid:205) l λp l µ l < kα . n the Throughput Optimization in Large-Scale Batch-Processing Systems Conference version, 2020, Virtual Next, we observe that the system eventually enters the region (cid:205) i ≤ r − w i < kα . This is because existence of a lowest index i < r − (cid:205) i ≤ i w i ≥ kα implies dw i dt ≥ i ≥ i , since thedomain of interest is (cid:205) i w ij ≤

1. This is an unstable system with w i , i > i increasing forever and thus it eventually enters the region (cid:205) i ≤ r − w i < kα .Let us assume ⟨ A − c a , ⟩ < kα . If we start the system in thesubregion (cid:205) i ≤ r w i < kα , the system evolves in a fashion similarto the case kα ≥ A − c a . When the system isstarted in the subregion (cid:205) i ≤ r w i ≥ kα , the evolution is given by d w dt = Bw − c b . We see that B is non singular and the eigenvalues have negativereal part. Hence the system move toward the point B − c b . However, ⟨ A − c a , ⟩ < kα implies ⟨ B − c b , ⟩ < kα . Hence, the system even-tually enters the subregion (cid:205) i ≤ r w i < kα and converges to A − c a .For the case ⟨ A − c a , ⟩ > kα , the system converges to B − c b andthe proof proceeds similarly.andthe proof proceeds similarly.