Performance Analysis of Modified SRPT in Multiple-Processor Multitask Scheduling
aa r X i v : . [ c s . PF ] J un A Note on Multiple-Processor Multitask Scheduling
Wenxin LiDepartment of ECEThe Ohio State University [email protected]@osu.edu
Ness ShroffDepartment of ECE and CSEThe Ohio State University [email protected]
June 12, 2020
Abstract
In this paper we study the multiple-processor multitask scheduling problem in the determin-istic and stochastic models. We consider and analyze M-SRPT, a simple modification of theshortest remaining processing time algorithm, which always schedules jobs according to SRPTwhenever possible, while processes tasks in an arbitrary order. The modified SRPT algorithmis shown to achieve an competitive ratio of Θ(log α + β ) for minimizing flow time, where α denotes the ratio of maximum job workload and minimum job workload, β represents the ratiobetween maximum non-preemptive task workload and minimum job workload. The algorithm isshown to be optimal (up to a constant factor) when there are constant number of machines. Wefurther consider the problem under poisson arrival and general workload distribution, M-SRPTis proved to be asymptotic optimal when the traffic intensity ρ approaches 1, if the task size isupper bound by the derived upper bound η . With widespread applications in various manufacturing industries, scheduling jobs to minimize thetotal flow time (also known as response time, sojourn time and delay) is one of the most classicand fundamental problem in operation research and has been extensively studied. As an importantmetric measuring the quality of a scheduler, flow time, is formally defined as the difference betweenjob completion time and releasing date, and characterizes the amount of time that the job spendsin the system.Optimizing the objective of flow time has been considered both in offline and online scenarios. Ifpreemption is allowed, shortest remaining processing time (SRPT) discipline is shown to be optimalin single machine environment. Many generalizations of this basic formulation become NP-hard,for example, non-preemptive single machine model and preemptive model with two machines [5].When jobs arrive online, no information about jobs is known to the algorithm in advance, severalalgorithms with logarithmic competitive ratio are proposed in various settings [5, 1]. On the otherhand, while SRPT minimizes the mean response time sample-path wise, it requires the knowledge ofremaining job service time. Gittins proved that the Gittins index policy minimizes the mean delayin an M/G/1 queue, which only requires the access to the information about job size distribution.Though much progresses have been made in single-task job scheduling, there is a lack of theo-retical understanding on multiple-processor multitask scheduling (MPMS). Jobs with multiple tasks1re common and relevant in practice, as jobs and tasks can take many different forms in moderncomputing environment. For example, for the objective of computing matrix vector product, wecan divide matrix elements and vector elements into groups of columns and rows respectively, thenthe tasks correspond to the block-wise multiplication operations. Moreover, tasks can also be map,shuffle and reduce procedures in MapReduce framework. To this end, in this paper, we investi-gate how to minimize the total flow time of multitask jobs in a multiserver system, where a job isconsidered to be completed until all the tasks within the job are finished.With the tremendous increasing in data size and job complexity, we cannot emphasize toomuch the importance of multiple-processor multitask scheduling in modern era. More specifically,distributed computing has indeed become a useful tool to tackle many large-scale computationalchallenges, since parallel algorithms can be advantageous over their sequential counterparts, bydividing computational expensive jobs over machines as multiple tasks, to utilize the combina-tion computational power of processors. For example, there are two basic perspectives to designdistributed scalable machine learning methods [4], data-parallel and model-parallel. In the firstperspective, the data set is partitioned and dispersed into different machines, each machine has alocal copy of the whole model, while model-parallel framework partitions and distributes the modelparameters on different workers and update a subset of parameters on each worker. A naturalquestion arising is, how to design efficient scheduling algorithms to minimize the total amount timethat the multitask jobs spend in the system.On the other hand, the ability to preempt jobs is important for desirable performance in flowtime minimization [3, p.506]. When preemption is not available, the approach of checkpoint basedpreemption is suggested [3, p.506]. Checkpointing is a tolerate failure technique to avoid appli-cations with large processing time being forced to restart from the very beginning. Similarly wecan take extra time to checkpoint jobs and restart again from the last checkpoint, to provide moreflexibility for scheduling jobs. The number of checkpoints can be varied, it is important to under-stand the effects of checkpoints on the system performance. If there are only a few checkpoints,the performance is close to that under non-preemptive disciplines, otherwise we need to pay a largeamount of extra time for saving job states and restarting jobs, when the number of checkpoints islarge. It is natural to ask, how to choose the number of checkpoints to ensure good performance.
Related Work.
For the MapReduce framework, Wang et al. [12] studied the problem of schedul-ing map tasks with data locality, and proposed a map task scheduling algorithm consisting of theJoin the Shortest Queue policy the MaxWeight policy. The algorithm asymptotically minimizesthe number of backlogged tasks (which is directly related to the delay performance based on Lit-tle’s law), when the arrival rate vector approaches the capacity region boundary. Zheng et al. [14]proposed an online scheduler called available shortest remaining processing time (ASRPT), whichis shown to achieve an efficiency ratio no more than two.However, little is known about multitask scheduling. Scully et. al [9] presented the first theo-retical analysis of single-processor multitask scheduling problem, and gave an optimal policy that iseasy to compute for batch arrival, together with the assumption that the processing time of taskssatisfies the aged Pareto distributions. To model the scenario when the scheduler has incompleteinformation about job size, Scully et. al [10] introduced the multistage job model and proposedan optimal scheduling algorithm for multistage job scheduling in M/G/1 queue. In addition, theclosed-form expression of mean response time is given for the optimal scheduler. Sun et al. [11]studied the multitask scheduling problem when all the tasks are of unit size, and proved that2mong causal and non-preemptive policies, fewest unassigned tasks first (FUT) policy, earliest duedate first (EDD) policy, and first come first serve (FCFS) are near delay-optimal in distribution(stochastic ordering) for minimizing the metric of average delay, maximum lateness and maximumdelay respectively.
Contributions.
In this paper we answer the aforementioned questions and our contributions aresummarized as follows.
The analysis in this paper follows from and is closely related tothat in [2, 6] . • We present Algorithm 1 in Section 3, which is a simple modification of SRPT and achieves acompetitive ratio of O (log α + β ), where α is the maximum-to-minimum job workload ratio, β represents the ratio between maximum non-preemptive task workload and minimum jobworkload. In addition, it can be shown that no o (log α + β )-competitive algorithm exists whenthe number of machines is constant. For the class of work-conserving algorithms, O ( α + β − ε )is the best possible competitive ratio. • Under certain probabilistic structure on the problem instances, we further reveal the followingconclusion about the algorithm in Section 4, by utilizing our aforementioned result in theadversarial setting. Assuming that jobs arrive according to a poisson process, we prove thatAlgorithm 1 is optimal when load ρ →
1, as long as the workload of non-preemptive tasks areupper bounded by threshold η specified in equation (10). Deterministic Model.
We are given a set J = { J , J , . . . , J n } of n jobs arriving online overtime, together with a set of N identical machines. Job i consists of n i tasks and its workload p i is equal to the total summation of the processing time of tasks, i.e. , p i = P ℓ ∈ n i p ( ℓ ) i , where p ( ℓ ) i represents the processing time of task ℓ . Tasks can be either preemptive or non-preemptive. A taskis non-preemptive if it is not allowed to interrupt the task once it starts service, i.e. , the task is runto completion. All the information of job i is unknown to the algorithm until its releasing date r i .Under any given scheduling algorithm, the completion time of job j under the algorithm, denotedby C j , is equal to the maximum completion time of individual tasks within the job. Formally, let C ( ℓ ) j be the completion time of task ℓ in job j , then C j = max ℓ ∈ [ n i ] C ( ℓ ) j . The flow time of job j is defined as F j = C j − r j , our objective is to minimize the total flow time P j ∈ [ n ] F j . Note thatdifferent tasks within the same job may or may not be allowed to be processed in parallel, ouranalysis holds for both scenarios.Throughout the paper we use α = max i ∈ [ n ] p i / min i ∈ [ n ] p i to denote the ratio of maximum andminimum job workload. Let η = max { p ( ℓ ) i | task ℓ of job i is non-preemptive } be the maximum pro-cessing time of a non-preemptive task, β = η/ min i ∈ [ n ] p i be the ratio between η and minimum jobworkload. In some sense, parameters β and η represent the degree of non-preemptivity and exhibitsa trade-off between the preemptive and non-preemptive setting, since the problem degenerates tothe preemptive case when η = 1, and the problem approaches the non-preemptive case when η increases to max i ∈ [ n ] p i .The definitions of work-conserving algorithms and competitive are formally given as following,notations of this paper are summarized in Table 1.3 efinition 1 (Work-conserving scheduling algorithm [3]) . A scheduling algorithm π is called work-conserving if it never idles machines when there exists at least one feasible job or task awaiting theexecution in the system. Here a job or task is called feasible, if it satisfies all the given constraintsof the system (e.g, precedence constraint, preemptive and non-preemptive constraint, etc). Definition 2 (Competitive ratio) . The competitive ratio of online algorithm A refers to the worstratio of the cost incurred by A and that of optimal offline algorithm A ∗ over all input instances ω in Ω , i.e. , CR A = max ω ∈ Ω Cost A ( ω )Cost A ∗ ( ω ) . In the multiple-processor multitask scheduling problem, the cost is the total flow time under instance ω = { ( r i , p ( ℓ ) i ) } ℓ ∈ [ n i ] ,i ∈ [ n ] . Stochastic Model.
In the stochastic setting, we assume that jobs arrive into the system accord-ing to a Poisson process with rate λ . Job processing time are i.i.d distributed with probabilitydensity function f ( · ). The analysis relies on the concept of busy period, which is defined as follow-ing. Definition 3 (Busy Period [3]) . Busy period is defined to be the longest time interval in which nomachines are idle.
We use B ( w ) to denote the length of a busy period with started by a workload of w . It can be seenthat the B ( · ) is an additive function [3, p.460], i.e. , for ∀ w , w , B ( w + w ) = B ( w ) + B ( w ) , as a busy period with initial workload of w + w can be regarded as a busy period started byinitial workload w , following a busy period started by initial workload w . Moreover, the lengthof a busy period with initial workload of w and load ρ is shown to be equal to B ( w ) = E [ w ]1 − ρ . (1) N number of machines n number of jobs r i arrival time of job ip i total workload of job iρ ≤ y load composed of jobs with size 0 to y : ρ y = λ · R y tf ( t ) dtα job size ratio: α = max i ∈ [ n ] p i min i ∈ [ n ] p i η maximum processing time of a single task β η min i ∈ [ n ] p i C i completion time of job i Table 1: Notation Table4
Competitive Ratio Analysis
The main idea of Algorithm 1 is similar as SPRT, i.e. , we utilize as many resources as possible onthe job with smallest remaining workload, to reduce the number of alive jobs in a greedy manner,while satisfying all the given constraints.
Algorithm 1:
Modified SRPT (M-SRPT) At each time slot t , maintain the following quantities: • For each job i ∈ [ n ], maintain – W i ( t ) // remaining workload – w i ( t ) // remaining workload of the shortest single task being processed (ifexists) or alive • J ( t ) ← { i ∈ [ n ] | w i ( t ) = 0 } // Jobs with tasks that are finised at time t • d ( t ) ← |J ( t ) | // Number of machines to be reallocated and assign jobs alive to the d ( t ) machines, where jobs with lower remaining workload have a higherpriority. When parallelism is not allowed, at most one machine is allocated to a single job. Our main result is stated in the following theorem.
Theorem 4.
Algorithm 1 achieves a competitive ratio that is no more than CR M − SRPT ≤ p max p min + 2 ηp min + 8 . To show the competitive ratio above, we divide the jobs into different classes and comparethe remaining number of jobs under Algorithm 1 with that under optimal algorithm π ∗ . For anyalgorithm π , at time slot t , we divide the unfinished jobs into Θ(log α ) classes {C k ( π, t ) } k ∈ [log α +1] ,based on their remaining workload. Jobs with remaining workload that is no more than 2 k andlarger than 2 k − are assigned to the k -th class. Formally, C k ( π, t ) = n i ∈ [ n ] (cid:12)(cid:12)(cid:12) W i ( π, t ) ∈ (2 k − , k ] o , where W i ( π, t ) represents the unfinished workload of job i at time t . In the following analysis, weuse C [ k ] ( π, t ) = ∪ ki =1 C i ( π, t ) to denote the collection of jobs in the first k classes, and let W [ k ] π ( t ) = P ki =1 W ( i ) π ( t ) represent the total remaining workload of jobs in the first k classes, where W ( k ) π ( π, t )denotes the amount of remaining workload of jobs in class C k ( π, t ). W ( k ) π ∗ ( t ) and W [ k ] π ∗ ( t ) are definedin a similar way for the optimal scheduling algorithm π ∗ .We first prove the following lemma, which relates the remaining workload in M-SRPT with thatunder optimal algorithm π ∗ . Lemma 5.
For ∀ t ≥ , the unfinished workload under Algorithm 1 can be upper bounded as W [ k ]M − SRPT ( t ) ≤ W [ k ] π ∗ ( t ) + N · (2 k +1 + η + 1) . (2)5 roof: In the proof we always divide jobs into different classes according to the remaining workloadunder M-SRPT, we suppress reference to M-SRPT in the notation of C k . Without loss of generalitywe can assume that W [ k ]M − SRPT ( t ) > W [ k ] π ∗ ( t ), otherwise Lemma 5 already holds. Since the remainingworkload under M-SRPT is strictly larger than that under the optimal algorithm, we claim thatthere must exist time slots in (0 , t ], at which either • Idle machines exist under M-SRPT; • Jobs with remaining workload (under M-SRPT) larger than 2 k are processed.Otherwise, all the machines will be processing jobs belonging to set C [ k ] ( t ) before time t , while nojobs in higher classes, i.e. , ∪ i>k C i ( t ), will be switched into class C [ k ] ( t ). Combining with the fact thatthe initial workload under Algorithm 1 and optimal algorithm are identical, i.e. , W [ k ]M − SRPT (0) = W [ k ] π ∗ (0), we can see that W [ k ]M − SRPT ( t ) should be no more than W [ k ] π ∗ ( t ) and the contradiction appears.Now consider the following two collection of time slots before t : T (1) k = n ¯ t ∈ [0 , t ] (cid:12)(cid:12)(cid:12) At time ¯ t, at least one machine is idle under Algorithm 1 o , T (2) k = n ¯ t ∈ [0 , t ] (cid:12)(cid:12)(cid:12) At time ¯ t, there exists i > k such that at least one machine isprocessing jobs in C i under Algorithm 1 o . Let ¯ t ( i ) k = max { t | t ∈ T ( i ) k } ( i ∈ { , } ) be the last time slot in T ( i ) k , based on which we divide ourproof into the following two cases. Case 1: ¯ t (1) k ≥ ¯ t (2) k . From the definition of ¯ t (1) k , it can be seen that during (¯ t (1) k , t ], no machines areidle or process jobs with remaining workload larger than 2 k under Algorithm 1, while the incrementin remaining workload incurred by newly arriving jobs are identical for Algorithm 1 and π ∗ . Inaddition, it is important to point out that ([ n ] \ C [ k ] (¯ t (1) k )) ∩ C [ k ] (˜ t ) = ∅ for ∀ ˜ t ∈ (¯ t (1) k , t ], i.e. , no jobwill switch from a higher class to C [ k ] during (¯ t (1) k , t ]. Hence W [ k ]M − SRPT ( t ) − W [ k ] π ∗ ( t ) ≤ W [ k ]M − SRPT (¯ t (1) k ) − W [ k ] π ∗ (¯ t (1) k ) . It suffices to prove the workload difference inequality (2) for t = ¯ t (1) k , i.e. , W [ k ]M − SRPT (¯ t (1) k ) ≤ W [ k ] π ∗ (¯ t (1) k ) + N · (2 k +1 + η + 1) . (3)Note that there exists some idle machines at time t = ¯ t (1) k , which implies that under Algorithm 1,the number of jobs alive must be less than N . Hence W [ k ]M − SRPT (¯ t (1) k ) ≤ ( N − · k and (3) holds. Case 2 : ¯ t (1) k < ¯ t (2) k . According to the definition of ¯ t (2) k , there exist jobs with remaining workloadlarger than 2 k being processed at ¯ t (2) k , we use ˆ J (¯ t (2) k ) ⊆ [ n ] \C [ k ] (¯ t (2) k ) to denote the collection ofsuch jobs.When all the tasks are processed preemptively, we can obtain (2) easily, as we are able toconclude that there are at most N − C [ k ] (¯ t (2) k ). This is because that tasks are allowed6o be preempted, while Algorithm 1 selects a job with remaining workload larger than 2 k at thebeginning of time ¯ t (2) k . Consequently W [ k ]M − SRPT (¯ t (2) k ) ≤ n [ k ]M − SRPT (¯ t (2) k ) · k and for ∀ t > ¯ t (2) k , W [ k ]M − SRPT ( t ) − W [ k ] π ∗ ( t ) ≤ W [ k ]M − SRPT (¯ t (2) k ) − W [ k ] π ∗ (¯ t (2) k ) + [ N − n [ k ]M − SRPT (¯ t (2) k )] · k ≤ N · k , where the first inequality follows from the fact that no more than N − n [ k ]M − SRPT (¯ t (2) k ) jobs switchesfrom higher classes to C [ k ] ( t ), as there are at most N − n [ k ]M − SRPT (¯ t (2) k ) jobs with remaining workloadlarger than 2 k are being processed at time ¯ t (2) k . Hence Lemma 5 holds.Now for the case when there exist non-preemptive tasks, arguments above does not work sincemachines may be processing tasks with remaining workload larger than 2 k and n [ k ]M − SRPT (¯ t (2) k ) maybe larger than N . Let r ∈ [ N ] be the number of tasks that are being processed at time ¯ t (2) k andbelongs to [ n ] \ C [ k ] (¯ t (2) k ), and t s ≤ ¯ t (2) k be the latest starting processing time of these tasks. Wedivide our analysis into the following two subcases: • Case . : No jobs switch from set [ n ] \C [ k ] ( t s ) to C [ k ] (¯ t (2) k ) under Algorithm 1. We use ∆ k to represent the increment of W [ k ] A incurred by the newly arriving jobs during time period[ t s , ¯ t (2) k ]. Then we have: W [ k ]M − SRPT (¯ t (2) k ) − W [ k ]M − SRPT ( t s ) = − ( N − r )(¯ t (2) k − t s ) + ∆ k . (4)On the other hand, W [ k ] π ∗ , the remaining workload of jobs in class C [ k ] under optimal algorithm,decreases at a speed that is no more than N units of workload per time slot, hence W [ k ] π ∗ (¯ t (2) k ) − W [ k ] π ∗ ( t s ) ≥ − N · (¯ t (2) k − t s ) + ∆ k . (5)According to the definition of ¯ t (2) k , no jobs with remaining workload larger than 2 k are pro-cessed in (¯ t (2) k , t ]. Compared with time ¯ t (2) k , there are at most r jobs switch from [ n ] \C [ k ] (¯ t (2) k )to set C [ k ] (¯ t (2) k + 1). Therefore W [ k ]M − SRPT (¯ t (2) k + 1) − W [ k ] π ∗ (¯ t (2) k + 1) ≤ W [ k ]M − SRPT (¯ t (2) k ) − W [ k ] π ∗ (¯ t (2) k ) + r · k . (6)Combining inequalities (4 )—(6), we can obtain W [ k ]M − SRPT ( t ) − W [ k ] π ∗ ( t ) ≤ W [ k ]M − SRPT (¯ t (2) k + 1) − W [ k ] π ∗ (¯ t (2) k + 1) ≤ W [ k ]M − SRPT ( t s ) − W [ k ] π ∗ ( t s ) + r · [2 k + (¯ t (2) k − t s )] ≤ ( N − · k + r · (¯ t (2) k − t s ) ≤ N · (2 k + η ) . The third inequality above holds since at time t s , Algorithm 1 is required to do job selectionand a job with remaining workload larger than k is selected. The last inequality follows fromthe fact that ¯ t (2) k − t s ≤ η , as t s is the starting time of a non-preemptive task that is still aliveat time ¯ t (2) k . 7 Case . : There exist jobs switching from set [ n ] \C [ k ] ( t s ) to C [ k ] (¯ t (2) k ) under Algorithm 1. Weuse J s to denote the collection of such switching jobs. It is essential to bound the number ofswitching jobs, which will incur an increment of |J s |· k in the remaining workload of class C [ k ] .A straightforward bound is |J s | ≤ N · (¯ t (2) k − t s ) ≤ N · η , since at most N jobs receive serviceat each time slot, and hence the number of switching jobs is no more than N . However, thisbound is indeed loose and we argue that |J s | ≤ N − r. (7)Notice that after a job switches to class C [ k ] during [ t s , ¯ t (2) k ] , it will only be preempted by jobsthat are also in class C [ k ] , which is due to the SRPT rule. According to the precondition ofthis case, there are r jobs in set [ n ] \ C [ k ] that are continuously being processed during [ t s , ¯ t (2) k ] ,hence at most N − r units of resources per time slot are available for the remaining jobs. Notethat resources that are allocated to jobs in C [ k ] will not be utilized for switching a job from ahigher class to C [ k ] . In addition, finished jobs will have no contribution to the total remainingworkload W [ k ]M − SRPT ( t ) . Hence |J s | is no more than N − r .Furthermore, we can derive the following conclusion: W [ k ]M − SRPT ( t ) − W [ k ] π ∗ ( t ) ≤ W [ k ]M − SRPT (¯ t (2) k + 1) − W [ k ] π ∗ (¯ t (2) k + 1) ≤ W [ k ]M − SRPT (¯ t (2) k ) − W [ k ] π ∗ (¯ t (2) k ) + r · k + N (job switching at t (2) ) ≤ [ W [ k ]M − SRPT ( t s ) − W [ k ] π ∗ ( t s ) + ( N − r ) · k + N · (¯ t (2) k − t s )] + r · k + N (job switching during [ t s , ¯ t (2) k ] ) ≤ N · (2 k +1 + η + 1) . ( ¯ t (2) k − t s ≤ η )The proof is complete. (cid:3) We are ready to prove the competitive ratio of Algorithm 1.
Proof of Theorem 4:
Let n M − SRPT ( t ) and n π ∗ ( t ) represent the number of jobs alive at time t under Algorithm 1 and optimal scheduler respectively. For ∀ t ≥ , n π ∗ ( t ) ≥ log p max +1 X k =log p min W ( k ) π ∗ ( t )2 k = log p max +1 X k =log p min h W [ k ] π ∗ ( t ) − W [ k − π ∗ ( t ) i k (definition of W [ k ] π ∗ ( t ) ) = W [log p max +1]M − SRPT ( t )2 log p max +1 + log p max +1 X k =log p min W [ k ]M − SRPT ( t )2 k +1 ≥ log p max +1 X k =log p min W [ k ] π ∗ ( t )2 k +1 . (8)On the other hand, the number of jobs alive under Algorithm 1 can be upper bounded in a similar8ashion, n M − SRPT ( t ) ≤ log p max +1 X k =log p min W ( k )M − SRPT ( t )2 k − = log p max +1 X k =log p min h W [ k ]M − SRPT ( t ) − W [ k − − SRPT ( t ) i k − (definition of W [ k ]M − SRPT ( t ) ) = log p max X k =log p min W [ k ]M − SRPT ( t )2 k + W [log p max +1]M − SRPT ( t )2 log p max ≤ log p max +1 X k =log p min W [ k ]M − SRPT ( t )2 k − Using Lemma 5, we are able to relate the number of unfinished jobs under two algorithms, n M − SRPT ( t ) ≤ log p max +1 X k =log p min W [ k ]M − SRPT ( t )2 k − ≤ log p max +1 X k =log p min W [ k ] π ∗ ( t )2 k − + log p max +1 X k =log p min N · (2 k + η )2 k − ≤ n π ∗ ( t ) + N · (cid:16) α + 4 ηp min + 4 (cid:17) , where the last inequality follows from inequality (8). To summarize, the competitive ratio ofAlgorithm 1 satisfies that CR M − SRPT = P t : n M − SRPT ( t ) Fact 6. For multiple-processor multitask scheduling problem with constant number of machines,there exists no algorithm achieving an competitive ratio of O (log α + β ) . Proof: When p min = η = 1 , the problem degenerates to preemptive setting and no algorithmcan achieve a competitive ratio of o (log α ) . When η = p max , the problem degenerates to the non-preemptive setting and O ( β ) is the best possible competitive ratio if the number of machines isconstant. The proof is complete. (cid:3) Fact 7. For multiple-processor multitask scheduling problem, the competitive ratio of any work-conserving algorithms have an competitive ratio of Ω(log α + β − ε ) for ∀ ε > . Proof: The reasoning is similar as Fact 6, since work-conserving algorithms cannot achieve acompetitive ratio of o ( β − ε ) in the non-preemptive setting. (cid:3) Optimality with Poisson Arrival In this section we show that under mild probabilistic assumptions, Algorithm 1 is asymptoticoptimal for minimizing the total flow time in the heavy traffic region. The main result is stated asfollowing. Theorem 8. Let F M − SRPT ρ and F ∗ ρ be the mean flow time incurred by Algorithm 1 and optimalalgorithm respectively, when the traffic intensity is equal to ρ . In an M/G/N with job size distri-bution satisfing either (1) bounded or (2) unbounded with tail function of upper Matuszewska indexless than − , Algorithm 1 is heavy traffic optimal, i.e. , lim ρ → E [ F M − SRPT ρ ] E [ F ∗ ρ ] = 1 , (9) as long as the size of a single task is no more than η = o (cid:16) − ρ ) · R ∞ f ( x )1 − ρ ≤ x dx (cid:17) Case (1) o (cid:16) − ρ ) · G − ( ρ ) · R ∞ f ( x )1 − ρ ≤ x dx (cid:17) Case (2) (10) Remark. The probabilistic assumptions (1) and (2) here are all with respect to the distributionof job size, i.e. , the total workload of tasks. For the processing time of a single task, the onlyassumption we have is the upper bound η . It can be seen that the optimality result in [2] correspondsto the special case when η = 1 , while the bound derived in (10) could be extremely large when ρ approaches . On the other hand, for the integral above, we have the following rough estimation, Z ∞ f ( x )1 − ρ ≤ x dx ≤ Z ∞ xf ( x )1 − ρ ≤ x dx + Z f ( x )1 − ρ ≤ x dx ≤ log (cid:16) − ρ (cid:17) + 11 − ρ ≤ . Lower bound on minimum flow time E [ F ∗ ρ ] . To start with, we consider the benchmark systemconsisting of a single machine with speed N , where all the tasks can be allowed to be served inpreemptive fashion, i.e. , the concept of task is indeed unnecessary in this setting. The performanceof SRPT for this single server system is summarized by the following fact. Fact 9 ([7]) . In an M/G/ with service distribution satisfing either (1) bounded or (2) unboundedwith tail function of upper Matuszewska index less than − , then E [ F SRPT − ρ ] = Θ (cid:16) − ρ (cid:17) Case (1)Θ (cid:16) − ρ ) · G − ( ρ ) (cid:17) Case (2) where G − ( · ) denotes the inverse of G ( x ) = ρ ≤ x /ρ . It is clear to see that the mean flow time under SRPT for this system can be performed as avalid lower bound for the multitask problem, i.e. , E [ F ∗ ρ ] ≥ E [ F SRPT − ρ ] . (11)10 roof of Theorem 8: Our main goal is to derive an analytical upper bound on the quantity E [ F M − SRPT ρ ] . The proof mainly follows from techniques in [2, 8], which relates the flow time of thetagged job with an appropriate busy period.Consider a tagged job with remaining workload x , arriving time r x and completion time C x .The computing resources of N servers must be spent on the following types of job during time [ r x , C x ] :1. The system may be dealing with jobs with remaining workload larger than x , or some machinesare idle, while the tagged job is in service , because the number of jobs alive is smaller than N . We use W waste ( r x ) to represent the amount of such resources, then W waste ( r x ) ≤ ( N − · x, (12)which is indeed the same as Lemma . in [2]. The reason is straightforward—the tagged jobmust be in service according to Algorithm 1, hence the number of such time slots should notexceed x and thus (12) holds.2. The system may be dealing with jobs with remaining workload no more than x at time r x ,the amount of resources spent on this class is no more than W M − SRPT ≤ x ( r x ) . Here we use W M − SRPT ≤ x ( t ) to denote the total workload of jobs with remaining workload no more than x at time t .3. The system may be dealing with jobs which have a remaining workload larger than x at time t = r x , while the tagged job is not in service. This is possible and happens only if the systemmay be processing tasks which belong to a job with total remaining workload larger than x , the tasks are in service before time r x and the non-preemptive rule allows the task to beserved from time r x onwards. Let W non − pm ( r x ) denote the total units of computing resourcesspent on this class of jobs during [ r x , C x ] . Our main argument for this class of jobs is W non − pm ( r x ) ≤ ( N + N ) · η + N · x, (13)To see the correctness of inequality (13), we consider time intervals [ r x , r x + η ] and ( r x + η, C x ] separately. • Note that there are N · η computing resources during time [ r x , r x + η ] in total, henceit is obvious to see that the amount of resources spent on this collection of jobs during [ r x , r x + η ] cannot exceed N · η . • We next show that in time interval ( r x + η, C x ] , the total amount of computing resourcesspent on such jobs is no more than N · η + N · x . Consider the following two types ofjobs: – Note that jobs of this class that have a remaining workload larger than x at time t = r x + η will be processed after time t = r x + η only if the tagged job is in service,hence the amount of resources spending on such jobs are already taken into accountin the first class above, i.e. , the quantity W waste ( r x ) , and we can ignore this subclass. – For the collection of jobs with remaining workload no more than x at time t = r x + η ,we first consider the setting when different tasks within the same job can be processedin parallel. It is clear to see that the remaining workload of such jobs at time t = r x x + N · η . Since there are at most N such jobs in total, we canconclude that the remaining workload of jobs in this subclass must be no more than N · ( x + N · η ) = N · x + N · η , which implies that W non − pm ( r x ) ≤ N · x + N · η + N η and (13) holds.4. Tagged job itself . The amount of resources is equal to x , the size of the tagged job.5. Newly arriving jobs during [ r x , C x ] with size no more than x .Hence f M − SRPT x , the flow time of the tagged job, is no more than the length of a busy periodwith arrival rate ρ ≤ x and initial workload of W waste ( r x ) + W non − pm ( r x ) + W M − SRPT ≤ x ( r x ) + x . Hencewe have f M − SRPT x ≤ st B ( ρ ≤ x ) (cid:16) W waste ( r x ) + W non − pm ( r x ) + W M − SRPT ≤ x ( r x ) + x (cid:17) ( a ) = B ( ρ ≤ x ) (cid:16) W waste ( r x ) + W non − pm ( r x ) + x (cid:17) + B ( ρ ≤ x ) (cid:16) W M − SRPT ≤ x ( r x ) (cid:17) ( b ) ≤ B ( ρ ≤ x ) (cid:16) N · η + N · (2 x + η ) (cid:17) + B ( ρ ≤ x ) (cid:16) W M − SRPT ≤ x ( r x ) (cid:17) ( c ) ≤ B ( ρ ≤ x ) (cid:16) N · ( η + x ) (cid:17)| {z } Σ + B ( ρ ≤ x ) (cid:16) W SRPT − ≤ x ( r x ) (cid:17)| {z } Σ , where ( a ) follows from the additivity of busy period; In ( b ) we utilize the upper bounds establishedin (12) and (13) and ( c ) follows from Lemma 10.Note that the average flow time under SRPT in a single server system is lower bounded as E [ F SRPT − ρ ] ≥ E [ B ( ρ ≤ x ) ( W SRPT − ≤ x ( t ))] = E x,r x [ B ( ρ ≤ x ) ( W SRPT − ≤ x ( r x ))] = E x,r x [Σ ] , (14)where the first equality holds due to the Poission Arrivals See Time Average (PASTA) property [13].Note that E [Σ ] = O (cid:16) E (cid:16) B ( ρ ≤ x ) ( η + x ) (cid:17)(cid:17) = O (cid:16) E h η + x − ρ ≤ x i(cid:17) = O (cid:16) log 11 − ρ (cid:17) + η · O (cid:16) Z ∞ f ( x )1 − ρ ≤ x dx (cid:17) . To achieve heavy traffic optimality, it suffices to show that the difference between average flow timeunder Algorithm 1 and optimal algorithm is a lower order term, i.e. , lim ρ → E [ F M − SRPT ρ ] − E [ F SRPT − ρ ] E [ F SRPT − ρ ] = 0 . (15)Note that E [ F M − SRPT ρ ] = E x,r x [ f M − SRPT x ] = E x,r x [Σ ] + E x,r x [Σ ] , lim ρ → η · O (cid:16) R ∞ f ( x )1 − ρ ≤ x dx (cid:17) E [ F SRPT − ρ ] = 0 , since log(1 / (1 − ρ )) is always a lower order term, compared with the optimal flow time E [ F SRPT − ρ ] .Hence η = o (cid:16) − ρ ) · R ∞ f ( x )1 − ρ ≤ x dx (cid:17) Case (1) o (cid:16) − ρ ) · G − ( ρ ) · R ∞ f ( x )1 − ρ ≤ x dx (cid:17) Case (2) (cid:3) Lemma 10. The difference of under Algorithm 1 and optimal algorithm is upper bounded by W M − SRPT ≤ y ( t ) − W SRPT − ≤ y ( t ) ≤ N · ( y + η + 1) , ∀ y, t ≥ . Proof: The proof is similar as that of Lemma 5. (cid:3) References [1] Yossi Azar and Noam Touitou. Improved online algorithm for weighted flow time. In FOCS ,pages 427–437, 2018.[2] Isaac Grosof, Ziv Scully, and Mor Harchol-Balter. Srpt for multiserver systems. PerformanceEvaluation , 127:154–175, 2018.[3] Mor Harchol-Balter. Performance modeling and design of computer systems: queueing theoryin action . Cambridge University Press, 2013.[4] Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, and Eric P.Xing. STRADS: a distributed framework for scheduled model parallel machine learning. In EuroSys , pages 5:1–5:16, 2016.[5] Stefano Leonardi and Danny Raz. Approximating total flow time on parallel machines. In STOC , pages 110–119, 1997.[6] Stefano Leonardi and Danny Raz. Approximating total flow time on parallel machines. Journalof Computer and System Sciences , 73(6):875–891, 2007.[7] Minghong Lin, Adam Wierman, and Bert Zwart. Heavy-traffic analysis of mean response timeunder shortest remaining processing time. Performance Evaluation , 68(10):955–966, 2011.[8] Linus E Schrage and Louis W Miller. The queue m/g/1 with the shortest remaining processingtime discipline. Operations Research , 14(4):670–684, 1966.[9] Ziv Scully, Guy Blelloch, Mor Harchol-Balter, and Alan Scheller-Wolf. Optimally schedulingjobs with multiple tasks. ACM SIGMETRICS Performance Evaluation Review , 45(2):36–38,2017. 1310] Ziv Scully, Mor Harchol-Balter, and Alan Scheller-Wolf. Optimal scheduling and exact responsetime analysis for multistage jobs. arXiv preprint arXiv:1805.06865 , 2018.[11] Yin Sun, C Emre Koksal, and Ness B. Shroff. Near delay-optimal scheduling of batch jobs inmulti-server systems. Ohio State Univ., Tech. Rep , 2017.[12] Weina Wang, Kai Zhu, Lei Ying, Jian Tan, and Li Zhang. Maptask scheduling in mapreducewith data locality: Throughput and heavy-traffic optimality. IEEE/ACM Transactions onNetworking (TON) , 24(1):190–203, 2016.[13] Ronald W Wolff. Poisson arrivals see time averages. Operations Research , 30(2):223–231, 1982.[14] Yousi Zheng, Ness B. Shroff, and Prasun Sinha. A new analytical technique for designingprovably efficient mapreduce schedulers. In