Online Job Scheduling with Redundancy and Opportunistic Checkpointing: A Speedup-Function-Based Analysis
Huanle Xu, Gustavo de Veciana, Wing Cheong Lau, Kunxiao Zhou
11 Online Job Scheduling with Redundancy andOpportunistic Checkpointing:A Speedup-Function-Based Analysis
Huanle Xu,
Member, IEEE,
Gustavo de Veciana,
Fellow, IEEE,
Wing Cheong Lau,
Senior Member, IEEE,
Kunxiao Zhou
Abstract —In a large-scale computing cluster, the job completions can be substantially delayed due to two sources of variability,namely, variability in the job size and that in the machine service capacity. To tackle this issue, existing works have proposed variousscheduling algorithms which exploit redundancy wherein a job runs on multiple servers until the first completes. In this paper, weexplore the impact of variability in the machine service capacity and adopt a rigorous analytical approach to design schedulingalgorithms using redundancy and checkpointing. We design several online scheduling algorithms which can dynamically vary thenumber of redundant copies for jobs. We also provide new theoretical performance bounds for these algorithms in terms of the overalljob flowtime by introducing the notion of a speedup function, based on which a novel potential function can be defined to enable thecorresponding competitive ratio analysis. In particular, by adopting the online primal-dual fitting approach, we prove that our SRPT+RAlgorithm in a non-multitasking cluster is (1 + (cid:15) ) -speed, O ( (cid:15) ) -competitive. We also show that our proposed Fair+R and LAPS+R( β )Algorithms for a multitasking cluster are (4 + (cid:15) ) -speed, O ( (cid:15) ) -competitive and ( β + 2 (cid:15) ) -speed O ( β(cid:15) ) -competitive respectively. Wedemonstrate via extensive simulations that our proposed algorithms can significantly reduce job flowtime under both thenon-multitasking and multitasking modes. Index Terms —Online Scheduling, Redundancy, Optimization, Competitive Analysis, Dual-Fitting, Potential Function (cid:70)
NTRODUCTION J OB traces from large-scale computing clusters indicatethat the completion time of jobs can vary substantially[8], [9]. This variability has two sources: variability in the jobprocessing requirements and variability in machine servicecapacity. The job profiles in production clusters also becomeincreasingly diverse as small latency-sensitive jobs coexistwith large batch processing applications which take hoursto months to complete [51]. With the size of today’s com-puting clusters continuing to grow, component failures andresource contention have become a common phenomenonin cloud infrastructure [25], [33]. As a result, the rate ofmachine service capacity may fluctuate significantly overthe lifetime of a job. The same job may experience a farhigher response time when executed at a different time onthe same server [21]. These two dimensions of variabilitymake efficient job scheduling for fast response time (alsoreferred to as job flowtime) over large-scale computingclusters challenging.To tackle variability in job processing requirements, var-ious schedulers have been proposed to provide efficientresource sharing among heterogeneous applications. Widely • Huanle Xu and Kunxiao Zhou are with the School of Computer Scienceand Network Security, Dongguan University of Technology, Dongguan,Guangdong. E-mail: { xuhl,zhoukx } @dgut.edu.cn. • Gustavo de Veciana is with the Department of Electrical and ComputerEngineering, The University of Texas at Austin, Austin, TX, USA. E-mail: [email protected]. • Wing Cheong Lau is with the Department of Information Engineering,The Chinese University of Hong Kong, Shatin, N.T., Hong Kong. E-mail:[email protected] of this work has been presented in IEEE Infocom 2017. deployed schedulers to-date include the Fair scheduler [3]and the Capacity scheduler [2]. It is well known that theShortest Remaining Processing Time scheduler (SRPT) isoptimal for minimizing the overall/ average job flowtime[19] on a single machine in the clairvoyant setting, i.e.,when job processing times are known a priori. As such,many works have aimed to extend SRPT scheduling to yieldefficient scheduling algorithms in the multiprocessor settingwith the objective of reducing job flowtimes for differentsystems and programming frameworks [22], [35], [36], [53].Under SRPT, job’s residual precessing times are known tothe job scheduler upon arrival and smaller jobs are givenpriority. However, if only the distribution of job sizes isknown, it is shown in [4] that, Gittins index-based policyis optimal for minimizing the expected job flowtime underthe Poission job arrivals in the single-server case. The Gittinsindex depends on knowing the service already allocated toeach job and gives priority to the job with the highest index.To deal with component failures and resource con-tention, computing clusters are exploiting redundant ex-ecution wherein multiple copies of the same job executeon available machines until the first completes. With re-dundancy, it is expected that one copy of the same jobmight complete quickly to avoid long completion times.For the Google MapReduce system, it has been shown thatredundancy can decrease the average job flowtime by 44%[17]. Many other cloud computing systems apply simpleheuristics to use redundancy and they have proven to be ef-fective at reducing job flowtimes via practical deployments,e.g., [1], [7], [9], [14], [17], [31], [52].Recently, researchers have started to investigate the ef- a r X i v : . [ c s . D C ] J u l fectiveness of scheduling redundant copies from a queuingperspective [15], [21], [38], [39], [42], [45]. These worksassume a specific distribution of the job execution timewhere jobs follow the same distribution. However, they donot characterize the major cause leading to the variance ofthe job response time, namely, whether the variance is dueto variability of job size or to variability in machine servicecapacity. In fact, if there is no variability in the machineservice capacity, making multiple copies of the same job maynot help and redundancy is a waste of resource.To overcome the aforementioned limitations, we havedeveloped a stochastic framework in our previous work [49]to explore the impact of variability in the machine servicecapacity. In this framework, the service capacity of each ma-chine over time is modeled as a stationary process. To takefull advantage of redundancy, [49] allows checkpointing[37] to preempt, migrate and perform dynamic partitioning[43] on its running jobs. By checkpointing, we mean theruntime system of a cluster takes a snapshot of the stateof a job in progress so that its execution can be resumedfrom that point in the case of subsequent machine failureor job preemption [10]. Upon checkpointing, the state ofthe redundant copy which has made the most progress ispropagated and cloned to other copies. In other words, allthe redundant copies of a job can be brought to that mostadvance state and proceed to execute from this updatedstate.A fundamental limitation of [49] is that checkpoint-ing needs to be done periodically when a job is beingprocessed. Such a checkpointing mechanism would incurlarge overheads when the cluster size is large while thescheduler needs to make scheduling decisions quickly. Totackle this limitation, in this paper, we limit the total numberof checkpointings for each job. Moreover, we only allowcheckpointing to occur on a job only if there is an arrivalto or departure from the system. As such, the resultantalgorithms are more scalable and applicable to real worldimplementations.Most previous works studying job scheduling assumethat clusters are working in the non-multitasking mode, i.e.,each server (CPU Core) in the cluster can only serve one jobat any time. However, multitasking is a reasonable model ofcurrent scheduling policies in CPUs, web servers, routers,etc [16], [44], [46]. In a multitasking cluster, each servermay run multiple jobs simultaneously and jobs can shareresources with different proportions. In this paper, we willalso study scheduling algorithms, which determine check-pointing times, the number of redundant copies betweensuccessive checkpoints as well as the fraction of resource tobe shares in both of the multitasking and non-multitaskingsettings. Our Results
For non-multitasking clusters, we propose the SRPT+R algo-rithm where redundancy is used only when the number ofactive jobs is less than the number of servers. For clusters al-lowing multitasking, we first design the Fair+R Algorithm,which shares resources near equally among existing jobs,with priority given to jobs which arrived most recently.We then extend Fair+R Algorithm to yield the LAPS+R( β ) Algorithm, which only shares resources amongst a fixedfraction of the active jobs. In summary, this paper makesthe following technical contributions: • New Framework.
We present the first optimization frame-work to address the job scheduling problem with redun-dancy, subject to limited number of checkpointings. Ouroptimization problems consider both the multitaskingand non-multitasking scenarios. • New Techniques.
We introduce the notion of speedup func-tions in both the multitasking and non-multitasking cases.Thanks to this, we develop a new dual-fitting approach tobound the competitive performance for both SRPT+R andFair+R. Based on the speedup function, we also designa novel potential function accounting for redundancyto analyze the performance of LAPS+R( β ) in the multi-tasking setting. By changing the speedup function, onecan readily apply our dual-fitting approach as well asthe potential function analysis to other resource allocationproblems in the multi-machine setting with/ withoutmultitasking. • New Results.
Under our optimization framework, SRPT+Rachieves a much tighter competitive bound than otherSRPT-based redundancy algorithms under different set-tings, e.g., [49]. Moreover, LAPS+R( β ) is the first one toaddress the redundancy issue among those algorithmswhich work under the multitasking mode.The rest of this paper is organized as follows. Afterreviewing the related work in Section 2, we introduce oursystem model and optimization framework in Section 3. InSection 4, we present SRPT+R and its performance boundin a non-multitasking cluster. We proceed to introduce thedesign and analysis for both Fair+R and LAPS+R( β ) underthe multitasking mode in Section 5. Before concluding ourwork in Section 7, we conduct several numerical studies inSection 6 to evaluate our proposed algorithms. ELATED W ORK
In this section, we begin by giving a brief introduction toexisting work on job schedulers. Then, we review the relatedwork on redundancy schemes in large-scale computingclusters presented by priori research from the industry andacademia.The design of job schedulers for large-scale computingclusters is currently an active research area [12], [13], [35],[36], [50], [53]. In particular, several works have derivedperformance bounds towards minimizing the total job com-pletion time [12], [13], [50] by formulating an approximatelinear programming problem. By contrast, [34] shows thatthere is a strong lower bound on any online randomizedalgorithm for the job scheduling problem on multiple unit-speed processors with the objective of minimizing the over-all job flowtime. Based on this lower bound, some worksextend the SRPT scheduler to design algorithms that min-imize the overall flowtimes of jobs which may consist ofmultiple small tasks with precedence constraints [35], [36],[50], [53]. The above work was conducted in the clairvoyantsetting, i.e., the job size is known once the job arrives. For thenon-clairvoyant setting, [26], [27], [28] design several mul-titasking algorithms under which machines are allocatedto all jobs in the system and priorities are given to jobs which arrive most recently. All of the above studies assumeaccurate knowledge of machine service capacity and hencedo not address dynamic scheduling of redundant copies fora job.Production clusters and big data computing frame-works have adopted various approaches to use redundancyfor running jobs. The initial Google MapReduce systemlaunches redundant copies when a job is close to its com-pletion [17]. Hadoop adopts another solution called LATE,which schedules a redundant copy for a running task only ifits estimated progress rate is below certain threshold [1]. Bycomparison, Microsoft Mantri [9] schedules a new copy fora running task if its progress is slow and the total resourceconsumption is expected to decrease once a new redundantcopy is made.Researchers have proposed different schemes to take ad-vantage of redundancy via more careful designs. For exam-ple, [14] proposes a smart redundancy scheme to accuratelyestimate the task progress rate and launch redundant copiesaccordingly. The authors in [7] propose to use redundancyfor very small jobs when the extra loading is not high. As anextension to [7], they further develops GRASS [8], whichcarefully schedules redundant copies for approximationjobs. Moreover, [41] proposes Hopper to allocate computingslots based on the virtual job size, which is larger than theactual size. Hopper can immediately schedule a redundantcopy once the progress rate of a task is detected to be slow.No performance characterization has been developed forthese heuristics.In our previous work, we have developed several op-timization frameworks to study the design of schedulingalgorithms utilizing redundancy [47], [48]. The proposedalgorithms in [47] require the knowledge of exact distribu-tion of the task response time. We also analyze performancebounds of the proposed algorithm which extends the SRPTScheduler in [48] by adopting the potential function analy-sis. A fundamental limitation is that these resultant boundsare not scalable as they increase linearly with the number ofmachines. Recently, [20] proposes a simple model to addressboth machine service variability and job size variability.However, [20] only considers the FIFO scheduling policyon each server to characterize the average job response timefrom a queuing perspective.Another body of research related to this paper focuseson the study of scheduling algorithms for jobs with inter-mediate parallelizability. In these works, e.g., [5], [11], [18],[24], [30], jobs are parallelizable and the service rate can bearbitrarily scaled. In particular, Samuli et al. present severaloptimal scheduling policies for different capacity regionsin [5] but for the transient case only. [18] [11] and [24]propose similar multitasking algorithms for jobs whereinpriorities are given to jobs which arrive the most recently.These works develop competitive performance bounds withrespect to the total job flowtime by adopting potential func-tion arguments. [30] also provides a competitive bound forthe SRPT-based parallelizable algorithm in the multitaskingsetting. One limitation of [30] is that the resultant bound ispotentially very large. By contrast, this paper is motivatedby the setting where there is variability in the machineservice capacity.For the analysis of SRPT+R algorithm in Section 4.2, and Fair+R algorithm in 5.1, we adopt the dual fitting approach.Dual fitting was first developed by [6], [23] and is nowwidely used for the analysis of online algorithms [27], [28].In particular, [6] and [27], [28] address linear objectives,and use the dual-fitting approach to derive competitivebounds for traditional scheduling algorithms without re-dundancy. By contrast, [23] focuses on a convex objective inthe multitasking setting. By comparison, this paper includesinteger constraints associated with the non-multitaskingmode. Moreover, our setting of dual variables is novel inthe sense that it deals with the dynamical change of jobflowtime across multiple machines where other settings ofdual variables can only deal with the change of job flowtimeon one single machine.We apply the potential function analysis to bound theperformance of LAPS+R( β ) in Section 5. Potential functionis widely used to derive performance bounds with resourceaugmentation for online parallel scheduling algorithms e.g.,[18], [30]. However, since we need to deal with redundancyand checkpointing, the design of our potential functionis totally different from that in [18] and [30] which onlyaddress sublinear speedup.While this paper adopts a framework similar to the onein [49] to model machine service variability, it differs from[49] in two major aspects. Firstly, the requirement of limitingthe the total number of checkpointings results in a verydifferent optimization problem which is much more difficultto solve from the one in [49]. To tackle this challenge, in thispaper, we adopt both the dual fitting approach and potentialfunction analysis to make approximations and bound thecompetitive performance. By contrast, [49] only applies thepotential function analysis to derive performance bounds.Secondly, the current paper considers both the multi-taskingmode and non-multitasking mode to design correspondingonline scheduling algorithms using redundancy. By con-trast, the scheduling algorithms proposed in [49] can onlywork under the non-multitasking mode. YSTEM M ODEL
Consider a computing cluster which consists of M servers(machines) where the servers are indexed from to M .Job j arrives to the cluster at time a j and the job arrivalprocess, ( a , a , · · · , a N ) , is an arbitrary deterministic timesequence. In addition, job j has a workload which requires p j units of time to complete when processed on a machineat unit speed. Job j completes at time c j and its flowtime f j , is denoted by f j = c j − a j . In this paper, we focus onminimizing the overall job flowtime, i.e., (cid:80) Nj =1 f j .The service capacity of machines are assumed to beidentically distributed random processes with stationaryincrements. To be specific, we let S i = ( S i ( t ) | t ≥ bea random process where S i ( t, τ ) = S i ( τ ) − S i ( t ) denotethe cumulative service delivered by machine i in the interval ( t, τ ] . The service capacity of a machine has unit mean speedand a peak rate of ∆ , so for all τ > t ≥ , we have S i ( t, τ ] ≤ ( τ − t ) · ∆ almost surely and E (cid:2) S i ( t, τ ] (cid:3) = τ − t .In this paper, our aim is to mitigate the impact of servicevariability by (possibly) varying the number of redundantcopies with appropriate checkpointing. Checkpointing canmake the most out of the allocated resources, i.e., start the ! ! = ! ! ! ! ! ! ! ! ! ! ! = ! ! ! ! ! ! ! ! ! ! ! ! ! time Fig. 1. The service process of job j . processing of the possibly redundant copies at the mostadvanced state amongst the previously executing copies.In fact, we shall make the following assumption across thesystem: Assumption 1.
A job j can be checkpointed only if there is anarrival to, or departure from, the system. Remark 1.
We refer to Assumption 1 as a scalability assumptionas it limits the checkpointing overheads in the system.
Below, we will first introduce a service model where eachserver can only serve one job at a time. In Section 3.2, we willdiscuss a service model which supports multitasking, i.e., aserver can execute multiple jobs simultaneously.
As illustrated in Fig. 1, one can view the service processof job j in a non-multitasking cluster by dividing its ser-vice period (from its arrival to its completion) into severalsubintervals, i.e., (cid:8) ( t k − j , t kj ] (cid:9) k where t kj denotes the timewhen the k th checkpointing of job j occurs. The job arrivaland completion times are also considered as checkpointingtimes, i.e., t j = a j and t L j j = c j if job j experiences ( L j + 1) checkpoints. During in ( t k − j , t kj ] , job j is running on r kj re-dundant servers. Thus, together t j = (cid:0) t kj (cid:12)(cid:12) k = 0 , , · · · , L j (cid:1) and r j = (cid:0) r kj (cid:12)(cid:12) k = 1 , , · · · , L j (cid:1) capture the checkpointtimes and the scheduled redundancy for job j .We will let g ( r, t ) denote the cumulative service deliv-ered to a job on r redundant machines and checkpointed atthe end of an interval of duration t . Clearly, g ( r, t ) is equiv-alent to the amount of work processed by the redundantcopy which has made the most progress. In this paper, wemake the following assumption for g ( r, t ) : Assumption 2.
We shall model (approximate) the cumulativeservice capacity under redundant execution, g ( r, t ) , by its mean,i.e., g ( r, t ) = E (cid:104) max i =1 , , ··· ,r S i (0 , t ] (cid:105) . (1) Remark 2.
Assumption 2 essentially replaces the service capacityof the system with the mean but accounts for the mean gains onemight expect when there are redundant copies executed.
The following lemmas illustrate two important proper-ties of g ( r, t ) : Lemma 1.
For a fixed t , { g ( r, t ) } r is a concave sequence, i.e., g ( r, t ) − g ( r − , t ) ≤ g ( r − , t ) − g ( r − , t ) .Proof. Let H r (0 , t ] = max i =1 , , ··· ,r S i (0 , t ] and define F ( x, t ) as the cumulative distribution function of randomvariable S i (0 , t ] for a fixed t . Thus, we have Pr( H r (0 , t ] ≤ x ) = F r ( x, t ) and g ( r, t ) = E (cid:2) H r ( t ) (cid:3) = (cid:82) ∞ (1 − F r ( x, t )) dx ,which further implies that: g ( r, t ) − g ( r − , t ) = (cid:90) ∞ F r − ( x, t ) · (1 − F ( x, t )) dx ≤ (cid:90) ∞ F r − ( x, t ) · (1 − F ( x, t )) dx = g ( r − , t ) − g ( r − , t ) . (2)This completes the proof.Lemma 1 states that the marginal increase of the meanservice capacity in the number of redundant executions isdecreasing. Lemma 2.
For all r ∈ N and r ≤ M , g ( r, t ) ≤ min { ∆ t, rt } .Proof. As shown in the proof of Lemma 1, g ( r, t ) = (cid:82) ∞ (1 − F r ( x, t )) dx . Therefore, it follow that: (cid:90) ∞ (1 − F r ( x, t )) dx = (cid:90) ∞ (1 − F ( x, t )) r − (cid:88) l =0 ( F l ( x, t )) dx ≥ r (cid:90) ∞ (1 − F ( x, t )) F r − ( x, t ) dx = r ( g ( r, t ) − g ( r − , t )) . (3)which implies g ( r, t ) ≤ rr − g ( r − , t ) . Hence, g ( r, t ) ≤ rg (1 , t ) . Moreover, we have g (1 , t ) = E (cid:2) S i (0 , t ] (cid:3) = t . Thus,we have: g ( n, t ) ≤ nt. (4)Since g i ( t ) = S i (0 , t ] ≤ ∆ t , it follows that: E (cid:104) max i =1 , , ··· ,n (cid:8) g i ( t ) (cid:9)(cid:105) ≤ ∆ t. (5)The result follows from (4) and (5).Lemma 2 states that the mean service capacity underredundant execution can grow at most linearly in the re-dundancy, rt , and is bounded by the peak service capacityof any single redundant copy, ∆ t .Given Assumption 2, the last checkpoint time for job j , t L j j , is also the completion time c j and satisfies the followingequation: L j (cid:88) k =1 g ( r kj , t kj − t k − j ) = p j . (6)In the sequel, we shall also make use of the speedupfunction, h j ( t j , r j , t ) , defined as follows: h j ( t j , r j , t ) = g ( r kj ,t kj − t k − j ) t kj − t k − j t ∈ ( t k − j , t kj ] , otherwise . (7)The speedup function captures the speedup that redundantexecution is delivering in a checkpointing interval relativeto a job execution on a unit speed machine. (6) can bereformulated in terms of the speedup as follows: (cid:90) c j a j h j ( t j , r j , τ ) dτ = p j . (8) Remark 3.
Note that the speedup depends, not only on thenumber of redundant copies being executed, but also, on all thetimes when checkpointing occurs. In this sense, h j ( t j , r j , t ) is not a causal function. However, in the following sections, h j ( t j , r j , t ) will be a convenient notation to study competitiveperformance bounds for our proposed algorithms. With multitasking, a server can run several jobs simultane-ously and the service a job receives on a server is propor-tional to the fraction of processing resource it is assigned.We will model a cluster allowing multitasking as follows.Comparing with the service model in Subsection 3.1, weinclude another variable x kj , to characterize the fractionof resource assigned to job j in the k th subinterval, i.e., ( t k − j , t kj ] . Here, we assume that job j shares the samefraction of processing resource on all the machines on whichit is being executed. Let x j = (cid:0) x kj (cid:12)(cid:12) k = 1 , , · · · , L j (cid:1) andwe define another speedup function, ˆ h j ( t j , x j , r j , t ) , asfollows: ˆ h j ( t j , x j , r j , t ) = (cid:26) x kj · h j ( t j , r j , t ) t ∈ ( t k − j , t kj ]0 otherwise (9)Paralleling (8), the completion time of job j , c j mustsatisfy the following equation: (cid:90) c j a j ˆ h j ( t j , x j , r j , τ ) dτ = p j (10)In the sequel, we will design and analyze algorithmsunder both the multitasking mode and the non-multitaskingmode. In this paper, we will study algorithms for scheduling,which involves determining checkpointing times, the num-ber of redundant copies for jobs between successive check-points and in the multitasking setting the fraction of re-source shares. Note that, when there is no variability inthe machine’s service capacity, our problem reduces tojob scheduling on multiple unit-speed processors with theobjective of minimizing the overall flowtime. This has beenproven to be NP-hard even when preemption and migrationare allowed and previous work [30], [32] has adopted aresource augmentation analysis. Under such analysis, theperformance of the optimal algorithm on M unit-speedmachines is compared with that of the proposed algorithmson M δ -speed machines where δ > .The following definition characterizes the competitiveperformance of an online algorithm using resource augmen-tation. Definition 1. [32] An online algorithm is δ -speed c-competitiveif the algorithm’s objective is within a factor of c of the optimalsolution’s objective when the algorithm is given δ resource aug-mentation. In this paper, we also adopt the resource augmentationsetup to bound the competitive performance of our pro-posed algorithms. With resource augmentation, the servicecapacity in each checkpointing interval under our algo-rithms is scaled by δ . Similarly, the value of the speedupfunctons, i.e., h j ( t j , r j , t ) and ˆ h j ( t j , x j , r j , t ) , under ouralgorithms is δ times that under the optimal algorithm ofthe same variables. LGORITHM D ESIGN IN A N ON -M ULTITASKING C LUSTER
In a non-multitasking cluster, each server can only serveone job at any time. Before going to the details of algorithmdesign, we first state the optimal problem formulation.For ease of illustration, we let y j = ( t j , r j , L j ) denotethe checkpointing trajectory of job j and y = ( y j | j =1 , , · · · , N ) that for all jobs. Moreover, let ( A ) denotethe indicator function that takes value 1 if A is true and0 otherwise. The optimal problem formulation is as follows: min y N (cid:88) j =1 ( c j − a j ) (OPT)such that (a), (b), (c), (d) are satisfied(a) Job completion: The completion time of job j , c j , satisfies: (cid:82) c j a j h j ( t j , r j , t ) dt = p j , ∀ j .(b) Resource constraint: The total number of redundant exe-cutions at any time t ≥ is no larger than the numberof machines, M , i.e., (cid:80) j : a j ≤ t (cid:80) L j k =1 r kj · ( t ∈ ( t k − j , t kj ]) ≤ M, ∀ t .(c) Checkpoint trajectory: The number of checkpoints foreach job is between 2 and N since there are N jobarrivals and departures, i.e., L j ∈ { , , · · · , N − } . Thecheckpoint times of job j , t j , satisfy: t j ∈ T L j +1 j where T L j +1 j = (cid:8) ( t , t , · · · , t L j ) ∈ R L j +1 | a j = t < t < · · · SRPT+R is (1 + (cid:15) ) -speed O ( (cid:15) ) -competitive withrespect to the total job flowtime. We will prove Theorem 1 by adopting the online dual fit-ting approach. The first step is to formulate a minimizationproblem which serves as an approximation to the optimal Algorithm 1: SRPT+R Algorithm while A job arrives at or departure from the system do Sort the jobs in the order such that p ( t ) ≤ p ( t ) ≤ · · · ≤ p n ( t ) ( t ) and count thenumber of redundant copies being executed forjob j , r j ; Initialize M ( t ) to be the set of idle machines ; if n ( t ) < M then for j = 1 , , · · · , n ( t ) do if j = 1 then r j ( t ) = M − ( n ( t ) − (cid:98) Mn ( t ) (cid:99) ; else r j ( t ) = (cid:98) Mn ( t ) (cid:99) ; Checkpoint job j and assigns its redundantexecutions to r j ( t ) machines which areuniformly chosen at random from { , , · · · , M } ; if n ( t ) ≥ M then for j = 1 , , · · · , n ( t ) do if j ≤ M then Checkpoint job j and assign it to onemachine which is uniformly chosen atrandom from { , , · · · , M } ; else Checkpoint job j ;cost, OP T with a guarantee that the cost of the approxi-mation is within a constant of OP T . We then formulate thedual problem for the approximation and exploit the fact thata feasible solution to this dual problem gives a lower boundon its cost, which in turn is a constant times the cost of theproposed algorithm. Remark 4. It is worth to note that, when there is no machineservice variability, SRPT+R performs exactly the same as thetraditional SRPT algorithm on multiple machines. As a result,our proposed dual fitting framework can also show that SRPT is (1 + (cid:15) ) -speed, (3 + (cid:15) ) competitive with respect to the overall jobflowtime. When given small resource augmentation where (cid:15) ≤ ,our result improves the recent result in [19], which states, SRPTon multiple identical machines is (1 + (cid:15) ) -speed, (cid:15) -competitive interms of the overall job flowtime. To prove Theorem 1, we shall first both approximate theobjective of OPT and relax Constraint (d) in OPT to obtainthe following problem P1: min y N (cid:88) j =1 (cid:90) ∞ a j ( t − a j + 2 p j ) p j · h j ( t j , r j , t ) dt (P1)s.t. (cid:90) ∞ a j h j ( t j , r j , t ) dt ≥ p j , ∀ j, (cid:88) j : a j ≤ t L j (cid:88) k =1 r kj · ( t ∈ ( t k − j , t kj ]) ≤ M, ∀ t,L j ∈ { , , · · · , N − } , t j ∈ T L j +1 j , r j ∈ N L j , ∀ j. Let OP T denote the cost, i.e., the overall job flowtime,achieved by an optimal scheduling policy. The followinglemma guarantees that the optimal cost of P1, denoted by P , is not far from OP T . Lemma 3. P is upper bounded by (cid:0) (cid:1) · OP T , i.e., P ≤ (cid:0) (cid:1) · OP T . Let α j and β ( t ) denote the Lagrangian dual variablescorresponding to the first and second constraint in P1respectively. Define α = ( α j (cid:12)(cid:12) j = 1 , , · · · , N ) and β =( β ( t ) | t ∈ R + ) . The Lagrangian function associated with P1can be written as: Φ( y , α , β ) = N (cid:88) j =1 (cid:90) ∞ a j ( t − a j + 2 p j ) p j · h j ( t j , r j , t ) dt + (cid:90) ∞ β ( t ) (cid:0) (cid:88) j : a j ≤ t L j (cid:88) k =1 r kj ( t ∈ ( t k − j , t kj ]) − M (cid:1) dt − N (cid:88) j =1 α j (cid:0) (cid:90) ∞ a j h j ( t j , r j , t ) dt − p j (cid:1) , with the dual problem for P1 given by: max α ≥ , β ≥ min y Φ( y , α , β ) (D1)s.t. L j ∈ { , , · · · , N − } , r j ∈ N L j , t j ∈ T L j +1 j . Applying weak duality theory for continuous programs[40], we can conclude that the optimal value to D1 is alower bound for P . Moreover, the objective of D1 can bereformulated as shown in (7).Still it is difficult to solve D1 as it involves a mini-mization of a complex objective function of integer val-ued variables. However, it follows from Lemma 2 that r kj ≥ ( t ∈ ( t k − j , t kj ]) · h j ( t j , r j , t ) for all j and t ≥ a j ,thus, we have that, L j (cid:88) k =1 r kj ( t ∈ ( t k − j , t kj ]) ≥ L j (cid:88) k =1 ( t ∈ ( t k − j , t kj ]) h j ( t j , r j , t )= h j ( t j , r j , t ) . Therefore, it can be readily shown that the second termin the R.H.S of Φ( y , α , β ) in (7) is lower bounded by: (cid:90) ∞ (cid:88) j : a j ≤ t (cid:104)(cid:16) t − a j p j + 2 − α j + β ( t ) (cid:17) · h j ( t j , r j , t ) (cid:105) dt. As a result, for a fixed α j and β ( t ) such that for all t ≥ a j t − a j p j + 2 − α j + β ( t ) ≥ , (8)the minimum of Φ( y , α , β ) can be attained by setting all r j to and t j = ( a j , c j ) . In this solution, there are no Φ( y , α , β ) = (cid:88) j α j p j − M (cid:90) ∞ β ( t ) dt + (cid:90) ∞ (cid:88) j : a j ≤ t (cid:104) ( t − a j p j + 2 − α j ) h j ( t j , r j , t ) + β ( t ) L j (cid:88) k =1 r kj · ( t ∈ ( t k − j , t kj ]) (cid:105) dt. (7)other checkpoints for job j other than the job arrival anddeparture.Therefore, restricting α and β to satisfy (8) would givea lower bound on D1 and results in the following optimiza-tion problem: max α , β (cid:88) j α j p j − M (cid:90) ∞ β ( t ) dt (P2) s.t. α j − β ( t ) ≤ t − a j p j + 2 , ∀ j, t ≥ a j ,α j ≥ , ∀ jβ ( t ) ≥ , ∀ t Based on Lemma 3, we conclude that P ≤ P ≤ (cid:0) (cid:1) · OP T where P is the optimal cost for P2.Next, we shall find a setting of the dual variables in P2such that the corresponding objective is lower bounded by O ( (cid:15) ) · SR under a (1 + (cid:15) ) -speed resource augmentation.To achieve this, we first consider a pure SRPT schedulingprocess that does not exploit job redundancy. We then usethis to motivate a setting of dual variables which feasiblefor P2. Finally, we show that the objective for this setting ofdual variables is at least O ( (cid:15) ) times the cost of SRPT, whichis also lower bounded by O ( (cid:15) ) · SR since the cost of SRPT isno smaller than SR . Observe that SRPT+R and SRPT only differ when n ( t ) < M and that when this is the case SRPT only assigns a singlemachine to each active job. Since SRPT+R maintains thesame scheduling order and each job is scheduled with atleast the same number of copies as SRPT, we conclude thatthe cost of SRPT, denoted by SRP T , is a lower bound for SR , where SR denotes the overall job flowtime achievedSRPT+R.In this section, we let n ( t ) and p j ( t ) denote the numberof active jobs and the size of the remaining workload of job j under SRPT respectively.Let Θ j = { k : a k ≤ a j ≤ c k } , the set of jobs thatare active when job j arrives and A j = { k (cid:54) = j : k ∈ Θ j and p k ( a j ) ≤ p j } , i.e., jobs whose residual processingtime upon job j ’s arrival is less than job j ’s processing re-quirement. Define ρ j = | A j | , we shall set the dual variablesas follows: α j = 1(1 + (cid:15) ) p j ρ j (cid:88) k =1 (cid:16)(cid:4) n ( a j ) − kM (cid:5) − (cid:4) n ( a j ) − k − M (cid:5)(cid:17) p k ( a j )+ 11 + (cid:15) (cid:16)(cid:4) n ( a j ) − ρ j − M (cid:5) + 1 (cid:17) , (9)where (cid:15) > and β ( t ) = 1(1 + (cid:15) ) M n ( t ) . (10)We show in the following lemma that this setting of dualvariables is feasible. Machine MMachine qMachine 1 r r q r M r M+1 r M+q r zM r zM+1 r r M+1 r zM+1 r q r M+q r zM+q r M r zM r zM+q Fig. 2. The scheduling process of SRPT at time a j where n ( a j ) = zM + q and there are no further job arrivals after a j . Jobs are sorted basedon the remaining size, which is denoted by r j for job j , i.e., r j = p j ( a j ) .Jobs indexed by kM + i for some integer valued k and i are assignedto machine i . Lemma 4. The setting of dual variables in (9) and (10) is feasibleto P2.Proof. Since α and β are both nonnegative, it only remainsto show α j − β ( t ) ≤ t − a j p j + 2 for all j and t ≥ a j . First, α j can be represented as follows: α j = (cid:80) zk =0 p kM + q ( a j ) ( kM + q ≤ ρ j )(1 + (cid:15) ) p j + (cid:16)(cid:4) n ( a j ) − ρ j − M (cid:5) + 1 (cid:17) (cid:15) . (11)For ease of illustration, let Ω and Ω denote the two termson the R.H.S of (11) respectively.If n ( a j ) ≤ M , we have α j = (cid:15) and the result follows.Therefore, we only consider n ( a j ) = zM + q > M andanalyze the following three cases: Case I: All the jobs in Θ j have completed at time t . Asdepicted in Fig. 2, if there are no job arrivals after time a j ,then, jobs indexed by km + q where k is a non-negativeinteger are all processed on Machine q . Since the servicecapacity of Machine q is ( t − a j ) during ( a j , t ] , thus, itfollows that, t − a j ≥ 11 + (cid:15) z (cid:88) k =0 p kM + q ( a j ) . (12)In contrast, if there are other job arrivals after time a j ,Machine q needs to process an amount of work whichexceeds (cid:80) zk =0 p kM + q ( a j ) , therefore, (12) still holds. Thus,we have that, t − a j p j − Ω ≥ (cid:80) zk =0 p kM + q ( a j ) ( kM + q ≥ ρ j + 1)(1 + (cid:15) ) p j ≥ (cid:80) zk =0 ( kM + q ≥ ρ j + 1)1 + (cid:15) = Ω . (13) Case II: The jobs indexed from to κ in Θ j havecompleted and κ ≤ ρ j . Let κ = z M + q . Similar to Case I,it follows that, t − a j ≥ 11 + (cid:15) z (cid:88) k =0 p kM + q ( a j ) . (14)In addition, the number of active jobs, n ( t ) , is no less than n ( a j ) − κ . Therefore, we have: α j (ii) ≤ (cid:80) z k =0 p kM + q ( a j ) + (cid:100) ρj − κM (cid:101) (cid:15) (1 + (cid:15) ) p j + (cid:16)(cid:4) n ( a j ) − ρ j − M (cid:5) + 1 (cid:17) (cid:15) (iii) ≤ t − a j p j + 11 + (cid:15) (cid:100) ρ j − κM (cid:101) + (cid:16)(cid:4) n ( a j ) − ρ j − M (cid:5) + 1 (cid:17) (cid:15) ≤ t − a j p j + 11 + (cid:15) (cid:0) (cid:98) n ( a j ) − κM (cid:99) + 2 (cid:1) ≤ t − a j p j + β ( t ) + 2 , (15)where (cid:100) x (cid:101) denotes the smallest integer which is no less than x and (ii) is due to that Ω ≤ (cid:15) ) p j (cid:80) z k =0 p kM + q ( a j ) + (cid:15) · (cid:100) ρ j − κM (cid:101) . (iii) is due to (14). Case III: The jobs indexed from to κ in Θ j have com-pleted and κ > ρ j . In this case, (14) still holds. Moreover, wehave that (cid:80) z k =0 p kM + q ( a j ) ≥ Ω + (cid:98) κ − ρ j M (cid:99) p j . Therefore, itfollows that: α j ≤ t − a j p j − 11 + (cid:15) (cid:98) κ − ρ j M (cid:99) + 11 + (cid:15) (cid:100) n ( a j ) − ρ j M (cid:101)≤ t − a j p j + 11 + (cid:15) (cid:0) (cid:98) n ( a j ) − κM (cid:99) + 2 (cid:1) ≤ t − a j p j + β ( t ) + 2 . (16)Thus, we conclude that, for all the three cases above, theconstraint between α j and β ( t ) is well satisfied. To bound the cost of the dual variables which are set in (9)and (10), we first show the following lemma to quantify thetotal job flowtime under SRPT in the transient case wherethere are no job arrivals after time t . Lemma 5. When there are no job arrivals after time t , the overallremaining job flowtime under SRPT scheduling, F ( t ) , is givenby: F ( t ) = n ( t ) (cid:88) j =1 ( (cid:4) n ( t ) − jM (cid:5) + 1) p j ( t ) . (17) Proof. In this proof, we shall not assume resource augmen-tation. Let f j ( t ) denote the remaining flowtime for job j attime t . Thus, the job completion time, c j is equal to f j ( t ) + t .Since we have indexed jobs such that p ( t ) ≤ p ( t ) ≤ · · · ≤ p n ( t ) ( t ) , under SRPT, it follows that c ≤ c ≤ · · · ≤ c n ( t ) .When n ( t ) ≤ M , (17) follows immediately since all jobs canbe scheduled simultaneously and f j ( t ) is equal to p j ( t ) .Let us then consider the case where n ( t ) > M . Let n ( t ) = zM + q where z ≥ , ≤ q ≤ M − and z , q are non-negative integers. We first show that for all k such that M ≤ k ≤ n ( t ) , the following result holds: k (cid:88) j = k − M +1 f j ( t ) = k (cid:88) j =1 p j ( t ) . (18)As illustrated in Fig. 3, at any time between t and c , thereare ( k − M ) jobs waiting to be processed among those k jobs which complete first. Hence, the accumulated waitingtime in this period is ( k − M ) f ( t ) . Similarly, at any timebetween c and c , there are ( k − M − jobs waiting tobe processed and they contribute ( k − M − · ( c − c ) = ( k − M − · ( f ( t ) − f ( t )) waiting time. Hence, the totalwaiting time of the k jobs is given by: k − M − (cid:88) j =0 ( k − M − j ) · ( f j +1 ( t ) − f j ( t )) = k − M (cid:88) j =1 f j ( t ) . (19)Therefore, the total remaining flowtime for these k jobs is asfollows: k (cid:88) j =1 f j ( t ) = k (cid:88) j =1 p j ( t ) + k − M (cid:88) j =1 f j ( t ) . (20)By shifting terms in (20), we have: (cid:80) kj = k − M +1 f j ( t ) = (cid:80) kj =1 p j ( t ) . Summing up all job flowtime, it follows that: n ( t ) (cid:88) j =1 f j ( t ) = q (cid:88) j =1 f j ( t ) + z (cid:88) k =1 kM + q (cid:88) j =( k − M + q +1 f j ( t ) ( i ) = q (cid:88) j =1 p j ( t ) + z (cid:88) k =1 kM + q (cid:88) j =1 p j ( t )= n ( t ) (cid:88) j =1 ( (cid:4) n ( t ) − jM (cid:5) + 1) p j ( t ) , (21)where on the R.H.S. of ( i ) , the first term is due to that theflowtime of the first q jobs is equal to their remaining job sizeand the second term is due to that (cid:80) kM + qj =( k − M + q +1 f j ( t ) = (cid:80) kM + qj =1 p j ( t ) . This completes the proof.Based on Lemma 5, if job j never arrive to the systemand the subsequent jobs do not enter the system, the overallremaining job flowtime at time a j is given by: F (cid:48) j ( a j ) = n ( a j ) − (cid:88) k =1 ( (cid:4) n ( a j ) − − kM (cid:5) + 1) p k ( a j ) . (22)In contrast, when job j arrives and the subsequent jobs donot arrive to the system at time a j , the overall remaining jobflowtime at time a j is as follows: F j ( a j ) = ρ j (cid:88) k =1 ( (cid:4) n ( a j ) − kM (cid:5) + 1) p k ( a j )+ (cid:16)(cid:4) n ( a j ) − ρ j − M (cid:5) + 1 (cid:17) p j + n ( a j ) (cid:88) k = ρ j +1 ( (cid:4) n ( a j ) − kM (cid:5) + 1) p k ( a j ) . (23)Therefore, one can view α j as the incremental increase of theoverall job flowtime caused by the arrival of job j by takingthe difference of (22) and (23) and then dividing by (1+ (cid:15) ) p j .Since we are using a (1 + (cid:15) ) -speed resource augmentation,thus, (cid:80) j p j α j exactly characterizes the overall job flowtimein SRPT, i.e., (cid:80) j α j p j = SRP T .Moreover, β ( t ) reflects the loading condition of the clus-ter in our setting, thus, M (cid:82) ∞ β ( t ) = (cid:15) SRP T . Therefore,we have (cid:80) j α j p j − M (cid:82) ∞ β ( t ) dt = (cid:15) (cid:15) SRP T .Based on Lemma 3, we conclude that (cid:15) (cid:15) SR ≤ (cid:15) (cid:15) SRP T ≤ P ≤ P ≤ (cid:0) (cid:1) · OP T . This implies SR ≤ O ( (cid:15) ) OP T and completes the proof of Theorem 1. time c c c (k-M) (k - M) jobs (k - M-1) jobs 1 job t Fig. 3. The number of jobs waiting to be processed in different timeperiods where k > M . LGORITHM D ESIGN FOR M ULTITASKING P RO - CESSORS In this section, we design scheduling algorithms for clusterssupporting multitasking. Besides checkpointing times andlevel of redundancy, one must introduce additional vari-ables, x = ( x j : j = 1 , , · · · , N ) where x j = ( x kj | k =1 , , · · · , L j ) are the fractions of resource shares to be al-located to each job during checkpointing intervals. To bespecific, we first design the Fair+R Algorithm which is anextension of the Fair Scheduler. Fair+R allows all jobs inthe cluster to (near) equally share resources in the cluster,with priority given to those which arrive most recently. Wethen generalize Fair+R to design the LAPS+R ( β ) algorithm,which is an extension of LAPS (the Latest Arrival ProcessorSharing). The main idea of LAPS is to share resources onlyamong a certain fraction of jobs in the cluster [18]. How-ever, the initial version of LAPS only considers the speedscaling among different jobs, our proposed LAPS+R( β ) Al-gorithm extends this such that redundant copies of jobs canbe made dynamically. In this section, we assume withoutloss of generality that jobs have been ordered such that a ≤ a ≤ · · · ≤ a n ( t ) . Let n ( t ) = kM + l denote the number of jobs which areactive in the cluster at time t .At a high level, Fair+R works as follows. When n ( t ) ≥ M , the kM jobs which arrive the most recently, i.e., jobsindexed from ( l + 1) to n ( t ) , are each assigned to one serverand gets a resource share of k . Each server processes k jobssimultaneously. By contrast, if n ( t ) < M , the latest arrivaljob, i.e., Job n ( t ) , is scheduled on M − (cid:98) Mn ( t ) (cid:99) ( n ( t ) − machines and the others are each scheduled on (cid:98) Mn ( t ) (cid:99) machines. In this case, there is no multitasking.The corresponding pseudo-code is exhibited in the panelnamed Algorithm 2. Our main result for Fair+R is given inthe following theorem: Theorem 2. Fair+R is (4+ (cid:15) ) -speed O ( (cid:15) ) -competitive with re-spect to the total job flowtime. Paralleling the proof of Theorem 1, we adopt the dual-fittingapproach to prove Theorem 2. Let z j = ( t j , x j , r j , L j ) and Algorithm 2: Fair+R Algorithm while A job arrives to or departure from the system do Sort the jobs in the order such that a ≤ a ≤ · · · ≤ a n ( t ) ; Compute n ( t ) = kM + l ; if n ( t ) ≥ M then for j = l + 1 , l + 2 · · · , n ( t ) do r j ( t ) = 1 and x j ( t ) = 1 /k ; else r n ( t ) ( t ) = M − (cid:98) Mn ( t ) (cid:99) ( n ( t ) − and x n ( t ) ( t ) = 1 ; for j = 1 , , · · · , n ( t ) − do r j ( t ) = (cid:98) Mn ( t ) (cid:99) and x j ( t ) = 1 ; Checkpoint all jobs and assign job j ’s redundantexecutions to r j ( t ) machines which are uniformlychosen at random from { , , · · · , M } with aresource share of x j ( t ) ; z = ( z j | j = 1 , , · · · , N ) , we first formulate an approximateoptimization problem as follows: min z N (cid:88) j =1 (cid:90) ∞ a j p j ( t − a j + p j (cid:101) h j ( t j , x j , r j , t ) dt (P3) s.t. (cid:90) ∞ a j (cid:101) h j ( t j , x j , r j , t ) dt ≥ p j , ∀ j, (cid:88) j : a j ≤ t (cid:88) k x kj r kj · ( t ∈ ( t k − j , t kj ]) ≤ M, ∀ t,L j ∈ { , , · · · , N − } , t j ∈ T L j +1 j , r j ∈ N L j , ∀ j. < x kj ≤ , ∀ j, ≤ k ≤ L j . Observe that P3 and P1 differ in both the objective andthe second constraint since job j gets a resource share of x kj r kj when t ∈ ( t k − j , t kj ] .The dual problem associated with P3 is similar to that ofP1 and we only need to modify the first constraint of P2 toyield the following inequality: α j − β ( t ) ≤ t − a j p j + 14 ∀ j ; t ≥ a j . (24)Paralleling Lemma 3, we have P ≤ P ≤ (cid:0) ∆ (cid:1) · OP T where P and P are the optimal values of the dual problemand P3 respectively.Denote by A ( t ) the set which contains all jobs that arestill active in the cluster at time t under Fair+R. Thus, n ( t ) = | A ( t ) | . We shall set α j as follows: α j = (cid:90) c j a j α j ( τ ) dτ, (25)where α j ( t ) = (cid:80) k : a k ≤ a j ( k ∈ A ( t )) · ( n ( t ) ≥ M ) (cid:101) h k ( t k , x k , r k , t )(4 + (cid:15) ) M p j + ( n ( t ) < M ) (cid:101) h j ( t j , x j , r j , t )4(4 + (cid:15) ) p j . (26)and the setting of β ( t ) is given by: β ( t ) = 1(4 + (cid:15) ) M n ( t ) . (27)Next, we proceed to check the feasibility of these dualvariables. Observe that α j and β ( t ) are nonnegative for all j, t and thus we only need to show they satisfy (24). Lemma 6. The dual variable settings in (25) and (27) satisfy theconstraint in (24) . Lemma 7. Under the choice of dual variables in (25) and (27) , (cid:80) Nj =1 α j p j − M (cid:82) ∞ β ( t ) dt ≥ (cid:15) (cid:15) F R where F R is the costof Fair+R. Lemma 7 implies that F R ≤ (16+4 (cid:15) ) P (cid:15) ≤ (cid:15)(cid:15) · (cid:0) ∆ (cid:1) · OP T = O ( (cid:15) ) OP T . This completes the proof ofTheorem 2. β ) Algorithm and the performance guaran-tee The algorithm depends on the parameter β ∈ (0 , . Say, β = 1 / , then the algorithm essentially schedules the n ( t ) most recently arrived jobs. If there are fewer than M suchjobs, they are each assigned an roughly equal number ofservers for execution without multitasking. If n ( t ) > M ,each job will roughly get a share of M n ( t ) on some machine.For a given number of active jobs n ( t ) , and parameter β , z ∈ N , α ∈ { , , · · · , M − } and γ ∈ [0 , such that βn ( t ) = zM + α + γ .The LAPS+R( β ) Algorithm operates as follows. At time t , if z = 0 , jobs indexed from ( n ( t ) − α ) to ( n ( t ) − are scheduled on (cid:98) Mα +1 (cid:99) machines each, and Job n ( t ) isscheduled on the remaining ( M − α (cid:98) Mα +1 (cid:99) ) machines. Inthis case, there is no multitasking. By contrast, if z ≥ ,jobs indexed from ( n ( t ) − zM − α ) to ( n ( t ) − are eachassigned a single machine and get a resource share of z +1 .And, Job n ( t ) is scheduled on ( M − α ) machines with a z +1 share of its resources.The corresponding pseudo-code is exhibited as Algo-rithm 3 in the panel below. β ) and our tech-niques Let OP T and LR denote the cost of the optimal schedul-ing policy and LAPS+R( β ) respectively. The main result inthis section, characterizing the competitive performance ofLAPS+R( β ), is given in the following theorem: Theorem 3. LAPS+R( β ) is ( β + 2 (cid:15) ) -speed O ( β(cid:15) ) -competitive with respect to the total job flowtime. The dual fitting approach fails in this setting so we adoptthe use of a potential function, which is widely used toderive performance bounds with resource augmentation foronline parallel scheduling algorithms e.g., [18], [29]. Themain idea of this method is to find a potential functionwhich combines the optimal schedule and LAPS+R( β ). Tobe specific, let LR ( t ) and OP T ( t ) denote the accumulatedjob flowtime under LAPS+R( β ) with a ( β + 2 (cid:15) ) -speedresource augmentation and the optimal schedule, respec-tively. We define a potential function, Λ( t ) , which satisfiesthe following properties:1) Boundary Condition: Λ(0) = Λ( ∞ ) = 0 . Algorithm 3: LAPS+R( β ) Algorithm while A job arrives at or departure from the system do Sort the jobs in the order such that a ≥ a ≥ · · · ≥ a n ( t ) ; Compute βn ( t ) = zM + α + γ where γ < and α < M ; if z ≥ then r n ( t ) ( t ) = ( M − α ) and x n ( t ) ( t ) = z +1 ; for j = n ( t ) − zM − α, · · · , n ( t ) − do r j ( t ) = 1 and x j ( t ) = z +1 ; if z < then r n ( t ) ( t ) = M − α (cid:98) Mα +1 (cid:99) and x n ( t ) ( t ) = 1 ; for j = n ( t ) − α, · · · , n ( t ) − do r j ( t ) = (cid:98) Mα +1 (cid:99) and x j ( t ) = 1 ; for j = 1 , , · · · , n ( t ) − zM − α − do x j ( t ) = r j ( t ) = 0 ; Checkpoint all jobs and assign job j ’s redundantexecutions to r j ( t ) machines which are uniformlychosen at random from { , , · · · , M } with aresource share of x j ( t ) ;2) Jumps Condition: the potential function may havejumps only when a job arrives or completes underthe LAPS+R( β ) schedule, and if present, it must bedecreased.3) Drift Condition: with a ( β + 2 (cid:15) ) -speed resourceaugmentation, for any time t not corresponding to ajump, and some constant c , we have that, d Λ( t ) dt ≤ − (cid:15)β · dLR ( t ) dt + c · dOP T ( t ) dt . (28)By integrating 28 and accounting for the negative jump andthe boundary condition, one can see that the existence ofsuch a potential function guarantees that LR ≤ cβ(cid:15) OP T under a ( β + 2 (cid:15) ) -speed resource augmentation. To prove Theorem 3, we shall propose a potential function, Λ( t ) , which satisfies all the three properties specified above. Λ( t ) Consider a checkpointing trajectory for job j un-der LAPS+R( β ) and the optimal schedule, denoted by ( t j , x j , r j ) and ( t ∗ j , x ∗ j , r ∗ j ) respectively. Let ψ ∗ ( t ) be the jobsthat are still active at time t under the optimal schedulingand denote by ψ ( t ) the set of jobs that are active underLAPS+R( β ). Thus, we have that | ψ ( t ) | = n ( t ) . Further let n j ( t ) denote the number of jobs which are active at time t and arrive no later than job j under LAPS+R( β ). Define thecumulative service difference between the two schedules forjob j at time t , i.e., π j ( t ) , as follows: π j ( t ) = max (cid:104) (cid:90) ta j (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , τ ) dτ − (cid:90) ta j (cid:101) h j ( t j , x j , r j , τ ) dτ , (cid:105) , (29)Let δ = 2 + 2 β + 2 (cid:15) and define f ( n j ( t )) = (cid:40) βn j ( t ) ≤ M, Mβn j ( t ) otherwise . (30)Note that f ( n j ( t )) takes the minimum of 1 and Mβn j ( t ) wherethe latter is roughly the total resource allocated to job j under LAPS+R( β ) if n j ( t ) jobs were active at time t .Our potential function is given by: Λ( t ) = (cid:88) j ∈ ψ ( t ) Λ j ( t ) , (31)where Λ j ( t ) is the ratio between (29) and (30), i.e., Λ j ( t ) = π j ( t ) δ · f ( n j ( t )) . Λ( t ) caused by job arrival and departure Clearly, our potential function satisfies the boundary condi-tion. Indeed, since each job is completed under LAPS+R( β ),thus, ψ ( t ) will eventually be empty, and Λ(0) = Λ( ∞ ) = 0 . Let us consider possible jump times. When job j arrivesto the system at time a j , π j ( a j ) = 0 and f ( n j ( t )) does notchange for all k (cid:54) = j . Therefore, we conclude that the jobarrival does not change the potential function Λ( t ) . When ajob leaves the system under LAPS+R( β ), f ( n j ( t )) can onlyincrease if job j is active at time t , leading to a decrease in Λ j ( t ) . As a consequence, the job arrival or departure doesnot cause any increase in the potential function, Λ( t ) , thus,the jump condition on the potential function is satisfied. Λ( t ) caused by job processing Beside job arrivals and departures under LAPS+R( β ), thereare no other events leading to changes in f ( n j ( t )) and thuschanges in Λ j ( t ) depend only on π j ( t ) , see definition of Λ j ( t ) in (31). Specifically, for all t / ∈ { a j } j ∪ { c j } j , we havethat, d Λ( t ) dt = (cid:88) j ∈ ψ ( t ) d Λ j ( t ) dt = (cid:88) j ∈ ψ ( t ) dπ j ( t ) /dtδ · f ( n j ( t )) , where we let dπ j ( t ) dt = lim τ → + π j ( t + τ ) − π j ( t ) τ and thus dπ j ( t ) dt exists for all t ≥ . Moreover, we have: dπ j ( t ) dt ≤ ( j ∈ ψ ∗ ( t )) (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , t ) − ( j / ∈ ψ ∗ ( t )) (cid:101) h j ( t j , x j , r j , t ) , (32)indeed, either j ∈ ψ ∗ ( t ) so job j has not completed underthe optimal policy and the drift is bounded by the first termin (32), or j / ∈ ψ ∗ ( t ) and the job has completed under theoptimal policy, the difference term in (29) is positive and itsderivative is given by the the second term in (32). Therefore,for all t / ∈ { a j } j ∪{ c j } j , we have the following upper bound: d Λ( t ) dt ≤ (cid:88) j ∈ ψ ( t ) ( j ∈ ψ ∗ ( t )) (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , t ) δ · f ( n j ( t )) − (cid:88) j ∈ ψ ( t ) ( j / ∈ ψ ∗ ( t )) (cid:101) h j ( t j , x j , r j , t ) δ · f ( n j ( t )) ≤ (cid:88) j ∈ ψ ∗ ( t ) (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , t ) δ · f ( n j ( t )) (cid:124) (cid:123)(cid:122) (cid:125) Γ ∗ ( t ) − (cid:88) j ∈ ψ ( t ) \ ψ ∗ ( t ) (cid:101) h j ( t j , x j , r j , t ) δ · f ( n j ( t )) (cid:124) (cid:123)(cid:122) (cid:125) Γ( t ) , (33) where ψ ( t ) \ ψ ∗ ( t ) contains all the jobs that are in ψ ( t ) but not in ψ ∗ ( t ) . For ease of illustration, let Γ ∗ ( t ) and Γ( t ) denote the two terms on the R.H.S. of (33). In the sequel, webound these two terms. Γ ∗ ( t ) When βn j ( t ) ≥ M , we have f ( n j ( t )) = M/βn j ( t ) , thus, (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j ,t ) f ( n j ( t )) = (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j ,t ) M/βn j ( t ) . By contrast, when βn j ( t ) ≤ M , it follows that f ( n j ( t )) = 1 , so (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j ,t ) f ( n j ( t )) = (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , t ) , which is upper bounded by ∆ based onLemma 2.Therefore, we have: (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , t ) δf ( n j ( t )) ≤ δ (cid:16) (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , t ) M/βn j ( t ) + ∆ (cid:17) , and Γ ∗ ( t ) ≤ (cid:88) j ∈ ψ ∗ ( t ) δ (cid:16) (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , t ) M/βn j ( t ) + ∆ (cid:17) ≤ ∆ | ψ ∗ ( t ) | δ + (cid:88) j ∈ ψ ∗ ( t ) βn ( t ) (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , t ) δM ≤ ∆ | ψ ∗ ( t ) | /δ + βn ( t ) /δ, (34)where the last inequality is due to (cid:88) j ∈ ψ ∗ ( t ) (cid:101) h j ( t ∗ j , x ∗ j , r ∗ j , t ) ≤ (cid:88) j ∈ ψ ∗ ( t ) (cid:88) k x kj r kj · ( t ∈ ( t k − j , t kj ]) ≤ M, for all t . Γ( t ) First, Γ( t ) ban be represented as: Γ( t ) = (cid:88) j ∈ ψ ∗ ( t ) ∩ ψ ( t ) (cid:101) h j ( t j , x j , r j , t ) δf ( n j ( t )) − (cid:88) j ∈ ψ ( t ) (cid:101) h j ( t j , x j , r j , t ) δf ( n j ( t )) . (35)To get an upper bound of Γ( t ) , we will consider two cases.In particular, βn ( t ) = zM + α + γ , we consider the casewhere z = 0 and that where z ≥ . Case 1: Suppose z = 0 , in this case, (cid:100) βn ( t ) (cid:101) ≤ M . Since n j ( t ) ≤ n ( t ) for all ≤ j ≤ n ( t ) , it then follows that βn j ( t ) ≤ M and f ( n j ( t )) = 1 , which implies, for all j ∈ ψ ∗ ( t ) ∩ ψ ( t ) , (cid:101) h j ( t j , x j , r j ,t ) δf ( n j ( t )) ≤ ∆ since (cid:101) h j ( t j , x j , r j , t ) ≤ δ ∆ as we are using a δ -speed resource augmentation. Thus, thefirst term on the R.H.S. of (35) is upper bounded by: (cid:88) j ∈ ψ ∗ ( t ) ∩ ψ ( t ) (cid:101) h j ( t j , x j , r j , t ) δf ( n j ( t )) ≤ (cid:88) j ∈ ψ ∗ ( t ) ∩ ψ ( t ) ∆ ≤ | ψ ∗ ( t ) | ∆ . (36)Consider j ∈ ψ ( t ) where n ( t ) − α ≤ j ≤ n ( t ) and t ∈ [ t k − j , t kj ) for some k ∈ { , , · · · , L j } . Then, the numberof redundant executions for job j , r kj ≥ (cid:98) Mα +1 (cid:99) ≥ . Thus, (cid:101) h j ( t j , x j , r j , t ) ≥ δ and (cid:101) h j ( t j , x j , r j ,t ) δf ( n j ( t )) ≥ . Combining (35)and (36), we then have: Γ( t ) ≤ ∆ | ψ ∗ ( t ) | − ( α + 1) ≤ ∆ | ψ ∗ ( t ) | − βn ( t ) , (37) Case 2: Suppose z ≥ , then, (cid:100) βn ( t ) (cid:101) > M and Mβn ( t ) ≥ z +1 . Similarly, we consider job j ∈ ψ ( t ) where n ( t ) − kM − α ≤ j ≤ n ( t ) and t ∈ ( t k − j , t kj ] . Based on the schedulingpolicy of LAPS+R( β ), we have that, x kj = z +1 and r kj ≥ .Therefore, (cid:101) h j ( t j , x j , r j , t ) is bounded by: δz + 1 ≤ (cid:101) h j ( t j , x j , r j , t ) ≤ δ · ∆ z + 1 . (38)Moreover, we have: min[1 , Mβn j ( t ) ] ≥ min[1 , Mβn ( t ) ] ≥ z +1 . Therefore, it follows that: z + 1 ≤ min (cid:2) , Mβn j ( t ) (cid:3) ≤ f ( n j ( t )) ≤ Mβn j ( t ) . (39)Combining (38) and (39), we have that, for all n ( t ) − kM − α ≤ j ≤ n ( t ) − , / ( z + 1) M/βn j ( t ) ≤ (cid:101) h j ( t j , x j , r j , t ) δf ( n j ( t )) ≤ ∆ . (40)Substituting (40) into (35), it then follows that, Γ( t ) ≤ (cid:88) j ∈ ψ ∗ ( t ) ∩ ψ ( t ) ∆ − n ( t ) (cid:88) j = n ( t ) − zM − α / ( z + 1) M/βn j ( t ) ≤ ∆ | ψ ∗ ( t ) | − βzM ( n ( t ) − zM − α ) M ( z + 1) ≤ ∆ | ψ ∗ ( t ) | − β ( 12 − β n ( t ) , (41)where the second inequality is due to that n j ( t ) = j and thelast inequality is because zM + α ≤ βn ( t ) and zz +1 ≥ / .Based on Case 1 and Case 2 , we have: Γ( t ) ≤ ∆ | ψ ∗ ( t ) | − β ( − β ) n ( t ) . Thus, combining (33) and (34), we have thefollowing upper bound for the drift, d Λ( t ) dt : d Λ( t ) dt ≤ Γ ∗ ( t ) + Γ( t ) ≤ ∆ | ψ ∗ ( t ) | /δ + βn ( t ) /δ + ∆ | ψ ∗ ( t ) | − β ( 12 − β n ( t ) . = ( δ + 1)∆ δ | ψ ∗ ( t ) | + β (1 − δ ( − β )) δ n ( t ) ≤ ( δ + 1)∆ δ | ψ ∗ ( t ) | − (cid:15)β δ n ( t ) , (42)where the last inequality is due to δ = 2 + 2 β + 2 (cid:15) and δ ( − β ) ≥ (cid:15) .Based on (42), we then have that, ∞ ) − Λ(0) ≤ (cid:90) ∞ d Λ( t ) dt dt ≤ ( δ + 1)∆ δ (cid:90) ∞ | ψ ∗ ( t ) | dt − (cid:15)β δ (cid:90) ∞ n ( t ) dt = ( δ + 1)∆ δ OP T − (cid:15)β δ LR, (43)where the first inequality is due to that there exist negativejumps during the evolution of Λ( t ) . This completes the proofof Theorem 3. time Available AvailableUnavailable UnavailableMachineSpeed Fig. 4. Fluctuation of machine service rates in different time periods. Job flowtime C u mm u l a t i ve f r ac t i on o f j ob s a Job flowtime C u m u l a t i ve f r ac t i on o f j ob s b LAPS+R( β = 0.2)LAPSSRPT+RSRPT Fig. 5. The comparison between algorithms with and without redun-dancy. Panel a shows the CDF of the job flowtimes under bothSPRT+R and SRPT. Panel b shows the CDF of the job flowtimes underLAPS+R( β ) with β = 0 . and LAPS. UMERICAL S TUDIES In this section, we conduct several numerical studies toevaluate our proposed algorithms in both the multi-taskingand non-multi-tasking setting. As pointed out in [33], theGamma distribution is a good fit for the failure model ofmost parallel and distributed computing systems. Therefore,we apply the Gamma distribution to generate machineservice rates in a cluster with 100 machines over a periodwhich lasts 100000 units of time.To be more specific, we categorize the service process ofeach machine into two classes, namely, the Available Period(AP) and Unavailable Period (UP). As depicted in Fig. 4,each UP follows an AP. During an available period, the rateof the machine service capacity is uniformly distributed in [2 , . On the other hand, when the machine is processingjobs in an unavailable period, its rate is uniformly dis-tributed in [0 , . . In addition, we apply the statistics ofthe trace data collected from a computational grid platform(see [33]) to generate a series of available and unavailableperiods for each machine independently. The length of anAP is Gamma distributed with k = 0 . and θ = 94 . where k and θ are the shape parameter and scale parameterrespectively. In contrast, the length of an UP is Gammadistributed with k = 0 . and θ = 39 . . We also normalizeall the distributions such that the mean service rate is one.In all the following evaluations, we consider time is slot-ted and the scheduling decisions are made at the beginningof each time slot. The jobs arrive at the cluster following aPoisson Process with rate λ and the workload of each job isPareto distributed as shown below.P { p j ≤ x } = (cid:26) − ( bx ) α x ≥ b, otherwise,where b = 20 and α = 2 . It can be readily shown that themean of the job workload is 40.In the following simulations, we will compare the av-erage as well as the cumulative distribution function (i.e.,CDF) of job flowtimes for different algorithms. A ve r a g e j ob f l o w t i m e With redundancyWithout Redundancy SRPT Fair LAPS( β = 0.2) LAPS( β = 0.8) Fig. 6. Average job flowtime under different scheduling algorithms withand without redundancy. A ve r a g e j ob f l o w t i m e SRPT+RLAPS+R( β = 0.8)Fair+R λ = 2.5 λ = 2 λ = 1.5 λ = 1 Fig. 7. Comparison between different algorithms in terms of average jobflowtime for different λ . In this subsection, we implement scheduling algorithmswith both redundancy and no redundancy to characterizethe benefit of redundant execution. We set the job arrivalrate λ to one and depict the simulation results in Fig. 5and Fig. 6. As shown in Fig. 5, more than of jobscan complete within 40 units of time under SRPT+R. Asa comparison, only of jobs complete within unitsof time under the SRPT scheme. It’s worthy noting thatthis result also applies to LAPS+R( β ) and LAPS. Moreover,Fig. 6 shows that, with redundancy, the average job flowtimecan be reduced by nearly under all the schedulingalgorithms. We conducted a more comprehensive comparison of variousalgorithms by tuning values of λ . Following the simulationparameters configured at the beginning of Section 6, we canreadily show that, λ = 2 . reaches the heavy traffic limitabove which the system is overloaded. As such, we tune λ from 1 to 2.5 in this simulation. Observe in Fig. 7 thatthe average job flowtimes under SRPT+R and Fair+R areroughly the same under λ = 1 and λ = 1 . . However,when λ increases, SRPT+R tends to perform much betterthan both LAPS+R( β ) and Fair+R. For λ = 2 , the averagejob flowtime under both LAPS+R( β = . ) and Fair+R is twotimes that under SRPT+R. More importantly, when λ hitsthe heavy traffic limit, the average job flowtime under bothLAPS+R( β ) and Fair+R increases significantly in λ while itdoesn’t change much under SRPT+R. In addition, Fair+Routperforms LAPS+R( β ) when λ is below 2. Conversely, if λ is above , LAPS+R( β ) performs better than Fair+R. Job flowtime under LAPS+R C u mm u l a t i ve f r ac t i on o f j ob s The value of β under LAPS+R Th e ave r a g e j ob f l o w t i m e β = .2 β = .4 β = .6 β = .8 Fig. 8. The job flowtime under different β in LAPS( β ) when λ = 2 . . Job flowtime under LAPS Th e c u m u l a t i ve f r ac t i on o f j ob s The vaule of β in LAPS+R Th e ave r a g e j ob f l o w t i m e β = .2 β = .4 β = .6 β = .8 Fig. 9. The job flowtime under different β in LAPS+R( β ) when λ = 1 . β in LAPS+( β ) Since β has a high impact on the performance of LAPS+( β ),in this subsection, we tune the values of β to illustrate theperformance of LAPS+( β ) under different settings.We depict the comparison results under the heavy trafficregime where λ = 2 . in Fig. 8. It shows that, when β decreases, the number of jobs with small flowtime (less than200 units of time) increases. Therefore, small jobs benefitmore than large jobs under a small β as they have higherpriorities to be allocated resources in the cluster. In addition,when β = . , the average job flowtime attains its minimum.As illustrated in Fig. 9, when λ = 1 , almost all of thejobs in the cluster can complete within 200 units of timeunder different settings for β . When the job arrival rate islow, the jobs with small workloads can get large fractionsof shared resource under a small value of β . In this case,the benefit of redundancy is marginal and tuning down thevalue of β does not help much for small jobs. However, interms of the average job flowtime, a smaller β leads to aworse performance. The reason behind is that a large jobusually has a very small chance to obtain shared resourceunder a small value of β in LAPS+R( β ) since it takes along time to complete, resulting in a large flowtime. Thoughthe number of large jobs is small, the amount of total jobflowtime contributed but those large ones is significant. ONCLUSIONS AND F UTURE D IRECTIONS This paper is an attempt to address the impact of twokey sources of variability in parallel computing clusters:job processing times and machine processing rate. Ourprimary aim and contribution was to introduce a newspeedup function account for redundancy, and provide thefundamental understanding on how job scheduling andredundant execution algorithms with limited number of checkpointings can help to mitigate the impact of variabilityon job response time. As the need of delivering predictableservice on shared cluster and computing platforms grows,approaches, such as ours, will likely be an essential elementof any possible solution. Extensions of this work to non-clairvoyant scenarios, the case of jobs with associated taskgraphs etc, are likely next steps towards developing thefoundational theory and associated algorithms to addressthis problem. R EFERENCES [1] Apache. http://hadoop.apache.org, 2013.[2] Capacity Scheduler. http://hadoop.apache.org/docs/r1.2.1/capacity scheduler.html, 2013.[3] Fair Scheduler. http://hadoop.apache.org/docs/r1.2.1/fairscheduler.html, 2013.[4] S. Aalto, U. Ayesta, and R. Righter. On the gittins index in theM/G/1 queue. Queuing Systems , 63(1):437–458, December 2009.[5] S. Aalto, A. Penttinen, P. Lassila, and P. Osti. On the optimal trade-off between SRPT and opportunistic scheduling. In Proceedings ofSigmetrics , June 2011.[6] S. Anand, N. Garg, and A. Kumar. Resource augmentation forweighted flow-time explained by dual fitting. In Proceedings ofSODA , 2002.[7] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Ef-fective straggler mitigation: Attack of the clones. In NSDI , April2013.[8] G. Ananthanarayanan, M. C.-C. Hung, X. Ren, and I. Stoica. Grass:Trimming stragglers in approximation analytics. In NSDI , April2014.[9] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu,B. Saha, and E. Harris. Reining in the outliers in MapReduceclusters using mantri. In USENIX OSDI , Vancouver, Canada,October 2010.[10] G. Bronevetsky, D. Marques, and K. Pingali. Application-levelcheckpointing for shared memory programs. In ASPLOS , 2004.[11] H. L. Chan, J. Edmonds, and K. Pruhs. Speed scaling of processeswith arbitrary speedup curves on a multiprocessor. In SPAA , pages1–10, 2009.[12] H. Chang, M. Kodialam, R. R. Kompella, T. V. Lakshman, M. Lee,and S. Mukherjee. Scheduling in MapReduce-like systems for fastcompletion time. In Proceedings of IEEE Infocom , pages 3074–3082,March 2011.[13] F. Chen, M. Kodialam, and T. Lakshman. Joint scheduling of pro-cessing and shuffle phases in MapReduce systems. In Proceedingsof IEEE Infocom , March 2012.[14] Q. Chen, C. Liu, and Z. Xiao. Improving MapReduce performanceusing smart speculative execution strategy. IEEE Transactions onComputers , 63(4), April 2014.[15] S. Chen, Y. Sun, U. C. Kozat, L. Huang, P. Sinha, G. Liang, X. Liu,and N. B. Shroff. When queueing meets coding: Optimal-latencydata retrieving scheme in storage clouds. In Infocom , April 2014.[16] S. Das, V. Narasayya, F. Li, and M. Syamala. CPU sharingtechniques for performance isolation in multi-tenant. Proceedingsof the VLDB Endowment , 7(1), September 2013.[17] J. Dean and S. Ghemawat. MapReduce: simplified data processingon large clusters. In Proceedings of OSDI , pages 137–150, December2004.[18] J. Edmonds and K. Pruhs. Scalably scheduling processes witharbitrary speedup curves. ACM Transaction on Algorithms , 8(28),2012.[19] K. Fox and B. Moseley. Online scheduling on identical machinesusing SRPT. In SODA , January 2011.[20] K. Gardner, M. Harchol-Balter, and A. Scheller-Wolf. A bettermodel for job redundancy: Decoupling server slowdown and jobsize. In IEEE MASCOTS , pages 1–10. IEEE, 2016.[21] K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, M. Velednitsky,and S. Zbarsky. Redundancy-d: The power of d choices forredundancy. In Operation Research , to appear, 2017.[22] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, andA. Akella. Multi-resource packing for cluster schedulers. ACMSIGCOMM, August 2014. [23] A. Gupta, R. Krishnaswamy, and K. Pruhs. Online primal-dualfor non-linear optimization with applications to speed scaling. In Proceedings of WAOA , pages 173–186, 2002.[24] A. Gupta, B. Moseley, S. Im, and K. Pruhs. Scheduling jobs withvarying parallelizability to reduce variance. In SPAA: 22nd ACMSymposium on Parallelism in Algorithms and Architectures , 2010.[25] E. Heien, D. Kondo, A. Gainaru, D. LaPine, B. Kramer, and F. Cap-pello. Modeling and tolerating heterogeneous failures in largeparallel system. In International Conference for High PerformanceComputing, Networking, Storage and Analysis , 2011.[26] S. Im, J. Kulkarni, and B. Moseley. Temporal fairness of roundrobin: Competitive analysis for lk-norms of flow time. In SPAA ,pages 155–160, 2015.[27] S. Im, J. Kulkarni, and K. Munagala. Competitive algorithmsfrom competitive equilibria: non-clairvoyant scheduling underpolyhedral constraints. In Proceedings of STOC , pages 313–322,2014.[28] S. Im, J. Kulkarni, K. Munagala, and K. Pruhs. Selfishmigrate:A scalable algorithm for non-clairvoyantly scheduling heteroge-neous processors. In Proceedings of FOCS , pages 531–540, 2014.[29] S. Im, B. Moseley, and K. P. an dEric Torng. Competitivelyscheduling tasks with intermediate parallelizability. In Proceedingsof SPAA , June 2014.[30] S. Im, B. Moseley, K. Pruhs, and E. Torng. Competitively schedul-ing tasks with intermediate parallelizability. In Proceedings ofSPAA , June 2014.[31] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: dis-tributed data-parallel programs from sequential building blocks.In Proceeding of the 2nd ACM SIGOPS/EuroSys European Conferenceon Computer Systems , March 2007.[32] B. Kalyanasundaram and K. Pruhs. Speed is as powerful asclairvoyance. In Proceedings of FOCS , October 1995.[33] D. Kondo, B. Javadi, A. Iosup, and D. Epema. The failure tracearchive: Enabling comparative analysis of failures in diverse dis-tributed systems. In CCGrid , pages 398–407, 2010.[34] S. Leonardia and D. Raz. Approximating total flow time onparallel machines. Journal of Computer and System Sciences , 73(6),September 2007.[35] M. Lin, L. Zhang, A. Wierman, and J. Tan. Joint optimizationof overlapping phases in MapReduce. In Proceedings of IFIPPerformance , September 2013.[36] B. Moseley, A. Dasgupta, R. Kumar, and T. Sarlos. On schedulingin map-reduce and flow-shops. In Proceedings of SPAA , pages 289–298, June 2011.[37] J. Pruyne and M. Livny. Managing checkpoints for parallelprograms. In Workshop on Job Scheduling Strategies for ParallelProcessing , pages 140–154. Springer, 1996.[38] Z. Qiu and J. F. P´erez. Assessing the impact of concurrentreplication with canceling in parallel jobs. In MASCOTS , 2014.[39] Z. Qiu and J. F. P´erez. Enhancing reliability and response timesvia replication in computing clusters. In Infocom , April 2015.[40] T. W. Reiland. Optimality conditions and duality in continuousprogramming I. convex programs and a theorem of the alternative. Journal of Mathematical Analysis and Applications , 77(1):297 – 325,1980.[41] X. Ren, G. Ananthanarayanan, A. Wierman, and M. Yu. Hopper:Decentralized speculation-aware cluster scheduling at scale. In Sigcomm , August 2015.[42] N. Shah, K. Lee, and K. Ramchandran. When do redundantrequests reduce latency? In Annual Allerton Conference on Com-munication, Control, and Computing , Oct 2013.[43] M. S. Squillante. On the benefits and limitations of dynamic par-titioning in parallel computer systems. In Job Scheduling Strategiesfor Parallel Processing , pages 219–238. Springer-Verlag, 1995.[44] I. Stoica, H. Abdel-Wahab, K. Jeffay, S. Baruah, J. Gehrke, andC. Plaxton. A proportional share resource allocation algorithmfor real-time, time-shared systems. In RTSS , pages 288 – 299, 1996.[45] A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, andS. Shenker. Low latency via redundancy. In CoNEXT , 2013.[46] A. Wierman, L. L. H. Andrew, and A. Tang. Power-aware speedscaling in processor sharing systems: Optimality and robustness. Performance Evaluation , pages 601–622, December 2012.[47] H. Xu and W. C. Lau. Optimization for speculative execution ofmultiple jobs in a MapReduce-like cluster. In IEEE Infocom , April2015. [48] H. Xu and W. C. Lau. Task-cloning algorithms in a MapReducecluster with competitive performance bounds. In IEEE ICDCS ,June 2015.[49] H. Xu, W. C. Lau, Z. Yang, G. de Veciana, and H. Hou. Mitigatingservice variability in mapreduce clusters via task cloning: A com-petitive analysis. In IEEE Transactions on Parallel and DistributedSystems , http://ieeexplore.ieee.org/document/7890998, 2017.[50] Y. Yuan, D. Wang, and J. Liu. Joint scheduling of MapReduce jobswith servers: Performance bounds and experiments. In Proceedingsof IEEE Infocom , April 2014.[51] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica.Discretized streams: fault-tolerant streaming computation at scale.In SOSP , pages 423–438, 2013.[52] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Im-proving MapReduce performance in heterogeneous environments.In Proceeding of OSDI , December 2008.[53] Y. Zheng, N. Shroff, and P. Sinha. A new analytical technique fordesigning provably efficient MapReduce schedulers. In Proceed-ings of IEEE Infocom , Turin, Italy, April 2013. A PPENDIX AP ROOF OF L EMMA Proof. Consider an optimal solution to OPT, y ∗ , whosecorresponding job completion time for job j is denoted by c ∗ j . Thus, for all j = 1 , , · · · , N , y ∗ and c ∗ j satisfy: (cid:90) c ∗ j a j h j ( t ∗ j , r ∗ j , t ) dt = p j . (44)Moreover, it follows that h j ( t ∗ j , r ∗ j , t ) = 0 for all t ≥ c ∗ j , thus,we have that: (cid:90) ∞ a j h j ( t ∗ j , r ∗ j , t ) dt = p j , (45)and it follows that: (cid:90) ∞ a j p j ( t − a j ) h j ( t ∗ j , r ∗ j , t ) dt = (cid:90) c ∗ j a j p j ( t − a j ) h j ( t ∗ j , r ∗ j , t ) dt ≤ (cid:90) c ∗ j a j p j ( c ∗ j − a j ) h j ( t ∗ j , r ∗ j , t ) dt = c ∗ j − a j . (46)Following Lemma 2, it can be readily shown that h j ( t ∗ j , r ∗ j , t ) ≤ ∆ . Therefore, we have: p j = (cid:90) c ∗ i a j h j ( t ∗ j , r ∗ j , t ) dt ≤ ∆( c ∗ j − a j ) . (47)Combining (46) and (47), we have: (cid:90) ∞ a j ( t − a j + 2 p j ) p j · h j ( t ∗ j , r ∗ j , t ) dt ≤ (1 + 2∆)( c ∗ j − a j ) . Since the optimal solution to OPT must be feasible for P1, itfollows that: P ≤ N (cid:88) j =1 (cid:90) ∞ a j ( t − a j + 2 p j ) p j · h j ( t ∗ j , r ∗ j , t ) dt ≤ (1 + 2∆) N (cid:88) j =1 ( c ∗ j − a j ) = (1 + 2∆) OP T. (48)This completes the proof. A PPENDIX BP ROOF OF L EMMA Proof. First, we have: (cid:15) ) p j (cid:90) c j a j ( n ( t ) < M ) · (cid:101) h j ( t j , x j , r j , t ) dt ≤ (cid:15) ) p j (cid:90) c j a j (cid:101) h j ( t j , x j , r j , t ) dt = 14 . (49)Next, we proceed to show the following result holds. (cid:82) c j a j (cid:80) k : a k ≤ a j ( j ∈ A ( τ )) · ( n ( τ ) ≥ M ) (cid:101) h k ( t k , x k , r k , τ ) dτ (4 + (cid:15) ) M p j ≤ t − a j p j + 1(4 + (cid:15) ) M n ( t ) . (50)To achieve this, we divide the job set Ψ j = { k : a k ≤ a j } into two separate sets: Ψ j = { k : c k ≤ t } ∩ Ψ j and Ψ j = { k : c k > t } ∩ Ψ j . For the first set, we have: (cid:82) c j a j (cid:80) k : k ∈ Ψ j ( k ∈ A ( τ )) · ( n ( τ ) ≥ M ) · (cid:101) h k ( t k , x k , r k , τ ) dτM ≤ (cid:82) ta j (cid:80) k : k ∈ Ψ j ( k ∈ A ( τ )) · ( n ( τ ) ≥ M ) · (cid:101) h k ( t k , x k , r k , τ ) dτM . (51)Based on the scheduling principle of Fair+R, it followsthat: (cid:88) k k ∈ A ( t ) · n ( t ) ≥ M · (cid:101) h k ( t k , x k , r k , t ) ≤ (4 + (cid:15) ) M. (52)Therefore, the L.H.S of (51) is upper bounded by (4 + (cid:15) )( t − a j ) . For all jobs in Ψ j , we have: (cid:90) c j a j (cid:88) k : k ∈ Ψ j ( k ∈ A ( τ )) · ( n ( τ ) ≥ M ) · (cid:101) h k ( t k , x k , r k , τ ) dτ = (cid:88) k : k ∈ Ψ j (cid:90) c k a j ( k ∈ A ( τ )) · ( n ( τ ) ≥ M ) · (cid:101) h k ( t k , x k , r k , τ ) dτ (ii) ≤ (cid:88) k : a k ≤ t ≤ c k ≤ c j p j ≤ n ( t ) p j , (53)where (ii) is due to that, for any job who arrives before j ,its amount of work processed between the range [ a j , c k ] isupper bounded by p j .Combining all inequalities above, the lemma immedi-ately follows. This completes the proof. A PPENDIX CP ROOF OF L EMMA Proof. First, it can be readily shown that: M (cid:90) ∞ β ( t ) dt = 14 + (cid:15) n ( t ) dt = RF (cid:15) . (54)Next, we proceed to show (cid:80) Nj =1 α j ≥ RF . To achievethis, we consider the following two cases: Case I: n ( t ) ≥ M . In this case, it’s easy to verify that α j ( t ) = 0 for j ≤ l and α j ( t ) = j − lkMp j for l < j < n ( t ) .Therefore, it follows that: N (cid:88) j =1 α j ( t ) p j = n ( t ) (cid:88) j = l +1 j − lkM = kM + 12 ≥ n ( t )4 (55) Case II: n ( t ) < M . In this case, we have: (cid:101) h j ( t j , x j , r j , t ) ≥ (cid:15) since we are using a resource aug-mentation of (4 + (cid:15) ) -speed. Hence, the following equationholds: N (cid:88) j =1 α j ( t ) p j ≥ n ( t ) (cid:88) j =1 ( n ( t ) < M ) = n ( t )4 . (56)As such, we have: N (cid:88) j =1 α j p j = N (cid:88) j =1 (cid:90) c j a j α j ( τ ) p j dτ = (cid:90) ∞ N (cid:88) j =1 α j ( τ ) p j dτ ≥ (cid:90) ∞ n ( τ ) dτ = RF ..