[PDF] Communication-Aware Scheduling of Precedence-Constrained Tasks on Related Machines

Abstract

Scheduling precedence-constrained tasks is a classical problem that has been studied for more than fifty years. However, little progress has been made in the setting where there are communication delays between tasks. Results for the case of identical machines were derived nearly thirty years ago, and yet no results for related machines have followed. In this work, we propose a new scheduler, Generalized Earliest Time First (GETF), and provide the first provable, worst-case approximation guarantees for the goals of minimizing both the makespan and total weighted completion time of tasks with precedence constraints on related machines with machine-dependent communication times.

Full PDF

CC OMMUNICATION -A WARE S CHEDULING OF P RECEDENCE -C ONSTRAINED T ASKS ON R ELATED M ACHINES

Yu Su ∗ , Xiaoqi Ren † , Shai Vardi ‡ and Adam Wierman § Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA Google, Kirkland, WA Krannert School of Management, Purdue University, West Lafayette, IN A BSTRACT

Scheduling precedence-constrained tasks is a classical problem that has been studied for more thanﬁfty years. However, little progress has been made in the setting where there are communicationdelays between tasks. Results for the case of identical machines were derived nearly thirty years ago,and yet no results for related machines have followed. In this work, we propose a new scheduler,Generalized Earliest Time First (GETF), and provide the ﬁrst provable, worst-case approximationguarantees for the goals of minimizing both the makespan and total weighted completion time of taskswith precedence constraints on related machines with machine-dependent communication times.

In this paper we study scheduling precedence-constrained tasks onto a set of heterogeneous machines with communica-tion delays between the machines in order to minimize the makespan or the total weighted completion time. Initially,work on this topic was motivated by the goal of scheduling jobs on multi-processor systems, e.g., [1]. Today thisproblem is timely due to the prominence of large-scale, general-purpose machine learning platforms. For example, insystems such as Google’s TensorFlow [2], Facebook’s PyTorch [3] and Microsoft’s Azure Machine Learning (AzureML)[4], machine learning workﬂows are expressed via a computational graph, where jobs are made up of tasks, representedas vertices, and precedence relationships between the tasks, represented as edges. This “precedence graph” abstractionallows data scientists to quickly develop and incorporate modular components into their machine learning pipeline (e.g.,data preprocessing, model training, and model evaluation) and then easily specify a workﬂow. The graphs that specifythe workﬂows in platforms such as TensorFlow, PyTorch and AzureML can be made up of hundreds or even thousandsof tasks, and the jobs may be run on systems with thousands of machines. As a result, the performance of the platformsdepends on how these precedence-constrained tasks are scheduled across machines.The goal of scheduling jobs composed of precedence-constrained tasks has been studied for more than ﬁfty years,starting with the work of [5]. The simplest version of this scheduling problem focuses on scheduling a single job with n precedence-constrained tasks on m identical parallel machines with the goal of minimizing the makespan : the timeuntil the last task completes. More generally, the goal of minimizing the total weighted completion time is considered,where the total weighted completion time is a weighted average of the completion time of each task in the job . For thegoal of minimizing the makespan, Graham showed that a simple list scheduling algorithm can ﬁnd a schedule of lengthwithin a multiplicative factor of (2 − /m ) of the optimal. This result is still the best guarantee known for this simplesetting. Since then, research has sought to generalize the setting considered in two important ways: (i) to non-identicalmachines and (ii) to the case where communication is needed between tasks. ∗ [email protected] † [email protected] ‡ [email protected] § [email protected] Makespan is a special case of total weighted completion time as a dummy task with weight one can be added as the ﬁnal task ofthe job, with all other tasks given weight zero. a r X i v : . [ c s . PF ] A p r ddressing these two issues has been one of the major goals of the ﬁeld since Graham’s initial result ﬁfty years ago.Since that time, considerable progress has mostly been made on generalizations to heterogeneous machines. The focushas been on (uniformly) related machines , a model where each machine i has a speed s i , each task j has a size w j , andthe time to run task j on machine i is w j /s i . Under the related machine model, a sequence of results in the 1980s and1990s culminated in a result that showed how to use list scheduling algorithms in combination with a partitioning ofmachines into groups with “similar” speeds in order to achieve an O (log m ) -approximation algorithm for makespan [6].This result was also extended in the same work to total weighted completion time by proposing a time-indexed linearprogramming technique. The extension yields an O (log m ) -approximation for total weighted completion time. The ideaof using a group assignment rule to partition machines into groups of machines with similar speeds and then to assigntasks to a group is a powerful one and has shown up frequently in the years since; it recently led to a breakthroughwhen the idea of partitioning machines was adapted further and combined with a variation of list scheduling to obtain a O (log m/ log log m ) -approximation algorithm for both makespan and total weighted completion time [7].Despite the progress made in generalizing from identical machines to heterogeneous machines, there has been littleprogress toward the goal of incorporating communication delays. Machine-dependent communication delays are crucialfor capturing issues such as data locality and the difference between intra-rack and inter-rack communication. Wenote that if communication delays are machine independent, they can simply be viewed as part of the processing time,making the problem much easier. The state-of-the-art result in the case of communication delays is [8], which studiesmachine-dependent communication costs in the setting of identical machines . In this context, a greedy algorithm calledEarliest Time First (ETF) has been shown to produce schedules with a makespan bounded by (2 − /m ) OPT ( i ) + C ,where OPT ( i ) is the optimal schedule length when ignoring communication time and C is the maximum amount ofcommunication of a chain (path) in the precedence graph. However, the analysis for the case of identical machines in[8] is quite complex and it has proven difﬁcult to generalize to the related machines setting. As a result, there has beenno progress outside the context of identical machines in the thirty years since [8].Given the challenge of designing schedulers that are approximately optimal for related machines with machine-dependent communication time, most work studying the design of scheduling policies in this context has relied ondeveloping scheduling heuristics and evaluating these heuristics numerically, e.g., [9, 10, 11, 12, 13, 14]. For a recentsurvey see [15, 16] and the references therein. Contributions.

In this paper we propose a new scheduler, Generalized Earliest Time First (GETF), and prove thatit computes a makespan that is at most of length O (log m/ log log m ) OPT ( i ) + C in the case of related machinesand machine-dependent communication times, where C is the amount of communication time in a chain (path) inthe precedence graph. Additionally, we generalize our result to the objective of total weighted completion timeand show that GETF produces a schedule S whose total weighted completion time is at most O (log m/ log log m ) wOPT ( i ) + (cid:80) j ω j C ( S , j ) , where wOPT ( i ) is the optimal total weighted completion time, ω j is the weight in theobjective, and C ( S , j ) is the communication requirement in a chain in the precedence graph. These two results addresslong-standing open problems. Note that the makespan result matches state-of-the-art bounds for the special cases(i) when there is zero communication time and (ii) when the machines are identical. In the case of total weightedcompletion time, no previous result exists for the case of identical machines with communication time, but the resultmatches the best known bound for the case with related machines and zero communication time.The key technical advance that enables our new result is a dramatically simpliﬁed analysis of ETF in the setting ofidentical machines. The state-of-the-art result in this setting is [8], which is established using a long, complex argument.In contrast, the core idea in our proof of Theorem 4.1 is a short, simple proof of a Separation Principle which can beused to provide a novel proof of the approximation ratio for ETF in the case of identical machines. The proof is simpleand general enough that it can be extended from identical machines to related machines by adapting recent advancesfrom [7].

Related literature.

In recent years, the design and optimization of large-scale general-purpose machine learningplatforms has been an overarching goal, bridging many communities in both industry and academia. The emergence ofplatforms such as TensorFlow, PyTorch and AzureML illustrate the power of such systems to democratize tools frommachine learning, making them accessible and scalable for anyone.Since the emergence of such systems, there has been a torrent of work that seeks to optimize the scheduling andassignment of the precedence-constrained graphs in such systems. Heuristics have emerged for managing stragglertasks, e.g., [10, 17, 9, 18]; scheduling tasks with different computational properties, e.g., jobs with MapReduce-typestructures [19, 20, 21, 22, 23, 24], scheduling approximation jobs [9, 25, 18], and managing communication times[16, 26]. Many of these heuristics have led to system designs that have had a signiﬁcant industrial impact.Such designs typically address the challenges associated with precedence constraints in ad hoc ways based on simplifyingassumptions about the structures of the graphs. In contrast, there is a long history of analytic work seeking to design2chedulers for precedence-constrained tasks with provable worst-case guarantees. As we have already mentioned, theinitial results on this topic for makespan were provided by Graham, who gave a (2 − /m ) -approximation algorithmbased on list scheduling for P | prec | C max [5]. A decade later, it was shown by [27] that it is NP-hard to approximate P | prec | C max within a factor of / . This left a gap which has been essentially closed recently, when [28] provedthat it is NP-hard to achieve an approximation factor less than , given the assumption of a new variant of the UniqueGame Conjecture introduced by [29]. In the case of total weighted completion time objective P | prec | (cid:80) j ω j C j , thenegative results carry over from the makespan objective since makespan objective can be viewed as a special caseof total weighted completion time objective. Moreover, under the assumption of the stronger version of the UniqueGame Conjecture, it is shown in [29] that it is even hard to approximate within a factor of − (cid:15) for the problemwith one machine. On the positive side, a -approximation was given in [30], and [31, 32] later improved it to a -approximation. The current best known result is a (2 + 2 ln 2 + (cid:15) ) -approximation by [7] via a time-indexed linearprogramming relaxation technique.The results mentioned above all focus on identical machines with zero communication delays. When related machinesare considered, the problem becomes more challenging. An early result on this topic is [6], which proposed a Speed-based List Scheduling (SLS) algorithm that obtains an approximation of O (log m ) for Q | prec | C max . A time-indexedlinear programming technique has been proposed in the same work that gives a O (log m ) bound for Q | prec | (cid:80) j ω j C j .Recently, an improvement to O (log m/ log log m ) for both objectives was proven in [7]. The best known lower boundfor the problem of related machines is from [33], which shows that it is impossible for a polynomial time algorithmto approximate the minimal makespan to any constant factor assuming the hardness of an optimization problem on k -partite graphs.In contrast, when communication delay is considered, much less is known. To our knowledge, no approximation ratiois known for P | prec, c i,j | C max , and this open problem was noted by [34]. The only algorithm with a guaranteedworst-case performance bound in this setting is ETF [8], which provides a bound of (2 − /m ) OPT ( i ) + C on themakespan in the case of identical machines. Prior to our paper, no algorithm with a worst-case approximation guaranteefor either makespan or total weighted completion time is known for the case of related machines with communicationdelays, i.e., Q | prec, c i,j | C max and Q | prec, c i,j | (cid:80) j ω j C j . We study a model that generalizes Q | prec, c i,j | (cid:80) j ω j C j by including machine-dependent communication times. Ourgoal is to derive bounds on the total weighted completion time and the makespan, which is an important special case ofthe total weighted completion time that uses a particular choice of ω j .Speciﬁcally, we consider the task of scheduling a job made up of a set V of n tasks on a heterogeneous system composedof a set M of m machines with potentially different processing speeds and communication speeds. The tasks form adirected acyclic graph (DAG) G = ( V, E ) , in which each node j represents a task and an edge ( j (cid:48) , j ) between task j and task j (cid:48) represents a precedence constraint. We interchangeably use node or task, as convenient. Precedenceconstraints are denoted by a partial order ≺ between two nodes of any edge, where j (cid:48) ≺ j means that task j can onlybe scheduled after task j (cid:48) completes. Let w j represent the processing demand of task j . The amount of data to betransmitted between task j (cid:48) and task j is represented by the edge weight w j (cid:48) ,j of ( j (cid:48) , j ) .The system is heterogeneous in two aspects: processing speed and communication speed. For processing speed, weconsider the classical related machines model: a machine i has speed s i , and it takes w j /s i uninterrupted time units fortask j to complete on machine i . Speciﬁcally, computer resources such as CPUs and GPUs have varying speeds; henceschedulers must be able to handle heterogeneous servers. The communication speed s i (cid:48) ,i between any two machines i (cid:48) , i is heterogeneous across different machine pairs. We index the machine to which task j is assigned by h ( j ) . If i = h ( j ) and i (cid:48) = h ( j (cid:48) ) , then communication time between task j (cid:48) and j in the DAG is w j (cid:48) ,j /s i (cid:48) ,i .For simplicity, we consider a setting where the machines are fully connected to each other, so any machine cancommunicate with any other machine. This is without loss of generality as one can simply set the communication speedbetween any two disconnected machines to 0. We also assume that the DAG is connected. Again, this is without lossof generality because, otherwise, the DAG can be viewed as multiple DAGs and the same results can be applied toeach. As a result, our results trivially apply to the case of multiple jobs. Additionally, our model assumes that eachmachine (processing unit) can process at most one task at a time, i.e., there is no time-sharing , and the machines areassumed to be non-preemptive , i.e., once a task starts on a machine, the scheduler must wait for the task to completebefore assigning any new task to this machine. This is a natural assumption in many settings, as interrupting a task andtransferring it to another machine can cause signiﬁcant processing overhead and communication delays due to datalocality, e.g., [35]. 3 lgorithm 1 Generalized Earliest Time First (GETF)

INPUT: group assignment rule f ( · ) , tie-breaking rule OUTPUT: schedule S with machine assignment mapping h ( · ) and starting time mapping t ( · ) R ← { , , . . . , n } while R (cid:54) = ∅ do A = { j : j ∈ R, (cid:64) j (cid:48) s.t. j (cid:48) ∈ R and j (cid:48) ≺ j } For j ∈ A, t (cid:48) j = earliest starting time on machine m (cid:48) j s.t. m (cid:48) j ∈ f ( j ) B = { j : j = arg min j (cid:48) ∈ A t ( j (cid:48) ) } Choose j from B to start on machine m (cid:48) j with a starting time t (cid:48) j based on the given tie-breaking rule h ( j ) = m (cid:48) j , t ( j ) = t (cid:48) j R ← R \ { j } end while The goal of the scheduler in our model is to minimize the total weighted completion time of the job, denoted by (cid:80) j ω j C j , where C j is the completion time of task j and ω j is the weight associated with task j . We also consider the makespan , denoted by C max , which is the time when the the ﬁnal task in the DAG completes. Note that the problem weconsider is an ofﬂine scheduling problem. This is a classical problem with relevance to modern ML platforms, whichuse batch scheduling of precedence constrained tasks in their pipelines, e.g. [2]. It is also known to be challenging.Speciﬁcally, minimizing the makespan (and hence also minimizing the total weighted completion time) of jobs withprecedence constraints is known to be NP-complete [36]. Thus, we aim to design a polynomial-time algorithm thatcomputes an approximately optimal schedule. We say that an algorithm is a ρ -approximation algorithm if it alwaysproduces a solution with an objective value within a factor of ρ of optimal in polynomial time.Our main results use three important concepts. First, our results provide bounds in terms of OPT ( i ) and wOPT ( i ) ,which are the optimal makespan and the optimal total weighted completion time if the communication delays were zero,respectively. Note that OPT ( i ) and wOPT ( i ) are a lower bound of the corresponding objectives of the problem whencommunication delays are not included. Second, we provide bounds in terms of the communication time of a terminalchain of the schedule. A chain in the DAG is a sequence of immediate predecessor-successor pairs, whose ﬁrst nodeis a node with no predecessor and last node is a leaf node with no successors. Third, we provide bounds in terms ofthe communication time of a terminal chain of a subset of the DAG that is naturally formed in the scheduling process.Formally, for any given schedule, a terminal chain C of length N can be constructed in the following fashion. We startwith one of the tasks that ends last in the given schedule, denoted as c N . Among all the immediate predecessors of node c N , we pick one of the tasks that ﬁnishes last and deﬁne it as c N − . In such a way, we can construct a chain of tasks c ≺ c ≺ . . . ≺ c N until the ﬁrst node c in the chain does not have a predecessor. There may be many such terminalchains, and our results apply to any arbitrary terminal chain for the given schedule. In this section, we introduce a new algorithm – Generalized Earliest Time First (GETF) – for scheduling tasks withprecedence constraints in settings where servers have heterogeneous service rates and communication times. ForGETF, we provide provable worst-case approximation guarantees for both the goal of minimizing the makespan andminimizing the total weighted completion time.At its core, GETF is a greedy algorithm. Like ETF, it seeks to run tasks that can be started earliest, thus minimizingthe idle time created by the precedence constraints in a greedy way. However, this simple heuristic does not take intoaccount the potential difference between the service rates of different machines. For this, GETF is similar to SLS. Ituses a group assignment function f ( · ) to determine sets of “similar” machines and then assigns tasks to different groupsof machines. Within the groups of similar machines, GETF uses the ETF greedy allocation rule.GETF is parameterized by a group assignment function f ( · ) and a tie-breaking rule, and proceeds in two stages. Atevery iteration, GETF ﬁnds a set A of all the tasks that are ready to process and are not yet scheduled. For every taskin A , GETF calculates the earliest starting time if it was only allowed to schedule on machines in the assigned group.Then, GETF computes B , the set of tasks in A with the earliest starting times, and chooses one of the tasks to processon a machine based on the tie-breaking rule. The pseudocode for GETF is presented in Algorithm 1 and Figure 1 insection 3.3 illustrates the operation of GETF on a simple example (Example 1).GETF can be instantiated with different group assignment and tie-breaking rules. To understand how these rules work,consider a situation where the m machines are divided into K groups M , M , . . . , M K by a group assignment rule.4et f ( j ) denote the group of machines to which task j can be assigned, j = 1 , . . . , n . Given this notation, a scheduleunder GETF consists of two mappings: a mapping h ( · ) from each task to its assigned machine and a mapping t ( · ) fromeach task to its starting time. Further, for any schedule with h ( · ) produced by GETF, h ( · ) of the produced scheduleshould be consistent with group assignment function f ( · ) , i.e., h ( j ) ∈ f ( j ) for each task j .The choice of the group assignment rule has a signiﬁcant impact on the performance of GETF. Indeed, different groupassignment functions are used for the goals of minimizing the makespan and total weighted completion time. While ourresults hold for any tie-breaking rule, different tie-breaking rules could provide meaningful improvements in real-worldworkloads. As it could be helpful to keep a speciﬁc tie-breaking rule in mind while considering the algorithm andproofs, the reader may ﬁnd it helpful to consider random tie-breaking. Our technical results are based on the speciﬁcgroup assignment functions described in the following subsections. The group assignment rule f mksp ( · ) for the goal of minimizing the makespan that we focus on is adapted from SLS,which is designed for the setting without communication time. Speciﬁcally, machines of similar speeds are groupedtogether as follows.First, all the machines with speed less than a m fraction of the speed of the fastest machine are discarded. Then, theremaining machines are divided into K groups M , M , . . . , M K where K = (cid:100) log γ m (cid:101) , γ = log m/ log log m . Notethat K = O (log m/ log log m ) . Given the removal of the slowest machines, we can assume that any remaining machinehas speed within a factor of m of the fastest machine. Without loss of generality, we assume the speed of the fastestmachine is m and the group M k contains machines with speeds in range [ γ k − , γ k ) .It may seem strange that some machines are discarded, but note that the total speed of discarded machines is notbigger than the speed of the fastest machine. So, if we consider the scheduling problem with zero communication time,removing these machines at most doubles the makespan in the worst case.After dividing machines into K groups in the preprocessing step, we need to assign the machines. This step is moreinvolved than the division. The design of the group assignment rule f mksp ( · ) is based on the solution of a linear program(LP), which is a relaxed version of the following mixed integer linear program (MILP). min x i,j ,C j ,T T (cid:88) i x i,j = 1 ∀ j (1a) w j (cid:88) i x i,j s i ≤ C j ∀ j (1b) C j (cid:48) + w j (cid:88) i x i,j s i ≤ C j j (cid:48) ≺ j (1c) s i (cid:88) j w j x i,j ≤ T ∀ i (1d) C j ≤ T ∀ j (1e) x i,j ∈ { , } ∀ i, j (1f)While the MILP is only designed to produce a group assignment rule, its optimal solution does not necessarily providea feasible schedule. In the MILP, x i,j = 1 if task j is assigned to machine i ; otherwise x i,j = 0 . For each task j , C j denotes the completion time of task j . Constraint (1a) ensures that every task is processed on some machine. For anytask j , processing time w j (cid:80) i x i,j s i is bounded by its completion time as in constraint (1b). Constraint (1c) enforces theprecedence constraints between any predecessor-successor pair ( j (cid:48) , j ) . Constraint (1d) guarantees that the total loadassigned to machine i is w j (cid:80) i x i,j s i and it should not be greater than the makespan. Finally, constraint (1e) states thatthe makespan should not be smaller than the completion time of any task.Since we cannot solve the MILP efﬁciently, we relax it to form an LP by replacing constraint (1f) with x i,j ≥ .Let x ∗ , C ∗ , T ∗ denote the optimal solution of this LP. Note that T ∗ provides a lower bound on OPT ( i ) , the optimalmakespan for the same problem with zero communication time.5or a set M k ⊆ M of machines, let s ( M k ) denote the total speed of machines in M k , i.e., s ( M k ) = (cid:88) i ∈ M k s i . Deﬁne x ∗ M k ,j as the total fraction of task j assigned to machines in set M k : x ∗ M k ,j = (cid:88) i ∈ M k x ∗ i,j . For any task j , deﬁne (cid:96) j as the largest group index such that at least half of the tasks are fractionally assigned tomachines in groups M (cid:96) , . . . , M K : (cid:96) j = max (cid:96) (cid:96) s.t. K (cid:88) k = (cid:96) x ∗ M k ,j ≥ . We note that any choice of constant above works for the purpose of our worst case analysis of GETF, but the choice canpotentially have an impact on its empirical performance. Thus the choice of the parameter should be further optimizedwhen applied in practice. Each task j is assigned to the group f mksp ( j ) that maximizes the total speed of machines inthat group among candidates M l j , . . . , M K , i.e., f mksp ( j ) = arg max M k : (cid:96) j ≤ k ≤ K s ( M k ) . The group assignment rule f twct ( · ) for the goal of minimizing the total weighted completion time is similar in spiritto f mksp ( · ) but is based on modiﬁed solutions of a different LP. We divide machines into groups in the same way asin Section 3.1. Without loss of generality, we assume that w j s i ≥ for any task j to be processed on any machine i . Thus, we can divide the time horizon into the following time-indexed intervals of possible task completion times: [1 , , (2 , , (4 , , . . . , ( τ Q − , τ Q ] where Q = log ( (cid:80) j w j min i s i ) and τ q = 2 q for ≤ q ≤ Q . Then, the MILP thatforms the basis for the group assignment rule can be formulated as follows: min x i,j,q ,C j (cid:88) j ω j C j (cid:88) i (cid:88) q x i,j,q = 1 ∀ j (2a) w j (cid:88) i s i (cid:88) q x i,j,q ≤ C j ∀ j (2b) C j (cid:48) + w j (cid:88) i s i (cid:88) q x i,j,q ≤ C j j (cid:48) ≺ j (2c) q (cid:88) t =1 (cid:88) i x i,j,t − q (cid:88) t =1 (cid:88) i x i,j (cid:48) ,t ≤ ∀ q, j (cid:48) ≺ j (2d) (cid:88) q τ q − (cid:88) i x i,j,q < C j ∀ j (2e) s i (cid:88) j w j q (cid:88) t =1 x i,j,t ≤ τ q ∀ i, q (2f) x i,j,q ∈ { , } ∀ i, j, q (2g)Again, the MILP is only designed to ﬁnd a group assignment rule and thus its optimal solution does not necessarilyproduce a feasible schedule. Here, x i,j,q = 1 if task j is assigned to machine i and it completes in the q th interval ( τ q − , τ q ] . For each task j , C j denotes the completion time of task j and ω j represents its weight in the objective oftotal weighted completion time. Constraint (2a) enforces that each task will be assigned to some machine. Constraint(2b) guarantees that the completion time of a task is not smaller than its processing time. Constraints (2c) and (2d)together enforce the precedence constraint for every predecessor-successor pair. Constraint (2e) guarantees that thecompletion time of task j is not smaller than the left boundary of the q th interval ( τ q − , τ q ] . The total load assigned to6achine i up to q th interval is s i (cid:80) j w j (cid:80) qt =1 x i,j,t , and it should not be greater than the upper bound τ q as enforcedin constraint (2f).To deﬁne the group allocation rule, we relax constraint (2g) to form an LP. As in the previous section, let x ∗ , C ∗ denotethe optimal solution for this LP. Note that (cid:80) j ω j C ∗ j provides a lower bound for wOPT ( i ) . For any task j , deﬁne q ( j ) as the the minimum value of q such that both (cid:80) qt =1 (cid:80) i x ∗ i,j,t ≥ and C ∗ j ≤ q are satisﬁed. Intuitively, q ( j ) can beviewed as a rough estimate of the completion time of task j . Deﬁne α ( j ) as the total fraction of task j over any machinein the ﬁrst q ( j ) intervals with respect to solution x ∗ : α j = q ( j ) (cid:88) t =1 (cid:88) i x ∗ i,j,t . We construct a set of feasible solutions ˜ x based on the optimal solution x ∗ for the LP: ˜ x i,j = q ( j ) (cid:88) q =1 x ∗ i,j,q α j ∀ i, j. (3)Notice that the group assignment rule f twct ( · ) is of the same form as f mksp ( · ) , with ˜ x replacing x ∗ . For task j, deﬁne ˜ (cid:96) j as before but with respect to ˜ x instead of x ∗ : ˜ (cid:96) j = max (cid:96) (cid:96) s.t. K (cid:88) k = (cid:96) ˜ x M k ,j ≥ . The group assignment rule f twct ( · ) for the goal of minimizing the total weighted completion time follows as below: f twct ( j ) = arg max M k :˜ (cid:96) j ≤ k ≤ K s ( M k ) . (a) (b)(c) (d) Figure 1: An illustration of GETF running on Example 1. (a)-(d) show the ﬁrst four iterations.The description of GETF above highlights that it combines the greedy heuristic of ETF with the speed-based assignmentheuristic of SLS. This enables GETF to provide guarantees for settings with both heterogeneous processing rates and7 a) (b)(c) (d)

Figure 2: An illustration of SLS running on Example 1. (a)-(d) show the ﬁrst four iterations.communication delays. In contrast, SLS does not provide guarantees in settings with communication time. This is aresult of the fact that SLS is based on list scheduling and does not always schedule the earliest task ﬁrst, thus making itimpossible to bound the overall idle time in between tasks.To illustrate the difference between GETF and SLS, we provide a simple example of scheduling a job made up of fourtasks.

Example 1.

We consider a job made up of four tasks, , , , with processing demands , , , and that are to bescheduled on a set of two identical machines with the same processing speed equal to . The weight for the edges in thegraph are listed as below: w , = w , = w , = 2 , w , = 1 . We assume s i,j = 1 for i (cid:54) = j ; otherwise s i,i = 2 for i = 0 , .The schedules of GETF and SLS are illustrated in Figures 1 and 2. Note that, since the servers are identical, the groupassignment rule does not play a role in these examples. Given a priority list (0 , , , , a possible schedule produced bySLS puts tasks and on machine and assigns the rest of tasks to machine as demonstrated in Figure 2. A terminalchain for the given schedule is task followed by task , and the idle time of length between the end of task andthe start of task on machine is not bounded by the communication time between task and . In contrast, task starts earlier on machine in a schedule produced by GETF, see Figure 1. List scheduling does not always schedulethe earliest task at each step, thus making the idle time on machine not necessarily bounded by communication timebetween task and task . Our proofs in Section 4.1 highlight that maintaining a tight bound on the communication timebetween tasks is crucial to achieving a good approximation ratio in settings with machine-dependent communicationtime. Our main results bound the approximation ratio of GETF in settings with related machines and heterogeneouscommunication time for the goals of minimizing the makespan and minimizing the total weighted completion time.

In the case of minimizing the makespan, our main result provides a bound in terms of the communication time of aterminal chain of the schedule. Speciﬁcally, let C : c ≺ c ≺ . . . ≺ c N be a terminal chain for the schedule and deﬁne C as the communication time over such a chain in the worst case, i.e. C = N (cid:88) j =2 w c j − ,c j ¯ s ( c j − , c j ) , ¯ s ( c j − , c j ) is deﬁned as the slowest speed between h ( c j − ) , the machine assigned to c j − and any machine inthe group f ( c j ) , i.e., ¯ s ( c j − , c j ) = min i ∈ f ( c j ) s h ( c j − ) ,i . Note that C can be computed efﬁciently and minimized over all the terminal chains using dynamic programming andthat the tie-breaking rule can have an impact on C due to its impact on terminal chains. Theorem 4.1.

For any schedule S produced by GETF with group assignment rule, f mksp ( · ) C max ( S ) ≤ O (log m/ log log m ) OPT ( i ) + C, where OPT ( i ) is the optimal schedule length obtained if communication time for all pairs were zero. Theorem 4.1 represents the ﬁrst result for makespan in the setting of related machines and heterogeneous communicationtime, addressing a problem that has been open since ETF was introduced for identical machines thirty years ago.Additionally, it matches the state-of-the art results for the case without communication time, where the best knownapproximation ratio is O (log m/ log log m ) [7], and the case with communication time but identical machines, wherethe best known approximation ratio is (2 − m ) OPT ( i ) + C [8].Concretely, in the special case of identical machines, the group assignment rule f mksp ( · ) is no longer required whenimplementing GETF since all machines share the same speed and so there is only one group of machines. Thus, GETFreduces to ETF. The theorem makes use of C (cid:48) which is deﬁned as C (cid:48) = 1 m N (cid:88) j =2 m (cid:88) i =1 w c j − ,c j s h ( c j − ) ,i . Note that C (cid:48) differs from C since it is an average over the terminal chain. The result we obtain in this case is thefollowing, which matches the current state-of-the-art result of [8]. Proposition 4.2.

Consider a setting with m identical machines. For any schedule S produced by GETF, C max ( S ) ≤ (cid:18) − m (cid:19) OPT ( i ) + C (cid:48) , where OPT ( i ) is the optimal schedule length obtained if communication time for all pairs were zero. Similarly to the makespan case, we provide a bound with respect to the communication time of chains. However, sincetotal weighted completion time depends on the completion time of every task (instead of just one task as in the caseof makespan), the communication time of terminal chains of many subsets of the DAG show up in the bound. Moreformally, assume that the tasks are indexed with respect to their order in the schedule determined by GETF, denotedby S . At iteration j , task j is to be scheduled. Let G ( S , j ) denote a DAG formed by a set of the tasks that have beenscheduled so far and the corresponding edges within these tasks. Deﬁne S ( j ) to be a subset of the given schedule S upto iteration j , i.e., it is a schedule for DAG G ( S , j ) . This deﬁnition ensures that task j is one of the tasks that ends lastin the schedule S ( j ) . Now, let C ( S , j ) : c ≺ c ≺ · · · ≺ c N j be a terminal chain that ends with task j = c N j in theschedule S ( j ) , and deﬁne C ( S , j ) as the communication time over such a chain in the worst case, i.e., C ( S , j ) = N j (cid:88) j (cid:48) =2 w c j (cid:48)− ,c j (cid:48) ¯ s ( c j (cid:48) − , c j (cid:48) ) . This deﬁnition of C ( S , j ) generalizes the notion of C used in Theorem 4.1 for makespan and plays a similar role in thetheorem below. Theorem 4.3.

For any schedule S produced by GETF with group assignment rule f twct ( · ) , (cid:88) j ω j C j ≤ O (log m/ log log m ) wOPT ( i ) + (cid:88) j ω j C ( S , j ) , where wOPT ( i ) is the optimal total weighted completion time obtained if communication time for all pairs was zero. In this section, we present our proofs of Theorems 4.1 and 4.3. The general form of both arguments is similar; however,the case of total weighted completion time is more involved. The ﬁrst step of our argument is to show a generalupper bound, which is valid for GETF regardless of choices of group assignment function f ( · ) , and tie-breaking rule.This Separation Principle can be used to easily establish the result for makespan in the case of identical machines(Proposition 4.2), and represents a signiﬁcant simpliﬁcation compared to existing proofs of that result in the literature.We then tighten the general bound by taking advantage of the choices of f ( · ) described in Section 3 for makespan andtotal weighted completion time. Finally, we establish a connection between the makespan and total weighted completiontime in the same settings by introducing a time-indexed LP that enables us to bound the total weighted completion time. The Separation Principle presented here is a key component of our proof of Theorem 4.1. The core of nearly all proofsin this area is the construction of a chain, which is then used to bound the overall makespan. This idea goes back to theﬁrst list scheduling algorithms proposed by [5]. The key to our argument is to bound the amount of communication timebetween any predecessor-successor pairs in a terminal chain. However, as we discuss in Section 3, it is not possible todo this under list scheduling algorithms.Our approach also differs considerably from the approach used to study ETF in [8], where the authors divide [0 , C max ] into two sets of time intervals, one for the time when all the machines are busy and the other that one chain covers.Extending this approach to related machines does not appear possible. In contrast, in our argument, the construction ofa terminal chain is simple and so we can identify the set of time intervals between tasks in the terminal chain and takeadvantage of the greedy nature of GETF to bound these times directly.A key feature of the the Separation Principle below is that it separates the analysis of the terminal chain from theanalysis of the group assignment rule, which provides another valuable simpliﬁcation of the previous proof approaches. Theorem 5.1 (Separation Principle) . For any choice of group assignment function f ( · ) and tie-breaking rule, GETFproduces a schedule S of makespan C max ( S ) ≤ P + K (cid:88) k =1 D k + C, where P = (cid:88) c j ∈ C w c j s h ( c j ) ,D k = (cid:80) j : k ∈ f ( j ) w j s ( M k ) ,C = N − (cid:88) j =1 w c j ,c j +1 ¯ s ( c j , c j +1 ) . Note that the upper bound in this result is valid regardless of the choice of group assignment rule and tie-breaking rule. P is the sum of processing times along a terminal chain and D k can be viewed as total load assigned to machines ingroup M k . Both P and D k , k = 1 , , . . . , K , are not dependent on the communication constraint, which enables us totake advantage of any good choice of group assignment rule f ( · ) for general DAG scheduling, even in the case of zerocommunication time. Proof.

Our proof proceeds in four steps: 10i) Deﬁne a terminal chain C . Recall that a chain C , c ≺ c ≺ . . . ≺ c N is a terminal chain when task c N completes at the end of the overall schedule.(ii) Partition the overall makespan into K + 1 parts. The idea of this step is to decouple [0 , C max ] into one partwhere the tasks in the terminal chain are being processed and K other parts associated with each machinegroup. Dependent on the choices of group assignment rule, we can further bound these K + 1 parts.(iii) Bound the idle time in between tasks. The greedy nature of GETF makes it possible to bound the length of theidle time intervals between tasks by communication delays of task pairs.(iv) Combine (ii) and (iii) to bound the overall makespan in terms of the communication time of the terminal chain. ( i ) Deﬁne a terminal chain C . To ﬁnd a terminal chain of length N , we start with one of the tasks that ends last, denotedas c N . According to the deﬁnition of h ( · ) and t ( · ) , task c N is assigned to machine h ( c N ) in group f ( c N ) with a startingtime t ( c N ) . Among all the immediate predecessors of task c N , we pick one of the tasks that ﬁnishes last and deﬁne it as c N − . In such a fashion, we construct a chain C of tasks c ≺ c ≺ . . . ≺ c N of length N such that c does not haveany predecessor. ( ii ) Partition [0 , C max ] into K + 1 parts, T , T , . . . , T K . Recall that K = O (log m/ log log m ) is the number ofgroups for machines by the group assignment rule as we describe in the previous section. Let T denote the union ofthe time intervals during which tasks of chain C are being processed. Consider the time interval between the end oftask c j − and the start of task c j for j = 2 , , . . . , N , and assign it to T k where M k = f ( c j ) . As a set of time intervals, T k can be possibly empty or have more than one time interval. Essentially, T k is a set of time intervals that tasks inthe terminal chain C assigned to machines in group M k have to wait before being processed. In such a fashion, wedeﬁne T , T , . . . , T K since f ( · ) maps each task to one of the K machine groups. The length of the union of T i for i = 0 , , . . . , K is the makespan. ( iii ) Bound the idle time in between tasks.

Consider a task c j assigned to machine h ( c j ) . For each machine i ∈ f ( c j ) ,let E ( c j − , c j , i ) denote a union of disjoint empty time intervals on machine i between the end time of task c j − andthe start time of task c j . Between the end time of task c j − and the start time of task c j , there can be multiple tasksbeing processed on machine i in serial, possibly resulting in more than one idle time interval on machine i during thattime interval E ( c j − , c j , i ) . Precedence constraints between task pairs can also possibly make a successor wait beforeit gets started. Regardless of the reason for idle time between tasks, each task can not possibly start earlier on anymachine in the assigned group due to the greedy feature of GETF. Thus the length of E ( c j − , c j , i ) is bounded aboveby the communication time between task c j − and task c j , i.e., | E ( c j − , c j , i ) | ≤ w c j − ,c j s h ( c j − ) ,i ∀ i ∈ f ( c j ) . This is true because if it were not the case then task c j could have started earlier on machine i . Note that the endtime of task c j could possibly be earlier if it were allowed to be scheduled on a faster machine with a slightly biggercommunication delay, since the processing speeds of machines in the same group vary.Let e i be idle time on machine i in group M k during the time interval T k , and let ¯ e k be maximum idle time on anymachine in group M k during the time intervals T k , i.e., e i ≤ ¯ e k for all i ∈ M k . Thus, K (cid:88) k =1 ¯ e k ≤ N (cid:88) j =2 w c j − ,c j min i (cid:48) ∈ f ( c j ) s h ( c j − ) ,i (cid:48) ≤ N (cid:88) j =2 w c j − ,c j ¯ s ( c j − , c j ) . (4) ( iv ) Bound the makespan.

For ≤ k ≤ K , the total speed of machines in group M k is s ( M k ) = (cid:88) i ∈ M k s i . Denote the total length of the intervals in T k by t k . There must be at least a sum of ( t k − e i ) s i units of processingdone on each machine i in group M k during the time intervals T k . Thus for ≤ k ≤ K , (cid:88) i ∈ M k ( t k − e i ) s i ≤ (cid:88) j : f ( j )= M k w j . t k ≤ (cid:80) j : f ( j )= M k w j s ( M k ) + (cid:80) i ∈ M k e i s i s ( M k ) . (5)We now bound C max : C max = K (cid:88) k =1 t k + t ≤ K (cid:88) k =1 (cid:32) (cid:80) j : f ( j )= k w j s ( M k ) + (cid:80) i ∈ M k e i s i s ( M k ) (cid:33) + (cid:88) c j ∈ C w c j s h ( c j ) (6a) ≤ P + K (cid:88) k =1 D k + K (cid:88) k =1 ¯ e k (cid:80) i ∈ M k s i s ( M k )= P + K (cid:88) k =1 D k + K (cid:88) k =1 ¯ e k ≤ P + K (cid:88) k =1 D k + C, (6b)where (6a) is due to (5) and (6b) is due to (4). In order to apply the Separation Principle to prove Theorem 4.1, we need to prove bounds on P and (cid:80) Kk =1 D k inthe case of the group assignment rule deﬁned in Section 3. For this, we consider the scheduling problem with zerocommunication time. Note that the design of group assignment function f mksp ( · ) is based on the optimal solution x ∗ of the relaxed LP for a scheduling problem with zero communication time, hence the upper bounds for both P and (cid:80) Kk =1 D k are associated with the optimal objective of the relaxed LP in the setting with zero communication time aswell.The bounds of P and (cid:80) Kk =1 D k are given in the following two lemmas, which are adapted from results in [7]. Theorem4.1 follows directly from these two lemmas, the Separation Principle, and the fact that T ∗ ≤ OPT ( i ) , where T ∗ is theoptimal solution to the LP. Lemma 5.2. P ≤ γT ∗ . Proof.

Recall that x ∗ M (cid:48) ,j = (cid:80) i ∈ M (cid:48) x ∗ i,j and (cid:96) j as the largest group index such that at least more than half of tasks areassigned to machines in groups M (cid:96) , . . . , M K . For every task j and any machine i ∈ f ( j ) , by deﬁnition of the largestindex (cid:96) j , (cid:96) j (cid:88) k =1 x ∗ M k ,j > . (7)Thus, (cid:88) i (cid:48) ∈ M x ∗ i (cid:48) ,j s i (cid:48) = K (cid:88) k =1 (cid:88) i (cid:48) ∈ M k x ∗ i (cid:48) ,j s i (cid:48) (8a) ≥ (cid:96) j (cid:88) k =1 (cid:88) i (cid:48) ∈ M k x ∗ i (cid:48) ,j s i (cid:48) ≥ γ − (cid:96) j (8b) ≥ γs i , (8c)12here (8b) is due to (7) and the fact that processing speed of machine i (cid:48) in group M k for task j is at most γ (cid:96) j for k ≤ (cid:96) j , and (8c) is due to the fact that processing speed of machine i in group f ( j ) , whose group index is not smallerthan (cid:96) j , is at least γ (cid:96) j − . Using this, we can bound P as follows: P = (cid:88) c j ∈ C w c j s h ( c j ) ≤ γ (cid:88) c j ∈ C w c j (cid:88) i (cid:48) ∈ M x ∗ i (cid:48) ,c j s i (cid:48) (9a) ≤ γ (cid:88) c j ∈ C C ∗ c j (9b) ≤ γT ∗ , (9c)where (9a) is due to (8), (9b) is due to constraint (1d) of the LP and (9c) is due to constraint (1c) of the LP. Lemma 5.3. (cid:80) Kk =1 D k ≤ KT ∗ . Proof.

For any task j , by deﬁnition of (cid:96) j , (cid:80) Kk = (cid:96) j x ∗ M k ,j ≥ . Thus, s ( f ( j )) ≤ K (cid:88) k = (cid:96) j x ∗ M k ,j s ( f ( j )) ≤ K (cid:88) k = (cid:96) j x ∗ M k ,j s ( M k ) (10) ≤ K (cid:88) k =1 x ∗ M k ,j s ( M k ) . Inequality (10) is due to the fact that the assigned group f ( j ) maximizes the total speeds of machines in that groupamong the candidates M (cid:96) j , . . . , M K . Thus, K (cid:88) k =1 D k = K (cid:88) k =1 (cid:80) j : f ( j )= M k w j s ( M k ) = (cid:88) j ∈ V w j s ( f ( j )) ≤ (cid:88) j ∈ V w j K (cid:88) k =1 x ∗ M k ,j s ( M k )= 2 K (cid:88) k =1 s ( M k ) (cid:88) j ∈ V w j x ∗ M k ,j ≤ K (cid:88) k =1 T ∗ (11) = 2 KT ∗ . The total load assigned to machines in group M k is (cid:80) j ∈ V w j x ∗ M k ,j while its total speed is S ( M k ) . Summing overmachines in group M k on both sides for constraint (1d) leads to (11). We now show how the Separation Principle can be used to provide a new, simpler proof of the state-of-the-artapproximation ratio of ETF in the case of identical machines. Recall that the group assignment function is not requiredfor GETF in this case.To prove Proposition 4.2, we use the same approach as we used for proving the Separation Principle. However, we cantighten the analysis in the ﬁnal step of the argument. Speciﬁcally, the proof can be broken into three steps, instead offour: 13i) Deﬁne a terminal chain C . This step is identical to the deﬁnition of a terminal chain in the proof of theSeparation Principle.(ii) Bound the idle time in between tasks. As the machines are identical in terms of processing speed, commu-nication speed between different machine pairs are still heterogeneous due to the possible geolocations ofmachines.(iii) Combine (i) and (ii) to bound the overall makespan in terms of the communication time of the terminal chain.Compared with the proof of the Separation Principle, Step (i) deﬁnes a terminal chain in the exactly same way. In Step(ii), bounding the idle time in the case of identical machines is also similar. Step (iii) requires more work. Here, wefurther tighten the bound by eliminating the processing time of the terminal chain to improve the constant factor. ( i ) Deﬁne a terminal chain C . This step is identical to the deﬁnition of a terminal chain in the proof of the SeparationPrinciple. ( ii ) Bound the idle time in between tasks.

Let I ( c j − , c j ) be the time interval between the end time of task c j − and thestart time of c i for j = 2 , , . . . , N . As we explained in the Separation Principle, there can possibly be multiple idletime intervals on a machine during the time interval I ( c j − , c j ) . For each machine i ∈ M , deﬁne E ( c j − , c j , i ) as aunion of disjoint empty time intervals on machine i during the time interval I ( c j − , c j ) . For any machine i , the lengthof E ( c j − , c j , i ) is bounded above by the communication time between task c j − and task c j , i.e., | E ( c j − , c j , i ) | ≤ w c j − ,c j s h ( c j − ) ,i ∀ i ∈ M, j = 2 , , . . . , N. Otherwise task c j could have started earlier on machine i . ( iii ) Bound the makespan.

During the time intervals I ( c j − , c j ) for j = 2 , , . . . , N , there must be at least (cid:80) Nj =2 (cid:80) mi =1 ( | I ( c j − , j i ) | − | E ( c j − , c j , i ) | ) processing units done, and it is bounded by a sum of the processingunits for all the tasks except those in the terminal chain. This leads to the following bound: N (cid:88) j =2 m (cid:88) i =1 ( | I ( c j − , c j ) | − | E ( c j − , c j , i ) | ) ≤ n (cid:88) j =1 w j − N (cid:88) j =1 w c j . (12a)Finally, applying (12a), we have C max = N (cid:88) j =2 | I ( c j − , c j ) | + N (cid:88) j =1 w c j ≤ m n (cid:88) j =1 w j + m − m N (cid:88) j =1 w c j +1 m N (cid:88) j =2 m (cid:88) i =1 | E ( c j − , c j , i ) |≤ (cid:18) − m (cid:19) opt ( i ) + C (cid:48) . (13a)The total processing time (cid:80) nj =1 w j divided by the number of machines m is the smallest possible makespan, i.e., m (cid:80) nj =1 w j ≤ OPT ( i ) . At the same time, the makespan of any schedule should at least cover the processing time ofany chain C in the DAG. These two facts lead to the last inequality (13a). To establish the bound on the total weighted completion time for the group assignment rule f twct ( · ) , we ﬁrst apply theSeparation Principle to separate the requirements on communication and processing times. Second, we break the tasksinto subsets based on the task completion times and, for each subset, we form an LP for those tasks alone. For eachsuch LP, we construct a feasible solution ˜ x, ˜ C and ˜ T to bound processing time of the tasks. The feasibility of ˜ x, ˜ C and ˜ T enables us to take advantage of Lemmas 5.2 and 5.3 with only a loss of an additional constant factor.Given a schedule S for a DAG G , we use the same notation as in Section 4.2, G ( S , j ) , to denote subsets of DAG. Foreach DAG G ( S , j ) , there is a terminal chain C ( S , j ) with task j as the ending task in the schedule S ( j ) . Similarly,14eﬁne P ( S , j ) as a sum of the processing time along the terminal chain C ( S , j ) , P ( S , j ) = (cid:88) c j ∈ C ( S ,j ) w c j s h ( c j ) , (14)and let D k ( S , j ) denote the total load assigned to machines in group M k in DAG G ( S , j ) , D k ( S , j ) = (cid:80) j : j ∈ G ( S ,j ) ,k ∈ f ( j ) w j s ( M k ) . (15)For every DAG G ( S , j ) associated with schedule S j for ≤ j ≤ n , we are able to apply Separation Principle and thencombine these inequalities as follows: (cid:88) j ω j C j ≤ (cid:88) j ω j (cid:32) P ( S , j ) + (cid:88) k D k ( S , j ) (cid:33) + (cid:88) j ω j C ( S , j ) . Both P ( S , j ) and D k ( S , j ) are independent of the communication constraints, which enables us to take advantage ofany group assignment rule.Using the group assignment rule f twct ( · ) helps further tighten the bound. To show this, we ﬁrst divide the n tasks into Q sets based on q ( j ) , which can be viewed as a rough estimate of the completion time of task j . For the q th interval,we deﬁne J q as a set of tasks such that q ( j ) = q : J q = { j : q ( j ) = q } . In this way, we have divided the n tasks into Q sets: J , J , . . . , J Q .Next, for ≤ q ≤ Q , we construct a set of feasible solutions for LP (1), ˜ x, ˜ C and ˜ T , for every set of tasks in J q , basedon the optimal solution of LP (2), i.e., x ∗ and C ∗ . Note that ˜ x here is the same as in equation (3). Since precedenceconstraints are preserved in constraints of the LPs, we can concatenate these schedules together to obtain a feasibleschedule for all of the tasks. Lemma 5.4.

Consider a set of tasks J q for a ﬁxed q . A feasible solution for LP (1) is deﬁned by ˜ x i,j = q (cid:88) t =1 x ∗ i,j,t α j ∀ i, j ∈ J q (17a) ˜ C j = 2 C ∗ j ∀ j ∈ J q (17b) ˜ T = 2 q +1 . (17c) Proof.

To show feasibility of such a candidate solution, we verify that q , ˜ x, ˜ C and ˜ T satisfy all the constraints in LP (1).Substitute ˜ x into the left side of constraint (1a) for any task j ∈ J q , and it is clear that (cid:80) i ˜ x i,j = 1 . To validate thatconstraint (1b) is satisﬁed, note that α j ≥ / by deﬁnition and so a direct substitution on the left hand side yields theright hand side due to (2b). Similarly, constraint (2c) ensures that constraint (1c) is satisﬁed and constraint (2f) ensuresthat constraint (1d) is satisﬁed. Finally, we obtain C ∗ j ≤ q by deﬁnition of q ( j ) and thus constraint (1e) holds.Due to the similarity between group assignment rule f mksp ( · ) and f twct ( · ) , we can further tighten the bound usingLemmas 5.2 and 5.3 from Section 5.2 directly. Combining Lemmas 5.2 and 5.4, we conclude that the total load alongany chain C in the DAG formed by J q is upper bounded by (cid:88) j ∈ C w j s h ( j ) ≤ γ ˜ T = 2 γ · q +1 . Next, since the terminal chain C ( S , j ) can be represented as a concatenation of chains in the DAGs formed by tasks in J q for ≤ q ≤ q ( j ) , we have P ( S , j ) ≤ q ( j ) (cid:88) t =1 γ · t +1 ≤ γ · q ( j ) . (cid:88) j ∈J q w j s ( f twct ( j )) ≤ K ˜ T = 2 K · q +1 . The left side can be viewed as (cid:80) k D k for a DAG formed by tasks in J q . Since the tasks in DAG G ( S , j ) form a subsetof ∪ q ( j ) t =1 J q , the following inequality holds: (cid:88) k D k ( S , j ) ≤ q ( j ) (cid:88) t =1 (cid:88) j (cid:48) ∈J q w j (cid:48) s ( f twct ( j (cid:48) )) ≤ q ( j ) (cid:88) t =1 K · t +1 ≤ K · q ( j ) , which immediately yields P ( S , j ) + (cid:88) k D k ( S , j ) ≤ γ + K ) · q ( j ) . Finally, the remaining piece of the proof is to upper bound q ( j ) with a multiplicative factor of its optimal completiontime C ∗ j in the LP (2). By deﬁnition of q ( j ) , for task j either q ( j ) − (cid:88) t =1 (cid:88) i x ∗ i,j,t < (20)or C ∗ j > q ( j ) − . (21)If inequality (20) holds, then q ( j ) − = τ q ( j ) − ≤ τ q ( j ) −  Q (cid:88) t = q ( j ) (cid:88) i x ∗ i,j,t  (22a) ≤  Q (cid:88) t = q ( j ) τ t − (cid:88) i x ∗ i,j,t  ≤ (cid:32)(cid:88) t τ t − (cid:88) i x ∗ i,j,t (cid:33) ≤ C ∗ j . (22b)Inequality (22a) is due to (20) and the deﬁnition of q ( j ) , and constraint (2e) in the LP (2) leads to (22b). If inequality(21) is true, then q ( j ) − < C ∗ j ≤ C ∗ j . In both cases, q ( j ) − is upper bounded by C ∗ j . Thus, we achieve P ( S , j ) + (cid:88) k D k ( S , j ) ≤ γ + K ) · C ∗ j . Since (cid:80) j ω j C ∗ j is lower bounded by wOPT ( i ) , we conclude that (cid:88) j ω j C j ≤ (cid:88) j ω j (cid:32) P ( S , j ) + (cid:88) k D k ( S , j ) (cid:33) + (cid:88) j ω j C ( S , j ) (23a) ≤ γ + K ) (cid:88) j ω j C ∗ j + (cid:88) j ω j C ( S , j ) (23b) ≤ O (log m/ log log m ) · wOPT ( i ) + (cid:88) j ω j C ( S , j ) , (23c)16hich completes the proof. This paper studies the problem of scheduling tasks with precedence constraints on related machines with machine-dependent communication times, and addresses two long-standing open problems in the area. We introduce a newscheduler, GETF, and prove worst-case approximation ratios for it in the case of (i) scheduling to minimize the makespanand (ii) scheduling to minimize the total weighted completion time. These results represent the ﬁrst progress on thisproblem in the 30 years since [8] provided a bound on the makespan under ETF in the case of identical servers andcommunication time. No previous bounds exist for the case of total weighted completion time when communicationtime is considered.A variety of open questions are raised by the work in this paper. Most importantly, while we have provided theoreticalbounds on the performance of GETF, it is also important to investigate how GETF performs in real settings via animplementation study. GETF could be particularly powerful in the context of large-scale machine learning platforms,where workﬂows are typically speciﬁed as DAGs. As part of such a study, it would be interesting to understand how tobest choose a tie-breaking rule, how to adjust the group assignment rules for the best performance, and how variouschoices for these rules compare with heuristics that have been suggested in the literature. Further, it will be important tosee if it is possible to obtain some theoretical results characterizing how the optimal choices for these rules depend onproperties of real-world workloads. Moreover, it will also be interesting to extend the results of this work to stochasticsettings, e.g., when task sizes are unknown.On the analytic side, it will be interesting to discover other applications of the Separation Principle. It may be possibleto revisit other scheduling problems for precedence-constrained tasks and obtain more general results because of theseparation this result provides. Further, it is possible to consider other performance measures, such as energy usage andresource augmentation, using the Separation Principle.

References [1] Edward Grady Coffman and John L Bruno.

Computer and job-shop scheduling theory . John Wiley & Sons, 1976.[2] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In , pages 265–283, 2016.[3] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.[4] David Chappell. Introducing azure machine learning.

A guide for technical professionals, sponsored by MicrosoftCorporation , 2015.[5] Ronald L. Graham. Bounds on multiprocessing timing anomalies.

SIAM journal on Applied Mathematics ,17(2):416–429, 1969.[6] Fabián A Chudak and David B Shmoys. Approximation algorithms for precedence-constrained schedulingproblems on parallel machines that run at different speeds.

Journal of Algorithms , 30(2):323–343, 1999.[7] S. Li. Scheduling to minimize total weighted completion time via time-indexed linear programming relaxations.In , pages 283–294, Oct 2017.[8] Jing-Jang Hwang, Yuan-Chieh Chow, Frank D Anger, and Chung-Yee Lee. Scheduling precedence graphs insystems with interprocessor communication times.

SIAM Journal on Computing , 18(2):244–257, 1989.[9] Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu.GRASS: Trimming stragglers in approximation analytics. In , pages 289–302, Seattle, WA, 2014. USENIX Association.[10] Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. Hopper: Decentralized speculation-aware cluster scheduling at scale. In

Proceedings of the 2015 ACM Conference on Special Interest Group on DataCommunication , SIGCOMM ’15, pages 379–392, New York, NY, USA, 2015. ACM.[11] M-Y Wu and Daniel D Gajski. Hypertool: A programming aid for message-passing systems.

IEEE transactionson parallel and distributed systems , 1(3):330–343, 1990.[12] Yuming Xu, Kenli Li, Ligang He, Longxin Zhang, and Keqin Li. A hybrid chemical reaction optimization schemefor task scheduling on heterogeneous computing systems.

IEEE Transactions on parallel and distributed systems ,26(12):3208–3222, 2015. 1713] Tao Yang and Apostolos Gerasoulis. Dsc: Scheduling parallel tasks on an unbounded number of processors.

IEEETransactions on Parallel and Distributed Systems , 5(9):951–967, 1994.[14] Haluk Topcuoglu, Salim Hariri, and Min-you Wu. Performance-effective and low-complexity task scheduling forheterogeneous computing.

IEEE transactions on parallel and distributed systems , 13(3):260–274, 2002.[15] Fuhui Wu, Qingbo Wu, and Yusong Tan. Workﬂow scheduling in cloud: a survey.

The Journal of Supercomputing ,71(9):3373–3418, 2015.[16] Ruben Mayer, Christian Mayer, and Larissa Laich. The tensorﬂow partitioning and scheduling problem: it’s thecritical path! In

Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning , pages 1–6.ACM, 2017.[17] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective straggler mitigation: Attack of theclones. In

Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation(NSDI 13) , pages 185–198, 2013.[18] Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and EdwardHarris. Reining in the outliers in map-reduce clusters using mantri. In , Vancouver, BC, 2010. USENIX Association.[19] Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans,Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. Apache hadoop yarn: Yet another resourcenegotiator. In

Proceedings of the 4th annual Symposium on Cloud Computing , page 5. ACM, 2013.[20] Minghong Lin, Li Zhang, Adam Wierman, and Jian Tan. Joint optimization of overlapping phases in mapreduce.

Performance Evaluation , 70(10):720–735, 2013.[21] Balaji Palanisamy, Aameek Singh, Ling Liu, and Bhushan Jain. Purlieus: locality-aware resource allocationfor mapreduce in a cloud. In

Proceedings of 2011 International Conference for High Performance Computing,Networking, Storage and Analysis , page 58. ACM, 2011.[22] Jian Tan, Xiaoqiao Meng, and Li Zhang. Delay tails in mapreduce scheduling.

ACM SIGMETRICS PerformanceEvaluation Review , 40(1):5–16, 2012.[23] Abhishek Verma, Ludmila Cherkasova, and Roy H Campbell. Two sides of a coin: Optimizing the schedule ofmapreduce jobs to minimize their makespan and improve cluster performance. In , pages 11–18.IEEE, 2012.[24] Weina Wang, Kai Zhu, Lei Ying, Jian Tan, and Li Zhang. Maptask scheduling in mapreduce with data locality:Throughput and heavy-trafﬁc optimality.

IEEE/ACM Transactions on Networking (TON) , 24(1):190–203, 2016.[25] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. Improving mapreduceperformance in heterogeneous environments. In

Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation , OSDI’08, pages 29–42, Berkeley, CA, USA, 2008. USENIX Association.[26] Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. Tictac: Accelerating distributed deeplearning with communication scheduling.

CoRR , abs/1803.03288, 2018.[27] Jan Karel Lenstra and AHG Rinnooy Kan. Complexity of scheduling under precedence constraints.

OperationsResearch , 26(1):22–35, 1978.[28] Ola Svensson. Conditional hardness of precedence constrained scheduling on identical machines. In

Proceedingsof the forty-second ACM symposium on Theory of computing , pages 745–754. ACM, 2010.[29] Nikhil Bansal and Subhash Khot. Optimal long code test with one free bit. In , pages 453–462. IEEE, 2009.[30] Leslie A Hall, David B Shmoys, and Joel Wein. Scheduling to minimize average completion time: Off-line andon-line algorithms. In

SODA , volume 96, pages 142–151, 1996.[31] Alix Munier, Maurice Queyranne, and Andreas S Schulz. Approximation bounds for a general class of precedenceconstrained parallel machine scheduling problems. In

International Conference on Integer Programming andCombinatorial Optimization , pages 367–382. Springer, 1998.[32] Maurice Queyranne and Andreas S Schulz. Approximation bounds for a general class of precedence constrainedparallel machine scheduling problems.

SIAM Journal on Computing , 35(5):1241–1253, 2006.[33] Abbas Bazzi and Ashkan Norouzi-Fard. Towards tight lower bounds for scheduling problems. In

Algorithms-Esa2015 , pages 118–129. Springer, 2015.[34] Maciej Drozdowski.

Scheduling for parallel processing . Springer, 2009.1835] Yu-Kwong Kwok and Ishfaq Ahmad. Static scheduling algorithms for allocating directed task graphs to multipro-cessors.