[PDF] A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

Abstract

Many state-of-the-art Deep Neural Networks (DNNs) have substantial memory requirements. Limited device memory becomes a bottleneck when training those models. We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs. ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized. ParDNN is completely independent of the deep learning aspects of a DNN. It requires no modification neither at the model nor at the systems level implementation of its operation kernels. ParDNN partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving superlinear scaling for both the batch size and training throughput. ParDNN either outperforms or qualitatively improves upon the related work.

Full PDF

AA Computational-Graph Partitioning Method forTraining Memory-Constrained DNNs

Fareed Qararyah

Koc¸ University, Turkey [email protected]

Mohamed Wahib

National Institute of AdvancedIndustrial Science and Technology,Japan [email protected]

Do ˘ga Dikbayır

Michigan State University, USA [email protected]

Mehmet Esat Belviranli

Colorado School of Mines, USA [email protected]

Didem Unat

Koc¸ University, Turkey [email protected]

Abstract

We propose ParDNN, an automatic, generic, and non-intrusivepartitioning strategy for large DNN models that do not fitinto single device memory. ParDNN decides a placement ofDNN’s underlying computational graph operations acrossmultiple devices so that the devices’ memory constraintsare met and the training time is minimized. ParDNN iscompletely independent of the deep learning aspects of aDNN and requires no modification neither at the model norat the systems level implementation of operation kernels.It partitions DNNs having billions of parameters and hun-dreds of thousands of operations in seconds to few minutes.Our experiments with TensorFlow on 16 GPUs demonstrateefficient training of 5 very large models while achieving su-perlinear scaling for both the batch size and training through-put. In comparison to related work (Mesh-TensorFlow andGradient Checkpointing), ParDNN either outperforms orqualitatively improves upon them.

Keywords

DNN, graph partitioning, model parallelism

DNN models have doubled in size roughly every 2 . deeper or wider or both, produce resultswith higher accuracy on more complex tasks. However, theycome at a high memory cost required to store the parametersand the intermediate results for both training and inference[53]. For example, in computer vision, considering WideResidual Network [70], a widened variant of the well-knownResnet [19], widening the model 8 times increases the num-ber of parameters ∼

60 times [70] leading to a substantialincrease in the memory requirements. The same trend showsup in the NLP field where deep-stacked LSTMs [66] or atten-tion layers [61] often give more accurate results compared toshallower models but these newer models push the numberof parameters up to O (10 B ) [50, 55].Different approaches have been proposed to tackle theissue of training very large models on multiple devices. One approach is to work on the model level, where the model ispartitioned across multiple devices through model, pipeline,channel parallelism, or combinations of them [12, 15, 21, 27,28, 43, 55]. Even though these methods are successful tosome extent, they suffer from either: (a) being not genericas they target a specific class of DNNs, (b) introduce non-negligible memory overhead to maintain the statistical ef-ficiency, or (c) can incur a high implementation cost andnecessitate detailed understanding of the DNN model for anaccurate cost model. Another approach works at the systemslevel by partitioning the computational graph that representsthe operations in a neural network model and distributes itover multiple devices. However, the method proposed in [64]has a restricted applicability because it relies on a descriptivelanguage to specify computations and cannot describe allthe operations used in DL. Others propose a reinforcementlearning-based approach, which is impractical in many casesdue to substantial resource and time requirements [41, 42].We adopt the system-level approach and propose a generic,efficient, and non-intrusive partitioning strategy (ParDNN)that avoids the drawbacks of the related work. ParDNNdirectly works on the computational graph representationof the neural network adopted by the most popular general-purpose DL frameworks such as TensorFlow [1] and MXNet[9]. Operating on the graph level has three main benefits.First, it provides a fine-grained view of the model, whichgives more parallelization options and allows better loadbalancing and resource utilization. Second, it isolates ourstrategy from the details of the learning process, which pro-vides more generality and guarantees unaffected statisticalefficiency [43] of the model. Third, working at the level ofthe graph enables us to leverage decades of work on graphpartitioning and static scheduling (as will be discussed later).ParDNN’s strategy is composed of two main steps. First,we cluster the operation-nodes of the computational graphinto K partitions, where K represents the number of theavailable devices. The objective of this step is to reduce theend-to-end runtime by assigning the operations on the parti-tions auch that the computational loads are balanced and the a r X i v : . [ c s . D C ] A ug ommunication is minimized. In the second step, we checkwhether the memory constraints are met in each partition.If they are not, we reassign some operations to differentpartitions such that the reassigned operations have the leastpossible perturbed effect on the placement generated by thefirst step but at the same time meet the memory constraints.Most existing graph partitioning libraries are designed tohandle undirected graphs. State-of-the-art graph partition-ing tools, such as Scotch static mapper [45, 46] and MinCutoptimizer, results in 2 to 10 times slowdown when applied ondirected graphs of DL models [41, 42]. Our algorithm outlineis inspired by the principle of the multilevel approach used ingraph partitioning [30] but the design and algorithmic detailsof ParDNN includes a mix of variants of static schedulingheuristics [31] that are mutated to reduce the time complex-ity, and novel techniques to address some shortcomings inthe existing ones [40, 49]. Our contributions are: • We propose a novel computational graph partitioning methodthat enables training models with large memory consump-tion on a set of devices with limited memory. • We conduct extensive experiments with large DNNs todemonstrate ParDNN’s efficiency. In comparison to re-lated work: (a) ParDNN’s performance is comparable tothat of Mesh-TensorFlow, a state-of-the-art distributedtraining framework [54], while having qualitative advan-tages of automating the partitioning and not requiringmodel rewrite. (b) It generally outperforms redundant re-computation methods (Gradient Checkpointing [47]). (c)It outperforms out-of-core methods (CUDA Unified Mem-ory). • For models that do not fit into a single GPU’s memory,ParDNN enables training models having up to 5 . • ParDNN’s overhead is negligible. For a graph having hun-dreds of thousands of nodes representing DNNs with bil-lions of parameters, it takes ∼ • To the best of our knowledge ParDNN is the first of itstype that permits the training of models that do not fitinto a single device memory while being generic due to(a) having zero dependency and requiring no knowledgeabout the DL aspects of the models, and (b) not requiringany modifications of the model or operation kernels.

Many DL frameworks model a computation as a directedgraph [1, 5, 9]. TensorFlow uses a stateful directed graph torepresent the computational flow of operations. It extendsthe classical dataflow graph model to allow maintaining and

Deep Learning Model

Memory Consumption EstimationsCompute TimesCommunication Times (estimation)

Python, C/C++, Java Cost ModelComputational Graph

Scheduler EmulatorOffline Profiling

Deep Learning Distributed Execution Engine

Device Device Device DeviceHost

Mapping

ParDNN A B, C, D, E, F, G, H, I, J, K, L NodeID DeviceID

B 0C 1… …

Figure 1.

ParDNN Overview updating the persistent state of some special nodes, branch-ing, and loop control. In a TensorFlow graph G = ( V , E ), eachnode n ∈ V represents the instantiation of an operation (e.g.,matrix multiplication or convolution) and it has zero or moreinputs and zero or more outputs. Each edge e ∈ E representsa dependency between its incident nodes. Normal edges rep-resent dataflow between the nodes, while special edges, e.g.control dependencies, are used to enforce happens-beforerelationships with no data flows along them [1].Graph partitioning is, in general, defined as splitting thegraph G ( V , E ) into K disjoint subsets [6]. The constrainedversion of the graph partitioning aims at partitioning in sucha way that the sums of the vertices weights in each set areas equal as possible, and the sum of the weights of edgescrossing between sets is minimized [30]. An extension ofgeneral graph partitioning which aims to assign a set of com-municating tasks to processors is called static mapping [6].Static mapping does not consider the logical and temporaldependencies of the tasks, it is assumed that all the taskssimultaneously coexist throughout the program execution.Finding a spatial and temporal assignment of the set ofnodes in a task graph G = ( V , E ) onto a set of processorsresulting in the fastest possible execution, while respectingthe precedence constraints expressed by all e ∈ E is referredto as task scheduling problem [56]. The schedule length, makespan , is the completion time ( C t ) of the last node in G assuming that the graph execution starts at time 0. The goalis to minimize C tmax , where C tmax = max n ∈ V C t ( n ). Findingan optimal schedule or static mapping is NP-hard [6, 56].

ParDNN offers a practical, non-intrusive, and generic methodto partition a DNN on a set of processing elements ( PE ).The main objective of ParDNN is to minimize C tmax , themakespan of the computational graph, while satisfying thememory capacity constraints of the target processing ele-ments. It is important to mention that ParDNN does nothave a runtime component. All the steps of ParDNN aredone ahead of time. After running ParDNN once, the result-ing partitioning can be used as long as the model parametersthat affect the memory consumption do not change. able 1. Terminology used in this work

Term Description G = ( V , E ) Computational graph with vertex set V , edge set ECP

Critical path of a graph C tmax Makespan of G , schedule length PE , pe Set of processing elements, a processing element K Number of processing elements (e.g., comp ( n ) Weight of a node n (compute time) mem ( n ) Memory consumption of outputs of a node ncomm ( e ) Cost of an edge e (communication time) sc Secondary cluster, which is a node or a path comm ( sc ) Total communication cost incurred by all edges that have oneend in sctl ( n ) Node top level: length of the costliest path between the thesource node of the graph and the node n , excluding the node n . Where the length of a path, is the summation of the compu-tation costs of the nodes on the path and the communicationcost of its edges (cid:80) n ∈ p comp ( n ) + (cid:80) e ∈ p comm ( e ) bl ( n ) Node bottom level: length of the costliest path between n andthe sink node including the node nw lvl ( n ) Node weighted level: tl ( n ) + bl ( n ) span ( sc ) Time between the expected finish time of the last parent of thefirst node in a sc , and the expected starting time of the firstchild of the last node in that path. Last and first here meantopologically. potential ( sc ) Summation of the weights of all nodes that can be executedwithin span ( sc ) st ( n ) Starting time of node n , the time when n is assigned to a pe to execute f t ( n ) Finish time of node n , the time when pe is done with executing nM cons ( pe , t ) Memory consumed by the processing element at time tM pot ( n , t ) Memory potential of a node n at time t . The summation ofthe memory occupied by the outputs of n ’s direct ancestorsthat are executed before t , and for which n is the last directdescendant in its pe . Plus n ’s memory consumption if st ( n ) ≤ t ≤ f t ( n ) Figure 1 shows the overall process. ParDNN takes a com-putational directed acyclic graph as an input, it annotatesthis graph with computation, communication, and memoryconsumption information gathered using offline profiling.ParDNN splits the graph into parts to be mapped to process-ing elements. ParDNN outputs the mapping information tobe used by the execution engine of the DL framework (e.g.TensorFlow).Our algorithm is divided into two major steps. Step-1aims to obtain a partitioning that has a minimal makespan.Step-1 is further divided into three stages. Stage-I, graphslicing splits the graph into K disjoint primary and S dis-joint secondary clusters. This splitting enables working ata coarser level in the upcoming stages. Stage-II, mapping ,merges these S secondary clusters into the K primary clus-ters by firstly merging the clusters that have no parallelismgain, then merging the rest of S using a novel load balancingalgorithm. The final stage of Step-1 is a refinement of themapping through path swapping and node switching. InStep-2, the result from Step-1 is validated against the mem-ory constraints of the given devices, if the constraints aresatisfied, the partition will be the final output. Otherwise,the partition is refined until the memory consumption by aprocessing element pe at any time t ∈ [0 , C tmax ] is less thanor equal to pe ’s memory capacity.Next, we explain the details of each step. Table 1 summa-rizes the terms and notations used during explanations. Algorithm 1

Graph Slicing

In :

K, Graph G (cid:46) number of devices, DNN graph

Out: pri clusters[ ], sec clusters[ ] (cid:46) initially empty j ← w lvls ← compute weighted levels( G )3: while G (cid:54) = ϕ do heaviest path ← find heaviest path( G , w lvls )5: if j ≤ K then pri clusters [ j ] ← heaviest path w lvls ← compute weighted levels( G )8: else sec clusters [ j − K ] ← heaviest path end if G ← G − { heaviest path } j ← j + 113: end while Before presenting the details of the step, it is important topoint its distinction from both static task scheduling andstatic mapping. Unlike scheduling algorithms, we do notspecify an order of task execution; we rather focus on spa-tially allocating tasks on a set of processors while addressingthe locality-parallelism trade-off. The order of execution de-cision is left to the runtime dynamic scheduler, e.g. Tensor-Flow scheduler. Unlike static mapping, ParDNN considersthe logical and temporal dependencies between the tasks.The size, ( | V | ), of a DNNs’ computational graph is usuallyin the order of hundreds of thousands and is projected togrow to millions of operation-nodes [50]. To have a scalableStep-1, we follow the concept of multilevel method [6], wherewe group vertices together and deal with groups of vertices,rather than individuals. This reduces the problem size andallows our heuristics to be applied within a reasonable time.Step-1 is designed in three main stages. This stage groups the nodes of the graph into disjoint clusters.It iteratively finds the critical path ( CP ) in the graph andremoves CP ’s nodes and their incident edges from the graphby marking them as visited so that they are not explored inthe following iterations. This is repeated K times resulting in K primary clusters , which are the initial partitions assignedto different processing elements. Hence, the terms primarycluster and pe are going to be used interchangeably. Afterfinding those primary clusters, if there are leftover nodes,we group them into secondary clusters . A secondary cluster,which is a linear cluster [56], is either a single node or apath. All the secondary clusters are identified and taggeduntil there is no node left on the graph that is not part of anycluster. Figure 2(b) shows an example.Algorithm 1 shows the pseudo-code of the graph slicing,which takes K and graph G as inputs and outputs primaryand secondary clusters. Line 2 computes the weighted level( w lvl ( n )) for all nodes in the graph. The heaviest path, (Line4), is the CP when w lvl ( n ) are recalculated. Finding the eaviest path is done by traversing the graph using the com-puted w lvls as priorities until reaching a dead-end. Afterforming a CP , it is added to the primary clusters and its nodesand edges are removed from the graph (Line 11). Unlike lin-ear clustering [31], we obtain only K many CP s, then westop recalculating w lvl ( n ) for the secondary clusters sincecomputing weighted levels is expensive. When weighted lev-els are not recalculated, f ind heaviest path may not returna CP , it rather returns a path of a heavy cost. This aims atcapturing dependent and heavily-communicating nodes inthe same cluster to increase locality. If a path could not beobtained, it returns a single node. This stage attaches the secondary clusters to the primarieswith the objective of balancing the load among partitionsand reducing communication. First, initial merging is ap-plied to some of the secondary clusters for which executingthem in parallel is not advantageous. For example, in Fig-ure 2(c), the cluster { J } is merged with a primary becausethe total amount of communication ( comm ) incurred by thenodes in cluster { J } can not be covered by its potential ( { J } ).Intuitively, the potential of a cluster measures how muchparallel work is in the cluster at the time of its execution andwhether or not that work is sufficient to totally hide its com-munication. In other words comm ( { J } ) − potential ( { J } ) > pe . Sucha cluster is merged with the primary cluster with which itcommunicates the most.Second, we apply a level-aware load balancing techniqueat which we merge the secondary clusters that are not mergedby the initial merging. This process is referred to as clus-ter mapping in the scheduling literature. There are someheuristics such as wrap cluster merging [67], list schedulingbased cluster assignment [52], and Guided Load Balancing(GLB) [49]. In a comprehensive evaluation of scheduling andcluster merging algorithms in [62], GLB is shown to producethe best result. However, GLB assumes that its precedingclustering step has eliminated the largest communicationdelays. As a result, the communication delays are not con-sidered for cluster mapping [49]. Ignoring communicationcost results in a low-quality mapping when the graph be-comes very large. Even if each inter-cluster communicationis small, the cumulative effect becomes considerable. In addi-tion, the load balancing is global rather than time dependent(temporal). This issue is demonstrated in Figure 2(d) and (e).Ignoring the temporal aspect of load balancing causes GLB tomake locally sub-optimal decision for cluster { D } . It assignsit to the less loaded pe , yet that pe has more work within the span ( { D } ). This assignment pattern is not suitable especiallyfor TensorFlow graphs with frequent forks and joins, wherethe local and the global loads become more uncorrelated.We propose a novel time-efficient heuristic called Level-Aware Load Balancing (LALB) . LALB considers both commu-nication minimization and the temporal load balancing. We

Algorithm 2

Mapping

InOut: pri clusters[ ] (cid:46) primary clusters

In: sec clusters[ ] (cid:46) secondary clusters for sc ∈ sec clusters do if comm ( sc ) − potential ( sc ) > then tarдet ← find most comm( sc , pri clusters )4: tarдet ← tarдet + { sc } sec clusters ← sec clusters − { sc } end if end for comps [ ] ← ϕ , comms [ ] ← ϕ while sec clusters (cid:54) = ϕ do sc ← remove next secondary( sec clusters )11: comps ← calc work at span( span ( sc ) , pri clusters )12: comms ← calc comms with( sc , pri clusters )13: tarдet ← find minimal( comps , comms )14: tarдet ← tarдet + { sc } end while temporally balance the loads by considering the workload ofevery pe within span ( sc ), where sc is the secondary clusterthat is going to be merged with one of the primary clusters. sc is mapped to a pe that has the minimal computational loadwithin the span ( sc ), and minimizes the incurred communica-tion with the other processing elements. Equation (1) showsthe selection criteria. In case of ties, we assign sc to the pe which has the highest communication value with it.min pe ∈ PE (cid:16) (cid:88) n ∈ petl ( n ) ∈ span ( sc ) comp ( n ) + (cid:88) ( n , u ) ∈ E , n ∈{ PE }− pe , u ∈ sc comm ( n , u ) + (cid:88) ( u , n ) ∈ E , n ∈{ PE }− pe , u ∈ sc comm ( u , n ) (cid:17) (1)Algorithm 2 shows the two mapping procedures. Lines 1-7do initial merging of the secondary clusters to the primaries.The while loop (Line 9) applies our novel load balancing.Most time is consumed in calculating the work in the spanof the target secondary cluster sc (Line 11), in each of the pri-mary clusters. We model this part as a problem of frequentrange queries with updates. More specifically, we find thesum of the weights of the nodes whose levels fall in the span,and upon merging, the weights of those levels are updated.We use binary-indexed-trees [13] as a data structure, wherethe tree nodes store the weights per level. This data struc-ture allows logarithmic range summation and value updates.Line 12 calculates the cost of communication between thesecondary cluster sc in each of the primary clusters. Line 13performs the selection criteria defined in Eqn (1) to selectthe best primary cluster to merge the sc with. This stage refines the partitioning with two refinement poli-cies. The first is responsible for coarse-grained refinementat the secondary cluster level and the second does the fine-grained refinement at the node level. The first policy searchesfor a secondary cluster sc for which there is another sec-ondary cluster sc (cid:48) within its span that when swapped with sc , it results in better quality partitioning. The better qualitycomes from either enhancing the load balancing or reducingthe total communication, or both. B, C, D, E, F, G, H, I, J, K, L A B, C, D, E, F, G, H, I, J, K, L A B, C, D, E, F, G, H, I, J, K, L A B, C, D, E, F, G, H, I, J, K, L A B, C, D, E, F, G, H, I, J, K, L (a) (b) (c) (d) (e) Figure 2.

In the computational graph, vertex and edge weights indicate computation and communication costs, respectively. (a)

Originalcomputational graph, source node ( A ) and sink node ( L ) are added by ParDNN. (b) Shows the slicing stage when there are two pe (s). Clustersare found in the following order: { A , B , E , G , I , K , L } , { C , F } , { J } , { D } , { H } . First two are primary clusters, the other three are secondaries. (c) Cluster { J } is merged to a primary cluster in initial merging since it has a communication of 5 units that cannot be hidden by the workwithin its span. (d) and (e) show LALB (ours) and GLB load balancing algorithms, respectively, after initial merging. The makespan of theLALB output is 13 units, while GLB is 15, thus LALB results in 15% performance gain. The second policy handles a general issue with CP basedheuristics that is discussed in [40]. This issue arises on the communication-edges among the processing elements. Whenthere are many costly communicating operations in G , someof them may fall outside the CP . If their effect is large enough,they will create heavier CP s in the partitioned graph. Notethat the CP of the graph after partitioning is probably differ-ent than the original CP . For example, in Figure 2(a) the CP is initially { A , B , E , G , I , K , L } , but after partitioning, the CP becomes { A , C , F , G , I , K , L } as in Figure 2(d). This is becausethe communication between the nodes in the same clusteris assumed to be zero. We repeatedly find the CP in the par-titioned graph, as in Algorithm 1. Then we check the edgesin that path that connect two different primaries, if movinga node incident to any of these edges to another primaryshortens the CP , we switch that node’s primary. This processcould be repeated as long as the CP can be shortened butsince each time w lvl s needs to be recalculated we chooseto do it at most K times. Similar to Step-1, we handle the memory constraints stati-cally ahead of time for two reasons: (a) to avoid any runtimeoverhead and (b) to reduce the chance of conflicting withother runtime optimizations. Our approach is implementedseparately from the memory management module of a DLframework, and could be seamlessly used with any dynamicoptimization policies provided by the framework. Step-2 isfurther divided into three stages.

To address the memory consumption statically, temporalmodeling of the allocation and deallocation patterns is re-quired. Such modeling necessitates knowledge about sched-uling in the DL framework to estimate when an operation is going to start and finish execution. Consequently, whenthe memory allocated for the operation inputs is releasedand when a new memory is allocated to hold the operationoutputs. To estimate those values, we emulate the Tensor-Flow scheduler. The TensorFlow scheduler maintains a readyqueue that initially contains nodes with no ancestors. Eachnode has an in-degree representing the number of dependentnodes. The nodes are executed in FIFO order. Once a node isexecuted, the in-degrees of its children are decremented byone. Any node having an in-degree of zero will be pushedto the queue. Using the per-node running times and com-munication sizes collected from profiling, we emulate thisbehaviour to get the expected start- and end-times of theoperations under a certain partitioning.

In TensorFlow, from the memory consumption perspective,operation-nodes broadly fall into three main categories. First,operations of which the data survives across iterations [1]and we refer to them as residual nodes ( res ns ). Second,special operations that mutate the referenced tensor, of thefirst type, we refer to them as reference nodes ( re f ns ). Thoseoperations do not reserve any additional memory. However,they are co-located with the variables that they are mutatingand must be moved together with their referred variablenodes. Third, operations that require additional memoryproportional to their output size and we call them normalnodes ( nor ns ). Memory for the output of these nodes isallocated upon scheduling and released once all their directdescendants are executed. This third type covers most ofTensorFlow operations such as matmul, conv , and add . Thereis also temporary memory allocated for operation’s localvariables. Those are immediately released once executed.One might think that profiling solely peak memory foot-prints would be sufficient to predict the memory overflows. owever, to handle an overflow, nodes have to be movedbetween partitions and that in turn changes the scheduleand the memory consumption at a certain time. Our costmodel takes this dynamic behavior into account and modelsthe interplay between the scheduler and memory usage. M cons ( pe , t ) = (cid:88) n ∈ pe , n ∈ res ns mem ( n ) + (cid:88) n ∈ pe , n ∈ nor ns , st ( n ) ≤ t ≤ f t ( n ) mem ( n ) + (cid:88) n / ∈ ( pe ∩ res ns ) , f t ( n ) ≤ t , ∃ ( n , u ) ∈ E : st ( u ) ≥ t , u ∈ pe mem ( n ) (2)Eqn (2) defines the memory consumption of a pe at time t as M cons ( pe , t ). The first term is the memory consumption ofthe res ns assigned to that pe . The second term indicates thememory consumption of the normal nodes that have startedon that pe at ≤ t and are being executed at t . The thirdindicates the nodes that have descendants assigned to that pe and the descendants’ expected starting time is ≥ t , andthose nodes have finished at ≤ t at any processing elementexcept pe , or are non-residual that have finished at ≤ t onthat pe .The overall memory consumption needs to be estimatedfor each node ( | V | time points) because the change in mem-ory consumption is triggered by node executions. Once anode starts executing, new memory space needs to be allo-cated and that may cause an overflow. Estimating memoryconsumption is done by visiting all the nodes in the graphin the order of their estimated starting times, which is ob-tained from the scheduler emulator, and keeping track ofthe accumulated memory consumption. In the same pass,the memory potential values of the nodes ( M pot in Table 1)are obtained. A node’s memory consumption is added to thecumulative value once it is visited, and subtracted after itslast descendent is visited unless it is a res ns . After estimating the memory consumption, we traverse thegraph starting from the sink node and keep the nodes in aheap data structure, namely nodes heap . When the memoryconsumed exceeds the limit, we deal with the overflow as a0-1 min-knapsack problem [11]. The min-knapsack problemis formulated as follows; given N pairs of positive integers( c j , a j ) and a positive integer O , find x , x , …, x N so as to: minimize N (cid:88) j =1 c j x j s . t . N (cid:88) j =1 a j ≥ O and x j ∈ { , } (3)In our case, O represents the amount of memory overflow ,and a j represents M pot ( n , t ). For the cost cj , we use the sum-mation of the node computation cost and the communicationwith its direct ancestors and descendants located on the same pe , as shown in Eqn (4), which defines move cost . comp ( n ) + ∀ u , v ∈ pe ( n ) (cid:88) u :( u , n ) ∈ G comm ( u , n ) + (cid:88) v :( n , v ) ∈ G , v comm ( n , v ) (4) Table 2.

Complexity of Each Step of ParDNN

Step-1 Partitioning to Minimize MakespanGraph Slicing (inc. sorting) O ( K ( | V | + | E | )) Mapping O ( | V |∗ loд | V | ) Refinement O ( K ( | V | + | E | )) Step-2 Satisfying Memory ConstraintsTensorFlow Scheduler Emulator O ( | V | + | E | ) Tracking Memory Consumption O ( | V | ) Addressing Overflow O ( | V | ) Overall complexity of ParDNN O ( | V | ) The idea behind move cost is that when a node is movedfrom a pe to another, it incurs a computational load imbal-ance proportional to its weight and extra communicationproportional to its communication with the nodes assignedto the same pe . Our goal is to find a set of operation-nodesthat the summation of their memory consumption potentialsat the overflow time is ≥ overflow when their total move-ment cost is minimized. The movement criteria is to pick thenode with the lowest move cost / M pot ( n , t ). In other wordswe choose the node that alleviates the overflow while incur-ring the least amount of communication and computationimbalance.The nodes heap is a min heap in which the move cost is theordering key. To avoid choosing a node that has a low move-ment criteria but high move cost , each node for which the M pot ( n , t ) > overflow is inserted in another heap at which thesorting key is move cost . When selecting, the top node is re-moved from both heaps and the one with the least move cost is chosen, and the other is returned to its heap. The selectednode is moved to another pe if the target pe has sufficientmemory to accommodate that node’s memory potential. Oth-erwise, the node is not considered again and another node ispicked from the heap. The algorithm terminates when eitherthe overflow is completely eliminated or we run out of nodeswithout addressing it. Table 2 summarizes the time complexity of each step ofParDNN. Detailed explanations for the time complexities ofeach step can be found in Appendix A. The reported com-plexities after each step are relaxed ones and for some stagestighter bounds maybe driven with amortized analysis. Split-ting the partitioning strategy into a set of simple, yet efficient,sub-stages permits lowering the complexity. The nodes aregrouped into clusters in the first step, then for the most oflater stages, ParDNN works at the cluster rather than nodegranularity, which considerably reduces the instance size itdeals with. In practice, running ParDNN on the DNN mod-els listed in Table 3 takes up to 2 minutes on a typical laptopprocessor, namely an Intel i7-7600u CPU @ 2.80GHz. Con-sidering the training time of those models is in the ordersof days or even weeks, ParDNN offers an extremely light-weight and practical approach to partition the computationalgraphs of DNNs.

Our algorithm takes as an input the device count, their mem-ory capacities, the interconnection bandwidth and latency etween them, the model computational graph, profilingdata, operations metadata. The profiling data contain execu-tion time measurements and the size of the output of eachoperation-node. The operation metadata contain the opera-tion types(section 3.2.2s). TensorFlow standard APIs providethe profiling information including per-node time, memoryconsumption, and communication sizes at the granularity ofgraph nodes for regular as well as user-defined operators.To estimate the memory consumption, we implementedan emulator of TensorFlow’s scheduler described in [1]. It isimportant to note that if ParDNN is intended to be used withanother DL framework, another emulator can be written toemulate its scheduler, if needed, without modifying our par-titioning algorithm. When handling memory constraintsthere is a trade-off between the overhead and the accuracy;static handling prioritizes overhead reduction over accuracywhile dynamic handling targets the opposite. Due to theefficiency and maintainability reasons, we adopt the staticapproach. To accommodate sacrificing the exact details ofthe memory management optimizations and allocation de-tails, such as fragmentation and temporary memory for localvariables, we spare 10% of the device memory and constrainourselves to the remaining 90%. This threshold was sufficientto successfully run all our experiments without going OOM.Nevertheless, this ratio might need to be tuned and it is theonly parameter of ParDNN that needs tuning.As shown in Figure 1, the output of our algorithm is asingle file containing the operation placement as key-valuepairs. Each key is an operation-node name and the valueis the device on which the operation should be allocated.To control the placement at operation-node granularity, theTensorFlow back-end reads the node-to-device assignmentfrom the placement file generated by our algorithm. ParDNN on multiple nodes:

Despite the capability ofdesigning ParDNN to partition a DNN on multiple nodes,in this work we assume a single node where the process-ing elements are identical GPUs connected to a commonhost. This is because the number of GPUs per node has beensteadily increasing over time. For instance, systems with16 or more GPUs per node are in production (e.g. NVIDIADGX SuperPOD). As suggested by many state-of-the-artworks [21, 50, 55], we argue that a hybrid approach of dataparallelism across compute nodes and using ParDNN insidethe compute node is a practical choice. This approach bene-fits from the efficiency and non-invasiveness of our methodin tackling the memory capacity issue at the node-level,while also harnessing the weak scaling properties of dataparallelism across the nodes.

This section is organized into three parts. First part comparesthe performance of ParDNN against related work: explicitmodel parallelism, redundant recompute, and an out-of-coremethod. The second part evaluates the scaling of ParDNN,

Table 3.

Specifications of Models Datasets. (C)HSD: (Character)Hidden State Dimension, SL: Sequence Length, ED: EmbeddingDimensions, RU: Residual Units, WF: Widening Factor, MD: ModelDimension, FS:Filter Size, P SZ: patch size.

Model / Dataset Acronym ) NodesRNN for Word-LevelLanguage [58] /Tiny Shakespeare [29] Word-RNN 8 2048 28 0.34 10578Word-RNN-2 8 4096 25 1.28 10578CHSD EDCharacter-Aware NeuralLanguage Models [32] /Penn Treebank (PTB) [38] Char-CRN 8 2048 15 0.23 22748Char-CRN-2 32 2048 15 1.09 86663 and the last part performs overhead and fidelity analysis ofParDNN. Key findings of each part are as follows: • Comparison with Related Work : (i) ParDNN achievessimilar performance to the distributed tensor computa-tion framework, Mesh-TensorFlow [54] but provides muchhigher user productivity. (ii) ParDNN outperforms Gradi-ent Checkpointing [7] combined with data parallelism inmany cases, yielding up to 2.8x speedup. More importantly,ParDNN enables training models where applying GradientCheckpointing result in out of memory (OOM) even witha batch size of 1. (iii) ParDNN outperforms CUDA UnifiedMemory for all configurations and GPU counts. • Scaling: (i) For the same number of GPUs, ParDNN en-ables the use of more than 9x batch size over the maximumpossible with data parallelism on average. (ii) Superlinearspeedup in most models and configurations is observedgoing from one GPU to 16 GPUs. • Overhead and Fidelity: (i) Empirical overhead of ParDNNis no more than 2 minutes for the largest model over 16GPUs. (ii) Replacing any of ParDNN steps with otherheuristics or using alternative approaches result in sig-nificant drop in performance or huge increase in overhead,hence, demonstrating and justifying the design choices andefficiency of ParDNN’s algorithmic steps.

We conducted all our experiments on a NVIDIA DGX-2 with16 Tesla V100 SXM3 32GB GPUs connected via NVSwitch.The throughput measurements are conducted over the inter-val between the 100 th and the 150 th training iterations toget stable results. We use TensorFlow 1.14, and CUDA 10.0.We experimented with five large models representing fourmain tracks of DL applications: image classification, transla-tion, video prediction, and language modeling. All modelsand datasets used in experiments are listed in Table 3, anddetailed in Appendix A. We focus our analysis on the perfor-mance of ParDNN, rather than pursuing the accuracy since ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Word-RNN-2 Char-CRN-2 WRN-2 TRN-2 E3D-2

ParDNN Speedup over CUDA Unified Memory ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Word-RNN Char-CRN WRN E3D TRN Word-RNN-2 Char-CRN-2 WRN-2 E3D-2 TRN-2

ParDNN Speedup over Gradient Checkpointing + Data Parallelism m4 m2:b2 m8 m4:b2 m2:b44-GPUs 8-GPUs

Model:Batch (m:b) Split Dimensions of Mesh-Tensorflow

ParDNN speedup over Mesh-TensorFlow (Transformer Model) OO M ( G r a d i e n t C h e c k p o i n t i n g ) OO M ( G r a d i e n t C h e c k p o i n t i n g ) OO M ( G r a d i e n t C h e c k p o i n t i n g ) OO M ( G r a d i e n t C h e c k p o i n t i n g ) (a) (b) (c) Figure 3. (a) ParDNN speedup over Mesh-Tensorflow, (b) ParDNN speedup over gradient checkpointing combined with data parallelism torun on multiple GPUs, (c) ParDNN speedup over CUDA Unified Memory (UM) using larger models. X-axis: Number of GPUs (Batch Size)

ParDNN has no effect on the learning aspect of the model:ParDNN does not alter the model nor its hyper-parameters.

We compare ParDNN to three different state-of-the-art ap-proaches used to circumvent the memory limitation whentraining DNNs. We compare with (i) Mesh-TensorFlow [54]for explicit model parallelism, (ii) gradient checkpointing [7]in combination with data parallelism for redundant recom-pute and (iii) CUDA Unified Memory for out-of-core comput-ing. Although there exists other graph-based solutions, wecannot directly compare either because we are not aware ofany open source implementation [41] or the implementationis available for MXNet only [64]. It is worth mentioning,however, that ParDNN takes no more than 2 minutes forthe largest configuration we tested, in comparison to 10s ofhours reported by the other graph-based methods, in addi-tion to ParDNN working on models 2.3x as large as whatthe other methods experimented with [41].

Mesh-TensorFlow [54], an extension to TensorFlow, was pro-posed to overcome the memory limitations of a single deviceand permits specifying a general class of distributed tensorcomputations. We compare the performance of ParDNNwith Mesh-TensorFlow using the Transformer model whichthe original authors used to demonstrate the scaling [54]. Fig-ure 3(a) shows the speedup of ParDNN over Mesh-TensorFlowusing 4 and 8 GPUs. We report all permutations [60] possiblewith the maximum trainable batch size for Mesh-TensorFlow.ParDNN is on par with Mesh-TensorFlow, however, unlikeMesh-TensorFlow (a) ParDNN requires no knowledge aboutthe DNN structure by the user, while with Mesh-TensorFlowit is the responsibility of the user to rewrite the model usingMesh-TensorFlow syntax. (b) ParDNN entirely automatesthe partitioning, while with Mesh-TensorFlow users have tomanually specify the tensor-dimensions to be split acrossa multi-dimensional processor mesh and finding the bestassignment is an NP-hard problem. (c) Mesh-TensorFlowhas a non-negligible pre-run overhead which doubles whendoubling the number of GPUs reaching ∼ Gradient checkpointing [10] enables DNN training with asublinear memory cost ( O ( √ N )) when training an N layernetwork by recomputing the activations during backpropa-gation, instead of holding the forward pass results. In ourcomparison, we use a TensorFlow-based open-source imple-mentation [7]. Figure 3(b) shows the speedup of ParDNNover gradient checkpoint when combined with data paral-lelism to run on multiple GPUs. For ParDNN and check-pointing, we used the common largest possible batch sizes.ParDNN outperforms gradient checkpointing in most cases.In few cases, checkpointing is better than ParDNN; this hap-pens mainly when the degree of parallelism inherent in thegraph is not sufficient to utilize all the GPUs. However, moreimportantly, ParDNN is qualitatively superior to gradientcheckpointing since it enables the training of models wherecheckpointing fails to make them fit in device memory, evenwhen using a batch size of one. For example, Figure 3(b)shows several configurations where gradient checkpointinggoes out-of-memory at the batch size of one. Moreover, theoverhead of gradient checkpointing can be up to 5 hours [7]. Figure 3(c) shows the speedup of ParDNN over CUDA Uni-fied Memory (UM). UM, to the authors knowledge, is theonly out-of-core solution that has an available Tensorflowimplementation. In all cases, ParDNN throughput alwaysimproves going from 4 to 16 GPUs while increasing the batchsize. UM performance in this case degrades when increasingthe batch size due to the page faulting penalty [3].

We experimented with models under two main use-cases ofParDNN. First, model instances that fit into a single devicememory only with very small batch sizes . Small here is relativeto the numbers used by the DL community and reported inthe literature. In such a case, ParDNN provides a qualitativeadvantage over data parallelism (DP), which splits the inputover different GPUs that hold the replicas of the model. Thesecond use-case is model instances that do not fit into a singleGPU memory even with small batch sizes. These are largervariants of each model in Figure 4(a). . . . . . . . . . ( ) ( ) ( ) ( ) ( ) WRN-2 TRN-2 E3D-2

246 85 151 292 ( ) ( ) ( ) ( ) ( ) ( ) Word-RNN-2 Char-CRN-2 T h r o u g hpu t ( i t e m / s e c o nd ) . . . . . . . . . . . . . . . . . . . ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Word-RNN Char-CRN WRN TRN E3D

ParDNN Speedup over Single GPU

Batch Size Scaling Increase Over Ideal DP Model/

ParDNN Throughput (a) (b) (c)

Figure 4. (a) Maximum batch sizes (bsz) made possible by ParDNN. Bsz on a single GPU is the maximum that could fit without triggeringOOM. Table also shows the multiplier by which ParDNN could increase the bsz over ideal data parallelism (DP). For use-cases-1, DP isassumed to applied on top of a single GPU reference point. For use-cases-2, ParDNN enables ≥ Training with large batch sizes offers more parallelism anddrastically reduces the overall training time. Authors in [17]proposed a method to scale batch sizes, which reduced thetraining of RESNET-50 on ImageNet to one hour. Anotherwork harnessed very large batch sizes to reduce BERT train-ing time from 3 days to 76 mins [69]. ParDNN enables super-linear scaling of the batch sizes while increasing the numberof GPUs. Figure 4 (a) shows the batch size scaling for allof our experiments. We could increase the batch size byup to 256 x for use-cases-1 and 64 x for use-cases-2. Thisgives ParDNN a qualitative advantage even for models thatfit into a single GPU since ParDNN enables training withmuch larger batch sizes than what can be achieved with DP.ParDNN achieves superlinear scaling of the batch sizefirstly because with ParDNN, the parameters are not repli-cated but distributed. A large fraction of the memory con-sumed by the large models is to store the parameters andvariables that survive through iterations. For instance, for1 .

91 billion parameter

WRN , TensorFlow allocates around8GB for those variables. Using ParDNN these parameters aredistributed, but with DP they need to be replicated. Secondly,for some operations, the memory consumption does not scalelinearly with the batch size. For example, in

Word-RNN and

Char-CRN , the outputs of matrix multiplication operationshave the largest memory consumption ratio. When doublingthe batch size, the memory consumption by matrix multipli-cation results increases by only ∼ a ∗ batch size by another of batch size ∗ b , the result has thedimensions of a ∗ b regardless of the batch size . So the mem-ory allocated to store the output of that operation does notincrease, and this effect propagates to its decedents that willtake its output as their input. Figure 4(b) and (c) show the speedup over a single GPU forsmall models and the throughput scaling of ParDNN forlarge models, respectively, using the batch sizes in Figure 4. In Figure 4(b), ParDNN shows a substantial improvementon 2 GPUs and a superlinear speedups up to 4 GPUs forall the models. The sharp performance increase happensbecause when a model fits into single GPU memory withsmall batch size, the resources are extremely underutilized.Pushing larger batches, while doubling the number of GPUs,improves the device utilization considerably. After 4 GPUs,the batch size could only be doubled when doubling the GPUcount. The performance behavior beyond this point dependson inherent DoP (degree of parallelism) in the graph and CCR(the ratio of total communication cost to total computationcost) of the graph [44, 56]. Both

Word-RNN and

Char-CRN have large DoP, as a result, they continue to give superlinearspeedups up to 8 and 16 GPUs, respectively.

TRN exhibitsmodest improvement beyond 4 GPUs even though its graphhas higher DoP in comparison to

WRN . This is because ithas a larger CCR (e.g. on 4-GPU configurations the CCR of

TRN is 1 .

58 compared to 0 .

59 of

WRN ), hence a considerableamount of time is spent on communication.

E3D scales betterbeyond 4 GPUs in comparison to

WRN due to having higherDoP. However, the speedup, compared to a single GPU, isless because E3D’s main operation is 3D convolution, whichheavily utilizes the GPU even with small batch sizes.In Figure 4(c), going from 4 to 8 GPUs enables much largerbatches in all cases but two. This in turn enhances the re-source utilization and results in the substantial throughputimprovements.

Char-CRN-2 perfectly scales up to 16 GPUsdue to its high DoP.

Word-RNN-2 and

WRN-2 scale modestlyfrom 8 to 16 because the batch size of 8 is sufficient to satu-rate the GPUs for

Word-RNN-2 . In case of

WRN-2 , the modestscaling is due to the low DoP.

ParDNN has a negligible overhead thanks to the low com-plexity of each step. The longest partitioning time amongall the combinations of batch sizes, GPUs and model con-figurations used in this work was 117 secs in the case ofpartitioning

TRN-2 over 16 GPUs. The minimum time of 18secs was taken to partition

Word-RNN over 2 GPUs. Eventhough handling the memory overflow takes most of the . . .

92 0 .

98 0 . .

81 0 .

67 0 .

93 1 0 . .

78 0 .

51 0 . . .

76 0 . .

99 1 1

Normalized Makespan Of ParDNN Over Linear Clustering

K=2 K=4 K=8 K=16 . . . . . . .

62 11 . . RR w / o - r e f i n e m e n t P a r D NN RR w / o - r e f i n e m e n t P a r D NN RR w / o - r e f i n e m e n t P a r D NN RR w / o - r e f i n e m e n t P a r D NN RR w / o - r e f i n e m e n t P a r D NN Word-RNNChar-CRN WRN TRN E3D

Partitioning: ParDNN vs. Alternatives

Figure 5. (a) ParDNN’s speedup over Round Robin (RR) andParDNN without refinement. The values are normalized over RR.Four GPU configuration is used. (b) Makespan of ParDNN over thatof Linear clustering (lower is better). K is the number of partitions. overall partitioning time, the time taken to handle memoryoverflow is much lower than the theoretical upper bound.This is because Step-2 of ParDNN depends on how manynodes need to be moved between clusters to address the over-flow, which is much less than | V | in practice. The averageratio of the nodes moved in all our experiments is 8%. To analyze the impact of slicing-mapping-refinement stagesof Step-1, we replace Step-1 with a naive approach, whichsimply distributes the graph-nodes to the devices in a round-robin fashion (RR), where the nodes of the graph are iteratedin their topological order. Figure 5(a) shows the perfor-mance improvement by ParDNN over RR. In addition, thefigure also shows the ParDNN’s performance without therefinement. Compared to RR results, ParDNN doubles thetraining throughput on average. Applying refinement has anon-negligible effect, contributing to 5-25% improvement.

ParDNN is not a standalone scheduling algorithm. It leavesthe order of execution decision to the dynamic scheduler.However, it can still serve as an efficient phase in staticscheduling. To show this feature and the advantage of ourmulti-staged approach over a high-complexity single heuris-tic, we compare ParDNN with linear clustering (LC). Todo a fair comparison, we implemented LC with GLB andEarliest Estimated Time First (EST First) [62] as a task or-dering heuristic since this combination gave the best results.For ParDNN, we used EST First as well to derive the execu-tion order of tasks and omit the memory constraints (Step2). Figure 5(b) shows that in all the experiments ParDNNoutperforms or is on par with LC. In particular, ParDNNproduces much better results than LC when the degree ofparallelism is high as in

Char-CRN and

Word-RNN . Anotheradvantage comes from the overhead. For the largest graph,

WRN with ∼ K nodes, it took ParDNN 36 secs while LCtook about 4 . Systems-level approaches : Mirhoseini et al. proposed a re-inforcement learning-based method to place dataflow graphson multiple devices [41, 42]. This approach suffers fromsignificant time and resource consumption. The proposed policy was trained for hours using 16 workers to produceplacements for models having less than 100 K operations. Amore efficient approach was proposed by Wang et al. in [64].However, it requires a description language to specify com-putations and cannot describe all the operations used in DL.Moreover, it partitions all operators and tensors across allworkers, resulting in poor resource utilization. DL-level approaches : Explicit model parallelism, whereeach worker is responsible for a subset of the layers, suffersfrom two major limitations: requiring complex cost modelson case-by-case bases and leaving the partitioning burdento the programmer [43]. Pipeline parallelism provides goodresource utilization yet some implementations requires asingle layer to fit in a single device [22], which may not bethe case for models with 3D inputs [39]. While in others,extra memory overhead proportional to the size of the modelweights is necessary to address the statistical efficiency issue,i.e. preventing model convergence [43]. In [12, 27, 33, 55]non-generic techniques were proposed to parallelize specifictypes of DL models, some focusing on CNNs while othersrelying on Transformer in their optimizations.

Virtualization and Recomputation: methods relax thememory requirements. vDNN [51] is a memory manager thatvirtualizes GPU memory in DNN training. ooc cuDNN [25]extends cuDNN and applies cuDNN-compatible operatorseven when a layer exceeds GPU memory capacity by swap-ping at the granularity of individual tensor dimensions. Gra-dient checkpointing [10] reduces the memory needed tostore the intermediate outputs and gradients with the costof doubling the forward pass computational cost [10, 26].PoocH [24] and Capuchin [47] propose a hybrid approachthat selects either recomputing or swapping for certain lay-ers to reduce the performance overhead based on profilingdata.

Graph partitioning : To deal with a directed graph, ex-isting graph partitioning libraries convert every directededge to undirected even though this conversion loses cru-cial information [4]. Due to this reason, Scotch static map-per [45, 46] and MinCut optimizer, results in 2 to 10 timesslowdown when applied on graphs of DL models [41, 42].In [20], new techniques are proposed to deal with directedgraphs and [44] built on top of those techniques for a clus-tering based scheduler. They aim at producing acyclic parti-tioning , where if there is a cut edge from partition a to b andanother from b to a , the partition is considered cyclic, andis not acceptable. Since the graphs produced by Tensorfloware full of fork-joins, applying their technique to our DNNmodels results in unbalanced partitions. Static graph scheduling:

Plenty of sophisticated andhigh-quality algorithms were proposed [18, 23, 35, 36, 68] inthis area. The vast majority of these algorithms were devel-oped in 1990’s to handle small-sized graphs, and they werelater evaluated using instances having up to 3000 nodes [14,18, 37, 62, 63]. A recent evaluation on large graphs shows that hey either do not scale due to their high time-complexity,or produce low-quality allocations due to their inability tocapture the global structure of the graph [44]. ParDNN presents a lightweight approach to partition com-putational graphs of very large DNN models. It permits thetraining of models that do not fit into a single device memory.The experiments on five large DNNs and comparisons withrelated work demonstrate its high efficiency and superlinearscaling of batch size and training throughput.

Acknowledgement

Authors from Koc¸ University are supported by the TurkishScience and Technology Research Centre Grant No: 118E801.This work was partially supported by JST-CREST underGrant Number JPMJCR19F5. The research presented in thispaper has benefited from the Experimental Infrastructurefor Exploration of Exascale Computing (eX3), which is finan-cially supported by the Research Council of Norway undercontract 270053.

References [1] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, et al. 2016. Tensorflow: Large-scale machine learning onheterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).[2] Zahangir Alo, Tarek M. Taha, Chris Yakopcic, Stefan Westberg, VasitSagan, Mst Shamima Nasrin, Mahmudul Hasan, Brian C. Van Essen,Abdul A. S. Awwal, and Vijayan K. Asari. 2019. A State-of-the-ArtSurvey on Deep Learning Theory and Architectures.

Electronics

8, 3(2019), 292. https://doi.org/10.3390/electronics8030292 [3] Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, XiaoyiLu, and Dhabaleswar K Panda. 2018. OC-DNN: Exploiting AdvancedUnified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training. In . IEEE, 143–152.[4] David A Bader, Henning Meyerhenke, Peter Sanders, and DorotheaWagner. 2013.

Graph partitioning and graph clustering . Vol. 588. Amer-ican Mathematical Society Providence, RI.[5] James Bergstra, Fr´ed´eric Bastien, Olivier Breuleux, Pascal Lamblin,Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, DavidWarde-Farley, Ian Goodfellow, Arnaud Bergeron, et al. 2011. Theano:Deep learning on gpus with python. In

NIPS 2011, BigLearning Work-shop, Granada, Spain , Vol. 3. Citeseer, 1–48.[6] Charles-Edmond Bichot and Patrick Siarry. 2011.

Graph partitioning .Wiley Online Library.[7] Yaroslav Bulatov. 2018. gradient-checkpointing. https://github.com/cybertronai/gradient-checkpointing .[8] Mauro Cettolo, Niehues Jan, St¨uker Sebastian, Luisa Bentivogli,Roldano Cattoni, and Marcello Federico. 2016. The IWSLT 2016 evalu-ation campaign. In

International Workshop on Spoken Language Trans-lation .[9] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang,Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015.Mxnet: A flexible and efficient machine learning library for heteroge-neous distributed systems. arXiv preprint arXiv:1512.01274 (2015). [10] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin.2016. Training Deep Nets with Sublinear Memory Cost.

ArXiv abs/1604.06174 (2016).[11] J´anos Csirik. 1991. Heuristics for the 0-1 min-knapsack problem.

ActaCybernetica

10, 1-2 (1991), 15–20.[12] Nikoli Dryden et al. 2019. Channel and Filter Parallelism for Large-Scale CNN Training. In

Proceedings of the International Conference forHigh Performance Computing, Networking, Storage, and Analysis (SC’19) . Article 46, 13 pages.[13] Peter M Fenwick. 1994. A new data structure for cumulative frequencytables.

Software: Practice and experience

24, 3 (1994), 327–336.[14] Apostolos Gerasoulis and Tao Yang. 1992. A comparison of clusteringheuristics for scheduling directed acyclic graphs on multiprocessors. journal of parallel and distributed computing

16, 4 (1992), 276–291.[15] Amir Gholami, Ariful Azad, Peter Jin, Kurt Keutzer, and Aydin Buluc.2017. Integrated model, batch and domain parallelism in trainingneural networks. arXiv preprint arXiv:1712.04432 (2017).[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016.

Deeplearning . MIT press.[17] Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, LukaszWesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and KaimingHe. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).[18] Kun He, Xiaozhu Meng, Zhizhou Pan, Ling Yuan, and Pan Zhou. 2018.A novel task-duplication based clustering algorithm for heterogeneouscomputing environments.

IEEE Transactions on Parallel and DistributedSystems

30, 1 (2018), 2–14.[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deepresidual learning for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition . 770–778.[20] Julien Herrmann, Jonathan Kho, Bora Uc¸ar, Kamer Kaya, and ¨Umit VC¸ataly¨urek. 2017. Acyclic partitioning of large directed acyclic graphs.In . IEEE, 371–380.[21] Yanping Huang et al. 2018. GPipe: Efficient Training of Giant NeuralNetworks using Pipeline Parallelism.

CoRR abs/1811.06965 (2018).arXiv:1811.06965[22] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, DehaoChen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, YonghuiWu, et al. 2019. Gpipe: Efficient training of giant neural networks usingpipeline parallelism. In

Advances in Neural Information ProcessingSystems . 103–112.[23] Jing-Jang Hwang, Yuan-Chieh Chow, Frank D Anger, and Chung-YeeLee. 1989. Scheduling precedence graphs in systems with interproces-sor communication times.

SIAM J. Comput.

18, 2 (1989), 244–257.[24] Yuki Ito, Haruki Imai, Tung D. Le, Yasushi Negishi, KiyokuniKawachiya, Ryo Matsumiya, and Toshio Endo. 2019. Profiling basedout-of-core hybrid method for large neural networks: poster.

ArXiv abs/1907.05013 (2019).[25] Yuki Ito, Ryo Matsumiya, and Toshio Endo. 2017. ooc cuDNN: Ac-commodating convolutional neural networks over GPU memorycapacity. In . 183–192. https://doi.org/10.1109/BigData.2017.8257926 [26] Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, PieterAbbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Break-ing the Memory Wall with Optimal Tensor Rematerialization. In

Proceedings of Machine Learning and Systems 2020 . 497–511.[27] Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploringhidden dimensions in parallelizing convolutional neural networks. arXiv preprint arXiv:1802.04924 (2018).[28] Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data andModel Parallelism for Deep Neural Networks.

CoRR abs/1807.05358(2018). arXiv:1807.05358 http://arxiv.org/abs/1807.05358

29] Andrej Karpathy. 2015. tinyshakespeare. https://github.com/karpathy/char-rnn/tree/master/data/tinyshakespeare [30] George Karypis and Vipin Kumar. 1995. Multilevel graph partitioningschemes. In

ICPP (3) . 113–122.[31] Sung J Kim. 1988. A general approach to multiprocessor scheduling.(1988).[32] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016.Character-aware neural language models. In

Thirtieth AAAI Conferenceon Artificial Intelligence .[33] Alex Krizhevsky. 2014. One weird trick for parallelizing convolutionalneural networks (2014). arXiv preprint arXiv:1404.5997 (2014).[34] Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layersof features from tiny images. (2009).[35] Yu-Kwong Kwok and Ishfaq Ahmad. 1995. Bubble scheduling: Aquasi dynamic algorithm for static allocation of tasks to parallel ar-chitectures. In

Proceedings. Seventh IEEE Symposium on Parallel andDistributed Processing . IEEE, 36–43.[36] Yu-Kwong Kwok and Ishfaq Ahmad. 1996. Dynamic critical-pathscheduling: An effective technique for allocating task graphs to multi-processors.

IEEE transactions on parallel and distributed systems

7, 5(1996), 506–521.[37] Jing-Chiou Liou and Michael A Palis. 1997. A comparison of gen-eral approaches to multiprocessor scheduling. In

Proceedings 11thInternational Parallel Processing Symposium . IEEE, 152–156.[38] Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert Mac-Intyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger.1994. The Penn Treebank: annotating predicate argument structure.In

Proceedings of the workshop on Human Language Technology . Asso-ciation for Computational Linguistics, 114–119.[39] Amrita Mathuriya et al. 2018. CosmoFlow: Using Deep Learningto Learn the Universe at Scale. In

Proceedings of the InternationalConference for High Performance Computing, Networking, Storage, andAnalysis (SC ’18) . IEEE Press, Piscataway, NJ, USA, Article 65, 11 pages.[40] CL McCreary, MA Cleveland, and AA Khan. 1996. The problem withcritical path scheduling algorithms.

Master’s Thesis, Department ofComputer Science and Engineering Auburn University, USA (1996).[41] Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc VLe, and Jeff Dean. 2018. A hierarchical model for device placement.(2018).[42] Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, RasmusLarsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, SamyBengio, and Jeff Dean. 2017. Device Placement Optimization with Rein-forcement Learning. In

Proceedings of the 34th International Conferenceon Machine Learning - Volume 70 (ICML’17) . JMLR.org, 2430–2439.[43] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri,Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and MateiZaharia. 2019. PipeDream: generalized pipeline parallelism for DNNtraining. In

Proceedings of the 27th ACM Symposium on OperatingSystems Principles . 1–15.[44] M Yusuf ¨Ozkaya, Anne Benoit, Bora Uc¸ar, Julien Herrmann, and¨Umit V C¸ataly¨urek. 2019. A scalable clustering-based task schedulerfor homogeneous processors using DAG partitioning. In .IEEE, 155–165.[45] Franc¸ois Pellegrini. 2009. Distillating knowledge about Scotch. In

Dagstuhl Seminar Proceedings . Schloss Dagstuhl-Leibniz-Zentrum f¨urInformatik.[46] Franc¸ois Pellegrini and Jean Roman. 1996. Scotch: A software packagefor static mapping by dual recursive bipartitioning of process andarchitecture graphs. In

International Conference on High-PerformanceComputing and Networking . Springer, 493–498.[47] Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, QianXiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-BasedGPU Memory Management for Deep Learning. In

Proceedings of theTwenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’20) . As-sociation for Computing Machinery, New York, NY, USA, 891–905. https://doi.org/10.1145/3373376.3378505 [48] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, andIlya Sutskever. 2019. Language Models are Unsupervised MultitaskLearners.

OpenAI Technical Report (2019).[49] Andrei Radulescu and Arjan JC Van Gemund. 1998. GLB: A low-cost scheduling algorithm for distributed-memory architectures. In

Proceedings. Fifth International Conference on High Performance Com-puting (Cat. No. 98EX238) . IEEE, 294–301.[50] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.2019. ZeRO: Memory Optimization Towards Training A TrillionParameter Models.

ArXiv abs/1910.02054 (2019).[51] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, andStephen W Keckler. 2016. vDNN: Virtualized deep neural networks forscalable, memory-efficient neural network design. In

The 49th AnnualIEEE/ACM International Symposium on Microarchitecture . IEEE Press,18.[52] Vivek Sarkar. 1988. Partitioning and scheduling parallel programs forexecution on multiprocessors. (1988).[53] Taro Sekiyama, Takashi Imamichi, Haruki Imai, and Rudy Raymond.2018. Profile-guided memory optimization for deep neural networks. arXiv preprint arXiv:1804.10001 (2018).[54] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, AshishVaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee,Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman.2018. Mesh-TensorFlow: Deep Learning for Supercomputers. In

Pro-ceedings of the 32nd International Conference on Neural InformationProcessing Systems (NIPS’18) . Curran Associates Inc., Red Hook, NY,USA, 10435–10444.[55] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Trainingmulti-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053 (2019).[56] Oliver Sinnen. 2007.

Task scheduling for parallel systems . Vol. 60. JohnWiley & Sons.[57] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015.Unsupervised learning of video representations using lstms. In

Inter-national conference on machine learning . 843–852.[58] Jack Jackson Sung Kim. 2017. Multi-layer Recurrent Neural Networks(LSTM, RNN) for word-level language models in Python using Ten-sorFlow. https://github.com/hunkim/word-rnn-tensorflow [59] Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Gener-ating text with recurrent neural networks. In

Proceedings of the 28thinternational conference on machine learning (ICML-11) . 1017–1024.[60] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet,Aidan N. Gomez, Stephan Gouws, Llion Jones, Lukasz Kaiser, NalKalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and JakobUszkoreit. 2018. Tensor2Tensor for Neural Machine Translation.

CoRR abs/1803.07416 (2018). http://arxiv.org/abs/1803.07416 [61] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, LlionJones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.Attention is all you need. In

Advances in neural information processingsystems . 5998–6008.[62] Huijun Wang and Oliver Sinnen. 2018. List-scheduling versus cluster-scheduling.

IEEE Transactions on Parallel and Distributed Systems . IEEE, 708–713.[64] Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supportingvery large models using automatic dataflow graph partitioning. In

Proceedings of the Fourteenth EuroSys Conference 2019 . 1–17.

65] Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long,and Li Fei-Fei. 2018. Eidetic 3d lstm: A model for video prediction andbeyond. (2018).[66] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, MohammadNorouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao,Klaus Macherey, et al. 2016. Google’s neural machine translationsystem: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).[67] Tao Yang. 1993.

Scheduling and code generation for parallel architectures .Ph.D. Dissertation. Citeseer.[68] Tao Yang and Apostolos Gerasoulis. 1994. DSC: Scheduling paralleltasks on an unbounded number of processors.

IEEE Transactions onParallel and Distributed Systems

5, 9 (1994), 951–967.[69] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Sri-nadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, andCho-Jui Hsieh. 2019. Large batch optimization for deep learning:Training bert in 76 minutes. In

International Conference on LearningRepresentations .[70] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual net-works. arXiv preprint arXiv:1605.07146 (2016).

A Appendix

A.1 Time Complexity of Each Step of ParDNNComplexity of Graph Slicing:

The most expensive part ofAlgorithm 1 is computing weighted levels for all the nodes.This operation performs a variant of topological sorting andhas time complexity of O ( | V | + | E | ) [56]. It is done K times, re-sulting in an overall complexity of O ( K ( | V | + | E | )) as opposedto linear clustering that would cost O ( | V | ( | E | + | V | )) [62]. Complexity of Mapping:

Since the clusters are disjointpaths or singular nodes, and have no common nodes (a nodeexists only in one cluster), the total number of the updateoperations is bounded by | V | . The number of range summa-tion queries is bounded by the number of the paths which isagain bounded by | V | . The cost of either of the operationsis logarithmic in the number of levels. The number of levelsis ≤ | V | , so we end up with O ( | V |∗ loд | V | ). Before startingLALB, we sort the clusters by their weights (the heaviestclusters first due to their importance in balancing the loads),this has an upper bound of O ( | V |∗ loд | V | ), since the numberof clusters is upper-bounded by the number of nodes. Hence,the overall complexity of the mapping stage is O ( | V |∗ loд | V | ). Complexity of Refinement:

For swapping, initially wesort the clusters by tl ( n ) of their source nodes to find theclusters within the span of a certain cluster using binarysearch. Since the number of disjoint clusters is bounded bythe number of nodes, this process gives the complexity of O ( | V |∗ loд ( | V | )). Once two clusters are swapped, they aremarked and not considered again leading to at most | V | cluster, hence node, swaps. With each swap the binary-indexed-trees are updated to reflect the new work loads.Since each update takes O ( loд ( | V | )) , overall complexity is O ( | V |∗ loд ( | V | )). The node-level refinement is repeated K times and each time we recalculate the weighted levels andthe CP . Upon node switching we update the trees. Theoverall time complexity is O ( K ( | V | + | E | )). Complexity of Scheduler Emulator:

The scheduler em-ulator estimates the starting time st ( n ) and finishing time f t ( n ) of the nodes in the graph. The emulator has time com-plexity of O ( | V | + | E | ). Complexity of Tracking Memory Consumption:

Track-ing the memory consumption requires O ( | V | ) time since itis done in one pass over the graph nodes while keeping thecumulative values and calculating the potentials. Complexity of Addressing Overflow:

We solve theknapsack greedily as the dynamic programming based solu-tion complexity is impractical. When an overflow is detected,we pick the nodes from the heaps in a logarithmic time. Anynode that is moved to another partition is guaranteed not tobe moved again since it is moved only if the destination pe can accommodate it, meaning that it can neither cause norsolve an overflow on that pe . As a result, there is no repe-tition and a node can enter or exit the heap once, resultingin O ( | V |∗ loд | V | ). When a node is moved, the new potentials nd memory consumption need to be recalculated ( O ( | V | )).It may happen at most | V | times. Overall the complexity is O ( | V | ). A.2 Models and Datasets

From the language modeling, we use

Word-RNN a multi-layer Recurrent Neural Network for word-level languageinspired by the character-level modeling [59], and character-Aware Neural Language Models (

Char-CRN ) [32]. Bothmodels can be enlarged by increasing the number of layers orthe hidden state size. While Penn Treebank text corpus [38]is used to train

Char-CRN , Word-RNN is trained usingTiny Shakespeare [29].From computer vision, we experiment with

WRN [70],which is a widened version of the residual network model.In

WRN the width of the convolutional layers can be con-figured. The model size grows quadratically when widened.WRN has achieved better accuracy when the model is widened [70].

WRN is trained using

CIFAR [34] as it is the dataset usedby the original authors.

TRN (Transformer) [61] is a widely used model that hada significant influence on the design of SoTA Transformer-based models in the NLP domain such as GPT-2 [48] andMegatron-LM [55]. Transformer can be enlarged by increas-ing the number of layers, which deepens the model, and bywidening the inner-layer dimensionality. Deeper [22] andwider [61] configurations of Transformer are shown to givehigher accuracy. We trained Transformer using IWSLT 2016German–English parallel corpus for training.

E3D is Eidetic 3D LSTM [65] for video prediction. Thismodel achieves state-of-the-art performance in future frameprediction.

E3D is closely related to convolutional recurrentnetworks, where the dimensions of memory states are in-creased, and 3D-Convs are adopted as the basic operatorsfor state transitions.

E3D can be enlarged by increasingthe number of the hidden state channels on the memory di-mensions. We trained E3D-LSTM using the Moving MNISTdataset.can be enlarged by increasingthe number of the hidden state channels on the memory di-mensions. We trained E3D-LSTM using the Moving MNISTdataset.