VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware
VVirtualFlow: Decoupling Deep Learning Model Executionfrom Underlying Hardware
Andrew Or
Princeton University
Haoyu Zhang
Google AI
Michael J. Freedman
Princeton University
Abstract
State-of-the-art deep learning systems tightly couple modelexecution with the underlying hardware. This coupling posesimportant challenges in a world where the scale of deep learn-ing workloads is growing rapidly: workloads with high re-source requirements are inaccessible to most users, experi-mentation on smaller test beds is impossible, and results aredifficult to reproduce across different hardware.We propose VirtualFlow, a novel system approach leverag-ing virtual node processing to decouple model execution fromthe hardware. In each execution step, the batch is divided andprocessed with data parallelism on many virtual nodes insteadof physical devices (GPUs, TPUs), and the gradients are ag-gregated and applied to the model after all virtual nodes finishprocessing their data. With multiple virtual nodes mapped toeach device, the system allows users to run models at muchlarger batch sizes that would otherwise exceed the memorylimits of the underlying physical resources. VirtualFlow sig-nificantly improves model training reproducibility across dif-ferent hardware, and enables models to run on shared clusterswith dynamically changing resources for better efficiency.Our implementation of VirtualFlow enables virtual nodeprocessing with elasticity for TensorFlow. Evaluation withrepresentative deep learning models (ResNet, BERT, Trans-former) demonstrates strong convergence guarantees on dif-ferent hardware with out-of-the-box hyperparameters, and upto 48% lower job completion times with resource elasticity.
The scale of deep learning workloads continues to rise. Inrecent years, model sizes have grown to billions of parame-ters [38, 42], dataset sizes to hundreds of GBs [12, 37], andbatch sizes to 64K to allow for increased parallelism [26, 43],and none of these trends show signs of stopping.Hardware advances have been slow to catch up, however,leading to high computational requirements for these largerworkloads. For instance, pre-training BERT takes more than
Figure 1:
Virtual node processing.
A batch is divided into16 equally sized virtual nodes (colored shapes), where each vir-tual node is mapped to a different GPU. Virtual nodes assigned tothe same GPU are executed sequentially, allowing 4 GPUs to trainthe same model as 16 GPUs with the same set of hyperparameters,including the batch size. ,
024 TPUs [51]. Megatron-LM, an 8 . These recent trends raise three important challenges:
Resource requirements.
First, many of these workloadsrequire large clusters of expensive computing devices, suchas GPUs or TPUs. As training data continues to grow insize, the resource requirement will only increase. Systemstechniques have been proposed to mitigate this issue, but onlyto a limited extent. For instance, even with the optimizationsfrom DeepSpeed with ZeRO [31] to reduce memory footprint,1 a r X i v : . [ c s . D C ] S e p raining Turing-NLG still required 256 V100 GPUs [38], aset of resources inaccessible to most. Ease of experimentation.
Second, even with enough re-sources, experimentation becomes difficult at this scale. Usersmay wish to experiment on a small testbed before running themodel on a large cluster. While this approach would enable ashorter feedback loop and reduce deployment costs, it is notpossible today: Important hyperparameters such as the batchsize must be carefully picked to fit within the memory limitsof the physical devices. This often leads to different conver-gence behavior across the two settings, making it difficult tomeasure the effects of specific changes to the model.
Reproducibility.
In many workloads, the batch size is de-signed to maximize the efficiency on a specific type and num-ber of available physical devices [17, 26, 42, 43, 51]. The sameset of resources may not be available on a subsequent exe-cution, however, and the batch size may have to be adjusted.Yet this adjustment leads to additional issues: other hyperpa-rameters such as the learning rate depend on the batch size,and thus they also must be retuned accordingly to preservethe convergence properties of the model [17]. However, thisis cumbersome in practice, and techniques proposed for oneworkload may not work for another [40].
All three of the above challenges largely stem from a centraldrawback in today’s deep learning systems: a tight couplingbetween model execution and the underlying hardware .The most widely used frameworks today, such as Tensor-Flow [5] and PyTorch [35], embed cluster configuration in-formation into the model graph itself. Tensor operations areexplicitly placed on specific computing devices, and commu-nication operations often involve a fixed set of devices. Thebatch size, an important hyperparameter that has a large ef-fect on the convergence trajectory of a model [27], is oftentied to the memory capacity of individual computing devicesand the number of devices in the cluster [17, 26, 43]. If theglobal batch size exceeds the cluster-wide memory limit, theworkload will simply fail.In this paper, we argue that deep learning systems shoulddecouple model specifications from the underlying physicalhardware. A model should converge to the same accuracyregardless of the type and number of computing devices it istrained on. Running the same workload on a different cluster,such as a smaller test bed, should produce the same resultswithout the user having to retune the hyperparameters orapply different optimization techniques. The same modelconfiguration should be reusable, regardless of the physicallayout of the underlying cluster, across a design space thattrades off performance for lower resource requirements.The same philosophy can be observed in many big-data an-alytics systems. In MapReduce-style batch processing [10,53]and newer stream processing workloads [ ? , 1, 2], the system always computes the same answers regardless of the level ofparallelism and amount of resources assigned to the job. Theinput data is sliced into many small partitions to be processedin multiple sequential waves of tasks, and the job would notfail if the amount of data processed in a single wave does notfit in the aggregate memory of the system. Towards this goal of separating the model from the hardware,this paper introduces VirtualFlow, which leverages virtualnodes as a substrate for distributing computation across phys-ical devices (Figure 1). In this paradigm, a batch of trainingdata is partitioned among virtual nodes instead of physicaldevices. One or more virtual nodes are then mapped to eachphysical device and processed sequentially on the device, thusproducing one or more MapReduce-style waves of executionwithin each training step.VirtualFlow’s approach leverages the insight that all vir-tual nodes share the same model parameters. This allowsthe model to be cached in each physical device’s memory atthe beginning of the step and efficiently reused by all virtualnodes mapped to that device. The gradients produced by thevirtual nodes are first aggregated in a local memory buffer,and then synchronized across all devices at the end of the step,after all virtual nodes have been processed.With VirtualFlow, model convergence behavior can be pre-served across different sets of resources by fixing the numberof virtual nodes. The batch size and other hyperparametersremain unchanged, e.g., regardless whether the virtual nodesrun on 2 or 32 GPUs. Instead, only the mapping between vir-tual nodes and physical devices needs to be adjusted to utilizedifferent hardware configurations. Workloads that previouslyrequired large, expensive clusters can now be packed intosmaller deployments by mapping many virtual nodes to eachphysical device. Experimentation on smaller test beds is nowpossible, and results obtained by other users can now be repro-duced on a different set of resources without the user havingto modify any hyperparameter or optimization strategy.Virtual nodes enable several other important applications: • Resource elasticity can now be naturally expressed asredistributing the existing set of virtual nodes across anew set of physical devices. Dynamically resizing a jobwhile maintaining convergence guarantees—previouslyan open challenge [34]—is now possible. • Fault tolerance can be implemented by reassigning thevirtual nodes mapped to a failed physical device to a new,healthy device. • Hyperparameter exploration.
Batch sizes previouslyexceeding the aggregate limits of the underlying clusterare now in the exploration space. In certain workloads,being able to access these batch sizes can lead to highermodel accuracies (Figure 2).2
Epoch V a li d a t i o n a cc u r a c y BERT-LARGE finetuning on RTE
Batch size 4 (no VN)Batch size 16 (4 VN)
Figure 2: BERT-LARGE finetuning on the Recognizing TextualEntailment (RTE) [47] task on a single RTX 2080 Ti GPU, with andwithout virtual node processing. Training with a batch size of 16,which converged to a higher accuracy, was previously not possibleon this GPU due to memory limits.
We implemented VirtualFlow on top of TensorFlow andevaluated the system’s reproducibility and elasticity on a set ofrepresentative deep learning models (ResNet [19], BERT [13],Transformer [46]). In our evaluation, VirtualFlow demon-strates strong model convergence guarantees across differenthardware confirguations, and improves cluster utilization by20% and reduces job completion time by 48% with elasticity.
In this section, we discuss two important ways deep learningworkloads are tightly coupled with the underlying hardwarein state-of-the-art systems (§2.1, §2.2). We then describe howeach step of training proceeds under this framework (§2.3) tohighlight the differences with virtual node processing.
Hyperparameters, such as the batch size, learning rate, anddropout rate, have important effects on the convergence ofa model. For this reason, significant effort is often put intotuning these hyperparameters to achieve desirable results.The batch size refers to the number of input examples, e.g.,images in an image classification model, processed within atraining step. In the multi-GPU setting, the batch is dividedevenly among the computing devices, such that each deviceprocesses roughly the same amount of data in each step.Using a larger batch size generally improves the through-put of the training process. Within a single computing device,the device batch size is often set to the maximum size pos-sible within the limits of the device memory capacity. Thisincreases utilization of the device and reduces the numberof kernel launches on it. Across multiple devices, the globalbatch size is simply the maximum possible per device batchsize multiplied by the number of devices in the system. Thus,the larger the global batch size, the more devices can be usedto process the batch in parallel. However, prior work has shown that extremely large batchsizes tend to deteriorate model convergence [27]. In orderto preserve convergence behavior while scaling a workload,various efforts have proposed to adjust hyperparameters de-pendent on the batch size, such as the learning rate [17], oreven to apply custom optimization algorithms [26, 43, 50, 51].This poses a significant burden on the user when scaling outa workload, since the batch size typically increases linearlywith the number of computing devices in the system.
Hurdles for reproducibility.
Thus, reproducing existingresults on a different set of hardware requires significant ef-fort and expertise. In some cases, it is even impossible. Forexample, the results from training the the BERT model usinga batch size of 32 ,
000 examples on 1 ,
024 TPUs [51] and1 ,
472 GPUs [32] cannot be reproduced on a smaller test bedof 16 GPUs, as the same batch size would not fit in the smallercluster’s GPU memory. On the other hand, reducing the batchsize would inevitably lead to very different convergence tra-jectories that require retuning various hyperparameters. Thisposes a major hurdle for experimentation as well as scaling.
Another source of coupling between the model and the hard-ware lies in the model graph , which specifies the network ofoperations to perform on the input data. Modern deep learn-ing frameworks compile and optimize this graph once at thebeginning of training and reuse it for the rest of the job.In addition to tensor operations, information regarding theunderlying cluster configuration is also embedded into themodel graph. In both TensorFlow and PyTorch, for instance,the graph is defined under a distribution strategy that specifieshow model parameters should be synchronized in differentsettings, such as single GPU, single machine multi-GPU, anddistributed multi-GPU.
Hurdles for resource elasticity.
Once the model graph iscreated under a particular distribution strategy, subsequenttraining will use synchronization operations that involve afixed set of computing devices. If the user wishes to resizethe cluster or replace a subset of computing devices duringtraining, they will have to rebuild the entire graph under anew distribution strategy, and reload previously trained modelparameters from a checkpoint. Depending on the size of thegraph, this rebuild/reload process can be an expensive processthat takes minutes. This inconvenience has served as a suf-ficiently large hurdle to prevent most users from leveragingresource elasticity for their workloads [34].
Synchronous training.
It is worth noting that the rigidassignment of resources to jobs is in some ways an artifactof the widely used synchronous training, which enforces asynchronization barrier across all workers at the end of eachstep. Traditionally, deep learning workloads instead reliedon asynchronous training using the parameter server archi-tecture [29], where workers push and pull their parameters3 igure 3: Memory footprint of a single training step. independently of each other. This architecture comes with acertain degree of flexibility in terms of resource allocation,but suffers from stale gradients [9], which do not exist insynchronous training. Thus, to solve the resource elasticityproblem for modern deep learning workloads, we must designan approach that targets the synchronous training setting.
Today, a set of devices processes each batch of training ex-amples largely in parallel, as shown in Figure 3. The systemfirst forms an input pipeline that reads a subset of the datasetinto main memory, divides it into batches, and preprocessesthe training examples into formats that can be consumed bythe model graph. Input preprocessing is primarily performedon CPUs and often pipelined with the graph operations ex-ecuted on GPUs (or TPUs). The preprocessed data are thenprefetched from main memory to GPU memory (Step 1) tohide the memory copy overhead behind the computation.Once the inputs are ready on GPUs, the system performsthe forward and backward passes as defined by the model.In the forward pass (Step 2), activations are computed foreach layer of the model and retained in memory for gradi-ent computation later. In the backward pass (Step 3), localgradients are computed on each GPU and then synchronizedacross the cluster (Step 4). Synchronization is typically a sim-ple averaging operation performed via the all-reduce mecha-nism [39, 45]. Aggregated gradients are then used to updatethe model parameters (Step 5).
The key concept in VirtualFlow is virtual node processing,a layer of indirection between the model and the underly-ing hardware that abstracts resource requirements away fromapplication semantics. From the application’s perspective,the model uses virtual nodes, rather than physical devices,to perform the computation. With an appropriate number ofvirtual nodes, the model can run at arbitrary desirable batch
Figure 4: Virtual node trade-off between resource requirement andtime requirement. VN in this figure refers to number of virtual nodesassigned to each GPU. The design space for today’s deep learningworkloads is limited to only (a). sizes as if there are enough physical resources available. Vir-tualFlow manages the mapping between virtual nodes andphysical devices on behalf of the user, who can then tunethe model’s hyperparameters once and use the same set ofhyperparameters on multiple different clusters.
In today’s deep learning systems, each batch of training ex-amples is divided evenly across the devices in the spatial dimension. Each device is assigned exactly one slice, and thesize of each slice is limited by the memory capacity of thedevice (per Fig. 4a). The only way to train the same model onfewer devices (e.g., 2 GPUs instead of 4) would be to reducethe total number of slices (boxes in the figure), which wouldalso reduce the batch size. Without carefully adjusting otherhyperparameters, this would hurt the model’s convergence.Instead, VirtualFlow additionally divides the work to doin each training step in the time dimension. It achieves thisby processing multiple virtual nodes assigned to each devicesequentially (per Fig. 4b and 4c) while maintaining the totalnumber of slices to preserve the global batch size. By timeslicing each batch of inputs, VirtualFlow allows models togracefully fall back to run on fewer devices with longer train-ing times. Users have the freedom to explore the trade-off4 igure 5: Memory footprint of a single training step with virtual node processing. Each batch is divided across D ∗ V virtual nodes ( D devices and V virtual nodes per device). The level of parallelism in the system is still D , but there arenow V forward and backward passes within each step. Gradients are synchronized only at the end of the batch, after allvirtual nodes have been processed. between hardware capacity and processing time by adjustingthe amount of resources assigned to their workloads.This flexibility is crucial to model reproducibility and ex-ploratory experimentation. It guarantees reproducibility inthat the training results of a different run of the same modelcan be replicated using the exact same set of hyperparametersacross different clusters. Without using virtual nodes, this isdifficult because the batch size often has to change to adaptto the underlying cluster’s resource limits. It also simplifies experimentation across different hardware, allowing users tofirst experiment with a large workload on a small cluster (e.g.,Fig. 4c) before deploying it in production. In each training step, the batch of inputs is split among the vir-tual nodes in a manner analogous to how a job in MapReduceis divided into tasks. Virtual nodes assigned to the same phys-ical device are processed sequentially, while virtual nodesassigned to different devices are still processed in parallel.This produces one or more waves of execution, similar toMapReduce workloads where the number of tasks is often asmall multiple of the number of CPU slots in the system.Figure 5 traces the steps involved in processing a singlebatch of data with virtual node processing. The cycle is largelysimilar to regular processing (Fig. 3), except multiple forwardand backward passes may be computed before gradients aresynchronized. Local gradients computed on a GPU are aggre-gated into a gradient buffer at the end of each backward pass(Step 4). After all virtual nodes have been processed on allGPUs, the aggregated gradients are synchronized across thecluster and each GPU independently applies them to its copyof the model as before (Step 5).
10 12 14 16 18 20 22 24 26 28 30
Time elapsed (s) M e m o r y u s a g e ( B ) Activations inputs (173.41MB)activations (8.17GB)kernel_temp (788.81MB)other (507.96MB)parameters (102.45MB)unknown (341.84MB)
Figure 6: Memory usage in the first 3 steps of training ResNet-50 onImageNet [12] on a single 2080 Ti GPU, broken down by category.Activations constitute the vast majority of memory usage at the peak.The first step is slower due to initial graph optimizations performedby the framework.
One important observation from Figure 5 is activations consti-tute the single largest category of memory usage during peakmemory consumption (Step 3). This is because activationsscale with the batch size while other categories such as themodel parameters do not. For example, when training ResNet-50 on ImageNet, the activations typically require over 8GB,while the model is only around 104MB (see Fig. 6).VirtualFlow aggregates intermediate gradients producedby each virtual node in a local gradient buffer that stays inthe device memory throughout the step, creating an extramemory overhead. The size of this buffer is typically on parwith the size of the model, which is much smaller than theactivations for most workloads. Importantly, since the samegradient buffer is shared among all virtual nodes assigned toeach device, the memory overhead is a constant regardless ofthe number of virtual nodes per device. We demonstrate thisacross a variety of models in §6.5.5
Resource Elasticity
Jobs that run on shared clusters can benefit significantly fromresource elasticity. This enables jobs to adapt their resource us-age to changing allocations from the cluster scheduler, whichmay perform such adjustments to enforce fairness [25, 52],preemption [22], and utility-based [54] scheduling policies.Elasticity has been a desireable feature in many other dis-tributed systems, including ones from batch processing [33],stream processing [16], cluster management [44], and cloudcomputing [7, 8, 14], with important benefits such as highercluster utilization and lower job completion time. In §6.4, wedemonstrate that enabling elasticity in deep learning work-loads using virtual nodes can yield the same benefits.
VirtualFlow maintains a mapping between virtual nodes andphysical devices, but this mapping need not be fixed overtime. To enable resource elasticity, virtual nodes can be redis-tributed across the physical devices dynamically in responseto cluster demand.More specifically, downsizing a job can be expressed interms of reassigning the virtual nodes of the released comput-ing devices to the remaining ones that are still allocated tothe job. Similarly, upsizing a job can be expressed in termsof migrating a subset of the virtual nodes assigned to existingdevices to the new devices that were added. In both cases, thetotal number of virtual nodes remains the same, and thereforemodel convergence behavior is preserved across resizes.Figure 1 illustrates an example of resizing a job from 16GPUs to 4 GPUs. Each batch is split into 16 equally sizedvirtual nodes, which were each assigned to different GPUsinitially. The virtual nodes are then redistributed among the4 remaining GPUs, such that each GPU is assigned 4 virtualnodes (instead of 1) in the new configuration.
To showcase the benefits of expressing elasticity in terms ofvirtual node processing, we built a simple event-driven clusterscheduler that allocates resources based on the weighted fairshares (WFS) [11] of the oustanding jobs in an elastic manner.These fair shares are computed based on the priority of thejobs, which can be set to arbitrary attributes of the job toexpress a variety of scheduling objectives, such as ShortestJob First (SJF) and Shortest Remaining Time First (SRTF).The main scheduling logic is summarized in Algorithm 1.Every time a job arrival, completion, or resize event is trig-gered, the scheduler first attempts to expand the allocations ofcurrently running jobs in decreasing priority order to fill idleGPUs, if any. Then, it schedules new jobs until the allocationsof existing higher priority jobs are affected. Finally, we issue
Algorithm 1:
Elastic WFS Scheduler function schedule ( running _ jobs , job _ queue ) : new _ allocations = expand current allocations while job_queue not empty do f air _ allocations = compute fair shares( running _ jobs , job _ queue . peek () ) if no higher priority job allocations are affected then new _ allocations = f air _ allocations running _ jobs += job _ queue . dequeue () else break resize jobs( new _ allocations )resize requests to all currently running jobs according to thenew allocations.In §6.4, we demonstrate that this scheduler produces sig-nificant improvements in terms of cluster utilization and jobcompletion time, compared to a simple priority scheduler thatdoes not perform elasticity. We implemented VirtualFlow with elasticity support on topof TensorFlow 2 . ,
400 lines of code, of which 300 linesinvolved changes in TensorFlow’s Keras engine, and 1 , In this section, we describe additional practical challengesthat VirtualFlow faces and how we address these challengesin our implementation.
Batch normalization.
A common technique for improv-ing model convergence, batch normalization [24] interactswith VirtualFlow as follows. In each step, batch normaliza-tion scales the activation values by their mean and varianceacross all examples in the batch (size B ). However, in practice,the mean and variance are often computed over the smallerbatch of examples assigned to each device instead (size B / N ,for N devices), so as to minimize communication across de-vices [21]. VirtualFlow preserves this behavior as long asthe total number of virtual nodes ( N ) stays the same. This is6ecause the mean and variance are computed over the samenumber of examples ( B / N ) as before in each forward pass.Batch normalization also computes moving averages ofthe mean and variance across batches, though these movingaverages are used primarily for evaluation and inference, nottraining. The semantics here differ slightly with virtual nodes:all virtual nodes assigned to the same physical device willnow share the moving averages, as opposed to each one ofthem having its own copy. In our evaluation, however, wefound that this difference had little impact on convergence(Table 1) and so will leave resolving it for future work. Stateful kernel migration.
In VirtualFlow, resource elas-ticity is expressed in terms of redistributing virtual nodes.When new physical devices are added to a job, certain stateon the virtual nodes must be migrated to the new devices,including the model parameters and certain stateful kernels .One such example is the batch normalization moving meanand variance, which are computed independently on each de-vice and never synchronized. Naive migration of virtual nodeswithout also migrating these stateful kernels would effectivelyreset their internal state, potentially hurting convergence. Forthis reason, VirtualFlow also migrates these stateful kernelsas well as regular model parameters during job resizes.
Data visition guarantees.
Deep learning workloads typ-ically use one of the following approaches to distribute thetraining dataset: replicate it with random shuffling on each de-vice, or partition it across the devices. For replicated datasets,the migration of virtual nodes is trivial: simply start a new,independent shuffled data pipeline on the new device.For partitioned datasets, however, maintaining the exactly-once semantics is more challenging. This is because check-pointing the data pipeline state at the step boundary is ingeneral not supported in existing frameworks. Due to thislimitation, VirtualFlow can only provide the exactly-once vi-sition guarantee if the job is resized at the epoch boundary.However, deep learning models are usually robust to smallnoises, and we have not observed noticeable training degra-dation in practice. Thus, we defer providing the exactly-oncedata visition guarantee to future work.
In this section, we evaluate VirtualFlow’s effectiveness inreproducing results across cluster sizes (§6.2), exploring pre-viously inaccessible hyperparameters for models (§6.3), andproviding cluster-level benefits through resource elasticitywhile preserving application-level semantics (§6.4).
Most end-to-end experiments are performed on 2 servers, eachwith 8 NVIDIA V100 GPUs (16GB), 64 Intel Xeon CPUs(2.2Ghz), and 250GB of DRAM, connected over a 16 Gbps VirtualFlow TF*GPUs BS VN
GPU
Acc (%) BS Acc (%)1 8192 32 75.92 256 69.252 8192 16 75.96 512 67.304 8192 8 75.99 1024 70.688 8192 4 75.83 2048 73.0416 8192 2 75.68 – –2 † Table 1: Final top 1 validation accuracies for the same ResNet-50experiment shown in Figure 8. VirtualFlow preserves the target ac-curacy of 76% ( ± GPU refers to numberof virtual nodes per GPU, and † refers to training on RTX 2080 TiGPUs instead of V100 GPUs. connection. For some experiments, we also run microbench-marks using a different type of GPU (specifically, 2 NVIDIAGeForce RTX 2080 Ti) on a server with 32 Intel(R) Xeon(R)E5-2620v4 CPUs (2.1GHz) and 64GB of DRAM.
We demonstrate VirtualFlow can reproduce training re-sults across different cluster sizes for two well-known deeplearning workloads: ResNet-50 [19] on ImageNet [12] andBERT [13] finetuning on GLUE [47]. In these experiments,we varied the number of GPUs from 1 to 16 (8 for BERT)while fixing the batch sizes, and observed almost identicalconvergence trajectories for both workloads.
Baseline.
To highlight the differences with the state-of-the-art, we compare VirtualFlow with a version of vanillaTensorFlow 2 . In this experiment, we train ResNet-50 on the ImageNetdataset for 90 epochs using a batch size of 8192, a widelyused benchmark that is known to converge to the vicinity of76% [17]. To demonstrate VirtualFlow can preserve conver-gence across GPU types, we ran this workload on both V100and RTX 2080 Ti GPUs. Each V100 GPU can fit a batch of256 examples at a given time, so we use 32 total virtual nodesfor these runs. For the smaller RTX 2080 Ti GPUs, we use 64total virtual nodes instead.Table 1 demonstrates VirtualFlow can reach the target ac-curacy for all runs ( ± . Epoch V a li d a t i o n a cc u r a c y BERT-BASE finetuning on QNLI
VF 1 GPU (BS 64)VF 2 GPUs (BS 64)VF 4 GPUs (BS 64)VF 8 GPUs (BS 64)TF* 1 GPU (BS 8)TF* 2 GPUs (BS 16)TF* 4 GPUs (BS 32)TF* 8 GPUs (BS 64)
Epoch V a li d a t i o n a cc u r a c y BERT-BASE finetuning on SST-2
VF 1 GPU (BS 64)VF 2 GPUs (BS 64)VF 4 GPUs (BS 64)VF 8 GPUs (BS 64)TF* 1 GPU (BS 8)TF* 2 GPUs (BS 16)TF* 4 GPUs (BS 32)TF* 8 GPUs (BS 64)
Epoch V a li d a t i o n a cc u r a c y BERT-BASE finetuning on CoLA
VF 1 GPU (BS 64)VF 2 GPUs (BS 64)VF 4 GPUs (BS 64)VF 8 GPUs (BS 64)TF* 1 GPU (BS 8)TF* 2 GPUs (BS 16)TF* 4 GPUs (BS 32)TF* 8 GPUs (BS 64)
Figure 7: VirtualFlow preserves the convergence trajectory across different numbers of GPUs by fixing the batch size at 64. Inthis case, the naive solution (TF*) also happens to converge to the same accuracies because these workloads are less sensitive tovarying the batch size in this range.
10 20 30 40 50 60 70 80 90
Epoch V a li d a t i o n a cc u r a c y ResNet-50 on ImageNet
VF 1 GPU (BS 8192)VF 2 GPUs (BS 8192)VF 4 GPUs (BS 8192)VF 8 GPUs (BS 8192)VF 16 GPUs (BS 8192)TF* 1 GPU (BS 256)TF* 2 GPUs (BS 512)TF* 4 GPUs (BS 1024)TF* 8 GPUs (BS 2048)
Figure 8: VirtualFlow preserves the convergence trajectory acrossdifferent numbers of GPUs by fixing the batch size at 8192. Naivelyattempting to reproduce the same workload on fewer GPUs withoutretuning the hyperparameters (TF*) yields lower accuracies. the results for the same workload on even a single GPU. Incontrast, attempts to reproduce this workload on fewer GPUswithout retuning the hyperparameters (TF*) led to divergedmodels, e.g., doing so on 1 GPU led to a final accuracy ofonly 69.25%, far short of the target 76%.VirtualFlow preserves not only the final accuracy but alsothe entire convergence trajectory. In Figure 8, all VirtualFlowlines trace each other closely especially in the latter portionof the training. In contrast, the baseline TensorFlow solu-tion (TF*) that does not retune hyperparameters converged tovarious accuracies, all conspicuously below the target.In short, without relying on virtual nodes, the user mustretune the hyperparameters every time a workload is run on adifferent cluster, a cumbersome process that often involves sig-nificant human expertise. While there are well-known guide-lines for tuning these hyperparameters for ResNet, this is notthe case for arbitrary workloads. Instead, training with theexact same set of hyperparameters using VirtualFlow guaran-tees that the convergence behavior of the model is preserved,allowing the user to focus on application semantics instead. QNLI SST-2 CoLAGPUs BS VN
GPU
Acc (%) Acc (%) Acc (%)1 64 8 90.86 92.07 83.012 64 4 91.05 92.35 84.084 64 2 90.86 92.20 83.508 64 1 90.88 91.86 82.45
Table 2: Final top 1 validation accuracies achieved by VirtualFlowfor the same BERT-BASE experiment shown in Figure 7. Acrossa variety of finetuning tasks, VirtualFlow converged to the samefinal accuracies regardless of the number of GPUs assigned. VN
GPU refers to number of virtual nodes per GPU.
The second workload that showcases the reproducibility ofVirtualFlow is finetuning BERT-BASE on the GLUE datasetusing a fixed batch size of 64. The GLUE tasks considered inthis experiment are QNLI (a natural language inference taskon question-answer pairs), SST-2 (a sentiment classificationtask), and CoLA (which predicts whether a sentence is lin-guistically acceptable). For QNLI and SST-2, we use 1/10thof the original dataset in each epoch and train for 20 epochs.For CoLA, we train on the whole dataset for 50 epochs.As with the ResNet workload, VirtualFlow converged tothe same final accuracies for all runs within each GLUE task(Table 2) by preserving the batch size and the total numberof virtual nodes across different numbers of GPUs. Similarly,Figure 7 shows that the entire convergence trajectory is alsopreserved across different cluster sizes.Unlike in the ResNet case, however, the naive approachof not retuning hyperparameters across different cluster sizes(TF*) also happened to converge to the same accuracies inthese workloads. This illustrates that these workloads are lesssensitive to a changing batch size within this range (8 to 64).Thus, while VirtualFlow did not lead to higher accuracies inthis case, it still guaranteed that results for the batch size of 64can be consistently reproduced across different clusters. Ad-ditionally, VirtualFlow enabled the user to better understandthe convergence characteristics of the larger batch size of 64using just a single GPU. We will return to this in §6.3.8
GPU 2 GPUs 4 GPUs 8 GPUs
Configuration T h r o u g h p u t ( e x a m p l e s / s ) Acc0.900.91 0.900.91 0.910.91 0.900.91
BERT-BASE finetuning on QNLI
TF* VF
Configuration T h r o u g h p u t ( e x a m p l e s / s ) Acc0.920.92 0.910.91 0.930.92 0.920.92
BERT-BASE finetuning on SST-2
TF* VF
Configuration T h r o u g h p u t ( e x a m p l e s / s ) Acc0.830.83 0.830.84 0.820.83 0.820.82
BERT-BASE finetuning on CoLA
TF* VF
Figure 9: VirtualFlow can increase the training throughput by up to 19.2% (SST-2 with 1 GPU) compared to running withoutvirtual nodes (TF*) on the same set of resources, while converging to the same accuracies. This is because the use of virtualnodes allows VirtualFlow to perform fewer model updates, a direct result of using larger batch sizes that would have previouslyexceeded the memory limits of the cluster.
Epoch V a li d a t i o n a cc u r a c y BERT-LARGE finetuning on RTE
TF (BS 4)VF (BS 8)VF (BS 16) VF (BS 32)VF (BS 64)VF (BS 128)
Epoch V a li d a t i o n a cc u r a c y BERT-LARGE finetuning on SST-2
TF (BS 4)VF (BS 8)VF (BS 16)VF (BS 32)VF (BS 64)VF (BS 128)
Epoch V a li d a t i o n a cc u r a c y BERT-LARGE finetuning on MRPC
TF (BS 4)VF (BS 8)VF (BS 16) VF (BS 32)VF (BS 64)VF (BS 128)
Figure 10: Batch size exploration with VirtualFlow on a single RTX 2080 Ti GPU. VirtualFlow expands the space of possible batchsizes on this GPU from 4 (TF) to [4, 8, 16, 32, 64, 128], and can support even larger batch sizes. In some cases, such as in RTE(left), being able to access larger batch sizes can lead to significantly higher final accuracies (+7.1% with a batch size of 16).
One important side effect of using larger batch sizes is thatthe number of gradient synchronizations and model updatesdecreases proportionally. In VirtualFlow, this can be achievedby using V virtual nodes per device, which reduces the numberof model updates by a factor of V while leaving the number offorward and backward passes unchanged. This can improvethe overall throughput of the system.Figure 9 illustrates this effect for the BERT-BASE finetun-ing workload. The fewer GPUs used to train with the samebatch size, the more virtual nodes are needed per GPU, andthe larger the difference in throughput is compared to not us-ing virtual nodes. For example, for one GPU, VirtualFlow canuse a batch size of 64 while vanilla TensorFlow must use abatch size of 8 or less. In this case, using the larger batch sizein VirtualFlow led to a throughput improvement of 19.2%,17.8%, and 16.1% for SST-2, CoLA, and QNLI respectively.Thus, even if VirtualFlow did not improve the final accuracyof the model (which was not a goal of VirtualFlow in the firstplace), it can still help reduce the training time by lowering themodel update frequency. In the distributed setting, this alsoreduces the number of expensive gradient synchronizationsacross the network. An important use case of being able to reproduce resultsacross different clusters is exploration. In this section, wedemonstrate how VirtualFlow allows the user to explore theconvergence characteristics of larger batch sizes that wouldhave previously exceeded the memory limits of the same set ofresources, and how this can, in some cases, lead to improvedmodel accuracies.In this experiment, we finetune BERT-LARGE on threeGLUE tasks: RTE (a textual entailment task), SST-2, andMRPC (a classification task for paraphrasing). All tasks aretrained for 10 epochs on a single RTX 2080 Ti GPU. Unlikebefore, we vary the number of virtual nodes, and consequentlythe batch size, while holding the number of GPUs constant.Figure 10 plots the model convergence for this experiment.Unlike before, since the batch size changes across runs, sodo the convergence trajectory and the final accuracy. Thisallows the user to explore the convergence characteristicsof various batch sizes, without deploying the resources thatwould have been necessary to run these batch sizes usingvanilla TensorFlow (e.g., 32 GPUs for a batch size of 128).In some cases, VirtualFlow can even achieve higher accura-cies in the batch sizes explored. For RTE, using a larger batchsize of 16 is now possible on 1 GPU, even though this batchsize would require 4 GPUs without the use of virtual nodes.This configuration ended up improving the final accuracy by7 percentage points compared to running vanilla TensorFlow9 bs(TF) 8bs2VN 16bs4VN 32bs8VN 64bs16VN 128bs32VN
Configuration T h r o u g h p u t ( e x a m p l e s / s ) Acc0.66 0.72 0.73 0.59 0.70 0.67
BERT-LARGE finetuning on RTE
TF VF
Configuration T h r o u g h p u t ( e x a m p l e s / s ) Acc0.92 0.92 0.92 0.93 0.92 0.92
BERT-LARGE finetuning on SST-2
TF VF
Configuration T h r o u g h p u t ( e x a m p l e s / s ) Acc0.87 0.86 0.86 0.87 0.84 0.82
BERT-LARGE finetuning on MRPC
TF VF
Figure 11: Throughputs of the same batch size exploration experiment shown in Figure 10. For RTE, using VirtualFlow not onlyleads to higher accuracies, but also higher training throughputs (up to +18.5% using 16BS, or +28.7% using 128BS). Other taskssee similar throughput improvements. The number at the bottom of each bar refers to the final accuracy achieved in that run, andthe hatched bar represents the configuration with the highest final accuracy within each task.
Time (s) G P U s a ll o c a t e d J1 J2
Job 0Job 1Job 2 (a) VF scheduler with elasticity
Time (s) G P U s a ll o c a t e d J1 J2
Job 0Job 1Job 2 (b) Priority scheduler, no elasticity
Job 0 Job 1 Job 2 V a l a cc u r a c y VF Priority (c) Final accuracy
Job 0 Job 1 Job 2 J C T ( m i n ) VF Priority (d) JCT
Figure 12: Elasticity with VirtualFlow reduces the makespan by 38%and the job completion time (JCT) for the highest priority job by45%, while preserving model accuracies. In this workload, 3 jobsshare 4 V100 GPUs on a single machine. Job priorities are (1, 5,10), and job GPU demands are (4, 2, 4) respectively. For (a) and (b),dotted lines indicate when Jobs 1 and 2 are submitted. with the maximum batch size (4) previously available to thesame GPU.The observation that larger batch sizes can lead to improvedthroughputs on the same set of resources (§6.2.3) is alsorelevant in this experiment. For all the tasks considered, usinga larger batch size reduced the training time by lowering themodel update frequency as before (Fig. 11). For RTE, using abatch size of 16, as enabled by VirtualFlow, not only improvedthe final accuracy by 7.1%, but also improved the throughputby 18.5%. Using a batch size of 128 further improved thethroughput by 28.7% without affecting convergence.
Another important use case enabled by VirtualFlow is re-source elasticity: a job can be resized dynamically duringtraining by adjusting the number of virtual nodes per GPU. Model Dataset Batch sizes VN
GPU
ResNet-56 cifar10 64, 128 1ResNet-50 ImageNet 256, 512, 10242048, 4096, 8192 1, 2, 4BERT-BASE CoLA 8, 16, 32, 64, 128 1, 2BERT-BASE SST-2 8, 16, 32, 64, 128 1, 2Transformer WMT 4096, 8192, 1638432768, 65536 1, 2
Table 3: Mix of workloads used in 20 job elasticity experiment. Eachjob in the trace is selected uniformly at random from this set ofworkloads and assigned a random priority chosen from (1, 5, 10).
This section describes experimental results that highlight thecluster-level benefits of this approach.
Using the scheduling framework described in §4.2, we rantwo traces with and without VirtualFlow. The first is a simple3-job trace designed to illustrate a scenario in which elasticitycan have significant effects on cluster-level objectives. Job 0finetunes BERT-BASE on SST-2, Job 1 trains ResNet-56 oncifar10 [28], and Job 2 finetunes BERT-BASE on QNLI. TheBERT jobs both demand 4 GPUs, while the ResNet job de-mands only 2 GPUs. The jobs arrive in the order of increasingpriority, with Job 2 being the highest.Figure 12 compares running this trace with the VirtualFlowscheduler, which dynamically resizes jobs to satisfy cluster-level Weighted Fair Shares (WFS), to running it with a simplepriority scheduler that orders jobs in descending priority butdoes not resize any job. With VirtualFlow, existing jobs down-size as soon as a new job with priority arrives. With the staticpriority scheduler, however, the high priority Job 2 is stuckbehind Job 1 for a long time, leaving 2 GPUs idle for theentire duration of Job 1.Observe that although all 3 jobs resized over the courseof their respective lifetimes in the VirtualFlow case, they allconverged to the same accuracies as their counterparts in the10
Time (s) G P U s a ll o c a t e d Time (s) G P U s a ll o c a t e d Figure 13: Elasticity with VirtualFlow (top) increases average clusterutilization by 19.5% and reduces makespan by 45.5%, compared toa simple priority scheduler (bottom) that does not perform resourceelasticity. In this trace, 20 jobs arrive at a rate of 12 jobs per houraccording to a poisson distribution. Each colored box correspondsto a job. Boxes resize for the elastic scheduler (top) but not for thestatic scheduler (bottom).
Duration (s) Q u e u i n g d e l a y C D F VFPriority
Duration (s) J C T C D F VFPriority
Figure 14: In the same 20 job experiment shown in Figure 13, Virtu-alFlow reduces the median JCT by 47.6% and the median queuingdelay by 99.3% by resizing jobs dynamically. simple priority scheduler case. Thus, the VirtualFlow sched-uler is able to reduce the makespan by 38% and the highpriority job completion time (JCT) by 45%, while preservingthe application-level semantics of each job.
Next, we evaluate VirtualFlow on a more realistic trace con-sisting of 20 jobs arriving with a poisson distribution, withan average load of 12 jobs per hour (average interarrival timeof 5 minutes). The mixture of workloads used in this trace isselected uniformly at random from Table 3. To speed up theexperiment, we train each job for only a subset of the steps orepochs needed for convergence.Figure 13 depicts the GPU allocations for both schedulersover time. Compared to the simple priority scheduler, en-abling elasticity with VirtualFlow improved average clusterutilization from 71.1% to 90.6%, reduced the makespan by45.5%, the median JCT by 47.6%, and the median queuingdelay by 99.3%. The largest gain from using elasticity is thereduction in queuing delay (Figure 14): most jobs are assigned
ResNet-50 Transformer BERT-LARGE0.00.20.40.60.81.01.2 N o r m . p e a k m e m o r y TFVF (2VN) VF (4VN)VF (8VN) VF (16VN)VF (32VN)
Figure 15: Peak memory on a RTX 2080 Ti GPU, normalized bythe peak memory of not using virtual nodes (TF). Memory overheadscales with the model size and is constant across virtual nodes.
ResNet-50 Transformer BERT-LARGE0.00.20.40.60.81.01.2 N o r m . t h r o u g h p u t TFVF (2VN) VF (4VN)VF (8VN) VF (16VN)VF (32VN)
Figure 16: Throughput on a RTX 2080 Ti GPU, normalized by thethroughput of not using virtual nodes (TF). For large models suchas BERT-LARGE, throughput increases with the number of virtualnodes due to fewer model updates. For smaller models, throughputis not affected by the number of virtual nodes. some GPUs as soon as they are submitted instead of beingqueued behind other potentially long jobs. This is especiallytrue for high priority jobs, which can partially preempt lowerpriority jobs by downsizing them.
Virtual node processing adds a gradient buffer in memory toaggregate gradients across virtual nodes. Figure 15 plots thememory overhead of this buffer for three different workloads.The gradient buffer is the same size of the model: the gap be-tween 1 and 2 virtual nodes is much larger for BERT-LARGEthan for ResNet-50. Beyond 2 virtual nodes, however, thememory overhead stays constant when the number of virtualnodes increase. For all the workloads considered, the memoryoverhead does not exceed 20%.Figure 16 plots the throughput across a range of virtualnodes for the same three workloads. For these workloads,using virtual nodes at worst leaves throughput unchanged,and can increase it by 1.3x in some cases, especially when themodel is large (BERT-LARGE). This is because the batch sizescales with the total number of virtual nodes in the system,and a larger batch size means more training steps betweenmodel updates (§6.2.3). For large models, model updates canbe a significant cost.11
Future Work
Virtual node processing can be further extended to supportother important use cases not explored in this paper, includingheterogeneous training and fault tolerance. We briefly touchupon these two use cases now, and additionally discuss howVirtualFlow can be extended to support model parallelism.
Heterogeneous training.
Modern deep learning frame-works require the set of resources assigned to a job to behomogeneous. Meanwhile, hardware vendors have been grad-ually rolling out new generations of accelerators, each withimproved compute and memory capacity. This often resultsin a heterogeneous mix of devices in on-premise clusters andpublic clouds [6, 15]. Being able to combine different typesof computing devices (e.g., V100 and K80 GPUs) in the samejob can increase throughput and cluster utilization.Heterogeneous training is a generalization of virtual nodeprocessing: each computing device is still assigned one ormore virtual nodes, but the number and size of the virtualnodes need not be the same across different resource types.For example, if V100 GPUs achieve roughly 50% higherthroughput than K80 GPUs on a given workload, then theframework may try to place 3 virtual nodes on each V100GPU, and 2 virtual nodes on each K80 GPU. This is a classicresource packing problem that cluster schedulers elsewherehave long supported [20].
Fault tolerance.
State-of-the-art solutions for fault toler-ance in (synchronous) deep learning workloads rely on pe-riodic checkpoints. Even if a single worker fails, the entirejob needs to be restarted and the model will be restored froma potentially stale checkpoint. However, since the model istypically replicated across all workers, we can reuse elasticitymechanisms in VirtualFlow to simply replace failed workers.As long as there is one healthy worker left, the new work-ers can fetch the model parameters from existing workersinstead of reading them from a checkpoint. This would ensuretraining is uninterrupted from the application’s perspective.
Model parallelism.
Recent years have seen a rise of ex-tremely large models that deliver unprecedented accuracies,but do not fit within the memory capacity of a single device.Training these workloads relies on model parallelism , whichpartitions the model across the computing devices in the sys-tem. These models typically require a large number of train-ing examples, so model parallel training is often employedtogether with data parallelism [23, 41].VirtualFlow is still applicable in this setting: the model canbe partitioned by virtual nodes instead of by physical devices.Virtual nodes that are assigned the same partition of the modelcan then be preferrably colocated, so that each model partitioncan be shared across all the virtual nodes assigned to thesame device. Reducing the resource requirement for thesemodels in this manner will bring the same reproducibility andexperimentation benefits to these workloads as well.
Gradient aggregation.
The most similar line of work to vir-tual node processing is a variant of asynchronous trainingthat synchronizes gradients every k steps, where k is eitherfixed [56] or changes over time [48, 55]. Their main goal,however, is providing an alternative to synchronous SGD thatprovides faster convergence, rather than reproducing resultsacross hardware. PyTorch recently introduced gradient accu-mulation [30], a mechanism that also allows users to simulatelarger batch sizes, but does not explore its benefits and con-vergence characteristics in a dynamic resource setting. Resource elasticity.
Resource elasticity in deep learninghas been proposed in [34]. The authors also explore autoscal-ing heuristics for these workloads, which are complementaryto our approach. However, their system requires retuning thehyperparameters and hence lacks strong convergence guaran-tees, and only considers elasticity for individual jobs. Concur-rent with this work, Elastic Horovod [3] and TorchElastic [4]also implement elasticity for deep learning workloads, butlikewise leave the burden of fixing the batch size and thuspreserving model convergence up to the users.
Cluster scheduling.
Dynamic resource allocation has alsobeen explored in the context of multi-tenant GPU clusters.Optimus [36] models throughput and convergence based ononline feedback to schedule jobs using the parameter serverarchitecture. Tiresias [18] proposes a multi-queue schedulingalgorithm that preempts low priority jobs to minimize JCT,where priority is expressed in terms of least attained service(LAS). Gandiva [49] time slices GPUs across multiple jobsand dynamically migrates jobs to increase cluster utilization.However, unlike VirtualFlow, these approaches do not focuson preserving application semantics of individual jobs, andthey rely on checkpoint-based mechanisms that restart jobsentirely across resizes, leading to unnecessary GPU idle time.
VirtualFlow is an important step towards decoupling deeplearning model execution from its underlying physical hard-ware. Leveraging the idea of virtual node processing, Virtual-Flow allows users to reproduce training results consistentlyacross different clusters, experiment with large workloads onsmall testbeds, and reap the benefits of resource elasticitywithout worrying about model convergence, all without a sin-gle change to the model specification or the hyperparameters.The benefits of virtual nodes are not limited to the use casesand settings explored in this paper. In the future, we expect tosee more sources of the systems complexity in deep learningframeworks to be hidden from the user, freeing them to focuson achieving desirable results with their models instead.12 eferences [1] Apache Flink. https://flink . apache . org/ .[2] Apache Storm. https://storm . apache . org/ .[3] Elastic Horovod: https://horovod . readthedocs . io/en/latest/elastic_include . html , 2020.[4] TorchElastic: https://pytorch . org/elastic/0 . . . html , 2020.[5] Martín Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin, San-jay Ghemawat, Geoffrey Irving, Michael Isard, et al.TensorFlow: A System for Large-Scale Machine Learn-ing. In , pages 265–283,2016.[6] AWS. Accelerated Computing: https://aws . amazon . com/ec2/instance-types/ .[7] AWS. Auto Scaling: https://aws . amazon . com/autoscaling/ .[8] Azure. Autoscale: https://azure . microsoft . com/en-us/features/autoscale/ .[9] Jianmin Chen, Rajat Monga, Samy Bengio, and RafalJozefowicz. Revisiting distributed synchronous sgd. In International Conference on Learning RepresentationsWorkshop Track , 2016.[10] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simpli-fied data processing on large clusters.
Communicationsof the ACM , 51(1):107–113, 2008.[11] Alan Demers, Srinivasan Keshav, and Scott Shenker.Analysis and simulation of a fair queueing algorithm.
ACM SIGCOMM Computer Communication Review ,19(4):1–12, 1989.[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In , pages 248–255. Ieee,2009.[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. BERT: Pre-training of deep bidirec-tional transformers for language understanding. arXivpreprint arXiv:1810.04805 , 2018.[14] Google Cloud Engine. Autoscaling groups ofinstances: https://cloud . google . com/compute/docs/autoscaler/ . [15] Google Cloud Engine. TPU Pod Pricing: https://cloud . google . com/tpu/pricing .[16] Bu˘gra Gedik, Scott Schneider, Martin Hirzel, and Kun-Lung Wu. Elastic scaling for data stream processing. IEEE Transactions on Parallel and Distributed Systems ,25(6):1447–1463, 2013.[17] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tul-loch, Yangqing Jia, and Kaiming He. Accurate, largeminibatch sgd: Training imagenet in 1 hour. arXivpreprint arXiv:1706.02677 , 2017.[18] Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, YiboZhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, andChuanxiong Guo. Tiresias: A GPU cluster managerfor distributed deep learning. In , pages 485–500, 2019.[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer visionand pattern recognition , pages 770–778, 2016.[20] Benjamin Hindman, Andy Konwinski, Matei Zaharia,Ali Ghodsi, Anthony D. Joseph, Randy Katz, ScottShenker, and Ion Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In , pages 295–308, 2011.[21] Elad Hoffer, Itay Hubara, and Daniel Soudry. Trainlonger, generalize better: closing the generalization gapin large batch training of neural networks. In
Advancesin Neural Information Processing Systems , pages 1731–1741, 2017.[22] Chi-Yao Hong, Matthew Caesar, and P Brighten God-frey. Finishing flows quickly with preemptive schedul-ing.
ACM SIGCOMM Computer Communication Re-view , 42(4):127–138, 2012.[23] Yanping Huang, Youlong Cheng, Ankur Bapna, OrhanFirat, Dehao Chen, Mia Chen, HyoukJoong Lee, JiquanNgiam, Quoc V Le, Yonghui Wu, and zhifeng Chen.GPipe: Efficient Training of Giant Neural Networksusing Pipeline Parallelism. In
Advances in Neural Infor-mation Processing Systems 32 , pages 103–112. 2019.[24] Sergey Ioffe and Christian Szegedy. Batch normaliza-tion: Accelerating deep network training by reducinginternal covariate shift. volume 37 of
Proceedings of Ma-chine Learning Research , pages 448–456, Lille, France,07–09 Jul 2015. PMLR.1325] Michael Isard, Vijayan Prabhakaran, Jon Currey, UdiWieder, Kunal Talwar, and Andrew Goldberg. Quincy:fair scheduling for distributed computing clusters. In
Proceedings of the ACM SIGOPS 22nd symposium onOperating systems principles , pages 261–276, 2009.[26] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang,Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo,Yuanzhou Yang, Liwei Yu, et al. Highly scalabledeep learning training system with mixed-precision:Training imagenet in four minutes. arXiv preprintarXiv:1807.11205 , 2018.[27] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge No-cedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang.On large-batch training for deep learning: Gener-alization gap and sharp minima. arXiv preprintarXiv:1609.04836 , 2016.[28] Alex Krizhevsky. Convolutional deep belief networkson cifar-10.[29] Mu Li, David G Andersen, Jun Woo Park, Alexander JSmola, Amr Ahmed, Vanja Josifovski, James Long, Eu-gene J Shekita, and Bor-Yiing Su. Scaling distributedmachine learning with the parameter server. In , pages 583–598, 2014.[30] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar,Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith,Brian Vaughan, Pritam Damania, et al. Pytorch dis-tributed: Experiences on accelerating data parallel train-ing. arXiv preprint arXiv:2006.15704 , 2020.[31] Rangan Majumder and Junhua Wang. ZeRO &DeepSpeed: New system optimizations enabletraining models with over 100 billion param-eters. . microsoft . com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/ , 2020.[32] Shar Narasimhan. NVIDIA clocks world’s fastestBERT training time and largest transformer basedmodel, paving path for advanced conversational AI. https://developer . nvidia . com/blog/training-bert-with-gpus/ , 2019.[33] Andrew Or. Apache Spark Dynamic Resource Allo-cation: https://spark . apache . org/docs/latest/job-scheduling . html , 2014.[34] Andrew Or, Haoyu Zhang, and Michael J Freedman.Resource elasticity in distributed deep learning. In Pro-ceedings of the 3rd Conference on Machine Learningand Systems , 2020. [35] Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, Trevor Killeen,Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.Pytorch: An imperative style, high-performance deeplearning library. In
Advances in neural informationprocessing systems , pages 8026–8037, 2019.[36] Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu,and Chuanxiong Guo. Optimus: an efficient dynamicresource scheduler for deep learning clusters. In
Pro-ceedings of the Thirteenth EuroSys Conference , pages1–14, 2018.[37] Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou, WeiLi, and Peter J Liu. Exploring the limits of transferlearning with a unified text-to-text transformer. arXivpreprint arXiv:1910.10683 , 2019.[38] Corby Rosset. Turing-nlg: A 17-billion-parameter language model by microsoft. . microsoft . com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ , 2020.[39] Alexander Sergeev and Mike Del Balso. Horovod: fastand easy distributed deep learning in tensorflow. arXivpreprint arXiv:1802.05799 , 2018.[40] Christopher J Shallue, Jaehoon Lee, Joseph Antognini,Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl.Measuring the effects of data parallelism on neural net-work training. arXiv preprint arXiv:1811.03600 , 2018.[41] Noam Shazeer, Youlong Cheng, Niki Parmar, DustinTran, Ashish Vaswani, Penporn Koanantakool, PeterHawkins, HyoukJoong Lee, Mingsheng Hong, CliffYoung, Ryan Sepassi, and Blake Hechtman. Mesh-tensorflow: Deep learning for supercomputers. In Ad-vances in Neural Information Processing Systems 31 ,pages 10414–10423. Curran Associates, Inc., 2018.[42] Mohammad Shoeybi, Mostofa Patwary, Raul Puri,Patrick LeGresley, Jared Casper, and Bryan Catanzaro.Megatron-lm: Training multi-billion parameter languagemodels using gpu model parallelism. arXiv preprintarXiv:1909.08053 , 2019.[43] Peng Sun, Wansen Feng, Ruobing Han, Shengen Yan,and Yonggang Wen. Optimizing network perfor-mance for distributed dnn training on gpu clusters: Im-agenet/alexnet training in 1.5 minutes. arXiv preprintarXiv:1902.06855 , 2019.[44] Jerzy Szczepkowski and Marcin Wielgus. Autoscalingin Kubernetes: https://kubernetes . io/blog/2016/07/autoscaling-in-kubernetes/ , 2016.1445] Rajeev Thakur, Rolf Rabenseifner, and William Gropp.Optimization of collective communication operations inmpich. The International Journal of High PerformanceComputing Applications , 19(1):49–66, 2005.[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin. Attention is all you need. In
Ad-vances in neural information processing systems , pages5998–6008, 2017.[47] Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. GLUE: Amulti-task benchmark and analysis platform for naturallanguage understanding. In
International Conferenceon Learning Representations , 2019.[48] Jianyu Wang and Gauri Joshi. Adaptive communicationstrategies to achieve the best error-runtime trade-off inlocal-update SGD. In
Proceedings of the 2nd Confer-ence on Systems and Machine Learning , 2019.[49] Wencong Xiao, Romil Bhardwaj, Ramachandran Ram-jee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han,Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang,et al. Gandiva: Introspective cluster scheduling for deeplearning. In , pages595–610, 2018.[50] Yang You, Igor Gitman, and Boris Ginsburg. ScalingSGD batch size to 32k for imagenet training. arXivpreprint arXiv:1708.03888 , 6, 2017.[51] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, San-jiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, JamesDemmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batchoptimization for deep learning: Training bert in 76 min-utes. In
International Conference on Learning Repre-sentations , 2019. [52] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,S. Shenker, and I. Stoica. Delay scheduling: A Sim-ple Technique for Achieving Locality and Fairness inCluster Scheduling. In
Proceedings of the 5th Europeanconference on Computer systems , pages 265–278, 2010.[53] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,Ankur Dave, Justin Ma, Murphy McCauly, Michael JFranklin, Scott Shenker, and Ion Stoica. Resilient dis-tributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In
Presented as part of the9th USENIX Symposium on Networked Systems Designand Implementation (NSDI 12) , pages 15–28, 2012.[54] Haoyu Zhang, Logan Stafman, Andrew Or, andMichael J Freedman. Slaq: quality-driven schedulingfor distributed machine learning. In
Proceedings of the2017 Symposium on Cloud Computing , pages 390–404.ACM, 2017.[55] X. Zhao, M. Papagelis, A. An, B. X. Chen, J. Liu, andY. Hu. Elastic bulk synchronous parallel model fordistributed deep learning. In , pages 1504–1509,2019.[56] Fan Zhou and Guojing Cong. On the convergence prop-erties of a k-step averaging stochastic gradient descentalgorithm for nonconvex optimization. In