How to Train your DNN: The Network Operator Edition
Michael Alan Chang, Domenic Bottini, Lisa Jian, Pranay Kumar, Aurojit Panda, Scott Shenker
HHow to Train your DNN:
The Network Operator Edition
MICHAEL ALAN CHANG, DOMENIC BOTTINI, LISA JIAN, PRANAY KUMAR, AUROJIT PANDA, SCOTTSHENKER
Deep Neural Nets have hit quite a crestBut physical networks are where they must restAnd here we put them all to the testTo see which network optimization is best
Deep Neural Networks have gained significant traction in bothacademia and industry. One type of deep neural network – feedfor-ward convolutional neural networks (CNNs) – have been the drivingthrust behind critical applications such as image recognition [15, 26],drug discovery [30], and medical diagnosis [18, 21, 27, 30]. The accu-racy of CNNs often requires frequent retraining, so it is important toreduce CNN training time. Efforts to do so have targeted nearly alllayers of the software and hardware stack, and increasingly involvesdistributing training across machines in a cluster. Early work ondistributed CNN training adopted the parameter server model [19]where computation is performed by several worker nodes and one ormore parameter servers are used to aggregate and distribute resultsfrom individual workers. Recent work has also looked at a varietyof topics, such as improving the performance of distributed CNNtraining through the use of better scheduling [14], and improvingnetwork transfers [29]. In this paper we focus on optimizations thatinvolve the cluster network. A variety of network oriented solutionshave been proposed, but with little resulting clarity about whichproposal (or combination of proposals) achieves the best end-to-endperformance. This paper seeks to answer this question.We first observe that one can divide network optimizations intotwo broad categories: those that change the network fabric and thosethat only change software at end hosts. This is a useful distinctionbecause changing the network fabric is typically more difficult (in-volving router software and perhaps hardware), while host softwarechanges are significantly easier to implement. The first fabric-basedchange we consider is using IP multicast (a feature that is supportedby most routers but often not enabled). IP multicast has been usedpreviously to accelerate HPC workloads using MPI [6, 17, 31], whichin turn has been used as a communication primitive by a number ofdistributed CNN frameworks [2, 29]. The other class of fabric-basedoptimization is in-network aggregation, as can be implemented us-ing programmable switches. The use of in-network aggregation forCNNs has already been proposed in Daiet [25] and Luo [22], andhere our goal is to understand its performance impact relative toother network optimizations.The host-based techniques we consider move away from the pa-rameter server model. These include ring-reduce [3] and all-reduce( e.g.,
Rabenseifner [24] or butterfly mixing [5]) which avoid the useof a centralized server for aggregation and have been successfullyused to speed up HPC jobs in the past. We provide a more detailedlist of such approaches in §3.3.2. We analyze these various approaches – in isolation and in combi-nation where possible – to answer four questions. First, how do thesevarious optimizations rank in terms of effectiveness? Second, giventhis ranking, is it necessary for us to resort to fabric-based mecha-nisms, or do host-based mechanisms suffice? Third, how robust arethese results to possible future changes in CNNs ( e.g., more layers)?Fourth, how robust are these results to possible future changes inhost behavior ( e.g., changes in TensorFlow)?We rely on trace driven simulation to address these questions.The use of simulations allowed us great latitude in testing a varietyof proposals including ones which require changes to the networkhardware and were hence infeasible for us to test in practice. Theuse of a simulation also allowed us to avoid certain approximationsand non-determinism that would have been difficult to consider ina tractable analytical model. Thus, simulation provided us with agood balance between the accuracy of our results and the abilityto try out a wide range of optimizations. To further ensure realismfor our results we seed our simulations with traces generated fromtraining CNNs on real hardware using distributed TensorFlow. Wedescribe our techniques for generating traces and the actual designof our simulator in greater detail in §5.We ran our simulator on four image recognition models (describedin greater detail in §7), and we present evaluation results from theseruns later in the paper. At a high level we found that in the typicalcase using fabric-based mechanisms to speed up training in theparameter server model has lower benefits than using host-basedmechanisms that abandon the parameter server model in favor ofother reduce strategies. We found that this held even when we com-bined both fabric-based mechanisms. Our basic conclusion is thatoptimizing communication for CNN training does not necessitatechanges to the network fabric.
In this section, we begin by discussing the computational modelfor CNN training. Following this, we provide an overview of com-munication paradigms for distributed CNN training. Overall, weprovide overviews of the following mechanisms that we explore inthis paper: in-network aggregation, IP multicast, ring-reduce, ring-reduce with multicast, and butterfly mixing. Next, we discuss howchanges to the network fabric could be used in conjunction withthe aforementioned communication paradigms – and thus furtheraccelerate training.
We now consider the computational process: how the model com-putation (forward pass and back propagation) interleave with thecommunication. We consider data parallel approaches to distributedlearning, where each worker operates on the entire model but usesdifferent training data. Data parallel learning is the most prevalentapproach today. a r X i v : . [ c s . N I] A p r orker 1 NetworkWorker 2 NetworkWorker 1 ComputeWorker 2 ComputeDistribution Forward Pass AggregationBackpropagationGlobal Barrier Data Dependency Global BarrierData DependencyTime Fig. 1. Steps in a distributed training job with param. server
Distributed CNN training proceeds in four steps:
Distribution:
When using parameter servers, the parameterserver updates to the worker constitutes the distribution phase.For butterfly mixing or ring-reduce, this would be the last commu-nication phase, right before all workers receive the same updatedmodel.
Forward Pass:
Each worker selects a sample of the training dataand uses the parameters to compute labels for this training sample.This computation is commonly referred to as the forward pass sinceeach layer operates on the training image in order. The trainingdata used in this step is loaded concurrently during the distributionstep and is not on the critical path. As a result, our analysis doesnot consider time taken loading training data.
Backpropagation:
Next, each worker uses the supplied labels(from the training set) and computed labels to determine each layer’scontribution to training error, and then computes an appropriatechange to the layer. This computation is commonly referred to as backpropagation and proceeds from the last layer in the neural netto the first one. Backpropagation utilizes results computed duringthe forward pass and, as a result of this data dependency, it cannotprogress until the worker has finished the forward pass.
Aggregation:
As the backpropagation progresses, the workerwill send updates to a reducer. In the parameter server model, pa-rameter servers are responsible for both aggregating and applyingupdates. An iteration is considered to have completed only whenthe parameter servers have received updates from all workers, thusenforcing a global barrier ( i.e., across all workers) between the dis-tribution and aggregation step.Within the parameter server model, each iteration of the algo-rithm consists of the four steps (shown in Figure 1). These foursteps can be partially pipelined, in the following two ways. First, theforward pass proceeds layer-by-layer, and as a result a worker canbegin the forward pass step as soon as it has received the param-eters for the first layer of the CNN. Second, backpropagation alsoproceeds layer-by-layer, and workers can begin sending updates assoon as they have computed updates for a layer. To the best of ourknowledge, all commonly used CNN frameworks employ pipeliningto improve training performance.For end-host mechanisms (i.e., butterfly mixing and ring-reduce),the forward pass is not pipelined with the distribution phase. Theback-propagation pipelining operates the same way.
One way to facilitatedistributed CNN training makes use of one or more centralizedparameter servers [19]. These algorithms are implemented as a part of several CNN frameworks including TensorFlow [11], Caffe2 [13],and MXNet [8].For parameter server based training, there can be two distinct,communication phases during each training iteration: (i) a distribu-tion phase where CNN parameters are distributed to workers, whichthen execute a local training algorithm using these parameters, and(ii) an aggregation phase where each worker sends the local trainingalgorithm’s updates to one or more parameter servers.The distributed CNN training algorithms we consider are itera-tive. Distributed training algorithms can be further classified intosynchronous and asynchronous algorithms.
Synchronous trainingalgorithms require that all workers agree on the model at the be-ginning of a training iteration; this is implemented by having theparameter server impose a barrier across workers.
Asynchronous training algorithms do not impose consistency requirements acrossworkers. While asynchronous training algorithms decrease the timetaken by each iteration, they increase the total number of itera-tions required and can thus slow down overall training time [7].Some companies like Google tend to favor synchronous trainingalgorithms [1]. In this paper we focus on synchronous training al-gorithms because the presence of synchronization barriers allowsus to more easily reason about iteration time.
Parameter serverspresent several issues to practitioners, which begin with selectingthe correct ratio between the number of parameter servers andworkers. Rather than aggregating parameter updates in centralizednodes, others have advocated efficient all-reduce algorithms thatexchange parameters merely between worker nodes. In the contextof training CNNs, two algorithms are most commonly discussed:ring-reduce [28] and butterfly mixing [5].Ring-reduce, popularized by Uber’s implementation (Horovod[28]), requires that the workers connect in a ring. There are two com-munication phases. First, parameters in the model are assigned toeach worker in a round robin fashion. In the first phase (analogousthe aggregation phase), each worker begins computing gradientupdates; when they complete the computation for the parameter as-signed to it, it immediately sends the parameter to the next node inthe ring. Upon receiving the update, each worker incrementally av-erages the update with it’s locally calculated gradient. Once the lastworker in the ring has received the parameter, it has the completeaveraged model parameter from the entire cluster. In the secondphase, this exact updated model is passed around the ring a secondtime such that all workers now possess this updated model.On the other hand, butterfly mixing performance scales logarith-mically with the number of workers. At each phase within butterflymixing, each worker simultaneously sends the entirety of its modelto one other worker. After receiving an update, the worker averagesits local model with the received model. Suppose there are fourworkers: A , B , C , D . In the first phase of communication, A and B simultaneously exchange their entire model, while C and D followsuit. At this point, A and B contain the same averaged model, asdoes C and D . In the second phase of communication, A sends itsaveraged model to C (and vice versa), while B sends its averagedmodel to D . This concludes the all-reduce. Butterfly mixing reduceshe number of communication phases required during updates atthe cost of sending a larger amount of data over the network fabric. The network fabric is capable of providing support for distributedtraining at multiple levels. Here, we discuss two proposals andhow they can used in conjunction with the all reduce algorithmspresented earlier.
Multicast [10] implementations are designedto be bandwidth efficient. For CNN training, multicast ensures thatthe amount of data traversing any network link does not scale withworker count. In practice, deploying multicast requires addressinga litany of considerations (e.g., membership, discovery, reliability).However, we do not address them in this paper, noting only thatCNN training tasks involve bulk transfers (making reliability eas-ier to solve since one has time to identify and recover from errors,and there are various reliable multicast implementations available)and tend to run over several hours with a constant set of workers,simplifying the management problems.When using a parameter server, IP multicast assists solely withthe distribution phase. During the distribution phase each parameterserver sends each worker all of its parameters – w copies of thesame data when deployed in a cluster with w workers. EnablingIP multicast would allow each parameter server to send a singlecopy of its parameters. It is also possible to use IP multicast withring-reduce during the second ring during it’s model distribution. In-network aggregations im-prove training time when using parameter servers. The data sentduring the aggregation phase varies by worker. To reduce traffic inthis phase, the network needs to implement the model’s aggrega-tion semantics. This is possible either through the use of softwareswitches operating on an overlay network [23] or through the useof programmable switch ASICs [25] such as Barefoot Tofino [4]. Inthis approach the network buffers worker updates and aggregatesthem before sending them to parameter servers. The benefits of thisapproach mirror those achieved through multicast; this mechanismalso presents several deployment challenges including requiring theuse of new, specialized switching hardware, allocation of computeand memory resources on switches, and mechanisms for isolatingCNN training traffic. We also do not address these issues in thepaper, and refer the interested reader to recent discussions on thistopic [25].
Our initial attempts to derive an analytical model for training per-formance provided intuition about mechanism scaling, but failed toidentify the reasons about the relative performance of the variousmechanisms. In particular, it fails to capture two specific compo-nents of CNN training that significantly affect performance.
First, the backpropagation phase consists of fine-grained,causal, interleavings between compute and network . This setof interleavings is highly model specific . Each model has hundreds ofcomposite operations that are interspersed over highly uneven unitsof time. Moreover, radical new CNN proposals appear frequently;thus developing a novel analytical model for each new CNN design is highly burdensome. Rather than attempt to model this complexinteraction, our simulator uses a comprehensive set of empiricaltraces that are easy to collect.
Secondly, different phases of computation (e.g., distribu-tion and aggregation) overlap non-deterministically.
Recallthat while there is a global barrier at the parameter server, eachworker unit has a local barrier before it initiates backpropagation. Ifwe assume 1.) every local barrier is reached simultaneously and 2.)there is no variance in the compute part of backpropagation, thiswould – within the constraints of our computational model – max-imize the amount of incast on the parameter server. On the otherhand, any delta between workers hitting the local barrier wouldreduce each worker’s overlap between the backpropagation phases;this delay reduces incast. We refer to this delta as backpropaga-tion staggering . Backpropagation staggering is influenced by twofactors that are extremely difficult to model analytically. First, itrequires reasoning across both the overlapping distribution andaggregation phases. The mechanism used to distribute parameters,as well as the qualities of the individual parameters themselves (i.e.,how long it takes to compute and send over the network) affectthe amount of backpropagation staggering. Second, there is naturalvariation in worker processing time that further influences the stag-gering. This is not limited to the parameter server framework; inthe case of end-host mechanisms like butterfly mixing, pipelinedparameter mixing between workers can interfere with each other.In Section 8, we will show in much greater detail how the amountof backpropagation staggering influences how the mechanism andmodel performs. Thus, an analytic model which fails to capturethis phenomenon will not be sufficient to answer the questions weposed.
We seek to develop a simulator that captures the performance nu-ances described in Section 4, while simultaneously being sufficientlyflexible to accommodate a limited scope of potential changes infeedforward CNN models and Tensorflow. Thus, we moved towardsdeveloping a trace driven simulator which we describe in greaterdetail in this section.To ensure that the trace collection is simple and effective, a tracemust have two attributes. First, it must be network agnostic, so thesame trace can be used for different network settings. Second, itmust accurately model pipelining. The traces were generated byadding minor instrumentation to TensorFlow 1.4. Computation inTensorFlow is represented as a dataflow graph, where individual operations are represented as nodes on the graph, while parameters(i.e., tensor) are transferred along the edges of the graph. Our in-strumentation records send operations along the dataflow graphs.The trace is very simple. For each operation, it shows five attributes.1.) event time 2.) parameter name – identified by an edge name(e.g., conv1/weights/read tells us that the first convolution opera-tion is being read by the worker) 3.) size of the parameter ready tobe queued on to the network. 4.) source device (e.g., worker0) 5.)destination device (e.g., parameter server). This trace is simple togenerate, parse, and possibly even modify if the operator wants torun simulations on synthetic models.
NN Name 1 PS 2 PS 4 PS 8 PS
VGG-16 Sim 21.0 22.5 19.3 18.2VGG-16 Real 22.5 22.8 20.8 19.3Inception-v3 Sim 2.29 2.29 1.37 0.852Inception-v3 Real 2.16 2.16 1.49 1.3Resnet-200 Sim 7.15 3.34 2.3 2.29Resnet-200 Real 5.89 2.3 1.71 1.71Resnet-101 Sim 4.57 2.37 1.52 1.5Resnet-101 Real 3.7 1.58 0.855 0.9
Table 1. Comparison of measured (Real) iteration time compared to simulationprediction times (sim) on a cluster of 8 workers
For each neural net model, we collected a different representativetrace. On each iteration, we partitioned the events into two traces:an aggregation trace and a distribution trace.For the aggregation trace, we identified specific send operationsthat are gradient operations triggered automatically by TensorFlow’sSyncReplicaOptimizer. The aggregation trace allows us to accuratelymodel how the parameters are sent over the network. Recall fromthe computational model that due to extensive pipelining, the calcu-lated gradients are delivered to the parameter server as soon as theparameter gradient is calculated. In the gathering of this data, notethat not every operation from the worker to the parameter serverfalls on the critical path. The beginning of the aggregation is markedby a dependency operation that varies from model to model. We onlyconsider the send operations that occur following the dependencyoperation, since all other send messages prior to that dependencyoperation are pipelined with the worker’s forward pass; thus, theydo not fall on the critical path. For example, it is common for CNNtraining to use batch normalization , which normalizes the inputs toeach layer. The worker will send the result of its batch normalizationcomputation to the parameter server so the PS can compute movingaverages over multiple iterations. Such operations must be removedfrom the aggregation trace. There are other operations that must befiltered out that arise occasionally on a per-model basis (such as theuse of auxiliary logits ).For the distribution trace, we used the TensorFlow timeline toolto identify the particular send operations originating from the pa-rameter server that triggered forward pass operations on the worker.The purpose of the distribution trace is to obtain the order in whichparameters are queued on the network. We demonstrate in the eval-uation (§8) how the ordering of parameters affects the end-to-endperformance of distributed training. Additionally, we profiled theforward pass time on a single GPU. We later use this for emulatingthe pipeline effects in the forward pass, which we find plays aninsignificant role in typical use cases; this is due to the one-to-manycommunication overhead arising from the parameter server.To simulate normal performance computational performancevariance across GPUs, we recorded traces for each worker in clustersof difference sizes and across several distinct clusters. To ensurethat the trace is agnostic to the network and size of the cluster, therecorded time of a trace event is relative to the first event in thattrace. Different workers in a cluster finish receiving the model atdifferent times due to network topology and parameter ordering.The absolute times in the aggregation trace are affected by thenetwork conditions, so aggregation times are recorded relative tothe first aggregation event. Thus, we are able to simulate the networkeffects across a wide variety of cluster sizes.
CNN Name
Inception-v3 21 . × . × VGG-16 22 . × . × Resnet-101 103 . × . × Resnet-200 202 . × . × Table 2. Complexity of the CNN models we considered, including both weightand pooling layers in our layer count.
CNN Fwd Pass Comp Bkprop Comp Bkprop Net, 25 Gbps Comp:Net Ratio
Inception-v3 0.176 sec 0.296 sec 0.028 sec 10.6VGG-16 0.169 sec 0.024 sec 0.263 sec 0.09Resnet-101 0.176 sec 0.180 sec 0.052 sec 3.46Resnet-200 0.357 sec 0.34 sec 0.082 sec 4.14
Table 3. Compute and network times during the backpropagation of the model.Note that the backprop compute time does not include the time to calculatethe first layer of backpropagation.
We validated our simulator by comparing the results of the simulatoragainst actual runs. In a cluster of 8 workers, our results are shown inTable 1. For the most part, our simulation results accurately predictthe performance trend with more parameter servers. In many ofthese cases that use multiple parameter servers, the CNN weightsare not evenly distributed among the parameter servers, causing theperformance improvements to plateau. Our simulation effectivelyreflects this behavior. There are two notable points at the far ends ofthe spectrum – Inception-v3 with 8 PSs, and Resnet-200 with 1 PS– where our simulation fails to match our empirical measurement,although both still capture the general scaling trend. There areseveral possible explanations for this. First, our simulation makesan assumption that parameters in the distribution phase occurs ina round-robin fashion over the workers. While this assumption issufficient for most of the settings that we looked at, our observationsof the actual distribution send-traces reveal that there are someminor overlaps in the way that worker parameters are being sent.This suggests that the round robin assumption is slightly strongerthan reality – especially in the case of 1 PS where the amount ofbackpropagation staggering is most pronounced.
Which
CNN model characteristics influence a mechanism’s perfor-mance benefit? We’ve identified four such model attributes thatimpact the extent to which acceleration mechanisms will benefitperformance and show where our CNN models fall on those di-mensions. Other, non-deterministic factors not related to the CNNcharacteristics may influence training performances, but are dis-cussed earlier in Section 4. With the exception of the forward pass,these attributes are all extracted directly from the trace. We dis-cuss this in the context of the four distinct, commonly deployedimage classification models we deployed and analyzed: Inception-v3, Resnet-200, Resnet-101, and VGG16. They were trained usingImageNet data.
Distribution of parameter sizes over model, in particular thesize of the last parameter . Many models exhibit a very parame-ter heavy fully connected last layer, which represents a significantfraction of the model size. Inception-v3 and VGG16 both possessvery memory expensive fully connected layers, while Resnet-200and Resnet-101 are relatively even throughout. ackprop(op3’) Backprop(op2’) Backprop(op1’)Backprop(op3’) Backprop(op2’) Backprop(op1’)
Worker 0(compute)Worker 1(compute)PS net rec’v(without agg)
Receive (op3’, wk0) Receive (op3’, wk1) Receive (op2’, wk0) Receive (op2’, wk1) Receive (op1’, wk0) Receive (op1’, wk1) time
PS net rec’v(with agg)
Receive (op3’) Receive (op2’) Receive (op1’)
Improvement over Baseline: 28%
Fig. 2. Aggregation phase of computationally/communication even , 3layer (op) toy model, but with staggered backpropagation start times. PS =parameter server
Computation/network bottleneck after the first layer of back-propagation.
For all future references to backpropagation com-pute/network ratio, the first layer of backpropagation compute isnot included. Once the first layer of backpropagation has been calcu-lated, how interspersed are the computational and network elementsof a CNN model? Table 3 shows the amount of time spent in commu-nication and computation, and a compute:net ratio. VGG16 spendsnearly all it’s computational time calculating the first back propaga-tion parameter, thus exhibiting the smallest compute:network ratio.On the other hand, Inception-v3, a model which is similarly skewed,is compute intensive even after the first layer of back propagationis computed.
Forward Pass Time : See Table 3
Raw size of the model . We show model sizes in Table 2. Theyrange from very large (6.58Gb) to small (0.7Gb)As we will show in Section 8, the first two characteristics listedabove heavily influence the amount of backpropagation staggering,as they both affect the overlapping distribution and aggregationphases.
In this section, we evaluate the efficacy of the mechanisms describedin previous sections, and explore how the CNN model characteristicsinteract with the mechanism. Finally, we explain how they jointlyimpact performance. We also present head to head comparisons ofcompeting mechanisms. Finally, we show that our rankings andintuitions generalize to two types of future training conditions: 1.larger models and 2. faster processors.The traces used in the simulations were derived from clustersin AWS EC2 running Tensorflow 1.4, with a fixed batch size of 32training instances per worker unit.Our simulations show that in all cases, an end-host mechanism(ring-reduce) offers performance improvements greater than orequal to any in-network mechanism. Thus, rather than having tomake a trade-off between performance and infrastructure cost, op-erators can instead deploy software mechanisms running on theend-host and expect to get equal or better performance.
We look at network fabric optimizations – in-network aggregationand multicast – which primarily are used to accelerate training whenusing a parameter server. . Factor 1: Large Last layer of CNN reduces impact : In section 4, we
Model Name Aggregation Only Multicast Only Multicast + Aggregation
Inception-v3 1.34x 1.69x 3.28xVGG-16 1.89x 1.94x 22.0xResnet-101 1.65x 1.79x 6.07xResnet-200 1.65x 1.85x 6.7x
Table 4. Factor speedup of network support models relative to a baseline modelwith no network support. 32 workers, 25 Gbps defined backpropagation staggering. Recall that increased back-propagation staggering reduces incast, which consequently reducesthe impact of in-network aggregation. The amount of backprop-agation staggering is positively correlated with the time taken toexecute the the penultimate layer(s) of the CNN. During the distribu-tion phase, parameters are distributed between workers in a roundrobin manner. Thus, the forward pass on each worker is blockeduntil that single parameter (which can be over 5 Gb in the case ofVGG16) reaches the worker in its entirety. Larger final parameter(s)causes each worker node to begin it’s individual backpropagationat increasingly staggered times. To illustrate this, we consider asimple example with a 3 operation CNN model. Each operationtakes three seconds to compute and three seconds to send over thenetwork. In the case where all the workers start backpropagationsimultaneously (not shown), aggregating the new model takes 21seconds. When using in-network aggregation, aggregation takes12 seconds (a 43% improvement). In contrast, Figure 2 shows ag-gregation in the same setup when backpropagation start-time isstaggered between workers. While performance without in-networkaggregation stays constant (21 seconds), in-network aggregationonly improves performance by 28%. Why? In-network agg reducesparameter server network bottleneck by aggregating parameter up-dates from all workers . This aggregation can’t proceed until the lastworker has communicated that parameter update.Both VGG16 and Inception-v3 have large last layers that will fur-ther stagger the time at which each worker begins backpropagation,making them particularly susceptible to this effect.
Factor 2: Network dominated backpropagation time increases im-pact : Recall that during the back-propagation process, parametersupdates are sent as soon as they are calculated. If the calculationtime between parameter updates is relatively short, there will befewer parameter updates queued inside the network to be sent tothe parameter server. Assessing the degree to which the backpropa-gation is network-bound cannot be evaluated solely by the overalltime of backpropagation. VGG16 has the longest overall backprop-agation time, yet enjoys the largest percentage improvement inperformance from in-network aggregation. For VGG16, the ma-jority of the backpropagation computation is spent on computingthe first backpropagation layer , an expensive fully connected layer.Once this first computational step completes, the remaining modelparameters are quickly calculated and queued in the network. Con-versely, Inception-v3 spends a significant amount of time doingbackpropagation computation even after the first backprop layer iscomputed.
Results:
For 32 workers and 25 Gbps network, Table 4 demon-strates the performance improvements derived from using just in-network aggregation to accelerate distributed training. Inception-v3(which has a compute-intensive backpropagation) experiences theleast performance gain from in-network aggregation, while VGG16which has a network-intensive backpropagation) experiences themost.
Factor 1 – Model Size increases impact
Unlike in-network aggre-gation, the distribution phase is initiated with a global barrier. Allthe parameters are ready simultaneously, so there is no pipeliningon the send-side as in the case of in-network aggregation. Thusperformance gains from multicast are more directly a function ofmodel size.
Factor 2 – Forward Pass Pipelining decreases impact (slightly)
As theworker receives model parameters, it partially executes the forwardpass. However, the forward pass is unlikely to be the bottleneck inthe distribution phase, especially as the number of workers grows.Recall that the simulated parameter server distributes parametersto the workers in a round robin fashion. More workers will allowfor more computation time between parameter distributions. Con-sequently, forward pass pipelining has a very slight (if any) impacton multicast performance.
Non-Factor – decreased backpropagation staggering : When usingmulticast, workers receivers parameters at roughly the same time(within minimal link latency). Thus backpropagation staggeringis decreased, as we expect workers to initiate backpropagation si-multaneously (approximately). Does the resultant increased incasthurt iteration performance? We find this not to be the case unlessthe following condition is met. Let D be the delay between workerbackpropagation start times, B be the full backpropagation time (i.e.,including both the computation and network transfer time), and C be the compute time for the backpropagation time of the first layer.In order for the decreased backpropagation staggering to give backperformance, the following must hold: D > B − C . This is highlyunlikely in the multicast case; multicast should beget a very minimal D . Results : Again, we refer to Table 4. Multicast impact is moredirectly proportional to the size of the model. Resnet-101 (1.78x) andResnet-200 (1.85x) are both larger than Inception-v3 and smallerthan VGG16. Their performance gains likewise fall between theperformance gains of those models.
As shown in Table 4, multicast outperforms or approximatesthe performance gains from using in-network aggregation in allcases. Later, we will show that both approaches are weaker thanend-host based acceleration mechanisms, but we briefly provideintuitions for why multicast is more effective than in-network aggre-gation. Fundamentally, in-network aggregation is tied to backpropa-gation and multicast is tied to the forward pass. Generally speaking,the forward pass is significantly faster than the backpropagation;refer to Table 5. Moreover, the extent to which the forward passis bottlenecked on compute decreases with more workers. On theother hand, the compute:network ratio of backpropagation remainsconstant with the amount of workers. In fact, increasing workersonly leads to staggered backpropagation start times, which reducesthe benefits of in-network aggregation.
Multicast individuallyis more impactful than in-network aggregation alone.
Table4 indicates that multicast provides larger performance gains thanin-network aggregation across the board.
CNN Name GPU Model Forward Pass Backprop
Resnet-200 Maxwell Titan X 170 ms 384 msResnet-200 Pascal Titan X 315 ms 520 msVGG-16 Maxwell Titan X 173 ms 416 msVGG-16 Pascal Titan X 98.2 ms 260 msResnet-101 Maxwell Titan X 109 ms 190 msResnet-101 Pascal Titan X 162 ms 258 msInception-v1 Maxwell Titan X 91.3 ms 141 msInception-v1 Pascal Titan X 57.5 ms 85.9 ms
Table 5. Forward Pass vs. Backpropagation Benchmarks [9]
Model Name Ring-Reduce Ring-Reduce + Multicast Butterfly Mixing
VGG-16 24.6x 24.6x 11.3xResnet-200 6.75x 6.76x 6.79xResnet-101 6.55x 6.71x 6.46xInception-v3 3.35x 3.41x 3.41x
Table 6. Ring-reduce is most effective when parameters are evenly distributed,while butterfly mixing performs well at 25 Gbps. plus in-network aggregation
Finally, what happens when multicast is combined with in-networkaggregation? Because multicast supports the distribution phase andin-network aggregation supports the aggregation phase, both ap-proaches can be simultaneously used to improve performance. Infact, multicast puts in-network aggregation in the best position tosucceed because it strongly decreases backpropagation staggering.Table 4 shows clearly that using multicast with in-network aggrega-tion yields substantially better performance gains than using eitherjust multicast or just in-network aggregation.
Results:
Multicast plus in-network aggregation benefits VGG16the most, as the performance gains increase from 1.9x to 21.2x.For all models, using both multicast with in-network aggregationresults in more than additive performance gains from the individualmechanisms, for reasons described in the previous paragraph.
Based on just look-ing at in-network mechanisms, we find that optimizations can beranked as: multicast + aggregation, multicast, aggregation. Thisleads us to conclude that if one is required to use the parameterserver model, then using multicast jointly with in-network aggre-gation yields the largest performance improvements. While bothmulticast and in-network aggregation offer their own set of deploy-ment challenges, multicast outperforms in-network aggregation inthe CNNs we tested. As evidenced by the CNN characteristics thatfactored into each mechanism’s impact, the interaction betweenthe distribution phase and aggregation phase across all workers iskey. While the aggregation phase holds more complexity in termsof network/compute interleavings, it is actually the distributionphase optimizations which dictate how much the aggregation phasesits on the critical path. Future work should not solely evaluate theefficacy of accelerating either the distribution phase or aggregationphase in isolation.
We analyze two all-reduce algorithms: Horovod ring-reduce andbutterfly mixing. Table 6 shows speedup over baseline for bothalgorithms when run with 32 workers at 25 Gbps links. The table alsoshows improvements when ring-reduce is combined with multicast.
Recall that in ring-reduce, the parameters are assigned to work-ers round robin. One critical issue that must be addressed whendoing this is that often a significant fraction of a model’s raw sizeomes from a single parameter (e.g., VGG16, Inception-v3). Ana-lytically, the communication overhead of ring-reduce is 2 ( W − ) ∗( max parameter ) where W is the worker count. This overhead canbe prohibitively large when a model has a single huge parameter, e.g.VGG16’s 5.4 Gb fully-connected layer. This is consistent with oursimulation with 32 workers at 10 Gbps where ring-reduce iterationtime was 34.0 seconds. CNN models (especially those ending in afully connected layer) tend to have a few layers that take up a sig-nificant percentage of the overall model size; thus even an optimalassignment of model parameters would still result in parameter sizeimbalances for a worker on a ring.To address this, we modified our simulator to use parameter mes-saging, where parameters are partitioned evenly between workers(discussed more in §9.2). In the rest of this section, all ring-reduceresults make use of this messaging mechanism. What model char-acteristics positively influence ring-reduce performance relative tobaseline? Factor: Network-dominated backpropagation increases impact : Forring-reduce, each worker begins backpropagation at the same time.This is imposed by a global barrier. In the optimal case for ring-reduce, each worker sends it’s assigned model parameter updatesat the same time. Then, every link in the ring would be nearlyidentically utilized and network contention would be minimized.However, each worker can only send its model parameter updatewhen that gradient has been calculated. Longer time for backprop-agation computation on each parameter results in deviation fromthe optimal case. Table 6 shows the model with the most compute-bound back-propagation, Inception-v3, has the lowest performanceimprovement from ring-reduce (3.3x). In contrast, VGG16, the modelwith the largest performance improvement (24.6x), has the mostnetwork-bound back-propagation process.
Recall that the first phase of butterfly mixing merely involves aforward pass. At the point the forward pass begins, each workeralready received the complete set of model parameters. Once eachparameter is calculated, it will be sent loд ( W ) times, where W is thenumber of workers. While this communication is sequential, thisprocess can be pipelined between workers. What factors impactperformance for butterfly mixing? Factor: Compute dominated backpropagation increases impact:
Dur-ing backpropagation, longer gradient computations between param-eters gives workers a chance to pipeline communications amongthe loд ( W ) steps. However, models with network dominated back-propagation will still experience a substantial boost with butterflymixing ; for example, Table 6 indicates that VGG16 still gets a 11.3xspeedup. Butterfly mixing and ring-reduce are both impacted by an CNNmodel’s backpropagation compute/network interactions, but inopposite ways. Butterfly mixing helps more for compute boundbackpropagations, and ring-reduce helps more for more networkbound backpropagation. In Table 6, ring-reduce and butterfly mix-ing perform comparably for all models except VGG16, which hasthe most network bound backpropagation. This leads to strongerimprovement from ring-reduce. Compare this to a lower bandwidth, 10 Gbps. Ring-reduce and butterfly mixing only perform comparablyfor Inception-v3, which has the most compute bound backpropa-gation. Conversely, ring-reduce outperforms butterfly mixing forResnet-200 (Figure 4), and VGG16 (Figure 5).
Ring-reduce offers superior or equal performance impact as butterflymixing.
In this section, we present overall simulations results for the topperforming mechanisms seen so far: ring-reduce with messaging andmulticast/in-network aggregation. We include results for butterflymixing as another competitive reference point, but the focus of thissection is on ring-reduce vs. multicast/in-network aggregation.First, we fix the number of workers to 32 and vary bandwidth.The results for Inception-v3, Resnet-200, and VGG16 are shown inFigure 3, Figure 4, and Figure 5, respectively. Next, for the sameacceleration mechanisms, we show how performance scales withnumber of workers, while keeping bandwidth fixed at 25 Gbps.The results for Inception-v3, Resnet-200, and VGG16 are shown inFigure 6, Figure 7, and Figure 8, respectively. Resnet-101 is left offthis panel of graphs but shows consistent trends with Resnet-200.These results indicate that with 32 workers – across all band-widths – ring-reduce outperforms the combination of multicast andin-network aggregation. For Resnet-200, Resnet-101 and Inception-v3, ring-reduce and multicast+in-network aggregation perform verysimilarly. However, ring-reduce holds a key advantage in the VGG16model. As shown in Figure 8 ring-reduce has nearly a 2x perfor-mance advantage with 4 workers, and a 1.3x performance advantagewith 32 workers. While the gap closes with more workers, we do notobserve a case with VGG16 where multicast+in-network aggrega-tion outperforms ring-reduce. The gap between the two mechanismsis also more pronounced at low bandwidths, as seen in Figure 5.Ring-reduce has one key advantage over multicast + in-networkaggregation. The first and second ring of ring-reduce are equiva-lent to the parameter server’s aggregation and distribution phase,respectively. While there is a per-worker local barrier between theaggregation and distribution phase, no such barrier exists for ring-reduce. The second distribution ring of ring-reduce can proceed evenwhile other parameters are still circulating their first ring. Thus thetwo phases can be pipelined, which enhances performance impact.The difference is particularly stark for models with disproportion-ately large penultimate layers, such as VGG16 in Figure 8.
Ring-reduce offers superior or equal performance impactas multicast with in-network aggregation.8.4 All-reduce with network support
In this subsection, we explore the possibility of using in-networksupport to accelerate all-reduce algorithms. The primary combina-tion we discuss in this section is ring-reduce with multicast. Notethat multicast does not offer any performance benefits to butterflymixing. The same applies for in-network aggregation for both ring-reduce and butterfly mixing. When using ring-reduce, multicastcould be used to improve the performance time of the second ring .When parameters have been distributed between workers equally,we find in our simulations that multicast in the second phase ofring-reduce has very limited impact on performance. This can bereasoned about analytically. The communication overhead in the .30.40.50.60.7 0 25 50 75 100 125 I t e r a t i o n T i m e ( s ) Bandwidth (Gbps) multiagg ring-reduce butterfly
Fig. 3.
Varying Bandwidth : Mechanism Rank-ings for Inception-v3 training time on 32 workers I t e r a t i o n T i m e ( s ) Bandwidth (Gbps) multiagg ring-reduce butterfly
Fig. 4.
Varying Bandwidth : Mechanism Rank-ings for Resnet-200 training time on 32 workers I t e r a t i o n T i m e ( s ) Bandwidth (Gbps) multiagg ring-reduce butterfly
Fig. 5.
Varying Bandwidth : Mechanism Rank-ings for VGG16 training time on 32 workers I t e r a t i o n T i m e ( s ) Worker Count multiagg ring-reduce butterfly
Fig. 6. Varying Worker Count: Inception-v3 per-formance at 25 Gbps I t e r a t i o n T i m e ( s ) Worker Count multiagg ring-reduce butterfly
Fig. 7. Varying Worker Count: Resnet-200 perfor-mance at 25 Gbps I t e r a t i o n T i m e ( s ) Worker Count multiagg ring-reduce butterfly
Fig. 8. Varying Worker Count: VGG16 perfor-mance at 25 Gbps second loop is:
Model size ×( W orkers − ) W orkers × Bandwidth . When using multicast, thecommunication overhead is:
Model sizeBandwidth
Note the bottleneck is onthe receiving worker side. When the number of workers is large,these result in very similar performance. Our simulation results(Figure 6) confirm that across the board, ring-reduce with multicastperforms equivalently with just ring-reduce.
To this point in the paper, we have demonstrated that ring-reduceoutperforms other proposals. Next we ask, how could these modelschange over time and would our results still hold? Recent modelproposals have simply modified existing CNN models through theaddition of convolutional layers [12, 15, 16]. To test this, we addbetween 1 and 125 modules to the Inception-v3 model. We capturethe full breadth of possible layers by adding one of two types ofmodules: Compute Intensive (35x35x288 module) and Network in-tensive (17x17x768 module). Again, we run our simulations with32 workers with 25bps. First, we observe in Figure 10, 9 that ourmechanism ranking are preserved over both strains of syntheticmodels. However, the relative impact of the speedup mechanismschange. With the compute intensive synthetic model, the perfor-mance impact of in-network aggregation quickly drops to to zero,since pipelining in backpropagation provides ample time for pa-rameters to be sent over the network between computations. Incontrast, multicast provides the same amount of performance im-pact between layers, and ultimately equals the performance of theother mechanisms with only 25 layers added. If models trend to-wards becoming computationally expensive, operators dedicatedto using parameter servers could turn away from in-networkaggregation for good, as multicast alone equals the impactof using both multicast and in-network aggregation . For thenetwork intensive synthetic model, multicast with in-network ag-gregation, ring-reduce, and butterfly mixing grow linearly with thenumber of layers. Even in the extreme case with 125 additional layers, neither in-network aggregation or multicast alone reachesit’s maximal performance gain of 2x. While this is hardly an exhaus-tive list potential model changes, creating synthetic models fromexisting model traces is extremely simple. Operators and developerscan easily modify existing traces to project necessary changes to thenetwork fabric or end-host design.
Next, we examine the case of potential enhancements in computecapabilities (e.g., faster hardware accelerators). In particular, dofaster computations change the relative effectiveness of the accel-eration mechanisms? In our examples, we increased the speed ofconvolutions by various factors.Inception-v3 is shown in Figure 11 and Resnet-200 is shown inFigure 12. First, we observe that there is a point where pipelinedphases of the parameter server model become so network bound that the parameter server paradigm (i.e., multicast with in-networkaggregation) wins out. For most models, this point occurs at around2.5x speedup, where multicast/in-network aggregation surpassesthe performance of both ring-reduce and butterfly mixing. Onestandout trend from Figure 12 is in Resnet-200 at 3x computationspeedup, where butterfly mixing results in only a 10x speedup whilemulticast with in-network aggregation leads to a 19.5x speedup. Infact, for all models except Inception-v3, butterfly mixing gains flagswith faster computation. This is consistent with our observationthat butterfly mixing impact decreases as backpropagation becomesmore network bound. Inception-v3 happens to be so compute-boundthat even at 3x compute speedup, butterfly mixing keeps pace withring-reduce and multicast/in-network aggregation.Our results indicate that faster compute capabilities could leadto better performance impact when jointly using multicast and in-network aggregation. We hasten to mention that performance is afunction of many factors (e.g., bandwidth, memory, etc.) that growat their own pace. The complexity of these assessment reinforces
357 1 5 25 125 F a c t o r S p ee dup ( x ) agg multicast multiagg ring-reduce butterfly Fig. 9. Synthetic Model (Network-heavy):Speedups relative to baseline increase as morenetwork heavy layers are added F a c t o r S p ee dup ( x ) agg multicast multiagg ring-reduce butterfly Fig. 10. Synthetic Model (Compute-heavy):Speedups relative to baseline decrease as morecompute heavy layers are added F a c t o r S p ee dup ( x ) Computation Speedup (x) agg multicast multiagg ring-reduce butterfly
Fig. 11. Faster GPU: Inception-v3 performancegains as compute speeds increase F a c t o r S p ee dup ( x ) Computation Speedup (x) agg multicast multiagg ring-reduce butterfly
Fig. 12. Faster GPU: Resnet-200 performance gains as compute speeds increase the need for a simple simulator that can be used to assess the currentstate of models and hardware.
From most to least performance impactful, our mechanism rank-ings are as follows: 1.) ring-reduce, 2.) multicast with in-networkaggregation, 3.) butterfly mixing, 4.) multicast, 5.) in-network ag-gregation. Not only have we demonstrated that an individual end-host mechanism outperforms joint usage of network sup-port , these performance rankings should continue to hold as mod-els evolve and computation becomes increasingly accelerated. Ul-timately, we hypothesize that the reason for this ranking is thatcurrent implementations of end-host mechanisms more effectivelyoverlap distinct pipelined phases, thus exploiting available band-width more effectively over the entire iteration. This difference inpipelining efficiency grows more pronounced when the penulti-mate layers of the model are disporportionately large. As modelsinevitably change, operators must carefully examine the qualities ofthe final CNN layers when considering how to accelerate training.
Much like how we expect the CNN models to change, training soft-ware running on the end host will also change. During the simulatordevelopment, we identified several end host design decisions that in-fluence performance outcomes. Ultimately, we find that our findingsare (mostly) robust to these end-host configurations. First, we lookat equal assignment of parameters to parameter servers. Next, welook at three design configurations that are not currently integratedinto Tensorflow: Parameter Distribution, Message Pipelining, andremoving parameter server side global barrier. In this section, werely on simulations so to gain intuition about host level changes
Num PS CNN Model Min % Max % Ideal % · − · − Table 7. Empirical Measurements for how parameter assignment to parameterservers. The columns show the percentage (in terms of bytes) of weights that areplaced on the most and least occupied parameter server. Resnet-101 is similarto Resnet-200
CNN Model Multiagg (s) 8 PS Multiagg (s) Ring-Reduce (s)
VGG-16 0.765 0.539 0.683Resnet-200 0.830 0.820 0.824Resnet-101 0.598 0.551 0.556Inception-v3 0.569 0.549 0.562
Table 8. Using 8 parameter servers with a theoretically optimal distributionfor multicast + aggregation does not provide substantive gains over ring-reduce.32 workers, 25 Gbps from a communication overhead point of view. Our simulations donot include system-level overheads for these approaches, and weleave evaluating this to future work.
By default, TensorFlow iterates through the parameters in the modeland assigns those parameters to the PS in a round robin fashion.While this effectively balances the number of parameters per PS,the weights on each of them can be vastly different. The unevendistributions across several models are shown in Table 7. For exam-ple, in VGG-16, the fully connected layer alone consists of 5.44 Gb(out of 6.58 Gb total over the entire model). Through actual exe-cutions of TensorFlow, the simulations we have explored to thispoint accurately assigns parameters to parameter servers based onTensorFlow’s default heuristic. Here, we explore this possibility ofdividing parameters evenly.To simulate this fairly, we aggressively split each parameter be-tween 8 parameter servers and 32 workers. Our results are shownin Table 8. With the exception of VGG16, ring-reduce continues toequal the combined efforts of multicast and in-network aggregation.If this end-host design change comes to fruition and models trendtowards VGG16 characteristics (i.e., short backpropagation and anexceptionally large last layer), operators should act accordingly toconsider hardware network support. With the exception of VGG16,parameter assignments does not change the fact that end-host ac-celeration mechanisms outperform in-network acceleration.
NN Model Multiagg (s) Ring-reduce (s) Multiagg no barrier (s)
VGG-16 1.53 1.37 1.76Resnet-200 1.65 1.65 1.65Resnet-101 1.17 1.13 1.08Inception-v3 1.14 1.13 0.988
Table 9. Removing the global barrier improves multicast + aggregation iterationtime, but does not cause a decisive lead over ring-reduce. 32 workers, 25 Gbps
In both the distribution and aggregation phases of distributed train-ing, the node typically waits for the entirety of a parameter to arrivebefore sending it forward. This can be inefficient when individualparameters are large; for example the largest parameter in VGG16model is in excess of 5 Gb. Instead, these parameters can be split upinto smaller messages and forwarded when ready.We incorporated message pipelining into our application andfound, surprisingly, that for all models we explored, commu-nication within the parameter server model do not benefitwhatsoever from message pipelining . The improvements thatresult from message pipelining are swallowed by compute in thebackpropagation. Only ring-reduce benefits significantly from mes-saging, thus our prior ring-reduce evaluatino in section 8 keepequipped with messaging. Overall, the conclusions we presented inthe evaluation are robust to message pipelining.
Up to this point in the paper, we have used a global barrier in theparameter server. While this creates greater flexibility in operationsthat can be conducted over the entire model, this inhibits pipeliningbetween iterations. Alternatively, this global barrier could be re-moved: when the parameter server receives all updates (from work-ers) to a parameter, it can immediately forward this update [32]. Tofairly capture the performance change of removing the global bar-rier, we run three iterations and measure the time between a.) whenthe parameter server receives all updates from the first parameterduring the latter part of the first iteration and b.) when that firstparameter is received from all workers during the latter part of thethird iteration.Table 9 shows that the removal of the global barrier increases theimpact of multicast plus in-network aggregation to the point thatit becomes roughly equal to the impact of ring-reduce. The globalbarrier improves training time because the aggregation phase canbe pipelined with the distribution phase. However, note that someof that improvement is returned because the worker cannot initiateforward pass until the first model layer arrives. That first layer is thefinal parameter computed in the backpropagation. Taking out theglobal barrier evens out the impact of ring-reduce and multicast/in-network aggregation, but our initial claim continues to hold.
In the parameter server model, there are two ways of distribut-ing parameters: round-robin distribution (one model parameterat a time), and block distribution (send all model parameters inentirety to each worker, one at a time). While parameter distribu-tion has been observed to be random [14], simulator validation hasshown that round-robin distribution order closely mimics actualempirical experiments. Surprisingly, in our experiments, we foundthat block distribution outperforms round-robin distribution.
CNN Name Bandwidth (Gbps) Agg (s) Block Distr. (s)
Inception-v3 10 2.99 3.1VGG16 10 22.3 21.7Resnet-101 10 4.9 4.94Resnet-200 10 7.77 7.79Inception-v3 100 0.71 0.77VGG16 100 2.23 2.27Resnet-101 100 0.89 0.94Resnet-200 100 1.19 1.45
Table 10. With 32 workers, the training time of using in-network aggregationand block (i.e., not round robin) parameter distributions is roughly the same
When using round robin distribution, workers progress at roughlythe same pace. Thus, until one of the workers begins the back-propagation process, the parameter server ingress bandwidth isun-utilized. When using block distribution, the parameter serverbandwidth is utilized as soon as the first worker completes its for-ward pass. Moreover, only a single worker is likely to be doingbackpropagation at a time, thus reducing incast.How does block distribution compare to in-network aggrega-tion? Recall that in-network aggregation benefits performance mostwhen backpropagation staggering is minimal. When using blockparameter distribution, under what analytical conditions wouldwe similar performance impact between in-network aggregation andblock distribution ? Let B be the computation time of just the firstlayer of backpropagation, B N be the compute time of the entireback propagation, N be the time to communicate the entire model, Rem
F P be the remaining forward pass computation after the workerhas received the entire model update from the parameter server. B + N + Rem
F P > Rem
F P + B N , which simplifies to B + N > B N .Thus, block distribution approximates the performance gains of in-network aggregation for models that have unusually large last layersand network transfer times (depends on model size and availablenetwork bandwidth).Table 10 illustrates simulation results which show that block dis-tribution performs similarly, or better, than in-network aggregationat vastly differently bandwidths. Block raises additional questionsabout the efficiency of using in-network aggregation for distributedCNN training.
10 Discussion
Next we briefly discuss the impact of other optimization strategiesand systems considerations.
Gradient Compression:
Other work [20] has also looked at usinggradient compression to reduce the amount of aggregation trafficsent during CNN training. Gradient compression and other compres-sion techniques reduce model size but do not affect the number ofnetwork transfers. As a result, applying these methods is analogousto using a smaller CNN and is covered by our analysis.
Asynchronous training:
Our analysis thus far has assumed theuse of synchronous training algorithms (§3.3.1). We focused on thesealgorithms for ease of analysis and exposition, and our simulatorand analysis techniques can be applied to asynchronous trainingalgorithms. However, neither of the in-network mechanisms can beused for asynchronous training due to a lack of barriers betweeniterations.
We began this work wanting to develop network optimizations thatimprove CNN training performance. We found that despite a greatdeal of excitement about this area, little was understood about whattypes of optimizations were promising, or even how current opti-mizations impacted end-to-end CNN training performance. Thuswe sought to address this question, and in doing so found that endhost based solutions, which are arguably easier to deploy, gener-ally provide better improvements than in-network solutions. Wedeveloped a trace-driven simulator, that simplifies the analysis ofhow network changes impact CNN performance. We hope that thissimulator will provide a foundation to enable the community to de-velop and evaluate optimizations for improving CNN performance.We plan to open source the simulator and our data so as to allowthe community to leverage and extend our findings.
12 Acknowledgements
We thank members of NetSys lab at UC Berkeley for their usefulfeedback, as well as Peter Gao and Qiyin Wu for their feedback onan early version of this work. This work was funded by NSF Grants1817115, 1817116, 1704941, and was supported by Intel, VMwareand Microsoft.
References [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray,B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zhang.Tensorflow: A system for large-scale machine learning. In
OSDI , 2016.[2] A. A. Awan, J. Bédorf, C.-H. Chu, H. Subramoni, and D. K. Panda. Scalabledistributed dnn training using tensorflow and cuda-aware mpi: Characterization,designs, and performance evaluation.
CoRR , abs/1810.11112, 2018.[3] Baidu Slicon Valley Lab. Bringing HPC techniques to deep learning. http://research.baidu.com/bringing-hpc-techniques-deep-learning/.[4] Barefoot Technology. https://barefootnetworks.com/technology/.[5] J. F. Canny and H. Zhao. Butterfly mixing: Accelerating incremental-updatealgorithms on clusters. In
Proceedings of the 13th SIAM International Conferenceon Data Mining, May 2-4, 2013. Austin, Texas, USA. , pages 785–793, 2013.[6] H. A. Chen, Y. O. Carrasco, and A. W. Apon. Mpi collective operations over ipmulticast. In
IPDPS Workshops , 2000.[7] J. Chen, R. Monga, S. Bengio, and R. Józefowicz. Revisiting distributed synchronoussgd.
CoRR , abs/1604.00981, 2016.[8] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, andZ. Zhang. Mxnet: A flexible and efficient machine learning library for heteroge-neous distributed systems.
CoRR , abs/1512.01274, 2015.[9] CNN Benchmarks. https://github.com/jcjohnson/cnn-benchmarks
CoRR ,abs/1711.10604, 2017.[12] T. Elsken, J. Hendrik Metzen, and F. Hutter. Neural Architecture Search: A Survey. arXiv e-prints , page arXiv:1808.05377, Aug. 2018.[13] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tul-loch, Y. Jia, and K. He. Accurate, Large Minibatch SGD: Training ImageNet in 1Hour.
CoRR , abs/1706.02677, 2017.[14] S. H. Hashemi, S. A. Jyothi, and R. H. Campbell. Communication scheduling as afirst-class citizen in distributed machine learning systems.
CoRR , abs/1803.03288,2018.[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition.
CoRR , abs/1512.03385, 2015.[16] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks.
CoRR , abs/1603.05027, 2016.[17] T. Hoefler, C. Siebert, and W. Rehm. A practically constant-time mpi broadcastalgorithm for large-scale infiniband clusters with multicast. , pages 1–8, 2007.[18] J. Ker, L. Wang, J. Rao, and T. Lim. Deep learning applications in medical imageanalysis.
IEEE Access , 6:9375–9389, 2018. [19] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J.Shekita, and B.-Y. Su. Scaling Distributed Machine Learning with the ParameterServer. In
OSDI , 2014.[20] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reduc-ing the communication bandwidth for distributed training.
CoRR , abs/1712.01887,2017.[21] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A.van der Laak, B. van Ginneken, and C. I. Sánchez. A survey on deep learning inmedical image analysis.
Medical Image Analysis , 42:60 – 88, 2017.[22] L. Luo, M. Liu, J. Nelson, L. Ceze, A. Phanishayee, and A. Krishnamurthy. Moti-vating In-network Aggregation for Distributed Deep Neural Network Training.In
Workshop on Approximate Computing Across the Stack . ACM, 2017.[23] L. Mai, C. Hong, and P. Costa. Optimizing Network Performance in DistributedMachine Learning. In
HotCloud , 2015.[24] R. Rabenseifner. Optimization of collective reduction operations. In M. Bubak,G. D. van Albada, P. M. A. Sloot, and J. Dongarra, editors,
Computational Science -ICCS 2004 , pages 1–9, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.[25] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis. In-network compu-tation is a dumb idea whose time has come. In
HotNets , 2017.[26] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition.
CoRR , abs/1409.1556, 2014.[27] N.-S. Tomov and S. Tomov. On deep neural networks for detecting heart disease.
CoRR , abs/1808.07168, 2018.[28] Uber. Meet Horovod: Uber’s Open Source Distributed Deep Learning Frameworkfor TensorFlow. https://eng.uber.com/horovod/.[29] A. Vishnu, C. Siegel, and J. Daily. Distributed tensorflow with mpi.
CoRR ,abs/1603.02339, 2016.[30] I. Wallach, M. Dzamba, and A. Heifets. Atomnet: A deep convolutional neuralnetwork for bioactivity prediction in structure-based drug discovery.
CoRR ,abs/1510.02855, 2015.[31] X. Yuan, S. Daniels, A. Faraj, and A. Karwande. Group management schemes forimplementing mpi collective communication over ip-multicast. In
JCIS , 2002.[32] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P.Xing. Poseidon: An efficient communication architecture for distributed deeplearning on GPU clusters.