[PDF] Scaling Distributed Training with Adaptive Summation

Abstract

Stochastic gradient descent (SGD) is an inherently sequential training algorithm--computing the gradient at batch i depends on the model parameters learned from batch i−1 . Prior approaches that break this dependence do not honor them (e.g., sum the gradients for each batch, which is not what sequential SGD would do) and thus potentially suffer from poor convergence. This paper introduces a novel method to combine gradients called Adasum (for adaptive sum) that converges faster than prior work. Adasum is easy to implement, almost as efficient as simply summing gradients, and is integrated into the open-source toolkit Horovod. This paper first provides a formal justification for Adasum and then empirically demonstrates Adasum is more accurate than prior gradient accumulation methods. It then introduces a series of case-studies to show Adasum works with multiple frameworks, (TensorFlow and PyTorch), scales multiple optimizers (Momentum-SGD, Adam, and LAMB) to larger batch-sizes while still giving good downstream accuracy. Finally, it proves that Adasum converges. To summarize, Adasum scales Momentum-SGD on the MLPerf Resnet50 benchmark to 64K examples before communication (no MLPerf v0.5 entry converged with more than 16K), the Adam optimizer to 64K examples before communication on BERT-LARGE (prior work showed Adam stopped scaling at 16K), and the LAMB optimizer to 128K before communication on BERT-LARGE (prior work used 64K), all while maintaining downstream accuracy metrics. Finally, if a user does not need to scale, we show LAMB with Adasum on BERT-LARGE converges in 30% fewer steps than the baseline.

Full PDF

SScaling Distributed Training with Adaptive Summation

Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi

Microsoft Research

Tianju Xu, Vadim Eksarevskiy, Jaliya Ekanayake, Emad Barsoum

Microsoft{saemal, madanm, toddm, olsaarik, tix,vaeksare,jaliyaek,ebarsoum}@microsoft.com

Abstract

Stochastic gradient descent (SGD) is an inherently sequentialtraining algorithm–computing the gradient at batch i dependson the model parameters learned from batch i −

1. Prior ap-proaches that break this dependence do not honor them (e.g.,sum the gradients for each batch, which is not what sequentialSGD would do) and thus potentially suffer from poor con-vergence. This paper introduces a novel method to c ombinegradients called Adasum (for adaptive sum) that convergesfaster than prior work. Adasum is easy to implement, almostas efﬁcient as simply summing gradients, and is integratedinto the open-source toolkit Horovod.This paper ﬁrst provides a formal justiﬁcation for Adasumand then empirically demonstrates Adasum is more accuratethan prior gradient accumulation methods. It then introducesa series of case-studies to show Adasum works with multi-ple frameworks, (TensorFlow and PyTorch), scales multipleoptimizers (Momentum-SGD, Adam, and LAMB) to largerbatch-sizes while still giving good downstream accuracy. Fi-nally, it proves that Adasum converges.To summarize, Adasum scales Momentum-SGD on theMLPerf Resnet50 benchmark to 64K examples before com-munication (no MLPerf entry converged with more than 16K),the Adam optimizer to 64K examples before communicationon BERT-LARGE (prior work showed Adam stopped scalingat 16K), and the LAMB optimizer to 128K before commu-nication on BERT-LARGE (prior work used 64K), all whilemaintaining downstream accuracy metrics. Finally, if a userdoes not need to scale, we show LAMB with Adasum onBERT-LARGE converges in 30% fewer steps than the base-line. Recent trends in deep learning demonstrate that increasingmodel size, coupled with an increase in training data, resultsin improved model performance. This has led to progressivelylarger models, such as BERT [14], GPT-2 [27], Megatron [34], and UniLM [15]. This trend along with the end of Moore’slaw means that these large models require massively parallelarchitectures to train. An important source of parallelism intraining is data parallelism where individual nodes train ona subset of data and periodically exchange model updates.Unfortunately, this is at odds with the sequential nature ofstochastic gradient descent (SGD) which is the most commonalgorithm to train them. Some prior approaches break this se-quential dependence with asynchronous SGD [13, 30] whereindividual nodes asynchronously update a global model ig-noring potential staleness of model updates. Recent advancesin hardware with powerful compute nodes with fast inter-connects [1, 8, 9] have led to synchronous SGD where onetrains with very large minibatch sizes. Neither approach is apanacea as both staleness and naively increasing minibatchsizes reduces model convergence [17, 20, 21]This paper proposes a new approach to data parallelismbased on two key insights. First, rather than asynchronouslyupdating a global model or increasing the minibatch size, thisapproach attempts to emulate a sequential execution in paral-lel. The basic idea is to combine the individual model updatesfrom nodes, each obtained by running a (small) minibatchfrom a starting model, into an update that would have resultedhad these nodes run one after the other from the same startingmodel. Second, this sequential emulation allows us to samplemultiple paths simultaneously. Intuitively, SGD is a stochasticprocess with each path representing a sample of the possibleoutcomes. By sampling many paths we dramatically reducethe variance of the estimate.This paper shows (in Section 2) that the following combinerachieves the two properties above:

Adasum ( g , g ) = ( − g T · g · (cid:107) g (cid:107) ) g + ( − g T · g · (cid:107) g (cid:107) ) g Here g and g are gradients from individual minibatches, g T · g is their dot product, and (cid:107) g (cid:107) represents the norm of thevector g . This combiner when recursively applied on gradientsfrom all nodes generates a ﬁnal gradient which can be usedto update the starting model with an appropriate learning rate.1 a r X i v : . [ c s . D C ] J un e call this approach Adasum as the combiner represents anadaptive sum of the two gradients with the gradients scaledby an appropriate constant.Adasum achieves signiﬁcant algorithmic efﬁciency withlarge batch sizes when compared to synchronous SGD byrequiring far less number of epochs to converge to the sameloss or model performance. This remains true even whenusing various learning-rate optimizers , such as Momentum-SGD [31], Adam [23], and LAMB [38]. Alternately, one canuse the improved algorithmic efﬁciency to scale to a muchlarger effective batch size. For example, for Resnet50 Ada-sum enables Momentum SGD optimizer to converge with aneffective batch size of 64K, which is four times larger than thelargest batch size we have seen reported for Resnet50 as perMLPerf v0.5 submissions. Similarly, for BERT-Large, Ada-sum enables the Adam optimizer to scale to an effective batchsize of 128K. Lack of convergence of Adam beyond 16Kwas the motivation for more sophisticated optimizers such asLARS and LAMB. When combined with LAMB, which is thestate of the art optimizer for BERT-Large, Adasum convergesin 20% fewer epochs than with LAMB alone when usingan effective batch size of 64K. In addition, we demonstratethat LAMB with Adasum can also scale to 128K effectivebatch size. These results indicate that Adasum emulates thebehavior of much smaller batch size even when running withlarge batch sizes.A desirable property of the Adasum operation, as evidentfrom the equation above, is that it has no hyperparameters.In all our experiments, we simply reused the recommendedhyper-parameters for the baseline synchronous SGD. Theonly additional tuning Adasum entails is a search for suitablebase learning rate.In summary, the contributions of this paper are:• Adasum, a new way to combine gradients that scalessynchronous SGD to unprecedented batch sizes, and aproof of its convergence.• A detailed discussion of how Adasum is implemented inHorovod, a popular distributed training framework forPyTorch and TensorFlow.• An evaluation that demonstrates Adasum scales existingoptimizers well beyond what was possible in prior work.For example, we demonstrate Adam can scale to 64Kexamples per allreduce on BERT (16K before), LAMBto 128K examples per allreduce on BERT (64K before),and Momentum-SGD to 64K examples per allreduce onResNet-50 (16K before). All while maintaining down-stream accuracy and with little hyper-parameter tuning.• An evaluation that demonstrates for similar effectivebatch sizes as in prior work, Adasum converges fasterthan that prior work. For example, LAMB with Ada-sum on 64K examples per allreduce on BERT converges in 30% fewer steps than LAMB when just averaginggradients. This section provides the background for the the paper, intro-ducing notation and concept used throughout.

Machine learning involves learning a model parameterizedby a set of weights w based on some training data. Givena set of training examples { ( x , y ) , . . . , ( x k , y k ) } where each x i is a vector representing the input instance i and y i is itslabel, training involves ﬁnding a w that minimizes some lossfunction L = ∑ L i ( w , x i , y i ) , the sum of individual loss function L i of the model w on input ( x i , y i ) .Most training uses stochastic gradient descent (SGD). SGDstarts from an appropriately initialized model w and progres-sively updates the model at step i as w i + = w i − α i g i . Here α i is the learning rate at this step as determined by a learning rateschedule, and g i = b ∑ bj = ∇ L j ( w i ) is the sum of gradients ofindividual loss functions for a randomly chosen minibatch ofdata of size b . The stochasticity of SGD arises because g i isonly an estimate of the true gradient of the loss function at w i .For deep neural networks, the gradients can be computed bythe backpropagation algorithm [16] that requires a forwardand a backward evaluation of the model. As models get larger, training requires parallelizing them ondistributed hardware. While there are other important sourcesof parallelism such as model parallelism and pipeline paral-lelism, these techniques are orthogonal to the data parallelismstudied in this paper.A common approach to data parallelism is synchronousSGD , where one computes the gradients for each dataset ina minibatch in parallel. This process works as follows. Eachnode processes a microbatch of data and computes its localgradient. The microbatch size is usually determined by theamount of memory available in the local node. Then a com-munication step sums up the local gradients from all nodesto compute the minibatch gradient. This process is called allreduce , after the MPI primitive that computes the sum ofvectors from each node and stores the result in all the nodes.Optionally, each node can perform gradient accumulation tosum up multiple microbatches before communicating withother nodes. Adasum, the technique proposed in this paper,replaces allreduce for improved parallelism.2 .3 Algorithmic Efﬁciency vs System Efﬁ-ciency

One way to increase data parallelism is to increase the mini-batch size. However, this has the effect of reducing stochastic-ity of SGD resulting in reduced convergence. Thus, there is adelicate balance between parallelism and model performancein distributed ML training. To capture this tradeoff, we deﬁnetwo notions.

System Efﬁciency represents the raw throughputof the system in the amount of training data processes per unitof time. Obviously, naively increasing the minibatch size willincrease the system efﬁciency of synchronous SGD.

Algorith-mic Efﬁciency is the inverse of the amount of training data thatneeds to be processed in order to achieve some desired modelaccuracy. Increasing the minibatch size decreases the algo-rithmic efﬁciency requiring more data or equivalently moreiterations/epochs on a given training dataset to achieve thesame level of accuracy. In the worst case, training might notconverge when minibatch size is increased beyond a certainthreshold.As one essentially cares about the training time to desiredaccuracy, the net accuracy of distributed training is a combi-nation of system efﬁciency and algorithmic efﬁciency.

While one can naively use the resulting gradient to update themodel, researchers have proposed various learning-rate opti-mizers that adaptively use different learning rates for differentparts of the model. For instance, Adam [23] computes indi-vidual adaptive learning rates for different parameters basedon the estimates of ﬁrst and second moments of gradients.LARS [37] uses layer-wise adaptive learning rates for greatertraining stability. The recent LAMB optimizer [38] extendsLARS to effectively train models like BERT with large mini-batch sizes. The beneﬁts of Adasum are orthogonal to andcan be used in concert with these optimizers.

We present the intuitions behind the Adasum algorithm, whilea mathematical treatment with convergence proof is in theAppendix.Consider two nodes that respectively compute gradients g and g respectively on minibatches b and b . When us-ing synchronous SGD, the effective gradient is the average ( g + g ) /

2. But as it is common to increase the learning rateproportional to the increased effective batch size, the combina-tion amounts to a sum in practice. The main proposal behindthis paper is to use an adaptive sum of the two gradients calledthe Adasum.

Adasum ( g , g ) = ( − g T · g · (cid:107) g (cid:107) ) g + ( − g T · g · (cid:107) g (cid:107) ) g Despite its apparent complexity, Adasum simply adds thetwo gradients after scaling with appropriate scalars. Usingthis operation instead of sum or average has the followingtwo properties. First, the Adasum operation approximates thesequential execution of the two nodes running one after theother, thereby achieving convergence properties of smallerminibatch sizes. Additionally, it samples both possible or-ders of visiting minibatches — b , b and b , b . We thenshow how to extend the Adasum operation to more than twominibatches. Consider two steps of SGD starting from model w . Say, theﬁrst step computes a gradient g ( w ) of the loss function at w for b . SGD updates the model to w = w − α · g ( w ) The second step computes its gradient g ( w ) at w for b .Assuming we are using the same learning rate for both steps,the ﬁnal model after the second step is w , = w − α · g ( w )= w − α · ( g ( w ) + g ( w )) (1)Here the subscript for w indicates that SGD processed b before b . Comparing this what a synchronous SGD algorithmwould have computed (with a corresponding doubling of thelearning rate) w − α · ( g ( w ) + g ( w )) we see a difference because gradient g is computed at w instead of w . As previously observed [24, 39], one can usesecond order reasoning to remove this staleness. Neglectinghigher order terms in the Taylor expansion of g ( w ) , we have g ( w ) = g − α · H · g (2)where H is the Hessian matrix of the loss function at w . One nice property of the standard synchronous SGD is that itadds no hyperparameters during gradient combination - wesimply sum or average the two gradients. It is desirable tohave the same property for Adasum. First, using a standardtheorem [2, 18] for estimating the Hessian matrix for negativelog likelihood loss functions (details in Appendix A.1), wecan estimate g ( w ) = g − α · g · g T · g (3)where for simplicity we have dropped the model term from g and g when computed at w for terseness. This along with3he assumption that α is chosen optimally (Appendix A.2),we have g ( w ) = g − g T · g (cid:107) g (cid:107) g (4)Essentially, by scaling g with an appropriate scalar, we canemulate the sequential execution of two SGD steps in Equa-tion 1 as w , = w − α · [ g + ( − g T · g (cid:107) g (cid:107) ) · g ] (5) The ability to emulate a sequential execution shown aboveprovides an intriguing possibility. SGD is a stochastic processthat samples a path deﬁned by the order of the training datait processes. Now, the emulation above provides us a wayto sample multiple paths for the cost of one! For instance, ifSGD had processed minibatch b before b , the ﬁnal modelwould be w , = w − α · [ g + ( − g T · g (cid:107) g (cid:107) ) · g ] Averaging the two samples, the ﬁnal model would be w , + , = w − α · [( − g T · g · (cid:107) g (cid:107) ) · g + ( − g T · g · (cid:107) g (cid:107) ) · g ]= w − α · Adasum ( g , g ) This equation motivates our design of the Adasum operator.

We can extend Adasum to more than two gradients by re-cursively applying the operator as follows. Let g [ , n ] be theresult of applying Adasum to minibatches b . . . b n . We canconsider this as the effective gradient of these minibatcheswhen emulating the behavior of SGD on b n + . Using the samearguments as above, we have Adasum ( g [ , n + ] ) = Adasum ( Adasum ( g [ , n ] ) , g n + ) Since we double the number of paths of SGD emulated ateach step, we achieve the effect of emulating exponentially many SGD paths.As discussed in Section 4, one can reuse the standard ringalgorithm used to sum all the gradients in synchronous SGDto implement the Adasum operation. One complexity is thatthe operation cannot be performed in a streaming manneras we need to compute the dot product and norm of the twogradients. For a more bandwidth optimal implementation, weuse the following recursive application in practice.

Adasum ( g [ , n ] ) = Adasum ( Adasum ( g [ , n / ) ) , Adasum ( g [ n / , n ] )) Consider

Adasum ( g , g ) when g and g are orthogonal.Their dot product g T · g is zero. Therefore, Adasum sim-ply adds the two gradients. Now consider the case when theyare parallel. Their dot product is simply the product of theirnorms. So, Adasum becomes the average of the two gradients.Intuitively, when the two gradients are pointing in orthogonaldirections, Adasum behaves as if their loss functions are lo-cally independent and aggressively sums the two gradients.Doing so when the two gradients are parallel has the dan-ger of "overshooting" the minimum, particularly when thelearning rate is also aggressive and therefore, Adasum safelyaverages the gradients. This adaptiveness becomes importantas we later show that gradients from different batches tend topoint in the same direction during the initial parts of the train-ing. This is because the initial model is completely randomand all gradients agree on the general direction model shouldprogress. However, the gradients become progressively or-thogonal in later parts of the training. Adasum automaticallyand adaptively interpolates between an aggressive sum and asafe average as training proceeds. For further insights, Figure 1 shows the per-layer orthogonal-ity of gradients during training. At different points during thetraining of ResNet-50 and BERT-Large with 64 GPUs, wecompared the norm of the result of Adasum on the gradientsfrom individual layers across all 64 nodes with the individualnorm of the gradients. Orthogonality of a set of gradients g . . . g n for a given layer is deﬁned as (cid:107) Adasum ( g [ , n ] ) (cid:107) ∑ i (cid:107) g i (cid:107) . Thiscompares the norm of the result of Adasum on a set of gra-dients with the sum of their individual norms. Because ofthe properties of Adasum described above, orthogonality is 1when the gradients are orthogonal to each other and reachesthe minimum value of 1 /

64 when the gradients are parallelto each other and are of the same norm. Of other ways ofmeasuring orthogonality, this was the easiest for us to collectexperimentally.The ﬁgures 1a and 1b show the orthogonality of differentlayers and their average, shown by the bold red lines, for bothResNet-50 and BERT-Large. We can clearly see from theaverage of the orthogonality (red lines) that the gradients startout pointing in the same direction but very soon become or-thogonal as the training proceeds. Each layer also demonstratea similar pattern with a different color shown in Figure 1. Ofcourse, there are too many layers to distinguish each colorindividually. However, general trends are still visible. Whilemost layers tend to become orthogonal as training proceeds,they do not do so at the same rate. This discrepancy is morevisible for BERT-Large, where some layers have low orthog-onality throughout the training process. Note that, there areclear drops in the orthogonality during the training for both4 a) ResNet-50(b) BERT-Large

Figure 1: Orthogonality of gradients during ResNet-50 (a)and BERT-Large (b). x-axis represents number of samples orepochs processed during training. A value of 1 in the y-axismeans that the gradients are orthogonal with lower valuesmeaning less orthogonality. The bold red lines show orthog-onality averaged across all layers, while the others show theorthogonality of individual layers with different colors.benchmarks. These drops happens exactly at boundaries oflearning rate schedule change.To exploit this observation, we perform the Adasum opera-tion per layer as opposed to applying to the whole gradient.This allows us to adaptively adjust the combination based onper-layer orthogonality.

To validate our intuitions behind Adasum, we empiricallyevaluate how closely Adasum emulates a sequential emu-lation. As shown earlier in Equation 2, we can reduce thestaleness of gradients using the Hessian of the loss function.Luckily, for small models such as LeNet-5, we can computethe Hessian matrix exactly. Speciﬁcally, we downloaded thePyTorch MNIST tutorial example, modiﬁed the code to en-sure deterministic runs, and used PyTorch’s autograd facilitiesto compute the exact Hessian at each step during a parallelrun with 64 nodes (that reached the target accuracy of 99.3%).After every communication step, we computed the modelwith sequential emulation using the exact Hessian, using theAdasum operator, and using the baseline synchronous SGD.Figure 2 shows the relative error of Adasum and synchronousSGD with respect to the sequential emulation using the Hes- a b s ( a pp r o x i m a t i o n e rr o r ) AdasumSynchronous SGD

Figure 2: Relative error of Adasum and synchronous SGDwhen compared to a sequential emulation that uses the exactHessian matrix.sian. As we can see, Adasum has a lower approximation errorreaching close to zero at some steps. Note synchronous SGDerror goes down with number of steps— the reason is becausethe norm of gradients (cid:107) g (cid:107) decays as the model approach theoptimal answer and H ≈ g · g T decays quadratically. In a realrun, the error shown here is accumulated during the train-ing process. This further validates our intuition that Adasumshould achieve faster convergence than synchronous SGD,which we demonstrate empirically in Section 5. Adasum is implemented in Horovod [3] and is publicly avail-able in the main branch of its open-source repository. Horovodintegrates with multiple machine learning frameworks, suchas TensorFlow and PyTorch. Horovod has a

C++ backendtargeting multiple backend transports like Ethernet/IB orNVLINK. Adasum works with CUDA aware MPI (whenavailable) and we implemented Adasum for all of these back-ends.

Adasum is easy to use by specifying an option to the

DistributedOptimizer

API of Horovod as follows. opt = hvd.DistributedOptimizer(opt, op=hvd.Adasum)

When enabled, the distributed optimizer calls the necessaryAdasum allreduce operations to synchronize global modelupdates. As with Horovod, the user is responsible for parti-tioning data across nodes and initializing the model correctlyin all nodes.Providing the option is the only change required for userswanting to use an existing

DistributedOptimizer such asAdam or LAMB. For more ﬁne grained control, we also ex-pose the Adasum operator through Horovod’s allreduce using the same option. hvd.allreduce(opt, op=hvd.Adasum) arams = list (model.parameters()) for p in params] for start, current in zip (starts, params):effective_gradient = current - starthvd.allreduce(effective_gradient, op=hvd.Adasum)current.data.add_(effective_gradient) Figure 3: Implementation of Adasum with optimizers such asAdam or LAMB.This is useful when users want to perform additional opera-tions such as gradient clipping beyond those implemented ina

DistributedOptimizer .One subtlety in the implementation is that the Adasumoperation should be performed after the optimizer update,as shown in Figure 3. In contrast, synchronous SGD per-forms allreduce before the optimizer update. Intuitively, thisis because Adasum does not increase the minibatch size likesynchronous SGD when distributing across multiple nodes.As such, the logic of optimizers should only apply to thesmaller minibatches per node. This is similar to BMUF [11].Users have to replicate this logic explicitly when not using anexisting

DistributedOptimizer directly.

This section describes the implementation of the Adasumoperator in Horovod’s allreduce. Message Passing Interface(MPI) provides capabilities to perform user-deﬁned reduc-tion operations for allreduce. However, since these customreductions can only be elementwise, Adasum cannot be im-plemented as a user-deﬁned reduction.

Our implementation of Adasum uses a modiﬁed recursivevector-halving (RVH) algorithm for allreduce [10, 35], whichis both latency and bandwidth optimal in a hypercube andfully connected networks. On each step of the reduce-scatterphase of the algorithm each node exchanges half of its datawith its neighbor and applies the reduction on its own half. Inthe baseline algorithm, since each application of the reductionoperation is only given a part of the data, the operation mustbe elementwise to ensure the correct result.

Algorithm 1

Recursive vector-halving with Adasum

Require: size > procedure A DASUM

RVH( x , d ) mid = (cid:98)| x | / (cid:99) if (cid:98) rank / d (cid:99) is even then (cid:46) Left neighbor nghr = rank + d S END ( x mid : | x | , nghr ) (cid:46) Send right half a = x mid b = R ECV ( nghr ) (cid:46) Receive left half else (cid:46) Right neighbor nghr = rank − d S END ( x mid , nghr ) (cid:46) Send left half a = R ECV ( nghr ) (cid:46) Receive right half b = x mid : | x | end if d (cid:48) = · d v = [ a · b , a · a , b · b ] (cid:46) Partial dot products group = [ (cid:98) rankd (cid:48) (cid:99) · d (cid:48) + i for i = . . d (cid:48) − ] v = A LLREDUCE ( v , + , group ) (cid:46) Finish dot products x (cid:48) = a · ( − v v ) + b · ( − v v ) (cid:46) Apply Adasum if d (cid:48) < size then x (cid:48) = A DASUM

RVH ( x (cid:48) , d (cid:48) ) end if S END ( x (cid:48) , nghr ) (cid:46) Send my half y = R ECV ( nghr ) (cid:46) Receive neighbor’s half x = x (cid:48) ++ y if (cid:98) rank / d (cid:99) is even else y ++ x (cid:48) end procedure To work around this problem we modify the RVH algo-rithm to perform each Adasum operation in two phases,with an additional step of communication in between. Al-gorithm 1 describes these modiﬁcations. Each process is iden-tiﬁed by a zero-based index called rank. A process can senda vector v to another rank r with S END ( v , r ) and receive onewith v = R ECV ( r ) . The algorithm also uses another allre-duce as a primitive to sum partial dot products and squarednorms across subgroups of ranks: A LLREDUCE ( v i , op , group ) returns a pointwise reduction of vectors v i from all ranks i ∈ group , where group is a list of the ranks participating inthe reductionEach level of recursion in Algorithm 1 starts with ranksexchanging half of their vector x with a neighbor at distance d (lines 2-13). Here the two halves a and b are assigned suchthe left neighbor’s half is in a and the right neighbor’s half isin b .Lines 15-18 of Algorithm 1 represent the main modiﬁca-tion to baseline RVH algorithm. First, on line 15, each rankcalculates a dot product and squared norms for a and b , whichare slices of a larger logical vector shared across exactly theranks in group (line 16). Line 17 then sums the productsamong the ranks in group to produce the complete results in v . The reduction is ﬁnally applied locally using the values in6 tensor (bytes)102050100200500 l a t e n c y ( m s ) NCCLAdasum

Figure 4: Latency of A

DASUM

RVH vs. NCCL for variousmessage sizes. v (line 18).The algorithm continues as normal, using recursion online 20 until all ranks share slices of the same reduced vector,followed by an all-gather phase on lines 22-24. Horovod uses A

DASUM

RVH to reduce tensors when-ever hvd.allreduce or hvd. DistributedOptimizer is used with op=hvd.Adasum . Additionally, if the HOROVOD_HIERARCHICAL_ALLREDUCE environment variable is set Horovod performs a hierarchical allreduceusing the NVIDIA Collective Communications Library(NCCL). This variant starts and ends with a NCCL reduce-scatter and allgather, respectively, for communication amongthe GPUs inside a node, with cross-node reduction handledby A

DASUM

RVH. This is useful with some hardwareconﬁgurations on which NCCL offers higher throughput thanCUDA-aware MPI.

DASUM

RVH Performance

Now we evaluate the latency of A

DASUM

RVH on 16 Azurenodes with 4 V100s per node (PCIe interconnect) and a single100 Gb/s Inﬁniband connection between them. As a base-line, we compare to NCCL’s sum operation. A point on Fig-ure 4 ( x , y ) shows the latency for an allreduce operation ( y )in seconds as a function of the number of bytes reduced( x ). For each point on the x axis, we allocate 64 tensors oneach GPU’s memory so their sum is the the number of bytes.The ﬁgure demonstrates that despite the additional logic re-quired to perform an adaptive summation, the performance of RV HAdasum is roughly equal to the highly optimized NCCLlibrary simply doing a summation. Note that these results usevectorization as well as tensor fusion with a threshold of 2MBas discussed in Section 4.4.2 and 4.4.3.As section 3.4 described, there are two ways of applyingthe pairwise Adasum operation on a set of gradients. A DA - SUM

RVH performs the "tree" reduction. An alternate wayis to apply the Adasum operator linearly. We additionallyimplemented this approach and optimized it using techniquessimilar to the ring allreduce algorithm commonly used for synchronous SGD. We found that this "ring" implementationprovided less throughput than the baseline NCCL allreduceand A

DASUM

RVH on the architectures we evaluated. Nev-ertheless, we believe the ring allreduce version of Adasumcould be competitive for other architectures.

For large models such as BERT-Large, memory availablein a GPU only ﬁts a small microbatch size. In such cases,to increase the effective microbatch size, we use the GPUsavailable in a single node to accumulate local gradients anduse the Adasum operation across nodes. In these scenarios,we parallelize the Adasum computation across all these localGPUs.Our approach is inspired by the optimizer-state partitioningalgorithm pioneered by Marian [4], a deep learning toolkit op-timized for NLP workloads. Optimizers like Adam or LAMBmaintain additional state per gradient element to estimatemoving mean and/or variance. The optimizer parameters areidentical for all GPUs and thus it is not necessary to replicatethem. Marian partitions this state and parallelizes the updatesto the state. Note that this memory optimization is not relatedto model parallelism as model parameters are still replicatedon all GPUs.We use the same insight to parallelize the Adasumcomputation. Looking at Figure 3, we partition the opti-mizer state as in Marian. In addition, we also partition the effective_gradient across the local GPUs. A key differ-ence between the Marian approach is that rather than dis-tributing this state uniformly, we partition to ensure that statecorresponding to one neural network layer falls in the samepartition. This greatly simpliﬁes our implementation as wedo not have to modify the code of the underlying optimizer.After the optimizer update, which is done in parallel (asin Marian), the effective_gradient in Figure 3 is alreadypartitioned across local GPUs. Now, each GPU does an Ada-sum allreduce only on the layers in its partition by commu-nicating with the corresponding GPU that share partitionsin other nodes. Finally, the GPU broadcasts its partition of effective_gradient locally to all other GPUs in the samenode to update the model parameters. To optimize the cost ofthis local broadcast, we overlap this communication with theAdasum operation of the next layer.

Without WithThroughput (samples/s) 154.7 168.5Model update (s) 1.82 0.97Microbatch 22 36

Table 1: Performance improvement with and without Adasumparallelization.The table above shows the performance improvement of7his optimization for a PyTorch implementation of BERT-Large on a Azure VM with 4 V100s (16 GB RAM) connectedby PCIe for 128 max sequence length. Since this optimizationreduces the memory usage, we can increase the microbatchsize by 60% as shown in the last column. To measure the im-pact of this larger microbatch, we evaluate 256 microbatchesbefore an adasum operation The ﬁrst column in Table 1 pro-vides the throughput per GPU. In other words, the 60% largermicrobatch yields nearly a 10% improvement in per GPUthroughput. To measure the impact of parallelizing the Ada-sum operation, we evaluate a microbatch size of 1 per modelupdate. The third column in Table 1 shows that this time dropsnearly 1.87 × . This section describes implementation details that are crucialfor improving system and algorithmic efﬁciency of Adasum.

Recent trends in ML training have shown the promise ofusing low-precision formats such as fp16 for compute andcommunication efﬁciency. Our implementation of Adasumintegrates with the low-precision support in Horovod to obtainthese beneﬁts automatically.We discuss two important subtleties in our implementationof low-precision support. First, Adasum requires computingthe dot product and norms of the combined gradients. Theaccumulation of the values for these two operations happensin a double even if the gradients are in lower precision. Thisdoes not incur any measurable overhead for both CPU andGPU implementations. On the other hand, the improved ﬂoat-ing point stability is crucial for the improved convergence ofAdasum.When using lower-precision, it is common to use dynamicscaling [25]. The basic idea is to maintain a scale for alltensors to ensure that the values are always in the dynamicrange of the low-precision format. During the training, thesescales have to be periodically adjusted when the values exceedthe range (resulting in nan s). We perform dynamic scalingfor tensors we introduce, such as the effective_gradient in Figure 3.

Adasum runs on both CPU and GPU hardware in fp16 , fp32 ,and fp64 . For CPU hardware, we manually vectorize loopbodies that perform both dot products and summations. WhenHorovod is compiled with CUDA aware MPI, we implementthese same loops as GPU kernel calls that operate directlyon GPU memory and thus save on the transfer from GPU toCPU. This is particularly important on hardware that supportsGPUDirect RDMA as GPU memory need not be copied tothe CPU for the Adasum operator. If at the time of allreducing a tensor, tensors from other lay-ers are also ready and available on all hosts, Horovod fusesthese tensors into a single one, performs an allreduce on thefused tensor, and then copies from the fused tensor back tothe individual tensors. This optimization signiﬁcantly reduceslatency for small tensors as the overhead of potentially manyindividual allreduce calls are amortized into a single one.To enable this optimization with Adasum, we do additionalbookkeeping to keep track of tensor boundaries in the fusedtensor as Adasum requires these boundaries to compute dotproducts per layer. Because all hosts 1) fuse the same set oftensors and 2) have the same layer sizes, this bookkeepingis stored locally and does not increase communication over-heads. Note that Horovod uses an extra buffer for copying thetensors from different layers into a consecutive array so thatunderlying libraries such as MPI or NCCL can be called once.Adasum uses the same buffer for this optimization. The sizeof this buffer is controlled by

HOROVOD_FUSION_THRESHOLD .A default value between 2MB-64MB usually works well.

To show that Adasum works across a variety of real-worldtraining scenarios, we evaluate its performance through asequence of case studies. Throughout the experiments, weuse our implementation of Adasum in Horovod describedin Section 4 on models implemented in both PyTorch andTensorFlow.We ﬁrst study the algorithmic and system efﬁciency of Ada-sum for ResNet-50 and BERT-Large. Then, we use LeNET5,which is small enough to do extensive hyperparameter tuning,to show that Adasum enables robust scaling without the needfor additional hyperparameter tuning. Finally, we ﬁnish witha short summary of our experience with production models.

This section evaluates Adasum on PyTorch’s ResNet-50 us-ing the Momentum-SGD optimizer on hardware with a fastinterconnect.

ResNet-50 [19] on Imagenet [32] is a popular model forstudying performance of training algorithms and implemen-tations. We used PyTorch’s ResNet-50 model modiﬁed torun with Horovod and compared the performance of Ada-sum with Horovod’s default

Sum operator as the baselinefor synchronous SGD. We ran experiments on Azure’s

Standard_NC24rs_v3 virtual machines, each of which has 4NVIDIA Tesla V100 GPUs connected with PCIe, dual-socket8

50 100 150 200 250 300 350time (min)0.00.10.20.30.40.50.60.7 a cc u r a c y Sum 2kSum 16kAdasum 2kAdasum 16k

Figure 5: Time-to-accuracy chart for ResNet-50 with 64 GPUson 16 Standard_NC24rs_v3 VMs.Intel Xeon E5-2690 v4 CPUs, 448 GiB of memory, and con-nected via Inﬁniband. Hierarchical allreduce, as described inSection 4.2.2 was faster for both the baseline and Adasumruns so we used it here.We train on 64 V100s with 2K and 16K examples perallreduce and use the default hyper-parameters that ship withthe benchmark for its momentum based SGD optimizer.

The number of epochs required for each conﬁguration to reachthe target accuracy are as follows:

Sum 2k Sum 16k Adasum 2k Adasum 16k62 - 62 69

Because sum with 16k batch size never reaches 74.9% valida-tion accuracy (we let it run for 120 epochs), it’s algorithmicefﬁciency is zero. Adasum on the other hand sees only a11% decline in its algorithmic efﬁciency as we increased thebatch size. This is more than made up for in increased systemefﬁciency.

The times per epoch for each conﬁguration are as follows:

Sum 2k Sum 16k Adasum 2k Adasum 16k5.61 min 2.12 min 5.72 min 2.23 min

Adasum closely matches the system efﬁciency of Horovod’ssum implementation at both 2k and 16k batch size. Increasingthe batch size from 2k to 16k results in a 61% and 62%improvement for Adasum and sum, respectively.The total efﬁciency can be seen in Figure 5, which showsthe time ( x -axis) to accuracy ( y -axis) for each conﬁguration.Sum 16k plateaus below the target accuracy. Adasum 16k, onthe other hand, gets to a top-1 accuracy of 74.9%, being 2.3Xfaster in time to accuracy than Adasum 2k, while using thesame number of GPUs . Local steps before communicating 16 1Effective batch size 64K 4KMinutes per epoch 1.98 2.58Epochs till convergence 84 68Time to accuracy(min) 166.32 175.44

Table 2: Algorithmic and System efﬁciency for TensorFlowResNet-50 on TCP. Adasum enables faster time to accuracyby doing more compute per communication.

Horovod and Adasum run on all types of hardware. Often, net-works have TCP rather than IB and as such, communicationlimits scaling to multiple nodes. This section demonstratesAdasum enables a larger effective batch size, which reducescommunication overhead, and yet still converges fast enoughto reduce time to accuracy.

This section demonstrates how a TensorFlow implementationof ResNet-50 from MLPerf’s v0.5 reference implementa-tion [5] is able to scale to 16 V100s (4 GPUs 32GB cardsper node with a PCIe gen 3 interconnect) and TCP (40 GB/s)interconnects in between.We downloaded the MLPerf v0.5 reference implementationand slightly modiﬁed it to use Horovod with Adasum. Wedid a small hyper-parameter search over the learning rate butdid not change any other hyper-parameters. We found 4 × the default learning rate provided the fastest convergence.This benchmark uses the Momentum SGD optimizer fromTensorFlow. MLPerf v0.5 converges when the test accuracyis greater than or equal to 74.9%.Unlike simply adding the gradients together for gradientaccumulation, the TensorFlow Adasum enabled distributedoptimizer uses a local SGD step to update weights; when itis time for an allreduce, the gradient is estimated via a deltafrom the model’s state since the prior allreduce. Table 2 shows the results of this experiment. Adasum forTensorFlow enables a form of gradient accumulation wherethe toolkit makes many local steps before communicating.The ﬁrst row denotes how many local steps to make beforeinitiating an allreduce with the Adasum operation. Note thatwhen it takes 1 local step before communicating, there isno gradient accumulation. In contrast, 16 local steps beforecommunicating means that the allreduce is called once every16 local steps. With 16 GPUs and 256 examples per GPU, theconvergence is fast in 68 epochs (fourth row in Table 2). Thatconvergence slows to 84 epochs with an effective batch sizeof 64K. It is important to note that none of the submissions to9LPerf v0.5 used more than a 16K effective batch size (weveriﬁed by looking at the result submissions). Thus Adasumis able to scale the TensorFlow momentum optimizer to 64Kexamples before communicating 4 × more than prior art. When communicating after every step, the system efﬁciencyis low as communication dominates (third row in Table 2).In contrast, when we communicate every 16 local steps, thetime for 1 epoch is much faster. Total running time is givenby the last row: min per epoch * epochs till convergence andclearly communicating less frequently has a big impact inoverall running time despite the slight increase in algorith-mic efﬁciency. Thus, a developer can exploit fast distributedhardware even without a fast interconnect.

This section shows Adasum scales both Adam [22] andLAMB [38] for the PyTorch NVIDIA implementation ofBERT-Large [7].

Training BERT-Large [14] a natural language processing(NLP) model, takes place in two stages. First is an unsuper-vised pre-training on two large text corpuses, Wikipedia andBookCorpus. Then, the model is ﬁne-tuned for a “downstream”NLP task such as SQuAD question-answering [28, 29]. A tar-get F1 score of SQuAD 1.1 of 90.5 averaged over 5 trieswith different seeds is generally accepted for BERT-Largepre-training [14, 38].Pre-Training BERT-Large requires tokenizing input sen-tences into a max sequence length. A maximum sequencelength of 512 is computationally expensive and thus Devlin etal .suggest [14] breaking pre-training into two phases: phase1 with a maximum sequence length of 128 for 90% of thetraining iterations and phase 2 with a maximum sequencelength of 512 for the remaining 10%.NVIDIA’s repo contains scripts that 1) download and pre-process data, 2) phase 1 and 2 pre-training and, 3) SQuADﬁne-tuning and evaluation. NVIDIA’s pre-training scripts usemixed precision training in addition to data parallelism withNCCL [6]. This codebase is our baseline and for the Adasumimplementation, we replaced its use of torch.distributed with the Adasum operator in Horovod.The system that we used for this case study is a clusterof DGX-2 nodes where each node has 16 V100 GPUs with32GB of memory per GPU and NVSwitch intra connection.Each node has 8 NICs with Inﬁniband support capable ofdelivering a throughput of 800GB/s per node.

Number of iterationsAlgorithm Phase 1 Phase 2Baseline-Adam - -Baseline-LAMB [38] 7039 1563Adasum-Adam 7039 1563Adasum-LAMB - 20% 5639 1250Adasum-LAMB - 30% 5039 1563Adasum-LAMB - 128K 4574 1563

Table 3: Algorithmic efﬁciency results on BERT-Large. Tableshows the number of iterations required for Phase 1 and Phase2 to achieve target SQuAD score of 90.5, when using theeffective batch size of 64K for Phase 1 and 32K for Phase 2.

Table 3 describes the algorithmic efﬁciency of Adasum overthe Adam and LAMB optimizer when using an effective batchsize of 64K for Phase 1 and 32K for Phase 2. As reported inprior work, the Adam optimizer does not scale to batch sizesbeyond 16K. This motivated the study of more sophisticatedoptimizers such as LARS and LAMB. For instance, our runsof the LAMB optimizer achieve the target SQuAD score with7039 iterations of phase 1 and 1563 iterations of phase 2, asshown in second row of Table 3.The next two rows of Table 3 show the performance ofAdasum. In contrast to Adam baseline, the Adasum-Adam op-timizer converges with 64K when run with the same numberiterations for Phase 1 and Phase 2 as the LAMB baseline. Thisis an interesting result as despite the advances of optimizerssuch as LAMB, Adam optimizer continues to be popular forsome models. When compared to prior work [38], it is impor-tant to note that Adasum adds no additional hyperparameterssimply reusing the baseline parameters of the Adam optimizer.On the other hand, improvements provided by Adasum are or-thogonal to improvements in optimizers. As shown in Table 3,Adasum-LAMB provides close to 20% faster convergencecompared to LAMB baseline requiring 5639 iterations forphase 1 and 1250 iterations phase 2.We also performed two variations of our Adasum-LAMBresults. First, we aggressively reduce the number of Phase 1iterations by 30%. With an equivalent aggressive reductionon Phase 2, we slightly missed the target SQuAD score by0.5. However, we did achieve the target accuracy with thefull 1563 iterations in Phase 2, which is what we report in thetable. With a more ﬁne grained search for Phase 2 iterations,we believe we can achieve the target SQuAD score with fewerPhase 2 iterations. These results are still interesting as Phase1 takes a larger percentage of training time than Phase 2.For the second variation, we increased the effective batchsize of Phase 1 to 128K. We were able to achieve the targetSQuAD score with 4574 Phase 1 iterations, while using thestandard 1563 iterations of Phase 2 with 32K batch size. Tothe best of our knowledge, this is the largest report effective10

H1 speedup PH2 speedup Time (minutes)GPUs Sum Adasum Sum Adasum Sum Adasum64 1 0.98 1 0.99 997 809256 3.79 3.61 3.89 3.92 260 214512 7.47 6.48 7.24 7.28 135 118

Table 4: System efﬁciency on BERT-Large for an effectivebatch size of 64K and 32K for phase 1 and phase 2, respec-tively. The speedup numbers are relative to the throughput ofBaseline-LAMB with 64 GPUs, which is 12.2K examples persecond for Phase 1 and 4.6K examples per second for Phase2. The improved convergence time of Adasum is a result ofthe 20% improvement in algorithmic efﬁciency as shown inTable 3.batch size for BERT-Large.

The algorithmic efﬁciency of Adasum is only useful if the al-gorithm can be implemented without substantial degradationin system efﬁciency. Table 4 shows the speedup of Adasum-LAMB when compared to Baseline-LAMB for an effectivebatch size of 64K. On our GPU cluster, the Baseline-LAMBprocesses 12.2K examples per second during Phase 1 and4.6K examples per second during Phase 2. This reduction inthroughput arises because Phase 2 has more computation dueto increases sequence length. The speedup numbers in thetable are relative to this baseline. For instance, at 256 GPUs,the baseline scales to a speedup of 3.789 (a perfect scalingwould be 4). Just as a comparison with published NVIDIAnumbers [7], the baseline ﬁnishes BERT-Large in 260 min-utes which is slower than 236 minutes reported by NVIDIA.This difference is due to the NVIDA cluster having a slightlymore performant DGX-2H conﬁguration using a higher clockspeed compared to our DGX-2 cluster.Though Adasum performs more computation during allre-duce, the reduction in throughput for 64 GPUs is less than 2%for Phase 1 and less than 1% for Phase 2. As we increase thenumber of GPUs, the additional computation results in lowerscaling efﬁciency for Phase 1. For instance, Adasum incursroughly a 5 % reduction (13% reduction) in throughput whencompared to the baseline for Phase 1 on 256 (512) GPUs.On the other hand, the throughput for Phase 2 with Adasumshows similar scalability as the baseline as we increase thenumber of GPUs, with Adasum being faster sometimes. Thisis due to the fact that Phase 2 does more computation.The reduction in system efﬁciency is more than compen-sated by the 20 % reduction in algorithmic efﬁciency. As such,Adasum achieves faster time to accuracy than the baseline.In particular, Adasum completes BERT-Large in 214 min-utes, which is faster than NVIDA reported numbers on 256GPUs [7]. On 512 GPUs, Adasum reaches the desired accu-racy in 118 minutes. We observed some of the overhead is a cc u r a c y Figure 6: LeNet-5 accuracies under an aggressive sequentiallearning rate schedule for various distributed conﬁgurations.due to CUDA aware MPI (openmpi + ucx) was not as fast asNCCL. We are in the process of porting Adasum’s allreduceto NCCL as a consequence.

LeNet-5 is a model for the MNIST dataset that is commonlyused as a tutorial for neural networks. This pioneering imagerecognition model is very small by today’s standards. Whilespeeding up training of LeNet-5 is not a priority, a beneﬁt ofevaluating Adasum on it is that it is practical to do exhaustivehyperparameter search . This allows us to thoroughly explorethe convergence properties of Adasum using the followingexperimental setup. First, we will ﬁnd a very aggressive learn-ing rate schedule for sequential training that barely reaches abaseline accuracy. Now, keeping the number of epochs ﬁxed,we can compare different distributed training conﬁgurationsby how far off they are from the sequential accuracy.We used the PyTorch version of LeNet-5 found inHorovod’s examples with a momentum based SGD optimizerwith a batch size of 32. While the original example has a ﬁxedlearning rate schedule of 10 epochs, we were able to bringthis down to 2 epochs using a linear warmup and decay fromzero to zero.The maximum accuracy of LeNet-5 on MNIST is in the99.3%-99.4% range. Using 99.3% as our target accuracy, weused an Azure cluster of NC-series VMs with 333 NVIDIAK80 GPUs to optimize the hyperparameters of the trainingscript. The following conﬁguration reliably reaches the targetaccuracy in sequential training in 2 epochs (while no reliableconﬁguration was found for 1.75 epochs):

Epochs Max LR Warmup %2 0.0328 17%

We keep the number of epochs constant in the followingexperiments. Since MNIST has 60000 images, one GPU willtake 1875 steps per epoch, while 32 GPUs would take only58 steps per epoch.We evaluated Adasum and Sum on 4, 8, 16 or 32 GPUswith both an unmodiﬁed learning rate as well as an optimizedone, which we searched for separately for each combination of11ethod and number of GPUs. Figure 6 shows the accuraciesreached by each conﬁguration under the aggressive learningrate schedule for sequential training.Without learning rate tuning Sum fails to converge at morethan 8 GPUs, while Adasum still converges at 32 GPUs with-out any hyperparameter search . This highlights the easy scal-ability that Adasum enables. Furthermore, even with a tunedlearning rate Sum is far below even untuned Adasum at 32GPUs, and is still beat by Adasum with a tuned learning rateat 16 GPUs.Consider the tuned learning rates for each conﬁguration:

Method 4 GPUs 8 GPUs 16 GPUs 32 GPUsAdasum 0.0328 0.0147 0.012 0.0204Sum 0.0275 0.017 0.0089 0.0043

For Sum going from 16 to 32 GPUs is coupled with a halvingof the learning rate, which means that the per iteration stepsize stays the same even though twice as many GPUs are par-ticipating in each iteration. In contrast, Adasum can maintainmuch higher learning rates at 16 and 32 GPUs.

The prior case studies demonstrate the efﬁcacy of Adasumon public models. Over the past three years, we have appliedAdasum to a variety of production models from Company X and observed similar improvement in convergence with littleor no hyperparameter tuning. Due to space constraints, weonly provide a summary of these results to show the broadapplicability of Adasum. For instance, the team relying on anLSTM-based model for predicting the next command a useris likely to issue in an application was able to use Adasumto train on four times the amount of data to result in 6%improvement in downstream accuracy. Similarly, for a speechtranslation model with the Momentum optimizer, Adasumallowed the team to scale to 16 GPUs dramatically reducingthe turn around time for their engineers. On a GPT-2 basedmodel modiﬁed to work with code, Adasum provided 15%faster time to accuracy. Previous works for enabling large-batch training have focusedon the problem of adapting the learning rate appropriately.This is helpful because while large batch sizes permit tak-ing larger steps in many situations, the appropriate learningrate changes during training. Adam [23] adjust the step sizebased on the variance of gradients, taking smaller steps whenvariance is high. LARS [37] adapts learning rates per-layerusing a trust ratio calculated as the ratio between the normof the layer weights and the norm of gradients, the intuitionbeing that divergence happens when steps are large in relation Name removed for anonymity. to the parameters being updated. LAMB [38] can be seenas LARS applied to Adam instead of vanilla SGD. Theseapproaches that use statistical measures to adapt learning rateare qualitatively different from Adasum, which exploits a spe-ciﬁc property (orthogonality) of gradients to take bigger stepswhen appropriate. Adasum and learning rate adaption meth-ods are in many cases complementary, as we have shown inour experiments successfully combining Adasum with Adamand LAMB.Asynchronous SGD [12, 13] approaches can address twoissues in distributed synchronous SGD: synchronization over-head of faster nodes having to wait for stragglers to ﬁnishthe iteration, and non-overlapping of compute and commu-nication. However, stale gradients present another potentialsource of degraded convergence. Speciﬁcally, the DC-ASGDalgorithm [39] addressed this staleness using an approxima-tion of the Hessian as used in Adasum. They only use thediagonal elements of the g · g T approximation of the Hessianand require an additional hyperparameter which requires acareful tuning over time. It was also only evaluated for SGDand Momentum-SGD. Our approach was motivated to bea drop-in replacement of the allreduce operation and thuswe eliminate all hyperparameters in our combination and itis optimizer agnostic. Similarly, Maleki et al. [24] use theHessian to reduce staleness and use a Johnson-Lindenstraussprojection to get a low rank approximation of a semantics-preserving model combiner. Their approach only works withexact Hessian computation and is unlikely to scale to DNNs.While large-batch training methods decrease the amountof communication needed, gradient compression approachesreduce the cost of each communication round. In gradientquantization approaches gradients are cast to a lower bit-width datatype for communication, with bit-widths ranging allthe way down to 1 bit [33]. Low-rank compression methodscommunicate the most important dimensions of gradients [36].Any lossy gradient compression presents yet another potentialsource for loss of convergence. Adasum unlocks an unprecedented level of scalability for dataparallel distributed training of large models. Our case studiesshow that Adasum scales BERT-Large-LAMB to 128k batchsize, BERT-Large-Adam to 64K batch size, and ResNet-50Momentum to 64k batch size, while maintaining downstreamaccuracy. Adasum is publicly available through open-sourceHorovod package [3], can be simply enabled with an addi-tional ﬂag, and requires no additional hyper-parameters.Finally, Adasum opens a new direction for future work onlightweight ways to exploit orthogonality in model updates.We expect Adasum’s empirical observations to drive futurework in adaptive learning rate methods and further communi-cation optimizations.12 eferences [1] Cloud tpu. https://cloud.google.com/tpu . Ac-cessed: 2020-05-22.[2] Fisher information and the hessian of loglikelihood. http://mark.reid.name/blog/fisher-information-and-log-likelihood.html .Accessed: 2020-05-22.[3] Horovod: Overview. https://horovod.readthedocs.io/en/latest/summary_include.html . Accessed:2020-05-22.[4] Marian: Fast neural machine translation in c++ https://marian-nmt.github.io . Accessed: 2020-05-22.[5] Mlperf: Fair and useful benchmarks for measuring train-ing and inference performance of ml hardware, software,and services. https://mlperf.org/ . Accessed: 2020-05-22.[6] Nccl: The nvidia collective communications li-brary. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html . Accessed:2020-05-22.[7] Nvidia deep learning examples for tensor cores. https://github.com/NVIDIA/DeepLearningExamples . Ac-cessed: 2020-05-22.[8] Nvidia dgx-1. . Accessed: 2020-05-22.[9] Nvidia dgx-2. . Accessed: 2020-05-22.[10] Ernie Chan, Marcel Heimlich, Avi Purkayastha, andRobert van de Geijn. Collective communication: the-ory, practice, and experience.

Concurrency and Com-putation: Practice and Experience , 19(13):1749–1783,2007.[11] Kai Chen and Qiang Huo. Scalable training of deeplearning machines by incremental block training withintra-block parallel optimization and blockwise model-update ﬁltering. In

ICASSP-2016 , March 2016.[12] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, andKarthik Kalyanaraman. Project adam: Building an efﬁ-cient and scalable deep learning training system. In , pages 571–582, Broomﬁeld,CO, October 2014. USENIX Association.[13] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen,Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, An-drew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. Large scale distributed deep networks.In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-berger, editors,

Advances in Neural Information Process-ing Systems 25 , pages 1223–1231. Curran Associates,Inc., 2012.[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. BERT: Pre-training of deep bidirec-tional transformers for language understanding. In

Pro-ceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguis-tics: Human Language Technologies, Volume 1 (Longand Short Papers) , pages 4171–4186, Minneapolis, Min-nesota, June 2019. Association for Computational Lin-guistics.[15] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, XiaodongLiu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Uniﬁed language model pre-training fornatural language understanding and generation. In , December 2019.[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning , chapter 6.5, page 200–220. MIT Press,2016. .[17] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter No-ordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tul-loch, Yangqing Jia, and Kaiming He. Accurate, largeminibatch SGD: training imagenet in 1 hour.

CoRR ,abs/1706.02677, 2017.[18] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

The Elements of Statistical Learning . Springer Seriesin Statistics. Springer New York Inc., New York, NY,USA, 2001.[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In ,pages 770–778, 2016.[20] Elad Hoffer, Itay Hubara, and Daniel Soudry. Trainlonger, generalize better: closing the generalizationgap in large batch training of neural networks. InI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-gus, S. Vishwanathan, and R. Garnett, editors,

Advancesin Neural Information Processing Systems 30 , pages1731–1741. Curran Associates, Inc., 2017.[21] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge No-cedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang.On large-batch training for deep learning: Generaliza-tion gap and sharp minima, 2016. http://arxiv.org/abs/1609.04836 .1322] Diederik P. Kingma and Jimmy Ba. Adam: A methodfor stochastic optimization, 2014. http://arxiv.org/abs/1412.6980 .[23] Diederik P. Kingma and Jimmy Ba. Adam: A methodfor stochastic optimization. In Yoshua Bengio and YannLeCun, editors, , 2015.[24] S. Maleki, M. Musuvathi, and T. Mytkowicz. Semantics-preserving parallelization of stochastic gradient descent.In , pages 224–233, 2018.[25] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gre-gory Diamos, Erich Elsen, David Garcia, Boris Ginsburg,Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh,and Hao Wu. Mixed precision training, 2017. http://arxiv.org/abs/1710.03740 .[26] Boris Polyak and Y.Z. Tsypkin. Pseudogradient adapta-tion and training algorithms.

Automation and RemoteControl , 34:377–397, 01 1973.[27] Alec Radford, Jeff Wu, Rewon Child, David Luan, DarioAmodei, and Ilya Sutskever. Language models are un-supervised multitask learners. 2019.[28] Pranav Rajpurkar, Robin Jia, and Percy Liang. Knowwhat you don’t know: Unanswerable questions forsquad, 2018. http://arxiv.org/abs/1806.03822 .[29] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. Squad: 100,000+ questions for machinecomprehension of text, 2016. http://arxiv.org/abs/1606.05250 .[30] Benjamin Recht, Christopher Re, Stephen Wright, andFeng Niu. Hogwild: A lock-free approach to paralleliz-ing stochastic gradient descent. In J. Shawe-Taylor, R. S.Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger,editors,

Advances in Neural Information Processing Sys-tems 24 , pages 693–701. Curran Associates, Inc., 2011.[31] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.Williams. Learning representations by back-propagatingerrors.

Nature , 323(6088):533–536, Oct 1986.[32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, Alexan-der C. Berg, and Li Fei-Fei. ImageNet Large ScaleVisual Recognition Challenge.

International Journal ofComputer Vision (IJCV) , 115(3):211–252, 2015.[33] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and DongYu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. InHaizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng,and Lei Xie, editors,

INTERSPEECH , pages 1058–1062.ISCA, 2014.[34] Mohammad Shoeybi, Mostofa Patwary, Raul Puri,Patrick LeGresley, Jared Casper, and Bryan Catanzaro.Megatron-lm: Training multi-billion parameter languagemodels using model parallelism, 2019. http://arxiv.org/abs/1909.08053 .[35] R.A. Vandegeijn. On global combine operations.

Jour-nal of Parallel and Distributed Computing , 22(2):324 –328, 1994.[36] Thijs Vogels, Sai Praneeth Karimireddy, and MartinJaggi. Powersgd: Practical low-rank gradient com-pression for distributed optimization. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,and R. Garnett, editors,

Advances in Neural InformationProcessing Systems 32 , pages 14259–14268. Curran As-sociates, Inc., 2019.[37] Yang You, Igor Gitman, and Boris Ginsburg. Largebatch training of convolutional networks. arXiv preprintarXiv:1708.03888 , 2017.[38] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, San-jiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, JamesDemmel, and Cho-Jui Hsieh. Large batch optimizationfor deep learning: Training bert in 76 minutes. Techni-cal Report UCB/EECS-2019-103, EECS Department,University of California, Berkeley, Jun 2019.[39] Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen,Nenghai Yu, Zhiming Ma, and Tie-Yan Liu. Asyn-chronous stochastic gradient descent with delay com-pensation for distributed deep learning.

CoRR ,abs/1609.08326, 2016.

A Deriving sequential emulation

Suppose L ( w ) and L ( w ) are two loss functions corre-sponding to two different examples. Starting from model w , sequential SGD, calculates w = w − α∇ L ( w ) fol-lowed by w = w − α∇ L ( w ) where α is a properly setlearning rate for both iteration. With forward substitution, w = w − α ( ∇ L ( w ) + ∇ L ( w )) . Alternatively, ∇ L ( w ) and ∇ L ( w ) (note that gradients are both at w ) are computedin parallel and w is updated with w (cid:48) = w − α ( ∇ L ( w ) + ∇ L ( w )) . Clearly w (cid:48) and w are different because ∇ L wascomputed at a different point.14 .1 Using Taylor Expansion Adasum uses an estimation for ∇ L ( w ) using the Taylor ex-pansion to capture the effect of the ﬁrst update on the secondupdate. Note that ∇ L ( w ) is a convenient notation for theﬁrst order derivative of L and can be re-written with ∂ L ∂ w (cid:12)(cid:12)(cid:12) w .Therefore: ∇ L ( w ) = ∂ L ∂ w (cid:12)(cid:12)(cid:12) w = ∂ L ∂ w (cid:12)(cid:12)(cid:12) w +( w − w ) = ∂ L ∂ w (cid:12)(cid:12)(cid:12) w + ∂ L ( ∂ w ) (cid:12)(cid:12)(cid:12) w · ( w − w ) + O ( (cid:107) w − w (cid:107) )= ∂ L ∂ w (cid:12)(cid:12)(cid:12) w − α ∂ L ( ∂ w ) (cid:12)(cid:12)(cid:12) w · ∇ L ( w )+ α O ( (cid:107) ∇ L ( w ) (cid:107) ) ≈ ∇ L ( w ) − α H ( w ) · ∇ L ( w ) (6)where the error is α O ( (cid:107) ∇ L ( w ) (cid:107) ) and H ( w ) is the Hes-sian matrix of loss function L . The quadratic relationshipbetween the error in Formula 6 and the learning rate α helpsthis approximation as in most training practices, learning ratedecays with progress in training. However, computing H ( w ) requires signiﬁcant computation powers as the size of thismatrix is the number of model parameters squared and withmillions of parameters, even storing the matrix is infeasible.[18] shows that for models with negative log likelihoodloss functions (which is the case for all models studied inthis paper), the Hessian matrix can be approximated by theouter product of the gradients. By using Equation 6 and thisapproximation, ∇ L ( w ) can be rewritten by: ∇ L ( w ) ≈ ∇ L ( w ) − α∇ L ( w ) · ∇ L ( w ) T · ∇ L ( w ) (7) A.2 Choosing optimal learning rate

For this discussion, we assumed that α was properly chosenfor the sequential SGD algorithm. We use Taylor expansionfor the loss function L ( w − α∇ L ( w )) and we will take itsderivative with respect to α to ﬁnd the optimal value: L ( w − α∇ L ( w )) ≈ L ( w ) − α∇ L ( w ) T · L ( w )+ α ∇ L ( w ) T · H ( w ) · ∇ L ( w )= ⇒ ∂ L ( w − α∇ L ( w )) ∂α = − ∇ L ( w ) T · L ( w )+ α∇ L ( w ) T · H ( w ) · ∇ L ( w ) = = ⇒ α (cid:107) ∇ L ( w ) (cid:107) = (cid:107) ∇ L ( w ) (cid:107) = ⇒ α = (cid:107) ∇ L ( w ) (cid:107) (8)where the last line is derived from the approximating forthe Hessian matrix. By putting together Equation 8 and 7, ∇ L ( w ) can be approximated by: ∇ L ( w ) ≈ ∇ L ( w ) − ∇ L ( w ) · ∇ L ( w ) T (cid:107) ∇ L ( w ) (cid:107) ∇ L ( w ) (9)Therefore, to approximate the sequential SGD semanticsin a parallel setting, Adasum uses: w = w − α ( ∇ L ( w ) + ∇ L ( w )) ≈ w − α ( ∇ L ( w ) + ∇ L ( w ) − ∇ L ( w ) · ∇ L ( w ) T (cid:107) ∇ L ( w ) (cid:107) ∇ L ( w ))= w − α ( g + g − g · g T (cid:107) g (cid:107) g ) (10)where in the last equality, ∇ L ( w ) was replaced by g and ∇ L ( w ) by g for simplicity (note that the gradients in thelast equality are all from w ). Equation 5 is derived fromEquation 10 by rearranging the terms. A.3 Convergence Proof for Adasum [26] discusses the requirements for a training algorithm toconverge to its optimal answer. Here we will present a simpli-ﬁed version of Theorem 1 and Corollary 1 from [26].Suppose that there are N training examples for a modelwith loss functions L ( w ) , . . . , L N ( w ) where w is the model pa-rameter and w is the initial model. Deﬁne L ( w ) = N ∑ i L i ( w ) .Also assume that w ∗ is the optimal model where L ( w ∗ ) ≤ L ( w ) for all w s. A training algorithm is pseudogradient if:• It is an iterative algorithm where w i + = w i − α i h i where h i is a random vector and α i is a scalar.• ∀ ε ∃ δ : E ( h i ) T · ∇ L ( w ) ≥ δ > L ( w ) ≥ L ( w ∗ ) + ε and w ∗ is the optimal model.• E ( (cid:107) h i (cid:107) ) < C where C is a constant.• ∀ i : α i ≥ ∑ i α i = inf, and ∑ i α i < inf.The following Theorem is taken from [26]. Theorem A.1.

A pseudogradient training algorithm con-verges to the optimal model w ∗ . In this section, we assume that the true gradient, ∇ L isbounded at any point. As a reminder, Adasum ( g , g ) =( − g · g T ·(cid:107) g (cid:107) ) · g + ( − g · g T ·(cid:107) g (cid:107) ) · g . As discussed in Sec-tion 3.4 Adasum operator reduces N gradients in a binarytree manner. We will prove that the ﬁnal gradient has all nec-essary requirements of pseudogradient. First, we discuss theinner product of Adasum ﬁnal vector with ∇ L ( w ) : Lemma A.2.

Suppose X = { x , . . . , x N } is a random variabledistribution. For all a and b independently chosen from X,let’s deﬁne Y = Adasum ( a , b ) . Assume that θ is the anglebetween E ( X ) and E ( Y ) . cos θ > . . roof.E ( Y ) = E (cid:0) Adasum ( a , b ) (cid:1) = E (cid:16) ( − a · a T · (cid:107) a (cid:107) ) · b + ( − b · b T · (cid:107) b (cid:107) ) · a (cid:17) = E ( a ) + E ( b ) − E (cid:16) a · a T · (cid:107) a (cid:107) (cid:17) · E ( b ) − E (cid:16) b · b T · (cid:107) b (cid:107) (cid:17) · E ( a ) = E ( X ) − E (cid:16) a · a T · (cid:107) a (cid:107) (cid:17) · E ( X ) (11)where the last equation comes from the independence of a and b . Next we will calculate, η , the angle between 2 E ( X ) − a · a T ·(cid:107) a (cid:107) · E ( X ) for some arbitrary a . First let’s denote E ( X ) with r and assume the angle between r and a is γ . By using theproperty of inner product, we have:cos η = r T · ( r − a · a T ·(cid:107) a (cid:107) · r ) (cid:107) r (cid:107) · (cid:13)(cid:13)(cid:13) r − a · a T ·(cid:107) a (cid:107) · r (cid:13)(cid:13)(cid:13) = (cid:107) r (cid:107) − (cid:107) r (cid:107) ( cos γ ) (cid:107) r (cid:107) · (cid:113) (cid:107) r (cid:107) + (cid:107) r (cid:107) ( cos γ ) − (cid:107) r (cid:107) ( cos γ ) = − ( cos γ ) (cid:112) − ( cos γ ) (12)By taking a derivative of γ from the last equation, we ﬁnd theminimum value of cos η to be ≈ . eta is at most 0 . π . Since in Formula 11, E ( Y ) is calculatedover an average of all possible a vectors, we can still guaranteethat E ( Y ) and E ( X ) have at most an angle of 0 . π sincewe derived this value for the worst case scenario. Lemma A.3.

With same assumptions as in Lemma A.2, (cid:107) E ( X ) (cid:107) ≤ (cid:107) E ( Y ) (cid:107) ≤ (cid:107) E ( X ) (cid:107) .Proof. As discussed in Lemma A.2, E ( Y ) = E ( X ) − E ( a · a T ·(cid:107) a (cid:107) ) · E ( X ) = ( I − E ( a · a T ·(cid:107) a (cid:107) ) · E ( X ) . It is trivial to checkthat the matrix ( I − E ( a · a T ·(cid:107) a (cid:107) ) is symmetric with eigenval-ues between 1 and 2. Therefore, λ min (cid:107) X (cid:107) ≤ (cid:107) Y (cid:107) ≤ λ max (cid:107) X (cid:107) where λ min and λ max are respectively the smallest and largesteigenvalues of the aforementioned matrix. Assumption:

Lemma A.2 showed that in the worst case E ( Y ) can rotate at most 0 . π with respect to E ( X ) . Evenmeeting this worst case requires carefully crafted x i s. If Ada-sum was applied recursively on X ( Y = Adasum ( X , X ) , Z = Adasum ( Y , Y ) , . . . ) for k times, the expected value of ﬁnal dis-tribution will at most have an angle of 0 . k π which is onlypossible if each Adasum meets the worst case scenario andeach worst case is stacked over the previous one. As one canimagine, this is an extremely unlikely scenario. In case x i s aregradients, we assume that Adasum recursively always keepsthe angle with E ( X ) to at most σ where cos σ >

0. Using this assumption and Lemma A.3, we can prove that

Adasum algorithm is a pseudogradient training algorithm.

Theorem A.4.

Adasum algorithm applied in an iterativemanner using a proper learning rate on a set of N gradients,G = { g , . . . , g N } computed in parallel, is a pseudogradienttraining algorithm.Proof. Given that Adasum follows the iterative method of

SGD , the ﬁrst assumption of a pseudogradient training al-gorithm is met. Also, since we use the learning rate sched-ule from the converging sequential SGD, the requirementfor the learning rate is trivially met. Section 3.4 discussedhow Adasum reduces all gradients in a binary tree man-ner which has log N steps. The distribution of the leaf levelin this binary tree is G and the next level’s distributionis G = Adasum ( G , G ) . Level i ’s distribution is G i + = Adasum ( G i , G i ) . At the top of the tree, we have G log N as thedistribution. Using Lemma A.3, (cid:107) E ( G ) (cid:107) ≤ (cid:13)(cid:13) E ( G log N ) (cid:13)(cid:13) ≤ log N (cid:107) E ( G ) (cid:107) = N (cid:107) E ( G ) (cid:107) . Since E ( G ) is the true gradient( ∇ L ( w i ) ), (cid:107) E ( G ) (cid:107) is bounded by assumption and therefore, sois (cid:13)(cid:13) E ( G log N ) (cid:13)(cid:13) . This meets the requirement for the norm of the h i s. Finally, Lemma A.3 proves that E ( G ) ≤ (cid:13)(cid:13) E ( G log N ) (cid:13)(cid:13) andtherefore E ( G log N ) T ˙ E ( G ) ≥ (cid:13)(cid:13) E ( G log N ) (cid:13)(cid:13) (cid:107) E ( G ) (cid:107) cos σ ≥(cid:107) E ( G ) (cid:107) cos σ . For any w i which is not w ∗ , (cid:107) E ( G ) (cid:107) > σ >

0. Therefore, the positiveinner product assumption is also met which concludes thatAdasum is a pseudogradient training algorithm and it con-verges.

A.4 Adasum Convergence Rate

Convergence rate of Adasum is highly dependent on orthogo-nality of the gradients. In the worst case scenario if all gradi-ents are parallel, the algorithm converges in 1 / N rate of thesequential SGD where N is the number of processors and inthe best case where all of the gradients are orthogonal, weexpect Adasum to converge as fast the sequential SGD. A.5 Convergence of Adasum with LearningRate Optimizers

The kernel of the computation in all optimizers such as Adam,LAMB or LARS is computing the gradients. These optimiz-ers differ by having different learning rate mechanisms foreach parameter. Generally one can think of having a dynamiclearning rate per layer for each of these optimizers. At eachiteration i and for each layer l , let’s deﬁne ∆ li = w li + − w li to be the update term for layer l . If g li is the correspondinggradient at iteration i , ∆ li ≈ C li · g li where C li is a scalar. Thisapproximation is true in expectation for all optimizers thatuse an unbiased gradient.Adasum reduces all ∆ li across all GPUs for each layer andsince ∆ li is approximately a constant multiplied by the gradi-16nt, Adasum applied on ∆ li s is the same as applying it on g lili