[PDF] Revisiting Distributed Synchronous SGD

Abstract

Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In contrast, the synchronous approach is often thought to be impractical due to idle time wasted on waiting for straggling workers. We revisit these conventional beliefs in this paper, and examine the weaknesses of both approaches. We demonstrate that a third approach, synchronous optimization with backup workers, can avoid asynchronous noise while mitigating for the worst stragglers. Our approach is empirically validated and shown to converge faster and to better test accuracies.

Full PDF

UUnder review as a conference paper at ICLR 2017 R EVISITING D ISTRIBUTED S YNCHRONOUS

SGD

Jianmin Chen ∗ , Xinghao Pan ∗† , Rajat Monga, Samy Bengio Google BrainMountain View, CA, USA { jmchen,xinghao,rajatmonga,bengio } @google.com Rafal Jozefowicz

OpenAISan Francisco, CA, USA [email protected] A BSTRACT

Distributed training of deep learning models on large-scale training data is typi-cally conducted with asynchronous stochastic optimization to maximize the rateof updates, at the cost of additional noise introduced from asynchrony. In con-trast, the synchronous approach is often thought to be impractical due to idle timewasted on waiting for straggling workers. We revisit these conventional beliefsin this paper, and examine the weaknesses of both approaches. We demonstratethat a third approach, synchronous optimization with backup workers, can avoidasynchronous noise while mitigating for the worst stragglers. Our approach isempirically validated and shown to converge faster and to better test accuracies.

NTRODUCTION

The recent success of deep learning approaches for domains like speech recognition (Hinton et al.,2012) and computer vision (Ioffe & Szegedy, 2015) stems from many algorithmic improvementsbut also from the fact that the size of available training data has grown signiﬁcantly over the years,together with the computing power, in terms of both CPUs and GPUs. While a single GPU oftenprovides algorithmic simplicity and speed up to a given scale of data and model, there exist anoperating point where a distributed implementation of training algorithms for deep architecturesbecomes necessary.Currently, popular distributed training algorithms include mini-batch versions of stochastic gradientdescent (SGD) and other stochastic optimization algorithms such as AdaGrad (Duchi et al., 2011),RMSProp (Tieleman & Hinton, 2012), and ADAM (Kingma & Ba, 2014). Unfortunately, bulk-synchronous implementations of stochastic optimization are often slow in practice due to the needto wait for the slowest machine in each synchronous batch. To circumvent this problem, practi-tioners have resorted to asynchronous approaches which emphasize speed by using potentially staleinformation for computation. While asynchronous training have proven to be faster than their syn-chronous counterparts, they often result in convergence to poorer results.In this paper , we revisit synchronous learning, and propose a method for mitigating stragglers insynchronous stochastic optimization. Speciﬁcally, we synchronously compute a mini-batch gradientwith only a subset of worker machines, thus alleviating the straggler effect while avoiding anystaleness in our gradients. The primary contributions of our paper are: • Illustration of how gradient staleness in asynchronous training negatively impacts test ac-curacy and is exacerbated by deep models. • Measurement of machine response times for synchronous stochastic optimization in a largedeployment of 100 GPUs, showing how stragglers in the tail end affect convergence speed. • Proposal of synchronous stochastic optimization with backup workers to mitigate stragglereffects without gradient staleness. • Establishing the need to measure both speed of convergence and test accuracy of optimumfor empirical validation. ∗ Joint ﬁrst authors † UC Berkeley, Berkeley, CA, USA, [email protected] This is an extension of our ICLR 2016 workshop extended abstract (Chen et al., 2016). a r X i v : . [ c s . L G ] M a r nder review as a conference paper at ICLR 2017 • Empirical demonstration that our proposed synchronous training method outperforms asyn-chronous training by converging faster and to better test accuracies.The remainder of this paper is organized as follows. We brieﬂy present preliminaries and notationin Section 1.1. Section 2 describes asynchronous stochastic optimization and presents experimentalevidence of gradient staleness in deep neural network models. We present our approach in Section 3,and exhibit straggler effects that motivate the approach. We then empirically evaluate our approachin Sections 4. Related work is discussed in Section 5, and we conclude in Section 6.1.1 P

RELIMINARIES AND N OTATION

Given a dataset X = { x i : i = 1 , . . . , |X |} , our goal is to learn the parameters θ of a model withrespect to an empirical loss function f , deﬁned as f ( θ ) ∆ = |X | (cid:80) |X | i =1 F ( x i ; θ ) , where F ( x i ; θ ) is theloss with respect to a datapoint x i and the model θ .A ﬁrst-order stochastic optimization algorithm achieves this by iteratively updating θ using astochastic gradient G ∆ = ∇ F ( x i ; θ ) computed at a randomly sampled x i , producing a sequenceof models θ (0) , θ (1) , . . . . Stochastic optimization algorithms differ in their update equations. Forexample, the update of SGD is θ ( t +1) = θ ( t ) − γ t G ( t ) = θ ( t ) − γ t ∇ F ( x i ; θ ( t ) ) , where γ t is the learning rate or step size at iteration t . A mini-batch version of the stochastic optimization algo-rithm computes the stochastic gradient over mini-batch of size B instead of a single datapoint, i.e., G ∆ = B (cid:80) Bi =1 ∇ F ( (cid:101) x i ; θ ( t ) ) , where (cid:101) x i ’s are randomly sampled from X . We will often evaluateperformance on an exponential moving average ¯ θ ( t ) = α ¯ θ ( t − + (1 − α ) θ ( t ) with decay rate α .Our interest is in distributed stochastic optimization using N worker machines in charge of comput-ing stochastic gradients that are sent to M parameter servers. Each parameter server j is responsiblefor storing a subset θ [ j ] of the model, and performing updates on θ [ j ] . In the synchronous setting,we will also introduce additional b backup workers for straggler mitigation. SYNCHRONOUS S TOCHASTIC O PTIMIZATION

An approach for a distributed stochastic gradient descent algorithm was presented in Dean et al.(2012), consisting of two main ingredients. First, the parameters of the model are distributed onmultiple servers, depending on the architecture. This set of servers are called the parameter servers .Second, there can be multiple workers processing data in parallel and communicating with the pa-rameter servers. Each worker processes a mini-batch of data independently of the others, as follows: • The worker fetches from the parameter servers the most up-to-date parameters of the modelneeded to process the current mini-batch; • It then computes gradients of the loss with respect to these parameters; • Finally, these gradients are sent back to the parameter servers, which then updates themodel accordingly.Since each worker communicates with the parameter servers independently of the others, this iscalled

Asynchronous Stochastic Gradient Descent (Async-SGD), or more generally,

AsynchronousStochastic Optimization (Async-Opt). A similar approach was later proposed by Chilimbi et al.(2014). Async-Opt is presented in Algorithms 1 and 2.In practice, the updates of Async-Opt are different than those of serially running the stochasticoptimization algorithm for two reasons. Firstly, the read operation (Algo 1 Line 2) on a worker maybe interleaved with updates by other workers to different parameter servers, so the resultant (cid:98) θ k maynot be consistent with any parameter incarnation θ ( t ) . Secondly, model updates may have occurredwhile a worker is computing its stochastic gradient; hence, the resultant gradients are typicallycomputed with respect to outdated parameters. We refer to these as stale gradients, and its staleness as the number of updates that have occurred between its corresponding read and update operations.Understanding the theoretical impact of staleness is difﬁcult work and the topic of many recentpapers, e.g. Recht et al. (2011); Duchi et al. (2013); Leblond et al. (2016); Reddi et al. (2015);2nder review as a conference paper at ICLR 2017 Algorithm 1:

Async-SGD worker k Input:

Dataset X Input: B mini-batch size while True do Read (cid:98) θ k = ( θ [0] , . . . , θ [ M ]) from PSs. G ( t ) k := 0 . for i = 1 , . . . , B do Sample datapoint (cid:101) x i from X . G ( t ) k ← G ( t ) k + B ∇ F ( (cid:101) x i ; (cid:98) θ k ) . end Send G ( t ) k to parameter servers. end Algorithm 2:

Async-SGD Parameter Server j Input: γ , γ , . . . learning rates. Input: α decay rate. Input: θ (0) model initialization. for t = 0 , , . . . do Wait for gradient G from any worker. θ ( t +1) [ j ] ← θ ( t ) [ j ] − γ t G [ j ] . ¯ θ ( t ) [ j ] = α ¯ θ ( t − [ j ] + (1 − α ) θ ( t ) [ j ] . end Figure 1:

Gradient staleness dependence on model layer. Gradientsare computed in a bottom-up forward propagation step followed by atop-down back propagation step. Parameters are read from servers inthe forward prop, but gradients are sent to servers during the back prop.Thus, gradients of lower layers are more stale than top layers.

Figure 2:

Degradation of test classi-ﬁcation error with increasing averagegradient staleness in MNIST CNNmodel.

De Sa et al. (2015); Mania et al. (2015), most of which focus on individual algorithms, under strongassumptions that may not hold up in practice. This is further complicated by deep models with mul-tiple layers, since the times at which model parameters are read and which gradients are computedand sent are dependent on the depth of the layers (Figure 1). To better understand this dependencein real models, we collected staleness statistics on a Async-Opt run with 40 workers on a 18-layerInception model (Szegedy et al., 2016) trained on the ImageNet Challenge dataset (Russakovskyet al., 2015), as shown in Table 1.

Layer Min Mean Median Max Std Dev Count18 4 14.54 13.94 29 3.83 1090812 5 11.35 11.3 23 3.09 4447811 8 19.8 19.59 34 3.65 1870 24 38.97 38.43 61 5.43 178

Table 1:

Staleness of gradients in a 18-layer Inception model. Gra-dients were collected in a run of asynchronous training using 40machines.

Staleness of a gradient is measured as the number ofupdates that have occurred between its corresponding read and up-date operations. The staleness of gradients increases from a meanof ∼ ∼ Despite the abovementioned prob-lems, Async-Opt has been shown tobe scale well up to a few dozens ofworkers for some models. However,at larger scales, increasing the num-ber of machines (and thus stalenessof gradients) can result in poorertrained models.2.1 I

MPACT OF STALENESS ONTEST ACCURACY

We explore how increased staleness contributes to training of poorer models. In order to mimic thesetting on a smaller scale, we trained a state-of-the-art MNIST CNN model but simulated stalenessby using old gradients for the parameter updates. Details of the model and training are provided inAppendix A.1.The best ﬁnal classiﬁcation error on a test set was 0.36%, which increases to 0.47% with averagegradient staleness of 20 steps, and up to 0.79% with 50 steps (see Figure 2).3nder review as a conference paper at ICLR 2017Once the average simulated staleness was chosen to be more than 15 steps, the results started tosigniﬁcantly deteriorate and the training itself became much less stable. We had to employ followingtricks to prevent the results from blowing up: • Slowly increase the staleness over the ﬁrst 3 epochs of training. This mimics increasingthe number of asynchronous workers and is also very important in practice for some of themodels we experimented with (e.g. large word-level language models). The trick was notrelevant with a simulated staleness less than 15 but became crucial for larger values. • Use lower initial learning rates when staleness is at least 20, which reduces a frequency ofexplosions (train error goes to 90%). This observation is similar to what we found in otherexperiments - we were able to use much larger learning rates with synchronous trainingand the results were also more stable. • Even with above tricks the divergence occurs occasionally and we found that restartingtraining from random weights can lead to more successful runs. The best results were thenchosen based on validation set performance.

EVISTING S YNCHRONOUS S TOCHASTIC O PTIMIZATION

Both Dean et al. (2012) and Chilimbi et al. (2014) use versions of Async-SGD where the main po-tential problem is that each worker computes gradients over a potentially old version of the model.In order to remove this discrepancy, we propose here to reconsider a synchronous version of dis-tributed stochastic gradient descent (Sync-SGD), or more generally,

Synchronous Stochastic Op-timization (Sync-Opt), where the parameter servers wait for all workers to send their gradients,aggregate them, and send the updated parameters to all workers afterward. This ensures that theactual algorithm is a true mini-batch stochastic gradient descent, with an effective batch size equalto the sum of all the mini-batch sizes of the workers.While this approach solves the staleness problem, it also introduces the potential problem that theactual update time now depends on the slowest worker. Although workers have equivalent compu-tation and network communication workload, slow stragglers may result from failing hardware, orcontention on shared underlying hardware resources in data centers, or even due to preemption byother jobs.To alleviate the straggler problem, we introduce backup workers (Dean & Barroso, 2013) as follows:instead of having only N workers, we add b extra workers, but as soon as the parameter serversreceive gradients from any N workers, they stop waiting and update their parameters using the N gradients. The slowest b workers’ gradients will be dropped when they arrive. Our method ispresented in Algorithms 3, 4. Algorithm 3:

Sync-SGD worker k , where k =1 , . . . , N + b Input:

Dataset X Input: B mini-batch size for t = 0 , , . . . do Wait to read θ ( t ) = ( θ ( t ) [0] , . . . , θ ( t ) [ M ]) from parameter servers. G ( t ) k := 0 for i = 1 , . . . , B do Sample datapoint (cid:101) x k,i from X . G ( t ) k ← G ( t ) k + B ∇ F ( (cid:101) x k,i ; θ ( t ) ) . end Send ( G ( t ) k , t ) to parameter servers. end Algorithm 4:

Sync-SGD Parameter Server j Input: γ , γ , . . . learning rates. Input: α decay rate. Input: N number of mini-batches to aggregate. Input: θ (0) model initialization. for t = 0 , , . . . do G = {} while |G| < N do Wait for ( G, t (cid:48) ) from any worker. if t (cid:48) == t then G ← G ∪ { G } . else Drop gradient G . end θ ( t +1) [ j ] ← θ ( t ) [ j ] − γ t N (cid:80) G ∈G G [ j ] . ¯ θ ( t ) [ j ] = α ¯ θ ( t − [ j ] + (1 − α ) θ ( t ) [ j ] . end TRAGGLER EFFECTS

The use of backup workers is motivated by the need to mitigate slow stragglers while maximizingcomputation. We investigate the effect of stragglers on Sync-Opt model training here.We ran Sync-Opt with N = 100 workers, b = 0 backups, and 19 parameter servers on the Inceptionmodel. Using one variable as a proxy, we collected for each iteration both the start time of theiteration and the time when the k th gradient of that variable arrived at the parameter server. Thesetimes are presented in Figure 3 for k = 1 , , , , , , . Note that 80% of the 98th gradientarrives in under 2s, whereas only 30% of the ﬁnal gradient do. Furthermore, the time to collect theﬁnal few gradients grows exponentially, resulting in wasted idle resources and time expended to waitfor the slowest gradients. This exponential increase is also seen in Figure 4.Figure 3: CDF of time taken to aggregate gradientsfrom N machines. For clarity, we only show times of ≤ s; the maximum observed time is 310s. Figure 4:

Mean and median times, across all itera-tions, to collect k gradients on N = 100 workers and b = 0 backups. Most mean times fall between 1.4sand 1.8s, except of ﬁnal few gradients. Thus, one might choose to drop slow stragglers to decrease the iteration time. However, using fewermachines implies a smaller effective mini-batch size and thus greater gradient variance, which in turncould require more iterations for convergence. We examine this relationship by running Sync-Opt with N = 50 , , , , and b = 6 , and note the number of iterations required for convergencein Figure 5. Additional details of this training are provided in Appendix A.2. As N is doubled from50 to 100, the number of iterations to converge nearly halves from . e to . e .Figure 5: Number of iterations to converge when ag-gregating gradient from N machines. Figure 6:

Estimated time to converge when aggregat-ing gradients from N machines on a N + b = 100 machine conﬁguration. Convergence is fastest whenchoosing N = 96 , b = 4 . Since we are interested in the gradient quality and convergence behavior but not running time in thisexperiment, the backups serve only to reduce our data collection time but do not affect our analysis. N + b = 100 machines, and we wish to choose the best conﬁguration of N and b to minimizerunning time to convergence. For each conﬁguration, we can estimate the iterations required fromFigure 5 (linearly interpolating for values of N for which we did not collect data). We can multiplythis with the mean iteration times (Figure 4) to obtain the running time required to converge for eachsetting of N and b . These results are shown in Figure 6, indicating that N = 96 , b = 4 convergesfastest. Therefore, this motivates our choice to use a few backup workers for mitigating stragglers. XPERIMENTS

In this section, we present our empirical comparisons of synchronous and asynchronous distributedstochastic optimization algorithms as applied to models such as Inception and PixelCNN. All exper-iments in this paper are using the TensorFlow system (Abadi et al., 2015).4.1 M

ETRICS OF COMPARISON : F

ASTER CONVERGENCE , B

ETTER OPTIMUM

We are interested in two metrics of comparison for our empirical validation: (1) test error or ac-curacy, and (2) speed of convergence. We point out that for non-convex deep learning models, itis possible to converge faster to a poorer local optimum. Here we show a simple example withInception using different learning rates.

Initial Test Epochsrate precision to γ at convergeconvergence1.125 77.29% 526282.25 77.75% 658114.5 78.15% 762099.0 78.17% 77235 Table 2:

Test accuracies at con-vergence and number of epochs toconverge for different initial learningrates γ . Low initial learning ratesresult in faster convergence to poorerlocal optimum. (a) Convergence (b) Epochs to (cid:15) test precision 1. Figure 7:

Convergence of Sync-Opt on Inception model using N =100 workers and b = 6 backups, with varying initial learning rates γ .To reach a lower (cid:15) test precision, small γ ’s require fewer epochs thanlarge γ ’s. However, small γ ’s either fail to attain high (cid:15) precision, ortake more epochs than higher γ ’s. We ran Sync-Opt on Inception with N = 100 and b = 6 , but varied the initial learning rate γ between 1.125 and 9.0. (Learning rates are exponentially decreased with iterations.) Table 2 showsthat smaller γ converge faster, but to poorer test precisions. Focusing on speed on an early phaseof training could lead to misleading conclusions if we fail to account for eventual convergence.For example, Figure 3b shows that using γ = 1 . reaches (cid:15) = 75% precision . × faster than γ = 4 . , but is slower for (cid:15) = 77 . , and fails to reach higher precisions.4.2 I NCEPTION

We conducted experiments on the Inception model (Szegedy et al., 2016) trained on ImageNet Chal-lenge dataset (Russakovsky et al., 2015), where the task is to classify images out of 1000 categories.We used several conﬁgurations, varying N + b from 53 to 212 workers. Additional details of thetraining are provided in Appendix A.3. An epoch is a synchronous iteration for Sync-Opt, or a fullpass of N updates for Async-Opt, which represent similar amounts of computation. Results of thisexperiment are presented in Figure 8.Figure 8b shows that Sync-Opt outperforms Async-Opt in test precision: Sync-Opt attains ∼ N + b workers. Furthermore, Sync-Opt con-verges 6h and 18h faster than Async-Opt for 106 and 212 workers respectively, and is 3h slower6nder review as a conference paper at ICLR 2017 (a) Convergence (b) Test precision @ 1(c) Epochs to converge (d) Time to converge (e) Mean epoch time Figure 8:

Convergence of Sync-Opt and Async-Opt on Inception model using varying number of machines.Sync-Opt with backup workers converge faster, with fewer epochs, to higher test accuracies. when 53 workers are used, as seen in Figure 8d. This difference in speed is largely due to the fewerepochs (Figure 8c) needed by Sync-Opt, but comparable or better epoch time (Figure 8e).4.3 P

IXEL

CNN E

XPERIMENTS

The second model we experimented on is PixelCNN (Oord et al., 2016), a conditional image gener-ation deep neural network, which we train on the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset.Conﬁgurations of N + b = 1 , , workers were used; for Sync-Opt, we always used b = 1 backupworker. Additional details are provided in Appendix A.4. (a) (b) Figure 9:

Convergence of synchronous and asynchronous training on PixelCNN model. For clarity, we showthe best NLL reached up to that point of time. Sync-Opt achieves lower negative log likelihood in less timethan Async-Opt. N = 1 worker, with degrading performance as N increases from 8 to 16. Figure 9b further shows the time taken to reach (cid:15) test NLL. Sync-Opt reducesthe time to reach (cid:15) = 2 . from 40h to < h; this NLL is not even achieved by Async-Opt. ELATED W ORK

Multicore and distributed optimization algorithms have received much attention in recent years.Asynchronous algorithms include Recht et al. (2011); Duchi et al. (2013); Zhang et al. (2015a);Reddi et al. (2015); Leblond et al. (2016). Implementations of asynchronous optimization includeXing et al. (2015); Li et al. (2014); Chilimbi et al. (2014). Attempts have also been made in Zinke-vich et al. (2010) and Zhang & Jordan (2015) to algorithmically improve synchronous SGD.An alternative solution, “softsync”, was presented in Zhang et al. (2015b), which proposed batchinggradients from multiple machines before performing an asynchronous SGD update, thereby reducingthe effective staleness of gradients. Similar to our proposal, softsync avoids stragglers by not forcingupdates to wait for the slowest worker. However, softsync allows the use of stale gradients but wedo not. The two solutions provide different explorations of the trade-off between high accuracy (byminimizing staleness) and fast throughput (by avoiding stragglers).Watcharapichat et al. (2016) introduces a distributed deep learning system without parameterservers, by having workers interleave gradient computation and communication in a round-robinpattern. Like Async-Opt, this approach suffers from staleness. We also note that in principle, work-ers in Sync-Opt can double as parameter servers and execute the update operations and avoid theneed to partition hardware resources between workers and servers.Das et al. (2016) analyzes distributed stochastic optimization and optimizes the system by solvingdetailed system balance equations. We believe this approach is complimentary to our work, andcould potentially be applied to guide the choice of systems conﬁgurations for Sync-Opt.Keskar et al. (2016) suggests that large batch sizes for synchronous stochastic optimization leads topoorer generalization. Our effective batch size increases linearly with the number of workers N .However, we did not observe this effect in our experiments; we believe we are not yet in the largebatch size regime examined by Keskar et al. (2016). ONCLUSION AND F UTURE W ORK

Distributed training strategies for deep learning architectures will become ever more important asthe size of datasets increases. In this work, we have shown how both synchronous and asynchronousdistributed stochastic optimization suffer from their respective weaknesses of stragglers and stal-eness. This has motivated our development of synchronous stochastic optimization with backupworkers, which we show to be a viable and scalable strategy.We are currently experimenting with different kinds of datasets, including word-level language mod-els where parts of the model (the embedding layers) are often very sparse, which involves verydifferent communication constraints. We are also working on further improving the performanceof synchronous training like combining gradients from multiple workers sharing the same machinebefore sending them to the parameter servers to reduce the communication overhead. An alternativeof using time-outs instead of backup workers is also being explored. R EFERENCES

Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz,Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man´e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi´egas, OriolVinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning onheterogeneous systems, 2015. URL http://tensorflow.org/ .Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981 ,2016.

T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efﬁcient and scalable deep learning training system. In

Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation , 2014.Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, andPradeep Dubey. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 , 2016.Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher R´e. Taming the wild: A uniﬁed analysis of hogwild-style algorithms. In

Advances in Neural Information Processing Systems , pp. 2674–2682, 2015.J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. A. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng.Large scale distributed deep networks. In

Advances in Neural Information Processing Systems, NIPS , 2012.Jeffrey Dean and Luiz Andr Barroso. The tail at scale.

Communications of the ACM , 56:74–80, 2013. URL http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext .John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.

Journal of MachineLearning Research , 12(Jul):2121–2159, 2011.John Duchi, Michael I Jordan, and Brendan McMahan. Estimation, optimization, and parallelism when data is sparse. In

Advances in NeuralInformation Processing Systems , pp. 2832–2840, 2013.G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deepneural networks for acoustic modeling in speech recognition.

IEEE Signal Processing Magazine , 29:82–97, 2012.S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

Proceedings of the32nd International Conference on Machine Learning, ICML , 2015.Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deeplearning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 , 2016.Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. Asaga: Asynchronous parallel saga. arXiv preprint arXiv:1606.04809 , 2016.Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-YiingSu. Scaling distributed machine learning with the parameter server. In , pp. 583–598, 2014.Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Perturbed iterate analysisfor asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970 , 2015.Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generationwith pixelcnn decoders. arXiv preprint arXiv:1606.05328 , 2016.Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent.In

Advances in Neural Information Processing Systems , pp. 693–701, 2011.Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex J Smola. On variance reduction in stochastic gradient descent and itsasynchronous variants. In

Advances in Neural Information Processing Systems , pp. 2647–2655, 2015.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, MichaelBernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. In

International Journal of ComputerVision , 2015.C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In

ArXiv 1512.00567 ,2016.Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.

COURSERA:Neural Networks for Machine Learning , 4(2), 2012.Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, and Peter Pietzuch. Ako: Decentralised deep learning with partialgradient exchange. In

Proceedings of the Seventh ACM Symposium on Cloud Computing , pp. 84–97. ACM, 2016.Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu.Petuum: A new platform for distributed machine learning on big data.

IEEE Transactions on Big Data , 1(2):49–67, 2015.Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In

Advances in Neural Information ProcessingSystems , pp. 685–693, 2015a.Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Staleness-aware async-sgd for distributed deep learning. arXiv preprint arXiv:1511.05950 ,2015b.Yuchen Zhang and Michael I Jordan. Splash: User-friendly programming interface for parallelizing stochastic algorithms. arXiv preprintarXiv:1506.07552 , 2015.Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In

Advances in neural informationprocessing systems , pp. 2595–2603, 2010.

A D

ETAILS OF M ODELS AND T RAINING

A.1 MNIST CNN, S

ECTION ¯ θ using a decay rate of α = 0 . . Initial learningrate was set to be 0.1 and linearly annealed to 0 in the last 10 epochs. We also used small imagerotations and zooms as a data augmentation scheme.A.2 I NCEPTION , S

ECTION B = 32 was used. Initial learning rates γ were set at . N , which we found toprovide good test precisions for Inception. Learning rates were also exponentially decreased withdecay rate β = 0 . as γ β tN/ (2 T ) , where T = |X | /B is the number of mini-batches in the dataset.Test precisions were evaluated on the exponential moving average ¯ θ using α = 0 . .A.3 I NCEPTION , S

ECTION N + b = 53 workers, 17 parameter servers were used;for N + b = 106 workers, we used 27 parameter servers; and 37 parameter servers were used for N + b = 212 .In the asynchronous training mode, gradient clipping is also needed for stabilization, which requireseach worker to collect the gradient across all layers of the deep model, compute the global norm || G || and then clip all gradient accordingly. However, synchronization turns out to be very stable sogradient clipping is no longer needed, which means that we can pipeline the update of parametersin different layers: the gradient of top layers’ parameters can be sent to parameter servers whileconcurrently computing gradients for the lower layers.The underlying optimizer is RMSProp with momentum, with decay of 0.9 and momentum of 0.9.Mini-batch size B = 32 was used. Initial learning rates γ for Async-Opt were set to 0.045; forSync-Opt, we found as a rule-of-thumb that a learning rate of . N worked well for this model.Learning rates were then exponentially decayed with decay rate β = 0 . as γ β t/ (2 T ) for Async-Opt, where T = |X | /B is the number of mini-batches in the dataset. For Sync-Opt, we learning rateswere also exponentially decreased at rate of γ β tN/ (2 T ) , so that the learning rates after computingthe same number of datapoints are comparable for Async-Opt and Sync-Opt.Test precisions were evaluated on the exponential moving average ¯ θ using α = 0 . .A.4 P IXEL

CNN, S

ECTION N + b = 1 , , workers each with a k80 GPU, and 10 parameterservers were used. For Sync-Opt, we always used b = 1 backup worker. The underlying optimizeris RMSProp with momentum, using decay of 0.95 and momentum of 0.9. Initial learning rates γ were set to e − and slowly decreased to e − after 200,000 iterations. Mini-batch size B = 4= 4