[PDF] Oscars: Adaptive Semi-Synchronous Parallel Model for Distributed Deep Learning with Global View

Abstract

Deep learning has become an indispensable part of life, such as face recognition, NLP, etc., but the training of deep model has always been a challenge, and in recent years, the complexity of training data and models has shown explosive growth, so the training method is gradually transformed into distributed training. Classical synchronization strategy can guarantee accuracy but frequent communication can lead to a slow training speed, although asynchronous strategy training speed but can not guarantee the accuracy, and in the face of the training of the heterogeneous cluster, the above work is not efficient to work, on the one hand, can cause serious waste of resources, on the other hand, frequent communication also made slow training speed, so this paper proposes a semi-synchronous training strategy based on local-SDG, effectively improve the utilization efficiency of heterogeneous resources cluster and reduce communication overhead, to accelerate the training and ensure the accuracy of the model.

Full PDF

OOscars: Adaptive Semi-Synchronous Parallel Model for Distributed DeepLearning with Global View

Sheng Huang Shanghai University of Electronic Power

Abstract

Deep learning has become an indispensable part of life, suchas face recognition, NLP, etc., but the training of deep modelhas always been a challenge, and in recent years, the com-plexity of training data and models has shown explosivegrowth, so the training method is gradually transformed intodistributed training. Classical synchronization strategy canguarantee accuracy but frequent communication can lead toa slow training speed, although asynchronous strategy train-ing speed but can not guarantee the accuracy, and in the faceof the training of the heterogeneous cluster, the above workis not efﬁcient work, on the one hand, can cause seriouswaste of resources on the other hand, frequent communica-tion also made slow training speed, so this paper proposes asemi-synchronous training strategy based on local-SDG, ef-fectively improve the utilization efﬁciency of heterogeneousresources cluster and reduce communication overhead, to ac-celerate the training and ensure the accuracy of the model.

Introduction

It has been widely acknowledged that machine learning hasbecome fundamentally important in a wide range of researchand engineering areas, including autonomous driving, facerecognition, speech recognition (Deng et al. 2013), text un-derstanding (Mikolov et al. 2013; Liang et al. 2017), imageclassiﬁcation (Yan et al. 2019, 2016), etc. There has beenan imperative need to improve the performance when train-ing machine learning models, especially in the presence oflarger volumes of data and increasingly complex computingmodels. Current the structure of neural network have at hun-dreds layers are relatively common, such as the Bert(Devlinet al. 2018) language model proposed by Google contains300 million parameters, ImageNet(Deng et al. 2009) dataset contains 20000 categories a total of 15 million images.Parameter servers(Li et al. 2013). are widely used in to-day’s distributed training system, such as MXNet(Chen et al.2015) and TensorFlow(Abadi et al. 2016).The architecture isshown in the Fig.1. Parameter server architecture consists ofa logic server group and a lot of workers. Under the parame-ter server architecture. Each worker holds different trainingdata and the same copy of the model. Each worker calculatesthe gradient locally, then periodically push the local gradi-ent to parameter server, and then parameter server summa-

Copyright © 2021, SUEP rizes the gradient of each worker and updates the model pa-rameters. Finally, each worker pull the latest parameters tocontinue the training. In addition, there is a decentralizedarchitecture that is distinct from Parameter server, calledRing-AllReduce. In this architecture, all nodes form a logicring, and each node only communicates with its neighbornodes, effectively avoiding the bandwidth congestion causedby centralization. However, due to the characteristics of thearchitecture, only synchronous algorithm can be used, soStraggler in this architecture will lead to more serious prob-lems. At present, there are also some work to optimize ring,such as horovod (Sergeev and Del Balso 2018; Gibiansky2017; Jia et al. 2018; Mikami et al. 2018). In this work, wefocus on parameter server.At present, the prevalent synchronous paradigms are:Bulk Synchronous Parallel(BSP)(Gerbessiotis and Valiant1994), Asynchronous Parallel (ASP), and Stale Syn-chronous Parallel (SSP)(Ho et al. 2013). BSP, as a famousgeneral parallel computing synchronization model in dis-tributed computing. Due to its stability and reliability, it hasthe same stability as SGD on a single machine. Thus themainstream distributed training systems take it as the defaultparallel strategy. Barriers are a critical component in BSP. Itrequires each node to stop working after completing its owntasks and to wait until all nodes have completed their tasks.Although this can ensures a high degree of consistency ofmodels on different nodes, it has serious drawbacks. In a het-erogeneous or volatile cloud environment, the performanceof each node is not the same. This means that each nodetakes different amounts of time to process the same amountof data, which results in a large amount of time spent wait-ing for the slowest node in each synchronization process. Atypical example is shown in Figure 2.So a natural idea would be to relax synchronization re-quirements. According to this naive idea, ASP paradigmsremove the strict barriers, so that each node can work asyn-chronously. However, because the progress of each node istoo different, the model will shake, and it will take longertime to converge. SSP(Ho et al. 2013) is a compromise be-tween the two methods. As long as the pace between thefastest node and the slowest node does not exceed the stalethreshold, each node can work asynchronously. But due toSSP introduces an extra threshold,If the threshold setting isnot reasonable, it will also be seriously affected by the strag- a r X i v : . [ c s . D C ] F e b igure 1: The architecture of Parameter Server and Ring-AllReduce.Figure 2: Under the restriction of barrier, the fastest nodemust wait for the slowest node after each iteration, which isvery inefﬁcient gler, causing the fastest nodes frequently to stop waiting forthe slowest node during the training process or too big caus-ing non-converge.In order to make up for the shortcomings of the aboveParallel paradigm. We propose a adaptive load balance ap-proach. The aim of our approach is to relax the strict syn-chronization requirement of classical BSP and to improveresource utilization. The core idea is let slower workersdo less computation between synchronization, and fasterworkers do more. Under global control, the waiting time ofeach synchronization is minimized. Thus, the overall train-ing time is shortened. Our contributions are summarized asfollows: Motivation

Mini-batch stochastic gradient descent (SGD) is state ofthe art in large scale distributed training (See Figure 3).The scheme can reach a linear speedup with respect to thenumber of workers, but this is rarely seen in practice asthe scheme often suffers from large network delays andbandwidth limits. To overcome this communication bottle-neck recent works propose to reduce the communication fre-quency. An algorithm of this type is Local SGD (Zinkevichet al. 2010; McDonald, Hall, and Mann 2010; Zhang et al.2014; McMahan et al. 2017) that runs SGD independentlyin parallel on different workers and averages the sequencesonly once in a while.Local SGD requires all workers to compute the averageof individual solutions every I iterations and synchroniza-tion among local workers are not needed before averaging.However, the fastest worker still needs to wait until all theother workers ﬁnish I iterations of SGD even if it ﬁnishesits own I iteration SGD much earlier. (See Figure 4 for a4 worker example where one worker is signiﬁcantly fasterthan the others.) As a consequence, the computation capabil-ity of faster workers is wasted. Such an issue can arise quiteoften in heterogeneous networks where nodes are equippedwith different hardwares.In this paper, we present asynchronous local SGD withload-balancing (Figure 8) that does not require that the localsequences are synchronized. This does not only reduce com-munication bottlenecks, but by using load-balancing tech-niques the algorithm can optimally be tuned to heteroge-neous settings (slower workers do less computation betweensynchronization, and faster workers do more).Figure 3: Mini-batch SGD on homogeneous environment.The green arrow represent the computation.igure 4: Local SGD in heterogeneous environment. Thegreen arrow represent the computation and the gray arrowrepresent the idle state.Figure 5: Local SGD with load-balancing in heterogeneousenvironment. The green arrow represent the computationand the gray arrow represent the idle state.

Problem Formulation

Most data centers have high availability, assuming that theyare run in a stable environment, and each worker’s compu-tational speed is in a steady state.According to the case study, the time spent by a workerbefore the global barrier can be composed of three parts, thebarrier notated as T . The ﬁrst part is the gradient calculationtime, notated as t iteri . The second part is the idle time of waitfor other workers to synchronize parameters, notated as t wi .and the third part is the time of synchronize the parameters.We assume that the bandwidth between workers in the datacenter is very fast, so we will ignore the third part.For N workers in a heterogeneous cluster, each worker hasto process a certain amount of iterations before the globalbarrier where each iteration time on the same worker is sim-ilar. Assume N worker index by i , the time of a local itera-tion of each worker can be notated as t iter , t iter , t iter ...t iterN .Given a global barrier T , we can get the t wi of each worker: t wi = mod ( T, t iteri ) (1)Before a synchronization, the maximum waiting time canreﬂect the idle degree of workers. If the maximum wait timeis as small as possible, it means that all worker can com-plete the last batch calculation exactly at the global barrier.Thus, computing resources can be fully utilized. We deﬁnethe maximum wait time as: max ( mod ( T, t iteri )) (2)So, We are looking for the optimal T ∗ which gives theminimum Eq. 2 from all possible T . At last, we formulatethe following optimization problem: T ∗ = argmin T maxmod ( T, t i ) (3) S.T. f loor ( T /min ( t iteri )) − f loor ( T /max ( t iteri )) < M Where f loor ( T / ( t iteri )) represent how many times can the Name Description t iteri Time of a local iteration of worker i t wi Time of wait of worker i T The time point of global synchronization N Number of workersTable 1: Frequently used notations i -th worker iterate before the barrier at most, notated as τ i ,and M limits the difference of local step of different workersin the appropriate range. Approach

In this section we present load-balance local SGD. Thisdoes not only using load-balancing techniques which the al-gorithm can optimally be tuned to heterogeneous settings(slower workers do less computation between synchroniza-tion, and faster workers do more), but also reduce the net-work overhead caused by the frequently communication.

Local-SGD load balancing

To minimize the wait time and improve cluster utilization,we propose an fast and efﬁcient algorithm based on theprinciple of least common multiple. The algorithmic ﬂowcan be found here1,and the described as follows: First let T = max ( t iteri ) . Next use a loop to get the max value of mod ( T, t iteri ) , and T + + . The above loop is repeated untilthe constraint is not satisﬁed. Last take the T ∗ that make themax value of mod ( T, t iteri ) is the minimum. Therefore, thetotal computation complexity is O ( M N ) which is an lin-ear complexity. It will not bring additional overhead to theoriginal training system.After obtain the optimal T ∗ , we can calculate the numberof iterations for each worker according to t iteri . The numberof iterations of each worker is expressed as: τ , τ ...τ N . Sothe model update rule is: x it +1 = (cid:26) N (cid:80) Nk =1 ( x kt − ηg ( x kt )) t mod τ i = 0 x it − ηg ( x it ) otherwise (4)where x it denotes the model parameters in the i -th worker. Data partition load balancing

Evaluation Setup

Testbed

We conduct our experiments on a GPU server. The serverruns with 2 NVIDIA RTX 2080 GPUs and interconnectedwith 10Gbps PCI-E. The server run Ubuntu Server 18.06.We used Pytorch framework to build our algorithm proto-types.

Dataset and DL Models

We used CIFAR-10 datasets for image classiﬁcation tasks.The datatsets has 50,000 training images and 10,000 test im-ages. We choosed ResNet101 as our deep neural networkbaseline to evaluate our approach. lgorithm 1

Load-Balance Algorithm

Input: t iteri , M Output: T ∗ Initial: T = max ( t iteri ) for f loor ( T /min ( t iter )) − f loor ( T /max ( t iter ))

The performance metrics include scalability and Rateof convergence. The scalability denotes the speedup onthroughput (number of iterations ﬁnished per hour) com-pared with single node DL.

Evaluations

Figure 6

Loss async_lb_slow_dynpart_rank__1 async_lb_fast_dynpart_rank__0 bsp_110 20 30 40 50Time (minutes)0123 L o ss Figure 7

Rate of convergence

Figure 6 plots the training time in ResNet101. We set thetraining time to 1 hour, batchsize to 128. It can be clearlyobserved in the above ﬁgure that the curve of BSP is more Figure 8smoother. This is due to the strong synchronization charac-teristics of BSP, which can ensure the correctness of the gra-dient from different workers and avoid shocks. Although theconvergence process of BSP is very stable, its convergencespeed is very slow. This is because in a heterogeneous envi-ronment, the performance of each worker is different, whichcauses different workers to process data of the same size indifferent times. Thus some workers with good performanceare idle for most of the time. Our approach uses load bal-ancing to improve the utilization of computing resources, sothat more data can be iterated in the same time. Thereby ac-celerating convergence.

Scalability

Related Works

Asynchronous SGD

For large scale machine learning optimization problems,parallel mini-batch SGD suffers from synchronization delaydue to a few slow machines, slowing down entire computa-tion. To mitigate synchronization delay, asynchronous SGDmethod are studied in (Recht et al. 2011; De Sa et al. 2015;Lian et al. 2015). These methods, though faster than syn-chronized methods, lead to convergence error issues due tostale gradients. (Agarwal and Duchi 2011) shows that lim-ited amount of delay can be tolerated while preserving lin-ear speedup for convex optimization problems. Furthermore,(Zhou et al. 2018) indicates that even polynomially growingdelays can be tolerated by utilizing a quasilinear step-sizesequence, but without achieving linear speedup.

Large batch SGD

Recent schemes for scaling training to a large number ofworkers rely on standard mini-batch SGD with very largeoverall batch sizes (You et al. 2018; Goyal et al. 2017) ,i.e. increasing the global batch size linearly with the num-ber of workers K. (Yu and Jin 2019) has shown that remark-ably, with exponentially growing mini-batch size it is possi-ble to achieve linear speed up (i.e., error of O (1 /KT )) withonly log T iterations of the algorithm, and thereby, when im-plemented in a distributed setting, this corresponds to log T rounds of communication. The result of (Yu and Jin 2019)implies that SGD with exponentially increasing batch sizesas a similar convergence behavior as the full-ﬂedged (non-stochastic) gradient descent.While the algorithm of (Yu and Jin 2019) provides a wayof reducing communication in distributed setting, for a largenumber of iterations, their algorithm will require large mini-batches, and washes away the computational beneﬁts of thestochastic gradient descent algorithm over its deterministiccounter part. Furthermore, it has been found that increasingthe mini-batch size often leads to increasing generalizationerrors, which limits their distributivity (Li et al. 2014).Our work is complementary to the approach of (Yu andJin 2019), as we focus on approaches that use local updateswith a ﬁxed minibatch size, which in our experiments, is ahyperparameter that is tuned to the data set. Local SGD

Motivated to better balance the available system resources(computation vs. communication), local SGD (a.k.a. local-update SGD, parallel SGD, or federated averaging) has re-cently attracted increased research interest (Zinkevich et al.2010; McDonald, Hall, and Mann 2010; Zhang et al. 2014;McMahan et al. 2017). In local SGD, each worker evolves alocal model by performing H sequential SGD updates withmini-batch size B, before communication (synchronizationby averaging) among the workers.A main research question is whether local-update SGDprovides a linear speedup with respect to the number ofworkers K , similar to mini-batch SGD. Recent work par-tially conﬁrms this, under the assumption that H is not toolarge compared to the total iterations T . (Stich 2018) showconvergence at O (( KT ) − ) on strongly convex and smoothobjective functions when H = O ( T / ) . For smooth non-convex objective functions, (Yu, Yang, and Zhu 2019) givean improved result O (( KT ) − / ) when H = O ( T / ) .(Zhang et al. 2016) empirically study the effect of the av-eraging frequency on the quality of the solution for someproblem cases and observe that more frequent averaging atthe beginning of the optimization can help. Similarly, (Bi-jral, Sarwate, and Srebro 2016) argue to average more fre-quently at the beginning.Although existing works provides convergence guaran-tees on local-update SGD, there is still no effort focus on op-timally tuning local-update SGD to heterogeneous settings(slower workers do less computation between synchroniza-tion, and faster workers do more) using load-balancing tech-niques. Conclusion

This is Conclusion.

References

Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean,J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al.2016. Tensorﬂow: A system for large-scale machine learn-ing. In { USENIX } Symposium on Operating SystemsDesign and Implementation ( { OSDI } , 265–283. Agarwal, A.; and Duchi, J. C. 2011. Distributed delayedstochastic optimization. In Advances in Neural InformationProcessing Systems , 873–881.Bijral, A. S.; Sarwate, A. D.; and Srebro, N. 2016. On datadependence in distributed stochastic optimization. arXivpreprint arXiv:1603.04379 .Chen, T.; Li, M.; Li, Y.; Lin, M.; Wang, N.; Wang, M.; Xiao,T.; Xu, B.; Zhang, C.; and Zhang, Z. 2015. Mxnet: A ﬂexi-ble and efﬁcient machine learning library for heterogeneousdistributed systems. arXiv preprint arXiv:1512.01274 .De Sa, C. M.; Zhang, C.; Olukotun, K.; and R´e, C. 2015.Taming the wild: A uniﬁed analysis of hogwild-style algo-rithms. In

Advances in neural information processing sys-tems , 2674–2682.Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical imagedatabase. In , 248–255. Ieee.Deng, L.; Li, J.; Huang, J.-T.; Yao, K.; Yu, D.; Seide, F.;Seltzer, M.; Zweig, G.; He, X.; Williams, J.; et al. 2013. Re-cent advances in deep learning for speech research at Mi-crosoft. In , 8604–8608. IEEE.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 .Gerbessiotis, A. V.; and Valiant, L. G. 1994. Direct bulk-synchronous parallel algorithms.

Journal of parallel anddistributed computing

URL http://research. baidu. com/bringing-hpc-techniquesdeep-learning .Goyal, P.; Doll´ar, P.; Girshick, R.; Noordhuis, P.;Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He,K. 2017. Accurate, large minibatch sgd: Training imagenetin 1 hour. arXiv preprint arXiv:1706.02677 .Ho, Q.; Cipar, J.; Cui, H.; Lee, S.; Kim, J. K.; Gibbons, P. B.;Gibson, G. A.; Ganger, G.; and Xing, E. P. 2013. More effec-tive distributed ml via a stale synchronous parallel parameterserver. In

Advances in neural information processing sys-tems , 1223–1231.Jia, X.; Song, S.; He, W.; Wang, Y.; Rong, H.; Zhou,F.; Xie, L.; Guo, Z.; Yang, Y.; Yu, L.; et al. 2018.Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprintarXiv:1807.11205 .Li, M.; Zhang, T.; Chen, Y.; and Smola, A. J. 2014. Efﬁcientmini-batch training for stochastic optimization. In

Proceed-ings of the 20th ACM SIGKDD international conference onKnowledge discovery and data mining , 661–670.Li, M.; Zhou, L.; Yang, Z.; Li, A.; Xia, F.; Andersen, D. G.;and Smola, A. 2013. Parameter server for distributed ma-chine learning. In

Big Learning NIPS Workshop , volume 6,2.ian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronousparallel stochastic gradient for nonconvex optimization. In

Advances in Neural Information Processing Systems , 2737–2745.Liang, X.; Hu, Z.; Zhang, H.; Gan, C.; and Xing, E. P. 2017.Recurrent topic-transition gan for visual paragraph genera-tion. In

Proceedings of the IEEE International Conferenceon Computer Vision , 3362–3371.McDonald, R.; Hall, K.; and Mann, G. 2010. Distributedtraining strategies for the structured perceptron. In

Hu-man language technologies: The 2010 annual conference ofthe North American chapter of the association for computa-tional linguistics , 456–464.McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; andy Arcas, B. A. 2017. Communication-efﬁcient learning ofdeep networks from decentralized data. In

Artiﬁcial Intelli-gence and Statistics , 1273–1282. PMLR.Mikami, H.; Suganuma, H.; Tanaka, Y.; Kageyama, Y.; et al.2018. Imagenet/resnet-50 training in 224 seconds. arXivpreprint arXiv:1811.05233 arXiv preprint arXiv:1301.3781 .Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild:A lock-free approach to parallelizing stochastic gradient de-scent. In

Advances in neural information processing sys-tems , 693–701.Sergeev, A.; and Del Balso, M. 2018. Horovod: fast andeasy distributed deep learning in TensorFlow. arXiv preprintarXiv:1802.05799 .Stich, S. U. 2018. Local SGD converges fast and communi-cates little. arXiv preprint arXiv:1805.09767 .Yan, Z.; Piramuthu, R.; Jagadeesh, V.; Di, W.; and Decoste,D. 2019. Hierarchical deep convolutional neural network forimage classiﬁcation. US Patent 10,387,773.Yan, Z.; Zhang, H.; Wang, B.; Paris, S.; and Yu, Y. 2016. Au-tomatic photo adjustment using deep neural networks.

ACMTransactions on Graphics (TOG)

Proceedings ofthe 47th International Conference on Parallel Processing ,1–10.Yu, H.; and Jin, R. 2019. On the computation and com-munication complexity of parallel SGD with dynamic batchsizes for stochastic non-convex optimization. arXiv preprintarXiv:1905.04346 .Yu, H.; Yang, S.; and Zhu, S. 2019. Parallel restarted SGDwith faster convergence and less communication: Demysti-fying why model averaging works for deep learning. In

Pro-ceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 33, 5693–5700.Zhang, J.; De Sa, C.; Mitliagkas, I.; and R´e, C. 2016. Par-allel SGD: When does averaging help? arXiv preprintarXiv:1606.07365 . Zhang, X.; Trmal, J.; Povey, D.; and Khudanpur, S. 2014.Improving deep neural network acoustic models using gen-eralized maxout networks. In , 215–219. IEEE.Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Glynn, P. W.; Ye,Y.; Li, L.-J.; and Li, F.-F. 2018. Distributed asynchronousoptimization with unbounded delays: How slow can you go?Zinkevich, M.; Weimer, M.; Li, L.; and Smola, A. J. 2010.Parallelized stochastic gradient descent. In