Oscars: Adaptive Semi-Synchronous Parallel Model for Distributed Deep Learning with Global View
OOscars: Adaptive Semi-Synchronous Parallel Model for Distributed DeepLearning with Global View
Sheng Huang Shanghai University of Electronic Power
Abstract
Deep learning has become an indispensable part of life, suchas face recognition, NLP, etc., but the training of deep modelhas always been a challenge, and in recent years, the com-plexity of training data and models has shown explosivegrowth, so the training method is gradually transformed intodistributed training. Classical synchronization strategy canguarantee accuracy but frequent communication can lead toa slow training speed, although asynchronous strategy train-ing speed but can not guarantee the accuracy, and in the faceof the training of the heterogeneous cluster, the above workis not efficient work, on the one hand, can cause seriouswaste of resources on the other hand, frequent communica-tion also made slow training speed, so this paper proposes asemi-synchronous training strategy based on local-SDG, ef-fectively improve the utilization efficiency of heterogeneousresources cluster and reduce communication overhead, to ac-celerate the training and ensure the accuracy of the model.
Introduction
It has been widely acknowledged that machine learning hasbecome fundamentally important in a wide range of researchand engineering areas, including autonomous driving, facerecognition, speech recognition (Deng et al. 2013), text un-derstanding (Mikolov et al. 2013; Liang et al. 2017), imageclassification (Yan et al. 2019, 2016), etc. There has beenan imperative need to improve the performance when train-ing machine learning models, especially in the presence oflarger volumes of data and increasingly complex computingmodels. Current the structure of neural network have at hun-dreds layers are relatively common, such as the Bert(Devlinet al. 2018) language model proposed by Google contains300 million parameters, ImageNet(Deng et al. 2009) dataset contains 20000 categories a total of 15 million images.Parameter servers(Li et al. 2013). are widely used in to-day’s distributed training system, such as MXNet(Chen et al.2015) and TensorFlow(Abadi et al. 2016).The architecture isshown in the Fig.1. Parameter server architecture consists ofa logic server group and a lot of workers. Under the parame-ter server architecture. Each worker holds different trainingdata and the same copy of the model. Each worker calculatesthe gradient locally, then periodically push the local gradi-ent to parameter server, and then parameter server summa-
Copyright © 2021, SUEP rizes the gradient of each worker and updates the model pa-rameters. Finally, each worker pull the latest parameters tocontinue the training. In addition, there is a decentralizedarchitecture that is distinct from Parameter server, calledRing-AllReduce. In this architecture, all nodes form a logicring, and each node only communicates with its neighbornodes, effectively avoiding the bandwidth congestion causedby centralization. However, due to the characteristics of thearchitecture, only synchronous algorithm can be used, soStraggler in this architecture will lead to more serious prob-lems. At present, there are also some work to optimize ring,such as horovod (Sergeev and Del Balso 2018; Gibiansky2017; Jia et al. 2018; Mikami et al. 2018). In this work, wefocus on parameter server.At present, the prevalent synchronous paradigms are:Bulk Synchronous Parallel(BSP)(Gerbessiotis and Valiant1994), Asynchronous Parallel (ASP), and Stale Syn-chronous Parallel (SSP)(Ho et al. 2013). BSP, as a famousgeneral parallel computing synchronization model in dis-tributed computing. Due to its stability and reliability, it hasthe same stability as SGD on a single machine. Thus themainstream distributed training systems take it as the defaultparallel strategy. Barriers are a critical component in BSP. Itrequires each node to stop working after completing its owntasks and to wait until all nodes have completed their tasks.Although this can ensures a high degree of consistency ofmodels on different nodes, it has serious drawbacks. In a het-erogeneous or volatile cloud environment, the performanceof each node is not the same. This means that each nodetakes different amounts of time to process the same amountof data, which results in a large amount of time spent wait-ing for the slowest node in each synchronization process. Atypical example is shown in Figure 2.So a natural idea would be to relax synchronization re-quirements. According to this naive idea, ASP paradigmsremove the strict barriers, so that each node can work asyn-chronously. However, because the progress of each node istoo different, the model will shake, and it will take longertime to converge. SSP(Ho et al. 2013) is a compromise be-tween the two methods. As long as the pace between thefastest node and the slowest node does not exceed the stalethreshold, each node can work asynchronously. But due toSSP introduces an extra threshold,If the threshold setting isnot reasonable, it will also be seriously affected by the strag- a r X i v : . [ c s . D C ] F e b igure 1: The architecture of Parameter Server and Ring-AllReduce.Figure 2: Under the restriction of barrier, the fastest nodemust wait for the slowest node after each iteration, which isvery inefficient gler, causing the fastest nodes frequently to stop waiting forthe slowest node during the training process or too big caus-ing non-converge.In order to make up for the shortcomings of the aboveParallel paradigm. We propose a adaptive load balance ap-proach. The aim of our approach is to relax the strict syn-chronization requirement of classical BSP and to improveresource utilization. The core idea is let slower workersdo less computation between synchronization, and fasterworkers do more. Under global control, the waiting time ofeach synchronization is minimized. Thus, the overall train-ing time is shortened. Our contributions are summarized asfollows: Motivation
Mini-batch stochastic gradient descent (SGD) is state ofthe art in large scale distributed training (See Figure 3).The scheme can reach a linear speedup with respect to thenumber of workers, but this is rarely seen in practice asthe scheme often suffers from large network delays andbandwidth limits. To overcome this communication bottle-neck recent works propose to reduce the communication fre-quency. An algorithm of this type is Local SGD (Zinkevichet al. 2010; McDonald, Hall, and Mann 2010; Zhang et al.2014; McMahan et al. 2017) that runs SGD independentlyin parallel on different workers and averages the sequencesonly once in a while.Local SGD requires all workers to compute the averageof individual solutions every I iterations and synchroniza-tion among local workers are not needed before averaging.However, the fastest worker still needs to wait until all theother workers finish I iterations of SGD even if it finishesits own I iteration SGD much earlier. (See Figure 4 for a4 worker example where one worker is significantly fasterthan the others.) As a consequence, the computation capabil-ity of faster workers is wasted. Such an issue can arise quiteoften in heterogeneous networks where nodes are equippedwith different hardwares.In this paper, we present asynchronous local SGD withload-balancing (Figure 8) that does not require that the localsequences are synchronized. This does not only reduce com-munication bottlenecks, but by using load-balancing tech-niques the algorithm can optimally be tuned to heteroge-neous settings (slower workers do less computation betweensynchronization, and faster workers do more).Figure 3: Mini-batch SGD on homogeneous environment.The green arrow represent the computation.igure 4: Local SGD in heterogeneous environment. Thegreen arrow represent the computation and the gray arrowrepresent the idle state.Figure 5: Local SGD with load-balancing in heterogeneousenvironment. The green arrow represent the computationand the gray arrow represent the idle state.
Problem Formulation
Most data centers have high availability, assuming that theyare run in a stable environment, and each worker’s compu-tational speed is in a steady state.According to the case study, the time spent by a workerbefore the global barrier can be composed of three parts, thebarrier notated as T . The first part is the gradient calculationtime, notated as t iteri . The second part is the idle time of waitfor other workers to synchronize parameters, notated as t wi .and the third part is the time of synchronize the parameters.We assume that the bandwidth between workers in the datacenter is very fast, so we will ignore the third part.For N workers in a heterogeneous cluster, each worker hasto process a certain amount of iterations before the globalbarrier where each iteration time on the same worker is sim-ilar. Assume N worker index by i , the time of a local itera-tion of each worker can be notated as t iter , t iter , t iter ...t iterN .Given a global barrier T , we can get the t wi of each worker: t wi = mod ( T, t iteri ) (1)Before a synchronization, the maximum waiting time canreflect the idle degree of workers. If the maximum wait timeis as small as possible, it means that all worker can com-plete the last batch calculation exactly at the global barrier.Thus, computing resources can be fully utilized. We definethe maximum wait time as: max ( mod ( T, t iteri )) (2)So, We are looking for the optimal T ∗ which gives theminimum Eq. 2 from all possible T . At last, we formulatethe following optimization problem: T ∗ = argmin T maxmod ( T, t i ) (3) S.T. f loor ( T /min ( t iteri )) − f loor ( T /max ( t iteri )) < M Where f loor ( T / ( t iteri )) represent how many times can the Name Description t iteri Time of a local iteration of worker i t wi Time of wait of worker i T The time point of global synchronization N Number of workersTable 1: Frequently used notations i -th worker iterate before the barrier at most, notated as τ i ,and M limits the difference of local step of different workersin the appropriate range. Approach
In this section we present load-balance local SGD. Thisdoes not only using load-balancing techniques which the al-gorithm can optimally be tuned to heterogeneous settings(slower workers do less computation between synchroniza-tion, and faster workers do more), but also reduce the net-work overhead caused by the frequently communication.
Local-SGD load balancing
To minimize the wait time and improve cluster utilization,we propose an fast and efficient algorithm based on theprinciple of least common multiple. The algorithmic flowcan be found here1,and the described as follows: First let T = max ( t iteri ) . Next use a loop to get the max value of mod ( T, t iteri ) , and T + + . The above loop is repeated untilthe constraint is not satisfied. Last take the T ∗ that make themax value of mod ( T, t iteri ) is the minimum. Therefore, thetotal computation complexity is O ( M N ) which is an lin-ear complexity. It will not bring additional overhead to theoriginal training system.After obtain the optimal T ∗ , we can calculate the numberof iterations for each worker according to t iteri . The numberof iterations of each worker is expressed as: τ , τ ...τ N . Sothe model update rule is: x it +1 = (cid:26) N (cid:80) Nk =1 ( x kt − ηg ( x kt )) t mod τ i = 0 x it − ηg ( x it ) otherwise (4)where x it denotes the model parameters in the i -th worker. Data partition load balancing
Evaluation Setup
Testbed
We conduct our experiments on a GPU server. The serverruns with 2 NVIDIA RTX 2080 GPUs and interconnectedwith 10Gbps PCI-E. The server run Ubuntu Server 18.06.We used Pytorch framework to build our algorithm proto-types.
Dataset and DL Models
We used CIFAR-10 datasets for image classification tasks.The datatsets has 50,000 training images and 10,000 test im-ages. We choosed ResNet101 as our deep neural networkbaseline to evaluate our approach. lgorithm 1
Load-Balance Algorithm
Input: t iteri , M Output: T ∗ Initial: T = max ( t iteri ) for f loor ( T /min ( t iter )) − f loor ( T /max ( t iter )) The performance metrics include scalability and Rateof convergence. The scalability denotes the speedup onthroughput (number of iterations finished per hour) com-pared with single node DL. Evaluations Figure 6 Loss async_lb_slow_dynpart_rank__1 async_lb_fast_dynpart_rank__0 bsp_110 20 30 40 50Time (minutes)0123 L o ss Figure 7 Rate of convergence Figure 6 plots the training time in ResNet101. We set thetraining time to 1 hour, batchsize to 128. It can be clearlyobserved in the above figure that the curve of BSP is more Figure 8smoother. This is due to the strong synchronization charac-teristics of BSP, which can ensure the correctness of the gra-dient from different workers and avoid shocks. Although theconvergence process of BSP is very stable, its convergencespeed is very slow. This is because in a heterogeneous envi-ronment, the performance of each worker is different, whichcauses different workers to process data of the same size indifferent times. Thus some workers with good performanceare idle for most of the time. Our approach uses load bal-ancing to improve the utilization of computing resources, sothat more data can be iterated in the same time. Thereby ac-celerating convergence. Scalability Related Works Asynchronous SGD For large scale machine learning optimization problems,parallel mini-batch SGD suffers from synchronization delaydue to a few slow machines, slowing down entire computa-tion. To mitigate synchronization delay, asynchronous SGDmethod are studied in (Recht et al. 2011; De Sa et al. 2015;Lian et al. 2015). These methods, though faster than syn-chronized methods, lead to convergence error issues due tostale gradients. (Agarwal and Duchi 2011) shows that lim-ited amount of delay can be tolerated while preserving lin-ear speedup for convex optimization problems. Furthermore,(Zhou et al. 2018) indicates that even polynomially growingdelays can be tolerated by utilizing a quasilinear step-sizesequence, but without achieving linear speedup. Large batch SGD Recent schemes for scaling training to a large number ofworkers rely on standard mini-batch SGD with very largeoverall batch sizes (You et al. 2018; Goyal et al. 2017) ,i.e. increasing the global batch size linearly with the num-ber of workers K. (Yu and Jin 2019) has shown that remark-ably, with exponentially growing mini-batch size it is possi-ble to achieve linear speed up (i.e., error of O (1 /KT )) withonly log T iterations of the algorithm, and thereby, when im-plemented in a distributed setting, this corresponds to log T rounds of communication. The result of (Yu and Jin 2019)implies that SGD with exponentially increasing batch sizesas a similar convergence behavior as the full-fledged (non-stochastic) gradient descent.While the algorithm of (Yu and Jin 2019) provides a wayof reducing communication in distributed setting, for a largenumber of iterations, their algorithm will require large mini-batches, and washes away the computational benefits of thestochastic gradient descent algorithm over its deterministiccounter part. Furthermore, it has been found that increasingthe mini-batch size often leads to increasing generalizationerrors, which limits their distributivity (Li et al. 2014).Our work is complementary to the approach of (Yu andJin 2019), as we focus on approaches that use local updateswith a fixed minibatch size, which in our experiments, is ahyperparameter that is tuned to the data set. Local SGD Motivated to better balance the available system resources(computation vs. communication), local SGD (a.k.a. local-update SGD, parallel SGD, or federated averaging) has re-cently attracted increased research interest (Zinkevich et al.2010; McDonald, Hall, and Mann 2010; Zhang et al. 2014;McMahan et al. 2017). In local SGD, each worker evolves alocal model by performing H sequential SGD updates withmini-batch size B, before communication (synchronizationby averaging) among the workers.A main research question is whether local-update SGDprovides a linear speedup with respect to the number ofworkers K , similar to mini-batch SGD. Recent work par-tially confirms this, under the assumption that H is not toolarge compared to the total iterations T . (Stich 2018) showconvergence at O (( KT ) − ) on strongly convex and smoothobjective functions when H = O ( T / ) . For smooth non-convex objective functions, (Yu, Yang, and Zhu 2019) givean improved result O (( KT ) − / ) when H = O ( T / ) .(Zhang et al. 2016) empirically study the effect of the av-eraging frequency on the quality of the solution for someproblem cases and observe that more frequent averaging atthe beginning of the optimization can help. Similarly, (Bi-jral, Sarwate, and Srebro 2016) argue to average more fre-quently at the beginning.Although existing works provides convergence guaran-tees on local-update SGD, there is still no effort focus on op-timally tuning local-update SGD to heterogeneous settings(slower workers do less computation between synchroniza-tion, and faster workers do more) using load-balancing tech-niques. Conclusion This is Conclusion. References Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean,J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al.2016. Tensorflow: A system for large-scale machine learn-ing. In { USENIX } Symposium on Operating SystemsDesign and Implementation ( { OSDI } , 265–283. Agarwal, A.; and Duchi, J. C. 2011. Distributed delayedstochastic optimization. In Advances in Neural InformationProcessing Systems , 873–881.Bijral, A. S.; Sarwate, A. D.; and Srebro, N. 2016. On datadependence in distributed stochastic optimization. arXivpreprint arXiv:1603.04379 .Chen, T.; Li, M.; Li, Y.; Lin, M.; Wang, N.; Wang, M.; Xiao,T.; Xu, B.; Zhang, C.; and Zhang, Z. 2015. Mxnet: A flexi-ble and efficient machine learning library for heterogeneousdistributed systems. arXiv preprint arXiv:1512.01274 .De Sa, C. M.; Zhang, C.; Olukotun, K.; and R´e, C. 2015.Taming the wild: A unified analysis of hogwild-style algo-rithms. In Advances in neural information processing sys-tems , 2674–2682.Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical imagedatabase. In , 248–255. Ieee.Deng, L.; Li, J.; Huang, J.-T.; Yao, K.; Yu, D.; Seide, F.;Seltzer, M.; Zweig, G.; He, X.; Williams, J.; et al. 2013. Re-cent advances in deep learning for speech research at Mi-crosoft. In , 8604–8608. IEEE.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 .Gerbessiotis, A. V.; and Valiant, L. G. 1994. Direct bulk-synchronous parallel algorithms. Journal of parallel anddistributed computing URL http://research. baidu. com/bringing-hpc-techniquesdeep-learning .Goyal, P.; Doll´ar, P.; Girshick, R.; Noordhuis, P.;Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He,K. 2017. Accurate, large minibatch sgd: Training imagenetin 1 hour. arXiv preprint arXiv:1706.02677 .Ho, Q.; Cipar, J.; Cui, H.; Lee, S.; Kim, J. K.; Gibbons, P. B.;Gibson, G. A.; Ganger, G.; and Xing, E. P. 2013. More effec-tive distributed ml via a stale synchronous parallel parameterserver. In Advances in neural information processing sys-tems , 1223–1231.Jia, X.; Song, S.; He, W.; Wang, Y.; Rong, H.; Zhou,F.; Xie, L.; Guo, Z.; Yang, Y.; Yu, L.; et al. 2018.Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprintarXiv:1807.11205 .Li, M.; Zhang, T.; Chen, Y.; and Smola, A. J. 2014. Efficientmini-batch training for stochastic optimization. In Proceed-ings of the 20th ACM SIGKDD international conference onKnowledge discovery and data mining , 661–670.Li, M.; Zhou, L.; Yang, Z.; Li, A.; Xia, F.; Andersen, D. G.;and Smola, A. 2013. Parameter server for distributed ma-chine learning. In Big Learning NIPS Workshop , volume 6,2.ian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronousparallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems , 2737–2745.Liang, X.; Hu, Z.; Zhang, H.; Gan, C.; and Xing, E. P. 2017.Recurrent topic-transition gan for visual paragraph genera-tion. In Proceedings of the IEEE International Conferenceon Computer Vision , 3362–3371.McDonald, R.; Hall, K.; and Mann, G. 2010. Distributedtraining strategies for the structured perceptron. In Hu-man language technologies: The 2010 annual conference ofthe North American chapter of the association for computa-tional linguistics , 456–464.McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; andy Arcas, B. A. 2017. Communication-efficient learning ofdeep networks from decentralized data. In Artificial Intelli-gence and Statistics , 1273–1282. PMLR.Mikami, H.; Suganuma, H.; Tanaka, Y.; Kageyama, Y.; et al.2018. Imagenet/resnet-50 training in 224 seconds. arXivpreprint arXiv:1811.05233 arXiv preprint arXiv:1301.3781 .Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild:A lock-free approach to parallelizing stochastic gradient de-scent. In Advances in neural information processing sys-tems , 693–701.Sergeev, A.; and Del Balso, M. 2018. Horovod: fast andeasy distributed deep learning in TensorFlow. arXiv preprintarXiv:1802.05799 .Stich, S. U. 2018. Local SGD converges fast and communi-cates little. arXiv preprint arXiv:1805.09767 .Yan, Z.; Piramuthu, R.; Jagadeesh, V.; Di, W.; and Decoste,D. 2019. Hierarchical deep convolutional neural network forimage classification. US Patent 10,387,773.Yan, Z.; Zhang, H.; Wang, B.; Paris, S.; and Yu, Y. 2016. Au-tomatic photo adjustment using deep neural networks. ACMTransactions on Graphics (TOG) Proceedings ofthe 47th International Conference on Parallel Processing ,1–10.Yu, H.; and Jin, R. 2019. On the computation and com-munication complexity of parallel SGD with dynamic batchsizes for stochastic non-convex optimization. arXiv preprintarXiv:1905.04346 .Yu, H.; Yang, S.; and Zhu, S. 2019. Parallel restarted SGDwith faster convergence and less communication: Demysti-fying why model averaging works for deep learning. In Pro-ceedings of the AAAI Conference on Artificial Intelligence ,volume 33, 5693–5700.Zhang, J.; De Sa, C.; Mitliagkas, I.; and R´e, C. 2016. Par-allel SGD: When does averaging help? arXiv preprintarXiv:1606.07365 . Zhang, X.; Trmal, J.; Povey, D.; and Khudanpur, S. 2014.Improving deep neural network acoustic models using gen-eralized maxout networks. In , 215–219. IEEE.Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Glynn, P. W.; Ye,Y.; Li, L.-J.; and Li, F.-F. 2018. Distributed asynchronousoptimization with unbounded delays: How slow can you go?Zinkevich, M.; Weimer, M.; Li, L.; and Smola, A. J. 2010.Parallelized stochastic gradient descent. In