[PDF] Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Abstract

Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift the training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks.We propose PBox, a balanced, scalable central PS hardware that balances compute and communication resources, and PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups to utilize PBox. We show that in a typical cloud environment, PBox can achieve up to 3.8x speedup over state-of-the-art designs when training ImageNet. We discuss future directions of integrating PBox with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement.

Full PDF

PParameter Box: High Performance Parameter Servers forEfficient Distributed Deep Neural Network Training

Liang Luo ∗ , Jacob Nelson † , Luis Ceze ∗ , Amar Phanishayee † , Arvind Krishnamurthy ∗∗ University of Washington, † Microsoft Research

ABSTRACT

Most work in the deep learning systems community has focusedon faster inference, but arriving at a trained model requires lengthyexperiments. Accelerating training lets developers iterate fasterand come up with better models.DNN training is often seen as a compute-bound problem, bestdone in a single large compute node with many GPUs. As DNNsget bigger, training requires going distributed. Distributed deepneural network (DDNN) training constitutes an important workloadon the cloud. Larger DNN models and faster compute enginesshift the training performance bottleneck from computation tocommunication. Our experiments show existing DNN trainingframeworks do not scale in a typical cloud environment due toinsufficient bandwidth and inefficient parameter server softwarestacks.We propose PBox, a balanced, scalable central PS hardware thatbalances compute and communication resources, and PHub, a highperformance parameter server (PS) software design that providesan optimized network stack and a streamlined gradient processingpipeline to benefit common PS setups to utilize PBox. We showthat in a typical cloud environment, PBox can achieve up to 3.8xspeedup over state-of-the-art designs when training ImageNet. Wediscuss future directions of integrating PBox with programmableswitches for in-network aggregation during training, leveragingthe datacenter network topology to reduce bandwidth usage andlocalize data movement.

The goal of this work is to accelerate distributed DNN trainingin cloud environments. This work focuses on “data” parallelism,where workers process different samples and share the same model.A training iteration in this paradigm has two main components:computation-heavy forward and backward passes, and a communication-heavy model update step. As DNN models get larger and speedieraccelerators emerge, the performance bottleneck of distributed DNNtraining has shifted from computation to communication.

Larger DNN models require more gradient communication periteration. The throughput of GPUs on ResNet, a recent DNN, hasincreased by 35x on modern cloud-based GPUs (Figure 1a), effec-tively demanding a similar increase in network bandwidth givena fixed batch size. However, the network bandwidth in computeinstances on major cloud providers such as EC2 has not improvedacross generational upgrades [2]. Further, existing parameter ex-change mechanisms have problems scaling up the total throughputon a standard cloud network stack (Table 1). The compound effect

65 13504080120160 K520 K80 M60 1080Ti V100 V100TensorCore

ResNet 269 (Sample/s) (a) Per-chip GPU throughput forResNet 269 training in EC2 has in-creased by 35x since 2012. ResNet 269 (Sec/Iteration)

GPU active time

Total time (b) Increased network over-head in training as GPUsget faster.

Figure 1: Distributed training as a communication boundworkload in the cloud.

Framework Local 2 workers 4 workers 8 workersTensorFlow 152 213 410 634Caffe2 195 266 343 513MXNet 190 187 375 of these factors dramatically increases communication overheadduring DDNN training.Figure 1b summarizes the throughput of modest-scale DNN train-ing with 8 workers and 8 colocated PSs on EC2 with 10Gbps linksand a per GPU batch size of 4 (maximizing GPU memory usageon GRID 520): although modern DNN training frameworks canoverlap backward passes with model updates, they can no longerhide the latency of communication due to faster computation. Onesolution is to increase the per GPU batch size, leading to a largerglobal batch size given a fixed number of GPUs. Large global batchsizes hurt statistical efficiency [3, 6, 7]; also, GPUs have limitedmemory. Techniques such as [5] could alleviate that pressure, butat a higher computational cost.Communication overhead will likely worsen as the gap betweencomputation and communication capability widens. New acceler-ators continue to reduce computation time, but networks are notgetting faster at the same rate. Over the last 5 years, 100 Gbpsnetworks have become available, but they pose high cost and havelimited deployment.These observations suggest that DDNN training has shifted froma compute-bound problem to one that also has a significant network-bound component. It is critical to perform model updates efficiently.

Model updates are usually performed in a parameter server (PS),a key-value store for the current model [10, 11, 15, 16]. We base a r X i v : . [ c s . D C ] M a y ey1 Key2 Worker1

Cores

PHub vkey1 vkey2 vkey3 … … Completion Queues … Queue PairsKey1 Key2

Worker2 vkey1 vkey2 vkey3 … (a) Fine grained key chunkingand balanced chunk to core as-signment scheme in PHub.

PBox Hardware

Interfaces

Cores

Worker 1 Worker 2Key1

Key2

Key1

Key2

Interface

SwitchNUMA0

NUMA1 (b) PBox has multiple NICs tobalance IO, memory and net-work bandwidth.

Figure 2: The PHub software and hardware architecture

AlexNet VGG-19 GoogleNet Inception V3 ResNet-269 ResNext-269 S peedup PHub Software + Hardware PHub Software only

Figure 3: Training performance on a EC2-like 10Gbps net-work. Results are normalized to sharded MXNet. Batch sizeper GPU: 8 for ResNext, 16 for ResNet, 32 for others. GPU:GTX 1080 Ti. Higher speedup possible with latest GPUs. our work on MXNet, a widely used, state of the art DDNN trainingframework that is known to be fast (Table 1, [4, 14]) and supportsnative distributed training. Our profiling of MXNet reveals twoproblems: (1) insufficient network bandwidth (more so with colo-cated PSs than non-colocated servers) and (2) an inefficient PSsoftware stack. We found that data copy, aggregation, and opti-mization are the main bottlenecks in the model update process:they prevent the PS from scaling to higher throughput with highbandwidth networks.We first propose PHub, a high performance PS design for DDNNtraining. We briefly summarize the main optimizations in PHub. Network Stack

Optimized InfiniBand support for lower networkoverhead, with one shot registration, zero copy, and minimizedmetadata, so all bandwidth is dedicated to gradient payload.

Aggregation and Optimization

Fine grained key chunking (32KB)for maximized overlap of gradient processing and network transfer,and optimal load balancing in processor cores; locality-preserving,vectorized implementation of aggregator and optimizer.

Gradient Memory Layout

NUMA aware, balanced scheme forassigning a key chunk to a processor core, through a series ofload-balanced, locality-preserving assignment of queue pairs, in-terfaces, completion queues to cores (Figure 2a). PHub incurs zerosynchronization between cores or between NUMA domains.These software optimizations benefit centralized or sharded PS configurations. However, to scale up a central PS, software alone isnot sufficient: the hardware in a typical server is unbalanced, withsignificantly more computing resources than network resources.Typically, a single interface/connection in the server must handletraffic for all participating workers. We propose PBox, a new serverarchitecture that balances IO, memory and network bandwidth. Ourprototype PBox is built using a server with ten 56Gbps network The PS process and the worker training process reside in the same machine. Multiple PS processes, each in charge of a partition of keys. P C I e C a c he li ne t r an s f e r s ( M / s ) Number of active workersInfiniBand/PCIe limitPhub trainingMicrobenchmark limit

Figure 4: PBox can fully uti-lize hardware during train-ing. Its performance is bot-tlenecked by the PCIe/cachecontroller link.

ToRSwitch PHub

Worker

Machines

Rack

ToRSwitch PHub

Worker

Machines

Rack

SwitchToRSwitch

PHub

Worker

Machines

Figure 5: A potential PHubimplementation with a pro-grammable ToR switch withhybrid aggregation for work-ers in different racks. interfaces—5 per NUMA domain (Figure 2b). PBox takes full advan-tage of PHub software and essentially forms micro-shards inside asingle box. We integrated PHub with MXNet; Figure 3 shows thespeedup of PHub over a MXNet colocated sharded baseline whentraining ImageNet-winning networks with 8 workers. PHub onPBox can provide up to 3.8x speedup over the state-of-the-art ona cloud-like 10Gbps network . The speedup using 56Gbps links issimilar, ranging from 2x-7x depending on the DNN being trained.PBox shows linear scaling with our 8 worker cluster runningall workloads, and provides higher throughput than other param-eter exchange patterns using MPI or collectives (because PHubuses only one round of communication and minimum total datatransfer per iteration). To understand its limits, we used a special ZeroComputeEngine that simulates infinitely fast computation inMXNet, performing only parameter exchange operations. We foundPBox performance is limited only by the bandwidth between thePCIe controller and the memory system (Figure 4). This limit ishard to hit in real training: we estimate that a single PBox willsupport up to 120 workers training ResNet-50 with batch size of 32per GPU. If each worker includes 4 GPUs, that translates to a globalbatch size of 15K, surpassing the maximum suggested in [6]. Recentwork suggests larger batch sizes may impede training [3, 6–8], butif higher scalability is desired, sharding or use of new platformswith more PCIe throughput (e.g., [1]) would enable PBox to providehigher throughput.

The PBox results show the benefit of a high-bandwidth central-ized PS. Recent programmable switches [9, 12, 13] enable a newapproach to building centralized PS designs to offload gradient ag-gregation operations to the network. Figure 5 shows how PBoxarchitecture running in a top of rack (ToR) switch can benefit fromthe full bisection bandwidth inside a server rack, performing central-ized aggregation of gradients inside a rack; only a single aggregatedstream must be sent to higher level switches for further aggregationacross racks. This hybrid synchronization reduces bandwidth usageand localizes data movement in the data center.Current switches have limited computational capabilities: mostcan perform only integer operations, with little on-switch storage,and only on a small region of each packet. Our future work includesexploring the hardware requirements necessary for efficient DDNNtraining, as our emulation-based experiments show that these limitslead to unsatisfactory throughput on current switches. All interfaces have a negotiated speed of 10Gbps with the switch in this experiment. EFERENCES arXiv preprint arXiv:1604.06174 (2016).[6] Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski,Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accu-rate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprintarXiv:1706.02677 (2017).[7] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,and Ping Tak Peter Tang. 2016. On large-batch training for deep learning:Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).[8] Y LeCun, L Bottou, and G Orr. [n. d.]. Efficient BackProp in Neural Networks:Tricks of the Trade (Orr, G. and M¨uller, K., eds.).

Lecture Notes in ComputerScience

Proceedings of the 12th USENIX Conference on Operating SystemsDesign and Implementation (OSDI’16) . USENIX Association, Berkeley, CA, USA,467–483. http://dl.acm.org/citation.cfm?id=3026877.3026914[10] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed,Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. ScalingDistributed Machine Learning with the Parameter Server. In

Proceedings ofthe 11th USENIX Conference on Operating Systems Design and Implementation(OSDI’14) . USENIX Association, Berkeley, CA, USA, 583–598. http://dl.acm.org/citation.cfm?id=2685048.2685095[11] Mu Li, David G. Andersen, Alexander Smola, and Kai Yu. 2014. CommunicationEfficient Distributed Machine Learning with the Parameter Server. In

Proceedingsof the 27th International Conference on Neural Information Processing Systems(NIPS’14) . MIT Press, Cambridge, MA, USA, 19–27. http://dl.acm.org/citation.cfm?id=2968826.2968829[12] Xiaozhou Li, Raghav Sethi, Michael Kaminsky, David G. Andersen, and Michael J.Freedman. 2016. Be Fast, Cheap and in Control with SwitchKV. In

SIGOPS Oper. Syst. Rev.

51, 2 (April 2017), 795–809. https://doi.org/10.1145/3093315.3037731[14] Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. BenchmarkingState-of-the-Art Deep Learning Software Tools. (2016). arXiv:arXiv:1608.07249[15] Alexander Smola and Shravan Narayanamurthy. 2010. An Architecture forParallel Topic Models.

Proc. VLDB Endow.

3, 1-2 (Sept. 2010), 703–710. https://doi.org/10.14778/1920841.1920931[16] Ce Zhang and Christopher R´e. 2014. DimmWitted: A Study of Main-memoryStatistical Analytics.

Proc. VLDB Endow.

7, 12 (Aug. 2014), 1283–1294. https://doi.org/10.14778/2732977.27330017, 12 (Aug. 2014), 1283–1294. https://doi.org/10.14778/2732977.2733001