[PDF] MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

Abstract

MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on various heterogeneous systems, ranging from mobile devices to distributed GPU clusters. This paper describes both the API design and the system implementation of MXNet, and explains how embedding of both symbolic expression and tensor operation is handled in a unified fashion. Our preliminary experiments reveal promising results on large scale deep neural network applications using multiple GPU machines.

Full PDF

MMXNet: A Flexible and Efﬁcient Machine LearningLibrary for Heterogeneous Distributed Systems

Tianqi Chen, Mu Li ∗ , Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, U. Washington CMU Stanford NUS TuSimple NYU

Tianjun Xiao, Bing Xu, Chiyuan Zhang, Zheng Zhang

Microsoft U. Alberta MIT NYU Shanghai

Abstract

MXNet is a multi-language machine learning (ML) library to ease the develop-ment of ML algorithms, especially for deep neural networks. Embedded in thehost language, it blends declarative symbolic expression with imperative tensorcomputation. It offers auto differentiation to derive gradients. MXNet is compu-tation and memory efﬁcient and runs on various heterogeneous systems, rangingfrom mobile devices to distributed GPU clusters.This paper describes both the API design and the system implementation ofMXNet, and explains how embedding of both symbolic expression and tensoroperation is handled in a uniﬁed fashion. Our preliminary experiments revealpromising results on large scale deep neural network applications using multipleGPU machines.

The scale and complexity of machine learning (ML) algorithms are becoming increasingly large.Almost all recent ImageNet challenge [12] winners employ neural networks with very deep layers,requiring billions of ﬂoating-point operations to process one single sample. The rise of structural andcomputational complexity poses interesting challenges to ML system design and implementation.Most ML systems embed a domain-speciﬁc language (DSL) into a host language (e.g. Python, Lua,C++). Possible programming paradigms range from imperative , where the user speciﬁes exactly“how” computation needs to be performed, and declarative , where the user speciﬁcation focuseson “what” to be done. Examples of imperative programming include numpy and Matlab, whereaspackages such as Caffe, CXXNet program over layer deﬁnition which abstracts away and hide theinner-working of actual implementation. The dividing line between the two can be muddy at times.Frameworks such as Theano and the more recent Tensorﬂow can also be viewed as a mixture of both,they declare a computational graph, yet the computation within the graph is imperatively speciﬁed.Related to the issue of programming paradigms is how the computation is carried out. Executioncan be concrete , where the result is returned right away on the same thread, or asynchronize or delayed , where the statements are gathered and transformed into a dataﬂow graph as an intermediaterepresentation ﬁrst, before released to available devices. These two execution models have differentimplications on how inherent parallelisms are discovered. Concrete execution is restrictive (e.g.parallelized matrix multiplication), whereas asynchronize/delayed execution additionally identiﬁedall parallelism within the scope of an instance of dataﬂow graph automatically.The combination of the programming paradigm and execution model yields a large design space,some of which are more interesting (and valid) than others. In fact, our team has collectively ex-plored a number of them, as does the rest of the community. For example, Minerva [14] combinesimperative programming with asynchronize execution. While Theano takes an declarative approach, ∗ Corresponding author ([email protected]) a r X i v : . [ c s . D C ] D ec mperative Program Declarative ProgramExecute a = b +1 Eagerly compute and store theresults on a as the same type with b . Return a computation graph; bind data to b and do the computation later.Advan-tages Conceptually straightforward, andoften works seamless with the hostlanguage’s build-in data structures,functions, debugger, and third-partylibraries. Obtain the whole computation graph beforeexecution, beneﬁcial for optimizing theperformance and memory utilization. Alsoconvenient to implement functions such asload, save, and visualization.Table 1: Compare the imperative and declarative for domain speciﬁc languages.System Core Binding Devices Distri- Imperative DeclarativeLang Langs (beyond CPU) buted Program ProgramCaffe [7] C++ Python/Matlab GPU × × √ Torch7 [3] Lua - GPU/FPGA × √ ×

Theano [1] Python - GPU × × √

TensorFlow [11] C++ Python GPU/Mobile √ × √

MXNet C++ Python/R/Julia/Go GPU/Mobile √ √ √

Table 2: Compare to other popular open-source ML librariesenabling more global graph-aware optimization. Similar discipline was adopted in Purine2 [10]. In-stead, CXXNet adopts declarative programming (over tensor abstraction) and concrete execution,similar to Caffe [7]. Table 1 gives more examples.Our combined new effort resulted in

MXNet (or “mix-net”), intending to blend advantages of differ-ent approaches. Declarative programming offers clear boundary on the global computation graph,discovering more optimization opportunity, whereas imperative programs offers more ﬂexibility. Inthe context of deep learning, declarative programming is useful in specifying the computation struc-ture in neural network conﬁgurations, while imperative programming are more natural for parameterupdates and interactive debugging. We also took the effort to embed into multiple host languages,including C++, Python, R, Go and Julia.Despite the support of multiple languages and combination of different programming paradigm, weare able to fuse the execution to the same backend engine. The engine tracks data dependenciesacross computation graphs and imperative operations, and schedules them efﬁciently jointly. We ag-gressively reduce memory footprint, performing in-place update and memory space reuse wheneverpossible. Finally, we designed a compact communication API so that a MXNet program runs onmultiple machines with little change.Comparing to other open-source ML systems, MXNet provides a superset programming interfaceto Torch7 [3], Theano [1], Chainer [5] and Caffe [7], and supports more systems such as GPU clus-ters. Besides supporting the optimization for declarative programs as TensorFlow [11] do, MXNetadditionally embed imperative tensor operations to provide more ﬂexibility. MXNet is lightweight,e.g. the prediction codes ﬁt into a single 50K lines C++ source ﬁle with no other dependency, andhas more languages supports. More detailed comparisons are shown in Table 2.

Symbol:

Declarative Symbolic Expressions

KV StoreSymbolic ExprBinderDep EngineBLASCPU GPU AndroidND ArrayC/C++ Python R Julia ...... iOSComm

Figure 1: MXNet OverviewMXNet uses multi-output symbolic expressions,

Symbol , declarethe computation graph. Symbols are composited by operators, suchas simple matrix operations (e.g. “+”), or a complex neural networklayer (e.g. convolution layer). An operator can take several inputvariables, produce more than one output variables, and have internalstate variables. A variable can be either free, which we can bind with value later, or an output ofanother symbol. Figure 2 shows the construction of a multi-layer perception symbol by chaining avariable , which presents the input data, and several layer operators.2 sing

MXNetmlp = @mx.chain mx.Variable(:data) =>mx.FullyConnected(num_hidden=64) =>mx.Activation(act_type=:relu) =>mx.FullyConnected(num_hidden=10) =>mx.Softmax()

Figure 2: Symbol expression construction in Julia. >>> import mxnet as mx>>> a = mx.nd.ones((2, 3),... mx.gpu())>>> print (a * 2).asnumpy()[[ 2. 2. 2.][ 2. 2. 2.]]

Figure 3: NDArray interface in PythonTo evaluate a symbol we need to bind the free variables with data and declare the required outputs.Beside evaluation (“forward”), a symbol supports auto symbolic differentiation (“backward”). Otherfunctions, such as load, save, memory estimation, and visualization, are also provided for symbols.

NDArray:

Imperative Tensor Computation

MXNet offers

NDArray with imperative tensor computation to ﬁll the gap between the declarativesymbolic expression and the host language. Figure 3 shows an example which does matrix-constantmultiplication on GPU and then prints the results by numpy.ndarray . NDArray abstraction works seamlessly with the executions declared by

Symbol , we can mix theimperative tensor computation of the former with the latter. For example, given a symbolic neuralnetwork and the weight updating function, e.g. w = w − ηg . Then we can implement the gradientdescent by while(1) { net.foward_backward(); net.w -= eta * net.g }; The above is as efﬁcient as the implementation using a single but often much more complex symbolicexpression. The reason is that MXNet uses lazy evaluation of

NDArray and the backend enginecan correctly resolve the data dependency between the two.

KVStore:

Data Synchronization Over Devices

The

KVStore is a distributed key-value store for data synchronization over multiple devices. Itsupports two primitives: push a key-value pair from a device to the store, and pull the value ona key from the store. In addition, a user-deﬁned updater can specify how to merge the pushedvalue. Finally, model divergence is controlled via consistency model [8]. Currently, we support thesequential and eventual consistency.The following example implements the distributed gradient descent by data parallelization. while(1){ kv.pull(net.w); net.foward_backward(); kv.push(net.g); } where the weight updating function is registered to the

KVStore , and each worker repeatedly pullsthe newest weight from the store and then pushes out the locally computed gradient.The above mixed implementation has the same performance comparing to a single declarative pro-gram, because the actual data push and pull are executed by lazy evaluation, which are scheduled bythe backend engine just like others.

MXNet ships with tools to pack arbitrary sized examples into a single compact ﬁle to facilitate bothsequential and random seek. Data iterators are also provided. Data pre-fetching and pre-processingare multi-threaded, reducing overheads due to possible remote ﬁle store reads and/or image decodingand transformation.The training module implements the commonly used optimization algorithms, such as stochasticgradient descent. It trains a model on a given symbolic module and data iterators, optionally dis-tributedly if an additional

KVStore is provided.3

Implementation fullcrelu ∂ W W X ∂ X b ∂ fullc ∂ relu ∂ b Figure 4: Computation graph forboth forward and backward.A binded symbolic expression is presented as a computationgraph for evaluation. Figure 4 shows a part of the graph ofboth forward and backward of the MLP symbol in Figure 2.Before evaluation, MXNet transforms the graph to optimizethe efﬁciency and allocate memory to internal variables.

Graph Optimization.

We explore the following straightforward optimizations. We note ﬁrst thatonly the subgraph required to obtain the outputs speciﬁed during binding is needed. For example,in prediction only the forward graph is needed, while for extracting features from internal layers,the last layers can be skipped. Secondly, operators can be grouped into a single one. For example, a × b + 1 is replaced by a single BLAS or GPU call. Finally, we manually implemented well-optimized “big” operations, such as a layer in neural network. Memory Allocation.

Note that each variable’s life time, namely the period between the creationand the last time will be used, is known for a computation graph. So we can reuse memory for non-intersected variables. However, an ideal allocation strategy requires O ( n ) time complexity, where n is the number of variables.We proposed two heuristics strategies with linear time complexity. The ﬁrst, called inplace , simu-lates the procedure of traversing the graph, and keeps a reference counter of depended nodes that arenot used so far. If the counter reaches zero, the memory is recycled. The second, named co-share ,allows two nodes to share a piece of memory if only if they cannot be run in parallel. Exploringco-share imposes one additional dependency constraint. In particular, each time upon scheduling,among the pending paths in the graph, we ﬁnd the longest path and perform needed memory alloca-tions. In MXNet, each source units, including

NDArray , random number generator and temporal space,is registered to the engine with a unique tag. Any operations, such as a matrix operation or data com-munication, is then pushed into the engine with specifying the required resource tags. The enginecontinuously schedules the pushed operations for execution if dependencies are resolved. Sincethere usually exists multiple computation resources such as CPUs, GPUs, and the memory/PCIebuses, the engine uses multiple threads to scheduling the operations for better resource utilizationand parallelization.Different to most dataﬂow engines [14], our engine tracks mutation operations as an existing re-source unit. That is, ours supports the speciﬁcation of the tags that a operation will write in additionto read . This enables scheduling of array mutations as in numpy and other tensor libraries. It alsoenables easier memory reuse of parameters, by representing parameter updates as mutating the pa-rameter arrays. It also makes scheduling of some special operations easier. For example, whengenerating two random numbers with the same random seed, we can inform the engine they willwrite the seed so that they should not be executed in parallel. This helps reproducibility. dev0 dev1 dev0 dev1 worker0 worker1

Figure 5: Communication.We implemented

KVStore based on the parameter server [8, 9,4](Figure 5). It differs to previous works in two aspects: First, weuse the engine to schedule the

KVStore operations and managethe data consistency. The strategy not only makes the data synchro-nization works seamless with computation, and also greatly simpli-ﬁes the implementation. Second, we adopt an two-level structure.A level-1 server manages the data synchronization between the de-vices within a single machine, while a level-2 server manages inter-machine synchronization. Outbound data from a level-1 server can4 lexnet googlenet vgg T i m e ( m s ) TensorFlowCaffeTorch7MXNet

Figure 6: Compare MXNetto others on a single forward-backward performance. alexnet googlenet vgg m e m o r y ( G B ) naiveinplaceco-shareinplace & co-share alexnet googlenet vgg m e m o r y ( G B ) naiveinplaceco-shareinplace & co-share Figure 7: Internal memory usage of MXNet under various al-location strategies for only forward (left) and forward-backward(right) with batch size 64.be aggregated, reducing bandwidth requirement; intra- and inter-machine synchronization can usedifferent consistency model (e.g. intra- is sequential and inter- is eventual).

Raw performance

We ﬁst compare MXNet with Torch7, Caffe, and TensorFlow on the popular“convnet-benchmarks” [2]. All these systems are compiled with CUDA 7.5 and CUDNN 3 exceptfor TensorFlow, which only supports CUDA 7.0 and CUDNN 2. We use batch size 32 for allnetworks and run the experiments on a single Nvidia GTX 980 card. Results are shown in Figure 6.As expected that MXNet has similar performance comparing to Torch7 and Caffe, because mostcomputations are spent on the CUDA/CUDNN kernels. TensorFlow is always 2x slower, whichmight be due its use of a lower CUDNN version.

Memory usage

Figure 7 shows the memory usages of the internal variables excepts for the out-puts. As can be seen, both “inplace” and “co-share” can effective reduce the memory footprint.Combing them leads to a 2x reduction for all networks during model training, and further improvesto 4x for model prediction. For instance, even for the most expensive VGG net, training needs lessthan extra. data pass t e s t a cc u r a cy Figure 8: Progress of googlenet onILSVRC12 dataset on 1 and 10 machines.

Scalability

We run the experiment on Amazon EC2g2.8x instances, each of which is shipped withfour Nvidia GK104 GPUs and 10G Ethernet. Wetrain googlenet with batch normalization [6] on theILSVRC12 dataset [13] which consists of 1.3 millionimages and 1,000 classes. We ﬁx the learning rate to . , momentum to . , weight decay to − , and feedeach GPU with images in one batch.The convergence results are shown in Figure 8. As canbe seen, comparing to single machine, the distributedtraining converges slower at the beginning, but outper-forms after 10 data passes. The average cost of a datapass is 14K and 1.4K sec on a single machine and 10machines, respectively. Consequently, this experimentreveals a super-linear speedup. MXNet is a machine learning library combining symbolic expression with tensor computation tomaximize efﬁciency and ﬂexibility. It is lightweight and embeds in multiple host languages, and canbe run in a distributed setting. Experimental results are encouraging. While we continue to explorenew design choices, we believe it can already beneﬁt the relevant research community. The codesare available at http://dmlc.io . Acknowledgment.

We sincerely thanks Dave Andersen, Carlos Guestrin, Tong He, Chuntao Hong,Qiang Kou, Hu Shiwen, Alex Smola, Junyuan Xie, Dale Schuurmans and all other contributors.5 eferences [1] Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, ArnaudBergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new featuresand speed improvements. arXiv preprint arXiv:1211.5590 , 2012.[2] Soumith Chintala. Easy benchmarking of all public open-source implementations of convnets,2015. https://github.com/soumith/convnet-benchmarks .[3] Ronan Collobert, Koray Kavukcuoglu, and Cl´ement Farabet. Torch7: A matlab-like envi-ronment for machine learning. In

BigLearn, NIPS Workshop , number EPFL-CONF-192376,2011.[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior,P. Tucker, K. Yang, and A. Ng. Large scale distributed deep networks. In

Neural InformationProcessing Systems , 2012.[5] Chainer Developers. Chainer: A powerful, ﬂexible, and intuitive framework of neural net-works, 2015. http://chainer.org/ .[6] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.[7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast featureembedding. In

Proceedings of the ACM International Conference on Multimedia , pages 675–678. ACM, 2014.[8] M. Li, D. G. Andersen, J. Park, A. J. Smola, A. Amhed, V. Josifovski, J. Long, E. Shekita, andB. Y. Su. Scaling distributed machine learning with the parameter server. In

OSDI , 2014.[9] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efﬁcient distributed machinelearning with the parameter server. In

Neural Information Processing Systems , 2014.[10] Min Lin, Shuo Li, Xuan Luo, and Shuicheng Yan. Purine: A bi-graph based deep learningframework. arXiv preprint arXiv:1412.6249 , 2014.[11] Abadi Martn, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray,Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, PaulTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden,Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorﬂow: Large-scalemachine learning on heterogeneous systems. 2015.[12] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.

International Journal of ComputerVision (IJCV) , 115(3):211–252, 2015.[13] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.