[PDF] SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

Abstract

Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far beyond the GPU DRAM capacity. SuperNeurons features 3 memory optimizations, \textit{Liveness Analysis}, \textit{Unified Tensor Pool}, and \textit{Cost-Aware Recomputation}, all together they effectively reduce the network-wide peak memory usage down to the maximal memory usage among layers. We also address the performance issues in those memory saving techniques. Given the limited GPU DRAM, SuperNeurons not only provisions the necessary memory for the training, but also dynamically allocates the memory for convolution workspaces to achieve the high performance. Evaluations against Caffe, Torch, MXNet and TensorFlow have demonstrated that SuperNeurons trains at least 3.2432 deeper network than current ones with the leading performance. Particularly, SuperNeurons can train ResNet2500 that has 10 4 basic network layers on a 12GB K40c.

Full PDF

SSuperNeurons: Dynamic GP U Memory Managementfor Training Deep Neural Networks

Linnan Wang § Jinmian Ye † Yiyang Zhao † Wei Wu ‡ Ang Li ⨿ Shuaiwen Leon Song ⨿ Zenglin Xu † Tim Kraska ♦§ Brown University † Univ. of Electr. Sci. & Tech. of China ‡ Los Alamos National Laboratory ⨿ Pacific Northwest National Laboratory ♦ Massachusetts Institute of Technology

Abstract

LivenessAnalysis , Unified Tensor Pool , and

Cost-Aware Recomputa-tion ; all together they effectively reduce the network-widepeak memory usage down to the maximal memory usageamong layers. We also address the performance issues inthose memory saving techniques. Given the limited GPUDRAM, SuperNeurons not only provisions the necessarymemory for the training, but also dynamically allocates thememory for convolution workspaces to achieve the highperformance. Evaluations against Caffe, Torch, MXNet andTensorFlow have demonstrated that SuperNeurons trainsat least 3.2432 deeper network than current ones with theleading performance. Particularly, SuperNeurons can trainResNet2500 that has 10 basic network layers on a 12GBK40c. Keywords

Neural Networks, GPU Memory Management,Runtime Scheduling

Deep Neural Network (DNN) is efficient at modeling com-plex nonlinearities thanks to the unparalleled representationpower from millions of parameters. This implies scaling upneural networks is an effective approach to improve the gen-eralization performance. The Deep Learning (DL) commu-nity now widely acknowledges either going deeper or goingwider on the nonlinear architecture improves the quality of image recognition tasks. For example, 9-layer AlexNet wonthe 2012 ILSVRC (ImageNet Large-Scale Visual RecognitionChallenge) with a top-5 error of 17%. GoogLeNet (inceptionv1) refreshed the top-5 error rate to 6.67% with 22 inceptionunits in 2014 ILSVRC, and ResNet further reduced the errorrate down to 3.57% in 2015 ILSVRC with 152 residual units.While DL practitioners are enthusiastically seeking deeperand wider nonlinear networks, the limited size of GPU DRAMbecomes a major restriction. Training a deep network is in-herently a computing-intensive task. Almost every AI labtoday, either in academia or industry, is deploying the net-work training on GPUs for the demand of high-performance[3]. Data need to be residing on GPU DRAM for the GPUcomputing, but the largest commercial GPU DRAM so far is24 GB. This is still far from sufficient to accommodate a deepneural network. For example, the latest Inception v4 has 515basic layers consuming 44.3 GB memory in the training. Thedeeper or wider we go, the higher memory usages will be.Therefore, this deep trend subjects the rigid GPU DRAM tothe severe insufficiency.Major DL frameworks, such as Caffe or MXNet, have triedto alleviate the GPU memory shortage with several staticmemory reduction techniques. Those techniques, due to theirstatic nature, are not well tuned to address the new data anddependency variations in non-linear networks. For example,Caffe and Torch do not fully support the data flow analysison non-linear neural networks; the trading computationfor memory strategy in MXNet is limited for ignoring thememory variations across network layers. These limitationshave motivated us to propose a dynamic approach for theemerging deep nonlinear neural architectures.In this paper, we present the first dynamic GPU mem-ory scheduling runtime for training deep non-linear neuralnetworks. The runtime allows DL practitioners to explore amuch deeper and wider model beyond the physical limita-tions of GPU memory. It utilizes tensors as the fundamental a r X i v : . [ c s . D C ] J a n cheduling units to consist with the layer-wise computa-tions enforced in DL performance primitives cuDNN [7].The runtime seamlessly orchestrates the tensor placement,movement, allocation and deallocation so that the underlymemory operations are entirely transparent to users.Our runtime guarantees the minimal peak memory usage, peak m = max ( l i ) , at the layer-wise granularity. We denotethe memory usage of the ith layer as l i , and the superscript,e.g. l fi or l bi , as the forward/backward. The peak memoryusage during the forward and backward computations isdenoted as peak m . First, Liveness Analysis recycles no longerneeded tensors to reduce peak m from baseline (cid:205) Ni = l fi + (cid:205) Ni = l bi to (cid:205) Ni = l fi + l bN (defined in Sec.3). Secondly, UnifiedTensor Pool (UTP) offloads tensors in compute-intensivelayers, referred to as checkpoints, to the external physi-cal memory. This further reduces peak m from (cid:205) Ni = l fi + l bN to (cid:205) Ni = ( l fi (cid:60) checkpoints ) + l bN . Finally, Cost-Aware Recom-putation drops the forward results of cheap-to-compute ornone-checkpoints layers and reconstructs them to reduce peak m from (cid:205) Ni = ( l fi (cid:60) checkpoints ) + l bN to peak m = max ( l i ) .The final peak m indicates the largest computable network isbounded by the maximal memory usage among layers.Our runtime also features three performance optimiza-tions to improve the efficiency of Liveness Analysis and

UTP .First, GPUs require memory allocations to create tensors anddeallocations to free tensors. Thus, the high-frequent largetensor allocations/deallocations incur the non-negligibleoverhead in

Liveness Analysis [26]. The runtime success-fully amortizes the cost by directly reusing memory seg-ments from a huge pre-allocated memory pool, managed bya heap based GPU memory management utility. Secondly,UTP swaps tensors among different physical memory spaces,while modern GPUs equip with independent Direct MemoryAccess (DMA) engine exposing opportunities to hide com-munications under computations. The runtime also metic-ulously overlap communications with computations. How-ever, the overlapping opportunity is limited given the fixedamount of computations. We propose a LRU based TensorCache built on GPU DRAM to minimize total communica-tions by tensor reusing.This paper claims the following contributions: • We demonstrate the new memory scheduling challenges innonlinear neural networks, and discuss the key limitationsof existing approaches. • We design and implement SuperNeurons to enable DLpractitioners to explore deep neural networks; and thelargest computable network of SuperNeurons is only boundedby the max memory usage among layers. • By dynamically allocating memory for convolution workspaces,SuperNeurons deliveries the leading performance amongstate-of-art DL systems on the GPU. (a) fan (b) join

Figure 1.

The non-linear connections in inception v4 (fan),ResNet (join, left) and DenseNet (join, right). DenseNet uti-lizes a full-join.

Traditional Convolutional Neural Networks (CNN) [17, 18,21] are typically composed of several basic building lay-ers, including Convolution (CONV), Pooling (POOL), Ac-tivation (ACT), Softmax, Fully Connected (FC), Local Re-sponse Normalization (LRN), Batch Normalization (BN), andDropout. For linear CNNs, these layers are independentand inter-connected to their neighbors in a sequential man-ner: 1 ↔ ↔ · · · ↔ n . Recently, several deep non-linear neural architectures have been proposed to furtherimprove the state-of-the-art accuracy on the 1K ImageNetrecognition challenge, e.g., Inception v4[22], ResNet[13], andDenseNet[14]. These prominent network designs (especiallythe one that solves classic gradient vanishing [4] problem)pave the algorithmic foundation for DL practitioners to har-ness the unparalleled representation power brought forth bythe super deep non-linear neural architectures. For example,the latest inception v4 delivers 95% top-5 accuracy with 515basic building layers while ResNet151 achieves 94.3% top-5accuracy with 567 layers. In Figure 1, we illustrate two clas-sic types of non-linear connections: fan and join. Comparedwith the linear connection pattern, the sparse fan-out con-nection (Figure 1a) avoids one huge computing-inefficientdense layer [23] while the join connection prevents gradientsfrom quickly vanishing in the back-propagation [13].Training these super deep and complex non-linear neuralarchitectures is a computation-intensive task. Due to its DL-driven novel architecture designs and massive parallelism,GPUs have been widely adopted in today’s industry andacademia for the efficient neural network training. However,there are critical issues for efficiently training in these newlydeveloped super deep non-linear neural architectures: lim-ited GPU resident memory and a high degree of variation incomputational dependencies . Challenge I: Limited GPU Resident Memory.

The promi-nent deep neural architectures share a common feature:high memory demand and computation intensity. Figure

151 represents the number of convolutional units. .511.522.53 S p ee dup w i t h C on v B u ff A l e x N e t V G G V G G I n c e p t i o n V R e s N e t R e s N e t R e s N e t M e m o r y A ll o ca t i on i n M B Memory Memory with Conv Buff SpeedUp with Conv Buff

Figure 2.

The left axis depicts the memory usages of net-works. The batch size of AlexNet is 200, and the rest use 32.The right axis and red x marks depict the speedup (imgs/s)with and without convolution workspaces.2 illustrates the network wide memory usages of severalrecent DNNs in the training with and without convolutionworkspaces or buffer. Among them, AlexNet and VGG are lin-ear networks while the others are non-linear. We can observethat the non-linear networks demand a significant amountof GPU memory, e.g., ResNet152 and Inception v4 requireup to 18.5GB and 44.3 GB at only the batch size of 32, respec-tively. However, these sizes are either similar to or surpassthe resident memory sizes of commercial GPUs on the mar-ket today. For instance, the newest generations of NVIDIAPascal and Volta GPUs only have 16GB with HBM2 enabled(e.g., P100 and V100) while the one with the most memoryavailable in the recent generations is Maxwell P40 with 24GBGDDR5. This limitation poses a major bottleneck for deeplearning practitioners for exploring deep and wide neuralarchitectures [19, 22, 23]. The most straight forward solutionis to split the network across GPUs, i.e. Model Parallelism.However, splitting either the computations of a network ora layer incurs excessive intra-network and intra-layer com-munications that drastically deteriorate the performance.For example, recent work has suggested the deficiency ofapplying model parallelism for deep neural networks: it com-promises at least 40% speed when training a network with1.3 billion parameters from 36 GPUs to 64 GPUs [8]. To ad-dress the performance issues from Model Parallelism, DataParallelism has been widely adopted in today’s mainstreamdeep learning frameworks such as Caffe[15], TensorFlow[2],Torch[9], and MXNet[5]. In this model, each GPU holds a net-work replica; and one GPU computes one sub-gradient witha sub-batch. Subsequently all sub-gradients are aggregatedas one global gradient to update the network parameters[25]. Although this process does not incur intra-networkor intra-layer communications besides necessary gradientexchanges, it requires the network training to fit in the lim-ited GPU DRAM. In this paper, we focus on addressing theGPU memory shortage issue for training deep neural net-works under data parallelism model while taking the trainingperformance into design considerations.

Challenge II: Variations in Computational Depen-dencies for Nonlinear Networks.

Nonlinear networks ex-hibit a high degree of dependency variations while linearnetworks follow a fixed sequential execution pattern withpredictable data dependencies [20]. Fig.3 illustrates the datadependency graph for linear (a) and nonlinear (b and c) neu-ral architectures. One typical training iteration consists oftwo phases: forward and backward propagation. For Linearnetworks, data is sequentially propagated in the forwardpass; and a layer’s backward computation is simply con-tingent upon the previous layer as illustrated in Figure 3a.Thus their computation and dependency patterns are staticregardless of the total layers involved.However, for nonlinear networks, a high degree of varia-tions in computational dependencies appear. Fig.3b and 3cshow two simple examples of join and fan nonlinear con-nections. For join connections, it forwards a layer’s outputtensor to another layer, creating a dependency between twolayers. For example, the join in Fig.3b forwards t0 from DATAlayer to FC layer in the forward pass. The dependency ofjoin-based non-linear networks is non-deterministic as anytwo layers can be connected with a join, e.g., in DenseNet.For fan connections, it creates multiple branches in the exe-cution flow: DATA layer forks two branches and joins thembefore FC layer. Separate branches, each with a differentnumber of layers, have to finish before joining them back tothe original branch, making this execution sequence nonlin-ear. Although the two basic nonlinear scenarios shown hereare intuitive, a typical deep nonlinear network today hashundreds of joins and fans convoluted together, resulting ina complex network architecture. These significantly compli-cate runtime resource-management compared to the staticcomputational pattern in linear ones. Therefore, the memoryscheduling of deep non-linear neural networks demands adynamic solution to effectively address these variations inboth the execution flow and computation dependencies. Several static memory reduction techniques have been imple-mented in today’s deep learning frameworks to address theGPU memory shortage at data parallelism level. For example,Caffe and Torch directly reuse the forward data tensors forthe backward data propagation, which saves up to 50% ofmemory on a linear network [1]. Although this techniqueworks well on linear networks, it requires extra tensors tohold the future dependencies for training non-linear net-works, thereby limiting the effectiveness and efficiency. Also,these frameworks still have to fit the entire network intoGPU DRAM without leveraging NUMA architectures; andthis level of reuse is arguably not adequate for contemporarydeep nonlinear neural networks. MXNet and TensorFlow arebuilt with a Directed Acyclic Graph (DAG) execution engine[28]. Users explicitly define the computation flow and tensor a) linear (b) join (nonlinear) (c) fan (nonlinear) Figure 3.

Data dependencies of different neural architectures. Tensors in red are ready to free when back propagate to thePOOL layer. Solid lines represent forward dependencies and dash lines represent backward dependencies.dependencies, which provide necessary information for theDAG engine to analyze the life span of tensors. Both sys-tems then free tensors that are no longer needed in order tosave memory. MXNet also implements a per-layer-based re-computation strategy that is similar to Resilient DistributedDatasets (RDD) in Spark. Basically it frees the tensors pro-duced by computing-cheap layers in the forward pass, andrecomputes the freed dependencies for the backward passby doing another forward. However, this method neglectsnon-uniform memory distribution of network layers, con-sequentially demanding large unnecessary memory usages.TensorFlow swaps long-lived data tensors from GPU DRAMto CPU DRAM, but it fails to optimize data communicationsbetween the two (e.g., utilizing pinned data transfer) whichcompromises at least 50% of communication speed.More importantly, none of aforementioned DL frame-works utilize a dynamic scheduling policy that provisionsnecessary memory space for deep nonlinear network train-ing while at the same time optimizing the training speedgiven the existing GPU DRAM resource. In other words,these static memory-saving techniques aggressively reducethe GPU memory usage at the expense of speed. Users eitherpainstakingly tune the performance or suffer from the in-sufficient memory during the execution. Additionally, theseframeworks either have no optimization strategy or adopta naive method on allocating the convolution workspace(see Section 3.5), which is a decisive factor determining CNNtraining speed on the GPU. In summary, these challengesmotivate us to design a dynamic scheduling runtime to pro-vision necessary memory for the training while maximizingthe memory for convolution workspaces to optimize thetraining speed.

This section elaborates on three memory optimization tech-niques and their related performance issues in SuperNeurons.From a high level perspective, SuperNeurons provisions nec-essary memory spaces for the training while maximizingthe speed by seeking convolution workspaces within theconstraint of native GPU memory size.

Notations and Baseline Definition:

To facilitate theanalysis of proposed techniques, we denote the forward memory usage of the ith layer as l fi , the backward as l bi .We denote the peak memory usage as peak m . We use thenaive network-wide tensor allocation strategy as the base-line, which allocates an independent tensor for each memoryrequests. Thus, the peak m of baseline is (cid:205) Ni = l fi + (cid:205) Ni = l bi .We also denote the maximal memory usage among layersas l peak = max ( l i ) , where i ∈ [ , N ] , and N represents thenetwork length. t i represents the ith tensor.First, Liveness Analysis reduces the baseline peak m to (cid:205) Ni = l fi + l bN by recycling free tensors amid back-propagation, demon-strating up to 50% of the memory saving. This technique isguaranteed to work on various non-linear architectures, andit is constructed in O( N ) . Liveness Analysis involves high-frequent memory operations on the large chunk memory,while native memory utilities, e.g. cudaMalloc and cudaFree,incur the nontrivial overhead. We address this issue with apreallocated heap managed by the runtime.Secondly,

Unified Tensor Pool(UTP) further reduces peak m to (cid:205) Ni = ( l fi (cid:60) checkpoints ) + l bN , where checkpoints repre-sent the compute-intensive layers such as FC and CONV.UTP provides a consolidated memory abstraction to externalmemory pools to supply for the training. Instead of usingnaive on-demand data transfers, it hides communicationsunder computations. While the overlapping opportunity islimited given the fixed amount of computations, UTP furtherintroduces a Tensor Cache built on GPU to reduce communi-cations.Finally,

Cost-Aware Recomputation reduces peak m to max ( l i ) ,the minimum at the layer-wise granularity. The methodkeeps track of memory distributions among checkpoints tominimize the extra computations while ensuring peak m ≤ max ( l i ) . The structure of tensors used in DNN.A typical DNN network layer computes on a 4-dimensiontensor indexed by batches (N), image channels (C), height igure 5. Applying

Liveness Analysis on the nonlinear network shown in Fig.3c. The number after the layer name (e.g., DATA0,CONV1, etc.) represents the step, which are calculated by Alg. 1. We mark the prerequisite tensors for a layer in red, such that t , t , t are required by CONV9. Each in and out set tracks live tensors before and after the layer’s computations. We can free t t Algorithm 1:

Construct execution steps for nonlinearneural architectures

Data: neural architecture definitions

Result: execution order Function

RouteConstruct( layer ) if layer is NULL then return layer → counter _ inc () ; if layer → get_counter < size of prev layers then return computation _ route . push ( layer ) ; next _ layers = b → дet _ next () ; for next _ l ∈ next _ layers do RouteConstruct( next_l ) ; reset layer → counter to Figure 6.

Execution route created by Algorithm 1 on a non-linear network. The left digit represents the forward step,while the right digit represents the backward step.(H) and width (W) (Fig.4). Since cuDNN operates at the layergranularity, we use tensors as the basic memory schedulingunit.Alg.1 describes how SuperNeurons constructs executionsteps for nonlinear neural architectures. The input is thefirst network layer; then Alg.1 recursively explores the sub-sequent layers in Depth-First Searching (DFS), except thatit reaches a join where all prior layers must finish beforeproceeding. The behavior is achieved by the counter in eachlayer that tracks the input dependencies (line 5 → a to j . Note that this networkhas two fan structures (layer b , c , d and layer f , g , h ) nestedtogether. Alg.1 successfully identifies layers e , g and h as theprerequisites for executing i . Liveness analysis enables different tensors to reuse the samephysical memory at different time partitions. Our runtimeimplements a simple yet effective variant of the traditionaldata flow analysis constructed in O( N ) for various nonlin-ear neural networks. The general procedures are as follows:1. We construct an in and out set for every layers to track thelive tensors before and after the layer, which cost O( N ) ,where N is the network length.2. The runtime populates a layer’s in and out sets by check-ing the dependencies of subsequent layers. It eliminatestensors in in from out if no subsequent layers need them.The cost is N ( N − ) ∼ O( N ) as each check costs N − N − ... , 2, 1, respectively.Fig.5 demonstrates the detailed procedures of Liveness Anal-ysis on the network shown in Fig.3c. It explicitly lists thecontent of in and out sets at each steps. For instance, forFC7, in = t0 , t1 , t3 , t2 , t5 . It needs to create tensor t6 to final-ize the current computation. Since t2 and t5 are no longerneeded after FC7, runtime eliminates them from FC7’s out set (step:7). Liveness Analysis reduces the baseline peak m = (cid:205) Ni = l fi + (cid:205) Ni = l bi to (cid:205) Ni = l fi + l bN . In order to simplify the analysis, let’sassume identical memory usages on every layers, i.e. l fi = l bi where i ∈ [ , N ] . In the network training, the results offorward pass are needed by the backward propagation [7,27]. Therefore, the forward total memory usages at step k is cost fk = (cid:205) ki = l fi , where k ≤ N . During the back-propagation, Liveness Analysis frees l fi and l bi where i ∈ [ k + , N ] at thebackward step k since no future dependencies on them asdemonstrated in Fig.5. Therefore, the backward total memoryusages at step k is cost bk = (cid:205) ki = l fi + l bk and k ≤ N . Since l i >

0, the peak m is max ( max ( cost fk ) , max ( cost bk )) = (cid:205) Ni = l fi + l bN .Therefore, Liveness Analysis saves up to 50% memory fromthe baseline. Not all layers require the previous forward output for the back-propagation,again we simplify the case for the analysis. igure 7. The unified tensor pool provides a consolidatedmemory abstraction to include various physical memorypools for tensor allocations.

Both the empty initial in set at step 0 and the empty final out set at step 11 in Fig.5 demonstrates Liveness Analysis frequently stashes and frees tensors on the fly in a trainingiteration, while a typical training phase consists of millions ofiterations and such intense memory operations incur nontriv-ial overhead if using the native cudaMalloc and cudaFree [26].According to the experiment, ResNet50 wastes 36 .

28% of thetraining time on memory allocations/deallocations with cu-daMalloc and cudaFree . To alleviate this performance issue,we implement a fast heap-based GPU memory pool utility.The core concept is to remove the allocation/deallocationoverhead by preallocating a big chunk of GPU memory as ashared memory pool. Then we divide the entire GPU memorypool into 1KB blocks as the basic storage unit. The memorypool contains a list of allocated and empty memory nodes.Each node in the two lists contains memory address, oc-cupied blocks and node ID. For an allocation request, thememory pool finds the first node with enough free memoryfrom the empty list. After that, it updates the empty list andcreates a new node in the allocated list to track the current al-location. For a deallocation request, the memory pool locatesthe node in the allocated list with the ID-to-node hash-table,then the pool places the node back to the empty list.

If the depth of a neural network goes to 10 , the ImageNettraining still consumes at least 10 GB memory. Therefore,

Liveness Analysis alone is inadequate for the emerging deepnonlinear neural architectures. We provide

Unified TensorPool (UTP) to further alleviate the GPU DRAM shortageby asynchronously transferring tensors in/out the externalmemory.

UTP is a consolidated memory pool abstractionfor tensor allocations/deallocations, using various externalphysical memory such as CPU DRAM, DRAM of other GPUs,or remote CPU/GPU DRAM. In this paper, we focus on thescenario of using local CPU DRAM as an external pool forthe fast and efficient interconnect, but the abstraction also ap-plies to other cases shown in Fig.7.

UTP intelligently managesthe tensor placement, movement, allocation and dealloca-tion, so that the underlying memory management is entirelytransparent to DL practitioners.

AlexNet Inception_v4 ResNet101 ResNet151 ResNet50 VGG16 VGG19020406080 % o f c o m pu t e t i m e CONV FC DROPOUT SOFTMAX POOL ACT BN LRN (a) breakdown of execution time by layer types

AlexNet Inception_v4 ResNet101 ResNet151 ResNet50 VGG16 VGG190204060 m e m o r y u sa g e CONV FC DROPOUT SOFTMAX POOL ACT BN LRN (b) breakdown of memory usages by layer types

Figure 8.

The percentages of execution time and memoryusages by layer types in different networks. Note that theexecution time includes both forward and backward passes.

Not all the layers are suitable for

Offloading and

Prefetch-ing . We define transferring tensors from GPU to externalphysical pools as

Offloading , and the reversed operation as

Prefetching . Fig.8a and Fig.8b demonstrate that POOL, ACT,BN and LRN all together occupy over 50% of the total mem-ory, while their computations only account for an averageof 20% of the entire workload. Thus, offloading these layersincurs a great overhead due to the insufficient overlappingof communications and computations. It is also not fruitfulto offload on Dropout, Softmax and FC layers since they onlyuse less than 1% of the total memory. Therefore, we onlyoffload the tensors from CONV layers.

Offloading :the runtime asynchronously transfers the for-ward outputs of CONV layers to the preallocated pinned CPUmemory. It records an event for this data transfer and freesthe tensor’s GPU memory once the event is completed. Theruntime has an independent thread running in the back-ground to check events of memory copies; and this enablesGPU-to-CPU data transfers to overlap with the forward com-putations starting from the current CONV layer to the nextone.

Prefetching :the runtime asynchronously brings the of-floaded and soon to be reused tensors back to the GPU DRAM.At any CONV layers in the backward, the runtime asyn-chronously fetches the required tensors for the previousCONV layer. This enables the CPU-to-GPU data transfer tooverlap with the backward computation starting from thecurrent CONV layer to the previous one.

Offloading and

Prefetching reduce peak m after LivenessAnalysis to (cid:205) Ni = ( l fi (cid:60) checkpoints ) + l bN , where checkpoints = { CONV } . Since layers in checkpoints are offloaded, the totalmemory consumption at each backward steps is cost ( k ) = (cid:205) ki = ( l fi (cid:60) checkpoints ) + l bk , where k ∈ [ , N ] . The mem-ory usage of each layers is non-negative, thus peak m = max ( cost ( k )) is (cid:205) Ni = ( l fi (cid:60) checkpoints ) + l bN . a) Speed-Centric Recomputation (b)

Memory-Centric Recomputation (c)

Cost-Aware Recomputation

Figure 9.

The speed-centric strategy only recomputes the segment once, and other backward layers within the segment willreuse the recomputed tensors. Thus, it only incurs O( N ) additional computations, but memcost is (cid:205) seдi = l fi + l bseд . The memory-centric strategy recomputes forward dependencies every time for each backward layers. Though it incurs O( N ) additionalcomputations, memcost is the least, i.e. l bi . Cost-Aware Recomputation profiles the memory usages across recomputationsegments. It uses the speed-centric strategy (red) if memcost of a segment is less than l peak , and the most memory savingstrategy (blue) otherwise. Algorithm 2:

The basic LRU operations

Data:

Tensor ( T ) and LRU

Result:

Tensor with the GPU memory. Function

LRU . in ( T ) T . Lock ← f alse /* A layer will lock its dependenttensors in the computation. */ LRU . insert F ront ( T ) Function

LRU . out ( T ) f reedMem ← while f reedMem < T . size do T ′ = LRU . дet LastU nlockedT ensor () f reedMem = f reedMem + T ′ . size remove T ′ f rom LRU list offload T ′ . GA to T ′ . CA /* CA is CPU Addr */ T . GA ← Malloc ( T . size ) Function

Check ( LRU , T ) isFound ← LRU . f ind ( T ) if isFound = f alse then T . GA ← Malloc ( T . size ) /* GA is GPU Addr */ if T . GA = ∅ then T . GA ← LRU . out () LRU . in ( T ) /* cache miss */ else LRU . placeT oF ront ( T ) /* cache hit */ return T . GA While the overlapping opportunity is limited given the fixedamount of computations in an iteration, the aforementionedon-demand

Prefetching / Offloading protocol can quickly ex-haust the chance. Nowadays CPU-to-GPU data movementsover PCI-E, GPU-to-GPU data movements over the same PCI-E switch, and GPU-to-remote GPU over GPU-Direct RDMAdeliver a practical speed of 8 GB/s, 10 GB/s, and 6 GB/s, buttransferring Gigabytes data in each training iterations incursthe nontrivial overhead. Therefore, this on-demand tensortransfer protocol must be optimized. SuperNeurons proposesa

Tensor Cache to exploit the temporal localities of tensors.It caches tensors on GPU DRAM to maximize their reuses and to minimize the global communications. With

Prefetch-ing and

Offloading , the runtime only triggers data transferswhen GPU DRAM is insufficient.We adopt Least Recent Used (LRU) tensor replacementpolicy to build

Tensor Cache . Since the back-propagationdemonstrates the head-to-tail and tail-to-head computationpattern, it subjects the most recent used tensors to the earli-est reusing as suggested in Fig.5. This motivates us to design

Tensor Cache with a simple variant of LRU. While there areother sophisticated cache replacement policies might be bet-ter fit the scenario, thorough discussions of them fall out thescope of this paper.Alg.2 demonstrates the three key operations of proposedLRU. 1)

LRU.in function intends to place a tensor into LRU.Each tensor has a lock, and a tensor cannot be removedfrom LRU if locked. A layer will lock dependent tensors atcalculations. LRU is implemented by a list with the frontas Most Frequently Used (MFU) and the tail otherwise. 2)

LRU.out function intends to remove enough bytes for a newtensor. It offloads the unlocked Least Recent Used tensors toCPU RAM till having enough free memory for the new one.3)

Check function decides what to operate on the tensor. Ittakes in a tensor to check if the tensor is in

LRU based onthe object address (line 2). If found, we place the tensor tothe MFU position, i.e. the list front (line 9), and return thetensor’s GPU address. This is the hit scenario. If not found,we call

LRU.out to free enough memory for the new tensorbefore inserting it into LRU, and this is the miss scenario.

POOL, ACT, LRN and BN all together use an average of50% memory, while their forward computations only ac-count for less than 10% of the total time. This exposes addi-tional 50% memory savings with a fraction of performanceloss by recomputing the forward dependencies in the back-propagation. Basically, the runtime frees the tensors in cheap-to-compute layers such as POOL for reconstructions. In gen-eral, there are memory-centric and speed-centric strategiesfor the recomputation for memory.The speed-centric strategy keeps the recomputed tensorsso that other backward layers can directly reuse them. Fig.9a enotes the procedures in red. At the backward step on l b , itperforms a forward pass from l f to l f to get dependencies for l b . It keeps l f , l f so that they can be re-used for the backwardcomputation on l b and l b . MXNet [6] adopts this strategy. Itincurs the least O( N ) additional computations, but memcost is (cid:205) seдi = ( l fi ) + l bseд . memcost will exceed l peak if l peak is withinthe segment.The memory-centric strategy always recomputes the de-pendencies for each backward layer. In contrast to the speed-centric one, it fully exploits the memory-saving opportunityby freeing the recomputed intermediate results. For example,it recomputes l f → l f for l b , while it recomputes l f → l f again for l b as demonstrated by the blue lines in Fig.9b. The memcost stays at l bi guaranteed to be ≤ l peak , but the strategyincurs O( N ) additional computations.We present a new Cost-Aware Recomputation that lever-ages the advantages of both methods. It is motivated by theobservation that the memory costs of most recomputationsegments are ≤ l peak , i.e. (cid:205) seдi = ( l fi ) + l bseд ≤ l peak . That im-plies we can leverage the least recomputations in the speed-centric strategy while still guarantees the memory usage tobe ≤ l peak as in the memory-centric strategy. The generalprocedures of Cost-Aware Recomputation are as follows:1. the runtime iterates over all the layers to find l peak = max ( l i ) as the threshold.2. In a recomputation segment, the runtime applies the speed-centric strategy (marked by red in Fig.9c ) if (cid:205) seдi = ( l fi ) + l bseд ≤ l peak , and the memory-centric strategy (marked byblue in Fig.9c) otherwise.Table.1 summarizes the extra recomputation for two basicstrategies and Cost-Aware Recomputation . Our cost-awaremethod ensures peak m to be consistent with the memory-centric strategy, while the extra recomputations are compa-rable to the speed-centric strategy. Cost-Aware Recomputation finally reduces peak m to max ( l i ) .Previously, Liveness Analysis and

Offloading jointly reducethe cost bk to (cid:205) ki = ( l fi (cid:60) checkpoints ) + l bk . Since non-checkpoints layers will be freed for recomputations, only the nearest checkpoint layer exists in the GPU memory. Thus, cost bk = l checkpoint . During the recomputations, cost bk can be either (cid:205) ki = ( l fi ) + l bk ≤ l peak or l bi depending what recomputationstrategies to use. Whereas, Cost-Aware Recomputation guar-antees cost bk ≤ l peak = max ( l i ) (see analyses above). Thus,the final network wide peak m = max ( cost bk ) = l peak , whichis the minimal peak m achievable at the layerwise granularity. The speed of CONV layers significantly impacts the trainingas it accounts for over 50% of total computing time (Fig.8).cuDNN provides several convolution algorithms, e.g. using

Table 1.

The counts of recomputations (extra) and peak m using the speed-centric, the memory-centric and Cost-AwareRecomputation. speed-centric memory-centric cost-aware extra peak m extra peak m extra peak m AlexNet 14 993.018 23 886.23 17 886.23ResNet50 84 455.125 118 401 85 401ResNet101 169 455.125 237 401 170 401

FFT, Winograd and GEMM, for different contexts. Someof them, FFT in particular, require temporary convolutionworkspaces to delivery the maximal speed as demonstratedin Fig.2. Therefore, the memory is also a critical factor to thehigh-performance training.We implement a dynamic strategy for allocating convolu-tion workspaces. It is dynamic because the memory left forconvolution workspaces constantly changes in every stepsaccording to

Liveness Analysis , UTP and

Cost-Aware Recom-putation . Since convolution workspaces do not affect thefunctionality, the allocations of functional tensors such asdata and parameters are prioritized. Then the runtime stepsinto each layer to profile free bytes left in GPU DRAM afterthose memory techniques being applied. With free bytesinformation at individual layers, the runtime benchmarksall the memory-feasible convolution algorithms to pick upthe fastest one. Please note the runtime skips convolutionalgorithms that require more memory than it can provide.Each layer selects the fastest algorithm under the remainingGPU DRAM, and therefore maximize the performance ofCONV layers and the entire training.

In this section, we present the results of our experimentalstudies that evaluate each memory and performance tech-niques in SuperNeurons. We also did end-to-end evaluationsagainst TensorFlow, MXNet, Caffe and Torch on variousneural networks to justify the design.

We use the naive network-wide tensor allocation strategyas the baseline. Thus, the peak m of baseline is (cid:205) Ni = l fi + (cid:205) Ni = l bi , where N is the network length. (defined in Sec.3).Since cuDNN operates at the layerwise granularity, peak m is bounded by the maximal memory usage among layers, i.e. l peak . Liveness Analysis reduces the baseline’s peak m to (cid:205) Ni = l fi + l bN . Fig.10a demonstrates how Liveness Analysis affect mem-ory usages and live tensor counts at each forward/backwardsteps on AlexNet. Since AlexNet has 23 layers, there are 23 the structure of AlexNet is CONV1 → RELU1 → LRN1 → POOL1 → CONV2 → RELU2 → LRN2 → POOL2 → CONV3 → RELU3 → CONV4 → RELU4 → CONV5 → RELU5 → POOL5 → FC1 → RELU6 → Dropout1 → FC2 → RELU7 → Dropout2 → FC3 → Softmax

10 20 30 40 50

Steps in an Iteration M e m o r y L i ve T e n s o r C oun t s baseline memory usage backwardforward baseline tensor countspeak m (a) liveness analysis Steps in an Iteration M e m o r y L i ve T e n s o r C oun t s baseline memory usage baseline tensor countsbackwardforward peak m reduced 357.2MB (b) prefetching/offloading + liveness Steps in an Iteration M e m o r y L i ve T e n s o r C oun t s baseline memory usage baseline tensor countsforward backward peak m reduced 245.77MBbottleneck, max layer usage (c) recomputation + previous two Figure 10.

The evaluations of

Liveness Analysis , Prefetching/Offloading and

Cost-Aware Recomputation on AlexNet at the batchsize of 200. AlexNet has 23 layers, and a training iteration consists of 1 →

23 forward steps and 24 →

46 backward steps. Theblue curve (left axis) depicts memory usages at each step, while the orange curve (right axis) depicts live tensor counts ateach step. (a) demonstrates how

Liveness Analysis affects memory usages w.r.t the baseline (horizontal lines). (b) demonstrateshow

Offloading/Prefetching improve

Liveness Analysis by comparing the memory usages of both techniques (blue dashed linesin (b)) with

Liveness alone (solid blue curve in (b)). Similarly, (c) demonstrates how

Cost-Aware Recomputation improve theprevious two; and dashed lines in (c) are from (b).forward steps and 23 backward steps. The central verticalline separates forward and backward while each of themcontains 23 computational steps. The baseline allocates 36data tensors consuming 2189.437MB, while

Liveness Anal-ysis uses up to 17 tensors with a peak memory usage of1489.355MB. This demonstrates 31.9% improvement over thebaseline in terms of peak m . It is also observable that the lo-cation of peak m is not necessarily consistent with the peaktensor count. This confirms our claim that the memory areunevenly distributed across network layers.To verify the cost model, i.e. cost bk = (cid:205) ki = l fi + l bk , wedelve into the memory usages of peak layer. Fig.10a suggeststhe 32th step reaches peak m . This corresponds to the back-ward POOL5 in AlexNet, and k =

14 because of 46 - 32. Theforward layers that are before and include POOL5 stash 5tensors, consuming 1409.277MB ( (cid:205) i = l fi ), while the back-ward POOL5 stashes 3 tensors, consuming 80.078MB ( l b ).Therefore, cost b = peak m . Prefetching and Offloading reduces the peak m after Liveness Analysis to (cid:205) Ni = ( l fi (cid:60) checkpoints ) + l bN . Fig.10bdemonstrates the updated memory usages and live tensorcounts after Prefetching/Offloading being applied on the topof

Liveness Analysis . We set CONV layers as checkpoints foroffloading. The new peak m is 1132.155 MB at the 39th stepor POOL2 backward. It further reduces 357.2MB on the pre-vious peak m or total 48.29% improvement over the baseline’s peak m . The new peak m shifts from POOL5 to POOL2 be-cause of the number of CONV layers ahead of them. CONV1,CONV2, CONV3, and CONV4 are located before POOL5; andthey consume 221.56MB, 142.38MB, 49.51MB and 49.51MB,respectively, The runtime offloads CONV 1 ∼ peak m (cid:205) ki = ( l fi (cid:60) checkpoints ) + l bk , we compare the calculated live tensor count from themodel with the actual measurement. There are 2 check-points, CONV1 and CONV2, before POOL2; and the runtimeprefetches CONV2 in the backward. As a result, the calcu-lated live tensor count at POOL2 is 10 (measured live tensorsbefore POOL2) - 1 (CONV1) = 9. This is same to our actualmeasurement of 9 tensors at POOL2. Therefore, the updatedcost model after Prefetching/Offloading is still valid.Finally,

Cost-Aware Recomputation reduces peak m tomax ( l i ) . In theory, max ( l i ) is the minimal peak m at the layer-wise granularity as cuDNN needs at least stash the tensors ina layer to compute. Fig.10c demonstrates stepwise memoryusages and live tensor counts with all three techniques. Weprofile that max ( l i ) = . MB at the backward LRN1 byiterating through every layer. Fig.10c demonstrates a peak m of 886 MB at the 44th step, which is the backward of LRN1.Therefore, three proposed memory saving techniques suc-cessfully reduce the peak m from (cid:205) Ni = l fi + (cid:205) Ni = l bi to max ( l i ) . The runtime equips with a GPU Memory Pool and a Ten-sor Cache to improve the performance of memory tech-niques and a dynamic strategy for allocating convolutionworkspaces to accelerate the training speed. More specifi-cally, GPU Memory Pool amortizes the non-trivial overheadof high-frequent memory allocations/deallocations in

Live-ness Analysis ; and Tensor Cache enables tensor reusing tominimize data transfers in

Prefetching/Offloading . Fig.10cdemonstrates the GPU free space dynamically changes ateach forward and backward step due to 3 memory techniques.The runtime allocates convolution workspaces within thefree memory at a step. As a result, the performance is opti-mized at individual layers under different stepwise memoryconstraints. able 2. The improvement of GPU memory pool over cud-aMalloc and cudaFree on various networks. The batch sizefor AlexNet is 128, while the rest is 16. img/s

AlexNet VGG16 InceptionV4 ResNet50 ResNet101 ResNet152CUDA 359.4 12.1 6.77 21.5 11.3 7.46Ours 401.6 14.4 10.0 32.9 18.95 13.2speedup 1.12x 1.19x 1.48x 1.53x 1.68x 1.77x

Table 3.

Communications with/without Tensor Cache. Webenchmark the result on AlexNet by increasing the batchsize from 256 to 1024.

Communications in GB

256 384 512 640 896 1024Without Tensor Cache 2.56 3.72 4.88 6.03 8.35 9.50Tensor Cache 0 0 0 0 0 0.88

AlexNet VGG16 InceptionV4 ResNet50 ResNet101 ResNet15200.511.52 N o r m a li z e d S p ee d Without Tensor Cache With Tensor Cache

Figure 11.

Normalized performance with and without

TensorCache . The batch size of AlexNet is 128, and 32 for the rest.

GPU Memory Pool amortizes the non-trivial overhead ofintensive memory operations in

Liveness Analysis by preal-locating a big chunk of GPU memory. Table 2 illustrates theperformance improvement of using

GPU Memory Pool overcudaMalloc and cudaFree. Linear networks such as AlexNetand VGG involve much fewer memory operations than non-linear ones such as InceptionV4 and ResNet50 →

152 dueto the limited depth. Therefore, the speedups on nonlinearnetworks (ResNet 50 →

152 and InceptionV4) are more sig-nificant than linear networks (AlexNet, VGG).

Tensor Cache intends to reduce unnecessary data trans-fers in

Prefetching/Offloading . Specifically, the offloading isunnecessary if a network can fit into the GPU DRAM. InTable 3, we can see

Tensor Cache successfully avoids com-munications at batch sizes of 256 → Tensor Cache , linearlyincrease along batch sizes. The training performance will de-teriorate if communications outweigh computations. Fig.11demonstrates up to 33.33% performance loss without using

Tensor Cache . It is also noticeable that the speedup on linearnetworks (AlexNet, VGG16) is less significant than nonlinearones (ResNet50 → Prefetching/Offloading , Tensor Cache does not provide the comparable speed up for AlexNet andVCG16.

Dynamic Convolution Workspace Allocation intendsto optimize each layers’ training speed in together with 3memory techniques. Convolution workspaces are critical to

1f 2f 3f 4f 5f 5b 4b 3b 2b 1b05001000 M e m o r y / M B Assigned WSMAX Speed WS (a) batch=100, pool = 3G

1f 2f 3f 4f 5f 5b 4b 3b 2b 1b010002000 M e m o r y / M B Assigned WSMAX Speed WS (b) batch=300, pool = 3G

1f 2f 3f 4f 5f 5b 4b 3b 2b 1b0100020003000 M e m o r y / M B Assigned WSMAX Speed WS (c)

1f 2f 3f 4f 5f 5b 4b 3b 2b 1b0100020003000 M e m o r y / M B Assigned WSMAX Speed WS (d)

Figure 12.

Dynamic Conv workspace allocations in the run-time. The digit in x-axis represents the ith CONV layer, whilef/d represent the forward and backward.

Table 4.

Going Deeper: the deepest ResNet that differentframeworks can reach on a 12GB NVIDIA K40. The batchsize is fixed at 16. ResNet has 4 for-loops to control its depth: depth = ∗ ( n + n + n + n ) +

2, where n i is the upper limitof ith for-loop. We fix n = n =

32, and n =

6, whilevarying n to increase the depth. Depth

Caffe MXNet Torch TensorFlow SuperNeuronsResNet 148 480 152 592 1920 the high performance, while the free memory for convolutionworkspaces constantly changes at different computing stepsas demonstrated in Fig.10c. The runtime picks the fastestmemory-feasible convolution algorithm at a particular step.Fig.12a and Fig.12b demonstrate that the runtime automat-ically reduces CONV workspaces to accommodate functionaltensors with the increasing batch size. Specifically, the run-time prioritizes the functional tensor allocations at batch300 under 3 GB memory pool (Fig.12b), while it provisionsthe most workspace for the maximal speed at batch 100(Fig.12a). In general, a higher speed is observable with moreconvolution workspaces. Fig.12c and Fig.12d demonstratethe training speed (images per second) increases from 203img/s to 240 img/s with additional CONV workspaces.

Our primary goal is to enable ML practitioners exploringdeeper and wider neural architectures within the limitedGPU DRAM. In this section, we conduct end-to-end com-parisions to TensorFlow, MXNet, Caffe and Torch with sev-eral mainstream linear networks (AlexNet, VGG16) and non-linear ones (ResNet50 → able 5. Going Wider: the largest batch size that severalmainstream neural architectures can reach in different frame-works with a 12GB NVIDIA K40. peak batch

Caffe MXNet Torch TensorFlow SuperNeuronsAlexNet 768 768 1024 1408 1792VGG16 48 64 48 80 224InceptionV4 16 N/A N/A 64 240ResNet50 24 80 32 128 384ResNet101 16 48 16 80 256ResNet152 16 32 16 48 176

AlexNet VGG16 InceptionV4 ResNet50 ResNet101 ResNet152050100150200 M e m o r y C o s t i n G B Caffe TensorFlow MxNet Torch SuperNeurons

Figure 13.

Going Wider: the corresponding memory usagesfor the batch size in TABLE.5larger batches than the second best. SuperNeurons can trainResNet101 at the batch of 256, which is 3x larger than thesecond best TensorFlow.Fig.13 demonstrates the corresponding memory require-ment to peak batches in Table.5. The translation is non-linearbecause of the convolution workspace. We calculate the mem-ory requirement with (cid:205) Ni = l fi + (cid:205) Ni = l bi , and l i is the sum ofthe memory usages of all tensors in the layer. It is observablethat SuperNeurons handles up to 19.8x larger model thanCaffe.We add layers to go deeper. Table.4 demonstrates Su-perNeurons trains 12.9730x, 12.6316x, 4.0000x, and 3.2432xdeeper ResNet than Caffe, Torch, MXNet, and TensorFlow,respectively. Particularly SuperNeurons can train a ResNetup to 2500 residual units having approximately 10 basiclayers at the batch size of 1 on a 12GB GPU.The training speed is measured by the processed imagesper second. Fig.14 presents an end-to-end training speedcomparison of SuperNeurons to mainstream DL systems.SuperNeurons consistently demonstrates the leading speedon various linear networks (AlexNet, VGG16) and nonlinearones (ResNet50 → Several solutions have been proposed to address the GPUDRAM shortage for training large-scale neural networks. batch size s p ee d ( i m g / s ) Caffe TF MXNet Torch Ours (a)

AlexNet batch size s p ee d ( i m g / s ) Caffe TF MXNet Torch Ours (b)

ResNet50 batch size s p ee d ( i m g / s ) Caffe TF MXNet Torch Ours (c)

VGG16 batch size s p ee d ( i m g / s ) Caffe TF MXNet Torch Ours (d)

ResNet101

10 20 30 40 50 60 70 80 batch size s p ee d ( i m g / s ) Caffe TF Ours (e)

InceptionV4

10 20 30 40 50 60 70 80 batch size s p ee d ( i m g / s ) Caffe TF MXNet Torch Ours (f)

ResNet152

Figure 14.

An end-to-end evaluation of different DL frame-works. We benchmark the data on a TITAN XP.Model Parallelism provides a straightforward solution to thelarge network training. DistBelief [10] partitions a networkacross multiple machines so that each machine holds a seg-ment of the original network. Coates et al [8] discuss anotherpartition scheme on multi-GPUs. Model Parallelism demandshuge intra-network communications for synchronizations.Therefore, most DL systems parallelize the training withData Parallelism for the high-performance [2, 5, 9, 15]. Inthis paper, we focus on the GPU DRAM shortage issue forData Parallelism.Under Data Parallelism, vDNN [20] proposes a prefetch-ing and offloading technique to utilize the CPU DRAM asan external buffer for the GPU. It tries to overlap commu-nications with computations by asynchronously swappingthe data between CPU and GPU amid the back-propagation.The performance of this method largely depends on the com-munication/computation ratio. Some layers such as POOLare very cheap to compute, while the GPU processing speedis several orders of faster than PCI-E 16x bus. In nonlin-ear networks, the performance will quickly deteriorate oncecomputations are inadequate to overlap with communica-tions. Chen et al [6] also introduce a recomputation strategyto trade computations for memory. However, their methodfails to fully exploit the memory saving opportunities andcomputation efficiency for ignoring the memory variationsamong layers.Removing the parameter redundancy also reduces thememory usage. For example, the network pruning [11, 12] emoves near zero parameters; and quantization [24] or pre-cision reduction [16] utilize low precision floats to save thememory. Although the parameter reduction has immensebenefits in deploying neural networks on embedded systems,parameters only account for a negligible portion of memoryusage in the training. Therefore, these approaches are quitelimited to the training. In this paper, we focus on the GPU memory scheduling prob-lem for training deep neural networks; and we propose anovel dynamic scheduling runtime to tackle the issue. Theruntime features three memory techniques to reduce peak m to max ( l i ) , which is the minimal at the layer-wise granular-ity. We also propose several performance optimizations toguarantee the high performance. Evaluations against state-of-the-art DL frameworks have demonstrated the effective-ness and efficiency of proposed dynamic scheduling runtime.It creates new opportunities for DL practitioners to exploredeeper and wider neural architectures; and the new accu-racy record is awaiting to be refreshed with even deeper andwider designs. This research is funded in part by the DARPA Award 16-43-D3M-FP-040 and gifts from Google, VMware, Mellanox,and Oracle. Zenglin Xu and Jianmian Ye were supported bya grant from the Natural Science Foundation of China (No.61572111), and a Fundamental Research Fund for the CentralUniversities of China (No. ZYGX2016Z003).

References [1] Mxnet’s graph representation of neural networks. http://mxnet.io/architecture/note_memory.html .[2] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin,M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A systemfor large-scale machine learning. In

OSDI (2016), vol. 16, pp. 265–283.[3] Bahrampour, S., Ramakrishnan, N., Schott, L., and Shah, M. Com-parative study of caffe, neon, theano, and torch for deep learning.[4] Bengio, Y., Simard, P., and Frasconi, P. Learning long-term depen-dencies with gradient descent is difficult.

IEEE transactions on neuralnetworks 5 , 2 (1994), 157–166.[5] Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B.,Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machinelearning library for heterogeneous distributed systems. arXiv preprintarXiv:1512.01274 (2015).[6] Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep netswith sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).[7] Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J.,Catanzaro, B., and Shelhamer, E. cudnn: Efficient primitives fordeep learning. arXiv preprint arXiv:1410.0759 (2014).[8] Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew,N. Deep learning with cots hpc systems. In

International Conferenceon Machine Learning (2013), pp. 1337–1345.[9] Collobert, R., Bengio, S., and Mariéthoz, J. Torch: a modularmachine learning software library. Tech. rep., Idiap, 2002. [10] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M.,Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributeddeep networks. In

Advances in neural information processing systems (2012), pp. 1223–1231.[11] Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., andDally, W. J. Eie: efficient inference engine on compressed deep neu-ral network. In

Proceedings of the 43rd International Symposium onComputer Architecture (2016), IEEE Press, pp. 243–254.[12] Hassibi, B., and Stork, D. G. Second order derivatives for networkpruning: Optimal brain surgeon. In

Advances in neural informationprocessing systems (1993), pp. 164–171.[13] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning forimage recognition. In

Proceedings of the IEEE conference on computervision and pattern recognition (2016), pp. 770–778.[14] Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten,L. Densely connected convolutional networks. arXiv preprintarXiv:1608.06993 (2016).[15] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick,R., Guadarrama, S., and Darrell, T. Caffe: Convolutional archi-tecture for fast feature embedding. In

Proceedings of the 22nd ACMinternational conference on Multimedia (2014), ACM, pp. 675–678.[16] Judd, P., Albericio, J., Hetherington, T., Aamodt, T. M., Jerger,N. E., and Moshovos, A. Proteus: Exploiting numerical precision vari-ability in deep neural networks. In

Proceedings of the 2016 InternationalConference on Supercomputing (2016), ACM, p. 23.[17] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classifica-tion with deep convolutional neural networks. In

Advances in neuralinformation processing systems (2012), pp. 1097–1105.[18] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-basedlearning applied to document recognition.

Proceedings of the IEEE 86 ,11 (1998), 2278–2324.[19] Pleiss, G., Chen, D., Huang, G., Li, T., van der Maaten, L., andWeinberger, K. Q. Memory-efficient implementation of densenets. arXiv preprint arXiv:1707.06990 (2017).[20] Rhu, M., Gimelshein, N., Clemons, J., Zulfiqar, A., and Keckler,S. W. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In

Microarchitecture (MICRO), 201649th Annual IEEE/ACM International Symposium on (2016), IEEE, pp. 1–13.[21] Simonyan, K., and Zisserman, A. Very deep convolutional networksfor large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).[22] Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. Inception-v4,inception-resnet and the impact of residual connections on learning.In

AAAI (2017), pp. 4278–4284.[23] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D.,Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper withconvolutions. In

Proceedings of the IEEE conference on computer visionand pattern recognition (2015), pp. 1–9.[24] Vanhoucke, V., Senior, A., and Mao, M. Z. Improving the speed ofneural networks on cpus. In

Proc. Deep Learning and UnsupervisedFeature Learning NIPS Workshop (2011), vol. 1, p. 4.[25] Wang, L., Wu, W., Bosilca, G., Vuduc, R., and Xu, Z. Efficient com-munications in training large scale neural networks. arXiv preprintarXiv:1611.04255 (2016).[26] Wang, L., Wu, W., Xu, Z., Xiao, J., and Yang, Y. Blasx: A high per-formance level-3 blas library for heterogeneous multi-gpu computing.In

Proceedings of the 2016 International Conference on Supercomputing (2016), ACM, p. 20.[27] Wang, L., Yang, Y., Min, R., and Chakradhar, S. Accelerating deepneural network training with inconsistent stochastic gradient descent.

Neural Networks (2017).[28] Wu, W., Bouteiller, A., Bosilca, G., Faverge, M., and Dongarra, J.Hierarchical dag scheduling for hybrid distributed systems. In

Paralleland Distributed Processing Symposium (IPDPS), 2015 IEEE International HotCloud 10 ,10-10 (2010), 95.,10-10 (2010), 95.