[PDF] Large-Scale Training System for 100-Million Classification at Alibaba

Abstract

In the last decades, extreme classification has become an essential topic for deep learning. It has achieved great success in many areas, especially in computer vision and natural language processing (NLP). However, it is very challenging to train a deep model with millions of classes due to the memory and computation explosion in the last output layer. In this paper, we propose a large-scale training system to address these challenges. First, we build a hybrid parallel training framework to make the training process feasible. Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs and improves the throughput of training. Then, to eliminate the communication overhead, we propose a new overlapping pipeline and a gradient sparsification method. Furthermore, we design a fast continuous convergence strategy to reduce total training iterations by adaptively adjusting learning rate and updating model parameters. With the help of all the proposed methods, we gain 3.9\times throughput of our training system and reduce almost 60\% of training iterations. The experimental results show that using an in-house 256 GPUs cluster, we could train a classifier of 100 million classes on Alibaba Retail Product Dataset in about five days while achieving a comparable accuracy with the naive softmax training process.

Full PDF

LLarge-Scale Training System for 100-Million Classification atAlibaba

Liuyihan Song, Pan Pan, Kang Zhao, Hao Yang, Yiming Chen, Yingya Zhang, Yinghui Xu, Rong Jin

Machine Intelligence Technology Lab, Alibaba Groupliuyihan.slyh,panpan.pp,zhaokang.zk,yh136073,charles.cym,yingya.zyy,renji.xyh,[email protected]

ABSTRACT

In the last decades, extreme classification has become an essentialtopic for deep learning. It has achieved great success in many ar-eas, especially in computer vision and natural language processing(NLP). However, it is very challenging to train a deep model withmillions of classes due to the memory and computation explosion inthe last output layer. In this paper, we propose a large-scale trainingsystem to address these challenges. First, we build a hybrid paralleltraining framework to make the training process feasible. Second,we propose a novel softmax variation named KNN softmax, whichreduces both the GPU memory consumption and computation costsand improves the throughput of training. Then, to eliminate thecommunication overhead, we propose a new overlapping pipelineand a gradient sparsification method. Furthermore, we design afast continuous convergence strategy to reduce total training iter-ations by adaptively adjusting learning rate and updating modelparameters. With the help of all the proposed methods, we gain3.9 × throughput of our training system and reduce almost 60% oftraining iterations. The experimental results show that using anin-house 256 GPUs cluster, we could train a classifier of 100 millionclasses on Alibaba Retail Product Dataset in about five days whileachieving a comparable accuracy with the naive softmax trainingprocess. CCS CONCEPTS • Computing methodologies → Computer vision tasks ; •

Infor-mation systems → Clustering and classification . KEYWORDS

Extreme Classification, Distributed Deep Learning, KNN Softmax,Communication Optimization, Fast Convergence

ACM Reference Format:

Liuyihan Song, Pan Pan, Kang Zhao, Hao Yang, Yiming Chen, Yingya Zhang,Yinghui Xu, Rong Jin. 2018. Large-Scale Training System for 100-MillionClassification at Alibaba. In

KDD ’20: The 26th ACM SIGKDD Conference onKnowledge Discovery and Data Mining, Augut 22–27, 2020, San Diego, CA.

ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/1122445.1122456

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Dataset Batch-1Batch-N Gradient Sparsification Hybrid PipelineHybrid Pipeline K NN S o ft M ax L o ss LR SchedulerLarge Batch Optimizer

Figure 1: Overview of the overall extreme classification sys-tem architecture. It contains three major components: (i)KNN softmax loss module for fast computation and deriva-tion. (ii) Communication module including hybrid pipelineand gradient sparsification. (iii) Fast convergence modulewith a large batch optimizer and a learning rate scheduler.

In recent years extreme classification has attracted significant inter-ests in the areas of computer vision and NLP. It introduces a vanillamulti-class classification problem where the number of classes issignificantly large. Such a large classifier has achieved remark-able successes, especially in applications like face recognition [24]and language modeling [5], when training on the industry-leveldatasets.At Alibaba, the Retail Product Dataset contains up to billions ofimages across 100 million classes. Each image is labeled at stockkeeping unit (SKU) level. We want to build a 100 million-levelextreme classification system with the dataset to improve the recog-nition abilities of our vision system.However, building an extreme classification system poses a num-ber of challenges as follows:

Memory and computation costs:

As the parameter size of thelast fully connected layer is proportional to the number of classes,it may go beyond the GPU memory capacity when training with alarge classifier straightforwardly. Also, the computational cost willbe significantly increased, which is approximately measured by thedot products between the class weights and input features.

Communication overhead:

To accelerate the training processas much as possible, one could add more GPU machines to processmore data samples synchronously. However, as the number of GPUmachines increases larger, the overhead of communication amongmachines will become the bottleneck of training speed.

Convergence:

In parallel training, synchronous stochastic gra-dient descent (SGD) is often used in training the large-scale deepneural networks. With an increase in the number of GPU nodes,training with large batch usually results in lower model accuracy. a r X i v : . [ c s . L G ] F e b oreover, the convergence speed is unacceptable with a large num-ber of epochs, e.g., 90 epochs in ImageNet-1K training [9].Prior efforts tackle these difficulties in several ways. To reducethe resource cost, feature embedding methods [2, 20, 25] are pro-posed. These methods project the inputs and the classes into a smalldimensional subspace instead of using a large fully connected layerat last. Nevertheless, training embedding models need a pairwiseloss function, which uses a large number of training pairs and needscarefully designed negative sampling. Another solution is hierarchi-cal softmax [7, 19], but it is often difficult to extend these methodsalong to other domains and cannot guarantee accuracy on imageclassification tasks. Meanwhile, hierarchical softmax is not parallelfriendly, which means it is hard to support multi-GPU training. Toour best knowledge, using a standard softmax with cross-entropybased classifier could solve these issues.In this paper, considering the drawbacks of those methods men-tioned above, we propose an extreme classification system by collab-oration of algorithm and engineer teams at Alibaba. Unlike [11, 28]using a data parallel framework to train ImageNet-1K in a fewminutes, we use a hybrid parallel framework to alleviate model par-tition in a GPU cluster. In this way, we can train such a “big head”neural network using a standard softmax with cross-entropy loss.Figure 1 shows the overview of our extreme classification systemarchitecture. We conclude our contributions as follows,1) We introduce an effective softmax implementation namedKNN softmax to classify 100 million classes of images straight-forwardly. Compared with the selective softmax [29] or MACH[17], our approach achieves the same accuracy as standard softmax.Furthermore, our proposed method saves computation and GPUmemory, which improves training speed correspondingly.2) We propose a new communication strategy, which includes anoverlapping pipeline and a gradient sparsification method. For ourhybrid parallel training framework, this communication strategyreduces the overhead and accelerates training speed.3) As large batch training plays an essential part in our trainingframework, we propose a new training strategy to update modelparameter and adjust learning rate adaptively. In this way, wecould significantly reduce our training iterations and achieve acomparable accuracy with the naive softmax training process.The rest of the paper is organized as follows. Section 2 brieflyreviews the related work. The proposed framework and methodsare detailedly described in Section 3. Experimental evaluations areshown in Section 4. Finally, Section 5 concludes this paper. Building an extreme classification system includes four primarysections. Firstly, we need to develop a parallel training methodfor extreme classification. Secondly, an accuracy-lossless softmaxcomputation algorithm should be carefully designed. Thirdly is tobuild an efficient communication strategy in a large GPU cluster.Besides, fast convergence is also essential for an efficient trainingprocess. Taking the applied techniques into account, prior practicesof building an extreme classification system can be concluded inthe following aspects.

Parallel Training:

Recently, [14] proposes a training schemewhich uses data parallelism and model parallelism together to paral-lelize the training of convolutional neural networks with stochasticgradient descent (SGD). Deng et al. [6] employ a parallel trainingstrategy to support millions-level identities on a single machineefficiently. In our scenarios, we extend the hybrid parallel trainingscheme to a larger GPU cluster. Meanwhile, we also optimize thetraining pipeline to accelerate training.

Softmax Variations:

1) Selective Softmax: [29] proposes a newmethod to solve the extreme classification problem. In particular,they develop an effective method based on the dynamic class levelto approach the optimal selection. This method has two drawbacks.Firstly, the method is not a completely GPU implementation sincethe entire W is maintained in CPU RAM. Moreover, the perfor-mance of selective softmax is inferior to the full softmax, especiallyin large-scale experiments, which is not acceptable in practice. 2)Merged-Average Classifiers via Hashing: To solve the 𝑘 -class clas-sification problem, a simple hashing based divide-and-conquer al-gorithm, MACH (Merged-Average Classification via Hashing) [17],is proposed. Compared with the traditional linear classifier, it onlyneeds a small model size. However, the method is still unable to geta comparable performance compared with standard softmax. Asstated in [17], in ImageNet dataset, MACH achieves an accuracy of11%, while full softmax achieves the best result of 17%. Efficient Communication:

Large-scale distributed parallel SGDtraining [3] requires gradient/parameter synchronization amongtens of nodes. With increasing numbers of nodes, communicationoverhead becomes the bottleneck and prohibits training scalability.As centralized network frameworks like parameter server [15] arelimited by network congestion among central nodes, decentralizednetwork frameworks with collective communication operations(all-reduce, all-gather, etc.) are widely used in large-scale distributedtraining. Besides utilizing expensive high-performance networks(100 Gbps Ethernet, InfiniBand, etc.), multiple methods have beenproposed to mitigate communication overhead. Pipelining overlapsbottom layers gradient computation and top layers gradient com-munication during backpropagation. It has been widely used indistributed machine learning frameworks such as PyTorch [21] andMXNet [4]. Recently, gradient compression methods that reducingtransmitted bits per iteration draw much attention. Sparsification[1, 16] methods selected part of gradients based on the magnitudeand conserved ImageNet-1K accuracy with gradient sparsity up to99.9%. Quantification [12] methods encoded gradients into 1-bit,thus achieving up to 1/32 compression ratio. Low-rank factorization[26] communicated a low-rank lossy approximation of gradients toreduce network traffic.

Fast Convergence:

Early works mostly focus on the learningrate adjustment to deal with large batch training. [8] set the initiallearning rate as a function of batch size according to a linear scale-up rule. The method managed to apply the approach to train aResNet-50 network on ImageNet-1K with a batch size of 8,000 over256 GPUs. In [28], training with a much larger batch size of 32K canbe finished in 20 minutes with LARS. Since the gradients of DNNnetwork in early steps may vary significantly, large learning ratemay cause divergence. To avoid divergence, a warm-up strategythat increases the learning rate gradually from a very small valueis proposed [8]. However, all the above techniques are only proved odel-Parallel ! Batch 1 Data-ParallelGPU-1Batch 2Batch N GPU-2GPU-N " " %$ " &$ (1) (2) (3) (4)(5) ' ' % ' & Figure 2: Hybrid parallel training framework to work in the training of ResNet-50 on ImageNet-1K, and lessattention is paid to the training of larger or more complex modelson other datasets.

To train a classifier of 100 million classes, how to store the largefully connected (fc) layer is the first and primary problem to beaddressed. Assuming the dimension of input feature is 512, the totalGPU memory cost of the fully connected layer is about 190 GBwhich cannot be fed into a single GPU. Therefore, training such alarge classifier is barely impossible using a data parallel trainingframework.As mentioned in [14], we split the large fc layer into differentsublayers and place each sublayer into different GPUs in a modelparallel way. It has two advantages: 1) The computation cost at thefully connected layer can be reduced proportionally to the numberof GPU used. 2) The communication overhead of synchronizinggradient can be reduced significantly since the fully connected layeris updated locally compared to the data parallel training.Since the fully connected layer is split in model parallel way, wecan reuse the remaining memory space of each GPU for the featureextraction part before the fully connected layer. Meanwhile, thefeature extraction part is trained in data parallelism. Therefore, theproposed framework of this work belongs to hybrid parallel training.Figure 2 presents the overall hybrid parallel training framework.As depicted in Figure 2, we use GPU- 𝑁 to elaborate on howthe hybrid parallel training works: 1) Data batch- 𝑁 is fed intoGPU- 𝑁 . 2) GPU- 𝑁 uses convolutional neural networks to extractthe features of batch- 𝑁 , and then gathers the features extractedfrom other GPUs. 3) GPU- 𝑁 forwards features through the 𝑁 -thsublayer of fully connected layer 4) Distributedly compute softmaxwith cross-entropy loss using all GPUs. 5) GPU- 𝑁 backpropagatesgradients through the whole network. It is worth mentioning thatthe gradient of 𝑓 𝑁 needs to merge with the corresponding item fromother GPUs. 6) For model parallel parts, GPU- 𝑁 updates weights bylocal gradient; For data parallel parts, it merges the corresponding gradient of feature extraction part all over the GPUs then updateweights (step-6 did not show in Figure 2).Additionally, as mixed-precision training [18] has been widelyused in computer vision and NLP tasks without sacrificing accuracy,we adopt this method to accelerate our training. For our hybridparallel training framework, we convert all the layers except batch-norm [10] to float16, and gradients are calculated by float16 too.Meanwhile, all parameters have a copy in float32 for parameter up-dating. Besides, we use loss scaling [18] to preserve small gradientvalues during training.By implementing a hybrid parallel training framework on 100million classification, we run an end-to-end training profiling onthe in-house GPU cluster. According to the profiling, almost 80% ofthe GPU memory is consumed by the fully connected layer, whichleads to a low training throughput since less memory can be usedfor feature extraction parts. Therefore, new methods that reducememory and computation costs is strongly needed in training thelarge classifier. The following sections will describe the proposedmethods that overcome the difficulties of such a training task. As mentioned above, the last fully connected layer consumes largeamounts of GPU memory. By conducting experiments of training aclassifier of 100 million classes, it is also noted that in each iterationmore than 80% of the time is spent in the softmax stage (mainlyincluding fc forward, softmax forward, softmax backward and fcbackward), and over 10 GB of GPU memory is used for the outputspace of the last fc layer.In order to further improve the throughput of the system, we pro-pose a new method called

KNN softmax . Specifically, we adopt theactive classes to speed up the softmax stage and save the memorydemand as in [29]. What’s more, through combining normalizationstrategy and a KNN graph-based selection approach, we achieve thelossless performance compared with the standard softmax, whichis essential in practice. Finally, we provide a completely GPU-basedtraining pipeline for our method.

Inspired by selective softmax [29],we also select the active classes for each mini-batch and calculateforward/backward based on them. Differently, we adopt a new wayto do the active classes selection. Let N be the total number ofclasses, we denote W ∈ R 𝑁 × 𝐷 as the weight parameters of lastfc layer, where each row w 𝑗 represents the weight vector of the 𝑗 -th class. For each training example x 𝑖 , we use its weight vector w 𝑦 𝑖 ( 𝑦 𝑖 is the label of x 𝑖 ), instead of x itself, to select active classes.On this condition, we can quickly fetch the active classes from a 𝑘 -nearest neighbor (KNN) graph of W that we build in advance toavoid the time complexity of searching the active classes using x .Specifically, we compute the 𝐿 normalization of X (the extractedfeature of mini-batch) and W in the training process first. Next,a KNN graph for the W 𝑛𝑜𝑟𝑚 (the 𝐿 normalization of W ) is con-structed (the building process will be described in the followingsection). Assuming we already built this graph, in each iterationof training process, we can quickly get the active classes of mini-batch: W 𝑎𝑐𝑡𝑖𝑣𝑒 = [ 𝑙𝑖𝑠𝑡 𝑦 , ..., 𝑙𝑖𝑠𝑡 𝑦 𝑚 ] where 𝑙𝑖𝑠𝑡 𝑦 𝑖 is the KNN resultof w 𝑦 𝑖 . Considering W has been normalized, w 𝑦 𝑖 must be rankedfirst in the 𝑙𝑖𝑠𝑡 𝑦 𝑖 . After that, we will remove the duplicated w from 𝑎𝑐𝑡𝑖𝑣𝑒 , then compare the number of the remaining active classeswith 𝑀 (the number setting of active classes for each iteration) toselect the final active classes. Algorithm 1 summarizes the KNNGraph-based Active Classes Selection. Algorithm 1

KNN Graph-based Active Classes Selection

Input:

A KNN graph, 𝐺 = [ 𝑙𝑖𝑠𝑡 , ..., 𝑙𝑖𝑠𝑡 𝑁 − ] ∈ R 𝑁 × 𝐾 ; The entireweight vector W ∈ R 𝑁 × 𝐷 ; The mini-batch feature X 𝑛𝑜𝑟𝑚 ∈ R 𝑚 × 𝐷 ; The number of selected active classes M; Output:

The active classes of the current mini-batch, 𝑊 ∗ 𝑎𝑐𝑡𝑖𝑣𝑒 ; Initialize active classes set 𝑊 𝑎𝑐𝑡𝑖𝑣𝑒 = ∅ ; for each sample x 𝑖 in X 𝑛𝑜𝑟𝑚 do insert 𝑙𝑖𝑠𝑡 𝑦 𝑖 into 𝑊 𝑎𝑐𝑡𝑖𝑣𝑒 end for 𝑊 ′ 𝑎𝑐𝑡𝑖𝑣𝑒 = 𝑑𝑢𝑝𝑙𝑖𝑐𝑎𝑡𝑒 ( 𝑊 𝑎𝑐𝑡𝑖𝑣𝑒 ) ; if 𝑊 ′ 𝑎𝑐𝑡𝑖𝑣𝑒 .𝑠𝑖𝑧𝑒 < M then 𝑊 𝑟𝑎𝑛𝑑𝑜𝑚 = random sample ( 𝑀 − 𝑊 ′ 𝑎𝑐𝑡𝑖𝑣𝑒 .𝑠𝑖𝑧𝑒 ) weight from 𝑊 ′ 𝑎𝑐𝑡𝑖𝑣𝑒 (the non-choosen weight in W ) 𝑊 ∗ 𝑎𝑐𝑡𝑖𝑣𝑒 = 𝑊 ′ 𝑎𝑐𝑡𝑖𝑣𝑒 + 𝑊 𝑟𝑎𝑛𝑑𝑜𝑚 else 𝑊 ∗ 𝑎𝑐𝑡𝑖𝑣𝑒 = get M weight from 𝑊 ′ 𝑎𝑐𝑡𝑖𝑣𝑒 based on their rankingscore. end if return 𝑊 ∗ 𝑎𝑐𝑡𝑖𝑣𝑒 ; Generally, one like to use ap-proximate nearest neighbor (ANN) methods to build graph [30],which could achieve the tradeoff between performance and timeconsuming. However, we empirically find the quality of KNN graphhas a great influence on the final accuracy. The ANN graph cannot guarantee all the nearest neighbors are recalled. Once somenearest neighbors are lost, it will inevitably bring about the loss ofthe active classes of certain samples in the training process, whichwill lead to the difference of final performance compared to thefull softmax. Consequently, we utilize linear search to ensure theprecision of nearest neighbors.The brute-force graph building is a very time-consuming process,so we will rebuild the graph after a long iterations. Moreover, inorder to save the computational resources, we will reuse the GPUof training to construct graph (the training will be suspended atthat time).With the normalization of W , the Euclidean distance and innerproduct are equivalent, and the inner product calculation is a matrixmultiplication on CUDA , which is easy to be implemented. Asmentioned above, W is stored in different nodes, so we use the ringstructure in Figure 3 (b) to transfer local w between different nodes.After the node gets the local w from the former one, it will calculatethe mm operation and update its NN list. Then the received localweight will be sent to the next node. Compared with gathering all w into one node (Figure 3 (a)), our method can avoid the burst ofGPU memory due to too large W (matrix multiplication also takesup a lot of temporary memory). W W W W W (cid:13)(cid:12)(cid:8)(cid:4) (cid:13)(cid:12)(cid:8)(cid:5) (cid:13)(cid:12)(cid:8)(cid:6) (cid:13)(cid:12)(cid:8)(cid:7)(cid:13)(cid:12)(cid:8)(cid:9)(cid:3)(cid:5) (cid:1)(cid:1) W W W W W (cid:13)(cid:12)(cid:8)(cid:4) (cid:13)(cid:12)(cid:8)(cid:5) (cid:13)(cid:12)(cid:8)(cid:6) (cid:13)(cid:12)(cid:8)(cid:7)(cid:13)(cid:12)(cid:8)(cid:9)(cid:3)(cid:5) (cid:1)(cid:1) (cid:1)(cid:10)(cid:2) (cid:1)(cid:11)(cid:2) Figure 3: Distributed graph building with multiple GPUs.

In addition, we transform the W from float32 to float16, and usethe TensorCore to accelerate the matrix multiplication. For thesake of the graph quality, we will recall 𝑘 ′ -nearest neighbor ( 𝑘 ′ islarger than 𝑘 ), and then perform the standard float32 calculationbased on the 𝑘 ′ neighbors to get the final kNN (the float32 calcu-lation here can almost be ignored). TensorCore can speed up thewhole process about three times.In practice, we rebuild the graph after one epoch training isfinished (To make it fair, we will take the graph building timeinto account when evaluating the efficiency of KNN softmax inthe experiment section). Thanks to the efficient pipeline of GPUimplementation, the time consuming of 100 million graph buildingcan be controlled within 0.75 hours with 256 V100. After the graph construction, weintend to make the training completely on the GPU as well. Con-sidering the classification of 100M, we set 𝐾 = to ensure theperformance, and then the graph size is 𝑀 × , which isabout 372 GB. When training begins, each node needs to querythe complete graph to select the active classes. In other words, thecomplete graph should be stored in every node, which is very hardto be loaded into CPU memory, let alone the GPU memory.So as to make full use of GPU to train our classifier, we presenttwo steps to solve the graph storage problem (the GPU we used is32 GB V100): (i) Graph compression. On each node, keeping a complete graphis always redundant. Because the w that are not on this node cannotbe selected by any mini-batch gathering on the node. As a result,we can delete all the redundant w index from the graph stored onone node. Suppose we use 256 GPU for training, the storage can becompressed to 372 GB / 256 = 1.45 GB on average; (ii) Quick Access. With the graph compression, the neighborsnum (K) of each w in the graph is no longer the same. So we turn atwo-dimensional tensor ( 𝑀 × 𝐾 ) into a one-dimensional tensor.A new problem arises: we can not get W 𝑎𝑐𝑡𝑖𝑣𝑒 efficiently as beforesince the one-dimensional tensor cannot be accessed directly withthe label as index. To tackle this problem, we added a new kernelfunction into PyTorch framework to quickly access the compressedgraph. More concretely, we first store the new K value of each w into a tensor, and use another tensor of the same size to accumulatethe K value (the accumulation result is the offset of the w in thecompressed graph). In the training process, we use different threadsto find the offset of each sample in the compressed graph. e summarize the core GPU Pipeline of KNN softmax as fol-lows, which mainly consist of three steps:(1)

Graph Building:

We use distributed GPUs to compose thegraph of W 𝑛𝑜𝑟𝑚 , and adopt method (i) graph compression tocompress and store the graph on all GPUs.(2) Normalization:

Normalization of X and W in GPU is exe-cuted during training process.(3) Active Classes Selection:

For the normalized mini batch X 𝑛𝑜𝑟𝑚 , method (ii) quick access is employed to implementactive classes selection on GPU.Though we both select active classes, our KNN softmax is totallydifferent from selective softmax [29] in the following three aspects.1) We use KNN graph to do the active classes selection, instead ofthe Hashing Forest used by selective softmax; 2) Selective softmax isnot completely GPU implemented; 3) Last but not least, we maintainthe same precision as the full softmax, which is hard to be achievedby selective softmax. As KNN softmax significantly reduces GPU memory and compu-tation cost, communication overhead becomes the bottleneck inlarge-scale distributed parallel SGD training. Based on the train-ing profiling of our hybrid parallelism framework, we applied anefficient hybrid parallel pipelining to introduce more overlappingin the forward and backward stages. Besides, we implemented anefficient gradient sparsification method [16] to reduce transmittedbits during backpropagation. Under the premise of ensuring modelconvergence, our strategies reduce wall clock time per iterationand improve the throughput of large-scale mini-batch training.

Data parallelism only involvesinter-node communication to synchronize gradients. Typical pipelin-ing overlaps the synchronization and computation of gradientsduring backpropagation. While under our large-scale classificationhybrid parallel framework, communication involves a) the trans-mission of features from the data parallel feature extraction (FE)part to fc layer; b) communicate among fc layer to compute softmax;c) gradient synchronization during backpropagation. b) is insignif-icant due to its tiny message size. As shown in Figure 4 (a), thefc sublayers are idle until all the feature extraction parts computefeatures and accumulate through all-gather communication andvice versa during backpropagation.To overlap computation and communication in our hybrid par-allel framework, we divide the mini-batch samples into micro-batch samples with asynchronous computation and communicationamong micro-batch samples. Figure 4 (b) shows our pipelining strat-egy. The fc layers collect the features among different nodes withan all-gather communication once a micro-batch forward compu-tation completes, thus overlapping forward computation of thefeature extraction part. In the backpropagation, we overlap the fclayer gradient computation and all-reduce communication amongmicro-batches. For the feature extraction part, we follow the com-mon data parallelism pipelining method. With our hybrid paralleloverlapping pipeline, we can achieve more overlapping between (cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:23)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:24)(cid:596)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:84) (cid:44) (cid:85) (cid:88) (cid:93) (cid:71) (cid:88) (cid:74) (cid:36)(cid:79)(cid:79)(cid:16)(cid:74)(cid:68)(cid:87)(cid:75)(cid:72)(cid:85)(cid:3) (cid:76)(cid:75)(cid:71)(cid:90)(cid:91)(cid:88)(cid:75)(cid:89)(cid:596) (cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:23)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:24)(cid:596)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:84)(cid:45)(cid:54)(cid:59)(cid:6)(cid:22) (cid:45)(cid:54)(cid:59)(cid:6)(cid:83) (cid:44)(cid:41)(cid:6)(cid:89)(cid:91)(cid:72)(cid:82)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:23) (cid:39)(cid:76)(cid:86)(cid:87)(cid:85)(cid:76)(cid:69)(cid:88)(cid:87)(cid:72)(cid:71)(cid:3)(cid:54)(cid:82)(cid:73)(cid:87)(cid:48)(cid:68)(cid:91)(cid:3)(cid:70)(cid:82)(cid:80)(cid:80)(cid:88)(cid:81)(cid:76)(cid:70)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:18)(cid:70)(cid:82)(cid:80)(cid:83)(cid:88)(cid:87)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81) (cid:44)(cid:41)(cid:6)(cid:89)(cid:91)(cid:72)(cid:82)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:83) (cid:50)(cid:85)(cid:89)(cid:89) (cid:40) (cid:71) (cid:73) (cid:81) (cid:93) (cid:71) (cid:88) (cid:74) (cid:44)(cid:75)(cid:71)(cid:90)(cid:91)(cid:88)(cid:75)(cid:89)(cid:585)(cid:6)(cid:77)(cid:88)(cid:71)(cid:74)(cid:89)(cid:6) (cid:68)(cid:79)(cid:79)(cid:16)(cid:85)(cid:72)(cid:71)(cid:88)(cid:70)(cid:72)(cid:3) (cid:12)(cid:6)(cid:89)(cid:86)(cid:82)(cid:79)(cid:90)(cid:596)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:84)(cid:596)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:23) (cid:44)(cid:41)(cid:6)(cid:89)(cid:91)(cid:72)(cid:82)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:23) (cid:51)(cid:68)(cid:85)(cid:68)(cid:80)(cid:72)(cid:87)(cid:72)(cid:85)(cid:3)(cid:56)(cid:83)(cid:71)(cid:68)(cid:87)(cid:72) (cid:50)(cid:85)(cid:89)(cid:89) (cid:44)(cid:41)(cid:6)(cid:89)(cid:91)(cid:72)(cid:82)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:83) (cid:596) (cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:84)(cid:596)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:23)(cid:50)(cid:84)(cid:6)(cid:77)(cid:88)(cid:71)(cid:74)(cid:89)(cid:596)(cid:50)(cid:23)(cid:6)(cid:77)(cid:88)(cid:71)(cid:74)(cid:89) (cid:50)(cid:84)(cid:6)(cid:77)(cid:88)(cid:71)(cid:74)(cid:89)(cid:596)(cid:50)(cid:23)(cid:6)(cid:77)(cid:88)(cid:71)(cid:74)(cid:89) (cid:68)(cid:79)(cid:79)(cid:16)(cid:85)(cid:72)(cid:71)(cid:88)(cid:70)(cid:72) (cid:39)(cid:68)(cid:87)(cid:68) (cid:51)(cid:79)(cid:84)(cid:79)(cid:19)(cid:72)(cid:71)(cid:90)(cid:73)(cid:78)(cid:6)(cid:23) (cid:51)(cid:79)(cid:84)(cid:79)(cid:19)(cid:72)(cid:71)(cid:90)(cid:73)(cid:78) (cid:83) (cid:44) (cid:85) (cid:88) (cid:93) (cid:71) (cid:88) (cid:74) (cid:45)(cid:54)(cid:59)(cid:6)(cid:22) (cid:45)(cid:54)(cid:59)(cid:6)(cid:83) (cid:39)(cid:76)(cid:86)(cid:87)(cid:85)(cid:76)(cid:69)(cid:88)(cid:87)(cid:72)(cid:71)(cid:3)(cid:54)(cid:82)(cid:73)(cid:87)(cid:48)(cid:68)(cid:91)(cid:3)(cid:70)(cid:82)(cid:80)(cid:80)(cid:88)(cid:81)(cid:76)(cid:70)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:18)(cid:70)(cid:82)(cid:80)(cid:83)(cid:88)(cid:87)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81) (cid:40) (cid:71) (cid:73) (cid:81) (cid:93) (cid:71) (cid:88) (cid:74) (cid:596)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:84)(cid:596)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:23) (cid:51)(cid:68)(cid:85)(cid:68)(cid:80)(cid:72)(cid:87)(cid:72)(cid:85)(cid:3)(cid:56)(cid:83)(cid:71)(cid:68)(cid:87)(cid:72) (cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:84)(cid:596)(cid:44)(cid:43)(cid:6)(cid:50)(cid:71)(cid:95)(cid:75)(cid:88)(cid:6)(cid:23)(cid:50)(cid:84)(cid:6)(cid:77)(cid:88)(cid:71)(cid:74)(cid:89)(cid:596)(cid:50)(cid:23)(cid:6)(cid:77)(cid:88)(cid:71)(cid:74)(cid:89) (cid:50)(cid:84)(cid:6)(cid:77)(cid:88)(cid:71)(cid:74)(cid:89)(cid:596)(cid:50)(cid:23)(cid:6)(cid:77)(cid:88)(cid:71)(cid:74)(cid:89) (cid:68)(cid:79)(cid:79)(cid:16)(cid:85)(cid:72)(cid:71)(cid:88)(cid:70)(cid:72) (cid:39)(cid:68)(cid:87)(cid:68) (cid:51)(cid:79)(cid:73)(cid:88)(cid:85)(cid:19)(cid:72)(cid:71)(cid:90)(cid:73)(cid:78)(cid:75)(cid:89) (cid:54)(cid:92)(cid:81)(cid:70)(cid:75)(cid:85)(cid:82)(cid:81)(cid:76)(cid:93)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:54)(cid:92)(cid:81)(cid:70)(cid:75)(cid:85)(cid:82)(cid:81)(cid:76)(cid:93)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81) (cid:44)(cid:41)(cid:40)(cid:71)(cid:73)(cid:81)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:88)(cid:75)(cid:74)(cid:91)(cid:73)(cid:75) (cid:44)(cid:41)(cid:40)(cid:71)(cid:73)(cid:81)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:88)(cid:75)(cid:74)(cid:91)(cid:73)(cid:75) (cid:44)(cid:41)(cid:40)(cid:71)(cid:73)(cid:81)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:88)(cid:75)(cid:74)(cid:91)(cid:73)(cid:75) (cid:44)(cid:41)(cid:40)(cid:71)(cid:73)(cid:81)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:88)(cid:75)(cid:74)(cid:91)(cid:73)(cid:75)(cid:44)(cid:41)(cid:40)(cid:71)(cid:73)(cid:81)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:88)(cid:75)(cid:74)(cid:91)(cid:73)(cid:75)(cid:44)(cid:41)(cid:40)(cid:71)(cid:73)(cid:81)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:88)(cid:75)(cid:74)(cid:91)(cid:73)(cid:75) (cid:14)(cid:71)(cid:15) (cid:14)(cid:72)(cid:15) (cid:596) (cid:44)(cid:43)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:77)(cid:71)(cid:90)(cid:78)(cid:75)(cid:88) (cid:44)(cid:43)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:77)(cid:71)(cid:90)(cid:78)(cid:75)(cid:88) (cid:44)(cid:43)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:77)(cid:71)(cid:90)(cid:78)(cid:75)(cid:88) (cid:44)(cid:43)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:77)(cid:71)(cid:90)(cid:78)(cid:75)(cid:88)(cid:44)(cid:43)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:77)(cid:71)(cid:90)(cid:78)(cid:75)(cid:88)(cid:44)(cid:43)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74)(cid:39)(cid:82)(cid:82)(cid:19)(cid:77)(cid:71)(cid:90)(cid:78)(cid:75)(cid:88) (cid:51)(cid:79)(cid:73)(cid:88)(cid:85)(cid:19)(cid:72)(cid:71)(cid:90)(cid:73)(cid:78)(cid:75)(cid:89) (cid:44)(cid:41)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74) (cid:44)(cid:41)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74) (cid:44)(cid:41)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74) (cid:44)(cid:41)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74)(cid:44)(cid:41)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74)(cid:44)(cid:41)(cid:44)(cid:85)(cid:88)(cid:93)(cid:71)(cid:88)(cid:74)

Figure 4: (a) is the baseline communication/computationovarlapping for hybrid parallel training and (b) is our pro-posed hybrid parallel overlapping pipeline. (The blue chunkindicates computation while the green one indicates com-munication.) communication and computation. Furthermore, the divided micro-batches save GPU memory usage and enable a larger batch size ina single GPU. 𝑘 sparsification. Hybrid parallel pipelin-ing overlaps feature extraction net and fc sublayers with the micro-batches split. For the feature extraction net backpropagation, gradi-ent compression methods can be utilized to further mitigate commu-nication overhead. Deep gradient compression [16] (DGC) performsa layer-wise top- 𝑘 selection among layers’ tensor and only com-municate selected gradients among nodes. With tricks includingmomentum correction, momentum factor masking, DGC raises thecommunication tensor sparsity to 99.9% while preserving the accu-racy of training ResNet-50 model on the ImageNet-1K dataset. DGCis not widely used in the industry, mainly because the layer-wisetop- 𝑘 selection is time-consuming. For example, [23] implementedDGC in 56 Gbps Ethernet and showed no throughput improvement.DGC proposed a sampling top- 𝑘 to reduce selection time. However,it introduces approximation of top- 𝑘 selection and is still ineffi-cient given a low communication-to-computation ratio model likeResNet.To deal with the top- 𝑘 selection computation overhead, we applya divide-and-conquer top- 𝑘 selection and grouping tensors withsimilar size, which makes full use of GPU parallel computing abilityand greatly reduces the computation overhead without any approx-imation. As shown in Figure 5, we divide single top- 𝑘 selection fromlarge tensor into two steps. First, we split a large tensor into M smallchunks and select top- 𝑘 from every chunk ( 𝑀 × 𝐾 ) simultaneously.Then, a second top- 𝑘 selection is carried out from selected 𝑀 × 𝐾 tensor. Grouping tensors with similar size makes the layer-wisetop- 𝑘 selection more highly integrated. With our top- 𝑘 selectionimplementation, the extra computation overhead can be negligible.Combined with hybrid parallel pipelining, the end-to-end iterationwall clock time is further reduced. op-k from NTop-k from !×

Top-k from $/! (a)(b)

Figure 5: Divide and conquer top- 𝑘 selection The extremely large fc layer involves the update of tens of billionsof parameters in each iteration. As a result of the limitation ofGPU resources, accelerating convergence becomes challenging: 1)In order to prevent such a large model from under-fitting or over-fitting, the amount of training data will be large, resulting in a verylong time of training. 2) With limited computation resources, anysophisticated learning strategies that involve trial and error are notdesired.To address the above chanllenges, we develop a very aggressiveconvergence algorithm to solve the problem of fast convergence ofthe large-scale classification models on large-scale datasets, calledfast continuous convergence strategy (FCCS). We divide the conver-gence strategy into global policy and local policy. The local policytakes advantage of the local learning rate calculation in LARS [27],which enables large batch training, to overcome the inefficiencyof massive data training. And the global policy aims to controlthe speed of model convergence, which give us the opportunity tocomplete the training quickly within a finite number of iterations.The global policy mainly focuses on the adjustment of batch sizeand learning rate. We divide the global policy into two differentphases. The first is the phase of warm-up, during which only thelearning rate is adjusted while the batch size remains unchanged.The second phase includes a progressive continuous increase ofbatch size while we keep the learning rate at a constant value.Let 𝐵 𝑡 be the batch size of the 𝑡 -th iteration, and 𝜂 𝑡 be the learningrate of the 𝑡 -th iteration. The learning rate adjustment strategyincludes a warm-up stage, that is, starting from a small learningrate during training, and gradually increasing to a large value 𝜂 .Then the learning rate remains constant after the warm up stage: 𝜂 𝑡 = (cid:26) 𝑡𝑇 𝑤𝑎𝑟𝑚 𝜂 if 𝑡 < 𝑇 𝑤𝑎𝑟𝑚 𝜂 if 𝑡 ≥ 𝑇 𝑤𝑎𝑟𝑚 Where 𝑇 𝑤𝑎𝑟𝑚 is the total number of iterations of warm-up.The batch size adjustment is divided into an initialization stageand an aggressive continuous increase stage. It can be described asfollows. 𝐵 𝑡 = (cid:26) 𝐵 if 𝑡 < 𝑇 𝑖𝑛𝑖 ⌊ 𝑓 ( 𝑡 )⌋ if 𝑡 ≥ 𝑇 𝑖𝑛𝑖 Where 𝑓 𝑡 is defined as follows: 𝑓 ( 𝑡 ) = 𝐵 𝑚𝑖𝑛 + ( 𝐵 𝑚𝑎𝑥 − 𝐵 𝑚𝑖𝑛 )( + 𝑐𝑜𝑠 ( 𝑡 − 𝑇 𝑖𝑛𝑖 𝑇 𝑓 𝑖𝑛𝑎𝑙 − 𝑇 𝑖𝑛𝑖 𝜋 )) During the initialization stage, the batch size keeps a small con-stant value 𝐵 , which guarantees a sufficient update frequency inthe warm-up period. During the stage of aggressive continuous in-crease, the batch size increases quickly as the number of iterationsincreases.According to the theory of [22], somehow increasing the batchsize is equivalent to reducing the learning rate, so we replace thetraditional learning rate decay process by increasing the batch size.At the same time, it should be emphasized that in our method, thebatch size is increased in a continuous manner. This is becauseif it is a discontinuous manner such as piece-wise police, moreexperiments are needed to be tried to clearly determine the hyper-parameter settings. Compared with these methods, our continuousgrowth policy avoids the choice of these hyper-parameters andonly needs to control the speed of batch size growth.Besides, to overcome the limitation of GPU memory, we applythe gradients accumulation technique to enlarge the batch size. Byaccumulating the gradients 𝑛 times without updating the parame-ters, the actual batch size can be considered as 𝑛 × 𝑏 . This also bringsanother benefit that gets the total communication cost decreasedat most to / 𝑛 of that with constant batch size. In this section, we conduct extensive experiments to evaluate theperformance of each algorithm in our system. To conduct the evalu-ation for each component in our system, we randomly sample threesubsets of images containing 1 million, 10 million, and 100 millionclasses from Alibaba Retail Product Dataset. Correspondingly Wecall these three datasets SKU-1M, SKU-10M, and SKU-100M forconvenience. Overall information related to each dataset is listedin Table 1.

Table 1: Overview of three datasets.

For the three datasets, we use ResNet-50 [9] as the base model ofthe feature extraction part in our hybrid parallel training framework.The final dimensions of the features are 512. Moreover, we applythe mixed-precision training method to our training framework asdefault.All of the experiments are running on the in-house GPU clus-ter. This cluster contains 32 machines, and each machine uses 8NVIDIA Tesla V100 (32GB) GPUs, which are interconnected withNVIDIA NVLink. For network connectivity, the machines use a25Gbit Ethernet network card for communication. We use PyTorch[21] for our distributed training implementation.

We assess the classification accuracy and throughputs respectively.For the softmax accuracy, we compare our approach with the fol-lowing state-of-the-art methods:

Selective Softmax [29]: We use the HF-A version with 𝐿 = , 𝑇 = and 𝜏 𝑐𝑝 = . ; • MACH [17]: We set different 𝐵 and 𝑅 for different scaledatasets: 𝐵 = , 𝑅 = for 1M; 𝐵 = , 𝑅 = for10M; 𝐵 = , 𝑅 = for 100M. • Full Softmax : The traditional softmax with the hybrid par-allel training framework. • KNN Softmax : This is our method proposed in this paper.We vary the 𝑘 (12 for 1M, 120 for 10M, 1200 for 100M) andchoose the active classes.To make it fair, we only compare the methods that have the “same”accuracy with the full softmax for the throughputs evaluation (“low”accuracy methods are ignored since one can sacrifice accuracy toimprove throughputs). Table 2: The classification accuracy of different methods inthe three datasets.

1M 10M 100M

Selective Softmax [29] 86.39% 79.02% 71.98%MACH [17] 80.11% 71.34% 59.82%KNN Softmax 87.46% 80.99% 74.54%Full Softmax 87.43% 81.01% 74.52%

We compare the classification accuracy of different methods inthe three large-scale datasets, as shown in Table 2. It’s very clear thatour KNN softmax gets the same accuracy as the full softmax on alldatasets. Different from our lossless KNN graph, the Hashing Forestadopted by selective softmax can not make sure that all the trueactive classes are recalled during training process. MACH performsworse than selective softmax since it uses a divide-and-conqueralgorithm to approximate the original classification, which can notguarantee the final accuracy. Our KNN softmax takes advantageof linear KNN graph to ensure no nearest neighbors are lost. Theexperimental results demonstrate that our KNN strategy makessense.

Table 3: The throughput improvement of KNN softmax inthe three datasets.

1M 10M 100M

Full Softmax 1.0 × × × KNN Softmax 1.2 × × × We set the throughputs of full softmax in three datasets as base-lines, and evaluate the speed-up of our KNN softmax. Selectivesoftmax and MACH are ignored for their accuracy. Table 3 showsthat at the scale of 100M, a speed-up of more than three timesis achieved by our approach. Meanwhile, the speed-up tends tobe more significant as the size of the dataset is increasing. Thisis because as the size of the dataset increased, the proportion ofthe time spent in the softmax stage will also increase, resulting inan increasing speed-up ratio of softmax stage. Thanks to the totalGPU pipeline, our KNN softmax is superior to other state-of-the-art methods in consideration of both accuracy and computationalefficiency.

In this part, we evaluate the large-scale classification training through-put speedup with our proposed communication strategies.

Effect of hybrid parallel pipelining.

We compare the hybridparallel pipelining training throughout with hybrid parallel base-line on the three large-scale datasets. As shown in Table 4, theoverlapping achieves 4.2%, 4.7%, 5.4% performance boost relatively.We tuned the micro-batch size for more network bandwidth usage.

Table 4: The training speedup with communication opti-mization in the three datasets.

1M 10M 100M hybrid parallel baseline - - -+ overlapping 1.042 × × × + layer-wise sparsification 1.162 × × × Table 5: The training accuracy with layer-wise sparsificationin the three datasets.

1M 10M 100M baseline 87.43% 81.01% 74.52%layer-wise sparsification 87.40% 81.05% 74.45%

Table 6: The wall clock time with different top- 𝑘 methods(average of 1000 trials). 𝑘 [16] 83.27divide-and-conquer top- 𝑘 Effect of gradient sparsification.

As shown in Table 4, com-bining our efficient hybrid parallel pipeline with gradient sparsi-fication can accelerate training throughput up to 1.123 × in theSKU-100M dataset. Layer-wise top- 𝑘 sparsification updates partialparameters in a single iteration. We also evaluate the influenceon final classification accuracy. Table 5 shows introducing layer-wise top- 𝑘 gradient sparsification in our hybrid parallel frameworkcauses no accuracy degradation in all three datasets. Table 6 showsthe efficiency of our proposed top- 𝑘 selection method, which cansave 94.2% wall clock time compared with plain for-loop implemen-tation and 7 × faster than sampling top- 𝑘 implementation. In this subsection, experiments are conducted to show the efficiencyof the fast continuous convergence strategy.For the comparison with baselines and the fast continuous con-vergence strategy, we combine the aforementioned optimizationmethods with KNN Softmax and communication/computation over-lapping to ensure no other factor is incorporated. We evaluate the igure 6: Compare the convergence speed of FCCS and thetraditional piece-wise decay learning rate policy.Figure 7: Compare the batch size adjustment of FCCS andthe traditional piece-wise decay learning rate policy. accuracy and training speed on the following methods: 1) Piecewisedecay, the traditional learning rate decay policy, which decays thelearning rate by a factor of 1/10 for every five epochs. 2) Adam [13],of which the initial learning rate is set as 1e-3. 3) FCCS withoutbatch size policy, 𝐵 𝑚𝑎𝑥 = 𝐵 𝑚𝑖𝑛 = 𝐵 , which means only thelearning rate policy in FCCS is kept. 4) FCCS, our proposed method, 𝐵 𝑚𝑎𝑥 = 𝐵 𝑚𝑖𝑛 = 𝐵 , 𝑇 𝑓 𝑖𝑛𝑎𝑙 = . We also keep the same initialbatch size 𝐵 = and the same initial learning rate 𝜂 = . for each method in each task.We compare the accuracy of different convergence strategy oneach large classification tasks. As shown in Table 7, the piece-wisedecay achieves the best accuracy on all three tasks, and FCCS getsvery similar and competitive results, which proves the effectivenessof the adjustment of batch size. Compared with the one withoutbatch size adjustment, FCCS can improve the accuracy from 68.12%to 87.40% with the batch size increase policy, which further presentsthe power of FCCS. The same improvements can be found in theother two large classification tasks too. These results all indicatethat the batch size increase policy in FCCS can take the place ofthe learning rate decay policy in traditional methods. Otherwise,another baseline method, Adams brings obvious loss of accuracyin nearly all the three tasks.As shown from Figure 6, in the task of 1M classification, theFCCS can reach the final accuracy 87.40% in 8 epochs while thebaseline reaches 87.46% in nearly 20 epochs. Since the total trainingprocedure speeds up to 2.5 times, 0.4% loss in test accuracy is toler-able. As shown from Figure 7, which demonstrates the differencebetween FCCS and the piece-wise decay on batch size adjustment, E6636BD20102E4F257E509342D30B15032B9B20B1DB14EAEA990245868D4FBAB427952482CD37BAB51E82B81113921FDF56E6BAE1300B2E188EA3C481AC57DA0C125E0E824F2B32C9EDBE241D92469A7BD34767D07D4F2376846D4E51BBC96C70FB05DEEBE7CBD5

Figure 8: The training speedup with proposed methods. we increase the batch size at the beginning of every epoch to sim-ulate the continuous change of linear cosine curve. Note that ifwe decrease the steps of the plateau in piece-wise decay to maketraining faster, the final accuracy may become lower and it requirestimes of effort to adjust the learning rate. Besides, the adaptive op-timizer may convergence fast at the begging of training, but finallyit brings noticeable accuracy loss compared with our method. Thereason that our method outperforms Adam is we change the batchsize at a proper range to make it neither too large nor too small.

Table 7: The test accuracy of different training methods inthe three datasets.

As mentioned above, we deploy all the proposed methods togetherin our extreme classification system to train a classifier of 100million classes on the SKU-100M dataset. Since these methods areorthogonal in different aspects, we could make full use of them tomaximize the training speed of our system.As depicted in Figure 8, we present the system throughput byadding KNN softmax, hybrid parallel overlapping, and layer-wisetop- 𝑘 gradient sparsification sequentially. Compared with the fullsoftmax method as the baseline, the final throughput of our systemreaches the improvement of 3.9 × (about 51800 images/sec).We also adopt the fast continuous convergence strategy (FCCS)to accelerate convergence, which could reduce training iterationsfrom 20 to 8 epochs as 2.5 × speed-up equivalently.he final results are shown in Table 8. Compared with the naivesoftmax training without FCCS, our proposed method could re-duce total training time to five days while reaching a comparableaccuracy. Table 8: Final results on SKU-100M dataset.

After finishing training the classifier of 100 million classes, wedeploy the large model by using the in-house retrieval system [30].For the weight of the fully connected layer W , we treat the weightvector w 𝑗 as the feature embedding for the 𝑗 -th class. Then weuse all of the embeddings to build a graph index for classifyingimages. The online classification process is described as follows: 1)Get the query image and pre-processing it. 2) Feed the query imageinto the feature extraction model to get the feature embedding.3) Use the feature embedding to compare and search across thewhole index to find the nearest neighbor as the final class. 4) Returnthe classification result. It only takes one GPU to deploy a featureextraction model with the retrieval system. Moreover, we could addmore GPUs incrementally to deal with a large number of queries. In this work, we propose an extreme classification system at 100million class scale. We deploy a KNN softmax implementation toreduce GPU memory consumption and computation costs. As thesystem is running on the in-house GPU cluster, we design a newcommunication strategy that contains a hybrid parallel overlap-ping pipeline and layer-wise top- 𝑘 gradient sparsification to reducecommunication overhead. We also propose a fast continuous con-vergence strategy to accelerate training by adaptively adjustinglearning rate and updating parameters. All of these methods tryto improve the speed of training the extreme classifier. The experi-mental results show that using an in-house 256 GPUs cluster, wereduce the total training time to five days and reach a comparableaccuracy with the naive softmax training process. REFERENCES [1] Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Dis-tributed Gradient Descent. In

Proceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) . 440–445.[2] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain.2015. Sparse local embeddings for extreme multi-label classification. In

Advancesin Neural Information Processing Systems (NeurIPS) . 730–738.[3] Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods forlarge-scale machine learning.

Siam Review

60, 2 (2018), 223–311.[4] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, TianjunXiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible andefficient machine learning library for heterogeneous distributed systems. arXivpreprint arXiv:1512.01274 (2015).[5] Wenlin Chen, David Grangier, and Michael Auli. 2016. Strategies for TrainingLarge Vocabulary Neural Language Models. In

Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics (ACL) . 1975–1985.[6] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface:Additive angular margin loss for deep face recognition. In

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) . 4690–4699. [7] Joshua Goodman. 2001. Classes for fast maximum entropy training. In

IEEEInternational Conference on Acoustics, Speech, and Signal Processing. Proceedings(ICASSP) , Vol. 1. IEEE, 561–564.[8] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski,Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate,large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) . 770–778.[10] Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: AcceleratingDeep Network Training by Reducing Internal Covariate Shift. In

InternationalConference on Machine Learning (ICML) . 448–456.[11] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou,Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. 2018. Highly scalabledeep learning training system with mixed-precision: Training imagenet in fourminutes. arXiv preprint arXiv:1807.11205 (2018).[12] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi.2019. Error Feedback Fixes SignSGD and other Gradient Compression Schemes.In

International Conference on Machine Learning (ICML) . 3252–3261.[13] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 (2014).[14] Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neuralnetworks. arXiv preprint arXiv:1404.5997 (2014).[15] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed,Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scalingdistributed machine learning with the parameter server. In { USENIX } Sym-posium on Operating Systems Design and Implementation ( { OSDI } . 583–598.[16] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2017. Deepgradient compression: Reducing the communication bandwidth for distributedtraining. arXiv preprint arXiv:1712.01887 (2017).[17] Tharun Kumar Reddy Medini, Qixuan Huang, Yiqiu Wang, Vijai Mohan, andAnshumali Shrivastava. 2019. Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products. In Advances inNeural Information Processing Systems (NeurIPS) . 13244–13254.[18] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, ErichElsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, GaneshVenkatesh, et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).[19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficientestimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).[20] Priyanka Nigam, Yiwei Song, Vijai Mohan, Vihan Lakshman, Weitian Ding, AnkitShingavi, Choon Hui Teo, Hao Gu, and Bing Yin. 2019. Semantic product search.In

Proceedings of the 25th International Conference on Knowledge Discovery andData Mining (SIGKDD) . 2876–2885.[21] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.2019. PyTorch: An imperative style, high-performance deep learning library. In

Advances in Neural Information Processing Systems (NeurIPS) . 8024–8035.[22] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. 2017. Don’tdecay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489 (2017).[23] Peng Sun, Wansen Feng, Ruobing Han, Shengen Yan, and Yonggang Wen. 2019.Optimizing Network Performance for Distributed DNN Training on GPU Clusters:ImageNet/AlexNet Training in 1.5 Minutes. arXiv preprint arXiv:1902.06855 (2019).[24] Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learningface representation by joint identification-verification. In

Advances in NeuralInformation Processing Systems (NeurIPS) . 1988–1996.[25] Yukihiro Tagami. 2017. Annexml: Approximate nearest neighbor search for ex-treme multi-label classification. In

Proceedings of the 23rd International Conferenceon Knowledge Discovery and Data Mining (SIGKDD) . 455–464.[26] Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. 2019. PowerSGD:Practical low-rank gradient compression for distributed optimization. In

Advancesin Neural Information Processing Systems (NeurIPS) . 14236–14245.[27] Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling sgd batch size to 32kfor imagenet training. arXiv preprint arXiv:1708.03888

Proceedings of the 47th International Conferenceon Parallel Processing (ICPP) . ACM, 1.[29] Xingcheng Zhang, Lei Yang, Junjie Yan, and Dahua Lin. 2018. Accelerated trainingfor massive classification via dynamic class selection. In

Proceedings of the AAAIConference on Artificial Intelligence (AAAI) .[30] Kang Zhao, Pan Pan, Yun Zheng, Yanhao Zhang, Changxu Wang, Yingya Zhang,Yinghui Xu, and Rong Jin. 2019. Large-Scale Visual Search with Binary DistributedGraph at Alibaba. In