[PDF] Empirical Evaluation of Parallel Training Algorithms on Acoustic Modeling

Abstract

Deep learning models (DLMs) are state-of-the-art techniques in speech recognition. However, training good DLMs can be time consuming especially for production-size models and corpora. Although several parallel training algorithms have been proposed to improve training efficiency, there is no clear guidance on which one to choose for the task in hand due to lack of systematic and fair comparison among them. In this paper we aim at filling this gap by comparing four popular parallel training algorithms in speech recognition, namely asynchronous stochastic gradient descent (ASGD), blockwise model-update filtering (BMUF), bulk synchronous parallel (BSP) and elastic averaging stochastic gradient descent (EASGD), on 1000-hour LibriSpeech corpora using feed-forward deep neural networks (DNNs) and convolutional, long short-term memory, DNNs (CLDNNs). Based on our experiments, we recommend using BMUF as the top choice to train acoustic models since it is most stable, scales well with number of GPUs, can achieve reproducible results, and in many cases even outperforms single-GPU SGD. ASGD can be used as a substitute in some cases.

Full PDF

EEmpirical Evaluation of Parallel Training Algorithms on Acoustic Modeling

Wenpeng Li , Binbin Zhang , Lei Xie ∗ , Dong Yu School of Computer Science, Northwestern Polytechnical University, Xi’an, China Tencent AI Lab, Seattle, USA { wpli, bbzhang, lxie } @nwpu-aslp.org, [email protected] Abstract

Deep learning models (DLMs) are state-of-the-art techniquesin speech recognition. However, training good DLMs can betime consuming especially for production-size models and cor-pora. Although several parallel training algorithms have beenproposed to improve training efﬁciency, there is no clear guid-ance on which one to choose for the task in hand due to lackof systematic and fair comparison among them. In this paperwe aim at ﬁlling this gap by comparing four popular paralleltraining algorithms in speech recognition, namely asynchronousstochastic gradient descent (ASGD), blockwise model-updateﬁltering (BMUF), bulk synchronous parallel (BSP) and elasticaveraging stochastic gradient descent (EASGD), on 1000-hourLibriSpeech corpora using feed-forward deep neural networks(DNNs) and convolutional, long short-term memory, DNNs(CLDNNs). Based on our experiments, we recommend usingBMUF as the top choice to train acoustic models since it ismost stable, scales well with number of GPUs, can achieve re-producible results, and in many cases even outperforms single-GPU SGD. ASGD can be used as a substitute in some cases.

Index Terms : speech recognition, parallel algorithm, ASGD,BMUF, BSP, EASGD

1. Introduction

Since 2010, the year in which deep neural networks (DNNs)were successfully applied to the large vocabulary continuousspeech recognition (LCVSR) tasks [1, 2, 3] and led to signiﬁ-cant recognition accuracy improvement over the then state ofthe art, various deep learning models, such as convolutionalneural networks (CNNs) [4, 5, 6, 7, 8, 9], long short-term mem-ory (LSTM) recurrent neural networks (RNNs) [10, 11, 12,13, 14, 15, 16, 17] and their variants [18, 19, 20, 21, 22, 23],have been developed to further improve the performance of au-tomatic speech recognition (ASR) systems. Albeit achievingthe state-of-the-art performance, these deep learning models aretime consuming to train well, especially when trained on single-GPU. Trade-offs often need to be made between the scale ofmodel size and training corpora (and thus recognition accuracy)and the training time because even with today’s massively par-allel GPU it usually takes days or weeks to train large modelsto desired accuracy on a single GPU.Many parallel training algorithms have been proposed tospeed up training. These algorithms can be categorized into twoclasses: model parallelism (e.g., [24, 25]), which exploits andsplits the structure of neural networks to distribute computationacross GPUs, and data parallelism (e.g., [24, 26, 27]), whichsplits and distributes data across GPUs to achieve speedup.Model parallelism focuses on computing more parameters atthe same time. It allows and is more suitable for training modelsthat are too big to ﬁt in the memory on a single device. On the * Corresponding author other hand, data parallelism concentrates on processing moretraining samples at the same time and is thus best used whenthere are enormous training samples. In speech recognition,data parallelism is more important since ASR models usuallyﬁt well on a single GPU while the training set is often large.The core problem data parallelism algorithms try to solveis the difﬁculty in achieving parallelization of mini-batch basedstochastic gradient descent (SGD) algorithm [28], which is themost popular technique to train deep learning models (DLMs).Several successful techniques, such as asynchronous stochas-tic gradient descent (ASGD) [29, 30, 31], blockwise model-update ﬁltering (BMUF) [32], bulk synchronous parallel (BSP)[33, 34, 35, 36], 1-bit SGD [26] and elastic averaging stochasticgradient descent (EASGD) [37], have been proposed recently.Unfortunately, these techniques solve the problem with differ-ent assumptions and strategies, have been evaluated only onvastly different data sets and tasks, and there is no theoreticalguarantee on their behavior when used to train DLMs. Thiscauses the difﬁculty in selecting the right parallel algorithm fortraining models on industrial-size corpora.In this paper, we evaluate and systematically compare fourparallel training algorithms, namely BSP, ASGD, BMUF andEASGD with regard to training speed, convergence behavior, ﬁ-nal model’s performance, reproducibility, and robustness acrossmodels, number of GPUs, and learning control parameters. Forall we know, this is the ﬁrst time these algorithms are com-pared relatively thoroughly on ASR tasks. It is also the ﬁrsttime EASGD is evaluated for acoustic model training. All thefour algorithms were implemented in Kaldi toolkit [38] usingmessage passing interface (MPI) for parameter exchange acrossGPUs. Using the same communication protocol guarantees thatthe comparison is fair and reliable. To evaluate these algo-rithms, we train DNNs and CLDNNs [23] (an architecture thatstacks CNNs, LSTMs and DNNs) on 1000hr LibriSpeech [39]corpus.The rest of the paper is organized as follows. In Section2, we introduce BSP, ASGD, BMUF and EASGD, and discussrelationships between them. In Section 3 we describe series ofexperimental setups and report related results. We conclude thepaper in Section 4.

2. Parallel training algorithms

The bulk synchronous parallel (BSP) [33] algorithm is often re-ferred to as model averaging. In this model, data are distributedacross multiple workers. Each worker updates its local modelreplica independently using its own portion of data with SGD.Periodically the local models are averaged and the generatedglobal model is synchronized across workers. We denote w it asthe i -th worker’s local model at time t . The global model ˜ w t is a r X i v : . [ c s . C L ] J u l arameter Server

Local Worker - 𝛻𝑤 𝑡𝑖 𝑤 𝑡+𝑘+1 = 𝑤 𝑡+𝑘 − 𝜂𝛻𝑤 𝑡𝑖 𝑤 𝑡+𝑘+1 Figure 1:

ASGD architecture. Arrows indicate communicationbetween the parameter server and the workers. computed as ˜ w t = ¯ w t = 1 N N (cid:88) i =1 w it , (1)where N is the number of local workers and ¯ w t is the averagemodel of the local models. This algorithm is easy to imple-ment and can achieve linear speedup when communication costcan be ignored (e.g., with large synchronous time) at the costof recognition accuracy degradation, esp. when the number ofworkers becomes large. The asynchronous stochastic gradient descent (ASGD) algo-rithm is the distributed version of SGD. It is proved [40] thatASGD converges for convex problems. As shown in Fig-ure 1, ASGD uses a parameter sever and several local workers.Each worker independently and asynchronously pulls the latestglobal model ˜ w t from the parameter server, computes the gra-dient ∇ w it with a new minibatch, and sends it to the parameterserver. The parameter server always keeps the current model.When it receives the gradient ∇ w it from worker i it generatesthe new model ˜ w t + k +1 = ˜ w t + k − η ∇ w it (2)where η is the learning rate.Before worker i sends gradient ∇ w it back to parameterserver, some other workers may have already added their lo-cal gradients to the model and updated the model k times tobecome ˜ w t + k . Therefore ASGD essentially adds a “delayed”gradient ∇ w it computed based on the model ˜ w t to the model ˜ w t + k [41]. This may be the reason that ASGD can be unsta-ble: sometimes the model can converge to the same accuracy asthat trained with SGD but with more iterations, and sometimesit can never achieve the same performance as SGD, esp. whenthere are many workers. The blockwise model-update ﬁltering (BMUF) algorithm [32]can be considered as an improved model averaging techniquein which the global model update is implemented as a ﬁlter.In BMUF, the full training set D is partitioned into M non-overlapping blocks and each block is further partitioned into N non-overlapping splits, where N is the number of workers.Each worker updates its local model with its portion of data.The N optimized local models are then averaged using Eq. (1).Unlike BSP, which treats the average model ¯ w t as the globalmodel, BMUF generates the global model ˜ w t as ˜ w t = ˜ w t − + ∆ t , (3)where ∆ t = ζ ∆ t − + ηG t , ≤ ζ < , η > , (4)is the global-model update, G t = ¯ w t − ˜ w t − (5) is the model-update resulted from a block, ζ is called block mo-mentum (BM) and η is called block learning rate (BLR). Weuse the formula ηN (1 − ζ ) = C (6)to set the values of ζ and η empirically, where C is a constantslightly large than 1 and N is the number of workers. Usually,the value of η and C both are set to . and the value of ζ iscalculated based on Eq. (6). We implemented CBM-BMUF[32] in this work. In elastic averaging stochastic gradient descent (EASGD) [37],the loss function is deﬁned as min w t ,...,w Nt , ˜ w t N (cid:88) i =1 f (D | w it )+ λ || w it − ˜ w t || (7)where D is the training set, f ( . ) is the loss function for lo-cal sequential training, λ is a hyper-parameter for the quadraticpenalty term, w it represents model for the i -th worker, and ˜ w t represents the global model.From Eq. (7) we observe that EASGD minimizes the losssummed over all workers, as well as the quadratic differencebetween the global model and local models. λ || w it − ˜ w t || isa quadratic regularization term, which forces local workers tostay close to the global model.By taking the derivative of w it and ˜ w t in Eq. (7), we get theupdate rules for w it and ˜ w t in synchronous EASGD as w it +1 = w it − η ∇ w it − ηλ ( w it − ˜ w t )˜ w t +1 = ˜ w t − ηλ N (cid:80) i =1 ( ˜ w t − w it ) (8)where ∇ w it is the stochastic gradient of f ( . ) with respect to w it .In asynchronous EASGD, ∇ w it is only used in local updat-ing, and the update rules for local and global models become w it +1 = w it − α ( w it − ˜ w t )˜ w t +1 = ˜ w t − α ( ˜ w t − w it ) (9)where α = ηλ , which controls the update step for the variable.Small α allows for more exploration as it allows w i to ﬂuctuatefurther from ˜ w while large α makes local model perform moreexploitation. We only implemented asynchronous EASGD inthis work. These four algorithms, although are different, have relations.First, ASGD and EASGD are asynchronous algorithmsbased on the client/server framework, in which the global modelis stored on and updated by a parameter server, and each workercomputes gradients and updates its local model independently.Workers only exchange parameters with the server and do notcommunicate with each other. BSP and BMUF, on the otherhand, are synchronous algorithms that do not use a server. Allworkers exchange parameters synchronously with each other.Second, in ASGD the global model is updated based on thelocal gradients computed by and sent from workers. In BSP,EASGD and BMUF, however, the global model is a weightedsum of local models instead of gradients.Third, EASGD and BMUF both introduce extra hyper-parameters whose values may affect the training behavior, whileASGD and BSP have no extra hyper-parameter and thus requireless tuning in practice.Forth, we argue that BMUF actually minimizes the differ-ence in ˜ w t F ( ˜ w t ) = 12 N (cid:88) i =1 || w it − ˜ w t || (10)between the global and local models. By taking the derivativeof ˜ w , we get ∇ ˜ w t = N (cid:88) i =1 ( ˜ w t − w it ) . (11)Let ∇ ˜ w t = 0 . By directly solving ˜ w t , we get ˜ w t = 1 N N (cid:88) i =1 w it . (12)This is the same as BSP in Eq. (1). If we optimize ˜ w t usingSGD, then ˜ w t = ˜ w t − − η N (cid:88) i =1 ( ˜ w t − − w it ) (13)which is the same as Eq. (8) in EASGD.Further, if we optimize ˜ w t using momentum SGD, then ˜ w t = ˜ w t − − η N (cid:88) i =1 ( ˜ w t − − w it ) − ζ ∇ ˜ w t − = ˜ w t − + η (cid:48) G t + ζ (cid:48) ∆ t − = ˜ w t − + ∆ t (14)where η (cid:48) = Nη, ζ (cid:48) = Nζ . This is exactly the BMUF updaterule in Eq. (3).Therefore we conclude that the global model updates ofBSP, EASGD and BMUF are derived from the same objectivefunction with different optimization strategies.

3. Experiments

In this work, all the models are trained on the 1000hr Lib-riSpeech [39] dataset. The 40-dim FBANK features computedon a 25ms window shifted by 10ms are used. The lexicon andlanguage model (LM) are provided by the dataset. Speciﬁcally,the results reported here are all achieved with a full 3-gram LM.We used test-clean and test-other sets for evaluation.To evaluate the parallel training algorithms, we trained twotypes of DLMs: DNNs and CLDNNs [23]. The input to DNNsis the 40-dim FBANK feature with ﬁrst and second order timederivatives and 11 frame context. The input to CLDNNs is thesame as that to DNNs but without the 2nd order time derivatives.The DNN has 6 hidden layers, each containing 1024 neurons.With 5723 HMM tied-states as output classes, it has about 13.5million parameters. The CLDNN consists of 1 CNN layer (128feature maps), 2 DNN layers (1024 neurons) and 2 LSTM lay-ers (1024 memory blocks and 512 projections). With the sameoutput classes as that in DNNs, CLDNNs have about 13.8 mil-lion parameters. Both models use the ReLU activation function.In order to ensure the fairness of the comparison and thecredibility of the experimental results, we take the followingmeasures: First, all experiments in this work were carried outon the same computing node with 8 GTX1080 GPUs. Second,the four parallel training algorithms were implemented in theKALDI toolkit. The parameter exchange among GPUs is basedon OpenMPI. Third, in all parallel training we used the sameinitial model which was obtained by one-epoch minibatch-SGD on a single-GPU. Finally, we used the identical learningrate schedule and the same initial learning rate. The learningrate keeps ﬁxed as long as the cross entropy loss on a cross-validation (CV) set decreases by at least 1%. Then, the learningrate is halved each epoch until the optimization terminates whenthe cross entropy loss on the CV set decreases by less than 0.1%. Table 1:

WER and training speedup of DNNs trained by single-GPU minibatch-SGD, ASGD, BMUF, BSP, and EASGD. Thesynchronization period is 5 minibatches for each algorithm.

ParallelAlgorithm GPUNumber TrainingSpeedup WER(%)test-clean test-otherASGD 4 2.74X 5.91 15.558 4.72X 6.09 16.20BMUF 4 2.68X

BSP 4 2.68X 6.01 16.038 4.56X 6.21 16.55EASGD 4 2.80X 6.04 16.028 5.00X 6.22 16.64Minibatch-SGD 1 1.0X 5.83 15.44

Epoch C r o ss E n t r op y Lo ss Figure 2:

Learning curves of CE loss on CV set with differentalgorithms and GPU numbers for DNN training.

In DNN training, the minibatch size was set to 4096. Table 1compares training speedups and word error rate (WER) on test-ing sets with four parallel training algorithms on 4 GPUs and8 GPUs. When using same synchronization period, EASGDachieved the best training speedup, followed by ASGD, BMUFand BSP. However, BMUF achieved the best WER in both 4and 8 GPU cases, and even outperformed the single-GPU SGD.As Figure 2 shows, the convergence trend of BMUF is similarto minibatch-SGD with single-GPU in the CV set. To furtherverify our conclusions, we chose the most appropriate synchro-nization period for each algorithm based on the literature. Theresults in Table 2 show that BMUF still performed the best.

In CLDNN training, we computed gradients on 100 subse-quences from different utterances in the same time. The trun-cated BPTT with truncation step of 20 was used to train themodels. The appropriate synchronization period was chosenfor each algorithm based on the literature. Speciﬁcally, thesynchronization periods for BSP, ASGD, BMUF, and EASGDare 5, 5, 80, and 64 minbatches, respectively. Table 3 showsthat BMUF achieved the best WER and training speedup with 4GPUs and ASGD achieved the best WER with 8 GPUs.

The choice of synchronization period τ affects the behavior ofBMUF and ASGD. As Table 4 shows, the WER of ASGD grad-ually increased and the training even diverged when increas-ing τ . This is because ASGD suffers from the problem of de-able 2: WER and training speedup of DNNs trained by ASGD,BMUF, BSP and EASGD. The most appropriate synchroniza-tion period is chosen for each algorithm, respectively.

ParallelAlgorithm SyncPeriod TrainingSpeedup WER(%)test-clean test-otherASGD 1 2.22X 5.85 15.54BMUF 80 2.93X

BSP 5 2.68X 6.01 16.03EASGD 64 2.99X 6.06 15.97

Table 3:

WER and training speedup of CLDNNs trained byASGD, BMUF, BSP and EASGD. The most appropriate syn-chronization period is chosen for each algorithm, respectively.

ParallelAlgorithm GPUNumber TrainingSpeedup WER(%)test-clean test-otherASGD 4 3.42X 5.37 14.088 6.11X

BMUF 4 3.84X layed gradient update and larger synchronization period resultsin greater latency. In contrast, we observed no performancedegradation on BMUF when varying the synchronization pe-riod. As for training speedup, when synchronization period τ issmall, the parameter exchange among multi-GPUs is quite fre-quent and the communication overhead is the main bottleneckof training speed. So as τ increases (from 5 to 20 minibatchesin Table 4), the communication overhead decreases and trainingspeedup increases. When the value of τ continues to increase,the communication overhead is reduced so that the computa-tion speed becomes the main bottleneck. Therefore the trainingspeedup almost keeps unchanged as the synchronization periodincreases (from 20 to 80 minibatches in Table 4). Table 5 compares three different minibatch sizes in BMUF.With single-GPU training, the training speed is lower whensmaller minibatch size is used. This is because with small mini-batch size the GPU power is not fully utilized and the model isupdated more frequently.The training speedup s of multi-GPU training is calculatedas s = t s f ( t s , N ) + t c (15)where t s is the computation time through one epoch of dataseton single-GPU, N is the number of GPUs, t c is the communi-cation overhead, and f ( t s , N ) = α t s N (16)is a decreasing function over N . When the minibatch can ﬁllthe GPU α is 1, otherwise α is greater than 1. According toEqs. (15) and (16), we get s = 1 αN + t c t s . (17)This means the training speedup s of multi-GPU parallel train-ing depends on t c t s . When synchronization period τ is small,the communication overhead t c is large due to frequent pa-rameter exchange among GPUs. Although t c and t s decrease Table 4: WER and training speedup of DNNs trained by ASGDand BMUF on 4 GPUs with various synchronization periods.

ParallelAlgorithm SyncPeriod TrainingSpeedup WER(%)test-clean test-otherASGD 5 2.74X 5.91 15.5520 2.96X 5.97 15.7880 divergenceBMUF 5 2.68X 5.70 15.0120 2.90X 5.73 14.8680 2.93X 5.74 14.87Minibatch-SGD 1 1.0X 5.83 15.44

Table 5:

WER and training speedup of DNNs trained by BMUFon 4 GPUs with various minibatch sizes.

MinibatchSize SyncPeriod TrainingSpeedup WER(%)test-clean test-other256 Single-GPU 1.0X 5.80 15.135 1.85X 5.86 15.3120 2.54X 5.86 15.2680 3.21X 5.87 15.131024 Single-GPU 1.0X 5.72 14.805 2.14X 5.71 15.0220 2.76X 5.74 15.0280 2.98X 5.69 14.904096 Single-GPU 1.0X 5.83 15.445 2.68X 5.70 15.0120 2.90X 5.73 14.8680 2.93X 5.74 14.87 with the increase of minibatch size, t c decreases more quickly.Therefore the larger minibatch size leads to the greater trainingspeedup. When τ is large, however, t c is small. t s decreasesmore quickly with the increase of minibatch size and causes de-creasing in training speedup. However, from the perspective ofabsolute training speed, it beneﬁts from the growth of minibatchsize. Eq. (17) also explains why the speedup of CLDNN (Ta-ble 3) is much larger than DNN (Table 1). In the same GPUnumber and synchronization period, t c of DNN and CLDNN issimilar, but t s of CLDNN is larger than DNN .

4. Conclusions

We implemented four parallel training algorithms, discussed therelationship among them, and evaluated them on speech recog-nition tasks. The experimental results show that BMUF andASGD consistently outperform BSP and EASGD. BMUF, inparticular, achieved the best performance without frequent syn-chronization, and even outperformed the single-GPU SGD insome cases. We conjecture that the momentum used in BMUFglobal model update makes the global model not only relatedto each local model but also the previous global model. ASGDalso achieved pretty good performance and the lager trainingspeedup with the same synchronization period. Proﬁt fully fromthe asynchronous properties, ASGD is insensitive to the differ-ence of computing capacity of workers, but sensitive to the syn-chronization period and suffers from the poor reproducibility.

5. Acknowledgements

This work is supported by the National Natural Science Foun-dation of China (Grant No. 61571363). . References [1] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependentpre-trained deep neural networks for large-vocabulary speechrecognition,”

IEEE Transactions on Audio Speech & LanguageProcessing , vol. 20, no. 1, pp. 30–42, 2012.[2] F. Seide, G. Li, and D. Yu, “Conversational speech transcrip-tion using context-dependent deep neural networks.” in

INTER-SPEECH , 2011, pp. 437–440.[3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, and T. N. Sainath, “Deepneural networks for acoustic modeling in speech recognition: Theshared views of four research groups,”

IEEE Signal ProcessingMagazine , vol. 29, no. 6, pp. 82–97, 2012.[4] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Apply-ing convolutional neural networks concepts to hybrid NN-HMMmodel for speech recognition,” in

ICASSP , 2012, pp. 4277–4280.[5] O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutionalneural network structures and optimization techniques for speechrecognition.” in

INTERSPEECH , 2013, pp. 3366–3370.[6] T. N. Sainath, B. Kingsbury, A.-r. Mohamed, G. E. Dahl, G. Saon,H. Soltau, T. Beran, A. Y. Aravkin, and B. Ramabhadran, “Im-provements to deep convolutional neural networks for LVCSR,”in

ASRU, 2013 IEEE Workshop on , 2013, pp. 315–320.[7] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramab-hadran, “Deep convolutional neural networks for LVCSR,” in

ICASSP , 2013, pp. 8614–8618.[8] O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn,and D. Yu, “Convolutional neural networks for speech recogni-tion,”

IEEE/ACM Transactions on Audio Speech & Language Pro-cessing , vol. 22, no. 10, pp. 1533–1545, 2014.[9] A.-r. Mohamed, “Deep neural network acoustic models for ASR,”Ph.D. dissertation, University of Toronto, 2014.[10] H. Sak, A. Senior, and F. Beaufays, “Long short-term memorybased recurrent neural network architectures for large vocabularyspeech recognition,” arXiv preprint arXiv:1402.1128 , 2014.[11] ——, “Long short-term memory recurrent neural network archi-tectures for large scale acoustic modeling.” in

INTERSPEECH ,2014, pp. 338–342.[12] Y. Miao, J. Li, Y. Wang, S. X. Zhang, and Y. Gong, “Simplify-ing long short-term memory acoustic models for fast training anddecoding,” in

ICASSP , 2016, pp. 2284–2288.[13] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accuraterecurrent neural network acoustic models for speech recognition,” arXiv preprint arXiv:1507.06947 , 2015.[14] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays,and J. Schalkwyk, “Learning acoustic frame labeling for speechrecognition with recurrent neural networks,” in

ICASSP , 2015, pp.4280–4284.[15] A. Senior, H. Sak, and I. Shafran, “Context dependent phone mod-els for LSTM RNN acoustic modelling,” in

ICASSP , 2015, pp.4585–4589.[16] K. Chen and Q. Huo, “Training deep bidirectional LSTM acousticmodel for LVCSR by a context-sensitive-chunk BPTT approach,”

IEEE/ACM Transactions on Audio, Speech & Language Process-ing , vol. 24, no. 7, pp. 1185–1193, 2016.[17] Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. Glass,“Highway long short-term memory RNNs for distant speechrecognition,” in

ICASSP , 2016, pp. 5755–5759.[18] D. Yu, W. Xiong, J. Droppo, A. Stolcke, G. Ye, J. Li, andG. Zweig, “Deep convolutional neural networks with layer-wisecontext expansion and attention,” in

INTERSPEECH , 2016, pp.17–21.[19] S. Sun, B. Zhang, L. Xie, and Y. Zhang, “An unsupervised deepdomain adaptation approach for robust speech recognition,”

Neu-rocomputing , 2017. [20] T. Sercu and V. Goel, “Advances in very deep convolutional neuralnetworks for LVCSR,” in

INTERSPEECH , 2016, pp. 3429–3433.[21] S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, “Feed-forward sequential memory networks: A new structure to learnlong-term dependency,” arXiv preprint arXiv:1512.08301 , 2015.[22] S. Zhang, H. Jiang, S. Xiong, S. Wei, and L. R. Dai, “Com-pact feedforward sequential memory networks for large vocab-ulary continuous speech recognition,” in

INTERSPEECH , 2016,pp. 3389–3393.[23] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional,long short-term memory, fully connected deep neural networks,”in

ICASSP , 2015, pp. 4580–4584.[24] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao,A. Senior, P. Tucker, K. Yang, Q. V. Le et al. , “Large scale dis-tributed deep networks,” in

NIPS , 2012, pp. 1223–1231.[25] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catan-zaro, “Deep learning with COTS HPC systems,” in

ICML , 2013.[26] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gra-dient descent and its application to data-parallel distributed train-ing of speech DNNs,” in

INTERSPEECH , 2014, pp. 1058–1062.[27] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic opti-mization,” in

NIPS , 2011, pp. 873–881.[28] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,

Learning rep-resentations by back-propagating errors . MIT Press, 1986.[29] G. Heigold, E. Mcdermott, V. Vanhoucke, and A. Senior, “Asyn-chronous stochastic optimization for sequence training of deepneural networks,” in

ICASSP , 2014, pp. 5587–5591.[30] S. Zhang, C. Zhang, Z. You, and R. Zheng, “Asynchronousstochastic gradient descent for DNN training,” in

ICASSP , 2013,pp. 6660–6663.[31] T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang, “GPU asyn-chronous stochastic gradient descent to speed up neural networktraining,” arXiv preprint arXiv:1312.6186 , 2013.[32] K. Chen and Q. Huo, “Scalable training of deep learning machinesby incremental block training with intra-block parallel optimiza-tion and blockwise model-update ﬁltering,” in

ICASSP , 2016, pp.5880–5884.[33] L. G. Valiant, “A bridging model for parallel computation,”

Com-munications of the ACM , vol. 33, no. 8, pp. 103–111, 1990.[34] H. Ma, F. Mao, and G. W. Taylor, “Theano-mpi: atheano-based distributed training framework,” arXiv preprintarXiv:1605.08325 , 2016.[35] D. Povey, X. Zhang, and S. Khudanpur, “Parallel training ofDNNs with natural gradient and parameter averaging,” arXivpreprint arXiv:1410.7455 , 2014.[36] H. Su and H. Chen, “Experiments on parallel training ofdeep neural network using model averaging,” arXiv preprintarXiv:1507.01239 , 2015.[37] S. Zhang, A. E. Choromanska, and Y. LeCun, “Deep learning withelastic averaging SGD,” in

NIPS , 2015, pp. 685–693.[38] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The Kaldi speech recognition toolkit,” in

ASRU, 2011 IEEEworkshop on , no. EPFL-CONF-192584, 2011.[39] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an ASR corpus based on public domain audio books,”in

ICASSP , 2015, pp. 5206–5210.[40] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient meth-ods for online learning and stochastic optimization,”

Journal ofMachine Learning Research , vol. 12, no. 7, pp. 2121–2159, 2011.[41] S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, andT.-Y. Liu, “Asynchronous stochastic gradient descent with de-lay compensation for distributed deep learning,” arXiv preprintarXiv:1609.08326arXiv preprintarXiv:1609.08326