A comprehensive study of batch construction strategies for recurrent neural networks in MXNet
AA COMPARATIVE STUDY OF BATCH CONSTRUCTION STRATEGIESFOR RECURRENT NEURAL NETWORKS IN MXNET
Patrick Doetsch, Pavel Golik, Hermann Ney
AppTek,6867 Elm St, Set 300, Mclean VA 22101 United States { pdoetsch,pgolik,hney } @apptek.com ABSTRACT
In this work we compare different batch construction methodsfor mini-batch training of recurrent neural networks. Whilepopular implementations like TensorFlow and MXNet sug-gest a bucketing approach to improve the parallelization ca-pabilities of the recurrent training process, we propose a sim-ple ordering strategy that arranges the training sequences in astochastic alternatingly sorted way. We compare our methodto sequence bucketing as well as various other batch construc-tion strategies on the CHiME-4 noisy speech recognition cor-pus. The experiments show that our alternated sorting ap-proach is able to compete both in training time and recogni-tion performance while being conceptually simpler to imple-ment.
Index Terms : bucketing, batches, recurrent neural networks
1. INTRODUCTION
Neural network based acoustic modeling became the de-factostandard in automatic speech recognition (ASR) and relatedtasks. Modeling contextual information over long distancesin the input signal hereby showed to be of fundamental impor-tance for optimal system performance. Modern acoustic mod-els therefore use recurrent neural networks (RNN) to modellong temporal dependencies. In particular the long short-termmemory (LSTM) [1] has been shown to work very well onthese tasks and most current state-of-the-art systems incorpo-rate LSTMs into their acoustic models. While it is commonpractice to train the models on a frame-by-frame labeling ob-tained from a previously trained system, sequence-level crite-ria that optimize the acoustic model and the alignment modeljointly are becoming increasingly popular. As an example,the connectionist temporal classification (CTC) [2] enablesfully integrated training of acoustic models without assuminga frame-level alignment to be given. Sequence-level criteriahowever require to train on full utterances, while it is possibleto train frame-wise labeled systems on sub-utterances of anyresolution.Training of recurrent neural networks for large vocabu-lary continuous speech recognition (LVCSR) tasks is com- putationally very expensive and the sequential nature of re-current process prohibits to parallelize the training over in-put frames. A robust optimization requires to work on largebatches of utterances and training time as well as recognitionperformance can vary strongly depending on the choice ofhow batches were put together. The main reason is that com-bining utterances of different lengths in a mini-batch requiresto extend the length of each utterance to that of the longest ut-terance within the batch, usually by appending zeros. Thesezero frames are ignored later on when gradients are computedbut the forward-propagation of zeros through the RNN is awaste of computing power.A straight-forward strategy to minimize zero padding isto sort the utterances by length and to partition them intobatches afterwards. However, there are significant drawbacksto this method. First, the sequence order remains constantin each epoch and therefore the intra-batch variability is verylow since the same sequences are usually combined into thesame batch. Second, the strategy favors putting similar utter-ances into the same batch, since short utterances often tendto share other properties. One way to overcome this limi-tation was proposed within TensorFlow and is also used asrecommended strategy in MXNet. The idea is to perform a bucketing of the training corpus, where each bucket repre-sents a range of utterance lengths and each training sampleis assigned to the bucket that corresponds to its length. Af-terwards a batch is constructed by drawing sequences froma randomly chosen bucket. The concept somehow mitigatesthe issue of zero padding if suitable length ranges can be de-fined, while still allowing for some level of randomness atleast when sequences are selected within a bucket. However,buckets have to be made very large in order to ensure a suf-ficiently large variability within batches. On the other hand,making buckets too large will increase training time due toirrelevant computations on zero padded frames. Setting thesehyper-parameters correctly is therefore of fundamental im-portance for fast and robust acoustic model training.In this work we propose a simple batch construction strat-egy that is easier to parametrize and implement. The methodproduces batches with large variability of sequences while at a r X i v : . [ c s . L G ] M a y he same time reducing irrelevant computation to a minimum.In the following sections we are going to give an overviewover current batch construction strategies and compare themw.r.t. training time and variability. We will then derive ourproposed method and discuss its properties on a theoreticallevel, followed by an empirical evaluation on the CHiME-4noisy speech recognition task.
2. RELATED WORK
While mini-batch training was studied extensively for feed-forward networks [3], authors rarely reveal the batch con-struction strategy they used during training when RNN ex-periments are reported. This is because the systems are eithertrained in a frame-wise fashion [4] or because the analysisuses sequences of very similar length as in [5]. We studiedin an earlier work [6] how training on sub-sequences in thosecases can lead to significantly faster and often also morerobust training. In [7] the problem of having sequences oflargely varying lengths in a batch was identified and the au-thors suggested to adapt their proposed batch-normalizationmethod to a frame-level normalization, although a sequence-level normalization sounds theoretically more reasonable.In [8] a curriculum learning strategy is proposed wheresequences follow a specific scheduling in order to reduceoverfitting.Modern machine learning frameworks like TensorFlow[9] and MXNet [10] implement a bucketing approach basedon the lengths distribution of the sequences. In [11] the au-thors extend this idea by selecting optimal sequences withineach bucket using a dynamic programming technique.
3. BUCKETING IN MXNET
Borrowed from TensorFlow’s sequence training example,MXNet implements bucketing by clustering sequences intobins depending on their length. The size of each bin, i.e. thespan of sequence lengths associated with this bin, has to bespecified by the user and optimal values depend on the ASRtask. The sampling process can be done in logarithmic time,since for each sequence length in the training set a binarysearch over the bins has to be performed.In each iteration of the mini-batch training a bucket is thenselected randomly. Within the selected bucket a random spanof sequences is chosen to be used as data batch. Note thatthis random shuffling only ensures a large inter-batch vari-ance w.r.t. the sequence length, while the variance within eachbatch can be small.Bucketing is especially useful if the RNN model itselfdoes not support dynamic unrolling and is not able to han-dle arbitrary long sequences but instead requires to store anunrolled version of the network for every possible length. Inthose cases bucketing allows the framework to assign each
Fig. 1 : Resulting sequence ordering for different batch con-struction strategies. The graphs show the utternace lengthsof 1000 randomly selected samples of the CHiME-4 trainingset. The Y-axis shows the length and the X-axis the utteranceindex for a given ordering. Bucketing was done with a bucketsize of 250 and in the proposed approach we used 12 bins.batch to the shortest possible unrolled network, while still op-timizing the same shared weights.
4. PROPOSED APPROACH
In order to improve the intra-batch variability we propose astochastic bucketing process. At the beginning of each epochthe utterances are arranged randomly and then partitionedinto bins of equal size. Each bin is then sorted in alternatingdirections such that two consecutive bins are sorted in reverseorder to each other. Finally, the constructed ordering is parti-tioned into batches. The overall algorithm can be summarizedas follows:For each epoch1. shuffle training data2. partition resulting sequence into N bins3. sort each bin n by the utterance length: • in ascending order if n is odd • in descending order if n is even4. draw consecutive batches of desired sizefrom the resulting sequenceue to the initial shuffling and subsequent partitioningthe probability for two sequences of any length being putinto the same bin is N · ( N − , so by increasing the number ofbins, the variability within a partition decreases quadraticallywhile the variability among different partitions increases.The alternated sorting approach ensures that utterances at theboundaries of two consecutive bins are of similar length suchthat the final partitioning into batches requires minimal zeropadding.Figure 1 shows the utterance lengths for random andsorted sequence ordering as well as for bucketing in MXNetand the proposed approach. Note that in the case of bucketingbatches are put together by randomly choosing one of thebuckets first, so the ordering does not directly represent thefinal set of batches.
5. EXPERIMENTAL SETUP
The 4th CHiME Speech Separation and Recognition Chal-lenge [12] consists of noisy utterances spoken by speakers inchallenging environments. Recording was done using a 6-channel microphone array on a tablet. The dataset revisitsthe CHiME-3 corpus that was published one year before thechallenge took place [13].We extracted 16-dimensional MFCC vectors as in [14]from the six sub-corpora and used them as features for theneural network models. The context-dependent HMM stateswere clustered into 1500 classes using a classification andregression tree (CART). We trained a GMM-based baselinemodel with three state HMM without skip transitions in or-der to obtain a frame-wise labeling of the training data. Anetwork of three bi-directional LSTM layers followed by asoftmax layer was trained to minimize the frame-wise cross-entropy. Optimization was done with the Adam optimizer anda constant learning rate of 0.01. We used MXNet [10] for ex-periments using the bucketing approach and RETURNN [15]for the other methods.After training, the state posterior estimates from the neu-ral network are normalized by the state priors and used aslikelihoods in a conventional hybrid HMM decoder using theRASR toolkit [16]. A 4-gram LM was used during decodingwith a language model scale of 12.
6. EXPERIMENTS
In order to provide some perspective on the impact of tem-poral context on the CHiME-4 task, we performed experi-ments on sub-utterance (chunk) which are presented in Ta-ble 1. For different sub-utterance lengths, we report the pro-cessing speed measured in utterances per second, the memoryrequired and the word error rate (WER) on the evaluation setof the CHiME-4 database. Here we constrained batches toonly contain 5,000 frames in total, such that the overall num-ber of updates is constant in all experiments. We can observe
Table 1 : Training time and recognition performance whentraining is done on sub-utterance level. The first columnshows the maximal sub-utterance length after partitioning ofthe original sequence. The last row shows the results obtainedwithout partitioning into sub-utterances.Chunk size Utt./sec Memory [GB] WER [%]10 36.7 1.6 21.350 31.1 1.6 10.1100 29.6 1.6 9.2500 17.3 1.6 9.0 max
Table 2 : An evaluation of different sequence batch construc-tion methods on the CHiME-4 database. Training time perepoch, memory consumption are presented in the first twocolumns, while the last column shows the word error rate ofthe corresponding acoustic model on the evaluation set.Approach Utt./sec Memory [GB] WER [%]Random ∼ ∼ ∼ . CONCLUSIONS In this work we presented a novel strategy to constructsequence-level batches for recurrent neural network acousticmodel training. While not much attention is given to the topicof batch construction, we demonstrate that different strategiescan lead to large variations both in training time and recogni-tion performance. Most deep-learning frameworks rely on abucketing approach by clustering sequences of similar lengthinto bins and to draw batches from each bin individually. Weshowed that we can achieve a better runtime performanceusing a simpler batch design, by partitioning a shuffled se-quence order and to sort the partitions in an alternating order.The method was evaluated on the ChiME-4 noisy speechrecognition task and compared to standard approaches likerandom sequence shuffling and the bucketing approach ofMXNet, where our method was able to reach a better trade-off between training time and recognition performance whilebeing easier to parametrize than the bucketing method.
8. ACKNOWLEDGEMENTS
We want to thank Jahn Heymann, Lukas Drude and Rein-hold H¨ab-Umbach from University of Paderborn, Germanyfor their CHiME-4 front-end which we used in this work.
9. REFERENCES [1] S. Hochreiter and J. Schmidhuber, “Long Short-Term Mem-ory,”
Neural computation,
Vol. 9, No. 8, pp. 1735–1780, 1997.[2] A. Graves and N. Jaitly, “Towards end-to-end speech recogni-tion with recurrent neural networks,” in
Intern. Conf. on Ma-chine Learning (ICML) , Beijing, China, 2014, pp. 1764–1772.[3] M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient mini-batch training for stochastic optimization,” in
Proceedings ofthe 20th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining , New York, NY, USA, Aug.2014, pp. 661–670.[4] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stol-cke, D. Yu, and G. Zweig, “Achieving human parity in con-versational speech recognition,” Tech. Rep. MSR-TR-2016-71,Feb. 2017.[5] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty oftraining recurrent neural networks,” in
Intern. Conf. on Ma-chine Learning (ICML) , Atlanta, GA, USA, Jun. 2013, pp.1310–1318.[6] P. Doetsch, M. Kozielski, and H. Ney, “Fast and robust trainingof recurrent neural networks for offline handwriting recogni-tion,” in
International Conference on Frontiers in HandwritingRecognition , Crete, Greece, Sep. 2014, pp. 279–284.[7] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio,“Batch normalized recurrent neural networks,” in
IEEE Intern.Conf. on Acoustics, Speech, and Signal Processing (ICASSP) ,Shanghai, China, Mar. 2016, pp. 2657–2661. [8] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduledsampling for sequence prediction with recurrent neural net-works,” in
Advances in Neural Information Processing Systems(NIPS) , Montreal, Canada, Dec. 2015, pp. 1171–1179.[9] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard,Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, andX. Zheng, “TensorFlow: Large-scale machine learning onheterogeneous systems,” Nov. 2015. [Online]. Available:http://tensorflow.org/[10] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,B. Xu, C. Zhang, and Z. Zhang, “MXNet: A flexible and ef-ficient machine learning library for heterogeneous distributedsystems,” in
Neural Information Processing Systems, Work-shop on Machine Learning Systems , Montreal, Canada, Dec.2015.[11] V. Khomenko, O. Shyshkov, O. Radyvonenko, and K. Bokhan,“Accelerating recurrent neural network training using sequencebucketing and multi-GPU data parallelization,” in
IEEE FirstInternational Conference on Data Stream Mining Processing(DSMP) , Lviv, Ukraine, Aug. 2016, pp. 100–103.[12] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, andR. Marxer, “An analysis of environment, microphone and datasimulation mismatches in robust speech recognition,”
Com-puter Speech and Language , 2016.[13] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “Thethird ’CHiME’ speech separation and recognition challenge:Dataset, task and baselines,” in
IEEE Automatic Speech Recog-nition and Understanding Workshop (ASRU) , Dec. 2015, pp.504–511.[14] T. Menne, J. Heymann, A. Alexandridis, K. Irie, A. Zeyer,M. Kitza, P. Golik, I. Kulikov, L. Drude, R. Schl¨uter,H. Ney, R. Haeb-Umbach, and A. Mouchtaris, “TheRWTH/UPB/FORTH system combination for the 4th CHiMEchallenge evaluation,” in
The 4th International Workshop onSpeech Processing in Everyday Environments , San Francisco,CA, USA, Sep. 2016, pp. 39–44.[15] P. Doetsch, A. Zeyer, P. Voigtlaender, I. Kulikov, R. Schl¨uter,and H. Ney, “RETURNN: the RWTH extensible trainingframework for universal recurrent neural networks,” in
IEEEIntern. Conf. on Acoustics, Speech, and Signal Processing(ICASSP) , New Orleans, LA, USA, Mar. 2017, pp. 5345–5349.[16] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer,Z. T¨uske, S. Wiesler, R. Schl¨uter, and H. Ney, “RASR -the RWTH Aachen university open source speech recognitiontoolkit,” in