Batch Normalized Recurrent Neural Networks
César Laurent, Gabriel Pereyra, Philémon Brakel, Ying Zhang, Yoshua Bengio
BBatch Normalized Recurrent Neural Networks
C´esar Laurent ∗ Universit´e de Montr´eal
Gabriel Pereyra ∗ University of Southern California
Phil´emon Brakel
Universit´e de Montr´eal
Ying Zhang
Universit´e de Montr´eal
Yoshua Bengio † Universit´e de Montr´eal
Abstract
Recurrent Neural Networks (RNNs) are powerful models for sequential data thathave the potential to learn long-term dependencies. However, they are computa-tionally expensive to train and difficult to parallelize. Recent work has shown thatnormalizing intermediate representations of neural networks can significantly im-prove convergence rates in feedforward neural networks [1]. In particular, batchnormalization, which uses mini-batch statistics to standardize features, was shownto significantly reduce training time. In this paper, we show that applying batchnormalization to the hidden-to-hidden transitions of our RNNs doesn’t help thetraining procedure. We also show that when applied to the input-to-hidden transi-tions, batch normalization can lead to a faster convergence of the training criterionbut doesn’t seem to improve the generalization performance on both our languagemodelling and speech recognition tasks. All in all, applying batch normalizationto RNNs turns out to be more challenging than applying it to feedforward net-works, but certain variants of it can still be beneficial.
Recurrent Neural Networks (RNNs) have received renewed interest due to their recent success in var-ious domains, including speech recognition [2], machine translation [3, 4] and language modelling[5]. The so-called Long Short-Term Memory (LSTM) [6] type RNN has been particularly success-ful. Often, it seems beneficial to train deep architectures in which multiple RNNs are stacked on topof each other [2]. Unfortunately, the training cost for large datasets and deep architectures of stackedRNNs can be prohibitively high, often times an order of magnitude greater than simpler models like n -grams [7]. Because of this, recent work has explored methods for parallelizing RNNs across mul-tiple graphics cards (GPUs). In [3], an LSTM type RNN was distributed layer-wise across multipleGPUs and in [8] a bidirectional RNN was distributed across time. However, due to the sequentialnature of RNNs, it is difficult to achieve linear speed ups relative to the number of GPUs.Another way to reduce training times is through a better conditioned optimization procedure. Stan-dardizing or whitening of input data has long been known to improve the convergence of gradient-based optimization methods [9]. Extending this idea to multi-layered networks suggests that nor-malizing or whitening intermediate representations can similarly improve convergence. However,applying these transforms would be extremely costly. In [1], batch normalization was used to stan-dardize intermediate representations by approximating the population statistics using sample-basedapproximations obtained from small subsets of the data, often called mini-batches, that are alsoused to obtain gradient approximations for stochastic gradient descent, the most commonly usedoptimization method for neural network training. It has also been shown that convergence canbe improved even more by whitening intermediate representations instead of simply standardizing ∗ Equal contribution † CIFAR Senior Fellow a r X i v : . [ s t a t . M L ] O c t hem [10]. These methods reduced the training time of Convolutional Neural Networks (CNNs) byan order of magnitude and additionallly provided a regularization effect, leading to state-of-the-artresults in object recognition on the ImageNet dataset [11]. In this paper, we explore how to leveragenormalization in RNNs and show that training time can be reduced. In optimization, feature standardization or whitening is a common procedure that has been shownto reduce convergence rates [9]. Extending the idea to deep neural networks, one can think of anarbitrary layer as receiving samples from a distribution that is shaped by the layer below. Thisdistribution changes during the course of training, making any layer but the first responsible notonly for learning a good representation but also for adapting to a changing input distribution. Thisdistribution variation is termed
Internal Covariate Shift , and reducing it is hypothesized to help thetraining procedure [1].To reduce this internal covariate shift, we could whiten each layer of the network. However, thisoften turns out to be too computationally demanding. Batch normalization [1] approximates thewhitening by standardizing the intermediate representations using the statistics of the current mini-batch. Given a mini-batch x , we can calculate the sample mean and sample variance of each feature k along the mini-batch axis ¯ x k = 1 m m (cid:88) i =1 x i,k , (1) σ k = 1 m m (cid:88) i =1 ( x i,k − ¯ x k ) , (2)where m is the size of the mini-batch. Using these statistics, we can standardize each feature asfollows ˆ x k = x k − ¯ x k (cid:112) σ k + (cid:15) , (3)where (cid:15) is a small positive constant to improve numerical stability.However, standardizing the intermediate activations reduces the representational power of the layer.To account for this, batch normalization introduces additional learnable parameters γ and β , whichrespectively scale and shift the data, leading to a layer of the form BN ( x k ) = γ k ˆ x k + β k . (4)By setting γ k to σ k and β k to ¯ x k , the network can recover the original layer representation. So, fora standard feedforward layer in a neural network y = φ ( Wx + b ) , (5)where W is the weights matrix, b is the bias vector, x is the input of the layer and φ is an arbitraryactivation function, batch normalization is applied as follows y = φ ( BN ( Wx )) . (6)Note that the bias vector has been removed, since its effect is cancelled by the standardization. Sincethe normalization is now part of the network, the back propagation procedure needs to be adapted topropagate gradients through the mean and variance computations as well.At test time, we can’t use the statistics of the mini-batch. Instead, we can estimate them by eitherforwarding several training mini-batches through the network and averaging their statistics, or bymaintaining a running average calculated over each mini-batch seen during training.2 Recurrent Neural Networks
Recurrent Neural Networks (RNNs) extend Neural Networks to sequential data. Given an inputsequence of vectors ( x , . . . , x T ) , they produce a sequence of hidden states ( h , . . . , h T ) , whichare computed at time step t as follows h t = φ ( W h h t − + W x x t ) , (7)where W h is the recurrent weight matrix, W x is the input-to-hidden weight matrix, and φ is anarbitrary activation function.If we have access to the whole input sequence, we can use information not only from the past timesteps, but also from the future ones, allowing for bidirectional RNNs [12] −→ h t = φ ( −→ W h −→ h t − + −→ W x x t ) , (8) ←− h t = φ ( ←− W h ←− h t +1 + ←− W x x t ) , (9) h t = [ −→ h t : ←− h t ] , (10)where [ x : y ] denotes the concatenation of x and y . Finally, we can stack RNNs by using h as theinput to another RNN, creating deeper architectures [13] h lt = φ ( W h h lt − + W x h l − t ) . (11)In vanilla RNNs, the activation function φ is usually a sigmoid function, such as the hyperbolictangent. Training such networks is known to be particularly difficult, because of vanishing andexploding gradients [14]. A commonly used recurrent structure is the Long Short-Term Memory (LSTM). It addresses thevanishing gradient problem commonly found in vanilla RNNs by incorporating gating functions intoits state dynamics [6]. At each time step, an LSTM maintains a hidden vector h and a cell vector c responsible for controlling state updates and outputs. More concretely, we define the computationat time step t as follows [15]: i t = sigmoid ( W hi h t − + W xi x t ) (12) f t = sigmoid ( W hf h t − + W hf x t ) (13) c t = f t (cid:12) c t − + i t (cid:12) tanh( W hc h t − + W xc x t ) (14) o t = sigmoid ( W ho h t − + W hx x t + W co c t ) (15) h t = o t (cid:12) tanh( c t ) (16)where sigmoid ( · ) is the logistic sigmoid function, tanh is the hyperbolic tangent function, W h · arethe recurrent weight matrices and W x · are the input-to-hiddent weight matrices. i t , f t and o t arerespectively the input, forget and output gates, and c t is the cell. From equation 6, an analogous way to apply batch normalization to an RNN would be as follows: h t = φ ( BN ( W h h t − + W x x t )) . (17)However, in our experiments, when batch normalization was applied in this fashion, it didn’t helpthe training procedure (see appendix A for more details). Instead we propose to apply batch normal-ization only to the input-to-hidden transition ( W x x t ), i.e. as follows: h t = φ ( W h h t − + BN ( W x x t )) . (18)This idea is similar to the way dropout [16] can be applied to RNNs [17]: batch normalization isapplied only on the vertical connections (i.e. from one layer to another) and not on the horizontalconnections (i.e. within the recurrent layer). We use the same principle for LSTMs: batch normal-ization is only applied after multiplication with the input-to-hidden weight matrices W x · .3 odel Train Dev FCE FER FCE FERBiRNN 0.95 0.28
BiRNN (BN)
In experiments where we don’t have access to the future frames, like in language modelling wherethe goal is to predict the next character, we are forced to compute the normalization a each time step ˆ x k,t = x k,t − ¯ x k,t (cid:113) σ k,t + (cid:15) . (19)We’ll refer to this as frame-wise normalization .In applications like speech recognition, we usually have access to the entire sequences. However,those sequences may have variable length. Usually, when using mini-batches, the smaller sequencesare padded with zeroes to match the size of the longest sequence of the mini-batch. In such setupswe can’t use frame-wise normalization, because the number of unpadded frames decreases alongthe time axis, leading to increasingly poorer statistics estimates. To solve this problem, we apply asequence-wise normalization, where we compute the mean and variance of each feature along boththe time and batch axis using ¯ x k = 1 n m (cid:88) i =1 T (cid:88) t =1 x i,t,k , (20) σ k = 1 n m (cid:88) i =1 T (cid:88) t =1 ( x i,t,k − ¯ x k ) , (21)where T is the length of each sequence and n is the total number of unpadded frames in the mini-batch. We’ll refer to this type of normalization as sequence-wise normalization . We ran experiments on a speech recognition task and a language modelling task. The models wereimplemented using Theano [18] and Blocks [19].
For the speech task, we used the Wall Street Journal (WSJ) [20] speech corpus. We used the si284split as training set and evaluated our models on the dev93 development set. The raw audio wastransformed into 40 dimensional log mel filter-banks (plus energy), with deltas and delta-deltas. Asin [21], the forced alignments were generated from the Kaldi recipe tri4b, leading to 3546 clusteredtriphone states. Because of memory issues, we removed from the training set the sequences thatwere longer than 1300 frames (4.6% of the set), leading to a training set of 35746 sequences.The baseline model (BL) is a stack of 5 bidirectional LSTM layers with 250 hidden units each,followed by a size 3546 softmax output layer. All the weights were initialized using the Glorot[22] scheme and all the biases were set to zero. For the batch normalized model (BN) we appliedsequence-wise normalization to each LSTM of the baseline model. Both networks were trainedusing standard SGD with momentum, with a fixed learning rate of 1e-4 and a fixed momentumfactor of 0.9. The mini-batch size is 24. 4igure 1: Frame-wise cross entropy on WSJ for the baseline (blue) and batch normalized (red)networks. The dotted lines are the training curves and the solid lines are the validation curves.
We used the Penn Treebank (PTB) [23] corpus for our language modeling experiments. We use thestandard split (929k training words, 73k validation words, and 82k test words) and vocabulary of10k words. We train a small, medium and large LSTM as described in [17].All models consist of two stacked LSTM layers and are trained with stochastic gradient descent(SGD) with a learning rate of 1 and a mini-batch size of 32.The small LSTM has two layers of 200 memory cells, with parameters being initialized from auniform distribution with range [-0.1, 0.1]. We back propagate across 20 time steps and the gradientsare scaled according to the maximum norm of the gradients whenever the norm is greater than 10.We train for 15 epochs and halve the learning rate every epoch after the 6th.The medium LSTM has a hidden size of 650 for both layers, with parameters being initialized from auniform distribution with range [-0.05, 0.05]. We apply dropout with probability of 50% between alllayers. We back propagate across 35 time steps and gradients are scaled according to the maximumnorm of the gradients whenever the norm is greater than 5. We train for 40 epochs and divide thelearning rate by 1.2 every epoch after the 6th.The Large LSTM has two layers of 1500 memory cells, with parameters being initialized froma uniform distribution with range [-0.04, 0.04]. We apply dropout between all layers. We backpropagate across 35 time steps and gradients are scaled according to the maximum norm of thegradients whenever the norm is greater than 5. We train for 55 epochs and divide the learning rateby 1.15 every epoch after the 15th.
Figure 1 shows the training and development framewise cross entropy curves for both networks ofthe speech experiments. As we can see, the batch normalized networks trains faster (at some pointsabout twice as fast as the baseline), but overfits more. The best results, reported in table 1, arecomparable to the ones obtained in [21].Figure 2 shows the training and validation perplexity for the large LSTM network of the languageexperiment. We can also observe that the training is faster when we apply batch normalization to5igure 2: Large LSTM on Penn Treebank for the baseline (blue) and the batch normalized (red)networks. The dotted lines are the training curves and the solid lines are the validation curves.
Model Train Valid
Small LSTM 78.5 119.2Small LSTM (BN) 62.5 120.9Medium LSTM 49.1 89.0Medium LSTM (BN) 41.0 90.6Large LSTM 49.3
Large LSTM (BN) cknowledgments
Part of this work was funded by Samsung. We also want to thank Nervana Systems for providingGPUs.
References [1] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network trainingby reducing internal covariate shift,” arXiv preprint arXiv:1502.03167 , 2015.[2] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deeprecurrent neural networks,” in
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEEInternational Conference on . IEEE, 2013, pp. 6645–6649.[3] Ilya Sutskever, Oriol Vinyals, and Quoc Le, “Sequence to sequence learning with neuralnetworks,” in
Advances in Neural Information Processing Systems , 2014, pp. 3104–3112.[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation byjointly learning to align and translate,” arXiv preprint arXiv:1409.0473 , 2014.[5] Tom´aˇs Mikolov, “Statistical language models based on neural networks,”
Presentation atGoogle, Mountain View, 2nd April , 2012.[6] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”
Neural computation ,vol. 9, no. 8, pp. 1735–1780, 1997.[7] Will Williams, Niranjani Prasad, David Mrva, Tom Ash, and Tony Robinson, “Scaling recur-rent neural network language models,” arXiv preprint arXiv:1502.00512 , 2015.[8] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, RyanPrenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al., “Deepspeech: Scaling upend-to-end speech recognition,” arXiv preprint arXiv:1412.5567 , 2014.[9] Yann A LeCun, L´eon Bottou, Genevieve B Orr, and Klaus-Robert M¨uller, “Efficient back-prop,” in
Neural networks: Tricks of the trade , pp. 9–48. Springer, 2012.[10] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu, “Naturalneural networks,” arXiv preprint arXiv:1507.00210 , 2015.[11] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, andLi Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal ofComputer Vision (IJCV) , pp. 1–42, April 2015.[12] Mike Schuster and Kuldip K Paliwal, “Bidirectional recurrent neural networks,”
Signal Pro-cessing, IEEE Transactions on , vol. 45, no. 11, pp. 2673–2681, 1997.[13] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio, “How to constructdeep recurrent neural networks,” arXiv preprint arXiv:1312.6026 , 2013.[14] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, “On the difficulty of training recurrentneural networks,” arXiv preprint arXiv:1211.5063 , 2012.[15] Felix A Gers, Nicol N Schraudolph, and J¨urgen Schmidhuber, “Learning precise timing withlstm recurrent networks,”
The Journal of Machine Learning Research , vol. 3, pp. 115–143,2003.[16] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov, “Dropout: A simple way to prevent neural networks from overfitting,”
The Journal ofMachine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014.[17] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals, “Recurrent neural network regulariza-tion,” arXiv preprint arXiv:1409.2329 , 2014.[18] Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, ArnaudBergeron, Nicolas Bouchard, and Yoshua Bengio, “Theano: new features and speed improve-ments,” Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.[19] B. van Merri¨enboer, D. Bahdanau, V. Dumoulin, D. Serdyuk, D. Warde-Farley, J. Chorowski,and Y. Bengio, “Blocks and Fuel: Frameworks for deep learning,”
ArXiv e-prints , June 2015.7 odel Train Valid
Best Baseline 1.05 1.10Best Batch Norm 1.07 1.11Table 3: Best frame-wise crossentropy for the best baseline network and for the best batch normal-ized one.[20] Douglas B Paul and Janet M Baker, “The design for the wall street journal-based csr corpus,” in
Proceedings of the workshop on Speech and Natural Language . Association for ComputationalLinguistics, 1992, pp. 357–362.[21] Alan Graves, Navdeep Jaitly, and Abdel-rahman Mohamed, “Hybrid speech recognition withdeep bidirectional lstm,” in
Automatic Speech Recognition and Understanding (ASRU), 2013IEEE Workshop on . IEEE, 2013, pp. 273–278.[22] Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforwardneural networks,” in
International conference on artificial intelligence and statistics , 2010, pp.249–256.[23] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini, “Building a large an-notated corpus of english: The penn treebank,”
Computational linguistics , vol. 19, no. 2, pp.313–330, 1993.
A Experimentations with Normalization Inside the Recurrence
In our first experiments we investigated if batch normalization can be applied in the same way as ina feedforward network (equation 17). We tried it on a language modelling task on the PennTreebankdataset, where the goal was to predict the next characters of a fixed length sequence of 100 symbols.The network is composed of a lookup table of dimension 250 followed by 3 layers of simple recur-rent networks with 250 hidden units each. A dimension 50 softmax layer is added on the top. In thebatch normalized networks, we apply batch normalization to the hidden-to-hidden transition, as inequation 17, meaning that we compute one mean and one variance for each of the 250 features ateach time step. For inference, we also keep track of the statistics for each time step. However, weused the same γ and β for each time step.The lookup table is randomly initialized using an isotropic Gaussian with zero mean and unit vari-ance. All the other matrices of the network are initialized using the Glorot scheme [22] and all thebias are set to zero. We used SGD with momentum. We performed a random search over the learn-ing rate (distributed in the range [0.0001, 1]), the momentum (with possible values of 0.5, 0.8, 0.9,0.95, 0.995), and the batch size (32, 64 or 128). We let the experiment run for 20 epochs. A total of52 experiments were performed.In every experiment that we ran, the performances of batch normalized networks were alwaysslightly worse than (or at best equivalent to) the baseline networks, except for the ones where thelearning rate is too high and the baseline diverges while the batch normalized one is still able to train.Figure 3 shows an example of a working experiment. We observed that in practically all the exper-iments that converged, the normalization was actually harming the performance. Table 3 shows theresults of the best baseline and batch normalized networks. We can observe that both best networkshave similar performances. The settings for the best baseline are: learning rate 0.42, momentum0.95, batch size 32. The settings for the best batch normalized network are: learning rate 3.71e-4,momentum 0.995, batch size 128.Those results suggest that this way of applying batch normalization in the recurrent networks is notoptimal. It seems that batch normalization hurts the training procedure. It may be due to the factthat we estimate new statistics at each time step, or because of the repeated application of γ and ββ