On Training Bi-directional Neural Network Language Model with Noise Contrastive Estimation
OOn Training Bi-directional Neural Network LanguageModel with Noise Contrastive Estimation
Tianxing He
Shanghai Jiao Tong University [email protected]
Yu Zhang
Jasha Droppo
Microsoft Research [email protected]
Kai Yu
Shanghai Jiao Tong University [email protected]
Abstract
We propose to train bi-directional neural network language model(NNLM) withnoise contrastive estimation(NCE). Experiments are conducted on a rescore taskon the PTB data set. It is shown that NCE-trained bi-directional NNLM out-performed the one trained by conventional maximum likelihood training. Butstill(regretfully), it did not out-perform the baseline uni-directional NNLM.
Recent years have witnessed exciting performance improvements in the field of language modeling,largely due to introduction of a series of neural network language models(NNLM). Although theconventional back-off n-gram language model has been widely used in the automatic speech recog-nition (ASR) or machine translation(MT) community for its simplicity and effectiveness, it has longsuffered from the curse-of-dimensionality problem caused by huge number of possible word combi-nations in real-world text. Various smoothing techniques[1] are proposed to address this issue but theimprovements have been limited. Recently, neural network based language models have attractedgreat interest due to its effective encoding of word context history [2, 3, 4, 5].In neural networkbased language models, the word context is projected into a continuous space and the projection,represented by the transformation matrices in the neural network, are learned during training. Theprojected continuous word vectors are also referred to as word embeddings . With the continuouscontext representation, feed-forward neural network language models (FNNLM)[2, 3, 4, 5, 6], haveachieved both better perplexity(PPL) and better word error rate (WER) when embedded into a real-world system.Despite the benefits of effective context representation brought by word embeddings, FNNLM isstill a short-span language model and not capable of utilizing long-term (e.g. context that is 5 or6 words away) word history for the target word prediction. To address this issue, recurrent neuralnetwork language model (RNNLM), which introduces a recurrent connection in the hidden layer, isproposed to preserve long-term context. It has achieved significant performance gain on perplexityand word error rate (WER) performance on various data sets [7, 8, 9, 10, 11, 12], out-performingtraditional back-off n-gram models and FNNLMs.However, RNN training generally suffered from the “vanishing gradient” problem[13]:the gradi-ent flow will decay sharply through a non-linear operation. The LSTM[14] structure alleviatesthis problem by introducing a “memory cell” structure which allows the gradient to travel withoutbeing squashed by a non-linear operation. Also, it has a set of gates which enable the model todecide whether to memorize, forget, or output information. By introducing the LSTM strucutre1 a r X i v : . [ c s . C L ] F e b nto RNNLM[9], LSTMLM is able to remember longer context information and gains more perfor-mance gain. It has also been shown that the dropout [15] can be used to regularize the LSTMLM.Inspired by its success, several variants of LSTM have been proposed, recently the gated recurrentunit(GRU)[16] is gaining increasing popularity becuase it has matching performance with LSTMbut has simpler structure. More recently, [17] has proposed to introduce the concept of memoryinto NNLM. By fetching memories from previous time, the model is able to “explicitly” utilizinglong-term dependency without recurrence structure.While these research efforts have been focusing on better utilization of history information, it wouldbe desirable if the model can utilize context information from both sides. In literature, very few at-tempts have been made to train a proper bi-directional neural network language model, even thoughbi-drectional NN has already been successfully applied to other fields[18]. This is because the bi-directional model won’t be by itself normalized because of the generative nature of language model,which makes the conventional maximum likelihood training framework improper for its training.In this work, attempts have been made to train a bi-directional neural network language model withnoise contrastive estimation, an alternative to maximum likelihood training which does not have theconstrain that the model to be trained is inherently normalized. The rest of the paper is organizedas follows: in section 2, the motivation of this work is discussed, in section 3 the formulation of themodel are elaborated in detail, implementation is covered in section 4, finally experiment results areshown in section 5 and related works are discussed in section 6. Statistical language models assign a probability P ( W ) to a given sentence W = < w , w , ..., w n > ,which can be decomposed into a product of word-level probabilities using the rule of conditionalprobability: P ( W ) = Π i P ( w i | w ..i − ) (1)Language models by this formulation predict the probability distribution of the next word given itsformer words(history). Since the prediction only depends on history information, in this work, thiskind of model is denoted as uni-directional language model. All types of language model mentionedin section 1 fall into this category, but note that shot-span models like N-gram makes the ”MarkovChain” assumption P ( w i | w ..i − ) ≈ P ( w i | w i − N..i − ) to alleviate the data-sparsity problem.For uni-directional language models, as long as each word-level probability is properly normalized,normalization is also guaranteed on sentence level: (cid:88) W P LM ( W ) = 1 (2)This is the key reason why the ”maximum likelihood estimation” training framework, which re-quires the model to be inherently probabilistic, has been successfully applied to the parameter esti-mation(training) of uni-directional language models. And recent years of research effort in the fieldof neural network language model has been focused on getting a better representation of historycontext using sophisticated recurrent neural network structures like LSTM[9].Unfortunately, while recently bi-directional neural network like BI-RNN or BI-LSTM has been suc-cessfully applied to many tasks, it is not trivial to apply this powerful model to language modeling,the main challenge is that the bi-directional information will break the sentence-level normaliza-tion , making the model no longer valid for the MLE training framework(please refer to section 3for more details ).In this work, noise contrastive estimation(NCE)[19] is used to train a bi-directional neural net-work based LM, one big advantage of NCE over MLE is that it doesn’t require the model to beself-normalized. This enables the utilization of bi-directional information for word-level scoring.Formulations of this work will be elaborated in the next section. If some bi-directional model like P ( w i | w ..i − ,i +1 ..N ) is used as the word-level LM, equation 1, andhence equation 2 won’t hold any more. 𝑤 𝑖 𝑤 𝑖+1 scoring scoring projection projection 𝒗 𝑖−1 𝒉 𝑖−11 𝒉 𝑖−11 𝒉 𝑖−11 𝒗 𝑖 𝒉 𝑖1 𝒉 𝑖1 𝒉 𝑖1 𝑤 𝑖−1 𝑤 𝑖 Figure 1: Illustration of the network structure
In this work, P ( W ) is the product of word-level scores(similar to uni-directional LM) and a learnednormalization scalar c , required by the NCE framework to ensure normalization: f (cid:48) ( W ) = Π i f i ( W ) P NCE ( W ) = f (cid:48) ( W ) exp ( c ) (3)where f i the scoring given by a bi-directional neural network on each word index. And the ”NCE”superscript for P NCE ( W ) is for indicating the normalization is induced by NCE training.In this work, the same bi-directional neural network structure that has been used in [18, 20] isapplied, and is shown in figure 1 and formulated below(we are aware that other variants of BI-RNNexist[21], but they are not fundamentally different with regard to this work): v i = W xh x i −→ h i = g ( −→ h i − , v i ) ←− h i = g ( ←− h i +1 , v i ) h i = tanh ( W hf −→ h i + W hr ←− h i + b ) u i = exp ( W ho h i + b o ) (4)where W ∗∗ and b ∗ are the transformation matrices and bias vector parameters in the neural network,and x i is the one-hot representation of w i . Finally, f i ( W ) is obtained after a normalizing operationover the vocabulary(denoted as V ) on u i : f i ( W ) = u i ( w i ) (cid:80) w j ∈ V u i ( w j ) (5)Note that the word-level normalization is not needed in this work( u i ( w i ) can be used directly as f i ( W ) ), but experiments show that reserving the word-level normalization will give better results.In this work, gated recurrent unit is used as the recurrent structure h t = g ( h t − , v t ) because it isfaster, causes less memory and has matching performance with the LSTM structure[16]. So our NNmodel is denoted BI-GRULM and the formulation is put below: z t = σ ( W hz h t − + W xz v t + b z ) r t = σ ( W hr h t − + W xr v t + b r )˜ h t = tanh ( W h ( r t ∗ h t − ) + W h v t + b h ) h t = (1 − z t ) ∗ h t − + z t ∗ ˜ h t (6)3here σ is the sigmoid function σ ( x ) = e − x and ∗ is element-wise multiplication, and note that adifferent set of parameter is used for forward and backward connections in the bi-directional neuralnetwork.Finally, we also write down the formulations of a one-layer uni-directional GRULM(UNI-GRULM)here since it will be used as baseline model: h i = g ( h i − , W xh x i ) u i = exp ( W ho h i + b o ) (7)Note that other than being uni-directional, the only other difference between these two models is thenormalization scalar c . And in this work, the dropout operation is applied on h i for both models. As stressed in section 2, the MLE framework is not suitable for training bi-directional NNLM, still,in this work MLE training is tried as a baseline experiment. Denoting the the data distribution as P data ( W ) , the MLE objective function is formulated as below: J MLE ( θ ) = E P data ( W ) [ logf (cid:48) θ ( W )] (8)Note that here the normalization scalar c does not exist in the model.In this work, noise contrastive estimation[19] is applied to train the bi-directional NNLM. NCEintroduces a noise distribution P noise ( W ) into training and a ”to-be-learned” normalization scalar c into the model, and its basic idea is that instead of maximizing the likelihood of the data samples,the model is asked to discriminative samples from the data distribution against samples from thenoise distribution: J NCE ( θ ) = E P data ( W ) [ logP ( D =1 |W ; θ )] + kE P noise ( W ) [ logP ( D = 0 |W ; θ )] P ( D = 1 |W ; θ ) = P NCEθ ( W ) P NCEθ ( W ) + kP noise ( W ) P ( D = 0 |W ; θ ) = kP noise ( W ) P NCEθ ( W ) + kP noise ( W ) (9)assuming a noise ratio of k .And the gradients are: ∂logP ( D = 1 |W ; θ ) ∂θ = kP noise ( W ) P NCEθ ( W ) + kP noise ( W ) ∂logP NCEθ ( W ) ∂θ∂logP ( D = 0 |W ; θ ) ∂θ = − P NCEθ ( W ) P NCEθ ( W ) + kP noise ( W ) ∂logP NCEθ ( W ) ∂θ (10)For NCE to get good performance, a noise distribution that is close to the real data distribution ispreferred, so in our case it is natural to use a good uni-directional LM as the noise distribution. Inthis work, N-gram LM is used as the noise distribution since it is efficient to sample from. Detailsabout implementation and training process will be covered in section 4. Mini-batch based stochastic gradient descent(SGD) is used to train bi-directional NNLM in thiswork. The training process is very similar to [12], but several changes need to be made for thesentence-level bi-directional NNLM training. Since NN training in this work is sentence-level,data(consisted of real data samples and noise model samples) are processed in chunks, illustrated infigure 2. Moreover, a batch of data streams is processed together to utilize the computing power ofGPU. It is relatively easy to realize this training process of BI-RNN with the help of neural networktraining tool-kits like CNTK[22]. In this work, the chunk size is set to 90(which is larger than thelongest sentence in the ptb data-set) and the batch size is set to 64.4 … DATA_SAMPLE … … NOISE_SAMPLE … … DATA_SAMPLE … … NOISE_SAMPLE … … DATA_SAMPLE … … DATA_SAMPLE … … NOISE_SAMPLE … … DATA_SAMPLE …
Chunk
Figure 2: Illustration of the parallel training implementationAn validation-based learning strategy is used, the learning rate is fixed to a large value at first, andstart halving at a rate of 0.6 when no significant improvement on the validation data is observed.And the training is stopped when that happens again. Further, a L2 regularization with coefficient1e-5 is used.Finally, the SRILM[23] Toolkit is used for N-gram LM training in this work. In our training theN-GRAM noise is generated on-the-fly so noise samples won’t be the same between iterations.
In this section, results of experiments designed to test the performance of the proposed bi-directionalNNLM trained by NCE. Since the training process is very time-costly when the noise ratio k islarge(in our training framework, it will cost at least k times the time for training the baseline UNI-GRULM model), we confined our experiments to the Penn Treebank portion(PTB) of the WSJcorpus, which is publicly available and has been used extensively in LM community. There are 930ktokens, 74k tokens, 82k tokens for training, validation and testing, respectively, and the vocabularysize is 10k.Further, since there is no guarantee that the trained model will be properly normalized, the evaluationof perplexity(PPL), which is the most conventional evaluation for LM, can no longer be applied.Instead, we need to resort to some discriminative task in which the LM is asked to tell ”good”sentence from ”bad” sentences, like its application in decoding or rescoring in systems like speechrecognition or machine translation. But still, we want the training corpus and vocabulary size tobe small enough, which will enable us to try a large noise ratio k , since sentence-level sampling isconsidered in this work, it is expected that k needs to be large enough for the training to work.In light of the above concerns, a rescoring task is created directly on the PTB dataset, denoted as ptb-rescore . In this test, random small errors are introduced to each sentence of the original test corpusof the PTB dataset, and the LM is then asked to recognize the original sentence from the tamperedones by assigning it the highest score. In this work, three types of error, namely substitution , deletion and insertion , are generated. For each error type, 9 decoys(one decoy only has one error)are generated for each test sentence, constituting three test sets. So a uniform guess will have anaccuracy of . Further, a mixed set where each decoy can be of any of the three types of erroris also added, denoted as test sdi . Some examples are shown in table 1. Note that in this test set allrandom number(for the position or new word index) are drawn from a uniform distribution, and the s-test set is similar to the MSR sentence completion task [24]. Although perplexity can not be used to evaluate bi-directional NNLM, it is still interesting what PPLthe trained model will assign to the test sentences. Besides the original test set for the
PTB data, twoadditional text are generated, one is sentences sampled from the 4-GRAM baseline model(denotedas ), the other one is sentences sampled from a completely uniform distribution(denoted This test set and the scripts for reproducing the N-gram baseline are available at https://bitbucket.org/cloudygoose/ptb_rescore riginal no it was n’t black monday s-error no it was n’t black revoke d-error no it was n’t monday i-error no it cracks was n’t black mondayTable 1: Examples of decoys in the ptb-resocre test setas uniform-text ). All three sets have around 4,000 sentences. A well-behaved LM is expected toassign lowest PPL to the first set, relatively low PPL to the second, and very high(bad) PPL to thelast one. The results are shown in table 2.Model Pseudo-PPL test-ptb 4gram-text uniform-text UNI-GRULM 103.7 431.0 91935.7BI-GRULM(MLE) 1.12 1.16 3.358BI-GRULM(NCE with noise ratio ) 15.5 3846.4 99565.4Table 2: Pseudo-PPL result of different trained LMs on three test sets.It is shown that the BI-GRULM(detailed configuration will be discussed in section 5.3) trainedwith NCE has similiar behavior to the baseline uni-directional model, meaning that NCE is helpingthe model with sentence-level normalization. On the contrary, BI-GRULM trained with MLE isassigning extremely low PPL to every test set, indicating that the model is not properly normalized.But surprisingly, the relative order of PPL from MLE-trained BI-GRULM is correct. In this section accuracy results on the ptb-rescore task is presented. Three models are trained tobe baseline models: 4-GRAM, UNI-GRULM, and BI-GRULM trained by MLE. Note that unlessotherwise mentioned, all GRULMs trianed in the work has 300 neurons on hidden layer and onlyone layer(in the BI-GRULM case, one layer means one forward layer and one backward layer)is used. This setting is chosen for the reason that adding more neurons or more layers give nosignificant on the test PPL for the baseline UNI-GRULM model. Through training, a dropout rateof is applied for the UNI-GRULM, but no dropout is applied for the reported experiments forthe BI-GRULM because it is found that dropout won’t give performance gain in that case.The baseline results are shown in the upper part of table 3. Overally, the UNI-GRULM model givesthe best performance, as expected. An interesting observation is that all model have extremely poorperformance on the test-d set. This behavior, however, is not so surprising since the LM score of asentence is afterall a product of word-level probabilities, so decoys with one less word will have bigadvantage. It is found that this problem can be alleviated by a length-norm trick: score length − norm ( W ) = score ( W ) l = logf ( W ) l = (cid:80) li logf i ( W ) l (11)assuming sentence W is of length l (including the sentence-end token). Note that this trick is equiv-alent to ranking the sentences using PPL instead of sentence-level log likelihood and it will do harmto the performance on the test-i set, although not large.Results of BI-GRULM trained by NCE are shown in in the lower part of table 3, it is observed thatthe length-norm trick can also help in this case, and the overall performance is improving withlarger and larger noise ratio, however, it became unaffordable for us to run experiments with ratiolarger than 100. One strange observation is that performance on the test-d set degrades with largernoise ratio, and this causes performance on the test-sdi to become worse. Also, comparing with theBI-GRULM(MLE) result, BI-GRULMs trained by NCE with a large noise ratio have overally betterperformance, indicating that NCE has the potential to utilize to power of BI-GRULM structure moreproperly. 6odel noise Accuracy(%)/Accuracy after length-norm (%)ratio test-s test-d test-i test-sdi /n /n /n /n BI-GRULM(MLE) - 50.0/n50.0 0.31/n21.9 95.3/n31.5 6.8/n27.11 31.9/n31.9 3.9/n12.8 67.4/n53.0 10.9/n17.8BI-GRULM 10 39.9/n39.9 8.8/n19.4 61.8/n48.8 20.5/n26.2(NCE) 20 39.2/n39.2 /n /n26.350 48.4/n48.4 6.8/n19.8 74.2/n54.9 18.1/n29.0100 /n /n Table 3: Accuracy result of BI-GRULM models trained by NCE.Unfortunately, the proposed model failed to out-perform the best UNI-GRULM baseline model onevery test set. Results on the test-s set show that improvement can only be obtained by growing thenoise ratio exponentially, this matches our concern in section 5.1, the sentence-level sampling spacemay be too sparse for our sampling to properly cover.
In [20], bi-directional LSTMLM is trained with MLE and tested by LM rescoring in an ASR task.However, no improvement is observed over the uni-directional baseline model. On the other hand,NCE has been used in uni-directional LM training both for FNNLM[25] and RNNLM[26], the maingoal was to speed-up the training and evaluation of these two models because under NCE trainingthe final softmax operation on the output layer is no longer necessary. Note that different from thesetwo work, NCE is applied on the sentence level in this work.
In this work noise contrastive estimation is used to train a bi-directional neural network languagemodel. Experiments are conducted on a rescore task on the PTB data set. It is shown that NCE-trained bi-directional NNLM outperformed the one trained by conventional maximum likelihoodtraining. But still, it did not out-perform the baseline uni-directional NNLM. The key reason maybethat the sentence-level sampling space is too sparse for our sampling to cover.
The authors want to thank Abdelrahman Mohamed, Kaisheng Yao, Geoffrey Zewig, Dong Yu, MikeSeltzer, and Da Zheng for valuable discussions.
References [1] Stanley F. Chen and Joshua Goodman, “An empirical study of smoothing techniques for lan-guage modeling,” in
Proc. ACL . 1996, pp. 310–318, Association for Computational Linguis-tics.[2] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin, “A neural proba-bilistic language model,”
Journal OF Machine Learning Research , vol. 3, pp. 1137–1155,2003.[3] Holger Schwenk, “Continuous space language models,”
Computer Speech Language , vol. 21,no. 3, pp. 492–518, 2007.[4] Frederic Morin and Yoshua Bengio, “Hierarchical probabilistic neural network languagemodel,” in
AISTATS , 2005, pp. 246–252. 75] J. Park, X. Liu, M. J. F. Gales, and P. C. Woodland, “Improved neural network based languagemodelling andadaptation,” in
Proc. InterSpeech , 2010.[6] Andriy Mnih and Geoffrey Hinton, “Three new graphical models for statistical language mod-elling,” in
Proc. ICML , 2007, pp. 641–648.[7] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur, “Re-current neural network based language model,” in
Proc. InterSpeech , 2010.[8] Martin Sundermeyer, Ilya Oparin, Ben Freiberg, Ralf Schlter, and Hermann Ney, “Comparisonof feedforward and recurrent neural network language models,” in
Proc. ICASSP , 2013.[9] Martin Sundermeyer, Ralf Schluter, and Hermann Ney, “Lstm neural networks for languagemodeling,” in
Proc. InterSpeech , 2012.[10] Zhiheng Huang, Geoffrey Zweig, and Benoit Dumoulin, “Cache based recurrent neural net-work language model inference for first pass speech recognition,” in
Proc. ICASSP , 2014.[11] X. Liu, Y. Wang, X. Chen, M. J. F. Gales, and P. C. Woodland, “Efficient lattice rescoringusing recurrent neural network language models,” in
Proc. ICASSP , 2014.[12] X. Chen, Y. Wang, X. Liu, M.J.F. Gales, and P. C. Woodland, “Efficient gpu-based training ofrecurrent neural network language models using spliced sentence bunch,” in
Proc. InterSpeech ,2014.[13] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jrgen Schmidhuber, “Gradient flow inrecurrent nets: the difficulty of learning long-term dependencies,” 2001.[14] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”
Neural Comput. , vol.9, no. 8, pp. 1735–1780, Nov. 1997.[15] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals, “Recurrent neural network regulariza-tion,”
CoRR , vol. abs/1409.2329, 2014.[16] Junyoung Chung, C¸ aglar G¨ulc¸ehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evalu-ation of gated recurrent neural networks on sequence modeling,”
CoRR , vol. abs/1412.3555,2014.[17] Shiliang Zhang, Hui Jiang, Si Wei, and Li-Rong Dai, “Feedforward sequential memory neuralnetworks without recurrent feedback,”
CoRR , vol. abs/1510.02693, 2015.[18] Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neuralnetworks,” in
Proceedings of the 31th International Conference on Machine Learning, ICML2014, Beijing, China, 21-26 June 2014 , 2014, pp. 1764–1772.[19] Michael U. Gutmann and Aapo Hyv¨arinen, “Noise-contrastive estimation of unnormalizedstatistical models, with applications to natural image statistics,”
J. Mach. Learn. Res. , vol. 13,no. 1, pp. 307–361, Feb. 2012.[20] Ebru Arisoy1, Abhinav Sethy, Bhuvana Ramabhadran, and Stanley Chen, “Bidirectional re-current neural network language models for automatic speech recognition,” in
Proc. ICASSP ,2015.[21] Mathias Berglund, Tapani Raiko, Mikko Honkala, Leo K¨arkk¨ainen, Akos Vetek, and JuhaKarhunen, “Bidirectional recurrent neural networks as generative models - reconstructing gapsin time series,”
CoRR , vol. abs/1504.01575, 2015.[22] “An introduction to computational networks and the computational network toolkit,”
MicrosoftResearch Technical Report , pp. MSR–TR–2014–112, 2014.[23] Andreas Stolcke, “Srilm-an extensible language modeling toolkit,” in
Proceedings Interna-tional Conference on Spoken Language Processing , November 2002, pp. 257–286.[24] “The microsoft research sentence completion challenge,”
Microsoft Research Technical Re-port , pp. MSR–TR–2011–129, 2011.[25] Andriy Mnih and Yee Whye Teh, “A fast and simple algorithm for training neural probabilisticlanguage models,” in
Proceedings of the 29th International Conference on Machine Learning ,2012, pp. 1751–1758.[26] M. J. F. Gales X. Chen, X. Liu and P. C.Woodland, “Recurrent neural network language modeltraining with noise contrastive estimation for speech recognition,” in