A Neural Network Approach for Mixing Language Models
AA NEURAL NETWORK APPROACH FOR MIXING LANGUAGE MODELS
Youssef Oualil, Dietrich Klakow
Spoken Language Systems (LSV)Collaborative Research Center on Information Density and Linguistic EncodingSaarland University, Saarbr¨ucken, Germany { firstname.lastname } @lsv.uni-saarland.de ABSTRACT
The performance of Neural Network (NN)-based language models issteadily improving due to the emergence of new architectures, whichare able to learn different natural language characteristics. This pa-per presents a novel framework, which shows that a significant im-provement can be achieved by combining different existing hetero-geneous models in a single architecture. This is done through 1)a feature layer, which separately learns different NN-based modelsand 2) a mixture layer, which merges the resulting model features.In doing so, this architecture benefits from the learning capabilitiesof each model with no noticeable increase in the number of modelparameters or the training time. Extensive experiments conducted onthe Penn Treebank (PTB) and the Large Text Compression Bench-mark (LTCB) corpus showed a significant reduction of the perplexitywhen compared to state-of-the-art feedforward as well as recurrentneural network architectures.
Index Terms — Neural networks, mixture models, languagemodeling
1. INTRODUCTION
For many language technology applications such as speech recogni-tion [1] and machine translation [2], a high quality Language Model(LM) is considered to be a key component to success. Tradition-ally, LMs aim to predict probable sequences of predefined linguisticunits, which are typically words. These predictions are guided bythe semantic and syntactic properties that are encoded by the LM.The recent advances in neural network-based approaches forlanguage modeling led to a significant improvement over the stan-dard N -gram models [3, 4]. This is mainly due to the continuousword representations they provide, which typically overcome the ex-ponential growth of parameters that N -gram models require. TheNN-based LMs were first introduced by Bengio et al. [5], who pro-posed a Feedforward Neural Network (FNN) model as an alternativeto N -grams. Although FNNs were shown to perform very well fordifferent tasks [6, 7], their fixed context (word history) size con-straint was a limiting factor for their performance. In order to over-come this constraint, Mikolov et al. [8, 9] proposed a Recurrent Neu-ral Network (RNN), which allows context information to cycle in thenetwork. Investigating the inherent shortcomings of RNNs led to theLong-Short Term Memory (LSTM)-based LMs [10], which explic-itly control the longevity of context information in the network. Thischain of novel NN-based LMs continued with more complex and ad-vanced models such as Convolutional Neural Networks (CNN) [11]and autoencoders [12], to name a few. This research was funded by the German Research Foundation (DFG) aspart of SFB 1102.
LMs performance has been shown to significantly improve us-ing model combination. This is typically done by either 1) designingdeep networks with different architectures at the different layers, asit was done in [11], which combines LSTM, CNN and a highwaynetwork, or by 2) combining different models at the output layer, asit is done in the maximum entropy RNN model [13], which uses di-rect N -gram connections to the output layer, or using the classicallinear interpolation [14]. While the former category, requires a care-ful selection of the architectures to combine for a well-suited featuredesign, and can be difficult/slow to train, the second category knowsa significant increase in the number of parameters when combiningmultiple models.Motivated by the work in [13], we have recently proposed a Se-quential Recurrent Neural Network (SRNN) [15], which combinesFFN information and RNN. In this paper, we continue along this lineof work by proposing a generalized framework to combine differentheterogeneous NN-based architectures in a single mixture model.More particularly, the proposed architecture uses 1) a hidden fea-ture layer to, separately, learn each of the models to be combined,and 2) a hidden mixture layer, which combines the resulting modelfeatures. Moreover, this architecture uses a single word embeddingmatrix, which is learned from all models, and a single output layer.This framework is, in principle, able to combine different NN-basedLMs (e.g., FNN, RNN, LSTM, etc.) with no direct constraints onthe number of models to combine or their configurations.We proceed as follows. Section 2 presents an overview of thebasic NN-based LMs. Section 3 introduces the proposed neural mix-ture model. Then, Section 4 evaluates the proposed network in com-parison to different state-of-the-art language models for perplexityon the PTB and the LTCB corpus. Finally, we conclude in Section 5.
2. NEURAL NETWORK LANGUAGE MODELS
The goal of a language model is to estimate the probability distribu-tion p ( w T ) of word sequences w T = w , · · · , w T . Using the chainrule, this distribution can be expressed as p ( w T ) = T (cid:89) t =1 p ( w t | w t − ) (1)Let U be a word embedding matrix and let W be the hidden-to-output weights. NN-based LMs (NNLMs), that consider word em-beddings as input, approximate each of the terms involved in thisproduct in a bottom-up evaluation of the network according to H t = M ( P , R t − , U ) (2) O t = g (cid:0) H t · W (cid:1) (3) a r X i v : . [ c s . C L ] A ug here M represents a particular NN-based model, which can be adeep architecture, P denotes its parameters and R t − denotes itsrecurrent information at time t . g ( · ) is the softmax function.The rest of this section briefly introduces M , P and R t − forthe basic architectures, namely FNN, RNN and LSTM, which wereinvestigated and evaluated as different components in the proposedmixture model. The proposed architecture, however, is general andcan include all NNLMs that consider world embeddings as input. Similarly to N -gram models, FNN uses the Markov assumption oforder N − to approximate (1). That is, the current word dependsonly on the last N − words. Subsequently, M is given by E t − i = X t − i · U , i = N − , · · · , (4) H t = f (cid:32) N − (cid:88) i =1 E t − i · V i (cid:33) (5) X t − i is a one-hot encoding of the word w t − i . Thus, E t − i is the con-tinuous representation of the word w t − i . f ( · ) is an activation func-tion. Hence, for an FNN model M , P = { V i } N − i =1 and R t − = ∅ . RNN attempts to capture the complete history in a context vector h t ,which represents the state of the network and evolves in time. There-fore, RNN approximates each term in (1) as p ( w t | w T ) ≈ p ( w t | h t ) .As a result, M for an RNN is given by H t = f (cid:0) X t − · U + H t − · V (cid:1) (6)Thus, for an RNN model M , P = V and R t − = H t − . In order to alleviate the rapidly changing context issue in standardRNNs and control the longevity of the dependencies modeling in thenetwork, the LSTM architecture [10] introduces an internal memorystate C t , which explicitly controls the amount of information, to for-get or to add to the network, before estimating the current hiddenstate. Formally, an LSTM model M is given by E t − = X t − · U (7) { i, f, o } t = σ (cid:16) V i,f,ow · E t − + V i,f,oh · H t − (cid:17) (8) ˜ C t = f (cid:0) V cw · E t − + V ch · H t − (cid:1) (9) C t = f t (cid:12) C t − + i t (cid:12) ˜ C t (10) H t = o t (cid:12) f (cid:0) C t (cid:1) (11)where (cid:12) is the element-wise product, ˜ C t is the memory candidate,whereas i t , f t and o t are the input, forget and output gates of the net-work, respectively. Hence, for an LSTM model M , R t = { H t , C t } and P = { V i,f,o,cw , V i,f,o,ch } .
3. NEURAL NETWORK MIXTURE MODELS
On the contrary to a large number of research directions on improv-ing or designing (new) particular neural architectures for languagemodeling, the work presented in this paper is an attempt to designa general architecture, which is able to combine different types ofexisting heterogeneous models rather than investigating new ones.
The work presented in this paper is motivated by recent researchshowing that model combination can lead to a significant improve-ment in LM performance [14]. This is typically done by either 1)designing deep networks with different architectures at the differentlayers, as it was done in [11]. This category of model combination,however, requires a careful selection of the architectures to combinefor a well-suited feature design, as it can be difficult/slow to train,whereas the second category 2) combines different models at theoutput layer, as it is done in the maximum entropy RNN model [13]or using the classical linear interpolation [14]. This category typ-ically leads to a significant increase in the number of parameterswhen combining multiple models.In a first attempt to circumvent these problems, we have recentlyproposed an SRNN model [15], which combines FFN informationand RNN through additional sequential connections at the hiddenlayer. Although SRNN was successful and did not noticeably sufferfrom the aforementioned problems, it was solely designed to com-bine RNN and FNN and is, therefore, not well-suited for other archi-tectures. This paper continues along this line of work by proposing ageneral architecture to combine different heterogeneous neural mod-els with no direct constraints on the number or type of models.
This section introduces the mathematical formulation of the pro-posed mixture model. Let {M m } Mm =1 be a set of M models tocombine, and let {P m , R tm } Mm =1 be their corresponding model pa-rameters and recurrent information at time t , respectively. For thebasic NNLMs, namely FNN, RNN and LSTM, M m , P m and R tm were introduced in Section 2.Let U be the shared word embedding matrix, which is learnedduring training from all models in the mixture. The mixture modelis given by the following steps (see illustration in Fig. 1):
1) Feature layer: update each model and calculate its features H tm = M m ( P m , R t − m , U ) , m = 1 , · · · , M (12)
2) Mixture layer: combine the different features H tmixture = f mixture (cid:32) M (cid:88) m =1 H tm · S m (cid:33) (13)
3) Output layer: calculate the output using a softmax function O t = g (cid:0) H tmixture · W (cid:1) (14) f mixture is a non-linear mixing function, whereas S m , m =1 , · · · , M are the mixture weights (matrices).Although the experiments conducted in this work mainly includeFNN, RNN and LSTM, the set of possible model selection for M m is not restricted to these but includes all NN-based models that takeword embeddings as input.The proposed mixture model uses a single word embedding ma-trix and a single output layer with predefined and fixed sizes. Thelatter are independent of the sizes of the mixture models. In doingso, this model does not suffer from the significant parameter growthwhen increasing the number of models in the mixture. We can alsosee that this architecture does not impose any direct constraints onthe number of models to combine, their size or their type. Hence,we can combine, for instance, models of the same type but with dif-ferent sizes/configurations, as we can combine heterogeneous mod-els such as recurrent and non-recurrent models, in a single mixture.oreover, the mixture models can also be deep architectures withmultiple hidden layers. Fig. 1 : Neural Mixture Model (NMM) architecture. Red (back) ar-rows show the error propagation during training.
NMM training follows the standard back-propagation algorithmused to train neural architectures. More particularly, the error at theoutput layer is propagated to all models in the mixture. At this stage,each model receives a network error, updates its parameters, andpropagates its error to the shared word embedding (input) layer. Weshould also mention here that recurrent models can be “unfolded” intime, independently of the other models in the mixture, as it is donefor standard networks. Once each model is updated, the continu-ous word representations are then updated as well while taking intoaccount the individual network errors emerging from the differentmodels in the mixture (see illustration in Fig. 1).The joint training of the mixture models is expected to lead to a“complementarity” effect. We mean by “complementarity” that themixture models perform poorly when evaluated separately but leadto a much better performance when tested jointly. This is typicallya result of the models learning and modeling, eventually, differentfeatures. Moreover, the joint learning is also expected to lead to aricher and more expressive word embeddings.
In order to 1) enforce models co-training and 2) avoid network over-fitting when the number of models in the mixture is large. Weuse a model dropout technique, which is inspired by the standarddropout regularization [16] that is widely used to train neural net-works. The idea here is to have “models” replace “neurons” in thestandard dropout. Therefore, for each training example, a model isto be dropped with a probability p d . Then, only models that areselected contribute to the mixture and have their parameters andmixing weights S m updated. Similarly to standard dropout, modeldropout is applied only to non-recurrent models in the mixture.
4. EXPERIMENTS AND RESULTS4.1. Experimental Setup
We evaluated the proposed architecture on two different benchmarktasks. The first set of experiments was conducted on the Penn Tree-bank (PTB) corpus using the standard division, e.g. [9, 17]; sections0-20 are used for training while sections 21-22 and 23-24 are usedfor validation and testing. The vocabulary was limited to the 10k most frequent words while the remaining words were all mappedto the token < unk > . In order to evaluate how the proposed ap-proach scales to large corpora, we run a set of experiments on theLarge Text Compression Benchmark (LTCB) [18]. This corpus isbased on the enwik9 dataset which contains the first bytes ofenwiki-20060303-pages-articles.xml. We adopted the same training-test-validation data split and pre-processing from [17]. The vocabu-lary was limited to the 80k most frequent words. Details about thesizes of these two corpora and the percentage of Out-Of-Vocabulary(OOV) words that were mapped to < unk > can be found in Table 1. Table 1 : Corpus size in number of words and < unk > rate. Train Dev TestCorpus < unk > < unk > < unk > PTB 930K 6.52% 82K 6.47% 74K 7.45%LTCB 133M 1.43% 7.8M 2.15% 7.9M 2.30%The results reported below compare the proposed Neural Mix-ture Model (NMM) approach to the baseline NNLMs. In particular,we compare our model to the FNN-based LM [5], the full RNN [9](without classes) as well as RNN with maximum entropy (RN-NME) [13]. We also report results for the LSTM architecture [10],and the recently proposed SRNN model [15].Although the proposed approach was not designed for a partic-ular mixture of models, we only report results for different com-binations of FNN, RNN and LSTM, which are considered to bethe baseline NNLMs. For clarity, an NMM result is presentedas F N , ··· ,N f S , ··· ,S f + R S , ··· ,S r + L S , ··· ,S l , where f is the number ofFNNs in the mixture, S m , m = 1 , · · · , f are their correspond-ing hidden layer sizes (that are fed to the mixture) and N m , m =1 , · · · , f are their fixed history sizes. The same notation holds forRNN and LSTM, where r and l are the number of RNNs and LSTMsin the mixture, respectively, and N r , N l = 1 . The number of mod-els in the mixture is given by f + r + l . Moreover, the notation F N b − N e S f means that this model combines N e − N b + 1 consecutiveFNN models with respective history sizes N b , N b + 1 , · · · , N e , withall models having the same hidden layer size S f . For the PTB experiments, all models have a hidden layer size of 400,with FFNN and SRNN using the Rectified Linear Unit (ReLu) i.e., f ( x ) = max (0 , x ) as activation function and having 2 hidden lay-ers. ReLu is also used as activation function for the mixture layer inNMMs, which use a single hidden layer. The embedding size is 100for SRNN and NMMs, whereas it is set to 400 for RNN and 200 forFNN and LSTM. The training is performed using the stochastic gra-dient descent algorithm with a mini-batch size of 200. the learningrate is initialized to 0.4, the momentum is set to 0.9, the weight decayis fixed at x − , the model dropout is set to . and the trainingis done in epochs. The weights initialization follows the normalizedinitialization proposed in [19]. Similarly to [8], the learning rate ishalved when no significant improvement in the log-likelihood of thevalidation data is observed. The BPTT was set to 5 time steps forall recurrent models. In the tables below, the results are reportedin terms of perplexity (PPL), Number of model Parameters (NoP)and the Parameter Growth (PG) for NMM, which is defined as therelative increase in the number of parameters of NMM w.r.t. thebaseline model in the table. In order to demonstrate the power ofhe joint training, we also report the perplexity PPL and NoP of theLinearly Interpolated (LI) models in the mixture after training themseparately. In this case, each model learns its own word embeddingand output layer. Table 2 : LMs performance on the PTB test set. model PPL NoP PG PPL(LI) NoP(LI)FNN (N=5) 114 6.49M — — — F ,
117 5.27M -18.80% 120.0 6.10M F −
110 5.61M -13.56% 112.0 12.28MLSTM 105 6.97M — — — L + F L + R R + F
109 5.18M -36.60% 119 5.05M R + F −
105 5.86M -28.27% 108 17.41MRNNME 117 10G — — —WD-SRNN 104 6.33M — — —WI-SRNN 104 5.33M — — —The PTB results reported in Table 2 show clearly that combin-ing different small-size models with a reduced word embedding sizeresults in a better perplexity performance compared to the baselinemodels, with a significant decrease in the NoP required by the mix-ture. More particularly, we can see that adding a single FNN modelto a small size LSTM or RNN is sufficient to outperform the base-line models while reducing the number of parameters by and , respectively. The same conclusion can be drawn when com-bining an RNN with an LSTM. We can also see that adding moreFNN models to each of these mixtures leads to additional improve-ments while keeping the number of parameters significantly small.Table 2 also shows that training the small size models (in the mix-ture) separately, and then linearly interpolating them, results in aslightly worse performance compared to the mixture model with anoticeable increase in the NoP. This conclusion emphasizes the im-portance of the joint training. Moreover, we can also see that mixingRNN and FNNs leads to a comparable performance to SRNN, whichwas particularly designed to enhance RNN with FNN information.The proposed approach, however, does not particularity encode theindividual characteristics of the models in the mixture, which re-flects its ability to include different types of NNLMs. We can alsoconclude that combining FNN with recurrent models leads to a moresignificant improvement when compared to mixtures of FNNs. Thisconclusion shows, similarly to other work e.g. [15, 13], that recurrentmodels can be further improved using N-gram/feedforward informa-tion, given that they model different linguistic features.Fig. 2 is an extension of Table 2, which shows the change in theperplexity and NoP of different NMMs when iteratively adding moreFFN models to the mixture. This figure confirms that combining het-erogeneous models (combining LSTM or RNN with FNNs) achievesa better performance compared to combining only FNN models. Wecan also conclude from this figure that the improvement becomesvery slow after adding 4 FNN models to each mixture.
The LTCB experiments use the same PTB setup with minor changes.The results shown in Table 3 follow the same experimental setupused in [15]. More precisely, these results were obtained withoutusage of momentum, model dropout or weight decay whereas the
History Size (N−1) PP L History Size (N−1) N o P ( M ) Standard FNNF L +F R +F Fig. 2 : Perplexity vs parameter growth of different mixture modelswhile iteratively adding more FNN models to the mixture.mini-batch size was set to 400. The FNN architecture contains 2hidden layers of size 600 whereas RNN, LSTM, SRNN and NMMhave a single hidden layer of size 600.
Table 3 : LMs Perplexity on the LTCB test set. model PPL NoP PGFNN[4*200]-600-600-80k 110 64.92M — F −
102 66.24M 2.03% F −
92 64.98M 0.09%RNN[600]-600-80k 85 96.44M — R + F
84 64.80M -32.81% R + F −
77 66.40M -31.15%LSTM[600]-600-80k 66 66.00M — L + R
64 65.44M -1.51% L + F
64 65.28M -1.75% L + F compared to the originalRNN model. The FFN mixture results show a more significant im-provement when combining multiple small-size (100) models com-pared to mixing few large models (600). This conclusion shows thatthe strength of mixture models lies in their ability to combine thelearning capabilities of different models, even with small sizes.
5. CONCLUSION AND FUTURE WORK
We have presented a neural mixture model which is able to combineheterogeneous NN-based LMs in a single architecture. Experimentson PTB and LTCB corpora have shown that this architecture sub-stantially outperforms many state-of-the-art neural systems, due toits ability to combine learning capabilities of different architectures.Further gains could be made using a more advanced model selec-tion or feature combination at the mixing layer instead of the simplemodel weighting. These will be investigated in future work. . REFERENCES [1] Slava M. Katz, “Estimation of probabilities from sparse datafor the language model component of a speech recognizer,”
IEEE Transactions on Acoustics, Speech, and Signal Process-ing , vol. 35, no. 3, pp. 400–401, Mar. 1987.[2] Peter F. Brown, John Cocke, Stephen A. Della Pietra, VincentJ. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L.Mercer, and Paul S. Roossin, “A statistical approach to ma-chine translation,”
Computational Linguistics , vol. 16, no. 2,pp. 79–85, Jun. 1990.[3] Ronald Rosenfeld, “Two decades of statistical language mod-eling: where do we go from here?,”
Proceedings of the IEEE ,vol. 88, no. 8, pp. 1270–1278, Aug. 2000.[4] Reinhard Kneser and Hermann Ney, “Improved backing-off for m-gram language modeling,” in
IEEE InternationalConference on Acoustics, Speech, and Signal Processing(ICASSP) , Detroit, Michigan, USA, May 1995, pp. 181–184.[5] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Chris-tian Jauvin, “A neural probabilistic language model,”
Journalof Machine Learning Research , vol. 3, pp. 1137–1155, Mar.2003.[6] Holger Schwenk and Jean-Luc Gauvain, “Training neural net-work language models on very large corpora,” in
HumanLanguage Technology Conference and Conference on Empir-ical Methods in Natural Language Processing (EMNLP) , Oct.2005, pp. 201–208.[7] Joshua Goodman, “A bit of progress in language modeling,extended version,” Tech. Rep. MSR-TR-2001-72, MicrosoftResearch, 2001.[8] Tomas Mikolov, Martin Karafi´at, Luk´as Burget, Jan Cernock´y,and Sanjeev Khudanpur, “Recurrent neural network basedlanguage model,” in ,Makuhari, Chiba, Japan, Sep. 2010, pp. 1045–1048.[9] Tomas Mikolov, Stefan Kombrink, Luk´as Burget, Jan Cer-nock´y, and Sanjeev Khudanpur, “Extensions of recurrent neu-ral network language model,” in
IEEE International Confer-ence on Acoustics, Speech, and Signal Processing (ICASSP) ,Prague, Czech Republic, May 2011, pp. 5528–5531.[10] Martin Sundermeyer, Ralf Schl¨uter, and Hermann Ney,“LSTM neural networks for language modeling,” in , Portland, Oregon, USA, Sep.2012, pp. 194–197.[11] Kim Yoon, Jernite Yacine, Sontag David, and Rush AlexanderM., “Character-aware neural language models,” in , 2016.[12] Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle, MiteshKhapra, Balaraman Ravindran, Vikas C Raykar, and AmritaSaha, “An autoencoder approach to learning bilingual wordrepresentations,” in
Advances in Neural Information Process-ing Systems 27 , pp. 1853–1861. 2014.[13] Tomas Mikolov, Anoop Deoras, Daniel Povey, Luk´as Bur-get, and Jan Cernock´y, “Strategies for training large scaleneural network language models,” in
IEEE Workshop onAutomatic Speech Recognition & Understanding (ASRU) ,Waikoloa, Hawaii, USA, Dec. 11-15, 2011, pp. 196–201. [14] Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Luk´as Bur-get, and Jan Cernock´y, “Empirical evaluation and combinationof advanced language modeling techniques,” in , Florence, Italy, Aug. 27-31, 2011,pp. 605–608.[15] Youssef Oualil, Clayton Greenberg, Mittul Singh, and Diet-rich Klakow, “Sequential recurrent neural network for lan-guage modeling,” in ,San Francisco, California, USA, Sep. 2016.[16] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov, “Dropout: a simple wayto prevent neural networks from overfitting.,”
Journal of Ma-chine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014.[17] Shiliang Zhang, Hui Jiang, Mingbin Xu, Junfeng Hou, andLi-Rong Dai, “The fixed-size ordinally-forgetting encodingmethod for neural network language models,” in , July 2015, vol. 2, pp. 495–500.[18] Matt Mahoney, “Large text compression benchmark,” 2011.[19] Xavier Glorot and Yoshua Bengio, “Understanding the diffi-culty of training deep feedforward neural networks,” in