Dialog Context Language Modeling with Recurrent Neural Networks
DDIALOG CONTEXT LANGUAGE MODELING WITH RECURRENT NEURAL NETWORKS
Bing Liu , Ian Lane , Electrical and Computer Engineering, Carnegie Mellon University Language Technologies Institute, Carnegie Mellon University [email protected], [email protected]
ABSTRACT
In this work, we propose contextual language models thatincorporate dialog level discourse information into languagemodeling. Previous works on contextual language modeltreat preceding utterances as a sequence of inputs, withoutconsidering dialog interactions. We design recurrent neu-ral network (RNN) based contextual language models thatspecially track the interactions between speakers in a dialog.Experiment results on Switchboard Dialog Act Corpus showthat the proposed model outperforms conventional single turnbased RNN language model by 3.3% on perplexity. The pro-posed models also demonstrate advantageous performanceover other competitive contextual language models.
Index Terms — RNNLM, contextual language model, di-alog modeling, dialog act
1. INTRODUCTION
Language model plays an important role in many natural lan-guage processing systems, such as in automatic speech recog-nition [1, 2] and machine translation systems [3, 4]. Recur-rent neural network (RNN) based models [5, 6] have recentlyshown success in language modeling, outperforming conven-tional n-gram based models. Long short-term memory [7, 8]is a widely used RNN variant for language modeling due to itssuperior performance in capturing longer term dependencies.Conventional RNN based language model uses a hiddenstate to represent the summary of the preceding words in asentence without considering context signals. Mikolov et al.proposed a context dependent RNN language model [9] byconnecting a contextual vector to the RNN hidden state. Thiscontextual vector is produced by applying Latent DirichletAllocation [10] on preceding text. Several other contextuallanguage models were later proposed by using bag-of-word[11] and RNN methods [12] to learn larger context represen-tation that beyond the target sentence.The previously proposed contextual language modelstreat preceding sentences as a sequence of inputs, and theyare suitable for document level context modeling. In dia-log modeling, however, dialog interactions between speakersplay an important role. Modeling utterances in a dialog as a sequence of inputs might not well capture the pauses, turn-taking, and grounding phenomena [13] in a dialog. In thiswork, we propose contextual RNN language models that spe-cially track the interactions between speakers. We expectsuch models to generate better representations of the dialogcontext.The remainder of the paper is organized as follows. Insection 2, we introduce the background on contextual lan-guage modeling. In section 3, we describe the proposed dia-log context language models. Section 4 discusses the evalua-tion procedures and results. Section 5 concludes the work.
2. BACKGROUND2.1. RNN Language Model
A language model assigns a probability to a sequence ofwords w = ( w , w , ..., w T ) following probability distri-bution. Using the chain rule, the likelihood of the wordsequence w can be factorized as: P ( w ) = P ( w , w , ..., w T ) = T (cid:89) t =1 P ( w t | w 3. METHODS The previously proposed contextual language models focuson applying context by encoding preceding text, without con-sidering interactions in dialogs. These models may not bewell suited for dialog language modeling, as they are not de-signed to capture dialog interactions, such as clarificationsand confirmations. By making special design in learning dia-log interactions, we expect the models to generate better rep-resentations of the dialog context, and thus lower perplexityof the target dialog turn or utterance.In this section, we first explain the context dependentRNN language model that operates on utterance or turn level.Following that, we describe the two proposed contextuallanguage models that utilize the dialog level context. Let D = ( U , U , ..., U K ) be a dialog that has K turns andinvolves two speakers. Each turn may have one or more utter-ances. The k th turn U k = ( w , w , ..., w T k ) is represented asa sequence of T k words. Conditioning on information of thepreceding text in the dialog, probability of the target turn U k can be calculated as: P ( U k | U In neural network based language models, the dialog contextcan be represented as a dense continuous vector. This contextvector can be produced in a number of ways.One simple approach is to use bag of word embeddings.However, bag of word embedding context representation doesnot take word order into consideration. An alternative ap-proach is to use an RNN to read the preceding text. The lasthidden state of the RNN encoder can be seen as the repre-sentation of the text and be used as the context vector for thenext turn. To generate document level context representation,one may cascade all sentences in a document by removingthe sentence boundaries. The last RNN hidden state of theprevious utterance serves as the initial RNN state of the nextutterance. As in [12], we refer to this model as DRNNLM.Alternatively, in the CCDCLM model proposed in [12], thelast RNN hidden state of the previous utterance is fed to theRNN hidden state of the target utterance at each time step. The previously proposed contextual language models, such asDRNNLM and CCDCLM, treat dialog history as a sequenceof inputs, without modeling dialog interactions. A dialog turnfrom one speaker may not only be a direct response to theother speaker’s query, but also likely to be a continuation ofhis own previous statement. Thus, when modeling turn k in adialog, we propose to connect the last RNN state of turn k − directly to the starting RNN state of turn k , instead of lettingit to propagate through the RNN for turn k − . The last RNNstate of turn k − serves as the context vector to turn k , whichis fed to turn k ’s RNN hidden state at each time step togetherwith the word input. The model architecture is as shown inFigure 2. The context vector c and the initial RNN hiddenstate for the k th turn h U k are defined as: c = h U k − T k − , h U k = h U k − T k − (6)where h U k − T k − represents the last RNN hidden state of turn k − . This model also allows the context signal from previousturns to propagate through the network in fewer steps, whichhelps to reduce information loss along the propagation. Werefer to this model as Interactive Dialog Context LanguageModel (IDCLM). peaker ASpeaker B turn k-2 turn k-2 word predictions turn k turn k-1 turn k word predictions turn k-1 word predictions Fig. 2 . Interactive Dialog Context Language Model (ID-CLM). The propagation of dialog context can be seen as a series ofupdates of a hidden dialog context state along the growing di-alog. IDCLM models this hidden dialog context state changesimplicitly in the turn level RNN state. Such dialog contextstate updates can also be modeled in a separated RNN. Asshown in the architecture in Figure 3, we use an external RNNto model the context changes explicitly. Input to the externalstate RNN is the vector representation of the previous dia-log turns. The external state RNN output serves as the dialogcontext for next turn: s k − = RNN ES ( s k − , h U k − T k − ) (7)where s k − is the output of the external state RNN after theprocessing of turn k − . The context vector c and the initialRNN hidden state for the k th turn h U k are then defined as: c = s k − , h U k = h U k − T k − (8)We refer to this model as External State Interactive DialogContext Language Model (ESIDCLM). Speaker ASpeaker B turn k-2 turn k-2 word predictions turn k turn k-1 turn k word predictions turn k-1 word predictions Hidden dialogue state s k-2 s k-1 Fig. 3 . External State Interactive Dialog Context LanguageModel (ESIDCLM).Comparing to IDCLM, ESIDCLM releases the burden ofturn level RNN by using an external RNN to model dialogcontext state changes. One drawback of ESIDCLM is thatthere are additional RNN model parameters to be learned dur-ing model training, which may make the model more prone tooverfitting when training data size is limited. 4. EXPERIMENTS4.1. Data Set We use the Switchboard Dialog Act Corpus (SwDA) in eval-uating our contextual langauge models. The SwDA corpusextends the Switchboard-1 Telephone Speech Corpus withturn and utterance-level dialog act tags. The utterances arealso tagged with part-of-speech (POS) tags. We split the datain folder sw00 to sw09 as training set, folder sw10 as test set,and folder sw11 to sw13 as validation set. The training, vali-dation, and test sets contain 98.7K turns (190.0K utterances),5.7K turns (11.3K utterances), and 11.9K turns (22.2K utter-ances) respectively. Maximum turn length is set to 160. Thevocabulary is defined with the top frequent 10K words. We compare IDCLM and ESIDCLM to several baselinemethods, including n-gram based model, single turn RNNLM,and various context dependent RNNLMs. A 5-gram language model with modifiedKneser-Ney smoothing [16]. Single-Turn-RNNLM Conventional RNNLM that op-erates on single turn level with no context information. BoW-Context-RNNLM Contextual RNNLM withBoW representation of preceding text as context. DRNNLM Contextual RNNLM with turn level contextvector connected to initial RNN state of the target turn. CCDCLM Contextual RNNLM with turn level con-text vector connected to RNN hidden state of the target turnat each time step. We implement this model following thedesign in [12].In order to investigate the potential performance gain thatcan be achieved by introducing context, we also compare theproposed methods to RNNLMs that use true dialog act tags ascontext. Although human labeled dialog act might not be thebest option for modeling the dialog context state, it providesa reasonable estimation of the best gain that can be achievedby introducing linguistic context. The dialog act sequenceis modeled by a separated RNN, similar to the external stateRNN used in ESIDCLM. We refer to this model as DialogAct Context Language Model (DACLM). DACLM RNNLM with true dialog act context vectorconnected to RNN state of the target turn at each time step. In this work, we use LSTM cell [7] as the basic RNN unit forits stronger capability in capturing long-range dependenciesin a word sequence comparing to simple RNN. We use pre-trained word vectors [17] that are trained on Google Newsdataset to initialize the word embeddings. These word em-beddings are fine-tuned during model training. We conduct http://compprag.christopherpotts.net/swda.html ini-batch training using Adam optimization method follow-ing the suggested parameter setup in [18]. Maximum norm isset to 5 for gradient clipping . For regularization, we applydropout ( p = 0 . ) on the non-recurrent connections [19] ofLSTM. In addition, we apply L regularization ( λ = 10 − )on the weights and biases of the RNN output layer. The experiment results on language modeling perplexity formodels using different dialog turn size are shown in Table 1. K value indicates the number of turns in the dialog. Perplex-ity is calculated on the last turn, with preceding turns used ascontext to the model. Table 1 . Language modeling perplexities on SwDA corpuswith various dialog context turn sizes (K). Model K=1 K=2 K=3 K=5 DACLM - 58.2 57.9 58.0As can be seen from the results, all RNN based mod-els outperform the n-gram model by large margin. TheBoW-Context-RNNLM and DRNNLM beat the Single-Turn-RNNLM consistently. Our implementation of the contextdependent CCDCLM performs worse than Single-Turn-RNNLM. This might due to fact that the target turn wordprediction depends too much on the previous turn contextvector, which connects directly to the hidden state of currentturn RNN at each time step. The model performance on train-ing set might not generalize well during inference given thelimited size of the training set.The proposed IDCLM and ESIDCLM beat the single turnRNNLM consistently under different context turn sizes. ES-IDCLM shows the best language modeling performance un-der dialog turn size of 3 and 5, outperforming IDCLM by asmall margin. IDCLM beats all baseline models when usingdialog turn size of 5, and produces slightly worse perplexitythan DRNNLM when using dialog turn size of 3.To analyze the best potential gain that may be achieved byintroducing linguistic context, we compare the proposed con-textual models to DACLM, the model that uses true dialog acthistory for dialog context modeling. As shown in Table 1, thegap between our proposed models and DACLM is not wide.This gives a positive hint that the proposed contextual modelsmay implicitly capture the dialog context state changes.For fine-grained analyses of the model performance, we further compute the test set perplexity per POS tag and perdialog act tag. We selected the most frequent POS tags anddialog act tags in SwDA corpus, and report the tag basedperplexity relative changes ( % ) of the proposed models com-paring to Single-Turn-RNNLM. A negative number indicatesperformance gain. Table 2 . Perplexity relative change ( % ) per POS tag POS Tag IDCLM ESIDCLM DACLM PRP -16.8 -5.8 -10.1IN -2.0 -5.5 -1.8RB -4.1 -8.9 -4.3NN 13.4 8.1 2.3UH -0.4 7.7 -9.7Table 2 shows the model perplexity per POS tag. Allthe three context dependent models produce consistent per-formance gain over the Single-Turn-RNNLM for pronouns,prepositions, and adverbs, with pronouns having the largestperplexity improvement. However, the proposed contextualmodels are less effective in capturing nouns. This suggeststhat the proposed contextual RNN language models exploitthe context to achieve superior prediction on certain but notall POS types. Further exploration on the model design isrequired if we want to better capture words of a specific type. Table 3 . Perplexity relative change ( % ) per dialog act tag. DA Tag IDCLM ESIDCLM DACLM Statement-non-opinion -1.8 -0.5 -1.6Acknowledge -2.6 11.4 -16.3Statement-opinion 4.9 -0.9 -1.0Agree/Accept 14.7 2.7 -15.1Appreciation 0.7 -3.8 -6.5For the dialog act tag based results in Table 3, thethree contextual models show consistent performance gainon Statement-non-opinion type utterances. The perplexitychanges for other dialog act tags vary for different models. 5. CONCLUSIONS In this work, we propose two dialog context language mod-els that with special design to model dialog interactions. Ourevaluation results on Switchboard Dialog Act Corpus showthat the proposed model outperform conventional RNN lan-guage model by 3.3%. The proposed models also illustrateadvantageous performance over several competitive contex-tual language models. Perplexity of the proposed dialog con-text language models is higher than that of the model usingtrue dialog act tags as context by a small margin. This in-dicates that the proposed model may implicitly capture thedialog context state for language modeling. . REFERENCES [1] Lawrence Rabiner and Biing-Hwang Juang, “Funda-mentals of speech recognition,” 1993.[2] William Chan, Navdeep Jaitly, Quoc Le, and OriolVinyals, “Listen, attend and spell: A neural networkfor large vocabulary conversational speech recognition,”in . IEEE, 2016,pp. 4960–4964.[3] Peter F Brown, John Cocke, Stephen A Della Pietra,Vincent J Della Pietra, Fredrick Jelinek, John D Laf-ferty, Robert L Mercer, and Paul S Roossin, “A statis-tical approach to machine translation,” Computationallinguistics , vol. 16, no. 2, pp. 79–85, 1990.[4] Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio, “Learning phrase rep-resentations using rnn encoder-decoder for statisticalmachine translation,” arXiv preprint arXiv:1406.1078 ,2014.[5] Tomas Mikolov, Martin Karafi´at, Lukas Burget, Jan Cer-nock`y, and Sanjeev Khudanpur, “Recurrent neural net-work based language model.,” in Interspeech , 2010,vol. 2, p. 3.[6] Tom´aˇs Mikolov, Stefan Kombrink, Luk´aˇs Burget, JanˇCernock`y, and Sanjeev Khudanpur, “Extensions of re-current neural network language model,” in . IEEE, 2011, pp. 5528–5531.[7] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,” Neural computation , vol. 9, no. 8, pp.1735–1780, 1997.[8] Martin Sundermeyer, Ralf Schl¨uter, and Hermann Ney,“Lstm neural networks for language modeling.,” in In-terspeech , 2012, pp. 194–197.[9] Tomas Mikolov and Geoffrey Zweig, “Context depen-dent recurrent neural network language model.,” in SLT ,2012, pp. 234–239.[10] David M Blei, Andrew Y Ng, and Michael I Jordan,“Latent dirichlet allocation,” Journal of machine Learn-ing research , vol. 3, no. Jan, pp. 993–1022, 2003.[11] Tian Wang and Kyunghyun Cho, “Larger-context lan-guage modelling,” arXiv preprint arXiv:1511.03729 ,2015.[12] Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer,and Jacob Eisenstein, “Document context languagemodels,” arXiv preprint arXiv:1511.03962 , 2015. [13] Herbert H Clark and Susan E Brennan, “Grounding incommunication,” Perspectives on socially shared cog-nition , vol. 13, no. 1991, pp. 127–149, 1991.[14] Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou,and Sheng Li, “Hierarchical recurrent neural networkfor document modeling,” in Proceedings of the 2015Conference on Empirical Methods in Natural LanguageProcessing , 2015, pp. 899–907.[15] Quan Hung Tran, Ingrid Zukerman, and GholamrezaHaffari, “Inter-document contextual language model,”in Proceedings of NAACL-HLT , 2016, pp. 762–766.[16] Stanley F Chen and Joshua Goodman, “An empiricalstudy of smoothing techniques for language modeling,”in Proceedings of the 34th annual meeting on Asso-ciation for Computational Linguistics . Association forComputational Linguistics, 1996, pp. 310–318.[17] T Mikolov and J Dean, “Distributed representationsof words and phrases and their compositionality,” Ad-vances in neural information processing systems , 2013.[18] Diederik Kingma and Jimmy Ba, “Adam: Amethod for stochastic optimization,” arXiv preprintarXiv:1412.6980 , 2014.[19] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals,“Recurrent neural network regularization,” arXivpreprint arXiv:1409.2329arXivpreprint arXiv:1409.2329