An End-to-End Trainable Neural Network Model with Belief Tracking for Task-Oriented Dialog
AAn End-to-End Trainable Neural Network Model withBelief Tracking for Task-Oriented Dialog
Bing Liu , Ian Lane , Electrical and Computer Engineering, Carnegie Mellon University Language Technologies Institute, Carnegie Mellon University [email protected], [email protected]
Abstract
We present a novel end-to-end trainable neural networkmodel for task-oriented dialog systems. The model is able totrack dialog state, issue API calls to knowledge base (KB),and incorporate structured KB query results into system re-sponses to successfully complete task-oriented dialogs. Theproposed model produces well-structured system responses byjointly learning belief tracking and KB result processing con-ditioning on the dialog history. We evaluate the model in arestaurant search domain using a dataset that is converted fromthe second Dialog State Tracking Challenge (DSTC2) corpus.Experiment results show that the proposed model can robustlytrack dialog state given the dialog history. Moreover, our modeldemonstrates promising results in producing appropriate systemresponses, outperforming prior end-to-end trainable neural net-work models using per-response accuracy evaluation metrics.
Index Terms : spoken dialog systems, end-to-end model, task-oriented, dialog state tracking, language understanding
1. Introduction
Task-oriented spoken dialog system is a prominent componentin today’s virtual personal assistants, which enable people toperform everyday tasks by interacting with devices via voiceinput. Traditional task-oriented dialog systems have complexpipelines, with a number of independently developed and mod-ularly connected components. There are usually separated mod-ules in a pipeline for natural language understanding (NLU), di-alog state tracking (DST), dialog management (DM), and natu-ral language generation (NLG) [1, 2, 3, 4]. One limitation withsuch pipeline approach is that it is inherently hard to adapt a sys-tem to new domains, as all these modules are trained and fine-tuned independently. Moreover, errors made in upper streammodules may propagate to downstream components, making ittedious to identify and track the source of error [5].To address these limitations, efforts have been made re-cently in designing end-to-end frameworks for task-oriented di-alogs. Wen et al. [6] proposed an end-to-end trainable neu-ral network model with modularly connected neural networksfor each system component. Zhao and Eskenazi [5] intro-duced an end-to-end reinforcement learning framework thatjointly performs dialog state tracking and policy learning. Liet al. [7] proposed an end-to-end learning framework that lever-ages both supervised and reinforcement learning signals andshowed promising dialog modeling performance. Such end-to-end trainable neural network models can be optimized directlytowards the final system objective functions (e.g. task successrate) and thus ameliorate the challenges of credit assignmentand online adaptation [5].In this work, we present an end-to-end trainable neural net-work model for task-oriented dialog that applies a unified net- work for belief tracking, knowledge base (KB) operation, andresponse creation. The model is able to track dialog state, in-terface with a KB, and incorporate structured KB query resultsinto system responses to successfully complete task-oriented di-alogs. We show that our proposed model can robustly trackbelief state given the dialog history. Our model also demon-strates promising performance in providing appropriate systemresponses and conducting task-oriented dialogs compared toprior end-to-end trainable neural network models.
2. Related Work
In spoken dialog systems, dialog state tracking, or belief track-ing, refers to the task of maintaining a distribution over possi-ble dialog states which directly determine the systems actions.Dialog state tracker is a core component in many state-of-the-art task-oriented spoken dialog systems [6, 8]. Conventionalapproaches for DST include using rule-based systems and gen-erative methods that model the dialog as a dynamic Bayesiannetwork [9]. Discriminative approaches using sequence mod-els such as CRF [10] or RNN [11, 12] address the limitationof generative models with the flexibility in exploring arbitraryfeatures [13] and achieve state-of-the-art DST performance.
Conventional task-oriented dialog systems typically require alarge number of domain-specific rules and handcrafted features,which make it hard to extend a good performing model to newapplication domains. Recent approaches to task-oriented dialogcast the task as a partially observable Markov Decision Process(POMDP) [4] and use reinforcement learning for online policyoptimization by interacting with users [14]. The dialog stateand system action spaces have to be carefully designed in orderto make the reinforcement policy learning tractable [4].With the success of end-to-end trainable neural networkmodels in non-task-oriented chit-chat dialog settings [15, 16],efforts have been made in carrying over the good performanceof end-to-end trainable models to task-oriented dialogs. Wenet al. [6] proposed a neural network based model that is end-to-end trainable yet still modularly connected. The model hasseparated modules for intent estimation, belief tracking, pol-icy learning, and response generation. Our model, on the otherhand, use a unified network for belief tracking, KB operation,and response generation, to fully explore knowledge that can beshared among different tasks. Bordes and Weston [17] recentlyproposed modeling the dialog with a reasoning approach usingend-to-end memory network. Their model selects a best systemresponse directly from a list of response candidates without ex-plicitly tracking dialog state. Comparing to this approach, our a r X i v : . [ c s . C L ] A ug STM D , s k
1 La_margherita ...2 prezzo ...5 caffe_uno ...
Ranked KB query results ...
Food slot Price slot
LSTM D , s k+1 LSTM D , s k-1 Figure 1:
System architecture of the proposed end-to-end trainable neural network model for task-oriented dialog. model tracks dialog state over the sequence of turns explicitly,as it is shown in [18] that robust dialog state tracking is likelyto boost success rate in task completion. Moreover, when gen-erating final system response, instead of letting the model toselect a final response directly from a large pool of candidateresponses, we let our model to select skeletal sentence structurefrom a short list of candidates and then replace the delexicalisedtokens with the state tracking outputs. This method will help toreduce the number of training samples required [11] and makethe model more robust to noises in dialog state.
3. Proposed Method
We model task-oriented dialog as a multi-task sequence learn-ing problem, with components for encoding user input, trackingbelief state, issuing API calls, processing KB results, and gen-erating system responses. The model architecture is as shown inFigure 1. Sequence of turns in a dialog is encoded using LSTM[19] recurrent neural networks. Conditioning on the dialog his-tory, state of the conversation is maintained in the LSTM state.The LSTM state vector is used to generate: (1) a skeletal sen-tence structure by selecting from a list of delexicalised systemresponse candidates, (2) a probability distribution of values foreach slot in belief tracker, and (3) a pointer to an entity in theretrieved KB results that matches the user’s query. The finalsystem response is generated by replacing the delexicalised to-kens with the predicted slot values and entity attribute values.Each model component is described in detail in below sections.
Utterance encoding here refers to encode a sequence of wordsinto a continuous dense vector. Popular methods include usingbag-of-means on word embeddings and RNNs [20, 21]. Weuse bidirectional LSTM to encode the user input to an utterancevector. Let U k = ( w , w , ..., w T k ) be the user input at the k thturn with T k words. The user utterance vector U k is representedby: U k = [ −−→ h U k T k , ←−− h U k ] , where −−→ h U k T k and ←−− h U k are the last forwardand backward utterance-level LSTM states at k th turn. Belief tracking, or dialog state tracking, maintains and adjuststhe state of a conversation, such as user’s goals, by accumu- late evidence along the sequence of a dialog. After collectingnew evidence from a user’s input at turn k , the neural dialogmodel updates the probability distribution P ( S mk ) over candi-date values for each slot type m ∈ M . For example, in restau-rant search domain, the model maintains a multinomial proba-bility distribution over each of user’s goals on restaurant area,food type, and price range. At turn k , the dialog-level LSTM( LSTM D ) updates its hidden state s k and use it to infer any up-dates on user’s goals after taking in the user input encoding U k and KB indicator I k (to be described in section below). s k = LSTM D ( s k − , [ U k , I k ]) (1) P ( S mk | U ≤ k , I ≤ k ) = SlotDist m ( s k ) (2)where SlotDist m is a multilayer perceptron (MLP) with softmax activation function over the slot type m ∈ M . Conditioning on the state of the conversation, the model mayissue an API call to query the KB based on belief tracking out-puts. A simple API call command template is firstly generatedby the model. The final API call command is produced by re-placing the slot type tokens in the command template with thebest hypothesis for each of the goal slot from the belief tracker.In restaurant search domain, a simple API call commandtemplate can be “ api call (cid:104) area (cid:105) (cid:104) food (cid:105) (cid:104) pricerange (cid:105) ”,and the slot type tokens are to be replaced with the be-lief tracker outputs to form the final API call command“ api call west italian dontcare ”. Once the neural dialog model receives the KB query results, itsuggests options to users by selecting entities from the returnedlist. Instead of treating KB results as unstructured text (morespecifically, as user utterances as in [17, 22, 23]) and process-ing them with machine reading comprehension approach, wetreat KB results as a list of structured entities and let the modelto select appropriate entity pointers. Outputs from KB searchor database query typically have well defined structures, withentity attributes associated with entity index. Other than let-ting the model to learn such entity-attribute association purelyfrom the training dialog corpus as in [17, 22, 23], we keep suchtructural information in our system and let the model to learnto select proper entity pointers from a ranked list.At turn k of a dialog, a binary KB indicator I k is passed tothe neural dialog model. This indicator is decided by the num-ber of retrieved entities from the last API call and the currententity pointer. When the system is in a state to suggest an en-tity to user, if a zero value I k is received, the model is likelyto inform user the unavailability of entity matching the currentquery. Otherwise if I k has a value of one, the model will likelypick an entity from the retrieved results based on the updatedprobability distribution of the entity pointer P ( E k ) : P ( E k | U ≤ k , I ≤ k ) = EntityPointerDist( s k ) (3)where EntityPointerDist is an MLP with softmax activation. At k th turn of a dialog, a skeletal sentence structure R k is se-lected from a list of delexicalised response candidates. The fi-nal system response is produced by replacing the delexicalisedtokens with the predicted slot values and entity attribute val-ues. For example, replacing (cid:104) food (cid:105) to italian , and replacing (cid:104) R name (cid:105) to prezzo as in Figure 1. P ( R k | U ≤ k , I ≤ k ) = ResponseDist( s k ) (4)where ResponseDist is an MLP with softmax activation.
We train the neural dialog model by finding the parameter set θ that minimize the cross-entropy of the predicted and true dis-tributions for goal slot labels, entity pointer, and delexicalisedsystem response jointly: min θ K (cid:88) k =1 − (cid:104) M (cid:88) m =1 λ S m log P ( S mk ∗ | U ≤ k , I ≤ k ; θ )+ λ E log P ( E ∗ k | U ≤ k , I ≤ k ; θ )+ λ R log P ( R ∗ k | U ≤ k , I ≤ k ; θ ) (cid:105) (5)where λ s are the linear interpolation weights for the cost of eachsystem output. S mk ∗ , E ∗ k , and R ∗ k are the ground truth labels foreach task at the k th turn. The model architecture (Figure 1) described above assumes thatthe hidden state of the dialog-level LSTM implicitly capturesthe complete state of the conversation, i.e. the user goal esti-mation and the previous system actions. Intuitively, the modelis likely to provide a better response if it is informed about thegoal slot value estimations explicitly and is aware of its previ-ous responses made to the user. Thus, we design and evaluate afew alternative model architectures to verify such assumption: (1)
Model with previously emitted delexicalised system re-sponse connected back to dialog-level LSTM state: s k = LSTM D ( s k − , [ U k , I k , R k − ]) (6) (2) Model with previously emitted slot labels connectedback to dialog-level LSTM state: s k = LSTM D ( s k − , [ U k , I k , S k − , ..., S Mk − ]) (7) (3) Model with both previously emitted response and slotlabels connected back to dialog-level LSTM state: s k = LSTM D ( s k − , [ U k , I k , R k − , S k − , ..., S Mk − ]) (8)
4. Experiments
We use data from DSTC2 [24] for our model evaluation. Thischallenge is designed in the restaurant search domain. Bordesand Weston [17] transformed the original DSTC2 corpus byadding system commands and removing the dialog state anno-tations. This transformed corpus contains additional API callsthat the system would make to the KB and the correspondingKB query results. In this study, we combine the original DSTC2corpus and this transformed version by keeping the dialog stateannotations and adding the system commands (API calls). Wecan thus perform more complete evaluation of our model’s ca-pability in tracking the dialog state, processing KB query re-sults, and conducting complete dialog. Statistics of this datasetis summarized in the Table 1.Table 1:
Statistics of the converted DSTC2 dataset.
Num of train & dev / test dialogs 2118 / 1117Num of turns per dialog in average 7.9(including API call commands)Num of area / food / pricerange options 5 / 91/ 3Num of delexicalised response candidates 78
We perform mini-batch model training with batch size of 32using Adam optimization method [25]. Regularization withdropout is applied to the non-recurrent connections [26] duringmodel training with dropout rate of 0.5. We set the maximumnorm for gradient clipping to 5 to prevent exploding gradients.Hidden layer sizes of the dialog-level LSTM and theutterance-level LSTM are set as 200 and 150 respectively. Wordembeddings of size 300 are randomly initialized. We also ex-plore using pre-trained word vectors [27] that are trained onGoogle News dataset to initialize the word embeddings.
Similar to the evaluation methods used in [17, 22, 23], we eval-uate the task-oriented dialog model in a ranking setting. We re-port the prediction accuracy for goal slot values, entity pointer,delexicalised system response, and final system response whichhas the delexicalised tokens replaced by predicted values.We first experiment with different text encoding meth-ods and recurrent model architectures to find best performingmodel. Table 2 shows the evaluation results of models using dif-ferent user utterance encoding methods and different word em-bedding initialization. Bidirectional LSTM (Bi-LSTM) showsclear advantage in encoding user utterance comparing to bag-of-means on word embedding (BoW Emb) method, improvingthe joint goal prediction accuracy by 4.6% and the final sys-tem response accuracy by 1.4%. Using pre-trained word vec-tors (word2vec) boosts the model performance further. Theseresults show that the semantic similarities of words captured inthe pre-trained word vectors are helpful in generating a betterrepresentation of user input, especially when the utterance con-tains words or entities that are rarely observed during training.Table 3 shows the evaluation results of different recurrentmodel architectures. The Hierarchical LSTM model in Table3 refers to the last model in Table 2. Models in row 2 to 4of Table 3 refer to the three recurrent model architectures dis-able 2:
Prediction accuracy for entity pointer, joint user goal,delexicalised system response, and final system response of thetransformed DSTC2 test set using different encoding methodsand word vector initialization.
Entity Joint De-lex FinalModel Pointer Goal Res Res
BoW Emb Encoder 93.5 72.6 55.4 51.2+ word2vec 93.6 74.3 55.9 51.5Bi-LSTM Encoder 93.8 cussed in section 3.7. As illustrated in these results, models thatconditioned on previous emitted labels in generating system re-sponse achieve lower prediction accuracy across all of the fourevaluation metrics. These observations are contrary to our in-tuition and analysis made in previous section. We believe thedegraded performance is mainly due to the data sparsity issueof the dataset that used in the experiment. Given a certain dia-log context, there might be multiple system response candidatesthat can be used to generate a suitable final response. With lim-ited number of training samples, the model is likely to overfitthe training set and not to generalize the strong modeling ca-pacity well during inference. The overfitting issue might be lessof a problem in the word-by-word response generation setting,and this is to be studied further in our future work.Table 3:
Prediction accuracy with different recurrent model ar-chitectures.
Entity Joint De-lex FinalModel Pointer Goal Res Res
Hierarchical LSTM + feed de-lex res (1) 93.6 74.8 55.4 51.8+ feed goal slots (2) 94.1 75.3 55.3 51.8+ feed both (3) 93.7 72.7 55.3 51.6As observed in Table 3, the joint goal tracking performanceis directly related to final response prediction accuracy. To fur-ther investigate our model’s capability on belief tracking, weconduct error breakdown analysis for each goal slot. We com-pare our model to two other recently proposed belief track-ing models, an RNN based model [11] and the Neural BeliefTracker [28], in the setting of only using live ASR hypothe-sis as model input. As the results show in Table 4, our systemachieves promising belief tracking performance comparable tothe state-of-the-art systems.Table 4:
Dialog state tracking performance on DSTC2 test set,comparing to previous approaches.
Area Food Price JointModel Goal Goal Goal Goal
RNN 92 86 86 69RNN + sem. dict 91 86 93 73NBT-DNN [28] 90 84 94 72NBT-CNN [28] 90 83 93 72Hierarchical LSTM 90 84 93 73 Finally, we report the model performance in producing fi-nal system responses and compare it to other published re-sults following the per-response accuracy metric used in priorwork. Even though using the same evaluation measurement,our model is designed with slightly different settings compar-ing to other published models in Table 5. Instead of using ad-ditional matched type features (i.e. KB entity type feature foreach word, e.g. whether a word is a food type or area type,etc.) as in [17, 23], we use user’s goal slots at each turn that aremapped from the original DSTC2 dataset as additional super-vised signals in our model. Moreover, instead of treating KBquery results as unstructured text, we treat them as structuredentities and let our model to pick the right entity by selectingthe most appropriate entity pointer. Our proposed model suc-cessfully predicts 52.8% of the true system responses, outper-forming prior end-to-end trainable neural dialog systems.Table 5:
Performance of the proposed model in per-responseaccuracy comparing to previous approaches.
Model Per-res Accuracy
Memory Networks [17] 41.1Gated Memory Networks [29] 48.7Sequence-to-Sequence [22] 48.0Query-Reduction Networks [23] 51.1Hierarchical LSTM
To further understand the prediction errors made by ourmodel, we conduct human evaluation by inviting 10 users toevaluate the appropriateness of the responses generated by oursystem. While some of the errors are made on generatingproper API calls due to the errors in dialog state tracking re-sults, we also find quite a number of responses that are con-sidered appropriate by our judges but do not match to the ref-erence responses in the test set. For example, there are caseswhere our system directly issues the correct API call (e.g.“ api call south italian expensive ”) based on user’s inputs,instead of asking user for confirmation of a goal type (e.g. ”
Didyou say you are looking for a restaurant in the south of town? ”)as in the reference corpus. By taking such factors into consid-eration, our system was able to generate appropriate responsesin 73.6% of the time based on feedback from the judges. Theseresults show that the per-response accuracy evaluation metricmay not well correlate with human judgments [30], and betterdialogs evaluation measurements should to be further explored.
5. Conclusions
In this work, we propose a novel end-to-end trainable neuralnetwork model for task-oriented dialog systems. The model isable to track dialog belief state, interface with knowledge basesby issuing API calls, and incorporate structured query resultsinto system responses to successfully complete task-oriented di-alogs. In the evaluation in a restaurant search domain using aconverted dataset from the second Dialog State Tracking Chal-lenge corpus, our proposed model shows robust performancein tracking dialog state over the sequence of dialog turns. Themodel also demonstrates promising performance in generatingappropriate system responses, outperforming prior end-to-endtrainable neural network models. . References [1] A. I. Rudnicky, E. H. Thayer, P. C. Constantinides, C. Tchou,R. Shern, K. A. Lenzo, W. Xu, and A. Oh, “Creating naturaldialogs in the carnegie mellon communicator system.” in
Eu-rospeech , 1999.[2] S. Young, “Using pomdps for dialog management,” in
SpokenLanguage Technology Workshop, 2006. IEEE . IEEE, 2006, pp.8–13.[3] A. Raux, B. Langner, D. Bohus, A. W. Black, and M. Eskenazi,“Lets go public! taking a spoken dialog system to the real world,”in in Proc. of Interspeech 2005 . Citeseer, 2005.[4] S. Young, M. Gaˇsi´c, B. Thomson, and J. D. Williams, “Pomdp-based statistical spoken dialog systems: A review,”
Proceedingsof the IEEE , vol. 101, no. 5, pp. 1160–1179, 2013.[5] T. Zhao and M. Eskenazi, “Towards end-to-end learning fordialog state tracking and management using deep reinforcementlearning,” in
Proceedings of the 17th Annual Meeting of theSpecial Interest Group on Discourse and Dialogue arXivpreprint: 1604.04562 , April 2016.[7] X. Li, Y.-N. Chen, L. Li, and J. Gao, “End-to-end task-completionneural dialogue systems,” arXiv preprint arXiv:1703.01008 ,2017.[8] J. Williams, A. Raux, and M. Henderson, “The dialog state track-ing challenge series: A review,”
Dialogue & Discourse , vol. 7,no. 3, pp. 4–33, 2016.[9] J. D. Williams and S. Young, “Partially observable markov deci-sion processes for spoken dialog systems,”
Computer Speech &Language , vol. 21, no. 2, pp. 393–422, 2007.[10] S. Lee, “Structured discriminative model for dialog state track-ing,” in
Proceedings of the SIGDIAL 2013 Conference , 2013, pp.442–451.[11] M. Henderson, B. Thomson, and S. Young, “Robust dialog statetracking using delexicalised recurrent neural networks and unsu-pervised gate,” in
Spoken Language Technology Workshop (SLT),2014 IEEE . IEEE, 2014, pp. 360–365.[12] M. Henderson, B. Thomson, and S. Young, “Word-based dialogstate tracking with recurrent neural networks,” in
Proceedings ofthe 15th Annual Meeting of the Special Interest Group on Dis-course and Dialogue (SIGDIAL) , 2014, pp. 292–299.[13] M. Henderson, “Machine learning for dialog state tracking: Areview,” in
Proc. of The First International Workshop on MachineLearning in Spoken Language Processing , 2015.[14] M. Gaˇsi´c, C. Breslin, M. Henderson, D. Kim, M. Szummer,B. Thomson, P. Tsiakoulis, and S. Young, “On-line policy opti-misation of bayesian spoken dialogue systems via human interac-tion,” in
Acoustics, Speech and Signal Processing (ICASSP), 2013IEEE International Conference on . IEEE, 2013, pp. 8367–8371.[15] L. Shang, Z. Lu, and H. Li, “Neural responding machine for short-text conversation,” arXiv preprint arXiv:1503.02364 , 2015.[16] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau,“Building end-to-end dialogue systems using generative hierar-chical neural network models,” arXiv preprint arXiv:1507.04808 ,2015.[17] A. Bordes and J. Weston, “Learning end-to-end goal-oriented di-alog,” arXiv preprint arXiv:1605.07683 , 2016.[18] F. Jurˇc´ıˇcek, B. Thomson, and S. Young, “Reinforcement learningfor parameter estimation in statistical spoken dialogue systems,”
Computer Speech & Language , vol. 26, no. 3, pp. 168–192, 2012.[19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997. [20] B. Liu and I. Lane, “Attention-based recurrent neural networkmodels for joint intent detection and slot filling,” in
Proceedingsof The 17th Annual Meeting of the International Speech Commu-nication Association , 2016.[21] D. Hakkani-T¨ur, G. Tur, A. Celikyilmaz, Y.-N. Chen, J. Gao,L. Deng, and Y.-Y. Wang, “Multi-domain joint semantic frameparsing using bi-directional rnn-lstm,” in
Proceedings of The 17thAnnual Meeting of the International Speech Communication As-sociation , 2016.[22] M. Eric and C. D. Manning, “A copy-augmented sequence-to-sequence architecture gives good performance on task-orienteddialogue,” in
EACL , 2017.[23] M. Seo, S. Min, A. Farhadi, and H. Hajishirzi, “Query-reductionnetworks for question answering,” in
International Conference onLearning Representations , 2017.[24] M. Henderson, B. Thomson, and J. Williams, “The second dialogstate tracking challenge,” in , vol. 263, 2014.[25] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 , 2014.[26] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural net-work regularization,” arXiv preprint arXiv:1409.2329 , 2014.[27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their com-positionality,” in
Advances in neural information processing sys-tems , 2013, pp. 3111–3119.[28] N. Mrkˇsi´c, D. O. S´eaghdha, T.-H. Wen, B. Thomson, andS. Young, “Neural belief tracker: Data-driven dialogue state track-ing,” arXiv preprint arXiv:1606.03777 , 2016.[29] J. Perez and F. Liu, “Gated end-to-end memory networks,” arXivpreprint arXiv:1610.04211 , 2016.[30] C.-W. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Charlin,and J. Pineau, “How not to evaluate your dialogue system: Anempirical study of unsupervised evaluation metrics for dialogueresponse generation,” arXiv preprint arXiv:1603.08023arXiv preprint arXiv:1603.08023