[PDF] Sequential Dialogue Context Modeling for Spoken Language Understanding

Abstract

Spoken Language Understanding (SLU) is a key component of goal oriented dialogue systems that would parse user utterances into semantic frame representations. Traditionally SLU does not utilize the dialogue history beyond the previous system turn and contextual ambiguities are resolved by the downstream components. In this paper, we explore novel approaches for modeling dialogue context in a recurrent neural network (RNN) based language understanding system. We propose the Sequential Dialogue Encoder Network, that allows encoding context from the dialogue history in chronological order. We compare the performance of our proposed architecture with two context models, one that uses just the previous turn context and another that encodes dialogue context in a memory network, but loses the order of utterances in the dialogue history. Experiments with a multi-domain dialogue dataset demonstrate that the proposed architecture results in reduced semantic frame error rates.

Full PDF

SSequential Dialogue Context Modeling for Spoken LanguageUnderstanding

Ankur Bapna [email protected]

Gokhan T ¨ur

Google Research, Mountain View

Dilek Hakkani-T ¨ur { gokhan.tur, dilek, larry.heck } @ieee.org Larry HeckAbstract

Spoken Language Understanding (SLU)is a key component of goal oriented di-alogue systems that would parse user ut-terances into semantic frame representa-tions. Traditionally SLU does not uti-lize the dialogue history beyond the pre-vious system turn and contextual ambigu-ities are resolved by the downstream com-ponents. In this paper, we explore novelapproaches for modeling dialogue con-text in a recurrent neural network (RNN)based language understanding system. Wepropose the Sequential Dialogue EncoderNetwork, that allows encoding contextfrom the dialogue history in chronologi-cal order. We compare the performance ofour proposed architecture with two contextmodels, one that uses just the previous turncontext and another that encodes dialoguecontext in a memory network, but losesthe order of utterances in the dialogue his-tory. Experiments with a multi-domain di-alogue dataset demonstrate that the pro-posed architecture results in reduced se-mantic frame error rates.

Goal oriented dialogue systems help users with ac-complishing tasks, like making restaurant reserva-tions or booking ﬂights, by interacting with themin natural language. The capability to understanduser utterances and break them down into task spe-ciﬁc semantics is a key requirement for these sys-tems. This is accomplished in the spoken languageunderstanding module, which typically parses userutterances into semantic frames, composed of do-mains, intents and slots (Tur and De Mori, 2011),that can then be processed by downstream dia- u Can you get me a restaurant reservation ? s Sure, where do you want to go ? u table for 2 at Pho Nam ↓ ↓ ↓ ↓ ↓ ↓ S O O B- D restaurants I reserve restaurantFigure 1: An example semantic parse of an utter-ance ( u ) with slot ( S ), domain ( D ), intent ( I ) an-notations, following the IOB (in-out-begin) repre-sentation for slot values.logue system components. An example semanticframe is shown for a restaurant reservation relatedquery in Figure 1.As the complexity of the task supported by a di-alogue system increases, there is a need for anincreased back and forth interaction between theuser and the agent. For example, a restaurantreservation task might require the user to spec-ify a restaurant name, date, time and number ofpeople required for the reservation. Additionally,based on reservation availability, the user mightneed to negotiate on date, time, or any other at-tribute with the agent. This puts the burden ofparsing in-dialogue contextual user utterances onthe language understanding module. The com-plexity increases further when the system supportsmore than one task and the user is allowed to havegoals spanning multiple domains within the samedialogue. Natural language utterances are oftenambiguous, and the context from previous userand system turns could help resolve the errors aris-ing from these ambiguities.In this paper, we explore approaches to im-prove dialogue context modeling within a Recur-rent Neural Network (RNN) based spoken lan-guage understanding system. We propose a novelmodel architecture to improve dialogue contextmodeling for spoken language understanding on a a r X i v : . [ c s . C L ] J u l igure 2: Architecture of the Memory and current utterance context encoder.multi-domain dialogue dataset. The proposed ar-chitecture is an extension of Hierarchical Recur-rent Encoder Decoders (HRED) (Sordoni et al.,2015), where we combine the query level encod-ings with a representation of the current utterance,before feeding it into the session level encoder. Wecompare the performance of this model to a RNNtagger injected with just the previous turn contextand a single hop memory network that uses an at-tention weighted combination of the dialogue con-text (Chen et al., 2016; Weston et al., 2014).Furthermore, we describe a dialogue recombi-nation technique to enhance the complexity ofthe training dataset by injecting synthetic domainswitches, to create a better match with the mixeddomain dialogues in the test dataset. This is,in principle, a multi-turn extension of (Jia andLiang, 2016). Instead of inducing and compos-ing grammars to synthetically enhance single turntext, we combine single domain dialogue sessionsinto multi-domain dialogues to provide richer con-text during training. The task of understanding a user utterance is typ-ically broken down into 3 tasks: domain classi-ﬁcation, intent classiﬁcation and slot-ﬁlling (Turand De Mori, 2011). Most modern approachesto Spoken language understanding involve train-ing machine learning models on labeled train-ing data (Young, 2002; Hahn et al., 2011; Wanget al., 2005, among others). More recently, re-current neural network (RNN) based approacheshave been shown to perform exceedingly wellon spoken language understanding tasks (Mesnilet al., 2015; Hakkani-T¨ur et al., 2016; Kurata et al.,2016, among others). RNN based approaches havealso been applied successfully to other tasks for di- alogue systems, like dialogue state tracking (Hen-derson, 2015; Henderson et al., 2014; Perez andLiu, 2016, among others), policy learning (Suet al., 2015) and system response generation (Wenet al., 2015, 2016, among others).In parallel, joint modeling of tasks and addition ofcontextual signals has been shown to result in per-formance gains for several applications. Modelingdomain, intent and slots in a joint RNN model wasshown to result in reduction of overall frame er-ror rates (Hakkani-T¨ur et al., 2016). Joint model-ing of intent classiﬁcation and language modelingshowed promising improvements in intent recog-nition, especially in the presence of noisy speechrecognition (Liu and Lane, 2016).Similarly, models incorporating more contextfrom dialogue history (Chen et al., 2016) or se-mantic context from the frame (Dauphin et al.,2014; Bapna et al., 2017) tend to outperform mod-els without context and have shown potential forgreater generalization on spoken language under-standing and related tasks. (Dhingra et al., 2016)show improved performance on an informationaldialogue agent by incorporating knowledge basecontext into their dialogue system. Using dialoguecontext was shown to boost performance for end toend dialogue (Bordes and Weston, 2016) and nextutterance prediction (Serban et al., 2015).In the next few sections, we describe the proposedmodel architecture, the dataset and our dialoguerecombination approach. This is followed by ex-perimental results and analysis.

We compare the performance of 3 model archi-tectures for encoding dialogue context on a multi-domain dialogue dataset. Let the dialogue be asequence of system and user utterances D t = igure 3: Architecture of the dialogue context encoder for the cosine similarity based memory network. { u , u ...u t } and at time step t we are trying tooutput the parse of a user utterance u t , given D t .Let any utterance u k be a sequence of tokens givenby { x k , x k ...x kn k } .We divide the model into 2 components, the con-text encoder that acts on D t to produce a vectorrepresentation of the dialogue context denoted by h t = H ( D t ) , and the tagger, which takes the di-alogue context encoding h t , and the current utter-ance u t as input and produces the domain, intentand slot annotations as output. In this section we describe the architectures of thecontext encoders used for our experiments. Wecompare the performance of 3 different architec-tures that encode varying levels of dialogue con-text.

This is the baseline context encoder architecture.We feed the embeddings corresponding to to-kens in the previous system utterance, u t − = { x t − , x t − ...x t − n t − } , into a single BidirectionalRNN (BiRNN) layer with Gated Recurrent Unit(GRU) (Chung et al., 2014) cells and 128 dimen-sions (64 in each direction). The embeddings areshared with the tagger. The ﬁnal state of the con-text encoder GRU is used as the dialogue context. h t = BiGRU c ( u t − ) (1) This architecture is identical to the approach de-scribed in (Chen et al., 2016). We encode alldialogue context utterances, { u , u ...u t − } , intomemory vectors denoted by { m , m , ...m t − } using a Bidirectional GRU (BiGRU) encoder with128 dimensions (64 in each direction). To addtemporal context to the dialogue history utter- ances, we append special positional tokens to eachutterance. m k = BiGRU m ( u k ) f or ≤ k ≤ t − (2)We also encode the current utterance with anotherBiGRU encoder with 128 dimensions (64 in eachdirection), into a context vector denoted by c , asin equation 3. This is conceptually depicted inFigure 2 c = BiGRU c ( u t ) (3)Let M be a matrix with the i th row given by m i . We obtain the cosine similarity between eachmemory vector, m i , and the context vector c . Thesoftmax of this similarity is used as an attentiondistribution over the memory M , and an attentionweighted sum of M is used to produce the dia-logue context vector h t (Equation 4). This is con-ceptually depicted in Figure 3. a = sof tmax ( M c ) h t = a T M (4) We enhance the memory network architecture de-scribed above by adding a session encoder (Sor-doni et al., 2015) that temporally combines ajoint representation of the current utterance en-coding, c , (Eq. 3) and the memory vectors, { m , m ...m t − } , (Eq. 2).We combine the context vector c with each mem-ory vector m k , for ≤ k ≤ n k , by concatenat-ing and passing them through a feed forward layer(FF) to produce 128 dimensional context encod-ings, denoted by { g , g ...g t − } (Eq. 5). g k = sigmoid ( F F ( m k , c )) f or ≤ k ≤ t − (5)These context encodings are fed as token level in-puts into the session encoder, which is a 128 di-igure 4: Architecture of the Sequential DialogueEncoder Network. The feed-forward networksshare weights across all memories.mensional BiGRU layer. The ﬁnal state of the ses-sion encoder represents the dialogue context en-coding h t (Eq. 6). h t = BiGRU s ( { g , g , ...g t − } ) (6)The architecture is depicted in Figure 4. For all our experiments we use a stacked BiRNNtagger to jointly model domain classiﬁcation, in-tent classiﬁcation and slot-ﬁlling, similar to theapproach described in (Hakkani-T¨ur et al., 2016).We feed learned 256 dimensional embeddings cor-responding to the current utterance tokens into thetagger.The ﬁrst RNN layer uses GRU cells with 256 di-mensions (128 in each direction) as in equation 7.The token embeddings are fed into the token levelinputs of the ﬁrst RNN layer to produce the tokenlevel outputs o = { o , o ...o n t } . o = BiGRU ( u t ) (7)The second layer uses Long Short Term Mem-ory (LSTM) (Hochreiter and Schmidhuber, 1997)cells with 256 dimensions (128 in both dimen-sions). We use a LSTM based second layer sincethat improved slot-ﬁlling performance on the val-idation set for all architectures. We apply dropoutto the outputs of both layers. The initial states ofboth forward and backward LSTMs of the secondtagger layer are initialized with the dialogue en-coding h t as in equation 8. The token level out-puts of the ﬁrst RNN layer, o , are fed as input into the second RNN layer to produce token leveloutputs o = { o , o ...o n t } and the ﬁnal state s . o , s = BiLST M ( o , h t ) (8)The ﬁnal state of the second layer, s , is used asinput to classiﬁcation layers for domain and intentclassiﬁcation. p domain = sof tmax ( U s ) p intent = sigmoid ( V s ) (9)The token level outputs of the second layer, o ,are used as input to a softmax layer that outputsthe IOB slot labels. This results in a softmax layerwith N +1 dimensions for a domain with N slots. p sloti = sof tmax ( So i ) f or ≤ i ≤ n t (10)The architecture is depicted in Figure 5. We crowd sourced multi-turn dialogue sessionsfor 3 tasks: buying movie tickets, searching for arestaurant and reserving tables at a restaurant. Ourdata collection process comprises of two steps: (i)Generating user-agent interactions comprising ofdialog acts and slots based on the interplay of asimulated user and a rule based dialogue policy.(ii) Using a crowd sourcing platform to elicit nat-ural language utterances that align with the seman-tics of the generated interactions.The goal of the spoken language understandingmodule of our dialogue system is to map each userutterance into frame based semantics that can beprocessed by the downstream components. Ta-bles describing the intents and slots present in thedataset can be found in the appendix.We use a stochastic agenda-based user simula-tor (Schatzmann et al., 2007; Shah et al., 2016)for interplay with our rule based system policy.The user goal is speciﬁed in terms of a tuple ofslots, which denote the user constraints. Someconstraints might be unspeciﬁed, in which case theuser is indifferent to the value of those slots. Atany given turn, the simulator samples a user dia-logue act from a set of acceptable actions basedon (i) the user goal and agenda that includes slotsthat still need to be speciﬁed, (ii) a randomlychosen user proﬁle (co-operative/aggressive, ver-bose/succinct etc.) and (iii) the previous user andigure 5: Architecture of the stacked BiRNN tagger. The dialogue context obtained from the contextencoder is fed into the initial states of the second RNN layer.

Domain Attributes movies date, movie, num tickets, theatre name, timeﬁnd-restaurants category, location, meal, price range, rating, restaurant namereserve-restaurant date, num people, restaurant name, timeTable 1: List of attributes supported for each domain.system actions. Based on the chosen user dialogueact, the rule based policy might make a backendcall to inquire for restaurant or movie availabil-ity. Based on the user act and the backend re-sponse the system responds back with a dialogueact or a combination of dialogue acts, based ona hand designed rule based policy. These gener-ated interactions were then translated to their nat-ural language counterparts and sent out to crowd-workers for paraphrasing into natural languagehuman-machine dialogues.The simulator and policy were also extended tohandle multiple goals spanning different domains.In this set-up, the user goal for the simulator wouldinclude multiple tasks and slot values could beconditioned on the previous task, for example, thesimulator would ask for booking a table ”after themovie”, or search for a restaurant ”near the the-ater”. The set of slots supported by the simulatoris enumerated in Table 1. We collected 1319 di-alogues for restaurant reservation, 976 dialoguesfor ﬁnding restaurants and 1048 dialogues for buy-ing movie tickets. All single domain datasets were used for training. The multi-domain simulatorwas used to collect 467 dialogues for training, 50for validation and 273 for the test set. Since thenatural language dialogues were paraphrased ver-sions of known dialogue- act and slot combina-tions, they were automatically labeled. These la-bels were veriﬁed by an expert annotator, and turnswith missing annotations were manually annotatedby the expert.

As described in the previous section, we train ourmodels on a large set of single domain dialoguedatasets and a small set of multi-domain dia-logues. These models are then evaluated on a testset composed of multi-domain dialogues, wherethe user attempts to fulﬁll multiple goals spanningseveral domains. This results in a distributiondrift that might result in performance degradation.To counter this drift in the training-test datadistributions we device a dialogue recombinationscheme to generate multi-domain dialogues fromsingle domain training datasets. ialogue x Dialogue y Dialogue d r U: Get me 5 tickets to see In-ferno. U: Get me 5 tickets to see In-ferno.S: Sure, when is this bookingfor ? S: Sure, when is this bookingfor ?U: Around 5 pm tomorrownight. U: Around 5 pm tomorrownight.S: Do you have a theatre inmind? S: Do you have a theatre inmind?U: AMC newpark 12. U: Find italian restaurants inMountain View U: Find italian restaurants inMountain ViewS: Does 4:45 pm work foryou ? S: What price range are youlooking for ? S: What price range are youlooking for ?U: Yes. U: cheap U: cheapS: Your booking is complete. S: Ristorante Giovanni isa nice Italian restaurant inMountain View. S: Ristorante Giovanni isa nice Italian restaurant inMountain View.U: That works. thanks. U: That works. thanks.Table 2: A sample dialogue obtained from recombining a dialogue from the movies and ﬁnd-restaurantdatasets.The key idea behind the recombination approachis the conditional independence of sub-dialoguesaimed at performing distinct tasks (Grosz andSidner, 1986). We exploit the presence of taskintents, or intents that denote a switch in theprimary task the user is trying to perform, sincethey are a strong indicator of a switch in the focusof the dialogue. We exploit the independence ofthe sub-dialogue following these intents from theprevious dialogue context, to generate syntheticdialogues with multi-domain context. The recom-bination process is described as follows:Let a dialogue d be deﬁned as a sequenceof turns and corresponding semantic la-bels (domain, intent and slot annotations) { ( t d , f d ) , ( t d , f d ) , ... ( t dn d , f dn d } . To obtain are-combined dataset composed of dialogues fromdataset dataset and dataset , we repeat the fol-lowing steps 10000 times, for each combinationof ( dataset , dataset ) from the three singledomain datasets. • Sample dialogues x and y from dataset and dataset respectively. • Find the ﬁrst user utterance labeled with atask intent in y . Let this be turn l . • Randomly sample an insertion point in dia-logue x . Let this be turn k . • The new recombined dialogue is { ( t x , f x ) , ... ( t xk , f xk ) , ( t yl , f yl ) ,... ( t yn y , f yn y ) } .A sample dialogue generated using the above pro-cedure is described in table 2. We drop the ut-terances from dialogue x following the insertionpoint (turn k ) in the recombined dialogue sincethese turns become ambiguous or confusing in theabsence of preceding context. In a sense our ap-proach is one of partial dialogue recombination. We compare the domain classiﬁcation, intent clas-siﬁcation and slot-ﬁlling performances, and theoverall frame error rates of the encoder-decoder,memory network and sequential dialogue encodernetwork on the dataset described above. Theframe error rate of a SLU system is the percentageof utterances where it makes a wrong predictioni.e. any of domain, intent or slot is predicted in-correctly.We trained all 3 models with RMSProp for 100000training steps with a batch size of 100. We startedwith a learning rate of 0.0003 which was decayedby a factor of 0.95 every 3000 steps. Gradientnorms were clipped if they exceed a magnitude of2.5. All model and optimization hyper-parameterswere chosen based on a grid search, to minimizevalidation set frame error rates. odel Domain F1 Intent F1 Slot Token F1 Frame Error Rate

ED 0.937 0.865 0.891 31.87%MN 0.964 0.890 0.896 26.72%SDEN 0.960 0.870 0.896 31.31%ED + DR 0.936 0.885 0.911 30.72%MN + DR 0.968

Table 3: Test set performances for the encoder decoder (ED) model, Memory Network (MN) and theSequential Dialogue Encoder Network (SDEN) with and without recombined data (DR). utterance MN+DR SDEN+DR hi! 0.00 0.13hello ! i want to buy movie tickets for pm at cinelux plaza 0.05 which movie , how many , and what day ? 0.13 Trolls , tickets for today True ED+DR MN+DR SDEN+DRDomain buy-movie-tickets movies movies movies

Intent contextual contextual contextual contextual date today today today today num tickets movie

Trolls Trolls - Trolls

Table 4: Dialogue from the test set with predictions from Encoder Decoder with recombined data(ED+DR), Memory Network with recombined data (MN+DR) and Sequential Dialogue Encoder Net-work with dialogue recombination (SDEN+DR).Tokens that have been italicized in the dialogue wereout of vocabulary or replaced with special tokens. The columns to the right of the dialogue history detailthe attention distributions. For SDEN+DR, we use the magnitude of the change in the session GRU stateas a proxy for the attention distribution. Attention weights might not sum up to 1 if there is non-zeroattention on history padding.We restrict the model vocabularies to contain onlytokens occurring more than 10 times in the train-ing set, to prevent over-ﬁtting to training set enti-ties. Digits were replaced with a special ”

The encoder decoder model trained on just theprevious turn context performs worst on almostall metrics, irrespective of the presence of recom-bined data. This can be explained by worse per-formance on in-dialogue utterances, where just theprevious turn context isn’t sufﬁcient to accuratelyidentify the domain, and in several cases, the in-tents and slots of the utterance.The memory network is the best performing modelin the absence of recombined data, indicating that the model is able to encode additional contexteffectively to improve performance on all tasks,even when only a small amount of multi-domaindata is available.The Sequential dialogue encoder network per-forms slightly worse than the memory network inthe absence of recombined data. This could be ex-plained by the model over-ﬁtting to the single do-main context seen during training and failure toutilize context effectively in a multi-domain set-ting. In the presence of recombined dialogues itoutperforms all other implementations.Apart from increasing the noise in the dialoguecontext, adding recombined dialogues to the train-ing set increases the average turn length of thetraining data, bringing it closer to that of the testdialogues. Our augmentation approach is, in spirit,an extension of the data recombination describedin (Jia and Liang, 2016) to conversations. Wehypothesize that the presence of synthetic con- tterance MN+DR SDEN+DR hello 0.01 0.10hello . i need to buy tickets at cinemark redwood downtown for xd at : pm 0.00 0.06which movie do you want to see at what time and date . 0.00 0.04I didn’t understand that. 0.00 0.03please tell which movie , the time and date of the movie 0.01 0.02the movie is queen of katwe today and the number of tickets is : pm showing 0.02 0.01yes 0.01 0.01I bought you tickets for the : pm showing of queen of katwe atcinemark redwood downtown Brazilian restaurant which one of

Fogo de Cho Brazilian steakhouse ,

Espetus Churrascaria san mateo or

Fogo de Cho would you prefer 0.02

Fogo de Cho Brazilian steakhouse

True ED+DR MN+DR SDEN+DRDomain ﬁnd-restaurants movies ﬁnd-restaurants ﬁnd-restaurants

Intent afﬁrm(restaurant) - - - restaurantname

Fogo de ChoBrazilian steak-house - -

Fogo de ChoBrazilian steak-houseTable 5: Dialogue from the test set with predictions from Encoder Decoder with recombined data(ED+DR), Memory Network with recombined data (MN+DR) and Sequential Dialogue Encoder Net-work with dialogue recombination (SDEN+DR). Tokens that have been italicized in the dialogue wereout of vocabulary or replaced with special tokens. The columns to the right of the dialogue history detailthe attention distributions. For SDEN+DR, we use the magnitude of the change in the session GRU stateas a proxy for the attention distribution. Attention weights might not sum up to 1 if there is non-zeroattention on history padding.text has a regularization-like effect on the models.Similar effects were observed by (Jia and Liang,2016), where training with longer, synthetically-augmented utterances resulted in improved se-mantic parsing performance on a simpler test set.This is also supported by the observation that per-formance improvements obtained by addition ofrecombined data increase as the complexity of themodel increases.

Table 4 demonstrates an example dialogue fromthe test set, along with the gold and model annota-tions from all 3 models. We observe that EncoderDecoder (ED) and Sequential Dialogue EncoderNetwork (SDEN) are able to successfully identifythe domain, intent and slots, while the MemoryNetwork (MN) fails to identify the movie name. Looking at the attention distributions, we noticethat the MN attention is very diffused, whereasSDEN is focusing on the most recent last 2 utter-ances, which directly identify the domain and thepresence of the movie slot in the ﬁnal user utter-ance. ED is also able to identify the presence of a movie in the ﬁnal user utterance from the previousutterance context.Table 5 displays another example where theSDEN model outperforms both MN and ED. Con-strained to just the previous utterance ED is un-able to correctly identify the domain of the userutterance. The MN model correctly identiﬁes thedomain, using its strong focus on the task-intentbearing utterance, but it is unable to identify thepresence of a restaurant in the user utterance. Thishighlights its failure to combine context from mul-tiple history utterances. On the other hand, asindicated by its attention distribution on the ﬁnalwo utterances, SDEN is able to successfully com-bine context from the dialogue to correctly iden-tify the domain and the restaurant name from theuser utterance, despite the presence of several out-of-vocabulary tokens.The above two examples hint that SDEN performsbetter in scenarios where multiple history ut-terances encode complementary information thatcould be useful to interpret user utterances. Thisis usually the case in more natural goal orienteddialogues, where several tasks and sub tasks go inand out of the focus of the conversation (Grosz,1979).On the other hand, we also observed that SDENperforms signiﬁcantly worse in the absence of re-combined data. Due to its complex architectureand a much larger set of parameters SDEN isprone to over-ﬁtting in low data scenarios.In this paper, we collect a multi-domain dataset ofgoal oriented human-machine conversations andanalyze and compare the SLU performance ofmultiple neural network based model architecturesthat can encode varying amounts of context. Ourexperiments suggest that encoding more contextfrom the dialogue, and enabling the model to com-bine contextual information in a sequential orderresults in a reduction in overall frame error rate.We also introduce a data augmentation scheme togenerate longer dialogues with richer context, andempirically demonstrate that it results in perfor-mance improvement for multiple model architec-tures.

We would like to thank Pararth Shah, AbhinavRastogi, Anna Khasin and Georgi Nikolov fortheir help with the user-machine conversation datacollection and labeling. We would also like tothank the anonymous reviewers for their insight-ful comments.

References

Ankur Bapna, Gokhan T¨ur, Dilek Hakkani-T¨ur, andLarry Heck. 2017. Towards zero-shot frame seman-tic parsing for domain scaling. In

In Proceedings ofthe Interspeech . Stockholm, Sweden.Antoine Bordes and Jason Weston. 2016. Learn-ing end-to-end goal-oriented dialog. arXiv preprintarXiv:1605.07683 .Y.-N. Chen, D. Hakkani-T¨ur, G. Tur, J. Gao, andL. Deng. 2016. End-to-end memory networks with knowledge carryover for multi-turn spoken languageunderstanding. In

Proceedings of the Interspeech .San Francisco, CA.Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555 .Y. Dauphin, G. Tur, D. Hakkani-T¨ur, and L. Heck.2014. Zero-shot learning and clustering for seman-tic utterance classiﬁcation. In

Proceedings of theICLR .Bhuwan Dhingra, Lihong Li, Xiujun Li, JianfengGao, Yun-Nung Chen, Faisal Ahmed, and Li Deng.2016. End-to-end reinforcement learning of dia-logue agents for information access. arXiv preprintarXiv:1609.00777 .Barbara J Grosz. 1979. Focusing and description innatural language dialogues. Technical report, DTICDocument.Barbara J Grosz and Candace L Sidner. 1986. Atten-tion, intentions, and the structure of discourse.

Com-putational linguistics

IEEE Transactions on Audio, Speech,and Language Processing

Proceedings of the In-terspeech . San Francisco, CA.Matthew Henderson. 2015. Machine learning for dia-log state tracking: A review. In

Proceedings of TheFirst International Workshop on Machine Learningin Spoken Language Processing .Matthew Henderson, Blaise Thomson, and Jason DWilliams. 2014. The second dialog state trackingchallenge.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation arXiv preprintarXiv:1606.03622 .G. Kurata, B. Xiang, B. Zhou, and M. Yu. 2016.Leveraging sentence-level information with encoderLSTM for semantic slot ﬁlling. In

Proceedings ofthe EMNLP . Austin, TX.Bing Liu and Ian Lane. 2016. Joint online spoken lan-guage understanding and language modeling withrecurrent neural networks.

CoRR abs/1609.01462.http://arxiv.org/abs/1609.01462.. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng,D. Hakkani-T¨ur, X. He, L. Heck, G. Tur, and D. Yu.2015. Using recurrent neural networks for slot ﬁll-ing in spoken language understanding.

IEEE Trans-actions on Audio, Speech, and Language Processing arXiv preprint arXiv:1606.04052 .Jost Schatzmann, Blaise Thomson, Karl Weilhammer,Hui Ye, and Steve Young. 2007. Agenda-based usersimulation for bootstrapping a pomdp dialogue sys-tem. In

Human Language Technologies 2007: TheConference of the North American Chapter of theAssociation for Computational Linguistics; Com-panion Volume, Short Papers . Association for Com-putational Linguistics, pages 149–152.Iulian Vlad Serban, Alessandro Sordoni, YoshuaBengio, Aaron C. Courville, and Joelle Pineau.2015. Hierarchical neural network generative mod-els for movie dialogues.

CoRR abs/1507.04808.http://arxiv.org/abs/1507.04808.Pararth Shah, Dilek Hakkani-T¨ur, and Larry Heck.2016. Interactive reinforcement learning for task-oriented dialogue management.Alessandro Sordoni, Yoshua Bengio, Hossein Va-habi, Christina Lioma, Jakob Grue Simonsen,and Jian-Yun Nie. 2015. A hierarchical recur-rent encoder-decoder for generative context-awarequery suggestion. In

Proceedings of the 24thACM International on Conference on Informa-tion and Knowledge Management . ACM, NewYork, NY, USA, CIKM ’15, pages 553–562.https://doi.org/10.1145/2806416.2806493.Pei-Hao Su, David Vandyke, Milica Gasic, NikolaMrksic, Tsung-Hsien Wen, and Steve Young. 2015.Reward shaping with recurrent neural networks forspeeding up on-line policy learning in spoken dia-logue systems. arXiv preprint arXiv:1508.03391 .Gokhan Tur and Renato De Mori. 2011.

Spoken lan-guage understanding: Systems for extracting seman-tic information from speech . John Wiley & Sons.Y.-Y. Wang, L. Deng, and A. Acero. 2005. Spoken lan-guage understanding - an introduction to the statis-tical framework.

IEEE Signal Processing Magazine arXiv preprint arXiv:1603.01232 .Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015.Semantically conditioned lstm-based natural lan-guage generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745 . Jason Weston, Sumit Chopra, and Antoine Bor-des. 2014. Memory networks. arXiv preprintarXiv:1410.3916 .S. Young. 2002. Talking to machines (statisticallyspeaking). In

Proceedings of the ICSLP . Denver,CO. able 6:

Supported Intents:

List of intents and dialogue acts supported by the user simulator, with de-scriptions and representative examples. Acts parametrized with slot can be instantiated for any attributesupported within the domain.

Intent Intent descriptions Sample utterance afﬁrm generic afﬁrmation U: sounds good.cant understand expressing failure to understandsystem utterance U: What do you mean ?deny generic negation U: That doesn’t work.good bye expressing end of dialogue U: byethank you expressing gratitude U: thanks a lot!greeting greeting U: Hirequest alts request alternatives to a systemoffer S: Doppio Zero is a nice italianrestaurant near you.U: Are there any other optionsavailable ?afﬁrm( slot) afﬁrming values correspondingto a particular attribute U: 5 pm sounds good to me.deny( slot) negating a particular attribute. U: None of those times wouldwork for me.dont care( slot) expressing that any value is ac-ceptable for a given attribute U: Any time should be ok.movies explicit intent to buy movie tick-ets U: Get me 3 tickets to Infernoreserve-restaurants explicit intent to reserve a tableat a restaurant U: make a reservation at MaxBrenner’sﬁnd-restaurants explicit intent to search forrestaurants U: ﬁnd cheap italian restaurantsnear mecontextual implicit intent continuing fromcontext, also used in place of in-form S: What time works for you ?U: 5 pm tomorrow.unknown intent intents not supported by the dia-logue system U: What’s the weather like inSan Francisco ?able 7:

Sample dialogue:

Sample dialogue generated using a crowd working platform. The LHSconsists of the instructions shown to the crowd workers based on the dialog act interactions between theuser simulator and the rule based policy. The RHS describes the natural language dialog generated bythe crowd workers.