[PDF] Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability

Abstract

Generative encoder-decoder models offer great promise in developing domain-general dialog systems. However, they have mainly been applied to open-domain conversations. This paper presents a practical and novel framework for building task-oriented dialog systems based on encoder-decoder models. This framework enables encoder-decoder models to accomplish slot-value independent decision-making and interact with external databases. Moreover, this paper shows the flexibility of the proposed method by interleaving chatting capability with a slot-filling system for better out-of-domain recovery. The models were trained on both real-user data from a bus information system and human-human chat data. Results show that the proposed framework achieves good performance in both offline evaluation metrics and in task success rate with human users.

Full PDF

GGenerative Encoder-Decoder Models for Task-Oriented Spoken DialogSystems with Chatting Capability

Tiancheng Zhao, Allen Lu, Kyusong Lee and Maxine Eskenazi

Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, Pennsylvania, USA { tianchez,arlu,kyusongl,max+ } @cs.cmu.edu Abstract

Generative encoder-decoder models of-fer great promise in developing domain-general dialog systems. However, theyhave mainly been applied to open-domainconversations. This paper presentsa practical and novel framework forbuilding task-oriented dialog systemsbased on encoder-decoder models. Thisframework enables encoder-decoder mod-els to accomplish slot-value independentdecision-making and interact with externaldatabases. Moreover, this paper shows theﬂexibility of the proposed method by in-terleaving chatting capability with a slot-ﬁlling system for better out-of-domain re-covery. The models were trained on bothreal-user data from a bus information sys-tem and human-human chat data. Re-sults show that the proposed frameworkachieves good performance in both ofﬂineevaluation metrics and in task success ratewith human users.

Task-oriented spoken dialog systems have trans-formed human-computer interaction by enablingpeople interact with computers via spoken lan-guage (Raux et al., 2005; Young, 2006; Bohusand Rudnicky, 2003). The task-oriented SDS isusually domain-speciﬁc. The system creators ﬁrstmap the user utterances into semantic frames thatcontain domain-speciﬁc slots and intents usingspoken language understanding (SLU) (De Moriet al., 2008). Then a set of domain-speciﬁc dialogstate variables is tracked to retain the context infor-mation over turns (Williams et al., 2013). Lastly,the dialog policy decides the next move from a list of dialog acts that covers the expected com-municative functions from the system.Although the above approach has been success-fully applied to many practical systems, it has lim-ited ability to generalize to out-of-domain (OOD)requests and to scale up to new domains. For ex-ample, even within in a simple domain, real usersoften make requests that are not included in thesemantic speciﬁcations. Due to this, proper er-ror handling strategies that guide users back to thein-domain conversation are crucial to dialog suc-cess (Bohus and Rudnicky, 2005). Past error han-dling strategies were limited to a set of predeﬁneddialog acts, e.g. request repeat, clariﬁcation etc.,which constrained the system’s capability in keep-ing users engaged. Moreover, there has been anincreased interest in extending task-oriented sys-tems to multiple topics (Lee et al., 2009; Gaˇsi´cet al., 2015b) and multiple skills, e.g. groupingheterogeneous types of dialogs into a single sys-tem (Zhao et al., 2016). Both cases require thesystem to be ﬂexible enough to extend to new slotsand actions.Our goal is to move towards a domain-generaltask-oriented SDS framework that is ﬂexibleenough to expand to new domains and skills byremoving domain-speciﬁc assumptions on the di-alog state and dialog acts (Bordes and Weston,2016). To achieve this goal, the neural encoder-decoder model(Cho et al., 2014; Sutskever et al.,2014) is a suitable choice, since it has achievedpromising results in modeling open-domain con-versations (Vinyals and Le, 2015; Sordoni et al.,2015). It encodes the dialog history using deepneural networks and then generates the next sys-tem utterance word-by-word via recurrent neuralnetworks (RNNs). Therefore, unlike the tradi-tional SDS pipeline, the encoder-decoder model istheoretically only limited by its input/output vo-cabulary. a r X i v : . [ c s . C L ] J un na`‘ive implementation of an encoder-decoder-based task-oriented system would useRNNs to encode the raw dialog history and gen-erate the next system utterance using a separateRNN decoder. However, while this implementa-tion might achieve good performance in an ofﬂineevaluation of a closed dataset, it would certainlyfail when used by humans. There are several rea-sons for this: 1) real users can mention new enti-ties that do not appear in the training data, such asa new restaurant name. These entities are, how-ever, essential in delivering the information thatmatches users’ needs in a task-oriented system.2) a task-oriented SDS obtains information froma knowledge base (KB) that is constantly updated(“today’s” weather will be different every day), sosimply memorizing KB results that occurred in thetraining data would produce false information. In-stead, an effective model should learn to query theKB constantly to get the most up-to-date informa-tion. 3) users may give OOD requests (e.g. say,“how is your day”, to a slot-ﬁlling system), whichmust be handled gracefully in order to keep theconversation moving in the intended direction.This paper proposes an effective encoder-decoder framework for building task-orientedSDSs. We propose entity indexing to tackle thechallenges of out-of-vocabulary (OOV) entitiesand to query the KB. Moreover, we show the ex-tensibility of the proposed model by adding chat-ting capability to a task-oriented encoder-decoderSDS for better OOD recovery. This approachwas assessed on the Let’s Go Bus Informationdata from the 1st Dialog State Tracking Chal-lenge (Williams et al., 2013), and we report per-formance on both ofﬂine metrics and real humanusers. Results show that this model attains goodperformance for both of these metrics. Past research in developing domain-general di-alog systems can be broadly divided into threebranches. The ﬁrst one focuses on learningdomain-independent dialog state representationwhile still using hand-crafted dialog act system ac-tions. Researchers proposed the idea of extract-ing slot-value independent statistics as the dia-log state (Wang et al., 2015; Gaˇsi´c et al., 2015a),so that the dialog state representation can beshared across systems serving different knowledgesources. Another approach uses RNNs to auto- matically learn a distributed vector representationof the dialog state by accumulating the observa-tions at each turn (Williams and Zweig, 2016;Zhao and Eskenazi, 2016; Dhingra et al., 2016;Williams et al., 2017). The learned dialog stateis then used by the dialog policy to select thenext action. The second branch of research de-velops a domain-general action space for dialogpolicy. Prior work replaced the domain-speciﬁcdialog acts with domain-independent natural lan-guage semantic schema as the action space of dia-log managers (Eshghi and Lemon, 2014), e.g. Dy-namic Syntax (Kempson et al., 2000). More re-cently, Wen, et al. (2016) have shown the feasibil-ity of using an RNN as the decoder to generate thesystem utterances word by word, and the dialogpolicy of the proposed model can be ﬁne tuned us-ing reinforcement learning (Su et al., 2016). Fur-thermore, to deal with the challenge of develop-ing end-to-end task-oriented dialog models thatare able to interface with external KB, prior workhas uniﬁed the special KB query actions via deepreinforcement learning (Zhao and Eskenazi, 2016)and soft attention over the database (Dhingra et al.,2016). The third branch strives to solve both prob-lems at the same time by building an end-to-endmodel that maps an observable dialog history di-rectly to the word sequences of the system’s re-sponse. By using an encoder-decoder model, it hasbeen successfully applied to open-domain conver-sational models (Serban et al., 2015; Li et al.,2015, 2016; Zhao et al., 2017), as well as to taskoriented systems (Bordes and Weston, 2016; Yanget al., 2016; Eric and Manning, 2017). In order tobetter predict the next correct system action, thisbranch has focused on investigating various neu-ral network architectures to improve the machine’sability to reason over user input and model long-term dialog context.This paper is closely related to the third branch,but differs in the following ways: 1) these modelsare slot-value independent by leveraging domain-general entity recognizer, which is more extensi-ble to OOV entities, 2) these models emphasizethe interactive nature of dialog and address out-of-domain handling by interleaving chatting in task-oriented conversations, 3) instead of testing on asynthetic dataset, this approach focuses on realworld use by testing the system on human usersvia spoken interface.

Proposed Method

Our proposed framework consists of three stepsas shown in Figure 2: a) entity indexing (EI), b)slot-value independent encoder-decoder (SiED),c) system utterance lexicalization (UL). The in-tuition is to leverage domain-general named en-tity recognition (NER) (Tjong Kim Sang andDe Meulder, 2003) techniques to extract saliententities in the raw dialog history and convert thelexical values of the entities into entity indexes.The encoder-decoder model is then trained to fo-cus solely on reasoning over the entity indexes ina dialog history and to make decisions about thenext utterance to produce (including KB query). Inthis way, the model can be unaffected by the inclu-sion of new entities and new KB, while maintain-ing its domain-general input/output interface foreasy extension to new types of conversation skills.Lastly, the output from the decoder networks arelexicalized by replacing the entity indexes and spe-cial KB tokens with natural language. The follow-ing sections explain each step in detail.

EI has two parts. First, the EIutilizes an existing domain-general NER to ex-tract entities from both the user and system utter-ances. Note that the entity here is assumed to bea super-set of the slots in the domain. For exam-ple, for a ﬂight-booking system, the system maycontain two slots: [from-LOCATION] and [to-LOCATION] for the departure and arrival city, re-spectively. However, EI only extracts every men-tion of [LOCATION] in the utterances and leavesthe task of distinguishing between departure andarrival to the encoder-decoder model. Further-more, this step replaces each KB search result withits search query (e.g. the weather is cloudy → [kb-search]-[DATETIME-0]). The second step of EIinvolves constructing a indexed entity table . Eachentity is indexed by its order of occurrence in theconversation. Figure 1 shows an example in whichthere are two [LOCATION] mentions. Properties of Entity Indexing

In this section,several properties of EI and their assumptions areaddressed. First, each entity is indexed uniquelyby its entity type and index. Note that the in-dex is not associated with the entity value, butrather solely by the order of appearance in thedialog. Despite the actual words being hidden, Figure 1: An example of entity indexing and utter-ance lexicalization.a human can still easily predict which entity thesystem should conﬁrm or search for in the KBbased on logical reasoning. Therefore, that the EInot only alleviates the OOV problem of deployingthe encoder-decoder model in the real world, butalso forces the encoder-decoder model’s focus onlearning the reasoning process of task-oriented di-alogs instead of leveraging too much informationfrom the language modeling.Moreover, most slot-ﬁlling SDSs, apart from in-forming the concepts from KBs, usually do not in-troduce novel entities to users. Instead, systemsmostly corroborate the entities introduced by theusers. With this assumption, every entity mentionin the system utterances can always be found in theusers’ utterances in the dialog history, and there-fore can also be found in the indexed entity table.This property reduces the grounding behavior ofthe conventional task-oriented dialog manager intoselecting an entity from the indexed entity tableand conﬁrming it with the user.

Utterance Lexicalization is the reverse of EI.Since EI is a deterministic process, its effect canalways be reversed by ﬁnding the correspondingentity in the indexed entity table and replacing theindex with its word. For KB search, a simple stringmatching algorithm can search for the special [kb-search] token and take the following generated en-tities as the argument to the KB. Then the actualKB results can replace the original KB query. Fig-ure 1 shows an example of utterance lexicaliza-tion.

The encoder-decoder model can then read in theEI-processed dialog history and predict the sys-tem’s next utterance in EI format. Speciﬁcally,a dialog history of k turns is represented by [( a , u , c ) , ... ( a k − , u k − , c k − )] , in which a i , u i and c i are, respectively, the system, user utter-ance and ASR conﬁdence score at turn i . Each ut-terance in the dialog history is encoded into ﬁxed-size vectors using Convolutional Neural Networksigure 2: The proposed pipeline for task-oriented dialog systems.(CNNs) proposed in (Kim, 2014). Speciﬁcally,each word in an utterance x is mapped to its wordembedding, so that an utterance is represented as amatrix R ∈ R | x |× D , in which D is the size of theword embedding. Then L ﬁlters of size 1,2,3 con-duct convolutions on R to obtain a feature map, c ,of n-gram features in window size 1,2,3. Then c is passed through a nonlinear ReLu (Glorot et al.,2011) layer, followed by a max-pooling layer toobtain a compact summary of salient n-gram fea-tures, i.e. e t ( x ) = maxpool ( ReLu ( c + b )) . Us-ing CNNs to capture word-order information iscrucial, because the encoder-decoder has to beable to distinguish between ﬁne-grained differ-ences between entities. For example, a simplebag-of-word embedding approach will fail to dis-tinguish between the two location entities in “leavefrom [LOCATION-0] and go to [LOCATION-1]”,while a CNN encoder can capture the context in-formation of these two entities.After obtaining utterance embedding, a turn-level dialog history encoder network similar tothe one proposed in (Zhao and Eskenazi, 2016)is used. Turn embedding is a simple concatena-tion of system, user utterance embedding and theconﬁdence score t = [ e u ( a i ); e u ( u i ); c i ] . Thenan Long Short-Term Memory (LSTM) (Hochre-iter and Schmidhuber, 1997) network reads the se-quence turn embeddings in the dialog history viarecursive state update s i +1 = LSTM ( t i +1 , h i ) , inwhich h i is the output of the LSTM hidden state. Decoding with/without Attention

A vanilladecoder takes in the last hidden state of theencoder as its initial state and decodes thenext system utterance word by word as shown in (Sutskever et al., 2014). This assumes thatthe ﬁxed-size hidden state is expressive enoughto encode all important information about thehistory of a dialog. However, this assump-tion may often be violated for a task that haslong-dependency or complex reasoning of the en-tire source sequence. An attention mechanismproposed (Bahdanau et al., 2014) in the ma-chine translation community has helped encoder-decoder models improve state-of-art performancein various tasks (Bahdanau et al., 2014; Xu et al.,2015). Attention allows the decoder to look overevery hidden state in the encoder and dynamicallydecide the importance of each hidden state at eachdecoding step, which signiﬁcantly improves themodel’s ability to handle long-term dependency.We experiment decoders both with and withoutattention. Attention is computed similarly mul-tiplicative attention described in (Luong et al.,2015). We denote the hidden state of the decoderat time step j by s j , and the hidden state outputsof the encoder at turn i by h i . We then predict thenext word by a ji = softmax ( h Ti W a s j + b a ) (1) c j = (cid:88) i a ji h i (2) (cid:101) s j = tanh ( W s (cid:20) s j c j (cid:21) ) (3) p ( w j | s j , c j ) = softmax ( W o (cid:101) s j ) (4)The decoder next state is updated by s j +1 = LSTM ( s j , e ( w j +1 ) , (cid:101) s j ) . .3 Leveraging Chat Data to Improve OODRecovery Past work has shown that simple supervised learn-ing is usually inadequate for learning robust se-quential decision-making policy (Williams andYoung, 2003; Ross et al., 2011). This is becausethe model is only exposed to the expert demonstra-tion, but not to examples of how to recover from itsown mistakes or users’ OOD requests. We presenta simple yet effective technique that leverages theextensibility of the encoder-decoder model in or-der to obtain a more robust policy in the settingof supervised learning. Speciﬁcally, we artiﬁciallyaugment a task-oriented dialog dataset with chatdata from an open-domain conversation corpus.This has been shown to be effective in improv-ing the performance of task-oriented systems (Yuet al., 2017). Let the original dialog dataset with N dialogs be D = [ d ..., d n , ...d N ] , where d n is amulti-turn task-oriented dialog of | d n | turns. Fur-thermore, we assume we have access to a chatdataset D c = [( q , r ) , ... ( q m , r m ) , ... ( q M , r M )] ,where q m , r m are common adjacency pairs thatappear in chats, (e.g. q = hello, r = hi, how areyou). Then we can create a new dataset D ∗ by re-peating the following process a certain number oftimes:1. Randomly sample dialog d n from D

2. Randomly sample turn t i = [ a i , u i ] from d n

3. Randomly sample an adjacency pair ( q m , r m ) from D c

4. Replace the user utterance of t i by q m so that t i = [ a i , q m ]

5. Insert a new turn after t i , i.e. t i +1 = [ r m + e i +1 , u i ] Figure 3: Illustration of data augmentation. Theturn in the dashed line is inserted in the originaldialog. In Step 5, e i is an error handling system utteranceafter the system answers the user’s OOD request, q m . In this study, we experimented with a simplecase where e i +1 = a i so that the system should re-peat its previous prompt after responding to q m via r m . Figure 3 shows an example of an augmentedturn. Eventually, we train the model on the unionof the two datasets D + = D ∪ D ∗ Discussion : There are several reasons that theabove data augmentation process is appealing.First, the model effectively learns an OOD recov-ery strategy from D ∗ , i.e. it ﬁrst gives chattinganswers to users’ OOD requests and then tries topull users back to the main-task conversation. Sec-ond, chat data usually has a larger vocabulary andmore diverse natural language expressions, whichcan reduce the chance of OOVs and enable themodel to learn more robust word embeddings andlanguage models. The CMU Let’s Go Bus Information Sys-tem (Raux et al., 2005) is a task-oriented spokendialog system that contains bus information. Wecombined the train1a and train1b datasets fromDSTC 1 (Williams et al., 2013), which contain2608 total dialogs. The average dialog lengthis 9.07 turns. The dialogs were randomly split-ted into 85/5/10 proportions for train/dev/test data.The data was noisy since the dialogs were col-lected from real users via telephone lines. Fur-thermore, this version of Let’s Go used an in-house database containing the Port Authority busschedule. In the current version, that database wasreplaced with the Google Directions API, whichboth reduces the human burden of maintaining adatabase and opens the possibility of extendingLet’s Go to cities other than Pittsburgh. Connect-ing to Google Directions API involves a POST callto their URL, with our given access key as wellas the parameters needed: departure place, arrivalplace and departure time, and the travel mode,which we always set as TRANSIT to obtain rel-evant bus routes. There are 14 distinct dialog actsavailable to the system, and each system utterancecontains one or more dialog acts. Lastly, the sys-tem vocabulary size is 1311 and the user vocabu-lary size is 1232. After the EI process, the sizesbecome 214 and 936, respectively.For chat data, we use a publicly available chatorpus used in (Yu et al., 2015) . In total, thereare 3793 chatting adjacency pairs. We control thenumber of data injections to 30% of the number ofturns in the original DTSC dataset, which leads toa user vocabulary size of 3537 and system vocab-ulary size of 4047. For all experiments, the word embedding size was100. The sizes of the LSTM hidden states for boththe encoder and decoder were 500 with 1 layer.The attention context size was also 500. We tiedthe CNN weights for the encoding system and userutterances. Each CNN has 3 ﬁlter windows, 1, 2,and 3, with 100 feature maps each. We trainedthe model end-to-end using Adam (Kingma andBa, 2014), with a learning rate of 1e-3 and abatch size of 40. To combat overﬁtting, we applydropout (Zaremba et al., 2014) to the LSTM layeroutputs and the CNN outputs after the maxpoolinglayer, with a dropout rate of 40%.

This approach was assessed both ofﬂine and on-line evaluations. The ofﬂine evaluation containsstandard metrics to test open-domain encoder-decoder dialog models (Li et al., 2015; Serbanet al., 2015). System performance was assessedfrom three perspectives that are essential for task-oriented systems: dialog acts, slot-values, and KBquery. The online evaluation is composed of ob-jective task success rate, the number of turns, andsubjective satisfaction with human users.

Each system utterance is madeup of one or more dialog acts, e.g. “leavingat [TIME-0], where do you want to go?” → [implicit-conﬁrm, request(arrival place)]. To eval-uate whether a generated utterance has the samedialog acts as the ground truth, we trained a multi-label dialog tagger using one-vs-rest Support Vec-tor Machines (SVM) (Tsoumakas and Katakis,2006), with bag-of-bigram features for each dia-log act label. Since the natural language genera-tion module in Let’s Go is handcrafted, the dialogact tagger achieved 99.4% average label accuracyon a held-out dataset. We used this dialog act tag-ger to tag both the ground truth and the generated github.com/echoyuzhou/ticktock text api responses. Then we computed the micro-averageprecision, recall, and the F-score. Slots:

This metric measures the model’s perfor-mance in generating the correct slot-values. Theslot-values mostly occur in grounding utterances(e.g. explicit/implicit conﬁrm) and KB queries.We compute precision, recall, and F-score.

KB Queries:

Although the slots metric alreadycovers the KB queries, here the precision/recall/F-score of system utterances that contain KB queriesare also explicitly measured, due to their impor-tance. Speciﬁcally, this action measures whetherthe system is able to generate the special [kb-query] symbol to initiate a KB query, as well ashow accurate the corresponding KB query argu-ments are.

BLEU (Papineni et al., 2002): compares the n-gram precision with length penalty, and has beena popular score used to evaluate the performanceof natural language generation (Wen et al., 2015)and open-domain dialog models (Li et al., 2016).Corpus-level BLEU-4 is reported.Metrics Vanilla EI EI+Attn EI+Attn+ChatDA(p/r/f1) 83.577.980.5 79.780.180.0 80.083.181.5 81.883.582.7Slot(p/r/f1) 42.030.335.2 60.663.662.1 63.764.764.2 64.669.166.8KB(p/r/f1) N/A 48.955.351.9 55.470.862.2 58.271.964.4BLEU 36.9 54.6 59.3 60.5Table 1: Performance of each model on automaticmeasures.Four systems were compared: the basicencoder-decoder models without EI (vanilla), thebasic model with EI pre-processing (EI), themodel with attentional decoder (EI+Attn) and themodel trained on the dataset augmented with chat-ting data (EI+Attn+Chat). The comparison wascarried out on exactly the same held-out testdataset that contains 261 dialogs. Table 1 showsthe results. It can be seen that all four mod-els achieve similar performance on the dialog actmetrics, even the vanilla model. This conﬁrmsthe capacity of encoder-decoders models to learnthe “shape” of a conversation, since they havechieved impressive results in more challengingsettings, e.g. modeling open-domain conversa-tions. Furthermore, since the DSTC1 data wascollected over several months, there were minorupdates made to the dialog manager. Therefore,there are inherent ambiguities in the data (the dia-log manager may take different actions in the samesituation). We conjecture that ∼

80% is near theupper limit of our data in modeling the system’snext dialog act given the dialog history.On the other hand, these proposed methods sig-niﬁcantly improved the metrics related to slotsand KB queries. The inclusion of EI alone wasable to improve the F-score of slots by a relative76%, which conﬁrms that EI is crucial in develop-ing slot-value independent encoder-decoder mod-els for modeling task-oriented dialogs. Likewise,the inclusion of attention further improved the pre-diction of slots in system utterances. Adding at-tention also improved the performance of predict-ing KB queries, more so than the overall slot accu-racy. This is expected, since KB queries are usu-ally issued near the end of a conversation, whichrequires global reasoning over the entire dialoghistory. The use of attention allows the decoderto look over the history and make better decisionsrather than simply depending on the context sum-mary in the last hidden layer of the encoder. Be-cause of the good performance achieved by themodels with the attentional decoder, the attentionweights in Equation 1 at every step of the decod-ing process in two example dialogs from test dataare visualized. For both ﬁgures, the vertical axesshow the dialog history ﬂowing from the top to thebottom. Each row is a turn in the format of (sys-tem utterance

Although the model achieves good performancein ofﬂine evaluation, this may not carray overto real user dialogs, where the system must si-multaneously deal with several challenges, suchas automatic speech recognition (ASR) errors,OOD requests, etc. Therefore, a real user studywas conducted to evaluate the performance ofthe proposed systems in the real world. Due tothe limited number of real users, only two bestperforming system were compared, EI+Attn andEI+Attn+Chat. Users were able to talk to a webinterface to the dialog systems via speech. Googlehrome Speech API served as the ASR and text-to-speech (TTS) modules. Turn-taking was donevia the built-in Chrome voice activity detection(VAD) plus a ﬁnite state machine-based end-of-turn detector (Zhao et al., 2015). Lastly, a hybridnamed entity recognizer (NER) was trained usingConditional Random Field (CRF) (McCallum andLi, 2003) and rules to extract 4 types of entities(location, hour, minute, pm/am) for the EI process.The experiment setup is as follows: when a userlogs into the website, the system prompts the userwith a goal, which is a randomly chosen combina-tion of departure place, arrival place and time ( e.g.leave from CMU and go to the airport at 10:30AM ). The system also instructs the user to saygoodbye if the he/she thinks the goal is achievedor wants to give up. The user begins a conversa-tion with one of the two evaluated systems, with a50/50 chance of choosing either system (not vis-ible to the user). After the user’s session is ﬁn-ished, the system asks the him/her to give twoscores between 1 and 5 for correctness and nat-uralness of the system respectively. The subjectsin this study consist of undergraduate and grad-uate students. However, many subjects did notfollow the prompted goal, but rather asked aboutbus routes of their own. Therefore, the dialogwas manually labeled for dialog success. A dia-log is successful if and only if the systems give atleast one bus schedule that matches with all threeslots expressed by the users. Table 2 shows theMetrics EI+Attn EI+Attn+Chat was not statistically signiﬁcant due to the limitednumber of subjects). The precision of ground-ing the correct slots and predicting the correct KBquery was also manually labelled. EI+Attn modelperforms slightly better than the EI+Attn+Chatmodel in slot precision, while the latter model per-forms signiﬁcantly better in KB query precision.In addition, EI+Attn+Chat leads to slightly longerdialogs because sometimes it generates chattingutterances with users when it cannot understandusers’ utterances.At last, we investigated the log ﬁles and iden-tiﬁed the following major types of sources of dia-log failure: RNN Decoder Invalid Output:

Oc-casionally, the RNN decoder outputs system ut-terances as “Okay going to [LOCATION-2]. DidI get that right?”, in which [LOCATION-2] can-not be found in the indexed entity table. Such in-valid output confuses users. This occurred in 149of the dialogs, where 4.1% of system utterancescontain invalid symbols.

Imitation of Subopti-mal Dialog Policy:

Since our models are onlytrained to imitate the suboptimal hand-crafted di-alog policy, their limitations show when the orig-inal dialog manager cannot handle the situation,such as failing to understand slots that appeared incompound utterances. Future plans involves im-proving the models to perform better than the sub-optimal teacher policy.

In conclusion, this paper discusses constructingtask-oriented dialog systems using generative en-coder decoder models. EI is effective in solvingboth the OOV entity and KB query challenges forencoder-decoder-based task-oriented SDSs. Addi-tionally, the novel data augmentation technique ofinterleaving task-oriented dialog corpus with chatdata led to better model performance in both on-line and ofﬂine evaluation. Future work includesdeveloping more advanced encoder-decoder mod-els that to better deal with long-term dialog his-tory and complex reasoning challenges than cur-rent models do. Furthermore, inspired by the suc-cess of mixing chatting with slot-ﬁlling dialogs,we will take full advantage of the extensibility ofencoder-decoder models by investigating how tomake systems that are able to interleave variousconversational tasks, e.g. different domains, chat-ting or task-oriented, which in turn can create amore versatile conversational agent. eferences

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Dan Bohus and Alexander I Rudnicky. 2003. Raven-claw: Dialog management using hierarchical taskdecomposition and an expectation agenda .Dan Bohus and Alexander I Rudnicky. 2005. Er-ror handling in the ravenclaw dialog managementframework. In

Proceedings of the conference onHuman Language Technology and Empirical Meth-ods in Natural Language Processing . Associationfor Computational Linguistics, pages 225–232.Antoine Bordes and Jason Weston. 2016. Learn-ing end-to-end goal-oriented dialog. arXiv preprintarXiv:1605.07683 .Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078 .Renato De Mori, Fr´ed´eric Bechet, Dilek Hakkani-Tur,Michael McTear, Giuseppe Riccardi, and GokhanTur. 2008. Spoken language understanding.

IEEESignal Processing Magazine arXiv preprintarXiv:1609.00777 .Mihail Eric and Christopher D Manning. 2017. Acopy-augmented sequence-to-sequence architecturegives good performance on task-oriented dialogue. arXiv preprint arXiv:1701.04024 .Arash Eshghi and Oliver Lemon. 2014. How domain-general can we be? learning incremental dialoguesystems without dialogue acts.

DialWattSemdial2014 page 53.M Gaˇsi´c, N Mrkˇsi´c, Pei-hao Su, David Vandyke,Tsung-Hsien Wen, and Steve Young. 2015a. Policycommittee for adaptation in multi-domain spokendialogue systems. In

Automatic Speech Recognitionand Understanding (ASRU), 2015 IEEE Workshopon . IEEE, pages 806–812.Milica Gaˇsi´c, Dongho Kim, Pirros Tsiakoulis, andSteve Young. 2015b. Distributed dialogue poli-cies for multi-domain statistical dialogue manage-ment. In

Acoustics, Speech and Signal Processing(ICASSP), 2015 IEEE International Conference on .IEEE, pages 5371–5375.Xavier Glorot, Antoine Bordes, and Yoshua Bengio.2011. Deep sparse rectiﬁer neural networks. In

Ais-tats . volume 15, page 275. Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation

Dynamic syntax: The ﬂow of languageunderstanding . Wiley-Blackwell.Yoon Kim. 2014. Convolutional neural net-works for sentence classiﬁcation. arXiv preprintarXiv:1408.5882 .Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Cheongjae Lee, Sangkeun Jung, Seokhwan Kim, andGary Geunbae Lee. 2009. Example-based dialogmodeling for practical multi-domain dialog system.

Speech Communication arXivpreprint arXiv:1510.03055 .Jiwei Li, Will Monroe, Alan Ritter, Michel Galley,Jianfeng Gao, and Dan Jurafsky. 2016. Deep rein-forcement learning for dialogue generation. arXivpreprint arXiv:1606.01541 .Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025 .Andrew McCallum and Wei Li. 2003. Early results fornamed entity recognition with conditional randomﬁelds, feature induction and web-enhanced lexicons.In

Proceedings of the seventh conference on Natu-ral language learning at HLT-NAACL 2003-Volume4 . Association for Computational Linguistics, pages188–191.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th annual meeting on association for compu-tational linguistics . Association for ComputationalLinguistics, pages 311–318.Antoine Raux, Brian Langner, Dan Bohus, Alan WBlack, and Maxine Eskenazi. 2005. Lets go pub-lic! taking a spoken dialog system to the real world.In in Proc. of Interspeech 2005 . Citeseer.St´ephane Ross, Geoffrey J Gordon, and Drew Bagnell.2011. A reduction of imitation learning and struc-tured prediction to no-regret online learning. In

AIS-TATS . volume 1, page 6.Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2015. Build-ing end-to-end dialogue systems using generative hi-erarchical neural network models. arXiv preprintarXiv:1507.04808 .lessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.A neural network approach to context-sensitive gen-eration of conversational responses. arXiv preprintarXiv:1506.06714 .Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Continu-ously learning neural dialogue management. arXivpreprint arXiv:1606.02689 .Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In

Advances in neural information process-ing systems . pages 3104–3112.Erik F Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. In

Proceedings of the seventh conference on Naturallanguage learning at HLT-NAACL 2003-Volume 4 .Association for Computational Linguistics, pages142–147.Grigorios Tsoumakas and Ioannis Katakis. 2006.Multi-label classiﬁcation: An overview.

Interna-tional Journal of Data Warehousing and Mining arXiv preprint arXiv:1506.05869 .Zhuoran Wang, Tsung-Hsien Wen, Pei-Hao Su,and Yannis Stylianou. 2015. Learning domain-independent dialogue policies via ontology parame-terisation. In . page 412.Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015.Semantically conditioned lstm-based natural lan-guage generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745 .Tsung-Hsien Wen, David Vandyke, Nikola Mrksic,Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialoguesystem. arXiv preprint arXiv:1604.04562 .Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state track-ing challenge. In

Proceedings of the SIGDIAL 2013Conference . pages 404–413.Jason Williams and Steve Young. 2003. Using wizard-of-oz simulations to bootstrap reinforcement-learning-based dialog management systems. In

Proceedings of the 4th SIGDIAL Workshop onDiscourse and Dialogue .Jason D Williams, Kavosh Asadi, and GeoffreyZweig. 2017. Hybrid code networks: practical and efﬁcient end-to-end dialog control with super-vised and reinforcement learning. arXiv preprintarXiv:1702.03274 .Jason D Williams and Geoffrey Zweig. 2016. End-to-end lstm-based dialog control optimized with su-pervised and reinforcement learning. arXiv preprintarXiv:1606.01269 .Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron C Courville, Ruslan Salakhutdinov, Richard SZemel, and Yoshua Bengio. 2015. Show, attend andtell: Neural image caption generation with visual at-tention. In

ICML . volume 14, pages 77–81.Zichao Yang, Phil Blunsom, Chris Dyer, and WangLing. 2016. Reference-aware language models. arXiv preprint arXiv:1611.01628 .Steve J Young. 2006. Using pomdps for dialog man-agement. In

SLT . pages 8–13.Zhou Yu, Alan W Black, and Alexander I Rudnicky.2017. Learning conversational systems that inter-leave task and non-task content. arXiv preprintarXiv:1703.00099 .Zhou Yu, Alexandros Papangelis, and Alexander Rud-nicky. 2015. Ticktock: A non-goal-oriented mul-timodal dialog system with engagement awareness.In

Proceedings of the AAAI Spring Symposium .Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 .Tiancheng Zhao, Alan W Black, and Maxine Eskenazi.2015. An incremental turn-taking model with ac-tive system barge-in for spoken dialog systems. In . page 42.Tiancheng Zhao and Maxine Eskenazi. 2016. Towardsend-to-end learning for dialog state tracking andmanagement using deep reinforcement learning. In .Tiancheng Zhao, Maxine Eskenazi, and Kyusong Lee.2016. Dialport: A general framework for aggregat-ing dialog systems.

EMNLP 2016 page 32.Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.2017. Learning discourse-level diversity for neuraldialog models using conditional variational autoen-coders. arXiv preprint arXiv:1703.10960arXiv preprint arXiv:1703.10960