Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems
Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, Pararth Shah, Larry Heck
DDialogue Learning with Human Teaching and Feedback in End-to-EndTrainable Task-Oriented Dialogue Systems
Bing Liu ∗ , Gokhan T ¨ur , Dilek Hakkani-T ¨ur , Pararth Shah , Larry Heck † Carnegie Mellon University, Pittsburgh, PA, USA Google Research, Mountain View, CA,USA Samsung Research, Mountain View, CA, USA [email protected] , { dilekh,pararth } @google.com , { gokhan.tur,larry.heck } @ieee.org Abstract
In this work, we present a hybrid learn-ing method for training task-oriented dialoguesystems through online user interactions. Pop-ular methods for learning task-oriented dia-logues include applying reinforcement learn-ing with user feedback on supervised pre-training models. Efficiency of such learningmethod may suffer from the mismatch of di-alogue state distribution between offline train-ing and online interactive learning stages. Toaddress this challenge, we propose a hybridimitation and reinforcement learning method,with which a dialogue agent can effectivelylearn from its interaction with users by learn-ing from human teaching and feedback. Wedesign a neural network based task-orienteddialogue agent that can be optimized end-to-end with the proposed learning method. Ex-perimental results show that our end-to-enddialogue agent can learn effectively from themistake it makes via imitation learning fromuser teaching. Applying reinforcement learn-ing with user feedback after the imitationlearning stage further improves the agent’s ca-pability in successfully completing a task.
Task-oriented dialogue systems assist users tocomplete tasks in specific domains by understand-ing user’s request and aggregate useful informa-tion from external resources within several dia-logue turns. Conventional task-oriented dialoguesystems have a complex pipeline (Rudnicky et al.,1999; Raux et al., 2005; Young et al., 2013) con-sisting of independently developed and modularlyconnected components for natural language un-derstanding (NLU) (Mesnil et al., 2015; Liu andLane, 2016; Hakkani-T¨ur et al., 2016), dialoguestate tracking (DST) (Henderson et al., 2014c; ∗ Work done while the author was an intern at Google. † Work done while at Google Research.
Mrkˇsi´c et al., 2016), and dialogue policy learn-ing (Gasic and Young, 2014; Shah et al., 2016; Suet al., 2016, 2017). These system components areusually trained independently, and their optimiza-tion targets may not fully align with the overallsystem evaluation criteria (e.g. task success rateand user satisfaction). Moreover, errors made inthe upper stream modules of the pipeline propa-gate to downstream components and get amplified,making it hard to track the source of errors.To address these limitations with the con-ventional task-oriented dialogue systems, re-cent efforts have been made in designing end-to-end learning solutions with neural networkbased methods. Both supervised learning (SL)based (Wen et al., 2017; Bordes and Weston,2017; Liu and Lane, 2017a) and deep reinforce-ment learning (RL) based systems (Zhao and Es-kenazi, 2016; Li et al., 2017; Peng et al., 2017)have been studied in the literature. Comparing tochit-chat dialogue models that are usually trainedoffline using single-turn context-response pairs,task-oriented dialogue model involves reasoningand planning over multiple dialogue turns. Thismakes it especially important for a system to beable to learn from users in an interactive manner.Comparing to SL models, systems trained withRL by receiving feedback during users interac-tions showed improved model robustness againstdiverse dialogue scenarios (Williams and Zweig,2016; Liu and Lane, 2017b).A critical step in learning RL based task-oriented dialogue models is dialogue policy learn-ing. Training dialogue policy online from scratchtypically requires a large number of interactivelearning sessions before an agent can reach a satis-factory performance level. Recent works (Hender-son et al., 2008; Williams et al., 2017; Liu et al.,2017) explored pre-training the dialogue modelusing human-human or human-machine dialogue a r X i v : . [ c s . C L ] A p r orpora before performing interactive learningwith RL to address this concern. A potential draw-back with such pre-training approach is that themodel may suffer from the mismatch of dialoguestate distributions between supervised training andinteractive learning stages. While interacting withusers, the agent’s response at each turn has a di-rect influence on the distribution of dialogue statethat the agent will operate on in the upcoming di-alogue turns. If the agent makes a small mistakeand reaches an unfamiliar state, it may not knowhow to recover from it and get back to a normaldialogue trajectory. This is because such recoverysituation may be rare for good human agents andthus are not well covered in the supervised train-ing corpus. This will result in compounding er-rors in a dialogue which may lead to failure of atask. RL exploration might finally help to find cor-responding actions to recover from a bad state, butthe search process can be very inefficient.To ameliorate the effect of dialogue state distri-bution mismatch between offline training and RLinteractive learning, we propose a hybrid imitationand reinforcement learning method. We first letthe agent to interact with users using its own pol-icy learned from supervised pre-training. When anagent makes a mistake, we ask users to correct themistake by demonstrating the agent the right ac-tions to take at each turn. This user corrected dia-logue sample, which is guided by the agent’s ownpolicy, is then added to the existing training cor-pus. We fine-tune the dialogue policy with this di-alogue sample aggregation (Ross et al., 2011) andcontinue such user teaching process for a numberof cycles. Since asking for user teaching at eachdialogue turn is costly, we want to reduce this userteaching cycles as much as possible and continuethe learning process with RL by collecting simpleforms of user feedback (e.g. a binary feedback,positive or negative) only at the end of a dialogue.Our main contributions in this work are: • We design a neural network based task-oriented dialogue system which can be op-timized end-to-end for natural language un-derstanding, dialogue state tracking, and dia-logue policy learning. • We propose a hybrid imitation and reinforce-ment learning method for end-to-end modeltraining in addressing the challenge with dia-logue state distribution mismatch between of-fline training and interactive learning. The remainder of the paper is organized as fol-lows. In section 2, we discuss related work inbuilding end-to-end task-oriented dialogue sys-tems. In section 3, we describe the proposedmodel and learning method in detail. In Section4, we describe the experiment setup and discussthe results. Section 5 gives the conclusions.
Popular approaches in learning task-orienteddialogue include modeling the task as a par-tially observable Markov Decision Process(POMDP) (Young et al., 2013). RL can be appliedin the POMDP framework to learn dialoguepolicy online by interacting with users (Gaˇsi´cet al., 2013). The dialogue state and system actionspace have to be carefully designed in order tomake the policy learning tractable (Young et al.,2013), which limits the model’s usage to restricteddomains.Recent efforts have been made in designingend-to-end solutions for task-oriented dialogues,inspired by the success of encoder-decoder basedneural network models in non-task-oriented con-versational systems (Serban et al., 2015; Li et al.,2016). Wen et al. (Wen et al., 2017) designed anend-to-end trainable neural dialogue model withmodularly connected system components. Thissystem is a supervised learning model which isevaluated on fixed dialogue corpora. It is un-known how well the model performance gener-alizes to unseen dialogue state during user inter-actions. Our system is trained by a combina-tion of supervised and deep RL methods, as it isshown that RL may effectively improve dialoguesuccess rate by exploring a large dialogue actionspace (Henderson et al., 2008; Li et al., 2017).Bordes and Weston (2017) proposed a task-oriented dialogue model using end-to-end memorynetworks. In the same line of research, people ex-plored using query-regression networks (Seo et al.,2016), gated memory networks (Liu and Perez,2017), and copy-augmented networks (Eric andManning, 2017) to learn the dialogue state. Thesesystems directly select a final response from a listof response candidates conditioning on the dia-logue history without doing slot filling or user goaltracking. Our model, on the other hand, explic-itly tracks user’s goal for effective integration withknowledge bases (KBs). Robust dialogue statetracking has been shown (Jurˇc´ıˇcek et al., 2012) toe critical in improving dialogue success in taskcompletion.Dhingra et al. (2017) proposed an end-to-endRL dialogue agent for information access. Theirmodel focuses on bringing differentiability to theKB query operation by introducing a “soft” re-trieval process in selecting the KB entries. Suchsoft-KB lookup is prone to entity updates and ad-ditions in the KB, which is common in real worldinformation systems. In our model, we use sym-bolic queries and leave the selection of KB enti-ties to external services (e.g. a recommender sys-tem), as entity ranking in real world systems canbe made with much richer features (e.g. user pro-files, location and time context, etc.). Quality ofthe generated symbolic query is directly relatedto the belief tracking performance. In our pro-posed end-to-end system, belief tracking can beoptimized together with other system components(e.g. language understanding and policy) duringinteractive learning with users.Williams et al. (2017) proposed a hybrid codenetwork for task-oriented dialogue that can betrained with supervised and reinforcement learn-ing. They show that RL performed with a super-vised pre-training model using labeled dialoguesimproves learning speed dramatically. They didnot discuss the potential issue of dialogue statedistribution mismatch between supervised pre-training and RL interactive learning, which is ad-dressed in our dialogue learning framework.
Figure 1 shows the overall system architectureof the proposed end-to-end task-oriented dialoguemodel. We use a hierarchical LSTM neural net-work to encode a dialogue with a sequence ofturns. User input to the system in natural lan-guage format is encoded to a continuous vector viaa bidirectional LSTM utterance encoder. This userutterance encoding, together with the encoding ofthe previous system action, serves as the input to adialogue-level LSTM. State of this dialogue-levelLSTM maintains a continuous representation ofthe dialogue state. Based on this state, the modelgenerates a probability distribution over candidatevalues for each of the tracked goal slots. A querycommand can then be formulated with the statetracking outputs and issued to a knowledge base toretrieve requested information. Finally, the systemproduces a dialogue action, which is conditioned on information from the dialogue state, the esti-mated user’s goal, and the encoding of the queryresults . This dialogue action, together with theuser goal tracking results and the query results, isused to generate the final natural language systemresponse via a natural language generator (NLG).We describe each core model component in detailin the following sections.
We use a bidirectional LSTM to encode the userutterance to a continuous representation. We referto this LSTM as the utterance-level LSTM. Theuser utterance vector is generated by concatenat-ing the last forward and backward LSTM states.Let U k = ( w , w , ..., w T k ) be the user utteranceat turn k with T k words. These words are firstlymapped to an embedding space, and further serveas the step inputs to the bidirectional LSTM. Let −→ h t and ←− h t represent the forward and backwardLSTM state outputs at time step t . The user ut-terance vector U k is produced by: U k = [ −→ h T k , ←− h ] ,where −→ h T k and ←− h are the last states in the forwardand backward LSTMs. Dialogue state tracking, or belief tracking, main-tains the state of a conversation, such as user’sgoals, by accumulating evidence along the se-quence of dialogue turns. Our model maintainsthe dialogue state in a continuous form in thedialogue-level LSTM (
LSTM D ) state s k . s k is up-dated after the model processes each dialogue turnby taking in the encoding of user utterance U k andthe encoding of the previous turn system output A k − . This dialogue state serves as the input to thedialogue state tracker. The tracker updates its es-timation of the user’s goal represented by a list ofslot-value pairs. A probability distribution P ( l mk ) is maintained over candidate values for each goalslot type m ∈ M : s k = LSTM D ( s k − , [ U k , A k − ]) (1) P ( l mk | U ≤ k , A System : Ok, what time do you prefer? Bi-LSTM Utterance Encoder LSTM Dialogue State Knowledge BasePolicyNetworkNatural Language Generator request(time) System dialogue act at turn k Query resultsencodingUser utterance encoding at turn kSystem dialogue act embedding at turn k-1 time=nonedate=Thursday Dialogue StateTracking Figure 1: Proposed end-to-end task-oriented dialogue system architecture. produced by replacing the tokens in a query com-mand template with the best hypothesis for eachgoal slot from the dialogue state tracking output.Alternatively, an n-best list of API calls can begenerated with the most probable candidate valuesfor the tracked goal slots. In interfacing with KBs,instead of using a soft KB lookup as in (Dhingraet al., 2017), our model sends symbolic queries tothe KB and leaves the ranking of the KB entitiesto an external recommender system. Entity rank-ing in real world systems can be made with muchricher features (e.g. user profiles, local context,etc.) in the back-end system other than just fol-lowing entity posterior probabilities conditioningon a user utterance. Hence ranking of the KB en-tities is not a part of our proposed neural dialoguemodel. In this work, we assume that the model re-ceives a ranked list of KB entities according to theissued query and other available sources, such asuser models.Once the KB query results are returned, we savethe retrieved entities to a queue and encode the re-sult summary to a vector. Rather then encoding thereal KB entity values as in (Bordes and Weston,2017; Eric and Manning, 2017), we only encode asummary of the query results (i.e. item availabil-ity and number of matched items). This encodingserves as a part of the input to the policy network. A dialogue policy selects the next system actionin response to the user’s input based on the cur-rent dialogue state. We use a deep neural networkto model the dialogue policy. There are three in-puts to the policy network, (1) the dialogue-levelLSTM state s k , (2) the log probabilities of candi-date values from the belief tracker v k , and (3) the LSTM Dialogue State,System action at turn k Policy NetworkQuery results encodingSlot value logits Figure 2: Dialogue state and policy network. encoding of the query results summary E k . Thepolicy network emits a system action in the formof a dialogue act conditioning on these inputs: P ( a k | U ≤ k , A By connecting all the system components, we havean end-to-end model for task-oriented dialogue.Each system component is a neural network thattakes in underlying system component’s outputsn a continuous form that is fully differentiable,and the entire system (utterance encoding, dia-logue state tracking, and policy network) can betrained end-to-end.We first train the system in a supervised man-ner by fitting task-oriented dialogue samples. Themodel predicts the true user goal slot values andthe next system action at each turn of a dia-logue. We optimize the model parameter set θ byminimizing a linear interpolation of cross-entropylosses for dialogue state tracking and system ac-tion prediction: min θ K (cid:88) k =1 − (cid:104) M (cid:88) m =1 λ l m log P ( l mk ∗ | U ≤ k , A Dialogue Learning with HumanTeaching and Feedback Train model end-to-end on dialogue samples D with MLE and obtain policy π θ ( a | s ) (cid:46) eq 4 for learning iteration k = 1 : K do Run π θ ( a | s ) with user to collect newdialogue samples D π Ask user to correct the mistakes in thetracked user’s goal for each dialogue turnin D π Add the newly labeled dialogue samplesto the existing corpora: D ← D ∪ D π Train model end-to-end on D and obtainan updated policy π θ ( a | s ) (cid:46) eq 4 end for for learning iteration k = 1 : N do Run π θ ( a | s ) with user for a new dialogue Collect user feedback as reward r Update model end-to-end and obtain anupdated policy π θ ( a | s ) (cid:46) eq 5 end for the limitations of the currently learned dialoguemodel, as these newly collected dialogue samplesare driven by the agent’s own policy. Specifically,in this study we let an expert user to correct themistake made by the agent in tracking the user’sgoal at the end of each dialogue turn. This newbatch of annotated dialogues are then added to theexisting training corpus. We start the next roundof supervised model training on this aggregatedcorpus to obtain an updated dialogue policy, andcontinue this dialogue imitation learning cycles. Learning from human teaching can be costly, asit requires expert users to provide corrections ateach dialogue turn. We want to minimize the num-ber of such imitation dialogue learning cycles andcontinue to improve the agent via a form of super-vision signal that is easier to obtain. After the imi-tation learning stage, we further optimize the neu-ral dialogue system with RL by letting the agentto interact with users and learn from user feed-back. Different from the turn-level corrections inthe imitation dialogue learning stage, the feedbackis only collected at the end of a dialogue. A pos-itive reward is collected for successful tasks, anda zero reward is collected for failed tasks. A steppenalty is applied to each dialogue turn to encour-ge the agent to complete the task in fewer steps.In this work, we only use task-completion as themetric in designing the dialogue reward. One canextend it by introducing additional factors to thereward functions, such as naturalness of interac-tions or costs associated with KB queries.To encourage the agent to explore the dialogueaction space, we let the agent to follow a softmaxpolicy during RL training by sampling system ac-tions from the policy network outputs. We applyREINFORCE algorithm (Williams, 1992) in op-timizing the network parameters. The objectivefunction can be written as J k ( θ ) = E θ [ R k ] = E θ (cid:104)(cid:80) K − kt =0 γ t r k + t (cid:105) , with γ ∈ [0 , being the dis-count factor. With likelihood ratio gradient esti-mator, the gradient of the objective function canbe derived as: ∇ θ J k ( θ ) = ∇ θ E θ [ R k ]= (cid:88) a k π θ ( a k | s k ) ∇ θ log π θ ( a k | s k ) R k = E θ [ ∇ θ log π θ ( a k | s k ) R k ] (5)This last expression above gives us an unbiasedgradient estimator. We evaluate the proposed method on DSTC2(Henderson et al., 2014a) dataset in restaurantsearch domain and an internally collected dialoguecorpus in movie booking domain. The moviebooking dialogue corpus has an average number of8.4 turns per dialogue. Its training set has 100K di-alogues, and the development set and test set eachhas 10K dialogues.The movie booking dialogue corpus is gener-ated (Shah et al., 2018) using a finite state ma-chine based dialogue agent and an agenda baseduser simulator (Schatzmann et al., 2007) with nat-ural language utterances rewritten by real users.The user simulator can be configured with differ-ent personalities, showing various levels of ran-domness and cooperativeness. This user simula-tor is also used to interact with our end-to-endtraining agent during imitation and reinforcementlearning stages. We randomly select a user profile The dataset can be accessed via https://github.com/google-research-datasets/simulated-dialogue . when conducting each dialogue simulation. Dur-ing model evaluation, we use an extended set ofnatural language surface forms over the ones usedduring training time to evaluate the generalizationcapability of the proposed end-to-end model inhandling diverse natural language inputs. The size of the dialogue-level and utterance-levelLSTM state is set as 200 and 150 respectively.Word embedding size is 300. Embedding size forsystem action and slot values is set as 32. Hiddenlayer size of the policy network is set as 100. Weuse Adam optimization method (Kingma and Ba,2014) with initial learning rate of 1e-3. Dropoutrate of 0.5 is applied during supervised training toprevent the model from over-fitting.In imitation learning, we perform mini-batchmodel update after collecting every 25 dialogues.System actions are sampled from the learned pol-icy to encourage exploration. The system actionis defined with the act and slot types from a dia-logue act (Henderson et al., 2013). For example,the dialogue act “ conf irm ( date = monday ) ” ismapped to a system action “ conf irm date ” anda candidate value “ monday ” for slot type “ date ”.The slot types and values are from the dialoguestate tracking output.In RL optimization, we update the model withevery mini-batch of 25 samples. Dialogue is con-sidered successful based on two conditions: (1)the goal slot values estimated from dialogue statetracking fully match to the user’s true goal values,and (2) the system is able to confirm with the userthe tracked goal values and offer an entity whichis finally accepted by the user. Maximum allowednumber of dialogue turn is set as 15. A positivereward of +15.0 is given at the end of a success-ful dialogue, and a zero reward is given to a failedcase. We apply a step penalty of -1.0 for each turnto encourage shorter dialogue for task completion. Table 1 and Table 2 show the supervised learn-ing model performance on DSTC2 and the moviebooking corpus. Evaluation is made on DST accu-racy. For the evaluation on DSTC2 corpus, we usethe live ASR transcriptions as the user input utter-ances. Our proposed model achieves near state-of-the-art dialogue state tracking results on DSTC2corpus, on both individual slot tracking and jointslot tracking, comparing to the recent publishedesults using RNN (Henderson et al., 2014b) andneural belief tracker (NBT) (Mrkˇsi´c et al., 2016).In the movie booking domain, our model alsoachieves promising performance on both individ-ual slot tracking and joint slot tracking accuracy.Instead of using ASR hypothesis as model inputas in DSTC2, here we use text based input whichhas much lower noise level in the evaluation of themovie booking tasks. This partially explains thehigher DST accuracy in the movie booking do-main comparing to DSTC2. Table 1: Dialogue state tracking results on DSTC2 Model Area Food Price Joint RNN 92 86 86 69RNN+sem. dict 92 86 92 71NBT 90 84 94 72Our SL model 90 84 92 72 Table 2: DST results on movie booking dataset Goal slot Accuracy Num of Tickets 98.22Movie 91.86Theater Name 97.33Date 99.31Time 97.71Joint 84.57 Evaluations of interactive learning with imitationand reinforcement learning are made on metricsof (1) task success rate, (2) dialogue turn size, and(3) DST accuracy. Figures 3, 4, and 5 show thelearning curves for the three evaluation metrics.In addition, we compare model performance ontask success rate using two different RL trainingsettings, the end-to-end training and the policy-only training, to show the advantages of perform-ing end-to-end system optimization with RL. Task Success Rate As shown in the learningcurves in Figure 3, the SL model performs poorly.This might largely due to the compounding er-rors caused by the mismatch of dialogue state dis-tribution between offline training and interactivelearning. We use an extended set of user NLGtemplates during interactive evaluation. Many ofthe test NLG templates are not seen by the super-vised training agent. Any mistake made by theagent in understanding the user’s request may leadto compounding errors in the following dialogue Figure 3: Interactive learning curves on task successrate. turns, which cause final task failure. The red curve( SL + RL ) shows the performance of the modelthat has RL applied on the supervised pre-trainingmodel. We can see that interactive learning withRL using a weak form of supervision from userfeedback continuously improves the task successrate with the growing number of user interactions.We further conduct experiments in learning dia-logue model from scratch using only RL (i.e. with-out supervised pre-training), and the task successrate remains at a very low level after 10K dialoguesimulations. We believe that it is because the di-alogue state space is too complex for the agentto learn from scratch, as it has to learn a goodNLU model in combination with a good policy tocomplete the task. The yellow curve ( SL + IL500 + RL ) shows the performance of the modelthat has 500 episodes of imitation learning overthe SL model and continues with RL optimization.It is clear from the results that applying imitationlearning on supervised training model efficientlyimproves task success rate. RL optimization af-ter imitation learning increases the task successrate further. The blue curve ( SL + IL 1000 +RL ) shows the performance of the model that has1000 episodes of imitation learning over the SLmodel and continues with RL. Similarly, it showshints that imitation learning may effectively adaptthe supervised training model to the dialogue statedistribution during user interactions. Average Dialogue Turn Size Figure 4 showsthe curves for the average turn size of successfuldialogues. We observe decreasing number of dia-logue turns in completing a task along the grow-ing number of interactive learning sessions. Thisshows that the dialogue agent learns better strate-gies in successfully completing the task with fewer igure 4: Interactive learning curves on average dia-logue turn size. number of dialogue turns. The red curve withRL applied directly after supervised pre-trainingmodel gives the lowest average number of turnsat the end of the interactive learning cycles, com-paring to models with imitation dialogue learn-ing. This seems to be contrary to our observa-tion in Figure 3 that imitation learning with hu-man teaching helps in achieving higher task suc-cess rate. By looking into the generated dialogues,we find that the SL + RL model can handle easytasks well but fails to complete more challengingtasks. Such easy tasks typically can be handledwith fewer number of turns, which result in thelow average turn size for the SL + RL model.On the other hand, the imitation plus RL modelsattempt to learn better strategies to handle thosemore challenging tasks, resulting in higher tasksuccess rates and also slightly increased dialoguelength comparing to SL + RL model. Dialogue State Tracking Accuracy Similarto the results on task success rate, we see that im-itation learning with human teaching quickly im-proves dialogue state tracking accuracy in just afew hundred interactive learning sessions. Thejoint slots tracking accuracy in the evaluation ofSL model using fixed corpus is 84.57% as in Table2. The accuracy drops to 50.51% in the interac-tive evaluation with the introduction of new NLGtemplates. Imitation learning with human teach-ing effectively adapts the neural dialogue model tothe new user input and dialogue state distributions,improving the DST accuracy to 67.47% after only500 imitation dialogue learning sessions. Anotherencouraging observation is that RL on top of SLmodel and IL model not only improves task suc-cess rate by optimizing dialogue policy, but also Figure 5: Interactive learning curves on dialogue statetracking accuracy.Figure 6: Interactive learning curves on task successrate with different RL training settings. further improves dialogue state tracking perfor-mance. This shows the benefits of performing end-to-end optimization of the neural dialogue modelwith RL during interactive learning. End-to-End RL Optimization To further showthe benefit of performing end-to-end optimizationof dialogue agent, we compare models with twodifferent RL training settings, the end-to-end train-ing and the policy-only training. End-to-end RLtraining is what we applied in previous evaluationsections, in which the gradient propagates fromsystem action output layer all the way back tothe natural language user input layer. Policy-onlytraining refers to only updating the policy networkparameters during interactive learning with RL,with all the other underlying system parametersfixed. The evaluation results are shown in Fig-ure 6. From these learning curves, we see clearadvantage of performing end-to-end model updatein achieving higher dialogue task success rate dur-ing interactive learning comparing to only updat-ing the policy network. .5 Human User Evaluations We further evaluate the proposed method withhuman judges recruited via Amazon MechanicalTurk. Each judge is asked to read a dialogue be-tween our model and user simulator and rate eachsystem turn on a scale of 1 (frustrating) to 5 (opti-mal way to help the user). Each turn is rated by 3different judges. We collect and rate 100 dialoguesfor each of the three models: (i) SL model, (ii) SLmodel followed by 1000 episodes of IL, (iii) SLand IL followed by RL. Table 3 lists the mean andstandard deviation of human scores overall sys-tem turns. Performing interactive learning withimitation and reinforcement learning clearly im-proves the quality of the model according to hu-man judges. Model Score SL 3.987 ± ± ± Table 3: Human evaluation results. Mean and standarddeviation of crowd worker scores (between 1 to 5). In this work, we focus on training task-orienteddialogue systems through user interactions, wherethe agent improves through communicating withusers and learning from the mistake it makes. Wepropose a hybrid learning approach for such sys-tems using end-to-end trainable neural networkmodel. We present a hybrid imitation and rein-forcement learning method, where we firstly traina dialogue agent in a supervised manner by learn-ing from dialogue corpora, and continuously toimprove it by learning from user teaching andfeedback with imitation and reinforcement learn-ing. We evaluate the proposed learning methodwith both offline evaluation on fixed dialogue cor-pora and interactive evaluation with users. Exper-imental results show that the proposed neural dia-logue agent can effectively learn from user teach-ing and improve task success rate with imitationlearning. Applying reinforcement learning withuser feedback after imitation learning with userteaching improves the model performance further,not only on the dialogue policy but also on thedialogue state tracking in the end-to-end trainingframework. References Antoine Bordes and Jason Weston. 2017. Learningend-to-end goal-oriented dialog. In InternationalConference on Learning Representations .Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao,Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017.Towards end-to-end reinforcement learning of dia-logue agents for information access. In ACL .Mihail Eric and Christopher D Manning. 2017. Acopy-augmented sequence-to-sequence architecturegives good performance on task-oriented dialogue.In EACL .Milica Gaˇsi´c, Catherine Breslin, Matthew Henderson,Dongho Kim, Martin Szummer, Blaise Thomson,Pirros Tsiakoulis, and Steve Young. 2013. On-line policy optimisation of bayesian spoken dialoguesystems via human interaction. In ICASSP .Milica Gasic and Steve Young. 2014. Gaussian pro-cesses for pomdp-based dialogue manager optimiza-tion. IEEE/ACM Transactions on Audio, Speech,and Language Processing .Dilek Hakkani-T¨ur, G¨okhan T¨ur, Asli Celikyilmaz,Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frameparsing using bi-directional rnn-lstm. In Inter-speech .James Henderson, Oliver Lemon, and KallirroiGeorgila. 2008. Hybrid reinforcement/supervisedlearning of dialogue policies from fixed data sets. Computational Linguistics .Matthew Henderson, Blaise Thomson, and JasonWilliams. 2013. Dialog state tracking challenge 2 &3. http://camdial.org/˜mh521/dstc/ .Matthew Henderson, Blaise Thomson, and JasonWilliams. 2014a. The second dialog state trackingchallenge. In SIGDIAL .Matthew Henderson, Blaise Thomson, and SteveYoung. 2014b. Robust dialog state tracking usingdelexicalised recurrent neural networks and unsu-pervised gate. In IEEE SLT .Matthew Henderson, Blaise Thomson, and SteveYoung. 2014c. Word-based dialog state trackingwith recurrent neural networks. In SIGDIAL .Filip Jurˇc´ıˇcek, Blaise Thomson, and Steve Young.2012. Reinforcement learning for parameter esti-mation in statistical spoken dialogue systems. Com-puter Speech & Language InternationalConference on Learning Representations .Jiwei Li, Michel Galley, Chris Brockett, Georgios PSpithourakis, Jianfeng Gao, and Bill Dolan. 2016. Apersona-based neural conversation model. In ACL .uijun Li, Yun-Nung Chen, Lihong Li, and JianfengGao. 2017. End-to-end task-completion neural dia-logue systems. arXiv preprint arXiv:1703.01008 .Bing Liu and Ian Lane. 2016. Joint online spoken lan-guage understanding and language modeling withrecurrent neural networks. In SIGDIAL .Bing Liu and Ian Lane. 2017a. An end-to-end trainableneural network model with belief tracking for task-oriented dialog. In Interspeech .Bing Liu and Ian Lane. 2017b. Iterative policy learningin end-to-end trainable task-oriented neural dialogmodels. In Proceedings of IEEE ASRU .Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, PararthShah, and Larry Heck. 2017. End-to-end optimiza-tion of task-oriented dialogue model with deep rein-forcement learning. In NIPS Workshop on Conver-sational AI .Fei Liu and Julien Perez. 2017. Gated end-to-endmemory networks. In EACL .Gr´egoire Mesnil, Yann Dauphin, Kaisheng Yao,Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xi-aodong He, Larry Heck, Gokhan Tur, Dong Yu, et al.2015. Using recurrent neural networks for slot fill-ing in spoken language understanding. IEEE/ACMTransactions on Audio, Speech and Language Pro-cessing (TASLP) .Nikola Mrkˇsi´c, Diarmuid O S´eaghdha, Tsung-HsienWen, Blaise Thomson, and Steve Young. 2016.Neural belief tracker: Data-driven dialogue statetracking. arXiv preprint arXiv:1606.03777 .Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao,Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong.2017. Composite task-completion dialogue policylearning via hierarchical deep reinforcement learn-ing. In Proceedings of EMNLP .Antoine Raux, Brian Langner, Dan Bohus, Alan WBlack, and Maxine Eskenazi. 2005. Lets go pub-lic! taking a spoken dialog system to the real world.In Interspeech .St´ephane Ross and Drew Bagnell. 2010. Efficient re-ductions for imitation learning. In Proceedings ofthe thirteenth international conference on artificialintelligence and statistics . pages 661–668.St´ephane Ross, Geoffrey J Gordon, and Drew Bagnell.2011. A reduction of imitation learning and struc-tured prediction to no-regret online learning. In In-ternational Conference on Artificial Intelligence andStatistics . pages 627–635.Alexander I Rudnicky, Eric H Thayer, Paul C Constan-tinides, Chris Tchou, R Shern, Kevin A Lenzo, WeiXu, and Alice Oh. 1999. Creating natural dialogs inthe carnegie mellon communicator system. In Eu-rospeech . Jost Schatzmann, Blaise Thomson, Karl Weilhammer,Hui Ye, and Steve Young. 2007. Agenda-based usersimulation for bootstrapping a pomdp dialogue sys-tem. In NAACL-HLT .Minjoon Seo, Ali Farhadi, and Hannaneh Hajishirzi.2016. Query-regression networks for machine com-prehension. arXiv preprint arXiv:1606.04582 .Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2015. Build-ing end-to-end dialogue systems using generative hi-erarchical neural network models. arXiv preprintarXiv:1507.04808 .Pararth Shah, Dilek Hakkani-T¨ur, Liu Bing, andGokhan T¨ur. 2018. Bootstrapping a neural conver-sational agent with dialogue self-play, crowdsourc-ing and on-line reinforcement learning. In NAACL-HLT .Pararth Shah, Dilek Hakkani-T¨ur, and Larry Heck.2016. Interactive reinforcement learning for task-oriented dialogue management. In NIPS 2016 DeepLearning for Action and Interaction Workshop .Pei-Hao Su, Pawel Budzianowski, Stefan Ultes, Mil-ica Gasic, and Steve Young. 2017. Sample-efficientactor-critic reinforcement learning with superviseddata for dialogue management. In SIGDIAL .Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line activereward learning for policy optimisation in spoken di-alogue systems. In ACL .Tsung-Hsien Wen, David Vandyke, Nikola Mrkˇsi´c,Milica Gaˇsi´c, Lina M. Rojas-Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. In EACL .Jason D Williams, Kavosh Asadi, and Geoffrey Zweig.2017. Hybrid code networks: practical and efficientend-to-end dialog control with supervised and rein-forcement learning. In ACL .Jason D Williams and Geoffrey Zweig. 2016. End-to-end lstm-based dialog control optimized with su-pervised and reinforcement learning. arXiv preprintarXiv:1606.01269 .Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning .Steve Young, Milica Gaˇsi´c, Blaise Thomson, and Ja-son D Williams. 2013. Pomdp-based statistical spo-ken dialog systems: A review. Proceedings of theIEEE