[PDF] Adversarial Learning of Task-Oriented Neural Dialog Models

Abstract

In this work, we propose an adversarial learning method for reward estimation in reinforcement learning (RL) based task-oriented dialog models. Most of the current RL based task-oriented dialog systems require the access to a reward signal from either user feedback or user ratings. Such user ratings, however, may not always be consistent or available in practice. Furthermore, online dialog policy learning with RL typically requires a large number of queries to users, suffering from sample efficiency problem. To address these challenges, we propose an adversarial learning method to learn dialog rewards directly from dialog samples. Such rewards are further used to optimize the dialog policy with policy gradient based RL. In the evaluation in a restaurant search domain, we show that the proposed adversarial dialog learning method achieves advanced dialog success rate comparing to strong baseline methods. We further discuss the covariate shift problem in online adversarial dialog learning and show how we can address that with partial access to user feedback.

Full PDF

AAdversarial Learning of Task-Oriented Neural Dialog Models

Bing Liu

Carnegie Mellon UniversityElectrical and Computer Engineering [email protected]

Ian Lane

Carnegie Mellon UniversityElectrical and Computer EngineeringLanguage Technologies Institute [email protected]

Abstract

In this work, we propose an adversariallearning method for reward estimation inreinforcement learning (RL) based task-oriented dialog models. Most of the cur-rent RL based task-oriented dialog sys-tems require the access to a reward signalfrom either user feedback or user ratings.Such user ratings, however, may not al-ways be consistent or available in practice.Furthermore, online dialog policy learningwith RL typically requires a large numberof queries to users, suffering from sampleefﬁciency problem. To address these chal-lenges, we propose an adversarial learn-ing method to learn dialog rewards directlyfrom dialog samples. Such rewards arefurther used to optimize the dialog policywith policy gradient based RL. In the eval-uation in a restaurant search domain, weshow that the proposed adversarial dialoglearning method achieves advanced dialogsuccess rate comparing to strong baselinemethods. We further discuss the covariateshift problem in online adversarial dialoglearning and show how we can address thatwith partial access to user feedback.

Task-oriented dialog systems are designed to as-sist user in completing daily tasks, such as mak-ing reservations and providing customer support.Comparing to chit-chat systems that are usu-ally modeled with single-turn context-responsepairs (Li et al., 2016; Serban et al., 2016), task-oriented dialog systems (Young et al., 2013;Williams et al., 2017) involve retrieving informa-tion from external resources and reasoning overmultiple dialog turns. This makes it especially im- portant for a system to be able to learn interac-tively from users.Recent efforts on task-oriented dialog systemsfocus on learning dialog models from a data-driven approach using human-human or human-machine conversations. Williams et al. (2017)designed a hybrid supervised and reinforcementlearning end-to-end dialog agent. Dhingra etal. (2017) proposed an RL based model for infor-mation access that can learn online via user inter-actions. Such systems assume the model has ac-cess to a reward signal at the end of a dialog, eitherin the form of a binary user feedback or a con-tinuous user score. A challenge with such learn-ing systems is that user feedback may be inconsis-tent (Su et al., 2016) and may not always be avail-able in practice. Further more, online dialog pol-icy learning with RL usually suffers from sampleefﬁciency issue (Su et al., 2017), which requires anagent to make a large number of feedback queriesto users.To reduce the high demand for user feedbackin online policy learning, solutions have been pro-posed to design or to learn a reward function thatcan be used to generate a reward in approxima-tion to a user feedback. Designing a good re-ward function is not easy (Walker et al., 1997) asit typically requires strong domain knowledge. ElAsri et al. (2014) proposed a learning based re-ward function that is trained with task completiontransfer learning. Su et al. (2016) proposed anonline active learning method for reward estima-tion using Gaussian process classiﬁcation. Thesemethods still require annotations of dialog ratingsby users, and thus may also suffer from the ratingconsistency and learning efﬁciency issues.To address the above discussed challenges, weinvestigate the effectiveness of learning dialog re-wards directly from dialog samples. Inspired bythe success of adversarial training in computer vi- a r X i v : . [ c s . C L ] M a y ion (Denton et al., 2015) and natural languagegeneration (Li et al., 2017a), we propose an ad-versarial learning method for task-oriented dialogsystems. We jointly train two models, a gener-ator that interacts with the environment to pro-duce task-oriented dialogs, and a discriminatorthat marks a dialog sample as being successfulor not. The generator is a neural network basedtask-oriented dialog agent. The environment thatthe dialog agent interacts with is the user. Qual-ity of a dialog produced by the agent and the useris measured by the likelihood that it fools the dis-criminator to believe that the dialog is a successfulone conducted by a human agent. We treat dia-log agent optimization as a reinforcement learningproblem. The output from the discriminator servesas a reward to the dialog agent, pushing it towardscompleting a task in a way that is indistinguishablefrom how a human agent completes it.In this work, we discuss how the adversariallearning reward function compares to designed re-ward functions in learning a good dialog policy.Our experimental results in a restaurant search do-main show that dialog agents that are optimizedwith the proposed adversarial learning methodachieve advanced task success rate comparing tostrong baseline methods. We discuss the impactof the size of annotated dialog samples to the ef-fectiveness of dialog adversarial learning. We fur-ther discuss the covariate shift issue in interactiveadversarial learning and show how we can addressthat with partial access to user feedback. Task-Oriented Dialog Learning

Popular ap-proaches in learning task-oriented dialog systemsinclude modeling the task as a partially observableMarkov Decision Process (POMDP) (Young et al.,2013). Reinforcement learning can be applied inthe POMDP framework to learn dialog policy on-line by interacting with users (Gaˇsi´c et al., 2013).Recent efforts have been made in designing end-to-end solutions (Williams and Zweig, 2016; Liuand Lane, 2017a; Li et al., 2017b; Liu et al., 2018)for task-oriented dialogs. Wen et al. (2017) de-signed a supervised training end-to-end neural dia-log model with modularly connected components.Bordes and Weston (2017) proposed a neural di-alog model using end-to-end memory networks.These models are trained ofﬂine using ﬁxed di-alog corpora, and thus it is unknown how well the model performance generalizes to online userinteractions. Williams et al. (2017) proposed ahybrid code network for task-oriented dialog thatcan be trained with supervised and reinforcementlearning. Dhingra et al. (2017) proposed an RLdialog agent for information access. Such modelsare trained against rule-based user simulators. Adialog reward from the user simulator is expectedat the end of each turn or each dialog.

Dialog Reward Modeling

Dialog reward es-timation is an essential step for policy optimiza-tion in task-oriented dialogs. Walker et al. (1997)proposed PARADISE framework in which usersatisfaction is estimated using a number of dia-log features such as number of turns and elapsedtime. Yang et al. (2012) proposed a collabora-tive ﬁltering based method in estimating user sat-isfaction in dialogs. Su et al. (2015) studied us-ing convolutional neural networks in rating dia-log success. Su et al. (2016) further proposed anonline active learning method based on Gaussianprocess for dialog reward learning. These methodsstill require various levels of annotations of dia-log ratings by users, either ofﬂine or online. Onthe other side of the spectrum, Paek and Pieraccini(2008) proposed inferring a reward directly fromdialog corpora with inverse reinforcement learn-ing (IRL) (Ng et al., 2000). However, most of theIRL algorithms are very expensive to run (Ho andErmon, 2016), requiring reinforcement learning inan inner loop. This hinders IRL based dialog re-ward estimation methods to scale to complex dia-log scenarios.

Adversarial Networks

Generative adversar-ial networks (GANs) (Goodfellow et al., 2014)have recently been successfully applied in com-puter vision and natural language generation (Liet al., 2017a). The network training process isframed as a game, in which people train a gen-erator whose job is to generate samples to fool adiscriminator. The job of a discriminator is to dis-tinguish samples produced by the generator fromthe real ones. The generator and the discrimina-tor are jointly trained until convergence. GANswere ﬁrstly applied in image generation and re-cently used in language tasks. Li et al. (2017a)proposed conducting adversarial learning for re-sponse generation in open-domain dialogs. Yanget al. (2017) proposed using adversarial learning inneural machine translation. The use of adversariallearning in task-oriented dialogs has not been welltudied. Peng et al. (2018) recently explored us-ing adversarial loss as an extra critic in additionto the main reward function based on task comple-tion. This method still requires prior knowledge ofa user’s goal, which can be hard to collect in prac-tice, in deﬁning the completion of a task. Our pro-posed method uses adversarial reward as the onlysource of reward signal for policy optimization inaddressing this challenge.

In this section, we describe the proposed adver-sarial learning method for policy optimization intask-oriented neural dialog models. Our objectiveis to learn a dialog agent (i.e. the generator, G )that is able to effectively communicate with a userover a multi-turn conversation to complete a task.This can be framed as a sequential decision mak-ing problem, in which the agent generates a bestaction to take at every dialog turn given the dialogcontext. The action can be in the form of eithera dialog act (Henderson et al., 2013) or a naturallanguage utterance. We study on dialog act levelin this work. Let U k and A k represent the user in-put and agent outputs (i.e. the agent act a k and theslot-value predictions) at turn k . Given the currentuser input U k , the agent estimates the user’s goaland select a best action a k to take conditioning onthe dialog history.In addition, we want to learn a reward function(i.e. the discriminator, D ) that is able to provideguidance to the agent for learning a better policy.We expect the reward function to give a higher re-ward to the agent if the conversation it had with theuser is closer to how a human agent completes thetask. Output of the reward function is the prob-ability of a given dialog being successfully com-pleted. We train the reward function by forcing itto distinguish successful dialogs and dialogs con-ducted by the machine agent. At the same time,we also update the dialog agent parameters withpolicy gradient based reinforcement learning us-ing the reward produced by the reward function.We keep updating the dialog agent and the rewardfunction until the discriminator can no longer dis-tinguish dialogs from a human agent and from amachine agent. In the subsequent sections, we de-scribe in detail the design of our dialog agent andreward function, and the proposed adversarial dia-log learning method. LSTM dialogue state, s k Query results encoding, E k Slot value probs for each slot type, v k System outputs at turn k - 1, A k-1

User input encoding at turn k, U k Policy network System action at turn k, a k Figure 1: Design of the task-oriented neural dia-log agent.

The generator is a neural network based task-oriented dialog agent. The model architecture isshown in Figure 1. The agent uses an LSTM re-current neural network to model the sequence ofturns in a dialog. At each turn, the agent takes abest system action conditioning on the current di-alog state. A continuous form dialog state is main-tained in the LSTM state s k . At each dialog turn k , user input U k and previous system output A k − are ﬁrstly encoded to continuous representations.The user input can either in the form of a dialogact or a natural language utterance. We use dialogact form user input in our experiment. The dialogact representation is obtained by concatenating theembeddings of the act and the slot-value pairs. Ifnatural language form of input is used, we can en-code the sequence of words using a bidirectionalRNN and take the concatenation of the last for-ward and backward states as the utterance repre-sentation, similar to (Yang et al., 2016) and (Liuand Lane, 2017a). With the user input U k andagent input A k − , the dialog state s k is updatedfrom the previous state s k − by: s k = LSTM G ( s k − , [ U k , A k − ]) (1) Belief Tracking

Belief tracking maintains thestate of a conversation, such as a user’s goals, byaccumulating evidence along the sequence of dia-log turns. A user’s goal is represented by a list ofslot-value pairs. The belief tracker updates its es-timation of the user’s goal by maintaining a proba-bility distribution P ( l mk ) over candidate values foreach of the tracked goal slot type m ∈ M . Withthe current dialog state s k , the probability overandidate values for each of the tracked goal slotis calculated by: P ( l mk | U ≤ k , A

We model the agent’s policywith a deep neural network. Following the pol-icy, the agent selects the next action in responseto the user’s input based on the current dialogstate. In addition, information retrieved from ex-ternal resources may also affects the agent’s nextaction. Therefore, inputs to our policy module arethe current dialog state s k , the probability distri-bution of estimated user goal slot values v k , andthe encoding of the information retrieved from ex-ternal sources E k . Here instead of encoding theactual query results, we encode a summary of theretrieved items (i.e. count and availability of thereturned items). Based on these inputs, the policynetwork produces a probability distribution overthe next system actions: P ( a k | U ≤ k , A

The discriminator model is a binary classiﬁer thattakes in a dialog with a sequence of turns and out-puts a label indicating whether the dialog is a suc-cessful one or not. The logistic function returnsa probability of the input dialog being success-ful. The discriminator model design is as shown inFigure 2. We use a bidirectional LSTM to encodethe sequence of turns. At each dialog turn k , inputto the discriminator model is the concatenation of(1) encoding of the user input U k , (2) encoding ofthe query result summary E k , and (3) encoding ofagent output A k . The discriminator LSTM outputat each step k , h k , is a concatenation of the for-ward LSTM output −→ h k and the backward LSTMoutput ←− h k : h k = [ −→ h k , ←− h k ] .Once obtaining the discriminator LSTM stateoutputs { h , . . . , h K } , we experiment with fourdifferent methods in combining these state outputsto generated the ﬁnal dialog representation d forthe binary classiﬁer: BiLSTM-last

Produce the ﬁnal dialog repre-sentation d by concatenating the last LSTM state E U x x xx. . . . . . xx x . . . . . . Max Pooling D(d) A E U A E K U K A K d . . . . . . Figure 2: Design of the dialog reward estimator:Bidirectional LSTM with max pooling.outputs from the forward and backward directions: d = [ −→ h K , ←− h ] BiLSTM-max

Max-pooling. Produce the ﬁ-nal dialog representation d by selecting the max-imum value over each dimension of the LSTMstate outputs. BiLSTM-avg

Average-pooling. Produce theﬁnal dialog representation d by taking the averagevalue over each dimension of the LSTM state out-puts. BiLSTM-attn

Attention-pooling. Producethe ﬁnal dialog representation d by taking theweighted sum of the LSTM state outputs. Theweights are calculated with attention mechanism: d = K (cid:88) k =1 α k h k (4)and α k = exp( e k ) (cid:80) Kt =1 exp( e t ) , e k = g ( h k ) (5) g a feed-forward neural network with a single out-put node. Finally, the discriminator produces avalue indicating the likelihood the input dialog be-ing a successful one: D ( d ) = σ ( W o d + b o ) (6)where W o and b o are the weights and bias in thediscriminator output layer. σ is a logistic function. Once we obtain a dialog sample initiated by theagent and a dialog reward from the reward func-tion, we optimize the dialog agent using REIN-FORCE (Williams, 1992) with the given reward.he reward D ( d ) is only received at the end of adialog, i.e. r K = D ( d ) . We discount this ﬁnal re-ward with a discount factor γ ∈ [0 , to assign areward R k to each dialog turn. The objective func-tion can thus be written as J k ( θ G ) = E θ G [ R k ] = E θ G (cid:104)(cid:80) Kt = k γ t − k r t − V ( s k ) (cid:105) , with r k = D ( d ) for k = K and r k = 0 for k < K . V ( s k ) is the statevalue function which serves as a baseline value.The state value function is a feed-forward neu-ral network with a single-node value output. Weoptimize the generator parameter θ G to maximize J k ( θ G ) . With likelihood ratio gradient estimator,the gradient of J k ( θ G ) can be derived with: ∇ θ G J k ( θ G ) = ∇ θ G E θ G [ R k ]= (cid:88) a k ∈A G ( a k |· ) ∇ θ G log G ( a k |· ) R k = E θ G [ ∇ θ G log G ( a k |· ) R k ] (7)where G ( a k |· ) = G ( a k | s k , v k , E k ; θ G ) . The ex-pression above gives us an unbiased gradient es-timator. We sample agent action a k following asoftmax policy at each dialog turn and computethe policy gradient. At the same time, we updatethe discriminator parameter θ D to maximize theprobability of assigning the correct labels to thesuccessful dialog from human demonstration andthe dialog conducted by the machine agent: ∇ θ D (cid:104) E d ∼ θ demo [ log ( D ( d ))] + E d ∼ θ G [ log (1 − D ( d ))] (cid:105) (8)We continue to update both the dialog agent andthe reward function via dialog simulation or realuser interaction until convergence. We use data from the second Dialog State Track-ing Challenge (DSTC2) (Henderson et al., 2014)in the restaurant search domain for our modeltraining and evaluation. We add entity infor-mation to each dialog sample in the originalDSTC2 dataset. This makes entity information apart of the model training process, enabling theagent to handle entities during interactive evalu-ation with users. Different from the agent ac-tion deﬁnition used in DSTC2, actions in oursystem are produced by concatenating the act

Algorithm 1

Adversarial Learning for Task-Oriented Dialog Required: dialog corpus S demo , user simual-tor U , generator G , discriminator D Pretrain a dialog agent (i.e. the generator) G on dialog corpora S demo with MLE Simulate dialogs S simu between U and G Sample successful dialogs S (+) and randomdialogs S ( − ) from { S demo , S simu } Pretrain a reward function (i.e. the discrimi-nator) D with S (+) and S ( − ) (cid:46) eq 8 for number of training iterations do for G-steps do Simulate dialogs S b between U and G Compute reward r for each dialog in S b with D (cid:46) eq 6

Update G with reward r (cid:46) eq 7 end for for D-steps do Sample dialogs S ( b +) from S (+) Update D with S ( b +) and S b (with S b as negative examples) (cid:46) eq 8 end for end for and slot types in the original dialog act output(e.g. “ conf irm ( f ood = italian ) ” maps to“ conf irm f ood ”). The slot values (e.g. italian )are captured in the belief tracking outputs. Table1 shows the statistics of the dataset used in our ex-periments. We use a user simulator for our interactive train-ing and evaluation with adversarial learning. In-stead of using a rule-based user simulator as inmany prior work (Zhao and Eskenazi, 2016; Penget al., 2017), in our study we use a model-basedsimulator trained on DSTC2 dataset. We followthe design and training procedures of (Liu andLane, 2017b) in building the model-based simu-lator. The stochastic policy used in the simula-or introduces additional diversity in user behaviorduring dialog simulation.Before performing interactive adversarial learn-ing with RL, we pretrain the dialog agent and thediscriminative reward function with ofﬂine super-vised learning on DSTC2 dataset. We ﬁnd thisbeing helpful in enabling the adversarial policylearning to start with a good initialization. Thedialog agent is pretrained to minimize the cross-entropy losses on agent action and slot value pre-dictions. Once we obtain a supervised trainingdialog agent, we simulate dialogs between theagent and the user simulator. These simulated di-alogs together with the dialogs in DSTC2 datasetare then used to pretrain the discriminative re-ward function. We sample 500 successful dialogsas positive examples, and 500 random dialogs asnegative examples in pretraining the discrimina-tor. During dialog simulation, a dialog is markedas successful if the agent’s belief tracking out-puts fully match the informable (Henderson et al.,2013) user goal slot values, and all user requestedslots are fulﬁlled. This is the same evaluation cri-teria as used in (Wen et al., 2017) and (Liu andLane, 2017b). It is important to note that suchdialog success signal is usually not available dur-ing real user interactions, unless we explicitly askusers to provide this feedback.During supervised pretraining, for the dialogagent we use LSTM with a state size of 150. Hid-den layer size for the policy network MLP is set as100. For the discriminator model, a state size of200 is used for the bidirectional LSTM. We per-form mini-batch training with batch size of 32 us-ing Adam optimization method (Kingma and Ba,2014) with initial learning rate of 1e-3. Dropout( p = 0 . ) is applied during model training to pre-vent the model from over-ﬁtting. Gradient clip-ping threshold is set to 5.During interactive learning with adversarial RL,we set the maximum allowed number of dialogturns as 20. A simulation is force to terminated af-ter 20 dialog turns. We update the model with ev-ery mini-batch of 25 samples. Dialog rewards arecalculated by the discriminative reward function.Reward discount factor γ is set as 0.95. These re-wards are used to update the agent model via pol-icy gradient. At the same time, this mini-batch ofsimulated dialogs are used as negative examples toupdate the discriminator. In this section, we show and discuss our empir-ical evaluation results. We ﬁrst compare dialogagent trained using the proposed adversarial re-ward to those using human designed reward andusing oracle reward. We then discuss the impactof discriminator model design and model pretrain-ing on the adversarial learning performance. Lastbut not least, we discuss the potential issue of co-variate shift during interactive adversarial learningand show how we address that with partial accessto user feedback.

We ﬁrst compare the performance of dialog agentusing adversarial reward to those using designedreward and oracle reward on dialog success rate.Designed reward refers to reward function that isdesigned by humans with domain knowledge. Inour experiment, based on the dialog success crite-ria deﬁned in section 4.2, we design the followingreward function for RL policy learning: • +1 for each informable slot that is correctlyestimated by the agent at the end of a dialog. • If ALL informable slots are tracked correctly,+1 for each requestable slot successfully han-dled by the agent.In addition to the comparison to human de-signed reward, we further compare to the case ofusing oracle reward during agent policy optimiza-tion. Using oracle reward refers to having accessto the ﬁnal dialog success status. We apply a re-ward of +1 for a successful dialog, and a rewardof 0 for a failed dialog. Performance of the agentusing oracle reward serves as an upper-bound forthose using other types of reward. For the learningwith adversarial rewards, we use BiLSTM-max asthe discriminator model. During RL training, wenormalize the rewards produced by different re-ward functions.Figure 3 show the RL learning curves for mod-els trained using different reward functions. Thedialog success rate at each evaluation point is cal-culated by averaging over the success status of1000 dialog simulations at that point. The pre-train baseline in the ﬁgure refers to the super-vised pretraining model. This model does not getupdated during interactive learning, and thus thecurve stays ﬂat during the RL training cycle. Asigure 3: RL policy optimization performancecomparing with adversarial reward, designed re-ward, and oracle reward.shown in these curves, all the three types of re-ward functions lead to improved dialog successrate along the interactive learning process. Theagent trained with designed reward falls behind theagent trained with oracle reward by a large margin.This shows that the reward designed with domainknowledge may not fully align with the ﬁnal evalu-ation metric. Designing a reward function that canprovide an agent enough supervision signal andalso well aligns the ﬁnal system objective is nota trivial task (Popov et al., 2017). In practice, itis often difﬁcult to exactly specify what we expectan agent to do, and we usually end up with sim-ple and imperfect measures. In our experiment,agent using adversarial reward achieves a 7.4%improvement on dialog success rate over the su-pervised pretraining baseline at the end of 6000 in-teractive dialog learning episodes, outperformingthat using the designed reward (4.2%). This showsthe advantage of performing adversarial training inlearning directly from expert demonstrations andin addressing the challenge of designing a properreward function. Another important point we ob-serve in our experiments is that RL agents trainedwith adversarial reward, although enjoy higherperformance in the end, suffer from larger vari-ance and instability on model performance duringthe RL training process, comparing to agents us-ing human designed reward. This is because dur-ing RL training the agent interfaces with a movingtarget, rather than a ﬁxed objective measure as inthe case of using the designed reward or oracle re-ward. The model performance gradually becomesstabilized when both the dialog agent and the re-ward model are close to convergence.

We study the impact of different discriminatormodel designs on the adversarial learning perfor-mance. We compare the four pooling methods de-scribed in section 3.2 in producing the ﬁnal dialogrepresentation. Table 2 shows the ofﬂine evalua-tion results on 1000 simulated test dialog samples.Among the four pooling methods, max-poolingon bidirectional LSTM outputs achieves the bestclassiﬁcation accuracy in our experiment. Max-pooling also assigns the highest probability to suc-cessful dialogs in the test set comparing to otherpooling methods. Attention-pooling based LSTMmodel achieves the lowest performance across allthe three ofﬂine evaluation metrics in our study.This is probably due to the limited number oftraining samples we used in pretraining the dis-criminator. Learning good attentions usually re-quires more data samples and the model may thusoverﬁt the small training set. We observe sim-ilar trends during interactive learning evaluationthat the attention-based discriminator leads to di-vergence of policy optimization more often thanthe other three pooling methods. Max-pooling dis-criminator gives the most stable performance dur-ing our interactive RL training.

Prediction Success FailModel Accuracy Prob. Prob.

BiLSTM-last 0.674 0.580 0.275BiLSTM-max

BiLSTM-attn 0.652 0.541 0.285Table 2: Performance of different discriminatormodel design, on prediction accuracy and proba-bilities assigned to successful and failed dialogs.

Annotating dialogs for model training requires ad-ditional human efforts. We investigate the impactof the size of the annotated dialog samples on dis-criminator model training. The amount of anno-tated dialogs required for learning a good discrim-inator depends mainly on the complexity of a task.Given the rather simple nature of the slot ﬁllingbased DSTC2 restaurant search task, we experi-ment with annotating 100 to 1000 discriminatortraining samples. We use BiLSTM-max discrimi-nator model in these experiments. The adversarialigure 4: Impact of discriminator training samplesize on RL dialog learning performance.RL training curves with different levels of discrim-inator training samples are shown in Figure 4. Asthese results illustrate, with 100 annotated dialogsas positive samples for discriminator training, thediscriminator is not able to produce dialog rewardsthat are useful in learning a good policy. Learningwith 250 positive samples does not lead to con-crete improvement on dialog success rate neither.With the growing number of annotated samples,the dialog agent becomes more likely to learn abetter policy, resulting in higher dialog successrate at the end of the interactive learning sessions.

A potential issue with RL based interactive adver-sarial learning is the covariate shift (Ross and Bag-nell, 2010; Ho and Ermon, 2016) problem. Partof the positive examples for discriminator trainingare generated based on the supervised pretrainingdialog policy before the interactive learning stage.During interactive RL training, the agent’s policygets updated. The newly generated dialog sam-ples based on the updated policy may be equallygood comparing to the initial set of positive di-alogs, but they may look very different. In thiscase, the discriminator is likely to give these di-alogs low rewards as the pattern presented in thesedialogs is different to what the discriminator isinitially trained on. The agent will thus be dis-couraged to produce such type of successful di-alogs in the future with these negative rewards.To address such covariate shift issue, we designa DAgger (Ross et al., 2011) style imitation learn-ing method to the dialog adversarial learning. Weassume that during interactive learning with users,occasionally we can receive feedback from users Figure 5: Addressing covariate shift in online ad-versarial dialog learning with partial access to userfeedback.indicating the quality of the conversation they hadwith the agent. We then add those dialogs withgood feedback as additional training samples tothe pool of positive dialogs used in discrimina-tor model training. With this, the discriminatorcan learn to assign high rewards to such good di-alogs in the future. In our empirical evaluation,we experiment with the agent receiving positivefeedback 10% and 20% of the time during its in-teraction with users. The experimental results areshown in Figure 5. As illustrated in these curves,the proposed DAgger style learning method caneffectively improve the dialog adversarial learningwith RL, leading to higher dialog success rate.

In this work, we investigate the effectivenessof applying adversarial training in learning task-oriented dialog models. The proposed methodis an attempt towards addressing the rating con-sistency and learning efﬁciency issues in onlinedialog policy learning with user feedback. Weshow that with limited number of annotated di-alogs, the proposed adversarial learning methodcan effectively learn a reward function and use thatto guide policy optimization with policy gradientbased reinforcement learning. In the experimentin a restaurant search domain, we show that theproposed adversarial learning method achieves ad-vanced dialog success rate comparing to baselinemethods using other forms of reward. We furtherdiscuss the covariate shift issue during interactiveadversarial learning and show how we can addressit with partial access to user feedback. eferences

Antoine Bordes and Jason Weston. 2017. Learningend-to-end goal-oriented dialog. In

InternationalConference on Learning Representations .Emily L Denton, Soumith Chintala, Rob Fergus, et al.2015. Deep generative image models using a lapla-cian pyramid of adversarial networks. In

Advancesin neural information processing systems , pages1486–1494.Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao,Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017.Towards end-to-end reinforcement learning of dia-logue agents for information access. In

Proceedingsof ACL .Layla El Asri, Romain Laroche, and Olivier Pietquin.2014. Task completion transfer learning for rewardinference.

Proc of MLIS .Milica Gaˇsi´c, Catherine Breslin, Matthew Henderson,Dongho Kim, Martin Szummer, Blaise Thomson,Pirros Tsiakoulis, and Steve Young. 2013. On-line policy optimisation of bayesian spoken dialoguesystems via human interaction. In

ICASSP , pages8367–8371. IEEE.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In

Advances in neural informationprocessing systems , pages 2672–2680.Matthew Henderson, Blaise Thomson, and JasonWilliams. 2013. Dialog state tracking challenge 2& 3.Matthew Henderson, Blaise Thomson, and JasonWilliams. 2014. The second dialog state trackingchallenge. In

SIGDIAL .Jonathan Ho and Stefano Ermon. 2016. Generative ad-versarial imitation learning. In

Advances in NeuralInformation Processing Systems , pages 4565–4573.Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In

ICLR .Jiwei Li, Michel Galley, Chris Brockett, Georgios PSpithourakis, Jianfeng Gao, and Bill Dolan. 2016. Apersona-based neural conversation model. In

Proc.of ACL .Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, andDan Jurafsky. 2017a. Adversarial learning for neu-ral dialogue generation. In

Proceedings of ACL .Xuijun Li, Yun-Nung Chen, Lihong Li, and JianfengGao. 2017b. End-to-end task-completion neural di-alogue systems. arXiv preprint arXiv:1703.01008 .Bing Liu and Ian Lane. 2017a. An end-to-end trainableneural network model with belief tracking for task-oriented dialog. In

Interspeech . Bing Liu and Ian Lane. 2017b. Iterative policy learningin end-to-end trainable task-oriented neural dialogmodels. In

Proceedings of 2017 IEEE Workshop onAutomatic Speech Recognition and Understanding(ASRU) .Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, PararthShah, and Larry Heck. 2018. Dialogue learning withhuman teaching and feedback in end-to-end train-able task-oriented dialogue systems. In

NAACL .Andrew Y Ng, Stuart J Russell, et al. 2000. Algorithmsfor inverse reinforcement learning. In

Icml , pages663–670.Tim Paek and Roberto Pieraccini. 2008. Automatingspoken dialogue management design using machinelearning: An industry perspective.

Speech commu-nication , 50(8-9):716–729.Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu,Yun-Nung Chen, and Kam-Fai Wong. 2018. Ad-versarial advantage actor-critic model for task-completion dialogue policy learning. In

ICASSP .Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao,Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong.2017. Composite task-completion dialogue policylearning via hierarchical deep reinforcement learn-ing. In

Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing ,pages 2231–2240.Ivaylo Popov, Nicolas Heess, Timothy Lillicrap,Roland Hafner, Gabriel Barth-Maron, Matej Ve-cerik, Thomas Lampe, Yuval Tassa, Tom Erez, andMartin Riedmiller. 2017. Data-efﬁcient deep re-inforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073 .St´ephane Ross and Drew Bagnell. 2010. Efﬁcient re-ductions for imitation learning. In

Proceedings ofthe thirteenth international conference on artiﬁcialintelligence and statistics , pages 661–668.St´ephane Ross, Geoffrey Gordon, and Drew Bagnell.2011. A reduction of imitation learning and struc-tured prediction to no-regret online learning. In

Pro-ceedings of the fourteenth international conferenceon artiﬁcial intelligence and statistics , pages 627–635.Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2016. Buildingend-to-end dialogue systems using generative hier-archical neural network models. In

Proceedings ofthe 30th AAAI Conference on Artiﬁcial Intelligence(AAAI-16) .Pei-Hao Su, Paweł Budzianowski, Stefan Ultes, Mil-ica Gasic, and Steve Young. 2017. Sample-efﬁcientactor-critic reinforcement learning with superviseddata for dialogue management. In

Proceedings ofthe 18th Annual SIGdial Meeting on Discourse andDialogue , pages 147–157, Saarbr¨ucken, Germany.Association for Computational Linguistics.ei-Hao Su, Milica Gaˇsi´c, Nikola Mrkˇsi´c, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line activereward learning for policy optimisation in spoken di-alogue systems. In

Proceedings of ACL .Pei-Hao Su, David Vandyke, Milica Gasic, DonghoKim, Nikola Mrksic, Tsung-Hsien Wen, and SteveYoung. 2015. Learning from real users: Rating di-alogue success with neural networks for reinforce-ment learning in spoken dialogue systems. In

Inter-speech .Marilyn A Walker, Diane J Litman, Candace A Kamm,and Alicia Abella. 1997. Paradise: A frameworkfor evaluating spoken dialogue agents. In

Proceed-ings of the eighth conference on European chap-ter of the Association for Computational Linguistics ,pages 271–280. Association for Computational Lin-guistics.Tsung-Hsien Wen, David Vandyke, Nikola Mrkˇsi´c,Milica Gaˇsi´c, Lina M. Rojas-Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. In

Proc. of EACL .Jason D Williams, Kavosh Asadi, and Geoffrey Zweig.2017. Hybrid code networks: practical and efﬁcientend-to-end dialog control with supervised and rein-forcement learning. In

ACL .Jason D Williams and Geoffrey Zweig. 2016. End-to-end lstm-based dialog control optimized with su-pervised and reinforcement learning. arXiv preprintarXiv:1606.01269 .Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , 8(3-4):229–256.Zhaojun Yang, Gina-Anne Levow, and Helen Meng.2012. Predicting user satisfaction in spoken dia-log system evaluation with collaborative ﬁltering.

IEEE Journal of Selected Topics in Signal Process-ing , 6(8):971–981.Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2017.Improving neural machine translation with condi-tional sequence generative adversarial nets. arXivpreprint arXiv:1703.04887 .Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classiﬁcation.In

Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 1480–1489.Steve Young, Milica Gaˇsi´c, Blaise Thomson, and Ja-son D Williams. 2013. Pomdp-based statistical spo-ken dialog systems: A review.

Proceedings of theIEEE , 101(5):1160–1179. Tiancheng Zhao and Maxine Eskenazi. 2016. Towardsend-to-end learning for dialog state tracking andmanagement using deep reinforcement learning. In