Guided Dialog Policy Learning without Adversarial Learning in the Loop
Ziming Li, Sungjin Lee, Baolin Peng, Jinchao Li, Julia Kiseleva, Maarten de Rijke, Shahin Shayandeh, Jianfeng Gao
GGuided Dialog Policy Learning withoutAdversarial Learning in the Loop
Ziming Li , Sungjin Lee , Baolin Peng , Jinchao Li , Shahin Shayandeh , Jianfeng Gao University of Amsterdam Amazon Microsoft [email protected], [email protected], { baolin.peng,jincli,shahins,[email protected] } Abstract
Reinforcement-based training methods haveemerged as the most popular choice to trainan efficient and effective dialog policy. How-ever, these methods are suffering from sparseand unstable reward signal usually returnedfrom the user simulator at the end of the di-alog. Besides, the reward signal is manuallydesigned by human experts which requires do-main knowledge. A number of adversariallearning methods have been proposed to learnthe reward function together with the dialogpolicy. However, to alternatively update thedialog policy and the reward model on thefly, the algorithms to update the dialog pol-icy are limited to policy gradient based algo-rithms, such as REINFORCE and PPO. Be-sides, the alternative training of dialog agentand the reward model can easily get stuck inlocal optimum or result in mode collapse. Inthis work, we propose to decompose the pre-vious adversarial training into two differentsteps. We first train the discriminator with anauxiliary dialog generator and then incorpo-rate this trained reward model to a commonreinforcement learning method to train a high-quality dialog agent. This approach is applica-ble to both on-policy and off-policy reinforce-ment learning methods. By conducting sev-eral experiments, we show the proposed meth-ods can achieve remarkable task success andits potential to transfer knowledge from exist-ing domains to a new domain.
Task-Oriented dialog systems aim for offeringusers with assistance to complete tasks by inter-acting with users, such as Siri, Google Assistantand Amazon Alexa. With the developing of Rein-forcement learning in robotics and other domains,it brings another view of learning the dialog pol-icy (Williams and Young, 2007; Gaˇsi´c and Young,2014; Su et al., 2017; Li et al., 2019). As it is not practical to interact with a real user in the pol-icy training loop, a common but essential strategyis to build a user simulator to provide replies tothe dialog agent(Schatzmann et al., 2007; Li et al.,2016). In the real dialog systems, they aims tomaximize the positive feedback they can get fromthe user. To simulate the user feedback duringtraining, a reward function has been designed andembedded to the user simulator and it will returna reward signal to the dialog agent according tothe given dialog context and system action (Penget al., 2018b; Williams et al., 2017; Dhingra et al.,2016; Su et al., 2016). The reward signal can be inthe form of binary feedback or a continuous score.The most straightforward way to design such a re-ward function is to provide the agent with differentreward signal based on the dialog status: if the di-alog ends successfully, a large positive reward sig-nal will be returned; if the dialog fails, the rewardsignal will be a large negative value; if the dialogis still ongoing, a small negative signal will be re-turned to encourage shorter session (Peng et al.,2018b). However, this solution assigns the samenegative signal to all the system actions happenedin the dialog except the last one, which resultsin the qualities of different actions are not distin-guishable. Besides, the really meaningful rewardsignal only be returned at the end of a dialog andthis can delay the penalty to low-quality actionsand the reward to high-quality actions. Liu andLane (2018) address the difficulties listed above byadopting adversarial training for policy learning:they jointly train two systems: (1) a policy modelthat decides on the actions to take at each turn, and(2) a discriminator that marks a dialog as beingsuccessful or not. Feedback from the discrimina-tor is used as a reward signal to push the policymodel to complete a task in a way that is indis-tinguishable from how a human agent completesit. Following this solution, Takanobu et al. (2019) a r X i v : . [ c s . A I] A p r eplaces the discriminator with a reward functionwhich has a specific architecture and takes as in-put the dialog state, system action and next dia-log state. This method managed to achieve higherperformance with respect to success rate and othermetrics.However, to alternatively update the dialog pol-icy and the reward model on the fly, the algo-rithms to update the dialog policy are limited topolicy gradient based algorithms, such as RE-INFORCE(Williams, 1992) and PPO(Schulmanet al., 2017), while off-policy methods are not ableto benefit from the self-learned reward functions.Besides, the alternative training of dialog agentand the reward model can easily get stuck in localoptimum or result in mode collapse. To alleviatethe potential problems mentioned above, we de-compose the adversarial learning method in dialogpolicy learning into two sequential step. We learnthe reward function using an auxiliary dialog stategenerator where the loss from the discriminatorcan be backpropagated to the generator directly.In the next step, we discard the state generator andonly keep the trained discriminator as the dialogreward model. The trained reward model will beincorporated to the reinforcement learning processand will not be updated. In this way, we can uti-lize any reinforcement learning algorithms to up-date the dialog policy, including both on-policyand off-policy methods. Besides, we show howto use the pretrained reward functions to transferknowledge learned in existing domains to a newdialog domain. To summarize, we make the fol-lowing technological contributions: • A reward learning method that is applicableto off-policy reinforcement learning methods indialog training. • A reward learning method can alleviate theproblem of local optimum for adversarial dialogtraining. • A reward function that can transfer knowledgelearned in existing domains to a new dialog do-main.
Building a dialog system that can handle conver-sations across different domains has attracted alot of attentions in the last few years. A rule-based dialog system is becoming powerless in themulti-domain scenarios because of the rich and di-verse interactions. It is intractable to take into ac- count all the possible situations and get the corre-sponding solutions ready by predefining a bunchof rules. Reinforcement learning methods (Penget al., 2017; Lipton et al., 2018; Li et al., 2017; Suet al., 2018; Dhingra et al., 2016; Williams et al.,2017), have been widely utilized to train a dialogagent by interacting with users. With the help ofreinforcement learning, the dialog agent is able toexplore the dialog contexts which may not existin the previously observed data. However, the re-ward signal used to update the dialog policy is usu-ally from a reward function predefined with do-main knowledge and it could become very trickyfacing to the multi-domain dialog scenarios. Toprovide the dialog policy with high quality rewardsignal, Peng et al. (2018a) proposed to make use ofthe adversarial loss as an extra critic in addition toshape the main reward function. Inspired by t,hesuccess of adversarial learning in other researchfields, Liu and Lane (2018) learns the reward func-tion directly from dialog samples by alternativelyupdating the dialog policy and the reward func-tion. The reward function in fact is a discriminatorand it aims to assign high value to real human dia-logues while low value to dialogues generated bythe current dialog policy. In contrast, the dialogpolicy attempts to achieve higher reward from thediscriminator given the generated dialog. Follow-ing this solution, Takanobu et al. (2019) replacesthe discriminator with a reward function which hasa specific architecture and argues to achieve higherperformance with respect to success rate and othermetrics.
Different from previous adversarial training meth-ods (Liu and Lane, 2018; Takanobu et al., 2019),in our method the dialog policy and reward modelare trained consecutively rather than trained alter-natively in different time step. We believe this canavoid potential training issues, such as mode col-lapse and local optimum. To achieve this goal,we introduce an auxiliary generator in the firststep which is used to explore potential dialog sit-uations. The advantage of this setup is that wetransfer the SeqGan setting (Yu et al., 2017) to avanilla Gan setting (Goodfellow et al., 2014). Se-qGan setup refers to the adversarial training stylethat policy gradient is utilized to deliver the up-date signal from Discriminator to the dialog agent.n contrast, in the vanilla Gan framework the dis-criminator can directly backpropogate the rewardsignal to the generator. Once we restored a high-quality reward model with the auxiliary generatorin the first step, we can make use of it in commonreinforcement learning methods to update the dia-log agent. Since the reward model will keep fixedduring the policy training, we can adopt differentkinds of reinforcement learning methods while theadversarial learning methods are restricted to pol-icy gradient based methods.
We reuse the rule-based dialog state tracker inConvLab (Lee et al., 2019) (more details aboutConvLab will be introduced in Section 4.1) to keeptracking the information emerged in the interac-tions between users and the dialog agent. Thestate tracker plays an important role in dialog sys-tems since its output is the foundation for the dia-log policy decisions in the next step. The embed-ded tracker in ConvLab has the ability to handlemulti-domain interactions. The output from theNLU module is fed to the dialog state tracker toextract informative information, including the in-formable slots about the constraints from users andrequestable slots that indicates what users want toinquiry. Besides, a belief vector will be main-tained and updated for each slot in every domain.
Dialog State
The scattered information from thedialog state tracker will be integrated to form astructured state representation state t at time step t . There are mainly six feature segments informa-tion in the final representation, including the em-bedded results of returned entities for a query, theavailability of the booking option with respect toa given domain, the state of informable slots, thestate of requestable slots, the last user action andhow many times the last user action has been re-peated without interruptions. The final state repre-sentation S is an information vector with 392 di-mensions and each position is filled in with or . Dialog Action
The action space consists of twodifferent sets. In the first action set, each actionis a concatenation of domain name, action typeand slot name, such as
Attraction Inform Address and
Hotel Request Internet . Since in the real sce-narios, the response from a human or a dialogagent can cover several different single actions de-fined in the first action set, we extract the mostfrequently used dialog actions from the human- human dialog dataset to form the second actionset. In another word, all the actions in the sec-ond set are a combination of two or three singleactions from the first action set. For example, [At-traction Inform Address, Hotel Request Internet] will be regarded as a new action that the policyagent can execute. In the end, the final actionspace A has different dialog actions. We uti-lize one-hot embedding to represent the actions. We aim to train a reward function which has theability to distinguish high quality dialogs from un-reasonable and inappropriate dialogs. We utilizea generator
Gen to explore the possible dialogscenarios that could happen in real life. The di-alog scenario at time t is a pair of a dialog state s t and the corresponding system action a t at thesame time step t . The dialog state-action pairsgenerated from this generator are fed to the rewardmodel as negative samples. During reward train-ing, the reward function can benefit from the richand high-quality negative instances generated bythe advanced generator Gen to improve the dis-criminability. The dialog simulating step can beformulated as: ( s, a ) fake = Gen ( z sa ) , where z sa is a sampled Gaussian noise and each z sa corresponds to one potential state-action pair ( s, a ) . To simulate the dialog actions, we adopt an MLPas the action generator
Gen a following by aGumbel-Softmax function with dimensions,where each dimension corresponds to a specificaction in the action space. The Gumbel-Max trick(Gumbel, 1954) is commonly used to draw sam-ples u from a categorical distribution with classprobabilities p : u = one hot ( argmax i [ g i + log p i ]) where g i is independently sampled from Gumbel(0,1). However, the argmax operation is not differ-entiable, thus no gradient can be backpropagatedthrough u . Instead, we can adopt the soft-argmaxapproximation (Jang et al., 2016) as a continuousand differentiable approxiamation to argmax ando generate k-dimensional sample vectors below: y i = exp ((log( p i ) + g i ) /τ ) (cid:80) kj =1 exp ((log( p j ) + g j ) /τ ) for i = 1 , . . . k . When the temperature τ → , the argmax operation is exactly recovered and sam-ples from the Gumbel-Softmax distribution be-come one-hot. However the gradient will vanishwhen τ approaches . In the contrary, when τ is going higher, the Gumbel-Softmax samples aregetting similar to samples from uniform distribu-tion over k categories. In practice, τ should beselected to balance the approximation bias and themagnitude of gradient variance. In our case, p cor-responding to the action distribution p ( a | z sa ) and k equals to the action dimension . In our setting, the state representation is a vectorfilled in with discrete values which means we can-not connect the generator with the discriminatordirectly. Similar to the action generating method,the Gumbel-Softmax trick could be the bridge todeliver the gradient from the discriminator to thestate generator
Gen s . In this solution, we haveto attach a bunch of Gumbel-Softmax functions tothe back of Gen s and the number depends on howmany meaningful segments included in the staterepresentation. The Gumbel-Softmax trick is pow-erless in our setting because there are around independent meaningful segments in the state rep-resentation. Besides, a preprocessing step is es-sential to expand the discrete representation to aconcatenation of a number of one-hot embeddingswhich demands the familiarity with the state struc-ture. These disadvantages lead us, to an alterna-tive solution for state simulation by utilizing a pre-trained Variational AutoEncoder (Kramer, 1991;Kingma and Welling, 2013). State transferring with Variational AutoEn-coder
Compared to the scenarios of GAN in com-puter vision, the output of the generator in our set-ting is a discrete vector which makes it challeng-ing to backpropogate the loss from discriminatorto the generator directly. To address this prob-lem, we propose to project the discrete representa-tion x in the expert demonstrations to a continuousspace with an encoder Enc from a pretrained vari-ational autoencoder (Kingma and Welling, 2013).Assuming the expert-like dialogue state s is gen-erated by a latent variable z vae via the distribution p ( s | z vae ; ψ ) , the variable z vae could be the repre-sentation we aim for in a continuous space. Givenhuman-generated state s , the VAE utilize a con-ditional probabilistic encoder Enc to infer the la-tents z vae : z vae ∼ Enc ( s ) = q ω ( z vae | s ) .ω are the variational parameters for the encoderwhile ψ the decoder. The optimization objectiveis given as: L vae ( ω, ψ ) = E z vae ∼ q ω ( z vae | s ) [log p ψ ( s | z vae )]+ KL ( q ω ( z vae | s ) || p ( z vae )) . The first term in the right side is responsible re-construction loss and this term encourages the de-coder parametered with ψ to learn to reconstructthe input x . The second term is the KL-divergencebetween the encoder’s distribution q ω ( z vae | s ; ω ) and a standard Gaussian distribution p ( z vae ) = N (0 , I ) .The benefit of projecting the state representationsto a different space is that we can directly simu-late the dialog states in the continuous space justlike generating realistic images in computer vi-sion. Besides, similar dialog states will be embed-ded into close latent representations. As shownin Fig 1, we utilize an variational autoencoder tolearn the state projecting fucntion Enc ω ( s ) givendialog states from real human dialogs. In sum-mary, we transfer the discrete dialog state s fromthe state tracker to a continuous state space S embed through a pretrained state encoder Enc ω and allthe future training will happen in the latent contin-uous space rather than the original state space. By applying Gumbel-Softmax to action simula-tion and state transferring to the state simulationrespectively, we can simulate the real state-actiondistribution in a differentiable setup.The whole process of
Gen θ can be formulatedas follow: h = M LP ( z sa ) a fake = f Gumbel ( M LP ( h ) s fake = M LP ( h )( s, a ) fake = s fake ⊕ a fake θ denotes all the parameters in the generatorand ⊕ is the concatenating operation. During theadversarial training process, the generator Gen θ akes noise z sa as input and outputs a sample ( s, a ) fake and it aims to get higher reward signalfrom the discriminator D φ . The training loss forthe generator Gen θ can be given as: L G ( θ ) = − E ( s,a ) fake ∼ Gen θ ( R φ (( s, a ) fake ) , where R φ (( s, a ) fake ) = − log(1 − D φ (( s, a ) fake ) and D φ denotes the discriminator measuring thereality of generated state-action pairs ( s, a ) fake . s' Enc(s') D s z sa Real or Fake?
Enc(s') Dec(Enc(s'))EmbeddingState State aa a fakereal
Figure 1: The architecture to simulate state-action rep-resentation with variational autoencoder. z sa is thesampled Gaussian noise. The discriminator D φ in this work is an MLPwhich takes as input the state-action pair ( s, a ) andoutputs the probability D ( s, a ) that the sample isfrom the real data distribution. Since the discrim-inator’s goal is to assign higher probability to thereal data while lower score to the fake data, theobjective can be given as the average log probabil-ity it assigns to the correct classification. Given anequal mixture of real data samples and generatedsamples from the generator Gen θ , the loss func-tion for the discriminator D φ is: L D ( φ ) = E (( s,a ) fake ) ∼ Gen θ (log(1 − D φ (( s, a ) fake ))) − E ( s (cid:48) ,a ) ∼ data ( D φ ( Enc ω ( s (cid:48) ) , a ))) , where s (cid:48) denotes the discrete state representationfrom the state tracker. MultiWOZ (Budzianowski et al.,2018) is a multi-domain dialogue dataset spanning7 distinct domains and consisting of 10,438 dia-logues. The main scenario in this dataset is that adialogue agent is trying to satisfy the demand fromtourists such as booking a restaurant or recom-mending a hotel with specific requirements. Theinteractions between the dialogue agent and users can happen in 7 different domains, including:
At-traction, Hospital, Police, Hotel, Restaurant, Taxi,Train . The average number of turns are 8.93 and15.39 for single and multi-domain dialogs, respec-tively.
ConvLab
ConvLab (Lee et al., 2019) is an open-source multi-domain end-to-end dialogue systemplatform. ConvLab offers the annotated Multi-WOZ dataset and associated pre-trained referencemodels. We reuse the rule-based dialog statetracker from ConvLab to keep tracking the infor-mation emerged in the interactions between usersand the dialog agent. Besides, an agenda-based(Schatzmann et al., 2007) user simulator is em-bedded in ConvLab and it has been adapted formulti-domain dialogue scenarios.
The Encoder is a two-layer MLP which takes the discrete state repre-sentation (392 dimensions) as input and outputstwo intermediate embedding (64 dimensions) cor-responding to the mean and the variance respec-tively. During inference time, we regard the mean µ as the embedded representation for a given stateinput s . Generator
The generator takes a randomly sam-pled Gaussian noise as input and output a contin-uous state representation and a one-hot action em-bedding. The input noise will be fed to a one-layerMLP first followed by the state generator and ac-tion generator. The state generator is implementedwith a two-layer MLP whose output is the sim-ulated state representation (64 dimensions) corre-sponding to the input noise. The main componentsof the action generator is a two-layer MLP fol-lowed by a Gumbel-Softmax function. The out-put of the Gumbel-Softmax function is an one-hotrepresentation (300 dimensions). Specifically, inorder to sample a discrete action, we implementedthe “Straight-Through” Gumbel Softmax Estima-tor (Jang et al., 2016) and the temperature for thefunction is set to . . Discriminator
The discriminator is a three-layerMLP which takes as input the concatenation of la-tent state representation (64 dimensions) and one-hot encoding (300 dimensions) of the action. Dur-ing the adversarial training, the real samples comefrom the real human dialogues in the training setwhile the fake samples have tree different sources.The main source is the output of the generator in-roduced above. The second way is that we ran-domly sample state-action pairs from the trainingset and replace the action in each pair with a dif-ferent action to build a fake state-action pair. Be-sides, we keep a history buffer with limited size( , ) to record the fake state-action pairs fromthe generator. The state-action pairs in the bufferwill be replaced randomly by the new generatedpairs from the generator. To strength the rewardsignal, we incorporate the human reward signal r ( Human ) to the pretrained reward function andwe use the mixed reward r = r ( Human ) + r ( log ( D ( s, a )) as the final reward function to trainthe dialog agent. In this work, we validate our pre-trained rewardwith two different types of reinforcement learningmethods: Deep Q-network (DQN) and ProximalPolicy Optimization (PPO). DQN (Mnih et al.,2015) is a off-policy reinforcement learning algo-rithm while PPO (Schulman et al., 2017) is pol-icy gradient based algorithm. What needs to bepointed out is that the adversarial learning meth-ods can only be applied to PPO or other policy gra-dient algorithms. Besides, to speed up the train-ing speed, we extend the vanilla DQN to WDQN,where the dialog policy has the access to the expertdata from the training set at the very beginning.We implemented the DQN and PPO algorithmsaccording to the reinforcement learning module inConvLab.
The handcrafted reward signal is defined as fol-low: at the end of a dialogue, if the dialog agentsuccessfully accomplish the task within T turns,it will receive T ∗ as reward; otherwise, it willreceive − T as penalty. T is the maximum num-ber of turns in each dialogue and we set it to inthe whole experiments. Furthermore, the dialogueagent will receive − as intermediate reward dur-ing the dialogue. We use r(Human) to representthe handcrafted reward function.In terms of DQN based methods, we haveDQN(human), DQN(GAN-AE) and
DQN(GAN-VAE) , where
GAN-VAE is our method and
GAN − AE denotes the variant that the Variational au-toencoder is replaced with an vanilla autoencoder.With respect to WDQN, we also provide threedifferent reward signals from Human, GAN-AE,GAN-VAE . In terms of PPO based methods, we imple-mented Generative Adversarial Imitation Learning(GAIL) (Ho and Ermon, 2016). In GAIL, the re-ward signal is provided with a discriminator andthe parameter of this discriminator will be updatedduring the adversarial training process. To showthe efficiency of different reward signals in a fairsetup, the discriminator in GAIL have been pre-trained but the dialog policies are initialized ran-domly for all methods. We report the averageperformance by running the same method timeswith different random seeds. In the rest of this pa-per, we use GAN-VAE to denote the reward func-tion trained with GAN and VAE, same for
GAN-AE . Frame S u cc e ss r a t e WDQN(Human)WDQN(GAN-VAE)WDQN(GAN-AE)DQN(Human)DQN(GAN-VAE)DQN(GAN-AE)
Figure 2: The learning process with different rewardfunctions and training agents
Figure 2 draws the learning process with dif-ferent reward functions but the same user simula-tor. With respect to
DQN agents, the dialogue pol-icy trained with
GAN-VAE shows the best perfor-mance in terms of convergence speed and successrate. Compared to
GAN-VAE and
GAN-AE , theupdate signal from the handcrafted reward func-tion r(Human) can still optimize the dialog pol-icy to a reasonable performance but with slowerspeed. This could oppositely verify that denser re-ward signals could speed up the training processof a dialog policy. Besides, the policy with hand-crafted reward function r(Human) converges to alower success rate in comparison with
GAN-VAE and
GAN-AE . We believe, to some extent, the pre-trained reward functions have mastered the under-lying information to measure the quality of giventate-action pairs. The knowledge that the rewardfunction learned during the adversarial learningstep could be generalized to unseen dialog statesand actions to avoid potential local optimum. Incontrast, the dialog agent
DQN(Human) only re-lies on the final reward signal from the simula-tor at the end of a dialog and it can not providemuch guidance to the ongoing dialogue turns dur-ing the conversation. This could be the reason why
DQN(Human) shows lower success rate comparedto
DQN(GAN-VAE) and
DQN(GAN-AE) . The rep-resentation quality of learned state embeddingsleads to different performance between
GAN-VAE and
GAN-AE , where VAE brings more benefits tothe reward functions because of its better general-izability.In terms of
WDQN agents, all three meth-ods achieve there inflection points in the first , frames. By comparing DQN(Human) and
WDQN(Human) , we found that the expert dialogpairs from the training set do alleviate the prob-lem of sparse reward signals for the handcraftedreward function during the start stage of policytraining. Similar results could be observed fromagents with pre-trained reward functions. After , frames, the curve of WDQN(Human) co-incides in position of
DQN(Human) and they con-verge to the same point in the end. The faster con-vergence speed on
WDQN(Human) did not bringhigher success rate because the dialog policy stillhas no access to precise intermediate reward sig-nals for the ongoing dialogue turns.
Dialog Agent Success Rate Average Turn
WDQN keep(Human) 0.741 9.572WDQN keep(GAN-AE) 0.879 7.559WDQN(Human) 0.906 6.790WDQN(GAN-AE) 0.911 6.649WDQN(GAN-VAE) 0.937 6.130DQN(Human) 0.870 7.480DQN(GAN-AE) 0.953 6.150DQN(GAN-VAE)
Table 1: WDQN keep means the dialog policy has ac-cess to the expert state-action pairs during the wholetraining stage; WDQN is the agent we described in Sec-tion 4.4 where we remove the expert dialogues gradu-ally from the expert buffer. We calculate the perfor-mance based on the average results by running eachmethod times. Table 1 reports the final performance of dif- ferent dialog agents during testing time. Allthe agents have been trained with k framesand we save the model which has the best per-formance during the training stage. One inter-esting finding is that DQN(GAN-VAE) outper-forms
WDQN(GAN-VAE) while
WDQN(Human) beats
DQN(Human) . The warming up stage in
WDQN(GAN-VAE) improves the training speedbut also bring side effect that it achieve lower suc-cess rate in the final stage. The potential rea-son is that the expert dialog bring strong updatesignal at the beginning of the training processbut also limit the exploring ability of the dia-log agent. To verify this argument, we designedtwo more
WDQN agents:
WDQN keep(Human) , WDQN keep(GAN-AE) as shown in Table 1. Theexpert dialogues in these two agents will be kept inthe whole training stage instead of being removedgradually. With respect to the human designed re-ward function, there is a huge performance gap,almost , between
WDQN keep(Human) and
WDQN(Human) . The performance difference be-tween
WDQN keep(GAN-AE) and
WDQN(GAN-AE) is much smaller because the pre-trained re-ward function brings more precise and consistentupdate signals that have been explored and dis-closed during the adversarial training step.
Frame S u cc e ss r a t e PPO(Human)PPO(GAN-VAE)PPO(GAIL )PPO(GAIL ) Figure 3: The learning process with different rewardfuctions and PPO agents
With respect to the adversarial training meth-ods, the reward functions are updated on thefly and only policy gradient based reinforcementlearning algorithms are applicable. To comparehe performance of a pretrained reward functionand these reward functions updated in real-time,we utilized PPO algorithms to train the dialogagent with different reward functions. Accord-ing to how we design the discriminators in GAIL,we have two different variants,
PPO(
GAIL ) and PPO(
GAIL ) . To pretrained the discrimi-nator in PPO(
GAIL ) , we firstly trained a di-alogue policy with imitation learning to gener-ate negative samples. The generated sampleswill be utilized to pretrain the discriminator. Incontrast, PPO(
GAIL ) reused the pretrained re-ward function in our proposed method PPO(GAN-VAE) but keep updating it during the training pro-cess. The training performance is shown in Fig-ure 3. It should be noted that the dialog agentsshown in Figure 3 are not pretrained and onlythe corresponding reward models have been tunedin advance.
PPO(GAN-VAE) managed to in-crease the success rate through interacting withthe user simulator. With respect to the rewardfunctions updated in real-time, corresponding to
PPO(
GAIL ) and PPO(
GAIL ) , the success rateis increasing gradually in the first , framesand starting slowing down in the following inter-actions and getting stuck in local optimum. Thelearning curve of the human designed reward func-tion keeps growing, albeit slowly. Frame S u cc e ss r a t e DQN(human)DQN(GAN-VAE + NoHotel)
DQN (GAN-VAE + FullDomain) DQN (GAN-VAE + FullDomain) Figure 4: The learning process with different rewardfuctions and PPO agents
When we are defining the action space, we keepthe most frequent actions from the MultiWozdataset and use onehot embedding to representthe actions. As shown in Figure 1, the action representation will be concatenated to the staterepresentation to denote a specific state-actionpair. However, this way of formulating the ac-tion space ignores the relations between differentactions. For example,
Restaurant Inform Price and
Restaurant Request People should be closein the same conversation since they happen inthe same domain. From the other side, even indifferent domains, there are still connections be-tween actions from two different domain giventhe example that slot types
Inform Price and
Re-quest People can also happen in
Hotel domain,corresponding to actions
Hotel Inform Price and
Hotel Request People . We are curious if we cantransfer the knowledge learned in several domainto a new domain never seen before through thepre-trained reward function. To verify this hy-pothesis, we first reformulate the action represen-tation as a concatenation of three different seg-ments:
Onehot(Domain) , Onehot(Diact) , One-hot(Slot) . In this way, actions containing simi-lar information will be linked through the corre-sponding segments in the action representation.Following this formulation, we retrained our re-ward function in several domain and incorporateit to the training process of a dialogue agent in anew domain. In our work, the existing domainsare
Restaurant, Bus, Attraction and Train and thetesting domain is
Hotel since
Hotel has the mostslot types and some of them are unique such as
In-ternet, Parking, Stars . As shown in Figure 4, theagent
DQN(GAN-VAE + NoHotel) still benefitsfrom the reward function trained in different do-mains and manages to outperform
DQN(Human) . DQN corresponds to the dialog agent trained infull domains and the action is represented with asingle onehot embedding. By replacing the ac-tion representation in DQN we get agent DQN .Obviously, the reward function trained in full do-mains should have better performance comparedto the one trained in different domains. In this work, we propose a guided dialog policytraining method without using adversarial trian-ing in the loop. We first train the reward modelwith an auxiliary generator and then incorporatethis trained reward model to a common reinforce-ment learning method to train a high-quality di-alog agent. By conducting several experiments,we show the proposed methods can achieve re-arkable task success and its potential to trans-fer knowlege from existing domains to a new do-main.
References
Paweł Budzianowski, Tsung-Hsien Wen, Bo-HsiangTseng, I˜nigo Casanueva, Stefan Ultes, Osman Ra-madan, and Milica Gasic. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In
Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 5016–5026.Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao,Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2016.Towards end-to-end reinforcement learning of dia-logue agents for information access. arXiv preprintarXiv:1609.00777 .Milica Gaˇsi´c and Steve Young. 2014. Gaussian pro-cesses for pomdp-based dialogue manager optimiza-tion.
IEEE/ACM Transactions on Audio, Speech,and Language Processing , 22(1):28–40.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In
Advances in neural informationprocessing systems , pages 2672–2680.Emil Julius Gumbel. 1954.
Statistical theory of ex-treme values and some practical applications: a se-ries of lectures , volume 33. US Government Print-ing Office.Jonathan Ho and Stefano Ermon. 2016. Generative ad-versarial imitation learning. In
NIPS , pages 4565–4573.Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categor-ical reparameterization with gumbel-softmax. arXivpreprint arXiv:1611.01144 .Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .Mark A Kramer. 1991. Nonlinear principal compo-nent analysis using autoassociative neural networks.
AIChE journal , 37(2):233–243.Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Xiang Li,Yaoqin Zhang, Zheng Zhang, Jinchao Li, BaolinPeng, Xiujun Li, Minlie Huang, et al. 2019. Con-vlab: Multi-domain end-to-end dialog system plat-form. arXiv preprint arXiv:1904.08637 .Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao,and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. arXiv preprintarXiv:1703.01008 . Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, LihongLi, Jianfeng Gao, and Yun-Nung Chen. 2016. Auser simulator for task-completion dialogues. arXivpreprint arXiv:1612.05688 .Ziming Li, Julia Kiseleva, and Maarten de Rijke. 2019.Dialogue generation: From imitation learning to in-verse reinforcement learning. In
Proceedings ofthe AAAI Conference on Artificial Intelligence , vol-ume 33, pages 6722–6729.Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li,Faisal Ahmed, and Li Deng. 2018. Bbq-networks:Efficient exploration in deep reinforcement learn-ing for task-oriented dialogue systems. In
Thirty-Second AAAI Conference on Artificial Intelligence .Bing Liu and Ian Lane. 2018. Adversarial learning oftask-oriented neural dialog models. In
Proceedingsof the SIGDIAL 2018 Conference , pages 350–359.Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fidje-land, Georg Ostrovski, et al. 2015. Human-levelcontrol through deep reinforcement learning.
Na-ture , 518(7540):529.Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu,Yun-Nung Chen, and Kam-Fai Wong. 2018a. Ad-versarial advantage actor-critic model for task-completion dialogue policy learning. In , pages 6149–6153.IEEE.Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu,and Kam-Fai Wong. 2018b. Deep dyna-q: Inte-grating planning for task-completion dialogue pol-icy learning. In
Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , volume 1, pages2182–2192.Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao,Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong.2017. Composite task-completion dialogue policylearning via hierarchical deep reinforcement learn-ing. arXiv preprint arXiv:1704.03084 .Jost Schatzmann, Blaise Thomson, Karl Weilhammer,Hui Ye, and Steve Young. 2007. Agenda-based usersimulation for bootstrapping a pomdp dialogue sys-tem. In
Human Language Technologies 2007: TheConference of the North American Chapter of theAssociation for Computational Linguistics; Com-panion Volume, Short Papers , pages 149–152. As-sociation for Computational Linguistics.John Schulman, Filip Wolski, Prafulla Dhariwal,Alec Radford, and Oleg Klimov. 2017. Proxi-mal policy optimization algorithms. arXiv preprintarXiv:1707.06347 .ei-Hao Su, Pawel Budzianowski, Stefan Ultes, Mil-ica Gasic, and Steve Young. 2017. Sample-efficientactor-critic reinforcement learning with superviseddata for dialogue management. arXiv preprintarXiv:1707.00130 .Pei-Hao Su, Milica Gasic, Nikola Mrkˇsi´c, Lina M Ro-jas Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line activereward learning for policy optimisation in spoken di-alogue systems. In
Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , volume 1, pages2431–2441.Shang-Yu Su, Xiujun Li, Jianfeng Gao, Jingjing Liu,and Yun-Nung Chen. 2018. Discriminative deepdyna-q: Robust planning for dialogue policy learn-ing. In
EMNLP .Ryuichi Takanobu, Hanlin Zhu, and Minlie Huang.2019. Guided dialog policy learning: Reward es-timation for multi-domain task-oriented dialog. In
Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 100–110.Jason D Williams, Kavosh Asadi, and GeoffreyZweig. 2017. Hybrid code networks: practicaland efficient end-to-end dialog control with super-vised and reinforcement learning. arXiv preprintarXiv:1702.03274 .Jason D Williams and Steve Young. 2007. Partiallyobservable markov decision processes for spokendialog systems.
Computer Speech & Language ,21(2):393–422.Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.
Machine learning , 8(3-4):229–256.Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017. Seqgan: Sequence generative adversarial netswith policy gradient. In