Meta Dialogue Policy Learning
MMeta Dialogue Policy Learning
Yumo Xu ∗ School of InformaticsUniversity of Edinburgh, UK [email protected]
Chenguang Zhu
Microsoft Cognitive Services ResearchRedmond, WA, USA [email protected]
Baolin Peng
Microsoft ResearchRedmond, WA, USA [email protected]
Michael Zeng
Microsoft Cognitive Services ResearchRedmond, WA, USA [email protected]
Abstract
Dialog policy determines the next-step actions for agents and hence is central toa dialogue system. However, when migrated to novel domains with little data, apolicy model can fail to adapt due to insufficient interactions with the new envi-ronment. We propose Deep Transferable Q-Network (DTQN) to utilize shareablelow-level signals between domains, such as dialogue acts and slots. We decomposethe state and action representation space into feature subspaces corresponding tothese low-level components to facilitate cross-domain knowledge transfer. Further-more, we embed DTQN in a meta-learning framework and introduce Meta-DTQNwith a dual-replay mechanism to enable effective off-policy training and adaptation.In experiments, our model outperforms baseline models in terms of both successrate and dialogue efficiency on the multi-domain dialogue dataset MultiWOZ 2.0.
Task-oriented dialogue systems aim to assist users to efficiently accomplish daily tasks such asbooking a hotel or reserving dinner at a restaurant. Complex systems like Alexa and Siri often containthousands of task domains. However, a successful model on one task often requires hundreds orthousands of carefully labelled domain-specific dialogue data, which consumes a large amount ofhuman effort. Therefore, how to agilely adapt an existing dialogue system to new domains with ascant number of training samples is an essential task in task-oriented dialogues.In this paper, we investigate dialogue policy, or dialogue management, which lies in the center of atask-oriented dialogue system. Dialogue policy determines the next-step action of the agent givendialogue states and the user’s goals. As a dialogue is composed of multiple turns, the feedback to adialogue policy’s decision is often delayed until the end of the conversation. Therefore, ReinforcementLearning (RL) is usually leveraged to improve the efficiency and success rate in dialogue policylearning [1].There have been a number of methods applying dialogue policy in multi-domain settings [2–4].These models usually employ an all-in-one multi-hot representation for dialogue states. The stateembedding vector is a concatenation of multiple segments, each as a multi-hot vector for the statesin one domain. However, when there are unseen domains at inference time, the correspondingparameters of its dialogue acts and slots are not optimized. This significantly limits the adaptationperformance of policy models. ∗ Work done during an internship at Microsoft.Preprint. Under review. a r X i v : . [ c s . C L ] J un o alleviate this problem, we note that there is often shareable low-level information between differentdomains. For instance, suppose the source domain is taxi-booking and the target domain is hotel-booking. Although the two domains have different ontologies, both domains share certain dialogueslots (e.g. start time and location ) and dialogue acts (e.g. request and inform ). These shared conceptsbear a lot of similarities both in textual representation and corresponding agent policies. Thus, it isfeasible to transfer domain knowledge via these commonalities in ontologies.To this end, we propose a D eep T ransferable Q - N etwork (DTQN), based on Deep Q-Network (DQN)[5] in reinforcement learning, which learns to predict accurate Q-function values given dialoguestates and system actions. In DTQN, we factorize the dialogue state space into a set of lower-levelfeature spaces. Specifically, we hierarchically model cross-domain relations at domain-level, act-leveland slot-level. State representations are then composed of several shareable sub-embeddings. Forinstance, slots like start time in different domains will now share the same slot-level embedding.Furthermore, instead of treating actions as independent regression classes as in DQN, we decomposethe dialogue action space and our model learns to represent actions based on common knowledgebetween domains.To adapt DTQN to few-shot learning scenarios, we leverage the meta-learning framework. Meta-learning aims to guide the model to rapidly learn knowledge from new environments with only a fewlabelled samples [6, 7]. Previously, meta-learning has been successfully employed in the NaturalLanguage Generation (NLG) module in dialogues [8]. However, NLG is supervised learning bynature. Comparatively, there has been little work on applying meta-learning to the dialogue policy, asit is known that applying RL under meta-learning, a.k.a. meta-RL, is a much harder problem thanmeta supervised learning [9].To verify this fact, we train the DTQN model under the Model-Agnostic Meta-Learning (MAML)framework [6]. However, we find through experiments that the canonical MAML fails to let thepolicy model converge because the task training phase leverages off-policy learning while the taskevaluation and meta-adaptation phase employ an on-policy strategy. Thus, the model initially receivesvery sparse reward signals, especially on complex composite-domain tasks. As a result, the dialogueagent is prone to overfit the on-policy data and to get stuck at the local minimum in the policy space.Therefore, we further propose Meta-DTQN with a dual-replay mechanism. To support effective off-policy learning in meta dialogue policy optimization, we construct a task evaluation memory to cachedialogue trajectories and prefill it with rule-based experiences in task evaluation. This dual-replaystrategy ensures the consistency of off-policy strategy in both meta-training and meta-adaptation,and provides richer dialogue trajectory records to enhance the quality of the learned policy model.Empirical results show that the dual-replay mechanism can effectively increase the success rate ofDTQN while reducing the dialogue length, and Meta-DTQN with dual replay outperforms strongbaselines on the multi-domain task-oriented dialogue dataset MultiWOZ 2.0 [10]. Dialogue Policy Learning
Dialogue policy, also known as the dialogue manager, is the controllingmodule in task-oriented dialogue that determines the agent’s next action. Early work on dialoguepolicy is constructed on manual rules [11]. As the outcome of a dialogue does not emerge until theend of the conversation, dialogue policy is often trained via Reinforcement Learning (RL) [12]. Forinstance, deep RL is proven useful for strategic conversations [13], and a sample-efficient onlineRL algorithm is proposed to learn from only a few hundred dialogues [14]. Towards more effectivecompletion on complex tasks, hierarchical RL is employed to learn a multi-level policy either throughtemporal control [2], or subgoal discovery [15]. Model-based RL also helps a dialogue agent to planfor the future during conversations [1]. While RL for multi-domain dialogue policy learning hasattracted increasing attention from researchers, dialogue policy transfer remains under-studied.
Meta-Learning
Meta-learning is a framework to adapt models to new tasks with a small number ofdata [16]. It can be achieved either by finding an effective prior as initialization for new task learning[16], or by a meta-learner to optimize the model which can quickly adapt to new domains [17].Particularly, the model-agnostic meta-learning (MAML) [6] framework applies to any optimizablesystem. It associates the model’s performance to its adaptability to new systems, so that the resultingmodel can achieve maximal improvement on new tasks after a small number of updates.2n dialogue systems, meta-learning has been applied to response generation. The domain adaptivedialog generation method (DAML) [8] is an end-to-end dialogue system that can adapt to newdomains with a few training samples. It places the state encoder and response generator into theMAML framework to learn general features across multiple tasks.
Reinforced Dialogue Agent
Task-oriented dialogue management is usually formulated as a MarkovDecision Process (MDP): a dialogue agent interacts with a user with sequential actions based on theobserved dialogue states s to fulfill the target conversational goal. At step t , given the current state s t of the dialogue, the agent selects a system action a t based on its policy π , i.e., a t = π ( s t ) , andreceives a reward r t from the environment . The expected total reward of taking action a under thestate s is defined as a function Q ( s, a ) : Q ( s, a ) = E π (cid:34) T − t (cid:88) k =0 γ k r t + k | s t = s, a t = a (cid:35) (1)where T is the maximum number of turns in the dialogue, and γ ∈ [0 , is a discount factor. Thepolicy π is trained to find the optimal Q-function Q ∗ ( s, a ) so that the expected total reward at eachstate is maximized. The optimal policy is to greedily act as π ∗ ( s ) = argmax a ∈A Q ∗ ( s, a ) .To better explore the action space, an (cid:15) -greedy policy is employed to select the action based onthe state s : with probability (cid:15) , a random action is chosen; with probability − (cid:15) , a greedy policy a = argmax a (cid:48) Q ∗ ( s, a (cid:48) ; θ Q ) is taken. Here, the Q-function is modeled by Deep Q-Network (DQN)[18] with parameters θ Q . To train this network, state-action transitions ( s t , a t , r t , s t +1 ) are stored ina replay buffer M . At each training step, a batch of samples is sampled from M to update the policynetwork via 1-step temporal difference (TD) error implemented with the mean-square error loss: L ( θ Q ) = E ( s,a,r,s (cid:48) ) ∼M (cid:2) ( y − Q ( s, a ; θ Q )) (cid:3) (2) y = r + γ max a (cid:48) Q (cid:48) ( s (cid:48) , a (cid:48) ; θ Q (cid:48) ) (3)where Q (cid:48) is the target network that is only periodically replaced by Q to stabilize training. Environment and Domain
The dialogue environment typically includes a database that can bequeried by the system, and a user-simulator that mimics human actions to interact with the agent.At the beginning of a conversation, the user-simulator specifies a dialogue goal , and the agent isoptimized to accomplish it. Dialogue goals are generated from one or multiple domain(s) . Forinstance, in the benchmark multi-domain dialogue dataset MultiWoz [10], there are a total of 7domains and 25 domain compositions. Each domain composition consists of one or more domains,e.g., { hotel } and { hotel , restaurant , taxi }. We split all domains into source domains and target domains to fit the meta-learning scenario (see Section 5 for details). State Representation
We show the dialogue state representation for classic DQN in Figure 1(A).After receiving a system action a , the environment responds with a user action, which is then fedinto a dialogue state tracker (DST) to update the dialogue agenda. The DST maintains the entiredialogue records with a state dictionary, and the DQN has a state encoder to embed the dictionaryinto a state vector. In detail, this state encoder represents states with multi-hot state vectors includingsix primary feature categories [4], e.g., request and inform . As shown in the bottom-left corner ofFigure 1(A), each category is encoded as the concatenation of a few domain-specific multi-hot vectorsfrom its relevant domains, and the concatenation of the six category representations forms a binarystate representation (see Appendix A for details).We argue that two major issues in the classic DQN system prohibit its generalization to unseendomains: (1) the input states adopt multi-hot representations where no inter-state relation is consideredand (2) given the state input, actions in different domains are modeled as independent regressionclasses. However, there is a considerable amount of domain knowledge that can be shared across Reward r measures the degree of success of a dialogue. In ConvLab [4], for example, success leads to areward of ∗ L where L is the maximum number of turns in a dialogue (set to 40 in default), failure to a rewardof − L . To encourage shorter dialogues, the agent also receives a reward of − at each turn. omain embeds slot embedsdialog act embeds [(restaurant, inform, name:subway) , (taxt, request, time)] action vecepsilon-greedy state vec grounded system action (B) DTQN for Cross-Domain DPL Q-value [(taxt, inform, time:14:00)]
State Enc. value embeds (binary, bins, ..) user
Action Enc. user actiondatabase [(restaurant, inform, name:subway) , (taxt, request, time)] databaseepsilon-greedystate vec (multi-hot) grounded system actionQ-value (A) DQN for DPL user State Tracker [(taxt, inform, time:14:00)] user actioninform features f(hotel)f(taxi)g(restuarant)h(taxi)h(hotel) request featuresbookingfeatures ... multi-hot feature extractors: f, g, ..., h
StateEncoderAction Decoder ActionDecoder State Tracker
Figure 1: Framework of (A) classic DQN for Dialogue Policy Learning and (B) our Deep TransferableQ-Network (DTQN) for Cross-Domain Dialogue Policy Learning.actions and states, e.g., both taxi-booking and hotel-reserving tasks share dialogue slots such as starttime and location and dialogue acts such as request . These types of information elicit similar textrepresentation and policy handling.
To enable effective knowledge learning and transfer across different domains, we reformulate cross-domain dialogue policy learning as a state-action matching problem. As shown in in Figure 1(B),we propose DTQN, a D eep T ransferable Q - N etwork that jointly optimizes the policy network anddomain knowledge representations.Driven by the structure of dialogue knowledge, we assume that the dialogue state space S and thesystem action space A are factorized by a set of lower-level feature spaces. Based on this hypothesis,we aim at modeling cross-domain relations at different levels in DTQN: domain-level, act-leveland slot-level. To this end, we hierarchically decompose the state and actions into four embeddingsubspaces, shared across all dialogue sessions: domains D , dialogue acts C , slots O , and values V .Both the states and actions are encoded by joining different sets of subspace embeddings.We retain the existing categorization of dialogue state features mentioned in Section 3 consideringits effectiveness in dialogue management [1, 4]. We first represent each feature category as a densevector s h , h ∈ [1 , H ] and then concatenate the H = 6 categories of feature vectors into the staterepresentation s ∈ R d s × : s = ReLU ( W s [ s , s , ..., s H ]) . (4)Note that each s h , h ∈ [1 , H ] consists of |D h | domain-specific features, each corresponding to adomain. In the classic DQN state representation [4], few features are shared across domains. As aresult, an agent cannot generalize its policy from source domains to a target domain if the target statespace remains unseen and the target action space is mostly unexplored. Besides, the length of staterepresentations grows linearly with |D h | .Here, we propose to use a fixed-length state vector s h to represent the h -th feature category. To dothat, we use cross-domain features to aggregate state information from different domains. In detail,the i -th domain-specific component of s h is denoted by ˆ s h,i . For example, for the inform category, ˆ s h,i = [ d h,i , o (cid:12) v, u h,i ] (5)where d h,i denotes embedding of domain. o (cid:12) v is the average of the inner product between generalslot embeddings and their value embeddings. The binary feature u h,i tracks whether the correspondingdomain is active, i.e., essential domain slots are already filled.4 lgorithm 1 Meta Dialogue Policy Learning function M ETA P OLICY L EARNING
2: Initialize Q ( s, a ; θ Q ) and Q (cid:48) ( s, a ; θ Q (cid:48) ) with θ Q (cid:48) ← θ Q (cid:46) Policy network and target network
3: Initialize experience replay memory M tr and M ev using Reply Buffer Spiking (RBS) (cid:46) Dual replay
4: Set K domains { d k } Kk =1 and gather the domain compositions {T d k } Kk =1 , where d k ∈ T d k , ≤ k ≤ K .5: for n ← N do (cid:46) Outer loop for meta-training
6: Generate K dialogue goals { t , ..., t K } from d k or Uniform ( T d k ) (cid:46) Single or composite domain
7: Initialize meta-training loss
L ← for k ← K do (cid:46) Inner loop for task data collection and training θ (cid:48) ← θ and load agent with θ (cid:48)
10: E NV I NTERACT ( t k , M tr , B tr ) (cid:46) Task training data collection
11: Sample random minibatches of ( t k , s, a, r, s (cid:48) ) from M tr
12: Update θ (cid:48) via Z -step minibatch SGD13: E NV I NTERACT ( t k , M ev , B ev ) (cid:46) Task evaluation data collection
14: Sample random minibatches of ( t k , s, a, r, s (cid:48) ) from M ev
15: Forward pass with the minibatches and obtain L t k L ← L + L t k end for
18: Load agent with θ Q and update with respect to L via minibatch SGD19: Every C steps reset θ Q (cid:48) ← θ Q (cid:46) Target network update end for end function
To obtain the fixed-length representation for a feature category, we aggregate its domain featuresfrom all relevant domains via a non-linear transformation with a residual connection: s h = 1 |D h | |D h | (cid:88) i =1 (ˆ s h,i + ReLU ( W h ˆ s h,i + b h )) (6)where W h projects domain-specific features into a shared feature space across domains and weacquire the final state representation s via Equation (4).Different from DQN, which encodes only dialogue states and incorporates no prior information ofactions, we explicitly model the structural information of system actions with an action encoder in DTQN to maximize knowledge sharing across domains. Action encoding follows analogousprocedures to the state encoding except that it does not use value space V . For each system action a , the domains that contain this action form a set D a . We encode its (cid:96) -th domain feature ( ≤ (cid:96) ≤ |D a | ] ) as ˆ a (cid:96) = [ d (cid:96) , c (cid:96) , o ] , where c (cid:96) is embedding for dialogue act, e.g., request or booking ,and o is the average of slot embeddings. We then obtain the system action embedding: a =1 / |D a | (cid:80) |D a | (cid:96) =1 (ˆ a (cid:96) + ReLU ( W a ˆ a (cid:96) + b a )) .All embedding tables are shared between state and action encoders. We stack all action vectors anddenote the action matrix as A ∈ R |A|× d a , which is then used to produce the Q-values: Q ( s, a ) = 1 √ d a AW q s ∈ R |A| (7)where W q ∈ R d a × d s is a parameter matrix. To adapt Q-network to few-shot learning scenarios, we propose to use a meta-learning framework[6] and present an instantiation of this framework with the DTQN as the policy network Q andtarget network Q (cid:48) . Algorithm 1 shows the pseudocode for our methodology for meta dialogue policylearning.At the beginning of each outer-loop of meta-training, we first sample K dialogue goals as trainingtasks. For the k th inner-loop step, the agent interacts with the environment using task t k to collecttrajectories and stores them in the replay buffer M tr (see Appendix B for details of functionE NV I NTERACT ). Then, we sample from M tr a minibatch of experiences of task t k : B t k tr . The loss5 ystems Hotel Train Police
Average
Success Reward Turns Success Reward Turns Success Reward Turns Success Reward TurnsFew-Shot Models D QN -1 K TQN -1 K Adaptive Models V ANILLA D QN QN TQN
ETA -D TQN -S R ETA -D TQN
Table 1: System performance in the single-domain setting on 2000 dialogues in the target domains.function L t k is from Equation (2). We compute task-specific updated parameters θ ( k ) Q from θ Q : θ ( k ) Q = θ Q − α ∇ θ Q L t k (cid:0) θ Q ; B t k tr ∼ M tr (cid:1) where (8) ∇ θ Q L t k ( θ Q ) = E ( s,a,r,s (cid:48) ) ∼B tktr (cid:104) ( r + γ max a (cid:48) Q (cid:48) ( s (cid:48) , a (cid:48) ; θ Q (cid:48) ) − Q ( s, a ; θ Q )) ∇ θ Q Q ( s, a ; θ Q ) (cid:105) . (9)With the updated parameters θ ( k ) Q , the agent interacts with the environment and obtains trajectory B t k ev .According to MAML [6], the task evaluation loss L t k ( θ ( k ) Q ; B t k ev ) should be directly used to update θ Q with learning rate β : θ Q ← θ Q − β ∇ θ Q K (cid:88) k =1 L t k ( θ ( k ) Q ; B t k ev ) . (10)However, this on-policy learning suffers from very sparse rewards especially at the initial learningstage. This is due to the inherent difficulties in cross-domain dialogue learning: i) the state-actionspace to explore is much larger, and ii) the conversation required to complete the task is often longer[2]. As a result, the dialogue agent is prone to overfit with on-policy data and to get stuck at the localminimum in the policy space.To alleviate this problem, we propose a dual-replay framework to support efficient off-policy learningin meta-RL. Apart from the main replay buffer M tr for meta-training, we construct a task evaluationmemory M ev . We note that it is essential to separate M tr and M ev since the task replay buffer isfor the evaluation purpose for each task and should not be seen during task training.Moreover, we adopt a variant of imitation learning, Replay Buffer Spiking (RBS) [19] to warmup the learning process. Before our agent interacts with the environment, we employ a rule-basedagent crafted for MultiWoz to initialize both M tr and M ev . Then, in steps 14-16, we collect newtrajectories with our agent and push them into M ev . We uniformly sample from M ev a mini-batch B t k ev , which can be a mixture of on-policy and relevant off-policy data, to calculate the task evaluationloss L t k . As a result, θ Q is updated as: θ Q ← θ Q − β ∇ θ Q K (cid:88) k =1 L t k ( θ ( k ) Q ; B t k ev ∼ M ev ) . (11)During test, for an unseen domain, we adopt a similar off-policy approach for meta-adaptation. Thistrain-test consistency circumvents the known difficulty in on-policy meta-adaptation with off-policymeta-training [7]. In fact, classic MAML for RL can be seen as a special case of our dual-replayarchitecture by setting the task evaluation memory M ev to |B t k ev | . We use the benchmark multi-domain dialogue dataset MultiWoz 2.0 [10]for the evaluation. We adopt attraction , restaurant , taxi and hospital as source domains for6 ystems Hotel Train
Average
Success Reward Turns Success Reward Turns Success Reward TurnsFew-Shot Models D QN -1 K TQN -1 K Adaptive Models V ANILLA D QN QN TQN
ETA -D TQN -S R ETA -D TQN
Table 2: System performance in the composite-domain setting on 2,000 dialogues in the targetdomains. We show results in
Hotel and
Train as Police has only single-domain dialogue goal.
Systems
Single Composite Average
Success Reward Turns Success Reward Turns Success Reward Turns D QN TQN
ETA -D TQN -S R ETA -D TQN
Table 3: System performance on 2,000 dialogues in the training domains.training (source task size K = 4 ), and use hotel , train and police as target domains for adaptation.This split makes sure that both train and test splits have domains with various frequency levels (seeAppendix C for details). We propose two experiment settings: single-domain and composite-domain .For the single-domain setting, agents are trained and tested with only single-domain dialogue goals.In the composite-domain setting, for each task in meta-training, we first select a seed domain d ∗ andthen sample domain composition which contains d ∗ . The trained model is then adapted and evaluatedin various domain compositions containing d ∗ . Systems
We developed different baseline task-oriented dialogue systems: D QN is standard deepQ-learning which uses binary state representations and D TQN , which is our proposed model withoutmeta-learning framework. We also build V
ANILLA D QN without Replay Buffer Spiking (RBS) [19]to show the warm-up effects in adaptation from rule-based off-policy data. In addition, we buildM ETA -D TQN -S R with only one single replay buffer to show the effects of the dual-replay mechanismwe proposed. During adaptation to target domains, we simulate the data scarcity scenario by usingonly 1,000 frames (i.e., 10% of the training data). Besides, to examine the effects of the two-stageparadigm of training-and-adaptation, we also report the results of two few-shot models, D QN -1 K andD TQN -1 K . Both models are trained from scratch with the 1,000 frames in the target domains. Implementation Details
We developed all variants of agents based on Convlab [4]. We used abatch size of 16 for both training and adaptation. We set the size of the training replay buffer and theevaluation replay buffer to 50,000. We initialized the replay buffers with Replay Buffer Spiking (RBS)[19] during the first 1000 episodes of meta-training, first 10 episodes of single-domain adaptationand first 50 episodes of composite-domain adaptation (see Appendix D for details).
Table 1 shows that our models (M
ETA -D TQN and D
TQN ) considerably outperform baseline systemson single-domain tasks in hotel and train . Dialogue tasks in police are relatively easy toaccomplish, on which DQN-1 K trained from scratch with only 1,000 frames can completely succeed.On the contrary, DQN-1 K fails on all the tasks in Hotel . Also, note that D
TQN -1 K significantlyoutperforms D QN -1 K across all domains. This demonstrates the effectiveness of modeling thedependency between the state and action space. Besides, the performance gain from meta-trainingis more significant in the train domain (i.e., 4.8% in success rate), which can be attributed to thesimilarity of state and action spaces between train and the source domain taxi .Table 2 shows the adaptation results on the composite-domain setting, which is a much harderdialogue task. Here, M ETA -D TQN has a clear advantage over other agents on both hotel and train ,7 S u cc e ss | ev | = | ev | = | ev | = | ev | = | ev | = | ev | = Lo ss Number of optimization steps
Figure 2: Development success rate (above) and training loss (below) of M
ETA D TQN with evaluationreplay buffer of different sizes on composite-domain tasks. Shadow denotes variance.showing that meta-learning can boost the quality of agents adapted with a small amount of data incomplex dialogue tasks (see Appendix E for dialogue examples). Table 3 lists the performance ofvarious models when evaluating on source domains. Here, meta-learning can also help to achievebetter results, and the gain is larger in the more complex composite-domain settings.It is worth noting that on all tasks, M
ETA -D TQN shows superior results than its single-replaycounterpart, M
ETA -D TQN -S R , and the performance gap is particularly large on composite-domaindialogue tasks where the agent is more prone to suffer from initial reward sparsity. Effects of dual replay
We further investigate the effects of the proposed dual-replay method. InFigure 2, we show the performance of our model with a task evaluation memory of varied sizes. Westart with pure on-policy evaluation |M ev | = 16 , i.e., batch size |B t k ev | , and experiment with differentbuffer sizes: 16, 1000, 3000, 5000, 10000, and 50000. As shown, when the replay buffer is relativelysmall ( < ), the success rate fails to improve. We argue that this optimization difficulty is due tothe overfitting to on-policy data with sparse rewards at the beginning of the learning phase. This canbe verified by the loss curve: the training loss abruptly drops from high values (100-500) to extremelylow values (less than 10) soon after the RBS warm-up phase. When the evaluation memory sizeincreases, our model is able to escape from the local minimum and get optimized continuously.
100 500 1,000 1,500 2,000 2,500
Number of adaptation frames S u cc e ss T u r n s SuccessTurns
Figure 3: Performance of M
ETA -D TQN with adapta-tion data of varied sizes on composite tasks.
Effects of adaptation data size
In addi-tion, we show how the size of adaptationdata affects the agent’s performance on tar-get domains in Figure 3. We test M
ETA -D TQN with adaptation data ranging from100 frames (1% of the training data) to2,500 frames (25% of the training data). Asshown, the agent performance positivelycorrelates with the amount of data avail-able in the target domain. Note that asone episode has on average 10 frames, andwe adopt RBS for the first 50 episodes,the agent is adapted only with off-policyrule-based experiences when the number offrames is less than 500. Therefore, the largeperformance gap between 500 and 1,000frames indicates that our model can consid-erably benefit from a very small amount ofon-policy data.8
Conclusion
Dialogue policy is the central controller of a dialogue system and is usually optimized via reinforce-ment learning. However, it often suffers from insufficient training data, especially in multi-domainscenarios. In this paper, we propose the Deep Transferable Q-Network (DTQN) to share multi-levelinformation between domains such as slots and acts. We also modify the meta-learning frameworkMAML and introduce a dual-replay mechanism. Empirical results show that our method outperformstraditional deep reinforcement learning models without domain knowledge sharing, in terms of bothsuccess rate and length of dialogue. As future work, we plan to generalize our method to moremeta-RL applications in multi-domain and few-shot learning scenarios.
Broader Impact
Our work can contribute to dialogue research and applications, especially in new domains with scanttraining data. Our framework helps models quickly adapt to unseen domains to bootstrap applications.The outcome is a more effective and efficient dialogue agent system to facilitate activities in humansociety.However, one needs to be cautious when collecting dialogue data, which may cause privacy issues.Deanonymization methods must be used to protect personal privacy.
References [1] Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Kam-Fai Wong, and Shang-Yu Su. Deepdyna-q: Integrating planning for task-completion dialogue policy learning. arXiv preprintarXiv:1801.06176 , 2018.[2] Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-FaiWong. Composite task-completion dialogue policy learning via hierarchical deep reinforcementlearning. In
Proceedings of the 2017 Conference on Empirical Methods in Natural LanguageProcessing , pages 2231–2240, 2017.[3] Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and Li Deng. Bbq-networks:Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.[4] Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Zheng Zhang, Yaoqin Zhang, Xiang Li, Jinchao Li,Baolin Peng, Xiujun Li, Minlie Huang, et al. Convlab: Multi-domain end-to-end dialog systemplatform. In
Proceedings of the 57th Annual Meeting of the Association for ComputationalLinguistics: System Demonstrations , pages 64–69, 2019.[5] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 , 2013.[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-tation of deep networks. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 1126–1135. JMLR. org, 2017.[7] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficientoff-policy meta-reinforcement learning via probabilistic context variables. In
InternationalConference on Machine Learning , pages 5331–5340, 2019.[8] Kun Qian and Zhou Yu. Domain adaptive dialog generation via meta learning. arXiv preprintarXiv:1906.03520 , 2019.[9] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763 , 2016. 910] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes,Osman Ramadan, and Milica Gaši´c. Multiwoz - a large-scale multi-domain wizard-of-oz datasetfor task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278 , 2018.[11] Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jianshe Zhou, and Zhoujun Li. Building task-oriented dialogue systems for online shopping. In
Thirty-First AAAI Conference on ArtificialIntelligence , 2017.[12] Richard S Sutton and Andrew G Barto.
Reinforcement learning: An introduction . MIT press,2018.[13] Heriberto Cuayáhuitl, Simon Keizer, and Oliver Lemon. Strategic dialogue management viadeep reinforcement learning. arXiv preprint arXiv:1511.08099 , 2015.[14] Olivier Pietquin, Matthieu Geist, and Senthilkumar Chandramohan. Sample efficient on-linelearning of optimal dialogue policies with kalman temporal differences. In
Twenty-SecondInternational Joint Conference on Artificial Intelligence , 2011.[15] Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang, Lihong Li, and Tony Jebara. Subgoal discoveryfor hierarchical dialogue policy learning. In
Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages 2298–2309, 2018.[16] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networksfor one shot learning. In
Advances in neural information processing systems , pages 3630–3638,2016.[17] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recastinggradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930 , 2018.[18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning.
Nature , 518(7540):529, 2015.[19] Zachary C Lipton, Jianfeng Gao, Lihong Li, Xiujun Li, Faisal Ahmed, and Li Deng. Efficientexploration for dialog policy learning with deep bbq networks & replay buffer spiking.