Domain Adaptation in Dialogue Systems using Transfer and Meta-Learning
DDomain Adaptation in Dialogue Systems using Transfer and Meta-Learning
Rui Ribeiro , Alberto Abad , Jos´e Lopes INESC-ID Lisboa, Portugal Instituto Superior T´ecnico, Universidade de Lisboa, Portugal Heriot-Watt University, Edinburgh, United Kingdom [email protected]
Abstract
Current generative-based dialogue systems are data-hungry andfail to adapt to new unseen domains when only a small amountof target data is available. Additionally, in real-world applica-tions, most domains are underrepresented, so there is a needto create a system capable of generalizing to these domainsusing minimal data. In this paper, we propose a method thatadapts to unseen domains by combining both transfer and meta-learning (DATML). DATML improves the previous state-of-the-art dialogue model, DiKTNet, by introducing a differentlearning technique: meta-learning. We use Reptile, a first-orderoptimization-based meta-learning algorithm as our improvedtraining method. We evaluated our model on the MultiWOZdataset and outperformed DiKTNet in both BLEU and EntityF1 scores when the same amount of data is available.
Index Terms : dialogue systems, domain adaptation, transfer-learning, meta-learning
1. Introduction
With the appearance of chatbots like Siri and Alexa capableof having fluent and consistent conversations, dialogue systemshave become very popular these days. Additionally, the emer-gence of deep learning techniques in natural language process-ing contributes to this popularity and various new models werecreated in order to surpass previous rule-based models. How-ever, these generative-based models are data-hungry, they needlarge amounts of training data in order to obtain good results,they produce dull responses and fail to adapt to new unseen do-mains when only a few examples of data are available. Besides,in real-world applications, most domains are underrepresented,so there is a need to create a model capable of generalizing tothese domains using the minimum amount of data available.In this paper, we study the importance of generalizing tounseen domains using minimal data and aim to design a novelmodel to surpass this problem. We believe that for successfuladaptation to new domains, two key features are essential forimproving the overall performance of a dialogue system: bet-ter representation learning and better learning techniques. Fol-lowing this belief, we are concerned with the exploration of amethod able to learn a more general dialogue representationfrom a large data-source and able to incorporate this informa-tion into a dialogue system.We follow this reasoning and introduce Domain Adapta-tion using Transfer and Meta-Learning (DATML), a model thatcombines both transfer-learning with meta-learning for the pur-pose of adapting to unseen domains. Our model builds upon theapproach from Dialogue Knowledge Transfer Network (DiK-TNet) [1] by enhancing its learning method while keeping thestrong representation learning present in both ELMo [2] contex-tual embeddings and latent representations. For that, we divide the training method into three training stages: 1. A pre-train-ing phase where some latent representations are leveraged froma domain-agnostic dataset; 2. Source training with all data ex-cept dialogues from the target domain; 3. Fine-tuning using fewexamples from the target domain.We incorporate meta-learning in source training as thismethod proved to be promising at capturing domain-agnosticdialogue representations [3]. However, instead of using Model-Agnostic Metal-Learning (MAML) [4] algorithm, we use afirst-order optimization-based method, Reptile [5], which hasshown to achieve similar or even better results than MAML forlow-resource NLU tasks while being more lightweight in termsof memory consumption [6].We evaluate our model on the MultiWOZ dataset [7] andcompare our approach with both Zero-Shot Dialog Generation(ZSDG) [8] and current state-of-the-art model in few-shot dia-logue generation, DiKTNet. As the code for both baselines isopenly available online, we adapt and evaluate their implemen-tations on the MultiWOZ corpus. Our model outperforms bothZSDG and DiKTNet when the same amount of data is available.Furthermore, DATML achieves superior performance with 3%of available target data in comparison to DiKTNet with 10%,which shows that DATML surpasses DiKTNet in terms of bothperformance and data efficiency.
2. Related Work
The reduced amount of available data has always been a prob-lem in domain adaptation tasks. Methods as meta-learning [4],transfer-learning [9, 10, 11] and few-shot learning [12, 13, 14]were introduced to solve this problem in machine learning.However, there were only a few attempts to solve the problemof domain adaptation in end-to-end dialogue systems.Perhaps, one of the first studies to pursue this direction wasthe work from ZSDG [8], where authors performed zero-shotdialogue generation using minimal data in the form of seed re-sponses. The model is described as ”zero-shot” and does notuse complete dialogues, however, the model still depends onhuman annotated data. Although this approach seems promis-ing, ZSDG relies on these annotations for seed responses, and inthe real-world scenario, if collecting data for underrepresenteddomains is already difficult enough, access to annotated databecomes infeasible.More recent studies attempt to perform domain adaptionwithout the need of human annotated data and adopt the meth-ods presented above: Domain Adaptive Dialog Generation viaMeta-Learning (DAML) [3] incorporates meta-learning into the sequicity [15] model to train a dialogue system able to general-ize to unseen domains. This approach seems promising, yetDAML was evaluated on a synthetic dataset. DiKTNet [1] ap-plies transfer learning by leveraging general latent representa- a r X i v : . [ c s . C L ] F e b ions from a large data-source and incorporating them into aHierarchical Recurrent Encoder-Decoder (HRED). We will de-scribe this model in detail in the following sections as it repre-sents a key feature for our solution.
3. Base Model
As mentioned in the previous section, our base model is thework from DiKTNet [1]. The basic idea in DiKTNet is learningreusable latent representations from a domain-agnostic datasetand incorporate that knowledge when training using minimaldata from the target domains. DiKTNet’s base model is thesame from ZSDG, a HRED with an attention-based copyingmechanism.More formally, the base model’s HRED F is optimized ac-cording to the following loss function: L HRED = log p F d ( x sys |F e ( c , x usr )) , (1)where x usr is the user’s request, x sys is the system’s responseand c is the context.Although each domain has its specific dialogue structure,every domain still shares a general representation. Thus, theauthors consider the Latent Action Encoder-Decoder (LAED)framework [16]. LAED is, in essence, a Variational Auto-Encoder (VAE) representation method that allows discoveringinterpretable and meaningful representations of utterances intodiscrete latent variables. LAED introduces a recognition net-work R that maps an utterance to a latent variable z and a gener-ation network G that will be used to train z ’s representation. Thegoal is to represent the latent variable z independently of thecontext c , so it can capture general dialogue semantics. LAEDis a HRED model and the authors have introduced two versionsof the model: Discrete Information Variational Auto-Encoder(DI-VAE) and Discrete Information Variational Skip-Thought(DI-VST).DI-VAE works as a typical VAE by reconstructing the input x and minimizing the error between the generated and the orig-inal data. The loss function that optimizes the VAE model canbe described as: L DI − V AE = E q R ( z | x ) p ( x ) [log p G ( x | z )] − KL ( q ( z ) (cid:107) p ( z )) , (2)where p ( z ) and q ( z ) are, respectively, the prior and posteriordistributions of z , KL is the Kullback-Leibler divergence and E is the expectation.DI-VAE model aims to capture utterance representations byreconstructing each word of the utterance. However, it is alsopossible to capture the meaning by inferring from the surround-ing context, as dialogue meaning is very context-dependent.With this, the authors propose another version, the DI-VST,which is inspired by the Skip-Thought representation [17]. DI-VST uses the same recognition network from DI-VAE to outputthe posterior distribution q ( z ) , however, two generators are nowused to predict both previous x p and following x n utterances.The loss function that optimizes DI-VST can now be describedas: L DI − V ST = E q R ( z | x ) p ( x ) [log p n G ( x n | z ) log p p G ( x p | z )] − KL ( q ( z ) (cid:107) p ( z )) . (3)DiKTNet learns this domain-agnostic representation from alarge data-source and uses LAED models to perform this task. DiKTNet uses the DI-VAE model to obtain a latent represen-tation of the user’s request z usr = DI-VAE ( x usr ). As for thesystem’s response, the model also wants to predict a latent rep-resentation z sys . In order to achieve that, DiKTNet uses the DI-VST model together with a context-aware hierarchical encoder-decoder that takes as input the user’s request x usr and the con-text c . This encoder-decoder is different from the DI-VST forthe reason that this new model, instead of predicting the previ-ous and the following utterances, is interested in only predictingthe following utterance that, in fact, is the system’s response.The authors argue that DI-VAE captures the user utterance rep-resentation and that DI-VST predicts the system’s action. Whentraining with minimal data from the target domain, and afterlearning the latent representations z usr and z sys , these vari-ables are incorporated into the HRED F by an updated versionof the loss function from equation 1: L HRED = E p ( x usr ,c ) p ( z usr , x usr ) p ( z sys | x usr ,c ) [log p F d ( x sys |{F e ( c , x usr ) , z usr , z sys } )] , (4)where { } is the concatenation operator. With this, we ensurethat the decoder is conditioned on the latent representations in-ferred in the pre-training phase and can now fine-tune in thetarget domain by taking into account that domain-agnostic rep-resentations. DiKTNet is also augmented with ELMo’s [2] deepcontextualized representations as word embeddings.Instead of performing joint training as in original work, wefirst train the model with only source domains and then fine-tune it using a few example dialogues from the target domain.Below, we present how we enhanced our base model’s perfor-mance using an improved training strategy.
4. Meta-learning
As we referenced in section 1, better training techniques im-prove the overall system’s performance when adapting to newunseen domains using minimal data. In the following sections,we present our chosen meta-learning algorithm and describehow we adapted this algorithm into our base model.
In section 2, we described DAML [3] which incorporatesthe MAML [4] algorithm into the sequicity model. Thisoptimization-based meta-learning technique aims to learn agood initialization for the model on source domains that canbe efficiently adapted to target domains using minimum fine-tuning.More formally, in each iteration of MAML, two batchesof the training corpus are sampled from a source domain d : D ds and D dq which are named, respectively, the source and the query set. Instead of calculating the gradient step and updating themodel, in each episode, low-resource fine-tuning is simulated:the model’s parameters θ are preserved and for each domain d in source domains, new temporary parameters are calculatedaccording to: θ d = θ − β ∇ θ L ( θ, D ds ) , (5)where β is the inner learning rate. We could update the model’soriginal parameters with the sum of the losses from all sourcedomains, however, we choose to update the parameters aftereach domain iteration as this method performs better as pre-sented by [18].fter each episode, the model’s parameters are updated us-ing the temporary ones calculated in equation 5: θ = θ − α ∇ θ L ( θ d , D dq ) , (6)where α is the outer learning rate. As our model incorporatesboth context and knowledge-base information for each dialogueand as MAML also consumes too much memory, we insteadadopt a lightweight version of the MAML algorithm that wedescribe below. Reptile [5] algorithm is a first-order meta-learning algorithmwhere instead of sampling two source and query sets, k > batches are retrieved for each domain D d = ( D d , ..., D dk ) andused to create the temporary model’s parameters. The loss forthe temporary model is calculated using Adam [19] optimizeraccording to: θ d = Adam k ( θ, D d , β ) , (7)where β is the inner learning rate and k is the number of updatesin D d . After each episode, the model’s original parameters areupdated using the ones calculated in equation 7: θ = θ + α ( θ d − θ ) , (8)where α is the outer learning rate. Reptile is shown in [5] toproduce equivalent or even better updates than MAML whileconsuming lower memory. Our final model, DATML, is an adaptation of the architecture ofDiKTNet with a modified training technique, while maintain-ing the strong representation learning. Instead of two trainingstages as in original work, we split joint training into sourcetraining and fine-tuning:1.
Pre-training: we maintain the first phase, where welearn the latent general representations for each turn us-ing DI-VAE and DI-VST models.2.
Source training: in this phase, we exclude all data fromthe target domain and improve the training method byemploying the Reptile meta-learning algorithm.3.
Fine-tuning: finally, we fine-tune the model using onlya few example dialogues from the target domain.
5. Experiments
In this section, we describe how we evaluated both ZSDG andDiKTNet baselines and DATML. We also analyze and suggestpossible limitations of our approach.
The dataset used to obtain the latent actions in the pre-trainingphase for DiKTNet and DATML was the MetalWOZ dataset.MetalWOZ [20] is a dataset specifically constructed for the taskof generalizing to unseen domains and is designed to help de-veloping meta-learning models. This dataset contains about 37ktask-oriented dialogues in 47 domains, such as schedules, apart-ment search, alarm setting, and banking. The data was col-lected in a Wizard-of-Oz fashion where a person acted like arobot/system and another acted as the user. Table 1:
Excluded domains from MetalWOZ for each target do-main on MultiWOZ dataset.
MultiWOZ MetalWOZ hotel
HOTEL RESERVE restaurant
MAKE RESTAURANT RESERVATIONSRESTAURANT PICKER attraction
EVENT RESERVE
Both baselines and our approach were evaluated on thethree most represented domains from Multi-Domain Wizard-of-Oz dataset [7]: hotel, restaurant and attraction, where eachcontains more than 1500 dialogues. MultiWOZ is a large-scale multi-domain corpus containing human-to-human conver-sations with rich semantic labels (dialogue acts and domain-specific slot-values) from various domains and topics, and, likeMetalWOZ, was collected in a Wizard-of-Oz fashion.
In the pre-training stage, we choose to learn the latent represen-tations on MetalWOZ dataset as it is a domain-agnostic corpusintroduced specifically for learning general representations. Inorder to make the evaluation as fair as possible, we exclude alldialogues from domains on MetalWOZ that could relate withthe target domain on MultiWOZ, as described in table 1.For source training, we train DATML on MultiWOZ datasetand exclude all dialogues from the target domains, includingthe multi-domain dialogues that contain turns from the targetdomain. In the fine-tuning phase, we use low resource data thatvaries from to by following [1] approach.For both baselines and DATML, we follow [8] and [1] origi-nal setting and use Adam optimizer with a learning rate of − and Dropout ( p = 0 . [21]. All RNNs have hidden size of 512and were trained for 50 epochs, using early stopping if the vali-dation accuracy does not improve on half of already completedepochs. In the pre-training phase, we train both DI-VAE andDI-VST based LAED with y size of and k size of , where y represents the number of latent variables and k the number ofpossible discrete values for each variable. For Reptile, we usea k size of and train the model for episodes. The innerand outer learning rates are − and − , respectively.For ZSDG, we followed the original author’s [8] setting andused 150 seed responses for each domain. In order to fairlycompare our model with state-of-the-art DiKTNet, we choosethe same domain target data for both models by setting the ran-dom seed to , with no particular reason for selecting thatnumber. We follow the work from DiKTNet [1] and ZSDG [8] and re-port BLEU and Entity F1 for each domain. These scores arecalculated for each turn, where BLEU measures the similaritybetween the predicted and the reference responses and EntityF1 determines the ability of the model to retrieve correct enti-ties from the knowledge base.
6. Results and Discussion
Table 2 shows results on MultiWOZ dataset. As observed inbold values, DATML outperforms both baselines ZSDG andDiKTNet in all low-resource scenarios.able 2:
Results on MultiWOZ dataset.
Domain hotel restaurant attractionModel
BLEU % Entity F1 % BLEU % Entity F1 % BLEU % Entity F1 %ZSDG 5.0 8.0 4.7 14.3 6.0 16.0DiKTNet - 1% 10.7 17.3 12.4 17.5 10.2 18.6DiKTNet - 3% 11.4 18.2 13.4 26.0 12.4 20.6DiKTNet - 5% 11.6 17.6 16.6 25.7 12.0 27.1DiKTNet - 10% 13.1 16.8 16.9 28.2 12.3 27.4DATML - 1%
DATML - 3%
DATML - 5%
DATML - 10%
Figure 1:
BLEU score for different amounts of target data in therestaurant domain.
We investigate how the use of different amounts of targetdomain data has an impact in the system’s performance. Ta-ble 2 shows that our model’s performance correlates with theamount of available data from the unseen domain. Figures 1and 2 reveal that correlation for the restaurant domain and com-pare DiKTNet and DATML in terms of data usage. While smallimprovements can be observed when only of target domaindata is available, for each domain DATML achieves better re-sults with of target data in all metrics in comparison toDiKTNet with of available target data. This shows thatDATML outperforms DiKTNet in terms of both performanceand data efficiency.Table 2 also confirms that DiKTNet and DATML outper-form ZSDG while using no annotated data and thus discard-ing human effort in annotating dialogues. This confirms thatDATML achieves state-of-the-art results in data-efficiency andthat is most suitable for real-world applications, as in underrep-resented domains the amount of annotated data is almost nonex-istent.The results demonstrate that using optimization-basedmeta-learning improves the overall model’s performance, andvalidate our initial idea that better learning techniques are a keyfeature when adapting to unseen domains using minimal data.Although our results seem promising and DATML outperformsprevious state-of-the-art DiKTNet, these low scores are far frombeing sufficient for real-world applications, and more work is Figure 2: Entity F1 score for different amounts of target data inthe restaurant domain. essential to surpass the problem of data scarcity in dialogue sys-tems.
7. Conclusions
Domain adaptation in dialogue systems is extremely importantas most domains are underrepresented. We proposed a modelthat improves previous state-of-the-art method by enhancing thetraining method. However, the evaluation results indicate thatour model is far from being suited for real-world applicationsand show that this field requires more study. Future work in-cludes improving the latent representations’ retrieval and inte-gration into our model. We would also like to refer that aftersubmitting this paper we started some experiments with BERT-based [22] embeddings which are left for future work.
8. Acknowledgements
This work has been partially supported by national fundsthrough Fundac¸˜ao para a Ciˆencia e a Tecnologia (FCT) withreference UIDB/50021/2020.
9. References [1] I. Shalyminov, S. Lee, A. Eshghi, and O. Lemon, “Data-efficient goal-oriented conversation with dialogue knowledgetransfer networks,” in
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processing and the 9thnternational Joint Conference on Natural Language Processing(EMNLP-IJCNLP)
Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers)
Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics
Proceedings of the 34thInternational Conference on Machine Learning , ser. Proceedingsof Machine Learning Research, D. Precup and Y. W. Teh, Eds.,vol. 70. International Convention Centre, Sydney, Australia:PMLR, 06–11 Aug 2017, pp. 1126–1135. [Online]. Available:http://proceedings.mlr.press/v70/finn17a.html[5] A. Nichol, J. Achiam, and J. Schulman, “On First-Order Meta-Learning Algorithms,” arXiv e-prints , p. arXiv:1803.02999, Mar.2018.[6] Z.-Y. Dou, K. Yu, and A. Anastasopoulos, “Investigatingmeta-learning algorithms for low-resource natural languageunderstanding tasks,” in
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processing and the 9thInternational Joint Conference on Natural Language Processing(EMNLP-IJCNLP)
Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing
Proceedings of the 19th AnnualSIGdial Meeting on Discourse and Dialogue
Proceedings ofthe 34th International Conference on Machine Learning , ser.Proceedings of Machine Learning Research, D. Precup and Y. W.Teh, Eds., vol. 70. International Convention Centre, Sydney,Australia: PMLR, 06–11 Aug 2017, pp. 2208–2217. [Online].Available: http://proceedings.mlr.press/v70/long17a.html[10] D. George, H. Shen, and E. Huerta, “Deep transfer learning: Anew deep learning glitch classification method for advanced ligo,” arXiv preprint arXiv:1706.07446 , 2017.[11] Y. Yao and G. Doretto, “Boosting for transfer learning with mul-tiple sources,” , pp. 1855–1862, 2010.[12] J. Snell, K. Swersky, and R. Zemel, “Prototypical networksfor few-shot learning,” in
Advances in Neural Informa-tion Processing Systems 30 , I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, Eds. Curran Associates, Inc., 2017, pp.4077–4087. [Online]. Available: http://papers.nips.cc/paper/6996-prototypical-networks-for-few-shot-learning.pdf [13] V. Garcia and J. Bruna, “Few-shot learning with graph neural net-works,” arXiv preprint arXiv:1711.04043 , 2017.[14] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M.Hospedales, “Learning to compare: Relation network for few-shotlearning,” in
Proceedings of the IEEE conference on computer vi-sion and pattern recognition , 2018, pp. 1199–1208.[15] W. Lei, X. Jin, M.-Y. Kan, Z. Ren, X. He, and D. Yin, “Sequicity:Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures,” in
Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics(Volume 1: Long Papers)
Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers)
Advances inNeural Information Processing Systems , C. Cortes, N. Lawrence,D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28. CurranAssociates, Inc., 2015, pp. 3294–3302.[18] A. Antoniou, H. Edwards, and A. J. Storkey, “How to train yourMAML,”
CoRR , vol. abs/1810.09502, 2018. [Online]. Available:http://arxiv.org/abs/1810.09502[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980
Journal of Machine LearningResearch , vol. 15, no. 56, pp. 1929–1958, 2014. [Online].Available: http://jmlr.org/papers/v15/srivastava14a.html[22] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT:pre-training of deep bidirectional transformers for languageunderstanding,”