An Empirical Study of Cross-Lingual Transferability in Generative Dialogue State Tracker
AAn Empirical Study of Cross-Lingual Transferability in Generative Dialogue StateTracker
Yen-Ting Lin, Yun-Nung Chen
National Taiwan [email protected], [email protected]
Abstract
There has been a rapid development in data-driven task-oriented dialogue systems with the benefit of large-scaledatasets. However, the progress of dialogue systems in low-resource languages lags far behind due to the lack of high-quality data. To advance the cross-lingual technology inbuilding dialog systems, DSTC9 introduces the task of cross-lingual dialog state tracking, where we test the DST modulein a low-resource language given the rich-resource trainingdataset.This paper studies the transferability of a cross-lingual gener-ative dialogue state tracking system using a multilingual pre-trained seq2seq model. We experiment under different set-tings, including joint-training or pre-training on cross-lingualand cross-ontology datasets. We also find out the low cross-lingual transferability of our approaches and provides inves-tigation and discussion.
Introduction
Dialogue state tracking is one of the essential buildingblocks in the task-oriented dialogues system. With the ac-tive research breakthrough in the data-driven task-orienteddialogue technology and the popularity of personal assis-tants in the market, the need for task-oriented dialogue sys-tems capable of doing similar services in low-resource lan-guages is expanding. However, building a new dataset fortask-oriented dialogue systems for low-resource language iseven more laborious and costly. It would be desirable to useexisting data in a high-resource language to train models inlow-resource languages. Therefore, if cross-lingual transferlearning can be applied effectively and efficiently on dia-logue state tracking, the development of task-oriented di-alogue systems on low-resource languages can be acceler-ated.The Ninth Dialog System Technology Challenge(DSTC9) Track2 (Gunasekara et al. 2020) proposed across-lingual multi-domain dialogue state tracking task.The main goal is to build a cross-lingual dialogue statetracker with a rich resource language training set and asmall development set in the low resource language. Theorganizers adopt MultiWOZ 2.1 (Eric et al. 2019) andCrossWOZ (Zhu et al. 2020) as the dataset and provide the automatic translation of these two datasets for development.In this paper’s settings, our task is to build a cross-lingualdialogue state tracker in the settings of CrossWOZ-en,the English translation of CrossWOZ. In the following,we will refer cross-lingual datasets to datasets in differentlanguages, such as MultiWOZ-zh and CrossWOZ-en, and cross-ontology datasets to datasets with different ontologies,such as MultiWOZ-en and CrossWOZ-en.The cross-lingual transfer learning claims to transferknowledge across different languages. However, in our ex-periments, we experience tremendous impediments in jointtraining on cross-lingual or even cross-ontology datasets. Tothe best of our knowledge, all previous cross-lingual dia-logue state trackers approach DST as a classification prob-lem (Mrkˇsi´c et al. 2017)(Liu et al. 2019), which does notguarantee the success of transferability on our generative di-alogue state tracker.The contributions of this paper are three-fold:• This paper explores the cross-lingual generative dialoguestate tracking system’s transferability.• This paper compares joint training and pre-train thenfinetune method with cross-lingual and cross-ontologydatasets.• This paper analyzes and open discussion on colossal per-formance drop when training with cross-lingual or cross-ontology datasets.
Problem Formulation
In this paper, we study the cross-lingual multi-domain dia-logue state tracking task. Here we define the multi-domaindialogue state tracking problem and introduce the cross-lingual DST datasets.
Multi-domain Dialogue State Tracking
The dialogue state in the multi-domain dialogue state track-ing is a set of ( domain, slot name, value ) triplets, where thedomain indicates the service that the user is requesting, slotname represents the goal from the user, and value is the ex-plicit constraint of the goal. For dialogue states not men-tioned in the dialogue context, we assign a null value, ∅ , tothe corresponding values. For example, ( Hotel, type, luxury )summarizes one of the user’s constraints of booking a luxuryhotel, and (
Attraction, fee, 20 yuan or less ) means the user a r X i v : . [ c s . C L ] J a n ants to find a tourist attraction with a ticket price equal toor lower than 20 dollars. An example is presented in Figure1.Our task is to predict the dialogue state at the t th turn, B t = { ( D i , S i , V i ) | ≤ i ≤ I } where I is the numberof states to be tracked, given the historical dialogue contextuntil now, defined as C t = {U , R , U , R , . . . , R t − , U t } where U i and R i is the user utterance and system response,respectively, at the i th turn. Dataset
MultiWOZ is the task-oriented dataset often used as thebenchmark dataset for task-oriented dialogue system tasks,including dialogue state tracking, dialogue policy optimiza-tion, and NLG. MultiWOZ 2.1 is a cleaner version of theprevious counterpart with more than 30% updates in di-alogue state annotations. CrossWOZ is a Chinese multi-domain task-oriented dataset with more than 6,000 dia-logues, five domains, and 72 slots. Both of the above datasetscollects human-to-human dialogues in Wizard-of-Oz set-tings. Table 1 lists the details of the dataset.In DSTC9 Track 2, the organizers translate MultiWOZand CrossWOZ into Chinese and English, respectively, andwe refer the translated version of MultiWOZ and CrossWOZas MultiWOZ-zh and CrossWOZ-en, respectively. The pub-lic and private test of CrossWOZ-en in DSTC9 has 250 di-alogues, but only the public test set has annotations. There-fore, we use the public one as the test set in our experiments.
Metric MultiWOZ CrossWOZLanguage
English Chinese (Simplified)
Total
Domains
Slots
24 72
Values
Related Work
Dialogue State Tracker
Traditionally, dialogue state tracking depends on fixed vo-cabulary approaches where retrieval-based models ranksslot candidates from a given slot ontology. (Ramadan,Budzianowski, and Gaˇsi´c 2018)(Lee, Lee, and Kim2019)(Shan et al. 2020) However, recent research effortsin DST have moved towards generation-based approacheswhere the models generate slot value given the dialoguehistory. (Wu et al. 2019) proposed a generative multi-domain DST model with a copy mechanism which ensuresthe capability to generate unseen slot values. (Kim et al.2019) introduced a selectively overwriting mechanism, amemory-based approach to increase efficiency in trainingand inference. (Le, Socher, and Hoi 2020) adopted a non-autoregressive architecture to model potential dependencies among (domain, slot) pairs and reduce real-time DST la-tency significantly. (Hosseini-Asl et al. 2020) took advan-tage of the powerful generation ability of large-scale auto-regressive language model and formulated the DST problemas a casual language modeling problem.
Multilingual Transfer Learning in Task-orientedDialogue (Schuster et al. 2019) introduced a multilingual multi-domain NLU dataset. (Mrkˇsi´c et al. 2017) annotated twoadditional languages to WOZ 2.0 (Mrkˇsic et al. 2017) and(Liu et al. 2019) proposed a mixed-language training forcross-lingual NLU and DST tasks. Noted that all previousmultilingual DST methods modeled the dialogue state track-ing task as a classification problem. (Mrkˇsi´c et al. 2017)(Liuet al. 2019)
Methods
This paper considers the multi-domain dialogue state track-ing as a sequence generation task by adopting a sequence-to-sequence framework.
Architecture
Following (Liu et al. 2020), we use the sequence-to-sequence Transformer architecture (Vaswani et al. 2017)with 12 layers in each encoder and decoder. We denote seq2seq as our model in the following.
DST as Sequence Generation
The input sequence is composed of the concatenation of dia-logue context x t = {U ; R ; U ; R ; . . . ; R t − ; U t } where ; denote the concatenation of texts.For the target dialogue state, we only consider the slotswhere the values are non-empty. The target sequence is con-sist of the concatenation of the ( domain, slot, value ) tripletswith a non-empty value , y t = {D i ; S i ; V i | ≤ i ≤ I ∧S i (cid:54) = ∅} . ˆy t = seq seq ( x t ) We fix the order of the ( domain, slot name, value ) tripletsfor consistency.The training objective is to minimize the cross-entropyloss between the ground truth sequence y t and the predictedsequence ˆy t . Post-processing
The predicted sequence ˆy t is then parsed by heuristic rulesto construct ˆ B t = {D i ; S i ; ˆ V i | ≤ i ≤ I } .By utilizing the possible values of slots in the ontology,for predicted slot values ˆ V that do not appears in the ontol-ogy, we choose the one with the best match to our predictedvalue. Experiments
In the following section, we describe evaluation metrics, ex-periment setting and introduce experimental results. This is implemented by difflib.get close matches in Python ello, it is said that the reputation of Jinjiang Inn (Beijing Yizhuang Culture Park) is still good. Do you know the price?'Its price is 295 yuan.Um, okay, please help me find an attraction with a duration of 1 hour. I hope the rating of the attraction is 4 points or above. There are too many eligible, I suggest you go to Sanlitun Bar Street or China Great Wall Museum. 𝐵 ! (Hotel , name , Jinjiang Inn (Beijing Yizhuang Culture Park) ),(
Attraction , duration , 1 hour),( Attraction , rating , 4 points or above) }𝐵 " { ( Hotel , name , Jinjiang Inn (Beijing Yizhuang Culture Park) ) } Figure 1: Illustration of dialogue state tracking. The dialogue is sampled from CrossWOZ-en.
Evaluation Metrics
We use joint goal accuracy and slot F1 as our metrics toevaluate our dialogue state tracking system.• Joint Goal Accuracy: The proportion of dialogue turnswhere predicted dialogue states match entirely to theground truth dialogue states.• Slot F1: The macro-averaged F1 score for all slots in eachturn.
Experiments Settings
We want to examine how different settings affect the perfor-mance of the target low-resource dataset: CrossWOZ-en. We will conduct our experiments in the settings below.•
Direct Fine-tuning • Cross-Lingual Training (CLT) • Cross-Ontology Training (COT) • Cross-Lingual Cross-Ontology Training (CL/COT) • Cross-Lingual Pre-Training (CLPT) • Cross-Ontology Pre-Training (COPT) • Cross-Lingual Cross-Ontology Pre-Training(CL/COPT)
Table 2 and 3 show the datasets for training and pre-trainingin different settings. For experiments with pre-training, allmodels are pre-trained on the pre-training dataset and thenfine-tuned on CrossWOZ-en.The baseline model provided by DSTC9 is SUMBT (Lee,Lee, and Kim 2019), the ontology-based model trained onCrossWOZ-en. In our experimental circumstance, English is the low-resourcelanguage since the original language of CrossWOZ is Chinese.
Multilingual Denoising Pre-training
All of our models initialize from mBART25 . (Liu et al. 2020) mBART25 is trained with denoising auto-encoding task onmono-lingual data in 25 languages, including English andSimplified Chinese. (Liu et al. 2020) shows pre-training ofdenoising autoencoding on multiple languages improves theperformance on low resource machine translation. We hopeusing mBART25 as initial weights would improve the cross-lingual transferability.
Implementation Details
In all experiments, the models are optimized with AdamW(Loshchilov and Hutter 2017) with learning rate set to e − for 4 epochs. The best model is selected from the validationloss and is used for testing.During training, the decoder part of our model is trainedin the teacher forcing fashion (Williams and Zipser 1989).Greedy decoding (Vinyals and Le 2015) is applied when in-ference. Following mBART (Liu et al. 2020), we use sen-tencespiece tokenizer. For GPU memory constraints, sourcesequences longer than 512 tokens are truncated at the frontand target sequences longer than 256 tokens are truncated atthe back.The models are implemented in Transformers (Wolf et al.2019), PyTorch (Paszke et al. 2019) and PyTorch Lightning(Falcon 2019). Results and Discussion
The results for all experiment settings are shown in Table 2and 3.
Additional Training Data Cause Degeneration
Direct Fine-tuning significantly outperforms other settings,including the official baseline. We assume English and Chi-nese data with the same ontology to train the mBART wouldbridge the gap between the two languages and increase the xperiment Training Data JGA SF1MultiWOZ CrossWOZen zh en zhBaseline (cid:51)
Direct Fine-tuning (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)
COT (cid:51) (cid:51)
CLT (cid:51) (cid:51)
Experiment Pre-training Data JGA SF1MultiWOZ CrossWOZen zh en zhDirect Fine-tuning 16.82 66.35CL/COPT (cid:51) (cid:51)
COPT (cid:51)
CLPT (cid:51)
Cross-Lingual Training , trainingon English and Chinese version of CrossWOZ leads to catas-trophic performance on CrossWOZ-en.In the
Cross-Ontology Training where combine two datain the same language. However, with different ontologies,the performance marginally increases from
Cross-LingualTraining , which shows more extensive mono-lingual datawith the unmatched domain, slots, and ontology confusesthe model during inference. In the
Cross-Lingual Cross-Ontology Training , we collect all four datasets for training,and the performance is still far from
Direct Fine-tuning .In conclusion, additional data deteriorate the performanceon CrossWOZ-en even whether the language or ontologymatches or not.
Does ”First Pre-training, then fine-tuning” Help?
We hypothesize that training with additional data causesperformance degeneration, and therefore one possible im-provement could be first pre-training the model on cross-lingual / cross-ontology data and then fine-tuning on the tar-get dataset CrossWOZ-en. Table 3 shows the results.By comparing
COPT to COT and
CL/COPT to CL/COP ,the relative performance gain by over 37% with regards toslot F1. ”Pre-training, fine-tuning” framework may partiallyalleviate the problem of catastrophic performance drop injoint training.
Domain Performance Difference acrossExperiment Settings?
This section further investigates the cause of the perfor-mance decrease by comparing the slot F1 of different modelsacross five domains in Figure 2. Generally speaking, in attraction, restaurant, and hotel do-mains, ”pre-train then fine-tune” methods beat their ”jointtraining” counterparts by an observable margin. By contrast,in metro and taxi domains, despite poor performance amongall, ”joint training” settings beat their”pre-train then fine-tune” counterparts.The only two trackable slots in the metro and taxi domain,”from” and ”to,” usually take the address or name of build-ings, are highly non-transferable across datasets. We con-jecture that pretraining on cross-lingual or cross-ontologydatasets does not help or even hurt those non-transferableslots.
Domain S l o t F Figure 2: Slot F1 across 5 domains in CrossWOZ-en in dif-ferent settings. onclusion
In this paper, we build a cross-lingual multi-domain gen-erative dialogue state tracker with multilingual seq2seq totest on CrossWOZ-en and investigate our tracker’s transfer-ability under different training settings. We find that jointlytrained the dialogue state tracker on cross-lingual or cross-ontology data degenerates the performance.
Pre-training oncross-lingual or cross-ontology data, then fine-tuning frame-work may alleviate the problem, and we find empirically evi-dence on relative improvement in slot F1. A finding from thedomain performance shift is that performance on some non-transferable slots, such as name , from , to , may be limited bythe previous pretraining approach. A future research direc-tion would investigate why such a significant performancedeclines in joint training and tries to bridge it. References
Eric, M.; Goel, R.; Paul, S.; Kumar, A.; Sethi, A.; Ku, P.;Goyal, A. K.; Agarwal, S.; Gao, S.; and Hakkani-Tur, D.2019. MultiWOZ 2.1: A Consolidated Multi-Domain Di-alogue Dataset with State Corrections and State TrackingBaselines.Falcon, W. 2019. PyTorch Lightning.
GitHub. Note:https://github.com/PyTorchLightning/pytorch-lightning arXiv
URL http://arxiv.org/abs/1911.03906.Le, H.; Socher, R.; and Hoi, S. C. H. 2020. Non-Autoregressive Dialog State Tracking 1–21. URL http://arxiv.org/abs/2002.08024.Lee, H.; Lee, J.; and Kim, T.-Y. 2019. SUMBT: Slot-Utterance Matching for Universal and Scalable BeliefTracking 5478–5483. doi:10.18653/v1/p19-1546.Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad,M.; Lewis, M.; and Zettlemoyer, L. 2020. Multilingual De-noising Pre-training for Neural Machine Translation URLhttps://arxiv.org/abs/2001.08210.Liu, Z.; Winata, G. I.; Lin, Z.; Xu, P.; and Fung, P. 2019.Attention-Informed Mixed-Language Training for Zero-shotCross-lingual Task-oriented Dialogue Systems. arXiv
URLhttp://arxiv.org/abs/1911.09273.Loshchilov, I.; and Hutter, F. 2017. Decoupled Weight De-cay Regularization URL http://arxiv.org/abs/1711.05101. Mrkˇsic, N.; S´eaghdha, D.; Wen, T. H.; Thomson, B.; andYoung, S. 2017. Neural belief tracker: Data-driven dia-logue state tracking.
ACL 2017 - 55th Annual Meetingof the Association for Computational Linguistics, Proceed-ings of the Conference (Long Papers)
1: 1777–1788. doi:10.18653/v1/P17-1163.Mrkˇsi´c, N.; Vuli´c, I.; S´eaghdha, D. ´O.; Leviant, I.; Reichart,R.; Gaˇsi´c, M.; Korhonen, A.; and Young, S. 2017. Seman-tic Specialisation of Distributional Word Vector Spaces us-ing Monolingual and Cross-Lingual Constraints. arXiv
URLhttp://arxiv.org/abs/1706.00374.Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury,J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.;Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito,Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner,B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch:An Imperative Style, High-Performance Deep LearningLibrary. In Wallach, H.; Larochelle, H.; Beygelz-imer, A.; d'Alch´e-Buc, F.; Fox, E.; and Garnett, R.,eds.,
Advances in Neural Information Processing Sys-tems 32 , 8024–8035. Curran Associates, Inc. URLhttp://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.Ramadan, O.; Budzianowski, P.; and Gaˇsi´c, M. 2018. Large-Scale Multi-Domain Belief Tracking with Knowledge Shar-ing. In
Proceedings of the 56th Annual Meeting of the Asso-ciation for Computational Linguistics (Volume 2: Short Pa-pers)
Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) arXiv preprint arXiv:1506.05869 .Williams, R. J.; and Zipser, D. 1989. A learning algo-rithm for continually running fully recurrent neural net-works.
Neural computation