[PDF] CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP

Abstract

Multi-lingual contextualized embeddings, such as multilingual-BERT (mBERT), have shown success in a variety of zero-shot cross-lingual tasks. However, these models are limited by having inconsistent contextualized representations of subwords across different languages. Existing work addresses this issue by bilingual projection and fine-tuning technique. We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT, which encourages model to align representations from source and multiple target languages once by mixing their context information. Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages. Experimental results on five tasks with 19 languages show that our method leads to significantly improved performances for all the tasks compared with mBERT.

Full PDF

CCoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-ShotCross-Lingual NLP

Libo Qin , Minheng Ni , Yue Zhang , , Wanxiang Che Research Center for Social Computing and Information RetrievalHarbin Institute of Technology, China School of Engineering, Westlake University, China Institute of Advanced Technology, Westlake Institute for Advanced Study { lbqin, mhni, car } @ir.hit.edu.cn, [email protected] Abstract

Multi-lingual contextualized embeddings, such asmultilingual-BERT (mBERT), have shown successin a variety of zero-shot cross-lingual tasks. How-ever, these models are limited by having incon-sistent contextualized representations of subwordsacross different languages. Existing work ad-dresses this issue by bilingual projection and ﬁne-tuning technique. We propose a data augmen-tation framework to generate multi-lingual code-switching data to ﬁne-tune mBERT, which encour-ages model to align representations from sourceand multiple target languages once by mixing theircontext information. Compared with the existingwork, our method does not rely on bilingual sen-tences for training, and requires only one trainingprocess for multiple target languages. Experimen-tal results on ﬁve tasks with 19 languages show thatour method leads to signiﬁcantly improved perfor-mances for all the tasks compared with mBERT.

Neural network models for NLP rely on the availability oflabeled data for effective training [Yin et al. , 2019]. For lan-guages such as English and Chinese, there exist manually la-beled datasets on a variety of tasks, trained over which neuralmodels for NLP can rival human performance. However, formost of the languages, manually labeled data can be scarce.As a result, cross-lingual transfer learning stands as a usefulresearch task [Ruder et al. , 2017]. The main idea is to makeuse of knowledge learned from a resource-rich language toenhance model performance on a low-resource language. Inparticular, zero-shot cross-lingual learning has attracted muchresearch attention [Wang et al. , 2019], which requires no la-beled data for a target language. In this paper, we considerthis transfer setting.Recent state-of-the-art results have been achieved by meth-ods based on cross-lingual contextualized embeddings [Con-neau and Lample, 2019; Huang et al. , 2019; Liu et al. , 2019b;Devlin et al. , 2019]. In particular, a common set of subwordsare extracted across different languages, which are taken asthe basis for training contextualized embeddings. For such (b) Semantic space within multiple language(a) Semantic space within two languageEnglishSpanish Thai

Shared Semantic Space in

English and Spanish EnglishShared Semantic Space inEnglish and Thai Shared Semantic Space inEnglish, Thai and SpanishEnglishSpanish Thai

Figure 1: Prior work (a) vs. our method (b). training, raw sentences from multiple languages are mergedinto a single training set, so that shared subword embeddingsand other parameters are tuned across different languages. Arepresentative model is mBERT [Devlin et al. , 2019], whichis a multi-lingually trained version of BERT.While the method above gives strong results for zero-shotcross-lingual adaptation through shared subwords and param-eters, it has a salient limitation. The context for training cross-lingual embeddings is still mono-lingual, which can leadto inconsistent contextualized representations of subwordsacross different languages. To address this issue, several re-cent methods try to bridge the inconsistency of contextualizedembeddings across languages. As shown in Figure 1(a), twomain lines of methods are considered. One learns a mappingfunction from a source contextualized subword embedding toits target counterpart by using word alignment information[Wang et al. , 2019], and the other uses code mixing to con-struct training sentences that consist of both source and targetphrases in order to ﬁne-tune mBERT [Liu et al. , 2019b]. Un-fortunately, both lines of work only consider a pair of sourceand target languages at a time, therefore resulting in a sepa-rate model for each target language.We consider enhancing mBERT without creating multi-ple additional models, by constructing code-switched data inmultiple languages dynamically for better ﬁne-tuning. To thisend, a set of English raw sentences and the bilingual dictio-naries of MUSE [Lample et al. , 2018] are used as the basis. a r X i v : . [ c s . C L ] J u l t 's a very sincere work , but it would be better as a diary or documentaryFollowing are some of the top headlines in leading Italian newspapersWhat will the temperature be like this weekend in Santa Barabara (a)(b)(c)(d) Original Training DataSentence SelectionToken SelectionReplacement Selection it ‘s a (cid:2)(cid:1) aufrichtig work , but it จะ be mieux as a diary or documentarioFollowing are some of the top headlines in leading Italian newspapersчто will the (cid:4)(cid:3) be ﻣﺜﻞ lubić viikonloppu in Santa Barabarait 's a very sincere work , but it would be better as a diary or documentaryFollowing are some of the top headlines in leading Italian newspapersWhat will the temperature be like this weekend in Santa Barabarait 's a very sincere work , but it would be better as a diary or documentaryFollowing are some of the top headlines in leading Italian newspapersWhat will the temperature be like this weekend in Santa Barabara Figure 2: Augmentation process. The source language sentences(a), the sentence selection step (b), the token selection step (c) andthe replacement selection step (d) (different shades yellow colors in(d) represent different languages translation). As shown in Figure 2, three data augmentation steps aretaken. First, a set of sentences is randomly selected for codemixing. Second, a set of words is randomly chosen in eachsentence for being replaced with the translation words in adifferent language. Third, for each word to translate, a targetlanguage is randomly selected. The above procedure is dy-namically executed on a batch level, for ﬁne-tuning mBERT.The intuition is to help the model automatically and implic-itly align the replaced word vectors in the source and all targetlanguages by mixing their context information.Compared with existing methods, our method has the fol-lowing advantages. First, the resulting model is as simple touse as mBERT, without the need to know the test language be-fore hand. In addition, one training process is used for all dif-ferent target languages. Second, unlike most existing meth-ods, our method does not rely on parallel sentences, whichis especially practical for low-resource languages. Third, themethod is dynamic in the sense that a different set of code-switched sentences is constructed in each batch during train-ing, therefore increasing the distribution of data instances[Liu et al. , 2019a]. Finally, contextualized embeddings forall the languages are aligned into the same space while priorwork can only align representation in source and one targetlanguage for each language. This advantage is demonstratedin Figure 1 (b).We conduct experiments on ﬁve zero-shot cross-lingualtasks: natural language inference, sentiment classiﬁcation,document classiﬁcation, natural language understanding anddialogue state tracking. Results show that our methodleads to signiﬁcantly improved performances for all the taskscompared with mBERT. For some tasks, our model givesthe best results with only 1/10 English training data. Allcodes are publicly available at: https://github.com/kodenii/CoSDA-ML.

In this section, we will describe the background of mBERT aswell as how to apply mBERT for cross-lingual classiﬁcation For some languages that can not be found in MUSE, we extracttheir dictionaries from Panlex [Kamholz et al. , 2014]. tasks and sequence labeling tasks. mBERT follows the same model architecture and trainingprocedure as BERT [Devlin et al. , 2019]. It adopts a 12 layerTransformer, but instead of training only on monolingual En-glish data, it is trained on the Wikipedia pages of 104 lan-guages with a shared word piece vocabulary, which allowsthe model to share embeddings across languages.

Given an input utterance s = ( s , s , ..., s n ) froma source language (i.e., English), we ﬁrst constructthe input sequence by adding speciﬁc tokens s =( [CLS] , s , s , ..., s n , [SEP] ) , where [CLS] is the specialsymbol for representing the whole sequence, and [SEP] is the special symbol to separate non-consecutive token se-quences [Devlin et al. , 2019]. mBERT takes the constructedinput sequence of no more than 512 tokens and outputs therepresentation of the sequence h = ( h CLS , h , . . . , h n , h SEP ).For classiﬁcation tasks, mBERT takes h CLS into a classiﬁ-cation layer to ﬁnd the label c : c = softmax( W h

CLS + b ) , (1)where W is a task-speciﬁc parameter matrix. We ﬁne-tuneall the parameters of mBERT as well as W jointly by maxi-mizing the log-probability of the correct label. For sequence labeling tasks, we feed the ﬁnal hidden statesof the input tokens into a softmax layer to classify the tokens.Note that BERT produces embeddings in the wordpiece-levelwith WordPiece tokenization. We use the hidden state corre-sponding to the ﬁrst sub-token as input to classify a word. y n = softmax( W s h n + b s ) , (2)where h n is the ﬁrst sub-token representation of word x n . The baseline mBERT models, trained on the source languageclassiﬁcation and sequence labeling tasks, perform zero-shot cross-lingual transfer tasks by directly being used for the tar-get language. We assume that there are labeled training datafor each task in English, and transfer the trained model to atarget language without labeled training data.

Our method enhances mBERT in § § § § Our framework performs the cross-lingual tasks in two steps:Fine-tuning mBERT with augmented multi-lingual code-switch data and applying it for zero-shot testing, which isillustrated in Figure 3. Given a batch of training data S = lgorithm 1: Multi-lingual code-switching data augmen-tation framework.

Input: source language training data: S = { s ( n ) } Nn =1 ; aset of bilingual dictionaries: dict ; { sentence,token } replacement ratio: { α , β } ; targetlanguage sets: LAN

Output: multi-lingual code-switching training data: T = { t ( n ) } Nn =1 . for n ← ...N do if random() < α then i ← ; t ( n ) ← ∅ ; while s ni (cid:54) = [SEP] do if random() < β then tgt ← random ( LAN ) ; s ni ← dict tgtsrc [ s ni ] ; else s ni ← s ni ; end t ( n ) ← t ( n ) ∪ { s ni } ; i ← i + 1 ; end else t ( n ) ← s ( n ) ; end end { s ( n ) } Nn =1 from a source language, the dynamic augmenta-tion generator adopts Algorithm 1 to generate code-switchedtraining data for ﬁne-tuning mBERT. Formally, the procedurecan be written as: T = Generator ( S ) , (3) out = Fine-tune ( mBERT , T ) , (4)where T represents the generated code-switched data, out de-notes the output of all tasks. In zero-shot test, the ﬁne-tunedmBERT is used directly for target languages. The augmentation method consists of three steps, includingsentence selection, word selection and replacement selection.(i)

Sentence Selection:

Given a batch of training data S , werandomly select sentences for generating code-switchedsentences. The unselected sentences keep in the originallanguage. Take the sentences in Figure 2(b) for example,we randomly select the ﬁrst and the third sentence whileleaving the second sentence unchanged;(ii) Token Selection:

For each selected sentences in the sen-tence selection step, we randomly choose words to trans-late. Take the example in Figure 2(c). The word “ very ”in ﬁrst sentence and “

What ” in third sentence are chosen;(iii)

Replacement Selection:

After obtaining the selectedword, we randomly choose a target language accordingto a bilingual-dictionary. As shown in Figure 2 (d), dif-ferent target languages can be mixed in the code-switch

Multilingual BERTMulti-lingual Code-switch datageneratorIt’s a very sincere work, …it ’s a (cid:10)(cid:8) aufrichtig work, …(a) Training (b) Zero-shot TestMultilingual BERT (cid:4)(cid:3)(cid:2)(cid:9)(cid:10)(cid:8)(cid:5)(cid:11)(cid:1)(cid:7)(cid:6) out…

Multilingual BERTEs ist eine sehr aufrichtige Arbeit out

Figure 3: Illustration of our training and zero-shot test process withmulti languages. Red color denotes Chinese and blue color denotesGerman. Better viewed in color. generated data. It is worth noticing that words in thesource language can have multiple translations in the tar-get language. In this case, we randomly choose any ofthe multiple translations as the replacement target lan-guage word. Though we cannot guarantee that this isthe correct word-to-word translation in the context, wecan consider it as one of the data augmented strategy forour tasks.Algorithm 1 shows pseudocode for the multi-lingual code-switching code augmentation process, where lines 1-2 denotethe sentence selection step, lines 3-6 denote the word selec-tion and lines 7-11 denote the replacement selection step.In addition, the augmentation steps are performed per batchdynamically and the model trains with different augmenteddata in each batch, which can increase the distribution of datainstances [Liu et al. , 2019a]. Intuitively, training with aug-mented code-switched data can make model automaticallyalign the replaced word in the target language and the orig-inal word in a source language into a similar vector spaceaccording to their similar context information.

Natural Language Inference.

We use XNLI [Conneau etal. , 2018], which covers 15 languages for natural language in-ference. We feed a pair of sentences directly into the mBERTencoder and a task-speciﬁc classiﬁcation layer is used forclassiﬁcation. Models are evaluated by the classiﬁcation ac-curacy (ACC).

Sentiment Classiﬁcation.

Following Barnes et al. [2018],we use the OpeNER English and Spanish datasets, and theMultiBooked Catalan and Basque datasets. We directly pro-vide the sentence to mBERT encoder and the speciﬁc [CLS] representation is fed into a linear layer for classiﬁcation.Models are evaluated by the macro F1.

Document Classiﬁcation.

We use MLDoc [Schwenk andLi, 2018] for document classiﬁcation, which includes a bal-anced subset size of the Reuters corpus covering 8 languagesfor document classiﬁcation. Similar to sentiment classiﬁca-tion, we also directly provide the document to mBERT en-coder, and the speciﬁc [CLS] representation is fed into alinear layer for classiﬁcation. Models are evaluated by classi-ﬁcation accuracy (ACC). odel en fr es de el bg ru tr ar vi th zh hi sw ur Average

Artetxe and Schwenk [2018] 73.9 71.9 72.9 72.6 73.1 74.2 71.5 69.7 71.4 72.0 69.2 71.4 65.5 62.2 61.0 70.2XLM [Conneau and Lample, 2019] 84.1 77.1 78.0 75.0 74.1 75.1 72.4 70.0 70.6 71.5 68.3 73.2 66.7 67.5 62.2 72.4+CoSDA-ML mBERT from Wu and Dredze [2019] 82.1 73.8 74.3 71.1 66.4 68.9 69.0 61.6 64.9 69.5 55.8 69.3 60.0 50.4 58.0 66.3+CoSDA-ML 82.9 76.7 76.9 74.1 70.9 72.7 73.2 63.9 68.0 73.6 59.8 73.8 65.5 51.0 62.3 69.7

Table 1: Natural Language Inference experiments.

Model es eu ca et al. , 2018] 74.6 41.2 69.3 30.0 72.9 35.9XLM [Conneau and Lample, 2019] 86.1 32.9 - - - -+CLCSA 91.3 46.8 - - - -mBERT [Devlin et al. , 2019] 93.1 51.0 73.1 35.1 83.5 52.3+CLCSA

Table 2: Sentiment classiﬁcation experiments.

Dialogue State Tracking (DST).

Following prior work[Liu et al. , 2019b], we use the Multilingual WOZ 2.0 dataset[Mrkˇsi´c et al. , 2017], which includes German and Italian lan-guages. DST aims to predict the slot-value pair given a cur-rent utterance and the previous system acts. It can be viewedas a collection of binary prediction problems by using a dis-tinct estimator for each slot-value pair [Chen et al. , 2018]. Weconcatenate the current utterance and the previous system actfor input into mBERT and obtain the [CLS] representation.We also feed each slot-value pair into mBERT and obtain an-other [CLS] representation. Finally, the two representationsare provided to the classiﬁcation layer to decide whether itshould be selected. Similar to prior work, we use the turn-level request tracking accuracy, joint goal tracking accuracyand the slot tracking accuracy for evaluation.

Spoken Language Understanding.

We follow Schuster etal. [2019b] and use the cross-lingual spoken language under-standing dataset, which contains English, Spanish and Thai.We adopt a joint model which provides the utterance formBERT and the [CLS] is used for intent detection. The to-ken representations are used for slot prediction as local clas-siﬁer task on each word, which can be treated as a sequencelabeling task. Intent detection is evaluated by the classiﬁca-tion accuracy (ACC) and slot ﬁlling is evaluated by F1 score.

We evaluate the effectiveness of our proposed method across19 languages on ﬁve tasks. In addition to mBERT, wealso conduct all experiments on the recent strong pre-trainedcross-lingual model (XLM) [Conneau and Lample, 2019].XLM outperforms mBERT on XNLI tasks, but underper-forms mBERT for some other tasks [Liu et al. , 2019b]. Wechoose it as a secondary baseline for verifying the generaliz-ability of our augmentation method.

For all tasks, no preprocessing is performed except tokeniza-tion of words into subwords with WordPiece. Following De-vlin et al. [2019], we use WordPiece embeddings with a 110ktoken vocabulary. We use the base case multilingual BERT (mBERT), which has N = 12 attention heads and M = 12Transformer blocks. In ﬁne-tuning, we select the best hyper-parameters by searching a combination of batch size, learn-ing rate, the number of ﬁne-tuning epochs and replacementratio with the following range: learning rate { × − , × − , × − , × − , × − , × − } ; batch size { , , } ; number of epochs: { , , , , } ; tokenand sentence replacement ratio: { . , . , . , . , . , . } .Note that the best model are saved by development perfor-mance in the English . We include the following state-of-the-art baselines:

Natural Language Inference.

Artetxe andSchwenk [2018] use multilingual sentence representa-tion, pre-trained with sequence-to-sequence model. Thismodel requires bitext for training.

Sentiment Classiﬁcation.

BLSE [Barnes et al. , 2018]jointly represents sentiment information in a source and tar-get language and achieves the state-of-the-art performance inzero-shot cross-lingual sentiment classiﬁcation.

Document Classiﬁcation.

1) Schwenk and Li [2018] useMultiCCA, multilingual word embeddings trained with abilingual dictionary, and convolution neural networks. 2)Artetxe and Schwenk [2018] also obtain the promising per-formance and the detail has been described in Natural Lan-guage Inference paragraph.

Dialogue State Tracking (DST).

1) XL-NBT [Chen et al. ,2018] proposes a state tracker for the source language as ateacher and then distills and transfers its own knowledge tothe student state tracker in target languages. 2) Attention-Informed Mixed Training: Liu et al. [2019b] use the gen-erated attention-informed code-switch data for training andachieves the state-of-the-art performance.

Spoken Language Understanding.

1) Multi. CoVe:[Schuster et al. , 2019a] use Multilingual CoVe [Yu et al. ,2018] as the encoder and add an autoencoder objective toproduce more general representations for semantically simi-lar sentences across languages. 2) Attention-Informed Mixed odel en de zh es fr it ja ru Average

Schwenk and Li [2018] 92.2 81.2 74.7 72.5 72.4 69.4 67.6 60.8 73.9Artetxe and Schwenk [2018] 89.9 84.8 71.9 77.3 78.0 69.4 60.3 67.8 74.9XLM [Conneau and Lample, 2019] 94.2 76.8 46.2 64.0 70.5 - - 61.5 68.9+CLCSA 93.4 81.4 71.1 73.1 83.7 - - 68.3 78.5mBERT [Devlin et al. , 2019] 94.2 80.2 76.9 72.6 72.6 68.9 56.5 73.7 74.5+CLCSA

Table 3: Document classiﬁcation experiments.

Model German Italian slot acc. joint goal acc. request acc. slot acc. joint goal acc. request acc.XL-NBT [Chen et al. , 2018] 55.0 30.8 68.4 72.0 41.2 81.2Attention-Informed Mixed Training [Liu et al. , 2019b] 69.5 32.2 86.3 69.5 31.4 85.2XLM from Liu et al. [2019b] 58.0 16.3 75.7 - - -+CLCSA 77.4 48.7 88.3 - - -mBERT [Devlin et al. , 2019] 57.6 15.0 75.3 54.6 12.6 77.3+CLCSA

Table 4: Dialog State Tracking experiments.

Model Spanish Thai

Intent acc. Slot F1 Intent acc. Slot F1Multi. CoVe [Yu et al. , 2018] 53.9 19.3 70.7 35.6Attention-Informed Mixed Training [Liu et al. , 2019b] 86.5 74.4 70.6 28.5XLM from Liu et al. [2019b] 62.3 42.3 31.6 7.9+ CLCSA 90.3 69.0 et al. , 2019] 73.7 51.7 28.2 10.6+ CLCSA (Static) 92.8 75.2 74.8 28.1+ CLCSA

Table 5: Slot ﬁlling and Intent detection experiments.

Training: Liu et al. [2019b] use attention to generate code-switched sentences, achieving the current best result. Themethod translates only one word into each augmented sen-tence.

We perform t-test for all experiments to measure whether theresults from the proposed model are signiﬁcantly better thanthe baselines. The numbers with asterisks indicate that theimprovement is signiﬁcant with p < . . “-” represents theabsence of languages in the XLM models and we cannot re-port the results. Five tasks results are shown in Table 1, 2,3, 5 and 4, respectively. Across the tasks, we can observethat: 1) mBERT achieves strong performance on all zeros-shot cross-lingual tasks, which demonstrates that mBERT is asurprisingly effective cross-lingual model for a wide range ofNLP tasks. This is consistent with the observation of Wu andDredze [2019]. Additionally, the XLM achieves much betterperformance than mBERT on XNLI and achieves the promis-ing performance on four other tasks. 2) Our method out-performs mBERT and XLM by a large margin and achievesstate-of-the-art performance on all the tasks, which demon-strates the effectiveness of our proposed method. Note thatwe have not reproduced the results on XNLI task of origi-nal paper because of lacking the exact best hyper-parameters,which is also mentioned on some issues on Github. So werun their open-source code to obtain the results and we ap- https://github.com/facebookresearch/XLM/issues/199. https://github.com/facebookresearch/XLM ply the CoSDA-ML to it with the same hyper-parameters.Besides, our method not only obtains 2.9% improvement onaverage score but also outperforms the reported results (Av-erage 75.1 score) from Conneau and Lample [2019], whichfurther demonstrates the effectiveness of our method. 3) Ourmethod outperforms Attention-Informed Mixed Training inboth DST and SLU tasks, which indicates that our dynamicsampling and multi-lingual code-switch data training tech-nique are more effective for aligning representation betweensource and target languages than only translating one word tothe target language.

Robustness.

To verify the robustness of CoSDA-ML, weconduct experiments with different token replacement ratios β during the ﬁne-tuning process and keep the sentence re-placement ratio α as 1. The results are shown in Figure 4(a)and 4(b). With all the values of β , our model consistentlyoutperforms the state-of-the-art model ( Attention-InformedMixed Training ) in slot ﬁlling and intent detection when β> Varying Amounts of Training Data.

We study the effec-tiveness of CoSDA-ML by varying amounts of training data.Figure 4(c) and 4(d) report the results of adding varyingamounts of training data between

Attention-Informed MixedTraining and our model. We have two interesting observa-tions: 1) Our augmentation method consistently outperformsthe baseline with all training data sizes, which demonstratesconsistency. 2) Using only 1/10 of the training data, our a) Slot ﬁlling. (b) Intent detection.(c) Slot ﬁlling. (d) Intent detection.Figure 4: Comparison between our model (solid lines) and Attention-Informed Mixed Training (Att.) model (dashed lines). Results withdifferent β in (a) and (b) and different subset size of training data in (c) and (d). In (c) and (d), it’s worth that the dashline denotes Att.performance with 100% training data and the solid line represents our model performance by varying the proportion training data size.(a) (b)Figure 5: t-SNE visualization of sentences vector space frommBERT (a) and with our CoSDA-ML method (b). The differentcolor represents different languages and the dots in the same colordenotes sentence representation with same intent. approach performs better than the Attention-Informed MixedTraining using 100% of the training data, demonstrating thatour approach is particularly useful when we only access tosmall amounts of training data.

Effectiveness of Dynamic Sampling.

To verify the effec-tiveness of our proposed dynamic augmentation mechanism,we make comparison with static augmentation method, inwhich we adopt Algorithm 1 to obtain augmented multi-lingual code-switch training data once for all the batches. Theresults are shown in the static row of Table 5 . We ﬁnd thatthe dynamic method outperforms the static method in all thetasks. We attribute this to the fact that the dynamic mecha-nism can generate more varying code-switched multi-lingualdata within the batch training process while static method canonly augment one time of origin training data. Dynamic sam-pling allows the model to align more words representationcloser in multiple languages.

Visualization.

In order to see whether our frameworkaligns the representation between the source language and allthe target languages, we select three intents with 100 sen-tences respectively and obtain their sentence vector [CLS] to visualize between our method with mBERT. The mBERTresults are shown in Figure 5(a). We can see that there is

ThaiSlot F1 ThaiIntent Acc SpanishSlot F1 SpanishIntent Acc020406080 Our MethodBiLSTM

Figure 6: Evaluation result based on BiLSTM. nearly no overlap between different languages, which showsthat the distance of the representations of different languageswith the same intent is distant. In contrast, the representa-tions from our CoSDA-ML ﬁne-tuned model in Figure 5 (b)in different languages become closer and overlap with eachother, which further demonstrates that our method effectivelyand successfully aligns representations of different languagescloser.

CoSDA-ML with BiLSTM.

A natural question that arisesis whether our augmentation method is effective for a gen-eral encoder in addition to Transformer. To investigate thequestion, we replace mBERT with BiLSTM and keep othercomponents the same. BiLSTM does not include any in-formation pre-trained over Wikipedia pages of multiple lan-guages. We conduct experiments on top of BiLSTM to bet-ter verify whether our method strongly depends on the pre-trained model. The results are shown in Figure 6. We can seethat our framework outperforms BiLSTM in all metrics in alllanguages, which further demonstrates that our augmentationmethod is not only effective on top of mBERT but also canwork on a general encoder.

Related Work

Zero-shot Cross-lingual Transfer.

The main strands ofwork focused on learning cross-lingual word embeddings .Ruder et al. [2017] surveyed methods [Klementiev et al. ,2012; Koˇcisk´y et al. , 2014; Guo et al. , 2016] for learningcross-lingual word embeddings by either joint training orpost-training mappings of monolingual embeddings. Xing etal. [2015], Lample et al. [2018] and Chen and Cardie [2018]proposed to take pre-trained monolingual word embeddingsof different languages as input, aligning them into a sharedsemantic space. Our work follows in the recent line ofcross-lingual contextualized embedding methods [Huang etal. , 2019; Devlin et al. , 2019; Wu and Dredze, 2019; Conneauand Lample, 2019; Artetxe et al. , 2019], which are trained us-ing masked language modeling or other auxiliary pre-trainingtasks to encourage representation in source and target lan-guage space closer, achieving state-of-the-art performance ona variety of zero-shot cross-lingual NLP tasks. In addition,our work is related with the recent work [Conneau et al. ,2020],which analyzed the effectiveness of anchor points. Wepropose a data augmentation framework to dynamically con-struct multi-lingual code-switching data for training, whichencourages model implicitly to align similar words in differ-ent languages into the same space.

Data Augmentation.

Recently, some augmentation meth-ods have been successfully applied in the cross-lingual set-ting. Liu et al. [2019b] proposed an attention mechanism toselect the most important word to translate into the target lan-guage for training. In contrast, our framework can augmentdata dynamically in each epoch to encourage the model toalign the representation in different languages, and can gener-ate multiple languages code-switch data making training onceand directly testing for all languages multiple times. Zhang et al. [2019] proposed using code-mixing to perform the syn-tactic transfer in dependency parsing. However, they needa high-accuracy translator to obtain multiple language datawhich can be difﬁcult to train for low-resource language. Incontrast, our method uses the existing bilingual dictionaries,which can be more practical and useful.

We proposed an augmentation framework to generate multi-lingual code-switching data to ﬁne-tune mBERT for aligningrepresentations from source and multiple target languages.Experiments on ﬁve tasks show that our method consistentlyand signiﬁcantly outperforms mBERT and XLM baselines.In addition, our method is ﬂexible and can be used to ﬁne-tuneall base encoder models. Future work includes the applicationof CoSDA-ML on the task of multi-lingual language model-ing task, so that a more general version of the multi-lingualcontextual embedding can be investigated.

Acknowledgements

This work was supported by the National Natural ScienceFoundation of China (NSFC) via grant 61976072, 61632011and 61772153. Besides, this work also faxed the support viaWestlake-BrightDreams Robotics research grant. We thank Yijia Liu for the helpful discussion and anonymous reviewersfor the insightful comments. Wanxiang Che and Yue Zhangare the corresponding author.

References [Artetxe and Schwenk, 2018] Mikel Artetxe and HolgerSchwenk. Massively multilingual sentence embeddingsfor zero-shot cross-lingual transfer and beyond. arXivpreprint arXiv:1812.10464 , 2018.[Artetxe et al. , 2019] Mikel Artetxe, Sebastian Ruder, andDani Yogatama. On the cross-lingual transferabil-ity of monolingual representations. arXiv preprintarXiv:1910.11856 , 2019.[Barnes et al. , 2018] Jeremy Barnes, Roman Klinger, andSabine Schulte im Walde. Bilingual sentiment embed-dings: Joint projection of sentiment across languages. In

Proc. of ACL , pages 2483–2493, Melbourne, Australia,July 2018. Association for Computational Linguistics.[Chen and Cardie, 2018] Xilun Chen and Claire Cardie. Un-supervised multilingual word embeddings. arXiv preprintarXiv:1808.08933 , 2018.[Chen et al. , 2018] Wenhu Chen, Jianshu Chen, Yu Su, XinWang, Dong Yu, Xifeng Yan, and William Yang Wang.XL-NBT: A cross-lingual neural belief tracking frame-work. In

Proc. of EMNLP , October-November 2018.[Conneau and Lample, 2019] Alexis Conneau and Guil-laume Lample. Cross-lingual language model pretraining.In

Advances in Neural Information Processing Systems ,pages 7057–7067, 2019.[Conneau et al. , 2018] Alexis Conneau, Ruty Rinott, Guil-laume Lample, Adina Williams, Samuel Bowman, HolgerSchwenk, and Veselin Stoyanov. XNLI: Evaluating cross-lingual sentence representations. In

Proc. of EMNLP ,2018.[Conneau et al. , 2020] Alexis Conneau, Shijie Wu, HaoranLi, Luke Zettlemoyer, and Veselin Stoyanov. Emergingcross-lingual structure in pretrained language models. In

Proceedings of the 58th Annual Meeting of the Associationfor Computational Linguistics , pages 6022–6034, Online,July 2020. Association for Computational Linguistics.[Devlin et al. , 2019] Jacob Devlin, Ming-Wei Chang, Ken-ton Lee, and Kristina Toutanova. BERT: Pre-training ofdeep bidirectional transformers for language understand-ing. In

Proc. of NAACL , 2019.[Guo et al. , 2016] Jiang Guo, Wanxiang Che, DavidYarowsky, Haifeng Wang, and Ting Liu. A representationlearning framework for multi-source transfer parsing. In

Proc. of AAAI , 2016.[Huang et al. , 2019] Haoyang Huang, Yaobo Liang, NanDuan, Ming Gong, Linjun Shou, Daxin Jiang, and MingZhou. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In

Proc. ofEMNLP , November 2019.Kamholz et al. , 2014] David Kamholz, Jonathan Pool, andSusan Colowick. PanLex: Building a resource for panlin-gual lexical translation. In

Proceedings of the Ninth Inter-national Conference on Language Resources and Evalua-tion (LREC’14) , 2014.[Klementiev et al. , 2012] Alexandre Klementiev, Ivan Titov,and Binod Bhattarai. Inducing crosslingual distributedrepresentations of words. In

Proc. of COLING , 2012.[Koˇcisk´y et al. , 2014] Tom´aˇs Koˇcisk´y, Karl Moritz Her-mann, and Phil Blunsom. Learning bilingual word repre-sentations by marginalizing alignments. In

Proc. of ACL ,June 2014.[Lample et al. , 2018] Guillaume Lample, Alexis Conneau,Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv´eJ´egou. Word translation without parallel data. In

Inter-national Conference on Learning Representations , 2018.[Liu et al. , 2019a] Yinhan Liu, Myle Ott, Naman Goyal,Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta:A robustly optimized bert pretraining approach. arXivpreprint arXiv:1907.11692 , 2019.[Liu et al. , 2019b] Zihan Liu, Genta Indra Winata, Zhao-jiang Lin, Peng Xu, and Pascale Fung. Attention-informedmixed-language training for zero-shot cross-lingual task-oriented dialogue systems, 2019.[Mrkˇsi´c et al. , 2017] Nikola Mrkˇsi´c, Ivan Vuli´c, Diarmuid´O S´eaghdha, Ira Leviant, Roi Reichart, Milica Gaˇsi´c, AnnaKorhonen, and Steve Young. Semantic specialization ofdistributional word vector spaces using monolingual andcross-lingual constraints.

Transactions of the Associationfor Computational Linguistics , 5:309–324, 2017.[Ruder et al. , 2017] Sebastian Ruder, Ivan Vuli´c, and AndersSøgaard. A survey of cross-lingual word embedding mod-els. arXiv preprint arXiv:1706.04902 , 2017.[Schuster et al. , 2019a] Sebastian Schuster, Sonal Gupta,Rushin Shah, and Mike Lewis. Cross-lingual transferlearning for multilingual task oriented dialog. In

Proc. ofNAACL , June 2019.[Schuster et al. , 2019b] Tal Schuster, Ori Ram, ReginaBarzilay, and Amir Globerson. Cross-lingual alignmentof contextual word embeddings, with applications to zero-shot dependency parsing. In

Proc. of NAACL , June 2019.[Schwenk and Li, 2018] Holger Schwenk and Xian Li. Acorpus for multilingual document classiﬁcation in eightlanguages. In

Proceedings of the 11th Language Re-sources and Evaluation Conference , May 2018.[Wang et al. , 2019] Yuxuan Wang, Wanxiang Che, JiangGuo, Yijia Liu, and Ting Liu. Cross-lingual BERT trans-formation for zero-shot dependency parsing. In

Proc. ofEMNLP , November 2019.[Wu and Dredze, 2019] Shijie Wu and Mark Dredze. Beto,bentz, becas: The surprising cross-lingual effectiveness ofBERT. In

Proc. of EMNLP , 2019. [Xing et al. , 2015] Chao Xing, Dong Wang, Chao Liu, andYiye Lin. Normalized word embedding and orthogo-nal transform for bilingual word translation. In

Proc. ofNAACL , 2015.[Yin et al. , 2019] Yichun Yin, Lifeng Shang, Xin Jiang, XiaoChen, and Qun Liu. Dialog state tracking with reinforceddata augmentation. arXiv preprint arXiv:1908.07795 ,2019.[Yu et al. , 2018] Katherine Yu, Haoran Li, and Barlas Oguz.Multilingual seq2seq training with similarity loss forcross-lingual document classiﬁcation. In

Proceedings ofThe Third Workshop on Representation Learning for NLP ,July 2018.[Zhang et al. , 2019] Meishan Zhang, Yue Zhang, and Guo-hong Fu. Cross-lingual dependency parsing using code-mixed TreeBank. In