[PDF] El Volumen Louder Por Favor: Code-switching in Task-oriented Semantic Parsing

Abstract

Full PDF

EEl Volumen Louder Por Favor: Code-switching in Task-oriented SemanticParsing

Arash Einolghozati Abhinav Arora Lorena Sainz-Maza LecandaAnuj Kumar Sonal Gupta

Facebook { arashe,abhinavarora,lorenasml,anujk,sonalgupta } @fb.com Abstract

Being able to parse code-switched (CS)utterances, such as Spanish+English orHindi+English, is essential to democratizetask-oriented semantic parsing systems forcertain locales. In this work, we focus onSpanglish (Spanish+English) and release adataset, CSTOP, containing 5800 CS utter-ances alongside their semantic parses. Weexamine the CS generalizability of variousCross-lingual (XL) models and exhibit theadvantage of pre-trained XL language modelswhen data for only one language is present. Assuch, we focus on improving the pre-trainedmodels for the case when only English corpusalongside either zero or a few CS traininginstances are available. We propose twodata augmentation methods for the zero-shotand the few-shot settings: ﬁne-tune usingtranslate-and-align and augment using a gen-eration model followed by match-and-ﬁlter.Combining the few-shot setting with the aboveimprovements decreases the initial -pointaccuracy gap between the zero-shot and thefull-data settings by two thirds. Code-switching (CS) is the alternation of languageswithin an utterance or a conversation (Poplack,2004). It occurs under certain linguistic constraintsbut can vary from one locale to another (Joshi,1982). We envision two usages of CS for virtual as-sistants. First, CS is very common in locales wherethere is a heavy inﬂuence of a foreign language(usually English) in the native “substrate” language(e.g., Hindi or Latin-American Spanish). Sec-ond, for other native languages, the prevalence ofEnglish-related tech words (e.g., Internet, screen)or media vocabulary (e.g., movie names) is verycommon. While in the second case, a model usingcontextual understanding should be able to parsethe utterance, the ﬁrst form of CS, which is our focus in this paper, needs Cross-Lingual(XL) capa-bilities in order to infer the meaning.There are various challenges for CS seman-tic parsing. First, collecting CS data is hard be-cause it needs bilingual annotators. This gets evenworse considering that the number of CS pairsgrows quadratically. Moreover, CS is very dy-namic and changes signiﬁcantly by occasion and intime (Poplack, 2004). As such, we need extensiblesolutions that need little or no CS data while havingthe more commonly-accessible English data avail-able. In this paper, we ﬁrst focus on the zero-shotsetup for which we only use EN data for the sametask domains (we call this in-domain EN data). Weshow that by translating the utterances to ES andaligning the slot values, we can achieve high ac-curacy on the CS data. Moreover, we show thathaving a limited number of CS data alongside aug-mentation with synthetically generated data cansigniﬁcantly improve the performance.Our contributions are as follows: 1) We re-lease a code-switched task-oriented dialog data set,CSTOP , containing 5800 Spanglish utterances anda corresponding parsing task. To the best of ourknowledge, this is the ﬁrst Code-switched parsingdataset of such size that contains utterances forboth training and testing. 2) We evaluate strongbaselines under various resource constraints. 3)We introduce two data augmentation techniquesthat improve the code-switching performance us-ing monolingual data. In task-oriented dialog, the language understandingtask consists of classifying the intent of an utter-ance, i.e., sentence classiﬁcation, alongside taggingthe slots, i.e., sequence labeling. We use the Task- The dataset can be downloaded fromhttps://fb.me/cstop data a r X i v : . [ c s . C L ] J a n N:GET WEATHERDime el clima para next FridaySL:DATE TIME

Figure 1: Example CS sentence and its annotationfor the sequence [IN:GET WEATHER Dime el clima[SL:DATE TIME para next Friday]]

Oriented Parsing dataset released by Schuster et al.(2019a) as our EN monolingual dataset. We releasea similar dataset, CSTOP, of around 5800 Spanglishutterances over two domains, Weather and Device,which are collected and annotated by native Span-glish speakers. An example from the CSTOP along-side its annotation is shown in Fig. 1. Note thatthe intent and slot lables start with IN : and SL : ,respectively. Our task is to classify the sentenceintent, here IN:GET WEATHER as well as the labeland value of the slots, here

SL:DATE TIME corre-sponding to the span para next Friday . Moreover,other words are classiﬁed as having no label, i.e., O class. We discuss the details of this dataset in thenext section.One of the unique challenges of this task, com-pared with common NER and language identiﬁ-cation CS tasks, is the constant evolution of CSdata. Since the task is concerned with spoken lan-guage, the nature of CS is very dynamic and keepsevolving from domain to domain and from onecommunity to another. Furthermore, cross-lingualdata for this task is also very rare. Most of theexisting techniques, either combine monolingualrepresentations (Winata et al., 2019a) or combinethe datasets to synthesize code-switched data (Liuet al., 2019). Lack of monolingual data for thesubstrate language (very realistic if you replace ESwith a less common language) would make thosetechniques inapplicable.In order to evaluate the model in a task-orienteddialog setting, we use the exact-match accuracy(from now on, accuracy) as the primary metric.This is simply deﬁned as the percentage of utter-ances for which the full parse, i.e., the intent andall the slots, have been correctly predicted. In this section, we provide details of the CSTOPdataset. We originally collected around 5800 CSutterances over two domains;

Weather and

Device .We picked these two domains as they represent complementary behavior. While Weather containsslot-heavy utterances (average 1.6 slots per utter-ance), Device is an intent-heavy domain with onlyaverage 0.8 slots per utterance. We split the datainto 4077, 1167, and 559 utterances for training,testing, and validation, respectively.CS data collection proceeded in the followingsteps:1. One of the authors, who is a native speakerof Spanish and uses Spanglish on a daily ba-sis, generated a small set of CS utterances forWeather and Device domains. Additionally,we also recruited bilingual EN/ES speakerswho met our Spanglish speaker criteria guide-lines, established following Escobar and Po-towski (2015).2. We wrote Spanglish data creation instructionsand asked participants to produce Spanish-English CS utterances for each intent (i.e. askfor the weather, set device brightness, etc).3. Next, we ﬁlter out utterances from this poolto only retain those that exhibited true intra-sentential CS.4. The collected utterances were labeled by twoannotators, who identiﬁed the intent and slotspans. If the two annotators disagreed on theannotation for an utterance, a third annotatorwould resolve the disagreement to provide aﬁnal annotation for it.Table. 1 shows the number of distinct intents andslots for each domain and the number of utterancesin CSTOP for each domain. We have also shownthe most 15 common intents in the training set anda representative Spanglish example alongside itsslot values for those intents in Table. 2. The ﬁrstvalue in a slot tuple is the slot label and the secondis the slot value. We can see that while most of theverbs and stop words are in Spanish, Nouns andslot values are mostly in English. We further calcu-late the prevalence Spanish and English words byusing a vocabulary ﬁle of 20k for each language.Each token in the CSTOP training set is assignedto the language for which that token has a lowerrank. The ratio of the Spanglish to English tokensis around . which matches our previous anec-dotal observation. This ratio was consistent whenincreasing the vocabulary size to even 40k. . omain Weather 2 4 3692Device 17 6 2112

Table 1: CSTOP Statistics

Our base model is a bidirectional LSTM with sepa-rate projections for the intent and slot tagging (Yaoet al., 2013). We use the aligned word embeddingMUSE (Lample et al., 2018) with a vocabularysize of 25k for both EN and ES. Our experimentsshowed that for the best XL generalization, it’s bestto freeze the word embeddings when the trainingdata contains only EN or ES utterances. We referto this model as simply MUSE.We also use SOTA pre-trained XL models;XLM (Conneau and Lample, 2019) and XLM-R (Conneau et al., 2020). These modelsare pre-trained via Masked Language Modeling(MLM) (Devlin et al., 2019) on massive multilin-gual data. They share the word-piece token repre-sentation, BPE (Sennrich et al., 2016) and Senten-cePiece (Kudo and Richardson, 2018), as well as acommon MLM transformer for different languages.Moreover, while XLM is pre-trained on Wikipedia,XLM-R is trained on crawled web data which con-tains more non-English and possibly CS data. Inorder to adapt these models for the joint intent clas-siﬁcation and slot tagging task, we use the methoddescribed in Chen et al. (2019). For classiﬁcation,we add a linear classiﬁer on top of the ﬁrst hiddenstate of the Transformer. A typical slot taggingmodel feeds the hidden states, corresponding toeach token, to a CRF layer (Mesnil et al., 2015).To make this compatible with XLM and XLM-R,we use the hidden states corresponding to the ﬁrstsub-word of every token as the input to the CRFlayer.Table 3 shows the accuracy of the above modelson CSTOP. We also have listed the performancewhen the models were ﬁrst ﬁne-tuned on the ENdata (CS+EN). We observe that in-domain ﬁne-tuning can almost halve the gap between XLM-R and XLM, which is around faster duringthe inference than XLM-R during inference. Thetraining details for all our models and the validationresults are listed in the Appendix.

Bottom part of Table 3 shows the CS test accu-racy when using only the in-domain monolingualdata. Our EN dataset is the task-oriented parsingdataset (Schuster et al., 2019a) described in theprevious section. Since the original TOP datasetdid not include any utterances belonging to the De-vice domain, we also release a dataset of aroundthousand EN Device utterances for the experimentsusing the EN data. In order to showcase the effectof monolingual ES data, we also experiment withusing the in-domain ES dataset, i.e. ES Weatherand Device queries.We observe that having monolingual data of bothlanguages yields very high accuracy, only a fewpoints shy of training directly on the CS data. More-over, in this setting, even simpler models such asMUSE can yield competitive results with XLM-Rwhile being much faster. However, the advantageof XL pre-training becomes evident when only oneof the languages is present. As such, having onlythe substrate language (i.e., ES) is almost the sameas having both languages for XLM-R.Note that we do not use ES data for other resultsin this paper. Obtaining semantic parsing dataset inanother language is expensive and often only ENdata is available. Our experiments show a hugeperformance gap when only using the EN data, andthus in this paper, we will be focusing on using theEN data alongside zero or a few CS instances.

Here, we explore how much of the zero-shot per-formance can be attributed to the XL embeddingsas opposed to the shared XL representation. Assuch, we experiment with replacing MUSE embed-dings with other embeddings in the LSTM modelexplained in the previous section. We experimentwith the following strategies:: (1) Random embed-ding: This learns the ES and EN word embeddingsfrom the scratch (2) Randomly-initialized Sen-tencePiece (Kudo and Richardson, 2018) (RSP):Words are represented by wordpiece tokens thatare learned from a huge unlabeled multilingualcorpus. (3) Pre-trained XLM-R sentence piece(XLSP). These are the 250k embedding vectorsthat are learned during the pre-trainig of XLM-R.We have shown the effects of using the afore-mentioned embeddings in the zero-shot settingin Table 4. We can see that by having monolin-gual datasets from both languages, even random ntent utterance slots

GET WEATHER ¿c´omo estar´a el clima en Miami este weekend? (LOCATION, Miami),(DATE TIME, este weekend)UNSUPPORTED WEATHER how many centimeters va a llover hoy (DATE TIME, hoy)OPEN RESOURCE Abreme el gallery (RESOURCE, el gallery)CLOSE RESOURCE Cierra maps (RESOURCE, maps)TURN ON Prende el privacy mode (COMPONENT, el privacy mode)TURN OFF Desactiva el speaker (COMPONENT, el speaker)WAKE UP Quita sleep mode -SLEEP prende el modo sleep -OPEN HOMESCREEN Go to pagina de inicio -MUTE VOLUME Desactiva el sound -UNMUTE VOLUME Prende el sound -SET BRIGHTNESS subir el brigtness al 80 (PERCENT, 80)INCREASE BRIGHTNESS Ponlo mas bright -DECREASE BRIGHTNESS baja el brightness -SET VOLUME Turn the volumen al nivel 10 (PRECISE AMOUNT,10)INCREASE VOLUME aumenta el volumen a little bit -DECREASE VOLUME B´ajale a la music -

Table 2: Examples from CSTOP intents

Lang/Model MUSE XLM XLM-R

CS 87.0 86.6 94.4CS + EN 88.1 93.0 95.4EN 39.2 54.8 66.6ES 69.9 78.3 88.1EN+ES 88.2 87.8 91.2

Table 3: Full-training (top) and zero-shot (bottom) ac-curacy of XL models when using different monolingualcorpora. ES is an internal dataset to showcase the effectof having a big Spanish corpus. embeddings can yield high performance. By re-moving one of the languages, unsurprisingly, thecodeswitching generalizability drops sharply forall, but much less for XLSP and MUSE. More-over, even though the XLSP embeddingsm, unlikeMUSE, is not consttrained to only EN and ES,it yields comparable results with the word-basedMUSE embeddings.We can also see that When ES data is available,RSP provides some codeswitching generalizabil-ity, as compared with the Random strategy, butnot when only EN data is available. We hypoth-esize that the common sub-word tokens are morehelpful to generalize the slot values (which in thecodeswitched data are mostly in EN) than the non-slot queries which are more commonly in ES. Thisis also veriﬁed by the observation that most of thegains for the RSP vs Random for the ES only sce-nario come from the slot tagging accuracy as com-pared with the intent detection.As a ﬁnal note, we observe that between − of the XLM-R gains can be captured by using Random RSP XLSP MUSE

EN 13.5 12.2 30.3

ES 38.2 48.0

EN+ES 81.1 84.3

Table 4: Zero-shot accuracy for simple LSTM modelwhen using different monolingual corpora and differentembedding strategies. the pre-trained sentence-piece embeddings whilethe rest are coming from the shared XL representa-tion pre-trained on massive unlabeled data. In therest of the paper, we focus on the XLM-R model.

In this section, we discuss two data augmentationapproaches. The ﬁrst one is in a zero-shot settingand only uses EN data to improve the performanceon the Spanglish test set. In the second approach,we assume having a limited number of Spanglishdata and use the EN data to augment the few-shotsetting.

We explore creating synthetic ES data from the ENdataset using machine translation. Since our task isa joint intent and slot tagging task, creating a syn-thetic ES corpus consists of two parts: a) Obtaininga parallel EN-ES corpus by machine translatingutterances from EN to ES, b) Projecting gold anno-tations from EN utterances to their ES counterpartsvia word alignment (Tiedemann et al., 2014; Leeet al., 2019b). Once the words in both languagesare aligned, the slot annotations are simply copied et volume to 10ajuste el volumen a 10

SL:AMOUNT percentpor

SL:UNITSL:UNITSL:AMOUNT ciento

Attention Alignment set volume to 10ajuste el volumen a 10

SL:AMOUNT percentpor

SL:UNITSL:UNITSL:AMOUNT ciento

Fast-align

Figure 2: An example comparison between the two methods of slot label projection. The image in on the leftshows Attention alignment, where every source token gets projected to a single target token. As a result, percent ,in EN is aligned only with ciento in ES. The image on the right shows fast-align, which allow a many-to-manyalignment. Hence percent is correctly aligned with por ciento . over from EN to ES by word alignment. For wordalignment, we explore two methods that are ex-plained below. In some cases, word alignment mayproduce discontinuous slot tokens in ES, which wehandle by introducing new slots of the same type,for all discontinuous slot fragments.Our ﬁrst method leverages the attentionscores (Bahdanau et al., 2015) obtained from anexisting EN to ES NMT model. We adopt a simpli-fying assumption that each source word is alignedto one target language word (Brown et al., 1993).For every slot token in the source language, weselect the highest attention score to align it with aword in the target language.Our next approach to annotation projectionmakes use of unsupervised word alignment fromstatistical machine translation. Speciﬁcally, we usethe fast-align toolkit (Dyer et al., 2013) to obtainalignments between EN and ES tokens. Since fast-align generates asymmetric alignments, we gener-ate two sets of alignments, EN to ES and ES to ENand symmetrize them using the grow-diagnol-ﬁnal-and heuristic (Koehn et al., 2003) to obtain the ﬁnalalignments.In Table 5, we show the CS zero-shot accu-racy when ﬁne-tuning on the newly generated ESdata (called ES ∗ .) alongside the original EN data.We can see that unsupervised alignment results inaround 2.5 absolute point accuracy improvement.On the other hand, using attention alignment endsup hurting the accuracy, which is perhaps due tothe slot noise that it introduces. The assumptionthat a single source token aligns with a single targettoken leads to incorrect data annotations when thelength of a translated slot is different in EN andES. Figure 3 shows an example utterance where at- tention alignment produces an incorrect annotationcompared to unsupervised alignment. EN EN+ES ∗ Attn EN+ES ∗ aligned Table 5: Zero-shot accuracy when ﬁne-tuning XLM-Ron EN monolignual data as well as the auto-translatedand aligned ES data (called ES*).

Here, we assume having a limited number of high-quality in-domain CS data and as such, we con-struct the CSTOP dataset of around utter-ances from the original training set in the CSTOP.We make sure that every individual slot and intent(but not necessarily the combination) is presentedin CSTOP and randomly sample the rest. Weperform our sampling three times and report thefew-shot results on the average performance. Thissetting is of paramount importance for bringing upa domain in a new locale when the EN data is al-ready available. The ﬁrst column in Table 6 showsthe CS Few-Shot (FS) performance alongside theﬁne-tuning on the EN data and the aligned trans-lated data, when average over three sampling ofCSTOP .In order to improve the FS performance, we per-form data augmentation on the CSTOP dataset.Unlike methods such as Pratapa et al. (2018), weseek generic methods that do not need extra re-sources such as constituency parsers. Instead, weexplore using pre-trained generative models whiletaking advantage of the EN data.We use BART (Lewis et al., 2020), a denois- odel/Training Data Few Shot Few shot+ Generate and Filter augmentation

XLM-R 61.2 70.3XLM-R ﬁne-tuned on EN 82.6 83.7XLM-R ﬁne-tuned on EN+ES ∗ Table 6: Accuracy when only a few CS instances are available during training, with and without the data augmen-tation. ES* is the auto-translated and aligned data. [IN:GET WEATHERshow me the weather[SL:DATE TIMEfor next Monday ][IN:GET WEATHERDime el clima[SL:DATE TIMEpara next Friday] [IN:GET WEATHERQuiero saber el clima[SL:DATE TIMEpara next Monday ]][IN:GET WEATHERDime el clima esper-ado [SL:DATE TIMEpara next Friday ]][IN:GET WEATHERDime el pron´ostico[SL:DATE TIME hasta el 15 ]]

Figure 3: Match and Filter data augmentation: 1- For each CS utterance (target), ﬁnd the the closest EN neighbor(source). 2- Learn a generative model from source to target 3- Perform beam search to generate more targets fromthe source utterances. ing autoencoder trained on massive amount of webdata, as the generative model. Our goal is to gener-ate diverse Spanglish data from the EN data. Eventhough BART was trained for English, we foundit very effective for this task. We hypothesize thisis due to the abundance of the Spanish text amongEN web data and the proximity of the word-piecetokens among them. We also experimented withmultilingual BART (Liu et al., 2020a) but found itvery challenging to ﬁne-tune for this task.First, we convert the data to a bracket for-mat (Vinyals et al., 2015), which is called the seq-logical form in Gupta et al. (2018). Examples ofthis format are shown in Fig. 3. In the seqlogicalform, we include the intent (i.e., sentence label) atthe beginning and for each slot, we ﬁrst include thelabel and text in brackets.We perform our data augmentation technique inthe following steps:1. Find the top K closest EN neighbors to everyCS query in the CSTOP . We enforce theneighbors to have the same parse as the CS ut-terance, i.e., same intent and same slot labels,and use the Levenshtein distance to rank theEN sequences.2. Having this parallel corpus, i.e., top-K ENneighbors as the source and the original CSquery as the target, Fine-tune the BART model. We use K=10 in our experiments toincrease the parallel data size to around .3. During the inference, Use the beam size of 5to decode CS utterances from the same ENsource data. Since both the source and targetsequences are in the seqlogical form, the CSgenerated sequences are already annotated.In Fig. 3, we have shown the closest EN neigh-bor corresponding to the original CS example inFig. 1. The CS utterance can be seen as a roughtranslation of the EN sentence. We have also shownthe top three generated CS utterances from the ENexample.In order to reduces the noise, we ﬁlter thegenerated sequences that either already exist inCSTOP , are not valid trees, or have a semanticparse different from the original utterance. We aug-ment CSTOP with the data, and ﬁne-tune theXLM-R baseline.In the second column of Table 6, we have shownthe average data augmentation improvement overthe three CSTOP samples for the few-shot set-ting. We can see that even after ﬁne-tuning on theEN monolingual data (the second row), the aug-mentation technique improves this strong baseline.In the last row, we ﬁrst use the translation align-ment of the previous section to obtain ES ∗ . Afterﬁne-tuning on this set combined with the EN data,e further ﬁne-tune on the CSTOP . We cansee that the best model enjoys improvements fromboth zero-shot (translation alignment) and the few-shot (generate and ﬁlter) augmentation techniques.We also note that the p-value corresponding to thesecond and third row gains are 0.018 and 0.055,respectively. Most of the initial work on pre-trained XL represen-tations was focused on embedding alignment (Xinget al., 2015; Zhang et al., 2017; Lample et al.,2018). Recent developments in this area have fo-cused on the context-aware XL alignment of con-textual representations (Schuster et al., 2019b; Al-darmaki and Diab, 2019; Wang et al., 2019; Caoet al., 2020). Recently, pre-trained multilingual lan-guage models such as mBERT (Devlin et al., 2019),XLM (Conneau and Lample, 2019), and Conneauet al. (2020) have been introduced, and Pires et al.(2019) demonstrate the effectiveness of these onsequence labeling tasks.Separately, Liu et al. (2020a) introduce mBART, asequence-to-sequence denoising auto-encoder pre-trained on monolingual corpora in many languagesusing a denoising autoencoder objective (Lewiset al., 2020).

Following the ACL shared tasks, CS is mostlydiscussed in the context of word-level lan-guage identiﬁcation (Molina et al., 2016) andNER (Aguilar et al., 2018). Techniques such ascurriculum learning (Choudhury et al., 2017) andattention over different embeddings (Wang et al.,2018; Winata et al., 2019a) have been amongthe successful techniques. CS parsing and use ofmonolingual parses are discussed in Sharma et al.(2016); Bhat et al. (2017, 2018). Sharma et al.(2016) introduces a Hinglish test set for a shallowparsing pipeline. In Bhat et al. (2017), outputs oftwo monolingual dependency parsers are combinedto achieve a CS parse. Bhat et al. (2018) extendsthis test set by including training data and transfersthe knowledge from monolingual treebanks.Duong et al. (2017) introduced a CS test set forsemantic parsing which is curated by combiningutterances from the two monolingual datasets. Incontrast, CSTOP is procured independently ofthe monolingual data and exhibits much more linguistic diversity. In Pratapa et al. (2018),linguistic rules are used to generate CS data whichhas been shown to be effective in reducing theperplexity of a CS language model. In contrast,our augmentation techniques are generic and donot require rules or constituency parsers.

Most approaches to cross-lingual data augmenta-tion use machine translation and slot projection forsequence labeling tasks (Jain et al., 2019). Weiand Zou (2019) uses simple operations such as syn-onym replacement and Lee et al. (2019a) use phrasereplacement from a parallel corpus to augment thetraining data. Singh et al. (2019) present XLDAthat augments data by replacing segments of inputtext with its translations in other languages. Somerecent approaches (Chang et al., 2019; Winataet al., 2019b) also train generative models to ar-tiﬁcially generate CS data. More recently, Ku-mar et al. (2020) study data augmentation usingpre-trained transformer models by incorporatinglabel information during ﬁne-tuning. Concurrentto our work, Bari et al. (2020) introduce Multimix,where data augmentation from pre-trained multi-lingual language models and self-learning are usedfor semi-supervised learning. Recently, Liu et al.(2019) generate CS data by translating keywordspicked based on attention scores from a monolin-gual model. Generating CS data has recently beenstudied in Liu et al. (2020b)

The intent/slot framework is the most common wayof performing language understanding for task ori-ented dialog using. Bidirectional LSTM for thesentence representation alongside separate projec-tion layers for intent and slot tagging is the typicalarchitecture for the joint task (Yao et al., 2013; Mes-nil et al., 2015; Hakkani-T¨ur et al., 2016). Suchrepresentations can accommodate trees of up tolength two, as is the case in CSTOP. More recently,an extension of this framework has been introducedto ﬁt the deeper trees (Gupta et al., 2018; Rongaliet al., 2020).

In this paper, we propose a new task for code-switched semantic parsing and release a dataset,CSTOP, containing 5800 Spanglish utterances overwo domains. We hope this foments further re-search on the code-switching phenomenon whichhas been set back by paucity of sizeable curateddatasets. We show that cross-lingual pre-trainedmodels can generalize better than traditional mod-els to the code-switched setting when monolingualdata from only one languages is available. In thepresence of only EN data, we introduce genericaugmentation techniques based on translation andgeneration. As such, we show that translating andaligning the EN data can signiﬁcantly improvethe zero-shot performance. Moreover, generatingcode-switched data using a generation model and amatch-and-ﬁlter approach leads to improvements ina few-shot setting. We leave exploring and combin-ing other augmentation techniques to future work.

References

Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, MonaDiab, Julia Hirschberg, and Thamar Solorio. 2018.Named entity recognition on code-switched data:Overview of the CALCS 2018 shared task. In

Proceedings of the Third Workshop on Compu-tational Approaches to Linguistic Code-Switching ,pages 138–147, Melbourne, Australia. Associationfor Computational Linguistics.Hanan Aldarmaki and Mona Diab. 2019. Context-aware cross-lingual mapping. In

Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 3906–3911, Minneapolis, Min-nesota. Association for Computational Linguistics.Ahmed Aly, Kushal Lakhotia, Shicong Zhao, Mri-nal Mohit, Barlas Oguz, Abhinav Arora, SonalGupta, Christopher Dewan, Stef Nelson-Lindall,and Rushin Shah. 2018. Pytext: A seamlesspath from NLP research to production.

CoRR ,abs/1812.08729.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In .M. Saiful Bari, Muhammad Tasnim Mohiuddin, andShaﬁq R. Joty. 2020. Multimix: A robust data aug-mentation strategy for cross-lingual NLP.

CoRR ,abs/2004.13240.Irshad Bhat, Riyaz A. Bhat, Manish Shrivastava, andDipti Sharma. 2017. Joining hands: Exploitingmonolingual treebanks for parsing of code-mixingdata. In

Proceedings of the 15th Conference of the European Chapter of the Association for Computa-tional Linguistics: Volume 2, Short Papers , pages324–330, Valencia, Spain. Association for Computa-tional Linguistics.Irshad Bhat, Riyaz A. Bhat, Manish Shrivastava, andDipti Sharma. 2018. Universal dependency parsingfor Hindi-English code-switching. In

Proceedingsof the 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers) , pages 987–998, New Orleans, Louisiana. As-sociation for Computational Linguistics.Peter F. Brown, Stephen A. Della Pietra, Vincent J.Della Pietra, and Robert L. Mercer. 1993. The math-ematics of statistical machine translation: Parameterestimation.

Computational Linguistics , 19(2):263–311.Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Mul-tilingual alignment of contextual word representa-tions. In . OpenReview.net.Ching-Ting Chang, Shun-Po Chuang, and Hung-yi Lee.2019. Code-switching sentence generation by gener-ative adversarial networks and its application to dataaugmentation. In

Interspeech 2019, 20th AnnualConference of the International Speech Communi-cation Association, Graz, Austria, 15-19 September2019 , pages 554–558. ISCA.Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bertfor joint intent classiﬁcation and slot ﬁlling.

ArXiv ,abs/1902.10909.Monojit Choudhury, Kalika Bali, Sunayana Sitaram,and Ashutosh Baheti. 2017. Curriculum design forcode-switching: Experiments with language iden-tiﬁcation and language modeling with deep neu-ral networks. In

Proceedings of the 14th Interna-tional Conference on Natural Language Processing(ICON-2017) , pages 65–74, Kolkata, India. NLP As-sociation of India.Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In

Proceedings of the 58th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2020,Online, July 5-10, 2020 , pages 8440–8451. Associa-tion for Computational Linguistics.Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In

Advancesin Neural Information Processing Systems 32: An-nual Conference on Neural Information ProcessingSystems 2019, NeurIPS 2019, December 8-14, 2019,Vancouver, BC, Canada , pages 7057–7067.acob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Long Duong, Hadi Afshar, Dominique Estival, GlenPink, Philip Cohen, and Mark Johnson. 2017. Mul-tilingual semantic parsing and code-switching. In

Proceedings of the 21st Conference on Computa-tional Natural Language Learning (CoNLL 2017) ,pages 379–389, Vancouver, Canada. Association forComputational Linguistics.Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A simple, fast, and effective reparameter-ization of IBM model 2. In

Proceedings of the2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies , pages 644–648, At-lanta, Georgia. Association for Computational Lin-guistics.Anna Maria Escobar and Kim Potowski. 2015.

ElEspa˜nol de los Estados Unidos . Cambridge Univer-sity Press.Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Ku-mar, and Mike Lewis. 2018. Semantic parsing fortask oriented dialog using hierarchical representa-tions. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing, Brussels, Belgium, October 31 - November 4,2018 , pages 2787–2792. Association for Computa-tional Linguistics.Dilek Hakkani-T¨ur, G¨okhan T¨ur, Asli C¸ elikyilmaz,Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frameparsing using bi-directional RNN-LSTM. In

Inter-speech 2016, 17th Annual Conference of the Inter-national Speech Communication Association, SanFrancisco, CA, USA, September 8-12, 2016 , pages715–719. ISCA.Alankar Jain, Bhargavi Paranjape, and Zachary C. Lip-ton. 2019. Entity projection via machine transla-tion for cross-lingual NER. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 1083–1092, Hong Kong,China. Association for Computational Linguistics.Aravind K. Joshi. 1982. Processing of sentences withintra-sentential code-switching. In

Coling 1982:Proceedings of the Ninth International Conferenceon Computational Linguistics .Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In .Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003.Statistical phrase-based translation. In

Proceedingsof the 2003 Human Language Technology Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics , pages 127–133.Taku Kudo and John Richardson. 2018. SentencePiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations , pages 66–71, Brussels, Belgium.Association for Computational Linguistics.Varun Kumar, Ashutosh Choudhary, and Eunah Cho.2020. Data augmentation using pre-trained trans-former models.

CoRR , abs/2003.02245.Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Herv´e J´egou. 2018.Word translation without parallel data. In . OpenRe-view.net.Grandee Lee, Xianghu Yue, and Haizhou Li. 2019a.Linguistically motivated parallel data augmentationfor code-switch language modeling. In

Interspeech2019, 20th Annual Conference of the InternationalSpeech Communication Association, Graz, Austria,15-19 September 2019 , pages 3730–3734. ISCA.Kyungjae Lee, Sunghyun Park, Hojae Han, JinyoungYeo, Seung-won Hwang, and Juho Lee. 2019b.Learning with limited data for multilingual readingcomprehension. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2840–2850, Hong Kong, China. As-sociation for Computational Linguistics.Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020. BART: denoising sequence-to-sequence pre-training for natural language generation, translation,and comprehension. In

Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, ACL 2020, Online, July 5-10, 2020 ,pages 7871–7880. Association for ComputationalLinguistics.Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, SergeyEdunov, Marjan Ghazvininejad, Mike Lewis, andLuke Zettlemoyer. 2020a. Multilingual denoisingpre-training for neural machine translation.

Trans.Assoc. Comput. Linguistics , 8:726–742.Zihan Liu, Genta Indra Winata, Zhaojiang Lin, PengXu, and Pascale Fung. 2019. Attention-informedixed-language training for zero-shot cross-lingualtask-oriented dialogue systems.Zihan Liu, Genta Indra Winata, Zhaojiang Lin, PengXu, and Pascale Fung. 2020b. Attention-informedmixed-language training for zero-shot cross-lingualtask-oriented dialogue systems.

Proceedings ofthe AAAI Conference on Artiﬁcial Intelligence ,34(05):8433–8440.Gr´egoire Mesnil, Yann N. Dauphin, Kaisheng Yao,Yoshua Bengio, Li Deng, Dilek Hakkani-T¨ur, Xi-aodong He, Larry P. Heck, G¨okhan T¨ur, Dong Yu,and Geoffrey Zweig. 2015. Using recurrent neuralnetworks for slot ﬁlling in spoken language under-standing.

IEEE ACM Trans. Audio Speech Lang.Process. , 23(3):530–539.Giovanni Molina, Fahad AlGhamdi, MahmoudGhoneim, Abdelati Hawwari, Nicolas Rey-Villamizar, Mona Diab, and Thamar Solorio.2016. Overview for the second shared task onlanguage identiﬁcation in code-switched data. In

Proceedings of the Second Workshop on Computa-tional Approaches to Code Switching , pages 40–49,Austin, Texas. Association for ComputationalLinguistics.Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual bert? In

Pro-ceedings of the 57th Conference of the Associationfor Computational Linguistics, ACL 2019, Florence,Italy, July 28- August 2, 2019, Volume 1: Long Pa-pers , pages 4996–5001. Association for Computa-tional Linguistics.Shana Poplack. 2004.

Code-Switching , pages 589–596.Adithya Pratapa, Gayatri Bhat, Monojit Choudhury,Sunayana Sitaram, Sandipan Dandapat, and KalikaBali. 2018. Language modeling for code-mixing:The role of linguistic theory based synthetic data. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 1543–1553, Melbourne, Aus-tralia. Association for Computational Linguistics.Subendhu Rongali, Luca Soldaini, Emilio Monti, andWael Hamza. 2020. Don’t parse, generate! A se-quence to sequence architecture for task-oriented se-mantic parsing. In

WWW ’20: The Web Confer-ence 2020, Taipei, Taiwan, April 20-24, 2020 , pages2962–2968. ACM / IW3C2.Sebastian Schuster, Sonal Gupta, Rushin Shah, andMike Lewis. 2019a. Cross-lingual transfer learningfor multilingual task oriented dialog. In

Proceedingsof the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, NAACL-HLT 2019,Minneapolis, MN, USA, June 2-7, 2019, Volume 1(Long and Short Papers) , pages 3795–3805. Associ-ation for Computational Linguistics. Tal Schuster, Ori Ram, Regina Barzilay, and AmirGloberson. 2019b. Cross-lingual alignment of con-textual word embeddings, with applications to zero-shot dependency parsing. In

Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 1599–1613, Minneapolis, Min-nesota. Association for Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Arnav Sharma, Sakshi Gupta, Raveesh Motlani, PiyushBansal, Manish Shrivastava, Radhika Mamidi, andDipti M. Sharma. 2016. Shallow parsing pipeline -Hindi-English code-mixed social media text. In

Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages1340–1345, San Diego, California. Association forComputational Linguistics.Jasdeep Singh, Bryan McCann, Nitish Shirish Keskar,Caiming Xiong, and Richard Socher. 2019. XLDA:cross-lingual data augmentation for natural lan-guage inference and question answering.

CoRR ,abs/1905.11471.J¨org Tiedemann, ˇZeljko Agi´c, and Joakim Nivre. 2014.Treebank translation for cross-lingual parser induc-tion. In

Proceedings of the Eighteenth Confer-ence on Computational Natural Language Learning ,pages 130–140, Ann Arbor, Michigan. Associationfor Computational Linguistics.Oriol Vinyals, Ł ukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-mar as a foreign language. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 28 , pages 2773–2781. Curran Asso-ciates, Inc.Changhan Wang, Kyunghyun Cho, and Douwe Kiela.2018. Code-switched named entity recognition withembedding attention. In

Proceedings of the ThirdWorkshop on Computational Approaches to Lin-guistic Code-Switching , pages 154–158, Melbourne,Australia. Association for Computational Linguis-tics.Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu,and Ting Liu. 2019. Cross-lingual BERT trans-formation for zero-shot dependency parsing. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 5721–5727, Hong Kong, China. Association for Computa-tional Linguistics.ason Wei and Kai Zou. 2019. EDA: Easy data aug-mentation techniques for boosting performance ontext classiﬁcation tasks. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 6382–6388, Hong Kong,China. Association for Computational Linguistics.Genta Indra Winata, Zhaojiang Lin, Jamin Shin, ZihanLiu, and Pascale Fung. 2019a. Hierarchical meta-embeddings for code-switching named entity recog-nition. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages3541–3547, Hong Kong, China. Association forComputational Linguistics.Genta Indra Winata, Andrea Madotto, Chien-ShengWu, and Pascale Fung. 2019b. Code-switched lan-guage models using neural based synthetic data fromparallel sentences. In

Proceedings of the 23rd Con-ference on Computational Natural Language Learn-ing, CoNLL 2019, Hong Kong, China, November 3-4, 2019 , pages 271–280. Association for Computa-tional Linguistics.Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015.Normalized word embedding and orthogonal trans-form for bilingual word translation. In

Proceedingsof the 2015 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies , pages 1006–1011,Denver, Colorado. Association for ComputationalLinguistics.Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang,Yangyang Shi, and Dong Yu. 2013. Recurrent neu-ral networks for language understanding. In

Inter-speech , pages 2524–2528.Meng Zhang, Yang Liu, Huanbo Luan, and MaosongSun. 2017. Adversarial training for unsupervisedbilingual lexicon induction. In

Proceedings of the55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 1959–1970, Vancouver, Canada. Associationfor Computational Linguistics.

Appendix

Here, we describe the details regarding the trainingas the validation results.

A.1 Model and Training Parameters

In Table 7, we have shown the training detailsfor all our models. We use ADAM (Kingma andBa, 2015) with Learning Rate (LR), Weight Decay(WD), and Batch Size (BSz) that is listed for eachmodel. We have also shown the number of epochsand the average training time for the full CS datausing V100 Nvidia GPUs. For all our XLM-Rexperiments, we use the XLM-R large from thePyText (Aly et al., 2018) which is pre-trained on100 languages. For the XLM experiments, we useXLM-20 pre-trained over 20 languages and use thesame ﬁne-tuning parameters as XLM-R but run formore epochs.For the LSTM models, we use a two-layerLSTM with hidden dimension of and dropoutof . for all connections. We use one layer ofMLP of dimension for both the slot taggingand the intent classiﬁcation. We also use an ensem-ble of ﬁve models for all the LSTM experimentsto reduce the variance. The LSTM model withSentencePiece embeddings in Table 4 were trainedwith embedding dimension of similar to theXLM-R model. A.2 Validation Results

In Table. 9, we have shown the validation resultswhen using the full CS training data. We have notshown the corresponding results for the zero-shotexperiments as no validation data was not used andthe monolingual models were tested off the shelf.In Table. 8, we have shown the validation resultsfor the few-shot setting. https://pytext.readthedocs.io/en/master/xlm_r.html odel BSz LR WD Epoch Avg TimeXLM-R (pronoun) 8 0.000005 0.0001 15 5 hrXLM (pronoun) 8 0.000005 0.0001 20 1 hrLSTM (pronoun+question) 64 0.03 0.00001 45 45 min Table 7: Training Parameters

Model/Training Data Few shot Few shot + Generate and Filter Augmentation

XLM-R 61.7 70.4XLM-R ﬁne-tuned on EN 83.3 83.9XLM-R ﬁne-tuned on EN+ES ∗ Table 8: Validation Accuracy when only a few CS instances (FS) are available during training. FS+G refers toaugmenting the few-shot instances with generated CS data. ES* is the auto-translated and aligned data.

Lang/Model MUSE XLM XLM-R