[PDF] Cross-lingual Alignment Methods for Multilingual BERT: A Comparative Study

Abstract

Multilingual BERT (mBERT) has shown reasonable capability for zero-shot cross-lingual transfer when fine-tuned on downstream tasks. Since mBERT is not pre-trained with explicit cross-lingual supervision, transfer performance can further be improved by aligning mBERT with cross-lingual signal. Prior work proposes several approaches to align contextualised embeddings. In this paper we analyse how different forms of cross-lingual supervision and various alignment methods influence the transfer capability of mBERT in zero-shot setting. Specifically, we compare parallel corpora vs. dictionary-based supervision and rotational vs. fine-tuning based alignment methods. We evaluate the performance of different alignment methodologies across eight languages on two tasks: Name Entity Recognition and Semantic Slot Filling. In addition, we propose a novel normalisation method which consistently improves the performance of rotation-based alignment including a notable 3% F1 improvement for distant and typologically dissimilar languages. Importantly we identify the biases of the alignment methods to the type of task and proximity to the transfer language. We also find that supervision from parallel corpus is generally superior to dictionary alignments.

Full PDF

CCross-lingual Alignment Methods for Multilingual BERT:A Comparative Study

Saurabh Kulshreshtha ∗ Department of Computer ScienceUniversity of Massachusetts Lowell [email protected]

Jos Luis Redondo-Garca

Amazon AlexaCambridge, UK [email protected]

Ching-Yun Chang

Amazon AlexaCambridge, UK [email protected]

Abstract

Multilingual BERT (mBERT) has shown rea-sonable capability for zero-shot cross-lingualtransfer when ﬁne-tuned on downstream tasks.Since mBERT is not pre-trained with ex-plicit cross-lingual supervision, transfer per-formance can further be improved by aligningmBERT with cross-lingual signal. Prior workproposes several approaches to align contex-tualised embeddings. In this paper we anal-yse how different forms of cross-lingual su-pervision and various alignment methods in-ﬂuence the transfer capability of mBERT inzero-shot setting. Speciﬁcally, we compareparallel corpora vs. dictionary-based supervi-sion and rotational vs. ﬁne-tuning based align-ment methods. We evaluate the performanceof different alignment methodologies acrosseight languages on two tasks: Name EntityRecognition and Semantic Slot Filling. Inaddition, we propose a novel normalisationmethod which consistently improves the per-formance of rotation-based alignment includ-ing a notable 3% F1 improvement for distantand typologically dissimilar languages. Impor-tantly we identify the biases of the alignmentmethods to the type of task and proximity tothe transfer language. We also ﬁnd that super-vision from parallel corpus is generally supe-rior to dictionary alignments.

Multilingual BERT (mBERT) (Devlin et al., 2019)is the BERT architecture trained on data from 104languages where all languages are embedded in thesame vector space. Due to the multilingual andcontextual representation properties of mBERT, ithas gained popularity in various multilingual andcross-lingual tasks (Karthikeyan et al., 2020; Wuand Dredze, 2019). In particular, it has demon-strated good zero-shot cross-lingual transfer perfor- ∗ Work done during an internship at Amazon. mance on many downstream tasks, such as Docu-ment Classiﬁcation, NLI, NER, POS tagging, andDependency Parsing (Wu and Dredze, 2019), whenthe source and the target languages are similar.Many experiments (Ahmad et al., 2019) suggestthat to achieve reasonable performance in the zero-shot setup, the source and the target languages needto share similar grammatical structure or lie in thesame language family. In addition, since mBERT isnot trained with explicit language signal, mBERT’smultilingual representations are less effective forlanguages with little lexical overlap (Patra et al.,2019). One branch of work is therefore dedicatedto improve the multilingual properties of mBERTby aligning the embeddings of different languageswith cross-lingual supervision.Broadly, two methods have been proposed inprior work to induce cross-lingual signals in con-textual embeddings: 1) Rotation Alignment as de-scribed in Section 2 aims at learning a linear rota-tion transformation to project source language em-beddings into their respective locations in the targetlanguage space (Schuster et al., 2019b; Wang et al.,2019; Aldarmaki and Diab, 2019); 2) Fine-tuningAlignment as explained in Section 3 internallyaligns language sub-spaces in mBERT through tun-ing its weights such that distances between em-beddings of word translations decrease while notlosing the informativity of the embeddings (Caoet al., 2020). Additionally, two sources of cross-lingual signal have been considered in literatureto align languages: parallel corpora and bilingualdictionaries. While the choice of each alignmentmethod and source of supervision have a varietyof advantages and disadvantages, it is unclear howthese affect the performance of the aligned spacesacross languages and various tasks.In this paper, we empirically investigate the ef-fect of these cross-lingual alignment methodolo-gies and applicable sources of cross-lingual super- a r X i v : . [ c s . C L ] S e p ision by evaluating their performance on zero-shotNamed Entity Recognition (NER), a structured pre-diction task, and Semantic Slot-ﬁlling (SF), a se-mantic labelling task, across eight language pairs.The motivation for choice of these tasks to evalu-ate are two-fold: 1. Prior work has already studiedalignment methods on sentence level tasks. Caoet al. (2020) show the effectiveness of mBERTalignment methods on XNLI (2018). 2. Word-level tasks do not beneﬁt from more pre-trainingunlike other language tasks that improve by sim-ply supplementing with more pre-training data. Inexperiments over the XTREME benchmark, Huet al. (2020) ﬁnd that transfer performance im-proves across all tasks when multilingual languagemodels are pre-trained with more data, with thesole exception of word-level tasks. They note thatthis indicates current deep pre-trained models donot fully exploit the pre-training data to transfer toword-level tasks. We believe that NER and Slot-ﬁlling tasks are strong candidate tasks to assessalignment methods due to limited cross-lingualtransfer capacity of current models to these tasks.To the authors’ knowledge, this is the ﬁrst paperexploring the comparison of alignment methodsfor contextual embedding spaces: rotation vs. ﬁne-tuning alignment and two sources of cross-lingualsupervision: dictionary vs. parallel corpus super-vision on a set of tasks of structural and seman-tic nature over a wide range of languages. Fromthe results, we ﬁnd that parallel corpora are bet-ter suited for aligning contextual embeddings. Inaddition, we ﬁnd that rotation alignment is morerobust for primarily structural NER downstreamtasks while the ﬁne-tuning alignment is found toimprove performance across semantic SF tasks. Inaddition, we propose a novel normalisation proce-dure which consistently improves rotation align-ment, motivated by the structure of mBERT spaceand how languages are distributed across it. Wealso ﬁnd the effect of language proximity on trans-fer improvement for these alignment methods. Mikolov et al. (2013) proposed to learn a lineartransformation W s → t which would project an em-bedding in the source language e s to its translationin the target language space e t , by minimising thedistances between the projected source embeddingsand their corresponding target embeddings: min W ∈ R d × d (cid:107) W X s − X t (cid:107) (1) X s and X t are matrices of size d × K where d is thedimensionality of embeddings and K is the num-ber of parallel words from word-aligned corpora,or word pairs from a bilingual dictionary betweenthe source and target languages. Further work Xinget al. (2015) demonstrated that restricting W to apurely rotational transform improves cross-lingualtransfer across similar languages. The orthogonal-ity assumption reduces Eq.(1) into the so-calledProcrustes problem with the closed form solution: W = U V T , (2)where U Σ V T = SV D (cid:16) X t X Ts (cid:17) (3)and the SVD operator stands for Singular ValueDecomposition. A purely rotational transformation can align twoembedding spaces only if the two spaces areroughly isometric and are distributed about thesame mean. In case the two embedding distri-butions are not centered around the same mean,meaning the two spaces have little overlap and areshifted by a translation offset in the space, theycannot be aligned solely through rotation.Since the linear transformation W s → t derivedfrom solving the Procrustus problem only rotatesthe vector space, it assumes the embeddings of twolanguages are zero-centered. However Libovicket al. (2019) observe that languages distributionsin mBERT have distinct and separable centroidsand different language families have well separatedsub-spaces in the mBERT embedding vector space.To address this discrepancy, we propose a new nor-malisation mechanism which entails:Step 1. Normalising the embeddings of bothlanguages so that they have zero mean: ˆ X s = X s − ¯ X s and ˆ X t = X t − ¯ X t (4)where ¯ X s and ¯ X t are centroids of source and targetembeddings X s and X t ; and ˆ X s and ˆ X t are mean-centered source and target language embeddingstheir rows correspond to word translations. Next, ˆ X s and ˆ X t are used to compute the transformationmatrix ˆ W s → t by solving Eq.(2) and Eq.(3).Step 2. During training a downstream task, em-bedding of a source language word e s needs to bere-centered, rotated and ﬁnally translated to thetarget language subspace to derive the projection e t ∗ : e t ∗ = ˆ W s → t ( e s − ¯ X s ) + ¯ X t (5)his helps the task speciﬁc model, particularly inzero-shot setting, by projecting the source languagetask data to the same locality as the target language. In this section we describe how existing workutilises two different cross-lingual signals, bilin-gual dictionaries and parallel corpora, to superviserotation alignment. Additionally, we analyse theadvantages and disadvantages of the two choices.

In order to utilise a bilingual dictionary to supervisethe embedding alignment, each word in the dictio-nary needs to have a single representation. How-ever the same word can have many representationsin the contextualised language model vector spacedepending on the context it occurs in. Schuster et al.(2019b) observes that the contextual embeddingsof the same word form a tight cluster - word cloud,the centroid of this word cloud is distinct and sep-arable for individual words. They further proposethat centroid of a word cloud can be considered asthe context-independent representation of a word,called average word anchor. These word anchorsare computed by averaging embeddings over alloccurrences of a word in a monolingual corpora,where words occur in a variety of contexts. For-mally the mBERT embedding of a source languageword s m in context c h is denoted as e s m ,c h . If thisword occurs a total of p times in the monolingualcorpus, that is in contexts c , c , ...c p , the anchorword embedding A s m for word s m across all thecontexts is the average: A s m = p (cid:88) h =1 e s m ,c h p (6)Average word anchor pair ( A is m , A it m ∗ ) , where i is the mBERT layer, for all word pairs from thedictionary ( s m , t m ∗ ) form the rows of matrices X is and X it respectively, which are then used to solveEq.(2) and Eq.(3), resulting in an alignment trans-formation matrix W is → t .However, there are limitations to this approach.Zhang et al. (2019) found that the word cloud ofmulti-sense words, such as the word “bank”, whichcan mean either the ﬁnancial institution or the edgeof a river depending on the context, are further com-posed of clearly separable clusters, for every word sense. Averaging over multiple contextual embed-dings infers losing certain degree of contextual in-formation at both the source and target languagewords. Figure 1a visualises word anchor calcu-lation and also highlights this limitation. On theother hand, one of the advantages of this methodis that bilingual dictionaries are available for evenvery low resource languages. Word-aligned parallel sentences can be utilised asa source of cross-lingual signal to align contextualembeddings (Aldarmaki and Diab, 2019; Wanget al., 2019). Given a parallel corpora, s m and t m ∗ are aligned source and the target language wordsappearing in context c h and c h ∗ , respectively. Theparallel word embedding matrices X is and X it formBERT layer- i are composed from the contextualembeddings e is m ,c h and e it m ∗ ,c h ∗ respectively, andare used to solve Eq.(2) and Eq.(3) to derive analignment transformation matrix W is → t .Figure 1a and 1b illustrate how parallel supervi-sion is more suited to align contextual embeddingscompared to dictionary supervision where multiplesenses of a word are compressed into a single wordanchor. However, parallel corpora rarely comewith word-alignment annotations that are often au-tomatically generated by off-the-shelf tools such as fast align (Dyer et al., 2013), which can be noisy.It is worth noting that word alignment error rate ofan off-the-shelf tool drops when number of parallelsentences increases, therefore parallel corpus su-pervision is favourable for languages where moreparallel data is available. Rotation alignment has a strong assumption thatthe two language spaces (or sub-spaces in caseof mBERT) are approximately isometric (Søgaardet al., 2018). Patra et al. (2019) reported that thegeometry of language embeddings becomes dis-similar for distant languages, and the isometry as-sumption degrades the alignment performance insuch cases. In addition, as explained in Section 2.1rotation alignment alone cannot achieve effectivemapping when two languages spaces have separatecentroids. Therefore, next we consider existingwork to non-linearly align two language spaces.Cao et al. (2020) proposed to directly align lan-guages within mBERT model through ﬁne-tuning. nglish sentencesWillows lined the bank of the stream.They walked along the bank making conversation.A bank is a institution that accepts deposits from public.Paychecks automatically deposited into the bank.Went to the bank to make a withdrawal.Open a bank account.German sentencesDie Stadt liegt am Ufer der Elbe. Wir gingen am Ufer spazieren. Jeden Frühling tritt der Fluss hier über die Ufer.ObjectiveW.e ibank ≃ e iUfer "bank" word cloud"Ufer" word cloude ibank e iUfer English sentencesWillows lined the bank of the stream.They walked along the bank making conversation.A bank is a financial institution that accepts deposits.Open a bank account.Parallel translated sentencesWeiden säumten das Ufer des Baches.Sie gingen am Ufer entlang und unterhielten sich.Eine Bank ist ein Finanzinstitut, das Einlagen akzeptiert.Ein Bankkonto eröffnen.English MBERT featuresGerman MBERT features (a) Alignment with dictionary. (b) Alignment with parallel sentences.

Figure 1: In Figure 1a, contextual embeddings of the word “bank” get averaged across all word senses notedby different colors into single word anchor embedding. Figure 1b illustrates supervision from parallel corporawhere word-alignments correspond to translation in similar context noted by similar colors (lighter for English),this provides more ﬁne-grained supervision for contextualised alignment of mBERT.

The objective of the ﬁne-tuning is to minimise thedistance between the two contextual representa-tions of an aligned word pair in parallel corpora: L ialign = min (cid:88) m,m ∗ (cid:107) e is m − e it m ∗ (cid:107) (7)However, ﬁne-tuning with only the above objec-tive would led to lose the semantic information inmBERT learnt during pre-training, since a trivialsolution to the Eq.(7) can be simply to make allthe embeddings equal. To deal with this, Cao et al.(2020) also proposed a regularisation loss that doesnot allow the embedding of a source language wordto stray too far away from its original location e is m in the pre-trained mBERT model, namely: L iregularise = min (cid:88) m (cid:107) e is m − e is m (cid:107) (8)Note that e is m is generated from a copy of the origi-nal pre-trained mBERT model where the parame-ters are kept frozen. Both of the alignment and theregularization losses are combined and jointly opti-mised in order to align the two language subspaceswhile maintaining informativity of embeddings: L finetune = min n e (cid:88) i = n s L ialign + L iregularise (9)Here n s to n e is the range of mBERT layers aligned.We experimented with two variants of the ﬁne-tuning approach: 1) moving target language to-wards source language while keeping the sourceembeddings approximately ﬁxed through the regu-larization term in Eq.(8); 2) moving the source lan-guage embeddings towards the target space whilekeeping the target language space relatively ﬁxed,then the regularisation loss changes to: L iregularise = min (cid:88) m ∗ (cid:107) e it m ∗ − e it m ∗ (cid:107) (10) In this section, we ﬁrstly describe the resources andimplementation details of the alignment methodsfollowed by the zero-shot NER and SF tasks usedto evaluate the alignments. In addition, we brieﬂyexplain the datasets used in the experiments.

Our baseline model is a pre-trained mBERT ∗ – 12transformer layers, 12 attention heads, 768 hiddendimensions – denoted as mBERT Baseline . Whena word is tokenised into multiple subwords by thetokeniser, we average their corresponding subwordembeddings to obtain embedding for the word. Fol-lowing Wang et al. (2019) we collect 30k parallelsentences for each of the language pairs from pub-licly available parallel corpora. For the Europeanlanguages, German, Italian, Spanish and Dutch, theEuroparl corpus (Koehn, 2005) is used; for Hindi,Turkish and Thai, the OpenSubtitles corpus (Lisonand Tiedemann, 2016) is used; for Armenian theparallel sentences are extracted from the QED Cor-pus (Abdelali et al., 2014). We obtain contextualand average anchor embeddings described in Sec-tion 2.2.1 by passing the corpora described abovethrough pre-trained mBERT.We use the bilingual dictionaries provided withthe MUSE framework (Lample et al., 2018) as thesource for dictionary supervision. As for the paral-lel corpus supervision, since none of the collectedparallel sentences contains word-level alignmentinformation, we utilise fast align (Dyer et al., 2013)to automatically derive word alignment signals. ∗ Available for download at: https://github.com/google-research/bert/blob/master/multilingual.md or the rotation alignment, we compute four in-dependent transformation matrices for each of thelast four transformer layers similar to Wang et al.(2019). We use

RotateAlign and

NormRotateAlign to refer the rotation alignment learnt without andwith the proposed language centering normalisa-tion, respectively. To be consistent, for the ﬁne-tuning alignment we align the word representationsin the last four transformer layers of the mBERTmodel, denoted as

FineTuneAlign . We evaluate the learnt alignments using two down-stream tasks: Named Entity Recognition (NER)and Semantic Slot Filling (SF), both of which aimto predict a label for each token in a sentence. NERis a more structural task with fewer entity typesand involves less semantic understanding of thecontext compared to SF. Examples of the tasks canbe found in Table 2.We use the same model architecture and hyper-parameters as Wang et al. (2019), two BiLSTMlayers followed by a CRF layer, where learningrate is set to − for European languages and − for the other languages determined by thevalidation set. In order to measure the effective-ness of a learnt alignment, all the experiments areconducted with zero-shot settings similar to Wanget al. (2019), where the source language data is ﬁrsttransformed to the target language space and thenused to train a BiLSTM-CRF model. The targetlanguage validation set is used for hyper-parametertuning and reporting the evaluation results. Foreach experiment we report F1 scores averaged over5 runs. We use the following four families of datasets, eachof which has the same set of labels. A summaryof the datasets can be found in Table 1. Exampleutterances and annotations and shown in Table 2.

CoNLL-NER : This includes CoNLL 2002, 2003NER benchmark task (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) contain-ing entity annotations for news articles in English,German, Spanish and Dutch. We also include inthis family PioNER † (Ghukasyan et al., 2018), amanually annotated dataset in Armenian, which istypographically different from the other languages † PioNER data only has PER, LOC and ORG labels anddoes not contain MISC. in this family. In this dataset-family, target lan-guage data is sourced from local news articles, andnot generated through translation from source data.

ATIS-SF : ATIS Corpus (Price, 1990) is an Englishdataset containing conversational queries aboutﬂight booking. Upadhyay et al. (2018) manu-ally translated a subset of the data into two lan-guages, Turkish and Hindi, along with crowd-sourced phrase-level annotations.

FB-SF : Schuster et al. (2019a) introduced Multi-lingual Task-Oriented Dialog Corpus in English,Spanish and Thai across three domains: weather,alarm and reminders, where Spanish and Thai datawere manually translated and annotated from a sub-set of the English data.

SNIPS-SF : A multi-domain slot-ﬁlling dataset inEnglish released by Coucke et al. (2018). Bel-lomaria et al. (2019) automatically translated thisdataset into Italian, and then manually labelled thetranslation where entities were substituted by Ital-ian entities collected from the Web.

The evaluation results of each alignment methodon the downstream NER and SF tasks are reportedin Table 3 and Figure 2. In addition to the mBERTBaseline and for comparison purposes, we also listrelevant results found in literature (Wu and Dredze,2019; Wang et al., 2019; Upadhyay et al., 2018;Schuster et al., 2019a; Bellomaria et al., 2019) thathave been evaluated on the same datasets. mBERT Baseline numbers can be indicative of howwell languages are already aligned in the mBERTspace. High zero-shot scores for German, Dutch,Spanish and Italian indicate that European lan-guages are extremely well aligned to English inmBERT. However, distant languages such as Thaiand Turkish, which belong to different languagefamilies (KraDai and Turkic) than English, havepoor alignment with low F1 scores of 9.58 and21.15, respectively. Finally, moderately distant lan-guages such as Armenian and Hindi, which fallwithin the larger Indo-European language family,have moderate alignment with English with scoresof 62.38 and 50.84, respectively. atasets Task Translated (2018) NER No en en en en Table 1: Summary of NER and SF dataset families. English marked in bold is treated as the source language.

CoNLL-NER [U.N.]

ORG ofﬁcial [Ekeus]

PER heads for [Baghdad]

LOC .ATIS-SF show the [latest] flight mod ﬂight from [denver] fromloc.city name to [boston] toloc.city name

FB-SF do you have [wednesday’s] datetime [weather forecast] weather noun for [half moon bay] location

SNIPS-SF add this [track] music item to [my] playlist owner [global funk] playlist

Table 2: Examples from the datasets.

RotateAlign improves performance by 19% abso-lute for ATIS-Turkish, going from baseline of 21.15to 38.18 in F1 score. For ATIS-Hindi the perfor-mance improves from 50.84 to 57.86 F1 (7 points),and 4% absolute for the PioNER-Armenian from62.38 to 66.56. These numbers show how

Ro-tateAlign can improve performance over mBERTBaseline for moderately-close languages such asHindi, Turkish and Armenian, while there is onlyaround 1 point improvement for European lan-guages. This implies that Hindi, Turkish and Arme-nian subspaces are geometrically similar to English,however they are misaligned in terms of rotation in mBERT Baseline .However, in the case of Thai, which is a distantlanguage from English,

RotateAlign does not im-prove performance over the mBERT Baseline . Thissuggests that Thai and English’s embedding spacesare structurally dissimilar.

Applying the proposed language centering normali-sation in Section 2.1 before performing the rotationalignment, namely

NormRotateAlign in Table 3,is found to further improve downstream perfor-mance across all tasks and languages. The im-provement over

RotateAlign is up to 3% absoluteF1 for Thai, around 1% absolute for moderatelycloser languages like Hindi, Turkish and Armenian, and around 0.5% absolute F1 for closer target lan-guages such as German. Note that Thai, which doesnot beneﬁt from rotation alignment alone, improvesby an average of 2.3 points after applying the nor-malisation. These results corroborate that languagefamilies that are further away from each other havemore separable sub-spaces in the mBERT Baseline ,and bringing the language distributions closer helpsthe downstream task’s performance.

Amongst the cases where

RotateAlign improvesperformance over the mBERT Baseline , parallel-corpus supervised

RotateAlign is superior to dictio-nary supervision, with the exception of Hindi. Thiscould be explained by the fact that word anchors areindependent of multiple word senses, thereby thecross-lingual signal is poorer compared to parallelword alignments. This is in line with observationsfrom Zhang et al. (2019).

From Table 3 and Figure 2 we can see that

Fine-TuneAlign explained in Section 3 improves per-formance over

RotateAlign for semantic tasks(SF), with the only exception of ATIS-Hindi.On the other hand,

FineTuneAlign underperforms

RotateAlign for structural tasks (NER), and insome cases even fall behind mBERT Baseline .Note that we notice no clear trend between

FineTuneAlign src → tgt and FineTuneAlign tgt → src . ataset-Task CoNLL-NER ATIS-SF FB-SF SNIPS-SFTransfer Pair en to de en to nl en to es en to hy en to hi en to tk en to es en to th en to itBaselines from LiteraturemBERT (Wu and Dredze, 2019) 69.56 77.75 74.96 - - - - - -mBERT Rotation Alignment: Parallel (Wang et al., 2019) 70.54 79.03 75.77 - - - - - -BERT, 1400 Target Language Train (Bellomaria et al., 2019) † - - - - - - - - 83.04Non-contextual Zero-shot Baseline (Upadhyay et al., 2018) ∗ - - - - ∼ ∼

40 - - -Translate train (Schuster et al., 2019a) ‡ - - - - - - 72.87 55.43 -Our ExperimentsmBERT Baseline 66.15 77.55 74.80 62.38 50.84 21.15 74.66 9.58 76.70RotateAlign dict dict parallel parallel tgt → src src → tgt Table 3: Performance (F1 score) of the alignment methods on the zero-shot NER and SF tasks. Top scores withinour experiments are marked in bold. No results are reported for Armenian dictionary alignments since English-Armenian dictionary was available in the MUSE framework. † Bellomaria et al. (2019) use 1400 Italian instancesas part of the training data. ∗ Numbers read from a chart in the paper. ‡ Schuster et al. (2019a) uses a machinetranslation model to translate this dataset and word alignments generated by attention weights to infer annotation.

FineTuneAlign src → tgt improves over the bestrotation alignment NormRotateAlign parallel by 7.8% absolute for the ATIS-Turkish taskfrom 38.18 to 45.98. It signiﬁcantly outper-forms mBERT Baseline by 24 points. ForFB-Thai

FineTuneAlign src → tgt surpasses NormRotateAlign dict by 8.39% absolute F1from 12.38 to 20.77, 11 points higher than mBERT Baseline . For FB-Spanish we observe animprovement from 74.73 to 80.90 (6% absolute)compared to

RotateAlign and similarly +6 pointscompared to mBERT Baseline . For SNIPS-Italian,

FineTuneAlign improves performance over

Norm-RotateAlign from 77.87 to 80.21 (2.5 points) and is3.5 points better than mBERT Baseline .All SF tasks considered are generated by transla-tion from the source language data. This may indi-cate that the ﬁne-tuning approach performs betterthan rotation-based methods for translated datasets,where there is high correlation between utterancestructure of training data in source language andevaluation data in target language. On the otherhand, rotation-based alignments generalise betterwhen the downstream target sentence distributionis dissimilar from the source sentence distribution,as is the case for non-translated NER tasks.

FineTuneAlign src → tgt achieves top F1 score of80.21 on SNIPS-Italian dataset which is not farfrom the score of 83 from a BERT-based modeltrained on 1400 manually-annotated Italian utter-ances (2019). Also, our best alignment score of 80.90 for FB-Spanish ( FineTuneAlign src → tgt ) sur-passes translate-train baseline (2019a) where theannotations are automatically inferred from a NMTmodel. This suggests that for closer target lan-guages, ﬁne-tuning based alignment are not far be-hind from unaligned models trained on additionaltarget language labelled examples.Performance improvement from ﬁne-tuningalignment for translated datasets should not be at-tributed to superﬁcial transfer of entity informationfrom source language. An evidence to supportthis claim is the strong performance on the SNIPSItalian-SF dataset, which has been translated fromSNIPS dataset (Bellomaria et al., 2019), whereEnglish entities have been replaced with Italian en-tities collected from the Web during dataset prepa-ration. Therefore, during validation, the modelcame across utterances with similar structure butdifferent entities, which shows that improvementfrom ﬁne-tuning alignment is largely independentof language speciﬁc entity memorisation. Aldarmaki and Diab (2019) propose to align ELMoembeddings (Peters et al., 2018) with word-leveland sentence-level alignments. They compare thealigned ELMo with static character-level embed-dings with similar alignments.Cao et al. (2020) originally proposed ﬁne-tuningalignment of mBERT language sub-spaces. Theyclaim these methods are strictly stronger to rotationalignments methods based solely on zero-shot ex-perimentation on XNLI task (Conneau et al., 2018),a semantic sentence-level classiﬁcation task gener-

CoNLLen->de CoNLLen->nl CoNLLen->es PioNERen->hy ATISen->hi ATISen->tk FBen->es FBen->th SNIPSen->it mBERT BaselineRotateAlignᵈᶦᶜᵗNormRotateAlignᵈᶦᶜᵗRotateAlignᵖᵃʳᵃˡˡᵉˡNormRotateAlignᵖᵃʳᵃˡˡᵉˡFineTuneAlignˢʳᶜ ᵗᵒ ᵗᵍᵗFineTuneAlignᵗᵍᵗ ᵗᵒ ˢʳᶜ

Figure 2: Trend of improvement from various align-ment methods. Rotation alignment improves perfor-mance for NER, while ﬁne-tuning alignment is foundto be better for SF tasks. Improvements increaseinitially with distance between source and target lan-guages and diminish for distant languages. ated through translation from source language. Onthe contrary, we observe that ﬁne-tuning does notimprove performance across all tasks, particularlystructural tasks, where utterance structure changesand there is higher incidence of domain shift. Thisraises the question whether translated datasets arebiased to ﬁne-tuning alignment, and whether suchdatasets are a good evaluation test-bed for generalcross-lingual transfer.Wang et al. (2019) applies rotational alignmentto mBERT and reports results on CoNLL NERtasks, however the main focus of their work is onthe overlap of static bilingual embeddings. Theydo not extend similar analysis on contextualisedembeddings. In our work, drawing from the obser-vations made by Libovick et al. (2019) on the distri-bution of languages in mBERT space, we proposea normalization mechanism to increase the overlapof two languages distributions prior to computingrotational alignment.Schuster et al. (2019b) originally proposed dic-tionary supervision to align ELMo with rotationaltransform. They claim supervision from dictionaryis superior to using parallel word aligned corpora,however they do not substantiate these throughcomparative experiments. We observe that par-allel corpus supervision is stronger than dictionary supervision possibly because of considering con-textual alignment.

In this paper, we investigate cross-lingual align-ment methods for multilingual BERT. We em-pirically evaluate their effect on zero-shot trans-fer for downstream tasks of two types: structuralNER and semantic Slot-ﬁlling, across a set of di-verse languages. Speciﬁcally, we compare rota-tion alignment and ﬁne-tuning cross-lingual align-ment. We compare the effect of dictionary andparallel corpora supervision across all tasks. Wealso propose a novel normalisation technique thatimproves state-of-the-art performance on zero-shotNER and Semantic Slot-ﬁlling downstream tasks,motivated by how languages are distributed acrossthe mBERT space. Our experimental settings coverfour datasets families (one for NER and three forSF) across eight language pairs.Key ﬁndings of this paper are as follows: (1)rotation-based alignments show large performanceimprovements (up to +19% absolute for TurkishATIS-SF) on moderately close languages, only asmall improvement for very close target languagesand no improvement for very distant languages;(2) we propose a novel normalisation which cen-ters language distributions prior to learning rotationmaps and is consistently shown to improve rota-tion alignment across all tasks particularly for Thai,by up to 3% absolute; (3) rotational alignmentsare more robust and generalise well for structuraltasks such as NER which may have higher utter-ance variability and domain shift; (4) supervisionfrom parallel corpus generally leads to better align-ment than dictionary-based, since it offers the pos-sibility of generating contextualised alignments;(5) ﬁne-tuning alignment improves performancefor semantic tasks such as slot-ﬁlling where thesource language data has minimal shift in utterancestructure or domain from target language data andparticularly improves performance for extremelydistant languages (up to +8.39% absolute higherfor Thai FB-SF) compared to rotation alignment;(6) for close languages and tasks with similar ut-terance structure, zero-shot ﬁne-tuning alignmentis competitive versus unaligned models trained onadditional annotated data in target language.This work aims to pave the way for optimisinglanguage transfer capability in contextual multilin-gual models. In the future, we would like to furthernvestigate patterns in the embedding space andapply alignment methods into speciﬁc regions ofthe multilingual hyperspace to obtain more tailor-suited alignments between language pairs. Wewould also like to evaluate zero-shot capabilities ofalignments when applied to other language tasks.

References

Ahmed Abdelali, Francisco Guzman, Hassan Sajjad,and Stephan Vogel. 2014. The AMARA Corpus:Building Parallel Language Resources for the Educa-tional Domain. In

Proceedings of the Ninth Interna-tional Conference on Language Resources and Eval-uation (LREC’14) , pages 1856–1862, Reykjavik,Iceland. European Language Resources Association(ELRA).Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, EduardHovy, Kai-Wei Chang, and Nanyun Peng. 2019. OnDifﬁculties of Cross-Lingual Transfer with OrderDifferences: A Case Study on Dependency Pars-ing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages2440–2452, Minneapolis, Minnesota. Associationfor Computational Linguistics.Hanan Aldarmaki and Mona Diab. 2019. Context-Aware Cross-Lingual Mapping. In

Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers) , pages 3906–3911, Minneapolis,Minnesota. Association for Computational Linguis-tics.Valentina Bellomaria, Giuseppe Castellucci, AndreaFavalli, and Raniero Romagnoli. 2019. Almawave-SLU: A new dataset for SLU in Italian.Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Mul-tilingual alignment of contextual word representa-tions. arXiv preprint arXiv:2002.03518 .Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-ina Williams, Samuel R. Bowman, Holger Schwenk,and Veselin Stoyanov. 2018. XNLI: EvaluatingCross-lingual Sentence Representations. In

Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing . Associationfor Computational Linguistics.Alice Coucke, Alaa Saade, Adrien Ball, Th´eodoreBluche, Alexandre Caulier, David Leroy, Cl´ementDoumouro, Thibault Gisselbrecht, Francesco Calta-girone, Thibaut Lavril, et al. 2018. Snips voice plat-form: an embedded spoken language understandingsystem for private-by-design voice interfaces. arXivpreprint arXiv:1805.10190 . Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A Simple, Fast, and Effective Reparameter-ization of IBM Model 2. In

Proceedings of the2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies , pages 644–648, At-lanta, Georgia. Association for Computational Lin-guistics.Tsolak Ghukasyan, Garnik Davtyan, Karen Avetisyan,and Ivan Andrianov. 2018. pioNER: Datasets andBaselines for Armenian Named Entity Recognition. ,pages 56–61.Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-ham Neubig, Orhan Firat, and Melvin Johnson.2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Gener-alization.

CoRR , abs/2003.11080.Kaliyaperumal Karthikeyan, Zihan Wang, StephenMayhew, and Dan Roth. 2020. Cross-LingualAbility of Multilingual BERT: An Empirical Study.

ArXiv , abs/1912.07840.Philipp Koehn. 2005. Europarl: A Parallel Corpusfor Statistical Machine Translation. In

ConferenceProceedings: the tenth Machine Translation Summit ,pages 79–86, Phuket, Thailand. AAMT, AAMT.Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Herv´e J´egou. 2018.Word translation without parallel data.Jindich Libovick, Rudolf Rosa, and Alexander Fraser.2019. How Language-Neutral is MultilingualBERT?Pierre Lison and J¨org Tiedemann. 2016. OpenSubti-tles2015: Extracting Large Parallel Corpora fromMovie and TV Subtitles. In

International Confer-ence on Language Resources and Evaluation .Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013.Exploiting Similarities among Languages for Ma-chine Translation.

CoRR , abs/1309.4168.Barun Patra, Joel Ruben Antony Moniz, Sarthak Garg,Matthew R. Gormley, and Graham Neubig. 2019.Bilingual Lexicon Induction with Semi-supervisionin Non-Isometric Embedding Spaces. In

Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics , pages 184–193, Flo-rence, Italy. Association for Computational Linguis-tics.atthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep Contextualized Word Rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.P. J. Price. 1990. Evaluation of Spoken Language Sys-tems: the ATIS Domain. In

Speech and Natural Lan-guage: Proceedings of a Workshop Held at HiddenValley, Pennsylvania, June 24-27,1990 .Sebastian Schuster, Sonal Gupta, Rushin Shah, andMike Lewis. 2019a. Cross-lingual Transfer Learn-ing for Multilingual Task Oriented Dialog. In

Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 3795–3805,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Tal Schuster, Ori Ram, Regina Barzilay, and AmirGloberson. 2019b. Cross-Lingual Alignment ofContextual Word Embeddings, with Applications toZero-shot Dependency Parsing. In

Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages778–788.Erik F. Tjong Kim Sang. 2002. Introduction to theCoNLL-2002 Shared Task: Language-IndependentNamed Entity Recognition. In

COLING-02: The6th Conference on Natural Language Learning 2002(CoNLL-2002) .Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition.In

Proceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003 , pages142–147.S. Upadhyay, M. Faruqui, G. Tr, H. Dilek, and L. Heck.2018. (Almost) Zero-Shot Cross-Lingual SpokenLanguage Understanding. In , pages 6034–6038.Zirui Wang, Jiateng Xie, Ruochen Xu, Yiming Yang,Graham Neubig, and Jaime Carbonell. 2019. Cross-lingual Alignment vs Joint Training: A ComparativeStudy and A Simple Uniﬁed Framework. Shijie Wu and Mark Dredze. 2019. Beto, Bentz, Be-cas: The Surprising Cross-Lingual Effectiveness ofBERT. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages833–844, Hong Kong, China. Association for Com-putational Linguistics.Chao Xing, Dong Wang, Chao Liu, and Yiye Lin.2015. Normalized Word Embedding and Orthogo-nal Transform for Bilingual Word Translation. In