[PDF] Analyzing Zero-shot Cross-lingual Transfer in Supervised NLP Tasks

Abstract

In zero-shot cross-lingual transfer, a supervised NLP task trained on a corpus in one language is directly applicable to another language without any additional training. A source of cross-lingual transfer can be as straightforward as lexical overlap between languages (e.g., use of the same scripts, shared subwords) that naturally forces text embeddings to occupy a similar representation space. Recently introduced cross-lingual language model (XLM) pretraining brings out neural parameter sharing in Transformer-style networks as the most important factor for the transfer. In this paper, we aim to validate the hypothetically strong cross-lingual transfer properties induced by XLM pretraining. Particularly, we take XLM-RoBERTa (XLMR) in our experiments that extend semantic textual similarity (STS), SQuAD and KorQuAD for machine reading comprehension, sentiment analysis, and alignment of sentence embeddings under various cross-lingual settings. Our results indicate that the presence of cross-lingual transfer is most pronounced in STS, sentiment analysis the next, and MRC the last. That is, the complexity of a downstream task softens the degree of crosslingual transfer. All of our results are empirically observed and measured, and we make our code and data publicly available.

Full PDF

AAnalyzing Zero-shot Cross-lingual Transferin Supervised NLP Tasks

Hyunjin Choi, Judong Kim, Seongho Joe, Seungjai Min, Youngjune Gwon

Samsung SDS

Abstract —In zero-shot cross-lingual transfer, a supervised NLPtask trained on a corpus in one language is directly applicableto another language without any additional training. A sourceof cross-lingual transfer can be as straightforward as lexicaloverlap between languages ( e.g. , use of the same scripts, sharedsubwords) that naturally forces text embeddings to occupy asimilar representation space. Recently introduced cross-linguallanguage model (XLM) pretraining brings out neural parametersharing in Transformer-style networks as the most importantfactor for the transfer. In this paper, we aim to validate thehypothetically strong cross-lingual transfer properties induced byXLM pretraining. Particularly, we take XLM-RoBERTa (XLM-R) in our experiments that extend semantic textual similarity(STS), SQuAD and KorQuAD for machine reading comprehen-sion, sentiment analysis, and alignment of sentence embeddingsunder various cross-lingual settings. Our results indicate that thepresence of cross-lingual transfer is most pronounced in STS,sentiment analysis the next, and MRC the last. That is, thecomplexity of a downstream task softens the degree of cross-lingual transfer. All of our results are empirically observed andmeasured, and we make our code and data publicly available.

I. I

NTRODUCTION

Pretraining language models at a large scale has dramati-cally improved natural language understanding. According toa comprehensive analysis [1] on the limitations in pretraininga multi-lingual model, more languages lead to better cross-lingual performance for low-resource languages only up toa certain point when the number of languages increases. Thephenomenon is dubbed the curse of multilinguality, which canonly be freed up by scaling up the model size.The recent experimental results show that multilingualmodels can outperform their monolingual counterparts. Fora low-resource language that lacks in labeled examples, suchresults are an encouraging breakthrough for building an NLPapplication for low-resource languages. In cross-lingual lan-guage understanding, XLM by Conneau & Lample [2], despitebeing pretrained by only masked language modeling (MLM),has reported the state-of-the-art on downstream benchmarks.Shared lexical features ( e.g. , subwords, scripts, anchor points)across languages have been suspected for the primary sourceof learning language-independent representation that leadsto cross-lingual transfer. Recent studies, however, show thatparameter sharing induced by the Transformer architecture isinstead the most attributable factor for the transfer.We are motivated by these progresses in language modeling.This work focuses on empirical analysis of cross-lingualtransfer in supervised NLP tasks ﬁne-tuned over XLM. Inparticular, we are interested in zero-shot transfer settings where no additional training is done using the target languageexamples after being ﬁne-tuned in the source language. Weexperiment with XLM-RoBERTa (XLM-R) [1], a large XLMmodel with 550 million parameters and a 250k vocabulary sizeby extending semantic textual similarity, SQuAD [3] & Ko-rQuAD [4] question answering, and sentiment classiﬁcationsfor various cross-lingual settings.At last, beyond previous work that has attempted to alignword embeddings across different languages [5], we computea projection that directly maps sentence embeddings of onelanguage to those of another. We then analyze the effect ofﬁne-grained alignment of sentences across different languagesto the quality of zero-shot cross-lingual transfer, manifestedthrough the aforementioned NLP task performances measuredempirically.We make the following contributions. • We provide rigorous results on cross-lingual transferpresent in three important supervised NLP tasks that re-quire high-level natural language understanding, namelySTS, MRC, and sentiment classiﬁcation. • We propose to directly compute a cross-lingual mappingthat aligns sentence embeddings of different languageswhereas previous work has focused on word-level em-beddings. • We furthermore show beneﬁts of the ﬁne-grained cross-lingual sentence alignment that enables directly compar-ing sentences from different languages for sentence-pairregression tasks.The rest of this paper is organized as follows. In SectionII, we describe our approach by presenting the zero-shotcross-lingual evaluation framework. Section III discusses ourexperimental methodology and empirical results. Section IVconcludes the paper.II. O UR A PPROACH

XLM pretraining is known to effectively promote cross-lingual transfer where a supervised model ﬁne-tuned in onelanguage is applied to another without additional training.

A. Zero-shot Cross-lingual Evaluation Framework

We propose a simple approach to transfer a supervisedmodel learned in one language to another for zero-shot cross-lingual evaluation as illustrated in Fig. 1. First, we placea pretrained XLM–for our case, the 550 million-parameterXLM-RoBERTa (XLM-R) [1] trained on 100 languages isused. We then ﬁne-tune XLM-R for a downstream task using a r X i v : . [ c s . C L ] J a n re-training XLM model

100 Languages

1. Cross-lingual Model Pre-training 2. Fine-tune on Language A 3. Test on Language B

Language ASupervised Task

Pre-trained XLM model

Language B

Pre-trained XLM model

Supervised Task

Fig. 1. Zero-shot cross-lingual evaluation labeled data in language A. Lastly, we evaluate the ﬁne-tuneddownstream task in both languages A and B. Note that runninga test set from language B to the ﬁne-tuned task evaluateszero-shot cross-lingual transfer.

B. Sentence Embedding and Pair Modeling

Transformer [6] models such as BERT produce contex-tualized representations that are central to build a high-performance downstream task. XLM-R is a BERT variantwhose output constitutes token embeddings (up to 512 to-ken vectors of 768 dimensions each) for a given input. Toproduce ﬁxed-size sentence embeddings necessary for a tasklike semantic textual similarity (STS), we average the tokenembedding output to obtain a single 768-dimensional pooledvector.For text regression (or classiﬁcation), one learns a functionthat maps sentence embeddings to a target value. Sentence-pair modeling gives an important primitive that underliessupervised NLP tasks such as STS. We adopt a siamesenetwork architecture by Sentence-BERT [7] that avoids thecombinatorial explosion to form sentence pairs. Fig. 2 depictsour sentence pair modeling for STS.

XLM-R 𝑠 𝐴 XLM-R 𝑠 𝐵 Cosine SimilarityMSE LossSentence A

Sentence B

Average Pooling Average Pooling

Fig. 2. Siamese net for sentence-pair modeling

C. Cross-lingual Mapping for Fine-grained Alignment of Sen-tence Embedddings

Cross-lingual mapping for word embeddings has beenwidely studied. Because context awareness is the key tolanguage understanding, learning cross-lingual mapping for sentence-level transformations can be valuable. A sentence isless ambiguous than words since the words must be interpretedwithin a speciﬁc context.We learn cross-lingual sentence mappings directly fromsentence-pair examples. Note that sentence embeddingsproduced from contextualized cross-lingual word embed-dings would imply loosely aligned sentences. Similar tothe projection-based cross-lingual word embeddings frame-work [5], [8], we use linear algebraic methods to computea projection matrix that achieves ﬁne-grained alignment ofsentence embeddings across different languages. We also usea single-layer neural net that can iteratively learn the sameprojection by gradient descent.

System of least squares via normal equation.

Supposelanguages A and B that are the source and the target languagesof the projection Φ . We seek the solution to the problem S A Φ = S B with S A =  s (1) A s (2) A ... s ( n ) A  , S B =  s (1) B s (2) B ... s ( n ) B  , s ( i ) A =  a ( i )1 a ( i )2 ... a ( i ) d  (cid:62) , s ( i ) B =  b ( i )1 b ( i )2 ... b ( i ) d  (cid:62) (1)where S A and S B are datasets that contain n sentenceembeddings for languages A and B with each sentence s ∈ R d . With Φ = (cid:104) φ (1) φ (2) . . . φ ( j ) . . . φ ( d ) (cid:105) whose element φ ( j ) ∈ R d is a column vector, each S A φ ( j ) =[ b (1) j b (2) j . . . b ( n ) j ] gives a problem of the least squares. Since j = 1 , . . . , d , we have a system of d least-square problemsthat can be solved linear algebraically via the normal equation: Φ ∗ = (cid:0) S (cid:62) A S A (cid:1) − S (cid:62) A S B . Solving the Procrustes problem.

Given two data matrices,a source S A and a target S B , the orthogonal Procrustesproblem [9] describes a matrix approximation searching foran orthogonal projection that most closely maps S A to S B .ormally, we write Ψ ∗ = argmin Ψ (cid:107) S A Ψ − S B (cid:107) F s.t. Ψ (cid:62) Ψ = I (2)The solution to Eq. (2) has the closed-form Ψ ∗ = UV (cid:62) with UΣV (cid:62) = SVD ( S B S (cid:62) A ) , where SVD is the singular valuedecomposition. Fully-connected single-layer neural net with linearlyactivated neurons.

Contrasted to linear algebraic solutions Φ ∗ and Ψ ∗ , a neural net can be used to compute the projectionmatrix iteratively via gradient descent. We consider a fully-connected single hidden-layer neural net with linear activationfunctions as illustrated in Fig. 3. We use the neural net asan array of linear regressors with mean square error (MSE)objectives S A W = S B (feedforward) s.t. (cid:13)(cid:13)(cid:13) s ( i ) B − S A w ( j ) (cid:13)(cid:13)(cid:13) < (cid:15) ∀ i, j (3)where W = [ w (1) w (2) . . . w ( j ) . . . w ( d ) ] contains the weightparameters of the neural net. Instead of a cross-entropy loss,we impose the MSE loss function to optimize each w ( j ) forstochastic gradient descent. . . .. . . 𝑎 𝑎 𝑎 𝑎 𝑎 𝑑(𝑖) 𝑏 𝑏 𝑏 𝑏 𝑏 𝑑(𝑖) 𝐬 𝐴(𝑖) 𝐬 𝐵(𝑖) 𝑤 𝐰 (𝑗=1) 𝑤 𝑤 𝑤 𝑤 𝑑(1) Fig. 3. Fully-connected single-layer neural net with neurons having linearactivation functions.

III. E

XPERIMENTS

Throughout all our experiments, we use the pretrainedXLM-RoBERTa (XLM-R) [1] downloaded from HuggingFace [10] unmodiﬁed, upon which we build supervised NLPtasks and ﬁne-tune. We focus on experimenting with sentence-level representations and their cross-lingual transfer qualityevaluations when used in a downstream task under zero-shot settings. There is a considerable amount of the existingliterature on evaluating the cross-lingual transfer quality ofword representations, which we will not cover in this paper. https://huggingface.co/ A. Semantic Textual Similarity (STS)

Task & dataset.

The ﬁrst of our cross-lingual experimentsare STS benchmark (STSb) [11], Korean STS (KorSTS) [12],SemEval-2017 Spanish, and SemEval-2017 Arabic. STSb isa set of English data originated for the STS task evaluationsin the International Workshop on Semantic Evaluation (Se-mEval) [13]–[17] between 2012 and 2017. STSb is distributedas one of the four similarity and paraphrase tasks in the GLUEbenchmark [18]. The STSb dataset includes 8,628 sentencepairs from image captions, news headlines, and user forumsthat are partitioned in train (5,749), dev (1,500) and test(1,379) sets.The STSb sentence pairs are labeled with a similarity scoreranging from 0 to 5 that indicates how similar the sentences arein terms of semantic relatedness. KorSTS is a translated datasetfrom STSb and has exactly the same structure. SemEval-2017Spanish and Arabic are evaluation sets from SemEval-2017Task 1 [19], which has 250 test pairs per each language.

Fine-tuning.

We run the GLUE benchmark code as-is fromHugging Face to ﬁne-tune STS tasks. This means that a textinput to XLM-R is in the Sentence A– [SEP] –Sentence Bformat, which is the same as in pretraining. We use RectiﬁedAdam (RAdam) optimizer with a linear learning rate warm-upfor 10% of the training data and a learning rate of × − .We have run 4 training epochs using a batch size of 32.To evaluate zero-shot cross-lingual transfer, we ﬁne-tune onthe STSb train set and test using the STSb, KorSTS, SemEval-2017 Spanish, and SemEval-2017 Arabic test sets, and simi-larly for ﬁne-tuning and testing on KorSTS. Furthermore, wecarry out the following mixed instances: 1) ﬁne-tune on STSbthe ﬁrst and KorSTS the next; 2) ﬁne-tune on KorSTS the ﬁrstand STSb the next; 3) ﬁne-tune on sentence pair examplesuniformly drawn from STSb and KorSTS. Results.

The upper portion of Table I reports the STS per-formances on zero-shot cross-lingual testing with 4 languages.We immediately ﬁnd the presence of cross-lingual transferstrong for STS. When ﬁne-tuned on English (the STSb trainset), zero-shot testing with Korean results in 1.24% decreasein Spearman’s rank correlation. On the other hand, whenﬁne-tuned using the KorSTS train set, zero-shot testing withEnglish results in 3.40% degradation.For Spanish and Arabic, we observe better performancewhen ﬁne-tuned on English. We ﬁnd particularly low scoresfor Arabic and suspect that it is relatively lower resourcelanguage compared to the others. In fact, XLM-R uses 28.0GBof Arabic resources while for Korean 54.2GB is used, 53.3GBSpanish, and 300.8GB English [1].The lower portion of Table I shows how two-stage ﬁne-tuning mixed with two different languages affects the perfor-mance in each language. Although the performance numbersare similar regardless of the ﬁne-tuning order, the last languageﬁne-tuned slightly outperforms the others.

B. Machine Reading Comprehension (MRC)

Task & dataset.

Reading comprehension has been one ofthe most challenging tasks for machine, combining natural

ABLE IE

VALUATION ON

STS

TASKS . N

UMBERS REPRESENT THE S PEARMAN (P EARSON ) CORRELATIONS IN PERCENTILE . Evaluation LanguageFine-tuning Task(s)

English Korean Spanish Arabic

Zero-shot

STSb (English) 87.44 (87.43) 82.34 (82.27) 85.58 (87.02) 72.67 (70.54)KorSTS (Korean) 84.47 (84.40) 83.38 (83.16) 84.94 (85.00) 70.99 (69.66)

MixedLaunguageFine-tuning

STSb → KorSTS 86.43 (86.47) 83.54 (83.42) 85.47 (86.05) 73.85 (73.39)KorSTS → STSb 88.33 (88.34) 85.12 (85.12) 86.77 (87.83) 73.37 (72.37)STSb + KorSTS 87.71 (87.84) 84.37 (84.48) 86.53 (86.99) 75.72 (75.22)TABLE IIE

VALUATION ON

MRC

TASKS . N

UMBERS REPRESENT F1 SCORE , AND NUMBERS IN PARENTHESES ARE EXACT MATCHES . Evaluation LanguageFine-tuning Task(s)

English Korean Spanish

Zero-shot

SQuAD (Enlgish) 88.81 (81.68) 80.92 (45.08) 72.07 (53.18)KorQuAD (Korean) 72.03 (61.93) 89.58 (65.29) 58.65 (43.09)SQuAD-es (Spanish) 84.75 (74.51) 78.87 (42.76) 76.11 (59.68)

MixedLanguageFine-tuning

SQuAD → KorQuAD 85.81 (77.16) 90.17 (66.02) 70.54 (52.40)SQuAD → SQuAD-es 86.73 (76.78) 78.16 (36.87) 76.70 (59.87)KorQuAD → SQuAD 89.16 (82.20) 88.42 (62.83) 72.78 (53.92)SQuAD + KorQuAD 84.41 (75.93) 86.79 (62.45) 67.72 (48.49)SQuAD + KorQuAD + SQuAD-es 89.29 (81.98) 90.41 (66.36) 76.75 (59.66) language understanding and generation with knowledge aboutthe world. We use Stanford Question Answering Dataset(SQuAD) [3], Korean Question Answering Dataset (Ko-rQuAD) [4], and Spanish SQuAD (SQuAD-es) [20] for thecross-lingual transfer evaluation of machine reading compre-hension (MRC) tasks. Both SQuAD and KorQuAD consist ofcrowdsourced question-answer pairs from English and KoreanWikipedia articles, respectively. SQuAD-es is a translateddataset of SQuAD for Spanish.Using SQuAD v1.1, KorQuAD v1.0, and SQuAD-es v1.1,we do the following eight cross-lingual MRC tasks. Weprepare three copies of XLM-R and ﬁne-tune them on 1)SQuAD, 2) KorQuAD and 3) SQuAD-es for testing withSQuAD (English), KorQuAD (Korean) and SQuAD-es (Span-ish) dev sets. We then ﬁne-tune cross-lingually again using4) KorQuAD on the SQuAD ﬁne-tuned XLM-R, 5) SQuAD-es on the SQuAD ﬁne-tuned XLM-R, and 6) SQuAD on theKorQuAD ﬁne-tuned XLM-R for another round of testing withthe dev sets. Additionally, we ﬁne-tune XLM-R with 7) mixedset of SQuAD and KorQuAD and 8) mixed set of SQuAD,KorQuAD, and SQuAD-es.

Fine-tuning.

We use RAdam optimizer with a linear learn-ing rate warm-up for 10% of the training data and a learningrate of × − . We have found that running just 3 trainingepochs with a batch size of 48 is sufﬁcient. Results.

The upper portion of Table II reports the cross-lingual MRC performance evaluated on the SQuAD, Ko-rQuAD, and SQuAD-es dev sets. For ﬁne-tuned SQuAD, zero-shot testing with Korean and Spanish degrades 9.67% and5.30% in F1 score. (Here, the compared baseline is KorQuADdev set tested on KorQuAD train set ﬁne-tuned XLM-R.) Fine-tuned on KorQuAD, however, zero-shot testing with Englishand Spanish degrades 18.89% and 22.94%, respectively. Theresults with SQuAD-es shows 4.57% and 11.96% decreases for English and Korean. Compared to the performance on STStasks, the degraded performance gap measured in F1 scoresand exact match is much higher for MRC tasks.The lower portion of Table II reports the cross-lingualMRC performance for mixed language ﬁne-tuning cases. Theresult shows a similar trend as in STS tasks. In general, ﬁne-tuning with an additional language seems to improve the MRCperformance regardless of testing language. Fine-tuning withall other languages yields the best MRC performance as shownin the last row of Table II.

C. Sentiment Analysis

Task & dataset.

For sentiment analysis, we use twodatasets of the similar origin, namely Large Movie ReviewDataset (LMRD) [21] and Naver Sentiment Movie Corpus(NSMC) [22]. LMRD is a movie review dataset in English.The dataset provides a set of 50,000 reviews with labelsindicating whether a review is positive or negative. NSMCuses the same labeling system for movie reviews written inKorean language. The dataset consists of 200,000 reviews.Using LMRD and NSMC, we have experimented ﬁve cross-lingual evaluations: ﬁne-tune using 1) LMRD, 2) NSMC, 3)NSMC on the LMRD ﬁne-tuned XLM-R, 4) LMRD on theNSMC ﬁne-tuned XLM-R, and 5) mixed set of LMRD andNSMC. All of these tasks are evaluated on the LMRD andNSMC test sets.

Fine-tuning.

Again, using RAdam optimizer with a linearlearning rate warm-up for 5% of the training data and alearning rate of × − , we run 5 training epochs with abatch size of 48. Results.

The upper portion of Table III presents the zero-shot cross-lingual transfer results on sentiment analysis tasks.The numbers represent classiﬁcation accuracy in percent-age. Zero-shot testing with NSMC (Korean) on the LMRDne-tuned XLM-R results in 12.05% accuracy degradation,whereas zero-shot testing with English shows 7.63% decreasein classiﬁcation accuracy.The lower portion of Table III presents the cross-lingualsentiment analysis performance for mixed language ﬁne-tuningcases. Here, the performance of the last language ﬁne-tunedis improved while the ﬁrst language ﬁne-tuned degrades alittle. When ﬁne-tuned on the train set mixed with bothlanguages, the sentiment analysis performance improves forboth languages.

TABLE IIIE

VALUATION ON SENTIMENT CLASSIFICATION TASKS . T

HE NUMBERSREPRESENT CLASSIFICATION ACCURACY IN PERCENTAGE . Evaluation LanguageFine-tuning Task(s)

English Korean

Zero-shot

LMRD (English) 93.52 79.24NSMC (Korean) 86.38 90.10

MixedLanguageFine-tuning

LMRD → NSMC 90.65 90.12NSMC → LMRD 93.69 89.47LMRD + NSMC 93.80 90.24

D. Cross-lingual Mapping for Fine-grained Alignment of Sen-tence Embeddings

Using the analytical ﬁndings of Section II.C, we havedetermined the cross-lingual mappings Φ ∗ and Ψ ∗ linearalgebraically. We have applied the mappings to align thetranslated sentence pairs of STSb and KorSTS. Precisely, weset the source S A English sentences from STSb, and the target S B Korean sentences from KorSTS. The quality of alignmentvia linear projections Φ ∗ and Ψ ∗ is very similar. Based onthe average cosine similarity of the translated sentence pairs,we ﬁnd Φ ∗ slightly better than Ψ ∗ .We determine W by stochastic gradient descent on thesingle-layer neural net of Fig. 3. Using the translated sen-tence pairs, we set the input S A to the neural net Englishsentences from STSb, and the output S B Korean sentencesfrom KorSTS. The average cosine similarity of the translatedsentence pairs after alignment via the Φ ∗ projection is 0.7131whereas the average cosine similarity for the neural net is0.7265. Without alignment by the projection matrix or theneural net, the average cosine similarity would have been0.4636. Fig. 4 illustrates the t-SNE plots that visualize theeffect of the sentence alignment. The top plots are unalignedEnglish, aligned English and Korean sentences by the Φ ∗ projection, whereas the bottom plots represent unaligned andaligned English and Korean sentences via the neural net. TABLE IVSTS

EVALUATION WITH CROSS - LINGUAL SENTENCE PAIRS . Method

Zero-shot Transfer Cross-lingual Mapping

Fine-tuningTask

STSb 49.03 59.16KorSTS 43.23 47.24

In Table IV, we compare the cosine similarity of alignedEnglish and Korean translated sentence pairs of STSb and KorSTS through the ﬁne-grained cross-lingual mapping tozero-shot transfer. Cross-lingual mapping that we computelinear algebraically or by the use of a neural net outperformszero-shot cross-lingual transfer by 9.3–20% in cosine similar-ity matching of the translated sentences pairs of STSb andKorSTS.

E. Discussion

Generally, we ﬁnd that cross-lingual transfer is present inimportant supervised NLP tasks that require high-level naturallanguage understanding, namely STS, MRC, and sentimentclassiﬁcation. Our empirical evaluation suggests the presenceof cross-lingual transfer be most pronounced in STS. Thenext is sentiment analysis, and MRC comes the last. Itseems that more complex a task is, and the quality of cross-lingual transfer becomes less effective. For STS, we haveobserved the transfer quality in two different measures, theSpearman’s rank and Pearson correlation coefﬁcients, andfound them concordant. For MRC, while zero-shot transferperformance measured by F1 score is reasonable, it sufferssigniﬁcantly more for the case of the exact match (EM) metric.Interestingly, if we ﬁne-tune XLM-R with both source andtarget languages, the last language ﬁne-tuned has the strongestimpact on the performance.IV. C

ONCLUSION

This paper focuses on the empirical validation of the cross-lingual transfer properties induced by XLM pretraining. Wehave experimented with XLM-RoBERTa (XLM-R), a largecross-lingual language model, and extended semantic textualsimilarity (STS), SQuAD and KorQuAD for machine readingcomprehension (MRC), and sentiment analysis to cross-lingualsettings. Our results suggest the presence of cross-lingualtransfer be most pronounced in STS, the sentiment analysisthe next, and MRC the last. We compute matrix projectionslinear algebraically that directly map sentence embeddings ofone language to another for analyzing the effect of ﬁne-grainedalignment of sentences in zero-shot cross-lingual transfer.We have shown that such mapping can also be determinediteratively using a simple neural net. Our future work includesmore systematic evaluations on broader range of low- andhigh-resource languages to generalize the quality of cross-lingual transfer manifested through important NLP tasks.R

EFERENCES[1] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsu-pervised cross-lingual representation learning at scale,” in

Proceedingsof the 58th Annual Meeting of the Association for ComputationalLinguistics , 2020.[2] A. Conneau and G. Lample, “Cross-lingual Language Model Pretrain-ing,” in

Advances in Neural Information Processing Systems 32 , 2019.[3] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+questions for machine comprehension of text,” in

Proceedings of the2016 Conference on Empirical Methods in Natural Language Process-ing , 2016.[4] S. Lim, M. Kim, and J. Lee, “Korquad1.0: Korean QA Dataset forMachine Reading Comprehension,” arXiv preprint arXiv:1909.07005 ,2019.ig. 4. t-SNE plots of English and Korean translated pairs from STSb and KorSTS. The leftmost plot on the top row is unaligned English sentences (source),and the middle represents aligned English via linear projection Φ ∗ , the rightmost Korean (target). The middle and the rightmost plots are aligned, showingsimilar patterns in t-SNE. The bottom plots are unaligned English, aligned English, and Korean sentences via the fully-connected single layer neural net whoseweight parameters W are learned by stochastic gradient descent.[5] G. Glavaˇs, R. Litschko, S. Ruder, and I. Vuli´c, “How to (properly) eval-uate cross-lingual word embeddings: On strong baselines, comparativeanalyses, and some misconceptions,” in Proceedings of the 57th AnnualMeeting of the Association for Computational Linguistics , 2019.[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. u. Kaiser, and I. Polosukhin, “Attention is All you Need,” in

Advancesin Neural Information Processing Systems 30 , 2017.[7] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddingsusing Siamese BERT-networks,” in

Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 2019.[8] T. Mikolov, Q. V. Le, and I. Sutskever, “Exploiting Similarities amongLanguages for Machine Translation,” arXiv preprint arXiv:1309.4168 ,2013.[9] P. Sch¨onemann, “A Generalized Solution of the Orthogonal ProcrustesProblem,”

Psychometrika , vol. 31, no. 1, pp. 1–10, 1966.[10] Hugging Face, “Open Source NLP,” https://huggingface.co, 2020.[11] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingualfocused evaluation,” in

Proceedings of the 11th International Workshopon Semantic Evaluation (SemEval-2017) , 2017.[12] J. Ham, Y. J. Choe, K. Park, I. Choi, and H. Soh, “KorNLI and KorSTS:New Benchmark Datasets for Korean Natural Language Understanding,” arXiv preprint arXiv:2004.03289 , 2020.[13] E. Agirre, D. Cer, M. Diab, and A. Gonzalez-Agirre, “SemEval-2012task 6: A pilot on semantic textual similarity,” in *SEM 2012: The FirstJoint Conference on Lexical and Computational Semantics – Volume 1:Proceedings of the main conference and the shared task, and Volume 2:Proceedings of the Sixth International Workshop on Semantic Evaluation(SemEval 2012) , 2012.[14] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, and W. Guo, “*SEM2013 shared task: Semantic textual similarity,” in

Second Joint Con-ference on Lexical and Computational Semantics (*SEM), Volume 1:Proceedings of the Main Conference and the Shared Task: SemanticTextual Similarity , 2013.[15] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre,W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe, “SemEval-2014 task 10: Multilingual semantic textual similarity,” in

Proceedings of the 8thInternational Workshop on Semantic Evaluation (SemEval 2014) , 2014.[16] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre,W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria,and J. Wiebe, “SemEval-2015 task 2: Semantic textual similarity,English, Spanish and pilot on interpretability,” in

Proceedings of the9th International Workshop on Semantic Evaluation (SemEval 2015) ,2015.[17] E. Agirre, C. Banea, D. Cer, M. Diab, A. Gonzalez-Agirre, R. Mihalcea,G. Rigau, and J. Wiebe, “SemEval-2016 task 1: Semantic textualsimilarity, monolingual and cross-lingual evaluation,” in

Proceedingsof the 10th International Workshop on Semantic Evaluation (SemEval-2016) , 2016.[18] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman,“GLUE: A multi-task benchmark and analysis platform for naturallanguage understanding,” in

Proceedings of the 2018 EMNLP WorkshopBlackboxNLP: Analyzing and Interpreting Neural Networks for NLP ,2018.[19] D. M. Cer, M. T. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia,“SemEval-2017 task 1: Semantic textual similarity, multilingual andcross-lingual focused evaluation,” 2017.[20] C. P. Carrino, M. R. Costa-juss`a, and J. A. Fonollosa, “AutomaticSpanish Translation of the SQuAD Dataset for Multilingual QuestionAnswering,” arXiv preprint arXiv:1912.05200 , 2019.[21] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,“Learning Word Vectors for Sentiment Analysis,” in