Analyzing Zero-shot Cross-lingual Transfer in Supervised NLP Tasks
Hyunjin Choi, Judong Kim, Seongho Joe, Seungjai Min, Youngjune Gwon
AAnalyzing Zero-shot Cross-lingual Transferin Supervised NLP Tasks
Hyunjin Choi, Judong Kim, Seongho Joe, Seungjai Min, Youngjune Gwon
Samsung SDS
Abstract —In zero-shot cross-lingual transfer, a supervised NLPtask trained on a corpus in one language is directly applicableto another language without any additional training. A sourceof cross-lingual transfer can be as straightforward as lexicaloverlap between languages ( e.g. , use of the same scripts, sharedsubwords) that naturally forces text embeddings to occupy asimilar representation space. Recently introduced cross-linguallanguage model (XLM) pretraining brings out neural parametersharing in Transformer-style networks as the most importantfactor for the transfer. In this paper, we aim to validate thehypothetically strong cross-lingual transfer properties induced byXLM pretraining. Particularly, we take XLM-RoBERTa (XLM-R) in our experiments that extend semantic textual similarity(STS), SQuAD and KorQuAD for machine reading comprehen-sion, sentiment analysis, and alignment of sentence embeddingsunder various cross-lingual settings. Our results indicate that thepresence of cross-lingual transfer is most pronounced in STS,sentiment analysis the next, and MRC the last. That is, thecomplexity of a downstream task softens the degree of cross-lingual transfer. All of our results are empirically observed andmeasured, and we make our code and data publicly available.
I. I
NTRODUCTION
Pretraining language models at a large scale has dramati-cally improved natural language understanding. According toa comprehensive analysis [1] on the limitations in pretraininga multi-lingual model, more languages lead to better cross-lingual performance for low-resource languages only up toa certain point when the number of languages increases. Thephenomenon is dubbed the curse of multilinguality, which canonly be freed up by scaling up the model size.The recent experimental results show that multilingualmodels can outperform their monolingual counterparts. Fora low-resource language that lacks in labeled examples, suchresults are an encouraging breakthrough for building an NLPapplication for low-resource languages. In cross-lingual lan-guage understanding, XLM by Conneau & Lample [2], despitebeing pretrained by only masked language modeling (MLM),has reported the state-of-the-art on downstream benchmarks.Shared lexical features ( e.g. , subwords, scripts, anchor points)across languages have been suspected for the primary sourceof learning language-independent representation that leadsto cross-lingual transfer. Recent studies, however, show thatparameter sharing induced by the Transformer architecture isinstead the most attributable factor for the transfer.We are motivated by these progresses in language modeling.This work focuses on empirical analysis of cross-lingualtransfer in supervised NLP tasks fine-tuned over XLM. Inparticular, we are interested in zero-shot transfer settings where no additional training is done using the target languageexamples after being fine-tuned in the source language. Weexperiment with XLM-RoBERTa (XLM-R) [1], a large XLMmodel with 550 million parameters and a 250k vocabulary sizeby extending semantic textual similarity, SQuAD [3] & Ko-rQuAD [4] question answering, and sentiment classificationsfor various cross-lingual settings.At last, beyond previous work that has attempted to alignword embeddings across different languages [5], we computea projection that directly maps sentence embeddings of onelanguage to those of another. We then analyze the effect offine-grained alignment of sentences across different languagesto the quality of zero-shot cross-lingual transfer, manifestedthrough the aforementioned NLP task performances measuredempirically.We make the following contributions. • We provide rigorous results on cross-lingual transferpresent in three important supervised NLP tasks that re-quire high-level natural language understanding, namelySTS, MRC, and sentiment classification. • We propose to directly compute a cross-lingual mappingthat aligns sentence embeddings of different languageswhereas previous work has focused on word-level em-beddings. • We furthermore show benefits of the fine-grained cross-lingual sentence alignment that enables directly compar-ing sentences from different languages for sentence-pairregression tasks.The rest of this paper is organized as follows. In SectionII, we describe our approach by presenting the zero-shotcross-lingual evaluation framework. Section III discusses ourexperimental methodology and empirical results. Section IVconcludes the paper.II. O UR A PPROACH
XLM pretraining is known to effectively promote cross-lingual transfer where a supervised model fine-tuned in onelanguage is applied to another without additional training.
A. Zero-shot Cross-lingual Evaluation Framework
We propose a simple approach to transfer a supervisedmodel learned in one language to another for zero-shot cross-lingual evaluation as illustrated in Fig. 1. First, we placea pretrained XLM–for our case, the 550 million-parameterXLM-RoBERTa (XLM-R) [1] trained on 100 languages isused. We then fine-tune XLM-R for a downstream task using a r X i v : . [ c s . C L ] J a n re-training XLM model
100 Languages
1. Cross-lingual Model Pre-training 2. Fine-tune on Language A 3. Test on Language B
Language ASupervised Task
Pre-trained XLM model
Language B
Pre-trained XLM model
Supervised Task
Fig. 1. Zero-shot cross-lingual evaluation labeled data in language A. Lastly, we evaluate the fine-tuneddownstream task in both languages A and B. Note that runninga test set from language B to the fine-tuned task evaluateszero-shot cross-lingual transfer.
B. Sentence Embedding and Pair Modeling
Transformer [6] models such as BERT produce contex-tualized representations that are central to build a high-performance downstream task. XLM-R is a BERT variantwhose output constitutes token embeddings (up to 512 to-ken vectors of 768 dimensions each) for a given input. Toproduce fixed-size sentence embeddings necessary for a tasklike semantic textual similarity (STS), we average the tokenembedding output to obtain a single 768-dimensional pooledvector.For text regression (or classification), one learns a functionthat maps sentence embeddings to a target value. Sentence-pair modeling gives an important primitive that underliessupervised NLP tasks such as STS. We adopt a siamesenetwork architecture by Sentence-BERT [7] that avoids thecombinatorial explosion to form sentence pairs. Fig. 2 depictsour sentence pair modeling for STS.
XLM-R 𝑠 𝐴 XLM-R 𝑠 𝐵 Cosine SimilarityMSE LossSentence A
Sentence B
Average Pooling Average Pooling
Fig. 2. Siamese net for sentence-pair modeling
C. Cross-lingual Mapping for Fine-grained Alignment of Sen-tence Embedddings
Cross-lingual mapping for word embeddings has beenwidely studied. Because context awareness is the key tolanguage understanding, learning cross-lingual mapping for sentence-level transformations can be valuable. A sentence isless ambiguous than words since the words must be interpretedwithin a specific context.We learn cross-lingual sentence mappings directly fromsentence-pair examples. Note that sentence embeddingsproduced from contextualized cross-lingual word embed-dings would imply loosely aligned sentences. Similar tothe projection-based cross-lingual word embeddings frame-work [5], [8], we use linear algebraic methods to computea projection matrix that achieves fine-grained alignment ofsentence embeddings across different languages. We also usea single-layer neural net that can iteratively learn the sameprojection by gradient descent.
System of least squares via normal equation.
Supposelanguages A and B that are the source and the target languagesof the projection Φ . We seek the solution to the problem S A Φ = S B with S A = s (1) A s (2) A ... s ( n ) A , S B = s (1) B s (2) B ... s ( n ) B , s ( i ) A = a ( i )1 a ( i )2 ... a ( i ) d (cid:62) , s ( i ) B = b ( i )1 b ( i )2 ... b ( i ) d (cid:62) (1)where S A and S B are datasets that contain n sentenceembeddings for languages A and B with each sentence s ∈ R d . With Φ = (cid:104) φ (1) φ (2) . . . φ ( j ) . . . φ ( d ) (cid:105) whose element φ ( j ) ∈ R d is a column vector, each S A φ ( j ) =[ b (1) j b (2) j . . . b ( n ) j ] gives a problem of the least squares. Since j = 1 , . . . , d , we have a system of d least-square problemsthat can be solved linear algebraically via the normal equation: Φ ∗ = (cid:0) S (cid:62) A S A (cid:1) − S (cid:62) A S B . Solving the Procrustes problem.
Given two data matrices,a source S A and a target S B , the orthogonal Procrustesproblem [9] describes a matrix approximation searching foran orthogonal projection that most closely maps S A to S B .ormally, we write Ψ ∗ = argmin Ψ (cid:107) S A Ψ − S B (cid:107) F s.t. Ψ (cid:62) Ψ = I (2)The solution to Eq. (2) has the closed-form Ψ ∗ = UV (cid:62) with UΣV (cid:62) = SVD ( S B S (cid:62) A ) , where SVD is the singular valuedecomposition. Fully-connected single-layer neural net with linearlyactivated neurons.
Contrasted to linear algebraic solutions Φ ∗ and Ψ ∗ , a neural net can be used to compute the projectionmatrix iteratively via gradient descent. We consider a fully-connected single hidden-layer neural net with linear activationfunctions as illustrated in Fig. 3. We use the neural net asan array of linear regressors with mean square error (MSE)objectives S A W = S B (feedforward) s.t. (cid:13)(cid:13)(cid:13) s ( i ) B − S A w ( j ) (cid:13)(cid:13)(cid:13) < (cid:15) ∀ i, j (3)where W = [ w (1) w (2) . . . w ( j ) . . . w ( d ) ] contains the weightparameters of the neural net. Instead of a cross-entropy loss,we impose the MSE loss function to optimize each w ( j ) forstochastic gradient descent. . . .. . . 𝑎 𝑎 𝑎 𝑎 𝑎 𝑑(𝑖) 𝑏 𝑏 𝑏 𝑏 𝑏 𝑑(𝑖) 𝐬 𝐴(𝑖) 𝐬 𝐵(𝑖) 𝑤 𝐰 (𝑗=1) 𝑤 𝑤 𝑤 𝑤 𝑑(1) Fig. 3. Fully-connected single-layer neural net with neurons having linearactivation functions.
III. E
XPERIMENTS
Throughout all our experiments, we use the pretrainedXLM-RoBERTa (XLM-R) [1] downloaded from HuggingFace [10] unmodified, upon which we build supervised NLPtasks and fine-tune. We focus on experimenting with sentence-level representations and their cross-lingual transfer qualityevaluations when used in a downstream task under zero-shot settings. There is a considerable amount of the existingliterature on evaluating the cross-lingual transfer quality ofword representations, which we will not cover in this paper. https://huggingface.co/ A. Semantic Textual Similarity (STS)
Task & dataset.
The first of our cross-lingual experimentsare STS benchmark (STSb) [11], Korean STS (KorSTS) [12],SemEval-2017 Spanish, and SemEval-2017 Arabic. STSb isa set of English data originated for the STS task evaluationsin the International Workshop on Semantic Evaluation (Se-mEval) [13]–[17] between 2012 and 2017. STSb is distributedas one of the four similarity and paraphrase tasks in the GLUEbenchmark [18]. The STSb dataset includes 8,628 sentencepairs from image captions, news headlines, and user forumsthat are partitioned in train (5,749), dev (1,500) and test(1,379) sets.The STSb sentence pairs are labeled with a similarity scoreranging from 0 to 5 that indicates how similar the sentences arein terms of semantic relatedness. KorSTS is a translated datasetfrom STSb and has exactly the same structure. SemEval-2017Spanish and Arabic are evaluation sets from SemEval-2017Task 1 [19], which has 250 test pairs per each language.
Fine-tuning.
We run the GLUE benchmark code as-is fromHugging Face to fine-tune STS tasks. This means that a textinput to XLM-R is in the Sentence A– [SEP] –Sentence Bformat, which is the same as in pretraining. We use RectifiedAdam (RAdam) optimizer with a linear learning rate warm-upfor 10% of the training data and a learning rate of × − .We have run 4 training epochs using a batch size of 32.To evaluate zero-shot cross-lingual transfer, we fine-tune onthe STSb train set and test using the STSb, KorSTS, SemEval-2017 Spanish, and SemEval-2017 Arabic test sets, and simi-larly for fine-tuning and testing on KorSTS. Furthermore, wecarry out the following mixed instances: 1) fine-tune on STSbthe first and KorSTS the next; 2) fine-tune on KorSTS the firstand STSb the next; 3) fine-tune on sentence pair examplesuniformly drawn from STSb and KorSTS. Results.
The upper portion of Table I reports the STS per-formances on zero-shot cross-lingual testing with 4 languages.We immediately find the presence of cross-lingual transferstrong for STS. When fine-tuned on English (the STSb trainset), zero-shot testing with Korean results in 1.24% decreasein Spearman’s rank correlation. On the other hand, whenfine-tuned using the KorSTS train set, zero-shot testing withEnglish results in 3.40% degradation.For Spanish and Arabic, we observe better performancewhen fine-tuned on English. We find particularly low scoresfor Arabic and suspect that it is relatively lower resourcelanguage compared to the others. In fact, XLM-R uses 28.0GBof Arabic resources while for Korean 54.2GB is used, 53.3GBSpanish, and 300.8GB English [1].The lower portion of Table I shows how two-stage fine-tuning mixed with two different languages affects the perfor-mance in each language. Although the performance numbersare similar regardless of the fine-tuning order, the last languagefine-tuned slightly outperforms the others.
B. Machine Reading Comprehension (MRC)
Task & dataset.
Reading comprehension has been one ofthe most challenging tasks for machine, combining natural
ABLE IE
VALUATION ON
STS
TASKS . N
UMBERS REPRESENT THE S PEARMAN (P EARSON ) CORRELATIONS IN PERCENTILE . Evaluation LanguageFine-tuning Task(s)
English Korean Spanish Arabic
Zero-shot
STSb (English) 87.44 (87.43) 82.34 (82.27) 85.58 (87.02) 72.67 (70.54)KorSTS (Korean) 84.47 (84.40) 83.38 (83.16) 84.94 (85.00) 70.99 (69.66)
MixedLaunguageFine-tuning
STSb → KorSTS 86.43 (86.47) 83.54 (83.42) 85.47 (86.05) 73.85 (73.39)KorSTS → STSb 88.33 (88.34) 85.12 (85.12) 86.77 (87.83) 73.37 (72.37)STSb + KorSTS 87.71 (87.84) 84.37 (84.48) 86.53 (86.99) 75.72 (75.22)TABLE IIE
VALUATION ON
MRC
TASKS . N
UMBERS REPRESENT F1 SCORE , AND NUMBERS IN PARENTHESES ARE EXACT MATCHES . Evaluation LanguageFine-tuning Task(s)
English Korean Spanish
Zero-shot
SQuAD (Enlgish) 88.81 (81.68) 80.92 (45.08) 72.07 (53.18)KorQuAD (Korean) 72.03 (61.93) 89.58 (65.29) 58.65 (43.09)SQuAD-es (Spanish) 84.75 (74.51) 78.87 (42.76) 76.11 (59.68)
MixedLanguageFine-tuning
SQuAD → KorQuAD 85.81 (77.16) 90.17 (66.02) 70.54 (52.40)SQuAD → SQuAD-es 86.73 (76.78) 78.16 (36.87) 76.70 (59.87)KorQuAD → SQuAD 89.16 (82.20) 88.42 (62.83) 72.78 (53.92)SQuAD + KorQuAD 84.41 (75.93) 86.79 (62.45) 67.72 (48.49)SQuAD + KorQuAD + SQuAD-es 89.29 (81.98) 90.41 (66.36) 76.75 (59.66) language understanding and generation with knowledge aboutthe world. We use Stanford Question Answering Dataset(SQuAD) [3], Korean Question Answering Dataset (Ko-rQuAD) [4], and Spanish SQuAD (SQuAD-es) [20] for thecross-lingual transfer evaluation of machine reading compre-hension (MRC) tasks. Both SQuAD and KorQuAD consist ofcrowdsourced question-answer pairs from English and KoreanWikipedia articles, respectively. SQuAD-es is a translateddataset of SQuAD for Spanish.Using SQuAD v1.1, KorQuAD v1.0, and SQuAD-es v1.1,we do the following eight cross-lingual MRC tasks. Weprepare three copies of XLM-R and fine-tune them on 1)SQuAD, 2) KorQuAD and 3) SQuAD-es for testing withSQuAD (English), KorQuAD (Korean) and SQuAD-es (Span-ish) dev sets. We then fine-tune cross-lingually again using4) KorQuAD on the SQuAD fine-tuned XLM-R, 5) SQuAD-es on the SQuAD fine-tuned XLM-R, and 6) SQuAD on theKorQuAD fine-tuned XLM-R for another round of testing withthe dev sets. Additionally, we fine-tune XLM-R with 7) mixedset of SQuAD and KorQuAD and 8) mixed set of SQuAD,KorQuAD, and SQuAD-es.
Fine-tuning.
We use RAdam optimizer with a linear learn-ing rate warm-up for 10% of the training data and a learningrate of × − . We have found that running just 3 trainingepochs with a batch size of 48 is sufficient. Results.
The upper portion of Table II reports the cross-lingual MRC performance evaluated on the SQuAD, Ko-rQuAD, and SQuAD-es dev sets. For fine-tuned SQuAD, zero-shot testing with Korean and Spanish degrades 9.67% and5.30% in F1 score. (Here, the compared baseline is KorQuADdev set tested on KorQuAD train set fine-tuned XLM-R.) Fine-tuned on KorQuAD, however, zero-shot testing with Englishand Spanish degrades 18.89% and 22.94%, respectively. Theresults with SQuAD-es shows 4.57% and 11.96% decreases for English and Korean. Compared to the performance on STStasks, the degraded performance gap measured in F1 scoresand exact match is much higher for MRC tasks.The lower portion of Table II reports the cross-lingualMRC performance for mixed language fine-tuning cases. Theresult shows a similar trend as in STS tasks. In general, fine-tuning with an additional language seems to improve the MRCperformance regardless of testing language. Fine-tuning withall other languages yields the best MRC performance as shownin the last row of Table II.
C. Sentiment Analysis
Task & dataset.
For sentiment analysis, we use twodatasets of the similar origin, namely Large Movie ReviewDataset (LMRD) [21] and Naver Sentiment Movie Corpus(NSMC) [22]. LMRD is a movie review dataset in English.The dataset provides a set of 50,000 reviews with labelsindicating whether a review is positive or negative. NSMCuses the same labeling system for movie reviews written inKorean language. The dataset consists of 200,000 reviews.Using LMRD and NSMC, we have experimented five cross-lingual evaluations: fine-tune using 1) LMRD, 2) NSMC, 3)NSMC on the LMRD fine-tuned XLM-R, 4) LMRD on theNSMC fine-tuned XLM-R, and 5) mixed set of LMRD andNSMC. All of these tasks are evaluated on the LMRD andNSMC test sets.
Fine-tuning.
Again, using RAdam optimizer with a linearlearning rate warm-up for 5% of the training data and alearning rate of × − , we run 5 training epochs with abatch size of 48. Results.
The upper portion of Table III presents the zero-shot cross-lingual transfer results on sentiment analysis tasks.The numbers represent classification accuracy in percent-age. Zero-shot testing with NSMC (Korean) on the LMRDne-tuned XLM-R results in 12.05% accuracy degradation,whereas zero-shot testing with English shows 7.63% decreasein classification accuracy.The lower portion of Table III presents the cross-lingualsentiment analysis performance for mixed language fine-tuningcases. Here, the performance of the last language fine-tunedis improved while the first language fine-tuned degrades alittle. When fine-tuned on the train set mixed with bothlanguages, the sentiment analysis performance improves forboth languages.
TABLE IIIE
VALUATION ON SENTIMENT CLASSIFICATION TASKS . T
HE NUMBERSREPRESENT CLASSIFICATION ACCURACY IN PERCENTAGE . Evaluation LanguageFine-tuning Task(s)
English Korean
Zero-shot
LMRD (English) 93.52 79.24NSMC (Korean) 86.38 90.10
MixedLanguageFine-tuning
LMRD → NSMC 90.65 90.12NSMC → LMRD 93.69 89.47LMRD + NSMC 93.80 90.24
D. Cross-lingual Mapping for Fine-grained Alignment of Sen-tence Embeddings
Using the analytical findings of Section II.C, we havedetermined the cross-lingual mappings Φ ∗ and Ψ ∗ linearalgebraically. We have applied the mappings to align thetranslated sentence pairs of STSb and KorSTS. Precisely, weset the source S A English sentences from STSb, and the target S B Korean sentences from KorSTS. The quality of alignmentvia linear projections Φ ∗ and Ψ ∗ is very similar. Based onthe average cosine similarity of the translated sentence pairs,we find Φ ∗ slightly better than Ψ ∗ .We determine W by stochastic gradient descent on thesingle-layer neural net of Fig. 3. Using the translated sen-tence pairs, we set the input S A to the neural net Englishsentences from STSb, and the output S B Korean sentencesfrom KorSTS. The average cosine similarity of the translatedsentence pairs after alignment via the Φ ∗ projection is 0.7131whereas the average cosine similarity for the neural net is0.7265. Without alignment by the projection matrix or theneural net, the average cosine similarity would have been0.4636. Fig. 4 illustrates the t-SNE plots that visualize theeffect of the sentence alignment. The top plots are unalignedEnglish, aligned English and Korean sentences by the Φ ∗ projection, whereas the bottom plots represent unaligned andaligned English and Korean sentences via the neural net. TABLE IVSTS
EVALUATION WITH CROSS - LINGUAL SENTENCE PAIRS . Method
Zero-shot Transfer Cross-lingual Mapping
Fine-tuningTask
STSb 49.03 59.16KorSTS 43.23 47.24
In Table IV, we compare the cosine similarity of alignedEnglish and Korean translated sentence pairs of STSb and KorSTS through the fine-grained cross-lingual mapping tozero-shot transfer. Cross-lingual mapping that we computelinear algebraically or by the use of a neural net outperformszero-shot cross-lingual transfer by 9.3–20% in cosine similar-ity matching of the translated sentences pairs of STSb andKorSTS.
E. Discussion
Generally, we find that cross-lingual transfer is present inimportant supervised NLP tasks that require high-level naturallanguage understanding, namely STS, MRC, and sentimentclassification. Our empirical evaluation suggests the presenceof cross-lingual transfer be most pronounced in STS. Thenext is sentiment analysis, and MRC comes the last. Itseems that more complex a task is, and the quality of cross-lingual transfer becomes less effective. For STS, we haveobserved the transfer quality in two different measures, theSpearman’s rank and Pearson correlation coefficients, andfound them concordant. For MRC, while zero-shot transferperformance measured by F1 score is reasonable, it sufferssignificantly more for the case of the exact match (EM) metric.Interestingly, if we fine-tune XLM-R with both source andtarget languages, the last language fine-tuned has the strongestimpact on the performance.IV. C
ONCLUSION
This paper focuses on the empirical validation of the cross-lingual transfer properties induced by XLM pretraining. Wehave experimented with XLM-RoBERTa (XLM-R), a largecross-lingual language model, and extended semantic textualsimilarity (STS), SQuAD and KorQuAD for machine readingcomprehension (MRC), and sentiment analysis to cross-lingualsettings. Our results suggest the presence of cross-lingualtransfer be most pronounced in STS, the sentiment analysisthe next, and MRC the last. We compute matrix projectionslinear algebraically that directly map sentence embeddings ofone language to another for analyzing the effect of fine-grainedalignment of sentences in zero-shot cross-lingual transfer.We have shown that such mapping can also be determinediteratively using a simple neural net. Our future work includesmore systematic evaluations on broader range of low- andhigh-resource languages to generalize the quality of cross-lingual transfer manifested through important NLP tasks.R
EFERENCES[1] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsu-pervised cross-lingual representation learning at scale,” in
Proceedingsof the 58th Annual Meeting of the Association for ComputationalLinguistics , 2020.[2] A. Conneau and G. Lample, “Cross-lingual Language Model Pretrain-ing,” in
Advances in Neural Information Processing Systems 32 , 2019.[3] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+questions for machine comprehension of text,” in
Proceedings of the2016 Conference on Empirical Methods in Natural Language Process-ing , 2016.[4] S. Lim, M. Kim, and J. Lee, “Korquad1.0: Korean QA Dataset forMachine Reading Comprehension,” arXiv preprint arXiv:1909.07005 ,2019.ig. 4. t-SNE plots of English and Korean translated pairs from STSb and KorSTS. The leftmost plot on the top row is unaligned English sentences (source),and the middle represents aligned English via linear projection Φ ∗ , the rightmost Korean (target). The middle and the rightmost plots are aligned, showingsimilar patterns in t-SNE. The bottom plots are unaligned English, aligned English, and Korean sentences via the fully-connected single layer neural net whoseweight parameters W are learned by stochastic gradient descent.[5] G. Glavaˇs, R. Litschko, S. Ruder, and I. Vuli´c, “How to (properly) eval-uate cross-lingual word embeddings: On strong baselines, comparativeanalyses, and some misconceptions,” in Proceedings of the 57th AnnualMeeting of the Association for Computational Linguistics , 2019.[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. u. Kaiser, and I. Polosukhin, “Attention is All you Need,” in
Advancesin Neural Information Processing Systems 30 , 2017.[7] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddingsusing Siamese BERT-networks,” in
Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 2019.[8] T. Mikolov, Q. V. Le, and I. Sutskever, “Exploiting Similarities amongLanguages for Machine Translation,” arXiv preprint arXiv:1309.4168 ,2013.[9] P. Sch¨onemann, “A Generalized Solution of the Orthogonal ProcrustesProblem,”
Psychometrika , vol. 31, no. 1, pp. 1–10, 1966.[10] Hugging Face, “Open Source NLP,” https://huggingface.co, 2020.[11] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingualfocused evaluation,” in
Proceedings of the 11th International Workshopon Semantic Evaluation (SemEval-2017) , 2017.[12] J. Ham, Y. J. Choe, K. Park, I. Choi, and H. Soh, “KorNLI and KorSTS:New Benchmark Datasets for Korean Natural Language Understanding,” arXiv preprint arXiv:2004.03289 , 2020.[13] E. Agirre, D. Cer, M. Diab, and A. Gonzalez-Agirre, “SemEval-2012task 6: A pilot on semantic textual similarity,” in *SEM 2012: The FirstJoint Conference on Lexical and Computational Semantics – Volume 1:Proceedings of the main conference and the shared task, and Volume 2:Proceedings of the Sixth International Workshop on Semantic Evaluation(SemEval 2012) , 2012.[14] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, and W. Guo, “*SEM2013 shared task: Semantic textual similarity,” in
Second Joint Con-ference on Lexical and Computational Semantics (*SEM), Volume 1:Proceedings of the Main Conference and the Shared Task: SemanticTextual Similarity , 2013.[15] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre,W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe, “SemEval-2014 task 10: Multilingual semantic textual similarity,” in
Proceedings of the 8thInternational Workshop on Semantic Evaluation (SemEval 2014) , 2014.[16] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre,W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria,and J. Wiebe, “SemEval-2015 task 2: Semantic textual similarity,English, Spanish and pilot on interpretability,” in
Proceedings of the9th International Workshop on Semantic Evaluation (SemEval 2015) ,2015.[17] E. Agirre, C. Banea, D. Cer, M. Diab, A. Gonzalez-Agirre, R. Mihalcea,G. Rigau, and J. Wiebe, “SemEval-2016 task 1: Semantic textualsimilarity, monolingual and cross-lingual evaluation,” in
Proceedingsof the 10th International Workshop on Semantic Evaluation (SemEval-2016) , 2016.[18] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman,“GLUE: A multi-task benchmark and analysis platform for naturallanguage understanding,” in
Proceedings of the 2018 EMNLP WorkshopBlackboxNLP: Analyzing and Interpreting Neural Networks for NLP ,2018.[19] D. M. Cer, M. T. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia,“SemEval-2017 task 1: Semantic textual similarity, multilingual andcross-lingual focused evaluation,” 2017.[20] C. P. Carrino, M. R. Costa-juss`a, and J. A. Fonollosa, “AutomaticSpanish Translation of the SQuAD Dataset for Multilingual QuestionAnswering,” arXiv preprint arXiv:1912.05200 , 2019.[21] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,“Learning Word Vectors for Sentiment Analysis,” in