[PDF] Meemi: A Simple Method for Post-processing and Integrating Cross-lingual Word Embeddings

Abstract

Word embeddings have become a standard resource in the toolset of any Natural Language Processing practitioner. While monolingual word embeddings encode information about words in the context of a particular language, cross-lingual embeddings define a multilingual space where word embeddings from two or more languages are integrated together. Current state-of-the-art approaches learn these embeddings by aligning two disjoint monolingual vector spaces through an orthogonal transformation which preserves the structure of the monolingual counterparts. In this work, we propose to apply an additional transformation after this initial alignment step, which aims to bring the vector representations of a given word and its translations closer to their average. Since this additional transformation is non-orthogonal, it also affects the structure of the monolingual spaces. We show that our approach both improves the integration of the monolingual spaces as well as the quality of the monolingual spaces themselves. Furthermore, because our transformation can be applied to an arbitrary number of languages, we are able to effectively obtain a truly multilingual space. The resulting (monolingual and multilingual) spaces show consistent gains over the current state-of-the-art in standard intrinsic tasks, namely dictionary induction and word similarity, as well as in extrinsic tasks such as cross-lingual hypernym discovery and cross-lingual natural language inference.

Full PDF

JJournal of Artiﬁcial Intelligence Research x (xxxx) x-xx Submitted x/xx; published x/xx

Meemi: A Simple Method for Post-processingCross-lingual Word Embeddings

Yerai Doval [email protected]

Grupo COLE, Escola Superior de Enxe˜nar´ıa Inform´atica,Universidade de Vigo, Spain,

Jose Camacho-Collados [email protected]

Luis Espinosa-Anke [email protected]

Steven Schockaert [email protected]

School of Computer Science and Informatics,Cardiﬀ University, UK

Abstract

Word embeddings have become a standard resource in the toolset of any Natural Lan-guage Processing practitioner. While monolingual word embeddings encode informationabout words in the context of a particular language, cross-lingual embeddings deﬁne amultilingual space where word embeddings from two or more languages are integratedtogether. Current state-of-the-art approaches learn these embeddings by aligning two dis-joint monolingual vector spaces through an orthogonal transformation which preserves thestructure of the monolingual counterparts. In this work, we propose to apply an additionaltransformation after this initial alignment step, which aims to bring the vector represen-tations of a given word and its translations closer to their average. Since this additionaltransformation is non-orthogonal, it also aﬀects the structure of the monolingual spaces.We show that our approach both improves the integration of the monolingual spaces aswell as the quality of the monolingual spaces themselves. Furthermore, because our trans-formation can be applied to an arbitrary number of languages, we are able to eﬀectivelyobtain a truly multilingual space. The resulting (monolingual and multilingual) spacesshow consistent gains over the current state-of-the-art in standard intrinsic tasks, namelydictionary induction and word similarity, as well as in extrinsic tasks such as cross-lingualhypernym discovery and cross-lingual natural language inference.

1. Introduction

An increasingly popular research direction in multilingual Natural Language Processing(NLP) consists in learning mappings between two or more monolingual word embeddingspaces. These mappings, together with the initial monolingual spaces, deﬁne a multilingualword embedding space in which words from diﬀerent languages with a similar meaning arerepresented as similar vectors. Such multilingual embeddings do not only play a centralrole in multilingual NLP tasks, but they also provide a natural tool for transferring modelsthat were trained on resource-rich languages (typically English) to other languages, wherethe availability of annotated data may be more limited.State-of-the-art models for aligning monolingual word embeddings currently rely onlearning an orthogonal mapping from the monolingual embedding of a source language intothe embedding of a target language. Somewhat surprisingly, perhaps, this restriction to or- c (cid:13) xxxx AI Access Foundation. All rights reserved. a r X i v : . [ c s . C L ] O c t oval, Camacho-Collados, Espinosa-Anke, & Schockaert thogonal mappings, as opposed to arbitrary linear or even non-linear mappings, has provencrucial to obtain optimal results. The advantages of using orthogonal transformations aretwo-fold. First, because they are more constrained than arbitrary linear transformations,they can be learned from noisy data in a more robust way. This plays a particularly impor-tant role in settings where alignments between monolingual spaces have to be learned fromsmall and/or noisy dictionaries (Artetxe, Labaka, & Agirre, 2017), including dictionariesthat have been heuristically induced in a purely unsupervised way (Artetxe, Labaka, &Agirre, 2018b; Conneau, Lample, Ranzato, Denoyer, & J´egou, 2018a). Second, orthogonaltransformations preserve the distances between the word vectors, which means that theinternal structure of the monolingual spaces is not aﬀected by the alignment. Approachesthat rely on orthogonal transformations thus have to assume that the word embeddingspaces for diﬀerent languages are approximately isometric (Barone, 2016). However, it hasbeen argued that this assumption is not always satisﬁed (Søgaard, Ruder, & Vuli´c, 2018;Kementchedjhieva, Ruder, Cotterell, & Søgaard, 2018). Moreover, rather than treating themonolingual embeddings as ﬁxed elements, we may intuitively expect that embeddings fromdiﬀerent languages may actually be used to improve each other. This idea was exploitedby Faruqui and Dyer (2014), who learn linear mappings from two monolingual spaces ontoa new, shared, multilingual space. They found that the resulting changes to the internalstructure of the monolingual spaces can indeed bring beneﬁts. In multilingual evaluationtasks, however, their method is outperformed by approaches that rely on orthogonal trans-formations (Artetxe, Labaka, & Agirre, 2016).In this article, we propose a simple method which combines the advantages of orthogo-nal transformations with the potential beneﬁt of allowing monolingual spaces to aﬀect eachother’s internal structure. Speciﬁcally, we ﬁrst align the given monolingual spaces by learn-ing an orthogonal transformation using an existing state-of-the-art method. Subsequently,we aim to reduce any remaining discrepancies by trying to ﬁnd the middle ground betweenthe aligned monolingual spaces. Speciﬁcally, let ( w, v ) be an entry from a bilingual dictio-nary (i.e., v is the translation of w ), and let w and v be the vector representations of w and v in the aligned monolingual spaces. Our aim is to learn linear mappings M s and M t suchthat wM s ≈ vM t ≈ v + w , for each entry ( w, v ) from a given dictionary. Crucially, becausewe start from monolingual spaces which are already aligned, applying the mappings M s and M t can be thought of as a ﬁne-tuning step. We will refer to this proposed ﬁne-tuning stepas Meemi ( Meeting in the middle ) . Our experimental analysis reveals that this combina-tion of an orthogonal transformation followed by a simple non-orthogonal ﬁne-tuning stepconsistently, and often substantially outperforms existing approaches in cross-lingual eval-uation tasks. We also ﬁnd that the proposed transformation leads to improvements in themonolingual spaces, which, as already mentioned, is not possible with orthogonal transfor-mations. This article extends our earlier work in (Doval, Camacho-Collados, Espinosa-Anke,& Schockaert, 2018) in the following ways:1. We introduce a new variant of Meemi, in which the averages that are used to computethe linear transformations are weighted by word frequencies.

1. Code is available at https://github.com/yeraidm/meemi . This page will be updated with pre-trainedmodels for new languages. eemi: post-processing Cross-lingual Word Embeddings

2. We generalize the approach to an arbitrary number of languages, thus allowing us tolearn truly multilingual vector spaces.3. We more thoroughly compare the obtained multilingual models, extending the numberof baselines and evaluation tasks. We now also include a more extensive analysis ofthe results, e.g. studying the impact of the size of the bilingual dictionaries in moredetail.4. In the evaluation, we now include two distant languages which do not use the Latinalphabet: Farsi and Russian.

2. Background: Cross-lingual Alignment Methods

In this article we analyze cross-lingual word embedding models that are based on aligningmonolingual vector spaces. The overall process underpinning these methods is as follows.Given two monolingual corpora, a word vector space is ﬁrst learned independently for eachlanguage. This can be achieved with standard word embedding models such as Word2vec(Mikolov, Chen, Corrado, & Dean, 2013a), GloVe (Pennington, Socher, & Manning, 2014)or FastText (Bojanowski, Grave, Joulin, & Mikolov, 2017). Second, a linear alignmentstrategy is used to map the monolingual embeddings to a common bilingual vector space.These linear transformations are learned from a supervision signal in the form of abilingual dictionary (although some methods can also deal with dictionaries that are au-tomatically generated as part of the alignment process; see below). This approach waspopularized by Mikolov, Le, and Sutskever (2013b). Speciﬁcally, they proposed to learn amatrix W which minimizes the following objective: n (cid:88) i =1 (cid:107) x i W − z i (cid:107) (1)where we write x i for the vector representation of some word x i in the source languageand z i is the vector representation of the translation z i of w i in the target language. Thisoptimization problem corresponds to a standard least-squares regression problem, whoseexact solution can be eﬃciently computed. Note that this approach relies on a bilingualdictionary containing the training pairs ( x , z ) , ..., ( x n , z n ). However, once the matrix W has been learned, for any word w in the source language, we can use xW as a predictionof the vector representation of the translation of w . In particular, to predict which word inthe target language is the most likely translation of the word w from the source language,we can then simply take the word z whose vector z is closest to the prediction xW .The restriction to linear mappings might intuitively seem overly strict. However, itwas found that higher-quality alignments can be found by being even more restrictive. Inparticular, Xing, Wang, Liu, and Lin (2015) suggested to normalize the word vectors inthe monolingual spaces, and restrict the matrix W to an orthogonal matrix (i.e., imposingthe constraint that WW T = ). Under this restriction, the optimization problem (1) isknown as the orthogonal Procrustes problem, whose exact solution can still be computedeﬃciently. Another approach was taken by Faruqui and Dyer (2014), who proposed to learnlinear transformations W s and W t , which respectively map vectors from the source and oval, Camacho-Collados, Espinosa-Anke, & Schockaert target language word embeddings onto a shared vector space. They used Canonical Corre-lation Analysis to ﬁnd the transformations W s and W t which minimize the dimension-wisecovariance between XW s and ZW t , where X is a matrix whose rows are x , ..., x n and sim-ilarly Z is a matrix whose rows are z , ..., z n . Note that while the aim of Xing et al. (2015) isto avoid making changes to the cosine similarities between word vectors from the same lan-guage, Faruqui and Dyer (2014) speciﬁcally want to take into account information from theother language with the aim of improving the monolingual embeddings themselves. Artetxeet al. (2016) propose a model which combines ideas from Xing et al. (2015) and Faruquiand Dyer (2014). Speciﬁcally, they use the formulation in (1) with the constraint that W be orthogonal, as in Xing et al. (2015), but they also apply a preprocessing strategy calledmean centering which is closely related to the model from Faruqui and Dyer (2014). Ontop of this, in (Artetxe, Labaka, & Agirre, 2018a) they propose a multi-step framework inwhich they experiment with several pre-processing and post-processing strategies. Theseinclude whitening (which involves applying a linear transformation to the word vectors suchthat their covariance matrix is the identity matrix), re-weighting each coordinate accordingto its cross-correlation (which means that the relative importance of those coordinates withthe strongest agreement between both languages is increased), de-whitening (i.e., invertingthe whitening step to restore the original covariances), and a dimensionality reduction step,which is seen as an extreme form of re-weighting (i.e., those coordinates with the leastagreement across both languages are simply dropped). They also consider the possibility ofusing orthogonal mappings from both embedding spaces into a shared space, rather thanmapping one embedding space onto the other, where the objective is based on maximizingcross-covariance. Other approaches that have been proposed for aligning monolingual wordembedding spaces include models which replace (1) with a max-margin objective (Lazari-dou, Dinu, & Baroni, 2015) and models which rely on neural networks to learn non-lineartransformations (Lu, Wang, Bansal, Gimpel, & Livescu, 2015).A central requirement of the aforementioned methods is that they need a suﬃcientlylarge bilingual dictionary. Several approaches have been proposed to address this limita-tion, showing that high-quality results can be obtained in a purely unsupervised way. Forinstance, Artetxe et al. (2017) propose a method that can work with a small syntheticseed dictionary, e.g., only containing pairs of identical numerals (1,1), (2,2), (3,3), etc. Tothis end, they alternatingly use the current dictionary to learn a corresponding orthogonaltransformation and then use the learned cross-lingual embedding to improve the syntheticdictionary. This improved dictionary is constructed by assuming that the translation of agiven word w is the nearest neighbor of xW among all words from the target language.This approach was subsequently improved in (Artetxe et al., 2018b), where state-of-the-artresults were obtained without even assuming the availability of a synthetic seed dictionary.The key idea underlying their approach, called VecMap, is to initialize the seed dictionaryin a fully unsupervised way based on the idea that the histogram of similarity scores be-tween a given word w and the other words from the source language should be similar tothe histogram of similarity scores between its translation z and the other words from thetarget language. Another approach which aims to learn bilingual word embeddings in afully unsupervised way, called MUSE, is proposed in (Conneau et al., 2018a). The maindiﬀerence with VecMap lies in how the initial seed dictionary is learned. For this purpose,MUSE relies on adversarial training (Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, eemi: post-processing Cross-lingual Word Embeddings Ozair, Courville, & Bengio, 2014), similar as in earlier models (Barone, 2016; Zhang, Liu,Luan, & Sun, 2017a) but using a simpler formulation, based on the model in (1) with the or-thogonality constraint on W . The main intuition is to choose W such that it is diﬃcult fora classiﬁer to distinguish between word vectors z sampled from the target word embeddingand vectors xW , with x sampled from the source word embedding. There have been otherapproaches to create this initial bilingual dictionary without supervision via adversarialtraining (Zhang, Liu, Luan, & Sun, 2017b; Hoshen & Wolf, 2018; Xu, Yang, Otani, & Wu,2018) or stochastic processes (Alvarez-Melis & Jaakkola, 2018), but their performance hasnot generally surpassed existing methods (Artetxe et al., 2018b; Glavaˇs, Litschko, Ruder,& Vuli´c, 2019).In this work, we make use of the three mentioned variants of VecMap, namely thesupervised implementation based on the multi-step framework from Artetxe et al. (2018a),which will be referred to as VecMap multistep , the orthogonal method (VecMap ortho ) (Artetxeet al., 2016) and its unsupervised version (VecMap uns ) (Artetxe et al., 2018b). Similarly,we will consider the supervised and unsupervised variants of MUSE (MUSE and MUSE uns ,respectively) (Conneau et al., 2018a). In the next section we present our proposed post-processing method based on an unconstrained linear transformation to improve the resultsof the previous methods.

3. Fine-tuning Cross-lingual Embeddings by Meeting in the Middle

After the initial alignment of the monolingual spaces, we propose to apply a post-processingstep which aims to bring the two monolingual spaces closer together. To this end, we learnan unconstrained linear transformation that maps word vectors from one space onto theaverage of that word vector and the vector representation of its translation (according to agiven bilingual dictionary). This approach, which we call Meemi (Meeting in the middle)is illustrated in Figure 1. In particular, the ﬁgure illustrates the two-step nature, where weﬁrst learn an orthogonal transformation (using VecMap or MUSE), which aligns the twomonolingual spaces as much as possible without changing their internal structure. Then, ourapproach aims to ﬁnd a middle ground between the two resulting monolingual spaces. Thisinvolves applying a non-orthogonal transformation to both monolingual spaces. However,because we start from aligned spaces, the changes which are made by this transformationare relatively small. Our transformation is thus intuitively ﬁne-tuning the usual orthogonaltransformation, rather than replacing it. Note that this approach can naturally be appliedto more than two monolingual spaces (Section 3.2). First, however, we will consider thestandard bilingual case.

Let D be the given bilingual dictionary, encoded as a set of word pairs ( w, w (cid:48) ). Using thepairs in D as training data, we learn a linear mapping X such that wX ≈ w + w (cid:48) for all( w, w (cid:48) ) ∈ D , where we write w for the vector representation of word w in the given (aligned)monolingual space. This mapping X can then be used to predict the averages for wordsoutside the given dictionary. To ﬁnd the mapping X , we solve the following least squares oval, Camacho-Collados, Espinosa-Anke, & Schockaert Figure 1: Step by step integration of two monolingual embedding spaces: (1) obtainingisolated monolingual spaces, (2) aligning these spaces through an orthogonal linear trans-formation, and (3) map both spaces using an unconstrained linear transformation learnedon the averages of translation pairs.linear regression problem: E = (cid:88) ( w,w (cid:48) ) ∈ D (cid:13)(cid:13)(cid:13)(cid:13) wX − w + w (cid:48) (cid:13)(cid:13)(cid:13)(cid:13) (2)Similarly, we separately learn a mapping X (cid:48) such that w (cid:48) X (cid:48) ≈ w + w (cid:48) .We also consider a weighted variant of Meemi where the linear model is trained onweighted averages based on word frequency. Speciﬁcally, let f w be the occurrence count ofword w in the corresponding monolingual corpus, then w + w (cid:48) is replaced by: f w w + f w (cid:48) w (cid:48) f w + f w (cid:48) (3)The intuition behind this weighted model is that the word w might be much more prevalentin the ﬁrst language than the word w (cid:48) is in the second language. A clear example is when w = w (cid:48) , which may be the case, among others, if w is a named entity. For instance, supposethat w is the name of a Spanish city. Then, we may expect to see more occurrences of w ina Spanish corpus than in an English corpus. In such cases, it may be beneﬁcial to considerthe word vector obtained from the Spanish corpus to be of higher quality, and thus givemore weight to it in the average.We will write Meemi ( M ) to refer to the model obtained by applying Meemi after thebase method M , where M may be any variant of VecMap or MUSE. Similarly, we will writeMeemi w ( M ) in those cases where the weighted version of Meemi was used. To apply Meemi in a multilingual setting, we exploit the fact that bilingual orthogonalmethods such as VecMap (without re-weighting) and MUSE do not modify the targetmonolingual space but only apply an orthogonal transformation to the source. Hence, eemi: post-processing Cross-lingual Word Embeddings by simply applying this method to multiple language pairs while ﬁxing the target language(i.e., for languages l , l , ..., l n , we construct pairs of the form ( l i , l n ) with i ∈ { , ..., n − } ),we can obtain a multilingual space in which all of the corresponding monolingual modelsare aligned with, or mapped onto, the same target embedding space. Note, however, thatif we applied a re-weighting strategy, as suggested in Artetxe et al. (2018a) for VecMap,the target space would no longer remain ﬁxed for all source languages and would insteadchange depending on the source in each case. While most previous work has been limitedto bilingual settings, multilingual models involving more than two languages have alreadybeen studied by Ammar, Mulcaire, Tsvetkov, Lample, Dyer, and Smith (2016), who usedan approach based on Canonical Correlation Analysis. As in our approach, they also ﬁxone speciﬁc language as the reference language.Formally, let D be the given multilingual dictionary, encoded as a set of tuples ( w , w , ..., w n ),where n is the number of languages. Using the tuples in D as training data, we learn alinear mapping X i for each language, such that w i X i ≈ w + ... + w n n for all ( w , ..., w n ) ∈ D .This mapping X i can then be used to predict the averages for words in the i th languageoutside the given dictionary. To ﬁnd the mappings X i , we solve the following least squareslinear regression problem for each language: E multi = (cid:88) ( w ,...,w n ) ∈ D (cid:13)(cid:13)(cid:13)(cid:13) w i X i − w + ... + w n n (cid:13)(cid:13)(cid:13)(cid:13) (4)Note that while a weighted variant of this model can straightforwardly be formulated, wewill not consider this in the experiments.

4. Experimental Setting

In this section we explain the common training settings for all experiments. First, themonolingual corpora that were used, as well as other training details that pertain to theinitial monolingual embeddings, are discussed in Section 4.1. Then, in Section 4.2 we explainwhich bilingual and multilingual dictionaries were used as supervision signals. Finally, allcompared systems are listed in Section 4.3.

Instead of using comparable corpora such as Wikipedia, as in much of the previous work (Artetxeet al., 2017; Conneau et al., 2018a), we make use of independent corpora extracted fromthe web. This represents a more realistic setting where alignments are harder to obtain, asalready noted by Artetxe et al. (2018b). For English we use the 3B-word UMBC WebBaseCorpus (Han, Kashyap, Finin, Mayﬁeld, & Weese, 2013), containing over 3 billion words.For Spanish we used the Spanish Billion Words Corpus (Cardellino, 2016), consisting ofover a billion words. For Italian and German, we use the itWaC and sdeWaC corporafrom the WaCky project (Baroni, Bernardini, Ferraresi, & Zanchetta, 2009), containing 2and 0.8 billion words, respectively. For Finnish and Russian, we use their correspondingCommon Crawl monolingual corpora from the Machine Translation of News Shared Task

2. The same English, Spanish, and Italian corpora are used as input corpora for the hypernym discoverySemEval task (Section 6.1). oval, Camacho-Collados, Espinosa-Anke, & Schockaert , composed of 2.8B and 1.1B words, respectively. Finally, for Farsi we leverage thenewswire Hamshahri corpus (AleAhmad, Amiri, Darrudi, Rahgozar, & Oroumchian, 2009),composed of almost 200M words.In a preprocessing step, all corpora were tokenized using the Stanford tokenizer (Man-ning, Surdeanu, Bauer, Finkel, Bethard, & McClosky, 2014) and lowercased. Then wetrained FastText word embeddings (Bojanowski et al., 2017) on the preprocessed corporafor each language. The dimensionality of the vectors was set to 300, using the default valuesfor the remaining hyperparameters. We use the training dictionaries provided by Conneau et al. (2018a) as supervision. Thesebilingual dictionaries were compiled using the internal translation tools from Facebook. Tomake the experiments comparable across languages, we randomly extracted 8,000 trainingpairs from these splits for all language pairs considered, as this is the size of the smallestavailable dictionary. For completeness we also present results for fully unsupervised systems(see the following section), which do not take advantage of any dictionaries.

We have trained both bilingual and multilingual models involving up to seven languages.In the bilingual case, we consider the supervised and unsupervised variants of VecMap andMUSE to obtain the base alignments and then apply plain

Meemi and weighted Meemi onthe results. For supervised VecMap we compare with its orthogonal version VecMap ortho andthe multi-step procedure VecMap multistep . For the multilingual case we follow the proceduredescribed in Section 3.2 making use of all seven languages considered in the evaluation, i.e.,English, Spanish, Italian, German, Finnish, Farsi, and Russian. Note that in the bilingualcase all three variants of VecMap can be used, whereas in the multilingual setting we canonly use VecMap ortho .

5. Intrinsic evaluation

In this section we assess the intrinsic performance of our post-processing techniques incross-lingual (Section 5.1) and monolingual (Section 5.2) settings.

We evaluate the performance of all compared cross-lingual embedding models on standardpurely cross-lingual tasks, namely dictionary induction (Section 5.1.1) and cross-lingualword similarity (Section 5.1.2).

Also referred to as word translation, this task consists in automatically retrieving the wordtranslations in a target language for words in a source language. Acting on the correspondingcross-lingual embedding space which integrates the two (or more) languages of a particular eemi: post-processing Cross-lingual Word Embeddings Model English-Spanish English-Italian English-German P @1 P @5 P @10 P @1 P @5 P @10 P @1 P @5 P @10VecMap uns uns ortho ortho ) 33.9 60.7 67.4 w (VecMap ortho ) 33.4 60.9 67.4 33.1 58.5 66.3 22.9 44.3 52.5Meemi-multi (VecMap ortho ) 33.4 60.9 67.1 33.7 58.1 65.5 23.0 44.5 52.8VecMap multistep Meemi (VecMap multistep ) 33.8

Meemi w (VecMap multistep ) 33.2 60.9 68.1 32.5 58.2 66.2 22.8 44.8 53.1MUSE 32.5 58.2 65.9 32.5 56.0 63.2 22.4 40.9 48.9Meemi (MUSE) 33.9 60.7 w (MUSE) 33.3 61.2 68.2 33.0 58.8 65.3 22.8 44.4 52.3 Model English-Finnish English-Farsi English-Russian P @1 P @5 P @10 P @1 P @5 P @10 P @1 P @5 P @10VecMap uns uns ortho ortho ) w (VecMap ortho ) 22.6 48.3 56.5 19.8 35.2 41.6 17.4 39.9 49.4Meemi-multi (VecMap ortho ) 23.1 48.3 57.2 21.0 37.9 44.4 18.8 41.7 50.5VecMap multistep multistep ) 24.0 Meemi w (VecMap multistep ) 21.6 48.3 57.2 w (MUSE) 21.7 46.9 55.0 19.5 33.8 39.8 18.1 40.0 49.5 Table 1: P @ K performance of diﬀerent cross-lingual embedding models in the bilingualdictionary induction task.test case, we obtain the nearest neighbors to the source word in the target language as ourtranslation candidates. The performance is measured with precision at k ( P @ k ), deﬁnedas the proportion of test instances where the correct translation candidate for a givensource word was among the k highest ranked candidates. The nearest neighbors ranking isobtained by using cosine similarity as the scoring function. For this evaluation we use thecorresponding test dictionaries released by Conneau et al. (2018a).We show the results attained by a wide array of models in Table 1, where we can observethat the best ﬁgures are generally obtained by Meemi over the bilingual VecMap models.Not surprisingly, the impact of Meemi is more apparent when used in combination withthe orthogonal base models. On the other hand, using the weighted version of Meemi (i.e.,Meemi w ) does not seem to be particularly beneﬁcial on this task, with the only exceptionof English-Farsi. In general, the performance of unsupervised models (i.e., VecMap uns andMUSE uns ) is competitive in closely-related languages such as English-Spanish or English-German but they considerably under-perform for distant languages, especially English-Finnish and English-Russian. Finally, the results obtained by the multilingual model that oval, Camacho-Collados, Espinosa-Anke, & Schockaert includes all seven languages considered, i.e., Meemi-multi (VecMap ortho ), improve over thebase orthogonal model, but they do not improve over the results of our bilingual model. Wefurther discuss the impact of adding languages to the multilingual model in Section 7.3. Cross-lingual word similarity constitutes a straightforward benchmark to test the qualityof bilingual embeddings. In this case, and in contrast to monolingual similarity, wordsin a given pair (a,b) belong to diﬀerent languages, e.g., a belonging to English and b toFarsi. For this task we make use of the SemEval-17 multilingual similarity benchmark(Camacho-Collados, Pilehvar, Collier, & Navigli, 2017), considering the four cross-lingualdatasets that include English as target language in particular, but discarding multi-wordexpressions. Performance is computed in terms of Pearson and Spearman correlation withrespect to the gold standard.Table 2 shows the results of the diﬀerent embeddings models in the cross-lingual wordsimilarity task. Except in a few cases for the VecMap multistep model, our Meemi transfor-mation proves superior to the base models, and to all their unsupervised variants. In Farsi,which is the most distant language with a diﬀerent alphabet, the results are lower overall,but in this case our Meemi transformation proves essential, outperforming the best VecMapmodel by almost four percentage points (from 36.5% to 40.4%). Similarly as in the bilingualdictionary induction task, the weighted version of Meemi proves robust only on English-Farsi, which suggests that this weighting scheme is most useful for distant languages, as inthis case the Farsi monolingual space (which is learned from a smaller corpus and hence, aswe will see in the next section, has a lower quality) gets closer to the English monolingualspace. As far as the multilingual model is concerned, it proves beneﬁcial in all cases withrespect to the orthogonal version of VecMap, as well as compared to the bilingual variantof Meemi. One of the advantages of breaking the orthogonality of the transformation is the potentialto improve the monolingual quality of the embeddings. To test the diﬀerence between theoriginal word embeddings and the embeddings obtained after applying the Meemi transfor-mation, we take monolingual word similarity as a benchmark. Given a word pair, this taskconsists in assessing the semantic similarity between both words in the pair, in this casefrom the same language. The evaluation is then performed in terms of Spearman and Pear-son correlation with respect to human judgements. In particular, we use the monolingualdatasets (English, Spanish, German, and Farsi) from the SemEval-17 task on multilingualword similarity. The results provided by the original monolingual FastText embeddings arealso reported as baseline.Table 3 shows the results on the monolingual word similarity task. In this task ourmultilingual model representing seven languages in a single space clearly stands out, ob-taining the best overall results for English, Spanish and Italian, and improving over thebase VecMap ortho model on the rest. With the exception of German, where the multi-stepframework of Artetxe et al. (2018a) proves most eﬀective, the plain Meemi transformationimproves over the base models, for both VecMap and MUSE. eemi: post-processing Cross-lingual Word Embeddings Mode l EN-ES EN-IT EN-DE EN-FA r ρ r ρ r ρ r ρ

VecMap uns uns ortho ortho ) 72.3 72.0 71.2 70.7 72.5 72.1 35.3 31.6Meemi w (VecMap ortho ) 72.1 72.0 70.0 69.7 70.5 70.2 34.2 30.2Meemi-multi (VecMap ortho ) multistep multistep ) 72.1 71.5 71.1 70.9 72.6 w (VecMap multistep ) 71.5 71.2 69.7 69.8 70.3 70.3 39.6 MUSE 71.9 71.9 70.4 70.4 70.5 70.2 29.7 23.9Meemi (MUSE) 72.5 72.3 71.5 71.1 72.5 72.1 36.4 33.0Meemi w (MUSE) 72.3 72.2 70.4 70.0 70.5 70.4 33.6 28.9 Table 2: Cross-lingual word similarity results in terms of Pearson ( r ) and Spearman ( ρ )correlation. Languages codes: English-EN, Spanish-ES, Italian-IT, German-DE, and Farsi-FA. Model English Spanish Italian German Farsi r ρ r ρ r ρ r ρ r ρ

VecMap uns uns ortho ortho ) 74.4 73.9 71.6 72.1 69.0 69.4 71.1 70.7 24.3 22.5Meemi w (VecMap ortho ) 74.4 74.0 71.8 71.8 68.2 68.8 68.8 68.9 Meemi-multi (VecMap ortho ) multistep multistep ) 73.3 72.6 71.7 71.6 69.4 69.8 71.1 71.0 27.3 26.2Meemi w (VecMap multistep ) 73.5 72.9 70.9 70.6 67.2 68.4 67.0 67.8 27.3 25.6MUSE 74.2 74.2 70.5 71.9 67.4 69.2 69.8 69.8 21.1 17.3Meemi (MUSE) 74.6 74.1 71.9 72.4 69.5 69.9 71.0 70.6 24.6 22.5Meemi w (MUSE) 74.5 - - - - - Table 3: Monolingual word similarity results in terms of Pearson ( r ) and Spearman ( ρ )correlation.

6. Extrinsic evaluation

We complement the intrinsic evaluation experiments, which are typically a valuable sourcefor understanding the properties of the vector spaces, with downstream extrinsic cross-lingual tasks. This evaluation is especially necessary in the view that the intrinsic be-haviour does not always correlate well with downstream performance (Bakarov, Suvorov,& Sochenkov, 2018; Glavaˇs et al., 2019). In particular, for this extrinsic evaluation we willfocus on the following question: how does our post-processing method help alleviate limita- oval, Camacho-Collados, Espinosa-Anke, & Schockaert tions of cross-lingual models that are due to their use of orthogonality constraints? In par-ticular, we perform experiments with the orthogonal model of VecMap (i.e., VecMap ortho ),in combination with the proposed Meemi strategy, both in bilingual and multilingual set-tings. For the latter case, we considered all six languages, i.e., Spanish, Italian, German,Finnish, Farsi, and Russian, keeping English as the target language.The tasks considered are cross-lingual hypernym discovery (Section 6.1) and cross-lingual natural language inference (Section 6.2). Hypernymy is an important lexical relation, which, if properly modeled, directly impactsdownstream NLP tasks such as semantic search (Hoﬀart, Milchevski, & Weikum, 2014;Roller & Erk, 2016), question answering (Prager, Chu-Carroll, Brown, & Czuba, 2008;Yahya, Berberich, Elbassuoni, & Weikum, 2013) or textual entailment (Geﬀet & Dagan,2005). Hypernyms, in addition, are the backbone of taxonomies and lexical ontologies(Yu, Wang, Lin, & Wang, 2015), which are in turn useful for organizing, navigating, andretrieving online content (Bordea, Lefever, & Buitelaar, 2016). We propose to evaluate thequality of a range of cross-lingual vector spaces in the extrinsic task of hypernym discovery,i.e., given an input word (e.g., “cat”), retrieve or discover its most likely (set of) validhypernyms (e.g., “animal”, “mammal”, “feline”, and so on). Intuitively, by leveraging abilingual vector space condensing the semantics of two languages, one of them being English,the need for large amounts of training data in the target language may be reduced.The base model is a (cross-lingual) linear transformation trained with hyponym-hypernympairs (Espinosa-Anke, Camacho-Collados, Delli Bovi, & Saggion, 2016), which is afterwardsused to predict the most likely (set of) hypernyms given a new term. Training and evalu-ation data come from the SemEval 2018 Shared Task on Hypernym Discovery (Camacho-Collados, Delli Bovi, Espinosa-Anke, Oramas, Pasini, Santus, Shwartz, Navigli, & Saggion,2018). Note that current state-of-the-art systems aimed at modeling hypernymy (Shwartz,Goldberg, & Dagan, 2016; Bernier-Colborne & Barriere, 2018) combine large amounts ofannotated data along with language-speciﬁc rules and cue phrases such as Hearst Pat-terns (Hearst, 1992), both of which are generally scarcely (if at all) available for languagesother than English. As a reference, we have included the best performing unsupervisedsystem for both Spanish and Italian (we will refer to this baseline as BestUns). This unsu-pervised baseline is based on the distributional models described in Shwartz, Santus, andSchlechtweg (2017).As such, we report experiments (Table 4) with training data only from English (11,779hyponym-hypernym pairs), and enriched models informed with relatively few training pairs(500, 1K, and 2K) from the target languages. Evaluation is conducted with the same metricsas in the original SemEval task, i.e., Mean Reciprocal Rank (MRR), Mean Average Precision(MAP), and precision at 5 ( P @5). These measures explain the behavior of a model fromcomplementary prisms, namely how often at least one valid hypernym was highly ranked(MRR), and in cases where there is more than one correct hypernym, to what extentthey were all correctly retrieved (MAP and P @5). We report comparative results betweenthe following systems: VecMap uns (the unsupervised variant), VecMap ortho (the orthogonal eemi: post-processing Cross-lingual Word Embeddings Train data Model Spanish Italian

MRR MAP P @5 MRR MAP P @5- BestUns 2.4 5.5 2.5 3.9 8.7 3.9EN VecMap uns multistep w (VecMap) Meemi-multi (VecMap) 14.39 5.50 5.22 11.46 4.58 4.44EN + 500 VecMap uns multistep w (VecMap) Meemi-multi (VecMap) 15.03 6.20 6.26 12.46 4.88 4.60EN + 1K VecMap uns multistep

Meemi w (VecMap) uns multistep Meemi w (VecMap) 17.52 6.96 6.76 14.4 5.86 5.60Meemi-multi (VecMap) 17.17 6.90 6.78 14.29 5.83 5.45 Table 4: Cross-lingual hypernym discovery results. In this case, VecMap = VecMap ortho .transformation variant), VecMap multi-step (the supervised multi-stage variant) and threeMeemi variants: Meemi (VecMap); Meemi w (VecMap) and Meemi-multi (VecMap).The ﬁrst noticeable trend is the better performance of the unsupervised VecMap ver-sion versus its supervised orthogonal and multi-step counterparts. Nevertheless, we ﬁndremarkably consistent gains over both VecMap variants when applying Meemi, across allconﬁgurations for the two language pairs considered. In fact, the weighted (Meemi w ) ver-sion brings an increase in performance between 1 and 2 MRR and MAP points across thewhole range of target language supervision (from zero to 2k pairs). This is in contrastto the instrinsic evaluation, where the weighted model did not seem to provide noticeableimprovements over the plain version of Meemi. Finally, concerning the fully multilingualmodel, the experimental results suggest that, while still better than the orthogonal base-lines, it falls short when compared to the weighted bilingual version of Meemi. This resultsuggests that exploring weighting schemes for the multilingual setting may bring furthergains, but we leave this extension for future work. oval, Camacho-Collados, Espinosa-Anke, & Schockaert The task of natural language inference (NLI) consists in detecting entailment, contradictionor neutral relations in pairs of sentences. In our case, we test a zero-shot cross-lingualtransfer setting where a system is trained with English corpora and is then evaluated ona diﬀerent language. We base our approach on the assumption that better aligned cross-lingual embeddings should lead to better NLI models, and that the impact of the inputembeddings may become more apparent in simple methods; as opposed to, for instance,complex neural network architectures. Hence, and also to account for the coarser linguisticgranularity of this task (being a sentence classiﬁcation problem rather than word-level), weemploy a simple bag-of-words approach where a sentence embedding is obtained throughword vector averaging. We then train a linear classiﬁer to predict one of the three possiblelabels in this task, namely entailment , contradiction or neutral . We use the full MultiNLIEnglish corpus (Williams, Nangia, & Bowman, 2018) for training and the Spanish andGerman test sets from XNLI (Conneau, Rinott, Lample, Williams, Bowman, Schwenk, &Stoyanov, 2018b) for testing. For comparison, we also include a lower bound obtained byconsidering English monolingual embeddings for input; in this case FastText trained on theUMBC corpus, which is the same model used to obtain multilingual embeddings.Accuracy results are shown in Table 5. The main conclusion in light of these results isthe remarkable performance of the unsupervised VecMap model and, most notably, multi-lingual Meemi for both Spanish and German, clearly outperforming the orthogonal bilingualmapping baseline. Our results are encouraging for two reasons. First, they suggest that, atleast for this task, collapsing several languages into a uniﬁed vector space is better than per-forming pairwise alignments. And second, the inherent beneﬁt of having one single modelaccounting for an arbitrary number of languages. Model EN-ES EN-DE

VecMap uns multistep ortho ortho ) 44.9 43.8Meemi w (VecMap ortho ) 40.4 43.5Meemi-multi (VecMap ortho ) Lower bound 38.0 33.4Table 5: Accuracy on the XNLI task using diﬀerent cross-lingual embeddings as features.

7. Analysis

We complement our quantitative (intrinsic and extrinsic) evaluations with a qualitativeanalysis which aims at discovering the most salient properties of the transformation per-formed by Meemi and their linguistic implications. We perform a qualitative analysis with

4. The codebase for these experiments is that of SentEval (Conneau & Kiela, 2018). eemi: post-processing Cross-lingual Word Embeddings examples in Section 7.1, as well as an analysis on the impact of the size of training dictio-naries in Section 7.2 and on the performance of the multilingual model in Section 7.3. Table 6 lists a number of examples where, for a source English word, we explore its high-est ranked cross-lingual synonyms (or word translations) in a target language. We selectSpanish as a use case. crazy telegraphVecMap Meemi Meemi-multi VecMap Meemi Meemi-multiloco loco chiﬂadas tel´egrafo telegr´aﬁco telegraph tonto loca locos tel´egrafos tel´egrafo telegraafenloquecere enloquec´ı loca telegr´aﬁco telegr´afono telegraphone locos enloquec´ıas est´upidas telegr´aﬁca telegraf telegr´afonoenloqueci locos alocadas telegrafo telegr´afo tel´egrafoconventions discoverVecMap Meemi Meemi-multi VecMap Meemi Meemi-multiconvenciones internaciones convenios descubrir´a descubre descubrinternacional7 1972naciones reglas descubr descubrir descubrir´anconvenci´on protocolos convenci´on descubrir´an descubriendo descubrirnos1961naciones convenios normas descubren descubra descubrirainternacionales3 1961naciones legislacionesnacionales descubriron descubrira descubrire remarks lyonVecMap Meemi Meemi-multi VecMap Meemi Meemi-multi astrom´etricos lobservaciones observaciones rocquigny beaubois marcignyobservacionales mediciones observacionales r´emilly bourgmont lyon astrom´etricas lasobservaciones observacional martignac marcigny pierrevilleastronom´etricas deobservaciones predicciones beaubois r´emilly jacquemontpredicciones susobservaciones mediciones chambourcy jacquemont beaubois

Table 6: Word translation examples from English and Spanish, comparing VecMap withthe bilingual and multilingual variants of Meemi. For each source word, we show its ﬁvenearest cross-lingual synonyms. Bold translations are correct, according to the source testdictionary (cf. Section 5.1.1).Let us study the examples listed in Table 6, as they constitute illustrative cases oflinguistic phenomena which go beyond correct or incorrect translations. First, the word’crazy’ is correctly translated by both VecMap and Meemi; loco (masculine singular), locos (masculine plural) or loca (feminine) being standard translations, with no further conno-tations, of the source word. However, the most interesting ﬁnding lies in the fact that forMeemi-multi, the preferred translation is a colloquial (or even vulgar) translation which wasnot considered as correct in the gold test dictionary. The Spanish word chiﬂadas translatesto English as ‘going mental’ or ‘losing it’. Similarly, we would like to highlight the case of‘telegraph’. This word is used in two major senses, namely to refer to a message transmitterand as a reference to media outlets (several newspapers have the word ‘telegraph’ in theirname). VecMap and Meemi (correctly) translate this word into the common translation tel´egrafo (the transmission device), whereas Meemi-multi prefers its named-entity sense.Other cases, such as ‘conventions’ and ‘discover’ are examples to illustrate the behaviourfor common ambiguous nouns. In both cases, candidate translations are either misspellings oval, Camacho-Collados, Espinosa-Anke, & Schockaert of the correct translation ( descubr for ‘discover’), or misspellings involving tokens conﬂat-ing two words whose compositional meaning is actually a correct candidate translation forthe source word; e.g., legislaciones nacionales (‘national rulings’) for ‘conventions’. Finally,‘remarks’ oﬀers an example of a case where ambiguity causes major disruptions. In par-ticular, ‘remark’ translates in Spanish to observaci´on , which in turn has an astronomicalsense; ‘astronomical observatory’ translates to observatorio astron´omico . Throughout this article we have discussed the role of diﬀerent supervision signals in bothintrinsic and extrinsic tasks. We provide the reader with a visual illustration of this phe-nomenon, with the task of cross-lingual word similarity as a use case. Figure 2 showsthe absolute improvement (in percentage points) over VecMap by applying Meemi, usingdiﬀerent training dictionary sizes for supervision.As can be observed, Meemi gets improvements in all cases with the dictionaries of8K, 5K and 3K word pairs, but its performance heavily drops with dictionaries of smallersizes (i.e. 1K and especially 100). In fact, having a larger dictionary helps avoid overﬁtting,which is a recurring problem in cross-lingual word embedding learning (Zhang et al., 2017a).The most remarkable case is that of Farsi, where Meemi improves the most, but where alarge dictionary becomes even more important. This behavior clearly shows under whichconditions our proposed ﬁnal transformation can be applied with higher success rates. Weleave exploring larger dictionaries and their impact in diﬀerent tasks and languages forfuture work.

In this section we assess the beneﬁts of our proposed multilingual integration (cf. Section3.2). To this end, we measure ﬂuctuations in performance as more languages were addedto the initially bilingual model. Thus, starting from a bilingual embedding space obtainedwith VecMap ortho , we apply Meemi over a number of aligned spaces, which ultimately leadsto a fully multilingual space containing the following languages: Spanish, Italian, German,Finnish, Farsi, Russian, and English. This latter language is used as the target embeddingspace for the orthogonal transformations due to it being the richest in terms of resourceavailability.To avoid a lengthy and overly exhaustive approach where all possible combinationsfrom two to seven languages are evaluated, we opted for conducting an experiment wherelanguages are added one by one in a ﬁxed order, starting from the languages which arecloser to English in terms of language family and alphabet (i.e., Spanish, Italian, Germanand then Finnish, Farsi and Russian). However, this approach does not allow us to use,for example, the English-Farsi test set until reaching the ﬁfth step. To solve this, if thelanguage that is needed for the test set has not yet been included, we replace the lastlanguage that was added by the one that is needed for the test set. For instance, while wenormally add Italian as the second source language (resulting in trilingual space en - es - it ),for the English-German test set, the results are instead based on a space where we addedGerman instead of English (i.e. the trilingual space en - es - de ). In Table 7 we show theresults obtained by the multilingual models in bilingual dictionary induction. eemi: post-processing Cross-lingual Word Embeddings The best results are achieved when more than two languages are involved in the training,which correlates with the results obtained in the rest of the tasks and highlights the abilityof Meemi to successfully exploit multilingual information to improve the quality of theembedding models involved. In general, the performance ﬂuctuates more signiﬁcantly whenadding the ﬁrst language to the bilingual models and then stabilizes at a similar level tothe bilingual case when adding more distant languages. (a) Meemi (VecMap)(b) Meemi (MUSE)

Figure 2: Absolute improvement (in terms of Pearson correlation percentage points) byapplying the Meemi over the two base orthogonal models VecMap and MUSE on the cross-lingual word similarity task, with diﬀerent training dictionary sizes. As data points in theX-axis we selected 100, 1000, 3000, 5000 and 8000 word pairs in the dictionary. oval, Camacho-Collados, Espinosa-Anke, & Schockaert Languages English-Spanish English-Italian English-German P @1 P @5 P @10 P @1 P @5 P @10 P @1 P @5 P @10 x -en (VecMap ortho ) 32.6 58.1 65.8 32.9 56.5 63.4 22.8 42.8 50.4 x -en 33.9 60.7 67.4 33.8 58.8 65.6 23.7 45.0 52.9es- x -en x -en 34.1 61.2 68.1 33.8 x -en es-it-de-ﬁ- x -en 33.6 60.9 67.5 33.8 58.0 65.8 23.1 44.7 52.7es-it-de-ﬁ-fa-ru-en 33.4 60.9 67.1 33.7 58.1 65.5 23.0 44.5 52.8 English-Finnish English-Farsi English-Russian P @1 P @5 P @10 P @1 P @5 P @10 P @1 P @5 P @10 x -en (VecMap ortho ) 22.1 44.5 52.9 18.5 33.6 40.5 15.6 35.5 44.2 x -en 24.2 48.8 57.7 20.0 37.1 43.8 19.0 40.5 49.9es- x -en x -en 24.1 x -en 23.9 50.2 58.5 21.0 37.7 x -en 23.5 48.6 57.5 es-it-de-ﬁ-fa-ru-en 23.1 48.3 57.2 21.0 Table 7: Dictionary induction results obtained with the multilingual extension of Meemiover (VecMap ortho ). The sequence in which source languages are added to the multilingualmodels is: Spanish, Italian, German, Finnish, Farsi, and Russian (English is the target).The x indicates the use of the test language in each case (if the test language is alreadyincluded, the following language in the sequence is added). We also include the scores ofthe original VecMap ortho as baseline.

8. Conclusion

In this article, we have presented an extended study of Meemi, a simple post-processingmethod for improving cross-lingual word embeddings which was ﬁrst presented in Dovalet al. (2018). Our initial goal was to learn improved bilingual alignments from those obtainedby state-of-the-art cross-lingual methods such as VecMap (Artetxe et al., 2018a) or MUSE(Conneau et al., 2018a). We do this by applying a ﬁnal unconstrained linear transformationto their initial mappings. In this work, we have also gone beyond the bilingual setting byexploring an extension of the original Meemi model to align embeddings from an arbitrarynumber of languages in a single shared vector space. In particular, we take advantage of thefact that, assuming the initial alignment was obtained with an orthogonal mapping, Meemican naturally be applied to any number of languages through a single linear transformationper language.Regarding the evaluation, we extended the language set to include, in addition to theusual Indo-European languages such as English, Spanish, Italian or German, other distantlanguages such as Finnish, Farsi, and Russian. The results we report in this article showthat Meemi is highly competitive, consistently yielding better results than competing base-lines, especially in the case of distant languages. We are particularly encouraged by the eemi: post-processing Cross-lingual Word Embeddings multilingual results, which prove that bringing together distant languages from diﬀerentfamilies in a shared vector space appears to be beneﬁcial in most cases.

9. Future Work

We will continue to explore the possibilities of post-processing multilingual models, investi-gating their impact in diﬀerent tasks. Given the fact that going from restrictive orthogonaltransformations to the more unconstrained Meemi transformation seems clearly beneﬁcialin the integration of monolingual models, it remains to be seen whether some form ofconstrained non-linear transformation can be successfully applied on the current modelsobtained with Meemi.Finally, the possibilities that are opened by multilingual models have not been fullyexplored in this article, nor in the recent literature more generally. A key (and perhapsobvious) advantage of learning a multilingual model is the fact that more than one lan-guage can be used for training. This has implications, for instance, in cases where thereis training data for more than one language, beyond the target language. For instance,we may consider a scenario where annotated data could be easily obtained for English andGerman, but not for Farsi, where we would want to combine the available training datafrom the two aforementioned languages to train a model for Farsi. This combination oflanguages can alleviate one of the main problems in cross-lingual transfer, which is the caseof distant languages and especially with diﬀerent orderings, as shown in a recent case studyon dependency parsing (Ahmad, Zhang, Ma, Hovy, Chang, & Peng, 2019).

Acknowledgments

Yerai Doval is supported by the Spanish Ministry of Economy, Industry, and Competi-tiveness (MINECO) through projects FFI2014-51978-C2-2-R, TIN2017-85160-C2-1-R, andTIN2017-85160-C2-2-R; the Spanish State Secretariat for Research, Development, and In-novation (which belongs to MINECO) and the European Social Fund (ESF) under an FPIfellowship (BES-2015-073768) associated to project FFI2014-51978-C2-1-R; and by the Gali-cian Regional Government under project ED431D 2017/12. This work was partly supportedby ERC Starting Grant 637277.

References

Ahmad, W. U., Zhang, Z., Ma, X., Hovy, E., Chang, K.-W., & Peng, N. (2019). Ondiﬃculties of cross-lingual transfer with order diﬀerences: A case study on dependencyparsing. In

Proceedings of NAACL .AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., & Oroumchian, F. (2009). Hamshahri:A standard persian text collection.

Knowledge-Based Systems , (5), 382–387.Alvarez-Melis, D., & Jaakkola, T. (2018). Gromov-Wasserstein alignment of word embed-ding spaces. In Proceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing , pp. 1881–1890, Brussels, Belgium. Association for Computa-tional Linguistics. oval, Camacho-Collados, Espinosa-Anke, & Schockaert Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., & Smith, N. A. (2016).Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925 .Artetxe, M., Labaka, G., & Agirre, E. (2016). Learning principled bilingual mappings ofword embeddings while preserving monolingual invariance. In

Proceedings of the 2016Conference on Empirical Methods in Natural Language Processing , pp. 2289–2294.Artetxe, M., Labaka, G., & Agirre, E. (2017). Learning bilingual word embeddings with(almost) no bilingual data. In

Proceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: Long Papers) , pp. 451–462, Vancouver,Canada. Association for Computational Linguistics.Artetxe, M., Labaka, G., & Agirre, E. (2018a). Generalizing and improving bilingual wordembedding mappings with a multi-step framework of linear transformations. In

Pro-ceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence (AAAI-18) ,pp. 5012–5019.Artetxe, M., Labaka, G., & Agirre, E. (2018b). A robust self-learning method for fullyunsupervised cross-lingual mappings of word embeddings. In

Proceedings of ACL , pp.789–798.Bakarov, A., Suvorov, R., & Sochenkov, I. (2018). The limitations of cross-language wordembeddings evaluation. In

Proceedings of the Seventh Joint Conference on Lexicaland Computational Semantics , pp. 94–100.Barone, A. V. M. (2016). Towards cross-lingual distributed representations without paralleltext trained with adversarial autoencoders. In

Proceedings of the 1st Workshop onRepresentation Learning for NLP , pp. 121–126.Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The wacky wide web: a col-lection of very large linguistically processed web-crawled corpora.

Language resourcesand evaluation , (3), 209–226.Bernier-Colborne, G., & Barriere, C. (2018). Crim at semeval-2018 task 9: A hybrid ap-proach to hypernym discovery. In Proceedings of The 12th International Workshopon Semantic Evaluation , pp. 722–728, New Orleans, Louisiana.Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors withsubword information.

Transactions of the Association of Computational Linguistics , (1), 135–146.Bordea, G., Lefever, E., & Buitelaar, P. (2016). Semeval-2016 task 13: Taxonomy extractionevaluation (texeval-2). In Proceedings of the 10th International Workshop on SemanticEvaluation (SemEval-2016) , pp. 1081–1091.Camacho-Collados, J., Delli Bovi, C., Espinosa-Anke, L., Oramas, S., Pasini, T., Santus,E., Shwartz, V., Navigli, R., & Saggion, H. (2018). SemEval-2018 Task 9: HypernymDiscovery. In

Proceedings of SemEval , pp. 712–724, New Orleans, LA, United States.Camacho-Collados, J., Pilehvar, M. T., Collier, N., & Navigli, R. (2017). Semeval-2017 task2: Multilingual and cross-lingual semantic word similarity. In

Proceedings of the 11thInternational Workshop on Semantic Evaluation (SemEval-2017) , pp. 15–26. eemi: post-processing Cross-lingual Word Embeddings Cardellino, C. (2016). Spanish Billion Words Corpus and Embeddings. http://crscardellino.me/SBWCE/ .Conneau, A., & Kiela, D. (2018). SentEval: An Evaluation Toolkit for Universal SentenceRepresentations. In

Proceedings of the Eleventh International Conference on Lan-guage Resources and Evaluation (LREC 2018) , Miyazaki, Japan. European LanguageResources Association (ELRA).Conneau, A., Lample, G., Ranzato, M., Denoyer, L., & J´egou, H. (2018a). Word translationwithout parallel data. In

Proceedings of ICLR .Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S. R., Schwenk, H., & Stoy-anov, V. (2018b). Xnli: Evaluating cross-lingual sentence representations. In

Proceed-ings of the 2018 Conference on Empirical Methods in Natural Language Processing .Association for Computational Linguistics.Doval, Y., Camacho-Collados, J., Espinosa-Anke, L., & Schockaert, S. (2018). Improvingcross-lingual word embeddings by meeting in the middle. In

Proceedings of EMNLP ,pp. 294–304.Espinosa-Anke, L., Camacho-Collados, J., Delli Bovi, C., & Saggion, H. (2016). Superviseddistributional hypernym discovery via domain adaptation. In

Proceedings of EMNLP ,pp. 424–435.Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multi-lingual correlation. In

Proceedings of the 14th Conference of the European Chapter ofthe Association for Computational Linguistics , pp. 462–471.Geﬀet, M., & Dagan, I. (2005). The distributional inclusion hypotheses and lexical entail-ment. In

Proceedings of the 43rd Annual Meeting on Association for ComputationalLinguistics , pp. 107–114. Association for Computational Linguistics.Glavaˇs, G., Litschko, R., Ruder, S., & Vuli´c, I. (2019). How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some mis-conceptions. In

Proceedings of the 57th Annual Meeting of the Association for Com-putational Linguistics , pp. 710–721, Florence, Italy. Association for ComputationalLinguistics.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,A., & Bengio, Y. (2014). Generative adversarial nets. In

Advances in neural informa-tion processing systems , pp. 2672–2680.Han, L., Kashyap, A., Finin, T., Mayﬁeld, J., & Weese, J. (2013). UMBC EBIQUITY-CORE: Semantic textual similarity systems. In

Proceedings of the Second Joint Con-ference on Lexical and Computational Semantics , pp. 44–52.Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In

Proc.of COLING 1992 , pp. 539–545.Hoﬀart, J., Milchevski, D., & Weikum, G. (2014). Stics: searching with strings, things, andcats. In

Proceedings of the 37th international ACM SIGIR conference on Research &development in information retrieval , pp. 1247–1248. ACM. oval, Camacho-Collados, Espinosa-Anke, & Schockaert Hoshen, Y., & Wolf, L. (2018). Non-adversarial unsupervised word translation. In

Proceed-ings of EMNLP , Brussels, Belgium.Kementchedjhieva, Y., Ruder, S., Cotterell, R., & Søgaard, A. (2018). Generalizing pro-crustes analysis for better bilingual dictionary induction. In

Proceedings of the Con-ference on Computational Natural Language Learning , pp. 211–220.Lazaridou, A., Dinu, G., & Baroni, M. (2015). Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In

Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the 7th International Joint Conferenceon Natural Language Processing , pp. 270–280.Lu, A., Wang, W., Bansal, M., Gimpel, K., & Livescu, K. (2015). Deep multilingual cor-relation for improved word embeddings. In

Proceedings of the 2015 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies , pp. 250–256.Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). Thestanford corenlp natural language processing toolkit. In

Proceedings of 52nd annualmeeting of the association for computational linguistics: system demonstrations , pp.55–60.Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Eﬃcient estimation of wordrepresentations in vector space.

CoRR , abs/1301.3781 .Mikolov, T., Le, Q. V., & Sutskever, I. (2013b). Exploiting similarities among languagesfor machine translation. arXiv preprint arXiv:1309.4168 .Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for wordrepresentation. In Proceedings of EMNLP , pp. 1532–1543.Prager, J., Chu-Carroll, J., Brown, E. W., & Czuba, K. (2008). Question answering bypredictive annotation. In

Advances in Open Domain Question Answering , pp. 307–347. Springer.Roller, S., & Erk, K. (2016). Relations such as hypernymy: Identifying and exploiting hearstpatterns in distributional vectors for lexical entailment. In

Proceedings of EMNLP ,pp. 2163–2172, Austin, Texas.Shwartz, V., Goldberg, Y., & Dagan, I. (2016). Improving hypernymy detection with anintegrated path-based and distributional method. In

Proceedings of ACL , pp. 2389–2398, Berlin, Germany.Shwartz, V., Santus, E., & Schlechtweg, D. (2017). Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection. In

Proceedings of EACL , Valencia, Spain.Association for Computational Linguistics.Søgaard, A., Ruder, S., & Vuli´c, I. (2018). On the limitations of unsupervised bilingualdictionary induction. In

Proceedings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) , pp. 778–788. Association forComputational Linguistics.Williams, A., Nangia, N., & Bowman, S. (2018). A broad-coverage challenge corpus forsentence understanding through inference. In

Proceedings of the 2018 Conference eemi: post-processing Cross-lingual Word Embeddings of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers) , pp. 1112–1122. Associationfor Computational Linguistics.Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized word embedding and orthogonaltransform for bilingual word translation. In Proceedings of the 2015 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies , pp. 1006–1011.Xu, R., Yang, Y., Otani, N., & Wu, Y. (2018). Unsupervised cross-lingual transfer of wordembedding spaces. In

Proceedings of the 2018 Conference on Empirical Methods inNatural Language Processing , pp. 2465–2474. Association for Computational Linguis-tics.Yahya, M., Berberich, K., Elbassuoni, S., & Weikum, G. (2013). Robust question answeringover the web of linked data. In

Proceedings of the 22nd ACM international conferenceon Conference on information & knowledge management , pp. 1107–1116. ACM.Yu, Z., Wang, H., Lin, X., & Wang, M. (2015). Learning term embeddings for hypernymyidentiﬁcation. In

Proceedings of IJCAI , pp. 1390–1397.Zhang, M., Liu, Y., Luan, H., & Sun, M. (2017a). Adversarial training for unsupervisedbilingual lexicon induction. In

Proceedings of the 55th Annual Meeting of the Associ-ation for Computational Linguistics , pp. 1959–1970.Zhang, M., Liu, Y., Luan, H., & Sun, M. (2017b). Earth mover’s distance minimizationfor unsupervised bilingual lexicon induction. In

Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Processing , pp. 1934–1945, Copenhagen,Denmark. Association for Computational Linguistics., pp. 1934–1945, Copenhagen,Denmark. Association for Computational Linguistics.