Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context
Shyam Upadhyay, Kai-Wei Chang, Matt Taddy, Adam Kalai, James Zou
BBeyond Bilingual: Multi-sense Word Embeddings using MultilingualContext
Shyam Upadhyay Kai-Wei Chang Matt Taddy Adam Kalai James Zou University of Illinois at Urbana-Champaign, Urbana, IL, USA University of Virginia, Charlottesville, VA, USA Microsoft Research, Cambridge, MA, USA Stanford University, Stanford, CA, USA [email protected] , [email protected] { taddy,adum } @microsoft.com , [email protected] Abstract
Word embeddings, which represent a wordas a point in a vector space, have becomeubiquitous to several NLP tasks. A re-cent line of work uses bilingual (two lan-guages) corpora to learn a different vec-tor for each sense of a word, by ex-ploiting crosslingual signals to aid senseidentification. We present a multi-viewBayesian non-parametric algorithm whichimproves multi-sense word embeddingsby (a) using multilingual (i.e., more thantwo languages) corpora to significantlyimprove sense embeddings beyond whatone achieves with bilingual information,and (b) uses a principled approach to learna variable number of senses per word,in a data-driven manner. Ours is thefirst approach with the ability to lever-age multilingual corpora efficiently formulti-sense representation learning. Ex-periments show that multilingual trainingsignificantly improves performance overmonolingual and bilingual training, by al-lowing us to combine different parallelcorpora to leverage multilingual context.Multilingual training yields comparableperformance to a state of the art mono-lingual model trained on five times moretraining data.
Word embeddings (Turian et al., 2010; Mikolovet al., 2013, inter alia) represent a word as a pointin a vector space. This space is able to capture se-mantic relationships: vectors of words with sim-ilar meanings have high cosine similarity (Tur-ney, 2006; Turian et al., 2010). Use of embed-dings as features has been shown to benefit sev- eral NLP tasks and serve as good initializations fordeep architectures ranging from dependency pars-ing (Bansal et al., 2014) to named entity recogni-tion (Guo et al., 2014b).Although these representations are now ubiqui-tous in NLP, most algorithms for learning word-embeddings do not allow a word to have differ-ent meanings in different contexts, a phenomenonknown as polysemy. For example, the word bank assumes different meanings in financial (eg.“bank pays interest”) and geographical contexts(eg. “river bank”) and which cannot be repre-sented adequately with a single embedding vector.Unfortunately, there are no large sense-tagged cor-pora available and such polysemy must be inferredfrom the data during the embedding process. My interest lies in History. (cid:14)(cid:15) [ (cid:6)(cid:18) ] (cid:11)(cid:3)(cid:9)(cid:10)(cid:1) Mon [intérêt] réside dans l'Histoire. Je suis un grand [intérêt] sur mes économies de la banque. I got high interest on my savings from the bank. (cid:14)(cid:12)(cid:8)(cid:2)(cid:14)(cid:15)(cid:5)(cid:16)(cid:4)(cid:19)(cid:17)(cid:20) [ (cid:7)(cid:13) ] (cid:1) Figure 1:
Benefit of Multilingual Information (beyondbilingual) : Two different senses of the word “interest” andtheir translations to French and Chinese (word translationshown in [bold] ). While the surface form of both senses aresame in French, they are different in Chinese.
Several attempts (Reisinger and Mooney, 2010;Neelakantan et al., 2014; Li and Jurafsky, 2015)have been made to infer multi-sense word rep-resentations by modeling the sense as a latentvariable in a Bayesian non-parametric framework.These approaches rely on the ”one-sense per col-location” heuristic (Yarowsky, 1995), which as-sumes that presence of nearby words correlatewith the sense of the word of interest. This heuris-tic provides only a weak signal for sense identifi-cation, and such algorithms require large amountof training data to achieve competitive perfor- a r X i v : . [ c s . C L ] J un ance.Recently, several approaches (Guo et al., 2014a;ˇSuster et al., 2016) propose to learn multi-senseembeddings by exploiting the fact that differentsenses of the same word may be translated intodifferent words in a foreign language (Dagan andItai, 1994; Resnik and Yarowsky, 1999; Diab andResnik, 2002; Ng et al., 2003). For example, bank in English may be translated to banc or banque inFrench, depending on whether the sense is finan-cial or geographical. Such bilingual distributionalinformation allows the model to identify whichsense of a word is being used during training.However, bilingual distributional signals oftendo not suffice. It is common that polysemy for aword survives translation. Fig. 1 shows an illus-trative example – both senses of interest get trans-lated to int´erˆet in French. However, this becomesmuch less likely as the number of languages underconsideration grows. By looking at Chinese trans-lation in Fig. 1, we can observe that the sensestranslate to different surface forms. Note thatthe opposite can also happen (i.e. same surfaceforms in Chinese, but different in French). Exist-ing crosslingual approaches are inherently bilin-gual and cannot naturally extend to include addi-tional languages due to several limitations (detailsin Section 4). Furthermore, works like ( ˇSusteret al., 2016) sets a fixed number of senses for eachword, leading to inefficient use of parameters, andunnecessary model complexity. This paper addresses these limitations byproposing a multi-view Bayesian non-parametricword representation learning algorithm whichleverages multilingual distributional information.Our representation learning framework is the firstmultilingual (not bilingual) approach, allowing usto utilize arbitrarily many languages to disam-biguate words in English. To move to multilin-gual system, it is necessary to ensure that theembeddings of each foreign language are relat-able to each other (i.e., they live in the samespace). We solve this by proposing an algorithmin which word representations are learned jointly across languages, using English as a bridge. Whilelarge parallel corpora between two languages arescarce, using our approach we can concatenatemultiple parallel corpora to obtain a large multi-lingual corpus. The parameters are estimated in Most words in conventional English are monosemous,i.e. single sense (eg. the word monosemous ) a Bayesian nonparametric framework that allowsour algorithm to only associate a word with a newsense vector when evidence (from either same orforeign language context) requires it. As a result,the model infers different number of senses foreach word in a data-driven manner, avoiding wast-ing parameters.Together, these two ideas – multilingual dis-tributional information and nonparametric sensemodeling – allow us to disambiguate multiplesenses using far less data than is necessary for pre-vious methods. We experimentally demonstratethat our algorithm can achieve competitive perfor-mance after training on a small multilingual cor-pus, comparable to a model trained monolinguallyon a much larger corpus. We present an analy-sis discussing the effect of various parameters –choice of language family for deriving the multi-lingual signal, crosslingual window size etc. andalso show qualitative improvement in the embed-ding space. Work on inducing multi-sense embeddings canbe divided in two broad categories – two-stagedapproaches and joint learning approaches. Two-staged approaches (Reisinger and Mooney, 2010;Huang et al., 2012) induce multi-sense embed-dings by first clustering the contexts and thenusing the clustering to obtain the sense vec-tors. The contexts can be topics induced us-ing latent topic models(Liu et al., 2015a,b), orWikipedia (Wu and Giles, 2015) or coarse part-of-speech tags (Qiu et al., 2014). A more recentline of work in the two-staged category is thatof retrofitting (Faruqui et al., 2015; Jauhar et al.,2015), which aims to infuse semantic ontologiesfrom resources like WordNet (Miller, 1995) andFramenet (Baker et al., 1998) into embeddingsduring a post-processing step. Such resources list(albeit not exhaustively) the senses of a word, andby retro-fitting it is possible to tease apart the dif-ferent senses of a word. While some resources likeWordNet (Miller, 1995) are available for manylanguages, they are not exhaustive in listing allpossible senses. Indeed, the number senses of aword is highly dependent on the task and can-not be pre-determined using a lexicon (Kilgarriff,1997). Ideally, the senses should be inferred in adata-driven manner, so that new senses not listedin such lexicons can be discovered. While re-ent work has attempted to remedy this by usingparallel text for retrofitting sense-specific embed-dings (Ettinger et al., 2016), their procedure re-quires creation of sense graphs , which introducesadditional tuning parameters. On the other hand,our approach only requires two tuning parameters(prior α and maximum number of senses T ).In contrast, joint learning approaches (Nee-lakantan et al., 2014; Li and Jurafsky, 2015)jointly learn the sense clusters and embeddingsby using non-parametrics. Our approach belongsto this category. The closest non-parametric ap-proach to ours is that of (Bartunov et al., 2016),who proposed a multi-sense variant of the skip-gram model which learns the different number ofsense vectors for all words from a large mono-lingual corpus (eg. English Wikipedia). Ourwork can be viewed as the multi-view extension oftheir model which leverages both monolingual andcrosslingual distributional signals for learning theembeddings. In our experiments, we compare ourmodel to monolingually trained version of theirmodel.Incorporating crosslingual distributional infor-mation is a popular technique for learning wordembeddings, and improves performance on sev-eral downstream tasks (Faruqui and Dyer, 2014;Guo et al., 2016; Upadhyay et al., 2016). However,there has been little work on learning multi-senseembeddings using crosslingual signals (Bansalet al., 2012; Guo et al., 2014a; ˇSuster et al., 2016)with only ( ˇSuster et al., 2016) being a joint ap-proach. (Kawakami and Dyer, 2015) also usedbilingual distributional signals in a deep neural ar-chitecture to learn context dependent representa-tions for words, though they do not learn separatesense vectors. Let E = { x e , .., x ei , .., x eN e } denote the words ofthe English side and F = { x f , .., x fi , .., x fN f } de-note the words of the foreign side of the paral-lel corpus. We assume that we have access toword alignments A e → f and A f → e mapping wordsin English sentence to their translation in foreignsentence (and vice-versa), so that x e and x f arealigned if A e → f ( x e ) = x f .We define Nbr( x, L, d ) as the neighborhood inlanguage L of size d (on either side) around word x in its sentence. The English and foreign neigh-boring words are denoted by y e and y f , respec- tively. Note that y e and y f need not be translationsof each other. Each word x f in the foreign vocab-ulary is associated with a dense vector x f in R m ,and each word x e in English vocabulary admits atmost T sense vectors, with the k th sense vectordenoted as x ek . As our main goal is to modelmultiple senses for words in English, we do notmodel polysemy in the foreign language and use asingle vector to represent each word in the foreignvocabulary.We model the joint conditional distribution ofthe context words y e , y f given an English word x e and its corresponding translation x f on the parallelcorpus: P ( y e , y f | x e , x f ; α, θ ) , (1)where θ are model parameters (i.e. all embed-dings) and α governs the hyper-prior on latentsenses.Assume x e has multiple senses, which are in-dexed by the random variable z , Eq. (1) can berewritten, (cid:90) β (cid:88) z P ( y e , y f z, β | x e , x f , α ; θ ) dβ where β are the parameters determining the modelprobability on each sense for x e (i.e., the weighton each possible value for z ). We place a Dirich-let process (Ferguson, 1973) prior on sense assign-ment for each word. Thus, adding the word- x sub-script to emphasize that these are word-specificsenses, P ( z x = k | β x ) = β xk (cid:89) k − r =1 (1 − β xr ) (2) β xk | α ind ∼ Beta ( β xk | , α ) , k = 1 , . . . . (3)That is, the potentially infinite number of sensesfor each word x have probability determinedby the sequence of independent stick-breakingweights , β xk , in the constructive definition of theDP (Sethuraman, 1994). The hyper-prior concen-tration α provides information on the number ofsenses we expect to observe in our corpus.After conditioning upon word sense, we decom-pose the context probability, P ( y e , y f | z, x e , x f ; θ ) = P ( y e | x e , x f , z ; θ ) P ( y f | x e , x f , z ; θ ) . We also maintain a context vector for each word in theEnglish and Foreign vocabularies. The context vector is usedas the representation of the word when it appears as the con-text for another word. oth the first and the second terms are sense-dependent, and each factors as, P ( y | x e , x f , z = k ; θ ) ∝ Ψ( x e , z = k, y )Ψ( x f , y )= exp( y T x ek ) exp( y T x f ) = exp( y T ( x ek + x f )) , where x ek is the embedding corresponding to the k th sense of the word x e , and y is either y e or y f .The factor Ψ( x e , z = k, y ) use the correspondingsense vector in a skip-gram-like formulation. Thisresults in total of 4 factors, P ( y e , y f | z, x e , x f ; θ ) ∝ Ψ( x e , z, y e )Ψ( x f , y f )Ψ( x e , z, y f )Ψ( x f , y e ) (4)See Figure 2 for illustration of each factor. Thismodeling approach is reminiscent of (Luong et al.,2015), who jointly learned embeddings for twolanguages l and l by optimizing a joint objectivecontaining 4 skip-gram terms using the alignedpair ( x e , x f )– two predicting monolingual contexts l → l , l → l , and two predicting crosslingualcontexts l → l , l → l . Learning.
Learning involves maximizing thelog-likelihood, P ( y e , y f | x e , x f ; α, θ ) = (cid:90) β (cid:88) z P ( y e , y f , z, β | x e , x f , α ; θ ) dβ for which we use variational approximation. Let q ( z, β ) = q ( z ) q ( β ) where q ( z ) = (cid:81) i q ( z i ) q ( β ) = (cid:81) Vw =1 (cid:81) Tk =1 β wk (5)are the fully factorized variational approximationof the true posterior P ( z, β | y e , y f , x e , x f , α ) ,where V is the size of english vocabulary, and T isthe maximum number of senses for any word. Theoptimization problem solves for θ , q ( z ) and q ( β ) using the stochastic variational inference tech-nique (Hoffman et al., 2013) similar to (Bartunovet al., 2016) (refer for details).The resulting learning algorithm is shown as Al-gorithm 1. The first for-loop (line 1) updates theEnglish sense vectors using the crosslingual andmonolingual contexts. First, the expected sensedistribution for the current English word w is com-puted using the current estimate of q ( β ) (line 4).The sense distribution is updated (line 7) using thecombined monolingual and crosslingual contexts The bank paid me [interest] on my savings. la banque m'a payé des [intérêts] sur mes économies.
Ψ(interest,2,savings) Ψ(intérêts, économies) Ψ(interest,2,banque) Ψ(intérêts,savings)
Figure 2: The aligned pair ( interest , int´erˆet ) is used to predictmonolingual and crosslingual context in both languages (seefactors in eqn. (4)). We pick each sense (here 2nd) vector for interest , to perform weighted update. We only model poly-semy in English. (line 5) and re-normalized (line 8). Using the up-dated sense distribution q ( β ) ’s sufficient statisticsis re-computed (line 9) and the global parameter θ is updated (line 10) as follows, θ ← θ + ρ t ∇ θ (cid:88) k | z ik >(cid:15) (cid:88) y ∈ y c z ik log p ( y | x i , k, θ ) (6)Note that in the above sum, a sense participates ina update only if its probability exceeds a thresh-old (cid:15) (= 0.001). The final model retains sensevectors whose sense probability exceeds the samethreshold. The last for-loop (line 11) jointly opti-mizes the foreign embeddings using English con-text with the standard skip-gram updates. Disambiguation.
Similar to (Bartunov et al.,2016), we can disambiguate the sense for the word x e given a monolingual context y e as follows, P ( z | x e , y e ) ∝ P ( y e | x e , z ; θ ) (cid:88) β P ( z | x e , β ) q ( β ) (7)Although the model trains embeddings using bothmonolingual and crosslingual context, we only usemonolingual context at test time. We found thatso long as the model has been trained with mul-tilingual context, it performs well in sense dis-ambiguation on new data even if it contains onlymonolingual context. A similar observation wasmade by ( ˇSuster et al., 2016). Bilingual distributional signal alone may not besufficient as polysemy may survive translation inthe second language. Unlike existing approaches,we can easily incorporate multilingual distribu-tional signals in our model. For using languages l and l to learn multi-sense embeddings for En-glish, we train on a concatenation of En- l par-allel corpus with an En- l parallel corpus. Thistechnique can easily be generalized to more than lgorithm 1 Psuedocode of Learning Algorithm
Input: parallel corpus E = { x e , .., x ei , .., x eN e } and F = { x f , .., x fi , .., x fN f } and alignments A e → f and A f → e , Hyper-parameters α and T ,window sizes d, d (cid:48) . Output: θ , q ( β ) , q ( z ) for i = 1 to N e do (cid:46) update english vectors w ← x ei for k = 1 to T do z ik ← E q ( β w ) [log p ( z i = k | , x ei )] y c ← Nbr( x ei , E , d ) ∪ Nbr( x fi , F , d (cid:48) ) ∪ { x fi } where x fi = A e → f ( x ei ) for y in y c do SENSE - UPDATE ( x ei , y, z i ) Renormalize z i using softmax Update suff. stats. for q ( β ) like (Bartunovet al., 2016) Update θ using eq. (6) for i = 1 to N f do (cid:46) jointly update foreignvectors y c ← Nbr( x fi , F , d ) ∪ Nbr( x ei , E , d (cid:48) ) ∪ { x ei } where x ei = A f → e ( x fi ) for y in y c do SKIP - GRAM - UPDATE ( x fi , y ) procedure SENSE - UPDATE ( x i , y, z i ) z ik ← z ik + log p ( y | x i , k, θ ) two foreign languages to obtain a large multilin-gual corpus. Value of Ψ( y e , x f ) . The factor modeling the de-pendence of the English context word y e on for-eign word x f is crucial to performance when us-ing multiple languages. Consider the case of usingFrench and Spanish contexts to disambiguate thefinancial sense of the English word bank . In thiscase, the (financial) sense vector of bank will beused to predict vector of banco (Spanish context)and banque (French context). If vectors for banco and banque do not reside in the same space or arenot close, the model will incorrectly assume theyare different contexts to introduce a new sense for bank . This is precisely why the bilingual mod-els, like that of ( ˇSuster et al., 2016), cannot be ex-tended to multilingual setting, as they pre-train theembeddings of second language before runningthe multi-sense embedding process. As a result ofnaive pre-training, the French and Spanish vectorsof semantically similar pairs like ( banco , banque )will lie in different spaces and need not be close.A similar reason holds for (Guo et al., 2014a), as Corpus Source Lines (M) EN-Words (M)En-Fr EU proc. ≈ ≈ . ≈ ≈ ≈ ≈ Table 1: Corpus Statistics (in millions). Horizontal lines de-marcate corpora from the same domain. they use a two step approach instead of joint learn-ing.To avoid this, the vector for pairs like banco and banque should lie in the same space and closeto each other and the sense vector for bank . The Ψ( y e , x f ) term attempts to ensure this by using thevector for banco and banque to predict the vectorof bank . This way, the model brings the embed-ding space for Spanish and French closer by usingEnglish as a bridge language during joint training.A similar idea of using English as a bridging lan-guage was used in the models proposed in (Her-mann and Blunsom, 2014) and (Coulmance et al.,2015). Beside the benefit in the multilingual case,the Ψ( y e , x f ) term improves performance in thebilingual case as well, as it forces the English andsecond language embeddings to remain close inspace.To show the value of Ψ( y e , x f ) factor in our ex-periments, we ran a variant of Algorithm 1 with-out the Ψ( y e , x f ) factor, by only using monolin-gual neighborhood N br ( x fi , F ) in line 12 of Al-gorithm 1. We call this variant O NE -S IDED modeland the model in Algorithm 1 the F
ULL model.
We first describe the datasets and the preprocess-ing methods used to prepare them. We also de-scribe the Word Sense Induction task that we usedto compare and evaluate our method.
Parallel Corpora.
We use parallel corpora inEnglish (En), French (Fr), Spanish (Es), Russian(Ru) and Chinese (Zh) in our experiments. Corpusstatistics for all datasets used in our experimentsare shown in Table 1. For En-Zh, we use the FBISparallel corpus (LDC2003E14). For En-Fr, we usethe first 10M lines from the Giga-EnFr corpus re-leased as part of the WMT shared task (Callison-Burch et al., 2011). Note that the domain fromwhich parallel corpus has been derived can affecthe final result. To understand what choice of lan-guages provide suitable disambiguation signal, itis necessary to control for domain in all paral-lel corpora. To this end, we also used the En-Fr,En-Es, En-Zh and En-Ru sections of the MultiUNparallel corpus (Eisele and Chen, 2010). Wordalignments were generated using fast_align tool (Dyer et al., 2013) in the symmetric intersec-tion mode. Tokenization and other preprocessingwere performed using cdec toolkit. StanfordSegmenter (Tseng et al., 2005) was used to pre-process the Chinese corpora. Word Sense Induction (WSI).
We evaluate ourapproach on word sense induction task. In thistask, we are given several sentences showing us-ages of the same word, and are required to clusterall sentences which use the same sense (Nasirud-din, 2013). The predicted clustering is then com-pared against a provided gold clustering. Notethat WSI is a harder task than Word Sense Disam-biguation (WSD)(Navigli, 2009), as unlike WSD,this task does not involve any supervision or ex-plicit human knowledge about senses of words.We use the disambiguation approach in eq. (7) topredict the sense given the target word and fourcontext words.To allow for fair comparison with earlier work,we use the same benchmark datasets as (Bartunovet al., 2016) – Semeval-2007, 2010 and WikipediaWord Sense Induction (WWSI). We report Ad-justed Rand Index (ARI) (Hubert and Arabie,1985) in the experiments, as ARI is a more strictand precise metric than F-score and V-measure.
Parameter Tuning.
For fairness, we used fivecontext words on either side to update each En-glish word-vectors in all the experiments. In themonolingual setting, all five words are English;in the multilingual settings, we used four neigh-boring English words plus the one foreign wordaligned to the word being updated ( d = 4 , d (cid:48) = 0 in Algorithm 1). We also analyze effect of varying d (cid:48) , the context window size in the foreign sentenceon the model performance.We tune the parameters α and T by maximizingthe log-likelihood of a held out English text. Theparameters were chosen from the following values α = { . , . , .., . } , T = { , , .., } . Allmodels were trained for 10 iteration with a decay- github.com/redpony/cdec first 100k lines from the En-Fr Europarl (Koehn, 2005) ing learning rate of 0.025, decayed to 0. Unlessotherwise stated, all embeddings are 100 dimen-sional.Under various choice of α and T , we identifyonly about 10-20% polysemous words in the vo-cabulary using monolingual training and 20-25%polysemous using multilingual training. It is evi-dent using the non-parametric prior has led to sub-stantially more efficient representation comparedto previous methods with fixed number of sensesper word. Setting S-2007 S-2010 WWSI avg. ARI SCWSEn-FrM
ONO .044 .064 .112 .073 41.1O NE -S IDED .054 .074 .116 .081 F ULL .055 .086 .105 .082
ONO .054 .074 .073 .067 42.6O NE -S IDED .059 .084 .078 .074 F ULL .055 .090 .079 .075
ONO .056 .086 .103 .082 O NE -S IDED .067 .085 .113 .088 44.6F
ULL .065 .094 .120 .093 bold . We performed extensive experiments to evalu-ate the benefit of leveraging bilingual and multi-lingual information during training. We also ana-lyze how the different choices of language family(i.e. using more distant vs more similar languages)affect performance of the embeddings.
The results for WSI are shown in Table 2. Recallthat the O NE -S IDED model is the variant of Algo-rithm 1 without the Ψ( y e , x f ) factor. M ONO refersto the AdaGram model of (Bartunov et al., 2016)trained on the English side of the parallel corpus.In all cases, the M
ONO model is outperformed byO NE -S IDED and F
ULL models, showing the ben-efit of using crosslingual signal in training. Bestperformance is attained by the multilingual model(En-FrZh), showing value of multilingual signal.The value of Ψ( y e , x f ) term is also verified by thefact that the O NE -S IDED model performs worsethan the F
ULL model. rain S-2007 S-2010 WWSI Avg. ARISetting En-FrEs En-RuZh En-FrEs En-FrEs En-FrEs En-RuZh En-FrEs En-RuZh(1) M
ONO .035 .033 .046 .049 .054 .049 .045 .044(2) O NE -S IDED .044 .044 .055 .063 .062 .057 .054 .055(3) F
ULL .046 .040 .056 .070 .068 .069 .057 .059 (3) - (1) .011 .007 .010 .021 .014 .020 .012 .015Table 3: Effect (in ARI) of language family distance on WSI task. Best results for each column is shown in bold . Theimprovement from M
ONO to F
ULL is also shown as (3) - (1). Note that this is not comparable to results in Table 2, as we use adifferent training corpus to control for the domain.
We can also compare (unfairly to our F
ULL model) to the best results described in (Bartunovet al., 2016), which achieved ARI scores of 0.069,0.097 and 0.286 on the three datasets respectivelyafter training 300 dimensional embeddings on En-glish Wikipedia ( ≈ For completeness, we report correlation scoreson Stanford contextual word similarity dataset(SCWS) (Huang et al., 2012) in Table 2. Thetask requires computing similarity between twowords given their contexts. While the bilin-gually trained model outperforms the monolin-gually trained model, surprisingly the multilin-gually trained model does not perform well onSCWS. We believe this may be due to our param-eter tuning strategy. Intuitively, choice of language can affect the resultfrom crosslingual training as some languages mayprovide better disambiguation signals than oth-ers. We performed a systematic set of experimentto evaluate whether we should choose languagesfrom a closer family (Indo-European languages)or farther family (Non-Indo European Languages) Most works tune directly on the test dataset for WordSimilarity tasks (Faruqui et al., 2016) as training data alongside English. To control fordomain here we use the MultiUN corpus. We useEn paired with Fr and Es as Indo-European lan-guages, and English paired with Ru and Zh forrepresenting Non-Indo-European languages.From Table 3, we see that using Non-Indo Eu-ropean languages yield a slightly higher improve-ment on an average than using Indo-European lan-guages. This suggests that using languages from adistance family aids better disambiguation. Ourfindings echo those of (Resnik and Yarowsky,1999), who found that the tendency to lexicalizesenses of an English word differently in a secondlanguage, correlated with language distance.
Figure 3d shows the effect of increasing thecrosslingual window ( d (cid:48) ) on the average ARI onthe WSI task for the En-Fr and En-Zh models.While increasing the window size improves theaverage score for En-Zh model, the score for theEn-Fr model goes down. This suggests that itmight be beneficial to have a separate window pa-rameter per language. This also aligns with theobservation earlier that different language fami-lies have different suitability (bigger crosslingualcontext from a distant family helped) and require-ments for optimal performance. As an illustration for the effects of multilingualtraining, Figure 3 shows PCA plots for 11 sensevectors for 9 words using monolingual, bilin-gual and multilingual models. From Fig 3a, wenote that with monolingual training the senses arepoorly separated. Although the model infers twosenses for bank , the two senses of bank are closeto financial terms, suggesting their distinction wasnot recognized. The same observation can be (ˇSuster et al., 2016) compared different languages butdid not control for domain.a) Monolingual (En side of En-Zh) (b) Bilingual (En-Zh)(c) Multilingual (En-FrZh) . . · − window a vg . A R I En-FrEn-Zh (d) Window size v.s. avg. ARIFigure 3:
Qualitative:
PCA plots for the vectors for { apple, bank, interest, itunes, potato, west, monetary, desire } withmultiple sense vectors for apple , interest and bank obtained using monolingual (3a), bilingual (3b) and multilingual (3c) training. Window Tuning:
Figure 3d shows tuning window size for En-Zh and En-Fr. made for the senses of apple . In Fig 3b, with bilin-gual training, the model infers two senses of bank correctly, and two sense of apple become moredistant. The model can still improve eg. pulling interest towards the financial sense of bank , andpulling itunes towards apple 2 . Finally, in Fig 3c,all senses of the words are more clearly clustered,improving over the clustering of Fig 3b. Thesenses of apple , interest , and bank are well sepa-rated, and are close to sense-specific words, show-ing the benefit of multilingual training. We presented a multi-view, non-parametric wordrepresentation learning algorithm which canleverage multilingual distributional information.Our approach effectively combines the bene-fits of crosslingual training and Bayesian non-parametrics. Ours is the first multi-sense repre- sentation learning algorithm capable of using mul-tilingual distributional information efficiently, bycombining several parallel corpora to obtained alarge multilingual corpus. Our experiments showhow this multi-view approach learns high-qualityembeddings using substantially less data and pa-rameters than prior state-of-the-art. We also an-alyzed the effect of various parameters such aschoice of language family and cross-lingual win-dow size on the performance. While we focusedon improving the embedding of English words inthis work, the same algorithm could learn bettermulti-sense embedding for other languages. Ex-citing avenues for future research include extend-ing our approach to model polysemy in foreignlanguage. The sense vectors can then be alignedacross languages, to generate a multilingual Word-net like resource, in a completely unsupervisedmanner thanks to our joint training paradigm. eferences
Collin F Baker, Charles J Fillmore, and John B Lowe.1998. The berkeley framenet project. In
ACL .Mohit Bansal, John DeNero, and Dekang Lin. 2012.Unsupervised translation sense clustering. In
NAACL .Mohit Bansal, Kevin Gimpel, and Karen Livescu.2014. Tailoring continuous word representations fordependency parsing. In
ACL .Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin,and Dmitry Vetrov. 2016. Breaking sticks and am-biguities with adaptive skip-gram.
AISTATS .Chris Callison-Burch, Philipp Koehn, Christof Monz,and Omar F Zaidan. 2011. Findings of the 2011workshop on statistical machine translation. In
WMT Shared Task .Jocelyn Coulmance, Jean-Marc Marty, GuillaumeWenzek, and Amine Benhalloum. 2015. Trans-gram, fast cross-lingual word-embeddings. In
EMNLP .Ido Dagan and Alon Itai. 1994. Word sense disam-biguation using a second language monolingual cor-pus.
Computational linguistics .Mona Diab and Philip Resnik. 2002. An unsupervisedmethod for word sense tagging using parallel cor-pora. In
ACL .Chris Dyer, Victor Chahuneau, and Noah A Smith.2013. A simple, fast, and effective reparameteriza-tion of ibm model 2. In
NAACL .Andreas Eisele and Yu Chen. 2010. MultiUN: A mul-tilingual corpus from united nation documents. In
LREC .Allyson Ettinger, Philip Resnik, and Marine Carpuat.2016. Retrofitting sense-specific word vectors usingparallel text. In
NAACL .Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, ChrisDyer, Eduard Hovy, and Noah A. Smith. 2015.Retrofitting word vectors to semantic lexicons. In
NAACL .Manaal Faruqui and Chris Dyer. 2014. Improving vec-tor space word representations using multilingualcorrelation. In
EACL .Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi,and Chris Dyer. 2016. Problems with evaluation ofword embeddings using word similarity tasks. In .Thomas S Ferguson. 1973. A bayesian analysis ofsome nonparametric problems.
The annals of statis-tics . Jiang Guo, Wanxiang Che, Haifeng Wang, and TingLiu. 2014a. Learning sense-specific word embed-dings by exploiting bilingual resources. In
COL-ING .Jiang Guo, Wanxiang Che, Haifeng Wang, and TingLiu. 2014b. Revisiting embedding features for sim-ple semi-supervised learning. In
EMNLP .Jiang Guo, Wanxiang Che, David Yarowsky, HaifengWang, and Ting Liu. 2016. A representation learn-ing framework for multi-source transfer parsing. In
AAAI .Karl Moritz Hermann and Phil Blunsom. 2014. Mul-tilingual Distributed Representations without WordAlignment. In
ICLR .Matthew D Hoffman, David M Blei, Chong Wang, andJohn William Paisley. 2013. Stochastic variationalinference.
JMLR .Eric H Huang, Richard Socher, Christopher D Man-ning, and Andrew Y Ng. 2012. Improving wordrepresentations via global context and multiple wordprototypes. In
ACL .Lawrence Hubert and Phipps Arabie. 1985. Compar-ing partitions.
Journal of classification .Sujay Kumar Jauhar, Chris Dyer, and Eduard Hovy.2015. Ontologically grounded multi-sense represen-tation learning for semantic vector space models. In
NAACL .Kazuya Kawakami and Chris Dyer. 2015. Learning torepresent words in context with multilingual super-vision.
ICLR Workshop .Adam Kilgarriff. 1997. I don’t believe in word senses.
Computers and the Humanities .Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In
MT summit . vol-ume 5, pages 79–86.Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em-beddings improve natural language understanding?
EMNLP .Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2015a.Learning context-sensitive word embeddings withneural tensor skip-gram model. In
IJCAI .Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and MaosongSun. 2015b. Topical word embeddings. In
AAAI .Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Bilingual word representations withmonolingual quality in mind. In
Workshop on VectorSpace Modeling for NLP .Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013. Linguistic regularities in continuous spaceword representations. In
NAACL .eorge A Miller. 1995. Wordnet: a lexical database forenglish.
Communications of the ACM .Mohammad Nasiruddin. 2013. A state of theart of word sense induction: A way towardsword sense disambiguation for under-resourced lan-guages. arXiv preprint arXiv:1310.1425 .Roberto Navigli. 2009. Word sense disambiguation: Asurvey.
ACM Computing Surveys (CSUR) .Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings perword in vector space. In
EMNLP .Hwee Tou Ng, Bin Wang, and Yee Seng Chan. 2003.Exploiting parallel texts for word sense disambigua-tion: An empirical study. In
ACL .Lin Qiu, Yong Cao, Zaiqing Nie, Yong Yu, and YongRui. 2014. Learning word representation consider-ing proximity and ambiguity. In
AAAI .Joseph Reisinger and Raymond J. Mooney. 2010.Multi-prototype vector-space models of word mean-ing. In
NAACL .Philip Resnik and David Yarowsky. 1999. Distinguish-ing systems and distinguishing senses: New evalua-tion methods for word sense disambiguation.
NLE .Jayaram Sethuraman. 1994. A constructive definitionof dirichlet priors.
Statistica sinica .Huihsin Tseng, Pichuan Chang, Galen Andrew, DanielJurafsky, and Christopher Manning. 2005. A condi-tional random field word segmenter for sighan bake-off 2005. In
Proc. of SIGHAN .Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: a simple and general methodfor semi-supervised learning. In
ACL .Peter D. Turney. 2006. Similarity of semantic relations.
Computational Linguistics .Shyam Upadhyay, Manaal Faruqui, Chris Dyer, andDan Roth. 2016. Cross-lingual models of word em-beddings: An empirical comparison. In
ACL .Simon ˇSuster, Ivan Titov, and Gertjan van Noord. 2016.Bilingual learning of multi-sense embeddings withdiscrete autoencoders. In
NAACL .Zhaohui Wu and C Lee Giles. 2015. Sense-aaware se-mantic analysis: A multi-prototype word represen-tation model using wikipedia. In
AAAI .David Yarowsky. 1995. Unsupervised word sense dis-ambiguation rivaling supervised methods. In