Abstract

Distributional semantics models derive word space from linguistic items in context. Meaning is obtained by defining a distance measure between vectors corresponding to lexical entities. Such vectors present several problems. In this paper we provide a guideline for post process improvements to the baseline vectors. We focus on refining the similarity aspect, address imperfections of the model by applying the hubness reduction method, implementing relational knowledge into the model, and providing a new ranking similarity definition that give maximum weight to the top 1 component value. This feature ranking is similar to the one used in information retrieval. All these enrichments outperform current literature results for joint ESL and TOEF sets comparison. Since single word embedding is a basic element of any semantic task one can expect a significant improvement of results for these tasks. Moreover, our improved method of text processing can be translated to continuous distributed representation of biological sequences for deep proteomics and genomics.

Full PDF

aa r X i v : . [ c s . C L ] D ec Novel Ranking-Based Lexical Similarity Measure for WordEmbedding

Jakub Dutkiewicz & Czesław JędrzejekPolitechnika PoznańskaPoznań, Poland { jakub.dutkiewicz,czeslaw.jedrzejek } @put.poznan.pl Abstract

Distributional semantics models derive word space from linguistic items incontext. Meaning is obtained by deﬁning a distance measure between vec-tors corresponding to lexical entities. Such vectors present several problems.In this paper we provide a guideline for post process improvements to thebaseline vectors. We focus on reﬁning the similarity aspect, address imper-fections of the model by applying the hubness reduction method, imple-menting relational knowledge into the model, and providing a new rankingsimilarity deﬁnition that give maximum weight to the top 1 component va-lue. This feature ranking is similar to the one used in information retrieval.All these enrichments outperform current literature results for joint ESLand TOEF sets comparison. Since single word embedding is a basic elementof any semantic task one can expect a signiﬁcant improvement of resultsfor these tasks. Moreover, our improved method of text processing can betranslated to continuous distributed representation of biological sequencesfor deep proteomics and genomics.

Distributional language models are frequently used to measure word similarity in natu-ral language (e.g. Frackowiak et al. (2017)). Recent works usually use the DistributionalHypothesis (Harris (1954)) to generate the language models. This model often consists ofa set of vectors; each vector corresponds to a character string, which represents a word.Mikolov et al. (2013) and Pennington et al. (2014) implement word embedding (WE) algo-rithms. Vector components in language models created by these algorithms are latent. Si-milarity between words is deﬁned as a function of vectors corresponding to the given words.The cosine measure is the most frequently used similarity function. Santus et al. (2016) hi-ghlights the fact that the cosine can be outperformed by ranking based functions. Vectorspace word representations obained from purely distributional information of words in largeunlabelled corpora are not enough to best the state-of-the-art results in query answeringbenchmarks, because they suﬀer from 4 types of weaknesses:1. Inadequate deﬁnition of similarity,2. Inability of accounting of senses of words,3. Appeareance of hubness that distorts distances between vectors,4. Inability of distinguishing from antonymsIn this paper we use the existing word embedding model but with several post processenhancement techniques. We address three out of four of these issues. In particular wedeﬁne novel similarity measure, dedicated for language models. The Euclidean distance isbased on the locations of points in such a space.Similarity is a function, which is an monotonically opposite to distance. As the distancebetween two given entities gets shorter, entities are more similar. This holds for languagemodels. Similarity between words is equal to similarity between their corresponding vectors.1here are various deﬁnitions of distance. The most common Euclidean distance is deﬁnedas follows: d ( p , p ) = sX c ∈ p ( c p − c p ) (1)Similarity based on the Euclidean deﬁnition is inverse to the distance: sim ( p , p ) = 11 + d ( p , p ) (2)Angular deﬁnition of distance is deﬁned with cosine function: d ( p , p ) = 1 − cos ( p , p ) (3)We deﬁne angular similarity as: sim ( p , p ) = cos ( p , p ) (4)Both Euclidean and Cosine deﬁnitions of distance could be looked at as the analysis ofvector components. Simple operations, like addition and multiplication work really well inlow dimensional spaces. We believe, that applying those metrics in spaces of higher orderis not ideal, hence we compare cosine similarity to a measure of distance dedicated for highdimensional spaces. Santus et al. (2016) introduce the ranking based similarity function called APSyn. In theirexperiment APSyn outperforms the cosine similarity, reaching 73% of accuracy in the bestcases (an improvement of 27% over cosine) on the ESL dataset, and 70% accuracy (animprovement of 10% over cosine) on the TOEFL dataset. In contrast to our work, they usethe Positive Pointwise Mutual Information algorithm to create their language model.A successful avenue to enhance WE was pointed out by Faruqui et al. (2014), using WordNet(Miller & Fellbaum (2007)), and the Paraphrase Database (Ganitkevitch et al. (2013)) toprovide synonymy relation information to vector optimization equations. They call thisprocess retroﬁtting, a pattern we adapt to the angular deﬁnition of distance, which is moresuitable to our case.We also address hubness reduction. Hubness is related to the phenomenon of concentrationof distances - the fact that points get closer at large vector dimensionalities. Hubness isvery pronounced for vector dimensions of the order of thousands. We apply this methodof localized centering for hubness reduction (Feldbauer & Flexer (2016)) for the languagemodels.

In our work we deﬁne the language model as a set of word representations. Each word isrepresented by its vector. We refer to a vector corresponding to a word w i as v i . A completeset of words for a given language is referred to as a vector space model. We deﬁne similaritybetween words w i and w j as a function of vectors v i and v j . sim ( w i , w j ) = f ( v i , v j ) (5)We present an algorithm for obtaining optimized similarity measures given a vector spacemodel for word embedding. The algorithm consists of 6 steps:1. Reﬁne the vector space using the L2 retroﬁt algorithm2. Obtain vector space of centroids3. Obtain vectors for a given pair of words and optionally for given context words4. Recalculate the vectors using the localized centering method2. Calculate ranking of vector components for a given pair of words6. Use the ranking based similarity function to obtain the similarity between a givenpair of words.We use all of the methods in together to achieve signiﬁcant improvement over the baselinemethod. We present details of the algorithm in the following sections.3.1 BaselineThe cosine function provides the baseline similarity measure: sim ( w , w ) = cos ( v , v ) = v · v k v kk v k (6)The cosine function has been to achieving a reasonable baseline. It is superior to the Eucli-dean similarity measure and is used in various works related to word similarity. In our workwe use several post-process modiﬁcations to the vector space model; we also redeﬁne thesimilarity measure.3.2 Implementing relational knowledge into the vector space modelLet us deﬁne a lexicon of synonyms L . Each row in the lexicon consists of a word and a setof its synonyms. L ( w i ) = { w j : synonymity ( w i , w j ) } (7)A basic method of implementing synonym knowledge into the vector space model was pre-viously described in Ganitkevitch et al. (2013). We refer to that method as retroﬁt; it usesthe iterational algorithm of moving the vector towards an average vector of its synonymsaccording to the following formula. v ′ i = α i v i + P wj ∈ L ( wj ) β j v j k L ( w j ) k α and β allow us to weigh the im-portance of certain synonyms. The basic retroﬁt method moves the vector towards its de-stination (shortens the distance between the average synonym to a given vector) using theEuclidean deﬁnition of distance. This is not consistent with the cosine distance. Instead weimprove Faruqui et al. (2014) idea by performing operations in spherical space by normali-zing the vector, thus preserving the angular deﬁnition of distance. This amounts to rotatingthe vector instead of translating it. We implemented the basic transformation for the ro-tation; however it proved to be time consuming, which aﬀected our work on the subject.Therefor, for simplicity, the average vector of two normalized vectors is precisely betwe-en given vectors in both the Euclidean and angular deﬁnition of distance. This gives thefollowing formula: v ′ i = k v i k v i k v i k + P wj ∈ L ( wj ) vj k vj k k L ( w j ) k k nearest neighbors of the given vector v i . We apply a cosine distancemeasure to calculate the nearest neighbors. c i = P v j ∈ k − NN ( v i ) v i ) N (10)3n Feldbauer & Flexer (2016), the authors pointed out that skewness of a space has a directconnection to the hubness of vectors. We follow the pattern presented in that work andrecalculate the vectors using the following formula. v ′ i = v i − c γi (11)Parameter γ in the equation is equal to the skewness of the space.3.3.1 Ranking based similarity functionWe propose a component ranking function as the similarity measure. This idea was originallyintroduced in Santus et al. (2016) who proposed the APSyn ranking function. Let us deﬁnethe vector v i as a list of its components. v i = [ f , f , ..., f n ] (12)We then obtain the ranking r i by sorting the list in descending order (d in the equationdenotes type of ordering), denoting each of the components with its rank on the list. r di = { f : rank di ( f ) , ..., f n : rank di ( f n ) } (13)APSyn is calculated on the intersection of the N components with the highest score. AP Syn ( w i , w j ) = X f k ∈ top ( r di ) ∩ top ( r dj ) rank i ( f k ) + rank j ( f k ) (14)APSyn was originally computed on the PPMI language model, which has unique featureof non-negative vector components. As this feature is not given for every language model,we take into account negative values of the components. We deﬁne the negative ranking bysorting the components in ascending order (a in the equation denotes type of ordering). r ai = { f : rank ai ( f ) , ..., f n : rank ai ( f n ) } (15)As we want our ranking based similarity function to preserve some of the cosine properties,we deﬁne score values for each of the components and similarly to the cosine function,multiply the scores for each component. As the distribution of component values is Gaussian,we use the exponential function. s i,f k = e − rank i ( f k ) kd (16)Parameters k and d correspond respectively to weighting of the score function and thedimensionality of the space. With high k values, the highest ranked component will be themost inﬂuential one. The rationale is maximizing information gain. Our measure is similarto infAP and infNDCG measures used in information retrieval Roberts et al. (2017) thatgive maximum weight to the top 1 result. P@10 gives equal weight to the top 10,results.Lower k values increase the impact of lower ranked components at the expense of ‘long tail’of ranked components. We use the default k value of 10. The score function is identicalfor both ascending and descending rankings. We address the problem of polysemy with adiﬀerential analysis process. Similarity between pair of words is captured by discovering thesense of each word and then comparing two given senses of words. The sense of words isdiscovered by analysis of their contexts. We deﬁne the diﬀerential analysis of a componentas the sum of all scores for that exact component in each of the context vectors. h i,f k = X w j ∈ context ( w j ) s j,f k (17)Finally we deﬁne the Ranking based Exponential Similarity Measure (RESM) as follows. RESM a ( w i , w j ) = X f k ∈ top ( r di ) ∩ top ( r dj ) s ai,f k s aj,f k h ai,f k (18)4abela 1: Example questions.Q.word P1 P2 P3 P4Iron Wood Metal Plastic StoneIron Wood Crop Grass ArrowTabela 2: State of the art results for TOEFL and ESL test setsBullinaria & Levy (2012) 100.0% 66.0%Osterlund et al. (2015)Jarmasz & Szpakowicz (2012) 79.7% 82.0%Lu et al. (2011) 97.5% 86.0%The equation is similar to the cosine function. Both cosine and RESM measures multiplyvalues of each component and sum the obtained results. Contrary to the cosine function,RESM scales with a given context. It should be noted, that we apply diﬀerential analysiswith a context function h . An equation in this form is dedicated for the test sets we use inevaluation. The ﬁnal value is calculated as a sum of the RESM for both types of ordering. RESM ( w i , w j ) = RESM a ( w i , w j ) + RESM d ( w i , w j ) (19)3.4 ImplementationThe algorithm has been implemented in C We have tested our method against TOEFL and ESL test sets. TOEFL consists of 80questions, ESL consists of 50 questions. Questions in ESL are signiﬁcantly harder. Bothtests consist of questions designed for nonnative speakers of English. Each question in thetests consists of a question word with a set of four answers. It is worth pointing out, thatthe context given by a set of possible answers often deﬁnes the question. Example questionsin Table 1 highlight the problem. In the ﬁrst question, all of possible answers are buildingmaterials. Wood should be rejected as there is more appopriate answer. In second question,out of possible answers, only wood is a building material which makes it a good candidatefor the correct answer. This is a basis for applying a diﬀerential analysis in the similaritymeasure. Table 2. illustrates state of the art results for both test sets. The TOEFL test setwas introduced in Landauer & Dutnais (1997); the ESL test set was introduced in Turney(2001)4.1 Experimental setupWe use the unmodiﬁed vector space model trained on 840 billion words from Common Crawldata with the GloVe algorithm introduced in Pennington et al. (2014). The model consistsof 2.2 million unique vectors; Each vector consists of 300 components. The model can beobtained via the GloVe authors website. We run several experiments, for which settings areas follows: In the evaluation skewness γ = 9 and k = 10. All of the possible answers aretaken as context words for the diﬀerential analysis. In our runs we use all of the describedmethods separately and conjunctively. We refer to the methods in the following way. Wedenote the localized centering method for hubness reduction as HR. We use a ParaphraseDatabase lexicon introduced in Ganitkevitch et al. (2013) for the retroﬁtting. We denote L2retroﬁtting as RETRO. https://github.com/dudenzz/DistributionalModel The procedure we provide in this is capable of achieving the best results in the word em-bedding category and (nearly) state-of-the art results for any current metod. The work ofLu et al. (2011) employed 2 ﬁtting constants (and it is not clear that they were the samefor all questions) for answering the TOEFL test where only 50 questions are used. Tech-niques introduced in the paper are lightweight and easy to implement, yet they provide asigniﬁcant performance boost to the language model. Recently, a lot of progress was achie-ved in relating antonyms to synonyms Santus et al. (2016), Nguyen et al. (2017). We triedthe antonym RETRO method to take into account relational knowledge on antonyms. Themethod repels two vectors that are an antonym pair. Contrary to Nguyen et al. (2017), whoobtained minimal improvement, in our case accuracy does not improve (for any value ofrepulsion strength), which could be because of relatively large window (10) in the originalGlove work. All works that could distingush antonyms from synonyms using word embed-ding used much smaller context windows. We plan to extent our Glove based calculationsto smaller windows. This needs to be studied more thoroughly. Since the single word em-bedding is a basic element of any semantic task one can expect a signicant improvementof results for these tasks. In particular, SemEval-2017 International Workshop on SemanticEvaluation run (among others) the following tasks(se2):1. Task 1: Semantic Textual Similarity2. Task 2: Multilingual and Cross-lingual Semantic Word Similarity3. Task 3: Community Question Answering6n the category Semantic comparison for words and texts. Another immediate applicationwould be information retrieval (IR). Expanding queries by adding potentially relevant termsis a common practice in improving relevance in IR systems. There are many methods ofquery expansion. Relevance feedback takes the documents on top of a ranking list and addsterms appearing in these document to a new query. In this work we use the idea to addsynonyms and other similar terms to query terms before the pseudo- relevance feedback.This type of expansion can be divided into two categories. The ﬁrst category involves theuse of ontologies or lexicons (relational knowledge). The second category is word embedding(WE). Here closed words for expansion have to be very precise, otherwise a query drift mayoccur, and precision and accuracy of retrieval may deteriorate.Moreover, our improved method of text processing can be translated to continuous distri-buted representation of biological sequences for deep proteomics and genomics. Proteinsequence is typically notated as a string of letters, listing the amino acids starting at theamino-terminal end through to the carboxyl-terminal end. Either a three letter code orsingle letter code can be used to represent the 20 naturally occurring amino acids. Until re-cently most methods used n-grams. The immediate application is family classiﬁcation taskAsgari & Mofrad (2015).AcknowledgementThis work was supported by the PUT DS grant no 04/45/DSPB/0149