Novel Ranking-Based Lexical Similarity Measure for Word Embedding
aa r X i v : . [ c s . C L ] D ec Novel Ranking-Based Lexical Similarity Measure for WordEmbedding
Jakub Dutkiewicz & Czesław JędrzejekPolitechnika PoznańskaPoznań, Poland { jakub.dutkiewicz,czeslaw.jedrzejek } @put.poznan.pl Abstract
Distributional semantics models derive word space from linguistic items incontext. Meaning is obtained by defining a distance measure between vec-tors corresponding to lexical entities. Such vectors present several problems.In this paper we provide a guideline for post process improvements to thebaseline vectors. We focus on refining the similarity aspect, address imper-fections of the model by applying the hubness reduction method, imple-menting relational knowledge into the model, and providing a new rankingsimilarity definition that give maximum weight to the top 1 component va-lue. This feature ranking is similar to the one used in information retrieval.All these enrichments outperform current literature results for joint ESLand TOEF sets comparison. Since single word embedding is a basic elementof any semantic task one can expect a significant improvement of resultsfor these tasks. Moreover, our improved method of text processing can betranslated to continuous distributed representation of biological sequencesfor deep proteomics and genomics.
Distributional language models are frequently used to measure word similarity in natu-ral language (e.g. Frackowiak et al. (2017)). Recent works usually use the DistributionalHypothesis (Harris (1954)) to generate the language models. This model often consists ofa set of vectors; each vector corresponds to a character string, which represents a word.Mikolov et al. (2013) and Pennington et al. (2014) implement word embedding (WE) algo-rithms. Vector components in language models created by these algorithms are latent. Si-milarity between words is defined as a function of vectors corresponding to the given words.The cosine measure is the most frequently used similarity function. Santus et al. (2016) hi-ghlights the fact that the cosine can be outperformed by ranking based functions. Vectorspace word representations obained from purely distributional information of words in largeunlabelled corpora are not enough to best the state-of-the-art results in query answeringbenchmarks, because they suffer from 4 types of weaknesses:1. Inadequate definition of similarity,2. Inability of accounting of senses of words,3. Appeareance of hubness that distorts distances between vectors,4. Inability of distinguishing from antonymsIn this paper we use the existing word embedding model but with several post processenhancement techniques. We address three out of four of these issues. In particular wedefine novel similarity measure, dedicated for language models. The Euclidean distance isbased on the locations of points in such a space.Similarity is a function, which is an monotonically opposite to distance. As the distancebetween two given entities gets shorter, entities are more similar. This holds for languagemodels. Similarity between words is equal to similarity between their corresponding vectors.1here are various definitions of distance. The most common Euclidean distance is definedas follows: d ( p , p ) = sX c ∈ p ( c p − c p ) (1)Similarity based on the Euclidean definition is inverse to the distance: sim ( p , p ) = 11 + d ( p , p ) (2)Angular definition of distance is defined with cosine function: d ( p , p ) = 1 − cos ( p , p ) (3)We define angular similarity as: sim ( p , p ) = cos ( p , p ) (4)Both Euclidean and Cosine definitions of distance could be looked at as the analysis ofvector components. Simple operations, like addition and multiplication work really well inlow dimensional spaces. We believe, that applying those metrics in spaces of higher orderis not ideal, hence we compare cosine similarity to a measure of distance dedicated for highdimensional spaces. Santus et al. (2016) introduce the ranking based similarity function called APSyn. In theirexperiment APSyn outperforms the cosine similarity, reaching 73% of accuracy in the bestcases (an improvement of 27% over cosine) on the ESL dataset, and 70% accuracy (animprovement of 10% over cosine) on the TOEFL dataset. In contrast to our work, they usethe Positive Pointwise Mutual Information algorithm to create their language model.A successful avenue to enhance WE was pointed out by Faruqui et al. (2014), using WordNet(Miller & Fellbaum (2007)), and the Paraphrase Database (Ganitkevitch et al. (2013)) toprovide synonymy relation information to vector optimization equations. They call thisprocess retrofitting, a pattern we adapt to the angular definition of distance, which is moresuitable to our case.We also address hubness reduction. Hubness is related to the phenomenon of concentrationof distances - the fact that points get closer at large vector dimensionalities. Hubness isvery pronounced for vector dimensions of the order of thousands. We apply this methodof localized centering for hubness reduction (Feldbauer & Flexer (2016)) for the languagemodels.
In our work we define the language model as a set of word representations. Each word isrepresented by its vector. We refer to a vector corresponding to a word w i as v i . A completeset of words for a given language is referred to as a vector space model. We define similaritybetween words w i and w j as a function of vectors v i and v j . sim ( w i , w j ) = f ( v i , v j ) (5)We present an algorithm for obtaining optimized similarity measures given a vector spacemodel for word embedding. The algorithm consists of 6 steps:1. Refine the vector space using the L2 retrofit algorithm2. Obtain vector space of centroids3. Obtain vectors for a given pair of words and optionally for given context words4. Recalculate the vectors using the localized centering method2. Calculate ranking of vector components for a given pair of words6. Use the ranking based similarity function to obtain the similarity between a givenpair of words.We use all of the methods in together to achieve significant improvement over the baselinemethod. We present details of the algorithm in the following sections.3.1 BaselineThe cosine function provides the baseline similarity measure: sim ( w , w ) = cos ( v , v ) = v · v k v kk v k (6)The cosine function has been to achieving a reasonable baseline. It is superior to the Eucli-dean similarity measure and is used in various works related to word similarity. In our workwe use several post-process modifications to the vector space model; we also redefine thesimilarity measure.3.2 Implementing relational knowledge into the vector space modelLet us define a lexicon of synonyms L . Each row in the lexicon consists of a word and a setof its synonyms. L ( w i ) = { w j : synonymity ( w i , w j ) } (7)A basic method of implementing synonym knowledge into the vector space model was pre-viously described in Ganitkevitch et al. (2013). We refer to that method as retrofit; it usesthe iterational algorithm of moving the vector towards an average vector of its synonymsaccording to the following formula. v ′ i = α i v i + P wj ∈ L ( wj ) β j v j k L ( w j ) k α and β allow us to weigh the im-portance of certain synonyms. The basic retrofit method moves the vector towards its de-stination (shortens the distance between the average synonym to a given vector) using theEuclidean definition of distance. This is not consistent with the cosine distance. Instead weimprove Faruqui et al. (2014) idea by performing operations in spherical space by normali-zing the vector, thus preserving the angular definition of distance. This amounts to rotatingthe vector instead of translating it. We implemented the basic transformation for the ro-tation; however it proved to be time consuming, which affected our work on the subject.Therefor, for simplicity, the average vector of two normalized vectors is precisely betwe-en given vectors in both the Euclidean and angular definition of distance. This gives thefollowing formula: v ′ i = k v i k v i k v i k + P wj ∈ L ( wj ) vj k vj k k L ( w j ) k k nearest neighbors of the given vector v i . We apply a cosine distancemeasure to calculate the nearest neighbors. c i = P v j ∈ k − NN ( v i ) v i ) N (10)3n Feldbauer & Flexer (2016), the authors pointed out that skewness of a space has a directconnection to the hubness of vectors. We follow the pattern presented in that work andrecalculate the vectors using the following formula. v ′ i = v i − c γi (11)Parameter γ in the equation is equal to the skewness of the space.3.3.1 Ranking based similarity functionWe propose a component ranking function as the similarity measure. This idea was originallyintroduced in Santus et al. (2016) who proposed the APSyn ranking function. Let us definethe vector v i as a list of its components. v i = [ f , f , ..., f n ] (12)We then obtain the ranking r i by sorting the list in descending order (d in the equationdenotes type of ordering), denoting each of the components with its rank on the list. r di = { f : rank di ( f ) , ..., f n : rank di ( f n ) } (13)APSyn is calculated on the intersection of the N components with the highest score. AP Syn ( w i , w j ) = X f k ∈ top ( r di ) ∩ top ( r dj ) rank i ( f k ) + rank j ( f k ) (14)APSyn was originally computed on the PPMI language model, which has unique featureof non-negative vector components. As this feature is not given for every language model,we take into account negative values of the components. We define the negative ranking bysorting the components in ascending order (a in the equation denotes type of ordering). r ai = { f : rank ai ( f ) , ..., f n : rank ai ( f n ) } (15)As we want our ranking based similarity function to preserve some of the cosine properties,we define score values for each of the components and similarly to the cosine function,multiply the scores for each component. As the distribution of component values is Gaussian,we use the exponential function. s i,f k = e − rank i ( f k ) kd (16)Parameters k and d correspond respectively to weighting of the score function and thedimensionality of the space. With high k values, the highest ranked component will be themost influential one. The rationale is maximizing information gain. Our measure is similarto infAP and infNDCG measures used in information retrieval Roberts et al. (2017) thatgive maximum weight to the top 1 result. P@10 gives equal weight to the top 10,results.Lower k values increase the impact of lower ranked components at the expense of ‘long tail’of ranked components. We use the default k value of 10. The score function is identicalfor both ascending and descending rankings. We address the problem of polysemy with adifferential analysis process. Similarity between pair of words is captured by discovering thesense of each word and then comparing two given senses of words. The sense of words isdiscovered by analysis of their contexts. We define the differential analysis of a componentas the sum of all scores for that exact component in each of the context vectors. h i,f k = X w j ∈ context ( w j ) s j,f k (17)Finally we define the Ranking based Exponential Similarity Measure (RESM) as follows. RESM a ( w i , w j ) = X f k ∈ top ( r di ) ∩ top ( r dj ) s ai,f k s aj,f k h ai,f k (18)4abela 1: Example questions.Q.word P1 P2 P3 P4Iron Wood Metal Plastic StoneIron Wood Crop Grass ArrowTabela 2: State of the art results for TOEFL and ESL test setsBullinaria & Levy (2012) 100.0% 66.0%Osterlund et al. (2015)Jarmasz & Szpakowicz (2012) 79.7% 82.0%Lu et al. (2011) 97.5% 86.0%The equation is similar to the cosine function. Both cosine and RESM measures multiplyvalues of each component and sum the obtained results. Contrary to the cosine function,RESM scales with a given context. It should be noted, that we apply differential analysiswith a context function h . An equation in this form is dedicated for the test sets we use inevaluation. The final value is calculated as a sum of the RESM for both types of ordering. RESM ( w i , w j ) = RESM a ( w i , w j ) + RESM d ( w i , w j ) (19)3.4 ImplementationThe algorithm has been implemented in C We have tested our method against TOEFL and ESL test sets. TOEFL consists of 80questions, ESL consists of 50 questions. Questions in ESL are significantly harder. Bothtests consist of questions designed for nonnative speakers of English. Each question in thetests consists of a question word with a set of four answers. It is worth pointing out, thatthe context given by a set of possible answers often defines the question. Example questionsin Table 1 highlight the problem. In the first question, all of possible answers are buildingmaterials. Wood should be rejected as there is more appopriate answer. In second question,out of possible answers, only wood is a building material which makes it a good candidatefor the correct answer. This is a basis for applying a differential analysis in the similaritymeasure. Table 2. illustrates state of the art results for both test sets. The TOEFL test setwas introduced in Landauer & Dutnais (1997); the ESL test set was introduced in Turney(2001)4.1 Experimental setupWe use the unmodified vector space model trained on 840 billion words from Common Crawldata with the GloVe algorithm introduced in Pennington et al. (2014). The model consistsof 2.2 million unique vectors; Each vector consists of 300 components. The model can beobtained via the GloVe authors website. We run several experiments, for which settings areas follows: In the evaluation skewness γ = 9 and k = 10. All of the possible answers aretaken as context words for the differential analysis. In our runs we use all of the describedmethods separately and conjunctively. We refer to the methods in the following way. Wedenote the localized centering method for hubness reduction as HR. We use a ParaphraseDatabase lexicon introduced in Ganitkevitch et al. (2013) for the retrofitting. We denote L2retrofitting as RETRO. https://github.com/dudenzz/DistributionalModel The procedure we provide in this is capable of achieving the best results in the word em-bedding category and (nearly) state-of-the art results for any current metod. The work ofLu et al. (2011) employed 2 fitting constants (and it is not clear that they were the samefor all questions) for answering the TOEFL test where only 50 questions are used. Tech-niques introduced in the paper are lightweight and easy to implement, yet they provide asignificant performance boost to the language model. Recently, a lot of progress was achie-ved in relating antonyms to synonyms Santus et al. (2016), Nguyen et al. (2017). We triedthe antonym RETRO method to take into account relational knowledge on antonyms. Themethod repels two vectors that are an antonym pair. Contrary to Nguyen et al. (2017), whoobtained minimal improvement, in our case accuracy does not improve (for any value ofrepulsion strength), which could be because of relatively large window (10) in the originalGlove work. All works that could distingush antonyms from synonyms using word embed-ding used much smaller context windows. We plan to extent our Glove based calculationsto smaller windows. This needs to be studied more thoroughly. Since the single word em-bedding is a basic element of any semantic task one can expect a signicant improvementof results for these tasks. In particular, SemEval-2017 International Workshop on SemanticEvaluation run (among others) the following tasks(se2):1. Task 1: Semantic Textual Similarity2. Task 2: Multilingual and Cross-lingual Semantic Word Similarity3. Task 3: Community Question Answering6n the category Semantic comparison for words and texts. Another immediate applicationwould be information retrieval (IR). Expanding queries by adding potentially relevant termsis a common practice in improving relevance in IR systems. There are many methods ofquery expansion. Relevance feedback takes the documents on top of a ranking list and addsterms appearing in these document to a new query. In this work we use the idea to addsynonyms and other similar terms to query terms before the pseudo- relevance feedback.This type of expansion can be divided into two categories. The first category involves theuse of ontologies or lexicons (relational knowledge). The second category is word embedding(WE). Here closed words for expansion have to be very precise, otherwise a query drift mayoccur, and precision and accuracy of retrieval may deteriorate.Moreover, our improved method of text processing can be translated to continuous distri-buted representation of biological sequences for deep proteomics and genomics. Proteinsequence is typically notated as a string of letters, listing the amino acids starting at theamino-terminal end through to the carboxyl-terminal end. Either a three letter code orsingle letter code can be used to represent the 20 naturally occurring amino acids. Until re-cently most methods used n-grams. The immediate application is family classification taskAsgari & Mofrad (2015).AcknowledgementThis work was supported by the PUT DS grant no 04/45/DSPB/0149