[PDF] Automatic Synonym Discovery with Knowledge Bases

Abstract

Recognizing entity synonyms from text has become a crucial task in many entity-leveraging applications. However, discovering entity synonyms from domain-specific text corpora (e.g., news articles, scientific papers) is rather challenging. Current systems take an entity name string as input to find out other names that are synonymous, ignoring the fact that often times a name string can refer to multiple entities (e.g., "apple" could refer to both Apple Inc and the fruit apple). Moreover, most existing methods require training data manually created by domain experts to construct supervised-learning systems. In this paper, we study the problem of automatic synonym discovery with knowledge bases, that is, identifying synonyms for knowledge base entities in a given domain-specific corpus. The manually-curated synonyms for each entity stored in a knowledge base not only form a set of name strings to disambiguate the meaning for each other, but also can serve as "distant" supervision to help determine important features for the task. We propose a novel framework, called DPE, to integrate two kinds of mutually-complementing signals for synonym discovery, i.e., distributional features based on corpus-level statistics and textual patterns based on local contexts. In particular, DPE jointly optimizes the two kinds of signals in conjunction with distant supervision, so that they can mutually enhance each other in the training stage. At the inference stage, both signals will be utilized to discover synonyms for the given entities. Experimental results prove the effectiveness of the proposed framework.

Full PDF

AAutomatic Synonym Discovery with Knowledge Bases

Meng Qu

University of Illinoisat [email protected]

Xiang Ren

University of Illinoisat [email protected]

Jiawei Han

University of Illinoisat [email protected]

ABSTRACT

Apple Inc andthe fruit apple ). Moreover, most existing methods require trainingdata manually created by domain experts to construct supervised-learning systems. In this paper, we study the problem of automaticsynonym discovery with knowledge bases, that is, identifying syn-onyms for knowledge base entities in a given domain-specific corpus.The manually-curated synonyms for each entity stored in a knowl-edge base not only form a set of name strings to disambiguate themeaning for each other, but also can serve as “ distant ” supervisionto help determine important features for the task. We propose anovel framework, called

DPE , to integrate two kinds of mutually-complementing signals for synonym discovery, i.e. , distributionalfeatures based on corpus-level statistics and textual patterns basedon local contexts. In particular, DPE jointly optimizes the two kindsof signals in conjunction with distant supervision, so that they canmutually enhance each other in the training stage. At the inferencestage, both signals will be utilized to discover synonyms for thegiven entities. Experimental results prove the effectiveness of theproposed framework. ACM Reference format:

Meng Qu, Xiang Ren, and Jiawei Han. 2017. Automatic Synonym Discoverywith Knowledge Bases. In

Proceedings of KDD’17, August 13-17, 2017, Halifax,NS, Canada, ,

People often have a variety of ways to refer to the same real-worldentity, forming different synonyms for the entity ( e.g. , entity

UnitedStates can be referred using “

America ” and “

USA ”). Automatic syn-onym discovery is an important task in text analysis and under-standing, as the extracted synonyms ( i.e. the alternative ways torefer to the same entity) can benefit many downstream applica-tions [1, 32, 33, 37]. For example, by forcing synonyms of an entityto be assigned in the same topic category, one can constrain thetopic modeling process and yield topic representations with higherquality [33]. Another example is in document retrieval [26], where

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Sentence

Washington is a state in the Pacific Northwest region.

Washington served as the first President of the US.The exact cause of leukemia is unknown.

Cancer involves abnormal cell growth. ID Washington State George WashingtonLeukemiaCancer

Text CorpusEntities Knowledge BasesSynonym Seeds

CancerCancer, CancersWashington State

Washington State, State of Washington

Figure 1: Distant supervision for synonym discovery. Welink entity mentions in text corpus to knowledge base enti-ties, and collect training seeds from knowledge bases. we can leverage entity synonyms to enhance the process of queryexpansion, and thus improve the retrieval performances.One straightforward approach for obtaining entity synonyms isto leverage publicly available knowledge bases such as Freebase andWordNet, in which popular synonyms for the entities are manuallycurated by human crowds. However, the coverage of knowledgebases can be rather limited, especially on some newly emergingentities, as the manual curation process entails high costs and isnot scalable. For example, the entities in Freebase have only 1.1synonyms on average. To increase the synonym coverage, weexpect to automatically extract more synonyms that are not inknowledge bases from massive, domain-specific text corpora. Manyapproaches address this problem through supervised [19, 27, 29]or weakly supervised learning [11, 20], which treat some manuallylabeled synonyms as seeds to train a synonym classifier or detectsome local patterns for synonym discovery. Though quite effectivein practice, such approaches still rely on careful seed selections byhumans.To retrieve training seeds automatically, recently there is a grow-ing interest in the distant supervision strategy, which aims to auto-matically collect training seeds from existing knowledge bases. Thetypical workflow is: i) detect entity mentions from the given corpus,ii) map the detected entity mentions to the entities in a given knowl-edge base, iii) collect training seeds from the knowledge base. Suchtechniques have been proved effective in a variety of applications,such as relation extraction [10], entity typing [17] and emotionclassification [14]. Inspired by such strategy, a promising directionfor automatic synonym discovery could be collecting training seeds( i.e. , a set of synonymous strings) from knowledge bases.Although distant supervision helps collect training seeds au-tomatically, it also poses a challenge due to the string ambiguityproblem, that is, the same entity surface strings can be mappedto different entities in knowledge bases. For example, considerthe string “

Washington ” in Figure 1. The “

Washington ” in the firstsentence represents a state of the United States; while in the secondsentence it refers to a person. As some strings like “

Washington ” a r X i v : . [ c s . C L ] J un ave ambiguous meanings, directly inferring synonyms for suchstrings may lead to a set of synonyms for multiple entities. Forexample, the synonyms of entity Washington returned by currentsystems may contain both the state names and person names, whichis not desirable. To address the challenge, instead of using ambigu-ous strings as queries, a better way is using some specific conceptsas queries to disambiguate, such as entities in knowledge bases.This motivated us to define a new task: automatic synonymdiscovery for entities with knowledge bases . Given a domain-specificcorpus, we aim to collect existing name strings of entities fromknowledge bases as seeds. For each query entity, the existing namestrings of that entity can disambiguate the meaning for each other,and we will let them vote to decide whether a given candidate stringis a synonym of the query entity. Based on that, the key task forthis problem is to predict whether a pair of strings are synonymousor not. For this task, the collected seeds can serve as supervision tohelp determine the important features. However, as the synonymseeds from knowledge bases are usually quite limited, how to usethem effectively becomes a major challenge. There are broadlytwo kinds of efforts towards exploiting a limited number of seedexamples.The distributional based approaches [9, 13, 19, 27, 29] considerthe corpus-level statistics, and they assume strings which oftenappear in similar contexts are likely to be synonyms. For exam-ple, the strings “

USA ” and “

United States ” are usually mentioned insimilar contexts, and they are the synonyms of the country

USA .Based on the assumption, the distributional based approaches usu-ally represent strings with their distributional features, and treatthe synonym seeds as labels to train a classifier, which predictswhether a given pair of strings are synonymous or not. Sincemost synonymous strings will appear in similar contexts, such ap-proaches usually have high recall. However, such strategy alsobrings some noise, since some non-synonymous strings may alsoshare similar contexts, such as “

USA ” and “

Canada ”, which couldbe labeled as synonyms incorrectly.Alternatively, the pattern based approaches [5, 15, 20, 22] con-sider the local contexts, and they infer the relation of two strings byanalyzing sentences mentioning both of them. For example, fromthe sentence “The United States of America is commonly referredto as America.” , we can infer that “

United States of America ” and“

America ” have the synonym relation; while the sentence “The USAis adjacent to Canada” may imply that “

USA ” and “

Canada ” are notsynonymous. To leverage this observation, the pattern based ap-proaches will extract some textual patterns from sentences in whichtwo synonymous strings co-occur, and discover more synonymswith the learned patterns. Different from the distributional basedapproaches, the pattern based approaches can treat the patterns asconcrete evidences to support the discovered synonyms, which aremore convincing and interpretable. However, as many synonymousstrings will not be co-mentioned in any sentences, such approachesusually suffer from low recall.Ideally, we would wish to combine the merits of both approaches,and in this paper we propose such a solution named DPE (distribu-tional and pattern integrated embedding framework). Our frame-work consists of a distributional module and a pattern module. Thedistributional module predicts the synonym relation from the global distributional features of strings; while in the pattern module, weaim to discover synonyms from the local contexts. Both modulesare built on top of some string embeddings, which preserve the

Sentence

The

USA is also known as

America . The

USA ( America ) is a country of 50 states.The

USA is a highly developed country .Canadais adjacent to

America . ID Target StringsAmerica USA

America

Country State The

USA

Local Patterns:

Distributional Based Approaches

Pattern Based Approaches

USA ;; known as ;; AmericaUSA ;; ( ) ;; America

Figure 2: Comparison of the distributional based and pat-tern based approaches. To predict the relation of two tar-get strings, the distributional based approaches will ana-lyze their distributional features, while the pattern basedapproaches will analyze the local patterns extracted fromsentences mentioning both of the target strings. semantic meanings of strings. During training, both modules willtreat the embeddings as features for synonym prediction, and inturn update the embeddings based on the supervision from syn-onym seeds. The string embeddings are shared across the modules,and therefore each module can leverage the knowledge discoveredby the other module to improve the learning process.To discover missing synonyms for an entity, one may directlyrank all candidate strings with both modules. However, such strat-egy can have high time costs, as the pattern module needs to extractand analyze all sentences mentioning a pair of given strings whenpredicting their relation. To speed up synonym discoveries, ourframework will first utilize the distributional module to rank allcandidate strings, and extract a set of top ranked candidates ashigh-potential ones. After that, we will re-rank the high-potentialcandidates with both modules, and treat the top ranked candidatesas the discovered synonyms.The major contributions of the paper are summarized as follows: • We propose to study the problem of automatic synonym discoverywith knowledge bases , i.e. , aiming to discover missing synonymsfor entities by collecting training seeds from knowledge bases. • We propose a novel approach DPE, which naturally integratesthe distributional based approaches and the pattern based ap-proaches for synonym discovery. • We conduct extensive experiments on the real world text corpora.Experimental results prove the effectiveness of our proposedapproach over many competitive baseline approaches.

In this section, we define several concepts and our problem:

Synonym.

A synonym is a string ( i.e. , word or phrase) that meansexactly or nearly the same as another string in the same language [21].Synonyms widely exist in human languages. For example, “

Aspirin ”and “

Acetylsalicylic Acid ” refer to the same drug; “

United States ”and “

USA ” represent the same country. All these pairs of stringsare synonymous.

Entity Synonym.

For an entity, its synonym refers to stringsthat can be used as alternative names to describe that entity. Forexample, both the strings “

USA ” and “

United States ” serve as alter-native names of the entity

United States , and therefore they are thesynonyms of this entity.

Knowledge Base.

A knowledge base consists of some manuallyconstructed facts about a set of entities. In this paper, we only focus eed Collection

Sentence

Washington is a state in the Pacific Northwest region.

Washington served as the first President of the US.The exact cause of leukemia is unknown.

Cancer involves abnormal cell growth. ID Text CorpusKnowledge Base

Entity Synonyms

Cancer cancer,cancersUSA USA, America, U.S.A.Illinois Illinois, IL

Synonym Seeds

WashingtonWashingtonleukemiaCancerUS

String Embeddings DistributionalScore FunctionDistributional ModulePattern ClassifierPattern Module

SeedsSeeds

Model Learning

Query Entity High-PotentialCandidates

Distributional Pattern

Inference

DiscoveredSynonyms

Distributional …… ……

Figure 3: Framework Overview. on the existing entity synonyms provided by knowledge bases, andwe will collect those existing synonyms as training seeds to helpdiscover other missing synonyms.

Problem Definition.

Given the above concepts, we formally define our problem asfollows.

Definition 2.1. (Problem Definition)

Given a knowledge base K and a text corpus D , our problem aims to extract the missing syn-onyms for the query entities. In this section, we introduce our approach DPE for entity synonymdiscovery with knowledge bases. To infer the synonyms of a queryentity, we leverage its name strings collected from knowledge basesto disambiguate the meaning for each other, and let them vote todecide whether a given candidate string is a synonym of the queryentity. Therefore, the key task for this problem is to predict whethera pair of strings are synonymous or not. For this task, the synonymseeds collected from knowledge bases can serve as supervision toguide the learning process. However, as the number of synonymseeds is usually small, how to leverage them effectively is quitechallenging. Existing approaches either train a synonym classifierwith the distributional features, or learn some textual patterns forsynonym discovery, which cannot exploit the seeds sufficiently.To address this challenge, our framework naturally integrates thedistributional based approaches and the pattern based approaches.Specifically, our framework consists of a distributional module anda pattern module. Given a pair of target strings, the distributionalmodule predicts the synonym relation from the global distributionalfeatures of each string; while the pattern module considers the localcontexts mentioning both target strings. During training, bothmodules will mutually enhance each other. At the inference stage,we will leverage both modules to find high-quality synonyms forthe query entities.

Framework Overview.

The overall framework of DPE (Figure 3)is summarized below:(1) Detect entity mentions in the given text corpus and link themto entities in the given knowledge base. Collect synonym seedsfrom knowledge bases as supervision.(2) Jointly optimize the distributional and the pattern modules.The distributional module predicts synonym relations withthe global distributional features, while the pattern moduleconsiders the local contexts mentioning both target strings.(3) Discover missing synonyms for the query entities with boththe distributional module and the pattern module.

To automatically collect synonym seeds, our approach will firstdetect entity mentions (strings that represent entities) in the giventext corpus and link them to entities in the given knowledge base.After that, we will retrieve the existing synonyms in knowledgebases as our training seeds. An illustrative example is presented inFigure 1.Specifically, we first apply existing named-entity recognition(NER) tools [8] to detect entity mentions and phrases in the giventext corpus. Then some entity linking techniques such as the DBpe-dia Spotlight [3] are applied, which will map the detected entitymentions to the given knowledge base. During entity linking, somementions can be linked to incorrect entities, leading to false syn-onym seeds. To remove such seeds, for each mention and its linkedentity, if the surface string of that mention is not in the existingsynonym list of that entity, we will remove the link between themention and the entity,After entity mention detection and linking, the synonym seedswill be collected from the linked corpus. Specifically, for eachentity, we collect all mentions linked to that entity, and treat allcorresponding surface strings as synonym seeds. After extracting synonym seeds from knowledge bases, we formu-late an optimization framework to jointly learn the distributionalmodule and the pattern module.To preserve the semantic meanings of different strings, our frame-work introduces a low-dimensional vector ( a.k.a. embedding) torepresent each entity surface string ( i.e. , strings that are linkedto entities in knowledge bases) and each unlinkable string ( i.e. ,words and phrases that are not linked to any entities). For thesame strings that linked to different entities, as they have differentsemantic meanings, we introduce different embeddings for them.For example, the string “

Washinton ” can be linked to a state or aperson, and we use two embeddings to represent

Washinton (state)and

Washinton (person) respectively.The two modules of our framework are built on top of these stringembeddings. Specifically, both modules treat the embeddings asfeatures for synonym prediction, and in turn update the embeddingsbased on the supervision from the synonym seeds, which may bringstronger predictive abilities to the learned embeddings. Meanwhile,since the string embeddings are shared between the two modules,each module is able to leverage the knowledge discovered by theother module, so that the two modules can mutually enhance toimprove the learning process. http://stanfordnlp.github.io/CoreNLP/ https://github.com/dbpedia-spotlight/dbpedia-spotlight e overall objective of our framework is summarized as follows: O = O D + O P , (1)where O D is the objective of the distributional module and O P isthe objective of the pattern module. Next, we introduce the detailsof each module. Distributional Module . The distributional module ofour framework considers the global distributional features for syn-onym discovery. The module consists of an unsupervised part anda supervised part. In the unsupervised part, a co-occurrence net-work encoding the distributional information of strings will beconstructed, and we try to preserve the distributional informationinto the string embeddings. Meanwhile in the supervised part, thesynonym seeds will be used to learn a distributional score function,which takes string embeddings as features to predict whether twostrings are synonymous or not.

Unsupervised Part.

In the unsupervised part, we first constructa co-occurrence network between different strings, which capturestheir distributional information. Formally, all strings ( i.e. , entitysurface strings and other unlinkable strings) within a sliding win-dow of a certain size w in the text corpus are considered to beco-occurring with each other. The weight for each pair of strings inthe co-occurrence network is defined as their co-occurrence count.After network construction, we aim to preserve the encoded dis-tributional information into the string embeddings, so that stringswith similar semantic meanings will have similar embeddings. Topreserve the distributional information, we observe that the co-occurrence counts of strings are related to the following factors.Observation 3.1 (Co-occurrence Observation). (1) If twostrings have similar semantic meanings, then they are more likely toco-occur with each other. (2) If a string tends to appear in the contextof another one, then they tend to co-occur frequently. The above observation is quite intuitive. If two strings have simi-lar semantic meanings, they are more likely to be mentioned in thesame topics, and therefore have a larger co-occurrence probability.For example, the strings “ data mining ” and “ text mining ” are highlycorrelated, while they have quite different meanings from the word“ physics ”, and we can observe that the co-occurrence chances be-tween “ data mining ” and “ text mining ” are much larger than thosebetween “ data mining ” and “ physics ”. On the other hand, somestring pairs with very different meanings may also have large co-occurrence counts, when one tends to appear in the context of theother one. For example, the word “ capital ” often appears in thecontext of “

USA ”, even they have very different meanings.To exploit the above observation, for each string u , besides itsembedding vector x u , we also introduce a context vector c u , whichdescribes what kinds of strings are likely co-mentioned with u .Given a pair of strings ( u , v ) , we model the conditional probability p ( u | v ) as follows: p ( u | v ) = exp ( x Tu x v + x Tu c v ) Z , (2)where Z is a normalization term. We see that if u and v havesimilar embedding vectors, meaning they have similar semanticmeanings, the first part ( x Tu x v ) of the equation will be large, leadingto a large conditional probability, which corresponds to the firstobservation 3.1. On the other hand, if the embedding vector of u issimilar to the context vector of v , meaning u tends to appear in the context of v , the second part ( x Tu c v ) becomes large, which also leadsto a large conditional probability, and this process corresponds tothe second observation 3.1.To preserve the distributional information of strings, we ex-pect the estimated distribution p (·| v ) to be close to the empiricaldistribution p (cid:48) (·| v ) ( i.e. , p (cid:48) ( u | v ) = w u , v / d v , where w u , v is the co-occurrence count between u and v , and d v is the degree of v in thenetwork) for each string v . Therefore, we minimize the KL distancebetween p (·| v ) and p (cid:48) (·| v ) , which is equivalent to the followingobjective [24]: L C = (cid:213) u , v ∈ V w u , v log p ( u | v ) , (3)where V is the vocabulary of all strings.Directly optimizing the above objective is computational expen-sive since it involves traversing all strings in the vocabulary whencomputing the conditional probability. Therefore, we leverage thenegative sampling techniques [9] to speed up the learning process,which modify the conditional probability p ( u | v ) in Eqn. 3 as follows:log σ ( x Tu x v + x Tu c v ) + N (cid:213) n = E u n ∼ P neд ( u ) [ − log σ ( x Tu n x v + x Tu n c v )] , (4)where σ ( x ) = /( + exp (− x )) is the sigmoid function. The first termtries to maximize the probabilities of some observed string pairs,while the second term tries to minimize the probabilities of N noisypairs, and u n is sampled from a noisy distribution P neд ( u ) ∝ d / u and d u is the degree of string u in the network. Supervised Part.

The unsupervised part of the distributional mod-ule can effectively preserve the distributional information of stringsinto the learned string embeddings. In the supervised part, we willutilize the collected synonym seeds to train a distributional scorefunction, which treats the string embeddings as features to predictwhether two strings have the synonym relation or not.To measure how likely two strings are synonymous, we intro-duce a score for each pair of strings. Inspired by the existingstudy [36], we use the following bilinear function to define thescore of a string pair ( u , v ) : Score D ( u , v ) = x u W D x Tv , (5)where x u is the embedding of string u , W D is a parameter matrixfor the score function. Due to the efficiency issue, in this paper weconstrain W D as a diagonal matrix.To learn the parameters W D in the score function, we expectthat the synonymous string pairs could have larger scores thanthose randomly sampled pairs. Therefore we adopt the followingranking based objective for learning: L S = (cid:213) ( u , v )∈ S seed (cid:213) v (cid:48) ∈ V min ( , Score D ( u , v ) − Score D ( u , v (cid:48) )) , (6)where S seed is the set of synonymous string pairs, v (cid:48) is a stringrandomly sampled from the string vocabulary. By maximizing theabove objective, the learned parameter matrix W D will be able todistinguish those synonymous pairs from others. Meanwhile, wewill update the string embeddings to maximize the objective, whichwill bring more predictive abilities to the learned embeddings. Pattern Module . For a pair of target strings, the patternmodule of our framework predicts their relation from the sentencesmentioning both of them. We achieve this by extracting a pattern entence Illinois , which is also called IL , is a state in the US . Pattern (ENT NNP nsubj) (called VBN acl:relcl) (ENT NN xcomp)

Lexical Features

Embedding[called]

Syntactic Features

NNP VBN NNP (NNP,VBN) (VBN,NNP)nsubj acl:relcl xcomp (nsubj,acl:relcl) (acl,xcomp)

Sentence Michigan , also known as MI , consists of two peninsulas . Pattern (ENT NNP nsubj) (known VBN acl) (ENT NNP xcomp)

Lexical Features

Embedding[known]

Syntactic Features

NNP VBN NNP (NNP,VBN) (VBN,NNP)nsubj acl xcomp (nsubj,acl) (acl,xcomp)

Figure 4: Examples of patterns and their features. For a pairof target strings (red ones) in each sentence, we define thepattern as the < token, POS tag, dependency label > triplesin the shortest dependency path. We collect both lexical fea-tures and syntactic features for pattern classification. from each of such sentences, and collecting some lexical featuresand syntactic features to represent each pattern. Based on theextracted features, a pattern classifier is trained to predict whethera pattern expresses the synonym relation between the target strings.Finally, we will integrate all prediction results from these patternsto decide the relation of the target strings.We first introduce the definition of the pattern used in our frame-work. Following existing pattern based approaches [11, 35], giventwo target strings in a sentence, we define the pattern as the se-quence of < lexical string, part-of-speech tag, dependency label > triples collected from the shortest dependency path connecting thetwo strings. Two examples can be found in Figure 4.For each pattern, we will extract some features and predictwhether this pattern expresses the synonym relation. We expectthat the extracted features could well capture the functional cor-relations between patterns. In other words, patterns expressingsynonym relations should have similar features. For example, con-sider the two sentences in Figure 4. The patterns in both sentencesexpress the synonym relation between the target strings (stringswith the red color), and therefore we anticipate that the two patternscould have similar features.Towards this goal, we extract both lexical and syntactic featuresfor each pattern. For the lexical features, we average all embeddingsof strings in a pattern as the features. As the string embeddingscan well preserve the semantic meanings of strings, such lexicalfeatures can effectively capture the semantic correlations betweendifferent patterns. Take the sentences in Figure 4 as an example.Since the strings “ called ” and “ known ” usually appear in similarcontexts, they will have quite similar embeddings, and therefore thetwo patterns will have similar lexical features, which is desirable.For the syntactic features, we expect that they can capture thesyntactic structures of the patterns. Therefore for each pattern, wetreat all n-grams (1 ≤ n ≤ N ) in the part-of-speech tag sequenceand the dependency label sequence as its syntactic features. Someexample are presented in Figure 4, where we set N as 2.Based on the extracted features, a pattern classifier will be trained,which predicts whether a pattern expresses the synonym relation.To collect positive examples for training, we extract patterns fromall sentences mentioning a pair of synonymous strings, and treatthese patterns as positive examples. For the negative examples, werandomly sample some string pairs without the synonym relation,and treat the corresponding patterns as negative ones. We selectthe linear logistic classifier for classification. Given a pattern pat and its feature vector f pat , we define the probability that pattern pat expresses the synonym relation as follows: P ( y pat = | pat ) = + exp (− W P T f pat ) , (7)where W P is the parameter vector of the classifier. We learn W P bymaximizing the log-likelihood objective function, which is definedas below: O P = (cid:213) pat ∈ S pat log P ( y pat | pat ) , (8)where S pat is the set of all training patterns, y pat is the label of pat-tern pat . By maximizing the above objective, the learned classifiercan effectively predict whether a pattern expresses the synonym re-lation or not. Meanwhile, we will also update the string embeddingsduring training, and therefore the learned string embeddings willhave better predictive abilities for the synonym discovery problem.After learning the pattern classifier, we can use it for synonymprediction. Specifically, for a pair of target strings u and v , wefirst collect all sentences mentioning both strings, and extract cor-responding patterns from them, then we measure the possibilitythat u and v are synonymous using the following score function Score P ( u , v ) : Score P ( u , v ) = (cid:205) pat ∈ S pat ( u , v ) P ( y pat = | pat )| S pat ( u , v )| , (9)where S pat ( u , v ) is the set of all corresponding patterns. Basically,our approach will classify all corresponding patterns, and differentpatterns will vote to decide whether u and v are synonymous. In this section, we introduce our optimization algorithm and howwe discover missing synonyms for entities.

Optimization Algorithm.

The overall objective function of ourframework consists of three parts. Two of them ( L C and L S ) arefrom the distributional module and the other one ( O P ) is from thepattern module. To optimize the objective, we adopt the edge sam-pling strategy [24]. In each iteration, we alternatively sample atraining example from the three parts, and then update the corre-sponding parameters. We summarize the optimization algorithmin Algorithm 1 Synonym Inference.

To infer the synonyms of a query entity,our framework leverages both the distributional module and thepattern module.Formally, given a query entity e , suppose its name strings col-lected from knowledge bases is S syn ( e ) . Then for each candidatestring u , we measure the possibility that u is a synonym of e usingthe following score function: Score ( e , u ) = (cid:213) s ∈ S syn ( e ) { Score D ( s , u ) + λScore P ( s , u )} . (10) Score D (Eqn. 5) and Score P (Eqn. 9) are used to measure how likelytwo target strings are synonymous, which are learned from thedistributional module and the pattern module respectively. λ is aparameter controlling the relative weights of the two parts. Thedefinition of the score function is quite intuitive. For each candidatestring, we will compare it with all existing name strings of the queryentity, and these existing name strings will vote to decide whetherthe candidate string is a synonym of the query entity. lgorithm 1 Optimization Algorithm of the DPE

Input:

A co-occurrence network between strings N occur , a set ofseed synonym pairs S seed , a set of training patterns S pat . Output:

The string embeddings x , parameters of the distributionalscore function W D , parameters of the pattern classifier W P . while iter ≤ I do (cid:0) Optimize L C Sample a string pair ( u , v ) from N occur . Randomly sample N negative string pairs {( u , v n )} Nn = . Update x , c w.r.t. L C . (cid:0) Optimize L S Sample a string pair ( u , v ) from S seed . Randomly sample a negative string pair ( u , v n ) Update x and W D w.r.t. L S . (cid:0) Optimize O P Sample a pattern from S pat . Update x and W P w.r.t. O P . end while However, the above method is not scalable. The reason is thatthe computational cost of the pattern score

Score P is very high, aswe need to collect and analyze all the sentences mentioning boththe target strings. When the number of candidate strings is verylarge, calculating the pattern scores for all candidate strings can bevery time-consuming. To solve the problem, as the distributionalscore Score D between two target strings is easy to calculate, a moreefficient solution could be first utilizing the distributional score Score D to construct a set of high potential candidates, and thenusing the integrated score Score to find the synonyms from thosehigh potential candidates.Therefore, for each query entity e , we first rank each candidatestring according to their distributional scores Score D , and extractthe top ranked candidate strings as the high potential candidates.After that, we re-rank the high potential candidates with the inte-grated score Score , and treat the top ranked candidate strings asthe discovered synonym of entity e . With such two-step strategy,we are able to discover synonyms both precisely and efficiently. Datasets . Three datasets are constructed in our experi-ments. (1)

Wiki + Freebase : We treat the first 100K articles in theWikipedia dataset as the text data, and the Freebase [4] as theknowledge base. (2) PubMed + UMLS : We collect around 1.5Mpaper abstracts from the PubMed dataset , and use the UMLS dataset as our knowledge base. (3) NYT + Freebase : We randomlysample 118664 documents from 2013 New York Times news articles,and we select the Freebase as the knowledge base. For each dataset,we adopt the Stanford CoreNLP package [8] to do tokenization,part-of-speech tagging and dependency parsing. We filtered outstrings that appear less than 10 times. The window size is set as5 when constructing the co-occurrence network between strings.The statistics of the datasets are summarized in Table 1. https://developers.google.com/freebase/ http://stanfordnlp.github.io/CoreNLP/ Table 1: Statistics of the Datasets.Dataset Wiki PubMed NYT

Performance Evaluation . For each dataset, we randomlysample some linked entities as the training entities, and all theirsynonyms are used as seeds by the compared approaches. We alsorandomly sample a few linked entities as test entities, which areused for evaluation.Two settings are considered in our experiments, i.e. , the warm-start setting and the cold-start setting. In the warm-start setting,for each test entity, we assume that 50% of its synonyms are alreadygiven, and we aim to use them to infer the rest 50%. In the cold-startsetting, we are only given the original name of each test entity, andour goal is to infer all its synonyms in knowledge bases.During evaluation, we treat all unlinkable strings ( i.e. , words orphrases that are not linked to any entities in the knowledge base)as the candidate strings. In both settings, we add the ground-truthsynonyms of each test entity into the set of candidate strings, andwe aim to rank the ground-truth synonyms at the top positionsamong all candidate strings. For the evaluation metrics, we reportthe Precision at Position K (P@ K ), Recall at Position K (R@ K ) andF1 score at Position K (F1@ K ). Compared algorithms . We select the following algo-rithms to compare. (1)

Patty [11]: a pattern based approach forrelation extraction, which can be applied to our problem by treatingthe collected synonym seeds as training instances. (2)

SVM [29]:a distributional based approach, which uses the bag-of-words fea-tures and learns an SVM classifier for synonym discovery. (3) word2vec [9]: a word embedding approach. We use the learnedstring embedding as features and train a score function in Eqn. 5 forsynonym discovery. (4)

GloVe [13]: another word embedding ap-proach. Similar to word2vec, we use the learned string embeddingas features and train a score function for synonym discovery. (5)

PTE [23]: a text embedding approach, which is able to exploit boththe text data and the entity types provided in knowledge bases tolearn string embeddings. After embedding learning, we apply thescore function in Eqn. 5 for synonym discovery. (6)

RKPM [27]:a knowledge powered string embedding approach, which utilizesboth the raw text and the synonym seeds for synonym discovery.(7)

DPE : our proposed embedding framework, which integratesboth the distributional features and local patterns for synonymdiscovery. (8)

DPE-NoP : a variant of our framework, which onlydeploys the distributional module ( O D ). (9) DPE-TwoStep : a vari-ant of our framework, which first trains the distributional module( O D ) and then the pattern module ( O P ), without jointly optimizingthem. Parameter Settings . For all embedding based approaches,we set the embedding dimension as 100. For DPE and its variants,we set the learning rate as 0.01 and the number of negative samples N when optimizing the co-occurrence network L D is set as 5. Whencollecting the syntactic features in the pattern module, we set the n-gram length N as 3. The parameter λ , which controls the weights of able 2: Quantitative results on the warm-start setting. Algorithm Wiki + Freebase PubMed + UMLS NYT + FreebaseP@1 R@1 F1@1 P@5 R@5 F1@5 P@1 R@1 F1@1 P@5 R@5 F1@5 P@1 R@1 F1@1 P@5 R@5 F1@5

Patty 0.102 0.075 0.086 0.049 0.167 0.076 0.352 0.107 0.164 0.164 0.248 0.197 0.101 0.081 0.090 0.038 0.141 0.060SVM 0.508 0.374 0.431 0.273 0.638 0.382 0.696 0.211 0.324 0.349 0.515 0.416 0.481 0.384 0.427 0.248 0.616 0.354word2vec 0.387 0.284 0.328 0.247 0.621 0.353 0.784 0.238 0.365 0.464 0.659 0.545 0.367 0.293 0.326 0.216 0.596 0.317GloVe 0.254 0.187 0.215 0.104 0.316 0.156 0.536 0.163 0.250 0.279 0.417 0.334 0.203 0.162 0.180 0.084 0.283 0.130PTE 0.445 0.328 0.378 0.252 0.612 0.357 0.800 0.243 0.373 0.476 0.674 0.558 0.456 0.364 0.405 0.233 0.606 0.337RKPM 0.500 0.368 0.424 0.302 0.681 0.418 0.804 0.244 0.374 0.480 0.677 0.562 0.506 0.404 0.449 0.302 0.707 0.423DPE-NoP 0.641 0.471 0.543 0.414 0.790 0.543 0.816 0.247 0.379 0.532 0.735 0.617 0.532 0.424 0.472 0.305 0.687 0.422DPE-TwoStep 0.684 0.503 0.580 0.417 0.782 0.544 0.836 0.254 0.390 0.538 0.744 0.624 0.557 0.444 0.494 0.344 0.768 0.475DPE

Table 3: Quantitative results on the cold-start setting.

Algorithm Wiki + Freebase PubMed + UMLS NYT + FreebaseP@1 R@1 F1@1 P@5 R@5 F1@5 P@1 R@1 F1@1 P@5 R@5 F1@5 P@1 R@1 F1@1 P@5 R@5 F1@5

Patty 0.131 0.056 0.078 0.065 0.136 0.088 0.413 0.064 0.111 0.191 0.148 0.167 0.125 0.054 0.075 0.062 0.132 0.084SVM 0.371 0.158 0.222 0.150 0.311 0.202 0.707 0.110 0.193 0.381 0.297 0.334 0.347 0.150 0.209 0.165 0.347 0.224word2vec 0.411 0.175 0.245 0.196 0.401 0.263 0.627 0.098 0.170 0.408 0.318 0.357 0.361 0.156 0.218 0.151 0.317 0.205GloVe 0.251 0.107 0.150 0.105 0.221 0.142 0.480 0.075 0.130 0.264 0.206 0.231 0.181 0.078 0.109 0.084 0.180 0.115PTE 0.474 0.202 0.283 0.227 0.457 0.303 0.647 0.101 0.175 0.389 0.303 0.341 0.403 0.174 0.243 0.166 0.347 0.225RKPM 0.480 0.204 0.286 0.227 0.455 0.303 0.700 0.109 0.189 0.447 0.348 0.391 0.403 0.186 0.255 0.170 0.353 0.229DPE-NoP 0.491 0.209 0.293 0.246 0.491 0.328 0.700 0.109 0.189 0.456 0.355 0.399 0.417 0.180 0.251 0.180 0.371 0.242DPE-TwoStep 0.537 0.229 0.321 0.269 0.528 0.356 0.720 0.112 0.194 0.477 0.372 0.418 0.431 0.186 0.260 0.183 0.376 0.246DPE llllllllllllllllllllllllllllllllllllllllllllllllll . . . . @K P r e c i s i on lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll PattySvmDiffGloVeword2vecPTERKPMDPE (a) Precision (Warm Start) llllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . @K P r e c i s i on llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllll PattySvmDiffGloVeword2vecPTERKPMDPE (b) Recall (Warm Start) llllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . . . @K P r e c i s i on lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll PattySvmDiffGloVeword2vecPTERKPMDPE (c) Precision (Cold Start) llllllllllllllllllllllllllllllllllllllllllllllllll . . . . . @K P r e c i s i on llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllll PattySvmDiffGloVeword2vecPTERKPMDPE (d) Recall (Cold Start)

Figure 5: Precision and Recall at different positions on the Wiki dataset. the two modules during synonym discovery, is set as 0.1 by default.We set the number of iterations as 10 billions. During synonyminference, we first adopt the distributional module to extract top100 ranked strings as the high potential candidates, then we useboth modules to re-rank them. For word2vec, PTE, the number ofnegative examples is also set as 5, and the initial learning rate isset as 0.025, as suggested by [9, 23, 24]. The number of iterationsis set as 20 for word2vec, and for PTE we sample 10 billion edgesto ensure convergence. For GloVe, we use the default parametersettings as used in [13]. For RKPM, we set the learning rate as 0.01,and the iteration is set as 10 billion to ensure convergence.

1. Comparing DPE with other baseline approaches.

Table 2,Table 3 and Figure 5 present the results on the warm-start andcold-start settings. In both settings, we see that the pattern basedapproach Patty does not perform well, and our proposed approachDPE significantly outperforms Patty. This is because most syn-onymous strings will never co-appear in any sentences, leading tothe low recall of Patty. Also, many patterns discovered by Pattyare not so reliable, which may harm the precision of the discov-ered synonyms. DPE addresses this problem by incorporating thedistributional information, which can effectively complement andregulate the pattern information, leading to higher recall and preci-sion. Comparing DPE with the distributional based approaches (word2vec,GloVe, PTE, RKPM), DPE still significantly outperforms them. Theperformance gains mainly come from: (1) we exploit the co-occurrenceobservation 3.1 during training, which enables us to better capturethe semantic meanings of different strings; (2) we incorporate thepattern information to improve the performances.

2. Comparing DPE with its variants.

To better understand whyDPE achieves better results, we also compare DPE with severalvariants. From Table 2 and Table 3, we see that in most cases,the distributional module of our approach (DPE-NoP) can alreadyoutperform the best baseline approach RKPM. This is because weutilize the co-occurrence observation 3.1 in our distributional mod-ule, which helps us capture the semantic meanings of strings moreeffectively. By separately training the pattern module after the dis-tributional module, and using both modules for synonym discovery(DPE-TwoStep), we see that the results are further improved, whichdemonstrates that the two modules can indeed mutually comple-ment each other for synonym discovery. If we jointly train bothmodules (DPE), we obtain even better results, which shows thatour proposed joint optimization framework can benefit the trainingprocess and therefore helps achieve better results.

3. Performances w.r.t. the weights of the modules.

Duringsynonym discovery, DPE will consider the scores from both thedistributional module and the pattern module, and the parameter λ controls the relative weight. Next, we study how DPE behaves l l l l l . . . . . Lambda P e r f o r m an c e l l l l l l ll F1@1F1@5 (a) Wiki (warm-start) l l l l l l . . . . . . . Lambda P e r f o r m an c e l l l l l l ll F1@1F1@5 (b) Wiki (cold-start)

Figure 6: Performances w.r.t. λ . A small λ emphasizes the dis-tributional module. A large λ emphasizes the pattern mod-ule. Either module cannot discover synonyms effectively. l l l l l l l . . . . . . F @ l l l l l l l ll RKPMDPE (a) Percentage of Training Entities l l l l . . . . . . F @ l l l ll l l l lll RKPMDPE−NoPDPE (b)

Figure 7: Performance change of DPE (a) under different per-centage of training entities; and (b) with respect to the num-ber of entity name strings used in inference. under different λ . The results on the Wiki dataset are presentedin Figure 6. We see that when λ is either small or large, the per-formance is not so good. This is because a small λ will emphasizeonly the distributional module, while a large λ will assign too muchweight to the pattern module. Therefore, either the distributionalmodule or the pattern module cannot discover synonyms effectively,and we must integrate them during synonym discovery.

4. Performances w.r.t. the percentage of the training entities.

During training, DPE will use the synonyms of the training entitiesas seeds to guide the training. To understand how the trainingentities will affect the results, we report the performances of DPEunder different percentages of training entities. Figure 7(a) presentsthe results on the Wiki dataset under the warm-start setting. Wesee that compared with RKPM, DPE needs fewer labeled data toconverge. This is because the two modules in our framework canmutually complement each other, and therefore reduce the demandof the training entities.

5. Performances w.r.t. the number of entity name stringsused in inference.

Our framework aims to discover synonymsat the entity level. Specifically, for each query entity, we use itsexisting name strings to disambiguate the meaning for each other,and let them vote to discover the missing synonyms. In this section,we study how the number of name strings in inference will affectthe results. We sample a number of test entities from the Wikidataset, and utilize 1 ∼ Table 4: Example outputs on the Wiki dataset. Strings withred colors are the true synonyms.

Entity US dollar World War IIMethod DPE-NoP DPE DPE-NoP DPEOutput

US Dollars U.S. dollar Second World War Second World WarU.S. dollars US dollars World War Two World War TwoEuros U.S. dollars World War One WW IIU.S. dollar U.S. $ WW I world warRMB Euros world wars world wars

Table 5: Top ranked patterns expressing the synonym rela-tion. Strings with red colors are the target strings.

Pattern Corresponding Sentence (-,NN,nsubj) (-lrb-,JJ,amod) … Olympia ( commonly known as(known,VBN,acl) (-,NN,nmod) L’Olympia ) is a music hall …(-,NN,dobj) (-,NN,appos) … , many hippies used cannabis( marijuana ) , considering it …(-,NNP,nsubj) (known,VBN,acl) … BSE , commonly known as ”(-,NN,nmod) mad cow disease ” , is a …

1. Example output.

Next, we present some example outputs ofDPE-NoP and DPE on the Wiki dataset. The results are shownin Figure 4. From the learned synonym list, we have filtered outall existing synonyms in knowledge bases, and the red strings arethe new synonyms discovered by our framework. We see that ourframework finds many new synonyms which have not been in-cluded in knowledge bases. Besides, by introducing the patternmodule, we see that some false synonyms (RMB and WW I) ob-tained by DPE-NoP will be filtered out by DPE, which demonstratesthat combing the distributional features and the local patterns canindeed improve the performances.

2. Top ranked positive patterns.

To exploit the local patterns inour framework, our pattern module learns a pattern classifier topredict whether a pattern expresses the synonym relation betweenthe target strings. To test whether the learned classifier can pre-cisely discover some positive patterns for synonym discovery, weshow some top-ranked positive patterns learned by the classifierand also the corresponding sentences. Table 5 presents the results,in which the red strings are the target strings. We see that all thethree patterns indeed express the synonym relations between thetarget strings, which proves that our learned pattern classifier caneffectively find some positive patterns and therefore benefit thesynonym discovery.

Synonym Discovery.

Various approaches have been proposed todiscover synonyms from different kinds of information. Most ofthem exploit structured knowledge such as query logs [2, 16, 30]for synonym discovery. Different from them, we aim to discoversynonyms from raw text corpora, which is more challenging.There are also some methods trying to discover string relations( e.g. , synonym relation, antonym relation, hypernym relation) fromraw texts, including some distributional based approaches and pat-tern based approaches. Both approaches can be applied to ouretting. Given some training seeds, the distributional based ap-proaches [6, 12, 19, 25, 25, 27, 29] discover synonyms by repre-senting strings with their distributional features, and learning aclassifier to predict the relation between strings. Different fromthem, the pattern based approaches [5, 11, 15, 20, 22, 35] considerthe sentences mentioning a pair of synonymous strings, and learnsome textual patterns from these sentences, which are further usedto discover more synonyms. Our proposed approach naturally inte-grates the two types of approaches, which enjoys both merits ofthem.

Text Embedding.

Our work is also related to text embeddingtechniques, which learn low-dimensional vector representationsfor strings from raw texts. The learned embedding capture some se-mantic correlations between strings, which can be used as featuresfor synonym extraction. Most text embedding approaches [9, 13, 24]only exploit the text data, which cannot exploit information fromknowledge bases to guide the embedding learning. There are alsosome studies trying to incorporate knowledge bases to improve theembedding learning. [18, 23] exploit entity types to enhance thelearned embedding and [7, 28, 31, 34] exploit existing relation factsin knowledge bases as constraints to improve the performances.Compared with these methods, our embedding approach canbetter preserve the semantic correlations of strings with the the co-occurrence observation 3.1. Besides, both the distributional moduleand the pattern module of our approach will provide supervisionfor embedding learning, which brings stronger predictive abilitiesto the learned embeddings under the synonym discovery problem.

In this paper, we studied the problem of automatic synonym discov-ery with knowledge bases, aiming to discover missing synonyms forentities in knowledge bases. We proposed a framework called theDPE, which naturally integrates the distributional based approachesand the pattern based approaches. We did extensive experimentson three real-world datasets. Experimental results proved the effec-tiveness of our proposed framework.

ACKNOWLEDGMENTS

REFERENCES [1] R. Angheluta, R. De Busser, and M.-F. Moens. The use of topic segmentation forautomatic summarization. In

ACL Workshop on Automatic Summarization 2002 .[2] S. Chaudhuri, V. Ganti, and D. Xin. Exploiting web search to generate synonymsfor entities. In

WWW 2009 .[3] J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. Improving efficiency andaccuracy in multilingual entity extraction. In

Proceedings of the 9th InternationalConference on Semantic Systems (I-Semantics) , 2013.[4] Google. Freebase data dumps. https://developers.google.com/freebase/data.[5] M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In

Proceedings of the 14th conference on Computational linguistics-Volume 2 , pages539–545. Association for Computational Linguistics, 1992. [6] D. Lin, S. Zhao, L. Qin, and M. Zhou. Identifying synonyms among distribution-ally similar words. In

IJCAI 2003 .[7] Q. Liu, H. Jiang, S. Wei, Z.-H. Ling, and Y. Hu. Learning semantic word embed-dings based on ordinal knowledge constraints. In

ACL-IJCNLP 2015 .[8] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky.The Stanford CoreNLP natural language processing toolkit. In

Association forComputational Linguistics (ACL) System Demonstrations , 2014.[9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre-sentations of words and phrases and their compositionality. In

NIPS 2013 .[10] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relationextraction without labeled data. In

ACL 2009 .[11] N. Nakashole, G. Weikum, and F. Suchanek. Patty: a taxonomy of relationalpatterns with semantic types. In

EMNLP 2012 .[12] P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scaledistributional similarity and entity set expansion. In

EMNLP 2009 .[13] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for wordrepresentation. In

EMNLP 2014 .[14] M. Purver and S. Battersby. Experimenting with distant supervision for emotionclassification. In

Proceedings of the 13th Conference of the European Chapter ofthe Association for Computational Linguistics , pages 482–491. Association forComputational Linguistics, 2012.[15] L. Qian, G. Zhou, F. Kong, and Q. Zhu. Semi-supervised learning for semanticrelation classification using stratified sampling strategy. In

EMNLP 2009 .[16] X. Ren and T. Cheng. Synonym discovery for structured entities on heteroge-neous graphs. In

WWW 2015 .[17] X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. Clustype: Effectiveentity recognition and typing by relation phrase-based clustering. In

KDD 2015 .[18] X. Ren, W. He, M. Qu, C. R. Voss, H. Ji, and J. Han. Label noise reduction inentity typing by heterogeneous partial-label embedding. In

KDD 2016 .[19] S. Roller, K. Erk, and G. Boleda. Inclusive yet selective: Supervised distributionalhypernymy detection. In

COLING 2014 .[20] R. Snow, D. Jurafsky, and A. Y. Ng. Learning syntactic patterns for automatichypernym discovery.

NIPS 2004 .[21] M. Stanojevi´c et al. Cognitive synonymy: A general overview.

FACTAUNIVERSITATIS-Linguistics and Literature , 7(2):193–200, 2009.[22] A. Sun and R. Grishman. Semi-supervised semantic pattern discovery withguidance from unsupervised pattern clusters. In

ACL 2010 .[23] J. Tang, M. Qu, and Q. Mei. Pte: Predictive text embedding through large-scaleheterogeneous text networks. In

KDD 2015 .[24] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scaleinformation network embedding. In

WWW 2015 .[25] P. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. 2001.[26] E. M. Voorhees. Query expansion using lexical-semantic relations. In

Proceedingsof the 17th annual international ACM SIGIR conference on Research and devel-opment in information retrieval , pages 61–69. Springer-Verlag New York, Inc.,1994.[27] H. Wang, F. Tian, B. Gao, J. Bian, and T.-Y. Liu. Solving verbal comprehensionquestions in iq test by knowledge-powered word embedding. arXiv preprintarXiv:1505.07909 , 2015.[28] Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph and text jointlyembedding. In

EMNLP 2014 .[29] J. Weeds, D. Clarke, J. Reffin, D. J. Weir, and B. Keller. Learning to distinguishhypernyms and co-hyponyms. In

COLING 2014 .[30] X. Wei, F. Peng, H. Tseng, Y. Lu, and B. Dumoulin. Context sensitive synonymdiscovery for web search queries. In

CIKM 2009 .[31] J. Weston, A. Bordes, O. Yakhnenko, and N. Usunier. Connecting language andknowledge bases with embedding models for relation extraction. arXiv preprintarXiv:1307.7973 , 2013.[32] F. Wu and D. S. Weld. Open information extraction using wikipedia. In

ACL2010 .[33] P. Xie, D. Yang, and E. P. Xing. Incorporating word correlation knowledge intotopic modeling. In

HLT-NAACL 2015 .[34] C. Xu, Y. Bai, J. Bian, B. Gao, G. Wang, X. Liu, and T.-Y. Liu. Rc-net: A generalframework for incorporating knowledge into word representations. In

CIKM2014 .[35] M. Yahya, S. Whang, R. Gupta, and A. Y. Halevy. Renoun: Fact extraction fornominal attributes. In

EMNLP 2014 .[36] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relationsfor learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575 ,2014.[37] Q. T. Zeng, D. Redd, T. C. Rindflesch, and J. R. Nebeker. Synonym, topic modeland predicate-based query expansion for retrieving clinical documents. In