An Unsupervised Language-Independent Entity Disambiguation Method and its Evaluation on the English and Persian Languages
Majid Asgari-Bidhendi, Behrooz Janfada, Amir Havangi, Sayyed Ali Hossayni, Behrouz Minaei-Bidgoli
AA N U NSUPERVISED L ANGUAGE -I NDEPENDENT E NTITY D ISAMBIGUATION M ETHOD AND ITS E VALUATION ON THE E NGLISH AND P ERSIAN L ANGUAGES
A P
REPRINT
Majid Asgari-Bidhendi
School of Computer EngineeringIran University of Science and TechnologyTehran, Iran [email protected]
Behrooz Janfada
School of Computer EngineeringIran University of Science and TechnologyTehran, Iran [email protected]
Amir Havangi
School of Computer EngineeringIran University of Science and TechnologyTehran, Iran [email protected]
Sayyed Ali Hossayni
School of Computer EngineeringIran University of Science and TechnologyTehran, Iran [email protected]
Behrouz Minaei-Bidgoli
School of Computer EngineeringIran University of Science and TechnologyTehran, Iran b [email protected]
February 2, 2021 A BSTRACT
Entity Linking is one of the essential tasks of information extraction and natural language under-standing. Entity linking mainly consists of two tasks: recognition and disambiguation of namedentities. Most studies address these two tasks separately or focus only on one of them. Moreover,most of the state-of-the -art entity linking algorithms are either supervised, which have poor per-formance in the absence of annotated corpora or language-dependent, which are not appropriatefor multi-lingual applications. In this paper, we introduce an Unsupervised Language-IndependentEntity Disambiguation (ULIED), which utilizes a novel approach to disambiguate and link namedentities. Evaluation of ULIED on different English entity linking datasets as well as the only avail-able Persian dataset illustrates that ULIED in most of the cases outperforms the state-of-the-artunsupervised multi-lingual approaches.
Keywords
Entity Linking, Named Entity Disambiguation, Multilingual, Knowledge Base
In this section, we first present a general introduction to entity linking (EL). We then introduce the knowledge basesas an essential requirement for entity linking task and then discuss the general steps of it. At the end of this section,we outline the general structure of this paper. a r X i v : . [ c s . C L ] J a n LIED
A P
REPRINT
Entity Linking (EL) is the task of linking a set of entities mentioned in a text to an external dataset. Entity linking playsan essential role in text analysis, information extraction, question answering, text understanding, and recommendersystems [1]. It also allows users to know about the background knowledge of entities in the text [2]. However, thereare two types of ambiguities which make this task challenging. Firstly, entities may have different names, even in asingle document. For example, the name of a person can appear in the text as the first name, last name, or nickname.EL should link all of these names to a single entity in the knowledge base. Secondly, different entities may have thesame name, but the entity linking system should be able to refer them to various entities from the knowledge base.Therefore, information about entities is crucial in choosing the correct entities [3, 2, 4].With few exceptions, most of entity linking methods separately address the mention detection (also known as entitydetection or named entity detection) and entity disambiguation stages [5]. Our approach also focuses on entity disam-biguation. The proposed approach is an Unsupervised Language-Independent Entity Disambiguation method, dubbedas ULIED.
A knowledge base is one of the fundamental components in entity linking systems. Generally, the KB consists of aset of entities, information, semantic categories, and the relationship between entities. Knowledge bases used in ELsystems should have features such as public availability, machine readability, persistent identifiers, and credibility [6].There are currently several knowledge bases for EL systems such as DBpedia [7], YAGO [8], Freebase [9], andProbase [10]. In the case of low-resource languages, cross-lingual methods are used if there is no knowledge base withthe above characteristics [11]. Fu et al. [12] showed that cross-lingual methods are heavily dependent on Wikipediaand only work well on Wikipedia texts. These methods perform poorly in non-Wikipedia texts and require outside-Wikipedia cross-lingual resources to improve their performance. This study employs FarsBase [13], which is the firstmulti-source KB especially designed for the Persian language and includes more than 500,000 entities with 25 millionrelations between them. FarsBase can provide various information such as locations, persons, and organizations. Generally, the EL process includes four subtasks, which are consistent with most of both supervised and unsupervisedEL systems. The first step, Mention Detection (MD), is the operation of specifying named entities in the input naturallanguage raw text. The last three steps can be grouped as Entity Disambiguation (ED) subtask. ED is the operation ofthe disambiguation of a named entity using a set of candidate entities and then linking it to a knowledge base.
Mention Detection
An MD algorithm captures the raw text and specifies the place of occurrence of named entitiesat the output. Most studies [14, 15, 16] in EL employed existing algorithms, provided by the other researches, for MD,and address the other three modules.
Candidate Entity Generation
In this step, the system proposes a set of candidate entities for every entity mentionsin the text provided by the previous step [4, 17]. In this regard, most studies [4, 18, 19, 20] use features such as redirectpages, disambiguation pages, and hyperlinks in Wikipedia (or other knowledge bases and resources), to make a namedictionary for each entity mention with the aim of mapping the entity mention to a set of candidate entities.
Candidate Entity Ranking
In most cases, candidate entity generation modules, generates multiple candidate en-tities for a single mention. Therefore, candidate entities should be ranked by the EL system to find the most likelyentity from the knowledge base[6]. The EL system can use two types of features for ranking candidates entities:Context-Independent Features and Context-Dependent Features[21]. In the literature, the term “phase entity disam-biguation” [22, 23, 24] has the same meaning as candidate entity ranking. Additionally, both supervised and unsu-pervised methods can be used to achieve the results. Supervised ranking methods depend on the annotated trainingdataset, where its data annotation should be done manually. In the case of low-source languages, such a resourceis not available, and alternative solutions are needed. For example, Kile et al. [25] have provided a novel domain-agnostic Human-In-The-Loop (HITL) annotation approach which uses recommenders that suggest potential conceptsand adaptive candidate ranking to speed up the overall annotation process and make it less tedious for users. NERank+proposed by Wang et al. [26] is another example of entity ranking approaches, which utilizes Topical Tripartite Graph,consisting of document, topic and entity nodes as well as a random walk algorithm to propagate prior entity and topicranks based on the graph model. Yet Another Great Ontology
A P
REPRINT
Unlinkable Mention Prediction
In cases where entity mentions do not have any relevant entities in the knowledgebase, unlinkable entity mentions are separated from other entities and tagged as NIL. Different approaches are sug-gested by researchers to separate unlinkable entity mentions: (1) ignoring unlinkable entity mentions [2, 27, 14],(2) ignoring the low-probability candidates(NIL threshold) [28, 24] and (3) supervised machine learning tech-niques [4, 6, 29, 30].The rest of this paper is organized as follows. Section 2 presents an overview study of EL, especially with a view toentity disambiguation and language independent and unsupervised approaches. In this section, we will also introducethe only available entity disambiguation dataset for the Persian language, and we will review multilingual entity linkingdatasets as well as entity disambiguation datasets for the English language. Section 4 describes the proposed approachfor unsupervised, language-independent entity disambiguation. Experimental results and the comparison of obtainedresults with the baseline methods on English and Persian datasets are discussed in section 5. The last section concludesthis research and expresses our future work.
Supervised approaches need adequate resources and are not suitable for low-resource languages. There are a limitednumber of EL system which are focused on multilingual strategies. This manuscript targets multilingual unsupervisedentity disambiguation as the best approach for entity linking in low-resources languages. In this section, we firstdescribe the related literature in unsupervised EL and multilingual EL and then introduce some popular existingdatasets for entity disambiguation.
There are various algorithms proposed to perform unsupervised EL. Here we review notable studies in this field.Some researchers [19, 22, 18, 20] used Vector Space Model (VSM) [31] based methods for unsupervised candidateranking. In this method, the first step is calculating the similarity between the vector representations of the entitymention and the candidate entity. The system links the candidate entity with the highest similarity to the entitymention. These methods are different in the calculation of vector similarity and vector representation [4]. Methodshave also been proposed that use a combination of different SVM methods as an ensemble. For example, Alokaili andMenai proposed an ensemble learning using SVM [32], which produces competitive performance levels compared towell-known entity annotation systems and ensemble models on different benchmark corpora.Cucerzan [22] used entity mentions and Wikipedia articles of the candidate entities to build vectors. To this end, thesystem will choose a candidate that maximizes vector similarity and have the same category as an entity mention. Thissystem got 91.4% accuracy on a news dataset.Chen et al. [19] built the entity mention and candidate entities vectors based on the Bag of Words model by usingthe context of their article to capture word co-occurrence information and computed the similarity between them byTF-IDF similarity. They reported 71.2% accuracy on the TAC-KBP2010 dataset.Han and Zhao [18] used two types of similarity measures: the Wikipedia Semantic Knowledge-Based Similarityalongside Bag of Words based similarity. For generating vectors in the first similarity, the method detects Wikipediaconcepts in candidate entities and context of the mentioned entity. It then computes the vector similarity of the entitymention and candidate entities using a weighted average of semantic relations between articles of Wikipedia conceptsand the context of the mentioned entity. After that, these two types of similarity are merged, and the final similarityvector of the candidate entities is reported, and finally, the entity that maximizes this merged similarity is chosen.Their system achieves 76.7% accuracy on the TAC-KBP2009 dataset.Xu et al. [20] applied a linking approach for medical texts and exploited name similarity, entity popularity, categoryconsistency, context similarity, and the semantic correlation between the entity mention and candidate entities, andranked candidate entities by combining these features. They called their ranking measure, Confidence Score. Onaverage, their Confidence Score got about 82% precision on their medical dataset.Zhang et al. [33] proposed an unsupervised bilingual entity linker inspired by Han et al. [2] and Yamada et al. [24]researches. As we discussed before, they utilized a pre-built dictionary for the candidate generation, and after that,they used probabilistic generative methods to disambiguate the entities. Their system achieves 91.2% precision on theCoNLL dataset.Pan et al. [34] used Abstract Meaning Representation (AMR) [35] to select high-quality sets of entities for theirsimilarity measure. They claimed that their representation using AMR could capture some contextual properties3LIED
A P
REPRINT which are very critical and helpful for disambiguating entities without using training data. Next, for comparing thecontext of the entities, they used an unsupervised graph to get final results and reported 92.12% precision on a datasetannotated from news and discussion forum posts.Xie et al. [36] proposed the graph-ranking collective Chinese entity linking (GRCCEL) algorithm, which utilizes boththe structured relationship between entities in the local knowledge base and the additional background informationoffered by external knowledge sources. To measure similarity, they used improved weighted word2vec and improvedPageRank methods. They reported the effectiveness of GRCCEL in Chinese entity linking task and demonstrated thesuperiority of their method over state-of-the-art methods in Chinese.
Some studies target multilingual entity linking. Babelfy [37] is one of the most distinguished studies on unsupervisedmultilingual EL and Word Sense Disambiguation (WSD). Moro et al. used a unified graph-based approach to EL andWSD based on a loose identification of candidate meanings coupled with the densest subgraph heuristic, which selectshigh-coherence semantic interpretations.Hoffart et al. proposed the AIDA [16] system which provides an integrated NED method using popularity, similar-ity, and graph-based coherence, and includes robustness tests for self-adaptive behavior. Later, they extended theirapproach [38] presenting a novel notion of semantic relatedness between two entities represented as sets of weighted(multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality ofprior link-based models and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage betweenentities.Usbeck et al. [39] present AGDISTIS, a knowledge-base-agnostic approach for named entity disambiguation. Theirapproach combines the Hypertext-Induced Topic Search (HITS) algorithm with label expansion strategies and stringsimilarity measures. They extended their AGDISTIS to a multilingual approach named MAG[40].Rosales et al. [41] introduce VoxEL as a benchmark dataset for multilingual Entity Linking including German, English,Spanish, French and Italian languages based on 15 news articles from VoxEurop, a multilingual newsletter, totaling 94sentences. The study compares 15 multilingual entity linkers with General Entity Annotation Benchmark Framework(GERBIL) [42]: KIM [43], TagME [44], SDA [45], ualberta [46], HITS [47], THD [48], DBpedia Spotlight [49, 50],Wang-Tang [51], AGDISTIS [39], Babelfy [37], FREME [52], WikiME [53], FEL [54], FOX [55] and MAG [40] andchecks the availability of entity recognition, having a demo, having an API and availability of source codes for eachsystem.DBpedia Spotlight proposed by Mendes et al. [49] is a system for automatically annotating text documents withDBpedia URIs. Their algorithm uses the same four-step approach and VSM and TF-IDF similarity measure, which isdescribed earlier.
In this section, we present a review of some popular entity disambiguation datasets, for the English language, whichis used in the evaluation of this research and the only published dataset for the Persian language.
In the 2004 Automatically Content Extracting (ACE) technology evaluation, the ACE 2004 Multilingual TrainingCorpus contains all English, Arabic, and Chinese education data. This collection includes various types of dataannotated for organizations and connections, and a Linguistic Information Consortium has been formed with the helpof the ACE Program with the additional support of the DARPA TIDES program.
This dataset contains assignments of entities to the mentions of named entities annotated for the original CoNLL 2003entity recognition task [16]. This dataset consists of proper noun annotations for 1393 Reuters newswire articles.All these proper nouns are hand-annotated with corresponding entities in YAGO2. Two experts disambiguated eachmention, and in case of conflict, another expert resolved the conflict.4LIED
A P
REPRINT
AQUAINT Corpus [56], Linguistic Data Consortium (LDC) catalog number LDC- 2002T31 and ISBN 1-58563-240-6consists of English-language newswire text data from three sources: the Xinhua News Service (People’s Republic ofChina), the New York Times News Service, and the Associated Press Worldstream News Service. It was prepared forthe AQUAINT Project by the LDC and will be used by the National Institute of Standards and Technology (NIST) inofficial benchmark evaluations.
DBpedia Spotlight [49] is a system where text documents with DBpedia URIs are automatically annotated. DBpe-dia Spotlight enables users to configure annotations to their specific needs utilizing DBpedia Ontology and qualitymeasures such as prominence, topical relevance, contextual ambiguity, and confidence in disambiguation.
The goal of KORE50 [38] is to stress testing EL systems using highly ambiguous mentions using hand-crafted sen-tences via difficult disambiguation tasks. KORE is a new notion of entity relatedness, based on the overlap of two setsof keyphrases, e.g., partial matches of 2 sentences.
An excess of different evaluation data sets relies on either Wikipedia or DBpedia. Noullet et al. [57] have recentlyextended KORE50, to not only accommodate EL tasks for DBpedia, but also for YAGO, Wikidata, and Crunchbase.
IITB [58] was established in 2009 and had the highest corporate entity/document density. It is a list of ground truths(called ”IITB”) using an annotation system based on a browser. Manual annotation documents were gathered from thelinks to popular websites belonging to a handful of domains that included sports, culture, science and technology, andeducation. The annotations can be found in the public domain.
N3 Reuters-128 [59] includes 128 news articles sampled randomly from the Reuters-21578 news articles and manuallyannotated by domain experts.
N3 RSS-500 [59] consists of 1,457 RSS feeds scraped from a list containing all major newspapers around the worldand a wide range of topics. Domain experts manually annotated the corpus. The RSS list was compiled using a76-hour crawl, leading to a corpus of approximately 11.7 million sentences. By render selecting 1% of the containedwords, a subset of this corpus was generated.
The ERD2014 [60] dataset is constructed for the 2014 Entity Recognition and Disambiguation Challenge (ERD’14),which took place from March to June 2014 and was summarized in a dedicated workshop at SIGIR 2014. The ERDchallenge’s main goal was to promote research in recognition and disambiguation of entities in unstructured text. Forthe short-text track, the dataset was built by sampling 500 queries from a commercial search engine’s query log toform a development set and 500 queries for the test set. The average query length was four words per query. For thelong-text track, the dataset was built by sampling 100 web pages for the development and 100 web pages for the testset. Also, all HTML tags from the web pages were stripped, and various heuristics were applied to extract the maincontent from each document. Particularly, boilerplate content from header and side panes were removed. Among alldocuments, 50% were sampled from general web pages; the remaining 50% were news articles from msn.com.5LIED
A P
REPRINT
Silviu Cucerzan launched MSNBC dataset [22] in 2007. The data set contains unique SF media reports and a distinc-tive lexicalization.
To evaluate ULIED and competing methods in Persian, we used ParsEL-Social Dataset [61], which is introducedin the conference version of this paper. ParsEL-Social is constructed from social media contents derived from 10Telegram channels in 10 different categories: sport, economics, gaming, general news, IT news, travel, art, academic,entertainment, and health. To create this dataset, firstly, entity mentions are automatically identified in raw text, and alist of candidates is created based on redirect and disambiguation links of that entity mention in Wikipedia. Then thetext with identified entity mentions and candidate lists is given to a Persian linguistics expert. The expert either selectsone of the candidates for each entity. In some cases, a mention must be linked to an entity. However, the right entity isnot found in the candidates due to an error in the automatic candidate generation algorithm. The expert may add it tothe candidate list and selects it as the right link. It should be noted that the automatic candidate generator only appearsas an expert helper.
In this section, we describe the architecture of our unsupervised language independent entity disambiguation system(ULIED) and propose our new approach in which entity mentions of input text are disambiguated and linked toWikipedia. We first look at the architecture as a general system and then describe each of its modules separately.
The ULIED system implemented in a pipeline architecture. It is assumed that the input text only specifies the entitymentions, and the system at the output should disambiguate these entity mentions and link them to Wikipedia. To thisend, a list of candidates is first generated in the Candidate Generation module for each entity mention. Four candidateentity weighting modules are used in the ED component; two context-dependent modules and two context-independentmodules. Figure 1 shows a block diagram of the architecture and data flow of ULIED.In this architecture, in the Candidate Generation module, candidates are generated for all entity mentions of the textusing Wikipedia redirects and disambiguation pages. The output of this module is a set of entity mentions and severalcandidates for each entity mention.Entity Disambiguation Component consists of five modules; four are grouped as Candidate Entity Weightening sub-system and one named Top-weight Entity Selection and Linking Module. We utilize both of the context-dependentand context-independent features in the ranking step. Context-dependent features rely on the context where entitymention appears, but context-independent features are independent of context and rely on entity mention and candidateentities[4]. These modules are described below.
In this module, we examine whether the type of candidate can match the entity’s context by the type of candidate entityderived from Wikipedia’s Infobox and the context surrounding the original entity.Some entities have a very generic name that may cause a high level of ambiguity. For instance, (cid:250)(cid:198)(cid:203)(cid:65)(cid:131) (cid:201)(cid:234)(cid:107)(cid:18) (“At theage of 40”) is an Iranian movie while it can be part of a general sentence, e.g., “Vahid died at the age of 40”. Suchnames are widespread in artworks (e.g., movies or books) and a limited number of the other specialized classes. Toimprove the disambiguation process, we will look for more evidence in the context using a hand-made reference list ifthe candidate entity belongs to individual classes. Considering the above example, “At the age of 40”, the surroundingcontext containing phrases such as a director, artist, channel, cinema, ticket, and movie is required. Otherwise, thealgorithm multiplies the candidate’s real rate by a predefined constant number between 0 and 1 based on each case.
In this module, using the surface properties of the text of the document that contains a given entity mention andcorresponding Wikipedia page of its candidates, we perform a vector similarity to determine the degree of similaritybetween each candidate and its corresponding entity mention.6LIED
A P
REPRINT
Figure 1: The architecture of unsupervised language-independent entity disambiguation system (ULIED)
Previous research has shown that internal links between Wikipedia pages can be a useful feature for examining andmeasuring the semantic relevance of concepts [62]. The main idea behind the Link-Graph modules is that candidateswith more Wikipedia internal links to other candidates are more likely to be suitable candidates for disambiguation.In the following, we will explain this method in more detail. Given the candidate list CL including all candidates of allmentions in the context, for each candidate c i ∈ CL , we create a list LLC i including all candidates which are linkedin the corresponding Wikipedia article for c i and their frequencies in the article, ( link ij , count ij ) . For example, theWikipedia article of “Saadi” contains ten links to “Shiraz” article, four links to “Persian”, 12 links to “Poet”, and soon. The list LLC Saadi will be: [( Shiraz, , ( P ersian, , ( P oet, , ... ] . (1)In the next step, we assign a weight to each c i of CL : w c i = (cid:88) ( link ij ,count ij ) ∈ LLC i count ij × e ij (2)and e ij is e ij = (cid:26) link ij ∈ CL otherwise (3)In the last step, for each mention, we choose the entity with the highest weight for the mention.The Level 1 Link-Graph Module only counts the number of links per candidate and accordingly assigns a weight toeach candidate. The Level 2 Link-Graph Module is not limited to first-level links. Instead, we create a LLC listwhich equals to LLC of the entity + LLC of entities which are linked to each of them. In the previous exampleabout “Saadi”, we add all of the links of “Shiraz”, “Persian” and “Poet” article to LLC of Saadi.To accelerate this phase, we have created a cache LLC for all of the articles in Wikipedia dump, and the system canfetch the list for each member of list CL instantly from the cache.Suppose the following text as the input: “Saadi was born in the city of Shiraz.”7LIED A P
REPRINT
Level-1 graph formation:
Suppose “Saadi” has three candidate entities: A , A , and A , and the “city” can alsorefer to two entities: B and B . “Shiraz” also has four ambiguities: C , C , C , and C . The rest of the words haveno candidate entity. In this example, a graph is formed that has the following nodes: A , A , A , B , B , C , C , C , C If the Wikipedia page of a node has n internal links to another node’s Wikipedia page, an edge of value n is createdbetween the two nodes.In the above example, “Saadi” has two candidates. Suppose the Wikipedia page of candidate A is linked to the“Shiraz” Wikipedia page ( C ), and candidate A is not linked to this page. Considering that the city of Shiraz ( C ) isone of the candidates for the word “Shiraz” in the text and has appeared in the graph, the score of the first candidate,“Saadi” ( A ), will be higher than the second candidate ( A ). Level-2 graph formation:
In the previous example, suppose that Saadi’s ( A ) candidate is linked to the other fourentities D , D , D , and D . In the second level graph, these entities are also added to the graph. The addingoperation will be done for all other level-1 nodes ( A , A , B , B , C , C , C , C ). The level-2 graph will alwaysbe more crowded, but the criterion for forming edges is the same as the previous graph: the internal Wikipedia linkbetween the entities’ pages. A , A , A , B , B , C , C , C , C , D , D , D , D , .... (links of all other candidates)Suppose the previous example with some differences. “Saadi” has two candidates A and A . Suppose that neithercandidate A nor candidate A has a link to the Shiraz Wikipedia page ( C ). However, A is linked to anotherWikipedia page D , and D is linked to C . A has not any connection to C , even with one node in between. Inthis example, C exists in the level-2 graph, and the score of the first candidate A will be higher than the secondcandidate A . In this module, we first multiply all the weights of the candidates obtained from the previous four modules, andthen we sort all the candidates according to the final number of these results, and we consider the highest weightedcandidate as the selected candidate. The link is to the Wikipedia page for this candidate. As such, the entity hasbeen disambiguated. Finally, the system links the candidate entity with the highest score to the entity mention. Otherentities will be added to the entity mention’s “ambiguity-list” to persist the rejected candidates for possible futureapplications such as error checking. After candidate generation and ranking, the NIL threshold method is used forunlinkable mention prediction. In this method, if the score of the top-ranked candidate entity is lower than the pre-defined threshold, the entity mention will be tagged as NIL, and the system will add all of the candidate entities to theambiguity-list.
We implemented our proposed method, ULIED, on the Persian and English languages using existing datasets andcompared the evaluation results with other state-of-the-art multilingual unsupervised methods, Babelfy, and DBpediaSpotLight. We then analyze these results and evaluations.We evaluate ULIED on ParsEL-Social dataset, and the results are reported in Figure 2.To evaluate ULIED on the English language, we use ten different datasets that are widely used in entity linkingevaluations, which are introduced in Section 3.To evaluate the system, we input the documents of each dataset with the spanning of each mentions within the text inthe NIF standard format. Giving the position of each mentions in the text, ULIED selects some candidate Wikipediaarticles for the mention and performs the ED phase and links the mention to an entity. If there is no candidate entityfor the mention, or confidence value of all entities is under a threshold value, ULIED does not link the mention to anyentity.Rosales et al. [41] reports the results of some state-of-the-art unsupervised language-independent EL systems on theirmultilingual dataset and report the superiority of Babelfy and DBpedia SpotLight on the other systems.Figure 2 depicts the results of ULIED with DBpedia Spotlight and Babelfy based on micro F1-measure. The resultsshow ULIED outperforms Babelfy in 5 datasets. Spotlight records the best performance only in the DBpediaSpot-light dataset. Therefore the performance of ULIED in the English language is comparable with other multilingualapproaches and outperforms them in the Persian language.8LIED
A P
REPRINT N - R e u t e r s - E R D KO R E N - R SS - A I DA / C o N LL - C o m p l e t e AQUA I N T A C E M S N BC D B p e d i a S po tli gh t II T B P a r s EL - S o c i a l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C oun t ULIEDBabelfySpotLight
Figure 2: Evaluation of entity disambiguation results using the proposed method ULIED on 10 different datasets andcomparison with DBpedia-SpotLight and BabelFy as baseline algorithms. Results are reported according to the F1measure.Moreover, it is notable that the size of Wikipedia articles in the Persian language is almost 500,000 (i.e., One-twentiethof English), and consequently the number of candidates for each mention, in the Persian, is very smaller in comparisonto mentions in the English language. This fact causes ULIED F-score to be meaningfully more than its F-score in theEnglish datasets.To evaluate ULIED on the Persian language, we use Babelfy as the Baseline, which works based on BabelNet 3.0 . Inthe first step, We run Babelfy on our dataset by public APIs of Babelfy. Babelfy returns all of the BabelNet synsets foreach token in the text. Each synset is linked to some sources, including Wikipedia articles. FarsBase knowledge baseis constructed from Persian Wikipedia and uses Wikipedia articles to construct its entities. Therefore, although weonly get Persian Wikipedia sources for each BabelNet synset and convert it to its corresponding entity in the FarsBase.Babelfy (despite its multilingual nature) is not expected to perform comparably with ULIED, in Persian, because (inEnglish) Babelfy utilizes additional lexical data sources such as WordNet whereas, in Persian, neither of Babelfy andULIED have access to additional lexical resources or knowledge resources to Wikipedia. Indeed, it is the reason for thehigh difference of ULIED F-score compared to Babelfy, in the Persian language. Babelfy API does not return entitycandidates in the results; Thus, comparing the reported recall rate with the ParsEL is not rational, and predictably therecall of our baseline method is lower than ULIED. It should be noted that it was not possible to compare ULIED withDBpedia-SpotLight. DBpedia-SpotLight covers specific languages for entity disambiguation, and Persian is not oneof those languages. http://babelfy.org https://babelnet.org A P
REPRINT
In this paper, we presented an approach for UnSupervised, Language-Independent Entity Disambiguation, dubbedas ULIED. ULIED utilizes only the link-graph of Wikipedia pages (for each language) and the raw text of theircorresponding articles.ULIED is compared to Babelfy and DBpedia-Spotlight as the state-of-the-art of unsupervised and language-independent entity disambiguation methods. For the English datasets, ULIED F-score is almost similar to (and insome cases better than) Babelfy and almost always better than Spotlight. However, in the Persian dataset (proposedin the conference version of this paper [61]), it meaningfully outperforms the Babelfy F-score. This expected supe-riority is because (in English) Babelfy utilizes additional data resources (e.g., WordNet), and its performance comesdown, for the cases in which such additional lexical and knowledge resources are absent or are not accessible.ULIED is suggested for the applications in which training data is not available or costly to produce as an unsupervisedmethod. However, ULIED is not suggested to the problems for which large enough annotated entity linking corporaare available, for which the supervised methods are the best solution.As the future work of this study, we plan to improve the proposed method to be an end-to-end approach, by focusingon the mention detection sub-task of entity linking.Additionally, using phrase embedding can also improve the proposed method, especially for the context similaritydetection phase. As another future work, the strategy of weight aggregation is expected to be upgraded. Moreover, theproposed method is expected to be evaluated in other languages, especially in the multilingual entity linking datasets. The authors certify that they have NO affiliations with or involvement in any organization or entity with any financialinterest, or non-financial interest in the subject matter or materials discussed in this paper.
References [1] Weiqian Yan and Kanchan Khurad. Entity linking with people entity on wikipedia.
CoRR , abs/1705.01042,2017.[2] Xianpei Han, Le Sun, and Jun Zhao. Collective entity linking in web text: a graph-based method. In
Proceedingsof the 34th international ACM SIGIR conference on Research and development in Information Retrieval , pages765–774. ACM, 2011.[3] Octavian-Eugen Ganea, Marina Ganea, Aurelien Lucchi, Carsten Eickhoff, and Thomas Hofmann. Probabilisticbag-of-hyperlinks model for entity linking. In
Proceedings of the 25th International Conference on World WideWeb , pages 927–938. International World Wide Web Conferences Steering Committee, 2016.[4] Wei Shen, Jianyong Wang, and Jiawei Han. Entity linking with a knowledge base: Issues, techniques, andsolutions.
IEEE Transactions on Knowledge and Data Engineering , 27(2):443–460, 2014.[5] Nikolaos Kolitsas, Octavian-Eugen Ganea, and Thomas Hofmann. End-to-end neural entity linking. In
Pro-ceedings of the 22nd Conference on Computational Natural Language Learning , pages 519–529. Associationfor Computational Linguistics, 2018.[6] Pavel Taufer. Named entity recognition and linking. Master’s thesis, Univerzita Karlova, Matematicko-fyzik´aln´ıfakulta, 2017.[7] S¨oren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia:A nucleus for a web of open data. In
The Semantic Web, 6th International Semantic Web Conference, 2nd AsianSemantic Web Conference, ISWC 2007 + ASWC 2007 , volume 4825 of
Lecture Notes in Computer Science , pages722–735. Springer, 2007.[8] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowledge. In
Proceedingsof the 16th international conference on World Wide Web , pages 697–706. ACM, 2007.[9] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively createdgraph database for structuring human knowledge. In
Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data , pages 1247–1250. AcM, 2008. http://farsbase.net/ParsEL.html A P
REPRINT [10] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. Probase: A probabilistic taxonomy for text under-standing. In
Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data , pages481–492. ACM, 2012.[11] Shuyan Zhou, Shruti Rijhwani, John Wieting, Jaime Carbonell, and Graham Neubig. Improving candidategeneration for low-resource cross-lingual entity linking.
Transactions of the Association for ComputationalLinguistics , 8:109–124, 2020.[12] Xingyu Fu, Weijia Shi, Zian Zhao, Xiaodong Yu, and Dan Roth. Design challenges for low-resource cross-lingualentity linking.
CoRR , abs/2005.00692, 2020.[13] Majid Asgari, Ali Hadian, and Behrouz Minaei-Bidgoli. Farsbase: The persian knowledge graph.
Semantic Web ,pages 1169–1196, 2019.[14] Maria Pershina, Yifan He, and Ralph Grishman. Personalized page rank for named entity disambiguation. In
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies , pages 238–243. The Association for Computational Linguistics, 2015.[15] Chenwei Ran, Wei Shen, and Jianyong Wang. An attention factor graph model for tweet entity linking. In
Pro-ceedings of the 2018 World Wide Web Conference , pages 1135–1144. International World Wide Web ConferencesSteering Committee, 2018.[16] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen F¨urstenau, Manfred Pinkal, Marc Spaniol,Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust disambiguation of named entities in text. In
Pro-ceedings of the Conference on Empirical Methods in Natural Language Processing , pages 782–792. Associationfor Computational Linguistics, 2011.[17] Gongqing Wu, Ying He, and Xuegang Hu. Entity linking: an issue to extract corresponding entity with knowl-edge base.
IEEE Access , 6:6220–6231, 2018.[18] Xianpei Han and Jun Zhao. Nlpr kbp in tac 2009 kbp track: A two-stage method to entity linking. In
TAC ,page 8. Citeseer, 2009.[19] Zheng Chen, Suzanne Tamang, Adam Lee, Xiang Li, Wen-Pin Lin, Matthew G Snover, Javier Artiles, MarissaPassantino, and Heng Ji. Cuny-blender tac-kbp2010 entity linking and slot filling system description. In
Pro-ceedings of the Third Text Analysis Conference , page 16. NIST, 2010.[20] Jing Xu, Liang Gan, Mian Cheng, and Quanyuan Wu. Unsupervised medical entity recognition and linking inchinese online medical text.
Journal of healthcare engineering , 2018:2548537, 2018.[21] Hongda Shen, David Francois Huynh, Grace Chung, Chen Zhou, Yanlai Huang, and Guanghua Li. Rankingsearch results based on entity metrics, November 19 2015. US Patent App. 14/651,332.[22] Silviu Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In
Proceedings of the 2007Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural LanguageLearning (EMNLP-CoNLL) , pages 708–716. ACL, 2007.[23] Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, and Tim Finin. Entity disambiguation for knowledgebase population. In
Proceedings of the 23rd International Conference on Computational Linguistics , pages277–285. Association for Computational Linguistics, 2010.[24] Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. Joint learning of the embeddingof words and entities for named entity disambiguation. In
Proceedings of The 20th SIGNLL Conference onComputational Natural Language Learning , pages 250–259. ACL, 2016.[25] Jan-Christoph Klie, Richard Eckart de Castilho, and Iryna Gurevych. From zero to hero: Human-in-the-loopentity linking in low resource domains. In
Proceedings of the 58th Annual Meeting of the Association for Com-putational Linguistics , pages 6982–6993. Association for Computational Linguistics, 2020.[26] Chengyu Wang, Guomin Zhou, Xiaofeng He, and Aoying Zhou. Nerank+: a graph-based approach for entityranking in document collections.
Frontiers of Computer Science , 12(3):504–517, 2018.[27] Xianpei Han and Le Sun. An entity-topic model for entity linking. In
Proceedings of the 2012 Joint Conferenceon Empirical Methods in Natural Language Processing and Computational Natural Language Learning , pages105–115. Association for Computational Linguistics, 2012.[28] Wei Shen, Jianyong Wang, Ping Luo, and Min Wang. Linden: linking named entities with knowledge base viasemantic knowledge. In
Proceedings of the 21st international conference on World Wide Web , pages 449–458.ACM, 2012.[29] Wei Zhang, Chew Lim Tan, Yan Chuan Sim, and Jian Su. Nus-i2r: Learning a combined system for entitylinking. In
Test Analysis Conference 2010 TAC 2010 , page 5. NIST, 2010.11LIED
A P
REPRINT [30] Wei Zhang, Yan Chuan Sim, Jian Su, and Chew Lim Tan. Entity linking with effective acronym expansion,instance selection and topic modeling. In
Proceedings of the Twenty-Second international joint conference onArtificial Intelligence-Volume Volume Three , pages 1909–1914. AAAI Press, 2011.[31] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing.
Communica-tions of the ACM , 18(11):613–620, 1975.[32] Amal Alokaili and Mohamed El Bachir Menai. Svm ensembles for named entity disambiguation.
Computing ,102(4):1051–1076, 2020.[33] Jing Zhang, Yixin Cao, Lei Hou, Juanzi Li, and Hai-Tao Zheng. Xlink: An unsupervised bilingual entity linkingsystem. In
Chinese Computational Linguistics and Natural Language Processing Based on Naturally AnnotatedBig Data - 16th China National Conference, CCL 2017, - and - 5th International Symposium, NLP-NABD 2017 ,pages 172–183. Springer, 2017.[34] Xiaoman Pan, Taylor Cassidy, Ulf Hermjakob, Heng Ji, and Kevin Knight. Unsupervised entity linking withabstract meaning representation. In
Proceedings of the 2015 conference of the north american chapter of theassociation for computational linguistics: Human language technologies , pages 1130–1139. The Associationfor Computational Linguistics, 2015.[35] Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight,Philipp Koehn, Martha Palmer, and Nathan Schneider. Abstract meaning representation for sembanking. In
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse , pages 178–186.The Association for Computer Linguistics, 2013.[36] Tao Xie, Bin Wu, Bingjing Jia, and Bai Wang. Graph-ranking collective chinese entity linking algorithm.
Fron-tiers of Computer Science , 14(2):291–303, 2020.[37] Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity linking meets word sense disambiguation: aunified approach.
Transactions of the Association for Computational Linguistics , 2:231–244, 2014.[38] Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. Kore: keyphraseoverlap relatedness for entity disambiguation. In
Proceedings of the 21st ACM international conference onInformation and knowledge management , pages 545–554. ACM, 2012.[39] Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Michael R¨oder, Daniel Gerber, Sandro Athaide Coelho, S¨orenAuer, and Andreas Both. Agdistis-graph-based disambiguation of named entities using linked data. In
Proceed-ings of the 13th International Semantic Web Conference-Part I , pages 457–471. Springer, 2014.[40] Diego Moussallem, Ricardo Usbeck, Michael R¨oeder, and Axel-Cyrille Ngonga Ngomo. Mag: A multilingual,knowledge-base agnostic and deterministic entity linking approach. In
Proceedings of the Knowledge CaptureConference , pages 9:1–9:8. ACM, 2017.[41] Henry Rosales-M´endez, Aidan Hogan, and Barbara Poblete. Voxel: A benchmark dataset for multilingual entitylinking. In
Proceedings of the 17th International Semantic Web Conference , pages 170–186. Springer, 2018.[42] Ricardo Usbeck, Michael R¨oder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Br¨ummer,Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, et al. Gerbil: general entity annotator bench-marking framework. In
Proceedings of the 24th international conference on World Wide Web , pages 1133–1143.International World Wide Web Conferences Steering Committee, 2015.[43] Borislav Popov, Atanas Kiryakov, Damyan Ognyanoff, Dimitar Manov, and Angel Kirilov. KIM - a semanticplatform for information extraction and retrieval.
Natural language engineering , 10(3-4):375–392, 2004.[44] Paolo Ferragina and Ugo Scaiella. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities).In
Proceedings of the 19th ACM international conference on Information and knowledge management , pages1625–1628. ACM, 2010.[45] Eric Charton, Michel Gagnon, and Benoit Ozell. Automatic semantic web annotation of named entities. In
Advances in Artificial Intelligence - 24th Canadian Conference on Artificial Intelligence , pages 74–85. Springer,2011.[46] Zhaochen Guo, Ying Xu, Filipe de S´a Mesquita, Denilson Barbosa, and Grzegorz Kondrak. ualberta at tac-kbp2012: English and cross-lingual entity linking. In
Proceedings of the Fifth Text Analysis Conference, TAC . NIST,2012.[47] Angela Fahrni, Thierry G¨ockel, and Michael Strube. Hits’monolingual and cross-lingual entity linking system attac 2012: A joint approach. In
Proceedings of the Fifth Text Analysis Conference, TAC . NIST, 2012.[48] Milan Dojchinovski and Tom´aˇs Kliegr. Recognizing, classifying and linking entities with wikipedia and dbpedia.In
Workshop on Intelligent and Knowledge Oriented Technologies (WIKT) , pages 41–44, 2012.12LIED
A P
REPRINT [49] Pablo N Mendes, Max Jakob, Andr´es Garc´ıa-Silva, and Christian Bizer. Dbpedia spotlight: shedding light on theweb of documents. In
Proceedings of the 7th international conference on semantic systems , pages 1–8. ACM,2011.[50] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N Mendes. Improving efficiency and accuracy in multilin-gual entity extraction. In
Proceedings of the 9th International Conference on Semantic Systems , pages 121–124.ACM, 2013.[51] Zhichun Wang, Juanzi Li, and Jie Tang. Boosting cross-lingual knowledge linking via concept annotation. In
Pro-ceedings of the 23rd International Joint Conference on Artificial Intelligence , pages 2733–2739. IJCAI/AAAI,2013.[52] Felix Sasaki, Milan Dojchinovski, and Jan Nehring. Chainable and extendable knowledge integration web ser-vices. In
Knowledge Graphs and Language Technology - ISWC 2016 International Workshops: KEKI andNLP&DBpedi , volume 10579 of
Lecture Notes in Computer Science , pages 89–101. Springer, 2016.[53] Chen-Tse Tsai and Dan Roth. Cross-lingual wikification using multilingual embeddings. In
Proceedings ofthe 2016 Conference of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies , pages 589–598. The Association for Computational Linguistics, 2016.[54] Aasish Pappu, Roi Blanco, Yashar Mehdad, Amanda Stent, and Kapil Thadani. Lightweight multilingual entityextraction and linking. In
Proceedings of the Tenth ACM International Conference on Web Search and DataMining , pages 365–374. ACM, 2017.[55] Ren´e Speck and Axel-Cyrille Ngonga Ngomo. Ensemble learning of named entity recognition algorithms usingmultilayer perceptron for the multilingual web of data. In
Proceedings of the Knowledge Capture Conference ,page 26. ACM, 2017.[56] Dick Crouch, Roser Saur´ı, and Abraham Fowler. Aquaint pilot knowledge-based evaluation: Annotation guide-lines, 2005.[57] Kristian Noullet, Rico Mix, and Michael F¨arber. Kore 50dywc: An evaluation data set for entity linking basedon dbpedia, yago, wikidata, and crunchbase. In
Proceedings of The 12th Language Resources and EvaluationConference , pages 2389–2395. European Language Resources Association, 2020.[58] Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti. Collective annotation ofwikipedia entities in web text. In
Proceedings of the 15th ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 457–466. ACM, 2009.[59] Michael R¨oder, Ricardo Usbeck, Sebastian Hellmann, Daniel Gerber, and Andreas Both. N -a collection ofdatasets for named entity recognition and disambiguation in the nlp interchange format. In Proceedings ofthe Ninth International Conference on Language Resources and Evaluation, LREC 2014 , pages 3529–3533.European Language Resources Association (ELRA), 2014.[60] David Carmel, Ming-Wei Chang, Evgeniy Gabrilovich, Bo-June Paul Hsu, and Kuansan Wang. Erd’14: entityrecognition and disambiguation challenge. In
ACM SIGIR Forum , volume 48, pages 63–77. ACM, 2014.[61] Farzaneh Fakhrian, Majid Asgari-Bdihendi, and Behrouz Minaei-Bidgoli. An unsupervised end-to-end language-independent entity linking method and its evaluation on parsel as the first persian entity linking corpus.Manuscript submitted for publication, 2020.[62] Xinhua Zhu, Qingsong Guo, Bo Zhang, and Fei Li. An efficient approach for measuring semantic relatednessusing wikipedia bidirectional links.