OpenTapioca: Lightweight Entity Linking for Wikidata
aa r X i v : . [ c s . C L ] A p r OpenTapioca: Lightweight Entity Linking for Wikidata
Antonin Delpeuch
Department of Computer ScienceUniversity of Oxford, UK [email protected]
Abstract
We propose a simple Named Entity Linkingsystem that can be trained from Wikidata only.This demonstrates the strengths and weak-nesses of this data source for this task and pro-vides an easily reproducible baseline to com-pare other systems against. Our model islightweight to train, to run and to keep syn-chronous with Wikidata in real time.
Named Entity Linking is the task of detectingmentions of entities from a knowledge base in freetext, as illustrated in Figure 1.Most of the entity linking literature focuses ontarget knowledge bases which are derived fromWikipedia, such as DBpedia (Auer et al., 2007)or YAGO (Suchanek et al., 2007). These basesare curated automatically by harvesting informa-tion from the info-boxes and categories on eachWikipedia page and are therefore not editable di-rectly.Wikidata (Vrandeˇci´c and Kr¨otzsch, 2014)is an editable, multilingual knowledge basewhich has recently gained popularity as a targetdatabase for entity linking (Klang and Nugues,2014; Weichselbraun et al., 2018;Sorokin and Gurevych, 2018;Raiman and Raiman, 2018). As these newapproaches to entity linking also introduce novellearning methods, it is hard to tell apart thebenefits that come from the new models and thosewhich come from the choice of knowledge graphand the quality of its data.We review the main differences between Wiki-data and static knowledge bases extracted fromWikipedia, and analyze their implactions for entitylinking. We illustrate these differences by buildinga simple entity linker, OpenTapioca , which only The implementation and datasets are available at
Associated Press (Q40469)Julie Pace (Q34666768)Washington D.C. (Q61)employer (P108)Associated Press writer Julie Pacecontributed from Washington .
Figure 1: Example of an annotated sentence uses data from Wikidata, and show that it is com-petitive with other systems with access to largerdata sources for some tasks. OpenTapioca can betrained easily from a Wikidata dump only, and canbe efficiently kept up to date in real time as Wiki-data evolves. We also propose tools to adapt exist-ing entity linking datasets to Wikidata, and offer anew entity linking dataset, consisting of affiliationstrings extracted from research articles.
Wikidata is a wiki itself, meaning that it can beedited by anyone, but differs from usual wikis byits data model: information about an entity canonly be input as structured data, in a format thatis similar to RDF.Wikidata stores information about the world ina collection of items, which are structured wikipages. Items are identified by ther Q-id, suchas Q40469, and they are made of several datafields. The label stores the preferred name forthe entity. It is supported by a description, ashort phrase describing the item to disambiguateit from namesakes, and aliases are alternate namesfor the entity. These three fields are stored sep-arately for each language supported by Wikidata. https://github.com/wetneb/opentapioca and the demo can be found at https://opentapioca.org/ . tems also hold a collection of statements: theseare RDF-style claims which have the item as sub-ject. They can be backed by references and bemade more precise with qualifiers, which all relyon a controlled vocabulary of properties (similarto RDF predicates). Finally, items can have sitelinks, connecting them to the corresponding pagefor the entity in other Wikimedia projects (suchas Wikipedia). Note that Wikidata items to notneed to be associated with any Wikipedia page:in fact, Wikidata’s policy on the notability of thesubjects it covers is much more permissive thanin Wikipedia. For a more detailed introductionto Wikidata’s data model we refer the reader toVrandeˇci´c and Kr¨otzsch (2014); Geißet al. (2017).Our goal is to evaluate the usefulness of thiscrowdsourced structured data for entity linking.We will therefore refrain from augmenting it withany external data (such as phrases and topical in-formation extracted from Wikipedia pages), as isgenerally done when working with DBpedia orYAGO. By avoiding a complex mash-up of datacoming from disparate sources, our entity linkingsystem is also simpler and easier to reproduce. Fi-nally, it is possible keep OpenTapioca in real-timesynchronization with the live version of Wikidata,with a lag of a few seconds only. This meansthat users are able to fix or improve the knowledgegraph, for instance by adding a missing alias on anitem, and immediately see the benefits on their en-tity linking task. This constrasts with all other sys-tems we are aware of, where the user either cannotdirectly intervene on the underlying data, or thereis a significant delay in propagating these updatesto the entity linking system. We review the dominant architecture of entitylinking heuristics following Shen et al. (2015),and assess its applicability to Wikidata.Entities in the knowledge base are associatedwith a set (or probability distribution) of possiblesurface forms. Given a text to annotate, candidateentities are generated by looking for occurrencesof their surface forms in the text. Because ofhomonymy, many of these candidate occurrencesturn out to be false matches, so a classifier is usedto predict their correctness. We can group the fea-tures they tend to use in the following categories: • local compatibility : these features assess theadequacy between an entity and the phrase that refers to it. This relies on the dictionaryof surface forms mentioned above, and doesnot take into account the broader context ofthe phrase to link. • topic similarity : this measures the compati-bility between the topics in the text to anno-tate and the topics associated with the candi-date entity. Topics can be represented in var-ious ways, for instance with a bag of wordsmodel. • mapping coherence : entities mentioned inthe same text are often related, so linking de-cisions are inter-dependent. This relies on anotion of proximity between entities, whichcan be defined with random walks in theknowledge graph for instance. These features compare the phrase to annotatewith the known surface forms for the entity. Col-lecting such forms is often done by extractingmentions from Wikipedia (Cucerzan, 2007). Linklabels, redirects, disambiguation pages and boldtext in abstracts can all be useful to discover al-ternate names for an entity. It is also possibleto crawl the web for Wikipedia links to improvethe coverage, often at the expense of data qual-ity (Spitkovsky and Chang, 2012).Beyond collecting a set of possible surfaceforms, these approaches count the number of timesan entity e was mentioned by a phrase w . Thismakes it possible to use a Bayesian methodology:the compatibility of a candidate entity e with agiven mention w is P ( e | w ) = P ( e,w ) P ( w ) , which canbe estimated from the statistics collected.In Wikidata, items have labels and aliases inmultiple languages. As this information is directlycurated by editors, these phrases tend to be ofhigh quality. However, they do not come with oc-curence counts. As items link to each other us-ing their Wikidata identifiers only, it is not pos-sible to compare the number of times USA wasused to refer United States of America (Q30) or toUnited States Army (Q9212) inside Wikidata.Unlike Wikipedia’s page titles which must beunique in a given language, two Wikidata itemscan have the same label in the same language.For instance
Curry is the English label of boththe item about the Curry programming language(Q2368856) and the item about the village inlaska (Q5195194), and the description field isused to disambiguate them.Manual curation of surface forms implies afairly narrow coverage, which can be an is-sue for general purpose entity linking. For in-stance, people are commonly refered to withtheir given or family name only, and thesenames are not systematically added as aliases:at the time of writing,
Trump is an alias forDonald Trump (Q22686), but
Cameron is not analias for David Cameron (Q192). As a Wikidataeditor, the main incentive to add aliases to an itemis to make it easier to find the item with Wikidata’sauto-suggest field, so that it can be edited or linkedto more easily. Aliases are not designed to offera complete set of possible surface forms found intext: for instance, adding common mispellings ofa name is discouraged. The compatibility of the topic of a candidate entitywith the rest of the document is traditionally es-timated by similarity measures from informationretrieval such as TFIDF ( ˇStajner and Mladeni´c,2009; Ratinov et al., 2011) or keywordextraction (Strube and Ponzetto, 2006;Mihalcea and Csomai, 2007; Cucerzan, 2007).Wikidata items only consist of structureddata, except in their descriptions. This makesit difficult to compute topical information us-ing the methods above. Vector-based repre-sentations of entities can be extracted fromthe knowledge graph alone (Bordes et al., 2013;Xiao et al., 2016), but it is not clear how tocompare them to topic representations for plaintext, which would be computed differently. Inmore recent work, neural word embeddingswere used to represent topical information forboth text and entities (Ganea and Hofmann, 2017;Raiman and Raiman, 2018; Kolitsas et al., 2018).This requires access to large amounts of text bothto train the word vectors and to derive the en-tity vectors from them. These vectors have beenshown to encode significant semantic informationby themselves (Mikolov et al., 2013), so we re-frain from using them in this study.
Entities mentioned in the same context are oftentopically related, therefore it is useful not to treat The guidelines are available at linking decisions in isolation but rather to try tomaximize topical coherence in the chosen items.This is the issue on which entity linking systemsdiffer the most as it is harder to model.First, we need to estimate the topical coherenceof a sequence of linking decisions. This is oftendone by first defining a pairwise relatedness scorebetween the target entities. For instance, a popu-lar metric introduced by Witten and Milne (2008)considers the set of wiki links | a | , | b | made from orto two entities a , b and computes their relatedness:rel ( a, b ) = 1 − log(max( | a | , | b | )) − log( | a | ∩ | b | )log( | K | ) − log(min( | a | , | b | )) where | K | is the number of entities in the knowl-edge base.When linking to Wikidata instead of Wikipedia,it is tempting to reuse these heuristics, replac-ing wikilinks by statements. However, Wikidata’slinking structure is quite different from Wikipedia:statements are generally a lot sparser than linksand they have a precise semantic meaning, as edi-tors are restricted by the available properties whencreating new statements. We propose in the nextsection a similarity measure that we find to per-form well experimentally.Once a notion of semantic similarity is cho-sen, we need to integrate it in the inference pro-cess. Most approaches build a graph of candi-date entities, where edges indicate semantic relat-edness: the difference between the heuristics lie inthe way this graph is used for the matching de-cisions. Moro et al. (2014) use an approximatealgorithm to find the densest subgraph of the se-mantic graph. This determines choices of entitiesfor each mention. In other approaches, the initialevidence given by the local compatibility score ispropagated along the edges of the semantic graph(Mihalcea and Csomai, 2007; Han et al., 2011) oraggregated at a global level with a ConditionalRandom Field (Ganea and Hofmann, 2017). We propose a model that adapts previous ap-proaches to Wikidata. Let d be a document (apiece of text). A spot s ∈ d is a pair of start andend positions in d . It defines a phrase d [ s ] , and aset of candidate entities E [ s ] : those are all Wiki-data items for which d [ s ] is a label or alias. Giventwo spots s, s ′ we denote by | s − s ′ | the number ofharacters between them. We build a binary clas-sifier which predicts for each s ∈ d and e ∈ E [ s ] if s should be linked to e . Although Wikidata makes it impossible to counthow often a particular label or alias is used to referto an entity, these surface forms are carefully cu-rated by the community. They are therefore fairlyreliable.Given an entity e and a phrase d [ s ] , we needto compute p ( e | d [ s ]) . Having no access to such aprobability distribution, we choose to approximatethis quantity by p ( e ) p ( d [ s ]) , where p ( e ) is the probabil-ity that e is linked to, and p ( d [ s ]) is the probabilitythat d [ s ] occurs in a text. In other words, we esti-mate the popularity of the entity and the common-ness of the phrase separately.We estimate the popularity of an entity e bya log-linear combination of its number of state-ments n e , site links s e and its PageRank r ( e ) . ThePageRank is computed on the entire Wikidata us-ing statement values and qualifiers as edges.The probability p ( d [ s ]) is estimated by a simpleunigram language model that can be trained eitheron any large unannotated dataset .The local compatibility is therefore representedby a vector of features F ( e, w ) and the local com-patibility is computed as follows, where λ is aweights vector: F ( e, w ) = ( − log p ( d [ s ]) , log p ( e ) , n e , s e , p ( e | d [ s ]) ∝ e F ( e,w ) · λ The issue with the features above is that they ig-nore the context in which a mention in found. Tomake it context-sensitive, we adapt the approachof Han et al. (2011) to our setup. The general ideais to define a graph on the candidate entities, link-ing candidate entities which are semantically re-lated, and then find a combination of candidateentities which have both high local compatibilityand which are densely related in the graph.For each pair of entities e, e ′ we define a similar-ity metric s ( e, e ′ ) . Let l ( e ) be the set of items that e links to in its statements. Consider a one-steprandom walks starting on e , with probability β tostay on e and probability − β | l ( e ) | to reach one of the For the sake of respecting our constraint to use Wikidataonly, we train this language model from Wikidata item labels. linked items. We define s ( e, e ′ ) as the probabil-ity that two such one-step random walks startingfrom e and e ′ end up on the same item. This canbe computed explicitly as s ( e, e ′ ) = β δ e = e ′ + β (1 − β )( δ e ∈ l ( e ′ ) | l ( e ′ ) | + δ e ′ ∈ l ( e ) | l ( e ) | + (1 − β ) | l ( e ) ∩ l ( e ′ ) || l ( e ) || l ( e ′ ) | We then build a weighted graph G d whose ver-tices are pairs ( s ∈ d, e ∈ E [ s ]) . In other words,we add a vertex for each candidate entity at a givenspot. We fix a maximum distance D for edges:vertices ( s, e ) and ( s ′ , e ′ ) can only be linked if | s − s ′ | ≤ D and s = s ′ . In this case, we define theweight of such an edge as ( η + s ( e, e ′ )) D −| s − s ′ | D ,where η is a smoothing parameter. In other words,the edge weight is proportional to the smoothedsimilarity between the entities, discounted by thedistance between the mentions.The weighted graph G d can be represented as anadjacency matrix. We transform it into a column-stochastic matrix M d by normalizing its columnsto sum to one. This defines a Markov chain on thecandidate entities, that we will use to propagatethe local evidence. Han et al. (2011) first combine the local featuresinto a local evidence score, and then spread thislocal evidence using the Markov chain: G ( d ) = ( αI + (1 − α ) M d ) k · LC ( d ) (1)We propose a variant of this approach, where eachindividual local compatibility feature is propa-gated independently along the Markov chain. Let F be the matrix of all local features for each candi-date entity: F = ( F ( e , d [ s ]) , . . . , F ( e n , d [ s n ])) .After k iterations in the Markov chain, this definesfeatures M kd F . Rather than relying on these fea-tures for a fixed number of steps k , we record thefeatures at each step, which defines the vector ( F, M d · F, M d · F, . . . , M kd · F ) This alleviates the need for an α parameter whilekeeping the number of features small. We traina linear support vector classifier on these featuresand this defines the final score of each candidate It is important for this purpose that features are initiallyscaled to the unit interval. ntity. For each spot, our system picks the highest-scoring candidate entity that the classifier predictsas a match, if any.
Most entity linking datasets are annotated againstDBpedia or YAGO. Wikidata contains itemswhich do not have any corresponding Wikipediaarticle (in any language), so these items do nothave any DBpedia or YAGO URI either. There-fore, converting an entity linking dataset from DB-pedia to Wikidata requires more effort than simplyfollowing owl:sameAs links: we also need toannotate mentions of Wikidata items which do nothave a corresponding DBpedia URI.We used the RSS-500 dataset of news excerptsannotated against DBpedia and encoded in NIFformat (Usbeck et al., 2015). We first translatedall DBpedia URIs to Wikidata items . Then, weused OpenRefine (Huynh et al., 2019) to extractthe entities marked not covered by DBpedia andmatched them against Wikidata. After human re-view, this added 63 new links to the 524 convertedfrom DBpedia (out of 476 out-of-KB entities).We also annotated a new dataset from scratch.The ISTEX dataset consists of one thousand au-thor affiliation strings extracted from research arti-cles and exposed by the ISTEX text and data min-ing service . In this dataset, only 64 of the 2,624Wikidata mentions do not have a correspondingDBpedia URI.We use the Wikidata JSON dump of 2018-02-24 for our experiments, indexed with Solr(Lucene). We restrict the index to humans,organizations and locations, by selecting onlyitems whose type was a subclass of (P279)human (Q5), organization (Q43229) orgeographical object (Q618123). Labels andaliases in all languages are added to a case-sensitive FST index.We trained our classifier and its hyper-parameters by five-fold cross-validation on thetraining sets of the ISTEX and RSS datasets. Weused GERBIL (Usbeck et al., 2015) to evaluateOpenTapioca against other approaches. We reportthe InKB micro and macro F1 scores on test sets, This is the case of Julie Pace (Q34666768) in Figure 1. We built the nifconverter tool to do this conversion forany NIF dataset. The original data is available under an Etalab license at
AIDA-CoNLL Microposts 2016Micro Macro Micro MacroAIDA
OpenTapioca 0.482 0.399
Babelfy 0.461 0.447 0.314 0.304DBP Spotlight 0.574 0.575 0.281 0.261FREME NER 0.422 0.321 0.307 0.274OpenTapioca
Figure 2: F1 scores on test datasets with GERBIL’s weak annotation match method. The surface forms curated by Wikidata editors aresufficient to reach honourable recall, without theneed to expand them with mentions extracted fromWikipedia. Our restriction to people, locationsand organizations probably helps in this regard andwe anticipate worse performance for broader do-mains. Our approach works best for scientific af-filiations, where spelling is more canonical than innewswire. The availability of Twitter identifiersdirectly in Wikidata helps us to reach acceptableperformance in this domain. The accuracy de-grades on longer texts which require relying moreon the ambiant topical context. In future work, wewould like to explore the use of entity embeddingsto improve our approach in this regard.
References
S¨oren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary Ives.2007. Dbpedia: A nucleus for a web of open data.The semantic web, pages 722–735.Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural InformationProcessing Systems, page 9.Silviu Cucerzan. 2007.Large-scale named entity disambiguation based on Wikipedia data.In Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Processingand Computational Natural Language Learning(EMNLP-CoNLL). The full details can be found at http://w3id.org/gerbil/experiment?id=201904110006http://w3id.org/gerbil/experiment?id=201904110006