[PDF] CHOLAN: A Modular Approach for Neural Entity Linking on Wikipedia and Wikidata

Abstract

In this paper, we propose CHOLAN, a modular approach to target end-to-end entity linking (EL) over knowledge bases. CHOLAN consists of a pipeline of two transformer-based models integrated sequentially to accomplish the EL task. The first transformer model identifies surface forms (entity mentions) in a given text. For each mention, a second transformer model is employed to classify the target entity among a predefined candidates list. The latter transformer is fed by an enriched context captured from the sentence (i.e. local context), and entity description gained from Wikipedia. Such external contexts have not been used in the state of the art EL approaches. Our empirical study was conducted on two well-known knowledge bases (i.e., Wikidata and Wikipedia). The empirical results suggest that CHOLAN outperforms state-of-the-art approaches on standard datasets such as CoNLL-AIDA, MSNBC, AQUAINT, ACE2004, and T-REx.

Full PDF

CCHOLAN: A Modular Approach for Neural Entity Linking on Wikipediaand Wikidata

Manoj Prabhakar Kannan Ravi , Kuldeep Singh , Isaiah Onando Mulang’ ,Saeedeh Shekarpour , Johannes Hoffart , and Jens Lehmann Hasso Plattner Institute, University of Potsdam, Potsdam, Germany [email protected] Smart Data Analytics, University of Bonn, Bonn, Germany { mulang,jens.lehmann } @cs.uni-bonn.de Zerotha Research and Cerence GmbH, Aachen, Germany [email protected] University of Dayton, Dayton, USA [email protected] Goldman Sachs, Frankfurt, Germany [email protected]

Abstract

In this paper, we propose CHOLAN, a modu-lar approach to target end-to-end entity link-ing (EL) over knowledge bases. CHOLANconsists of a pipeline of two transformer-based models integrated sequentially to ac-complish the EL task. The ﬁrst transformermodel identiﬁes surface forms (entity men-tions) in a given text. For each mention, asecond transformer model is employed to clas-sify the target entity among a predeﬁned can-didates list. The latter transformer is fed byan enriched context captured from the sen-tence (i.e. local context), and entity descrip-tion gained from Wikipedia. Such exter-nal contexts have not been used in state ofthe art EL approaches. Our empirical studywas conducted on two well-known knowledgebases (i.e., Wikidata and Wikipedia). Theempirical results suggest that CHOLAN out-performs state-of-the-art approaches on stan-dard datasets such as CoNLL-AIDA, MSNBC,AQUAINT, ACE2004, and T-REx.

The explicit schema, graph-based structure, andinterlinking nature of information represented inpublicly available knowledge graphs (KGs) e.g.,DBpedia (Auer et al., 2007), Freebase (Bol-lacker et al., 2007), Wikidata (Vrandecic, 2012) orknowledge bases (KBs) such as Wikipedia; intro-duce a new landscape of features, as well as struc-tured knowledge and embeddings. Researchershave developed several techniques to align infor-mation available in unstructured text to the con-cepts of these KGs (Wu et al., 2019b; Broscheit,2019). End-to-end Entity Linking (hereafter EL) taskfollows this direction; such that, given a sentenceEL ﬁrst identiﬁes the entity mention in the sen-tence, then maps these mentions to the most likelyKG/KB entities. The EL comprises of a three-step process. With respect to the given exam-ple sentence

Soccer: Late Goals Give Japan winOver Syria , the ﬁrst step called mention detection(MD) identiﬁes the surface forms

Japan and

Syria .The next step is candidate generation (CG) aimingto ﬁnd a list of possible entity candidates in theKG/KB for each entity mention. For example, thecandidates list for entity mention

Japan consists inpart of

Japan national football team, Japan (coun-try), Japan (Band) and for

Syria is Syria (Romanprovince), Syria national football team, GreaterSyria . Finally, the third step deals with the en-tity disambiguation (ED) which employs the co-reference and contextual features to discriminatethe most likely entity from the candidates list e.g.,

Japan national football team and

Syria nationalfootball team are correct entities.Entity Linking approaches are broadly cate-gorised into three categories. The initial attempts(Hoffart et al., 2011; Piccinno and Ferragina,2014) solve MD and ED as independent sub-tasksof EL (i.e., a pipeline based system). However,these approaches exhibit a behaviour where er-rors propagate from MD to ED hence might down-grade the overall performance of the system. Thesecond category has emerged in an attempt to mit-igate these errors, where researchers focused onjointly modelling MD and ED, emphasising theimportance of the mutual dependency of the twosub-tasks (Kolitsas et al., 2018). These two EL a r X i v : . [ c s . C L ] F e b pproaches depend on an intermediate candidategeneration step and rely on a pre-computed list ofentity candidates. For example, (Kolitsas et al.,2018) propose a joint MD and ED model and in-herits the candidate list from (Ganea and Hof-mann, 2017). The third approach combines thethree sub-steps in a joint model and illustrates thateach of those tasks is interdependent (Durrett andKlein, 2014; Broscheit, 2019).The recent EL approaches focus on jointly mod-elling two or three subtasks (Sevgili et al., 2020).Furthermore, the NLP research community has ex-tensively used transformers in end-to-end modelsfor entity linking (Broscheit 2019, Peters et al.2019, and F´evry et al. 2020). Nevertheless, theseworks report less performance than (Kolitsas et al.,2018), which is a bi-LSTM based model. Theobservations regarding the limited performance oftransformer-based models for the EL motivate ourwork, and in this paper, our focus is to understandthe bottlenecks in the entity linking process. Weargue that the less studied task in literature, i.e.,candidate generation, has an essential role in theEL models’ performance, which has not been afocus in the recently proposed transformer-basedentity linking models.In this paper, we hypothesise that the trans-former models, though trained on a large corpus,may require additional task-speciﬁc contexts. Fur-thermore, inducing the context at the entity dis-ambiguation step may positively impact the over-all performance, which has not been utilised inthe state of the art methods due to monolithic im-plementations (Kolitsas et al., 2018; Peters et al.,2019; Broscheit, 2019; F´evry et al., 2020). Sub-sequently, we deviate from the joint modellingof two or three subtasks of the EL and revert tothe methodology opted by earlier EL systems in2011 (Hoffart et al., 2011), i.e. treat each sub-task independently. As such, we study the re-search question: RQ : what is the impact of eachsub-task (aka component) on the overall outcomeof the transformer-based entity linking approach? We propose an intuitive novel approach namedCHOLAN, comprising a modular architecture oftwo transformer models to solve MD and ED in-dependently. In the ﬁrst step, CHOLAN employsBERT (Devlin et al., 2019) model to identify men-tions of the entities in an input sentence. The sec-ond step involves expanding each mention witha list of KB entity candidates. Finally, the en- tity mention, sentence (local context), an entitycandidate, and entity Wikipedia description (entitycontext) are fed as input sequences in the secondBERT based model to predict the correct KB en-tity (cf. Figure 1). We train MD and ED stepsindependently during training, and while testing,we run the CHOLAN pipeline end-to-end for pre-dicting the KB entity. The following are the novelfeatures of CHOLAN:• The core focus of the approach is to ﬂexiblyinduce external context and candidate lists ina transformer-based model to improve theEL performance. CHOLAN is independentof a particular candidate list and additionalbackground context. We study four differ-ent conﬁgurations of CHOLAN to demon-strate the impact of candidate generationstep and background knowledge (i.e. en-tity and sentential context) induced in themodel. CHOLAN achieves a new state ofthe art performance on several datasets: T-REx (ElSahar et al., 2018) for Wikidata;AIDA-B, MSBC, AQUAINT, and ACE2004for Wikipedia (Hoffart et al., 2011; Guo andBarbosa, 2018).• CHOLAN is the ﬁrst approach which isempirically demonstrated to be transferableacross KBs having completely different un-derlying structure and schema i.e., on semi-structured Wikipedia and fully structuredWikidata.The implementation is publicly available . Thepaper is structured as follows: next sectionsummarises the related work. Section 3 describesthe problem statement and approach. Section 4explains the experimental settings followed byresults in 5. We conclude in Section 6. Mention Detection (MD) : The ﬁrst attempt toorganise a named entity recognition (NER) tasktraced back to 1996 (Grishman and Sundheim,1996). Since then, numerous attempts have beenmade ranging from conditional random ﬁelds(CRFs) with features constructed from dictionar-ies (Rockt¨aschel et al., 2013) or feature-inferringneural networks (Collobert and Weston, 2008). https://github.com/ManojPrabhakar/CHOLAN Syria" "Japan"

Soccer : Late Goals Give Japan Victory Over Syria[CLS]

Japan national football teamEmpire of JapanJapan national rugby union teamJapan women's national football

FALCON DCA

2. Candidates Generation

Japan | Soccer Late Over Syria [SEP] Japan National Football Team men representing Japan BERT BIDIRECTIONAL TRANSFORMER men's ... team representing Japan

BERT SOFTMAX CLASSIFIER CLASSIFIERmention Sentence Context Entity Context O O O O O B-LOC O O B-LOC KB Descriptions + + + + + + + + + + ++ + + + + + + + + + + +

1. Mention Detection monarchy between 1868–1947rugby union teamwomen ... team representing Japan

3. Entity Disambiguation + + + + + + + + + + +

Entity Index +++ + ++ ++ + +++

SegmentEmbeddingTokenEmbeddingPositionEmbedding

Japan national football team

Figure 1: CHOLAN has three building blocks: i) BERT-based Mention Detection that identiﬁes entity mentions inthe text ii) Candidate Generation that retrieves a set of entities for the mention iii) Entity Disambiguation: employsBERT transformer model powered by background knowledge from KB and local sentential context.

Recently, contextual embedding based modelsachieve state of the art for NER/MD task (Ak-bik et al., 2018; Devlin et al., 2019). We pointto the survey by Yadav and Bethard (2018) for de-tails about NER. Few early EL models have per-formed MD task independently (Ceccarelli et al.,2013; Cornolti et al., 2016).

Candidate Generation (CG) : There are fourprominent approaches for candidate generation.First is a direct matching of entity mentionswith a pre-computed candidate set (Zwicklbaueret al., 2016). The second approach is the dic-tionary lookup, where a dictionary of the associ-ated aliases of entity mentions is compiled fromseveral knowledge base sources (e.g. Wikipedia,Wordnet) (Sevgili et al., 2020; Fang et al., 2019;Cao et al., 2017). The third approach is to gen-erate entity candidates using empirical probabilis-tic entity-map p ( e | m ) . The p ( e | m ) is a pre-calculated prior probability of correspondence be-tween positive mentions and entities. A widelyused entity map was built by (Ganea and Hof-mann, 2017) from Wikipedia hyperlinks, Cross-wikis (Spitkovsky and Chang, 2012) and YAGO(Hoffart et al., 2011) dictionaries. End-to-end ELapproaches such as (Kolitsas et al., 2018; Caoet al., 2018) relies on the entity map built by Ganea and Hofmann. The next approach for generat-ing the candidates is proposed by (Sakor et al.,2019). Authors build a local KG by expanding en-tity mentions using Wikidata and DBpedia entitylabels and associated aliases. The local KG canbe queried using BM25 ranking algorithm (Lo-geswaran et al., 2019). The modular architec-ture of CHOLAN gives us the ﬂexibility to exper-iment with several ways of generating entity can-didates. Hence, we reused candidate list proposedby (Ganea and Hofmann, 2017) and built a newCG approach based on (Sakor et al., 2019). End to End EL:

Few EL approaches accomplishMD and ED tasks jointly. (Nguyen et al., 2016)propose joint recognition and disambiguation ofnamed-entity mentions using a graphical modeland show that it improves EL. The work in (Kolit-sas et al., 2018) also proposes a joint model forMD and ED. Authors use a bi-LSTM based modelfor mention detection and computes the similar-ity between the entity mention embedding andset of predeﬁned entity candidates. The work in(Broscheit, 2019) employs BERT to jointly modelthree subtasks of the EL. Author employ an entityvocabulary of 700K top most frequent entities totrain the model. Work in (F´evry et al., 2020) usesa Transformer architecture with large scale pre-raining from Wikipedia links for EL. For CG, au-thors train the model to predict BIO-tagged men-tion boundaries to disambiguate among all enti-ties. For Wikidata KG, Opentapioca is an entitylinking approach which relies on a heuristic-basedmodel for disambiguation of the mentions in a textto the Wikidata entities (Delpeuch, 2020). Arjun(Mulang et al., 2020) is the most similar to our ap-proach CHOLAN and trains two independent neu-ral models for MD and ED. It generates candidateson the ﬂy using a Wikidata entity alias map. Arjundoes not induce any context in the model.

We formally deﬁne EL task as follows:given an input sequence of words W = { w , w , w , . . . , w n } , and a set of entitiesdenoted by E from a KG/KB. The EL task alignsthe text into a subset of entities represented as Θ : W → E (cid:48) where E (cid:48) ⊂ E . We formulate theEL task as a three step process in which the ﬁrststep is the mention detection (MD). The MDis a function θ : W → M , where the set ofmentions is denoted by M = ( m , m , ..., m k ) ( k ≤ n ) and each mention m x is a sequenceof words starting from i to end position j : m ( i,j ) x = ( w i , w i +1 , ..., w j ) ( < i, j ≤ n ).The next task is candidate generation where foreach mention m x a set of candidates C ( m x ) = { e x , ..., e xn | e xi ∈ E} is derived. Finally, the entitydisambiguation (ED) task aims to map eachmention m x ∈ M to the most likely entity fromits list of candidates. In our case, we model theED task as a classiﬁcation task and augmentthe input with extra signals as context. Forevery candidate entity c i ∈ C ( m x ) , the modelestimates a probability p i , thus the most likelyentity is the one with the highest probability as γ = arg max p i {P ( p i | m x , c xi , W, C ) } where W and C are the input representations respectivelyfor the given sentence (local context) and the con-text derived from KG/KB. As such the probabilityof score p i is conditioned not only on m x and c xi but also on W and C as contextual parameters. The CHOLAN architecture comprises of threemain modules as illustrated in Figure 1.

We adapt the vanilla BERT (Devlin et al., 2019)model for the task of entity mention detection in an unstructured text. For each input sentence,we append the special tokens [CLS] and [SEP]to the beginning and end of the sentence, respec-tively. This is then used as input to the modelwhich learns a representation of the tokens in thesentence. We then introduce a (logistic regres-sion based) classiﬁcation layer on top of the BERTmodel to determine named entity tags for each to-ken following the BIO format (Sang and Meul-der, 2003). Our BERT † model is initialised us-ing publicly available weights from the pretrainedBERT BASE model and is ﬁne-tuned to the spe-ciﬁc dataset for detecting a mention m i . Pleasenote that BERT BASE model is the latest approachwhich successfully outperformed in various NLPtasks, including MD. Thus, we reuse this modelfor the completion of our approach. m i = BERT † ( w i ) (1) One of the critical focus of CHOLAN is to under-stand the bottleneck at the CG step. Hence, wereuse the DCA candidate list and propose a novelcandidate list to understand the candidate genera-tion impact on overall EL performance.

DCA Candidates : (Yang et al., 2019) adaptsthe probabilistic entity-map p ( e | m ) created by(Ganea and Hofmann, 2017) (cf. section 2) to cal-culate the prior probabilities of candidate entitiesfor a given mention. In the probabilistic entity-map, each entity mention has 30 potential entitycandidates. Yang and colleagues also provide as-sociated Wikipedia description of each entity. InCHOLAN, we reuse candidate set C ( m ) providedby (Yang et al., 2019) and further consider associ-ated Wikipedia entity descriptions. Falcon Candidates : (Sakor et al., 2019) createda local index of KG items from Wikidata enti-ties expanded with entity aliases. For example,in Wikidata the entity Q33 has the label ”Fin-land”. Sakor and colleagues expanded the en-tity label with other aliases from Wikidata such as“Finlande”, “Finnia”, “Land of Thousand Lakes”,“Suomi”, and “Suomen tasavalta”. We adopt thislocal KG index to generate entity candidates perentity mention in the employed datasets. The lo-cal KG has a querying mechanism using BM25 † algorithm (cf. equation (2)) and ranked by thecalculated score. We build a predeﬁned candidateset using the top 30 Wikidata entity candidates in F alcon ( m ) for each entity mention. We en-rich the candidates set obtained from Wikidata bythe correspondence from Wikipedia. We also addthe ﬁrst paragraph of Wikipedia as entity descrip-tions (only if Wikidata entity has correspondingWikipedia page) to the hyperlinks. By selectingtwo different candidate list, our idea is to under-stand the impact of candidate generation step onend-to-end entity linking performance. e i = BM † ( m i ) (2) In order to use the power of the transformers, wepropose “WikiBERT” to perform the ED task. InWikiBERT, our novel methodological contributionis the induction of local sentential context andglobal entity context at the ED step in a trans-former model, which has not been used in the re-cent EL models. WikiBERT is derived from thevanilla BERT

BASE model and ﬁne-tuned on thetwo EL datasets (CoNLL-AIDA and T-REx). Weview the ED task as sequence classiﬁcation task.The input to our model is a combination of twosequences. The ﬁrst sequence S concatenates theentity mention m ∈ M and sentence W wherethe sentence acts as a local context. The secondsequence S is a concatenation of entity candidate e ∈ C ( m ) /C F alcon ( m ) (obtained from Equa-tion 2) and its corresponding Wikipedia descrip-tion (entity context ct i ). The two sequences arepaired together with special start and separator to-kens: ([CLS] S [SEP] S [SEP]). The sequencesare fed into the model which in turn learns the in-put representations according to the architecture ofBERT (Devlin et al., 2019). Any given token (lo-cal context word, entity mention, or entity contextwords) is a summation of the three embeddings :i. Token embedding : refers to the embeddingof the corresponding token. We make notehere on speciﬁc tokens that comprises theinput representations for our model morespecialised as compared to other ﬁne-tuningtasks. The entity mention tokens appended atthe beginning of S and separated from thesentence context tokens by a single verticaltoken bar | , likewise, for the entity context se-quence S , we prepend the entity title tokensfrom the KB before adding the descriptions.ii. Segment embedding : each of the sequencesreceive a single representation such thatthe segment embedding for the local con- text E LC refers to the representation for S whereas E EC is the representation of S iii. Position embedding : represents the positionof the token in an input sequence. A tokenappearing at the i-th position in the input se-quence is represented with E i To train the model, we use the negative samplingapproach similar to Yamada and Shindo (2019).The candidate list is generated for each identiﬁedmention. The desired entity candidate item is la-belled as one, and the rest of the incorrect candi-date items (from candidate list) are labelled as zerofor a given mention. This process iterates over allthe identiﬁed mentions using Equation 1.The training process ﬁne-tunes BERT using thecontextual input from sentence and Wikipedia re-sulting into the WikiBERT model (Equation (3)).The model predicts the relatedness of the two se-quences by classifying it as either positive or neg-ative. e i = W ikiBERT ( m i , e i , ct i ) (3) For Wikidata EL, we rely on T-REx dataset (ElSa-har et al., 2018). We adapt the subset of T-RExused by Mulang et al. (2020) for a fair evalua-tion setting. The dataset contains 983,257 sen-tences (786,605 in training and 196,652 in the testset) accommodating 3,133,778 instances of sur-face forms which are linked to 85,628 distinctWikidata entities. T-REx does not have a sepa-rate validation set to ﬁne-tune the hyperparame-ters. Therefore, we further divide the train set intoa 90:10 ratio for training and validation.For EL over Wikipedia, we adapt standarddataset CoNLL-AIDA proposed by (Hoffart et al.,2011) for the training. The dataset contains 18,448linked mentions in 946 documents, a test set of4,485 mentions in 231 documents, and a validationset of 4,791 mentions in 216 documents. For test-ing, we use AIDA-B (test) dataset from (Hoffartet al., 2011) and MSNBC, AQUAINT, ACE2004datasets from (Guo and Barbosa, 2018).

We now brieﬂy explain Wikidata baselines. OpenTapioca (Delpeuch, 2020): is a heuristic-based end-to-end approach that depends on topicsimilarity and mapping coherence for linkingikidata entity in an input text. Arjun (Mulang et al., 2020): is a pipeline oftwo attentive neural networks employed for MDand ED. Arjun is the SotA, and we take baselinevalues from Arjun’s paper. (Hoffart et al., 2011): build a weighted graph ofentity mentions and candidate entities. Then, themodel computes a dense subgraph that predicts thebest joint mention-entity mapping. DBpedia Spotlight (Mendes et al., 2011) pro-poses a probabilistic model and relies on the con-text of the text to link the entities. KEA (Steinmetz and Sack, 2013) employs alinguistic pipeline coupled with metadata gener-ated from several Web sources. The candidates areranked using a heuristic approach. Babelfy (Moro et al., 2014) is a graph-basedapproach that uses loose identiﬁcation of candi-date meanings coupled with the densest subgraphheuristic to link the entities. Piccinno and Ferragina (2014): to solve en-tity linking, authors focus on mentions recognitionand annotations pruning to propose a voting algo-rithm for entity candidates using PageRank. Kolitsas et al. (2018) train MD and EDtask jointly using word and character-level em-beddings. The model reuses candidate set from(Ganea and Hofmann, 2017) and generates aglobal voting score to rank the entity candidates. Peters et al. (2019) induce multiple KBs intoa large pretrained BERT model with a knowledgeattention mechanism. Broscheit (2019) trains MD, CG, ED taskjointly using a BERT-based model. Besides, anentity vocabulary containing 700K most frequententities in English Wikipedia was utilised. F´evry et al. (2020) consider large scale pretrain-ing from Wikipedia links as the context for a trans-former model to predict KB entities.In Wikipedia-based experiments, we report val-ues from (F´evry et al., 2020) and (Kolitsaset al., 2018) for AIDA-B test set. On MSNBC(MSB), AQUAINT (AQ), and ACE2004 (ACE)test datasets, only (Kolitsas et al., 2018), DBpediaSpotlight (Mendes et al., 2011), KEA (Steinmetzand Sack, 2013), and Babelfy (Moro et al., 2014)report the values and we compare against them.

Hyper-parameters Value

Epochs 4Batch size 8Learning rate e − Learning rate decay linearAdam β β Table 1: Hyper-parameters during ﬁne-tuning.

We conﬁgure CHOLAN model applying variouscandidate generation approaches detailed below.

CHOLAN-Wikidata : we train the model using T-REx dataset and employ

C F alcon ( m ) candidateset. The ED model (WikiBERT) is fed with thesentential context but not with entity descriptionas not all Wikidata entities have a correspondingWikipedia entity. CHOLAN-Wiki+FC : is trained on CoNLL-AIDA (Hoffart et al., 2011). For CG step, we em-ploy Falcon candidate set

C F alcon ( m ) . Here,the ED model (WikiBERT) is only fed with thesentential context. CHOLAN-Wiki+DCA : We train the MD and EDmodels on CoNLL-AIDA. The CG step involvesDCA candidate set C ( m ) . During ED step (Wik-iBERT), Wikipedia descriptions associated witheach entity is fed along with sentential context. CHOLAN : inherits

CHOLAN-Wiki+FC but inaddition, Wikipedia entity description is inducedinto the ED model (WikiBERT).

On Wikidata-based experiments, we employ stan-dard metrics of accuracy i.e., precision (P), recall(R), and F-score (F) same as (Mulang et al., 2020).For Wikipedia-based datasets, we use Micro-F1score in strong matching setting (Kolitsas et al.,2018). The strong matching needs exactly pre-dicting the gold mention (i.e. target entity men-tion) boundaries and its corresponding entity an-notation in the KB. To compare the recalls oftwo CG approaches, we report the performance ongold recall. Gold recall is the percentage of entitymentions for which the candidate set contain theground truth entity (Yao et al., 2019).We have implemented all our models in PyTorch https://pytorch.org/ nd optimized using Adam (Kingma and Ba,2015). We used the pre-trained BERT modelsfrom the Transformers library (Wolf et al., 2019).We ran all the experiments on a single GeForceGTX 1080 Ti GPU with 11GB size. Table 1 out-lines the hyper-parameters used in the ﬁne-tuningon both the datasets. We followed the standard set-tings suggested by (Devlin et al., 2019). The av-erage run time is 9.31 hours/epoch for CHOLANand without description, it was 7.23 hours/epoch. We study the following research question: what isthe impact of each sub-task (aka component) onthe overall outcome of the transformer-based en-tity linking approach?

We further investigate asub-research question: how do the external con-text and the candidate generation step impact theoverall performance of CHOLAN? Our every ex-periment systematically studies the research ques-tions in different settings.

Model P R F

Delpeuch 2020 40.7 Table 2: Comparison on T-REx test set for WikidataEL. Best values in bold.

Table 2 summarises CHOLAN performance on T-REx dataset. CHOLAN-Wikidata conﬁgurationoutperforms the baselines. We dig deeper into ourreported values. We observe that for MD task,our F-score is 94.3 (compared to 77 F-score ofArjun (Mulang et al., 2020)). However, the goldrecall for CG step is 81.2. We generate the en-tity candidates using an information retrieval ap-proach (BM25 † algorithm) to get the top 30 candi-dates based on the conﬁdence score. The Wiki-data KG is challenging, and many labels sharethe same name. It contributes to a large loss inthe F-score for the CG step. For instance, theentity mention “National Highway” matches ex-actly with four Wikidata ID labels while 2,055other entities contain the full mention in their la-bels. Please note that we did not perform retrain-ing of (Kolitsas et al., 2018) (SOTA on WikipediaEL) on the T-REx dataset since we determined thatthe model is tightly coupled and relies on pre- computed Wikipedia candidate list from (Ganeaand Hofmann, 2017). We study the impact of local context on the per-formance of CHOLAN. Therefore, we exclude thesentence as input in the ED step at training andtesting time. Hence, the inputs to the ED modelare only entity mention and the entity candidatesgained from the CG step. We observe that the per-formance drops when the local sentential contextis not fed (cf. Table 3). It justiﬁes our choice tofeed the model by the sentence during the ED task.

Model P R F

CHOLAN-Wikidata

75 76 75.4

CHOLAN-Wikidata (WLC † ) 72 73.5 72.7 Table 3: The ablation study on T-REx test set for Wiki-data EL. Best values in bold. WLC † denotes modelwithout local context. When the local sentential con-text is excluded from ED, the performance drops. Table 4 reports the performance of CHOLAN’sconﬁgurations on AIDA-B test set. The ﬁrstconﬁguration is ”CHOLAN-Wiki+ FC” in whichMD and ED models are trained using CoNLL-AIDA. We notice a clear jump in the perfor-mance. We then replaced the Falcon candidatelist

C F alcon ( m ) with DCA candidates C ( m ) re-sulting into ”CHOLAN-Wiki+ DCA”. In DCAcandidates, the description of entities is attached.The performance is increased when an additionalbackground knowledge as an entity description isfed. Our next conﬁguration is CHOLAN where weattached Wikipedia entity descriptions in Falconcandidate list C F alcon ( m ) (as a modiﬁcation of”CHOLAN-Wiki+ FC”). This setting outperformsall the existing baselines and previous CHOLANconﬁgurations. Our experiments illustrate the im-pact of CG step and background knowledge onend-to-end EL performance. The improvement ofCHOLAN continues to the other three test datasetswhere the jump is signiﬁcantly higher comparedto the baselines (cf. Table 5). Reported values inTable 5 also approves transferability of CHOLANwhen we apply cross-domain experiments. We conducted three ablation studies to under-stand the behaviour of CHOLAN’s conﬁgura-tions over Wikipedia datasets. The ﬁrst study odel Micro F1

Hoffart et al. 2011 72.8Mendes et al. 2011 57.8Steinmetz and Sack 2013 42.3Moro et al. 2014 48.5Piccinno and Ferragina 2014 73Kolitsas et al. 2018 82.4Peters et al. 2019 73.7Broscheit 2019 79.3F´evry et al. 2020 76.7CHOLAN-Wiki+ FC 75.1CHOLAN-Wiki+ DCA 77.5CHOLAN

Table 4: Comparison on

AIDA-B . Best value in boldand previous SOTA value is underlined.

Model MSB AQ ACE

Mendes et al. 2011 40.6 45.2 60.5Steinmetz and Sack 2013 30.9 35.9 40.3Moro et al. 2014 39.7 35.8 17.8Kolitsas et al. 2018 72.4 40.4 68.3CHOLAN-Wiki+ FC 77.8 70 85.7CHOLAN-Wiki+ DCA 78.3 75.9 71.3CHOLAN

Table 5: The micro F1 scores are listed from the com-parative study over three datasets (out of domain). Themodel is trained over CoNLL-AIDA dataset. Bestvalue in bold and previous SOTA value is underlined. is to calculate the Gold recall values for vari-ous datasets. CHOLAN uses the candidates from

C F alcon ( m ) candidate set for each entity men-tion. While generating the candidate set from lo-cal KG of (Sakor et al., 2019) we observe a drop inthe Gold recall as reported in Table 6. CG plays acrucial role in trading off precision and recall. Weconclude that more robust CG approaches likelyimpact overall performance. The second ablationstudy is about to calculate the performance of ourconﬁgurations for ED step, i.e., running WikiB-ERT in isolation. Here, we assume that all entitiesare truly recognised; thus, our focus of the studyis the ED model. We report the impact of variouscandidate generation approaches on the ED modelin Table 7. The signiﬁcant jump in the perfor-mance from ”CHOLAN-Wiki+FC Vs CHOLAN”contributes to the additional background knowl-edge provided in CHOLAN as entity candidatedescriptions. The third ablation study tests theimpact of sentential context fed into two conﬁg-urations on a Wikipedia dataset. Table 8 reportsthe achieved performance after excluding sentenceas the additional context. Obviously, the perfor-mance decreases. The model shows similar be- haviour on T-REx in Table 3. These observationsconﬁrm our hypothesis as the ED model is en-hanced using additional contexts. Model AIDA-B MSB AQ ACE

Falcon Candidates 94 93.8 85.3 97.3DCA Candidates 98.3 98.5 94.2 90.6

Table 6: Gold Recall for Candidate Generation tech-niques over Wikipedia test datasets.

Model Micro F1

Kolitsas et al. 2018 83.8CHOLAN-Wiki+ FC 78.4CHOLAN-Wiki+ DCA 79.1CHOLAN

Table 7: Comparison on

AIDA-B for ED. Best score inbold and previous SOTA value is underlined.

Model Micro F1

CHOLAN-Wiki+ DCA 77.5CHOLAN-Wiki+ DCA (WLC † ) 71.2CHOLAN CHOLAN (WLC † ) 79.6 Table 8: Ablation study on

AIDA-B . We observe thatwhen local sentential context is removed from ED step,the performance drops. Best values in bold. WLC † denotes model without local context. In the last two years, the NLP research commu-nity has extensively tried transformer-based mod-els for the EL task. However, the performance re-mained lower than Kolitsas et al. (2018). This pa-per combines the traditional software engineeringprinciple of modular architecture with the context-induced transformers to effectively solve the ELtask. Our reason to deviate from an end-to-end ar-chitecture was to provide full ﬂexibility to our sys-tem in terms of candidate generation list, underly-ing KG, and induction of the context at the EDstep. We attribute CHOLAN’s outperformance tothe following reasons: 1) the modular architec-ture, which brings ﬂexibility and interoperabilityas CHOLAN can treat each task independently.Kolitsas et al. (2018) reports that shifting towardsjoint modelling of MD and ED tasks helps miti-gate error propagation from MD to ED. However,the performance of BERT

BASE for the MD task issigniﬁcantly high (92.3 on AIDA-B and 94.3 F1-core on T-REX calculated by us) remarkably re-ducing the errors in MD. CHOLAN leverages thiscapability in the MD subtask, placing more focuson CG and ED tasks. 2) The ﬂexibility in archi-tecture further permits us to induce sentence andentity descriptions as additional contexts. Further-more, using candidate list in plug and play mannerhas resulted in a signiﬁcant increase in the per-formance. In earlier transformer approaches, theimplementation is monolithic and context is notutilised. There are scopes for improvement in ourapproach. Wu et al. (2019a) introduces a novel CGmethod that retrieves candidates in a dense spacedeﬁned by a bi-encoder and can be used as alter-nate CG approach. We aim for scaling CHOLANto multilingual entity linking as a viable next step.

References

Alan Akbik, Duncan Blythe, and Roland Vollgraf.2018. Contextual string embeddings for sequencelabeling. In

Proceedings of the 27th InternationalConference on Computational Linguistics , pages1638–1649, Santa Fe, New Mexico, USA. Associ-ation for Computational Linguistics.S¨oren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary G. Ives.2007. DBpedia: A Nucleus for a Web of OpenData. In

The Semantic Web, 6th International Se-mantic Web Conference, 2nd Asian Semantic WebConference, ISWC 2007 + ASWC 2007, Busan, Ko-rea, November 11-15, 2007.

Kurt D. Bollacker, Robert P. Cook, and Patrick Tufts.2007. Freebase: A Shared Database of StructuredGeneral Human Knowledge. In

AAAI 2007 .Samuel Broscheit. 2019. Investigating entity knowl-edge in bert with simple neural end-to-end en-tity linking. In

Proceedings of the 23rd Confer-ence on Computational Natural Language Learning(CoNLL) , pages 677–685.Yixin Cao, Lei Hou, Juanzi Li, and Zhiyuan Liu. 2018.Neural collective entity linking. In

Proceedings ofthe 27th International Conference on ComputationalLinguistics , pages 675–686.Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and JuanziLi. 2017. Bridge text and knowledge by learningmulti-prototype entity mention embedding. In

Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers) , pages 1623–1633.Diego Ceccarelli, Claudio Lucchese, Salvatore Or-lando, Raffaele Perego, and Salvatore Trani. 2013.Dexter: an open source framework for entity linking.In

Proceedings of the sixth international workshop on Exploiting semantic annotations in informationretrieval , pages 17–20.Ronan Collobert and Jason Weston. 2008. A uniﬁedarchitecture for natural language processing: Deepneural networks with multitask learning. In

Pro-ceedings of the 25th international conference onMachine learning , pages 160–167.Marco Cornolti, Paolo Ferragina, Massimiliano Cia-ramita, Stefan R¨ud, and Hinrich Sch¨utze. 2016. Apiggyback system for joint entity mention detectionand linking in web queries. In

Proceedings of the25th International Conference on World Wide Web ,pages 567–578.Antonin Delpeuch. 2020. Opentapioca: Lightweightentity linking for wikidata.

The 1st Wikidata Work-shop co-located with International Semantic WebConference 2020 (to appear) .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Greg Durrett and Dan Klein. 2014. A joint modelfor entity analysis: Coreference, typing, and link-ing.

Transactions of the association for computa-tional linguistics , 2:477–490.Hady ElSahar, Pavlos Vougiouklis, Arslen Remaci,Christophe Gravier, Jonathon S. Hare, Fr´ed´eriqueLaforest, and Elena Simperl. 2018. T-rex: A largescale alignment of natural language with knowledgebase triples. In

LREC .Zheng Fang, Yanan Cao, Qian Li, Dongjie Zhang,Zhenyu Zhang, and Yanbing Liu. 2019. Joint en-tity linking with deep reinforcement learning. In

TheWorld Wide Web Conference , pages 438–447.Thibault F´evry, Nicholas FitzGerald, Livio BaldiniSoares, and Tom Kwiatkowski. 2020. Empiricalevaluation of pretraining strategies for supervisedentity linking. In

Automated Knowledge Base Con-struction .Octavian-Eugen Ganea and Thomas Hofmann. 2017.Deep joint entity disambiguation with local neuralattention. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing , pages 2619–2629.Ralph Grishman and Beth M Sundheim. 1996. Mes-sage understanding conference-6: A brief history.In

COLING 1996 Volume 1: The 16th InternationalConference on Computational Linguistics .Zhaochen Guo and Denilson Barbosa. 2018. Robustnamed entity disambiguation with random walks.

Semantic Web , 9(4):459–479.ohannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen F¨urstenau, Manfred Pinkal, Marc Span-iol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of namedentities in text. In

Proceedings of the 2011 Con-ference on Empirical Methods in Natural LanguageProcessing, EMNLP 2011, 27-31 July 2011, JohnMcIntyre Conference Centre, Edinburgh, UK, Ameeting of SIGDAT, a Special Interest Group of theACL , pages 782–792.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In .Nikolaos Kolitsas, Octavian-Eugen Ganea, andThomas Hofmann. 2018. End-to-end neural entitylinking. In

Proceedings of the 22nd Conference onComputational Natural Language Learning , pages519–529.Lajanugen Logeswaran, Ming-Wei Chang, KentonLee, Kristina Toutanova, Jacob Devlin, and HonglakLee. 2019. Zero-shot entity linking by reading en-tity descriptions. In

Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics , pages 3449–3460.Pablo N. Mendes, Max Jakob, Andr´es Garc´ıa-Silva,and Christian Bizer. 2011. DBpedia spotlight:shedding light on the web of documents. In

I-SEMANTICS .Andrea Moro, Alessandro Raganato, and Roberto Nav-igli. 2014. Entity linking meets word sense disam-biguation: a uniﬁed approach.

Transactions of theAssociation for Computational Linguistics , 2:231–244.Isaiah Onando Mulang, Kuldeep Singh, AkhileshVyas, Saeedeh Shekarpour, Ahmad Sakor,Maria Esther Vidal, Soren Auer, and Jens Lehmann.2020. Encoding knowledge graph entity aliasesin an attentive neural networks for wikidata entitylinking.

Web Information System and Engineering .Dat Ba Nguyen, Martin Theobald, and GerhardWeikum. 2016. J-nerd: joint named entity recogni-tion and disambiguation with rich linguistic features.

Transactions of the Association for ComputationalLinguistics , 4:215–229.Matthew E Peters, Mark Neumann, Robert Logan, RoySchwartz, Vidur Joshi, Sameer Singh, and Noah ASmith. 2019. Knowledge enhanced contextual wordrepresentations. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 43–54.Francesco Piccinno and Paolo Ferragina. 2014. Fromtagme to wat: a new entity annotator. In

Proceed-ings of the ﬁrst international workshop on Entityrecognition & disambiguation , pages 55–62. Tim Rockt¨aschel, Torsten Huber, Michael Weidlich,and Ulf Leser. 2013. Wbi-ner: The impact ofdomain-speciﬁc features on the performance ofidentifying and classifying mentions of drugs. In

Second Joint Conference on Lexical and Computa-tional Semantics (* SEM), Volume 2: Proceedingsof the Seventh International Workshop on SemanticEvaluation (SemEval 2013) , pages 356–363.Ahmad Sakor, Isaiah Onando Mulang, Kuldeep Singh,Saeedeh Shekarpour, Maria-Esther Vidal, JensLehmann, and S¨oren Auer. 2019. Old is gold: Lin-guistic driven approach for entity and relation link-ing of short text. In

Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT 2019,Volume 1(Long and Short Papers) , pages 2336–2346. Asso-ciation for Computational Linguistics.Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition.

CoRR , cs.CL/0306050.Ozge Sevgili, Artem Shelmanov, Mikhail Arkhipov,Alexander Panchenko, and Chris Biemann. 2020.Neural entity linking: A survey of models based ondeep learning.Valentin I Spitkovsky and Angel X Chang. 2012. Across-lingual dictionary for english wikipedia con-cepts.Nadine Steinmetz and Harald Sack. 2013. Semanticmultimedia information retrieval based on contex-tual descriptions. In

Extended Semantic Web Con-ference , pages 382–396. Springer.Denny Vrandecic. 2012. Wikidata: a new platformfor collaborative data collection. In

Proceedings ofthe 21st World Wide Web Conference, WWW 2012,Lyon, France, April 16-20, 2012 (Companion Vol-ume) , pages 1063–1064.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

ArXiv , abs/1910.03771v5.Ledell Wu, Fabio Petroni, Martin Josifoski, Sebas-tian Riedel, and Luke Zettlemoyer. 2019a. Zero-shot entity linking with dense entity retrieval. arXivpreprint arXiv:1911.03814 .Shanchan Wu, Kai Fan, and Qiong Zhang. 2019b. Im-proving distantly supervised relation extraction withneural noise converter and conditional optimal se-lector. In

The Thirty-Third AAAI Conference on Ar-tiﬁcial Intelligence, AAAI 2019 , pages 7273–7280.AAAI Press.ikas Yadav and Steven Bethard. 2018. A survey onrecent advances in named entity recognition fromdeep learning models. In

Proceedings of the 27th In-ternational Conference on Computational Linguis-tics , pages 2145–2158.Ikuya Yamada and Hiroyuki Shindo. 2019. Pre-training of deep contextualized embeddings ofwords and entities for named entity disambiguation. arXiv preprint arXiv:1909.00426 .Xiyuan Yang, Xiaotao Gu, Sheng Lin, Siliang Tang,Yueting Zhuang, Fei Wu, Zhigang Chen, GuopingHu, and Xiang Ren. 2019. Learning dynamic con-text augmentation for global entity linking. In

Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) , pages 271–281.Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.Kg-bert: Bert for knowledge graph completion. arXiv preprint arXiv:1909.03193 .Stefan Zwicklbauer, Christin Seifert, and MichaelGranitzer. 2016. Robust and collective entity dis-ambiguation through semantic embeddings. In