[PDF] Scalable Zero-shot Entity Linking with Dense Entity Retrieval

Abstract

This paper introduces a conceptually simple, scalable, and highly effective BERT-based entity linking model, along with an extensive evaluation of its accuracy-speed trade-off. We present a two-stage zero-shot linking algorithm, where each entity is defined only by a short textual description. The first stage does retrieval in a dense space defined by a bi-encoder that independently embeds the mention context and the entity descriptions. Each candidate is then re-ranked with a cross-encoder, that concatenates the mention and entity text. Experiments demonstrate that this approach is state of the art on recent zero-shot benchmarks (6 point absolute gains) and also on more established non-zero-shot evaluations (e.g. TACKBP-2010), despite its relative simplicity (e.g. no explicit entity embeddings or manually engineered mention tables). We also show that bi-encoder linking is very fast with nearest neighbour search (e.g. linking with 5.9 million candidates in 2 milliseconds), and that much of the accuracy gain from the more expensive cross-encoder can be transferred to the bi-encoder via knowledge distillation. Our code and models are available at this https URL.

Full PDF

SScalable Zero-shot Entity Linking with Dense Entity Retrieval

Ledell Wu, Fabio Petroni, Martin Josifoski, ∗ Sebastian Riedel, , Luke Zettlemoyer Facebook AI Research { ledell, fabiopetroni, sriedel, lsz } @fb.com University College London Ecole Polytechnique Federale de Lausanne [email protected]

Abstract

This paper introduces a conceptually simple,scalable, and highly effective BERT-based en-tity linking model, along with an extensiveevaluation of its accuracy-speed trade-off. Wepresent a two-stage zero-shot linking algo-rithm, where each entity is deﬁned only bya short textual description. The ﬁrst stagedoes retrieval in a dense space deﬁned bya bi-encoder that independently embeds themention context and the entity descriptions.Each candidate is then re-ranked with a cross-encoder, that concatenates the mention and en-tity text. Experiments demonstrate that thisapproach is state of the art on recent zero-shot benchmarks (6 point absolute gains) andalso on more established non-zero-shot eval-uations (e.g. TACKBP-2010), despite its rel-ative simplicity (e.g. no explicit entity em-beddings or manually engineered mention ta-bles). We also show that bi-encoder link-ing is very fast with nearest neighbour search(e.g. linking with 5.9 million candidates in2 milliseconds), and that much of the ac-curacy gain from the more expensive cross-encoder can be transferred to the bi-encodervia knowledge distillation. Our code andmodels are available at https://github.com/facebookresearch/BLINK . Scale is a key challenge for entity linking; there aremillions of possible entities to consider for eachmention. To efﬁciently ﬁlter or rank the candi-dates, existing methods use different sources ofexternal information, including manually curatedmention tables (Ganea and Hofmann, 2017), incom-ing Wikipedia link popularity (Yamada et al., 2016),and gold Wikipedia entity categories (Gillick et al.,2019). In this paper, we show that BERT-basedmodels set new state-of-the-art performance levels ∗ Work done during internship with Facebook. for large scale entity linking when used in a zeroshot setup, where there is no external knowledgeand a short text description provides the only infor-mation we have for each entity. We also present anextensive evaluation of the accuracy-speed trade-off inherent to large pre-trained models, and showis possible to achieve very efﬁcient linking withmodest loss of accuracy.More speciﬁcally, we introduce a two stage ap-proach for zero-shot linking (see Figure 1 for anoverview), based on ﬁne-tuned BERT architectures(Devlin et al., 2019). In the ﬁrst stage, we do re-trieval in a dense space deﬁned by a bi-encoder thatindependently embeds the mention context and theentity descriptions (Humeau et al., 2019; Gillicket al., 2019). Each retrieved candidate is then ex-amined more carefully with a cross-encoder thatconcatenates the mention and entity text, follow-ing Logeswaran et al. (2019). This overall approachis conceptually simple but highly effective, as weshow through detailed experiments.Our two-stage approach achieves a new state-of-the-art result on TACKBP-2010, with an over 30%relative error reduction. By simply reading the pro-vided text descriptions, we are able to outperformprevious methods that included many extra cuessuch as entity name dictionaries and link popular-ity. We also improve the state of the art on existingzero-shot benchmarks, including a nearly 6 pointabsolute gain on the recently introduced Wikiacorpus (Logeswaran et al., 2019) and more than7 point absolute gain on WikilinksNED Unseen-Mentions (Onoe and Durrett, 2019).Finally, we do an extensive evaluation of theaccuracy-speed trade-off inherent in our bi- andcross-encoder models. We show that the two stagemethods scales well in a full Wikipedia setting,by linking against all the 5.9M Wikipedia entitiesfor TACKBP-2010, while still outperforming exist-ing model with much smaller candidate sets. We a r X i v : . [ c s . C L ] S e p ikipedia dense space My kids really enjoyed a ride in the

Jaguar ! Jaguar is the luxury vehicle brand.

Jaguar_cars

Jaguar! is a junior roller coaster.

Jaguar! T B T B Tiger T B Ferrari DogCarriageMoon

Jaguar! is a junior roller coaster.My kids really enjoyed a ride in the

Jaguar ! Jaguar is the luxury vehicle brand.My kids really enjoyed a ride in the

Jaguar ! Jaguar … BI-ENCODER CROSS-ENCODER T C T C Figure 1: High level description of our zero-shot entity linking solution. From the top-left, the input gets encoded inthe same dense space where all entities representations lie. A nearest neighbors search is then performed (depictedwith a blue circle), k entities retrieved and supplied to the cross encoder. The latter attends over both input text andentities descriptions to produce a probability distribution over the candidates. also show that bi-encoder linking is very fast withapproximate nearest neighbor search (e.g. link-ing over 5.9 million candidates in 2 milliseconds),and that much of the accuracy gain from the moreexpensive cross-encoder can be transferred to thebi-encoder via knowledge distillation. We releaseour code and models, as well as a system to linkentity mentions to all of Wikipedia (similar toTagME (Ferragina and Scaiella, 2011)). We follow most recent work in studying entity link-ing with gold mentions. The entity linking taskcan be broken into two steps: candidate generationand ranking. Prior work has used frequency in-formation, alias tables and TF-IDF-based methodsfor candidate generation. For candidate ranking,He et al. (2013), Sun et al. (2015), Yamada et al.(2016), Ganea and Hofmann (2017), and Kolitsaset al. (2018) have established state-of-the-art resultsusing neural networks to model context word, spanand entity. There is also recent work demonstratingthat ﬁne-grained entity typing information helpslinking (Raiman and Raiman, 2018; Onoe and Dur-rett, 2019; Khalife and Vazirgiannis, 2018).Two recent results are most closely related toour work. Logeswaran et al. (2019) proposedthe zero-shot entity linking task. They use cross- Our code and models are available at https://github.com/facebookresearch/BLINK Kolitsas et al. (2018) study end-to-end linking. Our tech-niques should be applicable to this setting as well, but weleave this exploration to future work. encoders for entity ranking, but rely on traditionalIR-techniques for candidate generation and didnot evaluate on large scale benchmarks such asTACKBP. Gillick et al. (2019) show that dense em-beddings work well for candidate generation, butthey did not do pre-training and included externalcategory labels in their bi-encoder architectures,limiting their linking to entities in Wikipedia. Ourapproach can be seen as generalizing both of theselines of work, and showing for the ﬁrst time thatpre-trained zero-shot architectures are both highlyaccurate and computationally efﬁcient at scale.Humeau et al. (2019) studied different architec-tures to use deep pre-trained bidirectional trans-formers and performed detailed comparison ofthree different architectures, namely bi-encoder,poly-encoder, cross-encoder on tasks of sentenceselection in dialogues. Inspired by their work,we use similar architectures to the problem of en-tity linking, and in addition, demonstrate that bi-encoder can be a strong model for retrieval. Insteadof using the poly-encoder as a trade-off betweencross-encoder and bi-encoder, we propose to train abi-encoder model with knowledge distillation (Bu-ciluundeﬁned et al., 2006; Hinton et al., 2015) froma cross-encoder model to further improve the bi-encoder’s performances.

Entity Linking

Given an input text document D = { w , ..., w r } and a list of entity mentions M D = { m , ..., m n } , the output of an entityinking model is a list of mention-entity pairs { ( m i , e i ) } i ∈ [1 ,n ] where each entity is an entry in aknowledge base (KB) (e.g. Wikipedia), e ∈ E . Weassume that the title and description of the entitiesare available, which is a common setting in entitylinking (Ganea and Hofmann, 2017; Logeswaranet al., 2019). We also assume each mention hasa valid gold entity in the KB, which is usually re-ferred as in-KB evaluation. We leave the out-of-KBprediction (i.e. nil prediction) to future work. Zero-shot Entity Linking

We also study zero-shot entity linking (Logeswaran et al., 2019). Herethe document setup is the same, but the knowledgebase is separated in training and test time. Formally,denote E train and E test to be the knowledge basein training and test, we require E train ∩ E test = ∅ .The set of text documents, mentions, and entitydictionary are separated in training and test so thatthe entities being linked at test time are unseen. Figure 1 shows our overall approach. The bi-encoder uses two independent BERT transformersto encode model context/mention and entity intodense vectors, and each entity candidate is scoredas the dot product of these vectors. The candi-dates retrieved by the bi-encoder are then passed tothe cross-encoder for ranking. The cross-encoderencodes context/mention and entity in one trans-former, and applies an additional linear layer tocompute the ﬁnal score for each pair.

We use a bi-encoder architecturesimilar to the work of Humeau et al. (2019) tomodel (mention, entity) pairs. This approach al-lows for fast, real-time inference, as the candidaterepresentations can be cached. Both input contextand candidate entity are encoded into vectors: y m = red( T ( τ m )) (1) y e = red( T ( τ e )) (2)where τ m and τ e are input representations of men-tion and entity respectively, T and T are twotransformers. red( . ) is a function that reduces thesequence of vectors produced by the transform-ers into one vector. Following the experiments inHumeau et al. (2019), we choose red( . ) to be thelast layer of the output of the [CLS] token. Context and Mention Modeling

The represen-tation of context and mention τ m is composed ofthe word-pieces of the context surrounding the men-tion and the mention itself. Speciﬁcally, we con-struct input of each mention example as: [CLS] ctxt l [M s ] mention [M e ] ctxt r [SEP] where mention, ctxt l , ctxt r are the word-piecestokens of the mention, context before and after themention respectively, and [M s ] , [M e ] are specialtokens to tag the mention. The maximum lengthof the input representation is a hyperparameter inour model, and we ﬁnd that small value such as 32works well in practice (see Appendix A). Entity Modeling

The entity representation τ e isalso composed of word-pieces of the entity titleand description (for Wikipedia entities, we use theﬁrst ten sentences as description). The input to ourentity model is: [CLS] title [ENT] description [SEP] where title, description are word-pieces tokens ofentity title and description, and [ENT] is a spe-cial token to separate entity title and descriptionrepresentation. Scoring

The score of entity candidate e i is givenby the dot-product: s ( m, e i ) = y m · y e i (3) Optimization

The network is trained to maxi-mize the score of the correct entity with respectto the (randomly sampled) entities of the samebatch (Lerer et al., 2019; Humeau et al., 2019).Concretely, for each training pair ( m i , e i ) in abatch of B pairs, the loss is computed as: L ( m i , e i ) = − s ( m i , e i ) + log B (cid:88) j =1 exp ( s ( m i , e j )) (4)Lerer et al. (2019) presented a detailed analysison speed and memory efﬁciency of using batchedrandom negatives in large-scale systems. In addi-tion to in-batch negatives, we follow Gillick et al.(2019) by using hard negatives in training. Thehard negatives are obtained by ﬁnding the top 10predicted entities for each training example. Weadd these extra hard negatives to the random in-batch negatives. nference At inference time, the entity repre-sentation for all the entity candidates can be pre-computed and cached. The inference task is thenreduced to ﬁnding maximum dot product betweenmention representation and entity candidate rep-resentations. In Section 5.2.3 we present efﬁ-ciency/accuracy trade-offs by exact and approx-imate nearest neighbor search using FAISS (John-son et al., 2019) in a large-scale setting.

Our cross-encoder is similar to the ones describedby Logeswaran et al. (2019) and Humeau et al.(2019). The input is the concatenation of the inputcontext and mention representation and the entityrepresentation described in Section 4.1 (we removethe [CLS] token from the entity representation).This allows the model to have deep cross attentionbetween the context and entity descriptions. For-mally, we use y m,e to denote our context-candidateembedding: y m,e = red( T cross ( τ m,e )) (5)where τ m,e is the input representation of mentionand entity, T cross is a transformer and red ( . ) is thesame function as deﬁned in Section 4.1. Scoring

To score entity candidates, a linear layer W is applied to the embedding y m,e : s cross ( m, e ) = y m,e W (6) Optimization

Similar to methods in Section 4.1,the network is trained using a softmax loss to max-imize s cross ( m i , e i ) for the correct entity, given aset of entity candidates (same as in Equation 4).Due to its larger memory and compute footprint,we use the cross-encoder in a re-ranking stage, overa small set ( ≤ of candidates retrieved withthe bi-encoder. The cross-encoder is not suitablefor retrieval or tasks that require fast inference. To better optimize the accuracy-speed trade-off, wealso report knowledge distillation experiments thatuse a cross-encoder as a teacher for a bi-encodermodel. We follow Hinton et al. (2015) to use a soft-max with temperature where the target distributionis based on the cross-encoder logits.Concretely, let z be a vector of logits for set ofentity candidates and T a temperature, and σ ( z, T ) a (tempered) distribution over the entities with σ ( z, T ) = exp ( z i /T ) (cid:80) j exp ( z j /T ) . (7)Then the overall loss function, incorporating bothdistillation and student losses, is calculated as L dist = H ( σ ( z t ; τ ) , σ ( z s ; τ )) (8) L st = H ( e, σ ( z s ; 1)) (9) L = α · L st + (1 − α ) · L dist (10)where e is the ground truth label distribution withprobability 1 for the gold entity, H is the cross-entropy loss function, and α is coefﬁcient for mix-ing distillation and student loss L st . The studentlogits z s are the output of the bi-encoder scoringfunction s ( m, e i ) , the teacher logits the output ofthe cross-encoder scoring funcion s cross ( m, e ) . In this section, we perform an empirical study ofour model on three challenging datasets. was constructed byLogeswaran et al. (2019) from Wikia. The taskis to link entity mentions in text to an entity dic-tionary with provided entity descriptions, in a setof domains. There are 49K, 10K, and 10K exam-ples in the train, validation, test sets respectively.The entities in the validation and test sets are fromdifferent domains than the train set, allowing forevaluation of performance on entirely unseen enti-ties. The entity dictionaries cover different domainsand range in size from 10K to 100K entities.

TACKBP-2010 is widely used for evaluating en-tity linking systems Ji et al. (2010). Followingprior work, we measure in-KB accuracy (P@1).There are 1,074 and 1,020 annotated mention/entitypairs derived from 1,453 and 2,231 original newsand web documents on training and evaluationdataset, respectively. All the entities are from theTAC Reference Knowledgebase which contains818,741 entities with titles, descriptions and othermeta info. https://tac.nist.gov ethod Train Validation Test BM25 76.86 76.22 69.13Ours (bi-encoder) 93.12 91.44 82.06

Table 1: Recall@64 (%) on Zero-shot EL dataset, forthe BM25 approach and our dense space bi-encoderbased retrieval. Results on Train/Valideation/Test setreported.

WikilinksNED Unseen-Mentions was createdby Onoe and Durrett (2019) from the original Wik-ilinksNED dataset (Eshel et al., 2017), which con-tains a diverse set of ambiguous entities spanninga variety of domains. In the Unseen-Mentions ver-sion, no mentions in the validation and test setsappear in the training set. The train, validationand test sets contain 2.2M, 10K, and 10K exam-ples respectively. In this setting, the deﬁnition ofunseen-mentions is different from that in zero-shotentity linking: entities in the test set can be seenin the training set. However, in both deﬁnitions no(mention, entity) pairs from test set are observedin the training set. In the unseen-mentions test set,about of the entities appear in training set.

We experiment with both BERT-base and BERT-large (Devlin et al., 2019) for our bi-encoders andcross-encoders. The details of training infrastruc-ture and hyperparameters can be found in AppendixA. All models are implemented in PyTorch andoptimizied with Adam (Kingma and Ba, 2014). Weuse (base) and (large) to indicate the version of ourmodel where the underlying pretrained transformermodel is BERT-base and BERT-large, respectively. First, we train our bi-encoder on the training set, ini-tializing each encoder with pre-trained BERT base.Hyper-parameters are chosen based on Recall@64on validation datase. For speciﬁcs, see AppendixA.2. Our bi-encoder achieves much higher recallthan BM25, as shown in Figure 2. Following Lo-geswaran et al. (2019), we use the top 64 retrievedcandidates for the ranker, and we report Recall@64on train, validation and test in Table 1.After training the bi-encoder for candidate gen-eration, we train our cross-encoder (initialized withpre-trained BERT) on the top 64 retrieved candi-dates from bi-encoder for each sample on the train- https://pytorch.org . . . . k : number of retrieved entities R eca ll BM25Bi-encoder

Figure 2: Top- k entity retrieval recall on validationdataset of Zero-shot EL dataset ing set, and evaluate the cross-encoder on the testdataset. Overall, we are able to obtain a much betterend-to-end accuracy, as shown in Table 2, largelydue to the improvement on the retrieval stage. Method U.Acc.

Logeswaran et al. (2019) 55.08Logeswaran et al. (2019)(domain) † Table 2: Performance on test domains on the Zero-shotEL dataset. U.Acc. represents the unnormalized accu-racy. † indicates model trained with domain adaptivepre-training on source and target domain. Average per-formance across a set of worlds is computed by macro-averaging. We also report cross-encoder performance onthe same retrieval method (BM25) used by Lo-geswaran et al. (2019) in Table 3, where the perfor-mance is evaluated on the subset of test instancesfor which the gold entity is among the top 64 can-didates retrieved by BM25. We observe that ourcross-encoder obtains slightly better results thanreported by Logeswaran et al. (2019), likely due toimplementation and hyper-parameter details.

Following prior work (Sun et al., 2015; Cao et al.,2018; Gillick et al., 2019; Onoe and Durrett, 2019),we pre-train our models on Wikipedia data. Dataand model training details can be found in Ap-pendix A.1. ethod Valid Test TF-IDF † † † Table 3: Normalized accuracy on validation and testset on Zero-shot EL, where the performance is eval-uated on the subset of test instances for which thegold entity is among the top-k candidates retrievedduring candidate generation. † indicates methods re-implemented by Logeswaran et al. (2019). After training our model on Wikipedia, we ﬁne-tune the model on the TACKBP-2010 trainingdataset. We use the top 100 candidates retrievedby the bi-encoder as training examples for thecross-encoder, and chose hyper-parameters basedon cross validation. We report accuracy results inTable 4. For ablation studies, we also report thefollowing versions of our model:1. bi-encoder only: we use bi-encoder for candi-date ranking instead of cross-encoder.2. Full Wikipedia: we use 5.9M Wikipedia ar-ticles as our entity Knowlegebase, instead ofTACKBP Reference Knowledgebase.3. Full Wikipedia w/o ﬁnetune: same as above,without ﬁne-tuning on the TACKBP-2010training set.As expected, the cross-encoder performs betterthan the bi-encoder on ranking. However, bothmodels exceed state-of-the-art performance levels,demonstrating that the overall approach is highlyeffective. We observe that our model also per-forms well when we change the underlying Knowl-edgebase to full Wikipedia, and even without ﬁne-tuning on the dataset. In Table 5 we show that ourbi-encoder model is highly effective at retrievingrelevant entities, where the underlying Knowledge-base is full Wikipedia.There are however many other cues that couldpotentially be added in future work. For exam-ple, Khalife and Vazirgiannis (2018) report . precision on the TACKBP-2010 dataset. However,their method is based on the strong assumption thata gold ﬁne-grained entity type is given for eachmention (and they do not attempt to do entity type Method Accuracy

He et al. (2013) 81.0Sun et al. (2015) 83.9Yamada et al. (2016) † † † † Table 4: Accuracy scores of our proposed model andmodels from prior work on TACKBP-2010. † indicatesmethods doing global resolution of all mentions in adocument. Our work focuses on local resolution whereeach mention is modeled independently. Method Recall@100

AT-Prior † † † Table 5: Retrieval evaluation comparison for TACKBP-2010. † indicates alias table and BM25 baselines imple-mented by (Gillick et al., 2019). AT-Prior: alias tableordered by prior probabilities; AT-Ext: alias table ex-tended with heuristics. prediction). Indeed, if ﬁne-grained entity type in-formation is given by an oracle at test time, thenRaiman and Raiman (2018) reports . accu-racy on TACKBP-2010, indicating that improvingﬁne-grained entity type prediction would likely im-prove entity linking. Our results is achieved with-out gold ﬁne-grained entity type information. In-stead, our model learns representations of context,mention and entities based on text only. Similarly to the approach described in Section5.2.2, we train our bi-encoder and cross-encodermodel ﬁrst on Wikipedia examples, then ﬁne-tuneon the training data from this dataset. We alsopresent our model trained on Wikipedia examples ethod Training Test

MOST FREQUENT Wiki 54.1COSINE SIMILARITY Wiki 21.7GRU+ATTN(Mueller and Durrett, 2018) in-domain 41.2GRU+ATTN Wiki 43.4CBoW+WORD2VEC in-domain 43.0CBoW+WORD2VEC Wiki 38.0Onoe and Durrett (2019) Wiki 62.2Ours in-domain 74.7Ours Wiki 75.2Ours Wiki (bi-encoder) 71.5Ours Wiki and in-domain 76.8

Table 6: Accuracy on the WikilinksNED Unseen-Mentions test set. The numbers of baseline modelsare from (Onoe and Durrett, 2019). The column

Train-ing indicates the source of data used in training:

Wiki means Wikipedia examples; in-domain means exam-ples in the training set. and applied directly on the test set as well as ourmodel trained on this dataset directly without train-ing on Wikipedia examples. We report our models’performance of accuracy on the test set in Table 6,along with baseline models presented from Onoeand Durrett (2019). We observe that our modelout-performs all the baseline models.

Inference time efﬁciency

To illustrate the efﬁ-ciency of our bi-encoder model, we proﬁled re-trieval speed on a server with Intel Xeon CPU E5-2698 v4 @ 2.20GHz and 512GB memory. At infer-ence time, we ﬁrst compute all entity embeddingsfor the pool of 5.9M entities. This step is resourceintensive but can be paralleled. On 8 Nvidia Voltav100 GPUs, it takes about 2.8 hours to computeall entity embeddings. Given a query of mentionembedding, we use FAISS (Johnson et al., 2019)IndexFlatIP index type (exact search) to obtaintop 100 entity candidates. On the WikilinksNEDUnseen-Mentions test dataset which contains 10Kqueries, it takes 9.2 ms on average to return top 100candidates per query in batch mode.We also explore the approximate search optionsusing FAISS. We choose the IndexHNSWFlat in-dex type following Karpukhin et al. (2020). It takesadditional time in index construction while reducesthe average time used per query. In Table 7, we seethat

HN SW reduces the average query time to2.6 ms with less than 1.2% drop in accuracy and re- Neighbors to store per node: 128, construction timesearch depth: 200, search depth: 256; construction time: 2.1h.

Method

Acc R@10 R@30 R@100 ms/qEx. Search 71.5 92.7 95.4 96.7 9.2

HN SW HN SW Table 7: Exact and approximate candidate retrieval us-ing FAISS. Last column: average time per query (ms). . . . . . k : number of retrieved entities A cc u r ac y Overall entity linking accuracyoverall accuracybi-encoder Recall@Kcross-encoder accuracy

Figure 3: Overall model accuracy based on differentchoices of k (number of retrieved entities from bien-coder), on the Unseen-Mentions dataset. call, and HN SW further reduce the query timeto 1.4 ms with less than 2.1% drop. Inﬂuence of number of candidates retrieved

In a two-stage entity linking systems, the choice ofnumber of candidates retrieved inﬂuences the over-all model performance. Prior work often used aﬁxed number of k candidates where k ranges from to (for instance, Yamada et al. (2016) andGanea and Hofmann (2017) choose k = 30 , (Lo-geswaran et al., 2019) choose k = 64 ). When k islarger, the recall accuracy increases, however, theranking stage accuracy is likely to decrease. Fur-ther, increasing k would often increase the run-timeon the ranking stage. We explore different choicesof k in our model, and present the recall @ K curve,ranking stage accuracy and overall accuracy in Fig-ure 3. Based on the overall accuracy, we found that k = 10 is optimal. In this section, we present results on knowledgedistillation, using our cross-encoder as a teachermodel and bi-encoder as a student model. Neighbors to store per node: 128, construction timesearch depth: 200, search depth: 128; construction time: 1.8h. ention Bi-encoder Cross-encoder

But surely the biggest surprise is

Ronaldo s drop invalue, despite his impressive record of 53 goals and14 assists in 75 appearances for Juventus. Ronaldo(Brazilian footballer)

Cristiano Ronaldo ... they spent eleven days in the United Kingdomand Spain, photographing things like

Gothic statues,bricks, and stone pavements for use in textures. Gothic ﬁction

Gothic art

To many people in many cultures, music is an im-portant part of their way of life.

Ancient Greek andIndian philosophers deﬁned music as tones ...

Acient Greek

Ancient Greek philosophy

Table 8: Examples of top entities predicted by Bi-encoder model and Cross-encoder model. Mentions in theexamples are written in ornage and the correct entity prediction in bold . We experiment knowledge distillation on theTACKBP-2010 and the WikilinksNED Unseen-Mentions dataset. We use the bi-encoder pretrainedon Wikipedia as the student model, and ﬁne-tuneit on each dataset with knowledge distillation fromthe teacher model, which is the best performingcross-encoder model pretrained on Wikipedia andﬁne-tuned on the dataset.We also ﬁne-tune the student model in our ex-periments on each dataset, without the knowledgedistillation component, as baseline models. As wecan see in Table 9, the bi-encoder model trainedwith knowledge distillation from cross-encoder out-performs the bi-encoder without knowledge distilla-tion, providing another point in the accuracy-speedtrade-off curve for these architectures.

Dataset bi-encoder teacher bi-encoder-KDUnseen 74.4 76.8 75.7TAC2010 92.9 94.5 93.5

Table 9: Knowledge Distillation Results. The teachermodel is the cross-encoder, and bi-encoder-KD is thebi-encoder model trained with knowledge distillation.

Table 8 presents some examples from our bi-encoder and cross-encoder model predictions, toprovide intuition for how these two models con-sider context and mention for entity linking.In the ﬁrst example, we see that the bi-encodermistakenly links “Ronaldo” to the Brazilian foot-ball player, while the cross-encoder is able to usecontext word “Juventus” to disambiguate. In thesecond example, the cross-encoder is able to iden-tify from context that the sentence is describing art instead of ﬁction, where the bi-encoder failed. Inthe third example, the bi-encoder is able to ﬁnd thecorrect entity “Ancient Greek,”; where the cross-encoder mistakenly links it to the entity “AncientGreek philosophy,” likely because that the word“philosophers” is in context. We observe that cross-encoder is often better at utilizing context infor-mation than bi-encoder, but can sometimes makemistakes because of misleading context cues.

We proposed a conceptually simple, scalable, andhighly effective two stage approach for entitylinking. We show that our BERT-based modeloutperforms IR methods for entity retrieval, andachieved new state-of-the-art results on recentlyintroduced zero-shot entity linking dataset, Wik-ilinksNED Unseen-Mentions dataset, and the moreestablished TACKBP-2010 benchmark, withoutany task-speciﬁc heuristics or external entity knowl-edge. We present evaluations of the accuracy-speedtrade-off inherent to large pre-trained models, andshow that it is possible to achieve efﬁcient linkingwith modest loss of accuracy. Finally, we showthat knowledge distillation can further improve bi-encoder model performance. Future work includes: • Enriching entity representations by adding en-tity type and entity graph information; • Modeling coherence by jointly resolving men-tions in a document; • Extending our work to other languages andother domains; • Joint models for mention detection and entitylinking. cknowledgements

We thank our colleagues Marjan Ghazvininejad,Kurt Shuster, Terra Blevins, Wen-tau Yih and JasonWeston for fruitful discussions.

References

Cristian Buciluundeﬁned, Rich Caruana, and Alexan-dru Niculescu-Mizil. 2006. Model compression.In

Proceedings of the 12th ACM SIGKDD Inter-national Conference on Knowledge Discovery andData Mining , KDD 06, page 535541, New York, NY,USA. Association for Computing Machinery.Yixin Cao, Lei Hou, Juanzi Li, and Zhiyuan Liu. 2018.Neural collective entity linking. In

Proceedings ofthe 27th International Conference on ComputationalLinguistics , pages 675–686.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Yotam Eshel, Noam Cohen, Kira Radinsky, ShaulMarkovitch, Ikuya Yamada, and Omer Levy. 2017.Named entity disambiguation for noisy text. In

Pro-ceedings of the 21st Conference on ComputationalNatural Language Learning (CoNLL 2017) , pages58–68, Vancouver, Canada. Association for Compu-tational Linguistics.Paolo Ferragina and Ugo Scaiella. 2011. Fast and ac-curate annotation of short texts with wikipedia pages.

IEEE software , 29(1):70–75.Octavian-Eugen Ganea and Thomas Hofmann. 2017.Deep joint entity disambiguation with local neuralattention. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing ,pages 2619–2629, Copenhagen, Denmark. Associa-tion for Computational Linguistics.Daniel Gillick, Sayali Kulkarni, Larry Lansing,Alessandro Presta, Jason Baldridge, Eugene Ie, andDiego Garcia-Olano. 2019. Learning dense rep-resentations for entity retrieval. arXiv preprintarXiv:1909.10506 .Amir Globerson, Nevena Lazic, Soumen Chakrabarti,Amarnag Subramanya, Michael Ringaard, and Fer-nando Pereira. 2016. Collective entity resolutionwith multi-focal attention. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages621–631.Nitish Gupta, Sameer Singh, and Dan Roth. 2017. En-tity linking via joint encoding of types, descriptions, and context. In

Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Process-ing , pages 2681–2690.Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, LongkaiZhang, and Houfeng Wang. 2013. Learning entityrepresentation for entity disambiguation. In

Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers) , pages 30–34, Soﬁa, Bulgaria. Associationfor Computational Linguistics.Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. stat ,1050:9.Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,and Jason Weston. 2019. Poly-encoders: Trans-former architectures and pre-training strategies forfast and accurate multi-sentence scoring.Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Grif-ﬁtt, and Joe Ellis. 2010. Overview of the tac 2010knowledge base population track. In

Third TextAnalysis Conference (TAC 2010) , volume 3, pages3–3.Jeff Johnson, Matthijs Douze, and Herv´e J´egou. 2019.Billion-scale similarity search with gpus.

IEEETransactions on Big Data .Vladimir Karpukhin, Barlas Ouz, Sewon Min, PatrickLewis, Ledell Wu, Sergey Edunov, Danqi Chen, andWen tau Yih. 2020. Dense passage retrieval for open-domain question answering.Sammy Khalife and Michalis Vazirgiannis. 2018. Scal-able graph-based individual named entity identiﬁca-tion.

CoRR , abs/1811.10547.Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. Citearxiv:1412.6980Comment: Published as a confer-ence paper at the 3rd International Conference forLearning Representations, San Diego, 2015.Nikolaos Kolitsas, Octavian-Eugen Ganea, andThomas Hofmann. 2018. End-to-end neural entitylinking. In

Proceedings of the 22nd Conferenceon Computational Natural Language Learning ,pages 519–529, Brussels, Belgium. Association forComputational Linguistics.Adam Lerer, Ledell Wu, Jiajun Shen, Timothe Lacroix,Luca Wehrstedt, Abhijit Bose, and AlexanderPeysakhovich. 2019. Pytorch-biggraph: A large-scale graph embedding system. In

Proceedings ofthe 2nd SysML Conference .Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee,Kristina Toutanova, Jacob Devlin, and Honglak Lee.2019. Zero-shot entity linking by reading entity de-scriptions. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 3449–3460, Florence, Italy. Association forComputational Linguistics.avid Mueller and Greg Durrett. 2018. Effective use ofcontext in noisy entity linking. In

Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 1024–1029.Feng Nie, Yunbo Cao, Jinpeng Wang, Chin-Yew Lin,and Rong Pan. 2018. Mention and entity descriptionco-attention for entity disambiguation. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence .Yasumasa Onoe and Greg Durrett. 2019. Fine-grainedentity typing for domain independent entity linking. arXiv preprint arXiv:1909.05780 .Jonathan Raphael Raiman and Olivier Michel Raiman.2018. Deeptype: multilingual entity linking by neu-ral type system evolution. In

Thirty-Second AAAIConference on Artiﬁcial Intelligence .Avirup Sil, Gourab Kundu, Radu Florian, and WaelHamza. 2018. Neural cross-lingual entity linking.In

Thirty-Second AAAI Conference on Artiﬁcial In-telligence .Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, Zhen-zhou Ji, and Xiaolong Wang. 2015. Modeling men-tion, context and entity with neural networks for en-tity disambiguation. In

Twenty-Fourth InternationalJoint Conference on Artiﬁcial Intelligence .Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2016. Joint learning of the em-bedding of words and entities for named entity dis-ambiguation. In

Proceedings of The 20th SIGNLLConference on Computational Natural LanguageLearning , pages 250–259, Berlin, Germany. Associ-ation for Computational Linguistics.

A Training details and hyper-parametersOptimization • Computing infrastructure: we use 8 NvidiaVolta v100 GPUs for model training. • Bounds for each hyper parameter: see Table10. In addition, for our bi-encoders, we usea max number of tokens of [32 , , forcontext/mention encoder and for candi-date encoder. In our knowledge distillationexperiments, we set α = 0 . , and T in [2 , .We use grid search for hyperparameters, for atotal number of trials. • Number of model parameters: see Table 11. • For all our experiments we use accuracy onvalidation set as criterion for selecting hyper-parameters.

Parameter

BoundsLearning rate [ − , − , − , − ]Bi-encoder batch size [128 , Cross-encoder batch size [1 , Table 10: Bounds of hyper-parameters in our models

Model

Number of parametersBi-encoder (base) 220MCross-encoder (base) 110MBi-encoder (large) 680MCross-encoder (large) 340M

Table 11: Number of parameters in our models

A.1 Training on Wikipedia data

We use Wikipedia data to train our models ﬁrst,then ﬁne-tune it on speciﬁc dataset. This approachis used in our experiments on TACKBP-2010 andWikilinksNED Unseen-Mentions datasets.We use the May 2019 English Wikipedia dumpwhich includes 5.9M entities, and use the hyper-links in articles as examples (the anchor text is themention). We use a subset of all Wikipedia linkedmentions as our training data for the bi-encodermodel (A total of 9M examples). We use a holdoutset of 10K examples for validation. We train ourcross-encoder model based on the top 100 retrievedresults from our bi-encoder model on Wikipediadata. For the training of the cross-encoder model,e further down-sample our training data to obtaina training set of 1M examples.

Bi-encoder (large) model

Hyperparameter con-ﬁguration for best model: learning rate= − ,batch size=128, max context tokens=32. Averageruntime for each epoch: 17.5 hours/epoch, trainedon 4 epochs. Cross-encoder (large) model

Hyperparameterconﬁguration for best model: learning rate= − ,batch size=1, max context tokens=32. Average run-time for each epoch: 37.2 hours/epoch, trained on1 epoch. A.2 Zero-shot Entity Linking Dataset

Dataset available at https://github.com/lajanugen/zeshel . There are 49K, 10K, and10K examples in the train, validation, test sets re-spectively. Training details:

Bi-encoder (base) model

Hyperparameter con-ﬁguration for best model: learning rate= − ,batch size=128, max context tokens=128. Averageruntime: 28.2 minutes/epoch, trained on 5 epochs. Bi-encoder (large) model

Hyperparameter con-ﬁguration for best model: learning rate= − ,batch size=128, max context tokens=128. Averageruntime: 38.2 minutes/epoch, trained on 5 epochs. Cross-encoder (base) model

Hyperparameterconﬁguration for best model: learning rate= − ,batch size=1, max context tokens=128. Averageruntime: 2.6 hours/epoch, trained on 2 epochs. Cross-encoder (large) model

Hyperparameterconﬁguration for best model: learning rate= − ,batch size=1, max context tokens=128. Averageruntime: 8.5 hours/epoch, trained on 2 epochs. A.3 TACKBP-2010 Dataset

Dataset available at https://catalog.ldc.upenn.edu/LDC2018T16 . There are 1,074and 1,020 annotated examples in the train and testsets respectively. We use a 10-fold cross-validationfrom training set. Training details:

Bi-encoder (large) model

Hyperparameter con-ﬁguration for best model: learning rate= − ,batch size=128, max context tokens=32. Averageruntime: 9.0 minutes/epoch, trained on 10 epochs. Bi-encoder (large) model with Knowledge Dis-tillation

Hyperparameter conﬁguration for bestmodel: learning rate= − , batch size=128, maxcontext tokens=32, T = 2 , α = 0 . . Average run-time: 11.2 minutes/epoch, trained on 10 epochs. Cross-encoder (large) model

Hyperparameterconﬁguration for best model: learning rate= − ,batch size=1, max context tokens=128. Averageruntime: 20.4 minutes/epoch, trained on 10 epochs. A.4 WikilinksNED Unseen-Mentions Dataset

The train, validation and test sets contain 2.2M,10K, and 10K examples respectively. We use asubset of 100K examples to ﬁne-tune our modelon this dataset, as we found more examples do nothelp. Training details:

Bi-encoder (large) model

Hyperparameter con-ﬁguration for best model: learning rate= − ,batch size=128, max context tokens=32. Averageruntime for each epoch: 3.2 hours/epoch, trainedon 1 epochs. Bi-encoder (large) model with Knowledge Dis-tillation

Hyperparameter conﬁguration for bestmodel: learning rate= − , batch size=128, maxcontext tokens=32, T = 2 , α = 0 . . Average run-time: 6.5 hours/epoch, trained on 1 epochs. Cross-encoder (large) model

Hyperparameterconﬁguration for best model: learning rate= −6