[PDF] Contextualized End-to-End Neural Entity Linking

Abstract

We propose yet another entity linking model (YELM) which links words to entities instead of spans. This overcomes any difficulties associated with the selection of good candidate mention spans and makes the joint training of mention detection (MD) and entity disambiguation (ED) easily possible. Our model is based on BERT and produces contextualized word embeddings which are trained against a joint MD and ED objective. We achieve state-of-the-art results on several standard entity linking (EL) datasets.

Full PDF

CContextualized End-to-End Neural Entity Linking

Haotian Chen

BlackRock [email protected]

Andrej Zukov-Gregoric

BlackRock [email protected]

Xi (David) Li

BlackRock [email protected]

Abstract

We propose an entity linking model that jointlylearns mention detection and entity disam-biguation. Built upon a pre-trained languagemodel as a text encoder, mention detection andentity disambiguation share the same contex-tualized features while having their own task-speciﬁc architectures. Each mention detectedis projected to the entity embedding space.As a result, our model can efﬁciently disam-biguate all mentions of a batch in one passover the entire entity universe by cosine dis-tance. With candidate sets that limit the searchspace, our model achieves state-of-the-art per-formance on end-to-end entity linking. Ourmodel also enables the option of eliminatingexternal knowledge in both training and infer-ence and hence allows us to study the impactof such external knowledge.

Entity linking (EL) , in our context, refers to thejoint task of recognizing named entity mentions intext through mention detection (MD) and linkingeach mention to a unique entity in a knowledgebase (KB) through entity disambiguation (ED) .For example, in the sentence The Times beganpublication in London under its current name in1788, the span

The Times should be detected asa named entity mention and then linked to thecorresponding entity:

The Times , a UK newspa-per. The ambiguity in language brings difﬁcul-ties to EL models which might link this men-tion span to a similar but incorrect entity suchas

The New York Times , an American newspaper.Our model approaches EL by producing MD andED results simultaneously out of the same contex-tualized feature embedding, so that ED decision is Also known as A2KB task in GERBIL evaluation plat-form (R¨oder et al., 2018) and end-to-end entity linking insome literature Also known as D2KB task in GERBIL partially informed by learned MD features. On topof the shared feature embedding, MD and ED havetheir own task-speciﬁc architectures and trainingobjectives, respectively.Within the ED sub-task, a common approachemployed by previous EL models is candidategeneration. Speciﬁcally, for each mention de-tected, a set of potential candidate entities is gener-ated and then ranked in order to ﬁnd the best entity.The candidate generation process incorporates ex-ternal knowledge compiled by human such as can-didate entity set given a mention and prior prob-abilities of entities given a mention. Our modelhas the option of not relying on candidate sets andtherefore the external knowledge comes with it.As a result, we can study the difference betweenusing and not using candidate sets.This paper introduces two main contributions: (i)

We propose an end-to-end differentiable neu-ral EL model that jointly performs MD and EDand achieves state-of-the-art performance. (ii)

Our model enables the option of eliminatingexternal knowledge so that we can study the im-pact of external knowledge to our EL model. Weprovide a benchmark performance of EL modelwithout any external knowledge in both trainingand inference.

Neural-network based models have recentlyachieved strong results across standard datasets.Research has focused on learning better entity rep-resentations and extracting better local and globalfeatures through novel model architectures.

Entity representation.

Good KB entity repre-sentations are a key component of most ED andEL models. Representation learning has been ad-dressed by Yamada et al. (2016), Ganea and Hof-mann (2017), Cao et al. (2017) and Yamada et al. a r X i v : . [ c s . C L ] M a y Entity Disambiguation.

Some efforts in thisﬁeld only address the ED modeling, disregard-ing the interaction between MD and ED. The un-derlying assumption is that mentions are labeledby some named entity recognizers (NER). Recentwork on ED has focused on extracting global fea-tures (Ratinov et al., 2011; Globerson et al., 2016;Ganea and Hofmann, 2017; Le and Titov, 2018),extending the scope of ED to more non-standarddatasets (Eshel et al., 2017), and positing the prob-lem in new ways such as building separate classi-ﬁers for KB entities (Barrena et al., 2018).

Entity Linking.

Early work by Sil and Yates(2013), Luo et al. (2015) and Nguyen et al. (2016)introduced models that jointly learn NER and EDusing engineered features. More recently, Kolitsaset al. (2018) propose a neural model that generatesall combination of spans as potential mentions andlearns contextual similarity scores over their en-tity candidates. MD is handled implicitly by onlyconsidering mention spans which have non-emptycandidate entity sets. On the other hand, Martinset al. (2019) propose training a multi-task NERand ED objective using Stack-LSTM (Dyer et al.,2015).

Given a document containing a sequence of n to-kens w = { w , ..., w n } with labels in mention in-dicators y md = { I, O, B } n and entity IDs y ed = { j ∈ Z : j ∈ [1 , k ] } n which index a pre-trainedentity embedding matrix E ∈ R k × d of entity uni-verse size k and entity embedding dimension d ,this model is trained to tag each token with its cor-rect mention indicator and link each mention withits correct entity ID. The text input to our model is encoded by B

ERT (Devlin et al., 2019). We initialize the pre-trainedweights from B

ERT -B ASE . The text input is to-kenized by the cased WordPiece (Johnson et al., We use standard inside-outside-beginning (IOB) taggingformat introduced by (Ramshaw and Marcus, 1995) https://github.com/google-research/bert n contextualized WordPiece embeddings h which are grouped to form the embedding matrix H ∈ R n × m , where m is the embedding dimen-sion. In the case of B ERT -B ASE , m is equal to .The transformation from word level to Word-Piece sub-word level labels is handled similarly tothe B ERT

NER task, where the head WordPiecetoken represents the entire word, disregarding tailtokens.B

ERT comes in two settings: feature-based andﬁne-tuned. Under the feature-based setting, B

ERT parameters are not trainable in the domain task(EL), whereas the ﬁne-tuned setting allows B

ERT parameters to adapt to the domain task. is modeled as a sequence labelling task. Con-textualized embedding h is passed through a feed-forward neural network and then softmaxed forclassiﬁcation over IOB: m md = W md h + b md (1) p md = softmax ( m md ) (2)where b md ∈ R is the bias term, W md ∈ R × m is a weight matrix, and p md ∈ R is the predicteddistribution across the { I, O, B } tag set. The pre-dicted tag is then simply: ˆy md = arg max i { p md ( i ) } (3) ED is modeled by ﬁnding the entity closest to thepredicted entity embedding by some distance mea-sure. Speciﬁcally, on top of the text encoder, weapply an additional ED speciﬁc feedforward neu-ral network. The combination forms a projectorfrom each token to the entity embedding spacewith dimension d : m ed = tanh ( W ed h + b ed ) p ed = s ( m ed , E ) ˆy ed = arg max j { p ed ( j ) } (4)where b ed ∈ R d is the bias term, W ed ∈ R d × m is a weight matrix, and m ed ∈ R d is the same sizeas the entity embedding. s is any similarity mea-sure which relates m ed to every entity embeddingin E . In our case, we use cosine similarity. Our Leicester

Leicester h h beat h Somerset h County h Cricket h Club h [CLS] h [SEP] B 1622318

I 1622318 O 3221

B 1622178

I 2221 I 2221 I 2221 0 1223

Output Layer FFN MD h Somerset

B 1622178FFN ED leicestershirecountycricket_club leicestershirecountycricket_club - somersetcountycricket_club somersetcountycricket_club somersetcountycricket_club somersetcountycricket_club - - Figure 1: Architecture of the proposed model. Input WordPiece tokens are passed through BERT forming con-textualized embeddings. Each contextualized embedding is passed through two task-speciﬁc feed-forward neuralnetworks for MD and ED, respectively. Entity ID prediction on ‘B’ is extended to entire mention span. predicted entity ID is the index of p ed with thehighest similarity score.We use pre-trained entity embedding from wikipedia2vec (Yamada et al., 2018) as pre-training good entity representation is beyond thescope of this work. Ideally, pre-trained entityembedding should be from similar architecture toour EL model, but experiment shows strong resulteven if it is not. The wikipedia2vec entity em-bedding used in our model is trained on the 2018Wikipedia with dimensions and link graphsupport. During inference, after receiving results foreach token from both MD and ED side, the men-tion span will be tagged by the { B, I } indicatoras shown in Figure 1. For each mention span,the ﬁrst token’s entity ID prediction represents theentire mention span. The remaining non-mentionand non-ﬁrst entity ID prediction are masked out.Such behavior would be facilitated by the trainingobjective below.During training, we minimize the followingmulti-task objective which is inspired by Redmonand Farhadi (2017) from the domain of object de-tection: J ( θ ) = λ L md ( θ ) + (1 − λ ) L ed ( θ ) (5)where L md is the cross entropy between predictedand actual distributions of IOB and L ed is the co-sine similarity between projected entity embed-ding and actual entity embedding. We tentativelyexplored triplet loss and contrastive loss with somesimple negative mining strategies for ED but didnot observe signiﬁcant gain in performance. Twoloss functions are weighted by a hyperparameter https://wikipedia2vec.github.io/wikipedia2vec/pretrained/ Similar to EL, object detection has two sub-tasks: locatebounding boxes and identify each box’s object. λ . Note that L md is calculated for all non-pad headWordPiece tokens but L ed is calculated only forthe ﬁrst token of every labeled entity mention. We train and evaluate our model on the widelyused AIDA/CoNLL dataset (Hoffart et al., 2011).It is a collection of news articles from Reuters,which is split into training, validation (testa) andtest (testb) sets. Following the convention, theevaluation metric is strong-matching span-levelInKB micro and macro F1 score over gold men-tions where entity annotation is available (R¨oderet al., 2018). Note that ED models are evaluatedby accuracy metric while EL models are evaluatedby F1, which penalizes tagging non-mention spanas entity mention.

All EL models cited rely on candidate sets. Asfor our model, mention can be efﬁciently disam-biguated with respect to the entire entity universewhich is 1 million most frequent entities in 2018Wikipedia. Consequently, our model can circum-vent candidate generation as well as the externalknowledge comes with it. In order to study theimpact of candidate sets on our model, we applycandidate sets from Hoffart et al. (2011) backedby the YAGO knowledge graph (Suchanek et al.,2007). We do not limit the size of the candidatesets. Note that we do not use any other externalknowledge in this work.

We train the EL model on the training split with abatch size of 4 for 50,000 steps. Similar to B

ERT ,the model is optimized by the Adam optimizerKingma and Ba, 2014) with the same hyperpa-rameters except the learning rate, which we set to2e-5. Training was performed on a Tesla V100GPU. Experiments are repeated 3 times to calcu-late an error range.

Our modelis compared with four of the most recent EL mod-els in Table 1. Our model with candidate sets(mentioned in Section 4.2) achieves state-of-the-art results. Without candidate sets, identifyingthe correct entity over the entire 1 million entityuniverse remains a challenging task. This resultserves as a benchmark for future models withoutexternal knowledge.

System Validation F1 Test F1Macro Micro Macro MicroMartins et al. (2019) 82.8 85.2 81.2 81.9Kolitsas et al. (2018) 86.6 89.4 82.6 82.4Cao et al. (2018) 77.0 79.0 80.0 80.0Nguyen et al. (2016) - - - 78.7Fine-tuned BERT with candidate sets ± . ± . ± . ± . Fine-tuned BERT without candidate sets 82.6 ± . ± . ± . ± . Table 1: Strong-matching span-level InKB macro& micro F1 results on validation and test splits ofAIDA/CoNLL dataset. Note that the other models citedall use candidate sets.

Ablation Study.

First, as shown in both Table1 and Table 2, the gap of F1 scores between us-ing and not using candidate sets shows that ex-ternal knowledge such as candidate sets have astrong impact on the performance under both ﬁne-tuned and feature-based B

ERT settings. To thebest of our knowledge, we are the ﬁrst to disen-tangle EL from external knowledge and quantifythis gap. Second, ﬁne-tuned B

ERT shows betterperformance than feature-based B

ERT , indicatingthat allowing B

ERT to adapt to the domain task iscrucial.

Ablation Validation F1 Test F1Macro Micro Macro MicroFeature-based BERT with candidate sets 87.1 ± . ± . ± . ± . Feature-based BERT without candidate sets 63.3 ± . ± . ± . ± . With fasttext entity embedding 90.4 91.4 82.8 82.9

Table 2: Ablation results on validation and test sets ofAIDA/CoNLL.

Third, the impact of entity embedding is tested.We build a simple fasttext entity embedding that replaces wikipedia2vec . This fasttext entityembedding is the 300-dimensional fasttext (Bo-janowski et al., 2017) embedding on each en-tity’s Wikipedia title. This fasttext entity embed-ding performs slightly worse than wikipedia2vec , which reveals that entity title contains some en-tity representation features but not as much as wikipedia2vec . Our model is robust to other lesssophisticated entity representations.

We propose an EL model that jointly learns MDand ED task, achieving state-of-the-art results. Weshow that training and inference without candidatesets or in fact any external knowledge are possi-ble. Benchmark results of EL without any externalknowledge are provided. For future work, we sug-gest to study entity representation learning fromsimilar process as our EL model. Additionally,to explore global EL from a language model withmemory to global context such as

XLNet (Yanget al., 2019) and cross-lingual EL from a multi-lingual language model would be promising.

References

Ander Barrena, Aitor Soroa, and Eneko Agirre. 2018.Learning text representations for 500K classiﬁca-tion tasks on named entity disambiguation. In

Pro-ceedings of the 22nd Conference on ComputationalNatural Language Learning , pages 171–180, Brus-sels, Belgium. Association for Computational Lin-guistics.Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information.

Transactions of the Associa-tion for Computational Linguistics , 5:135–146.Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu,Chengjiang Li, Xu Chen, and Tiansi Dong. 2018.Joint representation learning of cross-lingual wordsand entities via attentive distant supervision. In

Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 227–237, Brussels, Belgium. Association for Computa-tional Linguistics.Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and JuanziLi. 2017. Bridge text and knowledge by learningmulti-prototype entity mention embedding. In

Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers) , pages 1623–1633, Vancouver, Canada. As-sociation for Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference For example, one can easily differentiateNew York (state) versus New York (novel) just by theirentity title f the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In

Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers) , pages 334–343, Beijing, China. Associa-tion for Computational Linguistics.Yotam Eshel, Noam Cohen, Kira Radinsky, ShaulMarkovitch, Ikuya Yamada, and Omer Levy. 2017.Named entity disambiguation for noisy text. In

Pro-ceedings of the 21st Conference on ComputationalNatural Language Learning (CoNLL 2017) , pages58–68, Vancouver, Canada. Association for Compu-tational Linguistics.Octavian-Eugen Ganea and Thomas Hofmann. 2017.Deep joint entity disambiguation with local neuralattention. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing , pages 2619–2629, Copenhagen, Denmark. As-sociation for Computational Linguistics.Amir Globerson, Nevena Lazic, Soumen Chakrabarti,Amarnag Subramanya, Michael Ringaard, and Fer-nando Pereira. 2016. Collective entity resolutionwith multi-focal attention. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages621–631.Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen F¨urstenau, Manfred Pinkal, Marc Span-iol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of namedentities in text. In

Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing , pages 782–792. Association for ComputationalLinguistics.Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Vi´egas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’smultilingual neural machine translation system: En-abling zero-shot translation.

Transactions of the As-sociation for Computational Linguistics , 5:339–351.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization.

Proceedings ofthe 3rd International Conference on Learning Rep-resentations (ICLR) .Nikolaos Kolitsas, Octavian-Eugen Ganea, andThomas Hofmann. 2018. End-to-end neural entitylinking. In

Proceedings of the 22nd Conferenceon Computational Natural Language Learning , pages 519–529, Brussels, Belgium. Association forComputational Linguistics.Phong Le and Ivan Titov. 2018. Improving entity link-ing by modeling latent relations between mentions.In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 1595–1604, Melbourne, Aus-tralia. Association for Computational Linguistics.Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Za-iqing Nie. 2015. Joint entity recognition and disam-biguation. In

Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Pro-cessing , pages 879–888.Pedro Henrique Martins, Zita Marinho, and Andr´eF. T. Martins. 2019. Joint learning of named en-tity recognition and entity linking. In

Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics: Student Research Work-shop , pages 190–196, Florence, Italy. Associationfor Computational Linguistics.Dat Ba Nguyen, Martin Theobald, and GerhardWeikum. 2016. J-nerd: joint named entity recogni-tion and disambiguation with rich linguistic features.

Transactions of the Association for ComputationalLinguistics , 4:215–229.Lance Ramshaw and Mitch Marcus. 1995. Text chunk-ing using transformation-based learning. In

ThirdWorkshop on Very Large Corpora .Lev Ratinov, Dan Roth, Doug Downey, and MikeAnderson. 2011. Local and global algorithmsfor disambiguation to wikipedia. In

Proceedingsof the 49th Annual Meeting of the Associationfor Computational Linguistics: Human LanguageTechnologies-Volume 1 , pages 1375–1384. Associ-ation for Computational Linguistics.J. Redmon and A. Farhadi. 2017. Yolo9000: Better,faster, stronger. In , pages6517–6525.Michael R¨oder, Ricardo Usbeck, and Axel-Cyrille Ngonga Ngomo. 2018. GERBIL -benchmarking named entity recognition and linkingconsistently.

Semantic Web , 9(5):605–625.Avirup Sil, Gourab Kundu, Radu Florian, and WaelHamza. 2018. Neural cross-lingual entity linking.In

Thirty-Second AAAI Conference on Artiﬁcial In-telligence .Avirup Sil and Alexander Yates. 2013. Re-ranking forjoint named-entity recognition and linking. In

Pro-ceedings of the 22nd ACM international conferenceon Information & Knowledge Management , pages2369–2374. ACM.Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: A Core of Semantic Knowl-edge. In , pages 697–706.kuya Yamada, Akari Asai, Hiroyuki Shindo,Hideaki Takeda, and Yoshiyasu Takefuji. 2018.Wikipedia2vec: An optimized tool for learningembeddings of words and entities from wikipedia. arXiv preprint 1812.06280 .Ikuya Yamada and Hiroyuki Shindo. 2019. Pre-training of deep contextualized embeddings ofwords and entities for named entity disambiguation. arXiv preprint arXiv:1909.00426 .Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2016. Joint learning of the em-bedding of words and entities for named entity dis-ambiguation. In

Proceedings of The 20th SIGNLLConference on Computational Natural LanguageLearning , pages 250–259, Berlin, Germany. Asso-ciation for Computational Linguistics.Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2017. Learning distributed rep-resentations of texts and entities from knowledgebase.

Transactions of the Association for Compu-tational Linguistics , 5:397–411.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.Carbonell, Ruslan Salakhutdinov, and Quoc V.Le. 2019. Xlnet: Generalized autoregressivepretraining for language understanding.