[PDF] BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA

Abstract

Khandelwal et al. (2020) use a k-nearest-neighbor (kNN) component to improve language model performance. We show that this idea is beneficial for open-domain question answering (QA). To improve the recall of facts encountered during training, we combine BERT (Devlin et al., 2019) with a traditional information retrieval step (IR) and a kNN search over a large datastore of an embedded text collection. Our contributions are as follows: i) BERT-kNN outperforms BERT on cloze-style QA by large margins without any further training. ii) We show that BERT often identifies the correct response category (e.g., US city), but only kNN recovers the factually correct answer (e.g., "Miami"). iii) Compared to BERT, BERT-kNN excels for rare facts. iv) BERT-kNN can easily handle facts not covered by BERT's training set, e.g., recent events.

Full PDF

BBERT-kNN: Adding a kNN Search Component to Pretrained LanguageModels for Better QA

Nora Kassner, Hinrich Sch ¨utze

Center for Information and Language Processing (CIS)LMU Munich, Germany [email protected]

Abstract

Khandelwal et al. (2020) show that a k-nearest-neighbor (kNN) component improveslanguage modeling performance. We usethis idea for open domain question answering(QA). To improve the recall of facts stated inthe training text, we combine BERT (Devlinet al., 2019) with a kNN search over a largecorpus. Our contributions are as follows. i) Weoutperform BERT on cloze-style QA by largemargins without any further training. ii) Weshow that BERT often identiﬁes the correct re-sponse category (e.g., central European city),but only kNN recovers the factually correct an-swer (e.g., “Vienna”).

Pre-trained language models (PLMs) like BERT(Devlin et al., 2019), GPT-2 (Radford et al., 2019)and RoBERTa (Liu et al., 2019) have emerged asuniversal tools that not only capture a diverse rangeof linguistic but also (as recent evidence seems tosuggest) factual knowledge.Petroni et al. (2019) introduced LAMA (LAn-guage Model Analysis) to investigate PLMs’ ca-pacity to recall factual knowledge without the useof ﬁne-tuning. Since the PLM training objec-tive is to predict masked tokens, question answer-ing tasks can be reformulated as cloze questions;e.g., “Who wrote ‘Ulysses’?” is reformulated as“[MASK] wrote ‘Ulysses’.” In this setup, Petroniet al. (2019) show that, on QA, PLMs outperformbaselines trained on automatically extracted knowl-edge bases.Still, given that PLMs have seen more data thanany human could read in a lifetime, their perfor-mance on open domain QA seems poor. EvenLAMA facts that PLMs do get right are not nec-essarily “recalled” from the training experience asmany of them are easy-to-guess (Poerner et al.,2019). Choosing BERT as our PLM, we therefore

Figure 1: Schematic depiction of BERT-kNN: BERT’sprediction for query q are interpolated with a kNN-search component. The query q is input to an IRstep. The BERT embeddings of the retrieved contexts

BERT ( s ) together with the target word build a key-value datastore c − w (yellow). The kNN search runsbetween the BERT embeddings of the query BERT(q)(red) and the c of the datastore. The corresponding w of the kNNs and their distances d are returned (orange).The are aggregated and normalized. Finally, the predic-tions of the kNN-search component and BERT’s pre-dictions are interpolated. introduce BERT-kNN in this paper (see Figure 1):BERT-kNN combines BERT’s predictions with akNN search over a text collection where the textcollection can be BERT’s training set or any othersuitable text corpus. Due to its kNN component andits resulting ability to directly access facts stated inthe searched text, BERT-kNN outperforms BERTon cloze-style QA by large margins.In more detail, we use BERT to embed each to-ken’s context in the text collection. Each pair ofcontext embedding and token is stored as a key-value pair in a datastore. At test time for a clozequestion q , the MASK’s embedded context servesas query BERT ( q ) to ﬁnd the k context-targetpairs in the datastore that are closest. To make thismore effective, we ﬁrst query a separate informa-tion retrieval (IR) index with the original question a r X i v : . [ c s . C L ] M a y and only search over the top m hits when ﬁndingthe k nearest neighbors of BERT ( q ) in embed-ding space. The ﬁnal prediction is an interpolationof the kNN search and the PLM predictions.We ﬁnd that the PLM often correctly predicts theanswer category and therefore the correct answer isoften among the top k nearest neighbors. A typicalexample is “Albert Einstein was born in [MASK]”:the PLM knows that a city is likely to follow andmaybe even that it is a German city, but it fails topick the correct city. On the other hand, the top-ranked answer in the kNN search is “Ulm” and sothe correct ﬁller for the mask can be identiﬁed.BERT-kNN outperforms BERT on the LAMAcloze style QA dataset without any further training.Even though BERT-kNN is based on BERT-base, italso outperforms BERT-large on 3 out of 4 LAMAsubsets. The performance gap between BERT andBERT-kNN is most pronounced on hard-to-guessfacts. As this method can be applied to any kind oftext collection (not just the PLM training corpus),BERT-kNN can potentially correctly give answersthat BERT has never seen in its training corpus. The LAMA dataset is a cloze style QA datasetthat allows to query PLMs for knowledge base likefacts. A cloze question is generated from a subject-relation-object triple from a knowledge base andfrom a templatic statement for the relation thatcontains variables X and Y for subject and object(e.g, “X was born in Y”). The subject is substitutedfor X and [MASK] for Y. The triples are chosensuch that Y is always a single-token answer.LAMA covers different sources: The Google-RE set covers the three relations “place of birth”,“date of birth” and “place of death”. T-REx (ElSa-har et al., 2018) consists of a subset of Wikidatatriples covering 41 relations. ConceptNet (Li et al.,2016) combines 16 commonsense relationships be-tween words and phrases. The underlying OpenMind Common Sense corpus provides matchingstatements to query the language model. SQuAD(Rajpurkar et al., 2016) is a standard question an-swering dataset. LAMA contains a subset of 305context-insensitive questions and provides manu-ally reformulated cloze-style questions to query themodel.Poerner et al. (2019) introduce LAMA-UHN, a https://code.google.com/archive/p/relation-extraction-corpus/ Corpus BERT-base BERT-large BERT-kNNLAMA 27.7 30.6

LAMA-UHN 20.6 23.0

Table 1: Mean precision at one (P@1) for LAMA andLAMA-UHN on the TREx and GooglRE subsets. subset of LAMA’s T-REx and GoogleRE questionsfrom which easy-to-guess facts have been removed.

BERT-kNN combines Bert-base with a kNN searchcomponent. We now describe the architecture ofBERT-kNN.

BERT.

This method is applicable to any kindof PLM. We use BERT-base-uncased (Devlinet al., 2019) as our PLM since it is top performeron LAMA. BERT estimates the probability of amasked word given it’s context. BERT is pre-trained on the BookCorpus (Zhu et al., 2015) aswell as a crawl of English Wikipedia. Duringpre-training, BERT randomly masks positions andlearns to ﬁll the words.

Datastore.

Our text collection C is the 2016-12-21 English Wikipedia. For each single-token wordoccurrence w in a sentence s in C , we compute thepair ( c, w ) where c is a context representation of s computed by BERT. We ﬁnd that masking the oc-currence of w in s and using the embedding of themasked token is an effective context representation c . We store all pairs ( c, w ) in a key-value datastore D where c serves as key and w as value. Information Retrieval.

We found that just us-ing the datastore D does not give good results. Wetherefore use (Chen et al., 2017)’s IR system to ﬁrstselect a small subset of D using a keyword search.The IR index contains all Wikipedia articles. Anarticle is represented as a bag of words and wordbigrams. If the subject in question is speciﬁed weuse it as-is to query the IR index, otherwise, thecloze-style question q (the [MASK] token is re-moved) is used. Finally, we ﬁnd the top 5 relevantWikipedia articles using TF-IDF search. Inference.

At test time, we ﬁrst run the IRsearch between the cloze question q and datastore D and then only consider the subset of D that cor-responds to the top 5 relevant Wikipedia articles.For the kNN search q is embedded in the sameway as the context representations c in D : we set BERT ( q ) to the embedding computed by BERT https://dumps.wikimedia.org/enwiki/latest/ orpus Relation Statistics model Facts Rel BERT-base BERT-large kNN BERT-kNNGoogle-RE birth-place 2937 1 14.9 16.1 45.5 birth-date 1825 1 1.5 1.4 39.5 death-place 765 1 13.1 14.0 N-1 20006 23 32.4 34.2 29.8

N-M 13096 16 24.7 24.3 234.9

ConceptNet Total 11458 16 15.6

Table 2: Mean precision at one (P@1) for BERT-base, BERT-large, the k-NN search and the interpolation betweenBERT and the k-NN search (BERT-kNN) across the set of evaluation corpora. for [MASK]. We then retrieve the k nearest neigh-bors of BERT ( q ) in the 5-Wikipedia-article sub-set of D where k = 512 . We convert the distancesbetween BERT ( q ) and the 512 nearest neighborsto a probability distribution using softmax normal-ization. Since a word w can occur several timesin the 512 nearest neighbors, we compute its ﬁnaloutput probability as the sum over all occurrences.Not occurring words have zero probability.In the ﬁnal step the probability distributions ofBERT and the kNN search are interpolated withinterpolation parameter λ (set to 0.6). As Petroni et al. (2019) we report mean precisionat rank k (P@k). P@k is 0 or 1 depending on if thetrue answer occurs among the the top k predictions.Averaging is done ﬁrst within each relation andthen across relations.

BERT-kNN outperforms BERT on the LAMAdataset. It obtains over 10 precision points gainover BERT-base and large. Note that our modeluses BERT-base only. Table 1 shows that the perfor-mance gap between original BERT and BERT-kNNbecomes even larger when evaluating on LAMA-UHU, a subset of LAMA with hard to guess facts.Table 2 shows performance on different LAMAsubsets. We see that BERT-kNN outperformsBERT-base and BERT-large on 3 out of 4 LAMAsubsets. On ConceptNet it shows competitive re-sults. Huge gains are obtained on the GoogleREdataset. Figure 2 shows precision at 1, 5 and 10.BERT-kNN performs better in all three categories.Table 2 also shows that neither BERT nor thekNN search alone are sufﬁcient for good perfor-mance. Only the interpolation of the two yieldsoptimal results. In many cases, the knowledge re-

Figure 2: Mean Precison@1, Precison@5, Preci-son@10 on LAMA for original BERT and BERT-kNN called by BERT and the kNN is complementary.BERT is much better on ConceptNet relations. Thisseems to be due to the limitations of knowledgeexpressed in Wikipedia articles. Note that the inter-polation parameter is kept constant over all datasets.The prediction probabilities are well calibrated in asense that BERT-kNN is able to distinguish whento rely more on BERT or the kNN predictions.Table 3 compares exemplary differences inBERT and BERT-kNN predictions. We see thatoriginal BERT is good in predicting the answer cat-egory required for completing the cloze query butonly the kNN-search is able to recover the actualfact.

PLMs are top performers for many tasks, includ-ing QA (Kwiatkowski et al., 2019; Alberti et al.,2019). Petroni et al. (2019) introduced the LAMAcloze-style QA task to query PLM’s performanceon knowledge base like facts. Bosselut et al. (2019) Note that the results for BERT-base and BERT-large aretaken from (Petroni et al., 2019) where a slightly smallersubset of Bert’s original vocabulary is used. uery and True Answer Generation G oog l e R E Hans Gefors was born in [MASK]. BERT-kNN: Stockholm (0.62), Oslo (0.08), Copenhagen (0.7)True: Stockholm BERT: Oslo (0.22), Copenhagen (0.18), Bergen (0.09)kNN: Stockholm (0.97), Lund (0.02), Hans (0.0)Aglaja Orgeni died in [MASK]. BERT-kNN: Vienna (0.61), Bucharest (0.08), Paris (0.03)True: Vienna BERT: Bucharest (0.19), Paris (0.08), Budapest (0.04)kNN: Vienna (1.0), 1886 (0.0), Munich (0.0) T R E x Regiomontanus works in the ﬁeld of [MASK]. BERT-kNN: Mathematics (0.25), Astronomy (0.17), Medicine (0.04)True: Mathematics BERT: Medicine (0.09), Law (0.05), Physics (0.03)kNN: Mathematics (0.40), Astronomy (0.28), Literature (0.04)The headquarter of interpol is in [MASK] . BERT-kNN: Lyon (0.52), Paris (0.05), Singapore (0.04)True: Lyon BERT: Paris (0.12), London (0.08), Brussels (0.05)kNN: Lyon (0.86), Singapore (0.07), Oslo (0.01) C on ce p t N e t Ears can [MASK] sound. BERT-kNN: hear (0.22), detect (0.16), produce (0.11)True: hear BERT: hear (0.28), detect (0.06), produce (0.04)kNN: detect (0.23), hear (0.19), produce (0.15)Regret is an [MASK]. BERT-kNN: emotion (0.1), action (0.03), evolutionary (0.02)True: emotion BERT: emotion (0.25), option (0.04), art (0.04)kNN: action (0.04), evolutionary (0.03), explanation (0.03) S qu a d [MASK] is needed to pack electrons densely together. BERT-kNN: it (0.20), energy (0.05), this (0.04)True: energy BERT: it (0.5), this (0.1), energy (0.07)kNN: energy (0.04), electrons (0,02), material (0.01)The capital of the ottoman empire was [MASK]. BERT-kNN: Istanbul (0.32), Constantinople (0.25), Vienna (0.02)True: Istanbul BERT: Constantinople (0.48), Istanbul (0.33), Acre (0.02)kNN: Istanbul (0.3), Constantinople (0.1), Vienna (0.02) Table 3: Examples of generation for BERT-base, kNN, BERT-kNN. The last column reports the top three tokensgenerated together with the associated probability (in brackets). investigate PLMs’ common sense knowledge only.DRQA (Chen et al., 2017) is a popular open-domain QA model that combines an IR step with aneural reading comprehension model. Even toughwe use the same IR module our model differs sig-niﬁcantly. DRQA does not predict masked tokensbut extracts answers from text. It does not do usePLM Transformers nor a kNN search module. Butmost notably BERT-kNN is fully unsupervised anddoes not require any extra training.Extended work on knowledge in PLM focuses oninjecting knowledge into BERT’s encoder. ERNIE(Zhang et al., 2019) and KnowBert (Peters et al.,2019) are entity-enhanced versions of BERT. Theyintroduce additional encoder layers that are inte-grated into BERT’s original encoder by expensivefurther pre-training. Our approach on the otherhand is not limited to labeled entities nor does itrequire any further training. (Poerner et al., 2019)injects factual entity knowledge into BERT’s em-beddings without further training but by aligningWikipedia2Vec entity vectors (Yamada et al., 2016)with BERT’s word piece vocabulary. This approachis also limited to labeled entities. Our approach isconceptually very different from entity-enhancedversions of BERT and could potentially be com-bined with any of the mentioned ones.BERT-kNN architecture is based on (Khandel-wal et al., 2020) where an interpolation of a PLM and a kNN search is used for language modeling.In contrast this work analyses QA. Architecturallywe introduce an IR step into the model that is es-sential for factual correctness. We also change thehidden state used for the kNN to the masked tokenembeddings.Other work that store previous hidden statesin memory are Grave et al. (2016); Merity et al.(2017). They only consider recent history makingit easier to copy rare vocabulary items from therecent past. They do not use PLM Transformerarchitecture. Again these models evaluate on LMand not on factual correctness.

This work introduced BERT-kNN, an interpolationof BERT predictions with a kNN search for unsu-pervised cloze style QA. BERT-kNN sets new stateof the art on the LAMA dataset with top perfor-mance on hard to guess without any further train-ing. This method potentially allows querying LMsfor knowledge outside of the training domain withno additional training.

References

Chris Alberti, Kenton Lee, and Michael Collins. 2019.A BERT baseline for the natural questions.

ArXiv ,abs/1901.08634.ntoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.2019. COMET: Commonsense transformers for au-tomatic knowledge graph construction. In

Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics , pages 4762–4779,Florence, Italy. Association for Computational Lin-guistics.Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1870–1879, Vancouver, Canada. Association for Computa-tional Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Hady ElSahar, Pavlos Vougiouklis, Arslen Remaci,Christophe Gravier, Jonathon S. Hare, Fr´ed´eriqueLaforest, and Elena Simperl. 2018. T-rex: A largescale alignment of natural language with knowledgebase triples. In

Proceedings of the Eleventh Inter-national Conference on Language Resources andEvaluation, LREC 2018, Miyazaki, Japan, May 7-12,2018.

Edouard Grave, Armand Joulin, and Nicolas Usunier.2016. Improving neural language models with a con-tinuous cache.

ICLR , abs/1612.04426.Urvashi Khandelwal, Omer Levy, Dan Jurafsky, LukeZettlemoyer, and Mike Lewis. 2020. Generalizationthrough Memorization: Nearest Neighbor LanguageModels. In

International Conference on LearningRepresentations (ICLR) .Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-ﬁeld, Michael Collins, Ankur Parikh, Chris Al-berti, Danielle Epstein, Illia Polosukhin, Jacob De-vlin, Kenton Lee, Kristina Toutanova, Llion Jones,Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.Natural questions: A benchmark for question an-swering research.

Transactions of the Associationfor Computational Linguistics , 7:453–466.Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gimpel.2016. Commonsense knowledge base completion.In

Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 1445–1455, Berlin, Germany.Association for Computational Linguistics.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach.

CoRR , abs/1907.11692.Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2017. Pointer sentinel mixture mod-els. In

International Conference on Learning Repre-sentations (ICLR) .Matthew E. Peters, Mark Neumann, Robert Logan, RoySchwartz, Vidur Joshi, Sameer Singh, and Noah A.Smith. 2019. Knowledge enhanced contextual wordrepresentations. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 43–54, Hong Kong, China. Associ-ation for Computational Linguistics.Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2463–2473, Hong Kong, China. As-sociation for Computational Linguistics.Nina Poerner, Ulli Waltinger, and Hinrich Sch¨utze.2019. Bert is not a knowledge base (yet): Fac-tual knowledge vs. name-based reasoning in unsu-pervised qa.

ArXiv , abs/1911.03681.Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2016. Joint learning of the em-bedding of words and entities for named entity dis-ambiguation. In

Proceedings of The 20th SIGNLLConference on Computational Natural LanguageLearning , pages 250–259, Berlin, Germany. Associ-ation for Computational Linguistics.Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. ERNIE: En-hanced language representation with informative en-tities. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 1441–1451, Florence, Italy. Associationfor Computational Linguistics.Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesnd reading books. In