Bridging the Knowledge Gap: Enhancing Question Answering with World and Domain Knowledge
BBridging the Knowledge Gap:Enhancing Question Answering with World and Domain Knowledge
Travis R. Goodwin and Dina Demner-Fushman
Lister Hill National Center for Biomedical CommunicationsU.S. National Library of MedicineBethesda, MD, USA
Abstract
In this paper we present OSCAR (Ontology-based Semantic Composition AugmentedRegularization), a method for injecting task-agnostic knowledge from an Ontology orknowledge graph into a neural network dur-ing pretraining. We evaluated the impact of in-cluding OSCAR when pretraining BERT withWikipedia articles by measuring the perfor-mance when fine-tuning on two question an-swering tasks involving world knowledge andcausal reasoning and one requiring domain(healthcare) knowledge and obtained . , . , and improved accuracy comparedto pretraining BERT without OSCAR and ob-taining new state-of-the-art results on two ofthe tasks. “The detective flashed his badge to the police of-ficer.” The nearly effortless ease at which we, ashumans, can understand this simple statement be-lies the depth of semantic knowledge needed forits understanding: What is a detective? What isa police officer? What is a badge? What doesit mean to flash a badge? Why would the de-tective need to flash his badge to the police offi-cer? Understanding this sentence requires know-ing the answer to all these questions and relies onthe reader’s knowledge about this world.As shown in Figure 1, suppose we were in-terested in determining whether, upon showingthe policeman his badge, it is more plausible thatthe detective would be let into the crime sceneor that the police officer would confiscate thedetective’s badge? Although both scenarios arecertainly possible, our accumulated expectationsabout the world suggest it would be very extraor-dinary for the police officer to confiscate the de-tective’s badge rather than allow him to enter thecrime scene. Premise:
The detective flashed his badge to the policeofficer. What is the most likely effect ? A: The police officer confiscated the detective’s badge. B: The police officer let the detective enter the crimescene.
Figure 1: Example of a question requiring common-sense and causal reasoning (Roemmele et al., 2011).
Evidence of Grice’s Maxim of Quantity (Grice,1975), this shared knowledge of the world is rarelyexplicitly stated in text. Fortunately, some ofthis knowledge can be extracted from Ontologiesand knowledge bases. For example ConceptNet(Speer et al., 2017) indicates that a detective is aT
YPE O F police officer , and is C APABLE O F find-ing evidence ; that evidence can be L OCATED A T a crime scene ; and that a badge is a T YPE O F au-thority symbol .While neural networks have been shown to ob-tain state-of-the-art performance on many types ofquestion answering and reasoning tasks from rawdata (Devlin et al., 2018; Rajpurkar et al., 2016;Manning, 2015), there has been relatively little in-vestigation into how to inject ontological knowl-edge into deep learning models, with most priorattempts embedding ontological information out-side of the network itself (Wang et al., 2017).In this paper we present a pretraining regular-ization technique we call OSCAR (OntologicalSemantic Composition Augmented Regulariza-tion) which is capable of injecting world knowl-edge and ontological relationships into a deep neu-ral network. We show that incorporating OSCARinto BERT’s pretraining injects sufficient worldknowledge to improve fine-tuned performance inthree question answering datasets. The main con-tributions of this work are:1. OSCAR, a regularization method for inject- a r X i v : . [ c s . C L ] O c t ng ontological information and semanticcomposition into deep learning models;2. Empirical evidence showing the impact ofOSCAR on two tasks requiring world knowl-edge, causal reasoning, and discourse under-standing even with as few as 500 training ex-ample, as well as a task requiring medical do-main knowledge; and3. An open-source implementation of OSCARand BERT supporting mixed precision train-ing, non-TPU model distribution, and en-hanced numerical stability. Pretraining.
The idea of training a model on arelated problem before training on the problem ofinterest has been shown effective for many naturallanguage processing tasks (Dai and Le, 2015; Pe-ters et al., 2017; Howard and Ruder, 2018). Morerecent uses of pretraining adapt transfer learningby first training a network on a language modelingtask, and then fine-tuning (retraining) that modelfor a supervised problem of interest (Dai and Le,2015; Howard and Ruder, 2018; Radford et al.,2018). Pretraining in this way has the advantagethat the model can build on previous parameters toreduce the amount of information it needs to learnfor a specific downstream task. Conceptually, themodel can be viewed as applying what it has al-ready learned from the language model task whenlearning the downstream task.
BERT (Bidirectional Encoder Representationsfrom Transformers) is a pretrained neural networkwhich has been shown to obtain state-of-the-art re-sults on eleven natural language processing tasksafter fine-tuning (Devlin et al., 2018). Despite itsstrong empirical performance, the architecture ofBERT is relatively simple: four layers of trans-formers (Vaswani et al., 2017) are stacked to pro-cess each sentence.
Incorporating OSCAR into BERT’s pretraining re-quires an embedded ontology and a text corpus.In our experiments we used ConceptNet 5 as ourontology relying on pretrained entity embeddingsknown as ConceptNet NumberBatch (Speer et al.,2017).Our text corpus was a 2019 dump of EnglishWikipedia articles with templates expanded.
Virtually all neural networks designed for natu-ral language processing represent language as asequence of words, subwords, or characters. Bycontrast, Ontologies and knowledge bases encodesemantic information about entities which maycorrespond to individual nouns (e.g., “badge”)or multiword phrases (“police officer”). Conse-quently, injecting world and domain knowledgefrom a knowledge base into the network requires semantically decomposing the information aboutan entity into the supporting information about itsconstituent words. To do this, OSCAR is imple-mented using a three step approach illustrated inFigure 2:
Step 1. entities are recognized in a sentence usinga Finite State Transducer (FST);
Step 2. the sequence of subwords correspondingto each entity are semantically composedto produce an entity-level encoding; and
Step 3. the average energy between the composedentity encoding and the pretrained entityencoding from the ontology is used as aregularization term in the pretraining lossfunction.By training the model to compose sequences ofsubwords into entities, during back-propagationthe semantics of each entity are decomposed andinjected into the network based on the neural acti-vations associated with its constituent words.
We recognized entities during training and infer-ence online by (1) tokenizing each entity in ourontology using the same tokenizer used to pre-pare the BERT pretraining data, and (2) compil-ing a Finite State Transducer to detect sequencesof subword IDs corresponding to entities. TheFST, illustrated in Figure 3, allowed us to detectentities on-the-fly without hard coding a specificontology or inducing any discernible change intraining or inference time. Formally, let X = x , x , · · · , x N represent the sequence of wordsin a sentence. The FST processes X and returnsthree sequences: s , s , · · · , s M ; l , l , · · · , l M ;and e , e , · · · , e M representing the start offset,length, and the pretrained embedded representa-tion of every mention of any entity in the Ontol-ogy. Note: including or excluding subsumed andde-masked entities (as illustrated in Figure 2) pro-vided no discernable effect in our experiments. ritish policy during the Spanish Civil War was officially that of [MASK] Pre-training Sentence: ⋯ . 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 ⋯ 𝒙𝒙 𝑁𝑁 𝒄𝒄 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝒙𝒙 𝒄𝒄 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝒙𝒙 𝒄𝒄 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝒙𝒙 𝒄𝒄 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝒙𝒙 , 𝒙𝒙 , 𝒙𝒙 Semantic Composition (§4.2) 𝑅𝑅 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = 1 𝑀𝑀 � 𝑡𝑡=1𝑀𝑀 𝑓𝑓 𝒄𝒄 𝑡𝑡 , 𝒆𝒆 𝑡𝑡 Energy Regularization (§4.3) FSTNumberBatch ⋮𝒆𝒆 ⋮𝒆𝒆 𝐾𝐾 ⋮ ConceptNet
British ( 𝑐𝑐 = 1; 𝑙𝑙 = 1 ) Spanish Civil War ( 𝑐𝑐 = 5; 𝑙𝑙 = 3 ) War ( 𝑐𝑐 = 𝑙𝑙 = ) † Nonintervention ( 𝑐𝑐 = 12; 𝑙𝑙 = 2 ) ‡ Policy ( 𝑐𝑐 = 2; 𝑙𝑙 = 1) Civil War ( 𝑐𝑐 = 6; 𝑙𝑙 = 2 ) † Officially ( 𝑐𝑐 = 9; 𝑙𝑙 = 1 ) † Intervention ( 𝑐𝑐 = 13; 𝑙𝑙 = 1 ) † Spanish ( 𝑐𝑐 = 5; l = 1 ) † Civil ( 𝑐𝑐 = 6; 𝑙𝑙 = 1 ) † Non ( 𝑐𝑐 = 12; 𝑙𝑙 = 1 ) †‡ ⋮ Entity Detection (§4.1)
Figure 2: Architecture of OSCAR when injecting ontology knowledge from ConceptNet into BERT where ‘ † ’indicates subsumed entities, ‘ ‡ ’ indicates de-masked entities, N is the length of the input sentence, M is thenumber of entities detected in the sentence, and K is the number of entities with embeddings in ConceptNet. 𝐸𝐸 ⋯𝐸𝐸 , ⋯ 𝐸𝐸 , 𝐸𝐸 𝐸𝐸 𝐸𝐸 Figure 3: Finite State Transducer (FST) used to detectentities during pretraining; each node corresponds toa word ID, and e i indicates the i -th pretrained entityembedding in ConceptNet’s NumberBatch. The role of semantic composition is to learna composed representation c , c , · · · , c M foreach entity detected in X such that c i = compose ( x s i , x s i +1 , · · · , x s i + l i ) . As pretrainingin BERT is computationally expensive; we consid-ered three computationally-efficient methods forcomposing words and subwords into entities. Recurrent Additive Networks (RANs) are asimplified alternative to LSTM- or GRU-based re-current neural networks that use only additive con-nections between successive layers and have beenshown to obtain similar performance with 38%fewer learnable parameters (Lee et al., 2017).Given a sequence of words x , x , · · · , x L weuse the following layers to accumulate informationabout how the semantics of each word in an entitycontribute to the overall semantics of the entity: (cid:102) m t = W m x t (1a) i t = σ ( W i [ h t − , x t ] + b i ) (1b) f t = σ ( W f [ h t − , x t ] + b f ) (1c) m t = i t ◦ (cid:102) m t + f t ◦ c t − (1d) h t = g ( m t ) (1e) where [ • ] represents vector concatenation, (cid:102) m t rep-resents the content layer which encodes any new semantic information provided by word x t , ◦ in-dicates an element-wise product, i t represents theinput gate, f t represents the forget gate, m t rep-resents the internal memories about the entity, and h t is the output layer encoding accumulated se-mantics about word x t . We define the composedentity c i := h s i + l i for the sequence beginningwith x s i . Linear Recurrent Additive Networks.
A sec-ond, simpler version of a RAN omits the contentand output layers (i.e., Equations 1a and 1e) andEquation 1d. It is updated to depend on x t di-rectly: m t = i t ◦ x t + f t ◦ m t − . We define thecomposed entity c i := m s i + l i for the sequence ofsubwords beginning with x s i . Linear Interpolation.
The third, simplest formof semantic composition represents the semanticsof an entity as an unordered linear combination ofthe semantics of its constituent words, i.e.: c i := W e ( x s i + x s i +1 + · · · + x s i + l i ) + l i · b e . We project the composed entities into the samevector space as the pretrained entity embeddingsfrom the Ontology, and measure the average en-ergy across all entities detected in the sentence: R OSCAR = 1 M M (cid:88) i =1 f ( W p c i + b p , e i ) (2) where f is an energy function capturing the energybetween the composed entity c i and the pretrainedentity embedding e i . We considered three energyfunctions: (1) the Euclidean distance, (2) the ab-solute distance, and (3) the angular distance. remise: Gina misplaced her phone at hergrandparents. It wasn’t anywhere in the living room.She realized she was in the car before. She grabbed herdad’s keys and ran outside.
Ending A:
She found her phone in the car.
Ending B:
She didn’t want her phone anymore.
Figure 4: Example of a Story Cloze question (correctanswer is A).
Consumer Health Question:
Can sepsis be prevented.Can someone get this from a hospital?
FAQ A:
Who gets sepsis?
FAQ B:
What is the economic cost of sepsis?
Figure 5: Example of a Recognizing Question Entail-ment (RQE) question (correct answer is A).
We evaluated the impact of OSCAR on three ques-tion answering tasks requiring world or domainknowledge and causal reasoning.
Choice of Plausible Alternatives (CoPA)presents 500 training and 500 testing sets of two-choice questions and requires to choose the mostplausible cause or effect entailed by the premise,as illustrated in Figure 1 (Roemmele et al., 2011).
The Story Cloze Test evaluates story under-standing, story generation, and script learning andrequires a system to choose the correct ending toa four-sentence story, as illustrated in Figure 4(Mostafazadeh et al., 2016). In our experimentswe used only the 3,744 labeled stories.
Recognizing Question Entailment:
To over-come the complexity of healthcare questions, BenAbacha et al. (2016) proposed to simplify clinicalquestion answering by answering sub-questionsusing Recognizing Question Entailment (RQE).The RQE collection consists of 8,588 training and302 testing pairs of consumer health questions(CHQs) and frequently asked questions (FAQs)with labels indicating whether answering the FAQentails answering the CHQ, as illustrated in Fig-ure 5.Table 1 presents the results of BERT when pre-trained on Wikipedia with and without OSCAR,the state-of-the-art, and the average performanceof different semantic composition methods and en-ergy functions when calculating OSCAR.
Model
CoPA Cloze RQE
BERT 55.2 74.200 74.834OSCAR
SotA 71.2
OSCAR: Linear RAN
Table 1: Accuracy when fine-tuning BERT pre-trainedon Wikipedia data and pre-trained on Wikipedia datawith OSCAR.
OSCAR provided a significant improvement inaccuracy for both common sense causal reason-ing tasks, indicating that OSCAR was able toinject useful world knowledge into the network.While less pronounced than the general domain,for the clinical domain, OSCAR provided a mod-est improvement over standard BERT and both im-proved over the state-of-the-art.When comparing semantic composition meth-ods, the Linear method had the most consistentperformance across both domains; the RecurrentAdditive Network (RAN) obtained the lowest per-formance on the general domain and the highestperformance on medical texts, while the LinearRAN exhibited the opposite behavior.In terms of energy functions, the Euclidean dis-tance was the most consistent, the Angular dis-tance was the best for the Story Cloze and RQEtasks, and the Absolute difference was the best forCoPA. The Angular distance (being scale invari-ant) is least affected by the number of subwordsconstituting an entity while the Absolute distanceis most affected. Consequently, we believe theAbsolute distance was only effective on the CoPAevaluation because the entities in CoPA are typi-cally very short (single words or subwords).In future work we hope to explore the impactof incorporating different Ontologies and knowl-edge graphs as well as alternative types of entityembeddings (Bordes et al., 2013; Lin et al., 2015;Wang et al., 2014; Nickel et al., 2011; Xiong et al.,2018). eferences
Asma Ben Abacha and Dina Demner-Fushman. 2016.Recognizing question entailment for medical ques-tion answering. In
AMIA 2016, American Med-ical Informatics Association Annual Symposium,Chicago, IL, USA, November 12-16, 2016 .Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors,
Advances in Neural Information ProcessingSystems 26 , pages 2787–2795. Curran Associates,Inc.Andrew M Dai and Quoc V Le. 2015. Semi-supervisedsequence learning. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in Neural Information Processing Systems28 , pages 3079–3087. Curran Associates, Inc.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .H Paul Grice. 1975. Logic and conversation. ,pages 41–58.Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. In
Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 328–339. Association for Com-putational Linguistics.Kenton Lee, Omer Levy, and Luke Zettlemoyer.2017. Recurrent additive networks. arXiv preprintarXiv:1705.07393 .Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, andXuan Zhu. 2015. Learning entity and relation em-beddings for knowledge graph completion. In
Pro-ceedings of the Twenty-Ninth AAAI Conference onArtificial Intelligence , AAAI’15, pages 2181–2187.AAAI Press.Edward Loper and Steven Bird. 2002. Nltk: The natu-ral language toolkit. In
Proceedings of the ACL-02Workshop on Effective Tools and Methodologies forTeaching Natural Language Processing and Com-putational Linguistics - Volume 1 , ETMTNLP ’02,pages 63–70, Stroudsburg, PA, USA. Associationfor Computational Linguistics.Christopher D Manning. 2015. Computational linguis-tics and deep learning.
Computational Linguistics ,41(4):701–707.Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and cloze evaluation for deeper understanding ofcommonsense stories. In
Proceedings of the 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies , pages 839–849. Associa-tion for Computational Linguistics.Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In
Proceedings ofthe 28th International Conference on InternationalConference on Machine Learning , ICML’11, pages809–816, USA. Omnipress.Matthew Peters, Waleed Ammar, Chandra Bhagavat-ula, and Russell Power. 2017. Semi-supervised se-quence tagging with bidirectional language models.In
Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers) , pages 1756–1765. Association forComputational Linguistics.Alec Radford, Karthik Narasimhan, Time Salimans,and Ilya Sutskever. 2018. Improving language un-derstanding with unsupervised learning. Technicalreport, Technical report, OpenAI.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In
Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing , pages 2383–2392. Asso-ciation for Computational Linguistics.Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In .Robyn Speer, Joshua Chin, and Catherine Havasi.2017. Conceptnet 5.5: An open multilingual graphof general knowledge. In
AAAI Conference on Arti-ficial Intelligence , pages 4444–4451.Ole Tange. 2018.
GNU Parallel 2018 . Ole Tange.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,
Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran As-sociates, Inc.Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017. Knowl-edge graph embedding: A survey of approaches andapplications.
IEEE Transactions on Knowledge andData Engineering , 29(12):2724–2743.Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph embedding by trans-lating on hyperplanes. In
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence ,AAAI’14, pages 1112–1119. AAAI Press.enhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo,and William Yang Wang. 2018. One-shot relationallearning for knowledge graphs. In
Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing , pages 1980–1990. Asso-ciation for Computational Linguistics.
A Hyper-parameter Tuning
A.1 Fine-tuning
For each fine-tuning task, we used a greedyapproach to hyper-parameter tuning by incre-mentally and independently optimizing: batchsize ∈ { , , } ; initial learning rate ∈{ e − , e − , e − } ; whether to include sub-sumed entities ∈ { yes , no } ; and whether to in-clude masked entities ∈ { yes , no } .For CoPA, the Story Cloze task, and RQE wefound an optimal batch size of 16 and an optimallearning rate of e − . We also found that includ-ing subsumed entities and masked was optimal (ata net performance improvement of < accu-racy). A.2 Pretraining
We pretrained BERT using a 2019 Wikipediadump formatted for Wikipedia’s Cirrus search en-gine. Preprocessing relied on NLTK’s Punktsentence segmenter (Loper and Bird, 2002), andthe WordPiece subword tokenizer provided withBERT. We used the vocabulary from BERT base(not large), and a maximums sequence size of 384subwords, training , steps, with an initiallearning rate of e − , and 320 warm-up steps. B BERT Modifications
We used a modified version of BERT allowing formixed precision training. This necessitated a num-ber of minor changes to improve numerical stabil-ity around softmax operations. Training was per-formed using a single node with 4 Tesla P100seach (multiple variants of OSCAR were trainedsimultaneously using five such nodes at a time).Non-TPU multi-gpu support was added to BERTbased on Horovod and relying on Open MPI. https://eng.uber.com/horovod/ C State-of-the-Art
State-of-the-art was determined using the officialleader boards maintained by the task organizers. The RQE evaluation has not concluded; as such,we use the organizers’ baseline as the state-of-the-art.
D OSCAR PretrainingRecommendations
When pretraining BERT (or another model) withOSCAR, we make the following recommenda-tions:1. ignore subsumed entities: including sub-sumed entities provided only a minor in-crease in accuracy ( < average relativeimprovement) at a 10% increase in total train-ing time;2. ignore masked entities: De-masking enti-ties had little over-all impact on model per-formance ( < average relative improve-ment), and no discernible effect on trainingtime;3. use linear composition as it exhibits consis-tent performance and requires 50% less train-ing time than the RAN and 40% less than theLinear RAN; and4. select the energy function based on the aver-age length of entities in the fine-tuning tasks:Angular distance with long entities, Absolutedistance with short entities, and Euclideandistance with varied entities. E Stanford Question Answering Dataset
We evaluated the impact of OSCAR on the Stan-ford Question Answering Dataset (SQuAD; Ra-jpurkar et al. 2016), version 1.1 and observed nodiscern able change in performance (an Accuracyof . using BERT pretraining on Wikipediawithout OSCAR and . with OSCAR). Thelack of impact of OSCAR for SQuAD is unsur-prising, as the vast majority of SQuAD questionscan be answered directly by surface-level informa-tion in the text. However, the lack of impact showsthat injecting world knowledge with OSCAR doesnot come at the expense of model performance fortasks that require little external knowledge. CoPA’s“leaderboard” is available at http://people.ict.usc.edu/˜gordon/copa.html . The Story Cloze leaderboard is available at https://competitions.codalab.org/competitions/15333https://competitions.codalab.org/competitions/15333