[PDF] AutoKG: Constructing Virtual Knowledge Graphs from Unstructured Documents for Question Answering

Abstract

Knowledge graphs (KGs) have the advantage of providing fine-grained detail for question-answering systems. Unfortunately, building a reliable KG is time-consuming and expensive as it requires human intervention. To overcome this issue, we propose a novel framework to automatically construct a KG from unstructured documents that does not require external alignment. We first extract surface-form knowledge tuples from unstructured documents and encode them with contextual information. Entities with similar context semantics are then linked through internal alignment to form a graph structure. This allows us to extract the desired information from multiple documents by traversing the generated KG without a manual process. We examine its performance in retrieval based QA systems by reformulating the WikiMovies and MetaQA datasets into a tuple-level retrieval task. The experimental results show that our method outperforms traditional retrieval methods by a large margin.

Full PDF

aa r X i v : . [ c s . C L ] A ug Constructing a Knowledge Graphfrom Unstructured Documents without External Alignment

Seunghak Yu, Tianxing He, and James Glass

MIT Computer Science and Artiﬁcial Intelligence Laboratory, Cambridge, MA, USA { seunghak,cloudygoose,glass } @csail.mit.edu Abstract

Knowledge graphs (KGs) are relevant to manyNLP tasks, but building a reliable domain-speciﬁc KG is time-consuming and expen-sive. A number of methods for construct-ing KGs with minimized human interventionhave been proposed, but still require a processto align into the human-annotated knowledgebase. To overcome this issue, we propose anovel method to automatically construct a KGfrom unstructured documents that does not re-quire external alignment and explore its use toextract desired information. To summarize ourapproach, we ﬁrst extract knowledge tuplesin their surface form from unstructured doc-uments, encode them using a pre-trained lan-guage model, and link the surface-entities viathe encoding to form the graph structure. Weperform experiments with benchmark datasetssuch as WikiMovies and MetaQA. The exper-imental results show that our method can suc-cessfully create and search a KG with 18K doc-uments and achieve 69.7% hits@10 (close toan oracle model) on a query retrieval task.

Knowledge graphs (KGs) have been a keycomponent of question-answering (QA) sys-tems (Hao et al., 2017; Chen et al., 2019;Deng et al., 2019; Sun et al., 2019). Given aquery, a QA system will typically ﬁrst retrieverelevant triples by traversing the KG, and thenform an answer from the retrieved information.The KG structure not only enables the system todo multi-hop reasoning, but also provides a niceinterpretability mechanism to inform users abouthow an answer is extracted.However, creating a human-annotated KGis expensive, typically requiring large amountsof expert labor. While there exists largegeneral-purpose knowledge bases such as Free-Base (Bollacker et al., 2008), human knowledge is continually expanding so knowledge bases needcontinuous reﬁnement or will tend to becomeoutdated. To overcome this issue, many ap-proaches have attempted to build KGs automati-cally (Dong et al., 2014; Bosselut et al., 2019), butthe steps required to align extracted knowledgewith prior existing knowledge sources are neces-sary, so expert labor is still required.Given these issues, it is attractive for aQA system to use only unstructured doc-uments (Chen et al., 2017; Wang et al., 2018;Clark and Gardner, 2018), since this approach hasthe potential to exclude the need for human la-bor. However, it loses the nice properties ofthe KG structure. In addition, most recent workin this direction (Feldman and El-Yaniv, 2019;Karpukhin et al., 2020) still relies on an existingKB.In this work, we aim to build a system com-bining advantages from both worlds, by buildinga virtual KG from unstructured documents. Ourapproach has the following features: (1) It doesnot rely on any expert labor or existing knowledgebase. (2) It does not rely on labeled training data(e.g. QA pairs). (3) By structuring documents asa KG, we enable a mechanism for interpretabilityduring multi-hop reasoning.

Task Setup:

Our approach ﬁrst builds a knowl-edge graph (KG) from a set of unstructured docu-ments { D , . . . , D M } , where each document D isa sequence of sentences { S D , . . . , S DM } . Note thatin the formulations we assume each document con-sists of M sentences just for notation convenience.Given a query q that is in the form of a natural lan-guage question, our system will traverse the KG,and the ﬁnal output will be a retrieval of the top-kmost relevant paths (We will deﬁne the notion of entence: The Goonies is an American ﬁlm directed byRichard Donner. sf-entity1 sf-relation sf-entity2

The Goonies is an American ﬁlman American ﬁlm directed by Richard Donner

Table 1: An example of the transformation from rawtext to entity-relation triples by OpenIE. In the table“sf” stands for “surface”. a path in Section 2.2). Finally we judge the sys-tem’s performance by checking whether the ref-erence answer is included in any of the retrievedpaths.Our framework is roughly composed of threephases: (1) Build a KG from a potentially largenumber of unstructured documents. (2) Given aquery, traverse the KG. (3) Retrieve the top- k mostrelevant paths. In the following sections, we de-scribe each phase in more detail. Creating a graph from raw text is at the core ofthis work. We use a three-step process to do this:conversion, encoding, and surface-entity linking.

Conversion:

We start by applying OpenIE to every sentence S to generate a list of entity-relation triples. We will use the terms “surface-entities” ( se ) and “surface-relation” ( sr ), becausethey are not predeﬁned entities/relations. For eachdocument D , we merge all triples extracted fromeach sentence into a list { ( se i , sr i , se i ) } . Notethat OpenIE extraction is imperfect: In Table1, for example, “ by Richard Donner ” is ex-tracted instead of the correct entity “ RichardDonner ”. More importantly, different surface-entities could refer to the same underlying entity.We address these problems in the next two steps.

Encoding:

In the second step we utilize theBERT model (Devlin et al., 2019) to encode eachsurface-entity or surface-relation from sentence S .We ﬁrst form word-level embeddings by addingthe associated word-piece embeddings from themodel. Then challenge is to incorporate contex-tualized information into the encoding (e.g. theword “Apple” has different meanings in differentcontexts). Inspired by (Clark et al., 2019), weadopt the “weighted embedding” technique, wherethe encoding of each surface-entity / relation isa simple weighted summation of the word em-beddings and the output embedding of the ﬁnal https://openie.allenai.org/ [CLS] token when S is fed into the BERT model.We refer readers to the Spacy-transformer toolkitfor details . We denote the resulting encoding of se as se enc . Surface-entity Linking:

The third and ﬁnalKG building step performs surface-entity linking,which creates a graph structure out of the extractedentity-relation triples in a document D . The goalis to link entities with the same underlying concepttogether, for example from Table 1, a new surface-entity “ the film ” could be referring to “ theGoonies ” in some follow-up sentence, so the re-lations with “ the film ” should also be appliedto “ the Goonies ”. In this work, we use anadaptive threshold on the cosine distance betweenthe encodings of surface-entities to compute theirsimilarity. The intuition is that if there exists an se l that has high similarity to se i , then the accept-able similarity threshold for se i should be higher.We denote the set of surface-entities linked to se i as Link ( se i ) , which is formulated below: Link ( se i ) = { se j : cos ( se enci , se encj ) ≥ λ ∗ max l ∈ E D ( cos ( se enci , se encl )) } (1)Note that E D denotes the set of all surface-entities existing in document D . λ is a hyper-parameter controlling the adaptive threshold, andwe found that a setting of 0.6 works well in ourexperiments.To summarize, after the above three steps,for each document we have a list of extractedentity-relation triples { ( se i , sr i , se i ) } , and eachsurface-entity/relation has a contextualized encod-ing. Within every document, each surface-entity se i is linked to Link ( se i ) . In this stage, we will traverse the constructed KGto ﬁnd relevant information to a given query q .To start the traversal, we ﬁrst select a set of seedsurface-entities in the virtual graph as the startpoints of our traversal. In most cases, we simplyuse the set of surface-entities that exist in the query q . If that set is empty, we encode q with the BERTas q enc , and use the surface-entity whose encod-ing has the largest cosine-similarity with q enc asthe seed entity.Starting with the seed surface-entities, we adoptan expand-and-prune strategy that is similar in https://explosion.ai/ … LinkHop-1 RelationHop-1 LinkHop-2 RelationHop-2 the Gooniesan American film directedby Richard DonnerRichard Donner 1930was born in

Query: When was the director of the Goonies born ? Ref Answer: 1930

Figure 1: An illustration of a 2-hop traversal on theconstructed KG. Each node represents a surface-entity.“ the Goonies ” is used as the seed entity. Theshaded area represents the linked seed-entities. Thepruning process is not illustrated in this ﬁgure. spirit to breath-ﬁrst search. For each hop, we ex-pand the current set of paths ﬁrst via the surface-entity linking, and then via the entity-relationstriples. The detailed traversal algorithm is shownin Algorithm 1 and we provide an illustration inFigure 1.Since the number of active paths could growexponentially during this expansion, we designan importance score to rate and prune the paths.For each path p , we concatenate its surface-entities/relations with a period between each triple,and feed it to the BERT model to get an encodingof this path p enc . We use the cosine-similarity be-tween p enc and q enc as the importance score. Afterthe expansion of each hop, we only keep B mostrelevant paths. In our experiments, we ﬁnd thatwe only need to set B to 10 to achieve good per-formance.The product of the traversal stage will be aset of paths. We use the list of the surface-entities/relations traversed to represent a path : Forexample, p = [ se , sr , se , se , sr , se ] is a two-hop path. Finally, we use the cosine-similarity be-tween p enc and q enc to select the ﬁnal top- k pathsas the output. To quantitatively evaluate our model, weadopt two popular QA benchmark datasets:WikiMovies (Miller et al., 2016), andMetaQA (Zhang et al., 2018). Since thesedatasets consist of pairs of questions and answersand related Wikipedia articles, we can leveragethe entire set of articles in the dataset to build aknowledge graph and use our system to ﬁnd the https://research.fb.com/downloads/ https://github.com/yuyuz/MetaQA Algorithm 1

Traversing the KG

Input:

A set of seed surface-entities { se seedi } and a queryembedding q enc Output:

A set of traversed pathsFor each seed surface-entity se seedi , initialize an emptypath p i = [] , and denote the set of paths as P . for each hop do set ˆ P as an empty set for each p ∈ P do Let se tailp be the last surface-entity in p for each se i ∈ Link ( se tailp ) dofor each triple ( se i , sr ′ , se ′ ) in the KG whichstarts with se i do add ˆ p = p + [ se i , sr ′ , se ′ ] to ˆ P end forend forend for Prune ˆ P based on cos (ˆ p enc , q enc ) with a beam size B set ˆ P as the new P end for return P Dataset h H@1 H@5 H@10(Miller et al., 2016) N/A 68.301-hop QA (WikiMovies) 1 43.36 64.40

Oracle N/A 21.13 39.82 48.013-hop QA (MetaQA) 1 18.27 31.48 35.502 20.91 27.67 34.073 21.48 32.71

Oracle N/A 22.01 43.34 59.37

Table 2: Experimental results with changes in hops tonavigate the graph. h is the number of hops the modeltraverses in the KG. The baseline is not a result of re-trieval, but the result of a reasoning model. information that contains the correct answer to agiven question. WikiMovies and MetaQA use thesame 18,128 movie domain Wikipedia articles,but have different types of questions. Since ourframework does not need training, we do not usetraining data. We search the hyperparameter spaceusing WikiMovies dev data (10K), and evaluatethe model with the test data. The test data consistsof 9,952/14,872/14,274 QA pairs that require1-hop (WikiMovies), 2-hop, and 3-hop (MetaQA)inference, respectively.Finding a baseline to compare the performanceof our model is not straightforward because oursystem is different from conventional approachesin several ways: (1) Unlike existing automaticknowledge building methods that obtain results us-ing external knowledge, we do not require exter- ataset ExampleQuestion : What does Jeremy Piven act in?Golden Label : So Undercover, Keeping Up with the Steins, Just WriteWikiMovies Predicted Paths (top-3) :1. White Palace ﬁlm : The movie features Jeremy Piven.(1-hop) 2.

The Kingdom ﬁlm : starring Jason Bateman , with Jeremy Piven is ﬁctional.3.

Very Bad Things : (...) stars Christian Slater, with Jeremy Piven in supporting roles.Question : Who appeared in the same movie with Angie Everhart?Golden Label : Erika Eleniak, Dennis MillerMetaQA Predicted Paths (top-3) :1. Bordello of Blood: (...) starring Angie Everhart. a 1996 comedy horror ﬁlm starring

Dennis Miller .(2-hop) 2. Bordello of Blood: (...) starring Angie Everhart. a 1996 comedy horror ﬁlm starring

Erika Eleniak .3. Bordello of Blood: (...) starring Angie Everhart. a 1996 comedy horror ﬁlm starring

Chris Sarandon . Table 3: Examples where the model found the correct answers but are considered incorrect due to the missing datalabels.

Italic means the ground truth answers,

Bold represents words that should have been correct. nal knowledge to align extractions. (2) Unlikelearning-based retrieval models that require largeamounts of labeled data to train the model, wedo not require data-speciﬁc training. (Miller et al.,2016) also build up the KB via an information ex-traction pipeline, however they use an existing KBto clean out unrecognized entities. Moreover, theytrain their memory network model with 96K QApairs that we did not use, and test their model withonly 1-hop questions. Although their performancecannot therefore be directly compared to our setup,we nevertheless use it as a baseline performance in-dicator. As another performance metric, we set upan oracle system that uses Okapi BM25 to searchfor the top-k documents for a given query, and ifany of the resulting documents contain the answer,we consider it to be a hit.We present the main results in Table 2, whichshows that our model obtains reasonable perfor-mance considering that we did not rely on any ex-isting KB or additional training data. We achieve69.67 / 49.71 / 38.53 hits@10 for each 1-hop QA(WikiMovies), 2-hop, and 3-hop QA (MetaQA).Hit@k (Bordes et al., 2013) is the accuracy of top-k predicted paths containing the answer. In mostcases, our performance comes close to the oraclemodel, and in the case of MetaQA (2-hop), ourmodel outperforms the oracle. As expected, per-formance is better when the number of hops tonavigate the graph is more than the hops neededto ﬁnd the answer. Example predictions are shownin Table 3. One interesting observation is that in asigniﬁcant number of failure cases, our model ac-tually has the right answer, but is deemed wrongbecause the reference label is incomplete. There-fore, we believe that the performance of our modelis being underestimated.

Automatic knowledge graph construction:

Inmost research to create knowledge graphs fromunstructured text without human intervention, thepopular approach is to develop a pipeline of NLPoperations such as named entity recognition, en-tity linking and relationship extraction (Wu et al.,2019). These approaches require a predeﬁnedknowledge base to align the extracted entities orrelationships (Lin et al., 2016; Zhou et al., 2016;Zhang et al., 2019; Cao et al., 2020). Unlike exist-ing methods, our model can directly handle thesetasks with extracted surface forms from unstruc-tured documents.

Graph based multi-hop retrievers:

In orderto reason over documents and extract the desiredinformation, it is necessary to extract informationfrom multiple sentences or documents. To achievethis, Sun et al. (2019) builds a question-relevantsub-graph from the knowledge base or text cor-pus to gather all the relevant information. This issimilar to our approach in that it creates question-related sub-graphs, but differs from us in that itcreates graphs using a predeﬁned KB. Das et al.(2019); Asai et al. (2019) construct a Wikipediagraph using hyperlinks within the article to ex-tract paragraphs related to the query. Therefore,their method contrasts with ours in that a human-annotated hyperlink is essential and the minimumunit of information to be searched is a paragraph.

We propose a novel method to automatically builda knowledge graph from unstructured documents,without having to align resulting entities with ex-ternal information. Our method successfully con-structs a KG from 18K documents. The perfor-ance of our system is a 69.7 hits@10, which isclose to an oracle model. In the future, we plan toimprove multi-hop / multi-documents retrieval byintroducing a trainable re-ranking module.

Acknowledgments

Research was sponsored by the United States AirForce Research Laboratory and was accomplishedunder Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions containedin this document are those of the authors andshould not be interpreted as representing the of-ﬁcial policies, either expressed or implied, of theUnited States Air Force or the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for Government purposesnotwithstanding any copyright notation herein.

References

Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi,Richard Socher, and Caiming Xiong. 2019. Learn-ing to retrieve reasoning paths over wikipedia graphfor question answering. In

International Conferenceon Learning Representations .Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a collab-oratively created graph database for structuring hu-man knowledge. In

Proceedings of the 2008 ACMSIGMOD international conference on Managementof data , pages 1247–1250.Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In

Advances in neural informationprocessing systems , pages 2787–2795.Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.2019. Comet: Commonsense transformers for au-tomatic knowledge graph construction. In

Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics , pages 4762–4779.Ermei Cao, Difeng Wang, Jiacheng Huang, and WeiHu. 2020. Open knowledge enrichment for long-tail entities. In

Proceedings of The Web Conference2020 , pages 384–394.Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1870–1879. Yu Chen, Lingfei Wu, and Mohammed J Zaki. 2019.Bidirectional attentive memory networks for ques-tion answering over knowledge bases. In

Proceed-ings of NAACL-HLT , pages 2913–2923.Christopher Clark and Matt Gardner. 2018. Simpleand effective multi-paragraph reading comprehen-sion. In

Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 845–855.Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. In

Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP , pages 276–286, Florence, Italy. Associationfor Computational Linguistics.Rajarshi Das, Ameya Godbole, Dilip Kavarthapu,Zhiyu Gong, Abhishek Singhal, Mo Yu, XiaoxiaoGuo, Tian Gao, Hamed Zamani, Manzil Zaheer, et al.2019. Multi-step entity-centric information retrievalfor multi-hop question answering. In

Proceedings ofthe 2nd Workshop on Machine Reading for QuestionAnswering , pages 113–118.Yang Deng, Yuexiang Xie, Yaliang Li, Min Yang, NanDu, Wei Fan, Kai Lei, and Ying Shen. 2019. Multi-task learning with multi-view attention for answerselection and knowledge base question answering.In

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , volume 33, pages 6318–6325.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, WilkoHorn, Ni Lao, Kevin Murphy, Thomas Strohmann,Shaohua Sun, and Wei Zhang. 2014. Knowledgevault: A web-scale approach to probabilistic knowl-edge fusion. In

Proceedings of the 20th ACMSIGKDD international conference on Knowledgediscovery and data mining , pages 601–610.Yair Feldman and Ran El-Yaniv. 2019. Multi-hop para-graph retrieval for open-domain question answering.In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages2296–2309, Florence, Italy. Association for Compu-tational Linguistics.Yanchao Hao, Yuanzhe Zhang, Kang Liu, Shizhu He,Zhanyi Liu, Hua Wu, and Jun Zhao. 2017. An end-to-end model for question answering over knowl-edge base with cross-attention combining globalknowledge. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 221–231.ladimir Karpukhin, Barlas Ouz, Sewon Min, PatrickLewis, Ledell Wu, Sergey Edunov, Danqi Chen, andWen tau Yih. 2020. Dense passage retrieval for open-domain question answering.Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In

Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 2124–2133.Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston.2016. Key-value memory networks for directly read-ing documents. In

Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 1400–1409.Haitian Sun, Tania Bedrax-Weiss, and William Cohen.2019. Pullnet: Open domain question answeringwith iterative retrieval on knowledge bases and text.In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 2380–2390.Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,Tim Klinger, Wei Zhang, Shiyu Chang, GerryTesauro, Bowen Zhou, and Jing Jiang. 2018. R 3:Reinforced ranker-reader for open-domain questionanswering. In

Thirty-Second AAAI Conference onArtiﬁcial Intelligence .Xindong Wu, Jia Wu, Xiaoyi Fu, Jiachen Li, PengZhou, and Xu Jiang. 2019. Automatic knowledgegraph construction: A report on the 2019 icdm/icbkcontest. In , pages 1540–1545. IEEE.Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexan-der J Smola, and Le Song. 2018. Variational reason-ing for question answering with knowledge graph.In

Thirty-Second AAAI Conference on Artiﬁcial In-telligence .Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. Ernie: Enhancedlanguage representation with informative entities. In

Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1441–1451.Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li,Hongwei Hao, and Bo Xu. 2016. Attention-basedbidirectional long short-term memory networks forrelation classiﬁcation. In