Better Call the Plumber: Orchestrating Dynamic Information Extraction Pipelines
Mohamad Yaser Jaradeh, Kuldeep Singh, Markus Stocker, Andreas Both, Sören Auer
BBetter Call the Plumber: OrchestratingDynamic Information Extraction Pipelines
Mohamad Yaser Jaradeh − − − , KuldeepSingh − − − , Markus Stocker − − − , AndreasBoth − − − , and S¨oren Auer − − − L3S Research Center, Leibniz University Hannover, Germany [email protected] Zerotha-Research & Cerence GmbH, Germany [email protected] TIB Leibniz Information Centre for Science and Technology, Germany { markus.stocker, auer } @tib.eu Anhalt University of Applied Sciences, Germany [email protected]
Abstract.
We propose
Plumber , the first framework that brings to-gether the research community’s disjoint information extraction (IE)efforts. The
Plumber architecture comprises 33 reusable components forvarious Knowledge Graphs (KG) information extraction subtasks, suchas coreference resolution, entity linking, and relation extraction. Usingthese components,
Plumber dynamically generates suitable informationextraction pipelines and offers overall 264 distinct pipelines. We studythe optimization problem of choosing suitable pipelines based on inputsentences. To do so, we train a transformer-based classification model thatextracts contextual embeddings from the input and finds an appropriatepipeline. We study the efficacy of
Plumber for extracting the KG triplesusing standard datasets over two KGs: DBpedia, and Open ResearchKnowledge Graph (ORKG). Our results demonstrate the effectiveness of
Plumber in dynamically generating KG information extraction pipelines,outperforming all baselines agnostics of the underlying KG. Furthermore,we provide an analysis of collective failure cases, study the similarities andsynergies among integrated components, and discuss their limitations.
Keywords:
Information Extraction · NLP Pipelines · Software Reusabil-ity · Semantic Search · Semantic Web
In last one decade, publicly available KGs (DBpedia [2] and Wikidata [42] ) havebecome rich sources of structured content used in various applications, includingQuestion Answering (QA), fact checking, and dialog systems [39,4]. The researchcommunity developed numerous approaches to extract triple statements [44],keywords/topics [9], tables [45,23,22], or entities [35,36] from unstructured textto complement KGs. Despite extensive research, public KGs are not exhaustiveand require continuous effort to align newly emerging unstructured informationto the concepts of the KGs. a r X i v : . [ c s . C L ] F e b Jaradeh et al.
Research Problem:
This work was motivated by an observation of recentapproaches [35,45,15] that automatically align unstructured text to structureddata on the Web. Such approaches are not viable in practice for extracting andstructuring information because they only address very specific subtasks of theoverall KG information extraction problem. If we consider the exemplary sentence
Rembrandt painted The Storm on the Sea of Galilee. It was painted in 1633. (cf. Figure 1). To extract statements aligned with the DBpedia KG from thegiven sentences, a system must first recognize the entities and relation surfaceforms in the first sentence. The second sentence requires an additional step ofthe coreference resolution, where It must be mapped to the correct entity surfaceform (namely, The Storm on the Sea of Galilee ). The last step requires the map-ping of entity and relation surface forms to the respective DBpedia entities andpredicates. There has been extensive research in aligning concepts in unstructuredtext to KG, including entity linking [15,18], relation linking [36,38,4], and tripleclassification [14]. However, these efforts are disjoint, and little has been done toalign unstructured text to the complete KG triples (i.e., represented as subject,predicate, object) [25]. Furthermore, many entity and relation linking tools havebeen reused in pipelines of QA systems [39,26]. The literature suggests thatonce different approaches put forward by the research community are combined,the resulting pipeline-oriented integrated systems can outperform monolithicend-to-end systems [27]. For the KG information extraction task, however, tothe best of our knowledge, approaches aiming at dynamically integrating andorchestrating various existing components do not exist.
Objective and Contributions:
Based on these observations, we build a frame-work that enables the integration of previously disjoint efforts on the KG-IE taskunder a single umbrella. We present the
Plumber framework (cf. Figure 2) forcreating Information Extraction pipelines.
Plumber integrates 33 reusable com-ponents released by the research community for the subtasks entity linking (EL),relation linking (RL), text triple extraction (TE) (subject, predicate, object),and coreference resolution (CR). Overall, there are 264 different composableKG information extraction pipelines (generated by the possible combinationof the available 33 components, i.e., for DBpedia 3 CRs, 8 TEs, 10 EL/RLsgives 3*8*10=240, and 4*3*2=24 for the ORKG. Hence, 240+24=264 pipelines).
Plumber implements a transformer-based classification algorithm that intelli-gently chooses the best pipeline based on the unstructured input text.We perform an exhaustive evaluation of
Plumber on the two large-scaleKGs DBpedia, and Open Research Knowledge Graph (ORKG) [24] to investigatethe efficacy of
Plumber in creating KG triples from unstructured text. Wedemonstrate that independent of the underlying KG;
Plumber can find andassemble different extraction components to produce better suited KG tripleextraction pipelines, significantly outperforming existing baselines. In summary,we provide the following novel contributions: i) The
Plumber framework isthe first of its kind for dynamically assembling and evaluating informationextraction pipelines based on sequence classification techniques and for a giveninput text.
Plumber is easily extensible and configurable, thus enabling the etter Call the Plumber 3 rapid creation and adjustment of new information extraction components andpipelines. Researchers can also use the framework for running IE componentsindependently for specific subtasks such as triple extraction and entity linking.ii) A collection of 33 reusable IE components that can be combined to create264 distinct IE pipelines. iii) The exhaustive evaluation and our detailed ablationstudy of the integrated components and composed pipelines on various inputtext will guide future research for collaborative KG information extraction.We motivate our work with a running example; the sentence
Rembrandtpainted The Storm on the Sea of Galilee. It was painted in 1633 . Multiplesteps are required to extract these formally represented statements from thegiven text. First, the pronoun it in the second sentence should be replaced by The Storm on the Sea of Galilee using a coreference resolver. Next, a tripleextractor should extract the correct text triples from the natural languagetext, i.e.,
Rembrandt van Rijn , and dbr:The Storm on the Sea of Galilee for
The Stormon the Sea of Galilee , and for relations: dbo:Artist for painted , and dbp:year for painted in . Figure 1 illustrates our running example and shows three
Plumber
IE pipelines with different results. In Pipeline 1, the coreference resolver is unableto map the pronoun it to the respective entity in the previous sentence. Moreover,the triple extractor generates incomplete triples, which also hinders the task ofthe entity and relation linker in the last step. Pipeline 2 uses a different set ofcomponents, and its output differs from the first pipeline. Here, the coreference Text : Rembrandt painted The Storm on the Sea of Galilee. It was painted in 1633.
Stanford Coref Resolver OpenIE EARL (It = The Storm)
Three example information extraction pipelines showing different results for thesame text snippet. Each pipeline consists of coreference resolution, triple extractors,and entity/relation linking components. Jaradeh et al. resolution component is able to correctly co-relate the pronoun it to The Stormon the Sea of Galilee , and extract the text triple correctly. However, the overallresult is only partially correct because the second triple is not extracted. Also,the linking component is not able to spot the second entity. Pipeline 3 correctlyextracts both triples. This pipeline employs the same component as the secondpipeline for coreference resolution but also includes an additional informationextraction component (i.e., ReVerb [16]) and a joint entity and relation linkingcomponent, namely Falcon [35]. With this combination of components, the texttriple extractors were able to compensate for the loss of information in the secondpipeline by adding one more component. Using the extracted text triples, thelast component of the pipeline, a joint entity and relation linking tool, can mapboth triple components correctly to the corresponding KG entities.The reminder of this article is organized as follows. Related work is reviewedin Section 2. Section 3 presents
Plumber , which is extensively evaluated inSection 4. Section 5 discusses the results, and Section 6 concludes and outlinesdirections for future research and work.
In the last decade, many open source tools have been released by the researchcommunity to tackle IE tasks for KGs. These IE components are not only usedfor end-to-end KG triple extraction but also for various other tasks, such as:
Text Triple Extraction : The task of open information extraction is a wellstudied researched task in the NLP community [1]. It relies on NER (NamedEntity Recognition) and RE (Relation Extraction). SalIE [33] uses MinIE [21]in combination with PageRank and clustering to find facts in the input text.Furthermore, OpenIE [1] leverages linguistic structures to extract self-containedclauses from the text. A comprehensive survey by Niklaus et al. [32] providesdetailed about such techniques.
Entity and Relation Linking : Entity and relation linking is a widely studiedresearched topic in the NLP, Web, and Information Retrieval research commu-nities [3,4,11]. Often, entity and relation linking is performed independently.DBpedia Spotlight [10] is one of the first approaches for entity recognition anddisambiguation over DBpedia. TagMe [18] links entities to DBpedia using in-linkmatching to disambiguate candidates entities. Others tools such as RelMatch [38]do not perform entity linking and only focus on linking the relation in the text tothe corresponding KG relation. Recon [4] uses graph neural networks to to maprelations between the entities with the assumption that entities are already linkedin the text. EARL [15] is a joint linking tool over DBpedia and models the taskas a generalized traveling salesperson problem. Sakor et al. [35] proposed Falcon,a linguistic rules based tool for joint entity and relation linking over DBpedia.
Coreference Resolution : This task is used in conjunction with other tasks inNLP pipelines to disambiguate text and resolve syntactic complexities. The Stan-ford Coreference Resolver [34] uses a multi pass sieve of deterministic coreferencemodels. Clark and Manning [8] use reinforcement learning to fine-tune a neural etter Call the Plumber 5 mention-ranking model for coreference resolution. And more recently [37].
Frameworks and Dynamic Pipelines : There have been few attempts in var-ious domains aiming to consolidate the disjoint efforts of the research communityunder a single umbrella for solving a particular task. The Gerbil platform [41] pro-vides an easy-to-use web-based platform for the agile comparison of entity linkingtools using multiple datasets and uniform measuring approaches. OKBQA [26] isa community effort for the development of multilingual open knowledge base andQA systems. Frankenstein integrates 24 QA components to build QA systems col-laboratively on-top of the Qanary integration framework [6]. Other ETL pipelinessystem exists such as Apache NiFi. Semantic Web Pipes [31] and LarKC [17] areother prominent examples.
End-to-End Extraction Systems : More recently, end-to-end systems are gain-ing more attention due to the boom of deep learning techniques. Such systemsdraw on the strengths of deep models and transformers [13,29]. Kertkeidkachornand Ichise [25] present an end-to-end system to extract triples and link themto DBpedia. Other attempts such as KG-Bert [44] leverage deep transformers(i.e., BERT [13]) for the triple classification task, given the entity and relationdescriptions of a triple. KG-Bert does not attempt end-to-end alignment ofKG triples from a given input text. Liu et al. [28] design an encoder-decoderframework with an attention mechanism to extract and align triples to a KG.
Plumber has a modular design (see Figure 2) where each component is integratedas a microservice. To ensure a consistent data exchange between components,the framework maps the output of each component to a homogeneous datarepresentation using the Qanary [6] methodology.
Plumber follows three designprinciples of i)
Isolation , ii)
Reusability , and iii)
Extensibility inspired by [39,41].
Dynamic pipeline selection : Plumber uses a RoBERTa [29] based clas-sifier that given a text and a set of requirements,
Plumber predicts the bestpipeline to extract KG triples. The RoBERTa model acts as intermediary thatclassifies the contextual embeddings extracted from the input text into a classwhich represents one of the possible pipelines. Regarding RoBERTa’s training,we run each input sequence on all possible pipelines and compute the evaluationmetrics F1-score (i.e., estimated performance). RoBERTa is fed with the sentenceand the sentence-level performance with the best value among all pipelines asthe target class. Hence, in practice, the user points
Plumber to a piece of textand internally it uses RoBERTa to classify the text to a class (i.e., the pipeline)to execute against the input text.
Architecture : Plumber includes the following modules: i) IE Compo-nents Pool : All information extraction components that are integrated withinthe framework are parts of the pool. The components are divided based ontheir respective tasks, i.e., coreference resolution, text triple extraction, as wellas entity and relation linking. ii) Pipeline Generator : This module createspossible pipelines depending on the requirements of the components (i.e., the
Jaradeh et al.
Stanford ResolverDBpedia SpotlightReVerbTagMe ……. I E P i p e li n e s P oo l Natural Language Text Aligned Triples
Pipeline BuilderKnowledge Graph
Pipeline Runner RoBERTa-based Pipeline SelectorPipeline nPipeline 2Pipeline 1 …….…….…….
CR Components I E C o m p o n e n t s P oo l Stanford ResolverNeuralcoref …..
Pipeline Generator
TE Components
OpenIEReVerb …..
EL/RL Components
EARLFalcon …..
Natural Language Text
E2E Components
T2KGSeq2RDF …..
NeuralcorefClausIEFalcon 2.0 T2KG Best Pipeline Configuration
Requirements
Fig. 2.
Overview of
Plumber ’s architecture highlighting the components for pipelinegeneration, selection, and execution.
Plumber receives an input sentence and require-ment (underlying KG) from the user. The framework intelligently selects a suitablepipeline based on the contextual features captured from the input sentence. underlying KG). Users can manually select the underlying KG and, using themetadata associated with each component,
Plumber aggregates the componentsfor the concerned KG. iii) IE Pipelines Pool : Plumber stores the configu-rations of the possible pipelines in the pool of pipelines for faster retrieval andeasier interaction with other modules. iv) Pipeline Selector : Based on therequirements (i.e., underlying KG) and the input text, a RoBERTa based modelextracts contextual embeddings from the text and classifies the input into one ofthe possible classes. Each class corresponds to one pipeline configuration thatis held in the IE pipelines pool. v) Pipeline Runner : Given the input text,and the generated pipeline configuration, the module executes the pipeline andproduce the final KG triples.
In this section, we detail the empirical evaluation of the framework in comparisonto baselines on different datasets and knowledge graphs. As such, we study thefollowing research question:
How does the dynamic selection of pipelines based onthe input text affect the end-to-end information extraction task?
Knowledge Graphs
To study the effectiveness of
Plumber in building dynamicKG information extraction pipelines, we use the following KGs during ourevaluation:
DBpedia [2] is containing information extracted automatically from Wikipediainfo boxes. DBpedia consists of approximately 11.5B triples [35].
Open Research Knowledge Graph [24] (ORKG) collects structured scholarly etter Call the Plumber 7 knowledge published in research articles, using crowd sourcing and automatedtechniques. In total, ORKG consists of approximately 984K triples.
Datasets
Throughout our evaluation, we employed a set of existing and newlycreated datasets for structured triple extraction and alignment to knowledgegraphs: the WebNLG [20] dataset for DBpedia, and COV-triples for ORKG.
WebNLG is the Web Natural Language Generation Challenge. The challengeintroduced the task of aligning unstructured text to DBpedia. In total, the datasetcontains 46K triples with 9K triples in the testing and 37K in the training set.
COV-triples is a handcrafted dataset that focuses on COVID-19 related schol-arly articles. The COV-triples dataset consists of 21 abstracts from peer-reviewedarticles and aligns the natural language text to the corresponding KG triplesinto the ORKG. Three Semantic Web researchers verified annotation quality,and triples approved by all three researchers are part of the dataset. The datasetcontains only 75 triples. Hence, we use the WebNLG dataset for training, and 75triples are used as a test set.
Components and Implementation
The
Plumber framework integrates33 components, the components span different IE tasks from Triple Extrac-tion, Entity and Relation Linking, and Coreference Resolution. Most of thecomponents used are open-sourced and they have been evaluated and used bythe community in their respective publications.
Plumber ’s code and all relatedresources are publicly available online at https://git.io/JtT1s . Baselines
We include the following baselines:
T2KG [25] is an end-to-end static system aligns a given natural language textto DBpedia KG triples.
Frankenstein [39] dynamically composes Question Answering pipelines over theDBpedia KG. It employs logistic regression based classifiers for each componentfor predicting the accuracy and greedily composes a dynamic pipeline of thebest components per task. We adapted Frankenstein for the KG informationextraction over DBpedia.
The section summarizes a variety of experiments to compare the
Plumber framework against other baselines. Note, that evaluating the performance ofindividual components or their combination is out of this evaluation’s scope,since they were already used, benchmarked, and evaluated in the respectivepublications. We report values of the standard metrics Precision (P), Recall (R),and F1 score (F1). In all experiments, end-to-end components (e.g., T2KG) arenot part of
Plumber . Performance of Static Pipelines
In this experiment, we report results of thestatic pipelines, i.e., no dynamic selection of a pipeline based on the input text isconsidered. We ran all 264 pipelines and Table 2 (T2KG & Static noted rows)reports the performance of the best
Plumber pipeline against the baselines.
Plumber static pipeline for DBpedia comprises of NeuralCoref [8] for coreference
Jaradeh et al. resolution, OpenIE [1] for text triple extraction, TagMe [18] for EL, and Falcon [35]for RL tasks. Also, in case of Frankenstein, we choose its best performing staticpipeline. Results illustrated in the Table 2 confirm that the static pipelinecomposed by the components integrated in
Plumber outperforms all baselineson DBpedia. We observe that the performance of pipeline approaches is betterthan an end-to-end monolithic information extraction approaches. Although the
Plumber pipeline outperforms the baselines, the overall performance is relativelylow. All our components have been trained on distinct corpora in their respectivepublications and our aim was to put them together to understand their collectivestrengths and weaknesses. Note, Frankenstein addresses the QA pipeline problemand not all components are comparable and can be applied in the context ofinformation extraction. Thus, we integrated NeuralCoref coreference resolutioncomponent and OpenIE triple extraction component used in
Plumber staticpipeline into Frankenstein for providing the same experimental settings.
Static Pipeline for Scholarly KG
In order to assess how
Plumber performson domain-specific use cases, we evaluate the static pipelines’ performance on ascholarly knowledge graph. We use the COV-triples dataset for ORKG. To thebest of our knowledge, no baseline exists on information extractions of researchcontribution descriptions over ORKG. Hence, we execute all static pipelines in
Plumber tailored to ORKG to select the best one as shown in Table 2 (COV-triples rows).
Plumber pipelines over ORKG extract statements determiningthe reproductive number estimates for the COVID-19 infectious disease fromscientific articles as shown below. @prefix orkg: <\ protect \vrule width0pt \ protect \href{http :// orkg.org/orkg/ resource /}{ http :// orkg.org/orkg/ resource /}>.@prefix orkgp: <\ protect \vrule width0pt \ protect \href{http :// orkg.org/orkg/ property /}{ http :// orkg.org/orkg/ property /}>.orkg: R48100 orkgp: P16022 "2.68" .
In this example, orkg:R48100 refers to the city of Wuhan in China in the ORKGand orkgp:P16022 is the property “has R0 estimate (average)”. The number“2.68” is the reproductive number estimate.
Comparison of the Classification Approaches for Dynamic PipelineSelection
In this experiment, we study the effect of the transformer-basedpipeline selection approach implemented in
Plumber against the pipeline se-lection approach of Frankenstein. For a comparable experimental setting, were-use Frankenstein’s classification approach in
Plumber , keeping the underlyingcomponents precisely the same. We perform a 10-fold cross-validation for theclassification performance of the employed approach. Table 1 indicates that the
Plumber pipeline selection significantly outperforms baselines across the board.
Performance Comparison for KG Information Extraction Task
Ourthird experiment focuses on comparing the performance of
Plumber againstprevious baselines for an end-to-end information extraction task. The resultsin Table 2 illustrate that the dynamic pipelines built using
Plumber for KGinformation extraction outperform the best static pipelines of
Plumber as well etter Call the Plumber 9
Table 1.
Pipeline SelectionApproach Dataset KnowledgeGraph ClassificationP R F1
Frankenstein [39] WebNLG DBpedia 0.732 0.751 0.741COV-triples ORKG 0.832 0.858 0.845
Plumber
WebNLG DBpedia
COV-triples ORKG as the dynamically selected pipelines by Frankenstein (rows noted with dynamic).The end-to-end baselines, such as Kertkeidka-chorn and Ichise [25]. We alsoobserve that in cross-domain experiments for COV-triples datasets, dynamicallyselected pipelines perform better than the static pipeline. In the cross-domainexperiment, the static and dynamic
Plumber pipelines are relatively betterperforming than the other two KGs. Unlike components for DBpedia, componentsintegrated into
Plumber for ORKG are customized for KG triple extraction.We conclude that when components are integrated into a framework such as
Plumber aiming for the KG information extraction task, it is crucial to selectthe pipeline based on the input text dynamically. The superior performanceof
Plumber shows that the dynamic pipeline selection has a positive impactagnostic of the underlying KG and dataset. This also answers our overall researchquestion.
Plumber and baselines render relatively low performance on all the employeddatasets. Hence, in the ablation studies our aim is to provide a holistic picture ofunderlying errors, collective success, and failures of the integrated components.In the first study, we calculate the proportion of errors in
Plumber . Themodular architecture of the proposed framework allows us to benchmark eachcomponent independently. We consider the erroneous cases of
Plumber on the
Table 2.
Overall performance comparison of static and dynamic pipelines for the KGinformation extraction task.
System Dataset KnowledgeGraph PerformanceP R F1
T2KG [25] WebNLG DBpedia 0.133 0.140 0.135Frankenstein (Static) [39] WebNLG DBpedia 0.177 0.189 0.181
Plumber (Static) WebNLG DBpedia 0.210 0.225 0.215COV-triples ORKG 0.403 0.423 0.413Frankenstein (Dynamic) [39] WebNLG DBpedia 0.199 0.208 0.203COV-triples ORKG 0.403 0.424 0.413
Plumber (Dynamic) WebNLG DBpedia
COV-triples ORKG test set of the WebNLG dataset. We calculate the performance (F1 score) ofthe
Plumber dynamic pipeline (cf. Table 2) at each step in the pipeline. Theresults show that the coreference resolution components caused 21.54% of theerrors, 33.71% are caused by text triple extractors, 18.17% by the entity linkingcomponents, and 26.58% are caused by the relation linking components.We conclude that the text triple extractor components contribute to thelargest chunk of the errors over DBpedia. One possible reason for their limitedperformance is that open-domain information extracting components were notinitially released for the KG information extraction task. Also, these componentsdo not incorporate any schema or prior knowledge to guide the extraction. Weobserve that the errors mainly occur when the sentence is complex (with morethan one entity and predicate), or relations are not explicitly mentioned in thesentence. We further analyze the text triple extractor errors. The error analysisat the level of the triple subject, predicate, and object showed that most errorsare in predicates (40.17%) followed by objects (35.98%) and subjects (23.85%).
Further Analysis
Aiming to understand why IE pipelines perform with lowaccuracy, we conduct a more in-depth analysis per IE task. In the first anal-ysis, we evaluated each component independently on the WebNLG dataset.Researchers [12,40] proposed several criterion for micro-benchmarking tools/com-ponents for KG tasks (entity linking, relation linking, etc.) based on the linguisticfeatures of a sentence. We motivate our analysis based on the following:I)
Text Triple Extraction : We consider the number of words (wc) in theinput sentence (a sentence is termed by “simple” with average word lengthof 7.41 [39]. Sentences with higher number of words than seven are complexsentences). Furthermore, having a comma in a sentence (sub-clause) to separateclauses is another factor. Atomic sentences (e.g., ”cats have tails” ) are a type ofsentence that also affects triples extractors’ behavior. Moreover, nominal relationas in ”Durin, son of Thorin” is another impacting factor on the performance.Uppercase and lowercase mentions of the words (i.e., correct capitalization ofthe first character and not the entire word) in a sentence are standard errors forentity linking components. We consider this as a micro-benchmarking criteria.II)
Coreference Resolution : We focus on the length of the coreference chain(i.e., the number of aliases for a single mention). Additionally, the number ofclusters is another criterion in the analysis. A cluster refers to the groups ofmentions that require disambiguation (e.g., ”mother bought a new phone, sheis so happy about it” where the first cluster is mother → she and the second is phone → it ). The presence of proper nouns in the sentence is studied as well asacronyms. Furthermore, the demonstrative nature of the sentence is also observedas a factor. Demonstrative sentences are the ones that contain demonstrativepronouns (this, that, etc.).III) Entity Linking : The number of entities in a sentence (e=1,2) is a crucialobservation for the entity linking task. Capitalization of the surface form isanother criterion for micro-benchmarking entity linking tools. An entity is termedas an explicit entity when the entity’s surface form in a sentence matches the KGlabel. An entity is implicit when there is a vocabulary mismatch. For example, in etter Call the Plumber 11 the sentence ”The wife of Obama is Michelle Obama.” , the surface form
Obama is expected to be linked to dbr:Barack Obama and considered as an implicitentity [40]. The last linguistic feature is the number of words (w) in an entitylabel (e.g.,
The Storm on the Sea of Galilee has seven words).IV)
Relation Linking : Similar to the entity linking criteria, we focus on thenumber of relations in a sentence (rel=1,2). The type of relation (i.e., explicit, orimplicit) is another parameter. Covered relation (sentences without a predicatesurface form) is also used as a feature for micro-benchmarking: ”Which companieshave launched a rocket from Cape Canaveral Air Force station?” where the dbo:manufacturing relation is not mentioned in the sentence. Covered relationshighly depend on common sense knowledge (i.e., reasoning) and the structure ofthe KG [40]. Lastly, the number of words (w < =N) in a predicate surface form isalso considered.Figure 3 illustrates micro-benchmarking of various Plumber components pertask. We observe that across IE tasks, the F1 score of the components variessignificantly based on the sentence’s linguistic features. In fact, there exist nosingle component which performs equally well on all the micro-benchmarkingcriteria. This observation further validates our hypothesis to design
Plumber for building dynamic information extraction pipelines based on the strengthsand weaknesses of the integrated components. We also note in Figure 3 that allthe CR components report limited performance for the demonstrative sentences( demonstratives ). When there is more than one coreference cluster in a sentence, allother CR components observe a discernible drop in F1 score. The NeuralCoref [8]component performs best for proper nouns , whereas PyCobalt [19] performs bestfor the acronyms feature (almost being tied by NeuralCoref). In the TE task,Graphene [7] shows the most stable performance across all categories. However,the performance of all components (except Dependency Parser) drops significantlywhen the number of words in a sentence exceeds seven (wc > Even though the dynamic pipelines of
Plumber outperforms static pipelines, theoverall performance of
Plumber and baselines for the KG information extractiontask remains low. Our detailed and exhaustive ablation studies suggest thatwhen individual components are plugged together, their individual performanceis a major error source. However, this behavior is expected, considering earlierresearch works in other domains also observe a similar trend. As in 2015 Gerbilframework [41] and in 2018 Frankenstein [39]. Within two years, the community
Falcon
TextRazor
TagMe
EARL
DBpedia Spotlight
Spacy ANN e = , upp e r case e = , l o w e r case e = , ex p li c i t e = , i m p li c i t e = , w > = , upp e r case e = , l o w e r case e = , ex p li c i t e = , i m p li c i t e = , w > (a) F1 score heatmap of the EL task Ollie
OpenIE
ClausIE
MinIE
Graphene
ReVerb
POS Extractor
Dependency Extractor w c <= w c > ub - c l a u se a t o m i c se n e t e n ce no m i n a l r e l a t i on s upp e r case l o w e r case ac r on y m s (b) F1 score heatmap of the Text TE task Stanford Coref. Resolver
NeuralCoref
PyCobalt
HMTL c h a i n = h a i n > l u s t e r s = l u s t e r s > p r op e r noun s ac r on y m s d e m on s t r a t i ves (c) F1 score heatmap of the CR task Falcon RL
Rel Match
EARL RL
Spacy ANN RL r e l = , ex p li c i t r e l = , i m p li c i t r e l = , c o ve r e d r e l = , w > r e l = , ex p li c i t r e l = , i m p li c i t r e l = , c o ve r e d r e l = , w > (d) F1 score heatmap of the RL task Fig. 3.
Comparison of F1 scores per component for different IE tasks based on thevarious linguistic features of an input sentence (number of entities, word count in asentence, implicit vs. explicit relation, etc.). Darker colors indicate a higher F1 score.etter Call the Plumber 13 has released several components dedicated to solving entity linking and relationlinking [35,15,30], which were two loopholes identified by [39] for the QA task.We observe that state of the art components for information extraction stillhave much potential to improve their performance (both in terms of runtimeand F1 score). It is essential to highlight that some of the issues observed inour ablation study are very basic and repeatedly pointed out by researchers inthe community. For instance, Derczynski et al. [12] in 2015, followed by Singhet al. [39] in 2018, showed that case sensitivity is a main challenge for EL tools.Our observation in Figure 3 again confirms that case sensitivity of entity surfaceforms remains an open issue even for newly released components. In contrast,on specific datasets such as CoNLL-AIDA, several EL approaches reported F1scores higher than 0.90 [43], showing that EL tools are highly customized toparticular datasets. In a real-world scenario like ours, the underlying limitationsof approaches are uncovered.
In this paper, we presented the
Plumber approach and framework for informa-tion extraction.
Plumber effectively selects the best possible pipeline for a giveninput sentence using the sentential contextual features and a state-of-the-arttransformer-based classification model.
Plumber has a service-oriented architec-ture which is scalable, extensible, reusable, and agnostic of the underlying KG.The core idea of
Plumber is to combine the strengths of already existing disjointresearch for KG information extraction and build a foundation for a platformto promote reusability for the construction of large-scale and semantically struc-tured KGs. Our empirical results suggest that the performance of the individualcomponents directly impacts the end-to-end information extraction accuracy.This article does not focus on internal system architecture or employedalgorithms in a particular IE component to analyze the failures. The focus ofthe ablation studies is to holistically study the collective success and failurecases for the various tasks. Our studies provide the research community withinsightful results over two knowledge graphs, 33 components, 264 pipelines. Ourwork is a step in the larger research agenda of offering the research community aneffective way for synergistically combining and orchestrating various focused IEapproaches balancing their strengths and weaknesses taking different applicationdomains into account. We plan to extend our work in the following directions:i) extending
Plumber to other KGs such as UMLS [5] and Wikidata [42]. ii)addressing multilinguality with
Plumber , and iii) creating high performing RLcomponents.
References
1. Angeli, G., Johnson Premkumar, M.J., Manning, C.D.: Leveraging linguistic struc-ture for open domain information extraction. pp. 344–354. ACL (2015)4 Jaradeh et al.2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: Anucleus for a web of open data. In: The Semantic Web. pp. 722–735 (2007)3. Balog, K.: Entity linking. In: Entity-Oriented Search, pp. 147–188. Springer (2018)4. Bastos, A., Nadgeri, A., Singh, K., Mulang, I.O., Shekarpour, S., Hoffart, J.: Recon:Relation extraction using knowledge graph context in a graph neural network (2020)5. Bodenreider, O.: The unified medical language system (umls): integrating biomedicalterminology. Nucleic acids research , D267–D270 (2004)6. Both, A., Diefenbach, D., Singh, K., Shekarpour, S., Cherix, D., Lange, C.: Qanary- A methodology for vocabulary-driven open question answering systems. vol. 9678,pp. 625–641 (2016)7. Cetto, M., Niklaus, C., Freitas, A., Handschuh, S.: Graphene: Semantically-linkedpropositions in open information extraction. In: Proceedings of the 27th COLING.pp. 2300–2311 (2018)8. Clark, K., Manning, C.D.: Deep reinforcement learning for mention-ranking coref-erence models. In: Proceedings of the 2016 EMNLP. pp. 2256–2262 (2016)9. Cui, W., Liu, S., Wu, Z., Wei, H.: How hierarchical topics evolve in large textcorpora. IEEE TVCG (12), 2281–2290 (2014)10. Daiber, J., Jakob, M., Hokamp, C., Mendes, P.N.: Improving efficiency and accuracyin multilingual entity extraction. In: Proceedings of the 9th I-Semantics (2013)11. Delpeuch, A.: Opentapioca: Lightweight entity linking for wikidata (2019)12. Derczynski, L., Maynard, D., Rizzo, G., Van Erp, M., Gorrell, G., Troncy, R.,Petrak, J., Bontcheva, K.: Analysis of named entity recognition and linking fortweets. Information Processing & Management , 32–49 (2015)13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deepbidirectional transformers for language understanding. In: NAACL. pp. 4171–4186(2019)14. Dong, T., Wang, Z., Li, J., Bauckhage, C., Cremers, A.B.: Triple classification usingregions and fine-grained entity typing. In: Proceedings of the AAAI Conference onArtificial Intelligence. vol. 33, pp. 77–85 (2019)15. Dubey, M., Banerjee, D., Chaudhuri, D., Lehmann, J.: EARL: Joint entity andrelation linking for question answering over knowledge graphs. In: Lecture Notes inComputer Science, pp. 108–126. Springer International Publishing (2018)16. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open informationextraction. In: Proceedings of the 2011 EMNLP. pp. 1535–1545 (Jul 2011)17. Fensel, D., van Harmelen, F., Andersson, B., Brennan, P., Cunningham, H., DellaValle, E., Fischer, F., Huang, Z., Kiryakov, A., Lee, T.K., Witbrock, M., Zhong, N.:Towards larkc: A platform for web-scale reasoning. In: IEEE ICSC. pp. 524–529(2008)18. Ferragina, P., Scaiella, U.: TAGME: on-the-fly annotation of short text fragments(by wikipedia entities). pp. 1625–1628 (2010)19. Freitas, A., Bermeitinger, B., Handschuh, S.: Lambda-3/pycobalt: Coreferenceresolution in python. https://github.com/Lambda-3/PyCobalt
20. Gardent, C., Shimorina, A., Narayan, S., Perez-Beltrachini, L.: Creating trainingcorpora for NLG micro-planners. pp. 179–188 (2017)21. Gashteovski, K., Gemulla, R., del Corro, L.: MinIE: Minimizing facts in openinformation extraction. In: Proceedings of the 2017 EMNLP. pp. 2630–2640 (2017)22. Hou, Y., Jochim, C., Gleize, M., Bonin, F., Ganguly, D.: Identification of tasks,datasets, evaluation metrics, and numeric scores for scientific leaderboards con-struction. In: Proceedings of the 57th ACL. pp. 5203–5213 (2019)23. Ibrahim, Y., Riedewald, M., Weikum, G., Zeinalipour-Yazti, D.: Bridging quantitiesin tables and text. In: 2019 IEEE 35th ICDE. pp. 1010–1021 (2019)etter Call the Plumber 1524. Jaradeh, M.Y., Oelen, A., Farfar, K.E., Prinz, M., D’Souza, J., Kismih´ok, G., Stocker,M., Auer, S.: Open Research Knowledge Graph: Next Generation Infrastructurefor Semantic Scholarly Knowledge. Marina Del K-CAP (2019)25. Kertkeidkachorn, N., Ichise, R.: T2kg: An end-to-end system for creating knowledgegraph from unstructured text. In: AAAI Workshops. vol. WS-17 (2017)26. Kim, J.D., Unger, C., Ngomo, A.C.N., Freitas, A., Hahm, Y.g., Kim, J., Nam, S.,Choi, G.H., Kim, J.u., Usbeck, R., et al.: OKBQA Framework for collaboration ondeveloping natural language question answering systems (2017)27. Liang, S., Stockinger, K., de Farias, T.M., Anisimova, M., Gil, M.: Queryingknowledge graphs in natural language (2020)28. Liu, Y., Zhang, T., Liang, Z., Ji, H., McGuinness, D.: Seq2rdf: An end-to-endapplication for deriving triples from natural language text (2018)29. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Zettlemoyer, L., Stoyanov,V.: Roberta: A robustly optimized bert pretraining approach (2019)30. Mihindukulasooriya, N., Rossiello, G., Kapanipathi, P., Abdelaziz, I., Ravishankar,S., Yu, M., Gliozzo, A., Roukos, S., Gray, A.: Leveraging semantic parsing forrelation linking over knowledge bases. ISWC (to appear) (2020)31. Morbidoni, C., Polleres, A., Tummarello, G., Le-Phuoc, D.: Semantic web pipes(2007)32. Niklaus, C., Cetto, M., Freitas, A., Handschuh, S.: A survey on open informationextraction. In: Proceedings of the 27th COLING. pp. 3866–3878 (2018)33. Ponza, M., Del Corro, L., Weikum, G.: Facts that matter. In: Proceedings of the2018 EMNLP. pp. 1043–1048. ACL (2018)34. Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky,D., Manning, C.: A multi-pass sieve for coreference resolution. In: EMNLP (2010)35. Sakor, A., Onando Mulang’, I., Singh, K., Shekarpour, S., Esther Vidal, M.,Lehmann, J., Auer, S.: Old is gold: Linguistic driven approach for entity andrelation linking of short text. pp. 2336–2346. ACL (2019)36. Sakor, A., Singh, K., Patel, A., Vidal, M.E.: Falcon 2.0: An entity and relationlinking tool over wikidata. In: CIKM (2020)37. Sanh, V., Wolf, T., Ruder, S.: A hierarchical multi-task approach for learningembeddings from semantic tasks. Proceedings of the AAAI , 6949–6956 (2019)38. Singh, K., Mulang, I.O., Lytra, I., Jaradeh, M.Y., Sakor, A., Vidal, M., Lange, C.,Auer, S.: Capturing knowledge in semantically-typed relational patterns to enhancerelation linking. In: Proceedings of the Knowledge Capture Conference, K-CAP2017, Austin, TX, USA, December 4-6, 2017. pp. 31:1–31:8 (2017)39. Singh, K., Radhakrishna, A.S., Both, A., Shekarpour, S., Lytra, I., Usbeck, R.,Vyas, A., Khikmatullaev, A., Punjani, D., Lange, C., Vidal, M.E., Lehmann, J.,Auer, S.: Why reinvent the wheel: Let’s build question answering systems together.p. 1247–1256. WWW ’18 (2018)40. Singh, K., Saleem, M., Nadgeri, A., Conrads, F., Pan, J.Z., Ngomo, A.C.N.,Lehmann, J.: Qaldgen: Towards microbenchmarking of question answering systemsover knowledge graphs. In: ISWC. pp. 277–292 (2019)41. Usbeck, R., R¨oder, M., et al., N.N.: Gerbil: general entity annotator benchmarkingframework. In: Proceedings of the 24th WWW. pp. 1133–1143 (2015)42. Vrandeˇci´c, D., Kr¨otzsch, M.: Wikidata: A Free Collaborative Knowledgebase. Com-munications of the ACM57