[PDF] Better Call the Plumber: Orchestrating Dynamic Information Extraction Pipelines

Abstract

In the last decade, a large number of Knowledge Graph (KG) information extraction approaches were proposed. Albeit effective, these efforts are disjoint, and their collective strengths and weaknesses in effective KG information extraction (IE) have not been studied in the literature. We propose Plumber, the first framework that brings together the research community's disjoint IE efforts. The Plumber architecture comprises 33 reusable components for various KG information extraction subtasks, such as coreference resolution, entity linking, and relation extraction. Using these components,Plumber dynamically generates suitable information extraction pipelines and offers overall 264 distinct pipelines.We study the optimization problem of choosing suitable pipelines based on input sentences. To do so, we train a transformer-based classification model that extracts contextual embeddings from the input and finds an appropriate pipeline. We study the efficacy of Plumber for extracting the KG triples using standard datasets over two KGs: DBpedia, and Open Research Knowledge Graph (ORKG). Our results demonstrate the effectiveness of Plumber in dynamically generating KG information extraction pipelines,outperforming all baselines agnostics of the underlying KG. Furthermore,we provide an analysis of collective failure cases, study the similarities and synergies among integrated components, and discuss their limitations.

Full PDF

BBetter Call the Plumber: OrchestratingDynamic Information Extraction Pipelines

Mohamad Yaser Jaradeh − − − , KuldeepSingh − − − , Markus Stocker − − − , AndreasBoth − − − , and S¨oren Auer − − − L3S Research Center, Leibniz University Hannover, Germany [email protected] Zerotha-Research & Cerence GmbH, Germany [email protected] TIB Leibniz Information Centre for Science and Technology, Germany { markus.stocker, auer } @tib.eu Anhalt University of Applied Sciences, Germany [email protected]

Abstract.

We propose

Plumber , the ﬁrst framework that brings to-gether the research community’s disjoint information extraction (IE)eﬀorts. The

Plumber architecture comprises 33 reusable components forvarious Knowledge Graphs (KG) information extraction subtasks, suchas coreference resolution, entity linking, and relation extraction. Usingthese components,

Plumber dynamically generates suitable informationextraction pipelines and oﬀers overall 264 distinct pipelines. We studythe optimization problem of choosing suitable pipelines based on inputsentences. To do so, we train a transformer-based classiﬁcation model thatextracts contextual embeddings from the input and ﬁnds an appropriatepipeline. We study the eﬃcacy of

Plumber for extracting the KG triplesusing standard datasets over two KGs: DBpedia, and Open ResearchKnowledge Graph (ORKG). Our results demonstrate the eﬀectiveness of

Plumber in dynamically generating KG information extraction pipelines,outperforming all baselines agnostics of the underlying KG. Furthermore,we provide an analysis of collective failure cases, study the similarities andsynergies among integrated components, and discuss their limitations.

Keywords:

Information Extraction · NLP Pipelines · Software Reusabil-ity · Semantic Search · Semantic Web

In last one decade, publicly available KGs (DBpedia [2] and Wikidata [42] ) havebecome rich sources of structured content used in various applications, includingQuestion Answering (QA), fact checking, and dialog systems [39,4]. The researchcommunity developed numerous approaches to extract triple statements [44],keywords/topics [9], tables [45,23,22], or entities [35,36] from unstructured textto complement KGs. Despite extensive research, public KGs are not exhaustiveand require continuous eﬀort to align newly emerging unstructured informationto the concepts of the KGs. a r X i v : . [ c s . C L ] F e b Jaradeh et al.

Research Problem:

This work was motivated by an observation of recentapproaches [35,45,15] that automatically align unstructured text to structureddata on the Web. Such approaches are not viable in practice for extracting andstructuring information because they only address very speciﬁc subtasks of theoverall KG information extraction problem. If we consider the exemplary sentence

Rembrandt painted The Storm on the Sea of Galilee. It was painted in 1633. (cf. Figure 1). To extract statements aligned with the DBpedia KG from thegiven sentences, a system must ﬁrst recognize the entities and relation surfaceforms in the ﬁrst sentence. The second sentence requires an additional step ofthe coreference resolution, where It must be mapped to the correct entity surfaceform (namely, The Storm on the Sea of Galilee ). The last step requires the map-ping of entity and relation surface forms to the respective DBpedia entities andpredicates. There has been extensive research in aligning concepts in unstructuredtext to KG, including entity linking [15,18], relation linking [36,38,4], and tripleclassiﬁcation [14]. However, these eﬀorts are disjoint, and little has been done toalign unstructured text to the complete KG triples (i.e., represented as subject,predicate, object) [25]. Furthermore, many entity and relation linking tools havebeen reused in pipelines of QA systems [39,26]. The literature suggests thatonce diﬀerent approaches put forward by the research community are combined,the resulting pipeline-oriented integrated systems can outperform monolithicend-to-end systems [27]. For the KG information extraction task, however, tothe best of our knowledge, approaches aiming at dynamically integrating andorchestrating various existing components do not exist.

Objective and Contributions:

Based on these observations, we build a frame-work that enables the integration of previously disjoint eﬀorts on the KG-IE taskunder a single umbrella. We present the

Plumber framework (cf. Figure 2) forcreating Information Extraction pipelines.

Plumber integrates 33 reusable com-ponents released by the research community for the subtasks entity linking (EL),relation linking (RL), text triple extraction (TE) (subject, predicate, object),and coreference resolution (CR). Overall, there are 264 diﬀerent composableKG information extraction pipelines (generated by the possible combinationof the available 33 components, i.e., for DBpedia 3 CRs, 8 TEs, 10 EL/RLsgives 3*8*10=240, and 4*3*2=24 for the ORKG. Hence, 240+24=264 pipelines).

Plumber implements a transformer-based classiﬁcation algorithm that intelli-gently chooses the best pipeline based on the unstructured input text.We perform an exhaustive evaluation of

Plumber on the two large-scaleKGs DBpedia, and Open Research Knowledge Graph (ORKG) [24] to investigatethe eﬃcacy of

Plumber in creating KG triples from unstructured text. Wedemonstrate that independent of the underlying KG;

Plumber can ﬁnd andassemble diﬀerent extraction components to produce better suited KG tripleextraction pipelines, signiﬁcantly outperforming existing baselines. In summary,we provide the following novel contributions: i) The

Plumber framework isthe ﬁrst of its kind for dynamically assembling and evaluating informationextraction pipelines based on sequence classiﬁcation techniques and for a giveninput text.

Plumber is easily extensible and conﬁgurable, thus enabling the etter Call the Plumber 3 rapid creation and adjustment of new information extraction components andpipelines. Researchers can also use the framework for running IE componentsindependently for speciﬁc subtasks such as triple extraction and entity linking.ii) A collection of 33 reusable IE components that can be combined to create264 distinct IE pipelines. iii) The exhaustive evaluation and our detailed ablationstudy of the integrated components and composed pipelines on various inputtext will guide future research for collaborative KG information extraction.We motivate our work with a running example; the sentence

Rembrandtpainted The Storm on the Sea of Galilee. It was painted in 1633 . Multiplesteps are required to extract these formally represented statements from thegiven text. First, the pronoun it in the second sentence should be replaced by The Storm on the Sea of Galilee using a coreference resolver. Next, a tripleextractor should extract the correct text triples from the natural languagetext, i.e., , and . In the next step,the entity and relation linking component aligns the entity and relation surfaceforms extracted in the previous step to the DBpedia entities: dbr:Rembrandt for

Rembrandt van Rijn , and dbr:The Storm on the Sea of Galilee for

The Stormon the Sea of Galilee , and for relations: dbo:Artist for painted , and dbp:year for painted in . Figure 1 illustrates our running example and shows three

Plumber

IE pipelines with diﬀerent results. In Pipeline 1, the coreference resolver is unableto map the pronoun it to the respective entity in the previous sentence. Moreover,the triple extractor generates incomplete triples, which also hinders the task ofthe entity and relation linker in the last step. Pipeline 2 uses a diﬀerent set ofcomponents, and its output diﬀers from the ﬁrst pipeline. Here, the coreference Text : Rembrandt painted The Storm on the Sea of Galilee. It was painted in 1633.

Stanford Coref Resolver OpenIE EARL (It = The Storm) (Rembrandt = dbr:Artemisia_(Rembrandt)),(Storm = dbr:September_Storm) P i p e li n e NeuralCoref ClausIE DBpedia Spotlight (It = The Storm on the Sea of Galilee) (Rembrandt = dbr:Rembrandt) P i p e li n e NeuralCoref ClausIE FalconReVerb (It = The Storm on the Sea of Galilee) (Rembrandt = dbr:Rembrandt),(The Storm on the Sea of Galilee = dbr:The_Storm_on_the_Sea_of_Galilee)(painted = dbo:Artist) P i p e li n e Fig. 1.

Three example information extraction pipelines showing diﬀerent results for thesame text snippet. Each pipeline consists of coreference resolution, triple extractors,and entity/relation linking components. Jaradeh et al. resolution component is able to correctly co-relate the pronoun it to The Stormon the Sea of Galilee , and extract the text triple correctly. However, the overallresult is only partially correct because the second triple is not extracted. Also,the linking component is not able to spot the second entity. Pipeline 3 correctlyextracts both triples. This pipeline employs the same component as the secondpipeline for coreference resolution but also includes an additional informationextraction component (i.e., ReVerb [16]) and a joint entity and relation linkingcomponent, namely Falcon [35]. With this combination of components, the texttriple extractors were able to compensate for the loss of information in the secondpipeline by adding one more component. Using the extracted text triples, thelast component of the pipeline, a joint entity and relation linking tool, can mapboth triple components correctly to the corresponding KG entities.The reminder of this article is organized as follows. Related work is reviewedin Section 2. Section 3 presents

Plumber , which is extensively evaluated inSection 4. Section 5 discusses the results, and Section 6 concludes and outlinesdirections for future research and work.

In the last decade, many open source tools have been released by the researchcommunity to tackle IE tasks for KGs. These IE components are not only usedfor end-to-end KG triple extraction but also for various other tasks, such as:

Text Triple Extraction : The task of open information extraction is a wellstudied researched task in the NLP community [1]. It relies on NER (NamedEntity Recognition) and RE (Relation Extraction). SalIE [33] uses MinIE [21]in combination with PageRank and clustering to ﬁnd facts in the input text.Furthermore, OpenIE [1] leverages linguistic structures to extract self-containedclauses from the text. A comprehensive survey by Niklaus et al. [32] providesdetailed about such techniques.

Entity and Relation Linking : Entity and relation linking is a widely studiedresearched topic in the NLP, Web, and Information Retrieval research commu-nities [3,4,11]. Often, entity and relation linking is performed independently.DBpedia Spotlight [10] is one of the ﬁrst approaches for entity recognition anddisambiguation over DBpedia. TagMe [18] links entities to DBpedia using in-linkmatching to disambiguate candidates entities. Others tools such as RelMatch [38]do not perform entity linking and only focus on linking the relation in the text tothe corresponding KG relation. Recon [4] uses graph neural networks to to maprelations between the entities with the assumption that entities are already linkedin the text. EARL [15] is a joint linking tool over DBpedia and models the taskas a generalized traveling salesperson problem. Sakor et al. [35] proposed Falcon,a linguistic rules based tool for joint entity and relation linking over DBpedia.

Coreference Resolution : This task is used in conjunction with other tasks inNLP pipelines to disambiguate text and resolve syntactic complexities. The Stan-ford Coreference Resolver [34] uses a multi pass sieve of deterministic coreferencemodels. Clark and Manning [8] use reinforcement learning to ﬁne-tune a neural etter Call the Plumber 5 mention-ranking model for coreference resolution. And more recently [37].

Frameworks and Dynamic Pipelines : There have been few attempts in var-ious domains aiming to consolidate the disjoint eﬀorts of the research communityunder a single umbrella for solving a particular task. The Gerbil platform [41] pro-vides an easy-to-use web-based platform for the agile comparison of entity linkingtools using multiple datasets and uniform measuring approaches. OKBQA [26] isa community eﬀort for the development of multilingual open knowledge base andQA systems. Frankenstein integrates 24 QA components to build QA systems col-laboratively on-top of the Qanary integration framework [6]. Other ETL pipelinessystem exists such as Apache NiFi. Semantic Web Pipes [31] and LarKC [17] areother prominent examples.

End-to-End Extraction Systems : More recently, end-to-end systems are gain-ing more attention due to the boom of deep learning techniques. Such systemsdraw on the strengths of deep models and transformers [13,29]. Kertkeidkachornand Ichise [25] present an end-to-end system to extract triples and link themto DBpedia. Other attempts such as KG-Bert [44] leverage deep transformers(i.e., BERT [13]) for the triple classiﬁcation task, given the entity and relationdescriptions of a triple. KG-Bert does not attempt end-to-end alignment ofKG triples from a given input text. Liu et al. [28] design an encoder-decoderframework with an attention mechanism to extract and align triples to a KG.

Plumber has a modular design (see Figure 2) where each component is integratedas a microservice. To ensure a consistent data exchange between components,the framework maps the output of each component to a homogeneous datarepresentation using the Qanary [6] methodology.

Plumber follows three designprinciples of i)

Isolation , ii)

Reusability , and iii)

Extensibility inspired by [39,41].

Dynamic pipeline selection : Plumber uses a RoBERTa [29] based clas-siﬁer that given a text and a set of requirements,

Plumber predicts the bestpipeline to extract KG triples. The RoBERTa model acts as intermediary thatclassiﬁes the contextual embeddings extracted from the input text into a classwhich represents one of the possible pipelines. Regarding RoBERTa’s training,we run each input sequence on all possible pipelines and compute the evaluationmetrics F1-score (i.e., estimated performance). RoBERTa is fed with the sentenceand the sentence-level performance with the best value among all pipelines asthe target class. Hence, in practice, the user points

Plumber to a piece of textand internally it uses RoBERTa to classify the text to a class (i.e., the pipeline)to execute against the input text.

Architecture : Plumber includes the following modules: i) IE Compo-nents Pool : All information extraction components that are integrated withinthe framework are parts of the pool. The components are divided based ontheir respective tasks, i.e., coreference resolution, text triple extraction, as wellas entity and relation linking. ii) Pipeline Generator : This module createspossible pipelines depending on the requirements of the components (i.e., the

Jaradeh et al.

Stanford ResolverDBpedia SpotlightReVerbTagMe ……. I E P i p e li n e s P oo l Natural Language Text Aligned Triples

Pipeline BuilderKnowledge Graph

Pipeline Runner RoBERTa-based Pipeline SelectorPipeline nPipeline 2Pipeline 1 …….…….…….

CR Components I E C o m p o n e n t s P oo l Stanford ResolverNeuralcoref …..

Pipeline Generator

TE Components

OpenIEReVerb …..

EL/RL Components

EARLFalcon …..

Natural Language Text

E2E Components

T2KGSeq2RDF …..

NeuralcorefClausIEFalcon 2.0 T2KG Best Pipeline Conﬁguration

Requirements

Fig. 2.

Overview of

Plumber ’s architecture highlighting the components for pipelinegeneration, selection, and execution.

Plumber receives an input sentence and require-ment (underlying KG) from the user. The framework intelligently selects a suitablepipeline based on the contextual features captured from the input sentence. underlying KG). Users can manually select the underlying KG and, using themetadata associated with each component,

Plumber aggregates the componentsfor the concerned KG. iii) IE Pipelines Pool : Plumber stores the conﬁgu-rations of the possible pipelines in the pool of pipelines for faster retrieval andeasier interaction with other modules. iv) Pipeline Selector : Based on therequirements (i.e., underlying KG) and the input text, a RoBERTa based modelextracts contextual embeddings from the text and classiﬁes the input into one ofthe possible classes. Each class corresponds to one pipeline conﬁguration thatis held in the IE pipelines pool. v) Pipeline Runner : Given the input text,and the generated pipeline conﬁguration, the module executes the pipeline andproduce the ﬁnal KG triples.

In this section, we detail the empirical evaluation of the framework in comparisonto baselines on diﬀerent datasets and knowledge graphs. As such, we study thefollowing research question:

How does the dynamic selection of pipelines based onthe input text aﬀect the end-to-end information extraction task?

Knowledge Graphs

To study the eﬀectiveness of

Plumber in building dynamicKG information extraction pipelines, we use the following KGs during ourevaluation:

DBpedia [2] is containing information extracted automatically from Wikipediainfo boxes. DBpedia consists of approximately 11.5B triples [35].

Open Research Knowledge Graph [24] (ORKG) collects structured scholarly etter Call the Plumber 7 knowledge published in research articles, using crowd sourcing and automatedtechniques. In total, ORKG consists of approximately 984K triples.

Datasets

Throughout our evaluation, we employed a set of existing and newlycreated datasets for structured triple extraction and alignment to knowledgegraphs: the WebNLG [20] dataset for DBpedia, and COV-triples for ORKG.

WebNLG is the Web Natural Language Generation Challenge. The challengeintroduced the task of aligning unstructured text to DBpedia. In total, the datasetcontains 46K triples with 9K triples in the testing and 37K in the training set.

COV-triples is a handcrafted dataset that focuses on COVID-19 related schol-arly articles. The COV-triples dataset consists of 21 abstracts from peer-reviewedarticles and aligns the natural language text to the corresponding KG triplesinto the ORKG. Three Semantic Web researchers veriﬁed annotation quality,and triples approved by all three researchers are part of the dataset. The datasetcontains only 75 triples. Hence, we use the WebNLG dataset for training, and 75triples are used as a test set.

Components and Implementation

The

Plumber framework integrates33 components, the components span diﬀerent IE tasks from Triple Extrac-tion, Entity and Relation Linking, and Coreference Resolution. Most of thecomponents used are open-sourced and they have been evaluated and used bythe community in their respective publications.

Plumber ’s code and all relatedresources are publicly available online at https://git.io/JtT1s . Baselines

We include the following baselines:

T2KG [25] is an end-to-end static system aligns a given natural language textto DBpedia KG triples.

Frankenstein [39] dynamically composes Question Answering pipelines over theDBpedia KG. It employs logistic regression based classiﬁers for each componentfor predicting the accuracy and greedily composes a dynamic pipeline of thebest components per task. We adapted Frankenstein for the KG informationextraction over DBpedia.

The section summarizes a variety of experiments to compare the

Plumber framework against other baselines. Note, that evaluating the performance ofindividual components or their combination is out of this evaluation’s scope,since they were already used, benchmarked, and evaluated in the respectivepublications. We report values of the standard metrics Precision (P), Recall (R),and F1 score (F1). In all experiments, end-to-end components (e.g., T2KG) arenot part of

Plumber . Performance of Static Pipelines

In this experiment, we report results of thestatic pipelines, i.e., no dynamic selection of a pipeline based on the input text isconsidered. We ran all 264 pipelines and Table 2 (T2KG & Static noted rows)reports the performance of the best

Plumber pipeline against the baselines.

Plumber static pipeline for DBpedia comprises of NeuralCoref [8] for coreference

Jaradeh et al. resolution, OpenIE [1] for text triple extraction, TagMe [18] for EL, and Falcon [35]for RL tasks. Also, in case of Frankenstein, we choose its best performing staticpipeline. Results illustrated in the Table 2 conﬁrm that the static pipelinecomposed by the components integrated in

Plumber outperforms all baselineson DBpedia. We observe that the performance of pipeline approaches is betterthan an end-to-end monolithic information extraction approaches. Although the

Plumber pipeline outperforms the baselines, the overall performance is relativelylow. All our components have been trained on distinct corpora in their respectivepublications and our aim was to put them together to understand their collectivestrengths and weaknesses. Note, Frankenstein addresses the QA pipeline problemand not all components are comparable and can be applied in the context ofinformation extraction. Thus, we integrated NeuralCoref coreference resolutioncomponent and OpenIE triple extraction component used in

Plumber staticpipeline into Frankenstein for providing the same experimental settings.

Static Pipeline for Scholarly KG

In order to assess how

Plumber performson domain-speciﬁc use cases, we evaluate the static pipelines’ performance on ascholarly knowledge graph. We use the COV-triples dataset for ORKG. To thebest of our knowledge, no baseline exists on information extractions of researchcontribution descriptions over ORKG. Hence, we execute all static pipelines in

Plumber tailored to ORKG to select the best one as shown in Table 2 (COV-triples rows).

Plumber pipelines over ORKG extract statements determiningthe reproductive number estimates for the COVID-19 infectious disease fromscientiﬁc articles as shown below. @prefix orkg: <\ protect \vrule width0pt \ protect \href{http :// orkg.org/orkg/ resource /}{ http :// orkg.org/orkg/ resource /}>.@prefix orkgp: <\ protect \vrule width0pt \ protect \href{http :// orkg.org/orkg/ property /}{ http :// orkg.org/orkg/ property /}>.orkg: R48100 orkgp: P16022 "2.68" .

In this example, orkg:R48100 refers to the city of Wuhan in China in the ORKGand orkgp:P16022 is the property “has R0 estimate (average)”. The number“2.68” is the reproductive number estimate.

Comparison of the Classiﬁcation Approaches for Dynamic PipelineSelection

In this experiment, we study the eﬀect of the transformer-basedpipeline selection approach implemented in

Plumber against the pipeline se-lection approach of Frankenstein. For a comparable experimental setting, were-use Frankenstein’s classiﬁcation approach in

Plumber , keeping the underlyingcomponents precisely the same. We perform a 10-fold cross-validation for theclassiﬁcation performance of the employed approach. Table 1 indicates that the

Plumber pipeline selection signiﬁcantly outperforms baselines across the board.

Performance Comparison for KG Information Extraction Task

Ourthird experiment focuses on comparing the performance of

Plumber againstprevious baselines for an end-to-end information extraction task. The resultsin Table 2 illustrate that the dynamic pipelines built using

Plumber for KGinformation extraction outperform the best static pipelines of

Plumber as well etter Call the Plumber 9

Table 1.

Pipeline SelectionApproach Dataset KnowledgeGraph ClassiﬁcationP R F1

Frankenstein [39] WebNLG DBpedia 0.732 0.751 0.741COV-triples ORKG 0.832 0.858 0.845

Plumber

WebNLG DBpedia

COV-triples ORKG as the dynamically selected pipelines by Frankenstein (rows noted with dynamic).The end-to-end baselines, such as Kertkeidka-chorn and Ichise [25]. We alsoobserve that in cross-domain experiments for COV-triples datasets, dynamicallyselected pipelines perform better than the static pipeline. In the cross-domainexperiment, the static and dynamic

Plumber pipelines are relatively betterperforming than the other two KGs. Unlike components for DBpedia, componentsintegrated into

Plumber for ORKG are customized for KG triple extraction.We conclude that when components are integrated into a framework such as

Plumber aiming for the KG information extraction task, it is crucial to selectthe pipeline based on the input text dynamically. The superior performanceof

Plumber shows that the dynamic pipeline selection has a positive impactagnostic of the underlying KG and dataset. This also answers our overall researchquestion.

Plumber and baselines render relatively low performance on all the employeddatasets. Hence, in the ablation studies our aim is to provide a holistic picture ofunderlying errors, collective success, and failures of the integrated components.In the ﬁrst study, we calculate the proportion of errors in

Plumber . Themodular architecture of the proposed framework allows us to benchmark eachcomponent independently. We consider the erroneous cases of

Plumber on the

Table 2.

Overall performance comparison of static and dynamic pipelines for the KGinformation extraction task.

System Dataset KnowledgeGraph PerformanceP R F1

T2KG [25] WebNLG DBpedia 0.133 0.140 0.135Frankenstein (Static) [39] WebNLG DBpedia 0.177 0.189 0.181

Plumber (Static) WebNLG DBpedia 0.210 0.225 0.215COV-triples ORKG 0.403 0.423 0.413Frankenstein (Dynamic) [39] WebNLG DBpedia 0.199 0.208 0.203COV-triples ORKG 0.403 0.424 0.413

Plumber (Dynamic) WebNLG DBpedia

COV-triples ORKG test set of the WebNLG dataset. We calculate the performance (F1 score) ofthe

Plumber dynamic pipeline (cf. Table 2) at each step in the pipeline. Theresults show that the coreference resolution components caused 21.54% of theerrors, 33.71% are caused by text triple extractors, 18.17% by the entity linkingcomponents, and 26.58% are caused by the relation linking components.We conclude that the text triple extractor components contribute to thelargest chunk of the errors over DBpedia. One possible reason for their limitedperformance is that open-domain information extracting components were notinitially released for the KG information extraction task. Also, these componentsdo not incorporate any schema or prior knowledge to guide the extraction. Weobserve that the errors mainly occur when the sentence is complex (with morethan one entity and predicate), or relations are not explicitly mentioned in thesentence. We further analyze the text triple extractor errors. The error analysisat the level of the triple subject, predicate, and object showed that most errorsare in predicates (40.17%) followed by objects (35.98%) and subjects (23.85%).

Further Analysis

Aiming to understand why IE pipelines perform with lowaccuracy, we conduct a more in-depth analysis per IE task. In the ﬁrst anal-ysis, we evaluated each component independently on the WebNLG dataset.Researchers [12,40] proposed several criterion for micro-benchmarking tools/com-ponents for KG tasks (entity linking, relation linking, etc.) based on the linguisticfeatures of a sentence. We motivate our analysis based on the following:I)

Text Triple Extraction : We consider the number of words (wc) in theinput sentence (a sentence is termed by “simple” with average word lengthof 7.41 [39]. Sentences with higher number of words than seven are complexsentences). Furthermore, having a comma in a sentence (sub-clause) to separateclauses is another factor. Atomic sentences (e.g., ”cats have tails” ) are a type ofsentence that also aﬀects triples extractors’ behavior. Moreover, nominal relationas in ”Durin, son of Thorin” is another impacting factor on the performance.Uppercase and lowercase mentions of the words (i.e., correct capitalization ofthe ﬁrst character and not the entire word) in a sentence are standard errors forentity linking components. We consider this as a micro-benchmarking criteria.II)

Coreference Resolution : We focus on the length of the coreference chain(i.e., the number of aliases for a single mention). Additionally, the number ofclusters is another criterion in the analysis. A cluster refers to the groups ofmentions that require disambiguation (e.g., ”mother bought a new phone, sheis so happy about it” where the ﬁrst cluster is mother → she and the second is phone → it ). The presence of proper nouns in the sentence is studied as well asacronyms. Furthermore, the demonstrative nature of the sentence is also observedas a factor. Demonstrative sentences are the ones that contain demonstrativepronouns (this, that, etc.).III) Entity Linking : The number of entities in a sentence (e=1,2) is a crucialobservation for the entity linking task. Capitalization of the surface form isanother criterion for micro-benchmarking entity linking tools. An entity is termedas an explicit entity when the entity’s surface form in a sentence matches the KGlabel. An entity is implicit when there is a vocabulary mismatch. For example, in etter Call the Plumber 11 the sentence ”The wife of Obama is Michelle Obama.” , the surface form

Obama is expected to be linked to dbr:Barack Obama and considered as an implicitentity [40]. The last linguistic feature is the number of words (w) in an entitylabel (e.g.,

The Storm on the Sea of Galilee has seven words).IV)

Relation Linking : Similar to the entity linking criteria, we focus on thenumber of relations in a sentence (rel=1,2). The type of relation (i.e., explicit, orimplicit) is another parameter. Covered relation (sentences without a predicatesurface form) is also used as a feature for micro-benchmarking: ”Which companieshave launched a rocket from Cape Canaveral Air Force station?” where the dbo:manufacturing relation is not mentioned in the sentence. Covered relationshighly depend on common sense knowledge (i.e., reasoning) and the structure ofthe KG [40]. Lastly, the number of words (w < =N) in a predicate surface form isalso considered.Figure 3 illustrates micro-benchmarking of various Plumber components pertask. We observe that across IE tasks, the F1 score of the components variessigniﬁcantly based on the sentence’s linguistic features. In fact, there exist nosingle component which performs equally well on all the micro-benchmarkingcriteria. This observation further validates our hypothesis to design

Plumber for building dynamic information extraction pipelines based on the strengthsand weaknesses of the integrated components. We also note in Figure 3 that allthe CR components report limited performance for the demonstrative sentences( demonstratives ). When there is more than one coreference cluster in a sentence, allother CR components observe a discernible drop in F1 score. The NeuralCoref [8]component performs best for proper nouns , whereas PyCobalt [19] performs bestfor the acronyms feature (almost being tied by NeuralCoref). In the TE task,Graphene [7] shows the most stable performance across all categories. However,the performance of all components (except Dependency Parser) drops signiﬁcantlywhen the number of words in a sentence exceeds seven (wc > Even though the dynamic pipelines of

Plumber outperforms static pipelines, theoverall performance of

Plumber and baselines for the KG information extractiontask remains low. Our detailed and exhaustive ablation studies suggest thatwhen individual components are plugged together, their individual performanceis a major error source. However, this behavior is expected, considering earlierresearch works in other domains also observe a similar trend. As in 2015 Gerbilframework [41] and in 2018 Frankenstein [39]. Within two years, the community

Falcon

TextRazor

TagMe

EARL

DBpedia Spotlight

Spacy ANN e = , upp e r case e = , l o w e r case e = , ex p li c i t e = , i m p li c i t e = , w > = , upp e r case e = , l o w e r case e = , ex p li c i t e = , i m p li c i t e = , w > (a) F1 score heatmap of the EL task Ollie

OpenIE

ClausIE

MinIE

Graphene

ReVerb

POS Extractor

Dependency Extractor w c <= w c > ub - c l a u se a t o m i c se n e t e n ce no m i n a l r e l a t i on s upp e r case l o w e r case ac r on y m s (b) F1 score heatmap of the Text TE task Stanford Coref. Resolver

NeuralCoref

PyCobalt

HMTL c h a i n = h a i n > l u s t e r s = l u s t e r s > p r op e r noun s ac r on y m s d e m on s t r a t i ves (c) F1 score heatmap of the CR task Falcon RL

Rel Match

EARL RL

Spacy ANN RL r e l = , ex p li c i t r e l = , i m p li c i t r e l = , c o ve r e d r e l = , w > r e l = , ex p li c i t r e l = , i m p li c i t r e l = , c o ve r e d r e l = , w > (d) F1 score heatmap of the RL task Fig. 3.

Comparison of F1 scores per component for diﬀerent IE tasks based on thevarious linguistic features of an input sentence (number of entities, word count in asentence, implicit vs. explicit relation, etc.). Darker colors indicate a higher F1 score.etter Call the Plumber 13 has released several components dedicated to solving entity linking and relationlinking [35,15,30], which were two loopholes identiﬁed by [39] for the QA task.We observe that state of the art components for information extraction stillhave much potential to improve their performance (both in terms of runtimeand F1 score). It is essential to highlight that some of the issues observed inour ablation study are very basic and repeatedly pointed out by researchers inthe community. For instance, Derczynski et al. [12] in 2015, followed by Singhet al. [39] in 2018, showed that case sensitivity is a main challenge for EL tools.Our observation in Figure 3 again conﬁrms that case sensitivity of entity surfaceforms remains an open issue even for newly released components. In contrast,on speciﬁc datasets such as CoNLL-AIDA, several EL approaches reported F1scores higher than 0.90 [43], showing that EL tools are highly customized toparticular datasets. In a real-world scenario like ours, the underlying limitationsof approaches are uncovered.

In this paper, we presented the

Plumber approach and framework for informa-tion extraction.

Plumber eﬀectively selects the best possible pipeline for a giveninput sentence using the sentential contextual features and a state-of-the-arttransformer-based classiﬁcation model.

Plumber has a service-oriented architec-ture which is scalable, extensible, reusable, and agnostic of the underlying KG.The core idea of

Plumber is to combine the strengths of already existing disjointresearch for KG information extraction and build a foundation for a platformto promote reusability for the construction of large-scale and semantically struc-tured KGs. Our empirical results suggest that the performance of the individualcomponents directly impacts the end-to-end information extraction accuracy.This article does not focus on internal system architecture or employedalgorithms in a particular IE component to analyze the failures. The focus ofthe ablation studies is to holistically study the collective success and failurecases for the various tasks. Our studies provide the research community withinsightful results over two knowledge graphs, 33 components, 264 pipelines. Ourwork is a step in the larger research agenda of oﬀering the research community aneﬀective way for synergistically combining and orchestrating various focused IEapproaches balancing their strengths and weaknesses taking diﬀerent applicationdomains into account. We plan to extend our work in the following directions:i) extending

Plumber to other KGs such as UMLS [5] and Wikidata [42]. ii)addressing multilinguality with

Plumber , and iii) creating high performing RLcomponents.

References

1. Angeli, G., Johnson Premkumar, M.J., Manning, C.D.: Leveraging linguistic struc-ture for open domain information extraction. pp. 344–354. ACL (2015)4 Jaradeh et al.2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: Anucleus for a web of open data. In: The Semantic Web. pp. 722–735 (2007)3. Balog, K.: Entity linking. In: Entity-Oriented Search, pp. 147–188. Springer (2018)4. Bastos, A., Nadgeri, A., Singh, K., Mulang, I.O., Shekarpour, S., Hoﬀart, J.: Recon:Relation extraction using knowledge graph context in a graph neural network (2020)5. Bodenreider, O.: The uniﬁed medical language system (umls): integrating biomedicalterminology. Nucleic acids research , D267–D270 (2004)6. Both, A., Diefenbach, D., Singh, K., Shekarpour, S., Cherix, D., Lange, C.: Qanary- A methodology for vocabulary-driven open question answering systems. vol. 9678,pp. 625–641 (2016)7. Cetto, M., Niklaus, C., Freitas, A., Handschuh, S.: Graphene: Semantically-linkedpropositions in open information extraction. In: Proceedings of the 27th COLING.pp. 2300–2311 (2018)8. Clark, K., Manning, C.D.: Deep reinforcement learning for mention-ranking coref-erence models. In: Proceedings of the 2016 EMNLP. pp. 2256–2262 (2016)9. Cui, W., Liu, S., Wu, Z., Wei, H.: How hierarchical topics evolve in large textcorpora. IEEE TVCG (12), 2281–2290 (2014)10. Daiber, J., Jakob, M., Hokamp, C., Mendes, P.N.: Improving eﬃciency and accuracyin multilingual entity extraction. In: Proceedings of the 9th I-Semantics (2013)11. Delpeuch, A.: Opentapioca: Lightweight entity linking for wikidata (2019)12. Derczynski, L., Maynard, D., Rizzo, G., Van Erp, M., Gorrell, G., Troncy, R.,Petrak, J., Bontcheva, K.: Analysis of named entity recognition and linking fortweets. Information Processing & Management , 32–49 (2015)13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deepbidirectional transformers for language understanding. In: NAACL. pp. 4171–4186(2019)14. Dong, T., Wang, Z., Li, J., Bauckhage, C., Cremers, A.B.: Triple classiﬁcation usingregions and ﬁne-grained entity typing. In: Proceedings of the AAAI Conference onArtiﬁcial Intelligence. vol. 33, pp. 77–85 (2019)15. Dubey, M., Banerjee, D., Chaudhuri, D., Lehmann, J.: EARL: Joint entity andrelation linking for question answering over knowledge graphs. In: Lecture Notes inComputer Science, pp. 108–126. Springer International Publishing (2018)16. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open informationextraction. In: Proceedings of the 2011 EMNLP. pp. 1535–1545 (Jul 2011)17. Fensel, D., van Harmelen, F., Andersson, B., Brennan, P., Cunningham, H., DellaValle, E., Fischer, F., Huang, Z., Kiryakov, A., Lee, T.K., Witbrock, M., Zhong, N.:Towards larkc: A platform for web-scale reasoning. In: IEEE ICSC. pp. 524–529(2008)18. Ferragina, P., Scaiella, U.: TAGME: on-the-ﬂy annotation of short text fragments(by wikipedia entities). pp. 1625–1628 (2010)19. Freitas, A., Bermeitinger, B., Handschuh, S.: Lambda-3/pycobalt: Coreferenceresolution in python. https://github.com/Lambda-3/PyCobalt

20. Gardent, C., Shimorina, A., Narayan, S., Perez-Beltrachini, L.: Creating trainingcorpora for NLG micro-planners. pp. 179–188 (2017)21. Gashteovski, K., Gemulla, R., del Corro, L.: MinIE: Minimizing facts in openinformation extraction. In: Proceedings of the 2017 EMNLP. pp. 2630–2640 (2017)22. Hou, Y., Jochim, C., Gleize, M., Bonin, F., Ganguly, D.: Identiﬁcation of tasks,datasets, evaluation metrics, and numeric scores for scientiﬁc leaderboards con-struction. In: Proceedings of the 57th ACL. pp. 5203–5213 (2019)23. Ibrahim, Y., Riedewald, M., Weikum, G., Zeinalipour-Yazti, D.: Bridging quantitiesin tables and text. In: 2019 IEEE 35th ICDE. pp. 1010–1021 (2019)etter Call the Plumber 1524. Jaradeh, M.Y., Oelen, A., Farfar, K.E., Prinz, M., D’Souza, J., Kismih´ok, G., Stocker,M., Auer, S.: Open Research Knowledge Graph: Next Generation Infrastructurefor Semantic Scholarly Knowledge. Marina Del K-CAP (2019)25. Kertkeidkachorn, N., Ichise, R.: T2kg: An end-to-end system for creating knowledgegraph from unstructured text. In: AAAI Workshops. vol. WS-17 (2017)26. Kim, J.D., Unger, C., Ngomo, A.C.N., Freitas, A., Hahm, Y.g., Kim, J., Nam, S.,Choi, G.H., Kim, J.u., Usbeck, R., et al.: OKBQA Framework for collaboration ondeveloping natural language question answering systems (2017)27. Liang, S., Stockinger, K., de Farias, T.M., Anisimova, M., Gil, M.: Queryingknowledge graphs in natural language (2020)28. Liu, Y., Zhang, T., Liang, Z., Ji, H., McGuinness, D.: Seq2rdf: An end-to-endapplication for deriving triples from natural language text (2018)29. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Zettlemoyer, L., Stoyanov,V.: Roberta: A robustly optimized bert pretraining approach (2019)30. Mihindukulasooriya, N., Rossiello, G., Kapanipathi, P., Abdelaziz, I., Ravishankar,S., Yu, M., Gliozzo, A., Roukos, S., Gray, A.: Leveraging semantic parsing forrelation linking over knowledge bases. ISWC (to appear) (2020)31. Morbidoni, C., Polleres, A., Tummarello, G., Le-Phuoc, D.: Semantic web pipes(2007)32. Niklaus, C., Cetto, M., Freitas, A., Handschuh, S.: A survey on open informationextraction. In: Proceedings of the 27th COLING. pp. 3866–3878 (2018)33. Ponza, M., Del Corro, L., Weikum, G.: Facts that matter. In: Proceedings of the2018 EMNLP. pp. 1043–1048. ACL (2018)34. Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky,D., Manning, C.: A multi-pass sieve for coreference resolution. In: EMNLP (2010)35. Sakor, A., Onando Mulang’, I., Singh, K., Shekarpour, S., Esther Vidal, M.,Lehmann, J., Auer, S.: Old is gold: Linguistic driven approach for entity andrelation linking of short text. pp. 2336–2346. ACL (2019)36. Sakor, A., Singh, K., Patel, A., Vidal, M.E.: Falcon 2.0: An entity and relationlinking tool over wikidata. In: CIKM (2020)37. Sanh, V., Wolf, T., Ruder, S.: A hierarchical multi-task approach for learningembeddings from semantic tasks. Proceedings of the AAAI , 6949–6956 (2019)38. Singh, K., Mulang, I.O., Lytra, I., Jaradeh, M.Y., Sakor, A., Vidal, M., Lange, C.,Auer, S.: Capturing knowledge in semantically-typed relational patterns to enhancerelation linking. In: Proceedings of the Knowledge Capture Conference, K-CAP2017, Austin, TX, USA, December 4-6, 2017. pp. 31:1–31:8 (2017)39. Singh, K., Radhakrishna, A.S., Both, A., Shekarpour, S., Lytra, I., Usbeck, R.,Vyas, A., Khikmatullaev, A., Punjani, D., Lange, C., Vidal, M.E., Lehmann, J.,Auer, S.: Why reinvent the wheel: Let’s build question answering systems together.p. 1247–1256. WWW ’18 (2018)40. Singh, K., Saleem, M., Nadgeri, A., Conrads, F., Pan, J.Z., Ngomo, A.C.N.,Lehmann, J.: Qaldgen: Towards microbenchmarking of question answering systemsover knowledge graphs. In: ISWC. pp. 277–292 (2019)41. Usbeck, R., R¨oder, M., et al., N.N.: Gerbil: general entity annotator benchmarkingframework. In: Proceedings of the 24th WWW. pp. 1133–1143 (2015)42. Vrandeˇci´c, D., Kr¨otzsch, M.: Wikidata: A Free Collaborative Knowledgebase. Com-munications of the ACM57