Creating a Scholarly Knowledge Graph from Survey Article Tables
CCreating a Scholarly Knowledge Graph fromSurvey Article Tables
Allard Oelen , − − − , Markus Stocker − − − , andS¨oren Auer , − − − L3S Research Center, Leibniz University of Hannover, Germany [email protected] TIB Leibniz Information Centre for Science and Technology, Germany { markus.stocker,auer } @tib.eu Abstract.
Due to the lack of structure, scholarly knowledge remainshardly accessible for machines. Scholarly knowledge graphs have beenproposed as a solution. Creating such a knowledge graph requires manualeffort and domain experts, and is therefore time-consuming and cumber-some. In this work, we present a human-in-the-loop methodology used tobuild a scholarly knowledge graph leveraging literature survey articles.Survey articles often contain manually curated and high-quality tabularinformation that summarizes findings published in the scientific litera-ture. Consequently, survey articles are an excellent resource for generat-ing a scholarly knowledge graph. The presented methodology consists offive steps, in which tables and references are extracted from PDF arti-cles, tables are formatted and finally ingested into the knowledge graph.To evaluate the methodology, 92 survey articles, containing 160 surveytables, have been imported in the graph. In total, 2 626 papers havebeen added to the knowledge graph using the presented methodology.The results demonstrate the feasibility of our approach, but also indi-cate that manual effort is required and thus underscore the importantrole of human experts.
Keywords:
Scholarly Communication · Scholarly Knowledge Graphs · Tabular Data Extraction
Scholarly communication is mainly document-based and the communicated schol-arly knowledge therefore hardly machine-actionable [21]. Scholarly knowledgegraphs have the potential to solve these issues by making knowledge structuredand thus more machine processable. Existing initiatives for scholarly informa-tion systems, e.g., the Microsoft Academic Graph [8] or Crossref [15] mainlyfocus on bibliographic metadata and not on the actual research contributions.The Open Research Knowledge Graph (ORKG) [10] aims to build a knowledgegraph infrastructure that publishes the research contributions of scholarly pub-lications rather than only the metadata. The approach is to crowdsource struc-tured paper descriptions by including paper authors and domain experts. ORKG a r X i v : . [ c s . D L ] D ec Oelen et al.
Survey article tables Scholarly knowledge graphPresented methodology Input Output
1. Select papers 2. Table extraction 3. Table formatting 4. Reference extraction 5. Build graph
Fig. 1: Systematic workflow in which survey articles are used to build a scholarlyknowledge graph. The input of our methodology is survey articles in PDF formatand the output is a scholarly knowledge graph.primarily relies on synergistically combining crowdsourcing and automated ex-traction rather than, as other systems such as Semantic Scholar , exclusively onautomated techniques to extract knowledge from scholarly articles. Mainly be-cause automated extraction methods, for example Natural Language Processing(NLP), do not have sufficient accuracy to generate the high-quality knowledgegraph needed to obtain suitable state-of-the-art overviews for researchers.In this work, we present a human-in-the-loop methodology to create a schol-arly knowledge graph by extracting knowledge from survey tables. We leveragesurvey tables from literature review papers, specifically. Tables in survey papersgenerally consist of high-quality research data that has been manually curatedby domain experts. Conducting a literature review is a labour-intensive task andwriting a review article is often more time-consuming than writing a researcharticle [32]. Compared to natural text, tables present information in a semi-structured manner, making the creation of a structured graph from such dataless complicated. Additionally, survey tables present relevant information whichis why the survey was conducted and published in the first place. We present asupervised approach to firstly extract data from survey articles and afterwardsbuild a knowledge graph from this data. Compared to sole crowdsourcing, theapproach of extracting knowledge is more efficient because the review has al-ready been conducted by the authors of the survey paper. Taking into accountthe previously mentioned considerations, our work addresses the following re-search question: How to efficiently populate a scholarly knowledge graph withhigh-quality knowledge?
We propose a methodology for extracting tabular sur-vey data. This methodology is used to create a scholarly knowledge graph fromsurvey articles. An overview of the systematic workflow is depicted in Figure 1.The rest of this paper is structured as follows. Section 2 discusses the relatedwork. Section 3 introduces the proposed five-step methodology for building theknowledge graph from survey articles. Section 4 presents the results. Section 5discusses the present and future work. Finally, Section 6 concludes the presentedwork. reating a Scholarly Knowledge Graph from Survey Article Tables 3 Survey articles provide well-structured overviews of the literature [33]. The terms“literature review” and “literature survey” are sometimes used interchangeablyin the literature, but we make the following distinction. We refer to the tableswithin review articles as literature surveys . Together with a (textual) analysisand explanation, they form the literature review . Among other things, literaturereviews are helpful in delimiting the research problem, avoiding fruitless ap-proaches [5] and to discover new research directions [6]. Conducting a literaturereview is a complicated and time-consuming activity [33]. When literature re-views are not available for certain fields, its development could be weakened [32].Because of the importance of literature surveys to scientific research, leveragingsurveys to build a graph results in a high-quality and relevant scholarly knowl-edge graph. Some existing work with respect to semantifying literature surveysexists [4,29,23]. However, those approaches are not (semi-)automated and aretherefore not scaling well to larger amounts of survey articles.One aspect of the proposed methodology is table extraction from survey arti-cles. Portable Document Format (PDF) is the most common format for scientificarticles [12]. Extracting tables from PDF documents is a cumbersome processsince the tabular structure is not stored within the file itself [11]. This means thatregular PDF extraction tools are only able to extract the text within a table, butloosing the tabular structure. Tools that specifically focus on table extractionfrom PDF files use segmentation techniques to estimate the position of rows andcolumns [7]. Corrˆea et al. did a literature survey on table extraction tools [3].They concluded that Tabula is the most suitable open-source tool. Based ofthese findings, we decided to use Tabula. Tabula is criticized because of the lackof documentation [26], but for our use case this is not considered problematic.Another aspect of the proposed methodology is reference extraction from PDFs.Since every individual article referenced within a survey table is imported, meta-data from this article should be collected. This is done by parsing the referencesthat are used within a table. For this, we use the state-of-the-art PDF extrac-tion tool GROBID [13]. GROBID focuses specifically on extracting bibliographicdata from scholarly articles [19]. Lipinski et al. [17] compared GROBID to otherPDF metadata extraction tools, and found out that GROBID performed best.Publishing data as structured or semantic data is a well researched topicamong various domains. For example, challenges related to publishing semanticopen government data are similar to the challenges in our research. This includesextracting data from legacy documents, often in PDF format [2,3]. Furthermore,in the literature use cases are described on publishing unstructured data as se-mantic data (e.g., [9,20,27]). These existing approaches differ from our approachsince they generally aim to semantify a homogeneous set of documents. Thisenables them to create data specific ontologies. In our case, this is not feasiblesince we work with a highly heterogeneous set of survey tables coming fromdifferent domains and comparing different aspects of papers. Table 1 provides https://tabula.technology Oelen et al.
Table 1: Related work compared to the method presented in our study. The fullcomparison is available via the ORKG. S t ud y N a m e M e t h o d a u t o m a t i o n S c o p e I npu t f o r m a t O u t pu t f o r m a t K G a c r e a t i o n R e f e r e n c ee x t r a c t i o n U s e r i n t e r f a c e Thisstudy ORKG Semi-Automatic Survey tables PDF JSON, RDF b (cid:51) (cid:51) (cid:51) [28] SemAnn Semi-Automatic Scholarly articles PDF RDF b (cid:55) (cid:55) (cid:51) [16] Web Tables Automatic Web tables HTML JSON (cid:55) (cid:55) (cid:55) [18] TableSeer Automatic Scholarly articles PDF Relationaldatabase (cid:55) (cid:55) (cid:51) [25] TEXUS Automatic Documents(applicationagnostic) PDF Abstracttable repre-sentation (cid:55) (cid:55) (cid:55) [1] None
Automatic Web tables HTML,Spread-sheet Relationalschema (cid:55) (cid:55) (cid:55) a Knowledge Graph; b Resource Description Framework a related work overview. In this overview, our proposed method is compared toother related approaches. To the best of our knowledge, this work is the first tobuild a knowledge graph at scale from survey tables.
Use Case: Open Research Knowledge Graph.
Extracted survey data canbe imported in a variety of different (scholarly) knowledge graphs, such as theMicrosoft Academic Graph, Wikidata [14] or ORKG. We chose ORKG as ouruse case for the following reasons. The ORKG provides tools that specificallyfocus on building paper comparisons (i.e., literature surveys), making it the mostsuitable infrastructure for this study. By using the extracted survey data, theORKG automatically generates a similar tabular survey view as was originallypresented in the review paper [22]. Additionally, the literature surveys withinORKG are compliant [23] with the FAIR data principles [34] thus making themFindable, Accessible, Interoperable and Reusable. The imported survey tablesare FAIR in contrast to the originally presented ones in the non-FAIR PDFarticle. This has several benefits, among others: – Comparisons can evolve over time, are not static and do not become staleafter publication. – Comparisons do represent a broader community consensus, since many re-searchers and curators can revise, discuss and annotate. reating a Scholarly Knowledge Graph from Survey Article Tables 5
1. Paper selection 2. Table extraction 4. Reference extractionManually search for review papers that contain survey tables. Semi-automatically extract survey tables from PDF papers. Extract the bibliographic metadata for each paper in the survey table. [57]: Authors: John Doe et al. Title: My research paper [43]: …
5. Build graphCombine the data from the previous steps and build the graph.
2. Table extraction 3. Table formatting4. Reference extractionCombine data from:Build the knowledge graph = Human-in-the-loop 3. Table formatting Clean the extracted tables and prepare them for import.
Reference Ex1 Ex2 [1] 1.221 76[2] 2.534 65[3] 0.123 79
Fig. 2: Methodology for importing survey tables into the scholarly knowledgegraph. – Via the ORKG search interface it is possible to search for specific compar-isons and to create dynamic custom comparison views. – Survey data can be reused by other researchers more easily because of itsmachine readable export formats (e.g., export as CSV or RDF).
We now present a five-step methodology for the creation of a scholarly knowledgegraph from survey tables. In order to reach sufficient quality, the methodologytakes a human-in-the-loop approach in which multiple steps require human inter-action. Data quality improves with human evaluation and, if needed, correctionof the extracted data. The methodology is displayed in Figure 2. The scriptsrequired to perform the steps are available online. In the first step, suitable survey papers are selected based on multiple criteria.The purpose is to find survey papers from a diverse range of domains. Therefore,a protocol has been designed to determine which papers are suitable for dataextraction. The structured nature of the selection process is needed to be ableto make conclusions about the percentage of survey papers that present theinformation in such a way that extracting data is relatively straightforward.
Search Strategy.
Table 2 lists the search engines used to find survey arti-cles. Google Scholar is chosen to ensure that survey papers from various fieldsare searched. Additionally, ACM Digital Library has been selected because theORKG currently focuses mainly on the Computer Science domain. The search is https://doi.org/10.5281/zenodo.3739427 Oelen et al.
Table 2: Search engines used to find survey articles.
Search engine Field Evaluated papers
Google Scholar All 335ACM Digital Library Computer Science 80 limited to 100 papers that are suitable for import. The following search criteriaare used: – Google Scholar: the article title contains the term “literature survey”. – ACM Digital Library: queries “literature review” and “literature survey”. – The survey article has been published after 2002. – The results are sorted by relevance.The rationale for selecting papers published after the year 2002 is because ingeneral more recent papers are more interesting for research and should there-fore have more priority in the scholarly knowledge graph. In the end, articlespublished before 2002 can still be part of the graph, since this criterion onlyapplies to the survey articles themselves, and not to the papers being reviewedin those articles.
Selection Criteria.
Papers that satisfy the inclusion criteria are selected forthe import process. The inclusion criteria are defined as follows:1. The article contains at least one table that lists scientific literature (i.e., theliterature is presented in a semi-structured manner).2. The article compares literature based on published results and does not solelytextually summarize the content of original papers.3. The survey table should be in markup format and not included as rasterimage.4. The table structure should be suitable for import (e.g., one table row shouldprovide information about one publication).5. The article is written in English.Inclusion criterion 1 ensures that a survey article does not only textually sum-marize the literature, but does also provide a semi-structured comparison (intabular form). Although papers that are textually reviewing scientific literatureare interesting for importing as well, it is out of scope for this work. Criterion2 ensures only surveys that compare actual paper results are included. Thisexcludes surveys researching, for instance, the growth of a field. Criterion 3 ex-cludes tables in image format. This is because of the tabular extraction methodwe use, which is based on character extraction and does not use Optical Charac-ter Recognition (OCR) needed to support image extraction [30]. Criterion 4 onlyselects tables that are suitable for import. Our methodology does only supportpaper import when one row in a table represents one paper. Although minor reating a Scholarly Knowledge Graph from Survey Article Tables 7 changes can be made manually (e.g. merging multiple tables), in case the struc-ture of the table deviates significantly from the required format, the table isexcluded. Finally, criterion 5 ensures a homogeneous semantic integration intothe currently English monolingual knowledge graph. The result of this step is aset of the selected papers in PDF format.
This step focuses on extracting the tables from the PDF files collected in theprevious step. Not only the text within the table should be extracted, but thetabular structure should be preserved as well. As explained in the related worksection, we use Tabula to perform the table extraction. Each PDF article isuploaded via the Tabula user interface. Afterwards, the regions of the tables aremanually selected within the interface. Although Tabula provides a functionalityto automatically detect tables, the accuracy is not sufficient for our use case. Theperformance is especially low for articles with a two-column layout. Additionally,not all tables within an article have to be extracted since not all of them arelisting and comparing literature. Arguably, the manual selection method is mostuseful in this methodology since human judgment is needed in the selectionprocess. Part of the extraction step is quality assurance after the extraction.When needed, extraction errors are manually fixed. Tabula supports two typesof extraction, namely “Stream” and “Lattice”. The Stream extraction method isbased on white space between columns while Lattice is based on boundary linesbetween columns. During the extraction it is possible to switch between thedifferent methods, which allows for selecting the best method for a particulartable. The result of this step is a set of CSV files, in which each file representsone survey table from a review article.
The CSV files containing the extracted tables from the review articles shouldbe formatted in a structure that is suitable for building a graph. Since the datafrom the CSV file is extracted automatically, all tables should have the sameformat. In this step, the formatting of the tables is changed when necessary. Forsome tables, a considerable amount of changes is required while for other tablesonly minor changes are needed. Changes could include merging, splitting, addingand removing both columns and rows. We use OpenRefine [31] to perform bulkoperations on tables. A table is formatted in such a way that it adheres to thefollowing rules:1. The first row of the table is the header.2. Each row represents one reviewed paper.3. Each row has a column called: “Reference”.4. The reference cell should contain the citation key for a paper .5. Non-literal values are prefixed with “[R]” in the column header.6. When needed, abbreviations are replaced by the full value from the legend.
Oelen et al.
For rule 2, in some cases a multidimensional table has to be flattened. This canoften be accomplished by adding additional columns to the table. Also, in somecases a table has to be transposed to ensure that each row contains one paper.Rules 3 and 4 ensure that bibliographic metadata can be fetched for each paperin the next step. Rule 5 makes a distinction between literal values and resources.The default cell type is a literal, and when [R] is prefixed to a header label, thecells are considered as resources. Finally, rule 6 makes the content of the tablereadable without requiring the original text from the legend. Often table legendsare used to condense information to improve user readability.
As mentioned earlier, each table row represents one paper. For each row, there isa value that contains the reference key from the original paper. The reference keyis often a numerical reference, in the form of [ n ], where n represents the referencenumber. In another frequently used citation style, the author names combinedwith their publication year is used as a reference key. The citation key is usedto automatically capture the bibliographic metadata for an article. In orderto extract references from article, we use the PDF extraction tool GROBID.GROBID processes the full PDF article. In the first place to extract all citationsfrom the paper’s reference list and then to connect the citation keys used in thetext to their respective citation string. In case a reference key cannot be extractedfrom the paper’s text, a reference key is generated automatically based on theauthor’s name and publication year.When the citation is extracted and parsed, five additional columns are ap-pended to the table: paper title, authors, publication month, publication yearand the DOI . In case a citation key could not be automatically mapped to anactual citation, a citation can be provided manually. The full citation text canbe copied directly from the paper (including paper title, authors etc.) and isthen parsed by GROBID to get structured bibliographic metadata. To performthe process of adding references, we created a Python script. This script firsttries to automatically fetch the metadata. In case the reference is not found, acommand line input field is displayed to enter the citation manually.
The final step is to build a knowledge graph from the previously created CSVfiles. An example of the resulting graph for a single paper is depicted in Figure 3.Firstly, a settings file is created which lists the table numbers, a suitable title forthe table and a reference to the original survey article. The reference is requiredto attribute the work done by the authors of the survey article. The table title ismanually created based on the original table caption. In case no suitable captionis available, a more suitable title is written. Digital Object Identifier File from https://doi.org/10.5281/zenodo.3739427 reating a Scholarly Knowledge Graph from Survey Article Tables 9
AuthorPublication monthPublication year ContributionDOIAuthorTime-varyingtransmission dynamics ofNovel CoronavirusPneumonia in China Tao Liu COVID-19 reproductivenumberResearch problem R0 estimatesStudy date95% CIMethods LocationContribution 110.1101/2020.01.25.91978712020
Study Location Study date Methods R estimates 95% CI Joseph et al. Wuhan 31 Dec '19 - 28 Jan '20 Stochastic Markov Chain... 2.68 2.47-2.86Shen et al. Hubei province 12-22 Jan. '20 Mathematical model, dynamic... 6.49 6.31-6.66Liu et al. China and overseas 23. Jan '20 Statistical exponential Growth... 2.90 2.32-3.63 ... 2.9023. Jan '20China2.32-3.63Statistical exponential Growth,using SARS generation...
Fig. 3: Example of the resulting subgraph for importing a single paper from asurvey table. Metadata captured by reference extraction is displayed in blue.Data coming from the survey table is displayed in orange and ORKG specificdata is displayed in white.Next, a Python script is used to select all rows from the tables. For each row,a paper is added to the graph via the ORKG API. For each table, a comparison iscreated in ORKG. The title and reference from the previously generated settingsfile are attached to this comparison. The comparison can be used later in ORKGto generate the same tabular literature overview as originally presented in thesurvey paper. Based on the steps from our methodology, a web User Interace (UI) is createdthat integrates all steps into a single interface. The interface provides a stream-lined process for importing survey tables as depicted in Figure 4. The UI isspecifically designed to make importing a table an effortless task without theneed of downloading any tools or the need to be able to operate these tools. Inthe background, the same tools from the methodology are used to extract tables(Tabula) and extract references (GROBID). The first step is to upload a PDFfile and select the survey table within this file. Afterwards, the table is extractedand the formatting can be fixed with an integrated spreadsheet editor. Then, foreach row the respective paper reference is extracted. Finally, the data is ingestedin the knowledge graph.The UI is not used to import the surveys tables presented in Section 4. Theinterface is designed to import individual survey tables rather than importinglarge amounts of tables at once. In the UI, all steps required to import a singletable should be performed consecutively. To increase efficiency when importing File from https://doi.org/10.5281/zenodo.3739427
Fig. 4: Survey table import User Interface integrating all steps from the method-ology.large amounts of tables, it helps to first finish a step for all papers before movingto the next step. The UI provides a method to extend the graph beyond theextracted surveys from this work. In the future, this interface will therefore beintegrated in the ORKG.
In this section, we report the results of the import process for each step of themethodology. Table 3 summarizes the results for all steps.
The dataset of the results are published online [24]. This set contains the selectedpapers, the ORKG comparisons and the ingested papers. The selected papers filelists IDs, paper titles, table references, sources and references. The IDs are usedto record any additional information about the import process for this specificpaper. IDs are missing for papers that were selected in the first place, but wereexcluded after revising the inclusion criteria. Additionally, table references referto the original table references used in the survey article.In total, 335 papers from Google Scholar were evaluated against the selectioncriteria described in Section 3.1. Out of these papers, 78 met the criteria and havetherefore been selected for importing. From the ACM Digital Library 80 paperswere evaluated and 14 papers have been selected. In total 22% of the evaluatedreview papers are suitable to be imported with the presented approach. reating a Scholarly Knowledge Graph from Survey Article Tables 11
Table 3: Summary of the results of all steps.
Description Amount
Paper selection
Amount of evaluated papers 415Amount of selected papers 92
Table extraction
Total amount of extractions (partial tables) 265Amount of extracted complete tables 160
Reference extraction
Found references 2 069Not found references 1 137
Build graph
Individual amount of imported papers 2 626Imported data cells (with metadata) 40 584Imported data cells (without metadata) 21 240
We extracted 160 tables from the 92 survey articles. In 22 cases, tables stretchedacross multiple pages, which results in a total of 265 extractions performed withTabula. Table 4 lists the most frequently occurred issues with the extraction.Issue 1 and 2 occur mostly when no boundary lines are present between tablecolumns. In this case, the Stream extraction method has to be used, which oftenresults in rows that are not correctly merged (e.g., multi-line sentences are putin separate rows while in the original table they are in the same row). Also, issue3 is mostly present when using the Stream method. When the Lattice methodcan be used for the extraction, the result is generally of higher quality. When notable borders (or boundary lines) are present, this method does not work andthe Stream method has to be used. Issue 4 is caused by general extraction errors,Table 4: Issues that occurred during the extraction of tables from the surveyarticles. Issues are counted per article. which can result in tables with wrongly extracted text. Additionally, formulasand other text styling are not supported, which compounds this issue. Issues 7and 8 result in tables that are not, or only partially, imported. The other issuesare self-explanatory.
In total, we extracted unique 2 626 papers from 3 206 rows. For each paper, therespective citation was retrieved. In 2 069 cases the citation could be extractedautomatically from the row (65% of the cases). In 1 137 cases it was not possibleto automatically extract the reference (35% of the cases). For those cases, thecitation is manually copied from the paper. There were multiple reasons whyautomatic reference extraction was not successful. Most issues occurred for ref-erences that used a numeric citation key. GROBID’s performance for extractingnumeric references from tables was low, oftentimes numeric table references werenot recognized. The amount of rows is higher than the amount of extracted pa-pers because multiple rows could refer to the same paper. Each paper only hasone graph entry and any additional data is added to the existing paper.In case a reference is only used in a table and not somewhere else in thearticle, automatic reference extraction was oftentimes not possible. When anauthor name was used as citation key, problems occurred mostly because of thedifferent citation styles. While some citation formats only use the last name of thefirst author, suffixed by et al. , other formats could list all author names. Whena format was used that deviates from the standard implementation, automaticextraction was not possible.
In total, we added 2 626 papers to the knowledge graph. These papers are used in160 different comparisons. A complete list of the generated ORKG comparisonsand a list of all ingested papers is available via [24]. In total, 21 240 table cellshave been imported, excluding the bibliographic metadata. Including metadata,the total is 40 584 data cells.
The presented methodology takes a human-in-the-loop approach as opposed to afully automated approach. Compared to a fully manual approach, the proposedapproach saves considerable time. In previous work [23], we manually importedonly four survey articles. On average, this process took 4 hours per article. Foreach of the papers, a Python script was created specifically to import the surveytable with its references and data. An example of such a script for one paper can reating a Scholarly Knowledge Graph from Survey Article Tables 13 be found online. For the methodology used in this paper, the time to importone survey article was on average 15 minutes. Compared to the 4 hours of themanual approach, this is considerably faster (i.e., 16 fold increase in speed). Theminimum amount of time needed to import a relatively small table was 2 min-utes. The table could be extracted without any issues. The maximum amountof required time was approximately 60 minutes. This was for a table with acomplex layout, stretched across multiple pages. Also, this table did not haveboundary lines. Most time was spend on fixing extraction issues. To further im-prove time performance, we identified two tasks that are time-consuming andcan potentially be improved. The first task relates to fixing errors occurred dur-ing the table extraction by Tabula. Most errors occurred when tables did nothave boundary lines between columns and rows. A potential solution, and pos-sible future research direction, is to create an interface that supports manuallydrawing boundary lines between rows and columns. The second task is relatedto adding missing references, which have to be manually copied from the PDFarticle. In total, 65% of the references were extracted automatically. By applyingmore advanced heuristics to match reference keys with their respective reference,this percentage can be improved.
The impact of the methodology relates to the amount of survey papers thatare suitable for our approach (i.e., surveys representing information in tabularformat). To order to provide insights on the impact, a structured search protocolhas been employed in the paper selection step. As the results show, out of the415 evaluated papers, 92 of them are suitable to be imported. This indicatesthat since 2002, 22% of the published survey papers contain comparison tables.Therefore, arguably our methodology can have considerable impact when appliedmore broadly. In the paper selection, non-survey papers were excluded. However,it is not uncommon for research articles to also contain tables with related work(e.g., Table 1 in this article). Thus the paper selection step could be extendedto also include other articles to have a broader impact.
The extracted knowledge graph consists of structured scholarly data. The qualityof the knowledge graph could be further improved by providing more semanticsto the data. Currently, a primitive method is used to map existing propertiesand resources. This is based on a lookup by resource label, in case a resultis found, the resource is mapped. If not, a new resource is created. A moreadvanced mapping of resources and properties to existing ontologies improvesthe machine readability of the data. Tables containing large amounts of naturaltext (e.g., textually describing a methodology) could be further processed using https://gitlab.com/TIBHannover/orkg/orkg-papers/-/blob/master/question-answering-import.py named entity recognition and linking. This results in more structured data andtherefore a higher quality knowledge graph. Approaches to improve the overallquality of the graph are part of future work. In total, we extracted 92 survey articles from a variety of domains. In the future,more survey articles will be ingested in ORKG. This will be done for multipledomains. The User Interface (UI) presented in Section 3.6 can be used to supportusers to import survey tables. The UI will be further improved to make to processmore efficient. Due to the dynamic nature of the interface (especially comparedto a regular spreadsheet editor), mapping properties and resources to existingconcepts is better supported. In the end, we aim to import as many surveys froma specific domain as possible. There are several reasons why such an approachis useful. In the first place, ORKG can serve as a digital library for literaturesurveys. As discussed in the related work, the platform provides tools to betterfind and organize surveys. Additionally, when all existing reviews for a domainare imported, the ORKG can be used as a source to find literature surveys. Incase a survey is not present in the ORKG, it means that is does not (yet) exist.This can be used as a basis to start working on new literature surveys.
Knowledge graphs are useful to make scholarly knowledge more machine action-able. Manually building such a knowledge graph is time-consuming and requiresthe expertise of paper authors and domain experts. In order to efficiently builda high-quality scholarly knowledge graph, we leverage survey tables from reviewarticles. Generally, survey tables contain high-quality, relevant, semi-structuredand manually curated data, and are therefore an excellent source for building ascholarly knowledge graph. We presented a methodology used to extract 2 626papers from 92 survey articles. The methodology adopts a human-in-the-loopapproach to ensure the quality and usefulness of the extracted data. Comparedto manually reviewing and entering research data, or to manually importingliterature surveys, the methodology is considerably more efficient. In conclu-sion, the presented methodology provides a full pipeline that can be used toextract knowledge from PDF documents and represent the extracted knowledgein a knowledge graph. The corresponding evaluation with survey articles demon-strates the effectiveness and efficiency of the proposed methodology.
Acknowledgements.
This work was co-funded by the European Research Coun-cil for the project ScienceGRAPH (Grant agreement ID: 819536) and the TIBLeibniz Information Centre for Science and Technology. We want to thank ourcolleagues Mohamad Yaser Jaradeh and Kheir Eddine Farfar for their contribu-tions to this work. reating a Scholarly Knowledge Graph from Survey Article Tables 15
References
1. Adelfio, M.D., Samet, H.: Schema extraction for tabular data onthe web. Proceedings of the VLDB Endowment , 421–432 (2013).https://doi.org/10.14778/2536336.25363432. Corrˆea, A.S., Corrˆea, P.L.P., Da Silva, F.S.C.: Transparency portals versusopen government data. An assessment of openness in Brazilian municipal-ities. ACM International Conference Proceeding Series pp. 178–185 (2014).https://doi.org/10.1145/2612733.26127603. Corrˆea, A.S., Zander, P.O.: Unleashing tabular content to open data: A survey onPDF table extraction methods and tools. ACM International Conference Proceed-ing Series pp. 54–63 (2017). https://doi.org/10.1145/3085228.30852784. Fathalla, S., Vahdati, S., Auer, S., Lange, C.: Towards a Knowledge Graph Rep-resenting Research Findings by Semantifying Survey Articles. In: Research andAdvanced Technology for Digital Libraries. pp. 315–327 (2017)5. Gall, M.D., Borg, W.R.: Educational Research: An introduction (sixth edition).White Plains, NY: Longman Publishers USA (1996)6. Hart, C.: Doing a Literature Review: Releasing the Social Science Research Imag-ination. Sage (1998)7. Hassan, T., Baumgartner, R.: Table recognition and understanding from PDF files.Proceedings of the International Conference on Document Analysis and Recogni-tion, ICDAR pp. 1143–1147 (2007). https://doi.org/10.1109/ICDAR.2007.43770948. Herrmannova, D., Knoth, P.: An analysis of the microsoft academic graph. D-libMagazine (9/10) (2016). https://doi.org/10.1045/september2016-herrmannova9. Hyv¨onen, E.: Publishing and using cultural heritage linked data on the semanticweb. Synthesis Lectures on the Semantic Web: Theory and Technology (1), 1–159(2012). https://doi.org/10.2200/S00452ED1V01Y201210WBE00310. Jaradeh, M.Y., Oelen, A., Farfar, K.E., Prinz, M., D’Souza, J., Kismih´ok, G.,Stocker, M., Auer, S.: Open research knowledge graph: Next generation in-frastructure for semantic scholarly knowledge. K-CAP 2019 - Proceedings ofthe 10th International Conference on Knowledge Capture pp. 243–246 (2019).https://doi.org/10.1145/3360901.336443511. Jiang, D., Yang, X.: Converting PDF to HTML approach based on textdetection. Proceedings of the 2nd international conference on interactionsciences: Information technology, culture and human , 982–985 (2009).https://doi.org/10.1145/1655925.165610312. Klampfl, S., Granitzer, M., Jack, K., Kern, R.: Unsupervised document structureanalysis of digital scientific articles. International Journal on Digital Libraries (3-4), 83–99 (2014). https://doi.org/10.1007/s00799-014-0115-113. K¨orner, M., Ghavimi, B., Mayr, P., Hartmann, H., Staab, S.: Evaluating ReferenceString Extraction Using Line-Based Conditional Random Fields: A Case Studywith German Language Publications. In: New Trends in Databases and InformationSystems. pp. 137–145. Cham (2017)14. Krotzsch, M., Vrandecic, D.: Wikidata : A Free Collaborative Knowl-edge Base. Communications of the ACM (10), 78–85 (2014).https://doi.org/10.1145/262948915. Lammey, R.: CrossRef text and data mining services. Insights: the UKSG Journal (2), 62–68 (2015). https://doi.org/10.1629/uksg.2336 Oelen et al.16. Lehmberg, O., Ritze, D., Meusel, R., Bizer, C.: A large public corpus of webtables containing time and context metadata. Proceedings of the 25th Interna-tional Conference Companion on World Wide Web - WWW ’16 Companion (2016).https://doi.org/10.1145/2872518.288938617. Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of headermetadata extraction approaches and tools for scientific PDF documents. Proceed-ings of the ACM/IEEE Joint Conference on Digital Libraries pp. 385–386 (2013).https://doi.org/10.1145/2467696.246775318. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Tableseer: automatic table metadata ex-traction and searching in digital libraries. Proceedings of the 2007 conference onDigital libraries - JCDL ’07 (2007). https://doi.org/10.1145/1255175.125519319. Lopez, P.: GROBID: Combining automatic bibliographic data recognition and termextraction for scholarship publications. International conference on theory andpractice of digital libraries , 473–474 (2009). https://doi.org/10.1007/978-3-642-04346-8 6220. M¨akel¨a, E., Hyv¨onen, E., Ruotsalo, T.: How to deal with massively heterogeneouscultural heritage data - Lessons learned in CultureSampo. Semantic Web (1),85–109 (2012). https://doi.org/10.3233/sw-2012-004921. Mons, B., Velterop, J.: Nano-publication in the e-science era. In: Workshop on Se-mantic Web Applications in Scientific Discourse (SWASD 2009). pp. 14–15 (2009)22. Oelen, A., Jaradeh, M.Y., Farfar, K.E., Stocker, M., Auer, S.: Comparing ResearchContributions in a Scholarly Knowledge Graph. In: Proceedings of the Third In-ternational Workshop on Capturing Scientific Knowledge (SciKnow19). pp. 21–26(2019)23. Oelen, A., Jaradeh, M.Y., Stocker, M., Auer, S.: Generate FAIR Liter-ature Surveys with Scholarly Knowledge Graphs. JCDL ’20: The 20thACM/IEEE Joint Conference on Digital Libraries (In Press) (2020).https://doi.org/10.1145/3383583.339852024. Oelen, A., Stocker, M., Auer, S.: Dataset for Creating a Scholarly Knowledge Graphfrom Survey Article Tables (2020). https://doi.org/10.5281/ZENODO.373515225. Rastan, R., Paik, H.Y., Shepherd, J.: Texus. Proceedings of the 2015ACM Symposium on Document Engineering - DocEng ’15 (2015).https://doi.org/10.1145/2682571.279706926. Ros, G.: Analysis of Tabula : A PDF-Table extraction tool (2019)27. Skjæveland, M.G., Lian, E.H., Horrocks, I.: Publishing the Norwegian PetroleumDirectorate’s FactPages as semantic web data. International Semantic Web Con-ference , 162–177 (2013). https://doi.org/10.1007/978-3-642-41338-4 1128. Takis, J., Islam, A.S., Lange, C., Auer, S.: Crowdsourced semantic annota-tion of scientific publications and tabular data in pdf. SEMANTICS ’15 Pro-ceedings of the 11th International Conference on Semantic Systems (2015).https://doi.org/10.1145/2814864.281488729. Vahdati, S., Fathalla, S., Auer, S., Lange, C., Vidal, M.E.: Semantic Represen-tation of Scientific Publications. International Conference on Theory and Prac-tice of Digital Libraries , 375–379 (2019). https://doi.org/10.1007/978-3-030-30760-8 3730. Vasileiadis, M., Kaklanis, N., Votis, K., Tzovaras, D.: Extraction of tabular datafrom document images. Proceedings of the 14th Web for All Conference, W4A(2017). https://doi.org/10.1145/3058555.305858131. Verborgh, R., De Wilde, M.: Using OpenRefine. Packt Publishing Ltd (2013)32. Webster, J., Watson, R.T.: Analyzing the Past to Prepare for the Future: Writinga Literature Review. MIS Quarterly (2), xiii – xxiii (2002)reating a Scholarly Knowledge Graph from Survey Article Tables 1733. Wee, B.V., Banister, D.: How to Write a Literature Review Paper? TransportReviews (2), 278–288 (2016). https://doi.org/10.1080/01441647.2015.106545634. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M.,Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., Bouw-man, J., Brookes, A.J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S.,Evelo, C.T., Finkers, R., Gonzalez-Beltran, A., Gray, A.J., Groth, P., Goble, C.,Grethe, J.S., Heringa, J., t Hoen, P.A., Hooft, R., Kuhn, T., Kok, R., Kok, J.,Lusher, S.J., Martone, M.E., Mons, A., Packer, A.L., Persson, B., Rocca-Serra,P., Roos, M., van Schaik, R., Sansone, S.A., Schultes, E., Sengstag, T., Slater,T., Strawn, G., Swertz, M.A., Thompson, M., Van Der Lei, J., Van Mulligen, E.,Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J., Mons,B.: Comment: The FAIR Guiding Principles for scientific data management andstewardship. Scientific Data3