Towards an RDF Knowledge Graph of Scholars from Early Modern History
PPresented at the 14th IEEE International Conference on Semantic Computing – Resource Track
Towards an RDF Knowledge Graph of Scholarsfrom Early Modern History
Jennifer Blanke
Herzog August LibraryWolfenb¨uttel, [email protected]
Thomas Riechert
Leipzig University of Applied Sciences (HTWK)Leipzig, [email protected]
Abstract —The use of Semantic Web Technologies supportsresearch in the field of digital humanities. In this paper we focuson the creation of semantic independent online databases suchas those of historical prosopography. These databases containbiographical information of historical persons. We focus onthis information with an interest in German professorial careerpatterns from the 16th to the 18th century. In that respect,we describe the process of building an Early Modern ScholarlyCareer RDF Knowledge Graph from two existing prosopographyonline databases: the Catalogus Professorum Lipsiensium and theCatalogus Professorum Helmstadiensium. Further, we provide aninsight in how to query the information using KBox to answerresearch questions.
I. I
NTRODUCTION
Semantic Web technologies brought semantics to severaldomains. One of these domains is historical research ononline prosopography databases. In this context, the processfor building a research ontology consists of using historicalexpertise and knowledge engineering methods in parallel. Itcovers the database layer, the application layer as well asthe research interface layer of the Heloise Common ResearchModel (HCRM) [3]. Researchers start by exploring availableexternal databases. To that extent, queries are formally definedby SPARQL and can be used to access online databasesavailable through endpoints, distributed data managementframeworks such as KBox [2] or by URI-dataset indicessuch as WIMUQ [5]. By definition, SPARQL queries canbe used to extract relevant concepts and properties for researchvocabularies as well as be used to materialize relevant datainto the envisaged research ontology. This workflow enablesresearchers to re-build the research ontology at any time, aslong as the syntax and semantics of the datasources are notgetting changed. The usage of a common vocabulary alleviatesthe problem of future data inconsistencies. Additionally, theeffort of exploring new databases can be minimised as SPARQLcan be used as a common vocabulary. To that extent, there aremany working groups targetting a common vocabularies suchas the Data for History. Although the Semantic Web technologies have contributedwith vocabularies and tools to publish a huge variety of dataon the web, most of its potential could only be achievedby integrating it. That is the case of information of German historical scholars that are collected in different formats byeach of the universities in Germany.In this work, we discuss the different aspects of the LinkedData Life Cycle including the challenges and results of buildinga domain-specific RDF Knowledge Graph (KG) of Scholarsand Scholarship from the 16th to the 18th century by gatheringinformation from different online databases. It describes theinitial effort of fusing the Catalogus Professorum Lipsien-sium and Catalogus Professorum Helmstadiensium into theProfessor Career Patterns KG and enriching its content withdata from Wikidata, DBpedia and Deutsche Nationalbibliothekby establishing owl:sameAs links among the professor’sinstances. The datasets choice is given due to their heterogeneityand completeness nature. The Catalogus Professorum Lipsien-sium has information such as office and family relations. TheProfessorum Helmstadiensium has extended this informationwith data about courses, students and their writing exams(Qualification Documents). The Deutsche Nationalbibliothekhas detailed information about the professors’ publications.The Wikidata and DBpedia are used to augment the KG withinformation such as religion, pictures, influencers, short bios,and alma maters.The research aligns itself with the Digital Humanities forbeing interdisciplinary in nature and for combining classichistoriographical research methods with Semantic Web tech-nologies in order to enable research on scholarly career. Theinterdiciplinary research is conducted through the HCRM forcross-project research in the field of academic history. We applyRDF standards for interlinking historical facts and to describethe vocabulary in a formal manner. The interlinking processcomprises the vocabulary and instance aligments as well as thequality control. Finally, we present the evolution of the KG andclarify how the KG can be used to conduct histrorical researchby using SPARQL queries. The main contributions of this paperare: (1) an ontology and (2) a dataset of Early Modern scholarsand scholarship, as well as (3) a detailed description of themethodological approach. The Table I outlines the resourcesdescribed in this work. It comprises (1) an ontology and (2) adataset of Early Modern scholars available by CC BY-SA 4.0license. https://research.uni-leipzig.de/catalogus-professorum-lipsiensium/ http://uni-helmstedt.hab.de a r X i v : . [ c s . D L ] S e p resented at the 14th IEEE International Conference on Semantic Computing – Resource Track Resource URL LicenseOntology https://github.com/pcp-on-web/ontology CC BY-SA 4.0Dataset https://gitlab.imn.htwk-leipzig.de/emarx/pcp-on-web/tree/master/dataset CC BY-SA 4.0TABLE IR
ESOURCES DESCRIBED IN THIS PAPER . https://catalogus-professorum.org https://pcp-on-web.htwk-leipzig.de http://dbpedia.org http://wikidata.orghttp://uni-helmstedt.hab.de/https://research.uni-leipzig.de/catalogus-professorum-lipsiensiumProfessorum Helmstadiensium Linked Dataset Fig. 1. Dataflow for creating the Professorial Career Patterns KG.
II. D
ATASET
The Professorial Career Patterns (PCP) RDF KnowledgeGraph is composed of data extracted from different data sources.Figure 1 provides the dataflow of the data contained in theKG. The data from the Catalogus Professorum Lipsiensiumand the Catalogus Professorum Helmstadiensium was firstextracted from the original databases and converted to RDF( 1 and 2 ) in previous works. The datasets were thenprocessed and fused into the PCP KG ( 3 - 4 ) unther acommon ontology, described in Section III. The content of theKG is then enriched by interlinking the data from Wikidata( 5 ), DBpedia ( 6 ) and the Deutsche Nationalbibliothek( 7 ) through owl:sameAs links. In the following sectionswe discuss the fusing, interlinking, evolving and exploringprocesses. III. O
NTOLOGY
The ontology published in this work is bilingual, it containslabel and concept descriptions either in English and German.The bilingual support enables transcultural transfer of specificEarly Modern German scholarly concepts and simplifies dataconsumption at a later stage. In its current version, the PCPKnowledge Graph describes classes and properties.An overview of the PCP-on-Web ontology is shown in Figure 2.The ontology is built around the class pcp:Person and itssubclass pcp:Professor . Following, we describe some ofthe ontology’s classes. a) Scholars:
A prosopographical catalog gives basicinformation about persons, such as their name, birth anddeath data, as well as data about their academic achievements. The PCP-on-Web ontology allows to describe qualifications,publications, thesises, annual reports and more of the persons’academic and personal CVs. One of the key properties in theKG is the PND using the Geimeinsame Normdatei (GND) ofthe German National Library. The GND can be used to identifyhistorical persons. Projects such as
PND/BAECON enable tointerlink other databases using the PND identifier. b) Period of Life: To support a finegrained representationof different periods within the life of a person we intro-duced the concept pcp:PeriodOfLife , which is associatedwith a person through the properties pcp:hasPeriod . Theontology supports the modeling of different periods of lifesuch as pcp:Career , pcp:Office (e.g. dean, rector), pcp:Study , pcp:Qualification (e.g. dissertation, andhabilitation), pcp:School , pcp:Birth and pcp:Death .Each of these period of life subclasses contain differentproperties which are used to describe a particular instancein more detail. However, all inherit the delimiting properties pcp:from and pcp:to used to specify the period. Differentperiods of life of the same person can overlap, e.g. the pcp:Family usually overlaps with other periods. c) Body: . This class is used to describe relations amongpersons and organizations during a specific life time period( pcp:PeriodOfLife ). Examples of bodies are the classes pcp:Academy , pcp:AcademySociety , pcp:Faculty , pcp:Party , pcp:Institution . A person can belong todifferent bodies. d) Family: Family relations are representedthrough the class pcp:Family . Instances of the class https://old.datahub.io/dataset/pndbeacon resented at the 14th IEEE International Conference on Semantic Computing – Resource Track Operator Leipzig HelmstedtJoint 21Disjoint 51 35Union 72 56TABLE IIO
VERALL J OINT AND D ISJOINT PROPERTIES FUSED INTO THE P ROFESSORIAL C AREER P ATTERNS
KG.Operator Leipzig HelmstedtJoint 16Disjoint 23 5Union 39 21TABLE IIIO
VERALL J OINT AND D ISJOINT CLASSES FUSED INTO THE P ROFESSORIAL C AREER P ATTERNS
KG. pcp:Person are then related to an instance of the pcp:Family class using the following properties: pcp:familyChild , pcp:familyAdoptiveChild , pcp:familyFosterChild , pcp:familyParent , pcp:familyCohabitant . e) Academy: To add information related to the aca-demic life of the scholars, the dataset contain calssessuch as pcp:Enrollment , pcp:Report , pcp:Thesis , pcp:Faculty and even pcp:Course . The dataset alsoincludes porperties such as pcp:lecturer , pcp:praeses and pcp:respondent . These metadata allows, for instance,to know the lectures given by a professor, who were his students,and how many thesis he has advised.IV. F USING
This section describes the process of fusing the CatalogusProfessorum Lipsiensium and the Catalogus ProfessorumHelmstadiensium into the Professorial Career Patterns KG.It uses the following namespaces: a) helmstedt : for Catalogus Professorum Helmstadi-ensium; b) leipzig : for Catalogus Professorum Lipsiensium,and; c) pcp : for Professorial Career Patterns research ontol-ogy. Table II and Table III gives an overview of the propertiesand classes fused in this process. A. Vocabulary Aligment
The first step of the interlinking process is vocabularyalignment. This task was performed by a team of a computerscientist and a historical researcher, as a specialist on thedomain specific databases. In the process, properties and classeswith same names or without corresponding counterpart wereautomatically shifted to the new namespace http://purl.org/pcp-on-web/ontology
B. Instance Aligment
In this task, we use the Link Discovery Framework Limes [4]to align instances from both databases, the Catalogus Profes-sorum Lipsiensium and the Catalogus Professorum Helmstadi-ensium. The aim was to merge resource instances from bothcatalogues that refer to one and the same person, in orderto find parallels and to enrich the prosopgraphical data. Toachieve that goal, we applied the Limes framework using twodistinct configurations. The first, using acceptance at , and,the second at . Both setups used instances for persons withthe following attributes: name ( rdfs:label ), surname andforename. The algorithm used was the unsupervised versionof “wombat simple.” We also tried a full match using theacceptance rate of . But there was no instance of a person inboth data sets that exactly match one another. • First Setup
The configuration with the highest ac-ceptance rate ( ). That means that the Person in-stances leipzig:heinrichmatthiasheinrichs and helmstedt:13084 achieved a similarty score of . Following, we manually checked the two instances,concluding that they both refer to different persons (seeListing 1). As can be seen in Listing 2 and 3, the instancescannot refer to the same person as either surnames andforenames are different. leipzig:heinrichmatthiasheinrichs helmstedt:13084 0.8164965809277261 Listing 1. Similarity measure between leipzig:heinrichmatthiasheinrichs and helmstedt:13084. leipzig:heinrichmatthiasheinrichs leipzig:surname "Heinrichs" ; leipzig:forename "Heinrich Matthias" ; rdfs:label "HeinrichMatthias Heinrichs" . Listing 2. Properties surname, forname and label from the instanceleipzig:heinrichmatthiasheinrichs in Leipzig Professor’s catalogdatabase. helmstedt:13084 rdfs:label "Andreas HeinrichMatthias" . helmstedt:13084 helmstedt:forename "Andreas Heinrich" . helmstedt:13084 helmstedt:surname "Matthias" . Listing 3. Properties surname, forname and label from the Person’sinstance helmstedt:13084 in Helmstedt database. • Second Setup
The configuration with the lowest ac-ceptance rate of encountered instances withpossible alignment. The manual analysis of the resultcarried out by the historian did not find any matchinginstance.
C. Quality Control
After interlinking we perform a quality check in the ontologyperformed by two historian data experts. The quality checkwas designed to fix nomenclature errors and to enhance theproperties and classes descriptions. Surprisly, the Helmstedtontology concepts did not contain labels and descriptions,requiring their creation. Thus, an historian manually performedthe ontology’s metadata creation in two languages Germanresented at the 14th IEEE International Conference on Semantic Computing – Resource Track pcp:Professor pcp:Deathpcp:Schoolpcp:Birth pcp:Careerpcp:Qualificationpcp:Officepcp:Family pcp:hasPeriodpcp:PeriodOfLiferdf:type pcp:Publicationpcp:publishedpcp:Person rdf:typepcp:familyParentpcp:familyChild pcp:Bodypcp:periodOfBody pcp:Faculty pcp:Academyrdf:type pcp:AcademicSociety pcp:Institution pcp:Faculty pcp:Partypcp:Enrollment pcp:Thesispcp:Report pcp:hasAuthor pcp:hasEnrollmentpcp:praeses
LeipzigHelmstedtIntersection
Fig. 2. An overview of the content fused in the Professorial Career Pattern KG. The blue classes are the content fused from Lipsiensium Catalogue. Thepurple classes are the content extracted from Helmstadiensium Catalogue. The green classes are the content available in both KGs. and English for 35 properties (see Table II) and 5 classes(see Table III). Few properties and classes were renamedto standardize the use of English and German. Property’sdescriptions were enhanced to better describe their usage. Someof the common errors were: • multi-lingual labels (e.g. pcp:hasMatrikel ); • different naming patterns (e.g. pcp:surname_lat in-stead of pcp:latinSurname ), and; • wrong labeling concept (e.g. pcp:lecture became pcp:lecturer ). The full task description can be found at https://github.com/pcp-on-web/dataset/wiki/Instance-Matching:-Link-Discovery.V. I
NTERLINKING
Among many information that can be found on the PCPKG, there is the GND information of the professors availablethrough the pcp:gnd property. The main idea of the interlink-ing process is to use the GND to link relevant data availablein other KGs and use this data for enrichment. This sectiondescribes PCP KG professors interlinking process with theirrespective instances on DBpedia, Wikidata and the DeutscheNationalbibliothek (DNB) by the use of owl:sameAs linksas well as the data extraction.
A. The GND standardization
Although the GND is available in the PCP KG professor’sinstances, one it is not standardized. The Catalogus ProfessorumLipsiensium uses the GND number while the Catalogus Profes-sorum Helmstadiensium uses the Deutsche Nationalbibliothek(DNB) GND URL. To overcome this issue, we replace theURL by the GND number. This can be done by extracting theGND number from the URL. The Deutsche NationalbibliothekGND URL is a composition of the GND’s namespace and theGND number e.g. in the URL https://d-nb.info/gnd/118755951,the GND namespace is https://d-nb.info/gnd and theGND number is .After the standardization the process of interlinking using owl:sameAs links as well as the extraction of Wikidata, The lecture (to give a lecture) was misconceived by the lecturer (the personwho gives the lecture). owl:sameAs . By using these properties, itis possible to lookup for DBpedia and Wikidata professorscontaining the GND URL in their corresponding respectiveproperties.
B. Extractor
We also conducted the extraction from the relevant subsetsof DNB, Wikidata and DBpedia datasets. To this aim, a lazy-extraction approach was designed. The approach receives alist of GNDs and a SPARQL query template. It performs theinstance extraction one by one, therefore lazy-extraction. Theaim is to avoid server timeouts and errors by executing simpleand fast SPARQL queries. The approach is open-source andis publicly available at https://github.com/pcp-on-web/scholar.extractor.The Wikidata, DNB, and DBpedia extracted data is publiclyavailable at the dataset Github page. There, users can reportissues or subscribe to receive update notifications. It is alsopossible to query it locally using KBox [2] (see Listing 4) tosimplify the sharing and querying. The Table IV give the totalnumber of classes and properties for each of the KG subsets. Java -jar kbox-v0.0.1.jar -kb "http://purl.org/pcp-on-web/dbpedia,http://purl.org/pcp-on-web/wikidata,http://purl.org/pcp-on-web/dnb" -sparql"Select * where {?s ?p ?o}" -install
Listing 4. Querying different Professorial Career Patterns subgraphs. resented at the 14th IEEE International Conference on Semantic Computing – Resource Track
Dataset
ROFESSORIAL C AREER P ATTERNS SUBGRAPH STATISTICS .Fig. 3. Example of the alterations on the Professorial Career Patterns ontology,publicy available at https://github.com/pcp-on-web/ontology/commits/master.
VI. E
VOLVING
Due to the distributed character of Web of Data, approachesthat provide versioning and provenance play a central role. Itis important to track the provenance of data at any step of aprocess involving possible changes of a database (e.g., creation,curation, linking). It provides a good basis for mechanismsto track down and debug the origin of errors and improveprocesses. Envisioning an approach to support a collaborativedatabase curation and research made
QUIT [1] a natural choice.
QUIT enables access to provenance-related metadata pertainingto the KG and provide all functionalities of a version controlsystem using Git. Figure 3 depicts a list of alterations performedby commits in the ontology repository. It is possible to visualizeand explore the changeset as well as follow the ontology andKG evolution. VII. E
XPLORING
Due to the flexibility of the SPARQL language and the lackof a practical approach to bridge the knowledge between thedata and the Semantic Web experts, we apply an interactiveapproach to enable researchers to explore the data. Thedata expert provides a question in natural language to theSemantic Web expert who formulates the SPARQL query.The SPARQL query result is then checked by the data expertproviding research insights and error analyses (Figure 4). TheListing 5 and 6 provide an example of a question and itsrespective SPARQL query. To make the interlinked databaseaccessible to other researchers we use KBox. The database andontology are both available under the Knowledge Name (KN) http://purl.org/pcp-on-web/dataset and http://purl.org/pcp-on-web/ontology (see Listing 6). Give me the amount of qualification documents withprofessors as praeses, arranged by year andfaculty.
Listing 5. Properties surname, forname and label from the Person instancehelmstedt:13084 in Helmstedt database. Knowledge Name is a reference to the KG in KBox. Fig. 4. Interaction between domain and Semantic Web expert. − j a r kbox − v0 . 0 . 1 . j a r − s p a r q l ” s e l e c t ( count (? doc ) as ?docN ) ? f a c u l t y ? year2 where { } group by ? f a c u l t y ? year o r d e r by asc (? year ) asc (? f a c u l t y )” − kb ” h t t p : / / p u r l . org / pcp − on − web / d a t a s e t , h t t p : / / p u r l .org / pcp − on − web / ontology , ” h t t p : / / p u r l . org / pcp − on − web /dbpedia , h t t p : / / p u r l . org / pcp − on − web / wikidata , h t t p : / / p u r l. org / pcp − on − web / dnb ” − i n s t a l l Listing 6. Listening properties from the merged ontology using KBox. Inthis example, the namespace declaration was omitted for the purpose ofsimplification.
VIII. C
ONCLUSION
In this paper, we described the interlinking process of twoprosopographical databases in order to conduct research on theresearch question of Early Modern scholarly career patterns.We gave insight into the different steps which are involved inthe vocabulary and databases curation as well as quality control.We further discussed the data evolution and an exploratoryresearch method engaging data domain and Semantic Webexperts. R
EFERENCES[1] N. Arndt, P. Naumann, N. Radtke, M. Martin, and E. Marx. DecentralizedCollaborative Knowledge Management using Git.
Journal of WebSemantics , 54:29–47, 2019.[2] E. Marx, C. Baron, T. Soru, and S. Auer. Kbox—Transparently ShiftingQuery Execution on Knowledge Graphs to the Edge. In , pages 125–132.IEEE, 2017.[3] T. Riechert and F. Beretta. Collaborative research on academic historyusing linked open data: A proposal for the heloise common researchmodel.
CIAN-Revista de Historia de las Universidades , 19(0), 2016.[4] M. A. Sherif, A.-C. N. Ngomo, and J. Lehmann. Wombat–a generalizationapproach for automatic link discovery. In
European Semantic WebConference , pages 103–119. Springer, 2017.[5] A. Valdestilhas, T. Soru, and M. Saleem. More complete resultsetretrieval from large heterogeneous rdf sources. In