[PDF] Provenance for Linguistic Corpora Through Nanopublications

Abstract

Research in Computational Linguistics is dependent on text corpora for training and testing new tools and methodologies. While there exists a plethora of annotated linguistic information, these corpora are often not interoperable without significant manual work. Moreover, these annotations might have evolved into different versions, making it challenging for researchers to know the data's provenance. This paper addresses this issue with a case study on event annotated corpora and by creating a new, more interoperable representation of this data in the form of nanopublications. We demonstrate how linguistic annotations from separate corpora can be reliably linked from the start, and thereby be accessed and queried as if they were a single dataset. We describe how such nanopublications can be created and demonstrate how SPARQL queries can be performed to extract interesting content from the new representations. The queries show that information of multiple corpora can be retrieved more easily and effectively because the information of different corpora is represented in a uniform data format.

Full PDF

PProvenance for Linguistic Corpora ThroughNanopublications

Timo Lek − − − , Anna de Groot − − − ,Tobias Kuhn − − − , and Roser Morante − − − Department of Computer Science, VU Amsterdam Computational Lexicology & Terminology Lab, VU Amsterdam

Abstract.

Research in Computational Linguistics is dependent on textcorpora for training and testing new tools and methodologies. Whilethere exists a plethora of annotated linguistic information, these corporaare often not interoperable without signiﬁcant manual work. Moreover,these annotations might have adapted and might have evolved into dif-ferent versions, making it challenging for researchers to know the data’sprovenance and merge it with other annotated corpora. In other words,these variations aﬀect the interoperability between existing corpora. Thispaper addresses this issue with a case study on event annotated corporaand by creating a new, more interoperable representation of this datain the form of nanopublications. We demonstrate how linguistic annota-tions from separate corpora can be merged through a similar format tothereby make annotation content simultaneously accessible. The processfor developing the nanopublications is described, and SPARQL queriesare performed to extract interesting content from the new representa-tions. The queries show that information of multiple corpora can now beretrieved more easily and eﬀectively with the automated interoperabilityof the information of diﬀerent corpora in a uniform data format.

The availability of annotated linguistic corpora is crucial to develop NaturalLanguage Processing (NLP) systems [5,25]. In the past couple of decades, manylinguistic resources have been released, which contain diﬀerent types of annota-tions, from morphosyntactic to semantic annotations [5]. Often, such annotationprojects run on (subsets of) existing text corpora, which are thereby annotatedwith multiple annotation layers. Unfortunately, the format for linking annota-tions to the texts is mostly not standardized and these links therefore are oftennot fully precise or directly interoperable. This makes it challenging for a re-searcher to ﬁgure out the data’s exact provenance and to integrate it with anno-tations from other corpora. Moreover, often the annotations follow idiosyncraticguidelines that pursue a speciﬁc annotation goal. Integrating diﬀerent annota-tions is valuable because it can help scholars study and understand the diﬀerentinterdependencies between annotation layers (e.g. semantic parsers usually take a r X i v : . [ c s . C L ] J un Lek et al. syntactic structure in consideration) and get a broader understanding by com-bining diﬀerent annotation types (e.g. how events from event annotations arelinked to persons from named entity annotations).The need to organize the various existing annotations has been recognized bythe corpus linguistic community [5,10]. While a number of solutions to improvelinguistic Linked Data have been presented in the past, we propose here to usea novel ﬁne-grained and ﬂexible Linked Data format based on nanopublicationsto represent annotated corpora with the aim of boosting the interoperability oflinguistic corpora and facilitating the automatic integration of annotations ina reliable and transparent manner. As a case study, we have chosen to focuson integrating annotations from event corpora, because the corpora are widelyused in the NLP community and there are many diﬀerent types of event anno-tations that continue to be made on the same corpora. The corpora selectedfor this project are FactBank [24] and PARC 3.0 [19]. We describe below howwe converted these corpora into nanopublications and then report on the inter-operability challenges faced and solved with our approach. Finally, we discussbased on our results the extent to which nanopublications can be seen as a viablealternative for improved linguistic corpora interoperability.Our main contributions thereby are (1) providing a ﬁne-grained and interop-erable Linked Data representation of annotated linguistic corpora in the form ofnanopublications, (2) analysing the main interoperability problems that we en-countered and solved with our nanopublication-based approach, and (3) demon-strating how the concrete annotations from our case study of two corpora canbe integrated with nanopublications and queried with SPARQL.

Below, we introduce the relevant background on the interoperability of linguis-tic corpora with a special focus on event annotations, Linked Data and dataprovenance, and the speciﬁc concept and technique of nanopublications.

Many types of annotated linguistic corpora exist today and are available for sci-entists to use, but they often lack structural and conceptual interoperability [5].The former is related to the data format of annotations, such as XML-standoﬀor RDF representations, while the latter refers to the possibility of exchanginginformation in a consistent and mappable way. Diverging annotation guidelines,the structure of annotated output, and the diﬀerent amounts of annotationscreated per corpus pose a challenge in organizing the provenance of corpora.As mentioned, this project focuses on event annotations. In corpus linguis-tics, events are generally considered as situations that can happen or occur [20].However, existing event annotation standards range from annotations about theparticipants, timing/temporal order, factuality, and entity coreference relation[25], making a standardized description of an event diﬃcult. For example, Figure rovenance for Linguistic Corpora Through Nanopublications 3 Fig. 1.

An example sentence with its annotations in FactBank and PARC. hurt, dis-posal, restructuring, said ). It is interesting to combine the two annotation layersin order to extract, for example, which events occur in the content of attributionsrelations by the same source and what are their factuality values.Our research is motivated by the empirical analysis on 20 event annotatedcorpora presented in [25], who highlighted the challenges, as well as the oppor-tunity, of event corpora interoperability. This is where nanopublications as aﬁne-grained and provenance-aware data format could provide transparency andinteroperability for computational linguistic research.

Linked Data is about connecting and publishing data in a structured way onthe web, often in the form of an ontology using the Web Ontology Language(OWL) [17]. The data is often structured in triples using the Resource Descrip-tion Framework (RDF). By using unique URIs for each entity of information,these entities can be interlinked in order to create an entire web of data [1].Several projects were already performed to create linked linguistic data. Since2012, six series of Linked Data in Linguistics (LDL) workshops have been hostedto gather and discuss contributions promoting linking linguistic data. After thesecond workshop, one of the presented eﬀorts was the Linked Linguistic OpenData (LLOD) cloud . While creating the LLOD cloud, a lot of linguistic datawas already available, which were mostly represented using RDF. One exampleis WordNet, which is also part of the Semantic Web [8,6]. Furthermore, Chiarcosproposed to represent linguistic corpora into OWL and RDF to interlink all of thediﬀerent resources of corpora. In this way, the annotations of the corpora couldbe linked using terminological resources like standardized annotation formats[5]. Existing approaches that aim to represent the structure and content of nat-ural language and it annotations include the Ontology Lexicalisation and theOntologies of Linguistic Annotation (OLiA) [26,4,2]. OLiA focuses on annota-tions of linguistic corpora and can be seen as a source that references to the‘annotation terminology’ that can be used in combination with the LLOD tocapture all of the information of the annotations in the LLOD in a similar for-mat [7]. https://linguistic-lod.org/ Lek et al.

Since all diﬀerent texts and annotations in a corpus may have a diﬀerentorigin, i.e. provenance, provenance ontology PROV can be used for this. Thisontology contains diﬀerent classes and properties that represent the provenanceinformation in diﬀerent contexts, thereby usable for keeping track of the prove-nance information of linguistic corpora. The PROV ontology has been shown tobe applicable to a diversity of ﬁelds to track provenance information in a generalmanner [16].Even though the RDF language is not optimized for this, the NLP Inter-change Format (NIF) makes it possible to represent texts and their elementsin RDF, primarily by using strings. With the NIF ontology, a context, sentence,phrase, or word can be represented into RDF triples. It is also possible to havereferences between diﬀerent levels, so a word can reference to another part ofthe RDF scheme that is the sentence from where it was derived.Moreover, many annotation guidelines exist and each is usually representedin a diﬀerent structure, increasing the diﬃculty in creating a standard modelto transform annotations into RDF. The Web Annotation Vocabulary [21], forexample, provides a possible solution with relations such as ‘motivatedBy’ and‘hasTarget’, making it possible to structure annotations in an interoperable way. Nanopublications are a format for small data publications based on RDF triples,consisting of three parts: the assertion, provenance, and publication information[12]. The assertion contains the main content of a nanopublication, for examplestating in a formal way the link between a gene and a disease. The provenancepart states how this assertion came to be, by linking to the study that wasperformed or the paper from which the assertion was extracted. The publicationinformation part, ﬁnally, records information about the nanopublication itself,such as by who and when it was created [9].In a previous work, a Java library for nanopublications was created, whichcan also be used as a command line tool [11]. Nanopublications can be given trusty URIs as identiﬁers, which include a hash value of the complete content,and thereby make nanopublications immutable and veriﬁable [14]. Such nanop-ublications can then be reliably and redundantly published to the existing dis-tributed server network [13]. Each of these servers contains all nanopublications,making it possible to retrieve the nanopublication from another server when oneserver is down.Nanopublications also allow for precise and reliable versioning, as well asthe deﬁnition of incremental datasets by deﬁning nanopublication indexes [15].Such indexes are represented as nanopublications themselves that contain linksto other nanopublications and thereby deﬁning sets of nanopublications. http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core rovenance for Linguistic Corpora Through Nanopublications 5 Fig. 2.

General nanopublication scheme with examples.

Here we introduce our methodology to address the above-mentioned problemsof interoperability and provenance of text corpora by applying the concept andtechniques of nanopublications. To demonstrate and evaluate this methodology,we then present a case study on the two overlapping corpora mentioned above.We introduce a number of questions on the combined annotations that can beanswered in an integrated and automatic manner through SPARQL queries.

In Figure 2, we present a general scheme for representing text corpora and theirannotations as a network of interconnected nanopublications. This model en-sures that new information from other corpora could easily be linked to ex-isting nanopublications, through the addition of more nodes to the network.The nanopublications that form this general structure can be divided into fourgroups depending on what they represent: text, annotation, corpus, and indexnanopublications.

Lek et al.

First, at the top of the ﬁgure the text-corpus nanopublication can be found.This nanopublication contains the information regarding the text corpus, whichis the set of original texts. It is linked to an index nanopublication which pointsto all document, text, and word nanopublications which are part of the corpus.The document and text nanopublications can be part of multiple corpora andthus multiple index nanopublications. These document nanopublications containall of the metadata on the actual texts, namely the provenance information, title,and identiﬁer. Furthermore, the document nanopublications have a link to thenanopublication containing the actual text corresponding to the metadata. Boththe document and the text nanopublication also contain a back-reference to theircorpus nanopublication.Additionally, word nanopublications are created, as words are the typicaltargets of annotations. They contain the word string, sentence number, and theoﬀset values with respect to the text, and possibly other word-level information(e.g. their part-of-speech value). Thereby, each word gets its own URI as identi-ﬁer, which is constructed from the text URI and the word’s oﬀset values. Fromthe bottom, annotation corpora are pointing to these words. Similar to the textcorpora, a corpus nanopublication of an annotation corpus contains the generalcorpus-level information, most importantly a link to an index nanopublicationpointing to all annotation nanopublications. They contain the actual annota-tions and their concrete content depends on the corpus, but they all point to thewords they are annotating via the URIs as described above. For convenience,they are also pointing back to the overall annotation corpus nanopublication.Modeling these elements as small self-contained entities has also the advan-tage that diﬀerences in licenses can be deﬁned and handled on a ﬁne-grainedlevel. Often, annotations have a diﬀerent, sometimes more permissive, licensethan the texts they annotate. For this, we introduce the nanopublication type

ProtectedNanopub that we use to mark nanopublications that cannot be openlypublished due to license reasons. We updated the nanopublication client andserver code such that an error is raised if a user accidentially tries to publishsuch a protected nanopublication to the server network.Lastly and importantly, our approach is open for anybody to publish furtherannotations or texts as nanopublications. These wouldn’t be directly included inthe datasets, as the index nanopublications don’t point to them, but they can befound by querying the nanopublication network, and included if desired in a givensituation. Moreover, a new version of a corpus can include such “third-party”contributions, thereby making them a proper part but also clearly attributingthe third-party contributor, which is deﬁned in the provenance and publicationinfo parts. This also allows for corrections to be published by anybody who ﬁndssomething that is wrong.

The case study workﬂow has 3 main components: (1) obtain and analyze originaland annotated corpora; (2) convert these into nanopublications; and (3) assess rovenance for Linguistic Corpora Through Nanopublications 7

Table 1.

Case study questions.QuestionQ1 Who talked about an event, and what is the factuality value of that event?Q2 Which events have multiple factuality value annotations?Q3 What are the diﬀerent factuality values expressed per document, and which valuesappear the most?Q4 How often are certain annotation values used? (e.g. the count of diﬀerent factu-ality values represented or the count of diﬀerent sources)Q5 How many of the annotated FactBank events are labelled with a speciﬁc attribu-tion component?Q6 Where in the corpus is a speciﬁc word (or lemma) assigned a speciﬁc attributionlabel (e.g. where is the verb ‘to surprise’ annotated as Cue? the usefulness of the nanopublication representations. We evaluate on two event-related linguistic corpora, FactBank and PARC 3.0. For PARC, we process 34annotation ﬁles that provide annotations on Wall Street Journal Corpus (WSJ)documents. For FactBank, we process 194 documents, which provide annotationson WSJ, New York Times (NYT), Associated Press Writer (APW) documents,and documents from several other sources. Thus, we use the intersection of 34WSJ documents for which both corpora provide annotations for investigatingthe extent to which nanopublications indeed facilitate automated merging andinteroperability.The PARC corpus provides annotations about attribution relations, whichare made up of three components a Source, a Content, and a Cue. The originalannotations are made at the token-level and are presented in XML format. TheFactBank corpus includes annotations about events and their factuality values.FactBank adopts event annotations from the TimeML annotations [23], while itdeﬁnes event factuality with a value that represents how certain or true an eventis according to a source present in the text. FactBank has six primary factualityvalues present in its annotations. These annotations are contained in a set of20 tables, each written in a separate text ﬁle. We refer the reader to [22] and[18] for detailed explanations of each annotation scheme, and we will elaboratebelow on their interoperability challenges.For this case study, we came up with six general questions that somebody whowould like to use the above annotations might want to ask. These questions areshown in Table 1. One of our goals is answering questions that show the mergingcapabilities and eﬃciency of nanopublications. For example, the answer to Q3,depending on the factuality values it outputs, might suggest how certain thecontent of the text is. An answer to Q6 might help an annotator clarify doubtsduring the annotation process, and Q5 demonstrates how FactBank events can berelated to PARC attribution relations. Some questions ask only for informationfrom one corpus, while others are more complex by requiring information fromboth corpora. We will use these general questions to make them automaticallyexecutable with our integrated nanopublication model and the SPARQL querylanguage.

Lek et al.

We applied the nanopublication-based approach to the case study in order toaddress the problem of interoperability and provenance. We discuss here thediﬀerent types of nanopublications created to represent the chosen corpora.As introduced above, nanopublications consists of an assertion, provenance,and publication info part. The publication info part is similar for all our nanop-ublications, containing the date of creation, license, ORCID identiﬁers of thecreators, and sometimes further comments and website links. All the text andword nanopublications, as well as the annotations of FactBank, are also markedas protected, due to their private licences. The remaining nanopublications arepublic and can therefore be published to the server network for public accessi-bility. The dataset of all public nanopublications can be found online. We deﬁned four corpus nanopublications for our case study: one for the textcontained in the corpus and one for the annotations contained in the corpus,for each of the two existing corpora (PARC and FactBank). They contain basicinformation about the corpus and link to their content via an index nanopublica-tion, as can be seen in this example (abbreviated here and below for readability):

Via their index nanopublications, they point to 194 document nanopublications,which contain the document metadata. This is an example: sub:assertion {sub:document a foaf:Document;dct:title "Financing Business: @ Rogers Communications Inc.";dct:created "1989-11-02T00:00:00"^^xsd:dateTime;dct:creator <\protect\vrule width0pt\protect\href{http://dbpedia.org/resource/The_Wall_Street_Journal}{http://dbpedia.org/resource/The_Wall_Street_Journal}>;pvcp:hasText textnp:text .}

These document nanopublications link to the actual text using pvcp:hasText topoint to the text nanopublication. An example of the latter is shown here: sub:assertion {sub:text a nif:OffsetBasedString, dct:Text;rdf:value """ ROGERS COMMUNICATIONS Inc. said it plans to raise 175 million to 180 millionCanadian dollars (US$148.9 million to $153.3 million) through a private placement ofperpetual preferred shares. ... He declined to discuss other terms of the issue. """ .}

The word nanopublications point back to the text nanopublication and pro-vide word-level information including the word string, the oﬀset of this stringwith respect to the text, and the sentence number. They can also contain lemmaand part-of-speech information. As this word level depends on tokenization, therecan be diﬀerences in how diﬀerent annotation corpora deﬁne word boundaries. https://github.com/ucds-vu/provcorp-model rovenance for Linguistic Corpora Through Nanopublications 9 For this reason, our approach allows for the generation of new word nanopubli-cations on demand. This creates an identiﬁer for the word based on the URI ofthe text nanopublication and the oﬀset of the word, thereby assuring that wordsintroduced several times by diﬀerent annotation corpora get the same identiﬁerand are therefore immediately interoperable. In our case, we have 6,362 wordnanopublications deﬁned from PARC and 7,784 from FactBank. This is an ex-ample of the former: sub:assertion {textoffset:3-9 a nif:OffsetBasedString, nif:Word;nif:beginIndex "3"^^xsd:int;nif:endIndex "9"^^xsd:int;nif:anchorOf "ROGERS";nif:lemma "rogers";olia:POS "NNP";pvcp:hasSentenceNumber "0"^^xsd:int;pvcp:isPartOfText textnp:text .}

The annotations can now link to these words. We have 136 annotation nanop-ublications for PARC, that associate words with content, cue and source anno-tations, as can be seen in the example below: sub:assertion {sub:annotation a oa:Annotation;dct:isPartOf corpus:parc-annotations;pvcpp:hasContentAnnotatedWord textoffset:101-106, textoffset:107-114, ...;pvcpp:hasCueAnnotatedWord textoffset:30-34;pvcpp:hasSourceAnnotatedWord textoffset:10-24, textoffset:25-29, textoffset:3-9 .}

For FactBank, we have annotation nanopublications that declare events (7,784)and separate ones annotating their factuality values (10,948). Below is an exam-ple of an event declaration, and another example of a factuality value annotation(the factuality value is here “CT–”) linking to such an event declaration: sub:assertion {sub:annotation a oa:Annotation;dct:isPartOf corpus:FactBank-annotations;oa:hasTarget textoffset:543-548;pvcpf:hasEID "e18" .}sub:assertion {sub:factvalue a oa:Annotation;dct:isPartOf corpus:factbank-annotations;pvcpf:hasFactvalue "CT-";pvcpf:hasSourceAnnotation sub:sourceAnnotation1, sub:sourceAnnotation2;pvcpf:hasTargetEvent eventnp:annotation .sub:sourceAnnotation1 oa:hasTarget textoffset:430-439;pvcpf:hasSourceText "spokesman" .sub:sourceAnnotation2 a pvcpf:AuthorAsSourceAnnotation .}

We have focused above only on the assertion part, but all these nanopub-lications also come with speciﬁc provenance and publication information, forexample describing that the PARC annotations were created at a diﬀerent timeby a diﬀerent person than the nanopublications themselves: sub:provenance {sub:assertion dct:created "2016-05-01T00:00:00"^^xsd:dateTime;dct:creator orcid:0000-0003-0955-6104 .}sub:pubinfo {this: dct:created "2020-05-29T11:53:00.700+02:00"^^xsd:dateTime;dct:creator orcid:0000-0002-3429-2879, orcid:0000-0002-5347-5750, ... .}

We can now analyze the interoperability challenges we faced and addressed withour approach and assess the extent to which we can now automatically answerthe questions we introduced above.

With our case study as described above, we encountered a number of interop-erability challenges. We managed to resolve almost all of them in our manualconversion work as described above. However, we had to exclude ten documentsof FactBank due to the fact that sentence numbers were skipped when a sen-tence ended with a double quote ( " ), and we did not have enough informationto correct this mistake. In Table 2, we show an overview of the challenges wesuccessfully addressed on the remaining 194 documents. This highlights the na-ture and extent of interoperability problems that come with the current way howsuch corpora are published but would be resolved if our approach was followed,i.e. if they would be published as interoperable nanopublications from the start.The ﬁrst issue on the list are text oﬀsets. PARC and FactBank locate wordsin a text diﬀerently: PARC uses byte counts, whereas FactBank makes use of sen-tence and token numbers. Token numbers are tokenizer-speciﬁc, so FactBank’smethod is quite sensitive and diﬃcult to replicate. FactBank’s sentence num-bers are moreover confusing as they also count the metadata ﬁelds, such as text Table 2.

Overview of addressed interoperability challenges, including their absoluteand relative frequencies in the 194 documents of our case study corpora.Addressed Challenge corpus abs. rel.1. Incompatible text oﬀsets PARC/FactBank 194 100.0%2. Metadata included in sentence number count FactBank 194 100.0%3. Insuﬃcient sentence splitting information FactBank 64 33.0%4. Missing headline FactBank 33 17.0%5. Inconsistent use of text tags FactBank 23 11.9%6. Absence of text tags to structure the document FactBank 17 8.8%7. Unknown journal / source FactBank 16 8.2%8. No attribution relations for the document PARC 4 2.0%9. Incompatible sentence splitting at semicolons PARC/FactBank 2 1.0%10. Annotations on the headline FactBank 1 0.5%rovenance for Linguistic Corpora Through Nanopublications 11 headings, and not just the text content (which is challenge number 2). PARC,on the other hand, also has its quirks, such as including the six characters ofthe initial tag in the oﬀset count. In our nanopublications, we use textoﬀsets that unambiguously link to the text string of the content, and therebyensure precise and interoperable text references.For FactBank’s sentence and token number approach to refer to words inthe text, it is crucial to know where a sentence ends and a new one starts. Thissentence splitting, similar to tokenization mentioned above, is sensitive to theused algorithm. This is why text corpora like WSJ provide us with this sentencesplitting information by saving each sentence on a separate line in the providedﬁle formats. However, the documents from NYT and APW didn’t have thisinformation and so these sentences had to be checked manually for the correctsplitting (challenge 3). Moreover, sentences containing a semicolon in PARC areannotated as one sentence while they are considered two sentences in FactBank(challenge 9).Challenges 5 and 6 point to the fact that text tags were used inconsistently.The main text is included in tags, but these were missing for 17 docu-ments. Moreover, WSJ documents use and tags for the head-line and date, respectively, while NYT and APW documents use and . Some documents moreover, did not contain tags at all, andwe had to add them manually.Normally, headlines are not annotated, but there is an exception in docu-ment

APW19980213.1380 , which includes an annotation of a part of the headline(challenge 10). Upon encountering this situation, we extended our model to alsoallow for headline annotations. Lastly, in some cases metadata was missing. Somedocuments did not have a headline (challenge 4) or did not include a journal orsource (challenge 7). In PARC, there were also four documents that did notcontain any attribution relations (challenge 8).In summary, all documents were aﬀected by at least two challenges, and thereseems to be a long tail of infrequent exceptions or small inconsistencies that onehas to take into account in order to process such corpora correctly.

In order to assess the practical beneﬁts of our approach we implemented thegeneral questions presented in Table 1 as queries in the SPARQL language sowe could use a triple story (Virtuoso in our case) to automatically answer them.The resulting SPARQL queries can be found online . They demonstrate thatour nanopublication approach eﬀectively merges the two corpora and allows forinformation to be extracted together in a practical and eﬃcient way. Table 3shows the average amount of time it took to run a query, as well as the numberof rows it outputs. We managed to represent all these questions in SPARQL,and thereby to make them automatically executable. https://github.com/ucds-vu/provcorp-model/tree/master/queries Table 3.

SPARQL query results to answer the general questions Q1 to Q6, with resultcount and average execution time in seconds.question short query description count timeQ1 list source and factuality value per event 656 0.206sQ2 list events with more than one factuality value 1,050 0.092sQ3 list factuality values per document 194 0.058sQ4 count of FactBank source values 425 0.029sQ5 count of FactBank events with PARC source attribution 6 0.033sQ6 list all occurrences of “surprise” as cue 1 0.048s

The listing below shows an example of a SPARQL query to answer one possi-bility for question Q4 (“How often are certain annotation values used?”) retriev-ing the count of words annotated as events from FactBank per PARC annotationattribution type: prefix pvcpp: prefix pvcpf: prefix select ?Attribution ( count ( distinct ?word) as ?Count) where {?event pvcpf:hasEID ?eventid.?event oa:hasTarget ?word.?annotation ?Attribution ?word. values ?Attribution { pvcpp:hasContentAnnotatedWord pvcpp:hasCueAnnotatedWordpvcpp:hasSourceAnnotatedWord }.} group by ?Attribution The concrete result for this query is:

Attribution Count pvcpp:hasSourceAnnotatedWord 3pvcpp:hasContentAnnotatedWord 284pvcpp:hasCueAnnotatedWord 130

This thereby demonstrates how FactBank and PARC are conceptually mergedand can be queried in an integrated fashion.As another example, the query representing question Q1 (“Who talked aboutan event, and what is the factuality value of that event?”) gives us this result(only the ﬁrst four rows are shown): textID eID eventWord factValue relativeSource sourcePhrase https://.../wsj_0026...

This shows again how the annotations from both corpora can be queried simulta-neously. The shown portion of the result is very informative for the ﬁle wsj_0026 .The content of the columns, in order of appearance from left to right, is: the textidentiﬁer in the form of its nanopublication URI with the suﬃx , the eventidentiﬁer of the event annotation, the word that is annotated as the event, thefactuality value of the event, the source of the event (according to FactBank rovenance for Linguistic Corpora Through Nanopublications 13 annotations), and the string in the document’s text that represents the source(according to the source annotated words in PARC). As such, the query’s resultcontains information from PARC (the source) as well as information from Fact-Bank (the factuality value and event). Moreover, the output of the query allowsus to see that an event can have two diﬀerent FactBank sources who each givea diﬀerent factuality value to the annotated event. This is seen in the last tworows where the event ‘beneﬁciaries’ has two diﬀerent factuality values, one whichwas given by the

AUTHOR (the writer of the text document) and another given bythe ‘oﬃcials’ (a source talked about in the text document). The diﬀerent con-ception of ‘source’ in FactBank and PARC can also be noted here, which showsthat although annotations can be merged, the annotation schema of each corpusshould always be considered when analyzing the results. This is also useful toreveal the representation of a corpus’s data more easily. For example, we can seethat an event from FactBank always has an

AUTHOR source.This demonstrates how we can use the power of query languages like SPARQLto access corpora in an integrated and fully automated way. The results can beexported in a variety of formats, including CSV tables and JSON ﬁles, and beloaded and processed in other tools. It is then straightforward to write conversiontools to other formats, such as the CoNLL format [3], a common practice dataformat used by computational linguists. With our nanopubliation approach, weensure that the data is interoperable from the start, and then format conversionsare relatively easy as the range of nasty interoperability challenges is alreadytaken care of.

Nanopublications have been mostly used with scientiﬁc data, such data fromthe biomedical domain [12]. In this work, we show that linguistic corpora canalso beneﬁt from the ﬁne-grained and provenance-aware structure of nanopubli-cations. The collection of nanopublications presented here contains informationfrom event annotated corpora. It is demonstrated that when this data is mod-elled in a homologous way, each unique corpus can be automatically mergedin order to produce valuable linguistic data combinations or informative corpuscontent that was not easily attainable prior to its transformation. Speciﬁcally,our dataset can provide attribution annotation insight about events from Fact-Bank.Moreover, our nanopublications suggest the prospect of linguistic data ﬁttinginto the Linked Data ecosystem. During our dataset creation, we were able toreuse multiple existing Linked Data vocabularies. Linguistic data becoming morelinked would also allow it to be connected more easily with already existinglinguistic Linked Data sources from LLOD. Nevertheless, the data format of theoriginal annotated corpora is extremely variable, which shows in the multipletextual challenges faced during production. While this project only focused ontwo corpora, there was inconsistency in annotation format not only betweenthe corpora, but also within a corpus itself. Thus, the conversion of existing linguistic corpora into a Linked Data scheme can expect tedious pre-processingeﬀorts. Even in our dataset, several documents needed to be excluded for theﬁnal version. Another aspect that makes it challenging to ﬁt linguistic corporain the Linked Data ecosystem relates to copyright availability. For example, thiscaused us to only be able to represent part of the PARC corpus.In this paper, interoperability issues in linguistic data, speciﬁcally event an-notated corpora, are discussed and nanopublications are proposed as a solutionto resolve them. When one set of annotation guidelines is used to produce anno-tations on one corpus, that same corpus might already have other annotationsthat could be useful for the researcher. However, currently it is very diﬃcult fora researcher to know or retrieve the version or provenance of these existing an-notations due to poor documentation and variable annotation formats. Aspectsof nanopublications that make them useful for annotated corpora are that theyare able to be written in diﬀerent formats, they can represent versioning, andthey can track provenance. Also, they follow LLOD principles of using URIs foridentiﬁcation and RDF representation as output. Above all, they seem useful formerging linguistic annotation data. A successful model for translating existingevent corpora into nanopublications is signiﬁcant not only for computationallinguistics, but also for the ﬁeld of Linked Data.

References

1. Bizer, C., Heath, T., Berners-Lee, T.: Linked data: The story so far. In: Semanticservices, interoperability and web applications: emerging concepts, pp. 205–227.IGI Global (2011)2. Bosque-Gil, J., Gracia, J., McCrae, J., Cimiano, P., Stolk, S., Khan, F., Depuydt,K., de Does, J., Frontini, F., Kernerman, I.: The ontolex lemon lexicography mod-ule. W3C Community Group Final Report (2015)3. Buchholz, S., Marsi, E.: CoNLL-x shared task on multilingual dependency pars-ing. In: Proceedings of the Tenth Conference on Computational Natural LanguageLearning (CoNLL-X). pp. 149–164. Association for Computational Linguistics,New York City (Jun 2006),

4. Buitelaar, P., Cimiano, P., McCrae, J., Montiel-Ponsoda, E., Declerck, T.: Ontologylexicalisation: The lemon perspective (2011)5. Chiarcos, C.: Interoperability of corpora and annotations. In: Linked Data in Lin-guistics, pp. 161–179. Springer (2012)6. Chiarcos, C., Cimiano, P., Declerck, T., McCrae, J.P.: Linguistic linked open data(llod). introduction and overview. In: Proceedings of the 2nd Workshop on LinkedData in Linguistics (LDL-2013): Representing and linking lexicons, terminologiesand other language data. pp. i–xi (2013)7. Chiarcos, C., Sukhareva, M.: Olia–ontologies of linguistic annotation. SemanticWeb (4), 379–386 (2015)8. Gangemi, A., Navigli, R., Velardi, P.: The ontowordnet project: extension andaxiomatization of conceptual relations in wordnet. In: OTM Confederated Inter-national Conferences" On the Move to Meaningful Internet Systems". pp. 820–838.Springer (2003)9. Groth, P., Gibson, A., Velterop, J.: The anatomy of a nanopublication. InformationServices & Use (1-2), 51–56 (2010)rovenance for Linguistic Corpora Through Nanopublications 1510. Kilgarriﬀ, A.: Comparing corpora. International journal of corpus linguistics (1),97–133 (2001)11. Kuhn, T.: nanopub-java: A java library for nanopublications. arXiv preprintarXiv:1508.04977 (2015)12. Kuhn, T., Barbano, P.E., Nagy, M.L., Krauthammer, M.: Broadening the scope ofnanopublications. In: Extended Semantic Web Conference. pp. 487–501. Springer(2013)13. Kuhn, T., Chichester, C., Krauthammer, M., Queralt-Rosinach, N., Verborgh, R.,Giannakopoulos, G., Ngomo, A.C.N., Viglianti, R., Dumontier, M.: Decentralizedprovenance-aware publishing with nanopublications. PeerJ Computer Science ,e78 (2016)14. Kuhn, T., Dumontier, M.: Trusty uris: Veriﬁable, immutable, and permanent dig-ital artifacts for linked data. In: European semantic web conference. pp. 395–410.Springer (2014)15. Kuhn, T., Willighagen, E., Evelo, C., Queralt-Rosinach, N., Centeno, E., Furlong,L.I.: Reliable granular references to changing linked data. In: International Seman-tic Web Conference. pp. 436–451. Springer (2017)16. Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney, J., Corsar, D., Gar-ijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: Prov-o: The prov ontology. W3Crecommendation (2013)17. McGuinness, D.L., Van Harmelen, F., et al.: Owl web ontology language overview.W3C recommendation (10), 2004 (2004)18. Pareti, S.: Attribution: a computational approach (2015)19. Pareti, S.: Parc 3.0: A corpus of attribution relations. In: Proceedings of the TenthInternational Conference on Language Resources and Evaluation (LREC’16). pp.3914–3920 (2016)20. Pustejovsky, J., Hanks, P., Sauri, R., See, A., Gaizauskas, R., Setzer, A., Radev,D., Sundheim, B., Day, D., Ferro, L., et al.: The timebank corpus. In: Corpuslinguistics. vol. 2003, p. 40. Lancaster, UK. (2003)21. Sanderson, R., Ciccarese, P., Young, B.: Web annotation vocabulary. W3C recom-mendation (2017)22. Saurı, R.: Factbank 1.0 annotation guidelines. Ms., Brandeis University (2008)23. Saurí, R., Littman, J., Knippen, B., Gaizauskas, R., Setzer, A., Pustejovsky, J.:Timeml annotation guidelines. Version (1), 31 (2006)24. Saurí, R., Pustejovsky, J.: Factbank: a corpus annotated with event factuality.Language resources and evaluation (3), 227 (2009)25. Van Son, C., Inel, O., Morante, R., Aroyo, L., Vossen, P.: Resource interoperabilityfor sustainable benchmarking: The case of events. In: Proceedings of the EleventhInternational Conference on Language Resources and Evaluation (LREC 2018)(2018)26. Villegas, M., Bel, N.: Parole/simple âĂŸlemonâĂŹontology and lexicons. SemanticWeb6