[PDF] SuperMat: Construction of a linked annotated dataset from superconductors-related publications

Abstract

A growing number of papers are published in the area of superconducting materials science. However, novel text and data mining (TDM) processes are still needed to efficiently access and exploit this accumulated knowledge, paving the way towards data-driven materials design. Herein, we present SuperMat (Superconductor Materials), an annotated corpus of linked data derived from scientific publications on superconductors, which comprises 142 articles, 16052 entities, and 1398 links that are characterised into six categories: the names, classes, and properties of materials; links to their respective superconducting critical temperature (Tc); and parametric conditions such as applied pressure or measurement methods. The construction of SuperMat resulted from a fruitful collaboration between computer scientists and material scientists, and its high quality is ensured through validation by domain experts. The quality of the annotation guidelines was ensured by satisfactory Inter Annotator Agreement (IAA) between the annotators and the domain experts. SuperMat includes the dataset, annotation guidelines, and annotation support tools that use automatic suggestions to help minimise human errors.

Full PDF

SSuperMat: Construction of a linked annotateddataset from superconductors-related publications

Luca Foppiano , Sae Dieb , Akira Suzuki , Pedro Baptista deCastro , Suguru Iwasaki , Azusa Uzuki , Miren Garbine EsparzaEchevarria , Yan Meng , Kensei Terashima , Laurent Romary ,Yoshihiko Takano , and Masashi Ishii Material Database Group, MaDIS, NIMS, Tsukuba, 305-0044,Japan Nano Frontier Superconducting Materials Group, MANA, NIMS,Tsukuba, 305-0047, Japan ALMAnaCH, Inria, Paris, 75012, France * corresponding authors: Luca Foppiano([email protected]), Masashi Ishii([email protected])January 29, 2021 Abstract

A growing number of papers are published in the area of supercon-ducting materials science. However, novel text and data mining (TDM)processes are still needed to eﬃciently access and exploit this accumu-lated knowledge, paving the way towards data-driven materials design.Herein, we present SuperMat (Superconductor Materials), an annotatedcorpus of linked data derived from scientiﬁc publications on superconduc-tors, which comprises 142 articles, 16052 entities, and 1398 links that arecharacterised into six categories: the names, classes, and properties ofmaterials; links to their respective superconducting critical temperature( T c ); and parametric conditions such as applied pressure or measurementmethods. The construction of SuperMat resulted from a fruitful collab-oration between computer scientists and material scientists, and its highquality is ensured through validation by domain experts. The qualityof the annotation guidelines was ensured by satisfactory Inter AnnotatorAgreement (IAA) between the annotators and the domain experts. Super-Mat includes the dataset, annotation guidelines, and annotation supporttools that use automatic suggestions to help minimise human errors. Introduction

The vast majority of scientiﬁc knowledge exists as published articles [9, 16, 35, 1].These publications are presented mainly as text, which is challenging to be1 a r X i v : . [ c ond - m a t . s up r- c on ] J a n sed as a machine-readable structure. Meanwhile, as a part of the text anddata mining (TDM) discipline, computer-assisted information collection fromthe literature has become a supportive asset for scientiﬁc research [34]. In thepast decades, new TDM processes were developed for several natural sciencedisciplines to achieve automatic document processing such as information re-trieval, entity extraction, and clustering. TDM has been applied in biologyfor identiﬁng interactions between agents (e.g. bacteria, viruses, genes, andproteins) [12, 24, 23] to support the research on serious diseases including can-cer [26]. In chemistry, it was used for the disambiguation of chemical compoundsnames, synthesis extraction, and retrieval [11]. In both domains, the applicationof TDM was based on manually curated datasets (corpora) that functioned asinfrastructures. Examples are the BioCreative IV CHEMDNER corpus [25] inchemistry, and Genia [17] and GENETAG [38, 33] in biology. Such datasets arecrucial for developing, training, and evaluating TDM systems.In comparison, such resources in the materials science domain are ratherlimited. Reported cases include NaDev [4] on nanocrystal devices research,SC-CoMIcs [39] in the superconductors domain, and a corpus for extractingsynthesis recipes [21]. To address this shortage of infrastructure, experimen-tal data is extracted manually [8], or ab-initio calculations are used [14] butthey might not accurately describe the real system. Several challenges still hin-der the data-driven exploration of materials (also called Materials Informatics(MI)), namely: the lack of data standard, infant stage of the data-driven culture,a wide variety of conﬂicting stakeholders, and missing incentives for researchersto contribute to large collaborative initiatives [13]. To bridge these gaps, it isnecessary to create infrastructural resources to support TDM processes in ma-terials science through the automatic construction of databases for materialsand their properties. Such application can minimise the need for humans toread the new papers and extract the key information therein. Equally impor-tantly, it enables scientists to focus and leverage computing power and humanresources to ﬁnd deeper relationships between superﬁcially unrelated informa-tion. Other applications include providing semantically enriched search enginesthat accept ﬁne-grain queries [29] to reduce the time needed to access speciﬁcinformation. These processes cannot be established without essential resourcessuch as dictionaries, lexicons, and datasets.Research on superconducting materials has been growing rapidly towardsboth fundamental science as well as practical applications. Superconductors dis-play many intriguing phenomena including zero-resistivity, the ability to host ahigh magnetic ﬁeld, quantisation of the magnetic ﬂux, and vortex pinning. Cur-rent applications of superconductors include medical instruments, high-speedtrains, quantum computers, and the Linear Hadron Collider (LHC) [30, 18, 2].However, discovering a new superconductor is a challenging task, as only 3%of candidate materials were found to be superconductors [20]. The NationalInstitute for Materials Science (NIMS) in Japan has been manually construct-ing databases to support material research, and SuperCon ( http://supercon.nims.go.jp ) is a manually curated data source for the superconductor domain.These databases would help researchers design new superconducting materialswith a higher superconducting critical temperature ( T c ) (ideally up to roomtemperature) [10, 37]. However, the current resources are very limited andnot dynamic enough to incorporate the information from new publications ina timely manner. In this paper, we present SuperMat (Superconductors Ma-2erials), an annotated linked corpus for superconducting material information.This dataset contains 142 documents with 16052 (7166 unique) entities, and1398 links that can serve as an infrastructural data for TDM processes in thedomain of superconducting materials. SuperMat is diﬀerent from SC-CoMIcsbased on the following reasons: (a) it provides full papers instead of abstractswhich contain more detailed information about the research on superconductingmaterials, and (b) it contains linked entities. We also describe the constructionguidelines for SuperMat, in the hope of supporting researchers to systematicallycreate annotated data. Furthermore, the unique feature of links between enti-ties in SuperMat will allow the development of more precise methodologies toassociate a particular material with its properties. Methods

Content acquisition

SuperMat originates from PDF documents of scientiﬁc articles related to su-perconductor research. The PDF format is the most widely used format forscientiﬁc publications [15]. The original documents were collected from thefollowing sources: (a) the Open Access (OA) version of peer-reviewed articlesreferenced in the SuperCon database records; (b) articles provided by domainexperts containing suitable items and potential links of material names, T c val-ues, measurement methods, and pressures; (c) articles from ”condensed matter”category of arXiv ( https://arxiv.org/archive/cond-mat ) selected using thesearch terms of ”superconductor”, ”critical temperature”, and ”superconduc-tivity”.Pre-print versions of peer-reviewed articles were obtained using a lookup ser-vice for bibliographic data called biblio-glutton ( https://github.com/kermitt2/biblio-glutton ) that aggregates data from various sources: the Crossref ( ) bibliographic database, the unPaywall ( http://unpaywall.org ) service, the PubMed Central repository ( https://pubmed.ncbi.nlm.nih.gov/ ), and mappings to other databases. We queried biblio-glutton using thebibliographic data of each article referenced in Supercon; subsequently, we down-loaded the pre-print article associated with the retrieved record, if available.Although the published version may be diﬀerent from the pre-print version of adocument, the diﬀerences measured by comparing pre-print and peer-reviewedarticles in biology [3] measured objective diﬀerences to be around 5%. Preliminary annotation study

Preliminary annotation study was carried out to assess the eﬀort required fromthe annotators to reach an acceptable Inter Annotation Agreement (IAA ¿ 0.7) .We annotated two randomly selected OA papers, by using a preliminary versionof the guidelines with a limited tag-set of four labels: , (ex-pression describing the presence or absence of superconductivity), (value of T c ), and (amount of substitution, such as stochiometric val-ues, usually expressed as functions of x or y). The process was iterated multipletimes. Each iteration ended with computing the IAA using the Krippendorﬀ’salpha coeﬃcient [27, 41], while annotators discussed the disagreements, and3pdated the guidelines.Based on the results in Table 1, IAA reached a satisfactory level ( 0.9) afterthe third iteration. In the second iteration, although the average IAA reached0.7 on three of the four labels, the average agreement was not satisfactory. Whenanalysing the disagreement, we noticed that the low score in the labelwas caused by a heavy overlap with the label, which required moreprecise deﬁnition in the guidelines.Based on this preliminary study, the following changes were implemented.(a) The label was merged under the because, even withdetailed documentation it was too diﬃcult for humans to annotate them in aconsistent way. (b) Three more labels were added: measurement methods andpressure (described as parametric conditions in relation to T c ), and class ofmaterials. Tag-set design

The tag set (also referred to as labels ) represents the classes of entities and thetype of links between them, which were designed to be extracted from the text(Figure 1).

Entities

Entities (also referred as Named Entities, mentions, or surface forms) are chunksof texts that represent an information of interest, as follow: • Class (tag: ) represents a group of materials deﬁned by certaincharacteristics. Superconducting materials can be classiﬁed according todiﬀerent criteria such as the composition and magnetic properties. Amongpublications collected for this study, the domain experts identiﬁed threetypes of classes based on: (a) the composition and crystal structure, (b)material phenomena (e.g. ”I-type” and ”II-type superconductivity”, ”BCSsuperconductors”, ”nematic”, and ”conventional/unconventional super-conductivity”), and (c) high/low T c value (e.g. ”high-tc” superconduc-tors).In this work, we only considered the (a) classes, mainly because the ma-terial composition and crystal structure do not change with time. Forexample, a cuprate from 1998 is still called a cuprate today. In compar-ison, many material phenomena used for (b) are not robust enough, andcan be biased by the viewpoint of the author(s) or research group, or themeasurement methods. Finally, the deﬁnition of ”high-tc” superconduc-tors (c) is completely relative; i.e., with the progress of research, materialsonce considered ”high-tc” might not be so anymore. • Material (tag: ) identiﬁes the name of one or more materials.This label is used to collect the following types of information: – Chemical formula indicating the material by its general or stochio-metric formula (e.g.

LaFe O , WB ), – Compositional name (e.g. magnesium diboride ) or abbreviations(e.g.

YBCO ), 4

The material’s shape (e.g. wire, powder, thin ﬁlm) or form of material(e.g. single/poly crystal), – Modiﬁcation by a dopant (

Zn-doped , Si-doped ) or by percentage ofdoping ( ). We also considered qualitative expressions suchas overdoped , lightly doped , and pure as valid information, – Substrate information (e.g. grown on MgO(100) film ) when it wasadjacent to the material name or formula, in the text, – Additional information about the sample (e.g. as-grown , untwinned , single-layer ) when it was adjacent to the material name or for-mula, in the text. • Superconducting critical temperature (tag: ) identiﬁes expressionsrelated to the phenomenon of superconductivity. Any temperature men-tioned in the text is not necessarily the T c . Rather, it could refer to thetemperature for other processes/events such as annealing/sintering tem-perature, speciﬁc measurements, and structural changes. This label iden-tiﬁes the presence or absence of superconductivity at a given temperature( showing/not showing superconductivity at this T c ). In addition, modiﬁersof this information (increasing/descreasing T c ) are also retained. • Superconducting critical temperature value (tag: ) representsthe temperature at which the superconducting phenomenon occurs. It canbe deﬁned by diﬀerent experimental criteria, such as the onset, mid-pointof resistivity drop, or zero resistivity. This value also considers boundaryconditions, such as the onset of superconductivity , zero resistance . • Applied pressure (tag: ) indicates the applied pressure corre-sponding to a measured T c . • The measurement method (tag: ) indicates the method usedto measure or calculate the presence of superconductivity. Here, we consid-ered the following categories: resistivity, magnetic susceptibility, speciﬁcheat, and theoretical calculations.

Links

The links connects entities of materials or samples to their corresponding prop-erties, conditions, and results. The links are non-directional, and there are norestrictions on the number of links for each entity. We deﬁned three types oflinks: • material-tc: linking materials to their T c values. • tc-pressure: connecting T c and the applied pressure under which it wasobtained. • tc-me method: linking T c and the corresponding measurement method.5 nnotation guidelines Annotation guidelines include the principles and the rules that describe whatconstitutes as desired information for the SuperMat dataset and how to annotateit. They include detailed description of the speciﬁc rules that have been deﬁnedfor each type of information to be annotated, with one or more deﬁnitionsand examples illustrating what to annotate in diﬀerent cases, exceptions, andreferences. We used an online system to track the discussions and decisionswhen a question or a comment was raised, and provided a link to such issuesin the respective description or example. In addition, the guidelines include linking rules that provide information on how to correctly connect the entitiesin a relationship. The guidelines were built using a dynamic markup language(called RestructuredText) and stored in a git ( https://git-scm.com/ ) versioncontrol system repository. We deployed them as HTML ﬁles via web, whichwere updated automatically after each modiﬁcation.

Annotation support tools

The task of annotating documents is tedious and requires both attention andsubject knowledge from the annotators. Annotation support tools aim to max-imise the eﬃciency of annotators and minimise human mistakes. They arecomposed of a web-based collaborative annotation tool, automatic annotationsuggestions, and automatic corpus analysis.

Web-based collaborative annotation tool: INCEpTION

The annotation tool is the platform used for creating, correcting and linkingannotations. After evaluating several tools, we selected INCEpTION [19, 5],a web-based multi-user platform for machine-assisted rapid dataset annotationconstruction. INCEpTION provides supportive functionalities that include: • Multi-layer annotation sheets allow diﬀerent annotation schemas over thesame documents, • Two annotation steps: annotation consists of manually correcting pre-imported documents, while curation allows another user to validate theannotations (Figure 5). • On-the-ﬂy automatic suggestions based on active learning and string match-ing (Figure 5), • Bulk annotation corrections, and • Being open-source (Apache 2.0 license), and under active development atthe time of this paper ( https://inception-project.github.io/ ). Annotation suggestions

Previous works have demonstrated that annotation suggestions improve thequality of the output [7, 32, 28]. We provide two types of annotations sug-gestions. (i)

Machine-based annotated data that were assigned to the docu-ments before loading into the annotation tool. Here, we use a machine learning(ML)-based system from a previously implemented prototype [6] to support our6ag-set. (ii).

Active learning recommendations provided by INCEpTION areassigned on-the-ﬂy based on previous annotations. The active-learning recom-mendations are less precise since they aim to increase the recall, and thereforethey need to be explicitly accepted by the annotator.

Automatic corpus analysis

Automatic corpus analysis is a set of scripts designed to run after the validationstep. These scripts automatically ﬁnd inconsistencies in the links and entities,while extracting the statistics of the corpus. We calculated the inconsistenciesby examining every annotated entity and computing the frequency of the sametext being annotated with diﬀerent labels. The script outputs a summary tableby visualising each annotation value, as well as their labels and frequencies. Wevisually inspected this table, because the reported inconsistencies can be eitherobvious mistakes (Table 2) or arise from ambiguities (Table 3); therefore theircontext should be veriﬁed.Although the links are conceptually non-directed, we have deﬁned a practi-cal convention to maintain their consistency. For example, material-tc is alwaysrepresented as a link between and entities. The scriptalso computes the statistics (Table 4) for the number of entities (total, unique,by class), the number of links (total, intra- and inter-paragraph, between para-graphs), and other statistical information.

Annotation process

The annotation workﬂow (Figure 2) was designed following the

MATTER (Model,Annotate, Train, Test, Evaluate, and Revise) schema[36] and other relatedwork [4, 25]. The workﬂow is composed of ﬁve steps (Figure 2): data-preparation , correction , validation , testing and evaluation , revision . This workﬂow involvesthree main actors: the automatic process, computer scientists, and the domainexperts.The ﬁrst step of the annotation process involves preparing the machine-basedannotated data from the source PDF documents. The PDF ﬁles are convertedto an XML-based format, and annotation is automatically applied. This isfollowed by four more steps: • Annotation: The human annotator can select a document and manuallyadd, remove, or modify each entity based on rules deﬁned in the guidelines.Once the annotation is complete, the document is marked ”ready” for thevalidation. • Validation/Curation by domain experts: Annotations from diﬀerent usersare validated and merged into a ﬁnal document (Figure 5). The domainexpert (”curator”), can compare the diﬀerent annotated versions, andselect the best combination of annotations, or add new ones. This stepensures that the annotations are cross-checked and that the document isvalidated by domain experts. • Automatic consistency checks and statistical analysis: This step aims todiscover obvious mistakes such as mislabelling or incorrect linking. A7equence labelling model is trained and evaluated using 10-fold cross-validation. The evaluation provides precision, recall, and f-score metricsfor all the labels. The resulting model is used for producing machine-basedannotated data in the following iteration. • Review: Retrospective analysis of the past iteration, where unclear casesare discussed and documented in the annotation guidelines.

Data transformation

There are two processes of data transformation (Figure 3): (a) from the sourcedocument (PDF) to the dataset format representation (XML-based), and (b)from the dataset format representation to the annotation tool exchange formats( https://inception-project.github.io/releases/0.16.1/docs/user-guide.html ) and vice-versa. • PDF to XML-based: This step converts the PDF source document to thedataset format representation in XML following the Text Encoding Ini-tiative (TEI, https://tei-c.org/ ) format guidelines. Such transforma-tion is performed by leveraging the functionalities provided by GROBID( https://github.com/kermitt2/grobid ).We developed a customised process for collecting a subset of informationfrom the source PDF document. The process extracts the title, keywords,and abstract from the header; and paragraphs, sections. and ﬁgure andtable captions from the body. All the callouts to references, tables, andﬁgures are ignored. The resulting structured document is then encoded inXML as will be described below. • XML to the annotation tool exchange formats: We transform our XML-formatted data into an INCEpTIONS compatible import format, such asthe Webanno TSV 3.2 ( https://inception-project.github.io/releases/0.17.0/docs/user-guide.html ), and vice-versa using a set of Python scripts. The Webanno TSV 3.2 format isan extension of the CONLL ( ) format,with additions of the header and column representation.

Data Record

The dataset is composed of 142 PDF documents, of which 92% (130) are OA(Figure 4a). To comply with copyright restriction, few articles from our datasetare not publicly available in our repository. The top three publishers repre-sented in the corpus are American Physical Society (APS), Elsevier, and IOPPublishing (Figure 4b). Figure 4c illustrate the distribution by publication date.We summarise SuperMat’s content in Table 4, with the statistics of documents,entities, and links given separately. In particular, this dataset contains 16052(7166 unique) entities spread over six labels and 1398 links.Each document is encoded according to the XML TEI guidelines, which isa rich format for document representation. We have carried out no speciﬁccustomisation, in order to remain fully compliant with the general TEI schema.A TEI document has two main parts: the header (within the tag). The transformed data has the following structure:

We transformed the source documents into these TEI-compliant structuresusing a simpliﬁed representation for speciﬁc content types. The general objectiveis to ﬂatten the content into a generic structure where priority is given to theannotations. For instance, the keywords section, which groups together the keyterms deﬁned by the author(s) of the paper, is encoded using the generic tag as free text, instead of the dedicated elementthat would typically be part of the header. For both the abstract and the articlebody, the text is segmented in paragraphs (by means of the

element). Thetext is annotated with the generic (referencing string) element adornedwith three attributes: @type (the entity type), @corresp (to provide a link toanother annotation such as from material to T c ), and @xml:id (to uniquelyidentify the annotation for referencing or linking purposes).Because only the captions of tables and ﬁgures are retained from the orig-inal source, a simpliﬁed encoding was deﬁned by means of the elementcharacterised by a @type attribute; that is, forﬁgure captions and for table captions. Here is anexample:

The electron-doped high-transition- emperature (Tc) iron-based pnictidesuperconductor LaFeAsO1-xHx has a uniquephase diagram: Superconducting (SC) double domes aresandwiched by antiferromagnetic phases at ambientpressure and they turn into a single dome witha maximum Tc that