SuperMat: Construction of a linked annotated dataset from superconductors-related publications
Luca Foppiano, Sae Dieb, Akira Suzuki, Pedro Baptista de Castro, Suguru Iwasaki, Azusa Uzuki, Miren Garbine Esparza Echevarria, Yan Meng, Kensei Terashima, Laurent Romary, Yoshihiko Takano, Masashi Ishii
SSuperMat: Construction of a linked annotateddataset from superconductors-related publications
Luca Foppiano , Sae Dieb , Akira Suzuki , Pedro Baptista deCastro , Suguru Iwasaki , Azusa Uzuki , Miren Garbine EsparzaEchevarria , Yan Meng , Kensei Terashima , Laurent Romary ,Yoshihiko Takano , and Masashi Ishii Material Database Group, MaDIS, NIMS, Tsukuba, 305-0044,Japan Nano Frontier Superconducting Materials Group, MANA, NIMS,Tsukuba, 305-0047, Japan ALMAnaCH, Inria, Paris, 75012, France * corresponding authors: Luca Foppiano([email protected]), Masashi Ishii([email protected])January 29, 2021 Abstract
A growing number of papers are published in the area of supercon-ducting materials science. However, novel text and data mining (TDM)processes are still needed to efficiently access and exploit this accumu-lated knowledge, paving the way towards data-driven materials design.Herein, we present SuperMat (Superconductor Materials), an annotatedcorpus of linked data derived from scientific publications on superconduc-tors, which comprises 142 articles, 16052 entities, and 1398 links that arecharacterised into six categories: the names, classes, and properties ofmaterials; links to their respective superconducting critical temperature( T c ); and parametric conditions such as applied pressure or measurementmethods. The construction of SuperMat resulted from a fruitful collab-oration between computer scientists and material scientists, and its highquality is ensured through validation by domain experts. The qualityof the annotation guidelines was ensured by satisfactory Inter AnnotatorAgreement (IAA) between the annotators and the domain experts. Super-Mat includes the dataset, annotation guidelines, and annotation supporttools that use automatic suggestions to help minimise human errors. Introduction
The vast majority of scientific knowledge exists as published articles [9, 16, 35, 1].These publications are presented mainly as text, which is challenging to be1 a r X i v : . [ c ond - m a t . s up r- c on ] J a n sed as a machine-readable structure. Meanwhile, as a part of the text anddata mining (TDM) discipline, computer-assisted information collection fromthe literature has become a supportive asset for scientific research [34]. In thepast decades, new TDM processes were developed for several natural sciencedisciplines to achieve automatic document processing such as information re-trieval, entity extraction, and clustering. TDM has been applied in biologyfor identifing interactions between agents (e.g. bacteria, viruses, genes, andproteins) [12, 24, 23] to support the research on serious diseases including can-cer [26]. In chemistry, it was used for the disambiguation of chemical compoundsnames, synthesis extraction, and retrieval [11]. In both domains, the applicationof TDM was based on manually curated datasets (corpora) that functioned asinfrastructures. Examples are the BioCreative IV CHEMDNER corpus [25] inchemistry, and Genia [17] and GENETAG [38, 33] in biology. Such datasets arecrucial for developing, training, and evaluating TDM systems.In comparison, such resources in the materials science domain are ratherlimited. Reported cases include NaDev [4] on nanocrystal devices research,SC-CoMIcs [39] in the superconductors domain, and a corpus for extractingsynthesis recipes [21]. To address this shortage of infrastructure, experimen-tal data is extracted manually [8], or ab-initio calculations are used [14] butthey might not accurately describe the real system. Several challenges still hin-der the data-driven exploration of materials (also called Materials Informatics(MI)), namely: the lack of data standard, infant stage of the data-driven culture,a wide variety of conflicting stakeholders, and missing incentives for researchersto contribute to large collaborative initiatives [13]. To bridge these gaps, it isnecessary to create infrastructural resources to support TDM processes in ma-terials science through the automatic construction of databases for materialsand their properties. Such application can minimise the need for humans toread the new papers and extract the key information therein. Equally impor-tantly, it enables scientists to focus and leverage computing power and humanresources to find deeper relationships between superficially unrelated informa-tion. Other applications include providing semantically enriched search enginesthat accept fine-grain queries [29] to reduce the time needed to access specificinformation. These processes cannot be established without essential resourcessuch as dictionaries, lexicons, and datasets.Research on superconducting materials has been growing rapidly towardsboth fundamental science as well as practical applications. Superconductors dis-play many intriguing phenomena including zero-resistivity, the ability to host ahigh magnetic field, quantisation of the magnetic flux, and vortex pinning. Cur-rent applications of superconductors include medical instruments, high-speedtrains, quantum computers, and the Linear Hadron Collider (LHC) [30, 18, 2].However, discovering a new superconductor is a challenging task, as only 3%of candidate materials were found to be superconductors [20]. The NationalInstitute for Materials Science (NIMS) in Japan has been manually construct-ing databases to support material research, and SuperCon ( http://supercon.nims.go.jp ) is a manually curated data source for the superconductor domain.These databases would help researchers design new superconducting materialswith a higher superconducting critical temperature ( T c ) (ideally up to roomtemperature) [10, 37]. However, the current resources are very limited andnot dynamic enough to incorporate the information from new publications ina timely manner. In this paper, we present SuperMat (Superconductors Ma-2erials), an annotated linked corpus for superconducting material information.This dataset contains 142 documents with 16052 (7166 unique) entities, and1398 links that can serve as an infrastructural data for TDM processes in thedomain of superconducting materials. SuperMat is different from SC-CoMIcsbased on the following reasons: (a) it provides full papers instead of abstractswhich contain more detailed information about the research on superconductingmaterials, and (b) it contains linked entities. We also describe the constructionguidelines for SuperMat, in the hope of supporting researchers to systematicallycreate annotated data. Furthermore, the unique feature of links between enti-ties in SuperMat will allow the development of more precise methodologies toassociate a particular material with its properties. Methods
Content acquisition
SuperMat originates from PDF documents of scientific articles related to su-perconductor research. The PDF format is the most widely used format forscientific publications [15]. The original documents were collected from thefollowing sources: (a) the Open Access (OA) version of peer-reviewed articlesreferenced in the SuperCon database records; (b) articles provided by domainexperts containing suitable items and potential links of material names, T c val-ues, measurement methods, and pressures; (c) articles from ”condensed matter”category of arXiv ( https://arxiv.org/archive/cond-mat ) selected using thesearch terms of ”superconductor”, ”critical temperature”, and ”superconduc-tivity”.Pre-print versions of peer-reviewed articles were obtained using a lookup ser-vice for bibliographic data called biblio-glutton ( https://github.com/kermitt2/biblio-glutton ) that aggregates data from various sources: the Crossref ( ) bibliographic database, the unPaywall ( http://unpaywall.org ) service, the PubMed Central repository ( https://pubmed.ncbi.nlm.nih.gov/ ), and mappings to other databases. We queried biblio-glutton using thebibliographic data of each article referenced in Supercon; subsequently, we down-loaded the pre-print article associated with the retrieved record, if available.Although the published version may be different from the pre-print version of adocument, the differences measured by comparing pre-print and peer-reviewedarticles in biology [3] measured objective differences to be around 5%. Preliminary annotation study
Preliminary annotation study was carried out to assess the effort required fromthe annotators to reach an acceptable Inter Annotation Agreement (IAA ¿ 0.7) .We annotated two randomly selected OA papers, by using a preliminary versionof the guidelines with a limited tag-set of four labels:
The tag set (also referred to as labels ) represents the classes of entities and thetype of links between them, which were designed to be extracted from the text(Figure 1).
Entities
Entities (also referred as Named Entities, mentions, or surface forms) are chunksof texts that represent an information of interest, as follow: • Class (tag:
LaFe O , WB ), – Compositional name (e.g. magnesium diboride ) or abbreviations(e.g.
YBCO ), 4
The material’s shape (e.g. wire, powder, thin film) or form of material(e.g. single/poly crystal), – Modification by a dopant (
Zn-doped , Si-doped ) or by percentage ofdoping ( ). We also considered qualitative expressions suchas overdoped , lightly doped , and pure as valid information, – Substrate information (e.g. grown on MgO(100) film ) when it wasadjacent to the material name or formula, in the text, – Additional information about the sample (e.g. as-grown , untwinned , single-layer ) when it was adjacent to the material name or for-mula, in the text. • Superconducting critical temperature (tag:
Links
The links connects entities of materials or samples to their corresponding prop-erties, conditions, and results. The links are non-directional, and there are norestrictions on the number of links for each entity. We defined three types oflinks: • material-tc: linking materials to their T c values. • tc-pressure: connecting T c and the applied pressure under which it wasobtained. • tc-me method: linking T c and the corresponding measurement method.5 nnotation guidelines Annotation guidelines include the principles and the rules that describe whatconstitutes as desired information for the SuperMat dataset and how to annotateit. They include detailed description of the specific rules that have been definedfor each type of information to be annotated, with one or more definitionsand examples illustrating what to annotate in different cases, exceptions, andreferences. We used an online system to track the discussions and decisionswhen a question or a comment was raised, and provided a link to such issuesin the respective description or example. In addition, the guidelines include linking rules that provide information on how to correctly connect the entitiesin a relationship. The guidelines were built using a dynamic markup language(called RestructuredText) and stored in a git ( https://git-scm.com/ ) versioncontrol system repository. We deployed them as HTML files via web, whichwere updated automatically after each modification.
Annotation support tools
The task of annotating documents is tedious and requires both attention andsubject knowledge from the annotators. Annotation support tools aim to max-imise the efficiency of annotators and minimise human mistakes. They arecomposed of a web-based collaborative annotation tool, automatic annotationsuggestions, and automatic corpus analysis.
Web-based collaborative annotation tool: INCEpTION
The annotation tool is the platform used for creating, correcting and linkingannotations. After evaluating several tools, we selected INCEpTION [19, 5],a web-based multi-user platform for machine-assisted rapid dataset annotationconstruction. INCEpTION provides supportive functionalities that include: • Multi-layer annotation sheets allow different annotation schemas over thesame documents, • Two annotation steps: annotation consists of manually correcting pre-imported documents, while curation allows another user to validate theannotations (Figure 5). • On-the-fly automatic suggestions based on active learning and string match-ing (Figure 5), • Bulk annotation corrections, and • Being open-source (Apache 2.0 license), and under active development atthe time of this paper ( https://inception-project.github.io/ ). Annotation suggestions
Previous works have demonstrated that annotation suggestions improve thequality of the output [7, 32, 28]. We provide two types of annotations sug-gestions. (i)
Machine-based annotated data that were assigned to the docu-ments before loading into the annotation tool. Here, we use a machine learning(ML)-based system from a previously implemented prototype [6] to support our6ag-set. (ii).
Active learning recommendations provided by INCEpTION areassigned on-the-fly based on previous annotations. The active-learning recom-mendations are less precise since they aim to increase the recall, and thereforethey need to be explicitly accepted by the annotator.
Automatic corpus analysis
Automatic corpus analysis is a set of scripts designed to run after the validationstep. These scripts automatically find inconsistencies in the links and entities,while extracting the statistics of the corpus. We calculated the inconsistenciesby examining every annotated entity and computing the frequency of the sametext being annotated with different labels. The script outputs a summary tableby visualising each annotation value, as well as their labels and frequencies. Wevisually inspected this table, because the reported inconsistencies can be eitherobvious mistakes (Table 2) or arise from ambiguities (Table 3); therefore theircontext should be verified.Although the links are conceptually non-directed, we have defined a practi-cal convention to maintain their consistency. For example, material-tc is alwaysrepresented as a link between
Annotation process
The annotation workflow (Figure 2) was designed following the
MATTER (Model,Annotate, Train, Test, Evaluate, and Revise) schema[36] and other relatedwork [4, 25]. The workflow is composed of five steps (Figure 2): data-preparation , correction , validation , testing and evaluation , revision . This workflow involvesthree main actors: the automatic process, computer scientists, and the domainexperts.The first step of the annotation process involves preparing the machine-basedannotated data from the source PDF documents. The PDF files are convertedto an XML-based format, and annotation is automatically applied. This isfollowed by four more steps: • Annotation: The human annotator can select a document and manuallyadd, remove, or modify each entity based on rules defined in the guidelines.Once the annotation is complete, the document is marked ”ready” for thevalidation. • Validation/Curation by domain experts: Annotations from different usersare validated and merged into a final document (Figure 5). The domainexpert (”curator”), can compare the different annotated versions, andselect the best combination of annotations, or add new ones. This stepensures that the annotations are cross-checked and that the document isvalidated by domain experts. • Automatic consistency checks and statistical analysis: This step aims todiscover obvious mistakes such as mislabelling or incorrect linking. A7equence labelling model is trained and evaluated using 10-fold cross-validation. The evaluation provides precision, recall, and f-score metricsfor all the labels. The resulting model is used for producing machine-basedannotated data in the following iteration. • Review: Retrospective analysis of the past iteration, where unclear casesare discussed and documented in the annotation guidelines.
Data transformation
There are two processes of data transformation (Figure 3): (a) from the sourcedocument (PDF) to the dataset format representation (XML-based), and (b)from the dataset format representation to the annotation tool exchange formats( https://inception-project.github.io/releases/0.16.1/docs/user-guide.html ) and vice-versa. • PDF to XML-based: This step converts the PDF source document to thedataset format representation in XML following the Text Encoding Ini-tiative (TEI, https://tei-c.org/ ) format guidelines. Such transforma-tion is performed by leveraging the functionalities provided by GROBID( https://github.com/kermitt2/grobid ).We developed a customised process for collecting a subset of informationfrom the source PDF document. The process extracts the title, keywords,and abstract from the header; and paragraphs, sections. and figure andtable captions from the body. All the callouts to references, tables, andfigures are ignored. The resulting structured document is then encoded inXML as will be described below. • XML to the annotation tool exchange formats: We transform our XML-formatted data into an INCEpTIONS compatible import format, such asthe Webanno TSV 3.2 ( https://inception-project.github.io/releases/0.17.0/docs/user-guide.html ), and vice-versa using a set of Python scripts. The Webanno TSV 3.2 format isan extension of the CONLL ( ) format,with additions of the header and column representation.
Data Record
The dataset is composed of 142 PDF documents, of which 92% (130) are OA(Figure 4a). To comply with copyright restriction, few articles from our datasetare not publicly available in our repository. The top three publishers repre-sented in the corpus are American Physical Society (APS), Elsevier, and IOPPublishing (Figure 4b). Figure 4c illustrate the distribution by publication date.We summarise SuperMat’s content in Table 4, with the statistics of documents,entities, and links given separately. In particular, this dataset contains 16052(7166 unique) entities spread over six labels and 1398 links.Each document is encoded according to the XML TEI guidelines, which isa rich format for document representation. We have carried out no specificcustomisation, in order to remain fully compliant with the general TEI schema.A TEI document has two main parts: the header (within the
We transformed the source documents into these TEI-compliant structuresusing a simplified representation for specific content types. The general objectiveis to flatten the content into a generic structure where priority is given to theannotations. For instance, the keywords section, which groups together the keyterms defined by the author(s) of the paper, is encoded using the generic tag element). Thetext is annotated with the generic The electron-doped high-