A Semantically Enriched Dataset based on Biomedical NER for the COVID19 Open Research Dataset Challenge
Hermann Kroll, Jan Pirklbauer, Johannes Ruthmann, Wolf-Tilo Balke
aa r X i v : . [ c s . D L ] M a y A Semantically Enriched Dataset based on Biomedical NERfor the COVID19 Open Research Dataset Challenge
Hermann Kroll
Institute for Information SystemsTU BraunschweigBraunschweig, Germanykroll@ifis.cs.tu-bs.de
Jan Pirklbauer
Institute for Information SystemsTU BraunschweigBraunschweig, [email protected]
Johannes Ruthmann
Institute for Information SystemsTU BraunschweigBraunschweig, [email protected]
Wolf-Tilo Balke
Institute for Information SystemsTU BraunschweigBraunschweig, Germanybalke@ifis.cs.tu-bs.de
ABSTRACT
Research into COVID-19 is a big challenge and highly relevant atthe moment. New tools are required to assist medical experts intheir research with relevant and valuable information. The COVID-19 Open Research Dataset Challenge (CORD-19) is a "call to action"for computer scientists to develop these innovative tools. Many ofthese applications are empowered by entity information, i. e. know-ing which entities are used within a sentence. For this paper, wehave developed a pipeline upon the latest Named Entity Recogni-tion tools for Chemicals, Diseases, Genes and Species. We applyour pipeline to the COVID-19 research challenge and share the re-sulting entity mentions with the community.
KEYWORDS
Named Entity Recognition, COVID19 Research Challenge
PubMed, the most extensive library for biomedical research, con-tains nearly 30 million publications. The Allen Institute for AI se-lects nearly 57,000 documents as relevant for COVID19 research(V9), and around 47,000 full texts are included within this selection.Accessing such an extensive document collection and finding rele-vant information is a hard task for medical researchers. Especiallyin times, when results are published within a few days, keeping anoverview of the latest research can be exhausting. Novel tools areurgently needed to assist medical researchers in their workflows:novel search engines find relevant information precisely, and newaccess paths like summarization techniques offer new opportuni-ties to engage the flood of information. These tools are typicallyempowered by utilizing additional side information like knowl-edge graphs [1].Knowledge graphs are structured storages providing fact-styleknowledge about entities, e. g.
Simvastatin is used in treatment ofhypercholesterolemia . In the biomedical domain, entities of inter-est are mainly
Chemicals , Diseases , Genes and
Species . The centralproblem of utilizing structured information for text retrieval is todetect, which entities are mentioned in the text. This problem is engaged by applying a Named Entity Recognition (NER), i. e. de-tecting important entities of in arbitrary texts. NER tools like Spot-light (DBpedia) and WAT (Wikidata) are developed to recognizea variety of different entities in several domains [5, 6]. Unfortu-nately, the biomedical domain contains a variety of different en-tities. Dictionary-based recognition tools might fail here becausethe exact entity mention within a sentence depends on the context.Hence, homonyms must be resolved, e. g. the gene name
CYP3A4 has different ids depending if the sentence talks about mouses orhumans. Yet, Named Entity Recognition tools suitable for the biomed-ical domain have been designed and built by experts already.In this paper, we utilize two biomedical NER tools, namely Tag-gerOne [4] and GNormPlus [8], and build a pipeline to annotatearbitrary biomedical texts. Finally, we apply our pipeline to theCOVID19 dataset. The detected entity mentions are published inour GitHub repository for free reuse. The code will be publishedunder the MIT license . The data is published for free reuse underthe Creative Commons Attribution 4.0 International license (CCBY 4.0) . We hope that this additional entity information can serveas a solid and high-quality platform for novel tools and thus enablemore research about COVID19. First we will introduce a pipeline for biomedical Named EntityRecognition in arbitrary texts. The task of a Named Entity Recog-nition is to detect entity mentions in texts. An entity represents athing of interest in a specific domain, e. g. Chemicals and Diseasesare of interest in the biomedical domain. Further, an entity con-sists of a unique id and an entity type, e. g. (
Simvastatin , Chemical )is a valid entity. Entities are described by a predefined vocabulary,which is typically build by experts. Entities might be mentionedwithin a written text. Therefore, we understand text as a sequenceof sentences and sentences as a sequence of tokens (single words).A sequence of tokens within an sentence might represent an entity. https://github.com/HermannKroll/CORD19BiomedicalNERDataset https://opensource.org/licenses/MIT https://creativecommons.org/licenses/by/4.0/ roll, Pirklbauer, Ruthmann, and Balke Table 1: Benchmark results of TaggerOne [4]Corpus Precision Recall F-measure
NCBI Disease 81.5% 80.8% 82.9%BioCreativeV CD-R 94.2% 88.8% 91.4%
Table 2: Benchmark results of GNormPlus (Human) [8]Corpus Precision Recall F-measure
BioCreative II GN 87.1% 86.4% 86.7%We call this representation an entity mention . Hence, entity men-tions consist of an entity and a sequence of corresponding tokenswithin a sentence.The U.S. library of medicine provides several expert-built toolscome with a high quality for detecting entity mentions in text.These tools can be used via command line interfaces and a freelyavailable. We build a pipeline upon these provided tools to auto-matically detect the following entity types in text: 1. Chemicals,2. Diseases, 3. Genes and 4. Species. Chemicals are described bythe Medical Subject Heading (MeSH) vocabulary . Diseases are ei-ther by MeSH terms or by OMIM . The NCBI Gene Vocabulary is utilized for the Genes’ NER and the NCBI Species Taxonomy likewise for the Species’ NER.Chemicals and Diseases are detected by TaggerOne [4], whichuses a semi-Markov structured linear classifier to run named entityrecognition (NER) and normalization simultaneously, thus improv-ing performance compared to other taggers. GNormPlus [8] is usedfor detecting Genes and Species, which runs NER and normaliza-tion as two separate steps. Both NER tools have been evaluated onreal-world text corpora to determine the quality of their detectedentity mentions. Benchmarks for the relevant corpora can be foundin Tables 1 for TaggerOne and 2 for GNormPlus. NCBI Diseasecorpus is a testset for analysing diseases and the BioCreativeV cor-pus is a challenge for detecting Chemicals as well as Diseases. TheGNormPlus evaluation is done for a Gene Normalisation testsetfor humans. Besides, GNormPlus is capable of detecting gene fam-ilies in texts. For more details about both applications, see [4] forTaggerOne and [8] for GNormPlus. Pipeline.
We have developed a pipeline utilizing TaggerOne andGNormPlus for biomedical NER. Our pipeline expects texts in aso-called PubTator format, see [7] and the description on . As aninput, the pipeline supports 1. a single PubTator file, 2. a com-posed PubTator file and 3. a directory of PubTator files. A com-posed PubTator file consists of the content of two PubTator filesseparated by two newlines. Besides, we support the tagging of mul-tiple files in parallel. Therefore, we implemented a splitting of theinput and parallel working of the underlying tools. The recognitionsteps stores it’s produced data in a relational database. Finally, the Table 3: Document Counts of CORD19 SourcesGeneral
Number of Documents 57.4KNumber of full texts 43.5K
JSON parses by source
PubMedCentral (PMC) 49.7KElsevier 24.8KmedRxiv 2.3KArXiv 1.2KbioRxiv 1.1KChan Zuckerberg Initiative (CZI) 0.2K
Table 4: Number of Detected Entity Mentions for the CORD-19 (Abstracts and Fulltexts)Corpus
Chemicals Diseases Genes SpeciesAbstracts 99K 145K 59K 165KFulltexts 3,407K 4,039K 2,232K 4,667Kpipeline exports the annotated entity mentions in a desired formatlike PubTator or JSON.
Research into COVID-19 is a big challenge and highly relevant atthe moment. Therefore, scientists in the medical field must be as-sisted by innovative tools to access the current state of literature ef-ficiently. The COVID-19 Open Research Dataset Challenge (CORD-19) [2] is a "call to action" for computer scientists in the naturallanguage processing (NLP) and data mining field to develop suchinnovative tools. The dataset in version 9 consists of ca. 57,000scholarly articles, of which ca. 44,000 have a PDF parse of theirfull text attached to them. Articles are taken from various sources,most prominently the PubMedCentral collection. The documentstatistics of the dataset in version 9 can be seen in Table 3. Somedocuments are accessible in multiple sources and are counted morethan once in the statistics. The abstracts and full texts of the doc-uments are given paragraph wise in a JSON-Format, so the textscan easily be extracted and processed. Entity-centric informationaccess plays a key role in the medical domain [3]. Hence, we runour pipeline upon the challenge dataset to assist the communitywith valuable entity information.
We report the number of the resulting entity mentions for eachentity type. We create two different dumps: one dump containsentity mentions within titles and abstracts and the second dumpcontains entity mentions in the title, abstract and fulltexts of thedocuments. Table 4 lists the number of entity mention for bothdumps grouped by the entity types. Our pipeline detects nearly99K Chemicals, 145K Diseases, 59K Genes and 165K Species in ti-tles and abstracts. For fulltexts, the pipeline detects around 3.4MChemicals, 4.0M Diseases, 2.2M Genes and 4,7M Species. We es-timate the annotation’s quality to be comparable to the reportedquality in the tools’ original publications.
Semantically Enriched Dataset based on Biomedical NERfor the COVID19 Open Research Dataset Challenge
We publish the obtained entity mentions as two JSON files. Thefirst file contains the entity mentions for titles and abstracts. Thesecond file contains the entity mentions for titles, abstracts as wellas fulltexts. We process the CORD19 fulltexts by selecting the avail-able JSON files. These JSON files contain fulltexts as sequences ofbody texts. Hence, a fulltext document consists of a title, an ab-stract and a sequence of body texts. We publish the correspondingentity mentions suitable for the given structure. Therefore, eachentity mentions contains an entity location in texts including:(1) a paragraph representing the position in the text. 0 is anentity mention in the title, 1 is an entity mention in the ab-stract and 2 is an entity mention in the first body text fieldand so on.(2) a start position representing the position of the first entity’scharacter within the corresponding text (title, abstract, bodytext element).(3) an end position representing the position of the last entity’scharacter within the corresponding text.As an example, an entity location with paragraph 5, start 5 and end10 means that the entity is mentioned in the third body text fieldstarting at character position 5 and ending at character position 10.The first character has the position 0. An entity mention containsthe following components:(1) an entity location,(2) an entity string representing the entity’s token sequence inthe text,(3) an entity type (Chemical, Disease, Gene and Species), and(4) an entity id corresponding to the previously described vo-cabularies.The computed entity mentions are shared within a JSON file.The JSON file consists of a dictionary, where each CORD19 docu-ment id is mapped to a list of entity mentions. A short prototypicalsnapshot of the exported JSON file is shown below: [
More details can be found in our regularly updated GitHub repos-itory.
In this paper, we discussed the importance and usefulness of en-tity mentions for retrieval applications. We developed an effective pipeline to automatically annotate biomedical entity mentions inarbitrary texts. Moreover, we built our pipeline on top of the latestavailable biomedical NER tools to ensure the quality of our entitymentions.Applying our pipeline to the COVID-19 open research dataset,we published the resulting entity mentions as a semantically en-riched dataset for free reuse on GitHub. We will continuously up-date our GitHub repository whenever new versions of the COVID-19 dataset are published.
REFERENCES [1] Laura Dietz, Alexander Kotov, and Edgar Meij. 2018. Utilizing Knowledge Graphsfor Text-Centric Information Retrieval. In
The 41st International ACM SIGIR Con-ference on Research & Development in Information Retrieval (Ann Arbor, MI,USA) (SIGIR âĂŹ18)
Journal ofthe American Medical Informatics Association
14, 2 (03 2007), 212–220.[4] Robert Leaman and Zhiyong Lu. 2016. TaggerOne: joint named entityrecognition and normalization with semi-Markov Models.
Bioinformat-ics
32, 18 (06 2016), 2839–2846. https://doi.org/10.1093/bioinformatics/btw343arXiv:https://academic.oup.com/bioinformatics/article-pdf/32/18/2839/24406872/btw343.pdf[5] Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DB-pedia Spotlight: Shedding Light on the Web of Documents. In
Proceedings of the7th Int. Conf. on Semantic Systems (Graz, Austria) (I-Semantics âĂŹ11) . Associa-tion for Computing Machinery, New York, NY, USA, 1âĂŞ8.[6] Francesco Piccinno and Paolo Ferragina. 2014. From TagME to WAT: A NewEntity Annotator. In
Proceedings of the First Int. Workshop on Entity Recognition& Disambiguation (Gold Coast, Queensland, Australia) (ERD âĂŹ14) . Associationfor Computing Machinery, New York, NY, USA, 55âĂŞ62.[7] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. PubTator: aweb-based text mining tool for assisting biocuration.
Nucleic Acids Re-search
41, W1 (05 2013), W518–W522. https://doi.org/10.1093/nar/gkt441arXiv:https://academic.oup.com/nar/article-pdf/41/W1/W518/3859973/gkt441.pdf[8] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong lu. 2015. GNormPlus: An Integra-tive Approach for Tagging Genes, Gene Families, and Protein Domains.