[PDF] Building a PubMed knowledge graph

Abstract

PubMed is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguated, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID, and identifying fine-grained affiliation data from MapAffil. Through the integration of the credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving a F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities. The PKG is freely available on Figshare (this https URL, simplified version that exclude PubMed raw data) and TACC website (this http URL, full version).

Full PDF

1 / Building a PubMed knowledge graph

Jian Xu , Sunkyu Kim , Min Song , Minbyul Jeong , Donghyeon Kim , Jaewoo Kang , Justin F. Rousseau , Xin Li , Weijia Xu , Vetle I. Torvik , Yi Bu , Chongyan Chen , Islam Akef Ebeid , Daifeng Li , & Ying Ding School of Information Management, Sun Yat-sen University, Guangzhou, China. Department of Computer Science and Engineering, Korea University, Seoul, South Korea. Department of Library and Information Science, Yonsei University, Seoul, South Korea. Dell Medical School, University of Texas at Austin, Austin, TX, USA. School of Information, University of Texas at Austin, Austin, TX, USA. Texas Advanced Computing Center, Austin, TX, USA. School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA. Center for Complex Networks and Systems Research, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA. Corresponding authors: Ying Ding ([email protected]), Daifeng Li ([email protected])

Abstract

PubMed ® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguated, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID ® , and identifying fine-grained affiliation data from MapAffil. Through the integration of the credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving a F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities. The PKG is freely available on Figshare (https://figshare.com/s/6327a55355fc2c99f3a2, simplified version that exclude PubMed raw data) and TACC website (http://er.tacc.utexas.edu/datasets/ped, full version). Background and Summary

Experts in healthcare and medicine communicate in their own languages, such as SNOMED CT, ICD-10, PubChem, and gene ontology. These languages equate to gibberish for laypeople, but for medical minds, they are an intricate method of transporting important semantics and consensus capable of translating diagnoses, medical procedures, and medications among millions of physicians, nurses, and medical researchers, thousands of hospitals, hundreds of pharmacies, and a multitude of health insurance companies. These languages (e.g., genes, drugs, proteins, species, and mutations) are the backbone of quality healthcare. However, they are / deeply embedded in publications, making literature searches increasingly onerous because conventional text mining tools and algorithms continue to be ineffective. Given that medical domains are deeply divided, locating collaborators across domains is arduous. For instance, if a researcher wants to study ACE2 gene related to COVID-19, he or she would like to know the following: which researchers are currently actively studying ACE2 gene, what are the related genes, diseases, or drugs discussed in these articles related to ACE2 gene, and with whom could the researcher collaborate? This is a strenuous position to be in, and the aforementioned problems diminish the curiosity directed at the topic. Many studies have been devoted to building open-access datasets to solve bio-entity recognition problems. For example, Kai et al. used a conditional random field classifier-based tool to recognize the named entities from PubMed and PubMed Central. Bell et al. performed a large-scale integration of a diverse set of bio-entities and their relationships from both bio-entity datasets and PubMed literature. Although these open-access datasets are predominantly about bio-entity recognition, researchers have also been interested in extracting other types of entities and relationships in PubMed, including the mapping of author affiliations to cities and their geocodes , author name disambiguation (AND), and author background information collections . Although the focus of previous research has been on limited types of entities, the goal of our study was to integrate a comprehensive dataset by capturing bio-entities, disambiguated authors, funding, and fine-grained affiliation information from PubMed literature. Figure 1 illustrates the bio-entity integration framework. This framework consists of two parts: (1) bio-entity extraction, which contains entity extraction, named entity recognition (NER), and multi-type normalization, and (2) integration, which connects authors, ORCID, and funding information. Figure 1. Bio-entity integration framework for PKG The process illustrated in Figure 1 can be described as follows. First, we applied the high-performance deep learning method Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) to extract bio-entities from 29 million PubMed abstracts. Based on the evaluation, this method significantly outperformed the state-of-the-art methods based on the F1 score (by 0.51%, on average). Then, we integrated two existing high-quality author disambiguation datasets: Author-ity and Semantic Scholar . We obtained the disambiguated authors of PubMed articles with full coverage and quality of 98.09% in terms of the F1 score. Next, we integrated additional fields from credible sources into our dataset, which included the projects funded by the National Institutes of Health (NIH) , the affiliation history and educational background of authors from ORCID , and fine-grained region and location information from the MapAffil 2016 dataset . We named this new interlinked dataset “PubMed Knowledge Graph” (PKG). PKG is by far the most comprehensive, up-to-date, high-quality dataset for PubMed regarding bio-entities, articles, scholars, affiliations, and funding information. Being an open dataset, PKG contains rich information ready to be deployed, / facilitating the effortless development of applications such as finding experts, searching bio-entities, analyzing scholarly impacts, and profiling scientist’s careers. Methods

Bio-Entity Extraction

The bio-entity extraction component has two models: (1) an NER model, which recognizes the named entities in PubMed abstracts based on the BioBERT model , and (2) a multi-type normalization model, which assigns unique IDs to recognize biomedical entities. (1) Named Entity Recognition (NER). The NER task recognizes a variety of domain-specific proper nouns in a biomedical corpus and is perceived as one of the most notable biomedical text mining tasks. In contrast to previous studies that have built models based on long short-term memory (LSTM) and conditional random fields (CRFs) , the recently proposed Bidirectional Encoder Representations from Transformers (BERT) model achieves excellent performance for most of the NLP tasks with minimal task-specific architecture modifications. The transformers applied in BERT connect the encoders and decoders through self-attention for greater parallelization and reduced training time. BERT was designed as a general-purpose language representation model that was pre-trained on English Wikipedia and BooksCorpus. Consequently, it is incredibly challenging to maintain high performance when applying BERT to biomedical domain texts that contain a considerable number of domain-specific proper nouns and terms (e.g., BRCA1 gene and Triton X-100 chemical). BERT required refinement, so BioBERT—a neural network-based high-performance NER model—was developed. Its purpose is to recognize the known biomedical entities and discover new biomedical entities. First, in the NER component, the case-sensitive version of BERT is used to initialize BioBERT. Second, PubMed articles and PubMed Central articles are used to pre-train BioBERT’s weights. The pre-trained weights are then fine-tuned for the NER task. While fine-tuning BERT (BioBERT), we use WordPiece tokenization to mitigate the out-of-vocabulary issue. WordPiece embedding is a method of dividing a word into several units (e.g., Immunoglobulin divided into I ) , X (i.e., a sub-token of WordPiece), [CLS] (i.e., the leading token of a sequence for classification), [SEP] (i.e., a sentence delimiter), and PAD (i.e., a padding of each word in a sentence). The NER models were fine-tuned as follows : 𝑝(𝑇 𝑖 ) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑇 𝑖 𝑊 𝑇 + 𝑏) 𝑘 , 𝑘 = 0,1, … ,6 (1) where k represents the indexes of seven tags {B, I, O, X, [CLS], [SEP], PAD}, p is the probability distribution of assigning each k to token i , and 𝑇 𝑖 ∈ 𝑅 𝐻 is the final hidden representation, which is calculated by BioBERT for each token i . H is the hidden size of 𝑇 𝑖 , 𝑊 ∈ 𝑅 𝐾×𝐻 is a weight matrix between k and 𝑇 𝑖 , K represents the number of tags and is equal to 7, and b is a K-dimensional vector that records the bias on each k . The classification loss L is calculated as follows: 𝐿(𝛩) = − ∑ 𝑙𝑜𝑔(𝑝(𝑦 𝑖 |𝑇 𝑖 ; 𝛩) 𝑁𝑖=1 (2) where 𝛩 represents the trainable parameters, and N is the sequence length. First, a tokenizer is applied to words in a sentence on a dataset with labels in the CoNLL format . The WordPiece algorithm is then applied to the sub-words of each word. Consequently, BioBERT is able to extract diverse types of bio-entities. Furthermore, an entity or two entities with frequently-occurring token interaction will be marked with more than one / entity type span (26.2% for all PubMed abstracts ). Based on the calculated probability distribution, we are able to choose the correct entity type when entities are tagged with more than two types according to the probability-based decision rules . (2) Multi-Type Normalization. Because an entity may be referred to by several synonymous terms (synonyms), and a term can be polysemous if it refers to multiple entity types (polysemy), we require a normalization process for the extracted entities. However, it is a daunting challenge to build a single normalization tool for multiple entity types because there exist various normalization models that depend on the type of entity. We addressed this issue by combining multiple NER normalization models into one multi-type normalization model that assigns IDs to extracted entities. Table 1 illustrates the statistics of the proposed normalization model. Table 1. Multi-type normalization model and dictionaries.

Entity types Normalization models Dictionaries

Gene/Protein GNormPlus Entrez Gene MeSH , OMIM , SNOMED·CT , PolySearch2 , ChEBI , DrugBank , US FDA-approved drugs 518,223 2,571,570 5.0 Species Dictionary lookup NCBI Taxonomy 398,037 3,119,005 7.8 Mutation tmVar 2.0 dbSNP , Clin Var The multi-type normalization model is based on a normalization model per entity type (Table 1). To improve the number of normalized entities, we added the disease names from the PolySearch2 dictionary (76,001 names of 27,658 diseases) to the sieve-based entity linking dictionary (76,237 names of 11,915 diseases). We also added the drug names from DrugBank and the U.S. Food and Drug Administration (FDA) to the tmChem dictionary. Because there are no existing normalization model for species, we normalized species based on dictionary lookup. Using tmVar 2.0, we created a dictionary of mutations with normalized mutation names, in which a mutation with several names was assigned to one normalized name or ID. Author Name Disambiguation (AND)

Despite a rigorous effort to create global author IDs (e.g., ORCID and ResearcherID), most articles in PubMed, particularly those before 2003 (the year in which the field ORCID was added into PubMed), provide limited author information with respect to last name, first initial, and affiliation (only for first authors before 2014). Author information is not effective meta-data to be used directly as a unique identifier because different people may have the same names, and the names and affiliations of an individual can change over time. AND is essential for identifying unique authors. In recent decades, researchers have made several attempts to solve the AND problem, using three types of methods. The first type of method relies on manual matching of articles with authors by surveying scientists or consulting curricula vitae (CVs) gathered from the Internet . Although this type of method ensures high accuracy, a considerable amount of investment in labor is required to collect and code the data, which is impractical for huge datasets. The second type of method uses publicly-accessible registry platforms, such as ORCID or Google Scholar, to help researchers identify their own publications, which produces a source of highly accurate and low-cost accessible disambiguation of authorship for large numbers of authors. However, registries cover only a small proportion of researchers , which introduces a form of survivor bias into samples. The third type of method uses an automated approach to estimate the similarity of author instance feature combinations and identify whether they refer to the / same person. The features for automated AND include author name, author affiliation, article keywords, journal names , coauthor information , and citation patterns . Automated methods typically rely on supervised or unsupervised machine learning, in which the machine learns how to weigh the various features associated with author names and where to assign a pair of author names either to the same author or to two different authors . This type of method can potentially avoid the shortcomings of the previous two types. Moreover, automated methods have been improved to a high level of accuracy after years of development. For PubMed, automated methods are the optimal choice because they can overcome the shortcomings of the other two methods while simultaneously providing high-quality AND results for the entire dataset. Several scholars have disambiguated the authors using automated methods. Although the evaluations of these results have exhibited different levels of accuracy and coverage limitations, we believe that integrating them with due diligence can yield a high-quality AND dataset with full coverage of PubMed articles. According to our investigation, a high-quality PubMed AND dataset with complete coverage can be obtained through the integration of the following two existing AND datasets: (1) Author-ity: The Author-ity database uses diverse information about authors and publications to determine whether two or more instances of the same name (or of highly similar names) on different papers represent the same person. According to the AND evaluation based on the method discussed in the section Technical Validation , the F1 score of Author-ity is 98.16%, which is the highest accuracy result that we have observed. However, this dataset only covers authors before 2009. (2) Semantic Scholar: It trains a binary classifier to merge a pair of author names and use the pair to create author clusters incrementally. According to the AND evaluation based on the method discussed in the section

Technical Validation , the F1 score of Semantic Scholar is 96.94%, which is 1.22% lower than that of Author-ity. However, it has the most comprehensive coverage of authors. Because the Author-ity dataset has a higher F1 score than the Semantic Scholar dataset, we selected the author’s unique ID of the Author-ity dataset as the primary AND_ID. AND_ID is limited by time range (containing PubMed papers before 2009); however, we supplemented authors after 2009 using the AND result from Semantic Scholar. The following steps were applied: Step 1: Allocate the author’s unique ID to each author instance according to the Author-ity AND results such that authors from the Author-ity dataset (before 2010) have unique author IDs. Step 2: For authors that have the same Semantic Scholar AND_ID but never appear in the Author-ity dataset, we generated a new AND_ID to label them. For example, author “Pietranico R.” published two papers in 2012 and 2013 and had two corresponding author instances. Because all papers that “Pietranico R.” published were after 2009, they were not covered by Author-ity and therefore had no AND_ID allocated by Author-ity. However, the authors disambiguated correctly by Semantic Scholar were allocated unique AND_IDs in Semantic Scholar. To maintain the consistency in labeling, we generated a new AND_ID continuing AND-IDs of Author-ity to label these two author instances as disambiguated by Semantic Scholar. Step 3: For author instances with a unique AND_ID in Semantic Scholar and in which authors (at least one) had the same Author-ity AND_ID, we allocated the Author-ity AND_ID to all author instances as their unique ID. For example, “Maneksha S.” published three papers in 2007, 2009, and 2010, and the first two author instances had unique Author-ity AND_ID. However, the last one had no Author-ity AND_ID because it was beyond the time coverage of the Author-ity dataset. Nevertheless, based on the AND results of Semantic Scholar, the three author instances had an identical AND_ID. Therefore, the last author instance with no Author-ity AND_ID could be labeled with the same ID as the other two author instances. / Extended Multi-source Information Integration

In addition to bio-entity extraction by BioBERT and AND, we made a considerable effort to integrate PubMed by extending multi-source data into PKG, which exploited the mapping connections between AND_ID and the PubMed identifier (PMID) to build relationships between different objects to provide a comprehensive overview of the PubMed dataset. These integrated data includes the funding data from NIH ExPORTER, the affiliation history and educational background of authors from ORCID, and the fine-grained region and location information from the MapAffil 2016 dataset. The entities and their associated relationships are depicted in Figure 2.

Figure 2. Entities and relationships in PKG (1) Project Data from NIH ExPORTER

NIH ExPORTER provides data files that contains research projects funded by major funding agencies such as the Centers for Disease Control and Prevention (CDC), the NIH, the Agency for Healthcare Research and Quality (AHRQ), the Health Resources and Services Administration (HRSA), the Substance Abuse and Mental Health Services Administration (SAMHSA), and the U.S. Department of Veterans Affairs (VA). Furthermore, it provides publications and patents citing support from these projects. It consists of 49 data fields, including the amount of funding for each fiscal year, organization information of the PIs, and the details of the projects. According to our investigation, NIH-funded research accounts for 80.7% of all grants recorded in PubMed. The NIH ExPORTER dataset contains a unique PI_ID for each scholar who received NIH funding between 1985 and 2018, and his or her PMIDs of the published articles. Through the mapping of PMIDs in NIH ExPORTER to PMIDs in PubMed, 1:N connections between the PI and the article have been established, paving the way for investigating the article details of a specific PI, and vice versa. Furthermore, by mapping PI names (last name, first initial, and affiliation) to author names that were listed in articles supported by the PI’s projects, a 1:1 connection between the PI and the AND_ID was established, providing a way to obtain PI-related article information, regardless of whether the article was labeled with a project ID. / (2) Employment History and Educational Background Data from ORCID According to its website, “ORCID is a nonprofit organization helping to create a world in which all who participate in research, scholarship, and innovation are uniquely identified and connected to their contributions and affiliations across disciplines, borders, and time” . It maintains a registry platform for researchers to actively participate in identifying their own publications, information about formal employment relationships with organizations, and educational backgrounds. ORCID provides an open-access dataset called ORCID Public Dataset 2018 , which contains a snapshot of all public data in the ORCID Registry associated with an ORCID record that was created or claimed by an individual as of October 1, 2018. The dataset includes 7,132,113 ORCID iDs, of which 1,963,375 have educational affiliations and 1,913,610 have employment affiliations. As a result of the proliferation of ORCID identifiers, PubMed has used ORCID identifiers as alternative author identifiers since 2013 . Using the aforementioned two steps, we could map ORCID records to the PubMed authors. Our first step was to map the author instances in PubMed to an ORCID record based on the feature combinations of article DOI and author name (last name and first initial). Because the DOI is not a compulsory field for PubMed, we appended the feature combinations of article titles, journals, and author names to map the records between the two datasets. The result contained many 1:1 connections between a disambiguated author of PubMed and an ORCID record. Furthermore, 1:1 connections between AND_ID and ORCID iD, and 1:N connections between AND_ID and background information (education and employment) were established. (3) Fine-Grained Affiliation Data The MapAffil 2016 dataset resolves PubMed authors’ affiliation strings to cities and associated geocodes worldwide. This dataset was constructed based on a snapshot of PubMed (which included the Medline and PubMed-not-Medline records) acquired in the first week of October 2016. Affiliations were linked to a specific author on a specific article. Prior to 2014, PubMed only recorded the affiliation of the first author. However, MapAffil 2016 covered some PubMed records that lacked affiliations and were harvested elsewhere, such as from PMC, NIH grants, the Microsoft Academic Graph, and the Astrophysics Data System. All affiliation strings were processed using MapAffil to identify and disambiguate the most specific place names. The dataset provides the following fields: PMID, author order, last name, first name, year of publication, affiliation type, city, state, country, journal, latitude, longitude, and Federal Information Processing Standards (FIPs) code. The MapAffil 2016 dataset does have a limitation because it does not cover the PubMed data after 2015 (covering 62.9% affiliation instances in PubMed). Consequently, we performed an additional step to improve the fraction of coverage. We collected authors (who published their first article before 2016 and continued publishing articles after 2015) by their AND_IDs. The new affiliation instances of the author after 2015 succeeded their corresponding fine-grained affiliation data from the affiliation instances before 2016 (fraction of affiliation instance coverage increased to 84.2%) if the author did not change affiliation. We also applied an up-to-date open-source library Affiliation Parser to extract additional fine-grained affiliation fields from all affiliation instances, including department, institution, email, ZIP code, location, and country. Table 2 summarizes the date coverage and version information of integrated datasets and open-access software used to extract data. Table 2. Date coverage and version information of data sources Data Source Start Year End Year Version Information

PubMed 2019 baseline files / files. It also includes AND results of 93,228 papers published after 2008, and majority of them are preprints. Semantic Scholar dataset NIH ExPORTER dataset Data Records

We built PKG with bio-entities extracted from PubMed abstracts, AND results of PubMed authors, and the integrated multi-source information. This dataset is freely available on Figshare . It contains seven comma-separated value (CSV) files named “Author_List,” “Bio_entities_Main,” “Bio_entities_Mutation,” “Affiliations,” “Researcher_Employment,” “Researcher_Education,” and “NIH_Projects”. The details are presented in Table 3. PubMed raw data are not included into Figshare file set because the amount of PubMed raw data is too large and they are not generated or altered by our methods. PubMed raw data can be freely downloaded from PubMed website . We also provide download link (http://er.tacc.utexas.edu/datasets/ped), which contains both the PubMed raw data and PKG dataset to facilitate the application of PKG dataset. Table 3. Dataset details. File

Author_List 114,345,178 28,510,300 14,830,461 CSV file containing PubMed authors and AND_IDs. Bio-entities_Main 330,394,494 18,361,409 - CSV file containing all types of extracted bio-entities by BioBERT. Bio-entities_Mutation 1,388,341 312,099 - CSV file containing additional items of mutations from Bio-entities_Main file. Affiliations 46,065,099 19,601,383 8,300,984 CSV file containing affiliations and their extracted fine-grained items. Researcher_Employment 532,356 - 276,483 CSV file containing employment history from ORCID. Researcher_Education 512,267 - 268,610 CSV file containing educational background from ORCID. NIH_Porjects 12,340,431 1,790,949 102,070 CSV file containing projects from NIH ExPORTER and mapping relation between PI_ID, PMID, and AND_ID. Note: In file Author_List, about 1.3 million (1.15%) author instances cannot be disambiguated because they do not exist in Author-ity or Semantic Scholar dataset. Therefore, their AND_ID field values were set to zero.

The statistics of all five types of extracted entities are presented in Table 4. Table 4. Statistics of extracted entities

Species Disease Gene / Protein Drug/Chemical Mutation Total number of extracted entities

Distinct PMIDs for each type

Distinct entities for each type / Each data field is self-explanatory by its name, and fields with the same name in other tables follow the same data format that can be linked across tables. Tables 5–11 illustrate the field name, format, and short description of fields for each data file listed in Table 3.

Table 5. Data type for records of Author_List.