Extracting Concepts for Precision Oncology from the Biomedical Literature
EExtracting Concepts for Precision Oncology from the Biomedical Literature
Nicholas Greenspan, ∗ Yuqi Si, MS, Kirk Roberts, PhD Department of Computer Science, Rice UniversityHouston TX, USA School of Biomedical Informatics,The University of Texas Health Science Center at HoustonHouston TX, USA
Abstract
This paper describes an initial dataset and automatic natural language processing (NLP) method for extracting con-cepts related to precision oncology from biomedical research articles. We extract five concept types: C ANCER , M U - TATION , P OPULATION , T REATMENT , O UTCOME . A corpus of 250 biomedical abstracts were annotated with theseconcepts following standard double-annotation procedures. We then experiment with BERT-based models for conceptextraction. The best-performing model achieved a precision of 63.8%, a recall of 71.9%, and an F1 of 67.1. Finally,we propose additional directions for research for improving extraction performance and utilizing the NLP system indownstream precision oncology applications.
Precision medicine is a paradigm in which treatment decisions are based not just on a patient’s disease status, buton a variety of other factors including specific genetic, environmental, and other factors. [1]
The preeminent use casefor precision medicine thus far has been cancer, i.e. precision oncology. Precision oncology is a rapidly-developingfield [2] , with a growing number of treatments, trials, and genomic markers. Since drugs can be targeted to relativelyrare mutations, the number of studied treatments is greatly expanded [3,4] and these can be referred to by a variety ofnames (e.g., the name used in pre-clinical trials is often different than the final drug name). Since the gene mutationscan be relatively rare, clinical trial structures have had to be altered to better fit the precision medicine paradigm. [5]
And, critically, there are thousands of known genetic mutations from hundreds of cancer-related genes. [6]
Sizableeffort is thus required to curate all of these types of information to make them available in a usable form to bothresearchers and clinicians.Our prior work has focused on this problem from an information retrieval (IR) perspective: how does one find patient-specific information (given a type of cancer, mutation, etc.) from the vast trove of precision medicine-related pub-lications. IR systems were evaluated for this task in the TREC Precision Medicine tracks. [7,8,9]
We also developedPRIMROSE [10] , a search engine that implements many of the best aspects of precision oncology search. A consistentweakness in these IR approaches, however, was difficulty dealing with the complex semantics of precision oncologyarticles: identifying the exact treatments studied in an article, which types of cancer the treatment applies to, etc. Thistask is more consistent with a natural language processing (NLP) information extraction (IE) approach. Therefore, inthis work we report the initial development of an NLP system for extracting five key elements of biomedical articlesfor precision oncology: the type(s) of cancer studied, the mutations that were targeted, the specific population it islimited to, the treatment evaluated, and any available outcome information summarized in the abstract. Because of thefast-moving nature of the field, we focus on biomedical abstracts instead of full-text articles. Not only are the abstractspublicly available well before the full text, but many of the latest-breaking developments in precision oncology arepresented at talks in major oncology conferences and only the abstracts for these talks are provided.To gauge the complexity of this NLP task, we collected a pilot corpus of 250 biomedical abstracts drawn from theTREC Precision Medicine dataset. The five concept types–C
ANCER , M
UTATION , P
OPULATION , T
REATMENT , andO
UTCOME –were double-annotated and reconciled. Two models based on BERT [11] , and specifically the BioBERT [12] model pre-trained on biomedical text, were evaluated: BioBERT
BASE and BioBERT
LARGE . The difference betweenthese models is the number of parameters, in terms of number of layers, hidden units, and attention heads. ∗ This project was undertaken during an undergraduate internship at UTHealth-SBMI. a r X i v : . [ c s . A I] S e p he remainder of this paper is organized as follows. Section 2 discusses related work in NLP for cancer and precisionmedicine. Section 3 describes the methods, including data ( § § § Cancer is one of the more frequently studied aspects of NLP for biomedicalliterature articles. Early works such as MedScan [13] employed rule-based systems to extract and interpret informationfrom MEDLINE abstracts. Chun et al. [14] developed a corpus and extracted relations between prostate cancer andgenes from abstracts using a maximum entropy classifier. Baker et al. developed a corpus for identifying the hall-marks of cancer from the biomedical literature and proposed a support vector machine (SVM) model [15] and later aconvolutional neural network [16] to automatically classify abstracts. A different take on cancer NLP for the biomedicalliterature is the development of literature-based discovery (LBD) tools such as LION LBD [17] to identify implicit linkswithin the network of literature articles. LION in particular focuses on the molecular biology of cancer. Beyond thebiomedical literature, a tremendous amount of NLP research has been conducted for cancer on other data types. Mostnotable among these are electronic health records, for which several review articles exist that overview cancer NLPfor clinical notes. [18,19,20]
Biomedical Literature NLP for Genomics
A tremendous amount of NLP work has focused on extracting informa-tion related to genomics from the literature. Early work includes EDGAR [21] , which identified gene-drug relationsfrom biomedical abstracts. Libbus et al. [22] identified genes from MEDLINE abstracts based on the Gene Ontol-ogy [23] for the purpose of linking literature-based data to structured knowledge sources. Work in pharmacogenomicshas required extensive use of NLP to build resources such as the use of SemRep [24] or the construction of the pharma-cogenomics knowledge base PharmGKB. [25,26,27]
In turn, PharmGKB has been utilized as a knowledge base for manyfurther NLP studies. [28,29,30]
Similarly, the PGxCorpus [31] is a manually-annotated corpus for pharmacogenomics–similar in many ways to our goal here, but their work is not specific to cancer. Finally, more general biomedicalliterature NLP has included genomic components, particularly the CRAFT corpus. [32,33]
Biomedical Literature NLP for Precision Oncology
There has indeed been some work specific to precisionmedicine for NLP within the space of the current work. For instance, Deng et al. [34] classifies abstracts with anSVM based on whether they focus on cancer penetrance. Bao et al. [35] extends this with a deep learning model. In-stead of extracting the particular concepts, however, these works focus is simply to classify the entire abstract for use indownstream meta-analyses. Next, Hughes et al. [36] reviews how to utilize precision oncology NLP specific for breastcancer. Finally, the TREC Precision Medicine track [7,8,9] is an ongoing information retrieval shared task focusing onidentifying articles relevant to precision oncology. This has inspired the creation of many search engines, includingour own, [10] for clinical decision support in precision oncology. Of the many search engines to participate in the TRECPrecision Medicine track, however, none has successfully integrated biomedical knowledge sources to greatly improveretrieval performance. We believe this is partly due to the fact that it is difficult to properly link the key aspects ofprecision oncology in an abstract to these powerful knowledge bases. Instead, most use of biomedical knowledge insuch search engines is simply to expand synonyms (e.g., through query expansion) which gives at most small boosts toretrieval performance. Our goal in this paper, then, is to lay the groundwork for improvements in precision oncologysearch and knowledge acquisition by identifying the key elements to precision oncology in biomedical abstracts. Thiswill allow for the downstream linking of these articles with existing biomedical knowledge bases for better semanticcomprehension of the precision oncology scientific landscape.
The high-level study design for this paper follows the standard supervised NLP pipeline: data identification (Sec-tion 3.1), manual data annotation (Section 3.2), and automatic NLP extraction (Section 3.3). Since this is a pilot study,our primary goal has been to identify the key barriers to large-scale system development, which is discussed in moredetail in the Discussion (Section 5). .1 Data
Since the latest developments of precision oncology research are only publicly available in abstracts, we focus only onabstract-based annotation and extraction. Compared to biomedical research in general, precision oncology is dispro-portionately less represented in PubMed Central given its funding structure (less open access, more embargoed journalarticles) and heavy use of abstract presentations for presenting results–which means many of the latest developmentsthat are so important to capture are not available as full text articles, but only abstracts. We focus on a set of abstractsknown to be relevant to precision oncology by annotating only abstracts judged as relevant in the TREC 2017 PrecisionMedicine track [7] . A random selection of 250 abstracts was chosen from those judged relevant during the assessmentprocess.
The 250 abstracts were imported into Brat [37] and double-annotated with the following concept types:1. C
ANCER . The type of cancer being studied in the article (e.g., “ breast cancer ”, “ non-small cell lung cancer ”,“ mantle cell lymphoma ”, “ solid tumor ”). If the abstract mentions a type of cancer but it is clearly not the cancerinvestigated in the study, then it is additionally labeled as a Non-study cancer. If multiple types of cancer areincluded in the study, all are annotated.2. M
UTATION . The gene mutation being studied in the article, be it a gene with any mutation (e.g., “
KRAS ”,“
FGFR2 ”, “
PIK3R1 ”), a specific variant (e.g., “
BRAF V600E ”, “
KRAS G13D ”, “
NF2 K322 ”), or some otherform of genetic mutation (e.g., “
CDK4 Amplification ”, “
PTEN Inactivating ”, “
EML4-ALK Fusion transcript ”).Similar to cancer type, mutations mentioned in the abstract but not investigated in the study are marked asNon-study mutations.3. P
OPULATION . The specific population in the study (e.g., “
Hunan Province in China ”, “ never or light smokers ”,“ adults ( >
18 years) ”, “
European patients ”, “ no history of chemotherapy for metastatic disease ”). As shownby the examples, this can include age, sex, location, ethnicity, cancer status, etc. Populations mentioned in theabstract but not investigated in the study are marked as Non-study populations.4. T
REATMENT . The drug used in the study (e.g., “ sorafenib ”, “ abemaciclib ”, “ trastuzumab ”). If the drug wasused as part of a combination, each individual component is annotated separately. If the drug was a comparatorbut not directly investigated in the study, then it is marked as a Non-study treatment (this is more common thanNon-study cancers, mutations, and populations).5. O
UTCOME . The result of the study with regards to the success or failure of the treatment. Non-study outcomesare not annotated. The outcomes are generally a sentence or long phrase describing the overall outcome. E.g., • Main grade 3 or 4 toxicities were rash (11 [13%] of 84 patients given erlotinib vs none of 82 patients in thechemotherapy group), neutropenia (none vs 18 [22%]), anaemia (one [1%] vs three [4%]), and increasedamino-transferase concentrations (two [2%] vs 0). • Treatment with crizotinib results in clinical benefit rate of 85%-90% and a median progression-free sur-vival of 9-10 months for this molecular subset of patients. • Although nearly all patients with GIST treated with imatinib experienced adverse events, most events weremild or moderate in nature.
Additionally, negated concepts were marked as such. While there were negated C
ANCER annotations (e.g.,Two annotators (the first author and a biomedically-trained graduate student) labeled each abstract in batches of 25,reconciling after each batch. Instead of using highly-refined guidelines, the goal of this annotation process was moreexploratory in nature. The concepts were defined as above, but no further. The goal was to identify the range of pos-sible ways in which the information can be expressed, without too much regard for maximizing inter-rater agreement.umber of abstracts 250Average length of abstract (tokens) 278.1Total concept annotations 4,722C
ANCER
UTATION
OPULATION
REATMENT
UTCOME
ANCER
UTATION
OPULATION
REATMENT
UTCOME
ANCER
UTATION
OPULATION
REATMENT
UTCOME
Table 1:
Descriptive statistics of the annotated corpus.Anecdotally, some concepts had more inconsistent agreement throughout the process (notably P
OPULATION and O UT - COME ), while others had early disagreement that improved over time (such as how to handle acronyms with C
ANCER and M
UTATION ). These issues are ultimately reflected in the automatic extraction scores described in Section 4.Descriptive statistics of the annotated corpus are provided in Table 1. Example annotations from the corpus are shownin Figure 1.
The abstracts were tokenized and split into sentences using spaCy [38] . A BILOU scheme was used for sequenceclassification, where B is the first token of a sequence, I an inside token, L the last token, O a token outside anysequence, and U a single-token concept. So “
K - ras and PTEN mutations ” would be [B-M
UTATION , I-M
UTATION ,L-M
UTATION , O, U-M
UTATION , O]. Non-study concepts were handled by adding a N- before the concept name(e.g., B-N-T
REATMENT ).We follow the standard BERT framework for named entity recognition tasks. Two variants of BioBERT [12] wereevaluated: BioBERT
BASE v1.1 and BioBERT
LARGE v1.1, which are versions of BERT
BASE and BERT
LARGE re-spectively pre-trained on both 1 million PubMed abstracts (note that the BioBERT v1.0 models are pre-trained on200k PubMed abstracts and 200k PubMed Central full-text articles, but BioBERT v1.1 is only pre-trained on ab-stracts, though a larger number). As such, BioBERT is an ideal starting point for a transformer-based language modelto use for our task. BioBERT
BASE has 12 layers, 768 hidden units per layer, and 12 attention heads per layer (a totalof 110 million parameters); BioBERT
LARGE has 24 layers, 1024 hidden units per layer, and 16 attention heads perlayer (a total of 340 million parameters). Generally, the larger BERT variant offers some improved performance, but inmany cases the performance delta is neglible and not worth the additional computational cost. As such, we experimentwith both models to assess whether a larger BERT model would be beneficial in this task.The data was split 70% for training the BioBERT models, 10% for validation (early stopping), and 20% for testing(results discussed below). The default BioBERT parameters were used other than a learning rate of x − , maximumsequence length of 128, training batch size of 32, validation batch size of 8, and test batch size of 8. igure 1: Example annotationsnnotation Precision Recall F1Overall 60.48 70.73 65.20C
ANCER
UTATION
OPULATION
REATMENT
UTCOME
Table 2:
Results using BioBERT
BASE model.Annotation Precision Recall F1Overall 63.79 71.90 67.61C
ANCER
UTATION
OPULATION
REATMENT
UTCOME
Table 3:
Results using BioBERT
LARGE model.
The results for the BioBERT
BASE and BioBERT
LARGE models are provided in Table 2 and Table 3. Not enoughNon-study concepts are present in the test set to merit an evaluation here. We thus focus on boundary extraction andtype classification without the Non-study attribute.In almost every case, the BioBERT
LARGE results outperform the BioBERT
BASE results (the lone exception beingM
UTATION recall, while neither model successfully extracts any O
UTCOME ). The differences between BioBERT
BASE and BioBERT
LARGE are often several points, including substantial boosts for both P
OPULATION (+10.74 F1) andT
REATMENT (+9.15 F1). Notably, the improvements from BioBERT
BASE to BioBERT
LARGE are roughly propor-tional to the number of available annotations for training, with the most common concept type (M
UTATION ) receivingthe smallest boost.For both models, their performance across the different concept types was roughly proportional to the number ofannotations for training. While there were more M
UTATION annotations than C
ANCER annotations, there was a fargreater variety of M
UTATION mentions than C
ANCER mentions, which likely explains why C
ANCER outperformsM
UTATION in both models by roughly 10 points of F1. T
REATMENT is the next most common concept type, andwhile for BioBERT
BASE this performs 6.73 points of F1 worse than M
UTATION , for BioBERT
LARGE T REATMENT actually outperforms M
UTATION by 1.35 points of F1. Meanwhile, for both models P
OPULATION is the second-worst-performing concept type, while as mentioned neither model correctly identifies a single O
UTCOME . The latteris almost certainly due to the combination of few annotations (130 in the entire corpus) and long, complex natureof each concept span (28.5 tokens). Clearly, O
UTCOME extraction is not an ideal named entity recognition task andshould be handled by a different type of extraction (e.g., sentence classification).Finally, it is interesting that with the exception of P
OPULATION for BioBERT
LARGE , all concepts have higher recallthan precision. This requires further investigation, but one possibility is that the BERT models are good at identifyinginstances very similar to those in the training data, but additionally predict spans with high biomedical similarity thatare nonetheless not one of the annotated concepts.
Discussion
This work is an initial feasibility study on the extraction of key variables for precision oncology from biomedicalliterature abstracts. We focus on identifying the type of cancer, mutation, population information, treatment, andoutcomes. A small corpus of 250 abstracts was manually annotated, then two BioBERT models were evaluated. Whilenone of the five concept types performed up to the level one would hope, C
ANCER performed reasonably well (F1of 75.00), while M
UTATION and T
REATMENT showed promise (F1 of 64.94 and 66.29, respectively). P
OPULATION performed below a level that is likely usable (F1 of 52.94), while O
UTCOME was not successfully extracted at all.Here, we discuss the successes and shortcomings of this feasibility pilot and what should come next to address the keyproblems.The most obvious need for improvement is the small size of the dataset. Our point of reference for appropriate datasetsizes is the NCBI Disease Corpus, [39,40] which has 793 abstracts, or roughly three times the size of what is presentedhere. BioBERT’s performance on that corpus is an F1 of 89.71, which we can assume is a rough upper bound forautomatic extraction if the corpus was scaled up. We will note, however, that even the C
ANCER , M
UTATION , andT
REATMENT concepts themselves are more diverse than what is in the NCBI Disease Corpus, and the lexical variationseen with even these concepts is likely greater (especially T
REATMENT , see Figure 1), so this would be an ambitiousupper bound. Ultimately, it seems clear that increasing the corpus size would be beneficial.Regarding the lower-performance concepts, it is likely that P
OPULATION needs to be refined as a concept, whichwould allow it to incorporate pre-defined lexicons. In this study we intentionally did not define this concept narrowlyin order to assess the range of populations mentioned in abstracts. Going forward, however, we can focus on the set ofpopulations that are critically important to precision oncology. These usually differ from the normal medical notion ofa population. Instead of demographics, in precision oncology the cancer and treatment history are primary populationsof interest (e.g., “ treatment-naive ” in Figure 1 refers to patients who have not yet undergone chemotherapy). RegardingO
UTCOME , this is clearly an item that is more appropriately tackled as a sentence classifier than via entity extraction.As can be seen in Figure 1, the O
UTCOME sentences have fairly clear features not seen in the other sentences, so it islikely that a sentence classifier could identify these with relatively high efficacy.The comparison of BioBERT
BASE and BioBERT
LARGE is instructive. At the current size of the corpus, the largermodel provides more than sufficient benefit to justify its additional complexity. Perhaps in a larger corpus, the basemodel will close the gap. In other works (e.g., Ji et al. [41] ), the larger model performed no better than the base model.These experiments, then, should be revisited with a larger corpus.Another logical place for improvement is the use of knowledge resources. In this study, we hoped to assess theperformance of BioBERT alone, but future work should incorporate existing knowledge resources such as the NCIThesaurus [42] for cancer names and COSMIC [43] for gene mutations. Above, we stated the NCBI Disease Corpusperformance is a good estimate of an upper bound, but the one advantage of focusing exclusively on precision oncologyis that more detailed knowledge resources can be brought to bear: a more specific domain allows us to make domain-specific assumptions. This could be critical for improving performance, but there is one important note of cautionwhich also justifies our initial reasoning to evaluate a resource-free approach. Since precision oncology moves quicklyas a field, the lexicon of terms used in papers is oftentimes well ahead of knowledge resources. A new oncogene maybe identified months or years before it is incorporated into the appropriate knowledge base. Over-reliance on theseknowledge sources may increase the NLP performance on the annotated corpus while simultaneously reducing themodel’s ability to recognize the very emerging concepts we are most focused on identifying. Thus, these knowledgeresources cannot be integrated naively, and care should be taken in this process.A final avenue for improvement focuses on the machine learning aspects. This includes adjusting the tagging scheme–we used BILOU in this study, but given the variance in concept length (see Table 1) other tagging schemes may bemore appropriate. Not every concept type need use the same tagging scheme, either. E.g., the shorter M
UTATION concepts may utilize a more simple BIO scheme. Additionally, the only form of transfer learning we experimentedwith in this paper is the use of the BioBERT model itself, which effectively transfers a language model pre-trained onlarge amounts of biomedical text. After the language modeling, but prior to fine-tuning the model on this precisiononcology corpus, other existing datasets may be utilized for transfer learning, such as the NCBI Disease Corpus [39,40] nd PGxCorpus [31] . This would effectively reduce the need to scale up the size of our own manual corpus, though wedo not believe that even with transfer learning the current corpus size is sufficient.
Limitations
The data evaluated in this study was taken from the TREC Precision Medicine track, [7] and specificallythe subset of abstracts marked as relevant for one of the topics. As such, it is certainly not representative of the fullarray of biomedical literature. This decision was made for annotation convenience–these abstracts were known tobe highly relevant to precision oncology. However, the real bias introduced here is the manual nature in which theywere chosen. Identifying potentially relevant abstracts to annotate via keywords or machine learning would resultin a corpus that is more appropriate, as these methods could be re-applied when using the precision oncology NLPmodel on new abstracts. A second limitation is the training of the annotators was intentionally kept minimal so as toencourage exploration of potential concepts. Also, only one of the two annotators was biomedically trained. We havediscussed the need for additional manual annotation, but this will also need to come with additional training and morerefined guidelines to ensure annotation quality.
This work presents a pilot study for NLP information extraction of terms related to precision oncology from biomed-ical literature abstracts. Five concept types were targeted: C
ANCER , M
UTATION , P
OPULATION , T
REATMENT , andO
UTCOME . A small corpus of 250 abstracts was manually annotated and reconciled. Two BioBERT models wereevaluated for automatic extraction, with the best results ranging in F1 of 75.0 (for C
ANCER ) to a complete inability toextract O
UTCOME information. We finally discussed a set of opportunities for future work to improve these results,including a larger corpus, use of existing biomedical knowledge resources, and additional transfer learning.
Acknowledgments
This work was supported by the the Patient-Centered Outcomes Research Institute (PCORI) under award ME2018C110963.The underlying TREC Precision Medicine data was supported by the National Institute of Standards & Technology(NIST).
References [1] Collins FS, Varmus H. A New Initiative on Precision Medicine. New England Journal of Medicine.2015;372:793–795.[2] Garraway LA, Verweij J, Ballman KV. Precision Oncology: An Overview. Journal of Clinical Oncology.2013;31(15).[3] Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, Altman RB, Klein TE. PharmGKB: the Pharmacoge-netics Knowledge Base. Nucleic Acids Research. 2002;30(1):163–165.[4] Barbarino JM, WhirlCarrillo M, Altman RB, Klein TE. PharmGKB: A worldwide resource for pharmacoge-nomic information. WIREs Systems Biology and Medicine. 2018;10(4):e1417.[5] Fountzilas E, Tsimberidou AM. Overview of precision oncology trials: challenges and opportunities. ExpertReview of Clinical Pharmacology. 2018;11(8):797–804.[6] Chakravarty D, Gao J, Phillips S, Kundra R, Zhang H, Wang J, Rudolph JE, Yaeger R, Soumerai T, Nissan MH,Chang MT, Chandarlapaty S, Traina TA, Paik PK, Ho AL, et al. OncoKB: A Precision Oncology KnowledgeBase. JCO Precision Oncology. 2017;1.[7] Roberts K, Demner-Fushman D, Voorhees EM, Hersh WR, Bedrick S, Lazar A, Pant S. Overview of the TREC2017 Precision Medicine Track. In: Proceedings of the Twenty-Sixth Text Retrieval Conference; 2017. .[8] Roberts K, Demner-Fushman D, Voorhees EM, Hersh WR, Bedrick S, Lazar A. Overview of the TREC 2018Precision Medicine Track. In: Proceedings of the Twenty-Seventh Text Retrieval Conference; 2018. .9] Roberts K, Demner-Fushman D, Voorhees EM, Hersh WR, Bedrick S, Lazar A. Overview of the TREC 2019Precision Medicine Track. In: Proceedings of the Twenty-Eighth Text Retrieval Conference; 2019. .[10] Shenoi SJ, Ly V, Soni S, Roberts K. Developing a Search Engine for Precision Medicine. In: Proceedings of theAMIA Informatics Summit; 2020. p. 579–588.[11] Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for LanguageUnderstanding. arXiv. 2018;abs/1810.04805. Available from: http://arxiv.org/abs/1810.04805http://arxiv.org/abs/1810.04805