PharmKE: Knowledge Extraction Platform for Pharmaceutical Texts using Transfer Learning
Nasi Jofche, Kostadin Mishev, Riste Stojanov, Milos Jovanovik, Dimitar Trajanov
PP HARM
KE: K
NOWLEDGE E XTRACTION P LATFORM FOR P HARMACEUTICAL T EXTS USING T RANSFER L EARNING
A P
REPRINT
Nasi Jofche, Kostadin Mishev, Riste Stojanov, Milos Jovanovik, Dimitar Trajanov
Faculty of Computer Science and Engineering,Ss. Cyril and Methodius Univesity in Skopje, N. Macedonia (name.surname)@finki.ukim.mk
March 1, 2021 A BSTRACT
Background and Objectives : The challenge of recognizing named entities in a given text has beena very dynamic field in recent years. This is due to the advances in neural network architectures,increase of computing power and the availability of diverse labeled datasets, which deliver pre-trained, highly accurate models. These tasks are generally focused on tagging common entities, suchas
Person , Organization , Date , Location , etc., however, many domain-specific use-cases exist whichrequire tagging custom entities which are not part of the pre-trained models. This can be solved byeither fine-tuning the pre-trained models, or by training custom models. The main challenge lies inobtaining reliable labeled training and test datasets, and manual labeling would be a highly tedioustask.
Methods : In this paper we present PharmKE, a text analysis platform focused on the pharmaceuticaldomain, which applies deep learning through several stages for thorough semantic analysis of phar-maceutical articles. It performs text classification using state-of-the-art transfer learning models,and thoroughly integrates the results obtained through a proposed methodology. The methodologyis used to create accurately labeled training and test datasets, which are then used to train models forcustom entity labeling tasks, centered on the pharmaceutical domain. This methodology is applied inthe process of detecting
Pharmaceutical Organizations and
Drugs in texts from the pharmaceuticaldomain, by training models for the well-known text processing libraries: spaCy and AllenNLP.
Results : The obtained results are compared to the fine-tuned BERT and BioBERT models trainedon the same dataset. Additionally, the PharmKE platform integrates the results obtained from namedentity recognition tasks to resolve co-references of entities and analyze the semantic relations in ev-ery sentence, thus setting up a baseline for additional text analysis tasks, such as question answeringand fact extraction. The recognized entities are also used to expand the knowledge graph generatedby DBpedia Spotlight for a given pharmaceutical text.
Conclusions : PharmKE is a modular platform which incorporates state-of-the-art models for textcategorization, pharmaceutical domain named entity recognition, co-reference resolution, semanticrole labeling and knowledge extraction. The platform visualizes the results from the models, en-abling pharmaceutical domain experts to better recognize the extracted knowledge from the inputtexts. K eywords Knowledge extraction · Natural language processing · Named entity recognition · Knowledge Graphs · Drugs
We are currently facing a situation where huge amounts of data are being generated continuously, in all aspects ofour lives. The main source of this data are online social media platforms and news portals. Given their volume, it a r X i v : . [ c s . C L ] F e b PREPRINT - M
ARCH
1, 2021is generally hard for an individual to keep track of all the information stored within the data. Historically, wheneverpeople do not have the capacity to finish a given task, they tend to invent tools that might help them. In this case, wewant to use natural language processing (NLP) tools to perform intelligent knowledge extraction (KE) and use it tofilter and receive only news which are of interest to us.In this paper, we are particularly interested in extracting named entities from the pharmaceutical domain, namelyentities which represent
Pharmaceutical Organizations and
Drugs . This NLP task is referred to as named entityrecognition (NER) [1, 2]. It aims to detect the entities of a given type in a text corpora. NER takes a central place inmany NLP systems, as a baseline task for information extraction, question answering, and more.Our interest in this topic stems from a problem we are facing in our LinkedDrugs dataset [3], where the collected drugproducts can have active ingredients (
Drug entities) and manufacturers (
Pharmaceutical Organization entities) writtenin a variety of ways, depending on the data source, country of registration, language, etc. Our initial work showedpromising results [4], and we want to build on it. The ambiguity in entity naming in our drug products dataset makesthe process of data analysis imprecise, thus using NER to normalize these name values for the active ingredients andmanufacturers can significantly improve both the quality of the dataset, as well as the results from any analytical taskon top of it.The recent advances in neural network architectures improved NER accuracy, mainly by leveraging bidirectional long-short term memory (LSTM) networks [5, 6], convolutional networks [7], and lately, transformer architectures [8].Many language processing libraries were made available for the public throughout the years [9] from both academiaand industry, equipped with highly accurate pre-trained models for extraction of common entity classes, such as
Person , Date , Location , Organization , etc. However, as a given business might require detection of more specificentities in text, these models should either be fine-tuned, or trained anew with corresponding datasets for the desiredentity types.The main challenge resides in obtaining a large amount of labeled training data, which is required to train a highlyaccurate model. Even though multiple manually labeled, highly accurate and generic datasets exist on the Web [10],their usage might not be feasible for the task at hand. Relevant data might either be unavailable on the Internet, or notfeasible to be labeled manually.As a solution to this problem, we propose a methodology that can be used to automatically create labeled datasets forcustom entity types, showcased in texts from the pharmaceutical domain. In our case, this methodology is appliedby tagging
Pharmaceutical Organizations in pharmacy related news. We prove that it can be extended to taggingother custom entities in different texts in the pharmaceutical domain by tagging
Drug entities as well, and assessingthe obtained results. The main focus is the automatic application of common language processing tasks, such astokenization, dealing with punctuation and stop words, lemmatization, as well as the possibility for application ofcustom, business case related text processing functions, like joining consecutive tokens to tag a multi-token entity, orperforming text similarity computations.The overall applicability and accuracy of this methodology is assessed by using two well-known language processinglibraries, spaCy [11] and AllenNLP [12], which come with a pre-trained model based on convolutional layers withresidual connections and a pre-trained model based on Elmo embeddings [13], respectively. The custom trained modelswhich are able to tag the custom entity
Pharmaceutical Organization indicate high tagging accuracy when comparedto the initial pre-trained models’ accuracy while tagging the more generic
Organization entity over the same testingdataset. In addition, a model trained on the same dataset by fine-tuning the state-of-the-art BERT is used for gaining abetter insight over the results. Lastly, a fine-tuned BioBERT [14], a model based on BERT architecture and pre-trainedon biomedical text corpora, is also used to better assess the results.The thorough explanation of the methodology used to generate the labeled datasets is given in the following sections,followed by custom model training and accuracy assessment.The extracted entities can help us filter the documents and news which mention them, but in the current era of dataoverflow, this is not enough. Therefore, we go one step further and integrate these results in a platform which thenextracts and visualizes the knowledge related to these entities. This platform currently integrates state-of-the-art NLPmodels for co-reference resolution [13] and Semantic Role Labeling [15] in order to extract the context in which theentities of interest appear. This platform additionally offers convenient visualization of the obtained findings, whichbrings the relevant concepts closer to the people which use the platform.This knowledge extraction process is then finalized by generating a Knowledge Graph (KG) using the Resource De-scription Framework (RDF) [16] - a graph-oriented knowledge representation of the entities and their relations. Thisprovides two main advantages: the RDF graph-data model allows seamless integration of the results from multipleknowledge extraction processes of various news sources within the platform, and at the same time links the extracted2
PREPRINT - M
ARCH
1, 2021entities to their counterparts within DBpedia [17] and the rest of the Linked Data on the Web [18]. This provides theusers of the platform with a uniform access over the entire knowledge extracted within the platform, and the relevantlinked knowledge already present in the publicly available knowledge graphs.
Named entity recognition (NER), as a key component in NLP systems for annotating entities with their correspondingclasses, enriches the semantic context of the words by adding hierarchical identification. Currently, there is a lot of newwork being done in this field, especially in the process of neural networks optimization for label sequencing, whichoutperform early NER systems based on domain dictionaries, lexicons, orthographic feature extraction and semanticrules. Starting with [19], neural network NER systems with minimal feature engineering have become popular, due tothe performances they achieve. They do so by introducing unified task-independent neural sequence labeling models,using convolutional neural networks (CNN) and n-dimensional representations of words.Character-level models treat text as distributions over characters and they are able to generate embeddings for anystring of characters within any textual context. With this, they improve the generalization of the model on both frequentand unseen words, which makes them popular in the biomedical domain. A model based on stacked bidirectional longshort-term memory (LSTM) is introduced in [20]. This model inputs characters and outputs tag probabilities foreach character, achieving state-of-the-art NER performance in seven languages without using additional lexicons andhand-engineered features. In [21], authors present a language model composed of a CNN and LSTM, where they usecharacters as input to form a word representation for each token in the sentence, thus they outperform word/morpheme-level LSTM baselines.In [22], authors propose a Biomedical Named Entity Recognition (Bio-NER) method based on a deep neural networkarchitecture, which leverages word representations pre-trained on unlabeled data collected from the PubMed databasewith a skip-gram language model. In [23], authors utilized word embedding techniques to capture the semantics of thewords in the sentence and built a generic model based on long short-term memory network-conditional random field(LSTM-CRF), which outperforms state-of-the-art entity-specific NER tools.Starting from 2018, Sequence-to-Sequence (Seq2Seq) architectures which work with text became a popular topicin NLP, due to their powerful ability to transform a given sequence of elements into another sequence – a conceptwhich fits well in machine translation. Transformers are models which implement Seq2Seq architecture by using anencoder-decoder structure.One of the latest milestones in this development is the release of Google’s BERT [8] which is based on a transformerarchitecture and integrates an attention mechanism [24]. It produces outstanding results on many NLP tasks, includingNER, due to its ability to learn contextual relations between words (or sub-words) in a text, making it applicable in thebiomedical and pharmaceutical domains. Hakala and Pyysalo [25] present an approach based on Conditional RandomFields (CRF) and multilingual BERT for biomedical named entity recognition on content in Spanish. In [26], authorsexplore feature-based and fine-tuning training strategies for the BERT model for NER in Portuguese. Lamurias andCouto [27] present an approach based on a transformer architecture for question answering in the biomedical domain.BioBERT [28] is a domain-specific language representation, pre-trained on large scale biomedical corpora. It ispre-trained on large general domain corpora (English Books, Wikipedia, etc.) and on biomedical domain corpora(PubMed abstracts, PMC full-text articles), using the BERT architecture. This language model provides improvedresults in various biomedical text mining tasks, including NER.Transfer learning, as a machine learning method, provides the concept of re-usability in neural networks, where onemodel developed for a task can be reused as the starting point of the training process of another problem that has asignificantly smaller training set. In recent years, transfer learning is one of the most popular approaches in computervision and NLP tasks, since it out-performs the state-of-the-art models in many use-cases, and does so by using smallertraining sets for fine-tuning and far less computational resources.Transfer learning has enabled an increase of the F1 score for co-reference resolution tasks over the past few years,allowing it to reach a satisfying average of 73%. This task is focused on clustering mentions within a text that refer tothe same underlying real-world entities. Different approaches use biLSTM and attention mechanisms to compute spanrepresentations and then find co-reference chains through a softmax mention ranking model [29]. Adding ELMO andcoarse-to-fine & second-order inference to this approach has resulted in a significant improvement over the F1 scoreachieving the above mentioned average of 73%. This task is evaluated with the OntoNotes co-reference annotationsfrom the CONLL2012 shared task [30], which involved predicting co-reference in English, Chinese, and Arabic, usingthe final version (5.0) of the OntoNotes corpus. It provides an accurate and integrated annotation of multiple levels ofthe shallow semantic structure in text in multiple languages.3
PREPRINT - M
ARCH
1, 2021On the other hand, applying transfer learning to the task of semantic role labeling shows that applying a simple BERT-based model can achieve state-of-the-art performance compared to the previous state-of-the-art neural models thatincorporated lexical and syntactic features, such as part-of-speech tags and dependency trees [15]. The reason lies inthe fact that semantic role labeling can be decomposed into four tasks: predicate detection, predicate sense disam-biguation, argument identification, and argument classification, where the predicate disambiguation task is focused onidentifying the correct meaning of a predicate in a given context - allowing it to be formulated as a sequence labelingtask, where BERT really shines.There are multiple ways to construct an RDF-based Knowledge Graph (KG), which generally depend on the sourcedata. In our case, we work with extracted and labeled data, so we can utilize existing solutions which recognize andmatch the entities in our data with their corresponding version in other publicly available KGs. One such tool isDBpedia Spotlight, an open source solution for automatic annotation of DBpedia entities in natural language text [31].It provides phrase spotting and disambiguation, i.e. entity linking, for the provided input. Its disambiguation algorithmis based upon cosine similarities and a modification of TF-IDF weights. The main phrase spotting algorithm is exactstring matching, which uses LingPipe’s Aho-Corasick implementation.There are many platforms like AllenNLP [12] and Spacy[11], which aim to provide demo pages for NLP modeltesting, and code snippets for easier usage by the machine learning experts. On the other hand, projects like HuggingFace’ Transformers[32] and Deep Pavlov AI [33] are libraries that significantly speed up prototyping and simplify thecreation of new solutions based on the existing NLP models.However, to the best of our knowledge, there is no complete solution for knowledge extraction in the pharmaceuticaldomain that is human-centric and enables visualisation of the results in a human-understandable format. In this paper,we present a platform which tries to fill this gap.
This section describes our PharmKE platform [34, 35], which goes a step further in understanding pharmaceuticaltexts: on top of identifying Drugs and Pharmaceutical Organizations, it also extracts relations in the mentioned contextand constructs a Knowledge Graph from them. The platform covers the entire process of understanding a documentand its content - from its classification and filtering, i.e. does it belong to the pharmaceutical domain, all the way tovisualization of the entities and their semantic relations, as shown in Fig. 1. Each of the steps is described in moredetail within this section.The PharmKE platform can be formally represented with the following functional expression: (1)The functional expression (1) shows that the platform is designed to combine the best of the available models in eachof the steps, while also enabling us to fine-tune some of the models, as is the case with the f ineT unedP harmaN ER model, which is explained in more details in Section 4.
At the beginning, the platform classifies whether a given text is from the pharmaceutical domain, and only the posi-tively classified texts are accepted for further analysis. The classification model used in this step is a transferred BERT http://alias-i.com/lingpipe PREPRINT - M
ARCH
1, 2021Figure 1: Platform workflow, available via the public instance of the platform [34].model, fine-tuned with a corpus of 5,000 documents from the pharmaceutical domain as positive samples , and gen-eral news documents as negative samples . 70% of these documents are used for fine-tuning of BERT’s and XLNet’smodels, and their precision, recall and F1 measure is evaluated with the remaining 30% of the documents. Table 1shows the results obtained by the fine-tuned models. The documents are extracted from , and PREPRINT - M
ARCH
1, 2021Table 1: Pharmaceutical text classification.
Model Precision Recall F1
BERT 0.9633 0.9528 0.9580XLNet 0.9983 0.9871 0.9926
Each correctly classified pharmaceutical text is further analyzed by recognizing combined entities through the pro-posed models, as well as by using BioBERT for the detection of BC5CDR and BioNLP13CG tags [36], whichinclude Disease, Chemical, Cell, Organ, Organism, Gene, etc. Additionally, we use a fine-tuned BioBERT modelin order to detect Pharmaceutical Organizations and
Drugs , entity classes that are not covered by the standard NERtasks. We explain the fine-tuning process in more details in Section 4. Tag collisions when combining the resultsfrom both models are avoided by applying precedence of the tags recognized by our fine-tuned model over the tagsrecognized by BioBERT’s model (Simple Chemical). All of the recognized entities are visualized in the sentence,along with their respective tags.
The recognized entities serve as a baseline for finding all of their mentions in the entire text, by applying co-referenceresolution in the background and replacing each mention ("it", "it’s", "his", etc.) with their respective entity. Librariessuch as AllenNLP, StanfordNLP [37] and NeuralCoref provide implementations of the algorithms for co-referenceresolution, focused on the CONLL2012 shared task [30]. Our platform utilizes the NeuralCoref library for co-referenceresolution due to its high accuracy, ease of integration compared to StanfordNLP, and the capability to take into accountuser-specific information and the speakers in a conversation.Once the mentions in the text are replaced with their respective entities, the final task includes labeling the semanticroles in each sentence. This is performed by using the BERT-based algorithm for semantic role labeling [15]. Then,the concrete arguments, like subject and object, as well as modifier arguments like temporal, location, instrument, etc.are visualized in a sequential manner for quick understanding.The result is a modular platform for pharmaceutical text analysis, which uses existing state-of-the-art models forentity recognition, as well as fine-tuned models for recognizing custom entities like Pharmaceutical Organization andDrug. The modular design of the platform enables a combination of results from multiple models which recognize avast range of entities. It also allows for semantic role labeling and visualization for each entity and their respectivementions in the text, by using state-of-the-art algorithms implemented by popular libraries. The entire analysis canbe exported in a JSON format, allowing it to be used for additional processing such as question answering, textsummarization, fact extraction, etc. As a final step, we annotate the entire text using the state-of-the-art knowledge extraction system DBpedia Spotlight[38]. The obtained results are then enriched with additional RDF facts which we construct from the identified Phar-maceutical Organization and Drug entities. This enriched knowledge graph is then available for further use within oroutside the platform.
Our methodology starts with a text corpora from the pharmaceutical domain and a closed set of entities that belong to agiven class. In our case, we are using entities that denote
Pharmaceutical Organizations and
Drugs . Using only thesetwo prerequisites, we show that we can train models that can extract even unseen entities from the class of interest.Figure 2 visualizes the whole process. https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/ https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data https://github.com/huggingface/neuralcoref PREPRINT - M
ARCH
1, 2021Figure 2: Named entity recognition pipeline.First, we start with the text corpora from the pharmaceutical domain that potentially contains the entities from the classof interest. This text corpora consists of news collected from the following pharmacy related websites:
FiercePharma , Pharmacist and Pharmaceutical Journal . Next, we tokenize the text such that we extract the words, and then wetry to annotate each word in respect to the set of entities from the required type. We utilize cosine similarity andlevenshtein distance in particular [39], where we check if the word is similar to some of the entities. The annotationprocess assigns start- and end-positions for each token in the text, respectively. Once we are done with this phase, wehave initialized a labeled dataset, denoted as MD . One of the main challenges is that the Pharmaceutical Organization entity type can be found in a given text as multi-word phrases, such as
Sanofi Pharmaceuticals Ltd. Spain , or as a single word:
Sanofi . Additionally, the name of the
Pharmaceutical Organization can contain pharmacy-related keywords, such as
Pharmaceuticals , Pharma , Medical , Biotech , etc., which are not part of the core name of the organization, and can either be found along with it in thesentence, or not at all. This means that we should not classify the countries, legal entities, and the pharmacy-relatedwords as parts of the
Pharmaceutical Organization type. Therefore, the annotation process sequentially performsuse-case-specific token filtering during the creation of the MD dataset.This is done by using a non-entity list which contains all tokens that should be ignored. In our case, this list containsall countries in the world, together with the legal entity types for companies ("Ltd", "Inc", "GmbH", "Corp", etc.)and pharmacy-related words. After filtering out the tokens from the non-entity list, only Sanofi will remain in ourexample, and we can be certain that the core name is thoroughly extracted. After matching the core name in the text,we use the same lists to detect neighbour tokens for multi token name, if any, as part of the organization name usingtext similarity metrics.After the application of the custom, use-case-related filtering, the MD dataset consists of the core entities that havehigh text similarity. Only the entities which have similarity above the customized threshold are labeled as members ofthe target class. In our experiments, we use a similarity threshold of 0.9. Some Pharmaceutical Organization entitiesconsist of multiple, consecutive tokens, such as
J & J . We solve this by token concatenation of consecutive relevanttokens, using a custom function applied on the MD .After applying all custom text processing functions, the state of the MD is as shown on Table 2. The MD dataset is then used to train a model which will be able to extract the named entities from the given class.Since NER models take into consideration the context in which the entities appear in a sentence, the training dataset PREPRINT - M
ARCH
1, 2021Table 2: State of MD after the application of the custom text processing functions. Token Range Entity
Sanofi 0:14 PH_ORGGlaxoSmithKline 258:272 PH_ORGRegeneron 3436:3440 PH_ORGRegeneron 3649:3654 PH_ORGGilead 3660:3668 PH_ORGSanofi 3699:3704 PH_ORGJ & J 3801:3806 PH_ORGis not required to contain a huge number of diverse entities. Here we improve the general knowledge language modelfor the more specific task, using small or moderate amounts of labeled data.In our case, we fine-tune spaCy, AllenNLP, BERT and BioBERT models. However, each of these models requires adifferent data format. SpaCy requires an array of sentences with respective tagged entities for each sentence and theirstart- and end-positions. AllenNLP requires a dataset in BIOUL or BIO notations , which differentiate the followingtoken annotations:• multi-word entity beginning token: (B) ,• multi-word entity inside tokens: (I) • multi-word entity ending token: (L) ,• single-token entities: (U) ,• non-entity tokens: (O) .The dataset adapted for BERT and BioBERT labels the entities with I - PH_ORG , regardless of the number of tokens,while all other tokens are marked with O .Therefore, we use different dataset serializers to output the training and test datasets for the fine-tuning process, in therequired format.The same methodology is used for creating labeled datasets for the Drug entity type. In this case we use the same textcorpora, but this time annotated with a fairly larger set of
Drug entities.Once we are done with the fine-tuning process, we have named entity recognition models able to extract the entitiesfrom a given type.
The accuracy of our proposed approach is assessed by using a pharmacy-related news dataset, which consists of 5000news. The
Pharmaceutical Organization entities set consists of 3,633 unique values, while the
Drug entities setconsists of 20,266 unique drug brand names. These sets were extracted and published as part of our previous work[3][4].The evaluation is performed in two distinct scenarios for both entity classes. In the first evaluation, we split thenews dataset into training and test portions, with sizes of 70% and 30% respectively, with no consideration of thedistribution of the entities inside. This scenario aims to check the overall precision of the fine-tuned model. In thesecond evaluation scenario, we evaluate the generalization ability of our approach. Here, we split the training and testportions based on the entities they contain, such that there will not be any entity overlap between them. To do so, weextract the documents that contain 30% of the entities as the testing portion, and the other news are used for training.However, the testing portion contained more than 30% of the overall news. Therefore, in order to achieve a 70% - 30%ratio between the training and test portions, the test portion was reduced to contains exactly 30% of the news, whilein the rest of the documents, the entities were replaced with other entities which do not belong to the entity set used inthe testing portion. 8
PREPRINT - M
ARCH
1, 2021Table 3: Evaluation of models trained on a dataset that contains known entities.
PH_ORG OrganizationLibrary Precision F1 Precision F1
AllenNLP 95.57 90.3 49.41 48.26spaCy 91.36 91.54 22.22 29.10BERT 97.65 96.66 51.65 53.18BioBERT 98.35* 96.86* 52.12* 53.38*Table 4: Evaluation of the models on previously unseen entities.
PH_ORG OrganizationLibrary Precision F1 Precision F1
AllenNLP 94.76 89.98 47.12 46.44spaCy 90.95 88.51 21.98 28.01BERT 97.45 97.68 51.51 55.68BioBERT 97.52* 97.86* 52.42* 55.70*
The obtained fine-tuned models for detecting
Pharmaceutical Organization entities using spaCy, AllenNLP, BERTand BioBERT were tested accordingly, and the results were compared to the original models before their fine-tuning,where the task was the extraction
Organization entities. The results are given in Table 3, indicating that the fine-tunedmodels are able to achieve significantly higher F1 score compared to the original models. Also, we can outline thatAllenNLP outperforms spaCy in this NER task, a result that can be attributed to the different neural architectures usedby both libraries, while the BERT model is able to outperform both. However, the pre-trained BioBERT on biomedicaltext is able to slightly outperform BERT in every evaluation.Even though the pre-trained models take into consideration the sentence context in which the entities appear, we canevaluate the fine-tuned model generalization capability by creating a test dataset that contains only entities that werenot seen during the training. To achieve this, we use the joint dataset of the pharmacy-related news and generate asample of entities in a random way to achieve a 70% - 30% split ratio between training and test datasets, where thetest dataset contains entities not encountered in the training dataset.SpaCy, AllenNLP, BERT and BioBERT models were also trained using these datasets, and the results are given inTable 4. To better visualize the accuracy, Fig. 3 denotes a sentence extracted from pharmacy-related news where the
Pharmaceutical Organization entities are recognized as expected.Table 5: Evaluation of models trained on a dataset that contains known entities.
Library Precision F1
AllenNLP 96.24 95.12spaCy 90.95 94.87BERT 98.86 95.98BioBERT 98.92* 96.14*
SpaCy, AllenNLP, BERT and BioBERT models were also created for recognizing Drug entities in texts. The evaluationresults are given in Table 5 for the scenario where the same
Drug entity can be present in both the training and the testdataset, while Table 6 shows the results when the test dataset does not contain any of the entities used in the trainingphase. Again, the train-test dataset ratio is 70% - 30%. To better visualize the accuracy, Fig. 4 denotes a sentenceextracted from pharmacy-related news, where the
Drug entity is recognized as expected. https://natural-language-understanding.fandom.com/wiki/Named_entity_recognition PREPRINT - M
ARCH
1, 2021Figure 3: Detecting
Pharmaceutical Organization entities in text.Table 6: Evaluation of the models on previously unseen entities.
Library Precision F1
AllenNLP 92.65 89.85spaCy 88.16 89.25BERT 98.12 95.01BioBERT 98.65* 95.14*Figure 4: Detecting
Drug entities in text.
As a final step in the pipeline, we want to generate an RDF knowledge graph (KG) with the knowledge extractedfrom the previous steps. One way to create a general-purpose knowledge graph is to use a tool such as DBpediaSpotlight [38], which performs recognition of interlinked entities in the DBpedia knowledge graph. So, in theory, itcan be used to recognize the drugs and pharmaceutical organizations in the texts of interest, and correctly annotatethem with their semantic type. However, our experiments showed that the annotated entities are of more generaltypes, such as schema:Organization or dbpedia:Company . In addition to that, most drug entities referencedby their brand names are not annotated at all. Therefore, we decided to use the results obtained so far by the pipelinedescribed in the previous sections, to expand the knowledge graph generated by DBpedia Spotlight with specifictypes: schema:MedicalOrganization for the recognized pharmaceutical organizations, and schema:Drug , dbpedia:Drug for the recognized drugs.To properly test the benefits of this knowledge graph enrichment, we decided to apply the technique on the test setwhich contains texts with previously unknown entities while training the named entity recognition models. The resultsshow an average expansion of 47.69% on the originally generated knowledge graph by DBpedia Spotlight. Figure 5 http://schema.org/Organization http://dbpedia.org/ontology/company http://schema.org/MedicalOrganization http://schema.org/Drug http://dbpedia.org/ontology/Drug PREPRINT - M
ARCH
1, 2021shows an example knowledge graph for a given input text, extracted using the DBpedia Spotlight annotation tool (left),and the enriched knowledge graph with additional knowledge about
MedicalOrganization and
Drug entities (right).Figure 5: Original knowledge graph generated by DBpedia Spotlight (left) and the expanded knowledge graph (right).The additional RDF triples are highlighted.Figure 6 shows the overall knowledge enrichment obtained by our system for the test dataset. It presents the ratio be-tween the number of texts and the percentage of knowledge enrichment. This overview indicates a normal distributionof the enrichment over the test set.Figure 6: Distribution of knowledge graph enrichment among the texts from the test set.The knowledge graph generated and enriched as part of the pipeline, can then be used for other purposes within oroutside the platform. We are currently providing an RDF output in Turtle syntax . The platform presented in this paper emphasizes a methodology for combining the best-performing NLP models andadopting them for use in a new domain. We use a modular approach, where each model is a separate phase in theknowledge extraction pipeline, and allows for an easy upgrade with new and potentially superior models, thereforeimproving the performance of the entire platform.In contrast to [12][11][32][33], the goal of our platform is to provide a knowledge extraction solution for the pharma-ceutical domain that brings the state-of-the-art NLP achievements closer to the people which analyze large amounts oftexts. The PharmKE platform is human-centric, meaning that it is designed to be used primarily by people who needto extract the knowledge. The outcome from each phase is visualized, which enables the users to better understand PREPRINT - M
ARCH
1, 2021the process of capturing and linking this knowledge. Since the web browser may not be the most convenient tool fordomain experts to use in the process of knowledge extraction, especially when they analyze texts from various sources,we are also publishing an Application Programming Interface (API) that exposes the results from our platform to otherapplications. With this, we enable the development of editor plugins which will potentially extract and visualize theknowledge in the tools that experts already use on a daily basis.In the current version of the PharmKE platform, we fine-tuned the Named Entity Recognition module to extract twoadditional entity types, namely
Pharmaceutical Organization and
Drug , on top of the entity types already recognizedby the superior BioBERT model. During the fine-tuning phase, we show a method for automatically creating thetraining set for the recognition of
Pharmaceutical Organizations and
Drugs , by using a text corpora from the pharma-ceutical domain and a closed set of entity instances from the types of interest. The evaluation of the fine-tuned modelshowed that this methodology enables recognition of entities that are not seen in the training set, which is a promisingresult.The knowledge graph which we generate and enrich at the end of the pipeline is aimed to show the possibility ofpackaging and reusing the knowledge generated by the pipeline in other software solutions. Namely, even though theplatform is human-centric, generating an RDF knowledge graph as the final step in the process means that the resultscan be stored, shared, combined with other RDF knowledge graphs and (re)used programmatically, outside of theplatform. The nature of RDF and knowledge graphs allows for an almost seamless combination of the results from theplatform with other RDF data which exists publicly or internally in the user environment.The PharmKE platform is open to the continuous advancements in the NLP field. One of the crucial elements in theprocess of the knowledge extraction that is not solved by the current models is the linking of the relations obtained bythe SRL model with the corresponding properties in the knowledge graph. This is the challenge that our team will tryto address in our future research, as well as incorporating any model that will have better results in some of the currenttasks. All of this is possible thanks to the modular design of the platform. Another challenge will be cleaning up theknowledge graph from erroneous conclusions made by the pipeline, which is a standard and expected problem withNLP.
In this paper, we present a modular platform [34, 35] that incorporates state-of-the-art models for text categorization,pharmaceutical domain named entity recognition (NER), co-reference resolution (CRR), semantic role labeling (SRL)and knowledge extraction (KE). This platform is designed primarily for human users. PharmKE visualizes the re-sults from each of the incorporated models, enabling pharmaceutical domain experts to better recognize the extractedknowledge from the input texts.Our strategic goal is to keep the PharmKE platform current and up-to-date, and its modular design enables easyincorporation of new and potentially superior models. One such step in this direction was our extension of the morerecent BioBERT model for NER with the
Pharmaceutical Organization and
Drug entity type recognition.The platform is also publicly available [34] and is open-source [35], providing reproducibility of our results. Thisalso means that other researchers can modify their own copy of the platform, run their own instances of it and evenre-purpose it, thanks to its modular design.A common issue while training custom models for language understanding tasks in text, is the lack of labeled datasetsfor testing and training. To tackle this issue, we propose a methodology that can be used to automate the labeleddataset creation process for training models for custom entity tagging. The methodology was assessed by trainingcustom models for named entity recognition using spaCy, AllenNLP, BERT and BioBERT, and the obtained resultsindicate that the newly trained models outperform the pre-trained models in detecting custom entities.
Evaluating the performance of the proposed methodology on pharmaceutical texts gives satisfying results. However,a better oversight could be obtained with testing the methodology on various texts with different context, that caneither include or not entities from the pharmaceutical domain. With this, we could evaluate the performance of themethodology in a generalized manner and compare the results to the current, task-specific evaluation. This wouldenable its usage in a variety of domains for training diverse models.Shifting our focus towards the platform, the extracted semantic roles can be further parsed into RDF triples whichcomprise a knowledge graph. A platform optimization is planned as part of the future work that would enable main-12
PREPRINT - M
ARCH
1, 2021tenance of the knowledge graph in the background, which would be continuously enriched with every text analysisperformed by the platform.The presence of a knowledge graph in the system will enable easy access and extraction of facts by performing simplequeries over the graph, and going further, it can be interconnected with other relevant knowledge graphs of the user,or public ones.
Acknowledgement
The work in this paper was partially financed by the Faculty of Computer Science and Engineering, Ss. Cyril andMethodius University in Skopje.
References [1] V. Krishnan, V. Ganapathy, Named Entity Recognition (2005).[2] E. F. Sang, F. De Meulder, Introduction to the CoNLL-2003 Shared task: Language-independent named entityrecognition, arXiv preprint cs/0306050 (2003).[3] M. Jovanovik, D. Trajanov, Consolidating Drug Data on a Global Scale Using Linked Data, Journal of Biomedi-cal Semantics 8 (1) (2017) 3.[4] N. Jofche, M. Jovanovik, D. Trajanov, Named Entity Discovery for the Drug Domain, in: 16th International Con-ference on Informatics and Information Technologies, Faculty of Computer Science and Engineering, Skopje,2019.[5] M. Sundermeyer, R. Schlüter, H. Ney, LSTM Neural Networks for Language Modeling, in: Thirteenth annualconference of the international speech communication association, 2012.[6] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural Architectures for Named EntityRecognition, arXiv preprint arXiv:1603.01360 (2016).[7] J. P. Chiu, E. Nichols, Named Entity Recognition with Bidirectional LSTM-CNNs, Transactions of the Associa-tion for Computational Linguistics 4 (2016) 357–370.[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, arXiv preprint arXiv:1810.04805 (2018).[9] J. Li, A. Sun, J. Han, C. Li, A Survey on Deep Learning for Named Entity Recognition, arXiv preprintarXiv:1812.09449 (2018).[10] D. Balasuriya, N. Ringland, J. Nothman, T. Murphy, J. R. Curran, Named Entity Recognition in Wikipedia, in:Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed SemanticResources (People’s Web), 2009, pp. 10–18.[11] M. Honnibal, I. Montani, spaCy 2: Natural Language Understanding with Bloom Embeddings, ConvolutionalNeural Networks and Incremental Parsing, To appear 7 (2017).[12] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, L. S. Zettlemoyer,AllenNLP: A Deep Semantic Natural Language Processing Platform, 2017. arXiv:arXiv:1803.07640 .[13] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep Contextualized WordRepresentations, arXiv preprint arXiv:1802.05365 (2018).[14] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: Pre-Trained Biomedical LanguageRepresentation Model for Biomedical Text Mining, arXiv preprint arXiv:1901.08746 (2019).[15] P. Shi, J. Lin, Simple BERT Models for Relation Extraction and Semantic Role Labeling, arXiv preprintarXiv:1904.05255 (2019).[16] O. Lassila, R. R. Swick, W. Wide, W. Consortium, Resource Description Framework (RDF) Model and SyntaxSpecification (1998).[17] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, DBpedia: A Nucleus for a Web of Open Data,in: The Semantic Web, Springer, 2007, pp. 722–735.[18] C. Bizer, T. Heath, K. Idehen, T. Berners-Lee, Linked Data on the Web (LDOW2008), in: Proceedings of the17th International Conference on World Wide Web, ACM, 2008, pp. 1265–1266.13
PREPRINT - M
ARCH
1, 2021[19] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural Language Processing (Almost)from Scratch, Journal of machine learning research 12 (Aug) (2011) 2493–2537.[20] O. Kuru, O. A. Can, D. Yuret, Charner: Character-Level Named Entity Recognition, in: Proceedings of COLING2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 911–921.[21] Y. Kim, Y. Jernite, D. Sontag, A. M. Rush, Character-Aware Neural Language Models, in: Thirtieth AAAIConference on Artificial Intelligence, 2016.[22] L. Yao, H. Liu, Y. Liu, X. Li, M. W. Anwar, Biomedical Named Entity Recognition Based on Deep NeutralNetwork, Int. J. Hybrid Inf. Technol 8 (8) (2015) 279–288.[23] M. Habibi, L. Weber, M. Neves, D. L. Wiegandt, U. Leser, Deep Learning With Word Embeddings ImprovesBiomedical Named Entity Recognition, Bioinformatics 33 (14) (2017) i37–i48.[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention IsAll You Need, in: Advances in neural information processing systems, 2017, pp. 5998–6008.[25] K. Hakala, S. Pyysalo, Biomedical Named Entity Recognition with Multilingual BERT, in: Proceedings of The5th Workshop on BioNLP Open Shared Tasks, Association for Computational Linguistics, Hong Kong, China,2019, pp. 56–61.[26] F. Souza, R. Nogueira, R. Lotufo, Portuguese Named Entity Recognition using BERT-CRF, arXiv preprintarXiv:1909.10649 (2019).[27] A. Lamurias, F. M. Couto, LasigeBioTM at MEDIQA 2019: Biomedical Question Answering using BidirectionalTransformers and Named Entity Recognition, in: Proceedings of the 18th BioNLP Workshop and Shared Task,2019, pp. 523–527.[28] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: A Pre-Trained Biomedical LanguageRepresentation Model for Biomedical Text Mining, Bioinformatics (09 2019).[29] K. Lee, L. He, M. Lewis, L. Zettlemoyer, End-to-End Neural Coreference Resolution, arXiv preprintarXiv:1707.07045 (2017).[30] S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, Y. Zhang, CoNLL-2012 Shared Task: Modeling MultilingualUnrestricted Coreference in OntoNotes, in: Joint Conference on EMNLP and CoNLL-Shared Task, Associationfor Computational Linguistics, 2012, pp. 1–40.[31] J. Daiber, M. Jakob, C. Hokamp, P. N. Mendes, Improving Efficiency and Accuracy in Multilingual EntityExtraction, in: Proceedings of the 9th International Conference on Semantic Systems (I-Semantics), 2013.[32] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz,J. Brew, HuggingFace’s Transformers: State-of-the-art Natural Language Processing, ArXiv abs/1910.03771(2019).[33] M. Burtsev, A. Seliverstov, R. Airapetyan, M. Arkhipov, D. Baymurzina, N. Bushkov, O. Gureenkova,T. Khakhulin, Y. Kuratov, D. Kuznetsov, et al., DeepPavlov: Open-Source Library for Dialogue Systems, in:Proceedings of ACL 2018, System Demonstrations, 2018, pp. 122–127.[34] PharmKE Platform: Public instance, http://pharmke.env4health.finki.ukim.mk .[35] PharmKE Platform: Source code, https://gitlab.com/jofce.nasi/pharma-text-analyticshttps://gitlab.com/jofce.nasi/pharma-text-analytics