Real-time tracking of COVID-19 and coronavirus research updates through text mining
Yutong Jin, Jie Li, Xinyu Wang, Peiyao Li, Jinjiang Guo, Junfeng Wu, Dawei Leng, Lurong Pan
RR EAL - TIME TRACKING OF
COVID-19
AND CORONAVIRUSRESEARCH UPDATES THROUGH TEXT MINING
Yutong Jin, Jie Li, Xinyu Wang, Peiyao Li, Jinjiang Guo, Junfeng Wu, Dawei Leng, Lurong Pan*
AIDD Group
Global Health Drug Discovery Institute, Beijing, China [email protected] and [email protected] A BSTRACT
The novel coronavirus (SARS-CoV-2) which causes COVID-19 is an ongoing pandemic. There areongoing studies with up to hundreds of publications uploaded to databases daily. We are exploring theuse-case of artificial intelligence and natural language processing in order to efficiently sort throughthese publications. We demonstrate that clinical trial information, preclinical studies, and a generaltopic model can be used as text mining data intelligence tools for scientists all over the world to useas a resource for their own research. To evaluate our method, several metrics are used to measure theinformation extraction and clustering results. In addition, we demonstrate that our workflow not onlyhave a use-case for COVID-19, but for other disease areas as well. Overall, our system aims to allowscientists to more efficiently research coronavirus. Our automatically updating modules are availableon our information portal at https://ghddi-ailab.github.io/Targeting2019-nCoV/ for public viewing.
Keywords
Natural Language Processing · Text Mining · Clustering · COVID-19
The COVID-19 pandemic is an ongoing pandemic caused by the novel coronavirus (SARS-CoV-2)[1]. The symptomsare highly variable, and the virus, which spreads through the air and contaminated surfaces, is highly contagious. Asof January 2021, there has yet to be a small molecule drug that is specific and effective for COVID-19. During thepandemic, countries around the world made efforts to overcome the difficulties, further reflecting the importance ofunity and cooperation and resource sharing. We are continuously exploring the value chain provided by artificialintelligence (AI) in the drug discovery process. In terms of pathological mechanisms, AI natural language processing(NLP) technology can replace manual curation of data and efficiently collect and sort data from global databases.Data mining is a process in which algorithms convert raw data into useful structured data. This technique is thenintegrated with NLP algorithms to analyze and organize the collected information from areas such as a disease fieldeither through rule-based text mining or a model-based tool. In our effort, we have launched the GHDDI TargetingCOVID-19 platform [2]. Since its launch on January 29, the platform has been continuously updated and maintainedwith new functions and modules continuously added. Several of these functional modules include NLP data miningmodule for SARS-CoV-2 small molecule drug in vitro experimental data that updates new experimental informationdaily, an automated NLP COVID-19 clinical trial module allowing up-to-date summarization of clinical trial data, andan NLP-based scientific literature recommendation module. Overall, we present the details behind these three modulesto support real-time scientific intelligence of COVID-19.
In this section, we briefly introduce the three modules and how they were built using different databases. All of theNLP systems were built using a standard Python 3.6 environment from Anaconda and associated packages mentionedbelow. Our system’s backend and database is hosted on a Ubuntu 18.04 server using the same environment. a r X i v : . [ c s . I R ] F e b eal-time tracking of COVID-19 and coronavirus research updates through text mining The data was aggregated using a variety of sources. Through automated download scripts and given ApplicationProgramming Interfaces (API), abstract data was downloaded from: PubMed, preprint sources, and dimensions.ai [3]using a string query "SARS-Cov-2 OR COVID-19 OR novel coronavirus". These data sources were compiled togetherand the string data was cleaned using simple Python scripts such as lower-casing all words and removing noise datasuch as spaces or tabs. Duplicate data was then removed through a sequence of steps by dropping DOI, title, andabstract strings, respectively. This aggregated dataset will be used for subsequent NLP workflows and models.
There were several dictionaries that were compiled and utilized for information filtering and information extraction.First, a dictionary of all drug names was compiled using DrugBank drug names and aliases [4], FDA drug list [5], andChEMBL [6]. All of these drug names were compiled into one list, and string length was computed to filter out outliers.Overall, the final list consisted of unigram, bigrams, and trigrams; it also included drug names with a string lengthbetween 5 and 75 characters.The second dictionary involved a filter list to clean out unwanted items from the drug dictionary. In the DrugBankdatabase, several entries that do not necessarily represent drug names can be found such as large biologic moleculesand antigens. Likewise, in a similar Kaggle competition [7], a list of filtered items were compiled, and this list wasaggregated and used to filter out unwanted terms in the final drug dictionary used in subsequent workflows.
This step uses the aggregated dataset mentioned in the previous section. The dataset is then sent through a filter ofkeywords [EC50, IC50, CC50] to get a subset of only abstracts mentioning these keywords. Then, each abstract fromthe subset is matched with a corpus of known drug names and sentences are extracted. If sentences contain both akey word and a drug name, the sentence will be searched for a numerical value or descriptive phrase describing therelationship in that sentence. This is done using either regex (Rule 1) or a Spacy noun chunk model (Rule 2). Usingregex, the system extracts the numerical value closest in word distance to the keyword or using a rule-based logic.Similarily, the Spacy noun chunk model extracts the noun chunk describing the keyword. The spacy model is anopen-source English language model. Several features include POS tagging, noun chunk extraction, and grammar NER.The noun chunks are a descriptive phase that has significant relations to a keyword. This extracted values list is thenextracted and mapped onto the drug name with a direct correlation. Additionally, if sentences mentioning the drug andkeyword are different, then the system tries to extract a value similar to above but prints the drug name and experimentalassay value relationship as an indirect correlation. These results are all tabulated and updated to the website.• Rule 1: Using a regex query, all numbers are extracted. The closest numerical token to the experimentalkeyword is mapped to the closest drug name.• Rule 2: Using a Spacy model, all noun chunks are extracted from the sentence. The noun chunk closest indistance to that of the experimental keyword is identified and mapped.Using these two rules, all data following this logic can be extracted and mapped. Because this text mining procedure isdone using a list of known drugs, several metrics are used to validate this workflow. We have evaluated the text miningresults based on a similar text mining study[8]. In that study, 25 unique text items (notes) were randomly sampled andmanually reviewed as a gold standard. Overall, we evaluated Precision, Recall, and F-measure in the preclinical datamining results by randomly sampling 25 papers by DOI. P = N correct /N total (1) R = N correct /N total possible (2)The above equations are derived from calculating precision and recall in an information extraction context[9]. The aggregated dataset built in Section 2.1 is utilized in this step. Figure 1 shows the monthly amount of articlesuploaded onto our database. As a result of this large number, it was important to split the articles into different categories2eal-time tracking of COVID-19 and coronavirus research updates through text miningFigure 1: Workflow of real-time system to update modulesand recommend them by topic. After preprocessing, the data is then checked and tagged for up to trigrams. Additionally,data lemmatization is used and stop words are removed; a bag-of-words is subsequently created for each abstract. ALatent Dirichlet Allocation (LDA) algorithm is used to build the topic model. This is an unsupervised machine learningmodel that measures the distribution of words and attempts to cluster this distribution into a specified number of hiddendistributions. The word distributions per abstract determine which topic or hidden distribution that abstract best fits into.The bag-of-words object is then sent to the Gensim LDA API [10] for model training, and subsequent Python pickleobjects and metadata were used for daily updates.A gridsearch optimization of this topic model was performed by maxmizing the coherence score, and the best scoringmodel was used for the final output where each topic was hand-labeled. In this dataset, the best scoring model was a30-topic model which was used for the final output. After this model was trained, inferencing was performed on theoriginal dataset, and then each abstract was assigned a topic. This result was recorded, and a data-driven filter was usedto filter out topics that did not meet the amount papers required to form a topic. Then for each topic, the top papers areranked and sorted by the model’s output weight, being the gamma value in the Gensim model, and the top 10 papersare output into a final tabulated format. This can then be done for new data which can be automatically updated in thefuture for this module.
In the clinical trials module, the open-access Figshare data shared by dimensions.ai was used [3]. As of January 2021,there were 7000 clinical trials records around the world. Clinical trials contain different phase human experimentsrelating to drugs or biologics, including vaccines. The data is first preprocessed similar to the methods mentionedabove. The data is then tagged for unigrams, bigrams, and trigrams similar to the NLP topic model. Afterwards, theinformation extraction process clusters the clinical trials into one of these three types.• Using the known drugs dictionary mentioned in the preclinical information extraction section, all smallmolecule drug names are extracted from the clinical trial descriptive phrases. This list is filtered and extractedsamples containing animal, food, and other non-small molecule drug words are removed.• Given the keyword “vaccine” and all its derivatives, the database was searched for these keywords and a list ofvaccine clinical trials was compiled and output. This list is filtered and trials containing words in the blacklistare removed.• Given several keywords relating to biological products such as plasma, antibody, stem cell, and all of theirderivative words, a list of biological products was compiled and output. This list is filtered and trials containingwords in the blacklist are removed.Using these three rules, all information pertaining to drugs, biologicals, and vaccines were extracted from the tabulateddata. The data was visualized in our information portal [2] together with word clouds for biological drug and vaccinetrials as a validation.
All of these modules are supported by real-time daily updates provided by a server setup to update automatically. Thissystem, as shown in Figure 1, provides daily incremental updates of clinical trial data and research articles using open3eal-time tracking of COVID-19 and coronavirus research updates through text miningFigure 2: Count of papers published every month starting from January 2020.APIs provided by PubMed, Figshare, and other sources. This data is then stored in our database which is updated daily.After updating the database, we use metadata to track changes in the clinical trial and research article databases. Theclinical trial script is run automatically every day and completes the data processing if there is a new update. Likewise,new articles recently updated to our database is preprocessed and then run with the preclinical NLP processing workflow,and updates are appended to a master list that is updated onto our portal. Finally, the entire abstracts database ispreprocessed then run with the topic model; afterwards, the top 10 titles are uploaded per topic to the recommendationpage. This model is retrained and updated monthly as new data becomes available.
The results of our modules are presented below. Full results can be found on the Targeting COVID-19 GitHub portal[2]. This section gives an in-depth description of the results that were published to the website.Figure 2 visualizes the number of papers uploaded to the databases by month. Due to the number of papers publishedexponentially increasing in March 2020, it became impossible to track all experimental and clinical results publishedto a journal or uploaded onto a preprint service. Therefore, we used this data to automatically data mine and extractvaluable information that may be of use to scientists of different fields all around the world.
For small molecule drugs, the compiled drug dictionary is used and the matches are tabulated in the following tables.Table 1 shows the top 10 most common drugs found through information extraction of clinical trial records where thereare currently over 1100 clinical trials for small molecule drugs, while Table 2 shows several of the best experimentalresults of small molecule drugs extracted from preclinical studies literature text. It is noted that the units are extractedfrom the sentence of the experimental value; standard units are typically given as a molar concentration such asmicromolar or nanomolar units.
Using the rules previously described in the Methods section, Figure 3 shows several examples of direct correlationsentences of nafamostat, which is also shown in Table 2, labelled with an experimental value along with the experimenttype from three different article abstracts[11] [12] [13]. Nafamostat is a small molecule drug which had the bestexperimental value out of all of the extracted data samples. These sentences were taken directly from the preprint orpublished abstracts aggregated in our database, and Figure 3 visualizes what the rule-based search engine looked for ineach abstract. It is noted that several other drugs are also labeled in Figure 3 for visualization purposes, but these drugsare not described in further detail. 4eal-time tracking of COVID-19 and coronavirus research updates through text miningTable 1: Top 20 known small molecule drugs undergoing COVID-19 clinical trials.
Treatment Count
Hydroxychloroquine 153Ritonavir 65Lopinavir 61Azithromycin 60Tocilizumab 55Ivermectin 51Favipiravir 38Remdesivir 33Chloroquine 32Colchicine 24Dexamethasone 23Methylprednisolone 23Enoxaparin 22Nitazoxanide 20Ruxolitinib 19Anakinra 15Angiotensin 15Heparin 15Baricitinib 14Interferon beta 14Table 2: Five small molecule drugs with in vitro assay results.
Drug name Assay Value Units (uM)
Nafamostat IC50 0.0022 microAzithromycin EC50 0.008 umPralatrexate EC50 0.008 umAdenosine EC50 0.01 umRemdesivir EC50 0.01 umFigure 3: Several sentences from different abstracts containing the drug “nafamostat”. All drugs were labeled in red,experiments in green, and numerical values in blue. 5eal-time tracking of COVID-19 and coronavirus research updates through text miningTable 3: Topic keywords for five select topics in our optimized topic model.AI Mental Health Disease Analysis Genetics PPEcovid covid covid sars_cov maskct health risk protein usescore mental age ace airuse pandemic high viral respiratorimage anxiety population virus particlediagnosis participant mortality human surfacepneumonia report factor cell wearfeature study infection host environmentallung survey disease analysis devicebase psychological increase genome transmissionFigure 4: Select Titles of the five topics in our optimized topic model in Table 3. Some of the title names were truncatedbecause of the large string size.
Table 3 prints 5 of the topics taken from the LDA topic model along with their manually assigned label. The top topicswere taken from a grid-search optimized number of topics while maximizing the coherence score using a corpus withover 20,000 abstracts. The best topic model was found to contain 30 topics. Among these topics, there were severalthat had minimal samples for that cluster. These topics were removed from the final presentation using a data drivenapproach. Several paper titles for each topic are shown in Figure 4 justifying the manual label attached to each LDAtopic.
We have developed several automatic modules from the openly available data. The full data results are pub-licly available at our website
COVID-19: GHDDI Info Sharing Portal : https://ghddi-ailab.github.io/Targeting2019-nCoV We evaluated a random subset of article abstracts with a gold standard that was manually read and labeled for the sameinformation extraction task. The manual gold standard results are then evaluated with the results in Table 4.6eal-time tracking of COVID-19 and coronavirus research updates through text miningTable 4: Results of data extraction system on a random subset of abstracts.
Metric Value
Precision 0.808Recall 0.689F1 Score 0.743It can be seen that the precision of the system was assessed to be around 0.8 showing that our system can indeed extractmost of the drug names and experimental values correctly. Therefore, this validates that our system can be used as arecommendation for users to follow-up on these articles. After a manual review of the articles, it was found that manyof the drug names that could not be extracted were not found in our drug dictionary that we had compiled. Additionally,wrong experimental values and wrong drug name mappings were attributed to the fact that the rule-based system cannotrobustly handle some content such as when multiple numerical values appear at multiple locations in a sentence. Morefine-tuning of this system’s rules is needed to boost the precision. It is however noted that a high precision system is notnecessarily important in this module as the intended purpose is mainly to gather and recommend preclinical studies forfurther research. Model-based systems that could be built in the future may be able to rectify these mistakes and outputa higher precision final result.
In an earlier iteration of the clinical trial analysis module, only unigrams were used for drug data extraction. Thiscaused an error such as “chloroquine” and “phosphate” being double counted in some instances. Another error includedinstances where the keyword “interferon” was present in the clinical trial, but the drug dictionary did not have a unigraminstance of this keyword. Therefore, the drug was not able to be matched with the dictionary. However, upon addingbigrams and trigrams to this module, “interferon beta” and “interferon alpha” were both successfully extracted from theclinical trial data.Likewise, this addition was utilized in the preclinical workflow, but during preliminary analysis of the results, thisworkflow was ultimately not included in the current module because the initial results showed no significant improvementof relevant data extraction precision for this workflow. The opposite occured and only unfiltered noise data was extractedusing n-grams. This is likely due to the fact that unlike the somewhat cleaned and structured clinical trial text, theabstract text is completely freeform, so the backend algorithm best captures different pieces of information such as anexperimental value or a reported experiment using single token keywords. The inclusion of multi-word sequences inthis workflow can be extremely superfluous and confusing.Similarly, n-grams up to trigrams were built into the topic model data preprocessing pipeline. This was because thereare some word pairs that are necessary for topics to be accurate and differentiable. One key example shown in Table 3is ’sars cov’ being one token instead of the two tokens ’sars’ and ’cov’. This will give the LDA model cleaner input dataespecially if differentiating between “SARS” virus and “SARS-CoV-2”.
In Figure 5, the optimal topic model by maximum coherence contained 30 topics. Several topics did not include manypapers, so they were excluded in the final results. This filter was done using a data-driven technique where topics thatcontained less than a fifth of the average amount of papers per topic were excluded in the final results. This meant thattopics clustered with very few papers were excluded from the final module results, and the quality of the recommendedpapers per topic remains clear and differentiable from the evidence shown in Figure 4. This figure showed that mostof the Top 5 highest-weighted papers from the corpus in each topic contained something in the title pertinent to thistopic. One example is that the hand labeled "AI" topic indeed contained papers with titles discussing deep learning orCT images. Another interesting example is the "PPE" topic where the papers clustered into this topic contained titlestalking about N95 respirators, masks, and respiratory aerosols.
The major limitation for the clinical trial and preclinical text mining section is that they both rely on known drugdictionaries for the exact text search. This means that all drug names contain known, approved, or experimental drugs.However, some experimental drugs have not yet been added into the database, so tracking newly published data onexperimental drugs is a major challenge for future AI models.7eal-time tracking of COVID-19 and coronavirus research updates through text miningFigure 5: The coherence score compared to number of topics. This gave an optimized number of topics for the finaltopic model.The granularity of the preliminary database search was impacted by two important factors: computational cost andprecision. In the preclinical text mining and information extraction workflow, we have tried to optimize the performanceof the initial preclinical abstract search to have high precision while minimizing the computational cost of text mining.Therefore, during the development of this workflow, we have looked at many different search queries to get the optimalnumber of abstracts to text mine. Some keywords that were used include: “covid”, “coronavirus”, “preclinical”, “invitro”, “EC50”, and “experiment”. While all of these keywords yielded the papers of interest, we optimized our searchby first using keywords such as “coronavirus” and “sars-cov-2” and then later searched the keywords “EC50” and“IC50”. Compared to searching keywords such as “preclinical” and/or “in vitro”, this allowed a more optimal precisionto mining texts of interest while minimizing the number of papers that did not have any useful information as searchingthe more general keywords gives more samples with useless information and increases the computational effort.In our workflow, we primarily used a rule-based text-mining system to extract drug names and experimental valuesevidenced in Table 1, Table 2, and Figure 3. Due to the nature of the pandemic, the first efforts all revolved around drugrepurposing; therefore, the use of a dictionary for text mining should be sufficient for this purpose as all known drugnames and aliases were captured in our dictionary. However, if this system were to be tracking a long-term disease orpandemic, a more robust data extraction system should be built. In our preliminary analysis, a Scispacy Named-entityrecognition (NER) model[14] was assessed and compared to the rule-based system. Comparing this model in Figure 6,it can be observed that although unique chemicals are identified, the precision is very low compared to a rule-basedsystem especially since it captures information such as "h=15.26" and "ptdtytsvylgkfrg" as chemicals. Future work caninclude building an NER model with curated data labels from a set of papers in one specific disease area, or an NERmodel trained with more specific data labels to avoid false positives with experimental values and protein sequences.As an extension to these text mining modules, there are several applications to AI models. Once the data miningmodules are mature, data curation efforts for small molecule drugs can be reduced and automated since robust researchdatasets can be produced[15]. Furthermore, the backend scripts can all be extended and developed for other use-cases.One such use-case can be a module performing data mining on known genes and knock-in, knock-down, or knock-outrelationships. Additionally, for the topic model, this can be useful for a variety of scientific fields including cancer orinfectious diseases. Future work can look into the topic models of these fields or at a specific area in one of these fields.
The modules presented on our portal and in this article showcase NLP techniques that may be useful in a globalpandemic where lots of text data is being generated daily. Our modules aim to ease the burden of reading thousandsof articles daily to the recommended ones automatically updated daily by our text mining systems. This will allow8eal-time tracking of COVID-19 and coronavirus research updates through text miningFigure 6: A comparison of an NER model (left) compared to our rule-based system (right).a significant reduction of time spent on reader articles and more time dedicated to research on coronavirus. Thesemodules also have the potential to be scaled to other applications in life sciences and may have use-cases in these areas.
References [1] Covid-19 pandemic - wikipedia. https://en.wikipedia.org/wiki/COVID-19_pandemic . Accessed: 2021-01-30.[2] Ghddi targeting covid-19 portal. https://ghddi-ailab.github.io/Targeting2019-nCoV/ . Accessed:2021-02-03.[3] Dimensions Resources. Dimensions covid-19 publications, datasets and clinical trials, Mar 2020. URL https://dimensions.figshare.com/articles/dataset/Dimensions_COVID-19_publications_datasets_and_clinical_trials/11961063/37 .[4] Drugbank. . Accessed: 2021-02-03.[5] Drugs@fda: Fda-approved drugs. . Accessed:2021-02-03.[6] Mark Davies, Michał Nowotka, George Papadatos, Nathan Dedman, Anna Gaulton, Francis Atkinson, LouisaBellis, and John P. Overington. ChEMBL web services: streamlining access to drug discovery data and utilities.
Nucleic Acids Research , 43(W1):W612–W620, 04 2015. ISSN 0305-1048. doi:10.1093/nar/gkv352. URL https://doi.org/10.1093/nar/gkv352 .[7] Kaggle -drug treatment extraction (taskvt). . Accessed: 2021-02-03.[8] Hua Xu, Shane P Stenner, Son Doan, Kevin B Johnson, Lemuel R Waitman, and Joshua C Denny. MedEx: amedication information extraction system for clinical narratives.
Journal of the American Medical InformaticsAssociation , 17(1):19–24, 01 2010. ISSN 1067-5027. doi:10.1197/jamia.M3378. URL https://doi.org/10.1197/jamia.M3378 .[9] Kai Ming Ting.
Precision and Recall , pages 781–781. Springer US, Boston, MA, 2010. ISBN 978-0-387-30164-8.doi:10.1007/978-0-387-30164-8_652. URL https://doi.org/10.1007/978-0-387-30164-8_652 .[10] Radim ˇReh˚uˇrek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In
Proceedings ofthe LREC 2010 Workshop on New Challenges for NLP Frameworks , pages 45–50, Valletta, Malta, May 2010.ELRA. http://is.muni.cz/publication/884893/en .[11] Jonathan H. Shrimp, Stephen C. Kales, Philip E. Sanderson, Anton Simeonov, Min Shen, and Matthew D. Hall.An enzymatic tmprss2 assay for assessment of clinical candidates and discovery of inhibitors as potential treatmentof covid-19. bioRxiv , 2020. doi:10.1101/2020.06.23.167544. URL .[12] Meehyun Ko, Sangeun Jeon, Wang-Shick Ryu, and Seungtaek Kim. Comparative analysis of antiviral efficacy offda-approved drugs against sars-cov-2 in human lung cells: Nafamostat is the most potent antiviral drug candidate. bioRxiv , 2020. doi:10.1101/2020.05.12.090035. URL .[13] Mizuki Yamamoto, Maki Kiso, Yuko Sakai-Tagawa, Kiyoko Iwatsuki-Horimoto, Masaki Imai, Makoto Takeda,Noriko Kinoshita, Norio Ohmagari, Jin Gohda, Kentaro Semba, Zene Matsuda, Yasushi Kawaguchi, Yoshihiro9eal-time tracking of COVID-19 and coronavirus research updates through text miningKawaoka, and Jun-ichiro Inoue. The anticoagulant nafamostat potently inhibits sars-cov-2 infection in vitro: anexisting drug with multiple possible therapeutic effects. bioRxiv , 2020. doi:10.1101/2020.04.22.054981. URL .[14] Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. ScispaCy: Fast and Robust Models for BiomedicalNatural Language Processing. In
Proceedings of the 18th BioNLP Workshop and Shared Task , pages 319–327,Florence, Italy, August 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-5034. URL .[15] Hannah L. Weeks, Cole Beck, Elizabeth McNeer, Cosmin A. Bejan, Joshua C. Denny, and Leena Choi. medextractr:A medication extraction algorithm for electronic health records using the r programming language. medRxiv ,2019. doi:10.1101/19007286. URL