[PDF] COVIDScholar: An automated COVID-19 research aggregation and analysis platform

Abstract

The ongoing COVID-19 pandemic has had far-reaching effects throughout society, and science is no exception. The scale, speed, and breadth of the scientific community's COVID-19 response has lead to the emergence of new research literature on a remarkable scale -- as of October 2020, over 81,000 COVID-19 related scientific papers have been released, at a rate of over 250 per day. This has created a challenge to traditional methods of engagement with the research literature; the volume of new research is far beyond the ability of any human to read, and the urgency of response has lead to an increasingly prominent role for pre-print servers and a diffusion of relevant research across sources. These factors have created a need for new tools to change the way scientific literature is disseminated. COVIDScholar is a knowledge portal designed with the unique needs of the COVID-19 research community in mind, utilizing NLP to aid researchers in synthesizing the information spread across thousands of emergent research articles, patents, and clinical trials into actionable insights and new knowledge. The search interface for this corpus, this https URL, now serves over 2000 unique users weekly. We present also an analysis of trends in COVID-19 research over the course of 2020.

Full PDF

CCOVIDScholar: An automated COVID-19research aggregation and analysis platform

Amalie Trewartha , John Dagdelen , , Haoyan Huo , , Kevin Cruse , , ZherenWang , , Tanjin He , , Akshay Subramanian , Yuxing Fei , Benjamin Justus ,Kristin Persson , , and Gerbrand Ceder , Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA94720, USA Department of Materials Science & Engineering, University of California, Berkeley,Berkeley, CA 94720, USA Indian Institute of Technology Roorkee, Roorkee, Uttarakhand 247667, India Wuhan University, Wuhan, Hubei 430072, China

Abstract.

The ongoing COVID-19 pandemic has had far-reaching ef-fects throughout society, and science is no exception. The scale, speed,and breadth of the scientiﬁc community’s COVID-19 response has leadto the emergence of new research literature on a remarkable scale — asof October 2020, over 81,000 COVID-19 related scientiﬁc papers havebeen released, at a rate of over 250 per day. This has created a challengeto traditional methods of engagement with the research literature; thevolume of new research is far beyond the ability of any human to read,and the urgency of response has lead to an increasingly prominent rolefor pre-print servers and a diﬀusion of relevant research across sources.These factors have created a need for new tools to change the way sci-entiﬁc literature is disseminated.COVIDScholar is a knowledge portal designed with the unique needsof the COVID-19 research community in mind, utilizing NLP to aidresearchers in synthesizing the information spread across thousands of a r X i v : . [ c s . D L ] D ec Amalie Trewartha et al.emergent research articles, patents, and clinical trials into actionable in-sights and new knowledge. The search interface for this corpus, https://covidscholar.org,now serves over 2000 unique users weekly.We present also an analysis of trends in COVID-19 research over thecourse of 2020.

The scientiﬁc community has responded to the COVID-19 pandemic with un-precedented speed, and as a result an enormous amount of research literature israpidly emerging, at a rate of over 250 papers a day [1]. The urgency and vol-ume of emerging research has caused pre-prints to take a prominent role in lieuof traditional journals, leading to widespread usage of pre-print servers for theﬁrst time in many ﬁelds, most prominently biomedical sciences[2][3]. While thisallows new research to be disseminated to the community sooner, this also cir-cumvents the role of journals in ﬁltering poor or ﬂawed papers and highlightingrelevant research [4]. Additionally, the uniquely multi-disciplinary nature of thescientiﬁc community’s response to the pandemic has lead to pertinent researchbeing dispersed across many open access and pre-print services - no single oneof which captures the entirety of the COVID-19 literature.These challenges have created a need and opportunity for new tools andmethods to rethink the way in which researchers engage the wealth of availableCOVID-19 scientiﬁc literature.COVIDScholar is an eﬀort to address these issues by using natural languageprocessing (NLP) techniques to aggregate, analyze, and search the COVID-19research literature. We have developed an automated, scalable infrastructure forscraping and integrating new research as it appears, and used it to constructa targeted corpus of over 81,000 scientiﬁc papers and documents pertinent to

OVIDScholar 3

COVID-19 from a broad range of disciplines. The search interface for this corpus,https://covidscholar.org, now serves over 2000 unique users weekly.While a variety of other COVID-19 literature aggregation eﬀorts exist [5, 6,7], COVIDScholar diﬀers in the breadth of literature collected. In addition to thebiological and medical research collected by other large-scale aggregation eﬀortssuch as CORD-19 [6] and LitCOVID [7], COVIDScholar’s collection includes thefull breadth of COVID-19 research, including public health, behavioural science,physical sciences, economics, psychology, and humanities.In this paper, we present a description of the COVIDScholar data intakepipeline and back-end infrastructure, and the NLP models used to power di-rected searches on the front-end search portal. We also present an analysis ofthe COVIDScholar corpus, and discuss trends in the dynamics of research outputduring the pandemic.

At the heart of COVIDScholar is the automated data intake and processingpipeline, depicted in Fig. 1. Data sources are continually checked for new orupdated papers, patents, and clinical trials, which are then parsed, cleaned,analyzed with NLP models, and made searchable on https://covidscholar.org. .The COVIDScholar research corpus consists of research literature from 14diﬀerent open-access and pre-print services, listed in Table. 1. For each of these,a web scraper regularly checks for new documents and updates to existing ones.Missing metadata is then collected from Crossref, and citation data is collectedfrom OpenCitations [8]. The complete codebase for the data pipeline is available at https://github.com/COVID-19-Text-Mining Amalie Trewartha et al.

Fig. 1: The data pipeline used to construct the COVIDScholar research corpus.

OVIDScholar 5

Source COVID-19Publications Count preprints.org [9] 923osf.io [10] 337lens.org [11] 98SSRN [12] 3491Psyarxiv [13] 691CORD-19 [6] 1135Dimensions.ai [14] 6489Elsevier [15] 6735Chemrxiv [16] 292LitCovid [17] 51807Biorxiv [18]/Medrxiv[19] 8832NBER.org [20] 261COVIDScholar UserSubmission 25Table 1: The source of papers, patents, andclinical trials in the COVIDScholar collection,with the count of COVID-19 related publica-tions from each source.After collection, these pub-lications are then parsed intoa uniﬁed format, cleaned, andresolved to remove duplicates.Publications are identiﬁed asduplicates when they shareany of doi (up to versionnumber), pubmed id, or un-cased title. For clinical tri-als without valid documentidentiﬁers, a shared title isused to identify duplicates. Incases where there are multi-ple versions of a single paper(most commonly, a pre-printand a published version), acombined single document isproduced, whose contents areselected on a ﬁeld-by-ﬁeld ba-sis using a priority system.Published versions and higherversion numbers (based on doi) are given higher priority, and sources are other-wise prioritized based on the quality of their text.In cases where full-text PDFs are available text is parsed from the documentusing pdfminer (for PDFs with embedded text [21]) or OCR. However, it isour experience that, in general text extracted in this manner is not of suﬃcient

Amalie Trewartha et al. quality for to be used by the classiﬁcation and relevance NLP models, and atthis time is used solely for text searches.Abstracts are classiﬁed based on their relevance to COVID-19, topic, dis-cipline, and ﬁeld. Publications are classiﬁed into 5 disciplines - Biological &Chemical Sciences, Medical Sciences, Public Health, Physical Sciences and Hu-manities & Social Sciences. A paper may belong to any number of disciplines.Each discipline is composed of 12-15 ﬁelds. The breakdown of ﬁelds by disciplineis shown in the supplementary material (S.1). Publications for which an abstractcannot be found are not classiﬁed.Keywords are also extracted from titles and abstracts using an unsupervisedapproach, as described in Sec. 3.Our web portal, COVIDScholar.org, provides an accessible user interface toa variety of literature search tools and information retrieval algorithms tunedspeciﬁcally for the needs of COVID-19 researchers. Because there still remainsa great deal that we do not know about the disease, we have directed our eﬀortstowards developing tools that can extend beyond information retrieval and aidresearchers at the knowledge discovery phase as well. To do this, we have utilizednew machine learning and natural language processing techniques together withproven information retrieval approaches to create the search algorithms behindCOVIDScholar, which we describe in the remainder of this section.Machine learning algorithms can be used to identify emerging trends in theliterature and correlate them with similar patterns from pre-existing research.For this reason, we chose to base our search back end on the Vespa engine [22],which provides a high level of performance, wide scalability, and easy integrationwith custom machine learning models. For example, the default search resultranking proﬁle on COVIDScholar.org combines the BM25 relevance[

BM25 ] witha ”COVID-19 relevance” score calculated by a classiﬁcation model trained to

OVIDScholar 7 predict whether a paper discusses the SARS-CoV-2 virus or COVID-19 usingthis approach. We observe that papers from before the COVID-19 pandemicthat are related to certain viruses/diseases tend to receive high relevance scores,especially papers on the original SARS and other respiratory diseases. SARS-CoV-2 shares 79% of its genome sequence identity with the SARS-CoV virus[23],and there are many similarities between how the two viruses enter cells, replicate,and transmit between hosts.[24] Because the relevance classiﬁcation model givesa higher score to studies on these similar diseases, search results are more likelyto contain relevant information, even if it is not directly focused on COVID-19.For example, the transmembrane protease TMPRSS2 plays an important role inviral entry and spread for both SARS-CoV and SARS-CoV-2, and its inhibitionis a promising avenue for treating COVID-19[25]. A wealth of information onstrategies to inhibit TMPRSS2 activity and their eﬃcacy in blocking SARS-CoV from entering host cells was available in the early days of the COVID-19pandemic. These studies were boosted in search results because of their higherrelevance scores, thereby bringing potentially useful information to the attentionof researchers more directly. In comparison, results of a Google Scholar search for”TMPRSS2” (with results containing ”COVID-19” and ”SARS-CoV-2” ﬁlteredout) are dominated by studies on the protease’s role in various cancers.COVIDScholar also provides tools that utilizes unsupervised document em-beddings so that searches can be performed within ”related documents” to au-tomatically link research papers together by topics, methods, drugs, and otherkey pieces of information. Documents are sorted by similarity via the cosine dis-tances between unsupervised document embeddings[26], which is then combinedwith the more overall result-ranking score mentioned above. This allows usersto focus their results into a more speciﬁc domain without having to repeatedlypick and choose new search terms to add to their queries. Users can also ﬁlter

Amalie Trewartha et al. all of the documents in the database by broader subjects relevant to COVID-19(treatment, transmission, case reports, etc), which are all determined though theapplication of machine learning models trained on a smaller number of hand-labeled examples. All combined, these tools have allowed us to create much moretargeted tools for literature search and knowledge discovery that would not bepossible otherwise.

Classiﬁcation of abstracts is performed using a ﬁne-tuned SciBERT [27] model.While other BERT models pre-trained on scientiﬁc text exist (e.g. BioBERT [28],MedBERT [29], and ClinicalBERT [30]), we select SciBERT due to its broad,multidisciplinary training corpus, which we expect to more closely resemble theCOVIDScholar corpus than those pre-trained on a single discipline. SciBERThas state-of-the-art performance on the task of paper domain classiﬁcation [31],as well as a number of biomedical domain benchmarks [32, 33, 34] - the mostcommon discipline in the COVIDScholar corpus. A single fully-connected layerwith sigmoid activation is used as a classiﬁcation head, and the model is ﬁne-tuned for 4 epochs using 2600 human-annotated abstracts ROC curves for the classiﬁer’s performance for each top-level discipline using20-fold cross-validation are shown in Fig. 2. The classiﬁer performs extremelywell, with F1 scores above 0.73 for all disciplines. Performance metrics of thediscipline classiﬁer are displayed in Table. 2, compared to a baseline randomforest model using TF-IDF features.On three disciplines (Medical Sciences, Physical Sciences, and Humanities& Social Sciences) the SciBERT-based discipline classiﬁer oﬀers a signiﬁcantperformance advantage over the baseline random forest/TF-IDF model, with F1 Abstracts were annotated by members of the Rapid Reviews: COVID-19 [35] edito-rial team.OVIDScholar 9

Biological& Chem-icalSciences MedicalSciences PublicHealth PhysicalSciences Humanities& SocialSciences

SciBERT F1 0.92 0.85 0.73 0.78 0.92Precision 0.92 0.80 0.74 0.78 0.88Recall 0.92 0.80 0.75 0.81 0.92Accuracy 0.92 0.85 0.73 0.79 0.92RandomForest F1 0.90 0.63 0.73 0.68 0.78Precision 0.93 0.77 0.83 0.81 0.89Recall 0.89 0.55 0.67 0.59 0.73Accuracy 0.92 0.84 0.81 0.83 0.90Table 2: Scoring metrics of SciBERT [27] and baseline random forest disciplineclassiﬁcation models. Models were evaluated using 10-fold cross-validation on2600 labeled abstracts. Input features to the random forest model generatedusing TF-IDF.scores which are between 0.1 and 0.14 higher. These are the broadest disciplines,encompassing multiple disparate ﬁelds. The large variability of subjects withinthese domains may account for the inability of TF-IDF-based models to classifythem well.For the remaining two disciplines, Biological & Chemical Sciences and PublicHealth, the F1 scores are similar between SciBERT and the baseline model. Inthe case of Biological & Chemical Sciences, this may be explained by relativelydistinctive vocabulary and narrow subjects within the discipline. Public Healthwas observed to have the largest inter-annotator disagreement, leading to a lowerperformance by the classiﬁer.It is also of note in each case that while precision is broadly similar betweenthe two models, the baseline model exhibits signiﬁcantly lower recall. This maybe due to unbalanced training data - no single discipline accounts for more than33% of the total corpus. For search applications, often a relatively small numberof documents is relevant to each query. In this case, a high recall is more desirable than a high precision - in practice, the performance gap between the two modelsis larger than indicated by relative F1 scores.On the task of binary classiﬁcation as related to COVID-19, our currentmodels perform similarly well, achieving an F1 score of 0.98. While the binaryclassiﬁcation task is signiﬁcantly simpler from an NLP perspective - the majorityof related papers contain ”COVID-19” or some synonym - this still representsa signiﬁcant performance improvement over the baseline model, which achievesan F1-score of 0.90. Given the relative simplicity of this task, in cases where anabstract is absent we classify it as related to COVID-19 based on the title.Fig. 2: ROC curves for discipline classiﬁcation models of paper abstracts using aﬁne-tuned SciBERT [27] model adapted for classiﬁcation. Training is performedusing a set of 2500 human-annotated abstracts, and results shown are generatedwith 20-fold cross validation.

OVIDScholar 11

For the task of unsupervised keyword extraction, 63 abstracts were anno-tated by humans, and two statistical methods, TextRank [36]and TF-IDF [37],and two graph-based models, RaKUn [38] and Yake [39], were tested. Modelswere evaluated for overlap between human-annotated keywords and extractedkeywords, and results are shown in Table. 3. Note that due to the inherent sub-jectivity of the keyword extraction task that scores are relatively low - the bestperforming model, RaKUn has an F1 score of only 0.2. However, the quality ofextracted keywords from this model was deemed reasonable for display on thesearch portal after manual inspection.Model Precision Recall F1RaKUn 0.17 0.33 0.2Yake 0.11 0.45 0.15TextRank 0.06 0.36 0.09TF-IDF 0.10 0.09 0.08Table 3: Precision, recall, and F1 scores for 4unsupervised keywords extractors, RaKUn[38],Yake[39], TextRank[36], and TF-IDF[37]. Out-put from keyword extractors was compared to63 abstracts with human-annotated keywords. To better visualize the em-bedding of COVID-19-relatedphrases and ﬁnd latent re-lationship between biomedi-cal terms, we designed a toolbased on Embedding Projec-tor[40]. A screenshot of thetool is shown in Fig. 3We utilize FastText[41]embeddings for the embed-ding projector, with an embedding dimension of 100. Embeddings are trained onthe abstracts of all papers which have been classiﬁed as relevant to COVID-19.For the purpose of visualization, embeddings must be projected to a lowerdimensional space (2D or 3D). The dimensionality reduction technique used hereincludes principal component analysis (PCA), uniform manifold approximationand projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE). Users can set various parameters and do the dimension reduction via an

Fig. 3: A screenshot of the embedding projector visualizing tokens similar to”spike protein”, using FastText[41] embeddings trained on the COVIDScholarcorpus.interactive page. They can also load and visualize the cached result on the serverwith default parameters.Cosine distance is used to measure the similarity between phrases. If thecosine distance between two phrases is quite small, they are likely to have similarmeaning. Cosine Distance( p , p ) = Emb ( p ) · Emb ( p ) (cid:107) Emb ( p ) (cid:107)(cid:107) Emb ( p ) (cid:107) p , p represent two phrases, Emb maps phrases to their embedded repre-sentation in the learned semantic space.

OVIDScholar 13

As of October 2020, the COVIDScholar corpus consists of 150,113 total docu-ments, of which 143,887 are papers. The remainder is composed of 3306 patents,1712 clinical trials, 1025 book chapters, and 183 datasets. Of the papers, 81,106are classiﬁed as related to COVID-19, and are approximately equally split be-tween preprints and published papers - 44% pre-prints, 56% published. A break-down by discipline of the COVID-19 relevant papers is shown in Table. 4. Asmay be expected, Public Health and Biological & Chemical Sciences are the mostrepresented disciplines, with respectively 56% and 42% of the corpus tagged asmembers of these disciplines. Overlap between these two disciplines is relativelysmall —only 3295 papers are classiﬁed as belonging to both Public Health andBiological & Chemical Sciences—, and so the vast majority of the corpus, 50,787papers, belongs to one of the two. Discipline Paper Count Fraction of Total

Biological & ChemicalSciences 23227 0.42Humanities & SocialSciences 17464 0.31Medical Sciences 21023 0.38Physical Sciences 17214 0.31Public Health 30855 0.56Table 4: The number of papers and fraction of total COVID-19 related papersin the COVIDScholar corpus for each discipline. Only papers with abstracts areclassiﬁed and included in ﬁnal count. Note that a given paper may have anynumber of discipline labels.

Fig. 4: Cumulative count by primary discipline of COVID-19 papers in theCOVIDScholar database, and total number of reported US COVID-19 casesduring the ﬁrst 10 months of 2020. Papers are categorized by the classiﬁcationmodel described in Sec. 3, and assigned to the discipline with highest predictedlikelihood. Case data from The New York Times, based on reports from state andlocal health agencies. Note that only those papers with abstracts available areclassiﬁed, and so the publication is somewhat lower than the total from Sec. 4.1.

OVIDScholar 15 $ Papers marked not relevant to COVID-19 are a combination of papers on relateddiseases, such as SARS and MERS, and with no relation to COVID-19.6 Amalie Trewartha et al.

Fig. 5: Fraction of total COVID-19 papers by primary discipline. Fractions arecalculated based on total over previous calendar month. Papers are categorizedby the classiﬁcation model described in Sec. 3, and assigned to the disciplinewith highest predicted likelihood.

OVIDScholar 17

A breakdown of research by discipline over the course of 2020 is shown inFig. 5, which depicts the fraction of monthly COVID-19 publications primarilyassociated with each discipline. From January - April, the relative popularity ofdiscipline showed some shifts. While Biological and Chemical Sciences comprised45% of the total corpus in January, by April that had decreased to 28%. Thisis largely accounted for by an increase in papers from Physical and MedicalSciences - over the same period the fraction of papers from Medical Sciencesincreased from 15% to 20% of the total, and Physical Sciences from 5% to 8%.By April, the fraction of the corpus from each discipline seems to have stabilized,with ﬂuctuations of relative fractions of under 1%. This is further support forthe evidence in Fig. 4 that research output had already reached its maximumrate by April/May - this seems to hold true on a discipline-by-discipline basisalso.We investigate this increase in Fig. 6, where we have plotted the fractionof total monthly papers on selected mental health- and lockdown- related top-ics. Over the April-June period, there is a clear increase in research related topsychological impacts of lockdown and social distancing, accounting for 6-8% oftotal monthly papers. Between March and April, many countries and territoriesinstituted lockdown orders, and by April, over half of the world’s population wasunder either compulsory or recommended shelter-in-place orders [46]. The cor-responding emergence of a robust literature on psychological impacts associatedwith this is the major driving force behind the increase in COVID-19 literaturefrom Humanities & Social Sciences.

We have developed and implemented a scalable research aggregation, analysis,and dissemination infrastructure, and created a targeted corpus of over 81,000

Fig. 6: Fraction of COVID-19 literature on mental health- and lockdown- relatedtopics on a monthly basis.

OVIDScholar 19

COVID-19 relevant research documents. The associated search portal, https://covidscholar.org, serves over 2000 weekly scientiﬁc users.While the large amount of open data and enormous scientiﬁc interest inCOVID-19 have made it an ideal use-case, the infrastructure is domain-agnostic,and presents a blueprint for future large-scale scientiﬁc literature aggregationeﬀorts.While to-date the COVIDScholar research corpus has primarily been usedfor front-end user search, it provides a rich opportunity for NLP analysis. Recentwork [47] has highlighted the ability of NLP to discover latent knowledge fromunstructured scientiﬁc text, utilizing information from thousands of researchpapers. We are now moving to employ similar techniques here, applied to suchproblems as drug re-purposing and predicting protein-protein interactions.

Portions of this work were supported by the C3.ai Digital Transformation In-stitute and the Laboratory Directed Research and Development Program ofLawrence Berkeley National Laboratory under U.S. Department of Energy Con-tract No. DE-AC02-05CH11231.The text corpus analysis and development of machine learning algorithmswere supported by the DOE Oﬃce of Science through the National VirtualBiotechnology Laboratory, a consortium of DOE national laboratories focusedon response to COVID-19, with funding provided by the Coronavirus CARESAct.This research used resources of the National Energy Research Scientiﬁc Com-puting Center (NERSC), a U.S. Department of Energy Oﬃce of Science UserFacility operated under Contract No. DE-AC02-05CH11231.

We are thankful to the editorial team of Rapid Reviews: COVID-19 for theirassistance in annotating text.

References [1] url

PLOS Medicine doi : 10.1371/journal.pmed.1002549. url : https://doi.org/10.1371/journal.pmed.1002549.[3] Nicholas Fraser et al. “Preprinting the COVID-19 pandemic”. In: bioRxiv (2020). doi url

BMC Medicine doi : 10.1186/s12916-020-01556-3.[5]

WHO COVID-19 Database . url : https : / / search . bvsalud . org / global -literature-on-novel-coronavirus-2019-ncov/.[6] Lucy Lu Wang et al. CORD-19: The COVID-19 Open Research Dataset .2020. arXiv: 2004.10706 [cs.DL] .[7] Qingyu Chen, Alexis Allot, and Zhiyong lu. “Keep up with the latestcoronavirus research”. In:

Nature

579 (Mar. 2020), pp. 193–193. doi : 10.1038/d41586-020-00694-1.[8] S. Peroni and D. Shotton. “OpenCitations, an infrastructure organizationfor open scholarship”. In:

Quantitative Science Studies

OVIDScholar 21 [9]

The Multidisciplinary Preprint Platform . url url : https://osf.io/.[11] The Lens COVID-19 Data Initiative . url : https://about.lens.org/covid-19/.[12] Social Science Research Network . url Introducing PsyArXiv: a preprint service for psychological sci-ence . Oct. 2016. url : http://blog.psyarxiv.com/2016/09/19/introducing-psyarxiv/.[14]

Dimensions COVID-19 Dataset . url Elsevier Novel Coronavirus Information Center . Nov. 2020. url

Chemrxiv . url : https://chemrxiv.org/.[17] Qingyu Chen, Alexis Allot, and Zhiyong Lu. “Keep up with the latestcoronavirus research”. In: Nature doi : 10.1038/d41586-020-00694-1.[18] 2013 Jocelyn KaiserNov. 12 et al.

New Preprint Server Aims to Be Bi-ologists’ Answer to Physicists’ arXiv . Dec. 2017. url

New preprint server for medicalresearch . 2019.[20]

NBER Working Papers . url url : https://github.com/pdfminer/pdfminer.six.[22] url : https://vespa.ai/. [23] Roujian Lu et al. “Genomic characterisation and epidemiology of 2019novel coronavirus: implications for virus origins and receptor binding”.In: The Lancet issn : 1474547X. doi : 10.1016/S0140- 6736(20)30251- 8. url : http://dx.doi.org/10.1016/S0140-6736(20)30251-8.[24] Ali A. Rabaan et al. “SARS-CoV-2, SARS-CoV, and MERS-CoV: A com-parative overview”. In:

Infezioni in Medicina issn : 11249390.[25] Konrad H Stopsack et al. “TMPRSS2 and COVID-19: Serendipity or Op-portunity for Intervention?” In:

Cancer discovery

Proceedings of the 31st International Conference onInternational Conference on Machine Learning - Volume 32 . ICML’14.Beijing, China: JMLR.org, 2014, II–1188–II–1196.[27] Iz Beltagy, Kyle Lo, and Arman Cohan. “SciBERT: Pretrained LanguageModel for Scientiﬁc Text”. In:

EMNLP . 2019. eprint: arXiv:1903.10676.[28] Jinhyuk Lee et al. “BioBERT: a pre-trained biomedical language repre-sentation model for biomedical text mining”. In:

Bioinformatics (Sept.2019). issn : 1367-4803. doi : 10.1093/bioinformatics/btz682. url : https://doi.org/10.1093/bioinformatics/btz682.[29] Laila Rasmy et al.

Med-BERT: pre-trained contextualized embeddings onlarge-scale structured electronic health records for disease prediction . 2020.arXiv: 2005.12833 [cs.CL] .[30] Emily Alsentzer et al. “Publicly Available Clinical BERT Embeddings”.In:

Proceedings of the 2nd Clinical Natural Language Processing Workshop .Minneapolis, Minnesota, USA: Association for Computational Linguistics,

OVIDScholar 23

June 2019, pp. 72–78. doi : 10.18653/v1/W19-1909. url

WWW - World Wide Web Consortium (W3C) . May2015. url

BMC Bioinformatics issn : 1471-2105. doi : 10 . 1186 / s12859 - 019 - 2813 - 6. url :http://dx.doi.org/10.1186/s12859-019-2813-6.[33] Benjamin Nye et al. “A Corpus with Multi-Level Annotations of Patients,Interventions and Outcomes to Support Language Processing for MedicalLiterature”. In:

Proceedings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) . Melbourne, Aus-tralia: Association for Computational Linguistics, July 2018, pp. 197–207. doi : 10.18653/v1/P18- 1019. url

Database issn :1758-0463. doi : 10.1093/database/bay060. eprint: https://academic.oup.com / database / article - pdf / doi / 10 . 1093 / database / bay060 / 27438554 /bay060.pdf. url : https://doi.org/10.1093/database/bay060.[35] “Rapid Reviews: COVID-19, publishes reviews of COVID-19 preprints”.In:

Rapid Reviews COVID-19 (Aug. 11, 2020). https://rapidreviewscovid19.mitpress.mit.edu/pub/wfavs1oc. url : https://rapidreviewscovid19.mitpress.mit.edu/pub/wfavs1oc.[36] Rada Mihalcea and Paul Tarau. “TextRank: Bringing Order into Text”. In:

Proceedings of the 2004 Conference on Empirical Methods in Natural Lan- guage Processing . Barcelona, Spain: Association for Computational Lin-guistics, July 2004, pp. 404–411. url

Information Processing & Management issn : 0306-4573. doi : https://doi.org/10.1016/0306-4573(88)90021-0. url

ArXiv abs/1907.06458 (2019).[39] Ricardo Campos et al. “YAKE! Collection-Independent Automatic Key-word Extractor”. In: Feb. 2018. doi : 10.1007/978-3-319-76941-7 80.[40] Daniel Smilkov et al. “Embedding projector: Interactive visualization andinterpretation of embeddings”. In: arXiv preprint arXiv:1611.05469 (2016).[41] Piotr Bojanowski et al. “Enriching Word Vectors with Subword Informa-tion”. In: arXiv preprint arXiv:1607.04606 (2016).[42] url url url url

OVIDScholar 25 [46] url

Nature

571 (July 2019),pp. 95–98. doidoi