COVIDScholar: An automated COVID-19 research aggregation and analysis platform
Amalie Trewartha, John Dagdelen, Haoyan Huo, Kevin Cruse, Zheren Wang, Tanjin He, Akshay Subramanian, Yuxing Fei, Benjamin Justus, Kristin Persson, Gerbrand Ceder
CCOVIDScholar: An automated COVID-19research aggregation and analysis platform
Amalie Trewartha , John Dagdelen , , Haoyan Huo , , Kevin Cruse , , ZherenWang , , Tanjin He , , Akshay Subramanian , Yuxing Fei , Benjamin Justus ,Kristin Persson , , and Gerbrand Ceder , Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA94720, USA Department of Materials Science & Engineering, University of California, Berkeley,Berkeley, CA 94720, USA Indian Institute of Technology Roorkee, Roorkee, Uttarakhand 247667, India Wuhan University, Wuhan, Hubei 430072, China
Abstract.
The ongoing COVID-19 pandemic has had far-reaching ef-fects throughout society, and science is no exception. The scale, speed,and breadth of the scientific community’s COVID-19 response has leadto the emergence of new research literature on a remarkable scale — asof October 2020, over 81,000 COVID-19 related scientific papers havebeen released, at a rate of over 250 per day. This has created a challengeto traditional methods of engagement with the research literature; thevolume of new research is far beyond the ability of any human to read,and the urgency of response has lead to an increasingly prominent rolefor pre-print servers and a diffusion of relevant research across sources.These factors have created a need for new tools to change the way sci-entific literature is disseminated.COVIDScholar is a knowledge portal designed with the unique needsof the COVID-19 research community in mind, utilizing NLP to aidresearchers in synthesizing the information spread across thousands of a r X i v : . [ c s . D L ] D ec Amalie Trewartha et al.emergent research articles, patents, and clinical trials into actionable in-sights and new knowledge. The search interface for this corpus, https://covidscholar.org,now serves over 2000 unique users weekly.We present also an analysis of trends in COVID-19 research over thecourse of 2020.
The scientific community has responded to the COVID-19 pandemic with un-precedented speed, and as a result an enormous amount of research literature israpidly emerging, at a rate of over 250 papers a day [1]. The urgency and vol-ume of emerging research has caused pre-prints to take a prominent role in lieuof traditional journals, leading to widespread usage of pre-print servers for thefirst time in many fields, most prominently biomedical sciences[2][3]. While thisallows new research to be disseminated to the community sooner, this also cir-cumvents the role of journals in filtering poor or flawed papers and highlightingrelevant research [4]. Additionally, the uniquely multi-disciplinary nature of thescientific community’s response to the pandemic has lead to pertinent researchbeing dispersed across many open access and pre-print services - no single oneof which captures the entirety of the COVID-19 literature.These challenges have created a need and opportunity for new tools andmethods to rethink the way in which researchers engage the wealth of availableCOVID-19 scientific literature.COVIDScholar is an effort to address these issues by using natural languageprocessing (NLP) techniques to aggregate, analyze, and search the COVID-19research literature. We have developed an automated, scalable infrastructure forscraping and integrating new research as it appears, and used it to constructa targeted corpus of over 81,000 scientific papers and documents pertinent to
OVIDScholar 3
COVID-19 from a broad range of disciplines. The search interface for this corpus,https://covidscholar.org, now serves over 2000 unique users weekly.While a variety of other COVID-19 literature aggregation efforts exist [5, 6,7], COVIDScholar differs in the breadth of literature collected. In addition to thebiological and medical research collected by other large-scale aggregation effortssuch as CORD-19 [6] and LitCOVID [7], COVIDScholar’s collection includes thefull breadth of COVID-19 research, including public health, behavioural science,physical sciences, economics, psychology, and humanities.In this paper, we present a description of the COVIDScholar data intakepipeline and back-end infrastructure, and the NLP models used to power di-rected searches on the front-end search portal. We also present an analysis ofthe COVIDScholar corpus, and discuss trends in the dynamics of research outputduring the pandemic.
At the heart of COVIDScholar is the automated data intake and processingpipeline, depicted in Fig. 1. Data sources are continually checked for new orupdated papers, patents, and clinical trials, which are then parsed, cleaned,analyzed with NLP models, and made searchable on https://covidscholar.org. .The COVIDScholar research corpus consists of research literature from 14different open-access and pre-print services, listed in Table. 1. For each of these,a web scraper regularly checks for new documents and updates to existing ones.Missing metadata is then collected from Crossref, and citation data is collectedfrom OpenCitations [8]. The complete codebase for the data pipeline is available at https://github.com/COVID-19-Text-Mining Amalie Trewartha et al.
Fig. 1: The data pipeline used to construct the COVIDScholar research corpus.
OVIDScholar 5
Source COVID-19Publications Count preprints.org [9] 923osf.io [10] 337lens.org [11] 98SSRN [12] 3491Psyarxiv [13] 691CORD-19 [6] 1135Dimensions.ai [14] 6489Elsevier [15] 6735Chemrxiv [16] 292LitCovid [17] 51807Biorxiv [18]/Medrxiv[19] 8832NBER.org [20] 261COVIDScholar UserSubmission 25Table 1: The source of papers, patents, andclinical trials in the COVIDScholar collection,with the count of COVID-19 related publica-tions from each source.After collection, these pub-lications are then parsed intoa unified format, cleaned, andresolved to remove duplicates.Publications are identified asduplicates when they shareany of doi (up to versionnumber), pubmed id, or un-cased title. For clinical tri-als without valid documentidentifiers, a shared title isused to identify duplicates. Incases where there are multi-ple versions of a single paper(most commonly, a pre-printand a published version), acombined single document isproduced, whose contents areselected on a field-by-field ba-sis using a priority system.Published versions and higherversion numbers (based on doi) are given higher priority, and sources are other-wise prioritized based on the quality of their text.In cases where full-text PDFs are available text is parsed from the documentusing pdfminer (for PDFs with embedded text [21]) or OCR. However, it isour experience that, in general text extracted in this manner is not of sufficient
Amalie Trewartha et al. quality for to be used by the classification and relevance NLP models, and atthis time is used solely for text searches.Abstracts are classified based on their relevance to COVID-19, topic, dis-cipline, and field. Publications are classified into 5 disciplines - Biological &Chemical Sciences, Medical Sciences, Public Health, Physical Sciences and Hu-manities & Social Sciences. A paper may belong to any number of disciplines.Each discipline is composed of 12-15 fields. The breakdown of fields by disciplineis shown in the supplementary material (S.1). Publications for which an abstractcannot be found are not classified.Keywords are also extracted from titles and abstracts using an unsupervisedapproach, as described in Sec. 3.Our web portal, COVIDScholar.org, provides an accessible user interface toa variety of literature search tools and information retrieval algorithms tunedspecifically for the needs of COVID-19 researchers. Because there still remainsa great deal that we do not know about the disease, we have directed our effortstowards developing tools that can extend beyond information retrieval and aidresearchers at the knowledge discovery phase as well. To do this, we have utilizednew machine learning and natural language processing techniques together withproven information retrieval approaches to create the search algorithms behindCOVIDScholar, which we describe in the remainder of this section.Machine learning algorithms can be used to identify emerging trends in theliterature and correlate them with similar patterns from pre-existing research.For this reason, we chose to base our search back end on the Vespa engine [22],which provides a high level of performance, wide scalability, and easy integrationwith custom machine learning models. For example, the default search resultranking profile on COVIDScholar.org combines the BM25 relevance[
BM25 ] witha ”COVID-19 relevance” score calculated by a classification model trained to
OVIDScholar 7 predict whether a paper discusses the SARS-CoV-2 virus or COVID-19 usingthis approach. We observe that papers from before the COVID-19 pandemicthat are related to certain viruses/diseases tend to receive high relevance scores,especially papers on the original SARS and other respiratory diseases. SARS-CoV-2 shares 79% of its genome sequence identity with the SARS-CoV virus[23],and there are many similarities between how the two viruses enter cells, replicate,and transmit between hosts.[24] Because the relevance classification model givesa higher score to studies on these similar diseases, search results are more likelyto contain relevant information, even if it is not directly focused on COVID-19.For example, the transmembrane protease TMPRSS2 plays an important role inviral entry and spread for both SARS-CoV and SARS-CoV-2, and its inhibitionis a promising avenue for treating COVID-19[25]. A wealth of information onstrategies to inhibit TMPRSS2 activity and their efficacy in blocking SARS-CoV from entering host cells was available in the early days of the COVID-19pandemic. These studies were boosted in search results because of their higherrelevance scores, thereby bringing potentially useful information to the attentionof researchers more directly. In comparison, results of a Google Scholar search for”TMPRSS2” (with results containing ”COVID-19” and ”SARS-CoV-2” filteredout) are dominated by studies on the protease’s role in various cancers.COVIDScholar also provides tools that utilizes unsupervised document em-beddings so that searches can be performed within ”related documents” to au-tomatically link research papers together by topics, methods, drugs, and otherkey pieces of information. Documents are sorted by similarity via the cosine dis-tances between unsupervised document embeddings[26], which is then combinedwith the more overall result-ranking score mentioned above. This allows usersto focus their results into a more specific domain without having to repeatedlypick and choose new search terms to add to their queries. Users can also filter
Amalie Trewartha et al. all of the documents in the database by broader subjects relevant to COVID-19(treatment, transmission, case reports, etc), which are all determined though theapplication of machine learning models trained on a smaller number of hand-labeled examples. All combined, these tools have allowed us to create much moretargeted tools for literature search and knowledge discovery that would not bepossible otherwise.
Classification of abstracts is performed using a fine-tuned SciBERT [27] model.While other BERT models pre-trained on scientific text exist (e.g. BioBERT [28],MedBERT [29], and ClinicalBERT [30]), we select SciBERT due to its broad,multidisciplinary training corpus, which we expect to more closely resemble theCOVIDScholar corpus than those pre-trained on a single discipline. SciBERThas state-of-the-art performance on the task of paper domain classification [31],as well as a number of biomedical domain benchmarks [32, 33, 34] - the mostcommon discipline in the COVIDScholar corpus. A single fully-connected layerwith sigmoid activation is used as a classification head, and the model is fine-tuned for 4 epochs using 2600 human-annotated abstracts ROC curves for the classifier’s performance for each top-level discipline using20-fold cross-validation are shown in Fig. 2. The classifier performs extremelywell, with F1 scores above 0.73 for all disciplines. Performance metrics of thediscipline classifier are displayed in Table. 2, compared to a baseline randomforest model using TF-IDF features.On three disciplines (Medical Sciences, Physical Sciences, and Humanities& Social Sciences) the SciBERT-based discipline classifier offers a significantperformance advantage over the baseline random forest/TF-IDF model, with F1 Abstracts were annotated by members of the Rapid Reviews: COVID-19 [35] edito-rial team.OVIDScholar 9
Biological& Chem-icalSciences MedicalSciences PublicHealth PhysicalSciences Humanities& SocialSciences
SciBERT F1 0.92 0.85 0.73 0.78 0.92Precision 0.92 0.80 0.74 0.78 0.88Recall 0.92 0.80 0.75 0.81 0.92Accuracy 0.92 0.85 0.73 0.79 0.92RandomForest F1 0.90 0.63 0.73 0.68 0.78Precision 0.93 0.77 0.83 0.81 0.89Recall 0.89 0.55 0.67 0.59 0.73Accuracy 0.92 0.84 0.81 0.83 0.90Table 2: Scoring metrics of SciBERT [27] and baseline random forest disciplineclassification models. Models were evaluated using 10-fold cross-validation on2600 labeled abstracts. Input features to the random forest model generatedusing TF-IDF.scores which are between 0.1 and 0.14 higher. These are the broadest disciplines,encompassing multiple disparate fields. The large variability of subjects withinthese domains may account for the inability of TF-IDF-based models to classifythem well.For the remaining two disciplines, Biological & Chemical Sciences and PublicHealth, the F1 scores are similar between SciBERT and the baseline model. Inthe case of Biological & Chemical Sciences, this may be explained by relativelydistinctive vocabulary and narrow subjects within the discipline. Public Healthwas observed to have the largest inter-annotator disagreement, leading to a lowerperformance by the classifier.It is also of note in each case that while precision is broadly similar betweenthe two models, the baseline model exhibits significantly lower recall. This maybe due to unbalanced training data - no single discipline accounts for more than33% of the total corpus. For search applications, often a relatively small numberof documents is relevant to each query. In this case, a high recall is more desirable than a high precision - in practice, the performance gap between the two modelsis larger than indicated by relative F1 scores.On the task of binary classification as related to COVID-19, our currentmodels perform similarly well, achieving an F1 score of 0.98. While the binaryclassification task is significantly simpler from an NLP perspective - the majorityof related papers contain ”COVID-19” or some synonym - this still representsa significant performance improvement over the baseline model, which achievesan F1-score of 0.90. Given the relative simplicity of this task, in cases where anabstract is absent we classify it as related to COVID-19 based on the title.Fig. 2: ROC curves for discipline classification models of paper abstracts using afine-tuned SciBERT [27] model adapted for classification. Training is performedusing a set of 2500 human-annotated abstracts, and results shown are generatedwith 20-fold cross validation.
OVIDScholar 11
For the task of unsupervised keyword extraction, 63 abstracts were anno-tated by humans, and two statistical methods, TextRank [36]and TF-IDF [37],and two graph-based models, RaKUn [38] and Yake [39], were tested. Modelswere evaluated for overlap between human-annotated keywords and extractedkeywords, and results are shown in Table. 3. Note that due to the inherent sub-jectivity of the keyword extraction task that scores are relatively low - the bestperforming model, RaKUn has an F1 score of only 0.2. However, the quality ofextracted keywords from this model was deemed reasonable for display on thesearch portal after manual inspection.Model Precision Recall F1RaKUn 0.17 0.33 0.2Yake 0.11 0.45 0.15TextRank 0.06 0.36 0.09TF-IDF 0.10 0.09 0.08Table 3: Precision, recall, and F1 scores for 4unsupervised keywords extractors, RaKUn[38],Yake[39], TextRank[36], and TF-IDF[37]. Out-put from keyword extractors was compared to63 abstracts with human-annotated keywords. To better visualize the em-bedding of COVID-19-relatedphrases and find latent re-lationship between biomedi-cal terms, we designed a toolbased on Embedding Projec-tor[40]. A screenshot of thetool is shown in Fig. 3We utilize FastText[41]embeddings for the embed-ding projector, with an embedding dimension of 100. Embeddings are trained onthe abstracts of all papers which have been classified as relevant to COVID-19.For the purpose of visualization, embeddings must be projected to a lowerdimensional space (2D or 3D). The dimensionality reduction technique used hereincludes principal component analysis (PCA), uniform manifold approximationand projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE). Users can set various parameters and do the dimension reduction via an
Fig. 3: A screenshot of the embedding projector visualizing tokens similar to”spike protein”, using FastText[41] embeddings trained on the COVIDScholarcorpus.interactive page. They can also load and visualize the cached result on the serverwith default parameters.Cosine distance is used to measure the similarity between phrases. If thecosine distance between two phrases is quite small, they are likely to have similarmeaning. Cosine Distance( p , p ) = Emb ( p ) · Emb ( p ) (cid:107) Emb ( p ) (cid:107)(cid:107) Emb ( p ) (cid:107) p , p represent two phrases, Emb maps phrases to their embedded repre-sentation in the learned semantic space.
OVIDScholar 13
As of October 2020, the COVIDScholar corpus consists of 150,113 total docu-ments, of which 143,887 are papers. The remainder is composed of 3306 patents,1712 clinical trials, 1025 book chapters, and 183 datasets. Of the papers, 81,106are classified as related to COVID-19, and are approximately equally split be-tween preprints and published papers - 44% pre-prints, 56% published. A break-down by discipline of the COVID-19 relevant papers is shown in Table. 4. Asmay be expected, Public Health and Biological & Chemical Sciences are the mostrepresented disciplines, with respectively 56% and 42% of the corpus tagged asmembers of these disciplines. Overlap between these two disciplines is relativelysmall —only 3295 papers are classified as belonging to both Public Health andBiological & Chemical Sciences—, and so the vast majority of the corpus, 50,787papers, belongs to one of the two. Discipline Paper Count Fraction of Total
Biological & ChemicalSciences 23227 0.42Humanities & SocialSciences 17464 0.31Medical Sciences 21023 0.38Physical Sciences 17214 0.31Public Health 30855 0.56Table 4: The number of papers and fraction of total COVID-19 related papersin the COVIDScholar corpus for each discipline. Only papers with abstracts areclassified and included in final count. Note that a given paper may have anynumber of discipline labels.
Fig. 4: Cumulative count by primary discipline of COVID-19 papers in theCOVIDScholar database, and total number of reported US COVID-19 casesduring the first 10 months of 2020. Papers are categorized by the classificationmodel described in Sec. 3, and assigned to the discipline with highest predictedlikelihood. Case data from The New York Times, based on reports from state andlocal health agencies. Note that only those papers with abstracts available areclassified, and so the publication is somewhat lower than the total from Sec. 4.1.
OVIDScholar 15 $ Papers marked not relevant to COVID-19 are a combination of papers on relateddiseases, such as SARS and MERS, and with no relation to COVID-19.6 Amalie Trewartha et al.
Fig. 5: Fraction of total COVID-19 papers by primary discipline. Fractions arecalculated based on total over previous calendar month. Papers are categorizedby the classification model described in Sec. 3, and assigned to the disciplinewith highest predicted likelihood.
OVIDScholar 17
A breakdown of research by discipline over the course of 2020 is shown inFig. 5, which depicts the fraction of monthly COVID-19 publications primarilyassociated with each discipline. From January - April, the relative popularity ofdiscipline showed some shifts. While Biological and Chemical Sciences comprised45% of the total corpus in January, by April that had decreased to 28%. Thisis largely accounted for by an increase in papers from Physical and MedicalSciences - over the same period the fraction of papers from Medical Sciencesincreased from 15% to 20% of the total, and Physical Sciences from 5% to 8%.By April, the fraction of the corpus from each discipline seems to have stabilized,with fluctuations of relative fractions of under 1%. This is further support forthe evidence in Fig. 4 that research output had already reached its maximumrate by April/May - this seems to hold true on a discipline-by-discipline basisalso.We investigate this increase in Fig. 6, where we have plotted the fractionof total monthly papers on selected mental health- and lockdown- related top-ics. Over the April-June period, there is a clear increase in research related topsychological impacts of lockdown and social distancing, accounting for 6-8% oftotal monthly papers. Between March and April, many countries and territoriesinstituted lockdown orders, and by April, over half of the world’s population wasunder either compulsory or recommended shelter-in-place orders [46]. The cor-responding emergence of a robust literature on psychological impacts associatedwith this is the major driving force behind the increase in COVID-19 literaturefrom Humanities & Social Sciences.
We have developed and implemented a scalable research aggregation, analysis,and dissemination infrastructure, and created a targeted corpus of over 81,000
Fig. 6: Fraction of COVID-19 literature on mental health- and lockdown- relatedtopics on a monthly basis.
OVIDScholar 19
COVID-19 relevant research documents. The associated search portal, https://covidscholar.org, serves over 2000 weekly scientific users.While the large amount of open data and enormous scientific interest inCOVID-19 have made it an ideal use-case, the infrastructure is domain-agnostic,and presents a blueprint for future large-scale scientific literature aggregationefforts.While to-date the COVIDScholar research corpus has primarily been usedfor front-end user search, it provides a rich opportunity for NLP analysis. Recentwork [47] has highlighted the ability of NLP to discover latent knowledge fromunstructured scientific text, utilizing information from thousands of researchpapers. We are now moving to employ similar techniques here, applied to suchproblems as drug re-purposing and predicting protein-protein interactions.
Portions of this work were supported by the C3.ai Digital Transformation In-stitute and the Laboratory Directed Research and Development Program ofLawrence Berkeley National Laboratory under U.S. Department of Energy Con-tract No. DE-AC02-05CH11231.The text corpus analysis and development of machine learning algorithmswere supported by the DOE Office of Science through the National VirtualBiotechnology Laboratory, a consortium of DOE national laboratories focusedon response to COVID-19, with funding provided by the Coronavirus CARESAct.This research used resources of the National Energy Research Scientific Com-puting Center (NERSC), a U.S. Department of Energy Office of Science UserFacility operated under Contract No. DE-AC02-05CH11231.
We are thankful to the editorial team of Rapid Reviews: COVID-19 for theirassistance in annotating text.
References [1] url
PLOS Medicine doi : 10.1371/journal.pmed.1002549. url : https://doi.org/10.1371/journal.pmed.1002549.[3] Nicholas Fraser et al. “Preprinting the COVID-19 pandemic”. In: bioRxiv (2020). doi url
BMC Medicine doi : 10.1186/s12916-020-01556-3.[5]
WHO COVID-19 Database . url : https : / / search . bvsalud . org / global -literature-on-novel-coronavirus-2019-ncov/.[6] Lucy Lu Wang et al. CORD-19: The COVID-19 Open Research Dataset .2020. arXiv: 2004.10706 [cs.DL] .[7] Qingyu Chen, Alexis Allot, and Zhiyong lu. “Keep up with the latestcoronavirus research”. In:
Nature
579 (Mar. 2020), pp. 193–193. doi : 10.1038/d41586-020-00694-1.[8] S. Peroni and D. Shotton. “OpenCitations, an infrastructure organizationfor open scholarship”. In:
Quantitative Science Studies
OVIDScholar 21 [9]
The Multidisciplinary Preprint Platform . url url : https://osf.io/.[11] The Lens COVID-19 Data Initiative . url : https://about.lens.org/covid-19/.[12] Social Science Research Network . url Introducing PsyArXiv: a preprint service for psychological sci-ence . Oct. 2016. url : http://blog.psyarxiv.com/2016/09/19/introducing-psyarxiv/.[14]
Dimensions COVID-19 Dataset . url Elsevier Novel Coronavirus Information Center . Nov. 2020. url
Chemrxiv . url : https://chemrxiv.org/.[17] Qingyu Chen, Alexis Allot, and Zhiyong Lu. “Keep up with the latestcoronavirus research”. In: Nature doi : 10.1038/d41586-020-00694-1.[18] 2013 Jocelyn KaiserNov. 12 et al.
New Preprint Server Aims to Be Bi-ologists’ Answer to Physicists’ arXiv . Dec. 2017. url
New preprint server for medicalresearch . 2019.[20]
NBER Working Papers . url url : https://github.com/pdfminer/pdfminer.six.[22] url : https://vespa.ai/. [23] Roujian Lu et al. “Genomic characterisation and epidemiology of 2019novel coronavirus: implications for virus origins and receptor binding”.In: The Lancet issn : 1474547X. doi : 10.1016/S0140- 6736(20)30251- 8. url : http://dx.doi.org/10.1016/S0140-6736(20)30251-8.[24] Ali A. Rabaan et al. “SARS-CoV-2, SARS-CoV, and MERS-CoV: A com-parative overview”. In:
Infezioni in Medicina issn : 11249390.[25] Konrad H Stopsack et al. “TMPRSS2 and COVID-19: Serendipity or Op-portunity for Intervention?” In:
Cancer discovery
Proceedings of the 31st International Conference onInternational Conference on Machine Learning - Volume 32 . ICML’14.Beijing, China: JMLR.org, 2014, II–1188–II–1196.[27] Iz Beltagy, Kyle Lo, and Arman Cohan. “SciBERT: Pretrained LanguageModel for Scientific Text”. In:
EMNLP . 2019. eprint: arXiv:1903.10676.[28] Jinhyuk Lee et al. “BioBERT: a pre-trained biomedical language repre-sentation model for biomedical text mining”. In:
Bioinformatics (Sept.2019). issn : 1367-4803. doi : 10.1093/bioinformatics/btz682. url : https://doi.org/10.1093/bioinformatics/btz682.[29] Laila Rasmy et al.
Med-BERT: pre-trained contextualized embeddings onlarge-scale structured electronic health records for disease prediction . 2020.arXiv: 2005.12833 [cs.CL] .[30] Emily Alsentzer et al. “Publicly Available Clinical BERT Embeddings”.In:
Proceedings of the 2nd Clinical Natural Language Processing Workshop .Minneapolis, Minnesota, USA: Association for Computational Linguistics,
OVIDScholar 23
June 2019, pp. 72–78. doi : 10.18653/v1/W19-1909. url
WWW - World Wide Web Consortium (W3C) . May2015. url
BMC Bioinformatics issn : 1471-2105. doi : 10 . 1186 / s12859 - 019 - 2813 - 6. url :http://dx.doi.org/10.1186/s12859-019-2813-6.[33] Benjamin Nye et al. “A Corpus with Multi-Level Annotations of Patients,Interventions and Outcomes to Support Language Processing for MedicalLiterature”. In:
Proceedings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) . Melbourne, Aus-tralia: Association for Computational Linguistics, July 2018, pp. 197–207. doi : 10.18653/v1/P18- 1019. url
Database issn :1758-0463. doi : 10.1093/database/bay060. eprint: https://academic.oup.com / database / article - pdf / doi / 10 . 1093 / database / bay060 / 27438554 /bay060.pdf. url : https://doi.org/10.1093/database/bay060.[35] “Rapid Reviews: COVID-19, publishes reviews of COVID-19 preprints”.In:
Rapid Reviews COVID-19 (Aug. 11, 2020). https://rapidreviewscovid19.mitpress.mit.edu/pub/wfavs1oc. url : https://rapidreviewscovid19.mitpress.mit.edu/pub/wfavs1oc.[36] Rada Mihalcea and Paul Tarau. “TextRank: Bringing Order into Text”. In:
Proceedings of the 2004 Conference on Empirical Methods in Natural Lan- guage Processing . Barcelona, Spain: Association for Computational Lin-guistics, July 2004, pp. 404–411. url
Information Processing & Management issn : 0306-4573. doi : https://doi.org/10.1016/0306-4573(88)90021-0. url
ArXiv abs/1907.06458 (2019).[39] Ricardo Campos et al. “YAKE! Collection-Independent Automatic Key-word Extractor”. In: Feb. 2018. doi : 10.1007/978-3-319-76941-7 80.[40] Daniel Smilkov et al. “Embedding projector: Interactive visualization andinterpretation of embeddings”. In: arXiv preprint arXiv:1611.05469 (2016).[41] Piotr Bojanowski et al. “Enriching Word Vectors with Subword Informa-tion”. In: arXiv preprint arXiv:1607.04606 (2016).[42] url url url url
OVIDScholar 25 [46] url
Nature
571 (July 2019),pp. 95–98. doidoi