[PDF] A Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19

Abstract

COVID-19 has resulted in an ongoing pandemic and as of 12 June 2020, has caused more than 7.4 million cases and over 418,000 deaths. The highly dynamic and rapidly evolving situation with COVID-19 has made it difficult to access accurate, on-demand information regarding the disease. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. With the advancements in the field of natural language processing, particularly in the domain of language models, it has become possible to design chatbots that can automatically answer consumer questions. However, such models are rarely applied and evaluated in the healthcare domain, to meet the information needs with accurate and up-to-date healthcare data. In this paper, we propose to apply a language model for automatically answering questions related to COVID-19 and qualitatively evaluate the generated responses. We utilized the GPT-2 language model and applied transfer learning to retrain it on the COVID-19 Open Research Dataset (CORD-19) corpus. In order to improve the quality of the generated responses, we applied 4 different approaches, namely tf-idf, BERT, BioBERT, and USE to filter and retain relevant sentences in the responses. In the performance evaluation step, we asked two medical experts to rate the responses. We found that BERT and BioBERT, on average, outperform both tf-idf and USE in relevance-based sentence filtering tasks. Additionally, based on the chatbot, we created a user-friendly interactive web application to be hosted online.

Full PDF

aa r X i v : . [ c s . I R ] J un A Qualitative Evaluation of Language Models on AutomaticQuestion-Answering for COVID-19

David Oniani

Mayo ClinicKern Center for the Science of Health Care DeliveryRochester, MN, [email protected]

Yanshan Wang

Mayo ClinicDivision of Digital Health SciencesRochester, MN, [email protected]

ABSTRACT

COVID-19 (2019 Novel Coronavirus) has resulted in an ongoingpandemic and as of 12 June 2020, has caused more than 7.4 mil-lion cases and over 418,000 deaths. The highly dynamic and rapidlyevolving situation with COVID-19 has made it diﬃcult to access ac-curate, on-demand information regarding the disease. Online com-munities, forums, and social media provide potential venues tosearch for relevant questions and answers, or post questions andseek answers from other members. However, due to the nature ofsuch sites, there are always a limited number of relevant questionsand responses to search from, and posted questions are rarely an-swered immediately. With the advancements in the ﬁeld of naturallanguage processing, particularly in the domain of language mod-els, it has become possible to design chatbots that can automati-cally answer consumer questions. However, such models are rarelyapplied and evaluated in the healthcare domain, to meet the infor-mation needs with accurate and up-to-date healthcare data. In thispaper, we propose to apply a language model for automatically an-swering questions related to COVID-19 and qualitatively evaluatethe generated responses. We utilized the GPT-2 language modeland applied transfer learning to retrain it on the COVID-19 OpenResearch Dataset (CORD-19) corpus. In order to improve the qual-ity of the generated responses, we applied 4 diﬀerent approaches,namely tf-idf (Term Frequency - Inverse Document Frequency),Bidirectional Encoder Representations from Transformers (BERT),Bidirectional Encoder Representations from Transformers for Biomed-ical Text Mining (BioBERT), and Universal Sentence Encoder (USE)to ﬁlter and retain relevant sentences in the responses. In the per-formance evaluation step, we asked two medical experts to ratethe responses. We found that BERT and BioBERT, on average, out-perform both tf-idf and USE in relevance-based sentence ﬁlteringtasks. Additionally, based on the chatbot, we created a user-friendlyinteractive web application to be hosted online and made its sourcecode available free of charge to anyone interested in running it lo-cally, online, or just for experimental purposes. Overall, our workhas yielded signiﬁcant results in both designing a chatbot that pro-duces high-quality responses to COVID-19-related questions andcomparing several embedding generation techniques.

KEYWORDS covid-19, cord-19, gpt-2, bert, biobert, use, tf-idf, dataset, nlp, ai,semantic similarity

Coronavirus disease 2019 (COVID-19) is an infectious disease causedby severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [22]. As of 12 June 2020, more than 7.4 million cases have beenrecorded, resulting in over 418,000 deaths [44]. The sudden globaloutbreak of COVID-19 made millions of people quarantined, dueto the social distancing measures. Additionally, the COVID-19 pan-demic caused a historic rise in mental health problems, such asdepression, post-traumatic stress disorder, and suicide, due to thestate-wise quarantine. People are isolated and stressed, and maydevelop long-term psychological consequences, beyond the quar-antine period [36] [16] [21]. Therefore, most of the time, peoplerely on online and web-based resources for getting news and up-dates concerning COVID-19. Given that currently many web sourcesdo not hold the accurate information about the pandemic and themisinformation campaigns are running rampant [39], it is criticallyimportant that people and patients receive accurate, up-to-date,and useful information regarding COVID-19. Online communities,forums, and social media provide potential venues to search for rel-evant questions and answers, or post questions and seek answersfrom other members. However, due to the nature of such sites,there are always a limited number of relevant questions and re-sponses to search from, and posted questions are rarely answeredimmediately. To address these issues, we propose to develop a chat-bot enhanced by neural language models that is able to automat-ically answer questions related to COVID-19 through conversa-tional interactions.A conversational chatbot is a software which is able to conducta conversation via text and/or other means. There are diﬀerent tax-onomies for the type of conversational chatbot. Based on how thenatural language conversations are generated, there are two maincategories: script chatbot and intelligent chatbot. The entire inter-action in a script chatbot is based on a pre-determined model thatdetermines what the chatbot can and cannot do. The “script” isusually a decision tree that is manually crafted by domain expertsto determine which speciﬁc path to take given a response to onequestion task. It is usually very labor-expensive and nongeneral-izable to develop conversation decision trees. The intelligent chat-bot is built using Artiﬁcial Intelligence (AI) and Natural LanguageProcessing (NLP) techniques that automatically generate naturallanguage on the back end. With the advancements in AI and NLP,the functionality and the performance of modern chatbots havebeen dramatically improved. However, these techniques are rarelyapplied and evaluated in the healthcare domain to meet the infor-mation needs with accurate, up-to-date, and interactive healthcareinformation. rXiv, June 23, 2020 David Oniani and Yanshan Wang

The outbreak of COVID-19 has motivated us to develop a chat-bot with advanced NLP techniques and evaluate the approach inautomatically answering questions related to COVID-19. To thebest of our knowledge, this is the ﬁrst study of such kind. Our con-tributions are: • We applied and compared the performance of four em-bedding generation approaches, namely tf-idf (Term Fre-quency - Inverse Document Frequency) [18], BidirectionalEncoder Representations from Transformers (BERT) [42],BioBERT (Bidirectional Encoder Representations from Trans-formers for Biomedical Text Mining) [28], and UniversalSentence Encoder (USE) [14] for reﬁning the automaticallygenerated answers. • We proposed a qualitative evaluation guideline for auto-matic question-answering for COVID-19. • We assessed the performance of the proposed “hybrid” ap-proach for automatic question-answering for COVID-19. • We built a web-based chatbot using the language modelsthat facilitate question-answering for users.This paper is organized as follows. We will proceed by discussingthe related work and the eﬀorts in Section 2. Section 3 will be dedi-cated to materials and Section 4 to the proposed approach. We willreport the chatbot evaluation strategy and the experimental resultsin Sections 7 and 6, respectively. Finally, we will also discuss a web-based chatbot with the proposed model and future work in Section7, and conclude the work in Section 8.

Recent neural language models of dialogue generation oﬀer greatpromise for generating responses for intelligent chatbots. The LSTM(Long Short-Term Memory) sequence-to-sequence (seq2seq) modelis one type of neural generation model that maximizes the probabil-ity of generating a response given the previous dialogue turn [37] [19] [41].XLNet uses a context of the word for predicting the next wordwhere the context word is constrained to two directions (backwardor forward) [45]. SAM is a technique (Self-Attentive AssociativeMemory) where two memories are wired into a single sequentialmodel capable of both memorization and relational reasoning [26].In the GPT-2 domain, Lee and Hsiang [29] have ﬁne-tuned GPT-2 for generating patent claims. Klein and Nabi [25] have appliedGPT-2 in conjunction with BERT for automatic question genera-tion purposes. Zhang, Sun, et al. developed a large and tunable neu-ral conversational model DialoGPT using GPT-2 [48]. Lee, Shu et al.developed RecipeGPT for automatic generation of cooking recipesby ﬁne-tuning GPT-2 on a large cooking recipe dataset [27]. Weare unaware of the work which applied GPT-2 model for transferlearning purposes on CORD-19.In regard to the work related to comparing pretrained AI models,Jin et al. made some eﬀorts conducting probing experiments andcomparing BERT, ELMo [31], and BioBERT. Sharma and Daniel [40]compared the performance of BERT networks to that of FLAIR [12].In the general AI-based chatbot domain, Serbal et al. [38] haveapplied deep reinforcement learning for building a conversationalAI chatbot. Adiwardana et al. [11] have developed a multi-turn open-domain chatbot trained end-to-end on data mined social me-dia conversations. Yin et al. [47] have developed a deep learningbased chatbot for psychological therapy purposes.Semantic similarity of texts, on the other hand, has been studiedfor a long time and recent breakthroughs allowed for developmentof new models such as BERT, BioBERT, and Universal Sentence En-coder (USE). Today, one of the state-of-the art conversational AImodels is GPT-2. GPT-2 is a pretrained model, so we have appliedtransfer learning utilizing CORD-19 for retraining purposes. Theresulted chatbot gave irregularly long responses that would notbe typical of a human. We have therefore decided to further ﬁlterthe responses via applying embedding generation algorithms andmodels such as tf-idf, BERT, BioBERT, and USE and then using se-mantic similarity approaches such as cosine similarity and innerproduct. In other words, we ﬁrst let a human ask a question andmake GPT-2 come up with an answer. We the further processedthe response with additional ﬁlters and ultimately, applied an em-bedding generation model for ﬁnding the sentences that are mostrelevant to the question.Cosine similarity is one of the most commonly used approachesin calculating semantic similarity of texts. Therefore, it is naturallyemployed in NLP tasks. Many NLP applications need to computethe semantic similarity between two short texts. Its ﬂexibility al-lows one to apply it under virtually any settings, as long as docu-ments can be represented as vectors. Besides, ﬁnding cosine simi-larity is usually not a time-consuming task and can be done reallyquickly. Therefore, it is also commonly used for benchmarking pur-poses [49].Our study has produced a chatbot that is both performant andextensible. Additional layer of ﬁlters have shown success in classi-fying sentences. The chatbot is also able to be retrained and read-justed to the new data, in case there are new discoveries or sci-entiﬁc achievements related to COVID-19. Furthermore, chatbotresponses have been annotated by medical experts and the resultswere consistent across the annotators.

The White House Oﬃce of Science and Technology Policy along-side with the coalition of leading research groups has released aCOVID-19 machine readable dataset - COVID-19 Open ResearchDataset (CORD-19) [2]. It consisted of over 128,000 scholarly arti-cles regarding COVID-19, SARS-CoV-2, and related coronaviruses,including over 59,000 with full text, and called researchers globallyto develop text and data mining tools for ﬁnding answers to thequestions within this content in support of the ongoing COVID-19 response eﬀorts worldwide [30].We used CORD-19 to train a language model that would au-tomatically answer questions related to COVID-19. The chatbotwould not only help improve information acquisition, but also serveas a knowledge base for COVID-19. We harvested the data from theinitial commercial use subset of CORD-19, containing 9000 schol-arly articles in the form of JSON ﬁles. We extracted the abstractand the main body of the article from every JSON ﬁle, combinedthem together, and used as a corpus for retraining the languagemodel.

Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19 arXiv, June 23, 2020

HumanGPT-2 Language ModelFiltering Based on Regex/String ManipulationFiltering Based on Semantic Similarity to the QuestionFinal ResponseStage 1Stage 2Stage 3Stage 4 QuestionResponseFiltered ResponseFiltered Response

Figure 1: Workﬂow for Response Generation.

We applied a hybrid approach for generating responses: GPT-2 wasused to generate the answer to the question, then an additional ﬁl-tering step was applied for pruning the irrelevant sentences fromthe answer, and subsequently, semantic similarity methods wereemployed to retain the sentences that are most semantically sim-ilar to the question. Such hybrid approach to the response gener-ation produced high quality answers to COVID-19-related ques-tions. Figure 1 illustrates the pipeline of the proposed approach.

GPT-2 has a Transformer-based [43] architecture which, in manyways, is similar to Open AI GPT model[34][33].There are a total of 4 diﬀerent GPT-2 models that were releasedby OpenAI: 124 million (124M), 355 million (355M), 774 million(774M), and 1.5 billion (1.5B) parameters [4] models. While themodel with 1.5 billion parameters showed the best results in theoriginal paper [34], in our experiments, we found that it was diﬃ-cult to ﬁne-tune and use for the transfer learning purposes. Besides,the training was unbearably slow, even if run on TPUs (Tensor Pro-cessing Unit) provided by Google Colaboratory [20] which we usedas our training ground.We therefore utilized 774M model and ran transfer learning for2500 iterations with the batch size of 8. After 2000 iterations, theloss was not decreasing so we let the language model train for theadditional 500 iterations and stopped the training. The batch sizeof 8 was chosen due to the memory limitations of Google Colabora-tory. As for the optimizer, we used Adam [23] and set the learningrate of 0.0001 ( ). Adam is an algorithm for ﬁrst-order gradient-based optimiza-tion of stochastic objective functions, based on adaptive estimatesof lower-order moments [23]. It is highly memory-eﬃcient and hasshown good results in retraining our chatbot. We have also triedSGD [24], yet Adam has shown the better performance and hence,we have released the Adam-based retrained model.The original GPT-2 was written in tensorﬂow [10] and this isthe version we used. That said, for retraining purposes, we appliedthe TPU-trainable version of the GPT-2 [32].As for the hardware, Google Colaboratory provided us with cloudTPUs and training capabilities. It came 25 GB RAM and since weconnected the Colab to Google Drive [20], we had enough storageto do transfer learning.The link for downloading the model is available on our GitHubpage [17].

The GPT-2 responses are usually very lengthy and for the mostpart, the answer is not relevant to the question. To prune the re-sponses generated from GPT-2, we ﬁrst chunked the answer intothe list of sentences using Python’s built-in module for dealingwith regular expressions ( re [6]) and then for each answer in thelist of answers, performed the following regex/string operations:(1) Eliminated redundant spaces(2) Eliminated extra punctuation marks (speciﬁcally, “.”, “!”,and “?”)(3) Removed redundant parentheses and square brackets(4) Further split the sentence into separate sentences if it con-tained a period (“.”)Steps 2 and 4, once again, employed re module while for steps1 and 4, just the built-in string operations were suﬃcient (hence,no built-in or external module was used).These operations have signiﬁcantly improved the quality of theanswer and allowed us directly passing them to the pretrainedmodels for generating embeddings. Semantic similarity is a metric that quantiﬁes the degree to whichtwo texts or text documents are similar to each other. The two ap-proaches we have used include cosine similarity and inner product.The diﬀerence between the two is that cosine similarity pays atten-tion to only the angle between the vectors, while the inner productcares about both the angle and the magnitude. That said, if one hasthe normalized data, both approaches are nearly equivalent.To put each sentence in a vector representation, we tested andapplied four diﬀerent approaches for generating embeddings: • tf-idf [7]: a simple, tf-idf based embedding-generation method. • BERT: Large model, uncased (24 layers and 340M parame-ters) [35]. • BioBERT: Large v1.1 (+PubMed 1M) model based on BERT-large Cased (custom 30k vocabulary) [3]. • USE: Version 3, large [8]. rXiv, June 23, 2020 David Oniani and Yanshan Wang

In all cases, the similar strategy was applied for ﬁltering sen-tences. The following equation deﬁnes the embedding generationprocess: E = emb ( S ) (1)where S = ( s , s , . . . s n , q ) and denotes the list of sentences ob-tained by performing the split based on regex/string manipulationplus the question, emb denotes one of the embedding generationapproaches (i.e., tf-idf, BERT, BioBERT, or USE), and q denotes thequestion being asked. E denotes the embeddings generated for thelist of sentences. In other words, the list of sentences is passed tothe embedding generation function and the feature matrix is re-turned.Once the embeddings were generated, we have applied cosinesimilarity and inner product for ranking the sentences. Inner prod-uct was used only with USE since its embeddings were approxi-mately normalized. The following equation shows the ranking pro-cess: M = ( sim ( e , e q ) , sim ( e , e q ) . . . sim ( e n , e q )) (2)where e , e , . . . e n , e q ∈ E and sim is a function that ﬁnds the co-sine similarity or the inner product between two embeddings. Forevery sentence embedding e , e , . . . e n , the approach calculatesthe cosine similarity to the question embedding e q . M is a simi-larity matrix where every column denotes the similarity vector fora particular sentence. The similarity vector contains the similarityvalues of the particular sentence with respect to all the sentences(including itself). The last column/similarity vector in the matrixhas the similarity measures with respect to the question embed-ding.Finally, we ranked the sentences based on their similarity scores,selected top ﬁve most similar sentences using Python’s numpy [5]package, and concatenated them into a single, uniﬁed answer. We show an example responses in the form of a Human-Bot con-versation generated by the GPT-2 language model, as follows. human:

What do we know about vaccines and therapeu-tics of COVID-19? bot:

Because of their critical role in viral pathogenesisand host defense against infections, several drugshave been developed against COVID-19 With thisin mind, we present evidence that drugs that in-hibit pro-inﬂammatory responses (e.g., IL-6) havethe potential to inhibit COVID-19 infection andviral replication in nonhuman primates after ad-ministration up to 12 days following infection Be-cause of their critical role in viral pathogenesisand host defense against infections, several drugshave been developed against COVID-19 human:

What do we know about COVID-19 risk factors? bot:

However, COVID-19 is now a threat to global healthsecurity According to the latest World Health Or-ganization report, a large number of travel is seenas one of the major causes of COVID-19 spread However, COVID-19 is now a threat to global healthsecurity human:

What is known about transmission, incubation,and environmental stability of COVID-19? bot:

For example, a number of reports have suggestedthat animal and human hosts may have providedthe necessary environment for the transmissionof COVID-19 in people and poultry In fact, allknown animal hosts that have been associated withanimal-to-human transmission of COVID-19 arewild birds, including chickens, geese, mule deer,herons, and turkey In fact, all known animal hoststhat have been associated with animal-to-humantransmission of COVID-19 are wild birds, includ-ing chickens, geese, mule deer, herons, and turkeyThe dialogue shows a consistent pattern in generating responses.Response to the ﬁrst question regarding the vaccines and therapeu-tics states that “several drugs have been developed against COVID-19.” Furthermore, it says that “drugs that inhibit pro-inﬂammatoryresponses (e.g., IL-6) have the potential to inhibit COVID-19 infec-tion and viral replication in nonhuman primates.” Regarding theresponse to the second question, concerning the risk factors, it ad-dressed the question directly by stating that “a large number oftravel is seen as one of the major causes of COVID-19 spread.” Asfor the third question, about transmission, incubation, and environ-mental stability of COVID-19, it has mentioned that “large numberof travel is seen as one of the major causes of COVID-19 spread”and additionally, talks about “animal-to-human transmission.” Inall cases, sentences were highly readable and understandable. Thatsaid, in some cases, the same sentences were repeated due to howthe hybrid approach was implemented. This can be avoided, whichwe discuss in the section 7.

In order to evaluate the performance of the proposed approaches aswell as the overall performance of the chatbot, it is crucial to havea question dataset that both are frequently asked and related toCOVID-19. For this purpose, we decided to use 12 questions fromthe Kaggle’s COVID-19 Open Research Dataset Challenge (CORD-19) [15]. Most of the questions included the term “COVID-19” butothers did not, in which case we appended the term to the end ofthe question. Table 1 presents all 12 questions.For every one of the 12 questions, we generated ﬁve diﬀerent an-swers by applying four diﬀerent embedding generation techniques,resulting in a total of 240 answers. Therefore, the response for ev-ery question was generated exactly 5 times using the same tech-nique. This ensured a fair and consistent distribution of both thequestions and the approaches across the dataset. We made all of theanswers publicly available on GitHub [1]. We then asked two expe-rienced medical experts to evaluate the quality of these responsesby assigning diﬀerent relevance scores according to the categoriesin Table 2. Having 5 categories allowed for a ﬂexibility and diver-sity of opinions/judgements as well as a broad range of scores thatultimately gave us a better way to evaluate our approaches. Theevaluation was done primarily by averaging the scores for a par-ticular approach.

Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19 arXiv, June 23, 2020

Table 1: Testing questions from CORD-19.

Number Question

Table 2: 5 Rating Categories.

Category Description Point(s)Relevant The answer partially or fully an-swers the question and/or makesclear attempts to do so and is re-lated to the question 5Well-formed the answer makes a logical senseand is somewhat related to boththe question and COVID-19, yetit does not (partially or fully) an-swer the question 4Informative The answer is not related to thequestion, but provides some in-formation about COVID-19 andmakes a logical sense 3Acceptable The answer makes some logicalsense and is weakly related tothe question or COVID-19, but ismostly diﬃcult to understand 2Poor the answer is totally unrelated tothe question or COVID-19 and/ordoes not make a logical sense 2

Table 3 lists the evaluation results of diﬀerent approaches. It showsthe approach, the average scores based on the approach for eachannotator, and the overall average across the annotators. The ﬁrstannotator rated BERT as the best approach with the average scoreof 4.167. BioBERT shows slightly worse performance with a scoreof 4.133 than BERT. The tf-idf approach performs well with a scoreof 3.967, yet it could not outperform either BERT or BioBERT. USEhas the worst performance out of all embedding generation tech-niques with the score of 3.683 out of 5. The second annotator, sim-ilarly, gave the highest average score to BERT (4.283). USE wasthe second best with the score of 4.083 followed by BioBERT withapproximately the same score of 4.067. The tf-idf approach hasyielded the worst results, rated 3.8.In general, the results are consistent between two annotatorswith an inner annotator agreement score of 0.521, which was calcu-lated using the Pearson correlation. Models from the BERT familyshowed the best performance in automatically answering COVID-19 questions, with BERT slightly outperforming BioBERT (4.225vs. 4.100 - average scores) being the best. The tf-idf approach andUSE show roughly similar performance (3.884 vs. 3883)., yet infe-rior to BERT and BioBERT. All four approaches, on average, canbe considered to be in the “well-formed” category with BERT andBioBERT being close to the “Relevant” category. The overall aver-age was 4.023 (Well-formed). rXiv, June 23, 2020 David Oniani and Yanshan Wang

Table 3: Average Scores of Embedding Generation Ap-proaches Across the Annotators.

Approach A1 A2 Overalltf-idf 3.967 3.8 3.884BERT 4.167 4.283

BioBERT 4.133 4.067 4.100USE 3.683 4.083 3.883

Table 4 shows the average scores for annotators A1 and A2, theoverall average, and the diﬀerence based on the question asked tothe language model. From the table, it is clear that the proposed ap-proach had the best results for responses to questions

The project had several limitations. First, due to hardware con-straints and the diﬃculty of ﬁne-tuining, we have not used thelarger 1.5B GPT-2 model that could potentially yield better resultsin generating responses. Second, the question pool was also limitedand comprised of 12 questions. Additionally, we have tried only 4speciﬁc embedding generation approaches, which might not be afair representation of all such techniques in the domains of AI andNLP.In order to make the language model more accessible to the gen-eral audience for automating the response generation, we built aweb-based chatbot using the trained GPT-2 with options of tf-idf,

Table 4: Question-Based Average Scores of Embedding Gen-eration Approaches Across the Annotators.

Question A1 A2 Average Diﬀerence

In this paper, we applied the GPT-2 language model to automat-ically answer questions related to COVID-19, and quantitatively

Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19 arXiv, June 23, 2020 evaluate the proposed approach. To reﬁne the responses gener-ated by GPT-2, we compared four diﬀerent embedding generationtechniques, namely tf-idf, BERT, BioBERT, and USE. We utilizedthe collected corpus from the CORD-19 task to pretrain the GPT-2 model, and evaluated the automatically generated answers ontwelve questions from the CORD-19. The results were evaluated bytwo medical experts. In general, the results are consistent betweentwo annotators. The empirical results show that BERT achievedthe best performance in automatically answering COVID-19 ques-tions. We also built a web-based chatbot using the trained GPT-2model and opensoured the code.

ACKNOWLEDGMENTS

This work was supported by NIH grant R01LM11934, the MayoClinic Center for Health Equity and Community Engagement Re-search Award, and the Mayo Clinic Oﬃce of Patient Education. Thefunders had no role in the design of the study, or collection, analy-sis, and interpretation of data and in preparation of the manuscript.The views presented in this report are not necessarily representa-tive of the funder’s views and belong solely to the authors.

REFERENCES [1] 2020. Annotation Results and Generated Answerson GitHub. (2020). Retrieved 2020-15-05 fromhttps://github.com/oniani/covid-19-chatbot/tree/master/annotations[2] 2020. COVID-19 Open Research Dataset (CORD-19). (2020). Retrieved 2020-16-03 from https://pages.semanticscholar.org/coronavirus-research[3] 2020. GitHub: DMIS Laboratory- Korea University. (2020). Retrieved 21-06-2020from https://github.com/dmis-lab/biobert[4] 2020. GPT-2 Model Card. (2020). Retrieved 21-06-2020 fromhttps://github.com/oniani/covid-19-chatbot[5] 2020. NumPy. (2020). Retrieved 21-06-2020 from https://numpy.org/[6] 2020. re âĂŤ Regular expression operations. (2020). Retrieved 21-06-2020 fromhttps://docs.python.org/3/library/re.html[7] 2020. Scikit-learn: TﬁdfVectorizer. (2020). Retrieved 21-06-2020 fromhttps://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TﬁdfVectorizer.html[8] 2020. Tensorﬂow Hub: Universal Sentence Encoder (Ver-sion 3, Large). (2020). Retrieved 21-06-2020 fromhttps://tfhub.dev/google/universal-sentence-encoder-large/3[9] 2020. Welcome to Flask – Flask Documentation. (2020). Retrieved 21-06-2020from https://ﬂask.palletsprojects.com[10] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,Craig Citro, Greg S. Corrado, Andy Davis, Jeﬀrey Dean, Matthieu Devin, San-jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoﬀrey Irving, Michael Isard,Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven-berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, MikeSchuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, PaulTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.(2015). http://tensorﬂow.org/ Software available from tensorﬂow.org.[11] Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel,Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,and Quoc V. Le. 2020. Towards a Human-like Open-Domain Chatbot. (2020).arXiv:arXiv:2001.09977[12] Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Em-beddings for Sequence Labeling. In

COLING 2018, 27th International Conferenceon Computational Linguistics

Bioinformatics (09 2019).

DOI: rXiv, June 23, 2020 David Oniani and Yanshan Wang [42] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models. arXiv preprint arXiv:1908.08962v2arXiv preprint arXiv:1908.08962v2