A Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19
aa r X i v : . [ c s . I R ] J un A Qualitative Evaluation of Language Models on AutomaticQuestion-Answering for COVID-19
David Oniani
Mayo ClinicKern Center for the Science of Health Care DeliveryRochester, MN, [email protected]
Yanshan Wang
Mayo ClinicDivision of Digital Health SciencesRochester, MN, [email protected]
ABSTRACT
COVID-19 (2019 Novel Coronavirus) has resulted in an ongoingpandemic and as of 12 June 2020, has caused more than 7.4 mil-lion cases and over 418,000 deaths. The highly dynamic and rapidlyevolving situation with COVID-19 has made it difficult to access ac-curate, on-demand information regarding the disease. Online com-munities, forums, and social media provide potential venues tosearch for relevant questions and answers, or post questions andseek answers from other members. However, due to the nature ofsuch sites, there are always a limited number of relevant questionsand responses to search from, and posted questions are rarely an-swered immediately. With the advancements in the field of naturallanguage processing, particularly in the domain of language mod-els, it has become possible to design chatbots that can automati-cally answer consumer questions. However, such models are rarelyapplied and evaluated in the healthcare domain, to meet the infor-mation needs with accurate and up-to-date healthcare data. In thispaper, we propose to apply a language model for automatically an-swering questions related to COVID-19 and qualitatively evaluatethe generated responses. We utilized the GPT-2 language modeland applied transfer learning to retrain it on the COVID-19 OpenResearch Dataset (CORD-19) corpus. In order to improve the qual-ity of the generated responses, we applied 4 different approaches,namely tf-idf (Term Frequency - Inverse Document Frequency),Bidirectional Encoder Representations from Transformers (BERT),Bidirectional Encoder Representations from Transformers for Biomed-ical Text Mining (BioBERT), and Universal Sentence Encoder (USE)to filter and retain relevant sentences in the responses. In the per-formance evaluation step, we asked two medical experts to ratethe responses. We found that BERT and BioBERT, on average, out-perform both tf-idf and USE in relevance-based sentence filteringtasks. Additionally, based on the chatbot, we created a user-friendlyinteractive web application to be hosted online and made its sourcecode available free of charge to anyone interested in running it lo-cally, online, or just for experimental purposes. Overall, our workhas yielded significant results in both designing a chatbot that pro-duces high-quality responses to COVID-19-related questions andcomparing several embedding generation techniques.
KEYWORDS covid-19, cord-19, gpt-2, bert, biobert, use, tf-idf, dataset, nlp, ai,semantic similarity
Coronavirus disease 2019 (COVID-19) is an infectious disease causedby severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [22]. As of 12 June 2020, more than 7.4 million cases have beenrecorded, resulting in over 418,000 deaths [44]. The sudden globaloutbreak of COVID-19 made millions of people quarantined, dueto the social distancing measures. Additionally, the COVID-19 pan-demic caused a historic rise in mental health problems, such asdepression, post-traumatic stress disorder, and suicide, due to thestate-wise quarantine. People are isolated and stressed, and maydevelop long-term psychological consequences, beyond the quar-antine period [36] [16] [21]. Therefore, most of the time, peoplerely on online and web-based resources for getting news and up-dates concerning COVID-19. Given that currently many web sourcesdo not hold the accurate information about the pandemic and themisinformation campaigns are running rampant [39], it is criticallyimportant that people and patients receive accurate, up-to-date,and useful information regarding COVID-19. Online communities,forums, and social media provide potential venues to search for rel-evant questions and answers, or post questions and seek answersfrom other members. However, due to the nature of such sites,there are always a limited number of relevant questions and re-sponses to search from, and posted questions are rarely answeredimmediately. To address these issues, we propose to develop a chat-bot enhanced by neural language models that is able to automat-ically answer questions related to COVID-19 through conversa-tional interactions.A conversational chatbot is a software which is able to conducta conversation via text and/or other means. There are different tax-onomies for the type of conversational chatbot. Based on how thenatural language conversations are generated, there are two maincategories: script chatbot and intelligent chatbot. The entire inter-action in a script chatbot is based on a pre-determined model thatdetermines what the chatbot can and cannot do. The “script” isusually a decision tree that is manually crafted by domain expertsto determine which specific path to take given a response to onequestion task. It is usually very labor-expensive and nongeneral-izable to develop conversation decision trees. The intelligent chat-bot is built using Artificial Intelligence (AI) and Natural LanguageProcessing (NLP) techniques that automatically generate naturallanguage on the back end. With the advancements in AI and NLP,the functionality and the performance of modern chatbots havebeen dramatically improved. However, these techniques are rarelyapplied and evaluated in the healthcare domain to meet the infor-mation needs with accurate, up-to-date, and interactive healthcareinformation. rXiv, June 23, 2020 David Oniani and Yanshan Wang
The outbreak of COVID-19 has motivated us to develop a chat-bot with advanced NLP techniques and evaluate the approach inautomatically answering questions related to COVID-19. To thebest of our knowledge, this is the first study of such kind. Our con-tributions are: • We applied and compared the performance of four em-bedding generation approaches, namely tf-idf (Term Fre-quency - Inverse Document Frequency) [18], BidirectionalEncoder Representations from Transformers (BERT) [42],BioBERT (Bidirectional Encoder Representations from Trans-formers for Biomedical Text Mining) [28], and UniversalSentence Encoder (USE) [14] for refining the automaticallygenerated answers. • We proposed a qualitative evaluation guideline for auto-matic question-answering for COVID-19. • We assessed the performance of the proposed “hybrid” ap-proach for automatic question-answering for COVID-19. • We built a web-based chatbot using the language modelsthat facilitate question-answering for users.This paper is organized as follows. We will proceed by discussingthe related work and the efforts in Section 2. Section 3 will be dedi-cated to materials and Section 4 to the proposed approach. We willreport the chatbot evaluation strategy and the experimental resultsin Sections 7 and 6, respectively. Finally, we will also discuss a web-based chatbot with the proposed model and future work in Section7, and conclude the work in Section 8.
Recent neural language models of dialogue generation offer greatpromise for generating responses for intelligent chatbots. The LSTM(Long Short-Term Memory) sequence-to-sequence (seq2seq) modelis one type of neural generation model that maximizes the probabil-ity of generating a response given the previous dialogue turn [37] [19] [41].XLNet uses a context of the word for predicting the next wordwhere the context word is constrained to two directions (backwardor forward) [45]. SAM is a technique (Self-Attentive AssociativeMemory) where two memories are wired into a single sequentialmodel capable of both memorization and relational reasoning [26].In the GPT-2 domain, Lee and Hsiang [29] have fine-tuned GPT-2 for generating patent claims. Klein and Nabi [25] have appliedGPT-2 in conjunction with BERT for automatic question genera-tion purposes. Zhang, Sun, et al. developed a large and tunable neu-ral conversational model DialoGPT using GPT-2 [48]. Lee, Shu et al.developed RecipeGPT for automatic generation of cooking recipesby fine-tuning GPT-2 on a large cooking recipe dataset [27]. Weare unaware of the work which applied GPT-2 model for transferlearning purposes on CORD-19.In regard to the work related to comparing pretrained AI models,Jin et al. made some efforts conducting probing experiments andcomparing BERT, ELMo [31], and BioBERT. Sharma and Daniel [40]compared the performance of BERT networks to that of FLAIR [12].In the general AI-based chatbot domain, Serbal et al. [38] haveapplied deep reinforcement learning for building a conversationalAI chatbot. Adiwardana et al. [11] have developed a multi-turn open-domain chatbot trained end-to-end on data mined social me-dia conversations. Yin et al. [47] have developed a deep learningbased chatbot for psychological therapy purposes.Semantic similarity of texts, on the other hand, has been studiedfor a long time and recent breakthroughs allowed for developmentof new models such as BERT, BioBERT, and Universal Sentence En-coder (USE). Today, one of the state-of-the art conversational AImodels is GPT-2. GPT-2 is a pretrained model, so we have appliedtransfer learning utilizing CORD-19 for retraining purposes. Theresulted chatbot gave irregularly long responses that would notbe typical of a human. We have therefore decided to further filterthe responses via applying embedding generation algorithms andmodels such as tf-idf, BERT, BioBERT, and USE and then using se-mantic similarity approaches such as cosine similarity and innerproduct. In other words, we first let a human ask a question andmake GPT-2 come up with an answer. We the further processedthe response with additional filters and ultimately, applied an em-bedding generation model for finding the sentences that are mostrelevant to the question.Cosine similarity is one of the most commonly used approachesin calculating semantic similarity of texts. Therefore, it is naturallyemployed in NLP tasks. Many NLP applications need to computethe semantic similarity between two short texts. Its flexibility al-lows one to apply it under virtually any settings, as long as docu-ments can be represented as vectors. Besides, finding cosine simi-larity is usually not a time-consuming task and can be done reallyquickly. Therefore, it is also commonly used for benchmarking pur-poses [49].Our study has produced a chatbot that is both performant andextensible. Additional layer of filters have shown success in classi-fying sentences. The chatbot is also able to be retrained and read-justed to the new data, in case there are new discoveries or sci-entific achievements related to COVID-19. Furthermore, chatbotresponses have been annotated by medical experts and the resultswere consistent across the annotators.
The White House Office of Science and Technology Policy along-side with the coalition of leading research groups has released aCOVID-19 machine readable dataset - COVID-19 Open ResearchDataset (CORD-19) [2]. It consisted of over 128,000 scholarly arti-cles regarding COVID-19, SARS-CoV-2, and related coronaviruses,including over 59,000 with full text, and called researchers globallyto develop text and data mining tools for finding answers to thequestions within this content in support of the ongoing COVID-19 response efforts worldwide [30].We used CORD-19 to train a language model that would au-tomatically answer questions related to COVID-19. The chatbotwould not only help improve information acquisition, but also serveas a knowledge base for COVID-19. We harvested the data from theinitial commercial use subset of CORD-19, containing 9000 schol-arly articles in the form of JSON files. We extracted the abstractand the main body of the article from every JSON file, combinedthem together, and used as a corpus for retraining the languagemodel.
Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19 arXiv, June 23, 2020
HumanGPT-2 Language ModelFiltering Based on Regex/String ManipulationFiltering Based on Semantic Similarity to the QuestionFinal ResponseStage 1Stage 2Stage 3Stage 4 QuestionResponseFiltered ResponseFiltered Response
Figure 1: Workflow for Response Generation.
We applied a hybrid approach for generating responses: GPT-2 wasused to generate the answer to the question, then an additional fil-tering step was applied for pruning the irrelevant sentences fromthe answer, and subsequently, semantic similarity methods wereemployed to retain the sentences that are most semantically sim-ilar to the question. Such hybrid approach to the response gener-ation produced high quality answers to COVID-19-related ques-tions. Figure 1 illustrates the pipeline of the proposed approach.
GPT-2 has a Transformer-based [43] architecture which, in manyways, is similar to Open AI GPT model[34][33].There are a total of 4 different GPT-2 models that were releasedby OpenAI: 124 million (124M), 355 million (355M), 774 million(774M), and 1.5 billion (1.5B) parameters [4] models. While themodel with 1.5 billion parameters showed the best results in theoriginal paper [34], in our experiments, we found that it was diffi-cult to fine-tune and use for the transfer learning purposes. Besides,the training was unbearably slow, even if run on TPUs (Tensor Pro-cessing Unit) provided by Google Colaboratory [20] which we usedas our training ground.We therefore utilized 774M model and ran transfer learning for2500 iterations with the batch size of 8. After 2000 iterations, theloss was not decreasing so we let the language model train for theadditional 500 iterations and stopped the training. The batch sizeof 8 was chosen due to the memory limitations of Google Colabora-tory. As for the optimizer, we used Adam [23] and set the learningrate of 0.0001 ( ). Adam is an algorithm for first-order gradient-based optimiza-tion of stochastic objective functions, based on adaptive estimatesof lower-order moments [23]. It is highly memory-efficient and hasshown good results in retraining our chatbot. We have also triedSGD [24], yet Adam has shown the better performance and hence,we have released the Adam-based retrained model.The original GPT-2 was written in tensorflow [10] and this isthe version we used. That said, for retraining purposes, we appliedthe TPU-trainable version of the GPT-2 [32].As for the hardware, Google Colaboratory provided us with cloudTPUs and training capabilities. It came 25 GB RAM and since weconnected the Colab to Google Drive [20], we had enough storageto do transfer learning.The link for downloading the model is available on our GitHubpage [17].
The GPT-2 responses are usually very lengthy and for the mostpart, the answer is not relevant to the question. To prune the re-sponses generated from GPT-2, we first chunked the answer intothe list of sentences using Python’s built-in module for dealingwith regular expressions ( re [6]) and then for each answer in thelist of answers, performed the following regex/string operations:(1) Eliminated redundant spaces(2) Eliminated extra punctuation marks (specifically, “.”, “!”,and “?”)(3) Removed redundant parentheses and square brackets(4) Further split the sentence into separate sentences if it con-tained a period (“.”)Steps 2 and 4, once again, employed re module while for steps1 and 4, just the built-in string operations were sufficient (hence,no built-in or external module was used).These operations have significantly improved the quality of theanswer and allowed us directly passing them to the pretrainedmodels for generating embeddings. Semantic similarity is a metric that quantifies the degree to whichtwo texts or text documents are similar to each other. The two ap-proaches we have used include cosine similarity and inner product.The difference between the two is that cosine similarity pays atten-tion to only the angle between the vectors, while the inner productcares about both the angle and the magnitude. That said, if one hasthe normalized data, both approaches are nearly equivalent.To put each sentence in a vector representation, we tested andapplied four different approaches for generating embeddings: • tf-idf [7]: a simple, tf-idf based embedding-generation method. • BERT: Large model, uncased (24 layers and 340M parame-ters) [35]. • BioBERT: Large v1.1 (+PubMed 1M) model based on BERT-large Cased (custom 30k vocabulary) [3]. • USE: Version 3, large [8]. rXiv, June 23, 2020 David Oniani and Yanshan Wang
In all cases, the similar strategy was applied for filtering sen-tences. The following equation defines the embedding generationprocess: E = emb ( S ) (1)where S = ( s , s , . . . s n , q ) and denotes the list of sentences ob-tained by performing the split based on regex/string manipulationplus the question, emb denotes one of the embedding generationapproaches (i.e., tf-idf, BERT, BioBERT, or USE), and q denotes thequestion being asked. E denotes the embeddings generated for thelist of sentences. In other words, the list of sentences is passed tothe embedding generation function and the feature matrix is re-turned.Once the embeddings were generated, we have applied cosinesimilarity and inner product for ranking the sentences. Inner prod-uct was used only with USE since its embeddings were approxi-mately normalized. The following equation shows the ranking pro-cess: M = ( sim ( e , e q ) , sim ( e , e q ) . . . sim ( e n , e q )) (2)where e , e , . . . e n , e q ∈ E and sim is a function that finds the co-sine similarity or the inner product between two embeddings. Forevery sentence embedding e , e , . . . e n , the approach calculatesthe cosine similarity to the question embedding e q . M is a simi-larity matrix where every column denotes the similarity vector fora particular sentence. The similarity vector contains the similarityvalues of the particular sentence with respect to all the sentences(including itself). The last column/similarity vector in the matrixhas the similarity measures with respect to the question embed-ding.Finally, we ranked the sentences based on their similarity scores,selected top five most similar sentences using Python’s numpy [5]package, and concatenated them into a single, unified answer. We show an example responses in the form of a Human-Bot con-versation generated by the GPT-2 language model, as follows. human:
What do we know about vaccines and therapeu-tics of COVID-19? bot:
Because of their critical role in viral pathogenesisand host defense against infections, several drugshave been developed against COVID-19 With thisin mind, we present evidence that drugs that in-hibit pro-inflammatory responses (e.g., IL-6) havethe potential to inhibit COVID-19 infection andviral replication in nonhuman primates after ad-ministration up to 12 days following infection Be-cause of their critical role in viral pathogenesisand host defense against infections, several drugshave been developed against COVID-19 human:
What do we know about COVID-19 risk factors? bot:
However, COVID-19 is now a threat to global healthsecurity According to the latest World Health Or-ganization report, a large number of travel is seenas one of the major causes of COVID-19 spread However, COVID-19 is now a threat to global healthsecurity human:
What is known about transmission, incubation,and environmental stability of COVID-19? bot:
For example, a number of reports have suggestedthat animal and human hosts may have providedthe necessary environment for the transmissionof COVID-19 in people and poultry In fact, allknown animal hosts that have been associated withanimal-to-human transmission of COVID-19 arewild birds, including chickens, geese, mule deer,herons, and turkey In fact, all known animal hoststhat have been associated with animal-to-humantransmission of COVID-19 are wild birds, includ-ing chickens, geese, mule deer, herons, and turkeyThe dialogue shows a consistent pattern in generating responses.Response to the first question regarding the vaccines and therapeu-tics states that “several drugs have been developed against COVID-19.” Furthermore, it says that “drugs that inhibit pro-inflammatoryresponses (e.g., IL-6) have the potential to inhibit COVID-19 infec-tion and viral replication in nonhuman primates.” Regarding theresponse to the second question, concerning the risk factors, it ad-dressed the question directly by stating that “a large number oftravel is seen as one of the major causes of COVID-19 spread.” Asfor the third question, about transmission, incubation, and environ-mental stability of COVID-19, it has mentioned that “large numberof travel is seen as one of the major causes of COVID-19 spread”and additionally, talks about “animal-to-human transmission.” Inall cases, sentences were highly readable and understandable. Thatsaid, in some cases, the same sentences were repeated due to howthe hybrid approach was implemented. This can be avoided, whichwe discuss in the section 7.
In order to evaluate the performance of the proposed approaches aswell as the overall performance of the chatbot, it is crucial to havea question dataset that both are frequently asked and related toCOVID-19. For this purpose, we decided to use 12 questions fromthe Kaggle’s COVID-19 Open Research Dataset Challenge (CORD-19) [15]. Most of the questions included the term “COVID-19” butothers did not, in which case we appended the term to the end ofthe question. Table 1 presents all 12 questions.For every one of the 12 questions, we generated five different an-swers by applying four different embedding generation techniques,resulting in a total of 240 answers. Therefore, the response for ev-ery question was generated exactly 5 times using the same tech-nique. This ensured a fair and consistent distribution of both thequestions and the approaches across the dataset. We made all of theanswers publicly available on GitHub [1]. We then asked two expe-rienced medical experts to evaluate the quality of these responsesby assigning different relevance scores according to the categoriesin Table 2. Having 5 categories allowed for a flexibility and diver-sity of opinions/judgements as well as a broad range of scores thatultimately gave us a better way to evaluate our approaches. Theevaluation was done primarily by averaging the scores for a par-ticular approach.
Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19 arXiv, June 23, 2020
Table 1: Testing questions from CORD-19.
Number Question
Table 2: 5 Rating Categories.
Category Description Point(s)Relevant The answer partially or fully an-swers the question and/or makesclear attempts to do so and is re-lated to the question 5Well-formed the answer makes a logical senseand is somewhat related to boththe question and COVID-19, yetit does not (partially or fully) an-swer the question 4Informative The answer is not related to thequestion, but provides some in-formation about COVID-19 andmakes a logical sense 3Acceptable The answer makes some logicalsense and is weakly related tothe question or COVID-19, but ismostly difficult to understand 2Poor the answer is totally unrelated tothe question or COVID-19 and/ordoes not make a logical sense 2
Table 3 lists the evaluation results of different approaches. It showsthe approach, the average scores based on the approach for eachannotator, and the overall average across the annotators. The firstannotator rated BERT as the best approach with the average scoreof 4.167. BioBERT shows slightly worse performance with a scoreof 4.133 than BERT. The tf-idf approach performs well with a scoreof 3.967, yet it could not outperform either BERT or BioBERT. USEhas the worst performance out of all embedding generation tech-niques with the score of 3.683 out of 5. The second annotator, sim-ilarly, gave the highest average score to BERT (4.283). USE wasthe second best with the score of 4.083 followed by BioBERT withapproximately the same score of 4.067. The tf-idf approach hasyielded the worst results, rated 3.8.In general, the results are consistent between two annotatorswith an inner annotator agreement score of 0.521, which was calcu-lated using the Pearson correlation. Models from the BERT familyshowed the best performance in automatically answering COVID-19 questions, with BERT slightly outperforming BioBERT (4.225vs. 4.100 - average scores) being the best. The tf-idf approach andUSE show roughly similar performance (3.884 vs. 3883)., yet infe-rior to BERT and BioBERT. All four approaches, on average, canbe considered to be in the “well-formed” category with BERT andBioBERT being close to the “Relevant” category. The overall aver-age was 4.023 (Well-formed). rXiv, June 23, 2020 David Oniani and Yanshan Wang
Table 3: Average Scores of Embedding Generation Ap-proaches Across the Annotators.
Approach A1 A2 Overalltf-idf 3.967 3.8 3.884BERT 4.167 4.283
BioBERT 4.133 4.067 4.100USE 3.683 4.083 3.883
Table 4 shows the average scores for annotators A1 and A2, theoverall average, and the difference based on the question asked tothe language model. From the table, it is clear that the proposed ap-proach had the best results for responses to questions
The project had several limitations. First, due to hardware con-straints and the difficulty of fine-tuining, we have not used thelarger 1.5B GPT-2 model that could potentially yield better resultsin generating responses. Second, the question pool was also limitedand comprised of 12 questions. Additionally, we have tried only 4specific embedding generation approaches, which might not be afair representation of all such techniques in the domains of AI andNLP.In order to make the language model more accessible to the gen-eral audience for automating the response generation, we built aweb-based chatbot using the trained GPT-2 with options of tf-idf,
Table 4: Question-Based Average Scores of Embedding Gen-eration Approaches Across the Annotators.
Question A1 A2 Average Difference
In this paper, we applied the GPT-2 language model to automat-ically answer questions related to COVID-19, and quantitatively
Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19 arXiv, June 23, 2020 evaluate the proposed approach. To refine the responses gener-ated by GPT-2, we compared four different embedding generationtechniques, namely tf-idf, BERT, BioBERT, and USE. We utilizedthe collected corpus from the CORD-19 task to pretrain the GPT-2 model, and evaluated the automatically generated answers ontwelve questions from the CORD-19. The results were evaluated bytwo medical experts. In general, the results are consistent betweentwo annotators. The empirical results show that BERT achievedthe best performance in automatically answering COVID-19 ques-tions. We also built a web-based chatbot using the trained GPT-2model and opensoured the code.
ACKNOWLEDGMENTS
This work was supported by NIH grant R01LM11934, the MayoClinic Center for Health Equity and Community Engagement Re-search Award, and the Mayo Clinic Office of Patient Education. Thefunders had no role in the design of the study, or collection, analy-sis, and interpretation of data and in preparation of the manuscript.The views presented in this report are not necessarily representa-tive of the funder’s views and belong solely to the authors.
REFERENCES [1] 2020. Annotation Results and Generated Answerson GitHub. (2020). Retrieved 2020-15-05 fromhttps://github.com/oniani/covid-19-chatbot/tree/master/annotations[2] 2020. COVID-19 Open Research Dataset (CORD-19). (2020). Retrieved 2020-16-03 from https://pages.semanticscholar.org/coronavirus-research[3] 2020. GitHub: DMIS Laboratory- Korea University. (2020). Retrieved 21-06-2020from https://github.com/dmis-lab/biobert[4] 2020. GPT-2 Model Card. (2020). Retrieved 21-06-2020 fromhttps://github.com/oniani/covid-19-chatbot[5] 2020. NumPy. (2020). Retrieved 21-06-2020 from https://numpy.org/[6] 2020. re âĂŤ Regular expression operations. (2020). Retrieved 21-06-2020 fromhttps://docs.python.org/3/library/re.html[7] 2020. Scikit-learn: TfidfVectorizer. (2020). Retrieved 21-06-2020 fromhttps://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html[8] 2020. Tensorflow Hub: Universal Sentence Encoder (Ver-sion 3, Large). (2020). Retrieved 21-06-2020 fromhttps://tfhub.dev/google/universal-sentence-encoder-large/3[9] 2020. Welcome to Flask – Flask Documentation. (2020). Retrieved 21-06-2020from https://flask.palletsprojects.com[10] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San-jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven-berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, MikeSchuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, PaulTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.(2015). http://tensorflow.org/ Software available from tensorflow.org.[11] Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel,Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,and Quoc V. Le. 2020. Towards a Human-like Open-Domain Chatbot. (2020).arXiv:arXiv:2001.09977[12] Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Em-beddings for Sequence Labeling. In
COLING 2018, 27th International Conferenceon Computational Linguistics
Bioinformatics (09 2019).
DOI: rXiv, June 23, 2020 David Oniani and Yanshan Wang [42] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models. arXiv preprint arXiv:1908.08962v2arXiv preprint arXiv:1908.08962v2