[PDF] Boosting Low-Resource Biomedical QA via Entity-Aware Masking Strategies

Abstract

Biomedical question-answering (QA) has gained increased attention for its capability to provide users with high-quality information from a vast scientific literature. Although an increasing number of biomedical QA datasets has been recently made available, those resources are still rather limited and expensive to produce. Transfer learning via pre-trained language models (LMs) has been shown as a promising approach to leverage existing general-purpose knowledge. However, finetuning these large models can be costly and time consuming, often yielding limited benefits when adapting to specific themes of specialised domains, such as the COVID-19 literature. To bootstrap further their domain adaptation, we propose a simple yet unexplored approach, which we call biomedical entity-aware masking (BEM). We encourage masked language models to learn entity-centric knowledge based on the pivotal entities characterizing the domain at hand, and employ those entities to drive the LM fine-tuning. The resulting strategy is a downstream process applicable to a wide variety of masked LMs, not requiring additional memory or components in the neural architectures. Experimental results show performance on par with state-of-the-art models on several biomedical QA datasets.

Full PDF

BBoosting Low-Resource Biomedical QAvia Entity-Aware Masking Strategies

Gabriele Pergola , Elena Kochkina , , Lin Gui , Maria Liakata , , , Yulan He University of Warwick, UK Queen Mary University of London, UK The Alan Turing Institute, UK { gabriele.pergola,e.kochkina,lin.gui,yulan.he } @[email protected] Abstract

Biomedical question-answering (QA) hasgained increased attention for its capability toprovide users with high-quality informationfrom a vast scientiﬁc literature. Although anincreasing number of biomedical QA datasetshas been recently made available, those re-sources are still rather limited and expensiveto produce. Transfer learning via pre-trainedlanguage models (LMs) has been shown asa promising approach to leverage existinggeneral-purpose knowledge. However, ﬁne-tuning these large models can be costly andtime consuming, often yielding limited bene-ﬁts when adapting to speciﬁc themes of spe-cialised domains, such as the COVID-19 liter-ature. To bootstrap further their domain adap-tation, we propose a simple yet unexplored ap-proach, which we call biomedical entity-awaremasking (BEM). We encourage masked lan-guage models to learn entity-centric knowl-edge based on the pivotal entities characteriz-ing the domain at hand, and employ those en-tities to drive the LM ﬁne-tuning. The result-ing strategy is a downstream process applica-ble to a wide variety of masked LMs, not re-quiring additional memory or components inthe neural architectures. Experimental resultsshow performance on par with state-of-the-artmodels on several biomedical QA datasets.

Biomedical question-answering (QA) aims to pro-vide users with succinct answers given their queriesby analysing a large-scale scientiﬁc literature. Itenables clinicians, public health ofﬁcials and end-users to quickly access the rapid ﬂow of specialisedknowledge continuously produced. This has ledthe research community’s effort towards develop-ing specialised models and tools for biomedicalQA and assessing their performance on bench-mark datasets such as BioASQ (Tsatsaronis et al.,2015). Producing such data is time-consuming and [MASK] with [MASK] ( [MASK] [MASK] [MASK] than those without.Patients compositediabetes HRendpoints Figure 1: An excerpt of a sentence masked via theBEM strategy, where the masked words were chosenthrough a biomedical named entity recognizer. In con-trast, BERT (Devlin et al., 2019) would randomly se-lect the words to be masked, without attention to therelevant concepts characterizing a technical domain. requires involving domain experts, making it an ex-pensive process. As a result, high-quality biomedi-cal QA datasets are a scarce resource. The recentlyreleased CovidQA collection (Tang et al., 2020),the ﬁrst manually curated dataset about COVID-19related issues, provides only 127 question-answerpairs. Even one of the largest available biomedicalQA datasets, BioASQ, only contains a few thou-sand questions.There have been attempts to ﬁne-tune pre-trainedlarge-scale language models for general-purposeQA tasks (Rajpurkar et al., 2016; Liu et al., 2019;Raffel et al., 2020) and then use them directly forbiomedical QA. Furthermore, there has also beenincreasing interest in developing domain-speciﬁclanguage models, such as BioBERT (Lee et al.,2019) or RoBERTa-Biomed (Gururangan et al.,2020), leveraging the vast medical literature avail-able. While achieving state-of-the-art results onthe QA task, these models come with a high com-putational cost: BioBERT needs ten days on eightGPUs to train (Lee et al., 2019), making it pro-hibitive for researchers with no access to massivecomputing resources.An alternative approach to incorporating exter-nal knowledge into pre-trained language modelsis to drive the LM to focus on pivotal entitiescharacterising the domain at hand during the ﬁne- a r X i v : . [ c s . C L ] F e b ERTRoBERTaBioBERT … PatientsDiabetesHRCompositeEndpoints...Covid-19CoronaryPneumonia...

WikipediaBookCorpus … Pre-Training

MLM

Fine-Tuning

BiomedicalEntities BEMFine-Tuning QAFine-TuningPre-Training

Masked Language

ModelLargeCorpora

Figure 2: A schematic representation of the main steps involved in ﬁne-tuning masked language models for theQA task through the biomedical entity-aware masking (BEM) strategy. tuning stage. Similar ideas were explored in worksby Zhang et al. (2019), Sun et al. (2020), whichproposed the ERNIE model. However, their adap-tation strategy was designed to generally improvethe LM representations rather than adapting it toa particular domain, requiring additional objectivefunctions and memory. In this work we aim toenrich existing general-purpose LM models (e.g.BERT (Devlin et al., 2019)) with the knowledgerelated to key medical concepts. In addition, wewant domain-speciﬁc LMs (e.g. BioBERT) to re-encode the already acquired information around themedical entities of interests for a particular topic ortheme (e.g. literature relating to COVID-19).Therefore, to facilitate further domain adap-tation, we propose a simple yet unexplored ap-proach based on a novel masking strategy to ﬁne-tune a LM. Our approach introduces a biomedicalentity-aware masking (BEM) strategy encouragingmasked language models (MLMs) to learn entity-centric knowledge (§2). We ﬁrst identify a set ofentities characterising the domain at hand using adomain-speciﬁc entity recogniser (SciSpacy (Neu-mann et al., 2019)), and then employ a subset ofthose entities to drive the masking strategy whileﬁne-tuning (Figure 1). The resulting BEM strat-egy is applicable to a vast variety of MLMs anddoes not require additional memory or componentsin the neural architectures. Experimental resultsshow performance on a par with the state-of-the-artmodels for biomedical QA tasks (§4) on severalbiomedical QA datasets. A further qualitative as-sessment provides an insight into how QA pairsbeneﬁt from the proposed approach.

The fundamental principle of a masked languagemodel (MLM) is to generate word representationsthat can be used to predict the missing tokens of aninput text. While this general principle is adoptedin the vast majority of MLMs, the particular wayin which the tokens to be masked are chosen canvary considerably. We thus proceed analysing therandom masking strategy adopted in BERT (Devlinet al., 2019) which has inspired most of the existingapproaches, and we then introduce the biomedicalentity-aware masking strategy used to ﬁne-tuneMLMs in the biomedical domain.

BERT Masking strategy.

The masking strategyadopted in BERT randomly replaces a predeﬁnedproportion of words with a special [MASK] to-ken and the model is required to predict them. InBERT, 15% of tokens are chosen uniformly at ran-dom, 10% of them are swapped into random tokens(thus, resulting in an overall 1.5% of the tokens ran-domly swapped). This introduces a rather limitedamount of noise with the aim of making the pre-dictions more robust to trivial associations betweenthe masked tokens and the context. While another10% of the selected tokens are kept without modi-ﬁcations, the remaining 80% of them are replacedwith the [MASK] token.

Biomedical Entity-Aware Masking Strategy

We describe an entity-aware masking strategywhich only masks biomedical entities detected bya domain-speciﬁc named entity recogniser (SciS-

Model

CovidQA BioASQ 7b

P@1 R@3 MRR SAcc LAcc MRR1

BERT ∗ ∗ ∗ + BioASQ + STM + BioASQ + BEM + BioASQ RoBERTa + BioASQ + STM + BioASQ + BEM + BioASQ

RoBERTa-Biomed + BioASQ + STM + BioASQ + BEM + BioASQ

BioBERT ∗ ∗ ∗ + BioASQ † † † + STM + BioASQ + BEM + BioASQ T5 LM + MS-MARCO ∗ ∗ ∗ — — — Table 1: Performance of language models on the CovidQA and BioASQ 7b1 dataset. Values referenced with * come from the Tang et al. (2020) work and with † from Yoon et al. (2020). pacy ). Compared to the random masking strat-egy described above, which is used to pre-trainthe masked language models, the introduced entity-aware masking strategy is adopted to boost theﬁne-tuning process for biomedical documents. Inthis phase, rather than randomly choosing the to-kens to be masked, we inform the model of therelevant tokens to pay attention to, and encouragethe model to reﬁne its representations using thenew surrounding context. Replacing strategy

We decompose the BEMstrategy into two steps: (1) recognition and (2) sub-sampling and substitution . During the recognitionphase , a set of biomedical entities E is identiﬁed inadvance over a training corpus.Then, at the sub-sampling and substitution stage,we ﬁrst sample a proportion ρ of biomedical enti-ties E ∫ ∈ E . The resulting entity subsets E ∫ is thusdynamically computed at batch time, in order to in-troduce a diverse and ﬂexible spectrum of maskedentities during training. For consistency, we use thesame tokeniser for the documents d i in the batchand the entities e j ∈ E . Then, we substitute allthe k entity mentions w ke j in d i with the specialtoken [MASK] , making sure that no consecutiveentities are replaced. The substitution takes place atbatch time, so that the substitution is a downstreamprocess suitable for a wide typology of MLMs. A https://scispacy.apps.allenai.org/ diagram synthesizing the involved steps is reportedin Figure 2. Biomedical Reading Comprehension . We rep-resent a document as d i := ( s i , . . , s ij − ) , asequence of sentences, in turn deﬁned as s j :=( w j , . . , w jk − ) , with w k a word occurring in s j .Given a question q , the task is to retrieve the span w js , . . , w js + t from a document d j that can answerthe question. We assume the extractive QA settingwhere the answer span to be extracted lies entirelywithin one, or more than one document d i .In addition, for consistency with the CovidQAdataset and to compare with results in Tang et al.(2020), we consider a further and sightly modiﬁedsetting in which the task consists of retrieving thesentence s ij that most likely contains the exact an-swer. This sentence level QA task mitigates thenon-trivial ambiguities intrinsic to the deﬁnition ofthe exact span for an answer, an issue particularlyrelevant in the medical domain and well-know inthe literature (Voorhees and Tice, 1999) . Datasets . We assess the performance of theproposed masking strategies on two biomedicaldatasets: CovidQA and BioASQ. Consider, for instance, the following QA pair: “What isthe incubation period of the virus?” , “6.4 days (95% 175 CI5.3 to 7.6)” , where a model returning just “6.4 days” wouldbe considered wrong. ERT with STM BERT with BEM

What is the OR for severe infection in COVID-19 patients with hypertension? - There were signiﬁcant correlations between COVID-19 severityand [..], diabetes [OR=2.67], coronary heart disease [OR=2.85]. - There were signiﬁcant correlations between COVID-19 severityand [..], diabetes [OR=2.67], coronary heart disease [OR=2.85]. - Compared with the non-severe patient, the pooled odds ratio ofhypertension, respiratory system disease, cardiovascular disease insevere patients were (OR 2.36, ..), (OR 2.46, ..) and (OR 3.42, ..). - Compared with the non-severe patient, the pooled odds ratio ofhypertension, respiratory system disease, cardiovascular disease insevere patients were (OR 2.36, ..), (OR 2.46, ..) and (OR 3.42, ..).

What is the HR for severe infection in COVID-19 patients with hypertension? - - - - - After adjusting for age and smoking status, patients with COPD(HR 2.681), diabetes (HR 1.59), and malignancy (HR 3.50) weremore likely to reach to the composite endpoints than those without.

What is the RR for severe infection in COVID-19 patients with hypertension? - - - - - In univariate analyses, factors signiﬁcantly associated with severeCOVID-19 were male sex (14 studies; pooled RR=1.70, ...), hyper-tension (10 studies 2.74 ...),diabetes (11 studies ...), and CVD (..).

Table 2: Examples of questions and retrieved answers using BERT ﬁne-tuned either with its original maskingapproach or with the biomedical entity-aware masking (BEM) strategy.

CovidQA (Tang et al., 2020) is a manually curateddataset based on the AI2’s COVID-19 Open Re-search Dataset (Wang et al., 2020). It consists of127 question-answer pairs with 27 questions and85 unique related articles. This dataset is too smallfor supervised training, but is a valuable resourcefor zero-shot evaluation to assess the unsupervisedand transfer capability of models.

BioASQ (Tsatsaronis et al., 2015) is one of thelarger biomedical QA datasets available with over2000 question-answer pairs. To use it within theextractive questions answering framework, we con-vert the questions into the SQuAD dataset for-mat (Rajpurkar et al., 2016), consisting of question-answer pairs and the corresponding passages, med-ical articles containing the answers or clues witha length varying from a sentence to a paragraph.When multiple passages are available for a singlequestion, we form additional question-context pairscombined subsequently in a postprocessing step tochoose the answer with highest probability, simi-larly to Yoon et al. (2020). For consistency withthe CovidQA dataset, we report our evaluation ex-clusively on the factoid questions of the BioASQ7b Phase B1.

Baselines . We use the following unsupervised neu-ral models as baselines: the out-of-the-box BERT(Devlin et al., 2019) and RoBERTa (Liu et al.,2019), as well as their variants BioBERT (Lee et al.,2019) and RoBERTa-Biomed (Gururangan et al.,2020) ﬁne-tuned on medical and scientiﬁc corpora.To highlight the impact of different ﬁne-tuningstrategies, we examine several conﬁgurations de-pending on the data and the masking strategy adopted. We experiment using the BioASQ QAtraining pairs during the ﬁne-tuning stage and de-note the models using them with +BioASQ . Whenwe ﬁne-tune the models on the corpus consisting ofPubMed articles referred within the BioASQ andAI2’s COVID-19 Open Research dataset, we com-pare two masking strategies denoted as +STM and +BEM , where +STM indicates the standard mask-ing strategy of the model at hand and +BEM is ourproposed strategy. We additionally report the T5(Raffel et al., 2020) performance over CovidQA,which constitutes the current state-of-the-art (Tanget al., 2020) . Metrics . To facilitate comparisons, we adopt thesame evaluation scores used in Tang et al. (2020)to assess the models on the CovidQA dataset, i.e.mean reciprocal rank (MRR), precision at rank one(P@1), and recall at rank three (R@3); similarly,for the BioASQ dataset, we use the strict accuracy(SAcc), lenient accuracy (LAcc) and MRR, theBioASQ challenge’s ofﬁcial metrics.

We report the results on the QA tasks in Table 1.Among the unsupervised models, BERTachieves slightly better performance thanRoBERTa on CovidQA, yet the situation isreversed on BioASQ (rows 1,5). The low precisionof the two models (especially on the BioASQdataset) conﬁrms the difﬁculties in generalisingto the biomedical domain. Specialised language We attach supplementary results in Appx. A on SQuAD(Tab. A1) and the perplexity of MLMs when ﬁne-tuned on themedical collection with different masking strategies (Fig. A1) odels such as RoBERTa-Biomed and BioBERTshow a signiﬁcant improvement on the CovidQAdataset, but a rather limited one on BioASQ (rows9,13), highlighting the importance of havinglarger medical corpora to assess the model’seffectiveness. A general boost in performance isshared across models ﬁne-tuned on the QA tasks,with a large beneﬁt from the BioASQ QA. Theperformance gains obtained by the specialisedmodels (BioBERT and RoBERTa-Biomed) suggestthe importance of transferring not only the domainknowledge but also the ability to perform the QAtask itself (rows 9,10; 13,14).A further ﬁne-tuning step before the trainingover the QA pairs has been proven beneﬁcial forall of the models. The BEM masking strategy hassigniﬁcantly ampliﬁed the model’s generalisabil-ity, with an increased adaptation to the biomedicalthemes shown by the notable improvement in R@3and MRR; with the R@3 outperforming the state-of-the-art results of T5 ﬁne-tuned on MS-MARCO(Bajaj et al., 2018) and proving the effectiveness ofthe BEM strategy.Table 2 reports questions from the CovidQA re-lated to three statistical indices (i.e. Odds Ratio,Hazard Ratio and Relative Risk) to assess the riskof an event occurring in a group (e.g. infections ordeath). We notice that even though the indices arementioned as abbreviations, BERT ﬁne-tuned withthe STM is able to retrieve sentences with the exactanswer for just one of three questions. By contrast,BERT ﬁne-tuned with the BEM strategy succeedsin retrieving at least one correct sentence for eachquestion. This example suggests the importance ofplacing the emphasis on the entities, which mightbe overlooked by LMs during the training processdespite being available.

Our work is closely related to two lines of research:the design of masking strategies for LMs and thedevelopment of specialized models for the biomed-ical domain.

Masking strategies.

Building on top of theBERT’s masking strategy (Devlin et al., 2019), awide variety of approaches has been proposed (Liuet al., 2019; Yang et al., 2019; Jiang et al., 2020).A family of masking approaches aimed at lever-aging entity and phrase occurrences in text. Span-BERT, Joshi et al. (2020) proposed to mask andpredict whole spans rather than standalone tokensand to make use of an auxiliary objective function. ERNIE (Zhang et al., 2019) is instead developed tomask well-known named entities and phrases to im-prove the external knowledge encoded. Similarly,KnowBERT (Peters et al., 2019) explicitly modelentity spans and use an entity linker to an exter-nal knowledge base to form knowledge enhancedentity-span representations. However, despite theanalogies with the BEM approach, the above mask-ing strategies were designed to generally improvethe LM representations rather than adapting themto particular domains, requiring additional objec-tive functions and memory.

Biomedical LMs.

Particular attention has beendevoted to the adaptation of LMs to the medical do-main, with different corpora and tasks requiring tai-lored methodologies. BioBERT (Lee et al., 2019)is a biomedical language model based on BERT-

Base with additional pre-training on biomedicaldocuments from the PubMed and PMC collectionsusing the same training settings adopted in BERT.BioMed-RoBERTa (Gururangan et al., 2020) is in-stead based on RoBERTa-

Base (Liu et al., 2019)using a corpus of 2.27M articles from the SemanticScholar dataset (Ammar et al., 2018). SciBERT(Beltagy et al., 2019) follows the BERT’s maskingstrategy to pre-train the model from scratch usinga scientiﬁc corpus composed of papers from Se-mantic Scholar (Ammar et al., 2018). Out of the1.14M papers used, more than belong to thebiomedical domain.

We presented BEM, a biomedical entity-awaremasking strategy to boost LM adaptation to low-resource biomedical QA. It uses an entity-drivenmasking strategy to ﬁne-tune LMs and effectivelylead them in learning entity-centric knowledgebased on the pivotal entities characterizing the do-main at hand. Experimental results have shown thebeneﬁts of such an approach on several metrics forbiomedical QA tasks.

Acknowledgements

This work is funded by the EPSRC (grant no.EP/T017112/1, EP/V048597/1). YH is sup-ported by a Turing AI Fellowship funded by theUK Research and Innovation (UKRI) (grant no.EP/V020579/1). eferences

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavat-ula, and Oren Etzioni. 2018. Construction of theliterature graph in semantic scholar. In

Proceedingsof the 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 3 (IndustryPapers) , pages 84–91.Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng,Jianfeng Gao, Xiaodong Liu, Rangan Majumder,Andrew McNamara, Bhaskar Mitra, Tri Nguyen,Mir Rosenberg, Xia Song, Alina Stoica, SaurabhTiwary, and Tong Wang. 2018. MS MARCO: Ahuman generated machine reading comprehensiondataset.Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-ERT: A pretrained language model for scientiﬁc text.In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 3615–3620.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),NAACL .Suchin Gururangan, Ana Marasovi´c, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks. In

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics, ACL20 .Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki,Haibo Ding, and Graham Neubig. 2020. X-FACTR:Multilingual factual knowledge retrieval from pre-trained language models. In

Proceedings of the2020 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP) , Online. Asso-ciation for Computational Linguistics.Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2020.SpanBERT: Improving pre-training by representingand predicting spans.

Transactions of the Associa-tion for Computational Linguistics (TACL) , 8.Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So,and Jaewoo Kang. 2019. BioBERT: a pre-trainedbiomedical language representation model forbiomedical text mining.

Bioinformatics 2019 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized bert pretraining ap-proach.Mark Neumann, Daniel King, Iz Beltagy, and WaleedAmmar. 2019. ScispaCy: Fast and Robust Mod-els for Biomedical Natural Language Processing.In

Proceedings of the 18th BioNLP Workshop andShared Task , Florence, Italy.Matthew E. Peters, Mark Neumann, Robert Logan, RoySchwartz, Vidur Joshi, Sameer Singh, and Noah A.Smith. 2019. Knowledge enhanced contextual wordrepresentations. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , Hong Kong, China. Association for Com-putational Linguistics.Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J. Liu. 2020. Exploring the lim-its of transfer learning with a uniﬁed text-to-texttransformer.

Journal of Machine Learning Research(JMLR) , 21(140):1–67.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, HaoTian, Hua Wu, and Haifeng Wang. 2020. Ernie 2.0:A continual pre-training framework for language un-derstanding. In

Proceedings of the Thirty-FourthAAAI Conference on Artiﬁcial Intelligence, AAAI2020 .Raphael Tang, Rodrigo Nogueira, Edwin Zhang, NikhilGupta, Phuong Cam, Kyunghyun Cho, and JimmyLin. 2020. Rapidly bootstrapping a question answer-ing dataset for COVID-19.George Tsatsaronis, Georgios Balikas, ProdromosMalakasiotis, Ioannis Partalas, Matthias Zschunke,Michael R Alvers, Dirk Weissenborn, AnastasiaKrithara, Sergios Petridis, Dimitris Polychronopou-los, et al. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question an-swering competition.

BMC bioinformatics , 16(1).Ellen M. Voorhees and Dawn M. Tice. 1999. The trec-8 question answering track evaluation. In

In TextRetrieval Conference TREC-8 , pages 83–105.Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,Russell Reas, Jiangjiang Yang, Doug Burdick,Darrin Eide, Kathryn Funk, Yannis Katsis, Rod-ney Michael Kinney, Yunyao Li, Ziyang Liu,William Merrill, Paul Mooney, Dewey A. Murdick,Devvret Rishi, Jerry Sheehan, Zhihong Shen, Bran-don Stilson, Alex D. Wade, Kuansan Wang, NancyXin Ru Wang, Christopher Wilhelm, Boya Xie, Dou-glas M. Raymond, Daniel S. Weld, Oren Etzioni,nd Sebastian Kohlmeier. 2020. CORD-19: TheCOVID-19 open research dataset. In

Proceedingsof the 1st Workshop on NLP for COVID-19 at ACL2020 , Online. Association for Computational Lin-guistics.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In

Advances in Neural In-formation Processing Systems 32 .Wonjin Yoon, Jinhyuk Lee, Donghyeon Kim, Min-byul Jeong, and Jaewoo Kang. 2020. Pre-trainedlanguage model for biomedical question answering.In

Machine Learning and Knowledge Discovery inDatabases .Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. ERNIE: En-hanced language representation with informative en-tities. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,ACL19 , Florence, Italy.

Appendix

We further examined whether the ﬁne-tuning of the QA pairs affects not only the model adaptation tothe QA task but it further helps realign the repression for the domain at hand. The report scores pointout that the vanilla LMs are the ones gaining the most when using in-domain QA pairs, such as BioASQ,compared to the SQuAD (rows 2,3; 9,10). The advantage tends to be reduced on already specialisedLMs (rows 16,17; 23;24).

CovidQA BioASQ 7b

P@1 R@3 MRR SAcc LAcc MRR1

BERT ∗ ∗ ∗ + SQuAD + BioASQ + STM + SQuAD + STM + BioASQ + BEM + SQuAD + BEM + BioASQ RoBERTa + SQuAD + BioASQ + STM + SQuAD + STM + BioASQ + BEM + SQuAD + BEM + BioASQ

RoBERTa-Biomed + SQuAD + BioASQ + STM + SQuAD + STM + BioASQ + BEM + SQuAD + BEM + BioASQ

BioBERT ∗ ∗ ∗ + SQuAD ∗ ∗ ∗ + BioASQ † † † + STM + SQuAD + STM + BioASQ + BEM + SQuAD + BEM + BioASQ T5 LM + MS-MARCO ∗ ∗ ∗ — — — Table A1: Performance of language models on the CovidQA and BioASQ 7b1 dataset. Values referenced with * comes from the Tang et al. (2020) work and with † from Yoon et al. (2020). n Figure A1, we report the LM perplexity obtained when ﬁne-tuning the model with the standardmasking strategy versus the BEM strategy with different proportion of medical entities. Vanilla LMsexperienced a huge gain with just a small fraction of entities, while already specialised LMs has a lowerbut still signiﬁcant improvement. This could be expected as the specialised LMs has already encoded alarge domain knowledge with representations that need to be realigned to the new ones. BERT RoBERTa BioBERT RoBERTa-Biomed

Perplexity - BioASQ 7