[PDF] An Evaluation of Two Commercial Deep Learning-Based Information Retrieval Systems for COVID-19 Literature

Abstract

The COVID-19 pandemic has resulted in a tremendous need for access to the latest scientific information, primarily through the use of text mining and search tools. This has led to both corpora for biomedical articles related to COVID-19 (such as the CORD-19 corpus (Wang et al., 2020)) as well as search engines to query such data. While most research in search engines is performed in the academic field of information retrieval (IR), most academic search engines\unicode{x2013}though rigorously evaluated\unicode{x2013}are sparsely utilized, while major commercial web search engines (e.g., Google, Bing) dominate. This relates to COVID-19 because it can be expected that commercial search engines deployed for the pandemic will gain much higher traction than those produced in academic labs, and thus leads to questions about the empirical performance of these search tools. This paper seeks to empirically evaluate two such commercial search engines for COVID-19, produced by Google and Amazon, in comparison to the more academic prototypes evaluated in the context of the TREC-COVID track (Roberts et al., 2020). We performed several steps to reduce bias in the available manual judgments in order to ensure a fair comparison of the two systems with those submitted to TREC-COVID. We find that the top-performing system from TREC-COVID on bpref metric performed the best among the different systems evaluated in this study on all the metrics. This has implications for developing biomedical retrieval systems for future health crises as well as trust in popular health search engines.

Full PDF

AAn Evaluation of Two Commercial Deep Learning-Based InformationRetrieval Systems for COVID-19 Literature

Sarvesh Soni, Kirk Roberts

School of Biomedical InformaticsUniversity of Texas Health Science Center at HoustonHouston TX, USA { sarvesh.soni, kirk.roberts } @uth.tmc.edu Abstract

The COVID-19 pandemic has resulted in atremendous need for access to the latest sci-entiﬁc information, primarily through the useof text mining and search tools. This has ledto both corpora for biomedical articles relatedto COVID-19 (such as the CORD-19 corpus(Wang et al., 2020)) as well as search en-gines to query such data. While most researchin search engines is performed in the aca-demic ﬁeld of information retrieval (IR), mostacademic search engines–though rigorouslyevaluated–are sparsely utilized, while majorcommercial web search engines (e.g., Google,Bing) dominate. This relates to COVID-19because it can be expected that commercialsearch engines deployed for the pandemic willgain much higher traction than those producedin academic labs, and thus leads to ques-tions about the empirical performance of thesesearch tools. This paper seeks to empiricallyevaluate two such commercial search enginesfor COVID-19, produced by Google and Ama-zon, in comparison to the more academic pro-totypes evaluated in the context of the TREC-COVID track (Roberts et al., 2020). We per-formed several steps to reduce bias in the avail-able manual judgments in order to ensure afair comparison of the two systems with thosesubmitted to TREC-COVID. We ﬁnd that thetop-performing system from TREC-COVIDon bpref metric performed the best among thedifferent systems evaluated in this study on allthe metrics. This has implications for devel-oping biomedical retrieval systems for futurehealth crises as well as trust in popular healthsearch engines.

There has been a surge of scientiﬁc studies relatedto COVID-19 due to the availability of archivalsources as well as the expedited review policiesof publishing venues. A systematic effort to con-solidate the ﬂood of such information content, in the form of scientiﬁc articles, along with studiesfrom the past that may be relevant to COVID-19 isbeing carried out as requested by the White House(Wang et al., 2020). This effort led to the creationof CORD-19, a dataset of scientiﬁc articles relatedto COVID-19 and the other viruses from the coro-navirus family. One of the main aims for build-ing such a dataset is to bridge the gap betweenmachine learning and biomedical expertise to sur-face insightful information from the abundance ofrelevant published content. The TREC-COVIDchallenge was introduced to target the explorationof the CORD-19 dataset by gathering the infor-mation needs of biomedical researchers (Robertset al., 2020; Voorhees et al., 2020). The chal-lenge involved an information retrieval (IR) taskto retrieve a set of ranked relevant documents for agiven query. Similar to the task of TREC-COVID,major technology companies Amazon and Googlealso developed their own systems for exploring theCORD-19 dataset.Both Amazon and Google have made recentforays into biomedical natural language process-ing (NLP). Amazon launched Amazon Compre-hend Medical (ACM) for the developers to pro-cess unstructured medical data effectively (Kass-Hout and Wood, 2018). This motivated severalresearchers to explore the tool’s capability in in-formation extraction (Bhatia et al., 2019; Guzmanet al., 2020; Heider et al., 2020). Interestingly,the same technology is also incorporated to theirsearch engine for the CORD-19 dataset. It will beuseful to assess the overall performance of theirsearch engine that utilizes the company’s NLPtechnology. Similarly, BERT from Google (De-vlin et al., 2019) is enormously popular. BERT isa powerful language model that is trained on largeraw text datasets to learn the nuances of naturallanguage in an efﬁcient manner. The methodol-ogy of training BERT helps it transfer the knowl- a r X i v : . [ c s . I R ] J u l dge from vast raw data sources to other spe-ciﬁc domains such as biomedicine. Several workshave explored the efﬁcacy of BERT models in thebiomedical domain for tasks such as informationextraction (Wu et al., 2020) and question answer-ing (Soni and Roberts, 2020). Many biomedicaland scientiﬁc variants of the model have also beenbuilt, such as BioBERT (Lee et al., 2019), Clini-cal BERT (Alsentzer et al., 2019), and SciBERT(Beltagy et al., 2019). Google has even incorpo-rated BERT into their web search engine (Nayak,2019). Since this is the same technology that pow-ers Google’s CORD-19 search explorer, it will beinteresting to assess the performance of this searchtool.However, despite the popularity of these com-panies’ products, no formal evaluation of thesesystems is made available by the companies. Also,neither of these companies participated in theTREC-COVID challenge. In this paper, we aimto evaluate these two IR systems and compareagainst the runs submitted to TREC-COVID chal-lenge to gauge the efﬁcacy of what are likely high-utilized search engines. We evaluate two publicly available IR systems tar-geted toward exploring the COVID-19 Open Re-search Dataset (CORD-19) (Wang et al., 2020).These systems are launched by Amazon (CORD-19 Search ) and Google (COVID-19 Research Ex-plorer ). We hereafter refer to these systems bythe names of their corporations, i.e., Amazon andGoogle. Both the systems take as input a queryin the form of natural language and return a list ofdocuments from the CORD-19 dataset ranked bytheir relevance to the given query.Amazons system uses an enriched version ofthe CORD-19 dataset constructed by passingit through a language processing service calledAmazon Comprehend Medical (ACM) (Kass-Hout and Snively, 2020). ACM is a machinelearning-based natural language processing (NLP)pipeline to extract clinical concepts such as signs,symptoms, diseases, and treatments from unstruc-tured text (Kass-Hout and Wood, 2018). The https://cord19.aws https://covid19-research-explorer.appspot.com data is further mapped to clinical topics relatedto COVID-19 such as immunology, clinical trials,and virology using multi-label classiﬁcation andinference models. After the enrichment process,the data is indexed using Amazon Kendra that alsouses machine learning to provide natural languagequerying capabilities for extracting relevant docu-ments.Googles system is based on a semantic searchmechanism powered by BERT (Devlin et al.,2019), a deep learning-based approach to pre-training and ﬁne-tuning for downstream NLP tasks(document retrieval in this case) (Hall, 2020). Se-mantic search, unlike lexical term-based searchthat aims at phrasal matching, focuses on under-standing the meaning of user queries for search-ing. However, deep learning models such as BERTrequire a substantial amount of annotated data tobe tuned for some speciﬁc task/domain. Biomed-ical articles have very different linguistic featuresthan the general domain, upon which the BERTmodel is built. Thus, the model needs to be tunedfor the target domain, i.e., biomedical domain, us-ing annotated data. For this purpose, they usebiomedical IR datasets from the BioASQ chal-lenges . Due to the smaller size of these biomedi-cal datasets, and the large data requirement of theneural models, they use a synthetic query gener-ation technique to augment the existing biomed-ical IR datasets (Ma et al., 2020). Finally, theseexpanded datasets are used to ﬁne-tune the neu-ral model. They further enhance their system bycombining term- and neural-based retrieval mod-els by balancing the memorization and generaliza-tion dynamics (Jiang et al., 2020). We use a topic set collected as part of the TREC-COVID challenge for our evaluations (Robertset al., 2020; Voorhees et al., 2020). These topicsare a set of information need statements motivatedby searches submitted to the National Library ofMedicine and suggestions from researchers onTwitter. Each topic consists of three ﬁelds withvarying levels of granularity in terms of expressingthe information need, namely, (a keyword-based)query, (a natural language) question, and (a longerdescriptive) narrative. A few example topics fromRound 1 of the challenge are presented in Table1. The challenge participants are required to re- http://bioasq.org able 1: Three example topics from Round 1 of the TREC-COVID challenge. T o p i c Query : serological tests for coronavirus

Question : are there serological tests that detect antibodies to coronavirus?

Narrative : looking for assays that measure immune response to coronavirus that will helpdetermine past infection and subsequent possible immunity. T o p i c Query : coronavirus social distancing impact

Question : has social distancing had an impact on slowing the spread of COVID-19?

Narrative : seeking speciﬁc information on studies that have measured COVID-19’s transmis-sion in one or more social distancing (or non-social distancing) approaches. T o p i c Query : coronavirus remdesivir

Question : is remdesivir an effective treatment for COVID-19?

Narrative : seeking speciﬁc information on clinical outcomes in COVID-19 patients treatedwith remdesivir.turn a ranked list of documents for each topic (alsoknown as runs). The ﬁrst round of TREC-COVIDused a set of 30 topics and exploited the April 10,2020 release of CORD-19. Round 1 of the chal-lenge was initiated on April 15, 2020 with the runsfrom participants due April 23. Relevance judg-ments were released May 3.We use the question and narrative ﬁelds fromthe topics to query the systems developed by Ama-zon and Google. These ﬁelds are chosen follow-ing the recommendations set forward by the or-ganizations, i.e., to use fully formed queries withquestions and context. We use two variations forquerying the systems. In the ﬁrst variation, wequery the systems using only the question. In thesecond variation, we also append the narrative toprovide more context.As we accessed these systems in the ﬁrst weekof May 2020, the systems could be using the lat-est version of CORD-19 at that time (i.e., May 1release). Thus, we ﬁlter the list of returned docu-ments and only include the ones from the April 10release to ensure a fair comparison with the sub-missions to the Round 1 of TREC-COVID chal-lenge. We compare the performance of these sys-tems (by Amazon and Google) with the 5 top sub-missions to the TREC-COVID challenge Round1 (on the basis of bpref scores). It is valid tocompare Amazon and Google systems with thesubmissions from Round 1 because all these sys-tems are similarly built without using any rele-vance judgments from TREC-COVID.Relevance judgments (or assessments) forTREC-COVID are carried out by individuals withbiomedical expertise. The assessments are per-formed using a pooling mechanism where only the top-ranked results from different submissionsare assessed. A document is assigned one of thethree possible judgments, namely, relevant , par-tially relevant , or not relevant . We use relevancejudgments from Rounds 1 and 2. However, eventhe combined judgments from both the roundsmay not ensure that the relevance judgments fortop-n documents for both the evaluated systemsexist. It has recently been shown that pooling ef-fects can negatively impact post-hoc evaluation ofsystems that did not participate in the pooling (Yil-maz et al., 2020). So, to create a level ground forcomparison, we perform additional relevance as-sessments for the documents from evaluated sys-tems that may not have been covered by the com-bined set of judgments from TREC-COVID. In to-tal, 141 documents were assessed by 2 individualswho are also involved in performing the relevancejudgments for TREC-COVID.The runs submitted to TREC-COVID couldcontain up to 1000 documents per topic. Due tothe restrictions posed by the evaluated systems, wecould only fetch up to 100 documents per query.This number further decreases when we removethe documents that are not covered as part of theApril 10 release of CORD-19. Thus, to ensure afair comparison of the evaluated systems with theruns submitted to TREC-COVID, we calculate theminimum number of documents per topic (we callit topic-minimum) across the different variationsof querying the evaluated systems (i.e., questionor question+narrative). We then use this topic-minimum as a threshold for the maximum num-ber of documents per topic for all evaluated sys-tems. This ensures that each system returns thesame number of documents for a particular topic. able 2: Evaluation results after setting a threshold at the number of documents per topic using a minimum numberof documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and2 of TREC-COVID and our additional relevance assessments. The highest scores for the evaluated and TREC-COVID systems are underlined. System P@5 P@10 NDCG@10 MAP NDCG bpref

Amazon question 0.6733 0.6333 0.539 0.0722 0.1838 0.1049question + narrative 0.72 0.64 0.5583 0.0766 0.1862 0.1063Google question 0.5733 0.57 0.4972 0.0693 0.1831 0.1069question + narrative 0.6067 0.56 0.5112 0.0687 0.1821 0.1054 T R E C - C OV I D

1. sab20.1.meta.docs 0.78 0.7133 0.6109 0.0999 0.2266 0.13522. sab20.1.merged 0.6733 0.6433 0.5555 0.0787 0.1971 0.11543. UIowaS Run3 0.6467 0.6367 0.5466 0.0952 0.2091 0.12794. smith.rm3 0.6467 0.6133 0.5225 0.0914 0.2095 0.13035. udel fang run3 0.6333 0.6133 0.5398 0.0857 0.1977 0.1187

Figure 1: A box plot of the number of documents foreach topic as used in our evaluations (after ﬁlteringthe documents based on the April 10 th release of theCORD-19 dataset and setting a threshold at the mini-mum number of documents for any given topic). We use the standard measures in our evalu-ation as employed for TREC-COVID, namely,bpref (binary preference), NDCG@10 (normal-ized discounted cumulative gain with top 10 doc-uments), and P@5 (precision at 5 documents).Here, bpref only uses judged documents in cal-culation while the other two measures assume thenon-judged documents to be not relevant . Addi-tionally, we also calculate MAP (mean averageprecision), NDCG, and P@10. Note that we canprecisely calculate some of the measures that cutthe number of documents at up to 10 since wehave ensured that both the evaluated systems (forboth the query variations) have their top 10 doc-uments manually judged (through TREC-COVIDjudgments and our additional assessments as partof this study). We use the trec eval tool for ourevaluations, which is a standard system employedfor the TREC challenges. https://github.com/usnistgov/trec_eval The total number of documents used for each topicbased on the topic-minimums are shown in theform of a box plot in Figure 1. Approximately, anaverage of 43 documents are evaluated per topicwith a median number of documents as 40.5. Thisis another reason for using a topic-wise minimumrather than cutting off all the systems to the samelevel as the lowest return count (that would be 25documents). Having a topic-wise cut-off allowedus to evaluate the runs with the maximum possibledocuments while keeping the evaluation fair.The evaluation results of our study are presentedin Table 2. Among the commercial systems thatwe evaluated as part of this study, the questionplus narrative variant of the system by Amazonperformed consistently better than any other vari-ant in terms of all the included measures otherthan bpref. In terms of bpref, the question-onlyvariant of the system from Google performed thebest among the evaluated systems. Note that thebest run from the TREC-COVID challenge, aftercutting off using topic-minimums, still performedbetter than the other four submitted runs includedin our evaluation. Interestingly, this best run alsoperformed substantially better than all the variantsof both commercial systems evaluated as part ofthe study on all the calculated metrics. We discussmore about this system below.

We evaluate two commercial IR systems targetedtoward extracting relevant documents from theORD-19 dataset. For comparison, we also in-clude the 5 best runs from TREC-COVID in ourevaluation. We additionally annotate a total of141 documents from the runs by the commer-cial systems to ensure a fair comparison betweenthese runs and the runs from TREC-COVID chal-lenge. We ﬁnd that the best system from TREC-COVID in terms of bpref metric outperformed allthe commercial system variants on all the evalu-ated measures including P@5, NDCG@10, andbpref, which are the standard measures used inTREC-COVID.The commercial systems often employ cuttingedge technologies, such as ACM and BERT usedby Amazon and Google, while developing theirsystems. Also, the availability of technological re-sources such as CPUs and GPUs may be better inindustry settings than in academic settings. Thisfollows a common concern in academia, namelythat the resource requirements for advanced ma-chine learning methods (e.g., GPT-3 (Brown et al.,2020)) are well beyond the capabilities availableto the vast majority of researchers. However, in-stead these results demonstrate the potential pit-falls of deploying a deep learning-based systemwithout proper tuning. The sabir (sab20.*) systemdoes not use machine learning at all: it is basedon the very old SMART system (Buckley, 1985)and does not utilize any biomedical resources. Itis instead carefully deployed based on an analysisof the data ﬁelds available in CORD-19. Subse-quent rounds of TREC-COVID have since over-taken sabir (based indeed on machine learningwith relevant training data). The lesson, then, forfuture emerging health events is that deploying“state-of-the-art” methods without event-speciﬁcdata may be dangerous, and in the face of uncer-tainty simple may still be best.As evident from Figure 1, many of the docu-ments retrieved by the commercial systems werenot part of the April 10 release of CORD-19. Wequeried these systems after another version of theCORD-19 dataset was released. New sources ofpapers were constantly being added to the datasetalongside updating the content of existing pa-pers and adding newly published research relatedto COVID-19. This may have led to the re-trieval of more articles from the new release ofthe dataset. However, for a fair comparison be-tween the commercial and the TREC-COVID sys-tems, we pruned the list of documents and per- formed additional relevance judgments. We haveincluded the evaluation results that would haveresulted without our modiﬁcations in the supple-mental material. The performance of these twosystems drops precipitously. Yet, as addressed,this would not have been a “fair” comparison andthus the corrective measures described above werenecessary to ensure the scientiﬁc validity of ourcomparison.

We assessed the performance of two commercialIR systems using similar evaluation methods andmeasures as the TREC-COVID challenge. Tofacilitate a fair comparison between these sys-tems and the top 5 runs submitted to the TREC-COVID, we cut all the runs at different thresh-olds and performed more relevance judgments be-yond the assessments provided by TREC-COVID.We found that the top performing system fromTREC-COVID on bpref metric remained the bestperforming system among the commercial andthe TREC-COVID submissions on all the evalu-ation metrics. Interestingly, this best performingrun comes from a simple system that is purelybased on the data elements present in the CORD-19 dataset and does not apply machine learning.Thus, applying cutting edge technologies withoutenough target data-speciﬁc modiﬁcations may notbe sufﬁcient for achieving optimal results.

Acknowledgments

The authors thank Meghana Gudala and JordanGodfrey-Stovall for conducting the additional re-trieval assessments. This work was supported inpart by the National Science Foundation (NSF)under award OIA-1937136.

References

Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, andMatthew McDermott. 2019. Publicly AvailableClinical BERT Embeddings. In

Proceedings of the2nd Clinical Natural Language Processing Work-shop , pages 72–78.Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-ERT: A Pretrained Language Model for ScientiﬁcText. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages3615–3620.arminder Bhatia, Busra Celikkaya, MohammedKhalilia, and Selvan Senthivel. 2019. ComprehendMedical: A Named Entity Recognition and Rela-tionship Extraction Web Service. In , pages 1844–1851.Tom B. Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child,Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,Clemens Winter, Christopher Hesse, Mark Chen,Eric Sigler, Mateusz Litwin, Scott Gray, BenjaminChess, Jack Clark, Christopher Berner, Sam Mc-Candlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language Models are Few-ShotLearners. arXiv:2005.14165 [cs] .Chris Buckley. 1985. Implementation of the SMARTinformation retrieval system. Technical Report 85-686, Cornell University.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In

Proceedings of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages4171–4186.Benedict Guzman, Isabel Metzger, YindalonAphinyanaphongs, and Himanshu Grover. 2020.Assessment of Amazon Comprehend Medical:Medication Information Extraction.Keith Hall. 2020. An NLU-Powered Tool to ExploreCOVID-19 Scientiﬁc Literature.Paul M. Heider, Jihad S. Obeid, and St´ephane M.Meystre. 2020. A Comparative Analysis ofSpeed and Accuracy for Three Off-the-Shelf De-Identiﬁcation Tools.

AMIA Summits on Transla-tional Science Proceedings , 2020:241–250.Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, andMichael C. Mozer. 2020. Characterizing StructuralRegularities of Labeled Data in OverparameterizedModels. arXiv:2002.03206 [cs, stat] .Taha A. Kass-Hout and Ben Snively. 2020. AWSlaunches machine learning enabled search capabil-ities for COVID-19 dataset.Taha A. Kass-Hout and Matt Wood. 2018. Introducingmedical language processing with Amazon Compre-hend Medical.Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So,and Jaewoo Kang. 2019. BioBERT: A pre-trainedbiomedical language representation model forbiomedical text mining.

Bioinformatics , pages 1–7. Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, andRyan McDonald. 2020. Zero-shot Neural Retrievalvia Domain-targeted Synthetic Query Generation. arXiv:2004.14503 [cs] .Pandu Nayak. 2019. Understanding searches betterthan ever before.Kirk Roberts, Tasmeer Alam, Steven Bedrick, DinaDemner-Fushman, Kyle Lo, Ian Soboroff, EllenVoorhees, Lucy Lu Wang, and William R. Hersh.2020. TREC-COVID: Rationale and Structure of anInformation Retrieval Shared Task for COVID-19.

Journal of the American Medical Informatics Asso-ciation .Sarvesh Soni and Kirk Roberts. 2020. Evaluation ofDataset Selection for Pre-Training and Fine-TuningTransformer Language Models for Clinical Ques-tion Answering. In

Proceedings of the LREC , pages5534–5540.Ellen Voorhees, Tasmeer Alam, Steven Bedrick, DinaDemner-Fushman, William R. Hersh, Kyle Lo, KirkRoberts, Ian Soboroff, and Lucy Lu Wang. 2020.TREC-COVID: Constructing a Pandemic Informa-tion Retrieval Test Collection.

ACM SIGIR Forum ,54:1–12.Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,Russell Reas, Jiangjiang Yang, Darrin Eide, KathrynFunk, Rodney Kinney, Ziyang Liu, William Mer-rill, Paul Mooney, Dewey Murdick, Devvret Rishi,Jerry Sheehan, Zhihong Shen, Brandon Stilson,Alex D. Wade, Kuansan Wang, Chris Wilhelm,Boya Xie, Douglas Raymond, Daniel S. Weld,Oren Etzioni, and Sebastian Kohlmeier. 2020.CORD-19: The Covid-19 Open Research Dataset. arXiv:2004.10706v2 .Stephen Wu, Kirk Roberts, Surabhi Datta, JingchengDu, Zongcheng Ji, Yuqi Si, Sarvesh Soni, QiongWang, Qiang Wei, Yang Xiang, Bo Zhao, and HuaXu. 2020. Deep learning in clinical natural languageprocessing: A methodical review.

Journal of theAmerican Medical Informatics Association , 27:457–470.Emine Yilmaz, Nick Craswell, Bhaskar Mitra, andDaniel Campos. 2020. On the Reliability of TestCollections for Evaluating Systems of DifferentTypes. In

Proceedings of the 43rd InternationalACM SIGIR Conference on Research and Develop-ment in Information Retrieval , pages 2101–2104.

A Supplementary Material

The results without taking into account our addi-tional annotations, i.e., only using the relevancejudgments from TREC-COVID rounds 1 and 2,are presented in Table 3. Similarly, the resultswithout setting an explicit threshold on the numberof returned documents by the systems are shown inTable 4. The results without any of the two modi-ﬁcations made by us are provided in Table 5. able 3: Evaluation results after setting a threshold at the number of documents per topic using a minimum numberof documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and2 of TREC-COVID (WITHOUT our additional relevance assessments). The highest scores for the evaluated andTREC-COVID systems are underlined.

System P@5 P@10 NDCG@10 MAP NDCG bpref

Amazon question 0.6467 0.5933 0.5095 0.069 0.1794 0.1035question + narrative 0.6933 0.5933 0.5307 0.0722 0.1804 0.1031Google question 0.5667 0.5133 0.4688 0.0655 0.1785 0.1048question + narrative 0.56 0.5133 0.4795 0.0656 0.1763 0.1031 T R E C - C OV I D

1. sab20.1.meta.docs 0.78 0.7133 0.6109 0.1007 0.2278 0.13612. sab20.1.merged 0.6667 0.64 0.5539 0.0789 0.1968 0.11553. UIowaS Run3 0.6467 0.6367 0.5466 0.096 0.2099 0.12874. smith.rm3 0.6467 0.6133 0.5225 0.0922 0.2107 0.13155. udel fang run3 0.6333 0.6133 0.5398 0.0866 0.1989 0.1196

Table 4: Evaluation results WITHOUT setting a threshold at the number of documents per topic using a minimumnumber of documents present for each individual topic. The relevance judgments used are a combination ofRounds 1 and 2 of TREC-COVID and our additional relevance assessments. The highest scores for the evaluatedand TREC-COVID systems are underlined.

System P@5 P@10 NDCG@10 MAP NDCG bpref

Amazon question 0.6733 0.6333 0.539 0.0765 0.1931 0.1134question + narrative 0.72 0.64 0.5583 0.0788 0.1903 0.1105Google question 0.5733 0.57 0.4972 0.0775 0.2001 0.1227question + narrative 0.6067 0.56 0.5112 0.0763 0.1979 0.121 T R E C - C OV I D

1. sab20.1.meta.docs 0.78 0.7133 0.6109 0.2037 0.4702 0.34042. sab20.1.merged 0.6733 0.6433 0.5555 0.1598 0.4415 0.34333. UIowaS Run3 0.6467 0.6367 0.5466 0.174 0.4145 0.32294. smith.rm3 0.6467 0.6133 0.5225 0.1947 0.4461 0.34065. udel fang run3 0.6333 0.6133 0.5398 0.1911 0.4495 0.3246

Table 5: Evaluation results WITHOUT setting a threshold at the number of documents per topic using a minimumnumber of documents present for each individual topic. The relevance judgments used are a combination of Rounds1 and 2 of TREC-COVID (WITHOUT our additional relevance assessments). The highest scores for the evaluatedand TREC-COVID systems are underlined.

System P@5 P@10 NDCG@10 MAP NDCG bpref