[PDF] Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

Abstract

Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the "state of the art" (SOTA). Due to the rapid progress being made in information retrieval today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as improvements in the state of the art. Such pronouncements, however, are almost never qualified with significance testing. In the context of the MS MARCO document ranking leaderboard, we pose a specific question: How do we know if a run is significantly better than the current SOTA? We ask this question against the backdrop of recent IR debates on scale types: in particular, whether commonly used significance tests are even mathematically permissible. Recognizing these potential pitfalls in evaluation methodology, our study proposes an evaluation framework that explicitly treats certain outcomes as distinct and avoids aggregating them into a single-point metric. Empirical analysis of SOTA runs from the MS MARCO document ranking leaderboard reveals insights about how one run can be "significantly better" than another that are obscured by the current official evaluation metric (MRR@100).

Full PDF

SSignificant Improvements over the State of the Art? A CaseStudy of the MS MARCO Document Ranking Leaderboard

Jimmy Lin, , Daniel Campos, Nick Craswell, Bhaskar Mitra, , and Emine Yilmaz University of Waterloo Microsoft AI & Research University of Illinois Urbana-Champaign University College London

ABSTRACT

Leaderboards are a ubiquitous part of modern research in appliedmachine learning. By design, they sort entries into some linearorder, where the top-scoring entry is recognized as the “state of theart” (SOTA). Due to the rapid progress being made in informationretrieval today, particularly with neural models, the top entry ina leaderboard is replaced with some regularity. These are toutedas improvements in the state of the art. Such pronouncements,however, are almost never qualified with significance testing. Inthe context of the MS MARCO document ranking leaderboard, wepose a specific question: How do we know if a run is significantly better than the current SOTA? We ask this question against thebackdrop of recent IR debates on scale types: in particular, whethercommonly used significance tests are even mathematically permissi-ble. Recognizing these potential pitfalls in evaluation methodology,our study proposes an evaluation framework that explicitly treatscertain outcomes as distinct and avoids aggregating them into asingle-point metric. Empirical analysis of SOTA runs from the MSMARCO document ranking leaderboard reveals insights about howone run can be “significantly better” than another that are obscuredby the current official evaluation metric (MRR@100).

Leaderboard rankings and claims of the “state of the art” (SOTA)pervade modern research in applied machine learning, particularlyin natural language processing, information retrieval, and com-puter vision. There has been much debate in the community on themerits of such activities, compared to alternative uses of the sameresearcher energy, attention, and resources. Without participatingin this debate, this work attempts to address what we view as atechnical shortcoming of many, if not most, leaderboards today:the lack of significance testing.Specifically, we wish to answer the question: Does a particularrun significantly improve the state of the art? The nature of leader-boards and rapid progress by researchers mean that the top-scoringrun is regularly overtaken and replaced by another run that reportsa higher score. This is communicated (in papers, blog posts, tweets,etc.) as beating the existing SOTA and achieving a new SOTA. Suchpronouncements, however, are rarely qualified with significancetests. We hope to take a small step towards rectifying this.Our study investigates significance testing among SOTA resultsin the context of the MS MARCO document ranking leaderboard,but our findings and lessons learned can perhaps be generalized toother leaderboards in IR and beyond. Against the backdrop of recentdebates about evaluation methodology in IR [3, 4, 9], we discover,unsurprisingly, that there is no simple answer. Our findings can besummarized as follows: (1) The existing single-point metric for quantifying the “goodness”of a run on the MS MARCO document ranking leaderboard(MRR@100) conflates important differences in how one run canbe “better” than another. Thus, the naive approach of runningstandard significance tests on the existing metric may lead toquestionable results.(2) To address this issue, we propose an evaluation frameworkthat explicitly tracks outcomes separately, which then permitsmeaningful aggregation and significance testing. From a quali-tative perspective, this framework reveals many insights aboutdifferences that are obscured by the existing official metric.(3) Contributing to recent debates in the IR community on scaletypes and whether certain statistical operations are mathemati-cally permissible [3], we find that in our framework, analysisin terms of expected search length (ESL), which is a ratio scale,and mean reciprocal rank (MRR), which is an ordinal scale, yieldlargely the same findings.The contribution of this work is a novel evaluation framework thatcompares putative SOTA submissions in a nuanced way that con-tributes to ongoing debates in the information retrieval communityabout evaluation methodologies. We find that runs can be “better”in different ways, but these “different ways” cannot be reconciledwithout appealing to a user model of utility (which is presentlyabsent in the task definition).It is worth emphasizing that in this paper, we are asking a verynarrow question about entries on a leaderboard and significancetesting with respect to a clearly defined metric (that determines theranking on the leaderboard). There are a number of questions thatare outside the scope of inquiry, for example: Is the new SOTA tech-nique practically deployable (e.g., considering inference latencies,model size, etc.)? Does the model train efficiently? Is the improve-ment in the SOTA meaningful from a user perspective (i.e., doesit improve the user experience)? Might the new SOTA techniqueencode some bias that would be a cause for concern? And, no doubt,many more questions. While these are all important considerations,they raise orthogonal issues that we do not tackle here. Neverthe-less, we show that even for such a narrowly framed question, thereis still quite a bit of nuance that is missing in the current discourse.

The MS MARCO dataset was originally released in 2016 with theaim of helping academic researchers explore information access inthe large-data regime, particularly in the context of models based onneural networks that were known to be data hungry [8]. Initially,the dataset was designed to study question answering on webpassages, but it was later adapted into traditional ad hoc rankingtasks. Today, the document ranking and passage ranking tasks a r X i v : . [ c s . I R ] F e b ost competitive leaderboards that attract much attention fromresearchers around the world.This paper focuses on the document ranking task, which is astandard ad hoc retrieval task over a corpus of 3.2M web pages withURL, title, and body text. The organizers have made available atraining set with 367K queries and a development set with 5193queries; each query has exactly one relevance judgment. There are5793 evaluation (test) queries; relevance judgments for these queriesare withheld from the public. Scores on the evaluation queries canonly be obtained by a submission to the leaderboard. The officialmetric is mean reciprocal rank at a cutoff of 100 (MRR@100).Of the myriad metrics that have been proposed to evaluate re-trieval systems, there are those that make strong claims as to mod-eling user utility, such as nDCG [5] and RBP [7], and those thatdo not, say, precision at a fixed cutoff. Specifically in the contextof the MS MARCO document ranking task, reciprocal rank (RR)makes at least some plausible claims about utility. At a high level,the metric says that the user only cares about getting a single rele-vant document (not unrealistic since MS MARCO models questionanswering “in the wild”), and that utility drops off rapidly as afunction of increasing ordinal rank. While the functional form ofthis dropoff might be a matter of debate (similar disagreements canbe had about the functional form of the discount in nDCG), there isstrong empirical support for the claim in general, dating back wellover a decade. In the web search context, log analysis (e.g., [1]) aswell as eye-tracking experiments (e.g., [6]) have shown that userclick probabilities and attention fall rapidly with increasing ordinalrank in the retrieved results.We believe that at least some of the ongoing controversies aboutevaluation methodologies in information retrieval stems from con-fusion on whether a metric is being used simply as a useful proxyfor effectiveness (to aid in quantifying model improvements) or isactually making a claim about utility. Thus, in this paper, we arecareful to separate the two, and are explicit when making a claimabout utility (and appealing to some user model).The proximate motivation of this study is the recent work ofFerrante et al. [3], who argued that most IR metrics are not interval-scaled and suggested that decades of IR research may be method-ologically flawed and hence the conclusions may not be reliable. Weare not able to provide sufficient justice to their detailed argumentsdue to space limitations, but the crux of the matter in our contextis that for MRR, intervals are not equi-spaced; that is, a differenceof 0 . A summary of the MS MARCO document ranking leaderboard sinceits launch in August 2020 (until mid-February, 2021) is shown inFigure 1, where each point represents a run: the x -axis plots the date - - - - - - - - Date0.150.200.250.300.350.40 M RR @ MS MARCO Document Leaderboard

SubmissionSOTA

Figure 1: The leaderboard of the MS MARCO documentranking task, showing the effectiveness of runs (MRR@100)on the held-out evaluation set over time. “State of the art”(SOTA) runs are shown in red. of submission and the y -axis plots the official metric (MRR@100)reported on the leaderboard for the held-out evaluation (test) set.Circles in red represent the (current and former) state-of-the-art(SOTA) runs, i.e., a run that displaced a previous run at the top ofthe leaderboard, beginning with the first submission that beat thebaselines provided by the organizers when the leaderboard firstlaunched. In our analysis, we specifically focused on these SOTAruns. Since the identities of the runs (and for that matter, the actualtechniques they used) are not germane to our analysis, we simplydenote them 𝑅 to 𝑅 , arranged chronologically. That is, 𝑅 is thefirst SOTA run, and 𝑅 is the most recent.If we wish to ask if one SOTA run is significantly better thananother, an obvious first attempt would be to run some standard sta-tistical test over per-query scores of the official metric (MRR@100).Among the myriad tests available, three stand out:(1) Wilcoxon rank sum test (WRS): a non-parametric test that re-quires samples to be on an ordinal scale.(2) Wilcoxon signed rank test (WSR): a non-parametric test thatrequires samples to be on an interval scale.(3) Student’s 𝑡 -test: a parametric test that requires samples to beon an interval scale.As discussed in Ferrante et al. [3], there has been quite a bit ofcontroversy (in IR and beyond) on what tests are permissible forwhat scale types. Even taking the most stringent position, there is nodoubt that reciprocal rank is an ordinal scale: obtaining a documentat rank one is preferable to obtaining a document at rank two, rankthree, . . . is more preferable than not having the document in theranked list at all. This holds without making any commitmentson the “distance” between any of these possible outcomes (i.e.,equi-spacing). Thus, WRS is unequivocally permissible. With thesecaveats rendered explicit, let’s just run all the tests anyway.The results of running all three significance tests on pairwisecomparisons between 𝑅 (the earliest SOTA run) and every othersubsequent SOTA run { 𝑅 . . . 𝑅 } on the evaluation set are shownin Table 1. Additionally, we compare the two current top runs on theuns WRS WSR 𝑡 -test 𝐴 𝐵 Δ 𝑝 -value 𝑝 -value 𝑝 -value 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 Table 1: Results of running significance tests on SOTA runs:Wilcoxon rank sum test (WRS), Wilcoxon signed rank test(WSR), and the Student’s 𝑡 -test. 𝑅 . . . 𝑅 are the SOTA runs,arranged chronologically. leaderboard, 𝑅 and 𝑅 . The table reports the absolute differencesin effectiveness, along with the raw 𝑝 -values of the different tests,prior to the application of the Bonferroni correction for multiplehypothesis testing.We see that based on all three tests, the improvements from suc-cessive SOTA runs 𝑅 , 𝑅 , . . . are not statistically significant untilwe get to 𝑅 ; all subsequent runs thereafter appear to be signifi-cantly better (even just focusing on WRS). The absolute differencein MRR@100 between 𝑅 and 𝑅 is 2.4 points, which is surprisinglylarge. Independent of the particulars of any evaluation, the generalexpectation would be that with a large number of queries (over5K in our case), small significant differences (i.e., small effect sizes)should be detectable. That doesn’t appear to be the case here. Whilethis unexpected finding does not in and of itself demonstrate thatsomething is amiss, it would be natural to suspect Type II errors—that is, our tests are not powerful enough to confidently rejectthe null hypothesis. As it turns out, this is not the issue. We willdemonstrate in the context of our proposed evaluation framework:the fundamental issue is that MRR@100 forces aggregation acrossoutcomes that are not directly comparable (with respect to someplausible user model). This section presents our evaluation framework specifically tailoredto the MS MARCO document ranking task. We begin with a few(hopefully) uncontroversial claims and from there build an approachto evaluation that explicitly avoids conflating distinct outcomes in asingle-point metric. Note that since the official relevance judgmentscontain only one relevant document per query, the position ofthat relevant document on the ranked list (or its absence) alonedetermines the score (the metric, to be defined below) for that query; There is an important detail here worth mentioning: the official evaluation script de-liberately introduces a metric artifact designed to thwart (simple) attempts at “reverse-engineering” the evaluation set. Thus, the scores reported on the official leaderboard(and plotted in Figure 1) are not accurate. This artifact has no impact on the leader-board rankings, but does impact significance testing. All our analyses are based the trueMRR@100 scores, after the removal of this artifact. Thus, the absolute score differencesmay not line up with public leaderboard results. this nicely sidesteps the challenges with different “recall bases” [3],i.e., queries that have different numbers of relevant documents.Consider two hypothetical submissions to the MS MARCO doc-ument ranking leaderboard, runs 𝐴 and 𝐵 , comprising ranked listsover a set of queries 𝑄 . For each query 𝑞 ∈ 𝑄 , there are logicallythe following distinct outcomes that cover all possibilities:(1) Neither run 𝐴 nor run 𝐵 returns the relevant document in thetop 𝑘 ranking. In this case, both runs are equally “bad”.(2) Run 𝐴 returns the relevant document in the top 𝑘 ranking, whilerun 𝐵 does not. Run 𝐴 is thus “better”. Vice versa with 𝐵 and 𝐴 swapped.(3) Both run 𝐴 and 𝐵 return the relevant document in the top 𝑘 ranking, but the document has a ranking closer to the front ofthe ranked list in run 𝐵 (i.e., lower ordinal rank). Run 𝐵 is thus“better”. Vice versa with 𝐵 and 𝐴 swapped.We believe that the above assertions hold regardless of how onemight choose to operationalize “bad” and “better”. However, to bemore precise, let us define a metric in terms of expected searchlength (ESL), which has a long history in IR research dating backto the 1960s [2]. ESL quantifies how long a user needs to search(more specifically, read the ranked list) before obtaining a relevantdocument: A relevant document appearing at rank 1 gets a score of1, rank 2 gets a score of 2, etc. all the way up to rank 100 (in ourcase). Thus, the lower the score, the better.Consider a straightforward user model, that of a patient userwho issues a query and is willing to read 100 documents per query(at a constant pace) to find the relevant document, and then givesup (if no relevant document is found). It would be plausible to makethe claim that ESL, with respect to this user model, captures utilitymeasured in user time. It is clear that ESL as articulated here is on a ratio scale (whichis by definition also an interval scale). A difference of 1 “means thesame thing” at 2 ESL as it does at 99 ESL: in both cases, the relevantdocument appears one rank closer to the top of the ranked list;from the utility perspective, in both cases the user has saved thesame amount of time. Furthermore, being a ratio scale, 3 ESL is 3 × worse than 1 ESL, and 100 ESL is 5 × worse than 20 ESL (the analogyhere is that ESL is comparable measuring temperature in Kelvins).Against our user model, these are also plausible statements to makewith respect to utility (time): a relevant document at rank 4 (4 ESL)costs the user twice as much utility (time) as a relevant documentat rank 2 (2 ESL).So, for case (3) in the list of outcomes above when comparingrun 𝐴 and run 𝐵 for a specific query 𝑞 , we have a good alignmentbetween ESL and utility, and we can make a number of meaningfulclaims. For example, the following statements are permissible andindeed meaningful: • For 𝑞 , run 𝐴 returns the relevant document at rank 2 and run 𝐵 returns the relevant document at rank 4. For 𝑞 , run 𝐴 returnsthe relevant document at rank 50 and run 𝐵 returns the relevant Recognizing that we are making a few simplifying assumptions such as constantdocument length and fixed reading speed. More realism could be added by, for example,taking into account a more accurate model of reading speed [10], but these refinementsare unlikely to change our overall analysis. Once again, setting aside refinements that better model document length, readingspeed, etc. uns (1) (2) (3)

WSR 𝑡 -test WSR 𝑡 -test 𝐴 𝐵 Δ All 𝐴 wins 𝐵 wins All 𝐴 ESL 𝐵 ESL 𝑝 -value 𝑝 -value 𝐴 RR 𝐵 RR 𝑝 -value 𝑝 -value 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 𝑅 Table 2: Analysis of SOTA runs from the MS MARCO document ranking leaderboard, broken into distinct outcomes: (1) neitherrun retrieves the relevant document, (2) one run retrieves the relevant document but not the other, and (3) both runs retrievethe relevant document. document at rank 52. In both cases, 𝐴 is better than 𝐵 by two“ESL units”, and even more strongly, two units of utility (time). • For 𝑞 , run 𝐴 returns the relevant document at rank 1 and run 𝐵 returns the relevant document at rank 4. For 𝑞 , run 𝐴 returnsthe relevant document at rank 10 and run 𝐵 returns the relevantdocument at rank 40. In both cases, 𝐴 is better than 𝐵 by 4 × , interms of both “ESL units” and units of utility.The upshot here is that we can meaningfully average ESL valuesacross a large set of queries 𝑄 . Consider: for 𝑞 , run 𝐴 returns therelevant document at rank 1 and run 𝐵 returns the relevant docu-ment at rank 4, and for 𝑞 , run 𝐴 returns the relevant documentat rank 5 and run 𝐵 returns the relevant document at rank 2. Wecan say that on the set of queries comprised of { 𝑞 , 𝑞 } both runsare equally effective, with an (arithmetic) mean ESL of 3. The arith-metic mean here encodes the assumption that each query is equallyimportant, which seems reasonable. Furthermore, it is plausible tosay that on these two queries, a user derives equal utility.Note, critically, however, that this only applies to case (3) above,when both runs contain the relevant document in their top 𝑘 lists.For case (2), however, it is unclear how similar statements can bemade. Consider: for 𝑞 , run 𝐴 returns the relevant document atrank 1 and run 𝐵 returns the relevant document at rank 100, andfor 𝑞 , run 𝐴 doesn’t return the relevant document at all and 𝐵 returns the relevant document at rank 99. There is little that wecan meaningfully say about the effectiveness of the runs on a setof queries comprised of { 𝑞 , 𝑞 } , both from the perspective of ESL(what ESL would we assign to run 𝐴 for 𝑞 ?) and utility. For thelatter, we would need to quantify the cost of not finding a relevantdocument relative to an ESL or time unit. There’s nothing in theframework we’ve presented thus far that would shed light on thiswithout a more articulated user model (absent in the current taskdefinition). Note that the official metric MRR@100 does encode aspecific utility difference between a retrieved document at rank 100and not retrieving the relevant document (0 .

01 and 0, respectively),but justifying these values would require appeal to user models anddata (e.g., query logs) that are beyond the scope of the leaderboard.We argue, instead, that the best way forward is to maintain anexplicit separation and breakdown of the different outcomes. In other words, cases (2) and (3) represent apples and oranges, and itwould be suspect to average across them without first establishingsome way to compare different fruits.

Let us now apply the evaluation framework proposed above toanalyze the SOTA runs on the MS MARCO document rankingleaderboard. In our analysis, we compared each of 𝑅 . . . 𝑅 against 𝑅 , the results of which are shown in Table 2; we additionallycompared the top two runs on the leaderboard, 𝑅 against 𝑅 . Weshow the percentage of queries that fall under a particular outcome—case (1), (2), or (3)—as described in the previous section. Case (2) isbroken down into “ 𝐴 wins”, where 𝐴 returns the relevant documentin its ranked list and 𝐵 doesn’t, and “ 𝐵 wins”, the opposite case. Forrhetorical convenience, we will use “answered” and “unanswered”for these cases. For case (3), we show the overall percentage, aswell as the mean ESL and reciprocal rank (RR) for all queries thatfall under that outcome. For ESL and RR, we show 𝑝 -values fromthe application of the Wilcoxon signed-rank test and the paired 𝑡 -test. Note that since ESL is on an interval (ratio) scale, these twotests are unequivocally permissible. For RR, the application of theWilcoxon signed-rank test and the paired 𝑡 -test is subjected to thepotential objections raised by Ferrante et al. [3] regarding intervalscales. In all cases, we report raw 𝑝 -values, prior to the applicationof the Bonferroni correction for multiple hypothesis testing.This specific case study, and more generally our proposed evalu-ation framework, reveals many interesting insights that are com-pletely hidden if we simply reported the means of per-query re-ciprocal ranks and ran significance tests on them, as in Section 3.Focusing specifically on ESL, we highlight a number of interestingobservations below: • In the first row, comparing 𝑅 vs. 𝑅 , we see the overall MRR@100scores are quite close, but both runs obtain the scores in verydifferent ways. Looking at the case (3) breakdowns, we see that 𝑅 has a higher ESL than 𝑅 , and this difference is (highly) sta-tistically significant. From this perspective, 𝑅 is worse than 𝑅 (relevant results appearing later in the ranked lists). However,e see that 𝑅 answered far more queries that went unansweredin 𝑅 than the other way around.A similar observation can be made in the comparison between 𝑅 and 𝑅 . When focusing only on ranking, case (3), 𝑅 is signif-icantly worse than 𝑅 , but 𝑅 compensates by answering morequeries that went unanswered in 𝑅 , leading to a higher score interms of MRR@100. • Consider the comparison between 𝑅 and 𝑅 (the second row):contrary to the examples discussed above, we see that 𝑅 signif-icantly improves ranking, case (3), but has slightly more unan-swered queries compared to the baseline. This also leads to anoverall improvement in terms of MRR@100. • Consider the comparison between 𝑅 and 𝑅 . Here, looking atthe prevalence of case (2): there are equal percentages of caseswhere the query was answered by one run but not the other.However, looking at case (3), we see that 𝑅 obtains a statisticallysignificant reduction in ESL. That is, 𝑅 is better than 𝑅 notbecause it obtained more relevant documents, but rather that itranked the relevant documents more highly. • Another interesting observation relates to 𝑅 , 𝑅 , 𝑅 , 𝑅 , whichare runs from the same team. From the rows comparing 𝑅 to { 𝑅 , 𝑅 , 𝑅 } , we see that in terms of case (3), differences in ESLare not statistically significant. That is, the runs are comparablewhen it comes to ranking documents that appear in the top 100.The differences in MRR@100 come primarily from case (2), where { 𝑅 , 𝑅 , 𝑅 } have fewer unanswered queries on balance. • Looking at the current top two runs on the leaderboard, 𝑅 and 𝑅 , we see that the second-best run actually does a better jobranking, i.e., case (3), than the top run. The latter wins by virtueof it answering more queries.Examining the differences between ESL and RR, our analysis showsthat they largely agree, with the exception of 𝑅 vs. 𝑅 and 𝑅 vs. 𝑅 .It seems that differences in ESL are relatively larger than differencesin RR, which could explain these cases. Note that two runs with thesame ESL can have very different MRRs. For example, consider thecase of a run retrieving relevant documents at ranks 1 and 9, and asecond run retrieving relevant documents at ranks 4 and 6. Bothruns would have the same ESL, but very different MRRs, 0.556 and0.208, respectively. There may be a tempting intuitive interpretationof MRR as the reciprocal of the mean position at which a relevantdocument appears, but this interpretation is not accurate.With respect to the ongoing IR debate on scale types, the conclu-sion seems to be that, at least in our proposed framework, it doesn’tmatter much . That is, even adopting the most cautious positionadvocated by Ferrante et al. [3], analysis in terms of ESL (a ratioscale) and MRR ( not an interval scale) does not lead to markedlydifferent conclusions. Given this, there may be reasons to preferRR over ESL, since the former captures a more realistic user model;it is highly unlikely that user attention can be sustained acrossthe examination of 100 results, as suggested by ESL. Here, the the-ory of IR evaluation appears to diverge from its actual practice onlarge-scale, real-world datasets.Back to the original question that we set out to answer: Is “this”SOTA run better than “that” SOTA run? We might say that run 𝐵 is better than run 𝐴 if run 𝐵 wins in terms of case (2) and has asmaller ESL for case (3), or alternatively, greater MRR. We might further claim that run 𝐵 is significantly better than run 𝐴 if theimprovements in both outcomes are statistically significant: forcase (3), significant testing as we have performed above, and forcase (2), perhaps the binomial test.Alternatively, we might adopt a less stringent definition, some-what akin to the Hippocratic Oath (i.e., “do no harm”): a run canbe considered significantly better if it significantly increases thefraction of answered questions without significantly increasingthe ESL or that it significantly decreases ESL without significantlyincreasing the fraction of unanswered questions. The advantageof this approach is that it provides two concrete facets of “good-ness” that researchers can independently tackle while still beingamenable to a linear sort order for populating a leaderboard. One attraction of leaderboards is that they define a concrete metricto optimize, and if that metric is well-defined and meaningful, itenables the community to make rapid progress on a problem. Thiswork focuses on one important yet under-explored aspect of in-trinsic validity: We show that naive significance testing applied toMRR@100 obscures many potential insights, and that our proposedframework provides a more nuanced analysis.How might we generalize our framework to encompass otherretrieval tasks and possibly beyond? We see at least one challengethat limits the broader applicability of our approach: it criticallydepends on having only one relevant document per query, sincethis property is necessary to separate the outcomes. Otherwise, it isunclear how we would compare ranked lists that retrieve differentnumbers of relevant documents. While this restriction would not beunrealistic for some tasks, such as question answering, there clearlyneeds to be more work before our approach can be generalized.

REFERENCES [1] Eugene Agichtein, Eric Brill, Susan Dumais, and Robert Ragno. 2006. LearningUser Interaction Models for Predicting Web Search Result Preferences. In

Pro-ceedings of the 29th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR 2006) . Seattle, Washington, 3–10.[2] William S. Cooper. 1968. Expected Search Length: A Single Measure of RetrievalEffectiveness Based on the Weak Ordering Action of Retrieval Systems.

Journalof the American Society for Information Science

19, 1 (1968), 31–40.[3] Marco Ferrante, Nicola Ferro, and Norbert Fuhr. 2021. Towards MeaningfulStatements in IR Evaluation. Mapping Evaluation Measures to Interval Scales. arXiv preprint arXiv:2101.02668 (2021).[4] Marco Ferrante, Nicola Ferro, and Silvia Pontarollo. 2017. Are IR EvaluationMeasures on an Interval Scale?. In

Proceedings of the ACM SIGIR InternationalConference on Theory of Information Retrieval . 67–74.[5] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulative Gain-Based Evaluationof IR Techniques.

ACM Transactions on Information Systems

20, 4 (2002), 422–446.[6] Thorsten Joachims, Laura Granka, Bing Pang, Helene Hembrooke, and GeriGay. 2005. Accurately Interpreting Clickthrough Data as Implicit Feedback. In

Proceedings of the 28th Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval (SIGIR 2005) . Salvador, Brazil, 154–161.[7] Alistair Moffat and Justin Zobel. 2008. Rank-Biased Precision for Measurementof Retrieval Effectiveness.

ACM Transactions on Information Systems

27, 1 (2008),Article 2.[8] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, RanganMajumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchineReading COmprehension Dataset. arXiv:1611.09268v1 (2016).[9] Tetsuya Sakai. 2020. On Fuhr’s Guideline for IR Evaluation. In

SIGIR Forum ,Vol. 54. p14.[10] Mark D. Smucker and Charles L. A. Clarke. 2012. Time-Based Calibration ofEffectiveness Measures. In