[PDF] Assessing top- k preferences

Abstract

Assessors make preference judgments faster and more consistently than graded judgments. Preference judgments can also recognize distinctions between items that appear equivalent under graded judgments. Unfortunately, preference judgments can require more than linear effort to fully order a pool of items, and evaluation measures for preference judgments are not as well established as those for graded judgments, such as NDCG. In this paper, we explore the assessment process for partial preference judgments, with the aim of identifying and ordering the top items in the pool, rather than fully ordering the entire pool. To measure the performance of a ranker, we compare its output to this preferred ordering by applying a rank similarity measure.We demonstrate the practical feasibility of this approach by crowdsourcing partial preferences for the TREC 2019 Conversational Assistance Track, replacing NDCG with a new measure named "compatibility". This new measure has its most striking impact when comparing modern neural rankers, where it is able to recognize significant improvements in quality that would otherwise be missed by NDCG.

Full PDF

000

Assessing top- k preferences CHARLES L. A. CLARKE, ALEXANDRA VTYURINA, and MARK D. SMUCKER,

University ofWaterlooAssessors make preference judgments faster and more consistently than graded relevance judgments. Prefer-ence judgments can also recognize distinctions between items that appear equivalent under graded judgments.Unfortunately, preference judgments can require more than linear effort to fully order a pool of items, andevaluation measures for preference judgments are not as well established as those for graded judgments, suchas NDCG. In this paper, we explore the assessment process for partial preference judgments, with the aimof identifying and ordering the top items in the pool, rather than fully ordering the entire pool. To measurethe performance of a ranker, we compare its output to this preferred ordering by applying a rank similaritymeasure. We demonstrate the practical feasibility of this approach by crowdsourcing partial preferences forthe TREC 2019 Conversational Assistance Track, replacing NDCG with a new measure that can reflect factorsbeyond relevance. This new measure has its most striking impact when comparing traditional IR techniquesto modern neural rankers, where NDCG can fail to recognize significant differences exposed by this newmeasure.

ACM Reference format:

Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020. Assessing top- k preferences. ACMTransactions on Information Systems

00, 00, Article 00 ( 2020), 18 pages.DOI: XX.XXXX/XXXXXXX.XXXXXXX

Preference judgments [7, 16, 30, 32, 38] have long been proposed as an alternative to gradedrelevance judgments for the offline evaluation of search and related ranking tasks, includingrecommendation and question answering. Instead of independently judging individual itemsaccording to defined relevance criteria, assessors make preference judgments on pairs of itemsby comparing them side-by-side to determine the better of the two. If we allow ties, preferencejudgments impose a weak ordering on a set of items. To evaluate the performance of a rankeron a query, we can directly compare this weak ordering to the actual ranking generated for thatquery. If we employ a rank similarity measure for this comparison, it provides a measure ofthe ranker’s performance [13]. This approach contrasts with the more established approach ofconverting independently assigned relevance grades into gain values to compute measures such asNDCG [4, 20] and ERR [10].Compared with independent relevance judgments, assessors make preference judgments fasterand more consistently [7]. Preference judgments also make it easy to incorporate factors beyondrelevance into offline evaluation [12]. For example, for e-commerce search, these factors mightinclude price and quality. For a news search vertical, these factors might include recency, so thatan assessor comparing two equally relevant news stories could choose the latest update. If twonews articles are equally relevant and timely, an assessor might prefer a shorter, more focused,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2020 ACM. 1046-8188/2020/00-ART00 $15.00DOI: XX.XXXX/XXXXXXX.XXXXXXXACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. a r X i v : . [ c s . I R ] J u l article over a longer article containing extraneous information. Preference judgments can also takepersonalization into account, so that locally available items could be preferred for an e-commercesearch, or concordant political views could be preferred for a news search.Preference judgments face two criticisms. First, even if we assume transitivity, a set of n itemsrequires O ( n log n ) judgments to produce a total order. If we don’t assume transitivity, a set of n itemsmay require O ( n ) preference judgments. In contrast, if we have dedicated and reliable assessors,traditional graded relevance requires exactly n judgments. Second, while NDCG and similar gradedrelevance measures are well established for offline evaluation in both industry and academia, widelyaccepted evaluation measures for preference judgments have not yet emerged [7, 37].In prior work, we addressed the first criticism by proposing evaluation by partial preferences [13].We focus preference judgments on identifying and carefully ordering the best items for a query,perhaps no more than four or five. Since these are the items that are most likely to be seen by asearcher [21], these are the items a ranker should return as the top results, ranked consistently withpreferences. These will have the most impact on perceived search quality, and it’s important to getthem right. The remaining items can be grouped into larger equivalence classes, exactly as they arefor graded measures, so that they still contribute to the measurement of ranker performance, butwith less impact than the best items.To address the second criticism, we measure a system’s performance by its maximum similarityto an ideal ranking [12]. Partial preferences impose a weak ordering on a collection. We interpretthis weak ordering as a set of ideal rankings for a query. For the best items, preference judgmentscan precisely define this ideal ordering. For the larger equivalence classes, any ordering of theitems in the class is equally good, although we do not include the class of non-relevant items inour ideal rankings. We then apply a rank similarity measure to compare these ideal rankings to anactual ranking generated the system we wish to measure. As our performance measure, we takethe maximum similarity between the members of the ideal set and the actual ranking.We call this process of computing maximum similarity to a set of ideal rankings computing the compatibility of the actual ranking. When compared to traditional graded relevance measures,compatibility allows us to more precisely specify the ideal response expected from a ranker, andto compare this ideal response with its actual response. We provide further details regardingcompatibility in Section 3.1. As part of computing compatibility, we use Rank Biased Overlap [33](RBO) to compute similarity between ideal and actual rankings. The properties of RBO make itideally suited for this purpose, and we provide further details regarding RBO in Section 3.2.This thread of research [12, 13] was directly motivated by our experience implementing offlineevaluation metrics for a social media site. Even under carefully composed assessment guidelines,multiple items may appear to be perfect, but when these items are placed side-by-side, a clearlydesirable ordering becomes apparent. For example, on social media sites popular entertainers mayhave multiple official accounts. As well, there may be multiple high quality and carefully curatedfan accounts. On Twitter, there are at least two verified accounts for Taylor Swift, @taylorswift13with 86M followers and @taylorswiftnation13 with 1M followers. As well, there are multiple fanaccounts with over 100K followers. When independently assessed, any of these accounts couldreasonably be labeled as perfect for the query “taylor swift”, particularly when seen outside thecontext of the others. When placed side-by-side, and considering factors such as the number offollowers, we might rank @taylorswift13 first, @taylorswiftnation13 second, with the various fanaccounts after that.Maximum similarity to an ideal ranking represents a radical simplification of existing offlineevaluation practice. Essentially we reduce offline evaluation to the problem of answering the ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. ssessing top- k preferences 00:3 question: “What would an ideal system do?” Once we determine the ideal ranking for a query —or rather a set of equally ideal rankings —we apply a rank similarity measure to determine thecompatibility of an actual ranking generated by a ranker to this ideal. As an offline evaluationmeasure, compatibility is particularly suited to partial preferences, since the weak ordering inducedby partial preference can be directly interpreted as a set of ideal rankings.In the current paper, we extend our prior work to consider assessment methods for partialpreferences. Starting from a pool of items, we examine methods for narrowing this pool to thetop- k items, identifying and ordering these items, while minimizing the cost and effort required.We compare two methods. The first assumes dedicated and motivated assessors, employing atournament structure. The second crowdsources preference judgments through Mechanical Turk.For both methods, we start with an initial graded assessment as a first step in narrowing the pool.We focus our effort on partial preferences for a question answering task — the TREC 2019 Conver-sational Assistance Track (CAsT) [14]. For this task, questions were collected into conversations ofbetween 7 and 12 questions each. Answers were drawn from a collection of passages derived fromvarious Web sources, including Wikipedia. For each of the 479 test questions, participating systemsreturned a ranked list of passages intended to answer the question. Submitted runs were pooled toa depth of 10, and 173 of the questions were judged on a 5-point relevance scale. NDCG@3 formedthe primary evaluation measure for the track. Through the application of preference judging, weaim to identify and order the top-five answers for these 174 previously judged questions.The questions from the TREC CAsT Track provide some excellent examples of the problem thatinitially motivated us. Figure 1 shows four passages that receive the top relevance grade (“fullymeets”) for the question What is taught in sociology? (

As far back as 1990, Rorvig [30] argued for the superiority of preference judgments as a tool forestimating document utility, as opposed to graded or binary relevance judgments, explicitly recog-nizing that this utility may reflect differences beyond relevance. That paper raises the transitivityof preferences as a necessary requirement for this utility estimation, and it reports experimentsdemonstrating that document preference judgments do exhibit the required transitivity. Rorvigalso outlines a procedure for constructing a test collection based on preference judgments, whilenoting that this test collection “would cost a great deal more to build than current collections,”due to the large number of judgments required. Frei and Sch¨auble [16] also sidestep absoluterelevance in favor of relative comparisons between items, arguing that human assessors makerelative comparisons more easily and consistently.In a 1995 paper, Yao [38] proposed preferences judgments as a solution to the difficulties alreadythen encountered in attempts to define and interpret ordinal relevance scales, which in some casesmight suggest, for example, “that a document with grade 2 is equivalent to two documents withgrade one.” Under Yao’s proposal, preference judgments define a weak ordering on the collection,where items may be tied. Just as we propose in this paper, this weak ordering might be derivedfrom direct pairwise comparisons or from ordinal relevance grades, avoiding the need to interpret MARCO 1568091 : Sociology is the study of social life and the social causes andconsequences of human behavior. In the words of C. Wright Mills, sociology looks forthe public issues that underlie private troubles. Sociology differs from popular notionsof human behavior in that it uses systematic, scientific methods of investigation andquestions many of the common sense and taken-for-granted views of our social world…

MARCO 394140 : What is Sociology? Sociology is the study of human socialrelationships and institutions. Sociology’s subject matter is diverse, ranging from crimeto religion, from the family to the state, from the divisions of race and social class to theshared beliefs of a common culture, and from social stability to radical change in wholesocieties.

CAR f62c5a5a0be476d8ba9ce5d956b519413d73eb71 : Jennifer Conn used Snape’sand Quidditch coach Madam Hooch’s teaching methods as examples of what to avoidand what to emulate in clinical teaching, and Joyce Fields wrote that the books illustratefour of the five main topics in a typical first-year sociology class: “sociological conceptsincluding culture, society, and socialisation; stratification and social inequality; socialinstitutions; and social theory”.

CAR 5465fd5dd01cba27c7d792b6b6453ee3da101e03 : sociology of aging - so-ciology of architecture - sociology of art - sociology of the body - sociology of childhood- sociology of conflict - sociology of deviance - sociology of development - sociology ofdisaster - sociology of economic life - sociology of education - sociology of emotions -sociology of the family - …

Fig. 1. Four of the 14 passages assigned the top relevance grade (“fully meets”) by the TREC 2019 Conversa-tional Assistance Track for the question:

What is taught in sociology? ( relevance grades as relevance values. Effectiveness is then measured by computing the distancebetween this weak ordering and a ranking generated by a search system. Yao defines axiomsrequired for this distance metric, including the usual mathematical properties required of anydistance metric. Our compatibility measure, defined in Section 3, follows this suggestion, usingrank similarity measures to compare ideal and system rankings.More recently, Carterette and Bennett, along with various collaborators, published a series ofpapers aiming to establish preference judgments as a practical approach to offline search evalua-tion [5–9, 11, 39]. Carterette et al. [7] provides evidence that preference judgments are generallytransitive, so that O ( n ) judgments are not required for a pool of n items. They further recognizethat prejudging non-relevant documents allows these documents to be excluded from the poolfor preference judging, further reducing effort. Carterette et al. [5] describe the creation of one ofthe few test collections based on preferences. Along with Carterette and Bennett [6], these paperspropose evaluation measures based on the discordant pairs in an actual ranking.Zhu and Carterette [39] crowdsource preference judgments for search page layouts, providingadvice that informs our current effort. Chandar and Carterette [8] employ preference judgmentsto generate an ideally diverse ranking. Chandar and Carterette [9] extend this work to define aevaluation measure for novelty and diversity based on preference judgments. Chen et al. [11]present an active learning approach to inferring a ranking from crowdsourced preference judgments. ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. ssessing top- k preferences 00:5 Radinsky and Ailon [29] refer to the practice of inferring preferences from individual relevancejudgments — both to train rankers and for evaluation — as the “IR detour”. Through experimentson human subjects they conclude that “the validity of taking the IR detour is questionable.” Theypropose an active learning method for reducing the number of preference judgments. In particular,they propose focusing preference judgments on identifying the top- k items, although they donot explore this proposal in detail. They also provide an overview of some of the earlier work inthe large body of literature related to preference judgments for learning-to-rank. This literatureincludes research specifically focused on top- k learning-to-rank methods [25, 28, 34].Another large body of literature explores methods for crowdsourcing relevance judgments [1,2, 26], including preference judgments. Maddalena et al. [27] crowdsource relevance magnitudesthrough a process in which assessors view a series of documents and estimate relevance relative tothe previously seen document. Their results call into question the standard practice of convertingrelevance grades into gain values for the purpose of computing NDCG. Hui and Berberich [18, 19]explore the transitivity of crowdsourced preference judgments and propose an algorithm basedon a randomized quicksort to reduce judging effort by allowing ties. Yang et al. [37] comparepreference, absolute and ratio judgments through a large crowdsourced experiment, concluding thatcrowdsourced preferences provided similar outcomes as dedicated assessments when comparingrankers.Bashir et al. [3] propose methods for converting preference judgments to relevance scoresby adapting the ELO ratings used for chess and other games. Kim et al. [23] provide evidencethat preference judgments can capture differences beyond traditional topical relevance, such asauthority and recency. Hassan Awadallah and Zitouni [17] employ a classifier to reduce theeffort associated with preference judgments. Kuhlman et al. [24] explore interaction methods forcollecting preference judgments. Kalloori et al. [22] augment star ratings with preference judgmentsin a recommender system.In a recent SIGIR 2020 paper, Sakai and Zeng [32] propose and explore two broad familiesof measures intended to support preference judgments. The first family is based on counts ofconcordant pairs, generalizing and extending ideas proposed by Carterette et al. [7] and Carteretteand Bennett [6]. The second family converts preference judgments to gain values for use withtraditional graded relevance measures. A unique aspect of these measures is that they work directlyfrom a collection of preference judgments, and do not require assumptions of transitivity. Aspart of this work, the authors released an exhaustive set of preference judgments for an NTCIRtask. Overall, the work demonstrates several important advantages of preference judgments,especially their closer agreement with SERP preferences, but questions remain regarding the costsand sensitivity of measures based on preference judgments.Given the quality and breadth of this prior research, it is perhaps surprising that preferencejudgments are not yet standard for offline search evaluation. Many of the key ideas we employin this paper have been explored, or at least proposed, in this prior work. We view the primarycontribution of this paper and our related papers [12, 13] as consolidating and simplifying this priorwork to establish the practical utility of preference judgments. In particular, we focus preferencejudgments on the top items to maximize impact while minimizing judging effort. In addition, wefurther establish maximum similarity to an ideal ranking as a simplified framework for offlineevaluation, accommodating traditional relevance grades, preference judgments, and factors beyondrelevance. ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020.

Computing compatibility requires two choices: 1) a choice of rank similarity measure to comparerankings, and 2) a definition of an ideal ranking, which might be a single ranking or a set of equallyideal rankings. For rank similarity we use RBO because its properties make it ideally suited forcomparing rankings (see Section 3.2). For the experiments in this paper, we define the ideal rankingsfor a query by a set of equivalence classes, or ”effectiveness levels”, where each effectiveness levelcontains one or more items.Let { L , L , ..., L T } be the set of effectiveness levels for a query. The effectiveness levels areordered so that L < L < ... < L T , with L T being the top level. Unlike traditional graded relevance,the number of levels T can vary from query to query. We define an extra level L containing allitems not appearing in another level. We define an ideal ranking as any ranking containing all theitems in L T , in any order, followed by all the items in L T − , in any order, and so on down to L . Theitems in L are not included.If we have graded relevance values, these effectiveness levels correspond exactly to them, with L containing items that are non-relevant, spammy, unjudged, etc. If we have an ideal rankingexactly defined by a top- k ranking of items, then we have T = k , with the first item alone in L k ,the second item alone in L k − , etc. We can also combine a top- k ranking with graded relevanceby ordering the top- k items first and ordering the remaining items in the graded relevance levelsbelow them. In this paper, we do all three.Together, a set of equivalence levels defines a set of ideal rankings containing | L T | ! × | L T − | ! × ... × | L | ! elements. If equivalence levels are based on graded relevance, the size of this set can be amillion or more for a typical TREC task. For TREC 2019 CAsT questions, the size of this set rangesfrom 192 ideal rankings up to 26,842,725 ideal rankings, with an average above two million. Incontrast, with a top- k ranking, the sole element in the set can precisely specify what the searchershould see.Fortunately, regardless of the number of ideal rankings, we do not need to generate all of themto determine the ideal ranking. This maximum will be obtained by the ideal ranking that has all theitems in each level ordered according to the actual ranking, maximizing the number of concordantpairs [12, 13]. For items not appearing in the actual ranking, they should be placed last in the levelin any order. Once we have chosen a rank similarity measure and defined a set of ideal rankings,we compute compatibility as the maximum similarity between members of the ideal set and theactual ranking generated by a ranker we wish to measure. While in principle any rank similarity measure could be used to compute compatibility, we employRank Biased Overlap (RBO). By design, its properties make it ideally suited for this purpose. Increating RBO, Webber et al. [33] carefully identified and specified the requirements of rank similarityfor what they call indefinite rankings, such as the output of rankers. For example, when comparingan actual ranking generated by a ranker to an ideal ranking, the top ranks matter more and shouldbe given greater weight. The ideal ranking may be relatively short — just the top-5, for example —while the actual ranking may be much longer — up to 1000 passages for TREC CAsT experimentalruns. Since all the items appearing in the ideal ranking may not appear in the actual ranking. RBOallows us to meaningfully compare rankings with differing length and content. While we couldcertainly employ or invent other rank similarity measures, they would still need to satisfy therequirements of Webber et al. [33]. Further discussion can be found in our related paper [13].

ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. ssessing top- k preferences 00:7 Measure Judgments Sensitivity Kendall’s τ NDCG@3 graded only 71.7% - -compatibility graded only 71.0% 0.907 Fig. 2” combined 76.5% 0.851 Fig. 7” top-5 only 73.3% 0.814 Fig. 8” best only 55.2% 0.775 Fig. 9

Table 1. Sensitivity and consistency of evaluation measures and judgment sets examined in this paper.NDCG@3 forms the baseline for all experiments and for Kendall’s τ . Using RBO, we compute compatibility between an ideal ranking I and an actual ranking R asfollows: Let I i denote the top i items in I , and let R i denote the top i items in R . We define the overlap between I and R at depth i as the size of the intersection between these lists at depth i : | I i ∩ R i | . We define the agreement between I and R at depth i as the overlap divided by i . RBO isthen a weighted average of the agreement across depths from 1 to ∞ , as follows:RBO ( R , I ) = ( − p ) ∞ (cid:213) i = p i − | I i ∩ R i | i . (1)The parameter 0 < p < p i − is close to zero and we reach the bottom of both the ideal and actual rankings.We go down to depth 1000 for this paper. Please see Webber et al. [33] for further discussion. Along with other analyses, we compare evaluation measures in terms of their consistency and sensitivity . By consistency we mean the degree to which evaluation measures recognize the samedifferences between rankers. By sensitivity we mean the ability of evaluation measures to recognizesignificant difference between rankers.We measure consistency using Kendall’s τ . We measure sensitivity following the approach ofSakai [31], but using paired t-tests rather than bootstraps (see Yang et al. [37], for example). Wetake all pairs of experimental runs and compute a paired t-test between them under each measure.A pair with p < .

05 is considered to be distinguished . Sensitivity is then:sensitivity = As detailed in Section 3.1 relevance grades alone can be used to define a set of ideal rankings,allowing compatibility to be computed. For the TREC 2019 CAsT task, there are four effectivenesslevels. The top effectiveness level L contains all passages judged “fully meets”, L contains allpassages judged “highly meets”, L contains the “moderately meets” passages, and L contains the“slightly meets” passages. Figure 2 compares compatibility and NDCG@3 on the 42 automatic runsfrom TREC 2019 CAsT. We compare with NDCG@3 because this is the primary evaluation measure ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020.

Fig. 2. The relationship between NDCG@3 and compatibility on TREC 2019 Conversational Assistance Trackautomatic runs when ideal rankings are based on graded relevance values only. Even though compatibilitydoes not convert relevance grades to gain values, the relationship is nearly linear, with Kendall’s τ = 0.907. reported for TREC 2019 CAsT [14]. The relationship between these measures is nearly linear, withrelatively few inversions, especially in the higher scoring runs.Since this comparison forms a baseline for later work, we tune the value of p to provide the bestmatch for RBO to NDCG@3 in terms of consistency and sensitivity. Tuning was entirely manual;we tried four or five values before settling on p = .

80. This value provides approximately the samesensitivity as NDCG@3, as well as a relatively high Kendall’s τ of 0.907. Higher values of p tendto increase sensitivity and decrease τ , while lower values tend to decrease both sensitivity and τ . In general, the value of p can be adjusted to provide a close match with NDCG@ n , in terms ofconsistency and sensitivity. For example, across multiple TREC Web Track tasks, p = .

95 providesa close match with NDCG@20 [13]. K Our goal is to identify the top- k items for each query while minimizing effort. We follow a multi-step approach, depending on if the assessment will be completed by dedicated assessors or bycrowdsourced assessors. We assume that dedicated assessors will be more focused and reliablethan crowdsourced assessors, so we build more redundancy into the crowdsourced process. Ouroverall approach is to favor simplicity. It can be summarized as follows:(1) Perform an initial graded relevance assessment pass to “thin the herd’, producing a reducedcandidate pool C , with |C| ≥ k to focus preference judgments on the most promising items(Section 4.1).(2) If dedicated assessors are to be used, we structure assessment as a single-eliminationtournament (Section 4.2).(3) If crowdsourced assessors are to be used we follow a two-stage process, with the first stagereducing the size of the candidate pool and the second stage determining the final order(Section 4.3):(a) While the size of the candidate pool is greater than some threshold F , where | C | > F > k , we generate random pairings of candidates, so that each candidate is paired ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. ssessing top- k preferences 00:9 Fig. 3. Relevance grades of passages selected for the candidate pool relative to the top relevance grade forthe question. with P or P + F > P > k . These pairings are then judgedby crowdworkers, for some threshold P > k . Items losing more than a majority ofpairings are eliminated, and we repeat.(b) Once the size of the candidate pool is less than or equal to F , we pair all remainingcandidates with all other remaining candidates, which are judged by crowdworkers.Items are then ranked by the number of pairs they win, and we cut to the top k . In thecase of ties at rank k , we keep all candidates with the tied score, so that in some casesthe size of the final ideal ranking will be larger than k .For the experiments in this paper, we use k = F =

9, and P =

7. The values for F and P werebased on a pilot test, intended to keep our costs under $4,000. We start with an initial graded relevance assessment, giving us an initial candidate pool of higherquality items and avoiding unnecessary preference judgments against lower quality items, particu-larly non-relevant items. These initial judgments could be crowdsourced or use dedicated assessors.If we assume д relevance grades, with G , G ,… G д as the sets of items for each grade, we compute C as follows: i ← д C ← ∅ while |C| < k and i > C ← G i ∪ C i ← i - 1For the TREC 2019 CAsT task, experimental runs were pooled down to depth 10 for assessment.A total of 29,350 passages were judged on a 5-point scale, from “fully meets”(4) down to “fails tomeet”(0), Of these, 8,120 passages were assigned a positive grade. Running the algorithm above onthe passages with a positive grade gives an initial candidate pool of 2,673 passages. The number ofcandidates vary by question up to a high of 112 for question ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020.

Topics k official extra % extra173 3 29,350 3,456 +11.78%” 5 ” 5,429 +18.50%” 10 ” 10,691 +36.43% Table 2. Upper bound estimates of extra judging effort to identify top- k items for the TREC 2019 CAsT taskwith dedicated and reliable assessment. an initial candidate pool with |C| ≤ F , so that for crowdsourced assessments, these candidatesimmediately moved to the second stage. As shown in Figure 3, not all candidates came from thetop relevance grade for that question. More than a third came from below the top grade, with ajust over 1% coming from three levels lower. Since we are depending on the relevance grades tobuild the initial candidate pool, it is certainly possible that some of the top answers were missed bythis process; we further discuss this possibility later in the paper. If we have reliable and dedicated assessment, undertaken by a relatively small number of individualswho understand the task, we can use a single-elimination tournament structure, or heap, todetermine the top- k items with no more than |C| + ( k − )(cid:100) log (|C|)(cid:101) preference judgments (not atight bound). Using this formula, Table 2 provides an estimate of the preference judgments requiredfor TREC 2019 CAsT for various values of k .To provide a basis for comparison with crowdsourcing results, the authors applied this approachto identify a single top answer for each of the questions. Our initial goal was to identify thetop-5, so we started the process with the full top-5 candidate pool described in Section 4.1. Overthe course of several weeks, and requiring nearly 40 hours, we completed the first round of thesingle-elimination tournament. In total we made 4,125 preference judgments, including some falsestarts and repeats due to initial bugs in the judging interface. Since this process gave us a topanswer for each question, which we could use to help validate the crowdsource assessment, wedecided not to invest the extra time to identify the remainder of the top 5. As described above, crowdsourcing proceeds in two stages: a) a pool reduction stage, intendedto reduce the size of the candidate pool below some threshold F , after which we b) compare allremaining candidates with each other, ranking the candidates according to the number of pairingin which they win and cutting to the top k . During the pool reduction stage each candidate israndomly paired with P or P + F , we repeat the process. On the TREC2019 CAsT candidate pool, each iteration of this process reduced the size of the pool by roughlyhalf.During the second stage all candidates are paired against each other, giving up to F ( F − )/ k ,we include all items tied at that rank. Otherwise, we cut to the top k . Ties also mean that someeffectiveness levels will contain multiple items.For the TREC 2019 CAsT passages, we used Amazon’s Mechanical Turk to recruit and paycrowdsourced workers. Workers were required to live in the U.S. and to have completed at least ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. ssessing top- k preferences 00:11 Fig. 4. Example assessment task for question

ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020.

Fig. 5. Comparison between local and crowdsourced judgments.

Having completed both a crowdsourced assessment for the top-5 answers and a dedicated assess-ment for the top answer (which we call the “local answer” for short) we can compare the twoapproaches. Figure 5 shows the result. For 63 questions (36%) the two assessment methods producedthe same top answer. For 141 questions (82%) the local answer from the dedicated assessmentappeared in the top-5 from the crowdsourced assessment. For example, of the passages in Figure 1the first passage was selected by crowdworkers as the top answer. The second passage was rankedsecond by the crowdworkers, but was the top local answer. For 32 questions the local answer didnot appear in the top five crowdsourced answers at all. In general, the crowdworkers appearedto prefer more direct answers, and appeared less tolerant of longer passages than the dedicatedassessors.Figure 6 compares the crowdsourced assessments with the original graded relevance assessments.Over 68% of the top-1 crowdsourced answers came from the highest relevance grade for the question,which varied from question to question. Over 61% of the top-5 crowdsourced answers came fromthe highest relevance grade. Nonetheless, the remaining answers came from lower relevance grades.Since we only added passages from lower relevance grades when they were needed to grow thecandidate pool to sufficient size, this outcome suggests that our initial strategy for “thinning theherd” may have missed some answers that the crowdworkers would have placed in the top 5.The values for F and P were chosen to keep us within an assessment budget of $4,000. Afterrunning a pilot study with 10% of the questions picked at random, we set F = P =

7, whichkept us under budget. Nonetheless, even if we assume fully consistent crowdworkers, there is asmall chance that some of the top-5 items might be missed. The worst case occurs with a candidatepool |C| = F + =

10. In this case with P = | C | ≤ F , and we have moved to the second stage, allpairs are assessed, providing redundancy for the final top-5 ordering.Overall, the assessment methods produced consistent, but not identical, results. By basing aninitial pass on the original relevance grades, we may have missed answers that crowdworkerswould have placed in the top 5. Larger value of F and P may have produced more consistent ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. ssessing top- k preferences 00:13 Fig. 6. Relevance grades of top crowdsourced answer relative to the top relevance grade for the question. results, although at greater cost. However, assuming that the top-5 crowdsourced answers providean acceptable approximation to the true top-5, we can move on to exam the impact of partialpreferences on runs submitted to the TREC 2019 CAsT Track.

The plot in Figure 7 compares the performance of automatic runs submitted to the TREC 2019CAsT Track under compatibility vs. NDCG@3. For this comparison, we create an ideal ranking bycombining the crowdsourced top-5 answers with the original graded relevance judgments. Thetop-five answers fill equivalence levels L down to L ; graded relevance judgments fill equivalencelevels L down to L . This approach precisely specifies the top ranks, the ones most likely to beseen by the searcher, while still taking advantage of the relevance grades to compare rankers. Asshown in Table 1 the sensitivity of compatibility using this ideal ranking is 76.5%, indicating thatwe are better able to recognize differences between rankers.Compatibility provides insights not provided by NDCG. The top four runs (by either measure)represent the most successful of the numerous attempts by participants to apply BERT [15] forre-ranking answers. Under compatibility the separation between these four runs and the otherruns is much more dramatic. The starred run (pgbert) produces the best score under compatibilityand third-best score under NDCG. In addition to BERT for re-ranking, it applied a transfer learningapproach for question re-writing [14]. Of the other three runs in the top four, one (pg2bert) isvariant of the pgbert run from the same group. The other two (h2oloo RUN2 and CFDA CLIP RUN7)both apply doc2query for expansion, as well as BERT for re-ranking [35].The circled run was the sole run in top ten to use only traditional IR methods. In particular, it wasthe only run in the top ten not to re-rank with BERT. Under NDCG, the starred run outperformsthe circled run by +15%, which is not significant under a paired t-test ( p = . < − , which remains signficant even after the conservative Bonferronicorrection. Under NDCG, we might conclude that the modern NLP methods used for the starredrun were providing only a modest and non-significant improvement over the traditional methods.Under compatibility, with an ideal ranking that precisely specifies the preferred answers, we see ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020.

Fig. 7. The relationship between NDCG@3 and compatibility on TREC 2019 Conversational Assistance Trackruns when ideal rankings are based on a combination of crowdsourced top-5 answers and the original gradedrelevance values. The plots on the right sort runs by score under different measures and show 95% confidenceintervals. The top-four runs show significant differences not captured by relevance grades alone. The runsmarked with a star and a circle are discussed in Section 6. the more dramatic improvements we might expect from these modern methods. The remainder ofthe top-ten runs, plus several other runs that also apply BERT, move ahead of this traditional rununder compatibility, which drops from 7th to 15th place.For Figure 7 we combined the top-5 crowdsourced answers with the graded relevance judgments.Instead, we might focus exclusively on the top-5 answers, recognizing that a searcher will rarelylook beyond these results. Nothing beyond the top-5 counts, as if the search engine returnednothing after that point. The set of ideal rankings now consists of a single element — this singleranking of the top-5 answers — or perhaps a small number of equivalent rankings if crowdsourcingproduced ties.As a minor point, under these circumstances ideal rankings are no longer indefinite in the senseof Webber et al. [33]. Under any circumstances, RBO always leaves a “residual”, since rankingscannot practically be computed to infinity. This residual becomes vanishingly small as rankingsbecome deeper. However, when k is small this residual can be noticeably large, and if we limitideal rankings to just the top- k , then they are not even theoretically indefinite. As a result, in thiscircumstance we apply a normalization for RBO, as follows:NRBO ( R , I ) = RBO ( R , I ) RBO ( I , I ) . (3)Unless the ideal ranking is relatively shallow, RBO ( I , I ) is close to one, but if not, this formulaprovides a simple way to normalize out the residual.While this normalization scales scores into the range [ , ] , it does not matter from a statisticalsense, since the same constant is applied to every run. Apart from lower values, plots are identical.However, if k varies from query to query, this normalization would allow each query to contributeequally to the magnitude of the average score. While we do not vary k in this way for theexperiments in this paper, we can imagine this would be helpful in the case of Web search, forexample, where different values of k might be used for navigational vs. informational queries.Figure 8 shows the relationship between NDCG@3 and compatibility when ideal rankings arebased solely on crowdsourced top-5 answers. As shown in Table 1 the sensitivity of 73.3% is lower ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. ssessing top- k preferences 00:15 Fig. 8. The relationship between NDCG@3 and compatibility on TREC 2019 Conversational Assistance Trackruns when ideal rankings are based solely on the crowdsourced top-5 answers. than with the combined ideal rankings of Figure 7, but higher than with graded relevance alone.The separation between the top-four runs and the rest of the runs remains.To go one step further, Figure 9 shows the relationship between NDCG@3 and compatibility onwhen ideal rankings are based only on the single best local answer identified by the research team.Many runs now have compatibility values close to zero, even when NDCG@3 values are closeto 0.2. Although sensitivity is now only 55.2%, the relative ordering of the top-four runs has notchanged. Using only the single best crowdscourced answer produces a similar result (not shown).

It is widely recognized that offline evaluation should focus on the top ranks, those the searcherwill most likely see. We often report measures of the form thing @ k , for small values of k , withNDCG@3 providing a typical example. In effect, these measures evaluate rankers by asking thequestion: “What items did the ranker put in the top k ranks?” In this paper, we turn this questionaround, asking instead: “Where did the ranker put the items that should be in the top k ranks?” Bydoing this, we achieve an evaluation measure that is not only focused on the quality of the topranked results, but which is also more sensitive to important differences between rankers.It is only recently that neural rankers have begun to show significant improvements overtraditional methods on IR tasks [36], and neural methods do not consistently provide the samedramatic improvements seen on many NLP tasks. We hypothesize that the lack of dramaticimprovement may be due to the limitations of traditional IR evaluation methodologies, with theirfocus on relevance, which cannot capture important aspects of searcher preferences. In this paper,we propose partial preferences focused on the top ranks as a practical method for capturing theseaspects.While we have demonstrated that our assessment methods can be practically and affordablyapplied to an academic evaluation exercise, we have not as yet applied these methods in a commercialcontext. In addition, we have also not explored the cost-benefits tradeoffs of varying the judgingparameters: k , F , and P . While our current method was kept as simple as possible to make easy for ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020.

Fig. 9. The relationship between NSCG@3 and compatibility on TREC 2019 Conversational Assistance Trackruns when ideal rankings are based solely on the single best answer identified through dedicated assessmentby the research team. others to replicate, statistical and machine learning methods from the literature might be extendedto partial preferences [3, 11, 17, 29], reducing assessment effort at the cost of complexity.In this paper, we piggybacked our work on the existing TREC 2019 CAsT graded relevancejudgments. Based on our experience, if top- k partial preferences was the end-goal from the start, itmight be possible to simplify the initial graded relevance assessment to three grades: A: “answersthe question” B: “provides related information”, C: “not relevant”. The grade-A passages would thenbecome the initial candidate pool, unless its size is less than k , in which case the grade-B passageswould be included. While this process might produce a larger initial candidate pool and increasethe total number of assessments, by simplifying the initial graded assessement stage it might speedthe overall process, reducing total costs. The trade-off depends on the relative cost and consistencyof graded vs. preference judgments, including any savings from reducing the complexity of gradedassessment. REFERENCES [1] Omar Alonso, Daniel E. Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation.

SIGIR Forum . Singapore, 667fi?!674.[3] Maryam Bashir, Jesse Anderton, Jie Wu, Peter B Golbus, Virgil Pavlu, and Javed A. Aslam. 2013. A documentrating system for preference judgements. In . Dublin, Ireland, 909fi?!912.[4] Christopher J. C. Burges. 2010.

From RankNet to LambdaRank to LambdaMART: An overview . Microsoft ResearchTechnical Report MSR-TR-2010-82.[5] Ben Carterette, Paul Bennett, and Olivier Chapelle. 2008. A test collection of preference judgments. In

SIGIR 2008Workshop on Beyond Binary Relevance: Preferences, Diversity, and Set-Level Judgments . Singapore.[6] Ben Carterette and Paul N. Bennett. 2008. Evaluation measures for preference judgments. In . Singapore, 685fi?!686.[7] Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008.

Here or there: Preferencejudgments for relevance . Computer Science Department Faculty Publication Series 46. University of MassachusettsACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. ssessing top- k preferences 00:17 Amherst.[8] Praveen Chandar and Ben Carterette. 2012. Using preference judgments for novel document retrieval. In . Portland, Oregon, 861fi?!870.[9] Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures for novelty and diversity. In . Dublin, Ireland, 413fi?!422.[10] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance.In . Hong Kong, China, 621–630.[11] Xi Chen, Paul N. Bennett, Kevyn Collins-Thompson, and Eric Horvitz. 2013. Pairwise ranking aggregation in acrowdsourced setting. In . Rome, Italy, 193fi?!202.[12] Charles L. A. Clarke, Mark D. Smucker, and Alexandra Vtyurina. 2020. Offline evaluation by maximum similarity toan ideal ranking. In .[13] Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020. Offline evaluation without gain. In

ACM SIGIRInternational Conference on the Theory of Information Retrieval .[14] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The Conversational Assistance Track overview. In . Gaithersburg, Maryland.[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectionaltransformers for language understanding. In

Annual Conference of the North American Chapter of the Association forComputational Linguistics . Minneapolis, Minnesota.[16] H. P. Frei and P. Sch¨auble. 1991. Determining the effectiveness of retrieval algorithms.

Information Processing andManagement

27, 2fi?!3 (April 1991), 153fi?!164.[17] Ahmed Hassan Awadallah and Imed Zitouni. 2014. Machine-assisted search preference evaluation. In . Shanghai, China, 51fi?!60.[18] Kai Hui and Klaus Berberich. 2017. Low-cost preference judgment via ties. In

European Conference on InformationRetrieval . Aberdeen, Scotland, 626–632.[19] Kai Hui and Klaus Berberich. 2017. Transitivity, time consumption, and quality of preference judgments in crowd-sourcing. In

European Conference on Information Retrieval . Aberdeen, Scotland, 239–251.[20] Kalervo J¨arvelin and Jaana Kek¨al¨ainen. 2002. Cumulated gain-based evaluation of IR techniques.

ACM Transactions onInformation Systems

20, 4 (2002), 422–446.[21] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2017. Accurately interpretingclickthrough data as implicit feedback.

SIGIR Forum

51, 1 (August 2017), 4fi?!11.[22] Saikishore Kalloori, Francesco Ricci, and Rosella Gennari. 2018. Eliciting pairwise preferences in recommender systems.In . Vancouver, British Columbia, 329fi?!337.[23] Jinyoung Kim, Gabriella Kazai, and Imed Zitouni. 2013. Relevance dimensions in preference-based IR evaluation.In . Dublin, Ireland,913fi?!916.[24] Caitlin Kuhlman, Diana Doherty, Malika Nurbekova, Goutham Deva, Zarni Phyo, Paul-Henry Schoenhagen, MaryAnnVanValkenburg, Elke Rundensteiner, and Lane Harrison. 2019. Evaluating preference collection methods for interactiveranking analytics. In . Glasgow, Scotland Uk, ArticlePaper 512, 11 pages.[25] Yanyan Lan, Shuzi Niu, Jiafeng Guo, and Xueqi Cheng. 2013. Is top- k sufficient for ranking?. In . San Francisco, California, 1261fi?!1270.[26] Matthew Lease and Emine Yilmaz. 2012. Crowdsourcing for information retrieval. SIGIR Forum

45, 2 (January 2012),66fi?!75.[27] Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On crowdsourcing relevance magnitudesfor information retrieval evaluation.

ACM Transactions on Information Systems

35, 3 (January 2017).[28] Shuzi Niu, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2012. Top-k learning to rank: Labeling, ranking and evaluation.In . Portland, Oregon,751fi?!760.[29] Kira Radinsky and Nir Ailon. 2011. Ranking from pairs and triplets: Information quality, evaluation methods andquery complexity. In . Hong Kong, China, 105fi?!114.[30] Mark E. Rorvig. 1990. The simple scalability of documents.

Journal of the American Society for Information Science . Seattle, Washington, 525fi?!532.[32] Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In . Xi’an, China.ACM Transactions on Information Systems, Vol. 00, No. 00, Article 00. Publication date: 2020. [33] William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings.

ACM Transactionson Information Systems

28, 4 (November 2010), 20:1–20:38.[34] Fen Xia, Tie-Yan Liu, and Hang Li. 2009. Statistical consistency of top- k ranking. In . Vancouver, British Columbia, 2098fi?!2106.[35] Jheng-Hong Yang, Sheng-Chieh Lin, Chuan-Ju Wang, Jimmy Lin, and Ming-Feng Tsai. 2019. Query and answerexpansion from conversation history. In . Gaithersburg, Maryland.[36] Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically examining the “neural hype”: Weak baselinesand the additivity of effectiveness gains from neural ranking models. In . Paris, France, 1129fi?!1132.[37] Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In . Dunedin, New Zealand.[38] Y. Y. Yao. 1995. Measuring retrieval effectiveness based on user preference of documents. Journal of the AmericanSociety for Information Science

46, 2 (1995), 133–145.[39] Dongqing Zhu and Ben Carterette. 2010. An analysis of assessor behavior in crowdsourced preference judgments. In

SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation . A SOFTWARE AND DATA RELEASE

Code and preference judgements are available at https//github.com/claclark/compatibility. Pref-erence judgments are released without personally identifying information, for which we haveUniversity of Waterloo ethics approval.The implementation of compatibility used for these experiments consists of a hundred-line Pythonprogram, which is backward compatible with the standard formats used by TREC for adhoc runs andrelevance judgments. These relevance judgments are expressed as ( topic-id , document-id , preference ) triples (plus the required but unused “Q0” field).Preferences can be any positive floating point or integer value. If one document’s preferencevalue is greater than another document’s preference value, it indicates that the first document ispreferred over the second. If preferences are tied, it indicates that the two documents belong to thesame effectiveness level. The number of effectiveness levels for a topic is defined by the number ofdistinct preference values for that topic, and can vary from topic to topic. In this way, the programcan be used directly with many existing TREC runs and qrels and extended by adding additionalpreference values.By default the code computes NRBO, since this normalization is close to one unless the numberof qrels is small. By default, we report p = .

95, which provides a close match to NDCG@20, aprimary measure for the older TREC Web Tracks. Overall the code should work “out of the box”for typical TREC tasks.95, which provides a close match to NDCG@20, aprimary measure for the older TREC Web Tracks. Overall the code should work “out of the box”for typical TREC tasks.