Learning from User Interactions with Rankings: A Unification of the Field
HHarrie Oosterhuis
Learning from User Interactions with Rankings:A Uni fi cation of the Field a r X i v : . [ c s . I R ] D ec earning from UserInteractions with Rankings:A Unification of the Field Harrie Oosterhuis earning from UserInteractions with Rankings:A Unification of the Field A CADEMISCH P ROEFSCHRIFT ter verkrijging van de graad van doctor aan deUniversiteit van Amsterdamop gezag van de Rector Magnificusprof. dr. ir. K.I.J. Maexten overstaan van een door het College voor Promoties ingesteldecommissie, in het openbaar te verdedigen inde Agnietenkapelop vrijdag 27 november 2020, te 16:00 uurdoor
Hendrikus Roelof Oosterhuis geboren te Schaijk romotiecommissie
Promotor: Prof. dr. M. de Rijke Universiteit van AmsterdamCo-promotor: Prof. dr. E. Kanoulas Universiteit van AmsterdamOverige leden: Prof. dr. H. Haned Universiteit van AmsterdamProf. dr. T. Joachims Cornell UniversityDr. ir. J. Kamps Universiteit van AmsterdamProf. dr. C.G.M. Snoek Universiteit van AmsterdamProf. dr. ir. A.P. de Vries Radboud Universiteit NijmegenFaculteit der Natuurwetenschappen, Wiskunde en InformaticaThe research was supported by the Netherlands Organisation for Scientific Research(NWO) under project number 612.001.551.Copyright © 2020 Harrie Oosterhuis, Amsterdam, The NetherlandsCover by Harrie OosterhuisPrinted by Offpage, AmsterdamISBN: 978-94-93197-36-7 cknowledgements
Over six years ago, I was invited to join the ILPS research group as an honours MSc.A.I. student. This was the start of an amazing period where I was able to learn, exploreand develop myself into a true researcher. Many people have helped me on this journey,and I am truly grateful for all the support and friendship I have received along the way.I hope to inspire future students in the same way that you have all inspired me, and Iwill now try my best to thank each and everyone of you.First and foremost, I want to thank Maarten de Rijke, my supervisor and promotor.Maarten, I have learned more from you than I thought possible, you have taught mehow to do research, how to become a better teacher and supervisor, and how to developa research career. Your help was always there when needed and without fail you havealways gone above and beyond. You have set a great example for me and all of yourstudents. Thank you so much.Second, I wish to thank my co-promotor Evangelos Kanoulas. In our annualmeetings you have always given me great advice. You are a truly compassionatesupervisor, who cares greatly about his students. I am very happy to know that you willcontinue to have an amazing and caring influence on the future of the research group.Third, I thank Hinda Haned, Thorsten Joachims, Jaap Kamps, Cees Snoek, andArjen de Vries, I am truly honoured that you are all part of my PhD committee.My special thanks to Ana and Tom for being my paranymphs. Through the peaksand valleys of my PhD life you have always been there for me, and I am very honouredto defend this thesis with you on my side.Another special thanks to Anne Schuth for accepting and supervising me in ILPSwhen I had only just started the MSc. A.I. In the end, I hold you responsible for myinterest in ranking systems and user interactions and I cannot thank you enough. AlsoI want to thank Petra in particular, your amazing work has made all of this possible.Without contest, I consider you the true ILPS MVP, thank you Petra.Further thanks to everyone who has been part of ILPS during my journey: Adith,Alexey, Ali, Ali, Amir, Ana, Anna, Anne, Antonis, Arezoo, Arianna, Artem, Bob, Boris,Chang, Christof, Christophe, Chuan, Cristina, Daan, Damien, Dan, Dat, David, David,Dilek, Evgeny, Georgios, Hamid, Hendra, Hinda, Hosein, Ilya, Isaac, Ivan, Jiahuan,Jie, Jin, Julia, Julien, Kaspar, Katya, Ke, Maarten, Maarten, Maartje, Mahsa, Mariya,Marlies, Marzieh, Masrour, Maurits, Mohammad, Mostafa, Mozhdeh, Nikos, Olivier,Peilei, Pengjie, Petra, Praveen, Richard, Ridho, Rolf, Sam, Sami, Sebastian, Shangsong,Shaojie, Spyretta, Svitlana, Thorsten, Tobias, Tom, Trond, Vera, Wanyu, Xiaohui,Xiaojuan, Xinyi, Yangjun, Yaser, Yifan, Zhaochun, and Ziming. Together, you have allmade ILPS a wonderful group to be part of, I am very grateful to call all of you mycolleagues. I was also very happy to part of several sub-groups that discussed rankingsystems, thank you Ali, Arezoo, Artem, Chang, Jin, Julia, Maarten, Rolf, and Wanyu,for the great discussions, hopefully there will be many more discussions to come. Inaddition, special thanks to Ana, Antonis, Bob, Chang, Hosein, Maartje, Maurits, Nikos,Rolf, and Tom for being great friends as well.Furthermore, I want to thank the all the people that welcomed me abroad. For thegreat experiences I had during Google internships I thank Ajay, Ariel, Bo, Eugene,eorge, Guan-Lin, Heng-Tze, Larry, Maxime, Michael, Mustafa, Roger, Sujith, Vihan,and Yi-fan. For the absolute amazing time I had in Australia, I want to thank Andrew,Binsheng, Brian, Falk, Joel, Luke, Mark, Sarah, and Shane. Especially Joel andBinsheng for literally travelling to the other side of the world with me. I am reallygrateful to have met you all and hope the future will allow me to visit you a great manytimes.Dan wil ik nog mijn studiegenoten bedanken: Carla, Dasyel, Fabian, Jelle en Wietze,voor de vele leuke herinneringen aan mijn studie. Verder ben ik ook dankbaar voor mijnlange vriendschap met Chiel, Don, Kit, Stefan, Mark en Luuk, het is mij erg dierbaarom vrienden te hebben die ik al sinds de kleuterklas ken.Als laatste wil ik mijn familie bedanken, de belangrijkste mensen in mijn leven.Marianna, Roelof, Anna en Jeroen, bedankt voor alle steun, zonder jullie was het mijnooit gelukt om zo ver te komen. Ik bedank Helena en Nathalie omdat zij ons altijdzo warm verwelkomen. Erg dankbaar ben ik ook voor mijn lieve oma Anna, die altijdzo geduldig luistert als ik weer eens probeer uit te leggen wat ik nu eigenlijk bij deuniversiteit doe. Het meest bedank ik mijn grote liefde Emily omdat zij altijd voor mijklaar staat en de dagen zoveel mooier maakt. Harrie OosterhuisAmsterdamOctober 2020 ontents
I Novel Online Methods for Learning and Evaluating 13
ONTENTS
II A Single Framework for Online and Counterfactual Learn-ing to Rank 75 k Rankings 77 k Feedback . . . . . . . . . . . . . . . . . . . . . 815.3.1 The problem with top- k feedback . . . . . . . . . . . . . . . 815.3.2 Policy-aware propensity scoring . . . . . . . . . . . . . . . . 825.3.3 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . 845.4 Learning for Top- k Metrics . . . . . . . . . . . . . . . . . . . . . . . 855.4.1 Top- k metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 865.4.2 Monotonic upper bounding . . . . . . . . . . . . . . . . . . . 86vi ONTENTS k LTR . . . . . . 875.4.4 Unbiased loss selection . . . . . . . . . . . . . . . . . . . . . 895.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.5.2 Simulating top- k settings . . . . . . . . . . . . . . . . . . . . 905.5.3 Experimental runs . . . . . . . . . . . . . . . . . . . . . . . 915.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 925.6.1 Learning under item-selection bias . . . . . . . . . . . . . . . 925.6.2 Optimizing top- k metrics . . . . . . . . . . . . . . . . . . . . 935.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.A Notation Reference for Chapter 5 . . . . . . . . . . . . . . . . . . . . 99 ONTENTS
ONTENTS
Bibliography 181Summary 189Samenvatting 191 ix Introduction
Search engines allow users to efficiently navigate through the enormous numbers ofdocuments available online [7]. Underlying every search engine is a ranking systemthat processes documents in order to present a ranking to the user [75]. Over theyears, the role of ranking systems has only become more important, as they are nowused in a wide variety of settings. Users rely on them to search through many largecollections of content, including images [35], scientific articles [58], e-commerceproducts [60], streaming videos [19], job applications/applicants [32], emails [127], andlegal documents [103]. Similarly, ranking systems are used for recommendation as well,where they help to suggest content to users that matches their interests [101]. This mayeven be content of which users are not aware that they have an interest in [106]. In allthese rankings scenarios, the best user experience is provided when the items that usersprefer most are on top of the produced rankings [52]. In other words, the ranking shouldhelp the user find what they are looking for with the minimal amount of effort [105].Without a ranking system finding the right information in any sizeable collectionbecomes an impossible task. Furthermore, without recommendations many onlineservices would lack a lot of user engagement [29]. Thus, ranking systems drive bothuser satisfaction – providing users with the content they prefer – and user engagement– bringing providers of content or services to interested consumers [34]. Therefore,the performance of a ranking system is very important to both the users of a serviceand its providers. Due to this importance, a lot of interest has gone to the evaluationof ranking systems [18, 38, 45, 55, 104, 105] and the field of Learning to Rank (LTR)which covers methods for optimizing ranking systems [13, 58–60, 75, 129].Traditionally, ranking evaluation and LTR methods made use of human judgementsin the form of expert annotations [104]: for given pairs of queries and documents,experts are asked to annotate the relevance of a document w.r.t. a specific query. Thiscostly process results in an annotated dataset: a collection of query and document pairswith corresponding expert annotations [17, 27, 76]. For an annotated dataset to be usefulit should accurately capture: (i) the queries users typically issue; (ii) the documentsthat have to be ranked; and (iii) the relevance preferences of the user [105]. With sucha dataset, the optimization of a ranking system can be done through supervised LTRmethods. These methods optimize ranking metrics such as Precision, Average RelevancePosition (ARP) or Discounted Cumulative Gain (DCG), based on the provided relevanceannotations [52, 75]. While very important to the LTR field, some severe limitations of1 . Introduction this supervised approach have become apparent over the years: (i) Expert annotationsare expensive and time-consuming to obtain [17, 76]. (ii) In sensitive settings acquiringexperts annotations can be unethical, for instance, when gathering data for optimizingsystems for search over personal documents such as emails [127]. (iii) For specificsettings there may be no experts that can judge what is relevant, for instance, in thecontext of personalized recommendations. (iv) What users perceive as relevant is knownto change over time, thus a dataset would have to be updated regularly, further increasingthe associated costs [1, 71]. (v) Actual user preferences and expert annotations are oftenmisaligned [104]. Consequently, the supervised approach is infeasible for many LTRpractitioners because they do not have the resources to create an annotated dataset orgathering annotations is not possible in their ranking setting. Moreover, even if a datasetcan be obtained, it may not lead to the optimal ranking system. Thus there is a need foran alternative to the supervised approach to ranking evaluation and LTR.An alternative approach that has received a lot of attention is to base evaluation andoptimization on user interactions [56, 99]. For rankings this usually means that userclicks are used to compare and improve ranking systems. At first glance user interactionsseem to solve the problems with annotations: (i) If a service has enough active users,interactions are virtually free and available at a large scale. (ii) Gathering interactionscan be done without showing sensitive items to experts for annotation. (iii) Unlikeannotations, interactions are an indication of the actual individual user preferences.Thus there appears to be a lot of potential for using user interactions, however, there arealso drawbacks specific to using them: (i) It requires keeping track of large amounts ofuser behavior, something users may not consent with [94]. (ii) User behavior is veryunpredictable, clicks in particular are known to be a very noisy signal [20]. (iii) Clicksare a form of implicit feedback; there are other factors beside user preference that alsoaffect whether a click takes place, making clicks a biased signal of relevance [20, 25].This thesis will not explore the first drawback and will instead focus on settings whereacquiring user interactions is done with consent, in a privacy-respecting and ethicalmanner. Mainly, we will consider how methods of evaluation and optimization basedon clicks can mitigate the negative effects from click-related noise and bias.Existing methods for ranking evaluation and optimization from user interactionscan roughly be divided into two families: the online family that deals with bias throughdirect interaction and result-randomization [100, 132]; and the counterfactual familythat first models click behavior and then uses the inferred model to correct for biasin logged click data [58, 127]. A further division can be made. For this thesis adecomposition into five areas is relevant. We will divide the online family into threeareas:(i) Online Evaluation – methods like A/B testing and interleaving [56] that interactdirectly with users to compare ranking systems and randomize displayed resultsto mitigate biases [18, 44, 110, 111].(ii) Feature-Based Online LTR – methods like Dueling Bandit Gradient Descent(DBGD) [132] and the Perturbed Preference Perceptron for Ranking (3PR) [100]that optimize feature-based ranking models by direct interaction with users, oftenrelying on online evaluation [42, 111, 126].2 .1. Research Outline and Questions (iii) Tabular Online LTR – methods like Cascading Bandits [68] and the Position-Based Model algorithm (PBM) [69] that optimize a single ranking for a singleranking setting, by learning from direct interactions and result randomization [67,70, 138, 139]. Characteristic about tabular methods is that they do not use anyfeature-based prediction model but instead memorize the best ranking.For the counterfactual family, we will use the following division into two areas:(iv) Counterfactual Evaluation – methods that evaluate rankings based on historicallylogged clicks. They require an inferred model of click behavior and use that modelto correct for biases using, for instance, Inverse Propensity Scoring (IPS) [4, 16,58, 92, 116].(v) Counterfactual LTR – methods that use counterfactual evaluation to estimateperformance based on historical click-logs, and that optimize ranking models tomaximize the estimated a system’s performance [2, 3, 46, 58, 92, 127].This division reveals a rich diversity in approaches that all share the same goal ofevaluating or optimizating ranker performance based on user interactions.On the one hand, this diversity is understandable, since in some settings only onearea of methods is applicable. For instance, one cannot add randomization to datathat is already logged, making the counterfactual approach the only available optionif only logged data is available. On the other hand, the diversity of approaches is alsounexpected and raises some questions. For instance, why would online approachesnot benefit from an accurate model of click behavior if one is available, similar to thecounterfactual approach?In this thesis, we investigate whether this online/counterfactual division is trulynecessary. We will introduce several novel LTR methods that improve over the efficiencyof existing methods, and increase the applicability of LTR from user clicks. In particular,we focus on finding LTR methods that bridge the online/counterfactual division andfind methods that are highly effective both when applied online or counterfactually. Animportant result of our thesis on the LTR field, is that we offer a unified perspective andset of LTR methods.
The overarching question this thesis aims to answer is:
Could there be a single general theoretically-grounded approach that has com-petitive performance for both evaluation and LTR from user clicks on rankings,in both the counterfactual and online settings?
Our aim is to progress the LTR field towards answering this question in the affirmation.In this thesis, we will explore two directions in search of a single general theoretically-grounded approach. Firstly, by introducing novel online LTR methods that outperformexisting online methods in optimization and large scale optimization in the onlinesetting. Secondly, by introducing novel counterfactual LTR methods that build on3 . Introduction the original IPS-based counterfactual LTR approach [58]. Our novel counterfactualLTR methods expand the original counterfactual approach and make it applicable tomore tasks and settings. As a result, these novel methods bridge several gaps betweencounterfactual LTR and the areas of supervised LTR and online LTR. Furthermore, allour novel counterfactual LTR methods are compatible with each other, and can be seenas part of a novel counterfactual LTR framework. At the end of the thesis, our proposedframework has taken the original counterfactual LTR approach and greatly increasedits applicability and effectiveness for both online and counterfactual evaluation andoptimization. This leads to a more unified perspective of the LTR field, where areas thatwere previously largely independent are now connected.
In the first part of the thesis, we introduce two methods that greatly increase theefficiency of large scale online evaluation and online LTR. Additionally, we take acritical look at several existing methods for online evaluation and online LTR.Interleaving was introduced as an efficient evaluation paradigm designed for eval-uating whether one ranking system outperforms another [56]. Interleaving methodstake the rankings produced by two systems and combine them into an interleaved rank-ing [41, 96, 99]. Clicks on the interleaved ranking are interpreted directly as preferencesignals between the two systems, resulting in a more data-efficient approach [110]. Thusallowing one to efficiently estimate if an alteration leads to an improved system. Later,the interleaving approach was extended to multileaving which allows for comparisonsthat include more than two systems at once [12, 108, 109], thereby enabling efficientcomparing large numbers of systems with each other.In Chapter 2 we look at such multileaving methods for large scale online rankingevaluation. Specifically, we investigate the following question:
RQ1
Does the effectiveness of online ranking evaluation methods scale to large com-parisons?We examine existing multileaving methods in terms of fidelity – are they provablyunbiased in unambiguous cases [44] – and considerateness – are they safe w.r.t. theuser experience during the gathering of clicks. From our theoretical analysis, we findthat no existing multileaving method manages to meet both criteria. Furthermore, ourempirical analysis reveals that their performance decreases as comparisons involvemore ranking systems at once. As a novel alternative, we introduce the PairwisePreference Multileaving (PPM) algorithm, PPM bases evaluation on inferred pairwiseitem preferences. We prove that it meets both the fidelity and considerateness criteria.Furthermore, our empirical results indicate that using PPM leads to a much smallernumber of errors especially in large scale comparisons.Besides evaluation, optimization is also very important to obtain effective rankingsystems [75]. The idea of optimizing ranking systems based on clicks is long-established.One of the first-theoretically grounded approaches was Dueling Bandit Gradient Descent(DBGD) [132]. For every incoming query, DBGD samples a variation on a rankingsystem and then uses interleaving to estimate whether this variation is an improvement.If so, it updates the ranking system to be more similar to the variation. Over time4 .1. Research Outline and Questions this process is supposed to oscillate towards the optimal ranking system. Numerousextensions have been proposed but all have kept the overall DBGD approach of samplingvariations and using online evaluation [42, 100, 111, 126]. This is somewhat puzzling,since this sampling approach is in stark contrast with all other LTR methods that usegradient-based optimization.In Chapter 3 we explore alternatives to the DBGD approach and ask ourselves thefollowing question:
RQ2
Is online LTR possible without relying on model-sampling and online evaluation?We answer this question in the affirmative by proposing a novel online LTR method:Pairwise Differentiable Gradient Descent (PDGD). Unlike DBGD, PDGD does notrequire model-sampling nor does it make use of any online evaluation. Instead, PDGDoptimizes a stochastic Plackett-Luce ranking model and bases its updates on inferredpairwise item preferences. PDGD weights the gradients w.r.t. item-pairs to mitigatethe effect of position bias. We prove, under very mild assumptions, that the weightedgradient of PDGD is unbiased w.r.t. item-preferences. Our experimental results showthat PDGD requires far fewer interactions to reach the same level of performance asDBGD. Furthermore, we show that even in ideal settings DBGD may not be able to findthe optimal model and is ineffective at optimizing neural models. In contrast, PDGDdoes converge to near optimal models, and reaches even higher performance whenapplied to neural networks.The large improvements of PDGD over DBGD observed in Chapter 3, made uswonder whether DBGD is actually a reliable choice for online LTR. In response to thisquestion, Chapter 4 tackles the following question:
RQ3
Are DBGD LTR methods reliable in terms of theoretical soundness and empiricalperformance?First, we take a critical look at the theory underlying the DBGD approach, and find thatits assumptions do not hold for deterministic ranking systems and common rankingmetrics. Consequently, we conclude that its theory is not applicable to the large majorityof existing research that utilizes the DBGD approach [42, 43, 90, 111, 125, 135].Second, we perform an empirical analysis where DBGD and PDGD are compared incircumstances ranging from near-ideal – where interactions contain little noise and noposition bias – to extremely difficult – where interactions contain extreme amounts ofnoise and position bias. The difference in performance between PDGD and DBGD isso large, that we conclude that PDGD is by far the more reliable choice.For the field of online LTR this leads us to question the relevancy of DBGD andits extentions, as we have found theoretical weaknesses and empirical inferiority. Thefact that virtually all previous methods in the online LTR field are extensions of DBGDraises profound questions.
In the second part of the thesis, we expand the existing IPS-based counterfactual LTRapproach [58] to create a unified framework for both online and counterfactual LTR and5 . Introduction ranking evaluation based on clicks.The conclusions of the first part of the thesis revealed that DBGD, which forms thebasis of most previous work in online LTR, has problems in terms of performance andits theoretical basis. It is concerning that these conclusions could have been made muchearlier: previous work could have taken a critical look at the theory at any moment;furthermore, if previous work had compared DBGD performance with supervised LTRin the prevalent simulated setups, it would have observed the convergence problemsof DBGD. To avoid similar issues, we chose to build upon the Counterfactual LTRapproach because it has a strong theoretical basis, and additionally, all experimentalcomparisons in the second part include optimal ranking models to detect potentialconvergence issues.In contrast with online LTR approaches, counterfactual LTR and evaluation makesexplicit assumptions about user behavior [58, 127]. By making such assumptions, theunbiasedness of counterfactual methods can be proven. Thus guaranteeing optimalconvergence, given that the assumptions are correct. While this provides a strongfoundation for learning from historically logged clicks, the counterfactual approach isnot always applicable nor always the most effective option [50]. The following researchquestions consider whether counterfactual LTR could overcome its limitations andbecome the best choice for LTR from clicks in general.One of the requirements for the unbiasedness of the original counterfactual LTRmethod is that it requires every relevant item to be displayed at every query [58]. Thisis a problem in top- k ranking settings where not all items can be displayed at once [92].Hence, Chapter 5 concerns the question: RQ4
Can counterfactual LTR be extended to top- k ranking settings?We introduce the Policy-Aware estimator that corrects for position bias while takinginto account the behavior of a stochastic logging policy. As a result, the policy-awareestimator is unbiased even when learning from top- k feedback, if the policy gives everyrelevant item a non-zero chance of appearing in the top- k . Thus with this extensioncounterfactual LTR is also applicable to the top- k setting which is especially prevalentin recommendation.Existing work has considered how to optimize ranking metrics such as DCG usingcounterfactual LTR [2, 46]. Interestingly, the solutions for counterfactual LTR are verydifferent than those in supervised LTR [13, 129]. To investigate whether this differenceis really necessary, Chapter 5 also addresses the question: RQ5
Is it possible to apply state-of-the-art supervised LTR methods to the counterfac-tual LTR problem?We find that the LambdaLoss framework [129], which includes the famous Lamb-daMART method [13], can also be applied to counterfactual estimates of rankingmetrics. Thus we show that there does not need to be a divide between state-of-the-artsupervised LTR and counterfactual LTR.So far we have not considered the area of tabular online LTR: methods that findthe optimal ranking for a single query based on result randomization and direct in-teraction [67–70, 139]. While these methods need a lot of click data to reach decent6 .1. Research Outline and Questions performance, they can always find the optimal ranking since they optimize a memorizedranking, instead of using a feature-based model [138]. The downside is that when fewclicks are available for a query, tabular LTR methods are highly sensitive to noise. Thusthese approaches are good for specialization: they have great performance on querieswhere numerous clicks have been observed, while also having an initial period of poorperformance. Conversely, counterfactual LTR commonly optimizes feature-based mod-els for generalization to have a robust performance on previously unseen queries, whileoften not reaching perfect performance at convergence.Inspired by this contrast, in Chapter 6 we ask ourselves:
RQ6
Can the specialization ability of tabular online LTR be combined with the robustfeature-based approach of counterfactual LTR?Our answer is in the form of the novel Generalization and Specialization (GENSPEC)algorithm, it optimizes a single robust generalized policy and numerous specializedpolicies each optimized for a single query. Then the GENSPEC meta-policy uses high-confidence bounds to safely decide per query which policy to deploy. Consequently,for previously unseen queries GENSPEC choose the generalized policy which utilizesthe robust feature-based ranking model. While for other queries it can decide todeploy a specialized policy, i.e., if it has enough data to confidently determine that thespecialized policy has found the better ranking. For the LTR field, GENSPEC showsthat specialization does not need to be unique to tabular online LTR, instead it can be aproperty of counterfactual LTR as well. Moreover, overall it shows that specializationand generalization are not mutually exclusive abilities.While counterfactual evaluation methods are designed for using historical clicks,they can be applied online by simply treating newly gathered data as historical [16, 50].In contrast with online evaluation methods, counterfactual evaluation is completely pas-sive: its methods do not prescribe which rankings should be displayed. This differenceleads us to ask the following question in Chapter 7:
RQ7
Can counterfactual evaluation methods for ranking be extended to perform effi-cient and effective online evaluation?We answer this question positively by introducing the novel Logging-Policy Optimiza-tion Algorithm (LogOpt) which uses available clicks to optimize the logging policy tominimize the variance of counterfactual estimates of ranking metrics. By minimizingvariance, LogOpt increases the data-efficiency of counterfactual evaluation, leading tomore accurate estimates from fewer logged clicks. LogOpt is applied when data is stillbeen gathered and changes what rankings will be displayed for future queries. Thus,with the addition of LogOpt, counterfactual evaluation is transformed into an onlineapproach that is actively involved with how data is gathered. Our experimental resultssuggest that LogOpt is at least as efficient as interleaving methods, while also beingproven to be unbiased under the common assumptions of counterfactual LTR.The results in Chapter 2 and Chapter 7 did not show any online evaluation methodconverge on a zero error. This lead us to also ask the following question in Chapter 7:
RQ8
Are existing interleaving methods truly capable of unbiased evaluation w.r.t.position bias? 7 . Introduction
We prove that, under the assumption of basic position bias, interleaving methods are notunbiased. Furthermore, our results in Chapter 7 indicate that interleaving methods havea systematic error. Unfortunately, we are unable to estimate the impact this systematicerror has on real-world comparisons. To the best of our knowledge, no empirical studieshave been performed that could measure such a bias, our findings strongly show thatsuch a study would be highly valuable to the field.In Chapter 7 we have shown that counterfactual ranking evaluation can be asefficient as online evaluation methods, while also having the theoretical justification ofcounterfactual methods. Naturally this leads to a similar question regarding LTR:
RQ9
Can the counterfactual LTR approach be extended to perform highly effectiveonline LTR?In Chapter 8 we answer this question by introducing the intervention-aware estimator foronline/counterfactual LTR. The intervention-aware estimator corrects for position-biasand trust-bias while also taking into account the effect of online interventions. Thismeans that if an intervention takes place – i.e., the logging policy changes during thegathering of data – the intervention-aware estimator takes its effect on the interactionbiases into account. The result is an estimator that, one the one hand, is just as efficientas other counterfactual estimators when applied to historical data. While on the otherhand, it is much more efficient when applied online than existing estimators. Moreover,its performance is comparable to online LTR methods. In contrast with online methods,including DBGD and PDGD, the intervention-aware estimator is proven to be unbiasedw.r.t. ranking metrics under the standard assumptions. In other words, it is the onlymethod that is proven to converge on the optimal model, while also being as efficient asthe others. Therefore, we consider the intervention-aware estimator a bridge betweenonline and counterfactual LTR as it is a most-reliable choice in both scenarios.
This section will now summarize the main contributions of this thesis. We differentiatebetween algorithmic contributions – novel algorithms introduced in the thesis – andtheoretical contributions – findings that are important to the field, both in the form offormal proofs and empirical observations.
1. The Pairwise Preference Multileaving (PPM) algorithm for large scale compar-isons in online evaluation.2. The Pairwise Differentiable Gradient Descent (PDGD) algorithm for fast andefficient online LTR.3. The policy-aware estimator that can perform unbiased counterfactual LTR fromtop- k settings.4. Three loss functions for optimizing top- k metrics with counterfactual LTR, in-cluding an adaption of the supervised LTR LambdaLoss method.8 .2. Main Contributions
5. The Generalization and Specialization (GENSPEC) algorithm that combinesthe specialization ability of tabular models with the generalization ability offeature-based models.6. The Logging-Policy Optimization Algorithm (LogOpt) algorithm that turns coun-terfactual evaluation into online evaluation so as to minimize variance by updatingthe logging policy during the gathering of data.7. The intervention-aware estimator that bridges the gap between counterfactualand online LTR, by extending the policy-aware estimator to take into account theeffect of online interventions.8. An overarching framework for both online and counterfactual LTR evaluationand optimization, by combining the existing counterfactual approach with thecontributions of the second part of the thesis. For counterfactual/online eval-uation contributions 3, 6, and 7 can be applied simultaneously, similarly forcounterfactual/online LTR the same can be done with contributions 3, 4, 5, and 7.
9. An extension of the definition of fidelity and considerateness for multileaving; inaddition, we show that no existing multileave method meets the criteria of bothsimultaneously.10. A formal proof that PDGD is unbiased w.r.t. pairwise item preferences undermild assumptions.11. A formal proof that the assumptions of DBGD are not sound for deterministicranking models, thus invalidating some claims of unbiasedness in previous onlineLTR work.12. An extensive comparison of DBGD and PDGD under circumstances rangingfrom ideal to near worst-case, revealing that even in ideal circumstances DBGDis often unable to approximate the optimal model.13. A formal proof for the unbiasedness of the policy-aware and intervention-awareestimators, proving that the former is unbiased w.r.t. position bias and item-selection bias and the latter w.r.t. position bias, item-selection bias, and trust biasrespectively.14. A formal demonstration how LTR loss functions can be adapted to bound top- k metrics, including a description of how LambdaLoss can be adapted for counter-factual LTR.15. An extension of existing bounds in order to bound the relative performance of twopolicies, with an additional proof that this bound is more efficient than comparingthe bounds of individual policies.16. A formal proof that interleaving methods are not unbiased w.r.t. position bias. 9 . Introduction
17. An empirical analysis that reveals that PDGD is not unbiased w.r.t. position bias,item-selection bias, and trust bias, when not applied fully online.In addition to these contributions, the source code used to perform the experiments ineach published chapter has been shared publicly to enable reproducibility.
This section will provide an overview of the thesis, and provide some recommendationsfor reading directions. This thesis consists of an introduction chapter, seven researchchapters divided into two parts, and a conclusion. Each research chapter answers one ortwo of the thesis research questions put forward in Section 1.1, in addition to severalchapter-specific research questions. The thesis research questions are important to theoverarching story of the thesis, whereas the chapter-specific research questions onlyconsider the individual contributions of the chapters.The first chapter, which you are currently reading, introduces the subject of thisthesis: LTR and ranking evaluation from user clicks. Furthermore, it lays out the thesisresearch questions this thesis answers, and provides an overview of its contributionsand its origins.Part I titled
Novel Online Methods for Learning and Evaluating contains threeresearch chapters that all consider online methods for LTR and ranking evaluation.Chapter 2 looks at multileaving methods for online evaluation, evaluates existingmethods and introduces a novel multileaving method. Chapter 3 considers online LTRand introduces PDGD, a novel debiased pairwise method. Chapter 4 performs anextensive comparison of the previous state-of-the-art online LTR method DBGD andour novel PDGD, in terms of theoretical guarantees and an experimental analysis.Part II titled
A Single Framework for Online and Counterfactual Learning to Rank contains four research chapters that build on the counterfactual approach to LTR andranker evaluation. The chapters in this part of the thesis are complementary, most oftheir contributions can be applied together or build upon each other. Chapter 5 extendscounterfactual LTR to top- k settings; it introduces a novel estimator to learn from top- k feedback and extends supervised LTR methods to optimize counterfactual estimates oftop- k ranking metrics. Chapter 6 looks at both tabular and feature-based ranking models,and introduces an algorithm that optimizes both types of models and safely deploysdifferent models per query. Thus combining the specialization abilities of tabularmodels with the robust performance of feature-based models in previously unseencircumstances. Chapter 7 aims to unify counterfactual and online ranking evaluation;it introduces a method that updates the logging policy during the gathering of data,turning counterfactual evaluation into efficient online evaluation. Similarly, Chapter 8seeks to unify counterfactual and online LTR; it proposes a novel estimator that takesinto account the effect of online interventions but can also be applied counterfactually.As a result, the estimator is effective for both counterfactual LTR and online LTR.Lastly, the thesis is concluded in Chapter 9, where we summarize the findings of thethesis; in particular, we discus whether the division between the families of online andcounterfactual LTR methods has been bridged. We end the chapter with a discussion ofpossible future research directions.10 .4. Origins The research chapters in this thesis are self-contained, therefore, a reader can readany single chapter independently if they desire. The research chapters grew out ofpublished papers. We wanted to avoid creating alternate versions of published workthat deviate from the originals. As a result, the notation between some chapters differsomewhat; to help the reader, we have added a table at the end of each chapter detailingthe notation it uses. For the best experience, we recommend reading all the chapters inpart II because they build on each other. For the same reason, Chapter 3 and Chapter 4are best read together.
We will now list the publications on which the research chapters were based. Each ofthe publications is a conference paper written by Harrie Oosterhuis and Maarten deRijke. In all cases, Oosterhuis came up with the main research ideas, performed allexperiments, and wrote the majority of text. De Rijke lead the discussions on how eachpaper should be structured and contributed significantly to the text. In total, this thesisis built on 6 publications [81, 82, 84–88].
Chapter 2 is based on
Sensitive and scalable online evaluation with theoretical guar-antees published at CIKM ’17 by [81].
Chapter 3 is based on
Differentiable Unbiased Online Learning to Rank published atCIKM ’18 by Oosterhuis and de Rijke [82].
Chapter 4 is based on
Optimizing Ranking Models in an Online Setting published atECIR ’19 by Oosterhuis and de Rijke [84].
Chapter 5 is based on
Policy-Aware Unbiased Learning to Rank for Top-k Rankings published at SIGIR ’20 by Oosterhuis and de Rijke [86].
Chapter 6 is based on
Robust Generalization and Safe Query-Specialization in Coun-terfactual Learning to Rank submitted to WWW ’21 by Oosterhuis and de Rijke[87].
Chapter 7 is based on
Taking the Counterfactual Online: Efficient and UnbiasedOnline Evaluation for Ranking published at ICTIR ’20 by Oosterhuis and de Rijke[85].
Chapter 8 is based on
Unifying Online and Counterfactual Learning to Rank publishedat WSDM ’21 by Oosterhuis and de Rijke [88].In addition, this thesis also indirectly benefitted from the following publications:•
Probabilistic Multileave for Online Retrieval Evaluation published at SIGIR ’15 bySchuth et al. [109].•
Multileave Gradient Descent for Fast Online Learning to Rank published at WSDM’16 by Schuth et al. [111]. 11 . Introduction • Probabilistic Multileave Gradient Descent published at ECIR ’16 by Oosterhuis et al.[90].•
Balancing Speed and Quality in Online Learning to Rank for Information Retrieval published at CIKM ’17 by Oosterhuis and de Rijke [80].•
Query-level Ranker Specialization published at CEUR ’17 by Jagerman et al. [49].•
Ranking for Relevance and Display Preferences in Complex Presentation Layouts published at SIGIR ’18 by Oosterhuis and de Rijke [83].•
The Potential of Learned Index Structures for Index Compression published at ADCS’18 by Oosterhuis et al. [91].•
To Model or to Intervene: A Comparison of Counterfactual and Online Learning toRank from User Interactions published at SIGIR ’19 by Jagerman et al. [50].•
When Inverse Propensity Scoring does not Work: Affine Corrections for UnbiasedLearning to Rank published at CIKM ’20 by Vardasbi et al. [123].•
Keeping Dataset Biases out of the Simulation: A Debiased Simulator for Reinforce-ment Learning based Recommender Systems published at RecSys ’20 by Huang et al.[47].Furthermore, other work helped with gaining broader research insights, without beingdirectly related to the thesis topic:•
Semantic Video Trailers by Oosterhuis et al. [89].•
Optimizing Interactive Systems with Data-Driven Objectives by Li et al. [73].•
Actionable Interpretability through Optimizable Counterfactual Explanations for TreeEnsembles by Lucic et al. [77].12 art I
Novel Online Methods forLearning and Evaluating Sensitive and Scalable Online Evaluation with Theoretical Guarantees
Multileaved comparison methods generalize interleaved comparison methods to providea scalable approach for comparing ranking systems based on regular user interactions.Such methods enable the increasingly rapid research and development of search engines.However, existing multileaved comparison methods that provide reliable outcomes doso by degrading the user experience during evaluation. Conversely, current multileavedcomparison methods that maintain the user experience cannot guarantee correctness.In this chapter, we address the following thesis research question:
RQ1
Does the effectiveness of online evaluation methods scale to large comparisons?
Our answer comes in a two-fold contribution; First, we propose a theoretical frameworkfor systematically comparing multileaved comparison methods using the notions of considerateness , which concerns maintaining the user experience, and fidelity , whichconcerns reliable correct outcomes. Second, we introduce a novel multileaved compari-son method, Pairwise Preference Multileaving (PPM), that performs comparisons basedon document-pair preferences, and prove that it is considerate and has fidelity. We showempirically that, compared to previous multileaved comparison methods, PPM is moresensitive to user preferences and scalable with the number of rankers being compared.
Evaluation is of tremendous importance to the development of modern search engines.Any proposed change to the system should be verified to ensure it is a true improvement.Online approaches to evaluation aim to measure the actual utility of an InformationRetrieval (IR) system in a natural usage environment [45]. Interleaved comparisonmethods are a within-subject setup for online experimentation in IR. For interleavedcomparison, two experimental conditions (“control” and “treatment”) are typical. Re-cently, multileaved comparisons have been introduced for the purpose of efficientlycomparing large numbers of rankers [12, 108]. These multileaved comparison methodswere introduced as an extension to interleaving and the majority are directly derived
This chapter was published as [81]. Appendix 2.A gives a reference for the notation used in this chapter. . Sensitive and Scalable Online Evaluation with Theoretical Guarantees from their interleaving counterparts [108, 109]. The effectiveness of these methods hasthus far only been measured using simulated experiments on public datasets. Whilethis gives some insight into the general sensitivity of a method, there is no work thatassesses under what circumstances these methods provide correct outcomes and whenthey break. Without knowledge of the theoretical properties of multileaved comparisonmethods we are unable to identify when their outcomes are reliable.In prior work on interleaved comparison methods a theoretical framework has beenintroduced that provides explicit requirements that an interleaved comparison methodshould satisfy [44]. We take this approach as our starting point and adapt and extend itto the setting of multileaved comparison methods. Specifically, the notion of fidelity iscentral to Hofmann et al. [44]’s previous work; Section 2.3 describes the frameworkwith its requirements of fidelity . In the setting of multileaved comparison methods, thismeans that a multileaved comparison method should always recognize an unambiguouswinner of a comparison. We also introduce a second notion, considerateness , whichsays that a comparison method should not degrade the user experience, e.g., by allowingall possible permutations of documents to be shown to the user. In this chapter weexamine all existing multileaved comparison methods and find that none satisfy boththe considerateness and fidelity requirements. In other words, no existing multileavedcomparison method is correct without sacrificing the user experience.To address this gap, we propose a novel multileaved comparison method, PairwisePreference Multileaving (PPM). PPM differs from existing multileaved comparisonmethods as its comparisons are based on inferred pairwise document preferences,whereas existing multileaved comparison methods either use some form of documentassignment [108, 109] or click credit functions [12, 108]. We prove that PPM meets boththe considerateness and the fidelity requirements, thus PPM guarantees correct winnersin unambiguous cases while maintaining the user experience at all times. Furthermore,we show empirically that PPM is more sensitive than existing methods, i.e., it makesfewer errors in the preferences it finds. Finally, unlike other multileaved comparisonmethods, PPM is computationally efficient and scalable , meaning that it maintains mostof its sensitivity as the number of rankers in a comparison increases.In this chapter we address thesis research question RQ1 by answering the followingmore specific research questions:
RQ2.1
Does PPM meet the fidelity and considerateness requirements?
RQ2.2
Is PPM more sensitive than existing methods when comparing multiple rankers?To summarize, our contributions in this chapter are:1. A theoretical framework for comparing multileaved comparison methods;2. A comparison of all existing multileaved comparison methods in terms of consider-ateness , fidelity and sensitivity ;3. A novel multileaved comparison method that is considerate and has fidelity and ismore sensitive than existing methods.16 .2. Related Work Evaluation of information retrieval systems is a core problem in IR. Two types ofapproach are common to designing reliable methods for measuring an IR system’seffectiveness. Offline approaches such as the Cranfield paradigm [104] are effectivefor measuring topical relevance, but have difficulty taking into account contextualinformation including the user’s current situation, fast changing information needs, andpast interaction history with the system [45]. In contrast, online approaches to evaluationaim to measure the actual utility of an IR system in a natural usage environment. Userfeedback in online evaluation is usually implicit, in the form of clicks, dwell time, etc.By far the most common type of controlled experiment on the web is A/B testing [65,66]. This is a classic between-subject experiment, where each subject is exposed toone of two conditions, control —the current system—and treatment —an experimentalsystem that is assumed to outperform the control.An alternative experimental design uses a within-subject setup, where all studyparticipants are exposed to both experimental conditions. Interleaved comparisons[54, 99] have been developed specifically for online experimentation in IR. Interleavedcomparison methods have two main ingredients. First, a method for constructinginterleaved result lists specifies how to select documents from the original rankings(“control” and “treatment”). Second, a method for inferring comparison outcomes basedon observed user interactions with the interleaved result list. Because of their within-subject nature, interleaved comparisons can be up to two orders of magnitude moreefficient than A/B tests in effective sample size for studies of comparable dependentvariables [18].For interleaved comparisons, two experimental conditions are typical. Extensionsto multiple conditions have been introduced by Schuth et al. [108]. Such multileaved comparisons are an efficient online evaluation method for comparing multiple rankerssimultaneously. Similar to interleaved comparison methods [41, 56, 96, 99], a multi-leaved comparison infers preferences between rankers. Interleaved comparisons dothis by presenting users with interleaved result lists; these represent two rankers insuch a way that a preference between the two can be inferred from clicks on theirdocuments. Similarly, for multileaved comparisons multileaved result lists are createdthat allow more than two rankers to be represented in the result list. As a consequence,multileaved comparisons can infer preferences between multiple rankers from a singleclick. Due to this property multileaved comparisons require far fewer interactionsthan interleaved comparisons to achieve the same accuracy when multiple rankers areinvolved [108, 109].The general approach for every multileaved comparison method is described inAlgorithm 2.1; here, a comparison of a set of rankers R is performed over T userinteractions. After the user submits a query q to the system (Line 4), a ranking l i isgenerated for each ranker r i in R (Line 6). These rankings are then combined intoa single result list by the multileaving method (Line 7); we refer to the resulting list m as the multileaved result list. In theory a multileaved result list could contain theentire document set, however in practice a length k is chosen beforehand, since usersgenerally only view a restricted number of result pages. This multileaved result list ispresented to the user who has the choice to interact with it or not. Any interactions17 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees Algorithm 2.1
General pipeline for multileaved comparisons. Input : set of rankers R , documents D , no. of timesteps T . P ← // initialize |R| × |R| preference matrix for t = 1 , . . . , T do q t ← wait for user() // receive query from user for i = 1 , . . . , |R| do l i ← r i ( q, D ) // create ranking for query per ranker m t ← combine lists ( l , . . . , l R ) // combine into multileaved list c ← display ( m t ) // display to user and record interactions for i = 1 , . . . , |R| do for j = 1 , . . . , |R| do P ij ← P ij + infer ( i, j, c , m t ) // infer pref. between rankers return P are recorded in c and returned to the system (Line 8). While c could contain anyinteraction information [63], in practice multileaved comparison methods only considerclicks. Preferences between the rankers in R can be inferred from the interactions andthe preference matrix P is updated accordingly (Line 11). The method of inference(Line 11) is defined by the multileaved comparison method (Line 7). By aggregating theinferred preferences of many interactions a multileaved comparison method can detectpreferences of users between the rankers in R . Thus it provides a method of evaluationwithout requiring a form of explicit annotation.By instantiating the general pipeline for multileaved comparisons shown in Algo-rithm 2.1, i.e., the combination method at Line 6 and the inference method at Line 11,we obtain a specific multileaved comparison method. We detail all known multileavedcomparison methods in Section 2.4 below.What we add on top of the work discussed above is a theoretical framework that allowsus to assess and compare multileaved comparison methods. In addition, we propose anaccurate and scalable multileaved comparison method that is the only one to satisfy theproperties specified in our theoretical framework and that also proves to be the mostefficient multileaved comparison method in terms of much reduced data requirements. Before we introduce a novel multileaved comparison method in Section 2.5, we proposetwo theoretical requirements for multileaved comparison methods. These theoreticalrequirements will allow us to assess and compare existing multileaved comparisonmethods. Specifically, we introduce two theoretical properties: considerateness and fidelity . These properties guarantee correct outcomes in unambigious cases whilealways maintaining the user experience. In Section 2.4 we show that no currentlyavailable multileaved comparison method satisfies both properties. This motivates theintroduction of a method that satisfies both properties in Section 2.5.18 .3. A Framework for Assessing Multileaved Comparison Methods
Firstly, one of the most important properties of a multileaved comparison method ishow considerate it is. Since evaluation is done online it is important that the searchexperience is not substantially altered [54, 96]. In other words, users should not beobstructed to perform their search tasks during evaluation. As maintaining a user baseis at the core of any search engine, methods that potentially degrade the user experienceare generally avoided. Therefore, we set the following requirement: the displayedmultileaved result list should never show a document d at a rank i if every ranker in R places it at a lower rank. Writing r ( d, l j ) for the rank of d in the ranking l j produced byranker r j , this boils down to: m i = d → ∃ r j ∈ R , r ( d, l j ) ≤ i. (2.1)Requirement 2.1 guarantees that a document can never be displayed higher in a mul-tileaved result list than any ranker would. In addition, it guarantees that if all rankersagree on the top n documents, the resulting multileaved result list m will display thesame top n . Secondly, the preferences inferred by a multileaved comparison method should corre-spond with those of the user with respect to retrieval quality, and should be robust touser behavior that is unrelated to retrieval quality [54]. In other words, the preferencesfound should be correct in terms of ranker quality. However, in many cases the relativequality of rankers is unclear. For that reason we will use the notion of fidelity [44] tocompare the correctness of a multileaved comparison method.
Fidelity was introducedby Hofmann et al. [44] and describes two general cases in which the preference betweentwo rankers is unambiguous. To have fidelity the expected outcome of a method isrequired to be correct in all matching cases. However, the original notion of fidelity only considers two rankers as it was introduced for interleaved comparison methods,therefore the definition of fidelity must be expanded to the multileaved case. First wedescribe the following concepts:
Uncorrelated clicks
Clicks are considered uncorrelated if relevance has no influenceon the likelihood that a document is clicked. We write r ( d i , m ) for the rank of document d i in multileaved result list m and P ( c l | q, m l = d i ) for the probability of a click atthe rank l at which d i is displayed: l = r ( d i , m ) . Then, for a given query q uncorrelated ( q ) ⇔ ∀ l, ∀ d i,j , P ( c l | q, m l = d i ) = P ( c l | q, m l = d j ) . (2.2) Correlated clicks
We consider clicks correlated if there is a positive correlationbetween document relevance and clicks. However we differ from Hofmann et al. [44]by introducing a variable k that denotes at which rank users stop considering documents.Writing P ( c i | rel ( m i , q )) for the probability of a click at rank i if a document relevant19 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees to query q is displayed at this rank, we set correlated ( q, k ) ⇔∀ i ≥ k, P ( c i ) = 0 ∧ ∀ i < k, P ( c i | rel ( m i , q )) > P ( c i | ¬ rel ( m i , q )) . (2.3)Thus under correlated clicks a relevant document is more likely to be clicked than anon-relevant one at the same rank, if they appear above rank k . Pareto domination
Ranker r Pareto dominates ranker r if all relevant documentsare ranked at least as high by r as by r and r ranks at least one relevant documenthigher. Writing rel for the set of relevant documents that are ranked above k by at leastone ranker, i.e., rel = { d | rel ( d, q ) ∧ ∃ r n ∈ R , r ( d, l n ) > k } , we require that thefollowing holds for every query q and any rank k : Pareto ( r i , r j , q, k ) ⇔∀ d ∈ rel , r ( d, l i ) ≤ r ( d, l j ) ∧ ∃ d ∈ rel , r ( d, l i ) < r ( d, l j ) . (2.4)Then, fidelity for multileaved comparison methods is defined by the following tworequirements:1. Under uncorrelated clicks the expected outcome may find no preferences betweenany two rankers in R : ∀ q, ∀ ( r i , r j ) ∈ R , uncorrelated ( q ) ⇒ E [ P ij | q ] = 0 . (2.5)2. Under correlated clicks, a ranker that Pareto dominates all other rankers must winthe multileaved comparison in expectation: ∀ k, ∀ q, ∀ r i ∈ R , (cid:0) correlated ( q, k ) ∧ ∀ r j ∈ R , i (cid:54) = j → Pareto ( r i , r j , q, k ) (cid:1) ⇒ ( ∀ r j ∈ R , i (cid:54) = j → E [ P ij | q ] > . (2.6)Note that for the case where |R| = 2 and if only k = | D | is considered, these re-quirements are the same as for interleaved comparison methods [44]. The k parameterwas added to allow for fidelity in considerate methods, since it is impossible to detectpreferences at ranks that users never consider without breaking the considerateness requirement. We argue that differences at ranks that users are not expected to observeshould not affect comparison outcomes. Fidelity is important for a multileaved com-parison method as it ensures that an unambiguous winner is expected to be identified.Additionally, the first requirement ensures that in exception no preferences are inferredwhen clicks are unaffected by relevancy.
In addition to the two theoretical properties listed above, considerateness and fidelity, wealso scrutinize multileaved comparison methods to determine whether they accuratelyfind preferences between all rankers in R and minimize the number of user impressionsrequired do so. This empirical property is commonly known as sensitivity [44, 108].20 .4. An Assessment of Existing Multileaved Comparison Methods In Section 2.6 we describe experiments that are aimed at comparing the sensitivity ofmultileaved comparison methods. Here, two aspects of every comparison are considered:the level of error at which a method converges and the number of impressions requiredto reach that level. Thus, an interleaved comparison method that learns faster initiallybut does not reach the same final level of error is deemed worse.
We briefly examine all existing multileaved comparison methods to determine whetherthey meet the considerateness and fidelity requirements. An investigation of the empiri-cal sensitivity requirement is postponed until Section 2.6 and 2.7.
Team-Draft Multileaving (TDM) was introduced by Schuth et al. [108] and is based onthe previously proposed Team Draft Interleaving (TDI) [99]. Both methods are inspiredby how team assignments are often chosen for friendly sport matches. The multileavedresult list is created by sequentially sampling rankers without replacement; the firstsampled ranker places their top document at the first position of the multileaved list.Subsequently, the next sampled ranker adds their top pick of the remaining documents.When all rankers have been sampled, the process is continued by sampling from theentire set of rankers again. The method is stops when all documents have been added.When a document is clicked, TDM assigns the click to the ranker that contributedthe document. For each impression binary preferences are inferred by comparing thenumber of clicks each ranker received.It is clear that TDM is considerate since each added document is the top pick ofat least one ranker. However, TDM does not meet the fidelity requirements. This isunsurprising as previous work has proven that TDI does not meet these requirements[41, 44, 96]. Since TDI is identical to TDM when the number of rankers is |R| = 2 ,TDM does not have fidelity either.
Optimized Multileaving (OM) was proposed by Schuth et al. [108] and serves asan extension of Optimized Interleaving (OI) introduced by Radlinski and Craswell[96]. The allowed multileaved result lists of OM are created by sampling rankerswith replacement at each iteration and adding the top document of the sampled ranker.However, the probability that a multileaved result list is shown is not determined by thegenerative process. Instead, for a chosen credit function OM performs an optimizationthat computes a probability for each multileaved result list so that the expected outcomeis unbiased and sensitive to correct preferences.All of the allowed multileaved result lists of OM meet the considerateness require-ment, and in theory instantiations of OM could have fidelity . However, in practice OMdoes not meet the fidelity requirements. There are two main reasons for this. First, it21 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees is not guaranteed that a solution exists for the optimization that OM performs. Forthe interleaving case this was proven empirically when k = 10 [96]. However, thisapproach does not scale to any number of rankers. Secondly, unlike OI, OM allows moreresult lists than can be computed in a feasible amount of time. Consider the top k of allpossible multileaved result lists; in the worst case this produces |R| k lists. Computingall lists for a large value of |R| and performing linear constraint optimization over themis simply not feasible. As a solution, Schuth et al. [108] propose a method that samplesfrom the allowed multileaved result lists and relaxes constraints when there is no exactsolution. Consequently, there is no guarantee that this method does not introduce bias.Together, these two reasons show that the fidelity of OI does not imply fidelity of OM.It also shows that OM is computationally very costly. Probabilistic Multileaving (PM) [109] is an extension of Probabilistic Interleaving (PI)[41], which was designed to solve the flaws of TDI. Unlike the previous methods, PMconsiders every ranker as a distribution over documents, which is created by applyinga soft-max to each of them. A multileaved result list is created by sampling a rankerwith replacement at each iteration and sampling a document from the ranker that wasselected. After the sampled document has been added, all rankers are renormalizedto account for the removed document. During inference PM credits every ranker theexpected number of clicked documents that were assigned to them. This is done bymarginalizing over the possible ways the list could have been constructed by PM. Abenefit of this approach is that it allows for comparisons on historical data [41, 44].A big disadvantage of PM is that it allows any possible ranking to be shown, albeitnot with uniform probabilities. This is a big deterrent for the usage of PM in opera-tional settings. Furthermore, it also means that PM does not meet the considerateness requirement. On the other hand, PM does meet the fidelity requirements, the proof forthis follows from the fact that every ranker is equally likely to add a document at eachlocation in the ranking. Moreover, if multiple rankers want to place the same documentsomewhere they have to share the resulting credits. Similar to OM, PM becomesinfeasible to compute for a large number of rankers |R| ; the number of assignmentsin the worst case is | R | k . Fortunately, PM inference can be estimated by samplingassignments in a way that maintains fidelity [90, 109]. Sample-Scored-Only Multileaving (SOSM) was introduced by Brost et al. [12] in anattempt to create a more scalable multileaved comparison method. It is the only existingmultileaved comparison method that does not have an interleaved comparison counter-part. SOSM attempts to increase sensitivity by ignoring all non-sampled documentsduring inference. Thus, at each impression a ranker receives credits according to how itranks the documents that were sampled for the displayed multileaved result list of size k . The preferences at each impression are made binary before being added to the mean. Brost et al. [12] proved that if the preferences at each impression are made binary the fidelity of PM islost. .4. An Assessment of Existing Multileaved Comparison Methods Table 2.1: Overview of multileaved comparison methods and whether they meet the considerateness and fidelity requirements.
Considerateness Fidelity Source
TDM (cid:88) [108]OM (cid:88) [108]PM (cid:88) [109]SOSM (cid:88) [12]PPM (cid:88) (cid:88) this chapterSOSM creates multileaved result lists following the same procedure as TDM, a choicethat seems arbitrary.SOSM meets the considerateness requirements for the same reason TDM does.However, SOSM does not meet the fidelity requirement. We can prove this by providingan example where preferences are found under uncorrelated clicks. Consider the twodocuments A and B and the three rankers with the following three rankings: l = AB , l = l = BA . The first requirement of fidelity states that under uncorrelated clicks no preferences maybe found in expectation. Uncorrelated clicks are unconditioned on document relevance(Equation 2.2); however, it is possible that they display position bias [134]. Thus theprobability of a click at the first rank may be greater than at the second: P ( c | q ) > P ( c | q ) . Under position biased clicks the expected outcome for each possible multileaved resultlist is not zero. For instance, the following preferences are expected: E [ P | m = AB ] > ,E [ P | m = BA ] < ,E [ P | m = AB ] = − E [ P | m = BA ] . Since SOSM creates multileaved result lists following the TDM procedure the proba-bility P ( m = BA ) is twice as high as P ( m = AB ) . As a consequence, the expectedpreference is biased against the first ranker: E [ P ] < . Hence, SOSM does not have fidelity . This outcome seems to stem from a disconnectbetween how multileaved results lists are created and how preferences are inferred.To conclude this section, Table 2.1 provides an overview of our findings thus far, i.e.,the theoretical requirements that each multileaved comparison method satisfies; we havealso included PPM, the multileaved comparison method that we will introduce below.23 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees
Algorithm 2.2
Multileaved result list construction for PPM. Input : set of rankers R , rankings { l } , documents D . m ← [] // initialize empty multileaving for n = 1 , . . . , | D | do ˆΩ n ← Ω( n, R , D ) \ m // choice set of remaining documents d ← uniform sample ( ˆΩ n ) // uniformly sample next document m ← append ( m , d ) // add sampled document to multileaving return m The previously described multileaved comparison methods are based around directcredit assignment, i.e., credit functions are based on single documents. In contrast, weintroduce a method that estimates differences based on pairwise document preferences.We prove that this novel method is the only multileaved comparison method that meetsthe considerateness and fidelity requirements set out in Section 2.3.The multileaved comparison method that we introduce is Pairwise Preference Mul-tileaving (PPM). It infers pairwise preferences between documents from clicks andbases comparisons on the agreement of rankers with the inferred preferences. PPM isbased on the assumption that a clicked document is preferred to: (i) all of the unclickeddocuments above it; and (ii) the next unclicked document. These assumptions arelong-established [55] and form the basis of pairwise Learning to Rank (LTR) [54].We write c r ( d i , m ) for a click on document d i displayed in multileaved result list m at the rank r ( d i , m ) . For a document pair ( d i , d j ) , a click c r ( d i , m ) infers a preferenceas follows: c r ( d i , m ) ∧ ¬ c r ( d j , m ) ∧ (cid:0) ∃ i, ( c i ∧ r ( d j , m ) < i ) ∨ c r ( d j , m ) − (cid:1) ⇔ d i > c d j . (2.7)In addition, the preference of a ranker r is denoted by d i > r d j . Pairwise preferencesalso form the basis for Preference-Based Balanced Interleaving (PBI) introduced by Heet al. [38]. However, previous work has shown that PBI does not meet the fidelity requirements [44]. Therefore, we do not use PBI as a starting point for PPM. Instead,PPM is derived directly from the considerateness and fidelity requirements. Conse-quently, PPM constructs multileaved result lists inherently differently and its inferencemethod has fidelity , in contrast with PBI.When constructing a multileaved result list m we want to be able to infer unbiasedpreferences while simultaneously being considerate . Thus, with the requirement for considerateness in mind we define a choice set as: Ω( i, R , D ) = { d | d ∈ D ∧ ∃ r j ∈ R , r ( d, l j ) ≤ i } . (2.8)This definition is chosen so that any document in Ω( i, R , D ) can be placed at rank i without breaking the considerateness requirement (Equation 2.1). The multileavingmethod of PPM is described in Algorithm 2.2. The approach is straightforward: ateach rank n the set of documents ˆΩ n is determined (Line 4). This set of documents is Ω( n, R , D ) with the previously added documents removed to avoid document repetition.24 .5. A Novel Multileaved Comparison Method Algorithm 2.3
Preference inference for PPM. Input : rankers R , rankings { l } , documents D , multileaved result list m , clicks c . P ← // preference matrix of |R| × |R| for ( d i , d j ) ∈ { ( d i , d j ) | d i > c d j } do if ¯ r ( i, j, m ) ≥ ¯ r ( i, j ) then w ← // variable to store P (¯ r ( i, j, m ) ≥ ¯ r ( i, j )) min x ← min d ∈{ d i ,d j } min r n ∈R r ( d, l n ) for x = min x , . . . , ¯ r ( i, j ) − do w ← w · (1 − ( | Ω( x, R , D ) | − x − − ) for n = 1 , . . . , | R | do for m = 1 , . . . , | R | do if d i > r n d j ∧ n (cid:54) = m then P nm ← P nm + w − // result of scoring function φ else if n (cid:54) = m then P nm ← P nm − w − return P Then, the next document is sampled uniformly from ˆΩ n (Line 5), thus every documentin ˆΩ n has a probability: | Ω( n, R , D ) | − n + 1 (2.9)of being placed at position n (Line 6). Since ˆΩ n ⊆ Ω( n, R , D ) the resulting m isguaranteed to be considerate .While the multileaved result list creation method used by PPM is simple, its prefer-ence inference method is more complicated as it has to meet the fidelity requirements.First, the preference found between a ranker r n and r m from a single interaction c isdetermined by: P nm = (cid:88) d i > c d j φ ( d i , d j , r n , m , R ) − φ ( d i , d j , r m , m , R ) , (2.10)which sums over all document pairs ( d i , d j ) where interaction c inferred a preference.Before the scoring function φ can be defined we introduce the following function: ¯ r ( i, j, R ) = max d ∈{ d i ,d j } min r n ∈R r ( d, l n ) . (2.11)For succinctness we will note ¯ r ( i, j ) = ¯ r ( i, j, R ) . Here, ¯ r ( i, j ) provides the highestrank at which both documents d i and d j can appear in m . Position ¯ r ( i, j ) is importantto the document pair ( d i , d j ) , since if both documents are in the remaining documents ˆΩ ¯ r ( i,j ) , then the rest of the multileaved result list creation process is identical for both.To keep notation short we introduce: ¯ r ( i, j, m ) = min d ∈{ d i ,d j } r ( d, m ) . (2.12)25 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees Therefore, if ¯ r ( i, j, m ) ≥ ¯ r ( i, j ) then both documents appear below ¯ r ( i, j ) . This, inturn, means that both documents are equally likely to appear at any rank: ∀ n, P ( m n = d i | ¯ r ( i, j, m ) ≥ ¯ r ( i, j ))= P ( m n = d j | ¯ r ( i, j, m ) ≥ ¯ r ( i, j )) . (2.13)The scoring function φ is then defined as follows: φ ( d i , d j , r , m ) = , ¯ r ( i, j, m ) < ¯ r ( i, j ) − P (¯ r ( i,j, m ) ≥ ¯ r ( i,j )) , d i < r d j P (¯ r ( i,j, m ) ≥ ¯ r ( i,j )) , d i > r d j , (2.14)indicating that a zero score is given if one of the documents appears above ¯ r ( i, j ) .Otherwise, the value of φ is positive or negative depending on whether the ranker r agrees with the inferred preference between d i and d j . Furthermore, this score isinversely weighted by the probability P (¯ r ( i, j, m ) ≥ ¯ r ( i, j )) . Therefore, pairs thatare less likely to appear below their threshold ¯ r ( i, j ) result in a higher score than formore commonly occuring pairs. Algorithm 2.3 displays how the inference of PPM canbe computed. The scoring function φ was carefully chosen to guarantee fidelity , theremainder of this section will sketch the proof for PPM meeting its requirements.The two requirements for fidelity will be discussed in order: Requirement 1
The first fidelity requirement states that under uncorrelated clicks the expected outcomeshould be zero. Consider the expected preference: E [ P nm ] = (cid:88) d i ,d j (cid:88) m P ( d i > c d j | m ) P ( m )( φ ( d i , d j , r n , m ) − φ ( d i , d j , r m , m )) . (2.15)To see that E [ P nm ] = 0 under uncorrelated clicks, take any multileaving m where P ( m ) > and φ ( d i , d j , r , m ) (cid:54) = 0 with m x = d i and m y = d j . Then there is alwaysa multileaved result list m (cid:48) that is identical expect for swapping the two documentsso that m (cid:48) x = d j and m (cid:48) y = d i . The scoring function only gives non-zero valuesif both documents appear below the threshold ¯ r ( i, j, m ) < ¯ r ( i, j ) (Equation 2.14).At this point the probability of each document appearing at any position is the same(Equation 2.13), thus the following holds: P ( m ) = P ( m (cid:48) ) , (2.16) φ ( d i , d j , r n , m ) = − φ ( d j , d i , r n , m (cid:48) ) . (2.17)Finally, from the definition of uncorrelated clicks (Equation 2.2) the following holds: P ( d i > c d j | m ) = P ( d j > c d i | m (cid:48) ) . (2.18)As a result, any document pair ( d i , d j ) and multileaving m that affects the expected out-come is cancelled by the multileaving m (cid:48) . Therefore, we can conclude that E [ P nm ] = 0 under uncorrelated clicks, and that PPM meets the first requirement of fidelity .26 .5. A Novel Multileaved Comparison Method Requirement 2
The second fidelity requirement states that under correlated clicks a ranker that Paretodominates all other rankers should win the multileaved comparison. Therefore, theexpected value for a Pareto dominating ranker r n should be: ∀ m, n (cid:54) = m → E [ P nm ] > . (2.19)Take any other ranker r m that is thus Pareto dominated by r n . The proof for the firstrequirement shows that E [ P nm ] is not affected by any pair of documents d i , d j with thesame relevance label. Furthermore, any pair on which r n and r m agree will not affectthe expected outcome since: ( d i > r n d j ↔ d i > r m d j ) ⇒ φ ( d i , d j , r n , m ) − φ ( d i , d j , r m , m ) = 0 . (2.20)Then, for any relevant document d i , consider the set of documents that r n incorrectlyprefers over d i : A = { d j | ¬ rel ( d j ) ∧ d j > r n d i } (2.21)and the set of documents that r m incorrectly prefers over d i and places higher thanwhere r n places d i : B = { d j | ¬ rel ( d j ) ∧ d j > r m d i ∧ r ( d j , l m ) < r ( d i , l n ) } . (2.22)Since r n Pareto dominates r m , it has the same or fewer incorrect preferences: | A | ≤ | B | .Furthermore, for any document d j in either A or B the threshold of the pair d i , d j is thesame: ∀ d j ∈ A ∪ B, ¯ r ( i, j ) = r ( d i , l n ) . (2.23)Therefore, all pairs with documents from A and B will only get a non-zero value from φ if they both appear at or below r ( d i , l n ) . Then, using Equation 2.13 and the Bayesrule we see: ∀ ( d j , d l ) ∈ A ∪ B, P ( m x = d j , ¯ r ( i, j, m ) ≥ ¯ r ( i, j, R )) P (¯ r ( i, j, m ) ≥ ¯ r ( i, j, R ))= P ( m x = d l , ¯ r ( i, l, m ) ≥ ¯ r ( i, l, R )) P (¯ r ( i, l, m ) ≥ ¯ r ( i, l, R )) . (2.24)Similarly, the reweighing of φ ensures that every pair in A and B contributes the same tothe expected outcome. Thus, if both rankers rank d i at the same position the followingsum: (cid:88) d j ∈ A ∪ B (cid:88) m P ( m ) · [ P ( d i > c d j | m )( φ ( d i , d j , r n , m ) − φ ( d i , d j , r m , m ))+ P ( d j > c d i | m )( φ ( d j , d i , r n , m ) − φ ( d j , d i , r m , m ))] (2.25)27 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees will be zero if | A | = | B | and positive if | A | < | B | under correlated clicks. Moreover,since r n Pareto dominates r m , there will be at least one document d j where: ∃ d i , ∃ d j , rel ( d i ) ∧ ¬ rel ( d j ) ∧ r ( d i , l n ) = r ( d j , l m ) . (2.26)This means that the expected outcome (Equation 2.15) will always be positive undercorrelated clicks, i.e., E [ P nm ] > , for a Pareto dominating ranker r n and any otherranker r m .In summary, we have introduced a new multileaved comparison method, PPM.Furthermore, we answered RQ2.1 in the affirmative since we have shown it to be considerate and to have fidelity . We further note that PPM has polynomial complexity:to calculate P (¯ r ( i, j, m ) ≥ ¯ r ( i, j )) only the size of the choice sets Ω and the firstpositions at which d i and d j occur in Ω have to be known. In order to answer Research Question
RQ2.2 posed in Section 2.1 several experimentswere performed to evaluate the sensitivity of PPM. The methodology of evaluationfollows previous work on interleaved and multileaved comparison methods [12, 41, 44,108, 109] and is completely reproducible.
In order to make fair comparisons between rankers, we will use the LTR datasetsdescribed in Section 2.6.2 below. From the feature representations in these datasetsa handpicked set of features was taken and used as ranking models. To match thereal-world scenario as best as possible this selection consists of features that are knownto perform well as relevance signals independently. This selection includes but is notlimited to: BM25, LMIR.JM, Sitemap, PageRank, HITS and TF.IDF [108].Then the ground-truth comparisons between the rankers are based on their NDCGscores computed on a held-out test set, resulting in a binary preference matrix P nm forall ranker pairs ( r n , r m ) : P nm = N DCG ( r n ) − N DCG ( r m ) . (2.27)The metric by which multileaved comparison methods are compared is the binary error , E bin [12, 108, 109]. Let ˆ P nm be the preference inferred by a multileaved comparisonmethod; then the error is: E bin = (cid:80) n,m ∈R∧ n (cid:54) = m sgn( ˆ P nm ) (cid:54) = sgn ( P nm ) |R| × ( |R| − . (2.28) Our experiments are performed over ten publicly available LTR datasets with varyingsizes and representing different search tasks. Each dataset consists of a set of queries28 .6. Experiments
Table 2.2: Instantiations of Cascading Click Models [36] as used for simulating userbehaviour in experiments. P ( click = 1 | R ) P ( stop = 1 | R ) R perfect navigational informational and a set of corresponding documents for every query. While queries are representedonly by their identifiers, feature representations and relevance labels are available forevery document-query pair. Relevance labels are graded differently by the datasetsdepending on the task they model, for instance, navigational datasets have binary labelsfor not relevant (0), and relevant (1), whereas most informational tasks have labelsranging from not relevant (0), to perfect relevancy (4). Every dataset consists of fivefolds, each dividing the dataset in different training, validation and test partitions.The first publicly available LTR datasets are distributed as LETOR 3.0 and 4.0 [76];they use representations of 45, 46, or 64 features encoding ranking models such asTF.IDF, BM25, Language Modelling, PageRank, and HITS on different parts of thedocuments. The datasets in LETOR are divided by their tasks, most of which come fromthe TREC Web Tracks between 2003 and 2008 [23, 24]. HP2003, HP2004, NP2003,NP2004, TD2003 and
TD2004 each contain between 50 and 150 queries and 1,000judged documents per query and use binary relevance labels. Due to their similaritywe report average results over these six datasets noted as
LETOR 3.0 . The
OHSUMED dataset is based on the query log of the search engine on the MedLine abstract database,and contains 106 queries. The last two datasets,
MQ2007 and
MQ2008 , were based onthe Million Query Track [8] and consist of 1,700 and 800 queries, respectively, but havefar fewer assessed documents per query.The
MLSR-WEB10K dataset [95] consists of 10,000 queries obtained from a retiredlabelling set of a commercial web search engine. The datasets uses 136 features torepresent its documents, each query has around 125 assessed documents.Finally, we note that there are more LTR datasets that are publicly available [17, 27],but there is no public information about their feature representations. Therefore, theyare unfit for our evaluation as no selection of well performing ranking features can bemade.
While experiments using real users are preferred [18, 21, 63, 133], most researchersdo not have access to search engines. As a result the most common way of comparingonline evaluation methods is by using simulated user behaviour [12, 41, 44, 108, 109].Such simulated experiments show the performance of multileaved comparison methodswhen user behaviour adheres to a few simple assumptions.Our experiments follow the precedent set by previous work on online evaluation:29 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees
First, a user issues a query simulated by uniformly sampling a query from the staticdataset. Subsequently, the multileaved comparison method constructs the multileavedresult list of documents to display. The behavior of the user after receiving this list issimulated using a cascade click model [20, 36]. This model assumes a user to examinedocuments in their displayed order. For each document that is considered the userdecides whether it warrants a click, which is modeled as the conditional probability P ( click = 1 | R ) where R is the relevance label provided by the dataset. Accordingly, cascade click model instantiations increase the probability of a click with the degree ofthe relevance label. After the user has clicked on a document their information needmay be satisfied; otherwise they continue considering the remaining documents. Theprobability of the user not examining more documents after clicking is modeled as P ( stop = 1 | R ) , where it is more likely that the user is satisfied from a very relevantdocument. At each impression we display k = 10 documents to the user.Table 2.2 lists the three instantiations of cascade click models that we use for thischapter. The first models a perfect user who considers every document and clicks on allrelevant documents and nothing else. Secondly, the navigational instantiation models auser performing a navigational task who is mostly looking for a single highly relevantdocument. Finally, the informational instantiation models a user without a very specificinformation need who typically clicks on multiple documents. These three models haveincreasing levels of noise, as the behavior of each depends less on the relevance labelsof the displayed documents. Each experimental run consists of applying a multileaved comparison method to asequence of T = 10 , simulated user impressions. To see the effect of the numberof rankers in a comparison, our runs consider |R| = 5 , |R| = 15 , and |R| = 40 .However only the MSLR dataset contains |R| = 40 rankers. Every run is repeated forevery click model to see how different behaviours affect performance. For statisticalsignificance every run is repeated 25 times per fold, which means that 125 runs areconducted for every dataset and click model pair. Since our evaluation covers fivemultileaved comparison methods, we generate over 393 million impressions in total. Wetest for statistical significant differences using a two tailed t-test. Note that the resultsreported on the LETOR 3.0 data are averaged over six datasets and thus span 750 runsper datapoint.The parameters of the baselines are selected based on previous work on the samedatasets; for OM the sample size η = 10 was chosen as reported by Schuth et al. [108];for PM the degree τ = 3 . was chosen according to Hofmann et al. [41] and the samplesize η = 10 , in accordance with Schuth et al. [109]. We answer Research Question
RQ2.2 by evaluating the sensitivity of PPM based on theresults of the experiments detailed in Section 2.6.The results of the experiments with a smaller number of rankers: |R| = 5 are30 .7. Results and Analysis
Table 2.3: The binary error E bin of all multileaved comparison methods after 10,000impressions on comparisons of |R| = 5 rankers. Average per dataset and click model;standard deviation in brackets. The best performance per click model and dataset isnoted in bold, statistically significant improvements of PPM are noted by (cid:72) ( p < . and (cid:79) ( p < . and losses by (cid:78) and (cid:77) respectively or ◦ for no difference, per baseline. TDM OM PM SOSM PPM perfect
LETOR 3.0 ( 0.13) ( 0.15) ( 0.15) ( 0.15) ( 0.13) ◦◦◦◦
MQ2007 ( 0.16) ( 0.18) ( 0.14) ( 0.16) ( 0.14) ◦ (cid:72) ◦◦ MQ2008 ( 0.12) ( 0.14) ( 0.12) ( 0.15) ( 0.12) ◦ (cid:72) ◦ (cid:79) MSLR-WEB10k ( 0.13) ( 0.17) ( 0.14) ( 0.18) ( 0.13) (cid:72)(cid:72)(cid:72)(cid:72)
OHSUMED ( 0.12) ( 0.15) ( 0.09) ( 0.10) ( 0.10) (cid:72)(cid:72) ◦◦ navigational LETOR 3.0 ( 0.13) ( 0.15) ( 0.14) ( 0.15) ( 0.14) ◦◦◦◦
MQ2007 ( 0.17) ( 0.21) ( 0.12) ( 0.23) ( 0.14) ◦ (cid:72) ◦ (cid:72) MQ2008 ( 0.14) ( 0.20) ( 0.15) ( 0.18) ( 0.13) ◦ (cid:72) ◦ (cid:72) MSLR-WEB10k ( 0.14) ( 0.20) ( 0.17) ( 0.19) ( 0.15) (cid:79)(cid:72)(cid:79)(cid:72)
OHSUMED ( 0.11) ( 0.19) ( 0.12) ( 0.17) ( 0.12) ◦ (cid:72) ◦ (cid:72) informational LETOR 3.0 ( 0.14) ( 0.19) ( 0.11) ( 0.15) ( 0.13) ◦ (cid:72) ◦◦ MQ2007 ( 0.15) ( 0.26) ( 0.15) ( 0.23) ( 0.16) (cid:72)(cid:72)(cid:72)(cid:72)
MQ2008 ( 0.13) ( 0.19) ( 0.16) ( 0.18) ( 0.14) ◦ (cid:72) ◦ (cid:72) MSLR-WEB10k ( 0.18) ( 0.23) ( 0.17) ( 0.20) ( 0.17) (cid:72)(cid:72)(cid:79)(cid:72)
OHSUMED ( 0.10) ( 0.24) ( 0.11) ( 0.21) ( 0.10) ◦ (cid:72) ◦ (cid:72) displayed in Table 2.3. Here we see that after 10,000 impressions PPM has a significantlylower error on many datasets and at all levels of interaction noise. Furthermore, for |R| = 5 there are no significant losses in performance under any circumstances.When |R| = 15 as displayed in Table 2.4, we see a single case where PPM performsworse than a previous method: on MQ2007 under the perfect click model SOSMperforms significantly better than PPM. However, on the same dataset PPM performssignificantly better under the informational click model. Furthermore, there are moresignificant improvements for |R| = 15 than when the number of rankers is the smaller |R| = 5 .Finally, when the number of rankers in the comparison is increased to |R| = 40 asdisplayed in Table 2.5, PPM still provides significant improvements.We conclude that PPM, in the experimental conditions that we considered, providesa performance that is at least as good as any existing method. Moreover, PPM is robust tonoise as we see more significant improvements under click-models with increased noise.Furthermore, since improvements are found with the number of rankers |R| varyingfrom to , we conclude that PPM is scalable in the comparison size. Additionally, the31 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees Table 2.4: The binary error E bin after 10,000 impressions on comparisons of |R| = 15 rankers. Notation is identical to Table 2.3. TDM OM PM SOSM PPM perfect
LETOR 3.0 ( 0.07) ( 0.08) ( 0.07) ( 0.08) ( 0.08) ◦◦◦◦
MQ2007 ( 0.07) ( 0.09) ( 0.06) ( 0.07) ( 0.07) ◦ (cid:72) ◦ (cid:78) MQ2008 ( 0.05) ( 0.05) ( 0.05) ( 0.07) ( 0.06) ◦ (cid:79) ◦◦ MSLR-WEB10k ( 0.07) ( 0.11) ( 0.06) ( 0.08) ( 0.05) (cid:72)(cid:72)(cid:72)(cid:72)
OHSUMED ( 0.03) ( 0.05) ( 0.03) ( 0.03) ( 0.03) (cid:72)(cid:72)(cid:72)(cid:72) navigational
LETOR 3.0 ( 0.08) ( 0.09) ( 0.08) ( 0.08) ( 0.08) ◦◦◦◦
MQ2007 ( 0.07) ( 0.11) ( 0.07) ( 0.08) ( 0.08) (cid:72)(cid:72) ◦◦ MQ2008 ( 0.05) ( 0.07) ( 0.05) ( 0.06) ( 0.06) (cid:72)(cid:72) ◦ (cid:72) MSLR-WEB10k ( 0.07) ( 0.12) ( 0.06) ( 0.09) ( 0.08) (cid:72)(cid:72) ◦ (cid:72) OHSUMED ( 0.04) ( 0.07) ( 0.03) ( 0.06) ( 0.04) ◦ (cid:72) ◦ (cid:72) informational LETOR 3.0 ( 0.07) ( 0.11) ( 0.08) ( 0.08) ( 0.08) ◦ (cid:79) ◦◦ MQ2007 ( 0.07) ( 0.14) ( 0.08) ( 0.11) ( 0.08) (cid:72)(cid:72)(cid:72)(cid:72)
MQ2008 ( 0.06) ( 0.11) ( 0.06) ( 0.06) ( 0.06) (cid:72)(cid:72)(cid:72)(cid:72)
MSLR-WEB10k ( 0.09) ( 0.12) ( 0.08) ( 0.11) ( 0.08) (cid:72)(cid:72)(cid:72)(cid:72)
OHSUMED ( 0.03) ( 0.09) ( 0.03) ( 0.06) ( 0.04) (cid:72)(cid:72) ◦ (cid:72) dataset type seems to affect the relative performance of the methods. For instance, on LETOR 3.0 little significant differences are found, whereas the
MSLR dataset displaysthe most significant improvements. This suggests that on more artificial data, i.e., thesmaller datasets simulating navigational tasks, the differences are fewer, while on theother hand on large commercial data the preference for PPM increases further. Lastly,Figure 2.1 displays the binary error of all multileaved comparison methods on the
MSLR dataset over 10,000 impressions. Under the perfect click model we see that all of theprevious methods display converging behavior around 3,000 impressions. In contrast,the error of PPM continues to drop throughout the experiment. The fact that the existingmethods converge at a certain level of error in the absence of click-noise is indicativethat they are lacking in sensitivity .Overall, our results show that PPM reaches a lower level of error than previousmethods seem to be capable of. This feat can be observed on a diverse set of datasets,various levels of interaction noise and for different comparison sizes. To answerResearch Question
RQ2.2 : from our results we conclude that PPM is more sensitivethan any existing multileaved comparison method.32 .8. Conclusion
Table 2.5: The binary error E bin of all multileaved comparison methods after 10,000impressions on comparisons of |R| = 40 rankers. Averaged over the MSLR-WEB10k ,notation is identical to Table 2.3.
TDM OM PM SOSM PPM perfect ( 0.03) ( 0.02) ( 0.02) ( 0.02) ( 0.04) (cid:72)(cid:72)(cid:72)(cid:72) navigational ( 0.03) ( 0.01) ( 0.03) ( 0.03) ( 0.05) (cid:72)(cid:72)(cid:72) ◦ informational ( 0.04) ( 0.01) ( 0.05) ( 0.05) ( 0.06) (cid:72)(cid:72)(cid:72)(cid:72) In this chapter we have examined multileaved comparison methods for evaluatingranking models online.We have presented a new multileaved comparison method, Pairwise PreferenceMultileaving (PPM), that is more sensitive to user preferences than existing methods.Additionally, we have proposed a theoretical framework for assessing multileavedcomparison methods, with considerateness and fidelity as the two key requirements.We have shown that no method published prior to PPM has fidelity without lacking considerateness . In other words, prior to PPM no multileaved comparison method hasbeen able to infer correct preferences without degrading the search experience of theuser. In contrast, we prove that PPM has both considerateness and fidelity , thus it isguaranteed to correctly identify a Pareto dominating ranker without altering the searchexperience considerably. Furthermore, our experimental results spanning ten datasetsshow that PPM is more sensitive than existing methods, meaning that it can reach alower level of error than any previous method. Moreover, our experiments show thatthe most significant improvements are obtained on the more complex datasets, i.e.,larger datasets with more grades of relevance. Additionally, similar improvements areobserved under different levels of noise and numbers of rankers in the comparison,indicating that PPM is robust to interaction noise and scalable to large comparisons.As an extra benefit, the computational complexity of PPM is polynomial and, unlikeprevious methods, does not depend on sampling or approximations.With these findings we can answer the thesis research question
RQ1 positively:with the introduction of our novel Pairwise Preference Multileaving (PPM) method theeffectiveness of online evaluation scales to large comparisons.The theoretical framework that we have introduced allows future research intomultileaved comparison methods to guarantee improvements that generalize better thanempirical results alone. In turn, properties like considerateness can further stimulate theadoption of multileaved comparison methods in production environments; future workwith real-world users may yield further insights into the effectiveness of the multileavingparadigm. Rich interaction data enables the introduction of multileaved comparisonmethods that consider more than just clicks, as has been done for interleaving methods[63]. These methods could be extended to consider other signals such as dwell-time or the order of clicks in an impression , etc.Furthermore, the field of Online Learning to Rank (OLTR) has depended on online33 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees evaluation from its inception [132]. The introduction of multileaving and subsequentnovel multileaved comparison methods brought substantial improvements to both fields[90, 111]. Similarly, PPM and any future extensions are likely to benefit the OLTR fieldtoo.Finally, while the theoretical and empirical improvements of PPM are convincing,future work should investigate whether the sensitivity can be made even stronger. Forinstance, it is possible to have clicks from which no preferences between rankers canbe inferred. Can we devise a method that avoids such situations as much as possiblewithout introducing any form of bias, thus increasing the sensitivity even further whilemaintaining theoretical guarantees?In Chapter 7 we will take another look at online ranker evaluation and contrast itwith counterfactual evaluation. We will see that existing interleaving methods (andby extension some multileaving methods) are biased w.r.t. the definition of positionbias common in counterfactual evaluation. The novel method introduced in Chapter 7combines aspects of counterfactual and online ranker evaluation, creating a methodwith strong theoretical guarantees while also being very effective.Furthermore, similar to this chapter, Chapter 3 will look at whether a pairwise LTRmethod is suitable for online LTR. While different from PPM, the method introducedin Chapter 3 also infers pairwise preferences between documents, and weights inferredpreferences to account for position bias.34 .8. Conclusion . . . . . . E b i n perfect TDMOM PMSOSM PPM0 2000 4000 6000 8000 100000 . . . . . . E b i n navigational0 2000 4000 6000 8000 10000impressions0 . . . . . . E b i n informational Figure 2.1: The binary error of different multileaved comparison methods on compar-isons of |R| = 15 rankers on the
MSLR-WEB10k dataset. 35 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees
Notation Description q a user-issued query T the total number of interactions r i an individual ranker a.k.a. a single ranking system or ranking model R a set of rankers to compare l i a ranking generated by ranker r i m a multileaved result list k the length of the multileaved result lists c a vector indicating clicks on a displayed multileaved result list P a preference matrix to store inferred preferences between rankers r ( d, l i ) the rank at which ranker r i places document d Differentiable Online Learning to Rank
Online Learning to Rank (OLTR) methods optimize rankers based on direct interactionwith users. State-of-the-art OLTR methods rely on online evaluation and samplingmodel variants, they were designed specifically for linear models. Their approaches donot extend well to non-linear models such as neural networks.To address this limitation, this chapter will consider the thesis research question:
RQ2
Is online Learning to Rank (LTR) possible without relying on model-samplingand online evaluation?
We introduce an entirely novel approach to OLTR that constructs a weighted differ-entiable pairwise loss after each interaction: Pairwise Differentiable Gradient De-scent (PDGD). PDGD breaks away from the traditional approach that relies on inter-leaving or multileaving and extensive sampling of models to estimate gradients. Instead,its gradient is based on inferring preferences between document pairs from user clicksand can optimize any differentiable model. We prove that the gradient of PDGD isunbiased w.r.t. user document pair preferences. Our experiments on the largest publiclyavailable LTR datasets show considerable and significant improvements under all levelsof interaction noise. PDGD outperforms existing OLTR methods both in terms of learn-ing speed as well as final convergence. Furthermore, unlike previous OLTR methods,PDGD also allows for non-linear models to be optimized effectively. Our results showthat using a neural network leads to even better performance at convergence than alinear model. In summary, PDGD is an efficient and unbiased OLTR approach thatprovides a better user experience than previously possible.
In order to benefit from unprecedented volumes of content, users rely on ranking systemsto provide them with the content of their liking. LTR in Information Retrieval (IR)concerns methods that optimize ranking models so that they order documents accordingto user preferences. In web search engines such models combine hundreds of signalsto rank web-pages according to their relevance to user queries [75]. Similarly, rankingmodels are a vital part of recommender systems where there is no explicit search intent
This chapter was published as [82]. Appendix 3.A gives a reference for the notation used in this chapter. . Differentiable Online Learning to Rank [59]. LTR is also prevalent in settings where other content is ranked, e.g., videos [19],products [60], conversations [97] or personal documents [127].Traditionally, LTR has been applied in the offline setting where a dataset withannotated query-document pairs is available. Here, the model is optimized to rankdocuments according to the relevance annotations, which are based on the judgementsof human annotators. Over time the limitations of this supervised approach havebecome apparent: annotated sets are expensive and time-consuming to create [17, 76];when personal documents are involved such a dataset would breach privacy [127];the relevance of documents to queries can change over time, like in a news searchengine [1, 71]; and judgements of raters are not necessarily aligned with the actualusers [104].In order to overcome the issues with annotated datasets, previous work in LTR haslooked into learning from user interactions. Work along these lines can be dividedinto approaches that learn from historical interactions , i.e., in the form of interactionlogs [54], and approaches that learn in an online setting [132]. The latter regard methodsthat determine what to display to the user at each impression, and then immediatelylearn from observed user interactions and update their behavior accordingly. This onlineapproach has the advantage that it does not require an existing ranker of decent quality,and thus can handle cold-start situations. Additionally, it is more responsive to theuser by updating continuously and instantly, therefore allowing for a better experience.However, it is important that an online method can handle biases that come with userbehavior: for instance, the observed interactions only take place with the displayedresults, i.e., there is item-selection bias, and are more likely to occur with higher rankeditems, i.e., there is position bias. Accordingly, a method should learn user preferencesw.r.t. document relevance, and be robust to the forms of noise and bias present in theonline setting. Overall, the online LTR approach promises to learn ranking models thatare in line with user preferences, in a responsive matter, reaching good performancefrom few interactions, even in cold-start situations.Despite these highly beneficial properties, previous work in OLTR has only con-sidered linear models [42, 111, 132] or trivial variants thereof [80]. The reason forthis is that existing work in OLTR has worked with the Dueling Bandit Gradient De-scent (DBGD) algorithm [132] as a basis. While very influential and effective, weidentify two main problems with the gradient estimation of the DBGD algorithm:1. Gradient estimation is based on sampling model variants from a unit circle around thecurrent model. This concept does not extend well to non-linear models. Computingrankings for variants is also computationally costly for larger complex models.2. It uses online evaluation methods, i.e., interleaving or multileaving, to determine thegradient direction from the resulting set of models. However, these evaluation meth-ods are designed for finding preferences between ranking systems, not (primarily)for determining how a model should be updated.As an alternative we introduce Pairwise Differentiable Gradient Descent (PDGD), thefirst unbiased OLTR method that is applicable to any differentiable ranking model.PDGD infers pairwise document preferences from user interactions and constructs anunbiased gradient after each user impression. In addition, PDGD does not rely on sam-38 .2. Related Work pling models for exploration, but instead models rankings as probability distributionsover documents. Therefore, it allows the OLTR model to be very certain for specificqueries and perform less exploration in those cases, while being much more explorativein other, uncertain cases. Our results show that, consequently, PDGD provides signifi-cant and considerable improvements over previous OLTR methods. This indicates thatits gradient estimation is more in line with the preferences to be learned.In this chapter, we address the thesis research question
RQ2 by answering thefollowing three specific research questions:
RQ3.1
Does using PDGD result in significantly better performance than the currentstate-of-the-art Multileave Gradient Descent?
RQ3.2
Is the gradient estimation of PDGD unbiased?
RQ3.3
Is PDGD capable of effectively optimizing different types of ranking models?To facilitate replicability and repeatability of our findings, we provide open sourceimplementations of PDGD and our experiments under the permissive MIT open-sourcelicense. LTR can be applied to the offline and online setting. In the offline setting LTR isapproached as a supervised problem where the relevance of each query-document pairis known. Most of the challenges with offline LTR come from obtaining annotations.For instance, gathering annotations is time-consuming and expensive [17, 76, 95].Furthermore, in privacy sensitive-contexts it would be unethical to annotate items,e.g., for personal emails or documents [127]. Moreover, for personalization problemsannotators are unable to judge what specific users would prefer. Also, (perceived)relevance chances over time, due to cognitive changes on the user’s end [120] or due tochanges in document collections [1] or the real world [71]. Finally, annotations are notnecessarily aligned with user satisfaction, as judges may interpret queries differentlyfrom actual users [104]. Consequently, the limitations of offline LTR have led to anincreased interest in alternative approaches to LTR.
OLTR is an attractive alternative to offline LTR as it learns directly from interactingwith users [132]. By doing so it attempts to solve the issues with offline annotations thatoccur in LTR, as user preferences are expected to be better represented by interactionsthan by offline annotations [99]. Unlike methods in the offline setting, OLTR algorithmshave to simultaneously perform ranking while also optimizing their ranking model. Inother words, an OLTR algorithm decides what rankings to display to users, while at https://github.com/HarrieO/OnlineLearningToRank . Differentiable Online Learning to Rank the same time learning from the interactions with the presented rankings. While thepotential of learning in the online setting is great, it has its own challenges. In particular,the main difficulties of the OLTR task are bias and noise . Any user interaction that doesnot reflect their true preference is considered noise, this happens frequently e.g., clicksoften occur for unexpected reasons [104]. Bias comes in many forms, for instance,item-selection bias occurs because interactions only involve displayed documents [127].Another common bias is position bias, a consequence from the fact documents at thetop of a ranking are more likely to be considered [134]. An OLTR method should thustake into account the biases that affect user behavior while also being robust to noise, inorder to learn the true user preferences.OLTR methods can be divided into two groups [139]: tabular methods that learnthe best ranked list under some model of user interaction with the list [98, 114], such asa click model [20], and feature-based algorithms that learn the best ranker in a family ofrankers [43, 132]. Model-based methods may have greater statistical efficiency but theygive up generality, essentially requiring us to learn a separate model for every query.For the remainder of this chapter, we focus on model-free OLTR methods. State-of-the-art (model-free) OLTR approaches learn user preferences by approachingoptimization as a dueling bandit problem [132]. They estimate the gradient of the modelw.r.t. user satisfaction by comparing the current model to sampled variations of themodel. The original DBGD algorithm [132] uses interleaving methods to make thesecomparisons: at each interaction the rankings of two rankers are combined to create asingle result list. From a large number of clicks on such a combined result list a userpreference between the two rankers can reliably be inferred [41]. Conversely, DBGDcompares its current ranking model to a different slight variation at each impression.Then, if a click is indicative of a preference for the variation, the current model isslightly updated towards it. Accordingly, the model of DBGD will continuously updateitself and oscillate towards an inferred optimum.Other work in OLTR has used DBGD as a basis and extended upon it. Notably,Hofmann et al. [43] have proposed a method that guides exploration by only sam-pling variations that seem promising from historical interaction data. Unfortunately,while this approach provides faster initial learning, the historical data introduces biaswhich leads to the quality of the ranking model to steadily decrease over time [90].Alternatively, Schuth et al. [111] introduced Multileave Gradient Descent (MGD), thisextension replaced the interleaving of DBGD with multileaving methods. In turn themultileaving paradigm is an extension of interleaving where a set of rankers are com-pared efficiently [81, 108, 109]. Conversely, multileaving methods can combine therankings of more than two rankers and thus infer preferences over a set of rankers froma single click. MGD uses this property to estimate the gradient more effectively bycomparing a large number of model variations per user impression [90, 111]. As a result,MGD requires fewer user interactions to converge on the same level of performanceas DBGD. Another alternative approach was considered by Hofmann et al. [40], whoinject the ranking from the current model with randomly sampled documents. Then,after each user impression, a pairwise loss is constructed from inferred preferences40 .3. Method between documents. This pairwise approach was not found to be more effective thanDBGD.Quite remarkably, all existing work in OLTR has only considered linear models.Recently, Oosterhuis and de Rijke [80] recognized that a tradeoff unique to OLTR ariseswhen choosing models. High capacity models such as neural networks [13] require moredata than simpler models. On the one hand, this means that high capacity models needmore user interactions to reach the same level of performance, thus giving a worse initialuser experience. On the other hand, high capacity models are capable of finding betteroptima, thus lead to better final convergence and a better long-term user experience.This dilemma is named the speed - quality tradeoff, and as a solution a cascade of modelscan be optimized: combining the initial learning speed of a simple model with theconvergence of a complex one. But there are more reasons why non-linear models haveso far been absent from OLTR. Importantly, the DBGD algorithm was designed forlinear models from the ground up; relying on a unit circle to sample model variantsand averaging models to estimate the gradient. Furthermore, the computational cost ofmaintaining an extensive set of model variants for large and complex models makes thisapproach very impractical.Our contribution over the work listed above is an OLTR method that is not anextension of DBGD, instead it computes a differentiable pairwise loss to update itsmodel. Unlike the existing pairwise approach, our loss function is unbiased and ourexploration is performed using the model’s confidence over documents. Finally, we alsoshow that this is the first OLTR method to effectively optimize neural networks in theonline setting. In this section we introduce a novel OLTR algorithm: Pairwise Differentiable GradientDescent (PDGD). First, Section 3.3.1 describes PDGD in detail, before Section 3.3.2formalizes and proves the unbiasedness of the method. Appendix 3.A lists the notationwe use.
PDGD revolves around optimizing a ranking model f θ ( d ) that takes a feature represen-tation of a query-document pair d as input and outputs a score. The aim of the algorithmis to find the parameters θ so that sorting the documents by their scores in descendingorder provides the most optimal rankings. Because this is an online algorithm, themethod must first decide what ranking to display to the user, then after the user hasinteracted with the displayed ranking, it may update θ accordingly.Unlike previous OLTR approaches, PDGD does not rely on any online evaluationmethods. Instead, a Plackett-Luce (PL) model is applied to the ranking function f θ ( · ) resulting in a distribution over the document set D : P ( d | D ) = e f θ ( d ) (cid:80) d (cid:48) ∈ D e f θ ( d (cid:48) ) . (3.1)41 . Differentiable Online Learning to Rank document 1document 2document 3document 4document 5 (a) document 3document 2document 1document 4document 5 (b) Figure 3.1: Left: a click on a document ranking R and the inferred preferences of d over { d , d , d } . Right: the reversed pair ranking R ∗ ( d , d , R ) for the document pair d and d .A ranking R to display to the user is then created by sampling from the distribution k times, where after each placement the distribution is renormalized to prevent duplicateplacements. PL models have been used before in LTR. For instance, the ListNetmethod [15] optimizes such a model in the offline setting. With R i denoting thedocument at position i , the probability of the ranking R then becomes: P ( R | D ) = k (cid:89) i =1 P ( R i | D \ { R , . . . , R i − } ) . (3.2)After the ranking R has been displayed to the user, they have the option to interact withit. The user may choose to click on some (or none) of the documents. Based on theseclicks, PDGD will infer preferences between the displayed documents. We assumethat clicked documents are preferred over observed unclicked documents. However, tothe algorithm it is unknown which unclicked documents the user has considered. Asa solution, PDGD relies on the assumption that every document preceding a clickeddocument and the first subsequent unclicked document was observed, as illustratedin Figure 3.1a. This preference assumption has been proven useful in IR before, forinstance in pairwise LTR on click logs [54] and recently in online evaluation [81]. Wewill denote preferences between documents inferred from clicks as: d k > c d l where d k is preferred over d l .Then θ is updated by optimizing pairwise probabilities over the preference pairs;for each inferred document preference d k > c d l , the probability that the preferreddocument d k is sampled before d l is sampled is increased [118]: P ( d k (cid:31) d l ) = P ( d k | D ) P ( d k | D ) + P ( d l | D ) = e f ( d k ) e f ( d k ) + e f ( d l ) . (3.3)We have chosen for pairwise optimization over listwise optimization because a pairwisemethod can be made unbiased by reweighing preference pairs. To do this we introducethe weighting function ρ ( d k , d l , R, D ) and estimate the gradient of the user preferences42 .3. Method Algorithm 3.1
Pairwise Differentiable Gradient Descent (PDGD). Input : initial weights: θ ; scoring function: f ; learning rate η . for t ← . . . ∞ do q t ← receive query ( t ) // obtain a query from a user D t ← preselect documents ( q t ) // preselect documents for query R t ← sample list ( f θ t , D t ) // sample list according to Eq. 3.1 c t ← receive clicks ( R t ) // show result list to the user ∇ f θ t ← // initialize gradient for d k > c d l ∈ c t do w ← ρ ( d k , d l , R, D ) // initialize pair weight (Eq. 3.5) w ← w e fθt ( d k ) e fθt ( d l ) (cid:16) e fθt ( d k ) + e fθt ( d l ) (cid:17) // pair gradient (Eq. 3.4) ∇ f θ t ← ∇ θ t + w ( f (cid:48) θ t ( d k ) − f (cid:48) θ t ( d l )) // model gradient (Eq. 3.4) θ t +1 ← θ t + η ∇ f θ t // update the ranking model by the weighted sum: ∇ f θ ( · ) ≈ (cid:88) d k > c d l ρ ( d k , d l , R, D ) [ ∇ P ( d k (cid:31) d l )]= (cid:88) d k > c d l ρ ( d k , d l , R, D ) e f θ ( d k ) e f θ ( d l ) ( e f θ ( d k ) + e f θ ( d l ) ) ( f (cid:48) θ ( d k ) − f (cid:48) θ ( d l )) . (3.4)The ρ function is based on the reversed pair ranking R ∗ ( d k , d l , R ) , which is the sameranking as R with the position of d k and d l swapped. An example of a reversed pairranking is illustrated in Figure 3.1b. The idea is that if a preference for d k > c d l isinferred in R and both documents are equally relevant, then the reverse preference d l > c d k is equally likely to be inferred in R ∗ ( d k , d l , R ) . The ρ function reweighs thefound preferences to the ratio between the probabilities of R or R ∗ ( d k , d l , R ) occurring: ρ ( d k , d l , R, D ) = P ( R ∗ ( d k , d l , R ) | D ) P ( R | D ) + P ( R ∗ ( d k , d l , R ) | D ) . (3.5)This procedure has similarities with importance sampling [93]; however, we found thatreweighing according to the ratio between R and R ∗ provides a more stable performance,since it produces less extreme values. Section 3.3.2 details exactly how ρ creates anunbiased gradient.Algorithm 3.1 describes the PDGD method step by step: Given the initial parameters θ and a differentiable scoring function f (Line 1), the method waits for a user-issuedquery q t to arrive (Line 3). Then the preselected set of documents D t for the query isfetched (Line 4), in our experiments these preselections are given in the LTR datasetsthat we use. A result list R is sampled from the current model (Line 5 and Equation 3.1)and displayed to the user. The clicks from the user are logged (Line 6) and preferencesbetween the displayed documents inferred (Line 8). The gradient is initialized (Line 7),and for each pair document pair d k , d l such that d k > c d l , the weight ρ ( d k , d l , R, D ) is calculated (Line 9 and Equation 3.5), followed by the gradient for the pair probability43 . Differentiable Online Learning to Rank (Line 10 and Equation 3.4). Finally, the gradient for the scoring function f is weightedand added to the gradient (Line 11), resulting in the estimated gradient. The model isthen updated by taking an η step in the direction of the gradient (Line 12). The algorithmagain waits for the next query to arrive and thus the process continues indefinitely.PDGD has some notable advantages over MGD [111]. Firstly, it explicitly modelsuncertainty over the documents per query, thus PDGD is able to have high confidencein its ranking for one query, while being completely uncertain for another query. As aresult, it will vary the amount of exploration per query, allowing it to avoid explorationin cases where it is not required and focussing on areas where it can improve. In contrast,MGD does not explicitly model confidence: its degree of exploration is only affectedby the norm of its linear model [80]. Consequently, MGD is unable to vary explorationper query nor is there a way to directly measure its level of confidence. Secondly,PDGD works for any differentiable scoring function f and does not rely on samplingmodel variants. Conversely, MGD is based around sampling from the unit spherearound a model; this approach is very ineffective for non-linear models. Additionally,sampling large models and producing rankings for them can be very computationallyexpensive. Besides these beneficial properties, our experimental results in Section 3.5show that PDGD achieves significantly higher levels of performance than MGD andother previous methods. The previous section introduced PDGD; this section answers
RQ3.2 : RQ3.2
Is the gradient estimation of PDGD unbiased?First, Theorem 3.1 will provide a definition of unbiasedness w.r.t. user documentpair preferences. Then we state the assumptions we make about user behavior anduse them to prove Theorem 3.1. Our notation will use d k = rel d l to indicate no userpreference between two documents t d k and d l ; and d k > rel d l to indicate a preferencefor d k over d l ; and d k < rel d l for the opposite preference. Theorem 3.1.
The expected estimated gradient of PDGD can be written as a weightedsum, with a unique weight α k,l for each possible document pair d k and d l in thedocument collection D : E [ ∇ f θ ( · )] = (cid:88) d k ,d l ∈ D α k,l ( f (cid:48) θ t ( d k ) − f (cid:48) θ t ( d l )) . (3.6) The signs of the weights α k,l adhere to user preferences between documents. That is, ifthere is no preference: d k = rel d l ⇔ α k,l = 0; (3.7) if d k is preferred over d l : d k > rel d l ⇔ α k,l > (3.8) and if d l is preferred over d k : d k < rel d l ⇔ α k,l < . (3.9) Therefore, in expectation PDGD will perform updates that adhere to the preferencesbetween the documents in every possible document pair. .3. Method Assumptions.
To prove Theorem 3.1 the following assumptions about user behaviorwill be used:
Assumption 1.
We assume that clicks from a user are position biased and con-ditioned on the relevance of the current document and the previously considered doc-uments. For a click on a document in ranking R at position i the probability can bewritten as: P ( click ( R i ) |{ R , . . . , R i − , R i +1 } ) . (3.10)For ease of notation, we will denote the set of “other documents” as { . . . } from here on. Assumption 2.
If there is no user preference between two documents d k , d l ,denoted by d k = rel d l , we assume that each is equally likely to be clicked given thesame context: d k = rel d l ⇒ P ( click ( d k ) |{ . . . } ) = P ( click ( d l ) |{ . . . } ) . (3.11) Assumption 3.
If a document in the set of documents being considered is replacedwith an equally preferred document the click probability is not affected: d k = rel d l ⇒ P ( click ( R i ) |{ . . . , d k } ) = P ( click ( R i ) |{ . . . , d l } ) . (3.12) Assumption 4.
Similarly, given the same context if one document is preferredover another, then it is more likely to be clicked: d k > rel d l ⇒ P ( click ( d k ) |{ . . . } ) > P ( click ( d l ) |{ . . . } ) . (3.13) Assumption 5.
Lastly, for any pair d k > rel d l , the considered document set { . . . , d k } and the same set with d k replaced by d l { . . . , d l } , we assume that the preferred d k in the context of { . . . , d l } is more likely to be clicked than d l in the context of { . . . , d k } : d k > rel d l ⇒ P ( click ( d k ) |{ . . . , d k } ) > P ( click ( d l ) |{ . . . , d l } ) . (3.14)These are all the assumptions we make about the user. With these assumptions, we canproceed to prove Theorem 3.1. Proof of Theorem 3.1.
We denote the probability of inferring the preference of d k over d l in ranking R as P ( d k > c d l | R ) . Then the expected gradient ∇ f θ ( · ) of PDGD canbe written as: E [ ∇ f θ ( · )] = (cid:88) R (cid:88) d k ,d l ∈ D (cid:20) P ( d k > c d l | R ) · P ( R ) · ρ ( d k , d l , R, D )[ ∇ P ( d k (cid:31) d l )] (cid:21) . (3.15)We will rewrite this expectation using the symmetry property of the reversed pairranking: R n = R ∗ ( d k , d l , R m ) ⇔ R m = R ∗ ( d k , d l , R n ) . (3.16)45 . Differentiable Online Learning to Rank First, we define a weight ω Rk,l for every document pair d k , d l and ranking R so that: ω Rk,l = P ( R ) ρ ( d k , d l , R, D ) = P ( R | D ) P ( R ∗ ( d k , d l , R ) | D ) P ( R | D ) + P ( R ∗ ( d k , d l , R ) | D ) . (3.17)Therefore, the weight for the reversed pair ranking is equal: ω R ∗ ( d k ,d l ,R ) k,l = P ( R ∗ ( d k , d l , R )) ρ ( d k , d l , R ∗ ( d k , d l , R ) , D )= ω Rk,l . (3.18)Then, using the symmetry of Equation 3.3 we see that: ∇ P ( d k (cid:31) d l ) = −∇ P ( d l (cid:31) d k ) . (3.19)Thus, with R ∗ as a shorthand for R ∗ ( d k , d l , R ) , the expectation can be rewritten as: E [ ∇ f θ ( · )] = (3.20) (cid:88) d k ,d l ∈ D (cid:88) R ω Ri,j (cid:32) P ( d k > c d l | R ) − P ( d l > c d k | R ∗ ) (cid:33) (cid:34) ∇ P ( d k (cid:31) d l ) (cid:35) , proving that the expected gradient matches the form of Equation 3.6. Then to prove thatEquations 3.7, 3.8, and 3.9 are correct we will show that: d k = rel d l ⇒ P ( d k > c d l | R ) = P ( d l > c d k | R ∗ ) , (3.21) d k > rel d l ⇒ P ( d k > c d l | R ) > P ( d l > c d k | R ∗ ) , (3.22) d k < rel d l ⇒ P ( d k > c d l | R ) < P ( d l > c d k | R ∗ ) . (3.23)If a preference R i > c R j is inferred then there are only three possible cases based onthe positions:1. The clicked document succeeds the unclicked document by more than one posi-tion: i + 1 > j .2. The clicked document precedes the unclicked document by more than one posi-tion: i − < j .3. The clicked document is one position before or after the unclicked document: i = j + 1 ∨ i = j − .In the first case the clicked document succeeds the other by more than one position, theprobability of an inferred preference is then: i + 1 > j ⇒ P ( R i > c R j | R ) = P ( c i | R i , { . . . , R j } )(1 − P ( c j | R j , { . . . } )) . (3.24)Combining Assumption 2 and 3 with Equation 3.24 proves Equation 3.21 for thiscase. Furthermore, combining Assumption 4 and 5 with Equation 3.24 proves Equa-tions 3.22 and 3.23 for this case as well.46 .4. Experiments Table 3.1: Instantiations of Cascading Click Models [36] as used for simulating userbehavior in experiments. P ( click = 1 | R ) P ( stop = 1 | click = 1 , R ) R perfect navigational informational i + 1 < j ⇒ P ( R i > c R j | R ) = P ( c i | R i , { . . . } )(1 − P ( c j | R j , { . . . , R i } )) P ( c rem ) , (3.25)where P ( c rem ) denotes the probability of an additional click that is required to add R j to the inferred observed documents. First, due to Assumption 1 this probability will bethe same for R and R ∗ : P ( c rem | R i , R j , R ) = P ( c rem | R i , R j , R ∗ ) . (3.26)Combining Assumption 2 and 3 with Equation 3.25 also proves Equation 3.21 for thiscase. Furthermore, combining Assumption 4 and 5 with Equation 3.25 also provesEquation 3.22 and 3.23 for this case as well.Lastly, in the third case the clicked document is one position before or after theother document, the probability of the inferred preference is then: i = j + 1 ∨ i = j − (3.27) ⇒ P ( R i > c R j | R ) = P ( c i | R i , { . . . , R j } )(1 − P ( c j | R j , { . . . , R i } )) . Combining Assumption 3 with Equation 3.28 proves Equation 3.21 for this case as well.Then, combining Assumption 5 with Equation 3.28 also proves Equation 3.22 and 3.23for this case.This concludes our proof of the unbiasedness of PDGD. Hence, we answer
RQ3.2 positively: the gradient estimation of PDGD is unbiased. We have shown that theexpected gradient is in line with user preferences between document pairs.
In this section we detail the experiments that were performed to answer the researchquestions in Section 3.1. 47 . Differentiable Online Learning to Rank
Our experiments are performed over five publicly available LTR datasets; we haveselected three large labelled dataset from commercial search engines and two smallerresearch datasets. Every dataset consists of a set of queries with each query having acorresponding preselected document set. The exact content of the queries and documentsare unknown, each query is represented only by an identifier, but each query-documentpair has a feature representation and relevance label. Depending on the dataset, therelevance labels are graded differently; we have purposefully chosen datasets that haveat least two grades of relevance. Each dataset is divided in training, validation and testpartitions.The oldest datasets we use are
MQ2007 and
MQ2008 [95] which are based on theMillion Query Track [8] and consist of 1,700 and 800 queries. They use representationsof 46 features that encode ranking models such as TF.IDF, BM25, Language Modeling,PageRank, and HITS on different parts of the documents. They are divided into fivefolds and the labels are on a three-grade scale from not relevant (0) to very relevant (2).In 2010 Microsoft released the
MSLR-WEB30k and
MLSR-WEB10K datasets [95],which are both created from a retired labelling set of a commercial web search engine(Bing). The former contains 30,000 queries with each query having 125 assesseddocuments on average, query-document pairs are encoded in 136 features, The latter isa subsampling of 10,000 queries from the former dataset. For practical reasons only
MLSR-WEB10K was used for this chapter. Also in 2010 Yahoo! released an LTR dataset[17]. It consists of 29,921 queries and 709,877 documents encoded in 700 features, allsampled from query logs of the Yahoo! search engine. Finally, in 2016 a LTR datasetwas released by the Istella search engine [27]. It is the largest with 33,118 queries, anaverage of 315 documents per query and 220 features. These three commercial datasetsall label relevance on a five-grade scale: from not relevant (0) to perfectly relevant (4).
For simulating users we follow the standard setup for OLTR simulations [38, 40, 90,111, 137]. First, queries issued by users are simulated by uniformly sampling from thestatic dataset. Then the algorithm determines the result list of documents to display.User interactions with the displayed list are then simulated using a cascade clickmodel [20, 36]. This models a user who goes through the documents one at a timein the displayed order. At each document, the user decides whether to click it or not,modelled as a probability conditioned on the relevance label R : P ( click = 1 | R ) .After a click has occurred, the user’s information need may be satisfied and they maythen stop considering documents. The probability of a user stopping after a click ismodelled as P ( stop = 1 | click = 1 , R ) . For our experiments κ = 10 documents aredisplayed at each impression.The three instantiations of cascade click models that we used are listed in Table 3.1.First, a perfect user is modelled who considers every document and solely clicks on allrelevant documents. The second models a user with a navigational task, where a singlehighly relevant document is searched. Finally, an informational instantiation models auser without a specific information need, and thus typically clicks on many documents.48 .4. Experiments These models have varying levels of noise, as each behavior depends on the relevancelabels of documents with a different degree.
For our experiments three baselines are used. First, MGD with Probabilistic Multi-leaving [90]; this is the highest performing existing OLTR method [80, 90]. For thischapter n = 49 candidates were sampled per iteration from the unit sphere with δ = 1 ;updates are performed with η = 0 . and zero initialization was used. Additionally,DBGD is used for comparison as it is one of the most influential methods, it was runwith the same parameters except that only n = 1 candidate is sampled per iteration.Furthermore, we also let DBGD optimize a single hidden-layer neural network with 64hidden nodes and sigmoid activation functions with Xavier initialization [33]. Theseparameters were also found most effective in previous work [40, 90, 111, 132].Additionally, the pairwise method introduced by Hofmann et al. [40] is used as abaseline. Despite not showing significant improvements over DBGD in the past [40],the comparison with PDGD is interesting because they both estimate gradients frompairwise preferences. For this baseline, η = 0 . and (cid:15) = 0 . is used; these parametersare chosen to maximize the performance at convergence [40].Runs with PDGD are performed with both a linear and neural ranking model. Forthe linear ranking model η = 0 . and zero initialization was used. The neural networkhas the same parameters as the one optimized by DBGD, except for η = 0 . . Two aspects of performance are evaluated seperately: the final convergence and theranking quality during training.Final convergence is addressed in offline performance which is the average NDCG@10of the ranking model over the queries in the held-out test-set. The offline performanceis measured after 10,000 impressions at which point most ranking models have reachedconvergence. The user experience during optimization should be considered as well,since deterring users during training would compromise the goal of OLTR. To addressthis aspect of evaluation online performance has been introduced [39]; it is the cumula-tive discounted NDCG@10 of the rankings displayed during training. For T sequentialqueries with R t as the ranking displayed to the user at timestep t , this is: Online Performance = T (cid:88) t =1 NDCG ( R t ) · γ ( t − . (3.28)This metric models the expected reward a user receives with a γ probability that theuser stops searching after each query. We follow previous work [80, 90] by choosinga discount factor of γ = 0 . , consequently queries beyond the horizon of 10,000queries have a less than impact.Lastly, all experimental runs are repeated 125 times, spread evenly over the availabledataset folds. Results are averaged and a two-tailed Student’s t-test is used for signifi-cance testing. In total, our results are based on more than 90,000,000 user impressions.49 . Differentiable Online Learning to Rank Our main results are displayed in Table 3.2 and Table 3.3, showing the offline and onlineperformance of all methods, respectively. Additionally, Figure 3.2 displays the offlineperformance on the MSLR-WEB10k dataset over 30,000 impressions and Figure 3.3over 1,000,000 impressions. We use these results to answer
RQ3.1 – whether PDGDprovides significant improvements over existing OLTR methods – and
RQ3.3 – whetherPDGD is successful at optimizing different types of ranking models.
First, we consider the offline performance after 10,000 impressions as reported in Ta-ble 3.2. We see that the DBGD and MGD baselines reach similar levels of performance,with marginal differences at low levels of noise. Our results seem to suggest that MGDprovides an efficient alternative to DBGD that requires fewer user interactions and ismore robust to noise. However, MGD does not appear to have an improved point ofconvergence over DBGD, Figure 3.2 further confirms this conclusion. Additionally,Table 3.2 and Figure 3.3 reveal thats DBGD is incapable of training its neural networkso that it improves over the linear model, even after 1,000,000 impressions.Alternatively, the pairwise baseline displays different behavior, providing improve-ments over DBGD and MGD on most datasets under all levels of noise. However, onthe istella dataset large decreases in performance are observed. Thus it is unclear if thismethod provides a reliable alternative to DBGD or MGD in terms of convergence. Fig-ure 3.2 also reveals that it converges within several hundred impressions, while DBGDor MGD continue to learn and considerably improve over the total 30,000 impressions.Because the pairwise baseline also converges sub-optimally under the perfect clickmodel, we do not attribute its suboptimal convergence to noise but to the method beingbiased.Conversely, Table 3.2 shows that PDGD reaches significantly higher performancethan all the baselines within 10,000 impressions. Improvements are observed on alldatasets under all levels of noise, especially on the commercial datasets where increasesup to . NDCG are observed. Our results also show that PDGD learns faster than thebaselines; at all time-steps the offline performance of PDGD is at least as good or betterthan all other methods, across all datasets. This increased learning speed can also beobserved in Figure 3.2. Besides the faster learning it also appears as if PDGD convergesat a better optimum than DBGD or MGD. However, Figure 3.2 reveals that DBGDdoes not fully converge within 30,000 iterations. Therefore, we performed an additionalexperiment where PDGD and DBGD optimize models over 1,000,000 impressionson the MSLR-WEB10k dataset, as displayed in Figure 3.3. Clearly the performanceof DBGD plateaus at a considerably lower level than that of PDGD. Therefore, weconclude that PDGD indeed has an improved point of final convergence compared toDBGD and MGD.Finally, Figure 3.2 and 3.3 also shows the behavior predicted by the speed-qualitytradeoff [80]: a more complex model will have a worse initial performance but a betterfinal convergence. Here, we see that depending on the level of interaction noise theneural model requires 3,000 to 20,000 iterations to match the performance of a linear50 .5. Results and Analysis model. However, in the long run the neural model does converge at a significantlybetter point of convergence. Thus, we conclude that PDGD is capable of effectivelyoptimizing different kinds of models in terms of offline performance.In conclusion, our results show that PDGD learns faster than existing OLTR methodswhile also converging at significantly better levels of performance.
Besides the ranking models learned by the OLTR methods, we also consider the userexperience during optimization. Table 3.3 shows that the online performance of DBGDand MGD are close to each other; MGD has a higher online performance due to itsfaster learning speed [90, 111]. In contrast, the pairwise baseline has a substantiallylower online performance in all cases. Because Figure 3.2 shows that the learning speedof the pairwise baseline sometimes matches that of DBGD and MGD, we attribute thisdifference to the exploration strategy it uses. Namely, the random insertion of uniformlysampled documents by this baseline appears to have a strong negative effect on the userexperience.The linear model optimized by PDGD has significant improvements over all baselinemethods on all datasets and under all click models. This improvement indicates thatthe exploration of PDGD, which uses a distribution over documents, does not lead toa worse user experience. In conclusion, PDGD provides a considerably better userexperience than all existing methods.Finally, we also discuss the performance of the neural models optimized by PDGDand DBGD. This model has both significant increases and decreases in online per-formance varying per dataset and amount of interaction noise. The decrease in userexperience is predicted by the speed-quality tradeoff [80], as Figure 3.2 also shows, theneural model has a slower learning speed leading to a worse initial user experience. Asolution to this tradeoff has been proposed by Oosterhuis and de Rijke [80], which opti-mizes a cascade of models. In this case, the cascade could combine the user experienceof the linear model with the final convergence of the neural model, providing the best ofboth worlds.
After having discussed the offline and online performance of PDGD, we will nowanswer
RQ3.1 and
RQ3.3 .First, concerning
RQ3.1 (whether PDGD performs significantly better than MGD),the results of our experiments show that models optimized with PDGD learn faster andconverge at better optima than MGD, DBGD, and the pairwise baseline, regardless ofdataset or level of interaction noise. Moreover, the level of performance reached withPDGD is significantly higher than the final convergence of any other method. Thus,even in the long run DBGD and MGD are incapable of reaching the offline performanceof PDGD. Additionally, the online performance of a linear model optimized with PDGDis significantly better across all datasets and user models. Therefore, we answer
RQ3.1 positively: PDGD outperforms existing methods both in terms of model convergenceand user experience during learning. 51 . Differentiable Online Learning to Rank
Then, with regards to
RQ3.3 (whether PDGD can effectively optimize differenttypes of models), in our experiments we have successfully optimized models from twofamilies: linear models and neural networks. Both models reach a significantly higherlevel of performance of model convergence than previous OLTR methods, across alldatasets and degrees of interaction noise. As expected, the simpler linear model has abetter initial user experience, while the more complex neural model has a better pointof convergence. In conclusion, we answer
RQ3.3 positively: PDGD is applicable todifferent ranking models and effective for both linear and non-linear models.
In this chapter, we have introduced a novel OLTR method: PDGD that estimates itsgradient using inferred pairwise document preferences. In contrast with previous OLTRapproaches PDGD does not rely on online evaluation to update its model. Insteadafter each user interaction it infers preferences between document pairs. Subsequently,it constructs a pairwise gradient that updates the ranking model according to thesepreferences.We have proven that this gradient is unbiased w.r.t. user preferences, that is, if thereis a preference between a document pair, then in expectation the gradient will update themodel to meet this preference. Furthermore, our experimental results show that PDGDlearns faster and converges at a higher performance level than existing OLTR methods.Thus, it provides better performance in the short and long term, leading to an improveduser experience during training as well. On top of that, PDGD is also applicable to anydifferentiable ranking model, in our experiments a linear and a neural network wereoptimized effectively. Both reached significant improvements over DBGD and MGDin performance at convergence. In conclusion, the novel unbiased PDGD algorithmprovides better performance than existing methods in terms of convergence and userexperience. Unlike the previous state-of-the-art, it can be applied to any differentiableranking model.We can now answer thesis research question
RQ2 positively: OLTR is possiblewithout relying on model-sampling and online evaluation. Moreover, our results showsthat using PDGD instead leads to much higher performance, and is much more effectiveat optimizing non-linear models.Future research could consider the regret bounds of PDGD; these could give furtherinsights into why it outperforms DBGD based methods. Furthermore, while we provedthe unbiasedness of our method w.r.t. document pair preferences, the expected gradientweighs document pairs differently. Offline LTR methods like LambdaMART [13] use aweighted pairwise loss to create a listwise method that directly optimizes IR metrics.However, in the online setting there is no metric that is directly optimized. Instead,future work could see if different weighing approaches are more in line with userpreferences. Another obvious avenue for future research is to explore the effectivenessof different ranking models in the online setting. There is a large collection of researchin ranking models in offline LTR, with the introduction of PDGD such an extensiveexploration in models is now also possible in OLTR.Based on the big difference in observed performance between PDGD and DBGD,52 .6. Conclusion
Chapter 4 will further extend this comparison to more extreme experimental conditions.Furthermore, Chapter 8 will also consider the performance of PDGD and compare itwith methods inspired by counterfactual LTR. Additionally, Chapter 8 will considerapplying PDGD as a counterfactual method and without debiasing weights, and findsthat in both these scenarios this leads to biased convergence. 53 . Differentiable Online Learning to Rank . . . . N D C G perfect DBGD (linear)DBGD (neural) MGD (linear)Pairwise (linear) PDGD (linear)PDGD (neural) . . . . N D C G navigational0 5000 10000 15000 20000 25000 30000impressions0 . . . . N D C G informational Figure 3.2: Offline performance (NDCG) on the MSLR-WEB10k dataset under threedifferent click models, the shaded areas indicate the standard deviation.54 .6. Conclusion . . . . N D C G perfect DBGD (linear)DBGD (neural) PDGD (linear)PDGD (neural) Pairwise (linear) . . . . N D C G navigational0 200000 400000 600000 800000 1000000impressions0 . . . . N D C G informational Figure 3.3: Long-term offline performance (NDCG) on the MSLR-WEB10k datasetunder three click models, the shaded areas indicate the standard deviation. 55 . Differentiable Online Learning to Rank T a b l e . : O f fl i n e p e rf o r m a n ce ( ND C G )f o r d i ff e r e n ti n s t a n ti a ti on s o f CC M ( T a b l e . ) . T h e s t a nd a r dd e v i a ti on i ss ho w n i nb r ac k e t s , bo l d v a l u e s i nd i ca t e t h e h i gh e s t p e rf o r m a n ce p e r d a t a s e t a nd c li c k m od e l , s i gn i fi ca n ti m p r ov e m e n t s ov e r t h e D B GD , M GD a ndp a i r w i s e b a s e li n e s a r e i nd i ca t e dby (cid:77) ( p < . ) a nd (cid:78) ( p < . ) , no l o ss e s w e r e m ea s u r e d . M Q M Q M S L R - W EB k Y a h oo i s t e ll a p e r f ec t D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) na v i ga ti ona l D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) i n f o r m a ti ona l D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) .6. Conclusion T a b l e . : O n li n e p e rf o r m a n ce ( D i s c oun t e d C u m u l a ti v e ND C G , S ec ti on3 . . )f o r d i ff e r e n ti n s t a n ti a ti on s o f CC M ( T a b l e . ) . T h e s t a nd a r d d e v i a ti on i ss ho w n i nb r ac k e t s , bo l dv a l u e s i nd i ca t e t h e h i gh e s t p e rf o r m a n ce p e r d a t a s e t a nd c li c k m od e l , s i gn i fi ca n ti m p r ov e m e n t s a nd l o ss e s ov e r t h e D B GD , M GD a ndp a i r w i s e b a s e li n e s a r e i nd i ca t e dby (cid:77) ( p < . ) a nd (cid:78) ( p < . ) a ndby (cid:79) a nd (cid:72) , r e s p ec ti v e l y . M Q M Q M S L R - W EB k Y a h oo i s t e ll a p e r f ec t D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) na v i ga ti ona l D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:72)(cid:79)(cid:72)(cid:78) . ( . ) (cid:72)(cid:78)(cid:72)(cid:78) . ( . ) (cid:72)(cid:72)(cid:72)(cid:78) . ( . ) (cid:78)(cid:78)(cid:79)(cid:78) . ( . ) (cid:72)(cid:72)(cid:72)(cid:78) i n f o r m a ti ona l D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:77)(cid:78)(cid:72)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78) . ( . ) (cid:79)(cid:78)(cid:72)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:72)(cid:78) . Differentiable Online Learning to Rank Notation Description q a user-issued query d , d k , d l document d feature representation of a query-document pair D set of documents R ranked list R ∗ the reversed pair ranking R ∗ ( d k , d l , R ) R i document placed at rank iρ preference pair weighting function θ parameters of the ranking model f θ ( · ) ranking model with parameters θf ( d k ) ranking score for a document from model click ( d ) a click on document dd k = rel d l two documents equally preferred by users d k > rel d l a user preference between two documents d k > c d l document preference inferred from clicks58 A Critical Comparison of Online
Learning to Rank Methods
Online Learning to Rank (OLTR) methods optimize ranking models by directly inter-acting with users, which allows them to be very efficient and responsive. All OLTRmethods introduced during the past decade have extended on the original OLTR method:Dueling Bandit Gradient Descent (DBGD). In Chapter 3, a fundamentally differentapproach was introduced with the Pairwise Differentiable Gradient Descent (PDGD)algorithm. The empirical comparisons in Chapter 3 suggested that PDGD converges atmuch higher levels of performance and learns considerably faster than DBGD-basedmethods. In contrast, DBGD appeared unable to converge on the optimal model inscenarios with little noise or bias. Furthermore, it seemed DBGD is not effective atoptimizing non-linear models. These observations are quite surprising and prompted usto further investigate DBGD. As a result, this Chapter will address the thesis researchquestion:
RQ3
Are DBGD Learning to Rank methods reliable in terms of theoretical soundnessand empirical performance?
In this chapter, we investigate whether the previous conclusions about the PDGD andDBGD comparison generalize from ideal to worst-case circumstances. We do so intwo ways. First, we compare the theoretical properties of PDGD and DBGD, by takinga critical look at previously proven properties in the context of ranking. Second, weestimate an upper and lower bound on the performance of methods by simulatingboth ideal user behavior and extremely difficult behavior, i.e., almost-random non-cascading user models. Our findings show that the theoretical bounds of DBGD do notapply to any common ranking model and, furthermore, that the performance of DBGDis substantially worse than PDGD in both ideal and worst-case circumstances. Theseresults reproduce previously published findings about the relative performance of PDGDvs. DBGD and generalize them to extremely noisy and non-cascading circumstances.Overall they show that DBGD is a very flawed method for OLTR both in terms oftheoretical guarantees and performance.
This chapter was published as [84]. Appendix 4.A gives a reference for the notation used in this chapter. . A Critical Comparison of Online Learning to Rank Methods Learning to Rank (LTR) plays a vital role in information retrieval. It allows us tooptimize models that combine hundreds of signals to produce rankings, thereby makinglarge collections of documents accessible to users through effective search and recom-mendation. Traditionally, LTR has been approached as a supervised learning problem,where annotated datasets provide human judgements indicating relevance. Over theyears, many limitations of such datasets have become apparent: they are costly to pro-duce [17, 95] and actual users often disagree with the relevance annotations [104]. Asan alternative, research into LTR approaches that learn from user behavior has increased.By learning from the implicit feedback in user behavior, users’ true preferences canpotentially be learned. However, such methods must deal with the noise and biases thatare abundant in user interactions [134]. Roughly speaking, there are two approaches toLTR from user interactions: learning from historical interactions and Online Learningto Rank (OLTR). Learning from historical data allows for optimization without gath-ering new data [58], though it does require good models of the biases in logged userinteractions [20]. In contrast, OLTR methods learn by interacting with the user, thusthey gather their own learning data. As a result, these methods can adapt instantly andare potentially much more responsive than methods that use historical data.Dueling Bandit Gradient Descent (DBGD) [132] is the most prevalent OLTRmethod; it has served as the basis of the field for the past decade. DBGD samplesvariants of its ranking model, and compares them using interleaving to find improve-ments [41, 96]. Subsequent work in OLTR has extended on this approach [43, 111, 125].In Chapter 3, the first alternative approach to DBGD was introduced with
PairwiseDifferentiable Gradient Descent (PDGD) [82]. PDGD estimates a pairwise gradientthat is reweighed to be unbiased w.r.t. users’ document pair preferences. Chapter 3showed considerable improvements over DBGD under simulated user behavior [84]: asubstantially higher point of performance at convergence and a much faster learningspeed. The results in Chapter 3 are based on simulations using low-noise cascadingclick models. The pairwise assumption that PDGD makes, namely, that all documentspreceding a clicked document were observed by the user, is always correct in thesecircumstances, thus potentially giving it an unfair advantage over DBGD. Furthermore,the low level of noise presents a close-to-ideal situation, and it is unclear whether thefindings in Chapter 3 generalize to less perfect circumstances.In this chapter, we contrast PDGD and DBGD. Prior to an experimental comparison,we determine whether there is a theoretical advantage of DBGD over PDGD andexamine the regret bounds of DBGD for ranking problems. We then investigate whetherthe benefits of PDGD over DBGD reported in Chapter 3 generalize to circumstancesranging from ideal to worst-case. We simulate circumstances that are perfect for bothmethods – behavior without noise or position-bias – and circumstances that are the worstpossible scenario – almost-random, extremely-biased, non-cascading behavior. Thesesettings provide estimates of upper and lower bounds on performance, and indicatehow well previous comparisons generalize to different circumstances. Additionally, weintroduce a version of DBGD that is provided with an oracle interleaving method; itsperformance shows us the maximum performance DBGD could reach from hypotheticalextensions.60 .2. Related Work
In summary, we map thesis research question
RQ3 into the following more fine-grained research questions:
RQ4.1
Do the regret bounds of DBGD provide a benefit over PDGD?
RQ4.2
Do the advantages of PDGD over DBGD observed in Chapter 3 generalize toextreme levels of noise and bias?
RQ4.3
Is the performance of PDGD reproducible under non-cascading user behavior?
This section provides a brief overview of traditional LTR (Section 4.2.1), of LTR fromhistorical interactions (Section 4.2.2), and OLTR (Section 4.2.3).
Traditionally, LTR has been approached as a supervised problem; in the context ofOLTR this approach is often referred to as offline
LTR. It requires a dataset containingrelevance annotations of query-document pairs, after which a variety of methods canbe applied [75]. The limitations of offline LTR mainly come from obtaining suchannotations. The costs of gathering annotations are high as it is both time-consumingand expensive [17, 95]. Furthermore, annotators cannot judge for very specific users,i.e., gathering data for personalization problems is infeasible. Moreover, for certainapplications it would be unethical to annotate items, e.g., for search in personal emailsor documents [127]. Additionally, annotations are stationary and cannot account for(perceived) relevance changes [1, 71, 120]. Most importantly, though, annotations arenot necessarily aligned with user preferences; judges often interpret queries differentlyfrom actual users [104]. As a result, there has been a shift of interest towards LTRapproaches that do not require annotated data.
The idea of LTR from user interactions is long-established; one of the earliest examplesis the original pairwise LTR approach [54]. This approach uses historical click-throughinteractions from a search engine and considers clicks as indications of relevance.Though very influential and quite effective, this approach ignores the noise and biases inherent in user interactions. Noise, i.e., any user interaction that does not reflect theuser’s true preference, occurs frequently, since many clicks happen for unexpectedreasons [104]. Biases are systematic forms of noise that occur due to factors other thanrelevance. For instance, interactions will only involve displayed documents, resultingin selection bias [127]. Another important form of bias in LTR is position bias, whichoccurs because users are less likely to consider documents that are ranked lower [134].Thus, to effectively learn true preferences from user interactions, a LTR method shouldbe robust to noise and handle biases correctly.In recent years counterfactual LTR methods have been introduced that correct forsome of the bias in user interactions. Such methods use inverse propensity scoring to61 . A Critical Comparison of Online Learning to Rank Methods account for the probability that a user observed a ranking position [58]. Thus, clicks onpositions that are observed less often due to position bias will have greater weight toaccount for that difference. However, the position bias must be learned and estimatedsomewhat accurately [5]. On the other side of the spectrum are click models, whichattempt to model user behavior completely [20]. By predicting behavior accurately, theeffect of relevance on user behavior can also be estimated [11, 127].An advantage of these approaches over OLTR is that they only require historicaldata and thus no new data has to be gathered. However, unlike OLTR, they do require afairly accurate user model, and thus they cannot be applied in cold-start situations.
OLTR differs from the approaches listed above because its methods intervene in thesearch experience. They have control over what results are displayed, and can learnfrom their interactions instantly. Thus, the online approach performs LTR by interactingwith users directly [132]. Similar to LTR methods that learn from historical interactiondata, OLTR methods have the potential to learn the true user preferences. However, theyalso have to deal with the noise and biases that come with user interactions. Anotheradvantage of OLTR is that the methods are very responsive, as they can apply theirlearned behavior instantly. Conversely, this also brings a danger as an online methodthat learns incorrect preferences can also worsen the experience immediately. Thus, itis important that OLTR methods are able to learn reliably in spite of noise and biases.Thus, OLTR methods have a two-fold task: they have to simultaneously present rankingsthat provide a good user experience and learn from user interactions with the presentedrankings.The original OLTR method is Dueling Bandit Gradient Descent (DBGD); it ap-proaches optimization as a dueling bandit problem [132]. This approach requires anonline comparison method that can compare two rankers w.r.t. user preferences; tra-ditionally, DBGD methods use interleaving. Interleaving methods take the rankingsproduced by two rankers and combine them in a single result list, which is then dis-played to users. From a large number of clicks on the presented list the interleavingmethods can reliably infer a preference between the two rankers [41, 96]. At eachtimestep, DBGD samples a candidate model, i.e., a slight variation of its current model,and compares the current and candidate models using interleaving. If a preference forthe candidate is inferred, the current model is updated towards the candidate slightly.By doing so, DBGD will update its model continuously and should oscillate towards aninferred optimum. Section 4.3 provides a complete description of the DBGD algorithm.Virtually all work in OLTR in the decade since the introduction of DBGD has usedDBGD as a basis. A straightforward extension comes in the form of Multileave GradientDescent [111] which compares a large number of candidates per interaction [81, 108,109]. This leads to a much faster learning process, though in the long run this methoddoes not seem to improve the point of convergence.One of the earliest extensions of DBGD proposed a method for reusing historicalinteractions to guide exploration for faster learning [43]. While the initial results showedgreat improvements [43], later work showed performance drastically decreasing in thelong term due to bias introduced by the historical data [90]. Unfortunately, OLTR work62 .3. Dueling Bandit Gradient Descent
Algorithm 4.1
Dueling Bandit Gradient Descent (DBGD). Input : initial weights: θ ; unit: u ; learning rate η . for t ← . . . ∞ do q t ← receive query ( t ) // obtain a query from a user θ ct ← θ t + sample from unit sphere ( u ) // create candidate ranker R t ← get ranking ( θ t , D q t ) // get current ranker ranking R ct ← get ranking ( θ ct , D q t ) // get candidate ranker ranking I t ← interleave ( R t , R ct ) // interleave both rankings c t ← display to user ( I t ) // displayed interleaved list, record clicks if preference for candidate ( I t , c t , R t , R ct ) then θ t +1 ← θ t + η ( θ ct − θ t ) // update model towards candidate else θ t +1 ← θ t // no update that continued this historical approach [125] also only considered short term results;moreover, the results of some work [135] are not based on held-out data. As a result,we do not know whether these extensions provide decent long-term performance and itis unclear whether the findings of these studies generalize to more realistic settings.In Chapter 3, an inherently different approach to OLTR was introduced withPDGD [82]. PDGD interprets its ranking model as a distribution over documents;it estimates a pairwise gradient from user interactions with sampled rankings. Thisgradient is differentiable, allowing for non-linear models like neural networks to beoptimized, something DBGD is ineffective at [80, 82]. Section 4.4 provides a detaileddescription of PDGD. In the chapter in which we introduced PDGD (Chapter 3), weclaim that it provides substantial improvements over DBGD. However, those claims arebased on cascading click models with low levels of noise. This is problematic becausePDGD assumes a cascading user, and could thus have an unfair advantage in this setting.Furthermore, it is unclear whether DBGD with a perfect interleaving method could stillimprove over PDGD. Lastly, DBGD has proven regret bounds while PDGD has nosuch guarantees.In this chapter, we clear up these questions about the relative strengths of DBGD andPDGD by comparing the two methods under non-cascading, high-noise click models.Additionally, by providing DBGD with an oracle comparison method, its hypotheticalmaximum performance can be measured; thus, we can study whether an improvementover PDGD is hypothetically possible. Finally, a brief analysis of the theoretical regretbounds of DBGD shows that they do not apply to any common ranking model, thereforehardly providing a guaranteed advantage over PDGD. This section describes the DBGD algorithm in detail, before discussing the regretbounds of the algorithm. 63 . A Critical Comparison of Online Learning to Rank Methods
The DBGD algorithm [132] describes an indefinite loop that aims to improve a rankingmodel at each step; Algorithm 4.1 provides a formal description. The algorithm starts agiven model with weights θ (Line 1); then it waits for a user-submitted query (Line 3).At this point a candidate ranker is sampled from the unit sphere around the currentmodel (Line 4), and the current and candidate model both produce a ranking for thecurrent query (Line 5 and 6). These rankings are interleaved (Line 7) and displayed tothe user (Line 8). If the interleaving method infers a preference for the candidate rankerfrom subsequent user interactions the current model is updated towards the candidate(Line 10), otherwise no update is performed (Line 12). Thus, the model optimized byDBGD should converge and oscillate towards an optimum. Unlike PDGD, DBGD has proven regret bounds [132], potentially providing an advan-tage in the form of theoretical guarantees. In this section we answer
RQ4.1 by criticallylooking at the assumptions which form the basis of DBGD’s proven regret bounds.The original DBGD paper [132] proved a sublinear regret under several assumptions.DBGD works with the parameterized space of ranking functions W , that is, every θ ∈ W is a different set of parameters for a ranking function. For this chapter we willonly consider deterministic linear models because all existing OLTR work has dealtwith them [40, 43, 82, 90, 111, 125, 132, 135]. But we note that the proof is easilyextendable to neural networks where the output is a monotonic function applied toa linear combination of the last layer. Then there is assumed to be a concave utilityfunction u : W → R ; since this function is concave, there should only be a singleinstance of weights that are optimal θ ∗ . Furthermore, this utility function is assumed tobe L-Lipschitz smooth: ∃ L ∈ R , ∀ ( θ a , θ b ) ∈ W , | u ( θ a ) − u ( θ b ) | < L (cid:107) θ a − θ b (cid:107) . (4.1)We will show that these assumptions are incorrect : there is an infinite number of optimalweights, and the utility function u cannot be L-Lipschitz smooth. Our proof relies ontwo assumptions that avoid cases where the ranking problem is trivial. First, the zeroranker is not the optimal model: θ ∗ (cid:54) = . (4.2)Second, there should be at least two models with different utility values: ∃ ( θ, θ (cid:48) ) ∈ W , u ( θ ) (cid:54) = u ( θ (cid:48) ) . (4.3)We will start by defining the set of rankings a model f ( · , θ ) will produce as: R D ( f ( · , θ )) = { R | ∀ ( d, d (cid:48) ) ∈ D, [ f ( d, θ ) > f ( d (cid:48) , θ ) → d (cid:31) R d (cid:48) ] } . (4.4)It is easy to see that multiplying a model with a positive scalar α > will not affect thisset: ∀ α ∈ R > , R D ( f ( · , θ )) = R D ( αf ( · , θ )) . (4.5)64 .4. Pairwise Differentiable Gradient Descent Consequently, the utility of both functions will be equal: ∀ α ∈ R > , u ( f ( · , θ )) = u ( αf ( · , θ )) . (4.6)For linear models scaling weights has the same effect: αf ( · , θ ) = f ( · , αθ ) . Thus, thefirst assumption cannot be true since for any optimal model f ( · , θ ∗ ) there is an infiniteset of equally optimal models: { f ( · , αθ ∗ ) | α ∈ R > } .Then, regarding L-Lipschitz smoothness, using any positive scaling factor: ∀ α ∈ R > , | u ( θ a ) − u ( θ b ) | = | u ( αθ a ) − u ( αθ b ) | , (4.7) ∀ α ∈ R > , (cid:107) αθ a − αθ b (cid:107) = α (cid:107) θ a − θ b (cid:107) . (4.8)Thus the smoothness assumption can be rewritten as: ∃ L ∈ R , ∀ α ∈ R > , ∀ ( θ a , θ b ) ∈ W , | u ( θ a ) − u ( θ b ) | < αL (cid:107) θ a − θ b (cid:107) . (4.9)However, there is always an infinite number of values for α small enough to break theassumption. Therefore, we conclude that a concave L-Lipschitz smooth utility functioncan never exist for a deterministic linear ranking model, thus the proof for the regretbounds is not applicable when using deterministic linear models.Consequently, the regret bounds of DBGD do not apply to the ranking problems inprevious work. One may consider other models (e.g., spherical coordinate based modelsor stochastic ranking models), however this still means that for the simplest and mostcommon ranking problems there are no proven regret bounds. As a result, we answer RQ4.1 negatively, the regret bounds of DBGD do not provide a benefit over PDGD forthe ranking problems in LTR.
The Pairwise Differentiable Gradient Descent (PDGD) [82] algorithm is formally de-scribed in Algorithm 4.2. PDGD interprets a ranking function f ( · , θ ) as a probabilitydistribution over documents by applying a Plackett-Luce model: P ( d | D, θ ) = e f ( d,θ ) (cid:80) d (cid:48) ∈ D e f ( d (cid:48) ,θ ) . (4.10)First, the algorithm waits for a user query (Line 3), then a ranking R is created bysampling documents without replacement (Line 4). Then PDGD observes clicks fromthe user and infers pairwise document preferences from them. All documents precedinga clicked document and the first succeeding one are assumed to be observed by theuser. Preferences between clicked and unclicked observed documents are inferred byPDGD; this is a long-standing assumption in pairwise LTR [54]. We denote an inferred preference between documents as d i (cid:31) c d j , and the probability of the model placing d i earlier than d j is denoted and calculated by: P ( d i (cid:31) d j | θ ) = e f ( d i ,θ ) e f ( d i ,θ ) + e f ( d j ,θ ) . (4.11)65 . A Critical Comparison of Online Learning to Rank Methods Algorithm 4.2
Pairwise Differentiable Gradient Descent (PDGD). Input : initial weights: θ ; scoring function: f ; learning rate η . for t ← . . . ∞ do q t ← receive query ( t ) // obtain a query from a user R t ← sample list ( f θ t , D q t ) // sample list according to Eq. 4.10 c t ← receive clicks ( R t ) // show result list to the user ∇ f ( · , θ t ) ← // initialize gradient for d i (cid:31) c d j ∈ c t do w ← ρ ( d i , d j , R, D ) // initialize pair weight (Eq. 4.13) w ← w × P ( d i (cid:31) d j | θ t ) P ( d j (cid:31) d i | θ t ) // pair gradient (Eq. 3.4) ∇ f ( · , θ t ) ← ∇ f θ t + w × ( f (cid:48) ( d i , θ t ) − f (cid:48) ( d j , θ t )) // model gradient (Eq. 3.4) θ t +1 ← θ t + η ∇ f ( · , θ t ) // update the ranking model The gradient is estimated as a sum over inferred preferences with a weight ρ per pair: ∇ f ( · , θ ) ≈ (cid:88) d i (cid:31) c d j ρ ( d i , d j , R, D )[∆ P ( d i (cid:31) d j | θ )] (4.12) = (cid:88) d i (cid:31) c d j ρ ( d i , d j , R, D ) P ( d i (cid:31) d j | θ ) P ( d j (cid:31) d i | θ )( f (cid:48) ( d i , θ ) − f (cid:48) ( d j , θ )) . After computing the gradient (Line 10), the model is updated accordingly (Line 11).This will change the distribution (Equation 4.10) towards the inferred preferences. Thisdistribution models the confidence over which documents should be placed first; theexploration of PDGD is naturally guided by this confidence and can vary per query.The weighting function ρ is used to make the gradient of PDGD unbiased w.r.t.document pair preferences. It uses the reverse pair ranking: R ∗ ( d i , d j , R ) , which is thesame ranking as R but with the document positions of d i and d j swapped. Then ρ is theratio between the probability of R and R ∗ : ρ ( d i , d j , R, D ) = P ( R ∗ ( d i , d j , R ) | D ) P ( R | D ) + P ( R ∗ ( d i , d j , R ) | D ) . (4.13)In Chapter 3, the weighted gradient is proven to be unbiased w.r.t. document pairpreferences under certain assumptions about the user. Here, this unbiasedness is definedby being able to rewrite the gradient as: E [∆ f ( · , θ )] = (cid:88) ( d i ,d j ) ∈ D α ij ( f (cid:48) ( d i , θ ) − f (cid:48) ( d j , θ )) , (4.14)and the sign of α ij agreeing with the preference of the user: sign ( α ij ) = sign ( relevance ( d i ) − relevance ( d j )) . (4.15)The proof in Chapter 3 only relies on the difference in the probabilities of inferring apreference: d i (cid:31) c d j in R and the opposite preference d j (cid:31) c d i in R ∗ ( d i , d j , R ) . The66 .5. Experiments Table 4.1: Click probabilities for simulated perfect or almost random behavior. P ( click ( d ) | relevance ( d ) , observed ( d )) relevance ( d ) 0 1 2 3 4 perfect almost random sign ( P ( d i (cid:31) c d j | R ) − P ( d j (cid:31) c d i | R ∗ ))= sign ( relevance ( d i ) − relevance ( d j )) . (4.16)As long as Equation 4.16 is true, Equation 4.14 and 4.15 hold as well. Interestingly, thismeans that other assumptions about the user can be made than in Chapter 3, and othervariations of PDGD are possible, e.g., the algorithm could assume that all documentsare observed and the proof still holds.Chapter 3 reports large improvements over DBGD, however these improvementswere observed under simulated cascading user models. This means that the assumptionthat PDGD makes about which documents are observed are always true. As a result, itis currently unclear whether the method is really better in cases where the assumptiondoes not hold. In this section we detail the experiments that were performed to answer the researchquestions in Section 4.1. Our experiments are performed over three large labelled datasets from commercialsearch engines, the largest publicly available LTR datasets. These datasets are the
MLSR-WEB10K [95],
Yahoo! Webscope [17], and
Istella [27] datasets. Each contains aset of queries with corresponding preselected document sets. Query-document pairs arerepresented by feature vectors and five-grade relevance annotations ranging from notrelevant (0) to perfectly relevant (4). Together, the datasets contain over 29,900 queriesand between 136 and 700 features per representation.
In order to simulate user behavior we partly follow the standard setup for OLTR [38,40, 90, 111, 137]. At each step a user issued query is simulated by uniformly samplingfrom the datasets. The algorithm then decides what result list to display to the user, the The resources for reproducing the experiments in this chapter are available at https://github.com/HarrieO/OnlineLearningToRank . A Critical Comparison of Online Learning to Rank Methods result list is limited to k = 10 documents. Then user interactions are simulated usingclick models [20]. Past OLTR work has only considered cascading click models [36];in contrast, we also use non-cascading click models . The probability of a click isconditioned on relevance and observance: P ( click ( d ) | relevance ( d ) , observed ( d )) . (4.17)We use two levels of noise to simulate perfect user behavior and almost random be-havior [39], Table 4.1 lists the probabilities of both. The perfect user observes alldocuments, never clicks on anything non-relevant, and always clicks on the most rele-vant documents. Two variants of almost random behavior are used. The first is based oncascading behavior, here the user first observes the top document, then decides to clickaccording to Table 4.1. If a click occurs, then, with probability P ( stop | click ) = 0 . the user stops looking at more documents, otherwise the process continues on the nextdocument. The second almost random behavior is simulated in a non-cascading way;here we follow [58] and model the observing probabilities as: P ( observed ( d ) | rank ( d )) = 1 rank ( d ) . (4.18)The important distinction is that it is safe to assume that the cascading user has observedall documents ranked before a click, while this is not necessarily true for the non-cascading user. Since PDGD makes this assumption, testing under both models canshow us how much of its performance relies on this assumption. Furthermore, the almost random model has an extreme level of noise and position bias compared to theclick models used in previous OLTR work [40, 90, 111], and we argue it simulates an(almost) worst-case scenario. In our experiments we simulate runs consisting of 1,000,000 impressions; each runwas repeated 125 times under each of the three click models. PDGD was run with η = 0 . and zero initialization, DBGD was run using Probabilistic Interleaving [90]with zero initialization, η = 0 . , and the unit sphere with δ = 1 . Other variants likeMultileave Gradient Descent [111] are not included; previous work has shown that theirperformance matches that of regular DBGD after around 30,000 impressions [82, 90,111]. The initial boost in performance comes at a large computational cost, though,as the fastest approaches keep track of at least 50 ranking models [90], which makesrunning long experiments extremely impractical. Instead, we introduce a novel oracleversion of DBGD, where, instead of interleaving, the NDCG values on the current queryare calculated and the highest scoring model is selected. This simulates a hypotheticalperfect interleaving method, and we argue that the performance of this oracle runindicates what the upper bound on DBGD performance is.Performance is measured by NDCG@10 on a held-out test set, a two-sided t-testis performed for significance testing. We do not consider the user experience duringtraining, because Chapter 3 has already investigated this aspect thoroughly.68 .6. Experimental Results and Analysis Recall that in Section 4.3.2 we have already provided a negative answer to
RQ4.1 : theregret bounds of DBGD do not provide a benefit over PDGD for the common rankingproblem in LTR. In this section we present our experimental results and answer
RQ4.2 (whether the advantages of PDGD over DBGD of previous work generalize to extremelevels of noise and bias) and
RQ4.3 (whether the performance of PDGD is reproducibleunder non-cascading user behavior).Our main results are presented in Table 4.2. Additionally, Figure 4.1 displays theaverage performance over 1,000,000 impressions. First, we consider the performanceof DBGD; there is a substantial difference between its performance under the perfect and almost random user models on all datasets. Thus, it seems that DBGD is stronglyaffected by noise and bias in interactions; interestingly, there is little difference betweenperformance under the cascading and non-cascading behavior. On all datasets the oracle version of DBGD performs significantly better than DBGD under perfect user behavior.This means there is still room for improvement and hypothetical improvements in, e.g.,interleaving could lead to significant increases in long-term DBGD performance.Next, we look at the performance of PDGD; here, there is also a significant differ-ence between performance under the perfect and almost random user models on alldatasets. However, the effect of noise and bias is very limited compared to DBGD, andthis difference at 1,000,000 impressions is always less than . NDCG on any dataset.To answer
RQ4.2 , we compare the performance of DBGD and PDGD. Across alldatasets, when comparing DBGD and PDGD under the same levels of interaction noiseand bias, the performance of PDGD is significantly better in every case. Furthermore,PDGD under the perfect user model significantly outperforms the oracle run of DBGD,despite the latter being able to directly observe the NDCG of rankers on the currentquery. Moreover, when comparing PDGD’s performance under the almost random user model with DBGD under the perfect user model, we see the differences arelimited and in both directions. Thus, even under ideal circumstances DBGD does notconsistently outperform PDGD under extremely difficult circumstances. As a result,we answer
RQ4.2 positively: our results strongly indicate that the performance ofPDGD is considerably better than DBGD and that these findings generalize from idealcircumstances to settings with extreme levels of noise and bias.Finally, to answer
RQ4.3 , we look at the performance under the two almost random user models. Surprisingly, there is no clear difference between the performance ofPDGD under cascading and non-cascading user behavior. The differences are small andper dataset it differs which circumstances are slightly preferred. Therefore, we answer
RQ4.3 positively: the performance of PDGD is reproducible under non-cascading userbehavior.
In this chapter, we have reproduced and generalized findings about the relative per-formance of Dueling Bandit Gradient Descent (DBGD) and Pairwise DifferentiableGradient Descent (PDGD). Our results show that the performance of PDGD is repro-69 . A Critical Comparison of Online Learning to Rank Methods
Table 4.2: Performance (NDCG@10) after 1,000,000 impressions for DBGD andPDGD under a perfect click model and two almost-random click models: cascading and non-cascading , and DBGD with an oracle comparator. Significant improvements andlosses (p < (cid:78) , (cid:72) , and ◦ (no significantdifference). Indications are in order of: oracle , perfect , cascading , and non-cascading . Yahoo MSLR Istella
Dueling Bandit Gradient Descentoracle (0.001) (cid:72) (cid:78) (cid:78) (0.004) (cid:72) (cid:78) (cid:78) (0.001) (cid:72) (cid:78) (cid:78) perfect (0.002) (cid:72) ◦ ◦ (0.004) (cid:72) (cid:78) (cid:78) (0.002) (cid:72) (cid:72) (cid:72) cascading (0.008) (cid:72) (cid:72) (cid:72) (0.006) (cid:72) (cid:72) (cid:72) (0.014) (cid:72) (cid:72) (cid:72) non-cascading (0.010) (cid:72) (cid:72) (cid:72) (0.014) (cid:72) (cid:72) (cid:72) (0.014) (cid:72) (cid:72) (cid:72)
Pairwise Differentiable Gradient Descentperfect (0.001) (cid:78) (cid:78) (cid:78) (cid:78) (0.003) (cid:78) (cid:78) (cid:78) (cid:78) (0.000) (cid:78) (cid:78) (cid:78) (cid:78) cascading (0.003) (cid:72) ◦ (cid:78) (cid:78) (0.007) (cid:72) (cid:72) (cid:78) (cid:78) (0.003) (cid:72) (cid:78) (cid:78) (cid:78) non-cascading (0.003) (cid:72) ◦ (cid:78) (cid:78) (0.005) (cid:72) (cid:72) (cid:78) (cid:78) (0.003) (cid:72) (cid:78) (cid:78) (cid:78) ducible under non-cascading user behavior. Furthermore, PDGD outperforms DBGDin both ideal and extremely difficult circumstances with high levels of noise and bias.Moreover, the performance of PDGD in extremely difficult circumstances is compa-rable to that of DBGD in ideal circumstances. Additionally, we have shown that theregret bounds of DBGD are not applicable to the common ranking problem in LTR.In summary, our results strongly confirm the previous finding that PDGD consistentlyoutperforms DBGD, and generalizes this conclusion to circumstances with extremelevels of noise and bias.With these findings we can answer RQ3 mostly negatively: the theory behindDBGD is not sound for the common deterministic ranking problem, moreover, DBGDhas extremely poor performance when compared to the PDGD method under varyingconditions. Consequently, there appears to be no advantage to using DBGD overPDGD in either theoretical or empirical terms. In addition, a decade of OLTR workhas attempted to extend DBGD in numerous ways without leading to any measurablelong-term improvements. Together, this suggests that the general approach of DBGDbased methods, i.e., sampling models and comparing with online evaluation, is not aneffective way of optimizing ranking models. Although the PDGD method considerablyoutperforms the DBGD approach, we currently do not have a theoretical explanationfor this difference. Thus it seems plausible that a more effective OLTR method couldbe derived, if the theory behind the effectiveness of OLTR methods is better understood.Due to this potential and the current lack of regret bounds applicable to OLTR, we arguethat a theoretical analysis of OLTR would make a very valuable future contribution tothe field.Finally, we consider the limitations of the comparison in this chapter. As is standardin OLTR our results are based on simulated user behavior. These simulations providevaluable insights: they enable direct control over biases and noise, and evaluation can70 .7. Conclusion be performed at each time step. In this chapter, the generalizability of this setup waspushed the furthest by varying the conditions to the extremely difficult. It appearsunlikely that more reliable conclusions can be reached from simulated behavior. Thuswe argue that the most valuable future comparisons would be in experimental settingswith real users. Furthermore, with the performance improvements of PDGD the timeseems right for evaluating the effectiveness of OLTR in real-world applications.The limited theoretical guarantees regarding OLTR methods, prompted the secondpart of this thesis where we consider counterfactual LTR. In contrast with OLTR,counterfactual LTR methods are founded on assumed models of user behavior andare proven to unbiasedly optimize ranking metrics if the assumed models are correct.Despite these theoretical strengths, empirical comparisons in previous work show thatPDGD is more robust than existing counterfactual LTR methods. In Chapter 8 weintroduce a counterfactual LTR method that can reach the same levels of performanceas PDGD when applied online. 71 . A Critical Comparison of Online Learning to Rank Methods
Figure 4.1: Performance (NDCG@10) on held-out data from Yahoo (top), MSLR(center), Istella (bottom) datasets, under the perfect , and almost random user models:cascading (casc.) and non-cascading (non-casc.). The shaded areas display the standarddeviation. · · · · . . . . N D C G · · · · . . . . . N D C G · · · · iterations0 . . . . N D C G PDGD (non-casc.)PDGD (perfect) PDGD (casc.)DBGD (perfect) DBGD (casc.)DBGD (non-casc.) DBGD (oracle) .A. Notation Reference for Chapter 4 Notation Description t a timestep q a user-issued query d , d k , d l document d feature representation of a query-document pair D set of documents R ranked list I t an interleaved result list R ∗ the reversed pair ranking R ∗ ( d k , d l , R ) ρ preference pair weighting function θ parameters of the ranking model f θ ( · ) ranking model with parameters θf ( d k ) ranking score for a document from model c t a binary vector representing the clicks at timestep t art II A Single Framework for Onlineand Counterfactual Learning toRank Policy-Aware Counterfactual Learning toRank for Top- k Rankings
Counterfactual Learning to Rank (LTR) methods optimize ranking systems using loggeduser interactions that contain interaction biases. Existing methods are only unbiased ifusers are presented with all relevant items in every ranking. However, in prevalent top- k ranking settings not all items can be displayed at once. Therefore, there is currently noexisting counterfactual unbiased LTR method for top- k rankings. In this chapter weaddress this limitation by asking the thesis research question: RQ4
Can counterfactual LTR be extended to top- k ranking settings?We introduce a novel policy-aware counterfactual estimator for LTR metrics that canaccount for the effect of a stochastic logging policy. We prove that the policy-awareestimator is unbiased if every relevant item has a non-zero probability to appear in thetop- k ranking. Our experimental results show that the performance of our estimatoris not affected by the size of k : for any k , the policy-aware estimator reaches thesame retrieval performance while learning from top- k feedback as when learning fromfeedback on the full ranking.While the policy-aware estimator allows us to learn from top- k feedback, there isno theoretically-grounded way to optimize for top- k ranking metrics. Furthermore,existing counterfactual LTR work has mostly used novel loss functions for optimization,which are quite different from those used in supervised LTR. This lead us to ask thefollowing thesis research question: RQ5
Is it possible to apply state-of-the-art supervised LTR to the counterfactual LTRproblem?In this chapter, we also introduce novel extensions of supervised LTR methods toperform counterfactual LTR and to optimize top- k metrics. Together, our contributionsintroduce the first policy-aware unbiased LTR approach that learns from top- k feedbackand optimizes top- k metrics. As a result, counterfactual LTR is now applicable to thevery prevalent top- k ranking setting in search and recommendation. This chapter was published as [86]. Appendix 5.A gives a reference for the notation used in this chapter. . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings
LTR optimizes ranking systems to provide high quality rankings. Interest in LTR fromuser interactions has greatly increased in recent years with the introduction of unbiasedLTR methods [58, 127]. The potential for learning from logged user interactions isgreat: user interactions provide valuable implicit feedback while also being cheap andrelatively easy to acquire at scale [57]. However, interaction logs also contain largeamounts of bias, which is the result of both user behavior and the ranker used duringlogging. For instance, users are more likely to examine items at the top of rankings,consequently the display position of an item heavily affects the number of interactions itreceives [128]. This effect is called position bias and it is very dominant when learningfrom interactions with rankings. Naively ignoring it during learning can be detrimentalto ranking performance, as the learning process is strongly impacted by what rankingswere displayed during logging instead of true user preferences. The goal of unbiasedLTR methods is to optimize a ranker w.r.t. the true user preferences, consequently, theyhave to account and correct for such forms of bias.Previous work on unbiased LTR has mainly focussed on accounting for positionbias through counterfactual learning [5, 58, 127]. The prevalent approach models theprobability of a user examining an item in a displayed ranking. This probability canbe inferred from user interactions [4, 5, 58, 127, 128] and corrected for using inversepropensity scoring . As a result, these methods optimize a loss that in expectation isunaffected by the examination probabilities during logging, hence it is unbiased w.r.t.position bias.This approach has been applied effectively in various ranking settings, includingsearch for scientific articles [58], email [127] or other personal documents [128]. How-ever, a limitation of existing approaches is that in every logged ranking they requireevery relevant item to have a non-zero chance of being examined [16, 58]. In this chap-ter, we focus on top- k rankings where the number of displayed items is systematicallylimited. These rankings can display at most k items, making it practically unavoidablethat relevant items are missing. Consequently, existing counterfactual LTR methodsare not unbiased in these settings. We recognize this problem as item-selection bias introduced by the selection of (only) k items to display. This is especially concerningsince top- k rankings are quite prevalent, e.g., in recommendation [26, 48], mobilesearch [9, 124], query autocompletion [14, 127, 128], and digital assistants [112].Our main contribution is a novel policy-aware estimator for counterfactual LTR thataccounts for both a stochastic logging policy and the users’ examination behavior. Ourpolicy-aware approach can be viewed as a generalization of the existing counterfactualLTR framework [2, 58]. We prove that our policy-aware approach performs unbiasedLTR and evaluation while learning from top- k feedback. Our experimental results showthat while our policy-aware estimator is unaffected by the choice of k , the existingpolicy-oblivious approach is strongly affected even under large values of k . For instance,optimization with the policy-aware estimator on top-5 feedback reaches the sameperformance as when receiving feedback on all results. Furthermore, because top- k metrics are the only relevant metrics in top- k rankings, we also propose extensions totraditional LTR approaches that are proven to optimize top- k metrics unbiasedly andintroduce a pragmatic way to choose optimally between available loss functions.78 .2. Background This chapter is based around two main contributions:1. A novel estimator for unbiased LTR from top- k feedback.2. Unbiased losses that optimize bounds on top- k LTR metrics.To the best of our knowledge, our policy-aware estimator is the first estimator that isunbiased in top- k ranking settings. In this section we discuss supervised LTR and counterfactual LTR [58].
The goal of LTR is to optimize ranking systems w.r.t. specific ranking metrics. Rankingmetrics generally involve items d , their relevance r w.r.t. a query q , and their position inthe ranking R produced by the system. We will optimize the Empirical Risk [121] overthe set of queries Q , with a loss ∆( R i | q i , r ) for a single query q i : L = 1 | Q | (cid:88) q i ∈ Q ∆( R i | q i , r ) . (5.1)For simplicity we assume that relevance is binary: r ( q, d ) ∈ { , } ; for brevity wewrite: r ( q, d ) = r ( d ) . Then, ranking metrics commonly take the form of a sum overitems: ∆( R | q, r ) = (cid:88) d ∈ R λ ( d | R ) · r ( d ) , (5.2)where λ can be chosen for a specific metric, e.g., for Average Relevance Position (ARP)or Discounted Cumulative Gain (DCG): λ ARP ( d | R ) = rank ( d | R ) , (5.3) λ DCG ( d | R ) = − log (cid:0) rank ( d | R ) (cid:1) − . (5.4)In a so-called full-information setting, where the relevance values r are known, opti-mization can be done through traditional LTR methods [13, 54, 75, 129]. Optimizing a ranking loss from the implicit feedback in interaction logs requires adifferent approach from supervised LTR. We will assume that clicks are gathered usinga logging policy π with the probability of displaying ranking ¯ R for query q denoted as π ( ¯ R | q ) . Let o i ( d ) ∈ { , } indicate whether d was examined by a user at interaction i and o i ( d ) ∼ P ( o ( d ) | q i , r, ¯ R i ) . Furthermore, we assume that users click on all relevantitems they observe and nothing else: c i ( d ) = [ r ( d ) ∧ o i ( d )] . Our goal is to find an79 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings estimator ˆ∆ that provides an unbiased estimate of the actual loss; for N interactionsthis estimate is: ˆ L = 1 N N (cid:88) i =1 ˆ∆( R i | q i , ¯ R i , π, c i ) . (5.5)We write R i for the ranking produced by the system for which the loss is being computed,while ¯ R i is the ranking that was displayed when logging interaction i . For brevity wewill drop i from our notation when only a single interaction is involved. A naiveestimator could simply consider every click to indicate relevance: ˆ∆ naive ( R | q, c ) = (cid:88) d : c ( d )=1 λ ( d | R ) . (5.6)Taking the expectation over the displayed ranking and observance variables results inthe following expected loss: E o, ¯ R (cid:104) ˆ∆ naive ( R | q, c ) (cid:105) = E o, ¯ R (cid:88) d : c ( d )=1 λ ( d | R ) = E o, ¯ R (cid:34)(cid:88) d ∈ R λ ( d | R ) · c ( d ) (cid:35) = E o, ¯ R (cid:34)(cid:88) d ∈ R o ( d ) · λ ( d | R ) · r ( d ) (cid:35) (5.7) = E ¯ R (cid:34)(cid:88) d ∈ R P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ ( d | R ) · r ( d ) (cid:35) = (cid:88) ¯ R ∈ π ( ·| q ) π ( ¯ R | q ) · (cid:88) d ∈ R P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ ( d | R ) · r ( d ) . Here, the effect of position bias is very clear; in expectation, items are weightedaccording to their probability of being examined. Furthermore, it shows that examinationprobabilities are determined by both the logging policy π and user behavior P ( o ( d ) | q, r, ¯ R ) .In order to avoid the effect of position bias, Joachims et al. [58] introduced an inverse-propensity-scoring estimator in the same vain as previous work by Wang et al. [127].The main idea behind this estimator is that if the examination probabilities are known,then they can be corrected for per click: ˆ∆ oblivious (cid:0) R | q, c, ¯ R (cid:1) = (cid:88) d : c ( d )=1 λ ( d | R ) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) . (5.8)In contrast to the naive estimator (Eq. 5.6), this policy-oblivious estimator (Eq. 5.8) can80 .3. Learning from Top- k Feedback provide an unbiased estimate of the loss: E o, ¯ R (cid:104) ˆ∆ oblivious (cid:0) R | q, c, ¯ R (cid:1) (cid:105) = E o, ¯ R (cid:34) (cid:88) d : c ( d )=1 λ ( d | R ) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) (cid:35) = (cid:88) d ∈ R E o, ¯ R (cid:34) o ( d ) · λ ( d | R ) · r ( d ) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) (cid:35) = (cid:88) d ∈ R E ¯ R (cid:34) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ ( d | R ) · r ( d ) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) (cid:35) = (cid:88) d ∈ R λ ( d | R ) · r ( d ) = ∆( R | q, r ) . (5.9)We note that the last step assumes P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) > , and that only relevantitems r ( d ) = 1 contribute to the estimate [58]. Therefore, this estimator is unbiased aslong as the examination probabilities are positive for every relevant item: ∀ d, ∀ ¯ R ∈ π ( · | q ) (cid:2) r ( d ) = 1 → P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) > (cid:3) . (5.10)Intuitively, this condition exists because propensity weighting is applied to items clickedin the displayed ranking and items that cannot be observed can never receive clicks.Thus, there are no clicks that can be weighted more heavily to adjust for the zeroobservance probability of an item.An advantageous property of the policy-oblivious estimator ˆ∆ oblivious is that thelogging policy π does not have to be known. That is, as long as Condition 5.10 is met,it works regardless of how interactions were logged. Additionally, Joachims et al. [58]proved that it is still unbiased under click noise. Virtually all recent counterfactual LTRmethods use the policy-oblivious estimator for LTR optimization [3–5, 58, 127, 128]. k Feedback
In this section we explain why the existing policy-oblivious counterfactual LTR frame-work is not applicable to top- k rankings. Subsequently, we propose a novel solutionthrough policy-aware propensity scoring that takes the logging policy into account. k feedback An advantage of the existing policy-oblivious estimator for counterfactual LTR de-scribed in Section 5.2.2 is that the logging policy does not need to be known, makingits application easier. However, the policy-oblivious estimator is only unbiased whenCondition 5.10 is met: every relevant item has a non-zero probability of being observedin every ranking displayed during logging. 81 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings
We recognize that in top- k rankings, where only k items can be displayed, relevantitems may systematically lack non-zero examination probabilities. This happens becauseitems outside the top- k cannot be examined by the user: ∀ d, ∀ ¯ R (cid:2) rank (cid:0) d | ¯ R (cid:1) > k → P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) = 0 (cid:3) . (5.11)In most top- k ranking settings it is very unlikely that Condition 5.10 is satisfied; If k isvery small, the number of relevant items is large, or if the logging policy π is ineffectiveat retrieving relevant items, it is unlikely that all relevant items will be displayed inthe top- k positions. Moreover, for a small value of k the performance of the loggingpolicy π has to be near ideal for all relevant items to be displayed. We call this effect item-selection bias , because in this setting the logging ranker makes a selection ofwhich k items to display, in addition to the order in which to display them (positionbias). The existing policy-oblivious estimator for counterfactual LTR (as described inSection 5.2.2) cannot correct for item-selection bias when it occurs, and can thus beaffected by this bias when applied to top- k rankings. Item-selection bias is inevitable in a single top- k ranking, due to the limited numberof items that can be displayed. However, across multiple top- k rankings more than k items could be displayed if the displayed rankings differ enough. Thus, a stochasticlogging-policy could provide every item with a non-zero probability to appear in thetop- k ranking. Then, the probability of examination can be calculated as an expectationover the displayed ranking: P ( o ( d ) = 1 | q, r, π ) = E ¯ R (cid:2) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1)(cid:3) (5.12) = (cid:88) ¯ R ∈ π ( ·| q ) π (cid:0) ¯ R | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) . This policy-dependent examination probability can be non-zero for all items, evenif all items cannot be displayed in a single top- k ranking. Naturally, this leads to a policy-aware estimator: ˆ∆ aware ( R | q, c, π ) = (cid:88) d : c ( d )=1 λ ( d | R ) P (cid:0) o ( d ) = 1 | q, r, π (cid:1) . (5.13)By basing the propensity on the policy instead of the individual rankings, the policy-aware estimator can correct for zero observance probabilities in some displayed rankingsby more heavily weighting clicks on other displayed rankings with non-zero observanceprobabilities. Thus, if a click occurs on an item that the logging policy rarely displaysin a top- k ranking, this click may be weighted more heavily than a click on an itemthat is displayed in the top- k very often. In contrast, the policy-oblivious approach onlycorrects for the observation probability for the displayed ranking in which the clickoccurred, thus it does not correct for the fact that an item may be missing from the top- k in other displayed rankings.82 .3. Learning from Top- k Feedback
In expectation, the policy-aware estimator provides an unbiased estimate of theranking loss: E o, ¯ R (cid:104) ˆ∆ aware ( R | q, c, π ) (cid:105) = E o, ¯ R (cid:34) (cid:88) d : c ( d )=1 λ (cid:0) d | R (cid:1) P ( o ( d ) = 1 | q, r, π ) (cid:35) = (cid:88) d ∈ R E o, ¯ R (cid:34) o ( d ) · λ (cid:0) d | R (cid:1) · r ( d ) (cid:80) ¯ R (cid:48) ∈ π ( ·| q ) π (cid:0) ¯ R (cid:48) | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:48) (cid:1) (cid:35) = (cid:88) d ∈ R E ¯ R (cid:34) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d | R (cid:1) · r ( d ) (cid:80) ¯ R (cid:48) ∈ π ( ·| q ) π (cid:0) ¯ R (cid:48) | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:48) (cid:1) (cid:35) = (cid:88) d ∈ R (cid:80) ¯ R ∈ π ( ·| q ) π (cid:0) ¯ R | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d | R (cid:1) · r ( d ) (cid:80) ¯ R (cid:48) ∈ π ( ·| q ) π (cid:0) ¯ R (cid:48) | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:48) (cid:1) = (cid:88) d ∈ R λ (cid:0) d | R (cid:1) · r ( d )= ∆ (cid:0) R | q, r (cid:1) . (5.14)In contrast to the policy-oblivious approach (Section 5.2.2), this proof is sound as longas every relevant item has a non-zero probability of being examined under the loggingpolicy π : ∀ d r ( d ) = 1 → (cid:88) ¯ R ∈ π ( ·| q ) π (cid:0) ¯ R | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) > . (5.15)It is easy to see that Condition 5.10 implies Condition 5.15, in other words, for allsettings where the policy-oblivious estimator (Eq. 5.8) is unbiased, the policy-awareestimator (Eq. 5.13) is also unbiased. Conversely, Condition 5.15 does not implyCondition 5.10, thus there are cases where the policy-aware estimator is unbiased butthe policy-oblivious estimator is not guaranteed to be.To better understand for which policies Condition 5.15 is satisfied, we introduce asubstitute Condition 5.16: ∀ d (cid:104) r ( d ) = 1 → ∃ ¯ R (cid:2) π (cid:0) ¯ R | q (cid:1) > ∧ P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) > (cid:3)(cid:105) . (5.16)Since Condition 5.16 is equivalent to Condition 5.15, we see that the policy-awareestimator is unbiased for any logging-policy that provides a non-zero probability forevery relevant item to appear in a position with a non-zero examination probability.Thus to satisfy Condition 5.16 in a top- k ranking setting, every relevant item requires anon-zero probability of being displayed in the top- k .As long as Condition 5.16 is met, a wide variety of policies can be chosen accordingto different criteria. Moreover, the policy can be deterministic if k is large enough todisplay every relevant item. Similarly, the policy-oblivious estimator can be seen as a83 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings special case of the policy-aware estimator where the policy is deterministic (or assumedto be). The big advantage of our policy-aware estimator is that it is applicable to a muchlarger number of settings than the existing policy-oblivious estimator, including thosewere feedback is only received on the top- k ranked items. To better understand the difference between the policy-oblivious and policy-awareestimators, we introduce an illustrative example that contrasts the two. We consider asingle query q and a logging policy π that chooses between two rankings to display: ¯ R and ¯ R , with: π ( ¯ R | q ) > ; π ( ¯ R | q ) > ; and π ( ¯ R | q ) + π ( ¯ R | q ) = 1 . Then fora generic estimator we consider how it treats a single relevant item d n with r ( d n ) (cid:54) = 0 using the expectation: E o, ¯ R (cid:104) c ( d n ) · λ (cid:0) d n | R (cid:1) ρ (cid:0) o ( d n ) = 1 | q, d n , ¯ R, π (cid:1) (cid:105) = λ (cid:0) d n | R (cid:1) · r ( d n ) · (cid:18) π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) ρ (cid:0) o ( d n ) = 1 | q, d n , ¯ R , π (cid:1) + π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) ρ (cid:0) o ( d n ) = 1 | q, d n , ¯ R , π (cid:1) (cid:19) , (5.17)where the propensity function ρ can be chosen to match either the policy-oblivious(Eq. 5.8) or policy-aware (Eq. 5.13) estimator.First, we examine the situation where d n appears in the top- k of both rankings ¯ R and ¯ R , thus it has a positive observance probability in both cases: P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) > and P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) > . Here, the policy-oblivious estimator ˆ∆ oblivious (Eq. 5.8) removes the effect of observation bias by adjusting for the observanceprobability per displayed ranking: (cid:18) π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) + π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) (cid:19) · λ (cid:0) d n | R (cid:1) · r ( d n ) = λ (cid:0) d n | R (cid:1) · r ( d n ) . (5.18)The policy-aware estimator ˆ∆ aware (Eq. 5.13) also corrects for the examination bias,but because its propensity scores are based on the policy instead of the individualrankings (Eq. 5.12), it uses the same score for both rankings: π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) + π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) + π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d n | R (cid:1) · r ( d n ) = λ (cid:0) d n | R (cid:1) · r ( d n ) . (5.19)Then, we consider a different relevant item d m with r ( d m ) = r ( d n ) that unlike theprevious situation only appears in the top- k of ¯ R . Thus it only has a positive observanceprobability in ¯ R : P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) > and P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) = 0 .Consequently, no clicks will ever be received in ¯ R , i.e., ¯ R = ¯ R → c ( d m ) = 0 , thus84 .4. Learning for Top- k Metrics the expectation for d m only has to consider ¯ R : E o, ¯ R (cid:104) c ( d m ) · λ (cid:0) d m | R (cid:1) ρ (cid:0) o ( d m ) = 1 | q, d m , ¯ R, π (cid:1) (cid:105) = π ( ¯ R | q ) · P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) ρ (cid:0) o ( d m ) = 1 | q, d m , ¯ R , π (cid:1) · λ (cid:0) d m | R (cid:1) · r ( d m ) . (5.20)In this situation, Condition 5.10 is not satisfied, and correspondingly, the policy-oblivious estimator (Eq. 5.8) does not give an unbiased estimate: π ( ¯ R | q ) · P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d m | R (cid:1) · r ( d m ) < λ (cid:0) d m | R (cid:1) · r ( d m ) . (5.21)Since the policy-oblivious estimator ˆ∆ oblivious only corrects for the observance proba-bility per displayed ranking, it is unable to correct for the zero probability in R as noclicks on d m can occur here. As a result, the estimate is affected by the logging policy π : the more item-selection bias π introduces (determined by π ( ¯ R | q ) ) the further theestimate will deviate. Consequently, in expectation ˆ∆ oblivious will biasedly estimate that d n should be ranked higher than d m , which is incorrect since both items are actuallyequally relevant.In contrast, the policy-aware estimator ˆ∆ aware (Eq. 5.13) avoids this issue becauseits propensities are based on the logging policy π . When calculating the probability ofobservance conditioned on π , P (cid:0) o ( d m ) = 1 | q, r, π (cid:1) (Eq. 5.12), it takes into accountthat there is a π ( ¯ R | q ) chance that d m is not displayed to the user: π ( ¯ R | q ) · P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) π ( ¯ R | q ) · P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d m | R (cid:1) · r ( d m ) = λ (cid:0) d m | R (cid:1) · r ( d m ) . (5.22)Since in this situation Condition 5.16 is true (and therefore also Condition 5.15), weknow beforehand that in expectation the policy-aware estimator is unaffected by positionand item-selection bias.This concludes our illustrative example. It was meant to contrast the behavior ofthe policy-aware and policy-oblivious estimators in two different situations. Whenthere is no item-selection bias, i.e., an item is displayed in the top- k of all rankings thelogging policy may display, both estimators provide unbiased estimates albeit usingdifferent propensity scores. However, when there is item-selection bias. i.e., an itemis not always present in the top- k , the policy-oblivious estimator ˆ∆ oblivious no longerprovides an unbiased estimate, while the policy-aware estimator ˆ∆ aware is still unbiasedw.r.t. both position bias and item-selection bias. k Metrics
This section details how counterfactual LTR can be used to optimize top- k metrics,since these are the relevant metrics in top- k rankings. 85 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings k metrics Since top- k rankings only display the k highest ranked items to the user, the performanceof a ranker in this setting is only determined by those items. Correspondingly, onlytop- k metrics matter here, where items beyond rank k have no effect: λ metric@k (cid:0) d | R (cid:1) = (cid:40) λ metric (cid:0) d | R (cid:1) , if rank ( d | R ) ≤ k, , if rank (cid:0) d | R (cid:1) > k. (5.23)These metrics are commonly used in LTR since, usually, performance gains in the topof a ranking are the most important for the user experience. For instance, NDCG@ k ,which is the normalized version of DCG@ k , is often used: λ DCG@k (cid:0) d | R (cid:1) = (cid:40) − log (cid:0) rank ( d | R ) (cid:1) − , if rank ( d | R ) ≤ k, , if rank (cid:0) d | R (cid:1) > k. (5.24)Generally in LTR, DCG is optimized in order to maximize NDCG [13, 129]. In unbiasedLTR it is not trivial to estimate the normalization factor for NDCG, further motivatingthe optimization of DCG instead of NDCG [2, 16].Importantly, top- k metrics bring two main challenges for LTR. First, the rankfunction is not differentiable, a problem for almost every LTR metric [75, 129]. Second,changes in a ranking beyond position k do not affect the metric’s value thus resulting inzero-gradients. The first problem has been addressed in existing LTR methods, we willnow propose adaptations of these methods that address the second issue as well. A common approach for enabling optimization of ranking methods, is by finding loweror upper bounds that can be minimized or maximized, respectively. For instance, similarto a hinge loss, the rank function can be upper bounded by a maximum over scoredifferences [54, 58]. Let s be the scoring function used to rank (in descending order),then: rank (cid:0) d | R (cid:1) ≤ (cid:88) d (cid:48) ∈ R max (cid:16) − (cid:0) s ( d ) − s ( d (cid:48) ) (cid:1) , (cid:17) . (5.25)Alternatively, the logistic function is also a popular choice [129]:rank (cid:0) d | R (cid:1) ≤ (cid:88) d (cid:48) ∈ R log (cid:16) e s ( d (cid:48) ) − s ( d ) (cid:17) . (5.26)Minimizing one of these differentiable upper bounds will directly minimize an upperbound on the ARP metric (Eq. 5.3).Furthermore, Agarwal et al. [2] showed that this approach can be extended to anymetric based on a monotonically decreasing function. For instance, if rank (cid:0) d | R (cid:1) is anupper bound on the rank (cid:0) d | R (cid:1) function, then the following is an upper bound on theDCG loss (Eq. 5.4): λ DCG (cid:0) d | R (cid:1) ≤ − log (cid:0) rank ( d | R ) (cid:1) − = ˆ λ DCG (cid:0) d | R (cid:1) . (5.27)86 .4. Learning for Top- k Metrics
More generally, let α be a monotonically decreasing function. A loss based on α isalways upper bounded by: λ α (cid:0) d | R (cid:1) = − α (cid:0) rank ( d | R ) (cid:1) ≤ − α (cid:0) rank ( d | R ) (cid:1) = ˆ λ α (cid:0) d | R (cid:1) . (5.28)Though appropriate for many standard ranking metrics, ˆ λ α is not an upper boundfor top- k metric losses. To understand this, consider that an item beyond rank k may still receive a negative score from ˆ λ α , for instance, for the DCG upper bound:rank (cid:0) d | R (cid:1) > k → ˆ λ DCG (cid:0) d | R (cid:1) < . As a result, this is not an upper bound for aDCG@ k based loss.We propose a modification of the ˆ λ α function to provide an upper bound for top- k metric losses, by simply giving a positive penalty to items beyond rank k : ˆ λ α @k (cid:0) d | R (cid:1) = − α (cid:0) rank ( d | R ) (cid:1) + (cid:2) rank ( d | R ) > k (cid:3) · α ( k ) . (5.29)The resulting function is an upper bound on top- k metric losses based on a monotonicfunction: λ α @k (cid:0) d | R (cid:1) ≤ ˆ λ α @k (cid:0) d | R (cid:1) . The main difference with ˆ λ α is that itemsbeyond rank k acquire a positive score from λ α @k , thus providing an upper bound onthe actual metric loss. Interestingly, the gradient of ˆ λ α @k w.r.t. the scoring function s isthe same as that of ˆ λ α . Therefore, the gradient of either function optimizes an upperbound on λ α @k top- k metric losses, while only ˆ λ α @k provides an actual upper bound.While this monotonic function-based approach is simple, it is unclear how coarsethese upper bounds are. In particular, some upper bounds on the rank function (e.g.,Eq. 5.25) can provide gross overestimations. As a result, these upper bounds on rankingmetric losses may be very far removed from their actual values. k LTR
Many supervised LTR approaches, such as the well-known LambdaRank and subse-quent LambdaMART methods [13], are based on Expectation Maximization (EM)procedures [28]. Recently, Wang et al. [129] introduced the LambdaLoss framework,which provides a theoretical way to prove that a method optimizes a lower bound on aranking metric. Subsequently, it was used to prove that LambdaMART optimizes sucha bound on DCG, similarly it was also used to introduce the novel LambdaLoss methodwhich provides an even tighter bound on DCG. In this section, we will show that theLambdaLoss framework can be used to find proven bounds on counterfactual LTRlosses and top- k metrics. Since LambdaLoss is considered state-of-the-art in supervisedLTR, making its framework applicable to counterfactual LTR could potentially providecompetitive performance. Additionally, adapting the LambdaLoss framework to top- k metrics further expands its applicability.The LambdaLoss framework and its EM-optimization approach work for metricsthat can be expressed in item-based gains, G ( d n | q, r ) , and discounts based on position, D (cid:0) rank ( d n | R ) (cid:1) ; for brevity we use the shorter G n and D n , respectively, resulting in: ∆ (cid:0) R | q, r (cid:1) = (cid:88) d n ∈ R G ( d n | q, r ) · D (cid:0) rank ( d n | R ) (cid:1) = | R | (cid:88) n =1 G n · D n . (5.30) We consider the indicator function to never have a non-zero gradient. . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings
For simplicity of notation, we choose indexes so that: n = rank ( d n | R ) , thus D n isalways the discount for the rank n . Then, we differ from the existing LambdaLossframework by allowing the discounts to be zero ( ∀ n D n ≥ ), thus also accountingfor top- k metrics. Furthermore, items at the first rank are not discounted or the metriccan be scaled so that D = 1 . Additionally, higher ranked items should be discountedless or equally: n > m → D n ≤ D m . Most ranking metrics meet these criteria; forinstance, G n and D n can be chosen to match ARP or DCG. Importantly, our adaptionalso allows ∆ to match top- k metrics such as DCG @ k or Precision @ k .In order to apply the LambdaLoss framework to counterfactual LTR, we consider ageneral inverse-propensity-scored estimator: ˆ∆ IPS ( R | q, c, · ) = (cid:88) d n : c ( d n )=1 λ ( d n | R ) ρ (cid:0) o ( d n ) = 1 | q, r, ¯ R, π (cid:1) , (5.31)where the propensity function ρ can match either the policy-oblivious (Eq. 5.8) or thepolicy-aware (Eq. 5.13) estimator. By choosing G n = 1 ρ (cid:0) o ( d n ) = 1 | q, r, ¯ R, π (cid:1) and D n = λ ( d n | R ) , (5.32)the estimator can be described in terms of gains and discounts. In contrast, in the existingLambdaLoss framework [129] gains are based on item relevance. For counterfactualtop- k LTR, we have designed Eq. 5.32 so that gains are based on the propensity scoresof observed clicks, and the discounts can have zero values.The EM-optimization procedure alternates between an expectation step and a maxi-mization step. In our case, the expectation step sets the discount values D n according tothe current ranking R of the scoring function s . Then the maximization step updates s to optimize the ranking model. Following the LambdaLoss framework [129], we derivea slightly different loss. With the delta function: δ nm = D | n − m | − D | n − m | +1 , (5.33)our differentiable counterfactual loss becomes: (cid:88) G n >G m − log (cid:32)(cid:18)
11 + e s ( d m ) − s ( d n ) (cid:19) δ nm ·| G n − G m | (cid:33) . (5.34)The changes we made do not change the validity of the proof provided in the originalLambdaLoss paper [129]. Therefore, the counterfactual loss (Eq. 5.34) can be provento optimize a lower bound on counterfactual estimates of top- k metrics.Finally, in the same way the LambdaLoss framework can also be used to derivecounterfactual variants of other supervised LTR losses/methods such as LambdaRankor LamdbaMART. Unlike previous work that also attempted to find a counterfactuallambda-based method by introducing a pairwise-based estimator [46], our approach iscompatible with the prevalent counterfactual approach since it uses the same estimatorbased on single-document propensities [3–5, 58, 127, 128]. Our approach suggeststhat the divide between supervised and counterfactual LTR methods may disappearin the future, as a state-of-the-art supervised LTR method can now be applied to thestate-of-the-art counterfactual LTR estimators.88 .5. Experimental Setup So far we have introduced two counterfactual LTR approaches that are proven to op-timize lower bounds on top- k metrics: with monotonic functions (Section 5.4.2) andthrough the LambdaLoss framework (Section 5.4.3). To the best of our knowledge,we are the first to introduce theoretically proven lower bounds for top- k LTR met-rics. Nevertheless, previous work has also attempted to optimize top- k metrics, albeitthrough heuristic methods. Notably, Wang et al. [129] used a truncated version of theLambdaLoss loss to optimize DCG @ k . Their loss uses the discounts D n based onfull-ranking DCG but ignores item pairs outside of the top- k : (cid:88) G n >G m − (cid:2) n ≤ k ∨ m ≤ k (cid:3) · log (cid:32)(cid:18)
11 + e s ( d m ) − s ( d n ) (cid:19) δ nm ·| G n − G m | (cid:33) . (5.35)While empirical results motivate its usage, there is no known theoretical justification forthis loss, and thus it is considered a heuristic.This leaves us with a choice between two theoretically-motivated counterfactualLTR approaches for optimizing top- k metrics (Eq. 5.29 and 5.34) and an empirically-motivated heuristic (Eq. 5.35). We propose a pragmatic solution by recognizing thatcounterfactual estimators can unbiasedly evaluate top- k metrics. Therefore, in practiceone can optimize several ranking models using various approaches, and subsequently,estimate which resulting model provides the best performance. Thus, using counterfac-tual evaluation to select from resulting models is an unbiased method to choose betweenthe available counterfactual LTR approaches. We follow the standard setup in unbiased LTR [5, 16, 50, 58] and perform semi-syntheticexperiments: queries and items are based on datasets of commercial search engines andinteractions are simulated using probabilistic click models.
We use the queries and documents from two of the largest publicly available LTRdatasets: MLSR-WEB30K [95] and Yahoo! Webscope [17]. Each was created by acommercial search engine and contains a set of queries with corresponding preselecteddocument sets. Query-document pairs are represented by feature vectors and five-graderelevance annotations ranging from not relevant (0) to perfectly relevant (4). In order tobinarize the relevancy, we only consider the two highest relevance grades as relevant.The MSLR dataset contains 30,000 queries with on average 125 preselected documentsper query, and encodes query-document pairs in 136 features. The Yahoo datasethas 29,921 queries and on average 24 documents per query encoded in 700 features.Presumably, learning from top- k feedback is harder as k becomes a smaller percentageof the number of items. Thus, we expect the MSLR dataset with more documents perquery to pose a more difficult problem. 89 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings k settings The setting we simulate is one where interactions are gathered using a non-optimal butdecent production ranker. We follow existing work [5, 50, 58] and use supervised opti-mization for the ARP metric on 1% of the training data. The resulting model simulatesa real-world production ranker since it is much better than a random initialization butleaves enough room for improvement [58].We then simulate user-issued queries by uniformly sampling from the trainingpartition of the dataset. Subsequently, for each query the production ranker ranks thedocuments preselected by the dataset. Depending on the experimental run that weconsider, randomization is performed on the resulting rankings. In order for the policy-aware estimator to be unbiased, every relevant document needs a chance of appearingin the top- k (Condition 5.16). Since in a realistic setting relevancy is unknown, wechoose to give every document a non-zero probability of appearing in the top- k . Ourrandomization policy takes the ranking of the production ranker and leaves the first k − documents unchanged but the document at position k is selected by sampling uniformlyfrom the remaining documents. The result is a minimally invasive randomized top- k ranking since most of the ranking is unchanged and the placement of the sampleddocuments is limited to the least important position.We note that many other logging policies could be applied (see Condition 5.16), e.g.,an alternative policy could insert sampled documents at random ranks for less obviousrandomization. Unfortunately, a full exploration of the effect of using different loggingpolicies is beyond the scope of this chapter.Clicks are simulated on the resulting ranking ¯ R according to position bias and docu-ment relevance. Top- k position bias is modelled through the probability of observance,as follows: P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) = (cid:40) rank ( d | ¯ R ) − , if rank ( d | ¯ R ) ≤ k, , if rank ( d | ¯ R ) > k. (5.36)The randomization policy results in the following examination probabilities w.r.t. thelogging policy (cf. Eq. 5.12): P (cid:0) o ( d ) = 1 | q, r, π (cid:1) = (cid:40) rank ( d | ¯ R ) − , if rank ( d | ¯ R ) < k, (cid:0) rank ( d | ¯ R ) · ( | ¯ R | − k + 1) (cid:1) − , if rank ( d | ¯ R ) ≥ k. (5.37)The probability of a click is conditioned on the relevance of the document according tothe dataset: P (cid:0) c ( d ) = 1 | q, r, ¯ R, o (cid:1) = , if r ( d ) = 1 ∧ o ( d ) = 1 , . , if r ( d ) = 0 ∧ o ( d ) = 1 , , if o ( d ) = 0 . (5.38)Note that our previous assumption that clicks only take place on relevant items (Sec-tion 5.2.2) is not true in our experiments.90 .5. Experimental Setup Optimization is performed on training clicks simulated on the training partition ofthe dataset. Hyperparameter tuning is done by estimating performance on (unclipped)validation clicks simulated on the validation partition; the number of validation clicks isalways 15% of the number of training clicks. Lastly, evaluation metrics are calculatedon the test partition using the dataset labels.
In order to evaluate the performance of the policy-aware estimator (Eq. 5.13) andthe effect of item-selection bias, we compare with the following baselines: (i) Thepolicy-oblivious estimator (Eq. 5.8). In our setting, where the examination probabilitiesare known beforehand, the policy-oblivious estimator also represents methods thatjointly estimate these probabilities while performing LTR, i.e., the following methodsreduce to this estimator if the examination probabilities are given: [3, 5, 58, 127]. (ii) Arerank estimator, an adaption of the policy-oblivious estimator. During optimization thererank estimator applies the policy-oblivious estimator but limits the document set of aninteraction i to the k displayed items R i = { d | rank ( d | ¯ R i ) ≤ k } (cf. Eq. 5.8). Thus, itis optimized to rerank the top- k of the production ranker only, but during inference it isapplied to the entire document set. (iii) Additionally, we evaluate performance withoutany cutoff k or randomization; in these circumstances all three estimators (Policy-Aware,Policy-Oblivious, Rerank) are equivalent. (iv) Lastly, we use supervised LTR on thedataset labels to get a full-information skyline , which shows the hypothetical optimalperformance.To evaluate the effectiveness of our proposed loss functions for optimizing top- k metrics, we apply the monotonic lower bound (Eq. 5.29) with a linear (Eq. 5.25)and a logistic upper bound (Eq. 5.26). Additionally, we apply several versions ofthe LamdbaLoss loss function (Eq. 5.34): one that optimizes full DCG, another thatoptimizes DCG @5 , and the heuristic truncated loss also optimizing DCG @5 (Eq. 5.35).Lastly, we apply unbiased loss selection where we select the best-performing modelbased on the estimated performance on the (unclipped) validation clicks.Optimization is done with stochastic gradient descent; to maximize computationalefficiency we rewrite the loss (Eq. 5.5) for a propensity scoring function ρ in thefollowing manner: ˆ L = 1 N N (cid:88) i =1 ˆ∆ (cid:0) R i | q i , ¯ R i , π, c i (cid:1) = 1 N N (cid:88) i =1 (cid:88) d : c i ( d )=1 λ (cid:0) d | R i (cid:1) ρ (cid:0) o i ( d ) = 1 | q i , r, · (cid:1) = 1 N (cid:88) q ∈Q (cid:88) d ∈ R q (cid:32) N (cid:88) i =1 [ q i = q ] · c i ( d ) ρ (cid:0) o i ( d ) = 1 | q, r, · (cid:1) (cid:33) · λ (cid:0) d | R q (cid:1) = 1 N (cid:88) q ∈Q (cid:88) d ∈ R q ω d · λ (cid:0) d | R q (cid:1) . (5.39)91 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings
After precomputing the document weights ω d , the complexity of computing the loss isonly determined by the dataset size. This allows us to optimize over very large numbersof clicks with very limited increases in computational costs.We optimize linear models, but our approach can be applied to any differentiablemodel [2]. Propensity clipping [58] is applied to training clicks and never applied to thevalidation clicks; we also use self-normalization [116]. In this section we discuss the results of our experiments and evaluate our policy-awareestimator and the methods for top- k LTR metric optimization empirically.
First we consider the question:
RQ5.1
Is the policy-aware estimator effective for unbiased counterfactual LTR fromtop- k feedback?Figure 5.1 displays the performance of different approaches after optimization on clicks under varying values for k . Both the policy-oblivious and rerank estimators aregreatly affected by the item-selection bias introduced by the cutoff at k . On the MSLRdataset neither approach is able to get close to optimal ARP performance, optimalDCG @5 is only reached when k > . On the Yahoo dataset, the policy-oblivousapproach can only approximate optimal ARP when k > ; for DCG @5 it requires k > . The rerank approach reaches optimal ARP when k > and optimal DCG @5 when k > . Considering that on average a query in the Yahoo dataset only has 24preselected documents, it appears that even a little item-selection bias has a substantialeffect on both estimators. Furthermore, randomization appears to have a very limitedpositive effect on the policy-oblivious and rerank approaches. The one exception is thepolicy-oblivious approach when k = 1 where it reaches optimal performance underrandomization. Here, the randomization policy gives every item an equal probability ofbeing presented, thus trivially removing item-selection bias; additionally, there is noposition bias as there is only a single position. However, besides this trivial exception,the baseline estimators are strongly affected by item-selection bias and simply loggingwith randomization is unable to remove the effect of item-selection bias.In contrast, the policy-aware approach is hardly affected by the choice of k . Itconsistently approximates optimal performance in terms of ARP and DCG @5 on bothdatasets. On the MSLR dataset, the policy-aware approach provides near optimal ARPperformance; however, for k > there is a small but noticeable gap. We suspect thatthis is the result of variance from click-noise and can be closed by gathering more clicks.Across all settings, the policy-aware approach appears unaffected by the choice of k andthus the effect of item-selection bias. Moreover, it consistently provides performanceat least as good as the baselines; and on the Yahoo dataset it outperforms them for k < and on the MSLR dataset outperforms them for all tested values of k . Wenote that the randomization policy is the same for all methods; in other words, under92 .6. Results and Discussion randomization the clicks for the policy-oblivious, policy-aware and rerank approachesare acquired in the exact same way. Thus, our results show that in order to benefit fromrandomization, a counterfactual LTR method has to take its effect into account, henceonly the policy-aware approach has improved performance.Figure 5.2 displays the performance when learning from top-5 feedback whilevarying the number of clicks. Here we see that the policy-oblivious approach perfor-mance is stable after clicks have been gathered. The rerank approach has stableperformance after clicks when optimized for ARP and for DCG @5 . Bothbaseline approaches show biased behavior where adding additional data does not leadto improved performance. This confirms that their estimators are unable to deal withitem-selection bias. In contrast, the policy-aware approach reaches optimal performancein all settings. However, it appears that the policy-aware approach requires more clicksthan the no-cutoff baseline; we suspect that this difference is due to variance added bythe randomization and smaller propensity scores.In conclusion, we answer our RQ5.1 positively: our results show that the policy-aware approach is unbiased w.r.t. item-selection bias and position bias. Where allbaseline approaches are affected by item-selection bias even in small amounts, thepolicy-aware approach approximates optimal performance regardless of the cutoff value k . k metrics Next, we consider the question:
RQ5.2
Are our novel counterfactual LTR loss functions effective for top- k LTR metricoptimization?Figure 5.3 shows the performance of the policy-aware approach after optimizing differ-ent loss functions under top- feedback. While on the Yahoo dataset small differencesare observed, on the MSLR dataset substantial differences are found. Interestingly,there seems to be no advantage in optimizing for DCG @5 instead of full DCG with theLambdaLoss. Furthermore, the monotonic loss function works very well with a linearupper bound, yet poorly when using the log upper bound. On both datasets the heuristictruncated LambdaLoss loss function provides the best performance, despite being theonly method without a theoretical basis. When few clicks are available, the differenceschange; e.g., the monotonic loss function with a log upper bound outperforms the otherlosses on the MSLR dataset when fewer than clicks are available.Finally, we consider unbiased loss selection; Figure 5.3 displays both the perfor-mance of the selected models and the estimated performance on which the selections arebased. For the most part the optimal models are selected, but variance does cause mis-takes in selection when few clicks are available. Thus, unbiased optimal loss selectionseems effective as long as enough clicks are available.In conclusion, we answer RQ5.2 positively: our results indicate that the truncatedcounterfactual LambdaLoss loss function is most effective at optimizing DCG @5 .Using this loss, our counterfactual LTR method reaches state-of-the-art performancecomparable to supervised LTR on both datasets. Alternatively, our proposed unbiased93 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings loss selection method can choose optimally between models that are optimized bydifferent loss functions.
Section 5.2.1 has discussed supervised LTR and Section 5.2.2 has described the existingcounterfactual LTR framework; this section contrasts additional related work with ourpolicy-aware approach.Interestingly, some existing work in unbiased LTR was performed in top- k rankingssettings [3, 4, 127, 128]. Our findings suggest that the results of that work are affectedby item-selection bias and that there is the potential for considerable improvements byapplying the policy-aware method.Carterette and Chandar [16] recognized that counterfactual evaluation cannot eval-uate rankers that retrieve items that are unseen in the interaction logs, essentially dueto a form of item-selection bias. Their proposed solution is to gather new interactionson rankings where previously unseen items are randomly injected. Accordingly, theyadapt propensity scoring to account for the random injection strategy. In retrospect, thisapproach can be seen as a specific instance of our policy-aware approach. In contrast,we have focused on settings where item-selection bias takes place systematically andpropose that logs should be gathered by any policy that meets Condition 5.16. Insteadof expanding the logs to correct for missing items, our approach avoids systematicitem-selection bias altogether.Other previous work has also used propensity scores based on a logging policyand examination probabilities. Komiyama et al. [67] and subsequently Lagr´ee et al.[69] use such propensities to find the optimal ranking for a single query by casting theranking problem as a multiple-play bandit. Li et al. [72] use similar propensities tocounterfactually evaluate ranking policies where they estimate the number of clicks aranking policy will receive. Our policy-aware approach contrasts with these existingmethods by providing an unbiased estimate of LTR-metric-based losses, and thus it canbe used to optimize LTR models similar to supervised LTR.Lastly, online LTR methods where interactive processes learn from the user [132]also make use of stochastic ranking policies. They correct for biases through random-ization in rankings but do not use an explicit model of examination probabilities. Tocontrast with counterfactual LTR, while online LTR methods appear to provide robustperformance [50], they are not proven to unbiasedly optimize LTR metrics [82, 84].Unlike counterfactual LTR, they are not effective when applied to historical interactionlogs [43]. In this chapter, we have proposed a policy-aware estimator for LTR, the first counter-factual method that is unbiased w.r.t. both position bias and item-selection bias. Ourexperimental results show that existing policy-oblivious approaches are greatly affectedby item-selection bias, even when only small amounts are present. In contrast, the pro-posed policy-aware LTR method can learn from top- k feedback without being affected94 .8. Conclusion by the choice of k . Furthermore, we proposed three counterfactual LTR approachesfor optimizing top- k metrics: two theoretically proven lower bounds on DCG @ k basedon monotonic functions and the LambdaLoss framework, respectively, and anotherheuristic truncated loss. Additionally, we introduced unbiased loss selection that canchoose optimally between models optimized with different loss functions. Together, ourcontributions provide a method for learning from top- k feedback and for top- k metrics.With these contributions, we can answer the thesis research questions RQ4 and
RQ5 positively: with the policy-aware estimator counterfactual LTR is applicable to top- k ranking settings; furthermore, we have shown that the state-of-the-art supervised LTRLambdaLoss method can be used for counterfactual LTR. To the best of our knowledge,this is the first counterfactual LTR method that is unbiased in top- k ranking settings.Additionally, this chapter also serves to further bridge the gap between supervised andcounterfactual LTR methods, as we have shown that state-of-the-art lambda-based super-vised LTR methods can be applied to the state-of-the-art counterfactual LTR estimators.Therefore, the contributions of this chapter have greatly extended the capability of thecounterfactual LTR approach and further connected it with the supervised LTR field.Future work in supervised LTR could verify whether potential novel supervisedmethods can be applied to counterfactual losses. A limitation of the policy-aware LTRapproach is that the logging policy needs to be known; future work could investigatewhether a policy estimated from logs also suffices [72, 74]. Finally, existing workon bias in recommendation [107] has not considered position bias, thus we anticipatefurther opportunities for counterfactual LTR methods for top- k recommendations.The remaining chapters of this thesis will continue to build on the policy-awareestimator. Chapter 6 introduces a counterfactual LTR algorithm that uses the policy-aware estimator to combine properties of tabular models and feature-based models.Furthermore, Chapter 7 looks at how the policy-aware estimator can be used for rankerevaluation. It introduces an algorithm that optimizes the logging policy to reducevariance when using the policy-aware estimator for evaluation. Lastly, Chapter 8introduces a novel intervention-aware estimator inspired by the policy-aware estimator.This novel estimator takes the policy-aware approach even further by consideringthe effect of all logging policies used during data gathering. The intervention-awareapproach thus also considers the case where the logging policy is updated duringthe gathering of data. Besides the policy-aware estimator, Chapter 6, Chapter 7, andChapter 8 all use the adaption of LambdaLoss for counterfactual LTR derived in thischapter. 95 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings
Yahoo! Webscope
10 20 30 40 50 60 70 8011 . . . . A v g . R e l e v a n t P o s i t i o n
10 20 30 40 50 60 70 800 . . N o r m a li z e d D C G@ MSLR-WEB30k
10 20 30 40 50 60 70 805055 A v g . R e l e v a n t P o s i t i o n
10 20 30 40 50 60 70 80
Number of Display Positions (k) . . . N o r m a li z e d D C G@ Policy-Aware (rand.)Policy-Oblivious (no rand.) Policy-Oblivious (rand.)Rerank (no rand.) Rerank (rand.)Production Full-Info Skyline
Figure 5.1: The effect of item-selection bias on different estimators. Optimization on clicks simulated on top- k rankings with varying number of display positions ( k ), withand without randomization (for each datapoint clicks were simulated independently).Results on the Yahoo dataset and
MSLR dataset. The top graph per dataset optimizesthe average relevance position through the linear upper bound (Eq. 5.25); the bottomgraph per dataset optimizes DCG @5 using the truncated LambdaLoss (Eq. 5.35).96 .8. Conclusion Yahoo! Webscope . . . . A v g . R e l e v a n t P o s i t i o n . . N o r m a li z e d D C G@ MSLR-WEB30k A v g . R e l e v a n t P o s i t i o n Number of Training Clicks . . . N o r m a li z e d D C G@ No-CutoffPolicy-Aware (rand.) Policy-Oblivious (no rand.)Policy-Oblivious (rand.) Rerank (no rand.)Rerank (rand.) ProductionFull-Info Skyline
Figure 5.2: Performance of different estimators learning from different numbers ofclicks simulated on top-5 rankings, with and without randomization. Results on the
Yahoo dataset and
MSLR dataset. The top graph per dataset optimizes the averagerelevance position through the linear upper bound (Eq. 5.25); the bottom graph perdataset optimizes DCG @5 using the truncated LambdaLoss (Eq. 5.35). 97 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings
Yahoo! Webscope . . N o r m a li z e d D C G@ . . . . E s t i m a t e d D C G@ MSLR-WEB30k . . . N o r m a li z e d D C G@ Number of Training Clicks . . . E s t i m a t e d D C G@ Loss SelectionMonotonic (linear) Monotonic (log)LambdaLoss (full-DCG) LambdaLoss (DCG@5)Truncated LambdaLoss (DCG@5) ProductionFull-Info Skyline
Figure 5.3: Performance of the policy-aware estimator (Eq. 5.13) optimizing DCG @5 using different loss functions. The loss selection method selects the estimated optimalmodel based on clicks gathered on separate validation queries. Varying numbers ofclicks on top-5 rankings with randomization, the number of validation clicks is 15% ofthe number of training clicks.98 .A. Notation Reference for Chapter 5 Notation Description k the number of items that can be displayed in a single ranking i an iteration number Q set of queries q a user-issued query d an item to be ranked r ( d, q ) , r ( d ) the relevance of item d w.r.t. query qR a ranked list ¯ R a ranked list that was displayed to the user λ ( d | R ) a metric that weights items depending on their display rank c i ( d ) a function indicating item d was clicked at iteration io i ( d ) a function indicating item d was observed at iteration iπ a logging policy π ( ¯ R | q ) the probability that policy π displays ranking ¯ R for query q rank (cid:0) d | ¯ R (cid:1) the rank of item d in displayed ranking ¯ Rρ a propensity function used to represent any IPS estimator s ( d ) the score given to item d by ranking model s , used to sort items by99 Combining Generalized and Specialized
Models in Counterfactual Learning toRank
So far, this thesis has only addressed feature-based Learning to Rank (LTR) – theoptimization of models that rank items based on their features – as opposed to tabularonline LTR – which optimizes a ranking directly, thus not using any scoring models.A big advantage of feature-based LTR is that its model can be applied to previouslyunseen queries and items. As a result, it provides very robust performance in previouslyunseen circumstances. However, their behavior is often limited by the available features:in practice they do not provide enough information to determine the optimal ranking. Instark contrast, tabular LTR memorizes rankings, instead of using a features to predictthem. Consequently, tabular LTR is not limited by which features are available andcan potentially always find the optimal ranking. Despite this potential, tabular LTRdoes not generalize: it cannot transfer learned behavior to previously unseen queries oritems. In other words, tabular LTR has the potential to specialize – perform very well incircumstances encountered often – whereas feature-based LTR is good at generalization– performing well overall, including previously unseen circumstances. In this chapterwe investigate whether the advantageous properties of these two areas can be combinedin the counterfactual LTR framework, and thus we address the thesis research question:
RQ6
Can the specialization ability of tabular online LTR be combined with the robustfeature-based approach of counterfactual LTR?In this chapter we introduce a framework for Generalization and Specialization(GENSPEC) for counterfactual learning from logged bandit feedback. GENSPECis designed for problems that can be divided in many non-overlapping contexts. Itsimultaneously learns a generalized policy – optimized for high performance acrossall contexts – and many specialized policies – each optimized for high performancein a single context. Using high-confidence bounds on the relative performance ofpolicies, per context GENSPEC decides whether to deploy a specialized policy, thegeneral policy, or the current logging policy. By doing so GENSPEC combines the high
This chapter was submitted as [87]. Appendix 6.C gives a reference for the notation used in this chapter. . Combining Generalized and Specialized Models in Counterfactual LTR performance of successfully specialized policies with the safety and robustness of ageneralized policy.While GENSPEC is applicable to many different bandit problems, we focus onquery-specialization for counterfactual learning to rank, where a context consists of aquery submitted by a user. Here we learn both a single general feature-based model forrobust performance across queries, and many memory-based models, each of which ishighly specialized for a single query, GENSPEC then chooses which model to deployon a per query basis. Our results show that GENSPEC leads to massive performancegains on queries with sufficient click data, while still having safe and robust behavioron queries with little or noisy data.
Generalization is an important goal for most machine learning algorithms: modelsshould perform well across a large range of contexts, especially previously unseencontexts [10].
Specialization , the ability to perform well in a single context, is oftendisfavored over generalization because the latter is more robust [37]. Generally, thesame trade-off pertains to contextual bandit problems [70, Chapter 18]. There, the goalis to find a policy that maximizes performance over the full distribution of contextualinformation. While a specialized policy, i.e., a policy optimized on a subset of possiblecontexts, could outperform a generalized policy on that subset, most likely it compro-mises performance on other contexts to do so since specialization comes with a risk ofoverfitting: applying a policy that is specialized in a specific set of contexts to differentcontexts [22, 37]. As a consequence, generalization is often preferred as it avoids thisissue.In this chapter, we argue that, depending on the circumstances, specialization may bepreferable over generalization, specifically, if with high-confidence it can be guaranteedthat a specialized policy is only deployed in contexts where it outperforms policiesoptimized for generalization. We focus on counterfactual learning for contextual banditproblems where contexts can be split into non-overlapping sets. We simultaneously train(i) a generalized policy that performs well across all contexts, and (ii) many specializedpolicies, one for each of a specific set of contexts. Thus, per context there is a choicebetween three policies: (i) the logging policy used to gather data, (ii) the generalizedpolicy, and (iii) the specialized policy. Depending on the circumstances, e.g., theamount of data available, noise in the data, or the difficulty of the task, a different policywill perform best in a specific context [22]. To reliably choose between policies, weestimate high-confidence bounds [119] on the relative performance differences betweenpolicies and then choose conservatively: we only apply a specialized policy insteadof the generalized policy or logging policy if the lower bounds on their differences inperformance are positive in a specific context. Otherwise, the generalized policy is onlyapplied if with high-confidence it outperforms the logging policy across all contexts.We call this approach the
Generalization and Specialization (GENSPEC) framework: ittrains both generalized and specialized policies and results in a meta-policy that choosesbetween them using high-confidence bounds. The GENSPEC meta-policy is particularlypowerful because it can combine the properties of different models: for instance, a102 .2. Background: Learning to Rank generalized policy using a feature-based model can be overruled by a specialized policyusing a tabular model that has memorized the best actions. GENSPEC promises thebest of two worlds: the safe robustness of a generalized policy with the potentially highperformance of a specialized policy.To evaluate the GENSPEC approach, we apply it to query-specialization in thesetting of
Counterfactual Learning to Rank (LTR). Existing approaches in this fieldeither generalize – by learning a ranking model that ranks items based on their featuresand generalizes well across all queries [58] – or they specialize – by learning tabularranking models that are specific to a single query and cannot be applied to any otherquery [70, 138]. By viewing each query as a different context, GENSPEC learnsboth a generalized ranker and many specialized tabular rankers, and subsequentlychooses which ranker to apply per query. Our empirical results show that GENSPECcombines the advantages of both approaches: very high performance on queries wheresufficiently many interactions were observed for successful specialization, and saferobust performance on queries where interaction data is limited or noisy.Our main contributions are:1. an adaptation of existing counterfactual high-confidence bounds for relativeperformance between ranking policies;2. the GENSPEC framework that simultaneously learns generalized and specializedranking policies plus a meta-policy that decides which to deploy per context.To the best of our knowledge, GENSPEC is the first counterfactual LTR method tosimultaneously train generalized and specialized models, and reliably choose betweenthem using high-confidence bounds.
This section covers the basics of counterfactual LTR.
The LTR task has been approached as a contextual bandit problem before [68, 70, 117,132]. The differentiating characteristic of the LTR task is that actions are rankings ,thus they consist of an ordered set of K items: a = ( d , d , . . . , d K ) . The contextualinformation often contains a user-issued search query, features based on the itemsavailable for ranking and item-query combinations, information about the user, amongother miscellaneous information. Since our focus is query specialization, we record thequery separately; thus, at each time step i contextual information x i and a single query q i ∈ { , , , . . . } are active: x i , q i ∼ P ( x, q ) . Let ∆ indicate the reward for a ranking a . A policy π should maximize the expected reward [58, 75]: R ( π ) = (cid:90) (cid:90) (cid:16) (cid:88) a ∈ π ∆( a | x, q, r ) · π ( a | x, q ) (cid:17) P ( x, q ) dx dq. (6.1)103 . Combining Generalized and Specialized Models in Counterfactual LTR Commonly, in LTR the reward for a ranking a is a linear combination of the relevancescores of the items in a , weighted according to their rank. We use r ( d | x, q ) to denotethe relevance score of item d and λ (cid:0) rank ( d | a ) (cid:1) for the weight per rank, resulting in: ∆( a | x, q, r ) = (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · r ( d | x, q ) . (6.2)A common choice is to optimize the Discounted Cumulative Gain (DCG) metric; λ canbe chosen accordingly: λ DCG (cid:0) rank ( d | a ) (cid:1) = log (cid:0) rank ( d | a ) (cid:1) − . (6.3)When the relevance function r is given, maximizing R can be done through traditionalLTR in a supervised manner [13, 75, 129]. In practice, the relevance score r is often unknown or requires expensive annotation [17,27, 95, 104]. An attractive alternative comes from LTR based on historical interactionlogs, which takes a counterfactual approach [58, 127]. Let π be the logging policy thatwas used when interactions were logged: a i ∼ π ( a | x i , q i ) . (6.4)LTR focusses mainly on clicks in interactions; clicks are strongly affected by positionbias [25]. This bias arises because users often do not examine all items presented tothem, and only click on examined items. As a result, items that are displayed in positionsthat are more often examined are also more likely to be clicked, without necessarilybeing more relevant. Let o i ( d ) ∈ { , } indicate whether item d was examined by theuser or not: o i ( d ) ∼ P (cid:0) o ( d ) | a i (cid:1) . (6.5)We use c i ( d ) ∈ { , } to indicate whether d was clicked at time step i : c i ( d ) ∼ P (cid:0) c ( d ) | o i ( d ) , r ( d | x, q ) (cid:1) . (6.6)We assume that click probabilities are only dependent on whether an item was examined, o i ( d ) , and its relevance, r ( d | x, q ) . Furthermore, we make the common assumptionthat clicks only occur on examined items [58, 127], thus: P (cid:0) c ( d ) = 1 | o ( d ) = 0 , r ( d | x, q ) (cid:1) = 0 . (6.7)Moreover, we assume that, given examination, more relevant documents are more likelyto be clicked. Specifically, click probability is proportional to relevance with an offset µ ∈ R > : P (cid:0) c ( d ) = 1 | o ( d ) = 1 , r ( d | x, q ) (cid:1) ∝ r ( d | x, q ) + µ. (6.8)The data used for counterfactual LTR consists of observed clicks c i , propensity scores ρ i , contextual information x i and query q i for N interactions: D = (cid:8) ( c i , a i , ρ i , x i , q i ) (cid:9) Ni =1 . (6.9)104 .3. GENSPEC for Query Specialization We apply the policy-aware approach [86] (Chapter 5) and base ρ both on the examinationprobability of the user and the behavior of the policy: ρ i ( d ) = (cid:88) a ∈ π P (cid:0) o i ( d ) = 1 | a (cid:1) · π ( a | x i , q i ) . (6.10)The estimated reward based on D is now: ˆ R ( π | D ) = 1 |D| (cid:88) i ∈D (cid:88) a ∈ π ˆ∆( a | c i , ρ i ) · π ( a | x i , q i ) , (6.11)where ˆ∆ is an Inverse Propensity Scoring (IPS) estimator: ˆ∆( a | c i , ρ i ) = (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · c i ( d ) ρ i ( d ) . (6.12)Since the reward r is not observed directly, clicks are used as implicit feedback, whichis a biased and noisy indicator of relevance. The unbiased estimate ˆ R can be used forunbiased evaluation and optimization since: arg max π E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) = arg max π R ( π ) . (6.13)Previous work has introduced several methods for maximizing ˆ R so as to optimizedifferent LTR metrics [2, 58].This concludes our description of the counterfactual LTR basics; importantly, rank-ing policies can be optimized from clicks without being affected by the logging policyor the users’ position bias. This section introduces the GENSPEC framework and applies it to query specializationfor LTR. Section 6.6 details how it can be applied to the general contextual banditproblem.
We will now propose the first part of the GENSPEC framework, which produces ageneral policy π g and, for each query q , a specialized policy π q .GENSPEC uses the logged data D both to train policies and to evaluate relativeperformance; to avoid overfitting we split D in a training partition D train and a policy-selection partition D sel so that D = D train ∪ D sel and D train ∩ D sel = ∅ .A policy has optimal generalization performance if it maximizes performance across all queries . Thus, given the generalization policy space Π g , the optimal general policyis: π g = arg max π ∈ Π g ˆ R ( π | D train ) . (6.14) See Appendix 6.A for a proof. . Combining Generalized and Specialized Models in Counterfactual LTR
Alternatively, we can also choose to optimize performance for a single query q . First,we only select the datapoints in D where query q was issued: D q = (cid:8) ( c i , a i , ρ i , x i , q i ) ∈ D | q i = q (cid:9) . (6.15)Then the policy π q that is specialized for query q is the policy in the specializationpolicy space Π q that maximizes the performance when query q is issued: π q = arg max π ∈ Π q ˆ R ( π | D train q ) . (6.16)The motivation for π q is that it has the potential to provide better performance than π g when q is issued. We may expect π q to outperform π g because π g may compromiseperformance on the query q for better performance across all queries, whereas π q nevermakes such compromises. Furthermore, Π q could contain more optimal policies than Π g , because the policies in Π g have to be applicable to all queries whereas Π q can makeuse of specific properties of q . However, it is also possible that π g and π q provide thesame performance. Moreover, since D q is a subset of D , the optimization of π q is morevulnerable to noise in the data. As a result, the true performance of π q for query q couldbe worse than that of π g , especially when D q is substantially smaller than D .In other words, a priori it is unclear whether π g or π q are preferred. We thus need amethod to estimate the optimal choice with a reasonable amount of confidence. We will now propose the other part of our GENSPEC framework: a meta-policy thatsafely chooses between deploying π g and π q per query q . We wish to avoid deploying π q when it performs worse than π g , and similarly, avoid deploying π g when it isoutperformed by the logging policy π . Recently, a method for safe policy deploymentwas introduced by Jagerman et al. [51] based on high-confidence bounds [119]. Theintuition behind their method is that a learned policy π should not be deployed beforewe can be highly confident that it outperforms the logging policy π , otherwise it issafer to keep the logging policy in deployment.While previous work has bounded the performance of individual policies [51, 119],we instead bound the difference in performance between two policies directly. Let δ ( π , π ) indicate the true difference in performance between a policy π and policy π : δ ( π , π ) = R ( π ) − R ( π ) . (6.17)Knowing δ ( π , π ) allows us to optimally choose which of the two policies to deploy.However, we can only estimate its value from historical data D : ˆ δ ( π , π | D ) = ˆ R ( π | D ) − ˆ R ( π | D ) . (6.18)For brevity, let R i,d indicate the inverse-propensity-scored difference for a singledocument d at interaction i : R i,d = c i ( d ) ρ i ( d ) (cid:88) a ∈ π ∪ π (cid:0) π ( a | x i , q i ) − π ( a | x i , q i ) (cid:1) · λ (cid:0) rank ( d | a ) (cid:1) . (6.19)106 .3. GENSPEC for Query Specialization Then, for computational efficiency we rewrite: ˆ δ ( π , π | D ) = 1 |D| (cid:88) i ∈D (cid:88) d ∈ a i R i,d = 1 |D| K (cid:88) ( i,d ) ∈D K · R i,d . (6.20)For notational purposes, we let (cid:80) ( i,d ) ∈D iterate over all actions a i and K documents d per action a i . With the confidence parameter (cid:15) ∈ [0 , , setting b to be the maximumpossible absolute value for R i,d , i.e., b = max λ ( · )min ρ , and ν = 2 |D| K ln (cid:0) − (cid:15) (cid:1) |D| K − (cid:88) ( i,d ) ∈D (cid:0) K · R i,d − ˆ δ ( π , π | D ) (cid:1) , we follow Thomas et al. [119] to get the high-confidence bound: CB ( π , π | D ) = 7 Kb ln (cid:0) − (cid:15) (cid:1) |D| K −
1) + 1 |D| K · √ ν. (6.21)In turn, this provides us with the following upper and lower confidence bounds on δ : LCB ( π , π | D ) = ˆ δ ( π , π | D ) − CB ( π , π | D ) UCB ( π , π | D ) = ˆ δ ( π , π | D ) + CB ( π , π | D ) . (6.22)As proven by Thomas et al. [119], with at least a probability of (cid:15) they bound the truevalue of δ ( π , π ) : P (cid:16) δ ( π , π ) ∈ (cid:2) LCB ( π , π | D ) , UCB ( π , π | D ) (cid:3)(cid:17) > (cid:15). (6.23)These guarantees allow us to safely choose between policies per query q . We applya doubly conservative strategy: π g is not deployed before we are confident that itoutperforms π across all queries; and π q is not deployed before we are confidentthat it outperforms both π g and π on query q . This strategy results in the GENSPECmeta-policy π GS : π GS ( a | x, q ) = π q ( a | x, q ) , if (cid:0) LCB ( π q , π g | D sel q ) > ∧ LCB ( π q , π | D sel q ) > (cid:1) ,π g ( a | x, q ) , if (cid:0) LCB ( π q , π g | D sel q ) ≤ ∧ LCB ( π g , π | D sel ) > (cid:1) ,π ( a | x, q ) , otherwise . (6.24)In theory, this approach can make use of the potential gains of specialization whileavoiding its risks. For instance, if the policy-selection partition D sel q is very small, it maybe heavily affected by noise, so that the confidence bound CB will be wide and π q willnot be deployed. Simultaneously, D sel may be large enough so that with high-confidence π g is deployed.We expect that in practice the relative bounding of GENSPEC is much more data-efficient than the Safe Exploration Algorithm (SEA) approach by Jagerman et al. [51].107 . Combining Generalized and Specialized Models in Counterfactual LTR SEA computes an upper bound on the trusted policy and a lower bound on a learnedpolicy, and only deploys the learned policy if its lower bound is greater than the other’supper bound. When the learned policy has higher performance than the other, we expectthe relative bounds of GENSPEC to require less data to be certain about this differencethan the SEA bounds. In Appendix 6.B we theoretically analyze the difference betweenthese approaches and conclude that the relative bounding of GENSPEC is more efficientif there is a positive covariance between ˆ R ( π | D ) and ˆ R ( π | D ) . Because bothestimates are based on the same interaction data D , a high covariance is extremelylikely.Previous work has described safety constraints for policy deployment [51, 62, 131].The authors assume that a baseline policy exists whose behavior is considered safe; otherpolicies are considered unsafe if their performance is worse than the baseline policy bya certain margin. If the logging policy is taken to be the baseline policy, then GENSPECcan meet such constraints [51]. We note that while the safety guarantee is strong for asingle bound (Eq. 6.23), when applied to a large number of queries the probability of atleast one incorrect bound greatly increases. This problem of multiple comparisons maycause some non-optimal policies to be deployed for some queries. Since we mainlycare about overall performance this is not expected to be an issue; however, in caseswhere safety constraints are very important, (cid:15) can be chosen to account for the numberof comparisons. This completes our introduction of the GENSPEC framework for query specialization.Figure 6.1 visualizes our approach to query specialization. We learn from historicalinteractions gathered using a logging policy π ; the interactions are divided into atraining and policy-selection partition per query. Subsequently, a policy is optimized forgeneralization — to perform well across all queries — and for each query a policy isoptimized for specialization — to perform well for a single query. While specializationcan potentially maximize performance on a specific query, it brings more risks thangeneralization, since a general policy is optimized on more data and may provide betterperformance on previously unseen queries. As a solution to this dilemma, we propose astrategy that uses high-confidence bounds on the differences in performance betweenpolicies. These bounds are then used to choose safely between the deployment of thelogging, general and specialized policies. In theory, GENSPEC combines the best ofboth worlds: the high potential of specialization and the broad safety of generalization. This section discusses our experimental setup and the policies used to evaluate theGENSPEC framework.108 .4. Experimental Setup 𝜋 G ene r a li z a t i on po li cy Specialization policiesGENSPECg 𝜋 𝜋 𝜋 𝜋 𝜋 q = 1 q = 2 q = 3 q = 4 q = 5 𝜋 𝜋 g 𝜋 𝜋 g 𝜋 𝜋 C on f i den c e bound da t a T r a i n i ng da t a Loggingpolicy UsersRankings/QueriesDividing interactions per context
Figure 6.1: Visualization of the GENSPEC framework applied to query specializationfor counterfactual LTR. The data D is divided per query q , many specialized policies π , π , . . . are each optimized for a single query q ∈ { , , . . . } , and single a generalpolicy π g is learned on the data across all queries. Finally, GENSPEC decides whichpolicy to deploy per context, based on high-confidence bounds. To evaluate the GENSPEC framework, we make use of a semi-synthetic experimentalsetup: queries, relevance judgements, and documents come from industry datasets, whilebiased and noisy user interactions are simulated using probabilistic user models. Thissetup is very common in the counterfactual LTR and online LTR literature [2, 58, 84].We make use of the three largest LTR industry datasets:
Yahoo! Webscope [17],
MSLR-WEB30k [95], and
Istella [27]. Each consists of a set of queries, with for each query apreselected set of documents, document-query combinations are only represented byfeature vectors and a label indicating relevance according to expert annotators. Labelsrange from (not relevant) to (perfectly relevant): r ( d | x, q ) ∈ { , , , , } . Userissued queries are simulated by uniformly sampling from the training and validationpartitions of the datasets. Displayed rankings are generated by a logging ranker usinga linear model optimized on of the training partition using supervised LTR [58].Then, user examination is simulated with probabilities inverse to the displayed rank of adocument: P (cid:0) o ( d ) = 1 | a (cid:1) = rank ( d | a ) . Finally, user clicks are generated according tothe following formula using a single parameter α ∈ R : P (cid:0) c ( d ) = 1 | o ( d ) = 1 , r ( d | x, q ) (cid:1) = 0 . α · r ( d | x, q ) . (6.25)109 . Combining Generalized and Specialized Models in Counterfactual LTR In our experiments, we use α = 0 . and α = 0 . ; the former represents a near-ideal setting where relevant documents receive a very large number of clicks, the latterrepresents a more noisy and harder setting where the large majority of clicks are onnon-relevant documents. Clicks are only generated on the training and validationpartitions, of training clicks are separated for policy selection ( D sel ), hyperparame-ter optimization is done using counterfactual evaluation with clicks on the validationpartition [58].Some of our baselines are online bandit algorithms, for these baselines no clicks areseparated for D sel , and the algorithms are run online: this means clicks are not gatheredusing the logging policy but by applying the algorithms in an online interactive setting.The evaluation metric we use is normalized DCG (Eq. 6.3) [53] using the ground-truth labels from the datasets. Unlike most LTR work we do not apply a rank-cutoffwhen computing the metric, thus, an NDCG of . indicates that all documents areranked perfectly (not just the top- k ). We separately calculate performance on the test set(Test-NDCG), to evaluate performance on previously unseen queries, and the trainingset (Train-NDCG). The total number of clicks is varied up to in total, uniformlyspread over all queries, the differences in Train-NDCG when more clicks are addedallows us to evaluate performance on queries with different levels of popularity. For the generalization policy space Π g we use feature-based ranking models. This is anatural choice as they can be applied to any query, including previously unseen ones.However, the available features could limit the possible behavior of the policies. We uselinear models for Π g ; optimization is done on D train following previous counterfactualLTR work [2]. This results in a learned scoring function f ( d, x, q ) ∈ R according towhich items are ranked; due to score-ties there can be multiple valid rankings: A g ( x, q ) = { a | ∀ ( d n , d m ) ∈ x, ( f ( d n , x, q ) > f ( d m , x, q ) → d n (cid:31) a d m ) } . (6.26)The general policy π g samples uniformly random from the set of valid rankings: π g ( a | x, q ) = (cid:40) |A g ( x,q ) | if a ∈ A g ( x, q ) , otherwise. (6.27)For the specialization policy space Π q , we follow bandit-style online LTR work andtake the tabular approach [69]. Documents are scored according to an unbiased estimateof Click-Through-Rate (CTR) on query q : ˆ CTR ( d, q ) = 1 |D train q | (cid:88) i ∈D train q c i ( d ) ρ i , (6.28)which maximizes the estimated performance (Eq. 6.12). Due to ties there can be multiplevalid rankings: A q ( x, q ) = (cid:110) a | ∀ ( d n , d m ) ∈ x, (cid:16) ˆ CTR ( d n ) > ˆ CTR ( d m ) → d n (cid:31) a d m (cid:17)(cid:111) , (6.29)110 .5. Experimental Results and Discussion The specialized policy π q also chooses uniformly random from the set of valid rankings: π q ( a | x, q ) = (cid:40) |A q ( x,q ) | if a ∈ A q ( x, q ) , otherwise. (6.30)The tabular approach is not restrained by the available features and can produce anypossible ranking [138]. Consequently, given enough interactions the tabular approachcan perfectly rank items according to relevance. However, CTR cannot be estimatedfor previously unseen queries and there π q chooses uniformly randomly between allpossible rankings. On a query with a single click, π q will place the once-clicked item atthe front of the ranking. Since clicks are very noisy, this behavior is very risky and henceGENSPEC uses confidence bounds to avoid the deployment of such unsafe behavior. First, we consider the behavior of GENSPEC compared with pure generalization orpure specialization policies. Figures 6.2 and 6.3 show the performance of (i) GENSPECwith different levels of confidence for its bounds ( (cid:15) ), along with that of (ii) the loggingpolicy, (iii) the pure generalization policy, and (iv) the pure specialization policiesbetween which the GENSPEC meta-policy chooses. We see that pure generalizationrequires few clicks to improve over the logging policy but is not able to reach optimallevels of performance. The performance of pure specialization, on the other hand, isinitially far below the logging policy. However, after enough clicks have been gathered,performance increases until the optimal ranking is found; when click noise is limited( α = 0 . ) it reaches perfect performance on all three datasets (Train-NDCG). On theunseen queries where there are no clicks (Test-NDCG), the specialization policy isunable to learn anything and provides random performance (not displayed in Figures 6.2and 6.3). The initial period of poor performance can be very detrimental to queriesthat do not receive a large number of clicks. Prior work has found that web-searchqueries follow a long-tail distribution [113, 115]; White et al. [130] found that 97% ofqueries received fewer than clicks over six months. For such queries, users mayonly experience the initial poor performance of pure specialization, and never see theimprovements it brings at convergence. This possibility can be a large deterrent fromapplying pure specialization in practice [131].Finally, the GENSPEC policy combines properties of both: after a few clicks itdeploys the generalization policy and thus outperforms the logging policy; as more clicksare gathered, specialization policies are activated, further improving performance. With α = 0 . the GENSPEC policy with (cid:15) ≤ . reaches perfect Train-NDCG performanceon all three datasets, similar to the pure specialization policy. However, unlike purespecialization the performance of GENSPEC (with (cid:15) > ) never drops below the loggingpolicy. Moreover, we never observe the situation where an increase in the number ofclicks results in a decrease in mean performance. There is a delay between when the purespecialization policy is the optimal choice and when GENSPEC activates specializationpolicies. Thus, while the usage of confidence bounds prevents the performance from111 . Combining Generalized and Specialized Models in Counterfactual LTR Train-NDCG Test-NDCG Y a hoo ! W e b s c op e − − . . . . . .
00 10 − − . . . . M S L R - W E B − − . . . . . − − . . . . . . I s t e ll a − − . . . . − − . . . . . . Mean Number of Clicks per Query Mean Number of Clicks per Query
Generalized ModelLogging ModelSpecialized Model GENSPEC (no bounds)GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) Figure 6.2: Performance of GENSPEC with varying levels of confidence, comparedto pure generalization and pure specialization, on clicks generated with α = 0 . . Weseparate queries on the training set (Train-NDCG) that have received clicks, and querieson the test set (Test-NDCG) that do not receive any clicks. Clicks are spread uniformlyover the training set, the x-axis indicates the total number of clicks divided by thenumber of training queries. Results are an average of 10 runs; shaded area indicates thestandard deviation.dropping below the level of the logging policy, it does so at the cost of this delay. WhenGENSPEC does not use any bounds it deploys specialized policies earlier, however,in some cases these deployments result in worse performance than the logging policy,albeit less than pure specialization. In all our observed results, a confidence of (cid:15) = 0 . was enough to prevent any decreases in performance.To conclude, our experimental results show that the GENSPEC meta-policy com-bines the high-performance at convergence of specialization and the safe robustness ofgeneralization. In contrast to pure specialization, which results in very poor performancewhen not enough clicks have been gathered, GENSPEC effectively avoids incorrectdeployment and under our tested conditions it never performs worse than the loggingpolicy. Meanwhile, GENSPEC achieves considerable gains in performance at conver-gence, in contrast with pure generalization. Therefore, we conclude that GENSPEC is112 .5. Experimental Results and Discussion Train-NDCG Test-NDCG Y a hoo ! W e b s c op e − − . . . . . .
00 10 − − . . . . M S L R - W E B − − . . . . . . .
90 10 − − . . . . . . I s t e ll a − − . . . . . .
75 10 − − . . . . . . Mean Number of Clicks per Query Mean Number of Clicks per Query
Generalized ModelLogging ModelSpecialized Model GENSPEC (no bounds)GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) Figure 6.3: Performance of GENSPEC with varying levels of confidence, comparedto pure generalization and pure specialization, on clicks generated with α = 0 . .Notation is the same as in Figure 6.2.the best choice in situations where periods of poor performance have to be avoided [131]or when not all queries receive large numbers of clicks [130]. GENSPEC is not the first method that deploys policies based on confidence bounds. Asdiscussed in Section 6.3, Jagerman et al. [51] previously introduced the SEA algorithm.SEA chooses between deploying a generalizing policy or keeping the logging policyin deployment, by bounding both the performance of the logging and generalizationpolicy. When the upper bound of the logging policy is less than the lower bound of thegeneralizing policy, SEA deploys the latter. The big differences with GENSPEC arethat SEA (i) uses two bounds to confidently estimate if one policy outperforms another,and (ii) does not consider specialization policies. Because GENSPEC directly boundsrelative performance, its comparisons only use a single bound and thus we expect it tobe more efficient w.r.t. the number of clicks required than SEA (see Appendix 6.B for aformal analysis).For a fair comparison, we adapt SEA to choose between the same policies as113 . Combining Generalized and Specialized Models in Counterfactual LTR
Train-NDCG Test-NDCG Y a hoo ! W e b s c op e . . . . .
00 10 . . . . . M S L R - W E B . . . . . . . . . I s t e ll a . . . . . . Mean Number of Clicks per Query Mean Number of Clicks per Query
Generalized ModelLogging ModelSpecialized Model SEA ( (cid:15) = 0 . )SEA ( (cid:15) = 0 . ) GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) Figure 6.4: GENSPEC compared to a meta-policy using the SEA bounds (see Sec-tion 6.5.2), on clicks generated with α = 0 . . Notation is the same as in Figure 6.2.GENSPEC and provide it with the same click data. Figure 6.4 and 6.5 display theresults of this comparison. Across all settings, GENSPEC deploys policies much earlierthan SEA with the same level of confidence. While they converge at the same levelsof performance, GENSPEC requires considerably less data, e.g., on the Istella datasetwith α = 0 . GENSPEC deploys with times less data. Thus, we conclude thatthe relative bounds of GENSPEC are much more efficient than the existing boundingapproach of SEA. Obvious baselines for our experiments are methods from the counterfactual LTR field [5,58, 127]. In our setting where the observance probabilities are given, all these methodsreduce to Oosterhuis and de Rijke [86]’s method (see Chapter 5), i.e., the methodused to optimize the pure generalization policy in Figures 6.2 and 6.3. Thus, thecomparison between GENSPEC and pure generalization is effectively a comparisonbetween GENSPEC and state-of-the-art counterfactual LTR. As expected, we seethat GENSPEC reaches the same performance on previously unseen queries (Test-NDCG); but on queries with clicks (Train-NDCG) GENSPEC outperforms standard114 .5. Experimental Results and Discussion
Train-NDCG Test-NDCG Y a hoo ! W e b s c op e . . . . .
00 10 . . . . . . M S L R - W E B . . . . . .
90 10 . . . . . I s t e ll a . . . . .
75 10 . . Mean Number of Clicks per Query Mean Number of Clicks per Query
Generalized ModelLogging ModelSpecialized Model SEA ( (cid:15) = 0 . )SEA ( (cid:15) = 0 . ) GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) Figure 6.5: GENSPEC compared to a meta-policy using the SEA bounds (see Sec-tion 6.5.2), on clicks generated with α = 0 . . Notation is the same as in Figure 6.2.counterfactual LTR by enormous amounts once many clicks have been gathered. Again,there is a small delay between the moment the generalization policy outperforms thelogging policy and when GENSPEC deploys it. Since this observed delay is very short,this downside seems to be heavily outweighed by the large increases in performancein Train-NDCG. Thus, we conclude that GENSPEC is preferable over the existingcounterfactual LTR approaches, due to its ability to incorporate highly specializedmodels in its policy. Other related methods are online LTR bandit algorithms [61, 68]. Unlike counterfactualLTR, these bandit methods learn using online interventions: at each timestep theychoose which ranking to display to users. Thus, they have some control over theinteractions they receive, and attempt to display rankings that will benefit the learningprocess the most. As baselines we use the hotfix algorithm [138] and the Position-Based Model algorithm (PBM) [69]. The hotfix algorithm is a very general approach,it completely randomly shuffles the top- n items and ranks them based on pairwisepreferences inferred from clicks. The main downside of the hotfix approach is that it115 . Combining Generalized and Specialized Models in Counterfactual LTR Clicks generated with α = 0 . . Clicks generated with α = 0 . . Y a hoo ! W e b s c op e . . . . . . . M S L R - W E B . . . . . . . . . . I s t e ll a . . . . . . . . . Mean Number of Clicks per Query Mean Number of Clicks per Query
Position Based ModelHotfix-Complete Hotfix-Top10Logging GENSPEC ( (cid:15) = 0 . ) Figure 6.6: GENSPEC compared to various online LTR bandits (see Section 6.5.4).Notation is the same as in Figure 6.2.can be very detrimental to the user experience due to the randomization. We apply twoversions of the hotfix algorithm, one for top-10 reranking to minimize randomizationand another for reranking the complete ranking. PBM is perfectly suited for our taskas it makes the same assumptions about user behavior as our experimental setting. Weapply PMB-PIE [69], which results in PBM always displaying the ranking it expects toperform best, thus attempting to maximize the user experience during learning. Thesemethods are very similar to our specialization policies: the bandit baselines memorizethe best rankings and do not depend on features at all. Consequently, their learnedpolicies cannot be applied to previously unseen queries.Figure 6.6 displays the results for this comparison. We see that when α = 0 . Hotfix-Complete, PBM and GENSPEC all reach perfect Train-NCDG; however, Hotfix-Complete and PBM reach convergence much earlier than GENSPEC. We attributethis difference to three causes: (i) the online interventions of the bandit baselines,(ii) GENSPEC only uses 70% of the available data for training ( D train ) whereas thebandit baselines use everything, and (iii) the delay in deployment added by GENSPEC’susage of confidence bounds. Similar to the pure specialization policies, the earlier We only report the performance of the ranking produced by the hotfix baseline, not of the randomizedrankings used to gather clicks. .6. GENSPEC for Contextual Bandits moment of convergence of the bandit baselines comes at the cost of an initial periodof very poor performance. We conclude that if only the moment of reaching optimalperformance matters, PBM is the best choice of method. However, if periods ofpoor performance should be avoided [131], or if some queries may not receive largenumbers of clicks [130], GENSPEC is the better choice. An additional advantage isthat GENSPEC is a counterfactual method and does not have to be applied online likethe bandit baselines.
Besides the bandit baselines discussed in Section 6.5.4, feature-based methods foronline LTR also exist [82, 111, 126, 132]. A direct experimental comparison withthese methods is beyond the scope of this chapter. However, previous work has alreadycompared these methods with each other [84] and the state-of-the-art method withcounterfactual LTR [50]. Based on the latter work by Jagerman et al. [50] we do notexpect considerable differences between these online LTR methods and counterfactualLTR in our settings. Therefore, we expect that a comparison would lead to similarresults as discussed in Section 6.5.3.
So far we have discussed GENSPEC for counterfactual LTR. We will now show that itis also applicable to the broader contextual bandit problem. Instead of a query q , wenow keep track of an arbitrary context z ∈ { , , . . . } where x i , z i ∼ P ( x, z ) . (6.31)Data is gathered using a logging policy π : a i ∼ π ( a | x i , z i ) . (6.32)However, unlike the LTR case, the rewards r i are observed directly: r i ∼ P ( r | a i , x i , z i ) . (6.33)With the propensities ρ i = π ( a i | x i , z i ) (6.34)the data is: D = (cid:8) ( r i , a i , ρ i , x i , z i ) (cid:9) Ni =1 ; (6.35)for specialization the data is filtered per context z : D z = (cid:8) ( r i , a i , ρ i , x i , z i ) ∈ D | z i = z (cid:9) . (6.36)Again, data for training D train and for policy selection D sel are separated. The reward isestimated with an IPS estimator: ˆ R ( π | D ) = 1 |D| (cid:88) i ∈D r i ρ i π ( a i | x i , z i ) . (6.37)117 . Combining Generalized and Specialized Models in Counterfactual LTR With the policy spaces Π g and Π z , the policy for generalization is: π g = arg max π ∈ Π g ˆ R ( π | D train ); (6.38)per context z , the specialization policy is: π z = arg max π ∈ Π z ˆ R ( π | D train z ) . (6.39)The difference between two policies is estimated by: ˆ δ ( π , π | D ) = ˆ R ( π | D ) − ˆ R ( π | D ) . (6.40)We differ from the LTR approach by estimating the bounds using: R i = r i ρ i (cid:0) π ( a i | x i , z i ) − π ( a i | x i , z i ) (cid:1) . (6.41)Following Thomas et al. [119], the confidence bounds are: CB ( π , π | D ) = 7 b ln (cid:0) − (cid:15) (cid:1) |D| − |D| (cid:118)(cid:117)(cid:117)(cid:116) |D| ln (cid:0) − (cid:15) (cid:1) |D| − (cid:88) i ∈D (cid:0) R i − ˆ δ ( π , π | D ) (cid:1) , (6.42)where b is the maximum possible value for R i . This results in the lower bound: LCB ( π , π | D ) = ˆ δ ( π , π | D ) − CB ( π , π | D ) , (6.43)which is used by the GENSPEC meta-policy: π GS ( a | x, z ) = π z ( a | x, z ) , if (cid:0) LCB ( π z , π g |D sel z ) > ∧ LCB ( π z , π |D sel z ) > (cid:1) ,π g ( a | x, z ) , if (cid:0) LCB ( π z , π g |D sel z ) ≤ ∧ LCB ( π g , π |D sel ) > (cid:1) ,π ( a | x, z ) , otherwise . (6.44)As such, GENSPEC can be applied to the contextual bandit problem for any arbitrarychoice of context z . In this chapter we have introduced the Generalization and Specialization (GENSPEC)framework for contextual bandit problems. For an arbitrary choice of contexts it simulta-neously learns a general policy to perform well across all contexts, and many specializedpolicies each optimized for a single context. Then, per context the GENSPEC meta-policy uses high-confidence bounds to choose between deploying the logging policy,118 .7. Conclusion the general policy, or a specialized policy. As a result, GENSPEC combines the robustsafety of a general policy with the high-performance of a successfully specialized policy.We have shown how GENSPEC can be applied to query-specialization for coun-terfactual LTR. Our results show that GENSPEC combines the high performance ofspecialized policies on queries with sufficiently many interactions, with the robustperformance on queries that were previously unseen or where little data is available.Thus, it avoids the low performance at convergence of feature-based models underlyingthe general policy, and the initial poor performance of the tabular models underlyingthe specialized policies. We expect that GENSPEC can be used for other types ofspecialization by choosing different context divisions, i.e., personalization for LTR is apromising choice.With these findings we can answer thesis research question
RQ6 positively: UsingGENSPEC we can combine the specialization ability of bandit-style online LTR withthe robust generalization of feature-based LTR. As a result, the choice between spe-cialization and generalization can now be made in a principled, theoretically-groundedmanner. For the LTR field this means that bandit-style LTR and feature-based LTR cannow be seen as complementary, instead of a mutually exclusive choice.Future work could explore other contextual bandit problems and choices for context.Additionally, we hope that the robust safety of GENSPEC further incites the applicationof bandit algorithms in practice.While this chapter considered GENSPEC for counterfactual LTR, Chapter 8 in-troduces a novel method that is both effective at counterfactual LTR and online LTR.With only small adaptations the contributions of both chapters could be combined, thuspotentially resulting in GENSPEC for both online and counterfactual LTR. Future workcould investigate the effectiveness of this possible combined approach. 119 . Combining Generalized and Specialized Models in Counterfactual LTR
This section will prove that the IPS estimate ˆ R (Eq. 6.11) can be used to unbiasedlyoptimize the true reward R (Eq. 6.1), as claimed in Section 6.2.2. For this proof werely on the following assumptions: (i) LTR metrics are linear combinations of itemrelevances (Eq. 6.2), (ii) the assumption that clicks never occur on unobserved items(Eq. 6.7), and (iii) click probabilities (conditioned on observance) are proportional torelevance (Eq. 6.8).First, we consider the expected value for an observed click c i ( d ) using Eq. 6.7; forbrevity we write r ( d ) = r ( d | x i , q i ) : E o i ,a i (cid:2) c i ( d ) (cid:3) = E a i (cid:104) P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) · P (cid:0) o i ( d ) = 1 | a i (cid:1)(cid:105) = P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) · (cid:32) (cid:88) a ∈ π P (cid:0) o i ( d ) = 1 | a (cid:1) · π ( a | x i , q i ) (cid:33) = ρ i ( d ) · P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) . (6.45)Then, consider the expected value for the IPS estimator, and note that a i is a historicallyobserved action and that a is the action being evaluated: E o i ,a i (cid:2) ˆ∆( a | c i , ρ i ) (cid:3) = E o i ,a i (cid:34) (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · c i ( d ) ρ i ( d ) (cid:35) = (cid:88) d ∈ a ρ i ( d ) ρ i ( d ) · λ (cid:0) rank ( d | a ) (cid:1) · P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) = (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) . (6.46)This step assumes that ρ i ( d ) > , i.e., that every item has a non-zero probability ofbeing examined [58]. While E o i ,a i [ ˆ∆( a | c i , ρ i )] and ∆( a | x i , q i , r ) are not necessarilyequal, using Eq. 6.8 we see that they are proportional with some offset C : E o i ,a i (cid:2) ˆ∆( a | c i , ρ i ) (cid:3) ∝ (cid:16) (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · r ( d ) (cid:17) + C = ∆( a | x i , q i , r ) + C, (6.47)where C is a constant: C = (cid:0) (cid:80) Ki =1 λ ( i ) (cid:1) · µ . Therefore, in expectation ˆ R and R arealso proportional with the same constant offset: E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) ∝ R ( π ) + C. (6.48)120 .B. Efficiency of Relative Bounding by GENSPEC Consequently, the estimator can be used to unbiasedly estimate the preference betweentwo policies: E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) < E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) ⇔ R ( π ) < R ( π ) . (6.49)Moreover, this implies that maximizing the estimated performance unbiasedly optimizesthe actual reward: arg max π E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) = arg max π R ( π ) . (6.50)This concludes our proof. We have shown that ˆ R is suitable for counterfactual evalu-ation since it can unbiasedly identify if a policy outperforms another (Eq. 6.49) and,furthermore, that ˆ R can be used for unbiased LTR, i.e., it can be used to find the optimalpolicy (Eq. 6.50). Our experimental results showed that GENSPEC chooses between policies more effi-ciently than when using SEA bounds [51]. In other words, when one policy has higherperformance than another, the relative bounds of GENSPEC require less data to becertain about this difference than the SEA bounds. In this section, we will prove thatthe relative bounds of GENSPEC are more efficient than the SEA bounds when thecovariance between the reward estimates of two policies is positive:cov (cid:0) ˆ R ( π |D ) , ˆ R ( π |D ) (cid:1) > . (6.51)This means that GENSPEC will deploy a policy earlier than SEA if there is highcovariance, since both estimates are based on the same interaction data D a highcovariance is very likely.Let us first consider when GENSPEC deploys a policy: Deployment by GENSPECdepends on whether a relative confidence bound is greater than the estimated differencein performance (cf. Eq. 6.24). For two policies π and π deployment happens when: ˆ R ( π | D ) − ˆ R ( π | D ) − CB ( π , π | D ) > . (6.52)Thus the bound has to be smaller than the estimated performance difference: CB ( π , π | D ) < ˆ R ( π | D ) − ˆ R ( π | D ) . (6.53)In contrast, SEA does not use a single bound, but bounds the performance of bothpolicies. For clarity, we reformulate the SEA bound in our notation. First we have R π j i,d the observed reward for an item d at interaction i for policy π j : R π j i,d = c i ( d ) ρ i ( d ) (cid:88) a ∈ π j π j ( a | x i , q i ) · λ (cid:0) rank ( d | a ) (cid:1) . (6.54)121 . Combining Generalized and Specialized Models in Counterfactual LTR Then we have a ν π j for each policy: ν π j = 2 |D| K ln (cid:0) − (cid:15) (cid:1) |D| K − (cid:88) ( i,d ) ∈D (cid:0) K · R π j i,d − ˆ R ( π j | D ) (cid:1) , which we use to note the confidence bound for a single policy π j : CB ( π j | D ) = 7 Kb ln (cid:0) − (cid:15) (cid:1) |D| K −
1) + 1 |D| K · √ ν π j . (6.55)We note that the b parameter has the same value for both the relative and single con-fidence bounds. SEA chooses between policies by comparing their upper and lowerconfidence bounds: ˆ R ( π | D ) − CB ( π | D ) > ˆ R ( π | D ) + CB ( π | D ) . (6.56)In this case, the summation of the bounds has to be smaller than the estimated perfor-mance difference: CB ( π | D ) + CB ( π | D ) < ˆ R ( π | D ) − ˆ R ( π | D ) . (6.57)We can now formally describe under which condition GENSPEC is more efficientthan SEA: by combining Eq. 6.53 and Eq. 6.57, we see that relative bounding is moreefficient when: CB ( π , π | D ) < CB ( π | D ) + CB ( π | D ) . (6.58)We notice that D , K , b and (cid:15) have the same value for both confidence bounds, thus weonly require: √ ν < √ ν π + √ ν π . (6.59)If we assume that D is sufficiently large, we see that √ ν approximates the standarddeviation scaled by some constant: √ ν ≈ C · (cid:113) var (cid:0) ˆ δ ( π , π |D ) (cid:1) , (6.60)where the constant is: C = (cid:114) |D| K ln (cid:0) − (cid:15) (cid:1) |D| K − . Since the purpose of the bounds is toprevent deployment until enough certainty has been gained, we think it is safe to assumethat D is large enough for this approximation before any deployment takes place.To keep our notation concise, we use the following: ˆ δ = ˆ δ ( π , π |D ) , ˆ R =ˆ R ( π |D ) , and ˆ R = ˆ R ( π |D ) . Using the same approximations for √ ν π and √ ν π we get: (cid:113) var (ˆ δ ) < (cid:113) var ( ˆ R ) + (cid:113) var ( ˆ R ) . (6.61)By making use of the Cauchy-Schwarz inequality, we can derive the following lowerbound: (cid:113) var ( ˆ R ) + var ( ˆ R ) ≤ (cid:113) var ( ˆ R ) + (cid:113) var ( ˆ R ) . (6.62)122 .C. Notation Reference for Chapter 6 Therefore, the relative bounding of GENSPEC must be more efficient when the follow-ing is true: var (ˆ δ ) < var ( ˆ R ) + var ( ˆ R ) , (6.63)i.e. the variance of the relative estimator must be less than the sum of the variances ofthe estimators for the individual policies. Finally, by rewriting var (ˆ δ ) to:var (ˆ δ ) = var ( ˆ R − ˆ R ) = var ( ˆ R ) + var ( ˆ R ) − cov ( ˆ R , ˆ R ) , (6.64)we see that the relative bounds of GENSPEC are more efficient than the multiple boundsof SEA if the covariance between ˆ R and ˆ R is positive:cov ( ˆ R , ˆ R ) > . (6.65)Remember that both estimates are based on the same interaction data: ˆ R = ˆ R ( π |D ) ,and ˆ R = ˆ R ( π |D ) . Therefore, they are based on the same clicks and propensitiesscores, thus it is extremely likely that the covariance between the estimates is positive.Correspondingly, it is also extremely likely that the relative bounds of GENSPEC aremore efficient than the bounds used by SEA. Notation Description K the number of items that can be displayed in a single ranking i an iteration number q a user-issued query x contextual information, i.e., additional features d an item to be ranked a a ranked list π a ranking policy π ( a | q ) the probability that policy π displays ranking a for query qr ( d | x, q ) the relevance of item d w.r.t. query q given context xλ (cid:0) rank ( d | a ) (cid:1) a metric function that weights items depending on their rank D the available interaction data c i ( d ) a function indicating item d was clicked at iteration io i ( d ) a function indicating item d was observed at iteration i Taking the Counterfactual Online:
Efficient and Unbiased OnlineEvaluation for Ranking
Counterfactual evaluation can estimate Click-Through-Rate (CTR) differences betweenranking systems based on historical interaction data, while mitigating the effect ofposition bias and item-selection bias. In contrast, online evaluation methods, designedfor ranking, estimate performance differences between ranking systems by showinginterleaved rankings to users and observing their clicks. We are curious to find outwhether the online interventions of online evaluation methods truly result in moreefficient evaluation, and additionally, whether the popular interleaving methods are trulyunbiased w.r.t. biases such as position bias. Accordingly this chapter will consider thefollowing two thesis research questions:
RQ7
Can counterfactual evaluation methods for ranking be extended to perform effi-cient and effective online evaluation?
RQ8
Are existing interleaving methods truly capable of unbiased evaluation w.r.t.position bias?We introduce the novel Logging-Policy Optimization Algorithm (LogOpt), which op-timizes the policy for logging data so that the counterfactual estimate has minimalvariance. As minimizing variance leads to faster convergence, LogOpt increases thedata-efficiency of counterfactual estimation. LogOpt turns the counterfactual approach– which is indifferent to the logging policy – into an online approach, where the algo-rithm decides what rankings to display. We prove that, as an online evaluation method,LogOpt is unbiased w.r.t. position and item-selection bias, unlike existing interleavingmethods. Furthermore, we perform large-scale experiments by simulating comparisonsbetween thousands of rankers. Our results show that while interleaving methods makesystematic errors, LogOpt is as efficient as interleaving without being biased. Lastly, weprovide a formal proof that shows interleaving methods are not unbiased w.r.t. positionbias.
This chapter was published as [85]. Appendix 7.C gives a reference for the notation used in this chapter. . Efficient and Unbiased Online Evaluation for Ranking
Evaluation is essential for the development of search and recommendation systems [45,64]. Before any ranking model is widely deployed it is important to first verify whetherit is a true improvement over the currently-deployed model. A traditional way ofevaluating relative differences between systems is through A/B testing, where part of theuser population is exposed to the current system (“control”) and the rest to the alteredsystem (“treatment”) during the same time period. Differences in behavior betweenthese groups can then indicate if the alterations brought improvements, e.g., if thetreatment group showed a higher CTR or more revenue was made with this system [18].Interleaving has been introduced in Information Retrieval (IR) as a more efficientalternative to A/B testing [56]. Interleaving algorithms take the rankings produced bytwo ranking systems, and for each query create an interleaved ranking by combining therankings from both systems. Clicks on the interleaved rankings directly indicate relativedifferences. Repeating this process over a large number of queries and averaging the re-sults, leads to an estimate of which ranker would receive the highest CTR [44]. Previousstudies have found that interleaving requires fewer interactions than A/B testing, whichenables them to make consistent comparisons in a much shorter timespan [18, 110].More recently, counterfactual evaluation for rankings has been proposed by Joachimset al. [58] to evaluate a ranking model based on clicks gathered using a differentmodel. By correcting for the position bias introduced during logging, the counterfactualapproach can unbiasedly estimate the CTR of a new model on historical data. Toachieve this, counterfactual evaluation makes use of Inverse Propensity Scoring (IPS),where clicks are weighted inversely to the probability that a user examined them duringlogging [127]. A big advantage compared to interleaving and A/B testing, is thatcounterfactual evaluation does not require online interventions.In this chapter, we show that no existing interleaving method is truly unbiased:they are not guaranteed to correctly predict which ranker has the highest CTR. On twodifferent industry datasets, we simulate a total of 1,000 comparisons between 2,000different rankers. In our setup, interleaving methods converge on the wrong answer forat least 2.2% of the comparisons on both datasets. A further analysis shows that existinginterleaving methods are unable to reliably estimate CTR differences of around 1% orlower. Therefore, in practice these systematic errors are expected to impact situationswhere rankers with a very similar CTR are compared.We propose a novel online evaluation algorithm: Logging-Policy OptimizationAlgorithm (LogOpt). LogOpt extends the existing unbiased counterfactual approach,and turns it into an online approach. LogOpt estimates which rankings should be shownto the user, so that the variance of its CTR estimate is minimized. In other words, itattempts to learn the logging-policy that leads to the fastest possible convergence of thecounterfactual estimation. Our experimental results indicate that our novel approach isas efficient as any interleaving method or A/B testing, without having a systematic error.As predicted by the theory, we see that the estimates of our approach converge on thetrue CTR difference between rankers. Therefore, we have introduced the first onlineevaluation method that combines high efficiency with unbiased estimation.The main contributions of this chapter are:126 .2. Preliminaries: Ranker Comparisons
1. The first logging-policy optimization method for minimizing the variance in counter-factual CTR estimation.2. The first unbiased online evaluation method that is as efficient as state-of-the-artinterleaving methods.3. A large-scale analysis of existing online evaluation methods that reveals a previouslyunreported bias in interleaving methods.
The overarching goal of ranker evaluation is to find the ranking model that provides thebest rankings. For the purposes of this chapter, we will define the quality of a rankerin terms of the number of clicks it is expected to receive. Let R indicate a rankingand let E [ CTR ( R )] ∈ R ≥ be the expected number of clicks a ranking receives afterbeing displayed to a user. We consider ranking R to be better than R if in expectationit receives more clicks: E [ CTR ( R )] > E [ CTR ( R )] . We will represent a rankingmodel by a policy π , with π ( R | q ) as the probability that π displays R for a query q .With P ( q ) as the probability of a query q being issued, the expected number of clicksreceived under a ranking model π is: E [ CTR ( π )] = (cid:88) q P ( q ) (cid:88) R E [ CTR ( R )] π ( R | q ) . (7.1)Our goal is to discover the E [ CTR ] difference between two policies: ∆( π , π ) = E [ CTR ( π )] − E [ CTR ( π )] . (7.2)We recognize that to correctly identify if one policy is better than another, we merelyneed a corresponding binary indicator: ∆ bin ( π , π ) = sign (cid:0) ∆( π , π ) (cid:1) . (7.3)However, in practice the magnitude of the differences can be very important, forinstance, if one policy is computationally much more expensive while only having aslightly higher E [ CTR ] , it may be preferable to use the other in production. Therefore,estimating the absolute E [ CTR ] difference is more desirable in practice. Any proof regarding estimators using user interactions must rely on assumptions aboutuser behavior. In this chapter, we assume that only two forms of interaction bias are atplay: position bias and item-selection bias.Users generally do not examine all items that are displayed in a ranking but onlyclick on examined items [20]. As a result, a lower probability of examination for anitem also makes it less likely to be clicked. Position bias assumes that only the rankdetermines the probability of examination [25]. Furthermore, we will assume that given127 . Efficient and Unbiased Online Evaluation for Ranking an examination only the relevance of an item determines the click probability. Let c ( d ) ∈ { , } indicate a click on item d and o ( d ) ∈ { , } examination by the user.Then these assumptions result in the following assumed click probability: P ( c ( d ) = 1 | R, q ) = P ( o ( d ) = 1 | R ) P ( c ( d ) = 1 | o ( d ) = 1 , q )= θ rank ( d | R ) ζ d,q . (7.4)Here rank ( d | R ) indicates the rank of d in R ; for brevity we use θ rank ( d | R ) to denotethe examination probability: θ rank ( d | R ) = P ( o ( d ) = 1 | R ) , (7.5)and ζ d,q for the conditional click probability: ζ d,q = P ( c ( d ) = 1 | o ( d ) = 1 , q ) . (7.6)We also assume that item-selection bias is present; this type of bias is an extremeform of position bias that results in zero examination probabilities for some items [86,92]. This bias is unavoidable in top- k ranking settings, where only the k ∈ N > highestranked items are displayed. Consequently, any item beyond rank k cannot be observedor examined by the user: ∀ r ∈ N > ( r > k → θ r = 0) . The distinction betweenitem-selection bias and position bias is important because the original counterfactualevaluation method [58] is only able to correct for position bias when no item-selectionbias is present [86, 92].Based on these assumptions, we can now formulate the expected CTR of a ranking: E [ CTR ( R )] = (cid:88) d ∈ R P ( c ( d ) = 1 | R, q ) = (cid:88) d ∈ R θ rank ( d | R ) ζ d,q . (7.7)While we assume this model of user behavior, its parameters are still assumed unknown.Therefore, the methods in this chapter will have to estimate E [ CTR ] without priorknowledge of θ or ζ . Recall that our goal is to estimate the CTR difference between rankers (Eq. 7.2); onlineevaluation methods do this based on user interactions. Let I be the set of availableuser interactions, it contains N tuples of a single (issued) query q i , the correspondingdisplayed ranking R i , and the observed user clicks c i : I = { ( q i , R i , c i ) } Ni =1 . (7.8)Each evaluation method has a different effect on what rankings will be displayed tousers. Furthermore, each evaluation method converts each interaction into a singleestimate using some function f : x i = f ( q i , R i , c i ) . (7.9)128 .3. Existing Online and Counterfactual Evaluation Methods The final estimate is simply the mean over these estimates: ˆ∆( I ) = 1 N N (cid:88) i =1 x i = 1 N N (cid:88) i =1 f ( q i , R i , c i ) . (7.10)This description fits all existing online and counterfactual evaluation methods forrankings. Every evaluation method uses a different function f to convert interactionsinto estimates; moreover, online evaluation methods also decide which rankings R todisplay when collecting I . These two choices result in different estimators. Before wediscuss the individual methods, we briefly introduce the three properties we desire ofeach estimator: consistency, unbiasedness and variance.• Consistency – an estimator is consistent if it converges as the number of issuedqueries N increases. All existing evaluation methods are consistent as their finalestimates are means of bounded values.• Unbiasedness – an estimator is unbiased if its estimate is equal to the true CTRdifference in expectation:Unbiased ( ˆ∆) ⇔ E (cid:2) ˆ∆( I ) (cid:3) = ∆( π , π ) . (7.11)If an estimator is both consistent and unbiased it is guaranteed to converge on the true E [ CTR ] difference.• Variance – the variance of an estimator is the expected squared deviation between asingle estimate x and the mean ˆ∆( I ) :Var ( ˆ∆) = E (cid:104)(cid:0) x − E [ ˆ∆( I )] (cid:1) (cid:105) . (7.12)Variance affects the rate of convergence of an estimator; for fast convergence it shouldbe as low as possible.In summary, our goal is to find an estimator, for the CTR difference between tworanking models, that is consistent, unbiased and has minimal variance. We describe three families of online and counterfactual evaluation methods for ranking.
A/B testing is a well established form of online evaluation to compare a system A witha system B [64]. Users are randomly split into two groups and during the same timeperiod each group is exposed to only one of the systems. In expectation, the only factorthat differs between the groups is the exposure to the different systems. Therefore, by129 . Efficient and Unbiased Online Evaluation for Ranking comparing the behavior of each user group, the relative effect each system has can beevaluated.We will briefly show that A/B testing is unbiased for E [ CTR ] difference estimation.For each interaction either π or π determines the ranking, let A i ∈ { , } indicate theassignment and A i ∼ P ( A ) . Thus, if A i = 1 , then R i ∼ π ( R | q ) and if A i = 2 , then R i ∼ π ( R | q ) . Each interaction i is converted into a single estimate x i by f A/B : x i = f A/B ( q i , R i , c i ) = (cid:18) [ A i = 1] P ( A = 1) − [ A i = 2] P ( A = 2) (cid:19) (cid:88) d ∈ R i c i ( d ) . (7.13)We can prove that A/B testing is unbiased, since in expectation each individual estimateis equal to the CTR difference: E [ f A/B ( q i , R i , c i )] = (cid:88) q P ( q ) (cid:18) P ( A = 1) (cid:80) R π ( R | q ) E [ CTR ( R )] P ( A = 1) − P ( A = 2) (cid:80) R π ( R | q ) E [ CTR ( R )] P ( A = 2) (cid:19) = (cid:88) q P ( q ) (cid:88) R E [ CTR ( R )] (cid:0) π ( R | q ) − π ( R | q ) (cid:1) = E [ CTR ( π )] − E [ CTR ( π )] = ∆( π , π ) . (7.14)Variance is harder to evaluate without knowledge of π and π . Unless ∆( π , π ) = 0 ,some variance is unavoidable since A/B testing alternates between estimating CTR ( π ) and CTR ( π ) . Interleaving methods were introduced specifically for evaluation in ranking, as a moreefficient alternative to A/B testing [56]. After a query is issued, interleaving methodstake the rankings of two competing ranking systems and combine them into a singleinterleaved ranking. Any clicks on the interleaved ranking can be interpreted as apreference signal between either ranking system. Thus, unlike A/B testing, interleavingdoes not estimate the CTR of individual systems but a relative preference; the idea isthat this allows it to be more efficient than A/B testing.Each interleaving method attempts to use randomization to counter position bias,without deviating too much from the original rankings so as to maintain the userexperience [56].
Team-draft interleaving (TDI) randomly selects one ranker to placetheir top document first, then the other ranker places their top (unplaced) documentnext [99]. Then it randomly decides the next two documents, and this process isrepeated until all documents are placed in the interleaved ranking. Clicks on thedocuments are attributed to the ranker that placed them. The ranker with the mostattributed clicks is inferred to be preferred by the user.
Probabilistic interleaving (PI) treats each ranking as a probability distribution over documents; at each rank adistribution is randomly selected and a document is drawn from it [41]. After clickshave been received, probabilistic interleaving computes the expected number of clicked130 .3. Existing Online and Counterfactual Evaluation Methods documents per ranking system to infer preferences.
Optimized interleaving (OI) caststhe randomization as an optimization problem, and displays rankings so that if alldocuments are equally relevant no preferences are found [96].While every interleaving method attempts to deal with position bias, none is unbiasedaccording to our definition (Section 7.2.2). This may be confusing because previouswork on interleaving makes claims of unbiasedness [41, 44, 96]. However, they usedifferent definitions of the term. More precisely, TDI, PI, and OI provably converge onthe correct outcome if all documents are equally relevant [41, 44, 96, 99]. Moreover, ifone assumes binary relevance and π ranks all relevant documents equal to or higherthan π , the binary outcome of PI and OI is proven to be correct in expectation [44, 96].However, beyond the confines of these unambiguous cases, we can prove that thesemethods do not meet our definition of unbiasedness: for every method one can constructan example where it converges on the incorrect outcome. The rankers π , π andposition bias parameters θ can be chosen so that in expectation the wrong (binary)outcome is estimated; see Appendix 7.A for a proof for each of the three interleavingmethods. Thus, while more efficient than A/B testing, interleaving methods makesystematic errors in certain circumstances and thus should not be considered to beunbiased w.r.t. CTR differences.We note that the magnitude of the bias should also be considered. If the systematicerror of an interleaving method is minuscule while the efficiency gains are very high, itmay still be very useful in practice. Our experimental results (Section 7.6.2) reveal thatthe systematic error of all three interleaving methods considered becomes very highwhen comparing systems with a CTR difference of 1% or smaller. Counterfactual evaluation is based on the idea that if certain biases can be estimatedwell, they can also be adjusted [57, 127]. While estimating relevance is considered thecore difficulty of ranking evaluation, estimating the position bias terms θ is very doable.By randomizing rankings, e.g., by swapping pairs of documents [57] or exploitingdata logged during A/B testing [4], differences in CTR for the same item on differentpositions can be observed directly. Alternatively, using Expectation Maximization (EM)optimization [128] or a dual learning objective [5], position bias can be estimated fromlogged data as well. Once the bias terms θ have been estimated, logged clicks can beweighted so as to correct for the position bias during logging. Hence, counterfactualevaluation can work with historically logged data. Existing counterfactual evaluationalgorithms do not dictate which rankings should be displayed during logging: they donot perform interventions and thus we do not consider them to be online methods.Counterfactual evaluation assumes that the position bias θ and the logging policy π are known, in order to correct for both position bias and item-selection bias. Clicksare gathered with π which decides which rankings are displayed to the user. We followOosterhuis and de Rijke [86] (see Chapter 5) and use as propensity scores the probabilityof observance in expectation over the displayed rankings: ρ ( d | q ) = E R (cid:2) P ( o ( d ) = 1 | R ) | π (cid:3) = (cid:88) R π ( R | q ) P ( o ( d ) = 1 | R ) . (7.15)131 . Efficient and Unbiased Online Evaluation for Ranking Then we use λ ( d | π , π ) to indicate the difference in observance probability under π or π : λ ( d | π , π ) = E R (cid:2) P ( o ( d ) = 1 | R ) | π (cid:3) − E R (cid:2) P ( o ( d ) = 1 | R ) | π (cid:3) = (cid:88) R θ rank ( d | R ) (cid:0) π ( R | q i ) − π ( R | q i ) (cid:1) . (7.16)Then, the IPS estimate function is formulated as: x i = f IPS ( q i , R i , c i ) = (cid:88) d : ρ ( d | q i ) > c i ( d ) ρ ( d | q i ) λ ( d | π , π ) . (7.17)Each click is weighted inversely to its examination probability, but items with a zeroprobability: ρ ( d | q i ) = 0 are excluded. We note that these items can never be clicked: ∀ q, d ( ρ ( d | q ) = 0 → c ( d ) = 0 . (7.18)Before we prove unbiasedness, we note that given ρ ( d | q i ) > : E (cid:20) c ( d ) ρ ( d | q ) (cid:21) = (cid:80) R π ( R | q ) θ rank ( d | R ) ζ d,q ρ ( d | q i )= (cid:80) R π ( R | q ) θ rank ( d | R ) (cid:80) R (cid:48) π ( R (cid:48) | q ) θ rank ( d | R (cid:48) ) ζ d,q = ζ d,q . (7.19)This, in turn, can be used to prove unbiasedness: E [ f IPS ( q i , R i , c i )] = (cid:88) q P ( q ) (cid:88) d : ρ ( d | q i ) > ζ d,q λ ( d | π , π )= E [ CTR ( π )] − E [ CTR ( π )] = ∆( π , π ) . (7.20)This proof is only valid under the following requirement: ∀ d, q ( ζ d,q λ ( d | π , π ) > → ρ ( d | q ) > . (7.21)In practice, this means that the items in the top- k of either π or π need to have anon-zero examination probability under π , i.e., they must have a chance to appear inthe top- k under π .Besides Requirement 7.21 the IPS counterfactual evaluation method [57, 127] iscompletely indifferent to π and hence we do not consider it to be an online method. Inthe next section, we will introduce an algorithm for choosing and updating π duringlogging to minimize the variance of the estimator. By doing so we turn counterfactualevaluation into an online method. Next, we introduce a method aimed at finding a logging policy minimizes the varianceof the estimates of the counterfactual estimator.132 .4. Logging Policy Optimization for Variance Minimization
In Section 7.3.3, we have discussed counterfactual evaluation and established that it isunbiased as long as θ is known and the logging policy meets Requirement 7.21. Thevariance of ∆ IPS depends on the position bias θ , the conditional click probabilities ζ ,and the logging policy π . In contrast to the user-dependent θ and ζ , the way data islogged by π is something one can have control over. The goal of our method is to findthe optimal policy that minimizes variance while still meeting Requirement 7.21: π ∗ = arg min π : π meets Req. 7.21 Var (cid:0) ˆ∆ π IPS (cid:1) , (7.22)where ˆ∆ π IPS is the counterfactual estimator based on data logged using π .To formulate the variance, we first note that it is an expectation over queries:Var ( ˆ∆) = (cid:88) q P ( q ) Var ( ˆ∆ | q ) . (7.23)To keep notation short, for the remainder of this section we will write: ∆ = ∆( π , π ) ; θ d,R = θ rank ( d | R ) ; ζ d = ζ d,q ; λ d = λ ( d | π , π ) ; and ρ d = ρ ( d | q, π ) . Next, weconsider the probability of a click pattern c , this is simply a vector indicating a possiblecombination of clicked documents c ( d ) = 1 and not-clicked documents c ( d ) = 0 : P ( c | q ) = (cid:88) R π ( R | q ) (cid:89) d : c ( d )=1 θ d,R ζ d (cid:89) d : c ( d )=0 (1 − θ d,R ζ d )= (cid:88) R π ( R | q ) P ( c | R ) . (7.24)Here, π has some control over this probability: by deciding the distribution of displayedrankings it can make certain click patterns more or less frequent. The variance addedper query is the squared error of every possible click pattern weighted by the probabilityof each pattern. Let (cid:80) c sum over every possible click pattern:Var ( ˆ∆ π IPS | q ) = (cid:88) c P ( c | q ) (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) . (7.25)It is unknown whether there is a closed-form solution for π ∗ . However, the variancefunction is differentiable. Taking the derivative reveals a trade-off between two poten-tially conflicting goals: δδπ Var ( ˆ∆ π IP S | q ) = (cid:88) c minimize frequency of high-error click patterns (cid:122) (cid:125)(cid:124) (cid:123)(cid:20) δδπ P ( c | q ) (cid:21) (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) + P ( c | q ) δδπ (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:124) (cid:123)(cid:122) (cid:125) minimize error of frequent click patterns . (7.26)133 . Efficient and Unbiased Online Evaluation for Ranking On the one hand, the derivative reduces the frequency of click patterns that result inhigh error samples, i.e., by updating π so that these are less likely to occur. On theother hand, changing π also affects the propensities ρ d , i.e., if π makes an item d lesslikely to be examined, its corresponding value λ d /ρ d becomes larger, which can lead toa higher error for related click patterns. The optimal policy has to balance: (i) avoidingshowing rankings that lead to high-error click patterns; and (ii) avoiding minimizingpropensity scores, which increases the errors of corresponding click patterns.Our method applies stochastic gradient descent to optimize the logging policy w.r.t.the variance. There are two main difficulties with this approach: (i) the parameters θ and ζ are unknown a priori; and (ii) the gradients include summations over all possiblerankings and all possible click patterns, both of which are computationally infeasible.In the following sections, we will detail how LogOpt solves both of these problems. In order to compute the gradient in Eq. 7.26, the parameters θ and ζ have to be known.LogOpt is based on the assumption that accurate estimates of θ and ζ suffice to find anear-optimal logging policy. We note that the counterfactual estimator only requires θ to be known for unbiasedness (see Section 7.3.3). Our approach is as follows. At givenintervals during evaluation we use the available clicks to estimate θ and ζ . Then we usethe estimated ˆ θ to get the current estimate ˆ∆ IPS ( I , ˆ θ ) (Eq. 7.17) and optimize w.r.t. theestimated variance (Eq. 7.25) based on ˆ θ , ˆ ζ , and ˆ∆ IPS ( I , ˆ θ ) .For estimating θ and ζ we use the existing EM approach by Wang et al. [128],because it works well in situations where few interactions are available and does notrequire randomization. We note that previous work has found randomization-basedapproaches to be more accurate for estimating θ [4, 30, 128]. However, they requiremultiple interactions per query and specific types of randomization in their results; bychoosing the EM approach we avoid having these requirements. Both the variance (Eq. 7.25) and its gradient (Eq. 7.26), include a sum over all possibleclick patterns. Moreover, they also include the probability of a specific pattern P ( c | q ) that is based on a sum over all possible rankings (Eq. 7.24). Clearly, these equationsare infeasible to compute under any realistic time constraints. To solve this issue, weintroduce gradient estimation based on Monte-Carlo sampling. Our approach is similarto that of Ma et al. [78], however, we are estimating gradients of variance instead ofgeneral performance.First, we assume that policies place the documents in order of rank and the probabil-ity of placing an individual document at rank x only depends on the previously placeddocuments. Let R x − indicate the (incomplete) ranking from rank up to rank x , then π ( d | R x − , q ) indicates the probability that document d is placed at rank x giventhat the ranking up to x is R x − . The probability of a ranking R up to rank k is thus: π ( R k | q ) = k (cid:89) x =1 π ( R x | R x − , q ) . (7.27)134 .4. Logging Policy Optimization for Variance Minimization Let K be the length of a complete ranking R , the gradient of the probability of a rankingw.r.t. a policy is: δπ ( R | q ) δπ = K (cid:88) x =1 π ( R | q ) π ( R x | R x , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) . (7.28)The gradient of the propensity w.r.t. the policy (cf. Eq. 7.15) is: δρ ( d | q ) δπ = K (cid:88) k =1 θ k (cid:88) R π ( R k − | q ) (cid:32) (cid:20) δπ ( d | R k − , q ) δπ (cid:21) + k − (cid:88) x =1 π ( d | R k − , q ) π ( R x | R x − , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) (cid:33) . (7.29)To avoid iterating over all rankings in the (cid:80) R sum, we sample M rankings: R m ∼ π ( R | q ) , and a click pattern on each ranking: c m ∼ P ( c | R m ) . This enables us tomake the following approximation: (cid:92) ρ -grad ( d ) = 1 M M (cid:88) m =1 K (cid:88) k =1 θ k (cid:32) (cid:20) δπ ( d | R m k − , q ) δπ (cid:21) + k − (cid:88) x =1 π ( d | R m k − , q ) π ( R mx | R m x − , q ) (cid:20) δπ ( R mx | R m x − , q ) δπ (cid:21) (cid:33) , (7.30)since δρ ( d | q ) δπ ≈ (cid:92) ρ -grad ( d, q ) . In turn, we can use this to approximate the second part ofEq. 7.26: (cid:92) error-grad ( c ) = 2 (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:88) d : c ( d )=1 λ d ρ d (cid:92) ρ -grad ( d ) . (7.31)We approximate the first part of Eq. 7.26 with: (cid:92) freq-grad ( R, c ) = (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) K (cid:88) x =1 π ( R x | R x − , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) . (7.32)Together, they approximate the complete gradient (cf. Eq. 7.26): δ Var ( ˆ∆ π IP S | q ) δπ ≈ M M (cid:88) m =1 (cid:92) freq-grad ( R m , c m ) + (cid:92) error-grad ( c m ) . (7.33)Therefore, we can approximate the gradient of the variance w.r.t. a logging policy π ,based on rankings sampled from π and our current estimated click model ˆ θ , ˆ ζ , whilestaying computationally feasible. For a more detailed description see Appendix 7.B. . Efficient and Unbiased Online Evaluation for Ranking
Algorithm 7.1
Logging-Policy Optimization Algorithm (LogOpt) Input : Historical interactions: I ; rankers to compare π , π . ˆ θ, ˆ ζ ← infer click model ( I ) // estimate bias using EM ˆ λ ← estimated observance (ˆ θ, π , π ) // estimate λ cf. Eq 7.16 ˆ∆( π , π ) ← estimated CTR ( I , ˆ λ, ˆ θ ) // CTR diff. cf. Eq 7.17 π ← init policy () // initialize logging policy for j ∈ { , , . . . } do q ∼ P ( q | I ) // sample a query from interactions R ← { R , R , . . . , R M } ∼ π ( R | q ) // sample M rankings C ← { c , c , . . . , c M } ∼ P ( c | R ) // sample M click patterns ˆ δ ← approx grad ( R , C , ˆ λ, ˆ θ, ˆ∆( π , π )) // using Eq. 7.33 π ← update ( π , ˆ δ ) // update using approx. gradient return π We have summarized the LogOpt method in Algorithm 7.1. The algorithm requires aset of historical interactions I and two rankers π and π to compare. Then by fittinga click model on I using an EM-procedure (Line 2) an estimate of observation bias ˆ θ and document relevance ˆ ζ is obtained. Using ˆ θ , an estimate of the difference inobservation probabilities ˆ λ is computed (Line 3 and cf. Eq 7.16), and an estimate of theCTR difference ˆ∆( π , π ) (Line 4 and cf. Eq 7.17). Then the optimization of a newlogging policy π begins: A query is sampled from I (Line 7), and for that query M rankings are sampled from the current π (Line 8), then for each ranking a click patternis sampled using ˆ θ and ˆ ζ (Line 9). Finally, using the sampled rankings and clicks, ˆ θ , ˆ λ , and ˆ∆( π , π ) , the gradient is now approximated using Eq. 7.33 (Line 10) and thepolicy π is updated accordingly (Line 11). This process can be repeated for a fixednumber of steps, or until the policy has converged.This concludes our introduction of LogOpt: the first method that optimizes thelogging policy for faster convergence in counterfactual evaluation. We argue thatLogOpt turns counterfactual evaluation into online evaluation, because it instructswhich rankings should be displayed for the most efficient evaluation. The ability tomake interventions like this is the defining characteristic of an online evaluation method. We ran semi-synthetic experiments that are prevalent in online and counterfactualevaluation [41, 58, 86]. User-issued queries are simulated by sampling from learning torank datasets; each dataset contains a preselected set of documents per query. We usethe Yahoo! Webscope [17] and MSLR-WEB30k [95] datasets; they both contain 5-graderelevance judgements for all preselected query-document pairs. For each sampled query,we let the evaluation method decide which ranking to display and then simulate clickson them using probabilistic click models.To simulate position bias, we use the rank-based probabilities of Joachims et al.136 .6. Results [58]: P ( o ( d ) = 1 | R, q ) = 1 rank ( d | R ) . (7.34)If observed, the click probability is determined by the relevance label of the dataset(ranging from 0 to 4). More relevant items are more likely to be clicked, yet non-relevantdocuments still have a non-zero click probability: P ( c ( d ) = 1 | o ( d ) = 1 , q ) = 0 . · relevance label ( q, d ) + 0 . . (7.35)Spread over both datasets, we generated 2,000 rankers and created 1,000 ranker-pairs.We aimed to generate rankers that are likely to be compared in real-world scenarios;unfortunately, no simple distribution of such rankers is available. Therefore, we tried togenerate rankers that have (at least) a decent CTR and that span a variety of rankingbehaviors. Each ranker was optimized using LambdaLoss [129] based on the labelleddata of 100 sampled queries; each ranker is based on a linear model that only uses arandom sample of 50% of the dataset features. Figure 7.1 displays the resulting CTRdistribution; it appears to follow a normal distribution, on both datasets.For each ranker-pair and method, we sample · queries and calculate theirCTR estimates for different numbers of queries. We considered three metrics: (i) Thebinary error: whether the estimate correctly predicts which ranker should be preferred.(ii) The absolute error: the absolute difference between the estimate and the true E [ CTR ] difference: absolute-error = | ∆( π , π ) − ˆ∆( I ) | . (7.36)And (iii) the mean squared error: the squared error per sample (not the final estimate);if the estimator is unbiased this is equivalent to the empirical variance:mean-squared-error = 1 N N (cid:88) i =1 (∆( π , π ) − x i ) . (7.37)We compare LogOpt with the following baselines: (i) A/B testing (with equal proba-bilities for each ranker), (ii) Team-Draft Interleaving, (iii) Probabilistic Interleaving(with τ = 4 ), and (iv) Optimized Interleaving (with the inverse rank scoring function).Furthermore, we compare LogOpt with other choices of logging policies: (i) uniformsampling, (ii) A/B testing: showing either the ranking of A or B with equal probability,and (iii) an Oracle logging policy: applying LogOpt to the true relevances ζ and positionbias θ . We also consider LogOpt both in the case where θ is known a priori , or where ithas to be estimated still. Because estimating θ and optimizing the logging policy π is time-consuming, we only update ˆ θ and π after , , and queries. Thepolicy LogOpt optimizes uses a neural network with 2 hidden layers consisting of 32units each. The network computes a score for every document, then a softmax is appliedto the scores to create a distribution over documents. Our results are displayed in Figures 7.2, 7.3, and 7.4. Figure 7.2 shows the resultscomparing LogOpt with other online evaluation methods; Figure 7.3 compares LogOpt137 . Efficient and Unbiased Online Evaluation for Ranking
Yahoo Webscope MSLR Web30k . . . . . Figure 7.1: The CTR distribution of the 2000 generated rankers, 1000 were generatedper dataset.with counterfactual evaluation using other logging policies; and finally, Figure 7.4shows the distribution of binary errors for each method after · sampled queries. In Figure 7.2 we see that, unlike interleaving methods, counterfactual evaluation withLogOpt continues to decrease both its binary error and its absolute error as the numberof queries increases. While interleaving methods converge at a binary error of at least2.2% and an absolute error greater than . , LogOpt appears to converge towards zeroerrors for both. This is expected as LogOpt is proven to be unbiased when the positionbias is known. Interestingly, we see similar behavior from LogOpt with estimatedposition bias. Both when bias is known or estimated, LogOpt has a lower error than theinterleaving methods after · queries. Thus we conclude that interleaving methodsconverge faster and have an initial period where their error is lower, but are biased. Incontrast, by being unbiased, LogOpt converges on a lower error eventually.If we use Figure 7.2 to compare LogOpt with A/B testing, we see that on bothdatasets LogOpt has a considerably smaller mean squared error. Since both methodsare unbiased, this means that LogOpt has a much lower variance and thus is expectedto converge faster. On the Yahoo dataset we observe this behavior, both in terms ofbinary error and absolute error and regardless of whether the bias is estimated, LogOptrequires half as much data as A/B testing to reach the same level or error. Thus, onYahoo LogOpt is roughly twice as data-efficient as A/B testing. On the MSLR dataset itis less clear whether LogOpt is noticeably more efficient: after queries the absoluteerror of LogOpt is twice as high, but after queries it has a lower error than A/Btesting. We suspect that the relative drop in performance around queries is dueto LogOpt overfitting on incorrect ˆ ζ values, however, we were unable to confirm this.Hence, LogOpt is just as efficient as, or even more efficient than, A/B testing, dependingon the circumstances.Finally, when we use Figure 7.3 to compare LogOpt with other logging policychoices, we see that LogOpt mostly approximates the optimal Oracle logging policy. Incontrast, the uniform logging policy is very data-inefficient; on both datasets it requires138 .6. Results Yahoo! Webscope MSLR-Web30k B i n a r y E rr o r − − − − A b s o l u t e E rr o r − − M ea n S qu a r e d E rr o r . . . Number of Queries Issued Number of Queries Issued
A/B TestingOptimized Interleaving Probabilistic InterleavingTeam-Draft Interleaving LogOpt (Position Bias Known)LogOpt (Position Bias Estimated)
Figure 7.2: Comparison of LogOpt with other online methods; displayed results are anaverage over 500 comparisons.around ten times the number of queries to reach the same level or error as LogOpt.The A/B logging policy is a better choice than the uniform logging policy, but apartfrom the dip in performance on the MSLR dataset, it appears to require twice as manyqueries as LogOpt. Interestingly, the performance of LogOpt is already near the Oraclewhen only queries have been issued. With such a small number of interactions,accurately estimating the relevances ζ should not be possible, thus it appears that inorder for LogOpt to find an efficient logging policy the relevances ζ are not important.This must mean that only the differences in behavior between the rankers (i.e., λ ) haveto be known for LogOpt to be efficient. Overall, these results show that LogOpt cangreatly increase the efficiency of counterfactual estimation. Our results in Figure 7.2 clearly illustrate the bias of interleaving methods: each of themsystematically infers incorrect preferences in (at least) 2.2% of the ranker-pairs. Theseerrors are systematic since increasing the number of queries from to · does notremove any of them. Additionally, the combination of the lowest mean-squared-error139 . Efficient and Unbiased Online Evaluation for Ranking Yahoo! Webscope MSLR-Web30k B i n a r y E rr o r − − − − A b s o l u t e E rr o r − − M ea n S qu a r e d E rr o r . . . Number of Queries Issued Number of Queries Issued
A/B Logging PolicyUniform Logging Policy LogOpt (Position Bias Known) Oracle Logging Policy
Figure 7.3: Comparison of logging policies for counterfactual evaluation; displayedresults are an average over 500 comparisons.with a worse absolute error than A/B testing after queries, indicates that interleavingresults in a low variance at the cost of bias. To better understand when these systematicerrors occur, we show the distribution of binary errors w.r.t. the CTR differences of theassociated ranker-pairs in Figure 7.4. Here we see that most errors occur on ranker-pairswhere the CTR difference is smaller than 1%, and that of all comparisons the percentageof errors greatly increases as the CTR difference decreases below 1%. This suggeststhat interleaving methods are unreliable to detect preferences when differences are 1%CTR or less.It is hard to judge the impact this bias may have in practice. On the one hand, a 1%CTR difference is far from negligible: generally a 1% increase in CTR is consideredan impactful improvement in the industry [102]. On the other hand, our results arebased on a single click model with specific values for position bias and conditional clickprobabilities. While our results strongly prove that interleaving is biased, we should becareful not to generalize the size of the observed systematic error to all other rankingsettings.Previous work has performed empirical studies to evaluate various interleaving140 .7. Conclusion methods with real users. Chapelle et al. [18] applied interleaving methods to compareranking systems for three different search engines, and found team-draft interleavingto highly correlate with absolute measures such as CTR. However, we note that inthe study by Chapelle et al. [18] no more than six rankers were compared, thus such astudy would likely miss a systematic error of 2.2%. In fact, Chapelle et al. [18] notethemselves that they cannot confidently claim team-draft interleaving is completelyunbiased. Schuth et al. [110] performed a larger comparison involving 38 rankingsystems, but again, too small to reliably detect a small systematic error.It appears that the field is missing a large scale comparison that involves a largeenough number of rankers to observe small systematic errors. If such an error is found,the next step is to identify if certain types of ranking behavior are erroneously andsystematically disfavored. While these questions remain unanswered, we are concernedthat the claims of unbiasedness in previous work on interleaving (see Section 7.3.2)give practitioners an unwarranted sense of reliability in interleaving. In this chapter, we considered thesis research question
RQ7 : whether counterfactualevaluation methods for ranking can be extended to perform efficient and effectiveonline evaluation. Our answer is positive: we have introduced the Logging-PolicyOptimization Algorithm (LogOpt): the first method that optimizes a logging policyfor minimal variance counterfactual evaluation. Counterfactual evaluation is proven tobe unbiased w.r.t. position bias and item-selection bias under a wide range of loggingpolicies. With the introduction of LogOpt, we now have an algorithm that can decidewhich rankings should be displayed for the fastest convergence. Therefore, we argue thatLogOpt turns the IPS-based counterfactual evaluation approach – which is indifferentto the logging policy – into an online approach – which instructs the logging policy.Our experimental results show that LogOpt can lead to a better data-efficiency than A/Btesting, while also showing that interleaving is biased.This brings us to the second thesis research question that this chapter addressed,
RQ8 : whether interleaving methods are truly unbiased w.r.t. position bias. We an-swer this question negatively: Our experimental results clearly reveal a systematicerror in interleaving, moreover, in Appendix 7.A we formally prove that cases existwhere interleaving is affected by position bias. In other words, interleaving should notbe considered unbiased under the most common definition of bias in counterfactualevaluation.While our findings are mostly theoretical, they do suggest that future work shouldfurther investigate the bias in interleaving methods. Our results suggest that all inter-leaving methods make systematic errors, in particular when rankers with a similar CTRare compared. Furthermore, to the best of our knowledge, no empirical studies havebeen performed that could measure such a bias; our findings strongly show that such astudy would be highly valuable to the field. Finally, LogOpt shows that in theory anevaluation method that is both unbiased and efficient is possible; if future work findsthat these theoretical findings match empirical results with real users, this could be thestart of a new line of theoretically-justified online evaluation methods. 141 . Efficient and Unbiased Online Evaluation for Ranking
Inspired by the success of this chapter to find a method effective at both online andcounterfactual evaluation for ranking, Chapter 8 introduces a method that is effectiveat both online and counterfactual Learning to Rank (LTR). Together, these chaptersshow that the divide between online and counterfactual optimization/evaluation can bebridged.142 .7. Conclusion
Yahoo! Webscope MSLR-Web30k T ea m - D r a f t I n t e r l ea v i ng − − − − − − − − P r ob a b ili s ti c I n t e r l ea v i ng − − − − − − − − O p ti m i ze d I n t e r l ea v i ng − − − − − − − − A / B T e s ti ng − − − − − − − − L og O p t ( B i a s E s ti m a t e d ) − − − − − − − − CTR difference CTR difference
Figure 7.4: Distribution of errors over the CTR differences of the rankers in the compar-ison; red indicates a binary error; green indicates a correctly inferred binary preference;results are on estimates based on · sampled queries. 143 . Efficient and Unbiased Online Evaluation for Ranking Section 7.3.2 claimed that for the discussed interleaving methods, an example can beconstructed so that in expectation the wrong binary outcome is estimated w.r.t. the actualexpected CTR differences. These examples are enough to prove that these interleavingmethods are biased w.r.t. CTR differences. In the following sections we will introduce asingle example for each interleaving method.For clarity, we will keep these examples as basic as possible. We consider aranking setting where only a single query q occurs, i.e. P ( q ) = 1 , furthermore,there are only three documents to be ranked: A , B , and C . The two policies π and π in the comparison are both deterministic so that: π ([ A, B, C ] | q ) = 1 and π ([ B, C, A ] | q ) = 1 . Thus π will always display the ranking: [ A, B, C ] , and π the ranking: [ B, C, A ] . Furthermore, document B is completely non-relevant: ζ B = 0 ,consequently, B can never receive clicks; this will make our examples even simpler.The true E [ CTR ] difference is thus: ∆( π , π ) = ( θ − θ ) ζ A + ( θ − θ ) ζ C . (7.38)For each interleaving method, will now show that position bias parameters θ , θ , and θ and relevances ζ A and ζ C exist where the wrong binary outcome is estimated. Team-Draft Interleaving [99] lets rankers take turns to add their top document and keepstrack which ranker added each document. In total there are four possible interleavingand assignment combinations, each is equally probable:Interleaving Ranking Assignments Probability R A, B, C 1, 2, 1 1/4 R A, B, C 1, 2, 2 1/4 R B, A, C 2, 1, 1 1/4 R B, A, C 2, 1, 2 1/4Per issued query Team-Draft Interleaving produces a binary outcome, this is based onwhich ranker had most of its assigned documents clicked. To match our CTR estimate,we use to indicate π receiving more clicks, and − for π . Per interleaving we can144 .A. Proof of Bias in Interleaving compute the probability of each outcome: P ( outcome = 1 | R ) = θ ζ A + (1 − θ ζ A ) θ ζ C ,P ( outcome = 1 | R ) = θ ζ A (1 − θ ζ C ) ,P ( outcome = 1 | R ) = θ ζ A + (1 − θ ζ A ) θ ζ C ,P ( outcome = 1 | R ) = θ ζ A (1 − θ ζ C ) ,P ( outcome = − | R ) = 0 ,P ( outcome = − | R ) = (1 − θ ζ A ) θ ζ C ,P ( outcome = − | R ) = 0 ,P ( outcome = − | R ) = (1 − θ ζ A ) θ ζ C . Since every interleaving is equally likely, we can easily derive the unconditional proba-bilities: P ( outcome = 1) = 14 (cid:16) θ ζ A + (1 − θ ζ A ) θ ζ C + θ ζ A (1 − θ ζ C )+ θ ζ A + (1 − θ ζ A ) θ ζ C + θ ζ A (1 − θ ζ C ) (cid:17) ,P ( outcome = −
1) = 14 (cid:16) (1 − θ ζ A ) θ ζ C + (1 − θ ζ A ) θ ζ C (cid:17) . With these probabilities, the expected outcome is straightforward to calculate: E [ outcome ] = P ( outcome = 1) − P ( outcome = − (cid:16) θ ζ A + θ ζ A (1 − θ ζ C ) + θ ζ A + θ ζ A (1 − θ ζ C ) (cid:17) > . Interestingly, without knowing the values for θ , ζ A and ζ C , we already know that theexpected outcome is positive. Therefore, we can simply choose values that lead to anegative CTR difference, and the expected outcome will be incorrect. For this example,we choose the position bias: θ = 1 . , θ = 0 . , and θ = 0 . ; and the relevances: ζ = 0 . , and ζ = 1 . . As a result, the expected binary outcome of Team-DraftInterleaving will not match the true E [ CTR ] difference: ∆( π , π ) < ∧ E [ outcome ] > . (7.39)Therefore, we have proven that Team-Draft Interleaving is biased w.r.t. CTR differences. Probabilistic Interleaving [41] treats rankings as distributions over documents, wefollow the soft-max approach of Hofmann et al. [41] and use τ = 4 . as suggested.Probabilistic Interleaving creates interleavings by sampling randomly from one ofthe rankings, unlike Team-Draft Interleaving it does not remember which rankingadded each document. Because rankings are treated as distributions, every possiblepermutation is a valid interleaving, leading to six possibilities with different probabilitiesof being displayed. When clicks are received, every possible assignment is consideredand the expected outcome is computed over all possible assignments. Because thereare 36 possible rankings and assignment combinations, we only report every possibleranking and the probabilities for documents A or C being added by π : 145 . Efficient and Unbiased Online Evaluation for Ranking Interleaving Ranking P ( add ( A ) = 1) P ( add ( C ) = 1) Probability R A, B, C 0.9878 0.4701 0.4182 R A, C, B 0.9878 0.4999 0.0527 R B, A, C 0.8569 0.0588 0.2849 R B, C, A 0.5000 0.0588 0.2094 R C, A, B 0.9872 0.5000 0.0166 R C, B, A 0.5000 0.0562 0.0182These probabilities are enough to compute the expected outcome, similar as the pro-cedure we used for Team-Draft Interleaving. We will not display the full calculationhere as it is extremely long; we recommend using some form of computer assistance toperform these calculations. While there are many possibilities, we choose the followingposition bias: θ = 1 . , θ = 0 . , and θ = 0 . ; and relevance: ζ = 0 . , and ζ = 1 . .This leads to the following erroneous result: ∆( π , π ) < ∧ E [ outcome ] > . (7.40)Therefore, we have proven that Probabilistic Interleaving is biased w.r.t. CTR differ-ences. Optimized Interleaving casts interleaving as an optimization problem [96]. OptimizedInterleaving works with a credit function: each clicked document produces a positive ornegative credit. The sum of all credits is the final estimated outcome. We follow Radlin-ski and Craswell [96] and use the linear rank difference, resulting in the following creditsper document: click-credit ( A ) = 2 , click-credit ( B ) = − , and click-credit ( C ) = − .Then the set of allowed interleavings is created, these are all the rankings that do notcontradict a pairwise document preference that both rankers agree on. Given this setof interleavings, a distribution over them is found so that if every document is equallyrelevant then no preference is found. For our example, the only valid distribution overinterleavings is the following:Interleaving Ranking Probability R A, B, C / R B, A, C / R B, C, A / The expected credit outcome shows us which ranker will be preferred in expectation: E [ credit ] = 13 (cid:0) θ + θ + θ ) ζ A − ( θ + 2 θ ) ζ C (cid:1) . (7.41) Radlinski and Craswell [96] state that if clicks are not correlated with relevance then no preferenceshould be found, in their click model (and ours) these two requirements are actually equivalent. .B. Expanded Explanation of Gradient Approximation
We choose the position bias: θ = 1 . , θ = 0 . , and θ = 0 . ; and the relevances: ζ = 0 . , ζ = 1 . . As a result, the true E [ CTR ] difference is positive, but optimizedinterleaving will prefer π in expectation: ∆( π , π ) > ∧ E [ credit ] < . (7.42)Therefore, we have proven that Optimized Interleaving is biased w.r.t. CTR differences. This section describes our Monte-Carlo approximation of the variance gradient in moredetail. We repeat the steps described in Section 7.4.3 and include some additionalintermediate steps; this should make it easier for a reader to verify our theory.First, we assume that policies place the documents in order of rank and the probabil-ity of placing an individual document at rank x only depends on the previously placeddocuments. Let R x − indicate the (incomplete) ranking from rank up to rank x , then π ( d | R x − , q ) indicates the probability that document d is placed at rank x giventhat the ranking up to x is R x − . The probability of a ranking R of length K is thus: π ( R | q ) = K (cid:89) x =1 π ( R x | R x − , q ) . (7.43)The probability of a ranking R up to rank k is: π ( R k | q ) = k (cid:89) x =1 π ( R x | R x − , q ) . (7.44)Therefore the propensity (cf. Eq. 7.15) can be rewritten to: ρ ( d | q ) = K (cid:88) k =1 θ k (cid:88) R π ( R k − | q ) π ( d | R k − , q ) . (7.45)Before we take the gradient of the propensity, we note that the gradient of the probabilityof a single ranking is: δπ ( R | q ) δπ = K (cid:88) x =1 π ( R | q ) π ( R x | R x , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) . (7.46)Using this gradient, we can derive the gradient of the propensity w.r.t. the policy: δρ ( d | q ) δπ = K (cid:88) k =1 θ k (cid:88) R π ( R k − | q ) (cid:32) (cid:20) δπ ( d | R k − , q ) δπ (cid:21) + k − (cid:88) x =1 π ( d | R k − , q ) π ( R x | R x − , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) (cid:33) . (7.47)147 . Efficient and Unbiased Online Evaluation for Ranking To avoid iterating over all rankings in the (cid:80) R sum, we sample M rankings: R m ∼ π ( R | q ) , and a click pattern on each ranking: c m ∼ P ( c | R m ) . This enables us tomake the following approximation: (cid:92) ρ -grad ( d ) = 1 M M (cid:88) m =1 K (cid:88) k =1 θ k (cid:32) (cid:20) δπ ( d | R m k − , q ) δπ (cid:21) + k − (cid:88) x =1 π ( d | R m k − , q ) π ( R mx | R m x − , q ) (cid:20) δπ ( R mx | R m x − , q ) δπ (cid:21) (cid:33) , (7.48)since δρ ( d | q ) δπ ≈ (cid:92) ρ -grad ( d, q ) . The second part of Eq. 7.26 is: (cid:34) δδπ (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:35) = 2 (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:88) d : c ( d )=1 λ d ρ d (cid:20) δρ d δπ (cid:21) , (7.49)using (cid:92) ρ -grad ( d ) we get the approximation: (cid:92) error-grad ( c ) = 2 (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:88) d : c ( d )=1 λ d ρ d (cid:92) ρ -grad ( d ) . (7.50)Next, we consider the gradient of a single click pattern: δδπ P ( c | q ) = (cid:88) R P ( c | R ) (cid:20) δπ ( R | q ) δπ (cid:21) . (7.51)This can then be used to reformulate the first part of Eq. 7.26: (cid:88) c (cid:20) δδπ P ( c | q ) (cid:21)(cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) = (cid:88) c (cid:88) R P ( c | R ) (cid:20) δπ ( R | q ) δπ (cid:21) (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (7.52)Making use of Eq. 7.46, we approximate this with: (cid:92) freq-grad ( R, c ) = (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) K (cid:88) x =1 π ( R x | R x − , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) . (7.53)Combining the approximation of both parts of Eq. 7.26, allows us to approximate thecomplete gradient: δ Var ( ˆ∆ π IP S | q ) δπ ≈ M M (cid:88) m =1 (cid:92) freq-grad ( R m , c m ) + (cid:92) error-grad ( c m ) . (7.54)148 .C. Notation Reference for Chapter 7 This completes our expanded description of the gradient approximation. We have shownthat we can approximate the gradient of the variance w.r.t. a logging policy π , basedon rankings sampled from π and our current estimated click model ˆ θ , ˆ ζ , while stayingcomputationally feasible. Notation Description k the number of items that can be displayed in a single ranking i an iteration number q a user-issued query d an item to be ranked R a ranked list R x the subranking in R from index up to and including index xπ a ranking policy π ( R | q ) the probability that policy π displays ranking R for query qπ ( R x | R x − , q ) probability of π adding item R x given R x − is already placed I the available interaction data c a click pattern: a vector indicating a combination of clickedand not-clicked items (cid:80) c a summation over every possible click pattern c ( d ) a function indicating item d was clicked in click pattern co ( d ) a function indicating item d was observed at iteration ix i the estimate for a single interaction if ( q i , R i , c i ) the method-specific function that converts a single interactioninto an estimate x i θ rank ( d | R ) the observation probability: P ( o ( d ) = 1 | R ) ζ d,q the conditional click probability: P ( c ( d ) = 1 | o ( d ) = 1 , q ) Unifying Online and Counterfactual
Learning to Rank
In Chapter 7, we introduced the Logging-Policy Optimization Algorithm (LogOpt)algorithm that turns a counterfactual ranking evaluation method into an online evaluationmethod. Thus, the contributions of Chapter 7 are a significant step in bridging the dividebetween online and counterfactual ranking evaluation. Inspired by this contribution, thischapter will consider whether something similar can be done for the gap between onlineand counterfactual Learning to Rank (LTR). Accordingly, in this chapter the followingquestion will be addressed:
RQ9
Can the counterfactual LTR approach be extended to perform highly effectiveonline LTR?In contrast with Chapter 7, which looked at finding the best logging policy, this chapterwill consider a novel counterfactual estimator; we propose the novel intervention-awareestimator for both counterfactual and online LTR. The estimator corrects for the effect ofposition bias, trust bias, and item-selection bias using corrections based on the behaviorof the logging policy and online interventions: changes to the logging policy madeduring the gathering of click data. Our experimental results show that, unlike existingcounterfactual LTR methods, the intervention-aware estimator can greatly benefit fromonline interventions. In contrast, existing online methods are hindered without onlineinterventions and thus should not be applied counterfactually. With the introductionof the intervention-aware estimator, we aim to bridge the online/counterfactual LTRdivision as it is shown to be highly effective in both online and counterfactual scenarios.
Ranking systems form the basis for most search and recommendation applications [75].As a result, the quality of such systems can greatly impact the user experience, thus it isimportant that the underlying ranking models perform well. The LTR field considersmethods to optimize ranking models. Traditionally this was based on expert annotations.Over the years the limitations of expert annotations have become apparent; some of the
This chapter was submitted as [88]. Appendix 8.A gives a reference for the notation used in this chapter. . Unifying Online and Counterfactual Learning to Rank most important ones are: (i) they are expensive and time-consuming to acquire [17, 95];(ii) in privacy-sensitive settings expert annotation is unethical, e.g., in email or privatedocument search [128]; and (iii) often expert annotations appear to disagree with actualuser preferences [104].User interaction data solves some of the problems with expert annotations: (i) in-teraction data is virtually free for systems with active users; (ii) it does not requireexperts to look at potentially privacy-sensitive content; (iii) interaction data is indicativeof users’ preferences. For these reasons, interest in LTR methods that learn from userinteractions has increased in recent years. However, user interactions are a form ofimplicit feedback and generally also affected by other factors than user preference [57].Therefore, to be able to reliably learn from interaction data, the effect of factors otherthan preference has to be corrected for. In clicks on rankings three prevalent factors arewell known: (i) position bias : users are less likely to examine, and thus click, lowerranked items [25]; (ii) item-selection bias : users cannot click on items that are notdisplayed [86, 92]; and (iii) trust bias : because users trust the ranking system, theyare more likely to click on highly ranked items that they do not actually prefer [3, 57].As a result of these biases, which ranking system was used to gather clicks can have asubstantial impact on the clicks that will be observed. Current LTR methods that learnfrom clicks can be divided into two families: counterfactual approaches [58] – thatlearn from historical data, i.e., clicks that have been logged in the past – and onlineapproaches [132] – that can perform interventions, i.e., they can decide what rankingswill be shown to users. Recent work has noticed that some counterfactual methods canbe applied as an online method [50], or vice versa [6, 136]. Nonetheless, every existingmethod was designed for either the online or counterfactual setting, never both.In this chapter, we propose a novel estimator for both counterfactual and onlineLTR from clicks: the intervention-aware estimator . The intervention-aware estimatorbuilds on ideas that underlie the latest existing counterfactual methods: the policy-awareestimator [86] and the affine estimator [123]; and expands them to consider the effectof online interventions. It does so by considering how the effect of bias is changed byan intervention, and utilizes these differences in its unbiased estimation. As a result,the intervention-aware estimator is both effective when applied as a counterfactualmethod, i.e., when learning from historical data, and as an online method where onlineinterventions lead to enormous increases in efficiency. In our experimental resultsthe intervention-aware estimator is shown to reach state-of-the-art LTR performancein both online and counterfactual settings, and it is the only method that reaches top-performance in both settings.The main contributions of this chapter are:1. A novel intervention-aware estimator that corrects for position bias, trust bias, item-selection bias, and the effect of online interventions.2. An investigation into the effect of online interventions on state-of-the-art counterfac-tual and online LTR methods.152 .2. Interactions with Rankings
The theory in this chapter assumes that three forms of interaction bias occur: positionbias, item-selection bias, and trust bias.
Position bias occurs because users only click an item after examining it, and usersare more likely to examine items displayed at higher ranks [25]. Thus the rank (a.k.a.position) at which an item is displayed heavily affects the probability of it being clicked.We model this bias using P ( E = 1 | k ) : the probability that an item d displayed at rank k is examined by a user E [128]. Item-selection bias occurs when some items have a zero probability of being ex-amined in some displayed rankings [92]. This can happen because not all items aredisplayed to the user, or if the ranked list is so long that no user ever considers the entirelist. We model this bias by stating: ∃ k, ∀ k (cid:48) , ( k (cid:48) > k → P ( E = 1 | k (cid:48) ) = 0) , (8.1)i.e., there exists a rank k such that items ranked lower than k have no chance of beingexamined. The distinction between position bias and item-selection bias is importantbecause some methods can only correct for the former if the latter is not present [86].Finally, trust bias occurs because users trust the ranking system and, consequently,are more likely to perceive top ranked items as relevant even when they are not [57].We model this bias using: P ( C = 1 | k, R, E ) : the probability of a click conditionedon the displayed rank k , the relevance of the item R , and examination E .To combine these three forms of bias into a single click model, we follow Agarwalet al. [3] and write: P ( C = 1 | d, k, q )= P ( E = 1 | k ) (cid:0) P ( C = 1 | k, R = 0 , E = 1) P ( R = 0 | d, q )+ P ( C = 1 | k, R = 1 , E = 1) P ( R = 1 | d, q ) (cid:1) , (8.2)where P ( R = 1 | d, q ) is the probability that an item d is deemed relevant w.r.t. query q by the user. An analysis on real-world interaction data performed by Agarwal et al.[3], showed that this model better captures click behavior than models that only captureposition bias [128] on search services for retrieving cloud-stored files and emails.To simplify the notation, we follow Vardasbi et al. [123] and adopt: α k = P ( E = 1 | k ) (cid:0) P ( C = 1 | k, R = 1 , E = 1) − P ( C = 1 | k, R = 0 , E = 1) (cid:1) ,β k = P ( E = 1 | k ) P ( C = 1 | k, R = 0 , E = 1) . (8.3)This results in a compact notation for the click probability (8.2): P ( C = 1 | d, k, q ) = α k P ( R = 1 | d, q ) + β k . (8.4)For a single ranking y , let k be the rank at which item d is displayed in y ; we denote α k = α d,y and β k = β d,y . This allows us to specify the click probability conditionedon a ranking y : P ( C = 1 | d, y, q ) = α d,y P ( R = 1 | d, q ) + β d,y . (8.5)153 . Unifying Online and Counterfactual Learning to Rank Finally, let π be a ranking policy used for logging clicks, where π ( y | q ) is the probabilityof π displaying ranking y for query q , then the click probability conditioned on π is: P ( C = 1 | d, π, q ) = (cid:88) y π ( y | q )( α d,y P ( R = 1 | d, q ) + β d,y ) . (8.6)The proofs in the remainder of this chapter will assume this model of click behavior. In this section we cover the basics on LTR and counterfactual LTR.
The field of LTR considers methods for optimizing ranking systems w.r.t. rankingmetrics. Most ranking metrics are additive w.r.t. documents; let P ( q ) be the probabilitythat a user-issued query is query q , then the metric reward R commonly has the form: R ( π ) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) P ( R = 1 | d, q ) . (8.7)Here, the λ function scores each item d depending on how π ranks d when given thepreselected item set D q ; λ can be chosen to match a desired metric, for instance, thecommon Discounted Cumulative Gain (DCG) metric [52]: λ DCG ( d | D q , π, q ) = (cid:88) y π ( y | q )(log ( rank ( d | y ) + 1)) − . (8.8)Supervised LTR methods can optimize π to maximize R if relevances P ( R = 1 | d, q ) are known [75, 129]. However in practice, finding these relevance values is not straight-forward. Over time, limitations of the supervised LTR approach have become apparent. Mostimportantly, finding accurate relevance values P ( R = 1 | d, q ) has proved to beimpossible or infeasible in many practical situations [127]. As a solution, LTR methodshave been developed that learn from user interactions instead of relevance annotations.Counterfactual LTR concerns approaches that learn from historical interactions. Let D be a set of collected interaction data over T timesteps; for each timestep t it containsthe user issued query q t , the logging policy π t used to generate the displayed ranking ¯ y t , and the clicks c t received on the ranking: D = { ( π t , q t , ¯ y t , c t ) } Tt =1 , (8.9)where c t ( d ) ∈ { , } indicates whether item d was clicked at timestep t . While clicksare indicative of relevance they are also affected by several forms of bias, as discussedin Section 8.2.154 .3. Background Counterfactual LTR methods utilize estimators that correct for bias to unbiasedlyestimate the reward of a policy π . The prevalent methods introduce a function ˆ∆ thattransforms a single click signal to correct for bias. The general estimate of the rewardis: ˆ R ( π | D ) = 1 T T (cid:88) t =1 (cid:88) d ∈ D qt λ ( d | D q t , π, q ) ˆ∆( d | π t , q t , ¯ y t , c t ) . (8.10)We note the important distinction between the policy π for which we estimate thereward, and the policy π t that was used to gather interactions at timestep t . Duringoptimization only π is changed in order to maximize the estimated reward.The original Inverse Propensity Scoring (IPS) based estimator introduced by Wanget al. [127] and Joachims et al. [58] weights clicks according to examination probabili-ties: ˆ∆ IPS ( d | ¯ y t , c t ) = c t ( d ) P ( E = 1 | ¯ y t , d ) . (8.11)This estimator results in unbiased optimization under two requirements. First, everyrelevant item must have a non-zero examination probability in all displayed rankings: ∀ t, ∀ d ∈ D q t ( P ( R = 1 | d, q t ) > → P ( E = 1 | ¯ y t , d ) > . (8.12)Second, the click probability conditioned on relevance on examined items should be thesame on every rank: ∀ k, k (cid:48) ( P ( C | k, R, E = 1) = P ( C | k (cid:48) , R, E = 1)) , (8.13)i.e., no trust bias is present. These requirements illustrate that this estimator can onlycorrect for position bias, and is biased when item-selection bias or trust bias is present.For a proof we refer to previous work by Joachims et al. [58] and Vardasbi et al. [123].Oosterhuis and de Rijke [86] (Chapter 5) adapt the IPS approach to correct foritem-selection bias as well. They weight clicks according to examination probabilitiesconditioned on the logging policy, instead of the single displayed ranking on which aclick took place. This results in the policy-aware estimator: ˆ∆ aware ( d | π t , q t , c t ) = c t ( d ) P ( E = 1 | π t , q t , d )= c t ( d ) (cid:80) y π ( y | q t ) P ( E = 1 | y, d, q t ) . (8.14)This estimator can be used for unbiased optimization under two assumptions. First,every relevant item must have a non-zero examination probability under the loggingpolicy: ∀ t, ∀ d ∈ D q t ( P ( R = 1 | , d, q t ) > → P ( E = 1 | π t , d, q t ) > . (8.15)Second, no trust bias is present as described in Eq. 8.13. Importantly, this first require-ment can be met under item-selection bias, since a stochastic ranking policy can alwaysprovide every item a non-zero probability of appearing in a top- k ranking. Thus, even155 . Unifying Online and Counterfactual Learning to Rank when not all items can be displayed at once, a stochastic policy can provide non-zeroexamination probabilities to all items. For a proof of this claim we refer to previouswork by Oosterhuis and de Rijke [86].Lastly, Vardasbi et al. [123] prove that IPS cannot correct for trust bias. As analternative, they introduce an estimator based on affine corrections. This affine estimatorpenalizes an item displayed at rank k by β k while also reweighting inversely w.r.t. α k : ˆ∆ affine ( d | ¯ y t , c t ) = c t ( d ) − β d, ¯ y t α d, ¯ y t . (8.16)The β penalties correct for the number of clicks an item is expected to receive due to itsdisplayed rank, instead of its relevance. The affine estimator is unbiased under a singleassumption, namely that the click probability of every item must be correlated with itsrelevance in every displayed ranking: ∀ t, ∀ d ∈ D q t , α d, ¯ y t (cid:54) = 0 . (8.17)Thus, while this estimator can correct for position bias and trust bias, it cannot correctfor item-selection bias. For a proof of these claims we refer to previous work byVardasbi et al. [123].We note that all of these estimators require knowledge of the position bias ( P ( E =1 | k ) ) or trust bias ( α and β ). A lot of existing work has considered how these valuescan be inferred accurately [3, 30, 128]. The theory in this chapter assumes that thesevalues are known.This concludes our description of existing counterfactual estimators on which ourmethod expands. To summarize, each of these estimators corrects for position bias, onealso corrects for item-selection bias, and another also for trust bias. Currently, there isno estimator that corrects for all three forms of bias together. One of the earliest approaches to LTR from clicks was introduced by Joachims [54]. Itinfers pairwise preferences between items from click logs and uses pairwise LTR toupdate an SVM ranking model. While this approach had some success, in later workJoachims et al. [58] notes that position bias often incorrectly pushes the pairwise loss toflip the ranking displayed during logging. To avoid this biased behavior, Joachims et al.[58] proposed the idea of counterfactual LTR, in the spirit of earlier work by Wang et al.[127]. This led to estimators that correct for position bias using IPS weighting (seeSection 8.3.2). This work sparked the field of counterfactual LTR which has focusedon both capturing interaction biases and optimization methods that can correct forthem. Methods for measuring position bias are based on EM optimization [128], adual learning objective [5], or randomization [4, 30]; for trust bias only an EM-basedapproach is currently known [3]. Agarwal et al. [2] showed how counterfactual LTR canoptimize neural networks and DCG-like methods through upper-bounding. Oosterhuisand de Rijke [86] introduced an IPS estimator that can correct for item-selection bias (seeSection 8.3.2 and Chapter 5), while also showing that the LambdaLoss framework [129]156 .5. An Estimator Oblivious to Online Interventions can be applied to counterfactual LTR (see Chapter 5). Lastly, Vardasbi et al. [123]proved that IPS estimators cannot correct for trust bias and introduced an affine estimatorthat is capable of doing so (see Section 8.3.2). There is currently no known estimatorthat can correct for position bias, item selection bias, and trust bias simultaneously.The other paradigm for LTR from clicks is online LTR [132]. The earliest method,Dueling Bandit Gradient Descent (DBGD), samples variations of a ranking model andcompares them using online evaluation [41]; if an improvement is recognized the modelis updated accordingly. Most online LTR methods have increased the data-efficiencyof DBGD [43, 111, 126]; later work found that DBGD is not effective at optimizingneural models [82] (Chapter 3) and often fails to find the optimal linear-model even inideal scenarios [84] (Chapter 4). To these limitations, alternative approaches for onlineLTR have been proposed. Pairwise Differentiable Gradient Descent (PDGD) takes apairwise approach but weights pairs to correct for position bias [82] (Chapter 3). WhilePDGD was found to be very effective and robust to noise [50, 84] (Chapter 4), it can beproven that its gradient estimation is affected by position bias, thus we do not considerit to be unbiased. In contrast, Zhuang and Zuccon [136] introduced CounterfactualOnline Learning to Rank (COLTR), which takes the DBGD approach but uses a formof counterfactual evaluation to compare candidate models. Despite making use ofcounterfactual estimation, Zhuang and Zuccon [136] propose the method solely foronline LTR.Interestingly, with COLTR the line between online and counterfactual LTR methodsstarts to blur. Recent work by Jagerman et al. [50] applied the original counterfactualapproach [58] as an online method and found that it lead to improvements. Furthermore,Ai et al. [6] noted that with a small adaptation PDGD can be applied to historical data.Although this means that some existing methods can already be applied both onlineand counterfactually, no method has been found that is the most reliable choice in bothscenarios.
Before we propose the main contribution of this chapter, the intervention-aware esti-mator, we will first introduce an estimator that simultaneously corrects for positionbias, item-selection bias, and trust bias, without considering the effects of interventions.Subsequently, the resulting intervention-oblivious estimator will serve as a method tocontrast the intervention-aware estimator with.Section 8.3.2 described how the policy-aware estimator corrects for item-selectionbias by taking into account the behavior of the logging policy used to gather clicks [86].Furthermore, Section 8.3.2 also detailed how the affine estimator corrects for trust biasby applying an affine transformation to individual clicks [123]. We will now show thata single estimator can correct for both item-selection bias and trust bias simultaneously,by combining the approaches of both these existing estimators.First we note the probability of a click conditioned on a single logging policy π t . Unifying Online and Counterfactual Learning to Rank can be expressed as: P ( C = 1 | d, π t , q ) = (cid:88) ¯ y π t (¯ y | q )( α d, ¯ y P ( R = 1 | d, q ) + β d, ¯ y )= E ¯ y [ α d | π t , q ] P ( R = 1 | d, q ) + E ¯ y [ β d | π t , q ] . (8.18)where the expected values of α and β conditioned on π t are: E ¯ y [ α d | π t , q ] = (cid:88) ¯ y π t (¯ y | q ) α d, ¯ y , E ¯ y [ β d | π t , q ] = (cid:88) ¯ y π t (¯ y | q ) β d, ¯ y . (8.19)By reversing Eq. 8.18 the relevance probability can be obtained from the click probabil-ity. We introduce our intervention-oblivious estimator , which applies this transformationto correct for bias: ˆ∆ IO ( d | q t , c t ) = c t ( d ) − E ¯ y [ β d | π t , q t ] E ¯ y [ α d | π t , q t ] . (8.20)The intervention-oblivious estimator brings together the policy-aware and affine esti-mators: on every click it applies an affine transformation based on the logging policybehavior. Unlike existing estimators, we can prove that the intervention-obliviousestimator is unbiased w.r.t. our assumed click model (Section 8.2). Theorem 8.1.
The estimated reward ˆ R (Eq. 8.10) using the intervention-obliviousestimator (Eq. 8.20) is unbiased w.r.t. the true reward R (Eq. 8.7) under two assumptions:(1) our click model (Eq. 8.5), and (2) the click probability on every item, conditioned onthe logging policies per timestep π t , is correlated with relevance: ∀ t, ∀ d ∈ D q t , E ¯ y [ α d | π t , q t ] (cid:54) = 0 . (8.21) Proof.
Using Eq. 8.18 and Eq. 8.21 the relevance probability can be derived from theclick probability by: P ( R = 1 | d, q ) = P ( C = 1 | d, π t , q ) − E ¯ y [ β d | π t , q ] E ¯ y [ α d | π t , q ] . (8.22)Eq. 8.22 can be used to show that ˆ∆ IO is an unbiased indicator of relevance: E ¯ y,c (cid:2) ˆ∆ IO ( d | q t , c t ) | π t (cid:3) = E ¯ y,c (cid:20) c t ( d ) − E t, ¯ y [ β d | π t , q t ] E ¯ y [ α d | π t , q t ] | π t , q t (cid:21) = E ¯ y,c [ c t ( d ) | π t , q t ] − E ¯ y [ β d | π t , q t ] E ¯ y [ α d | π t , q t ]= P ( C = 1 | d, π t , q t ) − E ¯ y [ β d | π t , q t ] E ¯ y [ α d | π t , q t ]= P ( R = 1 | d, q t ) . (8.23)158 .5. An Estimator Oblivious to Online Interventions . . E ¯ y [ α d | π t , q ] E t, ¯ y [ α d | Π T , q ] / E ¯ y [ α d | π t , q ]1 / E t, ¯ y [ α d | Π T , q ] Timestep T Timestep T Figure 8.1: Example of an online intervention and the weights used by the intervention-oblivious and intervention-aware estimators for a single item as more data is gathered.Finally, combining Eq. 8.7 with Eq. 8.10 and Eq. 8.23 reveals that ˆ R based on theintervention-oblivious estimator ˆ∆ IO is unbiased w.r.t. R : E t,q, ¯ y,c (cid:104) ˆ R ( π | D ) (cid:105) (8.24) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) 1 T T (cid:88) t =1 E ¯ y,c (cid:104) ˆ∆ IO ( d | c, q ) | π t , q (cid:105) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) P ( R = 1 | d, q ) = R ( π ) . Existing estimators for counterfactual LTR are designed for a scenario where the loggingpolicy is static: ∀ ( π t , π t (cid:48) ) ∈ D , π t = π t (cid:48) . (8.25)However, we note that if an online intervention takes place [50], meaning that thelogging policy was updated during the gathering of data: ∃ ( π t , π t (cid:48) ) ∈ D , π t (cid:54) = π t (cid:48) , (8.26)the intervention-oblivious estimator is still unbiased. This was already proven inTheorem 8.1 because its assumptions cover both scenarios where online interventionsdo and do not take place.However, the individual corrections of the intervention-oblivious estimator areonly based on the single logging policy that was deployed at the timestep of eachspecific click. It is completely oblivious to the logging policies applied at differenttimesteps. Although this does not lead to bias in its estimation, it does result inunintuitive behavior. We illustrate this behavior in Figure 8.1, here a logging policythat results in E [ α d | π t , q ] = 0 . for an item d is deployed during the first t ≤ timesteps. Then an online intervention takes place and the logging policy is updatedso that for t > , E [ α d | π t , q ] = 0 . . The intervention-oblivous estimator weightsclicks inversely to E [ α d | π t ] ; so clicks for t ≤ will be weighted by / .
25 = 4 and159 . Unifying Online and Counterfactual Learning to Rank for t > by / .
05 = 20 . Thus, there is a sharp and sudden difference in how clicksare treated before and after t = 100 . What is unintuitive about this example is that theway clicks are treated after t = 100 is completely independent of what the situation wasbefore t = 100 . For instance, consider another item d (cid:48) where ∀ t, E [ α d (cid:48) | π t , q ] = 0 . .If both d and d (cid:48) are clicked on timestep t = 101 , these clicks would both be weightedby , despite the fact that d has so far been treated completely different than d (cid:48) . Onewould expect that in such a case the click on d should be weighted less, to compensatefor the high E [ α d | π t , q ] it had in the first 100 timesteps. The question is whether suchbehavior can be incorporated in an estimator without introducing bias. Our goal for the intervention-aware estimator is to find an estimator whose individualcorrections are not only based on single logging policies, but instead consider the entirecollection of logging policies used to gather the data D . Importantly, this estimatorshould also be unbiased w.r.t. position bias, item-selection bias and trust bias.For ease of notation, we use Π T for the set of policies that gathered the data in D : Π T = { π , π , . . . , π T } . The probability of a click can be conditioned on this set: P ( C = 1 | d, Π T , q ) = 1 T T (cid:88) t =1 (cid:88) ¯ y π t (¯ y | q )( α d, ¯ y P ( R = 1 | d, q ) + β d, ¯ y )= E t, ¯ y [ α d | Π T , q ] P ( R = 1 | d, q ) + E t, ¯ y [ β d | Π T , q ] , (8.27)where the expected values of α and β conditioned on Π T are: E t, ¯ y [ α d | Π T , q ] = 1 T T (cid:88) t =1 (cid:88) ¯ y π t (¯ y | q ) α d, ¯ y , E t, ¯ y [ β d | Π T , q ] = 1 T T (cid:88) t =1 (cid:88) ¯ y π t (¯ y | q ) β d, ¯ y . (8.28)Thus P ( C = 1 | d, Π T , q ) gives us the probability of a click given that any policy from Π T could be deployed. We propose our intervention-aware estimator that corrects forbias using the expectations conditioned on Π T : ˆ∆ IA ( d | q t , c t ) = c t ( d ) − E t, ¯ y [ β d | Π T , q t ] E t, ¯ y [ α d | Π T , q t ] . (8.29)The salient difference with the intervention-oblivious estimator is that the expectationsare conditioned on Π T , all logging policies in D , instead of an individual loggingpolicy π t . While the difference with the intervention-oblivious estimator seems small,our experimental results show that the differences in performance are actually quitesizeable. Lastly, we note that when no interventions take place the intervention-obliviousestimator and intervention-aware estimators are equivalent. Because the intervention-aware estimator is the only existing counterfactual LTR estimator whose corrections are160 .6. The Intervention-Aware Estimator influenced by online interventions, we consider it to be a step that helps to bridge thegap between counterfactual and online LTR.Before we revisit our online intervention example with our novel intervention-awareestimator, we prove that it is unbiased w.r.t. our assumed click model (Section 8.2). Theorem 8.2.
The estimated reward ˆ R (Eq. 8.10) using the intervention-aware esti-mator (Eq. 8.29) is unbiased w.r.t. the true reward R (Eq. 8.7) under two assumptions:(1) our click model (Eq. 8.5), and (2) the click probability on every item, conditioned onthe set of logging policies Π T , is correlated with relevance: ∀ q, ∀ d ∈ D q , E t, ¯ y [ α d | Π T , q ] (cid:54) = 0 . (8.30) Proof.
Using Eq. 8.27 and Eq. 8.30 the relevance probability can be derived from theclick probability by: P ( R = 1 | d, q ) = P ( C = 1 | d, Π T , q ) − E t, ¯ y [ β d | Π T , q ] E t, ¯ y [ α d | Π T , q ] . (8.31)Eq. 8.31 can be used to show that ˆ∆ IA is an unbiased indicator of relevance: E t, ¯ y,c (cid:2) ˆ∆ IA ( d | q t , c t ) | Π T (cid:3) = E t, ¯ y,c (cid:20) c t ( d ) − E t, ¯ y [ β d | Π T , q t ] E t, ¯ y [ α d | Π T , q t ] | Π T , q t (cid:21) = E t, ¯ y,c [ c t ( d ) | Π T , q t ] − E t, ¯ y [ β d | Π T , q t ] E t, ¯ y [ α d | Π T , q t ]= P ( C = 1 | d, Π T , q t ) − E t, ¯ y [ β d | Π T , q t ] E t, ¯ y [ α d | Π T , q t ]= P ( R = 1 | d, q t ) . (8.32)Finally, combining Eq. 8.32 with Eq. 8.10 and Eq. 8.7 reveals that ˆ R based on theintervention-aware estimator ˆ∆ IA is unbiased w.r.t. R : E t,q, ¯ y,c (cid:104) ˆ R ( π | D ) (cid:105) (8.33) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) E t, ¯ y,c (cid:104) ˆ∆ IA ( d | c, q ) | Π T , q (cid:105) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) P ( R = 1 | d, q )= R ( π ) . We will now revisit the example in Figure 8.1, but this time consider how the intervention-aware estimator treats item d . Unlike the intervention-oblivious estimator, clicks areweighted by E [ α d | Π T ] which means that the exact timestep t of a click does not matter,as long as t < T . Furthermore, the weight of a click can change as the total number of161 . Unifying Online and Counterfactual Learning to Rank timesteps T increases. In other words, as more data is gathered, the intervention-awareestimator retroactively updates the weights of all clicks previously gathered.We see that this behavior avoids the sharp difference in weights of clicks occurringbefore the intervention t ≤ and after t > . For instance, for a click on d occuring at t = 101 while T = 400 , results in E [ α d | Π T ] = 0 . and thus aweight of / . . This is much lower than the intervention-oblivious weight of / .
05 = 20 , because the intervention-aware estimator is also considering the initialperiod where E [ α d | π t , q ] was high. Thus we see that the intervention-aware estimatorhas the behavior we intuitively expected: it weights clicks based on how the itemwas treated throughout all timesteps. In this example, it leads weights considerablysmaller than those used by the intervention-oblivious estimator. In IPS estimators, lowpropensity weights are known to lead to high variance [58], thus we may expect that theintervention-aware estimator reduces variance in this example. While the intervention-aware estimator takes into account the effect of interventions,it does not prescribe what interventions should take place. In fact, it will work withany interventions that result in Eq. 8.30 being true, including the situation where nointervention takes place at all. For clarity, we will describe the intervention approach weapplied during our experiments here. Algorithm 8.1 displays our online/counterfactualapproach. As input it requires a starting policy ( π ), a choice for λ , the α and β parameters, a set of intervention timesteps ( Φ ), and the final timestep T .The algorithm starts by initializing an empty set to store the gathered interactiondata (Line 2) and initializes the logging policy with the provided starting policy π .Then for each timestep i in Φ the dataset is expanded using the current logging policy sothat |D| = i (Line 5). In other words, for i − |D| timesteps π is used to display rankingsto user-issued queries, and the resulting interactions are added to D . Then a policy isoptimized using the available data in D which becomes the new logging policy. For thisoptimization, we split the available data in training and validation partitions in order todo early stopping to prevent overfitting. We use stochastic gradient descent where weuse π as the initial model; this practice is based on the assumption that π has a betterperformance than a randomly initialized model. Thus, during optimization, gradientcalculation uses the intervention-aware estimator on the training partition of D , andafter each epoch, optimization is stopped if the intervention-aware estimator using thevalidation partition of D suspects overfitting. Each iteration results in an intervention asthe resulting policy replaces the logging policy, and thus changes the way future data islogged. After iterating over Φ is completed, more data is gathered so that |D| = T andoptimization is performed once more. The final policy is the end result of the procedure.We note that, depending on Φ , our approach can be either online, counterfactual,or somewhere in between. If Φ = ∅ the approach is fully counterfactual since all datais gathered using the static π . Conversely, if Φ = { , , , . . . , T } it is fully onlinesince at every timestep the logging policy is updated. In practice, we expect a fullyonline procedure to be infeasible as it is computationally expensive and user queriesmay be issued faster than optimization can be performed. In our experiments we willinvestigate the effect of the number of interventions on the approach’s performance.162 .7. Experimental Setup Algorithm 8.1
Our Online/Counterfactual LTR Approach Input : Starting policy: π ; Metric weight function: λ ;Inferred bias parameters: α and β ;Interventions steps: Φ ; End-time: T . D ← {} // initialize data container π ← π // initialize logging policy for i ∈ Φ do D ← D ∪ gather ( π, i − |D| ) // observe i − |D| timesteps π ← optimize ( D , α, β, π ) // optimize based on available data D ← D ∪ gather ( π, T − |D| ) // expand data to T π ← optimize ( D , α, β, π ) // optimize based on final data return π Our experiments aim to answer the following research questions:
RQ1
Does the intervention-aware estimator lead to higher performance than existingcounterfactual LTR estimators when online interventions take place?
RQ2
Does the intervention-aware estimator lead to performance comparable withexisting online LTR methods?We use the semi-synthetic experimental setup that is common in existing work on bothonline LTR [43, 82, 84, 136] and counterfactual LTR [58, 92, 123]. In this setup, queriesand documents are sampled from a dataset based on commercial search logs, while userinteractions and rankings are simulated using probabilistic click models. The advantageof this setup is that it allows us to investigate the effects of online interventions on alarge scale while also being easy to reproduce by researchers without access to liveranking systems.We use the publicly-available Yahoo Webscope dataset [17], which consists of29,921 queries with, on average, 24 documents preselected per query. Query-documentpairs are represented by 700 features and five-grade relevance annotations ranging fromnot relevant (0) to perfectly relevant (4). The queries are divided into training, validationand test partitions.At each timestep, we simulate a user-issued query by uniformly sampling fromthe training and validation partitions. Subsequently, the preselected documents areranked according to the logging policy, and user interactions are simulated on thetop-5 of the ranking using a probabilistic click model. We apply Eq. 8.4 with α =[0 . , . , . , . , . and β = [0 . , . , . , . , . ; the relevance prob-abilities are based on the annotations from the dataset: P ( R = 1 | d, q ) = 0 . · relevance label ( d, q ) . The values of α and β were chosen based on those reported byAgarwal et al. [3] who inferred them from real-world user behavior. In doing so, weaim to emulate a setting where realistic levels of position bias, item-selection bias, andtrust bias are present. 163 . Unifying Online and Counterfactual Learning to Rank All counterfactual methods use the approach described in Section 8.6.2. To simulatea production ranker policy, we use supervised LTR to train a ranking model on 1% ofthe training partition [58]. The resulting production ranker has much better performancethan a randomly initialized model, yet still leaves room for improvement. We use theproduction ranker as the initial logging policy. The size of Φ (the intervention timesteps)varies per run, and the timesteps in Φ are evenly spread on an exponential scale. Allranking models are neural networks with two hidden layers, each containing 32 hiddenunits with sigmoid activations. Gradients are calculated using a Monte-Carlo methodfollowing Oosterhuis and de Rijke [85] (Chapter 7). All policies apply a softmax to thedocument scores produced by the ranking models to obtain a probability distributionover documents. Clipping is only applied on the training clicks, denominators of anyestimator are clipped by / (cid:112) |D| to reduce variance. Early stopping is applied basedon counterfactual estimates of the loss using (unclipped) validation clicks.The following methods are compared: (i) The intervention-aware estimator. (ii) Theintervention-oblivious estimator. (iii) The policy-aware estimator [86] (Chapter 5).(iv) The affine estimator [123]. (v) PDGD [82] (Chapter 3), we apply PDGD bothonline and as a counterfactual method. As noted by Ai et al. [6], this can be doneby separating the logging models from the learned model and, basing the debiasingweights on the logging function. (vi) Biased PDGD, identical to PDGD except that wedo not apply the debiasing weights. (vii) COLTR [136]. We compute the NormalizedDCG (NDCG) of both the logging policy and of a policy trained on all available data.Every reported result is the average of 20 independent runs, figures plot the mean,shaded areas indicate the standard deviation. To answer the first research question: whether the intervention-aware estimator leads tohigher performance than existing counterfactual LTR estimators when online interven-tions take place , we consider Figure 8.2 which displays the performance of LTR usingdifferent counterfactual estimators.First we consider the top of Figure 8.2 which displays performance in the coun-terfactual setting where the logging policy is static. We clearly see that the affineestimator converges at a suboptimal point of convergence, a strong indication of bias.The most probable cause is that the affine estimator is heavily affected by the pres-ence of item-selection bias. In contrast, neither the policy-aware estimator nor theintervention-aware estimator have converged after queries. However, very clearlythe intervention-aware estimator quickly reaches a higher performance. While the theoryguarantees that it will converge at the optimal performance, we were unable to observethe number of queries it requires to do so. From the result in the counterfactual setting,we conclude that by correcting for position-bias, trust-bias, and item-selection bias theintervention-aware estimator already performs better without online interventions. Since under a static logging policy the intervention-aware and the intervention-oblivious estimators areequivalent, our conclusions apply to both in this setting. .8. Results and Discussion ND C G . . . . ND C G . . . . Number of Logged Queries
Full-InformationAffine Intervention-ObliviousPolicy-Aware Intervention-Aware
Figure 8.2: Comparison of counterfactual LTR estimators. Top: Counterfactual runs(no interventions); Bottom: Online runs (50 interventions).Second, we turn to the bottom of Figure 8.2 which considers the online settingwhere the estimators perform 50 online interventions during logging. We see that onlineinterventions have a positive effect on all estimators; leading to a higher performance forthe affine and policy-aware estimators as well. However, interventions also introduce anenormous amount of variance for the policy-aware and intervention-oblivious estimators.In stark contrast, the amount of variance of the intervention-aware estimator hardlyincreases while it learns much faster than the other estimators.Thus we answer the first research question positively: the intervention-aware estima-tor leads to higher performance than existing estimators, moreover, its data-efficiencybecomes even greater when online interventions take place.
To better understand how much the intervention-aware estimator benefits from onlineinterventions, we compared its performance under varying numbers of interventions inFigure 8.3. It shows both the performance of the resulting model when training from thelogged data (top), as the performance of the logging policy which reveals when inter-ventions take place (bottom). When comparing both graphs, we see that interventionslead to noticeable immediate improvements in data-efficiency. For instance, when only5 interventions take place the intervention-aware estimator needs more than 20 timesthe amount of data to reach optimal performance as with 50 interventions. Despite these165 . Unifying Online and Counterfactual Learning to Rank ND C G . . . . L ogg i ng P o li c y ND C G . . . . Number of Logged Queries
Full-InformationIntervention-Aware (counterfactual)Intervention-Aware (1 intervention)Intervention-Aware (5 interventions) Intervention-Aware (10 interventions)Intervention-Aware (25 interventions)Intervention-Aware (50 interventions)
Figure 8.3: Effect of online interventions on LTR with the intervention-aware estimator.speedups there are no large increases in variance. From these observations, we concludethat the intervention-aware estimator can effectively and reliably utilize the effect ofonline interventions for optimization, leading to enormous increases in data-efficiency.
In order to answer the second research question: whether the intervention-awareestimator leads to performance comparable with existing online LTR methods , weconsider Figure 8.4 which displays the performance of two online LTR methods: PDGDand COLTR and the intervention-aware estimator with 100 online interventions.First, we notice that COLTR is unable to outperform its initial policy, moreover, wesee its performance drop as the number of iterations increase. We were unable to findhyper-parameters for COLTR where this did not occur. It seems likely that COLTR isunable to deal with trust-bias, thus causing this poor performance. However, we notethat Zhuang and Zuccon [136] already show COLTR performs poorly when no bias ornoise is present, suggesting that it is perhaps an unstable method overall.Second, we see that the difference between PDGD and the intervention-awareestimator becomes negligible after · queries. Despite PDGD running fully online,and the intervention-aware estimator only performing 100 interventions in total. We donote that PDGD initially outperforms the intervention-aware estimator, thus it appears166 .8. Results and Discussion ND C G . . . . L ogg i ng P o li c y ND C G . . . . Number of Logged Queries
Full-InformationCOLTR (online) PDGD (online)Biased-PDGD (online) Inter.-Aware (100 int.)
Figure 8.4: Comparison with online LTR methods.that PDGD works better with low numbers of interactions. Additionally, we shouldalso consider the difference in overhead: while PDGD requires an infrastructure thatallows for fully online learning, the intervention-aware estimator only requires 100moments of intervention, yet has comparable performance after a short initial period.By comparing Figure 8.4 to Figure 8.2, we see that the intervention-aware estimatoris the first counterfactual LTR estimator that leads to stable performance while beingcomparably efficient with online LTR methods.Thus we answer the second research question positively: besides an initial periodof lower performance, the intervention-aware estimator has comparable performanceto online LTR and only requires 100 online interventions to do so. To the best of ourknowledge, it is the first counterfactual LTR method that can achieve this feat.
Now that we concluded that the intervention-aware estimator reaches performance com-parable to PDGD when enough online interventions take place, the opposite questionseems equally interesting:
Does PDGD applied counterfactually provide performancecomparable to existing counterfactual LTR methods?
To answer this question, we ran PDGD in a counterfactual way following Ai et al.[6], both fully counterfactual and with only 100 interventions. The results of theseruns are displayed in Figure 8.5. Quite surprisingly, the performance of PDGD rancounterfactually and with 100 interventions, reaches much higher performance than theintervention-aware estimator without interventions. However, after a peak in perfor-167 . Unifying Online and Counterfactual Learning to Rank ND C G . . . . L ogg i ng P o li c y ND C G . . . . Number of Logged Queries
Full-InformationPDGD (online)PDGD (counterfactual) PDGD (100 interventions)Intervention-Aware (counterfactual)
Figure 8.5: Effect of online interventions on PDGD.mance around queries, the PDGD performance starts to drop. This drop cannot beattributed to overfitting, since online PDGD does not show the same behavior. There-fore, we must conclude that PDGD is biased when not ran fully online. This conclusiondoes not contradict the existing theory, since in Chapter 3 we only proved it is unbiasedw.r.t. pairwise preferences. In other words, PDGD is not proven to unbiasedly optimizea ranking metric, and therefore also not proven to converge on the optimal model. Thisdrop is particularly unsettling because PDGD is a continuous learning algorithm: thereis no known early stopping method for PDGD. Yet these results show there is a greatrisk in running PDGD for too many iterations if it is not applied fully online. To answerour PDGD question: although PDGD reaches high performance when run counterfac-tually and appears to have great data-efficiency initially, it appears to converge at asuboptimal biased model. Thus we cannot conclude that PDGD is a reliable method forcounterfactual LTR.To better understand PDGD, we removed its debiasing weights resulting in theperformance shown in Figure 8.4 (Biased-PDGD). Clearly, PDGD needs these weightsto reach optimal performance. Similarly, from Figure 8.5 we see it also needs to berun fully online. This makes the choice between the intervention-aware estimator andPDGD complicated: on the one hand, PDGD does not require us to know the α and β parameters, unlike the intervention-aware estimator; furthermore, PDGD has betterinitial data-efficiency even when not run fully online. On the other hand, there are no168 .9. Conclusion theoretical guarantees for the convergence of PDGD, and we have observed that notrunning it fully online can lead to large drops in performance. It seems the choiceultimately depends on what guarantees a practitioner prefers. In this chapter, we have introduced an intervention-aware estimator: an extension ofexisting counterfactual approaches that corrects for position-bias, trust-bias, and item-selection bias, while also considering the effect of online interventions. Our resultsshow that the intervention-aware estimator outperforms existing counterfactual LTRestimators, and greatly benefits from online interventions in terms of data-efficiency.With only 100 interventions it is able to reach a performance comparable to state-of-the-art online LTR methods. These findings allow us to answer thesis research question
RQ9 : whether the counterfactual LTR approach be extended to perform highly effectiveonline LTR. From our experimental results, it appears that the answer is positive:using the intervention-aware estimator and 100 online interventions the performance ofstate-of-the-art online LTR methods can be matched.With the introduction of the intervention-aware estimator, we hope to further unifythe fields of online LTR and counterfactual LTR as it appears to be the most reliablemethod for both settings. Future work could investigate what kind of interventions workbest for the intervention-aware estimator. Since we have already seen in Chapter 7 thatsuch an approach is effective for counterfactual/online ranking evaluation.In retrospect, this chapter has put many findings from previous chapters in a differentperspective. Chapter 3 introduced the concept of unbiasedness w.r.t. pairwise prefer-ences and proved PDGD had this property. The experimental results of this chapterhave shown that unbiasedness w.r.t. pairwise preferences is not enough to guarantee con-vergence at an optimal level of NDCG. Furthermore, Chapter 4 showed PDGD is veryrobust to noise and bias, but with the results of this chapter we now know that PDGDneeds to be run online for this robustness. The policy-aware estimator in Chapter 5 is aprecursor to the intervention-aware estimator of this chapter. While Chapter 5 realizedthat taking the logging policy into account is beneficial to counterfactual estimation,this chapter showed that taking the idea further, by accounting for all logging policies,provides even more benefits. Lastly, Chapter 7 looked at bridging the divide betweenonline and counterfactual evaluation; in retrospect, the results of Chapter 7 might havebeen even better had it used the intervention-aware estimator. Together, Chapter 7 andthis chapter suggest that an online method should both optimize its logging policy anduse an intervention-aware estimator to learn, thus leaving a potentially very fruitfuldirection for future work. 169 . Unifying Online and Counterfactual Learning to Rank
Notation Description k the number of items that can be displayed in a single ranking t a timestep number T the total number of timesteps (gathered so far) D the available data R ( π ) the metric reward of a policy π ˆ R ( π | D ) an estimate of the metric reward of a policy πq a user-issued query D q the set of items to be ranked for query qd an item to be ranked y a ranked list π a ranking policy π ( R | q ) the probability that policy π displays ranking R for query qπ ( R x | R x − , q ) probability of π adding item R x given R x − is already placed Π T the set of logging policies deployed up to timestep Tλ ( d | D q , π, q ) a metric function that weights items depending on their rank c ( d ) a function indicating item d was clicked in click pattern co ( d ) a function indicating item d was observed at iteration i Conclusions
In Section 1.1 we stated the overarching question that we aim to answer in this thesis:
Could there be a single general theoretically-grounded approach that has com-petitive performance for both evaluation and Learning to Rank (LTR) from userclicks on rankings, in both the counterfactual and online settings?
The thesis has explored this question by looking at both online and counterfactualfamilies of LTR methods, and in particular, to see if one of these approaches can beextended to be effective at both the online and counterfactual LTR scenarios. In thisfinal chapter, we will summarize the findings of the thesis and discuss how they reflecton our overarching thesis question. Finally, we will consider future research directionsfor the field of LTR from user clicks.
This section look back at the thesis research questions posed in Section 1.1. We divideour discussion in two parts discussing online methods and counterfactual methods forLTR and evaluation, respectively.
The first part of the thesis focussed on online LTR methods. Chapter 2 looked atmultileaving methods [108] for comparing multiple ranking systems at once and asked:
RQ1
Does the effectiveness of online ranking evaluation methods scale to large com-parisons?We introduced the novel Pairwise Preference Multileaving (PPM) algorithm, PPM basesevaluation on inferred pairwise item preferences. Furthermore, PPM is proven to have fidelity – it is provably unbiased in unambiguous cases [44] – and considerateness – itis safe w.r.t. the user experience during the gathering of clicks. From our theoreticalanalysis, we find that no other existing multileaving method manages to meet bothcriteria. In addition, our empirical results indicate that using PPM leads to a much lower171 . Conclusions number of errors, in particular when applied to large scale comparisons. Therefore, weanswered
RQ1 positively: PPM is shown to be effective at online ranking evaluationfor large scale comparisons.Besides in Chapter 2, online evaluation was also the subject of Chapter 7, whichaddressed the question:
RQ8
Are existing interleaving methods truly capable of unbiased evaluation w.r.t.position bias?We showed that under a basic rank-based model of position bias (common in coun-terfactual LTR [4, 58, 128]), three of the most prevalent interleaving algorithms arenot unbiased: Team Draft Interleaving [99], Probabilistic Interleaving [41], and Opti-mized Interleaving [96]. For each of these three methods, we showed that situationsexist where the binary outcome of the method does not agree with the expected binarydifference in Click-Through-Rate (CTR). In other words, under a basic assumption ofposition bias, situations exist where these interleaving methods are expected to preferone system over another, while the latter system has a higher expected CTR than theformer. Thus, we answer
RQ8 negatively: the most prevalent interleaving methods arenot unbiased w.r.t. position bias.This finding can be extended to the multileaving methods: Team-Draft Multileav-ing [108], Probabilistic Multileaving [109], and Optimized Multileaving [108], sincethey are equivalent to their interleaving counterparts when only two systems are com-pared. While we did not examine it in this thesis, it is likely that PPM also fails tobe unbiased under basic position bias. Nonetheless, an evaluation method can still beeffective despite being biased, for instance, if the systematic error is small or situationswhere bias occurs are rare.Chapter 3 looked at online LTR methods. Existing online LTR methods have reliedon sampling model variants and comparing them using online evaluation [132]. Inresponse to the existing online LTR approach, Chapter 3 considered the question:
RQ2
Is online LTR possible without relying on model-sampling and online evaluation?We answered this question positively by introducing Pairwise Differentiable GradientDescent (PDGD), an online LTR method that learns from inferred pairwise preferencesand uses a debiased pairwise loss. Besides proving that PDGD is unbiased w.r.t. pairwisepreferences, our experimental results show that PDGD greatly outperforms the previousstate-of-the-art Dueling Bandit Gradient Descent (DBGD) [132] algorithm in terms ofdata-efficiency and convergence. Furthermore, PDGD is the first online LTR methodthat can effectively optimize neural networks as ranking models.Chapter 8 took another look at PDGD, in particular at conditions where PDGD isno longer effective. The results in Chapter 8 show that PDGD fails to reach optimalperformance without debiasing weights or when not applied fully online. A particularworrisome observation was that, when not applied fully online, the performance ofPDGD can degrade as more interactions are gathered. While this behavior looks similar,it is not overfitting since PDGD does not display it when applied online. Instead, itappears that PDGD becomes severely biased when not applied fully online. Therefore,we can conclude that the fact that PDGD is unbiased w.r.t. pairwise preferences is not172 .1. Main Findings enough to guarantee unbiased optimization. It appears that we do not fully understandwhy PDGD appears to be so effective when run online.The results of Chapter 3 had surprising implications on DBGD, for instance, itappeared that DBGD was not able to reach the performance of PDGD at conver-gence. Meanwhile, DBGD forms the basis of most existing online LTR methods.This prompted us to further investigate DBGD in Chapter 4, where we asked:
RQ3
Are DBGD LTR methods reliable in terms of theoretical soundness and empiricalperformance?By critically examining the theoretical assumptions underlying the DBGD method,we found that they are impossible when optimizing a deterministic ranking model.This means that the existing theoretical guarantees of DBGD are unsound in a lot ofprevious work where such models were used [40, 43, 82, 90, 111, 125, 126, 132, 135].Moreover, our empirical analysis revealed that ideal circumstances exist where DBGDis still unable to find the optimal model. In other words, even in scenarios whereoptimization should be very easy, DBGD was unable to get near optimal performance.These findings lead us to answer
RQ3 negatively: our empirical results show thatDBGD is very unreliable and its theoretical guarantees do not cover the most commonLTR ranking models.
The second part of the thesis considered counterfactual LTR methods for optimizationand evaluation. In particular, we tried to widen the applicability of counterfactual LTRmethods and their effectiveness as online methods.First, Chapter 5 recognized that the original Inverse Propensity Scoring (IPS) coun-terfactual method [58] is not unbiased when item selection bias occurs. This bias occurswhen not all items can be displayed in a single ranking; this bias is unavoidable in top- k ranking settings where only k items can be displayed. One of the questions Chapter 5addressed is: RQ4
Can counterfactual LTR be extended to top- k ranking settings?We showed that one can correct for item selection bias by basing propensity weights onboth the position bias of the user and the stochastic ranking behavior of the logging pol-icy. Our novel policy-aware estimator uses this idea to extend the original IPS approachby taking into account the logging policy behavior. We prove that, assuming rank-basedposition bias, the policy-aware estimator is unbiased as longs as the logging policygives every relevant item a non-zero probability of appearing in the top- k of a ranking.Furthermore, in our experimental results the policy-aware estimator approximates opti-mal performance regardless of the amount of item-selection bias present. Therefore,we answer RQ4 positively: with the introduction of the policy-aware estimator theapplicability of counterfactual LTR has been extended to top- k ranking settings.Besides learning from top- k feedback, Chapter 5 also considered optimizing for top- k metrics. Interestingly, the existing counterfactual LTR methods [2, 46] for optimizing173 . Conclusions Discounted Cumulative Gain (DCG) metrics are very dissimilar from the state-of-the-art in supervised LTR [13, 129]. To address this dissimilarity, Chapter 5 posed thefollowing question:
RQ5
Is it possible to apply state-of-the-art supervised LTR methods to the counterfac-tual LTR problem?We answer this question positively by showing that, with some small adjustments,the LambdaLoss framework [129] can be applied to counterfactual LTR losses, thusenabling the application of state-of-the-art supervised LTR to counterfactual LTR. Theimplication of this finding is that there does not need to be a division between state-of-the-art supervised LTR and counterfactual LTR. In other words, counterfactual LTRmethods can build on the best methods from the supervised LTR field.Chapter 6 takes a look at tabular and feature-based LTR methods. Tabular methodsoptimize a tabular ranking model [67–70, 139], which remembers the optimal ranking,in contrast with feature-based methods that optimize models that use the features ofitems to predict the optimal ranking. The tabular models are extremely expressiveand can capture any possible ranking, making them always capable of converging onthe optimal ranking [138]. However, their learned behavior does not generalize topreviously unseen circumstances. Conversely, the learned behavior of feature-basedmodels can generalize very well to previously unseen circumstances [10, 75]. Butfeature-based models can also be limited by the available features, because often theavailable features do not provide enough information to predict the optimal ranking.Thus feature-based LTR generalizes very well to unseen circumstances, whereas tabularLTR can specialize extremely well in specific circumstances. Inspired by this tradeoff,we asked the following question in Chapter 6:
RQ6
Can the specialization ability of tabular online LTR be combined with the robustfeature-based approach of counterfactual LTR?Our answer is in the form of the novel Generalization and Specialization (GENSPEC)algorithm, a method for combining the behavior of a single robust generalized modeland numerous specialized models. GENSPEC optimizes a single feature-based rankingmodel for performance across all queries, and many tabular ranking models eachspecialized for a single query. Then GENSPEC applies a meta-policy that uses high-confidence bounds to safely decide per query which model to deploy. Consequently,for previously unseen queries GENSPEC chooses the generalized model which utilizesrobust feature-based prediction. For other queries, it can decide to deploy a specializedmodel, i.e., if it has enough data to confidently determine that the tabular model hasfound the better ranking. Our experimental results show that GENSPEC successfullycombines robust performance on unseen queries with extremely high performanceat convergence. Accordingly, we answer
RQ6 positively: using GENSPEC we cancombine the specialization properties of tabular LTR with the robust generalizationof feature-based LTR. For the LTR field, the introduction of GENSPEC shows thatspecialization does not need to be unique to tabular online LTR, instead it can be aproperty of counterfactual LTR as well.As discussed above, Chapter 7 proved that several prominent interleaving methodsare biased w.r.t. a basic model of position bias. Nonetheless, empirical results suggest174 .1. Main Findings that these online ranking evaluation methods are still very effective. This leaves a gapfor a theoretically-grounded online ranking evaluation method that is also very effective.To address this gap, Chapter 7 considers counterfactual ranking evaluation, which hasstrong theoretical guarantees, and asks:
RQ7
Can counterfactual evaluation methods for ranking be extended to perform effi-cient and effective online evaluation?We realized that with the introduction of the policy-aware estimator in Chapter 5, thelogging policy has an important role in counterfactual estimation. Using the policy-aware estimator as a starting point, we introduce the Logging-Policy OptimizationAlgorithm (LogOpt) that optimizes the logging policy to minimize the variance of thepolicy-aware estimator. LogOpt can be deployed during the gathering of data, periodi-cally or fully online, and thus changes the logging behavior through an intervention. Assuch, it turns the counterfactual evaluation approach with the policy-aware estimatorinto an online approach. Our experimental results show that applying LogOpt increasesthe data-efficiency of counterfactual evaluation with the policy-aware estimator. Theperformance with LogOpt is comparable to A/B testing and interleaving, but in contrastwith interleaving, the policy-aware estimator applied with LogOpt does not have asystematic error. Therefore, we answer
RQ7 positively: by optimizing the loggingpolicy with LogOpt, counterfactual evaluation can perform effective and data-efficientonline evaluation.Inspired by how Chapter 7 bridges part of the gap between online and counterfactualranking evaluation, Chapter 8 addressed our final question:
RQ9
Can the counterfactual LTR approach be extended to perform highly effectiveonline LTR?The motivation is similar to the previous chapter: we would like to find a theoretically-grounded method that is effective at both counterfactual LTR and online LTR. Sincecounterfactual LTR has strong theoretical guarantees, we used it as a starting point.Then we introduced the novel intervention-aware estimator which does not assume astationary logging policy. As a result, the estimator takes into account the fact that anonline intervention may change the logging policy during the gathering of data. Thuswhen applied online, the intervention-aware estimator does not only consider the loggingpolicy used when a click was logged but also all the other logging policies applied atall other timesteps. In addition, the intervention-aware estimator also combines thetheoretical properties of recent counterfactual LTR estimators: it is the first estimator thatcan correct for both position bias, item-selection bias, and trust bias. Our experimentalresults show that the intervention-aware estimator results in much lower variance,than an equivalent estimator that ignores the effect of interventions. Furthermore, inour experimental setting it outperformed all existing counterfactual estimators, withespecially large differences when online interventions take place. Importantly, weobserved that the intervention-aware estimator matches the performance of PDGD withas few as interventions during learning. Besides a small initial period, LTR with theintervention-aware estimator was able to reach the performance of the most effectiveonline LTR methods. Therefore, we answer
RQ9 positively: the intervention-awareestimator extends the counterfactual LTR approach to perform highly effective online175 . Conclusions
LTR. For the LTR field, this demonstrates that methods do not have to be either partof counterfactual LTR or online LTR. By designing them for both applications at once,they can be highly effective in both scenarios.Finally, we note the complementary nature of the findings in the second part ofthe thesis. Many of the contributions of earlier chapters were used in later chapters.For instance, the methods introduced in Chapter 6 and Chapter 7 made use of thepolicy-aware estimator proposed in Chapter 5, and Chapter 8 built on the policy-aware estimator to introduce the intervention-aware estimator. Similarly, the adaptationof LambdaLoss for counterfactual LTR introduced in Chapter 5 was applied in theexperiments of Chapter 6 and Chapter 7. While not explored in the thesis, many ofthe later contributions can also be applied to methods in earlier chapters. For instance,the intervention-aware estimator from Chapter 8 is completely compatible with theLambdaLoss adaption from Chapter 5 and GENSPEC from Chapter 6. In particular, itcould be applied in combination with LogOpt from Chapter 7, potentially leading to evenmore effective online ranking evaluation. Together, the contributions of the second partcan be combined into a single framework for counterfactual LTR and ranking evaluation,where our contributions complement each other. Importantly, this framework bridgesseveral gaps between supervised LTR, online LTR, and counterfactual LTR.
The overarching question this thesis aimed to answer considered whether there could bea single general theoretically-grounded approach that has competitive performance forboth evaluation and LTR from user clicks on rankings, in both the counterfactual andonline settings.
We have looked at the family of online methods for LTR [43, 126, 132] and rankingevaluation [44, 56, 96, 108], which traditionally avoid making strong assumptionsabout user behavior, i.e., that a model of position bias is known [128]. While thismakes their theory widely applicable, the theoretical guarantees of these methodsare relatively weak. For instance, some interleaving and multileaving methods areproven to converge on correct outcomes if clicks are uncorrelated with relevanceand thus every ranker performs equally well [41, 96]. Though such guarantees arevaluable, they only cover a small group of unambiguous situations and thus leave mostsituations without theoretical guarantees. Online LTR methods are often motivated byempirical results from semi-synthetic experiments, where they are tested in settingswith varying levels of noise and bias [42, 80, 111, 125]. The fundamental questionwith this type of empirical motivation is how well the results generalize, in particular,whether a method is still effective if the experimental conditions change slightly. Thisthesis has presented four examples of online methods that showed surprisingly poorperformance when tested in new conditions: (i) On several datasets DBGD [132] didnot get close to optimal performance after issued-queries while learning fromclicks without noise or position bias (Chapter 4). (ii) Team Draft Interleaving [99],Probabilistic Interleaving [41], and Optimized Interleaving [96] make systematic errorsin some ranking comparisons when tested under rank-based position bias (Chapter 7).(iii) The performance of the COLTR algorithm [136] dropped severely when tested176 .2. Summary of Findings under position bias, item-selection bias, and trust bias (Chapter 8). (iv) PDGD nolonger converged to near-optimal performance when we ran it counterfactually oronly provided it with online interventions, and instead resulted in a large dropin performance (Chapter 8). While these online methods LTR and evaluation havealso shown great performance in previous work [41, 50, 96, 99, 111, 132, 136], theseproblematic examples illustrate why we cannot conclude that these online LTR methodsare reliable. For instance, the performance of a method like PDGD was thought tobe very robust to noise and bias [50] (Chapter 3 and 4), until tested without constantonline interventions (Chapter 8). Without strong theoretical guarantees, we cannotknow whether there are more currently-unknown conditions required for the robustperformance of PDGD. In general, it is unclear how robust online LTR methods are inpractice; this thesis has shown that there is a potential risk for detrimental performanceif real-world circumstances do not match the tested experimental settings. Therefore,we conclude that online LTR methods should not be used as a basis for a single generalapproach for LTR and ranking evaluation from user clicks.In the second part of the thesis, we considered the family of counterfactual methodsfor LTR and ranking evaluation [58, 127], which consist of theoretically-grounded meth-ods that use explicit assumptions about user behavior. In contrast with the online family,counterfactual methods are less widely applicable: they only provide guarantees whenthe assumed models of user behavior are true. For instance, the original counterfactualLTR method assumes clicks are only affected by relevance and rank-based positionbias [58, 127]. Despite their limited applicability, counterfactual methods have verystrong theoretical guarantees. In contrast to most online LTR methods, counterfactualLTR methods guarantee convergence at the same performance as supervised LTR, giventhat their assumptions about user behavior are true. The findings of this thesis indicatethat the strong guarantees with limited applicability of counterfactual LTR are preferableover the weak guarantees with wide applicability of online LTR. This is mainly becausewidening the applicability of counterfactual LTR proved very doable. In this thesis,we have expanded the applicability of counterfactual LTR and evaluation to (i) top- k settings with item-selection bias (Chapter 5), and (ii) ranking settings where both trustbias and item-selection bias occur (Chapter 8). Besides expanding the settings wherecounterfactual LTR methods can be applied, we expanded the methods that performcounterfactual LTR, including: (iii) the state-of-the-art LambdaLoss supervised LTRframework [129] (Chapter 5), (iv) tabular models for extremely specialized rankings(Chapter 6), and (v) a meta-policy that safely chooses between generalized feature-basedmodels and specialized tabular models (Chapter 6). Moreover, this thesis also foundnovel algorithms to increase the effectivity of counterfactual LTR methods for (vi) on-line ranking evaluation (Chapter 7), and (vii) online LTR (Chapter 8), even with alimited number of online interventions. Together, these contributions have widened theapplicability of counterfactual LTR while maintaining its strong theoretical guarantees.As a direct result of this thesis, counterfactual LTR is applicable to more settings, moreLTR methods can be applied to the counterfactual LTR problem, and counterfactualLTR methods are more effective in both the counterfactual and online LTR scenarios.In conclusion, based on the findings of this thesis, it appears that counterfactual LTRcould form the basis of a general approach for LTR from user clicks. In our experimentalresults, counterfactual LTR provided competitive performance to online LTR methods177 . Conclusions in both the counterfactual and online settings. While the theory of counterfactual LTRdoes rely on stronger assumptions regarding user behavior than existing online LTRmethods, counterfactual LTR provides far stronger theoretical guarantees. In contrast, itis currently unclear under what conditions online LTR methods are effective, makingtheir performance very unpredictable. Therefore, we answer our overarching thesisquestion positively: the counterfactual LTR framework proposed in this thesis providesa unified approach for effective and reliable LTR from user clicks. For the LTR field,the counterfactual LTR framework bridges many gaps between areas of online LTR,counterfactual LTR, and supervised LTR, and as such, it unifies many of the mosteffective methods for LTR from user clicks. We will conclude the thesis with promising research directions for future work.The most obvious direction is to widen the applicability of the counterfactual LTRframework. This means that estimators are introduced that are unbiased under otherassumptions about user behavior. Joachims et al. [58] mentioned that the original coun-terfactual method is unbiased as long as click probabilities decompose into observationand relevance probabilities. For example, Vardasbi et al. [122] looked at the perfor-mance of counterfactual LTR when assuming cascading user behavior, an alternativeto rank-based position bias. Additionally, Fang et al. [30] looked at context-dependentposition bias, where the degree of bias varies per query. It seems natural to continuethis trend to more complex models of user behavior. The challenge for future work istwo-fold: find LTR methods that are proven to be unbiased under more complex userbehavior models; and introduce methods that can reliably find the parameters of thesebehavior models.Besides learning from more complex user behavior, there is a big need for LTRbased on user clicks that optimizes for more complex goals. Some existing work hasalready looked at complex goals: for instance, Radlinski et al. [98] introduced a banditalgorithm for tabular LTR that optimizes for both relevance and diversity within aranking. Thus, using user clicks to find a ranking that has relevant items, as well ashaving variety in the items within the ranking. Another example comes from Moriket al. [79], who use counterfactual LTR to optimize for relevance and ranking fairness.Ranking fairness metrics are based on the amount of exposure different items receive,for example, some fairness metrics measure whether certain groups of items receivesimilar amounts of exposure. Other areas of LTR also optimize for computationalefficiency to ensure that ranking systems can process queries in minimal amounts oftime [31]. Future work could investigate if counterfactual LTR can be used for complexgoals like these and combinations of them.Surprisingly, the experimental results in this thesis showed that PDGD is no longereffective when not applied fully online, and similarly, we observed very poor perfor-mance for the COLTR algorithm [136]. However, we could not find theoretically provenconditions that guarantee that PDGD or COLTR is or is not effective. It appears that welack a theoretical approach to understand the limits of online LTR methods. If such anapproach could be found, we may be able to correct for the faults in some online LTR178 .3. Future Work methods, or understand when they can be applied reliably. Thus it may be very valuableif future work reconsidered the theory behind existing online LTR methods.Finally, most of the existing work on LTR from user interactions only consideresuser clicks. Existing work has already looked at additional signals that are useful forlearning [63, 110]. Novel methods that learn from other interactions in addition touser clicks have the potential to better understand user preferences. However, the mainchallenge this direction of research may be the availability of such data. Perhaps thisdirection of research mostly needs a publicly available source of data and methods toshare such data in a privacy-respecting way.Overall, our main advice for future work is to focus on methods that forge connec-tions between advances in the larger field of LTR; that is, methods that combine thebest of different areas, as our proposed framework does for online LTR, counterfactualLTR, and supervised LTR. 179 ibliography [1] E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The web changes everything: Understanding thedynamics of web content. In
Proceedings of the Second ACM International Conference on Web Searchand Data Mining , pages 282–291, 2009. (Cited on pages 2, 38, 39, and 61.)[2] A. Agarwal, K. Takatsu, I. Zaitsev, and T. Joachims. A general framework for counterfactual learning-to-rank. In
Proceedings of the 42nd International ACM SIGIR Conference on Research & Developmentin Information Retrieval , pages 5–14. ACM, 2019. (Cited on pages 3, 6, 78, 86, 92, 105, 109, 110,156, and 173.)[3] A. Agarwal, X. Wang, C. Li, M. Bendersky, and M. Najork. Addressing trust bias for unbiasedlearning-to-rank. In
The World Wide Web Conference , pages 4–14. ACM, 2019. (Cited on pages 3, 81,88, 91, 94, 152, 153, 156, and 163.)[4] A. Agarwal, I. Zaitsev, X. Wang, C. Li, M. Najork, and T. Joachims. Estimating position bias withoutintrusive interventions. In
Proceedings of the Twelfth ACM International Conference on Web Searchand Data Mining , pages 474–482. ACM, 2019. (Cited on pages 3, 78, 94, 131, 134, 156, and 172.)[5] Q. Ai, K. Bi, C. Luo, J. Guo, and W. B. Croft. Unbiased learning to rank with unbiased propensityestimation. In
The 41st International ACM SIGIR Conference on Research & Development inInformation Retrieval , pages 385–394. ACM, 2018. (Cited on pages 62, 78, 81, 88, 89, 90, 91, 114,131, and 156.)[6] Q. Ai, T. Yang, H. Wang, and J. Mao. Unbiased learning to rank: Online or offline? arXiv preprintarXiv:2004.13574 , 2020. (Cited on pages 152, 157, 164, and 167.)[7] R. Albert, H. Jeong, and A.-L. Barab´asi. Diameter of the world-wide web.
Nature , 401(6749):130–131,1999. (Cited on page 1.)[8] J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. Million query track 2007overview. In
TREC . NIST, 2007. (Cited on pages 29 and 48.)[9] W.-T. Balke, U. G¨untzer, and W. Kießling. On real-time top k querying for mobile services. In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems” , pages125–143. Springer, 2002. (Cited on page 78.)[10] C. M. Bishop.
Pattern Recognition and Machine Learning , chapter 1.3. Springer, 2006. (Cited onpages 102 and 174.)[11] A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov. A neural click model for web search. In
WWW ,pages 531–541. International World Wide Web Conferences Steering Committee, 2016. (Cited onpage 62.)[12] B. Brost, I. J. Cox, Y. Seldin, and C. Lioma. An improved multileaving algorithm for online rankerevaluation. In
Proceedings of the 39th International ACM SIGIR conference on Research and De-velopment in Information Retrieval , pages 745–748, 2016. (Cited on pages 4, 15, 16, 22, 23, 28,and 29.)[13] C. J. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Technical report,Microsoft Research, 2010. (Cited on pages 1, 6, 41, 52, 79, 86, 87, 104, and 174.)[14] F. Cai and M. de Rijke. A survey of query auto completion in information retrieval.
Foundations andTrends in Information Retrieval , 10(4):273–363, 2016. (Cited on page 78.)[15] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwiseapproach. In
Proceedings of the 24th international conference on Machine learning , pages 129–136,2007. (Cited on page 42.)[16] B. Carterette and P. Chandar. Offline comparative evaluation with incremental, minimally-invasiveonline feedback. In
The 41st International ACM SIGIR Conference on Research & Development inInformation Retrieval , pages 705–714. ACM, 2018. (Cited on pages 3, 7, 78, 86, 89, and 94.)[17] O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview.
Journal of MachineLearning Research , 14:1–24, 2011. (Cited on pages 1, 2, 29, 38, 39, 48, 60, 61, 67, 89, 104, 109, 136,152, and 163.)[18] O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleavedsearch evaluation.
ACM Transactions on Information Systems (TOIS) , 30(1):1–41, 2012. (Cited onpages 1, 2, 17, 29, 126, and 141.)[19] S. Chelaru, C. Orellana-Rodriguez, and I. S. Altingovde. How useful is social feedback for learning torank YouTube videos?
World Wide Web , 17(5):997–1025, 2014. (Cited on pages 1 and 38.)[20] A. Chuklin, I. Markov, and M. de Rijke.
Click Models for Web Search . Morgan & Claypool Publishers,2015. (Cited on pages 2, 30, 40, 48, 60, 62, 68, and 127.)[21] A. Chuklin, A. Schuth, K. Zhou, and M. D. Rijke. A comparative analysis of interleaving methods for . Bibliography aggregated search.
ACM Transactions on Information Systems (TOIS) , 33(2):1–38, 2015. (Cited onpage 29.)[22] G. Claeskens and N. L. Hjort.
Model Selection and Model Averaging . Cambridge University Press,2008. (Cited on page 102.)[23] C. L. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2009 Web Track. In
TREC . NIST,2009. (Cited on page 29.)[24] N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC 2003 Web Track. In
TREC . NIST, 2003. (Cited on page 29.)[25] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-biasmodels. In
Proceedings of the 2008 International Conference on Web Search and Data Mining , pages87–94. ACM, 2008. (Cited on pages 2, 104, 127, 152, and 153.)[26] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommen-dation tasks. In
Proceedings of the fourth ACM conference on Recommender systems , pages 39–46.ACM, 2010. (Cited on page 78.)[27] D. Dato, C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, N. Tonellotto, and R. Venturini. Fastranking with additive ensembles of oblivious and non-oblivious regression trees.
ACM Transactionson Information Systems (TOIS) , 35(2):1–31, 2016. (Cited on pages 1, 29, 48, 67, 104, and 109.)[28] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via theem algorithm.
Journal of the Royal Statistical Society: Series B (Methodological) , 39(1):1–22, 1977.(Cited on page 87.)[29] M. B. Dias, D. Locher, M. Li, W. El-Deredy, and P. J. Lisboa. The value of personalised recommendersystems to e-business: a case study. In
Proceedings of the 2008 ACM conference on RecommenderSystems , pages 291–294, 2008. (Cited on page 1.)[30] Z. Fang, A. Agarwal, and T. Joachims. Intervention harvesting for context-dependent examination-bias estimation. In
Proceedings of the 42nd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , pages 825–834, 2019. (Cited on pages 134, 156, and 178.)[31] L. Gallagher, R.-C. Chen, R. Blanco, and J. S. Culpepper. Joint optimization of cascade rankingmodels. In
Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining ,pages 15–23, 2019. (Cited on page 178.)[32] S. C. Geyik, Q. Guo, B. Hu, C. Ozcaglar, K. Thakkar, X. Wu, and K. Kenthapadi. Talent searchand recommendation systems at LinkedIn: Practical challenges and lessons learned. In
The 41stInternational ACM SIGIR Conference on Research & Development in Information Retrieval , pages1353–1354, 2018. (Cited on page 1.)[33] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks.In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics ,pages 249–256, 2010. (Cited on page 49.)[34] C. A. Gomez-Uribe and N. Hunt. The Netflix recommender system: Algorithms, business value, andinnovation.
ACM Transactions on Management Information Systems (TMIS) , 6(4):1–19, 2015. (Citedon page 1.)[35] A. Gordo, J. Almaz´an, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representationsfor image search. In
European Conference on Computer Vision , pages 241–257. Springer, 2016. (Citedon page 1.)[36] F. Guo, C. Liu, and Y. M. Wang. Efficient multiple-click models in web search. In
Proceedings of theSecond ACM International Conference on Web Search and Data Mining , pages 124–131, 2009. (Citedon pages 29, 30, 47, 48, and 68.)[37] D. M. Hawkins. The problem of overfitting.
Journal of Chemical Information and Computer Sciences ,44(1):1–12, 2004. (Cited on page 102.)[38] J. He, C. Zhai, and X. Li. Evaluation of methods for relative comparison of retrieval systems basedon clickthroughs. In
Proceedings of the 18th ACM Conference on Information and KnowledgeManagement , pages 2029–2032. ACM, 2009. (Cited on pages 1, 24, 48, and 67.)[39] K. Hofmann.
Fast and Reliably Online Learning to Rank for Information Retrieval . PhD thesis,University of Amsterdam, 2013. (Cited on pages 49 and 68.)[40] K. Hofmann, S. Whiteson, and M. De Rijke. Balancing exploration and exploitation in learning torank online. In
European Conference on Information Retrieval , pages 251–263. Springer, 2011. (Citedon pages 40, 48, 49, 64, 67, 68, and 173.)[41] K. Hofmann, S. Whiteson, and M. De Rijke. A probabilistic method for inferring preferences fromclicks. In
Proceedings of the 20th ACM international conference on Information and knowledgemanagement , pages 249–258, 2011. (Cited on pages 4, 17, 21, 22, 28, 29, 30, 40, 60, 62, 130, 131,
36, 145, 157, 172, 176, and 177.)[42] K. Hofmann, S. Whiteson, and M. de Rijke. Balancing exploration and exploitation in listwise andpairwise online learning to rank for information retrieval.
Information Retrieval , 16(1):63–90, 2012.(Cited on pages 2, 5, 38, and 176.)[43] K. Hofmann, A. Schuth, S. Whiteson, and M. de Rijke. Reusing historical interaction data for fasteronline learning to rank for IR. In
Proceedings of the sixth ACM international conference on Websearch and data mining , pages 183–192. ACM, 2013. (Cited on pages 5, 40, 60, 62, 64, 94, 157, 163,173, and 176.)[44] K. Hofmann, S. Whiteson, and M. D. Rijke. Fidelity, soundness, and efficiency of interleavedcomparison methods.
ACM Transactions on Information Systems (TOIS) , 31(4):1–43, 2013. (Cited onpages 2, 4, 16, 19, 20, 21, 22, 24, 28, 29, 126, 131, 171, and 176.)[45] K. Hofmann, L. Li, and F. Radlinski. Online evaluation for information retrieval.
Foundations andTrends in Information Retrieval , 10(1):1–117, 2016. (Cited on pages 1, 15, 17, and 126.)[46] Z. Hu, Y. Wang, Q. Peng, and H. Li. Unbiased LambdaMART: An unbiased pairwise learning-to-rankalgorithm. In
The World Wide Web Conference , pages 2830–2836. ACM, 2019. (Cited on pages 3, 6,88, and 173.)[47] J. Huang, H. Oosterhuis, M. de Rijke, and H. van Hoof. Keeping dataset biases out of the simulation:A debiased simulator for reinforcement learning based recommender systems. In
Proceedings of the2020 ACM conference on Recommender systems , 2020. (Cited on page 12.)[48] N. Hurley and M. Zhang. Novelty and diversity in top-n recommendation–analysis and evaluation.
ACM Transactions on Internet Technology (TOIT) , 10(4):14, 2011. (Cited on page 78.)[49] R. Jagerman, H. Oosterhuis, and M. de Rijke. Query-level ranker specialization. In
CEUR WorkshopProceedings , volume 2007, 2017. (Cited on page 12.)[50] R. Jagerman, H. Oosterhuis, and M. de Rijke. To model or to intervene: A comparison of counterfactualand online learning to rank from user interactions. In
Proceedings of the 42nd International ACMSIGIR Conference on Research & Development in Information Retrieval , pages 15–24. ACM, 2019.(Cited on pages 6, 7, 12, 89, 90, 94, 117, 152, 157, 159, and 177.)[51] R. Jagerman, I. Markov, and M. de Rijke. Safe exploration for optimizing contextual bandits.
ACMTransactions on Information Systems , 38(3):Article 24, 2020. (Cited on pages 106, 107, 108, 113,and 121.)[52] K. J¨arvelin and J. Kek¨al¨ainen. Cumulated gain-based evaluation of IR techniques.
ACM Transactionson Information Systems (TOIS) , 20(4):422–446, 2002. (Cited on pages 1 and 154.)[53] K. J¨arvelin and J. Kek¨al¨ainen. IR evaluation methods for retrieving highly relevant documents. In
ACMSIGIR Forum , volume 51, pages 243–250. ACM New York, NY, USA, 2017. (Cited on page 110.)[54] T. Joachims. Optimizing search engines using clickthrough data. In
Proceedings of the eighth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 133–142, 2002.(Cited on pages 17, 19, 24, 38, 42, 61, 65, 79, 86, and 156.)[55] T. Joachims. Unbiased evaluation of retrieval quality using clickthrough data. In
SIGIR Workshop onMathematical/Formal Methods in Information Retrieval , volume 354, 2002. (Cited on pages 1 and 24.)[56] T. Joachims. Evaluating retrieval performance using clickthrough data. In
Text Mining . Physica Verlag,2003. (Cited on pages 2, 4, 17, 126, 130, and 176.)[57] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough dataas implicit feedback. In
SIGIR Forum , pages 154–161. ACM, 2005. (Cited on pages 78, 131, 132, 152,and 153.)[58] T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased learning-to-rank with biased feedback.In
Proceedings of the Tenth ACM International Conference on Web Search and Data Mining , pages781–789, 2017. (Cited on pages 1, 2, 3, 4, 5, 6, 60, 62, 68, 78, 79, 80, 81, 86, 88, 89, 90, 91, 92, 103,104, 105, 109, 110, 114, 120, 126, 128, 136, 137, 152, 155, 156, 157, 162, 163, 164, 172, 173, 177,and 178.)[59] A. Karatzoglou, L. Baltrunas, and Y. Shi. Learning to rank for recommender systems. In
Proceedingsof the 7th ACM conference on Recommender systems , pages 493–494, 2013. (Cited on page 38.)[60] S. K. Karmaker Santu, P. Sondhi, and C. Zhai. On application of learning to rank for e-commercesearch. In
SIGIR , pages 475–484. ACM, 2017. (Cited on pages 1 and 38.)[61] S. Katariya, B. Kveton, C. Szepesvari, and Z. Wen. DCM bandits: Learning to rank with multiple clicks.In
International Conference on Machine Learning , pages 1215–1224, 2016. (Cited on page 115.)[62] A. Kazerouni, M. Ghavamzadeh, Y. A. Yadkori, and B. Van Roy. Conservative contextual linearbandits. In
Advances in Neural Information Processing Systems , pages 3910–3919, 2017. (Cited onpage 108.) . Bibliography [63] E. Kharitonov, C. Macdonald, P. Serdyukov, and I. Ounis. Generalized team draft interleaving. In
Proceedings of the 24th ACM International on Conference on Information and Knowledge Management ,pages 773–782, 2015. (Cited on pages 18, 29, 33, and 179.)[64] R. Kohavi and R. Longbotham. Online controlled experiments and A/B testing.
Encyclopedia ofMachine Learning and Data Mining , 7(8):922–929, 2017. (Cited on pages 126 and 129.)[65] R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web:survey and practical guide.
Data Mining and Knowledge Discovery , 18(1):140–181, 2009. (Cited onpage 17.)[66] R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohlmann. Online controlled experimentsat large scale. In
Proceedings of the 19th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , pages 1168–1176, 2013. (Cited on page 17.)[67] J. Komiyama, J. Honda, and H. Nakagawa. Optimal regret analysis of thompson sampling in stochasticmulti-armed bandit problem with multiple plays. In
Proceedings of the 32Nd International Confer-ence on International Conference on Machine Learning - Volume 37 , ICML’15, pages 1152–1161.JMLR.org, 2015. (Cited on pages 3, 6, 94, and 174.)[68] B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan. Cascading bandits: Learning to rank in the cascademodel. In
International Conference on Machine Learning , pages 767–776, 2015. (Cited on pages 3,103, and 115.)[69] P. Lagr´ee, C. Vernade, and O. Capp´e. Multiple-play bandits in the position-based model. In
Advancesin Neural Information Processing Systems , pages 1597–1605, 2016. (Cited on pages 3, 94, 110, 115,and 116.)[70] T. Lattimore and C. Szepesv´ari.
Bandit Algorithms . Cambridge University Press, 2020. (Cited onpages 3, 6, 102, 103, and 174.)[71] D. Lefortier, P. Serdyukov, and M. De Rijke. Online exploration for detecting shifts in fresh intent. In
Proceedings of the 23rd ACM International Conference on Conference on Information and KnowledgeManagement , pages 589–598, 2014. (Cited on pages 2, 38, 39, and 61.)[72] S. Li, Y. Abbasi-Yadkori, B. Kveton, S. Muthukrishnan, V. Vinay, and Z. Wen. Offline evaluation ofranking policies with click models. In
Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining , pages 1685–1694. ACM, 2018. (Cited on pages 94 and 95.)[73] Z. Li, A. Grotov, J. Kiseleva, M. de Rijke, and H. Oosterhuis. Optimizing interactive systems withdata-driven objectives. arXiv preprint arXiv:1802.06306 , page 11, 2018. (Cited on page 12.)[74] Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Infinite-horizon off-policyestimation. In
Advances in Neural Information Processing Systems , pages 5356–5366, 2018. (Citedon page 95.)[75] T.-Y. Liu. Learning to rank for information retrieval.
Foundations and Trends in Information Retrieval ,3(3):225–331, 2009. (Cited on pages 1, 4, 37, 61, 79, 86, 103, 104, 151, 154, and 174.)[76] T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR: Benchmark dataset for research on learning torank for information retrieval. In
Proceedings of the Workshop on Learning to Rank for InformationRetrieval , 2007. (Cited on pages 1, 2, 29, 38, and 39.)[77] A. Lucic, H. Oosterhuis, H. Haned, and M. de Rijke. Actionable interpretability through optimizablecounterfactual explanations for tree ensembles. arXiv preprint arXiv:1911.12199 , 2019. (Cited onpage 12.)[78] J. Ma, Z. Zhao, X. Yi, J. Yang, M. Chen, J. Tang, L. Hong, and E. H. Chi. Off-policy learning intwo-stage recommender systems. In
Proceedings of The Web Conference 2020 , pages 463–473, 2020.(Cited on page 134.)[79] M. Morik, A. Singh, J. Hong, and T. Joachims. Controlling fairness and bias in dynamic learning-to-rank. In
Proceedings of the 43rd International ACM SIGIR Conference on Research and Developmentin Information Retrieval , page 429–438, 2020. (Cited on page 178.)[80] H. Oosterhuis and M. de Rijke. Balancing speed and quality in online learning to rank for informationretrieval. In
Proceedings of the 2017 ACM on Conference on Information and Knowledge Management ,pages 277–286, 2017. (Cited on pages 12, 38, 41, 44, 49, 50, 51, 63, and 176.)[81] H. Oosterhuis and M. de Rijke. Sensitive and scalable online evaluation with theoretical guarantees.In
Proceedings of the 2017 ACM on Conference on Information and Knowledge Management , pages77–86, 2017. (Cited on pages 11, 15, 40, 42, and 62.)[82] H. Oosterhuis and M. de Rijke. Differentiable unbiased online learning to rank. In
Proceedings of the27th ACM International Conference on Information and Knowledge Management , pages 1293–1302.ACM, 2018. (Cited on pages 11, 37, 60, 63, 64, 65, 68, 94, 117, 157, 163, 164, and 173.)[83] H. Oosterhuis and M. de Rijke. Ranking for relevance and display preferences in complex presentation ayouts. In
The 41st International ACM SIGIR Conference on Research & Development in InformationRetrieval , pages 845–854, 2018. (Cited on page 12.)[84] H. Oosterhuis and M. de Rijke. Optimizing ranking models in an online setting. In
Advances inInformation Retrieval , pages 382–396, Cham, 2019. Springer International Publishing. (Cited onpages 11, 59, 60, 94, 109, 117, 157, and 163.)[85] H. Oosterhuis and M. de Rijke. Taking the counterfactual online: Efficient and unbiased onlineevaluation for ranking. In
Proceedings of the 2020 International Conference on The Theory ofInformation Retrieval . ACM, 2020. (Cited on pages 11, 125, and 164.)[86] H. Oosterhuis and M. de Rijke. Policy-aware unbiased learning to rank for top-k rankings. In
Proceedings of the 43rd International ACM SIGIR Conference on Research and Development inInformation Retrieval . ACM, 2020. (Cited on pages 11, 77, 105, 114, 128, 131, 136, 152, 153, 155,156, 157, and 164.)[87] H. Oosterhuis and M. de Rijke. Robust generalization and safe query-specialization in counterfactuallearning to rank. In
Submitted to The World Wide Web Conference . ACM, 2021. (Cited on pages 11and 101.)[88] H. Oosterhuis and M. de Rijke. Unifying online and counterfactual learning to rank. In
Proceedings ofthe 14th ACM International Conference on Web Search and Data Mining (WSDM’21) . ACM, 2021.(Cited on pages 11 and 151.)[89] H. Oosterhuis, S. Ravi, and M. Bendersky. Semantic video trailers. arXiv preprint arXiv:1609.01819 ,2016. (Cited on page 12.)[90] H. Oosterhuis, A. Schuth, and M. de Rijke. Probabilistic multileave gradient descent. In
EuropeanConference on Information Retrieval , pages 661–668. Springer, 2016. (Cited on pages 5, 12, 22, 34,40, 48, 49, 51, 62, 64, 67, 68, and 173.)[91] H. Oosterhuis, J. S. Culpepper, and M. de Rijke. The potential of learned index structures for indexcompression. In
Proceedings of the 23rd Australasian Document Computing Symposium , pages 1–4,2018. (Cited on page 12.)[92] Z. Ovaisi, R. Ahsan, Y. Zhang, K. Vasilaky, and E. Zheleva. Correcting for selection bias in learning-to-rank systems. arXiv preprint arXiv:2001.11358 , 2020. (Cited on pages 3, 6, 128, 152, 153,and 163.)[93] A. B. Owen. Monte Carlo theory, methods and examples.
Monte Carlo Theory, Methods and Examples.Art Owen , 2013. (Cited on page 43.)[94] E. Politou, E. Alepis, and C. Patsakis. Forgetting personal data and revoking consent under the GDPR:Challenges and proposed solutions.
Journal of Cybersecurity , 4(1), 2018. (Cited on page 2.)[95] T. Qin and T.-Y. Liu. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597 , 2013. (Citedon pages 29, 39, 48, 60, 61, 67, 89, 104, 109, 136, and 152.)[96] F. Radlinski and N. Craswell. Optimized interleaving for online retrieval evaluation. In
Proceedingsof the sixth ACM International Conference on Web Search and Data Mining , pages 245–254, 2013.(Cited on pages 4, 17, 19, 21, 22, 60, 62, 131, 146, 172, 176, and 177.)[97] F. Radlinski and N. Craswell. A theoretical framework for conversational search. In
Proceedings ofthe 2017 Conference on Human Information Interaction and Retrieval , pages 117–126, 2017. (Citedon page 38.)[98] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In
Proceedings of the 25th International Conference on Machine Learning , pages 784–791, 2008. (Citedon pages 40 and 178.)[99] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In
Proceedings of the 17th ACM Conference on Information and Knowledge Management , pages 43–52.ACM, 2008. (Cited on pages 2, 4, 17, 21, 39, 130, 131, 144, 172, 176, and 177.)[100] K. Raman, T. Joachims, P. Shivaswamy, and T. Schnabel. Stable coactive learning via perturbation. In
International Conference on Machine Learning , pages 837–845, 2013. (Cited on pages 2 and 5.)[101] P. Resnick and H. R. Varian. Recommender systems.
Communications of the ACM , 40(3):56–58, 1997.(Cited on page 1.)[102] M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through ratefor new ads. In
Proceedings of the 16th international conference on World Wide Web , pages 521–530,2007. (Cited on page 140.)[103] A. Roegiest, G. V. Cormack, C. L. Clarke, and M. R. Grossman. TREC 2015 total recall track overview.In
TREC , 2015. (Cited on page 1.)[104] M. Sanderson. Test collection based evaluation of information retrieval systems.
Foundations andTrends in Information Retrieval , 4(4):247–375, 2010. (Cited on pages 1, 2, 17, 38, 39, 40, 60, 61, 104, . Bibliography and 152.)[105] M. Sanderson, M. L. Paramita, P. Clough, and E. Kanoulas. Do user preferences and evaluationmeasures line up? In
Proceedings of the 33rd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , pages 555–562, 2010. (Cited on page 1.)[106] J. B. Schafer, J. Konstan, and J. Riedl. Recommender systems in e-commerce. In
Proceedings of the1st ACM Conference on Electronic Commerce , pages 158–166, 1999. (Cited on page 1.)[107] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims. Recommendations astreatments: Debiasing learning and evaluation. In
Proceedings of the 33rd International Conferenceon Machine Learning - Volume 48 , pages 1670–1679, 2016. (Cited on page 95.)[108] A. Schuth, F. Sietsma, S. Whiteson, D. Lefortier, and M. de Rijke. Multileaved comparisons for fastonline evaluation. In
Proceedings of the 23rd ACM International Conference on Information andKnowledge Management , pages 71–80, 2014. (Cited on pages 4, 15, 16, 17, 20, 21, 22, 23, 28, 29, 30,40, 62, 171, 172, and 176.)[109] A. Schuth, R.-J. Bruintjes, F. Bu¨uttner, J. van Doorn, C. Groenland, H. Oosterhuis, C.-N. Tran,B. Veeling, J. van der Velde, R. Wechsler, et al. Probabilistic multileave for online retrieval evaluation.In
Proceedings of the 38th International ACM SIGIR Conference on Research and Development inInformation Retrieval , pages 955–958, 2015. (Cited on pages 4, 11, 16, 17, 22, 23, 28, 29, 30, 40, 62,and 172.)[110] A. Schuth, K. Hofmann, and F. Radlinski. Predicting search satisfaction metrics with interleavedcomparisons. In
Proceedings of the 38th International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , pages 463–472, 2015. (Cited on pages 2, 4, 126, 141, and 179.)[111] A. Schuth, H. Oosterhuis, S. Whiteson, and M. de Rijke. Multileave gradient descent for fast onlinelearning to rank. In
Proceedings of the Ninth ACM International Conference on Web Search and DataMining , pages 457–466, 2016. (Cited on pages 2, 5, 11, 34, 38, 40, 44, 48, 49, 51, 60, 62, 64, 67, 68,117, 157, 173, 176, and 177.)[112] I. Shalyminov, O. Duˇsek, and O. Lemon. Neural response ranking for social conversation: A data-efficient approach. In
Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd Int’l Workshop onSearch-Oriented Conversational AI , pages 1–8, 2018. (Cited on page 78.)[113] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search enginequery log. In
ACM SIGIR Forum , volume 33, pages 6–12. ACM New York, NY, USA, 1999. (Citedon page 111.)[114] A. Slivkins, F. Radlinski, and S. Gollapudi. Ranked bandits in metric spaces: Learning diverse rankingsover large document collections.
Journal of Machine Learning Research , 14(1):399–436, 2013. (Citedon page 40.)[115] A. Spink, S. Ozmutlu, H. C. Ozmutlu, and B. J. Jansen. US versus European web searching trends. In
ACM Sigir Forum , volume 36, pages 32–38. ACM New York, NY, USA, 2002. (Cited on page 111.)[116] A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In
Advances in Neural Information Processing Systems , pages 3231–3239, 2015. (Cited on pages 3and 92.)[117] A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudik, J. Langford, D. Jose, and I. Zitouni.Off-policy evaluation for slate recommendation. In
Advances in Neural Information ProcessingSystems , pages 3632–3642, 2017. (Cited on page 103.)[118] B. Sz¨or´enyi, R. Busa-Fekete, A. Paul, and E. H¨ullermeier. Online rank elicitation for plackett-luce: Adueling bandits approach. In
Advances in Neural Information Processing Systems , pages 604–612,2015. (Cited on page 42.)[119] P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High-confidence off-policy evaluation. In
Twenty-Ninth AAAI Conference on Artificial Intelligence , 2015. (Cited on pages 102, 106, 107,and 118.)[120] P. Vakkari and N. Hakala. Changes in relevance criteria and problem stages in task performance.
Journal of Documentation , 56:540–562, 2000. (Cited on pages 39 and 61.)[121] V. Vapnik.
The Nature of Statistical Learning Theory . Springer Science & Business Media, 2013.(Cited on page 79.)[122] A. Vardasbi, M. de Rijke, and I. Markov. Cascade model-based propensity estimation for counterfactuallearning to rank. In
Proceedings of the 43rd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , page 2089–2092, 2020. (Cited on page 178.)[123] A. Vardasbi, H. Oosterhuis, and M. de Rijke. When inverse propensity scoring does not work: Affinecorrections for unbiased learning to rank. In
Proceedings of the 28th ACM International Conferenceon Information and Knowledge Management , 2020. (Cited on pages 12, 152, 153, 155, 156, 157, 163, nd 164.)[124] A. Vlachou, C. Doulkeridis, and K. Nørv˚ag. Monitoring reverse top-k queries over mobile devices. In
Proceedings of the 10th ACM International Workshop on Data Engineering for Wireless and MobileAccess , pages 17–24. ACM, 2011. (Cited on page 78.)[125] H. Wang, R. Langley, S. Kim, E. McCord-Snook, and H. Wang. Efficient exploration of gradientspace for online learning to rank. In
The 41st International ACM SIGIR Conference on Research &Development in Information Retrieval , pages 145–154. ACM, 2018. (Cited on pages 5, 60, 63, 64,173, and 176.)[126] H. Wang, S. Kim, E. McCord-Snook, Q. Wu, and H. Wang. Variance reduction in gradient explorationfor online learning to rank. In
Proceedings of the 42nd International ACM SIGIR Conference onResearch and Development in Information Retrieval , pages 835–844, 2019. (Cited on pages 2, 5, 117,157, 173, and 176.)[127] X. Wang, M. Bendersky, D. Metzler, and M. Najork. Learning to rank with selection bias in personalsearch. In
Proceedings of the 39th International ACM SIGIR conference on Research and Developmentin Information Retrieval , pages 115–124, 2016. (Cited on pages 1, 2, 3, 6, 38, 39, 40, 61, 62, 78, 80,81, 88, 91, 94, 104, 114, 126, 131, 132, 154, 155, 156, and 177.)[128] X. Wang, N. Golbandi, M. Bendersky, D. Metzler, and M. Najork. Position bias estimation for unbiasedlearning to rank in personal search. In
Proceedings of the Eleventh ACM International Conference onWeb Search and Data Mining , pages 610–618. ACM, 2018. (Cited on pages 78, 81, 88, 94, 131, 134,152, 153, 156, 172, and 176.)[129] X. Wang, C. Li, N. Golbandi, M. Bendersky, and M. Najork. The LambdaLoss framework for rankingmetric optimization. In
Proceedings of the 27th ACM International Conference on Information andKnowledge Management , pages 1313–1322. ACM, 2018. (Cited on pages 1, 6, 79, 86, 87, 88, 89, 104,137, 154, 156, 174, and 177.)[130] R. W. White, M. Bilenko, and S. Cucerzan. Studying the use of popular destinations to enhanceweb search interaction. In
Proceedings of the 30th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval , pages 159–166, 2007. (Cited on pages 111, 113,and 117.)[131] Y. Wu, R. Shariff, T. Lattimore, and C. Szepesv´ari. Conservative bandits. In
International Conferenceon Machine Learning , pages 1254–1262, 2016. (Cited on pages 108, 111, 113, and 117.)[132] Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling banditsproblem. In
Proceedings of the 26th Annual International Conference on Machine Learning , pages1201–1208. ACM, 2009. (Cited on pages 2, 4, 34, 38, 39, 40, 49, 60, 62, 64, 94, 103, 117, 152, 157,172, 173, 176, and 177.)[133] Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics forclick-based retrieval evaluation. In
Proceedings of the 33rd International ACM SIGIR Conference onResearch and Development in Information Retrieval , pages 507–514, 2010. (Cited on page 29.)[134] Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: Examining result attractiveness as a source ofpresentation bias in clickthrough data. In
Proceedings of the 19th International Conference on WorldWide Web , pages 1011–1018, 2010. (Cited on pages 23, 40, 60, and 61.)[135] T. Zhao and I. King. Constructing reliable gradient exploration for online learning to rank. In
Proceedings of the 25th ACM International on Conference on Information and Knowledge Management ,pages 1643–1652, 2016. (Cited on pages 5, 63, 64, and 173.)[136] S. Zhuang and G. Zuccon. Counterfactual online learning to rank. In
European Conference onInformation Retrieval , pages 415–430. Springer, 2020. (Cited on pages 152, 157, 163, 164, 166, 176,177, and 178.)[137] M. Zoghi, S. A. Whiteson, M. De Rijke, and R. Munos. Relative confidence sampling for efficienton-line ranker evaluation. In
Proceedings of the 7th ACM International Conference on Web Searchand Data Mining , pages 73–82, 2014. (Cited on pages 48 and 67.)[138] M. Zoghi, T. Tunys, L. Li, D. Jose, J. Chen, C. M. Chin, and M. de Rijke. Click-based hot fixes forunderperforming torso queries. In
Proceedings of the 39th International ACM SIGIR conference onResearch and Development in Information Retrieval , pages 195–204, 2016. (Cited on pages 3, 7, 103,111, 115, and 174.)[139] M. Zoghi, T. Tunys, M. Ghavamzadeh, B. Kveton, C. Szepesvari, and Z. Wen. Online learning torank in stochastic click models. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 4199–4208, 2017. (Cited on pages 3, 6, 40, and 174.) ummary
Ranking systems form the basis for online search engines and recommendation services.They process large collections of items, for instance web pages or e-commerce products,and present the user with a small ordered selection. The goal of a ranking system is tohelp a user find the items they are looking for with the least amount of effort. Thus therankings they produce should place the most relevant or preferred items at the top ofthe ranking. Learning to rank is a field within machine learning that covers methodswhich optimize ranking systems w.r.t. this goal. Traditional supervised learning to rankmethods utilize expert-judgements to evaluate and learn, however, in many situationssuch judgements are impossible or infeasible to obtain. As a solution, methods have beenintroduced that perform learning to rank based on user clicks instead. The difficulty withclicks is that they are not only affected by user preferences, but also by what rankingswere displayed. Therefore, these methods have to prevent being biased by other factorsthan user preference. This thesis concerns learning to rank methods based on user clicksand specifically aims to unify the different families of these methods.The first part of the thesis consists of three chapters that look at online learning torank algorithms which learn by directly interacting with users. Its first chapter considerslarge scale evaluation and shows existing methods do not guarantee correctness anduser experience, we then introduce a novel method that can guarantee both. The secondchapter proposes a novel pairwise method for learning from clicks that contrasts withthe previous prevalent dueling-bandit methods. Our experiments show that our pairwisemethod greatly outperforms the dueling-bandit approach. The third chapter furtherconfirms these findings in an extensive experimental comparison, furthermore, we alsoshow that the theory behind the dueling-bandit approach is unsound w.r.t. deterministicranking systems.The second part of the thesis consists of four chapters that look at counterfactuallearning to rank algorithms which learn from historically logged click data. Its firstchapter takes the existing approach and makes it applicable to top- k settings wherenot all items can be displayed at once. It also shows that state-of-the-art supervisedlearning to rank methods can be applied in the counterfactual scenario. The secondchapter introduces a method that combines the robust generalization of feature-basedmodels with the high-performance specialization of tabular models. The third chapterlooks at evaluation and introduces a method for finding the optimal logging policy thatcollects click data in a way that minimizes the variance of estimated ranking metrics.By applying this method during the gathering of clicks, one can turn counterfactualevaluation into online evaluation. The fourth chapter proposes a novel counterfactualestimator that considers the possibility that the logging policy has been updated duringthe gathering of click data. As a result, it can learn much more efficiently when deployedin an online scenario where interventions can take place. The resulting approach isthus both online and counterfactual, our experimental results show that its performancematches the state-of-the-art in both the online and the counterfactual scenario.As a whole, the second part of this thesis proposes a framework that bridges manygaps between areas of online, counterfactual, and supervised learning to rank. It hastaken approaches, previously considered independent, and unified them into a singlemethodology for widely applicable and effective learning to rank from user clicks.189 amenvattingamenvatting