[PDF] Learning from User Interactions with Rankings: A Unification of the Field

Abstract

Full PDF

HHarrie Oosterhuis

Learning from User Interactions with Rankings:A Uni ﬁ cation of the Field a r X i v : . [ c s . I R ] D ec earning from UserInteractions with Rankings:A Uniﬁcation of the Field Harrie Oosterhuis earning from UserInteractions with Rankings:A Uniﬁcation of the Field A CADEMISCH P ROEFSCHRIFT ter verkrijging van de graad van doctor aan deUniversiteit van Amsterdamop gezag van de Rector Magniﬁcusprof. dr. ir. K.I.J. Maexten overstaan van een door het College voor Promoties ingesteldecommissie, in het openbaar te verdedigen inde Agnietenkapelop vrijdag 27 november 2020, te 16:00 uurdoor

Hendrikus Roelof Oosterhuis geboren te Schaijk romotiecommissie

Promotor: Prof. dr. M. de Rijke Universiteit van AmsterdamCo-promotor: Prof. dr. E. Kanoulas Universiteit van AmsterdamOverige leden: Prof. dr. H. Haned Universiteit van AmsterdamProf. dr. T. Joachims Cornell UniversityDr. ir. J. Kamps Universiteit van AmsterdamProf. dr. C.G.M. Snoek Universiteit van AmsterdamProf. dr. ir. A.P. de Vries Radboud Universiteit NijmegenFaculteit der Natuurwetenschappen, Wiskunde en InformaticaThe research was supported by the Netherlands Organisation for Scientiﬁc Research(NWO) under project number 612.001.551.Copyright © 2020 Harrie Oosterhuis, Amsterdam, The NetherlandsCover by Harrie OosterhuisPrinted by Offpage, AmsterdamISBN: 978-94-93197-36-7 cknowledgements

Over six years ago, I was invited to join the ILPS research group as an honours MSc.A.I. student. This was the start of an amazing period where I was able to learn, exploreand develop myself into a true researcher. Many people have helped me on this journey,and I am truly grateful for all the support and friendship I have received along the way.I hope to inspire future students in the same way that you have all inspired me, and Iwill now try my best to thank each and everyone of you.First and foremost, I want to thank Maarten de Rijke, my supervisor and promotor.Maarten, I have learned more from you than I thought possible, you have taught mehow to do research, how to become a better teacher and supervisor, and how to developa research career. Your help was always there when needed and without fail you havealways gone above and beyond. You have set a great example for me and all of yourstudents. Thank you so much.Second, I wish to thank my co-promotor Evangelos Kanoulas. In our annualmeetings you have always given me great advice. You are a truly compassionatesupervisor, who cares greatly about his students. I am very happy to know that you willcontinue to have an amazing and caring inﬂuence on the future of the research group.Third, I thank Hinda Haned, Thorsten Joachims, Jaap Kamps, Cees Snoek, andArjen de Vries, I am truly honoured that you are all part of my PhD committee.My special thanks to Ana and Tom for being my paranymphs. Through the peaksand valleys of my PhD life you have always been there for me, and I am very honouredto defend this thesis with you on my side.Another special thanks to Anne Schuth for accepting and supervising me in ILPSwhen I had only just started the MSc. A.I. In the end, I hold you responsible for myinterest in ranking systems and user interactions and I cannot thank you enough. AlsoI want to thank Petra in particular, your amazing work has made all of this possible.Without contest, I consider you the true ILPS MVP, thank you Petra.Further thanks to everyone who has been part of ILPS during my journey: Adith,Alexey, Ali, Ali, Amir, Ana, Anna, Anne, Antonis, Arezoo, Arianna, Artem, Bob, Boris,Chang, Christof, Christophe, Chuan, Cristina, Daan, Damien, Dan, Dat, David, David,Dilek, Evgeny, Georgios, Hamid, Hendra, Hinda, Hosein, Ilya, Isaac, Ivan, Jiahuan,Jie, Jin, Julia, Julien, Kaspar, Katya, Ke, Maarten, Maarten, Maartje, Mahsa, Mariya,Marlies, Marzieh, Masrour, Maurits, Mohammad, Mostafa, Mozhdeh, Nikos, Olivier,Peilei, Pengjie, Petra, Praveen, Richard, Ridho, Rolf, Sam, Sami, Sebastian, Shangsong,Shaojie, Spyretta, Svitlana, Thorsten, Tobias, Tom, Trond, Vera, Wanyu, Xiaohui,Xiaojuan, Xinyi, Yangjun, Yaser, Yifan, Zhaochun, and Ziming. Together, you have allmade ILPS a wonderful group to be part of, I am very grateful to call all of you mycolleagues. I was also very happy to part of several sub-groups that discussed rankingsystems, thank you Ali, Arezoo, Artem, Chang, Jin, Julia, Maarten, Rolf, and Wanyu,for the great discussions, hopefully there will be many more discussions to come. Inaddition, special thanks to Ana, Antonis, Bob, Chang, Hosein, Maartje, Maurits, Nikos,Rolf, and Tom for being great friends as well.Furthermore, I want to thank the all the people that welcomed me abroad. For thegreat experiences I had during Google internships I thank Ajay, Ariel, Bo, Eugene,eorge, Guan-Lin, Heng-Tze, Larry, Maxime, Michael, Mustafa, Roger, Sujith, Vihan,and Yi-fan. For the absolute amazing time I had in Australia, I want to thank Andrew,Binsheng, Brian, Falk, Joel, Luke, Mark, Sarah, and Shane. Especially Joel andBinsheng for literally travelling to the other side of the world with me. I am reallygrateful to have met you all and hope the future will allow me to visit you a great manytimes.Dan wil ik nog mijn studiegenoten bedanken: Carla, Dasyel, Fabian, Jelle en Wietze,voor de vele leuke herinneringen aan mijn studie. Verder ben ik ook dankbaar voor mijnlange vriendschap met Chiel, Don, Kit, Stefan, Mark en Luuk, het is mij erg dierbaarom vrienden te hebben die ik al sinds de kleuterklas ken.Als laatste wil ik mijn familie bedanken, de belangrijkste mensen in mijn leven.Marianna, Roelof, Anna en Jeroen, bedankt voor alle steun, zonder jullie was het mijnooit gelukt om zo ver te komen. Ik bedank Helena en Nathalie omdat zij ons altijdzo warm verwelkomen. Erg dankbaar ben ik ook voor mijn lieve oma Anna, die altijdzo geduldig luistert als ik weer eens probeer uit te leggen wat ik nu eigenlijk bij deuniversiteit doe. Het meest bedank ik mijn grote liefde Emily omdat zij altijd voor mijklaar staat en de dagen zoveel mooier maakt. Harrie OosterhuisAmsterdamOctober 2020 ontents

I Novel Online Methods for Learning and Evaluating 13

ONTENTS

II A Single Framework for Online and Counterfactual Learn-ing to Rank 75 k Rankings 77 k Feedback . . . . . . . . . . . . . . . . . . . . . 815.3.1 The problem with top- k feedback . . . . . . . . . . . . . . . 815.3.2 Policy-aware propensity scoring . . . . . . . . . . . . . . . . 825.3.3 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . 845.4 Learning for Top- k Metrics . . . . . . . . . . . . . . . . . . . . . . . 855.4.1 Top- k metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 865.4.2 Monotonic upper bounding . . . . . . . . . . . . . . . . . . . 86vi ONTENTS k LTR . . . . . . 875.4.4 Unbiased loss selection . . . . . . . . . . . . . . . . . . . . . 895.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.5.2 Simulating top- k settings . . . . . . . . . . . . . . . . . . . . 905.5.3 Experimental runs . . . . . . . . . . . . . . . . . . . . . . . 915.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 925.6.1 Learning under item-selection bias . . . . . . . . . . . . . . . 925.6.2 Optimizing top- k metrics . . . . . . . . . . . . . . . . . . . . 935.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.A Notation Reference for Chapter 5 . . . . . . . . . . . . . . . . . . . . 99 ONTENTS

ONTENTS

Bibliography 181Summary 189Samenvatting 191 ix Introduction

Search engines allow users to efﬁciently navigate through the enormous numbers ofdocuments available online [7]. Underlying every search engine is a ranking systemthat processes documents in order to present a ranking to the user [75]. Over theyears, the role of ranking systems has only become more important, as they are nowused in a wide variety of settings. Users rely on them to search through many largecollections of content, including images [35], scientiﬁc articles [58], e-commerceproducts [60], streaming videos [19], job applications/applicants [32], emails [127], andlegal documents [103]. Similarly, ranking systems are used for recommendation as well,where they help to suggest content to users that matches their interests [101]. This mayeven be content of which users are not aware that they have an interest in [106]. In allthese rankings scenarios, the best user experience is provided when the items that usersprefer most are on top of the produced rankings [52]. In other words, the ranking shouldhelp the user ﬁnd what they are looking for with the minimal amount of effort [105].Without a ranking system ﬁnding the right information in any sizeable collectionbecomes an impossible task. Furthermore, without recommendations many onlineservices would lack a lot of user engagement [29]. Thus, ranking systems drive bothuser satisfaction – providing users with the content they prefer – and user engagement– bringing providers of content or services to interested consumers [34]. Therefore,the performance of a ranking system is very important to both the users of a serviceand its providers. Due to this importance, a lot of interest has gone to the evaluationof ranking systems [18, 38, 45, 55, 104, 105] and the ﬁeld of Learning to Rank (LTR)which covers methods for optimizing ranking systems [13, 58–60, 75, 129].Traditionally, ranking evaluation and LTR methods made use of human judgementsin the form of expert annotations [104]: for given pairs of queries and documents,experts are asked to annotate the relevance of a document w.r.t. a speciﬁc query. Thiscostly process results in an annotated dataset: a collection of query and document pairswith corresponding expert annotations [17, 27, 76]. For an annotated dataset to be usefulit should accurately capture: (i) the queries users typically issue; (ii) the documentsthat have to be ranked; and (iii) the relevance preferences of the user [105]. With sucha dataset, the optimization of a ranking system can be done through supervised LTRmethods. These methods optimize ranking metrics such as Precision, Average RelevancePosition (ARP) or Discounted Cumulative Gain (DCG), based on the provided relevanceannotations [52, 75]. While very important to the LTR ﬁeld, some severe limitations of1 . Introduction this supervised approach have become apparent over the years: (i) Expert annotationsare expensive and time-consuming to obtain [17, 76]. (ii) In sensitive settings acquiringexperts annotations can be unethical, for instance, when gathering data for optimizingsystems for search over personal documents such as emails [127]. (iii) For speciﬁcsettings there may be no experts that can judge what is relevant, for instance, in thecontext of personalized recommendations. (iv) What users perceive as relevant is knownto change over time, thus a dataset would have to be updated regularly, further increasingthe associated costs [1, 71]. (v) Actual user preferences and expert annotations are oftenmisaligned [104]. Consequently, the supervised approach is infeasible for many LTRpractitioners because they do not have the resources to create an annotated dataset orgathering annotations is not possible in their ranking setting. Moreover, even if a datasetcan be obtained, it may not lead to the optimal ranking system. Thus there is a need foran alternative to the supervised approach to ranking evaluation and LTR.An alternative approach that has received a lot of attention is to base evaluation andoptimization on user interactions [56, 99]. For rankings this usually means that userclicks are used to compare and improve ranking systems. At ﬁrst glance user interactionsseem to solve the problems with annotations: (i) If a service has enough active users,interactions are virtually free and available at a large scale. (ii) Gathering interactionscan be done without showing sensitive items to experts for annotation. (iii) Unlikeannotations, interactions are an indication of the actual individual user preferences.Thus there appears to be a lot of potential for using user interactions, however, there arealso drawbacks speciﬁc to using them: (i) It requires keeping track of large amounts ofuser behavior, something users may not consent with [94]. (ii) User behavior is veryunpredictable, clicks in particular are known to be a very noisy signal [20]. (iii) Clicksare a form of implicit feedback; there are other factors beside user preference that alsoaffect whether a click takes place, making clicks a biased signal of relevance [20, 25].This thesis will not explore the ﬁrst drawback and will instead focus on settings whereacquiring user interactions is done with consent, in a privacy-respecting and ethicalmanner. Mainly, we will consider how methods of evaluation and optimization basedon clicks can mitigate the negative effects from click-related noise and bias.Existing methods for ranking evaluation and optimization from user interactionscan roughly be divided into two families: the online family that deals with bias throughdirect interaction and result-randomization [100, 132]; and the counterfactual familythat ﬁrst models click behavior and then uses the inferred model to correct for biasin logged click data [58, 127]. A further division can be made. For this thesis adecomposition into ﬁve areas is relevant. We will divide the online family into threeareas:(i) Online Evaluation – methods like A/B testing and interleaving [56] that interactdirectly with users to compare ranking systems and randomize displayed resultsto mitigate biases [18, 44, 110, 111].(ii) Feature-Based Online LTR – methods like Dueling Bandit Gradient Descent(DBGD) [132] and the Perturbed Preference Perceptron for Ranking (3PR) [100]that optimize feature-based ranking models by direct interaction with users, oftenrelying on online evaluation [42, 111, 126].2 .1. Research Outline and Questions (iii) Tabular Online LTR – methods like Cascading Bandits [68] and the Position-Based Model algorithm (PBM) [69] that optimize a single ranking for a singleranking setting, by learning from direct interactions and result randomization [67,70, 138, 139]. Characteristic about tabular methods is that they do not use anyfeature-based prediction model but instead memorize the best ranking.For the counterfactual family, we will use the following division into two areas:(iv) Counterfactual Evaluation – methods that evaluate rankings based on historicallylogged clicks. They require an inferred model of click behavior and use that modelto correct for biases using, for instance, Inverse Propensity Scoring (IPS) [4, 16,58, 92, 116].(v) Counterfactual LTR – methods that use counterfactual evaluation to estimateperformance based on historical click-logs, and that optimize ranking models tomaximize the estimated a system’s performance [2, 3, 46, 58, 92, 127].This division reveals a rich diversity in approaches that all share the same goal ofevaluating or optimizating ranker performance based on user interactions.On the one hand, this diversity is understandable, since in some settings only onearea of methods is applicable. For instance, one cannot add randomization to datathat is already logged, making the counterfactual approach the only available optionif only logged data is available. On the other hand, the diversity of approaches is alsounexpected and raises some questions. For instance, why would online approachesnot beneﬁt from an accurate model of click behavior if one is available, similar to thecounterfactual approach?In this thesis, we investigate whether this online/counterfactual division is trulynecessary. We will introduce several novel LTR methods that improve over the efﬁciencyof existing methods, and increase the applicability of LTR from user clicks. In particular,we focus on ﬁnding LTR methods that bridge the online/counterfactual division andﬁnd methods that are highly effective both when applied online or counterfactually. Animportant result of our thesis on the LTR ﬁeld, is that we offer a uniﬁed perspective andset of LTR methods.

The overarching question this thesis aims to answer is:

Could there be a single general theoretically-grounded approach that has com-petitive performance for both evaluation and LTR from user clicks on rankings,in both the counterfactual and online settings?

Our aim is to progress the LTR ﬁeld towards answering this question in the afﬁrmation.In this thesis, we will explore two directions in search of a single general theoretically-grounded approach. Firstly, by introducing novel online LTR methods that outperformexisting online methods in optimization and large scale optimization in the onlinesetting. Secondly, by introducing novel counterfactual LTR methods that build on3 . Introduction the original IPS-based counterfactual LTR approach [58]. Our novel counterfactualLTR methods expand the original counterfactual approach and make it applicable tomore tasks and settings. As a result, these novel methods bridge several gaps betweencounterfactual LTR and the areas of supervised LTR and online LTR. Furthermore, allour novel counterfactual LTR methods are compatible with each other, and can be seenas part of a novel counterfactual LTR framework. At the end of the thesis, our proposedframework has taken the original counterfactual LTR approach and greatly increasedits applicability and effectiveness for both online and counterfactual evaluation andoptimization. This leads to a more uniﬁed perspective of the LTR ﬁeld, where areas thatwere previously largely independent are now connected.

In the ﬁrst part of the thesis, we introduce two methods that greatly increase theefﬁciency of large scale online evaluation and online LTR. Additionally, we take acritical look at several existing methods for online evaluation and online LTR.Interleaving was introduced as an efﬁcient evaluation paradigm designed for eval-uating whether one ranking system outperforms another [56]. Interleaving methodstake the rankings produced by two systems and combine them into an interleaved rank-ing [41, 96, 99]. Clicks on the interleaved ranking are interpreted directly as preferencesignals between the two systems, resulting in a more data-efﬁcient approach [110]. Thusallowing one to efﬁciently estimate if an alteration leads to an improved system. Later,the interleaving approach was extended to multileaving which allows for comparisonsthat include more than two systems at once [12, 108, 109], thereby enabling efﬁcientcomparing large numbers of systems with each other.In Chapter 2 we look at such multileaving methods for large scale online rankingevaluation. Speciﬁcally, we investigate the following question:

RQ1

Does the effectiveness of online ranking evaluation methods scale to large com-parisons?We examine existing multileaving methods in terms of ﬁdelity – are they provablyunbiased in unambiguous cases [44] – and considerateness – are they safe w.r.t. theuser experience during the gathering of clicks. From our theoretical analysis, we ﬁndthat no existing multileaving method manages to meet both criteria. Furthermore, ourempirical analysis reveals that their performance decreases as comparisons involvemore ranking systems at once. As a novel alternative, we introduce the PairwisePreference Multileaving (PPM) algorithm, PPM bases evaluation on inferred pairwiseitem preferences. We prove that it meets both the ﬁdelity and considerateness criteria.Furthermore, our empirical results indicate that using PPM leads to a much smallernumber of errors especially in large scale comparisons.Besides evaluation, optimization is also very important to obtain effective rankingsystems [75]. The idea of optimizing ranking systems based on clicks is long-established.One of the ﬁrst-theoretically grounded approaches was Dueling Bandit Gradient Descent(DBGD) [132]. For every incoming query, DBGD samples a variation on a rankingsystem and then uses interleaving to estimate whether this variation is an improvement.If so, it updates the ranking system to be more similar to the variation. Over time4 .1. Research Outline and Questions this process is supposed to oscillate towards the optimal ranking system. Numerousextensions have been proposed but all have kept the overall DBGD approach of samplingvariations and using online evaluation [42, 100, 111, 126]. This is somewhat puzzling,since this sampling approach is in stark contrast with all other LTR methods that usegradient-based optimization.In Chapter 3 we explore alternatives to the DBGD approach and ask ourselves thefollowing question:

RQ2

Is online LTR possible without relying on model-sampling and online evaluation?We answer this question in the afﬁrmative by proposing a novel online LTR method:Pairwise Differentiable Gradient Descent (PDGD). Unlike DBGD, PDGD does notrequire model-sampling nor does it make use of any online evaluation. Instead, PDGDoptimizes a stochastic Plackett-Luce ranking model and bases its updates on inferredpairwise item preferences. PDGD weights the gradients w.r.t. item-pairs to mitigatethe effect of position bias. We prove, under very mild assumptions, that the weightedgradient of PDGD is unbiased w.r.t. item-preferences. Our experimental results showthat PDGD requires far fewer interactions to reach the same level of performance asDBGD. Furthermore, we show that even in ideal settings DBGD may not be able to ﬁndthe optimal model and is ineffective at optimizing neural models. In contrast, PDGDdoes converge to near optimal models, and reaches even higher performance whenapplied to neural networks.The large improvements of PDGD over DBGD observed in Chapter 3, made uswonder whether DBGD is actually a reliable choice for online LTR. In response to thisquestion, Chapter 4 tackles the following question:

RQ3

Are DBGD LTR methods reliable in terms of theoretical soundness and empiricalperformance?First, we take a critical look at the theory underlying the DBGD approach, and ﬁnd thatits assumptions do not hold for deterministic ranking systems and common rankingmetrics. Consequently, we conclude that its theory is not applicable to the large majorityof existing research that utilizes the DBGD approach [42, 43, 90, 111, 125, 135].Second, we perform an empirical analysis where DBGD and PDGD are compared incircumstances ranging from near-ideal – where interactions contain little noise and noposition bias – to extremely difﬁcult – where interactions contain extreme amounts ofnoise and position bias. The difference in performance between PDGD and DBGD isso large, that we conclude that PDGD is by far the more reliable choice.For the ﬁeld of online LTR this leads us to question the relevancy of DBGD andits extentions, as we have found theoretical weaknesses and empirical inferiority. Thefact that virtually all previous methods in the online LTR ﬁeld are extensions of DBGDraises profound questions.

In the second part of the thesis, we expand the existing IPS-based counterfactual LTRapproach [58] to create a uniﬁed framework for both online and counterfactual LTR and5 . Introduction ranking evaluation based on clicks.The conclusions of the ﬁrst part of the thesis revealed that DBGD, which forms thebasis of most previous work in online LTR, has problems in terms of performance andits theoretical basis. It is concerning that these conclusions could have been made muchearlier: previous work could have taken a critical look at the theory at any moment;furthermore, if previous work had compared DBGD performance with supervised LTRin the prevalent simulated setups, it would have observed the convergence problemsof DBGD. To avoid similar issues, we chose to build upon the Counterfactual LTRapproach because it has a strong theoretical basis, and additionally, all experimentalcomparisons in the second part include optimal ranking models to detect potentialconvergence issues.In contrast with online LTR approaches, counterfactual LTR and evaluation makesexplicit assumptions about user behavior [58, 127]. By making such assumptions, theunbiasedness of counterfactual methods can be proven. Thus guaranteeing optimalconvergence, given that the assumptions are correct. While this provides a strongfoundation for learning from historically logged clicks, the counterfactual approach isnot always applicable nor always the most effective option [50]. The following researchquestions consider whether counterfactual LTR could overcome its limitations andbecome the best choice for LTR from clicks in general.One of the requirements for the unbiasedness of the original counterfactual LTRmethod is that it requires every relevant item to be displayed at every query [58]. Thisis a problem in top- k ranking settings where not all items can be displayed at once [92].Hence, Chapter 5 concerns the question: RQ4

Can counterfactual LTR be extended to top- k ranking settings?We introduce the Policy-Aware estimator that corrects for position bias while takinginto account the behavior of a stochastic logging policy. As a result, the policy-awareestimator is unbiased even when learning from top- k feedback, if the policy gives everyrelevant item a non-zero chance of appearing in the top- k . Thus with this extensioncounterfactual LTR is also applicable to the top- k setting which is especially prevalentin recommendation.Existing work has considered how to optimize ranking metrics such as DCG usingcounterfactual LTR [2, 46]. Interestingly, the solutions for counterfactual LTR are verydifferent than those in supervised LTR [13, 129]. To investigate whether this differenceis really necessary, Chapter 5 also addresses the question: RQ5

Is it possible to apply state-of-the-art supervised LTR methods to the counterfac-tual LTR problem?We ﬁnd that the LambdaLoss framework [129], which includes the famous Lamb-daMART method [13], can also be applied to counterfactual estimates of rankingmetrics. Thus we show that there does not need to be a divide between state-of-the-artsupervised LTR and counterfactual LTR.So far we have not considered the area of tabular online LTR: methods that ﬁndthe optimal ranking for a single query based on result randomization and direct in-teraction [67–70, 139]. While these methods need a lot of click data to reach decent6 .1. Research Outline and Questions performance, they can always ﬁnd the optimal ranking since they optimize a memorizedranking, instead of using a feature-based model [138]. The downside is that when fewclicks are available for a query, tabular LTR methods are highly sensitive to noise. Thusthese approaches are good for specialization: they have great performance on querieswhere numerous clicks have been observed, while also having an initial period of poorperformance. Conversely, counterfactual LTR commonly optimizes feature-based mod-els for generalization to have a robust performance on previously unseen queries, whileoften not reaching perfect performance at convergence.Inspired by this contrast, in Chapter 6 we ask ourselves:

RQ6

Can the specialization ability of tabular online LTR be combined with the robustfeature-based approach of counterfactual LTR?Our answer is in the form of the novel Generalization and Specialization (GENSPEC)algorithm, it optimizes a single robust generalized policy and numerous specializedpolicies each optimized for a single query. Then the GENSPEC meta-policy uses high-conﬁdence bounds to safely decide per query which policy to deploy. Consequently,for previously unseen queries GENSPEC choose the generalized policy which utilizesthe robust feature-based ranking model. While for other queries it can decide todeploy a specialized policy, i.e., if it has enough data to conﬁdently determine that thespecialized policy has found the better ranking. For the LTR ﬁeld, GENSPEC showsthat specialization does not need to be unique to tabular online LTR, instead it can be aproperty of counterfactual LTR as well. Moreover, overall it shows that specializationand generalization are not mutually exclusive abilities.While counterfactual evaluation methods are designed for using historical clicks,they can be applied online by simply treating newly gathered data as historical [16, 50].In contrast with online evaluation methods, counterfactual evaluation is completely pas-sive: its methods do not prescribe which rankings should be displayed. This differenceleads us to ask the following question in Chapter 7:

RQ7

Can counterfactual evaluation methods for ranking be extended to perform efﬁ-cient and effective online evaluation?We answer this question positively by introducing the novel Logging-Policy Optimiza-tion Algorithm (LogOpt) which uses available clicks to optimize the logging policy tominimize the variance of counterfactual estimates of ranking metrics. By minimizingvariance, LogOpt increases the data-efﬁciency of counterfactual evaluation, leading tomore accurate estimates from fewer logged clicks. LogOpt is applied when data is stillbeen gathered and changes what rankings will be displayed for future queries. Thus,with the addition of LogOpt, counterfactual evaluation is transformed into an onlineapproach that is actively involved with how data is gathered. Our experimental resultssuggest that LogOpt is at least as efﬁcient as interleaving methods, while also beingproven to be unbiased under the common assumptions of counterfactual LTR.The results in Chapter 2 and Chapter 7 did not show any online evaluation methodconverge on a zero error. This lead us to also ask the following question in Chapter 7:

RQ8

Are existing interleaving methods truly capable of unbiased evaluation w.r.t.position bias? 7 . Introduction

We prove that, under the assumption of basic position bias, interleaving methods are notunbiased. Furthermore, our results in Chapter 7 indicate that interleaving methods havea systematic error. Unfortunately, we are unable to estimate the impact this systematicerror has on real-world comparisons. To the best of our knowledge, no empirical studieshave been performed that could measure such a bias, our ﬁndings strongly show thatsuch a study would be highly valuable to the ﬁeld.In Chapter 7 we have shown that counterfactual ranking evaluation can be asefﬁcient as online evaluation methods, while also having the theoretical justiﬁcation ofcounterfactual methods. Naturally this leads to a similar question regarding LTR:

RQ9

Can the counterfactual LTR approach be extended to perform highly effectiveonline LTR?In Chapter 8 we answer this question by introducing the intervention-aware estimator foronline/counterfactual LTR. The intervention-aware estimator corrects for position-biasand trust-bias while also taking into account the effect of online interventions. Thismeans that if an intervention takes place – i.e., the logging policy changes during thegathering of data – the intervention-aware estimator takes its effect on the interactionbiases into account. The result is an estimator that, one the one hand, is just as efﬁcientas other counterfactual estimators when applied to historical data. While on the otherhand, it is much more efﬁcient when applied online than existing estimators. Moreover,its performance is comparable to online LTR methods. In contrast with online methods,including DBGD and PDGD, the intervention-aware estimator is proven to be unbiasedw.r.t. ranking metrics under the standard assumptions. In other words, it is the onlymethod that is proven to converge on the optimal model, while also being as efﬁcient asthe others. Therefore, we consider the intervention-aware estimator a bridge betweenonline and counterfactual LTR as it is a most-reliable choice in both scenarios.

This section will now summarize the main contributions of this thesis. We differentiatebetween algorithmic contributions – novel algorithms introduced in the thesis – andtheoretical contributions – ﬁndings that are important to the ﬁeld, both in the form offormal proofs and empirical observations.

1. The Pairwise Preference Multileaving (PPM) algorithm for large scale compar-isons in online evaluation.2. The Pairwise Differentiable Gradient Descent (PDGD) algorithm for fast andefﬁcient online LTR.3. The policy-aware estimator that can perform unbiased counterfactual LTR fromtop- k settings.4. Three loss functions for optimizing top- k metrics with counterfactual LTR, in-cluding an adaption of the supervised LTR LambdaLoss method.8 .2. Main Contributions

5. The Generalization and Specialization (GENSPEC) algorithm that combinesthe specialization ability of tabular models with the generalization ability offeature-based models.6. The Logging-Policy Optimization Algorithm (LogOpt) algorithm that turns coun-terfactual evaluation into online evaluation so as to minimize variance by updatingthe logging policy during the gathering of data.7. The intervention-aware estimator that bridges the gap between counterfactualand online LTR, by extending the policy-aware estimator to take into account theeffect of online interventions.8. An overarching framework for both online and counterfactual LTR evaluationand optimization, by combining the existing counterfactual approach with thecontributions of the second part of the thesis. For counterfactual/online eval-uation contributions 3, 6, and 7 can be applied simultaneously, similarly forcounterfactual/online LTR the same can be done with contributions 3, 4, 5, and 7.

9. An extension of the deﬁnition of ﬁdelity and considerateness for multileaving; inaddition, we show that no existing multileave method meets the criteria of bothsimultaneously.10. A formal proof that PDGD is unbiased w.r.t. pairwise item preferences undermild assumptions.11. A formal proof that the assumptions of DBGD are not sound for deterministicranking models, thus invalidating some claims of unbiasedness in previous onlineLTR work.12. An extensive comparison of DBGD and PDGD under circumstances rangingfrom ideal to near worst-case, revealing that even in ideal circumstances DBGDis often unable to approximate the optimal model.13. A formal proof for the unbiasedness of the policy-aware and intervention-awareestimators, proving that the former is unbiased w.r.t. position bias and item-selection bias and the latter w.r.t. position bias, item-selection bias, and trust biasrespectively.14. A formal demonstration how LTR loss functions can be adapted to bound top- k metrics, including a description of how LambdaLoss can be adapted for counter-factual LTR.15. An extension of existing bounds in order to bound the relative performance of twopolicies, with an additional proof that this bound is more efﬁcient than comparingthe bounds of individual policies.16. A formal proof that interleaving methods are not unbiased w.r.t. position bias. 9 . Introduction

17. An empirical analysis that reveals that PDGD is not unbiased w.r.t. position bias,item-selection bias, and trust bias, when not applied fully online.In addition to these contributions, the source code used to perform the experiments ineach published chapter has been shared publicly to enable reproducibility.

This section will provide an overview of the thesis, and provide some recommendationsfor reading directions. This thesis consists of an introduction chapter, seven researchchapters divided into two parts, and a conclusion. Each research chapter answers one ortwo of the thesis research questions put forward in Section 1.1, in addition to severalchapter-speciﬁc research questions. The thesis research questions are important to theoverarching story of the thesis, whereas the chapter-speciﬁc research questions onlyconsider the individual contributions of the chapters.The ﬁrst chapter, which you are currently reading, introduces the subject of thisthesis: LTR and ranking evaluation from user clicks. Furthermore, it lays out the thesisresearch questions this thesis answers, and provides an overview of its contributionsand its origins.Part I titled

Novel Online Methods for Learning and Evaluating contains threeresearch chapters that all consider online methods for LTR and ranking evaluation.Chapter 2 looks at multileaving methods for online evaluation, evaluates existingmethods and introduces a novel multileaving method. Chapter 3 considers online LTRand introduces PDGD, a novel debiased pairwise method. Chapter 4 performs anextensive comparison of the previous state-of-the-art online LTR method DBGD andour novel PDGD, in terms of theoretical guarantees and an experimental analysis.Part II titled

A Single Framework for Online and Counterfactual Learning to Rank contains four research chapters that build on the counterfactual approach to LTR andranker evaluation. The chapters in this part of the thesis are complementary, most oftheir contributions can be applied together or build upon each other. Chapter 5 extendscounterfactual LTR to top- k settings; it introduces a novel estimator to learn from top- k feedback and extends supervised LTR methods to optimize counterfactual estimates oftop- k ranking metrics. Chapter 6 looks at both tabular and feature-based ranking models,and introduces an algorithm that optimizes both types of models and safely deploysdifferent models per query. Thus combining the specialization abilities of tabularmodels with the robust performance of feature-based models in previously unseencircumstances. Chapter 7 aims to unify counterfactual and online ranking evaluation;it introduces a method that updates the logging policy during the gathering of data,turning counterfactual evaluation into efﬁcient online evaluation. Similarly, Chapter 8seeks to unify counterfactual and online LTR; it proposes a novel estimator that takesinto account the effect of online interventions but can also be applied counterfactually.As a result, the estimator is effective for both counterfactual LTR and online LTR.Lastly, the thesis is concluded in Chapter 9, where we summarize the ﬁndings of thethesis; in particular, we discus whether the division between the families of online andcounterfactual LTR methods has been bridged. We end the chapter with a discussion ofpossible future research directions.10 .4. Origins The research chapters in this thesis are self-contained, therefore, a reader can readany single chapter independently if they desire. The research chapters grew out ofpublished papers. We wanted to avoid creating alternate versions of published workthat deviate from the originals. As a result, the notation between some chapters differsomewhat; to help the reader, we have added a table at the end of each chapter detailingthe notation it uses. For the best experience, we recommend reading all the chapters inpart II because they build on each other. For the same reason, Chapter 3 and Chapter 4are best read together.

We will now list the publications on which the research chapters were based. Each ofthe publications is a conference paper written by Harrie Oosterhuis and Maarten deRijke. In all cases, Oosterhuis came up with the main research ideas, performed allexperiments, and wrote the majority of text. De Rijke lead the discussions on how eachpaper should be structured and contributed signiﬁcantly to the text. In total, this thesisis built on 6 publications [81, 82, 84–88].

Chapter 2 is based on

Sensitive and scalable online evaluation with theoretical guar-antees published at CIKM ’17 by [81].

Chapter 3 is based on

Differentiable Unbiased Online Learning to Rank published atCIKM ’18 by Oosterhuis and de Rijke [82].

Chapter 4 is based on

Optimizing Ranking Models in an Online Setting published atECIR ’19 by Oosterhuis and de Rijke [84].

Chapter 5 is based on

Policy-Aware Unbiased Learning to Rank for Top-k Rankings published at SIGIR ’20 by Oosterhuis and de Rijke [86].

Chapter 6 is based on

Robust Generalization and Safe Query-Specialization in Coun-terfactual Learning to Rank submitted to WWW ’21 by Oosterhuis and de Rijke[87].

Chapter 7 is based on

Taking the Counterfactual Online: Efﬁcient and UnbiasedOnline Evaluation for Ranking published at ICTIR ’20 by Oosterhuis and de Rijke[85].

Chapter 8 is based on

Unifying Online and Counterfactual Learning to Rank publishedat WSDM ’21 by Oosterhuis and de Rijke [88].In addition, this thesis also indirectly beneﬁtted from the following publications:•

Probabilistic Multileave for Online Retrieval Evaluation published at SIGIR ’15 bySchuth et al. [109].•

Multileave Gradient Descent for Fast Online Learning to Rank published at WSDM’16 by Schuth et al. [111]. 11 . Introduction • Probabilistic Multileave Gradient Descent published at ECIR ’16 by Oosterhuis et al.[90].•

Balancing Speed and Quality in Online Learning to Rank for Information Retrieval published at CIKM ’17 by Oosterhuis and de Rijke [80].•

Query-level Ranker Specialization published at CEUR ’17 by Jagerman et al. [49].•

Ranking for Relevance and Display Preferences in Complex Presentation Layouts published at SIGIR ’18 by Oosterhuis and de Rijke [83].•

The Potential of Learned Index Structures for Index Compression published at ADCS’18 by Oosterhuis et al. [91].•

To Model or to Intervene: A Comparison of Counterfactual and Online Learning toRank from User Interactions published at SIGIR ’19 by Jagerman et al. [50].•

When Inverse Propensity Scoring does not Work: Afﬁne Corrections for UnbiasedLearning to Rank published at CIKM ’20 by Vardasbi et al. [123].•

Keeping Dataset Biases out of the Simulation: A Debiased Simulator for Reinforce-ment Learning based Recommender Systems published at RecSys ’20 by Huang et al.[47].Furthermore, other work helped with gaining broader research insights, without beingdirectly related to the thesis topic:•

Semantic Video Trailers by Oosterhuis et al. [89].•

Optimizing Interactive Systems with Data-Driven Objectives by Li et al. [73].•

Actionable Interpretability through Optimizable Counterfactual Explanations for TreeEnsembles by Lucic et al. [77].12 art I

Novel Online Methods forLearning and Evaluating Sensitive and Scalable Online Evaluation with Theoretical Guarantees

Multileaved comparison methods generalize interleaved comparison methods to providea scalable approach for comparing ranking systems based on regular user interactions.Such methods enable the increasingly rapid research and development of search engines.However, existing multileaved comparison methods that provide reliable outcomes doso by degrading the user experience during evaluation. Conversely, current multileavedcomparison methods that maintain the user experience cannot guarantee correctness.In this chapter, we address the following thesis research question:

RQ1

Does the effectiveness of online evaluation methods scale to large comparisons?

Our answer comes in a two-fold contribution; First, we propose a theoretical frameworkfor systematically comparing multileaved comparison methods using the notions of considerateness , which concerns maintaining the user experience, and ﬁdelity , whichconcerns reliable correct outcomes. Second, we introduce a novel multileaved compari-son method, Pairwise Preference Multileaving (PPM), that performs comparisons basedon document-pair preferences, and prove that it is considerate and has ﬁdelity. We showempirically that, compared to previous multileaved comparison methods, PPM is moresensitive to user preferences and scalable with the number of rankers being compared.

Evaluation is of tremendous importance to the development of modern search engines.Any proposed change to the system should be veriﬁed to ensure it is a true improvement.Online approaches to evaluation aim to measure the actual utility of an InformationRetrieval (IR) system in a natural usage environment [45]. Interleaved comparisonmethods are a within-subject setup for online experimentation in IR. For interleavedcomparison, two experimental conditions (“control” and “treatment”) are typical. Re-cently, multileaved comparisons have been introduced for the purpose of efﬁcientlycomparing large numbers of rankers [12, 108]. These multileaved comparison methodswere introduced as an extension to interleaving and the majority are directly derived

This chapter was published as [81]. Appendix 2.A gives a reference for the notation used in this chapter. . Sensitive and Scalable Online Evaluation with Theoretical Guarantees from their interleaving counterparts [108, 109]. The effectiveness of these methods hasthus far only been measured using simulated experiments on public datasets. Whilethis gives some insight into the general sensitivity of a method, there is no work thatassesses under what circumstances these methods provide correct outcomes and whenthey break. Without knowledge of the theoretical properties of multileaved comparisonmethods we are unable to identify when their outcomes are reliable.In prior work on interleaved comparison methods a theoretical framework has beenintroduced that provides explicit requirements that an interleaved comparison methodshould satisfy [44]. We take this approach as our starting point and adapt and extend itto the setting of multileaved comparison methods. Speciﬁcally, the notion of ﬁdelity iscentral to Hofmann et al. [44]’s previous work; Section 2.3 describes the frameworkwith its requirements of ﬁdelity . In the setting of multileaved comparison methods, thismeans that a multileaved comparison method should always recognize an unambiguouswinner of a comparison. We also introduce a second notion, considerateness , whichsays that a comparison method should not degrade the user experience, e.g., by allowingall possible permutations of documents to be shown to the user. In this chapter weexamine all existing multileaved comparison methods and ﬁnd that none satisfy boththe considerateness and ﬁdelity requirements. In other words, no existing multileavedcomparison method is correct without sacriﬁcing the user experience.To address this gap, we propose a novel multileaved comparison method, PairwisePreference Multileaving (PPM). PPM differs from existing multileaved comparisonmethods as its comparisons are based on inferred pairwise document preferences,whereas existing multileaved comparison methods either use some form of documentassignment [108, 109] or click credit functions [12, 108]. We prove that PPM meets boththe considerateness and the ﬁdelity requirements, thus PPM guarantees correct winnersin unambiguous cases while maintaining the user experience at all times. Furthermore,we show empirically that PPM is more sensitive than existing methods, i.e., it makesfewer errors in the preferences it ﬁnds. Finally, unlike other multileaved comparisonmethods, PPM is computationally efﬁcient and scalable , meaning that it maintains mostof its sensitivity as the number of rankers in a comparison increases.In this chapter we address thesis research question RQ1 by answering the followingmore speciﬁc research questions:

RQ2.1

Does PPM meet the ﬁdelity and considerateness requirements?

RQ2.2

Is PPM more sensitive than existing methods when comparing multiple rankers?To summarize, our contributions in this chapter are:1. A theoretical framework for comparing multileaved comparison methods;2. A comparison of all existing multileaved comparison methods in terms of consider-ateness , ﬁdelity and sensitivity ;3. A novel multileaved comparison method that is considerate and has ﬁdelity and ismore sensitive than existing methods.16 .2. Related Work Evaluation of information retrieval systems is a core problem in IR. Two types ofapproach are common to designing reliable methods for measuring an IR system’seffectiveness. Ofﬂine approaches such as the Cranﬁeld paradigm [104] are effectivefor measuring topical relevance, but have difﬁculty taking into account contextualinformation including the user’s current situation, fast changing information needs, andpast interaction history with the system [45]. In contrast, online approaches to evaluationaim to measure the actual utility of an IR system in a natural usage environment. Userfeedback in online evaluation is usually implicit, in the form of clicks, dwell time, etc.By far the most common type of controlled experiment on the web is A/B testing [65,66]. This is a classic between-subject experiment, where each subject is exposed toone of two conditions, control —the current system—and treatment —an experimentalsystem that is assumed to outperform the control.An alternative experimental design uses a within-subject setup, where all studyparticipants are exposed to both experimental conditions. Interleaved comparisons[54, 99] have been developed speciﬁcally for online experimentation in IR. Interleavedcomparison methods have two main ingredients. First, a method for constructinginterleaved result lists speciﬁes how to select documents from the original rankings(“control” and “treatment”). Second, a method for inferring comparison outcomes basedon observed user interactions with the interleaved result list. Because of their within-subject nature, interleaved comparisons can be up to two orders of magnitude moreefﬁcient than A/B tests in effective sample size for studies of comparable dependentvariables [18].For interleaved comparisons, two experimental conditions are typical. Extensionsto multiple conditions have been introduced by Schuth et al. [108]. Such multileaved comparisons are an efﬁcient online evaluation method for comparing multiple rankerssimultaneously. Similar to interleaved comparison methods [41, 56, 96, 99], a multi-leaved comparison infers preferences between rankers. Interleaved comparisons dothis by presenting users with interleaved result lists; these represent two rankers insuch a way that a preference between the two can be inferred from clicks on theirdocuments. Similarly, for multileaved comparisons multileaved result lists are createdthat allow more than two rankers to be represented in the result list. As a consequence,multileaved comparisons can infer preferences between multiple rankers from a singleclick. Due to this property multileaved comparisons require far fewer interactionsthan interleaved comparisons to achieve the same accuracy when multiple rankers areinvolved [108, 109].The general approach for every multileaved comparison method is described inAlgorithm 2.1; here, a comparison of a set of rankers R is performed over T userinteractions. After the user submits a query q to the system (Line 4), a ranking l i isgenerated for each ranker r i in R (Line 6). These rankings are then combined intoa single result list by the multileaving method (Line 7); we refer to the resulting list m as the multileaved result list. In theory a multileaved result list could contain theentire document set, however in practice a length k is chosen beforehand, since usersgenerally only view a restricted number of result pages. This multileaved result list ispresented to the user who has the choice to interact with it or not. Any interactions17 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees Algorithm 2.1

General pipeline for multileaved comparisons. Input : set of rankers R , documents D , no. of timesteps T . P ← // initialize |R| × |R| preference matrix for t = 1 , . . . , T do q t ← wait for user() // receive query from user for i = 1 , . . . , |R| do l i ← r i ( q, D ) // create ranking for query per ranker m t ← combine lists ( l , . . . , l R ) // combine into multileaved list c ← display ( m t ) // display to user and record interactions for i = 1 , . . . , |R| do for j = 1 , . . . , |R| do P ij ← P ij + infer ( i, j, c , m t ) // infer pref. between rankers return P are recorded in c and returned to the system (Line 8). While c could contain anyinteraction information [63], in practice multileaved comparison methods only considerclicks. Preferences between the rankers in R can be inferred from the interactions andthe preference matrix P is updated accordingly (Line 11). The method of inference(Line 11) is deﬁned by the multileaved comparison method (Line 7). By aggregating theinferred preferences of many interactions a multileaved comparison method can detectpreferences of users between the rankers in R . Thus it provides a method of evaluationwithout requiring a form of explicit annotation.By instantiating the general pipeline for multileaved comparisons shown in Algo-rithm 2.1, i.e., the combination method at Line 6 and the inference method at Line 11,we obtain a speciﬁc multileaved comparison method. We detail all known multileavedcomparison methods in Section 2.4 below.What we add on top of the work discussed above is a theoretical framework that allowsus to assess and compare multileaved comparison methods. In addition, we propose anaccurate and scalable multileaved comparison method that is the only one to satisfy theproperties speciﬁed in our theoretical framework and that also proves to be the mostefﬁcient multileaved comparison method in terms of much reduced data requirements. Before we introduce a novel multileaved comparison method in Section 2.5, we proposetwo theoretical requirements for multileaved comparison methods. These theoreticalrequirements will allow us to assess and compare existing multileaved comparisonmethods. Speciﬁcally, we introduce two theoretical properties: considerateness and ﬁdelity . These properties guarantee correct outcomes in unambigious cases whilealways maintaining the user experience. In Section 2.4 we show that no currentlyavailable multileaved comparison method satisﬁes both properties. This motivates theintroduction of a method that satisﬁes both properties in Section 2.5.18 .3. A Framework for Assessing Multileaved Comparison Methods

Firstly, one of the most important properties of a multileaved comparison method ishow considerate it is. Since evaluation is done online it is important that the searchexperience is not substantially altered [54, 96]. In other words, users should not beobstructed to perform their search tasks during evaluation. As maintaining a user baseis at the core of any search engine, methods that potentially degrade the user experienceare generally avoided. Therefore, we set the following requirement: the displayedmultileaved result list should never show a document d at a rank i if every ranker in R places it at a lower rank. Writing r ( d, l j ) for the rank of d in the ranking l j produced byranker r j , this boils down to: m i = d → ∃ r j ∈ R , r ( d, l j ) ≤ i. (2.1)Requirement 2.1 guarantees that a document can never be displayed higher in a mul-tileaved result list than any ranker would. In addition, it guarantees that if all rankersagree on the top n documents, the resulting multileaved result list m will display thesame top n . Secondly, the preferences inferred by a multileaved comparison method should corre-spond with those of the user with respect to retrieval quality, and should be robust touser behavior that is unrelated to retrieval quality [54]. In other words, the preferencesfound should be correct in terms of ranker quality. However, in many cases the relativequality of rankers is unclear. For that reason we will use the notion of ﬁdelity [44] tocompare the correctness of a multileaved comparison method.

Fidelity was introducedby Hofmann et al. [44] and describes two general cases in which the preference betweentwo rankers is unambiguous. To have ﬁdelity the expected outcome of a method isrequired to be correct in all matching cases. However, the original notion of ﬁdelity only considers two rankers as it was introduced for interleaved comparison methods,therefore the deﬁnition of ﬁdelity must be expanded to the multileaved case. First wedescribe the following concepts:

Uncorrelated clicks

Clicks are considered uncorrelated if relevance has no inﬂuenceon the likelihood that a document is clicked. We write r ( d i , m ) for the rank of document d i in multileaved result list m and P ( c l | q, m l = d i ) for the probability of a click atthe rank l at which d i is displayed: l = r ( d i , m ) . Then, for a given query q uncorrelated ( q ) ⇔ ∀ l, ∀ d i,j , P ( c l | q, m l = d i ) = P ( c l | q, m l = d j ) . (2.2) Correlated clicks

We consider clicks correlated if there is a positive correlationbetween document relevance and clicks. However we differ from Hofmann et al. [44]by introducing a variable k that denotes at which rank users stop considering documents.Writing P ( c i | rel ( m i , q )) for the probability of a click at rank i if a document relevant19 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees to query q is displayed at this rank, we set correlated ( q, k ) ⇔∀ i ≥ k, P ( c i ) = 0 ∧ ∀ i < k, P ( c i | rel ( m i , q )) > P ( c i | ¬ rel ( m i , q )) . (2.3)Thus under correlated clicks a relevant document is more likely to be clicked than anon-relevant one at the same rank, if they appear above rank k . Pareto domination

Ranker r Pareto dominates ranker r if all relevant documentsare ranked at least as high by r as by r and r ranks at least one relevant documenthigher. Writing rel for the set of relevant documents that are ranked above k by at leastone ranker, i.e., rel = { d | rel ( d, q ) ∧ ∃ r n ∈ R , r ( d, l n ) > k } , we require that thefollowing holds for every query q and any rank k : Pareto ( r i , r j , q, k ) ⇔∀ d ∈ rel , r ( d, l i ) ≤ r ( d, l j ) ∧ ∃ d ∈ rel , r ( d, l i ) < r ( d, l j ) . (2.4)Then, ﬁdelity for multileaved comparison methods is deﬁned by the following tworequirements:1. Under uncorrelated clicks the expected outcome may ﬁnd no preferences betweenany two rankers in R : ∀ q, ∀ ( r i , r j ) ∈ R , uncorrelated ( q ) ⇒ E [ P ij | q ] = 0 . (2.5)2. Under correlated clicks, a ranker that Pareto dominates all other rankers must winthe multileaved comparison in expectation: ∀ k, ∀ q, ∀ r i ∈ R , (cid:0) correlated ( q, k ) ∧ ∀ r j ∈ R , i (cid:54) = j → Pareto ( r i , r j , q, k ) (cid:1) ⇒ ( ∀ r j ∈ R , i (cid:54) = j → E [ P ij | q ] > . (2.6)Note that for the case where |R| = 2 and if only k = | D | is considered, these re-quirements are the same as for interleaved comparison methods [44]. The k parameterwas added to allow for ﬁdelity in considerate methods, since it is impossible to detectpreferences at ranks that users never consider without breaking the considerateness requirement. We argue that differences at ranks that users are not expected to observeshould not affect comparison outcomes. Fidelity is important for a multileaved com-parison method as it ensures that an unambiguous winner is expected to be identiﬁed.Additionally, the ﬁrst requirement ensures that in exception no preferences are inferredwhen clicks are unaffected by relevancy.

In addition to the two theoretical properties listed above, considerateness and ﬁdelity, wealso scrutinize multileaved comparison methods to determine whether they accuratelyﬁnd preferences between all rankers in R and minimize the number of user impressionsrequired do so. This empirical property is commonly known as sensitivity [44, 108].20 .4. An Assessment of Existing Multileaved Comparison Methods In Section 2.6 we describe experiments that are aimed at comparing the sensitivity ofmultileaved comparison methods. Here, two aspects of every comparison are considered:the level of error at which a method converges and the number of impressions requiredto reach that level. Thus, an interleaved comparison method that learns faster initiallybut does not reach the same ﬁnal level of error is deemed worse.

We brieﬂy examine all existing multileaved comparison methods to determine whetherthey meet the considerateness and ﬁdelity requirements. An investigation of the empiri-cal sensitivity requirement is postponed until Section 2.6 and 2.7.

Team-Draft Multileaving (TDM) was introduced by Schuth et al. [108] and is based onthe previously proposed Team Draft Interleaving (TDI) [99]. Both methods are inspiredby how team assignments are often chosen for friendly sport matches. The multileavedresult list is created by sequentially sampling rankers without replacement; the ﬁrstsampled ranker places their top document at the ﬁrst position of the multileaved list.Subsequently, the next sampled ranker adds their top pick of the remaining documents.When all rankers have been sampled, the process is continued by sampling from theentire set of rankers again. The method is stops when all documents have been added.When a document is clicked, TDM assigns the click to the ranker that contributedthe document. For each impression binary preferences are inferred by comparing thenumber of clicks each ranker received.It is clear that TDM is considerate since each added document is the top pick ofat least one ranker. However, TDM does not meet the ﬁdelity requirements. This isunsurprising as previous work has proven that TDI does not meet these requirements[41, 44, 96]. Since TDI is identical to TDM when the number of rankers is |R| = 2 ,TDM does not have ﬁdelity either.

Optimized Multileaving (OM) was proposed by Schuth et al. [108] and serves asan extension of Optimized Interleaving (OI) introduced by Radlinski and Craswell[96]. The allowed multileaved result lists of OM are created by sampling rankerswith replacement at each iteration and adding the top document of the sampled ranker.However, the probability that a multileaved result list is shown is not determined by thegenerative process. Instead, for a chosen credit function OM performs an optimizationthat computes a probability for each multileaved result list so that the expected outcomeis unbiased and sensitive to correct preferences.All of the allowed multileaved result lists of OM meet the considerateness require-ment, and in theory instantiations of OM could have ﬁdelity . However, in practice OMdoes not meet the ﬁdelity requirements. There are two main reasons for this. First, it21 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees is not guaranteed that a solution exists for the optimization that OM performs. Forthe interleaving case this was proven empirically when k = 10 [96]. However, thisapproach does not scale to any number of rankers. Secondly, unlike OI, OM allows moreresult lists than can be computed in a feasible amount of time. Consider the top k of allpossible multileaved result lists; in the worst case this produces |R| k lists. Computingall lists for a large value of |R| and performing linear constraint optimization over themis simply not feasible. As a solution, Schuth et al. [108] propose a method that samplesfrom the allowed multileaved result lists and relaxes constraints when there is no exactsolution. Consequently, there is no guarantee that this method does not introduce bias.Together, these two reasons show that the ﬁdelity of OI does not imply ﬁdelity of OM.It also shows that OM is computationally very costly. Probabilistic Multileaving (PM) [109] is an extension of Probabilistic Interleaving (PI)[41], which was designed to solve the ﬂaws of TDI. Unlike the previous methods, PMconsiders every ranker as a distribution over documents, which is created by applyinga soft-max to each of them. A multileaved result list is created by sampling a rankerwith replacement at each iteration and sampling a document from the ranker that wasselected. After the sampled document has been added, all rankers are renormalizedto account for the removed document. During inference PM credits every ranker theexpected number of clicked documents that were assigned to them. This is done bymarginalizing over the possible ways the list could have been constructed by PM. Abeneﬁt of this approach is that it allows for comparisons on historical data [41, 44].A big disadvantage of PM is that it allows any possible ranking to be shown, albeitnot with uniform probabilities. This is a big deterrent for the usage of PM in opera-tional settings. Furthermore, it also means that PM does not meet the considerateness requirement. On the other hand, PM does meet the ﬁdelity requirements, the proof forthis follows from the fact that every ranker is equally likely to add a document at eachlocation in the ranking. Moreover, if multiple rankers want to place the same documentsomewhere they have to share the resulting credits. Similar to OM, PM becomesinfeasible to compute for a large number of rankers |R| ; the number of assignmentsin the worst case is | R | k . Fortunately, PM inference can be estimated by samplingassignments in a way that maintains ﬁdelity [90, 109]. Sample-Scored-Only Multileaving (SOSM) was introduced by Brost et al. [12] in anattempt to create a more scalable multileaved comparison method. It is the only existingmultileaved comparison method that does not have an interleaved comparison counter-part. SOSM attempts to increase sensitivity by ignoring all non-sampled documentsduring inference. Thus, at each impression a ranker receives credits according to how itranks the documents that were sampled for the displayed multileaved result list of size k . The preferences at each impression are made binary before being added to the mean. Brost et al. [12] proved that if the preferences at each impression are made binary the ﬁdelity of PM islost. .4. An Assessment of Existing Multileaved Comparison Methods Table 2.1: Overview of multileaved comparison methods and whether they meet the considerateness and ﬁdelity requirements.

Considerateness Fidelity Source

TDM (cid:88) [108]OM (cid:88) [108]PM (cid:88) [109]SOSM (cid:88) [12]PPM (cid:88) (cid:88) this chapterSOSM creates multileaved result lists following the same procedure as TDM, a choicethat seems arbitrary.SOSM meets the considerateness requirements for the same reason TDM does.However, SOSM does not meet the ﬁdelity requirement. We can prove this by providingan example where preferences are found under uncorrelated clicks. Consider the twodocuments A and B and the three rankers with the following three rankings: l = AB , l = l = BA . The ﬁrst requirement of ﬁdelity states that under uncorrelated clicks no preferences maybe found in expectation. Uncorrelated clicks are unconditioned on document relevance(Equation 2.2); however, it is possible that they display position bias [134]. Thus theprobability of a click at the ﬁrst rank may be greater than at the second: P ( c | q ) > P ( c | q ) . Under position biased clicks the expected outcome for each possible multileaved resultlist is not zero. For instance, the following preferences are expected: E [ P | m = AB ] > ,E [ P | m = BA ] < ,E [ P | m = AB ] = − E [ P | m = BA ] . Since SOSM creates multileaved result lists following the TDM procedure the proba-bility P ( m = BA ) is twice as high as P ( m = AB ) . As a consequence, the expectedpreference is biased against the ﬁrst ranker: E [ P ] < . Hence, SOSM does not have ﬁdelity . This outcome seems to stem from a disconnectbetween how multileaved results lists are created and how preferences are inferred.To conclude this section, Table 2.1 provides an overview of our ﬁndings thus far, i.e.,the theoretical requirements that each multileaved comparison method satisﬁes; we havealso included PPM, the multileaved comparison method that we will introduce below.23 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees

Algorithm 2.2

Multileaved result list construction for PPM. Input : set of rankers R , rankings { l } , documents D . m ← [] // initialize empty multileaving for n = 1 , . . . , | D | do ˆΩ n ← Ω( n, R , D ) \ m // choice set of remaining documents d ← uniform sample ( ˆΩ n ) // uniformly sample next document m ← append ( m , d ) // add sampled document to multileaving return m The previously described multileaved comparison methods are based around directcredit assignment, i.e., credit functions are based on single documents. In contrast, weintroduce a method that estimates differences based on pairwise document preferences.We prove that this novel method is the only multileaved comparison method that meetsthe considerateness and ﬁdelity requirements set out in Section 2.3.The multileaved comparison method that we introduce is Pairwise Preference Mul-tileaving (PPM). It infers pairwise preferences between documents from clicks andbases comparisons on the agreement of rankers with the inferred preferences. PPM isbased on the assumption that a clicked document is preferred to: (i) all of the unclickeddocuments above it; and (ii) the next unclicked document. These assumptions arelong-established [55] and form the basis of pairwise Learning to Rank (LTR) [54].We write c r ( d i , m ) for a click on document d i displayed in multileaved result list m at the rank r ( d i , m ) . For a document pair ( d i , d j ) , a click c r ( d i , m ) infers a preferenceas follows: c r ( d i , m ) ∧ ¬ c r ( d j , m ) ∧ (cid:0) ∃ i, ( c i ∧ r ( d j , m ) < i ) ∨ c r ( d j , m ) − (cid:1) ⇔ d i > c d j . (2.7)In addition, the preference of a ranker r is denoted by d i > r d j . Pairwise preferencesalso form the basis for Preference-Based Balanced Interleaving (PBI) introduced by Heet al. [38]. However, previous work has shown that PBI does not meet the ﬁdelity requirements [44]. Therefore, we do not use PBI as a starting point for PPM. Instead,PPM is derived directly from the considerateness and ﬁdelity requirements. Conse-quently, PPM constructs multileaved result lists inherently differently and its inferencemethod has ﬁdelity , in contrast with PBI.When constructing a multileaved result list m we want to be able to infer unbiasedpreferences while simultaneously being considerate . Thus, with the requirement for considerateness in mind we deﬁne a choice set as: Ω( i, R , D ) = { d | d ∈ D ∧ ∃ r j ∈ R , r ( d, l j ) ≤ i } . (2.8)This deﬁnition is chosen so that any document in Ω( i, R , D ) can be placed at rank i without breaking the considerateness requirement (Equation 2.1). The multileavingmethod of PPM is described in Algorithm 2.2. The approach is straightforward: ateach rank n the set of documents ˆΩ n is determined (Line 4). This set of documents is Ω( n, R , D ) with the previously added documents removed to avoid document repetition.24 .5. A Novel Multileaved Comparison Method Algorithm 2.3

Preference inference for PPM. Input : rankers R , rankings { l } , documents D , multileaved result list m , clicks c . P ← // preference matrix of |R| × |R| for ( d i , d j ) ∈ { ( d i , d j ) | d i > c d j } do if ¯ r ( i, j, m ) ≥ ¯ r ( i, j ) then w ← // variable to store P (¯ r ( i, j, m ) ≥ ¯ r ( i, j )) min x ← min d ∈{ d i ,d j } min r n ∈R r ( d, l n ) for x = min x , . . . , ¯ r ( i, j ) − do w ← w · (1 − ( | Ω( x, R , D ) | − x − − ) for n = 1 , . . . , | R | do for m = 1 , . . . , | R | do if d i > r n d j ∧ n (cid:54) = m then P nm ← P nm + w − // result of scoring function φ else if n (cid:54) = m then P nm ← P nm − w − return P Then, the next document is sampled uniformly from ˆΩ n (Line 5), thus every documentin ˆΩ n has a probability: | Ω( n, R , D ) | − n + 1 (2.9)of being placed at position n (Line 6). Since ˆΩ n ⊆ Ω( n, R , D ) the resulting m isguaranteed to be considerate .While the multileaved result list creation method used by PPM is simple, its prefer-ence inference method is more complicated as it has to meet the ﬁdelity requirements.First, the preference found between a ranker r n and r m from a single interaction c isdetermined by: P nm = (cid:88) d i > c d j φ ( d i , d j , r n , m , R ) − φ ( d i , d j , r m , m , R ) , (2.10)which sums over all document pairs ( d i , d j ) where interaction c inferred a preference.Before the scoring function φ can be deﬁned we introduce the following function: ¯ r ( i, j, R ) = max d ∈{ d i ,d j } min r n ∈R r ( d, l n ) . (2.11)For succinctness we will note ¯ r ( i, j ) = ¯ r ( i, j, R ) . Here, ¯ r ( i, j ) provides the highestrank at which both documents d i and d j can appear in m . Position ¯ r ( i, j ) is importantto the document pair ( d i , d j ) , since if both documents are in the remaining documents ˆΩ ¯ r ( i,j ) , then the rest of the multileaved result list creation process is identical for both.To keep notation short we introduce: ¯ r ( i, j, m ) = min d ∈{ d i ,d j } r ( d, m ) . (2.12)25 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees Therefore, if ¯ r ( i, j, m ) ≥ ¯ r ( i, j ) then both documents appear below ¯ r ( i, j ) . This, inturn, means that both documents are equally likely to appear at any rank: ∀ n, P ( m n = d i | ¯ r ( i, j, m ) ≥ ¯ r ( i, j ))= P ( m n = d j | ¯ r ( i, j, m ) ≥ ¯ r ( i, j )) . (2.13)The scoring function φ is then deﬁned as follows: φ ( d i , d j , r , m ) =  , ¯ r ( i, j, m ) < ¯ r ( i, j ) − P (¯ r ( i,j, m ) ≥ ¯ r ( i,j )) , d i < r d j P (¯ r ( i,j, m ) ≥ ¯ r ( i,j )) , d i > r d j , (2.14)indicating that a zero score is given if one of the documents appears above ¯ r ( i, j ) .Otherwise, the value of φ is positive or negative depending on whether the ranker r agrees with the inferred preference between d i and d j . Furthermore, this score isinversely weighted by the probability P (¯ r ( i, j, m ) ≥ ¯ r ( i, j )) . Therefore, pairs thatare less likely to appear below their threshold ¯ r ( i, j ) result in a higher score than formore commonly occuring pairs. Algorithm 2.3 displays how the inference of PPM canbe computed. The scoring function φ was carefully chosen to guarantee ﬁdelity , theremainder of this section will sketch the proof for PPM meeting its requirements.The two requirements for ﬁdelity will be discussed in order: Requirement 1

The ﬁrst ﬁdelity requirement states that under uncorrelated clicks the expected outcomeshould be zero. Consider the expected preference: E [ P nm ] = (cid:88) d i ,d j (cid:88) m P ( d i > c d j | m ) P ( m )( φ ( d i , d j , r n , m ) − φ ( d i , d j , r m , m )) . (2.15)To see that E [ P nm ] = 0 under uncorrelated clicks, take any multileaving m where P ( m ) > and φ ( d i , d j , r , m ) (cid:54) = 0 with m x = d i and m y = d j . Then there is alwaysa multileaved result list m (cid:48) that is identical expect for swapping the two documentsso that m (cid:48) x = d j and m (cid:48) y = d i . The scoring function only gives non-zero valuesif both documents appear below the threshold ¯ r ( i, j, m ) < ¯ r ( i, j ) (Equation 2.14).At this point the probability of each document appearing at any position is the same(Equation 2.13), thus the following holds: P ( m ) = P ( m (cid:48) ) , (2.16) φ ( d i , d j , r n , m ) = − φ ( d j , d i , r n , m (cid:48) ) . (2.17)Finally, from the deﬁnition of uncorrelated clicks (Equation 2.2) the following holds: P ( d i > c d j | m ) = P ( d j > c d i | m (cid:48) ) . (2.18)As a result, any document pair ( d i , d j ) and multileaving m that affects the expected out-come is cancelled by the multileaving m (cid:48) . Therefore, we can conclude that E [ P nm ] = 0 under uncorrelated clicks, and that PPM meets the ﬁrst requirement of ﬁdelity .26 .5. A Novel Multileaved Comparison Method Requirement 2

The second ﬁdelity requirement states that under correlated clicks a ranker that Paretodominates all other rankers should win the multileaved comparison. Therefore, theexpected value for a Pareto dominating ranker r n should be: ∀ m, n (cid:54) = m → E [ P nm ] > . (2.19)Take any other ranker r m that is thus Pareto dominated by r n . The proof for the ﬁrstrequirement shows that E [ P nm ] is not affected by any pair of documents d i , d j with thesame relevance label. Furthermore, any pair on which r n and r m agree will not affectthe expected outcome since: ( d i > r n d j ↔ d i > r m d j ) ⇒ φ ( d i , d j , r n , m ) − φ ( d i , d j , r m , m ) = 0 . (2.20)Then, for any relevant document d i , consider the set of documents that r n incorrectlyprefers over d i : A = { d j | ¬ rel ( d j ) ∧ d j > r n d i } (2.21)and the set of documents that r m incorrectly prefers over d i and places higher thanwhere r n places d i : B = { d j | ¬ rel ( d j ) ∧ d j > r m d i ∧ r ( d j , l m ) < r ( d i , l n ) } . (2.22)Since r n Pareto dominates r m , it has the same or fewer incorrect preferences: | A | ≤ | B | .Furthermore, for any document d j in either A or B the threshold of the pair d i , d j is thesame: ∀ d j ∈ A ∪ B, ¯ r ( i, j ) = r ( d i , l n ) . (2.23)Therefore, all pairs with documents from A and B will only get a non-zero value from φ if they both appear at or below r ( d i , l n ) . Then, using Equation 2.13 and the Bayesrule we see: ∀ ( d j , d l ) ∈ A ∪ B, P ( m x = d j , ¯ r ( i, j, m ) ≥ ¯ r ( i, j, R )) P (¯ r ( i, j, m ) ≥ ¯ r ( i, j, R ))= P ( m x = d l , ¯ r ( i, l, m ) ≥ ¯ r ( i, l, R )) P (¯ r ( i, l, m ) ≥ ¯ r ( i, l, R )) . (2.24)Similarly, the reweighing of φ ensures that every pair in A and B contributes the same tothe expected outcome. Thus, if both rankers rank d i at the same position the followingsum: (cid:88) d j ∈ A ∪ B (cid:88) m P ( m ) · [ P ( d i > c d j | m )( φ ( d i , d j , r n , m ) − φ ( d i , d j , r m , m ))+ P ( d j > c d i | m )( φ ( d j , d i , r n , m ) − φ ( d j , d i , r m , m ))] (2.25)27 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees will be zero if | A | = | B | and positive if | A | < | B | under correlated clicks. Moreover,since r n Pareto dominates r m , there will be at least one document d j where: ∃ d i , ∃ d j , rel ( d i ) ∧ ¬ rel ( d j ) ∧ r ( d i , l n ) = r ( d j , l m ) . (2.26)This means that the expected outcome (Equation 2.15) will always be positive undercorrelated clicks, i.e., E [ P nm ] > , for a Pareto dominating ranker r n and any otherranker r m .In summary, we have introduced a new multileaved comparison method, PPM.Furthermore, we answered RQ2.1 in the afﬁrmative since we have shown it to be considerate and to have ﬁdelity . We further note that PPM has polynomial complexity:to calculate P (¯ r ( i, j, m ) ≥ ¯ r ( i, j )) only the size of the choice sets Ω and the ﬁrstpositions at which d i and d j occur in Ω have to be known. In order to answer Research Question

RQ2.2 posed in Section 2.1 several experimentswere performed to evaluate the sensitivity of PPM. The methodology of evaluationfollows previous work on interleaved and multileaved comparison methods [12, 41, 44,108, 109] and is completely reproducible.

In order to make fair comparisons between rankers, we will use the LTR datasetsdescribed in Section 2.6.2 below. From the feature representations in these datasetsa handpicked set of features was taken and used as ranking models. To match thereal-world scenario as best as possible this selection consists of features that are knownto perform well as relevance signals independently. This selection includes but is notlimited to: BM25, LMIR.JM, Sitemap, PageRank, HITS and TF.IDF [108].Then the ground-truth comparisons between the rankers are based on their NDCGscores computed on a held-out test set, resulting in a binary preference matrix P nm forall ranker pairs ( r n , r m ) : P nm = N DCG ( r n ) − N DCG ( r m ) . (2.27)The metric by which multileaved comparison methods are compared is the binary error , E bin [12, 108, 109]. Let ˆ P nm be the preference inferred by a multileaved comparisonmethod; then the error is: E bin = (cid:80) n,m ∈R∧ n (cid:54) = m sgn( ˆ P nm ) (cid:54) = sgn ( P nm ) |R| × ( |R| − . (2.28) Our experiments are performed over ten publicly available LTR datasets with varyingsizes and representing different search tasks. Each dataset consists of a set of queries28 .6. Experiments

Table 2.2: Instantiations of Cascading Click Models [36] as used for simulating userbehaviour in experiments. P ( click = 1 | R ) P ( stop = 1 | R ) R perfect navigational informational and a set of corresponding documents for every query. While queries are representedonly by their identiﬁers, feature representations and relevance labels are available forevery document-query pair. Relevance labels are graded differently by the datasetsdepending on the task they model, for instance, navigational datasets have binary labelsfor not relevant (0), and relevant (1), whereas most informational tasks have labelsranging from not relevant (0), to perfect relevancy (4). Every dataset consists of ﬁvefolds, each dividing the dataset in different training, validation and test partitions.The ﬁrst publicly available LTR datasets are distributed as LETOR 3.0 and 4.0 [76];they use representations of 45, 46, or 64 features encoding ranking models such asTF.IDF, BM25, Language Modelling, PageRank, and HITS on different parts of thedocuments. The datasets in LETOR are divided by their tasks, most of which come fromthe TREC Web Tracks between 2003 and 2008 [23, 24]. HP2003, HP2004, NP2003,NP2004, TD2003 and

TD2004 each contain between 50 and 150 queries and 1,000judged documents per query and use binary relevance labels. Due to their similaritywe report average results over these six datasets noted as

LETOR 3.0 . The

OHSUMED dataset is based on the query log of the search engine on the MedLine abstract database,and contains 106 queries. The last two datasets,

MQ2007 and

MQ2008 , were based onthe Million Query Track [8] and consist of 1,700 and 800 queries, respectively, but havefar fewer assessed documents per query.The

MLSR-WEB10K dataset [95] consists of 10,000 queries obtained from a retiredlabelling set of a commercial web search engine. The datasets uses 136 features torepresent its documents, each query has around 125 assessed documents.Finally, we note that there are more LTR datasets that are publicly available [17, 27],but there is no public information about their feature representations. Therefore, theyare unﬁt for our evaluation as no selection of well performing ranking features can bemade.

While experiments using real users are preferred [18, 21, 63, 133], most researchersdo not have access to search engines. As a result the most common way of comparingonline evaluation methods is by using simulated user behaviour [12, 41, 44, 108, 109].Such simulated experiments show the performance of multileaved comparison methodswhen user behaviour adheres to a few simple assumptions.Our experiments follow the precedent set by previous work on online evaluation:29 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees

First, a user issues a query simulated by uniformly sampling a query from the staticdataset. Subsequently, the multileaved comparison method constructs the multileavedresult list of documents to display. The behavior of the user after receiving this list issimulated using a cascade click model [20, 36]. This model assumes a user to examinedocuments in their displayed order. For each document that is considered the userdecides whether it warrants a click, which is modeled as the conditional probability P ( click = 1 | R ) where R is the relevance label provided by the dataset. Accordingly, cascade click model instantiations increase the probability of a click with the degree ofthe relevance label. After the user has clicked on a document their information needmay be satisﬁed; otherwise they continue considering the remaining documents. Theprobability of the user not examining more documents after clicking is modeled as P ( stop = 1 | R ) , where it is more likely that the user is satisﬁed from a very relevantdocument. At each impression we display k = 10 documents to the user.Table 2.2 lists the three instantiations of cascade click models that we use for thischapter. The ﬁrst models a perfect user who considers every document and clicks on allrelevant documents and nothing else. Secondly, the navigational instantiation models auser performing a navigational task who is mostly looking for a single highly relevantdocument. Finally, the informational instantiation models a user without a very speciﬁcinformation need who typically clicks on multiple documents. These three models haveincreasing levels of noise, as the behavior of each depends less on the relevance labelsof the displayed documents. Each experimental run consists of applying a multileaved comparison method to asequence of T = 10 , simulated user impressions. To see the effect of the numberof rankers in a comparison, our runs consider |R| = 5 , |R| = 15 , and |R| = 40 .However only the MSLR dataset contains |R| = 40 rankers. Every run is repeated forevery click model to see how different behaviours affect performance. For statisticalsigniﬁcance every run is repeated 25 times per fold, which means that 125 runs areconducted for every dataset and click model pair. Since our evaluation covers ﬁvemultileaved comparison methods, we generate over 393 million impressions in total. Wetest for statistical signiﬁcant differences using a two tailed t-test. Note that the resultsreported on the LETOR 3.0 data are averaged over six datasets and thus span 750 runsper datapoint.The parameters of the baselines are selected based on previous work on the samedatasets; for OM the sample size η = 10 was chosen as reported by Schuth et al. [108];for PM the degree τ = 3 . was chosen according to Hofmann et al. [41] and the samplesize η = 10 , in accordance with Schuth et al. [109]. We answer Research Question

RQ2.2 by evaluating the sensitivity of PPM based on theresults of the experiments detailed in Section 2.6.The results of the experiments with a smaller number of rankers: |R| = 5 are30 .7. Results and Analysis

Table 2.3: The binary error E bin of all multileaved comparison methods after 10,000impressions on comparisons of |R| = 5 rankers. Average per dataset and click model;standard deviation in brackets. The best performance per click model and dataset isnoted in bold, statistically signiﬁcant improvements of PPM are noted by (cid:72) ( p < . and (cid:79) ( p < . and losses by (cid:78) and (cid:77) respectively or ◦ for no difference, per baseline. TDM OM PM SOSM PPM perfect

LETOR 3.0 ( 0.13) ( 0.15) ( 0.15) ( 0.15) ( 0.13) ◦◦◦◦

MQ2007 ( 0.16) ( 0.18) ( 0.14) ( 0.16) ( 0.14) ◦ (cid:72) ◦◦ MQ2008 ( 0.12) ( 0.14) ( 0.12) ( 0.15) ( 0.12) ◦ (cid:72) ◦ (cid:79) MSLR-WEB10k ( 0.13) ( 0.17) ( 0.14) ( 0.18) ( 0.13) (cid:72)(cid:72)(cid:72)(cid:72)

OHSUMED ( 0.12) ( 0.15) ( 0.09) ( 0.10) ( 0.10) (cid:72)(cid:72) ◦◦ navigational LETOR 3.0 ( 0.13) ( 0.15) ( 0.14) ( 0.15) ( 0.14) ◦◦◦◦

MQ2007 ( 0.17) ( 0.21) ( 0.12) ( 0.23) ( 0.14) ◦ (cid:72) ◦ (cid:72) MQ2008 ( 0.14) ( 0.20) ( 0.15) ( 0.18) ( 0.13) ◦ (cid:72) ◦ (cid:72) MSLR-WEB10k ( 0.14) ( 0.20) ( 0.17) ( 0.19) ( 0.15) (cid:79)(cid:72)(cid:79)(cid:72)

OHSUMED ( 0.11) ( 0.19) ( 0.12) ( 0.17) ( 0.12) ◦ (cid:72) ◦ (cid:72) informational LETOR 3.0 ( 0.14) ( 0.19) ( 0.11) ( 0.15) ( 0.13) ◦ (cid:72) ◦◦ MQ2007 ( 0.15) ( 0.26) ( 0.15) ( 0.23) ( 0.16) (cid:72)(cid:72)(cid:72)(cid:72)

MQ2008 ( 0.13) ( 0.19) ( 0.16) ( 0.18) ( 0.14) ◦ (cid:72) ◦ (cid:72) MSLR-WEB10k ( 0.18) ( 0.23) ( 0.17) ( 0.20) ( 0.17) (cid:72)(cid:72)(cid:79)(cid:72)

OHSUMED ( 0.10) ( 0.24) ( 0.11) ( 0.21) ( 0.10) ◦ (cid:72) ◦ (cid:72) displayed in Table 2.3. Here we see that after 10,000 impressions PPM has a signiﬁcantlylower error on many datasets and at all levels of interaction noise. Furthermore, for |R| = 5 there are no signiﬁcant losses in performance under any circumstances.When |R| = 15 as displayed in Table 2.4, we see a single case where PPM performsworse than a previous method: on MQ2007 under the perfect click model SOSMperforms signiﬁcantly better than PPM. However, on the same dataset PPM performssigniﬁcantly better under the informational click model. Furthermore, there are moresigniﬁcant improvements for |R| = 15 than when the number of rankers is the smaller |R| = 5 .Finally, when the number of rankers in the comparison is increased to |R| = 40 asdisplayed in Table 2.5, PPM still provides signiﬁcant improvements.We conclude that PPM, in the experimental conditions that we considered, providesa performance that is at least as good as any existing method. Moreover, PPM is robust tonoise as we see more signiﬁcant improvements under click-models with increased noise.Furthermore, since improvements are found with the number of rankers |R| varyingfrom to , we conclude that PPM is scalable in the comparison size. Additionally, the31 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees Table 2.4: The binary error E bin after 10,000 impressions on comparisons of |R| = 15 rankers. Notation is identical to Table 2.3. TDM OM PM SOSM PPM perfect

LETOR 3.0 ( 0.07) ( 0.08) ( 0.07) ( 0.08) ( 0.08) ◦◦◦◦

MQ2007 ( 0.07) ( 0.09) ( 0.06) ( 0.07) ( 0.07) ◦ (cid:72) ◦ (cid:78) MQ2008 ( 0.05) ( 0.05) ( 0.05) ( 0.07) ( 0.06) ◦ (cid:79) ◦◦ MSLR-WEB10k ( 0.07) ( 0.11) ( 0.06) ( 0.08) ( 0.05) (cid:72)(cid:72)(cid:72)(cid:72)

OHSUMED ( 0.03) ( 0.05) ( 0.03) ( 0.03) ( 0.03) (cid:72)(cid:72)(cid:72)(cid:72) navigational

LETOR 3.0 ( 0.08) ( 0.09) ( 0.08) ( 0.08) ( 0.08) ◦◦◦◦

MQ2007 ( 0.07) ( 0.11) ( 0.07) ( 0.08) ( 0.08) (cid:72)(cid:72) ◦◦ MQ2008 ( 0.05) ( 0.07) ( 0.05) ( 0.06) ( 0.06) (cid:72)(cid:72) ◦ (cid:72) MSLR-WEB10k ( 0.07) ( 0.12) ( 0.06) ( 0.09) ( 0.08) (cid:72)(cid:72) ◦ (cid:72) OHSUMED ( 0.04) ( 0.07) ( 0.03) ( 0.06) ( 0.04) ◦ (cid:72) ◦ (cid:72) informational LETOR 3.0 ( 0.07) ( 0.11) ( 0.08) ( 0.08) ( 0.08) ◦ (cid:79) ◦◦ MQ2007 ( 0.07) ( 0.14) ( 0.08) ( 0.11) ( 0.08) (cid:72)(cid:72)(cid:72)(cid:72)

MQ2008 ( 0.06) ( 0.11) ( 0.06) ( 0.06) ( 0.06) (cid:72)(cid:72)(cid:72)(cid:72)

MSLR-WEB10k ( 0.09) ( 0.12) ( 0.08) ( 0.11) ( 0.08) (cid:72)(cid:72)(cid:72)(cid:72)

OHSUMED ( 0.03) ( 0.09) ( 0.03) ( 0.06) ( 0.04) (cid:72)(cid:72) ◦ (cid:72) dataset type seems to affect the relative performance of the methods. For instance, on LETOR 3.0 little signiﬁcant differences are found, whereas the

MSLR dataset displaysthe most signiﬁcant improvements. This suggests that on more artiﬁcial data, i.e., thesmaller datasets simulating navigational tasks, the differences are fewer, while on theother hand on large commercial data the preference for PPM increases further. Lastly,Figure 2.1 displays the binary error of all multileaved comparison methods on the

MSLR dataset over 10,000 impressions. Under the perfect click model we see that all of theprevious methods display converging behavior around 3,000 impressions. In contrast,the error of PPM continues to drop throughout the experiment. The fact that the existingmethods converge at a certain level of error in the absence of click-noise is indicativethat they are lacking in sensitivity .Overall, our results show that PPM reaches a lower level of error than previousmethods seem to be capable of. This feat can be observed on a diverse set of datasets,various levels of interaction noise and for different comparison sizes. To answerResearch Question

RQ2.2 : from our results we conclude that PPM is more sensitivethan any existing multileaved comparison method.32 .8. Conclusion

Table 2.5: The binary error E bin of all multileaved comparison methods after 10,000impressions on comparisons of |R| = 40 rankers. Averaged over the MSLR-WEB10k ,notation is identical to Table 2.3.

TDM OM PM SOSM PPM perfect ( 0.03) ( 0.02) ( 0.02) ( 0.02) ( 0.04) (cid:72)(cid:72)(cid:72)(cid:72) navigational ( 0.03) ( 0.01) ( 0.03) ( 0.03) ( 0.05) (cid:72)(cid:72)(cid:72) ◦ informational ( 0.04) ( 0.01) ( 0.05) ( 0.05) ( 0.06) (cid:72)(cid:72)(cid:72)(cid:72) In this chapter we have examined multileaved comparison methods for evaluatingranking models online.We have presented a new multileaved comparison method, Pairwise PreferenceMultileaving (PPM), that is more sensitive to user preferences than existing methods.Additionally, we have proposed a theoretical framework for assessing multileavedcomparison methods, with considerateness and ﬁdelity as the two key requirements.We have shown that no method published prior to PPM has ﬁdelity without lacking considerateness . In other words, prior to PPM no multileaved comparison method hasbeen able to infer correct preferences without degrading the search experience of theuser. In contrast, we prove that PPM has both considerateness and ﬁdelity , thus it isguaranteed to correctly identify a Pareto dominating ranker without altering the searchexperience considerably. Furthermore, our experimental results spanning ten datasetsshow that PPM is more sensitive than existing methods, meaning that it can reach alower level of error than any previous method. Moreover, our experiments show thatthe most signiﬁcant improvements are obtained on the more complex datasets, i.e.,larger datasets with more grades of relevance. Additionally, similar improvements areobserved under different levels of noise and numbers of rankers in the comparison,indicating that PPM is robust to interaction noise and scalable to large comparisons.As an extra beneﬁt, the computational complexity of PPM is polynomial and, unlikeprevious methods, does not depend on sampling or approximations.With these ﬁndings we can answer the thesis research question

RQ1 positively:with the introduction of our novel Pairwise Preference Multileaving (PPM) method theeffectiveness of online evaluation scales to large comparisons.The theoretical framework that we have introduced allows future research intomultileaved comparison methods to guarantee improvements that generalize better thanempirical results alone. In turn, properties like considerateness can further stimulate theadoption of multileaved comparison methods in production environments; future workwith real-world users may yield further insights into the effectiveness of the multileavingparadigm. Rich interaction data enables the introduction of multileaved comparisonmethods that consider more than just clicks, as has been done for interleaving methods[63]. These methods could be extended to consider other signals such as dwell-time or the order of clicks in an impression , etc.Furthermore, the ﬁeld of Online Learning to Rank (OLTR) has depended on online33 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees evaluation from its inception [132]. The introduction of multileaving and subsequentnovel multileaved comparison methods brought substantial improvements to both ﬁelds[90, 111]. Similarly, PPM and any future extensions are likely to beneﬁt the OLTR ﬁeldtoo.Finally, while the theoretical and empirical improvements of PPM are convincing,future work should investigate whether the sensitivity can be made even stronger. Forinstance, it is possible to have clicks from which no preferences between rankers canbe inferred. Can we devise a method that avoids such situations as much as possiblewithout introducing any form of bias, thus increasing the sensitivity even further whilemaintaining theoretical guarantees?In Chapter 7 we will take another look at online ranker evaluation and contrast itwith counterfactual evaluation. We will see that existing interleaving methods (andby extension some multileaving methods) are biased w.r.t. the deﬁnition of positionbias common in counterfactual evaluation. The novel method introduced in Chapter 7combines aspects of counterfactual and online ranker evaluation, creating a methodwith strong theoretical guarantees while also being very effective.Furthermore, similar to this chapter, Chapter 3 will look at whether a pairwise LTRmethod is suitable for online LTR. While different from PPM, the method introducedin Chapter 3 also infers pairwise preferences between documents, and weights inferredpreferences to account for position bias.34 .8. Conclusion . . . . . . E b i n perfect TDMOM PMSOSM PPM0 2000 4000 6000 8000 100000 . . . . . . E b i n navigational0 2000 4000 6000 8000 10000impressions0 . . . . . . E b i n informational Figure 2.1: The binary error of different multileaved comparison methods on compar-isons of |R| = 15 rankers on the

MSLR-WEB10k dataset. 35 . Sensitive and Scalable Online Evaluation with Theoretical Guarantees

Notation Description q a user-issued query T the total number of interactions r i an individual ranker a.k.a. a single ranking system or ranking model R a set of rankers to compare l i a ranking generated by ranker r i m a multileaved result list k the length of the multileaved result lists c a vector indicating clicks on a displayed multileaved result list P a preference matrix to store inferred preferences between rankers r ( d, l i ) the rank at which ranker r i places document d Differentiable Online Learning to Rank

Online Learning to Rank (OLTR) methods optimize rankers based on direct interactionwith users. State-of-the-art OLTR methods rely on online evaluation and samplingmodel variants, they were designed speciﬁcally for linear models. Their approaches donot extend well to non-linear models such as neural networks.To address this limitation, this chapter will consider the thesis research question:

RQ2

Is online Learning to Rank (LTR) possible without relying on model-samplingand online evaluation?

We introduce an entirely novel approach to OLTR that constructs a weighted differ-entiable pairwise loss after each interaction: Pairwise Differentiable Gradient De-scent (PDGD). PDGD breaks away from the traditional approach that relies on inter-leaving or multileaving and extensive sampling of models to estimate gradients. Instead,its gradient is based on inferring preferences between document pairs from user clicksand can optimize any differentiable model. We prove that the gradient of PDGD isunbiased w.r.t. user document pair preferences. Our experiments on the largest publiclyavailable LTR datasets show considerable and signiﬁcant improvements under all levelsof interaction noise. PDGD outperforms existing OLTR methods both in terms of learn-ing speed as well as ﬁnal convergence. Furthermore, unlike previous OLTR methods,PDGD also allows for non-linear models to be optimized effectively. Our results showthat using a neural network leads to even better performance at convergence than alinear model. In summary, PDGD is an efﬁcient and unbiased OLTR approach thatprovides a better user experience than previously possible.

In order to beneﬁt from unprecedented volumes of content, users rely on ranking systemsto provide them with the content of their liking. LTR in Information Retrieval (IR)concerns methods that optimize ranking models so that they order documents accordingto user preferences. In web search engines such models combine hundreds of signalsto rank web-pages according to their relevance to user queries [75]. Similarly, rankingmodels are a vital part of recommender systems where there is no explicit search intent

This chapter was published as [82]. Appendix 3.A gives a reference for the notation used in this chapter. . Differentiable Online Learning to Rank [59]. LTR is also prevalent in settings where other content is ranked, e.g., videos [19],products [60], conversations [97] or personal documents [127].Traditionally, LTR has been applied in the ofﬂine setting where a dataset withannotated query-document pairs is available. Here, the model is optimized to rankdocuments according to the relevance annotations, which are based on the judgementsof human annotators. Over time the limitations of this supervised approach havebecome apparent: annotated sets are expensive and time-consuming to create [17, 76];when personal documents are involved such a dataset would breach privacy [127];the relevance of documents to queries can change over time, like in a news searchengine [1, 71]; and judgements of raters are not necessarily aligned with the actualusers [104].In order to overcome the issues with annotated datasets, previous work in LTR haslooked into learning from user interactions. Work along these lines can be dividedinto approaches that learn from historical interactions , i.e., in the form of interactionlogs [54], and approaches that learn in an online setting [132]. The latter regard methodsthat determine what to display to the user at each impression, and then immediatelylearn from observed user interactions and update their behavior accordingly. This onlineapproach has the advantage that it does not require an existing ranker of decent quality,and thus can handle cold-start situations. Additionally, it is more responsive to theuser by updating continuously and instantly, therefore allowing for a better experience.However, it is important that an online method can handle biases that come with userbehavior: for instance, the observed interactions only take place with the displayedresults, i.e., there is item-selection bias, and are more likely to occur with higher rankeditems, i.e., there is position bias. Accordingly, a method should learn user preferencesw.r.t. document relevance, and be robust to the forms of noise and bias present in theonline setting. Overall, the online LTR approach promises to learn ranking models thatare in line with user preferences, in a responsive matter, reaching good performancefrom few interactions, even in cold-start situations.Despite these highly beneﬁcial properties, previous work in OLTR has only con-sidered linear models [42, 111, 132] or trivial variants thereof [80]. The reason forthis is that existing work in OLTR has worked with the Dueling Bandit Gradient De-scent (DBGD) algorithm [132] as a basis. While very inﬂuential and effective, weidentify two main problems with the gradient estimation of the DBGD algorithm:1. Gradient estimation is based on sampling model variants from a unit circle around thecurrent model. This concept does not extend well to non-linear models. Computingrankings for variants is also computationally costly for larger complex models.2. It uses online evaluation methods, i.e., interleaving or multileaving, to determine thegradient direction from the resulting set of models. However, these evaluation meth-ods are designed for ﬁnding preferences between ranking systems, not (primarily)for determining how a model should be updated.As an alternative we introduce Pairwise Differentiable Gradient Descent (PDGD), theﬁrst unbiased OLTR method that is applicable to any differentiable ranking model.PDGD infers pairwise document preferences from user interactions and constructs anunbiased gradient after each user impression. In addition, PDGD does not rely on sam-38 .2. Related Work pling models for exploration, but instead models rankings as probability distributionsover documents. Therefore, it allows the OLTR model to be very certain for speciﬁcqueries and perform less exploration in those cases, while being much more explorativein other, uncertain cases. Our results show that, consequently, PDGD provides signiﬁ-cant and considerable improvements over previous OLTR methods. This indicates thatits gradient estimation is more in line with the preferences to be learned.In this chapter, we address the thesis research question

RQ2 by answering thefollowing three speciﬁc research questions:

RQ3.1

Does using PDGD result in signiﬁcantly better performance than the currentstate-of-the-art Multileave Gradient Descent?

RQ3.2

Is the gradient estimation of PDGD unbiased?

RQ3.3

Is PDGD capable of effectively optimizing different types of ranking models?To facilitate replicability and repeatability of our ﬁndings, we provide open sourceimplementations of PDGD and our experiments under the permissive MIT open-sourcelicense. LTR can be applied to the ofﬂine and online setting. In the ofﬂine setting LTR isapproached as a supervised problem where the relevance of each query-document pairis known. Most of the challenges with ofﬂine LTR come from obtaining annotations.For instance, gathering annotations is time-consuming and expensive [17, 76, 95].Furthermore, in privacy sensitive-contexts it would be unethical to annotate items,e.g., for personal emails or documents [127]. Moreover, for personalization problemsannotators are unable to judge what speciﬁc users would prefer. Also, (perceived)relevance chances over time, due to cognitive changes on the user’s end [120] or due tochanges in document collections [1] or the real world [71]. Finally, annotations are notnecessarily aligned with user satisfaction, as judges may interpret queries differentlyfrom actual users [104]. Consequently, the limitations of ofﬂine LTR have led to anincreased interest in alternative approaches to LTR.

OLTR is an attractive alternative to ofﬂine LTR as it learns directly from interactingwith users [132]. By doing so it attempts to solve the issues with ofﬂine annotations thatoccur in LTR, as user preferences are expected to be better represented by interactionsthan by ofﬂine annotations [99]. Unlike methods in the ofﬂine setting, OLTR algorithmshave to simultaneously perform ranking while also optimizing their ranking model. Inother words, an OLTR algorithm decides what rankings to display to users, while at https://github.com/HarrieO/OnlineLearningToRank . Differentiable Online Learning to Rank the same time learning from the interactions with the presented rankings. While thepotential of learning in the online setting is great, it has its own challenges. In particular,the main difﬁculties of the OLTR task are bias and noise . Any user interaction that doesnot reﬂect their true preference is considered noise, this happens frequently e.g., clicksoften occur for unexpected reasons [104]. Bias comes in many forms, for instance,item-selection bias occurs because interactions only involve displayed documents [127].Another common bias is position bias, a consequence from the fact documents at thetop of a ranking are more likely to be considered [134]. An OLTR method should thustake into account the biases that affect user behavior while also being robust to noise, inorder to learn the true user preferences.OLTR methods can be divided into two groups [139]: tabular methods that learnthe best ranked list under some model of user interaction with the list [98, 114], such asa click model [20], and feature-based algorithms that learn the best ranker in a family ofrankers [43, 132]. Model-based methods may have greater statistical efﬁciency but theygive up generality, essentially requiring us to learn a separate model for every query.For the remainder of this chapter, we focus on model-free OLTR methods. State-of-the-art (model-free) OLTR approaches learn user preferences by approachingoptimization as a dueling bandit problem [132]. They estimate the gradient of the modelw.r.t. user satisfaction by comparing the current model to sampled variations of themodel. The original DBGD algorithm [132] uses interleaving methods to make thesecomparisons: at each interaction the rankings of two rankers are combined to create asingle result list. From a large number of clicks on such a combined result list a userpreference between the two rankers can reliably be inferred [41]. Conversely, DBGDcompares its current ranking model to a different slight variation at each impression.Then, if a click is indicative of a preference for the variation, the current model isslightly updated towards it. Accordingly, the model of DBGD will continuously updateitself and oscillate towards an inferred optimum.Other work in OLTR has used DBGD as a basis and extended upon it. Notably,Hofmann et al. [43] have proposed a method that guides exploration by only sam-pling variations that seem promising from historical interaction data. Unfortunately,while this approach provides faster initial learning, the historical data introduces biaswhich leads to the quality of the ranking model to steadily decrease over time [90].Alternatively, Schuth et al. [111] introduced Multileave Gradient Descent (MGD), thisextension replaced the interleaving of DBGD with multileaving methods. In turn themultileaving paradigm is an extension of interleaving where a set of rankers are com-pared efﬁciently [81, 108, 109]. Conversely, multileaving methods can combine therankings of more than two rankers and thus infer preferences over a set of rankers froma single click. MGD uses this property to estimate the gradient more effectively bycomparing a large number of model variations per user impression [90, 111]. As a result,MGD requires fewer user interactions to converge on the same level of performanceas DBGD. Another alternative approach was considered by Hofmann et al. [40], whoinject the ranking from the current model with randomly sampled documents. Then,after each user impression, a pairwise loss is constructed from inferred preferences40 .3. Method between documents. This pairwise approach was not found to be more effective thanDBGD.Quite remarkably, all existing work in OLTR has only considered linear models.Recently, Oosterhuis and de Rijke [80] recognized that a tradeoff unique to OLTR ariseswhen choosing models. High capacity models such as neural networks [13] require moredata than simpler models. On the one hand, this means that high capacity models needmore user interactions to reach the same level of performance, thus giving a worse initialuser experience. On the other hand, high capacity models are capable of ﬁnding betteroptima, thus lead to better ﬁnal convergence and a better long-term user experience.This dilemma is named the speed - quality tradeoff, and as a solution a cascade of modelscan be optimized: combining the initial learning speed of a simple model with theconvergence of a complex one. But there are more reasons why non-linear models haveso far been absent from OLTR. Importantly, the DBGD algorithm was designed forlinear models from the ground up; relying on a unit circle to sample model variantsand averaging models to estimate the gradient. Furthermore, the computational cost ofmaintaining an extensive set of model variants for large and complex models makes thisapproach very impractical.Our contribution over the work listed above is an OLTR method that is not anextension of DBGD, instead it computes a differentiable pairwise loss to update itsmodel. Unlike the existing pairwise approach, our loss function is unbiased and ourexploration is performed using the model’s conﬁdence over documents. Finally, we alsoshow that this is the ﬁrst OLTR method to effectively optimize neural networks in theonline setting. In this section we introduce a novel OLTR algorithm: Pairwise Differentiable GradientDescent (PDGD). First, Section 3.3.1 describes PDGD in detail, before Section 3.3.2formalizes and proves the unbiasedness of the method. Appendix 3.A lists the notationwe use.

PDGD revolves around optimizing a ranking model f θ ( d ) that takes a feature represen-tation of a query-document pair d as input and outputs a score. The aim of the algorithmis to ﬁnd the parameters θ so that sorting the documents by their scores in descendingorder provides the most optimal rankings. Because this is an online algorithm, themethod must ﬁrst decide what ranking to display to the user, then after the user hasinteracted with the displayed ranking, it may update θ accordingly.Unlike previous OLTR approaches, PDGD does not rely on any online evaluationmethods. Instead, a Plackett-Luce (PL) model is applied to the ranking function f θ ( · ) resulting in a distribution over the document set D : P ( d | D ) = e f θ ( d ) (cid:80) d (cid:48) ∈ D e f θ ( d (cid:48) ) . (3.1)41 . Differentiable Online Learning to Rank document 1document 2document 3document 4document 5 (a) document 3document 2document 1document 4document 5 (b) Figure 3.1: Left: a click on a document ranking R and the inferred preferences of d over { d , d , d } . Right: the reversed pair ranking R ∗ ( d , d , R ) for the document pair d and d .A ranking R to display to the user is then created by sampling from the distribution k times, where after each placement the distribution is renormalized to prevent duplicateplacements. PL models have been used before in LTR. For instance, the ListNetmethod [15] optimizes such a model in the ofﬂine setting. With R i denoting thedocument at position i , the probability of the ranking R then becomes: P ( R | D ) = k (cid:89) i =1 P ( R i | D \ { R , . . . , R i − } ) . (3.2)After the ranking R has been displayed to the user, they have the option to interact withit. The user may choose to click on some (or none) of the documents. Based on theseclicks, PDGD will infer preferences between the displayed documents. We assumethat clicked documents are preferred over observed unclicked documents. However, tothe algorithm it is unknown which unclicked documents the user has considered. Asa solution, PDGD relies on the assumption that every document preceding a clickeddocument and the ﬁrst subsequent unclicked document was observed, as illustratedin Figure 3.1a. This preference assumption has been proven useful in IR before, forinstance in pairwise LTR on click logs [54] and recently in online evaluation [81]. Wewill denote preferences between documents inferred from clicks as: d k > c d l where d k is preferred over d l .Then θ is updated by optimizing pairwise probabilities over the preference pairs;for each inferred document preference d k > c d l , the probability that the preferreddocument d k is sampled before d l is sampled is increased [118]: P ( d k (cid:31) d l ) = P ( d k | D ) P ( d k | D ) + P ( d l | D ) = e f ( d k ) e f ( d k ) + e f ( d l ) . (3.3)We have chosen for pairwise optimization over listwise optimization because a pairwisemethod can be made unbiased by reweighing preference pairs. To do this we introducethe weighting function ρ ( d k , d l , R, D ) and estimate the gradient of the user preferences42 .3. Method Algorithm 3.1

Pairwise Differentiable Gradient Descent (PDGD). Input : initial weights: θ ; scoring function: f ; learning rate η . for t ← . . . ∞ do q t ← receive query ( t ) // obtain a query from a user D t ← preselect documents ( q t ) // preselect documents for query R t ← sample list ( f θ t , D t ) // sample list according to Eq. 3.1 c t ← receive clicks ( R t ) // show result list to the user ∇ f θ t ← // initialize gradient for d k > c d l ∈ c t do w ← ρ ( d k , d l , R, D ) // initialize pair weight (Eq. 3.5) w ← w e fθt ( d k ) e fθt ( d l ) (cid:16) e fθt ( d k ) + e fθt ( d l ) (cid:17) // pair gradient (Eq. 3.4) ∇ f θ t ← ∇ θ t + w ( f (cid:48) θ t ( d k ) − f (cid:48) θ t ( d l )) // model gradient (Eq. 3.4) θ t +1 ← θ t + η ∇ f θ t // update the ranking model by the weighted sum: ∇ f θ ( · ) ≈ (cid:88) d k > c d l ρ ( d k , d l , R, D ) [ ∇ P ( d k (cid:31) d l )]= (cid:88) d k > c d l ρ ( d k , d l , R, D ) e f θ ( d k ) e f θ ( d l ) ( e f θ ( d k ) + e f θ ( d l ) ) ( f (cid:48) θ ( d k ) − f (cid:48) θ ( d l )) . (3.4)The ρ function is based on the reversed pair ranking R ∗ ( d k , d l , R ) , which is the sameranking as R with the position of d k and d l swapped. An example of a reversed pairranking is illustrated in Figure 3.1b. The idea is that if a preference for d k > c d l isinferred in R and both documents are equally relevant, then the reverse preference d l > c d k is equally likely to be inferred in R ∗ ( d k , d l , R ) . The ρ function reweighs thefound preferences to the ratio between the probabilities of R or R ∗ ( d k , d l , R ) occurring: ρ ( d k , d l , R, D ) = P ( R ∗ ( d k , d l , R ) | D ) P ( R | D ) + P ( R ∗ ( d k , d l , R ) | D ) . (3.5)This procedure has similarities with importance sampling [93]; however, we found thatreweighing according to the ratio between R and R ∗ provides a more stable performance,since it produces less extreme values. Section 3.3.2 details exactly how ρ creates anunbiased gradient.Algorithm 3.1 describes the PDGD method step by step: Given the initial parameters θ and a differentiable scoring function f (Line 1), the method waits for a user-issuedquery q t to arrive (Line 3). Then the preselected set of documents D t for the query isfetched (Line 4), in our experiments these preselections are given in the LTR datasetsthat we use. A result list R is sampled from the current model (Line 5 and Equation 3.1)and displayed to the user. The clicks from the user are logged (Line 6) and preferencesbetween the displayed documents inferred (Line 8). The gradient is initialized (Line 7),and for each pair document pair d k , d l such that d k > c d l , the weight ρ ( d k , d l , R, D ) is calculated (Line 9 and Equation 3.5), followed by the gradient for the pair probability43 . Differentiable Online Learning to Rank (Line 10 and Equation 3.4). Finally, the gradient for the scoring function f is weightedand added to the gradient (Line 11), resulting in the estimated gradient. The model isthen updated by taking an η step in the direction of the gradient (Line 12). The algorithmagain waits for the next query to arrive and thus the process continues indeﬁnitely.PDGD has some notable advantages over MGD [111]. Firstly, it explicitly modelsuncertainty over the documents per query, thus PDGD is able to have high conﬁdencein its ranking for one query, while being completely uncertain for another query. As aresult, it will vary the amount of exploration per query, allowing it to avoid explorationin cases where it is not required and focussing on areas where it can improve. In contrast,MGD does not explicitly model conﬁdence: its degree of exploration is only affectedby the norm of its linear model [80]. Consequently, MGD is unable to vary explorationper query nor is there a way to directly measure its level of conﬁdence. Secondly,PDGD works for any differentiable scoring function f and does not rely on samplingmodel variants. Conversely, MGD is based around sampling from the unit spherearound a model; this approach is very ineffective for non-linear models. Additionally,sampling large models and producing rankings for them can be very computationallyexpensive. Besides these beneﬁcial properties, our experimental results in Section 3.5show that PDGD achieves signiﬁcantly higher levels of performance than MGD andother previous methods. The previous section introduced PDGD; this section answers

RQ3.2 : RQ3.2

Is the gradient estimation of PDGD unbiased?First, Theorem 3.1 will provide a deﬁnition of unbiasedness w.r.t. user documentpair preferences. Then we state the assumptions we make about user behavior anduse them to prove Theorem 3.1. Our notation will use d k = rel d l to indicate no userpreference between two documents t d k and d l ; and d k > rel d l to indicate a preferencefor d k over d l ; and d k < rel d l for the opposite preference. Theorem 3.1.

The expected estimated gradient of PDGD can be written as a weightedsum, with a unique weight α k,l for each possible document pair d k and d l in thedocument collection D : E [ ∇ f θ ( · )] = (cid:88) d k ,d l ∈ D α k,l ( f (cid:48) θ t ( d k ) − f (cid:48) θ t ( d l )) . (3.6) The signs of the weights α k,l adhere to user preferences between documents. That is, ifthere is no preference: d k = rel d l ⇔ α k,l = 0; (3.7) if d k is preferred over d l : d k > rel d l ⇔ α k,l > (3.8) and if d l is preferred over d k : d k < rel d l ⇔ α k,l < . (3.9) Therefore, in expectation PDGD will perform updates that adhere to the preferencesbetween the documents in every possible document pair. .3. Method Assumptions.

To prove Theorem 3.1 the following assumptions about user behaviorwill be used:

Assumption 1.

We assume that clicks from a user are position biased and con-ditioned on the relevance of the current document and the previously considered doc-uments. For a click on a document in ranking R at position i the probability can bewritten as: P ( click ( R i ) |{ R , . . . , R i − , R i +1 } ) . (3.10)For ease of notation, we will denote the set of “other documents” as { . . . } from here on. Assumption 2.

If there is no user preference between two documents d k , d l ,denoted by d k = rel d l , we assume that each is equally likely to be clicked given thesame context: d k = rel d l ⇒ P ( click ( d k ) |{ . . . } ) = P ( click ( d l ) |{ . . . } ) . (3.11) Assumption 3.

If a document in the set of documents being considered is replacedwith an equally preferred document the click probability is not affected: d k = rel d l ⇒ P ( click ( R i ) |{ . . . , d k } ) = P ( click ( R i ) |{ . . . , d l } ) . (3.12) Assumption 4.

Similarly, given the same context if one document is preferredover another, then it is more likely to be clicked: d k > rel d l ⇒ P ( click ( d k ) |{ . . . } ) > P ( click ( d l ) |{ . . . } ) . (3.13) Assumption 5.

Lastly, for any pair d k > rel d l , the considered document set { . . . , d k } and the same set with d k replaced by d l { . . . , d l } , we assume that the preferred d k in the context of { . . . , d l } is more likely to be clicked than d l in the context of { . . . , d k } : d k > rel d l ⇒ P ( click ( d k ) |{ . . . , d k } ) > P ( click ( d l ) |{ . . . , d l } ) . (3.14)These are all the assumptions we make about the user. With these assumptions, we canproceed to prove Theorem 3.1. Proof of Theorem 3.1.

We denote the probability of inferring the preference of d k over d l in ranking R as P ( d k > c d l | R ) . Then the expected gradient ∇ f θ ( · ) of PDGD canbe written as: E [ ∇ f θ ( · )] = (cid:88) R (cid:88) d k ,d l ∈ D (cid:20) P ( d k > c d l | R ) · P ( R ) · ρ ( d k , d l , R, D )[ ∇ P ( d k (cid:31) d l )] (cid:21) . (3.15)We will rewrite this expectation using the symmetry property of the reversed pairranking: R n = R ∗ ( d k , d l , R m ) ⇔ R m = R ∗ ( d k , d l , R n ) . (3.16)45 . Differentiable Online Learning to Rank First, we deﬁne a weight ω Rk,l for every document pair d k , d l and ranking R so that: ω Rk,l = P ( R ) ρ ( d k , d l , R, D ) = P ( R | D ) P ( R ∗ ( d k , d l , R ) | D ) P ( R | D ) + P ( R ∗ ( d k , d l , R ) | D ) . (3.17)Therefore, the weight for the reversed pair ranking is equal: ω R ∗ ( d k ,d l ,R ) k,l = P ( R ∗ ( d k , d l , R )) ρ ( d k , d l , R ∗ ( d k , d l , R ) , D )= ω Rk,l . (3.18)Then, using the symmetry of Equation 3.3 we see that: ∇ P ( d k (cid:31) d l ) = −∇ P ( d l (cid:31) d k ) . (3.19)Thus, with R ∗ as a shorthand for R ∗ ( d k , d l , R ) , the expectation can be rewritten as: E [ ∇ f θ ( · )] = (3.20) (cid:88) d k ,d l ∈ D (cid:88) R ω Ri,j (cid:32) P ( d k > c d l | R ) − P ( d l > c d k | R ∗ ) (cid:33) (cid:34) ∇ P ( d k (cid:31) d l ) (cid:35) , proving that the expected gradient matches the form of Equation 3.6. Then to prove thatEquations 3.7, 3.8, and 3.9 are correct we will show that: d k = rel d l ⇒ P ( d k > c d l | R ) = P ( d l > c d k | R ∗ ) , (3.21) d k > rel d l ⇒ P ( d k > c d l | R ) > P ( d l > c d k | R ∗ ) , (3.22) d k < rel d l ⇒ P ( d k > c d l | R ) < P ( d l > c d k | R ∗ ) . (3.23)If a preference R i > c R j is inferred then there are only three possible cases based onthe positions:1. The clicked document succeeds the unclicked document by more than one posi-tion: i + 1 > j .2. The clicked document precedes the unclicked document by more than one posi-tion: i − < j .3. The clicked document is one position before or after the unclicked document: i = j + 1 ∨ i = j − .In the ﬁrst case the clicked document succeeds the other by more than one position, theprobability of an inferred preference is then: i + 1 > j ⇒ P ( R i > c R j | R ) = P ( c i | R i , { . . . , R j } )(1 − P ( c j | R j , { . . . } )) . (3.24)Combining Assumption 2 and 3 with Equation 3.24 proves Equation 3.21 for thiscase. Furthermore, combining Assumption 4 and 5 with Equation 3.24 proves Equa-tions 3.22 and 3.23 for this case as well.46 .4. Experiments Table 3.1: Instantiations of Cascading Click Models [36] as used for simulating userbehavior in experiments. P ( click = 1 | R ) P ( stop = 1 | click = 1 , R ) R perfect navigational informational i + 1 < j ⇒ P ( R i > c R j | R ) = P ( c i | R i , { . . . } )(1 − P ( c j | R j , { . . . , R i } )) P ( c rem ) , (3.25)where P ( c rem ) denotes the probability of an additional click that is required to add R j to the inferred observed documents. First, due to Assumption 1 this probability will bethe same for R and R ∗ : P ( c rem | R i , R j , R ) = P ( c rem | R i , R j , R ∗ ) . (3.26)Combining Assumption 2 and 3 with Equation 3.25 also proves Equation 3.21 for thiscase. Furthermore, combining Assumption 4 and 5 with Equation 3.25 also provesEquation 3.22 and 3.23 for this case as well.Lastly, in the third case the clicked document is one position before or after theother document, the probability of the inferred preference is then: i = j + 1 ∨ i = j − (3.27) ⇒ P ( R i > c R j | R ) = P ( c i | R i , { . . . , R j } )(1 − P ( c j | R j , { . . . , R i } )) . Combining Assumption 3 with Equation 3.28 proves Equation 3.21 for this case as well.Then, combining Assumption 5 with Equation 3.28 also proves Equation 3.22 and 3.23for this case.This concludes our proof of the unbiasedness of PDGD. Hence, we answer

RQ3.2 positively: the gradient estimation of PDGD is unbiased. We have shown that theexpected gradient is in line with user preferences between document pairs.

In this section we detail the experiments that were performed to answer the researchquestions in Section 3.1. 47 . Differentiable Online Learning to Rank

Our experiments are performed over ﬁve publicly available LTR datasets; we haveselected three large labelled dataset from commercial search engines and two smallerresearch datasets. Every dataset consists of a set of queries with each query having acorresponding preselected document set. The exact content of the queries and documentsare unknown, each query is represented only by an identiﬁer, but each query-documentpair has a feature representation and relevance label. Depending on the dataset, therelevance labels are graded differently; we have purposefully chosen datasets that haveat least two grades of relevance. Each dataset is divided in training, validation and testpartitions.The oldest datasets we use are

MQ2007 and

MQ2008 [95] which are based on theMillion Query Track [8] and consist of 1,700 and 800 queries. They use representationsof 46 features that encode ranking models such as TF.IDF, BM25, Language Modeling,PageRank, and HITS on different parts of the documents. They are divided into ﬁvefolds and the labels are on a three-grade scale from not relevant (0) to very relevant (2).In 2010 Microsoft released the

MSLR-WEB30k and

MLSR-WEB10K datasets [95],which are both created from a retired labelling set of a commercial web search engine(Bing). The former contains 30,000 queries with each query having 125 assesseddocuments on average, query-document pairs are encoded in 136 features, The latter isa subsampling of 10,000 queries from the former dataset. For practical reasons only

MLSR-WEB10K was used for this chapter. Also in 2010 Yahoo! released an LTR dataset[17]. It consists of 29,921 queries and 709,877 documents encoded in 700 features, allsampled from query logs of the Yahoo! search engine. Finally, in 2016 a LTR datasetwas released by the Istella search engine [27]. It is the largest with 33,118 queries, anaverage of 315 documents per query and 220 features. These three commercial datasetsall label relevance on a ﬁve-grade scale: from not relevant (0) to perfectly relevant (4).

For simulating users we follow the standard setup for OLTR simulations [38, 40, 90,111, 137]. First, queries issued by users are simulated by uniformly sampling from thestatic dataset. Then the algorithm determines the result list of documents to display.User interactions with the displayed list are then simulated using a cascade clickmodel [20, 36]. This models a user who goes through the documents one at a timein the displayed order. At each document, the user decides whether to click it or not,modelled as a probability conditioned on the relevance label R : P ( click = 1 | R ) .After a click has occurred, the user’s information need may be satisﬁed and they maythen stop considering documents. The probability of a user stopping after a click ismodelled as P ( stop = 1 | click = 1 , R ) . For our experiments κ = 10 documents aredisplayed at each impression.The three instantiations of cascade click models that we used are listed in Table 3.1.First, a perfect user is modelled who considers every document and solely clicks on allrelevant documents. The second models a user with a navigational task, where a singlehighly relevant document is searched. Finally, an informational instantiation models auser without a speciﬁc information need, and thus typically clicks on many documents.48 .4. Experiments These models have varying levels of noise, as each behavior depends on the relevancelabels of documents with a different degree.

For our experiments three baselines are used. First, MGD with Probabilistic Multi-leaving [90]; this is the highest performing existing OLTR method [80, 90]. For thischapter n = 49 candidates were sampled per iteration from the unit sphere with δ = 1 ;updates are performed with η = 0 . and zero initialization was used. Additionally,DBGD is used for comparison as it is one of the most inﬂuential methods, it was runwith the same parameters except that only n = 1 candidate is sampled per iteration.Furthermore, we also let DBGD optimize a single hidden-layer neural network with 64hidden nodes and sigmoid activation functions with Xavier initialization [33]. Theseparameters were also found most effective in previous work [40, 90, 111, 132].Additionally, the pairwise method introduced by Hofmann et al. [40] is used as abaseline. Despite not showing signiﬁcant improvements over DBGD in the past [40],the comparison with PDGD is interesting because they both estimate gradients frompairwise preferences. For this baseline, η = 0 . and (cid:15) = 0 . is used; these parametersare chosen to maximize the performance at convergence [40].Runs with PDGD are performed with both a linear and neural ranking model. Forthe linear ranking model η = 0 . and zero initialization was used. The neural networkhas the same parameters as the one optimized by DBGD, except for η = 0 . . Two aspects of performance are evaluated seperately: the ﬁnal convergence and theranking quality during training.Final convergence is addressed in ofﬂine performance which is the average NDCG@10of the ranking model over the queries in the held-out test-set. The ofﬂine performanceis measured after 10,000 impressions at which point most ranking models have reachedconvergence. The user experience during optimization should be considered as well,since deterring users during training would compromise the goal of OLTR. To addressthis aspect of evaluation online performance has been introduced [39]; it is the cumula-tive discounted NDCG@10 of the rankings displayed during training. For T sequentialqueries with R t as the ranking displayed to the user at timestep t , this is: Online Performance = T (cid:88) t =1 NDCG ( R t ) · γ ( t − . (3.28)This metric models the expected reward a user receives with a γ probability that theuser stops searching after each query. We follow previous work [80, 90] by choosinga discount factor of γ = 0 . , consequently queries beyond the horizon of 10,000queries have a less than impact.Lastly, all experimental runs are repeated 125 times, spread evenly over the availabledataset folds. Results are averaged and a two-tailed Student’s t-test is used for signiﬁ-cance testing. In total, our results are based on more than 90,000,000 user impressions.49 . Differentiable Online Learning to Rank Our main results are displayed in Table 3.2 and Table 3.3, showing the ofﬂine and onlineperformance of all methods, respectively. Additionally, Figure 3.2 displays the ofﬂineperformance on the MSLR-WEB10k dataset over 30,000 impressions and Figure 3.3over 1,000,000 impressions. We use these results to answer

RQ3.1 – whether PDGDprovides signiﬁcant improvements over existing OLTR methods – and

RQ3.3 – whetherPDGD is successful at optimizing different types of ranking models.

First, we consider the ofﬂine performance after 10,000 impressions as reported in Ta-ble 3.2. We see that the DBGD and MGD baselines reach similar levels of performance,with marginal differences at low levels of noise. Our results seem to suggest that MGDprovides an efﬁcient alternative to DBGD that requires fewer user interactions and ismore robust to noise. However, MGD does not appear to have an improved point ofconvergence over DBGD, Figure 3.2 further conﬁrms this conclusion. Additionally,Table 3.2 and Figure 3.3 reveal thats DBGD is incapable of training its neural networkso that it improves over the linear model, even after 1,000,000 impressions.Alternatively, the pairwise baseline displays different behavior, providing improve-ments over DBGD and MGD on most datasets under all levels of noise. However, onthe istella dataset large decreases in performance are observed. Thus it is unclear if thismethod provides a reliable alternative to DBGD or MGD in terms of convergence. Fig-ure 3.2 also reveals that it converges within several hundred impressions, while DBGDor MGD continue to learn and considerably improve over the total 30,000 impressions.Because the pairwise baseline also converges sub-optimally under the perfect clickmodel, we do not attribute its suboptimal convergence to noise but to the method beingbiased.Conversely, Table 3.2 shows that PDGD reaches signiﬁcantly higher performancethan all the baselines within 10,000 impressions. Improvements are observed on alldatasets under all levels of noise, especially on the commercial datasets where increasesup to . NDCG are observed. Our results also show that PDGD learns faster than thebaselines; at all time-steps the ofﬂine performance of PDGD is at least as good or betterthan all other methods, across all datasets. This increased learning speed can also beobserved in Figure 3.2. Besides the faster learning it also appears as if PDGD convergesat a better optimum than DBGD or MGD. However, Figure 3.2 reveals that DBGDdoes not fully converge within 30,000 iterations. Therefore, we performed an additionalexperiment where PDGD and DBGD optimize models over 1,000,000 impressionson the MSLR-WEB10k dataset, as displayed in Figure 3.3. Clearly the performanceof DBGD plateaus at a considerably lower level than that of PDGD. Therefore, weconclude that PDGD indeed has an improved point of ﬁnal convergence compared toDBGD and MGD.Finally, Figure 3.2 and 3.3 also shows the behavior predicted by the speed-qualitytradeoff [80]: a more complex model will have a worse initial performance but a betterﬁnal convergence. Here, we see that depending on the level of interaction noise theneural model requires 3,000 to 20,000 iterations to match the performance of a linear50 .5. Results and Analysis model. However, in the long run the neural model does converge at a signiﬁcantlybetter point of convergence. Thus, we conclude that PDGD is capable of effectivelyoptimizing different kinds of models in terms of ofﬂine performance.In conclusion, our results show that PDGD learns faster than existing OLTR methodswhile also converging at signiﬁcantly better levels of performance.

Besides the ranking models learned by the OLTR methods, we also consider the userexperience during optimization. Table 3.3 shows that the online performance of DBGDand MGD are close to each other; MGD has a higher online performance due to itsfaster learning speed [90, 111]. In contrast, the pairwise baseline has a substantiallylower online performance in all cases. Because Figure 3.2 shows that the learning speedof the pairwise baseline sometimes matches that of DBGD and MGD, we attribute thisdifference to the exploration strategy it uses. Namely, the random insertion of uniformlysampled documents by this baseline appears to have a strong negative effect on the userexperience.The linear model optimized by PDGD has signiﬁcant improvements over all baselinemethods on all datasets and under all click models. This improvement indicates thatthe exploration of PDGD, which uses a distribution over documents, does not lead toa worse user experience. In conclusion, PDGD provides a considerably better userexperience than all existing methods.Finally, we also discuss the performance of the neural models optimized by PDGDand DBGD. This model has both signiﬁcant increases and decreases in online per-formance varying per dataset and amount of interaction noise. The decrease in userexperience is predicted by the speed-quality tradeoff [80], as Figure 3.2 also shows, theneural model has a slower learning speed leading to a worse initial user experience. Asolution to this tradeoff has been proposed by Oosterhuis and de Rijke [80], which opti-mizes a cascade of models. In this case, the cascade could combine the user experienceof the linear model with the ﬁnal convergence of the neural model, providing the best ofboth worlds.

After having discussed the ofﬂine and online performance of PDGD, we will nowanswer

RQ3.1 and

RQ3.3 .First, concerning

RQ3.1 (whether PDGD performs signiﬁcantly better than MGD),the results of our experiments show that models optimized with PDGD learn faster andconverge at better optima than MGD, DBGD, and the pairwise baseline, regardless ofdataset or level of interaction noise. Moreover, the level of performance reached withPDGD is signiﬁcantly higher than the ﬁnal convergence of any other method. Thus,even in the long run DBGD and MGD are incapable of reaching the ofﬂine performanceof PDGD. Additionally, the online performance of a linear model optimized with PDGDis signiﬁcantly better across all datasets and user models. Therefore, we answer

RQ3.1 positively: PDGD outperforms existing methods both in terms of model convergenceand user experience during learning. 51 . Differentiable Online Learning to Rank

Then, with regards to

RQ3.3 (whether PDGD can effectively optimize differenttypes of models), in our experiments we have successfully optimized models from twofamilies: linear models and neural networks. Both models reach a signiﬁcantly higherlevel of performance of model convergence than previous OLTR methods, across alldatasets and degrees of interaction noise. As expected, the simpler linear model has abetter initial user experience, while the more complex neural model has a better pointof convergence. In conclusion, we answer

RQ3.3 positively: PDGD is applicable todifferent ranking models and effective for both linear and non-linear models.

In this chapter, we have introduced a novel OLTR method: PDGD that estimates itsgradient using inferred pairwise document preferences. In contrast with previous OLTRapproaches PDGD does not rely on online evaluation to update its model. Insteadafter each user interaction it infers preferences between document pairs. Subsequently,it constructs a pairwise gradient that updates the ranking model according to thesepreferences.We have proven that this gradient is unbiased w.r.t. user preferences, that is, if thereis a preference between a document pair, then in expectation the gradient will update themodel to meet this preference. Furthermore, our experimental results show that PDGDlearns faster and converges at a higher performance level than existing OLTR methods.Thus, it provides better performance in the short and long term, leading to an improveduser experience during training as well. On top of that, PDGD is also applicable to anydifferentiable ranking model, in our experiments a linear and a neural network wereoptimized effectively. Both reached signiﬁcant improvements over DBGD and MGDin performance at convergence. In conclusion, the novel unbiased PDGD algorithmprovides better performance than existing methods in terms of convergence and userexperience. Unlike the previous state-of-the-art, it can be applied to any differentiableranking model.We can now answer thesis research question

RQ2 positively: OLTR is possiblewithout relying on model-sampling and online evaluation. Moreover, our results showsthat using PDGD instead leads to much higher performance, and is much more effectiveat optimizing non-linear models.Future research could consider the regret bounds of PDGD; these could give furtherinsights into why it outperforms DBGD based methods. Furthermore, while we provedthe unbiasedness of our method w.r.t. document pair preferences, the expected gradientweighs document pairs differently. Ofﬂine LTR methods like LambdaMART [13] use aweighted pairwise loss to create a listwise method that directly optimizes IR metrics.However, in the online setting there is no metric that is directly optimized. Instead,future work could see if different weighing approaches are more in line with userpreferences. Another obvious avenue for future research is to explore the effectivenessof different ranking models in the online setting. There is a large collection of researchin ranking models in ofﬂine LTR, with the introduction of PDGD such an extensiveexploration in models is now also possible in OLTR.Based on the big difference in observed performance between PDGD and DBGD,52 .6. Conclusion

Chapter 4 will further extend this comparison to more extreme experimental conditions.Furthermore, Chapter 8 will also consider the performance of PDGD and compare itwith methods inspired by counterfactual LTR. Additionally, Chapter 8 will considerapplying PDGD as a counterfactual method and without debiasing weights, and ﬁndsthat in both these scenarios this leads to biased convergence. 53 . Differentiable Online Learning to Rank . . . . N D C G perfect DBGD (linear)DBGD (neural) MGD (linear)Pairwise (linear) PDGD (linear)PDGD (neural) . . . . N D C G navigational0 5000 10000 15000 20000 25000 30000impressions0 . . . . N D C G informational Figure 3.2: Ofﬂine performance (NDCG) on the MSLR-WEB10k dataset under threedifferent click models, the shaded areas indicate the standard deviation.54 .6. Conclusion . . . . N D C G perfect DBGD (linear)DBGD (neural) PDGD (linear)PDGD (neural) Pairwise (linear) . . . . N D C G navigational0 200000 400000 600000 800000 1000000impressions0 . . . . N D C G informational Figure 3.3: Long-term ofﬂine performance (NDCG) on the MSLR-WEB10k datasetunder three click models, the shaded areas indicate the standard deviation. 55 . Differentiable Online Learning to Rank T a b l e . : O f ﬂ i n e p e rf o r m a n ce ( ND C G )f o r d i ff e r e n ti n s t a n ti a ti on s o f CC M ( T a b l e . ) . T h e s t a nd a r dd e v i a ti on i ss ho w n i nb r ac k e t s , bo l d v a l u e s i nd i ca t e t h e h i gh e s t p e rf o r m a n ce p e r d a t a s e t a nd c li c k m od e l , s i gn i ﬁ ca n ti m p r ov e m e n t s ov e r t h e D B GD , M GD a ndp a i r w i s e b a s e li n e s a r e i nd i ca t e dby (cid:77) ( p < . ) a nd (cid:78) ( p < . ) , no l o ss e s w e r e m ea s u r e d . M Q M Q M S L R - W EB k Y a h oo i s t e ll a p e r f ec t D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) na v i ga ti ona l D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) i n f o r m a ti ona l D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) .6. Conclusion T a b l e . : O n li n e p e rf o r m a n ce ( D i s c oun t e d C u m u l a ti v e ND C G , S ec ti on3 . . )f o r d i ff e r e n ti n s t a n ti a ti on s o f CC M ( T a b l e . ) . T h e s t a nd a r d d e v i a ti on i ss ho w n i nb r ac k e t s , bo l dv a l u e s i nd i ca t e t h e h i gh e s t p e rf o r m a n ce p e r d a t a s e t a nd c li c k m od e l , s i gn i ﬁ ca n ti m p r ov e m e n t s a nd l o ss e s ov e r t h e D B GD , M GD a ndp a i r w i s e b a s e li n e s a r e i nd i ca t e dby (cid:77) ( p < . ) a nd (cid:78) ( p < . ) a ndby (cid:79) a nd (cid:72) , r e s p ec ti v e l y . M Q M Q M S L R - W EB k Y a h oo i s t e ll a p e r f ec t D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) na v i ga ti ona l D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:72)(cid:79)(cid:72)(cid:78) . ( . ) (cid:72)(cid:78)(cid:72)(cid:78) . ( . ) (cid:72)(cid:72)(cid:72)(cid:78) . ( . ) (cid:78)(cid:78)(cid:79)(cid:78) . ( . ) (cid:72)(cid:72)(cid:72)(cid:78) i n f o r m a ti ona l D B GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) D B GD ( n e u r a l ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) M GD ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P a i r w i s e ( li n ea r) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) P DGD ( li n ea r) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) P DGD ( n e u r a l ) . ( . ) (cid:77)(cid:78)(cid:72)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78) . ( . ) (cid:79)(cid:78)(cid:72)(cid:78) . ( . ) (cid:78)(cid:78)(cid:78)(cid:78) . ( . ) (cid:78)(cid:78)(cid:72)(cid:78) . Differentiable Online Learning to Rank Notation Description q a user-issued query d , d k , d l document d feature representation of a query-document pair D set of documents R ranked list R ∗ the reversed pair ranking R ∗ ( d k , d l , R ) R i document placed at rank iρ preference pair weighting function θ parameters of the ranking model f θ ( · ) ranking model with parameters θf ( d k ) ranking score for a document from model click ( d ) a click on document dd k = rel d l two documents equally preferred by users d k > rel d l a user preference between two documents d k > c d l document preference inferred from clicks58 A Critical Comparison of Online

Learning to Rank Methods

Online Learning to Rank (OLTR) methods optimize ranking models by directly inter-acting with users, which allows them to be very efﬁcient and responsive. All OLTRmethods introduced during the past decade have extended on the original OLTR method:Dueling Bandit Gradient Descent (DBGD). In Chapter 3, a fundamentally differentapproach was introduced with the Pairwise Differentiable Gradient Descent (PDGD)algorithm. The empirical comparisons in Chapter 3 suggested that PDGD converges atmuch higher levels of performance and learns considerably faster than DBGD-basedmethods. In contrast, DBGD appeared unable to converge on the optimal model inscenarios with little noise or bias. Furthermore, it seemed DBGD is not effective atoptimizing non-linear models. These observations are quite surprising and prompted usto further investigate DBGD. As a result, this Chapter will address the thesis researchquestion:

RQ3

Are DBGD Learning to Rank methods reliable in terms of theoretical soundnessand empirical performance?

In this chapter, we investigate whether the previous conclusions about the PDGD andDBGD comparison generalize from ideal to worst-case circumstances. We do so intwo ways. First, we compare the theoretical properties of PDGD and DBGD, by takinga critical look at previously proven properties in the context of ranking. Second, weestimate an upper and lower bound on the performance of methods by simulatingboth ideal user behavior and extremely difﬁcult behavior, i.e., almost-random non-cascading user models. Our ﬁndings show that the theoretical bounds of DBGD do notapply to any common ranking model and, furthermore, that the performance of DBGDis substantially worse than PDGD in both ideal and worst-case circumstances. Theseresults reproduce previously published ﬁndings about the relative performance of PDGDvs. DBGD and generalize them to extremely noisy and non-cascading circumstances.Overall they show that DBGD is a very ﬂawed method for OLTR both in terms oftheoretical guarantees and performance.

This chapter was published as [84]. Appendix 4.A gives a reference for the notation used in this chapter. . A Critical Comparison of Online Learning to Rank Methods Learning to Rank (LTR) plays a vital role in information retrieval. It allows us tooptimize models that combine hundreds of signals to produce rankings, thereby makinglarge collections of documents accessible to users through effective search and recom-mendation. Traditionally, LTR has been approached as a supervised learning problem,where annotated datasets provide human judgements indicating relevance. Over theyears, many limitations of such datasets have become apparent: they are costly to pro-duce [17, 95] and actual users often disagree with the relevance annotations [104]. Asan alternative, research into LTR approaches that learn from user behavior has increased.By learning from the implicit feedback in user behavior, users’ true preferences canpotentially be learned. However, such methods must deal with the noise and biases thatare abundant in user interactions [134]. Roughly speaking, there are two approaches toLTR from user interactions: learning from historical interactions and Online Learningto Rank (OLTR). Learning from historical data allows for optimization without gath-ering new data [58], though it does require good models of the biases in logged userinteractions [20]. In contrast, OLTR methods learn by interacting with the user, thusthey gather their own learning data. As a result, these methods can adapt instantly andare potentially much more responsive than methods that use historical data.Dueling Bandit Gradient Descent (DBGD) [132] is the most prevalent OLTRmethod; it has served as the basis of the ﬁeld for the past decade. DBGD samplesvariants of its ranking model, and compares them using interleaving to ﬁnd improve-ments [41, 96]. Subsequent work in OLTR has extended on this approach [43, 111, 125].In Chapter 3, the ﬁrst alternative approach to DBGD was introduced with

PairwiseDifferentiable Gradient Descent (PDGD) [82]. PDGD estimates a pairwise gradientthat is reweighed to be unbiased w.r.t. users’ document pair preferences. Chapter 3showed considerable improvements over DBGD under simulated user behavior [84]: asubstantially higher point of performance at convergence and a much faster learningspeed. The results in Chapter 3 are based on simulations using low-noise cascadingclick models. The pairwise assumption that PDGD makes, namely, that all documentspreceding a clicked document were observed by the user, is always correct in thesecircumstances, thus potentially giving it an unfair advantage over DBGD. Furthermore,the low level of noise presents a close-to-ideal situation, and it is unclear whether theﬁndings in Chapter 3 generalize to less perfect circumstances.In this chapter, we contrast PDGD and DBGD. Prior to an experimental comparison,we determine whether there is a theoretical advantage of DBGD over PDGD andexamine the regret bounds of DBGD for ranking problems. We then investigate whetherthe beneﬁts of PDGD over DBGD reported in Chapter 3 generalize to circumstancesranging from ideal to worst-case. We simulate circumstances that are perfect for bothmethods – behavior without noise or position-bias – and circumstances that are the worstpossible scenario – almost-random, extremely-biased, non-cascading behavior. Thesesettings provide estimates of upper and lower bounds on performance, and indicatehow well previous comparisons generalize to different circumstances. Additionally, weintroduce a version of DBGD that is provided with an oracle interleaving method; itsperformance shows us the maximum performance DBGD could reach from hypotheticalextensions.60 .2. Related Work

In summary, we map thesis research question

RQ3 into the following more ﬁne-grained research questions:

RQ4.1

Do the regret bounds of DBGD provide a beneﬁt over PDGD?

RQ4.2

Do the advantages of PDGD over DBGD observed in Chapter 3 generalize toextreme levels of noise and bias?

RQ4.3

Is the performance of PDGD reproducible under non-cascading user behavior?

This section provides a brief overview of traditional LTR (Section 4.2.1), of LTR fromhistorical interactions (Section 4.2.2), and OLTR (Section 4.2.3).

Traditionally, LTR has been approached as a supervised problem; in the context ofOLTR this approach is often referred to as ofﬂine

LTR. It requires a dataset containingrelevance annotations of query-document pairs, after which a variety of methods canbe applied [75]. The limitations of ofﬂine LTR mainly come from obtaining suchannotations. The costs of gathering annotations are high as it is both time-consumingand expensive [17, 95]. Furthermore, annotators cannot judge for very speciﬁc users,i.e., gathering data for personalization problems is infeasible. Moreover, for certainapplications it would be unethical to annotate items, e.g., for search in personal emailsor documents [127]. Additionally, annotations are stationary and cannot account for(perceived) relevance changes [1, 71, 120]. Most importantly, though, annotations arenot necessarily aligned with user preferences; judges often interpret queries differentlyfrom actual users [104]. As a result, there has been a shift of interest towards LTRapproaches that do not require annotated data.

The idea of LTR from user interactions is long-established; one of the earliest examplesis the original pairwise LTR approach [54]. This approach uses historical click-throughinteractions from a search engine and considers clicks as indications of relevance.Though very inﬂuential and quite effective, this approach ignores the noise and biases inherent in user interactions. Noise, i.e., any user interaction that does not reﬂect theuser’s true preference, occurs frequently, since many clicks happen for unexpectedreasons [104]. Biases are systematic forms of noise that occur due to factors other thanrelevance. For instance, interactions will only involve displayed documents, resultingin selection bias [127]. Another important form of bias in LTR is position bias, whichoccurs because users are less likely to consider documents that are ranked lower [134].Thus, to effectively learn true preferences from user interactions, a LTR method shouldbe robust to noise and handle biases correctly.In recent years counterfactual LTR methods have been introduced that correct forsome of the bias in user interactions. Such methods use inverse propensity scoring to61 . A Critical Comparison of Online Learning to Rank Methods account for the probability that a user observed a ranking position [58]. Thus, clicks onpositions that are observed less often due to position bias will have greater weight toaccount for that difference. However, the position bias must be learned and estimatedsomewhat accurately [5]. On the other side of the spectrum are click models, whichattempt to model user behavior completely [20]. By predicting behavior accurately, theeffect of relevance on user behavior can also be estimated [11, 127].An advantage of these approaches over OLTR is that they only require historicaldata and thus no new data has to be gathered. However, unlike OLTR, they do require afairly accurate user model, and thus they cannot be applied in cold-start situations.

OLTR differs from the approaches listed above because its methods intervene in thesearch experience. They have control over what results are displayed, and can learnfrom their interactions instantly. Thus, the online approach performs LTR by interactingwith users directly [132]. Similar to LTR methods that learn from historical interactiondata, OLTR methods have the potential to learn the true user preferences. However, theyalso have to deal with the noise and biases that come with user interactions. Anotheradvantage of OLTR is that the methods are very responsive, as they can apply theirlearned behavior instantly. Conversely, this also brings a danger as an online methodthat learns incorrect preferences can also worsen the experience immediately. Thus, itis important that OLTR methods are able to learn reliably in spite of noise and biases.Thus, OLTR methods have a two-fold task: they have to simultaneously present rankingsthat provide a good user experience and learn from user interactions with the presentedrankings.The original OLTR method is Dueling Bandit Gradient Descent (DBGD); it ap-proaches optimization as a dueling bandit problem [132]. This approach requires anonline comparison method that can compare two rankers w.r.t. user preferences; tra-ditionally, DBGD methods use interleaving. Interleaving methods take the rankingsproduced by two rankers and combine them in a single result list, which is then dis-played to users. From a large number of clicks on the presented list the interleavingmethods can reliably infer a preference between the two rankers [41, 96]. At eachtimestep, DBGD samples a candidate model, i.e., a slight variation of its current model,and compares the current and candidate models using interleaving. If a preference forthe candidate is inferred, the current model is updated towards the candidate slightly.By doing so, DBGD will update its model continuously and should oscillate towards aninferred optimum. Section 4.3 provides a complete description of the DBGD algorithm.Virtually all work in OLTR in the decade since the introduction of DBGD has usedDBGD as a basis. A straightforward extension comes in the form of Multileave GradientDescent [111] which compares a large number of candidates per interaction [81, 108,109]. This leads to a much faster learning process, though in the long run this methoddoes not seem to improve the point of convergence.One of the earliest extensions of DBGD proposed a method for reusing historicalinteractions to guide exploration for faster learning [43]. While the initial results showedgreat improvements [43], later work showed performance drastically decreasing in thelong term due to bias introduced by the historical data [90]. Unfortunately, OLTR work62 .3. Dueling Bandit Gradient Descent

Algorithm 4.1

Dueling Bandit Gradient Descent (DBGD). Input : initial weights: θ ; unit: u ; learning rate η . for t ← . . . ∞ do q t ← receive query ( t ) // obtain a query from a user θ ct ← θ t + sample from unit sphere ( u ) // create candidate ranker R t ← get ranking ( θ t , D q t ) // get current ranker ranking R ct ← get ranking ( θ ct , D q t ) // get candidate ranker ranking I t ← interleave ( R t , R ct ) // interleave both rankings c t ← display to user ( I t ) // displayed interleaved list, record clicks if preference for candidate ( I t , c t , R t , R ct ) then θ t +1 ← θ t + η ( θ ct − θ t ) // update model towards candidate else θ t +1 ← θ t // no update that continued this historical approach [125] also only considered short term results;moreover, the results of some work [135] are not based on held-out data. As a result,we do not know whether these extensions provide decent long-term performance and itis unclear whether the ﬁndings of these studies generalize to more realistic settings.In Chapter 3, an inherently different approach to OLTR was introduced withPDGD [82]. PDGD interprets its ranking model as a distribution over documents;it estimates a pairwise gradient from user interactions with sampled rankings. Thisgradient is differentiable, allowing for non-linear models like neural networks to beoptimized, something DBGD is ineffective at [80, 82]. Section 4.4 provides a detaileddescription of PDGD. In the chapter in which we introduced PDGD (Chapter 3), weclaim that it provides substantial improvements over DBGD. However, those claims arebased on cascading click models with low levels of noise. This is problematic becausePDGD assumes a cascading user, and could thus have an unfair advantage in this setting.Furthermore, it is unclear whether DBGD with a perfect interleaving method could stillimprove over PDGD. Lastly, DBGD has proven regret bounds while PDGD has nosuch guarantees.In this chapter, we clear up these questions about the relative strengths of DBGD andPDGD by comparing the two methods under non-cascading, high-noise click models.Additionally, by providing DBGD with an oracle comparison method, its hypotheticalmaximum performance can be measured; thus, we can study whether an improvementover PDGD is hypothetically possible. Finally, a brief analysis of the theoretical regretbounds of DBGD shows that they do not apply to any common ranking model, thereforehardly providing a guaranteed advantage over PDGD. This section describes the DBGD algorithm in detail, before discussing the regretbounds of the algorithm. 63 . A Critical Comparison of Online Learning to Rank Methods

The DBGD algorithm [132] describes an indeﬁnite loop that aims to improve a rankingmodel at each step; Algorithm 4.1 provides a formal description. The algorithm starts agiven model with weights θ (Line 1); then it waits for a user-submitted query (Line 3).At this point a candidate ranker is sampled from the unit sphere around the currentmodel (Line 4), and the current and candidate model both produce a ranking for thecurrent query (Line 5 and 6). These rankings are interleaved (Line 7) and displayed tothe user (Line 8). If the interleaving method infers a preference for the candidate rankerfrom subsequent user interactions the current model is updated towards the candidate(Line 10), otherwise no update is performed (Line 12). Thus, the model optimized byDBGD should converge and oscillate towards an optimum. Unlike PDGD, DBGD has proven regret bounds [132], potentially providing an advan-tage in the form of theoretical guarantees. In this section we answer

RQ4.1 by criticallylooking at the assumptions which form the basis of DBGD’s proven regret bounds.The original DBGD paper [132] proved a sublinear regret under several assumptions.DBGD works with the parameterized space of ranking functions W , that is, every θ ∈ W is a different set of parameters for a ranking function. For this chapter we willonly consider deterministic linear models because all existing OLTR work has dealtwith them [40, 43, 82, 90, 111, 125, 132, 135]. But we note that the proof is easilyextendable to neural networks where the output is a monotonic function applied toa linear combination of the last layer. Then there is assumed to be a concave utilityfunction u : W → R ; since this function is concave, there should only be a singleinstance of weights that are optimal θ ∗ . Furthermore, this utility function is assumed tobe L-Lipschitz smooth: ∃ L ∈ R , ∀ ( θ a , θ b ) ∈ W , | u ( θ a ) − u ( θ b ) | < L (cid:107) θ a − θ b (cid:107) . (4.1)We will show that these assumptions are incorrect : there is an inﬁnite number of optimalweights, and the utility function u cannot be L-Lipschitz smooth. Our proof relies ontwo assumptions that avoid cases where the ranking problem is trivial. First, the zeroranker is not the optimal model: θ ∗ (cid:54) = . (4.2)Second, there should be at least two models with different utility values: ∃ ( θ, θ (cid:48) ) ∈ W , u ( θ ) (cid:54) = u ( θ (cid:48) ) . (4.3)We will start by deﬁning the set of rankings a model f ( · , θ ) will produce as: R D ( f ( · , θ )) = { R | ∀ ( d, d (cid:48) ) ∈ D, [ f ( d, θ ) > f ( d (cid:48) , θ ) → d (cid:31) R d (cid:48) ] } . (4.4)It is easy to see that multiplying a model with a positive scalar α > will not affect thisset: ∀ α ∈ R > , R D ( f ( · , θ )) = R D ( αf ( · , θ )) . (4.5)64 .4. Pairwise Differentiable Gradient Descent Consequently, the utility of both functions will be equal: ∀ α ∈ R > , u ( f ( · , θ )) = u ( αf ( · , θ )) . (4.6)For linear models scaling weights has the same effect: αf ( · , θ ) = f ( · , αθ ) . Thus, theﬁrst assumption cannot be true since for any optimal model f ( · , θ ∗ ) there is an inﬁniteset of equally optimal models: { f ( · , αθ ∗ ) | α ∈ R > } .Then, regarding L-Lipschitz smoothness, using any positive scaling factor: ∀ α ∈ R > , | u ( θ a ) − u ( θ b ) | = | u ( αθ a ) − u ( αθ b ) | , (4.7) ∀ α ∈ R > , (cid:107) αθ a − αθ b (cid:107) = α (cid:107) θ a − θ b (cid:107) . (4.8)Thus the smoothness assumption can be rewritten as: ∃ L ∈ R , ∀ α ∈ R > , ∀ ( θ a , θ b ) ∈ W , | u ( θ a ) − u ( θ b ) | < αL (cid:107) θ a − θ b (cid:107) . (4.9)However, there is always an inﬁnite number of values for α small enough to break theassumption. Therefore, we conclude that a concave L-Lipschitz smooth utility functioncan never exist for a deterministic linear ranking model, thus the proof for the regretbounds is not applicable when using deterministic linear models.Consequently, the regret bounds of DBGD do not apply to the ranking problems inprevious work. One may consider other models (e.g., spherical coordinate based modelsor stochastic ranking models), however this still means that for the simplest and mostcommon ranking problems there are no proven regret bounds. As a result, we answer RQ4.1 negatively, the regret bounds of DBGD do not provide a beneﬁt over PDGD forthe ranking problems in LTR.

The Pairwise Differentiable Gradient Descent (PDGD) [82] algorithm is formally de-scribed in Algorithm 4.2. PDGD interprets a ranking function f ( · , θ ) as a probabilitydistribution over documents by applying a Plackett-Luce model: P ( d | D, θ ) = e f ( d,θ ) (cid:80) d (cid:48) ∈ D e f ( d (cid:48) ,θ ) . (4.10)First, the algorithm waits for a user query (Line 3), then a ranking R is created bysampling documents without replacement (Line 4). Then PDGD observes clicks fromthe user and infers pairwise document preferences from them. All documents precedinga clicked document and the ﬁrst succeeding one are assumed to be observed by theuser. Preferences between clicked and unclicked observed documents are inferred byPDGD; this is a long-standing assumption in pairwise LTR [54]. We denote an inferred preference between documents as d i (cid:31) c d j , and the probability of the model placing d i earlier than d j is denoted and calculated by: P ( d i (cid:31) d j | θ ) = e f ( d i ,θ ) e f ( d i ,θ ) + e f ( d j ,θ ) . (4.11)65 . A Critical Comparison of Online Learning to Rank Methods Algorithm 4.2

Pairwise Differentiable Gradient Descent (PDGD). Input : initial weights: θ ; scoring function: f ; learning rate η . for t ← . . . ∞ do q t ← receive query ( t ) // obtain a query from a user R t ← sample list ( f θ t , D q t ) // sample list according to Eq. 4.10 c t ← receive clicks ( R t ) // show result list to the user ∇ f ( · , θ t ) ← // initialize gradient for d i (cid:31) c d j ∈ c t do w ← ρ ( d i , d j , R, D ) // initialize pair weight (Eq. 4.13) w ← w × P ( d i (cid:31) d j | θ t ) P ( d j (cid:31) d i | θ t ) // pair gradient (Eq. 3.4) ∇ f ( · , θ t ) ← ∇ f θ t + w × ( f (cid:48) ( d i , θ t ) − f (cid:48) ( d j , θ t )) // model gradient (Eq. 3.4) θ t +1 ← θ t + η ∇ f ( · , θ t ) // update the ranking model The gradient is estimated as a sum over inferred preferences with a weight ρ per pair: ∇ f ( · , θ ) ≈ (cid:88) d i (cid:31) c d j ρ ( d i , d j , R, D )[∆ P ( d i (cid:31) d j | θ )] (4.12) = (cid:88) d i (cid:31) c d j ρ ( d i , d j , R, D ) P ( d i (cid:31) d j | θ ) P ( d j (cid:31) d i | θ )( f (cid:48) ( d i , θ ) − f (cid:48) ( d j , θ )) . After computing the gradient (Line 10), the model is updated accordingly (Line 11).This will change the distribution (Equation 4.10) towards the inferred preferences. Thisdistribution models the conﬁdence over which documents should be placed ﬁrst; theexploration of PDGD is naturally guided by this conﬁdence and can vary per query.The weighting function ρ is used to make the gradient of PDGD unbiased w.r.t.document pair preferences. It uses the reverse pair ranking: R ∗ ( d i , d j , R ) , which is thesame ranking as R but with the document positions of d i and d j swapped. Then ρ is theratio between the probability of R and R ∗ : ρ ( d i , d j , R, D ) = P ( R ∗ ( d i , d j , R ) | D ) P ( R | D ) + P ( R ∗ ( d i , d j , R ) | D ) . (4.13)In Chapter 3, the weighted gradient is proven to be unbiased w.r.t. document pairpreferences under certain assumptions about the user. Here, this unbiasedness is deﬁnedby being able to rewrite the gradient as: E [∆ f ( · , θ )] = (cid:88) ( d i ,d j ) ∈ D α ij ( f (cid:48) ( d i , θ ) − f (cid:48) ( d j , θ )) , (4.14)and the sign of α ij agreeing with the preference of the user: sign ( α ij ) = sign ( relevance ( d i ) − relevance ( d j )) . (4.15)The proof in Chapter 3 only relies on the difference in the probabilities of inferring apreference: d i (cid:31) c d j in R and the opposite preference d j (cid:31) c d i in R ∗ ( d i , d j , R ) . The66 .5. Experiments Table 4.1: Click probabilities for simulated perfect or almost random behavior. P ( click ( d ) | relevance ( d ) , observed ( d )) relevance ( d ) 0 1 2 3 4 perfect almost random sign ( P ( d i (cid:31) c d j | R ) − P ( d j (cid:31) c d i | R ∗ ))= sign ( relevance ( d i ) − relevance ( d j )) . (4.16)As long as Equation 4.16 is true, Equation 4.14 and 4.15 hold as well. Interestingly, thismeans that other assumptions about the user can be made than in Chapter 3, and othervariations of PDGD are possible, e.g., the algorithm could assume that all documentsare observed and the proof still holds.Chapter 3 reports large improvements over DBGD, however these improvementswere observed under simulated cascading user models. This means that the assumptionthat PDGD makes about which documents are observed are always true. As a result, itis currently unclear whether the method is really better in cases where the assumptiondoes not hold. In this section we detail the experiments that were performed to answer the researchquestions in Section 4.1. Our experiments are performed over three large labelled datasets from commercialsearch engines, the largest publicly available LTR datasets. These datasets are the

MLSR-WEB10K [95],

Yahoo! Webscope [17], and

Istella [27] datasets. Each contains aset of queries with corresponding preselected document sets. Query-document pairs arerepresented by feature vectors and ﬁve-grade relevance annotations ranging from notrelevant (0) to perfectly relevant (4). Together, the datasets contain over 29,900 queriesand between 136 and 700 features per representation.

In order to simulate user behavior we partly follow the standard setup for OLTR [38,40, 90, 111, 137]. At each step a user issued query is simulated by uniformly samplingfrom the datasets. The algorithm then decides what result list to display to the user, the The resources for reproducing the experiments in this chapter are available at https://github.com/HarrieO/OnlineLearningToRank . A Critical Comparison of Online Learning to Rank Methods result list is limited to k = 10 documents. Then user interactions are simulated usingclick models [20]. Past OLTR work has only considered cascading click models [36];in contrast, we also use non-cascading click models . The probability of a click isconditioned on relevance and observance: P ( click ( d ) | relevance ( d ) , observed ( d )) . (4.17)We use two levels of noise to simulate perfect user behavior and almost random be-havior [39], Table 4.1 lists the probabilities of both. The perfect user observes alldocuments, never clicks on anything non-relevant, and always clicks on the most rele-vant documents. Two variants of almost random behavior are used. The ﬁrst is based oncascading behavior, here the user ﬁrst observes the top document, then decides to clickaccording to Table 4.1. If a click occurs, then, with probability P ( stop | click ) = 0 . the user stops looking at more documents, otherwise the process continues on the nextdocument. The second almost random behavior is simulated in a non-cascading way;here we follow [58] and model the observing probabilities as: P ( observed ( d ) | rank ( d )) = 1 rank ( d ) . (4.18)The important distinction is that it is safe to assume that the cascading user has observedall documents ranked before a click, while this is not necessarily true for the non-cascading user. Since PDGD makes this assumption, testing under both models canshow us how much of its performance relies on this assumption. Furthermore, the almost random model has an extreme level of noise and position bias compared to theclick models used in previous OLTR work [40, 90, 111], and we argue it simulates an(almost) worst-case scenario. In our experiments we simulate runs consisting of 1,000,000 impressions; each runwas repeated 125 times under each of the three click models. PDGD was run with η = 0 . and zero initialization, DBGD was run using Probabilistic Interleaving [90]with zero initialization, η = 0 . , and the unit sphere with δ = 1 . Other variants likeMultileave Gradient Descent [111] are not included; previous work has shown that theirperformance matches that of regular DBGD after around 30,000 impressions [82, 90,111]. The initial boost in performance comes at a large computational cost, though,as the fastest approaches keep track of at least 50 ranking models [90], which makesrunning long experiments extremely impractical. Instead, we introduce a novel oracleversion of DBGD, where, instead of interleaving, the NDCG values on the current queryare calculated and the highest scoring model is selected. This simulates a hypotheticalperfect interleaving method, and we argue that the performance of this oracle runindicates what the upper bound on DBGD performance is.Performance is measured by NDCG@10 on a held-out test set, a two-sided t-testis performed for signiﬁcance testing. We do not consider the user experience duringtraining, because Chapter 3 has already investigated this aspect thoroughly.68 .6. Experimental Results and Analysis Recall that in Section 4.3.2 we have already provided a negative answer to

RQ4.1 : theregret bounds of DBGD do not provide a beneﬁt over PDGD for the common rankingproblem in LTR. In this section we present our experimental results and answer

RQ4.2 (whether the advantages of PDGD over DBGD of previous work generalize to extremelevels of noise and bias) and

RQ4.3 (whether the performance of PDGD is reproducibleunder non-cascading user behavior).Our main results are presented in Table 4.2. Additionally, Figure 4.1 displays theaverage performance over 1,000,000 impressions. First, we consider the performanceof DBGD; there is a substantial difference between its performance under the perfect and almost random user models on all datasets. Thus, it seems that DBGD is stronglyaffected by noise and bias in interactions; interestingly, there is little difference betweenperformance under the cascading and non-cascading behavior. On all datasets the oracle version of DBGD performs signiﬁcantly better than DBGD under perfect user behavior.This means there is still room for improvement and hypothetical improvements in, e.g.,interleaving could lead to signiﬁcant increases in long-term DBGD performance.Next, we look at the performance of PDGD; here, there is also a signiﬁcant differ-ence between performance under the perfect and almost random user models on alldatasets. However, the effect of noise and bias is very limited compared to DBGD, andthis difference at 1,000,000 impressions is always less than . NDCG on any dataset.To answer

RQ4.2 , we compare the performance of DBGD and PDGD. Across alldatasets, when comparing DBGD and PDGD under the same levels of interaction noiseand bias, the performance of PDGD is signiﬁcantly better in every case. Furthermore,PDGD under the perfect user model signiﬁcantly outperforms the oracle run of DBGD,despite the latter being able to directly observe the NDCG of rankers on the currentquery. Moreover, when comparing PDGD’s performance under the almost random user model with DBGD under the perfect user model, we see the differences arelimited and in both directions. Thus, even under ideal circumstances DBGD does notconsistently outperform PDGD under extremely difﬁcult circumstances. As a result,we answer

RQ4.2 positively: our results strongly indicate that the performance ofPDGD is considerably better than DBGD and that these ﬁndings generalize from idealcircumstances to settings with extreme levels of noise and bias.Finally, to answer

RQ4.3 , we look at the performance under the two almost random user models. Surprisingly, there is no clear difference between the performance ofPDGD under cascading and non-cascading user behavior. The differences are small andper dataset it differs which circumstances are slightly preferred. Therefore, we answer

RQ4.3 positively: the performance of PDGD is reproducible under non-cascading userbehavior.

In this chapter, we have reproduced and generalized ﬁndings about the relative per-formance of Dueling Bandit Gradient Descent (DBGD) and Pairwise DifferentiableGradient Descent (PDGD). Our results show that the performance of PDGD is repro-69 . A Critical Comparison of Online Learning to Rank Methods

Table 4.2: Performance (NDCG@10) after 1,000,000 impressions for DBGD andPDGD under a perfect click model and two almost-random click models: cascading and non-cascading , and DBGD with an oracle comparator. Signiﬁcant improvements andlosses (p < (cid:78) , (cid:72) , and ◦ (no signiﬁcantdifference). Indications are in order of: oracle , perfect , cascading , and non-cascading . Yahoo MSLR Istella

Dueling Bandit Gradient Descentoracle (0.001) (cid:72) (cid:78) (cid:78) (0.004) (cid:72) (cid:78) (cid:78) (0.001) (cid:72) (cid:78) (cid:78) perfect (0.002) (cid:72) ◦ ◦ (0.004) (cid:72) (cid:78) (cid:78) (0.002) (cid:72) (cid:72) (cid:72) cascading (0.008) (cid:72) (cid:72) (cid:72) (0.006) (cid:72) (cid:72) (cid:72) (0.014) (cid:72) (cid:72) (cid:72) non-cascading (0.010) (cid:72) (cid:72) (cid:72) (0.014) (cid:72) (cid:72) (cid:72) (0.014) (cid:72) (cid:72) (cid:72)

Pairwise Differentiable Gradient Descentperfect (0.001) (cid:78) (cid:78) (cid:78) (cid:78) (0.003) (cid:78) (cid:78) (cid:78) (cid:78) (0.000) (cid:78) (cid:78) (cid:78) (cid:78) cascading (0.003) (cid:72) ◦ (cid:78) (cid:78) (0.007) (cid:72) (cid:72) (cid:78) (cid:78) (0.003) (cid:72) (cid:78) (cid:78) (cid:78) non-cascading (0.003) (cid:72) ◦ (cid:78) (cid:78) (0.005) (cid:72) (cid:72) (cid:78) (cid:78) (0.003) (cid:72) (cid:78) (cid:78) (cid:78) ducible under non-cascading user behavior. Furthermore, PDGD outperforms DBGDin both ideal and extremely difﬁcult circumstances with high levels of noise and bias.Moreover, the performance of PDGD in extremely difﬁcult circumstances is compa-rable to that of DBGD in ideal circumstances. Additionally, we have shown that theregret bounds of DBGD are not applicable to the common ranking problem in LTR.In summary, our results strongly conﬁrm the previous ﬁnding that PDGD consistentlyoutperforms DBGD, and generalizes this conclusion to circumstances with extremelevels of noise and bias.With these ﬁndings we can answer RQ3 mostly negatively: the theory behindDBGD is not sound for the common deterministic ranking problem, moreover, DBGDhas extremely poor performance when compared to the PDGD method under varyingconditions. Consequently, there appears to be no advantage to using DBGD overPDGD in either theoretical or empirical terms. In addition, a decade of OLTR workhas attempted to extend DBGD in numerous ways without leading to any measurablelong-term improvements. Together, this suggests that the general approach of DBGDbased methods, i.e., sampling models and comparing with online evaluation, is not aneffective way of optimizing ranking models. Although the PDGD method considerablyoutperforms the DBGD approach, we currently do not have a theoretical explanationfor this difference. Thus it seems plausible that a more effective OLTR method couldbe derived, if the theory behind the effectiveness of OLTR methods is better understood.Due to this potential and the current lack of regret bounds applicable to OLTR, we arguethat a theoretical analysis of OLTR would make a very valuable future contribution tothe ﬁeld.Finally, we consider the limitations of the comparison in this chapter. As is standardin OLTR our results are based on simulated user behavior. These simulations providevaluable insights: they enable direct control over biases and noise, and evaluation can70 .7. Conclusion be performed at each time step. In this chapter, the generalizability of this setup waspushed the furthest by varying the conditions to the extremely difﬁcult. It appearsunlikely that more reliable conclusions can be reached from simulated behavior. Thuswe argue that the most valuable future comparisons would be in experimental settingswith real users. Furthermore, with the performance improvements of PDGD the timeseems right for evaluating the effectiveness of OLTR in real-world applications.The limited theoretical guarantees regarding OLTR methods, prompted the secondpart of this thesis where we consider counterfactual LTR. In contrast with OLTR,counterfactual LTR methods are founded on assumed models of user behavior andare proven to unbiasedly optimize ranking metrics if the assumed models are correct.Despite these theoretical strengths, empirical comparisons in previous work show thatPDGD is more robust than existing counterfactual LTR methods. In Chapter 8 weintroduce a counterfactual LTR method that can reach the same levels of performanceas PDGD when applied online. 71 . A Critical Comparison of Online Learning to Rank Methods

Figure 4.1: Performance (NDCG@10) on held-out data from Yahoo (top), MSLR(center), Istella (bottom) datasets, under the perfect , and almost random user models:cascading (casc.) and non-cascading (non-casc.). The shaded areas display the standarddeviation. · · · · . . . . N D C G · · · · . . . . . N D C G · · · · iterations0 . . . . N D C G PDGD (non-casc.)PDGD (perfect) PDGD (casc.)DBGD (perfect) DBGD (casc.)DBGD (non-casc.) DBGD (oracle) .A. Notation Reference for Chapter 4 Notation Description t a timestep q a user-issued query d , d k , d l document d feature representation of a query-document pair D set of documents R ranked list I t an interleaved result list R ∗ the reversed pair ranking R ∗ ( d k , d l , R ) ρ preference pair weighting function θ parameters of the ranking model f θ ( · ) ranking model with parameters θf ( d k ) ranking score for a document from model c t a binary vector representing the clicks at timestep t art II A Single Framework for Onlineand Counterfactual Learning toRank Policy-Aware Counterfactual Learning toRank for Top- k Rankings

Counterfactual Learning to Rank (LTR) methods optimize ranking systems using loggeduser interactions that contain interaction biases. Existing methods are only unbiased ifusers are presented with all relevant items in every ranking. However, in prevalent top- k ranking settings not all items can be displayed at once. Therefore, there is currently noexisting counterfactual unbiased LTR method for top- k rankings. In this chapter weaddress this limitation by asking the thesis research question: RQ4

Can counterfactual LTR be extended to top- k ranking settings?We introduce a novel policy-aware counterfactual estimator for LTR metrics that canaccount for the effect of a stochastic logging policy. We prove that the policy-awareestimator is unbiased if every relevant item has a non-zero probability to appear in thetop- k ranking. Our experimental results show that the performance of our estimatoris not affected by the size of k : for any k , the policy-aware estimator reaches thesame retrieval performance while learning from top- k feedback as when learning fromfeedback on the full ranking.While the policy-aware estimator allows us to learn from top- k feedback, there isno theoretically-grounded way to optimize for top- k ranking metrics. Furthermore,existing counterfactual LTR work has mostly used novel loss functions for optimization,which are quite different from those used in supervised LTR. This lead us to ask thefollowing thesis research question: RQ5

Is it possible to apply state-of-the-art supervised LTR to the counterfactual LTRproblem?In this chapter, we also introduce novel extensions of supervised LTR methods toperform counterfactual LTR and to optimize top- k metrics. Together, our contributionsintroduce the ﬁrst policy-aware unbiased LTR approach that learns from top- k feedbackand optimizes top- k metrics. As a result, counterfactual LTR is now applicable to thevery prevalent top- k ranking setting in search and recommendation. This chapter was published as [86]. Appendix 5.A gives a reference for the notation used in this chapter. . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings

LTR optimizes ranking systems to provide high quality rankings. Interest in LTR fromuser interactions has greatly increased in recent years with the introduction of unbiasedLTR methods [58, 127]. The potential for learning from logged user interactions isgreat: user interactions provide valuable implicit feedback while also being cheap andrelatively easy to acquire at scale [57]. However, interaction logs also contain largeamounts of bias, which is the result of both user behavior and the ranker used duringlogging. For instance, users are more likely to examine items at the top of rankings,consequently the display position of an item heavily affects the number of interactions itreceives [128]. This effect is called position bias and it is very dominant when learningfrom interactions with rankings. Naively ignoring it during learning can be detrimentalto ranking performance, as the learning process is strongly impacted by what rankingswere displayed during logging instead of true user preferences. The goal of unbiasedLTR methods is to optimize a ranker w.r.t. the true user preferences, consequently, theyhave to account and correct for such forms of bias.Previous work on unbiased LTR has mainly focussed on accounting for positionbias through counterfactual learning [5, 58, 127]. The prevalent approach models theprobability of a user examining an item in a displayed ranking. This probability canbe inferred from user interactions [4, 5, 58, 127, 128] and corrected for using inversepropensity scoring . As a result, these methods optimize a loss that in expectation isunaffected by the examination probabilities during logging, hence it is unbiased w.r.t.position bias.This approach has been applied effectively in various ranking settings, includingsearch for scientiﬁc articles [58], email [127] or other personal documents [128]. How-ever, a limitation of existing approaches is that in every logged ranking they requireevery relevant item to have a non-zero chance of being examined [16, 58]. In this chap-ter, we focus on top- k rankings where the number of displayed items is systematicallylimited. These rankings can display at most k items, making it practically unavoidablethat relevant items are missing. Consequently, existing counterfactual LTR methodsare not unbiased in these settings. We recognize this problem as item-selection bias introduced by the selection of (only) k items to display. This is especially concerningsince top- k rankings are quite prevalent, e.g., in recommendation [26, 48], mobilesearch [9, 124], query autocompletion [14, 127, 128], and digital assistants [112].Our main contribution is a novel policy-aware estimator for counterfactual LTR thataccounts for both a stochastic logging policy and the users’ examination behavior. Ourpolicy-aware approach can be viewed as a generalization of the existing counterfactualLTR framework [2, 58]. We prove that our policy-aware approach performs unbiasedLTR and evaluation while learning from top- k feedback. Our experimental results showthat while our policy-aware estimator is unaffected by the choice of k , the existingpolicy-oblivious approach is strongly affected even under large values of k . For instance,optimization with the policy-aware estimator on top-5 feedback reaches the sameperformance as when receiving feedback on all results. Furthermore, because top- k metrics are the only relevant metrics in top- k rankings, we also propose extensions totraditional LTR approaches that are proven to optimize top- k metrics unbiasedly andintroduce a pragmatic way to choose optimally between available loss functions.78 .2. Background This chapter is based around two main contributions:1. A novel estimator for unbiased LTR from top- k feedback.2. Unbiased losses that optimize bounds on top- k LTR metrics.To the best of our knowledge, our policy-aware estimator is the ﬁrst estimator that isunbiased in top- k ranking settings. In this section we discuss supervised LTR and counterfactual LTR [58].

The goal of LTR is to optimize ranking systems w.r.t. speciﬁc ranking metrics. Rankingmetrics generally involve items d , their relevance r w.r.t. a query q , and their position inthe ranking R produced by the system. We will optimize the Empirical Risk [121] overthe set of queries Q , with a loss ∆( R i | q i , r ) for a single query q i : L = 1 | Q | (cid:88) q i ∈ Q ∆( R i | q i , r ) . (5.1)For simplicity we assume that relevance is binary: r ( q, d ) ∈ { , } ; for brevity wewrite: r ( q, d ) = r ( d ) . Then, ranking metrics commonly take the form of a sum overitems: ∆( R | q, r ) = (cid:88) d ∈ R λ ( d | R ) · r ( d ) , (5.2)where λ can be chosen for a speciﬁc metric, e.g., for Average Relevance Position (ARP)or Discounted Cumulative Gain (DCG): λ ARP ( d | R ) = rank ( d | R ) , (5.3) λ DCG ( d | R ) = − log (cid:0) rank ( d | R ) (cid:1) − . (5.4)In a so-called full-information setting, where the relevance values r are known, opti-mization can be done through traditional LTR methods [13, 54, 75, 129]. Optimizing a ranking loss from the implicit feedback in interaction logs requires adifferent approach from supervised LTR. We will assume that clicks are gathered usinga logging policy π with the probability of displaying ranking ¯ R for query q denoted as π ( ¯ R | q ) . Let o i ( d ) ∈ { , } indicate whether d was examined by a user at interaction i and o i ( d ) ∼ P ( o ( d ) | q i , r, ¯ R i ) . Furthermore, we assume that users click on all relevantitems they observe and nothing else: c i ( d ) = [ r ( d ) ∧ o i ( d )] . Our goal is to ﬁnd an79 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings estimator ˆ∆ that provides an unbiased estimate of the actual loss; for N interactionsthis estimate is: ˆ L = 1 N N (cid:88) i =1 ˆ∆( R i | q i , ¯ R i , π, c i ) . (5.5)We write R i for the ranking produced by the system for which the loss is being computed,while ¯ R i is the ranking that was displayed when logging interaction i . For brevity wewill drop i from our notation when only a single interaction is involved. A naiveestimator could simply consider every click to indicate relevance: ˆ∆ naive ( R | q, c ) = (cid:88) d : c ( d )=1 λ ( d | R ) . (5.6)Taking the expectation over the displayed ranking and observance variables results inthe following expected loss: E o, ¯ R (cid:104) ˆ∆ naive ( R | q, c ) (cid:105) = E o, ¯ R  (cid:88) d : c ( d )=1 λ ( d | R )  = E o, ¯ R (cid:34)(cid:88) d ∈ R λ ( d | R ) · c ( d ) (cid:35) = E o, ¯ R (cid:34)(cid:88) d ∈ R o ( d ) · λ ( d | R ) · r ( d ) (cid:35) (5.7) = E ¯ R (cid:34)(cid:88) d ∈ R P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ ( d | R ) · r ( d ) (cid:35) = (cid:88) ¯ R ∈ π ( ·| q ) π ( ¯ R | q ) · (cid:88) d ∈ R P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ ( d | R ) · r ( d ) . Here, the effect of position bias is very clear; in expectation, items are weightedaccording to their probability of being examined. Furthermore, it shows that examinationprobabilities are determined by both the logging policy π and user behavior P ( o ( d ) | q, r, ¯ R ) .In order to avoid the effect of position bias, Joachims et al. [58] introduced an inverse-propensity-scoring estimator in the same vain as previous work by Wang et al. [127].The main idea behind this estimator is that if the examination probabilities are known,then they can be corrected for per click: ˆ∆ oblivious (cid:0) R | q, c, ¯ R (cid:1) = (cid:88) d : c ( d )=1 λ ( d | R ) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) . (5.8)In contrast to the naive estimator (Eq. 5.6), this policy-oblivious estimator (Eq. 5.8) can80 .3. Learning from Top- k Feedback provide an unbiased estimate of the loss: E o, ¯ R (cid:104) ˆ∆ oblivious (cid:0) R | q, c, ¯ R (cid:1) (cid:105) = E o, ¯ R (cid:34) (cid:88) d : c ( d )=1 λ ( d | R ) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) (cid:35) = (cid:88) d ∈ R E o, ¯ R (cid:34) o ( d ) · λ ( d | R ) · r ( d ) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) (cid:35) = (cid:88) d ∈ R E ¯ R (cid:34) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ ( d | R ) · r ( d ) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) (cid:35) = (cid:88) d ∈ R λ ( d | R ) · r ( d ) = ∆( R | q, r ) . (5.9)We note that the last step assumes P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) > , and that only relevantitems r ( d ) = 1 contribute to the estimate [58]. Therefore, this estimator is unbiased aslong as the examination probabilities are positive for every relevant item: ∀ d, ∀ ¯ R ∈ π ( · | q ) (cid:2) r ( d ) = 1 → P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) > (cid:3) . (5.10)Intuitively, this condition exists because propensity weighting is applied to items clickedin the displayed ranking and items that cannot be observed can never receive clicks.Thus, there are no clicks that can be weighted more heavily to adjust for the zeroobservance probability of an item.An advantageous property of the policy-oblivious estimator ˆ∆ oblivious is that thelogging policy π does not have to be known. That is, as long as Condition 5.10 is met,it works regardless of how interactions were logged. Additionally, Joachims et al. [58]proved that it is still unbiased under click noise. Virtually all recent counterfactual LTRmethods use the policy-oblivious estimator for LTR optimization [3–5, 58, 127, 128]. k Feedback

In this section we explain why the existing policy-oblivious counterfactual LTR frame-work is not applicable to top- k rankings. Subsequently, we propose a novel solutionthrough policy-aware propensity scoring that takes the logging policy into account. k feedback An advantage of the existing policy-oblivious estimator for counterfactual LTR de-scribed in Section 5.2.2 is that the logging policy does not need to be known, makingits application easier. However, the policy-oblivious estimator is only unbiased whenCondition 5.10 is met: every relevant item has a non-zero probability of being observedin every ranking displayed during logging. 81 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings

We recognize that in top- k rankings, where only k items can be displayed, relevantitems may systematically lack non-zero examination probabilities. This happens becauseitems outside the top- k cannot be examined by the user: ∀ d, ∀ ¯ R (cid:2) rank (cid:0) d | ¯ R (cid:1) > k → P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) = 0 (cid:3) . (5.11)In most top- k ranking settings it is very unlikely that Condition 5.10 is satisﬁed; If k isvery small, the number of relevant items is large, or if the logging policy π is ineffectiveat retrieving relevant items, it is unlikely that all relevant items will be displayed inthe top- k positions. Moreover, for a small value of k the performance of the loggingpolicy π has to be near ideal for all relevant items to be displayed. We call this effect item-selection bias , because in this setting the logging ranker makes a selection ofwhich k items to display, in addition to the order in which to display them (positionbias). The existing policy-oblivious estimator for counterfactual LTR (as described inSection 5.2.2) cannot correct for item-selection bias when it occurs, and can thus beaffected by this bias when applied to top- k rankings. Item-selection bias is inevitable in a single top- k ranking, due to the limited numberof items that can be displayed. However, across multiple top- k rankings more than k items could be displayed if the displayed rankings differ enough. Thus, a stochasticlogging-policy could provide every item with a non-zero probability to appear in thetop- k ranking. Then, the probability of examination can be calculated as an expectationover the displayed ranking: P ( o ( d ) = 1 | q, r, π ) = E ¯ R (cid:2) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1)(cid:3) (5.12) = (cid:88) ¯ R ∈ π ( ·| q ) π (cid:0) ¯ R | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) . This policy-dependent examination probability can be non-zero for all items, evenif all items cannot be displayed in a single top- k ranking. Naturally, this leads to a policy-aware estimator: ˆ∆ aware ( R | q, c, π ) = (cid:88) d : c ( d )=1 λ ( d | R ) P (cid:0) o ( d ) = 1 | q, r, π (cid:1) . (5.13)By basing the propensity on the policy instead of the individual rankings, the policy-aware estimator can correct for zero observance probabilities in some displayed rankingsby more heavily weighting clicks on other displayed rankings with non-zero observanceprobabilities. Thus, if a click occurs on an item that the logging policy rarely displaysin a top- k ranking, this click may be weighted more heavily than a click on an itemthat is displayed in the top- k very often. In contrast, the policy-oblivious approach onlycorrects for the observation probability for the displayed ranking in which the clickoccurred, thus it does not correct for the fact that an item may be missing from the top- k in other displayed rankings.82 .3. Learning from Top- k Feedback

In expectation, the policy-aware estimator provides an unbiased estimate of theranking loss: E o, ¯ R (cid:104) ˆ∆ aware ( R | q, c, π ) (cid:105) = E o, ¯ R (cid:34) (cid:88) d : c ( d )=1 λ (cid:0) d | R (cid:1) P ( o ( d ) = 1 | q, r, π ) (cid:35) = (cid:88) d ∈ R E o, ¯ R (cid:34) o ( d ) · λ (cid:0) d | R (cid:1) · r ( d ) (cid:80) ¯ R (cid:48) ∈ π ( ·| q ) π (cid:0) ¯ R (cid:48) | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:48) (cid:1) (cid:35) = (cid:88) d ∈ R E ¯ R (cid:34) P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d | R (cid:1) · r ( d ) (cid:80) ¯ R (cid:48) ∈ π ( ·| q ) π (cid:0) ¯ R (cid:48) | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:48) (cid:1) (cid:35) = (cid:88) d ∈ R (cid:80) ¯ R ∈ π ( ·| q ) π (cid:0) ¯ R | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d | R (cid:1) · r ( d ) (cid:80) ¯ R (cid:48) ∈ π ( ·| q ) π (cid:0) ¯ R (cid:48) | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:48) (cid:1) = (cid:88) d ∈ R λ (cid:0) d | R (cid:1) · r ( d )= ∆ (cid:0) R | q, r (cid:1) . (5.14)In contrast to the policy-oblivious approach (Section 5.2.2), this proof is sound as longas every relevant item has a non-zero probability of being examined under the loggingpolicy π : ∀ d  r ( d ) = 1 → (cid:88) ¯ R ∈ π ( ·| q ) π (cid:0) ¯ R | q (cid:1) · P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) >  . (5.15)It is easy to see that Condition 5.10 implies Condition 5.15, in other words, for allsettings where the policy-oblivious estimator (Eq. 5.8) is unbiased, the policy-awareestimator (Eq. 5.13) is also unbiased. Conversely, Condition 5.15 does not implyCondition 5.10, thus there are cases where the policy-aware estimator is unbiased butthe policy-oblivious estimator is not guaranteed to be.To better understand for which policies Condition 5.15 is satisﬁed, we introduce asubstitute Condition 5.16: ∀ d (cid:104) r ( d ) = 1 → ∃ ¯ R (cid:2) π (cid:0) ¯ R | q (cid:1) > ∧ P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) > (cid:3)(cid:105) . (5.16)Since Condition 5.16 is equivalent to Condition 5.15, we see that the policy-awareestimator is unbiased for any logging-policy that provides a non-zero probability forevery relevant item to appear in a position with a non-zero examination probability.Thus to satisfy Condition 5.16 in a top- k ranking setting, every relevant item requires anon-zero probability of being displayed in the top- k .As long as Condition 5.16 is met, a wide variety of policies can be chosen accordingto different criteria. Moreover, the policy can be deterministic if k is large enough todisplay every relevant item. Similarly, the policy-oblivious estimator can be seen as a83 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings special case of the policy-aware estimator where the policy is deterministic (or assumedto be). The big advantage of our policy-aware estimator is that it is applicable to a muchlarger number of settings than the existing policy-oblivious estimator, including thosewere feedback is only received on the top- k ranked items. To better understand the difference between the policy-oblivious and policy-awareestimators, we introduce an illustrative example that contrasts the two. We consider asingle query q and a logging policy π that chooses between two rankings to display: ¯ R and ¯ R , with: π ( ¯ R | q ) > ; π ( ¯ R | q ) > ; and π ( ¯ R | q ) + π ( ¯ R | q ) = 1 . Then fora generic estimator we consider how it treats a single relevant item d n with r ( d n ) (cid:54) = 0 using the expectation: E o, ¯ R (cid:104) c ( d n ) · λ (cid:0) d n | R (cid:1) ρ (cid:0) o ( d n ) = 1 | q, d n , ¯ R, π (cid:1) (cid:105) = λ (cid:0) d n | R (cid:1) · r ( d n ) · (cid:18) π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) ρ (cid:0) o ( d n ) = 1 | q, d n , ¯ R , π (cid:1) + π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) ρ (cid:0) o ( d n ) = 1 | q, d n , ¯ R , π (cid:1) (cid:19) , (5.17)where the propensity function ρ can be chosen to match either the policy-oblivious(Eq. 5.8) or policy-aware (Eq. 5.13) estimator.First, we examine the situation where d n appears in the top- k of both rankings ¯ R and ¯ R , thus it has a positive observance probability in both cases: P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) > and P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) > . Here, the policy-oblivious estimator ˆ∆ oblivious (Eq. 5.8) removes the effect of observation bias by adjusting for the observanceprobability per displayed ranking: (cid:18) π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) + π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) (cid:19) · λ (cid:0) d n | R (cid:1) · r ( d n ) = λ (cid:0) d n | R (cid:1) · r ( d n ) . (5.18)The policy-aware estimator ˆ∆ aware (Eq. 5.13) also corrects for the examination bias,but because its propensity scores are based on the policy instead of the individualrankings (Eq. 5.12), it uses the same score for both rankings: π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) + π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) + π ( ¯ R | q ) · P (cid:0) o ( d n ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d n | R (cid:1) · r ( d n ) = λ (cid:0) d n | R (cid:1) · r ( d n ) . (5.19)Then, we consider a different relevant item d m with r ( d m ) = r ( d n ) that unlike theprevious situation only appears in the top- k of ¯ R . Thus it only has a positive observanceprobability in ¯ R : P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) > and P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) = 0 .Consequently, no clicks will ever be received in ¯ R , i.e., ¯ R = ¯ R → c ( d m ) = 0 , thus84 .4. Learning for Top- k Metrics the expectation for d m only has to consider ¯ R : E o, ¯ R (cid:104) c ( d m ) · λ (cid:0) d m | R (cid:1) ρ (cid:0) o ( d m ) = 1 | q, d m , ¯ R, π (cid:1) (cid:105) = π ( ¯ R | q ) · P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) ρ (cid:0) o ( d m ) = 1 | q, d m , ¯ R , π (cid:1) · λ (cid:0) d m | R (cid:1) · r ( d m ) . (5.20)In this situation, Condition 5.10 is not satisﬁed, and correspondingly, the policy-oblivious estimator (Eq. 5.8) does not give an unbiased estimate: π ( ¯ R | q ) · P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d m | R (cid:1) · r ( d m ) < λ (cid:0) d m | R (cid:1) · r ( d m ) . (5.21)Since the policy-oblivious estimator ˆ∆ oblivious only corrects for the observance proba-bility per displayed ranking, it is unable to correct for the zero probability in R as noclicks on d m can occur here. As a result, the estimate is affected by the logging policy π : the more item-selection bias π introduces (determined by π ( ¯ R | q ) ) the further theestimate will deviate. Consequently, in expectation ˆ∆ oblivious will biasedly estimate that d n should be ranked higher than d m , which is incorrect since both items are actuallyequally relevant.In contrast, the policy-aware estimator ˆ∆ aware (Eq. 5.13) avoids this issue becauseits propensities are based on the logging policy π . When calculating the probability ofobservance conditioned on π , P (cid:0) o ( d m ) = 1 | q, r, π (cid:1) (Eq. 5.12), it takes into accountthat there is a π ( ¯ R | q ) chance that d m is not displayed to the user: π ( ¯ R | q ) · P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) π ( ¯ R | q ) · P (cid:0) o ( d m ) = 1 | q, r, ¯ R (cid:1) · λ (cid:0) d m | R (cid:1) · r ( d m ) = λ (cid:0) d m | R (cid:1) · r ( d m ) . (5.22)Since in this situation Condition 5.16 is true (and therefore also Condition 5.15), weknow beforehand that in expectation the policy-aware estimator is unaffected by positionand item-selection bias.This concludes our illustrative example. It was meant to contrast the behavior ofthe policy-aware and policy-oblivious estimators in two different situations. Whenthere is no item-selection bias, i.e., an item is displayed in the top- k of all rankings thelogging policy may display, both estimators provide unbiased estimates albeit usingdifferent propensity scores. However, when there is item-selection bias. i.e., an itemis not always present in the top- k , the policy-oblivious estimator ˆ∆ oblivious no longerprovides an unbiased estimate, while the policy-aware estimator ˆ∆ aware is still unbiasedw.r.t. both position bias and item-selection bias. k Metrics

This section details how counterfactual LTR can be used to optimize top- k metrics,since these are the relevant metrics in top- k rankings. 85 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings k metrics Since top- k rankings only display the k highest ranked items to the user, the performanceof a ranker in this setting is only determined by those items. Correspondingly, onlytop- k metrics matter here, where items beyond rank k have no effect: λ metric@k (cid:0) d | R (cid:1) = (cid:40) λ metric (cid:0) d | R (cid:1) , if rank ( d | R ) ≤ k, , if rank (cid:0) d | R (cid:1) > k. (5.23)These metrics are commonly used in LTR since, usually, performance gains in the topof a ranking are the most important for the user experience. For instance, NDCG@ k ,which is the normalized version of DCG@ k , is often used: λ DCG@k (cid:0) d | R (cid:1) = (cid:40) − log (cid:0) rank ( d | R ) (cid:1) − , if rank ( d | R ) ≤ k, , if rank (cid:0) d | R (cid:1) > k. (5.24)Generally in LTR, DCG is optimized in order to maximize NDCG [13, 129]. In unbiasedLTR it is not trivial to estimate the normalization factor for NDCG, further motivatingthe optimization of DCG instead of NDCG [2, 16].Importantly, top- k metrics bring two main challenges for LTR. First, the rankfunction is not differentiable, a problem for almost every LTR metric [75, 129]. Second,changes in a ranking beyond position k do not affect the metric’s value thus resulting inzero-gradients. The ﬁrst problem has been addressed in existing LTR methods, we willnow propose adaptations of these methods that address the second issue as well. A common approach for enabling optimization of ranking methods, is by ﬁnding loweror upper bounds that can be minimized or maximized, respectively. For instance, similarto a hinge loss, the rank function can be upper bounded by a maximum over scoredifferences [54, 58]. Let s be the scoring function used to rank (in descending order),then: rank (cid:0) d | R (cid:1) ≤ (cid:88) d (cid:48) ∈ R max (cid:16) − (cid:0) s ( d ) − s ( d (cid:48) ) (cid:1) , (cid:17) . (5.25)Alternatively, the logistic function is also a popular choice [129]:rank (cid:0) d | R (cid:1) ≤ (cid:88) d (cid:48) ∈ R log (cid:16) e s ( d (cid:48) ) − s ( d ) (cid:17) . (5.26)Minimizing one of these differentiable upper bounds will directly minimize an upperbound on the ARP metric (Eq. 5.3).Furthermore, Agarwal et al. [2] showed that this approach can be extended to anymetric based on a monotonically decreasing function. For instance, if rank (cid:0) d | R (cid:1) is anupper bound on the rank (cid:0) d | R (cid:1) function, then the following is an upper bound on theDCG loss (Eq. 5.4): λ DCG (cid:0) d | R (cid:1) ≤ − log (cid:0) rank ( d | R ) (cid:1) − = ˆ λ DCG (cid:0) d | R (cid:1) . (5.27)86 .4. Learning for Top- k Metrics

More generally, let α be a monotonically decreasing function. A loss based on α isalways upper bounded by: λ α (cid:0) d | R (cid:1) = − α (cid:0) rank ( d | R ) (cid:1) ≤ − α (cid:0) rank ( d | R ) (cid:1) = ˆ λ α (cid:0) d | R (cid:1) . (5.28)Though appropriate for many standard ranking metrics, ˆ λ α is not an upper boundfor top- k metric losses. To understand this, consider that an item beyond rank k may still receive a negative score from ˆ λ α , for instance, for the DCG upper bound:rank (cid:0) d | R (cid:1) > k → ˆ λ DCG (cid:0) d | R (cid:1) < . As a result, this is not an upper bound for aDCG@ k based loss.We propose a modiﬁcation of the ˆ λ α function to provide an upper bound for top- k metric losses, by simply giving a positive penalty to items beyond rank k : ˆ λ α @k (cid:0) d | R (cid:1) = − α (cid:0) rank ( d | R ) (cid:1) + (cid:2) rank ( d | R ) > k (cid:3) · α ( k ) . (5.29)The resulting function is an upper bound on top- k metric losses based on a monotonicfunction: λ α @k (cid:0) d | R (cid:1) ≤ ˆ λ α @k (cid:0) d | R (cid:1) . The main difference with ˆ λ α is that itemsbeyond rank k acquire a positive score from λ α @k , thus providing an upper bound onthe actual metric loss. Interestingly, the gradient of ˆ λ α @k w.r.t. the scoring function s isthe same as that of ˆ λ α . Therefore, the gradient of either function optimizes an upperbound on λ α @k top- k metric losses, while only ˆ λ α @k provides an actual upper bound.While this monotonic function-based approach is simple, it is unclear how coarsethese upper bounds are. In particular, some upper bounds on the rank function (e.g.,Eq. 5.25) can provide gross overestimations. As a result, these upper bounds on rankingmetric losses may be very far removed from their actual values. k LTR

Many supervised LTR approaches, such as the well-known LambdaRank and subse-quent LambdaMART methods [13], are based on Expectation Maximization (EM)procedures [28]. Recently, Wang et al. [129] introduced the LambdaLoss framework,which provides a theoretical way to prove that a method optimizes a lower bound on aranking metric. Subsequently, it was used to prove that LambdaMART optimizes sucha bound on DCG, similarly it was also used to introduce the novel LambdaLoss methodwhich provides an even tighter bound on DCG. In this section, we will show that theLambdaLoss framework can be used to ﬁnd proven bounds on counterfactual LTRlosses and top- k metrics. Since LambdaLoss is considered state-of-the-art in supervisedLTR, making its framework applicable to counterfactual LTR could potentially providecompetitive performance. Additionally, adapting the LambdaLoss framework to top- k metrics further expands its applicability.The LambdaLoss framework and its EM-optimization approach work for metricsthat can be expressed in item-based gains, G ( d n | q, r ) , and discounts based on position, D (cid:0) rank ( d n | R ) (cid:1) ; for brevity we use the shorter G n and D n , respectively, resulting in: ∆ (cid:0) R | q, r (cid:1) = (cid:88) d n ∈ R G ( d n | q, r ) · D (cid:0) rank ( d n | R ) (cid:1) = | R | (cid:88) n =1 G n · D n . (5.30) We consider the indicator function to never have a non-zero gradient. . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings

For simplicity of notation, we choose indexes so that: n = rank ( d n | R ) , thus D n isalways the discount for the rank n . Then, we differ from the existing LambdaLossframework by allowing the discounts to be zero ( ∀ n D n ≥ ), thus also accountingfor top- k metrics. Furthermore, items at the ﬁrst rank are not discounted or the metriccan be scaled so that D = 1 . Additionally, higher ranked items should be discountedless or equally: n > m → D n ≤ D m . Most ranking metrics meet these criteria; forinstance, G n and D n can be chosen to match ARP or DCG. Importantly, our adaptionalso allows ∆ to match top- k metrics such as DCG @ k or Precision @ k .In order to apply the LambdaLoss framework to counterfactual LTR, we consider ageneral inverse-propensity-scored estimator: ˆ∆ IPS ( R | q, c, · ) = (cid:88) d n : c ( d n )=1 λ ( d n | R ) ρ (cid:0) o ( d n ) = 1 | q, r, ¯ R, π (cid:1) , (5.31)where the propensity function ρ can match either the policy-oblivious (Eq. 5.8) or thepolicy-aware (Eq. 5.13) estimator. By choosing G n = 1 ρ (cid:0) o ( d n ) = 1 | q, r, ¯ R, π (cid:1) and D n = λ ( d n | R ) , (5.32)the estimator can be described in terms of gains and discounts. In contrast, in the existingLambdaLoss framework [129] gains are based on item relevance. For counterfactualtop- k LTR, we have designed Eq. 5.32 so that gains are based on the propensity scoresof observed clicks, and the discounts can have zero values.The EM-optimization procedure alternates between an expectation step and a maxi-mization step. In our case, the expectation step sets the discount values D n according tothe current ranking R of the scoring function s . Then the maximization step updates s to optimize the ranking model. Following the LambdaLoss framework [129], we derivea slightly different loss. With the delta function: δ nm = D | n − m | − D | n − m | +1 , (5.33)our differentiable counterfactual loss becomes: (cid:88) G n >G m − log (cid:32)(cid:18)

11 + e s ( d m ) − s ( d n ) (cid:19) δ nm ·| G n − G m | (cid:33) . (5.34)The changes we made do not change the validity of the proof provided in the originalLambdaLoss paper [129]. Therefore, the counterfactual loss (Eq. 5.34) can be provento optimize a lower bound on counterfactual estimates of top- k metrics.Finally, in the same way the LambdaLoss framework can also be used to derivecounterfactual variants of other supervised LTR losses/methods such as LambdaRankor LamdbaMART. Unlike previous work that also attempted to ﬁnd a counterfactuallambda-based method by introducing a pairwise-based estimator [46], our approach iscompatible with the prevalent counterfactual approach since it uses the same estimatorbased on single-document propensities [3–5, 58, 127, 128]. Our approach suggeststhat the divide between supervised and counterfactual LTR methods may disappearin the future, as a state-of-the-art supervised LTR method can now be applied to thestate-of-the-art counterfactual LTR estimators.88 .5. Experimental Setup So far we have introduced two counterfactual LTR approaches that are proven to op-timize lower bounds on top- k metrics: with monotonic functions (Section 5.4.2) andthrough the LambdaLoss framework (Section 5.4.3). To the best of our knowledge,we are the ﬁrst to introduce theoretically proven lower bounds for top- k LTR met-rics. Nevertheless, previous work has also attempted to optimize top- k metrics, albeitthrough heuristic methods. Notably, Wang et al. [129] used a truncated version of theLambdaLoss loss to optimize DCG @ k . Their loss uses the discounts D n based onfull-ranking DCG but ignores item pairs outside of the top- k : (cid:88) G n >G m − (cid:2) n ≤ k ∨ m ≤ k (cid:3) · log (cid:32)(cid:18)

11 + e s ( d m ) − s ( d n ) (cid:19) δ nm ·| G n − G m | (cid:33) . (5.35)While empirical results motivate its usage, there is no known theoretical justiﬁcation forthis loss, and thus it is considered a heuristic.This leaves us with a choice between two theoretically-motivated counterfactualLTR approaches for optimizing top- k metrics (Eq. 5.29 and 5.34) and an empirically-motivated heuristic (Eq. 5.35). We propose a pragmatic solution by recognizing thatcounterfactual estimators can unbiasedly evaluate top- k metrics. Therefore, in practiceone can optimize several ranking models using various approaches, and subsequently,estimate which resulting model provides the best performance. Thus, using counterfac-tual evaluation to select from resulting models is an unbiased method to choose betweenthe available counterfactual LTR approaches. We follow the standard setup in unbiased LTR [5, 16, 50, 58] and perform semi-syntheticexperiments: queries and items are based on datasets of commercial search engines andinteractions are simulated using probabilistic click models.

We use the queries and documents from two of the largest publicly available LTRdatasets: MLSR-WEB30K [95] and Yahoo! Webscope [17]. Each was created by acommercial search engine and contains a set of queries with corresponding preselecteddocument sets. Query-document pairs are represented by feature vectors and ﬁve-graderelevance annotations ranging from not relevant (0) to perfectly relevant (4). In order tobinarize the relevancy, we only consider the two highest relevance grades as relevant.The MSLR dataset contains 30,000 queries with on average 125 preselected documentsper query, and encodes query-document pairs in 136 features. The Yahoo datasethas 29,921 queries and on average 24 documents per query encoded in 700 features.Presumably, learning from top- k feedback is harder as k becomes a smaller percentageof the number of items. Thus, we expect the MSLR dataset with more documents perquery to pose a more difﬁcult problem. 89 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings k settings The setting we simulate is one where interactions are gathered using a non-optimal butdecent production ranker. We follow existing work [5, 50, 58] and use supervised opti-mization for the ARP metric on 1% of the training data. The resulting model simulatesa real-world production ranker since it is much better than a random initialization butleaves enough room for improvement [58].We then simulate user-issued queries by uniformly sampling from the trainingpartition of the dataset. Subsequently, for each query the production ranker ranks thedocuments preselected by the dataset. Depending on the experimental run that weconsider, randomization is performed on the resulting rankings. In order for the policy-aware estimator to be unbiased, every relevant document needs a chance of appearingin the top- k (Condition 5.16). Since in a realistic setting relevancy is unknown, wechoose to give every document a non-zero probability of appearing in the top- k . Ourrandomization policy takes the ranking of the production ranker and leaves the ﬁrst k − documents unchanged but the document at position k is selected by sampling uniformlyfrom the remaining documents. The result is a minimally invasive randomized top- k ranking since most of the ranking is unchanged and the placement of the sampleddocuments is limited to the least important position.We note that many other logging policies could be applied (see Condition 5.16), e.g.,an alternative policy could insert sampled documents at random ranks for less obviousrandomization. Unfortunately, a full exploration of the effect of using different loggingpolicies is beyond the scope of this chapter.Clicks are simulated on the resulting ranking ¯ R according to position bias and docu-ment relevance. Top- k position bias is modelled through the probability of observance,as follows: P (cid:0) o ( d ) = 1 | q, r, ¯ R (cid:1) = (cid:40) rank ( d | ¯ R ) − , if rank ( d | ¯ R ) ≤ k, , if rank ( d | ¯ R ) > k. (5.36)The randomization policy results in the following examination probabilities w.r.t. thelogging policy (cf. Eq. 5.12): P (cid:0) o ( d ) = 1 | q, r, π (cid:1) = (cid:40) rank ( d | ¯ R ) − , if rank ( d | ¯ R ) < k, (cid:0) rank ( d | ¯ R ) · ( | ¯ R | − k + 1) (cid:1) − , if rank ( d | ¯ R ) ≥ k. (5.37)The probability of a click is conditioned on the relevance of the document according tothe dataset: P (cid:0) c ( d ) = 1 | q, r, ¯ R, o (cid:1) =  , if r ( d ) = 1 ∧ o ( d ) = 1 , . , if r ( d ) = 0 ∧ o ( d ) = 1 , , if o ( d ) = 0 . (5.38)Note that our previous assumption that clicks only take place on relevant items (Sec-tion 5.2.2) is not true in our experiments.90 .5. Experimental Setup Optimization is performed on training clicks simulated on the training partition ofthe dataset. Hyperparameter tuning is done by estimating performance on (unclipped)validation clicks simulated on the validation partition; the number of validation clicks isalways 15% of the number of training clicks. Lastly, evaluation metrics are calculatedon the test partition using the dataset labels.

In order to evaluate the performance of the policy-aware estimator (Eq. 5.13) andthe effect of item-selection bias, we compare with the following baselines: (i) Thepolicy-oblivious estimator (Eq. 5.8). In our setting, where the examination probabilitiesare known beforehand, the policy-oblivious estimator also represents methods thatjointly estimate these probabilities while performing LTR, i.e., the following methodsreduce to this estimator if the examination probabilities are given: [3, 5, 58, 127]. (ii) Arerank estimator, an adaption of the policy-oblivious estimator. During optimization thererank estimator applies the policy-oblivious estimator but limits the document set of aninteraction i to the k displayed items R i = { d | rank ( d | ¯ R i ) ≤ k } (cf. Eq. 5.8). Thus, itis optimized to rerank the top- k of the production ranker only, but during inference it isapplied to the entire document set. (iii) Additionally, we evaluate performance withoutany cutoff k or randomization; in these circumstances all three estimators (Policy-Aware,Policy-Oblivious, Rerank) are equivalent. (iv) Lastly, we use supervised LTR on thedataset labels to get a full-information skyline , which shows the hypothetical optimalperformance.To evaluate the effectiveness of our proposed loss functions for optimizing top- k metrics, we apply the monotonic lower bound (Eq. 5.29) with a linear (Eq. 5.25)and a logistic upper bound (Eq. 5.26). Additionally, we apply several versions ofthe LamdbaLoss loss function (Eq. 5.34): one that optimizes full DCG, another thatoptimizes DCG @5 , and the heuristic truncated loss also optimizing DCG @5 (Eq. 5.35).Lastly, we apply unbiased loss selection where we select the best-performing modelbased on the estimated performance on the (unclipped) validation clicks.Optimization is done with stochastic gradient descent; to maximize computationalefﬁciency we rewrite the loss (Eq. 5.5) for a propensity scoring function ρ in thefollowing manner: ˆ L = 1 N N (cid:88) i =1 ˆ∆ (cid:0) R i | q i , ¯ R i , π, c i (cid:1) = 1 N N (cid:88) i =1 (cid:88) d : c i ( d )=1 λ (cid:0) d | R i (cid:1) ρ (cid:0) o i ( d ) = 1 | q i , r, · (cid:1) = 1 N (cid:88) q ∈Q (cid:88) d ∈ R q (cid:32) N (cid:88) i =1 [ q i = q ] · c i ( d ) ρ (cid:0) o i ( d ) = 1 | q, r, · (cid:1) (cid:33) · λ (cid:0) d | R q (cid:1) = 1 N (cid:88) q ∈Q (cid:88) d ∈ R q ω d · λ (cid:0) d | R q (cid:1) . (5.39)91 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings

After precomputing the document weights ω d , the complexity of computing the loss isonly determined by the dataset size. This allows us to optimize over very large numbersof clicks with very limited increases in computational costs.We optimize linear models, but our approach can be applied to any differentiablemodel [2]. Propensity clipping [58] is applied to training clicks and never applied to thevalidation clicks; we also use self-normalization [116]. In this section we discuss the results of our experiments and evaluate our policy-awareestimator and the methods for top- k LTR metric optimization empirically.

First we consider the question:

RQ5.1

Is the policy-aware estimator effective for unbiased counterfactual LTR fromtop- k feedback?Figure 5.1 displays the performance of different approaches after optimization on clicks under varying values for k . Both the policy-oblivious and rerank estimators aregreatly affected by the item-selection bias introduced by the cutoff at k . On the MSLRdataset neither approach is able to get close to optimal ARP performance, optimalDCG @5 is only reached when k > . On the Yahoo dataset, the policy-oblivousapproach can only approximate optimal ARP when k > ; for DCG @5 it requires k > . The rerank approach reaches optimal ARP when k > and optimal DCG @5 when k > . Considering that on average a query in the Yahoo dataset only has 24preselected documents, it appears that even a little item-selection bias has a substantialeffect on both estimators. Furthermore, randomization appears to have a very limitedpositive effect on the policy-oblivious and rerank approaches. The one exception is thepolicy-oblivious approach when k = 1 where it reaches optimal performance underrandomization. Here, the randomization policy gives every item an equal probability ofbeing presented, thus trivially removing item-selection bias; additionally, there is noposition bias as there is only a single position. However, besides this trivial exception,the baseline estimators are strongly affected by item-selection bias and simply loggingwith randomization is unable to remove the effect of item-selection bias.In contrast, the policy-aware approach is hardly affected by the choice of k . Itconsistently approximates optimal performance in terms of ARP and DCG @5 on bothdatasets. On the MSLR dataset, the policy-aware approach provides near optimal ARPperformance; however, for k > there is a small but noticeable gap. We suspect thatthis is the result of variance from click-noise and can be closed by gathering more clicks.Across all settings, the policy-aware approach appears unaffected by the choice of k andthus the effect of item-selection bias. Moreover, it consistently provides performanceat least as good as the baselines; and on the Yahoo dataset it outperforms them for k < and on the MSLR dataset outperforms them for all tested values of k . Wenote that the randomization policy is the same for all methods; in other words, under92 .6. Results and Discussion randomization the clicks for the policy-oblivious, policy-aware and rerank approachesare acquired in the exact same way. Thus, our results show that in order to beneﬁt fromrandomization, a counterfactual LTR method has to take its effect into account, henceonly the policy-aware approach has improved performance.Figure 5.2 displays the performance when learning from top-5 feedback whilevarying the number of clicks. Here we see that the policy-oblivious approach perfor-mance is stable after clicks have been gathered. The rerank approach has stableperformance after clicks when optimized for ARP and for DCG @5 . Bothbaseline approaches show biased behavior where adding additional data does not leadto improved performance. This conﬁrms that their estimators are unable to deal withitem-selection bias. In contrast, the policy-aware approach reaches optimal performancein all settings. However, it appears that the policy-aware approach requires more clicksthan the no-cutoff baseline; we suspect that this difference is due to variance added bythe randomization and smaller propensity scores.In conclusion, we answer our RQ5.1 positively: our results show that the policy-aware approach is unbiased w.r.t. item-selection bias and position bias. Where allbaseline approaches are affected by item-selection bias even in small amounts, thepolicy-aware approach approximates optimal performance regardless of the cutoff value k . k metrics Next, we consider the question:

RQ5.2

Are our novel counterfactual LTR loss functions effective for top- k LTR metricoptimization?Figure 5.3 shows the performance of the policy-aware approach after optimizing differ-ent loss functions under top- feedback. While on the Yahoo dataset small differencesare observed, on the MSLR dataset substantial differences are found. Interestingly,there seems to be no advantage in optimizing for DCG @5 instead of full DCG with theLambdaLoss. Furthermore, the monotonic loss function works very well with a linearupper bound, yet poorly when using the log upper bound. On both datasets the heuristictruncated LambdaLoss loss function provides the best performance, despite being theonly method without a theoretical basis. When few clicks are available, the differenceschange; e.g., the monotonic loss function with a log upper bound outperforms the otherlosses on the MSLR dataset when fewer than clicks are available.Finally, we consider unbiased loss selection; Figure 5.3 displays both the perfor-mance of the selected models and the estimated performance on which the selections arebased. For the most part the optimal models are selected, but variance does cause mis-takes in selection when few clicks are available. Thus, unbiased optimal loss selectionseems effective as long as enough clicks are available.In conclusion, we answer RQ5.2 positively: our results indicate that the truncatedcounterfactual LambdaLoss loss function is most effective at optimizing DCG @5 .Using this loss, our counterfactual LTR method reaches state-of-the-art performancecomparable to supervised LTR on both datasets. Alternatively, our proposed unbiased93 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings loss selection method can choose optimally between models that are optimized bydifferent loss functions.

Section 5.2.1 has discussed supervised LTR and Section 5.2.2 has described the existingcounterfactual LTR framework; this section contrasts additional related work with ourpolicy-aware approach.Interestingly, some existing work in unbiased LTR was performed in top- k rankingssettings [3, 4, 127, 128]. Our ﬁndings suggest that the results of that work are affectedby item-selection bias and that there is the potential for considerable improvements byapplying the policy-aware method.Carterette and Chandar [16] recognized that counterfactual evaluation cannot eval-uate rankers that retrieve items that are unseen in the interaction logs, essentially dueto a form of item-selection bias. Their proposed solution is to gather new interactionson rankings where previously unseen items are randomly injected. Accordingly, theyadapt propensity scoring to account for the random injection strategy. In retrospect, thisapproach can be seen as a speciﬁc instance of our policy-aware approach. In contrast,we have focused on settings where item-selection bias takes place systematically andpropose that logs should be gathered by any policy that meets Condition 5.16. Insteadof expanding the logs to correct for missing items, our approach avoids systematicitem-selection bias altogether.Other previous work has also used propensity scores based on a logging policyand examination probabilities. Komiyama et al. [67] and subsequently Lagr´ee et al.[69] use such propensities to ﬁnd the optimal ranking for a single query by casting theranking problem as a multiple-play bandit. Li et al. [72] use similar propensities tocounterfactually evaluate ranking policies where they estimate the number of clicks aranking policy will receive. Our policy-aware approach contrasts with these existingmethods by providing an unbiased estimate of LTR-metric-based losses, and thus it canbe used to optimize LTR models similar to supervised LTR.Lastly, online LTR methods where interactive processes learn from the user [132]also make use of stochastic ranking policies. They correct for biases through random-ization in rankings but do not use an explicit model of examination probabilities. Tocontrast with counterfactual LTR, while online LTR methods appear to provide robustperformance [50], they are not proven to unbiasedly optimize LTR metrics [82, 84].Unlike counterfactual LTR, they are not effective when applied to historical interactionlogs [43]. In this chapter, we have proposed a policy-aware estimator for LTR, the ﬁrst counter-factual method that is unbiased w.r.t. both position bias and item-selection bias. Ourexperimental results show that existing policy-oblivious approaches are greatly affectedby item-selection bias, even when only small amounts are present. In contrast, the pro-posed policy-aware LTR method can learn from top- k feedback without being affected94 .8. Conclusion by the choice of k . Furthermore, we proposed three counterfactual LTR approachesfor optimizing top- k metrics: two theoretically proven lower bounds on DCG @ k basedon monotonic functions and the LambdaLoss framework, respectively, and anotherheuristic truncated loss. Additionally, we introduced unbiased loss selection that canchoose optimally between models optimized with different loss functions. Together, ourcontributions provide a method for learning from top- k feedback and for top- k metrics.With these contributions, we can answer the thesis research questions RQ4 and

RQ5 positively: with the policy-aware estimator counterfactual LTR is applicable to top- k ranking settings; furthermore, we have shown that the state-of-the-art supervised LTRLambdaLoss method can be used for counterfactual LTR. To the best of our knowledge,this is the ﬁrst counterfactual LTR method that is unbiased in top- k ranking settings.Additionally, this chapter also serves to further bridge the gap between supervised andcounterfactual LTR methods, as we have shown that state-of-the-art lambda-based super-vised LTR methods can be applied to the state-of-the-art counterfactual LTR estimators.Therefore, the contributions of this chapter have greatly extended the capability of thecounterfactual LTR approach and further connected it with the supervised LTR ﬁeld.Future work in supervised LTR could verify whether potential novel supervisedmethods can be applied to counterfactual losses. A limitation of the policy-aware LTRapproach is that the logging policy needs to be known; future work could investigatewhether a policy estimated from logs also sufﬁces [72, 74]. Finally, existing workon bias in recommendation [107] has not considered position bias, thus we anticipatefurther opportunities for counterfactual LTR methods for top- k recommendations.The remaining chapters of this thesis will continue to build on the policy-awareestimator. Chapter 6 introduces a counterfactual LTR algorithm that uses the policy-aware estimator to combine properties of tabular models and feature-based models.Furthermore, Chapter 7 looks at how the policy-aware estimator can be used for rankerevaluation. It introduces an algorithm that optimizes the logging policy to reducevariance when using the policy-aware estimator for evaluation. Lastly, Chapter 8introduces a novel intervention-aware estimator inspired by the policy-aware estimator.This novel estimator takes the policy-aware approach even further by consideringthe effect of all logging policies used during data gathering. The intervention-awareapproach thus also considers the case where the logging policy is updated duringthe gathering of data. Besides the policy-aware estimator, Chapter 6, Chapter 7, andChapter 8 all use the adaption of LambdaLoss for counterfactual LTR derived in thischapter. 95 . Policy-Aware Counterfactual Learning to Rank for Top- k Rankings

Yahoo! Webscope

10 20 30 40 50 60 70 8011 . . . . A v g . R e l e v a n t P o s i t i o n

10 20 30 40 50 60 70 800 . . N o r m a li z e d D C G@ MSLR-WEB30k

10 20 30 40 50 60 70 805055 A v g . R e l e v a n t P o s i t i o n

10 20 30 40 50 60 70 80

Number of Display Positions (k) . . . N o r m a li z e d D C G@ Policy-Aware (rand.)Policy-Oblivious (no rand.) Policy-Oblivious (rand.)Rerank (no rand.) Rerank (rand.)Production Full-Info Skyline

Figure 5.1: The effect of item-selection bias on different estimators. Optimization on clicks simulated on top- k rankings with varying number of display positions ( k ), withand without randomization (for each datapoint clicks were simulated independently).Results on the Yahoo dataset and

MSLR dataset. The top graph per dataset optimizesthe average relevance position through the linear upper bound (Eq. 5.25); the bottomgraph per dataset optimizes DCG @5 using the truncated LambdaLoss (Eq. 5.35).96 .8. Conclusion Yahoo! Webscope . . . . A v g . R e l e v a n t P o s i t i o n . . N o r m a li z e d D C G@ MSLR-WEB30k A v g . R e l e v a n t P o s i t i o n Number of Training Clicks . . . N o r m a li z e d D C G@ No-CutoﬀPolicy-Aware (rand.) Policy-Oblivious (no rand.)Policy-Oblivious (rand.) Rerank (no rand.)Rerank (rand.) ProductionFull-Info Skyline

Figure 5.2: Performance of different estimators learning from different numbers ofclicks simulated on top-5 rankings, with and without randomization. Results on the

Yahoo dataset and

Yahoo! Webscope . . N o r m a li z e d D C G@ . . . . E s t i m a t e d D C G@ MSLR-WEB30k . . . N o r m a li z e d D C G@ Number of Training Clicks . . . E s t i m a t e d D C G@ Loss SelectionMonotonic (linear) Monotonic (log)LambdaLoss (full-DCG) LambdaLoss (DCG@5)Truncated LambdaLoss (DCG@5) ProductionFull-Info Skyline

Figure 5.3: Performance of the policy-aware estimator (Eq. 5.13) optimizing DCG @5 using different loss functions. The loss selection method selects the estimated optimalmodel based on clicks gathered on separate validation queries. Varying numbers ofclicks on top-5 rankings with randomization, the number of validation clicks is 15% ofthe number of training clicks.98 .A. Notation Reference for Chapter 5 Notation Description k the number of items that can be displayed in a single ranking i an iteration number Q set of queries q a user-issued query d an item to be ranked r ( d, q ) , r ( d ) the relevance of item d w.r.t. query qR a ranked list ¯ R a ranked list that was displayed to the user λ ( d | R ) a metric that weights items depending on their display rank c i ( d ) a function indicating item d was clicked at iteration io i ( d ) a function indicating item d was observed at iteration iπ a logging policy π ( ¯ R | q ) the probability that policy π displays ranking ¯ R for query q rank (cid:0) d | ¯ R (cid:1) the rank of item d in displayed ranking ¯ Rρ a propensity function used to represent any IPS estimator s ( d ) the score given to item d by ranking model s , used to sort items by99 Combining Generalized and Specialized

Models in Counterfactual Learning toRank

So far, this thesis has only addressed feature-based Learning to Rank (LTR) – theoptimization of models that rank items based on their features – as opposed to tabularonline LTR – which optimizes a ranking directly, thus not using any scoring models.A big advantage of feature-based LTR is that its model can be applied to previouslyunseen queries and items. As a result, it provides very robust performance in previouslyunseen circumstances. However, their behavior is often limited by the available features:in practice they do not provide enough information to determine the optimal ranking. Instark contrast, tabular LTR memorizes rankings, instead of using a features to predictthem. Consequently, tabular LTR is not limited by which features are available andcan potentially always ﬁnd the optimal ranking. Despite this potential, tabular LTRdoes not generalize: it cannot transfer learned behavior to previously unseen queries oritems. In other words, tabular LTR has the potential to specialize – perform very well incircumstances encountered often – whereas feature-based LTR is good at generalization– performing well overall, including previously unseen circumstances. In this chapterwe investigate whether the advantageous properties of these two areas can be combinedin the counterfactual LTR framework, and thus we address the thesis research question:

RQ6

Can the specialization ability of tabular online LTR be combined with the robustfeature-based approach of counterfactual LTR?In this chapter we introduce a framework for Generalization and Specialization(GENSPEC) for counterfactual learning from logged bandit feedback. GENSPECis designed for problems that can be divided in many non-overlapping contexts. Itsimultaneously learns a generalized policy – optimized for high performance acrossall contexts – and many specialized policies – each optimized for high performancein a single context. Using high-conﬁdence bounds on the relative performance ofpolicies, per context GENSPEC decides whether to deploy a specialized policy, thegeneral policy, or the current logging policy. By doing so GENSPEC combines the high

This chapter was submitted as [87]. Appendix 6.C gives a reference for the notation used in this chapter. . Combining Generalized and Specialized Models in Counterfactual LTR performance of successfully specialized policies with the safety and robustness of ageneralized policy.While GENSPEC is applicable to many different bandit problems, we focus onquery-specialization for counterfactual learning to rank, where a context consists of aquery submitted by a user. Here we learn both a single general feature-based model forrobust performance across queries, and many memory-based models, each of which ishighly specialized for a single query, GENSPEC then chooses which model to deployon a per query basis. Our results show that GENSPEC leads to massive performancegains on queries with sufﬁcient click data, while still having safe and robust behavioron queries with little or noisy data.

Generalization is an important goal for most machine learning algorithms: modelsshould perform well across a large range of contexts, especially previously unseencontexts [10].

Specialization , the ability to perform well in a single context, is oftendisfavored over generalization because the latter is more robust [37]. Generally, thesame trade-off pertains to contextual bandit problems [70, Chapter 18]. There, the goalis to ﬁnd a policy that maximizes performance over the full distribution of contextualinformation. While a specialized policy, i.e., a policy optimized on a subset of possiblecontexts, could outperform a generalized policy on that subset, most likely it compro-mises performance on other contexts to do so since specialization comes with a risk ofoverﬁtting: applying a policy that is specialized in a speciﬁc set of contexts to differentcontexts [22, 37]. As a consequence, generalization is often preferred as it avoids thisissue.In this chapter, we argue that, depending on the circumstances, specialization may bepreferable over generalization, speciﬁcally, if with high-conﬁdence it can be guaranteedthat a specialized policy is only deployed in contexts where it outperforms policiesoptimized for generalization. We focus on counterfactual learning for contextual banditproblems where contexts can be split into non-overlapping sets. We simultaneously train(i) a generalized policy that performs well across all contexts, and (ii) many specializedpolicies, one for each of a speciﬁc set of contexts. Thus, per context there is a choicebetween three policies: (i) the logging policy used to gather data, (ii) the generalizedpolicy, and (iii) the specialized policy. Depending on the circumstances, e.g., theamount of data available, noise in the data, or the difﬁculty of the task, a different policywill perform best in a speciﬁc context [22]. To reliably choose between policies, weestimate high-conﬁdence bounds [119] on the relative performance differences betweenpolicies and then choose conservatively: we only apply a specialized policy insteadof the generalized policy or logging policy if the lower bounds on their differences inperformance are positive in a speciﬁc context. Otherwise, the generalized policy is onlyapplied if with high-conﬁdence it outperforms the logging policy across all contexts.We call this approach the

Generalization and Specialization (GENSPEC) framework: ittrains both generalized and specialized policies and results in a meta-policy that choosesbetween them using high-conﬁdence bounds. The GENSPEC meta-policy is particularlypowerful because it can combine the properties of different models: for instance, a102 .2. Background: Learning to Rank generalized policy using a feature-based model can be overruled by a specialized policyusing a tabular model that has memorized the best actions. GENSPEC promises thebest of two worlds: the safe robustness of a generalized policy with the potentially highperformance of a specialized policy.To evaluate the GENSPEC approach, we apply it to query-specialization in thesetting of

Counterfactual Learning to Rank (LTR). Existing approaches in this ﬁeldeither generalize – by learning a ranking model that ranks items based on their featuresand generalizes well across all queries [58] – or they specialize – by learning tabularranking models that are speciﬁc to a single query and cannot be applied to any otherquery [70, 138]. By viewing each query as a different context, GENSPEC learnsboth a generalized ranker and many specialized tabular rankers, and subsequentlychooses which ranker to apply per query. Our empirical results show that GENSPECcombines the advantages of both approaches: very high performance on queries wheresufﬁciently many interactions were observed for successful specialization, and saferobust performance on queries where interaction data is limited or noisy.Our main contributions are:1. an adaptation of existing counterfactual high-conﬁdence bounds for relativeperformance between ranking policies;2. the GENSPEC framework that simultaneously learns generalized and specializedranking policies plus a meta-policy that decides which to deploy per context.To the best of our knowledge, GENSPEC is the ﬁrst counterfactual LTR method tosimultaneously train generalized and specialized models, and reliably choose betweenthem using high-conﬁdence bounds.

This section covers the basics of counterfactual LTR.

The LTR task has been approached as a contextual bandit problem before [68, 70, 117,132]. The differentiating characteristic of the LTR task is that actions are rankings ,thus they consist of an ordered set of K items: a = ( d , d , . . . , d K ) . The contextualinformation often contains a user-issued search query, features based on the itemsavailable for ranking and item-query combinations, information about the user, amongother miscellaneous information. Since our focus is query specialization, we record thequery separately; thus, at each time step i contextual information x i and a single query q i ∈ { , , , . . . } are active: x i , q i ∼ P ( x, q ) . Let ∆ indicate the reward for a ranking a . A policy π should maximize the expected reward [58, 75]: R ( π ) = (cid:90) (cid:90) (cid:16) (cid:88) a ∈ π ∆( a | x, q, r ) · π ( a | x, q ) (cid:17) P ( x, q ) dx dq. (6.1)103 . Combining Generalized and Specialized Models in Counterfactual LTR Commonly, in LTR the reward for a ranking a is a linear combination of the relevancescores of the items in a , weighted according to their rank. We use r ( d | x, q ) to denotethe relevance score of item d and λ (cid:0) rank ( d | a ) (cid:1) for the weight per rank, resulting in: ∆( a | x, q, r ) = (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · r ( d | x, q ) . (6.2)A common choice is to optimize the Discounted Cumulative Gain (DCG) metric; λ canbe chosen accordingly: λ DCG (cid:0) rank ( d | a ) (cid:1) = log (cid:0) rank ( d | a ) (cid:1) − . (6.3)When the relevance function r is given, maximizing R can be done through traditionalLTR in a supervised manner [13, 75, 129]. In practice, the relevance score r is often unknown or requires expensive annotation [17,27, 95, 104]. An attractive alternative comes from LTR based on historical interactionlogs, which takes a counterfactual approach [58, 127]. Let π be the logging policy thatwas used when interactions were logged: a i ∼ π ( a | x i , q i ) . (6.4)LTR focusses mainly on clicks in interactions; clicks are strongly affected by positionbias [25]. This bias arises because users often do not examine all items presented tothem, and only click on examined items. As a result, items that are displayed in positionsthat are more often examined are also more likely to be clicked, without necessarilybeing more relevant. Let o i ( d ) ∈ { , } indicate whether item d was examined by theuser or not: o i ( d ) ∼ P (cid:0) o ( d ) | a i (cid:1) . (6.5)We use c i ( d ) ∈ { , } to indicate whether d was clicked at time step i : c i ( d ) ∼ P (cid:0) c ( d ) | o i ( d ) , r ( d | x, q ) (cid:1) . (6.6)We assume that click probabilities are only dependent on whether an item was examined, o i ( d ) , and its relevance, r ( d | x, q ) . Furthermore, we make the common assumptionthat clicks only occur on examined items [58, 127], thus: P (cid:0) c ( d ) = 1 | o ( d ) = 0 , r ( d | x, q ) (cid:1) = 0 . (6.7)Moreover, we assume that, given examination, more relevant documents are more likelyto be clicked. Speciﬁcally, click probability is proportional to relevance with an offset µ ∈ R > : P (cid:0) c ( d ) = 1 | o ( d ) = 1 , r ( d | x, q ) (cid:1) ∝ r ( d | x, q ) + µ. (6.8)The data used for counterfactual LTR consists of observed clicks c i , propensity scores ρ i , contextual information x i and query q i for N interactions: D = (cid:8) ( c i , a i , ρ i , x i , q i ) (cid:9) Ni =1 . (6.9)104 .3. GENSPEC for Query Specialization We apply the policy-aware approach [86] (Chapter 5) and base ρ both on the examinationprobability of the user and the behavior of the policy: ρ i ( d ) = (cid:88) a ∈ π P (cid:0) o i ( d ) = 1 | a (cid:1) · π ( a | x i , q i ) . (6.10)The estimated reward based on D is now: ˆ R ( π | D ) = 1 |D| (cid:88) i ∈D (cid:88) a ∈ π ˆ∆( a | c i , ρ i ) · π ( a | x i , q i ) , (6.11)where ˆ∆ is an Inverse Propensity Scoring (IPS) estimator: ˆ∆( a | c i , ρ i ) = (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · c i ( d ) ρ i ( d ) . (6.12)Since the reward r is not observed directly, clicks are used as implicit feedback, whichis a biased and noisy indicator of relevance. The unbiased estimate ˆ R can be used forunbiased evaluation and optimization since: arg max π E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) = arg max π R ( π ) . (6.13)Previous work has introduced several methods for maximizing ˆ R so as to optimizedifferent LTR metrics [2, 58].This concludes our description of the counterfactual LTR basics; importantly, rank-ing policies can be optimized from clicks without being affected by the logging policyor the users’ position bias. This section introduces the GENSPEC framework and applies it to query specializationfor LTR. Section 6.6 details how it can be applied to the general contextual banditproblem.

We will now propose the ﬁrst part of the GENSPEC framework, which produces ageneral policy π g and, for each query q , a specialized policy π q .GENSPEC uses the logged data D both to train policies and to evaluate relativeperformance; to avoid overﬁtting we split D in a training partition D train and a policy-selection partition D sel so that D = D train ∪ D sel and D train ∩ D sel = ∅ .A policy has optimal generalization performance if it maximizes performance across all queries . Thus, given the generalization policy space Π g , the optimal general policyis: π g = arg max π ∈ Π g ˆ R ( π | D train ) . (6.14) See Appendix 6.A for a proof. . Combining Generalized and Specialized Models in Counterfactual LTR

Alternatively, we can also choose to optimize performance for a single query q . First,we only select the datapoints in D where query q was issued: D q = (cid:8) ( c i , a i , ρ i , x i , q i ) ∈ D | q i = q (cid:9) . (6.15)Then the policy π q that is specialized for query q is the policy in the specializationpolicy space Π q that maximizes the performance when query q is issued: π q = arg max π ∈ Π q ˆ R ( π | D train q ) . (6.16)The motivation for π q is that it has the potential to provide better performance than π g when q is issued. We may expect π q to outperform π g because π g may compromiseperformance on the query q for better performance across all queries, whereas π q nevermakes such compromises. Furthermore, Π q could contain more optimal policies than Π g , because the policies in Π g have to be applicable to all queries whereas Π q can makeuse of speciﬁc properties of q . However, it is also possible that π g and π q provide thesame performance. Moreover, since D q is a subset of D , the optimization of π q is morevulnerable to noise in the data. As a result, the true performance of π q for query q couldbe worse than that of π g , especially when D q is substantially smaller than D .In other words, a priori it is unclear whether π g or π q are preferred. We thus need amethod to estimate the optimal choice with a reasonable amount of conﬁdence. We will now propose the other part of our GENSPEC framework: a meta-policy thatsafely chooses between deploying π g and π q per query q . We wish to avoid deploying π q when it performs worse than π g , and similarly, avoid deploying π g when it isoutperformed by the logging policy π . Recently, a method for safe policy deploymentwas introduced by Jagerman et al. [51] based on high-conﬁdence bounds [119]. Theintuition behind their method is that a learned policy π should not be deployed beforewe can be highly conﬁdent that it outperforms the logging policy π , otherwise it issafer to keep the logging policy in deployment.While previous work has bounded the performance of individual policies [51, 119],we instead bound the difference in performance between two policies directly. Let δ ( π , π ) indicate the true difference in performance between a policy π and policy π : δ ( π , π ) = R ( π ) − R ( π ) . (6.17)Knowing δ ( π , π ) allows us to optimally choose which of the two policies to deploy.However, we can only estimate its value from historical data D : ˆ δ ( π , π | D ) = ˆ R ( π | D ) − ˆ R ( π | D ) . (6.18)For brevity, let R i,d indicate the inverse-propensity-scored difference for a singledocument d at interaction i : R i,d = c i ( d ) ρ i ( d ) (cid:88) a ∈ π ∪ π (cid:0) π ( a | x i , q i ) − π ( a | x i , q i ) (cid:1) · λ (cid:0) rank ( d | a ) (cid:1) . (6.19)106 .3. GENSPEC for Query Specialization Then, for computational efﬁciency we rewrite: ˆ δ ( π , π | D ) = 1 |D| (cid:88) i ∈D (cid:88) d ∈ a i R i,d = 1 |D| K (cid:88) ( i,d ) ∈D K · R i,d . (6.20)For notational purposes, we let (cid:80) ( i,d ) ∈D iterate over all actions a i and K documents d per action a i . With the conﬁdence parameter (cid:15) ∈ [0 , , setting b to be the maximumpossible absolute value for R i,d , i.e., b = max λ ( · )min ρ , and ν = 2 |D| K ln (cid:0) − (cid:15) (cid:1) |D| K − (cid:88) ( i,d ) ∈D (cid:0) K · R i,d − ˆ δ ( π , π | D ) (cid:1) , we follow Thomas et al. [119] to get the high-conﬁdence bound: CB ( π , π | D ) = 7 Kb ln (cid:0) − (cid:15) (cid:1) |D| K −

1) + 1 |D| K · √ ν. (6.21)In turn, this provides us with the following upper and lower conﬁdence bounds on δ : LCB ( π , π | D ) = ˆ δ ( π , π | D ) − CB ( π , π | D ) UCB ( π , π | D ) = ˆ δ ( π , π | D ) + CB ( π , π | D ) . (6.22)As proven by Thomas et al. [119], with at least a probability of (cid:15) they bound the truevalue of δ ( π , π ) : P (cid:16) δ ( π , π ) ∈ (cid:2) LCB ( π , π | D ) , UCB ( π , π | D ) (cid:3)(cid:17) > (cid:15). (6.23)These guarantees allow us to safely choose between policies per query q . We applya doubly conservative strategy: π g is not deployed before we are conﬁdent that itoutperforms π across all queries; and π q is not deployed before we are conﬁdentthat it outperforms both π g and π on query q . This strategy results in the GENSPECmeta-policy π GS : π GS ( a | x, q ) =  π q ( a | x, q ) , if (cid:0) LCB ( π q , π g | D sel q ) > ∧ LCB ( π q , π | D sel q ) > (cid:1) ,π g ( a | x, q ) , if (cid:0) LCB ( π q , π g | D sel q ) ≤ ∧ LCB ( π g , π | D sel ) > (cid:1) ,π ( a | x, q ) , otherwise . (6.24)In theory, this approach can make use of the potential gains of specialization whileavoiding its risks. For instance, if the policy-selection partition D sel q is very small, it maybe heavily affected by noise, so that the conﬁdence bound CB will be wide and π q willnot be deployed. Simultaneously, D sel may be large enough so that with high-conﬁdence π g is deployed.We expect that in practice the relative bounding of GENSPEC is much more data-efﬁcient than the Safe Exploration Algorithm (SEA) approach by Jagerman et al. [51].107 . Combining Generalized and Specialized Models in Counterfactual LTR SEA computes an upper bound on the trusted policy and a lower bound on a learnedpolicy, and only deploys the learned policy if its lower bound is greater than the other’supper bound. When the learned policy has higher performance than the other, we expectthe relative bounds of GENSPEC to require less data to be certain about this differencethan the SEA bounds. In Appendix 6.B we theoretically analyze the difference betweenthese approaches and conclude that the relative bounding of GENSPEC is more efﬁcientif there is a positive covariance between ˆ R ( π | D ) and ˆ R ( π | D ) . Because bothestimates are based on the same interaction data D , a high covariance is extremelylikely.Previous work has described safety constraints for policy deployment [51, 62, 131].The authors assume that a baseline policy exists whose behavior is considered safe; otherpolicies are considered unsafe if their performance is worse than the baseline policy bya certain margin. If the logging policy is taken to be the baseline policy, then GENSPECcan meet such constraints [51]. We note that while the safety guarantee is strong for asingle bound (Eq. 6.23), when applied to a large number of queries the probability of atleast one incorrect bound greatly increases. This problem of multiple comparisons maycause some non-optimal policies to be deployed for some queries. Since we mainlycare about overall performance this is not expected to be an issue; however, in caseswhere safety constraints are very important, (cid:15) can be chosen to account for the numberof comparisons. This completes our introduction of the GENSPEC framework for query specialization.Figure 6.1 visualizes our approach to query specialization. We learn from historicalinteractions gathered using a logging policy π ; the interactions are divided into atraining and policy-selection partition per query. Subsequently, a policy is optimized forgeneralization — to perform well across all queries — and for each query a policy isoptimized for specialization — to perform well for a single query. While specializationcan potentially maximize performance on a speciﬁc query, it brings more risks thangeneralization, since a general policy is optimized on more data and may provide betterperformance on previously unseen queries. As a solution to this dilemma, we propose astrategy that uses high-conﬁdence bounds on the differences in performance betweenpolicies. These bounds are then used to choose safely between the deployment of thelogging, general and specialized policies. In theory, GENSPEC combines the best ofboth worlds: the high potential of specialization and the broad safety of generalization. This section discusses our experimental setup and the policies used to evaluate theGENSPEC framework.108 .4. Experimental Setup 𝜋 G ene r a li z a t i on po li cy Specialization policiesGENSPECg 𝜋 𝜋 𝜋 𝜋 𝜋 q = 1 q = 2 q = 3 q = 4 q = 5 𝜋 𝜋 g 𝜋 𝜋 g 𝜋 𝜋 C on f i den c e bound da t a T r a i n i ng da t a Loggingpolicy UsersRankings/QueriesDividing interactions per context

Figure 6.1: Visualization of the GENSPEC framework applied to query specializationfor counterfactual LTR. The data D is divided per query q , many specialized policies π , π , . . . are each optimized for a single query q ∈ { , , . . . } , and single a generalpolicy π g is learned on the data across all queries. Finally, GENSPEC decides whichpolicy to deploy per context, based on high-conﬁdence bounds. To evaluate the GENSPEC framework, we make use of a semi-synthetic experimentalsetup: queries, relevance judgements, and documents come from industry datasets, whilebiased and noisy user interactions are simulated using probabilistic user models. Thissetup is very common in the counterfactual LTR and online LTR literature [2, 58, 84].We make use of the three largest LTR industry datasets:

Yahoo! Webscope [17],

MSLR-WEB30k [95], and

Istella [27]. Each consists of a set of queries, with for each query apreselected set of documents, document-query combinations are only represented byfeature vectors and a label indicating relevance according to expert annotators. Labelsrange from (not relevant) to (perfectly relevant): r ( d | x, q ) ∈ { , , , , } . Userissued queries are simulated by uniformly sampling from the training and validationpartitions of the datasets. Displayed rankings are generated by a logging ranker usinga linear model optimized on of the training partition using supervised LTR [58].Then, user examination is simulated with probabilities inverse to the displayed rank of adocument: P (cid:0) o ( d ) = 1 | a (cid:1) = rank ( d | a ) . Finally, user clicks are generated according tothe following formula using a single parameter α ∈ R : P (cid:0) c ( d ) = 1 | o ( d ) = 1 , r ( d | x, q ) (cid:1) = 0 . α · r ( d | x, q ) . (6.25)109 . Combining Generalized and Specialized Models in Counterfactual LTR In our experiments, we use α = 0 . and α = 0 . ; the former represents a near-ideal setting where relevant documents receive a very large number of clicks, the latterrepresents a more noisy and harder setting where the large majority of clicks are onnon-relevant documents. Clicks are only generated on the training and validationpartitions, of training clicks are separated for policy selection ( D sel ), hyperparame-ter optimization is done using counterfactual evaluation with clicks on the validationpartition [58].Some of our baselines are online bandit algorithms, for these baselines no clicks areseparated for D sel , and the algorithms are run online: this means clicks are not gatheredusing the logging policy but by applying the algorithms in an online interactive setting.The evaluation metric we use is normalized DCG (Eq. 6.3) [53] using the ground-truth labels from the datasets. Unlike most LTR work we do not apply a rank-cutoffwhen computing the metric, thus, an NDCG of . indicates that all documents areranked perfectly (not just the top- k ). We separately calculate performance on the test set(Test-NDCG), to evaluate performance on previously unseen queries, and the trainingset (Train-NDCG). The total number of clicks is varied up to in total, uniformlyspread over all queries, the differences in Train-NDCG when more clicks are addedallows us to evaluate performance on queries with different levels of popularity. For the generalization policy space Π g we use feature-based ranking models. This is anatural choice as they can be applied to any query, including previously unseen ones.However, the available features could limit the possible behavior of the policies. We uselinear models for Π g ; optimization is done on D train following previous counterfactualLTR work [2]. This results in a learned scoring function f ( d, x, q ) ∈ R according towhich items are ranked; due to score-ties there can be multiple valid rankings: A g ( x, q ) = { a | ∀ ( d n , d m ) ∈ x, ( f ( d n , x, q ) > f ( d m , x, q ) → d n (cid:31) a d m ) } . (6.26)The general policy π g samples uniformly random from the set of valid rankings: π g ( a | x, q ) = (cid:40) |A g ( x,q ) | if a ∈ A g ( x, q ) , otherwise. (6.27)For the specialization policy space Π q , we follow bandit-style online LTR work andtake the tabular approach [69]. Documents are scored according to an unbiased estimateof Click-Through-Rate (CTR) on query q : ˆ CTR ( d, q ) = 1 |D train q | (cid:88) i ∈D train q c i ( d ) ρ i , (6.28)which maximizes the estimated performance (Eq. 6.12). Due to ties there can be multiplevalid rankings: A q ( x, q ) = (cid:110) a | ∀ ( d n , d m ) ∈ x, (cid:16) ˆ CTR ( d n ) > ˆ CTR ( d m ) → d n (cid:31) a d m (cid:17)(cid:111) , (6.29)110 .5. Experimental Results and Discussion The specialized policy π q also chooses uniformly random from the set of valid rankings: π q ( a | x, q ) = (cid:40) |A q ( x,q ) | if a ∈ A q ( x, q ) , otherwise. (6.30)The tabular approach is not restrained by the available features and can produce anypossible ranking [138]. Consequently, given enough interactions the tabular approachcan perfectly rank items according to relevance. However, CTR cannot be estimatedfor previously unseen queries and there π q chooses uniformly randomly between allpossible rankings. On a query with a single click, π q will place the once-clicked item atthe front of the ranking. Since clicks are very noisy, this behavior is very risky and henceGENSPEC uses conﬁdence bounds to avoid the deployment of such unsafe behavior. First, we consider the behavior of GENSPEC compared with pure generalization orpure specialization policies. Figures 6.2 and 6.3 show the performance of (i) GENSPECwith different levels of conﬁdence for its bounds ( (cid:15) ), along with that of (ii) the loggingpolicy, (iii) the pure generalization policy, and (iv) the pure specialization policiesbetween which the GENSPEC meta-policy chooses. We see that pure generalizationrequires few clicks to improve over the logging policy but is not able to reach optimallevels of performance. The performance of pure specialization, on the other hand, isinitially far below the logging policy. However, after enough clicks have been gathered,performance increases until the optimal ranking is found; when click noise is limited( α = 0 . ) it reaches perfect performance on all three datasets (Train-NDCG). On theunseen queries where there are no clicks (Test-NDCG), the specialization policy isunable to learn anything and provides random performance (not displayed in Figures 6.2and 6.3). The initial period of poor performance can be very detrimental to queriesthat do not receive a large number of clicks. Prior work has found that web-searchqueries follow a long-tail distribution [113, 115]; White et al. [130] found that 97% ofqueries received fewer than clicks over six months. For such queries, users mayonly experience the initial poor performance of pure specialization, and never see theimprovements it brings at convergence. This possibility can be a large deterrent fromapplying pure specialization in practice [131].Finally, the GENSPEC policy combines properties of both: after a few clicks itdeploys the generalization policy and thus outperforms the logging policy; as more clicksare gathered, specialization policies are activated, further improving performance. With α = 0 . the GENSPEC policy with (cid:15) ≤ . reaches perfect Train-NDCG performanceon all three datasets, similar to the pure specialization policy. However, unlike purespecialization the performance of GENSPEC (with (cid:15) > ) never drops below the loggingpolicy. Moreover, we never observe the situation where an increase in the number ofclicks results in a decrease in mean performance. There is a delay between when the purespecialization policy is the optimal choice and when GENSPEC activates specializationpolicies. Thus, while the usage of conﬁdence bounds prevents the performance from111 . Combining Generalized and Specialized Models in Counterfactual LTR Train-NDCG Test-NDCG Y a hoo ! W e b s c op e − − . . . . . .

00 10 − − . . . . M S L R - W E B − − . . . . . − − . . . . . . I s t e ll a − − . . . . − − . . . . . . Mean Number of Clicks per Query Mean Number of Clicks per Query

Generalized ModelLogging ModelSpecialized Model GENSPEC (no bounds)GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) Figure 6.2: Performance of GENSPEC with varying levels of conﬁdence, comparedto pure generalization and pure specialization, on clicks generated with α = 0 . . Weseparate queries on the training set (Train-NDCG) that have received clicks, and querieson the test set (Test-NDCG) that do not receive any clicks. Clicks are spread uniformlyover the training set, the x-axis indicates the total number of clicks divided by thenumber of training queries. Results are an average of 10 runs; shaded area indicates thestandard deviation.dropping below the level of the logging policy, it does so at the cost of this delay. WhenGENSPEC does not use any bounds it deploys specialized policies earlier, however,in some cases these deployments result in worse performance than the logging policy,albeit less than pure specialization. In all our observed results, a conﬁdence of (cid:15) = 0 . was enough to prevent any decreases in performance.To conclude, our experimental results show that the GENSPEC meta-policy com-bines the high-performance at convergence of specialization and the safe robustness ofgeneralization. In contrast to pure specialization, which results in very poor performancewhen not enough clicks have been gathered, GENSPEC effectively avoids incorrectdeployment and under our tested conditions it never performs worse than the loggingpolicy. Meanwhile, GENSPEC achieves considerable gains in performance at conver-gence, in contrast with pure generalization. Therefore, we conclude that GENSPEC is112 .5. Experimental Results and Discussion Train-NDCG Test-NDCG Y a hoo ! W e b s c op e − − . . . . . .

00 10 − − . . . . M S L R - W E B − − . . . . . . .

90 10 − − . . . . . . I s t e ll a − − . . . . . .

75 10 − − . . . . . . Mean Number of Clicks per Query Mean Number of Clicks per Query

Generalized ModelLogging ModelSpecialized Model GENSPEC (no bounds)GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) Figure 6.3: Performance of GENSPEC with varying levels of conﬁdence, comparedto pure generalization and pure specialization, on clicks generated with α = 0 . .Notation is the same as in Figure 6.2.the best choice in situations where periods of poor performance have to be avoided [131]or when not all queries receive large numbers of clicks [130]. GENSPEC is not the ﬁrst method that deploys policies based on conﬁdence bounds. Asdiscussed in Section 6.3, Jagerman et al. [51] previously introduced the SEA algorithm.SEA chooses between deploying a generalizing policy or keeping the logging policyin deployment, by bounding both the performance of the logging and generalizationpolicy. When the upper bound of the logging policy is less than the lower bound of thegeneralizing policy, SEA deploys the latter. The big differences with GENSPEC arethat SEA (i) uses two bounds to conﬁdently estimate if one policy outperforms another,and (ii) does not consider specialization policies. Because GENSPEC directly boundsrelative performance, its comparisons only use a single bound and thus we expect it tobe more efﬁcient w.r.t. the number of clicks required than SEA (see Appendix 6.B for aformal analysis).For a fair comparison, we adapt SEA to choose between the same policies as113 . Combining Generalized and Specialized Models in Counterfactual LTR

Train-NDCG Test-NDCG Y a hoo ! W e b s c op e . . . . .

00 10 . . . . . M S L R - W E B . . . . . . . . . I s t e ll a . . . . . . Mean Number of Clicks per Query Mean Number of Clicks per Query

Generalized ModelLogging ModelSpecialized Model SEA ( (cid:15) = 0 . )SEA ( (cid:15) = 0 . ) GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) Figure 6.4: GENSPEC compared to a meta-policy using the SEA bounds (see Sec-tion 6.5.2), on clicks generated with α = 0 . . Notation is the same as in Figure 6.2.GENSPEC and provide it with the same click data. Figure 6.4 and 6.5 display theresults of this comparison. Across all settings, GENSPEC deploys policies much earlierthan SEA with the same level of conﬁdence. While they converge at the same levelsof performance, GENSPEC requires considerably less data, e.g., on the Istella datasetwith α = 0 . GENSPEC deploys with times less data. Thus, we conclude thatthe relative bounds of GENSPEC are much more efﬁcient than the existing boundingapproach of SEA. Obvious baselines for our experiments are methods from the counterfactual LTR ﬁeld [5,58, 127]. In our setting where the observance probabilities are given, all these methodsreduce to Oosterhuis and de Rijke [86]’s method (see Chapter 5), i.e., the methodused to optimize the pure generalization policy in Figures 6.2 and 6.3. Thus, thecomparison between GENSPEC and pure generalization is effectively a comparisonbetween GENSPEC and state-of-the-art counterfactual LTR. As expected, we seethat GENSPEC reaches the same performance on previously unseen queries (Test-NDCG); but on queries with clicks (Train-NDCG) GENSPEC outperforms standard114 .5. Experimental Results and Discussion

Train-NDCG Test-NDCG Y a hoo ! W e b s c op e . . . . .

00 10 . . . . . . M S L R - W E B . . . . . .

90 10 . . . . . I s t e ll a . . . . .

75 10 . . Mean Number of Clicks per Query Mean Number of Clicks per Query

Generalized ModelLogging ModelSpecialized Model SEA ( (cid:15) = 0 . )SEA ( (cid:15) = 0 . ) GENSPEC ( (cid:15) = 0 . )GENSPEC ( (cid:15) = 0 . ) Figure 6.5: GENSPEC compared to a meta-policy using the SEA bounds (see Sec-tion 6.5.2), on clicks generated with α = 0 . . Notation is the same as in Figure 6.2.counterfactual LTR by enormous amounts once many clicks have been gathered. Again,there is a small delay between the moment the generalization policy outperforms thelogging policy and when GENSPEC deploys it. Since this observed delay is very short,this downside seems to be heavily outweighed by the large increases in performancein Train-NDCG. Thus, we conclude that GENSPEC is preferable over the existingcounterfactual LTR approaches, due to its ability to incorporate highly specializedmodels in its policy. Other related methods are online LTR bandit algorithms [61, 68]. Unlike counterfactualLTR, these bandit methods learn using online interventions: at each timestep theychoose which ranking to display to users. Thus, they have some control over theinteractions they receive, and attempt to display rankings that will beneﬁt the learningprocess the most. As baselines we use the hotﬁx algorithm [138] and the Position-Based Model algorithm (PBM) [69]. The hotﬁx algorithm is a very general approach,it completely randomly shufﬂes the top- n items and ranks them based on pairwisepreferences inferred from clicks. The main downside of the hotﬁx approach is that it115 . Combining Generalized and Specialized Models in Counterfactual LTR Clicks generated with α = 0 . . Clicks generated with α = 0 . . Y a hoo ! W e b s c op e . . . . . . . M S L R - W E B . . . . . . . . . . I s t e ll a . . . . . . . . . Mean Number of Clicks per Query Mean Number of Clicks per Query

Position Based ModelHotﬁx-Complete Hotﬁx-Top10Logging GENSPEC ( (cid:15) = 0 . ) Figure 6.6: GENSPEC compared to various online LTR bandits (see Section 6.5.4).Notation is the same as in Figure 6.2.can be very detrimental to the user experience due to the randomization. We apply twoversions of the hotﬁx algorithm, one for top-10 reranking to minimize randomizationand another for reranking the complete ranking. PBM is perfectly suited for our taskas it makes the same assumptions about user behavior as our experimental setting. Weapply PMB-PIE [69], which results in PBM always displaying the ranking it expects toperform best, thus attempting to maximize the user experience during learning. Thesemethods are very similar to our specialization policies: the bandit baselines memorizethe best rankings and do not depend on features at all. Consequently, their learnedpolicies cannot be applied to previously unseen queries.Figure 6.6 displays the results for this comparison. We see that when α = 0 . Hotﬁx-Complete, PBM and GENSPEC all reach perfect Train-NCDG; however, Hotﬁx-Complete and PBM reach convergence much earlier than GENSPEC. We attributethis difference to three causes: (i) the online interventions of the bandit baselines,(ii) GENSPEC only uses 70% of the available data for training ( D train ) whereas thebandit baselines use everything, and (iii) the delay in deployment added by GENSPEC’susage of conﬁdence bounds. Similar to the pure specialization policies, the earlier We only report the performance of the ranking produced by the hotﬁx baseline, not of the randomizedrankings used to gather clicks. .6. GENSPEC for Contextual Bandits moment of convergence of the bandit baselines comes at the cost of an initial periodof very poor performance. We conclude that if only the moment of reaching optimalperformance matters, PBM is the best choice of method. However, if periods ofpoor performance should be avoided [131], or if some queries may not receive largenumbers of clicks [130], GENSPEC is the better choice. An additional advantage isthat GENSPEC is a counterfactual method and does not have to be applied online likethe bandit baselines.

Besides the bandit baselines discussed in Section 6.5.4, feature-based methods foronline LTR also exist [82, 111, 126, 132]. A direct experimental comparison withthese methods is beyond the scope of this chapter. However, previous work has alreadycompared these methods with each other [84] and the state-of-the-art method withcounterfactual LTR [50]. Based on the latter work by Jagerman et al. [50] we do notexpect considerable differences between these online LTR methods and counterfactualLTR in our settings. Therefore, we expect that a comparison would lead to similarresults as discussed in Section 6.5.3.

So far we have discussed GENSPEC for counterfactual LTR. We will now show that itis also applicable to the broader contextual bandit problem. Instead of a query q , wenow keep track of an arbitrary context z ∈ { , , . . . } where x i , z i ∼ P ( x, z ) . (6.31)Data is gathered using a logging policy π : a i ∼ π ( a | x i , z i ) . (6.32)However, unlike the LTR case, the rewards r i are observed directly: r i ∼ P ( r | a i , x i , z i ) . (6.33)With the propensities ρ i = π ( a i | x i , z i ) (6.34)the data is: D = (cid:8) ( r i , a i , ρ i , x i , z i ) (cid:9) Ni =1 ; (6.35)for specialization the data is ﬁltered per context z : D z = (cid:8) ( r i , a i , ρ i , x i , z i ) ∈ D | z i = z (cid:9) . (6.36)Again, data for training D train and for policy selection D sel are separated. The reward isestimated with an IPS estimator: ˆ R ( π | D ) = 1 |D| (cid:88) i ∈D r i ρ i π ( a i | x i , z i ) . (6.37)117 . Combining Generalized and Specialized Models in Counterfactual LTR With the policy spaces Π g and Π z , the policy for generalization is: π g = arg max π ∈ Π g ˆ R ( π | D train ); (6.38)per context z , the specialization policy is: π z = arg max π ∈ Π z ˆ R ( π | D train z ) . (6.39)The difference between two policies is estimated by: ˆ δ ( π , π | D ) = ˆ R ( π | D ) − ˆ R ( π | D ) . (6.40)We differ from the LTR approach by estimating the bounds using: R i = r i ρ i (cid:0) π ( a i | x i , z i ) − π ( a i | x i , z i ) (cid:1) . (6.41)Following Thomas et al. [119], the conﬁdence bounds are: CB ( π , π | D ) = 7 b ln (cid:0) − (cid:15) (cid:1) |D| − |D| (cid:118)(cid:117)(cid:117)(cid:116) |D| ln (cid:0) − (cid:15) (cid:1) |D| − (cid:88) i ∈D (cid:0) R i − ˆ δ ( π , π | D ) (cid:1) , (6.42)where b is the maximum possible value for R i . This results in the lower bound: LCB ( π , π | D ) = ˆ δ ( π , π | D ) − CB ( π , π | D ) , (6.43)which is used by the GENSPEC meta-policy: π GS ( a | x, z ) =  π z ( a | x, z ) , if (cid:0) LCB ( π z , π g |D sel z ) > ∧ LCB ( π z , π |D sel z ) > (cid:1) ,π g ( a | x, z ) , if (cid:0) LCB ( π z , π g |D sel z ) ≤ ∧ LCB ( π g , π |D sel ) > (cid:1) ,π ( a | x, z ) , otherwise . (6.44)As such, GENSPEC can be applied to the contextual bandit problem for any arbitrarychoice of context z . In this chapter we have introduced the Generalization and Specialization (GENSPEC)framework for contextual bandit problems. For an arbitrary choice of contexts it simulta-neously learns a general policy to perform well across all contexts, and many specializedpolicies each optimized for a single context. Then, per context the GENSPEC meta-policy uses high-conﬁdence bounds to choose between deploying the logging policy,118 .7. Conclusion the general policy, or a specialized policy. As a result, GENSPEC combines the robustsafety of a general policy with the high-performance of a successfully specialized policy.We have shown how GENSPEC can be applied to query-specialization for coun-terfactual LTR. Our results show that GENSPEC combines the high performance ofspecialized policies on queries with sufﬁciently many interactions, with the robustperformance on queries that were previously unseen or where little data is available.Thus, it avoids the low performance at convergence of feature-based models underlyingthe general policy, and the initial poor performance of the tabular models underlyingthe specialized policies. We expect that GENSPEC can be used for other types ofspecialization by choosing different context divisions, i.e., personalization for LTR is apromising choice.With these ﬁndings we can answer thesis research question

RQ6 positively: UsingGENSPEC we can combine the specialization ability of bandit-style online LTR withthe robust generalization of feature-based LTR. As a result, the choice between spe-cialization and generalization can now be made in a principled, theoretically-groundedmanner. For the LTR ﬁeld this means that bandit-style LTR and feature-based LTR cannow be seen as complementary, instead of a mutually exclusive choice.Future work could explore other contextual bandit problems and choices for context.Additionally, we hope that the robust safety of GENSPEC further incites the applicationof bandit algorithms in practice.While this chapter considered GENSPEC for counterfactual LTR, Chapter 8 in-troduces a novel method that is both effective at counterfactual LTR and online LTR.With only small adaptations the contributions of both chapters could be combined, thuspotentially resulting in GENSPEC for both online and counterfactual LTR. Future workcould investigate the effectiveness of this possible combined approach. 119 . Combining Generalized and Specialized Models in Counterfactual LTR

This section will prove that the IPS estimate ˆ R (Eq. 6.11) can be used to unbiasedlyoptimize the true reward R (Eq. 6.1), as claimed in Section 6.2.2. For this proof werely on the following assumptions: (i) LTR metrics are linear combinations of itemrelevances (Eq. 6.2), (ii) the assumption that clicks never occur on unobserved items(Eq. 6.7), and (iii) click probabilities (conditioned on observance) are proportional torelevance (Eq. 6.8).First, we consider the expected value for an observed click c i ( d ) using Eq. 6.7; forbrevity we write r ( d ) = r ( d | x i , q i ) : E o i ,a i (cid:2) c i ( d ) (cid:3) = E a i (cid:104) P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) · P (cid:0) o i ( d ) = 1 | a i (cid:1)(cid:105) = P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) · (cid:32) (cid:88) a ∈ π P (cid:0) o i ( d ) = 1 | a (cid:1) · π ( a | x i , q i ) (cid:33) = ρ i ( d ) · P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) . (6.45)Then, consider the expected value for the IPS estimator, and note that a i is a historicallyobserved action and that a is the action being evaluated: E o i ,a i (cid:2) ˆ∆( a | c i , ρ i ) (cid:3) = E o i ,a i (cid:34) (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · c i ( d ) ρ i ( d ) (cid:35) = (cid:88) d ∈ a ρ i ( d ) ρ i ( d ) · λ (cid:0) rank ( d | a ) (cid:1) · P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) = (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · P (cid:0) c i ( d ) = 1 | o i ( d ) = 1 , r ( d ) (cid:1) . (6.46)This step assumes that ρ i ( d ) > , i.e., that every item has a non-zero probability ofbeing examined [58]. While E o i ,a i [ ˆ∆( a | c i , ρ i )] and ∆( a | x i , q i , r ) are not necessarilyequal, using Eq. 6.8 we see that they are proportional with some offset C : E o i ,a i (cid:2) ˆ∆( a | c i , ρ i ) (cid:3) ∝ (cid:16) (cid:88) d ∈ a λ (cid:0) rank ( d | a ) (cid:1) · r ( d ) (cid:17) + C = ∆( a | x i , q i , r ) + C, (6.47)where C is a constant: C = (cid:0) (cid:80) Ki =1 λ ( i ) (cid:1) · µ . Therefore, in expectation ˆ R and R arealso proportional with the same constant offset: E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) ∝ R ( π ) + C. (6.48)120 .B. Efﬁciency of Relative Bounding by GENSPEC Consequently, the estimator can be used to unbiasedly estimate the preference betweentwo policies: E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) < E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) ⇔ R ( π ) < R ( π ) . (6.49)Moreover, this implies that maximizing the estimated performance unbiasedly optimizesthe actual reward: arg max π E o i ,a i (cid:2) ˆ R ( π | D ) (cid:3) = arg max π R ( π ) . (6.50)This concludes our proof. We have shown that ˆ R is suitable for counterfactual evalu-ation since it can unbiasedly identify if a policy outperforms another (Eq. 6.49) and,furthermore, that ˆ R can be used for unbiased LTR, i.e., it can be used to ﬁnd the optimalpolicy (Eq. 6.50). Our experimental results showed that GENSPEC chooses between policies more efﬁ-ciently than when using SEA bounds [51]. In other words, when one policy has higherperformance than another, the relative bounds of GENSPEC require less data to becertain about this difference than the SEA bounds. In this section, we will prove thatthe relative bounds of GENSPEC are more efﬁcient than the SEA bounds when thecovariance between the reward estimates of two policies is positive:cov (cid:0) ˆ R ( π |D ) , ˆ R ( π |D ) (cid:1) > . (6.51)This means that GENSPEC will deploy a policy earlier than SEA if there is highcovariance, since both estimates are based on the same interaction data D a highcovariance is very likely.Let us ﬁrst consider when GENSPEC deploys a policy: Deployment by GENSPECdepends on whether a relative conﬁdence bound is greater than the estimated differencein performance (cf. Eq. 6.24). For two policies π and π deployment happens when: ˆ R ( π | D ) − ˆ R ( π | D ) − CB ( π , π | D ) > . (6.52)Thus the bound has to be smaller than the estimated performance difference: CB ( π , π | D ) < ˆ R ( π | D ) − ˆ R ( π | D ) . (6.53)In contrast, SEA does not use a single bound, but bounds the performance of bothpolicies. For clarity, we reformulate the SEA bound in our notation. First we have R π j i,d the observed reward for an item d at interaction i for policy π j : R π j i,d = c i ( d ) ρ i ( d ) (cid:88) a ∈ π j π j ( a | x i , q i ) · λ (cid:0) rank ( d | a ) (cid:1) . (6.54)121 . Combining Generalized and Specialized Models in Counterfactual LTR Then we have a ν π j for each policy: ν π j = 2 |D| K ln (cid:0) − (cid:15) (cid:1) |D| K − (cid:88) ( i,d ) ∈D (cid:0) K · R π j i,d − ˆ R ( π j | D ) (cid:1) , which we use to note the conﬁdence bound for a single policy π j : CB ( π j | D ) = 7 Kb ln (cid:0) − (cid:15) (cid:1) |D| K −

1) + 1 |D| K · √ ν π j . (6.55)We note that the b parameter has the same value for both the relative and single con-ﬁdence bounds. SEA chooses between policies by comparing their upper and lowerconﬁdence bounds: ˆ R ( π | D ) − CB ( π | D ) > ˆ R ( π | D ) + CB ( π | D ) . (6.56)In this case, the summation of the bounds has to be smaller than the estimated perfor-mance difference: CB ( π | D ) + CB ( π | D ) < ˆ R ( π | D ) − ˆ R ( π | D ) . (6.57)We can now formally describe under which condition GENSPEC is more efﬁcientthan SEA: by combining Eq. 6.53 and Eq. 6.57, we see that relative bounding is moreefﬁcient when: CB ( π , π | D ) < CB ( π | D ) + CB ( π | D ) . (6.58)We notice that D , K , b and (cid:15) have the same value for both conﬁdence bounds, thus weonly require: √ ν < √ ν π + √ ν π . (6.59)If we assume that D is sufﬁciently large, we see that √ ν approximates the standarddeviation scaled by some constant: √ ν ≈ C · (cid:113) var (cid:0) ˆ δ ( π , π |D ) (cid:1) , (6.60)where the constant is: C = (cid:114) |D| K ln (cid:0) − (cid:15) (cid:1) |D| K − . Since the purpose of the bounds is toprevent deployment until enough certainty has been gained, we think it is safe to assumethat D is large enough for this approximation before any deployment takes place.To keep our notation concise, we use the following: ˆ δ = ˆ δ ( π , π |D ) , ˆ R =ˆ R ( π |D ) , and ˆ R = ˆ R ( π |D ) . Using the same approximations for √ ν π and √ ν π we get: (cid:113) var (ˆ δ ) < (cid:113) var ( ˆ R ) + (cid:113) var ( ˆ R ) . (6.61)By making use of the Cauchy-Schwarz inequality, we can derive the following lowerbound: (cid:113) var ( ˆ R ) + var ( ˆ R ) ≤ (cid:113) var ( ˆ R ) + (cid:113) var ( ˆ R ) . (6.62)122 .C. Notation Reference for Chapter 6 Therefore, the relative bounding of GENSPEC must be more efﬁcient when the follow-ing is true: var (ˆ δ ) < var ( ˆ R ) + var ( ˆ R ) , (6.63)i.e. the variance of the relative estimator must be less than the sum of the variances ofthe estimators for the individual policies. Finally, by rewriting var (ˆ δ ) to:var (ˆ δ ) = var ( ˆ R − ˆ R ) = var ( ˆ R ) + var ( ˆ R ) − cov ( ˆ R , ˆ R ) , (6.64)we see that the relative bounds of GENSPEC are more efﬁcient than the multiple boundsof SEA if the covariance between ˆ R and ˆ R is positive:cov ( ˆ R , ˆ R ) > . (6.65)Remember that both estimates are based on the same interaction data: ˆ R = ˆ R ( π |D ) ,and ˆ R = ˆ R ( π |D ) . Therefore, they are based on the same clicks and propensitiesscores, thus it is extremely likely that the covariance between the estimates is positive.Correspondingly, it is also extremely likely that the relative bounds of GENSPEC aremore efﬁcient than the bounds used by SEA. Notation Description K the number of items that can be displayed in a single ranking i an iteration number q a user-issued query x contextual information, i.e., additional features d an item to be ranked a a ranked list π a ranking policy π ( a | q ) the probability that policy π displays ranking a for query qr ( d | x, q ) the relevance of item d w.r.t. query q given context xλ (cid:0) rank ( d | a ) (cid:1) a metric function that weights items depending on their rank D the available interaction data c i ( d ) a function indicating item d was clicked at iteration io i ( d ) a function indicating item d was observed at iteration i Taking the Counterfactual Online:

Efﬁcient and Unbiased OnlineEvaluation for Ranking

Counterfactual evaluation can estimate Click-Through-Rate (CTR) differences betweenranking systems based on historical interaction data, while mitigating the effect ofposition bias and item-selection bias. In contrast, online evaluation methods, designedfor ranking, estimate performance differences between ranking systems by showinginterleaved rankings to users and observing their clicks. We are curious to ﬁnd outwhether the online interventions of online evaluation methods truly result in moreefﬁcient evaluation, and additionally, whether the popular interleaving methods are trulyunbiased w.r.t. biases such as position bias. Accordingly this chapter will consider thefollowing two thesis research questions:

RQ7

Can counterfactual evaluation methods for ranking be extended to perform efﬁ-cient and effective online evaluation?

RQ8

Are existing interleaving methods truly capable of unbiased evaluation w.r.t.position bias?We introduce the novel Logging-Policy Optimization Algorithm (LogOpt), which op-timizes the policy for logging data so that the counterfactual estimate has minimalvariance. As minimizing variance leads to faster convergence, LogOpt increases thedata-efﬁciency of counterfactual estimation. LogOpt turns the counterfactual approach– which is indifferent to the logging policy – into an online approach, where the algo-rithm decides what rankings to display. We prove that, as an online evaluation method,LogOpt is unbiased w.r.t. position and item-selection bias, unlike existing interleavingmethods. Furthermore, we perform large-scale experiments by simulating comparisonsbetween thousands of rankers. Our results show that while interleaving methods makesystematic errors, LogOpt is as efﬁcient as interleaving without being biased. Lastly, weprovide a formal proof that shows interleaving methods are not unbiased w.r.t. positionbias.

This chapter was published as [85]. Appendix 7.C gives a reference for the notation used in this chapter. . Efﬁcient and Unbiased Online Evaluation for Ranking

Evaluation is essential for the development of search and recommendation systems [45,64]. Before any ranking model is widely deployed it is important to ﬁrst verify whetherit is a true improvement over the currently-deployed model. A traditional way ofevaluating relative differences between systems is through A/B testing, where part of theuser population is exposed to the current system (“control”) and the rest to the alteredsystem (“treatment”) during the same time period. Differences in behavior betweenthese groups can then indicate if the alterations brought improvements, e.g., if thetreatment group showed a higher CTR or more revenue was made with this system [18].Interleaving has been introduced in Information Retrieval (IR) as a more efﬁcientalternative to A/B testing [56]. Interleaving algorithms take the rankings produced bytwo ranking systems, and for each query create an interleaved ranking by combining therankings from both systems. Clicks on the interleaved rankings directly indicate relativedifferences. Repeating this process over a large number of queries and averaging the re-sults, leads to an estimate of which ranker would receive the highest CTR [44]. Previousstudies have found that interleaving requires fewer interactions than A/B testing, whichenables them to make consistent comparisons in a much shorter timespan [18, 110].More recently, counterfactual evaluation for rankings has been proposed by Joachimset al. [58] to evaluate a ranking model based on clicks gathered using a differentmodel. By correcting for the position bias introduced during logging, the counterfactualapproach can unbiasedly estimate the CTR of a new model on historical data. Toachieve this, counterfactual evaluation makes use of Inverse Propensity Scoring (IPS),where clicks are weighted inversely to the probability that a user examined them duringlogging [127]. A big advantage compared to interleaving and A/B testing, is thatcounterfactual evaluation does not require online interventions.In this chapter, we show that no existing interleaving method is truly unbiased:they are not guaranteed to correctly predict which ranker has the highest CTR. On twodifferent industry datasets, we simulate a total of 1,000 comparisons between 2,000different rankers. In our setup, interleaving methods converge on the wrong answer forat least 2.2% of the comparisons on both datasets. A further analysis shows that existinginterleaving methods are unable to reliably estimate CTR differences of around 1% orlower. Therefore, in practice these systematic errors are expected to impact situationswhere rankers with a very similar CTR are compared.We propose a novel online evaluation algorithm: Logging-Policy OptimizationAlgorithm (LogOpt). LogOpt extends the existing unbiased counterfactual approach,and turns it into an online approach. LogOpt estimates which rankings should be shownto the user, so that the variance of its CTR estimate is minimized. In other words, itattempts to learn the logging-policy that leads to the fastest possible convergence of thecounterfactual estimation. Our experimental results indicate that our novel approach isas efﬁcient as any interleaving method or A/B testing, without having a systematic error.As predicted by the theory, we see that the estimates of our approach converge on thetrue CTR difference between rankers. Therefore, we have introduced the ﬁrst onlineevaluation method that combines high efﬁciency with unbiased estimation.The main contributions of this chapter are:126 .2. Preliminaries: Ranker Comparisons

1. The ﬁrst logging-policy optimization method for minimizing the variance in counter-factual CTR estimation.2. The ﬁrst unbiased online evaluation method that is as efﬁcient as state-of-the-artinterleaving methods.3. A large-scale analysis of existing online evaluation methods that reveals a previouslyunreported bias in interleaving methods.

The overarching goal of ranker evaluation is to ﬁnd the ranking model that provides thebest rankings. For the purposes of this chapter, we will deﬁne the quality of a rankerin terms of the number of clicks it is expected to receive. Let R indicate a rankingand let E [ CTR ( R )] ∈ R ≥ be the expected number of clicks a ranking receives afterbeing displayed to a user. We consider ranking R to be better than R if in expectationit receives more clicks: E [ CTR ( R )] > E [ CTR ( R )] . We will represent a rankingmodel by a policy π , with π ( R | q ) as the probability that π displays R for a query q .With P ( q ) as the probability of a query q being issued, the expected number of clicksreceived under a ranking model π is: E [ CTR ( π )] = (cid:88) q P ( q ) (cid:88) R E [ CTR ( R )] π ( R | q ) . (7.1)Our goal is to discover the E [ CTR ] difference between two policies: ∆( π , π ) = E [ CTR ( π )] − E [ CTR ( π )] . (7.2)We recognize that to correctly identify if one policy is better than another, we merelyneed a corresponding binary indicator: ∆ bin ( π , π ) = sign (cid:0) ∆( π , π ) (cid:1) . (7.3)However, in practice the magnitude of the differences can be very important, forinstance, if one policy is computationally much more expensive while only having aslightly higher E [ CTR ] , it may be preferable to use the other in production. Therefore,estimating the absolute E [ CTR ] difference is more desirable in practice. Any proof regarding estimators using user interactions must rely on assumptions aboutuser behavior. In this chapter, we assume that only two forms of interaction bias are atplay: position bias and item-selection bias.Users generally do not examine all items that are displayed in a ranking but onlyclick on examined items [20]. As a result, a lower probability of examination for anitem also makes it less likely to be clicked. Position bias assumes that only the rankdetermines the probability of examination [25]. Furthermore, we will assume that given127 . Efﬁcient and Unbiased Online Evaluation for Ranking an examination only the relevance of an item determines the click probability. Let c ( d ) ∈ { , } indicate a click on item d and o ( d ) ∈ { , } examination by the user.Then these assumptions result in the following assumed click probability: P ( c ( d ) = 1 | R, q ) = P ( o ( d ) = 1 | R ) P ( c ( d ) = 1 | o ( d ) = 1 , q )= θ rank ( d | R ) ζ d,q . (7.4)Here rank ( d | R ) indicates the rank of d in R ; for brevity we use θ rank ( d | R ) to denotethe examination probability: θ rank ( d | R ) = P ( o ( d ) = 1 | R ) , (7.5)and ζ d,q for the conditional click probability: ζ d,q = P ( c ( d ) = 1 | o ( d ) = 1 , q ) . (7.6)We also assume that item-selection bias is present; this type of bias is an extremeform of position bias that results in zero examination probabilities for some items [86,92]. This bias is unavoidable in top- k ranking settings, where only the k ∈ N > highestranked items are displayed. Consequently, any item beyond rank k cannot be observedor examined by the user: ∀ r ∈ N > ( r > k → θ r = 0) . The distinction betweenitem-selection bias and position bias is important because the original counterfactualevaluation method [58] is only able to correct for position bias when no item-selectionbias is present [86, 92].Based on these assumptions, we can now formulate the expected CTR of a ranking: E [ CTR ( R )] = (cid:88) d ∈ R P ( c ( d ) = 1 | R, q ) = (cid:88) d ∈ R θ rank ( d | R ) ζ d,q . (7.7)While we assume this model of user behavior, its parameters are still assumed unknown.Therefore, the methods in this chapter will have to estimate E [ CTR ] without priorknowledge of θ or ζ . Recall that our goal is to estimate the CTR difference between rankers (Eq. 7.2); onlineevaluation methods do this based on user interactions. Let I be the set of availableuser interactions, it contains N tuples of a single (issued) query q i , the correspondingdisplayed ranking R i , and the observed user clicks c i : I = { ( q i , R i , c i ) } Ni =1 . (7.8)Each evaluation method has a different effect on what rankings will be displayed tousers. Furthermore, each evaluation method converts each interaction into a singleestimate using some function f : x i = f ( q i , R i , c i ) . (7.9)128 .3. Existing Online and Counterfactual Evaluation Methods The ﬁnal estimate is simply the mean over these estimates: ˆ∆( I ) = 1 N N (cid:88) i =1 x i = 1 N N (cid:88) i =1 f ( q i , R i , c i ) . (7.10)This description ﬁts all existing online and counterfactual evaluation methods forrankings. Every evaluation method uses a different function f to convert interactionsinto estimates; moreover, online evaluation methods also decide which rankings R todisplay when collecting I . These two choices result in different estimators. Before wediscuss the individual methods, we brieﬂy introduce the three properties we desire ofeach estimator: consistency, unbiasedness and variance.• Consistency – an estimator is consistent if it converges as the number of issuedqueries N increases. All existing evaluation methods are consistent as their ﬁnalestimates are means of bounded values.• Unbiasedness – an estimator is unbiased if its estimate is equal to the true CTRdifference in expectation:Unbiased ( ˆ∆) ⇔ E (cid:2) ˆ∆( I ) (cid:3) = ∆( π , π ) . (7.11)If an estimator is both consistent and unbiased it is guaranteed to converge on the true E [ CTR ] difference.• Variance – the variance of an estimator is the expected squared deviation between asingle estimate x and the mean ˆ∆( I ) :Var ( ˆ∆) = E (cid:104)(cid:0) x − E [ ˆ∆( I )] (cid:1) (cid:105) . (7.12)Variance affects the rate of convergence of an estimator; for fast convergence it shouldbe as low as possible.In summary, our goal is to ﬁnd an estimator, for the CTR difference between tworanking models, that is consistent, unbiased and has minimal variance. We describe three families of online and counterfactual evaluation methods for ranking.

A/B testing is a well established form of online evaluation to compare a system A witha system B [64]. Users are randomly split into two groups and during the same timeperiod each group is exposed to only one of the systems. In expectation, the only factorthat differs between the groups is the exposure to the different systems. Therefore, by129 . Efﬁcient and Unbiased Online Evaluation for Ranking comparing the behavior of each user group, the relative effect each system has can beevaluated.We will brieﬂy show that A/B testing is unbiased for E [ CTR ] difference estimation.For each interaction either π or π determines the ranking, let A i ∈ { , } indicate theassignment and A i ∼ P ( A ) . Thus, if A i = 1 , then R i ∼ π ( R | q ) and if A i = 2 , then R i ∼ π ( R | q ) . Each interaction i is converted into a single estimate x i by f A/B : x i = f A/B ( q i , R i , c i ) = (cid:18) [ A i = 1] P ( A = 1) − [ A i = 2] P ( A = 2) (cid:19) (cid:88) d ∈ R i c i ( d ) . (7.13)We can prove that A/B testing is unbiased, since in expectation each individual estimateis equal to the CTR difference: E [ f A/B ( q i , R i , c i )] = (cid:88) q P ( q ) (cid:18) P ( A = 1) (cid:80) R π ( R | q ) E [ CTR ( R )] P ( A = 1) − P ( A = 2) (cid:80) R π ( R | q ) E [ CTR ( R )] P ( A = 2) (cid:19) = (cid:88) q P ( q ) (cid:88) R E [ CTR ( R )] (cid:0) π ( R | q ) − π ( R | q ) (cid:1) = E [ CTR ( π )] − E [ CTR ( π )] = ∆( π , π ) . (7.14)Variance is harder to evaluate without knowledge of π and π . Unless ∆( π , π ) = 0 ,some variance is unavoidable since A/B testing alternates between estimating CTR ( π ) and CTR ( π ) . Interleaving methods were introduced speciﬁcally for evaluation in ranking, as a moreefﬁcient alternative to A/B testing [56]. After a query is issued, interleaving methodstake the rankings of two competing ranking systems and combine them into a singleinterleaved ranking. Any clicks on the interleaved ranking can be interpreted as apreference signal between either ranking system. Thus, unlike A/B testing, interleavingdoes not estimate the CTR of individual systems but a relative preference; the idea isthat this allows it to be more efﬁcient than A/B testing.Each interleaving method attempts to use randomization to counter position bias,without deviating too much from the original rankings so as to maintain the userexperience [56].

Team-draft interleaving (TDI) randomly selects one ranker to placetheir top document ﬁrst, then the other ranker places their top (unplaced) documentnext [99]. Then it randomly decides the next two documents, and this process isrepeated until all documents are placed in the interleaved ranking. Clicks on thedocuments are attributed to the ranker that placed them. The ranker with the mostattributed clicks is inferred to be preferred by the user.

Probabilistic interleaving (PI) treats each ranking as a probability distribution over documents; at each rank adistribution is randomly selected and a document is drawn from it [41]. After clickshave been received, probabilistic interleaving computes the expected number of clicked130 .3. Existing Online and Counterfactual Evaluation Methods documents per ranking system to infer preferences.

Optimized interleaving (OI) caststhe randomization as an optimization problem, and displays rankings so that if alldocuments are equally relevant no preferences are found [96].While every interleaving method attempts to deal with position bias, none is unbiasedaccording to our deﬁnition (Section 7.2.2). This may be confusing because previouswork on interleaving makes claims of unbiasedness [41, 44, 96]. However, they usedifferent deﬁnitions of the term. More precisely, TDI, PI, and OI provably converge onthe correct outcome if all documents are equally relevant [41, 44, 96, 99]. Moreover, ifone assumes binary relevance and π ranks all relevant documents equal to or higherthan π , the binary outcome of PI and OI is proven to be correct in expectation [44, 96].However, beyond the conﬁnes of these unambiguous cases, we can prove that thesemethods do not meet our deﬁnition of unbiasedness: for every method one can constructan example where it converges on the incorrect outcome. The rankers π , π andposition bias parameters θ can be chosen so that in expectation the wrong (binary)outcome is estimated; see Appendix 7.A for a proof for each of the three interleavingmethods. Thus, while more efﬁcient than A/B testing, interleaving methods makesystematic errors in certain circumstances and thus should not be considered to beunbiased w.r.t. CTR differences.We note that the magnitude of the bias should also be considered. If the systematicerror of an interleaving method is minuscule while the efﬁciency gains are very high, itmay still be very useful in practice. Our experimental results (Section 7.6.2) reveal thatthe systematic error of all three interleaving methods considered becomes very highwhen comparing systems with a CTR difference of 1% or smaller. Counterfactual evaluation is based on the idea that if certain biases can be estimatedwell, they can also be adjusted [57, 127]. While estimating relevance is considered thecore difﬁculty of ranking evaluation, estimating the position bias terms θ is very doable.By randomizing rankings, e.g., by swapping pairs of documents [57] or exploitingdata logged during A/B testing [4], differences in CTR for the same item on differentpositions can be observed directly. Alternatively, using Expectation Maximization (EM)optimization [128] or a dual learning objective [5], position bias can be estimated fromlogged data as well. Once the bias terms θ have been estimated, logged clicks can beweighted so as to correct for the position bias during logging. Hence, counterfactualevaluation can work with historically logged data. Existing counterfactual evaluationalgorithms do not dictate which rankings should be displayed during logging: they donot perform interventions and thus we do not consider them to be online methods.Counterfactual evaluation assumes that the position bias θ and the logging policy π are known, in order to correct for both position bias and item-selection bias. Clicksare gathered with π which decides which rankings are displayed to the user. We followOosterhuis and de Rijke [86] (see Chapter 5) and use as propensity scores the probabilityof observance in expectation over the displayed rankings: ρ ( d | q ) = E R (cid:2) P ( o ( d ) = 1 | R ) | π (cid:3) = (cid:88) R π ( R | q ) P ( o ( d ) = 1 | R ) . (7.15)131 . Efﬁcient and Unbiased Online Evaluation for Ranking Then we use λ ( d | π , π ) to indicate the difference in observance probability under π or π : λ ( d | π , π ) = E R (cid:2) P ( o ( d ) = 1 | R ) | π (cid:3) − E R (cid:2) P ( o ( d ) = 1 | R ) | π (cid:3) = (cid:88) R θ rank ( d | R ) (cid:0) π ( R | q i ) − π ( R | q i ) (cid:1) . (7.16)Then, the IPS estimate function is formulated as: x i = f IPS ( q i , R i , c i ) = (cid:88) d : ρ ( d | q i ) > c i ( d ) ρ ( d | q i ) λ ( d | π , π ) . (7.17)Each click is weighted inversely to its examination probability, but items with a zeroprobability: ρ ( d | q i ) = 0 are excluded. We note that these items can never be clicked: ∀ q, d ( ρ ( d | q ) = 0 → c ( d ) = 0 . (7.18)Before we prove unbiasedness, we note that given ρ ( d | q i ) > : E (cid:20) c ( d ) ρ ( d | q ) (cid:21) = (cid:80) R π ( R | q ) θ rank ( d | R ) ζ d,q ρ ( d | q i )= (cid:80) R π ( R | q ) θ rank ( d | R ) (cid:80) R (cid:48) π ( R (cid:48) | q ) θ rank ( d | R (cid:48) ) ζ d,q = ζ d,q . (7.19)This, in turn, can be used to prove unbiasedness: E [ f IPS ( q i , R i , c i )] = (cid:88) q P ( q ) (cid:88) d : ρ ( d | q i ) > ζ d,q λ ( d | π , π )= E [ CTR ( π )] − E [ CTR ( π )] = ∆( π , π ) . (7.20)This proof is only valid under the following requirement: ∀ d, q ( ζ d,q λ ( d | π , π ) > → ρ ( d | q ) > . (7.21)In practice, this means that the items in the top- k of either π or π need to have anon-zero examination probability under π , i.e., they must have a chance to appear inthe top- k under π .Besides Requirement 7.21 the IPS counterfactual evaluation method [57, 127] iscompletely indifferent to π and hence we do not consider it to be an online method. Inthe next section, we will introduce an algorithm for choosing and updating π duringlogging to minimize the variance of the estimator. By doing so we turn counterfactualevaluation into an online method. Next, we introduce a method aimed at ﬁnding a logging policy minimizes the varianceof the estimates of the counterfactual estimator.132 .4. Logging Policy Optimization for Variance Minimization

In Section 7.3.3, we have discussed counterfactual evaluation and established that it isunbiased as long as θ is known and the logging policy meets Requirement 7.21. Thevariance of ∆ IPS depends on the position bias θ , the conditional click probabilities ζ ,and the logging policy π . In contrast to the user-dependent θ and ζ , the way data islogged by π is something one can have control over. The goal of our method is to ﬁndthe optimal policy that minimizes variance while still meeting Requirement 7.21: π ∗ = arg min π : π meets Req. 7.21 Var (cid:0) ˆ∆ π IPS (cid:1) , (7.22)where ˆ∆ π IPS is the counterfactual estimator based on data logged using π .To formulate the variance, we ﬁrst note that it is an expectation over queries:Var ( ˆ∆) = (cid:88) q P ( q ) Var ( ˆ∆ | q ) . (7.23)To keep notation short, for the remainder of this section we will write: ∆ = ∆( π , π ) ; θ d,R = θ rank ( d | R ) ; ζ d = ζ d,q ; λ d = λ ( d | π , π ) ; and ρ d = ρ ( d | q, π ) . Next, weconsider the probability of a click pattern c , this is simply a vector indicating a possiblecombination of clicked documents c ( d ) = 1 and not-clicked documents c ( d ) = 0 : P ( c | q ) = (cid:88) R π ( R | q ) (cid:89) d : c ( d )=1 θ d,R ζ d (cid:89) d : c ( d )=0 (1 − θ d,R ζ d )= (cid:88) R π ( R | q ) P ( c | R ) . (7.24)Here, π has some control over this probability: by deciding the distribution of displayedrankings it can make certain click patterns more or less frequent. The variance addedper query is the squared error of every possible click pattern weighted by the probabilityof each pattern. Let (cid:80) c sum over every possible click pattern:Var ( ˆ∆ π IPS | q ) = (cid:88) c P ( c | q ) (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) . (7.25)It is unknown whether there is a closed-form solution for π ∗ . However, the variancefunction is differentiable. Taking the derivative reveals a trade-off between two poten-tially conﬂicting goals: δδπ Var ( ˆ∆ π IP S | q ) = (cid:88) c minimize frequency of high-error click patterns (cid:122) (cid:125)(cid:124) (cid:123)(cid:20) δδπ P ( c | q ) (cid:21) (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) + P ( c | q )  δδπ (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:124) (cid:123)(cid:122) (cid:125) minimize error of frequent click patterns . (7.26)133 . Efﬁcient and Unbiased Online Evaluation for Ranking On the one hand, the derivative reduces the frequency of click patterns that result inhigh error samples, i.e., by updating π so that these are less likely to occur. On theother hand, changing π also affects the propensities ρ d , i.e., if π makes an item d lesslikely to be examined, its corresponding value λ d /ρ d becomes larger, which can lead toa higher error for related click patterns. The optimal policy has to balance: (i) avoidingshowing rankings that lead to high-error click patterns; and (ii) avoiding minimizingpropensity scores, which increases the errors of corresponding click patterns.Our method applies stochastic gradient descent to optimize the logging policy w.r.t.the variance. There are two main difﬁculties with this approach: (i) the parameters θ and ζ are unknown a priori; and (ii) the gradients include summations over all possiblerankings and all possible click patterns, both of which are computationally infeasible.In the following sections, we will detail how LogOpt solves both of these problems. In order to compute the gradient in Eq. 7.26, the parameters θ and ζ have to be known.LogOpt is based on the assumption that accurate estimates of θ and ζ sufﬁce to ﬁnd anear-optimal logging policy. We note that the counterfactual estimator only requires θ to be known for unbiasedness (see Section 7.3.3). Our approach is as follows. At givenintervals during evaluation we use the available clicks to estimate θ and ζ . Then we usethe estimated ˆ θ to get the current estimate ˆ∆ IPS ( I , ˆ θ ) (Eq. 7.17) and optimize w.r.t. theestimated variance (Eq. 7.25) based on ˆ θ , ˆ ζ , and ˆ∆ IPS ( I , ˆ θ ) .For estimating θ and ζ we use the existing EM approach by Wang et al. [128],because it works well in situations where few interactions are available and does notrequire randomization. We note that previous work has found randomization-basedapproaches to be more accurate for estimating θ [4, 30, 128]. However, they requiremultiple interactions per query and speciﬁc types of randomization in their results; bychoosing the EM approach we avoid having these requirements. Both the variance (Eq. 7.25) and its gradient (Eq. 7.26), include a sum over all possibleclick patterns. Moreover, they also include the probability of a speciﬁc pattern P ( c | q ) that is based on a sum over all possible rankings (Eq. 7.24). Clearly, these equationsare infeasible to compute under any realistic time constraints. To solve this issue, weintroduce gradient estimation based on Monte-Carlo sampling. Our approach is similarto that of Ma et al. [78], however, we are estimating gradients of variance instead ofgeneral performance.First, we assume that policies place the documents in order of rank and the probabil-ity of placing an individual document at rank x only depends on the previously placeddocuments. Let R x − indicate the (incomplete) ranking from rank up to rank x , then π ( d | R x − , q ) indicates the probability that document d is placed at rank x giventhat the ranking up to x is R x − . The probability of a ranking R up to rank k is thus: π ( R k | q ) = k (cid:89) x =1 π ( R x | R x − , q ) . (7.27)134 .4. Logging Policy Optimization for Variance Minimization Let K be the length of a complete ranking R , the gradient of the probability of a rankingw.r.t. a policy is: δπ ( R | q ) δπ = K (cid:88) x =1 π ( R | q ) π ( R x | R x , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) . (7.28)The gradient of the propensity w.r.t. the policy (cf. Eq. 7.15) is: δρ ( d | q ) δπ = K (cid:88) k =1 θ k (cid:88) R π ( R k − | q ) (cid:32) (cid:20) δπ ( d | R k − , q ) δπ (cid:21) + k − (cid:88) x =1 π ( d | R k − , q ) π ( R x | R x − , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) (cid:33) . (7.29)To avoid iterating over all rankings in the (cid:80) R sum, we sample M rankings: R m ∼ π ( R | q ) , and a click pattern on each ranking: c m ∼ P ( c | R m ) . This enables us tomake the following approximation: (cid:92) ρ -grad ( d ) = 1 M M (cid:88) m =1 K (cid:88) k =1 θ k (cid:32) (cid:20) δπ ( d | R m k − , q ) δπ (cid:21) + k − (cid:88) x =1 π ( d | R m k − , q ) π ( R mx | R m x − , q ) (cid:20) δπ ( R mx | R m x − , q ) δπ (cid:21) (cid:33) , (7.30)since δρ ( d | q ) δπ ≈ (cid:92) ρ -grad ( d, q ) . In turn, we can use this to approximate the second part ofEq. 7.26: (cid:92) error-grad ( c ) = 2 (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:88) d : c ( d )=1 λ d ρ d (cid:92) ρ -grad ( d ) . (7.31)We approximate the ﬁrst part of Eq. 7.26 with: (cid:92) freq-grad ( R, c ) = (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) K (cid:88) x =1 π ( R x | R x − , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) . (7.32)Together, they approximate the complete gradient (cf. Eq. 7.26): δ Var ( ˆ∆ π IP S | q ) δπ ≈ M M (cid:88) m =1 (cid:92) freq-grad ( R m , c m ) + (cid:92) error-grad ( c m ) . (7.33)Therefore, we can approximate the gradient of the variance w.r.t. a logging policy π ,based on rankings sampled from π and our current estimated click model ˆ θ , ˆ ζ , whilestaying computationally feasible. For a more detailed description see Appendix 7.B. . Efﬁcient and Unbiased Online Evaluation for Ranking

Algorithm 7.1

Logging-Policy Optimization Algorithm (LogOpt) Input : Historical interactions: I ; rankers to compare π , π . ˆ θ, ˆ ζ ← infer click model ( I ) // estimate bias using EM ˆ λ ← estimated observance (ˆ θ, π , π ) // estimate λ cf. Eq 7.16 ˆ∆( π , π ) ← estimated CTR ( I , ˆ λ, ˆ θ ) // CTR diff. cf. Eq 7.17 π ← init policy () // initialize logging policy for j ∈ { , , . . . } do q ∼ P ( q | I ) // sample a query from interactions R ← { R , R , . . . , R M } ∼ π ( R | q ) // sample M rankings C ← { c , c , . . . , c M } ∼ P ( c | R ) // sample M click patterns ˆ δ ← approx grad ( R , C , ˆ λ, ˆ θ, ˆ∆( π , π )) // using Eq. 7.33 π ← update ( π , ˆ δ ) // update using approx. gradient return π We have summarized the LogOpt method in Algorithm 7.1. The algorithm requires aset of historical interactions I and two rankers π and π to compare. Then by ﬁttinga click model on I using an EM-procedure (Line 2) an estimate of observation bias ˆ θ and document relevance ˆ ζ is obtained. Using ˆ θ , an estimate of the difference inobservation probabilities ˆ λ is computed (Line 3 and cf. Eq 7.16), and an estimate of theCTR difference ˆ∆( π , π ) (Line 4 and cf. Eq 7.17). Then the optimization of a newlogging policy π begins: A query is sampled from I (Line 7), and for that query M rankings are sampled from the current π (Line 8), then for each ranking a click patternis sampled using ˆ θ and ˆ ζ (Line 9). Finally, using the sampled rankings and clicks, ˆ θ , ˆ λ , and ˆ∆( π , π ) , the gradient is now approximated using Eq. 7.33 (Line 10) and thepolicy π is updated accordingly (Line 11). This process can be repeated for a ﬁxednumber of steps, or until the policy has converged.This concludes our introduction of LogOpt: the ﬁrst method that optimizes thelogging policy for faster convergence in counterfactual evaluation. We argue thatLogOpt turns counterfactual evaluation into online evaluation, because it instructswhich rankings should be displayed for the most efﬁcient evaluation. The ability tomake interventions like this is the deﬁning characteristic of an online evaluation method. We ran semi-synthetic experiments that are prevalent in online and counterfactualevaluation [41, 58, 86]. User-issued queries are simulated by sampling from learning torank datasets; each dataset contains a preselected set of documents per query. We usethe Yahoo! Webscope [17] and MSLR-WEB30k [95] datasets; they both contain 5-graderelevance judgements for all preselected query-document pairs. For each sampled query,we let the evaluation method decide which ranking to display and then simulate clickson them using probabilistic click models.To simulate position bias, we use the rank-based probabilities of Joachims et al.136 .6. Results [58]: P ( o ( d ) = 1 | R, q ) = 1 rank ( d | R ) . (7.34)If observed, the click probability is determined by the relevance label of the dataset(ranging from 0 to 4). More relevant items are more likely to be clicked, yet non-relevantdocuments still have a non-zero click probability: P ( c ( d ) = 1 | o ( d ) = 1 , q ) = 0 . · relevance label ( q, d ) + 0 . . (7.35)Spread over both datasets, we generated 2,000 rankers and created 1,000 ranker-pairs.We aimed to generate rankers that are likely to be compared in real-world scenarios;unfortunately, no simple distribution of such rankers is available. Therefore, we tried togenerate rankers that have (at least) a decent CTR and that span a variety of rankingbehaviors. Each ranker was optimized using LambdaLoss [129] based on the labelleddata of 100 sampled queries; each ranker is based on a linear model that only uses arandom sample of 50% of the dataset features. Figure 7.1 displays the resulting CTRdistribution; it appears to follow a normal distribution, on both datasets.For each ranker-pair and method, we sample · queries and calculate theirCTR estimates for different numbers of queries. We considered three metrics: (i) Thebinary error: whether the estimate correctly predicts which ranker should be preferred.(ii) The absolute error: the absolute difference between the estimate and the true E [ CTR ] difference: absolute-error = | ∆( π , π ) − ˆ∆( I ) | . (7.36)And (iii) the mean squared error: the squared error per sample (not the ﬁnal estimate);if the estimator is unbiased this is equivalent to the empirical variance:mean-squared-error = 1 N N (cid:88) i =1 (∆( π , π ) − x i ) . (7.37)We compare LogOpt with the following baselines: (i) A/B testing (with equal proba-bilities for each ranker), (ii) Team-Draft Interleaving, (iii) Probabilistic Interleaving(with τ = 4 ), and (iv) Optimized Interleaving (with the inverse rank scoring function).Furthermore, we compare LogOpt with other choices of logging policies: (i) uniformsampling, (ii) A/B testing: showing either the ranking of A or B with equal probability,and (iii) an Oracle logging policy: applying LogOpt to the true relevances ζ and positionbias θ . We also consider LogOpt both in the case where θ is known a priori , or where ithas to be estimated still. Because estimating θ and optimizing the logging policy π is time-consuming, we only update ˆ θ and π after , , and queries. Thepolicy LogOpt optimizes uses a neural network with 2 hidden layers consisting of 32units each. The network computes a score for every document, then a softmax is appliedto the scores to create a distribution over documents. Our results are displayed in Figures 7.2, 7.3, and 7.4. Figure 7.2 shows the resultscomparing LogOpt with other online evaluation methods; Figure 7.3 compares LogOpt137 . Efﬁcient and Unbiased Online Evaluation for Ranking

Yahoo Webscope MSLR Web30k . . . . . Figure 7.1: The CTR distribution of the 2000 generated rankers, 1000 were generatedper dataset.with counterfactual evaluation using other logging policies; and ﬁnally, Figure 7.4shows the distribution of binary errors for each method after · sampled queries. In Figure 7.2 we see that, unlike interleaving methods, counterfactual evaluation withLogOpt continues to decrease both its binary error and its absolute error as the numberof queries increases. While interleaving methods converge at a binary error of at least2.2% and an absolute error greater than . , LogOpt appears to converge towards zeroerrors for both. This is expected as LogOpt is proven to be unbiased when the positionbias is known. Interestingly, we see similar behavior from LogOpt with estimatedposition bias. Both when bias is known or estimated, LogOpt has a lower error than theinterleaving methods after · queries. Thus we conclude that interleaving methodsconverge faster and have an initial period where their error is lower, but are biased. Incontrast, by being unbiased, LogOpt converges on a lower error eventually.If we use Figure 7.2 to compare LogOpt with A/B testing, we see that on bothdatasets LogOpt has a considerably smaller mean squared error. Since both methodsare unbiased, this means that LogOpt has a much lower variance and thus is expectedto converge faster. On the Yahoo dataset we observe this behavior, both in terms ofbinary error and absolute error and regardless of whether the bias is estimated, LogOptrequires half as much data as A/B testing to reach the same level or error. Thus, onYahoo LogOpt is roughly twice as data-efﬁcient as A/B testing. On the MSLR dataset itis less clear whether LogOpt is noticeably more efﬁcient: after queries the absoluteerror of LogOpt is twice as high, but after queries it has a lower error than A/Btesting. We suspect that the relative drop in performance around queries is dueto LogOpt overﬁtting on incorrect ˆ ζ values, however, we were unable to conﬁrm this.Hence, LogOpt is just as efﬁcient as, or even more efﬁcient than, A/B testing, dependingon the circumstances.Finally, when we use Figure 7.3 to compare LogOpt with other logging policychoices, we see that LogOpt mostly approximates the optimal Oracle logging policy. Incontrast, the uniform logging policy is very data-inefﬁcient; on both datasets it requires138 .6. Results Yahoo! Webscope MSLR-Web30k B i n a r y E rr o r − − − − A b s o l u t e E rr o r − − M ea n S qu a r e d E rr o r . . . Number of Queries Issued Number of Queries Issued

A/B TestingOptimized Interleaving Probabilistic InterleavingTeam-Draft Interleaving LogOpt (Position Bias Known)LogOpt (Position Bias Estimated)

Figure 7.2: Comparison of LogOpt with other online methods; displayed results are anaverage over 500 comparisons.around ten times the number of queries to reach the same level or error as LogOpt.The A/B logging policy is a better choice than the uniform logging policy, but apartfrom the dip in performance on the MSLR dataset, it appears to require twice as manyqueries as LogOpt. Interestingly, the performance of LogOpt is already near the Oraclewhen only queries have been issued. With such a small number of interactions,accurately estimating the relevances ζ should not be possible, thus it appears that inorder for LogOpt to ﬁnd an efﬁcient logging policy the relevances ζ are not important.This must mean that only the differences in behavior between the rankers (i.e., λ ) haveto be known for LogOpt to be efﬁcient. Overall, these results show that LogOpt cangreatly increase the efﬁciency of counterfactual estimation. Our results in Figure 7.2 clearly illustrate the bias of interleaving methods: each of themsystematically infers incorrect preferences in (at least) 2.2% of the ranker-pairs. Theseerrors are systematic since increasing the number of queries from to · does notremove any of them. Additionally, the combination of the lowest mean-squared-error139 . Efﬁcient and Unbiased Online Evaluation for Ranking Yahoo! Webscope MSLR-Web30k B i n a r y E rr o r − − − − A b s o l u t e E rr o r − − M ea n S qu a r e d E rr o r . . . Number of Queries Issued Number of Queries Issued

A/B Logging PolicyUniform Logging Policy LogOpt (Position Bias Known) Oracle Logging Policy

Figure 7.3: Comparison of logging policies for counterfactual evaluation; displayedresults are an average over 500 comparisons.with a worse absolute error than A/B testing after queries, indicates that interleavingresults in a low variance at the cost of bias. To better understand when these systematicerrors occur, we show the distribution of binary errors w.r.t. the CTR differences of theassociated ranker-pairs in Figure 7.4. Here we see that most errors occur on ranker-pairswhere the CTR difference is smaller than 1%, and that of all comparisons the percentageof errors greatly increases as the CTR difference decreases below 1%. This suggeststhat interleaving methods are unreliable to detect preferences when differences are 1%CTR or less.It is hard to judge the impact this bias may have in practice. On the one hand, a 1%CTR difference is far from negligible: generally a 1% increase in CTR is consideredan impactful improvement in the industry [102]. On the other hand, our results arebased on a single click model with speciﬁc values for position bias and conditional clickprobabilities. While our results strongly prove that interleaving is biased, we should becareful not to generalize the size of the observed systematic error to all other rankingsettings.Previous work has performed empirical studies to evaluate various interleaving140 .7. Conclusion methods with real users. Chapelle et al. [18] applied interleaving methods to compareranking systems for three different search engines, and found team-draft interleavingto highly correlate with absolute measures such as CTR. However, we note that inthe study by Chapelle et al. [18] no more than six rankers were compared, thus such astudy would likely miss a systematic error of 2.2%. In fact, Chapelle et al. [18] notethemselves that they cannot conﬁdently claim team-draft interleaving is completelyunbiased. Schuth et al. [110] performed a larger comparison involving 38 rankingsystems, but again, too small to reliably detect a small systematic error.It appears that the ﬁeld is missing a large scale comparison that involves a largeenough number of rankers to observe small systematic errors. If such an error is found,the next step is to identify if certain types of ranking behavior are erroneously andsystematically disfavored. While these questions remain unanswered, we are concernedthat the claims of unbiasedness in previous work on interleaving (see Section 7.3.2)give practitioners an unwarranted sense of reliability in interleaving. In this chapter, we considered thesis research question

RQ7 : whether counterfactualevaluation methods for ranking can be extended to perform efﬁcient and effectiveonline evaluation. Our answer is positive: we have introduced the Logging-PolicyOptimization Algorithm (LogOpt): the ﬁrst method that optimizes a logging policyfor minimal variance counterfactual evaluation. Counterfactual evaluation is proven tobe unbiased w.r.t. position bias and item-selection bias under a wide range of loggingpolicies. With the introduction of LogOpt, we now have an algorithm that can decidewhich rankings should be displayed for the fastest convergence. Therefore, we argue thatLogOpt turns the IPS-based counterfactual evaluation approach – which is indifferentto the logging policy – into an online approach – which instructs the logging policy.Our experimental results show that LogOpt can lead to a better data-efﬁciency than A/Btesting, while also showing that interleaving is biased.This brings us to the second thesis research question that this chapter addressed,

RQ8 : whether interleaving methods are truly unbiased w.r.t. position bias. We an-swer this question negatively: Our experimental results clearly reveal a systematicerror in interleaving, moreover, in Appendix 7.A we formally prove that cases existwhere interleaving is affected by position bias. In other words, interleaving should notbe considered unbiased under the most common deﬁnition of bias in counterfactualevaluation.While our ﬁndings are mostly theoretical, they do suggest that future work shouldfurther investigate the bias in interleaving methods. Our results suggest that all inter-leaving methods make systematic errors, in particular when rankers with a similar CTRare compared. Furthermore, to the best of our knowledge, no empirical studies havebeen performed that could measure such a bias; our ﬁndings strongly show that such astudy would be highly valuable to the ﬁeld. Finally, LogOpt shows that in theory anevaluation method that is both unbiased and efﬁcient is possible; if future work ﬁndsthat these theoretical ﬁndings match empirical results with real users, this could be thestart of a new line of theoretically-justiﬁed online evaluation methods. 141 . Efﬁcient and Unbiased Online Evaluation for Ranking

Inspired by the success of this chapter to ﬁnd a method effective at both online andcounterfactual evaluation for ranking, Chapter 8 introduces a method that is effectiveat both online and counterfactual Learning to Rank (LTR). Together, these chaptersshow that the divide between online and counterfactual optimization/evaluation can bebridged.142 .7. Conclusion

Yahoo! Webscope MSLR-Web30k T ea m - D r a f t I n t e r l ea v i ng − − − − − − − − P r ob a b ili s ti c I n t e r l ea v i ng − − − − − − − − O p ti m i ze d I n t e r l ea v i ng − − − − − − − − A / B T e s ti ng − − − − − − − − L og O p t ( B i a s E s ti m a t e d ) − − − − − − − − CTR difference CTR difference

Figure 7.4: Distribution of errors over the CTR differences of the rankers in the compar-ison; red indicates a binary error; green indicates a correctly inferred binary preference;results are on estimates based on · sampled queries. 143 . Efﬁcient and Unbiased Online Evaluation for Ranking Section 7.3.2 claimed that for the discussed interleaving methods, an example can beconstructed so that in expectation the wrong binary outcome is estimated w.r.t. the actualexpected CTR differences. These examples are enough to prove that these interleavingmethods are biased w.r.t. CTR differences. In the following sections we will introduce asingle example for each interleaving method.For clarity, we will keep these examples as basic as possible. We consider aranking setting where only a single query q occurs, i.e. P ( q ) = 1 , furthermore,there are only three documents to be ranked: A , B , and C . The two policies π and π in the comparison are both deterministic so that: π ([ A, B, C ] | q ) = 1 and π ([ B, C, A ] | q ) = 1 . Thus π will always display the ranking: [ A, B, C ] , and π the ranking: [ B, C, A ] . Furthermore, document B is completely non-relevant: ζ B = 0 ,consequently, B can never receive clicks; this will make our examples even simpler.The true E [ CTR ] difference is thus: ∆( π , π ) = ( θ − θ ) ζ A + ( θ − θ ) ζ C . (7.38)For each interleaving method, will now show that position bias parameters θ , θ , and θ and relevances ζ A and ζ C exist where the wrong binary outcome is estimated. Team-Draft Interleaving [99] lets rankers take turns to add their top document and keepstrack which ranker added each document. In total there are four possible interleavingand assignment combinations, each is equally probable:Interleaving Ranking Assignments Probability R A, B, C 1, 2, 1 1/4 R A, B, C 1, 2, 2 1/4 R B, A, C 2, 1, 1 1/4 R B, A, C 2, 1, 2 1/4Per issued query Team-Draft Interleaving produces a binary outcome, this is based onwhich ranker had most of its assigned documents clicked. To match our CTR estimate,we use to indicate π receiving more clicks, and − for π . Per interleaving we can144 .A. Proof of Bias in Interleaving compute the probability of each outcome: P ( outcome = 1 | R ) = θ ζ A + (1 − θ ζ A ) θ ζ C ,P ( outcome = 1 | R ) = θ ζ A (1 − θ ζ C ) ,P ( outcome = 1 | R ) = θ ζ A + (1 − θ ζ A ) θ ζ C ,P ( outcome = 1 | R ) = θ ζ A (1 − θ ζ C ) ,P ( outcome = − | R ) = 0 ,P ( outcome = − | R ) = (1 − θ ζ A ) θ ζ C ,P ( outcome = − | R ) = 0 ,P ( outcome = − | R ) = (1 − θ ζ A ) θ ζ C . Since every interleaving is equally likely, we can easily derive the unconditional proba-bilities: P ( outcome = 1) = 14 (cid:16) θ ζ A + (1 − θ ζ A ) θ ζ C + θ ζ A (1 − θ ζ C )+ θ ζ A + (1 − θ ζ A ) θ ζ C + θ ζ A (1 − θ ζ C ) (cid:17) ,P ( outcome = −

1) = 14 (cid:16) (1 − θ ζ A ) θ ζ C + (1 − θ ζ A ) θ ζ C (cid:17) . With these probabilities, the expected outcome is straightforward to calculate: E [ outcome ] = P ( outcome = 1) − P ( outcome = − (cid:16) θ ζ A + θ ζ A (1 − θ ζ C ) + θ ζ A + θ ζ A (1 − θ ζ C ) (cid:17) > . Interestingly, without knowing the values for θ , ζ A and ζ C , we already know that theexpected outcome is positive. Therefore, we can simply choose values that lead to anegative CTR difference, and the expected outcome will be incorrect. For this example,we choose the position bias: θ = 1 . , θ = 0 . , and θ = 0 . ; and the relevances: ζ = 0 . , and ζ = 1 . . As a result, the expected binary outcome of Team-DraftInterleaving will not match the true E [ CTR ] difference: ∆( π , π ) < ∧ E [ outcome ] > . (7.39)Therefore, we have proven that Team-Draft Interleaving is biased w.r.t. CTR differences. Probabilistic Interleaving [41] treats rankings as distributions over documents, wefollow the soft-max approach of Hofmann et al. [41] and use τ = 4 . as suggested.Probabilistic Interleaving creates interleavings by sampling randomly from one ofthe rankings, unlike Team-Draft Interleaving it does not remember which rankingadded each document. Because rankings are treated as distributions, every possiblepermutation is a valid interleaving, leading to six possibilities with different probabilitiesof being displayed. When clicks are received, every possible assignment is consideredand the expected outcome is computed over all possible assignments. Because thereare 36 possible rankings and assignment combinations, we only report every possibleranking and the probabilities for documents A or C being added by π : 145 . Efﬁcient and Unbiased Online Evaluation for Ranking Interleaving Ranking P ( add ( A ) = 1) P ( add ( C ) = 1) Probability R A, B, C 0.9878 0.4701 0.4182 R A, C, B 0.9878 0.4999 0.0527 R B, A, C 0.8569 0.0588 0.2849 R B, C, A 0.5000 0.0588 0.2094 R C, A, B 0.9872 0.5000 0.0166 R C, B, A 0.5000 0.0562 0.0182These probabilities are enough to compute the expected outcome, similar as the pro-cedure we used for Team-Draft Interleaving. We will not display the full calculationhere as it is extremely long; we recommend using some form of computer assistance toperform these calculations. While there are many possibilities, we choose the followingposition bias: θ = 1 . , θ = 0 . , and θ = 0 . ; and relevance: ζ = 0 . , and ζ = 1 . .This leads to the following erroneous result: ∆( π , π ) < ∧ E [ outcome ] > . (7.40)Therefore, we have proven that Probabilistic Interleaving is biased w.r.t. CTR differ-ences. Optimized Interleaving casts interleaving as an optimization problem [96]. OptimizedInterleaving works with a credit function: each clicked document produces a positive ornegative credit. The sum of all credits is the ﬁnal estimated outcome. We follow Radlin-ski and Craswell [96] and use the linear rank difference, resulting in the following creditsper document: click-credit ( A ) = 2 , click-credit ( B ) = − , and click-credit ( C ) = − .Then the set of allowed interleavings is created, these are all the rankings that do notcontradict a pairwise document preference that both rankers agree on. Given this setof interleavings, a distribution over them is found so that if every document is equallyrelevant then no preference is found. For our example, the only valid distribution overinterleavings is the following:Interleaving Ranking Probability R A, B, C / R B, A, C / R B, C, A / The expected credit outcome shows us which ranker will be preferred in expectation: E [ credit ] = 13 (cid:0) θ + θ + θ ) ζ A − ( θ + 2 θ ) ζ C (cid:1) . (7.41) Radlinski and Craswell [96] state that if clicks are not correlated with relevance then no preferenceshould be found, in their click model (and ours) these two requirements are actually equivalent. .B. Expanded Explanation of Gradient Approximation

We choose the position bias: θ = 1 . , θ = 0 . , and θ = 0 . ; and the relevances: ζ = 0 . , ζ = 1 . . As a result, the true E [ CTR ] difference is positive, but optimizedinterleaving will prefer π in expectation: ∆( π , π ) > ∧ E [ credit ] < . (7.42)Therefore, we have proven that Optimized Interleaving is biased w.r.t. CTR differences. This section describes our Monte-Carlo approximation of the variance gradient in moredetail. We repeat the steps described in Section 7.4.3 and include some additionalintermediate steps; this should make it easier for a reader to verify our theory.First, we assume that policies place the documents in order of rank and the probabil-ity of placing an individual document at rank x only depends on the previously placeddocuments. Let R x − indicate the (incomplete) ranking from rank up to rank x , then π ( d | R x − , q ) indicates the probability that document d is placed at rank x giventhat the ranking up to x is R x − . The probability of a ranking R of length K is thus: π ( R | q ) = K (cid:89) x =1 π ( R x | R x − , q ) . (7.43)The probability of a ranking R up to rank k is: π ( R k | q ) = k (cid:89) x =1 π ( R x | R x − , q ) . (7.44)Therefore the propensity (cf. Eq. 7.15) can be rewritten to: ρ ( d | q ) = K (cid:88) k =1 θ k (cid:88) R π ( R k − | q ) π ( d | R k − , q ) . (7.45)Before we take the gradient of the propensity, we note that the gradient of the probabilityof a single ranking is: δπ ( R | q ) δπ = K (cid:88) x =1 π ( R | q ) π ( R x | R x , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) . (7.46)Using this gradient, we can derive the gradient of the propensity w.r.t. the policy: δρ ( d | q ) δπ = K (cid:88) k =1 θ k (cid:88) R π ( R k − | q ) (cid:32) (cid:20) δπ ( d | R k − , q ) δπ (cid:21) + k − (cid:88) x =1 π ( d | R k − , q ) π ( R x | R x − , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) (cid:33) . (7.47)147 . Efﬁcient and Unbiased Online Evaluation for Ranking To avoid iterating over all rankings in the (cid:80) R sum, we sample M rankings: R m ∼ π ( R | q ) , and a click pattern on each ranking: c m ∼ P ( c | R m ) . This enables us tomake the following approximation: (cid:92) ρ -grad ( d ) = 1 M M (cid:88) m =1 K (cid:88) k =1 θ k (cid:32) (cid:20) δπ ( d | R m k − , q ) δπ (cid:21) + k − (cid:88) x =1 π ( d | R m k − , q ) π ( R mx | R m x − , q ) (cid:20) δπ ( R mx | R m x − , q ) δπ (cid:21) (cid:33) , (7.48)since δρ ( d | q ) δπ ≈ (cid:92) ρ -grad ( d, q ) . The second part of Eq. 7.26 is: (cid:34) δδπ (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:35) = 2 (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:88) d : c ( d )=1 λ d ρ d (cid:20) δρ d δπ (cid:21) , (7.49)using (cid:92) ρ -grad ( d ) we get the approximation: (cid:92) error-grad ( c ) = 2 (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (cid:88) d : c ( d )=1 λ d ρ d (cid:92) ρ -grad ( d ) . (7.50)Next, we consider the gradient of a single click pattern: δδπ P ( c | q ) = (cid:88) R P ( c | R ) (cid:20) δπ ( R | q ) δπ (cid:21) . (7.51)This can then be used to reformulate the ﬁrst part of Eq. 7.26: (cid:88) c (cid:20) δδπ P ( c | q ) (cid:21)(cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) = (cid:88) c (cid:88) R P ( c | R ) (cid:20) δπ ( R | q ) δπ (cid:21) (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) (7.52)Making use of Eq. 7.46, we approximate this with: (cid:92) freq-grad ( R, c ) = (cid:18) ∆ − (cid:88) d : c ( d )=1 λ d ρ d (cid:19) K (cid:88) x =1 π ( R x | R x − , q ) (cid:20) δπ ( R x | R x − , q ) δπ (cid:21) . (7.53)Combining the approximation of both parts of Eq. 7.26, allows us to approximate thecomplete gradient: δ Var ( ˆ∆ π IP S | q ) δπ ≈ M M (cid:88) m =1 (cid:92) freq-grad ( R m , c m ) + (cid:92) error-grad ( c m ) . (7.54)148 .C. Notation Reference for Chapter 7 This completes our expanded description of the gradient approximation. We have shownthat we can approximate the gradient of the variance w.r.t. a logging policy π , basedon rankings sampled from π and our current estimated click model ˆ θ , ˆ ζ , while stayingcomputationally feasible. Notation Description k the number of items that can be displayed in a single ranking i an iteration number q a user-issued query d an item to be ranked R a ranked list R x the subranking in R from index up to and including index xπ a ranking policy π ( R | q ) the probability that policy π displays ranking R for query qπ ( R x | R x − , q ) probability of π adding item R x given R x − is already placed I the available interaction data c a click pattern: a vector indicating a combination of clickedand not-clicked items (cid:80) c a summation over every possible click pattern c ( d ) a function indicating item d was clicked in click pattern co ( d ) a function indicating item d was observed at iteration ix i the estimate for a single interaction if ( q i , R i , c i ) the method-speciﬁc function that converts a single interactioninto an estimate x i θ rank ( d | R ) the observation probability: P ( o ( d ) = 1 | R ) ζ d,q the conditional click probability: P ( c ( d ) = 1 | o ( d ) = 1 , q ) Unifying Online and Counterfactual

Learning to Rank

In Chapter 7, we introduced the Logging-Policy Optimization Algorithm (LogOpt)algorithm that turns a counterfactual ranking evaluation method into an online evaluationmethod. Thus, the contributions of Chapter 7 are a signiﬁcant step in bridging the dividebetween online and counterfactual ranking evaluation. Inspired by this contribution, thischapter will consider whether something similar can be done for the gap between onlineand counterfactual Learning to Rank (LTR). Accordingly, in this chapter the followingquestion will be addressed:

RQ9

Can the counterfactual LTR approach be extended to perform highly effectiveonline LTR?In contrast with Chapter 7, which looked at ﬁnding the best logging policy, this chapterwill consider a novel counterfactual estimator; we propose the novel intervention-awareestimator for both counterfactual and online LTR. The estimator corrects for the effect ofposition bias, trust bias, and item-selection bias using corrections based on the behaviorof the logging policy and online interventions: changes to the logging policy madeduring the gathering of click data. Our experimental results show that, unlike existingcounterfactual LTR methods, the intervention-aware estimator can greatly beneﬁt fromonline interventions. In contrast, existing online methods are hindered without onlineinterventions and thus should not be applied counterfactually. With the introductionof the intervention-aware estimator, we aim to bridge the online/counterfactual LTRdivision as it is shown to be highly effective in both online and counterfactual scenarios.

Ranking systems form the basis for most search and recommendation applications [75].As a result, the quality of such systems can greatly impact the user experience, thus it isimportant that the underlying ranking models perform well. The LTR ﬁeld considersmethods to optimize ranking models. Traditionally this was based on expert annotations.Over the years the limitations of expert annotations have become apparent; some of the

This chapter was submitted as [88]. Appendix 8.A gives a reference for the notation used in this chapter. . Unifying Online and Counterfactual Learning to Rank most important ones are: (i) they are expensive and time-consuming to acquire [17, 95];(ii) in privacy-sensitive settings expert annotation is unethical, e.g., in email or privatedocument search [128]; and (iii) often expert annotations appear to disagree with actualuser preferences [104].User interaction data solves some of the problems with expert annotations: (i) in-teraction data is virtually free for systems with active users; (ii) it does not requireexperts to look at potentially privacy-sensitive content; (iii) interaction data is indicativeof users’ preferences. For these reasons, interest in LTR methods that learn from userinteractions has increased in recent years. However, user interactions are a form ofimplicit feedback and generally also affected by other factors than user preference [57].Therefore, to be able to reliably learn from interaction data, the effect of factors otherthan preference has to be corrected for. In clicks on rankings three prevalent factors arewell known: (i) position bias : users are less likely to examine, and thus click, lowerranked items [25]; (ii) item-selection bias : users cannot click on items that are notdisplayed [86, 92]; and (iii) trust bias : because users trust the ranking system, theyare more likely to click on highly ranked items that they do not actually prefer [3, 57].As a result of these biases, which ranking system was used to gather clicks can have asubstantial impact on the clicks that will be observed. Current LTR methods that learnfrom clicks can be divided into two families: counterfactual approaches [58] – thatlearn from historical data, i.e., clicks that have been logged in the past – and onlineapproaches [132] – that can perform interventions, i.e., they can decide what rankingswill be shown to users. Recent work has noticed that some counterfactual methods canbe applied as an online method [50], or vice versa [6, 136]. Nonetheless, every existingmethod was designed for either the online or counterfactual setting, never both.In this chapter, we propose a novel estimator for both counterfactual and onlineLTR from clicks: the intervention-aware estimator . The intervention-aware estimatorbuilds on ideas that underlie the latest existing counterfactual methods: the policy-awareestimator [86] and the afﬁne estimator [123]; and expands them to consider the effectof online interventions. It does so by considering how the effect of bias is changed byan intervention, and utilizes these differences in its unbiased estimation. As a result,the intervention-aware estimator is both effective when applied as a counterfactualmethod, i.e., when learning from historical data, and as an online method where onlineinterventions lead to enormous increases in efﬁciency. In our experimental resultsthe intervention-aware estimator is shown to reach state-of-the-art LTR performancein both online and counterfactual settings, and it is the only method that reaches top-performance in both settings.The main contributions of this chapter are:1. A novel intervention-aware estimator that corrects for position bias, trust bias, item-selection bias, and the effect of online interventions.2. An investigation into the effect of online interventions on state-of-the-art counterfac-tual and online LTR methods.152 .2. Interactions with Rankings

The theory in this chapter assumes that three forms of interaction bias occur: positionbias, item-selection bias, and trust bias.

Position bias occurs because users only click an item after examining it, and usersare more likely to examine items displayed at higher ranks [25]. Thus the rank (a.k.a.position) at which an item is displayed heavily affects the probability of it being clicked.We model this bias using P ( E = 1 | k ) : the probability that an item d displayed at rank k is examined by a user E [128]. Item-selection bias occurs when some items have a zero probability of being ex-amined in some displayed rankings [92]. This can happen because not all items aredisplayed to the user, or if the ranked list is so long that no user ever considers the entirelist. We model this bias by stating: ∃ k, ∀ k (cid:48) , ( k (cid:48) > k → P ( E = 1 | k (cid:48) ) = 0) , (8.1)i.e., there exists a rank k such that items ranked lower than k have no chance of beingexamined. The distinction between position bias and item-selection bias is importantbecause some methods can only correct for the former if the latter is not present [86].Finally, trust bias occurs because users trust the ranking system and, consequently,are more likely to perceive top ranked items as relevant even when they are not [57].We model this bias using: P ( C = 1 | k, R, E ) : the probability of a click conditionedon the displayed rank k , the relevance of the item R , and examination E .To combine these three forms of bias into a single click model, we follow Agarwalet al. [3] and write: P ( C = 1 | d, k, q )= P ( E = 1 | k ) (cid:0) P ( C = 1 | k, R = 0 , E = 1) P ( R = 0 | d, q )+ P ( C = 1 | k, R = 1 , E = 1) P ( R = 1 | d, q ) (cid:1) , (8.2)where P ( R = 1 | d, q ) is the probability that an item d is deemed relevant w.r.t. query q by the user. An analysis on real-world interaction data performed by Agarwal et al.[3], showed that this model better captures click behavior than models that only captureposition bias [128] on search services for retrieving cloud-stored ﬁles and emails.To simplify the notation, we follow Vardasbi et al. [123] and adopt: α k = P ( E = 1 | k ) (cid:0) P ( C = 1 | k, R = 1 , E = 1) − P ( C = 1 | k, R = 0 , E = 1) (cid:1) ,β k = P ( E = 1 | k ) P ( C = 1 | k, R = 0 , E = 1) . (8.3)This results in a compact notation for the click probability (8.2): P ( C = 1 | d, k, q ) = α k P ( R = 1 | d, q ) + β k . (8.4)For a single ranking y , let k be the rank at which item d is displayed in y ; we denote α k = α d,y and β k = β d,y . This allows us to specify the click probability conditionedon a ranking y : P ( C = 1 | d, y, q ) = α d,y P ( R = 1 | d, q ) + β d,y . (8.5)153 . Unifying Online and Counterfactual Learning to Rank Finally, let π be a ranking policy used for logging clicks, where π ( y | q ) is the probabilityof π displaying ranking y for query q , then the click probability conditioned on π is: P ( C = 1 | d, π, q ) = (cid:88) y π ( y | q )( α d,y P ( R = 1 | d, q ) + β d,y ) . (8.6)The proofs in the remainder of this chapter will assume this model of click behavior. In this section we cover the basics on LTR and counterfactual LTR.

The ﬁeld of LTR considers methods for optimizing ranking systems w.r.t. rankingmetrics. Most ranking metrics are additive w.r.t. documents; let P ( q ) be the probabilitythat a user-issued query is query q , then the metric reward R commonly has the form: R ( π ) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) P ( R = 1 | d, q ) . (8.7)Here, the λ function scores each item d depending on how π ranks d when given thepreselected item set D q ; λ can be chosen to match a desired metric, for instance, thecommon Discounted Cumulative Gain (DCG) metric [52]: λ DCG ( d | D q , π, q ) = (cid:88) y π ( y | q )(log ( rank ( d | y ) + 1)) − . (8.8)Supervised LTR methods can optimize π to maximize R if relevances P ( R = 1 | d, q ) are known [75, 129]. However in practice, ﬁnding these relevance values is not straight-forward. Over time, limitations of the supervised LTR approach have become apparent. Mostimportantly, ﬁnding accurate relevance values P ( R = 1 | d, q ) has proved to beimpossible or infeasible in many practical situations [127]. As a solution, LTR methodshave been developed that learn from user interactions instead of relevance annotations.Counterfactual LTR concerns approaches that learn from historical interactions. Let D be a set of collected interaction data over T timesteps; for each timestep t it containsthe user issued query q t , the logging policy π t used to generate the displayed ranking ¯ y t , and the clicks c t received on the ranking: D = { ( π t , q t , ¯ y t , c t ) } Tt =1 , (8.9)where c t ( d ) ∈ { , } indicates whether item d was clicked at timestep t . While clicksare indicative of relevance they are also affected by several forms of bias, as discussedin Section 8.2.154 .3. Background Counterfactual LTR methods utilize estimators that correct for bias to unbiasedlyestimate the reward of a policy π . The prevalent methods introduce a function ˆ∆ thattransforms a single click signal to correct for bias. The general estimate of the rewardis: ˆ R ( π | D ) = 1 T T (cid:88) t =1 (cid:88) d ∈ D qt λ ( d | D q t , π, q ) ˆ∆( d | π t , q t , ¯ y t , c t ) . (8.10)We note the important distinction between the policy π for which we estimate thereward, and the policy π t that was used to gather interactions at timestep t . Duringoptimization only π is changed in order to maximize the estimated reward.The original Inverse Propensity Scoring (IPS) based estimator introduced by Wanget al. [127] and Joachims et al. [58] weights clicks according to examination probabili-ties: ˆ∆ IPS ( d | ¯ y t , c t ) = c t ( d ) P ( E = 1 | ¯ y t , d ) . (8.11)This estimator results in unbiased optimization under two requirements. First, everyrelevant item must have a non-zero examination probability in all displayed rankings: ∀ t, ∀ d ∈ D q t ( P ( R = 1 | d, q t ) > → P ( E = 1 | ¯ y t , d ) > . (8.12)Second, the click probability conditioned on relevance on examined items should be thesame on every rank: ∀ k, k (cid:48) ( P ( C | k, R, E = 1) = P ( C | k (cid:48) , R, E = 1)) , (8.13)i.e., no trust bias is present. These requirements illustrate that this estimator can onlycorrect for position bias, and is biased when item-selection bias or trust bias is present.For a proof we refer to previous work by Joachims et al. [58] and Vardasbi et al. [123].Oosterhuis and de Rijke [86] (Chapter 5) adapt the IPS approach to correct foritem-selection bias as well. They weight clicks according to examination probabilitiesconditioned on the logging policy, instead of the single displayed ranking on which aclick took place. This results in the policy-aware estimator: ˆ∆ aware ( d | π t , q t , c t ) = c t ( d ) P ( E = 1 | π t , q t , d )= c t ( d ) (cid:80) y π ( y | q t ) P ( E = 1 | y, d, q t ) . (8.14)This estimator can be used for unbiased optimization under two assumptions. First,every relevant item must have a non-zero examination probability under the loggingpolicy: ∀ t, ∀ d ∈ D q t ( P ( R = 1 | , d, q t ) > → P ( E = 1 | π t , d, q t ) > . (8.15)Second, no trust bias is present as described in Eq. 8.13. Importantly, this ﬁrst require-ment can be met under item-selection bias, since a stochastic ranking policy can alwaysprovide every item a non-zero probability of appearing in a top- k ranking. Thus, even155 . Unifying Online and Counterfactual Learning to Rank when not all items can be displayed at once, a stochastic policy can provide non-zeroexamination probabilities to all items. For a proof of this claim we refer to previouswork by Oosterhuis and de Rijke [86].Lastly, Vardasbi et al. [123] prove that IPS cannot correct for trust bias. As analternative, they introduce an estimator based on afﬁne corrections. This afﬁne estimatorpenalizes an item displayed at rank k by β k while also reweighting inversely w.r.t. α k : ˆ∆ afﬁne ( d | ¯ y t , c t ) = c t ( d ) − β d, ¯ y t α d, ¯ y t . (8.16)The β penalties correct for the number of clicks an item is expected to receive due to itsdisplayed rank, instead of its relevance. The afﬁne estimator is unbiased under a singleassumption, namely that the click probability of every item must be correlated with itsrelevance in every displayed ranking: ∀ t, ∀ d ∈ D q t , α d, ¯ y t (cid:54) = 0 . (8.17)Thus, while this estimator can correct for position bias and trust bias, it cannot correctfor item-selection bias. For a proof of these claims we refer to previous work byVardasbi et al. [123].We note that all of these estimators require knowledge of the position bias ( P ( E =1 | k ) ) or trust bias ( α and β ). A lot of existing work has considered how these valuescan be inferred accurately [3, 30, 128]. The theory in this chapter assumes that thesevalues are known.This concludes our description of existing counterfactual estimators on which ourmethod expands. To summarize, each of these estimators corrects for position bias, onealso corrects for item-selection bias, and another also for trust bias. Currently, there isno estimator that corrects for all three forms of bias together. One of the earliest approaches to LTR from clicks was introduced by Joachims [54]. Itinfers pairwise preferences between items from click logs and uses pairwise LTR toupdate an SVM ranking model. While this approach had some success, in later workJoachims et al. [58] notes that position bias often incorrectly pushes the pairwise loss toﬂip the ranking displayed during logging. To avoid this biased behavior, Joachims et al.[58] proposed the idea of counterfactual LTR, in the spirit of earlier work by Wang et al.[127]. This led to estimators that correct for position bias using IPS weighting (seeSection 8.3.2). This work sparked the ﬁeld of counterfactual LTR which has focusedon both capturing interaction biases and optimization methods that can correct forthem. Methods for measuring position bias are based on EM optimization [128], adual learning objective [5], or randomization [4, 30]; for trust bias only an EM-basedapproach is currently known [3]. Agarwal et al. [2] showed how counterfactual LTR canoptimize neural networks and DCG-like methods through upper-bounding. Oosterhuisand de Rijke [86] introduced an IPS estimator that can correct for item-selection bias (seeSection 8.3.2 and Chapter 5), while also showing that the LambdaLoss framework [129]156 .5. An Estimator Oblivious to Online Interventions can be applied to counterfactual LTR (see Chapter 5). Lastly, Vardasbi et al. [123]proved that IPS estimators cannot correct for trust bias and introduced an afﬁne estimatorthat is capable of doing so (see Section 8.3.2). There is currently no known estimatorthat can correct for position bias, item selection bias, and trust bias simultaneously.The other paradigm for LTR from clicks is online LTR [132]. The earliest method,Dueling Bandit Gradient Descent (DBGD), samples variations of a ranking model andcompares them using online evaluation [41]; if an improvement is recognized the modelis updated accordingly. Most online LTR methods have increased the data-efﬁciencyof DBGD [43, 111, 126]; later work found that DBGD is not effective at optimizingneural models [82] (Chapter 3) and often fails to ﬁnd the optimal linear-model even inideal scenarios [84] (Chapter 4). To these limitations, alternative approaches for onlineLTR have been proposed. Pairwise Differentiable Gradient Descent (PDGD) takes apairwise approach but weights pairs to correct for position bias [82] (Chapter 3). WhilePDGD was found to be very effective and robust to noise [50, 84] (Chapter 4), it can beproven that its gradient estimation is affected by position bias, thus we do not considerit to be unbiased. In contrast, Zhuang and Zuccon [136] introduced CounterfactualOnline Learning to Rank (COLTR), which takes the DBGD approach but uses a formof counterfactual evaluation to compare candidate models. Despite making use ofcounterfactual estimation, Zhuang and Zuccon [136] propose the method solely foronline LTR.Interestingly, with COLTR the line between online and counterfactual LTR methodsstarts to blur. Recent work by Jagerman et al. [50] applied the original counterfactualapproach [58] as an online method and found that it lead to improvements. Furthermore,Ai et al. [6] noted that with a small adaptation PDGD can be applied to historical data.Although this means that some existing methods can already be applied both onlineand counterfactually, no method has been found that is the most reliable choice in bothscenarios.

Before we propose the main contribution of this chapter, the intervention-aware esti-mator, we will ﬁrst introduce an estimator that simultaneously corrects for positionbias, item-selection bias, and trust bias, without considering the effects of interventions.Subsequently, the resulting intervention-oblivious estimator will serve as a method tocontrast the intervention-aware estimator with.Section 8.3.2 described how the policy-aware estimator corrects for item-selectionbias by taking into account the behavior of the logging policy used to gather clicks [86].Furthermore, Section 8.3.2 also detailed how the afﬁne estimator corrects for trust biasby applying an afﬁne transformation to individual clicks [123]. We will now show thata single estimator can correct for both item-selection bias and trust bias simultaneously,by combining the approaches of both these existing estimators.First we note the probability of a click conditioned on a single logging policy π t . Unifying Online and Counterfactual Learning to Rank can be expressed as: P ( C = 1 | d, π t , q ) = (cid:88) ¯ y π t (¯ y | q )( α d, ¯ y P ( R = 1 | d, q ) + β d, ¯ y )= E ¯ y [ α d | π t , q ] P ( R = 1 | d, q ) + E ¯ y [ β d | π t , q ] . (8.18)where the expected values of α and β conditioned on π t are: E ¯ y [ α d | π t , q ] = (cid:88) ¯ y π t (¯ y | q ) α d, ¯ y , E ¯ y [ β d | π t , q ] = (cid:88) ¯ y π t (¯ y | q ) β d, ¯ y . (8.19)By reversing Eq. 8.18 the relevance probability can be obtained from the click probabil-ity. We introduce our intervention-oblivious estimator , which applies this transformationto correct for bias: ˆ∆ IO ( d | q t , c t ) = c t ( d ) − E ¯ y [ β d | π t , q t ] E ¯ y [ α d | π t , q t ] . (8.20)The intervention-oblivious estimator brings together the policy-aware and afﬁne esti-mators: on every click it applies an afﬁne transformation based on the logging policybehavior. Unlike existing estimators, we can prove that the intervention-obliviousestimator is unbiased w.r.t. our assumed click model (Section 8.2). Theorem 8.1.

The estimated reward ˆ R (Eq. 8.10) using the intervention-obliviousestimator (Eq. 8.20) is unbiased w.r.t. the true reward R (Eq. 8.7) under two assumptions:(1) our click model (Eq. 8.5), and (2) the click probability on every item, conditioned onthe logging policies per timestep π t , is correlated with relevance: ∀ t, ∀ d ∈ D q t , E ¯ y [ α d | π t , q t ] (cid:54) = 0 . (8.21) Proof.

Using Eq. 8.18 and Eq. 8.21 the relevance probability can be derived from theclick probability by: P ( R = 1 | d, q ) = P ( C = 1 | d, π t , q ) − E ¯ y [ β d | π t , q ] E ¯ y [ α d | π t , q ] . (8.22)Eq. 8.22 can be used to show that ˆ∆ IO is an unbiased indicator of relevance: E ¯ y,c (cid:2) ˆ∆ IO ( d | q t , c t ) | π t (cid:3) = E ¯ y,c (cid:20) c t ( d ) − E t, ¯ y [ β d | π t , q t ] E ¯ y [ α d | π t , q t ] | π t , q t (cid:21) = E ¯ y,c [ c t ( d ) | π t , q t ] − E ¯ y [ β d | π t , q t ] E ¯ y [ α d | π t , q t ]= P ( C = 1 | d, π t , q t ) − E ¯ y [ β d | π t , q t ] E ¯ y [ α d | π t , q t ]= P ( R = 1 | d, q t ) . (8.23)158 .5. An Estimator Oblivious to Online Interventions . . E ¯ y [ α d | π t , q ] E t, ¯ y [ α d | Π T , q ] / E ¯ y [ α d | π t , q ]1 / E t, ¯ y [ α d | Π T , q ] Timestep T Timestep T Figure 8.1: Example of an online intervention and the weights used by the intervention-oblivious and intervention-aware estimators for a single item as more data is gathered.Finally, combining Eq. 8.7 with Eq. 8.10 and Eq. 8.23 reveals that ˆ R based on theintervention-oblivious estimator ˆ∆ IO is unbiased w.r.t. R : E t,q, ¯ y,c (cid:104) ˆ R ( π | D ) (cid:105) (8.24) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) 1 T T (cid:88) t =1 E ¯ y,c (cid:104) ˆ∆ IO ( d | c, q ) | π t , q (cid:105) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) P ( R = 1 | d, q ) = R ( π ) . Existing estimators for counterfactual LTR are designed for a scenario where the loggingpolicy is static: ∀ ( π t , π t (cid:48) ) ∈ D , π t = π t (cid:48) . (8.25)However, we note that if an online intervention takes place [50], meaning that thelogging policy was updated during the gathering of data: ∃ ( π t , π t (cid:48) ) ∈ D , π t (cid:54) = π t (cid:48) , (8.26)the intervention-oblivious estimator is still unbiased. This was already proven inTheorem 8.1 because its assumptions cover both scenarios where online interventionsdo and do not take place.However, the individual corrections of the intervention-oblivious estimator areonly based on the single logging policy that was deployed at the timestep of eachspeciﬁc click. It is completely oblivious to the logging policies applied at differenttimesteps. Although this does not lead to bias in its estimation, it does result inunintuitive behavior. We illustrate this behavior in Figure 8.1, here a logging policythat results in E [ α d | π t , q ] = 0 . for an item d is deployed during the ﬁrst t ≤ timesteps. Then an online intervention takes place and the logging policy is updatedso that for t > , E [ α d | π t , q ] = 0 . . The intervention-oblivous estimator weightsclicks inversely to E [ α d | π t ] ; so clicks for t ≤ will be weighted by / .

25 = 4 and159 . Unifying Online and Counterfactual Learning to Rank for t > by / .

05 = 20 . Thus, there is a sharp and sudden difference in how clicksare treated before and after t = 100 . What is unintuitive about this example is that theway clicks are treated after t = 100 is completely independent of what the situation wasbefore t = 100 . For instance, consider another item d (cid:48) where ∀ t, E [ α d (cid:48) | π t , q ] = 0 . .If both d and d (cid:48) are clicked on timestep t = 101 , these clicks would both be weightedby , despite the fact that d has so far been treated completely different than d (cid:48) . Onewould expect that in such a case the click on d should be weighted less, to compensatefor the high E [ α d | π t , q ] it had in the ﬁrst 100 timesteps. The question is whether suchbehavior can be incorporated in an estimator without introducing bias. Our goal for the intervention-aware estimator is to ﬁnd an estimator whose individualcorrections are not only based on single logging policies, but instead consider the entirecollection of logging policies used to gather the data D . Importantly, this estimatorshould also be unbiased w.r.t. position bias, item-selection bias and trust bias.For ease of notation, we use Π T for the set of policies that gathered the data in D : Π T = { π , π , . . . , π T } . The probability of a click can be conditioned on this set: P ( C = 1 | d, Π T , q ) = 1 T T (cid:88) t =1 (cid:88) ¯ y π t (¯ y | q )( α d, ¯ y P ( R = 1 | d, q ) + β d, ¯ y )= E t, ¯ y [ α d | Π T , q ] P ( R = 1 | d, q ) + E t, ¯ y [ β d | Π T , q ] , (8.27)where the expected values of α and β conditioned on Π T are: E t, ¯ y [ α d | Π T , q ] = 1 T T (cid:88) t =1 (cid:88) ¯ y π t (¯ y | q ) α d, ¯ y , E t, ¯ y [ β d | Π T , q ] = 1 T T (cid:88) t =1 (cid:88) ¯ y π t (¯ y | q ) β d, ¯ y . (8.28)Thus P ( C = 1 | d, Π T , q ) gives us the probability of a click given that any policy from Π T could be deployed. We propose our intervention-aware estimator that corrects forbias using the expectations conditioned on Π T : ˆ∆ IA ( d | q t , c t ) = c t ( d ) − E t, ¯ y [ β d | Π T , q t ] E t, ¯ y [ α d | Π T , q t ] . (8.29)The salient difference with the intervention-oblivious estimator is that the expectationsare conditioned on Π T , all logging policies in D , instead of an individual loggingpolicy π t . While the difference with the intervention-oblivious estimator seems small,our experimental results show that the differences in performance are actually quitesizeable. Lastly, we note that when no interventions take place the intervention-obliviousestimator and intervention-aware estimators are equivalent. Because the intervention-aware estimator is the only existing counterfactual LTR estimator whose corrections are160 .6. The Intervention-Aware Estimator inﬂuenced by online interventions, we consider it to be a step that helps to bridge thegap between counterfactual and online LTR.Before we revisit our online intervention example with our novel intervention-awareestimator, we prove that it is unbiased w.r.t. our assumed click model (Section 8.2). Theorem 8.2.

The estimated reward ˆ R (Eq. 8.10) using the intervention-aware esti-mator (Eq. 8.29) is unbiased w.r.t. the true reward R (Eq. 8.7) under two assumptions:(1) our click model (Eq. 8.5), and (2) the click probability on every item, conditioned onthe set of logging policies Π T , is correlated with relevance: ∀ q, ∀ d ∈ D q , E t, ¯ y [ α d | Π T , q ] (cid:54) = 0 . (8.30) Proof.

Using Eq. 8.27 and Eq. 8.30 the relevance probability can be derived from theclick probability by: P ( R = 1 | d, q ) = P ( C = 1 | d, Π T , q ) − E t, ¯ y [ β d | Π T , q ] E t, ¯ y [ α d | Π T , q ] . (8.31)Eq. 8.31 can be used to show that ˆ∆ IA is an unbiased indicator of relevance: E t, ¯ y,c (cid:2) ˆ∆ IA ( d | q t , c t ) | Π T (cid:3) = E t, ¯ y,c (cid:20) c t ( d ) − E t, ¯ y [ β d | Π T , q t ] E t, ¯ y [ α d | Π T , q t ] | Π T , q t (cid:21) = E t, ¯ y,c [ c t ( d ) | Π T , q t ] − E t, ¯ y [ β d | Π T , q t ] E t, ¯ y [ α d | Π T , q t ]= P ( C = 1 | d, Π T , q t ) − E t, ¯ y [ β d | Π T , q t ] E t, ¯ y [ α d | Π T , q t ]= P ( R = 1 | d, q t ) . (8.32)Finally, combining Eq. 8.32 with Eq. 8.10 and Eq. 8.7 reveals that ˆ R based on theintervention-aware estimator ˆ∆ IA is unbiased w.r.t. R : E t,q, ¯ y,c (cid:104) ˆ R ( π | D ) (cid:105) (8.33) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) E t, ¯ y,c (cid:104) ˆ∆ IA ( d | c, q ) | Π T , q (cid:105) = (cid:88) q P ( q ) (cid:88) d ∈ D q λ ( d | D q , π, q ) P ( R = 1 | d, q )= R ( π ) . We will now revisit the example in Figure 8.1, but this time consider how the intervention-aware estimator treats item d . Unlike the intervention-oblivious estimator, clicks areweighted by E [ α d | Π T ] which means that the exact timestep t of a click does not matter,as long as t < T . Furthermore, the weight of a click can change as the total number of161 . Unifying Online and Counterfactual Learning to Rank timesteps T increases. In other words, as more data is gathered, the intervention-awareestimator retroactively updates the weights of all clicks previously gathered.We see that this behavior avoids the sharp difference in weights of clicks occurringbefore the intervention t ≤ and after t > . For instance, for a click on d occuring at t = 101 while T = 400 , results in E [ α d | Π T ] = 0 . and thus aweight of / . . This is much lower than the intervention-oblivious weight of / .

05 = 20 , because the intervention-aware estimator is also considering the initialperiod where E [ α d | π t , q ] was high. Thus we see that the intervention-aware estimatorhas the behavior we intuitively expected: it weights clicks based on how the itemwas treated throughout all timesteps. In this example, it leads weights considerablysmaller than those used by the intervention-oblivious estimator. In IPS estimators, lowpropensity weights are known to lead to high variance [58], thus we may expect that theintervention-aware estimator reduces variance in this example. While the intervention-aware estimator takes into account the effect of interventions,it does not prescribe what interventions should take place. In fact, it will work withany interventions that result in Eq. 8.30 being true, including the situation where nointervention takes place at all. For clarity, we will describe the intervention approach weapplied during our experiments here. Algorithm 8.1 displays our online/counterfactualapproach. As input it requires a starting policy ( π ), a choice for λ , the α and β parameters, a set of intervention timesteps ( Φ ), and the ﬁnal timestep T .The algorithm starts by initializing an empty set to store the gathered interactiondata (Line 2) and initializes the logging policy with the provided starting policy π .Then for each timestep i in Φ the dataset is expanded using the current logging policy sothat |D| = i (Line 5). In other words, for i − |D| timesteps π is used to display rankingsto user-issued queries, and the resulting interactions are added to D . Then a policy isoptimized using the available data in D which becomes the new logging policy. For thisoptimization, we split the available data in training and validation partitions in order todo early stopping to prevent overﬁtting. We use stochastic gradient descent where weuse π as the initial model; this practice is based on the assumption that π has a betterperformance than a randomly initialized model. Thus, during optimization, gradientcalculation uses the intervention-aware estimator on the training partition of D , andafter each epoch, optimization is stopped if the intervention-aware estimator using thevalidation partition of D suspects overﬁtting. Each iteration results in an intervention asthe resulting policy replaces the logging policy, and thus changes the way future data islogged. After iterating over Φ is completed, more data is gathered so that |D| = T andoptimization is performed once more. The ﬁnal policy is the end result of the procedure.We note that, depending on Φ , our approach can be either online, counterfactual,or somewhere in between. If Φ = ∅ the approach is fully counterfactual since all datais gathered using the static π . Conversely, if Φ = { , , , . . . , T } it is fully onlinesince at every timestep the logging policy is updated. In practice, we expect a fullyonline procedure to be infeasible as it is computationally expensive and user queriesmay be issued faster than optimization can be performed. In our experiments we willinvestigate the effect of the number of interventions on the approach’s performance.162 .7. Experimental Setup Algorithm 8.1

Our Online/Counterfactual LTR Approach Input : Starting policy: π ; Metric weight function: λ ;Inferred bias parameters: α and β ;Interventions steps: Φ ; End-time: T . D ← {} // initialize data container π ← π // initialize logging policy for i ∈ Φ do D ← D ∪ gather ( π, i − |D| ) // observe i − |D| timesteps π ← optimize ( D , α, β, π ) // optimize based on available data D ← D ∪ gather ( π, T − |D| ) // expand data to T π ← optimize ( D , α, β, π ) // optimize based on ﬁnal data return π Our experiments aim to answer the following research questions:

RQ1

Does the intervention-aware estimator lead to higher performance than existingcounterfactual LTR estimators when online interventions take place?

RQ2

Does the intervention-aware estimator lead to performance comparable withexisting online LTR methods?We use the semi-synthetic experimental setup that is common in existing work on bothonline LTR [43, 82, 84, 136] and counterfactual LTR [58, 92, 123]. In this setup, queriesand documents are sampled from a dataset based on commercial search logs, while userinteractions and rankings are simulated using probabilistic click models. The advantageof this setup is that it allows us to investigate the effects of online interventions on alarge scale while also being easy to reproduce by researchers without access to liveranking systems.We use the publicly-available Yahoo Webscope dataset [17], which consists of29,921 queries with, on average, 24 documents preselected per query. Query-documentpairs are represented by 700 features and ﬁve-grade relevance annotations ranging fromnot relevant (0) to perfectly relevant (4). The queries are divided into training, validationand test partitions.At each timestep, we simulate a user-issued query by uniformly sampling fromthe training and validation partitions. Subsequently, the preselected documents areranked according to the logging policy, and user interactions are simulated on thetop-5 of the ranking using a probabilistic click model. We apply Eq. 8.4 with α =[0 . , . , . , . , . and β = [0 . , . , . , . , . ; the relevance prob-abilities are based on the annotations from the dataset: P ( R = 1 | d, q ) = 0 . · relevance label ( d, q ) . The values of α and β were chosen based on those reported byAgarwal et al. [3] who inferred them from real-world user behavior. In doing so, weaim to emulate a setting where realistic levels of position bias, item-selection bias, andtrust bias are present. 163 . Unifying Online and Counterfactual Learning to Rank All counterfactual methods use the approach described in Section 8.6.2. To simulatea production ranker policy, we use supervised LTR to train a ranking model on 1% ofthe training partition [58]. The resulting production ranker has much better performancethan a randomly initialized model, yet still leaves room for improvement. We use theproduction ranker as the initial logging policy. The size of Φ (the intervention timesteps)varies per run, and the timesteps in Φ are evenly spread on an exponential scale. Allranking models are neural networks with two hidden layers, each containing 32 hiddenunits with sigmoid activations. Gradients are calculated using a Monte-Carlo methodfollowing Oosterhuis and de Rijke [85] (Chapter 7). All policies apply a softmax to thedocument scores produced by the ranking models to obtain a probability distributionover documents. Clipping is only applied on the training clicks, denominators of anyestimator are clipped by / (cid:112) |D| to reduce variance. Early stopping is applied basedon counterfactual estimates of the loss using (unclipped) validation clicks.The following methods are compared: (i) The intervention-aware estimator. (ii) Theintervention-oblivious estimator. (iii) The policy-aware estimator [86] (Chapter 5).(iv) The afﬁne estimator [123]. (v) PDGD [82] (Chapter 3), we apply PDGD bothonline and as a counterfactual method. As noted by Ai et al. [6], this can be doneby separating the logging models from the learned model and, basing the debiasingweights on the logging function. (vi) Biased PDGD, identical to PDGD except that wedo not apply the debiasing weights. (vii) COLTR [136]. We compute the NormalizedDCG (NDCG) of both the logging policy and of a policy trained on all available data.Every reported result is the average of 20 independent runs, ﬁgures plot the mean,shaded areas indicate the standard deviation. To answer the ﬁrst research question: whether the intervention-aware estimator leads tohigher performance than existing counterfactual LTR estimators when online interven-tions take place , we consider Figure 8.2 which displays the performance of LTR usingdifferent counterfactual estimators.First we consider the top of Figure 8.2 which displays performance in the coun-terfactual setting where the logging policy is static. We clearly see that the afﬁneestimator converges at a suboptimal point of convergence, a strong indication of bias.The most probable cause is that the afﬁne estimator is heavily affected by the pres-ence of item-selection bias. In contrast, neither the policy-aware estimator nor theintervention-aware estimator have converged after queries. However, very clearlythe intervention-aware estimator quickly reaches a higher performance. While the theoryguarantees that it will converge at the optimal performance, we were unable to observethe number of queries it requires to do so. From the result in the counterfactual setting,we conclude that by correcting for position-bias, trust-bias, and item-selection bias theintervention-aware estimator already performs better without online interventions. Since under a static logging policy the intervention-aware and the intervention-oblivious estimators areequivalent, our conclusions apply to both in this setting. .8. Results and Discussion ND C G . . . . ND C G . . . . Number of Logged Queries

Full-InformationAﬃne Intervention-ObliviousPolicy-Aware Intervention-Aware

Figure 8.2: Comparison of counterfactual LTR estimators. Top: Counterfactual runs(no interventions); Bottom: Online runs (50 interventions).Second, we turn to the bottom of Figure 8.2 which considers the online settingwhere the estimators perform 50 online interventions during logging. We see that onlineinterventions have a positive effect on all estimators; leading to a higher performance forthe afﬁne and policy-aware estimators as well. However, interventions also introduce anenormous amount of variance for the policy-aware and intervention-oblivious estimators.In stark contrast, the amount of variance of the intervention-aware estimator hardlyincreases while it learns much faster than the other estimators.Thus we answer the ﬁrst research question positively: the intervention-aware estima-tor leads to higher performance than existing estimators, moreover, its data-efﬁciencybecomes even greater when online interventions take place.

To better understand how much the intervention-aware estimator beneﬁts from onlineinterventions, we compared its performance under varying numbers of interventions inFigure 8.3. It shows both the performance of the resulting model when training from thelogged data (top), as the performance of the logging policy which reveals when inter-ventions take place (bottom). When comparing both graphs, we see that interventionslead to noticeable immediate improvements in data-efﬁciency. For instance, when only5 interventions take place the intervention-aware estimator needs more than 20 timesthe amount of data to reach optimal performance as with 50 interventions. Despite these165 . Unifying Online and Counterfactual Learning to Rank ND C G . . . . L ogg i ng P o li c y ND C G . . . . Number of Logged Queries

Full-InformationIntervention-Aware (counterfactual)Intervention-Aware (1 intervention)Intervention-Aware (5 interventions) Intervention-Aware (10 interventions)Intervention-Aware (25 interventions)Intervention-Aware (50 interventions)

Figure 8.3: Effect of online interventions on LTR with the intervention-aware estimator.speedups there are no large increases in variance. From these observations, we concludethat the intervention-aware estimator can effectively and reliably utilize the effect ofonline interventions for optimization, leading to enormous increases in data-efﬁciency.

In order to answer the second research question: whether the intervention-awareestimator leads to performance comparable with existing online LTR methods , weconsider Figure 8.4 which displays the performance of two online LTR methods: PDGDand COLTR and the intervention-aware estimator with 100 online interventions.First, we notice that COLTR is unable to outperform its initial policy, moreover, wesee its performance drop as the number of iterations increase. We were unable to ﬁndhyper-parameters for COLTR where this did not occur. It seems likely that COLTR isunable to deal with trust-bias, thus causing this poor performance. However, we notethat Zhuang and Zuccon [136] already show COLTR performs poorly when no bias ornoise is present, suggesting that it is perhaps an unstable method overall.Second, we see that the difference between PDGD and the intervention-awareestimator becomes negligible after · queries. Despite PDGD running fully online,and the intervention-aware estimator only performing 100 interventions in total. We donote that PDGD initially outperforms the intervention-aware estimator, thus it appears166 .8. Results and Discussion ND C G . . . . L ogg i ng P o li c y ND C G . . . . Number of Logged Queries

Full-InformationCOLTR (online) PDGD (online)Biased-PDGD (online) Inter.-Aware (100 int.)

Figure 8.4: Comparison with online LTR methods.that PDGD works better with low numbers of interactions. Additionally, we shouldalso consider the difference in overhead: while PDGD requires an infrastructure thatallows for fully online learning, the intervention-aware estimator only requires 100moments of intervention, yet has comparable performance after a short initial period.By comparing Figure 8.4 to Figure 8.2, we see that the intervention-aware estimatoris the ﬁrst counterfactual LTR estimator that leads to stable performance while beingcomparably efﬁcient with online LTR methods.Thus we answer the second research question positively: besides an initial periodof lower performance, the intervention-aware estimator has comparable performanceto online LTR and only requires 100 online interventions to do so. To the best of ourknowledge, it is the ﬁrst counterfactual LTR method that can achieve this feat.

Now that we concluded that the intervention-aware estimator reaches performance com-parable to PDGD when enough online interventions take place, the opposite questionseems equally interesting:

Does PDGD applied counterfactually provide performancecomparable to existing counterfactual LTR methods?

To answer this question, we ran PDGD in a counterfactual way following Ai et al.[6], both fully counterfactual and with only 100 interventions. The results of theseruns are displayed in Figure 8.5. Quite surprisingly, the performance of PDGD rancounterfactually and with 100 interventions, reaches much higher performance than theintervention-aware estimator without interventions. However, after a peak in perfor-167 . Unifying Online and Counterfactual Learning to Rank ND C G . . . . L ogg i ng P o li c y ND C G . . . . Number of Logged Queries

Full-InformationPDGD (online)PDGD (counterfactual) PDGD (100 interventions)Intervention-Aware (counterfactual)

Figure 8.5: Effect of online interventions on PDGD.mance around queries, the PDGD performance starts to drop. This drop cannot beattributed to overﬁtting, since online PDGD does not show the same behavior. There-fore, we must conclude that PDGD is biased when not ran fully online. This conclusiondoes not contradict the existing theory, since in Chapter 3 we only proved it is unbiasedw.r.t. pairwise preferences. In other words, PDGD is not proven to unbiasedly optimizea ranking metric, and therefore also not proven to converge on the optimal model. Thisdrop is particularly unsettling because PDGD is a continuous learning algorithm: thereis no known early stopping method for PDGD. Yet these results show there is a greatrisk in running PDGD for too many iterations if it is not applied fully online. To answerour PDGD question: although PDGD reaches high performance when run counterfac-tually and appears to have great data-efﬁciency initially, it appears to converge at asuboptimal biased model. Thus we cannot conclude that PDGD is a reliable method forcounterfactual LTR.To better understand PDGD, we removed its debiasing weights resulting in theperformance shown in Figure 8.4 (Biased-PDGD). Clearly, PDGD needs these weightsto reach optimal performance. Similarly, from Figure 8.5 we see it also needs to berun fully online. This makes the choice between the intervention-aware estimator andPDGD complicated: on the one hand, PDGD does not require us to know the α and β parameters, unlike the intervention-aware estimator; furthermore, PDGD has betterinitial data-efﬁciency even when not run fully online. On the other hand, there are no168 .9. Conclusion theoretical guarantees for the convergence of PDGD, and we have observed that notrunning it fully online can lead to large drops in performance. It seems the choiceultimately depends on what guarantees a practitioner prefers. In this chapter, we have introduced an intervention-aware estimator: an extension ofexisting counterfactual approaches that corrects for position-bias, trust-bias, and item-selection bias, while also considering the effect of online interventions. Our resultsshow that the intervention-aware estimator outperforms existing counterfactual LTRestimators, and greatly beneﬁts from online interventions in terms of data-efﬁciency.With only 100 interventions it is able to reach a performance comparable to state-of-the-art online LTR methods. These ﬁndings allow us to answer thesis research question

RQ9 : whether the counterfactual LTR approach be extended to perform highly effectiveonline LTR. From our experimental results, it appears that the answer is positive:using the intervention-aware estimator and 100 online interventions the performance ofstate-of-the-art online LTR methods can be matched.With the introduction of the intervention-aware estimator, we hope to further unifythe ﬁelds of online LTR and counterfactual LTR as it appears to be the most reliablemethod for both settings. Future work could investigate what kind of interventions workbest for the intervention-aware estimator. Since we have already seen in Chapter 7 thatsuch an approach is effective for counterfactual/online ranking evaluation.In retrospect, this chapter has put many ﬁndings from previous chapters in a differentperspective. Chapter 3 introduced the concept of unbiasedness w.r.t. pairwise prefer-ences and proved PDGD had this property. The experimental results of this chapterhave shown that unbiasedness w.r.t. pairwise preferences is not enough to guarantee con-vergence at an optimal level of NDCG. Furthermore, Chapter 4 showed PDGD is veryrobust to noise and bias, but with the results of this chapter we now know that PDGDneeds to be run online for this robustness. The policy-aware estimator in Chapter 5 is aprecursor to the intervention-aware estimator of this chapter. While Chapter 5 realizedthat taking the logging policy into account is beneﬁcial to counterfactual estimation,this chapter showed that taking the idea further, by accounting for all logging policies,provides even more beneﬁts. Lastly, Chapter 7 looked at bridging the divide betweenonline and counterfactual evaluation; in retrospect, the results of Chapter 7 might havebeen even better had it used the intervention-aware estimator. Together, Chapter 7 andthis chapter suggest that an online method should both optimize its logging policy anduse an intervention-aware estimator to learn, thus leaving a potentially very fruitfuldirection for future work. 169 . Unifying Online and Counterfactual Learning to Rank

Notation Description k the number of items that can be displayed in a single ranking t a timestep number T the total number of timesteps (gathered so far) D the available data R ( π ) the metric reward of a policy π ˆ R ( π | D ) an estimate of the metric reward of a policy πq a user-issued query D q the set of items to be ranked for query qd an item to be ranked y a ranked list π a ranking policy π ( R | q ) the probability that policy π displays ranking R for query qπ ( R x | R x − , q ) probability of π adding item R x given R x − is already placed Π T the set of logging policies deployed up to timestep Tλ ( d | D q , π, q ) a metric function that weights items depending on their rank c ( d ) a function indicating item d was clicked in click pattern co ( d ) a function indicating item d was observed at iteration i Conclusions

In Section 1.1 we stated the overarching question that we aim to answer in this thesis:

Could there be a single general theoretically-grounded approach that has com-petitive performance for both evaluation and Learning to Rank (LTR) from userclicks on rankings, in both the counterfactual and online settings?

The thesis has explored this question by looking at both online and counterfactualfamilies of LTR methods, and in particular, to see if one of these approaches can beextended to be effective at both the online and counterfactual LTR scenarios. In thisﬁnal chapter, we will summarize the ﬁndings of the thesis and discuss how they reﬂecton our overarching thesis question. Finally, we will consider future research directionsfor the ﬁeld of LTR from user clicks.

This section look back at the thesis research questions posed in Section 1.1. We divideour discussion in two parts discussing online methods and counterfactual methods forLTR and evaluation, respectively.

The ﬁrst part of the thesis focussed on online LTR methods. Chapter 2 looked atmultileaving methods [108] for comparing multiple ranking systems at once and asked:

RQ1

Does the effectiveness of online ranking evaluation methods scale to large com-parisons?We introduced the novel Pairwise Preference Multileaving (PPM) algorithm, PPM basesevaluation on inferred pairwise item preferences. Furthermore, PPM is proven to have ﬁdelity – it is provably unbiased in unambiguous cases [44] – and considerateness – itis safe w.r.t. the user experience during the gathering of clicks. From our theoreticalanalysis, we ﬁnd that no other existing multileaving method manages to meet bothcriteria. In addition, our empirical results indicate that using PPM leads to a much lower171 . Conclusions number of errors, in particular when applied to large scale comparisons. Therefore, weanswered

RQ1 positively: PPM is shown to be effective at online ranking evaluationfor large scale comparisons.Besides in Chapter 2, online evaluation was also the subject of Chapter 7, whichaddressed the question:

RQ8

Are existing interleaving methods truly capable of unbiased evaluation w.r.t.position bias?We showed that under a basic rank-based model of position bias (common in coun-terfactual LTR [4, 58, 128]), three of the most prevalent interleaving algorithms arenot unbiased: Team Draft Interleaving [99], Probabilistic Interleaving [41], and Opti-mized Interleaving [96]. For each of these three methods, we showed that situationsexist where the binary outcome of the method does not agree with the expected binarydifference in Click-Through-Rate (CTR). In other words, under a basic assumption ofposition bias, situations exist where these interleaving methods are expected to preferone system over another, while the latter system has a higher expected CTR than theformer. Thus, we answer

RQ8 negatively: the most prevalent interleaving methods arenot unbiased w.r.t. position bias.This ﬁnding can be extended to the multileaving methods: Team-Draft Multileav-ing [108], Probabilistic Multileaving [109], and Optimized Multileaving [108], sincethey are equivalent to their interleaving counterparts when only two systems are com-pared. While we did not examine it in this thesis, it is likely that PPM also fails tobe unbiased under basic position bias. Nonetheless, an evaluation method can still beeffective despite being biased, for instance, if the systematic error is small or situationswhere bias occurs are rare.Chapter 3 looked at online LTR methods. Existing online LTR methods have reliedon sampling model variants and comparing them using online evaluation [132]. Inresponse to the existing online LTR approach, Chapter 3 considered the question:

RQ2

Is online LTR possible without relying on model-sampling and online evaluation?We answered this question positively by introducing Pairwise Differentiable GradientDescent (PDGD), an online LTR method that learns from inferred pairwise preferencesand uses a debiased pairwise loss. Besides proving that PDGD is unbiased w.r.t. pairwisepreferences, our experimental results show that PDGD greatly outperforms the previousstate-of-the-art Dueling Bandit Gradient Descent (DBGD) [132] algorithm in terms ofdata-efﬁciency and convergence. Furthermore, PDGD is the ﬁrst online LTR methodthat can effectively optimize neural networks as ranking models.Chapter 8 took another look at PDGD, in particular at conditions where PDGD isno longer effective. The results in Chapter 8 show that PDGD fails to reach optimalperformance without debiasing weights or when not applied fully online. A particularworrisome observation was that, when not applied fully online, the performance ofPDGD can degrade as more interactions are gathered. While this behavior looks similar,it is not overﬁtting since PDGD does not display it when applied online. Instead, itappears that PDGD becomes severely biased when not applied fully online. Therefore,we can conclude that the fact that PDGD is unbiased w.r.t. pairwise preferences is not172 .1. Main Findings enough to guarantee unbiased optimization. It appears that we do not fully understandwhy PDGD appears to be so effective when run online.The results of Chapter 3 had surprising implications on DBGD, for instance, itappeared that DBGD was not able to reach the performance of PDGD at conver-gence. Meanwhile, DBGD forms the basis of most existing online LTR methods.This prompted us to further investigate DBGD in Chapter 4, where we asked:

RQ3

Are DBGD LTR methods reliable in terms of theoretical soundness and empiricalperformance?By critically examining the theoretical assumptions underlying the DBGD method,we found that they are impossible when optimizing a deterministic ranking model.This means that the existing theoretical guarantees of DBGD are unsound in a lot ofprevious work where such models were used [40, 43, 82, 90, 111, 125, 126, 132, 135].Moreover, our empirical analysis revealed that ideal circumstances exist where DBGDis still unable to ﬁnd the optimal model. In other words, even in scenarios whereoptimization should be very easy, DBGD was unable to get near optimal performance.These ﬁndings lead us to answer

RQ3 negatively: our empirical results show thatDBGD is very unreliable and its theoretical guarantees do not cover the most commonLTR ranking models.

The second part of the thesis considered counterfactual LTR methods for optimizationand evaluation. In particular, we tried to widen the applicability of counterfactual LTRmethods and their effectiveness as online methods.First, Chapter 5 recognized that the original Inverse Propensity Scoring (IPS) coun-terfactual method [58] is not unbiased when item selection bias occurs. This bias occurswhen not all items can be displayed in a single ranking; this bias is unavoidable in top- k ranking settings where only k items can be displayed. One of the questions Chapter 5addressed is: RQ4

Can counterfactual LTR be extended to top- k ranking settings?We showed that one can correct for item selection bias by basing propensity weights onboth the position bias of the user and the stochastic ranking behavior of the logging pol-icy. Our novel policy-aware estimator uses this idea to extend the original IPS approachby taking into account the logging policy behavior. We prove that, assuming rank-basedposition bias, the policy-aware estimator is unbiased as longs as the logging policygives every relevant item a non-zero probability of appearing in the top- k of a ranking.Furthermore, in our experimental results the policy-aware estimator approximates opti-mal performance regardless of the amount of item-selection bias present. Therefore,we answer RQ4 positively: with the introduction of the policy-aware estimator theapplicability of counterfactual LTR has been extended to top- k ranking settings.Besides learning from top- k feedback, Chapter 5 also considered optimizing for top- k metrics. Interestingly, the existing counterfactual LTR methods [2, 46] for optimizing173 . Conclusions Discounted Cumulative Gain (DCG) metrics are very dissimilar from the state-of-the-art in supervised LTR [13, 129]. To address this dissimilarity, Chapter 5 posed thefollowing question:

RQ5

Is it possible to apply state-of-the-art supervised LTR methods to the counterfac-tual LTR problem?We answer this question positively by showing that, with some small adjustments,the LambdaLoss framework [129] can be applied to counterfactual LTR losses, thusenabling the application of state-of-the-art supervised LTR to counterfactual LTR. Theimplication of this ﬁnding is that there does not need to be a division between state-of-the-art supervised LTR and counterfactual LTR. In other words, counterfactual LTRmethods can build on the best methods from the supervised LTR ﬁeld.Chapter 6 takes a look at tabular and feature-based LTR methods. Tabular methodsoptimize a tabular ranking model [67–70, 139], which remembers the optimal ranking,in contrast with feature-based methods that optimize models that use the features ofitems to predict the optimal ranking. The tabular models are extremely expressiveand can capture any possible ranking, making them always capable of converging onthe optimal ranking [138]. However, their learned behavior does not generalize topreviously unseen circumstances. Conversely, the learned behavior of feature-basedmodels can generalize very well to previously unseen circumstances [10, 75]. Butfeature-based models can also be limited by the available features, because often theavailable features do not provide enough information to predict the optimal ranking.Thus feature-based LTR generalizes very well to unseen circumstances, whereas tabularLTR can specialize extremely well in speciﬁc circumstances. Inspired by this tradeoff,we asked the following question in Chapter 6:

RQ6

Can the specialization ability of tabular online LTR be combined with the robustfeature-based approach of counterfactual LTR?Our answer is in the form of the novel Generalization and Specialization (GENSPEC)algorithm, a method for combining the behavior of a single robust generalized modeland numerous specialized models. GENSPEC optimizes a single feature-based rankingmodel for performance across all queries, and many tabular ranking models eachspecialized for a single query. Then GENSPEC applies a meta-policy that uses high-conﬁdence bounds to safely decide per query which model to deploy. Consequently,for previously unseen queries GENSPEC chooses the generalized model which utilizesrobust feature-based prediction. For other queries, it can decide to deploy a specializedmodel, i.e., if it has enough data to conﬁdently determine that the tabular model hasfound the better ranking. Our experimental results show that GENSPEC successfullycombines robust performance on unseen queries with extremely high performanceat convergence. Accordingly, we answer

RQ6 positively: using GENSPEC we cancombine the specialization properties of tabular LTR with the robust generalizationof feature-based LTR. For the LTR ﬁeld, the introduction of GENSPEC shows thatspecialization does not need to be unique to tabular online LTR, instead it can be aproperty of counterfactual LTR as well.As discussed above, Chapter 7 proved that several prominent interleaving methodsare biased w.r.t. a basic model of position bias. Nonetheless, empirical results suggest174 .1. Main Findings that these online ranking evaluation methods are still very effective. This leaves a gapfor a theoretically-grounded online ranking evaluation method that is also very effective.To address this gap, Chapter 7 considers counterfactual ranking evaluation, which hasstrong theoretical guarantees, and asks:

RQ7

Can counterfactual evaluation methods for ranking be extended to perform efﬁ-cient and effective online evaluation?We realized that with the introduction of the policy-aware estimator in Chapter 5, thelogging policy has an important role in counterfactual estimation. Using the policy-aware estimator as a starting point, we introduce the Logging-Policy OptimizationAlgorithm (LogOpt) that optimizes the logging policy to minimize the variance of thepolicy-aware estimator. LogOpt can be deployed during the gathering of data, periodi-cally or fully online, and thus changes the logging behavior through an intervention. Assuch, it turns the counterfactual evaluation approach with the policy-aware estimatorinto an online approach. Our experimental results show that applying LogOpt increasesthe data-efﬁciency of counterfactual evaluation with the policy-aware estimator. Theperformance with LogOpt is comparable to A/B testing and interleaving, but in contrastwith interleaving, the policy-aware estimator applied with LogOpt does not have asystematic error. Therefore, we answer

RQ7 positively: by optimizing the loggingpolicy with LogOpt, counterfactual evaluation can perform effective and data-efﬁcientonline evaluation.Inspired by how Chapter 7 bridges part of the gap between online and counterfactualranking evaluation, Chapter 8 addressed our ﬁnal question:

RQ9

Can the counterfactual LTR approach be extended to perform highly effectiveonline LTR?The motivation is similar to the previous chapter: we would like to ﬁnd a theoretically-grounded method that is effective at both counterfactual LTR and online LTR. Sincecounterfactual LTR has strong theoretical guarantees, we used it as a starting point.Then we introduced the novel intervention-aware estimator which does not assume astationary logging policy. As a result, the estimator takes into account the fact that anonline intervention may change the logging policy during the gathering of data. Thuswhen applied online, the intervention-aware estimator does not only consider the loggingpolicy used when a click was logged but also all the other logging policies applied atall other timesteps. In addition, the intervention-aware estimator also combines thetheoretical properties of recent counterfactual LTR estimators: it is the ﬁrst estimator thatcan correct for both position bias, item-selection bias, and trust bias. Our experimentalresults show that the intervention-aware estimator results in much lower variance,than an equivalent estimator that ignores the effect of interventions. Furthermore, inour experimental setting it outperformed all existing counterfactual estimators, withespecially large differences when online interventions take place. Importantly, weobserved that the intervention-aware estimator matches the performance of PDGD withas few as interventions during learning. Besides a small initial period, LTR with theintervention-aware estimator was able to reach the performance of the most effectiveonline LTR methods. Therefore, we answer

RQ9 positively: the intervention-awareestimator extends the counterfactual LTR approach to perform highly effective online175 . Conclusions

LTR. For the LTR ﬁeld, this demonstrates that methods do not have to be either partof counterfactual LTR or online LTR. By designing them for both applications at once,they can be highly effective in both scenarios.Finally, we note the complementary nature of the ﬁndings in the second part ofthe thesis. Many of the contributions of earlier chapters were used in later chapters.For instance, the methods introduced in Chapter 6 and Chapter 7 made use of thepolicy-aware estimator proposed in Chapter 5, and Chapter 8 built on the policy-aware estimator to introduce the intervention-aware estimator. Similarly, the adaptationof LambdaLoss for counterfactual LTR introduced in Chapter 5 was applied in theexperiments of Chapter 6 and Chapter 7. While not explored in the thesis, many ofthe later contributions can also be applied to methods in earlier chapters. For instance,the intervention-aware estimator from Chapter 8 is completely compatible with theLambdaLoss adaption from Chapter 5 and GENSPEC from Chapter 6. In particular, itcould be applied in combination with LogOpt from Chapter 7, potentially leading to evenmore effective online ranking evaluation. Together, the contributions of the second partcan be combined into a single framework for counterfactual LTR and ranking evaluation,where our contributions complement each other. Importantly, this framework bridgesseveral gaps between supervised LTR, online LTR, and counterfactual LTR.

The overarching question this thesis aimed to answer considered whether there could bea single general theoretically-grounded approach that has competitive performance forboth evaluation and LTR from user clicks on rankings, in both the counterfactual andonline settings.

We have looked at the family of online methods for LTR [43, 126, 132] and rankingevaluation [44, 56, 96, 108], which traditionally avoid making strong assumptionsabout user behavior, i.e., that a model of position bias is known [128]. While thismakes their theory widely applicable, the theoretical guarantees of these methodsare relatively weak. For instance, some interleaving and multileaving methods areproven to converge on correct outcomes if clicks are uncorrelated with relevanceand thus every ranker performs equally well [41, 96]. Though such guarantees arevaluable, they only cover a small group of unambiguous situations and thus leave mostsituations without theoretical guarantees. Online LTR methods are often motivated byempirical results from semi-synthetic experiments, where they are tested in settingswith varying levels of noise and bias [42, 80, 111, 125]. The fundamental questionwith this type of empirical motivation is how well the results generalize, in particular,whether a method is still effective if the experimental conditions change slightly. Thisthesis has presented four examples of online methods that showed surprisingly poorperformance when tested in new conditions: (i) On several datasets DBGD [132] didnot get close to optimal performance after issued-queries while learning fromclicks without noise or position bias (Chapter 4). (ii) Team Draft Interleaving [99],Probabilistic Interleaving [41], and Optimized Interleaving [96] make systematic errorsin some ranking comparisons when tested under rank-based position bias (Chapter 7).(iii) The performance of the COLTR algorithm [136] dropped severely when tested176 .2. Summary of Findings under position bias, item-selection bias, and trust bias (Chapter 8). (iv) PDGD nolonger converged to near-optimal performance when we ran it counterfactually oronly provided it with online interventions, and instead resulted in a large dropin performance (Chapter 8). While these online methods LTR and evaluation havealso shown great performance in previous work [41, 50, 96, 99, 111, 132, 136], theseproblematic examples illustrate why we cannot conclude that these online LTR methodsare reliable. For instance, the performance of a method like PDGD was thought tobe very robust to noise and bias [50] (Chapter 3 and 4), until tested without constantonline interventions (Chapter 8). Without strong theoretical guarantees, we cannotknow whether there are more currently-unknown conditions required for the robustperformance of PDGD. In general, it is unclear how robust online LTR methods are inpractice; this thesis has shown that there is a potential risk for detrimental performanceif real-world circumstances do not match the tested experimental settings. Therefore,we conclude that online LTR methods should not be used as a basis for a single generalapproach for LTR and ranking evaluation from user clicks.In the second part of the thesis, we considered the family of counterfactual methodsfor LTR and ranking evaluation [58, 127], which consist of theoretically-grounded meth-ods that use explicit assumptions about user behavior. In contrast with the online family,counterfactual methods are less widely applicable: they only provide guarantees whenthe assumed models of user behavior are true. For instance, the original counterfactualLTR method assumes clicks are only affected by relevance and rank-based positionbias [58, 127]. Despite their limited applicability, counterfactual methods have verystrong theoretical guarantees. In contrast to most online LTR methods, counterfactualLTR methods guarantee convergence at the same performance as supervised LTR, giventhat their assumptions about user behavior are true. The ﬁndings of this thesis indicatethat the strong guarantees with limited applicability of counterfactual LTR are preferableover the weak guarantees with wide applicability of online LTR. This is mainly becausewidening the applicability of counterfactual LTR proved very doable. In this thesis,we have expanded the applicability of counterfactual LTR and evaluation to (i) top- k settings with item-selection bias (Chapter 5), and (ii) ranking settings where both trustbias and item-selection bias occur (Chapter 8). Besides expanding the settings wherecounterfactual LTR methods can be applied, we expanded the methods that performcounterfactual LTR, including: (iii) the state-of-the-art LambdaLoss supervised LTRframework [129] (Chapter 5), (iv) tabular models for extremely specialized rankings(Chapter 6), and (v) a meta-policy that safely chooses between generalized feature-basedmodels and specialized tabular models (Chapter 6). Moreover, this thesis also foundnovel algorithms to increase the effectivity of counterfactual LTR methods for (vi) on-line ranking evaluation (Chapter 7), and (vii) online LTR (Chapter 8), even with alimited number of online interventions. Together, these contributions have widened theapplicability of counterfactual LTR while maintaining its strong theoretical guarantees.As a direct result of this thesis, counterfactual LTR is applicable to more settings, moreLTR methods can be applied to the counterfactual LTR problem, and counterfactualLTR methods are more effective in both the counterfactual and online LTR scenarios.In conclusion, based on the ﬁndings of this thesis, it appears that counterfactual LTRcould form the basis of a general approach for LTR from user clicks. In our experimentalresults, counterfactual LTR provided competitive performance to online LTR methods177 . Conclusions in both the counterfactual and online settings. While the theory of counterfactual LTRdoes rely on stronger assumptions regarding user behavior than existing online LTRmethods, counterfactual LTR provides far stronger theoretical guarantees. In contrast, itis currently unclear under what conditions online LTR methods are effective, makingtheir performance very unpredictable. Therefore, we answer our overarching thesisquestion positively: the counterfactual LTR framework proposed in this thesis providesa uniﬁed approach for effective and reliable LTR from user clicks. For the LTR ﬁeld,the counterfactual LTR framework bridges many gaps between areas of online LTR,counterfactual LTR, and supervised LTR, and as such, it uniﬁes many of the mosteffective methods for LTR from user clicks. We will conclude the thesis with promising research directions for future work.The most obvious direction is to widen the applicability of the counterfactual LTRframework. This means that estimators are introduced that are unbiased under otherassumptions about user behavior. Joachims et al. [58] mentioned that the original coun-terfactual method is unbiased as long as click probabilities decompose into observationand relevance probabilities. For example, Vardasbi et al. [122] looked at the perfor-mance of counterfactual LTR when assuming cascading user behavior, an alternativeto rank-based position bias. Additionally, Fang et al. [30] looked at context-dependentposition bias, where the degree of bias varies per query. It seems natural to continuethis trend to more complex models of user behavior. The challenge for future work istwo-fold: ﬁnd LTR methods that are proven to be unbiased under more complex userbehavior models; and introduce methods that can reliably ﬁnd the parameters of thesebehavior models.Besides learning from more complex user behavior, there is a big need for LTRbased on user clicks that optimizes for more complex goals. Some existing work hasalready looked at complex goals: for instance, Radlinski et al. [98] introduced a banditalgorithm for tabular LTR that optimizes for both relevance and diversity within aranking. Thus, using user clicks to ﬁnd a ranking that has relevant items, as well ashaving variety in the items within the ranking. Another example comes from Moriket al. [79], who use counterfactual LTR to optimize for relevance and ranking fairness.Ranking fairness metrics are based on the amount of exposure different items receive,for example, some fairness metrics measure whether certain groups of items receivesimilar amounts of exposure. Other areas of LTR also optimize for computationalefﬁciency to ensure that ranking systems can process queries in minimal amounts oftime [31]. Future work could investigate if counterfactual LTR can be used for complexgoals like these and combinations of them.Surprisingly, the experimental results in this thesis showed that PDGD is no longereffective when not applied fully online, and similarly, we observed very poor perfor-mance for the COLTR algorithm [136]. However, we could not ﬁnd theoretically provenconditions that guarantee that PDGD or COLTR is or is not effective. It appears that welack a theoretical approach to understand the limits of online LTR methods. If such anapproach could be found, we may be able to correct for the faults in some online LTR178 .3. Future Work methods, or understand when they can be applied reliably. Thus it may be very valuableif future work reconsidered the theory behind existing online LTR methods.Finally, most of the existing work on LTR from user interactions only consideresuser clicks. Existing work has already looked at additional signals that are useful forlearning [63, 110]. Novel methods that learn from other interactions in addition touser clicks have the potential to better understand user preferences. However, the mainchallenge this direction of research may be the availability of such data. Perhaps thisdirection of research mostly needs a publicly available source of data and methods toshare such data in a privacy-respecting way.Overall, our main advice for future work is to focus on methods that forge connec-tions between advances in the larger ﬁeld of LTR; that is, methods that combine thebest of different areas, as our proposed framework does for online LTR, counterfactualLTR, and supervised LTR. 179 ibliography [1] E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The web changes everything: Understanding thedynamics of web content. In

Proceedings of the Second ACM International Conference on Web Searchand Data Mining , pages 282–291, 2009. (Cited on pages 2, 38, 39, and 61.)[2] A. Agarwal, K. Takatsu, I. Zaitsev, and T. Joachims. A general framework for counterfactual learning-to-rank. In

Proceedings of the 42nd International ACM SIGIR Conference on Research & Developmentin Information Retrieval , pages 5–14. ACM, 2019. (Cited on pages 3, 6, 78, 86, 92, 105, 109, 110,156, and 173.)[3] A. Agarwal, X. Wang, C. Li, M. Bendersky, and M. Najork. Addressing trust bias for unbiasedlearning-to-rank. In

The World Wide Web Conference , pages 4–14. ACM, 2019. (Cited on pages 3, 81,88, 91, 94, 152, 153, 156, and 163.)[4] A. Agarwal, I. Zaitsev, X. Wang, C. Li, M. Najork, and T. Joachims. Estimating position bias withoutintrusive interventions. In

Proceedings of the Twelfth ACM International Conference on Web Searchand Data Mining , pages 474–482. ACM, 2019. (Cited on pages 3, 78, 94, 131, 134, 156, and 172.)[5] Q. Ai, K. Bi, C. Luo, J. Guo, and W. B. Croft. Unbiased learning to rank with unbiased propensityestimation. In

The 41st International ACM SIGIR Conference on Research & Development inInformation Retrieval , pages 385–394. ACM, 2018. (Cited on pages 62, 78, 81, 88, 89, 90, 91, 114,131, and 156.)[6] Q. Ai, T. Yang, H. Wang, and J. Mao. Unbiased learning to rank: Online or ofﬂine? arXiv preprintarXiv:2004.13574 , 2020. (Cited on pages 152, 157, 164, and 167.)[7] R. Albert, H. Jeong, and A.-L. Barab´asi. Diameter of the world-wide web.

Nature , 401(6749):130–131,1999. (Cited on page 1.)[8] J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. Million query track 2007overview. In

TREC . NIST, 2007. (Cited on pages 29 and 48.)[9] W.-T. Balke, U. G¨untzer, and W. Kießling. On real-time top k querying for mobile services. In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems” , pages125–143. Springer, 2002. (Cited on page 78.)[10] C. M. Bishop.

Pattern Recognition and Machine Learning , chapter 1.3. Springer, 2006. (Cited onpages 102 and 174.)[11] A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov. A neural click model for web search. In

WWW ,pages 531–541. International World Wide Web Conferences Steering Committee, 2016. (Cited onpage 62.)[12] B. Brost, I. J. Cox, Y. Seldin, and C. Lioma. An improved multileaving algorithm for online rankerevaluation. In

Proceedings of the 39th International ACM SIGIR conference on Research and De-velopment in Information Retrieval , pages 745–748, 2016. (Cited on pages 4, 15, 16, 22, 23, 28,and 29.)[13] C. J. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Technical report,Microsoft Research, 2010. (Cited on pages 1, 6, 41, 52, 79, 86, 87, 104, and 174.)[14] F. Cai and M. de Rijke. A survey of query auto completion in information retrieval.

Foundations andTrends in Information Retrieval , 10(4):273–363, 2016. (Cited on page 78.)[15] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwiseapproach. In

Proceedings of the 24th international conference on Machine learning , pages 129–136,2007. (Cited on page 42.)[16] B. Carterette and P. Chandar. Ofﬂine comparative evaluation with incremental, minimally-invasiveonline feedback. In

The 41st International ACM SIGIR Conference on Research & Development inInformation Retrieval , pages 705–714. ACM, 2018. (Cited on pages 3, 7, 78, 86, 89, and 94.)[17] O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview.

Journal of MachineLearning Research , 14:1–24, 2011. (Cited on pages 1, 2, 29, 38, 39, 48, 60, 61, 67, 89, 104, 109, 136,152, and 163.)[18] O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleavedsearch evaluation.

ACM Transactions on Information Systems (TOIS) , 30(1):1–41, 2012. (Cited onpages 1, 2, 17, 29, 126, and 141.)[19] S. Chelaru, C. Orellana-Rodriguez, and I. S. Altingovde. How useful is social feedback for learning torank YouTube videos?

World Wide Web , 17(5):997–1025, 2014. (Cited on pages 1 and 38.)[20] A. Chuklin, I. Markov, and M. de Rijke.

Click Models for Web Search . Morgan & Claypool Publishers,2015. (Cited on pages 2, 30, 40, 48, 60, 62, 68, and 127.)[21] A. Chuklin, A. Schuth, K. Zhou, and M. D. Rijke. A comparative analysis of interleaving methods for . Bibliography aggregated search.

ACM Transactions on Information Systems (TOIS) , 33(2):1–38, 2015. (Cited onpage 29.)[22] G. Claeskens and N. L. Hjort.

Model Selection and Model Averaging . Cambridge University Press,2008. (Cited on page 102.)[23] C. L. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2009 Web Track. In

TREC . NIST,2009. (Cited on page 29.)[24] N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC 2003 Web Track. In

TREC . NIST, 2003. (Cited on page 29.)[25] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-biasmodels. In

Proceedings of the 2008 International Conference on Web Search and Data Mining , pages87–94. ACM, 2008. (Cited on pages 2, 104, 127, 152, and 153.)[26] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommen-dation tasks. In

Proceedings of the fourth ACM conference on Recommender systems , pages 39–46.ACM, 2010. (Cited on page 78.)[27] D. Dato, C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, N. Tonellotto, and R. Venturini. Fastranking with additive ensembles of oblivious and non-oblivious regression trees.

ACM Transactionson Information Systems (TOIS) , 35(2):1–31, 2016. (Cited on pages 1, 29, 48, 67, 104, and 109.)[28] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via theem algorithm.

Journal of the Royal Statistical Society: Series B (Methodological) , 39(1):1–22, 1977.(Cited on page 87.)[29] M. B. Dias, D. Locher, M. Li, W. El-Deredy, and P. J. Lisboa. The value of personalised recommendersystems to e-business: a case study. In

Proceedings of the 2008 ACM conference on RecommenderSystems , pages 291–294, 2008. (Cited on page 1.)[30] Z. Fang, A. Agarwal, and T. Joachims. Intervention harvesting for context-dependent examination-bias estimation. In

Proceedings of the 42nd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , pages 825–834, 2019. (Cited on pages 134, 156, and 178.)[31] L. Gallagher, R.-C. Chen, R. Blanco, and J. S. Culpepper. Joint optimization of cascade rankingmodels. In

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining ,pages 15–23, 2019. (Cited on page 178.)[32] S. C. Geyik, Q. Guo, B. Hu, C. Ozcaglar, K. Thakkar, X. Wu, and K. Kenthapadi. Talent searchand recommendation systems at LinkedIn: Practical challenges and lessons learned. In

The 41stInternational ACM SIGIR Conference on Research & Development in Information Retrieval , pages1353–1354, 2018. (Cited on page 1.)[33] X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks.In

Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics ,pages 249–256, 2010. (Cited on page 49.)[34] C. A. Gomez-Uribe and N. Hunt. The Netﬂix recommender system: Algorithms, business value, andinnovation.

ACM Transactions on Management Information Systems (TMIS) , 6(4):1–19, 2015. (Citedon page 1.)[35] A. Gordo, J. Almaz´an, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representationsfor image search. In

European Conference on Computer Vision , pages 241–257. Springer, 2016. (Citedon page 1.)[36] F. Guo, C. Liu, and Y. M. Wang. Efﬁcient multiple-click models in web search. In

Proceedings of theSecond ACM International Conference on Web Search and Data Mining , pages 124–131, 2009. (Citedon pages 29, 30, 47, 48, and 68.)[37] D. M. Hawkins. The problem of overﬁtting.

Journal of Chemical Information and Computer Sciences ,44(1):1–12, 2004. (Cited on page 102.)[38] J. He, C. Zhai, and X. Li. Evaluation of methods for relative comparison of retrieval systems basedon clickthroughs. In

Proceedings of the 18th ACM Conference on Information and KnowledgeManagement , pages 2029–2032. ACM, 2009. (Cited on pages 1, 24, 48, and 67.)[39] K. Hofmann.

Fast and Reliably Online Learning to Rank for Information Retrieval . PhD thesis,University of Amsterdam, 2013. (Cited on pages 49 and 68.)[40] K. Hofmann, S. Whiteson, and M. De Rijke. Balancing exploration and exploitation in learning torank online. In

European Conference on Information Retrieval , pages 251–263. Springer, 2011. (Citedon pages 40, 48, 49, 64, 67, 68, and 173.)[41] K. Hofmann, S. Whiteson, and M. De Rijke. A probabilistic method for inferring preferences fromclicks. In

Proceedings of the 20th ACM international conference on Information and knowledgemanagement , pages 249–258, 2011. (Cited on pages 4, 17, 21, 22, 28, 29, 30, 40, 60, 62, 130, 131,

36, 145, 157, 172, 176, and 177.)[42] K. Hofmann, S. Whiteson, and M. de Rijke. Balancing exploration and exploitation in listwise andpairwise online learning to rank for information retrieval.

Information Retrieval , 16(1):63–90, 2012.(Cited on pages 2, 5, 38, and 176.)[43] K. Hofmann, A. Schuth, S. Whiteson, and M. de Rijke. Reusing historical interaction data for fasteronline learning to rank for IR. In

Proceedings of the sixth ACM international conference on Websearch and data mining , pages 183–192. ACM, 2013. (Cited on pages 5, 40, 60, 62, 64, 94, 157, 163,173, and 176.)[44] K. Hofmann, S. Whiteson, and M. D. Rijke. Fidelity, soundness, and efﬁciency of interleavedcomparison methods.

ACM Transactions on Information Systems (TOIS) , 31(4):1–43, 2013. (Cited onpages 2, 4, 16, 19, 20, 21, 22, 24, 28, 29, 126, 131, 171, and 176.)[45] K. Hofmann, L. Li, and F. Radlinski. Online evaluation for information retrieval.

Foundations andTrends in Information Retrieval , 10(1):1–117, 2016. (Cited on pages 1, 15, 17, and 126.)[46] Z. Hu, Y. Wang, Q. Peng, and H. Li. Unbiased LambdaMART: An unbiased pairwise learning-to-rankalgorithm. In

The World Wide Web Conference , pages 2830–2836. ACM, 2019. (Cited on pages 3, 6,88, and 173.)[47] J. Huang, H. Oosterhuis, M. de Rijke, and H. van Hoof. Keeping dataset biases out of the simulation:A debiased simulator for reinforcement learning based recommender systems. In

Proceedings of the2020 ACM conference on Recommender systems , 2020. (Cited on page 12.)[48] N. Hurley and M. Zhang. Novelty and diversity in top-n recommendation–analysis and evaluation.

ACM Transactions on Internet Technology (TOIT) , 10(4):14, 2011. (Cited on page 78.)[49] R. Jagerman, H. Oosterhuis, and M. de Rijke. Query-level ranker specialization. In

CEUR WorkshopProceedings , volume 2007, 2017. (Cited on page 12.)[50] R. Jagerman, H. Oosterhuis, and M. de Rijke. To model or to intervene: A comparison of counterfactualand online learning to rank from user interactions. In

Proceedings of the 42nd International ACMSIGIR Conference on Research & Development in Information Retrieval , pages 15–24. ACM, 2019.(Cited on pages 6, 7, 12, 89, 90, 94, 117, 152, 157, 159, and 177.)[51] R. Jagerman, I. Markov, and M. de Rijke. Safe exploration for optimizing contextual bandits.

ACMTransactions on Information Systems , 38(3):Article 24, 2020. (Cited on pages 106, 107, 108, 113,and 121.)[52] K. J¨arvelin and J. Kek¨al¨ainen. Cumulated gain-based evaluation of IR techniques.

ACM Transactionson Information Systems (TOIS) , 20(4):422–446, 2002. (Cited on pages 1 and 154.)[53] K. J¨arvelin and J. Kek¨al¨ainen. IR evaluation methods for retrieving highly relevant documents. In

ACMSIGIR Forum , volume 51, pages 243–250. ACM New York, NY, USA, 2017. (Cited on page 110.)[54] T. Joachims. Optimizing search engines using clickthrough data. In

Proceedings of the eighth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 133–142, 2002.(Cited on pages 17, 19, 24, 38, 42, 61, 65, 79, 86, and 156.)[55] T. Joachims. Unbiased evaluation of retrieval quality using clickthrough data. In

SIGIR Workshop onMathematical/Formal Methods in Information Retrieval , volume 354, 2002. (Cited on pages 1 and 24.)[56] T. Joachims. Evaluating retrieval performance using clickthrough data. In

Text Mining . Physica Verlag,2003. (Cited on pages 2, 4, 17, 126, 130, and 176.)[57] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough dataas implicit feedback. In

SIGIR Forum , pages 154–161. ACM, 2005. (Cited on pages 78, 131, 132, 152,and 153.)[58] T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased learning-to-rank with biased feedback.In

Proceedings of the Tenth ACM International Conference on Web Search and Data Mining , pages781–789, 2017. (Cited on pages 1, 2, 3, 4, 5, 6, 60, 62, 68, 78, 79, 80, 81, 86, 88, 89, 90, 91, 92, 103,104, 105, 109, 110, 114, 120, 126, 128, 136, 137, 152, 155, 156, 157, 162, 163, 164, 172, 173, 177,and 178.)[59] A. Karatzoglou, L. Baltrunas, and Y. Shi. Learning to rank for recommender systems. In

Proceedingsof the 7th ACM conference on Recommender systems , pages 493–494, 2013. (Cited on page 38.)[60] S. K. Karmaker Santu, P. Sondhi, and C. Zhai. On application of learning to rank for e-commercesearch. In

SIGIR , pages 475–484. ACM, 2017. (Cited on pages 1 and 38.)[61] S. Katariya, B. Kveton, C. Szepesvari, and Z. Wen. DCM bandits: Learning to rank with multiple clicks.In

International Conference on Machine Learning , pages 1215–1224, 2016. (Cited on page 115.)[62] A. Kazerouni, M. Ghavamzadeh, Y. A. Yadkori, and B. Van Roy. Conservative contextual linearbandits. In

Advances in Neural Information Processing Systems , pages 3910–3919, 2017. (Cited onpage 108.) . Bibliography [63] E. Kharitonov, C. Macdonald, P. Serdyukov, and I. Ounis. Generalized team draft interleaving. In

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management ,pages 773–782, 2015. (Cited on pages 18, 29, 33, and 179.)[64] R. Kohavi and R. Longbotham. Online controlled experiments and A/B testing.

Encyclopedia ofMachine Learning and Data Mining , 7(8):922–929, 2017. (Cited on pages 126 and 129.)[65] R. Kohavi, R. Longbotham, D. Sommerﬁeld, and R. M. Henne. Controlled experiments on the web:survey and practical guide.

Data Mining and Knowledge Discovery , 18(1):140–181, 2009. (Cited onpage 17.)[66] R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohlmann. Online controlled experimentsat large scale. In

Proceedings of the 19th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , pages 1168–1176, 2013. (Cited on page 17.)[67] J. Komiyama, J. Honda, and H. Nakagawa. Optimal regret analysis of thompson sampling in stochasticmulti-armed bandit problem with multiple plays. In

Proceedings of the 32Nd International Confer-ence on International Conference on Machine Learning - Volume 37 , ICML’15, pages 1152–1161.JMLR.org, 2015. (Cited on pages 3, 6, 94, and 174.)[68] B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan. Cascading bandits: Learning to rank in the cascademodel. In

International Conference on Machine Learning , pages 767–776, 2015. (Cited on pages 3,103, and 115.)[69] P. Lagr´ee, C. Vernade, and O. Capp´e. Multiple-play bandits in the position-based model. In

Advancesin Neural Information Processing Systems , pages 1597–1605, 2016. (Cited on pages 3, 94, 110, 115,and 116.)[70] T. Lattimore and C. Szepesv´ari.

Bandit Algorithms . Cambridge University Press, 2020. (Cited onpages 3, 6, 102, 103, and 174.)[71] D. Lefortier, P. Serdyukov, and M. De Rijke. Online exploration for detecting shifts in fresh intent. In

Proceedings of the 23rd ACM International Conference on Conference on Information and KnowledgeManagement , pages 589–598, 2014. (Cited on pages 2, 38, 39, and 61.)[72] S. Li, Y. Abbasi-Yadkori, B. Kveton, S. Muthukrishnan, V. Vinay, and Z. Wen. Ofﬂine evaluation ofranking policies with click models. In

Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining , pages 1685–1694. ACM, 2018. (Cited on pages 94 and 95.)[73] Z. Li, A. Grotov, J. Kiseleva, M. de Rijke, and H. Oosterhuis. Optimizing interactive systems withdata-driven objectives. arXiv preprint arXiv:1802.06306 , page 11, 2018. (Cited on page 12.)[74] Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Inﬁnite-horizon off-policyestimation. In

Advances in Neural Information Processing Systems , pages 5356–5366, 2018. (Citedon page 95.)[75] T.-Y. Liu. Learning to rank for information retrieval.

Foundations and Trends in Information Retrieval ,3(3):225–331, 2009. (Cited on pages 1, 4, 37, 61, 79, 86, 103, 104, 151, 154, and 174.)[76] T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR: Benchmark dataset for research on learning torank for information retrieval. In

Proceedings of the Workshop on Learning to Rank for InformationRetrieval , 2007. (Cited on pages 1, 2, 29, 38, and 39.)[77] A. Lucic, H. Oosterhuis, H. Haned, and M. de Rijke. Actionable interpretability through optimizablecounterfactual explanations for tree ensembles. arXiv preprint arXiv:1911.12199 , 2019. (Cited onpage 12.)[78] J. Ma, Z. Zhao, X. Yi, J. Yang, M. Chen, J. Tang, L. Hong, and E. H. Chi. Off-policy learning intwo-stage recommender systems. In

Proceedings of The Web Conference 2020 , pages 463–473, 2020.(Cited on page 134.)[79] M. Morik, A. Singh, J. Hong, and T. Joachims. Controlling fairness and bias in dynamic learning-to-rank. In

Proceedings of the 43rd International ACM SIGIR Conference on Research and Developmentin Information Retrieval , page 429–438, 2020. (Cited on page 178.)[80] H. Oosterhuis and M. de Rijke. Balancing speed and quality in online learning to rank for informationretrieval. In

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management ,pages 277–286, 2017. (Cited on pages 12, 38, 41, 44, 49, 50, 51, 63, and 176.)[81] H. Oosterhuis and M. de Rijke. Sensitive and scalable online evaluation with theoretical guarantees.In

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management , pages77–86, 2017. (Cited on pages 11, 15, 40, 42, and 62.)[82] H. Oosterhuis and M. de Rijke. Differentiable unbiased online learning to rank. In

Proceedings of the27th ACM International Conference on Information and Knowledge Management , pages 1293–1302.ACM, 2018. (Cited on pages 11, 37, 60, 63, 64, 65, 68, 94, 117, 157, 163, 164, and 173.)[83] H. Oosterhuis and M. de Rijke. Ranking for relevance and display preferences in complex presentation ayouts. In

The 41st International ACM SIGIR Conference on Research & Development in InformationRetrieval , pages 845–854, 2018. (Cited on page 12.)[84] H. Oosterhuis and M. de Rijke. Optimizing ranking models in an online setting. In

Advances inInformation Retrieval , pages 382–396, Cham, 2019. Springer International Publishing. (Cited onpages 11, 59, 60, 94, 109, 117, 157, and 163.)[85] H. Oosterhuis and M. de Rijke. Taking the counterfactual online: Efﬁcient and unbiased onlineevaluation for ranking. In

Proceedings of the 2020 International Conference on The Theory ofInformation Retrieval . ACM, 2020. (Cited on pages 11, 125, and 164.)[86] H. Oosterhuis and M. de Rijke. Policy-aware unbiased learning to rank for top-k rankings. In

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development inInformation Retrieval . ACM, 2020. (Cited on pages 11, 77, 105, 114, 128, 131, 136, 152, 153, 155,156, 157, and 164.)[87] H. Oosterhuis and M. de Rijke. Robust generalization and safe query-specialization in counterfactuallearning to rank. In

Submitted to The World Wide Web Conference . ACM, 2021. (Cited on pages 11and 101.)[88] H. Oosterhuis and M. de Rijke. Unifying online and counterfactual learning to rank. In

Proceedings ofthe 14th ACM International Conference on Web Search and Data Mining (WSDM’21) . ACM, 2021.(Cited on pages 11 and 151.)[89] H. Oosterhuis, S. Ravi, and M. Bendersky. Semantic video trailers. arXiv preprint arXiv:1609.01819 ,2016. (Cited on page 12.)[90] H. Oosterhuis, A. Schuth, and M. de Rijke. Probabilistic multileave gradient descent. In

EuropeanConference on Information Retrieval , pages 661–668. Springer, 2016. (Cited on pages 5, 12, 22, 34,40, 48, 49, 51, 62, 64, 67, 68, and 173.)[91] H. Oosterhuis, J. S. Culpepper, and M. de Rijke. The potential of learned index structures for indexcompression. In

Proceedings of the 23rd Australasian Document Computing Symposium , pages 1–4,2018. (Cited on page 12.)[92] Z. Ovaisi, R. Ahsan, Y. Zhang, K. Vasilaky, and E. Zheleva. Correcting for selection bias in learning-to-rank systems. arXiv preprint arXiv:2001.11358 , 2020. (Cited on pages 3, 6, 128, 152, 153,and 163.)[93] A. B. Owen. Monte Carlo theory, methods and examples.

Monte Carlo Theory, Methods and Examples.Art Owen , 2013. (Cited on page 43.)[94] E. Politou, E. Alepis, and C. Patsakis. Forgetting personal data and revoking consent under the GDPR:Challenges and proposed solutions.

Journal of Cybersecurity , 4(1), 2018. (Cited on page 2.)[95] T. Qin and T.-Y. Liu. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597 , 2013. (Citedon pages 29, 39, 48, 60, 61, 67, 89, 104, 109, 136, and 152.)[96] F. Radlinski and N. Craswell. Optimized interleaving for online retrieval evaluation. In

Proceedingsof the sixth ACM International Conference on Web Search and Data Mining , pages 245–254, 2013.(Cited on pages 4, 17, 19, 21, 22, 60, 62, 131, 146, 172, 176, and 177.)[97] F. Radlinski and N. Craswell. A theoretical framework for conversational search. In

Proceedings ofthe 2017 Conference on Human Information Interaction and Retrieval , pages 117–126, 2017. (Citedon page 38.)[98] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In

Proceedings of the 25th International Conference on Machine Learning , pages 784–791, 2008. (Citedon pages 40 and 178.)[99] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reﬂect retrieval quality? In

Proceedings of the 17th ACM Conference on Information and Knowledge Management , pages 43–52.ACM, 2008. (Cited on pages 2, 4, 17, 21, 39, 130, 131, 144, 172, 176, and 177.)[100] K. Raman, T. Joachims, P. Shivaswamy, and T. Schnabel. Stable coactive learning via perturbation. In

International Conference on Machine Learning , pages 837–845, 2013. (Cited on pages 2 and 5.)[101] P. Resnick and H. R. Varian. Recommender systems.

Communications of the ACM , 40(3):56–58, 1997.(Cited on page 1.)[102] M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through ratefor new ads. In

Proceedings of the 16th international conference on World Wide Web , pages 521–530,2007. (Cited on page 140.)[103] A. Roegiest, G. V. Cormack, C. L. Clarke, and M. R. Grossman. TREC 2015 total recall track overview.In

TREC , 2015. (Cited on page 1.)[104] M. Sanderson. Test collection based evaluation of information retrieval systems.

Foundations andTrends in Information Retrieval , 4(4):247–375, 2010. (Cited on pages 1, 2, 17, 38, 39, 40, 60, 61, 104, . Bibliography and 152.)[105] M. Sanderson, M. L. Paramita, P. Clough, and E. Kanoulas. Do user preferences and evaluationmeasures line up? In

Proceedings of the 33rd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , pages 555–562, 2010. (Cited on page 1.)[106] J. B. Schafer, J. Konstan, and J. Riedl. Recommender systems in e-commerce. In

Proceedings of the1st ACM Conference on Electronic Commerce , pages 158–166, 1999. (Cited on page 1.)[107] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims. Recommendations astreatments: Debiasing learning and evaluation. In

Proceedings of the 33rd International Conferenceon Machine Learning - Volume 48 , pages 1670–1679, 2016. (Cited on page 95.)[108] A. Schuth, F. Sietsma, S. Whiteson, D. Lefortier, and M. de Rijke. Multileaved comparisons for fastonline evaluation. In

Proceedings of the 23rd ACM International Conference on Information andKnowledge Management , pages 71–80, 2014. (Cited on pages 4, 15, 16, 17, 20, 21, 22, 23, 28, 29, 30,40, 62, 171, 172, and 176.)[109] A. Schuth, R.-J. Bruintjes, F. Bu¨uttner, J. van Doorn, C. Groenland, H. Oosterhuis, C.-N. Tran,B. Veeling, J. van der Velde, R. Wechsler, et al. Probabilistic multileave for online retrieval evaluation.In

Proceedings of the 38th International ACM SIGIR Conference on Research and Development inInformation Retrieval , pages 955–958, 2015. (Cited on pages 4, 11, 16, 17, 22, 23, 28, 29, 30, 40, 62,and 172.)[110] A. Schuth, K. Hofmann, and F. Radlinski. Predicting search satisfaction metrics with interleavedcomparisons. In

Proceedings of the 38th International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , pages 463–472, 2015. (Cited on pages 2, 4, 126, 141, and 179.)[111] A. Schuth, H. Oosterhuis, S. Whiteson, and M. de Rijke. Multileave gradient descent for fast onlinelearning to rank. In

Proceedings of the Ninth ACM International Conference on Web Search and DataMining , pages 457–466, 2016. (Cited on pages 2, 5, 11, 34, 38, 40, 44, 48, 49, 51, 60, 62, 64, 67, 68,117, 157, 173, 176, and 177.)[112] I. Shalyminov, O. Duˇsek, and O. Lemon. Neural response ranking for social conversation: A data-efﬁcient approach. In

Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd Int’l Workshop onSearch-Oriented Conversational AI , pages 1–8, 2018. (Cited on page 78.)[113] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search enginequery log. In

ACM SIGIR Forum , volume 33, pages 6–12. ACM New York, NY, USA, 1999. (Citedon page 111.)[114] A. Slivkins, F. Radlinski, and S. Gollapudi. Ranked bandits in metric spaces: Learning diverse rankingsover large document collections.

Journal of Machine Learning Research , 14(1):399–436, 2013. (Citedon page 40.)[115] A. Spink, S. Ozmutlu, H. C. Ozmutlu, and B. J. Jansen. US versus European web searching trends. In

ACM Sigir Forum , volume 36, pages 32–38. ACM New York, NY, USA, 2002. (Cited on page 111.)[116] A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In

Advances in Neural Information Processing Systems , pages 3231–3239, 2015. (Cited on pages 3and 92.)[117] A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudik, J. Langford, D. Jose, and I. Zitouni.Off-policy evaluation for slate recommendation. In

Advances in Neural Information ProcessingSystems , pages 3632–3642, 2017. (Cited on page 103.)[118] B. Sz¨or´enyi, R. Busa-Fekete, A. Paul, and E. H¨ullermeier. Online rank elicitation for plackett-luce: Adueling bandits approach. In

Advances in Neural Information Processing Systems , pages 604–612,2015. (Cited on page 42.)[119] P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High-conﬁdence off-policy evaluation. In

Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence , 2015. (Cited on pages 102, 106, 107,and 118.)[120] P. Vakkari and N. Hakala. Changes in relevance criteria and problem stages in task performance.

Journal of Documentation , 56:540–562, 2000. (Cited on pages 39 and 61.)[121] V. Vapnik.

The Nature of Statistical Learning Theory . Springer Science & Business Media, 2013.(Cited on page 79.)[122] A. Vardasbi, M. de Rijke, and I. Markov. Cascade model-based propensity estimation for counterfactuallearning to rank. In

Proceedings of the 43rd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , page 2089–2092, 2020. (Cited on page 178.)[123] A. Vardasbi, H. Oosterhuis, and M. de Rijke. When inverse propensity scoring does not work: Afﬁnecorrections for unbiased learning to rank. In

Proceedings of the 28th ACM International Conferenceon Information and Knowledge Management , 2020. (Cited on pages 12, 152, 153, 155, 156, 157, 163, nd 164.)[124] A. Vlachou, C. Doulkeridis, and K. Nørv˚ag. Monitoring reverse top-k queries over mobile devices. In

Proceedings of the 10th ACM International Workshop on Data Engineering for Wireless and MobileAccess , pages 17–24. ACM, 2011. (Cited on page 78.)[125] H. Wang, R. Langley, S. Kim, E. McCord-Snook, and H. Wang. Efﬁcient exploration of gradientspace for online learning to rank. In

The 41st International ACM SIGIR Conference on Research &Development in Information Retrieval , pages 145–154. ACM, 2018. (Cited on pages 5, 60, 63, 64,173, and 176.)[126] H. Wang, S. Kim, E. McCord-Snook, Q. Wu, and H. Wang. Variance reduction in gradient explorationfor online learning to rank. In

Proceedings of the 42nd International ACM SIGIR Conference onResearch and Development in Information Retrieval , pages 835–844, 2019. (Cited on pages 2, 5, 117,157, 173, and 176.)[127] X. Wang, M. Bendersky, D. Metzler, and M. Najork. Learning to rank with selection bias in personalsearch. In

Proceedings of the 39th International ACM SIGIR conference on Research and Developmentin Information Retrieval , pages 115–124, 2016. (Cited on pages 1, 2, 3, 6, 38, 39, 40, 61, 62, 78, 80,81, 88, 91, 94, 104, 114, 126, 131, 132, 154, 155, 156, and 177.)[128] X. Wang, N. Golbandi, M. Bendersky, D. Metzler, and M. Najork. Position bias estimation for unbiasedlearning to rank in personal search. In

Proceedings of the Eleventh ACM International Conference onWeb Search and Data Mining , pages 610–618. ACM, 2018. (Cited on pages 78, 81, 88, 94, 131, 134,152, 153, 156, 172, and 176.)[129] X. Wang, C. Li, N. Golbandi, M. Bendersky, and M. Najork. The LambdaLoss framework for rankingmetric optimization. In

Proceedings of the 27th ACM International Conference on Information andKnowledge Management , pages 1313–1322. ACM, 2018. (Cited on pages 1, 6, 79, 86, 87, 88, 89, 104,137, 154, 156, 174, and 177.)[130] R. W. White, M. Bilenko, and S. Cucerzan. Studying the use of popular destinations to enhanceweb search interaction. In

Proceedings of the 30th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval , pages 159–166, 2007. (Cited on pages 111, 113,and 117.)[131] Y. Wu, R. Shariff, T. Lattimore, and C. Szepesv´ari. Conservative bandits. In

International Conferenceon Machine Learning , pages 1254–1262, 2016. (Cited on pages 108, 111, 113, and 117.)[132] Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling banditsproblem. In

Proceedings of the 26th Annual International Conference on Machine Learning , pages1201–1208. ACM, 2009. (Cited on pages 2, 4, 34, 38, 39, 40, 49, 60, 62, 64, 94, 103, 117, 152, 157,172, 173, 176, and 177.)[133] Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics forclick-based retrieval evaluation. In

Proceedings of the 33rd International ACM SIGIR Conference onResearch and Development in Information Retrieval , pages 507–514, 2010. (Cited on page 29.)[134] Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: Examining result attractiveness as a source ofpresentation bias in clickthrough data. In

Proceedings of the 19th International Conference on WorldWide Web , pages 1011–1018, 2010. (Cited on pages 23, 40, 60, and 61.)[135] T. Zhao and I. King. Constructing reliable gradient exploration for online learning to rank. In

Proceedings of the 25th ACM International on Conference on Information and Knowledge Management ,pages 1643–1652, 2016. (Cited on pages 5, 63, 64, and 173.)[136] S. Zhuang and G. Zuccon. Counterfactual online learning to rank. In

European Conference onInformation Retrieval , pages 415–430. Springer, 2020. (Cited on pages 152, 157, 163, 164, 166, 176,177, and 178.)[137] M. Zoghi, S. A. Whiteson, M. De Rijke, and R. Munos. Relative conﬁdence sampling for efﬁcienton-line ranker evaluation. In

Proceedings of the 7th ACM International Conference on Web Searchand Data Mining , pages 73–82, 2014. (Cited on pages 48 and 67.)[138] M. Zoghi, T. Tunys, L. Li, D. Jose, J. Chen, C. M. Chin, and M. de Rijke. Click-based hot ﬁxes forunderperforming torso queries. In

Proceedings of the 39th International ACM SIGIR conference onResearch and Development in Information Retrieval , pages 195–204, 2016. (Cited on pages 3, 7, 103,111, 115, and 174.)[139] M. Zoghi, T. Tunys, M. Ghavamzadeh, B. Kveton, C. Szepesvari, and Z. Wen. Online learning torank in stochastic click models. In

Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 4199–4208, 2017. (Cited on pages 3, 6, 40, and 174.) ummary

Ranking systems form the basis for online search engines and recommendation services.They process large collections of items, for instance web pages or e-commerce products,and present the user with a small ordered selection. The goal of a ranking system is tohelp a user ﬁnd the items they are looking for with the least amount of effort. Thus therankings they produce should place the most relevant or preferred items at the top ofthe ranking. Learning to rank is a ﬁeld within machine learning that covers methodswhich optimize ranking systems w.r.t. this goal. Traditional supervised learning to rankmethods utilize expert-judgements to evaluate and learn, however, in many situationssuch judgements are impossible or infeasible to obtain. As a solution, methods have beenintroduced that perform learning to rank based on user clicks instead. The difﬁculty withclicks is that they are not only affected by user preferences, but also by what rankingswere displayed. Therefore, these methods have to prevent being biased by other factorsthan user preference. This thesis concerns learning to rank methods based on user clicksand speciﬁcally aims to unify the different families of these methods.The ﬁrst part of the thesis consists of three chapters that look at online learning torank algorithms which learn by directly interacting with users. Its ﬁrst chapter considerslarge scale evaluation and shows existing methods do not guarantee correctness anduser experience, we then introduce a novel method that can guarantee both. The secondchapter proposes a novel pairwise method for learning from clicks that contrasts withthe previous prevalent dueling-bandit methods. Our experiments show that our pairwisemethod greatly outperforms the dueling-bandit approach. The third chapter furtherconﬁrms these ﬁndings in an extensive experimental comparison, furthermore, we alsoshow that the theory behind the dueling-bandit approach is unsound w.r.t. deterministicranking systems.The second part of the thesis consists of four chapters that look at counterfactuallearning to rank algorithms which learn from historically logged click data. Its ﬁrstchapter takes the existing approach and makes it applicable to top- k settings wherenot all items can be displayed at once. It also shows that state-of-the-art supervisedlearning to rank methods can be applied in the counterfactual scenario. The secondchapter introduces a method that combines the robust generalization of feature-basedmodels with the high-performance specialization of tabular models. The third chapterlooks at evaluation and introduces a method for ﬁnding the optimal logging policy thatcollects click data in a way that minimizes the variance of estimated ranking metrics.By applying this method during the gathering of clicks, one can turn counterfactualevaluation into online evaluation. The fourth chapter proposes a novel counterfactualestimator that considers the possibility that the logging policy has been updated duringthe gathering of click data. As a result, it can learn much more efﬁciently when deployedin an online scenario where interventions can take place. The resulting approach isthus both online and counterfactual, our experimental results show that its performancematches the state-of-the-art in both the online and the counterfactual scenario.As a whole, the second part of this thesis proposes a framework that bridges manygaps between areas of online, counterfactual, and supervised learning to rank. It hastaken approaches, previously considered independent, and uniﬁed them into a singlemethodology for widely applicable and effective learning to rank from user clicks.189 amenvattingamenvatting