Efficient Exploration of Gradient Space for Online Learning to Rank
Huazheng Wang, Ramsey Langley, Sonwoo Kim, Eric McCord-Snook, Hongning Wang
aa r X i v : . [ c s . I R ] M a y Efficient Exploration of Gradient Spacefor Online Learning to Rank
Huazheng Wang, Ramsey Langley, Sonwoo Kim, Eric McCord-Snook, Hongning Wang
Department of Computer ScienceUniversity of VirginiaCharlottesville, VA 22903, USA{hw7ww,rml5tu,sak2m,esm7ky,hw5x}@virginia.edu
ABSTRACT
Online learning to rank (OL2R) optimizes the utility of returnedsearch results based on implicit feedback gathered directly fromusers. To improve the estimates, OL2R algorithms examine oneor more exploratory gradient directions and update the currentranker if a proposed one is preferred by users via an interleavedtest.In this paper, we accelerate the online learning process by ef-ficient exploration in the gradient space. Our algorithm, namedas Null Space Gradient Descent, reduces the exploration space toonly the null space of recent poorly performing gradients. This pre-vents the algorithm from repeatedly exploring directions that havebeen discouraged by the most recent interactions with users. Toimprove sensitivity of the resulting interleaved test, we selectivelyconstruct candidate rankers to maximize the chance that they canbe differentiated by candidate ranking documents in the currentquery; and we use historically difficult queries to identify the bestranker when tie occurs in comparing the rankers. Extensive ex-perimental comparisons with the state-of-the-art OL2R algorithmson several public benchmarks confirmed the effectiveness of ourproposal algorithm, especially in its fast learning convergence andpromising ranking quality at an early stage.
CCS CONCEPTS • Information systems → Learning to rank ; •
Theory of com-putation → Online learning algorithms ; KEYWORDS
Online learning to rank; Dueling bandit; Null space exploration
ACM Reference Format:
Huazheng Wang, Ramsey Langley, Sonwoo Kim, Eric McCord-Snook, Hongn-ing Wang. 2018. Efficient Exploration of Gradient Space for Online Learn-ing to Rank. In
SIGIR ’18: The 41st International ACM SIGIR Conference onResearch & Development in Information Retrieval, July 8–12, 2018, Ann Arbor,MI, USA.
ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3209978.3210045
The goal of learning to rank is to optimize a parameterized rank-ing function such that documents that are more relevant to a user’s
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
SIGIR ’18, July 8–12, 2018, Ann Arbor, MI, USA © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5657-2/18/07...$15.00https://doi.org/10.1145/3209978.3210045 query are ranked at higher positions [16]. A trained ranker com-bines hundreds of ranking features to recognize the relevance qual-ity of a document to a query, and shows several advantages overthe manually crafted ranking algorithms [4]. Traditionally, such aranker is optimized in an offline manner over a manually curatedsearch corpus. This learning scheme, however, becomes a main ob-stacle hampering the application of learning to rank algorithmsfor a few reasons: 1) it is expensive and time-consuming to obtainreliable annotations in large-scale retrieval systems; 2) editors’ an-notations do not necessarily align with actual users’ preferences[20]; and 3) it is difficult for an offline-trained model to reflect orcapture ever-changing users’ information needs in an online envi-ronment [21].To overcome these limitations, recent research has focused onlearning the rankers on the fly, by directly exploiting implicit feed-back from users via their interactions with the system [5, 10, 27].Fundamentally, online learning to rank (OL2R) algorithms oper-ate in an iterative manner: in every iteration, the algorithm exam-ines one or more exploratory directions, and updates the rankerif a proposed one is preferred by the users via an interleaved test[9, 23, 29, 30]. The essence of this type of OL2R algorithms is toestimate the gradient of an unknown objective function with lowbias, such that online gradient descent can be used for optimizationwith low regret [6]. For example, one eventually finds a close to op-timal ranker and seldom shows clearly bad results in the process.In the web search scenario, the objective function is usually con-sidered to be the utility of search results, which can be depicted byordinal comparisons in user feedback, such as clicks [20]. However,to maintain an unbiased estimation of the gradient, uniform sam-pling of random vectors in the entire parameter space is performedin these algorithms. As a result, the newly proposed exploratoryrankers are independent from not only the past interactions withusers, but also the current query being served. This inevitably leadsto slow convergence and large variance of ranking quality duringthe online learning process.Several lines of works have been proposed to improve the algo-rithms’ online learning efficiency. Hofmann et al. [9] suggested toreduce the step size in gradient descent for better empirical per-formance. In their follow-up work [8], historical interactions werecollected to supplement the interleaved test in the current queryand pre-select the candidate rankers. Schuth et al. [23] proposed toexplore multiple gradient directions in one multi-interleaved test[24] so as to reduce the number of comparisons needed to evaluatethe rankers. Zhao et al. [30] introduced the idea of using two uni-formly sampled random vectors with opposite directions as the ex-ploratory directions, with the hope that when they are not orthog-onal to the optimal gradient, one of them should be a more effec-tive direction than a simplely uniformly sampled direction. Theyalso developed a contextual interleaving method, which considersistorical explorations when interleaving the proposed rankers forcomparison, to reduce the noise from multi-interleaving.Nevertheless, all aforementioned solutions still uniformly sam-ple from the entire parameter space for gradient exploration. Thisresults in independent and isolated rankers for comparison. There-fore, less promising directions might be repeatedly tested, as his-torical interactions are largely ignored when proposing the newrankers. More seriously, as the exploratory rankers are indepen-dently proposed for the current query, they might give the sameranking order of the candidate documents for interleaving (thishappens when the difference in the feature weight vectors betweentwo rankers are orthogonal to the feature vectors in those candi-date documents). In this scenario, no click feedback can differen-tiate the ranking quality of those rankers in this query. When theinterleaved test cannot recognize the best ranker from ordinal com-parison in a query, tie will be arbitrarily broken [23, 29]. This againleads to large variance and slow convergence of ranking quality inthese types of algorithms.We propose improving the learning convergence of OL2R algo-rithms by carefully exploring the gradient space. First, instead ofuniformly sampling from the entire parameter space for gradientestimation, we maintain a collection of recently explored gradientsthat performed relatively poorly in their interleaved tests. We sam-ple proposal directions from the null space of these gradients toavoid repeatedly exploring poorly performing directions. Second,we use the candidate ranking documents associated with the cur-rent query to preselect the proposed rankers, with a focus on thosethat give different ranking orders over the documents. This ensuresthat the resulting interleaved test will have a better chance of rec-ognizing the difference between those rankers. Third, when an in-terleaved test fails to recognize the best ranker for a query, e.g.,two or more rankers tie, we compare the tied rankers on the mostrecent worst performing queries (i.e., the difficult queries) withthe recorded clicks to differentiate their ranking quality. We namethe resulting algorithm Null Space Gradient Descent, or NSGD forshort, and extensively compare it with four state-of-the-art algo-rithms on five public benchmarks. The results confirm greatly im-proved learning efficiency in NSGD, with a remarkably fast andstable convergence rate at the early stage of the interactive learn-ing process. This means systems equipped with NSGD can provideusers with better search results much earlier, which is crucial forany interactive system.
Online learning to rank has recently attracted increasing attentionin the information retrieval community, as it eliminates the heavydependency on manual relevance judgments for model trainingand directly estimates the utility of search results from user feed-back on the fly. Various algorithms have been proposed, and theycan be categorized into two main branches, depending on whetherthey estimate the utility of individual documents directly [19] orvia a parameterized function over the ranking features [29].The first branch learns the best ranked list for each individ-ual query by modeling user clicks using multi-armed bandit algo-rithms [1, 2]. Ranked bandits are studied in [19], where a k -armedbandit model is placed on each ranking position of a fixed inputquery to estimate the utility of candidate documents being in thatposition. The system’s learning is accelerated by assuming similar documents have similar utility for the same query [25]. By assum-ing that skipped documents are less attractive than later clickedones, Kveton et al. [13] develop a cascading bandit model to learnfrom both positive and negative feedback. To enable learning frommultiple clicks in the same result ranking list, they adopt the depen-dent click model [7] to infer user satisfaction after a sequence ofclicks [12], and later further extend to broader types of click mod-els [31]. However, such algorithms estimate the utility of rankeddocuments on a per-query basis, and no estimation is shared acrossqueries. This causes them to suffer from slow convergence, makingthem less practical.Another branch of study leverages ranking features and lookfor the best ranker in the entire parametric space. Our work fallsinto this category. The most representative work in this line is duel-ing bandit gradient descent (DBGD) [28, 29], where the algorithmproposes an exploratory direction in each iteration of interactionand uses an interleaved test to validate the exploration for modelupdating. As only one exploratory direction is compared in eachiteration of DBGD, its learning efficiency is limited. Different so-lutions have been proposed to address this limitation. Schuth etal. [23] propose the Multileave Gradient Descent algorithm to ex-plore multiple directions in each iteration. To evaluate multiplecandidate rankers at once, multi-interleaving comparison [24] isused. Zhao et al. [30] propose the Dual-Point Dueling Bandit Gra-dient Descent algorithm to sample two stochastic vectors with op-posite directions as the candidate gradients. When they are notorthogonal to the optimal gradient, one of the two should be amore effective gradient than a single proposal. However, all of theaforementioned algorithms uniformly sample the exploratory di-rections from the entire parameter space, which is usually veryhigh-dimensional. More importantly, the uniform sampling makesthe proposed rankers independent from past interactions, and thusthey cannot avoid repeatedly exploring less promising directions.Some works have recognized this deficiency and proposed dif-ferent solutions. Hofmann et al. [8] record historical interactionsduring the learning process to supplement the interleaved test whencomparing the rankers. They also suggest using historical data topreselect the proposed rankers before interleaved test. However,only the most recent interactions are collected in these two solu-tions, so that they are not necessarily effective in recognizing thequality of different rankers. Oosterhuis et al. [18] create the ex-ploratory directions via a weighted combination over a set of pre-selected reference documents from an offline training corpus. Thereference documents are either uniformly sampled or are the clus-tering centroids of the corpus. However, the reference documentsare fixed beforehand; this limits the quality of learnt rankers, if theoffline corpus has a different feature distribution than the incom-ing online documents. More importantly, none of these solutionsconsider the feature distribution in the candidate ranking docu-ments of a particular query when proposing exploratory rankers.It is possible that the proposed rankers are not differentiable by anyclick pattern for a given query, e.g., they rank the documents in thesame order. When the best rankers are tied, the winner is arbitrar-ily chosen. This further slows down the online learning process. Inour solution, we preselect the rankers that tend to provide differentranking lists in the current query, so that the resulting interleavedtest will have a better chance to tell the difference among thoserankers. When a tie occurs, we use the most recent difficult querieso further evaluate the rankers, as those queries are expected to bemore discriminative. We improve the learning convergence of OL2R algorithms by care-fully exploring the gradient space. In particular, we aim to avoidrepeatedly exploring recent poorly performing directions and fo-cus on the rankers that can be best differentiated by the candidateranking documents associated with the current query. We first givean overview of a basic online learning to rank algorithm, DuelingBandit Gradient Descent [29], based on which we describe our pro-posed solution, Null Space Gradient Descent, in details.
Dueling bandit gradient descent (DBGD) [29] is an OL2R algorithmthat learns from interleaved comparisons between one exploratoryranker and one current ranker. Each ranker is represented as a fea-ture weight vector w ∈ R d , and it ranks documents by taking theinner product with their associated ranking features, i.e., a linearranking model. As shown in Algorithm 1, at the beginning of iter-ation t , the algorithm receives a a query and its associated candi-date ranking documents, represented as a set of query-documentpairs X t = { x , x , ..., x s } . We denote w t as the weight vector ofthe current ranker. DBGD proposes an exploratory direction u t uniformly from the unit sphere, and generates a candidate ranker w t = w t + δu t , where δ is the step size of exploration. Two rankedlists generated by these two rankers, i.e., l ( X t , w t ) and l ( X t , w t ) ,are then combined via an interleaving method, such as Team DraftInterleaving [20]. The resultant list is returned to the user for feed-back. Based on the feedback and specific interleaving method, abetter ranker is determined. If the exploratory ranker wins, thecurrent ranker gets updated with w t + = w t + αu t , where α isthe learning rate; otherwise the current ranker stays intact. Suchexploration and comparison lead to a low bias estimation of gra-dient in terms of expectation [6], i.e., ∇ ˆ f ( w ) = E [ f ( w + δu ) u ] d / δ ,in which f ( w ) is the target utility function to be estimated. Thisestimation does not require the function f ( w ) to be differentiablenor even to be explicitly defined; and thus it is the theoretical basisof this family of OL2R algorithms.However, only one exploration direction u t is proposed for com-parison in each iteration, which limits the learning rate of DBGD.To address this limitation, Schuth et al. [24] proposed the Multi-leaving Gradient Descent algorithm that uniformly explores m di-rections at the same time, i.e., w t + δu it where i ∈ { .. m } . Zhaoet al. [30] proposed the Dual-Point Dueling Bandit Gradient De-scent algorithm to explore two opposite directions each time, i.e., w t + δu t and w t − δu t . Although exploring multiple candidatesgenerally improves the learning rate, the expected improvement isstill marginal, as the ranking features usually reside in a high di-mensional space and uniform sampling is very inefficient. More im-portantly, the uniform sampling makes the proposed rankers inde-pendent from historical interactions and the current query context.Algorithms with history-independent exploration cannot avoid re-peatedly exploring less promising directions that have been dis-couraged by the most recent user feedback. Additionally, context-independent exploration cannot avoid the issue of multiple rankersgenerating indifferentiable ranking results, such as by ranking the Algorithm 1
Dueling Bandit Gradient Descent (DBGD) [29] Inputs: δ , α Initiate w = sample_unit_vector () for t = T do Receive query X t = { x , x , ..., x s } u t = sample_unit_vector () w t = w t + δu t Generate ranked lists l ( X t , w t ) and l ( X t , w t ) Set L = Interleave (cid:0) l ( X t , w t ) , l ( X t , w t ) (cid:1) and present L touser Receive click positions C t on L , and infer click credit c t and c t for w t and w t accordingly if c t > c t then w t + = w t + αu t else w t + = w t end if end for documents in the same order. Both of them further hamper theconvergence rate of aforementioned OL2R algorithms. Our proposed Null Space Gradient Descent (NSGD) algorithm im-proves over DBGD-type OL2R algorithms by a suite of carefullydesigned exploration strategies in the gradient space.We illustrate the procedure of NSGD in Figure 1. First, to avoiduniformly testing exploratory directions in the entire parameterspace, we maintain a collection of most recently explored gradi-ents that performed poorly in their interleaved tests, and samplenew proposal directions from the null space of these gradients. Asa result, we only search in a subspace that is orthogonal to thoseless promising directions. This can be intuitively understood fromFigure 1 part 1: since the interleaved tests in iteration t − t − w t from exploring these lesspromising directions again by exploring the null space of them atiteration t . Second, we prefer the proposed rankers that tend to gen-erate the most distinct ranking orders from the current ranker forthe current query, so that the resulting interleaved test will havea better chance of recognizing the best ranker among those beingcompared. We show such an example in Figure 1 part 2, where thecurrent ranker w t and a randomly sampled candidate ranker w t rank the candidate documents in the same order. As a result, no in-terleaved test can differentiate their ranking quality in this query.NSGD avoids proposing w t , and favors w t as it ranks the docu-ments in a reverse order and would therefore give the interleavedtest a better chance of recognizing the difference between w t and w t . Third, if an interleaved test fails to recognize the best rankerin a query, e.g., a tie is encountered as shown in Figure 1 part 3,we compare the tied rankers on the most recent worst performingqueries (i.e., the difficult queries) with the recorded clicks to dif-ferentiate the rankers. Eventually, NSGD aims to reach w ∗ with aminimal number of interactions as shown in Figure 1, i.e., fasterconvergence. The detailed procedures in NSGD are shown in Al-gorithm 2, and we will discuss the key steps of it in the following. t-2 w t-1 w t w * Null space exploration to avoid repeatedly exploring less promising directions. Preselect differentiable candidate by current documents with associated query. Tie breaking function using historical difficult queries when multiple winners occur. w t w t w t d2d1 X t-1 X t-3 X t-5 ... w t > w t c t = c t d1d2 d2d1d2d1 (cid:23) Figure 1: Illustration of model update procedure for the Null Space Gradient Descent algorithm. • Null Space Gradient Exploration.
NSGD maintains a fixed sizequeue Q д of recently explored directions and their correspondingquality, i.e., Q д = (cid:8) ( д i , q i ) (cid:9) T д i = , which is constructed in line 32 to37 in Algorithm 2. We denote the quality q i of an explored direc-tion д i as the received click credit difference between the corre-sponding exploratory ranker and the default ranker by then (i.e.,line 33). Intuitively, q i measures the improvement in ranking qual-ity contributed by the update direction д i ; when q i is negative, itsuggests the direction д i cannot improve the current ranker, andtherefore should be discouraged in future. To realize this, after re-ceiving a user query, NSGD first constructs G = [ д , ..., д k д ] byselecting the top k д worst performing historical directions from Q д (i.e., line 8), and then solves for the null space of G denoted as G ⊥ = NullSpace ( G ) (i.e., line 9). The new exploratory directionsare sampled from G ⊥ (i.e., line 15). Because every vector in thespace of G ⊥ is orthogonal to all k д selected historical directions,those ineffective directions (and any linear combination of them)will not be tested in this query.Our null space exploration strategy is based on two mild as-sumptions: queries are independent and identically distributed (i.i.d.),and the gradient of the target (unknown) utility function satisfiesLipschitz continuity, i.e., ||∇ ˆ f ( w )−∇ ˆ f ( w )|| ≤ γ || w − w || , where γ is a Lipschitz constant for the target utility function f ( w ) . Theassumption that queries are i.i.d. is studied and widely adoptedin existing learning to rank research [14, 15]. This assumption al-lows NSGD to compare gradient performance across queries andselect k д worst performing gradients from previous queries. Lips-chitz continuity assumption suggests similar rankers would sharesimilar gradient fields for the same query. This assumption is mildand consistent with most of existing learning to rank algorithms[3, 16]. However, this assumption requires us to construct the nullspace from all historically explored directions whose associatedrankers have a similar weight vector w as the current ranker’s.This is clearly infeasible in an online learning setting, as we wouldhave to store the entire updating history and examine it in everyiteration. In NSGD, because the learning rate α is set to be small,rankers with close temporal proximity will have similar featureweight vectors, and therefore share a similar gradient field. Hence,NSGD only maintains the most recently tested directions in Q д ,which approximates the Lipschitz continuity. In our empirical eval-uation, we also tested the exhaustive solution, but aside from the significantly increased storage and time complexity, little rankingperformance improvement was observed. This supports our con-struction of the null space in NSGD.Another benefit of sampling from null space is that the searchtakes place in a reduced problem space. DBGD-type algorithmshave to sample in the whole d -dimensional space, while NSGDonly samples from a subspace of it, whose rank is at most d − k д ,when the top k д worst performing historical gradients are orthogo-nal to each other. This advantage is especially appealing when thedimension of ranking features is high, which is usually the case inpractical learning to rank applications [4, 16].There are two ways to sample from the null space G ⊥ in NSGD(i.e., line 11): uniformly selecting the basis vectors of the null spaceor sampling random unit vectors inside the null space. Randomlyselecting the basis vectors can maximize the coverage of sampleddirections in the null space, as the basis vectors are linearly inde-pendent from each other. It improves the exploration efficiency inan early stage. Zhao et al. tested a similar idea in [30], but they per-formed it over the entire parameter space. However, in the laterstage of model update, the true gradients are usually concentrat-ing in a specific region; continuing to select those independent ba-sis vectors becomes less effective. Exploring linear combinationsof those basis vectors, i.e., uniformly sampling inside the space,emerges as a better choice then. But directly sampling from thenull space at the beginning might be less effective, as it tends tointroduce smaller variance in proposing different directions.To take advantage of these two sampling schemes, we propose ahybrid sampling method in the null space: comparing with the win-ning ranker w t − k created in iteration t − k , if || w t − w t − k || < − ϵ ,we switch to sample random unit vectors in G ⊥ ; otherwise uni-formly select the basis vectors of G ⊥ . The intuition behind thisswitching control is that when the consecutive rankers becomesimilar, it indicates the gradients have converged to a local opti-mal region, and a refined search is needed to identify the true gra-dient. Otherwise, the gradient direction has not been identified,and larger diversity is needed to accelerate the exploration of nullspace. Oosterhuis and Rijke [18] also proposed a similar idea todetect model convergence and convert to more complex modelswhen a simpler model has converged. But their conversion mightnot always be feasible, e.g., when no linear mapping between mod-els exists; we only switch the sampling schemes for exploratory lgorithm 2 Null Space Gradient Descent (NSGD) Inputs: δ , α , n , m , k д , k h , T д , T h Initiate w = sample_unit_vector () Set Q д = queue ( T д ) and Q h = queue ( T h ) as fixed size queues for t = T do Receive query X t = { x , x , ..., x s } Generate ranked list l ( X t , w t ) ¯ x t = Í si = x i Construct G = [ д , ..., д k д ] by directions selected from Q д with the worst recorded quality q G ⊥ = NullSpace ( G ) for i = n do д it = sample_unit_vector ( G ⊥ ) end for Select top m gradients that maximize (cid:12)(cid:12) ¯ x T t д it (cid:12)(cid:12) from { д it } ni = for i = m do w it = w t + δд it Generate ranked list l ( X t , w it ) end for Set L t = Multileave (cid:0) { l ( X t , w it )} mi = (cid:1) , and present L t to user Receive click positions C t on L t , and infer click credits { c it } mi = for all rankers Infer winner set B t from { c it } mi = if | B t | > then Select k h worst performing queries (cid:8) ( X i , L i , C i ) (cid:9) k h i = from Q h by Eval ( L i , C i ) . j = arg max o ∈ B t Í k h i = Eval ( l ( X i , w o ) , C i ) else Set j to the sole winner in B t end if if j = then w t + = w t else w t + = w t + αд jt end if for i = m do q it = c it − c t if q it < then Append ( д it , q it ) to Q д end if end for Append ( X t , L t , C t ) to Q h end for directions, which has no additional assumption about the modelspace. • Context-Dependent Ranker Preselection.
NSGD selectivelyconstructs the candidate rankers to maximize the chance that theycan be differentiated from the current best ranker w t in the inter-leaved tests. A straightforward solution is to select the rankerswhich give totally distinct ranking orders to that from w t . But thisclearly emphasizes too much the exploration of new directions, butignores the exploitation of current best ranker. Especially in thelater stage of model update when the current ranker can alreadyprovide satisfactory ranking results, a very distinct ranking indi-cates a higher risk of providing worse result quality. To balance the needs for exploration and exploitation, we pro-pose a Context-Dependent Preselection (CDP) criterion as shownin line 13 of Algorithm 2: after randomly sampling n vectors from G ⊥ , we select the top m of them that maximize the inner prod-uct with the aggregate document feature vector ¯ x for query X t .This can be understood as a necessary condition for having a pro-posed ranker that generates a different ranked list in X t than thatfrom w t . More specifically, as we are learning a linear ranker, theranking score of each document is computed by the inner productbetween document feature vector x i and the feature weight vec-tor w t ; and the ranking scores lead to the ranked list l ( X t , w t ) . Togenerate a different ranked list, there has to be at least one doc-ument that has different ranking scores under these two rankers,i.e., ∃ j , | x T j ( w it − w t )| >
0. This can be simplified as ∃ j , | x T j д it | > | a | + | b | ≥ | a + b | ), we require adifferentiable ranker to satisfy | Í j x T j д it | >
0. To choose the candi-date rankers that can best satisfy this condition, we select the top m proposal directions that maximize this inner product. • History-Dependent Tie Breaking.
NSGD is flexible in select-ing the number of rankers for comparison: the hyper-parameter m in line 13 is an input to the algorithm. If multiple rankers areselected for comparison, multi-interleaving [24] can be performedto compare the quality of the proposed rankers, i.e., infer the clickcredit c it for each ranker w it and determine the winning ranker (i.e.,line 19 and 20). However, because of position bias in user clicks[11], very few result documents will be clicked each time. The spar-sity in result clicks directly reduces the resolution of interleavedtest in recognizing the winning ranker, e.g., multiple rankers mightshare the same aggregate click credit. The situation becomes evenworse when multiple rankers are compared. Existing solutions breakthe tie arbitrarily [29, 30] or heuristically take the mean vector ofrankers in the winner set [23]. No solutions consider the rankingproblem at hand, and they are not effective in general.We propose the idea of leveraging historical queries, especiallythe most difficult ones, to choose the winner whenever a tie hap-pens. First, in line 38, the 3-tuple comprised of the historical query,its displayed ranking list, and its corresponding click positions arestored in a fixed size queue. In future iterations, they are selected inline 22 to identify the best ranker, whenever a tie happens. Becauseonly click feedback is available in online learning, we use click po-sition C t in the evaluation function Eval ( L t , C t ) , such as MAP orNDCG by treating clicked documents as relevant, to measure theranking quality of L t in query X t (i.e., line 22 and 23). More impor-tantly, because the ranker is improving on the fly, a poorly servedquery might be caused by a badly performing ranker, rather thanits intrinsic difficulty. Therefore, in NSDG we only collect recentclick results to select the most discriminative queries. In this section, we perform extensive empirical comparisons be-tween our proposed Null Space Gradient Descent (NSGD) algo-rithm with four state-of-the-art OL2R algorithms on five publiclearning to rank benchmarks. Both quantitative and qualitativeevaluations are performed to examine our proposed gradient spaceexploration strategies, especially their advantages over the exist-ing solutions in improving online learning efficiency.a) Perfect (b) Navigational (c) Informational
Figure 2: Offline NDCG@10 on MQ2007 dataset under three click models. (a) Perfect (b) Navigational (c) Informational
Figure 3: Standard deviation of offline NDCG@10 on MQ2007 dataset under three click models. (a) Perfect (b) Navigational (c) Informational
Figure 4: Discounted cumulative NDCG@10 on MQ2007 dataset under three click models. • Datasets.
We used five benchmark datasets which are part ofthe LETOR 3.0 and LETOR 4.0 collections [17]: MQ2007, MQ2008,TD2003, NP2003 and HP2003. Among them, NP2003 and HP2003implement navigational tasks, such as homepage finding and named-page finding; TD2003 implements topic distillation, which is an in-formational task; MQ2007 and MQ2008 mix both types of tasks.Documents in TD2003, NP2003 and HP2003 datasets are collectedfrom the .GOV collection, which is crawled from the .gov domain;while the MQ2007 and MQ2008 datasets are collected from 2007and 2008 Million Query track at TREC [26]. In these datasets, eachquery-document pair is encoded as a vector of ranking features, in-cluding PageRank, TF.IDF, BM25, and language model on differentparts of a document. The number of features is 46 in MQ2007 andMQ2008, and 64 for the other three datasets. In the MQ2007 andMQ2008 datasets, every document is marked with a relevance labelbetween 0 and 2, while the other datasets only have binary labels.The MQ2007 and MQ2008 datasets contain 1,700 and 1,800 queriesrespectively, but with fewer assessments per query; while each of the other three datasets only contain fewer than 150 queries butwith 1,000 assessments per query. All of the datasets are split into5 folds for cross validation. We take the training set for online ex-periments gathering cumulative performance, and use testing setfor offline evaluation. • Simulating User Clicks.
To make our reported results compa-rable to existing literature, we follow the standard offline evalua-tion scheme proposed in Lerot [22], which simulates user interac-tions with an OL2R algorithm. We make use of the Cascade ClickModel [7] to simulate user click behavior. The click model simu-lates user interaction with the system by assuming that as a userscans through the list he/she makes a decision about whether ornot to click on a returned document. The probability of a user clickson a document is conditioned on the relevance label. Likewise, af-ter clicking, the user makes a decision about continuing to lookthrough the documents or to stop. The probability of this decisionis also conditioned on the current document’s relevance label. Ad-justing these probabilities allows us to simulate different types ofusers. able 1: Configurations of simulation click models.
Click Probability Stop ProbabilityRelevance grade 0 1 2 0 1 2Perfect 0.0 0.5 1.0 0.0 0.0 0.0Navigational 0.05 0.5 0.95 0.2 0.5 0.9Informational 0.4 0.7 0.9 0.1 0.3 0.5We use three click model configurations as shown in Table 1,including: 1) perfect user who clicks on all relevant documentsand does not stop browsing, which contributes the least noise; 2) navigational user who would stop early once they’ve found a rele-vant document; and 3) informational user who sometimes clicks onirrelevant documents in their search for information, which con-tributes the most noise. The length of resulting list evaluated bythe click models is set to 10 as a standard setting in [23, 30]. • Evaluation Metrics.
To evaluate an OL2R algorithm, cumula-tive Normalized Discounted Cumulative Gain (NDCG) and offlineNDCG are commonly used to assess the learning rate and rank-ing quality of the algorithm [22]. Cumulative NDCG is calculatedwith a discount factor of γ set to 0.995 for each iteration. To as-sess model estimation convergence, in each iteration we measurecosine similarity between the weight vector updated by an OL2Ralgorithm and a reference weight vector, which is estimated by anoffline learning to rank algorithm trained on the manual relevancejudgments. In our experiment, we used LambdaRank [3] with nohidden layer to obtain such a reference ranker in each dataset, be-cause of its superior empirical performance. For all experiments,we fix the total number of iterations T to 1,000 and randomly sam-ple query X t from the dataset with replacement accordingly. • Evaluation Questions.
We intend to answer the following ques-tions through empirical evaluations, to better understand the ad-vantages of our proposed algorithm.Q1: How does our proposed NSGD algorithm perform in com-parison to various baseline OL2R methods?Q2: Do candidate directions generated by NSGD explore the gra-dient space more efficiently than uniform sampling from theentire parameter space?Q3: How do the different components in NSGD contribute to itsfinal performance?Q4: How do different settings of hyper-parameters alter the per-formance of NSGD? • Baselines.
We chose the following four state-of-the-art OL2Ralgorithms as our baselines for comparison:-
DBGD [29]: A single direction uniformly sampled from theentire parameter space is explored. Team Draft is used tointerleave the results of the two rankers for comparison.-
CPS [9]: It proposes a candidate preselection strategy thatuses historical data to preselect the proposed rankers beforethe interleaved test in DBGD.-
DP-DBGD [30]: Two opposite uniformly sampled directionsare explored in DBGD. Both Contextual Interleave, which fa-vors the winning direction from the previous iteration, andTeam Draft are used in it in our experiment.-
MGD [23]: Multiple uniformly sampled directions are ex-plored in single iteration. Multileave is used to interleavethe results. If there is a tie, the model updates towards themean of all winners.
We start with our first evaluation question: how does NSGD per-form in comparison with baseline OL2R methods? We run all OL2Ralgorithms over all 5 datasets and all 3 click models. Accordingto the standard hyper-parameter settings of DBGD [29] and otherbaselines, we set δ to 1 and α to 0.1. For algorithms that can exploremultiple candidates, including MGD and NSGD, we set number ofcandidates explored in one iteration to 4, (i.e., m = k д = T д = k h =
10, and T h =
50. We willdiscuss the effect of these different hyper-parameters on NSGD inSection 4.3. All experiments are repeated 15 times for each fold,and we report the average performance.Figures 2 and 4 report the offline and online performance ofall OL2R methods on MQ2007 dataset under perfect , navigational and informational click models. We also report the standard devi-ation of offline NDCG in every iteration of model update on thisdataset in Figure 3. Due to the space limit, we cannot report thedetailed performance over other datasets, but we summarize thefinal performance in Table 2 and 3 respectively. From Figure 4, weobserve that CPS and NSGD, which both apply candidate prese-lection, perform better than other methods in terms of cumulativeNDCG. This confirms that exploring carefully selected candidatedirections generally improves the learning speed in the early itera-tions compared with the uniform sampling strategy used in otherbaselines. Our proposed NSGD further improves online learningefficiency over CPS by exploring inside the null space rather thanthe entire parameter space. From Table 2, we can observe the con-sistent improvement of NSGD for most of the datasets and clickmodels, which proves the accelerated learning speed by perform-ing more efficient gradient exploration during its online learningprocess.In Figure 2 we first observe that NSGD improves offline NDCGsignificantly faster than other baselines, which generally requiremuch more interactions with users to reach the same performance.This further explains our above analysis of the improved learningspeed of NSGD shown in Figure 4. For informational users, MGDrequires more than 800 iterations to reach performance compara-ble to NSGD at less than 200 iterations. From Table 3 we observethat algorithms that explore multiple candidate directions in oneiteration, including MGD and NSGD, consistently achieve betteroffline performance than other methods on all 5 datasets and 3click models. Compared with MGD, NSGD further improves thefinal offline NDCG on MQ2007, MQ2008 and NP2003 datasets, es-pecially for the informational users. We have discussed in Section4.1 that MQ2007 and MQ2008 contain more queries with fewer as-sessments per query. This improvement suggests that NSGD canbetter identify the most effective exploration directions even un-der a noisy environment. We have also tested MGD with 9 candi-dates explored in one iteration (i.e., m =
9) which has achievesbest performance according to [23], and observed same consistentimprovement of NSGD over MGD with 9 candidates in online per-formance. Due to space limit we did not report the performance ofMGD with 9 candidates in Table2 and 3.Figure 3 shows the standard deviation of offline NDCG at eachiteration. We observe that both NSGD and MGD enjoy a muchsmaller standard deviation in the perfect and navigational users,suggesting that exploring multiple directions reduce the varianceintroduced by random exploration. Another reason for the reducedvariance in NSGD is the hybrid sampling method mentioned in able 2: Online score (discounted cumulative NDCG@10) and standard deviation of each algorithm after 1000 queries undereach of the three click models. Statistically significant improvements over MGD baseline are indicated by N (p<0.05). Click Model Dataset DBGD CPS DP-DBGD MGD NSGDPerfect
MQ2007 (3.311) N MQ2008 (6.022) N HP2003 (8.276) N NP2003 (6.287) N TD2003 (7.654)Navigational
MQ2007 (2.832) N MQ2008 (4.553) N HP2003 (5.634) N NP2003 (5.649) N TD2003 (7.318) N Informational
MQ2007 (3.438) N MQ2008 (4.980) N HP2003 (5.503) N NP2003 (5.831) N TD2003 (7.148) N Table 3: Offline score (NDCG@10) and standard deviation of each algorithm after 1000 queries under each of the three clickmodels. Statistically significant improvements over MGD baseline are indicated by N (p<0.05). Click Model Dataset DBGD CPS DP-DBGD MGD NSGDPerfect
MQ2007 (0.019)
MQ2008 (0.043)
HP2003 (0.062) 0.752 (0.752)
NP2003 (0.049)
TD2003 (0.096) 0.289 (0.092)Navigational
MQ2007 (0.022)
MQ2008 (0.037)
HP2003 (0.061) 0.707 (0.156) 0.744 (0.073)
NP2003 (0.039)
TD2003 (0.098) 0.274 (0.094)Informational
MQ2007 (0.020) N MQ2008 (0.036)
HP2003 (0.063) 0.713 (0.069)
NP2003 (0.044) N TD2003 (0.090) 0.251 (0.085)Section 3.2: the result confirms that first sampling from the ba-sis vectors of null space and then sampling inside the null spaceprovides a more effective exploration, which not only improveslearning efficiency but also effectively reduces variance in an earlystage. For informational users, who have a lower stop probabilityand are likely to generate more clicks, they typically contributenosier clicks and more ties in the comparison. In this case, NSGDreaches a much smaller standard deviation compared with MGDand all other baselines. The reason is that NSGD applies context-dependent candidate preselection to propose the most differentiabledirections and use most difficult queries to discern tied candidates.Although CPS also uses historical interactions to preselect the rankers,it uniformly selects historical interactions, which are not necessar-ily informative. As a result, the ranking quality of CPS oscillateswhen the fidelity of user feedback is low.
To answer the second and third evaluation questions, we designdetailed ablation studies to carefully study NSGD. All the exper-iments in this section were conducted on MQ2007 under the in-formational click model, as the dataset has the largest amount of queries and the click model makes the retrieval task the most chal-lenging.In the first experiment, we trained an offline LambdaRank model[3] without any hidden layer using manual relevance labels. Themodel obtained the best offline NDCG performance in this dataset(around 0.437 in average). Its model parameter is denoted as w ∗ .We compare cosine similarity between the weight vector estimatedby NSGD and w ∗ , to that between the weight vectors generatedby MGD and DBGD and w ∗ in each iteration. In Figure 5 (a) wecan observe that NSGD moves towards w ∗ much faster than bothMGD and DBGD, which suggests the update directions exploredby NSGD are more effective in recognizing the important rankingfeatures. However, note that the final converged model in NSGDis not identical to w ∗ , and the final offline NDCG of all OL2R al-gorithms is worse than LambdaRank’s. This is expected: Lamb-daRank is directly trained by manual labels. To improve an onlinetrained model, one possible solution is to pre-train its weight vec-tor with some offline data, and continue training it with online userfeedback. This will take advantage of both training schemes.The second experiment serves to study the utility of gradientsproposed by NSGD. We mix uniform exploratory directions froma) Cosine similarity between online learntmodel and offline best model w ∗ (b) Selection ratio comparing null space anduniform gradients (c) Ablation analysis of NSGD Figure 5: Detailed experimental analysis of NSGD on MQ2007 dataset. the entire parameter space with directions proposed from null spacein the same algorithm. Specifically, we have in total 4 candidaterankers for multileaving, in which we vary the number of candi-dates created by null space gradients from 4 to 0, and we reportthe selection ratio, i.e., the frequency of selecting null space pro-posed rankers over the current ranker, versus the frequency of se-lecting uniformly proposed rankers over the current ranker. Thisratio is also normalized by the number of proposed rankers in eachtype, to make the results comparable. We also report the onlineperformance of each combination to understand the consequenceof selecting different types of rankers. The result is shown in Fig-ure 5 (b). We can clearly observe that comparing with the uniformexploratory rankers, rankers proposed by NSGD are always morelikely to be selected as a better ranker in all combinations. We seethat with more candidates proposed by NSGD, the online perfor-mance also increases. These results clearly show the superior qual-ity of directions explored by NSGD, and explains the performanceimprovement observed comparing with other baselines that uni-formly sample directions to explore.To better understand the contribution of different componentsin NSGD, we disable them in turn and experiment on the result-ing variants of NSGD. Specifically, we compare the following fourmodels: 1) NSGD; 2) NSGD without tie breaking, denoted as NSGDw/o (TB); 3) NSGD without tie breaking and context-dependentpreselection, denoted as NSGD w/o (CDP & TB); and 4) MGD. Theresult is reported in Figure 5(c). Comparing NSGD w/o (CDP &TB) with MGD, where the difference is exploring in the null spaceor entire parameter space, confirms the utility of null space gradi-ent exploration, which avoids repeatedly exploring recent and lesspromising directions. We want to mention that NSGD w/o (CDP& TB) also significantly improves the learning speed and quicklyreaches close to its high offline NDCG in less than 200 iterations,but it took MGD more than 800 iterations to achieve its highestperformance. Comparing NSGD w/o (CDP & TB) against NSGDw/o (TB), we observe that our context-dependent candidate prese-lection further improves the performance by selecting candidatesthat can be best differentiated by the current query in interleavedtests as compared with uniformly exploring inside the null space.Comparing NSGD w/o (TB) with NSGD, we observe that using dif-ficult queries for tie breaking further improves the performance,rather than arbitrarily breaking the tie or taking the average ofwinners as suggested by [23], which often introduces unexpectedvariance in online learning. To answer the fourth evaluation question, we study the effect ofdifferent hyper-parameter settings of NSGD and their correspond-ing online performance in the MQ2007 dataset with the three clickmodels mentioned above. • Number of candidates.
We vary the number of proposed can-didate rankers m from 1 to 10, from which the best ranker setis chosen through team-draft multileaving [23]. The result is re-ported in Figure 6 (a). Although each click model has different best-performing candidate size, with more candidate directions proposedthe performance generally first increases and then slightly decreases.As more candidates are proposed, even though more directions canbe explored, it is also easier to have multiple winners in the inter-leaved test, which introduces unnecessary complexity in recogniz-ing the best ranker. For example, we can clearly observe a trend ofdecreasing performance across all click models when m is largerthan 5. Specifically, since the result list length is set to 10, eachranker will on average only receive 2 clicks when 4 new rankersare proposed. This makes it common to have tied winners. Thisserves as further motivation for having an effective tie-breakingfunction in NSGD. We do not present results for m larger than 10as each candidate ranker can only expect less than one click andthe feedback from interleaved test is non informative. • Learning rate.
In all DBGD-type OL2R algorithms, the explo-ration step size is decided by δ and the learning rate for updatingcurrent ranker is decided by the choice of α . Here we study the ef-fect of different learning rate α , by fixing δ to 1. Figure 6 (b) showsthe result of varying α from 0 to 0.5. We notice that in most cases α around 0.1 gives the best performance. This suggests that eventhough we are exploring with a large step size δ , we should userelatively small learning rate α to avoid over-exploration. • Number of historical gradients to construct null space.
Asmentioned in Section 3.2, when using k д historical gradients toconstruct the null space, NSGD only samples from a subspace whoserank is at most d − k д . We vary the choice of k д from 5 to 40 (asin MQ2007 the feature dimension d is 46). The result is showedin Figure 6 (c). We observe that when we increase k д to 20, theperformance keeps relatively stable; but when k д goes beyond it,the performance decreases significantly. The key reason is the nullspace overly reduces the search space when k д is too large, suchthat it prevents NSGD from finding a good direction to explore,and forces it to converge to a suboptimal model quickly. • Number of historical queries for tie breaking.
When thealgorithm receives multiple winning rankers from an interleavedtest, we use the most recent difficult queries to identify the bestranker. In this experiment, we vary the number of historical queriesa) Number of candidates (b) Learning rate α (c) Number of historical gradientsto construct null space (d) Number of historical queriesfor tie breaker Figure 6: Performance of NSGD under different hyperparameter settings on MQ2007 dataset. k h for tie breaking from 0 (which means disable our tie breakingfunction) to 40. The result is showed in Figure 6 (d). Evidently, us-ing more historical queries for tie-breaking leads to an increase thealgorithm’s performance. However, evaluating candidates over alarge number of historical queries also increases the time and stor-age complexity. To balance computational efficiency and final per-formance, we set k h =
10 for NSGD in all previous experiments.
In this paper, we propose Null Space Gradient Descent (NSGD) toaccelerate and improve online learning to rank. To avoid repeat-edly exploring less promising directions, NSGD reduces its explo-ration space to the null space of recently poorly performing direc-tions. To identify the most effective exploratory rankers, NSGDuses a context-dependent preselection strategy to select candidaterankers that maximize the chance of being differentiated by an in-terleaved test for the current query. When two or more rankerstie, NSGD uses historically difficult queries to evaluate and iden-tify the most effective ranker. We performed thorough experimentsover multiple datasets and show that NSGD outperforms both thestandard DBGD algorithm as well as several state-of-the-art OL2Ralgorithms.As our future work, it is important to study the theoretical prop-erties of NSGD, including whether the directions proposed guaran-tee a low-bias estimation of the true gradients. As we observed inour empirical evaluations, the online trained models are generallyworse than the offline trained ones, which benefit most from man-ual annotations. It is meaningful to combine these two types oflearning schemes to maximize the utility of learnt models. Lastly,all OL2R algorithms consider consecutive interactions with usersas independent; but this is not always true, particularly when usersundergo complex search tasks. In this situation, balancing explo-ration and exploitation with respect to the search context becomesmore important. We plan to explore this direction with our NSGDalgorithm, as it already incorporates both long-term and short-term interaction history into gradient exploration.
We thank the anonymous reviewers for their insightful comments.This work was supported in part by National Science FoundationGrant IIS-1553568 and IIS-1618948.
REFERENCES [1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis ofthe multiarmed bandit problem.
Machine learning
47, 2-3 (2002), 235–256.[2] Peter Auer, Nicolo Cesa-Bianchi,Yoav Freund, and Robert E Schapire. 1995. Gam-bling in a rigged casino: The adversarial multi-armed bandit problem. In
Foun-dations of Computer Science, 1995. Proceedings., 36th Annual Symposium on . IEEE,322–331. [3] Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: Anoverview.
Learning
11, 23-581 (2010), 81.[4] Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview.In
Proceedings of the Learning to Rank Challenge . 1–24.[5] Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click modelfor web search ranking. In
Proceedings of the 18th international conference onWorld wide web . ACM, 1–10.[6] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. 2005. On-line convex optimization in the bandit setting: gradient descent without a gra-dient. In
Proceedings of the sixteenth annual ACM-SIAM symposium on Discretealgorithms . Society for Industrial and Applied Mathematics, 385–394.[7] Fan Guo, Chao Liu, and Yi Min Wang. 2009. Efficient multiple-click models inweb search. In
Proceedings of the Second ACM International Conference on WSDM .ACM, 124–131.[8] Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. 2013.Reusing historical interaction data for faster online learning to rank for IR. In
Proceedings of the sixth ACM international conference on WSDM . ACM, 183–192.[9] Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing ex-ploration and exploitation in listwise and pairwise online learning to rank forinformation retrieval.
Information Retrieval
16, 1 (2013), 63–90.[10] Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In
Proceedings of the eighth ACM SIGKDD . ACM, 133–142.[11] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay.2017. Accurately interpreting clickthrough data as implicit feedback. In
ACMSIGIR Forum , Vol. 51. Acm, 4–11.[12] Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2016.DCM bandits: Learning to rank with multiple clicks. In
International Conferenceon Machine Learning . 1215–1224.[13] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015. Cas-cading bandits: Learning to rank in the cascade model. In
Proceedings of the 32ndInternational Conference on Machine Learning (ICML-15) . 767–776.[14] Yanyan Lan, Tie-Yan Liu, Zhiming Ma, and Hang Li. 2009. Generalization anal-ysis of listwise learning-to-rank algorithms. In
Proceedings of the 26th AnnualInternational Conference on Machine Learning . ACM, 577–584.[15] Yanyan Lan, Tie-Yan Liu, Tao Qin, Zhiming Ma, and Hang Li. 2008. Query-level stability and generalization in learning to rank. In
Proceedings of the 25thinternational conference on Machine learning . ACM, 512–519.[16] Tie-Yan Liu et al. 2009. Learning to rank for information retrieval.
Foundationsand Trends® in Information Retrieval
3, 3 (2009), 225–331.[17] Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong, and Hang Li. 2007. Letor: Bench-mark dataset for research on learning to rank for information retrieval. In
Pro-ceedings of SIGIR 2007 workshop on learning to rank for information retrieval ,Vol. 310.[18] Harrie Oosterhuis and Maarten de Rijke. 2017. Balancing Speed and Quality inOnline Learning to Rank for Information Retrieval. In
Proceedings of the 2017ACM CIKM . ACM, 277–286.[19] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning diverserankings with multi-armed bandits. In
Proceedings of the 25th international con-ference on Machine learning . ACM, 784–791.[20] Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does click-through data reflect retrieval quality?. In
Proceedings of the 17th ACM CIKM .ACM, 43–52.[21] Mark Sanderson et al. 2010. Test collection based evaluation of informationretrieval systems.
Foundations and Trends® in Information Retrieval
4, 4 (2010),247–375.[22] Anne Schuth, Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013.Lerot: An online learning to rank framework.In
Proceedings of the 2013 workshopon Living labs for information retrieval evaluation . ACM, 23–26.[23] Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. 2016.Multileave gradient descent for fast online learning to rank. In
Proceedings of theNinth ACM International Conference on WSDM . ACM, 457–466.[24] Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maartende Rijke. 2014. Multileaved comparisons for fast online evaluation. In
Proceedingsof the 23rd ACM CIKM . ACM, 71–80.[25] Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. 2013. Ranked ban-dits in metric spaces: learning diverse rankings over large document collections. ournal of Machine Learning Research
14, Feb (2013), 399–436.[26] Ellen M Voorhees, Donna K Harman, et al. 2005.
TREC: Experiment and evalua-tion in information retrieval . Vol. 1. MIT press Cambridge.[27] Hongning Wang, Yang Song, Ming-Wei Chang, Xiaodong He, Ahmed Hassan,and Ryen W White. 2014. Modeling action-level satisfaction for search task satis-faction prediction. In
Proceedings of the 37th international ACM SIGIR conference .ACM, 123–132.[28] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. 2012. Thek-armed dueling bandits problem.
J. Comput. System Sci.
78, 5 (2012), 1538–1556.[29] Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing informationretrieval systems as a dueling bandits problem. In
Proceedings of the 26th AnnualInternational Conference on Machine Learning . ACM, 1201–1208.[30] Tong Zhao and Irwin King. 2016. Constructing reliable gradient exploration foronline learning to rank. In
Proceedings of the 25th ACM International on Confer-ence on Information and Knowledge Management . ACM, 1643–1652.[31] Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton,Csaba Szepesvari, and Zheng Wen. 2017. Online Learning to Rank in StochasticClick Models. In