[PDF] Learning to Rank in the Position Based Model with Bandit Feedback

Abstract

Personalization is a crucial aspect of many online experiences. In particular, content ranking is often a key component in delivering sophisticated personalization results. Commonly, supervised learning-to-rank methods are applied, which suffer from bias introduced during data collection by production systems in charge of producing the ranking. To compensate for this problem, we leverage contextual multi-armed bandits. We propose novel extensions of two well-known algorithms viz. LinUCB and Linear Thompson Sampling to the ranking use-case. To account for the biases in a production environment, we employ the position-based click model. Finally, we show the validity of the proposed algorithms by conducting extensive offline experiments on synthetic datasets as well as customer facing online A/B experiments.

Full PDF

LLearning to Rank in the Position Based Modelwith Bandit Feedback

Ermis Beyza, Patrick Ernst, Yannik Stein, Giovanni Zappella {ermibeyz,peernst,syannik,zappella}@amazon.deAmazonBerlin, Germany

ABSTRACT

Personalization is a crucial aspect of many online experiences. Inparticular, content ranking is often a key component in deliver-ing sophisticated personalization results. Commonly, supervisedlearning-to-rank methods are applied, which suffer from bias intro-duced during data collection by production systems in charge ofproducing the ranking. To compensate for this problem, we lever-age contextual multi-armed bandits . We propose novel extensionsof two well-known algorithms viz. LinUCB and Linear ThompsonSampling to the ranking use-case. To account for the biases in aproduction environment, we employ the position-based click model .Finally, we show the validity of the proposed algorithms by con-ducting extensive offline experiments on synthetic datasets as wellas customer facing online A/B experiments.

ACM Reference Format:

Ermis Beyza, Patrick Ernst, Yannik Stein, Giovanni Zappella. 2020. Learningto Rank in the Position Based Model with Bandit Feedback. In ,. ACM, NewYork, NY, USA, 9 pages.

The content catalogue in many online experiences today is too largeto be disseminated by regular customers. To explore and consumethese catalogues, content providers often present a selected subsetof their content which is personalized for easier consumption. Forexample, almost all major music streaming services rely on verticaltile interfaces, where the user interface is subdivided into rectangu-lar blocks, vertically and horizontally. The content of every tile isa graphical banner. Usually, customers observe a limited numberof tiles, that sometimes even rotate every few seconds, where onlyone large banner is visible at each point in time.The selected tiles displayed to the customer significantly impact theengagement with the service. Moreover, the order in which theyare presented by the application strongly impacts their chance ofbeing observed by the customer. This clearly calls for the need toconsider the order as well as the bias introduced by the visualizationmechanism. Generally, the selection and ranking of content are coreoperations in most modern recommendation and personalizationsystems. In this problem setting, we need to leverage all availableinformation to improve the customer experience.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

Related Work . Learning-to-rank approaches have been studiedin practical settings (e.g., see [10]) and there is additional work toaddress the presence of incomplete feedback (also known as “bandit”feedback) (e.g., [16–18, 26]). Learning-to-rank can be cast as a com-binatorial learning problem where, given a set of actions, the learnerhas to select the ordered subset maximizing its reward. A standardcombinatorial problem with bandit feedback (e.g., see [3, 6]) wouldprovide a single feedback (e.g., click/no-click signal) for each sub-set of selected actions or tiles, making the problem unnecessarilydifficult. A more benign formulation is to look at the problem asa semi-bandit problem, where the learner can observe feedbackfor each action, eventually transformed by a function of the ac-tions position in the ranking. Recently, several relevant methodshave been proposed for this kind of problem: non-contextual banditmethods such as [16, 18, 19, 21] do not leverage side-information about customers or content and thus do not present a viable solu-tion for our problem setting. Different approaches offer solutionsusing complex click models (i.e., the cascade model [17, 27]), whichcan be effective on applications like re-ranking of search results,but are complex to extend to consider other aspects like additionalelements on the page since in practice they are often controlled bydifferent subsystems.The approaches described in [8, 14, 22] share the same problemspace as this work, but target different aspects of the problem, suchas fairness, reward models, and evaluations.

Contribution . The first contribution of this paper is two differentcontextual linear bandit methods for the so called

Position-BasedModel (PBM) [5], which are straightforward to implement, maintainand debug. Second, we provide an empirical study on techniquesto estimate the position bias during the learning process.Specifically, we introduce new algorithms derived from LinUCBand linear Thompson Sampling with Gaussian posterior, address-ing the problem of learning with bandit feedback in the PBM. Thismodel assumes that the probability of the customer interactingwith a piece of content is a function of the relevance of that con-tent and the probability that the customer will actually inspectthat content allowing the model to be used in various scenarios.To the best of our knowledge, this is the first contextual banditapproach using PBM. Finally, we show the validity and versatilityof our approach by conducting extensive experiments on experi-ments on synthetic datasets as well as customer facing online A/Bexperiments, including lessons learned with anecdotal results.

In the following we introduce the Position-Based Model (PBM) todistinguish rewards for different ranking positions and afterwardsthe linear reward learning model. a r X i v : . [ c s . L G ] A p r reprint, Preprint Ermis Beyza, Patrick Ernst, Yannik Stein, Giovanni Zappella Position-Based Model . PBM [7, 23] is a click model wherethe probability of getting a click on an action depends on both itsrelevance and position. In this setting, each position is randomlyobserved by the user with some probability. It is parameterized byboth L action dependent relevance scores, expressing the probabil-ity that an action is judged as relevant, and L position dependentexamination probabilities q ∈ [ , ] L , where q ℓ denotes the exam-ination probability that position ℓ was observed (also known asposition bias). The core assumption of PBM is that the events ofan item being relevant and being observed are independent, i.e.the probability of getting a click C on action a in position ℓ is: P ( C = | x , a , ℓ ) = P ( E = | ℓ ) P ( R = | x , a ) . Regardless of the itemsthat are placed there: q ℓ = P ( E = | ℓ ) = P ( C = | ℓ ) , we need toprovide these parameters. In Section 4, we discuss how we derivethese parameters. The Learning Model . We consider a linear bandit setting inwhich the taken action at each round is a list of L actions chosenfrom a given set { a , . . . , a K } of size K . Accordingly, assuming asemi-bandit feedback, we receive a reward in the form of a list offeedbacks corresponding to each position of the recommended list.At each round t of the learning process, we obtain K vectors in R d that represent the available actions for the learner. We denote theseby A t = { a t , . . . , a Kt } and the action list selected at time t will bedenoted as A t = ( A t , . . . , A Lt ) , where A t is a permutation of A t .The PBM is characterized by examination parameters ( q ℓ ) ≤ ℓ ≤ L ,where q ℓ is the probability that the user effectively observes theitem in position ℓ . At round t , the selection A t is shown and thelearner observes the complete feedback. However, the observation Z ℓ t at position ℓ is censored being the product of the examinationvariable Y ℓ t and the actual user feedback C ℓ t where Y ℓ t ∼ B( q ℓ ) and C ℓ t = A ℓ t T θ + η ℓ t with all η ℓ t being 1-subgaussian independentrandom variables. When the user considered the item in position ℓ , Y ℓ t is unknown to the learner and C ℓ t is the reward of the item shownin position ℓ . Then, we can compute the expected payoff of eachaction in each position, conditionally on the action: E [ Z ℓ t | A ℓ t ] = q ℓ A ℓ t T θ , where θ ∈ R d is the unknown model parameter. At eachstep t , the learner is asked to make a list of L actions A t that maydepend on the history of observations and actions taken. As aconsequence to this choice, the learner is rewarded with r A t = (cid:205) L ℓ = Z ℓ t , where Z t = ( Z t , . . . , Z Lt ) = ( C t Y t , . . . , C Lt Y Lt ) . The goalof the learner is to maximize the total reward (cid:205) Tt = r A t accumulatedover the course of T rounds. We now introduce two contextual bandit algorithms for learningto rank in the PBM. The first one is named

LinUCB-PBMRank thatis a variation of LinUCB [1, 4, 9], the contextual version of theoptimistic approaches inspired by UCB1. The second algorithm,called

LinTS-PBMRank , is Bayesian approach to exploration and itis a variation of linear Thompson Sampling (LinTS) [2]. . The LinUCB algorithm for contextual bandit problem for a singleaction case at each time t , obtains a least square estimator for θ using all past observations: ˆ θ t = arg min ˜ θ ∈ R d (cid:205) t − s = ( C s − A sT ˜ θ ) + λ ∥ ˜ θ ∥ .We can now derive a conditionally unbiased estimator of the modelparameter θ for the ranking case in the PBM as a least squaresolution of ˆ θ t = arg min (cid:205) t − s = (cid:205) L ℓ = ( Z ℓ s − q ℓ A ℓ sT ˜ θ ) + λ ∥ ˜ θ ∥ .Proposition 1. The solution to the convex optimization problemformulated above gives a closed form solution for the estimator ˆ θ : ˆ θ t = V − t b t = (cid:32) L (cid:213) ℓ = q ℓ V ℓ t + λI (cid:33) − (cid:32) L (cid:213) ℓ = q ℓ b ℓ t (cid:33) (1) where ∀ ℓ ∈ [ L ] , V ℓ t = (cid:205) t − s = A ℓ s A ℓ sT and b ℓ t = (cid:205) t − s = Z ℓ s A ℓ s . Proof. Computing the gradient of the cost function is Eq 1 andsolving the equation leads to t − (cid:213) s = L (cid:213) ℓ = q ℓ A ℓ s Z ℓ s (cid:16) Z ℓ s − q ℓ A ℓ sT θ (cid:17) + λθ = L (cid:213) ℓ = q ℓ t − (cid:213) s = A ℓ s Z ℓ s − L (cid:213) ℓ = q ℓ t − (cid:213) s = Z ℓ s A ℓ s ( A ℓ sT θ + λθ = . From a Bayesian point of view, the problem can be formulatedas a posterior estimation of the parameter θ . Here, the true ob-servations Z ℓ t is replaced by its conditional expectation given thecensored position variables Y ℓ t ∼ B( q ℓ ) . We introduce the filtra-tion F t as the union of history until time t −

1, and the contextsat time t , F t = ( A , Z , . . . , A t ) such that for all t , ℓ , E [ Z ℓ t |F t ] = E [ C ℓ t |F t ] E [ Y ℓ t |F t ] = q ℓ ( A ℓ t T θ ) . We present a fully Bayesian treat-ment of Linear Thompson Sampling where we assume σ follows anInverse-Gamma distribution and θ follows a multivariate Gaussian: σ ∼ IG( α , β ) : = p ( σ ) θ ∼ N( θ , σ V − ) : = p ( θ ) Z ℓ t | A ℓ t , θ , q ℓ , σ ∼ N( q ℓ θ T A ℓ t , σ ) For the above model, the joint model posterior p ( θ , σ |F t ) follows aNormal-Inverse-Gamma distribution. We can compute the posteriorof the full-Bayesian approach as follows: p ( ˜ θ |F t ) ∝ p ( σ ) p ( θ ) T (cid:214) t = L (cid:214) ℓ = p ( Z ℓ t | θ t , F t )∝ exp {− σ L (cid:213) ℓ = ( Z ℓ t − q ℓ A ℓ t T θ t ) T ( Z ℓ t − q ℓ A ℓ t T θ t )} exp {− σ ( θ t − θ ) T V − ( θ t − θ )}{( σ ) −( α + ) exp (cid:18) − β σ (cid:19) } We rearrange the posterior to formalize the posterior mean θ t andthe variance V − t in closed form. First, we rewrite the quadratic earning to Rank in the Position Based Model with Bandit Feedback Preprint, Preprint Algorithm 1

LinUCBPBMRank

Input:

Position Bias Parameters ( q , . . . , q L ) ,confidence level δ >

0, regularization λ . for t = , . . . , T do Get the contextualized actions A t ,Compute ˆ θ t as in Prop. 1 and for all a ∈ A t , U t ( a ) = a T ˆ θ t + (cid:113) f t , δ ∥ a ∥ V − Build Top-L action list A t ∈ arg max a ∈A t (cid:213) ℓ q ℓ U t ( a ) (ties broken arbitrarily)Update V t ← V t − + (cid:205) ℓ q ℓ A ℓ t A ℓ t T Receive feedback for round t Update b t ← b t − + (cid:205) ℓ q ℓ Y ℓ t A ℓ t Algorithm 2

LinTSPBMRank

Input:

Position Bias Parameters ( q , . . . , q L ) , confidencelevel δ >

0, prior precision parameters α and β , so that σ ∼ IG( α , β ) and p ( θ ) = N( , σI ) . for t = , . . . , T do Get the contextualized actions A t ,Sample ˜ θ t ∼ p t − Compute scores for all a ∈ A t : s t ( a ) = a T ˜ θ t Build Top-L action list, A t ∈ arg max a ∈A t (cid:213) ℓ q ℓ s t ( a ) Update V t ← V t − + (cid:205) ℓ q ℓ A ℓ t A ℓ t T Receive feedback for round t Update b t ← b t − + (cid:205) ℓ q ℓ Y ℓ t A ℓ t terms in the exponential as a quadratic form: Q ( ˜ θ , σ ) = ( Z ℓ t − q ℓ A ℓ t T θ t ) T ( Z ℓ t − q ℓ A ℓ t T θ t ) + ( θ t − θ ) T V − ( θ t − θ ) = ( ˜ Z ℓ t − W θ t ) T ( ˜ Z ℓ t − W θ t ) where ˜ Z ℓ t = (cid:32) Z ℓ t V θ (cid:33) and W = (cid:32) q ℓ A ℓ t V (cid:33) In this case V − t = ( W T W ) − = ( q ℓ A ℓ t T A ℓ t + V − ) − and θ t = Σ t ( W T ˜ Z ℓ t ) = V − t ( q ℓ A ℓ t T ˜ Z ℓ t + V − θ ) . At each time t , we sampleone vector from the posterior for each action to compute the scores.The parameters of this posterior in terms of the parameters at time t − V t = ( (cid:213) t q ℓ A ℓ t A ℓ t T + V ) α t = α + t θ t = V − t b t β t = β + ( η t − θ tt b t ) where V = λIb t = b t − + q ℓ Z ℓ t A ℓ t η t = η t − + (cid:213) ℓ Z ℓ t z t = V − t − q ℓ A ℓ t We can simply apply the Sherman-Morrison identity [24] thatcomputes the inverse of the sum of an invertible matrix as the outerproduct of vectors to improve computational efficiency. The linearThompson Sampling to rank (LinUCB-PBMRank) is summarized inAlgorithm 2. For dense action vectors the above update schema iscomputed in O( d ) . Accurate estimation of the position bias is crucial for unbiasedlearning-to-rank from implicit click data. We can provide these pa-rameters either as fixed or use an automatic parameter estimationmethod. Using fixed hyperparameters in a production environmentwith many different use-cases and continuously expanding use cases can be quite challenging in terms of maintenance and scaling.To avoid that, we evaluate three automatic estimation methods: i) estimate using the click-through rate (CTR) per position by up-dating them online after observing each record ii) a supervisedlearning approach leveraging the Bayesian Probit regression (PR)model and iii) bias estimation using an expectation-maximization(EM) algorithm. . One of the most commonly used quantities in click log studies isclick-through rates (CTR) at different positions [5, 15]. A commonheuristic used in these cases is the rank-based CTR model where theclick probability depends on the rank of the document P ( C = | ℓ ) = ρ ℓ . Given the click event is always observed, ρ ℓ can be estimatedusing MLE. The likelihood for the parameter ρ l can be written as: L( ρ ℓ ) = (cid:214) c i ∈ S c ρ c i l ( − ρ ℓ ) − c i (2) where S c is the set of clicks and c i is the value of the click of the i th occurrence for position ℓ . By taking the log of (2), calculatingits derivative and equating it to zero, we get the MLE estimation ofthe parameter ρ ℓ . In this case, it is the sample mean of c i ’s: ρ ℓ = (cid:205) c i ∈ S c c i | S c | (3) . The CTR-based method is very intuitive but does not consideractions’ features and their probability of being clicked. Furthermore,it can incur in the same bias-related problem of the naive rankerssince the clicks will likely be more frequent towards the beginningof the ranking. We aim to learn a mapping x → [ , ] from a set offeatures x to the probability of a click. Bayesian Linear Probit modelis a generalized linear model (GLM) with a Probit link function. Thesampling distribution is given by: P ( C | θ , x ) : = Φ ( C · θ T x / β ) , wherewe assumed that C is either 1 (click) or 0 (no click) and Φ is thecumulative density function of the standard normal distribution: Φ ( t ) : = ∫ t −∞ N( s ; 0 , ) ds . It serves as the link function that mapsthe output of the linear model (sometimes referred to as the score) reprint, Preprint Ermis Beyza, Patrick Ernst, Yannik Stein, Giovanni Zappella in [−∞ , ∞] to a probability distribution in [ , ] over the observeddata, C . The parameter β scales the steepness of the inverse linkfunction.The function P ( C | θ , x ) is called likelihood as a function of θ andsampling distribution as a function of C ; the latter is the generativemodel of the data and is a proper probability distribution whereasthe former is the weighting that the data, C , gives to each param-eter. The model uncertainty over the weight vector is capturedin P ( θ ) = N( µ , σ ) . Given a feature vector x , the proposed sam-pling distribution together with the belief distribution results ina Gaussian distribution over the latent score. Given the samplingdistribution P ( C | θ , x ) and the prior P ( θ ) , the posterior is calculated: P ( θ | C , x ) : = P ( C | θ , x ) P ( θ ) .We keep a Probit Regression (PR) model for each position ℓ . Giventhe likelihood P ( C | x , a , ℓ, θ ) , the posterior is calculated as: P ( θ | C , x , a , ℓ ) : = P ( C | x , a , ℓ, θ ) P ( θ ) .Then, the predictive distribution P ( C | x , a , ℓ ) can be computed withgiven feature vector and the posterior (See [13] for details). As wementioned in Section 2, the probability of getting a click on action a in position ℓ is equal to P ( C = | x , a , ℓ ) = P ( E = | ℓ ) P ( R = | x , a ) .Here, our goal is to compute q ℓ = P ( E = | ℓ ) and we compute it as: q ℓ = P ( E = | ℓ ) P ( R = | x , a ) P ( E = | ℓ = ) P ( R = | x , a ) where we assume P ( E = | ℓ = ) =

1. It has to be noted thatalthough this assumption holds for the applications considered inthe experimental section of this paper, this is not guaranteed inall real-world applications. It is possible that the content on thepage may get reshuffled by another system and the position ofthe component visualizing the ranking changed, with significantimpact on the chance for the customer to observe the content.In Section 4.3, we will provide experimental results where thisassumption is violated. . After the observations made for the CTR and PR estimators, we toexplore different directions in order to provide a solution that can bemore robust in real-world scenarios. The Expectation-Maximization(EM) algorithm can be applied to a large family of estimation prob-lems with latent variables. In particular, suppose we have a trainingset X = { x , . . . , x n } consisting of n independent examples. Wewish to fit the parameters of a model P ( X , Z ) to the data, where thelikelihood is given by: L( θ ) = n (cid:213) i = log P ( X ; θ ) = log (cid:213) Z P ( X , Z ; θ ) (4) However, explicitly finding the maximum likelihood estimates ofthe parameters θ may be hard. Here, the z ( i ) ’s are the unobservedlatent random variables. In such a setting, the EM algorithm gives anefficient method for maximum likelihood estimation. To maximize L( θ ) , EM construct a lower-bound on L (E-step), and then optimizethat lower-bound (M-step) repeatedly. The EM estimator providedin this section can be seen as a generalization of PR estimator, whichshould provide better practical performance. Given the relevanceestimate γ x , a = P ( R = | x , a ) , the position bias q l = P ( E = | l ) where P ( C = | x , a , l ) = P ( E = | l ) P ( R = | x , a ) , and a regular clicklog L = {( c , x , a , ℓ )} , the log likelihood of generating this data is:log P (L) = (cid:213) ( c , x , a ,ℓ )∈L c log q ℓ γ x , a + ( − c ) log ( − q ℓ γ x , a ) (5) The EM algorithm can find the parameters that maximize the log-likelihood of the whole data. In [25], the authors introduced an EM-based method to estimate the position bias from regular productionclicks. The standard EM algorithm iterates over the Expectation andMaximization steps to update the position bias q ℓ and the relevanceparameter γ x , a . In this paper, we modify the standard EM and take γ x , a equal to σ ( A Tt ˆ θ t ) at each step t where A t is the contextualizedaction. In this way, we take the context information into account.At iteration t +

1, the Expectation step estimates the distribution ofhidden variable E and R given parameters from iteration t and theobserved data in L : P ( E = | c , x , a , ℓ ) = ( − γ ( t ) x , a ) q ( t ) ℓ − q ( t ) ℓ γ ( t ) x , a (6)Fore more details we refer to [25]. The Maximization step updatesthe parameters using the quantities from the Expectation step: q ( t + ) ℓ = T (cid:213) t (cid:32) c ( t ) ℓ + ( − c ( t ) ℓ ) ( − γ ( t ) x , a ) q ( t ) ℓ − q ( t ) ℓ γ ( t ) x , a (cid:33) (7) In this section we provide a number of empirical results to demon-strate the advantages of the proposed algorithms compared to their“naive” counterparts and other baselines. To this aim, we performexperiments on synthetic datasets with different variants of the posi-tion bias estimation in a controlled environemt showing differencesthat would not be possible to measure in an online environment.We also tested our algorithms in online experiments against otherbaselines but in order to avoid the risk of a negative impact on thecustomer experience, we ran online experiments comparing our twovariants of the presented algorithms and two “safe” production-likebaseliense. Comparing our algorithms to baselines that providednegative results in the offline experiments would be irresponsibleand completely against the customers’ interest.

For the purpose of testing our algorithms in a controlled environ-ment, we created two synthetic datasets with 25 available actionsand we limited the algorithm to select a maximum of 20 actions,simulating the behavior of a page that does not display all the ac-tions to all the customers. The actions vectors were generated asin the following: we fixed the number of dimensions to 5 and thengenerated dense vectors of random numbers in [0,1) and then setall the entries having a value below 0.1 to 0.0 (introduces somesparsity). The context vectors, part of the same datasets, are set tohave 10 dimensions and generated in the same way of the actions.A simplified version of the behaviour of the production system isreproduced in the offline experiments, so we join the action andcontext vectors to create a contextualized action . This is created byconcatenating the action vector, the context vector and the vector-ized outer product of the two. This process generates 25 vectors,each one representing an action and each vector is made of threeblocks: the action vector, the context vector, and the cross productbetween the action vector and the context vector. After the vectorsare generated, they get normalized by dividing them by their re-spective square norms. The only vectors received by the predictorsare the ones made available at the end of this process. earning to Rank in the Position Based Model with Bandit Feedback Preprint, Preprint

Number of actions/positionsALGORITHMS 1 5 10 20LinUCB 48278.30 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 1: Cumulative reward on SINREAL.

Number of actions/positionsALGORITHMS 1 5 10 20LinUCB 34790.07 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Cumulative reward on SINBIN datasets.

ALGORITHMS 5 / SINBIN 5 / SINREAL 10 / SINBIN 10 / SINREAL ϵ =0.1 LinTS-PBMRank (Real) 48348.93 ± ± ± ± ± ± ± ± ± ± ± ± ϵ =0.25 LinTS-PBMRank (Real) 39538.48 ± ± ± ± ± ± ± ± ± ± ± ± ϵ =0.5 LinTS-PBMRank (Real) 25546.38 ± ± ± ± ± ± ± ± ± ± ± ± Table 3: Cumulative reward on SINBIN when position bias is − ϵ exp ( position ) . For the dataset with real valued rewards (later called

SINREAL ),the rewards are generated as follows: at the beginning of the pro-cess a unit length random vector w is fixed and w will be usedto compute the inner product with the contextualized actions, fol-lowing the linear assumption made in Section 2. The reward isgenerated by summing the inner product between w and the con-textualized action vector with a noise factor uniformly sampled inthe interval [-0.1,0.1). Then, we apply floor and ceiling operationsto make sure to obtain a reward in [0,1]. In the case of the dataset with binary valued rewards (later called SINBIN ), the same pro-cedure is followed but we binarize the rewards by thresholdingwith a predefined hyperparameter. Before providing the rewards toupdate the predictor, the rewards are divided by the exponential ofthe position assigned to the corresponding action by the learningalgorithm (this is done “online” and depends on the predictionsmade by the algorithm). The aim is to mimic the behavior observedin online experiments where the users tend to click significantlymore on the top positions on the ranking. The exponential func-tion was chosen after observing the behavior of customers in some reprint, Preprint Ermis Beyza, Patrick Ernst, Yannik Stein, Giovanni Zappella

Figure 1:

Comparison of the real position biases and the position biasesestimated by CTR, PR and EM methods for the top 5 positions. SINBIN on theleft and SINREAL on the right. online experiments.

In our experiments, we compared the two algo-rithms presented in Section 3.1 with their counterparts that donot account for the bias introduced by the ranking position namelyLinTS and LinUCB. These algorithms select the actions taking thetop-K with the highest scores instead of the single best one as intheir original definition. The update operation is performed usingall the selected actions and the corresponding rewards without anyre-weighting. This is equivalent to set all the { q ℓ } L ℓ = to 1 in thealgorithms referenced above. Synthetic Data Results . Tables 1 and 2 report the results of ex-periments run on synthetic data in order to validate our ideas in acontrolled environment. The dataset used in this section are SIN-REAL and SINBIN, whose details are available in Section 5.1. Pleasenote that the since the datasets are generated artificially, everypotential prediction of the algorithm can receive the correct rewardand we do not need to employ techniques for running offline eval-uation with biased datasets (e.g., [20]). In these offline experiments,we can observe two important trends: i) not addressing the position

Figure 2: Structure of the page where the experiment was ran.The central-top slot is optimized using the methods describedin this paper. All the items in the list get eventually displayedon the page since the slot is automatically loading the nextpiece of content after a fixed amount of time. bias can significantly mislead algorithms to the point that theycan become worse than a random selection, ii) using an automaticmethod for estimating the position bias gives a clear advantage butthere is no clear winner between PR and EM.

Position Bias Estimation Results . The previous experimentsshow that CTR is inferior as position bias estimation method, whilePR and EM perform almost equally. In Figure 1 we compare thequality of the estimation methods by comparing the esimated posi-tion biases with the true values observed in the synthetic datasets.However, it is important to recall that for the CTR and PR esti-mators the parameter for the first position is artificially set to 1,while the EM method is performing its estimation without anyadditional information. This is particularly useful in cases wherethe hyperparameter associated with the first position is unknownbecause it is controlled by external factors (e.g., the ranked contentis displayed in a position where deos not catch the attention of theusers). We conducted a range of experiments, reported in Table 3,to assess the sensitivity of PR with respect to this parameter. Theresults clearly show that the more severe the violation of the q = To validate our offline results and to show the effectiveness of ourapproach in a real-world scenario, we conducted two end-customerfacing online A/B tests. Due to the costs and potential negativecustomer experience of running A/B tests involving real payingcustomers, we focused on two main scenarious. In each scenario wepick one widget, which is a so-called carousel UI and is embeddedat the top of the landing page of a large music streaming service.We alter the arrangement of the list items between control Aand treatment B to test different baselines against configurationsof our bandit-based ranking approach. Particularly we test: A carousel consists of a list of banners, where only one banner is displayed at a timeand rotated to the next one after a certain time period. earning to Rank in the Position Based Model with Bandit Feedback Preprint, Preprint • a human-curated list arrangement in control against ourapproach with fixed position biases, i.e. without online auto-matic estimation • a collaborative filtering based ranking in control againstour approach with online EM position bias estimation astreatmentThe customers are split equally, 50%/50% random allocation, be-tween control and treatment. In this experiment the goal is to have a confined test for the banditlearning to rank algorithm and thus we purposefully do not includeautomatic position bias estimation. Instead, we rely on manualhyper parameters based on view events, where the parameter forposition i is based on the number of historical customer requeststhat viewed i divided by the number of requests. The candidatewidget consists of 50 candidate items, which were represented bybanners containing music spanning different genres and user tastes(e.g., audio books, music for children). Our control treatment alwaysshows the same order of 13 manually curated items to customers. Intreatment, we apply our ranking bandit to contextually rerank thecandidate set every time the customer visits the landing page. Wepick the top-13 scored banners to fill the carousel and present themto the customer. To contextualize the ranking, we leverage differenttypes of features representing the customer, content, and generalcontext, such as temporal information, customer taste profiles andcustomer affinities towards musical content. We see major increasesof various classical ranking measurement and engagement metricsin treatment which leverages the ranking bandit. Overall, herecustomers interacted more with the widget and also consumedmore music. In particular, if we compare the performance of thewidget with the version provided to the control group, we are ableto improve the following widget specific metrics: • the mean reciprocal rank (MRR) increased by 15 . • the amount of attributed playbacks increased by 17 . • the listening duration measured in log seconds increased by16 . • the number of customers playing music increased by 15 . . • there is 2 .

33% more playbacks originating from the landingpage, • the listening duration measured in log seconds increasedover all customers increased by 3 . • the number of customers who played music increased by2 . . Figure 3: Intra-day trends for audio content of niche genre.Figure 4: Intra-day and intra-week trend for an item regardingmusic for Father’s day. There is an evident trend in the daysleading to the holiday where the item becomes more popular. temporal features provided as part of the context and fast modelupdates. Additionally, we observed that the ranking bandit wasable to handle seasonal content: an example is shown in Figure 4,which shows the average position of a banner targeted to Father’sday (celebrated on May 30th) in Germany that was ranked high inthe days leading to the holiday.

In this experiment, we tested the ranking bandit withposition bias estimation against a matrix factorization baseline onthe carousel widget during summer 2019 over 8 days in the US. Thecarousel contained 10–15 banners that were manually curated andchanged over the time of the experiment. In the control group, thebanners were ordered by scores derived from an existing produc-tion system that is based on matrix factorization. In the treatment,we applied the ranking bandit with position bias estimation. Tocontextualize, we used temporal features, as well as several featuresto represent the customer such as the customer’s taste profile andthe scores from the matrix factorization baseline.Overall, we saw an increased customer engagement in treatmentcompared to control. In particular, we saw improvements in thetreatment along the following metrics for the targeted widget: reprint, Preprint Ermis Beyza, Patrick Ernst, Yannik Stein, Giovanni Zappella • the mean reciprocal rank (MRR) increased by 5 . • the attributed playbacks increased by 7 . • the listening duration measured in log seconds increased by7 . • the number of customers playing music from this widgetincreased by 6 . . • the number attributed playbacks increased by 0 .

8% withp-value=0 . • the listening duration measured in log seconds increased by0 . • the number of customers who played music increased by0 . . While most of the risks where mitigated before the deployment andeverything move quite smoothly, there are a few facts which weconsidered surprising.

We developed methods which leverage “contextualized actions” al-lowing us to perform an extensive amount of features engineering.In this way we can leverage highly non-linear model trained onhistorical information to produce high-quality features. In the on-line experiments we reported in this paper, we used the our systemto re-rank a very small pool of items (represented by a large image)linked to a piece of musical content. Turns out that the one-hot-encoding representation of the items combined with the context bythe mean of the cross product and a non-linear dimensionality re-duction technique performed very well. We do not have a scientificexplanation of the reasons behind this success, but we conjecturethat the visual aspect of the items plays a crucial role which is hardto capture in a small set of visual features. Moreover, the smallcontent pool compared to the number of requests served allows thealgorithm to converge quickly also without information about thesimilarity between the actions. To verify the contribution of thevisual aspects to customers’ decisions and the best way to encodethe visual representation of the images associated to musical itemsis left as future work.

As reported in the previous section, we tested the Thompson Sam-pling ranking algorithm online in combination with the automaticposition bias estimation leveraging expectation maximization (pre-viously called LinTS-PBMRank(EM)). While we obtained positive

Figure 5: Intra-day trends for summer playlist in the beginningof July. The ranking bandit is in blue and the baseline recom-mender in green. The ranking bandit is able to catch the generaltrend earlier than the baseline recommender and also to followthe intra-day fluctuations.Figure 6: Trend for for banner featuring a new track by a well-known American singer. The ranking bandit is in green and thebaseline recommender in pink. As it often happens recently re-leased content by popular artists catches the attentions of cus-tomers outside the core artist fan base. In this plot it is evidentthat the ranking bandit catches the trend much earlier than thebaseline recommender. results in the online experiment, we observed an unexpected be-haviour in the probabilities computed by the EM algorithm whichcould have been related to numerical stability issues and furtherinvestigated the matter. We decided to run a new online experimentwhere LinTS-PBMRank(EM) was compared with an instance of thesame algorithm whose position bias probabilities where manually earning to Rank in the Position Based Model with Bandit Feedback Preprint, Preprint tuned leveraging historical data. This experiment terminated witha significant victory (about 5% increase in MRR) of the algorithmusing manually tuned position bias. Re-applying offline part ofthe updates to the model, we noticed that even using a consistentnumber of updates, in the order of 10 , the posteriors means of thetwo models where not converging to the same value. Specifically,their cosine similarity was in the interval (0.6, 0.8). This is due totwo main reasons: i) the random initialization of the EM modeland ii) the error made by the predictor in estimating the rewards.We decided to change the initialization of the EM model to ℓ + ϵ where ϵ is just a small random number (e.g., in (0, 0.1)). The sameoffline analysis described above provide significantly different re-sults using this initialization, with an average cosine similarity ofthe posterior means at 0.93 and negligible variance. We tested afew others initializations techniques offline with slightly worse butcomparable results and we are waiting to validate our findings inonline experiments. We provided extensions of two well-known contextual bandit algo-rithms that show a significant empirical advantage in real-worldscenarios. Our online experiments were run on a large scale musicstreaming service show a significant customer impact measured bya few different metrics. Moreover, the presented algorithms provedthemselves easy to maintain in a production environment.There are a few directions in which we are considering extendingthese ranking solutions: i) perform additional experiment on themost effective representations to be used for music recommenda-tions in visual clients, ii) scale known techniques [11, 12] for themulti-bandit setting to support a massive number of customersiii) compare our results with the ones obtained by more complexsolutions based on complex reinforcement learning algorithms.

We would like to thank Claire Vernade for the contributions madeduring the inital stage of this project.

REFERENCES [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms forlinear stochastic bandits. In

Advances in Neural Information Processing Systems ,pages 2312–2320, 2011.[2] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual banditswith linear payoffs. In

ICML ’13 , pages 127–135, 2013.[3] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits.

Journal ofComputer and System Sciences , 78(5):1404–1422, 2012.[4] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits withlinear payoff functions. In

Proceedings of the 14. International Conference onArtificial Intelligence and Statistics , pages 208–214, 2011.[5] Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. Click models for websearch.

Synthesis Lectures on Information Concepts, Retrieval, and Services , 7(3):1–115, 2015.[6] Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere,et al. Combinatorial bandits revisited. In

Advances in Neural Information Process-ing Systems , pages 2116–2124, 2015.[7] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimentalcomparison of click position-bias models. In

Proceedings of the 2008 internationalconference on web search and data mining , pages 87–94. ACM, 2008.[8] Paolo Dragone, Rishabh Mehrotra, and Mounia Lalmas. Deriving user- andcontent-specific rewards for contextual bandits. In

The World Wide Web Con-ference , WWW ’19, page 2680–2686, New York, NY, USA, 2019. Association forComputing Machinery.[9] Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford,Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. In

Proceedings of the 27. Conference on Uncertainty in Artificial Intelligence , UAI’11, pages 169–178, Arlington, Virginia, United States, 2011. AUAI Press.[10] Antonino Freno. Practical lessons from developing a large-scale recommendersystem at Zalando. In

RecSys ’17 , pages 251–259, New York, NY, USA, 2017. ACM.[11] Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, GiovanniZappella, and Evans Etrue. On context-dependent clustering of bandits. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,pages 1253–1262. JMLR. org, 2017.[12] Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits.In

International Conference on Machine Learning , pages 757–765, 2014.[13] Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich.Web-scale bayesian click-through rate prediction for sponsored search advertisingin Microsoft’s bing search engine. In

ICML ’10 , 2010.[14] Alois Gruson, Praveen Chandar, Christophe Charbuillet, James McInerney,Samantha Hansen, Damien Tardieu, and Ben Carterette. Offline evaluationto make decisions about playlist recommendation algorithms. In

Proceedingsof the Twelfth ACM International Conference on Web Search and Data Mining ,WSDM ’19, page 420–428, New York, NY, USA, 2019. Association for ComputingMachinery.[15] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay.Accurately interpreting clickthrough data as implicit feedback. In

ACM SIGIRForum , volume 51, pages 4–11. ACM, 2017.[16] Junpei Komiyama, Junya Honda, and Akiko Takeda. Position-based multiple-playbandit problem with unknown position bias. In

Advances in Neural InformationProcessing Systems , pages 4998–5008, 2017.[17] Branislav Kveton, Csaba Szepesvári, Zheng Wen, and Azin Ashkan. Cascadingbandits: Learning to rank in the cascade model. In

ICML’15 , pages 767–776, 2015.[18] Paul Lagrée, Claire Vernade, and Olivier Cappe. Multiple-play bandits in theposition-based model. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, andR. Garnett, editors,

Advances in Neural Information Processing Systems 29 , pages1597–1605. Curran Associates, Inc., 2016.[19] Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvári. Toprank: Apractical algorithm for online stochastic ranking. In

NeurIPS , 2018.[20] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S Muthukrishnan, VishwaVinay, and Zheng Wen. Offline evaluation of ranking policies with click models.In

Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining , pages 1685–1694. ACM, 2018.[21] Alexander R. Luedtke, Emilie Kaufmann, and Antoine Chambaz. Asymp-totically optimal algorithms for budgeted multiple play bandits. Preprint(https://hal.archives-ouvertes.fr/hal-01338733), 2017.[22] Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, andFernando Diaz. Towards a fair marketplace: Counterfactual evaluation of thetrade-off between relevance, fairness & satisfaction in recommendation systems.In

Proceedings of the 27th ACM International Conference on Information andKnowledge Management , CIKM ’18, page 2243–2251, New York, NY, USA, 2018.Association for Computing Machinery.[23] Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks:estimating the click-through rate for new ads. In

Proceedings of the 16th interna-tional conference on World Wide Web , pages 521–530. ACM, 2007.[24] Jack Sherman and Winifred J. Morrison. Adjustment of an inverse matrix corre-sponding to a change in one element of a given matrix.

The Annals of MathematicalStatistics , 21(1):124–127, 1950.[25] Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and MarcNajork. Position bias estimation for unbiased learning to rank in personal search.In

Proceedings of the 11. ACM International Conference on Web Search and DataMining , pages 610–618. ACM, 2018.[26] Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scalecombinatorial semi-bandits. In

ICML ’15 , pages 1113–1122, 2015.[27] Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and BranislavKveton. Cascading bandits for large-scale recommendation problems. In