TS-UCB: Improving on Thompson Sampling With Little to No Additional Computation
TTS-UCB: Improving on Thompson Sampling With Little toNo Additional Computation
Jackie BaekOperations Research CenterMassachusetts Institute of Technology [email protected]
Vivek F. FariasSloan School of ManagementMassachusetts Institute of Technology [email protected]
Abstract
Thompson sampling has become a ubiquitous approach to online decision problems with banditfeedback. The key algorithmic task for Thompson sampling is drawing a sample from theposterior of the optimal action. We propose an alternative arm selection rule we dub
TS-UCB ,that requires negligible additional computational effort but provides significant performanceimprovements relative to Thompson sampling. At each step,
TS-UCB computes a score foreach arm using two ingredients: posterior sample(s) and upper confidence bounds.
TS-UCB can be used in any setting where these two quantities are available, and it is flexible in thenumber of posterior samples it takes as input. This proves particularly valuable in heuristics fordeep contextual bandits: we show that
TS-UCB achieves materially lower regret on all probleminstances in a deep bandit suite proposed in Riquelme et al. (2018). Finally, from a theoreticalperspective, we establish optimal regret guarantees for
TS-UCB for both the K -armed andlinear bandit models.
1. Introduction
This paper studies the stochastic multi-armed bandit problem, a classical problem modeling sequentialdecision-making under uncertainty. This problem captures the inherent tradeoff between explorationand exploitation. We study the Bayesian setting, in which we are endowed with an initial prior onthe mean reward for each arm.Thompson sampling (TS) (Thompson 1933), has in recent years come to be a solution ofchoice for the multi-armed bandit problem. This popularity stems from the fact that the algorithmperforms well empirically (Scott 2010, Chapelle and Li 2011) and also admits near-optimal theoretical1 a r X i v : . [ c s . L G ] J un erformance guarantees (Agrawal and Goyal 2012, 2013b, Kaufmann et al. 2012b, Bubeck and Liu2013, Russo and Van Roy 2014, 2016). Perhaps one of the most attractive features of Thompsonsampling though, is the simplicity of the algorithm itself: the key algorithmic task of TS is to sampleonce from the posterior on arm means, a task that is arguably the simplest thing one can hope todo in a Bayesian formulation of the multi-armed bandit problem. This Paper:
Against the backdrop of Thompson sampling, we propose
TS-UCB . Given oneor more samples from the posterior on arm means,
TS-UCB simply provides a distinct approach toscoring the possible arms. The only additional ingredient this scoring rule relies on is the availabilityof so-called upper confidence bounds (UCBs) on these arm means.Now both sampling from a posterior, as well as computing a UCB can be a potentially hard task,especially in the context of bandit models where the payoff from an arm is a complex function ofunknown parameters. A canonical example of such a hard problem variant is the contextual banditproblem wherein mean arm reward is given by a complicated function (say, a deep neural network)of the context. Riquelme et al. (2018) provide a recent benchmark comparison of ten differentapproaches to sampling from an approximate posterior on unknown arm parameters. They showthat an approach that chooses to model the uncertainty in only the last layer of the neural networkdefining the mean reward from pulling a given arm at a given context is an effective and robustapproach to posterior approximation. In such an approach, not only is (approximate) posteriorsampling possible, but UCBs have a closed-form expression and can be easily computed, makingpossible the use of
TS-UCB . Our Contributions:
We show that
TS-UCB provides material improvements over Thompsonsampling across the board on the benchmark set of deep bandit problems studied in Riquelme et al.(2018). Importantly, these improvements come with essentially zero additional computation. Incontrast, an implementation of IDS (a state-of-the-art algorithm) (Russo and Van Roy 2018) didnot provide consistent improvements over TS on this benchmark set, and required approximatelythree orders of magnitude more sampling (and thus compute) than either TS or
TS-UCB .Theoretically, we analyze
TS-UCB in two specific bandit settings: the K -armed bandit and thelinear bandit. In the first setting, there are K independent arms. In the linear bandit, each arm is avector in R d , and the rewards are linear in the chosen arm. In both settings, TS-UCB is agnosticto the time horizon. We prove the following Bayes regret bounds for
TS-UCB :2 or the K -armed bandit, the Bayes regret of TS-UCB is at most O ( √ KT log T ) .For the linear bandit of dimension d , the Bayes regret of TS-UCB is at most O ( d log T √ T ) . Both of these results match the lower bounds up to log factors. The results are stated moreformally in Theorems 1 and 2.
Given the vast literature on bandit algorithms, we restrict our review to literature heavily relatedto our work, viz. literature focused on the development and analysis of upper confidence boundalgorithms, literature analyzing Thompson sampling (TS), and literature on methods of applyingdeep learning models to bandit problems.The UCB algorithm (Auer et al. 2002) computes an upper confidence bound for every action,and plays the action whose UCB is the highest. In the Bayesian setting, ‘Bayes UCB’ is defined asthe α ’th percentile of this distribution, and Kaufmann et al. (2012a) show that using α = 1 − t log c t achieves the lower bound of Lai and Robbins (1985) for K -armed bandits. For linear bandits, Daniet al. (2008) prove a lower bound of Ω( d √ T ) for infinite action sets, and the UCB algorithms fromDani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Abbasi-Yadkori et al. (2011) match thisup to log factors. It is worth noting that neither the UCB or Bayes UCB algorithms are competitiveon the benchmark set of problems in Riquelme et al. (2018).As discussed, TS is a randomized Bayesian algorithm that chooses an action with the sameprobability that the action is optimal. Though it was initially proposed in Thompson (1933), TShas only recently gained a surge of interest, largely influenced by the strong empirical performanceof TS demonstrated in Chapelle and Li (2011) and Scott (2010). Since then, many theoreticalresults on regret bounds for TS have been established (Agrawal and Goyal 2012, 2013a,b, 2017,Kaufmann et al. 2012b). In the Bayesian setting, Russo and Van Roy (2014) prove a regret boundof O ( √ KT log T ) and O ( d log T √ T ) for TS in the K -armed and linear bandit setting respectively.Bubeck and Liu (2013) improve the regret in the Bayesian K -armed setting to O ( √ KT ), and theyshow this is order-optimal.The ideas in this paper were heavily influenced by our reading of Russo and Van Roy (2014,2018). In the former paper, the authors use UCB algorithms as an analytical tool to analyze TS.This begs the natural question of whether an appropriate decomposition of regret can provide insight3n algorithmic modifications that might improve upon TS. Russo and Van Roy (2018) providesuch a decomposition and proposes Information Directed Sampling (IDS). IDS has been shownto provide significant performance improvement over TS in some cases, but has heavy sampling(and thus, computational) requirements. The present paper presents yet another decomposition,providing an arm selection rule that does not require additional sampling (i.e. a single sample fromthe posterior continues to suffice), but nonetheless provides significant improvements over TS whilebeing competitive with IDS.On the deep learning front, one key idea that has been used to apply deep learning to sequentialdecision making problems is to use TS (Riquelme et al. 2018, Lu and Van Roy 2017, Dwaracherlaet al. 2020). Since TS requires just a single sample from the posterior, if the posterior can beapproximated in some way, then TS can be readily applied. Riquelme et al. (2018) use this ideaand evaluates TS on ten different posterior approximation methods for neural networks, rangingfrom variational methods (Graves 2011), MCMC methods (Neal 2012), among others. The authorsfind that the approach of modeling uncertainty on just the last layer of the neural network (the‘Neural-Linear’ approach) (Snoek et al. 2015, Hinton and Salakhutdinov 2008, Calandra et al. 2016)was overall one of the most effective approaches. This neural linear approach provides not just atractable approach to approximate posterior sampling, but further provides a tractable UCB forthe problem as well. As such, the neural linear approach facilitates the use of the TS-UCB armselection rule, and we show that TS-UCB provides significant improvements over the use of TS onthe deep bandit benchmark in Riquelme et al. (2018).
2. Model
An agent is given a compact set of actions A in which they must choose one to play at everytime step t ≥
1. If action a is chosen at time t , the agent immediately observes a random reward R t ( a ) ∈ R . For each action a , the sequence ( R t ( a )) t ≥ is i.i.d. and independent of plays of otheractions. The mean reward of each action a is f θ ( a ), where θ ∈ Θ is an unknown parameter, and { f θ : A → R | θ ∈ Θ } is a known set of deterministic functions. That is, E [ R t ( a ) | θ ] = f θ ( a ) for all a ∈ A and t ≥ H t = ( A , R ( A ) , . . . , A t − , R t − ( A t − )) denote the history of observations available when4he agent is choosing the action for time t , and let H denote the set of all possible histories. Weoften refer to H t as the “state” at time t . A policy ( π t ) t ≥ is a deterministic sequence of functionsmapping the history to a distribution over actions. An agent employing the policy plays the randomaction A t distributed according to π t ( H t ), where H t is the current history. We will often write π t ( a ) instead of π t ( H t )( a ), where π t ( a ) = Pr( A t = a | H t ). Let A ∗ : Θ → A be a function satisfying A ∗ ( θ ) ∈ argmax a ∈A f θ ( a ), which represents the optimal action if θ was known. We use A ∗ to denotethe random variable A ∗ ( θ ), where θ is the true parameter.The T -period regret of policy π is defined asRegret( T, π, θ ) = T X t =1 E [ f θ ( A ∗ ) − f θ ( A t ) | θ ] . We study the Bayesian setting, in which we are endowed with a known prior q on the parameter θ .We take an expectation over this prior to define the T -period Bayes regretBayesRegret( T, π ) = T X t =1 E [ f θ ( A ∗ ) − f θ ( A t )] . We assume that the agent can perform a Bayesian update to their prior at each step after the rewardis observed. Let q ( H t ) denote to the posterior distribution of θ given the history H t . In our work,we assume that the agent is able to sample from the distribution q ( H t ) for any state H t .We end this section by describing two concrete bandit models that are the focus of our regretanalysis. In this setting, |A| = K , and each of the entries of the unknown parameter θ ∈ R K correspond tothe mean of each action. That is, for the i ’th action, f θ ( i ) = θ i . We assume that θ a ∈ [0 ,
1] for all a , and the rewards R t ( a ) are also bounded in [0 ,
1] for all a and t . The prior distribution q on θ ,supported on [0 , K , can otherwise be arbitrary.5 .2. Linear Bandit In the linear bandit, there is a known vector X ( a ) ∈ R d associated with each action, and the meanreward takes on the form f θ ( a ) = h θ, X ( a ) i , for θ ∈ Θ ⊆ R d . We assume that || θ || ≤ S ≤ √ d , || X ( a ) || ≤ L , and f θ ( a ) ∈ [ − ,
1] for all a ∈ A . Lastly, we assume that R t ( a ) − f θ ( a ) is r -sub-Gaussian for every t and a for some r ≥
1. All of these assumptions are standard and are the sameas in Abbasi-Yadkori et al. (2011).
3. Algorithm
TS-UCB requires a set of functions U, ˆ µ : H × A → R to first be specified, where U ( h, a ) representsthe upper confidence bound of action a at history h , and ˆ µ ( h, a ) represents an estimate of f θ ( a )at history h . We require that U ( h, a ) − ˆ µ ( h, a ) > U t ( a ) = U ( H t , a )and ˆ µ t ( a ) = ˆ µ ( H t , a ), and we refer to the quantity radius t ( a ) (cid:44) U t ( a ) − ˆ µ t ( a ) as the radius of theconfidence interval. TS-UCB proceeds as follows. At state H t , draw m independent samples from the posteriordistribution q ( H t ), for some integer parameter m ≥
1. Denote these samples by ˜ θ , . . . , ˜ θ m , and let˜ f i = f ˜ θ i ( A ∗ (˜ θ i )). ˜ f i is the mean reward of the best arm when the true parameter is ˜ θ i . Conditionedon H t , the distribution of ˜ f i is the same as the distribution of f θ ( A ∗ ). Let ˜ f t = m P mi =1 ˜ f i . Forevery action a , define the ratio Ψ t ( a ) asΨ t ( a ) (cid:44) ˜ f t − ˆ µ t ( a ) U t ( a ) − ˆ µ t ( a ) = ˜ f t − ˆ µ t ( a )radius t ( a ) . (1) TS-UCB chooses an action that minimizes this ratio, which we assume exists. That is, if A TS-UCB t is the random variable for the action chosen by TS-UCB at time t , then, A TS-UCB t ∈ argmin a ∈A Ψ t ( a ) . (2)We parse the ratio Ψ t ( a ): ˆ µ t ( a ) is an estimate of the expected reward E [ f θ ( a ) | H t ] from playingaction a , and ˜ f t is an estimate of the optimal reward E [ f θ ( A ∗ ) | H t ] (indeed, ˜ f t → E [ f θ ( A ∗ ) | H t ] as Clearly it exists if A is finite. Otherwise, since A is assumed to be compact, it exists if ˆ µ t and U t are continuousfunctions. → ∞ ). Then, the numerator of the ratio estimates the expected instantaneous regret from playingaction a . We clearly want this to be small, but minimizing only the numerator would result in thegreedy policy. The denominator enforces exploration by favoring actions with larger confidenceintervals, corresponding to actions in which not much information is known about. TS-UCB can be applied whenever the quantities ˜ f t = m P mi =1 ˜ f i and { U t ( a ) , ˆ µ t ( a ) } a ∈A can becomputed, which are exactly the quantities needed for TS ( m = 1) and UCB respectively. Thefollowing example shows that TS-UCB can be applied in a general setting where the relationshipbetween actions and rewards is modeled using a deep neural network.
Example 1 (Neural Linear (Riquelme et al. 2018)) . Consider a contextual bandit problem where acontext X t ∈ R d arrives at each time step, and the expected reward of taking action a ∈ A is g ( X t , a ) , for an unknown function g . The ‘Neural Linear’ method models uncertainty in only thelast layer of the network by considering a specific class of functions g . Specifically, consider that g allows the decomposition g ( X t , a ) = h ( X t ) > β a where h ( X t ) ∈ R d represent the outputs from the lastlayer of some network and β a ∈ R d is some parameter vector. If the function h ( · ) were known, thenthe resulting problem is a linear bandit problem for which both sampling from the posterior on β a forall a ∈ A as well as computing a (closed form) UCB on β a are easy. In reality h ( · ) is unknown butthe Neural Linear method approximates this quantity from past observations and ignores uncertaintyin the estimate. As such, it is clear that TS-UCB can be used as an alternative to TS in the NeuralLinear approach. We evaluate the method described in the above example on a range of real-world datasets inSection 4.2.
We now give an outline of the regret analysis for
TS-UCB , which provide intuition on both theform of the ratio (1) and the performance of the algorithm.First, it is useful to extend the definition of Ψ t to randomized actions. If ν is a probabilitydistribution over A , define ¯Ψ t ( ν ) (cid:44) ˜ f t − E A t ∼ ν [ˆ µ t ( A t )] E A t ∼ ν [radius t ( A t )] . (3) 7sing this definition, we show (Lemma 2) that for any policy ( π t ) t ≥ , surely,Ψ t ( A TS-UCB t ) ≤ ¯Ψ t ( π t ) . (4) Now, assume the following two approximations hold at every time step:(i) ˜ f t approximates the expected optimal reward: ˜ f t ≈ E [ f θ ( A ∗ ) | H t ].(ii) ˆ µ t ( a ) approximates the expected reward of action a : ˆ µ t ( a ) ≈ E [ f θ ( a ) | H t ].The Bayes regret for TS-UCB can be decomposed asBayesRegret(
T, π
TS-UCB ) = T X t =1 E [ E [ f θ ( A ∗ ) − f θ ( A TS-UCB t ) | H t ]] ≈ T X t =1 E [ ˜ f t − ˆ µ t ( A TS-UCB t )]= T X t =1 E h Ψ t ( A TS-UCB t )radius t ( A TS-UCB t ) i , (5)where the second step uses (i)-(ii), and the third step uses the definition (1).(5) decomposes the regret into the product of two terms: the ratio Ψ t ( A TS-UCB t ) and the radiusof the action taken. For the second piece, standard analyses for the UCB algorithm found in theliterature bound regret by bounding the sum P Tt =1 E [radius t ( A t )] for any sequence of actions A t .Therefore, if Ψ t ( A TS-UCB t ) can be upper bounded by a constant, the regret bounds found for UCBcan be directly applied.We show ¯Ψ t ( π TS t ) (cid:47)
1, where TS is the Thompson Sampling policy (this is stated formally andshown in Lemma 3.). In light of (4), this implies Ψ t ( A TS-UCB t ) (cid:47)
1. Plugging this back into (5)gives us BayesRegret(
T, π
TS-UCB ) (cid:47) P Tt =1 E h radius t ( A TS-UCB t ) i , which lets us apply UCB regretbounds from the literature and finishes the proof.This method of decomposing the regret into the product of two terms (as in (5)) and minimizingone of them was used in Russo and Van Roy (2018) for the IDS policy. The optimization problemin IDS is difficult, as the term that is minimized involves evaluating the information gain, requiringcomputing integrals over high-dimensional spaces. The optimization problem for TS-UCB is almosttrivial, but it trades off on the ability to incorporate complicated information structures as IDS can.8e now apply
TS-UCB for the K -armed bandit and linear bandit using the standard definitionsof upper confidence bounds found in the literature, and we formally state the main theorems. Theformal proofs of the theorems can be found in the supplementary materials. We assume T ≥ K , and we slightly modify the algorithm to pull every arm once in the first K timesteps. Let N t ( a ) = P t − s =1 ( A s = a ) be the number of times that action a was played up to but notincluding time t . We define the upper confidence bounds in a similar way to Auer et al. (2002);namely, ˆ µ t ( a ) (cid:44) t − X s =1 ( A s = a ) R s ( a ) U t ( a ) (cid:44) ˆ µ t ( a ) + s TN t ( a ) . (6)This implies radius t ( a ) = q TN t ( a ) .Because the term √ T appears as a multiplicative factor in the radius and the same termis used for all actions and time steps, the algorithm is agnostic to this value. That is, TS-UCB reduces to picking the action which minimizes q N t ( a )( ˜ f − ˆ µ t ( a )) . (7)This implies that TS-UCB does not have to know the time horizon T a priori. Remark 1.
For UCB algorithms, it is well known that tuning the parameter α > in the radius α q log TN t ( a ) can vastly change empirical performance Russo and Van Roy (2014). One benefit of TScompared to UCB is that it does not require any such tuning. We see from (7) that such tuning isalso not needed for TS-UCB . We now state our main result for this setting.
Theorem 1.
For the K -armed bandit, using the UCBs as defined in (6) , BayesRegret(
T, π
TS-UCB ) ≤ p KT log T + T − + 3 √ T + K = O ( p KT log T ) . (8) This result matches the Ω( √ KT ) lower bound Bubeck and Liu (2013) up to a logarithmic factor.9t is worth noting that TS has been shown to match the lower bound exactly Bubeck and Liu (2013);we believe that the logarithmic gap is a shortcoming of our analysis. For the linear bandit, to define the functions ˆ µ t and U t , we first need to define a confidence set C t ⊆ Θ,which contains θ with high probability. We use the confidence sets developed in Abbasi-Yadkori et al.(2011). Let X t = X ( A t ) be the vector associated with the action played at time t . Let X t be the t × d matrix whose s ’th row is X > s . Let Y t ∈ R t be the vector of rewards seen up to and including time t .At time t , define the positive semi-definite matrix V t = I + P ts =1 X s X > s = I + X > t X t , and constructthe estimate ˆ θ t = V − t X > t Y t . Using the notation || x || A = √ x > Ax , let C t = { ρ : || ρ − ˆ θ t || V t ≤ √ β t } ,where √ β t = r p d log( T (1 + tL )) + S .Using this confidence set, the functions needed for TS-UCB are defined asˆ µ t ( a ) (cid:44) h X ( a ) , ˆ θ t i U t ( a ) (cid:44) max ρ ∈ C t h X ( a ) , ρ i . (9)Since U t ( a ) is the solution to maximizing a linear function subject to an ellipsoidal constraint,it has a closed form solution: U t ( a ) = h X ( a ) , ˆ θ t i + √ β t || X ( a ) || V − t , which implies radius t ( a ) = √ β t || X ( a ) || V − t . Then, TS-UCB reduces to picking the action which minimizes˜ f − h X ( a ) , ˆ θ t i|| X ( a ) || V − t . Note that the √ β t term disappears, implying TS-UCB does not depend on the exact expression ofthis term. Like the K -armed bandit, there is no parameter tuning required and the algorithm doesnot have to know the time horizon T a priori.We state our main result for this setting. Theorem 2.
For the linear bandit, using the UCBs as defined in (9) , if || X ( a ) || = 1 for all a ∈ A , BayesRegret(
T, π
TS-UCB ) ≤ B + T − + 12 √ T = O ( d log T √ T ) . (10) 10 here B = 8 q T d log(1 +
T L/d )( S + r q T ) + d log(1 + T /d )) = O ( d log T √ T ) . This result matches the Ω( d √ T ) lower bound Dani et al. (2008) up to a logarithmic factor.We believe the additional assumption that || X ( a ) || = 1 is an artifact our proof, which we believecan be likely removed with a more refined analysis. We note that TS and IDS has been shown toachieve a regret of O ( p dT log( |A| )) (Russo and Van Roy 2016, 2018), which is dependent on thetotal number of actions |A| .Proofs of both Theorem 1 and 2 can be found in Section 5.
4. Computational Results
We conduct two sets of experiments. The first set is entirely synthetic for an ensemble of linearbandit problems where exact posterior samples (and a regret analysis) are available for all methodsconsidered. Our objective here is to understand the level of improvement
TS-UCB can provideover TS and how the level of this improvement depends on (a) natural problem features such asdimension and the the level of noise and (b) algorithmic parameters for
TS-UCB such as the choiceof UCB and the number of posterior samples. The second set of experiments then considers a deepbandit benchmark that consists of substantially more complex bandit problems. Here our goal is toshow that
TS-UCB provides both state of the art performance (by comparing it not just to TS butalso IDS) while being computationally cheap.
First, we simulate synthetic instances of the linear bandit with varying dimension and size ofthe prior covariance. Let d be the dimension. The number of actions is set to 2 d . For eachaction, we choose a vector X ( a ) ∈ R d uniformly at random over the unit sphere. We set the priorfor θ as N (0 , κI d ) where I d is the d -dimensional identity matrix, and κ >
0. The rewards aredistributed as h θ, X ( a ) i + (cid:15) , where (cid:15) ∼ N (0 , d ∈ { , , , } and κ ∈ { , , , , , } to get a total of 24 instances. For each instance, we simulate 500 runs over11 time horizon of T = 1000. Note that instances become “easier” when κ increases. This is becausewhen κ is bigger, the norm of θ is also bigger but since the variance of the noise (cid:15) stays constant at1, the signal-to-noise ratio is higher in this case.As for the parameters of the algorithm, we vary the number of samples m ∈ { , , } , and wealso vary how we define the UCBs. In particular, we use the UCBs as (9) and also use Bayes UCBs(Kaufmann et al. 2012a). For the Bayes UCBs, at every time step, we define U t ( a ) as the 1 − /t ’thpercentile of the posterior of f θ ( a ) for every action a .For each algorithm and each problem instance, we report the median regret as a percentage ofthe regret from the TS policy. The results are shown in Figure 1. (a) TS-UCB (1) (b)
TS-UCB (10) (c)
TS-UCB (100)
Figure 1 : TS-UCB improves on TS across the board, particularly on harder instances (bottom). Gridreports median regret of each policy as a percentage of regret of Thompson Sampling over 500 runs.
TS-UCB ( m ) refers to the algorithm using m samples. The top row uses the UCBs defined in (9), whilethe bottom row uses Bayes UCBs. Performance Gain relative to TS:
We see that
TS-UCB outperforms TS across the board,in some cases halving regret. The general trend is that
TS-UCB has a greater performanceimprovement over TS when κ is lower, which correspond to the “harder” instances. Impact of m and UCB type: We see that performance improves as m increases; on average,the regret decreased by 10.8% from TS-UCB (1) to
TS-UCB (10), and 8.8% from
TS-UCB (10) to12
S-UCB (100). That said, there do exist problem settings for which a smaller m performs better;characterizing the dependence of regret on the number of samples is an interesting (but challenging)direction for future research. Lastly, performance was very similar across both UCB definitions,suggesting TS-UCB is robust to the specific UCB used.
In challenging bandit models such as the deep contextual bandit discussed in Example 1, computinga posterior is challenging. Riquelme et al. (2018) evaluate a large number of posterior approxi-mation methods on a variety of real-world datasets for such a contextual bandit problem. Theirresults suggest that performing posterior sampling using the “Neural Linear” method, described inExample 1, is an effective and robust approach. We evaluate
TS-UCB on the benchmark problemsin Riquelme et al. (2018) and compare its performance to TS and IDS.For a finite action set, Neural Linear maintains one neural network, h ( · ) : R d → R d and posteriordistributions on |A| parameter vectors β a . At time t , the posteriors on β a are computed ignoringthe uncertainty in the estimate of h ( · ) so that this computation is equivalent to bayesian linearregression. Denoting by β a the random variable distributed according to the posterior on β a at time t , the action picked by TS is described by the random variable argmax h ( X t ) > β a . For TS-UCB ,we compute U t ( a ) as the 1 − /t ’th percentile of the random variable h ( X t ) > β a (note that this canbe computed in closed form in the Gaussian-Gaussian model used in the neural-linear approachRiquelme et al. (2018)). Also, ˆ µ t ( a ) = h ( X t ) > E β a . We then pick the action that minimizesΨ t ( a ). Finally it is straightforward to implement the IDS algorithm (specifically, the variance-basedapproximation given by Algorithm 6 in Russo and Van Roy (2018)) given access to draws of β a .We replicate the experiments from Riquelme et al. (2018) with the same real-world datasets, andevaluate the performance of TS, IDS and TS-UCB . These datasets vary widely in their properties;see Appendix A of Riquelme et al. (2018) for the details of each dataset. We use the same parametersand neural network structure as in their paper. While d varies across experiments, the last layer ofthe neural network has dimension d = 50. For each dataset, one “run” is defined as 2000 data pointsrandomly drawn from the entire dataset; that is, there are 2000 time steps, and each data point(or “context”) arrives sequentially in a random order. Lastly, we also run the IDS policy, using h ( · ) can be updated at every time step or at scheduled intervals simply by fitting the network to observed rewards. able 1 : Deep Bandit benchmark Riquelme et al. (2018) results for Neural-Linear and Linear posteriorapproximation methods. TS-UCB provides an improvement over TS across the board. For each posteriorapproximation approach, regret of TS-UCB is reported as a percentage of regret of Thompson Sampling(with 95% confidence intervals) for that approach. IDS(5000) requires five thousand samples from theposterior at each epoch; TS-UCB(1) and TS-UCB(10) require one and ten respectively.
Dataset d K TS-UCB (1)
TS-UCB (10)
IDS (5000)Statlog 9 7 85.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± TS-UCB for m = 1 and m = 10 posterior samples. To get meaningfulperformance, we require m = 5000 for IDS. We report the regret of each policy as a percentage ofthe regret of TS using the same method, shown in Table 1. Performance Relative to TS:
Riquelme et al. (2018) establish TS along with the neural linearapproach to posterior sampling as a benchmark algorithm for deep contextual bandits. We see herethat
TS-UCB improves upon TS on every dataset. Moreover,
TS-UCB (10) always outperforms
TS-UCB (1). Finally, it is worth noting that
TS-UCB requires essentially no additional computationover TS . IDS:
IDS is inconsistent across datasets; it performs well on some like Covertype and Census,but quite poorly on others like Statlog and Mushroom and thus does not provide a consistentimprovement over TS. The algorithm also requires substantially more posterior samples. Theseresults suggest that both TS and
TS-UCB are perhaps more robust to posterior approximationthan IDS. There has been some recent work (Phan et al. 2019) on analyzing TS with approximateinference; it is an interesting future direction to study the robustness of other arm selection rules(and regret) to posterior approximation.In summary, these experiments suggest that
TS-UCB consistently improves upon state-of-the-artperformance on a challenging deep contextual bandit benchmark.14 . Regret Analysis
For our analysis, we introduce lower confidence bounds ( L t ) t ≥ , which we define in a symmetricway to upper confidence bounds: L t ( a ) (cid:44) ˆ µ t ( a ) − ( U t ( a ) − ˆ µ t ( a )). We first state three known results used in the analysis. The first result says that the confidencebounds are valid with high probability.
Lemma 1.
Using the functions { ˆ µ t } t ≥ , { U t } t ≥ as defined in (6) in the K -armed setting and (9) in the linear bandit setting, for any t , Pr( f θ ( A ) < U t ( A )) ≤ T − , where A is any deterministic orrandom action. The analogous bounds hold for lower confidence bounds, i.e. Pr( f θ ( A ) > L t ( A )) ≤ T − . For completeness, the proof of Lemma 1 can be found in Section 5.3. The following corollary isimmediate using the law of total expectation and the fact that f θ ( A ) > − Corollary 1. E [ − f θ ( A )] ≤ E [ − L t ( A )] + T − , where A is any deterministic or random action. The next two results upper bound P Tt =1 E [radius t ( A t )]. In particular, for the K -armed setting,the proof of Proposition 2 of Russo and Van Roy (2014) implies the following result. Theorem 3.
For the K -armed bandit, using the UCBs as defined in (6) , T X t = K +1 E [radius t ( A t )] ≤ p KT log T , for any sequence of actions A t . Similarly, in the linear bandit setting, the proof of Theorem 3 of Abbasi-Yadkori et al. (2011)(using the parameters δ = T − , λ = 1) implies the following result. Theorem 4.
For the linear bandit, using the UCBs as defined in (9) , T X t =1 E [radius t ( A t )] ≤ q T d log(1 +
T L/d )( S + r q T ) + d log(1 + T /d )) = O ( d log T √ T ) for any sequence of actions A t . .2. Proof of Main Result There are two main steps of the proof which are stated in the following two propositions. Theseresults apply to both the K -armed and linear bandit settings. Proposition 1.
Suppose radius t ( a ) ∈ [ r min , r max ] for all a ∈ A and t ≥ . Using the UCBs as definedin (6) for the K -armed bandit, and (9) for the linear bandit, BayesRegret(
T, π
TS-UCB ) ≤ T X t =1 E [radius t ( A TS-UCB t )] + r max r min (cid:18) T √ m (cid:19) + T − . (11) Proposition 2.
Using the UCBs as defined in (6) for the K -armed bandit, and (9) for the linearbandit, BayesRegret(
T, π
TS-UCB ) ≤ T X t =1 E [radius t ( A TS-UCB t )] + ( m + 1) T − . The proof sketch from Section 3.1 refers to the proof of Proposition 1. The approximation˜ f t ≈ E [ f θ ( A ∗ ) | H t ] used in the proof sketch only holds when m is large; the fact that this doesn’thold contributes to the √ m term in (11), which goes to zero as m → ∞ . Proposition 2 has theopposite relationship with respect to m , so it applies when m is small. The final step of showingTheorems 1 and 2 involves combining these two propositions to remove the dependence on m ,showing r max r min = O ( √ T ), and plugging in the known bounds for P Tt =1 E [radius t ( A t )] from Theorems 3and 4. Details of this final step are in Section 5.2.3. We first show the result claimed in (4) whose proof is deferred to Section 5.3.
Lemma 2.
For any distribution τ over A , Ψ t ( A TS-UCB t ) ≤ ¯Ψ t ( τ ) almost surely. Next, we upper bound the ratio Ψ t ( A TS-UCB t ) by analyzing the Thompson Sampling policy. Lemma 3. Ψ t ( A TS-UCB t ) ≤ r min (Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t ) + ˜ f t − E [ f θ ( A ∗ ) | H t ]) almost surely. quivalently, using (1) , ˜ f t − ˆ µ t ( A TS-UCB t ) ≤ radius t ( A TS-UCB t )(1 + 1 r min (Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t ) + ˜ f t − E [ f θ ( A ∗ ) | H t ])) . (12) Proof.
Let π TS be the Thompson sampling policy. We show the inequality for ¯Ψ t ( π TS t ) instead, andthen use Ψ t ( A TS-UCB t ) ≤ ¯Ψ t ( π TS t ) from Lemma 2 to get the desired result.By definition of TS, π TS t = π TS t ( H t ) is the distribution over A corresponding to the posteriordistribution of A ∗ conditioned on H t . Then, if A t is the action chosen by TS at time t , we have E [ U t ( A t ) | H t ] = E [ U t ( A ∗ ) | H t ] and E [ˆ µ t ( A t ) | H t ] = E [ˆ µ t ( A ∗ ) | H t ]. Using this, we can write ¯Ψ t ( π TS t ) as¯Ψ t ( π TS t ) = ˜ f t − E [ˆ µ t ( A t ) | H t ] E [ U t ( A t ) − ˆ µ t ( A t ) | H t ] = ˜ f t − E [ˆ µ t ( A ∗ ) | H t ] E [ U t ( A ∗ ) − ˆ µ t ( A ∗ ) | H t ] . (13)By conditioning on the event { f θ ( A ∗ ) ≤ U t ( A ∗ ) } , the following inequality follows from the factthat f θ ( A ∗ ) ≤ E [ f θ ( A ∗ ) | H t ] ≤ E [ U t ( A ∗ ) | H t ] + Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t ) . (14)Consider the numerator of (13). We add and subtract E [ f θ ( A ∗ ) | H t ] and use (14):˜ f t − E [ˆ µ t ( A ∗ ) | H t ] = E [ f θ ( A ∗ ) − ˆ µ t ( A ∗ ) | H t ] + ˜ f t − E [ f θ ( A ∗ ) | H t ] ≤ E [ U t ( A ∗ ) − ˆ µ t ( A ∗ ) | H t ] + Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t ) + ˜ f t − E [ f θ ( A ∗ ) | H t ] . (15)The first term of (15) is equal to the denominator of ¯Ψ t ( π TS ). Therefore,¯Ψ t ( π TS ) ≤ f θ ( A ∗ ) > U t ( A ∗ ) | H t ) + ˜ f t − E [ f θ ( A ∗ ) | H t ] E [ U t ( A ∗ ) − ˆ µ t ( A ∗ ) | H t ] ≤ r min (Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t ) + ˜ f t − E [ f θ ( A ∗ ) | H t ]) . (cid:4) The next lemma simplifies the expectation of (12) using Cauchy-Schwarz.
Lemma 4.
For any t , E [ ˜ f t − ˆ µ t ( A TS-UCB t )] ≤ E [radius t ( A TS-UCB t )] + r max r min (cid:16) T + √ m (cid:17) . roof. Taking the expectation of (12) gives us E [ ˜ f t − ˆ µ t ( A TS-UCB t )](16) ≤ E [radius t ( A TS-UCB t )(1 + 1 r min (Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t ) + ˜ f t − E [ f θ ( A ∗ ) | H t ]))]= E [radius t ( A TS-UCB t )]+ 1 r min E [radius t ( A TS-UCB t ) Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t )](17) + 1 r min E [radius t ( A TS-UCB t )( ˜ f t − E [ f θ ( A ∗ ) | H t ])] . (18)We will now upper bound (17) and (18) with r max r min · T and r max r min · √ m respectively, in which casethe result will follow. First, consider (17). Using Cauchy-Schwarz yields1 r min E [radius t ( A TS-UCB t ) Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t )](19) ≤ r min q E [radius t ( A TS-UCB t ) ] E [Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t ) ] ≤ r min T q E [radius t ( A TS-UCB t ) ] ≤ T · r max r min , (20)where the second step uses the following. E [Pr( f θ ( A ∗ ) > U t ( A ∗ ) | H t ) ] = E [ E [ ( f θ ( A ∗ ) > U t ( A ∗ )) | H t ] ] ≤ E [ E [ ( f θ ( A ∗ ) > U t ( A ∗ )) | H t ]]= E [ E [ ( f θ ( A ∗ ) > U t ( A ∗ )) | H t ]] ≤ Pr( f θ ( A ∗ ) > U t ( A ∗ )) ≤ T , (21)where the first inequality uses Jensen’s inequality, and the last inequality uses Lemma 1.Similarly, we apply Cauchy-Schwarz to (18).1 r min E [radius t ( A TS-UCB t )( ˜ f t − E [ f θ ( A ∗ ) | H t ])] ≤ r min q E [radius t ( A TS-UCB t ) ] E [( ˜ f t − E [ f θ ( A ∗ ) | H t ]) ] . (22) 18ecall that ˜ f t = m P mi =1 ˜ f i , and ˜ f i has the same distribution as f θ ( A ∗ ) conditioned on H t .Therefore, E [ ˜ f t | H t ] = E [ f θ ( A ∗ ) | H t ]. Then, we have E [( ˜ f t − E [ f θ ( A ∗ ) | H t ]) ] = E [ E [( ˜ f t − E [ f θ ( A ∗ ) | H t ]) | H t ]]= E [Var( ˜ f t | H t )]= E [ 1 m Var( ˜ f i | H t )] ≤ m . The last inequality follows since ˜ f i ∈ [ − , r min E [radius t ( A TS-UCB t )( ˜ f t − E [ f θ ( A ∗ ) | H t ])] ≤ r min √ m q E [radius t ( A TS-UCB t ) ] ≤ √ m · r max r min (23)Substituting (20) and (23) into (18) yields the desired result. (cid:4) Proof of Proposition 1.
Conditioned on H t , the expectation of f θ ( A ∗ ) and ˜ f t is the same, implying E [ f θ ( A ∗ )] = E [ ˜ f t ] for any t . Therefore, the Bayes regret can be written as P Tt =1 E [ ˜ f t − f θ ( A TS-UCB t )].By adding and subtract ˆ µ t ( A TS-UCB t ), we deriveBayesRegret( T, π
TS-UCB ) = T X t =1 E [ ˜ f t − ˆ µ t ( A TS-UCB t )] + T X t =1 E [ˆ µ t ( A TS-UCB t ) − f θ ( A TS-UCB t )] . (24)The first sum in (24) can be bounded by P Tt =1 E [radius t ( A TS-UCB t )] + r max r min (cid:16) T √ m (cid:17) usingLemma 4. Using Corollary 1, the second sum in (24) can be bounded by P Tt =1 ( E [ˆ µ t ( A TS-UCB t ) − L t ( A TS-UCB t )] + T − ) ≤ P Tt =1 E [radius t ( A TS-UCB t )] + T − . Substituting these two bounds results inBayesRegret( T, π
TS-UCB ) ≤ T X t =1 E [radius t ( A TS-UCB t )] + r max r min (cid:18) T √ m (cid:19) + T − as desired. (cid:4) .2.2. Proof of Proposition 2. The main idea of this proof is captured in the following lemma, which says that we can essentiallyreplace the term E [ f θ ( A ∗ )] with E [ U t ( A TS-UCB t )]. Lemma 5.
For every t , E [ f θ ( A ∗ )] ≤ E [ U t ( A TS-UCB t )] + mT − .Proof. Fix t , H t , and ˜ f t . For an action a ∈ A , if U t ( a ) ≥ ˜ f t , then Ψ t ( a ) ≤ U t ( a ) < ˜ f t , then Ψ t ( a ) >
1. This implies that an actionwhose UCB is higher than ˜ f t will always be chosen over an action whose UCB is smaller than ˜ f t .Therefore, in the case that ˜ f t ≤ max a ∈A U t ( a ), it will be that U t ( A TS-UCB t ) ≥ ˜ f t . Since ˜ f t ≤
1, wehave E [ ˜ f t | H t ] ≤ U t ( A TS-UCB t ) Pr( ˜ f t ≤ max a ∈A U t ( a ) | H t ) + Pr( ˜ f t > max a ∈A U t ( a ) | H t ) ≤ U t ( A TS-UCB t ) + Pr( ˜ f t > max a ∈A U t ( a ) | H t ) . Since ˜ f t = m P mi =1 ˜ f i , if ˜ f t is larger than max a ∈A U t ( a ), it must be that at least one of the elements˜ f i is larger than max a ∈A U t ( a ). Then, the union bound gives us Pr( ˜ f t > max a ∈A U t ( a ) | H t ) ≤ P mi =1 Pr( ˜ f i > max a ∈A U t ( a ) | H t ). By definition of ˜ f i , the distribution of ˜ f i and f θ ( A ∗ ) are the sameconditioned on H t .Therefore, E [ ˜ f t | H t ] ≤ U t ( A TS-UCB t ) + m Pr( f θ ( A ∗ ) > max a ∈A U t ( a ) | H t ) . Using the fact that E [ ˜ f t | H t ] = E [ f θ ( A ∗ ) | H t ] and taking expectations on both sides, we have E [ f θ ( A ∗ )] ≤ E [ U t ( A TS-UCB t )] + m Pr( f θ ( A ∗ ) > max a ∈A U t ( a )) ≤ E [ U t ( A TS-UCB t )] + m Pr( f θ ( A ∗ ) > U t ( A ∗ )) ≤ E [ U t ( A TS-UCB t )] + mT − . The last inequality uses Lemma 1. (cid:4) roof of Proposition 2. BayesRegret(
T, π
TS-UCB ) = T X t =1 E [ f θ ( A ∗ ) − f θ ( A TS-UCB t )] ≤ T X t =1 ( E [ U t ( A TS-UCB t ) − f θ ( A TS-UCB t )] + mT − ) ≤ T X t =1 ( E [ U t ( A TS-UCB t ) − L t ( A TS-UCB t )] + T − ) + mT − = 2 T X t =1 E [radius t ( A TS-UCB t )] + ( m + 1) T − , where the first inequality uses Lemma 5 and the second inequality uses Corollary 1. (cid:4) Proof of Theorem 1.
The UCBs in (6) imply that radius t ( a ) ∈ [ q TT , √ T ] for all a and t ,therefore r max r min ≤ √ T . Then, Propositions 1 and 2 result in the following two inequalities respectively:BayesRegret( T, π
TS-UCB ) ≤ T X t =1 E [radius t ( A TS-UCB t )] + √ T + 2 s T m + T − , BayesRegret(
T, π
TS-UCB ) ≤ T X t =1 E [radius t ( A TS-UCB t )] + mT + T − . Combining these two bounds results inBayesRegret(
T, π
TS-UCB ) ≤ T X t =1 E [radius t ( A TS-UCB t )] + √ T + T − + min s T m , mT . For any value of m >
0, min (cid:26) q T m , mT (cid:27) ≤ √ T . Plugging in the known bound for P Tt =1 E [radius t ( A TS-UCB t )]from Theorem 3 finishes the proof of Theorem 1. (cid:4) Proof of Theorem 2.
The following lemma, whose proof is deferred to Section 5.3, allows us tobound r max r min by 4 √ T . Lemma 6.
For the linear bandit, using the UCBs as defined in (9) , if || X ( a ) || = 1 for every a , then radius t ( a ) ∈ [ r q d log TT , r √ d log T ] for every t and a . The statement of Theorem 1 has an additional + K term since the first K time steps are used to pull each armonce, which we did not include in the proof to simplify exposition. T, π
TS-UCB ) ≤ T X t =1 E [radius t ( A TS-UCB t )] + 4 √ T + T − + min s T m , mT . (25)For any m , min (cid:26) q T m , mT (cid:27) ≤ √ T . Plugging in the known bound for P Tt =1 E [radius t ( A TS-UCB t )]from Theorem 4 gives us (10), finishing the proof of Theorem 2. (cid:4) Proof of Lemma 2.
Fix H t and ˜ f t . For every action a , let ∆ a = ˜ f t − ˆ µ t ( a ), and hence Ψ t ( a ) = ∆ a radius t ( a ) . Let ν be a distribution over A . Then,¯Ψ t ( ν ) = E a ∼ ν [∆ a ] E a ∼ ν [radius t ( a )] . (26)radius t ( a ) > a , but ∆ a can be negative. We claim that the above ratio is minimized when τ puts all of its mass on one action — in particular, the action a ∗ ∈ argmin a ∆ a radius t ( a ) .For a = a ∗ , let c a = radius t ( a )radius t ( a ∗ ) >
0. Then, since Ψ t ( a ) ≥ Ψ t ( a ∗ ), we can write ∆ a = c a ∆ a ∗ + δ a for δ a ≥ a . Let p a ∗ = Pr( a = a ∗ ). Let E = { a = a ∗ } Substituting into (26), we get¯Ψ t ( ν ) = E [ c a ∆ a ∗ + δ a ] E [ c a radius t ( a ∗ )]= p a ∗ ∆ a ∗ + E [ c a ∆ a ∗ + δ a | E ] Pr( E ) p a ∗ radius t ( a ∗ ) + E [ c a radius t ( a ∗ ) | E ] Pr( E )= ∆ a ∗ ( p a ∗ + E [ c a | E ] Pr( E )) + E [ δ a | E ] Pr( E )radius t ( a ∗ ) ( p a ∗ + E [ c a | E ] Pr( E ))= ∆ a ∗ radius t ( a ∗ ) + E [ δ a | E ] Pr( E )radius t ( a ∗ ) ( p a ∗ + E [ c a | E ] Pr( E )) ≥ ∆ a ∗ radius t ( a ∗ )= Ψ t ( a ∗ ) (cid:4) Proof of Lemma 1.
In the linear bandit, this lemma follows directly from Theorem 2 of Abbasi-22adkori et al. (2011) (using the parameters δ = T − , λ = 1). In the K -armed setting, if ˆ µ ( n, a ) is theempirical mean of the first n plays of action a , Hoeffding’s inequality implies Pr( f θ ( a ) − ˆ µ ( n, a ) ≥ q Tn ) ≤ T − for any n . Then, since the number of plays of a particular action is no larger than T , we havePr( f θ ( a ) − ˆ µ t ( a ) ≥ s TN t ( a ) ) ≤ Pr( ∪ Tn =1 { f θ ( a ) − ˆ µ ( n, a ) ≥ s Tn } ) ≤ T − . Since |A| = K ≤ T and A ∗ , A t ∈ A , the result follows after taking another union bound over actions(which proves a stronger bound of T − ). (cid:4) Proof of Lemma 6.
We haveradius t ( a ) = p β t || X ( a ) || V − t = p β t || V − / t X ( a ) || . Then, since || X ( a ) || = 1 for all a , p β t σ min ( V − / t ) ≤ radius t ( a ) ≤ p β t σ max ( V − / t ) . First, we lower bound σ min ( V − / t ). To do this, we can instead upper bound || V t || , since σ min ( V − / t ) = q σ min ( V − t ) = √ σ max ( V t ) = √ || V t || . The triangle inequality gives || V t || ≤ || I || + P ts =1 || X s X > s || . Since X s X > s is a rank-1 matrix, the only non-zero eigenvalue is || X s || = 1 witheigenvector X s , since ( X s X > s ) X s = X s ( X > s X s ). Therefore, || V t || ≤ || I || + P ts =1 || X s || ≤ T ,which implies σ min ( V − / t ) ≥ √ T +1 ≥ √ T . Recall √ β t = r p d log( T (1 + t )) + S ≥ r √ d log T ,implying radius t ( a ) ≥ r q d log T T .Next, we upper bound σ max ( V − / t ) = √ σ min ( V t ) by lower bounding σ min ( V t ). σ min ( V t ) ≥ σ min ( I ) = 1. Therefore, σ max ( V − / t ) ≤
1. We can upper bound √ β t by r p d log( T ) + S ≤ r p d log( T ), since we assumed r ≥ S ≤ √ d . Therefore, we haveradius t ( a ) ≤ p β t ≤ r q d log( T ) . (cid:4) eferences Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits.
Advances inNeural Information Processing Systems , 2312–2320.Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem.
Conferenceon learning theory , 39–1.Agrawal S, Goyal N (2013a) Further optimal regret bounds for thompson sampling.
Artificial intelligence andstatistics , 99–107.Agrawal S, Goyal N (2013b) Thompson sampling for contextual bandits with linear payoffs.
InternationalConference on Machine Learning , 127–135.Agrawal S, Goyal N (2017) Near-optimal regret bounds for thompson sampling.
Journal of the ACM (JACM)
Machinelearning
Advances inNeural Information Processing Systems , 638–646.Calandra R, Peters J, Rasmussen CE, Deisenroth MP (2016) Manifold gaussian processes for regression. , 3338–3345 (IEEE).Chapelle O, Li L (2011) An empirical evaluation of thompson sampling.
Advances in neural informationprocessing systems , 2249–2257.Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback .Dwaracherla V, Lu X, Ibrahimi M, Osband I, Wen Z, Van Roy B (2020) Hypermodels for exploration.
International Conference on Learning Representations .Graves A (2011) Practical variational inference for neural networks.
Advances in neural information processingsystems , 2348–2356.Hinton GE, Salakhutdinov RR (2008) Using deep belief nets to learn covariance kernels for gaussian processes.
Advances in neural information processing systems , 1249–1256.Kaufmann E, Cappé O, Garivier A (2012a) On bayesian upper confidence bounds for bandit problems.
Artificial intelligence and statistics , 592–600.Kaufmann E, Korda N, Munos R (2012b) Thompson sampling: An asymptotically optimal finite-time analysis.
International conference on algorithmic learning theory , 199–213 (Springer). ai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics
Advances in neural information processing systems , 3258–3266.Neal RM (2012)
Bayesian learning for neural networks , volume 118 (Springer Science & Business Media).Phan M, Yadkori YA, Domke J (2019) Thompson sampling and approximate inference.
Advances in NeuralInformation Processing Systems , 8801–8811.Riquelme C, Tucker G, Snoek J (2018) Deep bayesian bandits showdown: An empirical comparison of bayesiandeep networks for thompson sampling. arXiv preprint arXiv:1802.09127 .Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits.
Mathematics of Operations Research
Mathematics of Operations Research
The Journal of MachineLearning Research
Operations Research
Applied Stochastic Models in Businessand Industry
International conference on machine learning ,2171–2180.Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidenceof two samples.
Biometrika25(3/4):285–294.