[PDF] On Thompson Sampling with Langevin Algorithms

Abstract

Thompson sampling for multi-armed bandit problems is known to enjoy favorable performance in both theory and practice. However, it suffers from a significant limitation computationally, arising from the need for samples from posterior distributions at every iteration. We propose two Markov Chain Monte Carlo (MCMC) methods tailored to Thompson sampling to address this issue. We construct quickly converging Langevin algorithms to generate approximate samples that have accuracy guarantees, and we leverage novel posterior concentration rates to analyze the regret of the resulting approximate Thompson sampling algorithm. Further, we specify the necessary hyperparameters for the MCMC procedure to guarantee optimal instance-dependent frequentist regret while having low computational complexity. In particular, our algorithms take advantage of both posterior concentration and a sample reuse mechanism to ensure that only a constant number of iterations and a constant amount of data is needed in each round. The resulting approximate Thompson sampling algorithm has logarithmic regret and its computational complexity does not scale with the time horizon of the algorithm.

Full PDF

OOn Thompson Sampling with Langevin Algorithms

Eric Mazumdar † , ∗ [email protected] Aldo Pacchiano † , ∗ [email protected] Yi-An Ma (cid:5) , ∗ [email protected] Peter L. Bartlett † , ‡ [email protected] Michael I. Jordan † , ‡ [email protected] † Department of Electrical Engineering and Computer Sciences ‡ Department of Statistics, University of California, Berkeley (cid:5)

Google Research

Abstract

Thompson sampling for multi-armed bandit problems is known to enjoy favorable performance in boththeory and practice. However, it suﬀers from a signiﬁcant limitation computationally, arising from theneed for samples from posterior distributions at every iteration. We propose two Markov Chain MonteCarlo (MCMC) methods tailored to Thompson sampling to address this issue. We construct quicklyconverging Langevin algorithms to generate approximate samples that have accuracy guarantees, and weleverage novel posterior concentration rates to analyze the regret of the resulting approximate Thompsonsampling algorithm. Further, we specify the necessary hyperparameters for the MCMC procedure toguarantee optimal instance-dependent frequentist regret while having low computational complexity. Inparticular, our algorithms take advantage of both posterior concentration and a sample reuse mechanismto ensure that only a constant number of iterations and a constant amount of data is needed in each round.The resulting approximate Thompson sampling algorithm has logarithmic regret and its computationalcomplexity does not scale with the time horizon of the algorithm.

Sequential decision making under uncertainty has become one of the fastest developing ﬁelds of machinelearning. A central theme in such problems is addressing exploration-exploitation tradeoﬀs [Auer et al., 2002,Lattimore and Szepesv´ari, 2020], wherein an algorithm must balance between exploiting its current knowledgeand exploring previously unexplored options.The classic stochastic multi-armed bandit problem has provided a theoretical laboratory for the studyof exploration/exploitation tradeoﬀs [Lai and Robbins, 1985]. A vast literature has emerged that providesalgorithms, insights, and matching upper and lower bounds in many cases. The dominant paradigm inthis literature has been that of frequentist analysis ; cf. in particular the analyses devoted to the celebratedupper conﬁdence bound (UCB) algorithm [Auer et al., 2002]. Interestingly, however, Thompson sampling, aBayesian approach ﬁrst introduced almost a century ago [Thompson, 1933] has been shown to be competitiveand sometimes outperform UCB algorithms in practice [Scott, 2010, Chapelle and Li, 2011]. Further, the factthat Thompson sampling, being a Bayesian method, explicitly makes use of prior information, has made itparticularly popular in industrial applications [see, e.g., Russo et al., 2017, and the references therein].Although most theory in the bandit literature is focused on non-Bayesian methods, there is a smaller,but nontrivial, theory associated with Thompson sampling. In particular, Thompson sampling has beenshown to achieve optimal risk bounds in multi-armed bandit settings with Bernoulli rewards and beta ∗ Equal contribution. a r X i v : . [ c s . L G ] J un riors [Kaufmann et al., 2012, Agrawal and Goyal, 2013a], Gaussian rewards with Gaussian priors [Agrawaland Goyal, 2013a], one-dimensional exponential family models with uninformative priors [Korda et al., 2013],and ﬁnitely-supported priors and observations [Gopalan et al., 2014]. Thompson sampling has further beenshown to asymptotically achieve optimal instance-independent performance [Russo and Van Roy, 2016].Despite these appealing foundational results, the deployment of Thompson sampling in complex problemsis often constrained by its use of samples from posterior distributions, which are often diﬃcult to generate inregimes where the posteriors do not have closed forms. A common solution to this has been to use approximate sampling techniques to generate samples from approximations of the posteriors [Russo et al., 2017, Chapelleand Li, 2011, G´omez-Uribe, 2016, Lu and Van Roy, 2017]. Such approaches have been demonstrated towork eﬀectively in practice [Riquelme et al., 2018, Urteaga and Wiggins, 2018], but it is unclear how tomaintain performance over arbitrary time horizons while using approximate sampling. Indeed, to the best ofour knowledge the strongest regret guarantees for Thompson sampling with approximate samples are givenby Lu and Van Roy [2017] who require a model whose complexity grows with the time horizon to guaranteeoptimal performance. Further, it was recently shown theoretically by Phan et al. [2019] that a na¨ıve usage ofapproximate sampling algorithms with Thompson sampling can yield a drastic drop in performance. Contributions

In this work we analyze Thompson sampling with approximate sampling methods in a classof multi-armed bandit algorithms where the rewards are unbounded, but their distributions are log-concave.In Section 3 we derive posterior contraction rates for posteriors when the rewards are generated from suchdistributions and under general assumptions on the priors. Using these rates, we show that Thompsonsampling with samples from the true posterior achieves ﬁnite-time optimal frequentist regret. Further, theregret guarantee we derive has explicit constants and explicit dependencies on the dimension of the parameterspaces, variance of the reward distributions, and the quality of the prior distributions.In Section 4 we present a simple counterexample demonstrating the relationship between the approximationerror to the posterior and the resulting regret of the algorithm. Building on the insight provided by thisexample, we propose two approximate sampling schemes based on Langevin dynamics to generate samplesfrom approximate posteriors and analyze their impact on the regret of Thompson sampling. We ﬁrst analyzesamples generated from the unadjusted Langevin algorithm (ULA) and specify the runtime, hyperparameters,and initialization required to achieve an approximation error which provably maintains the optimal regretguarantee of exact Thompson sampling over ﬁnite-time horizons. Crucially, we initialize the ULA algorithmfrom the approximate sample generated in the previous round to make use of the posterior concentrationproperty and ensure that only a constant number of iterations are required to achieve the optimal regretguarantee. Under slightly stronger assumptions, we then demonstrate that a stochastic gradient variant called stochastic gradient Langevin dynamics (SGLD) requires only a constant batch size in addition to the constantnumber of iterations to achieve logarithmic regret. Since the computational complexity of this samplingalgorithm does not scale with the time horizon, the proposed method is a true “anytime” algorithm. Finally,we conclude in Section 5 by validating these theoretical results in numerical simulations where we ﬁnd thatThompson sampling with our approximate sampling schemes maintain the desirable performance of exactThompson sampling.Our results suggest that the tailoring of approximate sampling algorithms to work with Thompsonsampling can overcome the phenomenon studied in Phan et al. [2019], where approximation error in thesamples can yield linear regret. Indeed, our results suggest that it is possible for Thompson sampling toachieve order-optimal regret guarantees with an eﬃciently implementable approximate sampling algorithm.

In this work we analyze Thompson sampling strategies for the K -armed stochastic multi-armed bandit (MAB)problem. In such problems, there is a set of K options or “arms”, A = { , ..., K } , from which a player mustchoose at each round t = 1 , , ... . After choosing an arm A t ∈ A in round t , the player receives a real-valuedreward X A t drawn from a ﬁxed yet unknown distribution associated with the arm, p A t . The random rewards2btained from playing an arm repeatedly are i.i.d. and independent of the rewards obtained from choosingother arms.Throughout this paper, we assume that the reward distribution for each arm is a member of a parametricfamily parametrized by θ a ∈ R d a such that the true reward distribution is p a ( X ) = p a ( X ; θ ∗ a ), where θ ∗ a isunknown. Moreover, we assume throughout this paper that the parametric families are log-concave andLipschitz smooth in θ a : Assumption 1-Local (Assumption on the family p a ( X | θ a ) around θ ∗ a ) . Assume that log p a ( x | θ a ) is L a -smooth and m a -strongly concave around θ ∗ a for all X ∈ R : − log p a ( x | θ ∗ a ) − ∇ θ log p a ( x | θ ∗ a ) (cid:62) ( θ a − θ ∗ a ) + m a (cid:107) θ a − θ ∗ a (cid:107) ≤ − log p a ( x | θ a ) ≤ − log p a ( x | θ ∗ a ) − ∇ θ log p a ( x | θ ∗ a ) (cid:62) ( θ a − θ ∗ a ) + L a (cid:107) θ a − θ ∗ a (cid:107) , ∀ θ a ∈ R d a , x ∈ R . Additionally we make assumptions on the true distribution of the rewards:

Assumption 2 (Assumption on true reward distribution p a ( X | θ ∗ a )). For every a ∈ A assume that p a ( X ; θ ∗ a )is strongly log-concave in X with some parameter ν a , and that ∇ θ log p a ( x | θ ∗ a ) is L a -Lipschitz in X : − ( ∇ x log p a ( x | θ ∗ a ) − ∇ x log p a ( x (cid:48) | θ ∗ a )) T ( x − x (cid:48) ) ≥ ν a (cid:107) x − x (cid:48) (cid:107) , ∀ x, x (cid:48) ∈ R . (cid:107)∇ θ log p a ( x | θ ∗ a ) − ∇ θ log p a ( x (cid:48) | θ ∗ a ) (cid:107) ≤ L a (cid:107) x − x (cid:48) (cid:107) , ∀ x, x (cid:48) ∈ R . Parameters ν a and L a provide lower and upper bounds to the sub- and super-Gaussianity of the truereward distributions. We further deﬁne κ a = max { L a /m a , L a /ν a } to be the condition number of themodel class. Finally, we assume that for each arm a ∈ A there is a linear map such that for all θ a ∈ R d a , E x ∼ p a ( x | θ a ) [ X ] = α Ta θ a , with (cid:107) α a (cid:107) = A a .We now review Thompson sampling, the pseudo-code for which is presented in Algorithm 1. A keyadvantage of Thompson sampling over frequentist algorithms for multi-armed bandit problems is its ﬂexibilityof incorporating prior information. In this paper, we assume that the prior distributions π a ( θ a ) over theparameters of the arms have smooth log-concave densities: Assumption 3 (Assumptions on the prior distribution). For every a ∈ A assume that log π a ( θ a ) is concavewith L a -Lipschitz gradients for all θ a ∈ R d a : (cid:107)∇ θ π a ( θ a ) − ∇ θ π a ( θ (cid:48) a ) (cid:107) ≤ L a (cid:107) θ a − θ (cid:48) a (cid:107) , ∀ θ a , θ (cid:48) a ∈ R d a . Thompson sampling proceeds by maintaining a posterior distribution over the parameters of each arm a ateach round t . Given the likelihood family, p ( X | θ a ), the prior, π ( θ a ), and the n data samples from an arm a , X a, , · · · , X a,n , let F n,a : R d a → R be F n,a ( θ a ) = n (cid:80) ni =1 log p a ( X a,i | θ a ), be the average log-likelihood of thedata. Then the posterior distribution over the parameter θ a at round t , denoted µ ( n ) a , satisﬁes: p a ( θ a | X a, , · · · , X a,n ) ∝ π a ( θ a ) t (cid:89) i =1 ( p a ( X t | θ a )) I { A t = a } = exp ( nF n,a ( θ a ) + log π ( θ a )) , For any γ a > as µ ( n ) a [ γ a ], whose density is proportional to:exp ( γ a ( nF n,a ( θ a ) + log π ( θ a ))) . (1) We remark that the Lipschitz constants are all assumed to be the same to simplify notation. In Section 3 we explain the use of scaled posteriors is required to obtain optimal regret guarantees for our bandit algorithms. lgorithm 1 Thompson sampling

Input :

Priors π a for a ∈ A , posterior scaling parameter γ a Set µ a,t = π a for a ∈ A for t = 0 , , · · · do Sample θ a,t ∼ µ ( T a ( t )) a [ γ a ]Choose action A t = argmax a ∈A α Ta θ a,t .Receive reward X A t .Update (approximate) posterior distribution for arm A t : µ ( T a ( t +1)) a .Letting T a ( t ) be the number of samples received from arm a after t rounds, a Thompson sampling algorithm,at each round t , ﬁrst samples the parameters of each arm a from their (scaled) posterior distributions: θ a,t ∼ µ ( T a ( t )) a [ γ a ] and then chooses the arm for which the sample has the highest value: A t = argmax a ∈A α Ta θ a,t . A player’s objective in MAB problems is to maximize her cumulative reward over any ﬁxed time horizon T .The measure of performance most commonly used in the MAB literature is known as the expected regret R ( T ),which corresponds to the expected diﬀerence between the accrued reward and the reward that would havebeen accrued had the learner selected the action with the highest mean reward during all steps t = 1 , · · · , T . Recalling that ¯ r a is the mean reward for arm a ∈ A , the regret is given by: R ( T ) := E (cid:34) T (cid:88) t =1 ¯ r a ∗ − ¯ r A t (cid:35) , where ¯ r a ∗ = max a ∈A ¯ r a . Without loss of generality, we assume throughout this paper that the optimal arm, a ∗ = argmax a ∈A ¯ r a , is arm 1. Further, we assume that the optimal arm is unique : ¯ r > ¯ r a for a > true scaled posterior distributions, { µ ( T a ( t )) a [ γ a ] } a ∈A , at each round. In thesecond, Thompson sampling makes use of samples coming from two approximate sampling schemes thatwe propose, such that the samples can be seen as corresponding to approximations of the scaled posteriors, { ¯ µ ( T a ( t )) a [ γ a ] } a ∈A . We refer to the former as exact Thompson sampling, and the latter as approximate

Thompson sampling.For the analysis of exact

Thompson sampling in Section 3 we derive posterior concentration theoremswhich characterize the rate at which the posterior distributions for the arms µ ( n ) a converge to delta functionscentered at θ ∗ a as a function of the number of n , the number of samples received from the arm. We then usethese rates to show that Thompson sampling in this family of multi-armed bandit problems achieves theoptimal ﬁnite-time regret. Further, our results demonstrate an explicit dependence on the quality of thepriors and other problem-dependent constants, which improve upon prior works. We remark that the analysis of Thompson sampling has often been focused on a diﬀerent quantity known as the Bayesregret, which is simply the expectation of R ( T ) over the priors: E π [ R ( T )]. However, in an eﬀort to demonstrate that Thompsonsampling is an eﬀective alternative to frequentist methods like UCB, we analyze the frequentist regret R ( T ). We introduce this assumption merely for the purpose of simplifying our analysis.

4n Section 4, we propose two eﬃciently implementable Langevin-MCMC-based sampling schemes forwhich the regret of approximate Thompson sampling still achieves the optimal logarithmic regret. To do so,we derive new results for the convergence of Langevin-MCMC-based sampling schemes in the Wasserstein- p distance which we then use to prove optimal regret bounds. In this section we ﬁrst derive posterior concentration rates on the parameters of the reward distributions whenthe data, the priors, and the likelihoods satisfy our assumptions. We then make use of these concentrationresults to give ﬁnite-time regret guarantees for exact Thompson sampling in log-concave bandits.

Core to the analysis of Thompson sampling is understanding the behavior of the posterior distributions overthe parameters of the arms’ distributions as the algorithm progresses and samples from the arms are collected.The literature on understanding how posteriors evolve as data is collected goes back to Doob [1949] andhis proof of the asymptotic normality of posteriors. More recently, there has been a line of work [see, e.g.,van der Vaart and van Zanten, 2008, Ghosal and van der Vaart, 2007] that derives rates of convergence ofposteriors in various regimes, mostly following the framework ﬁrst developed in Ghosal et al. [2000] for ﬁnite-and inﬁnite-dimensional models. Such results though quite general, do not have explicit constants or formswhich make them amenable for use in analyzing bandit algorithms. Indeed, ﬁnite-time rates remain an activearea of research but have been developed using information theoretic arguments [Shen and Wasserman, 2001],and more recently through the analysis of stochastic diﬀerential equations [Mou et al., 2019], though in bothcases the assumptions, burn-in times, and lack of precise constants make them diﬃcult to integrate with theanalysis of Thompson sampling. Due to this, Thompson sampling has, for the most part, been only wellunderstood for conjugate prior/likelihood families like beta/Bernoulli and Gaussian/Gaussian [Agrawal andGoyal, 2013a], or in more generality in well-behaved families such as one-dimensional exponential familieswith uninformative priors [Korda et al., 2013] or ﬁnitely supported prior/likelihood pairs [Gopalan et al.,2014].To derive posterior concentration rates for parameters in d -dimensions and for a large class of priorsand likelihoods we analyze the moments of a potential function along trajectories of a stochastic diﬀerentialequation for which the posterior is the limiting distribution. Our results expand upon the recent derivationof novel contraction rates for posterior distributions presented in Mou et al. [2019] to hold for a ﬁnite numberof samples and may be of independent interest. We make use of these concentration results to show thatThompson sampling with such priors and likelihoods results in order-optimal regret guarantees.To begin, we note that classic results [Øksendal, 2003] guarantee that, as t → ∞ the distribution P t of θ t which evolves according to: dθ t = 12 ∇ θ F n,a ( θ t ) dt + 12 n ∇ θ log π a ( θ t )) dt + 1 √ nγ a dB t , (2)is given by: lim t →∞ P t ( θ | X , ..., X n ) ∝ exp( − γ a ( nF n,a ( θ ) + log π a ( θ ))) , almost surely. Comparing with Eq. (1), this limiting distribution is the scaled posterior distribution µ ( n ) a [ γ a ].Thus, by analyzing the limiting properties of θ t as it evolves according to the stochastic diﬀerential equation,we can derive properties of the scaled posterior distribution.To do so, we ﬁrst show that with high probability the gradient of F n,a ( θ ∗ ) concentrates around zero (giventhe data X , ..., X n ). More precisely we show in Appendix B using well known results on the concentrationof Lipschitz functions of strongly log-concave random variables that ∇ θ F a,n ( θ ∗ a ) has sub-Gaussian tails: Proposition 1.

The random variable (cid:107)∇ θ F a,n ( θ ∗ a ) (cid:107) is L a (cid:113) d a nν a -sub-Gaussian: V ( θ t ) = 12 e ct (cid:107) θ t − θ ∗ (cid:107) , evolves along trajectories of the stochastic diﬀerential equation (2), where c >

0. By bounding the supremumof V ( θ t ), we construct bounds on the higher moments of the random variable (cid:107) θ a − θ ∗ a (cid:107) where θ a ∼ µ ( n ) a [ γ a ].These moment bounds translate directly into the posterior concentration bound of θ a ∼ µ ( n ) a [ γ a ] around θ ∗ presented in the following theorem (the proof of which is deferred to Appendix B). Theorem 1.

Suppose that Assumptions 1-3 hold, then for δ ∈ (0 , : P θ ∼ µ ( n ) a [ γ a ] (cid:32) (cid:107) θ a − θ ∗ a (cid:107) > (cid:115) em a n (cid:18) d a γ a + log B a + (cid:18) γ a + 8 d a κ a (cid:19) log (1 /δ ) (cid:19)(cid:33) < δ. where B a = max θ ∈ R d π a ( θ ) π a ( θ ∗ a ) . Theorem 1 guarantees that the scaled posterior distribution over the parameters of the arms concentrateat rate √ n , where n is the number of times the arm has been pulled.We remark that this posterior concentration result has a number of desirable properties. Through thepresence of B a , it reﬂects an explicit dependence on the quality of the prior. In particular, log B a = 0 if theprior is properly centered such that its mode is at θ ∗ or if the prior is uninformative or nearly ﬂat everywhere.We further remark that the concentration result also scales with the variance of θ a which is on the orderof d a / ( γ a m a n ). Lastly, we remark that this concentration result holds for any n > We now show that, under our assumptions, Thompson sampling with samples from the scaled posterior enjoysoptimal ﬁnite-time regret guarantees. To provide these results we proceed as is common in regret proofs formulti-armed bandits by upper bounding T a ( T ), the number of times a sub-optimal arm a ∈ A is pulled up totime T . Without loss of generality we assume throughout this section that arm 1 is the optimal arm, anddeﬁne the ﬁltration associated with a run of the algorithm as F t = { A , X , A , X , ..., A t , X t } .To upper bound the expected number of times a sub-optimal arm is pulled up to time T , we ﬁrst deﬁne thelow-probability event that the mean calculated from the value of θ a,t sampled from the posterior at time t ≤ T , r a,t ( T a ( t )), is greater than ¯ r − (cid:15) (recall that ¯ r is the optimal arm’s mean): E a ( t ) = { r a,t ( T a ( t )) ≥ ¯ r − (cid:15) } for some (cid:15) >

0. Given these events, we proceed to decompose the expected number of pulls of a sub-optimalarm a ∈ A as: E [ T a ( T )] = E (cid:34) T (cid:88) t =1 I ( A t = a ) (cid:35) = E (cid:34) T (cid:88) t =1 I ( A t = a, E ca ( t )) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) I + E (cid:34) T (cid:88) t =1 I ( A t = a, E a ( t )) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) II . (3)These two terms satisfy the following standard bounds (see e.g. Lattimore and Szepesv´ari [2020]): Lemma 1 (Bounding I and II) . For a sub-optimal arm a ∈ A , we have that: I ≤ E (cid:34) T − (cid:88) s = (cid:96) p ,s − (cid:35) ; (4)II ≤ E (cid:34) T (cid:88) s =1 I (cid:18) p a,s > T (cid:19)(cid:35) , (5)6 here p a,s = P ( r a,t ( s ) > ¯ r − (cid:15) |F t − ) , for some (cid:15) > . The proof of these results are standard for the regret of Thompson sampling and can be found in AppendixE, Lemmas 13 and 14 for completeness.Given Lemma 1, we see that to bound the regret of Thompson Sampling it is suﬃcient to bound the twoterms I and II.To bound term I, we ﬁrst show that for all times t = 1 , ..., T , and number of samples collected from arm 1,the probability p ,n = P ( r ,t ( n ) > ¯ r − (cid:15) |F t − ) is lower bounded by a constant depending only on the qualityof the prior for arm 1. This guarantees the posterior for the optimal arm is approximately optimistic withat least a constant probability, and requires a proper choice of γ . We note the unscaled posterior providesthe correct concentration with respect to the number of data samples T a ( t ), when T a ( t ) is large. This issuﬃcient to upper bound the trailing terms of I, that is, summands in Equation 4 for large s . Unfortunatelyconcentration is not enough to bound term I, since the early summands of Equation 4 corresponding to smallvalues of s could be extremely large. Intuitively, the random variable r ,t ( s ) can be thought of as centeredaround the posterior mean of arm 1. Though this is close to the true value of ¯ r with high probability, when T ( t ) is small, concentration alone does not preclude the possibility that the posterior mean underestimates¯ r by a value of at least (cid:15) . In order to ensure p ,s is large enough in these cases, we require r ,t ( s ) tohave suﬃcient variance to overcome this potential underestimation bias. We show that a scaled posterior µ ( T a ( t )) a [ γ a ] with γ a = (cid:0) d a κ a (cid:1) − in Algorithm 1 ensures r ,t ( s ) has enough variance. Lemma 2.

Suppose the likelihood and reward distributions satisfy Assumptions 1-3, then for all n = 1 , ..., T and γ = d κ : E (cid:20) p ,n (cid:21) ≤ C (cid:112) B κ , where C is a universal constant independent of the problem dependent parameters. Remark 1.

We ﬁnd that a proper choice of γ is required to ensure that that the posterior on the optimalarm has a large enough variance to guarantee a degree of optimism despite the randomness in its mean.Scaling up the posterior was also noted to be necessary in linear bandits (see e.g.Agrawal and Goyal [2013b],Abeille and Lazaric [2017]) to ensure optimal regret. In practice, since we do not a priori know which is theoptimal arm, we must scale the posterior of each arm by a parameter γ a . The quantity B = max θ π ( θ ) π ( θ ∗ ) captures a worst case dependence on the quality of the prior for the optimalarm, and can be seen as the expected number of samples from the prior until an optimistic sample is observed.By using this upper bound in conjunction with the posterior concentration result derived in Theorem 1,we can further bound I and II. We note that in contrast with simple subgaussian concentration bounds,our posterior concentration rates have a bias term decreasing at a rate of 1 / √ number of samples. In ouranalysis we carefully track and control the eﬀects of this bias term ensuring it does not compromise ourlog-regret guarantees. Indeed, using the posterior concentration in the bounds from Lemma 1 we showthat, for γ a = d a κ a there are two universal constants C , C > ≤ C (cid:112) κ B (cid:24) A m ∆ a ( D + σ ) (cid:25) + 1;II ≤ C A a m a ∆ a ( D a + σ a log( T )) , where for a ∈ A , D a and σ a are given by: D a = log B a + d a κ a , σ a = d a κ a + d a κ a . Finally, combining all these observations we obtain the following regret guarantee:7 heorem 2 (Regret of Exact Thompson Sampling) . When the likelihood and true reward distributions satisfyAssumptions 1-3 and γ a = d a κ a we have that the expected regret after T > rounds of Thompson samplingwith exact sampling satisﬁes: E [ R ( T )] ≤ (cid:88) a> CA a m a ∆ a (cid:0) log B a + d a κ a + d a κ a log( T ) (cid:1) + (cid:112) κ B CA m ∆ a (cid:0) B + d κ (cid:1) + ∆ a Where C is a universal constant independent of problem-dependent parameters. The proof of the theorem is deferred to Appendix E, where we also provide the exact value of the universalconstant C . We remark that this regret bound gives an O (cid:16) log ( T )∆ (cid:17) asymptotic regret guarantee, but holds forany T >

0. This further highlights that Thompson sampling is a competitive alternative to UCB algorithmssince it achieves the optimal problem-dependent rate for multi-armed bandit algorithms ﬁrst presented in Laiand Robbins [1985].Our bound also has explicit dependencies on the dimension of the parameter space of the likelihooddistributions for each arm, as well as on the quality of the priors through the presence of B a and B . Wenote that the dependence on the priors does not distinguish between “good” and “bad” priors. Indeed, theparameter B a ≥ O ( √ B log( B ))) than for sub-optimal arms( O (log( B a ))). This is also a worst case dependence which captures the expected number of samples from theprior until an approximately optimistic sample is observed, which we believe to be unavoidable.Finally, we note that our regret bound scales with the variances of the reward and likelihood familiessince m a and ν a reﬂect the variance of the likelihoods in θ and the rewards X a respectively.Thus, through the use of the posterior contraction rates we are able to get ﬁnite-time regret boundsfor Thompson sampling with multi-dimensional log-concave families and arbitrary log-concave priors. Thisgeneralizes the result of Korda et al. [2013] to a more general class or priors and higher dimensional parametricfamilies. In this section we present two approximate sampling schemes for generating samples from approximations ofthe (scaled) posteriors at each round. For both, we give the values of the hyperparameters and computationtime needed to guarantee an approximation error which does not result in a drastic change in the regret ofthe Thompson sampling algorithm.Before doing so, however, we ﬁrst present a simple counterexample to illustrate that in the worst case,Thompson sampling with approximate samples incurs an irreducible regret dependent on the error betweenthe posterior and the approximation to the posterior. In particular, by allowing the approximation error todecrease over time, we extract a relationship between the order of the regret and the level of approximation.

Example 1.

Consider a Gaussian bandit instance of two arms A = { , } having mean rewards ¯ r and ¯ r and known unit variances. Further assume that the unknown parameters are the means of the distributionssuch that θ ∗ a = ¯ r a , and consider the case where the learner makes use of a zero-mean, unit-variance Gaussianprior over θ a for a = 1 , . Under these assumptions, after obtaining samples X a, , · · · , X a,n , the posteriorupdates satisfy the following well-known formula: P a,n ( θ a ) ∝ N (cid:18) nn + 1 , n + 1 (cid:19) . Let ¯ r = 1 and ¯ r = 0 such that arm is optimal. We now show there exists an approximate posterior ˜ P a,t of arm , satisfying TV( ˜ P ,t , P ,t ) ≤ n − α and such that if samples from P ,t and ˜ P ,t were to be used by aThompson sampling algorithm, its regret would satisfy R ( T ) = Ω( T − α ) . e substantiate this claim by a simple construction. Let ˜ P a,t be (1 − n − α ) P a,t + n − α δ , where δ denotesa delta mass centered at . ˜ P a,t is a mixture distribution between the true posterior and a point mass.Clearly, for all t ≥ C for some universal constant C , with probability at least n − α the posterior samplefrom arm will be larger than the sample from arm . Since t > n , t − α < n − α for α > and since thesuboptimality gap equals , we conclude R ( T ) = Ω( (cid:80) Tt =1 t − α ) . Thus, to incur logarithmic regret, one needs T V ( ˜ P ,t , P ,t ) = Ω( n ) . Algorithm 2 (Stochastic Gradient) Langevin Algorithm for Arm a Input :

Data { x a, , · · · , x a,n } ;MCMC sample θ a,Nh ( n − from last round Set θ = θ a,t − for a ∈ A for i = 0 , , · · · N do Uniformly subsample

S ⊆ { x a, , · · · , x a,n } .Compute ∇ (cid:98) U ( θ ih ( n ) ) = − n |S| (cid:80) x k ∈S ∇ log p a ( x k | θ ih ( n ) ) − ∇ log π a ( θ ih ( n ) ) . Sample θ ( i +1) h ( n ) ∼ N (cid:16) θ ih ( n ) − h ( n ) ∇ (cid:98) U ( θ ih ( n ) ) , h ( n ) I (cid:17) . Output : θ a,Nh ( n ) = θ Nh ( n ) and θ a,t ∼ N (cid:16) θ Nh ( n ) , nL a γ a I (cid:17) Example 1 builds on the insights in Phan et al. [2019], who showed that constant approximation error canincur linear regret, which highlights the fact that to achieve logarithmic regret the total variation distancebetween the approximation of the posterior ¯ µ ( n ) a [ γ a ] and the true posterior µ ( n ) a must decrease as samplesare collected. In particular it illustrates that the rate at which the approximation error decreases is directlylinked to the resulting regret bound.Given this result, we ﬁrst propose an unadjusted Langevin algorithm (ULA) [Durmus and Moulines, 2017],which generates samples from an approximate posterior which monotonically approaches the true posterioras data is collected and provably maintains the regret guarantee of exact Thompson sampling. Importantto this eﬀort, we demonstrate that the number of steps inside the ULA procedure does not scale with thetime horizon, though the number of gradient evaluations scale with the number of times an arm has beenpulled. To over this issue arising from full gradient evaluation, we propose a stochastic gradient Langevindynamics (SGLD) [Welling and Teh, 2011] variant of ULA which has appealing computational beneﬁts: underslightly stronger assumptions, SGLD takes a constant number of iterations as well as a constant numberof data samples in the stochastic gradient estimate while maintaining the order-optimal regret of the exactThompson sampling algorithm. As described in Algorithm 2, in each round t we run the (stochastic gradient) Langevin algorithm for N steps to generate a sample of desirable quality for each arm. In particular, we ﬁrst run a Langevin MCMCalgorithm to generate a sample from an approximation to the unscaled posterior. To achieve the scaling with γ a that we require for the analysis of the regret, we add zero-mean Gaussian noise with variance γ a L a n tothis sample. The distribution of the resulting sample has the same characteristics as those from the scaledposterior analyzed in Sec. 3.Given Assumptions 1-Uniform and 3, we prove (in Theorem 5 in the Appendix) that running ULAwith exact gradients provides appealing convergence properties. In particular, for a number of iterationsindependent of the number of rounds t or the number of samples from an arm, n = T a ( t ), ULA convergesto an accuracy in Wasserstein- p distance which maintains the logarithmic regret of the exact algorithm(for more information on such metrics see Villani [2009]). We note parenthetically that working with theWasserstein- p distance provides us with a tighter MCMC convergence analysis (than with the total variationdistance used in Example 1) that helps in conjunction with the regret bounds. The proofs of the ULA and9GLD convergence require a uniform strong log-concavity and Lipschitz smoothness condition of the family p a ( X | θ a ) over the parameter θ a , a strengthening of Assumption 1-Local. Assumption 1-Uniform (Assumption on the family p a ( X | θ a ): strengthened for approximate sampling) . Assume that log p a ( x | θ a ) is L a -smooth and m a -strongly concave over the parameter θ a : − log p a ( x | θ (cid:48) a ) − ∇ θ log p a ( x | θ (cid:48) a ) (cid:62) ( θ a − θ (cid:48) a ) + m a (cid:107) θ a − θ (cid:48) a (cid:107) ≤ − log p a ( x | θ a ) ≤ − log p a ( x | θ (cid:48) a ) − ∇ θ log p a ( x | θ (cid:48) a ) (cid:62) ( θ a − θ (cid:48) a ) + L a (cid:107) θ a − θ (cid:48) a (cid:107) , ∀ θ a , θ (cid:48) a ∈ R d a , x ∈ R . Although the number of iterations required for ULA to converge is constant with respect to the timehorizon t , the number of gradient computations over the likelihood function within each iteration is T a ( t ). Totackle this issue, we sub-sample the data at each iteration and use a stochastic gradient MCMC method [Maet al., 2015]. To be able to get convergence guarantees despite the larger variance this method incurs, wemake a slightly stronger Lipschitz smoothness assumption on the parametric family of likelihoods. Assumption 4 (Joint Lipschitz smoothness of the family log p a ( X | θ a ): for SGLD). Assume a joint Lipschitzsmoothness condition, which strengthens Assumptions 1-Uniform and 2 to impose the Lipschitz smoothnesson the entire bivariate function log p a ( x ; θ ): (cid:107)∇ θ log p a ( x | θ a ) − ∇ θ log p a ( x (cid:48) | θ a ) (cid:107) ≤ L a (cid:107) θ a − θ (cid:48) a (cid:107) + L ∗ a (cid:107) x − x (cid:48) (cid:107) , ∀ θ a , θ (cid:48) a ∈ R d a , x, x (cid:48) ∈ R . Under this stronger assumption, we prove the fast convergence of the SGLD method in the followingTheorem 3. Speciﬁcally, we demonstrate that for a suitable choice of stepsize h ( n ) , number of iterations N ,and size of the minibatch k = |S| , samples generated by Algorithm 2 are distributed suﬃciently close to thetrue posterior to ensure the optimal regret guarantee. By examining the number of iterations N and size ofthe minibatch k , we conﬁrm that the algorithmic and sample complexity of our method do not grow with thenumber of rounds t , as advertised. Theorem 3 (SGLD Convergence) . Assume that the family log p a ( x ; θ ) , prior distributions, and that thetrue reward distributions satisfy Assumptions 1-Uniform through 4. If we take the batch size k = O (cid:0) κ a (cid:1) ,step size h ( n ) = O (cid:16) n κ a L a (cid:17) and number of steps N = O (cid:0) κ a (cid:1) in the SGLD algorithm, then for δ ∈ (0 , ,with probability at least − δ with respect to X a, , ...X a,n , we have convergence of the SGLD algorithm inthe Wasserstein- p distance. In particular, between the n -th and the ( n + 1) -th pull to arm a , samples θ a,t approximately follow the posterior µ ( n ) a : W p (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) ≤ (cid:114) nm a (cid:0) d a + log B a + (cid:0)

32 + 8 d a κ a (cid:1) p (cid:1) , where (cid:98) µ ( n ) a is the probability measure associated with any of the sample(s) θ a,Nh ( n ) a between the n -th and the ( n + 1) -th pull of arm a . We remark that we are able to keep the number of iterations, N , for both algorithms constant byinitializing the current round of the approximate sampling algorithm using the output of the last round ofthe Langevin MCMC algorithm. If we initialized the algorithm independently from the prior, we would need O (log T a ( t )) iterations to achieve this result, which would in turn yield a Thompson sampling algorithmfor which the computational complexity grows with the time horizon. We note that this warm-startingcomplicates the regret proof for the approximate Thompson sampling algorithms since the samples used byThompson sampling are no longer independent.By scrutinizing the stepsize h ( n ) and the accuracy level of the sample distribution W p (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) , wenote that we are taking smaller steps to get increasingly accurate MCMC samples as more data are being For simplicity of notation, we let Lipschitz constants L ∗ a = L a in the main paper. µ ( n ) a and µ ( n +1) a are closer.We restate Theorem 3 and give explicit values of the hyper-parameters in Theorem 6 in the appendix, butremark that the proof of this theorem is novel in the MCMC literature. It builds upon and strengthens Durmusand Moulines [2016] by taking into account the discretization and stochastic gradient error to achieve strongconvergence guarantees in the Wasserstein- p distance up to any ﬁnite order p . Other related works onthe convergence of ULA can provide upper bounds in the Wassertein distances up to the second order(i.e., for p ≤

2) [see, e.g., Dalalyan and Karagulyan, 2019, Cheng and Bartlett, 2018, Ma et al., 2019,Vempala and Wibisono, 2019]. This bound in the Wasserstein- p distance for arbitrarily large p is necessary inguaranteeing the following Lemma 3, a similar concentration result as in Theorem 1 for the approximatesamples θ a,t ∼ ¯ µ ( n ) a [ γ a ]. Lemma 3.

Suppose that Assumptions 1-Uniform through 4 hold, then for δ ∈ (0 , , the sample θ a,t resultingfrom running the (stochastic gradient) ULA with N steps, a step size of h ( n ) , and a batch-size k as deﬁned inTheorem 3 satisﬁes: P θ a,t ∼ ¯ µ ( n ) a [ γ a ] (cid:32) (cid:107) θ a,t − θ ∗ a (cid:107) > (cid:115) em a n (cid:18) d a + log B a + 2 (cid:18) σ a + d a κ a γ a (cid:19) log 1 /δ (cid:19)(cid:33) < δ. where σ a = 16 + 4 d a κ a . Given that the concentration results of the samples from ULA and SGLD have the same form as that ofexact Thompson sampling, we now show that approximate Thompson sampling achieves the same ﬁnite -timeoptimal regret guarantees (up to constant factors) as the exact Thompson sampling algorithm. To show this,we require an analgous result to Lemma 2 on the anti-concentration properties of the approximations to thescaled posteriors:

Lemma 4.

Suppose the likelihood and true reward distributions satisfy Assumptions 1-4: then if γ = O (cid:16) d κ (cid:17) , for all n = 1 , ..., T all samples from the the (stochastic gradient) ULA method with the hyperpa-rameters and runtime as described in Theorem 3 satisfy: E (cid:20) p ,n (cid:21) ≤ C (cid:112) B where C is a universal constant independent of problem-depedent parameters. The proof of Lemma 4 is similar to that of 2, but we are able to save a factor of √ κ due to the fact thatthe last step of the approximate sampling scheme samples θ a,t from a a Gaussian distribution as opposed toa strongly-log concave distribution which we must approximate with a Gaussian.Given this lemma and our concentration results presented in the previous section, the proof of logarithmicregret is essentially the same as that of the regret for exact Thompson sampling. However, more care has tobe taken to deal with the fact that the samples from the approximate posteriors are no longer independentdue to the fact that we warm-start our proposed sampling algorithms using previous samples. We cope withthis issue by constructing concentration rates (of a similar form as in Lemma 3) on the distributions of thesamples given the initial sample is suﬃciently well behaved (see Lemmas 11 and 12). We then show thatthis happens with suﬃciently high probability to maintain similar upper bounds on terms I and II fromLemma 1 in Lemma 17, which in turn allows us to prove the following Theorem in Appendix E.2.11 heorem 4 (Regret of Thompson sampling with a (stochastic gradient) Langevin algorithm) . When thelikelihood and true reward distributions satisfy Assumptions 1-4: we have that the expected regret after

T > rounds of Thompson sampling with the (stochastic gradient) ULA method with the hyper-parameters andruntime as described in Theorem 3 satisﬁes: E [ R ( T )] ≤ (cid:88) a> CA a m a ∆ a (cid:0) log B a + d a + d a κ a log T (cid:1) + (cid:112) B CA m ∆ a (cid:0) B + d κ + d κ log T (cid:1) + 3∆ a . where C is a universal constant that is independent of problem dependent parameter and the scale parameter γ a = O (cid:16) d a κ a (cid:17) . We note that Theorem 3 allows for SGLD to be implemented with a constant number of steps per iterationand a constant batch-size with only the step-size decreasing linearly with the number of samples. Combiningthis with our regret guarantee shows that an anytime algorithm for Thompson sampling with approximatesamples can indeed achieve logarithmic regret.Further, we remark that this bound exhibits a worse dependence on the quality of the prior on the optimalarm than in the exact sampling regime. In particular, we pay a d √ B log T in this regret bound as opposedto d √ B . Our regret bound in the approximate sampling regime does exhibit a slightly better dependenceon the condition number of the family. This, we believe, is an artifact of our analysis and is due to the factthat a lower bound on the exact posterior was needed to invoke Gaussian anti-concentration results whichwere not needed in the approximate sampling regime due to the design of the proposed sampling algorithm. We empirically corroborate our theoretical results with numerical experiments of approximate Thompsonsampling in log-concave multi-armed bandit instances. We benchmark against both UCB and exact Thompsonsampling across three diﬀerent multi-armed bandit instances, where in the ﬁrst instance, the priors reﬂectcorrect ordering of the mean rewards for all arms; in the second instance, the priors are agnostic of theordering; in the third instance, the priors reﬂects the complete opposite ordering. See Appendix F for detailsof the experimental settings.As suggested in our theoretical analysis in Section 4, we use a constant number of steps for both ULAand SGLD (with constant number of data points in the stochastic gradient evaluation) to generate samplesfrom the approximate posteriors. The regret of the three algorithms averaged across 100 runs is displayed inFigure 1, where we see approximate Thompson sampling with samples generated by ULA and SGLD performcompetitively against both exact Thompson sampling and UCB across all three instances.We observe signiﬁcant performance gains from the (approximate) Thompson sampling approach over thedeterministic UCB algorithm when the priors are suggestive or even non-informative of the appealing arms.When the priors are adversarial to the algorithm, the UCB algorithm outperforms the Thompson samplingapproach as expected. (This case corresponds to the constant B a in the Theorems 2 and 4 being large).Also as the theory predicts, we observe little diﬀerence between the exact and the approximate Thompsonsampling methods in terms of the regret. If we zoom in and scrutinize further, we can see that SGLD slightlyoutperforms the exact Thompson sampling method in the adversarial prior case. This might be due to theadded stochasticity from the approximate sampling techniques, which improves the robustness against badpriors. Although Thompson sampling has been used successfully in real-world problems for decades and has beenshown to have appealing theoretical properties there remains a lack of understanding of how approximatesampling aﬀects its regret guarantees. 12igure 1: Performance of exact and approximate Thompson sampling vs UCB on Gaussian bandits with (a)“good priors” (priors reﬂecting the correct ordering of the arms’ means), (b) the same priors on all the arms’means, and (c) “bad priors” (priors reﬂecting the exact opposite ordering of the arms’ means). The shadedregions represent the 95% conﬁdence interval around the mean regret across 100 runs of the algorithm.In this work we derived new posterior contraction rates for log-concave likelihood families with arbitrarylog-concave priors which capture key dependencies between the posterior distributions and various problem-dependent parameters such as the prior quality and the parameter dimension. We then used these ratesto show that exact Thompson sampling in MAB problems where the reward distributions are log-concaveachieves the optimal ﬁnite-time regret guarantee for MAB bandit problems from Lai and Robbins [1985].As a direction for future work, we note that although our regret bound demonstrates a dependence on thequality of the prior, it still is unable to capture the potential advantages of good priors.We then demonstrated that Thompson sampling using samples generated from ULA, and under slightlystronger assumptions, SGLD, could still achieve the optimal regret guarantee with constant algorithmic aswell as sample complexity in the stochastic gradient estimate. Thus, by designing approximate samplingalgorithms speciﬁcally for use with Thompson sampling, we were able to construct a computationally tractableanytime Thompson sampling algorithm from approximate samples with end-to-end guarantees of logarithmicregret.

References

M. Abeille and A. Lazaric. Linear thompson sampling revisited.

Electron. J. Statist. , 11(2):5165–5197, 2017.S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In

Proceedingsof the 25th Annual Conference on Learning Theory (COLT) , pages 39.1–39.26, 2012.S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson sampling. In

Proceedings of theSixteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) , pages 99–107,2013a.S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoﬀs. In

Proceedings ofthe 30th International Conference on Machine Learning (ICML) , pages 127–135. 2013b.P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.

MachineLearning , 47:235–256, 2002.S. Basu and A. DasGupta. The mean, median, and mode of unimodal distributions: a characterization.

Teor.Veroyatnost. i Primenen. , 41:336–352, 1996. 13. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In

Advances in Neural InformationProcessing Systems 24 (NeurIPS) , pages 2249–2257, 2011.X. Cheng and P. L. Bartlett. Convergence of Langevin MCMC in KL-divergence. In

Proceedings of the 29thInternational Conference on Algorithmic Learning Theory (ALT) , pages 186–211, 2018.A. S. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurategradient.

Stoch. Process. and their Appl. , 129(12):5278–5311, 2019.J. L. Doob. Application of the theory of martingales.

Le Calcul des Probabilites et ses Applications , pages23–27, 1949.A. Durmus and E. Moulines. Sampling from strongly log-concave distributions with the Unadjusted LangevinAlgorithm. arXiv preprint, 2016.A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm.

Ann. Appl. Probab. , 27(3):1551–1587, 06 2017.S. Ghosal and A. W. van der Vaart. Convergence rates of posterior distributions for noniid observations.

Ann. Statist. , 35(1):192–223, 2007.S. Ghosal, J. K. Ghosh, and A. W. van der Vaart. Convergence rates of posterior distributions.

Ann. Statist. ,28(2):500–531, 04 2000.C. A. G´omez-Uribe. Online algorithms for parameter mean and variance estimation in dynamic regression.arXiv preprint, 2016.A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In

Proceedingsof the 31st International Conference on Machine Learning (ICML) , Proceedings of Machine LearningResearch, pages 100–108, 2014.C. Jin, R. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities forrandom vectors with subGaussian norm. arXiv preprint, 2019.E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal ﬁnite-time analysis.In

Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT) , 2012.N. Korda, E. Kaufmann, and R. Munos. Thompson sampling for 1-dimensional exponential family bandits.In

Advances in Neural Information Processing Systems 26 (NeurRIPS) , pages 1448–1456, 2013.T. L. Lai and H. Robbins. Asymptotically eﬃcient adaptive allocation rules.

Adv. Appl. Math. , 6:4–22, 1985.T. Lattimore and C. Szepesv´ari.

Bandit Algorithms . Cambridge University Press, 2020.M. Ledoux. Concentration of measure and logarithmic Sobolev inequalities. In

Seminaire de probabilitesXXXIII , pages 120–216. Springer, 1999.M. Ledoux.

The Concentration of Measure Phenomenon . Mathematical surveys and monographs. AmericanMathematical Society, 2001.X. Lu and B. Van Roy. Ensemble sampling. In

Advances in Neural Information Processing Systems 32(NeurIPS) , pages 3260–3268, 2017.Y.-A Ma, T. Chen, and E. Fox. A complete recipe for stochastic gradient MCMC. In

Advances in NeuralInformation Processing Systems 28 (NeurIPS) , pages 2917–2925, 2015.Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan. Sampling can be faster than optimization.

Proc. Natl. Acad. Sci. U.S.A. , 116(42):20881–20885, 2019.14. Mou, N. Ho, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. A diﬀusion process perspective onposterior contraction rates for parameters. arXiv preprint, 2019.B. Øksendal.

Stochastic Diﬀerential Equations . Springer, Berlin, 6th edition, 2003.M. Phan, Y. A. Yadkori, and J. Domke. Thompson sampling and approximate inference. In

Advances inNeural Information Processing Systems 32 (NeurIPS) , pages 8804–8813, 2019.Y. Ren. On the Burkholder-Davis-Gundy inequalities for continuous martingales.

Stat. Probabil. Lett. , 78(17):3034–3039, 2008.C. Riquelme, Tucker G., and J. Snoek. Deep Bayesian bandits showdown: An empirical comparison ofBayesian deep networks for Thompson sampling. arXiv preprint, 2018.D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling.

J. Mach. Learn. Res. ,17:1–30, 2016.D. Russo, B. V. Roy, A. Kazerouni, and I. Osband. A tutorial on Thompson sampling. arXiv preprint, 2017.A. Saumard and J. A. Wellner. Log-concavity and strong log-concavity: A review.

Statist. Surv. , 8:45–114,2014.S. L. Scott. A modern Bayesian look at the multi-armed bandit.

Applied Stochastic Models in Business andIndustry , 26(6):639–658, 2010.X. Shen and L. Wasserman. Rates of convergence of posterior distributions.

Ann. Statist. , 29(3):687–714, 062001.W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence oftwo samples.

Biometrika , 25(3/4):285–294, 1933.I. Urteaga and C. Wiggins. Variational inference for the multi-armed contextual bandit. In

Proceedings ofthe 21st International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) , pages 698–706, 2018.A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posterior distributions based on Gaussianprocess priors.

Ann. Statist. , 36(3):1435–1463, 06 2008.S. Vempala and A. Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suﬃces.In

Advances in Neural Information Processing Systems 32 (NeurIPS) , pages 8094–8106, 2019.C. Villani.

Optimal Transport: Old and New . Wissenschaften. Springer, Berlin, 2009.M. J. Wainwright.

High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge University Press, 2019.M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In

Proceedings ofthe 28th international conference on machine learning (ICML) , pages 681–688, 2011.15

Notation

Before presenting our proofs, we ﬁrst include a table summarizing our notation.Symbol Meaning A set of arms in bandit environment K number of arms in the bandit environment |A| T Time horizon A t arm pulled at time t by the algorithm A t ∈ A T a ( t ) number of times arm a has been pulled by time tX A t reward from choosing arm A t at time tθ a parameters of likelihood functions such that, θ a ∈ R d a d a dimension of parameter space for arm ap a ( x | θ a ) parametric family of reward distributions for arm aπ a ( θ a ) prior distribution over the parameters for arm aµ ( n ) a probability measure associated with the posterior over the parameters of arm a after n samples from arm aµ ( n ) a [ γ a ] probability measure associated with the (scaled) posterior over the parameters of arm a after n samples from arm a (cid:98) µ ( n ) a probability measure resulting from running the Langevin MCMC algorithmdescribed in Algorithm 2 which approximates µ ( n ) a ¯ µ ( n ) a [ γ a ] probability measure resulting from an approximate sampling methodwhich approximates µ ( n ) a [ γ a ] θ ∗ a true parameter value for arm aθ a,t sampled parameter for arm a at time t of the Thompson Sampling algorithm: θ a,t ∼ µ ( n ) a ¯ r a mean of the reward distribution for arm a : ¯ r a = E [ X a | θ ∗ a ] α Ta vector in R d a such that ¯ r a = α Ta θ ∗ a r a,t ( T a ( t )) estimate of mean of arm a at round t : r a,t ( T a ( t )) = α Ta θ a,t A a norm of α a m a Strong log-concavity parameter of the family p a ( x ; θ ) in θ for all x . ν a Strong log-concavity parameter of the true reward distribution p a ( x ; θ ∗ ) in x . F n,a ( θ a ) Averaged log likelihood over the data points: F n,a ( θ a ) = n (cid:80) ni =1 log p a ( X i , θ a ) L a Lipschitz constant for the true reward distribution, and likelihood families p a ( x ; θ ∗ ) in x . κ a condition number of the likelihood family κ a = max (cid:16) L a m a , L a ν a (cid:17) . B a reﬂects the quality of the prior: B a = max θ π a ( θ ) π a ( θ ∗ ) We also deﬁne a few notations used within the approximate sampling Algorithm 2.Symbol Meaning N number of steps of the approximate sampling algorithm h ( n ) step size of the approximate sampling algorithm after n samples from the arm θ ih ( n ) MCMC sample generated within i -th iteration of Algorithm 2 µ ih ( n ) measure of θ ih ( n ) k batch-size of the stochastic gradient Langevin algorithm B Posterior Concentration Proof

To begin the proof of Theorem 1, we ﬁrst prove that under our assumptions, the gradients of the populationlikelihood function concentrates. 16 roposition 2.

If the prior distribution over θ a satisﬁes Assumption 3, then we have: sup R da ∇ log π a ( θ a ) T ( θ a − θ ∗ a ) ≤ g ∗ a − log π a ( θ ∗ a ) , where g ∗ a = max θ ∈ R d log π a ( θ a ) .Proof. Let log π a ( θ a ) = g ( θ a ). From the concavity of g , we know that ∇ g ( θ a ) T ( θ a − θ ∗ a ) ≤ g ( θ a ) − g ( θ ∗ a )Since this holds for all θ ∈ R d a , we take the supremum of both sides and get that:sup R da ∇ g ( θ a ) T ( θ a − θ ∗ a ) ≤ g ∗ − g ( θ ∗ a )Let log B a := g ∗ a − log π a ( θ ∗ a ). If the prior is centered on the correct value of θ ∗ a , then log B a = 0. Ourposterior concentration rates will depend on B a .Before proving the posterior concentration result we ﬁrst show the empirical likelihood function at θ ∗ a is asub-Gaussian random variable: Proposition 3.

The random variable (cid:107)∇ θ F a,n ( θ ∗ a ) (cid:107) is L a (cid:113) d a nν a -sub-Gaussian:Proof. Recall that the true density p a ( x | θ ∗ a ) is ν a -strongly log-concave in x and that ∇ θ log p a ( x | θ ∗ a ) is L a -Lipschitz in x . Notice that ∇ θ F a ( θ ∗ a ) = 0 since θ ∗ a is the point maximizing the population likelihood.Let’s consider the random variable Z = ∇ θ log p a ( x | θ ∗ a ). Since E [ Z ] = ∇ θ F a ( θ ∗ a ), the random variable Z is centered.We start by showing Z is a subgaussian random vector. Let v ∈ S d a be an arbitrary point in the d a − dimensional sphere and deﬁne the function V : R d a → R as V ( x ) = (cid:104)∇ θ log p a ( x | θ ∗ a ) , v (cid:105) . This function is L a − Lipschitz. Indeed let x , x ∈ R d a be two arbitrary points in R d a : | V ( x ) − V ( x ) | = |(cid:104)∇ θ log p a ( x | θ ∗ a ) − ∇ θ log p a ( x | θ ∗ a ) , v (cid:105)|≤ (cid:107)∇ θ log p a ( x | θ ∗ a ) − ∇ θ log p a ( x | θ ∗ a ) (cid:107) (cid:107) v (cid:107) = (cid:107)∇ θ log p a ( x | θ ∗ a ) − ∇ θ log p a ( x | θ ∗ a ) (cid:107) ≤ L a (cid:107) x − x (cid:107) The ﬁrst inequality follows by Cauchy-Schwartz, the second inequality by the Lipschitz assumption onthe gradients. After a simple application of Proposition 2.18 in Ledoux [2001], we conclude that V ( x ) issubgaussian with parameter L a √ ν a .Since the projection of Z onto an arbitrary direction v of the unit sphere is subgaussian, with a parameterindependent of v , we conclude the random vector Z is subgaussian with the same parameter L a √ ν a . Consequently,the vector ∇ θ F a,n ( θ ∗ a ), being an average of n i.i.d. subgaussian vectors with parameter L a √ ν a is also subgaussianwith parameter L a √ nν a .Since ∇ θ F a,n ( θ ∗ a ) is a subgaussian vector with parameter L a √ nν a , Lemma 1 of [Jin et al., 2019] implies it isnorm subgaussian with parameter L a √ d a √ nν a .Given these results we now prove Theorem 1. For clarity, we restate the theorem below: Theorem B.1. X ( n ) a = X a, , ..., X a,n , the posteriordistribution satisﬁes: P θ ∼ µ ( n ) a [ γ a ] (cid:32) (cid:107) θ a − θ ∗ a (cid:107) > (cid:115) em a n (cid:18) d a γ a + log B a + (cid:18) γ a + 8 d a κ a L a ν a (cid:19) log (1 /δ ) (cid:19)(cid:33) < δ. roof. The proof makes use of the techniques used to prove Theorem 1 in Mou et al. [2019]: analyzing howa carefully designed potential function evolves along trajectories of the s.d.e. By a careful accounting ofterms and constants, however, we are able to keep explicit constants and derive tighter bounds which holdfor any ﬁnite number of samples. Throughout the proof we drop the dependence on a and condition on thehigh-probability event, G a,n ( δ ), deﬁned in Proposition 3, which guarantees that the norm of the likelihoodgradients concentrates with probability at least 1 − δ .Consider the s.d.e.: dθ t = 12 ∇ θ F n ( θ t ) dt + 12 n ∇ θ log π ( θ t )) dt + 1 √ nγ dB t , and the potential function given by: V ( θ ) = 12 e αt (cid:107) θ − θ ∗ (cid:107) , for a choice of α >

0. The idea is that bounds on the p -th moments of V ( θ t ) can be translated into bounds onthe p -th moments of V ( θ ) where θ ∼ µ ( n ) , due to the fact that lim t →∞ θ t = θ ∼ µ ( n ) . The square-root growthin p of these moments will imply that (cid:107) θ − θ ∗ (cid:107) has subgaussian tails with a rate that we make explicit.We begin by using Ito’s Lemma on V ( θ t ): V ( θ t ) = T T T T , where: T − (cid:90) t e αs (cid:104) θ ∗ − θ s , ∇ θ F n ( θ s ) (cid:105) ds + α (cid:90) t e αs (cid:107) θ s − θ ∗ (cid:107) dsT n (cid:90) t e αs (cid:104) θ s − θ ∗ , ∇ θ log π ( θ s ) (cid:105) dsT d nγ (cid:90) t e αs dsT √ nγ (cid:90) t e αs (cid:104) θ s − θ ∗ , dB s (cid:105) Let us ﬁrst upper bound T T − (cid:90) t e αs (cid:104) θ ∗ − θ s , ∇ θ F n ( θ s ) (cid:105) ds + α (cid:90) t e αs (cid:107) θ s − θ ∗ (cid:107) ds = − (cid:90) t e αs (cid:104) θ ∗ − θ s , ∇ θ F n ( θ s ) − ∇ θ F n ( θ ∗ ) (cid:105) ds + α (cid:90) t e αs (cid:107) θ s − θ ∗ (cid:107) ds − (cid:90) t e αs (cid:104) θ ∗ − θ s , ∇ θ F n ( θ ∗ ) (cid:105) ds ( i ) ≤ α − m (cid:90) t e αs (cid:107) θ s − θ ∗ (cid:107) ds − (cid:90) t e αs (cid:104) θ ∗ − θ s , ∇ θ F n ( θ ∗ ) (cid:105) ds ( ii ) ≤ α − m (cid:90) t e αs (cid:107) θ s − θ ∗ (cid:107) ds + 12 (cid:90) t e αs (cid:107) θ ∗ − θ s (cid:107) (cid:107)∇ θ F n ( θ ∗ ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) := (cid:15) ( n ) ds where in ( i ) we use the strong-concavity property from Assumption 1-Local, and in ( ii ) we use Cauchy-Shwartz.Using Young’s inequality for products, where the constant is m , gives: T ≤ α − m (cid:90) t e αs (cid:107) θ s − θ ∗ (cid:107) ds + (cid:15) ( n ) m (cid:90) t e αs ds α = m/ T ≤ (cid:15) ( n ) m (cid:0) e αt − (cid:1) ≤ (cid:15) ( n ) m e αt . Given our assumption on the prior, our choice of α = m and simple algebra, we can upper bound T T T n (cid:90) t e αs (cid:104) θ s − θ ∗ , ∇ θ log π ( θ s ) (cid:105) ds ≤ log B αn ( e αt − ≤ log Bnm e αt T d nγ (cid:90) t e αs ds ≤ dγnm e αt . We proceed to bound T

4. Let’s start by deﬁning: M t = (cid:90) t e αs (cid:104) θ s − θ ∗ , dB s (cid:105) , so that: T M t √ nγ . Combining all the upper bounds of T , T , T

3, and T V ( θ t ) ≤ (cid:18) (cid:15) ( n ) m + dγnm + log Bnm (cid:19) e αt + M t √ γn . To ﬁnd a bound for the p − th moments of V , we upper bound the p -th moments of the supremum of M t where p ≥ E (cid:20) sup ≤ t ≤ T | M t | p (cid:21) ( i ) ≤ (8 p ) p E (cid:104) (cid:104) M, M (cid:105) p T (cid:105) = (8 p ) p E (cid:32)(cid:90) T e αs (cid:107) θ s − θ ∗ (cid:107) (cid:33) p ds  ( ii ) ≤ (8 p ) p E (cid:32) sup ≤ t ≤ T e αt (cid:107) θ t − θ ∗ (cid:107) (cid:90) T e αs ds (cid:33) p  = (8 p ) p E (cid:34)(cid:18) sup ≤ t ≤ T e αt (cid:107) θ t − θ ∗ (cid:107) ( e αT − α (cid:19) p (cid:35) ( iii ) ≤ (cid:18) pe αT α (cid:19) p E (cid:34)(cid:18) sup ≤ t ≤ T e αt (cid:107) θ t − θ ∗ (cid:107) (cid:19) p (cid:35) Inequality ( i ) is a direct consequence of the Burkholder-Gundy-Davis inequality [Ren, 2008], ( ii ) followsby pulling out the supremum out of the integral, ( iii ) holds because e αT − ≤ e αT .Now, let us look at the moments of V ( θ t ). E (cid:20)(cid:18) sup ≤ t ≤ T V ( θ t ) (cid:19) p (cid:21) p ≤ E (cid:20)(cid:18) sup ≤ t ≤ T (cid:18) (cid:15) ( n ) m + dγnm + log Bnm (cid:19) e αt + | M t |√ γn (cid:19) p (cid:21) p ≤ E (cid:20)(cid:18) sup ≤ t ≤ T (cid:18) (cid:15) ( n ) m + dγnm + log Bnm (cid:19) e αt + sup ≤ t ≤ T | M t |√ γn (cid:19) p (cid:21) p (cid:15) ( n ) is independent of t , we can expand the above as: E (cid:20)(cid:18) sup ≤ t ≤ T V ( θ t ) (cid:19) p (cid:21) p ≤ (cid:18) dγnm + log Bnm (cid:19) e αT (cid:124) (cid:123)(cid:122) (cid:125) := U T + e αT m E (cid:2) (cid:15) ( n ) p (cid:3) p + E (cid:20)(cid:18) sup ≤ t ≤ T | M t |√ n (cid:19) p (cid:21) p Since, from Proposition 1, we know that (cid:15) ( n ) is a L (cid:113) dnν -sub-Gaussian vector, we know that: E (cid:2) (cid:15) ( n ) p (cid:3) p ≤ (cid:32) L (cid:114) dpnν (cid:33) Using our upper bound on the supremum of M t gives: E (cid:20)(cid:18) sup ≤ t ≤ T V ( θ t ) (cid:19) p (cid:21) p ≤ U T + e αT dL νm n p + E (cid:34)(cid:18) pe αT γαn (cid:19) p (cid:18) sup ≤ t ≤ T e αt (cid:107) θ t − θ ∗ (cid:107) (cid:19) p (cid:35) p (6)We proceed by bounding the second term on the RHS of the expression above: E (cid:34)(cid:18) pe αT αn (cid:19) p (cid:18) sup ≤ t ≤ T e αt (cid:107) θ t − θ ∗ (cid:107) (cid:19) p (cid:35) p ( i ) ≤ E (cid:20) p − (cid:18) pe αT αγn (cid:19) p + 12 p (cid:18) sup ≤ t ≤ T e αt (cid:107) θ t − θ ∗ (cid:107) (cid:19) p (cid:21) p ( ii ) ≤ p − p E (cid:20)(cid:18) pe αT αγn (cid:19) p (cid:21) p + 12 E (cid:20)(cid:18) sup ≤ t ≤ T e αt (cid:107) θ t − θ ∗ (cid:107) (cid:19) p (cid:21) p ( iii ) ≤ E (cid:20)(cid:18) pe αT αγn (cid:19) p (cid:21) p + 12 E (cid:20)(cid:18) sup ≤ t ≤ T e αt (cid:107) θ t − θ ∗ (cid:107) (cid:19) p (cid:21) p (cid:124) (cid:123)(cid:122) (cid:125) I Inequality ( i ) follows from using Young’s inequality for products on the term inside the expectation withconstant 2 p − , inequality ( ii ) is a consequence of Minkowski Inequality and ( iii ) because 2 p − p ≤

2. We notenow that the second term I on the right hand side above is exactly:12 E (cid:20)(cid:18) sup ≤ t ≤ T V ( θ t ) (cid:19) p (cid:21) p Plugging this into Equation 6 and rearranging gives:12 E (cid:20)(cid:18) sup ≤ t ≤ T V ( θ t ) (cid:19) p (cid:21) p ≤ U T + 16 e αT αγn p + e αT dL νm n p, which ﬁnally results in: E (cid:20)(cid:18) sup ≤ t ≤ T V ( θ t ) (cid:19) p (cid:21) p ≤ mn (cid:18) dγ + log B + (cid:18) γ + 8 dL νm (cid:19) p (cid:19) e αT . (7)20iven this control on the moments of the supremum of V ( θ t ) (recall V ( θ ) = e αt (cid:107) θ − θ ∗ (cid:107) ), we ﬁnallyconstruct the bound on the moments of (cid:107) θ T − θ ∗ (cid:107) : E [ (cid:107) θ T − θ ∗ (cid:107) p ] p = E (cid:104) e − pαT V ( θ T ) p (cid:105) p ( i ) ≤ E (cid:34) e − pαT (cid:18) sup ≤ t ≤ T V ( θ t ) (cid:19) p (cid:35) p = e − αT  E (cid:34)(cid:18) sup ≤ t ≤ T V ( θ t ) (cid:19) p (cid:35) p  ( ii ) ≤ e − αT (cid:18) mn (cid:18) dγ + log B + (cid:18) γ + 4 dL νm (cid:19) p (cid:19) e αT (cid:19) = (cid:114) mn (cid:18) dγ + log B + (cid:18) γ + 4 dL νm (cid:19) p (cid:19) Inequality ( i ) follows from taking the supremum of V ( θ t ), inequality ( ii ) from plugging in the upper boundfrom Equation 7.Taking the limit as T → ∞ and using Fatou’s Lemma, we therefore have that the moments of E [ (cid:107) θ − θ ∗ (cid:107) p ] p ,with probability at least 1 − δ , grow at a rate of √ p : E [ (cid:107) θ − θ ∗ (cid:107) p ] p ≤ lim inf T →∞ E [ (cid:107) θ T − θ ∗ (cid:107) p ] p (8)= (cid:114) mn (cid:18) dγ + log B + (cid:18) γ + 4 dL νm (cid:19) p (cid:19) . (9)To simplify notation, let D = (cid:16) dγ + log B (cid:17) , and σ = (cid:16) γ + dL νm (cid:17) . Therefore we have: E [ (cid:107) θ − θ ∗ (cid:107) p ] p ≤ (cid:114) mn ( D + σp ) (10)The result (10), guarantees us that the norm of the uncentered random variable θ − θ ∗ has subgaussiantails. We make the parameters explicit via Markov’s inequality: P θ ∼ µ ( n ) a ( (cid:107) θ − θ ∗ (cid:107) > (cid:15) ) ≤ E [ (cid:107) θ − θ ∗ (cid:107) p ] (cid:15) p ≤ (cid:32) (cid:112) D + σp ) √ mn(cid:15) (cid:33) p . Choosing p = 2 log 1 /δ and letting (cid:15) = e (cid:114) mn ( D + σp )gives us our desired solution: P θ ∼ µ ( n ) a [ γ a ] (cid:32) (cid:107) θ − θ ∗ (cid:107) > (cid:115) emn (cid:18) dγ + log B + (cid:18) γ + 8 dL νm (cid:19) log (1 /δ ) (cid:19)(cid:33) < δ. Introduction to the Langevin Algorithms

We refer to the stochastic process represented by the following stochastic diﬀerential equation as continuous-time Langevin dynamics : d θ t = −∇ U ( θ t ) d t + √ B t . We have ﬁrst encountered this continuous time Langevin dynamics in Eq. (2), where we have set U ( θ ) = − γ a ( nF n,a ( θ ) + log π a ( θ )) = − γ a (cid:80) ni =1 log p a ( x a,i | θ ) − γ a log π a ( θ ) to prove posterior concentration of µ ( n ) a [ γ a ].One important feature of the Langevin dynamics is that its invariant distribution is proportional to e − U ( θ ) .We can therefore also use it to generate samples distributed according to the unscaled posterior distribution µ ( n ) a . Via letting U ( θ ) = − (cid:80) ni =1 log p a ( x a,i | θ ) − log π a ( θ ), we obtain a continuous time dynamics whichgenerates trajectories that converge towards the posterior distribution µ ( n ) a exponentially fast. To obtain animplementable algorithm, we apply Euler-Maruyama discretization to the Langevin dynamics and arrive atthe following ULA update: θ ( i +1) h ( n ) ∼ N (cid:16) θ ih ( n ) − h ( n ) ∇ U ( θ ih ( n ) ) , h ( n ) I (cid:17) . Since ∇ U ( θ ) = − (cid:80) ni =1 ∇ log p a ( x a,i | θ ) − ∇ log π a ( θ ) in the above update rule, the computation complexitywithin each iteration of the Langevin algorithm grows with the number of data being collected, n . Tocope with the growing number of terms in ∇ U ( θ ), we take a stochastic gradient approach and deﬁne (cid:98) U ( θ ) = − n |S| (cid:80) x k ∈S ∇ log p a ( x k | θ ) − ∇ log π a ( θ ), where S is a subset of the dataset { x a, , · · · , x a,n } . Forsimplicity, we form S via subsampling uniformly from { x a, , · · · , x a,n } . Substituting the stochastic gradient ∇ (cid:98) U for the full gradient ∇ U in the above update rule results in the SGLD algorithm. D Proofs for Approximate MCMC Sampling

In this Appendix we supply the proofs of concentration for approximate samples from both the ULA andSGLD MCMC methods. We will quantify the computation complexity of generating samples which aredistributed close enough to the posterior. We restate the assumptions required of the likelihood for theMCMC sampling methods to converge.

Assumption 1-Uniform (Assumption on the family p a ( X | θ a ): strengthened for approximate sampling) . Assume that log p a ( x | θ a ) is L a -smooth and m a -strongly concave over the parameter θ a : − log p a ( x | θ (cid:48) a ) − ∇ θ log p a ( x | θ (cid:48) a ) (cid:62) ( θ a − θ (cid:48) a ) + m a (cid:107) θ a − θ (cid:48) a (cid:107) ≤ − log p a ( x | θ a ) ≤ − log p a ( x | θ (cid:48) a ) − ∇ θ log p a ( x | θ (cid:48) a ) (cid:62) ( θ a − θ (cid:48) a ) + L a (cid:107) θ a − θ (cid:48) a (cid:107) , ∀ θ a , θ (cid:48) a ∈ R d a , x ∈ R . Assumption 3 (Assumptions on the prior distribution). For every a ∈ A assume that log π a ( θ a ) is concavewith L -Lipschitz gradients for all θ a ∈ R d a : (cid:107)∇ θ π a ( θ ) − ∇ θ π a ( θ (cid:48) ) (cid:107) ≤ L a (cid:107) θ − θ (cid:48) (cid:107) ∀ θ, θ (cid:48) ∈ R d a Assumption 4 (Joint Lipschitz smoothness of the family log p a ( X | θ a ): for SGLD). Assume a joint Lipschitzsmoothness condition, which strengthens Assumptions 1-Local and 2 to impose the Lipschitz smoothness onthe entire bivariate function log p a ( x ; θ ): (cid:107)∇ θ log p a ( x | θ a ) − ∇ θ log p a ( x (cid:48) | θ a ) (cid:107) ≤ L a (cid:107) θ a − θ (cid:48) a (cid:107) + L ∗ a (cid:107) x − x (cid:48) (cid:107) , ∀ θ a , θ (cid:48) a ∈ R d a , x, x (cid:48) ∈ R . We now begin by presenting the result for ULA. 22 .1 Convergence of the unadjusted Langevin algorithm (ULA)

If function log p a ( x ; θ ) satisﬁes the Lipschitz smoothness condition in Assumption 1-Local, then we can leveragegradient based MCMC algorithms to generate samples with convergence guarantees in the p -Wassersteindistance. As stated in Algorithm 2, we initialize ULA in the n -th round from the last iterate in the ( n − Theorem 5 (ULA Convergence) . Assume that the likelihood log p a ( x ; θ ) and prior π a satisfy Assumption 1-Uniform and Assumption 3. We take step size h ( n ) = m a n ( L a + n L a ) = O (cid:16) nL a κ a (cid:17) and number of steps N = 640 ( L a + n L a ) m a = O (cid:0) κ a (cid:1) in Algorithm 2. If the posterior distribution satisfy the concentration inequalitythat E θ ∼ µ ( n ) a [ (cid:107) θ − θ ∗ (cid:107) p ] p ≤ √ n (cid:101) D , then for any positive even integer p , we have convergence of the ULAalgorithm in W p distance to the posterior µ ( n ) a : W p (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) ≤ √ n (cid:101) D , ∀ (cid:101) D ≥ (cid:113) m a d a p .Proof of Theorem 5. We use induction to prove this theorem. • For n = 1, we initialize at θ which is within a (cid:113) d a m a -ball from the maximum of the target dis-tribution, θ ∗ p = arg max p a ( θ | x ), where p a ( θ | x ) ∝ p a ( x | θ ) π a ( θ ) and negative log p a ( θ | x ) is m a -strongly convex and ( L a + L a )-Lipschitz smooth. Invoking Lemma 10, we obtain that for d µ (1) a = p a ( θ | x )d θ , Wasserstein- p distance between the target distribution and the point mass at its mode: W p (cid:16) µ (1) a , δ (cid:0) θ ∗ p (cid:1)(cid:17) ≤ (cid:113) m a d a p. Therefore, W p (cid:16) µ (1) a , δ ( θ ) (cid:17) ≤ W p (cid:16) µ (1) a , δ (cid:0) θ ∗ p (cid:1)(cid:17) + (cid:13)(cid:13) θ − θ ∗ p (cid:13)(cid:13) ≤ (cid:113) m a d a p .We then invoke Lemma 6, with initial condition µ = δ (cid:0) θ ∗ p (cid:1) , to obtain the convergence in the N -thiteration of Algorithm 2 after the ﬁrst pull to arm a : W pp (cid:16) µ Nh (1) , µ (1) a (cid:17) ≤ (cid:16) − m a h (1) (cid:17) p · N W pp (cid:16) δ ( θ ) , µ (1) a (cid:17) + 2 p ( L a + L a ) p m pa ( d a p ) p/ (cid:16) h (1) (cid:17) p/ , where we have substituted in the strong convexity m a for (cid:98) m and the Lipschitz smoothness ( L a + L a )for (cid:98) L . Plugging in the step size h (1) = m a ( L a + L a ) ≤ min (cid:110) m a L a + L a ) , m a ( L a + L a ) (cid:101) D d a p (cid:111) , and numberof steps N = m a h (1) = 640 ( L a + L a ) m a , W pp (cid:16)(cid:98) µ (1) a , µ (1) a (cid:17) = W pp (cid:16) µ Nh (1) , µ (1) a (cid:17) ≤ (cid:101) D p . • Assume that after the ( n − n -th pull to the arm a , the ULA algorithmguarantees that W p (cid:16)(cid:98) µ ( n − a , µ ( n − a (cid:17) ≤ √ n − (cid:101) D . We now prove that after the n -th pull and beforethe ( n + 1)-th pull, it is guaranteed that W p (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) ≤ √ n (cid:101) D . We ﬁrst obtain from the assumedposterior concentration inequality: W p ( µ ( n ) a , δ ( θ ∗ )) ≤ E θ ∼ µ ( n ) a [ (cid:107) θ − θ ∗ (cid:107) p ] p ≤ √ n (cid:101) D. (11)Therefore, for n ≥ W p (cid:16) µ ( n ) a , µ ( n − a (cid:17) ≤ W p ( µ ( n ) a , δ ( θ ∗ )) + W p ( µ ( n − a , δ ( θ ∗ )) ≤ √ n (cid:101) D. We combine this bound with the induction hypothesis and obtain that W p (cid:16) µ ( n ) a , (cid:98) µ ( n − a (cid:17) ≤ W p (cid:16) µ ( n ) a , µ ( n − a (cid:17) + W p (cid:16) µ ( n − a , (cid:98) µ ( n − a (cid:17) ≤ √ n (cid:101) D. From Lemma 6, we know that for (cid:98) m = n · m a and (cid:98) L = n · L a + L a , with initial condition µ = (cid:98) µ ( n − a ,with accurate gradient, W pp (cid:16) µ ih ( n ) , µ ( n ) a (cid:17) ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p · i W pp (cid:16)(cid:98) µ ( n − a , µ ( n ) a (cid:17) + 2 p (cid:98) L p (cid:98) m p ( d a p ) p/ (cid:16) h ( n ) (cid:17) p/ .

23f we take step size h ( n ) = (cid:98) m (cid:98) L ≤ min (cid:110) (cid:98) m (cid:98) L , n (cid:98) m (cid:98) L (cid:101) D d a p (cid:111) and number of steps taken in the ULAalgorithm from ( n − n -th pull to be: (cid:98) N ≥ (cid:98) m h ( n ) , W pp (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) = W pp (cid:16) µ (cid:98) Nh ( n ) , µ ( n ) a (cid:17) ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p · (cid:98) N p (cid:101) D p n p/ + 2 p (cid:98) L p (cid:98) m p ( d a p ) p/ (cid:16) h ( n ) (cid:17) p/ ≤ (cid:101) D p n p/ , (12)leading to the result that W p (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) ≤ √ n (cid:101) D .Since at least one round would have past from the ( n − n -th pull to arm a , takingnumber of steps in each round t to be N = (cid:98) m h ( n ) = 640 ( L a + n L a ) m a suﬃces.Therefore, N = 640 ( L a + n L a ) m a = O (cid:16) L a m a (cid:17) . D.2 Convergence of the stochastic gradient Langevin algorithm (SGLD)

If log p a ( x ; θ ) satisﬁes a stronger joint Lipschitz smoothness condition in Assumption 4, similar guaranteescan be obtained for stochastic gradient MCMC algorithms. Theorem 6 (SGLD Convergence) . Assume that the family log p a ( x ; θ ) and prior π a satisfy Assumption 1-Uniform, Assumption 3, and Assumption 4. We take number of data samples in the stochastic gradientestimate k = 32 ( L ∗ a ) m a ν a = 32 κ a , step size h ( n ) = m a n ( L a + n L a ) = O (cid:16) nL a κ a (cid:17) and number of steps N =1280 ( L a + n L a ) m a = O (cid:0) κ a (cid:1) in Algorithm 2. If the posterior distribution satisfy the concentration inequality that E θ ∼ µ ( n ) a [ (cid:107) θ − θ ∗ (cid:107) p ] p ≤ √ n (cid:101) D , then for any positive even integer p , we have convergence of the ULA algorithmin W p distance to the posterior µ ( n ) a : W p (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) ≤ √ n (cid:101) D , ∀ (cid:101) D ≥ (cid:113) m a d a p .Proof of Theorem 6. Similar to Theorem 5, we use induction to prove this theorem. After the ﬁrst pull toarm a , we take the same 640 ( L a + n L a ) m a number of steps to converge to W pp (cid:16)(cid:98) µ (1) a , µ (1) a (cid:17) ≤ (cid:101) D p . Assume that after the ( n − n -th pull to the arm a , the SGLD algorithm guaranteesthat W p (cid:16)(cid:98) µ ( n − a , µ ( n − a (cid:17) ≤ √ n − (cid:101) D . We prove that after the n -th pull and before the ( n + 1)-th pull, itis guaranteed that W p (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) ≤ √ n (cid:101) D . Following the proof of Theorem 5, we combine the assumedposterior concentration inequality and the induction hypothesis to obtain: W p (cid:16) µ ( n ) a , (cid:98) µ ( n − a (cid:17) ≤ W p (cid:16) µ ( n ) a , µ ( n − a (cid:17) + W p (cid:16) µ ( n − a , (cid:98) µ ( n − a (cid:17) ≤ √ n (cid:101) D. Denote function U as the negative log-posterior density over parameter θ . From Lemma 6, we know thatfor (cid:98) m = n · m a and (cid:98) L = n · L a + L a , with initial condition that µ = (cid:98) µ ( n − a , if the diﬀerence between thestochastic gradient ∇ (cid:98) U and the exact one ∇ U is bounded as E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ ) − ∇ (cid:98) U ( θ ) (cid:13)(cid:13)(cid:13) p (cid:12)(cid:12) θ (cid:105) ≤ ∆ p , then W pp (cid:16) µ ih ( n ) , µ ( n ) a (cid:17) ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p · i W pp (cid:16)(cid:98) µ ( n − a , µ ( n ) a (cid:17) + 2 p (cid:98) L p (cid:98) m p ( d a p ) p/ (cid:16) h ( n ) (cid:17) p/ + 2 p +3 ∆ p (cid:98) m p . We demonstrate in the following Lemma 5 that∆ p ≤ n p/ k p/ (cid:18) √ d a pL ∗ a √ ν a (cid:19) p . emma 5. Denote (cid:98) U as the stochastic estimator of U . Then for stochastic gradient estimate with k datapoints, E (cid:104)(cid:13)(cid:13)(cid:13) ∇ (cid:98) U ( θ ) − ∇ U ( θ ) (cid:13)(cid:13)(cid:13) p (cid:12)(cid:12) θ (cid:105) ≤ n p/ k p/ (cid:18) √ d a pL ∗ a √ ν a (cid:19) p . If we take the number of samples in the stochastic gradient estimator k = 32 ( L ∗ a ) m a ν a , then ∆ p ≤ p/ ( n · m a ) p/ · ( p · d a ) p/ ≤ − p − (cid:98) m p (cid:101) D p n p/ for any p ≥

2. Consequently, 2 p +3 ∆ p (cid:98) m p ≤ (cid:101) D p n p/ .If we take step size h ( n ) = (cid:98) m (cid:98) L ≤ min (cid:110) (cid:98) m (cid:98) L , n (cid:98) m (cid:98) L (cid:101) D d a p (cid:111) and number of steps taken in the SGLDalgorithm from ( n − n -th pull to be: (cid:98) N ≥ (cid:98) m h ( n ) , W pp (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) = W pp (cid:16) µ (cid:98) Nh ( n ) , µ ( n ) a (cid:17) ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p · (cid:98) N p (cid:101) D p n p/ + 2 p (cid:98) L p (cid:98) m p ( d a p ) p/ (cid:16) h ( n ) (cid:17) p/ + 2 p +3 ∆ p (cid:98) m p ≤ (cid:101) D p n p/ , leading to the result that W p (cid:16)(cid:98) µ ( n ) a , µ ( n ) a (cid:17) ≤ √ n (cid:101) D . Since at least one round would have past from the( n − n -th pull to arm a , taking number of steps in each round t to be N = (cid:98) m h ( n ) suﬃces.Therefore, N = 1280 ( L a + n L a ) m a = O (cid:16) L a m a (cid:17) . Proof of Lemma 5.

We ﬁrst develop the expression: E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ ) − ∇ (cid:98) U ( θ ) (cid:13)(cid:13)(cid:13) p (cid:105) = n p E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ∇ log p ( x i | θ a ) − k k (cid:88) j =1 ∇ log p ( x j | θ a ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p  = n p k p E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) j =1 (cid:32) n n (cid:88) i =1 ∇ log p ( x i | θ a ) − ∇ log p ( x j | θ a ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p  . We note that ∇ log p ( x j | θ a ) − n n (cid:88) i =1 ∇ log p ( x i | θ a ) = 1 n (cid:88) i (cid:54) = j ( ∇ log p ( x j | θ a ) − ∇ log p ( x i | θ a )) . By the joint Lipschitz smoothness Assumption 4, we know that ∇ log p ( x | θ a ) is a Lipschitz function of x : (cid:107)∇ log p ( x j | θ a ) − ∇ log p ( x i | θ a ) (cid:107) ≤ L ∗ a (cid:107) x j − x i (cid:107) . On the other hand, the data x follows the true distribution p ( x ; θ ∗ ), which by Assumption 2 is ν a -strongly log-concave. Applying Theorem 3 .

16 in [Wainwright, 2019], we obtain that ( ∇ log p ( x j | θ a ) − ∇ log p ( x i | θ a )) is L ∗ a √ ν a -sub-Gaussian. Leveraging the Azuma-Hoeﬀding inequality for martingale diﬀerence sequences [Wainwright,2019], we obtain that sum of the ( n −

1) sub-Gaussian random variables: (cid:32) ∇ log p ( x j | θ a ) − n n (cid:88) i =1 ∇ log p ( x i | θ a ) (cid:33) , is √ n − L ∗ a n √ ν a -sub-Gaussian. In the same vein, (cid:16)(cid:80) kj =1 (cid:0) n (cid:80) ni =1 ∇ log p ( x i | θ a ) − ∇ log p ( x j | θ a ) (cid:1)(cid:17) is √ k ( n − L ∗ a n √ ν a -sub-Gaussian. We then invoke the √ d a k ( n − L ∗ a n √ ν a -sub-Gaussianity of (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) j =1 (cid:32) n n (cid:88) i =1 ∇ log p ( x i | θ a ) − ∇ log p ( x j | θ a ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) j =1 (cid:32) n n (cid:88) i =1 ∇ log p ( x i | θ a ) − ∇ log p ( x j | θ a ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p  ≤ (cid:32) (cid:112) d a k ( n − pL ∗ a en √ ν a (cid:33) p . Therefore, E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ ) − ∇ (cid:98) U ( θ ) (cid:13)(cid:13)(cid:13) p (cid:105) = n p k p E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) j =1 (cid:32) n n (cid:88) i =1 ∇ log p ( x i | θ a ) − ∇ log p ( x j | θ a ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p  ≤ n p/ k p/ (cid:18) √ d a pL ∗ a e √ ν a (cid:19) p ≤ n p/ k p/ (cid:18) √ d a pL ∗ a √ ν a (cid:19) p . D.3 Convergence of (Stochastic Gradient) Langevin Algorithm within EachRound

In this section, we examine convergence of the (stochastic gradient) Langevin algorithm to the posteriordistribution over a -th arm at the n -th round. Since only the a -th arm and n -th round are considered, wedrop these two indices in the notation whenever suitable. We also deﬁne some notation that will only beused within this subsection. For example, we focus on the θ parameter and denote the posterior measured µ ( n ) a ( x ; θ ) = d µ ∗ ( θ ) = exp ( − U ( θ )) d θ as the target distribution.Symbol Meaning µ ∗ posterior distribution, µ na U potential (i.e., negative log posterior density) θ ∗ U minimum of the potential U (or mode of the posterior µ ∗ ) θ t interpolation between θ ih ( n ) and θ ( i +1) h ( n ) , for t ∈ [ ih ( n ) , ( i + 1) h ( n ) ] µ t measure associated with θ t θ ∗ t an auxiliary stochastic process with initial distribution µ ∗ and follows dynamics (17) (cid:98) m strong convexity of the potential U , nm a (cid:98) L Lipschitz smoothness of the potential U , nL a + L a We also formally deﬁne the Wasserstein- p distance used in the main text. Given a pair of distributions µ and ν on R d , a coupling γ is a joint distribution over the product space R d × R d that has µ and ν as its marginaldistributions. We let Γ( µ, ν ) denote the space of all possible couplings of µ and ν . With this notation, theWasserstein- p distance is given by W p ( µ, ν ) = inf γ ∈ Γ( µ,ν ) (cid:90) R d × R d (cid:107) x − y (cid:107) p d γ ( x, y ) . (13)We use the following (stochastic gradient) Langevin algorithm to generate approximate samples from theposterior distribution µ ( n ) a ( θ ) at n -th round. For i = 0 , · · · , T , θ ( i +1) h ( n ) ∼ N (cid:16) θ ih ( n ) − h ( n ) ∇ (cid:98) U ( θ ih ( n ) ) , h ( n ) I (cid:17) , (14)where ∇ (cid:98) U ( θ ih ( n ) ) is a stochastic estimate of ∇ U ( θ ih ( n ) ). We prove in the following Lemma 6 the convergenceof this algorithm within n -th round. 26 emma 6. Assume that the potential U is (cid:98) m -strongly convex and (cid:98) L -Lipschitz smooth. Further assume thatthe p -th moment between the true gradient and the stochastic one satisﬁes: E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ ih ( n ) ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p (cid:12)(cid:12)(cid:12) θ ih ( n ) (cid:105) ≤ ∆ p . Then at i -th step, for µ ih ( n ) following the (stochastic gradient) Langevin algorithm with h ≤ (cid:98) m (cid:98) L , W pp ( µ ih ( n ) , µ ∗ ) ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p · i W pp ( µ , µ ∗ ) + 2 p (cid:98) L p (cid:98) m p ( dp ) p/ (cid:16) h ( n ) (cid:17) p/ + 2 p +3 ∆ p (cid:98) m p . (15) Remark 2.

When ∆ p = 0 , Lemma 6 provides convergence rate of the unadjusted Langevin algorithm (ULA)with the exact gradient.Proof of Lemma 6. We ﬁrst interpolate a continuous time stochastic process, θ t , between θ ih ( n ) and θ ( i +1) h ( n ) .For t ∈ [ ih ( n ) , ( i + 1) h ( n ) ], d θ t = ∇ (cid:98) U ( θ ih ( n ) )d t + √ B t , (16)where B t is standard Brownian motion. This process connects θ ih ( n ) and θ ( i +1) h ( n ) and approximates thefollowing stochastic diﬀerential equation which maintains the exact posterior distribution:d θ ∗ t = ∇ U ( θ ∗ t )d t + √ B t . (17)For a θ ∗ t initialized from µ ∗ and following equation (17), θ ∗ t will always have distribution µ ∗ .We therefore design a coupling between the two processes: θ t and θ ∗ t , where θ t follows equation (16) (andthereby interpolates Algorithm 2) and θ ∗ t initializes from µ ∗ and follows equation (17) (and thereby preserves µ ∗ ). By studying the diﬀerence between the two processes, we will obtain the convergence rate in terms ofthe Wasserstein- p distance.For t = ih ( n ) , we let θ ih ( n ) to couple optimally with θ ∗ ih ( n ) , so that for (cid:0) θ ih ( n ) , θ ∗ ih ( n ) (cid:1) ∼ γ ∗ ∈ Γ opt (cid:0) µ ih ( n ) , µ ∗ ih ( n ) (cid:1) , E (cid:2)(cid:13)(cid:13) θ ih ( n ) − θ ∗ ih ( n ) (cid:13)(cid:13) p (cid:3) = W pp ( µ ih ( n ) , µ ∗ ). For t ∈ [ ih ( n ) , ( i + 1) h ( n ) ], we choose a synchronous coupling¯ γ (cid:0) θ t , θ ∗ t | θ ih ( n ) , θ ∗ ih ( n ) (cid:1) ∈ Γ ( µ t ( θ t | θ ih ( n ) ) , µ ∗ t ( θ ∗ t | θ ih ( n ) )) for the laws of θ t and θ ∗ t . (A synchonous couplingsimply means that we use the same Brownian motion B t in deﬁning θ t and θ ∗ t .) We then obtain that for anypair ( θ t , θ ∗ t ) ∼ ¯ γ ,d (cid:107) θ t − θ ∗ t (cid:107) p d t = (cid:107) θ t − θ ∗ t (cid:107) p − (cid:28) θ t − θ ∗ t , d θ t d t − d θ ∗ t d t (cid:29) = p (cid:107) θ t − θ ∗ t (cid:107) p − (cid:104) θ t − θ ∗ t , −∇ U ( θ t ) + ∇ U ( θ ∗ t ) (cid:105) + p (cid:107) θ t − θ ∗ t (cid:107) p − (cid:68) θ t − θ ∗ t , ∇ U ( θ t ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:69) ≤ − p (cid:98) m (cid:107) θ t − θ ∗ t (cid:107) p + p (cid:107) θ t − θ ∗ t (cid:107) p − (cid:13)(cid:13)(cid:13) ∇ U ( θ t ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) (18) ≤ − p (cid:98) m (cid:107) θ t − θ ∗ t (cid:107) p (19)+ p  p − p (cid:18) p (cid:98) m p − (cid:19) (cid:107) θ t − θ ∗ t (cid:107) p + 1 p (cid:16) p (cid:98) m p − (cid:17) p − (cid:13)(cid:13)(cid:13) ∇ U ( θ t ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p  (20) ≤ − p (cid:98) m (cid:107) θ t − θ ∗ t (cid:107) p + 2 p − (cid:98) m p − (cid:13)(cid:13)(cid:13) ∇ U ( θ t ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p , (21)where equation (20) follows from Young’s inequality.27quivalently, we can obtaind e p (cid:99) m t (cid:107) θ t − θ ∗ t (cid:107) p d t ≤ e p (cid:99) m t p − (cid:98) m p − (cid:13)(cid:13)(cid:13) ∇ U ( θ t ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p . By the fundamental theorem of calculus, (cid:107) θ t − θ ∗ t (cid:107) p ≤ e − p (cid:99) m ( t − ih ( n ) ) (cid:13)(cid:13) θ ih ( n ) − θ ∗ ih ( n ) (cid:13)(cid:13) p + 2 p − (cid:98) m p − (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) (cid:13)(cid:13)(cid:13) ∇ U ( θ s ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p d s. (22)Taking expectation on both sides, we obtain that E [ (cid:107) θ t − θ ∗ t (cid:107) p ] = E (cid:2) E (cid:2) (cid:107) θ t − θ ∗ t (cid:107) p | θ ih ( n ) , θ ∗ ih ( n ) (cid:3)(cid:3) ≤ e − p (cid:99) m ( t − ih ( n ) ) E (cid:2)(cid:13)(cid:13) θ ih ( n ) − θ ∗ ih ( n ) (cid:13)(cid:13) p (cid:3) + 2 p − (cid:98) m p − (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ s ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p (cid:105) d s. (23)In the above expression, the integral and expectation are exchanged using Tonelli’s theorem, since (cid:13)(cid:13)(cid:13) ∇ U ( θ s ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p is positive measurable.We further expand the expected error E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ s ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p (cid:105) : E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ s ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p (cid:105) = E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ s ) − ∇ U ( θ ih ( n ) ) + ∇ U ( θ ih ( n ) ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p (cid:105) ≤ E [ (cid:107) ∇ U ( θ s ) − ∇ U ( θ ih ( n ) )) (cid:107) p ] + 12 E (cid:104)(cid:13)(cid:13)(cid:13) (cid:16) ∇ U ( θ ih ( n ) ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:17)(cid:13)(cid:13)(cid:13) p (cid:105) = 2 p − E [ (cid:107)∇ U ( θ s ) − ∇ U ( θ ih ( n ) ) (cid:107) p ] + 2 p − E (cid:104) E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ ih ( n ) ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p (cid:12)(cid:12)(cid:12) θ ih ( n ) (cid:105)(cid:105) ≤ p − (cid:98) L p · E [ (cid:107) θ s − θ ih ( n ) (cid:107) p ] + 2 p − ∆ p . (24)Plugging into equation (22), we have that E [ (cid:107) θ t − θ ∗ t (cid:107) p ] ≤ e − p (cid:99) m ( t − ih ( n ) ) E (cid:2)(cid:13)(cid:13) θ ih ( n ) − θ ∗ ih ( n ) (cid:13)(cid:13) p (cid:3) + 2 p − (cid:98) L p (cid:98) m p − (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) E [ (cid:107) θ s − θ ih ( n ) (cid:107) p ] d s + 2 p − ( t − ih ( n ) ) ∆ p (cid:98) m p − . (25)We provide an upper bound for (cid:82) tih ( n ) e − p (cid:99) m ( t − s ) E [ (cid:107) θ s − θ ih ( n ) (cid:107) p ] d s in the following lemma. Lemma 7.

For h ( n ) ≤ (cid:98) m (cid:98) L , and for t ∈ [ ih ( n ) , ( i + 1) h ( n ) ] , (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) E [ (cid:107) θ s − θ ih ( n ) (cid:107) p d s ] ≤ p − (cid:98) L p (cid:16) t − ih ( n ) (cid:17) p +1 W pp ( µ ih ( n ) , µ ∗ ) + 8 p (cid:16) t − ih ( n ) (cid:17) p/ ( dp ) p/ + 2 p − ( t − ih ( n ) ) p +1 · ∆ p . (26)28pplying this upper bound to equation (25), we obtain that for h ( n ) ≤ (cid:98) m (cid:98) L , and for t ∈ [ ih ( n ) , ( i + 1) h ( n ) ], E [ (cid:107) θ t − θ ∗ t (cid:107) p ] ≤ e − p (cid:99) m ( t − ih ( n ) ) E (cid:2)(cid:13)(cid:13) θ ih ( n ) − θ ∗ ih ( n ) (cid:13)(cid:13) p (cid:3) + 2 p − (cid:98) L p (cid:98) m p − (cid:16) t − ih ( n ) (cid:17) p +1 W pp ( µ ih ( n ) , µ ∗ )+ 2 p − (cid:98) L p (cid:98) m p − (cid:16) t − ih ( n ) (cid:17) p/ ( dp ) p/ + 2 p − (cid:98) L p (cid:98) m p − ( t − ih ( n ) ) p +1 · ∆ p + 2 p − ( t − ih ( n ) ) ∆ p (cid:98) m p − ≤ (cid:18) − (cid:98) m (cid:16) t − ih ( n ) (cid:17)(cid:19) p E (cid:2)(cid:13)(cid:13) θ ih ( n ) − θ ∗ ih ( n ) (cid:13)(cid:13) p (cid:3) + 2 p − (cid:98) L p (cid:98) m p − (cid:16) t − ih ( n ) (cid:17) p +1 W pp ( µ ih ( n ) , µ ∗ )+ 2 p − (cid:98) L p (cid:98) m p − (cid:16) t − ih ( n ) (cid:17) p/ ( dp ) p/ + 2 p ( t − ih ( n ) ) ∆ p (cid:98) m p − . Recognizing that (cid:98) γ ( θ t , θ ∗ t ) = E (cid:16) θ ih ( n ) ,θ ∗ ih ( n ) (cid:17) ∼ γ ∗ (cid:2) ¯ γ (cid:0) θ t , θ ∗ t | θ ih ( n ) , θ ∗ ih ( n ) (cid:1)(cid:3) is a coupling, we achieve the upperbound for W pp ( µ t , µ ∗ ): W pp ( µ t , µ ∗ ) ≤ E ( θ t ,θ ∗ t ) ∼ (cid:98) γ [ (cid:107) θ t − θ ∗ t (cid:107) p ] ≤ (cid:18) − (cid:98) m (cid:16) t − ih ( n ) (cid:17)(cid:19) p E (cid:16) θ ih ( n ) ,θ ∗ ih ( n ) (cid:17) ∼ γ ∗ (cid:2)(cid:13)(cid:13) θ ih ( n ) − θ ∗ ih ( n ) (cid:13)(cid:13) p (cid:3) + 2 p − (cid:98) L p (cid:98) m p − (cid:16) t − ih ( n ) (cid:17) p +1 W pp ( µ ih ( n ) , µ ∗ ) + 2 p − (cid:98) L p (cid:98) m p − (cid:16) t − ih ( n ) (cid:17) p/ ( dp ) p/ + 2 p ( t − ih ( n ) ) ∆ p (cid:98) m p − . ≤ (cid:18) − (cid:98) m (cid:16) t − ih ( n ) (cid:17)(cid:19) p W pp ( µ ih ( n ) , µ ∗ ) + 2 p − (cid:98) L p (cid:98) m p − (cid:16) t − ih ( n ) (cid:17) p/ ( dp ) p/ (27)+ 2 p ( t − ih ( n ) ) ∆ p (cid:98) m p − . (28)Taking t = ( i + 1) h ( n ) , the recurring bound reads W pp (cid:0) µ ( i +1) h ( n ) , µ ∗ (cid:1) ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p W pp ( µ ih ( n ) , µ ∗ ) + 2 p − (cid:98) L p (cid:98) m p − ( dp ) p/ (cid:16) h ( n ) (cid:17) p/ + 4 p (cid:98) m p − h ( n ) ∆ p . We ﬁnish the proof by invoking the recursion i times: W pp ( µ ih ( n ) , µ ∗ ) ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p W pp (cid:0) µ ( i − h ( n ) , µ ∗ (cid:1) + 2 p − (cid:98) L p (cid:98) m p − ( dp ) p/ (cid:16) h ( n ) (cid:17) p/ + 4 p (cid:98) m p − h ( n ) ∆ p ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p · i W pp ( µ , µ ∗ )+ i − (cid:88) k =0 (cid:18) − (cid:98) m h ( n ) (cid:19) p · k · (cid:32) p − (cid:98) L p (cid:98) m p − ( dp ) p/ (cid:16) h ( n ) (cid:17) p/ + 4 p (cid:98) m p − h ( n ) ∆ p (cid:33) ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p · i W pp ( µ , µ ∗ ) + 2 p (cid:98) L p (cid:98) m p ( dp ) p/ (cid:16) h ( n ) (cid:17) p/ + 2 p +3 ∆ p (cid:98) m p . (29)29 .3.1 Supporting proofs for Lemma 6 Proof of Lemma 7.

We use the update rule of ULA to develop (cid:82) tih ( n ) e − p (cid:99) m ( t − s ) E [ (cid:107) θ s − θ ih ( n ) (cid:107) p ] d s : (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) E [ (cid:107) θ s − θ ih ( n ) (cid:107) p d s ]= (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) E (cid:104)(cid:13)(cid:13)(cid:13) − ( s − ih ( n ) ) (cid:16) ∇ U ( θ ih ( n ) ) − (cid:16) ∇ U ( θ ih ( n ) ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:17)(cid:17) + √ B s − B ih ( n ) ) (cid:13)(cid:13)(cid:13) p (cid:105) d s ≤ p − ( t − ih ( n ) ) p (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) E [ (cid:107)∇ U ( θ ih ( n ) ) (cid:107) p ] d s + 2 p/ − (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) E [ (cid:107) B s − B ih ( n ) (cid:107) p ] d s + 2 p − ( t − ih ( n ) ) p (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) E (cid:104)(cid:13)(cid:13)(cid:13) ∇ U ( θ ih ( n ) ) − ∇ (cid:98) U ( θ ih ( n ) ) (cid:13)(cid:13)(cid:13) p (cid:105) d s ≤ p − (cid:98) L p (cid:16) t − ih ( n ) (cid:17) p +1 E [ (cid:107) θ ih ( n ) − θ ∗ U (cid:107) p ] + 2 p/ − (cid:90) tih ( n ) E [ (cid:107) B s − B ih ( n ) (cid:107) p ] d s + 2 p − (cid:16) t − ih ( n ) (cid:17) p +1 ∆ p . (30)where θ ∗ U is the ﬁxed point of U . We then use the following lemma to simplify the above expression. Lemma 8.

The integrated p -th moment of the Brownian motion can be bounded as: (cid:90) tih ( n ) E (cid:107) B s − B ih ( n ) (cid:107) p d s ≤ (cid:18) dpe (cid:19) p/ (cid:16) t − ih ( n ) (cid:17) p/ . (31)We also provide bound for the p -th moment of (cid:107) θ ih ( n ) − θ ∗ U (cid:107) . Lemma 9.

For θ ih ( n ) ∼ µ ih ( n ) , E (cid:107) θ ih ( n ) − θ ∗ U (cid:107) p ≤ p − W pp ( µ ih n , µ ∗ ) + 10 p (cid:18) dp (cid:98) m (cid:19) p/ . (32)Plugging the results into equation (30), we obtain that for h ( n ) ≤ (cid:98) m (cid:98) L , and for t ∈ [ ih ( n ) , ( i + 1) h ( n ) ], (cid:90) tih ( n ) e − p (cid:99) m ( t − s ) E [ (cid:107) θ s − θ ih ( n ) (cid:107) p d s ] ≤ p − (cid:98) L p (cid:16) t − ih ( n ) (cid:17) p +1 W pp ( µ ih n , µ ∗ ) + 40 p (cid:98) L p (cid:16) t − ih ( n ) (cid:17) p +1 (cid:18) dp (cid:98) m (cid:19) p/ + (cid:18) e (cid:19) p/ ( dp ) p/ (cid:16) t − ih ( n ) (cid:17) p/ + 2 p − ( t − ih ( n ) ) p +1 · ∆ p ≤ p − (cid:98) L p (cid:16) t − ih ( n ) (cid:17) p +1 W pp ( µ ih n , µ ∗ ) + 8 p (cid:16) t − ih ( n ) (cid:17) p/ ( dp ) p/ + 2 p − ( t − ih ( n ) ) p +1 ∆ p . (33) Proof of Lemma 8.

The Brownian motion term can be upper bounded by higher moments of a normal randomvariable: (cid:90) tih ( n ) E (cid:107) B s − B ih ( n ) (cid:107) p d s ≤ (cid:16) t − ih ( n ) (cid:17) E (cid:107) B t − B ih ( n ) (cid:107) p = (cid:16) t − ih ( n ) (cid:17) p/ E (cid:107) v (cid:107) p , v is a standard d -dimensional normal random variable. We then invoke the √ d sub-Gaussianity of (cid:107) v (cid:107) and have (assuming p to be an even integer): E (cid:107) v (cid:107) p ≤ p !2 p/ ( p/ d p/ ≤ e / p √ πp ( p/e ) p p/ √ πp ( p/ e ) p/ d p/ ≤ (cid:18) dpe (cid:19) p/ . Proof of Lemma 9.

For the E (cid:107) θ ih ( n ) − θ ∗ U (cid:107) p term, we note that any coupling of a distribution with a deltameasure is their product measure. Therefore, E (cid:107) θ ih ( n ) − θ ∗ U (cid:107) p relates to the p -Wasserstein distance between µ ih ( n ) and the delta measure at the ﬁxed point θ ∗ U , δ ( θ ∗ U ): E (cid:107) θ ih ( n ) − θ ∗ U (cid:107) p = W pp ( µ ih ( n ) , δ ( θ ∗ U )) ≤ ( W p ( µ ih ( n ) , µ ∗ ) + W p ( µ ∗ , δ ( θ ∗ U ))) p ≤ p − W pp ( µ ih ( n ) , µ ∗ ) + 2 p − W pp ( µ ∗ , δ ( θ ∗ U )) . We then bound W pp ( µ ∗ , δ ( θ ∗ U )) in the following lemma. Lemma 10.

Assume the posterior µ ∗ is (cid:98) m -strongly log-concave. Then for θ ∗ U = arg max µ ∗ , W pp ( µ ∗ , δ ( θ ∗ U )) ≤ p (cid:18) dp (cid:98) m (cid:19) p/ . (34)Therefore, E (cid:13)(cid:13)(cid:13) θ ( n ) ih ( n ) − θ ∗ n (cid:13)(cid:13)(cid:13) p ≤ p − W pp ( µ ih ( n ) , µ ∗ ) + 10 p (cid:18) dp (cid:98) m (cid:19) p/ . Proof of Lemma 10.

We ﬁrst decompose W p ( µ ∗ , δ ( θ ∗ U )) into two terms: W p ( µ ∗ , δ ( θ ∗ U )) ≤ W p ( µ ∗ , δ ( E θ ∼ µ ∗ [ θ ])) + (cid:107) θ ∗ U − E θ ∼ µ ∗ [ θ ] (cid:107) . By the celebrate relation between mean and mode for 1-unimodal distributions [see, e.g., Basu and DasGupta,1996, Theorem 7], we can ﬁrst bound the diﬀerence between mean and mode:( θ ∗ U − E θ ∼ µ ∗ [ θ ]) T Σ − ( θ ∗ U − E θ ∼ µ ∗ [ θ ]) ≤ . where Σ is the covariance matrix of µ ∗ . Therefore, (cid:107) θ ∗ U − E θ ∼ µ ∗ [ θ ] (cid:107) ≤ (cid:98) m . (35)We then bound W p ( µ ∗ , δ ( E θ ∼ µ ∗ [ θ ])). Since the coupling between µ ∗ and the delta measure δ ( E θ ∼ µ ∗ [ θ ])is their product measure, we can directly obtain that the p -Wasserstein distance is the p -th moments of µ ∗ : W pp ( µ ∗ , δ ( E θ ∼ µ ∗ [ θ ])) = (cid:90) (cid:107) θ − E θ ∼ µ ∗ [ θ ] (cid:107) p d µ ∗ ( θ ) . We invoke the Herbst argument [see, e.g., Ledoux, 1999] to obtain the p -th moment bound. We ﬁrst notethat for an (cid:98) m -strongly log-concave distribution, it has a log Sobolev constant of (cid:98) m . Then using the Herbstargument, we know that x ∼ µ ∗ is a sub-Gaussian random vector with parameter σ = (cid:98) m : (cid:90) e λu T ( θ − E θ ∼ µ ∗ [ θ ] )d µ ∗ ( θ ) ≤ e λ (cid:99) m , ∀ (cid:107) u (cid:107) = 1 . θ is 2 (cid:113) d (cid:98) m norm-sub-Gaussian, which implies that( E θ ∼ µ ∗ [ (cid:107) θ − E θ ∼ µ ∗ [ θ ] (cid:107) p ]) /p ≤ e /e (cid:114) dp (cid:98) m . (36)Combining equations (35) and (36), we obtain the ﬁnal result that W pp ( µ ∗ , δ ( θ ∗ U )) ≤ (cid:32) e /e (cid:114) dp (cid:98) m + (cid:114) (cid:98) m (cid:33) p ≤ p (cid:18) dp (cid:98) m (cid:19) p/ . Lemma 11.

Assume that the likelihood log p a ( x ; θ ) , prior distribution, and true distributions satisfy As-sumptions 1-3, and that arm a has been chosen n = T a ( t ) times up to iteration t of the Thompson samplingalgorithm. Further, assume that we choose the stepsize step size h ( n ) = m a n ( L a + n L a ) = O (cid:16) m a nL a (cid:17) , andnumber of steps N = 640 ( L a + n L a ) m a = O (cid:16) L a m a (cid:17) in Algorithm 2 then: P θ a,t ∼ ¯ µ ( n ) a [ γ a ] (cid:32) (cid:107) θ a,t − θ ∗ a (cid:107) > (cid:115) em a n (cid:18) d a + log B a + 2 σ log 1 /δ + 2 (cid:18) σ a + m a d a L a γ a (cid:19) log 1 /δ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) Z n − (cid:33) < δ . where Z t − = {(cid:107) θ a,t − − θ ∗ a (cid:107) ≤ C ( n ) } for: C ( n ) = (cid:114) enm a ( d a + log B a + 2 σ log 1 /δ ) ,σ = 16 + d a L a ν a m a , and where θ a,t − is the sample from the previous round of the Thompson samplingalgorithm for arm a .Proof. We begin as in the proof of Theorem 3, except that we now take µ = δ θ a,t − , where θ a,t − is thesample from the previous step of the algorithm: W pp (cid:16) µ ih ( n ) , µ ( n ) a (cid:17) ≤ (cid:18) − (cid:98) m h ( n ) (cid:19) p · i W pp (cid:16) δ ( θ a,t − ) , µ ( n ) a (cid:17) + 80 p (cid:98) L p (cid:98) m p ( dp ) p/ (cid:16) h ( n ) (cid:17) p/ . We ﬁrst use the triangle inequality on the ﬁrst term on the RHS: W p (cid:16) δ ( θ a,t − ) , µ ( n ) a (cid:17) ≤ W p (cid:0) δ ( θ a,t − ) , δ θ ∗ a (cid:1) + W p (cid:16) δ ( θ ∗ a ) , µ ( n ) a (cid:17) = (cid:107) θ ∗ a − θ a,t − (cid:107) + + W p (cid:16) δ ( θ ∗ a ) , µ ( n ) a (cid:17) ≤ C ( n ) + ˜ D √ n where we have used the fact that (cid:107) θ ∗ a − θ a,t − (cid:107) ≤ C ( n ) by assumption, and the deﬁnition of ˜ D from theproof of Theorem 5: (cid:101) D = (cid:113) m a ( d a + log B a + σp ) .Since: C ( n ) = (cid:114) em a ( d a + log B a + 2 σ log 1 /δ ) ,

32e can further develop this upper bound: W p (cid:16) δ θ a,t − , µ ( n ) a (cid:17) ≤ ˜ D √ n + C ( n ) ≤ (cid:114) m a n ( d a + log B a + 2 σ log 1 /δ + σp ) , where to derive this result we have used the fact that (cid:112) x + y ) ≥ √ x + √ y .Letting ¯ D = (cid:113) m a n ( d a + log B a + 2 σ log 1 /δ + σp ) , we see that our ﬁnal result is: W p (cid:16) δ θ a,t − , µ ( n ) a (cid:17) ≤ √ n ¯ D, where ˜ D < ¯ D . Using the same choice of h ( n ) and number of steps N as in the proof or Theorem 5guarantees us that: W pp (cid:16) µ ih ( n ) , µ ( n ) a (cid:17) ≤ (cid:18) ¯ D √ n (cid:19) p Further combining this with the triangle inequality, and the fact that ˜

D < ¯ D gives us that: W p ( µ ih ( n ) , δ θ ∗ ) ≤ ˜ D √ n + ¯ D √ n ≤ D √ n , Now, since the sample returned by the Langevin algorithm is given by: θ a = θ N + Z, (37)where Z ∼ N (cid:16) , nL a γ a I (cid:17) , it remains to bound the distance between the approximate posterior ˆ µ ( n ) a of θ a and the distribution of θ Nh ( n ) . Since θ a − θ Nh ( n ) = Z , for any even integer p , W pp (cid:16) ¯ µ ( n ) a , ¯ µ ( n ) a [ γ a ] (cid:17) =  inf γ ∈ Γ (cid:16) ¯ µ ( n ) a , ¯ µ ( n ) a [ γ a ] (cid:17) (cid:90) (cid:107) θ a − θ N (cid:107) p d θ a d θ N  /p ≤ E [ (cid:107) Z (cid:107) p ] p ≤ (cid:115) dnL a γ a (cid:32) p/ Γ ( p +12 ) √ π (cid:33) /p ≤ (cid:115) dnL a γ a (cid:18) p/ (cid:16) p (cid:17) p/ (cid:19) /p ≤ (cid:115) dpnL a γ a , where we have used upper bound of the Stirling type for the Gamma function Γ ( · ) in the second last inequality.Thus, we have, via the triangle inequality once again, that: W p (cid:16) ¯ µ ( n )[ γ a ] a , δ θ ∗ (cid:17) ≤ D √ n + (cid:115) dpnL a γ a ≤ (cid:114) m a n (cid:18) d a + log B a + 2 σ a log 1 /δ + (cid:18) σ a + d a L a γ a (cid:19) p (cid:19) , P θ a,t ∼ ¯ µ ( n ) a [ γ a ] (cid:32) (cid:107) θ a,t − θ ∗ a (cid:107) > (cid:115) em a n (cid:18) d a + log B a + 2 σ log 1 /δ + 2 (cid:18) σ a + m a d a L a γ a (cid:19) log 1 /δ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) Z n − (cid:33) < δ . We remark that via an identical argument, the following Lemma holds as well:

Lemma 12.

Assume that the family log p a ( x ; θ ) and the prior π a satisfy Assumptions 1-4 and that arm a has been chosen n = T a ( t ) times up to iteration t of the Thompson sampling algorithm. If we take number ofdata samples in the stochastic gradient estimate k = 32 ( L ∗ a ) m a ν a , step size h ( n ) = m a n ( L a + n L a ) = O (cid:16) m a nL a (cid:17) andnumber of steps N = 1280 ( L a + n L a ) m a = O (cid:16) L a m a (cid:17) in Algorithm 2, then: P θ a,t ∼ ¯ µ ( n ) a [ γ a ] (cid:32) (cid:107) θ a,t − θ ∗ a (cid:107) > (cid:115) em a n (cid:18) d a + log B a + 2 σ log 1 /δ + 2 (cid:18) σ a + m a d a L a γ a (cid:19) log 1 /δ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) Z n − (cid:33) < δ ., where Z t − = {(cid:107) θ a,t − − θ ∗ a (cid:107) ≤ C ( n ) } for the parameters: C ( n ) = (cid:114) enm a ( d a + log B a + 2 σ log 1 /δ ) , σ = 16 + 4 d a L a ν a m a , and θ a,t − being the sample from the previous round of the Thompson sampling algorithm over arm a . E Regret Proofs

We now present the proof of logarithmic regret of Thompson sampling under our assumptions with samplesfrom the true posterior and from the approximate sampling schemes discussed in Section 4. To provide theregret guarantees for Thompson sampling with samples from the true posterior and from approximationsto the posterior, we proceed as is common in regret proofs for multi-armed bandits by upper-bounding thenumber of times a sub-optimal arm a ∈ A is pulled up to time T , denoted T a ( T ). Without loss of generalitywe assume throughout this section that arm 1 is the optimal arm, and deﬁne the ﬁltration associated with arun of the algorithm as F t = { A , X , A , X , ..., A t , X t } .To upper bound the expected number of times a sub-optimal arm is pulled up to time T , we ﬁrst deﬁnethe event E a ( t ) = { r a,t ( T a ( t )) ≥ ¯ r − (cid:15) } for some (cid:15) >

0. This captures the event that the mean calculatedfrom the value of θ a sampled from the posterior at time t ≤ T , r a,t ( T a ( t )), is greater than ¯ r − (cid:15) (recall ¯ r isthe optimal arm’s mean). Given these events, we proceed to decompose the expected number of pulls of asub-optimal arm a ∈ A as: E [ T a ( T )] = E (cid:34) T (cid:88) t =1 I ( A t = a ) (cid:35) = E (cid:34) T (cid:88) t =1 I ( A t = a, E ca ( t )) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) I + E (cid:34) T (cid:88) t =1 I ( A t = a, E a ( t )) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) II . (38)In Lemma 13 we upper bound ( I ), and then bound term ( II ) in Lemmas 14.We note that this proof follows a similar structure to that of the regret bound for Thompson sampling forBernoulli bandits and bounded rewards in [Agrawal and Goyal, 2012]. However, to give the regret guaranteesthat incorporate the quality of the priors as well as the potential errors and lack of independence resultingfrom the approximate sampling methods we discuss in Section 4 the proof is more complex.34 emma 13 (Bounding I) . For a sub-optimal arm a ∈ A , we have that: I = E (cid:34) T (cid:88) t =1 I ( A t = a, E ca ( t )) (cid:35) ≤ E (cid:34) T − (cid:88) s = (cid:96) p ,s − (cid:35) . where p a,s = P ( r a,t ( s ) > ¯ r − (cid:15) |F t − ) , for some (cid:15) > .Proof. To bound term I of (3), we ﬁrst recall A t is the arm achieving the largest sample reward mean atround t . Further, we deﬁne A (cid:48) t to be the arm achieving the maximum sample mean value among all thesuboptimal arms: A (cid:48) t = argmax a ∈A ,a (cid:54) =1 r a ( t, T a ( t )) . Since E [ I ( A t = a, E ca ( t ))] = P ( A t = a, E ca ( t )), we aim to bound P ( A t = a, E ca ( t ) |F t − ). We note that thefollowing inequality holds: P ( A t = a, E ca ( t ) |F t − ) ≤ P ( A (cid:48) t = a, E ca ( t ) |F t − )( P ( r ( t, T ( t )) ≤ ¯ r − (cid:15) |F t − ))= P ( A (cid:48) t = a, E ca ( t ) |F t − )(1 − P ( E ( t ) |F t − )) . (39)We also note that the term P ( A (cid:48) t = a, E ca ( t ) |F t − ) can be bounded as follows: P ( A t = 1 , E ca ( t ) |F t − ) ( i ) ≥ P ( A (cid:48) t = a, E ca ( t ) , E ( t ) |F t − )= P ( A (cid:48) t = a, E ca ( t ) |F t − ) P ( E ( t ) . |F t − ) (40)Inequality ( i ) holds because { A (cid:48) t = a, E ca ( t ) , E ( t ) } ⊆ { A t = 1 , E ca ( t ) , E ( t ) } . The equality is a consequenceof the conditional independence of E ( t ) and { A (cid:48) t = a, E ca ( t ) } (conditioned on F t − ). Assuming P ( E ( t ) |F t − ) > putting inequalities 39 and 40 together gives the following upper boundfor P ( A t = a, E ca ( t ) |F t − ): P ( A t = a, E ca ( t ) |F t − ) ≤ P ( A t = 1 , E ca ( t ) |F t − ) (cid:18) − P ( E ( t ) |F t − ) P ( E ( t ) |F t − ) (cid:19) . Letting P ( E ( t ) |F t − ) := p ,T ( t ) and noting that { A t = 1 , E ca ( t ) } ⊆ { A t = 1 } : P ( A t = a, E ca ( t ) |F t − ) ≤ P ( A t = 1 |F t − ) (cid:18) p ,T ( t ) − (cid:19) . (41)Now, we use this to give an upper bound on the term of interest: The conditional independence property holds for all of our sampling mechanisms because the sample distributions for thetwo distinct arms ( a,

1) are always conditionally independent on F t − In all the cases we consider, including approximate sampling schemes, this property holds. In that case, since the Gaussiannoise in the Langevin diﬀusion ensures all sets of the form ( a, b ) have nonzero probability mass. (cid:34) T (cid:88) t =1 I ( A t = a, E ca ( t )) (cid:35) ( i ) = E (cid:34) T (cid:88) t =1 E [ I ( A t = a, E ca ( t )) |F t − ] (cid:35) ( ii ) = E (cid:34) T (cid:88) t =1 P ( A t = a, E ca ( t ) |F t − ) (cid:35) ( iii ) ≤ E (cid:34) T (cid:88) t =1 P ( A t = 1 |F t − ) (cid:18) p ,T ( t ) − (cid:19)(cid:35) ( iv ) = E (cid:34) T (cid:88) t =1 E [ I ( A t = 1) |F t − ] (cid:18) p ,T ( t ) − (cid:19)(cid:35) ( v ) = E (cid:34) T (cid:88) t =1 I ( A t = 1) (cid:18) p ,T ( t ) − (cid:19)(cid:35) ( vi ) ≤ E (cid:34) T − (cid:88) s =1 p ,s − (cid:35) . Here the equality ( i ) is a consequence of the tower property, and equality ( ii ) by noting that E [ I ( A t = a, E ca ( t )) |F t − ] = P ( A t = a, E ca ( t ) |F t − ). Inequality ( iii ) follows by from Equation 41, and equality ( iv ) follows by deﬁnition.Finally, equality ( v ) follows by the tower property and the last line each the fact that T ( t ) = s and A t = 1can only happen once for every s = 1 , ..., T . This completes the proof.Given the bound on ( I ) from (3), we now present the tighter of two bounds on ( II ) which is used toprovide regret guarantees for Thompson sampling with exact samples from the posteriors. Lemma 14 (Bounding II - exact posterior) . For a sub-optimal arm a ∈ A , we have that: II = E (cid:34) T (cid:88) t =1 I ( A t = a, E a ( t )) (cid:35) ≤ E (cid:34) T (cid:88) s =1 I (cid:18) p a,s > T (cid:19)(cid:35) . where p a,s = P ( r a,t ( s ) > ¯ r − (cid:15) |F t − ) , for some (cid:15) > .Proof. The upper bound for term II in (3) follows the exact same proof as in [Agrawal and Goyal, 2012],and we recreate it for completeness below. Let T = { t : p a,T a ( t ) > T } , then: E (cid:34) T (cid:88) t =1 I ( A t = a, E a ( t )) (cid:35) ≤ E (cid:34)(cid:88) t ∈T I ( A t = a ) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) I + E (cid:34)(cid:88) t/ ∈T I ( E a ( t )) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) II (42)By deﬁnition, term I in (42) satisﬁes: (cid:88) t ∈T I ( A t = a ) = (cid:88) t ∈T I (cid:18) A t = a, p a,T a ( t ) > T (cid:19) ≤ T (cid:88) s =1 I (cid:18) p a,s > T (cid:19) To address term II in (42), we note that, by deﬁnition: E [ I ( E a ( t )) |F t − ] = p a,T a ( t ) . Therefore, using the36eﬁnition of the set of times T , we can construct this simple upper bound: E (cid:34)(cid:88) t/ ∈T I ( E a ( t )) (cid:35) = E (cid:34)(cid:88) t/ ∈T E [ I ( E a ( t )) |F t − ] (cid:35) = E (cid:34)(cid:88) t/ ∈T p a,t (cid:35) ≤ (cid:88) t/ ∈T T ≤ I and II in (42) gives out desired result: E (cid:34) T (cid:88) t =1 I ( A t = a, E a ( t )) (cid:35) ≤ E (cid:34) T (cid:88) s =1 I (cid:18) p a,s > T (cid:19)(cid:35) E.1 Regret of Exact Thompson Sampling

We now present two technical lemmas for use in the proof of the regret of exact Thompson sampling. Theﬁrst technical lemma, provides a lower bound on the probability of an arm begin optimistic in terms of thequality of the prior:

Lemma 15.

Suppose the likelihood and reward distributions satisfy Assumptions 1-3, then for all n = 1 , ..., T and γ = ν m d L : E (cid:20) p ,n (cid:21) ≤ (cid:114) L m B Proof.

Throughout this proof we drop the dependence on the arm to simplify notation (unless necessary).We ﬁrst analyze (cid:107) θ ∗ − θ u (cid:107) where θ u is the mode of the posterior of arm 1 after having received n samplesfrom the arm which satisﬁes: 1 n ∇ log π ( θ u ) + ∇ F ,n ( θ u ) = 0Given this deﬁnition, and letting ˆ θ = θ u − θ ∗ we have that:ˆ θ T ( ∇ F n ( θ ∗ ) − ∇ F n ( θ u )) − n ˆ θ T ∇ log π ( θ u ) = ˆ θ T ∇ F n ( θ ∗ ) m (cid:107) ˆ θ (cid:107) ≤ m (cid:107) ˆ θ (cid:107) + 12 m (cid:107)∇ F n ( θ ∗ ) (cid:107) + log B n (cid:107) ˆ θ (cid:107) ≤ m (cid:107)∇ F n ( θ ∗ ) (cid:107) + 2 log B mn Noting that | a T ( θ ∗ − θ u ) | ≤ (cid:113) A (cid:107) ˆ θ (cid:107) we ﬁnd that: p ,s = P r (cid:0) α T ( θ − θ u ) ≥ α T ( θ ∗ − θ u ) − (cid:15) (cid:1) ≥ P r  α T ( θ − θ u ) ≥ (cid:114) A log B nm + A m (cid:107)∇ F n ( θ ∗ ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) = t  , (cid:107) F n ( θ ∗ ) (cid:107) in Proposition 1 is a 1-dimensional dL a √ nν subgaussian random variable.Now, since we know that the posterior over θ is γ ( n + 1) L -smooth and γmn -strongly log concave, withmode θ u , we know from e.g Saumard and Wellner [2014] Theorem 3.8 that the marginal density of α T θ is γ ( n +1) LA -smooth and γmnA -strongly log-concave.Thus we have that: P r (cid:0) α T ( θ − θ u ) ≥ t (cid:1) ≥ (cid:114) nm ( n + 1) L P r ( Z ≥ t )where Z ∼ N (cid:16) , A γ ( n +1) L (cid:17) .Now using a lower bound on the cumulative density function of a Gaussian random variable, we ﬁnd that,for σ = A γ ( n +1) L : p ,s ≥ (cid:114) nm π ( n + 1) L  σtt + σ e − t σ : t > A √ γ ( n +1) L .

34 : t ≤ A √ γ ( n +1) L Thus we have that: 1 p ,s ≤ (cid:114) π ( n + 1) Lnm  t + σ σt e t σ : t > A √ γ ( n +1) L . : t ≤ A √ γ ( n +1) L ≤ (cid:114) π ( n + 1) Lnm (cid:0) tσ + 1 (cid:1) e t σ : t > A √ γ ( n +1) L t ≤ A √ γ ( n +1) L Taking the expectation of both sides with respect to the samples X , ..., X n , letting κ = L/m , and usingthe fact that n +1 n ≤ n ≥ E (cid:20) p ,s (cid:21) ≤ √ πκ + 2 √ πκ E  (cid:113) A log B nm + A m (cid:107)∇ F n ( θ ∗ ) (cid:107) σ + 1  e t σ  Noting that (cid:113) A log B nm + A m (cid:107)∇ F n ( θ ∗ ) T (cid:107) ≤ A (cid:113) B nm + Am (cid:107)∇ F n ( θ ∗ ) (cid:107) , and letting Y = (cid:107)∇ F n ( θ ∗ ) (cid:107) tosimplify notation, this further simpliﬁes: E (cid:20) p ,s (cid:21) ≤ √ πκ + 2 √ πκ E (cid:20)(cid:18)(cid:112) γκ log B + Amσ Y (cid:19) e γκ log B + ( n +1) γL m Y (cid:21) Via Cauchy-Schwartz we can further develop this upper bound and ﬁnd that: E (cid:20) p ,s (cid:21) ≤ √ πκ + 2 √ πκe γκ log B (cid:32)(cid:112) γκ log B E (cid:104) e ( n +1) γL m Y (cid:105) + Amσ (cid:112) E [ Y ] (cid:114) E (cid:104) e ( n +1) γLm Y (cid:105)(cid:33) Since Y is sub-Gaussian, Y is sub-exponential such that:38 (cid:104) e λY (cid:105) ≤ e and E (cid:2) Y (cid:3) ≤ dL νn for λ < nν dL . Therefore if : γ = νm dL Simplifying the bound further gives: E (cid:20) p ,s (cid:21) ≤ √ πκ + 2 √ πκe γκ log B (cid:32)(cid:112) γκ log B e + 2 (cid:114) eγ ( n + 1) Lm dL νn (cid:33) ≤ √ πκ + 2 √ πκe log B ( (cid:114) log B e + 2 √ e )where we have used the fact that κ, d ≥ L/ν ≥

1. Thus, this bound simpliﬁes to: E (cid:20) p ,s (cid:21) ≤ √ πκ + 2 √ πκe γκ log B (cid:32)(cid:112) γκ log B e + 2 (cid:114) eγ ( n + 1) Lm dL νn (cid:33) ≤ √ πκ ( B ) (cid:32)(cid:114) log B e + 7 (cid:33) ≤ √ πκ ( B ) (cid:16)(cid:112) log B + 4 (cid:17) ≤ (cid:112) κB where we used the fact that x / ( √ log x + 4) ≤ √ x for x ≥ √ π < Lemma 16.

Suppose the likelihood, true reward distributions, and priors satisfy Assumptions 1-3, then for γ a = ν a m a d a L a : T − (cid:88) s =1 E (cid:20) p ,s − (cid:21) ≤ (cid:114) L m B (cid:24) eA m ∆ a ( D + 4 σ log 2) (cid:25) + 1 (43) T (cid:88) s =1 E (cid:20) I (cid:18) p a,s > T (cid:19)(cid:21) ≤ eA a m ∆ a ( D a + 2 σ a log( T )) (44) Where for a ∈ A , D a is given by: D a = log B a + 8 d a L a m a ν a σ a = 256 d a L a m a ν a + 8 d a L a m a ν a Proof.

We begin by showing that (43) holds. To do so, we ﬁrst note that, by deﬁnition p ,s satisﬁes: p ,s = P ( r ,t ( s ) > ¯ r − (cid:15) |F t − ) (45)= 1 − P ( r ,t ( s ) − ¯ r < − (cid:15) |F t − ) (46) ≥ − P ( | r ,t ( s ) − ¯ r | > (cid:15) |F t − ) (47) ≥ − P θ ∼ µ ( s )1 (cid:18) (cid:107) θ − θ ∗ (cid:107) > (cid:15)A (cid:19) (48)39here the last inequality follows from the fact that r ,t ( s ) and ¯ r are A a -Lipschitz functions of θ ∼ µ ( s )1 and θ ∗ respectively.We then use the fact that the posterior distribution P θ ∼ µ ( s )1 satisﬁes the concentration bound fromTheorem 1. Therefore, we have that: P θ ∼ µ ( s )1 (cid:18) (cid:107) θ − θ ∗ (cid:107) > (cid:15)A (cid:19) ≤ exp (cid:18) − σ (cid:18) mn(cid:15) eA − D (cid:19)(cid:19) , (49)where we use the constant D and σ deﬁned in the proof of Theorem 1 to simplify notation. We remarkthat this bound is not useful unless: n > eA (cid:15) m D . Thus, choosing (cid:15) = (¯ r − ¯ r a ) / a / (cid:96) as: (cid:96) = (cid:24) eA m ∆ a ( D + 2 σ log 2) (cid:25) . we proceed as follows: T − (cid:88) s = (cid:96) E (cid:20) p ,s − (cid:21) ≤ T − (cid:88) s =0 − δ ( s ) − ≤ (cid:90) ∞ s =1 − δ ( s ) − ds where: δ ( s ) = exp (cid:18) − σ (cid:18) m(cid:15) eA s (cid:19)(cid:19) , and the ﬁrst inequality follows from our choice of (cid:96) and the second by upper bounding the sum by an integral.To ﬁnish, we write δ ( s ) = exp ( − c ∗ s ), and solve the integral to ﬁnd that : (cid:90) ∞ s =1 − δ ( s ) − ds = log 2 − log (2 e c − c + 1 ≤ log 2 c + 1 . plugging in for c gives: T − (cid:88) s =1 E (cid:20) p ,s − (cid:21) ≤ (cid:96) − (cid:88) s =1 E (cid:20) p ,s − (cid:21) + 8 eA m ∆ a σ log 2 + 1 ≤ (cid:114) L m B (cid:24) eA m ∆ a ( D + 4 σ log 2) (cid:25) + 1To show that (44) holds, we do a similar derivation as in (48): T (cid:88) s =1 E (cid:20) I (cid:18) p a,s > T (cid:19)(cid:21) = T (cid:88) s =1 E (cid:20) I (cid:18) P ( r a,t ( s ) − ¯ r a > ∆ a − (cid:15) |F t − ) > T (cid:19)(cid:21) = T (cid:88) s =1 E (cid:20) I (cid:18) P ( r a,t ( s ) − ¯ r a > ∆ a |F t − ) > T (cid:19)(cid:21) ≤ T (cid:88) s =1 E (cid:20) I (cid:18) P (cid:18) | r a,t ( s ) − ¯ r a | > ∆ a (cid:12)(cid:12)(cid:12)(cid:12) F t − (cid:19) > T (cid:19)(cid:21) ≤ T (cid:88) s =1 E (cid:20) I (cid:18) P θ ∼ µ ( s ) a [ γ a ] (cid:18) (cid:107) θ − θ ∗ (cid:107) > ∆ a A a (cid:19) > T (cid:19)(cid:21) . n of arm a such that for all n ≥ ¯ n : P θ ∼ µ ( n ) a [ γ a ] (cid:18) (cid:107) θ − θ ∗ (cid:107) > ∆ a A a (cid:19) ≤ T .

Since the posterior for arm a after n pulls of arm a has the same form as in (49), we can choose ¯ n as:¯ n = 8 eA a m ∆ a ( D a + 2 σ a log( T )) . This completes the proof.Given these lemma’s the proof of Theorem 2 is straightforwards. For clarity, we restate the theorem below:

Theorem E.1.

When the likelihood and true reward distributions satisfy Assumptions 1-3 and γ a = ν a m a d a L a we have that the expected regret after T > rounds of Thompson sampling with exact sampling satisﬁes: E [ R ( T )] ≤ (cid:88) a> CA a m a ∆ a (cid:0) log B a + d a κ a + d a κ a log( T ) (cid:1) + (cid:112) κ B CA m ∆ a (cid:0) B + d κ (cid:1) + ∆ a Where C is a universal constant independent of problem-dependent parameters.Proof. We invoke Lemmas 13 and 14, to ﬁnd that: E [ T a ( T )] ≤ T − (cid:88) s =1 E (cid:20) p ,s − (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) ( I ) + T (cid:88) s =1 E (cid:20) I (cid:18) − p a,s > T (cid:19)(cid:21)(cid:124) (cid:123)(cid:122) (cid:125) ( II ) (50)Now, invoking Lemma 16, we use the upper bounds for terms ( I ) and ( II ) in the regret decompositionand expanding D a and D to give that: E [ R ( T )] ≤ (cid:88) a> eA a m a ∆ a (cid:0) log B a + 8 d a κ a ( d a + 66 log( T )) (cid:1) + (cid:112) κ B eA a m ∆ a (cid:0) B + 8 d κ ( d + 132 log(2)) (cid:1) + ∆ a ≤ (cid:88) a> CA a m a ∆ a (cid:0) log B a + d a κ a + d a κ a log( T ) (cid:1) + (cid:112) κ B CA m ∆ a (cid:0) B + d κ (cid:1) + ∆ a E.2 Regret of Approximate Sampling

For the proof of Theorem 4, we proceed similarly as for the proof of Theorem 2, but require anotherintermediate lemma to deal with the fact that the samples from the arms are no longer conditionallyindependent given the ﬁltration (due to the fact that we use the last sample as the initialization of theﬁltration). To do so, we ﬁrst deﬁne the event: Z a ( T ) = ∩ T − t =1 Z a,t , Z a,t = (cid:40) (cid:107) θ a,t − θ ∗ a (cid:107) < (cid:114) enm a (cid:18) d a + log B a + 2 (cid:18)

16 + 4 dL a ν a m a (cid:19) log 1 /δ (cid:19) (cid:41) , Lemma 17.

Suppose the likelihood and reward distributions satisfy Assumptions 1-4, Then the regret of aThompson sampling algorithm with approximate sampling can be decomposed as: E [ R ( T )] ≤ (cid:88) a> ∆ a E (cid:34) T a ( T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z a ( T ) ∩ Z ( T ) (cid:35) + 2∆ a (51) Proof.

We begin by conditioning on the event Z a ( T ) ∩ Z ( T ) for each a ∈ A , where we note that by construction p Z = P (( Z a ( T ) c ∪ Z ( T ) c )) ≤ P ( Z ( T ) c ) + P ( Z a ( T ) c ) = 2 T δ ) (since via Lemma 3, the probability of eachevent in Z a ( T ) c and Z ( T ) c is less than δ ).Therefore, we must have that: E [ T a ( T )] ≤ E (cid:34) T a ( T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z a ( T ) ∩ Z ( T ) (cid:35) + E (cid:34) T a ( T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( Z a ( T ) c ∪ Z ( T ) c ) (cid:35) p Z ≤ E (cid:34) T a ( T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z a ( T ) ∩ Z ( T ) (cid:35) + 2 T δ E (cid:34) T a ( T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( Z a ( T ) c ∪ Z ( T ) c ) (cid:35) ≤ E (cid:34) T a ( T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z a ( T ) ∩ Z ( T ) (cid:35) + 2 δ T , where in the ﬁrst line we use the fact that 1 − p Z ≤ T a ( T ) istrivially less than T . Choosing δ = 1 /T completes the proof.With this decomposition in hand, we can now proceed as in Lemma 15 to provide anti-concentrationguarantees for the approximate posteriors. Lemma 18.

Suppose the likelihood and true reward distributions satisfy Assumptions 1-4: then if γ = νm Lνm +4 dL ) , for all n = 1 , ..., T all samples from the the (stochastic gradient) ULA method with thehyperparameters and runtime as described in Theorem 3 satisfy: E (cid:20) p ,n (cid:21) ≤ (cid:112) B Proof.

We begin by using the last step of our Langevin Dynamics and show that it exhibits the desiredanti-concentration properties. In particular, we know that θ ,t ∼ N ( θ ,Nh , γ I ), such that: p ,s = P r (cid:0) α T ( θ − θ ,Nh ) ≥ α T ( θ ∗ − θ ,Nh ) − (cid:15) (cid:1) ≥ P r  Z ≥ A (cid:107) θ ,Nh − θ ∗ (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) := t  where Z ∼ N (0 , A nLγ I ) by construction.Now using a lower bound on the cumulative density function of a Gaussian random variable, we ﬁnd that,for σ = A nLγ : p ,s ≥ (cid:114) π  σtt + σ e − t σ : t > A √ nLγ .

34 : t ≤ A √ nLγ p ,s ≤ √ π (cid:0) tσ + 1 (cid:1) e t σ : t > A √ nLγ t ≤ A √ nLγ Taking the expectation of both sides with respect to the samples X , ..., X n , we ﬁnd that: E (cid:20) p ,s (cid:21) ≤ √ π + √ π E (cid:104)(cid:16)(cid:112) nLγ (cid:107) θ ,Nh − θ ∗ (cid:107) + 1 (cid:17) e nLγ (cid:107) θ ,Nh − θ ∗ (cid:107) (cid:105) ≤ √ π + (cid:112) πnLγ (cid:113) E [ (cid:107) θ ,Nh − θ ∗ (cid:107) ] (cid:113) E (cid:2) e nLγ (cid:107) θ ,Nh − θ ∗ (cid:107) (cid:3) + √ π E (cid:104) e nLγ (cid:107) θ ,Nh − θ ∗ (cid:107) (cid:105) Now, we remark that, from Theorems 5 and 6, we have that for both approximate sampling schemes: E (cid:2) (cid:107) θ ,Nh − θ ∗ (cid:107) (cid:3) ≤ mn (cid:18) d + log B + 32 + 8 dL νm (cid:19) Further, we note that (cid:107) θ ,Nh − θ ∗ (cid:107) is a sub-exponential random variable. To see this, we analyze itsmoment generating function: E [ e nLγ (cid:107) θ ,Nh − θ ∗ (cid:107) ] = 1 + ∞ (cid:88) i =1 E (cid:20) ( nLγ ) i (cid:107) θ ,Nh − θ ∗ (cid:107) i i ! (cid:21) Borrowing the notation from the proof of Theorem 1, we know that E (cid:2) (cid:107) θ ,Nh − θ ∗ (cid:107) p (cid:3) ≤ (cid:18) Dmn + 4 σpmn (cid:19) p where: D = d + log B and σ = 16 + 4 dL νm Plugging this in above gives: E [ e γ (cid:107) θ ,Nh − θ ∗ (cid:107) ] ≤ ∞ (cid:88) i =1 (cid:16) nLγD +4 nLγσimn (cid:17) i i ! ≤ ∞ (cid:88) i =1 i ! (cid:18) nLγDmn (cid:19) i + 32 ∞ (cid:88) i =1 i ! (cid:18) nLγσinm (cid:19) i ≤ e nLγDmn + 32 ∞ (cid:88) i =1 (cid:18) nLγeσnm (cid:19) i where, we have use the identities ( x + y ) i ≤ i − ( x i + y i ) for i ≥

1, and i ! ≥ ( i/e ) i to simplify the bound.If γ ≤ m Lσ , then we have that: E [ e nLγ (cid:107) θ ,Nh − θ ∗ (cid:107) ] ≤ (cid:16) e nLγDm + 2 . (cid:17) which, together with the upper bound on γ gives:43 (cid:20) p ,s (cid:21) ≤ √ π + 32 (cid:114) πnLγm ( D + 2 σ ) (cid:16) e nLγDm + 2 (cid:17) + 32 √ π (cid:16) e nLγDm + 7 . (cid:17) ≤ √ π + 32 (cid:32)(cid:114) π ( d + log B )2 σ + √ π (cid:33) (cid:16) e d +log B σ + 2 (cid:17) + 32 √ π (cid:16) e d +log B σ + 2 . (cid:17) where we used the sub-additivity of √ x , the fact that (cid:113) < , sqrt . < σ and D to simplify the boung. Finally since L mν >

1, we ﬁnd that σ > max (4 d, E (cid:20) p ,s (cid:21) ≤ √ π + 32 (cid:114) π B (cid:16) B / + 2 (cid:17) + 32 √ π (cid:16) B / + 2 . (cid:17) ≤

18 + 3 √ (cid:16) B / + B / (cid:112) log B + log B + 2 B / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) I ≤

18 + 12 / √ √ B ≤ √ B where to simplify the bound we used the fact that √ π < I ≤ √ B and that 18 + 12 / √ x ≤ x for x ≥ Lemma 19.

Suppose the likelihood, true reward distributions, and priors satisfy Assumptions 1-4, the samplesare generated from the sampling schemes described in Theorem 6 and Theorem 5, and γ a = m a L a σ a then: T − (cid:88) s =1 E (cid:20) (cid:98) p ,s − (cid:12)(cid:12)(cid:12)(cid:12) Z ( T ) (cid:21) ≤ (cid:112) B (cid:24) eA m ∆ a ( d + log B + 4 σ log T + 12 d σ log 2) (cid:25) + 1 (52) T (cid:88) s =1 E (cid:20) I (cid:18)(cid:98) p a,s > T (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) Z a ( T ) (cid:21) ≤ eA a m ∆ a ( d a + log B a + 10 d a σ a log( T )) , (53) where (cid:98) p a,s is the distribution of a sample from the approximate posterior (cid:98) µ a after s samples have been collected,and for a ∈ A , σ a is given by: σ a = 16 + 4 d a L a m a ν a . Proof.

We begin by showing that (52) holds. To do so, we proceed identically as in the proof of Lemma 16 tonote that, by deﬁnition (cid:98) p ,s satisﬁes: (cid:98) p ,s = P ( r ,t ( s ) > ¯ r − (cid:15) |F t − ) (54)= 1 − P ( r ,t ( s ) − ¯ r < − (cid:15) |F t − ) (55) ≥ − P ( | r ,t ( s ) − ¯ r | > (cid:15) |F t − ) (56) ≥ − P θ ∼ (cid:98) µ ( s )1 (cid:18) (cid:107) θ − θ ∗ (cid:107) > (cid:15)A (cid:19) , (57)where the last inequality follows from the fact that r ,t ( s ) and ¯ r are A a -Lipschitz functions of θ ∼ µ ( s )1 and θ ∗ respectively. 44e then use the fact that conditioned on Z ( T ), the approximate posterior distribution P θ ∼ (cid:98) µ ( s )1 satisﬁesthe identical concentration bounds from Lemmas 12 and Lemma 11. Substituting in the assumed value of γ ,and simplifying, we have that the distribution of the samples conditioned on Z ( T ) satisfy: P θ ,t ∼ ¯ µ ( s )1 [ γ ] (cid:18) (cid:107) θ ,t − θ ∗ (cid:107) > (cid:114) em n ( d + log B + 4 σ log T + 6 d σ log 1 /δ ) (cid:12)(cid:12)(cid:12)(cid:12) Z n − (cid:19) < δ ., Equivalently, we have that: P θ ∼ ¯ µ ( s )1 [ γ ] (cid:18) (cid:107) θ − θ ∗ (cid:107) > (cid:15)A (cid:19) ≤ exp (cid:18) − d σ (cid:18) m n(cid:15) eA − ¯ D (cid:19)(cid:19) , (58)where we deﬁne ¯ D = d + log B + 4 σ log T , to simplify notation. We remark that this bound is not usefulunless: n > eA (cid:15) m ¯ D . Thus, choosing (cid:15) = (¯ r − ¯ r a ) / a /

2, we can choose (cid:96) as: (cid:96) = (cid:24) eA m ∆ a ( ¯ D + 6 d σ log 2) (cid:25) . With this choice of (cid:96) , we proceed exactly as in the proof of Lemma 16 to ﬁnd that : T − (cid:88) s =1 E (cid:20) (cid:98) p ,s − (cid:12)(cid:12)(cid:12)(cid:12) Z ( T ) (cid:21) ≤ (cid:112) B (cid:96) + T − (cid:88) s = (cid:96) E (cid:20) p ,s − (cid:12)(cid:12)(cid:12)(cid:12) Z ( T ) (cid:21) ≤ (cid:112) B (cid:24) eA m ∆ a ( ¯ D + 12 d σ log 2) (cid:25) + 1 , where we used the upper bound from Lemma 18 to bound the ﬁrst (cid:96) terms in the ﬁrst inequality.To show that (53) holds, we use a similar derivation as in (57): T (cid:88) s =1 E (cid:20) I (cid:18) p a,s > T (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) Z a ( T ) (cid:21) ≤ T (cid:88) s =1 E (cid:20) I (cid:18) P θ ∼ ¯ µ ( s ) a [ γ a ] (cid:18) (cid:107) θ − θ ∗ (cid:107) > ∆ a A a (cid:19) > T (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) Z a ( T ) (cid:21) Since on the event Z a ( T ), the posterior concentration result from Lemmas 12 and Lemma 11 holds, it remainsto upper bound the number of pulls ¯ n of arm a such that for all n ≥ ¯ n : P θ ∼ ¯ µ ( n ) a [ γ a ] (cid:18) (cid:107) θ − θ ∗ (cid:107) > ∆ a A a (cid:19) ≤ T .

Since the posterior for arm a after n pulls of arm a has the same form as in (49), we can choose ¯ n as:¯ n = 144 eA a m ∆ a ( ¯ D a + 6 d a σ a log( T )) . Using the fact that d a > ≥ Theorem E.2 (Regret of Thompson sampling with (stochastic gradient) Langevin algorithm) . When thelikelihood and true reward distributions satisfy Assumptions 1-4: we have that the expected regret after

T > ounds of Thompson sampling with the (stochastic gradient) ULA method with the hyper-parameters andruntime as described in Lemmas 11 (and 12 respectively), and γ a = ν a m a L a ν a m a +4 d a L a ) = O (cid:16) d a κ a (cid:17) satisﬁes: E [ R ( T )] ≤ (cid:88) a> CA a m a ∆ a (cid:0) d a + log B a + d a κ a log T (cid:1) + C √ B A m ∆ a (cid:0) B + d κ log T + d κ (cid:1) + 3∆ a . where C is a universal constant that is independent of problem dependent parameters and κ a = L a /m a .Proof. To begin, we invoke Lemma 17, which shows that we only need to bound the number of times asuboptimal arm a ∈ A is chosen on the ‘nice’ event Z ( T ) ∩ Z a ( T ) where the gradient of the log likelihoodhas concentrated and the approximate samples have been in high probability regions of the posteriors. Wethen invoke Lemmas 13 and 14, to ﬁnd that: E (cid:34) T a ( T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z ( T ) ∩ Z a ( T ) (cid:35) ≤ (cid:96) (59)+ T − (cid:88) s = (cid:96) E (cid:20) p ,s − (cid:12)(cid:12)(cid:12)(cid:12) Z ( T ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) ( I ) + T (cid:88) s =1 E (cid:20) I (cid:18) − p a,s > T (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) Z a ( T ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) ( II ) (60)Now, invoking Lemma 16, we use the upper bounds for terms ( I ) and ( II ) in the regret decomposition,use our choice of both δ and δ = 1 /T , expanding D a and D , and use the fact that (cid:100) x (cid:101) ≤ x + 1 to givethat: E [ R ( T )] ≤ (cid:88) a> eA a m a ∆ a (cid:18) d a + log B a + 10 d a (cid:18)

16 + 4 d a L a ν a m a (cid:19) log( T ) (cid:19) + 27 (cid:112) B eA m ∆ a (cid:18) d + log B + 4 (cid:18)

16 + 4 d L a ν m (cid:19) (log T + 3 d log 2) (cid:19) + 3∆ a . ≤ (cid:88) a> CA a m a ∆ a (cid:0) d a + log B a + d a κ a log T (cid:1) + C √ B A m ∆ a (cid:0) B + d κ log T + d κ (cid:1) + 3∆ a . Using the fact that κ a ≥ d ≥ F Details in the Numerical Experiments

We benchmark the eﬀectiveness of approximate Thompson sampling against both UCB and exact Thompsonsampling across three diﬀerent Gaussian multi-armed bandit instances with 10 arms. We remark that theuse of Gaussian bandit instances is due to the fact that the closed form for the posteriors allows for us toproperly benchmark against exact Thompson sampling and UCB, though our theory applies to a broaderfamily of prior/likelihood pairs.In all three instances we keep the reward distributions for each arm ﬁxed such that their means areevenly spaced from 0 to 10 (¯ r = 1, ¯ r = 2, and so on), and their variances are all 1. In each instance we46se diﬀerent priors over the means of the arms to analyze whether the approximate Thompson samplingalgorithms preserve the performance of exact Thompson sampling. In the ﬁrst instance, the priors reﬂect thecorrect orderings of the means. We use Gaussian priors with variance 4, and means evenly spaced between 5and 10 such that E π [ X ] = 5, and E π [ X ] = 10. In the second instance, the prior for each arm is a Gaussianwith mean 7 . E π [ X ] = 10, and E π [ X ] = 5.As suggested in our theoretical analysis in Section 4, we use a constant number of steps for both ULAand SGLD to generate samples from the approximate posteriors. In particular, for ULA, we take N = 100and double that number for SGLD N = 200. We also choose the stepsize for both algorithms to be T a ( t ) .For SGLD, we use a batch size of min( T a ( t ) , d a = κ a = 1 since this is a Gaussian family,we take the scaling to be γ a = 1. The regret is calculated as (cid:80) Tt =1 ¯ r − ¯ r A t for the three algorithms and isaveraged across 100 runs. Finally, for the implementation of UCB, we used the time-horizon tuned UCB[Lattimore and Szepesv´ari, 2020] and the known variance, σ of the arms in the upper conﬁdence bounds (tomaintain a level playing ﬁeld between algorithms): U CB a ( t ) = 1 T a ( t ) t − (cid:88) i =1 X A i I { A i = a } + (cid:115) σ log 2 TT a ( t ) ..