[PDF] Diffusion Asymptotics for Sequential Experiments

Abstract

We propose a new diffusion-asymptotic analysis for sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with n time steps, we let the mean reward gaps between actions scale to the order 1/ n − − √ so as to preserve the difficulty of the learning task as n grows. In this regime, we show that the behavior of a class of sequentially randomized Markov experiments converges to a diffusion limit, given as the solution of a stochastic differential equation. The diffusion limit thus enables us to derive refined, instance-specific characterization of the stochastic dynamics of adaptive experiments. As an application of this framework, we use the diffusion limit to obtain several new insights on the regret and belief evolution of Thompson sampling. We show that a version of Thompson sampling with an asymptotically uninformative prior variance achieves nearly-optimal instance-specific regret scaling when the reward gaps are relatively large. We also demonstrate that, in this regime, the posterior beliefs underlying Thompson sampling are highly unstable over time.

Full PDF

DDiﬀusion Asymptotics for Sequential Experiments

Stefan [email protected] Kuang [email protected] Graduate School of Business

Abstract

We propose a new diﬀusion-asymptotic analysis for sequentially randomized ex-periments. Rather than taking sample size n to inﬁnity while keeping the problemparameters ﬁxed, we let the mean signal level scale to the order 1 / √ n so as to pre-serve the diﬃculty of the learning task as n gets large. In this regime, we show thatthe behavior of a class of methods for sequential experimentation converges to a dif-fusion limit. This connection enables us to make sharp performance predictions andobtain new insights on the behavior of Thompson sampling. Our diﬀusion asymptoticsalso help resolve a discrepancy between the Θ(log( n )) regret predicted by the ﬁxed-parameter, large-sample asymptotics on the one hand, and the Θ( √ n ) regret fromworst-case, ﬁnite-sample analysis on the other, suggesting that it is an appropriateasymptotic regime for understanding practical large-scale sequential experiments. Keywords:

Multi-arm bandit, Thompson sampling, Stochastic diﬀerential equation.

Sequential experiments, pioneered by Wald [1947] and Robbins [1952], involve collectingdata over time using a design that adapts to past experience. The promise of sequentialexperiments is that, relative to classical randomized trials, they can eﬀectively concentratepower on studying the most promising alternatives and save on costs by helping us avoidrepeatedly taking sub-optimal actions. Such adaptivity, however, does not come for free,and sequential experiments induce intricate dependence patterns in the data that resultin delicate practical considerations; see Bubeck and Cesa-Bianchi [2012] for a review anddiscussion.To illustrate the complexity of sequential experiments, consider the following K -armsetting, also known as a K -armed bandit. There is a sequence of decision points i = 1 , , . . . at which an agent chooses which action A i ∈ { , . . . , K } to take and then observes a reward Y i ∈ R . Here, Y i is assumed to be drawn from a distribution P A i , where the action A i maydepend on past observations. Furthermore, Y i is conditionally independent from all otheraspects of the system given the realization of A i . A standard goal is for the agent to chooseactions with the highest possible expected reward, and to minimize expected regret R n relative to always taking the best action, where R n = n sup ≤ k ≤ K { µ k } − n (cid:88) i =1 A i µ A i , µ k = E P k [ Y ] . (1) Draft version: February 2021. a r X i v : . [ m a t h . S T ] F e b n this setting, Lai and Robbins [1985] show that given any ﬁxed set of arms { P k } Kk =1 , awell-designed sequential algorithm can achieve regret that scales logarithmically with thenumber n of time steps considered, i.e., R n = O P (log( n )). Conversely, given any ﬁxed timehorizon n , it is possible to choose probability distributions { P k } Kk =1 such that the expectedregret E [ R n ] of any sequential algorithm is lower-bounded to order √ Kn [Auer et al., 2002].A similar diﬃculty arises when we want to use n samples to choose an action ˆ k to deploy,and we want to control the cost of mistakes sup { µ k } − E [ µ ˆ k ]. Here, for a ﬁxed problemsetting, the cost of mistakes can be made to decay exponentially in n [Russo, 2020], butfor ﬁxed n it’s possible to choose problems for which the regret of any algorithm is lowerbounded by (cid:112) K/n .This discrepancy between the regret behavior of ﬁxed- P , n → ∞ asymptotics, andworst-case, ﬁxed- n analysis complicates the use of formal results to guide practical design ofsequential experiments. When running a large sequential experiment in practice with, say, n = 100 ,

000 samples, should one expect to pay regret on the order of √ n or log( n )? Shouldwe expect the fraction of times we pull each arm to concentrate, as is suggested by ﬁxed- P , n → ∞ asymptotics, or should we expect it to show genuine variability based on analysestargeted at the ﬁxed- n setting? The discrepancy further highlights the importance of a“moderate data” regime in sequential experiments that is not addressed by existing asymp-totic analysis, one in which the analyst has enough data to perform meaningful inference,but not so much data that taking optimal actions becomes asymptotically trivial.In this paper, we study a new way of performing asymptotic analysis for sequentialexperiments that eliminates this discrepancy. Rather than taking sample size n to inﬁnitywhile otherwise keeping the problem setting ﬁxed, we also let the reward distributions { P nk } Kk =1 change with n such as to preserve the diﬃculty of the learning task as n gets large,and to keep the resulting asymptotic regret in line with worst-case results from Auer et al.[2002]. Speciﬁcally, we consider a sequence of systems, indexed by n = 1 , , . . . , and for each n we consider a K -arm sequential experiment with sample size n and reward distribution { P nk } Kk =1 . Furthermore, we assume that the reward distributions satisfy µ nk = E P nk [ Y ] = µ k (cid:14) √ n, ( σ nk ) = Var P nk [ Y ] = σ k (2)for k = 1 , . . . , K , where µ k ∈ R and σ k ≥ n . The main feature ofthis scaling is that, even as n becomes large, we cannot hope to estimate µ k with arbitrarilyhigh accuracy. Rather, even after n samples are collected, there remains genuine uncertaintyabout the relative merits of each arm. This type of scaling may be particularly relevant inscientiﬁc settings where the sample size n used by the analyst is tailored to the scale of theeﬀects they are expecting to encounter.In this regime, we show that the behavior of a class of methods for sequential experi-mentation converges to a diﬀusion limit. This result applies to a wide variety of adaptiveexperimentation rules arising in statistics, machine learning and behavioral economics. Wesubsequently use this connection to derive new insights about Thompson sampling, a popularBayesian heuristic for sequential experimentation [Thompson, 1933]. We show that, in thisdiﬀusion regime, it is essential to use what we refer to as “asymptotically undersmoothed”Thompson sampling, i.e., to use a prior whose regularizing eﬀect becomes vanishingly smallin the limit. Without asymptotic undersmoothing, the regret of Thompson sampling canbecome unbounded as | µ k | grows, a rather counter-intuitive ﬁnding given that one wouldexpect the learning task to become easier as the signal strength increases. If we consider translation-invariant algorithms, only the arm diﬀerences need to scale with n , and wecould consider µ nk = µ + δ k / √ n ; see Section 5 for an example. .1 Related Work The choice of the diﬀusion scaling (2) and the ensuing functional limit are motivated byinsights from both queueing theory and statistics. The scaling (2) plays a prominent rolein heavy-traﬃc diﬀusion approximation in queueing networks [Gamarnik and Zeevi, 2006,Harrison and Reiman, 1981, Reiman, 1984]. Here, one considers a sequence of queueingsystems in which the excessive service capacity, deﬁned as the diﬀerence between arrival rateand service capacity, decays as 1 / √ T , where T is the time horizon. Under this asymptoticregime, it is shown that suitably scaled queue-length and workload processes converge toreﬂected Brownian motion. Like in our problem, the diﬀusion regime here is helpful becauseit captures the most challenging problem instances, where the system is at once stable andexhibiting non-trivial performance variability. See also Glynn [1990] for an excellent surveyof diﬀusion approximation in operations research.The diﬀusion scaling (2) is further inspired by a recurring insight from statistics that,in order for asymptotic analysis to yield a normal limit that can be used for ﬁnite-sampleinsight, we need to appropriately down-scale the signal strength as the sample size gets large.One concrete example of this phenomenon arises when we seek to learn optimal decisionrules from (quasi-)experimental data. Here, in general, optimal behavior involves regretthat decays as 1 / √ n with the sample size [Athey and Wager, 2021, Kitagawa and Tetenov,2018]; however, this worst-case regret is only achieved if we let eﬀect sizes decay as 1 / √ n .For any ﬁxed sampling design, it’s possible to achieve faster than 1 / √ n rates asymptotically[Luedtke and Chambaz, 2017]. The multi-arm bandit problem is a popular framework for studying sequential exper-imentation; see Bubeck and Cesa-Bianchi [2012] for a broad discussion. The Θ(log( n ))regret scaling in the ﬁxed- P asymptotic regime was established in the seminal work of Laiand Robbins [1985]. The worst-case regret of Θ( √ Kn ) for a ﬁxed problem horizon of n wasﬁrst shown in Auer et al. [2002]. It is worth-noting that the problem instance that achievesthe √ Kn regret lower bound in Auer et al. [2002] involves the same mean reward scalingas (2), further aﬃrming the idea that the diﬀusion scaling proposed here captures the mostchallenging sub-family of learning tasks.Thompson sampling [Thompson, 1933] has gained considerable popularity in recent yearsthanks to its simplicity and impressive empirical performance [Chapelle and Li, 2011]. Re-gret bounds for Thompson sampling have been established in the frequentist [Agrawal andGoyal, 2017] and Bayesian [Bubeck and Liu, 2014, Lattimore and Szepesv´ari, 2019, Russoand Van Roy, 2016] settings; the setup here belongs to the ﬁrst category. None of the exist-ing instance-dependent regret bounds, however, appears to have suﬃcient precision to yieldmeaningful characterization in our regime. For example, the upper bound in Agrawal and At a higher level, a similar phenomenon also arises when considering optimal estimation of θ from n samples drawn from a probability distribution parametrized by θ . One might conjecture that popularestimators (such as the maximum likelihood estimator) should be asymptotically optimal in general, butthis is unfortunately not true; one can design “supereﬃcient” estimators as counterexamples. In response,Le Cam [1960] proposed local asymptotic normality as a framework for studying estimation in large samples.The key insight is that if we focus on a sequence of models parameterized by θ n = θ + h/ √ n , i.e., 1 / √ n -scale perturbations of a known θ , then the problem of estimating h using n samples from a distributionparametrized by θ n becomes asymptotically equivalent to the problem of estimating h from a single Gaussianrandom variable with mean h and variance depending only on θ . This fact can then be used to providea clean theory of asymptotically optimal estimation, and to reveal to role of Gaussian in a wide varietyof statistical problems; see van der Vaart [1998] for a modern textbook treatment. In our setting, weanalogously ﬁnd that running a sequential experiment is asymptotically equivalent to controlling a diﬀusionprocess once we down-scale eﬀect sizes by 1 / √ n , and this provides a natural generalization of Gaussianapproximation theory to sequential problems. / ∆ , where ∆ is the gap in mean rewardbetween optimal and sub-optimal arms. This constant would have led to a trivial bound of O ( n ) in our regime, where mean rewards scale as 1 / √ n . Furthermore, most of the existingﬁnite-time bounds seem to require delicate assumptions on the reward distributions (e.g.,bounded support, exponential family). In contrast, the diﬀusion asymptotics adopted in thispaper are universal in the sense that they automatically allow us to obtain approximationsfor a much wider range of reward distributions, requiring only a bounded fourth moment.Diﬀusion approximations have been also been used for optimal stopping in sequential ex-periments [Siegmund, 1985]. In this literature, the randomization is typically ﬁxed through-out the horizon. In contrast, in our multi-arm bandit setting the probabilities in the ran-domization depend on the history which creates a qualitatively diﬀerent limit object. Ourwork is also broadly related, in spirit, to recent work on models of learning and experimenta-tion using diﬀusion processes in the operations research literature [Araman and Caldentey,2019, Harrison and Sunar, 2015, Wang and Zenios, 2020]. As discussed above, the goal of this paper is to establish a diﬀusion limit for a class ofsequential experiments under triangular array asymptotics characterized by (2). For thispurpose, we focus on sequentially randomized experiments whose sampling probabilitiesdepend on past observations only through the state variables Q k,i = i (cid:88) j =1 ( { A j = k } ) , S k,i = i (cid:88) j =1 ( { A j = k } ) Y j , (3)where Q k,i counts the cumulative number of times arm k has been chosen by the time wecollect the i -th sample, and S k,i measures its cumulative reward. When useful, we use theconvention Q k, = S k, = 0. As shown via examples below, the abstraction of sequentiallyrandomized Markov experiments covers many popular ways of running sequential experi-ments. Deﬁnition 1. A K -arm sequentially randomized Markov experiment chooses the i -th action A i by taking a draw from a distribution A i (cid:12)(cid:12) { A , Y , . . . , A i − , Y i − } ∼ Multinomial( π i ) , (4) where the sampling probabilities are computed using a measurable sampling function ψ , ψ : [0 , K × R K → ∆ K , π i = ψ ( Q i − , S i − ) , (5) where ( Q i − ) = ( Q k,i − ) k =1 ,...,K , S i − = ( S k,i − ) k =1 ,...,K , and ∆ K is the K -dimensionalunit simplex. Example 1.

Thompson sampling is a popular Bayesian heuristic for running sequentialexperiments [Thompson, 1933]. In Thompson sampling an agent starts with a prior beliefdistribution G on the reward distributions { P k } Kk =1 . Then, at each step i , the agent drawsthe k -th arm with probability ρ k,i corresponding to their posterior belief G i − that P k hasthe highest mean, and any so-gathered information to update the posterior G i using Bayes’rule. The motivation behind Thompson sampling is that it quickly converges to pulling the4est arm, and thus achieves low regret [Agrawal and Goyal, 2017, Chapelle and Li, 2011].Thompson sampling does not always satisfy Deﬁnition 1. However, widely used modelingchoices involving exponential families for the { P k } Kk =1 and conjugate priors for G result inthese posterior probabilities ρ k,i satisfying the Markov condition (5) [Russo et al., 2018], inwhich case Thompson sampling yields a sequentially randomized Markov experiment in thesense of Deﬁnition 1. See Sections 4 and 5 for further discussion. Example 2.

Exploration sampling is a variant of Thompson sampling where, using notationfrom the above example,we pulling each arm with probability π k,i = ρ k,i (1 − ρ k,i ) / (cid:80) Kl =1 ρ l,i (1 − ρ l,i ) instead of π k,i = ρ k,i [Kasy and Sautmann, 2021]. Exploration sampling is preferred toThompson sampling when the analyst is more interested in identifying the best arm thansimply achieving low regret [Kasy and Sautmann, 2021, Russo, 2020]. Exploration samplingsatisﬁes Deﬁnition 1 under the same conditions as Thompson sampling. Example 3.

A greedy agent may be tempted to always pull the arm with the highestapparent mean, S i,k /Q i,k ; however, this strategy may fail to experiment enough and prema-turely discard good arms due to early unlucky draws. A tempered greedy algorithm insteadchooses π i,k = exp (cid:20) α S k,i Q k,i + c (cid:21) (cid:30) K (cid:88) l =1 exp (cid:20) α S l,i Q l,i + c (cid:21) , (6)where α, c > Example 4.

Similar learning dynamics arise in human psychology and behavioral economicswhere an agent chooses future actions with a bias towards those that have accrued higher(un-normalized) cumulative reward [Erev and Roth, 1998, Luce, 1959, Xu and Yun, 2020].A basic example of these policies, known as Luce’s rule, uses sampling probabilities π i,k = ( S k,i ∨ α ) (cid:30) K (cid:88) l =1 ( S l,i ∨ α ) , (7)where α > k a weight f ( S k,i ), where f is a non-negativepotential function, and sample actions with probabilities proportional to the weights. Thedecision rule in (7) only depends on S and thus satisﬁes (5). Example 5.

The Exp3 algorithm, proposed by Auer et al. [2002], uses sampling probabil-ities π i,k = exp  α i − (cid:88) j =1 ( { A j = k } ) Y j π k,j  (cid:30) K (cid:88) l =1 exp  α i − (cid:88) j =1 ( { A j = l } ) Y j π l,j  , (8)where again α > { P k } Kk =1 may be non-stationaryand change arbitrarily across samples. The sampling probabilities (8) do not satisfy (5),and so the Exp3 algorithm is not covered by the results given in this paper; however, itis plausible that a natural extension of our approach to non-stationary problems could bemade accommodate it. We leave a discussion of non-stationary problems to further work.5 iﬀusion scaling We now consider a sequence of experiments indexed by n . In order forsequentially randomized Markov experiments to admit a limit distribution in the triangulararray setting (5), we further need the sampling functions ψ n used in the n -th experimentto converge in an appropriate sense. As discussed further below, the natural scaling of thethe Q k,i and S k,i state variables deﬁned in (3) is Q nk,i = 1 n i (cid:88) j =1 ( { A j = k } ) , S nk,i = 1 √ n i (cid:88) j =1 ( { A j = k } ) Y j . (9)We then say that a sequence of sampling functions ψ n is convergent if it respects this scaling. Deﬁnition 2.

Writing sampling functions in a scale-adapted way as follows, ¯ ψ n ( q, s ) = ψ n (cid:0) nq, √ ns (cid:1) , q ∈ [0 , K , s ∈ R K , (10) we say that a sequence sampling functions ψ n satisfying (5) is convergent if, for all valuesof q ∈ [0 , K and s ∈ R K , we have lim n →∞ ¯ ψ n ( q, s ) = ψ ( q, s ) (11) for a limiting sampling function ψ : [0 , K × R K → ∆ K . Our main result is that, when performed on a sequence of reward distributions satisfying(2) and under a number of regularity conditions discussed further below, the sample pathsof the scaled statistics Q nk,i and S nk,i of a sequentially randomized Markov experimentswith convergent sampling functions converge in distribution to the solution to a stochasticdiﬀerential equation dQ k,t = ψ k ( Q t , S t ) dt,dS k,t = µ k ψ k ( Q t , S t ) dt + σ k (cid:112) ψ k ( Q t , S t ) dB k,t , (12)where B · , t is a standard K -dimensional Brownian motion, µ k and σ k and the mean andvariance parameters given in (2), and the time variable t ∈ [0 ,

1] approximates the ratio i/n . A formal statement is given in Theorem 2.We end this section by commenting brieﬂy on conditions under which we may hope forsampling functions ψ n to be convergent in the sense of Deﬁnition 2. The tempered greedymethod from Example 3 can immediately be seen to be convergent, provided we use asequence of tuning parameters α n and c n satisfying lim n →∞ √ nα n = α and lim n →∞ nc n = c for some α, c ∈ R + , resulting in a limiting sampling function ψ ( q, s ) = exp (cid:20) α s k q k + c (cid:21) (cid:30) K (cid:88) l =1 exp (cid:20) α s l q l + c (cid:21) . (13)For tempered greedy sampling to be interesting, we in general want the limit α to bestrictly positive, else the claimed diﬀusion limit (12) will be trivial. Conversely, for thesecond parameter, both the limits c > c = 0 may be interesting, but working inthe c = 0 limit may lead to additional technical challenges due to us getting very close todividing by 0.Meanwhile, as discussed further in Sections 4 and 5, variants of Thompson sampling as inExamples 1 and 2 can similarly be made convergent via appropriate choices of the prior G ;6nd we will again encounter questions regarding whether a scaled parameter analogous to c n in Example 3 converges to 0 or has a strictly positive limit. Finally, similar convergenceshould hold for Luce’s rule in Example 4 provided that √ nα n converges to a positive limit. Remark . With some appropriate adjustments, onemay also express the upper-conﬁdence bound (UCB) and (cid:15) -greedy algorithms in a form thatis consistent with Deﬁnition 1. Unfortunately, our main results on diﬀusion approximationdo not currently cover these two algorithms. The main reason is that the sampling functions ψ for these algorithms are discontinuous with respect to the underlying state ( Q, S ). Thiscauses a problem because the convergence to a diﬀusion limit, as well as the well-posed-nessof the limit stochastic integral, requires ψ to be appropriately continuous (Assumption 1).Modifying the UCB and (cid:15) -greedy in such a manner as to ensure some smoothness in thesampling probability should resolve this issue. Whether a well deﬁned diﬀusion limit existseven under a discontinuous sampling function, such as that of vanilla UCB or (cid:15) -greedy,remains an open question. We state our main results in this section: Given any Lipschitz sampling functions, we willshow that a suitably scaled version of the process ( Q ni , S ni ) converges to an Itˆo diﬀusionprocess. We will make the following assumptions on the sampling functions: Assumption 1.

Assume the following is true:1. The limiting sampling function ψ is Lipschitz-continuous.2. The convergence of ¯ ψ n to ψ (Deﬁnition 2) occurs uniformly over compact sets.Deﬁne Q nt to be the linear interpolation of Q n (cid:98) tn (cid:99) , Q nk,t = (1 − tn + (cid:98) tn (cid:99) ) Q nk, (cid:98) tn (cid:99) + ( tn − (cid:98) tn (cid:99) ) Q nk, (cid:98) tn (cid:99) +1 , t ∈ [0 , , k = 1 , . . . , K, (14)and deﬁne the process S nt analogously. Let C be the space of continuous functions [0 , (cid:55)→ R K equipped with the uniform metric: d ( x, y ) = sup t ∈ [0 , | x ( t ) − y ( t ) | , x, y ∈ C . We havethe following result; the proof is given in Section 7.1. Theorem 2.

Fix K ∈ N , µ ∈ R K and σ ∈ R K + . Suppose that Assumption 1 holds, and ( Q n , S n ) = 0 . Then, as n → ∞ , ( Q nt , S nt ) t ∈ [0 , converges weakly to ( Q t , S t ) t ∈ [0 , ∈ C ,which is the unique solution to the following stochastic diﬀerential equation over t ∈ [0 , :for k = 1 , . . . , K dQ k,t = ψ k ( Q t , S t ) dt,dS k,t = ψ k ( Q t , S t ) µ k dt + (cid:112) ψ k ( Q t , S t ) σ k dB k,t , (15) where ( Q , S ) = 0 , and B t is a standard Brownian motion in R K . Furthermore, for anybounded continuous function f : R K (cid:55)→ R , lim n →∞ E (cid:104) f ( Q nt , S nt ) (cid:105) = E [ f ( Q t , S t )] , ∀ t ∈ [0 , . (16)7he following theorem gives a more compact representation of the stochastic diﬀerentialequations in Theorem 2, showing that they can be written as a set of ordinary diﬀerentialequations driven by a Brownian motion with a random time change, t ⇒ Q t . The resultwill be useful, for instance, in our subsequent analysis of Thompson sampling. The proof isgiven in Section 7.2. Theorem 3.

The stochastic diﬀerential equation in (21) can be equivalently written as dQ k,t = ψ k ( Q t µ + σW Q t , Q t ) dt, k = 1 , . . . , K, (17) with Q = 0 , where W is a K -dimensional standard Brownian motion. Here, Q t µ and σW Q t are understood to be vectors of element-wise products, with Q t µ = ( Q k,t µ k ) k =1 ,...,K ,and σW Q t = ( σ k W k,Q k,t ) k =1 ,...,K . In particular, we may also represent S t explicitly as afunction of Q and W : S k,t = Q k,t µ k + σ k W Q k,t , k = 1 , . . . , K, t ∈ [0 , . (18) We discuss in this section an immediate consequence of Theorem 2: We can associate thedistribution of the diﬀusion process as a solution to a certain partial diﬀerential equation,known commonly as the Kolmogorov backward equation. To this end, let us ﬁrst introducesome notation to streamline the presentation of the result. Deﬁne Z t = ( Q t , S t ). Denoteby I S and I Q the indices in Z t corresponding to the coordinates of S and Q , respectively.Both sets are understood to be an ordered set of K elements, where the subscript is fordistinguishing whether the ordering is applied to S versus Q . For z ∈ R K + × R K , deﬁne thefunctions ( b k ) k ∈I Q ∪I S , b k ( z ) = (cid:40) ψ k ( z ) , k ∈ I Q ,ψ k ( z ) µ k , k ∈ I S . (19)For 1 ≤ k, l ≤ K , deﬁne η k,l ( z ) = (cid:40)(cid:112) ψ k ( z ) σ k , if k = l ∈ I S ,0 , otherwise. (20)Then, the Itˆo diﬀusion SDE in (15) can be written more compactly as dZ t = b ( Z t ) dt + η ( Z t ) dB t , t ∈ [0 , , (21)with Z = 0. Let R = R K + × R K . Deﬁne the inﬁnitesimal generator L of the Itˆo diﬀusion Z t as Lf ( z ) = lim t ↓ E (cid:2) f ( Z t ) (cid:12)(cid:12) Z = z (cid:3) − f ( z ) t , z ∈ R , (22)and let D L be the set of functions f for which the above limit exists for all z ∈ R . Denote by C ( R ) the set of twice continuously diﬀerentiable functions on R with a compact support.We have the following theorem; the proof is relatively standard and achieved by recognizingthat Z t is an Itˆo diﬀusion and applying Dynkin’s formula. See Oksendal [2013, Theorems7.3.3, 8.1.1] and Durrett [1996, Theorem 7.3.4].8 heorem 4 (Kolmogorov Backward Equation) . Let ( Z t ) t ∈ [0 , be the limiting diﬀusion inTheorem 2. The following holds:1. If f ∈ C ( R ) , then f ∈ D L , and the generator can be expressed explicitly as: Lf ( z ) = (cid:88) k b k ( z ) ∂∂z k f ( z ) + 12 (cid:88) k,l ( ηη (cid:124) ) k,l ( z ) ∂ ∂z k ∂z l f ( z ) , z ∈ R . (23)

2. Fix f ∈ C ( R K ) , and let u ( z, t ) = E (cid:2) f ( Z t ) (cid:12)(cid:12) Z = z (cid:3) . (24) The function u ( · , t ) belongs to D L for all t ∈ [0 , . Furthermore, u satisﬁes thefollowing partial diﬀerential equation: ∂∂t u ( z, t ) =( Lu ( · , t ))( z ) , t ∈ (0 , , z ∈ R ,u ( · ,

0) = f, (25) with L deﬁned in (22) .3. Conversely, let w ( z, t ) : R × [0 , → R be a continuous bounded function that is twicecontinuously diﬀerentiable in the ﬁrst argument. Suppose that w is a solution to (25) ,with L given explicitly as in (23) , then we must have w = u , i.e., w ( z, t ) = u ( z, t ) = E (cid:2) f ( Z t ) (cid:12)(cid:12) Z = z (cid:3) . (26) Remark . Note that the PDE in (25) is given respect to the implicitly deﬁned L in (22).Unfortunately, when L is explictily given as in (23), the theorem does not guarantee thatthe solution to (25) exists; see Section 6 for further discussion. As a ﬁrst application of our framework, we consider the following one-arm sequential exper-iment. In periods i = 1 , . . . , n , an agent has an option to draw from a distribution P n with(unknown) mean µ n and (known) variance ( σ n ) , or do nothing and receive zero reward. Asdiscussed above, we are interested in the regime where n → ∞ , and µ n = µ/ √ n for someﬁxed µ ∈ R while σ n = σ remains constant.Following the paradigm of Thompson sampling, the agent starts with a prior beliefdistribution G n on P n . Then, at each step i , the agent draws a new sample with probability π ni = P G ni − ( µ n > G ni usingBayes’ rule. Furthermore, we assume that the agent takes P n to be a Gaussian distributionwith (unknown) mean µ n and (known) variance σ , and sets G n to be a Gaussian prior on µ n with mean 0. Thus, writing I i for the even that we draw a sample in the i -th period and Y i for the observed outcome, we get µ n (cid:12)(cid:12) G ni ∼ N (cid:32) σ − (cid:80) ij =1 I j Y j σ − (cid:80) ij =1 I j + ( ν n ) − , σ − (cid:80) ij =1 I j + ( ν n ) − (cid:33) ,π i = Φ  σ − (cid:80) ij =1 I j Y j (cid:113) σ − (cid:80) ij =1 I j + ( ν n ) −  , (27)9here ( ν n ) is the prior variance and Φ is the standard Gaussian cumulative distributionfunction. Qualitatively, one can motivate this sampling scheme by considering an agentgambling at a slot machine: Here, µ n represents the expected reward from playing, and theagent’s propensity to play depends on the strength of their belief that this expected rewardis positive.An interesting question in this case is how we scale the prior variance nu n used in theThompson sampling heuristic. A ﬁrst choice is to choose a scaling such thatlim n →∞ ( ν n ) − /n = c > , (28)in which case Theorem 2 immediately implies that the scaled sample paths of S n and Q n converge weakly to the solution to the stochastic diﬀerential equation dQ t = π t dt, dS t = µπ t dt + √ π t dB t , π t = Φ (cid:32) S t σ (cid:112) Q t + σ c (cid:33) . (29)We refer to Thompson sampling with this scaling of the prior variance as “smoothed”Thompson sampling. As a direct corollary of Theorem 3, we obtain the following more com-pact characterization of the diﬀusion limit for the one-arm smoothed Thompson sampling: Theorem 6.

Let W t be a one-dimensional standard Brownian motion. Deﬁne Π( c, q ) = Φ (cid:32) qµ + W q σ (cid:112) q + σ c (cid:33) . (30) The stochastic diﬀerential equation in (29) can be equivalently written as dQ t =Π( c, Q t ) dt, Q = 0 . (31)Alternatively, one could also consider a setting where nu n = ν > n , or where nu n decays slowly enough that lim n →∞ ( ν n ) − /n = 0. This is the scaling ofThompson sampling that is most commonly considered in practice; for example, nu n issimply set to 1 in [Agrawal and Goyal, 2017]. Such “undersmoothed” Thompson samplingis not covered by Theorem 2, because the drift function ψ now violates the Lipscthiz condi-tion, and it is not clear whether the stochastic diﬀerential equation (15) admits a solution.Fortunately, the following theorem ensures that the limiting diﬀusion process almost surelyadmits a unique limit as c →

0; the proof makes use of the law of the iterated logarithm (LIL)of Brownian motion to show that the Π(0 , Q t ) does not exhibit too wild of an oscillationnear t = 0. We provide the proof in Section 7.3. Theorem 7 (Diﬀusion Limit for Undersmoothed Thompson Sampling) . The diﬀusion limit ( Q t ) t ∈ [0 , under Thompson sampling converges uniformly to a limit ˜ Q as c → almostsurely. Furthermore, ˜ Q is a strong solution to the stochastic diﬀerential equation: d ˜ Q t = Π(0 , ˜ Q t ) dt = Φ  µ (cid:113) ˜ Q t σ + W ˜ Q t σ (cid:113) ˜ Q t  dt, ˜ Q = 0 . (32)The next result combines Theorems 2 and 7 to show that if nu n scales at an appropriaterate, then the pre-limit sample path of one-arm Thompson sampling converges to the dif-fusion limit (32) with c = 0; the claim follows immediately by taking a triangulation limitacross ν n and n . 10 n = 10 / √ n , ν n = 1 / √ n µ n = 10 / √ n , ν n = 1 t t µ n = − / √ n , ν n = 1 / √ n µ n = − / √ n , ν n = 1 t t Figure 1: Convergence to the diﬀusion limit under one-arm Thompson sampling. The plotsshow the evolution of the scaled cumulative reward S nnt , and its limiting process S t , with n = 1500 over 1000 simulation runs. The lines and shades represent the empirical mean andone empirical standard deviation from the mean, respectively. The two columns correspondto the smoothed (left) and undersmoothed (right) Thompson sampling, respectively. Corollary 8 (Diﬀusion Approximation of Undersmoothed Thompson Sampling) . Thereexists a sequence ( nu n ) n ∈ N with lim n →∞ ( ν n ) − = 0 , such that ( ¯ Z n ) t ∈ [0 , converges weaklyto the solution to the stochastic diﬀerential equation in (32) as n → ∞ .

10 −5 0 5 10 . . . . mu r eg r e t −10 −5 0 5 10 . . . . . . mu d r a w s Figure 2: Regret proﬁle under one-arm Thompson sampling, for c = 1, 1/2, 1/4, 1/8, 1/16,1/32, 1/64, 1/256, 1/1024, and ﬁnally c = 0, with variance σ = 1. The curves with positivevalues are shown in hues of red with darker-colored hues corresponding to smaller values of c , while c = 0 is shown in black. As discussed in the introduction, one standard metric for evaluating the performance ofbandit algorithms is regret (1). The guarantee (16) from Theorem 2 immediately implies thatthe large-sample behavior of regret is captured by the limiting stochastic diﬀerential equationunder our diﬀusion asymptotic setting, and that (appropriately scaled) regret converges indistribution: R nn √ n ⇒ R := ( µ ) + − µQ . (33)Thus, given access to the distribution of the ﬁnal state Q in our diﬀusion limit, we alsoget access to the distribution of regret. In Figure 2, we plot both expected regret E [ R ] and E [ Q ] as a function of µ , across for several choices of smoothing parameter c (throughout,we keep σ = 1). This immediately highlights several properties of Thompson sampling;some well known, and some that are harder to see without our diﬀusion-based approach.First, we see that when µ = 0 we have E [ Q ] < .

5, meaning that bandits are biasedtowards being pessimistic about the value of an uncertain arm. This is in line with wellestablished results about the bias of bandit algorithms [Nie et al., 2018, Shin et al., 2019].Second, we see that these regret proﬁles are strikingly asymmetric: In general, getting lowregret when µ > µ <

0. This again matches what one mightexpect: When µ <

0, there is a tension between learning more about the data-generatingdistribution (which requires pulling the arm), and controlling regret (which requires notpulling the arm), resulting in a tension between exploration and exploitation. In contrast,when µ >

0, pulling the arm is optimal both for learning and for regret, and so as soon as thebandit acquires a belief that µ > c and regret. As predicted in Theorem 7, we see that regret in fact converges as c → c = 0 is veryclose to being optimal regardless of the true value of µ . When µ <

0, any deviations frominstance-wise optimality are not visible given the resolution of the plot, whereas for somevalues of µ > c = 0 is sub-optimal but not by much.Furthermore, the Bayesian heuristic behind Thompson sampling appears to be mostlyuninformative about which choices of c will perform well in terms of regret. For example,when µ = −

2, one might expect a choice c = 1 / c = 1 / ν = 2, in line with the eﬀect size. But this is in facta remarkably poor choice here: By setting c = 0 we achieve expected scaled regret of 0.44,but with c = 1 / Motivated by the previous observations, we now pursue a a more formal analysis of theinterplay between c and µ when µ is large. Speciﬁcally, we study the regret scaling ofone-arm Thompson sampling in what we refer to as the super-diﬀusive regime: We ﬁrsttake the diﬀusion limit as n → ∞ for a ﬁxed µ , and subsequently look at how the resultinglimiting process behaves in the limit as | µ | → ∞ . In other words, this is a regime of diﬀusionprocesses where the magnitude of the scaled mean-rewards, µ , is relatively large.A striking ﬁnding here is that we see a sharp separation in the regret performance of one-arm Thompson sampling, depending on whether it is smoothed ( c >

0) or undersmoothed( c = 0). Below, we will use the following asymptotic notation: We write that f ( x ) ≺ g ( x ),if for any β ∈ (0 , f ( x ) g ( x ) β → , (34)as x tends to a certain limit. We have the following theorem; the proof is given in Section7.4. Theorem 9.

Consider the diﬀusion limit associated with one-arm Thompson sampling,where ( ν n ) /n ↓ c as n → ∞ . Then, the following holds almost surely:1. If c > , then R → ∞ , as µ → −∞ ,R ≺ /µ, as µ → + ∞ . (35)

2. If c = 0 , then R ≺ / | µ | , as | µ | → ∞ . (36) Here, R is the scaled regret deﬁned in (33) . We discuss below two interesting implications of Theorem 9:13 he value of undersmoothing

We see that undersmoothed Thompson sampling alwaysachieves vanishing regret in the super-diﬀusive limit, regardless of whether µ tends to positiveor negative inﬁnity. In contrast, the regret of smoothed Thompson sampling explodes as µ → −∞ . This is because, as made clear in the proof, Thompson sampling with anynon-trivial smoothing will lead to the algorithm being too slow in shutting down a badarm, thus leading to inferior regret in the super-diﬀusive limit. The asymmetry betweenthe two smoothing methods can also be observed numerically in Figures 1 and 2. From apractical perspective, this result suggests that using smoothed Thompson sampling shouldbe considered risky, as this algorithm may have arbitrarily scaled regret—and furthermoremay have arbitrarily high excess regret relative to undersmoothed Thompson sampling. Minimax optimality of Thompson sampling

Theorem 9 also provides a quantitativerate of convergence for the regret: With undersmoothed Thompson sampling, regret decaysfaster than 1 / | µ | − (cid:15) for any (cid:15) > | µ | → ∞ . It is interesting to compare this againstknown lower bounds in the frequentist stochastic bandit literature. In this setting, it isknown that across a family of K -arm bandit problems where all but one arm has the samemean reward, the minimax regret is bounded from below by [Mannor and Tsitsiklis, 2004] CK

1∆ log (cid:18) ∆ nK (cid:19) , (37)where n is the horizon and ∆ the gap between the mean reward of the optimal arm versusthe rest and C is a universal constant. In our setting, with ∆ = | µ n | = | µ | / √ n and K = 2,(37) would suggest that E [ R n ] √ n ≥ C √ n | µ n | log (cid:18) n ( µ n ) (cid:19) = 2 C log( | µ | / | µ | . (38)Comparing this with (36) in Theorem 9 shows that the regret of undersmoothed Thompsonsampling almost matches the minimax ﬁnite-sample regret lower bound (up to an arbitrarilysmall polynomial factor). To the best of our knowledge, this behavior of Thompson samplingin the super-diﬀusive regime has not been reported in the literature.Furthermore, it is interesting to notice that most known algorithms that attain regretupper bounds on the order of (37) rely on substantially more sophisticated mechanisms,such as adaptive arm elimination and time-dependent conﬁdence intervals [Auer and Ortner,2010]. It is thus both surprising and encouraging that such a simple and easily implementableheuristic as Thompson sampling should achieve near-optimal minimax regret. Admittedly,Theorem 9 is restricted to the relatively simple one-arm setting. We are hopeful that similarinsights can be generalized to Thompson sampling applied to general K -arm bandits; seeSection 6 for further discussion. One ﬁnal insight given by the diﬀusion limit in Theorems 6 and 7 is that they give us asharp characterization of the evolution of the sampling probabilities π t = Π( c, Q t ) over time.Tracking the sampling probabilities π t for Thompson sampling is particularly interesting,since π t corresponds to the the subjective time- t belief that µ > c = 0, an application of the law ofthe iterated logarithm to the representation (30) for the sampling probabilities immediatelyimplies the following. 14 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . t s a m p li ng p r obab ili t y . . . . . . t s a m p li ng p r obab ili t y µ = − µ = 3Figure 3: Sample paths of the sampling probability π t in one-arm Thompson sampling asdeﬁned in (29), in the undersmoothed regime c = 0. Corollary 10.

In the setting of Theorem 8, regardless of the eﬀect size µ ∈ R , we have lim n →∞ P (cid:18) sup ≤ i ≤ n π n ≥ − η (cid:19) = lim n →∞ P (cid:18) inf ≤ i ≤ n π n ≥ η (cid:19) = 1 , (39) for any η > . In other words, in the undersmoothed limit, Thompson sampling will almost alwaysat some point be arbitrarily convinced about µ having the wrong sign; and this holds nomatter how large | µ | really is. However, Thompson sampling will eventually recover, thusachieving low regret. We further illustrate sample paths in the case with c = 0 in Figure3. At the very least, this ﬁnding again challenges a perspective that would take Thompsonsampling literally as a principled Bayesian algorithm (since in this case we’d expect beliefdistributions to follow martingale updates), and instead highlights that Thompson samplinghas subtle and unexpected behaviors that can only be elucidated via dedicated methods. We now extend the discussion from the previous section to the case of a two-armed bandit.In periods i = 1 , . . . , n , an agent has chooses which of two distributions P n or P n to drawfrom a distribution P , each with (unknown) mean µ nk and (known) variance ( σ nk ) . Theagent uses a variant of translation-invariant Thompson sampling, whereby they start withone forced draw from each arm, and subsequently pull arm 1 in period i with probability π i = Φ  α − i ∆ i (cid:113) α − i + ( ν n ) −  , ∆ i = (cid:18) S ,i Q ,i − S ,i Q ,i (cid:19) , α i = σ iQ ,i Q ,i , (40)15 . . . . . arm gap r eg r e t . . . . . . arm gap d r a w s on bad a r m Figure 4: Regret proﬁle for two-arm Thompson sampling, for c = 1, 1/2, 1/4, 1/8, 1/16,1/32, 1/64, 1/256, 1/1024, and ﬁnally c = 0. We use σ = 1 throughout. The left panelshows expected regret, while the right panel shows E [ Q ]. The curves with positive valuesare shown in hues of red with darker-colored hues corresponding to smaller values of c , while c = 0 is shown in black. The algorithm (40) is translation-invariant and symmetric in itstreatment of the arms, so regret only depends on δ = | µ − µ | .where nu n is interpreted as the prior standard deviation for the arm diﬀerence δ n = µ n − µ n .Here, we note that Q ,i + Q ,i = i , and so from here on out we deﬁne a single variable Q i such that Q ,i = Q t and Q ,i = i − Q i .As usual, we focus on asymptotics along a sequence of problems with µ nk = µ k / √ n and ( σ nk ) = σ . We also need to choose a scaling for nu n . Again, one option is touse non-vanishing smoothing, lim n →∞ n ( ν n ) − = c . In this case, Theorem 2 immediatelyimplies that our system converges in distribution to the solution of the stochastic diﬀerentialequation dS ,t = µ π t dt + √ π t σdB ,t , dS ,t = µ (1 − π t ) dt + √ − π t σdB ,t ,dQ t = π t dt, π t = Φ (cid:32) σ − Q t ( t − Q t ) ( S ,t /Q t − S ,t / ( t − Q t )) (cid:112) σ − tQ t ( t − Q t ) + t c (cid:33) . (41)Meanwhile, in the undersmoothed case lim n →∞ n ( ν n ) − = 0, our ﬁndings from Theorem 7and Corollary 8 suggest convergence to a version of (40) with c = 0.Figure 4 shows the limiting regret of two-arm Thompson sampling in the diﬀusion limit,with σ = 1. At a high level, the qualitative implications closely mirror those from theone-arm as reported in Figure 2. The behavior of Thompson sampling converges as c → c = 0 are in general very strong. If anything, the c = 0 choiceis now even more desirable than before: With the one-arm case, this choice was modestlybut perceptibly dominated by other choices of c for some values of µ >

0, but here c = 0 iseﬀectively optimal across all δ to within the resolution displayed in Figure 4.16 draws on bad arm e rr o r f o r a r m d i ff e r en c e count −20−100 −20 −10 0 error for bad arm e rr o r f o r good a r m count Figure 5: Distribution of the ﬁnal state of two-arm Thompson sampling, with δ = µ − µ =4, σ = 1 and c = 1 / . Here, the error on each arm means S k,t /Q k, − µ k , and the erroron the arm diﬀerence means is S ,t /Q , − S ,t /Q , − ( µ − µ ).Another interesting observation from Figure 4 is that, in the undersmoothed regime, the regret is maximized around δ = 4. We note that, in a randomized trial with π t = 0 . δ = 4 corresponds to an eﬀect size that is twice the standard deviation of itsdiﬀerence-in-means estimator. In other words, δ = 4 is an eﬀect size that can be reliablydetected using a randomized trial run on all samples, but that would be diﬃcult to detectusing just a fraction of the data. The fact that regret is maximized around δ = 4 is consistentwith an intuition that the hardest problems for bandit algorithms are those with eﬀects wecan detect—but just barely.We also again ﬁnd that Thompson sampling in general has unstable behavior—even whenit is operating in a regime where it achieves low regret. In Figure 5, we display realizations ofThompson sampling with δ = µ − µ = 4 and c = 1 / corresponding to nu n = δ n in (40).As seen in Figure 4, this is not the regret-optimal choice of c (and c = 0 would be better),but may correspond to something an analyst hoping to achieve stable performance woulduse. Figure 5, however, dispels any illusions of stability. Although Thompson samplingusually identiﬁes the ﬁrst arm as the better one and then de-emphasizes pulling the badarm, in some realizations it ends up convinced the second arm is better and spends almostall its draws on the bad arm. In the non-trivially smoothed regime with c >

0, an analogue to Theorem 9 holds, and the regret ofThompson sampling blows up in the super-diﬀusive regime. With small values of c >

0, the bump near δ = 4 in Figure 4 is just a local maximum, and the curve will diverge when δ gets very large. Here, Thompson sampling did a reasonable job learning which arm is better: It only ﬁnished with S ,t /Q , − S ,t /Q , < π t = 0 . − δ/

2) = Φ( −

2) = 2 . Discussion

In this paper, we introduced an asymptotic regime under which sequentially randomizedexperiments converge to a diﬀusion limit. In particular, the limit cumulative reward isobtained by applying a random time change to a constant-drift Brownian motion, wherethe time change is in turn given by cumulative sampling probabilities (Theorem 3). Wethen applied this result to derive sharp insights about the behavior of one- and two-armThompson sampling.An immediate open question is whether we can establish analogous properties of un-derstmoothed Thompson sampling (Theorem 9) to general K -arm bandits. More precisely,suppose that in a general K -arm bandit problem the gap between the mean rewards of thetop two arms are ∆ / √ n . Then, in the super-diﬀusive limit of ∆ → ∞ , one may expect thatthe scaled regret R ≺ / ∆ with an undersmoothed version of Thomspon sampling whereas R → ∞ under smoothed Thompson sampling.A second open question is whether we can show that a solution to the PDE in (23)exists, when L is explicitly given in (23). Classical theory on this type of PDE, known asthe Cauchy problem, typically requires that the second-order operator ηη (cid:124) be uniformlypositive semi-deﬁnite, also known as the ellipticity condition [Karatzas and Shreve, 2005].This is clearly violated in our setting, because there is no diﬀusion along the Q t coordinates,and therefore ηη (cid:124) is zero in the lower diagonal entries. It would be interesting to see whethersuch limitations can be overcome by exploiting additional structures of the problem.Finally, another interesting follow-up question is whether the approach used here can beused to build conﬁdence intervals using data from sequential experiments, thus adding tothe line of work pursued by Hadad et al. [2019], Howard et al. [2018], and others. We will use the following elementary lemma repeatedly; the proof of is given in AppendixA.1.

Lemma 11 (Gaussian Tail Bounds) . For all x < − (cid:112) π/ (9 − π ) : Φ( x ) ≤ | x | exp( − x / , Φ( x ) ≥ | x | exp( − x / . (42) This immediately implies that for all x > (cid:112) π/ (9 − π ) : Φ( x ) ≤ − x exp( − x / , Φ( x ) ≥ − x exp( − x / . (43) Proof.

The proof is based on the martingale framework of Stroock and Varadhan [2007],which hinges on showing that an appropriately scaled generator of the discrete-time Markovprocess converges to the inﬁnitesimal generator of the diﬀusion process. We begin with areview of the relevant results of the Stroock and Varadhan framework. Fix d ∈ N . Let( Z ni ) i ∈ N be a sequence of time-homogeneous Markov chains taking values in R d , indexed by n ∈ N . Denote by Π n the transition kernel of Z n :Π n ( z, A ) = P (cid:0) Z ni +1 ∈ A (cid:12)(cid:12) Z ni = z (cid:1) , z ∈ R d , A ⊆ R d . (44)18et Z nt be the piece-wise linear interpolation of Z nnt : Z nt = (1 − tn + (cid:98) tn (cid:99) ) Z n (cid:98) tn (cid:99) + ( tn − (cid:98) tn (cid:99) ) Z n (cid:98) tn (cid:99) +1 , t ∈ [0 , . (45)Deﬁne K n ( z, A ) to be the scaled transition kernel: K n ( z, A ) = n Π n ( z, A ) . (46)Finally, deﬁne the functions a nk,l ( z ) = (cid:90) x : | z − x |≤ ( x k − z k )( x l − z l ) K n ( z, dx ) ,b nk ( z ) = (cid:90) x : | z − x |≤ ( x k − z k ) K n ( z, dx ) , ∆ n(cid:15) ( z ) = K n ( z, { x : | x − z | > (cid:15) } ) . We will use the following result. A proof of the theorem can be found in Stroock andVaradhan [2007, Chapter 11] or Durrett [1996, Chapter 8]. For conditions that ensure theuniqueness and existence of the Itˆo diﬀusion (50), see Karatzas and Shreve [2005, Chapter5, Theorem 2.9].

Theorem 12.

Fix d . Let { a k,l } ≤ k,l ≤ d and { b k } ≤ k ≤ d be bounded Lipschitz-continuousfunctions from R d to R . Suppose that for all k, l ∈ { , . . . , d } and (cid:15), R > n →∞ sup z : | z |

In what follows, we will use z = ( q, s ) to denote a speciﬁc state of the Markov chain Z n .The transition kernel of the pre-limit chain Z n can be written asΠ n (( q, s ) , ( q + e k /n, s + e k ds/ √ n )) = ¯ ψ nk ( q, s ) P nk ( ds ) , k = 1 , . . . , K, (54)and zero elsewhere, where e k ∈ { , } K is the unit vector where the k th entry is equal to 1and all other entries are 0, and { P nk } k =1 ,...,K are the reward probability measures. Deﬁne K n ( z, A ) = n Π n ( z, A ).We next deﬁne the limiting functions a and b . The function b is deﬁned as in (19): b k ( z ) = (cid:40) ψ k ( z ) , k ∈ I Q ,ψ k ( z ) µ k , k ∈ I S , (55)and we let a ij ( z ) = ( ηη (cid:124) ) k,l ( z ) (56)where η is deﬁned in (20). That is, a k,l ( z ) = (cid:40) ψ k ( z ) σ k , if k = l ∈ I S ,0 , otherwise. (57)Fix R >

0. We show that the corresponding a n and b n converge to the functions a and b deﬁned above, uniformly over the compact set { z : | z | ≤ R } . In light of Lemma 13, itsuﬃces to verify the convergence in (51) through (53) for p = 4. Starting with (51), we havethat m n ( z ) = (cid:90) | z (cid:48) − z | n Π n ( z, dz (cid:48) )= K (cid:88) k =1 n ¯ ψ nk ( z ) (cid:90) w ∈ R (cid:18) n + w n (cid:19) P nk ( dw ) ≤ K (cid:88) k =1 n ¯ ψ nk ( z ) (cid:18) n + 1 n (cid:90) w ∈ R w P nk ( dw ) (cid:19) = 2 n + 1 n E Z ∼ P nk (cid:2) Z (cid:3) n →∞ −→ , (58)as n →

0, uniformly over all z , where the last step follows from the assumption that thereward distributions admit bounded fourth moments. This shows (51).20or the drift term b , we consider the following two cases; together, they prove (53). Case 1, k ∈ I Q . For all k ∈ I Q , and n ∈ N ,˜ b nk ( z ) = (cid:90) ( q (cid:48) k − q k ) K n ( z, dz (cid:48) )= 1 n ( n ¯ ψ nk ( z )) n →∞ −→ ψ k ( z ) . (59) Case 2, k ∈ I S . For all k ∈ I S ,˜ b nk ( z ) := (cid:90) ( s (cid:48) k − s k ) K n ( z, dz (cid:48) )= ¯ ψ nk ( z ) n (cid:90) w √ n P nk ( dw )= ¯ ψ nk ( z ) µ kn →∞ −→ b k ( z ) . (60)For the variance term a , we consider the following three cases: Case 1, k, l ∈ I Q . Note that under the multi-arm bandit model, only one arm can bechosen at each time step. This means that only one coordinate of Q n can be updated attime, immediately implying that for all n and k, l ∈ I Q , k (cid:54) = l ,˜ a nk,l ( z ) = (cid:90) ( q (cid:48) k − q k )( q (cid:48) l − q l ) K n ( z, dz (cid:48) ) = 0 . (61)For the case k = l , we note that for all k ∈ I Q , and all suﬃciently large n ˜ a nk,k ( z ) = 1 n n ¯ ψ nk ( z ) n →∞ −→ . (62) Case 2: k ∈ I Q , l ∈ I S , or k ∈ I S and l ∈ I Q .˜ a nk,l ( z ) = (cid:90) ( q (cid:48) k − q k )( q (cid:48) l − q l ) K n ( z, dz (cid:48) )= ¯ ψ nk ( z ) n (cid:90) w ( nP nk ( √ ndw ))= ¯ ψ nk ( z ) E Z ∼ P nk [ Z ]= ¯ ψ nk ( z ) µ k / √ n n →∞ −→ . (63) Case 3: k, l ∈ I S . This case divides into two further sub-cases. Suppose that k (cid:54) = l .Similar to the logic in Case 1, because only one coordinate of Q n can be updated at a giventime step, we have ˜ a nk,l ( z ) = 0 , k (cid:54) = l. (64)21uppose now that k = l . We have˜ a nk,l ( z ) = (cid:90) ( q (cid:48) k − q k ) K n ( z, dz (cid:48) )= ¯ ψ nk ( z ) (cid:90) w ( nP nk ( √ ndw ))= ¯ ψ nk ( z ) E Z ∼ P nk (cid:2) Z (cid:3) n →∞ −→ ψ k ( z ) σ k = a k,l ( z ) . (65)We note that due to Assumption 1, the convergence of ˜ b n , ˜ a n and m np to their respectivelimits holds uniformly over compact sets. We have thus veriﬁed the conditions in Lemma13, further implying (47) through (48). Note that because ψ k is bounded and Lipschitz-continuous, so are a and b . This proves the convergence of Z n to the diﬀusion limit in C . Finally, to prove the convergence of E (cid:104) f ( Z n ) (cid:105) to E [ f ( Z t )], note that the weak con-vergence of Z n in C implies that the marginal distribution, Z nt converges weakly to Z t , as n → ∞ . The result then follows immediately from the continuous mapping theorem andthe bounded convergence theorem. This completes the proof of Theorem 2. Proof.

It suﬃces to show that (18) holds. We will begin with a slightly diﬀerent, butequivalent, characterization of the pre-limit bandit dynamics. Consider the n th probleminstance. Denote by ˜ Y k,j the reward obtained from the j th pull of arm k . Then, we havethat for a ﬁxed k , ˜ Y k, · is an i.i.d. sequence, independent from all other aspects of the system,and S k,i = Q k,i (cid:88) j =1 ˜ Y k,j . (66)We can further write ˜ Y k,j = µ k / √ n + U k,j , (67)where U k,j is a zero-mean random variable with variance σ k . Deﬁne the scaled process: U nk,i = 1 √ n i (cid:88) j =1 U k,j . (68)We thus arrive at the following expression for the diﬀusion-scaled cumulative reward: S nk,t = Q nk,t µ k + U nk,nQ nk,t , i = 1 , . . . , n. (69)Denote by ¯ U nk,t to be the linear interpolation of U nk, (cid:98) nt (cid:99) for t ∈ [0 , K -dimensional standard Brownian motion W such that ¯ U n convergesto { σ k W k, · } k =1 ,...,K weakly in C . Evoking Theorem 2 and the Skorohod representationtheorem, we may construct a probability space such that all of the following convergencesin C occur almost surely: S nk,t → S k,t , Q nk,t → Q k,t , ¯ U nt k, t → σ k W k,t , (70)22s n → ∞ , where S and Q are diﬀusion processes satisfying the SDEs in Theorem 2.We now combine (69) and (70), along with the fact that W is uniformly continuous inthe compact interval [0 , U nk,nQ nk,i → σ k W k,Q k,t , (71)in C . This further implies that S also satisﬁes S k,t = Q k,t µ k + σ k W k,Q k,t , (72)proving our claim. Proof.

We will use the characterization of the diﬀusion process in (31). Although our mainfocus is on the process Z t restricted to the [0 ,

1] interval, the diﬀusion process itself is infact well deﬁned on t ∈ [0 , ∞ ). A useful observation we will make here is that, for any c > µ , we have that almost surely Q t → ∞ , as t → ∞ . (73)This fact can be verﬁed by noting that if Q t were bounded over [0 , ∞ ), then its driftΠ( c, Q t ) would have been bounded from below by a strictly positive constant, leading to acontradiction. Deﬁne τ c ( q ) to be the ﬁrst time Q t reaches q , which is well deﬁned by theabove reasoning for all q > τ c ( q ) = inf { t : Q t ≥ q } . (74)Here, we make the dependence on c explicit. Note that for all c > Q t and τ c are increasingand continuous and Q t = τ − ( t ). From (31), and with the change of variables u = s − , wehave that for q > τ c ( q ) = (cid:90) q / Φ (cid:18) sµ + W s σ √ s + σ c (cid:19) ds = (cid:90) ∞ /q u − / Φ (cid:18) µ + uW /u σ √ u + u σ c (cid:19) du = (cid:90) ∞ /q u − / Φ (cid:32) µ + ˜ W u σ √ u + u σ c (cid:33) du = (cid:90) ∞ /q h c ( u ) du (75)where ˜ W t = tW /t , and h c ( u ) = u − / Φ (cid:32) µ + ˜ W u σ √ u + u σ c (cid:33) . (76)It is well known that if W is a standard Brownian motion, then so is ˜ W t . By the law ofiterated logarithm of Brownian motion, we have thatlim sup t →∞ | ˜ W t |√ t log log t = 1 , a.s. (77)23ix a sample path of W such that the above is satisﬁed. Then, there exist M ∈ (1 /q, ∞ )and D >

0, such that ˜ W t ≥ − D (cid:112) t log log t, ∀ t ≥ M. (78)We now consider two cases depending on the sign of µ . First, suppose that µ ≥

0. Deﬁne g ( u ) = (cid:40) u − / Φ (cid:16) µ −| ˜ W u | σ √ u (cid:17) , ≤ u < M,u − / Φ (cid:0) − Dσ √ log log u (cid:1) , u ≥ M. (79)It follows that h c ( u ) ≤ g ( u ) , ∀ c, u ≥ . (80)We now show that g is integrable, i.e., (cid:90) ∞ M g ( u ) du < ∞ (81)for all M >

1. Recall the following lower bound on the cdf of standard normal from Lemma11: for all suﬃciently l Φ( x ) ≥ √ π − x x exp( − x / , x < . (82)We thus have that, for all suﬃciently large u , g ( u ) = u − / Φ (cid:18) − Dσ (cid:112) log log u (cid:19) ≤ u − √ π D σ log log u Dσ √ log log u exp (cid:18) D σ log log u (cid:19) ≤ u − √ π Dσ (cid:112) log log u (log u ) D / σ ≤ b u − (log u ) b , (83)where b and b are positive constants. Noting that (log u ) a (cid:28) √ u (84)as n → ∞ for any constant a >

0, we have that b u − (log u ) b is integrable over ( a, ∞ ) forany a ∈ (0 , g in (81).Using (80), (81) and the dominated convergence theorem, we thus conclude that, for all q >

0, lim c ↓ τ c ( q ) = τ ( q ) = (cid:90) q / Φ (cid:18) sµ + W s σ √ s (cid:19) ds, a.s. (85)Recall that Q t = ( τ c ) − ( t ), the above thus implies that, for all t ∈ [0 , Q t c ↓ −→ ˜ Q t := ( τ ) − ( t ) , a.s. (86) Here and henceforth the notation f (cid:28) g denotes the asymptotic relation f ( x ) /g ( x ) → x tends toan appropriate limit; f (cid:29) g is deﬁned analogously, with f ( x ) /g ( x ) → ∞ . ,

1] since Q t is 1-Lipschitz. This proves our claim in the case where µ ≥

0. The case for µ < M and D in (79), recognizing that the behavior of µ + ˜ W u is largely dominated by that of ˜ W u when u is large, which can be in turn bounded by the law of iterated logarithm. This completesthe proof of Theorem 7. Proof.

Deﬁne the limit cumulative regret R t as: R t = ( µ ) + t − µQ t , t ∈ [0 , , (87)where Q t is the diﬀusion limit associated wtih Q ni . Note that R corresponds to the scaledcumulative regret R in (33). We ﬁrst prove the statements in the almost sure sense. Theproof is dived into four cases. Case 1: c > , µ → −∞ . When c >

0, the drift of Q t is given byΠ( c, Q t ) = Φ (cid:32) Q t µ + W Q t σ (cid:112) Q t + σ c (cid:33) . (88)For the sake of contradiction, suppose there exists a constant C > µ : sup t ∈ [0 , Q t ≤ C/ | µ | . (89)This would imply that there exists B > µ , Q t µ + W Q t σ (cid:112) Q t + σ c ≥ − C + B σ √ c , ∀ t ∈ [0 , . (90)That is, the drift of Q t is positive and bounded away from zero for all t , independently of µ . This implies that lim inf µ →−∞ Q > , (91)leading to a contradiction with (89). We conclude that Q (cid:29) / | µ | as µ → −∞ and hencelim µ →−∞ R = lim µ →−∞ | µ | Q = ∞ , (92)as claimed. Case 2: c > , µ → ∞ . Fix α ∈ (1 , (cid:15) µ = µ − α . (93)Decompose regret as: R = R (cid:15) µ + ( R − R (cid:15) µ ) . (94)Clearly, we have that R (cid:15) µ ≤ µ(cid:15) µ = µ − ( α − . (95)For the second term in (94), note that almost surely there exist constants b , b > Q t µ + W Q t σ (cid:112) Q t + σ c ≥ Q t µ − b b , ∀ t ∈ [0 , . (96)25his shows that, there exists constant c > Q t is greater than c forall t ∈ (0 , (cid:15) µ ). As a result, we have that for all suﬃciently large µ , Q t ≥ c µ α , ∀ t ≥ (cid:15) µ . (97)Using Lemma 11, (96) and (97) together imply that for all large µR − R (cid:15) µ = µ (cid:90) (cid:15) µ (1 − Π(0 , Q t )) dt ≤ c µ α − exp( − c µ α ) (98)Combining together (95) and (98), we have that for all large µ , R ≤ µ − ( α − + 1 c µ α − exp( − c µ α ) µ →∞ −→ , a.s. (99) Case 3: c = 0 , µ → −∞ . We follow a similar set of steps as in Case 2. The maindiﬀerence here is that µ is now negative, and as such the arguments here need to be adjustedaccordingly. Fix α ∈ (1 , S µ = inf { t : Q t > | µ | − α } , (100)with S µ = 1 if Q ≤ | µ | − α . Decompose R with respect to S x as: R = R S µ + ( R − R S µ ) . (101)We next bound the two terms on the right-hand side of the equation above separately. Since Q t is non-decreasing, we have R S µ ≤ | µ | · Q S µ = | µ | − ( α +1) . (102)For the second term, the intuition is that by the time Q t reaches | µ | − / , the drift in Q t will have already become overwhelmingly small for the rest of the time horizon. To makethis rigorous, note the following facts:1. By LIL of Brownian motion, almost surely, there exists constant C such thatlim sup q ↓ sup x ∈ [ q, (cid:12)(cid:12)(cid:12)(cid:12) W x √ x (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:112) log log(1 /q ) . (103)2. µ (cid:112) Q S µ = | µ | − (1 − α/ , and therefore µ (cid:112) Q S µ (cid:29) (cid:112) log log(1 /Q S µ ) a.s. as µ → −∞ .Combining these facts along with the normal cdf tail bounds from Lemma 11 we have thatalmost surely, there exists constant b >

0, such that for all suﬃciently small µ , R − R S µ = | µ | (cid:90) S µ Π(0 , Q t ) dt ≤| µ | (cid:32) sup t ∈ [ S µ , Φ (cid:18) µ √ Q t σ + W Q t σ √ Q t (cid:19)(cid:33) ≤| µ | exp( −| µ | b ) . (104)26utting together (102) and (104) shows that R ≤ µ − ( α − + | µ | exp( −| µ | b ) µ →∞ −→ , a.s. (105) Case 4: c = 0 , µ → ∞ . In this case, we would like to argue that Q t will increase rapidlyas µ grows, and to do so, it is important to be able to characterize the drift of Q t near t = 0.To this end, it will be more convenient for us to work with the following re-parameterizationof the diﬀusion process, which we have already encountered in the proof of Theorem 7, (75).Let η be a function such that η ( x ) ≺ /x , as x → ∞ . (106)Deﬁne τ µ = inf { t : Q t ≥ − η ( µ ) } . (107)It follows from the deﬁnition that, if we can show that almost surely τ µ < , (108)for all suﬃciently large µ , then it follows that for all large µR = µ (1 − Q ) ≤ µη ( µ ) , a.s. (109)Using a change of variable of u = s − , identical to that in (75), we have that τ µ = (cid:90) ∞ (1 − η ( µ )) − u − (cid:32) Φ (cid:32) µσ √ u + ˜ W u σ √ u (cid:33)(cid:33) − du = (cid:90) ∞ / (1 − µ − ) u − ξ ( µ, u ) du (110)where ˜ W t = tW /t is a standard Brownian motion, and ξ ( µ, u ) (cid:44) (cid:32) Φ (cid:32) µσ √ u + ˜ W u σ √ u (cid:33)(cid:33) − . (111)We now bound the above integral using a truncation argument. For K > (1 − η ( µ )) − ,we write τ µ = (cid:90) ∞ (1 − η ( µ )) − u − ξ ( µ, u ) du = (cid:90) K (1 − η ( µ )) − u − ξ ( µ, u ) du + (cid:90) ∞ K u − ξ ( µ, u ) du ≤ (cid:32) sup u ∈ [(1 − η ( µ )) − ,K ] ξ ( µ, u ) (cid:33) (cid:90) ∞ (1 − η ( µ )) − u − du + (cid:90) ∞ K u − ξ ( µ, u ) du = (cid:32) sup u ∈ [(1 − η ( µ )) − ,K ] ξ ( µ, u ) (cid:33) (1 − η ( µ )) + (cid:90) ∞ K u − ξ ( µ, u ) du. (112)The following lemma bounds the second term in the above equation; the proof is given inAppendix A.3. 27 emma 14. For any δ ∈ (0 , , there exists C > such that for all large µ and K : (cid:90) ∞ K u − ξ ( µ, u ) du ≤ CK − (1 − δ ) , a.s. (113)Bounding the ﬁrst term is more delicate, and will involve taking µ to inﬁnity in a mannerthat depends on K . Fix γ ∈ (0 ,

1) and consider the sequence of µ : µ n = n + γ , n ∈ N . (114)By LIL of Brownian motion, we have that there exists C > K inf u ∈ [(1 − η ( µ )) − ,K ] ˜ W u √ u ≥ − C (cid:112) log log K, a.s. (115)Combining this with the lower bound on the normal cdf (Lemma 11), we havesup u ∈ [(1 − η ( µ )) − ,K ] ξ ( µ K , u ) ≤ − ( µ K / √ K − C √ log log K ) ) − µ K / √ K − C √ log log K ≤ − K γ ) (116)for all large K .Fix ν ∈ (0 , δ, γ ∈ (0 , /

4) such that2 > − δ / γ > − ν. (117)Note that such δ and γ exist for any ν , so long as we ensure that both δ and γ are suﬃcientlyclose to 0. Combining (112), (116) and Lemma 14, we have that there exist c , c > K : τ µ K ≤ − η ( K + γ ) + exp( − K γ ) + c K − (1 − δ ) ≤ − η ( K + γ ) + c K − (1 − δ ) , a.s. (118)Fix α such that 2 − ν < α < − δ (1 / γ ) < . (119)This choice of α exists because of (117). We now set η to be η ( x ) = x − α . (120)We have that for all suﬃciently large Kτ µ K ≤ − K − α (1 / γ ) + c K − (1 − δ ) < , a.s., (121)where the last inequality follows from (119). Combining the above equation, (108), (109)and the fact that ν can be arbitrarily close to 0, we have thus shown that for all α ∈ (1 , R ≤ µη ( µ ) = µ − ( α − , a.s., (122)for all large µ . This proves our main claim in this case, that is, almost surely R ≺ /µ, as µ → ∞ . (123)28 eferences Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for Thompson sampling.

Journal of the ACM , 64(5):1–24, 2017.Victor F Araman and Rene Caldentey. Diﬀusion approximations for a class of sequentialtesting problems.

Available at SSRN 3479676 , 2019.Susan Athey and Stefan Wager. Policy learning with observational data.

Econometrica , 89(1):133–161, 2021.Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochasticmulti-armed bandit problem.

Periodica Mathematica Hungarica , 61(1-2):55–65, 2010.Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochasticmultiarmed bandit problem.

SIAM Journal on Computing , 32(1):48–77, 2002.S´ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret analysis of stochastic and nonstochasticmulti-armed bandit problems.

Foundations and Trends ® in Machine Learning , 5(1):1–122, 2012.S´ebastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for thomp-son sampling. In , pages 1–9. IEEE, 2014.Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advancesin Neural Information Processing Systems , pages 2249–2257, 2011.Richard Durrett.

Stochastic calculus: a practical introduction , volume 6. CRC press, 1996.Ido Erev and Alvin E Roth. Predicting how people play games: Reinforcement learning inexperimental games with unique, mixed strategy equilibria.

American Economic Review ,pages 848–881, 1998.David Gamarnik and Assaf Zeevi. Validity of heavy traﬃc steady-state approximations ingeneralized Jackson networks.

The Annals of Applied Probability , 16(1):56–90, 2006.Peter W Glynn. Diﬀusion approximations.

Handbooks in Operations research and manage-ment Science , 2:145–198, 1990.Vitor Hadad, David A Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Conﬁdenceintervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768 ,2019.J Michael Harrison and Martin I Reiman. Reﬂected Brownian motion on an orthant.

TheAnnals of Probability , pages 302–308, 1981.J Michael Harrison and Nur Sunar. Investment timing with incomplete information andmultiple means of learning.

Operations Research , 63(2):442–457, 2015.Steven R Howard, Aaditya Ramdas, Jon McAuliﬀe, and Jasjeet Sekhon. Uniform, nonpara-metric, non-asymptotic conﬁdence sequences. arXiv preprint arXiv:1810.08240 , 2018.29oannis Karatzas and Steven E Shreve.

Brownian Motion and Stochastic Calculus . Springer,2005.Maximilian Kasy and Anja Sautmann. Adaptive treatment assignment in experiments forpolicy choice.

Econometrica , 89(1):113–132, 2021.Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximiza-tion methods for treatment choice.

Econometrica , 86(2):591–616, 2018.Tze Leung Lai and Herbert Robbins. Asymptotically eﬃcient adaptive allocation rules.

Advances in Applied Mathematics , 6(1):4–22, 1985.Tor Lattimore and Csaba Szepesv´ari. An information-theoretic approach to minimax regretin partial monitoring. arXiv preprint arXiv:1902.00470 , 2019.Lucien Le Cam. Locally asymptotically normal families of distributions. In

University ofCalifornia Publications in Statistics , volume 3, pages 37–98, 1960.R Duncan Luce.

Individual choice behavior: A theoretical analysis . John Wiely & Son, 1959.Alexander R Luedtke and Antoine Chambaz. Faster rates for policy learning. arXiv preprintarXiv:1704.06431 , 2017.Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem.

Journal of Machine Learning Research , 5(Jun):623–648, 2004.Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. Why adaptively collecteddata have negative bias and how to correct for it. In

International Conference on ArtiﬁcialIntelligence and Statistics , pages 1261–1269. PMLR, 2018.Bernt Oksendal.

Stochastic diﬀerential equations: an introduction with applications .Springer Science & Business Media, 2013.Martin I Reiman. Open queueing networks in heavy traﬃc.

Mathematics of operationsresearch , 9(3):441–458, 1984.Herbert Robbins. Some aspects of the sequential design of experiments.

Bulletin of theAmerican Mathematical Society , 58(5):527–535, 1952.Daniel Russo. Simple Bayesian algorithms for best-arm identiﬁcation.

Operations Research ,68(6), 2020.Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sam-pling.

The Journal of Machine Learning Research , 17(1):2442–2471, 2016.Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorialon Thompson sampling.

Foundations and Trends in Machine Learning , 11(1):1–96, 2018.Jaehyeok Shin, Aaditya Ramdas, and Alessandro Rinaldo. On the bias, risk and consistencyof sample means in multi-armed bandits. arXiv preprint arXiv:1902.00746 , 2019.David Siegmund.

Sequential Analysis: Tests and Conﬁdence Intervals . Springer Science &Business Media, 1985. 30aniel W Stroock and SR Srinivasa Varadhan.

Multidimensional diﬀusion processes .Springer, 2007.William R Thompson. On the likelihood that one unknown probability exceeds another inview of the evidence of two samples.

Biometrika , 25(3/4):285–294, 1933.Aad W. van der Vaart.

Asymptotic Statistics . Cambridge Series in Statistical and Probabilis-tic Mathematics. Cambridge University Press, 1998. doi: 10.1017/CBO9780511802256.Abraham Wald.

Sequential analysis . John Wiley & Sons, New York, 1947.Zhengli Wang and Stefanos Zenios. Adaptive design of clinical trials: A sequential learningapproach.

Available at SSRN , 2020.Kuang Xu and Se-Young Yun. Reinforcement with fading memories.

Mathematics of Oper-ations Research , 45(4):1258–1288, 2020.

A Additional Proofs

A.1 Proof of Lemma 11

Proof.

For the lower bound, we have that for all x < x ) = 1 √ π (cid:90) ∞− x exp( − s / ds ( a ) ≤ √ π (cid:90) ∞− x − xs exp( − s / ds < | x | exp( x / , where ( a ) follows from the fact that − x/s ≤ s ≥ − x . For the upper bound, deﬁne f ( x ) = x √ π ( x +1) exp( − x / − Φ( − x ). We have that f (0) = − Φ(0) <

0, lim x →∞ = 0, and f (cid:48) ( x ) = 1 √ π (1 + x ) exp( − x / > , ∀ x > . (124)This implies that f ( x ) < x >

0, which further implies thatΦ( − x ) ≥ | x |√ π ( x + 1) exp( − x / , ∀ x < . (125)The claim follows by noting that | x |√ π ( x +1) ≥ | x | whenever | x | ≥ (cid:112) π/ (9 − π ). A.2 Proof of Lemma 13

Proof.

The proof is based on Durrett [1996], and we include it here for completeness. For∆ n(cid:15) , note that for all (cid:15) >

0, ∆ n(cid:15) ( z ) ≤ (cid:15) p m np ( z ) . (126)The convergence of (51) thus implies that of (49). For b nk , note that | ˜ b nk ( z ) − b nk ( z ) | = (cid:90) x : | x − z | > | x − z | K n ( z, dx ) ≤ m np ( z ) , (127)31here the last step follows from the assumption p ≥

2. We have thus proven that (51) and(53) together imply (48). Finally, for a nk,l , we have | ˜ a nk,l ( z ) − a nk,l ( z ) | ≤ (cid:90) x : | x − z | > | ( x k − z k )( x l − z l ) | K n ( z, dx ) ≤ (cid:90) x : | x − z | > | x − z | K n ( z, dx ) ≤ m np ( z ) , (128)whenever p ≥

2, where the ﬁrst step follows from the Cauchy-Schwartz inequality, and thesecond step from the observation that ( x k − z k ) ≤ | x − z | for all k . This shows (51) and(52) together imply (47), completing our proof. A.3 Proof of Lemma 14

We will follow a similar set of arguments as those in the proof of Theorem 7, (83), exploitingthe LIL of Brownian motion along with tail bounds on the normal cdf. By LIL, we havethat there exists constant