[PDF] Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Abstract

Consider the problem of sampling sequentially from a finite number of N≥2 populations, specified by random variables X i k , i=1,…,N, and k=1,2,… ; where X i k denotes the outcome from population i the k th time it is sampled. It is assumed that for each fixed i , { X i k } k≥1 is a sequence of i.i.d. normal random variables, with unknown mean μ i and unknown variance σ 2 i . The objective is to have a policy π for deciding from which of the N populations to sample form at any time n=1,2,… so as to maximize the expected sum of outcomes of n samples or equivalently to minimize the regret due to lack on information of the parameters μ i and σ 2 i . In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.

Full PDF

DDepartment of MSIS - Rutgers University - Technical Report 2015

Normal Bandits of Unknown Means and Variances:

Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Wesley Cowan

CWCOWAN @ MATH . RUTGERS . EDU

Department of MathematicsRutgers University110 Frelinghuysen Rd., Piscataway, NJ 08854, USA

Junya Honda

HONDA @ IT . K . U - TOKYO . AC . JP Department of Complexity Science and EngineeringGraduate School of Frontier Sciences, The University of Tokyo.5-1-5 Kashiwanoha, Kashiwa-shi, Chiba 277-8561, Japan.

Michael N. Katehakis

MNK @ RUTGERS . EDU

Department of Management Science and Information SystemsRutgers University100 Rockafeller Rd., Piscataway, NJ 08854, USA

Abstract

Consider the problem of sampling sequentially from a ﬁnite number of N (cid:62) X ik , i = ,..., N , and k = , ,... ; where X ik denotes the outcome from population i the k th time it is sampled. It is assumed that for each ﬁxed i , { X ik } k (cid:62) is a sequence of i.i.d. normal random variables,with unknown mean µ i and unknown variance σ i . The objective is to have a policy π for deciding from whichof the N populations to sample from at any time t = , ,... so as to maximize the expected sum of outcomes of n total samples or equivalently to minimize the regret due to lack on information of the parameters µ i and σ i .In this paper, we present a simple inﬂated sample mean (ISM) index policy that is asymptotically optimal inthe sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996b).Additionally, ﬁnite horizon regret bounds are given . Keywords:

Inﬂated Sample Means, Multi-armed Bandits, Sequential Allocation

1. Introduction and Summary

Consider the problem of a controller sampling sequentially from a ﬁnite number of N (cid:62) i are speciﬁed by a sequence of i.i.d. random variables { X ik } k (cid:62) , taken to be normal with ﬁnite mean µ i and ﬁnite variance σ i . The means { µ i } and variances { σ i } are taken to be unknown to the controller. It is convenient to deﬁne the maximum mean, µ ∗ = max i { µ i } ,and the bandit discrepancies { ∆ i } where ∆ i = µ ∗ − µ i (cid:62) . It is additionally convenient to deﬁne σ ∗ as theminimal variance of any bandit that achieves µ ∗ , that is σ ∗ = min i : µ i = µ ∗ σ i .In this paper, given k samples from population i we will take the estimators: ¯ X ik = ∑ kt = X it / k and S i ( k ) = ∑ kt = (cid:0) X it − ¯ X ik (cid:1) / k for µ i and σ i respectively. Note that the use of the biased estimator for the variance, withthe 1 / k factor in place of 1 / ( k − ) , is largely for aesthetic purposes - the results presented here adapt to theuse of the unbiased estimator as well.

1. Substantial portion of the results reported here were derived independently by Cowan and Katehakis, and by Honda c (cid:13) Wesley Cowan, Junya Honda and Michael N. Katehakis. a r X i v : . [ s t a t . M L ] J un OWAN , H

ONDA AND K ATEHAKIS

For any adaptive, non-anticipatory policy π , π ( t ) = i indicates that the controller samples bandit i at time t .Deﬁne T i π ( n ) = ∑ nt = { π ( t ) = i } , denoting the number of times bandit i has been sampled during the periods t = , . . . , n under policy π ; we take, as a convenience, T i π ( ) = i , π . The value of a policy π is theexpected sum of the ﬁrst n outcomes under π , which we deﬁne to be the function V π ( n ) : V π ( n ) = E  N ∑ i = T i π ( n ) ∑ k = X ik  = N ∑ i = µ i E (cid:2) T i π ( n ) (cid:3) , (1)where for simplicity the dependence of V π ( n ) on the true, unknown, values of the parameters µ = ( µ , . . . , µ N ) and σ = ( σ , . . . , σ N ) , is supressed. The pseudo-regret , or simply regret , of a policy is taken to be theexpected loss due to ignorance of the parameters µ and σ by the controller. Had the controller completeinformation, she would at every round activate some bandit i ∗ such that µ i ∗ = µ ∗ = max i { µ i } . For a givenpolicy π , we deﬁne the expected regret of that policy at time n as R π ( n ) = n µ ∗ − V π ( n ) = n ∑ i = ∆ i E (cid:2) T i π ( n ) (cid:3) . (2)It follows from Eqs. (1) and (2) that maximization of V π ( n ) with respect to π is equivalent to minimizationof R π ( n ) . This type of loss due to ignorance of the means (regret) was ﬁrst introduced in the context ofan N = L π ( n ) / n = µ ∗ − ∑ Ni = ∑ T i π ( n ) k = X ik / n (for which R π ( n ) = E [ L π ( n )] ), constructing a modiﬁed (along two sparse sequences) ‘play the winner’ policy, π R , suchthat L π R ( n ) = o ( n ) (a.s.) and R π R ( n ) = o ( n ) , using for his derivation only the assumption of the Strong Law ofLarge Numbers. Following Burnetas and Katehakis (1996b) when n → ∞ , if π is such that R π ( n ) = o ( n ) wesay policy π is uniformly convergent (UC) (since then lim n → ∞ V π ( n ) / n = µ ∗ ). However, if under a policy π , R π ( n ) grew at a slower pace, such as R π ( n ) = o ( n / ) , or better R π ( n ) = o ( n / ) etc., then the controllerwould be assured that π is making a effective trade-off between exploration and exploitation. It turns our thatit is possible to construct ‘uniformly fast convergent’ (UFC) policies, also known as consistent or stronglyconsistent , deﬁned as the policies π for which: R π ( n ) = o ( n α ) , for all α > ( µ , σ ) . The existence of UFC policies in the case considered here is well established, e.g., Auer et al. (2002) (ﬁg. 4.therein) presented the following UFC policy π ACF : Policy π ACF (UCB1-NORMAL) . At each n = , , . . . :i) Sample from any bandit i for which T i π ACF ( n ) < (cid:100) n (cid:101) . ii) If T i π ACF ( n ) > (cid:100) n (cid:101) , for all i = , . . . , N , sample from bandit π ACF ( n + ) with π ACF ( n + ) = arg max i (cid:40) ¯ X iT i π ( n ) + · S i ( T i π ( n )) (cid:115) ln nT i π ( n ) (cid:41) . (3)(Taking, in this case, S i ( k ) as the unbiased estimator.)Additionally, Auer et al. (2002) (in Theorem 4. therein) gave the following bound: R π ACF ( n ) (cid:54) M ACF ( µ , σ ) ln n + C ACF ( µ ) , for all n and all ( µ , σ ) , (4) ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES with M ACF ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ σ i ∆ i + N ∑ i = ∆ i , (5) C ACF ( µ ) = ( + π ) N ∑ i = ∆ i . (6)Ineq. (4) readily implies that R π ACF ( n ) (cid:54) M ACF ( µ , σ ) ln n + o ( ln n ) . Thus, since ln n = o ( n α ) for all α > R π ACF ( n ) (cid:62) , it follows that π ACF is uniformly fast convergent.Given that UFC policies exist, the question immediately follows: just how fast can they be? The primarymotivation of this paper is the following general result, from Burnetas and Katehakis (1996b), where theyshowed that for any UFC policy π , the following holds:lim inf n → ∞ R π ( n ) ln n (cid:62) M BK ( µ , σ ) , for all ( µ , σ ) , (7)where the bound itself M BK ( µ , σ ) is determined by the speciﬁc distributions of the populations, in this case M BK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ ∆ i ln (cid:16) + ∆ i σ i (cid:17) . (8)For comparison, depending on the speciﬁcs of the bandit distributions, there is a considerable distance be-tween the logarithmic term of the upper bound of Eq. (4) and the lower bound implied by Eq. (8).The derivation of Ineq. (7) implies that in order to guarantee that a policy is uniformly fast convergent, sub-optimal populations have to be sampled at least a logarithmic number of times. The above bound is a specialcase of a more general result derived in Burnetas and Katehakis (1996b) (part 1 of Theorem 1 therein) fordistributions with multi-parameters being unknown (such as in the current problem of Normal populationswith both the mean and the variance being unknown): M BK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ ∆ i K i ( µ , σ ) with K i ( µ , σ ) = inf ( µ (cid:48) i , σ (cid:48) ) { I ( f ( µ i , σ i ) ; f ( µ (cid:48) i , σ (cid:48) i ) ) : µ (cid:48) i > µ ∗ , σ (cid:48) i > } = ( / ) ln ( + ∆ i σ i ) . Previously, Lai and Robbins (1985) had obtained such lower bounds for distributions with one-parameter(such as in the current problem of Normal populations with unknown mean but known variance). Allocationpolicies that achieved the lower bounds were called asymptotically efﬁcient or optimal in Lai and Robbins(1985).Ineq. (7) motivates the deﬁnition of a uniformly fast convergent policy π as having a uniformly maximalconvergence rate (UM) or simply being asymptotically optimal , within the class of uniformly fast convergentpolicies, if lim n → ∞ R π ( n ) / ln n = M BK ( µ , σ ) , since then V π ( n ) = n µ ∗ − M BK ( µ , σ ) ln n + o ( ln n ) .Burnetas and Katehakis (1996b) proposed the following index policy π BK as one that could achieve this lowerbound: OWAN , H

ONDA AND K ATEHAKIS

Policy π BK (UCB-NORMAL ) i) For n = , , . . . , N sample each bandit twice, andii) for n (cid:62) N , sample from bandit π BK ( n + ) with π BK ( n + ) = arg max i (cid:40) ¯ X iT i π ( n ) + S i ( T i π ( n )) (cid:113) n Ti π ( n ) − (cid:41) . (9)Burnetas and Katehakis (1996b) were not able to establish the asymptotic optimality of the π BK policy be-cause they were not able to establish a sufﬁcient condition ( Condition

A3 therein), which we express here asthe following equivalent conjecture (the referenced open question in the subtitle).

Conjecture 1

For each i, for every ε > , and for k → ∞ , the following is true: P (cid:16) ¯ X ij + S i ( j ) (cid:112) k / j − < µ i − ε for some (cid:54) j (cid:54) k (cid:17) = o ( / k ) . (10)We show that the above conjecture is false (cf. Proposition 6 in the Appendix). This does not imply that π BK fails to be UM (i.e., to be asymptotically optimal), but this failure means that the techniques established inBurnetas and Katehakis (1996b) are insufﬁcient to verify its optimality. All is not lost, however. One of thecentral results of this paper is to establish that with a small change, the policy π BK may be modiﬁed to onethat is provably asymptotically optimal. We introduce in this paper the policy π CHK deﬁned in the followingway:

Policy π CHK (UCB-NORMAL ) i) For n = , , . . . , N sample each bandit three times, andii) for n (cid:62) N , sample from bandit π CHK ( n + ) with π CHK ( n + ) = arg max i (cid:40) ¯ X iT i π ( n ) + S i ( T i π ( n )) (cid:113) n Ti π ( n ) − − (cid:41) . (11) Remark 1

1) Note that policy π CHK is only a slight modiﬁcation of policy π BK , the only difference between their indicesis the − n under the radical, i.e., 2 / ( T i π ( n ) − ) in π CHK ( n + ) replacing 2 / T i π ( n ) in π BK ( n + ) . This change, while seemingly asymptotically negligible (as in practice T i π ( n ) → ∞ (a.s.) with n ), has aprofound effect on what is provable about π CHK .2) We note that the indices of policy π CHK are a signiﬁcant modiﬁcation of those of the optimal allocationpolicy π σ for the case of normal bandits with known variances, cf. Burnetas and Katehakis (1996b) andKatehakis and Robbins (1995), which are: π σ ( n + ) = arg max i (cid:40) ¯ X iT i π ( n ) + σ i (cid:115) nT i π ( n ) (cid:41) the difference being replacing the term σ i (cid:113) nT i π ( n ) in π σ by S i ( T i π ( n )) (cid:113) n Ti π ( n ) − − in π CHK . However, theindices of policy π ACF are a minor modiﬁcation of the optimal policy π σ i the difference being replacing theterm σ i (cid:113) nT i π ( n ) in π σ i by S i ( T i π ( n )) (cid:113) nT i π ( n ) in π ACF . ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES

3) The π BK and π σ policies can be seen as connected in the following way, however, observing that 2 ln n / T i π ( n ) is a ﬁrst-order approximation of n / T i π ( n ) − = e n / T i π ( n ) − ( µ , σ ) achieves the asymptotic lower bound M BK ( µ , σ ) . The structure of the rest of the paper is as follows. In section 2, Theorem 3 establishes a ﬁnite horizon boundon the regret of π CHK . From this bound, it follows that π CHK is asymptotically optimal (Theorem 4), andwe provide a bound on the remainder term (Theorem 5). Additionally, in Section 3, the Thompson samplingpolicy of Honda and Takemura (2013) and π CHK are compared and discussed, as both achieve asymptoticoptimality.

2. The Optimality Theorem and Finite Time Bounds

The main results of this paper, that Conjecture 1 is false (cf. Proposition 6 in the Appendix), the asymptoticoptimality, and the bounds on the behavior of π CHK , all depend on the following probability bounds; we notethat tighter bounds seem possible, but these are sufﬁcient for this paper.

Proposition 2

Let Z , U be independent random variables, Z ∼ N ( , ) a standard normal, and U ∼ χ d achi-squared distribution with d degrees of freedom, where d (cid:62) .For δ > , p > , the following holds for all k (cid:62) : P (cid:18) Z (cid:62) U (cid:62) δ (cid:19) k − d / p (cid:54) P (cid:16) δ + √ U (cid:112) k / p − < Z (cid:17) (cid:54) e − ( + δ ) / p δ √ d k ( − d ) / p ln k . (12) Proof [of Proposition 2] The proof is given in the Appendix.

Theorem 3

For policy π CHK as deﬁned above, the following bounds hold for all n (cid:62) N and all ε ∈ ( , ) :R π CHK ( n ) (cid:54) ∑ i : µ i (cid:54) = µ ∗  n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + (cid:114) π e σ ∗ ∆ i ε ln ln n + ε + σ i ∆ i ε +  ∆ i . (13) OWAN , H

ONDA AND K ATEHAKIS

Before giving the proof of this bound, we present two results, the ﬁrst demonstrating the asymptotic optimalityof π CHK , the second giving an ε -free version of the above bound, which gives a bound on the sub-logarithmicremainder term. It is worth noting the following. The bounds of Theorem 3 can actually be improved, throughthe use of a modiﬁed version of Proposition 2, to eliminate the ln ln n dependence, so the only dependence on n is through the initial ln n term. The cost of this, however, is a dependence on a larger power of 1 / ε . Theparticular form of the bound given in Eq. (13) was chosen to simplify the following two results, cf. Remark4 in the proof of Propositition 2. Theorem 4

For a policy π CHK as deﬁned above, π CHK is asymptotically optimal in the sense that lim n → ∞ R π CHK ( n ) ln n = M BK ( µ , σ ) . (14) Proof [of Theorem 4] For any ε such that 0 < ε <

1, we have from Theorem 3 that the followings holds:lim sup n → ∞ R π CHK ( n ) ln n (cid:54) ∑ i : µ i (cid:54) = µ ∗ ∆ i ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) . (15)Taking the inﬁmum over all such ε ,lim sup n → ∞ R π CHK ( n ) ln n (cid:54) ∑ i : µ i (cid:54) = µ ∗ ∆ i ln (cid:16) + ∆ i σ i (cid:17) = M BK ( µ , σ ) , (16)and observing the lower bound of Eq. (7) completes the result. Theorem 5

For a policy π CHK as deﬁned above, R π CHK ( n ) (cid:54) M BK ( µ , σ ) ln n + O (( ln n ) / ln ln n ) , and moreconcretely R π CHK ( n ) (cid:54) M CHK ( µ , σ ) ln n + M CHK ( µ , σ )( ln n ) / ln ln n + M CHK ( µ , σ )( ln n ) / + M CHK ( µ , σ )( ln n ) / + M CHK ( µ , σ ) , (17) where M CHK ( µ , σ ) = M BK ( µ , σ ) M CHK ( µ , σ ) = (cid:114) π e ∑ i : µ i (cid:54) = µ ∗ (cid:18) σ ∗ ∆ i (cid:19) M CHK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗  ∆ i (cid:0) σ i + ∆ i (cid:1) ln (cid:16) + ∆ i σ i (cid:17)  M CHK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ (cid:18) ∆ i + σ i ∆ i (cid:19) M CHK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ ∆ i . (18) ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES

While the above bound admittedly has a more complex form than such a bound as in Eq. (4), it demonstratesthe asymptotic optimality of the dominating term, and bounds the sub-linear remainder term.

Proof [of Theorem 5] The bound follows directly from Theorem 3, taking ε = ( ln n ) − / for n (cid:62)

3, andobserving the following bound, that for ε such that 0 < ε < / (cid:16) + ∆ i σ i ( − ε ) + ε (cid:17) (cid:54) (cid:16) + ∆ i σ i (cid:17) + ∆ i (cid:0) σ i + ∆ i (cid:1) ln (cid:16) + ∆ i σ i (cid:17) ε . (19)This inequality is proven separately as Proposition 7 in the Appendix.We make no claim that the results of Theorems 3, 5 are the best achievable for this policy π CHK . At severalpoints in the proofs, choices of convenience were made in the bounding of terms, and different techniquesmay yield tighter bounds still. But they are sufﬁcient to demonstrate the asymptotic optimality of π CHK , andgive useful bounds on the growth of R π CHK ( n ) . Proof [of Theorem 1] In this proof, we take π = π CHK as deﬁned above. For notational convenience, wedeﬁne the index function u i ( k , j ) = ¯ X ij + S i ( j ) (cid:113) k j − − . (20)The structure of this proof will be to bound the expected value of T i π ( n ) for all sub-optimal bandits i , and usethis to bound the regret R π ( n ) . The basic techniques follow those in Katehakis and Robbins (1995) for theknown variance case, modiﬁed accordingly here for the unknown variance case and assisted by the probabilitybound of Proposition 2. For any i such that µ i (cid:54) = µ ∗ , we deﬁne the following quantities: Let 1 > ε > ε = ∆ i ε /

2. For n (cid:62) N , n i ( n , ε ) = n ∑ t = N { π ( t + ) = i , u i ( t , T i π ( t )) (cid:62) µ ∗ − ˜ ε , ¯ X iT i π ( t ) (cid:54) µ i + ˜ ε , S i ( T i π ( t )) (cid:54) σ i ( + ε ) } n i ( n , ε ) = n ∑ t = N { π ( t + ) = i , u i ( t , T i π ( t )) (cid:62) µ ∗ − ˜ ε , ¯ X iT i π ( t ) (cid:54) µ i + ˜ ε , S i ( T i π ( t )) > σ i ( + ε ) } n i ( n , ε ) = n ∑ t = N { π ( t + ) = i , u i ( t , T i π ( t )) (cid:62) µ ∗ − ˜ ε , ¯ X iT i π ( t ) > µ i + ˜ ε } n i ( n , ε ) = n ∑ t = N { π ( t + ) = i , u i ( t , T i π ( t )) < µ ∗ − ˜ ε } . (21)Hence, we have the following relationship for n (cid:62) N , that T i π ( n + ) = + n ∑ t = N { π ( t + ) = i } = + n i ( n , ε ) + n i ( n , ε ) + n i ( n , ε ) + n i ( n , ε ) . (22)The proof proceeds by bounding, in expectation, each of the four terms. OWAN , H

ONDA AND K ATEHAKIS

Observe that, by the structure of the index function u i , n i ( n , ε ) (cid:54) n ∑ t = N (cid:40) π ( t + ) = i , ( µ i + ˜ ε ) + σ i √ + ε (cid:113) t Ti π ( t ) − − (cid:62) µ ∗ − ˜ ε (cid:41) = n ∑ t = N  π ( t + ) = i , T i π ( t ) (cid:54) t ln (cid:16) + σ i ( ∆ i − ε ) ( + ε ) (cid:17) +  = n ∑ t = N  π ( t + ) = i , T i π ( t ) (cid:54) t ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) +  (cid:54) n ∑ t = N  π ( t + ) = i , T i π ( t ) (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) +  (cid:54) n ∑ t =  π ( t + ) = i , T i π ( t ) (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) +  (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + + . (23)The last inequality follows, observing that T i π ( t ) may be expressed as the sum of π ( t ) = i indicators, andseeing that the additional condition bounds the number of non-zero terms in the above sum. The additional + π ( ) = i term and the π ( n + ) = i term. Note, this bound is sample-path-wise.For the second term, n i ( n , ε ) (cid:54) n ∑ t = N { π ( t + ) = i , S i ( T i π ( t )) > σ i ( + ε ) } = n ∑ t = N t ∑ k = { π ( t + ) = i , S i ( k ) > σ i ( + ε ) , T i π ( t ) = k } = n ∑ t = N t ∑ k = { π ( t + ) = i , T i π ( t ) = k } { S i ( k ) > σ i ( + ε ) } (cid:54) n ∑ k = { S i ( k ) > σ i ( + ε ) } n ∑ t = k { π ( t + ) = i , T i π ( t ) = k } (cid:54) n ∑ k = { S i ( k ) > σ i ( + ε ) } . (24)The last inequality follows as, for ﬁxed k , { π ( t + ) = i , T i π ( t ) = k } may be true for at most one value of t .Recall that kS i ( k ) / σ i has the distribution of a χ k − random variable. Letting U k ∼ χ k , from the above we ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES have E (cid:2) n i ( n , ε ) (cid:3) (cid:54) n ∑ k = P (cid:0) S i ( k ) > σ i ( + ε ) (cid:1) (cid:54) ∞ ∑ k = P ( U k − / k > ( + ε )) (cid:54) ∞ ∑ k = P ( U k − / ( k − ) > ( + ε ))= ∞ ∑ k = P ( U k > k ( + ε )) (cid:54) (cid:113) e ε + ε − (cid:54) ε < ∞ . (25)The penultimate step is a Chernoff bound on the terms, P ( U k > k ( + ε )) (cid:54) ( e − ε ( + ε )) k / . To bound the third term, a similar rearrangement to Eq. (24) (using the sample mean instead of the samplevariance) yields: n i ( n , ε ) (cid:54) n ∑ t = N { π ( t + ) = i , ¯ X iT i π ( t ) > µ i + ˜ ε } (cid:54) n ∑ k = { ¯ X ik > µ i + ˜ ε } . (26)Recalling that ¯ X ik − µ i ∼ Z σ i / √ k for Z a standard normal, E (cid:2) n i ( n , ε ) (cid:3) (cid:54) n ∑ k = P (cid:0) ¯ X ik > µ i + ˜ ε (cid:1) (cid:54) ∞ ∑ k = P (cid:16) Z σ i / √ k > ˜ ε (cid:17) (cid:54) e ˜ ε σ i − (cid:54) σ i ˜ ε < ∞ . (27)The penultimate step is a Chernoff bound on the terms, P (cid:16) Z > δ √ k (cid:17) (cid:54) e − k δ / .To bound the n i term, observe that in the event π ( t + ) = i , from the structure of the policy it must be truethat u i ( t , T i π ( t )) = max j u j ( t , T j π ( t )) . Thus, if i ∗ is some bandit such that µ i ∗ = µ ∗ , u i ∗ ( t , T i ∗ π ( t )) (cid:54) u i ( t , T i π ( t )) .In particular, we take i ∗ to be a bandit that not only achieves the maximal mean µ ∗ , but also the minimalvariance among optimal bandits, σ i ∗ = σ ∗ . We have the following bound, n i ( n , ε ) (cid:54) n ∑ t = N { π ( t + ) = i , u i ∗ ( t , T i ∗ π ( t )) < µ ∗ − ˜ ε } (cid:54) n ∑ t = N { u i ∗ ( t , T i ∗ π ( t )) < µ ∗ − ˜ ε } (cid:54) n ∑ t = N { u i ∗ ( t , s ) < µ ∗ − ˜ ε for some 3 (cid:54) s (cid:54) t } . (28)The last step follows as for t in this range, 3 (cid:54) T i ∗ π ( t ) (cid:54) t . Hence E (cid:2) n i ( n , ε ) (cid:3) (cid:54) n ∑ t = N P ( u i ∗ ( t , s ) < µ ∗ − ˜ ε for some 3 (cid:54) s (cid:54) t ) . (29)As an aside, this is essentially the point at which the conjectured Eq. (10) would have come into play forthe proof of the optimality of π BK , bounding the growth of the corresponding term for that policy. We willessentially prove a successful version of that conjecture here. Deﬁne the events A ∗ s , t , ε = { u i ∗ ( t , s ) < µ ∗ − ˜ ε } . OWAN , H

ONDA AND K ATEHAKIS

Observing the distributions of the sample mean and sample variance, we have (similar to Eq. (41)) for Z astandard normal and U s − ∼ χ s − , with U , Z independent, P (cid:0) A ∗ s , t , ε (cid:1) = P (cid:18) ˜ εσ ∗ √ s + (cid:112) U s − (cid:113) t s − − < Z (cid:19) (cid:54) e − ( ˜ ε / σ ∗ ) s / ( s − ) ( ˜ ε / σ ∗ ) s (cid:112) e ( s − ) (cid:18) t − ln t (cid:19) (cid:54) e − ( ˜ ε / σ ∗ ) s / ( ˜ ε / σ ∗ ) √ es (cid:18) t − ln t (cid:19) (cid:54) (cid:32) ( ˜ ε / σ ∗ ) √ e (cid:33) e − ( ˜ ε / σ ∗ ) s / √ s (cid:18) t − ln t (cid:19) . (30)where the ﬁrst inequality follows as an application of Proposition 2, and the second since s (cid:62)

3. Applying aunion bound to Eq. (29), E (cid:2) n i ( n , ε ) (cid:3) (cid:54) n ∑ t = N t ∑ s = P (cid:0) A ∗ s , t , ε (cid:1) (cid:54) n ∑ t = N t ∑ s = (cid:32) ( ˜ ε / σ ∗ ) √ e (cid:33) e − ( ˜ ε / σ ∗ ) s / √ s (cid:18) t − ln t (cid:19) (cid:54) (cid:32) ( ˜ ε / σ ∗ ) √ e (cid:33) (cid:90) ∞ s = e − ( ˜ ε / σ ∗ ) s / √ s ds (cid:90) nt = e (cid:18) t − ln t (cid:19) dt = (cid:32) ( ˜ ε / σ ∗ ) √ e (cid:33) √ π ( ˜ ε / σ ∗ ) ln ln n = (cid:114) π e σ ∗ ˜ ε ln ln n . (31)The bounds follow, removing the dependence of the s -sum on t by extending it to ∞ , and bounding the sumsby integrals of the (decreasing) summands by slightly extending the range of each. From the above results,and observing that T i π ( n ) (cid:54) T i π ( n + ) , it follows from Eq. (22) that for any ε such that 0 < ε < E (cid:2) T i π ( n ) (cid:3) (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + + ε + σ i ˜ ε + (cid:114) π e σ i ˜ ε ln ln n (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + + ε + σ i ∆ i ε + (cid:114) π e σ ∗ ∆ i ε ln ln n . (32)The result then follows from the deﬁnition of regret in Eq. (2). Remark 2

Numerical Regret Comparison: Figure 1 shows the results of a small simulation study done ona set of six populations with means and variances given in Table 1. It provides plots of the regrets whenimplementing policies π CHK , π ACF , and π G a ‘greedy’ policy that always activates the bandit with the currenthighest average. Each policy was implemented over a horizon of 100,000 activations, each replicated 10,000times to produce a good estimate of the average regret R π ( n ) over the times indicated. The left plot is on thetime scale of the ﬁrst 10 ,

000 activations, and the right is on the full time scale of 100 ,

000 activations. ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES µ i σ i Figure 1:

Numerical Regret Comparison of π ACF , π CHK , and π G ; Left: [ , , ] range, Right: [ , , ] range. Remark 3

Bounds and Limits: Figure 2 shows ﬁrst (left) a comparison of the theoretical bounds on the regret, B π ACF ( n ) and B π CHK ( n ) representing the theoretical regret bounds of the RHS of Eq. (4) and Eq. (13) respec-tively, taking ε = ( ln n ) / in the latter case, for the means and variances indicated in Table 1. Additionally,Figure 2 (right) shows the convergence of R π CHK ( n ) / ln n to the theoretical lower bound M BK ( µ , σ ) .Figure 2: Left: Plots of B π ACF ( n ) and B π CHK ( n ) . Right: Convergence of R π CHK ( n ) / ln ( n ) to M BK ( µ , σ )

3. A Comparison of π CHK and Thompson Sampling

Honda and Takemura (2013) proved that for α <

0, the following Thompson sampling algorithm is asymp-totically optimal, i.e., lim n → ∞ R π CHK ( n ) / ln n = M BK ( µ , σ ) . OWAN , H

ONDA AND K ATEHAKIS

Policy π TS (TS-NORMAL α ) i) Initially, sample each bandit ˜ n (cid:62) max ( , − (cid:98) α (cid:99) ) times.ii) For n (cid:62) ˜ n : For each i generate a random sample U in from a posterior distribution for µ i ,given (cid:16) ¯ X iT i π ( n ) , S i ( T i π ( n )) (cid:17) , and a prior for (cid:0) µ i , σ i (cid:1) ∝ (cid:0) σ i (cid:1) − − α .iii) Then, take π TS ( n + ) = arg max i U in . (33)Policies π TS and π CHK differ decidedly in structure. One key difference, π TS is an inherently randomizedpolicy, while decisions under π CHK are completely determined given the bandit results at a given time. Giventhat both π TS and π CHK are asymptotically optimal, it is interesting to compare the performances of these twoalgorithms over ﬁnite time horizons, and observe any practical differences between them. To that end, twosmall simulation studies were done for different sets of bandit parameters ( µ , σ ) . In each case, the uniformprior α = − Numerical Regret Comparison of π CHK and π TS for the parameters, of Table 1, left and Table 2, right. µ i

10 9 8 7 -1 0 σ i We observe from the above, and from general sampling of bandit parameters, that π TS and π CHK generallyproduce comparable expected regret. A general exploration of random parameters suggests that, on average, π TS is slightly superior to π CHK in cases where all bandits have roughly equal variances, while π CHK hasan edge when the optimal bandits have large variance relative to the other bandits, and the size of the ban-dit discrepancies. It is additionally interesting to note that in the cases pictured above, the superior policyalso demonstrated the smaller variance in sample regret (Figure 4). Additional numerical experiments, notpictured here, indicate that the superior policy in each case may exhibit a slightly heavier tail distributiontowards larger regret. In general, the question of which policy is superior is largely context speciﬁc. ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES

Figure 4:

Numerical comparison of variance of sample regret for π CHK and π TS for different parameters, of Table 1, leftand Table 2, right. References

Jean-Yves Audibert, R´emi Munos, and Csaba Szepesv´ari. Exploration–exploitation tradeoff using varianceestimates in multi-armed bandits.

Theoretical Computer Science , 410(19):1876–1902, 2009.Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed banditproblem.

Periodica Mathematica Hungarica , 61(1-2):55–65, 2010.Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.

Machine learning , 47(2-3):235–256, 2002.Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning inweakly communicating mdps. In

Proceedings of the Twenty-Fifth Conference on Uncertainty in ArtiﬁcialIntelligence , pages 35–42. AUAI Press, 2009.S´ebastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits.arXiv preprint arXiv:1202.4473, 2012.Apostolos N Burnetas and Michael N Katehakis. On sequencing two types of tasks on a single processorunder incomplete information.

Probability in the Engineering and Informational Sciences , 7(1):85–119,1993.Apostolos N Burnetas and Michael N Katehakis. On large deviations properties of sequential allocationproblems.

Stochastic Analysis and Applications , 14(1):23–31, 1996a.Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for sequential allocation prob-lems.

Advances in Applied Mathematics , 17(2):122–142, 1996b.Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for Markov decision processes.

Mathematics of Operations Research , 22(1):222–255, 1997a.Apostolos N Burnetas and Michael N Katehakis. On the ﬁnite horizon one-armed bandit problem.

StochasticAnalysis and Applications , 16(1):845–859, 1997b.Apostolos N Burnetas and Michael N Katehakis. Asymptotic Bayes analysis for the ﬁnite-horizon one-armed-bandit problem.

Probability in the Engineering and Informational Sciences , 17(01):53–82, 2003.Sergiy Butenko, Panos M Pardalos, and Robert Murphey.

Cooperative Control: Models, Applications, andAlgorithms . Kluwer Academic Publishers, 2003. OWAN , H

ONDA AND K ATEHAKIS

Olivier Capp´e, Aur´elien Garivier, Odalric-Ambrym Maillard, R´emi Munos, and Gilles Stoltz. Kullback–leibler upper conﬁdence bounds for optimal sequential allocation.

The Annals of Statistics , 41(3):1516–1541, 2013.Wesley Cowan and Michael N Katehakis. An asymptotically optimal UCB policy for uniform bandits ofunknown support. arXiv preprint arXiv:1505.01918, 2015a.Wesley Cowan and Michael N Katehakis. Asymptotic behavior of minimal-exploration allocation policies:Almost sure, arbitrarily slow growing regret. arXiv preprint arXiv:1505.02865, Jul. 31 2015b.Wesley Cowan and Michael N Katehakis. Multi-armed bandits under general depreciation and commitment.

Probability in the Engineering and Informational Sciences , 29(01):51–76, 2015c.Savas Dayanik, Warren B Powell, and Kazutoshi Yamazaki. Asymptotically optimal Bayesian sequentialchange detection and identiﬁcation rules.

Annals of Operations Research , 208(1):337–370, 2013.Eric V Denardo, Eugene A Feinberg, and Uriel G Rothblum. The multi-armed bandit, with constraints. InM.N. Katehakis, S.M. Ross, and J. Yang, editors,

Cyrus Derman Memorial Volume I: Optimization underUncertainty: Costs, Risks and Revenues . Annals of Operations Research, Springer, New York, 2013.Eugene A Feinberg, Pavlo O Kasyanov, and Michael Z Zgurovsky. Convergence of value iterations fortotal-cost mdps and pomdps with general state and action sets. In

Adaptive Dynamic Programming andReinforcement Learning (ADPRL), 2014 IEEE Symposium on , pages 1–8. IEEE, 2014.Sarah Filippi, Olivier Capp´e, and Aur´elien Garivier. Optimism in reinforcement learning based on kull-backleibler divergence. In ,2010.John C. Gittins. Bandit processes and dynamic allocation indices (with discussion).

J. Roy. Stat. Soc. Ser. B ,41:335–340, 1979.John C. Gittins, Kevin Glazebrook, and Richard R. Weber.

Multi-armed Bandit Allocation Indices . JohnWiley & Sons, West Sussex, U.K., 2011.Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded supportmodels. In

COLT , pages 67–79. Citeseer, 2010.Junya Honda and Akimichi Takemura. An asymptotically optimal policy for ﬁnite support models in themultiarmed bandit problem.

Machine Learning , 85(3):361–391, 2011.Junya Honda and Akimichi Takemura. Optimality of Thompson sampling for Gaussian bandits depends onpriors. arXiv preprint arXiv:1311.1894, 2013.Wassim Jouini, Damien Ernst, Christophe Moy, and Jacques Palicot. Multi-armed bandit based policies forcognitive radio’s decision making issues. In , 2009.Michael N Katehakis and Herbert Robbins. Sequential choice from several populations.

Proceedings of theNational Academy of Sciences of the United States of America , 92(19):8584, 1995.Emilie Kaufmann. Analyse de strat´egies Bay´esiennes et fr´equentistes pour l’allocation s´equentielle deressources.

Doctorat , ParisTech., Jul. 31 2015.Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration.

The Journal of Machine LearningResearch , 4:1107–1149, 2003. ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES

Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules.

Advances in AppliedMathematics , 6(1):4–22, 1985.Lihong Li, Remi Munos, and Csaba Szepesvari. On minimax optimal ofﬂine policy evaluation. arXiv preprintarXiv:1409.3653, 2014.Michael L Littman. Inducing partially observable Markov decision processes. In

ICGI , pages 145–148, 2012.Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored mdps. In

Advances inNeural Information Processing Systems , pages 604–612, 2014.Herbert Robbins. Some aspects of the sequential design of experiments.

Bull. Amer. Math. Monthly , 58:527–536, 1952.Cem Tekin and Mingyan Liu. Approximately optimal adaptive learning in opportunistic spectrum access. In

INFOCOM, 2012 Proceedings IEEE , pages 1548–1556. IEEE, 2012.Ambuj Tewari and Peter L Bartlett. Optimistic linear programming gives logarithmic regret for irreduciblemdps. In

Advances in Neural Information Processing Systems , pages 1505–1512, 2008.Richard R Weber. On the Gittins index for multiarmed bandits.

The Annals of Applied Probability , 2(4):1024–1033, 1992.

Acknowledgement:

We gratefully acknowledge support for this project from the National Science Founda-tion (NSF grant CMMI-14-50743).

Appendix A. Additional Proofs

Proof [of Proposition 2] Let P = P (cid:16) δ + √ U (cid:112) k / p − < Z (cid:17) . Note immediately, P (cid:62) P (cid:0) δ + √ Uk / p < Z (cid:1) .Further, P (cid:62) P (cid:16) δ + √ Uk / p < Z and √ Uk / p (cid:62) δ (cid:17) (cid:62) P (cid:16) √ Uk / p < Z and √ Uk / p (cid:62) δ (cid:17) = (cid:90) ∞ δ k / p (cid:90) ∞ √ uk / p e − z / √ π f d ( u ) dzdu . (34)Where f d ( u ) is taken to be the density of a χ d -random variable. Letting ˜ u = k / p u , P (cid:62) k / p (cid:90) ∞ δ (cid:90) ∞ √ ˜ u e − z / √ π f d (cid:18) ˜ uk / p (cid:19) dzd ˜ u = k / p (cid:90) ∞ δ (cid:90) ∞ √ ˜ u e − z / √ π d / Γ ( d / ) (cid:18) ˜ uk / p (cid:19) d / − e − ˜ u k / p dzd ˜ u = (cid:18) k / p (cid:19) d / (cid:90) ∞ δ (cid:90) ∞ √ ˜ u e − z / √ π d / Γ ( d / ) ˜ u d / − e − ˜ u k / p dzd ˜ u (35) OWAN , H

ONDA AND K ATEHAKIS

Observing that k / p (cid:62) P (cid:62) (cid:18) k / p (cid:19) d / (cid:90) ∞ δ (cid:90) ∞ √ ˜ u e − z / √ π d / Γ ( d / ) ˜ u d / − e − ˜ u dzd ˜ u = k − d / p P (cid:16) √ U (cid:54) Z and U (cid:62) δ (cid:17) = k − d / p P (cid:0) U (cid:54) Z and U (cid:62) δ (cid:1) = k − d / p P (cid:18) Z (cid:62) U (cid:62) δ (cid:19) (36)The exchange from integral to probability is simply the interpretation of the integrand as the joint pdf of U and Z .For the upper bound, we utilize the classic normal tail bound, P ( x < Z ) (cid:54) e − x / / ( x √ π ) . P (cid:54) E  e − (cid:16) δ + √ U √ k / p − (cid:17) / ( δ + √ U (cid:112) k / p − ) √ π  (cid:54) e − δ / δ √ π E (cid:20) e − δ √ U √ k / p − − U ( k / p − ) (cid:21) . (37)Observing the bound that for positive x , e − x (cid:54) / x , and recalling that d (cid:62) P (cid:54) e − δ / δ √ π E (cid:34) e − U ( k / p − ) δ √ U (cid:112) k / p − (cid:35) = e − δ / δ √ π (cid:112) k / p − E (cid:104) U − e − U ( k / p − ) (cid:105) = e − δ / δ √ π (cid:112) k / p − (cid:32) k ( − d ) / p Γ (cid:0) d − (cid:1) √ Γ (cid:0) d (cid:1) (cid:33) . (38)Here we utilize the following bounds: e x − (cid:62) ( e / ) x , which is easy to prove, and Γ ( d / − / ) / Γ ( d / ) (cid:54) (cid:112) π / d , which may be proved on integer d (cid:62) P (cid:54) e − ( + δ ) / p δ ln k k ( − d ) / p √ d . (39)This completes the proof. Remark 4

Room for Improvement: The choice of the e x − (cid:62) ( e / ) x bound above was in fact arbitrary -other bounds, such as involving alternative powers of x , could be used. This would inﬂuence how the result-ing bound on P is utilized, for instance in the proof of Theorem 3. The use of e − x (cid:54) / x in Eq. (38) shouldbe considered similarly. Proposition 6

Conjecture 1 is false and for each i, for ε > , P (cid:16) ¯ X ij + S i ( j ) (cid:112) k / j − < µ i − ε for some (cid:54) j (cid:54) k (cid:17) / k → ∞ as k → ∞ . (40) . Proof [of Proposition 6] Deﬁne the events A ij , k , ε = { ¯ X ij + S i ( j ) (cid:112) k / j − < µ i − ε } . As the samples are takento be normally distributed with mean µ i and variance σ i , we have that ¯ X ij − µ i ∼ Z σ i / √ j and S i ( j ) ∼ σ i U / j , ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES where Z is a standard normal, U ∼ χ j − , and Z , U independent. Hence, P ( A ij , k , ε ) = P  Z σ i √ j + (cid:115) U σ i j (cid:112) k / j − < − ε  = P (cid:18) εσ i (cid:112) j + √ U (cid:112) k / j − < Z (cid:19) . (41)The last step is simply a re-arrangement, and an observation on the symmetry of the distribution of Z . For j (cid:62)

3, we may apply Proposition 2 here for d = j − p = j , to yield P ( A ij , k , ε ) (cid:62) k / j k P (cid:18) Z (cid:62) U (cid:62) ε σ i j (cid:19) . (42)For a ﬁxed j (cid:62)

3, for k (cid:62) j we have P (cid:16) A ij , k , ε for some 2 (cid:54) j (cid:54) k (cid:17) (cid:62) P ( A ij , k , ε ) (cid:62) O ( / k ) k / j . (43)The proposition follows immediately. Proposition 7

For G > , (cid:54) ε < / , the following holds: (cid:16) + G ( − ε ) + ε (cid:17) (cid:54) ( + G ) + G ( + G ) ln ( + G ) ε . (44) Proof

For any G >

0, the function 1 / ln (cid:16) + G ( − ε ) + ε (cid:17) is positive, increasing, and convex on ε ∈ [ , ) (Proposition 8). For a given G >

0, noting that the above inequality holds (as equality) at ε =

0, due to theconvexity it sufﬁces to show that the inequality is satisﬁed at ε = /

2, or1ln (cid:0) + G (cid:1) (cid:54) G ( + G ) ln ( + G ) + ( + G ) . (45)Equivalently, we consider the inequality0 (cid:54) G ( + G ) + ln ( + G ) − ln ( + G ) ln (cid:0) + G (cid:1) . (46)Deﬁne the function F ( G ) to be the RHS of Ineq. (46). Note that as G → F ( G ) →

0, and in simpliﬁed formwe have (for G > G → F (cid:48) ( G ) = (cid:0) ( + G ) ln ( + G ) − ( + G ) ln (cid:0) + G (cid:1)(cid:1) ( + G ) ( + G ) ln (cid:0) + G (cid:1) (cid:62) . (47)It follows that F ( G ) (cid:62)

0, and hence the desired inequality holds at ε = /

2. This completes the proof.

Proposition 8

The function H G ( ε ) = / ln (cid:16) + G ( − ε ) + ε (cid:17) is positive, increasing, and convex in ε ∈ [ , ) , forany constant G > . OWAN , H

ONDA AND K ATEHAKIS

Proof

That H G ( ε ) is positive and increasing in ε , follows immediately from inspection of H G and H (cid:48) G , giventhe hypotheses on G , and ε . To demonstrate convexity, by inspection of the terms of H (cid:48)(cid:48) G ( ε ) , it sufﬁces toshow that for all relevant G , and ε , the following inequality holds.2 G ( − ε ) ( + ε ) + (cid:0) − ( + ε ) + G ( − ε ) ( + ε ( + ε )) (cid:1) ln (cid:18) + G ( − ε ) + ε (cid:19) (cid:62) . (48)Deﬁning C = G ( − ε ) / ( + ε ) , it is sufﬁcient to show that for all C > ε ∈ [ , ) (eliminating a factorof ( + ε ) from the above), 2 C ( + ε ) + ( − + C ( + ε ( + ε ))) ln ( + C ) (cid:62) . (49)Deﬁning J C ( ε ) as the LHS of the above, note that J (cid:48) C ( ε ) = C ( + ε )( + ln ( + C )) >

0. It sufﬁces then toshow J C ( ) (cid:62)

0, or 18 C +( C − ) ln ( + C ) (cid:62)

0. Note this holds at C =

0, and d / dC [ J C ( )] = ( + C ) / ( + C ) + ln ( + C ) > C (cid:62)

0. Hence, J C ( ε ) (cid:62)

0, and H (cid:48)(cid:48) G ( ε ) (cid:62)0.