Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem
DDepartment of MSIS - Rutgers University - Technical Report 2015
Normal Bandits of Unknown Means and Variances:
Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem
Wesley Cowan
CWCOWAN @ MATH . RUTGERS . EDU
Department of MathematicsRutgers University110 Frelinghuysen Rd., Piscataway, NJ 08854, USA
Junya Honda
HONDA @ IT . K . U - TOKYO . AC . JP Department of Complexity Science and EngineeringGraduate School of Frontier Sciences, The University of Tokyo.5-1-5 Kashiwanoha, Kashiwa-shi, Chiba 277-8561, Japan.
Michael N. Katehakis
MNK @ RUTGERS . EDU
Department of Management Science and Information SystemsRutgers University100 Rockafeller Rd., Piscataway, NJ 08854, USA
Abstract
Consider the problem of sampling sequentially from a finite number of N (cid:62) X ik , i = ,..., N , and k = , ,... ; where X ik denotes the outcome from population i the k th time it is sampled. It is assumed that for each fixed i , { X ik } k (cid:62) is a sequence of i.i.d. normal random variables,with unknown mean µ i and unknown variance σ i . The objective is to have a policy π for deciding from whichof the N populations to sample from at any time t = , ,... so as to maximize the expected sum of outcomes of n total samples or equivalently to minimize the regret due to lack on information of the parameters µ i and σ i .In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal inthe sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996b).Additionally, finite horizon regret bounds are given . Keywords:
Inflated Sample Means, Multi-armed Bandits, Sequential Allocation
1. Introduction and Summary
Consider the problem of a controller sampling sequentially from a finite number of N (cid:62) i are specified by a sequence of i.i.d. random variables { X ik } k (cid:62) , taken to be normal with finite mean µ i and finite variance σ i . The means { µ i } and variances { σ i } are taken to be unknown to the controller. It is convenient to define the maximum mean, µ ∗ = max i { µ i } ,and the bandit discrepancies { ∆ i } where ∆ i = µ ∗ − µ i (cid:62) . It is additionally convenient to define σ ∗ as theminimal variance of any bandit that achieves µ ∗ , that is σ ∗ = min i : µ i = µ ∗ σ i .In this paper, given k samples from population i we will take the estimators: ¯ X ik = ∑ kt = X it / k and S i ( k ) = ∑ kt = (cid:0) X it − ¯ X ik (cid:1) / k for µ i and σ i respectively. Note that the use of the biased estimator for the variance, withthe 1 / k factor in place of 1 / ( k − ) , is largely for aesthetic purposes - the results presented here adapt to theuse of the unbiased estimator as well.
1. Substantial portion of the results reported here were derived independently by Cowan and Katehakis, and by Honda c (cid:13) Wesley Cowan, Junya Honda and Michael N. Katehakis. a r X i v : . [ s t a t . M L ] J un OWAN , H
ONDA AND K ATEHAKIS
For any adaptive, non-anticipatory policy π , π ( t ) = i indicates that the controller samples bandit i at time t .Define T i π ( n ) = ∑ nt = { π ( t ) = i } , denoting the number of times bandit i has been sampled during the periods t = , . . . , n under policy π ; we take, as a convenience, T i π ( ) = i , π . The value of a policy π is theexpected sum of the first n outcomes under π , which we define to be the function V π ( n ) : V π ( n ) = E N ∑ i = T i π ( n ) ∑ k = X ik = N ∑ i = µ i E (cid:2) T i π ( n ) (cid:3) , (1)where for simplicity the dependence of V π ( n ) on the true, unknown, values of the parameters µ = ( µ , . . . , µ N ) and σ = ( σ , . . . , σ N ) , is supressed. The pseudo-regret , or simply regret , of a policy is taken to be theexpected loss due to ignorance of the parameters µ and σ by the controller. Had the controller completeinformation, she would at every round activate some bandit i ∗ such that µ i ∗ = µ ∗ = max i { µ i } . For a givenpolicy π , we define the expected regret of that policy at time n as R π ( n ) = n µ ∗ − V π ( n ) = n ∑ i = ∆ i E (cid:2) T i π ( n ) (cid:3) . (2)It follows from Eqs. (1) and (2) that maximization of V π ( n ) with respect to π is equivalent to minimizationof R π ( n ) . This type of loss due to ignorance of the means (regret) was first introduced in the context ofan N = L π ( n ) / n = µ ∗ − ∑ Ni = ∑ T i π ( n ) k = X ik / n (for which R π ( n ) = E [ L π ( n )] ), constructing a modified (along two sparse sequences) ‘play the winner’ policy, π R , suchthat L π R ( n ) = o ( n ) (a.s.) and R π R ( n ) = o ( n ) , using for his derivation only the assumption of the Strong Law ofLarge Numbers. Following Burnetas and Katehakis (1996b) when n → ∞ , if π is such that R π ( n ) = o ( n ) wesay policy π is uniformly convergent (UC) (since then lim n → ∞ V π ( n ) / n = µ ∗ ). However, if under a policy π , R π ( n ) grew at a slower pace, such as R π ( n ) = o ( n / ) , or better R π ( n ) = o ( n / ) etc., then the controllerwould be assured that π is making a effective trade-off between exploration and exploitation. It turns our thatit is possible to construct ‘uniformly fast convergent’ (UFC) policies, also known as consistent or stronglyconsistent , defined as the policies π for which: R π ( n ) = o ( n α ) , for all α > ( µ , σ ) . The existence of UFC policies in the case considered here is well established, e.g., Auer et al. (2002) (fig. 4.therein) presented the following UFC policy π ACF : Policy π ACF (UCB1-NORMAL) . At each n = , , . . . :i) Sample from any bandit i for which T i π ACF ( n ) < (cid:100) n (cid:101) . ii) If T i π ACF ( n ) > (cid:100) n (cid:101) , for all i = , . . . , N , sample from bandit π ACF ( n + ) with π ACF ( n + ) = arg max i (cid:40) ¯ X iT i π ( n ) + · S i ( T i π ( n )) (cid:115) ln nT i π ( n ) (cid:41) . (3)(Taking, in this case, S i ( k ) as the unbiased estimator.)Additionally, Auer et al. (2002) (in Theorem 4. therein) gave the following bound: R π ACF ( n ) (cid:54) M ACF ( µ , σ ) ln n + C ACF ( µ ) , for all n and all ( µ , σ ) , (4) ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES with M ACF ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ σ i ∆ i + N ∑ i = ∆ i , (5) C ACF ( µ ) = ( + π ) N ∑ i = ∆ i . (6)Ineq. (4) readily implies that R π ACF ( n ) (cid:54) M ACF ( µ , σ ) ln n + o ( ln n ) . Thus, since ln n = o ( n α ) for all α > R π ACF ( n ) (cid:62) , it follows that π ACF is uniformly fast convergent.Given that UFC policies exist, the question immediately follows: just how fast can they be? The primarymotivation of this paper is the following general result, from Burnetas and Katehakis (1996b), where theyshowed that for any UFC policy π , the following holds:lim inf n → ∞ R π ( n ) ln n (cid:62) M BK ( µ , σ ) , for all ( µ , σ ) , (7)where the bound itself M BK ( µ , σ ) is determined by the specific distributions of the populations, in this case M BK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ ∆ i ln (cid:16) + ∆ i σ i (cid:17) . (8)For comparison, depending on the specifics of the bandit distributions, there is a considerable distance be-tween the logarithmic term of the upper bound of Eq. (4) and the lower bound implied by Eq. (8).The derivation of Ineq. (7) implies that in order to guarantee that a policy is uniformly fast convergent, sub-optimal populations have to be sampled at least a logarithmic number of times. The above bound is a specialcase of a more general result derived in Burnetas and Katehakis (1996b) (part 1 of Theorem 1 therein) fordistributions with multi-parameters being unknown (such as in the current problem of Normal populationswith both the mean and the variance being unknown): M BK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ ∆ i K i ( µ , σ ) with K i ( µ , σ ) = inf ( µ (cid:48) i , σ (cid:48) ) { I ( f ( µ i , σ i ) ; f ( µ (cid:48) i , σ (cid:48) i ) ) : µ (cid:48) i > µ ∗ , σ (cid:48) i > } = ( / ) ln ( + ∆ i σ i ) . Previously, Lai and Robbins (1985) had obtained such lower bounds for distributions with one-parameter(such as in the current problem of Normal populations with unknown mean but known variance). Allocationpolicies that achieved the lower bounds were called asymptotically efficient or optimal in Lai and Robbins(1985).Ineq. (7) motivates the definition of a uniformly fast convergent policy π as having a uniformly maximalconvergence rate (UM) or simply being asymptotically optimal , within the class of uniformly fast convergentpolicies, if lim n → ∞ R π ( n ) / ln n = M BK ( µ , σ ) , since then V π ( n ) = n µ ∗ − M BK ( µ , σ ) ln n + o ( ln n ) .Burnetas and Katehakis (1996b) proposed the following index policy π BK as one that could achieve this lowerbound: OWAN , H
ONDA AND K ATEHAKIS
Policy π BK (UCB-NORMAL ) i) For n = , , . . . , N sample each bandit twice, andii) for n (cid:62) N , sample from bandit π BK ( n + ) with π BK ( n + ) = arg max i (cid:40) ¯ X iT i π ( n ) + S i ( T i π ( n )) (cid:113) n Ti π ( n ) − (cid:41) . (9)Burnetas and Katehakis (1996b) were not able to establish the asymptotic optimality of the π BK policy be-cause they were not able to establish a sufficient condition ( Condition
A3 therein), which we express here asthe following equivalent conjecture (the referenced open question in the subtitle).
Conjecture 1
For each i, for every ε > , and for k → ∞ , the following is true: P (cid:16) ¯ X ij + S i ( j ) (cid:112) k / j − < µ i − ε for some (cid:54) j (cid:54) k (cid:17) = o ( / k ) . (10)We show that the above conjecture is false (cf. Proposition 6 in the Appendix). This does not imply that π BK fails to be UM (i.e., to be asymptotically optimal), but this failure means that the techniques established inBurnetas and Katehakis (1996b) are insufficient to verify its optimality. All is not lost, however. One of thecentral results of this paper is to establish that with a small change, the policy π BK may be modified to onethat is provably asymptotically optimal. We introduce in this paper the policy π CHK defined in the followingway:
Policy π CHK (UCB-NORMAL ) i) For n = , , . . . , N sample each bandit three times, andii) for n (cid:62) N , sample from bandit π CHK ( n + ) with π CHK ( n + ) = arg max i (cid:40) ¯ X iT i π ( n ) + S i ( T i π ( n )) (cid:113) n Ti π ( n ) − − (cid:41) . (11) Remark 1
1) Note that policy π CHK is only a slight modification of policy π BK , the only difference between their indicesis the − n under the radical, i.e., 2 / ( T i π ( n ) − ) in π CHK ( n + ) replacing 2 / T i π ( n ) in π BK ( n + ) . This change, while seemingly asymptotically negligible (as in practice T i π ( n ) → ∞ (a.s.) with n ), has aprofound effect on what is provable about π CHK .2) We note that the indices of policy π CHK are a significant modification of those of the optimal allocationpolicy π σ for the case of normal bandits with known variances, cf. Burnetas and Katehakis (1996b) andKatehakis and Robbins (1995), which are: π σ ( n + ) = arg max i (cid:40) ¯ X iT i π ( n ) + σ i (cid:115) nT i π ( n ) (cid:41) the difference being replacing the term σ i (cid:113) nT i π ( n ) in π σ by S i ( T i π ( n )) (cid:113) n Ti π ( n ) − − in π CHK . However, theindices of policy π ACF are a minor modification of the optimal policy π σ i the difference being replacing theterm σ i (cid:113) nT i π ( n ) in π σ i by S i ( T i π ( n )) (cid:113) nT i π ( n ) in π ACF . ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES
3) The π BK and π σ policies can be seen as connected in the following way, however, observing that 2 ln n / T i π ( n ) is a first-order approximation of n / T i π ( n ) − = e n / T i π ( n ) − ( µ , σ ) achieves the asymptotic lower bound M BK ( µ , σ ) . The structure of the rest of the paper is as follows. In section 2, Theorem 3 establishes a finite horizon boundon the regret of π CHK . From this bound, it follows that π CHK is asymptotically optimal (Theorem 4), andwe provide a bound on the remainder term (Theorem 5). Additionally, in Section 3, the Thompson samplingpolicy of Honda and Takemura (2013) and π CHK are compared and discussed, as both achieve asymptoticoptimality.
2. The Optimality Theorem and Finite Time Bounds
The main results of this paper, that Conjecture 1 is false (cf. Proposition 6 in the Appendix), the asymptoticoptimality, and the bounds on the behavior of π CHK , all depend on the following probability bounds; we notethat tighter bounds seem possible, but these are sufficient for this paper.
Proposition 2
Let Z , U be independent random variables, Z ∼ N ( , ) a standard normal, and U ∼ χ d achi-squared distribution with d degrees of freedom, where d (cid:62) .For δ > , p > , the following holds for all k (cid:62) : P (cid:18) Z (cid:62) U (cid:62) δ (cid:19) k − d / p (cid:54) P (cid:16) δ + √ U (cid:112) k / p − < Z (cid:17) (cid:54) e − ( + δ ) / p δ √ d k ( − d ) / p ln k . (12) Proof [of Proposition 2] The proof is given in the Appendix.
Theorem 3
For policy π CHK as defined above, the following bounds hold for all n (cid:62) N and all ε ∈ ( , ) :R π CHK ( n ) (cid:54) ∑ i : µ i (cid:54) = µ ∗ n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + (cid:114) π e σ ∗ ∆ i ε ln ln n + ε + σ i ∆ i ε + ∆ i . (13) OWAN , H
ONDA AND K ATEHAKIS
Before giving the proof of this bound, we present two results, the first demonstrating the asymptotic optimalityof π CHK , the second giving an ε -free version of the above bound, which gives a bound on the sub-logarithmicremainder term. It is worth noting the following. The bounds of Theorem 3 can actually be improved, throughthe use of a modified version of Proposition 2, to eliminate the ln ln n dependence, so the only dependence on n is through the initial ln n term. The cost of this, however, is a dependence on a larger power of 1 / ε . Theparticular form of the bound given in Eq. (13) was chosen to simplify the following two results, cf. Remark4 in the proof of Propositition 2. Theorem 4
For a policy π CHK as defined above, π CHK is asymptotically optimal in the sense that lim n → ∞ R π CHK ( n ) ln n = M BK ( µ , σ ) . (14) Proof [of Theorem 4] For any ε such that 0 < ε <
1, we have from Theorem 3 that the followings holds:lim sup n → ∞ R π CHK ( n ) ln n (cid:54) ∑ i : µ i (cid:54) = µ ∗ ∆ i ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) . (15)Taking the infimum over all such ε ,lim sup n → ∞ R π CHK ( n ) ln n (cid:54) ∑ i : µ i (cid:54) = µ ∗ ∆ i ln (cid:16) + ∆ i σ i (cid:17) = M BK ( µ , σ ) , (16)and observing the lower bound of Eq. (7) completes the result. Theorem 5
For a policy π CHK as defined above, R π CHK ( n ) (cid:54) M BK ( µ , σ ) ln n + O (( ln n ) / ln ln n ) , and moreconcretely R π CHK ( n ) (cid:54) M CHK ( µ , σ ) ln n + M CHK ( µ , σ )( ln n ) / ln ln n + M CHK ( µ , σ )( ln n ) / + M CHK ( µ , σ )( ln n ) / + M CHK ( µ , σ ) , (17) where M CHK ( µ , σ ) = M BK ( µ , σ ) M CHK ( µ , σ ) = (cid:114) π e ∑ i : µ i (cid:54) = µ ∗ (cid:18) σ ∗ ∆ i (cid:19) M CHK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ ∆ i (cid:0) σ i + ∆ i (cid:1) ln (cid:16) + ∆ i σ i (cid:17) M CHK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ (cid:18) ∆ i + σ i ∆ i (cid:19) M CHK ( µ , σ ) = ∑ i : µ i (cid:54) = µ ∗ ∆ i . (18) ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES
While the above bound admittedly has a more complex form than such a bound as in Eq. (4), it demonstratesthe asymptotic optimality of the dominating term, and bounds the sub-linear remainder term.
Proof [of Theorem 5] The bound follows directly from Theorem 3, taking ε = ( ln n ) − / for n (cid:62)
3, andobserving the following bound, that for ε such that 0 < ε < / (cid:16) + ∆ i σ i ( − ε ) + ε (cid:17) (cid:54) (cid:16) + ∆ i σ i (cid:17) + ∆ i (cid:0) σ i + ∆ i (cid:1) ln (cid:16) + ∆ i σ i (cid:17) ε . (19)This inequality is proven separately as Proposition 7 in the Appendix.We make no claim that the results of Theorems 3, 5 are the best achievable for this policy π CHK . At severalpoints in the proofs, choices of convenience were made in the bounding of terms, and different techniquesmay yield tighter bounds still. But they are sufficient to demonstrate the asymptotic optimality of π CHK , andgive useful bounds on the growth of R π CHK ( n ) . Proof [of Theorem 1] In this proof, we take π = π CHK as defined above. For notational convenience, wedefine the index function u i ( k , j ) = ¯ X ij + S i ( j ) (cid:113) k j − − . (20)The structure of this proof will be to bound the expected value of T i π ( n ) for all sub-optimal bandits i , and usethis to bound the regret R π ( n ) . The basic techniques follow those in Katehakis and Robbins (1995) for theknown variance case, modified accordingly here for the unknown variance case and assisted by the probabilitybound of Proposition 2. For any i such that µ i (cid:54) = µ ∗ , we define the following quantities: Let 1 > ε > ε = ∆ i ε /
2. For n (cid:62) N , n i ( n , ε ) = n ∑ t = N { π ( t + ) = i , u i ( t , T i π ( t )) (cid:62) µ ∗ − ˜ ε , ¯ X iT i π ( t ) (cid:54) µ i + ˜ ε , S i ( T i π ( t )) (cid:54) σ i ( + ε ) } n i ( n , ε ) = n ∑ t = N { π ( t + ) = i , u i ( t , T i π ( t )) (cid:62) µ ∗ − ˜ ε , ¯ X iT i π ( t ) (cid:54) µ i + ˜ ε , S i ( T i π ( t )) > σ i ( + ε ) } n i ( n , ε ) = n ∑ t = N { π ( t + ) = i , u i ( t , T i π ( t )) (cid:62) µ ∗ − ˜ ε , ¯ X iT i π ( t ) > µ i + ˜ ε } n i ( n , ε ) = n ∑ t = N { π ( t + ) = i , u i ( t , T i π ( t )) < µ ∗ − ˜ ε } . (21)Hence, we have the following relationship for n (cid:62) N , that T i π ( n + ) = + n ∑ t = N { π ( t + ) = i } = + n i ( n , ε ) + n i ( n , ε ) + n i ( n , ε ) + n i ( n , ε ) . (22)The proof proceeds by bounding, in expectation, each of the four terms. OWAN , H
ONDA AND K ATEHAKIS
Observe that, by the structure of the index function u i , n i ( n , ε ) (cid:54) n ∑ t = N (cid:40) π ( t + ) = i , ( µ i + ˜ ε ) + σ i √ + ε (cid:113) t Ti π ( t ) − − (cid:62) µ ∗ − ˜ ε (cid:41) = n ∑ t = N π ( t + ) = i , T i π ( t ) (cid:54) t ln (cid:16) + σ i ( ∆ i − ε ) ( + ε ) (cid:17) + = n ∑ t = N π ( t + ) = i , T i π ( t ) (cid:54) t ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + (cid:54) n ∑ t = N π ( t + ) = i , T i π ( t ) (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + (cid:54) n ∑ t = π ( t + ) = i , T i π ( t ) (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + + . (23)The last inequality follows, observing that T i π ( t ) may be expressed as the sum of π ( t ) = i indicators, andseeing that the additional condition bounds the number of non-zero terms in the above sum. The additional + π ( ) = i term and the π ( n + ) = i term. Note, this bound is sample-path-wise.For the second term, n i ( n , ε ) (cid:54) n ∑ t = N { π ( t + ) = i , S i ( T i π ( t )) > σ i ( + ε ) } = n ∑ t = N t ∑ k = { π ( t + ) = i , S i ( k ) > σ i ( + ε ) , T i π ( t ) = k } = n ∑ t = N t ∑ k = { π ( t + ) = i , T i π ( t ) = k } { S i ( k ) > σ i ( + ε ) } (cid:54) n ∑ k = { S i ( k ) > σ i ( + ε ) } n ∑ t = k { π ( t + ) = i , T i π ( t ) = k } (cid:54) n ∑ k = { S i ( k ) > σ i ( + ε ) } . (24)The last inequality follows as, for fixed k , { π ( t + ) = i , T i π ( t ) = k } may be true for at most one value of t .Recall that kS i ( k ) / σ i has the distribution of a χ k − random variable. Letting U k ∼ χ k , from the above we ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES have E (cid:2) n i ( n , ε ) (cid:3) (cid:54) n ∑ k = P (cid:0) S i ( k ) > σ i ( + ε ) (cid:1) (cid:54) ∞ ∑ k = P ( U k − / k > ( + ε )) (cid:54) ∞ ∑ k = P ( U k − / ( k − ) > ( + ε ))= ∞ ∑ k = P ( U k > k ( + ε )) (cid:54) (cid:113) e ε + ε − (cid:54) ε < ∞ . (25)The penultimate step is a Chernoff bound on the terms, P ( U k > k ( + ε )) (cid:54) ( e − ε ( + ε )) k / . To bound the third term, a similar rearrangement to Eq. (24) (using the sample mean instead of the samplevariance) yields: n i ( n , ε ) (cid:54) n ∑ t = N { π ( t + ) = i , ¯ X iT i π ( t ) > µ i + ˜ ε } (cid:54) n ∑ k = { ¯ X ik > µ i + ˜ ε } . (26)Recalling that ¯ X ik − µ i ∼ Z σ i / √ k for Z a standard normal, E (cid:2) n i ( n , ε ) (cid:3) (cid:54) n ∑ k = P (cid:0) ¯ X ik > µ i + ˜ ε (cid:1) (cid:54) ∞ ∑ k = P (cid:16) Z σ i / √ k > ˜ ε (cid:17) (cid:54) e ˜ ε σ i − (cid:54) σ i ˜ ε < ∞ . (27)The penultimate step is a Chernoff bound on the terms, P (cid:16) Z > δ √ k (cid:17) (cid:54) e − k δ / .To bound the n i term, observe that in the event π ( t + ) = i , from the structure of the policy it must be truethat u i ( t , T i π ( t )) = max j u j ( t , T j π ( t )) . Thus, if i ∗ is some bandit such that µ i ∗ = µ ∗ , u i ∗ ( t , T i ∗ π ( t )) (cid:54) u i ( t , T i π ( t )) .In particular, we take i ∗ to be a bandit that not only achieves the maximal mean µ ∗ , but also the minimalvariance among optimal bandits, σ i ∗ = σ ∗ . We have the following bound, n i ( n , ε ) (cid:54) n ∑ t = N { π ( t + ) = i , u i ∗ ( t , T i ∗ π ( t )) < µ ∗ − ˜ ε } (cid:54) n ∑ t = N { u i ∗ ( t , T i ∗ π ( t )) < µ ∗ − ˜ ε } (cid:54) n ∑ t = N { u i ∗ ( t , s ) < µ ∗ − ˜ ε for some 3 (cid:54) s (cid:54) t } . (28)The last step follows as for t in this range, 3 (cid:54) T i ∗ π ( t ) (cid:54) t . Hence E (cid:2) n i ( n , ε ) (cid:3) (cid:54) n ∑ t = N P ( u i ∗ ( t , s ) < µ ∗ − ˜ ε for some 3 (cid:54) s (cid:54) t ) . (29)As an aside, this is essentially the point at which the conjectured Eq. (10) would have come into play forthe proof of the optimality of π BK , bounding the growth of the corresponding term for that policy. We willessentially prove a successful version of that conjecture here. Define the events A ∗ s , t , ε = { u i ∗ ( t , s ) < µ ∗ − ˜ ε } . OWAN , H
ONDA AND K ATEHAKIS
Observing the distributions of the sample mean and sample variance, we have (similar to Eq. (41)) for Z astandard normal and U s − ∼ χ s − , with U , Z independent, P (cid:0) A ∗ s , t , ε (cid:1) = P (cid:18) ˜ εσ ∗ √ s + (cid:112) U s − (cid:113) t s − − < Z (cid:19) (cid:54) e − ( ˜ ε / σ ∗ ) s / ( s − ) ( ˜ ε / σ ∗ ) s (cid:112) e ( s − ) (cid:18) t − ln t (cid:19) (cid:54) e − ( ˜ ε / σ ∗ ) s / ( ˜ ε / σ ∗ ) √ es (cid:18) t − ln t (cid:19) (cid:54) (cid:32) ( ˜ ε / σ ∗ ) √ e (cid:33) e − ( ˜ ε / σ ∗ ) s / √ s (cid:18) t − ln t (cid:19) . (30)where the first inequality follows as an application of Proposition 2, and the second since s (cid:62)
3. Applying aunion bound to Eq. (29), E (cid:2) n i ( n , ε ) (cid:3) (cid:54) n ∑ t = N t ∑ s = P (cid:0) A ∗ s , t , ε (cid:1) (cid:54) n ∑ t = N t ∑ s = (cid:32) ( ˜ ε / σ ∗ ) √ e (cid:33) e − ( ˜ ε / σ ∗ ) s / √ s (cid:18) t − ln t (cid:19) (cid:54) (cid:32) ( ˜ ε / σ ∗ ) √ e (cid:33) (cid:90) ∞ s = e − ( ˜ ε / σ ∗ ) s / √ s ds (cid:90) nt = e (cid:18) t − ln t (cid:19) dt = (cid:32) ( ˜ ε / σ ∗ ) √ e (cid:33) √ π ( ˜ ε / σ ∗ ) ln ln n = (cid:114) π e σ ∗ ˜ ε ln ln n . (31)The bounds follow, removing the dependence of the s -sum on t by extending it to ∞ , and bounding the sumsby integrals of the (decreasing) summands by slightly extending the range of each. From the above results,and observing that T i π ( n ) (cid:54) T i π ( n + ) , it follows from Eq. (22) that for any ε such that 0 < ε < E (cid:2) T i π ( n ) (cid:3) (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + + ε + σ i ˜ ε + (cid:114) π e σ i ˜ ε ln ln n (cid:54) n ln (cid:16) + ∆ i σ i ( − ε ) ( + ε ) (cid:17) + + ε + σ i ∆ i ε + (cid:114) π e σ ∗ ∆ i ε ln ln n . (32)The result then follows from the definition of regret in Eq. (2). Remark 2
Numerical Regret Comparison: Figure 1 shows the results of a small simulation study done ona set of six populations with means and variances given in Table 1. It provides plots of the regrets whenimplementing policies π CHK , π ACF , and π G a ‘greedy’ policy that always activates the bandit with the currenthighest average. Each policy was implemented over a horizon of 100,000 activations, each replicated 10,000times to produce a good estimate of the average regret R π ( n ) over the times indicated. The left plot is on thetime scale of the first 10 ,
000 activations, and the right is on the full time scale of 100 ,
000 activations. ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES µ i σ i Figure 1:
Numerical Regret Comparison of π ACF , π CHK , and π G ; Left: [ , , ] range, Right: [ , , ] range. Remark 3
Bounds and Limits: Figure 2 shows first (left) a comparison of the theoretical bounds on the regret, B π ACF ( n ) and B π CHK ( n ) representing the theoretical regret bounds of the RHS of Eq. (4) and Eq. (13) respec-tively, taking ε = ( ln n ) / in the latter case, for the means and variances indicated in Table 1. Additionally,Figure 2 (right) shows the convergence of R π CHK ( n ) / ln n to the theoretical lower bound M BK ( µ , σ ) .Figure 2: Left: Plots of B π ACF ( n ) and B π CHK ( n ) . Right: Convergence of R π CHK ( n ) / ln ( n ) to M BK ( µ , σ )
3. A Comparison of π CHK and Thompson Sampling
Honda and Takemura (2013) proved that for α <
0, the following Thompson sampling algorithm is asymp-totically optimal, i.e., lim n → ∞ R π CHK ( n ) / ln n = M BK ( µ , σ ) . OWAN , H
ONDA AND K ATEHAKIS
Policy π TS (TS-NORMAL α ) i) Initially, sample each bandit ˜ n (cid:62) max ( , − (cid:98) α (cid:99) ) times.ii) For n (cid:62) ˜ n : For each i generate a random sample U in from a posterior distribution for µ i ,given (cid:16) ¯ X iT i π ( n ) , S i ( T i π ( n )) (cid:17) , and a prior for (cid:0) µ i , σ i (cid:1) ∝ (cid:0) σ i (cid:1) − − α .iii) Then, take π TS ( n + ) = arg max i U in . (33)Policies π TS and π CHK differ decidedly in structure. One key difference, π TS is an inherently randomizedpolicy, while decisions under π CHK are completely determined given the bandit results at a given time. Giventhat both π TS and π CHK are asymptotically optimal, it is interesting to compare the performances of these twoalgorithms over finite time horizons, and observe any practical differences between them. To that end, twosmall simulation studies were done for different sets of bandit parameters ( µ , σ ) . In each case, the uniformprior α = − Numerical Regret Comparison of π CHK and π TS for the parameters, of Table 1, left and Table 2, right. µ i
10 9 8 7 -1 0 σ i We observe from the above, and from general sampling of bandit parameters, that π TS and π CHK generallyproduce comparable expected regret. A general exploration of random parameters suggests that, on average, π TS is slightly superior to π CHK in cases where all bandits have roughly equal variances, while π CHK hasan edge when the optimal bandits have large variance relative to the other bandits, and the size of the ban-dit discrepancies. It is additionally interesting to note that in the cases pictured above, the superior policyalso demonstrated the smaller variance in sample regret (Figure 4). Additional numerical experiments, notpictured here, indicate that the superior policy in each case may exhibit a slightly heavier tail distributiontowards larger regret. In general, the question of which policy is superior is largely context specific. ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES
Figure 4:
Numerical comparison of variance of sample regret for π CHK and π TS for different parameters, of Table 1, leftand Table 2, right. References
Jean-Yves Audibert, R´emi Munos, and Csaba Szepesv´ari. Exploration–exploitation tradeoff using varianceestimates in multi-armed bandits.
Theoretical Computer Science , 410(19):1876–1902, 2009.Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed banditproblem.
Periodica Mathematica Hungarica , 61(1-2):55–65, 2010.Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.
Machine learning , 47(2-3):235–256, 2002.Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning inweakly communicating mdps. In
Proceedings of the Twenty-Fifth Conference on Uncertainty in ArtificialIntelligence , pages 35–42. AUAI Press, 2009.S´ebastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits.arXiv preprint arXiv:1202.4473, 2012.Apostolos N Burnetas and Michael N Katehakis. On sequencing two types of tasks on a single processorunder incomplete information.
Probability in the Engineering and Informational Sciences , 7(1):85–119,1993.Apostolos N Burnetas and Michael N Katehakis. On large deviations properties of sequential allocationproblems.
Stochastic Analysis and Applications , 14(1):23–31, 1996a.Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for sequential allocation prob-lems.
Advances in Applied Mathematics , 17(2):122–142, 1996b.Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for Markov decision processes.
Mathematics of Operations Research , 22(1):222–255, 1997a.Apostolos N Burnetas and Michael N Katehakis. On the finite horizon one-armed bandit problem.
StochasticAnalysis and Applications , 16(1):845–859, 1997b.Apostolos N Burnetas and Michael N Katehakis. Asymptotic Bayes analysis for the finite-horizon one-armed-bandit problem.
Probability in the Engineering and Informational Sciences , 17(01):53–82, 2003.Sergiy Butenko, Panos M Pardalos, and Robert Murphey.
Cooperative Control: Models, Applications, andAlgorithms . Kluwer Academic Publishers, 2003. OWAN , H
ONDA AND K ATEHAKIS
Olivier Capp´e, Aur´elien Garivier, Odalric-Ambrym Maillard, R´emi Munos, and Gilles Stoltz. Kullback–leibler upper confidence bounds for optimal sequential allocation.
The Annals of Statistics , 41(3):1516–1541, 2013.Wesley Cowan and Michael N Katehakis. An asymptotically optimal UCB policy for uniform bandits ofunknown support. arXiv preprint arXiv:1505.01918, 2015a.Wesley Cowan and Michael N Katehakis. Asymptotic behavior of minimal-exploration allocation policies:Almost sure, arbitrarily slow growing regret. arXiv preprint arXiv:1505.02865, Jul. 31 2015b.Wesley Cowan and Michael N Katehakis. Multi-armed bandits under general depreciation and commitment.
Probability in the Engineering and Informational Sciences , 29(01):51–76, 2015c.Savas Dayanik, Warren B Powell, and Kazutoshi Yamazaki. Asymptotically optimal Bayesian sequentialchange detection and identification rules.
Annals of Operations Research , 208(1):337–370, 2013.Eric V Denardo, Eugene A Feinberg, and Uriel G Rothblum. The multi-armed bandit, with constraints. InM.N. Katehakis, S.M. Ross, and J. Yang, editors,
Cyrus Derman Memorial Volume I: Optimization underUncertainty: Costs, Risks and Revenues . Annals of Operations Research, Springer, New York, 2013.Eugene A Feinberg, Pavlo O Kasyanov, and Michael Z Zgurovsky. Convergence of value iterations fortotal-cost mdps and pomdps with general state and action sets. In
Adaptive Dynamic Programming andReinforcement Learning (ADPRL), 2014 IEEE Symposium on , pages 1–8. IEEE, 2014.Sarah Filippi, Olivier Capp´e, and Aur´elien Garivier. Optimism in reinforcement learning based on kull-backleibler divergence. In ,2010.John C. Gittins. Bandit processes and dynamic allocation indices (with discussion).
J. Roy. Stat. Soc. Ser. B ,41:335–340, 1979.John C. Gittins, Kevin Glazebrook, and Richard R. Weber.
Multi-armed Bandit Allocation Indices . JohnWiley & Sons, West Sussex, U.K., 2011.Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded supportmodels. In
COLT , pages 67–79. Citeseer, 2010.Junya Honda and Akimichi Takemura. An asymptotically optimal policy for finite support models in themultiarmed bandit problem.
Machine Learning , 85(3):361–391, 2011.Junya Honda and Akimichi Takemura. Optimality of Thompson sampling for Gaussian bandits depends onpriors. arXiv preprint arXiv:1311.1894, 2013.Wassim Jouini, Damien Ernst, Christophe Moy, and Jacques Palicot. Multi-armed bandit based policies forcognitive radio’s decision making issues. In , 2009.Michael N Katehakis and Herbert Robbins. Sequential choice from several populations.
Proceedings of theNational Academy of Sciences of the United States of America , 92(19):8584, 1995.Emilie Kaufmann. Analyse de strat´egies Bay´esiennes et fr´equentistes pour l’allocation s´equentielle deressources.
Doctorat , ParisTech., Jul. 31 2015.Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration.
The Journal of Machine LearningResearch , 4:1107–1149, 2003. ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.
Advances in AppliedMathematics , 6(1):4–22, 1985.Lihong Li, Remi Munos, and Csaba Szepesvari. On minimax optimal offline policy evaluation. arXiv preprintarXiv:1409.3653, 2014.Michael L Littman. Inducing partially observable Markov decision processes. In
ICGI , pages 145–148, 2012.Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored mdps. In
Advances inNeural Information Processing Systems , pages 604–612, 2014.Herbert Robbins. Some aspects of the sequential design of experiments.
Bull. Amer. Math. Monthly , 58:527–536, 1952.Cem Tekin and Mingyan Liu. Approximately optimal adaptive learning in opportunistic spectrum access. In
INFOCOM, 2012 Proceedings IEEE , pages 1548–1556. IEEE, 2012.Ambuj Tewari and Peter L Bartlett. Optimistic linear programming gives logarithmic regret for irreduciblemdps. In
Advances in Neural Information Processing Systems , pages 1505–1512, 2008.Richard R Weber. On the Gittins index for multiarmed bandits.
The Annals of Applied Probability , 2(4):1024–1033, 1992.
Acknowledgement:
We gratefully acknowledge support for this project from the National Science Founda-tion (NSF grant CMMI-14-50743).
Appendix A. Additional Proofs
Proof [of Proposition 2] Let P = P (cid:16) δ + √ U (cid:112) k / p − < Z (cid:17) . Note immediately, P (cid:62) P (cid:0) δ + √ Uk / p < Z (cid:1) .Further, P (cid:62) P (cid:16) δ + √ Uk / p < Z and √ Uk / p (cid:62) δ (cid:17) (cid:62) P (cid:16) √ Uk / p < Z and √ Uk / p (cid:62) δ (cid:17) = (cid:90) ∞ δ k / p (cid:90) ∞ √ uk / p e − z / √ π f d ( u ) dzdu . (34)Where f d ( u ) is taken to be the density of a χ d -random variable. Letting ˜ u = k / p u , P (cid:62) k / p (cid:90) ∞ δ (cid:90) ∞ √ ˜ u e − z / √ π f d (cid:18) ˜ uk / p (cid:19) dzd ˜ u = k / p (cid:90) ∞ δ (cid:90) ∞ √ ˜ u e − z / √ π d / Γ ( d / ) (cid:18) ˜ uk / p (cid:19) d / − e − ˜ u k / p dzd ˜ u = (cid:18) k / p (cid:19) d / (cid:90) ∞ δ (cid:90) ∞ √ ˜ u e − z / √ π d / Γ ( d / ) ˜ u d / − e − ˜ u k / p dzd ˜ u (35) OWAN , H
ONDA AND K ATEHAKIS
Observing that k / p (cid:62) P (cid:62) (cid:18) k / p (cid:19) d / (cid:90) ∞ δ (cid:90) ∞ √ ˜ u e − z / √ π d / Γ ( d / ) ˜ u d / − e − ˜ u dzd ˜ u = k − d / p P (cid:16) √ U (cid:54) Z and U (cid:62) δ (cid:17) = k − d / p P (cid:0) U (cid:54) Z and U (cid:62) δ (cid:1) = k − d / p P (cid:18) Z (cid:62) U (cid:62) δ (cid:19) (36)The exchange from integral to probability is simply the interpretation of the integrand as the joint pdf of U and Z .For the upper bound, we utilize the classic normal tail bound, P ( x < Z ) (cid:54) e − x / / ( x √ π ) . P (cid:54) E e − (cid:16) δ + √ U √ k / p − (cid:17) / ( δ + √ U (cid:112) k / p − ) √ π (cid:54) e − δ / δ √ π E (cid:20) e − δ √ U √ k / p − − U ( k / p − ) (cid:21) . (37)Observing the bound that for positive x , e − x (cid:54) / x , and recalling that d (cid:62) P (cid:54) e − δ / δ √ π E (cid:34) e − U ( k / p − ) δ √ U (cid:112) k / p − (cid:35) = e − δ / δ √ π (cid:112) k / p − E (cid:104) U − e − U ( k / p − ) (cid:105) = e − δ / δ √ π (cid:112) k / p − (cid:32) k ( − d ) / p Γ (cid:0) d − (cid:1) √ Γ (cid:0) d (cid:1) (cid:33) . (38)Here we utilize the following bounds: e x − (cid:62) ( e / ) x , which is easy to prove, and Γ ( d / − / ) / Γ ( d / ) (cid:54) (cid:112) π / d , which may be proved on integer d (cid:62) P (cid:54) e − ( + δ ) / p δ ln k k ( − d ) / p √ d . (39)This completes the proof. Remark 4
Room for Improvement: The choice of the e x − (cid:62) ( e / ) x bound above was in fact arbitrary -other bounds, such as involving alternative powers of x , could be used. This would influence how the result-ing bound on P is utilized, for instance in the proof of Theorem 3. The use of e − x (cid:54) / x in Eq. (38) shouldbe considered similarly. Proposition 6
Conjecture 1 is false and for each i, for ε > , P (cid:16) ¯ X ij + S i ( j ) (cid:112) k / j − < µ i − ε for some (cid:54) j (cid:54) k (cid:17) / k → ∞ as k → ∞ . (40) . Proof [of Proposition 6] Define the events A ij , k , ε = { ¯ X ij + S i ( j ) (cid:112) k / j − < µ i − ε } . As the samples are takento be normally distributed with mean µ i and variance σ i , we have that ¯ X ij − µ i ∼ Z σ i / √ j and S i ( j ) ∼ σ i U / j , ORMAL B ANDITS OF U NKNOWN M EANS AND V ARIANCES where Z is a standard normal, U ∼ χ j − , and Z , U independent. Hence, P ( A ij , k , ε ) = P Z σ i √ j + (cid:115) U σ i j (cid:112) k / j − < − ε = P (cid:18) εσ i (cid:112) j + √ U (cid:112) k / j − < Z (cid:19) . (41)The last step is simply a re-arrangement, and an observation on the symmetry of the distribution of Z . For j (cid:62)
3, we may apply Proposition 2 here for d = j − p = j , to yield P ( A ij , k , ε ) (cid:62) k / j k P (cid:18) Z (cid:62) U (cid:62) ε σ i j (cid:19) . (42)For a fixed j (cid:62)
3, for k (cid:62) j we have P (cid:16) A ij , k , ε for some 2 (cid:54) j (cid:54) k (cid:17) (cid:62) P ( A ij , k , ε ) (cid:62) O ( / k ) k / j . (43)The proposition follows immediately. Proposition 7
For G > , (cid:54) ε < / , the following holds: (cid:16) + G ( − ε ) + ε (cid:17) (cid:54) ( + G ) + G ( + G ) ln ( + G ) ε . (44) Proof
For any G >
0, the function 1 / ln (cid:16) + G ( − ε ) + ε (cid:17) is positive, increasing, and convex on ε ∈ [ , ) (Proposition 8). For a given G >
0, noting that the above inequality holds (as equality) at ε =
0, due to theconvexity it suffices to show that the inequality is satisfied at ε = /
2, or1ln (cid:0) + G (cid:1) (cid:54) G ( + G ) ln ( + G ) + ( + G ) . (45)Equivalently, we consider the inequality0 (cid:54) G ( + G ) + ln ( + G ) − ln ( + G ) ln (cid:0) + G (cid:1) . (46)Define the function F ( G ) to be the RHS of Ineq. (46). Note that as G → F ( G ) →
0, and in simplified formwe have (for G > G → F (cid:48) ( G ) = (cid:0) ( + G ) ln ( + G ) − ( + G ) ln (cid:0) + G (cid:1)(cid:1) ( + G ) ( + G ) ln (cid:0) + G (cid:1) (cid:62) . (47)It follows that F ( G ) (cid:62)
0, and hence the desired inequality holds at ε = /
2. This completes the proof.
Proposition 8
The function H G ( ε ) = / ln (cid:16) + G ( − ε ) + ε (cid:17) is positive, increasing, and convex in ε ∈ [ , ) , forany constant G > . OWAN , H
ONDA AND K ATEHAKIS
Proof
That H G ( ε ) is positive and increasing in ε , follows immediately from inspection of H G and H (cid:48) G , giventhe hypotheses on G , and ε . To demonstrate convexity, by inspection of the terms of H (cid:48)(cid:48) G ( ε ) , it suffices toshow that for all relevant G , and ε , the following inequality holds.2 G ( − ε ) ( + ε ) + (cid:0) − ( + ε ) + G ( − ε ) ( + ε ( + ε )) (cid:1) ln (cid:18) + G ( − ε ) + ε (cid:19) (cid:62) . (48)Defining C = G ( − ε ) / ( + ε ) , it is sufficient to show that for all C > ε ∈ [ , ) (eliminating a factorof ( + ε ) from the above), 2 C ( + ε ) + ( − + C ( + ε ( + ε ))) ln ( + C ) (cid:62) . (49)Defining J C ( ε ) as the LHS of the above, note that J (cid:48) C ( ε ) = C ( + ε )( + ln ( + C )) >
0. It suffices then toshow J C ( ) (cid:62)
0, or 18 C +( C − ) ln ( + C ) (cid:62)
0. Note this holds at C =
0, and d / dC [ J C ( )] = ( + C ) / ( + C ) + ln ( + C ) > C (cid:62)
0. Hence, J C ( ε ) (cid:62)
0, and H (cid:48)(cid:48) G ( ε ) (cid:62)0.