Infinite Arms Bandit: Optimality via Confidence Bounds
aa r X i v : . [ s t a t . M L ] O c t INFINITE ARMS BANDIT:OPTIMALITY VIA CONFIDENCE BOUNDS
Hock Peng Chan and Hu Shouri
National University of Singapore
The infinite arms bandit problem was initiated by Berry et al. (1997). Theyderived a regret lower bound of all solutions for Bernoulli rewards with uniformpriors, and proposed bandit strategies based on success runs, but which do notachieve this bound. Bonald and Prouti`ere (2013) showed that the lower boundwas achieved by their two-target algorithm, and extended optimality to Bernoullirewards with general priors. We propose here a confidence bound target (CBT)algorithm that achieves optimality for unspecified non-negative reward distribu-tions. For each arm we apply the mean and standard deviation of its rewards tocompute a confidence bound and play the arm with the smallest confidence boundprovided it is smaller than a target mean. If the bounds are all larger, then weplay a new arm. We show for a given prior of the arm means how the target meancan be computed to achieve optimality. In the absence of information on the priorthe target mean is determined empirically, and the regret achieved is still compa-rable to the regret lower bound. Numerical studies show that CBT is versatile andoutperforms its competitors.
Berry, Chen, Zame, Heath and Shepp (1997) initiated the infinite arms banditproblem on Bernoulli rewards. They showed in the case of uniform prior onthe mean of an arm, a √ n regret lower bound for n rewards, and providedalgorithms based on success runs that achieve no more than 2 √ n regret.Bonald and Prouti`ere (2013) provided a two-target stopping-time algorithmthat can get arbitrarily close to Berry et al.’s lower bound, and is also opti-mal on Bernoulli rewards with general priors. Wang, Audibert and Munos(2008) considered bounded rewards and showed that their confidence boundalgorithm has regret bounds that are log n times the optimal regret. Ver-morel and Mohri (2005) proposed a POKER algorithm for general rewarddistributions and priors.The confidence bound method is arguably the most influential approachfor the fixed arm-size multi-armed bandit problem over the past thirty years.Lai and Robbins (1985) derived the smallest asymptotic regret that a multi-armed bandit algorithm can achieve. Lai (1987) showed that by constructing1n upper confidence bound (UCB) for each arm, playing the arm with thelargest UCB, this smallest regret is achieved in exponential families. TheUCB approach was subsequently extended to unknown time-horizons andother parametric families in Agrawal (1995), Auer, Cesa-Bianchi and Fischer(2002), Burnetas and Katehakis (1996), Capp´e, Garivier, Maillard, Munosand Stoltz (2013) and Kaufmann, Capp´e and Garivier (2012), and it hasbeen shown to perform well in practice, achieving optimality beyond expo-nential families. Chan (2019) modified the subsampling approach of Baransi,Maillard and Mannor (2014) and showed that optimality is achieved in expo-nential families, despite not applying parametric information in the selectionof arms. The method can be considered to be applying confidence boundsthat are computed empirically from subsample information, which substi-tutes for the missing parametric information. A related problem is the studyof the multi-armed bandit with irreversible constraints, initiated by Hu andWei (1989).Good performances and optimality has also been achieved by Bayesian ap-proaches to the multi-armed bandit problem, see Berry and Fridstedt (1985),Gittins (1989) and Thompson (1933) for early groundwork on the Bayesianapproach, and Korda, Kaufmann and Munos (2013) for more recent advances.In this paper we show how the confidence bound method can be extendedto the infinite arms bandit problem to achieve optimality. Adjustments aremade to account for the infinite arms that are available, in particular thespecification of a target mean that we desire from our best arm, and a mech-anism to reject weak arms quickly. For each arm a lower confidence boundof its mean is computed, using only information on the sample mean andstandard deviation of its rewards. We play an arm as long as its confidencebound is below the target mean. If it is above, then a new arm is played inthe next trial.We start with the smallest possible regret that an infinite arms banditalgorithm can achieve, as the number of rewards goes to infinity. This isfollowed by the target mean selection of the confidence bound target (CBT)algorithm to achieve this regret. The optimal target mean depends only onthe prior distribution of the arm means and not on the reward distributions.That is the reward distributions need not be specified and optimality is stillachieved. In the absence of information on the prior, we show how to adaptvia empirical determination of the target mean. Regret guarantees of em-pirical CBT, relative to the baseline lower bound, are provided. Numericalstudies on Bernoulli rewards and on a URL dataset show that CBT andempirical CBT outperform their competitors.2he layout of the paper is as follows. In Section 2 we review a number ofinfinite arms bandit algorithms and describe CBT. In Section 3 we motivatewhy a particular choice of the target mean leads to the smallest regret andstate the optimality results. In Section 4 we introduce an empirical version ofCBT to tackle unknown priors and explain intuitively why it works. In Sec-tion 5 we perform numerical studies. In Sections 6–7 we prove the optimalityof CBT. Let X k , X k , . . . be i.i.d. non-negative rewards from an arm or populationΠ k , 1 ≤ k < ∞ , with mean µ k . Let µ , µ , . . . be i.i.d. with prior density g on (0 , ∞ ). Let F µ denote the reward distribution of an arm with mean µ ,and let E µ ( P µ ) denote expectation (probability) with respect to X d ∼ F µ .Let a ∧ b denote min( a, b ), ⌊·⌋ ( ⌈·⌉ ) denote the greatest (least) integerfunction and a + denote max(0 , a ). Let a n ∼ b n if lim n →∞ ( a n /b n ) = 1, a n = o ( b n ) if lim n →∞ ( a n /b n ) = 0 and a n = O ( b n ) if lim sup n →∞ | a n /b n | < ∞ .A bandit algorithm is required to select one of the arms to be played ateach trial, with the choice informed from past outcomes. We measure theeffectiveness of a bandit algorithm by its regret R n = E (cid:16) K X k =1 n k µ k (cid:17) , where K is the total number of arms played, n k the number of rewards fromΠ k and n = P Kk =1 n k . Berry et al. (1997) showed that if F µ is Bernoulli and g is uniform on (0 , n →∞ R n √ n ≥ √ f -failure strategy. The same arm is played until f failures are encoun-tered. When this happens we switch to a new arm. We do not go backto a previously played arm, that is the strategy is non-recalling .2. s -run strategy. We restrict ourselves to no more than s arms, followingthe 1-failure strategy in each, until a success run of length s is observedin an arm. When this happens we play the arm for the remaining trials.If no success runs of length s is observed in all s arms, then the armwith the highest proportion of successes is played for the remainingtrials.3. Non-recalling s -run strategy. We follow the 1-failure strategy until anarm produces a success run of length s . When this happens we playthe arm for the remaining trials. If no arm produces a success run oflength s , then the 1-failure strategy is used for all n trials.4. m -learning strategy. We follow the 1-failure strategy for the first m trials, with the arm at trial m played until it yields a failure. Thereafterwe play, for the remaining trials, the arm with the highest proportionof successes.Berry et al. showed that R n ∼ n/ (log n ) for the f -failure strategy for any f ≥
1, whereas for the √ n -run strategy, the log n √ n -learning strategy andthe non-recalling √ n -run strategy,lim sup n →∞ R n √ n ≤ . Bonald and Prouti`ere (2013) proposed a two-target algorithm that getsarbitrarily close to the lower bound in (2.1). The target values are s = ⌊ p n ⌋ and s f = ⌊ f p n ⌋ , where f ≥ s successes when it encounters its first failure, or s f successeswhen it encounters its f th failure. If both targets are met, then the arm isaccepted and played for the remaining trials. Bonald and Prouti`ere showedthat for the uniform prior, the two-target algorithm satisfieslim sup n →∞ R n √ n ≤ √ f √ , (2.2)and they get close to the lower bound of Berry et al. when f is large.Bonald and Prouti`ere extended their optimality on Bernoulli rewards tonon-uniform priors. Specifically they considered g ( µ ) ∼ αµ β − for some4 > β > µ →
0. They extended the lower bound of Berry et al.to lim inf n →∞ ( n − ββ +1 R n ) ≥ C , where C = ( β ( β +1) α ) β +1 , (2.3)and showed that their two-target algorithm with s = ⌊ C β +1 β +2 ⌋ and s f = ⌊ f C ⌋ satisfies lim sup f →∞ [lim sup n →∞ ( n − ββ +1 R n )] ≤ C . (2.4)Wang, Audibert and Munos (2008) proposed a UCB-F algorithm for re-wards taking values in [0 , P g ( µ k ≤ µ ) = O ( µ β ) for some β > , then under suitable regularity conditions, R n = O ( n ββ +1 log n ). In UCB-F anorder n ββ +1 arms are chosen, and confidence bounds are computed on thesearms to determine which arm to play. UCB-F is different from CBT in thatit pre-selects the number of arms, and it also does not have a mechanismto reject weak arms quickly. Carpentier and Valko (2014) also consideredrewards taking values in [0,1], but their interest in maximizing the selectionof a good arm is different from the aims here and in the papers describedabove. In CBT we construct a confidence bound for each arm, and play an arm aslong as its confidence bound is under the target mean. Let b n → ∞ and c n → ∞ with b n + c n = o ( n δ ) for all δ > . (2.5)In our numerical studies we select b n = c n = log log n .For an arm k that has been played t times, its confidence bound is L kt = max (cid:16) ¯ X kt b n , ¯ X kt − c n b σ kt √ t (cid:17) , (2.6)where ¯ X kt = t − S kt , with S kt = P tu =1 X ku , and b σ kt = t − P tu =1 ( X ku − ¯ X kt ) .Let ζ > ζ should beselected to achieve optimality. It suffices to mention here that it should besmall for large n , more specifically it should decrease at a polynomial rate5ith respect to n . The algorithm is non-recalling, an arm is played until itsconfidence bound goes above ζ and is not played after that.Confidence bound target (CBT)For k = 1 , , . . . : Draw n k rewards from arm k , where n k = inf { t ≥ L kt > ζ } ∧ (cid:16) n − k − X ℓ =1 n ℓ (cid:17) . The total number of arms played is K = min { k : P kℓ =1 n ℓ = n } .There are three types of arms that we need to take care of, and thatexplains the design of L kt . The first type are arms with means µ k significantlylarger than ζ . For these arms we would like to reject them quickly. Thecondition that an arm be rejected when ¯ X kt /b n exceeds ζ is key to achievingthis.The second type are arms with means µ k larger than ζ but not by asmuch as those of the first type. For these arms we are unlikely to reject themquickly, as it is difficult to determine whether µ k is larger or less than ζ basedon a small sample. Rejecting arm k when ¯ X kt − c n b σ kt / √ t exceeds ζ ensuresthat arm k is rejected only when it is statistically significant that µ k is largerthan ζ . Though there may be large number of rewards from these arms, theircontributions to the regret are small because their means are small.The third type of arms are those with means µ k smaller than ζ . For thesearms the best strategy (when ζ is chosen correctly) is to play them for theremaining trials. Selecting b n → ∞ and c n → ∞ in (2.6) ensures that theprobabilities of rejecting these arms are small.For the two-target algorithm, the first target s is designed for quickrejection of the first type of arms and the second target s f is designed forrejection of the second type of arms. What is different is that whereas two-target monitors an arm for rejection only when there are 1 and f failures, with f chosen large for optimality, (in the case of Bernoulli rewards) CBT checksfor rejection each time a failure occurs. The frequent monitoring of CBT isa positive feature that results in significantly better numerical performances,see Section 5. 6 Optimality
We state the regret lower bound in Section 3.1 and show that CBT achievesthis bound in Section 3.2.
In Lemma 1 below we motivate the choice of ζ . Let λ = R ∞ E µ ( X | X > g ( µ ) dµ be the (finite) mean of the first non-zero reward of a random arm.The value λ represents the cost of experimenting with a new arm. Weconsider E µ ( X | X >
0) instead of µ because we are able to reject an armonly when there is a non-zero reward. For Bernoulli rewards, λ = 1. Let p ( ζ ) = P g ( µ ≤ ζ ) and v ( ζ ) = E g ( ζ − µ ) + .Consider an idealized algorithm which plays Π k until a non-zero rewardis observed, and µ k is revealed when that happens. If µ k > ζ , then Π k is rejected and a new arm is played next. If µ k ≤ ζ , then we end theexperimentation stage and play Π k for the remaining trials. Assuming thatthe experimentation stage uses o ( n ) trials and ζ is small, the regret of thisalgorithm is asymptotically r n ( ζ ) = λp ( ζ ) + nE g ( µ | µ ≤ ζ ) . (3.1)The first term in the expansion of r n ( ζ ) approximates E ( P K − k =1 n k µ k ) whereasthe second term approximates E ( n K µ K ). Lemma 1.
Let ζ n be such that v ( ζ n ) = λn . We have inf ζ> r n ( ζ ) = r n ( ζ n ) = nζ n . Proof . Since E g ( ζ − µ | µ ≤ ζ ) = v ( ζ ) /p ( ζ ), it follows from (3.1) that r n ( ζ ) = λp ( ζ ) + nζ − nv ( ζ ) p ( ζ ) . (3.2)It follows from ddζ v ( ζ ) = p ( ζ ) and ddζ p ( ζ ) = g ( ζ ) that ddζ r n ( ζ ) = g ( ζ )[ nv ( ζ ) − λ ] p ( ζ ) . Since v is continuous and strictly increasing when it is positive, the root to v ( ζ ) = λn exists, and Lemma 1 follows from solving ddζ r n ( ζ ) = 0. ⊓⊔ α > β > g ( µ ) ∼ αµ β − as µ → p ( ζ ) = R ζ g ( µ ) dµ ∼ αβ ζ β and v ( ζ ) = R ζ p ( µ ) dµ ∼ αβ ( β +1) ζ β +1 as ζ →
0, hence v ( ζ n ) ∼ λn − for ζ n ∼ Cn − β +1 , where C = ( λβ ( β +1) α ) β +1 . (3.3)In Lemma 2 below we state the regret lower bound. We assume therethat:(A2) There exists a > P µ ( X > ≥ a min( µ,
1) for all µ .Without condition (A2) we may play a bad arm a large number of timesbecause its rewards are mostly zeros but are very big when non-zeros. Lemma 2.
Assume (A1) and (A2) . For any infinite arms bandit algorithm,its regret satisfies R n ≥ [1 + o (1)] nζ n ( ∼ Cn ββ +1 ) . (3.4) Example
1. Consider X d ∼ Bernoulli( µ ). Condition (A2) holds with a = 1. If g is uniform on (0,1), then (A1) holds with α = β = 1. Since λ = 1, by (3.3), ζ n ∼ (2 /n ) / . Lemma 2 says that R n ≥ [1 + o (1)] √ n ,agreeing with Theorem 3 of Berry et al. (1997).Bonald and Prouti`ere (2013) showed (3.4) in their Lemma 3 for Bernoullirewards under (A1) and that the two-target algorithm gets close to the lowerbound when f is large. It will be shown in Theorem 1 that the lower boundin (3.4) is achieved by CBT for general rewards. We state the optimality of CBT in Theorem 1, after describing the conditionson discrete rewards under (B1) and continuous rewards under (B2) for whichthe theorem holds. Let M µ ( θ ) = E µ e θX .(B1) The rewards are non-negative integer-valued. For 0 < δ ≤
1, thereexists θ δ > µ > ≤ θ ≤ θ δ , M µ ( θ ) ≤ e (1+ δ ) θµ , (3.5) M µ ( − θ ) ≤ e − (1 − δ ) θµ . (3.6)8n addition, P µ ( X > ≤ a µ for some a > , (3.7) E µ X = O ( µ ) as µ → . (3.8)(B2) The rewards are continuous random variables satisfyingsup µ> P µ ( X ≤ γµ ) → γ → . (3.9)Moreover (3.8) holds and for 0 < δ ≤
1, there exists τ δ > < θµ ≤ τ δ , M µ ( θ ) ≤ e (1+ δ ) θµ , (3.10) M µ ( − θ ) ≤ e − (1 − δ ) θµ . (3.11)In addition for each t ≥
1, there exists ξ t > µ ≤ ξ t P µ ( b σ t ≤ γµ ) → γ → , (3.12)where b σ t = t − P tu =1 ( X u − ¯ X t ) and ¯ X t = t − P tu =1 X u for i.i.d. X u d ∼ F u . Theorem 1.
Assume (A1), (A2) and either (B1) or (B2) . For CBT withthreshold ζ n satisfying (3.3) and b n , c n satisfying (2.5) , R n ∼ nζ n as n → ∞ . (3.13)Theorem 1 says that CBT is optimal as it attains the lower bound givenin Lemma 2. In the examples below we show that the regularity conditions(A2), (B1) and (B2) are reasonable and checkable. Example
2. If X d ∼ Bernoulli( µ ) under P µ , then M µ ( θ ) = 1 − µ + µe θ ≤ exp[ µ ( e θ − , and (3.5), (3.6) hold with θ δ > e θ δ − ≤ θ δ (1 + δ ) and e − θ δ − ≤ − θ δ (1 − δ ) . (3.14)In addition (3.7) holds with a = 1, and (3.8) holds because E µ X = µ .Condition (A2) holds with a = 1. 9 xample
3. Let F µ be a distribution with support on 0 , . . . , I for somepositive integer I > µ . Let p i = P µ ( X = i ). We checkthat P µ ( X > ≥ I − µ and therefore (A2) holds with a = I − .Let θ δ > e iθ − ≤ iθ (1 + δ ) and e − iθ − ≤ − iθ (1 − δ ) for 0 ≤ iθ ≤ Iθ δ . (3.15)By (3.15) for 0 ≤ θ ≤ θ δ , M µ ( θ ) = I X i =0 p i e iθ ≤ δ ) µθ,M µ ( − θ ) = I X i =0 p i e − iθ ≤ − (1 − δ ) µθ, and (3.5), (3.6) follow from 1 + x ≤ e x . Moreover (3.7) holds with a = 1and (3.8) holds because E µ X = P Ii =0 p i i ≤ I µ . Example
4. If X d ∼ Poisson( µ ) under P µ , then M µ ( θ ) = exp[ µ ( e θ − , and (3.5), (3.6) again follow from (3.14). Since P µ ( X >
0) = 1 − e − µ , (A2)holds with a = 1 − e − , and (3.7) holds with a = 1. In addition (3.8) holdsbecause E µ X = ∞ X k =1 k µ k e − µ k ! = µe − µ + e − µ O (cid:16) ∞ X k =2 µ k (cid:17) . Example
5. Let Z be a continuous non-negative random variable withmean 1, and with Ee τ Z < ∞ for some τ >
0. Let X be distributed as µZ under P µ . Condition (A2) holds with a = 1. We conclude (3.9) fromsup µ> P µ ( X ≤ γµ ) = P ( Z ≤ γ ) → γ → . Let 0 < δ ≤
1. Since lim τ → τ − log Ee τZ = EZ = 1, there exists τ δ > < τ ≤ τ δ , Ee τZ ≤ e (1+ δ ) τ and Ee − τZ ≤ e − (1 − δ ) τ . (3.16)Since M µ ( θ ) = E µ e θX = Ee θµZ and M µ ( − θ ) = Ee − θµZ , we conclude (3.10)and (3.11) from (3.16) with τ = θµ . We conclude (3.8) from E µ X = µ EZ ,and (3.12), for arbitrary ξ t >
0, from P µ ( b σ t ≤ γµ ) = P ( b σ tZ ≤ γ ) → γ → , where b σ tZ = t − P tu =1 ( Z u − ¯ Z t ) , for i.i.d. Z and Z u .10 Empirical CBT for unknown priors
The optimal implementation of CBT, in particular the computation of thebest target mean ζ n , assumes knowledge of how g ( µ ) behaves for µ near 0. For g unknown we rely on Theorem 1 to motivate the empirical implementationof CBT.What is striking about (3.13) is that it relates the optimal threshold ζ n to R n n , and moreover this relation does not depend on either the prior g or thereward distributions. We suggest therefore, in an empirical implementationof CBT, to apply thresholds ζ ( m ) := S ′ m n , (4.1)where S ′ m is the sum of the first m rewards for all arms.In the beginning with m small, ζ ( m ) underestimates the optimal thresh-old, but this will only encourage exploration, which is the right strategy atthe beginning. As m increases ζ ( m ) gets closer to the optimal threshold, andempirical CBT behaves more like CBT in deciding whether to play an armfurther. Empirical CBT is recalling, unlike CBT, as it decides from amongall arms which to play further.Empirical CBTNotation: When there are m total rewards, let n k ( m ) denote the numberof rewards from arm k and let K m denote the number of arms played.For m = 0, play arm 1. Hence K = 1, n (1) = 1 and n k (1) = 0 for k > m = 1 , . . . , n − ≤ k ≤ K m L kn k ( m ) ≤ ζ ( m ), then play the arm k minimizing L kn k ( m ) .2. If min ≤ k ≤ K m L kn k ( m ) > ζ ( m ), then play a new arm K m + 1.Empirical CBT, unlike CBT, does not achieve the smallest regret. This isbecause when a good arm (that is an arm with mean below optimal target)appears early, we are not sure whether this is due to good fortune or that theprior is disposed towards arms with small means, so we experiment with more11rms before we are certain and play the good arm for the remaining trials.Similarly when no good arm appears after some time, we may conclude thatthe prior is disposed towards arms with large means, and play an arm withmean above the optimal target for the remaining trials, even though it isadvantageous to experiment further.There is a price for not knowing g . Our calculations in Appendix Aindicate that under empirical CBT, the regret R n ∼ I β nζ n , (4.2)where I β = ( β +1 ) β +1 (2 − β +1) )Γ(2 − β +1 ) and Γ( u ) = R ∞ x u − e − x dx . Thecalculations are based on an idealized version of empirical CBT.The constant I β increases from 1 (at β = 0) to 2 (at β = ∞ ), so the worst-case inflation is not more than 2. The increase is quite slow so for reasonablevalues of β it is closer to 1 than 2. For example I = 1 . I = 1 . I = 1 . I = 1 .
53. The predictions from (4.2), that the inflation of the regretincreases with β , and that it is not more than 25% for β =1, 2 and 3, arevalidated by our simulations in Section 5. We study here arms with rewards that have Bernoulli (Example 5) as well asunknown distributions (Example 6). In our simulations 10,000 datasets aregenerated for each entry in Tables 1–4, and standard errors are placed afterthe ± sign. In both CBT and empirical CBT, we select b n = c n = log log n . Example
5. We consider Bernoulli rewards with the following priors:1. g ( µ ) = 1, which satisfies (A1) with α = β = 1,2. g ( µ ) = π sin( πµ ), which satisfies (A1) with α = π and β = 2,3. g ( µ ) = 1 − cos( πµ ), which satisfies (A1) with α = π and β = 3.For all three priors, the two-target algorithm does better with f = 3 forsmaller n , and with f = 6 or 9 at larger n . CBT is the best performeruniformly over n , and empirical CBT is also competitive against two-targetwith f fixed.Even though optimal CBT performs better than empirical CBT, optimalCBT assumes knowledge of the prior to select the threshold ζ , which differs12lgorithm Regret n =100 n =1000 n =10,000 n =100,000CBT ζ = p /n ± ± ± ± ± ± ± ± ± ± ± ± √ n -run 19.1 ± ± ± ± √ n -run (non-recall) 15.4 ± ± ± ± n √ n -learning 18.7 ± ± ± ± f = 3 15.2 ± ± ± ± f = 6 16.3 ± ± ± ± f = 9 17.5 ± ± ± ± K = ⌊ p n/ ⌋ ± ± ± ± √ n β = 1).RegretAlgorithm n = 100 n = 1000 n =10,000 n =100,000CBT ζ = Cn − ± ± ± ± ± ± ± ± f = 3 25.0 ± ± ± ± f = 6 26.0 ± ± ± ± f = 9 26.7 ± ± ± ± ± ± ± ± n β +1 -run non-recall 28.1 ± ± ± ± Cn ββ +1 g ( µ ) = π sin( πµ ) ( β = 2).with the priors. On the other hand the same algorithm is used for all threepriors when applying empirical CBT, and in fact the same algorithm is alsoused on the URL dataset in Example 6, with no knowledge of the rewarddistributions. Hence though empirical CBT is numerically comparable totwo-target and weaker than CBT, it is more desirable as we do not need toknow the prior to use it.For the uniform prior, the best performing among the algorithms in Berryet al. (1997) is the non-recalling √ n -run algorithm. For UCB-F [cf. Wang et13egretAlgorithm n = 100 n = 1000 n =10,000 n =100,000CBT ζ = Cn − ± ± ± ± ± ± ± ± f = 3 43.2 ± ± ± ± f = 6 44.5 ± ± ± ± f = 9 45.6 ± ± ± ± ± ± ± ± n β +1 -run non-recall 45.5 ± ± ±
10 14697 ± Cn ββ +1 g ( µ ) = 1 − cos( πµ ) ( β = 3).al. (2008)], the selection of K = ⌊ ( βα ) β +1 ( nβ +1 ) ββ +1 ⌋ ( ∼ p ( ζ n ) ) and “explorationsequence” E m = √ log m works well.Algorithm Regret ǫ n =130 n =1300emp. CBT 212 ± ± ǫ -greedy 0.05 733 431 ǫ -first 0.15 725 411 ǫ -decreasing 1.0 738 411Table 4: The average regret ( R n /n ) for URL rewards. Example
6. We consider here the URL dataset studied in Vermorel andMohri (2005), where a POKER algorithm for dealing with large number ofarms is proposed. We reproduce part of their Table 1 in our Table 4, togetherwith new simulations on empirical CBT. The dataset consists of the retrievallatency of 760 university home-pages, in milliseconds, with a sample size ofmore than 1300 for each home-page. The dataset can be downloaded from“sourceforge.net/projects/bandit”.In our simulations the rewards for each home-page are randomly permutedin each run. We see from Table 4 that POKER does better than empiricalCBT at n = 130, whereas empirical CBT does better at n = 1300 . Theother algorithms are uniformly worse than both POKER and empirical CBT.14he algorithm ǫ -first refers to exploring with the first ǫn rewards, withrandom selection of the arms to be played. This is followed by pure exploita-tion for the remaining (1 − ǫ ) n rewards, on the “best” arm (with the smallestsample mean). The algorithm ǫ -greedy refers to selecting, in each play, arandom arm with probability ǫ , and the best arm with the remaining 1 − ǫ probability. The algorithm ǫ -decreasing is like ǫ -greedy except that in the m th play, we select a random arm with probability min(1 , ǫm ), and the bestarm otherwise. Both ǫ -greedy and ǫ -decreasing are disadvantaged by notmaking use of information on the total number of rewards. Vermorel andMohri also ran simulations on more complicated strategies like LeastTaken,SoftMax, Exp3, GaussMatch and IntEstim, with average regret ranging from287–447 for n = 130 and 189–599 for n = 1300. Let the infinite arms bandit problem be labelled as Problem A, and let R A bethe smallest regret for this problem. We shall prove Lemma 2 by consideringtwo related problems, Problems B and C. Proof of Lemma
2. Let Problem B be like Problem A except that whenwe observe the first non-zero reward from arm k , its mean µ k is revealed. Let R B be the smallest regret for Problem B. Since in Problem B we have accessto additional arm-mean information, R A ≥ R B .In Problem B the best solution involves an initial experimentation phasein which we play K arms, each until its first non-zero reward. This is followedby an exploitation phase in which we play the best arm for the remaining n − M trials, where M is the number of rewards in the experimentationphase. It is always advantageous to experiment first because no informationon arm mean is gained during exploitation. For continuous rewards M = K .Let µ b (= µ best ) = min ≤ k ≤ K µ k .In Problem C like in Problem B, the mean µ k of arm k is revealed uponthe observation of its first non-zero reward. The difference is that instead ofplaying the best arm for an additional n − M trials, we play it for n additionaltrials, for a total of n + M trials. Let R C be the smallest regret of ProblemC, the expected value of P Kk =1 n k µ k , with P Kk =1 n k = n + M . We can extendthe optimal solution of Problem B to a (possibly non-optimal) solution ofProblem C by simply playing the best arm with mean µ b a further M times.Hence [ R A + E ( M µ b ) ≥ ] R B + E ( M µ b ) ≥ R C . (6.1)15emma 2 follows from Lemmas 3 and 4 below. We prove the more technicalLemma 4 in Appendix B. ⊓⊔ Lemma 3. R C = nζ n for ζ n satisfying v ( ζ n ) = λn . Proof . Let arm j be the best arm after k arms have been played inthe experimentation phase, that is µ j = min ≤ i ≤ k µ i . Let φ ∗ be the strategyof trying out a new arm if and only if nv ( µ j ) > λ , or equivalently µ j > ζ n .Since we need on the average p ( ζ n ) arms before achieving µ j ≤ ζ n , the regretof φ ∗ is R ∗ = λp ( ζ n ) + nE g ( µ | µ ≤ ζ n ) = r n ( ζ n ) = nζ n , (6.2)see (3.1) and Lemma 1 for the second and third equalities in (6.2).Hence R C ≤ nζ n and to show Lemma 3, it remains to show that for anystrategy φ , its regret R φ is not less than R ∗ . Let K ∗ be the number of arms of φ ∗ and K the number of arms of φ . Let A = { K < K ∗ } (= { min ≤ k ≤ K µ k >ζ n } ) and A = { K > K ∗ } (= { min ≤ k ≤ K ∗ µ k ≤ ζ n , K > K ∗ } ). We can express R φ − R ∗ = X ℓ =1 n λE [( K − K ∗ ) A ℓ ]+ nE h(cid:16) min ≤ k ≤ K µ k − min ≤ k ≤ K ∗ µ k (cid:17) A ℓ io . (6.3)Under A , min ≤ k ≤ K µ k > ζ n and therefore by (6.2), λE [( K − K ∗ ) A ] + nE h(cid:16) min ≤ k ≤ K µ k − min ≤ k ≤ K ∗ µ k (cid:17) A i (6.4)= − λP ( A ) p ( ζ n ) + n n E h(cid:16) min ≤ k ≤ K µ k (cid:17) A i − P ( A ) E g ( µ | µ ≤ ζ n ) o ≥ P ( A ) {− λp ( ζ n ) + n [ ζ n − E g ( µ | µ ≤ ζ n )] } = 0 . The identity E [( K ∗ − K ) A ] = P ( A ) p ( ζ n ) is due to min ≤ k ≤ K µ k > ζ n when thereare K arms, and so an additional p ( ζ n ) arms on average is required understrategy φ ∗ , to get an arm with mean not more than ζ n .In view that ( K − K ∗ ) A = P ∞ j =0 { K>K ∗ + j } and (cid:16) min ≤ k ≤ K µ k − min ≤ k ≤ K ∗ µ k (cid:17) A = ∞ X j =0 (cid:16) min ≤ k ≤ K ∗ + j +1 µ k − min ≤ k ≤ K ∗ + j µ k (cid:17) { K>K ∗ + j } ,
16e can check that λE [( K − K ∗ ) A ] + nE h(cid:16) min ≤ k ≤ K µ k − min ≤ k ≤ K ∗ µ k (cid:17) A i (6.5)= ∞ X j =0 E nh λ + n (cid:16) min ≤ k ≤ K ∗ + j +1 µ k − min ≤ k ≤ K ∗ + j µ k (cid:17)i { K>K ∗ + j } o = ∞ X j =0 E nh λ − nv (cid:16) min ≤ k ≤ K ∗ + j µ k (cid:17)i { K>K ∗ + j } o ≥ . The inequality in (6.5) follows from v ( min ≤ k ≤ K ∗ + j µ k ) ≤ v ( min ≤ k ≤ K ∗ µ k ) ≤ v ( ζ n ) = λn , as v is monotone increasing. Lemma 3 follows from (6.2)–(6.5). ⊓⊔ Lemma 4. E ( M µ b ) = o ( n ββ +1 ) . Bonald and Prouti`ere (2013) also referred to Problem B in their lowerbound for Bernoulli rewards. What is different in our proof of Lemma 2 isa further simplification by considering Problem C, in which the number ofrewards in the exploitation phase is fixed to be n . We show in Lemma 3 thatunder Problem C the optimal regret has a simple expression nζ n , and reducethe proof of Lemma 2 to showing Lemma 4. We preface the proof of Theorem 1 with Lemmas 5–8. The lemmas areproved for discrete rewards in Section 7.1 and continuous rewards in Section7.2. Consider X , X , . . . i.i.d. F µ . Let S t = P tu =1 X u , ¯ X t = S t t and b σ t = t − P tu =1 ( X u − ¯ X t ) . Let T b = inf { t : S t > b n tζ n } , (7.1) T c = inf { t : S t > tζ n + c n b σ t √ t } , (7.2)with b n , c n satisfying (2.5) and ζ n ∼ Cn − β +1 for C = ( λβ ( β +1) α ) β +1 . Let d n = n − ω for some 0 < ω < β +1 . (7.3)17 emma 5. As n → ∞ , sup µ ≥ d n [min( µ, E µ T b ] = O (1) , (7.4) E g ( T b µ { µ ≥ d n } ) ≤ λ + o (1) . (7.5) Lemma 6.
Let ǫ > . As n → ∞ , sup (1+ ǫ ) ζ n ≤ µ ≤ d n [ µE µ ( T c ∧ n )] = O ( c n + log n ) , (7.6) E g [( T c ∧ n ) µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ] → . (7.7) Lemma 7.
Let < ǫ < . As n → ∞ , sup µ ≤ (1 − ǫ ) ζ n P µ ( T b < ∞ ) → . Lemma 8.
Let < ǫ < . As n → ∞ , sup µ ≤ (1 − ǫ ) ζ n P µ ( T c < ∞ ) → . The number of times an arm is played has distribution bounded above by T := T b ∧ T c . Lemmas 7 and 8 say that an arm with mean less than (1 − ǫ ) ζ n is unlikely to be rejected, whereas (7.5) and (7.7) say that the regret dueto sampling from an arm with mean more than (1 + ǫ ) ζ n is asymptoticallybounded by λ . The remaining (7.4) and (7.6) are technical relations used inthe proof of Theorem 1. Proof of Theorem
1. The number of times arm k is played is n k , andit is distributed as T b ∧ T c ∧ ( n − P k − ℓ =1 n ℓ ). Let 0 < ǫ <
1. We can express R n − nζ n = z + z + z = z + z − | z | , (7.8)where z i = E [ P k : µ k ∈ D i n k ( µ k − ζ n )] for D = [(1 + ǫ ) ζ n , ∞ ) , D = ((1 − ǫ ) ζ n , (1 + ǫ ) ζ n ) , D = (0 , (1 − ǫ ) ζ n ] . It is easy to see that z ≤ ǫnζ n . We shall show that z ≤ λ + o (1)(1 − ǫ ) β p ( ζ n ) , (7.9) | z | ≥ [( − ǫ ǫ ) β + o (1)][ nǫζ n + (1 − ǫ ) λp ( ζ n ) ] . (7.10)18e conclude Theorem 1 from (7.8)–(7.10) with ǫ → ⊓⊔ Proof of (7.9). Since T = T b ∧ T c , by Lemmas 7 and 8, q n := sup µ ≤ (1 − ǫ ) ζ n P µ ( T < ∞ ) (7.11) ≤ sup µ ≤ (1 − ǫ ) ζ n [ P µ ( T b < ∞ ) + P µ ( T c < ∞ )] → . That is an arm with mean less than (1 − ǫ ) ζ n is rejected with negligibleprobability for n large. Since the total number of played arms K is boundedabove by a geometric random variable with mean P g ( T = ∞ ) , by (7.11) and p ( ζ ) ∼ αβ ζ β as ζ → EK ≤ P g ( T = ∞ ) ≤ − q n ) p ((1 − ǫ ) ζ n ) ∼ − ǫ ) β p ( ζ n ) . (7.12)By (7.5) and (7.7), E g ( n µ { µ ≥ (1+ ǫ ) ζ n } )= E g ( n µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ) + E g ( n µ { µ ≥ d n } ) ≤ E g [( T c ∧ n ) µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ] + E g ( T b µ { µ ≥ d n } ) ≤ λ + o (1) , and (7.9) follows from (7.12) and z ≤ E g ( n µ { µ ≥ (1+ ǫ ) ζ n } ) EK . ⊓⊔ Proof of (7.10). Let ℓ be the first arm with mean not more than(1 − ǫ ) ζ n . We have | z | = E h X k : µ k ∈ D n k ( ζ n − µ k ) i (7.13) ≥ ( En ℓ ) { ζ n − E g [ µ | µ ≤ (1 − ǫ ) ζ n ] } . Since v ( ζ n ) ∼ λn and p ( ζ ) ∼ αβ ζ β , v ( ζ ) ∼ αβ ( β +1) ζ β +1 as ζ → ζ n − E g [ µ | µ ≤ (1 − ǫ ) ζ n ]= ζ n − { (1 − ǫ ) ζ n − E g [(1 − ǫ ) ζ n − µ | µ ≤ (1 − ǫ ) ζ n ] } = ζ n − [(1 − ǫ ) ζ n − v ((1 − ǫ ) ζ n ) p ((1 − ǫ ) ζ n ) ] ∼ ǫζ n + (1 − ǫ ) v ( ζ n ) p ( ζ n ) ∼ ǫζ n + (1 − ǫ ) λnp ( ζ n ) , and (7.10) thus follows from (7.13) and En ℓ ≥ [( − ǫ ǫ ) β + o (1)] n. (7.14)19et j be the first arm with mean not more than (1 + ǫ ) ζ n and M = P j − i =1 n i . We have En ℓ ≥ (1 − q n ) E ( n − M ) P ( ℓ = j ) . Since q n → P ( ℓ = j ) → ( − ǫ ǫ ) β , to show (7.14) it suffices to show that EM = o ( n ).Indeed by (7.4), (7.6) and E µ n ≤ E µ ( T ∧ n ),sup µ ≥ (1+ ǫ ) ζ n [min( µ, E µ n ] ≤ max h sup (1+ ǫ ) ζ n ≤ µ ≤ d n µE µ ( T c ∧ n ) , sup µ ≥ d n min( µ, E µ T b i = O ( c n + log n ) . Hence in view that p ((1+ ǫ ) ζ n ) = O ( n ββ +1 ) and P g ( µ > (1 + ǫ ) ζ n ) → n → ∞ , EM ≤ p ((1+ ǫ ) ζ n ) E g ( n | µ > (1 + ǫ ) ζ n )= O ( n ββ +1 ) E g [ c n +log n min( µ , (cid:12)(cid:12) µ > (1 + ǫ ) ζ n ]= O ( n ββ +1 ( c n + log n )) Z ∞ (1+ ǫ ) ζ n g ( µ )min( µ, dµ = O ( n ββ +1 ( c n + log n )) max( n − ββ +1 , log n ) = o ( n ) . The first relation in the line above follows from Z ∞ (1+ ǫ ) ζ n g ( µ )min( µ, dµ = O (1) if β > ,O (log n ) if β = 1 ,O ( n − ββ +1 ) if β < . ⊓⊔ In the case of discrete rewards, one difficulty is that for µ k small, there arepotentially multiple plays on arm k before a non-zero reward is observed.Condition (A2) is helpful in ensuring that the mean of this non-zero rewardis not too large. Proof of Lemma
5. Recall that T b = inf { t : S t > b n tζ n } , d n = n − ω for some 0 < ω < β +1 . We shall show thatsup µ ≥ d n [min( µ, E µ T b ] = O (1) , (7.15) E g ( T b µ { µ ≥ d n } ) ≤ λ + o (1) . (7.16)Let θ = 2 ω log n . Since X u is integer-valued, it follows from Markov’sinequality that P µ ( S t ≤ b n tζ n ) ≤ [ e θb n ζ n M µ ( − θ )] t ≤ { e θb n ζ n [ P µ ( X = 0) + e − θ ] } t . (7.17)By P µ ( X > ≥ a d n for µ ≥ d n [see (A2)], θb n ζ n = o ( d n ) [because θ and b n are both sub-polynomial in n and ζ n = O ( n − β +1 )] and (7.17), uniformlyover µ ≥ d n , E µ T b = 1 + ∞ X t =1 P µ ( T b > t ) (7.18) ≤ ∞ X t =1 P µ ( S t ≤ b n tζ n ) ≤ { − e θb n ζ n [ P µ ( X = 0) + e − θ ] } − = { − [1 + o ( d n )][ P µ ( X = 0) + d n ] } − = [ P µ ( X >
0) + o ( d n )] − ∼ [ P µ ( X > − . The term inside {·} in (7.17) is not more than [1 + o ( d n )](1 − a d n + d n ) < n large and this justifies the second inequality in (7.18). We conclude(7.15) from (7.18) and (A2). By (7.18), E g [ T b µ { µ ≥ d n } ] = Z ∞ d n E µ ( T b ) µg ( µ ) dµ ≤ [1 + o (1)] Z ∞ d n E µ ( X ) P µ ( X> g ( µ ) dµ = [1 + o (1)] Z ∞ d n E µ ( X | X > g ( µ ) dµ → λ, hence (7.16) holds. ⊓⊔ Proof of Lemma
6. Recall that T c = inf { t : S t > tζ n + c n b σ t √ t } andlet ǫ >
0. We want to show thatsup (1+ ǫ ) ζ n ≤ µ ≤ d n µE µ ( T c ∧ n ) = O ( c n + log n ) , (7.19) E g [( T c ∧ n ) µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ] → . (7.20)21e first show that there exists κ > n → ∞ ,sup µ ≤ d n h µ n X t =1 P µ ( b σ t ≥ κµ ) i = O (log n ) . (7.21)Since X is non-negative integer-valued, X ≤ X . Indeed by (3.8), thereexists κ > ρ µ := E µ X ≤ κµ for µ ≤ d n and n large, therefore by(3.8) again and Chebyshev’s inequality, P µ ( b σ t ≥ κµ ) ≤ P µ (cid:16) t X u =1 X u ≥ tκµ (cid:17) ≤ P µ (cid:16) t X u =1 ( X u − ρ µ ) ≥ tκµ (cid:17) ≤ t Var µ ( X )( tκµ/ = O (( tµ ) − ) , and (7.21) holds.By (7.21), uniformly over (1 + ǫ ) ζ n ≤ µ ≤ d n , E µ ( T c ∧ n ) = 1 + n − X t =1 P µ ( T c > t ) (7.22) ≤ n − X t =1 P µ ( S t ≤ tζ n + c n b σ t √ t ) ≤ n − X t =1 P µ ( S t ≤ tζ n + c n √ κµt ) + O ( log nµ ) . Let 0 < δ < to be further specified. Uniformly over t ≥ c n µ − , µt/ ( c n √ κµt ) → ∞ and therefore by (3.6), µ ≥ (1 + ǫ ) ζ n and Markov’sinequality, for n large, P µ ( S t ≤ tζ n + c n √ κµt ) ≤ P µ ( S t ≤ t ( ζ n + δµ )) (7.23) ≤ e θ δ t ( ζ n + δµ ) M tµ ( − θ δ ) ≤ e tθ δ [ ζ n − (1 − δ ) µ ] ≤ e − ηtθ δ µ , where η = 1 − δ − ǫ > δ chosen small). Since 1 − e − ηθ δ µ ∼ ηθ δ µ as µ → c n µ − + X t ≥ c n µ − e − ηtθ δ µ = O ( c n µ − ) , (7.24)22nd substituting (7.23) into (7.22) gives us (7.19). By (7.19), E g [( T c ∧ n ) µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ] = P g ((1 + ǫ ) ζ n ≤ µ ≤ d n ) O ( c n + log n )= O ( d βn ( c n + log n )) , and (7.20) holds since c n is sub-polynomial in n . ⊓⊔ Proof of Lemma
7. We want to show that P µ ( S t > tb n ζ n for some t ≥ → µ ≤ (1 − ǫ ) ζ n .By (3.7) and Bonferroni’s inequality, P µ ( S t > tb n ζ n for some t ≤ √ b n ζ n ) (7.26) ≤ P µ ( X t > t ≤ √ b n ζ n ) ≤ a µ √ b n ζ n → . By (3.5) and a change-of-measure argument, for n large, P µ ( S t > tb n ζ n for some t > √ b n ζ n ) (7.27) ≤ sup t> √ bnζn [ e − θ b n ζ n M µ ( θ )] t ≤ e − θ ( b n ζ n − µ ) / ( ζ n √ b n ) → . To see the first inequality of (7.27), let f µ be the density of X with respectto some σ -finite measure, and let E θ µ ( P θ µ ) denote expectation (probability)with respect to density f θ µ ( x ) := [ M µ ( θ )] − e θ x f µ ( x ) . Let T = inf { t > √ b n ζ n : S t > tb n ζ n } . It follows from a change of measurethat P µ ( T = t ) = M tµ ( θ ) E θ µ ( e − θ S t { T = t } ) (7.28) ≤ [ e − θ b n ζ n M µ ( θ )] t P θ µ ( T = t ) , and the first inequality of (7.27) follows from summing (7.28) over t > √ b n ζ n . ⊓⊔ Proof of Lemma
8. We want to show that P µ ( S t > tζ n + c n b σ t √ t for some t ≥ → µ ≤ (1 − ǫ ) ζ n .By (3.7) and Bonferroni’s inequality, P µ ( S t > tζ n + c n b σ t √ t for some t ≤ c n µ ) (7.30) ≤ P µ ( X t > t ≤ c n µ ) ≤ a c n → . Moreover P µ ( S t > tζ n + c n b σ t √ t for some t > c n µ ) ≤ (I) + (II) , (7.31)where (I) = P µ ( S t > tζ n + c n ( µt/ for some t > c n µ ) , (II) = P µ ( b σ t ≤ µ and S t ≥ tζ n for some t > c n µ ) . By (7.30) and (7.31), to show (7.29), it suffices to show that (I) → → < δ ≤ δ < (1 − ǫ ) − . Hence µ ≤ (1 − ǫ ) ζ n implies ζ n ≥ (1 + δ ) µ . It follows from (3.5) and change-of-measure argument [see(7.27) and (7.28)] that(I) ≤ sup t> cnµ [ e − θ δ [ tζ n + c n ( µt/ ] M tµ ( θ δ )] ≤ exp {− θ δ [ ζ n − (1 + δ ) µ ] / ( c n µ ) − θ δ ( c n / }≤ exp {− θ δ ( c n / } → . Since X u ≥ X u , the inequality S t ≥ tζ n ( ≥ tµ ) implies P tu =1 X u ≥ tµ , andthis, together with b σ t ≤ µ implies that ¯ X t ≥ µ . Hence by (3.5) and change-of-measure argument, for n large,(II) ≤ P µ ( ¯ X t ≥ q µ for some t > c n µ ) ≤ sup t> cnµ [ e − θ √ µ/ M µ ( θ )] t ≤ exp {− θ [ q µ − µ ] / ( c n µ ) }≤ exp (cid:8) − θ (cid:2) c n √ − ǫ ) ζ n − c n (cid:3)(cid:9) → . ⊓⊔ In the case of continuous rewards, the proofs are simpler due to rewards beingnon-zero, in particular λ = E g µ . 24 roof of Lemma
5. To show (7.4) and (7.5), it suffices to show thatsup µ ≥ d n E µ T b ≤ o (1) . (7.32)Let θ > P µ ( S t ≤ b n tζ n ) ≤ [ e θb n ζ n M µ ( − θ )] t . Moreover, for any γ > M µ ( − θ ) ≤ P µ ( X ≤ γµ ) + e − γθµ , hence E µ T b ≤ ∞ X t =1 P µ ( S t ≤ b n tζ n ) (7.33) ≤ { − e θb n ζ n [ P µ ( X ≤ γµ ) + e − γθµ ] } − . Let γ = n and θ = n η for some ω < η < β +1 . By (2.5), (3.9) and (7.3), for µ ≥ d n , e θb n ζ n → , e − γθµ → , P µ ( X ≤ γµ ) → , and (7.32) follows from (7.33). ⊓⊔ Proof of Lemma
6. By (3.8), for µ small, ρ µ := E µ X = E µ ( X { X< } ) + E µ ( X { X ≥ } ) ≤ E µ X + E µ X = O ( µ ) . Hence to show (7.6) and (7.7), we proceed as in the proof of Lemma 6 fordiscrete rewards, applying (3.11) in place of (3.6), with any fixed θ > θ δ in (7.23) and (7.24). ⊓⊔ Proof of Lemma
7. It follows from (3.10) with θ = τ µ and a change-of-measure argument [see (7.27) and (7.28)] that for n large, P µ ( S t > tb n ζ n for some t ≥ ≤ sup t ≥ [ e − θb n ζ n M µ ( θ )] t ≤ e − θ ( b n ζ n − µ ) → . ⊓⊔ roof of Lemma
8. Let η > δ > δ )(1 − ǫ ) <
1. It follows from (3.10) with θ = τ δ µ and a change-of-measure argumentthat for u large, P µ ( S t ≥ tζ n + c n b σ t √ t for some t > u ) (7.34) ≤ P µ ( S t ≥ tζ n for some t > u ) ≤ sup t>u [ e − θζ n M µ ( θ )] t ≤ e − uθ [ ζ n − (1+ δ ) µ ] ≤ e − uτ δ [(1 − ǫ ) − − (1+ δ )] ≤ η. By (3.12), we can select γ > n large (so that µ ≤ (1 − ǫ ) ζ n ≤ min ≤ t ≤ u ξ t ), u X t =1 P µ ( b σ t ≤ γµ ) ≤ η. (7.35)Let θ = τ µ . By (3.10), (7.35) and Bonferroni’s inequality, P µ ( S t > tζ + c n b σ t √ t for some t ≤ u ) (7.36) ≤ P µ ( S t ≥ c n b σ t √ t for some t ≤ u ) ≤ η + u X t =1 P µ ( S t ≥ c n µ √ γt ) ≤ η + u X t =1 e − θc n µ √ γt M tµ ( θ ) ≤ η + u X t =1 e − τ ( c n √ γt − t ) → η. Lemma 8 follows from (7.34) and (7.36) since η can be chosen arbitrarilysmall. ⊓⊔ A Derivation of (4.2)
The idealized algorithm in the beginning of Section 3.1 captures the essenceof how CBT behaves. The mean µ k of arm k is revealed when its first non-zero reward appears. If µ k > ζ [with optimality when ζ = ζ n , see (3.3)], thenwe stop sampling from arm k and sample next arm k + 1. If µ k ≤ ζ , then weexploit arm k a further n times before stopping the algorithm.For empirical CBT, when there are m rewards, we apply threshold ζ ( m ) = S ′ m n where S ′ m is the sum of the m rewards. In an idealized version of CBT,26e also reveal the mean µ k of arm when its first non-zero reward appears. Tocapture the essence of how threshold ζ ( m ) increases with number of rewards m , we assign a fixed cost of λ to each arm that we experiment with. That iswe replace ζ ( m ) by b ζ k = kλn , where k is the number of arms played. Whenmin ≤ i ≤ k µ i ≤ b ζ k , we stop experimenting and play the best arm a further n times. More specifically:Idealized empirical CBT1. For k = 1 , , . . . : Draw n k rewards from arm k , where n k = inf { t ≥ X kt > } .
2. Stop when there are K arms, where K = inf n k ≥ ≤ i ≤ k µ i ≤ kλn o .
3. Draw n additional rewards from arm j satisfying µ j = min ≤ k ≤ K µ k .We define the regret of this algorithm to be R n = λEK + nE (min ≤ k ≤ K µ k ). Theorem 2.
For idealized empirical CBT, its regret R n ∼ CI β n ββ +1 , where C = ( λβ ( β +1) α ) β +1 and I β = ( β +1 ) β +1 (2 − β +1) )Γ(2 − ββ +1 ) . Proof . We stop experimenting after K arms, where K = inf { k : min ≤ j ≤ k µ j ≤ b ζ k } , b ζ k = kλn . (A.1)Let D k = { b ζ k − λn < min ≤ j ≤ k − µ j ≤ b ζ k } , D k = { min ≤ j ≤ k − µ j > b ζ k , µ k ≤ b ζ k } . We check that D k , D k are disjoint, and that D k ∪ D k = { K = k } . Essentially D k is the event that K = k and the best arm is not k and D k the event that K = k and the best arm is k . 27or any fixed k ∈ Z + , P ( D k ) = [1 − p ( b ζ k − λn )] k − − [1 − p ( b ζ k )] k − (A.2)= { − p ( b ζ k ) + [1 + o (1)] λn g ( b ζ k ) } k − − [1 − p ( b ζ k )] k − ∼ { [1 − p ( b ζ k )] k − } kλn g ( b ζ k ) ∼ exp( − αλ β β k β +1 n − β ) αλ β k β n − β . Moreover E ( R n | D k ) ∼ kλ + n ( kλn ) = 2 kλ. (A.3)Likewise, P ( D k ) = { [1 − p ( b ζ k )] k − } p ( b ζ k ) (A.4) ∼ exp( − αλ β β k β +1 n − β ) αλ β β k β n − β ,E ( R n | D k ) = kλ + nE ( µ | µ ≤ b ζ k ) (A.5)= 2 kλ − nv (ˆ ζ k ) p (ˆ ζ k ) ∼ (2 − β +1 ) kλ. Combining (A.2)–(A.5) gives us R n = ∞ X k =1 [ E ( R ′ C | D k ) P ( D k ) + E ( R ′ C | D k ) P ( D k )] (A.6) ∼ ∞ X k =1 exp( − αλ β β k β +1 n − β )( αλ β +1 β k β +1 n − β )(2 β + 2 − β +1 ) , It follows from (A.6) and a change of variables x = αλ β β k β +1 n − β that R n ∼ (2 β + 2 − β +1 ) Z ∞ exp( − αλ β β k β +1 n − β )( αλ β +1 β k β +1 n − β ) dk = (2 β + 2 − β +1 ) Z ∞ β +1 ( λβα ) β +1 n ββ +1 exp( − x ) x β +1 dx = (2 − β +1) )( λβα ) β +1 Γ(2 − ββ +1 ) n ββ +1 , and Theorem 2 holds. ⊓⊔ Proof of Lemma 4
Let b K = ⌊ nζ n (log n ) β +2 ⌋ for ζ n satisfying nv ( ζ n ) = λ . Express E ( M µ b ) = P i =1 E ( M µ b D i ), where D = { µ b ≤ ζ n log n } ,D = { µ b > ζ n log n , K > b K } ,D = { ζ n log n < µ b ≤ ζ n (log n ) β +3 , K ≤ b K } ,D = { µ b > ζ n (log n ) β +3 , K ≤ b K, M > n } ,D = { µ b > ζ n (log n ) β +3 , K ≤ b K, M ≤ n } . It suffices to show that for all i , E ( M µ b D i ) = o ( n ββ +1 ) . (B.1)Since Mζ n log n ≤ nζ n log n = o ( n ββ +1 ), (B.1) holds for i = 1.Let b µ b = min k ≤ ˆ K µ k . Since M ≤ n , µ b ≤ µ and E ( µ ) ≤ λ , E ( M µ b D ) ≤ nE ( µ | µ > ζ n log n ) P ( D ) (B.2) ≤ [ λ + o (1)] nP ( b µ b > ζ n log n ) . Substituting P ( b µ b > ζ n log n ) = [1 − p ( ζ n log n )] ˆ K = exp {− [1 + o (1)] b K αβ ( ζ n log n ) β ] = O ( n − )into (B.2) shows (B.1) for i = 2.Let M j be the number of plays of Π j to its first non-zero reward (hence M = P Kj =1 M j ). It follows from condition (A2) that E µ M = P µ ( X > ≤ a min( µ, , by µ b ≤ ζ n (log n ) β +3 , E ( M µ b D ) ≤ E ( M { µ > ζn log n } ) b Kζ n (log n ) β +3 (B.3) ≤ (cid:16) Z ∞ ζn log n g ( µ ) a min( µ, dµ (cid:17) nζ n (log n ) β +5 . Substituting Z ∞ ζn log n g ( µ ) µ dµ = O (1) if β > ,O (log n ) if β = 1 ,O (( ζ n log n ) β − ) if β < , i = 3.If µ j > ζ n (log n ) β +3 , then by condition (A2), M j is bounded above by ageometric random variable with mean ν − , where ν = a ζ n (log n ) β +3 . Hencefor 0 < θ < log( − ν ), E ( e θM j { µ j >ζ n (log n ) β +3 } ) ≤ ∞ X i =1 e θi ν (1 − ν ) i − = νe θ − e θ (1 − ν ) , implying that [ e θn P ( D ) ≤ ] E ( e θM D ) ≤ ( νe θ − e θ (1 − ν ) ) ˆ K . (B.4)Consider e θ = 1 + ν and check that e θ (1 − ν ) ≤ − ν [ ⇒ θ < log( − ν )]. Itfollows from (B.4) that P ( D ) ≤ e − θn ( νe θ ν/ ) ˆ K = 2 ˆ K e θ ( ˆ K − n ) = exp[ b K log 2 + [1 + o (1)] ν ( b K − n )]= n − b K [ a − a ζ n (log n ) β +2 − log 2log n ] = O ( n − ) . Since M ≤ n , µ b ≤ µ and E ( µ ) ≤ λ , E ( M µ b D ) ≤ nE [ µ | µ > ζ n (log n ) β +3 ] P ( D ) ≤ n [ λ + o (1)] P ( D ) , and (B.1) holds for i = 4.Under D for n large,( n − M ) v ( µ b )[ > n v ( ζ n (log n ) β +3 ) ∼ n v ( ζ n )(log n ) ( β +3)( β +1) ] > λ. The optimal solution of Problem B requires further experimentation since itscost λ is less than the reduction in exploitation cost. In other words D isan event of zero probability. Therefore (B.1) holds for i = 5. References [1]
Agrawal, R. (1995). Sample mean based index policies with O (log n )regret for the multi-armed bandit problem. Adv. Appl. Probab. Auer, P., Cesa-Bianchi, N. and
Fischer, P. (2002). Finite-timeanalysis of the multiarmed bandit problem.
Machine Learning Baransi, A., Maillard, O.A. and
Mannor, S. (2014). Sub-sampling for multi-armed bandits.
Proceedings of the European Con-ference on Machine Learning pp.13.[4]
Berry, D., Chen, R., Zame, A., Heath, D. and
Shepp, L. (1997). Bandit problems with infinitely many arms.
Ann. Statist. , ,21032116.[5] Berry, D. and
Fristedt, B. (1985).
Bandit problems: sequentialallocation of experiments . Chapman and Hall, London.[6]
Bonald, T. and
Prouti`ere, A. (2013). Two-target algorithms forinfinite-armed bandits with Bernoulli rewards.
Neural Information Pro-cessing Systems .[7]
Brezzi, M. and
Lai, T.L. (2002). Optimal learning and experimen-tation in bandit problems.
J. Econ. Dynamics Cont. Burnetas, A. and
Katehakis, M. (1996). Optimal adaptive policiesfor sequential allocation problems.
Adv. Appl. Math. Capp´e, O., Garivier, A., Maillard, J., Munos, R., Stoltz,G. (2013). Kullback-Leibler upper confidence bounds for optimal se-quential allocation.
Ann. Statist. Carpentier, A. and
Valko, M. (2015). Simple regret for infinitelymany bandits. .[11]
Chan, H. (2019). The multi-armed bandit problem: an efficient non-parametric solution, to appear in
Ann. Statist. [12]
Gittins, J. (1989).
Multi-armed Bandit Allocation Indices . Wiley,New York.[13]
Hu, I. and
Wei, C.Z. (1989). Irreversible adaptive allocation rules.
Ann. Statist. Kaufmann, E., Capp´e and
Garivier, A. (2012). On Bayesian up-per confidence bounds for bandit problems.
Proceedings of the FifteenthInternational Conference on Artificial Intelligence and Statistics Korda, N., Kaufmann, E. and
Munos, R. (2013). Thompson sam-pling for 1-dimensional exponential family bandits.
NIPS Lai, T.L. (1987). Adaptive treatment allocation and the multi-armedbandit problem.
Ann. Statist. Lai, T.L. and
Robbins, H. (1985). Asymptotically efficient adaptiveallocation rules.
Adv. Appl. Math. Thompson, W. (1933). On the likelihood that one unknown probabil-ity exceeds another in view of the evidence of two samples.
Biometrika Vermorel, J. and
Mohri, M. (2005). Multi-armed bandit algo-rithms and empirical evaluation.
Machine Learning: ECML , Springer,Berlin.[20]
Wang, Y., Audibert, J. and
Munos, R. (2008). Algorithms for in-finitely many-armed Bandits.