[PDF] Infinite Arms Bandit: Optimality via Confidence Bounds

Abstract

Berry et al. (1997) initiated the development of the infinite arms bandit problem. They derived a regret lower bound of all allocation strategies for Bernoulli rewards with uniform priors, and proposed strategies based on success runs. Bonald and Proutière (2013) proposed a two-target algorithm that achieves the regret lower bound, and extended optimality to Bernoulli rewards with general priors. We present here a confidence bound target (CBT) algorithm that achieves optimality for rewards that are bounded above. For each arm we construct a confidence bound and compare it against each other and a target value to determine if the arm should be sampled further. The target value depends on the assumed priors of the arm means. In the absence of information on the prior, the target value is determined empirically. Numerical studies here show that CBT is versatile and outperforms its competitors.

Full PDF

aa r X i v : . [ s t a t . M L ] O c t INFINITE ARMS BANDIT:OPTIMALITY VIA CONFIDENCE BOUNDS

Hock Peng Chan and Hu Shouri

National University of Singapore

The inﬁnite arms bandit problem was initiated by Berry et al. (1997). Theyderived a regret lower bound of all solutions for Bernoulli rewards with uniformpriors, and proposed bandit strategies based on success runs, but which do notachieve this bound. Bonald and Prouti`ere (2013) showed that the lower boundwas achieved by their two-target algorithm, and extended optimality to Bernoullirewards with general priors. We propose here a conﬁdence bound target (CBT)algorithm that achieves optimality for unspeciﬁed non-negative reward distribu-tions. For each arm we apply the mean and standard deviation of its rewards tocompute a conﬁdence bound and play the arm with the smallest conﬁdence boundprovided it is smaller than a target mean. If the bounds are all larger, then weplay a new arm. We show for a given prior of the arm means how the target meancan be computed to achieve optimality. In the absence of information on the priorthe target mean is determined empirically, and the regret achieved is still compa-rable to the regret lower bound. Numerical studies show that CBT is versatile andoutperforms its competitors.

Berry, Chen, Zame, Heath and Shepp (1997) initiated the inﬁnite arms banditproblem on Bernoulli rewards. They showed in the case of uniform prior onthe mean of an arm, a √ n regret lower bound for n rewards, and providedalgorithms based on success runs that achieve no more than 2 √ n regret.Bonald and Prouti`ere (2013) provided a two-target stopping-time algorithmthat can get arbitrarily close to Berry et al.’s lower bound, and is also opti-mal on Bernoulli rewards with general priors. Wang, Audibert and Munos(2008) considered bounded rewards and showed that their conﬁdence boundalgorithm has regret bounds that are log n times the optimal regret. Ver-morel and Mohri (2005) proposed a POKER algorithm for general rewarddistributions and priors.The conﬁdence bound method is arguably the most inﬂuential approachfor the ﬁxed arm-size multi-armed bandit problem over the past thirty years.Lai and Robbins (1985) derived the smallest asymptotic regret that a multi-armed bandit algorithm can achieve. Lai (1987) showed that by constructing1n upper conﬁdence bound (UCB) for each arm, playing the arm with thelargest UCB, this smallest regret is achieved in exponential families. TheUCB approach was subsequently extended to unknown time-horizons andother parametric families in Agrawal (1995), Auer, Cesa-Bianchi and Fischer(2002), Burnetas and Katehakis (1996), Capp´e, Garivier, Maillard, Munosand Stoltz (2013) and Kaufmann, Capp´e and Garivier (2012), and it hasbeen shown to perform well in practice, achieving optimality beyond expo-nential families. Chan (2019) modiﬁed the subsampling approach of Baransi,Maillard and Mannor (2014) and showed that optimality is achieved in expo-nential families, despite not applying parametric information in the selectionof arms. The method can be considered to be applying conﬁdence boundsthat are computed empirically from subsample information, which substi-tutes for the missing parametric information. A related problem is the studyof the multi-armed bandit with irreversible constraints, initiated by Hu andWei (1989).Good performances and optimality has also been achieved by Bayesian ap-proaches to the multi-armed bandit problem, see Berry and Fridstedt (1985),Gittins (1989) and Thompson (1933) for early groundwork on the Bayesianapproach, and Korda, Kaufmann and Munos (2013) for more recent advances.In this paper we show how the conﬁdence bound method can be extendedto the inﬁnite arms bandit problem to achieve optimality. Adjustments aremade to account for the inﬁnite arms that are available, in particular thespeciﬁcation of a target mean that we desire from our best arm, and a mech-anism to reject weak arms quickly. For each arm a lower conﬁdence boundof its mean is computed, using only information on the sample mean andstandard deviation of its rewards. We play an arm as long as its conﬁdencebound is below the target mean. If it is above, then a new arm is played inthe next trial.We start with the smallest possible regret that an inﬁnite arms banditalgorithm can achieve, as the number of rewards goes to inﬁnity. This isfollowed by the target mean selection of the conﬁdence bound target (CBT)algorithm to achieve this regret. The optimal target mean depends only onthe prior distribution of the arm means and not on the reward distributions.That is the reward distributions need not be speciﬁed and optimality is stillachieved. In the absence of information on the prior, we show how to adaptvia empirical determination of the target mean. Regret guarantees of em-pirical CBT, relative to the baseline lower bound, are provided. Numericalstudies on Bernoulli rewards and on a URL dataset show that CBT andempirical CBT outperform their competitors.2he layout of the paper is as follows. In Section 2 we review a number ofinﬁnite arms bandit algorithms and describe CBT. In Section 3 we motivatewhy a particular choice of the target mean leads to the smallest regret andstate the optimality results. In Section 4 we introduce an empirical version ofCBT to tackle unknown priors and explain intuitively why it works. In Sec-tion 5 we perform numerical studies. In Sections 6–7 we prove the optimalityof CBT. Let X k , X k , . . . be i.i.d. non-negative rewards from an arm or populationΠ k , 1 ≤ k < ∞ , with mean µ k . Let µ , µ , . . . be i.i.d. with prior density g on (0 , ∞ ). Let F µ denote the reward distribution of an arm with mean µ ,and let E µ ( P µ ) denote expectation (probability) with respect to X d ∼ F µ .Let a ∧ b denote min( a, b ), ⌊·⌋ ( ⌈·⌉ ) denote the greatest (least) integerfunction and a + denote max(0 , a ). Let a n ∼ b n if lim n →∞ ( a n /b n ) = 1, a n = o ( b n ) if lim n →∞ ( a n /b n ) = 0 and a n = O ( b n ) if lim sup n →∞ | a n /b n | < ∞ .A bandit algorithm is required to select one of the arms to be played ateach trial, with the choice informed from past outcomes. We measure theeﬀectiveness of a bandit algorithm by its regret R n = E (cid:16) K X k =1 n k µ k (cid:17) , where K is the total number of arms played, n k the number of rewards fromΠ k and n = P Kk =1 n k . Berry et al. (1997) showed that if F µ is Bernoulli and g is uniform on (0 , n →∞ R n √ n ≥ √ f -failure strategy. The same arm is played until f failures are encoun-tered. When this happens we switch to a new arm. We do not go backto a previously played arm, that is the strategy is non-recalling .2. s -run strategy. We restrict ourselves to no more than s arms, followingthe 1-failure strategy in each, until a success run of length s is observedin an arm. When this happens we play the arm for the remaining trials.If no success runs of length s is observed in all s arms, then the armwith the highest proportion of successes is played for the remainingtrials.3. Non-recalling s -run strategy. We follow the 1-failure strategy until anarm produces a success run of length s . When this happens we playthe arm for the remaining trials. If no arm produces a success run oflength s , then the 1-failure strategy is used for all n trials.4. m -learning strategy. We follow the 1-failure strategy for the ﬁrst m trials, with the arm at trial m played until it yields a failure. Thereafterwe play, for the remaining trials, the arm with the highest proportionof successes.Berry et al. showed that R n ∼ n/ (log n ) for the f -failure strategy for any f ≥

1, whereas for the √ n -run strategy, the log n √ n -learning strategy andthe non-recalling √ n -run strategy,lim sup n →∞ R n √ n ≤ . Bonald and Prouti`ere (2013) proposed a two-target algorithm that getsarbitrarily close to the lower bound in (2.1). The target values are s = ⌊ p n ⌋ and s f = ⌊ f p n ⌋ , where f ≥ s successes when it encounters its ﬁrst failure, or s f successeswhen it encounters its f th failure. If both targets are met, then the arm isaccepted and played for the remaining trials. Bonald and Prouti`ere showedthat for the uniform prior, the two-target algorithm satisﬁeslim sup n →∞ R n √ n ≤ √ f √ , (2.2)and they get close to the lower bound of Berry et al. when f is large.Bonald and Prouti`ere extended their optimality on Bernoulli rewards tonon-uniform priors. Speciﬁcally they considered g ( µ ) ∼ αµ β − for some4 > β > µ →

0. They extended the lower bound of Berry et al.to lim inf n →∞ ( n − ββ +1 R n ) ≥ C , where C = ( β ( β +1) α ) β +1 , (2.3)and showed that their two-target algorithm with s = ⌊ C β +1 β +2 ⌋ and s f = ⌊ f C ⌋ satisﬁes lim sup f →∞ [lim sup n →∞ ( n − ββ +1 R n )] ≤ C . (2.4)Wang, Audibert and Munos (2008) proposed a UCB-F algorithm for re-wards taking values in [0 , P g ( µ k ≤ µ ) = O ( µ β ) for some β > , then under suitable regularity conditions, R n = O ( n ββ +1 log n ). In UCB-F anorder n ββ +1 arms are chosen, and conﬁdence bounds are computed on thesearms to determine which arm to play. UCB-F is diﬀerent from CBT in thatit pre-selects the number of arms, and it also does not have a mechanismto reject weak arms quickly. Carpentier and Valko (2014) also consideredrewards taking values in [0,1], but their interest in maximizing the selectionof a good arm is diﬀerent from the aims here and in the papers describedabove. In CBT we construct a conﬁdence bound for each arm, and play an arm aslong as its conﬁdence bound is under the target mean. Let b n → ∞ and c n → ∞ with b n + c n = o ( n δ ) for all δ > . (2.5)In our numerical studies we select b n = c n = log log n .For an arm k that has been played t times, its conﬁdence bound is L kt = max (cid:16) ¯ X kt b n , ¯ X kt − c n b σ kt √ t (cid:17) , (2.6)where ¯ X kt = t − S kt , with S kt = P tu =1 X ku , and b σ kt = t − P tu =1 ( X ku − ¯ X kt ) .Let ζ > ζ should beselected to achieve optimality. It suﬃces to mention here that it should besmall for large n , more speciﬁcally it should decrease at a polynomial rate5ith respect to n . The algorithm is non-recalling, an arm is played until itsconﬁdence bound goes above ζ and is not played after that.Conﬁdence bound target (CBT)For k = 1 , , . . . : Draw n k rewards from arm k , where n k = inf { t ≥ L kt > ζ } ∧ (cid:16) n − k − X ℓ =1 n ℓ (cid:17) . The total number of arms played is K = min { k : P kℓ =1 n ℓ = n } .There are three types of arms that we need to take care of, and thatexplains the design of L kt . The ﬁrst type are arms with means µ k signiﬁcantlylarger than ζ . For these arms we would like to reject them quickly. Thecondition that an arm be rejected when ¯ X kt /b n exceeds ζ is key to achievingthis.The second type are arms with means µ k larger than ζ but not by asmuch as those of the ﬁrst type. For these arms we are unlikely to reject themquickly, as it is diﬃcult to determine whether µ k is larger or less than ζ basedon a small sample. Rejecting arm k when ¯ X kt − c n b σ kt / √ t exceeds ζ ensuresthat arm k is rejected only when it is statistically signiﬁcant that µ k is largerthan ζ . Though there may be large number of rewards from these arms, theircontributions to the regret are small because their means are small.The third type of arms are those with means µ k smaller than ζ . For thesearms the best strategy (when ζ is chosen correctly) is to play them for theremaining trials. Selecting b n → ∞ and c n → ∞ in (2.6) ensures that theprobabilities of rejecting these arms are small.For the two-target algorithm, the ﬁrst target s is designed for quickrejection of the ﬁrst type of arms and the second target s f is designed forrejection of the second type of arms. What is diﬀerent is that whereas two-target monitors an arm for rejection only when there are 1 and f failures, with f chosen large for optimality, (in the case of Bernoulli rewards) CBT checksfor rejection each time a failure occurs. The frequent monitoring of CBT isa positive feature that results in signiﬁcantly better numerical performances,see Section 5. 6 Optimality

We state the regret lower bound in Section 3.1 and show that CBT achievesthis bound in Section 3.2.

In Lemma 1 below we motivate the choice of ζ . Let λ = R ∞ E µ ( X | X > g ( µ ) dµ be the (ﬁnite) mean of the ﬁrst non-zero reward of a random arm.The value λ represents the cost of experimenting with a new arm. Weconsider E µ ( X | X >

0) instead of µ because we are able to reject an armonly when there is a non-zero reward. For Bernoulli rewards, λ = 1. Let p ( ζ ) = P g ( µ ≤ ζ ) and v ( ζ ) = E g ( ζ − µ ) + .Consider an idealized algorithm which plays Π k until a non-zero rewardis observed, and µ k is revealed when that happens. If µ k > ζ , then Π k is rejected and a new arm is played next. If µ k ≤ ζ , then we end theexperimentation stage and play Π k for the remaining trials. Assuming thatthe experimentation stage uses o ( n ) trials and ζ is small, the regret of thisalgorithm is asymptotically r n ( ζ ) = λp ( ζ ) + nE g ( µ | µ ≤ ζ ) . (3.1)The ﬁrst term in the expansion of r n ( ζ ) approximates E ( P K − k =1 n k µ k ) whereasthe second term approximates E ( n K µ K ). Lemma 1.

Let ζ n be such that v ( ζ n ) = λn . We have inf ζ> r n ( ζ ) = r n ( ζ n ) = nζ n . Proof . Since E g ( ζ − µ | µ ≤ ζ ) = v ( ζ ) /p ( ζ ), it follows from (3.1) that r n ( ζ ) = λp ( ζ ) + nζ − nv ( ζ ) p ( ζ ) . (3.2)It follows from ddζ v ( ζ ) = p ( ζ ) and ddζ p ( ζ ) = g ( ζ ) that ddζ r n ( ζ ) = g ( ζ )[ nv ( ζ ) − λ ] p ( ζ ) . Since v is continuous and strictly increasing when it is positive, the root to v ( ζ ) = λn exists, and Lemma 1 follows from solving ddζ r n ( ζ ) = 0. ⊓⊔ α > β > g ( µ ) ∼ αµ β − as µ → p ( ζ ) = R ζ g ( µ ) dµ ∼ αβ ζ β and v ( ζ ) = R ζ p ( µ ) dµ ∼ αβ ( β +1) ζ β +1 as ζ →

0, hence v ( ζ n ) ∼ λn − for ζ n ∼ Cn − β +1 , where C = ( λβ ( β +1) α ) β +1 . (3.3)In Lemma 2 below we state the regret lower bound. We assume therethat:(A2) There exists a > P µ ( X > ≥ a min( µ,

1) for all µ .Without condition (A2) we may play a bad arm a large number of timesbecause its rewards are mostly zeros but are very big when non-zeros. Lemma 2.

Assume (A1) and (A2) . For any inﬁnite arms bandit algorithm,its regret satisﬁes R n ≥ [1 + o (1)] nζ n ( ∼ Cn ββ +1 ) . (3.4) Example

1. Consider X d ∼ Bernoulli( µ ). Condition (A2) holds with a = 1. If g is uniform on (0,1), then (A1) holds with α = β = 1. Since λ = 1, by (3.3), ζ n ∼ (2 /n ) / . Lemma 2 says that R n ≥ [1 + o (1)] √ n ,agreeing with Theorem 3 of Berry et al. (1997).Bonald and Prouti`ere (2013) showed (3.4) in their Lemma 3 for Bernoullirewards under (A1) and that the two-target algorithm gets close to the lowerbound when f is large. It will be shown in Theorem 1 that the lower boundin (3.4) is achieved by CBT for general rewards. We state the optimality of CBT in Theorem 1, after describing the conditionson discrete rewards under (B1) and continuous rewards under (B2) for whichthe theorem holds. Let M µ ( θ ) = E µ e θX .(B1) The rewards are non-negative integer-valued. For 0 < δ ≤

1, thereexists θ δ > µ > ≤ θ ≤ θ δ , M µ ( θ ) ≤ e (1+ δ ) θµ , (3.5) M µ ( − θ ) ≤ e − (1 − δ ) θµ . (3.6)8n addition, P µ ( X > ≤ a µ for some a > , (3.7) E µ X = O ( µ ) as µ → . (3.8)(B2) The rewards are continuous random variables satisfyingsup µ> P µ ( X ≤ γµ ) → γ → . (3.9)Moreover (3.8) holds and for 0 < δ ≤

1, there exists τ δ > < θµ ≤ τ δ , M µ ( θ ) ≤ e (1+ δ ) θµ , (3.10) M µ ( − θ ) ≤ e − (1 − δ ) θµ . (3.11)In addition for each t ≥

1, there exists ξ t > µ ≤ ξ t P µ ( b σ t ≤ γµ ) → γ → , (3.12)where b σ t = t − P tu =1 ( X u − ¯ X t ) and ¯ X t = t − P tu =1 X u for i.i.d. X u d ∼ F u . Theorem 1.

Assume (A1), (A2) and either (B1) or (B2) . For CBT withthreshold ζ n satisfying (3.3) and b n , c n satisfying (2.5) , R n ∼ nζ n as n → ∞ . (3.13)Theorem 1 says that CBT is optimal as it attains the lower bound givenin Lemma 2. In the examples below we show that the regularity conditions(A2), (B1) and (B2) are reasonable and checkable. Example

2. If X d ∼ Bernoulli( µ ) under P µ , then M µ ( θ ) = 1 − µ + µe θ ≤ exp[ µ ( e θ − , and (3.5), (3.6) hold with θ δ > e θ δ − ≤ θ δ (1 + δ ) and e − θ δ − ≤ − θ δ (1 − δ ) . (3.14)In addition (3.7) holds with a = 1, and (3.8) holds because E µ X = µ .Condition (A2) holds with a = 1. 9 xample

3. Let F µ be a distribution with support on 0 , . . . , I for somepositive integer I > µ . Let p i = P µ ( X = i ). We checkthat P µ ( X > ≥ I − µ and therefore (A2) holds with a = I − .Let θ δ > e iθ − ≤ iθ (1 + δ ) and e − iθ − ≤ − iθ (1 − δ ) for 0 ≤ iθ ≤ Iθ δ . (3.15)By (3.15) for 0 ≤ θ ≤ θ δ , M µ ( θ ) = I X i =0 p i e iθ ≤ δ ) µθ,M µ ( − θ ) = I X i =0 p i e − iθ ≤ − (1 − δ ) µθ, and (3.5), (3.6) follow from 1 + x ≤ e x . Moreover (3.7) holds with a = 1and (3.8) holds because E µ X = P Ii =0 p i i ≤ I µ . Example

4. If X d ∼ Poisson( µ ) under P µ , then M µ ( θ ) = exp[ µ ( e θ − , and (3.5), (3.6) again follow from (3.14). Since P µ ( X >

0) = 1 − e − µ , (A2)holds with a = 1 − e − , and (3.7) holds with a = 1. In addition (3.8) holdsbecause E µ X = ∞ X k =1 k µ k e − µ k ! = µe − µ + e − µ O (cid:16) ∞ X k =2 µ k (cid:17) . Example

5. Let Z be a continuous non-negative random variable withmean 1, and with Ee τ Z < ∞ for some τ >

0. Let X be distributed as µZ under P µ . Condition (A2) holds with a = 1. We conclude (3.9) fromsup µ> P µ ( X ≤ γµ ) = P ( Z ≤ γ ) → γ → . Let 0 < δ ≤

1. Since lim τ → τ − log Ee τZ = EZ = 1, there exists τ δ > < τ ≤ τ δ , Ee τZ ≤ e (1+ δ ) τ and Ee − τZ ≤ e − (1 − δ ) τ . (3.16)Since M µ ( θ ) = E µ e θX = Ee θµZ and M µ ( − θ ) = Ee − θµZ , we conclude (3.10)and (3.11) from (3.16) with τ = θµ . We conclude (3.8) from E µ X = µ EZ ,and (3.12), for arbitrary ξ t >

0, from P µ ( b σ t ≤ γµ ) = P ( b σ tZ ≤ γ ) → γ → , where b σ tZ = t − P tu =1 ( Z u − ¯ Z t ) , for i.i.d. Z and Z u .10 Empirical CBT for unknown priors

The optimal implementation of CBT, in particular the computation of thebest target mean ζ n , assumes knowledge of how g ( µ ) behaves for µ near 0. For g unknown we rely on Theorem 1 to motivate the empirical implementationof CBT.What is striking about (3.13) is that it relates the optimal threshold ζ n to R n n , and moreover this relation does not depend on either the prior g or thereward distributions. We suggest therefore, in an empirical implementationof CBT, to apply thresholds ζ ( m ) := S ′ m n , (4.1)where S ′ m is the sum of the ﬁrst m rewards for all arms.In the beginning with m small, ζ ( m ) underestimates the optimal thresh-old, but this will only encourage exploration, which is the right strategy atthe beginning. As m increases ζ ( m ) gets closer to the optimal threshold, andempirical CBT behaves more like CBT in deciding whether to play an armfurther. Empirical CBT is recalling, unlike CBT, as it decides from amongall arms which to play further.Empirical CBTNotation: When there are m total rewards, let n k ( m ) denote the numberof rewards from arm k and let K m denote the number of arms played.For m = 0, play arm 1. Hence K = 1, n (1) = 1 and n k (1) = 0 for k > m = 1 , . . . , n − ≤ k ≤ K m L kn k ( m ) ≤ ζ ( m ), then play the arm k minimizing L kn k ( m ) .2. If min ≤ k ≤ K m L kn k ( m ) > ζ ( m ), then play a new arm K m + 1.Empirical CBT, unlike CBT, does not achieve the smallest regret. This isbecause when a good arm (that is an arm with mean below optimal target)appears early, we are not sure whether this is due to good fortune or that theprior is disposed towards arms with small means, so we experiment with more11rms before we are certain and play the good arm for the remaining trials.Similarly when no good arm appears after some time, we may conclude thatthe prior is disposed towards arms with large means, and play an arm withmean above the optimal target for the remaining trials, even though it isadvantageous to experiment further.There is a price for not knowing g . Our calculations in Appendix Aindicate that under empirical CBT, the regret R n ∼ I β nζ n , (4.2)where I β = ( β +1 ) β +1 (2 − β +1) )Γ(2 − β +1 ) and Γ( u ) = R ∞ x u − e − x dx . Thecalculations are based on an idealized version of empirical CBT.The constant I β increases from 1 (at β = 0) to 2 (at β = ∞ ), so the worst-case inﬂation is not more than 2. The increase is quite slow so for reasonablevalues of β it is closer to 1 than 2. For example I = 1 . I = 1 . I = 1 . I = 1 .

53. The predictions from (4.2), that the inﬂation of the regretincreases with β , and that it is not more than 25% for β =1, 2 and 3, arevalidated by our simulations in Section 5. We study here arms with rewards that have Bernoulli (Example 5) as well asunknown distributions (Example 6). In our simulations 10,000 datasets aregenerated for each entry in Tables 1–4, and standard errors are placed afterthe ± sign. In both CBT and empirical CBT, we select b n = c n = log log n . Example

5. We consider Bernoulli rewards with the following priors:1. g ( µ ) = 1, which satisﬁes (A1) with α = β = 1,2. g ( µ ) = π sin( πµ ), which satisﬁes (A1) with α = π and β = 2,3. g ( µ ) = 1 − cos( πµ ), which satisﬁes (A1) with α = π and β = 3.For all three priors, the two-target algorithm does better with f = 3 forsmaller n , and with f = 6 or 9 at larger n . CBT is the best performeruniformly over n , and empirical CBT is also competitive against two-targetwith f ﬁxed.Even though optimal CBT performs better than empirical CBT, optimalCBT assumes knowledge of the prior to select the threshold ζ , which diﬀers12lgorithm Regret n =100 n =1000 n =10,000 n =100,000CBT ζ = p /n ± ± ± ± ± ± ± ± ± ± ± ± √ n -run 19.1 ± ± ± ± √ n -run (non-recall) 15.4 ± ± ± ± n √ n -learning 18.7 ± ± ± ± f = 3 15.2 ± ± ± ± f = 6 16.3 ± ± ± ± f = 9 17.5 ± ± ± ± K = ⌊ p n/ ⌋ ± ± ± ± √ n β = 1).RegretAlgorithm n = 100 n = 1000 n =10,000 n =100,000CBT ζ = Cn − ± ± ± ± ± ± ± ± f = 3 25.0 ± ± ± ± f = 6 26.0 ± ± ± ± f = 9 26.7 ± ± ± ± ± ± ± ± n β +1 -run non-recall 28.1 ± ± ± ± Cn ββ +1 g ( µ ) = π sin( πµ ) ( β = 2).with the priors. On the other hand the same algorithm is used for all threepriors when applying empirical CBT, and in fact the same algorithm is alsoused on the URL dataset in Example 6, with no knowledge of the rewarddistributions. Hence though empirical CBT is numerically comparable totwo-target and weaker than CBT, it is more desirable as we do not need toknow the prior to use it.For the uniform prior, the best performing among the algorithms in Berryet al. (1997) is the non-recalling √ n -run algorithm. For UCB-F [cf. Wang et13egretAlgorithm n = 100 n = 1000 n =10,000 n =100,000CBT ζ = Cn − ± ± ± ± ± ± ± ± f = 3 43.2 ± ± ± ± f = 6 44.5 ± ± ± ± f = 9 45.6 ± ± ± ± ± ± ± ± n β +1 -run non-recall 45.5 ± ± ±

10 14697 ± Cn ββ +1 g ( µ ) = 1 − cos( πµ ) ( β = 3).al. (2008)], the selection of K = ⌊ ( βα ) β +1 ( nβ +1 ) ββ +1 ⌋ ( ∼ p ( ζ n ) ) and “explorationsequence” E m = √ log m works well.Algorithm Regret ǫ n =130 n =1300emp. CBT 212 ± ± ǫ -greedy 0.05 733 431 ǫ -ﬁrst 0.15 725 411 ǫ -decreasing 1.0 738 411Table 4: The average regret ( R n /n ) for URL rewards. Example

6. We consider here the URL dataset studied in Vermorel andMohri (2005), where a POKER algorithm for dealing with large number ofarms is proposed. We reproduce part of their Table 1 in our Table 4, togetherwith new simulations on empirical CBT. The dataset consists of the retrievallatency of 760 university home-pages, in milliseconds, with a sample size ofmore than 1300 for each home-page. The dataset can be downloaded from“sourceforge.net/projects/bandit”.In our simulations the rewards for each home-page are randomly permutedin each run. We see from Table 4 that POKER does better than empiricalCBT at n = 130, whereas empirical CBT does better at n = 1300 . Theother algorithms are uniformly worse than both POKER and empirical CBT.14he algorithm ǫ -ﬁrst refers to exploring with the ﬁrst ǫn rewards, withrandom selection of the arms to be played. This is followed by pure exploita-tion for the remaining (1 − ǫ ) n rewards, on the “best” arm (with the smallestsample mean). The algorithm ǫ -greedy refers to selecting, in each play, arandom arm with probability ǫ , and the best arm with the remaining 1 − ǫ probability. The algorithm ǫ -decreasing is like ǫ -greedy except that in the m th play, we select a random arm with probability min(1 , ǫm ), and the bestarm otherwise. Both ǫ -greedy and ǫ -decreasing are disadvantaged by notmaking use of information on the total number of rewards. Vermorel andMohri also ran simulations on more complicated strategies like LeastTaken,SoftMax, Exp3, GaussMatch and IntEstim, with average regret ranging from287–447 for n = 130 and 189–599 for n = 1300. Let the inﬁnite arms bandit problem be labelled as Problem A, and let R A bethe smallest regret for this problem. We shall prove Lemma 2 by consideringtwo related problems, Problems B and C. Proof of Lemma

2. Let Problem B be like Problem A except that whenwe observe the ﬁrst non-zero reward from arm k , its mean µ k is revealed. Let R B be the smallest regret for Problem B. Since in Problem B we have accessto additional arm-mean information, R A ≥ R B .In Problem B the best solution involves an initial experimentation phasein which we play K arms, each until its ﬁrst non-zero reward. This is followedby an exploitation phase in which we play the best arm for the remaining n − M trials, where M is the number of rewards in the experimentationphase. It is always advantageous to experiment ﬁrst because no informationon arm mean is gained during exploitation. For continuous rewards M = K .Let µ b (= µ best ) = min ≤ k ≤ K µ k .In Problem C like in Problem B, the mean µ k of arm k is revealed uponthe observation of its ﬁrst non-zero reward. The diﬀerence is that instead ofplaying the best arm for an additional n − M trials, we play it for n additionaltrials, for a total of n + M trials. Let R C be the smallest regret of ProblemC, the expected value of P Kk =1 n k µ k , with P Kk =1 n k = n + M . We can extendthe optimal solution of Problem B to a (possibly non-optimal) solution ofProblem C by simply playing the best arm with mean µ b a further M times.Hence [ R A + E ( M µ b ) ≥ ] R B + E ( M µ b ) ≥ R C . (6.1)15emma 2 follows from Lemmas 3 and 4 below. We prove the more technicalLemma 4 in Appendix B. ⊓⊔ Lemma 3. R C = nζ n for ζ n satisfying v ( ζ n ) = λn . Proof . Let arm j be the best arm after k arms have been played inthe experimentation phase, that is µ j = min ≤ i ≤ k µ i . Let φ ∗ be the strategyof trying out a new arm if and only if nv ( µ j ) > λ , or equivalently µ j > ζ n .Since we need on the average p ( ζ n ) arms before achieving µ j ≤ ζ n , the regretof φ ∗ is R ∗ = λp ( ζ n ) + nE g ( µ | µ ≤ ζ n ) = r n ( ζ n ) = nζ n , (6.2)see (3.1) and Lemma 1 for the second and third equalities in (6.2).Hence R C ≤ nζ n and to show Lemma 3, it remains to show that for anystrategy φ , its regret R φ is not less than R ∗ . Let K ∗ be the number of arms of φ ∗ and K the number of arms of φ . Let A = { K < K ∗ } (= { min ≤ k ≤ K µ k >ζ n } ) and A = { K > K ∗ } (= { min ≤ k ≤ K ∗ µ k ≤ ζ n , K > K ∗ } ). We can express R φ − R ∗ = X ℓ =1 n λE [( K − K ∗ ) A ℓ ]+ nE h(cid:16) min ≤ k ≤ K µ k − min ≤ k ≤ K ∗ µ k (cid:17) A ℓ io . (6.3)Under A , min ≤ k ≤ K µ k > ζ n and therefore by (6.2), λE [( K − K ∗ ) A ] + nE h(cid:16) min ≤ k ≤ K µ k − min ≤ k ≤ K ∗ µ k (cid:17) A i (6.4)= − λP ( A ) p ( ζ n ) + n n E h(cid:16) min ≤ k ≤ K µ k (cid:17) A i − P ( A ) E g ( µ | µ ≤ ζ n ) o ≥ P ( A ) {− λp ( ζ n ) + n [ ζ n − E g ( µ | µ ≤ ζ n )] } = 0 . The identity E [( K ∗ − K ) A ] = P ( A ) p ( ζ n ) is due to min ≤ k ≤ K µ k > ζ n when thereare K arms, and so an additional p ( ζ n ) arms on average is required understrategy φ ∗ , to get an arm with mean not more than ζ n .In view that ( K − K ∗ ) A = P ∞ j =0 { K>K ∗ + j } and (cid:16) min ≤ k ≤ K µ k − min ≤ k ≤ K ∗ µ k (cid:17) A = ∞ X j =0 (cid:16) min ≤ k ≤ K ∗ + j +1 µ k − min ≤ k ≤ K ∗ + j µ k (cid:17) { K>K ∗ + j } ,

16e can check that λE [( K − K ∗ ) A ] + nE h(cid:16) min ≤ k ≤ K µ k − min ≤ k ≤ K ∗ µ k (cid:17) A i (6.5)= ∞ X j =0 E nh λ + n (cid:16) min ≤ k ≤ K ∗ + j +1 µ k − min ≤ k ≤ K ∗ + j µ k (cid:17)i { K>K ∗ + j } o = ∞ X j =0 E nh λ − nv (cid:16) min ≤ k ≤ K ∗ + j µ k (cid:17)i { K>K ∗ + j } o ≥ . The inequality in (6.5) follows from v ( min ≤ k ≤ K ∗ + j µ k ) ≤ v ( min ≤ k ≤ K ∗ µ k ) ≤ v ( ζ n ) = λn , as v is monotone increasing. Lemma 3 follows from (6.2)–(6.5). ⊓⊔ Lemma 4. E ( M µ b ) = o ( n ββ +1 ) . Bonald and Prouti`ere (2013) also referred to Problem B in their lowerbound for Bernoulli rewards. What is diﬀerent in our proof of Lemma 2 isa further simpliﬁcation by considering Problem C, in which the number ofrewards in the exploitation phase is ﬁxed to be n . We show in Lemma 3 thatunder Problem C the optimal regret has a simple expression nζ n , and reducethe proof of Lemma 2 to showing Lemma 4. We preface the proof of Theorem 1 with Lemmas 5–8. The lemmas areproved for discrete rewards in Section 7.1 and continuous rewards in Section7.2. Consider X , X , . . . i.i.d. F µ . Let S t = P tu =1 X u , ¯ X t = S t t and b σ t = t − P tu =1 ( X u − ¯ X t ) . Let T b = inf { t : S t > b n tζ n } , (7.1) T c = inf { t : S t > tζ n + c n b σ t √ t } , (7.2)with b n , c n satisfying (2.5) and ζ n ∼ Cn − β +1 for C = ( λβ ( β +1) α ) β +1 . Let d n = n − ω for some 0 < ω < β +1 . (7.3)17 emma 5. As n → ∞ , sup µ ≥ d n [min( µ, E µ T b ] = O (1) , (7.4) E g ( T b µ { µ ≥ d n } ) ≤ λ + o (1) . (7.5) Lemma 6.

Let ǫ > . As n → ∞ , sup (1+ ǫ ) ζ n ≤ µ ≤ d n [ µE µ ( T c ∧ n )] = O ( c n + log n ) , (7.6) E g [( T c ∧ n ) µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ] → . (7.7) Lemma 7.

Let < ǫ < . As n → ∞ , sup µ ≤ (1 − ǫ ) ζ n P µ ( T b < ∞ ) → . Lemma 8.

Let < ǫ < . As n → ∞ , sup µ ≤ (1 − ǫ ) ζ n P µ ( T c < ∞ ) → . The number of times an arm is played has distribution bounded above by T := T b ∧ T c . Lemmas 7 and 8 say that an arm with mean less than (1 − ǫ ) ζ n is unlikely to be rejected, whereas (7.5) and (7.7) say that the regret dueto sampling from an arm with mean more than (1 + ǫ ) ζ n is asymptoticallybounded by λ . The remaining (7.4) and (7.6) are technical relations used inthe proof of Theorem 1. Proof of Theorem

1. The number of times arm k is played is n k , andit is distributed as T b ∧ T c ∧ ( n − P k − ℓ =1 n ℓ ). Let 0 < ǫ <

1. We can express R n − nζ n = z + z + z = z + z − | z | , (7.8)where z i = E [ P k : µ k ∈ D i n k ( µ k − ζ n )] for D = [(1 + ǫ ) ζ n , ∞ ) , D = ((1 − ǫ ) ζ n , (1 + ǫ ) ζ n ) , D = (0 , (1 − ǫ ) ζ n ] . It is easy to see that z ≤ ǫnζ n . We shall show that z ≤ λ + o (1)(1 − ǫ ) β p ( ζ n ) , (7.9) | z | ≥ [( − ǫ ǫ ) β + o (1)][ nǫζ n + (1 − ǫ ) λp ( ζ n ) ] . (7.10)18e conclude Theorem 1 from (7.8)–(7.10) with ǫ → ⊓⊔ Proof of (7.9). Since T = T b ∧ T c , by Lemmas 7 and 8, q n := sup µ ≤ (1 − ǫ ) ζ n P µ ( T < ∞ ) (7.11) ≤ sup µ ≤ (1 − ǫ ) ζ n [ P µ ( T b < ∞ ) + P µ ( T c < ∞ )] → . That is an arm with mean less than (1 − ǫ ) ζ n is rejected with negligibleprobability for n large. Since the total number of played arms K is boundedabove by a geometric random variable with mean P g ( T = ∞ ) , by (7.11) and p ( ζ ) ∼ αβ ζ β as ζ → EK ≤ P g ( T = ∞ ) ≤ − q n ) p ((1 − ǫ ) ζ n ) ∼ − ǫ ) β p ( ζ n ) . (7.12)By (7.5) and (7.7), E g ( n µ { µ ≥ (1+ ǫ ) ζ n } )= E g ( n µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ) + E g ( n µ { µ ≥ d n } ) ≤ E g [( T c ∧ n ) µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ] + E g ( T b µ { µ ≥ d n } ) ≤ λ + o (1) , and (7.9) follows from (7.12) and z ≤ E g ( n µ { µ ≥ (1+ ǫ ) ζ n } ) EK . ⊓⊔ Proof of (7.10). Let ℓ be the ﬁrst arm with mean not more than(1 − ǫ ) ζ n . We have | z | = E h X k : µ k ∈ D n k ( ζ n − µ k ) i (7.13) ≥ ( En ℓ ) { ζ n − E g [ µ | µ ≤ (1 − ǫ ) ζ n ] } . Since v ( ζ n ) ∼ λn and p ( ζ ) ∼ αβ ζ β , v ( ζ ) ∼ αβ ( β +1) ζ β +1 as ζ → ζ n − E g [ µ | µ ≤ (1 − ǫ ) ζ n ]= ζ n − { (1 − ǫ ) ζ n − E g [(1 − ǫ ) ζ n − µ | µ ≤ (1 − ǫ ) ζ n ] } = ζ n − [(1 − ǫ ) ζ n − v ((1 − ǫ ) ζ n ) p ((1 − ǫ ) ζ n ) ] ∼ ǫζ n + (1 − ǫ ) v ( ζ n ) p ( ζ n ) ∼ ǫζ n + (1 − ǫ ) λnp ( ζ n ) , and (7.10) thus follows from (7.13) and En ℓ ≥ [( − ǫ ǫ ) β + o (1)] n. (7.14)19et j be the ﬁrst arm with mean not more than (1 + ǫ ) ζ n and M = P j − i =1 n i . We have En ℓ ≥ (1 − q n ) E ( n − M ) P ( ℓ = j ) . Since q n → P ( ℓ = j ) → ( − ǫ ǫ ) β , to show (7.14) it suﬃces to show that EM = o ( n ).Indeed by (7.4), (7.6) and E µ n ≤ E µ ( T ∧ n ),sup µ ≥ (1+ ǫ ) ζ n [min( µ, E µ n ] ≤ max h sup (1+ ǫ ) ζ n ≤ µ ≤ d n µE µ ( T c ∧ n ) , sup µ ≥ d n min( µ, E µ T b i = O ( c n + log n ) . Hence in view that p ((1+ ǫ ) ζ n ) = O ( n ββ +1 ) and P g ( µ > (1 + ǫ ) ζ n ) → n → ∞ , EM ≤ p ((1+ ǫ ) ζ n ) E g ( n | µ > (1 + ǫ ) ζ n )= O ( n ββ +1 ) E g [ c n +log n min( µ , (cid:12)(cid:12) µ > (1 + ǫ ) ζ n ]= O ( n ββ +1 ( c n + log n )) Z ∞ (1+ ǫ ) ζ n g ( µ )min( µ, dµ = O ( n ββ +1 ( c n + log n )) max( n − ββ +1 , log n ) = o ( n ) . The ﬁrst relation in the line above follows from Z ∞ (1+ ǫ ) ζ n g ( µ )min( µ, dµ =  O (1) if β > ,O (log n ) if β = 1 ,O ( n − ββ +1 ) if β < . ⊓⊔ In the case of discrete rewards, one diﬃculty is that for µ k small, there arepotentially multiple plays on arm k before a non-zero reward is observed.Condition (A2) is helpful in ensuring that the mean of this non-zero rewardis not too large. Proof of Lemma

5. Recall that T b = inf { t : S t > b n tζ n } , d n = n − ω for some 0 < ω < β +1 . We shall show thatsup µ ≥ d n [min( µ, E µ T b ] = O (1) , (7.15) E g ( T b µ { µ ≥ d n } ) ≤ λ + o (1) . (7.16)Let θ = 2 ω log n . Since X u is integer-valued, it follows from Markov’sinequality that P µ ( S t ≤ b n tζ n ) ≤ [ e θb n ζ n M µ ( − θ )] t ≤ { e θb n ζ n [ P µ ( X = 0) + e − θ ] } t . (7.17)By P µ ( X > ≥ a d n for µ ≥ d n [see (A2)], θb n ζ n = o ( d n ) [because θ and b n are both sub-polynomial in n and ζ n = O ( n − β +1 )] and (7.17), uniformlyover µ ≥ d n , E µ T b = 1 + ∞ X t =1 P µ ( T b > t ) (7.18) ≤ ∞ X t =1 P µ ( S t ≤ b n tζ n ) ≤ { − e θb n ζ n [ P µ ( X = 0) + e − θ ] } − = { − [1 + o ( d n )][ P µ ( X = 0) + d n ] } − = [ P µ ( X >

0) + o ( d n )] − ∼ [ P µ ( X > − . The term inside {·} in (7.17) is not more than [1 + o ( d n )](1 − a d n + d n ) < n large and this justiﬁes the second inequality in (7.18). We conclude(7.15) from (7.18) and (A2). By (7.18), E g [ T b µ { µ ≥ d n } ] = Z ∞ d n E µ ( T b ) µg ( µ ) dµ ≤ [1 + o (1)] Z ∞ d n E µ ( X ) P µ ( X> g ( µ ) dµ = [1 + o (1)] Z ∞ d n E µ ( X | X > g ( µ ) dµ → λ, hence (7.16) holds. ⊓⊔ Proof of Lemma

6. Recall that T c = inf { t : S t > tζ n + c n b σ t √ t } andlet ǫ >

0. We want to show thatsup (1+ ǫ ) ζ n ≤ µ ≤ d n µE µ ( T c ∧ n ) = O ( c n + log n ) , (7.19) E g [( T c ∧ n ) µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ] → . (7.20)21e ﬁrst show that there exists κ > n → ∞ ,sup µ ≤ d n h µ n X t =1 P µ ( b σ t ≥ κµ ) i = O (log n ) . (7.21)Since X is non-negative integer-valued, X ≤ X . Indeed by (3.8), thereexists κ > ρ µ := E µ X ≤ κµ for µ ≤ d n and n large, therefore by(3.8) again and Chebyshev’s inequality, P µ ( b σ t ≥ κµ ) ≤ P µ (cid:16) t X u =1 X u ≥ tκµ (cid:17) ≤ P µ (cid:16) t X u =1 ( X u − ρ µ ) ≥ tκµ (cid:17) ≤ t Var µ ( X )( tκµ/ = O (( tµ ) − ) , and (7.21) holds.By (7.21), uniformly over (1 + ǫ ) ζ n ≤ µ ≤ d n , E µ ( T c ∧ n ) = 1 + n − X t =1 P µ ( T c > t ) (7.22) ≤ n − X t =1 P µ ( S t ≤ tζ n + c n b σ t √ t ) ≤ n − X t =1 P µ ( S t ≤ tζ n + c n √ κµt ) + O ( log nµ ) . Let 0 < δ < to be further speciﬁed. Uniformly over t ≥ c n µ − , µt/ ( c n √ κµt ) → ∞ and therefore by (3.6), µ ≥ (1 + ǫ ) ζ n and Markov’sinequality, for n large, P µ ( S t ≤ tζ n + c n √ κµt ) ≤ P µ ( S t ≤ t ( ζ n + δµ )) (7.23) ≤ e θ δ t ( ζ n + δµ ) M tµ ( − θ δ ) ≤ e tθ δ [ ζ n − (1 − δ ) µ ] ≤ e − ηtθ δ µ , where η = 1 − δ − ǫ > δ chosen small). Since 1 − e − ηθ δ µ ∼ ηθ δ µ as µ → c n µ − + X t ≥ c n µ − e − ηtθ δ µ = O ( c n µ − ) , (7.24)22nd substituting (7.23) into (7.22) gives us (7.19). By (7.19), E g [( T c ∧ n ) µ { (1+ ǫ ) ζ n ≤ µ ≤ d n } ] = P g ((1 + ǫ ) ζ n ≤ µ ≤ d n ) O ( c n + log n )= O ( d βn ( c n + log n )) , and (7.20) holds since c n is sub-polynomial in n . ⊓⊔ Proof of Lemma

7. We want to show that P µ ( S t > tb n ζ n for some t ≥ → µ ≤ (1 − ǫ ) ζ n .By (3.7) and Bonferroni’s inequality, P µ ( S t > tb n ζ n for some t ≤ √ b n ζ n ) (7.26) ≤ P µ ( X t > t ≤ √ b n ζ n ) ≤ a µ √ b n ζ n → . By (3.5) and a change-of-measure argument, for n large, P µ ( S t > tb n ζ n for some t > √ b n ζ n ) (7.27) ≤ sup t> √ bnζn [ e − θ b n ζ n M µ ( θ )] t ≤ e − θ ( b n ζ n − µ ) / ( ζ n √ b n ) → . To see the ﬁrst inequality of (7.27), let f µ be the density of X with respectto some σ -ﬁnite measure, and let E θ µ ( P θ µ ) denote expectation (probability)with respect to density f θ µ ( x ) := [ M µ ( θ )] − e θ x f µ ( x ) . Let T = inf { t > √ b n ζ n : S t > tb n ζ n } . It follows from a change of measurethat P µ ( T = t ) = M tµ ( θ ) E θ µ ( e − θ S t { T = t } ) (7.28) ≤ [ e − θ b n ζ n M µ ( θ )] t P θ µ ( T = t ) , and the ﬁrst inequality of (7.27) follows from summing (7.28) over t > √ b n ζ n . ⊓⊔ Proof of Lemma

8. We want to show that P µ ( S t > tζ n + c n b σ t √ t for some t ≥ → µ ≤ (1 − ǫ ) ζ n .By (3.7) and Bonferroni’s inequality, P µ ( S t > tζ n + c n b σ t √ t for some t ≤ c n µ ) (7.30) ≤ P µ ( X t > t ≤ c n µ ) ≤ a c n → . Moreover P µ ( S t > tζ n + c n b σ t √ t for some t > c n µ ) ≤ (I) + (II) , (7.31)where (I) = P µ ( S t > tζ n + c n ( µt/ for some t > c n µ ) , (II) = P µ ( b σ t ≤ µ and S t ≥ tζ n for some t > c n µ ) . By (7.30) and (7.31), to show (7.29), it suﬃces to show that (I) → → < δ ≤ δ < (1 − ǫ ) − . Hence µ ≤ (1 − ǫ ) ζ n implies ζ n ≥ (1 + δ ) µ . It follows from (3.5) and change-of-measure argument [see(7.27) and (7.28)] that(I) ≤ sup t> cnµ [ e − θ δ [ tζ n + c n ( µt/ ] M tµ ( θ δ )] ≤ exp {− θ δ [ ζ n − (1 + δ ) µ ] / ( c n µ ) − θ δ ( c n / }≤ exp {− θ δ ( c n / } → . Since X u ≥ X u , the inequality S t ≥ tζ n ( ≥ tµ ) implies P tu =1 X u ≥ tµ , andthis, together with b σ t ≤ µ implies that ¯ X t ≥ µ . Hence by (3.5) and change-of-measure argument, for n large,(II) ≤ P µ ( ¯ X t ≥ q µ for some t > c n µ ) ≤ sup t> cnµ [ e − θ √ µ/ M µ ( θ )] t ≤ exp {− θ [ q µ − µ ] / ( c n µ ) }≤ exp (cid:8) − θ (cid:2) c n √ − ǫ ) ζ n − c n (cid:3)(cid:9) → . ⊓⊔ In the case of continuous rewards, the proofs are simpler due to rewards beingnon-zero, in particular λ = E g µ . 24 roof of Lemma

5. To show (7.4) and (7.5), it suﬃces to show thatsup µ ≥ d n E µ T b ≤ o (1) . (7.32)Let θ > P µ ( S t ≤ b n tζ n ) ≤ [ e θb n ζ n M µ ( − θ )] t . Moreover, for any γ > M µ ( − θ ) ≤ P µ ( X ≤ γµ ) + e − γθµ , hence E µ T b ≤ ∞ X t =1 P µ ( S t ≤ b n tζ n ) (7.33) ≤ { − e θb n ζ n [ P µ ( X ≤ γµ ) + e − γθµ ] } − . Let γ = n and θ = n η for some ω < η < β +1 . By (2.5), (3.9) and (7.3), for µ ≥ d n , e θb n ζ n → , e − γθµ → , P µ ( X ≤ γµ ) → , and (7.32) follows from (7.33). ⊓⊔ Proof of Lemma

6. By (3.8), for µ small, ρ µ := E µ X = E µ ( X { X< } ) + E µ ( X { X ≥ } ) ≤ E µ X + E µ X = O ( µ ) . Hence to show (7.6) and (7.7), we proceed as in the proof of Lemma 6 fordiscrete rewards, applying (3.11) in place of (3.6), with any ﬁxed θ > θ δ in (7.23) and (7.24). ⊓⊔ Proof of Lemma

7. It follows from (3.10) with θ = τ µ and a change-of-measure argument [see (7.27) and (7.28)] that for n large, P µ ( S t > tb n ζ n for some t ≥ ≤ sup t ≥ [ e − θb n ζ n M µ ( θ )] t ≤ e − θ ( b n ζ n − µ ) → . ⊓⊔ roof of Lemma

8. Let η > δ > δ )(1 − ǫ ) <

1. It follows from (3.10) with θ = τ δ µ and a change-of-measure argumentthat for u large, P µ ( S t ≥ tζ n + c n b σ t √ t for some t > u ) (7.34) ≤ P µ ( S t ≥ tζ n for some t > u ) ≤ sup t>u [ e − θζ n M µ ( θ )] t ≤ e − uθ [ ζ n − (1+ δ ) µ ] ≤ e − uτ δ [(1 − ǫ ) − − (1+ δ )] ≤ η. By (3.12), we can select γ > n large (so that µ ≤ (1 − ǫ ) ζ n ≤ min ≤ t ≤ u ξ t ), u X t =1 P µ ( b σ t ≤ γµ ) ≤ η. (7.35)Let θ = τ µ . By (3.10), (7.35) and Bonferroni’s inequality, P µ ( S t > tζ + c n b σ t √ t for some t ≤ u ) (7.36) ≤ P µ ( S t ≥ c n b σ t √ t for some t ≤ u ) ≤ η + u X t =1 P µ ( S t ≥ c n µ √ γt ) ≤ η + u X t =1 e − θc n µ √ γt M tµ ( θ ) ≤ η + u X t =1 e − τ ( c n √ γt − t ) → η. Lemma 8 follows from (7.34) and (7.36) since η can be chosen arbitrarilysmall. ⊓⊔ A Derivation of (4.2)

The idealized algorithm in the beginning of Section 3.1 captures the essenceof how CBT behaves. The mean µ k of arm k is revealed when its ﬁrst non-zero reward appears. If µ k > ζ [with optimality when ζ = ζ n , see (3.3)], thenwe stop sampling from arm k and sample next arm k + 1. If µ k ≤ ζ , then weexploit arm k a further n times before stopping the algorithm.For empirical CBT, when there are m rewards, we apply threshold ζ ( m ) = S ′ m n where S ′ m is the sum of the m rewards. In an idealized version of CBT,26e also reveal the mean µ k of arm when its ﬁrst non-zero reward appears. Tocapture the essence of how threshold ζ ( m ) increases with number of rewards m , we assign a ﬁxed cost of λ to each arm that we experiment with. That iswe replace ζ ( m ) by b ζ k = kλn , where k is the number of arms played. Whenmin ≤ i ≤ k µ i ≤ b ζ k , we stop experimenting and play the best arm a further n times. More speciﬁcally:Idealized empirical CBT1. For k = 1 , , . . . : Draw n k rewards from arm k , where n k = inf { t ≥ X kt > } .

2. Stop when there are K arms, where K = inf n k ≥ ≤ i ≤ k µ i ≤ kλn o .

3. Draw n additional rewards from arm j satisfying µ j = min ≤ k ≤ K µ k .We deﬁne the regret of this algorithm to be R n = λEK + nE (min ≤ k ≤ K µ k ). Theorem 2.

For idealized empirical CBT, its regret R n ∼ CI β n ββ +1 , where C = ( λβ ( β +1) α ) β +1 and I β = ( β +1 ) β +1 (2 − β +1) )Γ(2 − ββ +1 ) . Proof . We stop experimenting after K arms, where K = inf { k : min ≤ j ≤ k µ j ≤ b ζ k } , b ζ k = kλn . (A.1)Let D k = { b ζ k − λn < min ≤ j ≤ k − µ j ≤ b ζ k } , D k = { min ≤ j ≤ k − µ j > b ζ k , µ k ≤ b ζ k } . We check that D k , D k are disjoint, and that D k ∪ D k = { K = k } . Essentially D k is the event that K = k and the best arm is not k and D k the event that K = k and the best arm is k . 27or any ﬁxed k ∈ Z + , P ( D k ) = [1 − p ( b ζ k − λn )] k − − [1 − p ( b ζ k )] k − (A.2)= { − p ( b ζ k ) + [1 + o (1)] λn g ( b ζ k ) } k − − [1 − p ( b ζ k )] k − ∼ { [1 − p ( b ζ k )] k − } kλn g ( b ζ k ) ∼ exp( − αλ β β k β +1 n − β ) αλ β k β n − β . Moreover E ( R n | D k ) ∼ kλ + n ( kλn ) = 2 kλ. (A.3)Likewise, P ( D k ) = { [1 − p ( b ζ k )] k − } p ( b ζ k ) (A.4) ∼ exp( − αλ β β k β +1 n − β ) αλ β β k β n − β ,E ( R n | D k ) = kλ + nE ( µ | µ ≤ b ζ k ) (A.5)= 2 kλ − nv (ˆ ζ k ) p (ˆ ζ k ) ∼ (2 − β +1 ) kλ. Combining (A.2)–(A.5) gives us R n = ∞ X k =1 [ E ( R ′ C | D k ) P ( D k ) + E ( R ′ C | D k ) P ( D k )] (A.6) ∼ ∞ X k =1 exp( − αλ β β k β +1 n − β )( αλ β +1 β k β +1 n − β )(2 β + 2 − β +1 ) , It follows from (A.6) and a change of variables x = αλ β β k β +1 n − β that R n ∼ (2 β + 2 − β +1 ) Z ∞ exp( − αλ β β k β +1 n − β )( αλ β +1 β k β +1 n − β ) dk = (2 β + 2 − β +1 ) Z ∞ β +1 ( λβα ) β +1 n ββ +1 exp( − x ) x β +1 dx = (2 − β +1) )( λβα ) β +1 Γ(2 − ββ +1 ) n ββ +1 , and Theorem 2 holds. ⊓⊔ Proof of Lemma 4

Let b K = ⌊ nζ n (log n ) β +2 ⌋ for ζ n satisfying nv ( ζ n ) = λ . Express E ( M µ b ) = P i =1 E ( M µ b D i ), where D = { µ b ≤ ζ n log n } ,D = { µ b > ζ n log n , K > b K } ,D = { ζ n log n < µ b ≤ ζ n (log n ) β +3 , K ≤ b K } ,D = { µ b > ζ n (log n ) β +3 , K ≤ b K, M > n } ,D = { µ b > ζ n (log n ) β +3 , K ≤ b K, M ≤ n } . It suﬃces to show that for all i , E ( M µ b D i ) = o ( n ββ +1 ) . (B.1)Since Mζ n log n ≤ nζ n log n = o ( n ββ +1 ), (B.1) holds for i = 1.Let b µ b = min k ≤ ˆ K µ k . Since M ≤ n , µ b ≤ µ and E ( µ ) ≤ λ , E ( M µ b D ) ≤ nE ( µ | µ > ζ n log n ) P ( D ) (B.2) ≤ [ λ + o (1)] nP ( b µ b > ζ n log n ) . Substituting P ( b µ b > ζ n log n ) = [1 − p ( ζ n log n )] ˆ K = exp {− [1 + o (1)] b K αβ ( ζ n log n ) β ] = O ( n − )into (B.2) shows (B.1) for i = 2.Let M j be the number of plays of Π j to its ﬁrst non-zero reward (hence M = P Kj =1 M j ). It follows from condition (A2) that E µ M = P µ ( X > ≤ a min( µ, , by µ b ≤ ζ n (log n ) β +3 , E ( M µ b D ) ≤ E ( M { µ > ζn log n } ) b Kζ n (log n ) β +3 (B.3) ≤ (cid:16) Z ∞ ζn log n g ( µ ) a min( µ, dµ (cid:17) nζ n (log n ) β +5 . Substituting Z ∞ ζn log n g ( µ ) µ dµ =  O (1) if β > ,O (log n ) if β = 1 ,O (( ζ n log n ) β − ) if β < , i = 3.If µ j > ζ n (log n ) β +3 , then by condition (A2), M j is bounded above by ageometric random variable with mean ν − , where ν = a ζ n (log n ) β +3 . Hencefor 0 < θ < log( − ν ), E ( e θM j { µ j >ζ n (log n ) β +3 } ) ≤ ∞ X i =1 e θi ν (1 − ν ) i − = νe θ − e θ (1 − ν ) , implying that [ e θn P ( D ) ≤ ] E ( e θM D ) ≤ ( νe θ − e θ (1 − ν ) ) ˆ K . (B.4)Consider e θ = 1 + ν and check that e θ (1 − ν ) ≤ − ν [ ⇒ θ < log( − ν )]. Itfollows from (B.4) that P ( D ) ≤ e − θn ( νe θ ν/ ) ˆ K = 2 ˆ K e θ ( ˆ K − n ) = exp[ b K log 2 + [1 + o (1)] ν ( b K − n )]= n − b K [ a − a ζ n (log n ) β +2 − log 2log n ] = O ( n − ) . Since M ≤ n , µ b ≤ µ and E ( µ ) ≤ λ , E ( M µ b D ) ≤ nE [ µ | µ > ζ n (log n ) β +3 ] P ( D ) ≤ n [ λ + o (1)] P ( D ) , and (B.1) holds for i = 4.Under D for n large,( n − M ) v ( µ b )[ > n v ( ζ n (log n ) β +3 ) ∼ n v ( ζ n )(log n ) ( β +3)( β +1) ] > λ. The optimal solution of Problem B requires further experimentation since itscost λ is less than the reduction in exploitation cost. In other words D isan event of zero probability. Therefore (B.1) holds for i = 5. References [1]

Agrawal, R. (1995). Sample mean based index policies with O (log n )regret for the multi-armed bandit problem. Adv. Appl. Probab. Auer, P., Cesa-Bianchi, N. and

Fischer, P. (2002). Finite-timeanalysis of the multiarmed bandit problem.

Machine Learning Baransi, A., Maillard, O.A. and

Mannor, S. (2014). Sub-sampling for multi-armed bandits.

Proceedings of the European Con-ference on Machine Learning pp.13.[4]

Berry, D., Chen, R., Zame, A., Heath, D. and

Shepp, L. (1997). Bandit problems with inﬁnitely many arms.

Ann. Statist. , ,21032116.[5] Berry, D. and

Fristedt, B. (1985).

Bandit problems: sequentialallocation of experiments . Chapman and Hall, London.[6]

Bonald, T. and

Prouti`ere, A. (2013). Two-target algorithms forinﬁnite-armed bandits with Bernoulli rewards.

Neural Information Pro-cessing Systems .[7]

Brezzi, M. and

Lai, T.L. (2002). Optimal learning and experimen-tation in bandit problems.

J. Econ. Dynamics Cont. Burnetas, A. and

Katehakis, M. (1996). Optimal adaptive policiesfor sequential allocation problems.

Adv. Appl. Math. Capp´e, O., Garivier, A., Maillard, J., Munos, R., Stoltz,G. (2013). Kullback-Leibler upper conﬁdence bounds for optimal se-quential allocation.

Ann. Statist. Carpentier, A. and

Valko, M. (2015). Simple regret for inﬁnitelymany bandits. .[11]

Chan, H. (2019). The multi-armed bandit problem: an eﬃcient non-parametric solution, to appear in

Ann. Statist. [12]

Gittins, J. (1989).

Multi-armed Bandit Allocation Indices . Wiley,New York.[13]

Hu, I. and

Wei, C.Z. (1989). Irreversible adaptive allocation rules.

Ann. Statist. Kaufmann, E., Capp´e and

Garivier, A. (2012). On Bayesian up-per conﬁdence bounds for bandit problems.

Proceedings of the FifteenthInternational Conference on Artiﬁcial Intelligence and Statistics Korda, N., Kaufmann, E. and

Munos, R. (2013). Thompson sam-pling for 1-dimensional exponential family bandits.

NIPS Lai, T.L. (1987). Adaptive treatment allocation and the multi-armedbandit problem.

Ann. Statist. Lai, T.L. and

Robbins, H. (1985). Asymptotically eﬃcient adaptiveallocation rules.

Adv. Appl. Math. Thompson, W. (1933). On the likelihood that one unknown probabil-ity exceeds another in view of the evidence of two samples.

Biometrika Vermorel, J. and

Mohri, M. (2005). Multi-armed bandit algo-rithms and empirical evaluation.

Machine Learning: ECML , Springer,Berlin.[20]

Wang, Y., Audibert, J. and

Munos, R. (2008). Algorithms for in-ﬁnitely many-armed Bandits.