Bounded regret in stochastic multi-armed bandits
aa r X i v : . [ m a t h . S T ] F e b Bounded regret in stochasticmulti-armed bandits
S´ebastien Bubeck, Vianney Perchet ∗ and Philippe Rigollet † Princeton University, Universit´e Paris Diderot and Princeton UniversityAbstract.
We study the stochastic multi-armed bandit problem whenone knows the value µ ( ⋆ ) of an optimal arm, as a well as a positive lowerbound on the smallest positive gap ∆. We propose a new randomizedpolicy that attains a regret uniformly bounded over time in this setting.We also prove several lower bounds, which show in particular thatbounded regret is not possible if one only knows ∆, and bounded regretof order 1 / ∆ is not possible if one only knows µ ( ⋆ ) . AMS 2000 subject classifications:
Primary 62L05; secondary 68T05,62C20.
Key words and phrases:
Stochastic multi-armed bandits, Bounded Re-gret, Minimax optimality, Finite time analysis.
1. INTRODUCTION
In this paper we investigate the classical stochastic multi-armed bandit problemintroduced by [12] and described as follows: an agent facing K actions (or banditarms) selects one arm at every time step until a finite time horizon n ≥
1. Succes-sive pulls of each arm i ∈ { , . . . , K } yield a sequence of i.i.d rewards Y ( i )1 , Y ( i )2 , . . . according to some unknown distribution ν i with expected value µ ( i ) . Denote by ⋆ ∈ { , . . . , K } any optimal arm defined such that µ ( ⋆ ) = max i =1 ,...,K µ ( i ) . A pol-icy I = { I t } is a sequence of random variables I t ∈ { , ..., K } indicating whicharm to pull at each time t = 1 , . . . , n and such that I t depends only on obser-vations strictly anterior to t . The performance of a policy I is measured by its(cumulative) regret at time n that is defined by R n = nµ ( ⋆ ) − n X t =1 IE µ ( I t ) . Observe that if we denote by T i ( t ) = P t − ℓ =1 { I ℓ = i } the number of times arm i was pulled (strictly) before time t ≥ i = µ ( ⋆ ) − µ ( i ) the gap between arm i and the optimal arm, then one can rewrite the regret as R n = P Ki =1 ∆ i IE T i ( n +1). This formulation will be used hereafter.We refer the reader to [5] for a survey of the extensive literature on this problemand its variations. In this paper we investigate a phenomenon that was firstobserved in [8]: with some prior knowledge (in the form of lower bounds) on ∗ Partially supported by the ANR (ANR-10-BLAN-0112). † Partially supported by NSF grants DMS-0906424, CAREER-DMS-1053987 and a gift fromthe Bendheim Center for Finance. BUBECK, PERCHET AND RIGOLLET the maximal mean µ ( ⋆ ) and the minimal gap ∆ = min i :∆ i > ∆ i , it is possible toobtain a regret that is bounded uniformly in n , which implies in particular thatthe regret does not tend to infinity as the time horizon n tends to infinity. Notethat this result is striking, as the seminal paper [9] indicates that, if one has noprior knowledge on the distributions, then asymptotically (in n ) a regret of orderlog n is unavoidable. We describe in Section 2 a simple algorithm for the two-armed bandit problemwhen one knows the largest expected reward µ ( ⋆ ) and the gap ∆. In this two-armed case, this amounts to knowing µ (1) and µ (2) up to a permutation. We showthat the regret of this algorithm is bounded by ∆ + 16 / ∆, uniformly in n . Theoptimality of this bound is assessed in Section 4 where we show that any agentknowing ∆ and µ ( ⋆ ) must incur a regret of at least 1 / ∆. This upper and lowerbounds raise the following question: can such bounded regret be achieved withoutone of these two pieces of information? It follows from Theorems 6 and 8 thatthe answer to this question is negative. Indeed, the sole knowledge of either ∆ or µ ( ⋆ ) leads to a rescaled regret ∆ R n that is at least logarithmic in n . Interestingly,all these results are fully non-asymptotic, including lower bounds.What if ∆ is not perfectly known but only ε > > ε ? We answerthis question in Section 3 in the context of the general K -armed bandit problem.There, we prove an upper bound on R n when one knows the maximal mean µ ( ⋆ ) together with a positive lower bound ε on the smallest gap ∆. Specifically, wedesign a randomized policy for which R n ≤ X i :∆ i > n ∆ i + 32∆ i log (cid:0) ε (cid:1)o . Moreover, it follows form our main lower bound in Theorem 8 that this resultcannot be improved without further assumptions, since for ε of order of 1 / √ n —no information on the smallest gap— a logarithmic growth in n is unavoidablefor the rescaled regret ∆ R n . However for ε of order ∆ one would expect nodependency on ε (since at least for K = 2 our policy of Section 2 attains aregret of order 1 / ∆). To deal with this issue we propose an improvement of thebasic policy that for which the term log(1 /ε ) is replaced by log(∆ i /ε ) log log ε . Inparticular if all the gaps ∆ i and ε are of the same order, the logarithmic becomesa log-log term.The exploration-exploitation tradeoff is a preponderant paradigm in the banditliterature. The effects of this tradeoff already appear for the case K = 2 inthe form of the log n term derived in the original [9] paper. Indeed, there existsimple classes of (two!) problems over which the regret is uniformly boundedwith full information but cannot be bounded uniformly with bandit feedback, seeTheorem 6. Clearly, this tradeoff should become more and more apparent as thenumber of arms increases but this is not our main focus. Rather, the combinationof our results sheds light on an interesting phenomenon: the effects of the tradeoffvanish when both ∆ and µ ( ⋆ ) are known but can be seen already when K = 2and either ∆ or µ ( ⋆ ) is unknown. OUNDED REGRET IN STOCHASTIC BANDITS The two-armed bandit problem when one knows the distributions of the armsup to a permutation was first investigated in [8]. The authors observed that in thatcase, using a policy based on the sequential likelihood ratio test, one can obtaina regret uniformly bounded over n . Both upper and lower bounds were provided.This setting was generalized in [7], where the authors considered the generalmulti-armed bandit problem when one knows a separating value γ between thelargest mean and the other means. In that case they proved the bounded regretproperty for a policy based on sequential likelihood ratio tests for H : µ > γ vs. H : µ < γ (assuming exponential distributions to compute the likelihoods).They also designed a more subtle strategy for the case when only µ ( ⋆ ) is known.In that case too they proved a bounded regret property. The main open problemsleft by these works are (i) to understand the limitations of bounded regret, and(ii) to characterize the exact dependence on the parameters in the regret (whenbounded regret is achievable). In this paper we make progress on both questions.Regarding the limitations of bounded regret, we prove three finite-time lowerbounds, including a finite-time version of the seminal result of [9]. Ideas similar tothe ones we develop in Theorems 5 and 6 already appeared in [6] but our resultsare fully non asymptotic with the exact dependence in the parameters involved.Theorem 8 is more innovative. It shows that a logarithmic growth for the rescaledregret ∆ R n is unavoidable even if one knows µ ( ⋆ ) . The proof of this result goesbeyond any previous lower bound for the stochastic multi-armed bandit problem,including [7, 9], since all of them required to distinguish problems with differentvalues of µ ( ⋆ ) (such as the ones in Theorem 6 for example). As a consequenceof this theorem, we can deduce that the policies with bounded regret derived in[7, 1] with only the knowledge of µ ( ⋆ ) must have a suboptimal dependency in1 / ∆.The knowledge of µ ( ⋆ ) was also exploited in other works. For instance in [13],the authors showed that knowing µ ( ⋆ ) allows for policies with provably betterconcentration properties. Their policies are based on sequential likelihood ratiotests for H : µ = µ ( ⋆ ) vs. H : µ < µ ( ⋆ ) (assuming Gaussian distributions tocompute the likelihoods). To some extent it was to be expected that the knowledgeof µ ( ⋆ ) leads to an improved regret as it partially removes the need for exploration:if one arm has empirical performances close to µ ( ⋆ ) , one can be confident that thisis the best arm without worrying that it could be the best arm only because wehave not yet explored enough the other options. However note that the problemturns out to be more subtle than the above simple argument and underlines thefact that one needs more than the knowledge of µ ( ⋆ ) in order to have a boundedregret with optimal scaling in 1 / ∆. Indeed, Theorem 8 implies that the soleknowledge of µ ( ⋆ ) does not warrant the bounded property for the rescaled regret∆ R n . Throughout the paper, we assume that the distributions ν i are sub-Gaussianthat is R e λ ( x − µ ) ν i ( dx ) ≤ e λ / for all λ ∈ IR. Note that these include Gaussiandistributions with variance less than 1 and distributions supported on an intervalof length less than 2.We denote by b µ ( i ) s = s P sℓ =1 Y ( i ) ℓ the empirical mean of arm i after s pulls, BUBECK, PERCHET AND RIGOLLET for s ≥
1. Together with a Chernoff bound, it is not hard to see that the sub-Gaussian assumption implies the following concentration inequality, valid for any u > b µ ( i ) s − µ ( i ) > u ) ≤ exp (cid:18) − su (cid:19) .
2. THE TWO-ARMED CASE
In this section we investigate a toy example where K = 2 and the agentknows exactly both µ ( ⋆ ) = 0 (without loss of generality) and ∆. While somewhatsimplistic this example offers a convenient framework to lay the main ideas tobuild policies with bounded regret. Initialization:(0) For rounds t ∈ { , } , select arm I t = t .For each round t = 3 , , . . . (1) If b µ ( i ) T i ( t ) > − ∆ / b µ ( i ) T i ( t ) > b µ ( j ) T j ( t ) then select arm i , i.e., I t = i .(2) Otherwise select both arms, i.e., I t = 1 and I t +1 = 2. Figure 1 . A policy with bounded regret for the two-armed bandit problem.
Theorem . Policy 1 has regret bounded as R n ≤ ∆ + 16 / ∆ , uniformly in n . Proof.
Without loss of generality we assume that 1 = ⋆ is the optimal arm.Observe that { I t = 2 } ⊂ { t = 2 }∪{ b µ (2) T ( t ) > − ∆ / , t ≥ , I t = 2 }∪{ b µ (2) T ( t ) ≤ − ∆ / , t ≥ , I t = 2 } . Summing over t for the second event, we get(2.2)IE n X t =3 { b µ (2) T ( t ) > − ∆ / , I t = 2 } ≤ IE n X t =1 { b µ (2) t > − ∆ / } ≤ n X t =1 exp( − t ∆ / ≤ . For the third event we use the definition of the policy to obtain { b µ (2) T ( t ) ≤ − ∆ / , t ≥ , I t = 2 } ⊂ { b µ (1) T ( t − ≤ − ∆ / , t ≥ , I t − = 1 } and conclude as in (2.2).This policy has two weaknesses. First one may pay a big price for misspecifyingthe value of ∆. Namely if one only knows a lower bound 0 < ε ≤ ∆ and substitutes ε to ∆ in Policy 1, then it follows easily that the regret becomes of order ∆ /ε .Furthermore, for essentially the same reason, the trivial generalization of thisalgorithm to the K -armed case would give a regret bounded by P i ∆ i / ∆ . In thenext section we show how to overcome these two issues using a new, randomized,policy. OUNDED REGRET IN STOCHASTIC BANDITS
3. A FAMILY OF POLICIES WITH BOUNDED REGRET
In this section we consider the general multi-armed case, when the agent knows µ ( ⋆ ) = 0 (without loss of generality) and an ε > ε ≤ ∆. Akin toPolicy 1, the policy analyzed here sets a threshold at − ε/ i isessentially proportional to (ˆ µ ( i ) T i ( t ) ) − , which is an empirical estimate of ∆ − i since µ ( ⋆ ) = 0. Policy 2 is slighly more general, as it uses a potential function ψ : IR + → IR + , and selects arm i with probability inversely proportional to ψ ( | b µ ( i ) T i ( t ) | ). Thenatural choice is ψ ( x ) = x , but other choices can lead to improved performances,see Theorem 2 below. Note that we also analyze the case where ε = 0 (that is,when we have no information on the smallest gap). Initialization:(0) For rounds t ∈ { , . . . , K } , select arm I t = t .For each round t = K + 1 , K + 2 , . . . (1) If there exists i such that b µ ( i ) T i ( t ) ≥ − ε/
2, then select I t ∈ argmax ≤ i ≤ K b µ ( i ) T i ( t ) .(2) Otherwise select randomly an arm according to the following probability distribution: p i,t = cψ ( | b µ ( i ) T i ( t ) | ) , where c = K X j =1 ψ ( | b µ ( j ) T j ( t ) | ) . Figure 2 . A family of policies with bounded regret for the K -armed bandit problem. Theorem . Fix ε ∈ (0 , ∧ ∆] , then Policy 2 associated with the potential ψ ( x ) = x satisfies for all n ≥ , (3.3) R n ≤ X i :∆ i > n ∆ i + 32∆ i log (cid:0) ε (cid:1)o . Furthermore for ε = 0 , let v = IE (cid:16) Y ( ⋆ )1 (cid:17) , then the regret is bounded as (3.4) R n ≤ X i :∆ i > n ∆ i + (1 ∨ v ) 4 log(9 n )∆ i o . The dependency in ε can be reduced by using the potential ψ ( x ) = x log(4 x/ε ) sinceit yields (3.5) R n ≤ X i :∆ i > n ∆ i + 32 log (cid:0) i ε (cid:1) ∆ i (cid:2) (cid:0) ε (cid:1)(cid:3)o . BUBECK, PERCHET AND RIGOLLET If ε is of the order of every ∆ i , then Equation (3.5) upper bounds the regretin P i log log(1 / ∆ i ) / ∆ i ; on the other hand, using the potential ψ ( x ) = x onlyguarantees, under the same assumptions, a bound in P i log(1 / ∆ i ) / ∆ i .The result for ε = 0 implies that when one has no information on the smallestgap, our policy does not obtain bounded regret but it recovers the performancesof UCB, [3]. As we shall see in Section 4 it is in fact impossible to obtain boundedregret scaling in 1 / ∆ if one only knows µ ( ⋆ ) .Theorem 2 is deduced from the following more general regret bound for Policy 2expressed in terms of the properties of the potential ψ . Theorem . Fix ε ∈ [0 , ∆] and let ψ be a differentiable and increasing func-tion ψ : [ ε/ , + ∞ ) → IR + . If ε > , Policy 2 satisfies for all n ≥ , (3.6) R n ≤ X i :∆ i > n ∆ i + 8∆ i + ∆ i ψ (∆ i / h ψ ( ε/ ε + Z + ∞ ε/ ψ ′ ( x ) e x − dx io . Furthermore for ε = 0 it satisfies (3.7) R n ≤ X i :∆ i > (cid:16) ∆ i + 8∆ i + ∆ i ψ (∆ i / n X t =1 IE ψ ( | b µ (1) t | ) (cid:17) . Proof.
Without loss of generality we assume that 1 = ⋆ is the optimal arm.We decompose the event of a wrong selection into three events: { I t = i } ⊂ { t = i } ∪ { b µ ( i ) T i ( t ) > − ∆ i / , t ≥ K + 1 , I t = i }∪ { b µ ( i ) T i ( t ) ≤ − ∆ i / , t ≥ K + 1 , I t = i } . Using (2.2) one can easily prove that the cumulative probability of the first twoevents is smaller than 1 + 8 / ∆ i . For the third event, it is convenient to define therandom variable Z ∈ { , , } that indicates whether the agent plays accordingto (0), (1) or (2) in Policy 2. We write the following, using the definition of thealgorithm and the fact that ψ is non-decreasing,IP { b µ ( i ) T i ( t ) ≤ − ∆ i / , t ≥ K + 1 , I t = i } = IP { b µ ( i ) T i ( t ) ≤ − ∆ i / , I t = i , Z = 2 } = IE p i,t { b µ ( i ) T i ( t ) ≤ − ∆ i / , Z = 2 } = IE p i,t p ,t p ,t { b µ ( i ) T i ( t ) ≤ − ∆ i / , Z = 2 }≤ IE ψ ( | b µ (1) T ( t ) | ) ψ (∆ i / p ,t { b µ ( i ) T i ( t ) ≤ − ∆ i / , Z = 2 } ≤ ψ (∆ i /
2) IE ψ ( | b µ (1) T ( t ) | ) p ,t { Z = 2 }≤ ψ (∆ i /
2) IE ψ ( | b µ (1) T ( t ) | ) { b µ (1) T ( t ) < − ε/ , t ≥ K + 1 } . A simple rewriting of time then concludes the proof for the case of ε = 0. We usethe slight abuse of notation ψ − ( x ) := [ ψ ( x )] , and ψ ( ∞ ) = lim x → + ∞ ψ ( x ). For ε > OUNDED REGRET IN STOCHASTIC BANDITS n X t =1 IE ψ ( | b µ (1) T ( t ) | ) { b µ (1) T ( t ) ≤ − ε/ } ≤ n X t =1 IE ψ ( | b µ (1) t | ) { b µ (1) t ≤ − ε/ } = n X t =1 Z + ∞ IP (cid:16) ψ ( | b µ (1) t | ) { b µ (1) t ≤ − ε/ } ≥ x (cid:17) dx = n X t =1 n ψ (cid:16) ε (cid:17) IP (cid:0) | b µ (1) t | > ε (cid:1) + Z ψ ( ∞ ) ψ ( ε/ IP( ψ ( | b µ (1) t | ) ≥ x ) dx o ≤ n X t =1 n ψ (cid:16) ε (cid:17) e − tε + Z ψ ( ∞ ) ψ ( ε/ e − tψ − x )2 dx o ≤ ε ψ (cid:16) ε (cid:17) + Z ψ ( ∞ ) ψ ( ε/ e ψ − x )2 − dx. Making the change of variable x = ψ ( u ) concludes the proof of Theorem 3.Theorem 2 follows from Theorem 3 with specific choices for ψ . First, take ψ ( x ) = x , ε ∈ (0 ,
1] and observe that the integral in (3.6) can be computed as Z + ∞ ε/ xe x − dx = − (cid:0) − e − ε (cid:1) ≤ (cid:16) ε (cid:17) , which gives (3.3). When ε = 0, since IE ψ ( | b µ (1) t | ) = v/t , Equation (3.7) directlygives (3.4).Next, we turn to the the slightly more sophisticated potential function ψ ( x ) = x log(4 x/ε ) . Observe that for any x ≥ ψ ′ ( x ) = 2 x log(4 x/ε ) − x log (4 x/ε ) ≤ x log(4 x/ε ) . Therefore, for ε ∈ (0 , Z + ∞ ε/ x log(4 x/ε )[ e x − dx ≤ Z ε/ x log(4 x/ε ) dx + Z ∞ e − x dx ≤ /ε ) − ≤ /ε ) + 7 . It concludes the proof of (3.5).
4. LOWER BOUNDS
We conclude our study of bounded regret in stochastic multi-armed banditswith three different lower bounds. For simplicity, we phrase these results for thesimple two-armed case. First we show with Theorem 5 that if one knows both µ ( ⋆ ) and ∆, then the best attainable regret is of order 1 / ∆, which matches (upto a numerical constant) the result of Theorem 1. Next we show in Theorem 6that the sole knowledge of ∆ leads to a lower bound of order log( n ∆ ) / ∆. Thistheorem implies that the bounds of [2], [4] and [10] exhibit a tight dependence in BUBECK, PERCHET AND RIGOLLET ∆ (for the two-armed case), unlike the famous result of [9]. Moreover, comparedto the proof of [9], our approach is (i) much simpler, (ii) non-asymptotic and (iii)it is not limited to a certain class of policies. Finally we show in Theorem 8 thatif one only knows µ ( ⋆ ) then a regret of order log( n )∆ is unavoidable (for some valueof ∆).Our proof strategy consists in rephrasing arm selection as a hypothesis testingproblem, and then use well-known lower bounding techniques for the minimaxrisk of hypothesis testing. For instance, the proof of Theorem 5 and Theorem6 builds upon the following result; see [14, Chaper 2] for a proof, or Lemma 7below with λ chosen to be a Dirac mass at 1. Recall that the Kullback-Leiblerdivergence between two positive measures ρ, ρ ′ with ρ ′ absolutely continuous withrespect to ρ , is defined asKL( ρ, ρ ′ ) = Z log (cid:18) dρdρ ′ (cid:19) dρ = IE X ∼ ρ log (cid:18) dρdρ ′ ( X ) (cid:19) . Lemma . Let ρ , ρ be two probability distributions supported on some set X , with ρ absolutely continuous with respect to ρ . Then for any measurablefunction ψ : X → { , } , one has IP X ∼ ρ ( ψ ( X ) = 1) + IP X ∼ ρ ( ψ ( X ) = 0) ≥
12 exp ( − KL( ρ , ρ )) . In this section we denote by ν = ν ⊗ ν the product distribution that generatesthe rewards from ν j when pulling arm j ∈ { , } . The regret of a policy thatobserves such rewards is denoted by R n ( ν ). Finally let IP ν denote the probabilityassociated to ν and by IE ν the corresponding expectation.Hereafter, we favor rewards that are normally distributed because they lead tosimpler calculations of the KL-divergence. However, our lower bounds remain ofthe same order for all families of distributions { ρ µ } µ with expected value µ andsuch that KL( ρ µ − ρ µ ′ ) ≥ C ( µ − µ ′ ) for some absolute constant C >
0. This isthe case, for example, of the Bernoulli distribution with parameter µ as long as µ remains bounded away from 0 and 1; see, e.g., [11, Lemma 4.1].The first lower bound illustrates that when one knows the distributions up toa permutation, the best one can hope for is a bounded regret of order 1 / ∆. Theorem . Let ν = N (0 , ⊗ N ( − ∆ , and ν ′ = N ( − ∆ , ⊗ N (0 , .Then for any policy, and for every n ≥ , max (cid:0) R n ( ν ) , R n ( ν ′ ) (cid:1) ≥ . Proof.
In this proof we assume that the policy has access to t rewards fromeach arm at time step t . Clearly this full information setting is simpler than thebandit setting, and thus a lower bound for the former implies one for the latter.Using Lemma 4 as well as straightforward computations one obtainsmax (cid:0) R n ( ν ) , R n ( ν ′ ) (cid:1) ≥ (cid:0) R n ( ν ) + R n ( ν ′ ) (cid:1) = ∆2 n X t =1 (IP ν ( I t = 2) + IP ν ′ ( I t = 1)) ≥ ∆4 n X t =1 exp( − KL( ν ⊗ t , ν ′⊗ t )) = ∆4 n X t =1 exp( − t ∆ ) ≥ . OUNDED REGRET IN STOCHASTIC BANDITS The above theorem ensures that the regret bound of Theorem 1 has the correctdependence in ∆. This is quite surprising as the original bound of [9] indicatesthat without the knowledge of µ ( ⋆ ) and ∆, one can incur a regret that diverges toinfinity at a logarithmic rate. The next result shows that this logarithmic regretalready appears when one does not know the value of µ ( ⋆ ) . Thus the knowledgeof ∆ without the knowledge of µ ( ⋆ ) is not sufficient to obtain a bounded regret.Moreover, the following lower bound matches the upper bounds (for the two-armed case) of [2], [4] and [10], thus proving their optimality. Theorem . Let ν = δ ⊗ N ( − ∆ , and ν ′ = δ ⊗ N (∆ , . Then for anypolicy, and any n ≥ , max (cid:0) R n ( ν ) , R n ( ν ′ ) (cid:1) ≥ log( n ∆ / . Proof.
First note thatmax (cid:0) R n ( ν ) , R n ( ν ′ ) (cid:1) ≥ R n ( ν ) ≥ ∆IE ν T ( n ) . Furthermore, denoting by ν t (respectively ν ′ t ) the law of the observed rewards upto time t under ν (respectively under ν ′ ), and following the same computationsthan in the previous proof, one also obtainsmax (cid:0) R n ( ν ) , R n ( ν ′ ) (cid:1) ≥ ∆4 n X t =1 exp( − KL( ν t , ν ′ t )) . Since under ν , arm 1 is uninformative, it follows from basic calculation thatKL( ν t , ν ′ t ) = 2∆ IE ν T ( t ) . The above three displays yieldmax (cid:0) R n ( ν ) , R n ( ν ′ ) (cid:1) ≥ ∆2 (cid:16) IE ν T ( n ) + n − IE ν T ( n )) (cid:17) ≥ min x ∈ [0 ,n ] ∆2 (cid:16) x + n − x ) (cid:17) ≥ log( n ∆ / . Finally we prove that the knowledge of µ ( ⋆ ) without the knowledge of ∆ isnot sufficient either to obtain a bounded rescaled regret ∆ R n . This result is moredifficult, and falls within the more general topic of lower bounds for adaptive rates.First we need to generalize Lemma 4 to deal with both a composite alternative,and a rescaled risk. The proof of this result is standard and postponed to theappendix. BUBECK, PERCHET AND RIGOLLET
Lemma . Let ρ and ρ ∆ , ∆ ∈ IR be probability distributions supported onsome set X , with ρ ∆ absolutely continuous with respect to ρ . Let λ be a finitepositive measure on IR . Then for any measurable function ψ : X → { , } , onehas IP X ∼ ρ ( ψ ( X ) = 1) + Z ∆IP X ∼ ρ ∆ ( ψ ( X ) = 0) dλ (∆) ≥ C λ exp ( − KL ( ρ , ¯ ρ )) , where ¯ ρ is the positive measure on X defined by ¯ ρ = R ∆ ρ ∆ dλ (∆) and C λ =1 + R ∆ dλ (∆) . Note that R ∆ ρ ∆ λ (∆) is not a probability distribution, however it is a positivemeasure thus the Kullback-Leibler divergence in the above lemma is well-defined. Theorem . Let ν = N (0 , ⊗ N ( − , , and ν ∆ = N ( − ∆ , ⊗ N (0 , , ∆ ∈ (0 , . Then for any policy, and any n ≥ , max R n ( ν ) , sup ∆ ∈ (0 , ∆ R n ( ν ∆ ) ! ≥
12 log( n/ . Theorem 8 can be read as follows: for any policy, and any n ≥
1, there exists∆ ∈ (0 ,
1] and a problem instance with gap ∆ and optimal value µ ( ⋆ ) = 0 suchthat on this problem one has R n ≥ log( n/ . Proof.
Similarly to the previous proof we define ν ,t and ν ∆ ,t as the law ofthe observed rewards up to time t . Lemma 7 yields(4.8)max R n ( ν ) , sup ∆ ∈ (0 , ∆ R n ( ν ∆ ) ! ≥ C λ n X t =1 exp (cid:18) − KL (cid:18) ν ,t , Z ∆ ν ∆ ,t dλ (∆) (cid:19)(cid:19) . For ν ∈ { ν , ν ∆ } , define the average rewards for arm i ∈ { , } by µ ( i ) ν . Therefore, µ (1) ν = µ (2) ν ∆ = 0, µ (2) ν = − µ (1) ν ∆ = − ∆. Recall that a policy { I t } t ≥ takingvalues in { , } generates a sequence of rewards Y ( I t ) t , t ≥ ν ∈ { ν , ν ∆ } . The joint density (with respect to the Lebesgue measure) dν t of ( Y ( I t )1 , . . . , Y ( I t ) t ) ∈ IR t , where ν ∈ { ν ∆ , ν } can be computed easily using thechain rule for conditional densities. It is given by dν t = 1(2 π ) t/ exp (cid:16) − t X ℓ =1 ( Y ( I ℓ ) ℓ − µ ( I ℓ ) ν ) (cid:17) . Choosing ν = ν ∆ and ν = ν respectively, it yields dν ∆ ,t dν ,t ( Y ( I )1 , . . . , Y ( I t ) t ) = exp (cid:16) − t X ℓ =1 (cid:2) ( Y ( I ℓ ) ℓ − µ ( I ℓ ) ν ∆ ) − ( Y ( I ℓ ) ℓ − µ ( I ℓ ) ν ) (cid:3)(cid:17) = exp (cid:16) − t X ℓ =1 I ℓ =1 (cid:2) ( Y (1) ℓ + ∆) − ( Y (1) ℓ ) (cid:3) − t X ℓ =1 I ℓ =2 (cid:2) ( Y (2) ℓ ) − ( Y (2) ℓ + 1) (cid:3)(cid:17) = exp − T (1) µ (1) + ∆ ) + T (2) µ (2) + 1) ! , OUNDED REGRET IN STOCHASTIC BANDITS where we denote for simplicity T ( i ) = T i ( t + 1) = t X ℓ =1 { I ℓ = i } and ˆ µ ( i ) = ˆ µ ( i ) T i ( t ) = 1 T ( i ) t X ℓ =1 I ℓ = i Y ( i ) ℓ , i ∈ { , } . Dropping the dependency in ( Y ( I )1 , . . . , Y ( I t ) t ) from the notation, it yields Z ∆ dν ∆ ,t dν ,t dλ (∆) = exp T (2) µ (2) + 1) ! Z ∆ exp − T (1) µ (1) + ∆ ) ! dλ (∆) , and thusKL (cid:18) ν ,t , Z ∆ ν ∆ ,t dλ (∆) (cid:19) = − IE ν T (2) µ (2) + 1) + log Z ∆ exp − T (1) µ (1) + ∆ ) ! dλ (∆) !! = 12 IE ν T (2) − IE ν log Z ∆ exp − T (1) µ (1) + ∆ ) ! dλ (∆) ! where the last line follows standard computations. Next, it follows from theCauchy-Schwarz inequality that the function x log (cid:18)Z ∆ ∆ exp( ϕ (∆) x ) dλ (∆) (cid:19) is convex for any function ϕ . Together with the Jensen inequality, it yieldsIE ν log Z ∆ exp − T (1) µ (1) + ∆ ) ! dλ (∆) ! ≥ log Z ∆ exp − IE ν T (1) µ (1) + ∆ ) ! dλ (∆) ! = log Z ∆ exp − IE ν T (1) ! dλ (∆) ! Define τ = IE ν T (1) and let λ be the uniform distribution on [0 , / √ τ ]. Since ue − u / ≥ u/ ≤ u ≤
1, it yields Z ∆ exp − IE ν T (1) ! dλ (∆) = 1 √ τ Z u exp( − u / du ≥ √ τ , Thus we have proved thatKL (cid:18) ν ,t , Z ∆ ν ∆ ,t d ∆ (cid:19) ≤
12 IE ν T (2) + log(4 q IE ν T (1) ) ≤
12 IE ν T ( n ) + 12 log(16 n ) . BUBECK, PERCHET AND RIGOLLET
Plugging this into (4.8) one obtainsmax R n ( ν ) , sup ∆ ∈ (0 , ∆ R n ( ν ∆ ) ! ≥ √ n C λ exp (cid:18) −
12 IE ν T ( n ) (cid:19) ≥ √ n
16 exp (cid:18) −
12 IE ν T ( n ) (cid:19) , where we use the fact that τ ≥
1, which implies C λ ≤ / ≤
2. On the otherhand one also has R n ( ν ) ≥ IE ν T ( n )Thereforemax R n ( ν ) , sup ∆ ∈ (0 , ∆ R n ( ν ∆ ) ! ≥ min x ∈ [0 ,n ] (cid:16) x + √ n
16 exp( − x/ (cid:17) = 12 log( n/ . Theorem 6 and 8 have important consequences on the exploration-exploitationtradeoff mentioned in the introduction. Indeed, consider the full information casewhere at each round, the agent observes the reward of both arms. In this case,it is not hard to see that the policy that indicates to pull the arm with the bestaverage reward has bounded regret of order 1 / ∆. Therefore, the knowledge of ∆or µ ( ⋆ ) alone does not alleviate the price for exploration. However, when both areknown, it vanishes (see Theorem 1). Acknowledgments.
We are indebted to Alexander Goldenshluger for bring-ing the reference [7] to our attention.
APPENDIX A: PROOF OF LEMMA 7
Throughout the proof, Radon-Nikodym derivatives over X are taken with re-spect to a common but unspecified reference measure. It does not enter our finalresult. It follows from Fubini’s Theorem thatIP X ∼ ρ ( ψ ( X ) = 1) + Z ∆IP X ∼ ρ ∆ ( ψ ( X ) = 0) dλ (∆)= Z ψ =1 dρ + Z (cid:18)Z ψ =0 ∆ dρ ∆ (cid:19) dλ (∆)= Z ψ =1 dρ + Z ψ =1 d ¯ ρ = Z ψ =0 dρ + Z ψ =1 d ¯ ρdρ dρ OUNDED REGRET IN STOCHASTIC BANDITS Furthermore the last expression is clearly minimized for ψ ( x ) = n d ¯ ρdρ ( x ) > o .It yields Z ψ =1 dρ + Z ψ =0 d ¯ ρdρ dρ ≥ Z d ¯ ρdρ > dρ + Z d ¯ ρdρ ≤ d ¯ ρdρ dρ ( x )= Z d ¯ ρdρ > dρ + Z d ¯ ρdρ ≤ d ¯ ρ = Z min ( dρ , d ¯ ρ ) . Note that the latter quantity is often referred to as
Hellinger affinity and doesnot depend on the reference measure on X ; see, e.g., [14], Chapter 2. Now usingthe Cauchy-Schwarz inequality and the fact that Z min ( dρ , d ¯ ρ ) + Z max ( dρ , d ¯ ρ ) = C λ , we get (cid:18)Z p d ¯ ρdρ (cid:19) = (cid:18)Z p min( d ¯ ρ, dρ ) max( d ¯ ρ, dρ ) (cid:19) ≤ (cid:18)Z x min( d ¯ ρ, dρ ) (cid:19) (cid:18)Z x max( d ¯ ρ, dρ ) (cid:19) ≤ C λ Z x min( d ¯ ρ, dρ ) . The above three displays together yieldIP X ∼ ρ ( ψ ( X ) = 1) + Z ∆ ∆IP X ∼ ρ ∆ ( ψ ( X ) = 0) dλ (∆) ≥ C λ (cid:18)Z p d ¯ ρdρ (cid:19) . To complete the proof, observe that the Jensen inequality yields (cid:16) Z p d ¯ ρdρ (cid:17) = (cid:16) Z s d ¯ ρdρ dρ (cid:17) = exp h (cid:16) Z s d ¯ ρdρ dρ (cid:17)i ≥ exp h Z log (cid:16)s d ¯ ρdρ (cid:17) dρ i = exp[ − KL( ρ , ¯ ρ )] . BUBECK, PERCHET AND RIGOLLET
REFERENCES [1]
Agrawal, R., Teneketzis, D., and Anantharam, V.
Asymptoticallyefficient adaptive allocation schemes for controlled i.i.d. processes: finite pa-rameter space.
IEEE Trans. Automat. Control 34 , 3 (1989), 258–267.[2]
Audibert, J.-Y., and Bubeck, S.
Minimax policies for adversarial andstochastic bandits. In
Proceedings of the 22nd Annual Conference on Learn-ing Theory (COLT) (2009).[3]
Auer, P., Cesa-Bianchi, N., and Fischer, P.
Finite-time analysis ofthe multiarmed bandit problem.
Machine Learning Journal 47 , 2-3 (2002),235–256.[4]
Auer, P., and Ortner, R.
UCB revisited: Improved regret bounds for thestochastic multi-armed bandit problem.
Periodica Mathematica Hungarica61 , 1 (2010), 55–65.[5]
Bubeck, S., and Cesa-Bianchi, N.
Regret analysis of stochastic andnonstochastic multi-armed bandit problems.
Foundations and Trends inMachine Learning 5 , 1 (2012), 1–122.[6]
Kulkarni, S. R., and Lugosi, G.
Finite-time lower bounds for the two-armed bandit problem.
IEEE Transactions on Automatic Control 45 , 4(2000), 711–714.[7]
Lai, T. L., and Robbins, H.
Asymptotically optimal allocation of treat-ments in sequential experiments. In
Design of Experiments: Ranking andSelection , T. J. Santner and A. C. Tamhane, Eds. 1984, pp. 127–142.[8]
Lai, T. L., and Robbins, H.
Optimal sequential sampling from two pop-ulations.
Proc. Natl. Acad. Sci. USA 81 (1984), 1284–1286.[9]
Lai, T. L., and Robbins, H.
Asymptotically efficient adaptive allocationrules.
Advances in Applied Mathematics 6 (1985), 4–22.[10]
Perchet, V., and Rigollet, P.
The multi-armed bandit problem withcovariates, October 2011. arXiv:1110.6084.[11]
Rigollet, P., and Zeevi, A.
Nonparametric bandits with covariates.In
Proceedings of the 23rd Annual Conference on Learning Theory (COLT) (2010), A. T. Kalai and M. Mohri, Eds., pp. 54–66.[12]
Robbins, H.
Some aspects of the sequential design of experiments.
Bulletinof the American Mathematics Society 58 (1952), 527–535.[13]
Salomon, A., and Audibert, J.-Y.
Deviations of stochastic bandit regret.In
Proceedings of the 22nd International Conference on Algorithmic LearningTheory (ALT) (2011).[14]
Tsyabkov, A. B.
Introduction to Nonparametric Estimation . Springer,2009.
S´ebastien BubeckDepartment of Operations Researchand Financial EngineeringPrinceton UniversityPrinceton, NJ 08544, USA( [email protected] ) Vianney PerchetLPMA, UMR 7599Universit´e Paris Diderot175, rue du Chevaleret75013 Paris, France( [email protected] )Philippe RigolletDepartment of Operations Researchand Financial EngineeringPrinceton UniversityPrinceton, NJ 08544, USA( [email protected]@princeton.edu