aa r X i v : . [ c s . L G ] O c t The Pareto Regret Frontier for Bandits
Tor Lattimore
Department of Computing ScienceUniversity of Alberta, Canada [email protected]
Abstract
Given a multi-armed bandit problem it may be desirable to achieve a smaller-than-usualworst-case regret for some special actions. I show that the price for such unbalanced worst-caseregret guarantees is rather high. Specifically, if an algorithm enjoys a worst-case regret of B withrespect to some action, then there must exist another action for which the worst-case regret is atleast Ω( nK/B ) , where n is the horizon and K the number of actions. I also give upper boundsin both the stochastic and adversarial settings showing that this result cannot be improved. Forthe stochastic case the pareto regret frontier is characterised exactly up to constant factors. The multi-armed bandit is the simplest class of problems that exhibit the exploration/exploitationdilemma. In each time step the learner chooses one of K actions and receives a noisy reward signalfor the chosen action. A learner’s performance is measured in terms of the regret, which is the(expected) difference between the rewards it actually received and those it would have received (inexpectation) by choosing the optimal action.Prior work on the regret criterion for finite-armed bandits has treated all actions uniformly andhas aimed for bounds on the regret that do not depend on which action turned out to be optimal. Itake a different approach and ask what can be achieved if some actions are given special treatment.Focussing on worst-case bounds, I ask whether or not it is possible to achieve improved worst-caseregret for some actions, and what is the cost in terms of the regret for the remaining actions. Suchresults may be useful in a variety of cases. For example, a company that is exploring some newstrategies might expect an especially small regret if its existing strategy turns out to be (nearly)optimal.This problem has previously been considered in the experts setting where the learner is allowedto observe the reward for all actions in every round, not only for the action actually chosen. Theearliest work seems to be by Hutter and Poland [2005] where it is shown that the learner can assigna prior weight to each action and pays a worst-case regret of O ( √− n log ρ i ) for expert i where ρ i is the prior belief in expert i and n is the horizon. The uniform regret is obtained by choosing ρ i =1 /K , which leads to the well-known O ( √ n log K ) bound achieved by the exponential weightingalgorithm [Cesa-Bianchi, 2006]. The consequence of this is that an algorithm can enjoy a constantregret with respect to a single action while suffering minimally on the remainder. The problem wasstudied in more detail by Koolen [2013] where (remarkably) the author was able to exactly describethe pareto regret frontier when K = 2 .Other related work (also in the experts setting) is where the objective is to obtain an improvedregret against a mixture of available experts/actions [Even-Dar et al., 2008, Kapralov and Panigrahy,2011]. In a similar vain, Sani et al. [2014] showed that algorithms for prediction with expert advice1an be combined with minimal cost to obtain the best of both worlds. In the bandit setting I am onlyaware of the work by Liu and Li [2015] who study the effect of the prior on the regret of Thompsonsampling in a special case. In contrast the lower bound given here applies to all algorithms in arelatively standard setting.The main contribution of this work is a characterisation of the pareto regret frontier (the set ofachievable worst-case regret bounds) for stochastic bandits.Let µ i ∈ R be the unknown mean of the i th arm and assume that sup i,j µ i − µ j ≤ . In eachtime step the learner chooses an action I t ∈ { , . . . , K } and receives reward g I t ,t = µ i + η t where η t is the noise term that I assume to be sampled independently from a -subgaussian distributionthat may depend on I t . This model subsumes both Gaussian and Bernoulli (or bounded) rewards.Let π be a bandit strategy, which is a function from histories of observations to an action I t . Thenthe n -step expected pseudo regret with respect to the i th arm is R πµ,i = nµ i − E n X t =1 µ I t , where the expectation is taken with respect to the randomness in the noise and the actions of thepolicy. Throughout this work n will be fixed, so is omitted from the notation. The worst-caseexpected pseudo-regret with respect to arm i is R πi = sup µ R πµ,i . (1)This means that R π ∈ R K is a vector of worst-case pseudo regrets with respect to each of the arms.Let B ⊂ R K be a set defined by B = B ∈ [0 , n ] K : B i ≥ min n, X j = i nB j for all i . (2)The boundary of B is denoted by δ B . The following theorem shows that δ B describes the paretoregret frontier up to constant factors. Theorem
There exist universal constants c = 8 and c = 252 such that: Lower bound: for η t ∼ N (0 , and all strategies π we have c ( R π + K ) ∈ B Upper bound: for all B ∈ B there exists a strategy π such that R πi ≤ c B i for all i Observe that the lower bound relies on the assumption that the noise term be Gaussian while theupper bound holds for subgaussian noise. The lower bound may be generalised to other noise modelssuch as Bernoulli, but does not hold for all subgaussian noise models. For example, it does not holdif there is no noise ( η t = 0 almost surely).The lower bound also applies to the adversarial framework where the rewards may be chosenarbitrarily. Although I was not able to derive a matching upper bound in this case, a simple modifi-cation of the Exp- γ algorithm [Bubeck and Cesa-Bianchi, 2012] leads to an algorithm with R π ≤ B and R πk . nKB log (cid:18) nKB (cid:19) for all k ≥ , where the regret is the adversarial version of the expected regret. The details may be found in theAppendix.The new results seem elegant, but disappointing. In the experts setting we have seen that thelearner can distribute a prior amongst the actions and obtain a bound on the regret depending in anatural way on the prior weight of the optimal action. In contrast, in the bandit setting the learnerpays an enormously higher price to obtain a small regret with respect to even a single arm. In fact,the learner must essentially choose a single arm to favour, after which the regret for the remainingarms has very limited flexibility. Unlike in the experts setting, if even a single arm enjoys constantworst-case regret, then the worst-case regret with respect to all other arms is necessarily linear.2 Preliminaries
I use the same notation as Bubeck and Cesa-Bianchi [2012]. Define T i ( t ) to be the number of timesaction i has been chosen after time step t and ˆ µ i,s to be the empirical estimate of µ i from the first s times action i was sampled. This means that ˆ µ i,T i ( t − is the empirical estimate of µ i at the start ofthe t th round. I use the convention that ˆ µ i, = 0 . Since the noise model is -subgaussian we have ∀ ε > P {∃ s ≤ t : ˆ µ i,s − µ i ≥ ε/s } ≤ exp (cid:18) − ε t (cid:19) . (3)This result is presumably well known, but a proof is included in Appendix E for convenience. Theoptimal arm is i ∗ = arg max i µ i with ties broken in some arbitrary way. The optimal reward is µ ∗ = max i µ i . The gap between the mean rewards of the j th arm and the optimal arm is ∆ j = µ ∗ − µ j and ∆ ji = µ i − µ j . The vector of worst-case regrets is R π ∈ R K and has been definedalready in Eq. (1). I write R π ≤ B ∈ R K if R πi ≤ B i for all i ∈ { , . . . , K } . For vector R π and x ∈ R we have ( R π + x ) i = R πi + x . Before proving the main theorem I briefly describe the features of the regret frontier. First noticethat if B i = p n ( K − for all i , then B i = p n ( K −
1) = X j = i p n/ ( K −
1) = X j = i nB j . Thus B ∈ B as expected. This particular B is witnessed up to constant factors by MOSS[Audibert and Bubeck, 2009] and OC-UCB [Lattimore, 2015], but not UCB [Auer et al., 2002],which suffers R ucb i ∈ Ω( √ nK log n ) .Of course the uniform choice of B is not the only option. Suppose the first arm is special, so B should be chosen especially small. Assume without loss of generality that B ≤ B ≤ . . . ≤ B K ≤ n . Then by the main theorem we have B ≥ K X i =2 nB i ≥ k X i =2 nB i ≥ ( k − nB k . Therefore B k ≥ ( k − nB . (4)This also proves the claim in the abstract, since it implies that B K ≥ ( K − n/B . If B is fixed,then choosing B k = ( k − n/B does not lie on the frontier because K X k =2 nB k = K X k =2 B k − ∈ Ω( B log K ) However, if H = P Kk =2 / ( k − ∈ Θ(log K ) , then choosing B k = ( k − nH/B does lie onthe frontier and is a factor of log K away from the lower bound given in Eq. (4). Therefore up thea log K factor, points on the regret frontier are characterised entirely by a permutation determiningthe order of worst-case regrets and the smallest worst-case regret.Perhaps the most natural choice of B (assuming again that B ≤ . . . ≤ B K ) is B = n p and B k = ( k − n − p H for k > . For p = 1 / this leads to a bound that is at most √ K log K worse than that obtained by MOSS andOC-UCB while being a factor of √ K better for a select few.3 ssumptions The assumption that ∆ i ∈ [0 , is used to avoid annoying boundary problems caused by the fact thattime is discrete. This means that if ∆ i is extremely large, then even a single sample from this arm cancause a big regret bound. This assumption is already quite common, for example a worst-case regretof Ω( √ Kn ) clearly does not hold if the gaps are permitted to be unbounded. Unfortunately there isno perfect resolution to this annoyance. Most elegant would be to allow time to be continuous withactions taken up to stopping times. Otherwise you have to deal with the discretisation/boundaryproblem with special cases, or make assumptions as I have done here. Theorem 1.
Assume η t ∼ N (0 , is sampled from a standard Gaussian. Let π be an arbitrarystrategy, then R π + K ) ∈ B .Proof. Assume without loss of generality that R π = min i R πi (if this is not the case, then simplyre-order the actions). If R π > n/ , then the result is trivial. From now on assume R π ≤ n/ . Let c = 4 and define ε k = min (cid:26) , cR πk n (cid:27) ≤ . Define K vectors µ , . . . , µ K ∈ R K by ( µ k ) j = 12 + if j = 1 ε k if j = k = 1 − ε j otherwise . Therefore the optimal action for the bandit with means µ k is k . Let A = { k : R πk ≤ n/ } and A ′ = { k : k / ∈ A } and assume k ∈ A . Then R πk ( a ) ≥ R πµ k ,k ( b ) ≥ ε k E πµ k X j = k T j ( n ) ( c ) = ε k (cid:0) n − E πµ k T k ( n ) (cid:1) ( d ) = cR πk ( n − E πµ k T k ( n )) n , where (a) follows since R πk is the worst-case regret with respect to arm k , (b) since the gap betweenthe means of the k th arm and any other arm is at least ε k (Note that this is also true for k = 1 since ε = min k ε k . (c) follows from the fact that P i T i ( n ) = n and (d) from the definition of ε k .Therefore n (cid:18) − c (cid:19) ≤ E πµ k T k ( n ) . (5)Therefore for k = 1 with k ∈ A we have n (cid:18) − c (cid:19) ≤ E πµ k T k ( n ) ( a ) ≤ E πµ T k ( n ) + nε k q E πµ T k ( n ) ( b ) ≤ n − E πµ T ( n ) + nε k q E πµ T k ( n ) ( c ) ≤ nc + nε k q E πµ T k ( n ) , where (a) follows from standard entropy inequalities and a similar argument as used by Auer et al.[1995] (details given in Appendix C), (b) since k = 1 and E πµ T ( n ) + E πµ T k ( n ) ≤ n , and (c) byEq. (5). Therefore E πµ T k ( n ) ≥ − c ε k , R π ≥ R πµ , = K X k =2 ε k E πµ T k ( n ) ≥ X k ∈ A −{ } − c ε k = 18 X k ∈ A −{ } nR πk . Therefore for all i ∈ A we have R πi ≥ X k ∈ A −{ } nR πk · R πi R π ≥ X k ∈ A −{ i } nR πk . Therefore R πi + 8 K ≥ X k = i nR πk + 8 K − X k ∈ A ′ −{ i } nR πk ≥ X k = i nR πk , which implies that R π + K ) ∈ B as required. I now show that the lower bound derived in the previous section is tight up to constant factors. Thealgorithm is a generalisation MOSS [Audibert and Bubeck, 2009] with two modifications. First, thewidth of the confidence bounds are biased in a non-uniform way, and second, the upper confidencebounds are shifted. The new algorithm is functionally identical to MOSS in the special case that B i is uniform. Define log + ( x ) = max { , log( x ) } . Input: n and B , . . . , B K n i = n /B i for all i for t ∈ , . . . , n do I t = arg max i ˆ µ i,T i ( t − + s T i ( t −
1) log + (cid:18) n i T i ( t − (cid:19) − r n i end for Algorithm 1: Unbalanced MOSS
Theorem 2.
Let B ∈ B , then the strategy π given in Algorithm 1 satisfies R π ≤ B . Corollary 3.
For all µ the following hold:1. R πµ,i ∗ ≤ B i ∗ .2. R πµ,i ∗ ≤ min i ( n ∆ i + 252 B i ) The second part of the corollary is useful when B i ∗ is large, but there exists an arm for which n ∆ i and B i are both small. The proof of Theorem 2 requires a few lemmas. The first is a some-what standard concentration inequality that follows from a combination of the peeling argument andDoob’s maximal inequality. Lemma 4.
Let Z i = max ≤ s ≤ n µ i − ˆ µ i,s − r s log + (cid:16) n i s (cid:17) . Then P { Z i ≥ ∆ } ≤ n i ∆ for all ∆ > . roof. Using the peeling device. P { Z i ≥ ∆ } ( a ) = P ( ∃ s ≤ n : µ i − ˆ µ i,s ≥ ∆ + r s log + (cid:16) n i s (cid:17)) ( b ) ≤ ∞ X k =0 P (cid:26) ∃ s < k +1 : s ( µ i − ˆ µ i,s ) ≥ k ∆ + r k +2 log + (cid:16) n i k +1 (cid:17)(cid:27) ( c ) ≤ ∞ X k =0 exp (cid:0) − k − ∆ (cid:1) min (cid:26) , k +1 n i (cid:27) ( d ) ≤ (cid:18) (cid:19) · n i ∆ ≤ n i ∆ , where (a) is just the definition of Z i , (b) follows from the union bound and re-arranging the equationinside the probability, (c) follows from Eq. (3) and the definition of log + and (d) is obtained byupper bounding the sum with an integral.In the analysis of traditional bandit algorithms the gap ∆ ji measures how quickly the algorithmcan detect the difference between arms i and j . By design, however, Algorithm 1 is negativelybiasing its estimate of the empirical mean of arm i by p /n i . This has the effect of shifting thegaps, which I denote by ¯∆ ji and define to be ¯∆ ji = ∆ ji + p /n j − p /n i = µ i − µ j + p /n j − p /n i . Lemma 5.
Define stopping time τ ji by τ ji = min ( s : ˆ µ j,s + r s log + (cid:16) n j s (cid:17) ≤ µ j + ¯∆ ji / ) . If Z i < ¯∆ ji / , then T j ( n ) ≤ τ ji .Proof. Let t be the first time step such that T j ( t −
1) = τ ji . Then ˆ µ j,T j ( t − + s T j ( t −
1) log + (cid:18) n j T j ( t − (cid:19) − p /n j ≤ µ j + ¯∆ ji / − p /n j = µ j + ¯∆ ji − ¯∆ ji / − p /n j = µ i − p /n i − ¯∆ ji / < ˆ µ i,T i ( t − + s T i ( t −
1) log + (cid:18) n i T i ( t − (cid:19) − p /n i , which implies that arm j will not be chosen at time step t and so also not for any subsequent timesteps by the same argument and induction. Therefore T j ( n ) ≤ τ ji . Lemma 6. If ¯∆ ji > , then E τ ji ≤ ji + 64¯∆ ji ProductLog n j ¯∆ ji ! .Proof. Let s be defined by s = & ji ProductLog n j ¯∆ ji !' = ⇒ s s log + (cid:18) n j s (cid:19) ≤ ¯∆ ji . E τ ji = n X s =1 P { τ ji ≥ s } ≤ n − X s =1 P ( ˆ µ i,s − µ i,s ≥ ¯∆ ji − r s log + (cid:16) n j s (cid:17)) ≤ s + n − X s = s +1 P (cid:26) ˆ µ i,s − µ i,s ≥ ¯∆ ji (cid:27) ≤ s + ∞ X s = s +1 exp − s ¯∆ ji ! ≤ s + 32¯∆ ji ≤ ji + 64¯∆ ji ProductLog n j ¯∆ ji ! , where the last inequality follows since ¯∆ ji ≤ . Proof of Theorem 2.
Let ∆ = 2 / √ n i and A = { j : ∆ ji > ∆ } . Then for j ∈ A we have ∆ ji ≤ ji and ¯∆ ji ≥ p /n i + √ /n j . Letting ∆ ′ = p /n i we have R πµ,i = E K X j =1 ∆ ji T j ( n ) ≤ n ∆ + E X j ∈ A ∆ ji T j ( n ) ( a ) ≤ B i + E X j ∈ A ∆ ji τ ji + n max j ∈ A (cid:8) ∆ ji : Z i ≥ ¯∆ ji / (cid:9) ( b ) ≤ B i + X j ∈ A ji + 128¯∆ ji ProductLog n j ¯∆ ji !! + 4 n E [ Z i { Z i ≥ ∆ ′ } ] ( c ) ≤ B i + X j ∈ A √ n j + 4 n E [ Z i { Z i ≥ ∆ ′ } ] , where (a) follows by using Lemma 5 to bound T j ( n ) ≤ τ ji when Z i < ¯∆ ji . On the other hand,the total number of pulls for arms j for which Z i ≥ ¯∆ ji / is at most n . (b) follows by bounding τ ji in expectation using Lemma 6. (c) follows from basic calculus and because for j ∈ A we have ¯∆ ji ≥ p /n i . All that remains is to bound the expectation. n E [ Z i { Z i ≥ ∆ ′ } ] ≤ n ∆ ′ P { Z i ≥ ∆ ′ } + 4 n Z ∞ ∆ ′ P { Z i ≥ z } dz ≤ n ∆ ′ n i = 160 n √ n i = 160 B i , where I have used Lemma 4 and simple identities. Putting it together we obtain R πµ,i ≤ B i + X j ∈ A √ n j + 160 B ≤ B i , where I applied the assumption B ∈ B and so P j =1 √ n j = P j =1 n/B j ≤ B i .The above proof may be simplified in the special case that B is uniform where we recoverthe minimax regret of MOSS, but with perhaps a simpler proof than was given originally byAudibert and Bubeck [2009]. On Logarithmic Regret
In a recent technical report I demonstrated empirically that MOSS suffers sub-optimal problem-dependent regret in terms of the minimum gap [Lattimore, 2015]. Specifically, it can happen that R moss µ,i ∗ ∈ Ω (cid:18) K ∆ min log n (cid:19) , (6)7here ∆ min = min i :∆ i > ∆ i . On the other hand, the order-optimal asymptotic regret can be signif-icantly smaller. Specifically, UCB by Auer et al. [2002] satisfies R ucb µ,i ∗ ∈ O X i :∆ i > i log n ! , (7)which for unequal gaps can be much smaller than Eq. (6) and is asymptotically order-optimal[Lai and Robbins, 1985]. The problem is that MOSS explores only enough to obtain minimax re-gret, but sometimes obtains minimax regret even when a more conservative algorithm would dobetter. It is worth remarking that this effect is harder to observe than one might think. The examplegiven in the afforementioned technical report is carefully tuned to exploit this failing, but still re-quires n = 10 and K = 10 before significant problems arise. In all other experiments MOSS wasperforming admirably in comparison to UCB.All these problems can be avoided by modifying UCB rather than MOSS. The cost is a factorof O ( √ log n ) . The algorithm is similar to Algorithm 1, but chooses the action that maximises thefollowing index. I t = arg max i ˆ µ i,T i ( t − + s (2 + ε ) log tT i ( t − − r log nn i , where ε > is a fixed arbitrary constant. Theorem 7. If π is the strategy of unbalanced UCB with n i = n /B i and B ∈ B , then the regretof the unbalanced UCB satisfies:1. (problem-independent regret). R πµ,i ∗ ∈ O (cid:0) B i ∗ √ log n (cid:1) .2. (problem-dependent regret). Let A = n i : ∆ i ≥ p /n i ∗ log n o . Then R πµ,i ∗ ∈ O B i ∗ p log n { A = ∅} + X i ∈ A i log n ! . The proof may be found in Appendix B. The indicator function in the problem-dependentbound vanishes for sufficiently large n provided n i ∗ ∈ ω (log( n )) , which is equivalent to B i ∗ ∈ o ( n/ √ log n ) . Thus for reasonable choices of B , . . . , B K the algorithm is going to enjoy the sameasymptotic performance as UCB. Theorem 7 may be proven for any index-based algorithm for whichit can be shown that E T i ( n ) ∈ O (cid:18) i log n (cid:19) , which includes (for example) KL-UCB [Capp´e et al., 2013] and Thompson sampling (see analy-sis by Agrawal and Goyal [2012a,b] and original paper by Thompson [1933]), but not OC-UCB[Lattimore, 2015] or MOSS [Audibert and Bubeck, 2009]. A Note on Constants
The constants in the statement of Theorem 2 can be improved by carefully tuning all thresh-holds,but the proof would grow significantly and I would not expect a corresponding boost in practicalperformance. In fact, the reverse is true, since the “weak” bounds used in the proof would propagateto the algorithm. Also note that the appearing in the square root of the unbalanced MOSS algorithmis due to the fact that I am not assuming rewards are bounded in [0 , for which the variance is atmost / . It is possible to replace the with ε for any ε > by changing the base in the peelingargument in the proof of Lemma 4 as was done by Bubeck [2010] and others.8 xperimental Results I compare MOSS and unbalanced MOSS in two simple simulated examples, both with horizon n = 5000 . Each data point is an empirical average of ∼ i.i.d. samples, so error bars are too smallto see. Code/data is available in the supplementary material. The first experiment has K = 2 armsand B = n and B = n . I plotted the results for µ = (0 , − ∆) for varying ∆ . As predicted,the new algorithm performs significantly better than MOSS for positive ∆ , and significantly worseotherwise (Fig. 1). The second experiment has K = 10 arms. This time B = √ n and B k =( k − H √ n with H = P k =1 /k . Results are shown for µ k = ∆ { k = i ∗ } for ∆ ∈ [0 , / and i ∗ ∈ { , . . . , } . Again, the results agree with the theory. The unbalanced algorithm is superior toMOSS for i ∗ ∈ { , } and inferior otherwise (Fig. 2). − . − . . . R e g r e t MOSSU. MOSS
Figure 1 , , θ R e g r e t Figure 2: θ = ∆ + ( i ∗ − / Sadly the experiments serve only to highlight the plight of the biased learner, which sufferssignificantly worse results than its unbaised counterpart for most actions.
I have shown that the cost of favouritism for multi-armed bandit algorithms is rather serious. Ifan algorithm exhibits a small worst-case regret for a specific action, then the worst-case regret ofthe remaining actions is necessarily significantly larger than the well-known uniform worst-casebound of Ω( √ Kn ) . This unfortunate result is in stark contrast to the experts setting for which thereexist algorithms that suffer constant regret with respect to a single expert at almost no cost for theremainder. Surprisingly, the best achievable (non-uniform) worst-case bounds are determined up toa permutation almost entirely by the value of the smallest worst-case regret.There are some interesting open questions. Most notably, in the adversarial setting I am not sureif the upper or lower bound is tight (or neither). It would also be nice to know if the constant factorscan be determined exactly asymptotically, but so far this has not been done even in the uniformcase. For the stochastic setting it is natural to ask if the OC-UCB algorithm can also be modified.Intuitively one would expect this to be possible, but it would require re-working the very long proof. Acknowledgements
I am indebted to the very careful reviewers who made many suggestions for improving this paper.Thank you! 9 eferences
Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In
Pro-ceedings of International Conference on Artificial Intelligence and Statistics (AISTATS) , 2012a.Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit prob-lem. In
Proceedings of Conference on Learning Theory (COLT) , 2012b.Jean-Yves Audibert and S´ebastien Bubeck. Minimax policies for adversarial and stochastic bandits.In
COLT , pages 217–226, 2009.Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a riggedcasino: The adversarial multi-armed bandit problem. In
Foundations of Computer Science, 1995.Proceedings., 36th Annual Symposium on , pages 322–331. IEEE, 1995.Peter Auer, Nicol´o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed banditproblem.
Machine Learning , 47:235–256, 2002.Stephane Boucheron, Gabor Lugosi, and Pascal Massart.
Concentration Inequalities: A Nonasymp-totic Theory of Independence . OUP Oxford, 2013.S´ebastien Bubeck.
Bandits games and clustering foundations . PhD thesis, Universit´e des Scienceset Technologie de Lille-Lille I, 2010.S´ebastien Bubeck and Nicol`o Cesa-Bianchi.
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems . Foundations and Trends in Machine Learning. Now Publishers Incorpo-rated, 2012. ISBN 9781601986269.Olivier Capp´e, Aur´elien Garivier, Odalric-Ambrym Maillard, R´emi Munos, and Gilles Stoltz.Kullback–Leibler upper confidence bounds for optimal sequential allocation.
The Annals ofStatistics , 41(3):1516–1541, 2013.Nicolo Cesa-Bianchi.
Prediction, learning, and games . Cambridge University Press, 2006.Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs.regret to the average.
Machine Learning , 72(1-2):21–37, 2008.Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader.
TheJournal of Machine Learning Research , 6:639–660, 2005.Michael Kapralov and Rina Panigrahy. Prediction strategies without loss. In
Advances in NeuralInformation Processing Systems , pages 828–836, 2011.Wouter M Koolen. The pareto regret frontier. In
Advances in Neural Information Processing Sys-tems , pages 863–871, 2013.Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.
Advancesin applied mathematics , 6(1):4–22, 1985.Tor Lattimore. Optimally confident UCB : Improved regret for finite-armed bandits. Technicalreport, 2015. URL http://arxiv.org/abs/1507.07880 .Che-Yu Liu and Lihong Li. On the prior sensitivity of thompson sampling. arXiv preprintarXiv:1506.03378 , 2015.Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. In
Advances in Neural Information Processing Systems , pages 810–818, 2014.William Thompson. On the likelihood that one unknown probability exceeds another in view of theevidence of two samples.
Biometrika , 25(3/4):285–294, 1933.10
Table of Notation n time horizon K number of available actions t time step k, i actions B set of achievable worst-case regrets defined in Eq. (2) δ B boundary of B µ vector of expected rewards µ ∈ [0 , K µ ∗ expected return of optimal action ∆ j µ ∗ − µ j ∆ ji µ i − µ j π bandit strategy I t action chosen at time step tR πµ,k regret of strategy π with respect to the k th arm R πk worst-case regret of strategy π with respect to the k th arm ˆ µ k,s empirical estimate of the return of the k action after s samples T k ( t ) number of times action k has been taken at the end of time step ti ∗ optimal action log + ( x ) maximum of and log( x ) N ( µ, σ ) Gaussian with mean µ and variance σ B Proof of Theorem 7
Recall that the proof of UCB depends on showing that E T i ( n ) ∈ O (cid:18) i log n (cid:19) . Now unbalanced UCB operates exactly like UCB, but with shifted rewards. Therefore for unbal-anced UCB we have E T i ( n ) ∈ O (cid:18) i log n (cid:19) , where ¯∆ i ≥ ∆ i + r log nn i − r log nn i ∗ . Define : A = ( i : ∆ i ≥ r log nn i ∗ ) If i ∈ A , then ∆ i ≤ i and ¯∆ i ≥ q log nn i . Therefore ∆ i E T i ( n ) ∈ O (cid:18) ∆ i ¯∆ i log n (cid:19) ⊆ O (cid:18) i log n (cid:19) ⊆ O (cid:16)p n i log n (cid:17) ⊆ O (cid:18) nB i p log n (cid:19) . For i / ∈ A we have ∆ i < q log nn i ∗ thus E "X i/ ∈ A ∆ i T i ( n ) ∈ O n r log nn i ∗ ! ⊆ O (cid:16) B i ∗ p log n (cid:17) . R πµ,i ∗ = K X i =1 ∆ i E T i ( n ) ∈ O B i ∗ + X i ∈ A nB i ! p log n ! = O (cid:16) B i ∗ p log n (cid:17) as required. For the problem-dependent bound we work similarly. R πµ,i ∗ = K X i =1 ∆ i E T i ( n ) ∈ O X i ∈ A i log n + { A = ∅} B i ∗ p log n ! ∈ O X i ∈ A i log n + { A = ∅} B i ∗ p log n ! . C KL Techniques
Let µ , µ k ∈ R K be two bandit environments as defined in the proof of Theorem 1. Here I provethe claim that E πµ k T k ( n ) − E πµ T k ( n ) ≤ nε k q E πµ T k ( n ) . The result follows along the same lines as the proof of the lower bounds given by Auer et al. [1995].Let {F t } nt =1 be a filtration where F t contains information about rewards and actions chosen up totime step t . So g I t ,t and { I t = i } are measurable with respect to F t . Let P and P k be the measureson F induced by bandit problems µ and µ k respectively. Note that T k ( n ) is a F n -measurablerandom variable bounded in [0 , n ] . Therefore E πµ k T k ( n ) − E πµ T k ( n ) ( a ) ≤ n sup A | P ( A ) − P ( A ) | ( b ) ≤ n r
12 KL( P , P k ) , where the supremum in (a) is taken over all measurable sets (this is the total variation distance) and(b) follows from Pinsker’s inequality. It remains to compute the KL divergence. Let P ,t and P k,t be the conditional measures on the t th reward. By the chain rule for the KL divergence we have KL( P , P k ) = n X t =1 E P KL( P ,t , P k,t ) ( a ) = 2 ε k n X t =1 E P { I t = k } = 2 ε k E πµ T k ( n ) , where (a) follows by noting that if I t = k , then the distribution of the rewards at time step t isthe same for both bandit problems µ and µ k . For I t = k we have the difference in means is ( µ k ) k − ( µ ) k = ε k and since the distributions are Gaussian the KL divergence is ε k . For Bernoullirandom noise the KL divergence is also Θ( ε k ) provided ( µ k ) k ≈ ( µ ) k ≈ / and so a similar proofworks for this case. See the work by Auer et al. [1995] for an example. D Adversarial Bandits
In the adversarial setting I obtain something similar. First I introduce some new notation. Let g i,t ∈ [0 , be the gain/reward from choosing action i at time step t . This is chosen in an arbitraryway by the adversary with g i,t possibly even dependent on the actions of the learner up to time step12 . The regret difference between the gains obtained by the learner and those of the best action inhindsight. R πg = max i ∈{ ,...,K } E " n X t =1 g i,t − g I t ,t . I make the most obvious modification to the Exp3- γ algorithm, which is to bias the prior towardsthe special action and tune the learning rate accordingly. The algorithm accepts as input the prior ρ ∈ [0 , K , which must satisfy P i ρ i = 1 , and the learning rate η . Input: K , ρ ∈ [0 , K , η w i, = ρ i for each i for t ∈ , . . . , n do Let p i,t = w i,t − P Ki =1 w i,t − Choose action I t = i with probability p i,t and observe gain g I t ,t ˜ ℓ t,i = (1 − g t,i ) { I t = i } p i,t w i,t = w i,t − exp (cid:16) − η ˜ ℓ t,i (cid:17) end for Algorithm 2: Exp3- γ The following result follows trivially from the standard proof.
Theorem 8 (Bubeck and Cesa-Bianchi [2012]) . Let π be the strategy determined by Algorithm 2,then R πg ≤ ηKn + 1 η log 1 ρ i ∗ . Corollary 9. If ρ is given by ρ i = ( exp (cid:16) − B Kn (cid:17) if i = 1(1 − ρ ) / ( K − otherwiseand η = B / (2 Kn ) , then R πg ≤ ( B if i ∗ = 1 B + KnB log (cid:16) Kn ( K − B (cid:17) otherwise . Proof.
The proof follows immediately from Theorem 8 by noting that for i ∗ = 1 we have log 1 ρ i ∗ = log K − − exp (cid:16) − B Kn (cid:17) ≤ log (cid:18) Kn ( K − B (cid:19) as required. E Concentration
The following straight-forward concentration inequality is presumably well known and the proof ofan almost identical result is available by Boucheron et al. [2013], but an exact reference seems hardto find. 13 heorem 10.
Let X , X , . . . , X n be independent and -subgaussian, then P ∃ t ≤ n : 1 t X s ≤ t X s ≥ εt ≤ exp (cid:18) − ε n (cid:19) . Proof.
Since X i is -subgaussian, by definition it satisfies ( ∀ λ ∈ R ) E [exp ( λX i )] ≤ exp (cid:0) λ / (cid:1) . Now X , X , . . . are independent and zero mean, so by convexity of the exponential function exp( λ P ts =1 X s ) is a sub-martingale. Therefore if ε > , then by Doob’s maximal inequality P ( ∃ t ≤ n : t X s =1 X s ≥ ε ) = inf λ ≥ P ( ∃ t ≤ n : exp λ t X s =1 X s ! ≥ exp ( λε ) ) ≤ inf λ ≥ exp (cid:18) λ n − λε (cid:19) = exp (cid:18) − ε n (cid:19)(cid:19)