Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms
aa r X i v : . [ c s . L G ] M a y Unimodal Bandits:Regret Lower Bounds and Optimal Algorithms
Richard Combes
RCOMBES @ KTH . SE KTH, Royal Institute of technology, Stockholm, Sweden
Alexandre Proutiere
ALEPRO @ KTH . SE KTH, Royal Institute of technology, Stockholm, Sweden
Abstract
We consider stochastic multi-armed banditswhere the expected reward is a unimodal func-tion over partially ordered arms. This impor-tant class of problems has been recently inves-tigated in (Cope, 2009; Yu & Mannor, 2011).The set of arms is either discrete, in which casearms correspond to the vertices of a finite graphwhose structure represents similarity in rewards,or continuous, in which case arms belong to abounded interval. For discrete unimodal ban-dits, we derive asymptotic lower bounds for theregret achieved under any algorithm, and pro-pose OSUB, an algorithm whose regret matchesthis lower bound. Our algorithm optimally ex-ploits the unimodal structure of the problem, andsurprisingly, its asymptotic regret does not de-pend on the number of arms. We also pro-vide a regret upper bound for OSUB in non-stationary environments where the expected re-wards smoothly evolve over time. The analyticalresults are supported by numerical experimentsshowing that OSUB performs significantly bet-ter than the state-of-the-art algorithms. For con-tinuous sets of arms, we provide a brief discus-sion. We show that combining an appropriatediscretization of the set of arms with the UCBalgorithm yields an order-optimal regret, andin practice, outperforms recently proposed algo-rithms designed to exploit the unimodal struc-ture.
Proceedings of the st International Conference on MachineLearning , Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).
1. Introduction
Stochastic Multi-Armed Bandits (MAB) (Robbins, 1952;Gittins, 1989) constitute the most fundamental sequen-tial decision problems with an exploration vs. exploita-tion trade-off. In such problems, the decision maker se-lects an arm in each round, and observes a realization ofthe corresponding unknown reward distribution. Each de-cision is based on past decisions and observed rewards.The objective is to maximize the expected cumulative re-ward over some time horizon by balancing exploitation(arms with higher observed rewards should be selectedoften) and exploration (all arms should be explored tolearn their average rewards). Equivalently, the perfor-mance of a decision rule or algorithm can be measuredthrough its expected regret , defined as the gap betweenthe expected reward achieved by the algorithm and thatachieved by an oracle algorithm always selecting the bestarm. MAB problems have found many fields of appli-cation, including sequential clinical trials, communicationsystems, economics, see e.g. (Cesa-Bianchi & Lugosi,2006; Bubeck & Cesa-Bianchi, 2012).In their seminal paper (Lai & Robbins, 1985), Lai and Rob-bins solve MAB problems where the successive rewards ofa given arm are i.i.d., and where the expected rewards ofthe various arms are not related. They derive an asymp-totic (when the time horizon grows large) lower bound ofthe regret satisfied by any algorithm, and present an algo-rithm whose regret matches this lower bound. This ini-tial algorithm was quite involved, and many researchershave tried to devise simpler and yet efficient algorithms.The most popular of these algorithms are UCB (Auer et al.,2002) and its extensions, e.g. KL-UCB (Garivier & Capp´e,2011; Capp´e et al., 2013) (note that KL-UCB algorithmwas initially proposed in (Lai, 1987), see (2.6)). Whenthe expected rewards of the various arms are not related(Lai & Robbins, 1985), the regret of the best algorithm isessentially of the order O ( K log( T )) where K denotesthe number of arms, and T is the time horizon. When nimodal bandits K is very large or even infinite, MAB problems becomemore challenging. Fortunately, in such scenarios, the ex-pected rewards often exhibit some structural properties thatthe decision maker can exploit to design efficient algo-rithms. Various structures have been investigated in theliterature, e.g., Lipschitz (Agrawal, 1995; Kleinberg et al.,2008; Bubeck et al., 2008), linear (Dani et al., 2008), con-vex (Flaxman et al., 2005).We consider bandit problems where the expected rewardis a unimodal function over partially ordered arms as in(Yu & Mannor, 2011). The set of arms is either discrete, inwhich case arms correspond to the vertices of a finite graphwhose structure represents similarity in rewards, or contin-uous, in which case arms belong to a bounded interval. Thisunimodal structure occurs naturally in many practical deci-sion problems, such as sequential pricing (Yu & Mannor,2011) and bidding in online sponsored search auctions (B.,2005). Our contributions.
We mainly investigate unimodal ban-dits with finite sets of arms, and are primarily interestedin cases where the time horizon T is much larger than thenumber of arms K .(a) For these problems, we derive an asymptotic regretlower bound satisfied by any algorithm. This lower bounddoes not depend on the structure of the graph, nor on itssize: it actually corresponds to the regret lower bound in aclassical bandit problem (Lai & Robbins, 1985), where theset of arms is just a neighborhood of the best arm in thegraph.(b) We propose OSUB (Optimal Sampling for UnimodalBandits), a simple algorithm whose regret matches ourlower bound, i.e., it optimally exploits the unimodal struc-ture. The asymptotic regret of OSUB does not depend onthe number of arms. This contrasts with LSE (Line SearchElimination), the algorithm proposed in (Yu & Mannor,2011) whose regret scales as O ( γD log( T )) where γ is themaximum degree of vertices in the graph and D is its diam-eter. We present a finite-time analysis of OSUB, and derivea regret upper bound that scales as O ( γ log( T )+ K ) . HenceOSUB offers better performance guarantees than LSE assoon as the time horizon satisfies T ≥ exp( K/γD ) . Al-though this is not explicitly mentioned in (Yu & Mannor,2011), we believe that LSE was meant to address banditswhere the number of arms is not negligible compared tothe time horizon.(c) We further investigate OSUB performance in non-stationary environments where the expected rewardssmoothly evolve over time but keep their unimodal struc-ture.(d) We conduct numerical experiments and show thatOSUB significantly outperforms LSE and other classi- cal bandit algorithms when the number of arms is muchsmaller than the time horizon.(e) Finally, we briefly discuss systems with a continuous setof arms. We show that using a simple discretization of theset of arms, UCB-like algorithms are order-optimal, and ac-tually outperform more advanced algorithms such as thoseproposed in (Yu & Mannor, 2011). This result suggests thatin discrete unimodal bandits with a very large number ofarms, it is wise to first prune the set of arms, so as to reduceits size to a number of the order of √ T / log( T ) .
2. Related work
Unimodal bandits have received relatively little attentionin the literature. They are specific instances of banditsin metric spaces (Kleinberg, 2004; Kleinberg et al., 2008;Bubeck et al., 2008). In this paper, we add unimodality andshow how this structure can be optimally exploited. Uni-modal bandits have been specifically addressed in (Cope,2009; Yu & Mannor, 2011). In (Cope, 2009), banditswith a continuous set of arms are studied, and the authorshows that the Kiefer-Wolfowitz stochastic approximationalgorithm achieves a regret of the order of O ( √ T ) undersome strong regularity assumptions on the reward func-tion. In (Yu & Mannor, 2011), for the same problem, theauthors present LSE, an algorithm whose regret scales as O ( √ T log( T )) without the need for a strong regularity as-sumption. The LSE algorithm is based on Kiefer’s goldensection search algorithm. It iteratively eliminates subsets ofarms based on PAC-bounds derived after appropriate sam-pling. By design, under LSE, the sequence of parametersused for the PAC bounds is pre-defined, and in particulardoes not depend of the observed rewards. As a conse-quence, LSE may explore too much sub-optimal parts ofthe set of arms. For bandits with a continuum set of arms,we actually show that combining an appropriate discretiza-tion of the decision space (i.e., reducing the number of armsto √ T / log( T ) arms) and the UCB algorithm can outper-form LSE in practice (this is due to the adaptive nature ofUCB). Note that the parameters used in LSE to get a regretof the order O ( √ T log( T )) depend on the time horizon T .In (Yu & Mannor, 2011), the authors also present an exten-sion of the LSE algorithm to problems with discrete setsof arms, and provide regret upper bounds of this algorithm.These bounds depends on the structure of the graph defin-ing unimodal structure, and on the number of arms as men-tioned previously. LSE performs better than classical ban-dit algorithms only when the number of arms is very large,and actually becomes comparable to the time horizon. Herewe are interested in bandits with relatively small number ofarms.Non-stationary bandits have been studied in nimodal bandits (Hartland et al., 2007; Garivier & Moulines, 2008;Slivkins & Upfal, 2008; Yu & Mannor, 2011). Exceptfor (Slivkins & Upfal, 2008), these papers deal withenvironments where the expected rewards and the bestarm change abruptly. This ensures that arms are alwayswell separated, and in turn, simplifies the analysis. In(Slivkins & Upfal, 2008), the expected rewards evolve ac-cording to independent brownian motions. We consider adifferent, but more general class of dynamic environments:here the rewards smoothly evolve over time. The challengefor such environments stems from the fact that, at sometime instants, arms can have expected rewards arbitrarilyclose to each other.Finally, we should mention that bandit problems with struc-tural properties such as those we address here can oftenbe seen as specific instances of problems in the control ofMarkov chains, see (Graves & Lai, 1997). We leverage thisobservation to derive regret lower bounds. However, algo-rithms developed for the control of generic Markov chainsare often too complex to implement in practice. Our algo-rithm, OSUB, is optimal and straightforward to implement.
3. Model and Objectives
We consider a stochastic multi-armed bandit problem with K ≥ arms. We discuss problems where the set of arms iscontinuous in Section 6. Time proceeds in rounds indexedby n = 1 , , . . . . Let X k ( n ) be the reward obtained at time n if arm k is selected. For any k , the sequence of rewards ( X k ( n )) n ≥ is i.i.d. with distribution and expectation de-noted by ν k and µ k respectively. Rewards are independentacross arms. Let µ = ( µ , . . . , µ K ) represent the expectedrewards of the various arms. At each round, a decision ruleor algorithm selects an arm depending on the arms chosenin earlier rounds and their observed rewards. We denoteby k π ( n ) the arm selected under π in round n . The set Π of all possible decision rules consists of policies π satis-fying: for any n ≥ , if F πn is the σ -algebra generated by ( k π ( t ) , X k π ( t ) ( t )) ≤ t ≤ n , then k π ( n +1) is F πn -measurable. The expected rewards exhibit a unimodal structure, simi-lar to that considered in (Yu & Mannor, 2011). More pre-cisely, there exists an undirected graph G = ( V, E ) whosevertices correspond to arms, i.e., V = { , . . . , K } , andwhose edges characterize a partial order (initially unknownto the decision maker) among expected rewards. We as-sume that there exists a unique arm k ⋆ with maximumexpected reward µ ⋆ , and that from any sub-optimal arm k = k ⋆ , there exists a path p = ( k = k, . . . , k m = k ⋆ ) oflength m (depending on k ) such that for all i = 1 , . . . , m − , ( k i , k i +1 ) ∈ E and µ k i < µ k i +1 . We denote by U G theset of vectors µ satisfying this unimodal structure. This notion of unimodality is quite general, and includes,as a special case, classical unimodality (where G is just aline). Note that we assume that the decision maker knowsthe graph G , but ignores the best arm, and hence the partialorder induced by the edges of G . The model presented above concerns stationary environ-ments, where the expected rewards for the various armsdo not evolve over time. In this paper, we also considernon-stationary environments where these expected rewardscould evolve over time according to some deterministic dy-namics. In such scenarios, we denote by µ k ( n ) the ex-pected reward of arm k at time n , i.e., E [ X k ( n )] = µ k ( n ) ,and ( X k ( n )) n ≥ constitutes a sequence of independentrandom variables with evolving mean. In non-stationaryenvironments, the sequences of rewards are still assumedto be independent across arms. Moreover, at any time n , µ ( n ) = ( µ ( n ) , . . . µ K ( n )) is unimodal with respect tosome fixed graph G , i.e., µ ( n ) ∈ U G (note however that thepartial order satisfied by the expected rewards may evolveover time). The performance of an algorithm π ∈ Π is characterizedby its regret up to time T (where T is typically large). Theway regret is defined differs depending on the type of envi-ronment. Stationary Environments.
In such environments, the re-gret R π ( T ) of algorithm π ∈ Π is simply defined throughthe number of times t πk ( T ) = P ≤ n ≤ T { k π ( n ) = k } that arm k has been selected up to time T : R π ( T ) = P Kk =1 ( µ ⋆ − µ k ) E [ t πk ( T )] . Our objectives are (1) to identifyan asymptotic (when T → ∞ ) regret lower bound satisfiedby any algorithm in Π , and (2) to devise an algorithm thatachieves this lower bound. Non-stationary Environments.
In such environments, theregret of an algorithm π ∈ Π quantifies how well π tracks the best arm over time. Let k ⋆ ( n ) denote the op-timal arm with expected reward µ ⋆ ( n ) at time n . Theregret of π up to time T is hence defined as: R π ( T ) = P Tn =1 (cid:0) µ ⋆ ( n ) − E [ µ k π ( n ) ( n )] (cid:1) .
4. Stationary environments
In this section, we consider unimodal bandit problems instationary environments. We derive an asymptotic lowerbound of regret when the reward distributions belong to aparametrized family of distributions, and propose OSUB,an algorithm whose regret matches this lower bound. nimodal bandits
To simplify the presentation, we assume here that thereward distributions belong to a parametrized family ofdistributions. More precisely, we define a set of dis-tributions V = { ν ( θ ) } θ ∈ [0 , parametrized by θ ∈ [0 , . The expectation of ν ( θ ) is denoted by µ ( θ ) for any θ ∈ [0 , . ν ( θ ) is absolutely continuouswith respect to some positive measure m on R , andwe denote by p ( x, θ ) its density. The Kullback-Leibler(KL) divergence number between ν ( θ ) and ν ( θ ′ ) is: KL ( θ, θ ′ ) = R R log( p ( x, θ ) /p ( x, θ ′ )) p ( x, θ ) m ( dx ) . Wedenote by θ ⋆ a parameter (it might not be unique) suchthat µ ( θ ⋆ ) = µ ⋆ , and we define the minimal diver-gence number between ν ( θ ) and ν ( θ ⋆ ) as: I min ( θ, θ ⋆ ) =inf θ ∈ [0 , µ ( θ ′ ) ≥ µ ⋆ KL ( θ, θ ′ ) . Finally, we say that arm k has parameter θ k if ν k = ν ( θ k ) , and we denote by Θ G the set of all parameters θ = ( θ , . . . , θ K ) ∈ [0 , K such that the correspond-ing expected rewards are unimodal with respect to graph G : µ = ( µ , . . . , µ K ) ∈ U G . Of particular interest isthe family of Bernoulli distributions: the support of m is { , } , µ ( θ ) = θ , and I min ( θ, θ ⋆ ) = I ( θ, θ ⋆ ) where I ( θ, θ ⋆ ) = θ log( θθ ⋆ ) + (1 − θ ) log( − θ − θ ⋆ ) is KL diver-gence number between Bernoulli distributions of respectivemeans θ and θ ⋆ .We are now ready to derive an asymptotic regret lowerin parametrized unimodal bandit problems as definedabove. Without loss of generality, we restrict our atten-tion to so-called uniformly good algorithms, as defined in(Lai & Robbins, 1985) (uniformly good algorithms exist asshown later on). We say that π ∈ Π is uniformly good iffor all θ ∈ Θ G , we have that R π ( T ) = o ( T a ) for all a > . Theorem 4.1
Let π ∈ Π be a uniformly good algorithm,and assume that ν k = ν ( θ k ) ∈ V for all k . Then for any θ ∈ Θ G , lim inf T → + ∞ R π ( T )log( T ) ≥ c ( θ ) = X ( k,k ∗ ) ∈ E µ ⋆ − µ k I min ( θ k , θ ⋆ ) . (1)The above theorem is a consequence of results in optimalcontrol of Markov chains (Graves & Lai, 1997). All proofsare presented in appendix. As in classical discrete ban-dit problems, the regret scales at least logarithmically withtime (the regret lower bound derived in (Lai & Robbins,1985) is obtained from Theorem 4.1 assuming that G isthe complete graph). We also observe that the unimodalstructure, if optimally exploited, can bring significant per-formance improvements: the regret lower bound does notdepend on the size K of the decision space. Indeed c ( θ ) includes only terms corresponding to arms that are neigh-bors in G of the optimal arm (as if one could learn withoutregret that all other arms are sub-optimal). In the case of Bernoulli rewards, the lower regret boundbecomes log( T ) P ( k,k ∗ ) ∈ E µ ⋆ − µ k I ( θ k ,θ ⋆ ) . Note that LSE andGLSE, the algorithms proposed in (Yu & Mannor, 2011),have performance guarantees that do not match our lowerbound: when G is a line, LSE achieves a regret boundedby / ∆ log( T ) , whereas in the general case, GLSE in-curs a regret of the order of O ( γD log( T )) where γ is themaximal degree of vertices in G , and D is its diameter.The performance of LSE critically depends on the graphstructure, and the number of arms. Hence there is an im-portant gap between the performance of existing algorithmsand the lower bound derived in Theorem 4.1. In the nextsection, we close this gap and propose an asymptoticallyoptimal algorithm. We now describe OSUB, a simple algorithm whose re-gret matches the lower bound derived in Theorem of 4.1for Bernoulli rewards, i.e., OSUB is asymptotically opti-mal. The algorithm is based on KL-UCB proposed in (Lai,1987; Capp´e et al., 2013), and uses KL-divergence upperconfidence bounds to define an index for each arm. OSUBcan be readily extended to systems where reward distri-butions are within one-parameter exponential families bysimply modifying the definition of arm indices as done in(Capp´e et al., 2013). In OSUB, each arm is attached anindex that resembles the KL-UCB index, but the arm se-lected at a given time is the arm with maximal index withinthe neighborhood in G of the arm that yielded the highestempirical reward. Note that since the sequential choicesof arms are restricted to some neighborhoods in the graph,OSUB is not an index policy. To formally describe OSUB,we need the following notation. For p ∈ [0 , , s ∈ N , and n ∈ N , we define: F ( p, s, n ) = sup { q ≥ p : sI ( p, q ) ≤ log( n ) + c log(log( n )) } , (2)with the convention that F ( p, , n ) = 1 , and F (1 , s, n ) =1 , and where c > is a constant. Let k ( n ) be the armselected under OSUB at time n , and let t k ( n ) denote thenumber of times arm k has been selected up to time n .The empirical reward of arm k at time n is ˆ µ k ( n ) = t k ( n ) P nt =1 { k ( t ) = k } X k ( t ) , if t k ( n ) > and ˆ µ k ( n ) =0 otherwise. We denote by L ( n ) = arg max ≤ k ≤ K ˆ µ k ( n ) the index of the arm with the highest empirical reward (tiesare broken arbitrarily). Arm L ( n ) is referred to as the leader at time n . Further define l k ( n ) = P nt =1 { L ( t ) = k } the number of times arm k has been the leader up totime n . Now the index of arm k at time n is defined as: b k ( n ) = F (ˆ µ k ( n ) , t k ( n ) , l k ( L ( n ))) . Finally for any k , let N ( k ) = { k ′ : ( k ′ , k ) ∈ E } ∪ { k } bethe neighborhood of k in G . The pseudo-code of OSUB is nimodal bandits presented below. Algorithm
OSUBInput: graph G = ( V, E ) For n ≥ , select the arm k ( n ) where: k ( n ) = L ( n ) if l L ( n ) ( n ) − γ +1 ∈ N , arg max k ∈ N ( L ( n )) b k ( n ) otherwise,where γ is the maximal degree of nodes in G and ties arebroken arbitrarily.Note that OSUB forces us to select the current leader often: L ( n ) is chosen when l L ( n ) ( n ) − is a multiple of γ + 1 .This ensures that the number of times an arm has been se-lected is at least proportional to the number of times thisarm has been the leader. This property significantly simpli-fies the regret analysis, but it could be removed. Next we provide a finite time analysis of the regretachieved under OSUB. Let ∆ denote the minimal sepa-ration between an arm and its best adjacent arm: ∆ =min ≤ k ≤ K max k ′ :( k,k ′ ) ∈ E µ k ′ − µ k . Note that ∆ is notknown a priori. Theorem 4.2
Assume that the rewards lie in [0,1] (i.e.,the support of ν k is included in [0 , , for all k ), and that ( µ , . . . , µ K ) ∈ U G . The number of times suboptimal arm k is selected under OSUB satisfies: for all ǫ > and all T ≥ , E [ t k ( T )] ≤ (1 + ǫ ) log( T )+ c log(log( T )) I ( µ k ,µ ∗ ) if ( k, k ⋆ ) ∈ E, + C log log( T ) + C T β ( ǫ ) C ∆ otherwise,where β ( ǫ ) > , and < C < , C > , C > areconstants. To prove this upper bound, we analyze the regret accu-mulated (i) when the best arm k ⋆ is the leader, and (ii)when the leader is arm k = k ⋆ . (i) When k ⋆ is the leader,the algorithm behaves like KL-UCB restricted to the armsaround k ⋆ , and the regret at these rounds can be analyzed asin (Capp´e et al., 2013). (ii) Bounding the number of roundswhere k = k ⋆ is not the leader is more involved. To do this,we decompose this set of rounds into further subsets (suchas the time instants where k is the leader and its mean is notwell estimated), and control their expected cardinalities us-ing concentration inequalities. Along the way, we establishLemma 4.3, a new concentration inequality of independentinterest. Lemma 4.3
Let { Z t } t ∈ Z be a sequence of independentrandom variables with values in [0 , B ] . Define F n the σ -algebra generated by { Z t } t ≤ n and the filtration F =( F n ) n ∈ Z . Consider s ∈ N , n ∈ Z and T ≥ n . Wedefine S n = P nt = n B t ( Z t − E [ Z t ]) , where B t ∈ { , } is a F t − -measurable random variable. Further define t n = P nt = n B t . Define φ ∈ { n , . . . , T +1 } a F -stoppingtime such that either t φ ≥ s or φ = T + 1 .Then we have that: P [ S φ ≥ t φ δ , φ ≤ T ] ≤ exp( − sδ B − ) . As a consequence: P [ | S φ | ≥ t φ δ , φ ≤ T ] ≤ − sδ B − ) . Lemma 4.3 concerns the sum of products of i.i.d. randomvariables and of a previsible sequence, evaluated at a stop-ping time (for the natural filtration). We believe that con-centration results for such sums can be instrumental in ban-dit problems, where typically, we need information aboutthe empirical rewards at some specific random time epochs(that often are stopping times). Refer to the appendix for aproof. A direct consequence of Theorem 4.2 is the asymp-totic optimality of OSUB in the case of Bernoulli rewards:
Corollary 4.4
Assume that rewards distributions areBernoulli (i.e for any k , ν k ∼ Bernoulli ( θ k ) ), and that θ ∈ Θ G . Then the regret achieved under π =OSUB sat-isfies: lim sup T → + ∞ R π ( T ) / log( T ) ≤ c ( θ ) .
5. Non-stationary environments
We now consider time-varying environments. We assumethat the expected reward of each arm varies smoothly overtime, i.e., it is Lipschitz continuous: for all n, n ′ ≥ and ≤ k ≤ K : | µ k ( n ) − µ k ( n ′ ) | ≤ σ | n − n ′ | .We further assume that the unimodal structure is preserved(with respect to the same graph G ): for all n ≥ , µ ( n ) ∈ U G . Considering smoothly varying rewards ismore challenging than scenarios where the environment isabruptly changing. The difficulty stems from the fact thatthe rewards of two or more arms may become arbitrarilyclose to each other (this happens each time the optimal armchanges), and in such situations, regret is difficult to con-trol. To get a chance to design an algorithm that efficientlytracks the best arm, we need to make some assumption tolimit the proportion of time when the separation of armsbecomes too small. Define for T ∈ N , and ∆ > : H (∆ , T ) = T X n =1 X ( k,k ′ ) ∈ E {| µ k ( n ) − µ k ′ ( n ) | < ∆ } . Assumption 1
There exists a function Φ and ∆ such thatfor all ∆ < ∆ : lim sup T → + ∞ H (∆ , T ) /T ≤ Φ( K )∆ . nimodal bandits To cope with the changing environment, we modify theOSUB algorithm, so that decisions are based on pastchoices and observations over a time-window of fixed du-ration equal to τ + 1 rounds. The idea of adding a slidingwindow to algorithms initially designed for stationary en-vironments is not novel (Garivier & Moulines, 2008); buthere, the unimodal structure and the smooth evolution ofrewards make the regret analysis more challenging.Define: t τk ( n ) = P nt = n − τ { k ( t ) = k } ; ˆ µ τk ( n ) =(1 /t τk ( n )) P nt = n − τ { k ( t ) = k } X k ( t ) if t τk ( n ) > and ˆ µ τk ( n ) = 0 otherwise; L τ ( n ) = arg max ≤ k ≤ K ˆ µ τk ( n ) ; l τk ( n ) = P nt = n − τ { L τ ( t ) = k } . The in-dex of arm k at time n then becomes: b τk ( n ) = F (ˆ µ τk ( n ) , t τk ( n ) , l τk ( L τ ( n ))) . The pseudo-code of SW-OSUB is presented below.
Algorithm
SW-OSUBInput: graph G = ( V, E ) , window size τ + 1 For n ≥ , select the arm k ( n ) where: k ( n ) = L τ ( n ) if l τLτ ( n ) ( n ) − γ +1 ∈ N , arg max k ∈ N ( L τ ( n )) b τk ( n ) otherwise. In non-stationary environments, achieving sublinear regretsis often not possible. In (Garivier & Moulines, 2008), theenvironment is subject to abrupt changes or breakpoints. Itis shown that if the density of breakpoints is strictly pos-itive, which typically holds in practice, then the regret ofany algorithm has to scale linearly with time. We are inter-ested in similar scenarios, and consider smoothly varyingenvironments where the number of times the optimal armchanges has a positive density. The next theorem providesan upper bound of the regret per unit of time achieved un-der SW-OSUB. This bound holds for any non-stationaryenvironment with σ -Lipschitz rewards. Theorem 5.1
Let ∆ : τ σ < ∆ < ∆ . Assume that forany n ≥ , µ ( n ) ∈ U G and µ ⋆ ( n ) ∈ [ a, − a ] for some a > . Further suppose that µ k ( · ) is σ -Lipschitz for any k . The regret per unit time under π = SW-OSUB with asliding window of size τ + 1 satisfies: if a > στ , then forany T ≥ , R π ( T ) T ≤ H (∆ , T ) T (1 + ∆) + C K log( τ ) τ (∆ − τ σ ) + γ (cid:16) g − / (cid:17) log( τ ) + c log(log( τ )) + C τ (∆ − τ σ ) , where C , C are positive constants and g = ( a − στ )(1 − a + στ ) / . Corollary 5.2
Assume that for any n ≥ , µ ( n ) ∈ U G and µ ⋆ ( n ) ∈ [ a, − a ] for some a > , and that µ k ( · ) is σ -Lipschitz for any k . Set τ = σ − / log(1 /σ ) / . The regretper unit of time of π = SW-OSUB with window size τ + 1 satisfies: lim sup T →∞ R π ( T ) T ≤ C Φ( K ) σ log (cid:18) σ (cid:19) (1 + Kj ( σ )) , for some constant C > , and some function j such that lim σ → + j ( σ ) = 0 . These results state that the regret per unit of time achievedunder SW-OSUB decreases and actually vanishes when thespeed at which expected rewards evolve decreases to 0.Also observe that the dependence of this regret bound inthe number of arms is typically mild (in many practical sce-narios, Φ( K ) may actually not depend on K ).The proof of Theorem 5.1 relies on the same types of ar-guments as those used in stationary environments. To es-tablish the regret upper bound, we need to evaluate the per-formance of the KL-UCB algorithm in non-stationary en-vironments (the result and the corresponding analysis arepresented in appendix).
6. Continuous Set of Arms
In this section, we briefly discuss the case where the de-cision space is continuous. The set of arms is [0 , , andthe expected reward function µ : [0 , → R is assumedto be Lipschitz continuous, and unimodal: there exists x ⋆ ∈ [0 , such that µ ( x ′ ) ≥ µ ( x ) if x ′ ∈ [ x, x ⋆ ] or x ′ ∈ [ x ⋆ , x ] . Let µ ⋆ = µ ( x ⋆ ) denote the highest expectedreward. A decision rule selects at any round n ≥ an arm x and observes the corresponding reward X ( x, n ) . For any x ∈ [0 , , ( X ( x, n )) n ≥ is an i.i.d. sequence. We makethe following additional assumption on function µ . Assumption 2
There exists δ > such that (i) for all x, y in [ x ⋆ , x ⋆ + δ ] (or in [ x ⋆ − δ , x ⋆ ] ), C | x − y | α ≤ | µ ( x ) − µ ( y ) | ; (ii) for δ ≤ δ , if | x − x ∗ | ≤ δ , then | µ ( x ∗ ) − µ ( x ) | ≤ C δ α . This assumption is more general than that used in(Yu & Mannor, 2011). In particular it holds for functionswith a plateau and a peak : µ ( x ) = max(1 − | x − x ⋆ | /ǫ, .Now as for the case of a discrete set of arms, we de-note by Π the set of possible decision rules, and the re-gret achieved under rule π ∈ Π up to time T is: R π ( T ) = T µ ⋆ − P Tn =1 E [ µ ( x π ( n ))] , where x π ( n ) is the arm selectedunder π at time n .There is no known precise asymptotic lower bound for con-tinuous bandits. However, we know that for our problem,the regret must be at least of the order of O ( √ T ) up to nimodal bandits logarithmic factor. In (Yu & Mannor, 2011), the authorsshow that the LSE algorithm achieves a regret scaling as O ( √ T log( T )) , under more restrictive assumptions. Weshow that combining discretization and the UCB algorithmas initially proposed in (Kleinberg, 2004) yields lower re-grets than LSE in practice (see Section 7), and is order-optimal, i.e., the regret grows as O ( √ T log( T )) .For δ > , we define a discrete bandit problem with K = ⌈ /δ ⌉ arms, and where the rewards of k -th arm aredistributed as X (( k − /δ, n ) . The expected reward of the k -th arm is µ k = µ (( k − /δ ) . Let π be an algorithm run-ning on this discrete bandit problem. The regret of π forthe initial continuous bandit problem is at time T : R π ( T ) = T µ ⋆ − P ⌈ /δ ⌉ k =1 µ k E [ t πk ( T )] . We denote byUCB( δ ) the UCB algorithm (Auer et al., 2002) applied tothe discretized bandit. In the following proposition, weshow that when δ = (log( T ) / √ T ) /α , UCB( δ ) is order-optimal. In practice, one may not know the time horizon T in advance. In this case, using the “doubling trick” (see e.g.(Cesa-Bianchi & Lugosi, 2006)) would incur an additionallogarithmic multiplicative factor in the regret. Proposition 1
Consider a unimodal bandit on [0 , withrewards in [0 , and satisfying Assumption 2. Set δ =(log( T ) / √ T ) /α . The regret under UCB( δ ) satisfies: lim sup T →∞ R π ( T ) √ T log( T ) ≤ C α + 16 /C .
7. Numerical experiments
We compare the performance of our algorithm to that ofKL-UCB (Capp´e et al., 2013), LSE (Yu & Mannor, 2011),UCB (Auer et al., 2002), and UCB-U. The latter algorithmis obtained by applying UCB restricted to the arms whichare adjacent to the current leader as in OSUB. We add theprefix ”SW” to refer to Sliding Window versions of thesealgorithms.
Stationary environments.
In our first experiment, we con-sider K = 17 arms with Bernoulli rewards of respec-tive averages µ = (0 . , . , ...., . , . , . . . , . . The re-wards are unimodal (the graph G is simply a line). Theregret achieved under the various algorithms is presentedin Figure 1 and Table 1. The parameters in LSE algorithmare chosen as suggested in Proposition 4.5 (Yu & Mannor,2011). Regrets are calculated averaging over indepen-dent runs. OSUB significantly outperforms all other algo-rithms. The regret achieved under LSE is not presented inFigure 1, because it is typically much larger than that ofother algorithms. This poor performance can be explainedby the non-adaptive nature of LSE, as already discussedearlier. LSE can beat UCB when the number of arms is T 1000 10000 100000UCB 30.1 35.1 39KL-UCB 18.8 21.4 23UCB-U 8.5 11.7 13.9OSUB 5.8 5.9 6LSE 36.3 271.5 999.1 Table1. R π ( T ) / log( T ) for different algorithms – 17 arms. r eg r e t R π ( T ) UCBKL−UCBUCB−UOSUB
Figure1. Regret vs. time in stationary environments – K = 17 arms. not negligible compared to the time horizon (e.g. in Fig-ure 4 in (Yu & Mannor, 2011), K = 250 . and the timehorizon is less than K ): in such scenarios, UCB-like algo-rithms perform poorly because of their initialization phase(all arms have to be tested once).In Figure 2, the number of arms is 129, and the expected re-wards form a triangular shape as in the previous example,with minimum and maximum equal to 0.1 and 0.9, respec-tively. Similar observations as in the case of 17 arms can bemade. We deliberately restrict the plot to small time hori-zons: this corresponds to scenarios where LSE can performwell. Non-stationary environments.
We now investigate the per- time T r eg r e t R π ( T ) UCBKL−UCBUCB−UOSUBLSE
Figure2. Regret vs. time in stationary environments – K = 129 arms. nimodal bandits r eg r e t R π ( T ) SW−UCBSW−KL−UCBSW−UCB−USW−OSUB
Figure3. Regret vs. time in a slowly varying environment – K =10 arms, σ = 10 − . σ r eg r e t pe r un i t o f t i m e R π ( T ) / T SW−UCBSW−KL−UCBSW−UCB−USW−OSUB
Figure4. Regret per unit of time R π ( T ) /T vs. speed σ – K = 10 arms. formance of SW-OSUB in a slowly varying environment.There are K = 10 arms whose expected rewards form amoving triangle: for k = 1 , . . . , K , µ k ( n ) = ( K − /K −| w ( n ) − k | /K , where w ( n ) = 1+( K − nσ )) / .Figure 3 presents the regret as a function of time under var-ious algorithms when the speed at which the environmentevolves is σ = 10 − . The window size are set as fol-lows for the various algorithms: τ = σ − / for SW-UCBand SW-KL-UCB (the rationale for this choice is explainedin appendix), τ = σ − / log(1 /σ ) / for SW-UCB-U andOSUB. In Figure 4, we show how the speed σ impacts theregret per time unit. SW-OSUB provides the most efficientway of tracking the optimal arm. In Figure 5, we compare the performance of the LSE andUCB( δ ) algorithms when the set of arms is continuous. Theexpected rewards form a triangle: µ ( x ) = 1 / − | x − / | so that µ ⋆ = 1 / and x ⋆ = 1 / . The parameters usedin LSE are those given in (Yu & Mannor, 2011), whereasthe discretization parameter δ in UCB( δ ) is set to δ =log( T ) / √ T . UCB( δ ) significantly outperforms LSE at anytime: an appropriate discretization of continuous banditproblems might actually be more efficient than other meth- r eg r e t R π ( T ) UCB( δ )LSE Figure5. Regret vs. time for a continuous set of arms. R π ( T ) / ( T / l og ( T )) K/T
UCB( δ )OSUBLSE Figure6. Normalized regret vs.
K/T , T = 5 . for a continu-ous set of arms. ods based on ideas taken from classical optimization the-ory.Figure 6 compares the regret of the discrete version of LSE(with optimized parameters), and of OSUB as the numberof arms K grows large, T = 50 , . The average rewardsof arms are extracted from the triangle used in the contin-uous bandit, and we also provide the regret achieved underUCB( δ ). OSUB outperforms UCB( δ ) even if the numberof arms gets as large as 7500! OSUB also beats LSE unlessthe number of arms gets bigger than . × T .
8. Conclusion
In this paper, we address stochastic bandit problems witha unimodal structure, and a finite set of arms. We provideasymptotic regret lower bounds for these problems and de-sign an algorithm that asymptotically achieves the lowestregret possible. Hence our algorithm optimally exploits theunimodal structure of the problem. Our preliminary anal-ysis of the continuous version of this bandit problem sug-gests that when the number of arms become very large andcomparable to the time horizon, it might be wiser to prunethe set of arms before actually running any algorithm. nimodal bandits
References
Agrawal, R. The continuum-armed bandit problem.
SIAMJ. Control and Optimization , 33(6):1926–1951, Novem-ber 1995.Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite time anal-ysis of the multiarmed bandit problem.
Machine Learn-ing , 47(2-3):235–256, 2002.B., Edelman. Strategic bidder behavior in sponsoredsearch auctions. In
Proc. of Workshop on SponsoredSearch Auctions, ACM Electronic Commerce , pp. 192–198, 2005.Bubeck, S. and Cesa-Bianchi, N. Regret analysis ofstochastic and nonstochastic multi-armed bandit prob-lems.
Foundations and Trends in Machine Learning , 5(1):1–122, 2012.Bubeck, S., Munos, R., Stoltz, G., and Szepesv´ari, C. On-line optimization in x-armed bandits. In
Advances inNeural Information Processing Systems 22 , 2008.Capp´e, O., Garivier, A., Maillard, O., Munos, R., andStoltz, G. Kullback-leibler upper confidence bounds foroptimal sequential allocation.
Annals of Statistics , 41(3):516–541, June 2013.Cesa-Bianchi, N. and Lugosi, G.
Prediction, Learning, andGames . Cambridge University Press, 2006.Cope, E. W. Regret and convergence bounds for a classof continuum-armed bandit problems.
IEEE Trans. Au-tomat. Contr. , 54(6):1243–1253, 2009.Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linearoptimization under bandit feedback. In
Proc. of Confer-ence On Learning Theory (COLT) , pp. 355–366, 2008.Flaxman, A., Kalai, A. T., and McMahan, H. B. Onlineconvex optimization in the bandit setting: gradient de-scent without a gradient. In
Proc. of ACM/SIAM sym-posium on Discrete Algorithms (SODA) , pp. 385–394,2005.Garivier, A. and Capp´e, O. The KL-UCB algorithm forbounded stochastic bandits and beyond. In
Proc. of Con-ference On Learning Theory (COLT) , 2011.Garivier, A. and Moulines, E. On upper-confidence boundpolicies for non-stationary bandit problems. In
Proc. ofAlgorithmic Learning Theory (ALT) , 2008.Gittins, J.C.
Bandit Processes and Dynamic Allocation In-dices . John Wiley, 1989. Graves, T. L. and Lai, T. L. Asymptotically efficientadaptive choice of control laws in controlled markovchains.
SIAM J. Control and Optimization , 35(3):715–743, 1997.Hartland, C., Baskiotis, N., Gelly, S., Teytaud, O., andSebag, M. Change point detection and meta-banditsfor online learning in dynamic environments. In
Proc.of conf´erence francophone sur l’apprentissage automa-tique (CAp07) , 2007.Hoeffding, W. Probability inequalities for sums of boundedrandom variables.
Journal of the American StatisticalAssociation , 58(301):pp. 13–30, 1963.Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed ban-dits in metric spaces. In
Proc. of the 40th annual ACMSymposium on Theory of Computing (STOC) , pp. 681–690, 2008.Kleinberg, R. D. Nearly tight bounds for the continuum-armed bandit problem. In
Proc. of the conference onNeural Information Processing Systems (NIPS) , 2004.Lai, T. L. Adaptive treatment allocation and the multi-armed bandit problem.
The Annals of Statistics , 15(3):1091–1114, 09 1987.Lai, T.L. and Robbins, H. Asymptotically efficient adaptiveallocation rules.
Advances in Applied Mathematics , 6(1):4–2, 1985.Robbins, H. Some aspects of the sequential design of ex-periments.
Bulletin of the American Mathematical Soci-ety , 58(5):527–535, 1952.Slivkins, A. and Upfal, E. Adapting to a changing environ-ment: the brownian restless bandits. In
Proc. of Confer-ence On Learning Theory (COLT) , pp. 343–354, 2008.Yu, J. and Mannor, S. Unimodal bandits. In
Proc. of Inter-national Conference on Machine Learning (ICML) , pp.41–48, 2011. nimodal bandits
Appendix
The appendix is organized as follows. In Section A, weprove Theorem 4.1. In Section B, we state and prove sev-eral concentration inequalities which are the cornerstoneof our regret analysis of the OSUB algorithm for both sta-tionary and non-stationary environments. In Section C weprove Theorem 4.2. In Section D, we prove Theorem 5.1.Finally, Section E is devoted to the proof of Proposition 1.
A. Proof of Theorem 4.1
We derive here a regret lower bound for the unimodal ban-dit problem. To this aim, we apply the techniques usedby Graves and Lai (Graves & Lai, 1997) to investigate effi-cient adaptive decision rules in controlled Markov chains.We recall here their general framework. Consider a con-trolled Markov chain ( X t ) t ≥ on a finite state space S witha control set U . The transition probabilities given control u ∈ U are parametrized by θ taking values in a com-pact metric space Θ : the probability to move from state x to state y given the control u and the parameter θ is p ( x, y ; u, θ ) . The parameter θ is not known. The deci-sion maker is provided with a finite set of stationary con-trol laws G = { g , . . . , g K } where each control law g j is a mapping from S to U : when control law g j is ap-plied in state x , the applied control is u = g j ( x ) . It isassumed that if the decision maker always selects the samecontrol law g the Markov chain is then irreducible withstationary distribution π gθ . Now the expected reward ob-tained when applying control u in state x is denoted by r ( x, u ) , so that the expected reward achieved under con-trol law g is: µ θ ( g ) = P x r ( x, g ( x )) π gθ ( x ) . There is anoptimal control law given θ whose expected reward is de-noted µ ⋆θ ∈ arg max g ∈ G µ θ ( g ) . Now the objective of thedecision maker is to sequentially select control laws so asto maximize the expected reward up to a given time hori-zon T . As for MAB problems, the performance of a deci-sion scheme can be quantified through the notion of regretwhich compares the expected reward to that obtained byalways applying the optimal control law.We now apply the above framework to our unimodal banditproblem, and we consider θ ∈ Θ G . The Markov chainhas values in S = R . The set of control laws is G = { , . . . , K } . These laws are constant, in the sense that thecontrol applied by control law k does not depend on thestate of the Markov chain, and corresponds to selecting arm k . The transition probabilities are: p ( x, y ; k, θ ) = p ( y, θ k ) .Finally, the reward r ( x, k ) does not depend on the stateand is equal to µ ( θ k ) , which is also the expected rewardobtained by always using control law k .We now fix θ ∈ Θ G . Define KL k ( θ, λ ) = KL ( θ k , λ k ) for any k . Further define the set B ( θ ) consisting of all bad pa-rameters λ ∈ Θ G such that k ⋆ is not optimal under param-eter λ , but which are statistically indistinguishable from θ : B ( θ ) = { λ ∈ Θ G : λ k ⋆ = θ k ⋆ and max k µ ( λ k ) > µ ( λ k ⋆ ) } ,B ( θ ) can be written as the union of sets B k ( θ ) , ( k, k ⋆ ) ∈ E defined as: B k ( θ ) = { λ ∈ B ( θ ) : µ ( λ k ) > µ ( λ k ⋆ ) } . We have that B ( θ ) = ∪ ( k,k ⋆ ) ∈ E B k ( θ ) , because if µ ( λ k ⋆ ) < max k µ ( λ k ) , then there must exist k such that ( k, k ∗ ) ∈ E , and µ ( λ k ) > µ ( λ k ⋆ ) . By applying Theorem1 in (Graves & Lai, 1997), we know that c ( θ ) is the mini-mal value of the following LP:min P k c k ( µ ( θ k ⋆ ) − µ ( θ k )) (3)s.t. inf λ ∈ B k ( θ ) P l = k ⋆ c l KL l ( θ, λ ) ≥ , ( k, k ⋆ ) ∈ E (4) c k ≥ , ∀ k. (5)Next we show that the constraints (4) on the c k ’s are equiv-alent to: min ( k,k ⋆ ) ∈ E c k I min ( θ k , µ ( θ k ⋆ )) ≥ . (6)Consider k fixed with ( k, k ⋆ ) ∈ E . We prove that: inf λ ∈ B k ( θ ) X l = k ⋆ c l KL l ( θ, λ ) = c k I min ( θ k , µ ( θ k ⋆ )) . (7)This is simply due to the following two observations: • Since µ ( λ k ) > µ ( θ k ⋆ ) and the KL divergence is posi-tive: X l = k ⋆ c l KL l ( θ, λ ) ≥ c k KL k ( θ, λ ) ≥ c k I min ( θ k , µ ( θ k ⋆ )) . • For ǫ > , define λ ǫ as follows: µ ( λ k ) > µ ( θ k ⋆ ) and KL ( θ k , λ k ) ≤ I min ( θ k , µ ( θ k ⋆ )) + ǫ and λ l = θ l for l = k . By construction, λ ǫ ∈ B k ( θ ) , and lim ǫ → X l = k ⋆ c l KL l ( θ, λ ǫ ) = c k I min ( θ k , µ ( θ k ⋆ )) . From (7), we deduce that constraints (4) are equivalentto (6) (indeed, for ( k, k ⋆ ) ∈ E , (4) is equivalent to c k I min ( θ k , µ ( θ k ⋆ )) ≥ ). With the constraints (6), the op-timization problem becomes straightforward to solve, andits solution yields: c ( θ ) = X ( k,k ⋆ ) ∈ E µ ( θ k ⋆ ) − µ ( θ k ) I min ( θ k , µ ( θ k ⋆ )) . (cid:3) nimodal bandits B. Concentration inequalities andPreliminaries
B.1. Proof of Lemma 4.3
We prove Lemma 4.3, a new concentration inequalitywhich extends Hoeffding’s inequality, and is used for theregret analysis in subsequent sections. We believe thatLemma 4.3 could be useful in a variety of bandit problems,where an upper bound on the deviation of the empiricalmean sampled at a stopping time is needed. An examplewould be the probability that the empirical reward of the k -th arm deviates from its expectation, when it is sampledfor the s -th time. Proof.
Let λ > , and define G n = exp( λ ( S n − δt n )) { n ≤ T } . We have that: P [ S φ ≥ t φ δ , φ ≤ T ]= P [exp( λ ( S φ − δt φ )) { φ ≤ T } ≥ P [ G φ ≥ ≤ E [ G φ ] . Next we provide an upper bound for E [ G φ ] . We define thefollowing quantities: Y t = B t [ λ ( Z t − E [ Z t ]) − λ B / G n = exp n X t = n Y t ! { n ≤ T } . So that G can be written: G n = ˜ G n exp( − t n ( λδ − λ B / . Setting λ = 4 δ/B : G n = ˜ G n exp( − t n δ /B ) . Using the fact that t φ ≥ s if φ ≤ T , we can upper bound G φ by: G φ = ˜ G φ exp( − t φ δ /B ) ≤ ˜ G φ exp( − sδ /B ) . It is noted that the above inequality holds even when φ = T + 1 , since G T +1 = ˜ G T +1 = 0 . Hence: E [ G φ ] ≤ E [ ˜ G φ ] exp( − sδ /B ) . We prove that ( ˜ G n ) n is a super-martingale. We have that E [ ˜ G T +1 |F T ] = 0 ≤ ˜ G T . For n ≤ T − , since B n +1 is F n measurable: E [ ˜ G n +1 |F n ] = ˜ G n ((1 − B n +1 ) + B n +1 E [exp( Y n +1 )]) . As proven by Hoeffding (Hoeffding, 1963)[eq. 4.16] since Z n +1 ∈ [0 , B ] : E [exp( λ ( Z n +1 − E [ Z n +1 ]))] ≤ exp( λ B / , so E [exp( Y n +1 )] ≤ and ( ˜ G n ) n is indeed a supermartin-gale: E [ ˜ G n +1 |F n ] ≤ ˜ G n . Since φ ≤ T + 1 almost surely,and ( ˜ G n ) n is a supermartingale, Doob’s optional stoppingtheorem yields: E [ ˜ G φ ] ≤ E [ ˜ G n − ] = 1 , and so P [ S φ ≥ t φ δ, φ ≤ T ] ≤ E [ G φ ] ≤ E [ ˜ G φ ] exp( − sδ /B ) ≤ exp( − sδ /B ) . which concludes the proof. The second inequality is ob-tained by symmetry. (cid:3) B.2. Preliminary results
Lemma B.1 states that if a set of instants Λ can be decom-posed into a family of subsets (Λ( s )) s ≥ of instants (eachsubset has at most one instant) where k is tried sufficientlymany times ( t k ( n ) ≥ ǫs , for n ∈ Λ( s ) ), then the expectednumber of instants in Λ at which the average reward of k isbadly estimated is finite. Lemma B.1
Let k ∈ { , . . . , K } , and ǫ > . Define F n the σ -algebra generated by ( X k ( t )) ≤ t ≤ n, ≤ k ≤ K . Let Λ ⊂ N be a (random) set of instants. Assume that there existsa sequence of (random) sets (Λ( s )) s ≥ such that (i) Λ ⊂∪ s ≥ Λ( s ) , (ii) for all s ≥ and all n ∈ Λ( s ) , t k ( n ) ≥ ǫs , (iii) | Λ( s ) | ≤ , and (iv) the event n ∈ Λ( s ) is F n -measurable. Then for all δ > : E [ X n ≥ { n ∈ Λ , | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ } ] ≤ ǫδ . (8) Proof.
Let T ≥ . For all s ≥ , since Λ( s ) has at mostone element, define φ s = T + 1 if Λ( s ) ∩ { , . . . , T } isempty and { φ s } = Λ( s ) otherwise. Since Λ ⊂ ∪ s ≥ Λ( s ) ,we have: T X n =1 { n ∈ Λ , | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ }≤ X s ≥ {| ˆ µ k ( φ s ) − E [ˆ µ k ( φ s )] | > δ, φ s ≤ T } . Taking expectations: E [ T X n =1 { n ∈ Λ , | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ } ] ≤ X s ≥ P [ | ˆ µ k ( φ s ) − E [ˆ µ k ( φ s )] | > δ, φ s ≤ T ] . nimodal bandits Since φ s is a stopping time upper bounded by T + 1 , andthat t k ( φ s ) ≥ ǫs we can apply Lemma 4.3 to obtain: E [ T X n =1 { n ∈ Λ , | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ } ] ≤ X s ≥ (cid:0) − sǫδ (cid:1) ≤ ǫδ . We have used the inequality: P s ≥ e − sw ≤ R + ∞ e − uw du = 1 /w . Since the above reasoning isvalid for all T , we obtain the claim (8). (cid:3) A useful corollary of Lemma B.1 is obtained by choosing δ = ∆ k,k ′ / , when arms k and k ′ are separated by at least ∆ k,k ′ . Lemma B.2
Let k, k ′ ∈ { , . . . , K } with k = k ′ and ǫ > . Define F n the σ -algebra generated by ( X k ( t )) ≤ t ≤ n, ≤ k ≤ K . Let Λ ⊂ N be a (random) set ofinstants. Assume that there exists a sequence of (random)sets (Λ( s )) s ≥ such that (i) Λ ⊂ ∪ s ≥ Λ( s ) , (ii) for all s ≥ and all n ∈ Λ( s ) , t k ( n ) ≥ ǫs and t k ′ ( n ) ≥ ǫs ,(iii) for all s we have | Λ( s ) | ≤ almost surely and (iv) forall n ∈ Λ , we have E [ˆ µ k ( n )] ≤ E [ˆ µ k ′ ( n )] − ∆ k,k ′ (v) theevent n ∈ Λ( s ) is F n -measurable. Then: E [ X n ≥ { n ∈ Λ , ˆ µ k ( n ) > ˆ µ k ′ ( n ) } ] ≤ ǫ ∆ k,k ′ . (9)Lemma B.3 is straightforward from (Garivier & Capp´e,2011)[Theorem 10]. It should be observed that this resultis not a direct application of Sanov’s theorem; Lemma B.3provides sharper bounds in certain cases, and it is also validfor non-Bernoulli distributed random variables. Lemma B.3
For ≤ t k ( n ) ≤ τ and δ > , if { X k ( i ) } ≤ i ≤ τ are independent random variables withmean µ k , we have that: P t k ( n ) I t k ( n ) t k ( n ) X i =1 X k ( i ) , µ k ≥ δ ≤ e ⌈ δ log( τ ) ⌉ exp( − δ ) . We present results related to the KL divergence thatwill be instrumental when manipulating indexes b k ( n ) .Lemma B.4 gives an upper and a lower bound for the KLdivergence. The lower bound is Pinsker’s inequality. Theupper bound is due to the fact that I ( p, q ) is convex in itssecond argument. Lemma B.4
For all p, q ∈ [0 , , p ≤ q : p − q ) ≤ I ( p, q ) ≤ ( p − q ) q (1 − q ) . (10) and I ( p, q ) ∼ ( p − q ) q (1 − q ) , q → p + (11) Proof.
The lower bound is Pinsker’s inequality. For theupper bound, we have: ∂I∂q ( p, q ) = q − pq (1 − q ) . Since q ∂I∂q ( p, q ) is increasing, the fundamental theoremof calculus gives the announced result: I ( p, q ) ≤ Z qp ∂I∂u ( p, u ) du ≤ ( p − q ) q (1 − q ) . The equivalence comes from a Taylor development of q → I ( p, q ) at p , since: ∂I∂q ( p, q ) | q = p = 0 ,∂ I∂q ( p, q ) | q = p = 1 q (1 − q ) . (cid:3) We prove a deviation bound similar to that of Lemma B.1for non-stationary environments.
Lemma B.5
Let k ∈ { , . . . , K } , n ∈ N and ǫ > .Let Λ ⊂ N be a (random) set of instants. Assume thatthere exists a sequence of (random) sets (Λ( s )) s ≥ suchthat (i) Λ ⊂ ∪ s ≥ Λ( s ) , (ii) for all s ≥ and all n ∈ Λ( s ) , t k ( n ) ≥ ǫs , and (iii) for all s ≥ | Λ( s ) ∩ [ n , n + τ ] | ≤ .Then for all δ > : E [ n + τ X n = n { n ∈ Λ , | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ } ] ≤ log( τ )2 ǫδ + 2 . Proof.
Fix s ≥ . We use the following decomposition,depending on the value of s with respect to s : { n ∈ Λ , | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ } ⊂ A ∪ B, where A = { n , . . . , n + τ } ∩ ( ∪ ≤ s ≤ s Λ( s )) ,B = { n , . . . , n + τ }∩ { n ∈ ∪ s ≥ s Λ( s ) : | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ } . Since for all s , | Λ( s ) ∩ { n , . . . , n + τ }| ≤ , we have | A | ≤ s . The expected size of B is upper bounded by: E [ | B | ] ≤ n + τ X n = n P [ n ∈ ∪ s ≥ s Λ( s ) , | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ ] ≤ n + τ X n = n P [ | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ, t k ( n ) ≥ ǫs ] . nimodal bandits For a given n , we apply Lemma 4.3 with n − τ in place of n , and φ = n if t k ( n ) ≥ ǫs and φ = T + 1 otherwise. Itis noted that φ is indeed a stopping time. We get: P [ | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ, t k ( n ) ≥ ǫs ] ≤ (cid:0) − s ǫδ (cid:1) . Therefore, setting s = log( τ ) / (2 ǫδ ) , E [ | B | ] ≤ τ exp (cid:0) − s ǫδ (cid:1) = 2 . Finally we obtain the announced result: E [ n + τ X n = n { n ∈ Λ , | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ } ] ≤ log( τ )2 ǫδ + 2 . (cid:3) Lemma B.6
Consider k, k ′ ∈ { , . . . , K } , n ∈ N and ǫ > . Let Λ ⊂ N be a (random) set of instants. Assumethat there exists a sequence of (random) sets (Λ( s )) s ≥ such that (i) Λ ⊂ ∪ s ≥ Λ( s ) , and (ii) for all s ≥ andall n ∈ Λ( s ) , t k ( n ) ≥ ǫs , t k ′ ( n ) ≥ ǫs and (iii) for all s ≥ | Λ( s ) ∩ [ n , n + τ ] | ≤ and (iv) for all n ∈ Λ , wehave E [ˆ µ k ( n )] ≤ E [ˆ µ k ′ ( n )] − ∆ k,k ′ .Then for all δ > : E [ n + τ X n = n { n ∈ Λ , ˆ µ k ( n ) > ˆ µ k ′ ( n ) } ] ≤ τ ) ǫ ∆ k,k ′ + 4 . C. Proofs for stationary environments
C.1. Proof of Theorem 4.2Notations.
Throughout the proof, by a slight abuse of no-tation, we omit the floor/ceiling functions when it does notcreate ambiguity. Consider a suboptimal arm k = k ⋆ . De-fine the difference between the average reward of k and k ′ : ∆ k,k ′ = | µ k ′ − µ k | > . We use the notation: t k,k ′ ( n ) = n X t =1 { L ( t ) = k, k ( t ) = k ′ } .t k,k ′ ( n ) is the number of times up time n that k ′ has beenselected given that k was the leader. Proof.
Let
T > . The regret R OSUB ( T ) of OSUB algo-rithm up to time T is: R OSUB ( T ) = X k = k ⋆ ( µ k ⋆ − µ k ) E [ T X n =1 { k ( n ) = k } ] . We use the following decomposition: { k ( n ) = k } = { L ( n ) = k ⋆ , k ( n ) = k } + { L ( n ) = k ⋆ , k ( n ) = k } . Now X k = k ⋆ ( µ k ⋆ − µ k ) E [ T X n =1 { L ( n ) = k ⋆ , k ( n ) = k } ] ≤ X k = k ⋆ E [ T X n =1 { L ( n ) = k ⋆ , k ( n ) = k } ] ≤ X k = k ⋆ E [ l k ( T )] . Observing that when L ( n ) = k ⋆ , the algorithm selects adecision ( k, k ⋆ ) ∈ E , we deduce that: R OSUB ( T ) ≤ X k = k ⋆ E [ l k ( T )]+ X ( k,k ⋆ ) ∈ E ( µ k ⋆ − µ k ) E [ T X n =1 { L ( n ) = k ⋆ , k ( n ) = k } ] Then we analyze the two terms in the r.h.s. in the aboveinequality. The first term corresponds to the average num-ber of times where k ⋆ is not the leader, while the secondterm represents the accumulated regret when the leaderis k ⋆ . The following result states that the first term is O (log(log( T ))) : Theorem C.1
For k = k ⋆ , E [ l k ( T )] = O (log(log( T ))) . From the above theorem, we conclude that the leader is k ⋆ except for a negligible number of instants (in expecta-tion). When k ⋆ is the leader, OSUB behaves as KL-UCBrestricted to the set N ( k ⋆ ) of possible decisions. Follow-ing the same analysis as in (Garivier & Capp´e, 2011) (theanalysis of KL-UCB), we can show that for all ǫ > thereare constants C ≤ , C ( ǫ ) and β ( ǫ ) > such that: E [ T X n =1 { L ( n ) = k ⋆ , k ( n ) = k } ] ≤ E [ T X n =1 { b k ( n ) ≥ b k ⋆ ( n ) } ] ≤ (1 + ǫ ) log( T ) I ( µ k , µ k ⋆ ) + C log(log( T )) + C ( ǫ ) T β ( ǫ ) . (12)Combining the above bound with Theorem C.1, we get: R OSUB ( T ) ≤ (1 + ǫ ) c ( θ ) log( T ) + O (log(log( T ))) , (13)which concludes the proof of Theorem 4.2. (cid:3) It remains to show that Theorem C.1 holds, which is donein the next section. The proof of Theorem C.1 is techni-cal, and requires the concentration inequalities presentedin section B. The theorem itself is proved in C.2. nimodal bandits
C.2. Proof of Theorem C.1
Let k be the index of a suboptimal arm. Let δ > , ǫ > small enough (we provide a more precise definition lateron). We define k = arg max k ′ :( k,k ′ ) ∈ E µ k ′ the best neigh-bor of k . To derive an upper bound of E [ l k ( T )] , we decom-pose the set of times where k is the leader into the followingsets: { n ≤ T : L ( n ) = k } ⊂ A ǫ ∪ B Tǫ , where A ǫ = { n : L ( n ) = k, t k ( n ) ≥ ǫl k ( n ) } B Tǫ = { n ≤ T : L ( n ) = k, t k ( n ) ≤ ǫl k ( n ) } . Hence we have: E [ l k ( T )] ≤ E (cid:2) | A ǫ | + | B Tǫ | (cid:3) , Next we provide upper bounds of E [ | A ǫ | ] and E [ | B Tǫ | ] .Bound on E | A ǫ | . Let n ∈ A ǫ and assume that l k ( n ) = s .By design of the algorithm, t k ( n ) ≥ s/ ( γ + 1) . Also t k ( n ) ≥ ǫl k ( n ) = ǫs . We apply Lemma B.2 with Λ( s ) = { n ∈ A ǫ , l k ( n ) = s } , Λ = ∪ s ≥ Λ( s ) . Of course, for any s , | Λ( s ) | ≤ . We have: A ǫ = { n ∈ Λ : ˆ µ k ( n ) ≥ ˆ µ k ( n ) } ,since when n ∈ A ǫ , k is the leader. Lemma B.2 can beapplied with k ′ = k . We get: E | A ǫ | < ∞ .Bound on E | B Tǫ | . We introduce the following sets: • C δ is the set of instants at which the average rewardof the leader k is badly estimated: C δ = { n : L ( n ) = k, | ˆ µ k ( n ) − µ k | > δ } . • D δ = ∪ k ′ ∈ N ( k ) \{ k } D δ,k ′ where D δ,k ′ = { n : L ( n ) = k, k ( n ) = k ′ , | ˆ µ k ′ ( n ) − µ k ′ | > δ } is theset of instants at which k is the leader, k ′ is selectedand the average reward of k ′ is badly estimated. • E T = { n ≤ T : L ( n ) = k, b k ( n ) ≤ µ k } , is theset of instants at which k is the leader, and the upperconfidence index b k ( n ) underestimates the averagereward µ k .We first prove that | B Tǫ | ≤ γ (1+ γ )( | C δ | + | D δ | + | E T | )+ O (1) as T grows large, and then provide upper bounds on E | C δ | , E | D δ | , and E | E T | . Let n ∈ B Tǫ . When k is theleader, the selected decision is in N ( k ) : l k ( n ) = t k,k ( n ) + X k ′ ∈ N ( k ) \{ k } t k,k ′ ( n ) . We recall that t k,k ′ ( n ) denotes the number of times up totime n when k is the leader and k ′ is selected. Since n ∈ B Tǫ , t k,k ( n ) ≤ ǫl k ( n ) , from which we deduce that: (1 − ǫ ) l k ( n ) ≤ X k ′ ∈ N ( k ) \{ k } t k,k ′ ( n ) . Choose ǫ < / (2( γ + 1)) . With this choice, from the pre-vious inequality, we must have that either (a) there exists k ∈ N ( k ) \ { k, k } , t k,k ( n ) ≥ l k ( n ) / ( γ + 1) or (b) t k,k ( n ) ≥ (3 / l k ( n ) / ( γ + 1) + 1 .(a) Assume that t k,k ( n ) ≥ l k ( n ) / ( γ + 1) . Since t k,k ( n ) is only incremented when k is selected and k is the leader,and since n l k ( n ) is increasing, there exists a unique φ ( n ) < n such that L ( φ ( n )) = k , k ( φ ( n )) = k , t k,k ( φ ( n )) = ⌊ l k ( n ) / (2( γ + 1)) ⌋ . φ ( n ) is indeed uniquebecause t k,k ( φ ( n )) is incremented at time φ ( n ) .Next we prove by contradiction that for l k ( n ) ≥ l largeenough and δ small enough, we must have φ ( n ) ∈ C δ ∪ D δ ∪ E T . Assume that φ ( n ) / ∈ C δ ∪ D δ ∪ E T . Then b k ( φ ( n )) ≥ µ k , ˆ µ k ( φ ( n )) ≤ µ k + δ . Using Pinsker’sinequality and the fact that t k ( φ ( n )) ≥ t k,k ( φ ( n )) : b k ( φ ( n )) ≤ ˆ µ k ( φ ( n ))+ s log( l k ( φ ( n ))) + c log(log( l k ( φ ( n ))))2 t k ( φ ( n )) ≤ µ k + δ + s log( l k ( n )) + c log(log( l k ( n )))2 ⌊ l k ( n ) / (2( γ + 1)) ⌋ . Now select δ < ( µ k − µ k ) / and l such that p (log( l ) + c log(log( l ))) / ⌊ l / (2( γ + 1)) ⌋ ≤ δ . If l k ( n ) ≥ l : b k ( φ ( n )) ≤ µ k + 2 δ < µ k ≤ b k ( φ ( n )) , which implies that k cannot be selected at time φ ( n ) (be-cause b k ( φ ( n )) < b k ( φ ( n )) ), a contradiction.(b) Assume that t k,k ( n ) ≥ (3 / l k ( n ) / ( γ + 1) + 1 = l k ( n ) / ( γ + 1) + l k ( n ) / (2( γ + 1)) + 1 . There are at least l k ( n ) / (2( γ + 1)) + 1 instants ˜ n such that l k (˜ n ) − is nota multiple of / ( γ + 1) , L (˜ n ) = k and k (˜ n ) = k . By thesame reasoning as in (a) there exists a unique φ ( n ) < n such that L ( φ ( n )) = k , k ( φ ( n )) = k , t k,k ( φ ( n )) = ⌊ l k ( n ) / (2( γ + 1)) ⌋ and ( l k ( φ ( n )) − is not a multipleof / ( γ + 1) . So b k ( φ ( n )) ≥ b k ( φ ( n )) . The same rea-soning as that applied in (a) (replacing k by k ) yields φ ( n ) ∈ C δ ∪ D δ ∪ E T .We define B Tǫ,l = { n : n ∈ B Tǫ , l k ( n ) ≥ l } , and we havethat | B Tǫ | ≤ l + | B Tǫ,l | . We have defined a mapping φ from B Tǫ,l to C δ ∪ D δ ∪ E T . To bound the size of B Tǫ,l ,we use the following decomposition: { n : n ∈ B Tǫ,l , l k ( n ) ≥ l }⊂ ∪ n ′ ∈ C δ ∪ D δ ∪ E T { n : n ∈ B Tǫ,l , φ ( n ) = n ′ } . Let us fix n ′ . If n ∈ B Tǫ,l and φ ( n ) = n ′ , then ⌊ l k ( n ) / (2( γ + 1)) ⌋ ∈ ∪ k ′ ∈ N ( k ) \{ k } { t k,k ′ ( n ′ ) } and l k ( n ) nimodal bandits is incremented at time n because L ( n ) = k . Therefore: |{ n : n ∈ B Tǫ,l , φ ( n ) = n ′ }| ≤ γ ( γ + 1) . Using union bound, we obtain the desired result: | B Tǫ | ≤ l + | B Tǫ,l | ≤ O (1)+2 γ ( γ +1)( | C δ | + | D δ | + | E T | ) . Bound on E | C δ | . We apply Lemma B.1 with Λ( s ) = { n : L ( n ) = k, l k ( n ) = s } , and Λ = ∪ s ≥ Λ( s ) . Thenof course, | Λ( s ) | ≤ for all s . Moreover by design, t k ( n ) ≥ s/ ( γ + 1) when n ∈ Λ( s ) , so we can chooseany ǫ < / ( γ + 1) in Lemma B.1. Now C δ = { n ∈ Λ : | ˆ µ k ( n ) − µ k | > δ } . From (8), we get E | C δ | < ∞ .Bound on E | D δ | . Let k ′ ∈ N ( k ) \ { k } . Define for any s , Λ( s ) = { n : L ( n ) = k, k ( n ) = k ′ , t k ′ ( n ) = s } , and Λ = ∪ s ≥ Λ( s ) . We have | Λ( s ) | ≤ , and for any n ∈ Λ( s ) , t k ′ ( n ) = s ≥ ǫs for any ǫ < . We can now applyLemma B.1 (where k is replaced by k ′ ). Note that D δ,k ′ = { n ∈ Λ : | ˆ µ k ′ ( n ) − µ k ′ | > δ } , and hence (8) leads to E | D δ,k ′ | < ∞ , and thus E | D δ | < ∞ .Bound on E | E T | . We can show as in (Garivier & Capp´e,2011) (the analysis of KL-UCB) that E | E T | = O (log(log( T ))) (more precisely, this result is a simple ap-plication of Theorem 10 in (Garivier & Capp´e, 2011)).We have shown that E | B Tǫ | = O (log(log( T ))) , and hence E [ l k ( T )] = O (log(log( T ))) , which concludes the proof ofTheorem C.1. (cid:3) D. Proofs for non-stationary environments
To simplify the notation, we remove the superscript τ throughout the proofs, e.g t τk ( n ) and l τk ( n ) are denoted by t k ( n ) and l k ( n ) . D.1. A lemma for sums over a sliding window
We will use Lemma D.1 repeatedly to bound the number oftimes some events occur over a sliding window of size τ . Lemma D.1
Let A ⊂ N , and τ ∈ N fixed. Define a ( n ) = P n − t = n − τ { t ∈ A } . Then for all T ∈ N and s ∈ N wehave the inequality: T X n =1 { n ∈ A, a ( n ) ≤ s } ≤ s ⌈ T /τ ⌉ . (14) As a consequence, for all k ∈ { , . . . , K } , we have: T X n =1 { k ( n ) = k, t k ( n ) ≤ s } ≤ s ⌈ T /τ ⌉ , (15) T X n =1 { L ( n ) = k, l k ( n ) ≤ s } ≤ s ⌈ T /τ ⌉ . These inequalities are obtained by choosing A = { n : k ( n ) = k } and A = { n : L ( n ) = k } in (14). Proof.
We decompose { , . . . , T } into intervals of size τ : { , . . . , τ } , { τ + 1 , . . . , τ } etc. We have: T X n =1 { n ∈ A, a ( n ) ≤ s }≤ ⌈ T/τ ⌉− X i =0 τ X n =1 { n + iτ ∈ A, a ( n + iτ ) ≤ s } . (16)Fix i and assume that P τn =1 { n + iτ ∈ A, a ( n + iτ ) ≤ s } > s . Then there must exist n ′ < τ such that n ′ ∈ A and P n ′ n =1 { n + iτ ∈ A, a ( n + iτ ) ≤ s } = s . Since a ( n ′ + iτ ) ≥ P n ′ n =1 { n + iτ ∈ A, a ( n + iτ ) ≤ s } , we have a ( n ′ + iτ ) ≥ s . As n ′ ∈ A , we must have a ( n ′′ + iτ ) ≥ ( s + 1) for all n ′′ > n ′ such that n ′′ ∈ A . So τ X n =1 { n + iτ ∈ A, a ( n + iτ ) ≤ s } = n ′ X n =1 { n + iτ ∈ A, a ( n + iτ ) ≤ s } = s, which is a contradiction. Hence, for all i : τ X n =1 { n + iτ ∈ A, a ( n + iτ ) ≤ s } ≤ s, and substituting in (16) gives the desired result: T X n =1 { n ∈ A, a ( n ) ≤ s } ≤ ⌈ T/τ ⌉− X i =0 s = s ⌈ T /τ ⌉ . (cid:3) D.2. Regret of SW-KL-UCB
In order to analyze the regret of SW-OSUB , we first haveto analyze the regret SW-KL-UCB on which SW-OSUB isbased.
Theorem D.2
Let ∆ : τ σ < ∆ < ∆ . Assume that forany n ≥ , µ ⋆ ( n ) ∈ [ a, − a ] for some a > . Furthersuppose that µ k ( · ) is σ -Lipschitz for any k . The regret per nimodal bandits unit time under π = SW-KL-UCB with a sliding window ofsize τ satisfies: if a > στ , then for any T ≥ , R π ( T ) T ≤ H (∆ , T ) T ∆+ K (cid:16) g − / (cid:17) log( τ ) + c log(log( τ )) + C τ (∆ − τ σ ) , where C is a positive constant and g = ( a − στ )(1 − a + στ ) / . Recall that due to the changing environment and the use ofa sliding window, the empirical reward is a biased estimatorof the average reward, and that its bias is upper bounded by στ .To ease the regret analysis, we first provide bounds on theempirical reward. Unlike in the stationary case, the empir-ical reward ˆ µ k ( n ) is not a sum of t k ( n ) i.i.d. variables. Wedefine X k ( n ′ , n ) = X k ( n ′ )+( µ k ( n )+ σ | n ′ − n |− µ k ( n ′ )) , X k ( n ′ , n ) = X k ( n ′ )+( µ k ( n ) − σ | n ′ − n |− µ k ( n ′ )) and: ˆ µ k ( n ) = 1 t k ( n ) n X n ′ = n − τ X k ( n ′ , n ) { k ( n ′ ) = k } , ˆ µ k ( n ) = 1 t k ( n ) n X n ′ = n − τ X k ( n ′ , n ) { k ( n ′ ) = k } . Then of course, ˆ µ k ( n ) ≤ ˆ µ k ( n ) ≤ ˆ µ k ( n ) .Now the regret under π =SW-OSUB is given by: R π ( T ) = T X n =1 K X k =1 ( µ k ⋆ ( n ) − µ k ( n )) P [ k ( n ) = k ] . We define I min = 2(∆ − τ σ ) . Let ǫ > and K τ =(1 + ǫ ) log( τ )+ c log(log( τ )) I min . We introduce the following setsof events:(i) A = ∪ Kk =1 A k , where A k = { ≤ n ≤ T : k ( n ) = k, | µ k ( n ) − µ k ⋆ ( n ) | < ∆ } ,A k is the set of times at which k is chosen, and k is ”close”to the optimal decision. Note that, by definition, | A | ≤ H (∆ , T ) .(ii) B = { ≤ n ≤ T : b k ⋆ ( n ) ≤ µ k ⋆ ( n ) − τ σ } . B isthe set of times at which the index b k ⋆ ( n ) underestimatesthe average reward of the optimal decision (with an errorgreater than the bias τ σ ).(iii) C = ∪ Kk =1 C k , C k = { ≤ n ≤ T : k ( n ) = k, t k ( n ) ≤ K τ } . C k is the set of times at which k is se-lected and it has been tried less than K τ times. (iv) D = ∪ Kk =1 D k , D k = { ≤ n ≤ T : k ( n ) = k, n / ∈ ( A ∪ B ∪ C ) } . D k is the set of times where (a) k is chosen,(b) k has been tried more than K τ times, (c) k is not closeto the optimal decision, and (d) the average reward of theoptimal decision is not underestimated.We will show that: X n ∈ A ( µ ∗ ( n ) − µ k ( n ) ( n )) ≤ ∆ H (∆ , T ) . (17)and the following inequalities E [ | B | ] ≤ O ( T /τ ) , E [ | C k | ] ≤ K τ ⌈ T /τ ⌉ , E [ | D k ]] ≤ T ( τ log( τ ) c ) g ǫ . We deduce that: R π ( T ) ≤ ∆ H (∆ , T ) + O ( T /τ )+ K K τ ⌊ T /τ ⌋ + KT ( τ log( τ ) c ) g ǫ , which proves Theorem D.2.Proof of (17). Let n ∈ A k . If n ∈ A k , by definition wehave | µ k ⋆ ( n ) − µ k ( n ) | < ∆ . Then if k ( n ) = k , we havethat µ ∗ ( n ) − µ k ( n ) ( n ) ≤ ∆ so that: X n ∈ A ( µ ∗ ( n ) − µ k ( n ) ( n )) ≤ ∆ | A | ≤ ∆ H (∆ , T ) , which completes the proof of (17).Bound on E [ | B | ] . Let n ∈ B . Note that ˆ µ k ⋆ ( n ) ≤ ˆ µ k ⋆ ( n ) ≤ b k ⋆ ( n ) . Since b k ⋆ ( n ) ≤ µ k ⋆ ( n ) − στ , we deducethat: ˆ µ k ⋆ ( n ) ≤ µ k ⋆ ( n ) − στ . Now we have: P [ n ∈ B ] = P [ b k ⋆ ( n ) ≤ µ k ⋆ ( n ) − στ ]= P [ t k ⋆ ( n ) I (ˆ µ k ⋆ ( n ) , µ k ⋆ ( n ) − στ ) ≥ log( τ ) + c log(log( τ ))] ( a ) ≤ P [ t k ⋆ ( n ) I (cid:16) ˆ µ k ⋆ ( n ) , µ k ⋆ ( n ) − στ (cid:17) ≥ log( τ ) + c log(log( τ ))] ( b ) ≤ eτ (log( τ )) c − , where (a) is due to the fact that ˆ µ k ⋆ ( n ) ≤ ˆ µ k ⋆ ( n ) , and (b) isobtained applying Lemma B.3. Hence: E [ | B | ] ≤ O ( T /τ ) .Bound on E [ | C k | ] . Using Lemma D.1, we get | C k | ≤K τ ⌈ T /τ ⌉ , and hence | C | ≤ K K τ ⌊ T /τ ⌋ .Bound on E [ | D k | ] . We will prove that n ∈ D k im-plies that ˆ µ k ( n ) deviates from its expectation by at least f ( ǫ, I min ) > so that: P [ n ∈ D k ] ≤ P (cid:2) ˆ µ k ( n ) − E [ˆ µ k ( n )] > f ( ǫ, I min ) (cid:3) . nimodal bandits Let n ∈ D k . Since k ( n ) = k and b k ⋆ ( n ) ≥ µ k ⋆ ( n ) − στ ,we have b k ( n ) ≥ µ k ⋆ ( n ) − στ . We decompose D k asfollows: D k = D k, ∪ D k, D k, = { n ∈ D k : ˆ µ k ( n ) ≥ µ k ⋆ ( n ) − στ } D k, = { n ∈ D k : ˆ µ k ( n ) ≤ µ k ⋆ ( n ) − στ } If n ∈ D k, , ˆ µ k ( n ) − E [ˆ µ k ( n )] ≥ µ k ⋆ ( n ) − µ k ( n ) − στ > so that ˆ µ k ( n ) indeed deviates from its expectation. Nowlet n ∈ D k, . We have: P [ n ∈ D k, ] ≤ P [ b k ( n ) ≥ µ k ⋆ ( n ) − στ, n ∈ D k, ]= P [ t k ( n ) I (ˆ µ k ( n ) , µ k ⋆ ( n ) − στ ) ≤ log( τ ) + c log(log( τ )) , n ∈ D k, ] ( a ) ≤ P [ K τ I (cid:0) ˆ µ k ( n ) , µ k ⋆ ( n ) − στ (cid:1) ≤ log( τ ) + c log(log( τ )) , t k ( n ) ≥ K τ ]= P (cid:20) I (cid:0) ˆ µ k ( n ) , µ k ⋆ ( n ) − στ (cid:1) ≤ I min ǫ , t k ( n ) ≥ K τ (cid:21) , where in (a), we used the facts that: ˆ µ k ( n ) ≤ µ k ⋆ ( n ) − στ , ˆ µ k ( n ) ≥ ˆ µ k ( n ) , and t k ( n ) ≥ K τ ( n / ∈ C ). It is notedthat since n / ∈ A k , by Pinkser’s inequality we have that: I ( µ k ( n ) + τ σ, µ k ⋆ ( n ) − τ σ ) ≥ µ k ⋆ ( n ) − µ k ( n ) − τ σ ) ≥ − τ σ ) = I min . By continuity and mono-tonicity of the KL divergence, there exists a unique positivefunction f such that: I ( µ k ( n ) + στ + f ( ǫ, I min ) , µ k ⋆ ( n ) − στ ) = I min ǫ ,µ k ( n ) + στ + f ( ǫ, I min ) ≤ µ k ⋆ ( n ) − στ. We are interested in the asymptotic behavior of f when ǫ , I min both tend to . Define µ ′ , µ ′′ and µ such that µ k ( n ) + στ ≤ µ ′ ≤ µ ′′ ≤ µ = µ k ⋆ ( n ) − στ. and I ( µ ′ , µ ) = I min , I ( µ ′′ , µ ) = I min ǫ . Using the equivalent (11) given in Lemma B.4, there existsa function a such that: ( µ − µ ′ ) µ (1 − µ ) (1 + a ( µ − µ ′ )) = I min , ( µ − µ ′′ ) µ (1 − µ ) (1 + a ( µ − µ ′′ )) = I min ǫ . with a ( δ ) → when δ → + . It is noted that ≤ µ − µ ′′ ≤ µ − µ ′ = o (1) when I min → + by continuity ofthe KL divergence. Hence: µ ′′ − µ ′ = (cid:16) ǫ o (1) (cid:17) p µ (1 − µ ) I min . Using the inequality f ( ǫ, I min ) = µ ′′ − ( µ k ( n ) + στ ) ≥ µ ′′ − µ ′ = ǫ p µ (1 − µ ) I min , we have proved that: f ( ǫ, I min ) ≥ ǫ g I min + o ( ǫ ) with g = ( a − στ )(1 − a + στ ) / . Therefore, since E [ˆ µ k ( n )] ≤ µ k ( n ) + στ , as claimed, wehave P [ n ∈ D k ] ≤ P (cid:2) ˆ µ k ( n ) − E [ˆ µ k ( n )] ≥ f ( ǫ, I min ) , t k ( n ) ≥ K τ (cid:3) . We now apply Lemma 4.3 with n − τ in place of n , K τ in place of s and φ = n if t k ( n ) ≥ K τ and φ = T + 1 otherwise. We obtain, for all n : P [ n ∈ D k ] ≤ P (cid:2) ˆ µ k ( n ) − E [ˆ µ k ( n )] ≥ f ( ǫ, I min ) , t k ( n ) ≥ K τ (cid:3) ≤ exp (cid:0) − K τ f ( ǫ, I min ) (cid:1) ≤ τ log( τ ) c ) g ǫ , and we get the desired bound by summing over n : E [ | D k | ] = T X n =1 P [ n ∈ D k ] ≤ T ( τ log( τ ) c ) g ǫ . D.3. Proof of Theorem 5.1
We first introduce some notations. For any set A of instants,we use the notation: A [ n , n ] = A ∩ { n , . . . , n + τ } . Let n ≤ n . We define t k ( n , n ) the number of times k hasbeen chosen during interval { n , . . . , n + τ } , l k ( n , n ) the number of times k has been the leader, and t k,k ′ ( n , n ) the number of times k ′ has been chosen while k was theleader: t k ( n , n ) = n X n ′ = n { k ( n ′ ) = k } ,l k ( n , n ) = n X n ′ = n { L ( n ′ ) = k } ,t k,k ′ ( n , n ) = n X n ′ = n { L ( n ′ ) = k, k ( n ′ ) = k ′ } . Note that l k ( n − τ, n ) = l k ( n ) , t k ( n − τ, n ) = t k ( n ) and t k,k ′ ( n − τ, n ) = t k,k ′ ( n ) . Given ∆ > , we define the nimodal bandits set of instants at which the average reward of k is separatedfrom the average reward of its neighbours by at least ∆ : N k (∆) = ∩ ( k ′ ,k ) ∈ E { n : | µ k ( n ) − µ k ′ ( n ) | > ∆ } . We further define the amount of time that k is suboptimal, k is the leader, and it is well separated from its neighbors: L k (∆) = { n : L ( n ) = k = k ⋆ ( n ) , n ∈ N k (∆) } . By definition of the regret under π = SW-OSUB : R π ( T ) = T X n =1 X k = k ⋆ ( n ) ( µ k ⋆ ( n ) − µ k ( n )) P [ k ( n ) = k ] . To bound the regret, as in the stationary case, we split theregret into two components: the regret accumulated whenthe leader is the optimal arm, and the regret generated whenthe leader is not the optimal arm. The regret when theleader is suboptimal satisfies: T X n =1 X k = k ⋆ ( n ) ( µ k ⋆ ( n ) − µ k ) { k ( n ) = k, L ( n ) = k ⋆ ( n ) }≤ T X n =1 { L ( n ) = k ⋆ ( n ) }≤ T X n =1 X k = k ⋆ ( n ) { L ( n ) = k = k ⋆ ( n ) }≤ T X n =1 X k = k ⋆ ( n ) { n ∈ L k (∆) } + {∃ k ′ : ( k, k ⋆ ) ∈ E : | µ k ( n ) − µ k ′ ( n ) | ≤ ∆ }≤ K X k =1 |L k (∆)[0 , T ] | + H (∆ , T ) ! . Therefore the regret satisfies: R π ( T ) ≤ H (∆ , T ) + K X k =1 E [ |L k (∆)[0 , T ] | ] ! + T X n =1 X ( k,k ⋆ ( n )) ∈ E ( µ k ⋆ ( n ) − µ k ( n )) P [ k ( n ) = k ] . (18)The second term of the r.h.s in (18) is the regret of SW-OSUB when k ⋆ ( n ) is the leader. This term can be analyzedusing the same techniques as those used for the analysis ofSW-KL-UCB and is upper bounded by the regret of SW-KL-UCB. It remains to bound the first term of the r.h.s in(18). Theorem D.3
Consider ∆ > τ σ . Then for all k : E [ |L k (∆)[0 , T ] | ] ≤ C × T log( τ ) τ (∆ − τ σ ) , (19) where C > does not depend on T , τ , σ and ∆ . Substituting (19) in (18), we obtain the announced result. (cid:3)
D.4. Proof of Theorem D.3
It remains to prove Theorem D.3. Define δ = (∆ − τ σ ) / .We can decompose { , . . . , T } into at most ⌈ T /τ ⌉ intervalsof size τ . Therefore, to prove the theorem, it is sufficient toprove that for all n ∈ L k (∆) we have: E [ |L k (∆)[ n , n + τ ] | ] ≤ O (cid:18) log( τ ) δ (cid:19) . In the remaining of the proof, we consider an interval { n , . . . , n + τ } , with n ∈ L k (∆) fixed. It is notedthat the best neighbour of k changes with time. We define k ( n ) the best neighbor of k at time n . From the Lipschitzassumption and the fact that ∆ > τ σ , we have that forall n ∈ { n , . . . , n + τ } , k ( n ) = k ( n ) . Indeed for all n ∈ { n , . . . , n + τ } : µ k ( n ) ( n ) − µ k ( n ) ≥ µ k ( n ) ( n ) − µ k ( n ) − n − n ) σ ≥ ∆ − τ σ ≥ τ σ > . We write k = k ( n ) = k ( n ) when this does not cre-ate ambiguity. We will use the fact that, for all n ∈{ n , . . . , n + τ } : E [ˆ µ k ( n )] − E [ˆ µ k ( n )] ≥ µ k ( n ) − µ k ( n ) − τ σ, ≥ µ k ( n ) − µ k ( n ) − τ σ, ≥ ∆ − τ σ = 2 δ > . We decompose L k (∆)[ n , n + τ ] = A n ǫ ∪ B n ǫ , with: A n ǫ = { n ∈ L k (∆)[ n , n + τ ] , t k ( n ) ≥ ǫl k ( n , n ) } the set of times where k is the leader, k is not the optimal arm, and its best neighbor k has been tried sufficiently many times during interval { n , . . . , n + τ } , B n ǫ = { n ∈ L k (∆)[ n , n + τ ] , t k ( n ) ≤ ǫl k ( n , n ) } the set of times where k is the leader, k is not the optimal arm, and its best neighbor k hasbeen little tried during interval { n , . . . , n + τ } . nimodal bandits Bound on E [ A n ǫ ] . Let n ∈ A n ǫ . We recall that E [ˆ µ k ( n )] − E [ˆ µ k ( n )] ≥ δ , so that the reward of k or k must be badly estimated at time n : P [ n ∈ A n ǫ ] ≤ P [ | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ ]+ P [ | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ ] . We apply Lemma B.6, with k ′ = k , ∆ k,k ′ = 2 δ , Λ( s ) = { n ∈ A n ǫ , l k ( n , n ) = s } , t k ( n ) ≥ ǫl k ( n , n ) = ǫs .By design of SW-OSUB : t k ( n ) ≥ l k ( n , n ) / ( γ + 1) = s/ ( γ + 1) . Using the fact that | Λ( s ) | ≤ for all s , we havethat: E [ A n ǫ ] ≤ O (cid:18) log( τ ) ǫδ (cid:19) . Bound on E [ B n ǫ ] . Define l such that s log( l ) + c log(log( l ))2 ⌊ l / (2( γ + 1)) ⌋ ≤ δ. In particular we can choose l = 2( γ + 1)(log(1 /δ ) /δ ) .Indeed, with such a choice we have that s log( l ) + c log(log( l ))2 ⌊ l / (2( γ + 1)) ⌋ ∼ δ/ , δ → + . Let ǫ < / (2( γ + 1)) , and define the following sets: C n δ is the set of instants at which the average rewardof the leader k is badly estimated: C n δ = { n ∈ { n , . . . , n + τ } : L ( n ) = k = k ⋆ ( n ) , | ˆ µ k ( n ) − E [ˆ µ k ( n )] | > δ } ; D n δ = ∪ k ′ ∈ N ( k ) \{ k } D n δ,k ′ where D n δ,k ′ = { n : L ( n ) = k = k ⋆ ( n ) , k ( n ) = k ′ , | ˆ µ k ′ ( n ) − E [ˆ µ k ′ ( n )] | > δ } . D n δ is the set of instants at which k is the leader, k ′ is selected and the average reward of k ′ is badly estimated. E n = { n ≤ T : L ( n ) = k = k ⋆ ( n ) , b k ( n ) ≤ E [ˆ µ k ( n )] } is the set of instants at which k is theleader, and the upper confidence index b k ( n ) under-estimates the average reward E [ˆ µ k ( n )] .Let n ∈ B n ǫ . Write s = l k ( n , n ) , and we assume that s ≥ l . Since t k ( n , n ) ≤ ǫl k ( n , n ) and the fact that l k ( n , n ) = t k ( n , n ) + P k ′ ∈ N ( k ) \{ k } t k ′ ( n , n ) , wemust have (a) there exists k ∈ N ( k ) \ { k, k } such that t k ( n , n ) ≥ s/ ( γ + 1) or (b) t k ( n , n ) ≥ (3 / s/ ( γ +1) + 1 . Since t k,k ( n ) and t k,k ( n ) are incremented onlyat times when k ( n ) = k and k ( n ) = k respectively, there must exist a unique index φ ( n ) ∈ { n , . . . , n + τ } such that either: (a) t k,k ( φ ( n )) = ⌊ s/ (2( γ + 1)) ⌋ and k ( φ ( n )) = k ; or (b) t k,k ( φ ( n )) = ⌊ (3 / s/ ( γ + 1) ⌋ and k ( n ) = k and l k ( φ ( n )) is not a multiple of . In bothcases, as in the proof of theorem C.1, we must have that φ ( n ) ∈ C n δ ∪ D n δ ∪ E n .We now upper bound the number of instants n whichare associated to the same φ ( n ) . Let n, n ′ ∈ B n ǫ and s = l k ( n , n ) . We see that φ ( n ′ ) = φ ( n ) implies ei-ther ⌊ l k ( n , n ′ ) / (2( γ + 1)) ⌋ = ⌊ l k ( n , n ) / (2( γ + 1)) ⌋ or ⌊ (3 / l k ( n , n ′ ) / ( γ + 1) ⌋ = ⌊ (3 / l k ( n , n ) / ( γ + 1) ⌋ .Furthermore, n ′ l k ( n , n ′ ) is incremented at time n ′ .Hence for all n ∈ B n ǫ : | n ′ ∈ B n ǫ , φ ( n ′ ) = φ ( n ) | ≤ γ ( γ + 1) . We have established that: | B n ǫ | ≤ l + 2 γ ( γ + 1)( | C n δ | + | D n δ | + | E n | )= 2( γ + 1) log(1 /δ ) /δ + 2 γ ( γ + 1)( | C n δ | + | D n δ | + | E n | ) . We complete the proof by providing bounds of the expectedsizes of sets C n δ , D n δ and E n .Bound of E [ C n δ ] : Using Lemma B.5 with Λ( s ) = { n ∈ C n δ , l k ( n , n ) = s } , and by design of SW-OSUB : t k ( n ) ≥ l k ( n , n ) / ( γ + 1) = s/ ( γ + 1) . Since | Λ( s ) | ≤ for all s , we have that: E [ | C n δ | ] ≤ O (cid:18) log( τ ) δ (cid:19) . Bound of E [ D n δ ] : Using Lemma B.5 with Λ( s ) = { n ∈ D n δ , t k,k ′ ( n , n ) = s } , and | Λ( s ) | ≤ for all s , we havethat: E [ | D n δ,k ′ | ] ≤ O (cid:18) log( τ ) δ (cid:19) . Bound of E [ E n ] : By Lemma B.3 since l k ( n ) ≤ τ : P [ n ∈ E n ] ≤ e ⌈ log( τ )(log( τ ) + c log(log( τ ))) ⌉ exp( − log( τ ) + c log(log( τ ))) ≤ eτ log( τ ) c − . Thus E [ | E n | ] ≤ e (log τ ) c − . Putting the various bounds all together, we have: E [ |L k (∆)[ n , n + τ ] | ] ≤ O (cid:18) log( τ ) δ (cid:19) , for all n ∈ L k (∆) , uniformly in δ , which concludes theproof. (cid:3) nimodal bandits E. Proof of Proposition 1
The regret of UCB( δ ) is defined as: R π ( T ) ≤ ⌈ /δ ⌉ X k =1 E [ t k ( T )]( µ ∗ − µ k ) . We separate the arms into three different sets. { , . . . , ⌈ /δ ⌉} = A ∪ B ∪ C , with: A = { k ∗ − , k ∗ , k ∗ +1 } the optimal arm and its neighbors, B = { k : k / ∈ A, ( k − δ ∈ [ x ∗ − δ , x ∗ + δ ] } the arms which are notneighbors of the optimal arm, but are in [ x ∗ − δ , x ∗ + δ ] ,and C = { k : ( k − δ / ∈ [ x ∗ − δ , x ∗ + δ ] } the rest ofthe arms.We consider δ < δ / , so that A ⊂ [ x ∗ − δ , x ∗ + δ ] .By our assumption on the reward function, if k ∈ A , | x ∗ − δ ( k − | ≤ δ then | µ ∗ − µ k | ≤ C (2 δ ) α . The regret isupper bounded by: R π ( T ) ≤ T C (2 δ ) α + X k ∈ B ∪ C E [ t k ( T )]( µ ∗ − µ k ) . Using the fact that µ ∗ − µ k ∗ ≤ C δ α and P ⌈ /δ ⌉ k =1 E [ t k ( T )] ≤ T , the bound becomes: R π ( T ) ≤ T C (3 δ ) α + X k ∈ B ∪ C E [ t k ( T )]( µ k ∗ − µ k ) . By (Auer et al., 2002) (the analysis of UCB), for all k , E [ t k ( T )] ≤ T ) / ( µ k ∗ − µ k ) . Replacing in the re-gret upper bound: R π ( T ) ≤ T C (3 δ ) α + X k ∈ B ∪ C T ) / ( µ k ∗ − µ k ) . If k ∈ B , | δ ( k ∗ − − δ ( k − | ≥ δ ( | k ∗ − k | − ,so µ k ∗ − µ k ≥ C δ α ( | k ∗ − k | − α . If k ∈ C , then | δ ( k ∗ − − δ ( k − | ≥ δ / , so µ k ∗ − µ k ≥ C ( δ / α .So the regret for arms in B ∪ C reduces to: R π ( T ) ≤ T C (3 δ ) α + 8 log( T ) ⌈ /δ ⌉ C ( δ / α + 2 ⌈ /δ ⌉ X k =1 T ) C ( δk ) α . Using a sum-integral comparison: P ⌈ /δ ⌉ k =1 k − α ≤ P ⌈ /δ ⌉ k =1 k − ≤ ⌈ /δ ⌉ ) , so that: R π ( T ) ≤ T C (3 δ ) α + 8 log( T ) (cid:18) ⌈ /δ ⌉ C ( δ / α + 2(1 + log( ⌈ /δ ⌉ )) C δ α (cid:19) . Setting δ = (log( T ) / √ T ) /α , the regret becomes: R π ( T ) ≤ T C (3 α )(log( T ) / √ T )+8 log( T ) ⌈ ( √ T / log( T )) /α ⌉ C ( δ / α + 2(1 + log( T )) C log( T ) / √ T ! . we have used the fact that ⌈ /δ ⌉ ≤ T . R π ( T ) ≤ C (3 α ) log( T ) √ T +8 √ T + 1 C ( δ / α + 2 √ T (1 + log( T )) C ! Letting T → ∞ gives the result: lim sup T R π ( T ) / ( √ T log( T )) ≤ C α + 16 /C1