Nonstochastic Bandits with Infinitely Many Experts
NNonstochastic Bandits with Infinitely Many Experts
X. Flora Meng Tuhin Sarkar Munther A. Dahleh Abstract
We study the problem of nonstochastic banditswith infinitely many experts: A learner aims tomaximize the total reward by taking actions se-quentially based on bandit feedback while bench-marking against a countably infinite set of experts.We propose a variant of Exp4.P that, for finitelymany experts, enables inference of correct expertrankings while preserving the order of the regretupper bound. We then incorporate the variant intoa meta-algorithm that works on infinitely manyexperts. We prove a high-probability upper boundof ˜ O (cid:0) i ∗ K + √ KT (cid:1) on the regret, up to polylogfactors, where i ∗ is the unknown position of thebest expert, K is the number of actions, and T is the time horizon. We also provide an exampleof structured experts and discuss how to expeditelearning in such case. Our meta-learning algo-rithm achieves the tightest regret upper bound forthe setting considered when i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) .If a prior distribution is assumed to exist for i ∗ ,the probability of satisfying a tight regret boundincreases with T , the rate of which can be fast.
1. Introduction
Early work on the multi-armed bandit problem commonlystudied settings where the rewards of each arm are stochas-tically generated from some unknown distribution (Robbins,1952; Lai & Robbins, 1985; Auer et al., 2002a). In general,such statistical assumptions are difficult to validate or inap-proriate for some applications such as packet transmissionin communication networks (Auer et al., 1995; 2002b). Theproblem of nonstochastic bandits, first investigated in (Aueret al., 1995; 2002b), makes no statistical assumptions abouthow the rewards are generated.A setting of the nonstochastic bandit problem allows forincorporating expert advice. The learner interacts with an adversary over time horizon T as follows. At each time, Department of Electrical Engineering and Computer Science,Massachusetts Institute of Technology, Cambridge, MA, USA.Correspondence to: X. Flora Meng < [email protected] > .Copyright 2021 by the authors. the adversary sets the rewards for the K actions and keepsthem secret. The learner then gets every expert’s adviceon the probability of choosing each action. The learnersubsequently combines the experts’ advice and samples anaction. Finally, the learner observes only the reward of theaction chosen, and the game repeats. The learner’s goal is tominimize regret , which is the gap between the total rewardgained and the expected total reward of the best expert i ∗ who is unknown a priori.The framework described is a general one. First, there isno assumption about the generation of rewards except thatthe adversary is oblivious . In other words, the adversary’schoices are independent of the learner’s strategy. Equiva-lently, all rewards can be assigned before the game starts,and the learner only observes the rewards of chosen actionssequentially. Second, we do not restrict or assume knowl-edge of how the experts come up with their advice. Third,experts can give deterministic advice.The problem of bandits with expert advice is not only anatural model for numerous real-world applications, suchas selecting and pricing online advertisements (McMahan& Streeter, 2009), but also important from a theoretical per-spective. Contextual bandits can be framed as a bandits withexpert advice problem by introducing policies that map acontext to a probability distribution over actions (McMahan& Streeter, 2009; Agarwal et al., 2017). Bandits with expertadvice are also closely related to online model selectionwhere experts correspond to model classes (Cesa-Bianchi &Lugosi, 2006; Foster et al., 2017; 2019).Prior work on nonstochastic bandits with expert advice typ-ically assumes the number of experts to be finite (Aueret al., 1995; 2002b; McMahan & Streeter, 2009; Beygelz-imer et al., 2011; Neu, 2015). The exp onential-weight algo-rithm for exp loration and exp loitation using exp ert advice(Exp4), introduced by (Auer et al., 1995; 2002b), has a regretupper bound of O (cid:0) √ KT ln N (cid:1) in expectation , where N isthe number of experts. This upper bound almost matches thelower bound Ω( (cid:112) ( KT ln N ) / ln K ) derived by (Agarwalet al., 2012) for the expected regret when ln N ≤ T ln K .However, Exp4 does not satisfy a similar regret guarantee with high probability due to the large variance of its es-timates. Algorithms with high-probability guarantees arepreferred for domains that need reliable methods, but such a r X i v : . [ c s . L G ] F e b onstochastic Bandits with Infinitely Many Experts algorithms require delicate analysis (Beygelzimer et al.,2011; Neu, 2015). The Exp4.P algorithm, a variant of Exp4proposed by (Beygelzimer et al., 2011), has a regret boundedfrom above by O (cid:0)(cid:112) KT ln ( N/δ ) (cid:1) with probability at least − δ . This bound can be improved by a constant factor withthe key idea of avoiding explicit exploration (Neu, 2015).We study the problem of nonstochastic bandits with in-finitely many experts. Our main question is: Can the learnerperform almost as well as the globally best expert i ∗ of acountably infinite set while only querying a finite number ofexperts? This question is motivated by challenges encoun-tered in practical situations where it is unfeasible to seekadvice from all experts all the time (Seldin et al., 2013). Forsearch engine advertising, a company may need to chooseamong a multitude of schemes some of which also involvehyperparameter tuning (McMahan & Streeter, 2009). Asanother example, there are often a myriad of features thatcan be used for online recommendation systems. Somefeatures tend to be more informative than others, but theirrelevance is normally unknown a priori. We can transformthis problem into bandits with expert advice where each ex-pert corresponds to a model class in a certain feature space.The number of experts can be extremely large due to thecombinatorial nature. In contrast to the large number ofexperts available, it is desirable to query only some of themeach time in consideration of computational constraints.
Our Contributions
For the general case without any as-sumption about the experts, we propose an algorithm called Be st E xpert S earch (BEES) and provide theoretical guar-antees on its performance. BEES runs a subroutine calledExp4.R, an algorithm that we obtain by modifying Exp4.P.The “R” denotes a feature of Exp4.R: it enables inferenceof correct expert rankings with high probability in addi-tion to satisfying a regret upper bound of the same order asthat proved for Exp4.P. Our main result establishes a high-probability upper bound of ˜ O (cid:0) ( i ∗ ) /α K + √ αKT (cid:1) on theregret of BEES, hiding only polylog factors, which dependson the position of the unknown best expert i ∗ and a positiveinteger-valued parameter α . This upper bound illustrates thetrade-off, controlled by α , between exploration and exploita-tion for the problem of nonstochastic bandits with infinitelymany experts. On the one hand, it is desirable to include alarge number of experts per epoch so as to approach i ∗ ata fast rate. On the other hand, consulting too many expertssimultaneously necessitates long epochs, which reduces therate at which more experts are included. Although tuning α needs the unknown index i ∗ , we can simply set α = 1 .To the best of our knowledge, our high-probability regretupper bound is the tightest for the setting considered when i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) . This regime is less restricted than itseems at first sight. If we assume a prior distribution on i ∗ , then i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) holds with a probability that in- creases with T , the rate of which can be fast. Inspired bythe problem of finite-time model selection for reinforcementlearning (RL), we also present an example of structured ex-perts, which simulates the trade-off between approximationand estimation. We discuss how the expert ranking propertyof Exp4.R can be used to expedite learning in such case. Related Work
A natural approach is to consider expertsas arms and use methods for infinitely many-armed ban-dits such as (Berry et al., 1997; Kleinberg et al., 2008;Rusmevichientong & Tsitsiklis, 2010; Carpentier & Valko,2015). However, such work relies on statistical assumptions,whereas our setting is a nonstochastic bandit problem. Ourquestion is also related to bandits with limited advice, firstposed by (Seldin et al., 2013) and subsequently solved by(Kale, 2014), but their setting considers a finite number ofexperts of whom only a subset can be queried at each time.To the best of our knowledge, no work achieves high-probability regret bound of ˜ O (cid:0) √ KT (cid:1) for the setting con-sidered. When configured correctly, Exp4 has a regret up-per bound of O (cid:0) √ KT ln i ∗ (cid:1) in expectation (Foster et al.,2019). However, the algorithm is computationally unfea-sible as it needs to handle infinitely many experts at everytime step. One method of making Exp4 computationallytractable is to truncate the sequence of experts to a subsetof size O (cid:0) e √ KT (cid:1) as any larger set would make the ex-pected regret superlinear in K or T . Running Exp4 withcorrect configurations on this subset of experts has a re-gret upper bound of O (cid:0) ( KT ) / + T ∆ (cid:1) in expectation where ∆ is the infimum upper bound on the suboptimalitygaps of the experts considered. For stochastic contextualbandits, Exp4.P can be used as a subroutine to achieve ahigh-probability regret bound of ˜ O (cid:0) √ dT ln T (cid:1) with an in-finite set of experts that has a finite Vapnik–Chervonenkisdimension d (Beygelzimer et al., 2011). Since the regretanalysis of Exp4.P relies on the union bound, the algorithmdoes not apply to infinitely many experts in the nonstochas-tic setting. If we run Exp4.P on a finite subset of expertsof size Θ (cid:0) δ exp (cid:0)(cid:112) T / (16 K ) (cid:1)(cid:1) , the regret is then boundedfrom above by O (cid:0) K / T / + T ∆ (cid:1) with probability atleast − δ . Outline
Section 2 formally defines the problem of non-stochastic bandits with infinitely many experts. In Section 3,we introduce Exp4.R for the setting of finitely many expertsand prove that it enables inference of correct expert rank-ings with high probability. Section 4 investigates the case ofinfinitely many experts and presents a meta-algorithm thatruns Exp4.R as a subroutine. We prove a high-probabilityregret upper bound and give an example to illustrate howto expedite learning when working with structured experts.Finally, we conclude in Section 5. onstochastic Bandits with Infinitely Many Experts
2. Problem Formulation
Let Z + be the set of strictly positive integers. For N ∈ Z + ,we define [ N ] (cid:44) { , , . . . , N } . Let T ∈ Z + be the timehorizon . Let A be a set of actions where |A| = K < ∞ .At each time t ∈ [ T ] , the adversary first sets a reward vector r ( t ) ∈ [0 , K where r a ( t ) is the reward of action a . Eachexpert i ∈ Z + then gives their advice ξ i ( t ) , which is a prob-ability vector over A . After querying a finite subset of theexperts’ advice but not the rewards, the learner then sam-ples an action a ( t ) . Finally, the learner receives the reward r a ( t ) ( t ) and no other information. The game proceeds totime t + 1 and finishes after T time steps. The learner’s goalis to combine the experts’ advice such that the total rewardis close to a benchmark, which we will define shortly.Let y i ( t ) (cid:44) (cid:80) a ∈A ξ ia ( t ) r a ( t ) be the expected reward ofexpert i at time t . For any time interval T ⊂ Z + such that |T | < ∞ , we denote the expected total reward of expert i during T as R i ( T ) (cid:44) (cid:80) t ∈T y i ( t ) . We define the bestexpert i ∗ ( I ; T ) of a subset I ⊆ Z + during T as the onewith the lowest index that has the highest total reward in ex-pectation, namely, i ∗ ( I ; T ) (cid:44) min { argmax i ∈I R i ( T ) } .The learner’s regret with respect to i ∗ ( I ; T ) isRegret ( T ; I ) (cid:44) R i ∗ ( I ; T ) ( T ) − (cid:88) t ∈T r a ( t ) ( t ) . For simplicity of notation, let Regret ( T ) (cid:44) Regret ([ T ]; Z + ) and i ∗ (cid:44) i ∗ ( Z + ; [ T ]) . The learner’s goal is to minimizeRegret ( T ) , the regret with respect to the globally best expert i ∗ for the time horizon considered.
3. Nonstochastic Bandits with a FiniteNumber of Experts
We start with a simplified problem where the number ofexperts is finite. Section 3.1 presents Exp4.R (Algorithm 1)and provides some intuition for its design. In Section 3.2,we show that Exp4.R not only preserves the regret upperbound of Exp4.P in terms of order but also enables inferenceof correct expert rankings with high probability.
Exp4.R (Algorithm 1) is a slight variant of Exp4.P proposedby (Beygelzimer et al., 2011). The major distinction isthat Exp4.R calculates a threshold vector (cid:15) which enablesinference of correct expert rankings with high probability.Exp4.R takes four inputs, namely, an error rate δ ∈ (0 , ,a time horizon T ∈ Z + , the minimum probability ρ ∈ (0 , /K ] of exploration, and a finite set of experts I ⊂ Z + .Without loss of generality, we suppose that |I| = N . If max i ∈I R i ( T ) does not exist, we define i ∗ ( I ; T ) = ∞ and R i ∗ ( I ; T ) ( T ) = sup i ∈I R i ( T ) . Algorithm 1
Exp4.R
Input: δ ∈ (0 , , T ∈ Z + , ρ ∈ (0 , /K ] , I ⊂ Z + Output: w ( T + 1) , (cid:15)β ← (cid:112) ln(2 N/δ ) / ( KT ) .w i (1) ← for i ∈ I . for t = 1 , . . . , T do Get ξ i ( t ) for i ∈ I . q i ( t ) ← w i ( t ) / (cid:80) i (cid:48) ∈I w i (cid:48) ( t ) for i ∈ I . p a ( t ) ← (1 − Kρ ) (cid:80) i ∈I q i ( t ) ξ ia ( t ) + ρ for a ∈ A .Sample action a ( t ) from p ( t ) .Take action a ( t ) and receive reward r a ( t ) ( t ) . for i ∈ I do ˆ y i ( t ) ← ξ ia ( t ) ( t ) r a ( t ) ( t ) p a ( t ) ( t ) , ˆ v i ( t ) ← (cid:88) a ∈A ξ ia ( t ) p a ( t ) ,w i ( t + 1) ← w i ( t ) exp (cid:16) ρ y i ( t ) + β ˆ v i ( t )] (cid:17) . end forend forfor i ∈ I do (cid:15) i ← (cid:34) KT T (cid:88) t =1 ˆ v i ( t ) (cid:35) ln (cid:18) Nδ (cid:19) . end for Exp4.R first initializes a weight w i (1) = 1 for each expert i ∈ I . At time t ∈ [ T ] , normalizing w ( t ) gives a probabilitydistribution q ( t ) over I . After getting advice ξ i ( t ) fromeach expert i , Exp4.R constructs a probability distribution p ( t ) over A by weighting all advice according to q ( t ) andmixing in uniform exploration so that p a ( t ) ≥ ρ for all a ∈ A . Specifically, for all a , let p a ( t ) = (1 − Kρ ) (cid:88) i ∈I q i ( t ) ξ ia ( t ) + ρ. (1)Exp4.R subsequently takes action a ( t ) sampled accordingto p ( t ) and receives the reward r a ( t ) ( t ) . Time t concludeswith weight updates as specified below. For i ∈ I , Exp4.Restimates y i ( t ) by ˆ y i ( t ) and calculates an upper bound onthe variance of ˆ y i ( t ) conditional on history until time t − as given by ˆ y i ( t ) = ξ ia ( t ) ( t ) r a ( t ) ( t ) p a ( t ) ( t ) , ˆ v i ( t ) = (cid:88) a ∈A ξ ia ( t ) p a ( t ) . (2)Exp4.R updates each expert’s weight w i ( t ) using w i ( t + 1) = w i ( t ) exp (cid:16) ρ y i ( t ) + β ˆ v i ( t )] (cid:17) , (3)where β = (cid:112) ln(2 N/δ ) / ( KT ) . The game ends in T timesteps and gives two outputs, namely, the final weight vector onstochastic Bandits with Infinitely Many Experts w ( T + 1) and a threshold vector (cid:15) , the i th entry of which is (cid:15) i = (cid:34) KT T (cid:88) t =1 ˆ v i ( t ) (cid:35) ln (cid:18) Nδ (cid:19) . We establish in Proposition 1 that, with high probability,Exp4.R not only satisfies a regret upper bound of the sameorder as that proved for Exp4.P but also reveals correctpairwise expert rankings if the corresponding weights aresufficiently separated.For simplicity of notation, we denote R i ([ T ]) (cid:44) (cid:80) Tt =1 y i ( t ) as R i ( T ) . Updating weights using (3) allows us to constructa confidence bound for each R i ( T ) . For i ∈ I , let ˆ R i ( T ) (cid:44) (cid:80) Tt =1 ˆ y i ( t ) and ˆ V i ( T ) (cid:44) (cid:80) Tt =1 ˆ v i ( t ) . For any δ ∈ (0 , ,let E ( δ ) be an event defined by ∀ i ∈ I , − ln (cid:18) Nδ (cid:19) (cid:114) KT ln N − (cid:114) ln NKT ˆ V i ( T ) ≤ R i ( T ) − ˆ R i ( T ) ≤ (cid:115) ln (cid:18) Nδ (cid:19) (cid:32) ˆ V i ( T ) √ KT + √ KT (cid:33) . Lemma 1 shows that the estimates ˆ R i ( T ) are concen-trated around the true values R i ( T ) . The proof relies on aFreedman-style inequality for martingales from (Beygelz-imer et al., 2011), which we defer to the appendix.Lemma 2 establishes an upper bound on the regret ofExp4.R. Since Lemma 2 is a slight variant of Theorem 2in (Beygelzimer et al., 2011), the proof is very similarto the original one and hence omitted here. We notethat Theorem 2 in (Beygelzimer et al., 2011) holds for asmaller regime than stated in the original paper. To bespecific, the condition T = Ω( K ln N ) is essential for ρ = (cid:112) ln N/ ( KT ) ≤ /K to be true. We make the correc-tion in Lemma 2.Lemma 3 validates the correctness of the inferred expertrankings when the concentration event E ( δ ) holds. Corol-lary 1 shows that the uncertainty gap for ranking any pair ofexperts is the sum of their thresholds given by Exp4.R. Wecan prove Corollary 1 by first taking the contrapositive ofthe statement in Lemma 3 and then switching i and i (cid:48) .Finally, we combine the lemmas to prove Proposition 1.Same as Exp4.P, the computational complexity of Exp4.Ris O ( KN ) for space and O ( KN T ) for runtime. Assumption 1.
The following conditions hold: (i) max { K ln N, ln(2 N/δ ) / [( e − K ] } ≤ T , (ii) and thereexists a uniform expert i ∈ I such that ξ ia ( t ) = 1 /K for all a ∈ A and t ∈ Z + . Lemma 1.
Under Assumption 1, if we run Exp4.R with ρ = (cid:112) ln N/ ( KT ) , then P ( E ( δ )) ≥ − δ for all δ ∈ (0 , . Lemma 2.
Under Assumption 1, for any δ ∈ (0 , , if E ( δ ) holds, then Exp4.R with ρ = (cid:112) ln N/ ( KT ) satisfies thatRegret ( T ; I ) ≤ (cid:112) KT ln (2 N/δ ) . Lemma 3.
Under Assumption 1, for any δ ∈ (0 , , if E ( δ ) holds, then Exp4.R with ρ = (cid:112) ln N/ ( KT ) satisfiesthat, for all i, i (cid:48) ∈ I , we have R i ( T ) > R i (cid:48) ( T ) whenever ln w i ( T + 1) − ln w i (cid:48) ( T + 1) > (cid:15) i .Proof. We fix an arbitrary δ ∈ (0 , and suppose that event E ( δ ) holds. We recall that (cid:15) i (cid:44) (cid:34) V i ( T ) KT (cid:35) ln (cid:18) Nδ (cid:19) . We assume that ln w i ( T + 1) − ln w i (cid:48) ( T + 1) > (cid:15) i for some i, i (cid:48) ∈ I . By (3) and the initialization condition w i (1) = 1 ,we have ln w i ( T + 1) = T (cid:88) t =1 ln (cid:18) w i ( t + 1) w i ( t ) (cid:19) = ρ (cid:32) ˆ R i ( T ) + (cid:114) ln(2 N/δ ) KT ˆ V i ( T ) (cid:33) . Thus, ˆ R i ( T ) = 2 ρ ln w i ( T + 1) − (cid:114) ln(2 N/δ ) KT ˆ V i ( T ) . (4)Equation (4) also holds for i (cid:48) . Thus, ˆ R i ( T ) − ˆ R i (cid:48) ( T )= 2 ρ ln (cid:18) w i ( T + 1) w i (cid:48) ( T + 1) (cid:19) − (cid:114) ln(2 N/δ ) KT (cid:16) ˆ V i ( T ) − ˆ V i (cid:48) ( T ) (cid:17) > (cid:15) i ρ − (cid:114) ln(2 N/δ ) KT (cid:16) ˆ V i ( T ) − ˆ V i (cid:48) ( T ) (cid:17) = 2 ln (cid:18) Nδ (cid:19) (cid:114) KT ln N + (cid:114) ln(2 N/δ ) KT ˆ V i (cid:48) ( T )+ ˆ V i ( T ) (cid:114) ln(2 N/δ ) KT (cid:34) (cid:114) ln(2 N/δ )ln N − (cid:35) > (cid:18) Nδ (cid:19) (cid:114) KT ln N + (cid:114) ln(2 N/δ ) KT (cid:16) ˆ V i ( T ) + ˆ V i (cid:48) ( T ) (cid:17) . (5) onstochastic Bandits with Infinitely Many Experts Event E ( δ ) implies that R i ( T ) − ˆ R i ( T ) + ˆ R i (cid:48) ( T ) − R i (cid:48) ( T ) ≥ − ln (cid:18) Nδ (cid:19) (cid:114) KT ln N − (cid:114) ln NKT ˆ V i ( T ) − (cid:115) ln (cid:18) Nδ (cid:19) (cid:32) ˆ V i (cid:48) ( T ) √ KT + √ KT (cid:33) . (6)Adding (5) and (6) and then simplifying the algebra give R i ( T ) − R i (cid:48) ( T ) > . Corollary 1.
Under the conditions of Lemma 3, for all i, i (cid:48) ∈ I , it holds that(i) if ln w i ( T + 1) − ln w i (cid:48) ( T + 1) > (cid:15) i , then R i ( T ) > R i (cid:48) ( T ); (ii) if R i ( T ) ≥ R i (cid:48) ( T ) , then ln w i ( T + 1) − ln w i (cid:48) ( T + 1) ≥ − (cid:15) i (cid:48) . Proposition 1.
Under Assumption 1, for any δ ∈ (0 , ,with probability at least − δ , Exp4.R configured with ρ = (cid:112) ln N/ ( KT ) satisfies that(i) Regret ( T ; I ) ≤ (cid:112) KT ln (2 N/δ ) ;(ii) for all i, i (cid:48) ∈ I , if ln w i ( T + 1) − ln w i (cid:48) ( T + 1) > (cid:15) i ,then R i ( T ) > R i (cid:48) ( T ) .Proof. Proposition 1 follows directly from Lemmas 1–3.
4. Selection Among Infinitely Many Experts
In this section, we study the problem of nonstochastic ban-dits with a countably infinite set of experts. We make noassumptions about the experts or how they are indexed. Forthis general case, we propose a meta-algorithm called Be st E xpert S earch (BEES, Algorithm 2) that runs Exp4.R asa subroutine and provide a high-probability upper boundon regret. Section 4.1 provides an example of structuredexperts and discusses how the expert ranking property ofExp4.R can be used to expedite learning in such case.BEES takes five inputs including an error rate δ ∈ (0 , , thenumber of epochs L ∈ Z + , and three constants α, c, C ∈ Z + that control the exponential growth of the epoch lengthand the number of experts consulted in each epoch. At a highlevel, BEES supplies Exp4.R with an increasing (but still finite) number of experts over epochs, prioritizing those withlower indices. This scheme can be considered as puttinga prior on the experts implicitly where the experts that arebelieved to perform well are given low indices. Since wemake no assumptions about the experts, they can be orderedusing domain knowledge before being given to BEES asinputs. Growing the epoch length and the number of expertsat exponential rates allows us to derive a regret upper boundof the same order as that of Exp4.R when the best expert i ∗ has a relatively low index. This idea is similar to, thoughnot the same as, the doubling trick (Besson & Kaufmann,2018) as the latter only deals with the epoch length. Forour setting, we need to increase the number of experts at anappropriate rate relative to the epoch length. Algorithm 2 Be st E xpert S earch (BEES) Input: δ ∈ (0 , , α ∈ Z + , L ∈ Z + , c ∈ Z + , C ∈ Z + for epoch l = 1 , . . . , L do N l ← c αl , T l ← C l . ρ l ← (cid:112) ln N l / ( KT l ) . I l ← [ N l ] . Exp4.R ( δ/L, T l , ρ l , I l ) . end for Theorem 1 provides sample complexity guarantees forBEES by establishing a high-probability regret upper bound.Corollary 2 simplifies this bound for specific parametervalues. Corollary 2 shows that BEES, when tuned right,satisfies Regret ( T ) = ˜ O (cid:16) ( i ∗ ) /α K + √ αKT (cid:17) with highprobability, where ˜ O ( · ) omits only polylog factors. This up-per bound illustrates the trade-off between exploration andexploitation for the problem of bandits with infinitely manyexperts. On the one hand, we want to include a large num-ber of experts in each epoch so as to approach i ∗ at a fastrate. On the other hand, querying too many experts simul-taneously necessitates long epochs, which reduces the rateat which more experts are included. This trade-off is con-trolled by the parameter α ∈ Z + . The term ˜ O (cid:0) ( i ∗ ) /α K (cid:1) in the upper bound corresponds to regret from not con-sidering i ∗ sooner. The other term ˜ O (cid:16) √ αKT (cid:17) is theregret that benchmarks against the best expert in eachepoch. Another consideration for not using an arbitrar-ily large value of α is that the minimum time horizon re-quired by BEES which is T = Ω( C ( α, c, K, δ )) increaseswith α . Although tuning α needs the unknown index ofthe best expert i ∗ , we can simply set α = 1 . BEES hasspace complexity O ( K (1 + T /K ) α ) and time complexity ˜ O (cid:0) K (1 + T /K ) α +1 (cid:1) .To the best of our knowledge, Theorem 1 establishes thetightest high-probability regret bound for the setting consid-ered when i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) . This regime is less restrictedthan it seems at first sight. Assuming a prior distribution onstochastic Bandits with Infinitely Many Experts on i ∗ shows that the condition on i ∗ is satisfied with aprobability that increases with T , the rate of which canbe fast. For simplicity, let α = 1 and c = 1 . In orderfor Regret ( T ) = ˜ O (cid:16) √ KT (cid:17) to hold with high probability,we need i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) . We denote the complement ofthis event as B . If we suppose that F ( i ) = P ( i ∗ > i ) for i ∈ Z + and some function F : Z + → [0 , , then P ( B ) decreases with T . For example, if F ( i ) ∝ i − s for some s > , then P ( B ) is roughly proportional to K s/ T − s/ .If F ( i ) ∝ e − si for some s > , then P ( B ) is roughlyproportional to e − s √ T/K .Before stating Theorem 1, we provide some intuition for theproof. Lemma 1 implies that (cid:80) t ∈T l ˆ y i ( t ) ≈ R i ( T l ) for eachexpert i and every epoch l with high probability. For this rea-son, we can prove an upper bound on the regret with respectto the best expert in each epoch, namely, (cid:80) Ll =1 R i ∗ l ( T l ) − (cid:80) Tt =1 r a ( t ) ( t ) = ˜ O (cid:16) √ αKT (cid:17) . We then derive an upperbound on the gap between the globally best expert and thebest expert in each epoch, which is given by R i ∗ ([ T ]) − (cid:80) Ll =1 R i ∗ l ( T l ) = ˜ O (cid:0) ( i ∗ ) /α K (cid:1) . Adding the upper bounds,we get Regret ( T ) = ˜ O (cid:16) ( i ∗ ) /α K + √ αKT (cid:17) .For simplicity of notation, we suppose that the total numberof epochs L = log [1 + T / (2 C )] so that T = (cid:80) Ll =1 T l where T l = C l for l ∈ [ L ] . We use (cid:98)·(cid:99) and (cid:100)·(cid:101) to denotethe floor and ceiling functions, respectively. For the generalcase of T ≥ C , let L = (cid:98) log [1 + T / (2 C )] (cid:99) , T l = C l for l ∈ [ L − , and T L = T − (cid:80) L − l =1 T l . Theorem 1.
If a uniform expert is available in each epoch,then there exist absolute constants α ∈ Z + and c ∈ Z + such that, for some C ( α, c, K, δ ) ∈ Z + , BEES satisfies that,for any δ ∈ (0 , , with probability at least − δ , we haveRegret ( T ) < (cid:115) αK ( T + 2 C ) ln (cid:18) cL (2 + T /C ) δ (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . Corollary 2.
Under the conditions of Theorem 1, runningBEES with α ∈ Z + , c ∈ Z + , and C = (cid:100) αK ln(16 c /δ ) (cid:101) satisfies that, for any δ ∈ (0 , , with probability at least − δ , Regret ( T ) = ˜ O (cid:16) ( i ∗ ) /α K + √ αKT (cid:17) .Proof of Theorem 1. We can show that, for all δ ∈ (0 , , α ∈ Z + , and c ∈ Z + , there exists C ( α, c, K, δ ) ∈ Z + such that K ln (cid:0) c αl (cid:1) ≤ C l and ln (cid:0) c αl +1 /δ (cid:1) ≤ ( e − CK l for all l ∈ Z + . For example, we can set C = (cid:100) αK ln(16 c /δ ) (cid:101) . Together with the definitions of N l and T l in Algorithm 2, we have that, for all α ∈ Z + and c ∈ Z + , there exists C ∈ Z + such that K ln N l ≤ T l and ln(2 N l /δ ) ≤ ( e − KT l for all l ∈ Z + . We fix suchintegers α, c, C ∈ Z + for the rest of the proof.For simplicity of notation, we first consider runningExp4.R ( δ, T l , ρ l , I l ) in each epoch l for any δ ∈ (0 , /L ] and then apply a change of variables at the end of the proof.We suppose that a uniform expert is available in each epoch.Assumption 1 is then satisfied for all epochs. For now, weassume that event E ( δ ) holds for all epochs, the probabil-ity of which will be discussed at the end of the proof. Forsimplicity of notation, let i ∗ l (cid:44) i ∗ ( I l ; T l ) for l ∈ [ L ] .Let U l (cid:44) αl + log (2 c/δ ) for l ∈ [ L ] . Recall that T l is thetime interval of epoch l where |T l | = T l . By Lemma 2, L (cid:88) l =1 R i ∗ l ( T l ) − T (cid:88) t =1 r a ( t ) ( t ) ≤ L (cid:88) l =1 (cid:115) KT l ln (cid:18) N l δ (cid:19) = 7 √ KC ln 2 L (cid:88) l =1 (cid:112) l U l ≤ (cid:112) KCU L ln 2 L (cid:88) l =1 l/ < (cid:112) KCU L (cid:16) L/ − (cid:17) . Since L = log [1 + T / (2 C )] , we have L (cid:88) l =1 R i ∗ l ( T l ) − T (cid:88) t =1 r a ( t ) ( t ) < (cid:112) KCU L (cid:32)(cid:114) T C − (cid:33) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cδ (cid:19)(cid:21) (cid:18) C + T (cid:19) . (7)We first discuss the case where i ∗ / ∈ I . Let L (cid:48) be the lastepoch such that i ∗ is not considered in Algorithm 2. Since |I l | = N l , we have L (cid:48) = min (cid:0) L, (cid:100) α − log ( i ∗ /c ) (cid:101) − (cid:1) .Since i ∗ / ∈ I , we get L (cid:48) ≥ . By the definition of i ∗ l , wehave R i ∗ l ( T l ) ≥ R i ∗ ( T l ) for all l > L (cid:48) . Thus, R i ∗ ([ T ]) − L (cid:88) l =1 R i ∗ l ( T l ) ≤ L (cid:48) (cid:88) l =1 (cid:0) R i ∗ ( T l ) − R i ∗ l ( T l ) (cid:1) ≤ L (cid:48) (cid:88) l =1 T l < C L (cid:48) +1 < C (cid:18) i ∗ c (cid:19) α . (8)We now consider the case where i ∗ ∈ I . It follows fromAlgorithm 2 that i ∗ ∈ I l for all l . Thus, the definition onstochastic Bandits with Infinitely Many Experts of i ∗ l implies that R i ∗ l ( T l ) ≥ R i ∗ ( T l ) for all l . We define D (cid:44) R i ∗ ([ T ]) − (cid:80) Ll =1 R i ∗ l ( T l ) . We then have D ≤ .However, the definition of i ∗ implies that D ≥ . Therefore, D = 0 and (8) is satisfied.Adding (7) and (8) givesRegret ( T ) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cδ (cid:19)(cid:21) (cid:18) C + T (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . (9)Using Lemma 1 and the union bound over all L epochs,we conclude that (9) holds with probability at least − Lδ .A change of variables gives that, for any δ ∈ (0 , , withprobability at least − δ , we haveRegret ( T ) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cLδ (cid:19)(cid:21) (cid:18) C + T (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α < (cid:115) αK ( T + 2 C ) ln (cid:18) cL (2 + T /C ) δ (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . In this section, we present an example of structured expertsthat is inspired by the problem of finite-time model selec-tion for RL and discuss how the expert ranking property ofExp4.R can be used to expedite learning in such case.As RL becomes increasingly integrated into autonomoussystems such as agile robots (Hwangbo et al., 2019), self-driving vehicles (Kuderer et al., 2015), customized fertilizerformulation (Binas et al., 2019), and personalized medi-cation dosing (Nemati et al., 2016), it is crucial that thetechniques are robust (Matni et al., 2019). An aspect ofrobustness is the capability to detect and adjust for model er-rors. For RL, this entails both model selection and parameterestimation. How to achieve both objectives simultaneouslywhile maintaining provably good performance is an activearea of research (Ni & Wang, 2019; Abbasi-Yadkori et al.,2020). The crux of the problem of online model selectionfor RL is to balance approximation and estimation errorsin a time-dependent manner. As an example, we supposethat there is an infinite sequence of nested model classes.This structure arises naturally when an RL algorithm incor-porates increasingly many features over time. Some new features may also just become obtainable while an RL al-gorithm is running. In fact, it is unknown a priori for manyapplications what is a minimal feature space that contains anoptimal policy. Given an infinite sequence of model classes,the best class to use depends on the horizon or, equivalently,the amount of trajectory data that will become available.Although a larger model class has a smaller approximationerror, it tends to have a higher estimation error for a fixedfinite horizon. Moreover, if several classes have the sameapproximation power, the simplest one is typically preferredin consideration of time and space complexity.Inspired by the problem of finite-time model selection forRL, we propose to consider experts structured in a waythat simulates the trade-off between approximation and es-timation. In particular, we suppose that the experts areranked in ascending order of complexity. Assumption 2stipulates that the total reward is weakly unimodal in expec-tation with respect to the expert index during any period oftime. In addition, the index of the globally best expert is anondecreasing function of the time horizon. See Figure 1for an illustration. The proposed time-dependent unimodalstructure is fundamentally related to oracle inequalities inempirical risk minimization (Wainwright, 2019). Althoughthe experts’ performance may fluctuate around the proposedstructure in practice, solutions to the stylized setting are oftheoretical interest.
Assumption 2.
For all
T ⊂ [ T ] , if i ≤ i ∗ ( Z + ; T ) , then R i − ( T ) ≤ R i ( T ) . Otherwise, R i ( T ) ≥ R i +1 ( T ) . More-over, i ∗ ( Z + ; T ) ≤ i ∗ ( Z + ; T (cid:48) ) if T (cid:48) ⊂ [ T ] and |T (cid:48) | > |T | . Figure 1.
An illustration of Assumption 2
Under Assumption 2, the outputs of Exp4.R give a thresholdrule that allows us to find a lower bound for i ∗ , which canaccelerate the rate of approaching i ∗ . We modify BEESto incorporate lower bound estimation (BEES.LB, Algo-rithm 3). BEES.LB runs Exp4.R and Probabilistic Thresh-olding Search (PTS, Algorithm 4) as subroutines. In eachepoch, BEES.LB eliminates experts identified as subopti-mal. Lemma 4 shows that the estimated lower bound iscorrect if the concentration event E ( δ ) holds. Theorem 2provides sample complexity guarantees for BEES.LB byestablishing a high-probability regret upper bound. The onstochastic Bandits with Infinitely Many Experts proof of Theorem 2 is similar to that of Theorem 1, hencedeferred to the appendix. PTS has space complexity O ( N ) and time complexity O (cid:0) N (cid:1) . PTS can be efficiently imple-mented by first sorting the input w . BEES.LB takes the samespace O ( K (1 + T /K ) α ) as BEES. The time complexityof BEES.LB is ˜ O (cid:0) K (1 + T /K ) α +1 + (1 + T /K ) α (cid:1) ,which reduces to the runtime of BEES for sufficiently small α ∈ Z + . Algorithm 3 BEES with L ower B ound (BEES.LB) Input: δ ∈ (0 , , α ∈ Z + , L ∈ Z + , c ∈ Z + , C ∈ Z + i ← . for epoch l = 1 , . . . , L do N l ← c αl , T l ← C l . ρ l ← (cid:112) ln N l / ( KT l ) . I l ← { i l , i l + 1 , . . . , i l + N l − } . w l , (cid:15) l ← Exp4.R ( δ/L, T l , ρ l , I l ) . i l +1 ← PTS (cid:0) w l , (cid:15) l , i l (cid:1) . end forAlgorithm 4 Probabilistic Thresholding Search (PTS)
Input: w ∈ (0 , ∞ ) N , (cid:15) ∈ (0 , ∞ ) N , i ∈ Z + Output: i new j ← . for j = 1 , . . . , N − dofor j (cid:48) = j + 1 , . . . , N doif ln w j (cid:48) − ln w j > (cid:15) j (cid:48) then j ← j + 1 . end ifend forend for i new ← i + j − . Lemma 4.
Under Assumption 2 and the conditions ofLemma 3, if event E ( δ ) holds for all epochs, then i l ≤ i ∗ for all l .Proof. Under the assumption that event E ( δ ) holds for allepochs, we prove the statement by induction on l . Thebase case holds trivially as i = 1 . For the inductive step,we assume that i ι ≤ i ∗ for all ι ≤ l . If i l +1 = i l , then i ∗ ≥ i l +1 by the induction hypothesis. If there exists some j ≥ such that i l +1 = i l + j , then Algorithm 4 impliesthat ln w j (cid:48) − ln w j > (cid:15) j (cid:48) for some j (cid:48) > j in epoch l . UsingAssumption 2 and Lemma 3, we get i ∗ ≥ i l + j = i l +1 . Theorem 2.
Under Assumption 2, if a uniform expert isavailable in each epoch, then there exist absolute constants α ∈ Z + and c ∈ Z + such that, for some C ( α, c, K, δ ) ∈ Z + , BEES.LB satisfies that, for any δ ∈ (0 , , with proba- bility at least − δ , we haveRegret ( T ) < (cid:115) αK ( T + 2 C ) ln (cid:18) cL (2 + T /C ) δ (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . The upper bound given in Theorem 2 is the same as that forthe general case of unstructured experts because the lowerbound from PTS can stay at in the worst case. A trivialexample is that all experts are the same. For cases wherethe experts’ performance differs by sufficient margins, theactual improvement of BEES.LB over BEES should becomeobvious.If the globally best expert i ∗ is fixed over time, then we canmodify BEES.LB to additionally estimate an upper boundon i ∗ , initialized to ∞ . We can show that the confidenceinterval for i ∗ contracts over epochs. While the epoch lengthalways grows exponentially, the set of experts consideredin each epoch is data-dependent. If no upper bound on i ∗ has been identified, then the number of experts consideredwill increase by a factor of α in the next epoch. Otherwise,only the experts in the non-expanding confidence intervalwill be considered from now on.
5. Discussion
In this paper, we have proposed an algorithm for the prob-lem of nonstochastic bandits with infinitely many expertsunder the constraint of having access to only a finite subsetof experts. We have established a high-probability upperbound on the regret of our meta-algorithm BEES, whichmatches the lower bound up to polylog factors if the glob-ally best expert has a relatively low index. If we assume thatthere exists a prior distribution on the best expert, then theprobability that our regret upper bound is tight will increasewith the time horizon, the rate of which can be fast. Theexpert ranking property of the subroutine Exp4.R enableslearning acceleration if the structure of the experts is known.We have demonstrated this point with an example that isinspired by the problem of finite-time model selection forRL. One interesting direction for future work is to obtaininstance-dependent upper bounds in terms of the experts’suboptimality gaps. Such instance-dependent bounds can beused to prove the learning acceleration enabled by Exp4.R.A simple implementation of our algorithms inherits the com-putational complexity of Exp4.P. It is worthwhile to designefficient implementation for specific applications. onstochastic Bandits with Infinitely Many Experts
Acknowledgments
The authors would like to thank John N. Tsitsiklis, Dylan J.Foster, Caroline Uhler, Devavrat Shah, and Thibaut Horelfor helpful discussions. This work was supported by theOCP Group.
References
Abbasi-Yadkori, Y., Pacchiano, A., and Phan, M. Regret bal-ancing for bandit and RL model selection. arXiv preprintarXiv:2006.05491 , 2020.Agarwal, A., Dudik, M., Kale, S., Langford, J., andSchapire, R. E. Contextual bandit learning with pre-dictable rewards. In
International Conference on Artifi-cial Intelligence and Statistics , pp. 19–26, 2012.Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E.Corralling a band of bandit algorithms. In
Conference onLearning Theory , pp. 12–38, 2017.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.Gambling in a rigged casino: The adversarial multi-armedbandit problem. In
IEEE 36th Annual Foundations ofComputer Science , pp. 322–331, 1995.Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-timeanalysis of the multiarmed bandit problem.
MachineLearning , 47(2):235–256, 2002a.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.The nonstochastic multiarmed bandit problem.
SIAM J.Comput. , 32(1):48–77, 2002b.Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., andShepp, L. A. Bandit problems with infinitely many arms.
Ann. Statist. , 25(5):2103–2116, 1997.Besson, L. and Kaufmann, E. What doubling tricks canand can’t do for multi-armed bandits. arXiv preprintarXiv:1803.06971 , 2018.Beygelzimer, A., Langford, J., Li, L., Reyzin, L., andSchapire, R. E. Contextual bandit algorithms with super-vised learning guarantees. In
International Conferenceon Artificial Intelligence and Statistics , pp. 19–26, 2011.Binas, J., Luginbuehl, L., and Bengio, Y. Reinforcementlearning for sustainable agriculture.
CCAI Workshop atthe 36th International Conference on Machine Learning ,2019.Carpentier, A. and Valko, M. Simple regret for infinitelymany armed bandits. In
International Conference onMachine Learning , pp. 1133–1141, 2015.Cesa-Bianchi, N. and Lugosi, G.
Prediction, Learning, andGames . Cambridge University Press, 2006. Foster, D. J., Kale, S., Mohri, M., and Sridharan, K.Parameter-free online learning via model selection. In
Advances in Neural Information Processing Systems , pp.6020–6030, 2017.Foster, D. J., Krishnamurthy, A., and Luo, H. Model se-lection for contextual bandits. In
Advances in NeuralInformation Processing Systems , pp. 14741–14752. 2019.Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsou-nis, V., Koltun, V., and Hutter, M. Learning agile anddynamic motor skills for legged robots.
Science Robotics ,4(26), 2019.Kale, S. Multiarmed bandits with limited expert advice. In
Conference on Learning Theory , pp. 107–122, 2014.Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armedbandits in metric spaces. In
ACM Symposium on Theoryof Computing , pp. 681–690, 2008.Kuderer, M., Gulati, S., and Burgard, W. Learning drivingstyles for autonomous vehicles from demonstration. In
IEEE International Conference on Robotics and Automa-tion , pp. 2641–2646, 2015.Lai, T. L. and Robbins, H. Asymptotically efficient adaptiveallocation rules.
Adv. Appl. Math , 6(1):4–22, 1985.Matni, N., Proutiere, A., Rantzer, A., and Tu, S. Fromself-tuning regulators to reinforcement learning and backagain. In
IEEE Conference on Decision and Control , pp.3724–3740, 2019.McMahan, H. B. and Streeter, M. Tighter bounds for multi-armed bandits with expert advice. In
Conference onLearning Theory , 2009.Nemati, S., Ghassemi, M. M., and Clifford, G. D. Optimalmedication dosing from suboptimal clinical examples: Adeep reinforcement learning approach. In
Annual Interna-tional Conference of the IEEE Engineering in Medicineand Biology Society , pp. 2978–2981, 2016.Neu, G. Explore no more: Improved high-probability regretbounds for non-stochastic bandits. In
Advances in NeuralInformation Processing Systems , pp. 3168–3176, 2015.Ni, C. and Wang, M. Maximum likelihood tensor decompo-sition of Markov decision process. In
IEEE InternationalSymposium on Information Theory , pp. 3062–3066, 2019.Robbins, H. Some aspects of the sequential design of exper-iments.
Bull. Amer. Math. Soc. , 58(5):527–535, 1952.Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parame-terized bandits.
Mathematics of Operations Research , 35(2):395–411, 2010. onstochastic Bandits with Infinitely Many Experts
Seldin, Y., Crammer, K., and Bartlett, P. Open problem:Adversarial multiarmed bandits with limited advice. In
Conference on Learning Theory , pp. 1067–1072, 2013.Wainwright, M. J.
High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge UniversityPress, 2019.
A. Proof of Lemma 1
Let E t [ · ] denote the conditional expectation given historyuntil time t − . We can show that ˆ y i ( t ) is a conditionally un-biased estimator for y i ( t ) . In other words, E t [ˆ y i ( t )] = y i ( t ) for all i and t . Lemma 5 shows that ˆ v i ( t ) is an upperbound on the conditional variance of ˆ y i ( t ) . Lemma 6 isa Freedman-style inequality for martingales from (Beygelz-imer et al., 2011). The proof of Lemma 1 relies on Lem-mas 5 and 6. Lemma 5 (From proof of Lemma 3 in (Beygelzimeret al., 2011)) . For all t ∈ Z + and i ∈ I , we have E t (cid:104) ( y i ( t ) − ˆ y i ( t )) (cid:105) ≤ ˆ v i ( t ) . Lemma 6 ((Beygelzimer et al., 2011), Theorem 1) . Let X , . . . , X T be a sequence of real-valued random vari-ables. For any real-valued random variable Y , we define E t [ Y ] (cid:44) E [ Y | X , . . . , X t − ] . We assume that, X t ≤ B and E t [ X t ] = 0 for all t . We define the random variables S (cid:44) T (cid:88) t =1 X t , V (cid:44) T (cid:88) t =1 E t (cid:2) X t (cid:3) . For any fixed estimate V (cid:48) > of V , and for any δ ∈ (0 , ,with probability at least − δ , we have S ≤ (cid:40)(cid:113) ( e −
2) ln (cid:0) δ (cid:1) (cid:16) V √ V (cid:48) + √ V (cid:48) (cid:17) , if V (cid:48) ≥ B ln(1 /δ ) e − ,B ln(1 /δ ) + ( e − VB , otherwise . Proof of Lemma 1.
We now fix any i ∈ I and t ∈ Z + .By definition, we have y i ( t ) ∈ [0 , . Using (1) and theassumption that ρ ∈ [0 , /K ] , we get p a ( t ) ≥ ρ for all a ∈ A . Thus, (2) implies that ˆ y i ( t ) ∈ [0 , /ρ ] almost surely.Let X t = y i ( t ) − ˆ y i ( t ) . We then have − /ρ ≤ X t ≤ almost surely. We can show that E t [ˆ y i ( t )] = y i ( t ) andhence E t [ X t ] = 0 . We recall that R i ( T ) = (cid:80) Tt =1 y i ( t ) .Applying Lemma 6 to ( X t ) t and ( − X t ) t respectively andthen taking a union bound, we conclude that, for any δ ∈ (0 , , with probability at least − δ/N , the inequality − B ≤ R i ( T ) − ˆ R i ( T ) ≤ B holds, where B (cid:44) (cid:113) ( e −
2) ln (cid:0) Nδ (cid:1) (cid:16) V √ V (cid:48) + √ V (cid:48) (cid:17) , if V (cid:48) ≥ ln(2 N/δ )( e − ρ , ln(2 N/δ ) ρ + ( e − ρV, otherwise ,B (cid:44) (cid:40)(cid:113) ( e −
2) ln (cid:0) Nδ (cid:1) (cid:16) V √ V (cid:48) + √ V (cid:48) (cid:17) , if V (cid:48) ≥ ln(2 N/δ )( e − , ln(2 N/δ ) + ( e − V, otherwise ,V (cid:44) T (cid:88) t =1 E t (cid:2) X t (cid:3) . We now fix an arbitrary δ ∈ (0 , . Assumption 1 impliesthat ln(2 N/δ ) ≤ ( e − KT . Taking ρ = (cid:112) ln N/ ( KT ) and V (cid:48) = KT , we have ln(2 N/δ ) e − ≤ V (cid:48) < ln(2 N/δ )( e − ρ . Lemma 5 implies that V ≤ ˆ V i ( T ) . Therefore, with proba-bility at least − δ/N , we have − ln (cid:18) Nδ (cid:19) (cid:114) KT ln N − (cid:114) ln NKT ˆ V i ( T ) ≤ R i ( T ) − ˆ R i ( T ) ≤ (cid:115) ln (cid:18) Nδ (cid:19) (cid:32) ˆ V i ( T ) √ KT + √ KT (cid:33) . Applying the union bound over i ∈ I , we conclude that P ( E ( δ )) ≥ − δ . B. Proof of Theorem 2
Proof of Theorem 2.
We can show that, for all δ ∈ (0 , , α ∈ Z + , and c ∈ Z + , there exists C ( α, c, K, δ ) ∈ Z + such that K ln (cid:0) c αl (cid:1) ≤ C l and ln (cid:0) c αl +1 /δ (cid:1) ≤ ( e − CK l for all l ∈ Z + . For example, we can set C = (cid:100) αK ln(16 c /δ ) (cid:101) . Together with the definitions of N l and T l in Algorithm 3, we have that, for all α ∈ Z + and c ∈ Z + , there exists C ∈ Z + such that K ln N l ≤ T l and ln(2 N l /δ ) ≤ ( e − KT l for all l ∈ Z + . We fix suchintegers α, c, C ∈ Z + for the rest of the proof.For simplicity of notation, we first consider runningExp4.R ( δ, T l , ρ l , I l ) in each epoch l of Algorithm 3 forany δ ∈ (0 , /L ] and then apply a change of variables atthe end of the proof. We suppose that a uniform expert isavailable in each epoch. Assumption 1 is then satisfied forall epochs. For now, we assume that event E ( δ ) holds for allepochs, the probability of which will be discussed at the endof the proof. For simplicity of notation, let i ∗ l (cid:44) i ∗ ( I l ; T l ) for l ∈ [ L ] .Let U l (cid:44) αl + log (2 c/δ ) for l ∈ [ L ] . Recall that T l is the onstochastic Bandits with Infinitely Many Experts time interval of epoch l where |T l | = T l . By Lemma 2, L (cid:88) l =1 R i ∗ l ( T l ) − T (cid:88) t =1 r a ( t ) ( t ) ≤ L (cid:88) l =1 (cid:115) KT l ln (cid:18) N l δ (cid:19) = L (cid:88) l =1 (cid:115) KC l ln (cid:18) c αl +1 δ (cid:19) = 7 √ KC ln 2 L (cid:88) l =1 (cid:112) l U l ≤ (cid:112) KCU L ln 2 L (cid:88) l =1 l/ < (cid:112) KCU L (cid:16) L/ − (cid:17) . Since L = log [1 + T / (2 C )] , we have L (cid:88) l =1 R i ∗ l ( T l ) − T (cid:88) t =1 r a ( t ) ( t ) < (cid:112) KCU L (cid:32)(cid:114) T C − (cid:33) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cδ (cid:19)(cid:21) (cid:18) C + T (cid:19) . (10)We first discuss the case where i ∗ / ∈ I . Let L (cid:48)(cid:48) be the lastepoch such that i ∗ is not considered in Algorithm 3. In otherwords, L (cid:48)(cid:48) (cid:44) max { l ∈ [ L ] | i ∗ / ∈ I l } . Lemma 4 impliesthat i ∗ ∈ I l for all l > L (cid:48)(cid:48) . By the definition of i ∗ l , we have R i ∗ l ( T l ) ≥ R i ∗ ( T l ) for all l > L (cid:48)(cid:48) . Thus, R i ∗ ([ T ]) − L (cid:88) l =1 R i ∗ l ( T l ) ≤ L (cid:48)(cid:48) (cid:88) l =1 (cid:0) R i ∗ ( T l ) − R i ∗ l ( T l ) (cid:1) ≤ L (cid:48)(cid:48) (cid:88) l =1 T l < C L (cid:48)(cid:48) +1 . We now provide an upper bound on L (cid:48)(cid:48) . By Algorithms 3and 4, we have |I l | = N l and ≤ i l ≤ i l +1 for all l . Let L (cid:48) be the last epoch such that i ∗ is not considered in theworst case where i l = 1 for all l . In other words, L (cid:48) (cid:44) min (cid:0) L, (cid:100) α − log ( i ∗ /c ) (cid:101) − (cid:1) . Under the assumption that i ∗ / ∈ I , we get L (cid:48) ≥ . By the definitions of L (cid:48) and L (cid:48)(cid:48) ,we have L (cid:48)(cid:48) ≤ L (cid:48) and hence R i ∗ ([ T ]) − L (cid:88) l =1 R i ∗ l ( T l ) < C L (cid:48) +1 < C (cid:18) i ∗ c (cid:19) α . (11)We now consider the case where i ∗ ∈ I . It follows fromLemma 4 that i ∗ ∈ I l for all l . Thus, the definition of i ∗ l implies that R i ∗ l ( T l ) ≥ R i ∗ ( T l ) for all l . We define D (cid:44) R i ∗ ([ T ]) − (cid:80) Ll =1 R i ∗ l ( T l ) . We then have D ≤ .However, the definition of i ∗ implies that D ≥ . Therefore, D = 0 and (11) is satisfied.Adding (10) and (11) givesRegret ( T ) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cδ (cid:19)(cid:21) (cid:18) C + T (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . (12)Using Lemma 1 and the union bound over all L epochs, weconclude that (12) holds with probability at least − Lδ .A change of variables gives that, for any δ ∈ (0 , , withprobability at least − δ , we haveRegret ( T ) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cLδ (cid:19)(cid:21) (cid:18) C + T (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α < (cid:115) αK ( T + 2 C ) ln (cid:18) cL (2 + T /C ) δ (cid:19) + 2 C (cid:18) i ∗ c (cid:19) αα