[PDF] Nonstochastic Bandits with Infinitely Many Experts

Abstract

We study the problem of nonstochastic bandits with expert advice, extending the setting from finitely many experts to any countably infinite set: A learner aims to maximize the total reward by taking actions sequentially based on bandit feedback while benchmarking against a set of experts. We propose a variant of Exp4.P that, for finitely many experts, enables inference of correct expert rankings while preserving the order of the regret upper bound. We then incorporate the variant into a meta-algorithm that works on infinitely many experts. We prove a high-probability upper bound of \tilde{\mathcal{O}} \big( i^*K + \sqrt{KT} \big) on the regret, up to polylog factors, where i^* is the unknown position of the best expert, K is the number of actions, and T is the time horizon. We also provide an example of structured experts and discuss how to expedite learning in such case. Our meta-learning algorithm achieves optimal regret up to polylog factors when i^* = \tilde{\mathcal{O}} \big( \sqrt{T/K} \big). If a prior distribution is assumed to exist for i^*, the probability of optimality increases with T, the rate of which can be fast.

Full PDF

NNonstochastic Bandits with Inﬁnitely Many Experts

X. Flora Meng Tuhin Sarkar Munther A. Dahleh Abstract

We study the problem of nonstochastic banditswith inﬁnitely many experts: A learner aims tomaximize the total reward by taking actions se-quentially based on bandit feedback while bench-marking against a countably inﬁnite set of experts.We propose a variant of Exp4.P that, for ﬁnitelymany experts, enables inference of correct expertrankings while preserving the order of the regretupper bound. We then incorporate the variant intoa meta-algorithm that works on inﬁnitely manyexperts. We prove a high-probability upper boundof ˜ O (cid:0) i ∗ K + √ KT (cid:1) on the regret, up to polylogfactors, where i ∗ is the unknown position of thebest expert, K is the number of actions, and T is the time horizon. We also provide an exampleof structured experts and discuss how to expeditelearning in such case. Our meta-learning algo-rithm achieves the tightest regret upper bound forthe setting considered when i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) .If a prior distribution is assumed to exist for i ∗ ,the probability of satisfying a tight regret boundincreases with T , the rate of which can be fast.

1. Introduction

Early work on the multi-armed bandit problem commonlystudied settings where the rewards of each arm are stochas-tically generated from some unknown distribution (Robbins,1952; Lai & Robbins, 1985; Auer et al., 2002a). In general,such statistical assumptions are difﬁcult to validate or inap-proriate for some applications such as packet transmissionin communication networks (Auer et al., 1995; 2002b). Theproblem of nonstochastic bandits, ﬁrst investigated in (Aueret al., 1995; 2002b), makes no statistical assumptions abouthow the rewards are generated.A setting of the nonstochastic bandit problem allows forincorporating expert advice. The learner interacts with an adversary over time horizon T as follows. At each time, Department of Electrical Engineering and Computer Science,Massachusetts Institute of Technology, Cambridge, MA, USA.Correspondence to: X. Flora Meng < [email protected] > .Copyright 2021 by the authors. the adversary sets the rewards for the K actions and keepsthem secret. The learner then gets every expert’s adviceon the probability of choosing each action. The learnersubsequently combines the experts’ advice and samples anaction. Finally, the learner observes only the reward of theaction chosen, and the game repeats. The learner’s goal is tominimize regret , which is the gap between the total rewardgained and the expected total reward of the best expert i ∗ who is unknown a priori.The framework described is a general one. First, there isno assumption about the generation of rewards except thatthe adversary is oblivious . In other words, the adversary’schoices are independent of the learner’s strategy. Equiva-lently, all rewards can be assigned before the game starts,and the learner only observes the rewards of chosen actionssequentially. Second, we do not restrict or assume knowl-edge of how the experts come up with their advice. Third,experts can give deterministic advice.The problem of bandits with expert advice is not only anatural model for numerous real-world applications, suchas selecting and pricing online advertisements (McMahan& Streeter, 2009), but also important from a theoretical per-spective. Contextual bandits can be framed as a bandits withexpert advice problem by introducing policies that map acontext to a probability distribution over actions (McMahan& Streeter, 2009; Agarwal et al., 2017). Bandits with expertadvice are also closely related to online model selectionwhere experts correspond to model classes (Cesa-Bianchi &Lugosi, 2006; Foster et al., 2017; 2019).Prior work on nonstochastic bandits with expert advice typ-ically assumes the number of experts to be ﬁnite (Aueret al., 1995; 2002b; McMahan & Streeter, 2009; Beygelz-imer et al., 2011; Neu, 2015). The exp onential-weight algo-rithm for exp loration and exp loitation using exp ert advice(Exp4), introduced by (Auer et al., 1995; 2002b), has a regretupper bound of O (cid:0) √ KT ln N (cid:1) in expectation , where N isthe number of experts. This upper bound almost matches thelower bound Ω( (cid:112) ( KT ln N ) / ln K ) derived by (Agarwalet al., 2012) for the expected regret when ln N ≤ T ln K .However, Exp4 does not satisfy a similar regret guarantee with high probability due to the large variance of its es-timates. Algorithms with high-probability guarantees arepreferred for domains that need reliable methods, but such a r X i v : . [ c s . L G ] F e b onstochastic Bandits with Inﬁnitely Many Experts algorithms require delicate analysis (Beygelzimer et al.,2011; Neu, 2015). The Exp4.P algorithm, a variant of Exp4proposed by (Beygelzimer et al., 2011), has a regret boundedfrom above by O (cid:0)(cid:112) KT ln ( N/δ ) (cid:1) with probability at least − δ . This bound can be improved by a constant factor withthe key idea of avoiding explicit exploration (Neu, 2015).We study the problem of nonstochastic bandits with in-ﬁnitely many experts. Our main question is: Can the learnerperform almost as well as the globally best expert i ∗ of acountably inﬁnite set while only querying a ﬁnite number ofexperts? This question is motivated by challenges encoun-tered in practical situations where it is unfeasible to seekadvice from all experts all the time (Seldin et al., 2013). Forsearch engine advertising, a company may need to chooseamong a multitude of schemes some of which also involvehyperparameter tuning (McMahan & Streeter, 2009). Asanother example, there are often a myriad of features thatcan be used for online recommendation systems. Somefeatures tend to be more informative than others, but theirrelevance is normally unknown a priori. We can transformthis problem into bandits with expert advice where each ex-pert corresponds to a model class in a certain feature space.The number of experts can be extremely large due to thecombinatorial nature. In contrast to the large number ofexperts available, it is desirable to query only some of themeach time in consideration of computational constraints.

Our Contributions

For the general case without any as-sumption about the experts, we propose an algorithm called Be st E xpert S earch (BEES) and provide theoretical guar-antees on its performance. BEES runs a subroutine calledExp4.R, an algorithm that we obtain by modifying Exp4.P.The “R” denotes a feature of Exp4.R: it enables inferenceof correct expert rankings with high probability in addi-tion to satisfying a regret upper bound of the same order asthat proved for Exp4.P. Our main result establishes a high-probability upper bound of ˜ O (cid:0) ( i ∗ ) /α K + √ αKT (cid:1) on theregret of BEES, hiding only polylog factors, which dependson the position of the unknown best expert i ∗ and a positiveinteger-valued parameter α . This upper bound illustrates thetrade-off, controlled by α , between exploration and exploita-tion for the problem of nonstochastic bandits with inﬁnitelymany experts. On the one hand, it is desirable to include alarge number of experts per epoch so as to approach i ∗ ata fast rate. On the other hand, consulting too many expertssimultaneously necessitates long epochs, which reduces therate at which more experts are included. Although tuning α needs the unknown index i ∗ , we can simply set α = 1 .To the best of our knowledge, our high-probability regretupper bound is the tightest for the setting considered when i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) . This regime is less restricted than itseems at ﬁrst sight. If we assume a prior distribution on i ∗ , then i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) holds with a probability that in- creases with T , the rate of which can be fast. Inspired bythe problem of ﬁnite-time model selection for reinforcementlearning (RL), we also present an example of structured ex-perts, which simulates the trade-off between approximationand estimation. We discuss how the expert ranking propertyof Exp4.R can be used to expedite learning in such case. Related Work

A natural approach is to consider expertsas arms and use methods for inﬁnitely many-armed ban-dits such as (Berry et al., 1997; Kleinberg et al., 2008;Rusmevichientong & Tsitsiklis, 2010; Carpentier & Valko,2015). However, such work relies on statistical assumptions,whereas our setting is a nonstochastic bandit problem. Ourquestion is also related to bandits with limited advice, ﬁrstposed by (Seldin et al., 2013) and subsequently solved by(Kale, 2014), but their setting considers a ﬁnite number ofexperts of whom only a subset can be queried at each time.To the best of our knowledge, no work achieves high-probability regret bound of ˜ O (cid:0) √ KT (cid:1) for the setting con-sidered. When conﬁgured correctly, Exp4 has a regret up-per bound of O (cid:0) √ KT ln i ∗ (cid:1) in expectation (Foster et al.,2019). However, the algorithm is computationally unfea-sible as it needs to handle inﬁnitely many experts at everytime step. One method of making Exp4 computationallytractable is to truncate the sequence of experts to a subsetof size O (cid:0) e √ KT (cid:1) as any larger set would make the ex-pected regret superlinear in K or T . Running Exp4 withcorrect conﬁgurations on this subset of experts has a re-gret upper bound of O (cid:0) ( KT ) / + T ∆ (cid:1) in expectation where ∆ is the inﬁmum upper bound on the suboptimalitygaps of the experts considered. For stochastic contextualbandits, Exp4.P can be used as a subroutine to achieve ahigh-probability regret bound of ˜ O (cid:0) √ dT ln T (cid:1) with an in-ﬁnite set of experts that has a ﬁnite Vapnik–Chervonenkisdimension d (Beygelzimer et al., 2011). Since the regretanalysis of Exp4.P relies on the union bound, the algorithmdoes not apply to inﬁnitely many experts in the nonstochas-tic setting. If we run Exp4.P on a ﬁnite subset of expertsof size Θ (cid:0) δ exp (cid:0)(cid:112) T / (16 K ) (cid:1)(cid:1) , the regret is then boundedfrom above by O (cid:0) K / T / + T ∆ (cid:1) with probability atleast − δ . Outline

Section 2 formally deﬁnes the problem of non-stochastic bandits with inﬁnitely many experts. In Section 3,we introduce Exp4.R for the setting of ﬁnitely many expertsand prove that it enables inference of correct expert rank-ings with high probability. Section 4 investigates the case ofinﬁnitely many experts and presents a meta-algorithm thatruns Exp4.R as a subroutine. We prove a high-probabilityregret upper bound and give an example to illustrate howto expedite learning when working with structured experts.Finally, we conclude in Section 5. onstochastic Bandits with Inﬁnitely Many Experts

2. Problem Formulation

Let Z + be the set of strictly positive integers. For N ∈ Z + ,we deﬁne [ N ] (cid:44) { , , . . . , N } . Let T ∈ Z + be the timehorizon . Let A be a set of actions where |A| = K < ∞ .At each time t ∈ [ T ] , the adversary ﬁrst sets a reward vector r ( t ) ∈ [0 , K where r a ( t ) is the reward of action a . Eachexpert i ∈ Z + then gives their advice ξ i ( t ) , which is a prob-ability vector over A . After querying a ﬁnite subset of theexperts’ advice but not the rewards, the learner then sam-ples an action a ( t ) . Finally, the learner receives the reward r a ( t ) ( t ) and no other information. The game proceeds totime t + 1 and ﬁnishes after T time steps. The learner’s goalis to combine the experts’ advice such that the total rewardis close to a benchmark, which we will deﬁne shortly.Let y i ( t ) (cid:44) (cid:80) a ∈A ξ ia ( t ) r a ( t ) be the expected reward ofexpert i at time t . For any time interval T ⊂ Z + such that |T | < ∞ , we denote the expected total reward of expert i during T as R i ( T ) (cid:44) (cid:80) t ∈T y i ( t ) . We deﬁne the bestexpert i ∗ ( I ; T ) of a subset I ⊆ Z + during T as the onewith the lowest index that has the highest total reward in ex-pectation, namely, i ∗ ( I ; T ) (cid:44) min { argmax i ∈I R i ( T ) } .The learner’s regret with respect to i ∗ ( I ; T ) isRegret ( T ; I ) (cid:44) R i ∗ ( I ; T ) ( T ) − (cid:88) t ∈T r a ( t ) ( t ) . For simplicity of notation, let Regret ( T ) (cid:44) Regret ([ T ]; Z + ) and i ∗ (cid:44) i ∗ ( Z + ; [ T ]) . The learner’s goal is to minimizeRegret ( T ) , the regret with respect to the globally best expert i ∗ for the time horizon considered.

3. Nonstochastic Bandits with a FiniteNumber of Experts

We start with a simpliﬁed problem where the number ofexperts is ﬁnite. Section 3.1 presents Exp4.R (Algorithm 1)and provides some intuition for its design. In Section 3.2,we show that Exp4.R not only preserves the regret upperbound of Exp4.P in terms of order but also enables inferenceof correct expert rankings with high probability.

Exp4.R (Algorithm 1) is a slight variant of Exp4.P proposedby (Beygelzimer et al., 2011). The major distinction isthat Exp4.R calculates a threshold vector (cid:15) which enablesinference of correct expert rankings with high probability.Exp4.R takes four inputs, namely, an error rate δ ∈ (0 , ,a time horizon T ∈ Z + , the minimum probability ρ ∈ (0 , /K ] of exploration, and a ﬁnite set of experts I ⊂ Z + .Without loss of generality, we suppose that |I| = N . If max i ∈I R i ( T ) does not exist, we deﬁne i ∗ ( I ; T ) = ∞ and R i ∗ ( I ; T ) ( T ) = sup i ∈I R i ( T ) . Algorithm 1

Exp4.R

Input: δ ∈ (0 , , T ∈ Z + , ρ ∈ (0 , /K ] , I ⊂ Z + Output: w ( T + 1) , (cid:15)β ← (cid:112) ln(2 N/δ ) / ( KT ) .w i (1) ← for i ∈ I . for t = 1 , . . . , T do Get ξ i ( t ) for i ∈ I . q i ( t ) ← w i ( t ) / (cid:80) i (cid:48) ∈I w i (cid:48) ( t ) for i ∈ I . p a ( t ) ← (1 − Kρ ) (cid:80) i ∈I q i ( t ) ξ ia ( t ) + ρ for a ∈ A .Sample action a ( t ) from p ( t ) .Take action a ( t ) and receive reward r a ( t ) ( t ) . for i ∈ I do ˆ y i ( t ) ← ξ ia ( t ) ( t ) r a ( t ) ( t ) p a ( t ) ( t ) , ˆ v i ( t ) ← (cid:88) a ∈A ξ ia ( t ) p a ( t ) ,w i ( t + 1) ← w i ( t ) exp (cid:16) ρ y i ( t ) + β ˆ v i ( t )] (cid:17) . end forend forfor i ∈ I do (cid:15) i ← (cid:34) KT T (cid:88) t =1 ˆ v i ( t ) (cid:35) ln (cid:18) Nδ (cid:19) . end for Exp4.R ﬁrst initializes a weight w i (1) = 1 for each expert i ∈ I . At time t ∈ [ T ] , normalizing w ( t ) gives a probabilitydistribution q ( t ) over I . After getting advice ξ i ( t ) fromeach expert i , Exp4.R constructs a probability distribution p ( t ) over A by weighting all advice according to q ( t ) andmixing in uniform exploration so that p a ( t ) ≥ ρ for all a ∈ A . Speciﬁcally, for all a , let p a ( t ) = (1 − Kρ ) (cid:88) i ∈I q i ( t ) ξ ia ( t ) + ρ. (1)Exp4.R subsequently takes action a ( t ) sampled accordingto p ( t ) and receives the reward r a ( t ) ( t ) . Time t concludeswith weight updates as speciﬁed below. For i ∈ I , Exp4.Restimates y i ( t ) by ˆ y i ( t ) and calculates an upper bound onthe variance of ˆ y i ( t ) conditional on history until time t − as given by ˆ y i ( t ) = ξ ia ( t ) ( t ) r a ( t ) ( t ) p a ( t ) ( t ) , ˆ v i ( t ) = (cid:88) a ∈A ξ ia ( t ) p a ( t ) . (2)Exp4.R updates each expert’s weight w i ( t ) using w i ( t + 1) = w i ( t ) exp (cid:16) ρ y i ( t ) + β ˆ v i ( t )] (cid:17) , (3)where β = (cid:112) ln(2 N/δ ) / ( KT ) . The game ends in T timesteps and gives two outputs, namely, the ﬁnal weight vector onstochastic Bandits with Inﬁnitely Many Experts w ( T + 1) and a threshold vector (cid:15) , the i th entry of which is (cid:15) i = (cid:34) KT T (cid:88) t =1 ˆ v i ( t ) (cid:35) ln (cid:18) Nδ (cid:19) . We establish in Proposition 1 that, with high probability,Exp4.R not only satisﬁes a regret upper bound of the sameorder as that proved for Exp4.P but also reveals correctpairwise expert rankings if the corresponding weights aresufﬁciently separated.For simplicity of notation, we denote R i ([ T ]) (cid:44) (cid:80) Tt =1 y i ( t ) as R i ( T ) . Updating weights using (3) allows us to constructa conﬁdence bound for each R i ( T ) . For i ∈ I , let ˆ R i ( T ) (cid:44) (cid:80) Tt =1 ˆ y i ( t ) and ˆ V i ( T ) (cid:44) (cid:80) Tt =1 ˆ v i ( t ) . For any δ ∈ (0 , ,let E ( δ ) be an event deﬁned by ∀ i ∈ I , − ln (cid:18) Nδ (cid:19) (cid:114) KT ln N − (cid:114) ln NKT ˆ V i ( T ) ≤ R i ( T ) − ˆ R i ( T ) ≤ (cid:115) ln (cid:18) Nδ (cid:19) (cid:32) ˆ V i ( T ) √ KT + √ KT (cid:33) . Lemma 1 shows that the estimates ˆ R i ( T ) are concen-trated around the true values R i ( T ) . The proof relies on aFreedman-style inequality for martingales from (Beygelz-imer et al., 2011), which we defer to the appendix.Lemma 2 establishes an upper bound on the regret ofExp4.R. Since Lemma 2 is a slight variant of Theorem 2in (Beygelzimer et al., 2011), the proof is very similarto the original one and hence omitted here. We notethat Theorem 2 in (Beygelzimer et al., 2011) holds for asmaller regime than stated in the original paper. To bespeciﬁc, the condition T = Ω( K ln N ) is essential for ρ = (cid:112) ln N/ ( KT ) ≤ /K to be true. We make the correc-tion in Lemma 2.Lemma 3 validates the correctness of the inferred expertrankings when the concentration event E ( δ ) holds. Corol-lary 1 shows that the uncertainty gap for ranking any pair ofexperts is the sum of their thresholds given by Exp4.R. Wecan prove Corollary 1 by ﬁrst taking the contrapositive ofthe statement in Lemma 3 and then switching i and i (cid:48) .Finally, we combine the lemmas to prove Proposition 1.Same as Exp4.P, the computational complexity of Exp4.Ris O ( KN ) for space and O ( KN T ) for runtime. Assumption 1.

The following conditions hold: (i) max { K ln N, ln(2 N/δ ) / [( e − K ] } ≤ T , (ii) and thereexists a uniform expert i ∈ I such that ξ ia ( t ) = 1 /K for all a ∈ A and t ∈ Z + . Lemma 1.

Under Assumption 1, if we run Exp4.R with ρ = (cid:112) ln N/ ( KT ) , then P ( E ( δ )) ≥ − δ for all δ ∈ (0 , . Lemma 2.

Under Assumption 1, for any δ ∈ (0 , , if E ( δ ) holds, then Exp4.R with ρ = (cid:112) ln N/ ( KT ) satisﬁes thatRegret ( T ; I ) ≤ (cid:112) KT ln (2 N/δ ) . Lemma 3.

Under Assumption 1, for any δ ∈ (0 , , if E ( δ ) holds, then Exp4.R with ρ = (cid:112) ln N/ ( KT ) satisﬁesthat, for all i, i (cid:48) ∈ I , we have R i ( T ) > R i (cid:48) ( T ) whenever ln w i ( T + 1) − ln w i (cid:48) ( T + 1) > (cid:15) i .Proof. We ﬁx an arbitrary δ ∈ (0 , and suppose that event E ( δ ) holds. We recall that (cid:15) i (cid:44) (cid:34) V i ( T ) KT (cid:35) ln (cid:18) Nδ (cid:19) . We assume that ln w i ( T + 1) − ln w i (cid:48) ( T + 1) > (cid:15) i for some i, i (cid:48) ∈ I . By (3) and the initialization condition w i (1) = 1 ,we have ln w i ( T + 1) = T (cid:88) t =1 ln (cid:18) w i ( t + 1) w i ( t ) (cid:19) = ρ (cid:32) ˆ R i ( T ) + (cid:114) ln(2 N/δ ) KT ˆ V i ( T ) (cid:33) . Thus, ˆ R i ( T ) = 2 ρ ln w i ( T + 1) − (cid:114) ln(2 N/δ ) KT ˆ V i ( T ) . (4)Equation (4) also holds for i (cid:48) . Thus, ˆ R i ( T ) − ˆ R i (cid:48) ( T )= 2 ρ ln (cid:18) w i ( T + 1) w i (cid:48) ( T + 1) (cid:19) − (cid:114) ln(2 N/δ ) KT (cid:16) ˆ V i ( T ) − ˆ V i (cid:48) ( T ) (cid:17) > (cid:15) i ρ − (cid:114) ln(2 N/δ ) KT (cid:16) ˆ V i ( T ) − ˆ V i (cid:48) ( T ) (cid:17) = 2 ln (cid:18) Nδ (cid:19) (cid:114) KT ln N + (cid:114) ln(2 N/δ ) KT ˆ V i (cid:48) ( T )+ ˆ V i ( T ) (cid:114) ln(2 N/δ ) KT (cid:34) (cid:114) ln(2 N/δ )ln N − (cid:35) > (cid:18) Nδ (cid:19) (cid:114) KT ln N + (cid:114) ln(2 N/δ ) KT (cid:16) ˆ V i ( T ) + ˆ V i (cid:48) ( T ) (cid:17) . (5) onstochastic Bandits with Inﬁnitely Many Experts Event E ( δ ) implies that R i ( T ) − ˆ R i ( T ) + ˆ R i (cid:48) ( T ) − R i (cid:48) ( T ) ≥ − ln (cid:18) Nδ (cid:19) (cid:114) KT ln N − (cid:114) ln NKT ˆ V i ( T ) − (cid:115) ln (cid:18) Nδ (cid:19) (cid:32) ˆ V i (cid:48) ( T ) √ KT + √ KT (cid:33) . (6)Adding (5) and (6) and then simplifying the algebra give R i ( T ) − R i (cid:48) ( T ) > . Corollary 1.

Under the conditions of Lemma 3, for all i, i (cid:48) ∈ I , it holds that(i) if ln w i ( T + 1) − ln w i (cid:48) ( T + 1) > (cid:15) i , then R i ( T ) > R i (cid:48) ( T ); (ii) if R i ( T ) ≥ R i (cid:48) ( T ) , then ln w i ( T + 1) − ln w i (cid:48) ( T + 1) ≥ − (cid:15) i (cid:48) . Proposition 1.

Under Assumption 1, for any δ ∈ (0 , ,with probability at least − δ , Exp4.R conﬁgured with ρ = (cid:112) ln N/ ( KT ) satisﬁes that(i) Regret ( T ; I ) ≤ (cid:112) KT ln (2 N/δ ) ;(ii) for all i, i (cid:48) ∈ I , if ln w i ( T + 1) − ln w i (cid:48) ( T + 1) > (cid:15) i ,then R i ( T ) > R i (cid:48) ( T ) .Proof. Proposition 1 follows directly from Lemmas 1–3.

4. Selection Among Inﬁnitely Many Experts

In this section, we study the problem of nonstochastic ban-dits with a countably inﬁnite set of experts. We make noassumptions about the experts or how they are indexed. Forthis general case, we propose a meta-algorithm called Be st E xpert S earch (BEES, Algorithm 2) that runs Exp4.R asa subroutine and provide a high-probability upper boundon regret. Section 4.1 provides an example of structuredexperts and discusses how the expert ranking property ofExp4.R can be used to expedite learning in such case.BEES takes ﬁve inputs including an error rate δ ∈ (0 , , thenumber of epochs L ∈ Z + , and three constants α, c, C ∈ Z + that control the exponential growth of the epoch lengthand the number of experts consulted in each epoch. At a highlevel, BEES supplies Exp4.R with an increasing (but still ﬁnite) number of experts over epochs, prioritizing those withlower indices. This scheme can be considered as puttinga prior on the experts implicitly where the experts that arebelieved to perform well are given low indices. Since wemake no assumptions about the experts, they can be orderedusing domain knowledge before being given to BEES asinputs. Growing the epoch length and the number of expertsat exponential rates allows us to derive a regret upper boundof the same order as that of Exp4.R when the best expert i ∗ has a relatively low index. This idea is similar to, thoughnot the same as, the doubling trick (Besson & Kaufmann,2018) as the latter only deals with the epoch length. Forour setting, we need to increase the number of experts at anappropriate rate relative to the epoch length. Algorithm 2 Be st E xpert S earch (BEES) Input: δ ∈ (0 , , α ∈ Z + , L ∈ Z + , c ∈ Z + , C ∈ Z + for epoch l = 1 , . . . , L do N l ← c αl , T l ← C l . ρ l ← (cid:112) ln N l / ( KT l ) . I l ← [ N l ] . Exp4.R ( δ/L, T l , ρ l , I l ) . end for Theorem 1 provides sample complexity guarantees forBEES by establishing a high-probability regret upper bound.Corollary 2 simpliﬁes this bound for speciﬁc parametervalues. Corollary 2 shows that BEES, when tuned right,satisﬁes Regret ( T ) = ˜ O (cid:16) ( i ∗ ) /α K + √ αKT (cid:17) with highprobability, where ˜ O ( · ) omits only polylog factors. This up-per bound illustrates the trade-off between exploration andexploitation for the problem of bandits with inﬁnitely manyexperts. On the one hand, we want to include a large num-ber of experts in each epoch so as to approach i ∗ at a fastrate. On the other hand, querying too many experts simul-taneously necessitates long epochs, which reduces the rateat which more experts are included. This trade-off is con-trolled by the parameter α ∈ Z + . The term ˜ O (cid:0) ( i ∗ ) /α K (cid:1) in the upper bound corresponds to regret from not con-sidering i ∗ sooner. The other term ˜ O (cid:16) √ αKT (cid:17) is theregret that benchmarks against the best expert in eachepoch. Another consideration for not using an arbitrar-ily large value of α is that the minimum time horizon re-quired by BEES which is T = Ω( C ( α, c, K, δ )) increaseswith α . Although tuning α needs the unknown index ofthe best expert i ∗ , we can simply set α = 1 . BEES hasspace complexity O ( K (1 + T /K ) α ) and time complexity ˜ O (cid:0) K (1 + T /K ) α +1 (cid:1) .To the best of our knowledge, Theorem 1 establishes thetightest high-probability regret bound for the setting consid-ered when i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) . This regime is less restrictedthan it seems at ﬁrst sight. Assuming a prior distribution onstochastic Bandits with Inﬁnitely Many Experts on i ∗ shows that the condition on i ∗ is satisﬁed with aprobability that increases with T , the rate of which canbe fast. For simplicity, let α = 1 and c = 1 . In orderfor Regret ( T ) = ˜ O (cid:16) √ KT (cid:17) to hold with high probability,we need i ∗ = ˜ O (cid:0)(cid:112) T /K (cid:1) . We denote the complement ofthis event as B . If we suppose that F ( i ) = P ( i ∗ > i ) for i ∈ Z + and some function F : Z + → [0 , , then P ( B ) decreases with T . For example, if F ( i ) ∝ i − s for some s > , then P ( B ) is roughly proportional to K s/ T − s/ .If F ( i ) ∝ e − si for some s > , then P ( B ) is roughlyproportional to e − s √ T/K .Before stating Theorem 1, we provide some intuition for theproof. Lemma 1 implies that (cid:80) t ∈T l ˆ y i ( t ) ≈ R i ( T l ) for eachexpert i and every epoch l with high probability. For this rea-son, we can prove an upper bound on the regret with respectto the best expert in each epoch, namely, (cid:80) Ll =1 R i ∗ l ( T l ) − (cid:80) Tt =1 r a ( t ) ( t ) = ˜ O (cid:16) √ αKT (cid:17) . We then derive an upperbound on the gap between the globally best expert and thebest expert in each epoch, which is given by R i ∗ ([ T ]) − (cid:80) Ll =1 R i ∗ l ( T l ) = ˜ O (cid:0) ( i ∗ ) /α K (cid:1) . Adding the upper bounds,we get Regret ( T ) = ˜ O (cid:16) ( i ∗ ) /α K + √ αKT (cid:17) .For simplicity of notation, we suppose that the total numberof epochs L = log [1 + T / (2 C )] so that T = (cid:80) Ll =1 T l where T l = C l for l ∈ [ L ] . We use (cid:98)·(cid:99) and (cid:100)·(cid:101) to denotethe ﬂoor and ceiling functions, respectively. For the generalcase of T ≥ C , let L = (cid:98) log [1 + T / (2 C )] (cid:99) , T l = C l for l ∈ [ L − , and T L = T − (cid:80) L − l =1 T l . Theorem 1.

If a uniform expert is available in each epoch,then there exist absolute constants α ∈ Z + and c ∈ Z + such that, for some C ( α, c, K, δ ) ∈ Z + , BEES satisﬁes that,for any δ ∈ (0 , , with probability at least − δ , we haveRegret ( T ) < (cid:115) αK ( T + 2 C ) ln (cid:18) cL (2 + T /C ) δ (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . Corollary 2.

Under the conditions of Theorem 1, runningBEES with α ∈ Z + , c ∈ Z + , and C = (cid:100) αK ln(16 c /δ ) (cid:101) satisﬁes that, for any δ ∈ (0 , , with probability at least − δ , Regret ( T ) = ˜ O (cid:16) ( i ∗ ) /α K + √ αKT (cid:17) .Proof of Theorem 1. We can show that, for all δ ∈ (0 , , α ∈ Z + , and c ∈ Z + , there exists C ( α, c, K, δ ) ∈ Z + such that K ln (cid:0) c αl (cid:1) ≤ C l and ln (cid:0) c αl +1 /δ (cid:1) ≤ ( e − CK l for all l ∈ Z + . For example, we can set C = (cid:100) αK ln(16 c /δ ) (cid:101) . Together with the deﬁnitions of N l and T l in Algorithm 2, we have that, for all α ∈ Z + and c ∈ Z + , there exists C ∈ Z + such that K ln N l ≤ T l and ln(2 N l /δ ) ≤ ( e − KT l for all l ∈ Z + . We ﬁx suchintegers α, c, C ∈ Z + for the rest of the proof.For simplicity of notation, we ﬁrst consider runningExp4.R ( δ, T l , ρ l , I l ) in each epoch l for any δ ∈ (0 , /L ] and then apply a change of variables at the end of the proof.We suppose that a uniform expert is available in each epoch.Assumption 1 is then satisﬁed for all epochs. For now, weassume that event E ( δ ) holds for all epochs, the probabil-ity of which will be discussed at the end of the proof. Forsimplicity of notation, let i ∗ l (cid:44) i ∗ ( I l ; T l ) for l ∈ [ L ] .Let U l (cid:44) αl + log (2 c/δ ) for l ∈ [ L ] . Recall that T l is thetime interval of epoch l where |T l | = T l . By Lemma 2, L (cid:88) l =1 R i ∗ l ( T l ) − T (cid:88) t =1 r a ( t ) ( t ) ≤ L (cid:88) l =1 (cid:115) KT l ln (cid:18) N l δ (cid:19) = 7 √ KC ln 2 L (cid:88) l =1 (cid:112) l U l ≤ (cid:112) KCU L ln 2 L (cid:88) l =1 l/ < (cid:112) KCU L (cid:16) L/ − (cid:17) . Since L = log [1 + T / (2 C )] , we have L (cid:88) l =1 R i ∗ l ( T l ) − T (cid:88) t =1 r a ( t ) ( t ) < (cid:112) KCU L (cid:32)(cid:114) T C − (cid:33) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cδ (cid:19)(cid:21) (cid:18) C + T (cid:19) . (7)We ﬁrst discuss the case where i ∗ / ∈ I . Let L (cid:48) be the lastepoch such that i ∗ is not considered in Algorithm 2. Since |I l | = N l , we have L (cid:48) = min (cid:0) L, (cid:100) α − log ( i ∗ /c ) (cid:101) − (cid:1) .Since i ∗ / ∈ I , we get L (cid:48) ≥ . By the deﬁnition of i ∗ l , wehave R i ∗ l ( T l ) ≥ R i ∗ ( T l ) for all l > L (cid:48) . Thus, R i ∗ ([ T ]) − L (cid:88) l =1 R i ∗ l ( T l ) ≤ L (cid:48) (cid:88) l =1 (cid:0) R i ∗ ( T l ) − R i ∗ l ( T l ) (cid:1) ≤ L (cid:48) (cid:88) l =1 T l < C L (cid:48) +1 < C (cid:18) i ∗ c (cid:19) α . (8)We now consider the case where i ∗ ∈ I . It follows fromAlgorithm 2 that i ∗ ∈ I l for all l . Thus, the deﬁnition onstochastic Bandits with Inﬁnitely Many Experts of i ∗ l implies that R i ∗ l ( T l ) ≥ R i ∗ ( T l ) for all l . We deﬁne D (cid:44) R i ∗ ([ T ]) − (cid:80) Ll =1 R i ∗ l ( T l ) . We then have D ≤ .However, the deﬁnition of i ∗ implies that D ≥ . Therefore, D = 0 and (8) is satisﬁed.Adding (7) and (8) givesRegret ( T ) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cδ (cid:19)(cid:21) (cid:18) C + T (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . (9)Using Lemma 1 and the union bound over all L epochs,we conclude that (9) holds with probability at least − Lδ .A change of variables gives that, for any δ ∈ (0 , , withprobability at least − δ , we haveRegret ( T ) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cLδ (cid:19)(cid:21) (cid:18) C + T (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α < (cid:115) αK ( T + 2 C ) ln (cid:18) cL (2 + T /C ) δ (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . In this section, we present an example of structured expertsthat is inspired by the problem of ﬁnite-time model selec-tion for RL and discuss how the expert ranking property ofExp4.R can be used to expedite learning in such case.As RL becomes increasingly integrated into autonomoussystems such as agile robots (Hwangbo et al., 2019), self-driving vehicles (Kuderer et al., 2015), customized fertilizerformulation (Binas et al., 2019), and personalized medi-cation dosing (Nemati et al., 2016), it is crucial that thetechniques are robust (Matni et al., 2019). An aspect ofrobustness is the capability to detect and adjust for model er-rors. For RL, this entails both model selection and parameterestimation. How to achieve both objectives simultaneouslywhile maintaining provably good performance is an activearea of research (Ni & Wang, 2019; Abbasi-Yadkori et al.,2020). The crux of the problem of online model selectionfor RL is to balance approximation and estimation errorsin a time-dependent manner. As an example, we supposethat there is an inﬁnite sequence of nested model classes.This structure arises naturally when an RL algorithm incor-porates increasingly many features over time. Some new features may also just become obtainable while an RL al-gorithm is running. In fact, it is unknown a priori for manyapplications what is a minimal feature space that contains anoptimal policy. Given an inﬁnite sequence of model classes,the best class to use depends on the horizon or, equivalently,the amount of trajectory data that will become available.Although a larger model class has a smaller approximationerror, it tends to have a higher estimation error for a ﬁxedﬁnite horizon. Moreover, if several classes have the sameapproximation power, the simplest one is typically preferredin consideration of time and space complexity.Inspired by the problem of ﬁnite-time model selection forRL, we propose to consider experts structured in a waythat simulates the trade-off between approximation and es-timation. In particular, we suppose that the experts areranked in ascending order of complexity. Assumption 2stipulates that the total reward is weakly unimodal in expec-tation with respect to the expert index during any period oftime. In addition, the index of the globally best expert is anondecreasing function of the time horizon. See Figure 1for an illustration. The proposed time-dependent unimodalstructure is fundamentally related to oracle inequalities inempirical risk minimization (Wainwright, 2019). Althoughthe experts’ performance may ﬂuctuate around the proposedstructure in practice, solutions to the stylized setting are oftheoretical interest.

Assumption 2.

For all

T ⊂ [ T ] , if i ≤ i ∗ ( Z + ; T ) , then R i − ( T ) ≤ R i ( T ) . Otherwise, R i ( T ) ≥ R i +1 ( T ) . More-over, i ∗ ( Z + ; T ) ≤ i ∗ ( Z + ; T (cid:48) ) if T (cid:48) ⊂ [ T ] and |T (cid:48) | > |T | . Figure 1.

An illustration of Assumption 2

Under Assumption 2, the outputs of Exp4.R give a thresholdrule that allows us to ﬁnd a lower bound for i ∗ , which canaccelerate the rate of approaching i ∗ . We modify BEESto incorporate lower bound estimation (BEES.LB, Algo-rithm 3). BEES.LB runs Exp4.R and Probabilistic Thresh-olding Search (PTS, Algorithm 4) as subroutines. In eachepoch, BEES.LB eliminates experts identiﬁed as subopti-mal. Lemma 4 shows that the estimated lower bound iscorrect if the concentration event E ( δ ) holds. Theorem 2provides sample complexity guarantees for BEES.LB byestablishing a high-probability regret upper bound. The onstochastic Bandits with Inﬁnitely Many Experts proof of Theorem 2 is similar to that of Theorem 1, hencedeferred to the appendix. PTS has space complexity O ( N ) and time complexity O (cid:0) N (cid:1) . PTS can be efﬁciently imple-mented by ﬁrst sorting the input w . BEES.LB takes the samespace O ( K (1 + T /K ) α ) as BEES. The time complexityof BEES.LB is ˜ O (cid:0) K (1 + T /K ) α +1 + (1 + T /K ) α (cid:1) ,which reduces to the runtime of BEES for sufﬁciently small α ∈ Z + . Algorithm 3 BEES with L ower B ound (BEES.LB) Input: δ ∈ (0 , , α ∈ Z + , L ∈ Z + , c ∈ Z + , C ∈ Z + i ← . for epoch l = 1 , . . . , L do N l ← c αl , T l ← C l . ρ l ← (cid:112) ln N l / ( KT l ) . I l ← { i l , i l + 1 , . . . , i l + N l − } . w l , (cid:15) l ← Exp4.R ( δ/L, T l , ρ l , I l ) . i l +1 ← PTS (cid:0) w l , (cid:15) l , i l (cid:1) . end forAlgorithm 4 Probabilistic Thresholding Search (PTS)

Input: w ∈ (0 , ∞ ) N , (cid:15) ∈ (0 , ∞ ) N , i ∈ Z + Output: i new j ← . for j = 1 , . . . , N − dofor j (cid:48) = j + 1 , . . . , N doif ln w j (cid:48) − ln w j > (cid:15) j (cid:48) then j ← j + 1 . end ifend forend for i new ← i + j − . Lemma 4.

Under Assumption 2 and the conditions ofLemma 3, if event E ( δ ) holds for all epochs, then i l ≤ i ∗ for all l .Proof. Under the assumption that event E ( δ ) holds for allepochs, we prove the statement by induction on l . Thebase case holds trivially as i = 1 . For the inductive step,we assume that i ι ≤ i ∗ for all ι ≤ l . If i l +1 = i l , then i ∗ ≥ i l +1 by the induction hypothesis. If there exists some j ≥ such that i l +1 = i l + j , then Algorithm 4 impliesthat ln w j (cid:48) − ln w j > (cid:15) j (cid:48) for some j (cid:48) > j in epoch l . UsingAssumption 2 and Lemma 3, we get i ∗ ≥ i l + j = i l +1 . Theorem 2.

Under Assumption 2, if a uniform expert isavailable in each epoch, then there exist absolute constants α ∈ Z + and c ∈ Z + such that, for some C ( α, c, K, δ ) ∈ Z + , BEES.LB satisﬁes that, for any δ ∈ (0 , , with proba- bility at least − δ , we haveRegret ( T ) < (cid:115) αK ( T + 2 C ) ln (cid:18) cL (2 + T /C ) δ (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . The upper bound given in Theorem 2 is the same as that forthe general case of unstructured experts because the lowerbound from PTS can stay at in the worst case. A trivialexample is that all experts are the same. For cases wherethe experts’ performance differs by sufﬁcient margins, theactual improvement of BEES.LB over BEES should becomeobvious.If the globally best expert i ∗ is ﬁxed over time, then we canmodify BEES.LB to additionally estimate an upper boundon i ∗ , initialized to ∞ . We can show that the conﬁdenceinterval for i ∗ contracts over epochs. While the epoch lengthalways grows exponentially, the set of experts consideredin each epoch is data-dependent. If no upper bound on i ∗ has been identiﬁed, then the number of experts consideredwill increase by a factor of α in the next epoch. Otherwise,only the experts in the non-expanding conﬁdence intervalwill be considered from now on.

5. Discussion

In this paper, we have proposed an algorithm for the prob-lem of nonstochastic bandits with inﬁnitely many expertsunder the constraint of having access to only a ﬁnite subsetof experts. We have established a high-probability upperbound on the regret of our meta-algorithm BEES, whichmatches the lower bound up to polylog factors if the glob-ally best expert has a relatively low index. If we assume thatthere exists a prior distribution on the best expert, then theprobability that our regret upper bound is tight will increasewith the time horizon, the rate of which can be fast. Theexpert ranking property of the subroutine Exp4.R enableslearning acceleration if the structure of the experts is known.We have demonstrated this point with an example that isinspired by the problem of ﬁnite-time model selection forRL. One interesting direction for future work is to obtaininstance-dependent upper bounds in terms of the experts’suboptimality gaps. Such instance-dependent bounds can beused to prove the learning acceleration enabled by Exp4.R.A simple implementation of our algorithms inherits the com-putational complexity of Exp4.P. It is worthwhile to designefﬁcient implementation for speciﬁc applications. onstochastic Bandits with Inﬁnitely Many Experts

Acknowledgments

The authors would like to thank John N. Tsitsiklis, Dylan J.Foster, Caroline Uhler, Devavrat Shah, and Thibaut Horelfor helpful discussions. This work was supported by theOCP Group.

References

Abbasi-Yadkori, Y., Pacchiano, A., and Phan, M. Regret bal-ancing for bandit and RL model selection. arXiv preprintarXiv:2006.05491 , 2020.Agarwal, A., Dudik, M., Kale, S., Langford, J., andSchapire, R. E. Contextual bandit learning with pre-dictable rewards. In

International Conference on Artiﬁ-cial Intelligence and Statistics , pp. 19–26, 2012.Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E.Corralling a band of bandit algorithms. In

Conference onLearning Theory , pp. 12–38, 2017.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.Gambling in a rigged casino: The adversarial multi-armedbandit problem. In

IEEE 36th Annual Foundations ofComputer Science , pp. 322–331, 1995.Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-timeanalysis of the multiarmed bandit problem.

MachineLearning , 47(2):235–256, 2002a.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.The nonstochastic multiarmed bandit problem.

SIAM J.Comput. , 32(1):48–77, 2002b.Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., andShepp, L. A. Bandit problems with inﬁnitely many arms.

Ann. Statist. , 25(5):2103–2116, 1997.Besson, L. and Kaufmann, E. What doubling tricks canand can’t do for multi-armed bandits. arXiv preprintarXiv:1803.06971 , 2018.Beygelzimer, A., Langford, J., Li, L., Reyzin, L., andSchapire, R. E. Contextual bandit algorithms with super-vised learning guarantees. In

International Conferenceon Artiﬁcial Intelligence and Statistics , pp. 19–26, 2011.Binas, J., Luginbuehl, L., and Bengio, Y. Reinforcementlearning for sustainable agriculture.

CCAI Workshop atthe 36th International Conference on Machine Learning ,2019.Carpentier, A. and Valko, M. Simple regret for inﬁnitelymany armed bandits. In

International Conference onMachine Learning , pp. 1133–1141, 2015.Cesa-Bianchi, N. and Lugosi, G.

Prediction, Learning, andGames . Cambridge University Press, 2006. Foster, D. J., Kale, S., Mohri, M., and Sridharan, K.Parameter-free online learning via model selection. In

Advances in Neural Information Processing Systems , pp.6020–6030, 2017.Foster, D. J., Krishnamurthy, A., and Luo, H. Model se-lection for contextual bandits. In

Advances in NeuralInformation Processing Systems , pp. 14741–14752. 2019.Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsou-nis, V., Koltun, V., and Hutter, M. Learning agile anddynamic motor skills for legged robots.

Science Robotics ,4(26), 2019.Kale, S. Multiarmed bandits with limited expert advice. In

Conference on Learning Theory , pp. 107–122, 2014.Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armedbandits in metric spaces. In

ACM Symposium on Theoryof Computing , pp. 681–690, 2008.Kuderer, M., Gulati, S., and Burgard, W. Learning drivingstyles for autonomous vehicles from demonstration. In

IEEE International Conference on Robotics and Automa-tion , pp. 2641–2646, 2015.Lai, T. L. and Robbins, H. Asymptotically efﬁcient adaptiveallocation rules.

Adv. Appl. Math , 6(1):4–22, 1985.Matni, N., Proutiere, A., Rantzer, A., and Tu, S. Fromself-tuning regulators to reinforcement learning and backagain. In

IEEE Conference on Decision and Control , pp.3724–3740, 2019.McMahan, H. B. and Streeter, M. Tighter bounds for multi-armed bandits with expert advice. In

Conference onLearning Theory , 2009.Nemati, S., Ghassemi, M. M., and Clifford, G. D. Optimalmedication dosing from suboptimal clinical examples: Adeep reinforcement learning approach. In

Annual Interna-tional Conference of the IEEE Engineering in Medicineand Biology Society , pp. 2978–2981, 2016.Neu, G. Explore no more: Improved high-probability regretbounds for non-stochastic bandits. In

Advances in NeuralInformation Processing Systems , pp. 3168–3176, 2015.Ni, C. and Wang, M. Maximum likelihood tensor decompo-sition of Markov decision process. In

IEEE InternationalSymposium on Information Theory , pp. 3062–3066, 2019.Robbins, H. Some aspects of the sequential design of exper-iments.

Bull. Amer. Math. Soc. , 58(5):527–535, 1952.Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parame-terized bandits.

Mathematics of Operations Research , 35(2):395–411, 2010. onstochastic Bandits with Inﬁnitely Many Experts

Seldin, Y., Crammer, K., and Bartlett, P. Open problem:Adversarial multiarmed bandits with limited advice. In

Conference on Learning Theory , pp. 1067–1072, 2013.Wainwright, M. J.

High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge UniversityPress, 2019.

A. Proof of Lemma 1

Let E t [ · ] denote the conditional expectation given historyuntil time t − . We can show that ˆ y i ( t ) is a conditionally un-biased estimator for y i ( t ) . In other words, E t [ˆ y i ( t )] = y i ( t ) for all i and t . Lemma 5 shows that ˆ v i ( t ) is an upperbound on the conditional variance of ˆ y i ( t ) . Lemma 6 isa Freedman-style inequality for martingales from (Beygelz-imer et al., 2011). The proof of Lemma 1 relies on Lem-mas 5 and 6. Lemma 5 (From proof of Lemma 3 in (Beygelzimeret al., 2011)) . For all t ∈ Z + and i ∈ I , we have E t (cid:104) ( y i ( t ) − ˆ y i ( t )) (cid:105) ≤ ˆ v i ( t ) . Lemma 6 ((Beygelzimer et al., 2011), Theorem 1) . Let X , . . . , X T be a sequence of real-valued random vari-ables. For any real-valued random variable Y , we deﬁne E t [ Y ] (cid:44) E [ Y | X , . . . , X t − ] . We assume that, X t ≤ B and E t [ X t ] = 0 for all t . We deﬁne the random variables S (cid:44) T (cid:88) t =1 X t , V (cid:44) T (cid:88) t =1 E t (cid:2) X t (cid:3) . For any ﬁxed estimate V (cid:48) > of V , and for any δ ∈ (0 , ,with probability at least − δ , we have S ≤ (cid:40)(cid:113) ( e −

2) ln (cid:0) δ (cid:1) (cid:16) V √ V (cid:48) + √ V (cid:48) (cid:17) , if V (cid:48) ≥ B ln(1 /δ ) e − ,B ln(1 /δ ) + ( e − VB , otherwise . Proof of Lemma 1.

We now ﬁx any i ∈ I and t ∈ Z + .By deﬁnition, we have y i ( t ) ∈ [0 , . Using (1) and theassumption that ρ ∈ [0 , /K ] , we get p a ( t ) ≥ ρ for all a ∈ A . Thus, (2) implies that ˆ y i ( t ) ∈ [0 , /ρ ] almost surely.Let X t = y i ( t ) − ˆ y i ( t ) . We then have − /ρ ≤ X t ≤ almost surely. We can show that E t [ˆ y i ( t )] = y i ( t ) andhence E t [ X t ] = 0 . We recall that R i ( T ) = (cid:80) Tt =1 y i ( t ) .Applying Lemma 6 to ( X t ) t and ( − X t ) t respectively andthen taking a union bound, we conclude that, for any δ ∈ (0 , , with probability at least − δ/N , the inequality − B ≤ R i ( T ) − ˆ R i ( T ) ≤ B holds, where B (cid:44) (cid:113) ( e −

2) ln (cid:0) Nδ (cid:1) (cid:16) V √ V (cid:48) + √ V (cid:48) (cid:17) , if V (cid:48) ≥ ln(2 N/δ )( e − ρ , ln(2 N/δ ) ρ + ( e − ρV, otherwise ,B (cid:44) (cid:40)(cid:113) ( e −

2) ln (cid:0) Nδ (cid:1) (cid:16) V √ V (cid:48) + √ V (cid:48) (cid:17) , if V (cid:48) ≥ ln(2 N/δ )( e − , ln(2 N/δ ) + ( e − V, otherwise ,V (cid:44) T (cid:88) t =1 E t (cid:2) X t (cid:3) . We now ﬁx an arbitrary δ ∈ (0 , . Assumption 1 impliesthat ln(2 N/δ ) ≤ ( e − KT . Taking ρ = (cid:112) ln N/ ( KT ) and V (cid:48) = KT , we have ln(2 N/δ ) e − ≤ V (cid:48) < ln(2 N/δ )( e − ρ . Lemma 5 implies that V ≤ ˆ V i ( T ) . Therefore, with proba-bility at least − δ/N , we have − ln (cid:18) Nδ (cid:19) (cid:114) KT ln N − (cid:114) ln NKT ˆ V i ( T ) ≤ R i ( T ) − ˆ R i ( T ) ≤ (cid:115) ln (cid:18) Nδ (cid:19) (cid:32) ˆ V i ( T ) √ KT + √ KT (cid:33) . Applying the union bound over i ∈ I , we conclude that P ( E ( δ )) ≥ − δ . B. Proof of Theorem 2

Proof of Theorem 2.

We can show that, for all δ ∈ (0 , , α ∈ Z + , and c ∈ Z + , there exists C ( α, c, K, δ ) ∈ Z + such that K ln (cid:0) c αl (cid:1) ≤ C l and ln (cid:0) c αl +1 /δ (cid:1) ≤ ( e − CK l for all l ∈ Z + . For example, we can set C = (cid:100) αK ln(16 c /δ ) (cid:101) . Together with the deﬁnitions of N l and T l in Algorithm 3, we have that, for all α ∈ Z + and c ∈ Z + , there exists C ∈ Z + such that K ln N l ≤ T l and ln(2 N l /δ ) ≤ ( e − KT l for all l ∈ Z + . We ﬁx suchintegers α, c, C ∈ Z + for the rest of the proof.For simplicity of notation, we ﬁrst consider runningExp4.R ( δ, T l , ρ l , I l ) in each epoch l of Algorithm 3 forany δ ∈ (0 , /L ] and then apply a change of variables atthe end of the proof. We suppose that a uniform expert isavailable in each epoch. Assumption 1 is then satisﬁed forall epochs. For now, we assume that event E ( δ ) holds for allepochs, the probability of which will be discussed at the endof the proof. For simplicity of notation, let i ∗ l (cid:44) i ∗ ( I l ; T l ) for l ∈ [ L ] .Let U l (cid:44) αl + log (2 c/δ ) for l ∈ [ L ] . Recall that T l is the onstochastic Bandits with Inﬁnitely Many Experts time interval of epoch l where |T l | = T l . By Lemma 2, L (cid:88) l =1 R i ∗ l ( T l ) − T (cid:88) t =1 r a ( t ) ( t ) ≤ L (cid:88) l =1 (cid:115) KT l ln (cid:18) N l δ (cid:19) = L (cid:88) l =1 (cid:115) KC l ln (cid:18) c αl +1 δ (cid:19) = 7 √ KC ln 2 L (cid:88) l =1 (cid:112) l U l ≤ (cid:112) KCU L ln 2 L (cid:88) l =1 l/ < (cid:112) KCU L (cid:16) L/ − (cid:17) . Since L = log [1 + T / (2 C )] , we have L (cid:88) l =1 R i ∗ l ( T l ) − T (cid:88) t =1 r a ( t ) ( t ) < (cid:112) KCU L (cid:32)(cid:114) T C − (cid:33) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cδ (cid:19)(cid:21) (cid:18) C + T (cid:19) . (10)We ﬁrst discuss the case where i ∗ / ∈ I . Let L (cid:48)(cid:48) be the lastepoch such that i ∗ is not considered in Algorithm 3. In otherwords, L (cid:48)(cid:48) (cid:44) max { l ∈ [ L ] | i ∗ / ∈ I l } . Lemma 4 impliesthat i ∗ ∈ I l for all l > L (cid:48)(cid:48) . By the deﬁnition of i ∗ l , we have R i ∗ l ( T l ) ≥ R i ∗ ( T l ) for all l > L (cid:48)(cid:48) . Thus, R i ∗ ([ T ]) − L (cid:88) l =1 R i ∗ l ( T l ) ≤ L (cid:48)(cid:48) (cid:88) l =1 (cid:0) R i ∗ ( T l ) − R i ∗ l ( T l ) (cid:1) ≤ L (cid:48)(cid:48) (cid:88) l =1 T l < C L (cid:48)(cid:48) +1 . We now provide an upper bound on L (cid:48)(cid:48) . By Algorithms 3and 4, we have |I l | = N l and ≤ i l ≤ i l +1 for all l . Let L (cid:48) be the last epoch such that i ∗ is not considered in theworst case where i l = 1 for all l . In other words, L (cid:48) (cid:44) min (cid:0) L, (cid:100) α − log ( i ∗ /c ) (cid:101) − (cid:1) . Under the assumption that i ∗ / ∈ I , we get L (cid:48) ≥ . By the deﬁnitions of L (cid:48) and L (cid:48)(cid:48) ,we have L (cid:48)(cid:48) ≤ L (cid:48) and hence R i ∗ ([ T ]) − L (cid:88) l =1 R i ∗ l ( T l ) < C L (cid:48) +1 < C (cid:18) i ∗ c (cid:19) α . (11)We now consider the case where i ∗ ∈ I . It follows fromLemma 4 that i ∗ ∈ I l for all l . Thus, the deﬁnition of i ∗ l implies that R i ∗ l ( T l ) ≥ R i ∗ ( T l ) for all l . We deﬁne D (cid:44) R i ∗ ([ T ]) − (cid:80) Ll =1 R i ∗ l ( T l ) . We then have D ≤ .However, the deﬁnition of i ∗ implies that D ≥ . Therefore, D = 0 and (11) is satisﬁed.Adding (10) and (11) givesRegret ( T ) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cδ (cid:19)(cid:21) (cid:18) C + T (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α . (12)Using Lemma 1 and the union bound over all L epochs, weconclude that (12) holds with probability at least − Lδ .A change of variables gives that, for any δ ∈ (0 , , withprobability at least − δ , we haveRegret ( T ) < (cid:115) K (cid:20) αL + 2 ln (cid:18) cLδ (cid:19)(cid:21) (cid:18) C + T (cid:19) + 2 C (cid:18) i ∗ c (cid:19) α < (cid:115) αK ( T + 2 C ) ln (cid:18) cL (2 + T /C ) δ (cid:19) + 2 C (cid:18) i ∗ c (cid:19) αα