[PDF] Rotting Bandits

Abstract

The Multi-Armed Bandits (MAB) framework highlights the tension between acquiring new knowledge (Exploration) and leveraging available knowledge (Exploitation). In the classical MAB problem, a decision maker must choose an arm at each time step, upon which she receives a reward. The decision maker's objective is to maximize her cumulative expected reward over the time horizon. The MAB problem has been studied extensively, specifically under the assumption of the arms' rewards distributions being stationary, or quasi-stationary, over time. We consider a variant of the MAB framework, which we termed Rotting Bandits, where each arm's expected reward decays as a function of the number of times it has been pulled. We are motivated by many real-world scenarios such as online advertising, content recommendation, crowdsourcing, and more. We present algorithms, accompanied by simulations, and derive theoretical guarantees.

Full PDF

RRotting Bandits

Nir Levine

Electrical Engineering DepartmentThe TechnionHaifa 32000, Israel [email protected]

Koby Crammer

Electrical Engineering DepartmentThe TechnionHaifa 32000, Israel [email protected]

Shie Mannor

Electrical Engineering DepartmentThe TechnionHaifa 32000, Israel [email protected]

Abstract

The Multi-Armed Bandits (MAB) framework highlights the trade-off betweenacquiring new knowledge (Exploration) and leveraging available knowledge (Ex-ploitation). In the classical MAB problem, a decision maker must choose an arm ateach time step, upon which she receives a reward. The decision maker’s objectiveis to maximize her cumulative expected reward over the time horizon. The MABproblem has been studied extensively, speciﬁcally under the assumption of thearms’ rewards distributions being stationary, or quasi-stationary, over time. Weconsider a variant of the MAB framework, which we termed

Rotting Bandits , whereeach arm’s expected reward decays as a function of the number of times it has beenpulled. We are motivated by many real-world scenarios such as online advertis-ing, content recommendation, crowdsourcing, and more. We present algorithms,accompanied by simulations, and derive theoretical guarantees.

One of the most fundamental trade-offs in stochastic decision theory is the well celebrated Explorationvs. Exploitation dilemma. Should one acquire new knowledge on the expense of possible sacriﬁce inthe immediate reward (Exploration), or leverage past knowledge in order to maximize instantaneousreward (Exploitation)? Solutions that have been demonstrated to perform well are those whichsucceed in balancing the two. First proposed by Thompson [1933] in the context of drug trials, andlater formulated in a more general setting by Robbins [1985], MAB problems serve as a distilledframework for this dilemma. In the classical setting of the MAB, at each time step, the decision makermust choose (pull) between a ﬁxed number of arms. After pulling an arm, she receives a rewardwhich is a realization drawn from the arm’s underlying reward distribution. The decision maker’sobjective is to maximize her cumulative expected reward over the time horizon. An equivalent, moretypically studied, is the regret , which is deﬁned as the difference between the optimal cumulativeexpected reward (under full information) and that of the policy deployed by the decision maker.MAB formulation has been studied extensively, and was leveraged to formulate many real-worldproblems. Some examples for such modeling are online advertising [Pandey et al., 2007], routing ofpackets [Awerbuch and Kleinberg, 2004], and online auctions [Kleinberg and Leighton, 2003].Most past work (Section 6) on the MAB framework has been performed under the assumption thatthe underlying distributions are stationary, or possibly quasi-stationary. In many real-world scenarios, a r X i v : . [ s t a t . M L ] N ov his assumption may seem simplistic. Speciﬁcally, we are motivated by real-world scenarios wherethe expected reward of an arm decreases over time instances that it has been pulled. We term thisvariant Rotting Bandits . For motivational purposes, we present the following two examples. • Consider an online advertising problem where an agent must choose which ad (arm) to present(pull) to a user. It seems reasonable that the effectiveness (reward) of a speciﬁc ad on a userwould deteriorate over exposures. Similarly, in the content recommendation context, Agarwal et al.[2009] showed that articles’ CTR decay over amount of exposures. • Consider the problem of assigning projects through crowdsourcing systems [Tran-Thanh et al.,2012]. Given that the assignments primarily require human perception, subjects may fall intoboredom and their performance would decay (e.g., license plate transcriptions [Du et al., 2013]).As opposed to the stationary case, where the optimal policy is to always choose some speciﬁc arm, inthe case of Rotting Bandits the optimal policy consists of choosing different arms. This results in thenotion of adversarial regret vs. policy regret [Arora et al., 2012] (see Section 6). In this work wetackle the harder problem of minimizing the policy regret.The main contributions of this paper are the following: • Introducing a novel, real-world oriented MAB formulation, termed

Rotting Bandits . • Present an easy-to-follow algorithm for the general case, accompanied with theoretical guarantees. • Reﬁne the theoretical guarantees for the case of existing prior knowledge on the rotting models,accompanied with suitable algorithms.The rest of the paper is organized as follows: in Section 2 we present the model and relevantpreliminaries. In Section 3 we present our algorithm along with theoretical guarantees for the generalcase. In Section 4 we do the same for the parameterized case, followed by simulations in Section 5.In Section 6 we review related work, and conclude with a discussion in Section 7.

We consider the problem of Rotting Bandits (RB); an agent is given K arms and at each time step t = 1 , , .. one of the arms must be pulled. We denote the arm that is pulled at time step t as i ( t ) ∈ [ K ] = { , .., K } . When arm i is pulled for the n th time, the agent receives a time independent, σ sub-Gaussian random reward, r t , with mean µ i ( n ) . In this work we consider two cases: (1) There is no prior knowledge on the expected rewards, exceptfor the ‘rotting’ assumption to be presented shortly, i.e., a non-parametric case (NPC). (2) There isprior knowledge that the expected rewards comprised of an unknown constant part and a rotting partwhich is known to belong to a set of rotting models, i.e., a parametric case (PC).Let N i ( t ) be the number of pulls of arm i at time t not including this round’s choice ( N i (1) = 0 ),and Π the set of all sequences i (1) , i (2) , .. , where i ( t ) ∈ [ K ] , ∀ t ∈ N . i.e., π ∈ Π is an inﬁnitesequence of actions (arms), also referred to as a policy. We denote the arm that is chosen by policy π at time t as π ( t ) . The objective of an agent is to maximize the expected total reward in time T ,deﬁned for policy π ∈ Π by, J ( T ; π ) = E (cid:34) T (cid:88) t =1 µ π ( t ) (cid:0) N π ( t ) ( t ) + 1 (cid:1)(cid:35) (1)We consider the equivalent objective of minimizing the regret in time T deﬁned by, R ( T ; π ) = max ˜ π ∈ Π { J ( T ; ˜ π ) } − J ( T ; π ) . (2) Assumption 2.1. (Rotting) ∀ i ∈ [ K ] , µ i ( n ) is positive, and non-increasing in n . Our results hold for pulls-number dependent variances σ ( n ) , by upper bound them σ ≥ σ ( n ) , ∀ n . Itis fairly straightforward to adapt the results to pulls-number dependent variances, but we believe that the waypresented conveys the setting in the clearest way. .1 Optimal Policy Let π max be a policy deﬁned by, π max ( t ) ∈ argmax i ∈ [ K ] { µ i ( N i ( t ) + 1) } (3)where, in a case of tie, break it randomly. Lemma 2.1. π max is an optimal policy for the RB problem. Proof:

See Appendix B of the supplementary material.

In the NPC setting for the RB problem, the only information we have is that the expected rewardssequences are positive and non-increasing in the number of pulls. The Sliding-Window Average(

SWA ) approach is a heuristic for ensuring with high probability that, at each time step, the agent didnot sample signiﬁcantly sub-optimal arms too many times. We note that, potentially, the optimal armchanges throughout the trajectory, as Lemma 2.1 suggests. We start by assuming that we know thetime horizon, and later account for the case we do not.

Known Horizon

The idea behind the SWA approach is that after we pulled a signiﬁcantly sub-optimal arm “enough"times, the empirical average of these “enough" pulls would be distinguishable from the optimalarm for that time step and, as such, given any time step there is a bounded number of signiﬁcantlysub-optimal pulls compared to the optimal policy. Pseudo algorithm for SWA is given by Algorithm 1.

Algorithm 1

SWA

Input : K, T, α > Initialize : M ← (cid:100) α / σ / K − / T / ln / (cid:0) √ T (cid:1) (cid:101) , and N i ← for all i ∈ [ K ] for t = 1 , , .., KM doRamp up : i ( t ) by Round-Robin, receive r t , and set N i ( t ) ← N i ( t ) + 1 ; r N i ( t ) i ( t ) ← r t end forfor t = KM + 1 , ..., T doBalance : i ( t ) ∈ argmax i ∈ [ K ] (cid:26) M (cid:80) N i n = N i − M +1 r ni (cid:27) Update : receive r t , and set N i ( t ) ← N i ( t ) + 1 ; r N i ( t ) i ( t ) ← r t end forTheorem 3.1. Suppose Assumption 2.1 holds. SWA algorithm achieves regret bounded by, R (cid:0) T ; π SWA (cid:1) ≤ (cid:18) α max i ∈ [ K ] µ i (1) + α − / (cid:19) / σ / K / T / ln / (cid:16) √ T (cid:17) + 3 K max i ∈ [ K ] µ i (1) (4) Proof:

See Appendix C.1 of the supplementary material.

We note that the upper bound obtains its minimum for α = (cid:0) i ∈ [ K ] µ i (1) (cid:1) − / , which canserve as a way to choose α if max i ∈ [ K ] µ i (1) is known, but α can also be given as an input to SWAto allow control on the averaging window size. Unknown Horizon

In this case we use doubling trick in order to achieve the same horizon-dependent rate for the regret.We apply the SWA algorithm with a series of increasing horizons (powers of two, i.e., , , , .. ) untilreaching the (unknown) horizon. We term this Algorithm wSWA (wrapper SWA). Corollary 3.1.1.

Suppose Assumption 2.1 holds. wSWA algorithm achieves regret bounded by, R (cid:0) T ; π wSWA (cid:1) ≤ (cid:18) α max i ∈ [ K ] µ i (1) + α − / (cid:19) σ / K / T / ln / (cid:16) √ T (cid:17) + 3 K max i ∈ [ K ] µ i (1) (log T + 1) (5) Proof:

See Appendix C.2 of the supplementary material. Parametric Case

In the PC setting for the RB problem, there is prior knowledge that the expected rewards comprisedof a sum of an unknown constant part and a rotting part known to belong to a set of models, Θ . i.e.,the expected reward of arm i at its n th pull is given by, µ i ( n ) = µ ci + µ ( n ; θ ∗ i ) , where θ ∗ i ∈ Θ . Wedenote { θ ∗ i } [ K ] i =1 by Θ ∗ . We consider two cases: The ﬁrst is the asymptotically vanishing case (AV),i.e., ∀ i : µ ci = 0 . The second is the asymptotically non-vanishing case (ANV), i.e., ∀ i : µ ci ∈ R .We present a few deﬁnitions that will serve us in the following section. Deﬁnition 4.1.

For a function f : N → R , we deﬁne the function f (cid:63) ↓ : R → N ∪ {∞} by the follow-ing rule: given ζ ∈ R , f (cid:63) ↓ ( ζ ) returns the smallest N ∈ N such that ∀ n ≥ N : f ( n ) ≤ ζ , or ∞ ifsuch N does not exist. Deﬁnition 4.2.

For any θ (cid:54) = θ ∈ Θ , deﬁne det θ ,θ , Ddet θ ,θ : N → R as, det θ ,θ ( n ) = nσ (cid:16)(cid:80) nj =1 µ ( j ; θ ) − (cid:80) nj =1 µ ( j ; θ ) (cid:17) Ddet θ ,θ ( n ) = nσ (cid:16)(cid:80) (cid:98) n/ (cid:99) j =1 [ µ ( j ; θ ) − µ ( j ; θ )] − (cid:80) nj = (cid:98) n/ (cid:99) +1 [ µ ( j ; θ ) − µ ( j ; θ )] (cid:17) Deﬁnition 4.3.

Let bal : N ∪ ∞ → N ∪ ∞ be deﬁned at each point n ∈ N as the solution for, min α s.t, max θ ∈ Θ µ ( α ; θ ) ≤ min θ ∈ Θ µ ( n ; θ ) We deﬁne bal ( ∞ ) = ∞ . Assumption 4.1. (Rotting Models) µ ( n ; θ ) is positive, non-increasing in n , and µ ( n ; θ ) ∈ o (1) , ∀ θ ∈ Θ , where Θ is a discrete known set. We present an example for which, in Appendix E, we demonstrate how the different followingassumptions hold. By this we intend to achieve two things: (i) show that the assumptions are nottoo harsh, keeping the problem relevant and non-trivial, and (ii) present a simple example on how toverify the assumptions.

Example 4.1.

The reward of arm i for its n th pull is distributed as N (cid:0) µ ci + n − θ ∗ i , σ (cid:1) . Where θ ∗ i ∈ Θ = { θ , θ , ..., θ M } , and ∀ θ ∈ Θ : 0 . ≤ θ ≤ . . The Closest To Origin (CTO) approach for RB is a heuristic that simply states that we hypothesizethat the true underlying model for an arm is the one that best ﬁts the past rewards. The ﬁtting criterionis proximity to the origin of the sum of expected rewards shifted by the observed rewards. Let r i , r i , .., r iN i ( t ) be the sequence of rewards observed from arm i up until time t . Deﬁne, Y ( i, t ; Θ) = (cid:26) N i ( t ) (cid:88) j =1 r ij − N i ( t ) (cid:88) j =1 µ ( j ; θ ) (cid:27) θ ∈ Θ . (6)The CTO approach dictates that at each decision point, we assume that the true underlying rottingmodel corresponds to the following proximity to origin rule (hence the name), ˆ θ i ( t ) = argmin θ ∈ Θ {| Y ( i, t ; θ ) |} . (7)The CTO SIM version tackles the RB problem by simultaneously detecting the true rotting models andbalancing between the expected rewards (following Lemma 2.1). In this approach, every time step,each arm’s rotting model is hypothesized according to the proximity rule (7). Then the algorithmsimply follows an argmax rule, where least number of pulls is used for tie breaking (randomlybetween an equal number of pulls). Pseudo algorithm for CTO

SIM is given by Algorithm 2.

Assumption 4.2. (Simultaneous Balance and Detection ability) bal (cid:18) max θ (cid:54) = θ ∈ Θ (cid:26) det (cid:63) ↓ θ ,θ (cid:18)

116 ln − ( ζ ) (cid:19) (cid:27)(cid:19) ∈ o ( ζ ) T , the underlying models could bedistinguished from the others, w.p − /T , by their sums of expected rewards, and the arms couldthen be balanced, all within the horizon. Theorem 4.1.

Suppose Assumptions 4.1 and 4.2 hold. There exists a ﬁnite step T ∗ SIM , such thatfor all T ≥ T ∗ SIM , CTO

SIM achieves regret upper bounded by o ( ) (which is upper bounded by max θ ∈ Θ ∗ µ (1; θ ) ). Furthermore, T ∗ SIM is upper bounded by the solution for the following, min T s.t  T, b ∈ N ∪ { } , t ∈ N K ∀ b, ∃ t :  (cid:107) t (cid:107) ≤ T + bt i ≥ max θ ∈ Θ ∗ (cid:26) m ∗ (cid:16) K ( T + b ) ; θ (cid:17) (cid:27) µ ( t i + 1; θ ∗ i ) ≤ min ˜ θ ∈ Θ (cid:20) µ (cid:18) max θ ∈ Θ ∗ (cid:26) m ∗ (cid:16) K ( T + b ) ; θ (cid:17) (cid:27) ; ˜ θ (cid:19)(cid:21) (8) Proof : See Appendix D.1 of the supplementary material.

Regret upper bounded by o (1) is achieved by proving that w.p of − /T the regret vanishes, and inany case it is still bounded by a decaying term. The shown optimization bound stems from ensuringthat the arms would be pulled enough times to be correctly detected, and then balanced (followingthe optimal policy, Lemma 2.1). Another upper bound for T ∗ SIM can be found in Appendix D.1.

We tackle this problem by estimating both the rotting models and the constant terms of the arms. TheDifferences Closest To Origin (

D-CTO ) approach is composed of two stages: ﬁrst, detecting theunderlying rotting models, then estimating and controlling the pulls due to the constant terms. Wedenote a ∗ = argmax i ∈ [ K ] { µ ci } , and ∆ i = µ ca ∗ − µ ci . Assumption 4.3. (D-Detection ability) max θ (cid:54) = θ ∈ Θ (cid:26) Ddet (cid:63) ↓ θ ,θ ( (cid:15) ) (cid:27) ≤ D ( (cid:15) ) < ∞ , ∀ (cid:15) > This assumption ensures that for any given probability, the models could be distinguished, by thedifferences (in pulls) between the ﬁrst and second halves of the models’ sums of expected rewards.

Models Detection

In order to detect the underlying rotting models, we cancel the inﬂuence of the constant terms. Oncewe do this, we can detect the underlying models. Speciﬁcally, we deﬁne a criterion of proximity tothe origin based on differences between the halves of the rewards sequences, as follows: deﬁne, Z ( i, t ; Θ) =  (cid:98) N i ( t ) / (cid:99) (cid:88) j =1 r ij − N i ( t ) (cid:88) j = (cid:98) N i ( t ) / (cid:99) +1 r ij  −  (cid:98) N i ( t ) / (cid:99) (cid:88) j =1 µ ( j ; θ ) − N i ( t ) (cid:88) j = (cid:98) N i ( t ) / (cid:99) +1 µ ( j ; θ )  . (9)The D-CTO approach is that in each decision point, we assume that the true underlying modelcorresponds to the following rule, ˆ θ i ( t ) = argmin θ ∈ Θ {| Z ( i, t ; θ ) |} (10)We deﬁne the following optimization problem, indicating the number of samples required for ensuringcorrect detection of the rotting models w.h.p. For some arm i with (unknown) rotting model θ ∗ i , min m s.t (cid:40) P (cid:16) ˆ θ i ( l ) (cid:54) = θ ∗ i (cid:17) ≤ p, ∀ l ≥ m while pulling only arm i. (11)We denote the solution to the above problem, when we use proximity rule (10), by m ∗ diff ( p ; θ ∗ i ) , anddeﬁne m ∗ diff ( p ) = max θ ∈ Θ { m ∗ diff ( p ; θ ) } . 5 lgorithm 2 CTO

SIM

Input : K, Θ Initialization : N i = 0 , ∀ i ∈ [ K ] for t = 1 , , .., K doRamp up : i ( t ) = t ,and update N i ( t ) end forfor t = K + 1 , ..., doDetect : determine { ˆ θ i } by Eq. (7) Balance : i ( t ) ∈ argmax i ∈ [ K ] µ (cid:16) N i + 1; ˆ θ i (cid:17) Update : N i ( t ) ← N i ( t ) + 1 end for Algorithm 3 D-CTO

UCB

Input : K, Θ , δ Initialization : N i = 0 , ∀ i ∈ [ K ] for t = 1 , , .., K × m ∗ diff ( δ/K ) doExplore : i ( t ) by Round Robin, update N i ( t ) end forDetect : determine { ˆ θ i } by Eq. (10) for t = K × m ∗ diff ( δ/K ) + 1 , ..., doUCB : i ( t ) according to Eq. (12) Update : N i ( t ) ← N i ( t ) + 1 end forD-CTO UCB

We next describe an approach with one decision point, and later on remark on the possibility ofhaving a decision point at each time step. As explained above, after detecting the rotting models,we move to tackle the constant terms aspect of the expected rewards. This is done in a UCB1-likeapproach [Auer et al., 2002a]. Given a sequence of rewards from arm i , { r ik } N i ( t ) k =1 , we modify themusing the estimated rotting model ˆ θ i , then estimate the arm’s constant term, and ﬁnally choose thearm with the highest estimated expected reward, plus an upper conﬁdent term. i.e., at time t , we pullarm i ( t ) , according to the rule, i ( t ) ∈ argmax i ∈ [ K ] (cid:104) ˆ µ ci ( t ) + µ (cid:16) N i ( t ) + 1; ˆ θ i ( t ) (cid:17) + c t,N i ( t ) (cid:105) (12)where ˆ θ i ( t ) is the estimated rotting model (obtained in the ﬁrst stage), and, ˆ µ ci ( t ) = (cid:80) N i ( t ) j =1 (cid:16) r ij − µ (cid:16) j ; ˆ θ i ( t ) (cid:17)(cid:17) N i ( t ) , c t,s = (cid:114) t ) σ s In a case of a tie in the UCB step, it may be arbitrarily broken. Pseudo algorithm for D-CTO

UCB isgiven by Algorithm 3, accompanied with the following theorem.

Theorem 4.2.

Suppose Assumptions 4.1, and 4.3 hold. For δ ∈ (0 , , with probability of at least − δ , D-CTO

UCB algorithm achieves regret bounded at time T by, (cid:88) i ∈ [ K ] i (cid:54) = a ∗ (cid:20) max (cid:26) m ∗ diff ( δ/K ) , µ (cid:63) ↓ ( (cid:15) i ; θ ∗ i ) , σ ln T (∆ i − (cid:15) i ) (cid:27) × (∆ i + µ (1; θ ∗ a ∗ )) (cid:21) + C (Θ ∗ , { µ ci } ) (13) for any sequence (cid:15) i ∈ (0 , ∆ i ) , ∀ i (cid:54) = a ∗ . Where σ ln T (∆ i − (cid:15) i ) is the only time-dependent factor. Proof : See Appendix D.2 of the supplementary material.

A few notes on the result: Instead of calculating m ∗ diff ( δ/K ) , it is possible to use any upper bound(e.g., as shown in Appendix E, max θ (cid:54) = θ ∈ Θ Ddet (cid:63) ↓ θ ,θ (cid:0) ln − (cid:0) Kδ (cid:1)(cid:1) rounded to higher evennumber). We cannot hope for a better rate than ln T as stochastic MAB is a special case of the RBproblem. Finally, we can convert the D-CTO UCB algorithm to have a decision point in each step: ateach time step, determine the rotting models according to proximity rule (10), followed by pulling anarm according to Eq. (12). We term this version D-CTO

SIM-UCB . We next compare the performance of the SWA and CTO approaches with benchmark algorithms.

Setups for all the simulations we use Normal distributions with σ = 0 . , and T = 30 , . Non-Parametric: K = 2 . As for the expected rewards: µ ( n ) = 0 . , ∀ n , and µ ( n ) = 1 for its ﬁrst , pulls and . afterwards. This setup is aimed to show the importance of not relying on the6able 1: Number of ‘wins’ and p-values between the different algorithms UCB1 DUCB SWUCB wSWA (D-)CTOUCB1 < < < < < < N P wSWA

100 100 100

UCB1 0.81 < < < < < < < <

98 99 100 < AV CTO

100 100 100 100

UCB1 0.54 0.83 < < < e-5 < < <

97 98 97 < ANV

D-CTO

100 100 100 66 time steps R eg r e t Non-Parametric Case

UCB1DUCBSWUCBwSWA time steps R eg r e t Asymptotically Vanishing Case

UCB1DUCBSWUCBwSWACTO time steps R eg r e t Asymptotically Non-Vanishing Case

UCB1DUCBSWUCBwSWAD-CTO

Figure 1: Average regret. Left: non-parametric. Middle: parametric AV. Right: parametric ANVwhole past rewards in the RB setting.

Parametric AV & ANV: K = 10 . The rotting models are of the form µ ( j ; θ ) = (cid:0) int (cid:0) j (cid:1) + 1 (cid:1) − θ ,where int ( · ) is the lower rounded integer, and Θ = { . , . , .., . } (i.e., plateaus of length ,with decay between plateaus according to θ ). { θ ∗ i } Ki =1 were sampled with replacement from Θ ,independently across arms and trajectories. { µ ci } Ki =1 (ANV) were sampled randomly from [0 , . K . Algorithms we implemented standard benchmark algorithms for non-stationary MAB: UCB1 byAuer et al. [2002a], Discounted UCB (DUCB) and Sliding-Window UCB (SWUCB) by Garivier andMoulines [2008]. We implemented CTO

SIM , D-CTO

SIM-UCB , and wSWA for the relevant setups. Wenote that adversarial benchmark algorithms are not relevant in this case, as the rewards are unbounded.

Grid Searches were performed to determine the algorithms’ parameters. For DUCB, followingKocsis and Szepesvári [2006], the discount factor was chosen from γ ∈ { . , . , .., . } , thewindow size for SWUCB from τ ∈ { e , e , .., e } , and α for wSWA from { . , . , .., } . Performance for each of the cases, we present a plot of the average regret over trajectories,specify the number of ‘wins’ of each algorithm over the others, and report the p-value of a pairedT-test between the (end of trajectories) regrets of each pair of algorithms. For each trajectory and twoalgorithms, the ‘winner’ is deﬁned as the algorithm with the lesser regret at the end of the horizon.

Results the parameters that were chosen by the grid search are as follows: γ = 0 . for thenon-parametric case, and . for the parametric cases. τ = 4 e , e , and e for the non-parametric, AV, and ANV cases, respectively. α = 0 . was chosen for all cases.The average regret for the different algorithms is given by Figure 1. Table 1 shows the number of‘wins’ and p-values. The table is to be read as the following: the entries under the diagonal are thenumber of times the algorithms from the left column ‘won’ against the algorithms from the top row,and the entries above the diagonal are the p-values between the two.While there is no clear ‘winner’ between the three benchmark algorithms across the different cases,wSWA, which does not require any prior knowledge, consistently and signiﬁcantly outperformedthem. In addition, when prior knowledge was available and CTO SIM or D-CTO

UCB-SIM could bedeployed, they outperformed all the others, including wSWA.7

Related Work

We turn to reviewing related work while emphasizing the differences from our problem.

Stochastic MAB

In the stochastic MAB setting [Lai and Robbins, 1985], the underlying rewarddistributions are stationary over time. The notion of regret is the same as in our work, but the optimalpolicy in this setting is one that pulls a ﬁxed arm throughout the trajectory. The two most commonapproaches for this problem are: constructing Upper Conﬁdence Bounds which stem from the seminalwork by Gittins [1979] in which he proved that index policies that compute upper conﬁdence boundson the expected rewards of the arms are optimal in this case (e.g., see Auer et al. [2002a], Garivierand Cappé [2011], Maillard et al. [2011]), and Bayesian heuristics such as Thompson Samplingwhich was ﬁrst presented by Thompson [1933] in the context of drug treatments (e.g., see Kaufmannet al. [2012], Agrawal and Goyal [2013], Gopalan et al. [2014]).

Adversarial MAB

In the Adversarial MAB setting (also referred to as the Experts Problem, see thebook of Cesa-Bianchi and Lugosi [2006] for a review), the sequence of rewards are selected by anadversary (i.e., can be arbitrary). In this setting the notion of adversarial regret is adopted [Auer et al.,2002b, Hazan and Kale, 2011], where the regret is measured against the best possible ﬁxed actionthat could have been taken in hindsight. This is as opposed to the policy regret we adopt, where theregret is measured against the best sequence of actions in hindsight.

Hybrid models

Some past work consider settings between the Stochastic and the Adversarial settings.Garivier and Moulines [2008] consider the case where the reward distributions remain constant overepochs and change arbitrarily at unknown time instants, similarly to Yu and Mannor [2009] whoconsider the same setting, only with the availability of side observations. Chakrabarti et al. [2009]consider the case where arms can expire and be replaced with new arms with arbitrary expectedreward, but as long as an arm does not expire its statistics remain the same.

Non-Stationary MAB

Most related to our problem is the so-called Non-Stationary MAB. Originallyproposed by Jones and Gittins [1972], who considered a case where the reward distribution of achosen arm can change, and gave rise to a sequence of works (e.g., Whittle et al. [1981], Tekinand Liu [2012]) which were termed

Restless Bandits and

Rested Bandits . In the

Restless Bandits setting, termed by Whittle [1988], the reward distributions change in each step according to a knownstochastic process. Komiyama and Qin [2014] consider the case where each arm decays according toa linear combination of decaying basis functions. This is similar to our parametric case in that thereward distributions decay according to possible models, but differs fundamentally in that it belongsto the

Restless Bandits setup (ours to the

Rested Bandits ). More examples in this line of work areSlivkins and Upfal [2008] who consider evolution of rewards according to Brownian motion, andBesbes et al. [2014] who consider bounded total variation of expected rewards. The latter is related toour setting by considering the case where the total variation is bounded by a constant, but signiﬁcantlydiffers by that it considers the case where the (unknown) expected rewards sequences are not affectedby actions taken, and in addition requires bounded support as it uses the EXP3 as a sub-routine. Inthe

Rested Bandits setting, only the reward distribution of a chosen arm changes, which is the casewe consider. An optimal control policy (reward processes are known, no learning required) to banditswith non-increasing rewards and discount factor was previously presented (e.g., Mandelbaum [1987],and Kaspi and Mandelbaum [1998]). Heidari et al. (2016) consider the case where the reward decays(as we do), but with no statistical noise (deterministic rewards), which signiﬁcantly simpliﬁes theproblem. Another somewhat closely related setting is suggested by Bouneffouf and Feraud [2016], inwhich statistical noise exists, but the expected reward shape is known up to a multiplicative factor.

We introduced a novel variant of the

Rested Bandits framework, which we termed

Rotting Bandits .This setting deals with the case where the expected rewards generated by an arm decay (or generallydo not increase) as a function of pulls of that arm. This is motivated by many real-world scenarios.We ﬁrst tackled the non-parametric case, where there is no prior knowledge on the nature of the decay.We introduced an easy-to-follow algorithm accompanied by theoretical guarantees.We then tackled the parametric case, and differentiated between two scenarios: expected rewardsdecay to zero (AV), and decay to different constants (ANV). For both scenarios we introduced8uitable algorithms with stronger guarantees than for the non-parametric case: For the AV scenariowe introduced an algorithm for ensuring, in expectation, regret upper bounded by a term that decaysto zero with the horizon. For the ANV scenario we introduced an algorithm for ensuring, with highprobability, regret upper bounded by a horizon-dependent rate which is optimal for the stationarycase.We concluded with simulations that demonstrated our algorithms’ superiority over benchmarkalgorithms for non-stationary MAB. We note that since the RB setting is novel, there are not suitableavailable benchmarks, and so this paper also serves as a benchmark.For future work we see two main interesting directions: (i) show a lower bound on the regret for thenon-parametric case, and (ii) extend the scope of the parametric case to continuous parameterization.

Acknowledgment

The research leading to these results has received funding from the EuropeanResearch Council under the European Union’s Seventh Framework Program (FP/2007-2013) / ERCGrant Agreement n. 306638

References

D. Agarwal, B.-C. Chen, and P. Elango. Spatio-temporal models for estimating click-through rate. In

Proceedingsof the 18th international conference on World wide web , pages 21–30. ACM, 2009.S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In

Aistats , pages 99–107, 2013.R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: from regret to policyregret. arXiv preprint arXiv:1206.6400 , 2012.P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.

Machinelearning , 47(2-3):235–256, 2002a.P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem.

SIAMJournal on Computing , 32(1):48–77, 2002b.B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning andgeometric approaches. In

Proceedings of the thirty-sixth annual ACM symposium on Theory of computing ,pages 45–53. ACM, 2004.O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In

Advances in neural information processing systems , pages 199–207, 2014.D. Bouneffouf and R. Feraud. Multi-armed bandit problem with known trend.

Neurocomputing , 205:16–21,2016.N. Cesa-Bianchi and G. Lugosi.

Prediction, learning, and games . Cambridge university press, 2006.D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal multi-armed bandits. In

Advances in NeuralInformation Processing Systems , pages 273–280, 2009.S. Du, M. Ibrahim, M. Shehata, and W. Badawy. Automatic license plate recognition (alpr): A state-of-the-artreview.

IEEE Transactions on Circuits and Systems for Video Technology , 23(2):311–325, 2013.A. Garivier and O. Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In

COLT , pages359–376, 2011.A. Garivier and E. Moulines. On upper-conﬁdence bound policies for non-stationary bandit problems. arXivpreprint arXiv:0805.3415 , 2008.J. C. Gittins. Bandit processes and dynamic allocation indices.

Journal of the Royal Statistical Society. Series B(Methodological) , pages 148–177, 1979.A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In

ICML , volume 14,pages 100–108, 2014.E. Hazan and S. Kale. Better algorithms for benign bandits.

Journal of Machine Learning Research , 12(Apr):1287–1311, 2011.H. Heidari, M. Kearns, and A. Roth. Tight policy regret bounds for improving and decaying bandits. . M. Jones and J. C. Gittins. A dynamic allocation index for the sequential design of experiments . Universityof Cambridge, Department of Engineering, 1972.H. Kaspi and A. Mandelbaum. Multi-armed bandits in discrete and continuous time.

Annals of AppliedProbability , pages 1270–1290, 1998.E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal ﬁnite-time analysis. In

International Conference on Algorithmic Learning Theory , pages 199–213. Springer, 2012.R. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online posted-priceauctions. In

Foundations of Computer Science, 2003. Proceedings. 44th Annual IEEE Symposium on , pages594–605. IEEE, 2003.L. Kocsis and C. Szepesvári. Discounted ucb. In , pages 784–791, 2006.J. Komiyama and T. Qin. Time-decaying bandits for non-stationary systems. In

International Conference onWeb and Internet Economics , pages 460–466. Springer, 2014.T. L. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules.

Advances in applied mathematics ,6(1):4–22, 1985.O.-A. Maillard, R. Munos, G. Stoltz, et al. A ﬁnite-time analysis of multi-armed bandits problems withkullback-leibler divergences. In

COLT , pages 497–514, 2011.A. Mandelbaum. Continuous multi-armed bandits and multiparameter processes.

The Annals of Probability ,pages 1527–1556, 1987.S. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for taxonomies: A model-based approach. In

SDM , pages 216–227. SIAM, 2007.H. Robbins. Some aspects of the sequential design of experiments. In

Herbert Robbins Selected Papers , pages169–177. Springer, 1985.A. Slivkins and E. Upfal. Adapting to a changing environment: the brownian restless bandits. In

COLT , pages343–354, 2008.C. Tekin and M. Liu. Online learning of rested and restless bandits.

IEEE Transactions on Information Theory ,58(8):5588–5611, 2012.W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence oftwo samples.

Biometrika , 25(3/4):285–294, 1933.L. Tran-Thanh, S. Stein, A. Rogers, and N. R. Jennings. Efﬁcient crowdsourcing of unknown experts usingmulti-armed bandits. In

European Conference on Artiﬁcial Intelligence , pages 768–773, 2012.P. Whittle. Restless bandits: Activity allocation in a changing world.

Journal of applied probability , pages287–298, 1988.P. Whittle et al. Arm-acquiring bandits.

The Annals of Probability , 9(2):284–292, 1981.J. Y. Yu and S. Mannor. Piecewise-stationary bandit problems with side observations. In

Proceedings of the 26thAnnual International Conference on Machine Learning , pages 1177–1184. ACM, 2009. Hoeffding’s Inequality for Sub-Gaussian RVs

Let X , .., X n be independent, mean-zero, σ i -sub-Gaussian random variables. Then for all t ≥ , P (cid:32) n (cid:88) i =1 X i ≥ t (cid:33) ≤ exp (cid:26) − t (cid:80) ni =1 σ i (cid:27) (14) B Optimal Policy

B.1 Proof of Lemma 2.1

In this section we show that π max , deﬁned by Eq. (3) is an optimal policy for the RB problem.Assume on the contrary, that π max is not an optimal policy. Thus, there exists a time horizon, T , forwhich there exists some other policy π cand that satisﬁes J (cid:0) T ; π cand (cid:1) > J ( T ; π max ) .Let m be the ﬁrst time step in which π cand deviates from π max , since J (cid:0) T ; π cand (cid:1) > J ( T ; π max ) weinfer that m ≤ T (i.e., there is such time step). Let ˜ π be a policy deﬁned by, ˜ π ( t ) =  π cand ( t ) , if t < m argmax i ∈ [ K ] { µ ( N i ( m ) + 1; θ ∗ i ) } , if t = mπ cand ( t − , if t > m where if there exist more than one member in argmax i ∈ [ K ] { µ ( N i ( m ) + 1; θ ∗ i ) } , ˜ π chooses thesame action as π max . That is, ˜ π mimics π cand until time step m , then plays according to argmax rule,and then re-mimics π cand . Let µ m , µ T be the expected rewards of the arms that ˜ π chose at the m th time step, and that π cand chose at the T th time step, respectively. It is easy to see that, J ( T ; ˜ π ) − J (cid:0) T ; π cand (cid:1) = µ m − µ T ≥ (15)where the second transition holds by the argmax rule combined with the assumption that the expectedrewards are non-increasing (assumption 2.1). Thus, J ( T ; ˜ π ) ≥ J (cid:0) T ; π cand (cid:1) . If we apply the abovelogic steps recursively, we obtain a series of policies with non-decreasing values of expected totalreward J ( T ; · ) , where the series ends when there is no time step which deviates from π max , i.e., J ( T ; π max ) ≥ J (cid:0) T ; π cand (cid:1) , in contradiction to π max being non-optimal. Thus, we infer that π max isindeed an optimal policy. C Non-Parametric Case

C.1 Proof of Thm. 3.1

We deﬁne, (cid:40) M = (cid:100) α / σ / K − / T / ln / (cid:0) √ T (cid:1) (cid:101) q = α − / / σ / K / T − / ln / (cid:0) √ T (cid:1) and start by making two useful observations:Observation 1: By Hoeffding’s Inequality we have, P (cid:0) | ¯ X M − E (cid:2) ¯ X M (cid:3) | ≥ q (cid:1) ≤ T (16)where ¯ X M is the empirical average of M independent σ sub-Gaussian samples.Observation 2: Since the expected rewards of an arm only depends on the time it is being pulled (andnot on the time step itself), the expected total reward of a policy only depends on the number of pullsof the different arms (and not on the order of pulls).From now on we assume that | ¯ X M − E (cid:2) ¯ X M (cid:3) | < q (see Observation 1) for all arms throughout thetrajectory, and later address the case where it is violated.11 tep 1: bound the number of signiﬁcantly sub-optimal pulls.In what is following we prove by induction that for all the ends of time steps t ∈ [ T ] , by applyingSWA, there is no arm j for which, (cid:26) | n | : µ j (cid:16) N π SWA j ( t ) − n (cid:17) < max i ∈ [ K ] (cid:104) µ i (cid:16) N π SWA i ( t ) (cid:17)(cid:105) − q, n ∈ N (cid:27) > M (17)where N π SWA i ( t ) is the number of pulls of arms i at time t induced by policy π SWA , which is deﬁnedby the SWA algorithm. That is, following SWA ensures that for all time steps, no arm would bepulled more than M times in which its expected reward is at least q lower than the expected rewardof the (current) optimal arm. Basis: for all the ends of time steps t ∈ { , .., KM } this holds trivially since, by the deﬁnition ofSWA we pull each arm exactly M times. Inductive hypothesis:

Assume that the above statement holds for the end of time step t (cid:48) such that, KM ≤ t (cid:48) < T . Inductive step:

We show that the above statement holds for the end of time step t (cid:48) + 1 . By thenon-increasing Assumption 2.1 we note two things: (1) The RHS of the inner inequality in Eq. (17)is non-increasing in t , thus if the inequality did not hold for some arm j at the end of time step t (cid:48) itcan only hold for it at the end of t (cid:48) + 1 if SWA pulls arm j in that round. (2) The number of n s forwhich the inequality holds for some arm j can increase only by one at each time step. Combining thetwo with our inductive hypothesis we simply need to show that if for some arm j , Eq. (17) holds withequality (i.e., the number of n s is M ), that arm would not be pulled in t (cid:48) + 1 . By the non-increasingAssumption 2.1 we know that the last M expected rewards of arm j are those who are at least q lower. Let i ∗ ∈ argmax i ∈ [ K ] (cid:104) µ i (cid:16) N π SWA i ( t (cid:48) + 1) (cid:17)(cid:105) (if this set contains more than one arm, choosearbitrarily). We have, M N π SWA j ( t (cid:48) +1 ) (cid:88) n = N π SWA j ( t (cid:48) +1) − M +1 r nj (1) < E  M N π SWA j ( t (cid:48) +1 ) (cid:88) n = N π SWA j ( t (cid:48) +1) − M +1 r nj  + q (2) ≤ µ j (cid:16) N π SWA j ( t (cid:48) + 1) − M + 1 (cid:17) + q (3) < µ i ∗ (cid:16) N π SWA i ∗ ( t (cid:48) + 1) (cid:17) − q (4) ≤ E  M N π SWA i ∗ ( t (cid:48) +1 ) (cid:88) n = N π SWA i ∗ ( t (cid:48) +1) − M +1 r ni ∗  − q (5) < M N π SWA i ∗ ( t (cid:48) +1 ) (cid:88) n = N π SWA i ∗ ( t (cid:48) +1) − M +1 r ni ∗ (18)where (1) and (5) hold by our assumption regarding | ¯ X M − E (cid:2) ¯ X M (cid:3) | < q , (2) and (4) hold by thenon-increasing Assumption 2.1, and (3) holds by the deﬁnition of the inequality in Eq. (17). Since theSWA algorithm chooses in the Balance step according to the empirical averages of the last M -pulls ofeach arm, we infer that arm j would not be pulled ( i ∗ has higher empirical average). This concludesthe inductive step proof, and hence our statement holds. Step 2: bound J (cid:0) T ; π (cid:100) max (cid:1) − J (cid:0) T ; π SWA (cid:1) .Let π (cid:100) max be a policy deﬁned by, π (cid:100) max ( t ) ∈ argmax i ∈ [ K ] { µ i ( N i ( t )) } (19)where we ﬁrst pull each arm once using Round-Robin (before following the above rule), and in acase of tie, break it using the smallest index.Deﬁne I (cid:100) max ( T ) = (cid:26) (cid:16) i (cid:100) max t , n (cid:100) max t (cid:17) (cid:27) Tt =1 (20)to be the (deterministic) set of tuples induced by applying π (cid:100) max , where i (cid:100) max t is the arm chosen at timestep t , and n (cid:100) max t is the time it is being pulled. In the same manner, we deﬁne the (stochastic) set I SWA ( T ) , composed of (cid:0) i SWA t , n SWA t (cid:1) tuples, which induced by applying π SWA . We further deﬁne I (cid:100) max \ SWA ( T ) = I (cid:100) max ( T ) \{ I (cid:100) max ( T ) ∩ I SWA ( T ) } , and I SWA \ (cid:100) max ( T ) = I SWA ( T ) \{ I (cid:100) max ( T ) ∩ I SWA ( T ) } ,12nd also µ SWAmax ( T + 1) = max i ∈ [ K ] (cid:104) µ i (cid:16) N π SWA i ( T + 1) (cid:17)(cid:105) . By Observation 2, the difference in thepolicies expected total rewards only depends on these number of pull sets. Since both policies startwith one Round-Robin pulls of the arms we have, J (cid:16) T ; π (cid:100) max (cid:17) − J (cid:0) T ; π SWA (cid:1) = (cid:88) ( i (cid:100) max t ,n (cid:100) max t ) ∈ I (cid:100) max µ i (cid:100) max t (cid:16) n (cid:100) max t (cid:17) − (cid:88) ( i SWA t ,n SWA t ) ∈ I SWA µ i SWA t (cid:0) n SWA t (cid:1) = (cid:88) ( i (cid:100) max t ,n (cid:100) max t ) ∈ I (cid:100) max \ SWA µ i (cid:100) max t (cid:16) n (cid:100) max t (cid:17) − (cid:88) ( i SWA t ,n SWA t ) ∈ I SWA \ (cid:100) max µ i SWA t (cid:0) n SWA t (cid:1) ≤ µ SWAmax ( T + 1) × | I (cid:100) max \ SWA | − × KM − (cid:0) µ SWAmax ( T + 1) − q (cid:1) × (cid:16) | I (cid:100) max \ SWA | − KM (cid:17) ≤ KM max i ∈ [ K ] µ i (1) + 2 qT (21)The ﬁrst inequality holds by: (1) the non-increasing Assumption 2.1 implies that all the tuples in I (cid:100) max \ SWA correspond to expected reward upper bounded by µ SWAmax ( T + 1) , and (2) by what we showedin Step 1 , there are at most KM members in I SWA \ (cid:100) max that are more than q below µ SWAmax ( T + 1) , andthe positiveness of the expected rewards by Assumption 2.1. The second inequality holds by triviallybounding µ SWAmax ( T + 1) ≤ max i ∈ [ K ] µ i (1) , and | I (cid:100) max \ SWA | = | I SWA \ (cid:100) max | ≤ T .Finally, we note that all the above analysis was done assuming that | ¯ X M − E (cid:2) ¯ X M (cid:3) | < q for allarms throughout the trajectory, and we now address the case where it is violated. By Observation 1,the probability of the inequality to be violated ≤ /T . The number of times this inequality istested throughout the trajectory is bounded by KT (for each of the arms, in every time step, duringthe Balance step), and if the inequality is violated (even once) then J (cid:0) T ; π (cid:100) max (cid:1) − J (cid:0) T ; π SWA (cid:1) istrivially bounded by T max i ∈ [ K ] µ i (1) according to the non-increasing Assumption 2.1. Thus, weinfer that in expectation we have, J (cid:16) T ; π (cid:100) max (cid:17) − J (cid:0) T ; π SWA (cid:1) ≤ KM max i ∈ [ K ] µ i (1) + 2 qT + K max i ∈ [ K ] µ i (1) (22) Step 3: bound the regret.We bound the regret using our previous obtained result for π (cid:100) max by, R (cid:0) T ; π SWA (cid:1) = max π ∈ Π { J ( T ; π ) } − J (cid:0) T ; π SWA (cid:1) = J ( T ; π max ) − J (cid:0) T ; π SWA (cid:1) = (cid:16) J ( T ; π max ) − J (cid:16) T ; π (cid:100) max (cid:17)(cid:17) + (cid:16) J (cid:16) T ; π (cid:100) max (cid:17) − J (cid:0) T ; π SWA (cid:1)(cid:17) ≤ K max i ∈ [ K ] µ i (1) + (cid:16) J (cid:16) T ; π (cid:100) max (cid:17) − J (cid:0) T ; π SWA (cid:1)(cid:17) ≤ K max i ∈ [ K ] µ i (1) + KM max i ∈ [ K ] µ i (1) + 2 qT = (cid:18) α max i ∈ [ K ] µ i (1) + α − / (cid:19) / σ / K / T / ln / (cid:16) √ T (cid:17) + 3 K max i ∈ [ K ] µ i (1) (23)where the ﬁrst equality holds by Lemma 2.1, the ﬁrst inequality holds by Theorem 3 in Heidariet al. [2016], the second inequality holds by the bound we found in Step 2 , and the last equality holdsby plugging in the deﬁnition for M and q . This establishes Theorem 3.1. C.2 Proof of Corollary 3.1.1

For convenience, we deﬁne the following objects: R ( t → t ; π ) is the regret accumulated betweentime steps t and t (included), by applying policy π consistently. R ( t → t ; π | π ( t )) is theregret accumulated between time steps t and t , by applying π until time step t , and then π forthe measured time steps. We deﬁne similar objects for the expected total reward, J .13e note that, J ( t → t ; π max ) ≤ J (cid:18) t → t ; π max (cid:12)(cid:12)(cid:12)(cid:12) π ( t ) (cid:19) , ∀ π ∈ Π (24)The above inequality can be understood by the following argument: consider a decreasing sorted listof all the expected rewards across all arms. By Assumption 2.1, at each time step, π max simply pullsan arm corresponding to the highest element in that list, that was not previously pulled (independentlyof previous pulls).Thus, J ( t → t ; π max ) is the sum of the t th to t th elements in this list, which is the lowest possiblesum of the | t − t + 1 | highest elements in the list, following any | t − | pulls.Consider the n th iteration of wSWA. i.e., between time steps t = 2 n − and t = min [2 n − , T ] .We have, R (cid:0) t → t ; π wSWA (cid:1) (1) = J ( t → t ; π max ) − J (cid:0) t → t ; π wSWA (cid:1) (2) = J (cid:18) t → t ; π max (cid:12)(cid:12)(cid:12)(cid:12) π max ( t ) (cid:19) − J (cid:18) t → t ; π wSWA (cid:12)(cid:12)(cid:12)(cid:12) π wSWA ( t ) (cid:19) (3) ≤ J (cid:18) t → t ; π max (cid:12)(cid:12)(cid:12)(cid:12) π wSWA ( t ) (cid:19) − J (cid:18) t → t ; π wSWA (cid:12)(cid:12)(cid:12)(cid:12) π wSWA ( t ) (cid:19) (4) = J (cid:18) t → t ; π max (cid:12)(cid:12)(cid:12)(cid:12) π wSWA ( t ) (cid:19) − J (cid:18) t → t ; π SWA (cid:12)(cid:12)(cid:12)(cid:12) π wSWA ( t ) (cid:19) (5) = R (cid:18) t → t ; π SWA (cid:12)(cid:12)(cid:12)(cid:12) π wSWA ( t ) (cid:19) (6) ≤ R bound ( t − t + 1) (25)where (1) and (2) hold by deﬁnition. (3) holds by Eq. (24). (4) by noting the wSWA applies SWAbetween t and t . (5) by deﬁnition. (6) by observing that it is the regret of a known horizon problemthat holds Assumption 2.1, thus we can use the upper bound from Theorem 3.1, denoted by R bound .Let ˜ n = (cid:98) log T (cid:99) + 1 , thus ˜ n − ≤ T ≤ ˜ n − , and we have, R (cid:0) T ; π wSWA (cid:1) (1) = ˜ n − (cid:88) y =1 R (cid:0) y − → y − π wSWA (cid:1) + R (cid:0) ˜ n − → T ; π wSWA (cid:1) (2) ≤ ˜ n − (cid:88) y =1 R bound (cid:0) y − (cid:1) + R bound (cid:0) T − ˜ n − + 1 (cid:1) (3) ≤ ˜ n − (cid:88) y =0 R bound (2 y ) (4) = ˜ n − (cid:88) y =0 (cid:104) A y/ ln / (cid:16) y +1 / (cid:17) + B (cid:105) (5) ≤ A ln / (cid:16) √ T (cid:17) ˜ n − (cid:88) y =0 y/ + B (log T + 1) (6) ≤ A / T / ln / (cid:16) √ T (cid:17) + B (log T + 1) (26)where (1) holds by dividing the horizon and noting that the regret is additive. (2) holds byEq (25). (3) holds by noting that both Theorem 3 from Heidari et al. and Step 1 fromthe proof of Theorem 3.1 hold for any t ∈ [ T ] , thus the upper bound R bound holds for any t ∈ [ T ] (clearly, by plugging T in the bound). (4) holds by plugging R bound and deﬁning A = (cid:0) α max i ∈ [ K ] µ i (1) + α − / (cid:1) / σ / K / , and B = 3 K max i ∈ [ K ] µ i (1) . (5) holds bymonotonicity of the logarithm, and noting that A and B are independent of y . Finally, (6) holds as asum of a geometric series, and simple algebra.Plugging back A and B , we establish Corollary 3.1.1.14 Parametric Case

D.1 Proof of Thm. 4.1Bounding number of steps to optimality

We ﬁrst characterize the bound, and later show feasibility (i.e., that the analysis we show here indeedholds within the horizon).Similar to the deﬁnition of m ∗ diff ( p ; θ ∗ i ) and m ∗ diff ( p ) , we deﬁne m ∗ ( p ; θ ∗ i ) as the solutionto optimization problem (11) using Eq. (7) as the proximity rule to hypothesize ˆ θ , and m ∗ ( p ) = max θ ∈ Θ m ∗ ( p ; θ ) .Let T be some unknown horizon. We ﬁrst show that m ∗ (cid:0) KT (cid:1) is ﬁnite. Deﬁne, θ (cid:48) i ( ˜ m ) = argmin θ (cid:54) = θ ∗ i (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) ˜ m (cid:88) j =1 µ ( j ; θ ∗ i ) − ˜ m (cid:88) j =1 µ ( j ; θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) (27)Thus we have, when we sample only from arm i , P (cid:16) ˆ θ i ( ˜ m ) (cid:54) = θ ∗ i (cid:17) = P ( ∃ θ (cid:54) = θ ∗ i : | Y ( i, ˜ m ; θ ) | ≤ | Y ( i, ˜ m ; θ ∗ i ) | ) ≤ P (cid:12)(cid:12)(cid:12)(cid:12) ˜ m (cid:88) j =1 r ij − ˜ m (cid:88) j =1 µ ( j ; θ ∗ i ) (cid:12)(cid:12)(cid:12)(cid:12) > (cid:12)(cid:12)(cid:12)(cid:12) ˜ m (cid:88) j =1 µ ( j ; θ ∗ i ) − ˜ m (cid:88) j =1 µ (cid:16) j ; θ (cid:48) i ( ˜ m ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:26) − × det θ ∗ i ,θ (cid:48) i ( ˜ m ) ( ˜ m ) (cid:27) (28)where the ﬁrst inequality holds by inclusion of events, and the second inequality holds by Eq. (14)and the deﬁnition of det θ ∗ i ,θ (cid:48) i .Since trivially bal ( n ) ≥ n , by assumption 4.2, there exists a ﬁnite ˜ m , for which, max θ (cid:54) = θ ∈ Θ (cid:26) det θ ,θ ( ˜ m ) (cid:27) ≤

18 ln − (cid:0) KT (cid:1) (29)Therefore, if we plug ˜ m back in to the above equation we get, (cid:26) − × det θ ∗ i ,θ (cid:48) i ( ¯ m ) (cid:27) ≤ KT (30)Thus, we have a ﬁnite ˜ m that satisﬁes the constraints of optimization problem (11) for p = 1 /KT ,and by deﬁnition m ∗ (cid:0) KT (cid:1) ≤ ˜ m . i.e., m ∗ (cid:0) KT (cid:1) is ﬁnite.Given a rotting model, θ ∗ i of arm i , we term that arm ‘saturated’ if it has been pulled at least m ∗ (cid:0) KT ; θ ∗ i (cid:1) times, which is ﬁnite since, by deﬁnition, m ∗ (cid:0) KT ; θ ∗ i (cid:1) ≤ m ∗ (cid:0) KT (cid:1) . We assumethat once an arm is ‘saturated’, it is truely detected every time step, and omit this assertion from nowon (we deal with the misdetection case later). i.e., we assume that once arm i hypothesize its rottingmodel to be ˆ θ i and also has been pulled at least m ∗ (cid:0) KT ; θ ∗ i (cid:1) times, then ˆ θ i = θ ∗ i .We next bound the number of pulls of different arms, given the number of pulls of some other arm.Let s be the ﬁrst time step for which min i ∈ [ K ] { N i ( s ) } = max θ ∈ Θ ∗ { m ∗ (cid:0) KT ; θ (cid:1) } . We ﬁrst notethat s is ﬁnite since by Assumption 4.1 we have µ ( n ; θ ) ∈ o (1) , combined with the argmax ruleCTO SIM follows and its tie breaking rule, at some ﬁnite time step all arms would be pulled thespeciﬁed amount of times. By our above assumption, from this point on, all the arms’ rotting modelsare correctly detected. Thus, for any arm j , N j ( s ) can be upper bounded by the solution for, min t j s.t  t j ∈ N t j ≥ max θ ∈ Θ ∗ { m ∗ (cid:0) KT ; θ (cid:1) } µ (cid:0) t j + 1; θ ∗ j (cid:1) ≤ min ˜ θ ∈ Θ (cid:20) µ (cid:18) max θ ∈ Θ ∗ (cid:26) m ∗ (cid:0) KT ; θ (cid:1) (cid:27) ; ˜ θ (cid:19)(cid:21) (31)15here the above optimization bound characterization holds since:(1) For any arm j ∈ argmin i ∈ [ K ] { N i ( s ) } , this holds trivially by the explicit constraint t j ≥ max θ ∈ Θ ∗ { m ∗ (cid:0) KT ; θ (cid:1) } .(2) For any arm j / ∈ argmin i ∈ [ K ] { N i ( s ) } , clearly the constraint on the lower bound holds. Asfor the constraint on the upper bound, it holds by noting that all the arms’ hypothesized mod-els are correct and CTO SIM follows an argmax policy, thus j would not be pulled such that µ (cid:0) N j ( s ) ; θ ∗ j (cid:1) < min θ ∈ Θ (cid:2) µ (cid:0) max θ ∈ Θ ∗ { m ∗ (cid:0) KT ; θ (cid:1) } (cid:1)(cid:3) , as the RHS is the lowest obtainable ex-pected reward until time step s . In addition, since the tie breaking rule is least min θ ∈ Θ (cid:2) µ (cid:0) max θ ∈ Θ ∗ { m ∗ (cid:0) KT ; θ (cid:1) } (cid:1)(cid:3) .Let µ min ( s ; Θ ∗ ) = min j ∈ [ K ] { µ (cid:0) N j ( s ) ; θ ∗ j (cid:1) } . Following CTO SIM policy we infer that there exists ˜ s ≥ s for which:(1) µ ( N i (˜ s ) + 1; θ ∗ i ) ≤ µ min ( s ; Θ ∗ ) , for all i ∈ [ K ] .(2) µ ( N i (˜ s ) ; θ ∗ i ) > µ min ( s ; Θ ∗ ) , for all i / ∈ argmin j ∈ [ K ] { µ (cid:0) N j ( s ) ; θ ∗ j (cid:1) } .The above observation holds by noting that CTO SIM follows an argmax rule, thus it wouldchoose arms / ∈ argmin j [ K ] { µ (cid:0) N j ( s ) ; θ ∗ j (cid:1) } to be pulled as long as their expected reward isstrictly greater than already pulled minimal expected reward µ min ( s ; Θ ∗ ) , before the possibil-ity of choosing arms with expected reward ≤ µ min ( s ; Θ ∗ ) . Since by Eq. (31) we have that min j ∈ [ K ] { µ (cid:0) N j ( s ) ; θ ∗ j (cid:1) } ≥ min ˜ θ ∈ Θ (cid:104) µ (cid:16) max θ ∈ Θ ∗ { m ∗ (cid:0) KT ; θ (cid:1) } ; ˜ θ (cid:17)(cid:105) , we can upper bound ˜ s by the following, min (cid:107) t (cid:107) s.t  t ∈ N K t i ≥ max θ ∈ Θ ∗ { m ∗ (cid:0) KT ; θ (cid:1) } , ∀ i ∈ [ K ] µ ( t i + 1; θ ∗ i ) ≤ min ˜ θ ∈ Θ (cid:20) µ (cid:18) max θ ∈ Θ ∗ (cid:26) m ∗ (cid:0) KT ; θ (cid:1) (cid:27) ; ˜ θ (cid:19)(cid:21) , ∀ i ∈ [ K ] (32)We turn to show optimality starting from time step ˜ s . We start by showing for ˜ s .Assume on the contrary that, J (˜ s ; π max ) (cid:54) = J (cid:0) ˜ s ; π CTO

SIM (cid:1) . On the one hand, by Lemma 2.1, we have, J (˜ s ; π max ) ≥ J (cid:0) ˜ s ; π CTO

SIM (cid:1) . On the other hand, Let { q i } i ∈ [ K ] be the set of the arms’ number ofpulls at time ˜ s following π max (respectively, { ˜ s i } i ∈ [ K ] for CTO SIM ), i.e., J (˜ s ; π max ) = (cid:88) i ∈ [ K ] q i (cid:88) j =1 µ ( j ; θ ∗ i ) (33)We have that J (cid:0) ˜ s ; π CTO

SIM (cid:1) − J (˜ s ; π max ) is a sum of pairs in the form of, µ ( l ; θ ∗ i ) − µ (cid:0) h ; θ ∗ j (cid:1) where l ≤ ˜ s i , and h > ˜ s j , for i (cid:54) = j ∈ [ K ] . By deﬁnition of { ˜ s i } and the non-increasingassumption 2.1, we have that µ ( l ; θ ∗ i ) ≥ µ min ( s ; Θ ∗ ) , and µ min ( s ; Θ ∗ ) ≥ µ (cid:0) h ; θ ∗ j (cid:1) , resulting in J (cid:0) ˜ s ; π CTO

SIM (cid:1) ≥ J (˜ s ; π max ) . Hence, the regret vanishes in time step ˜ s , achieving optimality.We next show that the regret remains zero for ˆ s ≥ ˜ s .We showed optimality for time step ˜ s deﬁned above. We next show optimality for ˜ s + 1 . We examinethe two possible cases. Case 1 : ∀ i ∈ [ K ] : q i = ˜ s i . Since CTO SIM follows the argmax rule as π max does, we infer thatarms with equal expected reward would be chosen by both CTO SIM and π max . Thereby, holding J (˜ s + 1; π max ) = J (cid:0) ˜ s + 1; π CTO

SIM (cid:1) . i.e., zero regret as stated.

Case 2 : ∃ i : ˜ s i (cid:54) = q i . Therefore, there is an arm, denoted as i gap , for which ˜ s i gap < q i gap . By the argmax rule, CTO SIM chooses an arm i ˜ s +1 such that, µ (cid:16) ˜ s i ˜ s +1 + 1; θ ∗ i ˜ s +1 (cid:17) ≥ µ (cid:16) ˜ s i gap + 1; θ ∗ i gap (cid:17) .By the non-increasing assumption 2.1, and the deﬁnition of π max , since q i gap ≥ ˜ s i gap + 1 , wehave µ (cid:16) q j ˜ s +1 ; θ ∗ j ˜ s +1 (cid:17) ≤ µ (cid:16) q i gap ; θ ∗ i gap (cid:17) ≤ µ (cid:16) ˜ s i gap + 1; θ ∗ i gap (cid:17) , where j ˜ s +1 is the arm chosen by π max . Thus, on the one hand we have J (˜ s + 1; π max ) ≤ J (cid:0) ˜ s + 1; π CTO

SIM (cid:1) . On the other hand,by Lemma 2.1, we have J (˜ s + 1; π max ) ≥ J (cid:0) ˜ s + 1; π CTO

SIM (cid:1) . Combining the two, we have J (˜ s + 1; π max ) = J (cid:0) ˜ s + 1; π CTO

SIM (cid:1) . i.e., zero regret as stated.The above argument can be applied recursively for any ˆ s > ˜ s , thus establishing optimality of CTO SIM for all ˆ s ≥ s , under true detection. 16f it happens to be that (cid:107) t (cid:107) ≤ T , then for that T , CTO SIM will achieve zero regret (starting from ˜ s ).Since we require that the result will hold from some T ∗ SIM onward, we need the above characterizationto also hold for any ˜ T ≥ T . We thereby infer that the smallest T such that for any ˜ T ≥ T , thereexists t for which the above stated result holds (i.e., the solution to the optimization problem is indeedholds (cid:107) t (cid:107) ≤ ˜ T ), can serve as an upper bound for T ∗ SIM , resulting in T ∗ SIM being upper bounded bythe solution for, min T s.t  T, b ∈ N ∪ { } , t ∈ N K ∀ b, ∃ t :  (cid:107) t (cid:107) ≤ T + bt i ≥ max θ ∈ Θ ∗ (cid:26) m ∗ (cid:16) K ( T + b ) ; θ (cid:17) (cid:27) µ ( t i + 1; θ ∗ i ) ≤ min ˜ θ ∈ Θ (cid:20) µ (cid:18) max θ ∈ Θ ∗ (cid:26) m ∗ (cid:16) K ( T + b ) ; θ (cid:17) (cid:27) ; ˜ θ (cid:19)(cid:21) (34) Feasibility

In order to show feasibility, we wish to obtain,{ ≤ T where Detection is a phase of pulling arms until the rotting models are detected with highenough probability (deﬁned below), and Balance is a phase which at the end of it there is noarm which yields strictly higher expected reward than the minimal observed expected rewardso far, as explained in the former step, resulting in vanishing regret (similar to s and ˜ s dis-cussed above). We require that the detection of each arm is w.p of at least − KT . Deﬁne W ( T ) = max θ ,θ (cid:26) det (cid:63) ↓ θ ,θ (cid:16) ln − (cid:16) √ KT (cid:17)(cid:17) (cid:27) . As shown in the beginning of this proof, af-ter pulling an arm for W ( T ) times, the probability of misdetection its rotting model ≤ KT . Werefer to an arm that has been pulled at least W ( T ) times as ‘strongly saturated’. From now on we willassume that any ‘strongly saturated’ arm is truely detected at each decision point, and will discuss theother case later on.On the one hand, by the deﬁnition of bal () , the non-increasing assumption 2.1, and the rule oftie breaking applied by CTO SIM , we have that all arms become ‘strongly saturated’ after, at most, W ( T ) + ( K − × bal ( W ( T )) time steps.On the other hand, from the deﬁnition of bal () , and CTO SIM , we infer that no arm would be pulled bal ( W ( T )) + 1 times before all other arms would become ‘strongly saturated’.Combining the two above observations we have that, after at most W ( T ) + ( K − × bal ( W ( T )) time steps, there exists a time step in which all arms have became ‘strongly saturated’, but werenot pulled more than bal ( W ( T )) times. From that point, following the same ﬂow at the formersubsection, the total number of pulls required in order to “balance" the arms (i.e., there is no pull thatwould yield strictly higher reward than the minimal expected reward observed so far), is bounded by K × bal ( W ( T )) . That is under the worst case scenario, where every arm that becomes ‘stronglysaturated‘ is detected to be an arm that requires bal ( W ( T )) pulls to “balance" itself w.r.t to another‘strongly saturated’ arm. Thus, we infer that,{ ≤ K × bal ( W ( T )) Let (cid:15) = (cid:16) K √ K (cid:17) − . By assumption 4.2, we have that there exists a ﬁnite ˜ T max for which, ∀ ˜ T ≥ ˜ T max : bal (cid:18) max θ (cid:54) = θ ∈ Θ (cid:26) det (cid:63) ↓ θ ,θ (cid:18)

116 ln − (cid:16) ˜ T (cid:17)(cid:19) (cid:27)(cid:19) ≤ (cid:15) ˜ T (35)We denote T = (cid:16) √ K (cid:17) − ˜ T , and get, ∀ T ≥ ˜ T max √ K : K × bal ( W ( T )) ≤ T (36)which implies, under true detection, that ∀ T ≥ ˜ T max / √ K , CTO SIM algorithm achieves zero regret.Since by deﬁnition we have ∀ θ ∈ Θ : m ∗ (cid:0) KT ; θ (cid:1) ≤ m ∗ (cid:0) KT (cid:1) , and by deﬁnition of m ∗ (cid:0) KT (cid:1)

17e have m ∗ (cid:0) KT (cid:1) ≤ W ( T ) , we infer that there exists (a ﬁnite) T ∗ SIM that holds the optimizationproblem characterization as stated above (i.e., ∀ ˜ T ≥ T ∗ SIM the optimization problem is feasible).

Misdetection and Expectation

So far, we assumed that each ‘saturated’ (or ‘strongly saturated’) arm is truely detected. By deﬁnitioneach ‘saturated’ (or ‘strongly saturated’) arm probability of misdetection in any time step is upperbounded by /KT . Thereby, after all the arms are ‘saturated’, the probability of a misdetectionin each time step is upper bounded by /T . The number of time steps where all the arms are‘saturated’ (referred to as the ‘saturated step’) is trivially bounded by T . Hence, the probability that amisdetection occurs after the ‘saturated step’ is bounded by /T . Meaning that ∀ T ≥ T ∗ SIM , CTO

SIM achieves zero regret w.p of at least − /T .Next, we note that, as for the case where we misdetect any arm, J ( T ; π max ) − J (cid:0) T ; π CTO

SIM (cid:1) = K (cid:88) i =1 N max i ( T ) (cid:88) j =1 µ ( j ; θ ∗ i ) − K (cid:88) i =1 N CTOSIM i ( T ) (cid:88) j =1 µ ( j ; θ ∗ i ) ≤ K (cid:88) i =1 I { N max i ( T ) >N CTOSIM i ( T ) } N max i ( T ) (cid:88) N CTOSIM i ( T )+1 µ ( j ; θ ∗ i ) ≤ T max θ ∈ Θ ∗ (cid:26) µ (cid:18) min i ∈ [ K ] { N CTO

SIM i ( T ) } ; θ (cid:19)(cid:27) (37)where the ﬁrst inequality holds by only considering cases where N max i ( T ) > N CTO

SIM i ( T ) , and notthe other way around (since the expected rewards are positive by Assumption 4.1).By applying expectation over events (true detection or not), we get, R (cid:0) T ; π CTO

SIM (cid:1) = R (cid:0) T ; π CTO

SIM | true detection (cid:1) × P ( true detection )+ R (cid:0) T ; π CTO

SIM | misdetection (cid:1) × P ( misdetection ) ≤ max θ ∈ Θ ∗ (cid:26) µ (cid:18) min i ∈ [ K ] { N CTO

SIM i ( T ) } ; θ (cid:19)(cid:27) (38)Finally, T = K (cid:88) i =1 N CTO

SIM i ( T ) ≤ min i ∈ [ K ] N CTO

SIM i ( T ) + ( K −

1) max i ∈ [ K ] N CTO

SIM i ( T ) ≤ min i ∈ [ K ] N CTO

SIM i ( T ) + ( K − × bal (cid:18) min i ∈ [ K ] N CTO

SIM i ( T ) (cid:19) ≤ K × bal (cid:18) min i ∈ [ K ] N CTO

SIM i ( T ) (cid:19) (39)Hence, by assumption 2.1, min i ∈ [ K ] N CTO

SIM i ( T ) T →∞ −→ ∞ , resulting in R (cid:0) T ; π CTO

SIM (cid:1) ∈ o (1) , andtrivially ≤ max θ ∈ Θ ∗ µ (1; θ ) .We Note that from the feasibility step, given a function U ( (cid:15) ) that satisﬁes ∀ n ≥ U ( (cid:15) ) , bal (cid:18) max θ (cid:54) = θ ∈ Θ (cid:26) det (cid:63) ↓ θ ,θ (cid:18)

116 ln − ( n ) (cid:19) (cid:27)(cid:19) ≤ (cid:15)n (40)we have, T ∗ SIM ≤ U (cid:18)(cid:16) K √ K (cid:17) − (cid:19) √ K (41)18 .2 Proof of Thm. 4.2Decomposing the regret First, we upper bound the regret by, R (cid:0) T ; π D-CTO

UCB (cid:1) = K (cid:88) i =1 E (cid:104) N π max i ( T ) (cid:105) (cid:88) j =1 µ i ( j ) − K (cid:88) i =1 E (cid:104) N π D-CTOUCB i ( T ) (cid:105) (cid:88) j =1 µ i ( j ) ≤ (cid:88) i (cid:54) = a ∗ µ (cid:63) ↓ (∆ i ; θ ∗ i ) (cid:88) j =1 µ i ( j ) (cid:124) (cid:123)(cid:122) (cid:125) = ˜ C ( Θ ∗ , { µ ci } ) + T (cid:88) j =1 µ a ∗ ( j ) − K (cid:88) i =1 E (cid:104) N π D-CTOUCB i ( T ) (cid:105) (cid:88) j =1 µ i ( j )= ˜ C (Θ ∗ , { µ ci } ) + T (cid:88) E [ N π max a ∗ ( T ) ] +1 µ a ∗ ( j ) − (cid:88) i (cid:54) = a ∗ E (cid:104) N π D-CTOUCB i ( T ) (cid:105) (cid:88) j =1 µ i ( j ) ≤ ˜ C (Θ ∗ , { µ ci } ) + T (cid:88) E [ N π max a ∗ ( T ) ] +1 ( µ ca ∗ + µ (1; θ ∗ a ∗ )) − (cid:88) i (cid:54) = a ∗ E (cid:104) N π D-CTOUCB i ( T ) (cid:105) (cid:88) j =1 µ ci ≤ ˜ C (Θ ∗ , { µ ci } ) + (cid:88) i (cid:54) = a ∗ E (cid:104) N π D-CTOUCB i ( T ) (cid:105) × (∆ i + µ (1; θ ∗ a ∗ )) (42)where E (cid:2) N π max i ( T ) (cid:3) is the expected number of pulls of arm i at time T induced by the optimalpolicy, π max , and E N π D-CTOUCB i ( T ) is the expected number of pulls induced by policy π D-CTO

UCB . Theﬁrst inequality holds by noting that π max pulls according to argmax rule, thus any arm i (cid:54) = a ∗ wouldnot be pulled after yielding expected reward not greater than µ ca ∗ , according to the behavior of µ ( · ; · ) by assumption 2.1. Detecting the models

Next, we show that m ∗ diff ( δ/K ) is ﬁnite. Deﬁne, D ( µ ( · ; θ ) , , n ) = (cid:98) n (cid:99) (cid:88) j =1 µ ( j ; θ ) − n (cid:88) j = (cid:98) n (cid:99) +1 µ ( j ; θ ) (43)and, θ (cid:48) i ( ˜ m ) = argmin θ (cid:54) = θ ∗ i (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) D ( µ ( · ; θ ∗ i ) , , ˜ m ) − D ( µ ( · ; θ ) , , ˜ m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) (44)Thus, we have, when we sample only from arm i , and for an even ˜ mP (cid:16) ˆ θ i ( ˜ m ) (cid:54) = θ ∗ i (cid:17) = P ( ∃ θ (cid:54) = θ ∗ i : | Z ( i, ˜ m ; θ ) | ≤ | Z ( i, ˜ m ; θ ∗ i ) | ) ≤ P (cid:12)(cid:12)(cid:12)(cid:12)  ˜ m (cid:88) j =1 r ij − ˜ m (cid:88) j = ˜ m +1 r ij  − D ( µ ( · ; θ ∗ i ) , , ˜ m ) (cid:12)(cid:12)(cid:12)(cid:12) > (cid:12)(cid:12)(cid:12)(cid:12) D ( µ ( · ; θ ∗ i ) , , ˜ m ) − D (cid:16) µ (cid:16) · ; θ (cid:48) i ( ˜ m ) (cid:17) , , ˜ m (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) ≤ (cid:26) − × Ddet θ ∗ i ,θ (cid:48) i ( ˜ m ) ( ˜ m ) (cid:27) (45)19here the ﬁrst inequality holds by inclusion of events, and the second inequality holds by Eq. (14),the deﬁnition of Ddet θ ∗ i ,θ (cid:48) i , and noting that for an even ˜ m we have, E  ˜ m (cid:88) j =1 r ij − ˜ m (cid:88) j = ˜ m +1 r ij  = D ( µ ( · ; θ ∗ i ) , , ˜ m ) (46)By assumption 4.3, there exists a ﬁnite, even, ˜ m for which, max θ (cid:54) = θ ∈ Θ (cid:26) Ddet θ ,θ ( ˜ m ) (cid:27) ≤

18 ln − (cid:18) Kδ (cid:19) (47)If we plug ˜ m back to the above equation we get, (cid:26) − × Ddet θ ∗ i ,θ (cid:48) i ( ˜ m ) ( ˜ m ) (cid:27) ≤ δK (48)Thus, we have a ﬁnite ˜ m that satisﬁes the constraints of Prob. (11) for p = δ/K , and by deﬁnition m ∗ diff ( δ/K ) ≤ ˜ m . i.e., m ∗ diff ( δ/K ) is ﬁnite. Bounding number of pulls

We wish to bound E (cid:104) N π D-CTOUCB i ( T ) (cid:105) for all i (cid:54) = a ∗ . Remember that in the exploration part (leadingto the Detect step), we pull each arm m ∗ diff ( δ/K ) times, hence, N π D-CTOUCB i ( T ) = m ∗ diff ( δ/K ) + T (cid:88) t = K × m ∗ diff ( δ/K )+1 { i ( t )= i } (49)where {·} is the indicator function. Similarly to the proof of UCB1 (Auer et al. [2002a]) we have, N π D-CTOUCB i ( T ) ≤ l i + ∞ (cid:88) t =1 t − (cid:88) s = m ∗ diff ( δ/K ) t − (cid:88) s i = l i { ˆ µ ca ∗ ( s )+ µ ( s ; θ ∗ a ∗ ) + c t,s ≤ ˆ µ ci ( s i )+ µ ( s i ; θ ∗ i ) + c t,si } (50)where for some (cid:15) i ∈ (0 , ∆ i ) , we denote l i = max (cid:26) m ∗ diff ( δ/K ) , µ (cid:63) ↓ ( (cid:15) i ; θ ∗ i ) , (cid:100) σ ln T (∆ i − (cid:15) i ) (cid:101) (cid:27) , and wenote that we assume that we have detected the true underlying rotting models (holds w.p of at least − δ as shown above).The above indicator function holds when at least one of the following holds,  ˆ µ ca ∗ ( s ) ≤ µ ca ∗ − c t,s ˆ µ ci ( s i ) ≥ µ ci + c t,s i µ ca ∗ + µ ( s ; θ ∗ a ∗ ) < µ ci + µ ( s i ; θ ∗ i ) + 2 c t,s i (51)Plugging c t,s and c t,s i , and using Eq. (14), we have, (cid:26) P (ˆ µ ca ∗ ( s ) ≤ µ ca ∗ − c t,s ) = t − P (ˆ µ ci ( s i ) ≥ µ ci + c t,s i ) = t − (52)And for s i ≥ l i we have, µ ca ∗ + µ ( s ; θ ∗ a ∗ ) − µ ci − µ ( s i ; θ ∗ i ) − c t,s i ≥ µ ca ∗ − µ ci − µ ( s i ; θ ∗ i ) − c t,s i ≥ µ ca ∗ − µ ci − (cid:15) i − c t,s i = (∆ i − (cid:15) i ) − c t,s i ≥ (53)where the ﬁrst inequality holds by assumption 4.1, the second inequality by s i ≥ µ (cid:63) ↓ ( (cid:15) i ; θ ∗ i ) , andthe third inequality by s i ≥ (cid:100) σ ln T (∆ i − (cid:15) i ) (cid:101) .Thus, combining the above observations, we get, E [ N πi ( T )] ≤ l i + ∞ (cid:88) t =1 t − (cid:88) s = m ∗ diff ( δ/K ) t − (cid:88) s i = l i ( P (ˆ µ ca ∗ ≤ µ ca ∗ − c t,s ) + P (ˆ µ ci ≥ µ ci + c t,s i )) ≤ l i + π (54)Denoting C (Θ ∗ , { µ ci } ) = ˜ C (Θ ∗ , { µ ci } ) + (cid:80) i (cid:54) = a ∗ π +33 (∆ i + µ (1; θ ∗ a ∗ )) , and plugging back intothe upper bound on the regret, we achieve the stated result.20 Example 4.1

Next, we show an example for which the different assumptions hold; the case where the reward ofarm i for its n th pull is distributed as N (cid:0) µ ci + n − θ ∗ i , σ (cid:1) . Where θ ∗ i ∈ Θ = { θ , θ , ..., θ M } , and ∀ θ ∈ Θ : 0 . ≤ θ ≤ . . E.1 Assumption 4.1

The assumption given by µ ( n ; θ ) is positive, non-increasing in n , and µ ( n ; θ ) ∈ o (1) , ∀ θ ∈ Θ ,where Θ is a discrete known set. Indeed, for any θ ∈ { θ , θ , ..., θ M } , which is a discrete known setwhere . ≤ θ ≤ . , we have n − θ ≥ for all n ≥ . Moreover, ∂n − θ ∂θ = − θn − θ − < for all n ≥ , and n − θ n →∞ −→ . E.2 Assumption 4.2

The assumption is given by, bal (cid:18) max θ (cid:54) = θ ∈ Θ (cid:26) det (cid:63) ↓ θ ,θ (cid:18)

116 ln − ( ζ ) (cid:19) (cid:27)(cid:19) ∈ o ( ζ ) (55)Without a loss of generality, assume θ > θ . We have for large enough n , det θ ,θ ( n ) = nσ (cid:16)(cid:80) nj =1 j − θ − (cid:80) nj =1 j − θ (cid:17) ≤ nσ ( c n − θ − c − c n − θ ) = nσ c n − θ + c n − θ − c c n − θ − θ − c n − θ + 2 c c n − θ + c ≤ nσ ˜ cn − θ = ¯ cn − θ (56)where { c , c , ˜ c, ¯ c } are positive constants (independent of n ). The ﬁrst inequality holds by boundingthe sums by integrals and keeping in mind that θ > θ combined with . ≤ θ ≤ . . The secondinequality holds from large enough n (leading exponent, depends only on { θ , θ } , but ﬁnite).Next, we have, ¯ cn − θ <

116 ln − ( ζ ) = ⇒ n > (16¯ c ln ( ζ )) − θ > (16¯ c ln ( ζ )) (57)Meaning that ζ large enough, max θ (cid:54) = θ ∈ Θ (cid:26) det (cid:63) ↓ θ ,θ (cid:18)

116 ln − ( ζ ) (cid:19) (cid:27) < (16¯ c ln ( ζ )) (58)Next, we have, α − . ≤ x − . = ⇒ α ≥ x . (59)Hence, bal ( x ) = x . . Since bal ( · ) is monotonically increasing, we have that for ζ large enough, bal (cid:18) max θ (cid:54) = θ ∈ Θ (cid:26) det (cid:63) ↓ θ ,θ (cid:18)

116 ln − ( ζ ) (cid:19) (cid:27)(cid:19) < ˆ c ln ( ζ ) (60)where ˆ c is a positive constant (independent of ζ ). Finally, we note that, lim ζ →∞ ln ( ζ ) ζ = 0 (61)Thus we infer that the assumption holds. 21 .3 Assumption 4.3 The assumption is given by, max θ (cid:54) = θ ∈ Θ (cid:26) Ddet (cid:63) ↓ θ ,θ ( (cid:15) ) (cid:27) ≤ B ( (cid:15) ) < ∞ , ∀ (cid:15) > (62)Without a loss of generality, assume θ > θ . We have for large enough n , Ddet θ ,θ ( n ) = nσ (cid:18)(cid:18)(cid:80) (cid:98) n (cid:99) j =1 j − θ − (cid:80) nj = (cid:98) n (cid:99) +1 j − θ (cid:19) − (cid:18)(cid:80) (cid:98) n (cid:99) j =1 j − θ − (cid:80) nj = (cid:98) n (cid:99) +1 j − θ (cid:19)(cid:19) ≤ nσ (cid:16) c (cid:16) − (cid:4) n (cid:5) − θ − n − θ (cid:17) − c (cid:16) (cid:0)(cid:4) n (cid:5) + 1 (cid:1) − θ − n − θ (cid:17)(cid:17) ≤ nσ ˜ cn − θ = ˜ cn − θ (63)where { c , c , ˜ c } are positive constants (independent of nn