[PDF] Optimal Strategies for Graph-Structured Bandits

Abstract

We study a structured variant of the multi-armed bandit problem specified by a set of Bernoulli distributions ν=(ν_a,b)_a∈A,b∈B with means (μ_a,b)_a∈A,b∈B∈[0,1 ] A×B and by a given weight matrix ω=(ω_b, b ′ )_b, b ′ ∈B , where A is a finite set of arms and B is a finite set of users. The weight matrix ω is such that for any two users b, b ′ ∈B,max_a∈A|μ_a,b−μ_a, b ′ |≤ω_b, b ′ . This formulation is flexible enough to capture various situations, from highly-structured scenarios ( ω∈{0,1 } B×B ) to fully unstructured setups ( ω≡1 ).We consider two scenarios depending on whether the learner chooses only the actions to sample rewards from or both users and actions. We first derive problem-dependent lower bounds on the regret for this generic graph-structure that involves a structure dependent linear programming problem. Second, we adapt to this setting the Indexed Minimum Empirical Divergence (IMED) algorithm introduced by Honda and Takemura (2015), and introduce the IMED-GS ⋆ algorithm. Interestingly, IMED-GS ⋆ does not require computing the solution of the linear programming problem more than about log(T) times after T steps, while being provably asymptotically optimal. Also, unlike existing bandit strategies designed for other popular structures, IMED-GS ⋆ does not resort to an explicit forced exploration scheme and only makes use of local counts of empirical events. We finally provide numerical illustration of our results that confirm the performance of IMED-GS ⋆ .

Full PDF

OOptimal Strategies for Graph-Structured Bandits

Hassan Saber

HASSAN . SABER @ INRIA . FR SequeL Research GroupInria Lille-Nord Europe & CRIStALVilleneuve-d’Ascq, Parc scientiﬁque de la Haute-Borne, France

Pierre Ménard

PIERRE . MENARD @ INRIA . FR SequeL Research GroupInria Lille-Nord Europe & CRIStALVilleneuve-d’Ascq, Parc scientiﬁque de la Haute-Borne, France

Odalric-Ambrym Maillard

ODALRIC . MAILLARD @ INRIA . FR SequeL Research GroupInria Lille-Nord Europe & CRIStALVilleneuve-d’Ascq, Parc scientiﬁque de la Haute-Borne, France

Editor:

Abstract

We study a structured variant of the multi-armed bandit problem speciﬁed by a set of Bernoullidistributions ν = ( ν a,b ) a ∈A ,b ∈B with means ( µ a,b ) a ∈A ,b ∈B ∈ [0 , A×B and by a given weight matrix ω = ( ω b,b (cid:48) ) b,b (cid:48) ∈B , where A is a ﬁnite set of arms and B is a ﬁnite set of users. The weight matrix ω is such that for any two users b, b (cid:48) ∈ B , max a ∈A | µ a,b − µ a,b (cid:48) | (cid:54) ω b,b (cid:48) . This formulation is ﬂexibleenough to capture various situations, from highly-structured scenarios ( ω ∈ { , } B×B ) to fullyunstructured setups ( ω ≡ ). We consider two scenarios depending on whether the learner choosesonly the actions to sample rewards from or both users and actions. We ﬁrst derive problem-dependentlower bounds on the regret for this generic graph-structure that involves a structure dependent linearprogramming problem. Second, we adapt to this setting the Indexed Minimum Empirical Divergence( IMED ) algorithm introduced by Honda and Takemura (2015), and introduce the

IMED-GS (cid:63) algorithm.Interestingly,

IMED-GS (cid:63) does not require computing the solution of the linear programming problemmore than about log( T ) times after T steps, while being provably asymptotically optimal. Also,unlike existing bandit strategies designed for other popular structures, IMED-GS (cid:63) does not resort toan explicit forced exploration scheme and only makes use of local counts of empirical events. Weﬁnally provide numerical illustration of our results that conﬁrm the performance of

IMED-GS (cid:63) . Keywords:

Graph-structured stochastic bandits, regret analysis, asymptotic optimality, IndexedMinimum Empirical Divergence (

IMED ) algorithm.

1. Introduction

The multi-armed bandit problem is a popular framework to formalize sequential decision makingproblems. It was ﬁrst introduced in the context of medical trials (Thompson, 1933, 1935) and laterformalized by Robbins (1952). In this paper, we consider a contextual and structured variant ofthe problem, speciﬁed by a set of distributions ν = ( ν a,b ) a ∈A ,b ∈B with means ( µ a,b ) a ∈A ,b ∈B , where A is a ﬁnite set of arms and B is a ﬁnite set of users. Such ν is called a (bandit) conﬁgurationwhere each ν b = ( ν a,b ) a ∈A can be seen as a classical multi-armed bandit problem. The streaming1 a r X i v : . [ c s . I T ] J u l rotocol is the following: at each time t (cid:62) , the learner deals with a user b t ∈ B and chooses anarm a t ∈ A , based only on the past. We consider two scenarios: either the sequence of users isdeterministic (uncontrolled scenario) or the learner has the possibility to choose the user (controlledscenario), see Section 1.1. The learner then receives and observes a reward X t sampled according to ν a t ,b t conditionally independent from the past. We assume binary rewards: each ν a,b is a Bernoullidistribution Bern ( µ a,b ) with mean µ a,b ∈ (0 , and we denote by D the set of such conﬁgurations.The goal of the learner is then to maximize its expected cumulative reward over T rounds, orequivalently minimize regret given by R ( ν, T ) = E ν (cid:34) T (cid:88) t =1 max a ∈A µ a,b t − X t (cid:35) . For this problem one can run, for example, a separate instance of a bandit algorithm for each user b ,but we would like to exploit a known structure among the users (which we detail below). Unstructured bandits

The classical bandit problem (when |B| = 1 ) received increased attention inthe middle of the th century. The seminal paper Lai and Robbins (1985) established the ﬁrst lowerbounds on the cumulative regret, showing that designing a strategy that is optimal uniformly over agiven set of conﬁgurations D comes with a price. The study of the lower performance bounds in multi-armed bandits successfully lead to the development of asymptotically optimal strategies for speciﬁcconﬁguration sets, such as the KL-UCB strategy (Lai, 1987; Cappé et al., 2013; Maillard, 2018) forexponential families, or alternatively the

DMED and

IMED strategies from Honda and Takemura (2011,2015). The lower bounds from Lai and Robbins (1985), later extended by Burnetas and Katehakis(1997) did not cover all possible conﬁgurations, and in particular structured conﬁguration sets werenot handled until Agrawal et al. (1989) and then Graves and Lai (1997) established generic lowerbounds. Here, structure refers to the fact that pulling an arm may reveal information that enablesto reﬁne estimation of other arms. Unfortunately, designing efﬁcient strategies that are provablyoptimal remains a challenge for many structures at the cost of a high computational complexity.

Structured conﬁgurations

Motivated by the growing popularity of bandits in a number of in-dustrial and societal application domains, the study of structured conﬁguration sets has receivedincreasing attention over the last few years: The linear bandit problem is one typical illustration(Abbasi-Yadkori et al., 2011; Srinivas et al., 2010; Durand et al., 2017), for which the linear structureconsiderably modiﬁes the achievable lower bound, see Lattimore and Szepesvari (2017). The studyof a unimodal structure naturally appears in the context of wireless communications, and has beenconsidered in Combes and Proutiere (2014) from a bandit perspective, providing an explicit lowerbound together with a strategy exploiting this structure. Other structures include Lipschitz bandits(Magureanu et al., 2014), and we refer to the manuscript Magureanu (2018) for other examples, suchas cascading bandits that are useful in the context of recommender systems. Combes et al. (2017)introduced a generic strategy called

OSSB (Optimal Structured Stochastic Bandit), stepping the pathtowards generic multi-armed bandit strategies that are adaptive to a given structure.

Graph-structure

In this paper, we consider the following structure: For a given weight matrix ω = ( ω b,b (cid:48) ) b,b (cid:48) ∈B ∈ [0 , B×B inducing a metric on B , we assume that for any two users b, b (cid:48) ∈ B , || µ b − µ b (cid:48) || ∞ := max a ∈A (cid:12)(cid:12) µ a,b − µ a,b (cid:48) (cid:12)(cid:12) (cid:54) ω b,b (cid:48) . We see the matrix ω as an adjacency matrix of afully connected weighted graph where each vertex represents a user and each weigh ω b,b (cid:48) measuresproximity between two users, hence we call this a “ graph structure ”. The motivation to study such a2tructure is two-fold. On the one hand, in view of paving the way to solving generic structured bandits,the graph structure yields nicely interpretable lower bounds that show how ω effectively modiﬁes theachievable optimal regret and suggests a natural strategy, while being ﬂexible enough to interpolate between a fully unstructured and a highly structured setup. On the other hand, multi-armed banditshave been extensively applied to recommender systems: In such systems it is natural to assumethat users may not react arbitrarily differently from each other, but that two users that are "close" insome sense will also react similarly when presented with the same item (action). Now, the similaritybetween any two users may be loosely or accurately known (by studying for instance activities ofusers on various social networks and reﬁning this knowledge once in a while): The weight matrix ω enables to summarize such imprecise knowledge. Indeed ω b,b (cid:48) = 0 means that two users behaveidentically, while ω b,b (cid:48) = 1 is not informative on the true similarity (cid:107) µ b − µ b (cid:48) (cid:107) ∞ that can be anythingfrom arbitrarily small to . Hence, studying this structure is both motivated by a theoretical challengeand more applied considerations. To our knowledge this is the ﬁrst work on graph structure . Otherstructured problems such as Clustered bandits (Gentile et al., 2014), Latent bandits (Maillard andMannor, 2014), or Spectral bandits (Valko et al., 2014) do not deal with this particular setting. Goal

The primary goal of this paper is to build a provably optimal strategy for this ﬂexible notionof structure. To do so, we derive lower bounds and use them to build intuition on how to handlestructure, which enables us to establish a novel bandit strategy, that we prove to be optimal. Althoughspecialized to this structure, the mechanisms leading to the strategy and introduced in the prooftechnique are novel and are of independent interest.

Outline and contributions

We formally introduce the graph-structure model in Section 1.2. Graphstructure is simple enough while interpolating between a fully unstructured case and highly-structuredsettings such as clustered bandits (see Figure 1): This makes it a convenient setting to study structured multi-armed bandits. In Section 2, we ﬁrst establish in Proposition 5 a lower bound on the asymptoticnumber of times a sub-optimal couple must be pulled by any consistent strategy (see Deﬁnition 3),together with its corresponding lower bound on the regret (see Corollary 8) involving an optimizationproblem. In Section 3, we revisit the Indexed Minimum Empirical Divergence (

IMED ) strategy fromHonda and Takemura (2011) introduced for unstructured multi-armed bandits, and adapt it to thegraph-structured setting, making use of the lower bounds of Section 2. The resulting strategy iscalled

IMED-GS in the controlled scenario and

IMED-GS in the uncontrolled scenario. Our analysisreveals that in view of asymptotic optimality, these strategies may still not optimally exploit thegraph-structure in order to trade-off information gathering and low regret. In order to address thisdifﬁculty, we introduce the modiﬁed IMED-GS (cid:63) strategy for the controlled scenario (and

IMED-GS (cid:63) in the uncontrolled one). We show in Theorem 11, which is the main result of this paper, that both IMED-GS (cid:63) and

IMED-GS (cid:63) are asymptotically optimal consistent strategies. Interestingly, IMED-GS (cid:63) does not compute a solution to the optimization problem appearing in the lower bound at each timestep , unlike for instance

OSSB introduced for generic structures, but only about log( T ) times after T steps. Also, if forced exploration does not seem to be avoidable for this problem, IMED-GS (cid:63) does notmake use of an explicit forced exploration scheme but a more implicit one, based on local countersof empirical events. Up to our knowledge,

IMED-GS (cid:63) is the ﬁrst strategy with such properties, in thecontext of a structure requiring to solve an optimization problem, that is provably asymptoticallyoptimal. On a broader perspective, we believe the mechanism used in

IMED-GS (cid:63) as well as theproof techniques could be extended beyond the considered graph-structure, thus opening promisingperspective in order to build structure-adaptive optimal strategies for generic structures. Last, we3rovide in Section 4 numerical illustrations on synthetic data. They show that

IMED-GS (cid:63) is also numerically efﬁcient in practice, both in terms of regret minimization and computation time; thiscontrasts with some bandit strategies introduced for other structures (as in Combes et al. (2017),Lattimore and Szepesvari (2017)), that in practice suffer from a prohibitive burn-in phase.

Let us recall that the goal of the learner is to maximize its expected cumulative reward over T rounds,or equivalently minimize regret given by R ( ν, T ) = E ν (cid:34) T (cid:88) t =1 max a ∈A µ a,b t − X t (cid:35) . As mentioned, for this problem one can run, for example, a separate instance of bandit algorithms foreach user b , but we would like to exploit the graph structure. We consider two typical scenarios. Uncontrolled scenario

The sequence of users ( b t ) t (cid:62) is assumed deterministic and does notdepend on the strategy of the learner. At each time step t (cid:62) , the user b t is revealed to the learner. Controlled scenario

The sequence of users ( b t ) t (cid:62) is strategy-dependent and at each time step t (cid:62) , the learner has to choose a user b t to deal with, based only on the past.Both scenarios are motivated by practical considerations: uncontrolled scenario is the most commonsetup for recommender systems, while controlled scenario is more natural in case the learnerinteracts actively with available users as in advertisement campaigns. In an uncontrolled scenario ,the frequencies of user-arrivals are imposed and may be arbitrary, while in a controlled scenario allusers are available and the learner has to deal with them with similar frequency (even if this meansconsidering a subset of users). We formalize the notion of frequency in the following deﬁnition. Deﬁnition 1 (Log-frequency of a user)

A sequence of user ( b t ) t (cid:62) has log-frequencies β ∈ [0 , B if, almost surely, the number of times the learner has dealt with user b ∈ B is N b ( T ) = Θ (cid:0) T β b (cid:1) . Inthis case, almost surely we have ∀ b ∈ B , lim T →∞ log( N b ( T ))log( T ) = β b . In an uncontrolled scenario , we assume that the sequence of users ( b t ) t (cid:62) has positive log-frequencies β ∈ (0 , B , with β unknown to the learner. In a controlled scenario , we focus only on strategiesthat induce sequences of users with same log-frequencies, hence all equal to , independently on theconsidered conﬁguration, that is strategies such that, almost surely, N b ( T ) = Θ( T ) for all user b ∈ B . In this section, we introduce the graph structure. We assume that all bandit conﬁgurations ν belongto a set of the form: D ω := (cid:26) ν ∈ D : ∀ b, b (cid:48) ∈ B , max a ∈A (cid:12)(cid:12) µ a,b − µ a,b (cid:48) (cid:12)(cid:12) (cid:54) ω b,b (cid:48) (cid:27) ,

1. We say that u T = Θ( v T ) , if the two sequences u T and v T are equivalent. ω = ( ω b,b (cid:48) ) b,b (cid:48) ∈B ∈ [0 , B×B is a weight matrix known to the learner. Intuitively, when theweights are close to , we expect no change to the agnostic situation. But, when the weights are closeto (cid:107) µ b − µ b (cid:48) (cid:107) ∞ := max a ∈A (cid:12)(cid:12) µ a,b − µ a,b (cid:48) (cid:12)(cid:12) , we expect signiﬁcantly lower achievable regret. Remark 2

For the speciﬁc case where ω b,b (cid:48) = 0 , D ω corresponds to user b and b (cid:48) known to beperfectly clustered. The weight matrix given in Figure 1 models three smooth clusters of users. Eachcluster is included in a ball of diameter α for the inﬁnite norm (cid:107) · (cid:107) ∞ . In the sequel we assume the following properties on the weights.

Assumption 1 (Metric weight property)

The weight matrix ω satisﬁes:- ω b,b = 0 and ω b,b (cid:48) > for all b (cid:54) = b (cid:48) ∈ B ,- ω b,b (cid:48) = ω b (cid:48) ,b and ω b,b (cid:48) (cid:54) ω b,b (cid:48)(cid:48) + ω b (cid:48)(cid:48) ,b (cid:48) for all b, b (cid:48) , b (cid:48)(cid:48) ∈ B . This comes without loss of generality, since for the ﬁrst property, if two users share exactly thesame distribution we can see them as one unique user. For the second property, considering (cid:101) ω b,b (cid:48) = sup a ∈A ,ν ∈D ω (cid:12)(cid:12) µ a,b − µ a,b (cid:48) (cid:12)(cid:12) leads to the same set of conﬁguration D ω = D (cid:101) ω and it holds (cid:101) ω b,b (cid:48) = (cid:101) ω b (cid:48) ,b , (cid:101) ω b,b (cid:48) (cid:54) (cid:101) ω b,b (cid:48)(cid:48) + (cid:101) ω b (cid:48)(cid:48) ,b (cid:48) . Such a weight matrix ω naturally induces a metric on B .  α ) ... (1)( α ) 0 0 ( α ) ... ( α ) 0 0 ( α )(1) ... ( α ) 0  μ b μ b ν ν ν Figure 1: A cluster structure.

Left: weight matrix of three clusters.

Right: range of two armed banditproblems included in clusters with center ( ν , ν , ν ) for various α (the larger α the lighterand larger the box). The value α = 0 corresponds to perfect clusters. Let µ (cid:63)b = max a ∈A µ a,b denote the optimal mean for user b and A (cid:63)b = argmax a ∈A µ a,b the set ofoptimal arms for this user. We deﬁne for a couple ( a, b ) ∈ A × B its gap ∆ a,b = µ (cid:63)b − µ a,b . Thusa couple is optimal if its gap is equal to zero and sub-optimal if it is positive. We denote by O (cid:63) = { ( a, b ) ∈ A×B : µ a,b = µ (cid:63)b } the set of optimal couples. Thanks to the chain rule we can rewritethe regret as follows: R ( ν, T ) = (cid:88) a,b ∈A×B ∆ a,b E ν (cid:2) N a,b ( T ) (cid:3) , where N a,b ( t ) = t (cid:88) s =1 I (cid:8) ( a s ,b s )=( a,b ) (cid:9) is the number of pulls of arm a and user b up to time t .5 . Regret Lower bound In this subsection, we establish lower bounds on the regret for the structure D ω . In order to obtainnon trivial lower bounds we consider, as in the classical bandit problem, strategies that are consistent (uniformly good) on D ω . Deﬁnition 3 (Consistent strategy)

A strategy is consistent on D ω if for all conﬁguration ν ∈ D ω ,for all sub-optimal couple ( a, b ) , for all α > , lim T →∞ E ν (cid:20) N a,b ( T ) N b ( T ) α (cid:21) = 0 . Remark 4

When B = { b } , N b ( T ) = T and we recover the usual notion of consistency (Lai andRobbins, 1985). Before we provide below the lower bound on the cumulative regret, let us give some intuition: Tothat end, we ﬁx a conﬁguration ν ∈ D ω and a sub-optimal couple ( a, b ) . One key observation is thatif for all b (cid:48) ∈ B it holds µ (cid:63)b − µ a,b (cid:48) < ω b,b (cid:48) , this means we can form an environment (cid:101) ν ∈ D ω such that (cid:101) µ a (cid:48) ,b (cid:48) = µ a (cid:48) ,b (cid:48) for all couples ( a (cid:48) , b (cid:48) ) except ( a, b ) , and such that (cid:101) µ a,b satisﬁes µ (cid:63)b < (cid:101) µ a,b < µ a,b (cid:48) + ω b,b (cid:48) .Indeed, in this novel environment, (cid:101) µ a,b − (cid:101) µ a,b (cid:48) < ω b,b (cid:48) still holds but ( a, b ) is now optimal. Hence, wecan transform the sub-optimal couple ( a, b ) in an optimal one without moving the means of the otherusers. Thanks to this remarkable property, and introducing kl ( µ | µ (cid:48) ) to denote the Kullback-Leiblerdivergence between two Bernoulli distributions Bern ( µ ) and Bern ( µ (cid:48) ) with the usual conventions,one can prove then that for all consistent strategy lim inf T →∞ E ν (cid:20) N a,b ( T )log( N b ( T )) (cid:21) (cid:62) kl ( µ a,b | µ (cid:63)b ) , which is the lower bound that we get without graph structure. This suggests that only the users b (cid:48) suchthat µ (cid:63)b − µ a,b (cid:48) > ω b,b (cid:48) provide information about the behavior of user b . This justiﬁes to introduce foreach couple ( a, b ) the fundamental set B a,b := (cid:8) b (cid:48) ∈ B : µ a,b (cid:48) < µ (cid:63)b − ω b,b (cid:48) (cid:9) . It is also convenient to introduce its frontier, denoted ∂ B a,b := (cid:8) b (cid:48) ∈ B : µ a,b (cid:48) = µ (cid:63)b − ω b,b (cid:48) (cid:9) . Now, inorder to report the lower bounds while avoiding tedious technicalities, we slightly restrict the set D ω .To this end, we introduce the set D ω := (cid:8) ν ∈ D ω : ∀ ( a, b ) ∈ A × B , ∂ B a,b = ∅ (cid:9) . This deﬁnition is justiﬁed since the closure of D ω is indeed D ω (we only remove from D ω sets ofempty interior). We can now state the following proposition. Proposition 5 (Graph-structured lower bounds on pulls)

Let us consider a consistent strategy.Then, for all conﬁguration ν ∈ D ω , almost surely it holds for all sub-optimal couple ( a, b ) / ∈ O (cid:63) , lim T →∞ N b ( T ) < + ∞ or lim inf T →∞ N b ( T )) (cid:88) b (cid:48) ∈B a,b kl (cid:0) µ a,b (cid:48) (cid:12)(cid:12) µ (cid:63)b − ω b,b (cid:48) (cid:1) N a,b (cid:48) ( T ) (cid:62) . (1)6e then introduce the notion of Pareto-optimality based on the lower bounds given in Proposition 5.

Deﬁnition 6 (Pareto-optimality)

A strategy is asymptotically Pareto-optimal if for all ν ∈ D ω , ∀ a ∈ A , lim sup T →∞ min b : ( a,b ) / ∈O (cid:63) N b ( T )) (cid:88) b (cid:48) ∈B a,b kl (cid:0) µ a,b (cid:48) (cid:12)(cid:12) µ (cid:63)b − ω b,b (cid:48) (cid:1) N a,b (cid:48) ( T ) (cid:54) , with the convention min ∅ = −∞ . Remark 7

This proposition reveals that the set B a,b = (cid:8) b (cid:48) ∈ B : µ a,b (cid:48) < µ (cid:63)b − ω b,b (cid:48) (cid:9) plays a crucialrole in the graph structure. The deﬁnition of D ω excludes speciﬁc situations when there exists b, b (cid:48) ∈ B , a ∈ A , ω b,b (cid:48) = µ (cid:63)b − µ a,b (cid:48) = ∆ a,b + µ a,b − µ a,b (cid:48) , that belong to the close set D ω . Extending the result to D ω seems possible but at the price of clarity due to the need to handle degenerate cases. In order to derive an asymptotic lower bound on the regret from these asymptotic lowers bounds, wehave to characterize the growth of the counts ( N b ( · )) b ∈B . Corollary 8 (Lower bounds on the regret)

Let us consider a consistent strategy and sequences ofusers with log-frequencies β ∈ (0 , B independently of the considered conﬁguration in D ω . Then,for all conﬁguration ν ∈ D ω lim inf T →∞ R ( ν, T )log( T ) (cid:62) C (cid:63)ω ( β, ν ) := min (cid:26) (cid:88) a,b/ ∈O (cid:63) ∆ a,b n a,b : n ∈ R A×B + (2) s.t. ∀ ( a, b ) / ∈ O (cid:63) , (cid:88) b (cid:48) ∈B a,b kl (cid:0) µ a,b (cid:48) (cid:12)(cid:12) µ (cid:63)b − ω b,b (cid:48) (cid:1) n a,b (cid:48) (cid:62) β b (cid:27) . Hence such a strategy is asymptotically optimal if for all ν ∈ D ω lim sup T →∞ R ( ν, T )log( T ) (cid:54) C (cid:63)ω ( β, ν ) . Remark 9

In the previous corollary, log-frequencies β may be either strategy dependent or indepen-dent. In an uncontrolled scenario, β is imposed by the setting and does not depend on the followedstrategy, while in a controlled scenario we consider strategies that impose β = 1 B := (1) b ∈B . Like other structured bandit problems (as in Combes et al. (2017), Lattimore and Szepesvari (2017))this lower bound is characterized by a problem-dependent constant C (cid:63)ω ( β, ν ) solution to an optimiza-tion problem . In the agnostic case we recover the lower bound of the classical multi-armed banditproblem. Indeed, let us introduce for α ∈ [0 , the weight matrix ω α where all the weights are equalto α (except for the zero diagonal). ω α is the same weight matrix as in Figure 1 but only for onecluster. Then when there is no structure ( ω ≡ ω ), we obtain the explicit constant C (cid:63)ω ( β, ν ) = (cid:88) b ∈B β b (cid:88) a ∈A : ( a,b ) / ∈O (cid:63) ∆ a,b kl ( µ a,b | µ (cid:63)b ) , (3)7hat corresponds to solving |B| bandit problems in parallel (independently the ones from theothers). Thus the graph structure allows to interpolate smoothly between |B| independent ban-dit problems and a unique one when all the users share the same distributions. In order to il-lustrate the gain of information due to the graph structure we plot in Figure 2 the expectation E ν ∼U ( D ωα ) (cid:2) C (cid:63)ω α (1 B , ν ) /C (cid:63)ω (1 B , ν ) (cid:3) of the ratio between the constant in the structured case (2) andthat in the agnostic case (3), where U ( D ω α ) denotes the uniform distribution over D ω α , α ∈ [0 , .Figure 2: Plot of α (cid:55)→ E ν ∼U ( D ωα ) (cid:2) C (cid:63)ω α (1 B , ν ) /C (cid:63)ω (1 B , ν ) (cid:3) where ω α is a matrix where all theweights are equal to α (except for the zero diagonal) and ν is sampled uniformly at randomin D ω α . IMED type strategies for Graph-structured Bandits

In this section, we present for both the controlled and uncontrolled scenarios, two strategies:

IMED-GS (cid:63) that matches the asymptotic lower bound of Corollary 8 and

IMED-GS with a lower computationalcomplexity but weaker guaranty. Both are inspired by the Indexed Minimum Empirical Divergence(

IMED ) proposed by Honda and Takemura (2011). The general idea behind this algorithm is toenforce, via a well chosen index, the constraints (1) that appears in the optimization problem (2) ofthe asymptotic lower bound. These constraints intuitively serve as tests to assert whether or not acouple is optimal.

IMED type strategies for the controlled scenario

We consider the controlled scenario where the sequence of users ( b t ) t (cid:62) is strategy-dependent and ateach time step t (cid:62) , the learner has to choose a user b t and an arm a t , based only on the past.3.1.1 T HE IMED-GS

STRATEGY .We denote by (cid:98) µ a,b ( t ) = N a,b ( t ) t (cid:80) s =1 I { ( a s ,b s )=( a,b ) } X s if N a,b ( t ) > , otherwise, the empirical meanof the rewards from couple ( a, b ) . Guided by the lower bound (1) we generalize the IMED index to8ake into account the graph structure as follows. For a couple ( a, b ) and at time t we deﬁne I a,b ( t ) = (cid:40) log( N a,b ( t )) if ( a, b ) ∈ (cid:98) O (cid:63) ( t ) (cid:80) b (cid:48) ∈ (cid:98) B a,b ( t ) kl (cid:0)(cid:98) µ a,b (cid:48) ( t ) (cid:12)(cid:12)(cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) (cid:1) N a,b (cid:48) ( t ) + log (cid:0) N a,b (cid:48) ( t ) (cid:1) otherwise , (4)where (cid:98) µ (cid:63)b ( t ) = max a ∈A (cid:98) µ a,b ( t ) is the current best mean for user b , the current set of optimal couple is (cid:98) O (cid:63) ( t ) := (cid:26) ( a, b ) ∈ A × B : (cid:98) µ a,b ( t ) = (cid:98) µ (cid:63)b ( t ) (cid:27) and the current set of informative users for an empirical sub-optimal couple ( a, b ) is (cid:98) B a,b ( t ) := (cid:26) b (cid:48) ∈ B : N a,b (cid:48) ( t ) > and (cid:98) µ a,b (cid:48) ( t ) < (cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) (cid:27) . This quantity can be seen as a transportation cost for “moving ” a sub-optimal couple to an optimalone, plus exploration terms (the logarithms of the numbers of pulls). When an optimal couple isconsidered, the transportation cost is null and only the exploration part remains. Note that, as statedin Honda and Takemura (2011), I a,b ( t ) is an index in the weaker sense since it is not determined onlyby samples from the couple ( a, b ) but also uses empirical means of current optimal arms. We deﬁne IMED-GS (Indexed Minimum Empirical Divergence for Graph Structure) to be the strategy consistingof pulling a couple with minimum index in Algorithm 1. It works well in practice, see Section 4, andhas a low computational complexity (proportional to the number of couples). However, it is knownfor other structures, see Lattimore and Szepesvari (2017), that such greedy strategy does not exploitoptimally the structure of the problem. Indeed, at a high level, pulling an apparently sub-optimalcouple ( a, b ) allows to gather information not only about this particular couple but also about othercouples due to the structure. In order to attain optimality one needs to ﬁnd couples that provide thebest trade-off between information and low regret. This is exactly what is done in the optimizationproblem (2).3.1.2 T HE IMED-GS (cid:63)

STRATEGY

In order to address this difﬁculty we ﬁrst, thanks to the (weak) indexes, decide whether we need toexploit or explore. In the second case, in order to explore optimally according to the graph structurewe solve the optimization problem (2) parametrized by the current estimates of the means and thentrack the optimal numbers of pulls given by the solution of this problem. More precisely at eachround we choose but not immediately pull a couple with minimum index ( a t , b t ) ∈ argmin ( a,b ) ∈A×B I a,b ( t ) . Exploitation:

If this couple is currently optimal, ( a t , b t ) ∈ (cid:98) O (cid:63) ( t ) , we exploit, that is pull this couple. Exploration:

Else we explore arm a t +1 = a t . To this end, let n opt ( t ) be a solution of the empirical

2. This notion refers to the generic proof technique used to derive regret lower bounds. It involves a change-of-measureargument, from the initial conﬁguration in which the couple is sub-optimal to another one chosen to make it optimal. β = 1 B , that is n opt ( t ) ∈ argmin n ∈ R A×B + (cid:26) (cid:88) ( a,b ) ∈A×B (cid:0)(cid:98) µ (cid:63)b ( t ) − (cid:98) µ a,b ( t ) (cid:1) n a,b (5) s.t. ∀ ( a, b ) / ∈ (cid:98) O (cid:63) ( t ) , (cid:98) B a,b ( t ) (cid:54) = ∅ : (cid:88) b (cid:48) ∈ (cid:98) B a,b ( t ) kl (cid:0)(cid:98) µ a,b (cid:48) ( t ) (cid:12)(cid:12)(cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) (cid:1) n a,b (cid:48) (cid:62) (cid:27) . The current optimal numbers of pulls given by N opt a,b ( t ) = n opt a,b ( t ) min b (cid:48) ∈B I a,b (cid:48) ( t ) . (6)We then track b t +1 ∈ argmax b ∈ (cid:98) B at,bt ( t ) ∪{ b t } N opt a t ,b ( t ) − N a t ,b ( t ) . (7)Asymptotically, we expect that all the sub-optimal couples are pulled roughly log( T ) times. There-fore, for all sub-optimal couple ( a, b ) , the index I a,b ( T ) should be of order log( T ) . Thus weasymptotically recover in the deﬁnition of N opt a,b ( · ) the optimal number of pulls of couple ( a, b ) , thatis n νa,b log( T ) as suggested in Corollary 8. Finally we pull the selected couple ( a t +1 , b t +1 ) . In orderto ensure optimality, however, such a direct tracking of the current optimal number of pulls is stilla bit too aggressive and we need to force exploration in some exploration rounds. We proceed asfollows: when we explore arm a t we automatically pull a couple ( a t , b ) if its number of pulls N a ( t ) ,b is lower than the logarithm of the number of time we decided to explore this arm. See Algorithm 2for details. This does not hurt the asymptotic optimally because we expect to explore a sub-optimalarm not more than log( T ) times. On the bright side, this is still different than the traditional forcedexploration. Indeed, only few rounds are dedicated to exploration thanks to the ﬁrst selection withthe indexes and among them only a logarithmic number will consist of pure exploration: Thus, weexpect an overall log log( T ) rounds of forced exploration. Note also that all the quantities involved inthis forced exploration use empirical counters. Putting all together we end up with strategy IMED-GS (cid:63) described in Algorithm 2.

Comparison with other strategies

IMED-GS (cid:63) combines ideas from

IMED introduced by Honda andTakemura (2011) and from

OSSB by Combes et al. (2017). More precisely, it generalizes the indexfrom

IMED to the graph structure. From

OSSB it borrows the tracking of the optimal counts givenby the asymptotic lower bound (see also Lattimore and Szepesvari (2017)) and the way to forceexploration sparingly. The main difference with

OSSB is that

IMED-GS (cid:63) leverages the indexes to dealwith the exploitation-exploration trade-off. In particular

IMED-GS (cid:63) does not need to solve at eachround the optimization problem (2). This greatly improves the computational complexity. Also, notethat

OSSB requires choosing a tuning parameter that must be positive to ensure theoretical guaranteesbut that must be set equal to to work well in practice. This is not the case for IMED-GS (cid:63) that requiresno parameter tuning and that works well both in theory and in practice (see Section 4).

IMED type strategies for the uncontrolled scenario

In this section, an uncontrolled scenario is considered where the sequence of users ( b t ) t (cid:62) is assumeddeterministic and does not depend on the strategy of the learner. We adapt the two previous strategies IMED-GS and

IMED-GS (cid:63) to this scenario. 10 lgorithm 1

IMED-GS (controlled scenario)

Require:

Weight matrix ( ω b,b (cid:48) ) b,b (cid:48) ∈B . for t = 1 ...T do Pull ( a t +1 , b t +1 ) ∈ argmin ( a,b ) ∈A×B I a,b ( t ) end for Algorithm 2

IMED-GS (cid:63) (controlled scenario)

Require:

Weight matrix ( ω b,b (cid:48) ) b,b (cid:48) ∈B . ∀ a ∈ A , c a , c + a ← for For t = 1 ...T do Choose ( a t , b t ) ∈ argmin ( a,b ) ∈A×B I a,b ( t ) if ( a t , b t ) ∈ (cid:98) O (cid:63) ( t ) then Choose ( a t +1 , b t +1 ) = ( a t , b t ) else Set a t +1 = a t if c a t +1 = c + a t +1 then c + a t +1 ← c + a t +1 Choose b t +1 ∈ argmin b ∈B N a,b ( t ) else Choose b t +1 ∈ argmax b ∈ (cid:98) B at,bt ( t ) ∪ { b t } N opt a t +1 ,b ( t ) − N a t +1 ,b ( t ) end if c a t +1 ← c a t +1 + 1 end if Pull ( a t +1 , b t +1 ) end for IMED-GS strategy At time step t (cid:62) the choice of user b t is no longer strategy-dependent but isimposed by the sequence of users ( b t ) t (cid:62) which is assumed to be deterministic in the uncontrolledscenario . The learner only chooses an arm to pull a t knowing user b t . We deﬁne IMED-GS to be thestrategy consisting of pulling an arm with minimum index in Algorithm 3 of Appendix C. IMED-GS suffers the same advantages and shortcomings as IMED-GS . It does not exploit optimally the structureof the problem but it works well in practice, see Section 4, and has a low computational complexity.

IMED-GS (cid:63) strategy In order to explore optimally according to the graph structure in the uncontrolledscenario , we also track the optimal numbers of pulls. β may be at ﬁrst glance different from B .This requires some normalizations. First, for all time step t (cid:62) , n opt ( t ) now denotes a solution ofthe empirical version of (2) with β = ( (cid:98) β b ( t )) b ∈B where (cid:98) β b ( t ) =log( N b ( t )) / log( t ) estimates log-frequency β b of user b ∈ B . Second, we have to consider normalized indexes (cid:101) I a,b ( t ) = I a,b ( t ) / (cid:98) β b ( t ) for couples ( a, b ) ∈ A × B in order to have (cid:101) I a,b ( T ) ∼ log( T ) as in the controlled scenario . Anadditional difﬁculty is that at a given time step t (cid:62) , while the indexes indicate to explore, the currenttracked user (see Equation 7) given is likely to be different from user b t with whom the learner deals.This difﬁculty is easy to circumvent by postponing and prioritizing the exploration until the learnerdeals with the tracked user. Priority in exploration phases is given to ﬁrst delayed forced-explorationand delayed exploration based on solving optimization problem (2), then exploration based on currentindexes (see Algorithm 4 in Appendix C). IMED-GS (cid:63) corresponds essentially to IMED-GS (cid:63) with somedelays due to the fact that the tracked and the current users may be different. This has no impact onthe optimality of

IMED-GS (cid:63) since log-frequencies of users are enforced to be positive. In order to prove the asymptotic optimality of

IMED-GS (cid:63) we introduce the following mild assumptionson the conﬁguration considered. 11 eﬁnition 10 (Non-peculiar conﬁguration)

A conﬁguration ν ∈ D ω is non-peculiar if the optimiza-tion problem (2) admits a unique solution and each user b admits a unique optimal arm a (cid:63)b . In Theorem 11 we state the main result of this paper, namely, the asymptotic optimality of

IMED-GS (cid:63) and

IMED-GS (cid:63) . We prove this result for IMED-GS (cid:63) in Appendix E and adapt this proof in Appendix Gfor

IMED-GS (cid:63) . Please refer to Proposition 20 (Appendix D) for more reﬁned ﬁnite-time upper bounds.As a byproduct of this analysis we deduce the Pareto-optimality of IMED-GS and

IMED-GS stated inProposition 12 and proved in Appendix G. Theorem 11 (Asymptotic optimality)

Both

IMED-GS (cid:63) and

IMED-GS (cid:63) are consistent strategies. Fur-ther, they are asymptotically optimal on the set of non-peculiar conﬁgurations, that is, for all ν ∈ D ω non-peculiar, under IMED-GS (cid:63) the sequence of users has log-frequencies B and we have lim sup T →∞ R ( ν, T )log( T ) (cid:54) C (cid:63)ω (1 B , ν ) , and, under IMED-GS (cid:63) , assuming a sequence of users with log-frequencies β ∈ (0 , B , we have lim sup T →∞ R ( ν, T )log( T ) (cid:54) C (cid:63)ω ( β, ν ) . Proposition 12 (Asymptotic Pareto-optimality)

Both

IMED-GS and

IMED-GS are consistent strate-gies. Further, they are asymptotically Pareto-optimal on the set of non-peculiar conﬁgurations, thatis, under IMED-GS or IMED-GS , for all ν ∈ D ω non-peculiar, ∀ a ∈ A , lim sup T →∞ min b : ( a,b ) / ∈O (cid:63) N b ( T )) (cid:88) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) N a,b (cid:48) ( T ) (cid:54) . Discussion

Removing forced exploration remains the most challenging task for structured banditproblems. In the context of this structure, forced exploration would have been to use criteria like "if N a,b ( T ) < (cid:98) c a,b ( T ) log( T ) , then pull couple ( a, b ) " for some constants (cid:98) c a,b ( T ) that depends on theminimization problem coming from the lower bound and where (cid:98) c a,b ( T ) log( T ) can be interpretedas an estimator of the theoretical asymptotic lower bound on the numbers of pulls of couple ( a, b ) .In stark contrast, in IMED-GS (cid:63) there is no forced exploration in choosing the arm to explore and, inchoosing the user to explore, the used criteria is more intrinsic as it reads "if N a,b ( T ) < (cid:98) c a,b ( T ) I a ( T ) ,then pull couple ( a, b ) ", where I a ( T ) ∼ log( T ) but really depends on ( N a,b ( T )) ( a,b ) / ∈C (cid:63) . Thus, theused criteria are not asymptotic, and do not dependent on the time t but on the current numbersof pull of sub-optimal arms. Since theoretical asymptotic lower bounds on the numbers of pullsare signiﬁcantly larger than the current numbers of pulls in ﬁnite horizon (see Figure 3), IMED-GS (cid:63) strategy is also expected to behave better than strategies based on usual (conservative) forcedexploration. Although entirely removing forced exploration would be nicer, in

IMED-GS (cid:63) , forcedexploration is only done in a sparing way.

4. Numerical experiments

In this section, we compare empirically the following strategies introduced beforehand:

IMED-GS and

IMED-GS (cid:63) described respectively in Algorithms 1, 2,

IMED-GS and IMED-GS (cid:63) described respectively12n Algorithms 3, 4 and the baseline IMED by Honda and Takemura (2011) that does not exploit thestructure. We compare these strategies on two setups, each with |B| = 10 users and |A| = 5 arms.For the uncontrolled scenario we consider the round-robin sequence of users. As expected thestrategies leveraging the graph structure perform better than the baseline

IMED that does not exploitit. Furthermore, the plots suggest that

IMED-GS and

IMED-GS (cid:63) (respectively

IMED-GS and IMED-GS (cid:63) )perform similarly in practice. Fixed conﬁguration

Figure 3:

Left – For these experiments we investigate these strategies ona ﬁxed conﬁguration. The weight matrix ω and the conﬁguration ν ∈ D ω are given in Appendix I.This enables us to plot also the asymptotic lower bound on the regret for reference: We plotthe unstructured lower bound ( LB_agnostic ) in dashed red line, and the structured lower bound(

LB_struct ) in dashed blue line.

Random conﬁgurations

Figure 3:

Right – In these experiments we average regrets over randomconﬁgurations. We proceed as follows: At each run we sample uniformly at random a weight matrix ω and then sample uniformly at random a conﬁguration ν ∈ D ω .Figure 3: Regret approximated over runs. Top: controlled scenario . Bottom: uncontrolledscenario , ( b t ) t (cid:62) is the round-robin sequence of users. Left:

Fixed conﬁguration.

Right:

Random conﬁgurations.Additional experiments in Appendix I conﬁrm that both

IMED-GS and

IMED-GS (cid:63) induce sequences ofusers with log-frequencies all equal to . Acknowledgments eferences

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochasticbandits. In

Advances in Neural Information Processing Systems , pages 2312–2320, 2011.Rajeev Agrawal, Demosthenis Teneketzis, and Venkatachalam Anantharam. Asymptotically efﬁ-cient adaptive allocation schemes for controlled iid processes: Finite parameter space.

IEEETransactions on Automatic Control , 34(3), 1989.Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for Markov decisionprocesses.

Mathematics of Operations Research , 22(1):222–255, 1997.Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz.Kullback–Leibler upper conﬁdence bounds for optimal sequential allocation.

Annals of Statistics ,41(3):1516–1541, 2013.Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimalalgorithms. In

International Conference on Machine Learning , 2014.Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structuredstochastic bandits. In

Advances in Neural Information Processing Systems , pages 1763–1771,2017.Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression withprovably adaptive mean, variance, and regularization. arXiv preprint arXiv:1708.00768 , 2017.Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In

InternationalConference on Machine Learning , pages 757–765, 2014.Todd L Graves and Tze Leung Lai. Asymptotically efﬁcient adaptive choice of control lawsincontrolled markov chains.

SIAM journal on control and optimization , 35(3):715–743, 1997.Junya Honda and Akimichi Takemura. An asymptotically optimal policy for ﬁnite support models inthe multiarmed bandit problem.

Machine Learning , 85(3):361–391, 2011.Junya Honda and Akimichi Takemura. Non-asymptotic analysis of a new bandit algorithm forsemi-bounded rewards.

Machine Learning , 16:3721–3756, 2015.Tze Leung Lai. Adaptive treatment allocation and the multi-armed bandit problem.

The Annals ofStatistics , pages 1091–1114, 1987.Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules.

Advances inapplied mathematics , 6(1):4–22, 1985.Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of ﬁnite-armedlinear bandits. In

Artiﬁcial Intelligence and Statistics , pages 728–737, 2017.Stefan Magureanu.

Efﬁcient Online Learning under Bandit Feedback . PhD thesis, KTH RoyalInstitute of Technology, 2018. 15tefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lowerbounds and optimal algorithms.

Machine Learning , 35:1–25, 2014.O-A Maillard. Boundary crossing probabilities for general exponential families.

MathematicalMethods of Statistics , 27(1):1–31, 2018.Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In

International Conference on MachineLearning , pages 136–144, 2014.H. Robbins. Some aspects of the sequential design of experiments.

Bulletin of the AmericanMathematics Society , 58:527–535, 1952.Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process opti-mization in the bandit setting: no regret and experimental design. In

Proceedings of the 27thInternational Conference on International Conference on Machine Learning , pages 1015–1022.Omnipress, 2010.William R Thompson. On the likelihood that one unknown probability exceeds another in view ofthe evidence of two samples.

Biometrika , 25(3/4):285–294, 1933.William R Thompson. On a criterion for the rejection of observations and the distribution of the ratioof deviation to sample standard deviation.

The Annals of Mathematical Statistics , 6(4):214–219,1935.Michal Valko, Rémi Munos, Branislav Kveton, and Tomáš Kocák. Spectral bandits for smooth graphfunctions. In

International Conference on Machine Learning , 2014.16 ontents

IMED type strategies for Graph-structured Bandits 8

IMED type strategies for the controlled scenario . . . . . . . . . . . . . . . . . . . 83.1.1 The

IMED-GS strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.2 The

IMED-GS (cid:63) strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2

IMED type strategies for the uncontrolled scenario . . . . . . . . . . . . . . . . . . 103.3 Asymptotic optimality of IMED type strategies . . . . . . . . . . . . . . . . . . . 11

B.1 Almost sure asymptotic lower bounds under consistent strategies . . . . . . . . . . 18B.1.1 E (cid:101) ν (cid:2) N b ( T ) − α I { Ω T } (cid:3) tends to when T tends to inﬁnity . . . . . . . . . . 20B.1.2 P ν (Ω T ∩ E cT ) tends to when T tends to inﬁnity . . . . . . . . . . . . . . 21B.2 Asymptotic lower bounds on the regret . . . . . . . . . . . . . . . . . . . . . . . . 23 C Algorithms for the uncontrolled scenario

IMED-GS (cid:63) : Finite-time analysis 25

D.1 Strategy-based empirical bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 25D.2 Reliable current best arm and means . . . . . . . . . . . . . . . . . . . . . . . . . 28D.3 Pareto-optimality and upper bounds on the numbers of pulls of sub-optimal arms . 29D.4

IMED-GS (cid:63) is consistent and induces sequences of users with log-frequencies B . . 31D.5 The counters c a and c + a coincide at most O (log(log( T ))) times . . . . . . . . . . . 33D.6 All couples ( a, b ) ∈ A×B are asymptotically pulled an inﬁnite number of times . . 35D.7 Concentration lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36D.8 Proof of Lemma 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 E IMED-GS (cid:63) : Proof of Theorem 11 (main result) 38

E.1 Almost surely n opt ( T ) tends to n ν . . . . . . . . . . . . . . . . . . . . . . . . . . 39E.2 Almost surely and on expectation, for all sub-optimal couple N a,b ( T )log( T ) tends to n νa,b F Concentration lemmas: Proofs 41 IMED-GS , IMED-GS , IMED-GS (cid:63) : Finite-time analysis 44 G.1

IMED-GS ﬁnite-time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44G.2

Uncontrolled scenario : Finite-time analysis . . . . . . . . . . . . . . . . . . . . . 45G.2.1 Empirical bounds on the numbers of pulls . . . . . . . . . . . . . . . . . . 45G.2.2 Concentration inequality with bounded time delays . . . . . . . . . . . . . 46

H Continuity of solutions to parametric linear programs 47I Details on numerical experiments 47

Appendix A. Notations and reminders

For ν ∈ D ν , we deﬁne ε ν := min ( a,b ) / ∈O (cid:63) ,b (cid:48) ∈B µ a,b (cid:48) − µ (cid:63)b − ω b,b (cid:48) (cid:54) =0 (cid:40) (cid:12)(cid:12) µ a,b (cid:48) − µ (cid:63)b − ω b,b (cid:48) (cid:12)(cid:12) , µ a,b , − µ (cid:63)b (cid:41) . Then, there exists α ν : R (cid:63) + → R (cid:63) + such that lim ε → α ν ( ε ) = 0 and such that for all < ε < ε ν , for all ( a, b ) / ∈ O (cid:63) , for all b (cid:48) ∈ B , µ a,b (cid:48) (cid:54) = µ (cid:63)b − ω b,b (cid:48) ⇒ kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) )1 + α ν ( ε ) (cid:54) kl ( µ a,b (cid:48) ± ε | µ (cid:63)b − ω b,b (cid:48) ± ε ) (cid:54) (1+ α ν ( ε )) kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) . We also introduce the following constant of interest E ν := 6 e max a ∈A ,b ∈B (cid:18) − log(1 − µ a,b − ε ν )log(1 − µ a,b ) (cid:19) − (cid:32) − e − (cid:18) − log(1 − µa,b − εν )log(1 − µa,b ) (cid:19) kl ( µ a,b | µ a,b − ε ν ) (cid:33) − . Lastly, for all couple ( a, b ) ∈ A × B , for all n (cid:62) , we consider the stopping times τ na,b := inf { t (cid:62) N a,b ( t ) = n } and deﬁne (cid:98) µ na,b := (cid:98) µ a,b ( τ na,b ) . Appendix B. Proof related to the regret lower bound (Section 2)

In this section we regroup the proofs related to the lower bounds.

B.1 Almost sure asymptotic lower bounds under consistent strategies

In this section we prove Proposition 5.Let us consider a consistent strategy on D ω . Let ν ∈ D ω and let us consider ( a, b ) / ∈ O (cid:63) . We showthat almost surely lim T →∞ N b ( T ) = + ∞ implies lim inf T →∞ N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) (cid:62) B a,b = (cid:8) b (cid:48) ∈ B : ( a, b (cid:48) ) / ∈ O (cid:63) and µ a,b (cid:48) < µ (cid:63)b − ω b,b (cid:48) (cid:9) . Proof

Let us consider (cid:101) ν ∈ D ω , a maximal confusing distribution for the sub-optimal couple ( a, b ) such that a is the unique optimal arm of user b for (cid:101) ν , deﬁned as follows:- ∀ a (cid:48) (cid:54) = a, b (cid:48) ∈ B , (cid:101) µ a (cid:48) ,b (cid:48) = µ a (cid:48) ,b (cid:48) - ∀ b (cid:48)(cid:48) / ∈ B a,b , (cid:101) µ a,b (cid:48)(cid:48) = µ a,b (cid:48)(cid:48) - ∀ b (cid:48) ∈ B a,b , (cid:101) µ a,b (cid:48) = µ (cid:63)b − ω b,b (cid:48) + ε where < ε < ε = min b (cid:48) ∈B (cid:12)(cid:12) µ (cid:63)b − ω b,b (cid:48) − µ a,b (cid:48) (cid:12)(cid:12) . Our assumption on D ω ⊂ D ω ensures that ε > .Note that ε is chosen in such a way that for all b (cid:48) , b (cid:48)(cid:48) ∈ B , max a ∈A (cid:12)(cid:12)(cid:101) µ a,b (cid:48) − (cid:101) µ a,b (cid:48)(cid:48) (cid:12)(cid:12) (cid:54) ω b (cid:48) ,b (cid:48)(cid:48) . Indeed wehave:- for b (cid:48) , b (cid:48)(cid:48) ∈ B a,b : (cid:12)(cid:12)(cid:101) µ a,b (cid:48) − (cid:101) µ a,b (cid:48)(cid:48) (cid:12)(cid:12) = (cid:12)(cid:12) ω b,b (cid:48) − ω b,b (cid:48)(cid:48) (cid:12)(cid:12) (cid:54) ω b (cid:48) ,b (cid:48)(cid:48) - for b (cid:48) , b (cid:48)(cid:48) / ∈ B a,b : (cid:12)(cid:12)(cid:101) µ a,b (cid:48) − (cid:101) µ a,b (cid:48)(cid:48) (cid:12)(cid:12) = (cid:12)(cid:12) µ a,b (cid:48) − µ a,b (cid:48)(cid:48) (cid:12)(cid:12) (cid:54) ω b (cid:48) ,b (cid:48)(cid:48) - for b (cid:48) ∈ B a,b and b (cid:48)(cid:48) / ∈ B a,b , we have (cid:101) µ a,b (cid:48) − (cid:101) µ a,b (cid:48)(cid:48) = µ (cid:63)b − ω b,b (cid:48) + ε − µ a,b (cid:48)(cid:48) . Since in this case b (cid:48) ∈ B a,b it implies µ a,b (cid:48) (cid:54) µ (cid:63)b − ω b,b (cid:48) and since b (cid:48)(cid:48) / ∈ B a,b : µ a,b (cid:48)(cid:48) (cid:62) µ (cid:63)b − ω b,b (cid:48)(cid:48) + ε . Therefore onone hand we get µ (cid:63)b − ω b,b (cid:48) + ε − µ a,b (cid:48)(cid:48) (cid:62) µ a,b (cid:48) − µ a,b (cid:48)(cid:48) (cid:62) − ω b (cid:48) ,b (cid:48)(cid:48) , and on the other hand µ (cid:63)b − ω b,b (cid:48) + ε − µ a,b (cid:48)(cid:48) (cid:54) µ (cid:63)b − ω b,b (cid:48) + ε − ( µ (cid:63)b − ω b,b (cid:48)(cid:48) + ε ) = ε − ε + ω b,b (cid:48)(cid:48) − ω b,b (cid:48) (cid:54) ω b (cid:48) ,b (cid:48)(cid:48) . Actually, we can choose < ε < ε ν so that : ∀ b (cid:48) ∈ B a,b , kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) (cid:54) (1 + α ν ( ε )) kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) . We refer to Appendix A for the deﬁnitions of ε ν and α ν ( · ) . Note that α ν ( · ) is such that lim ε → α ν ( ε ) = 0 .Let < c < .We will show that almost surely lim T →∞ N b ( T ) = + ∞ implies lim inf T →∞ N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) (cid:62) c . We start with the following inequality P ν  lim inf T →∞ N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) < c, lim T →∞ N b ( T ) = ∞  (cid:54) lim inf T →∞ P ν  N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) < c, lim T →∞ N b ( T ) = ∞  . Let us consider an horizon T (cid:62) and let us introduce the event Ω T =  (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) < c log( N b ( T )) , lim T →∞ N b ( T ) = ∞  .

19e want to provide an upper bound on P ν (Ω T ) to ensure lim T →∞ P ν (Ω T ) = 0 . We start by takingadvantage of the following lemma. Lemma 13 (Change of measure)

For every event Ω and random variable Z both measurable withrespect to ν and (cid:101) ν , P ν (Ω ∩ E ) = E (cid:101) ν (cid:20) d ν d (cid:101) ν ( ψ ) I { Ω ∩ E } (cid:21) (cid:54) E (cid:101) ν (cid:2) e Z I { Ω } (cid:3) where E = (cid:26) log (cid:18) d ν d (cid:101) ν ( ψ ) (cid:19) (cid:54) Z (cid:27) and ψ = (( a t , b t ) , X t ) t =1 ..T is the sequence of pulled couplesand rewards. Let α ∈ (0 , and let us introduce the event E T = (cid:26) log (cid:18) d ν d (cid:101) ν ( ψ ) (cid:19) (cid:54) (1 − α ) log( N b ( T )) (cid:27) . Then we can decompose the probability as follows P ν (Ω T ) = P ν (Ω T ∩ E T ) + P ν (Ω T ∩ E cT ) (cid:54) E (cid:101) ν (cid:2) N b ( T ) − α I { Ω T } (cid:3) + P ν (Ω T ∩ E cT ) Now, we control successively E (cid:101) ν (cid:2) N b ( T ) − α I { Ω T } (cid:3) and P ν (Ω T ∩ E cT ) and show that they both tendto as T tends to ∞ .B.1.1 E (cid:101) ν (cid:2) N b ( T ) − α I { Ω T } (cid:3) TENDS TO WHEN T TENDS TO INFINITY

We ﬁrst provide an upper bound on I { Ω T } as follows, denoting c (cid:48) = c/ kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) , Ω T ⊂ (cid:26) N a,b ( T ) < c (cid:48) log( N b ( T )) , lim T →∞ N b ( T ) = ∞ (cid:27) =  N b ( T ) < c (cid:48) log( N b ( T )) + (cid:88) a (cid:48) (cid:54) = a N a (cid:48) ,b ( T ) , lim T →∞ N b ( T ) = ∞  . Thus, we have I { Ω T } (cid:54) c (cid:48) I { N b ( T ) (cid:62) , N b ( T ) →∞} log( N b ( T )) N b ( T ) + (cid:88) a (cid:48) (cid:54) = a N a (cid:48) ,b ( T ) N b ( T ) Considering f α : x (cid:62) (cid:55)→ log( x ) /x α , we have f α (cid:54) e − /α . Then, the dominated convergencetheorem ensures E (cid:101) ν (cid:20) I { N b ( T ) (cid:62) , N b ( T ) →∞} log( N b ( T )) N b ( T ) α (cid:21) = o (1) . Furthermore, since the considered strategy is assumed consistent we know that for a (cid:48) (cid:54) = a , since a (cid:48) isa sub-optimal arm for user b and conﬁguration (cid:101) ν , E (cid:101) ν (cid:20) N a,b (cid:48) ( T ) N b ( T ) α (cid:21) = o (1) , therefore we get E (cid:101) ν (cid:2) N b ( T ) − α I { Ω T } (cid:3) = o (1) . P ν (Ω T ∩ E cT ) TENDS TO WHEN T TENDS TO INFINITY

For each time t = 1 , . . . , T , the reward X t is sampled independently from the past and according to ν a t ,b t . Hence the likelihood ratio rewritesd ν d (cid:101) ν ( ψ ) = T (cid:89) t =1 d ν a t ,b t d (cid:101) ν a t ,b t ( X t ) where, for all ( a, b ) ∈ A × B and for all x ∈ { , } , we have : d ν a,b d (cid:101) ν a,b ( x ) = µ xa,b (1 − µ a,b ) − x (cid:101) µ xa,b (1 − (cid:101) µ a,b ) − x .Thus, since for all b (cid:48) / ∈ B a,b , µ a,b = (cid:101) µ a,b , the log-likelihood ratio is log (cid:18) d ν d (cid:101) ν ( ψ ) (cid:19) = (cid:88) b (cid:48) ∈B a,b T (cid:88) t =1 I { ( a t ,b t )=( a,b (cid:48) ) } log (cid:18) d ν a,b (cid:48) d (cid:101) ν a,b (cid:48) ( X t ) (cid:19) . Let us introduce, for ( a, b ) ∈ A × B , X na,b = X τ na,b where τ na,b = min { t (cid:62) s.t. N a,b ( t ) = n } .Note that the random variables τ na,b are predictable stopping times, since (cid:110) τ na,b = t (cid:111) is measurablewith respect to the ﬁltration generated by (( a , b ) , X , ..., ( a t − , b t − ) , X t − ) . Hence we can rewritethe event E T E T =  (cid:88) b (cid:48) ∈B a,b T (cid:88) t =1 I { ( a t ,b t )=( a,b ) } d ν a,b (cid:48) d (cid:101) ν a,b (cid:48) ( X t ) (cid:54) (1 − α ) log( N b ( T ))  and, since Ω T = (cid:40) (cid:80) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) < c log( N b ( T )) , lim T →∞ N b ( T ) = ∞ (cid:41) , we have Ω T ∩ E cT ⊂ (cid:26) ∃ ( n b (cid:48) ) b (cid:48) ∈B a,b : (cid:80) b (cid:48) ∈B a,b n b (cid:48) kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) < c log( N b ( T )) , lim T →∞ N b ( T ) = ∞ and (cid:80) b (cid:48) ∈B a,b (cid:80) n =1 ..n b (cid:48) d ν a,b (cid:48) d (cid:101) ν a,b (cid:48) ( X na,b (cid:48) ) > (1 − α ) log( N b ( T )) (cid:27) . For b (cid:48) ∈ B a,b and n (cid:62) , let us consider Z nb (cid:48) = d ν a,b (cid:48) d (cid:101) ν a,b (cid:48) ( X na,b (cid:48) ) . Then Z nb (cid:48) is positive and bounded by B b (cid:48) = 1 (cid:101) µ a,b (cid:48) (1 − (cid:101) µ a,b (cid:48) ) , with mean E ν [ Z nb (cid:48) ] = kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) . Furthermore, the random variables Z nb (cid:48) , for b (cid:48) ∈ B a,b and n (cid:62) , are independent. Thus, it holds Ω T ∩ E cT ⊂  max ( n (cid:48) b ) ∈N a,b (cid:88) b (cid:48) ∈B a,b (cid:88) n =1 ..n b (cid:48) Z nb (cid:48) − E ν [ Z nb (cid:48) ] > (cid:18) − αc − (cid:19) c log( N b ( T )) , N b ( T ) → ∞  , where N a,b :=  ( n b (cid:48) ) b (cid:48) ∈B a,b : (cid:88) b (cid:48) ∈B a,b n b (cid:48) kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) < c log( T )  .

21n the following, we apply Doob’s maximal inequality. For b (cid:48) ∈ B a,b and λ > , let us introduce thesuper-martingale M b (cid:48) n = exp (cid:32) λ n (cid:88) k =1 (cid:0) Z kb (cid:48) − E [ Z kb (cid:48) ] (cid:1) − nλ B b (cid:48) (cid:33) . Then noting that (cid:80) b (cid:48) ∈B a,b λ n b (cid:48) B b (cid:48) c log( T ) < (cid:80) b (cid:48) ∈B a,b λ n b (cid:48) B b (cid:48) (cid:80) b (cid:48) ∈B a,b n b (cid:48) kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) (cid:54) λ max b (cid:48) ∈B a,b B b (cid:48) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) , we obtain Ω T ∩ E cT ⊂  max ( n (cid:48) b ) ∈N a,b (cid:89) b (cid:48) ∈B a,b M b (cid:48) n (cid:48) b > T (cid:104) λ ( − αc − − λ b (cid:48)∈B a,b B b (cid:48) b (cid:48)∈B a,b kl ( µa,b (cid:48)| (cid:101) µa,b (cid:48) ) (cid:105) c , N b ( T ) → ∞  ⊂ (cid:26) ∃ b (cid:48) ∈ B a,b : max n (cid:54) n max M b (cid:48) n > N b ( T ) γ , N b ( T ) → ∞ (cid:27) , where n max = c log( T )min b (cid:48)∈B a,b kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) and γ = (cid:104) λ ( − αc − − λ b (cid:48)∈B a,b B b (cid:48) b (cid:48)∈B a,b kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) (cid:105) c |B a,b | . Inorder to have γ > , we impose:- < α < − c (this implies − αc − > )- λ ∈ argmax λ (cid:48) (cid:62) (cid:26) λ (cid:48) ( 1 − αc − − λ (cid:48) b (cid:48)∈B a,b B b (cid:48) b (cid:48)∈B a,b kl ( µ a,b (cid:48) | (cid:101) µ a,b (cid:48) ) (cid:27) > .Thus for A > , we have P ν (Ω T ∩ E cT ) (cid:54) (cid:88) b (cid:48) ∈B a,b P ν (cid:18) max n (cid:54) n max M b (cid:48) n > N b ( T ) γ , N b ( T ) → ∞ (cid:19) (Union bound) (cid:54) (cid:88) b (cid:48) ∈B a,b P ν ( N b ( T ) γ (cid:54) A, N b ( T ) → ∞ ) + P ν (cid:18) max n (cid:54) n max M b (cid:48) n > A (cid:19) (cid:54) |B a,b | P ν ( N b ( T ) γ (cid:54) A, N b ( T ) → ∞ ) + (cid:88) b (cid:48) ∈B a,b E ν [ M b (cid:48) ] A (Doob’s maximal inequality) = |B a,b | P ν ( N b ( T ) γ (cid:54) A, N b ( T ) → ∞ ) + |B a,b | A .

Furthermore, we have lim T →∞ P (cid:101) ν ( N b ( T ) γ (cid:54) A, N b ( T ) → ∞ ) (cid:54) P (cid:101) ν (cid:18) lim sup T →∞ ( N b ( T ) γ (cid:54) A ) , N b ( T ) → ∞ (cid:19) (cid:54) P (cid:101) ν (cid:18) lim sup T →∞ N b ( T ) < ∞ , N b ( T ) → ∞ (cid:19) = 0 . Thus we have shown ∀ A > , lim sup T →∞ P ν (Ω T ∩ E cT ) (cid:54) |B a,b | A , P ν (Ω T ∩ E cT ) = o (1) . B.2 Asymptotic lower bounds on the regret

Here, we explain how we obtain the lower bounds on the regret given in Corollary 8.

Proof [Proof of Corollary 8.] Let us consider a consistent strategy on D ω and let ν ∈ D ω . Let ( T k ) k ∈ N be a sub-sequence such that lim inf T →∞ R ( T, ν )log( T ) = lim k →∞ R ( T k , ν )log( T k ) . We assume that this limit is ﬁnite otherwise the result is straightforward. This implies in particularthat for all ( a, b ) / ∈ O (cid:63) lim sup k →∞ E ν [ N a,b ( T k )]log( T k ) < + ∞ . By Cantor’s diagonal argument there exists an extraction of ( T k ) k ∈ N denoted by ( T (cid:48) k ) k ∈ N such thatfor all ( a, b ) / ∈ O (cid:63) , there exist N a,b (cid:54)∈ O (cid:63) such that lim k (cid:48) →∞ E ν [ N a,b ( T (cid:48) k )]log( T (cid:48) k ) = N a,b . Hence we get lim inf T →∞ R ( T, ν )log( T ) = (cid:88) ( a,b ) / ∈O (cid:63) N a,b ∆ a,b . But thanks to Proposition 5 we have for all ( a, b ) (cid:54)∈ O (cid:63) , since user b has a log-frequency β b , (cid:88) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) N a,b (cid:48) = lim k →∞ (cid:88) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) E ν (cid:2) N a,b (cid:48) ( T (cid:48) k ) (cid:3) log( T (cid:48) k ) (cid:62) lim inf k →∞ (cid:88) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) E ν (cid:2) N a,b (cid:48) ( T (cid:48) k ) (cid:3) log (cid:0) N b ( T (cid:48) k ) (cid:1) × lim inf k →∞ log( N b ( T (cid:48) k ))log( T (cid:48) k ) (cid:62) β b . Therefore we obtain the lower bound lim inf T →∞ R ( ν, T )log( T ) (cid:62) C (cid:63)ω ( β, ν ) := min n ∈ R A×B + (cid:88) a,b/ ∈O (cid:63) n a,b ∆ a,b s.t. ∀ ( a, b ) (cid:54)∈ O (cid:63) : (cid:88) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) n a,b (cid:48) (cid:62) β b . ppendix C. Algorithms for the uncontrolled scenario We regroup in this section the algorithms

IMED-GS and IMED-GS (cid:63) for the uncontrolled scenario . Algorithm 3

IMED-GS Require:

Weight matrix ( ω b,b (cid:48) ) b,b (cid:48) ∈B . for t = 1 ...T do Pull a t +1 ∈ argmin a ∈A (cid:101) I a,b t +1 ( t ) end for Algorithm 4

IMED-GS (cid:63) Require:

Weight matrix ( ω b,b (cid:48) ) b,b (cid:48) ∈B . ∀ a ∈ A , c a , c + a ← ∀ b ∈ B , E ( b ) ← ∅ , FE ( b ) ← ∅ for For t = 1 ...T do Choose a t ∈ argmin a ∈A (cid:101) I a,b t +1 ( t ) if ( a t , b t +1 ) ∈ (cid:98) O (cid:63) ( t ) then Choose a t +1 = a t else Choose ( a t , b t ) ∈ argmin ( a,b ) ∈A×B (cid:101) I a,b ( t ) if ( a t , b t ) / ∈ (cid:98) O (cid:63) ( t ) thenif c a t = c + a t then c + a t ← c + a t Choose b t ∈ argmin b ∈B N a,b ( t ) FE (cid:0) b t (cid:1) ← a t else Choose b t ∈ argmax b ∈ (cid:98) B at,bt ( t ) ∪ { b t } N opt a t +1 ,b ( t ) − N a t +1 ,b ( t ) E (cid:0) b t (cid:1) ← a t end if c a t +1 ← c a t +1 + 1 end ifPriority rule in exploration phases : if FE ( b t +1 ) (cid:54) = ∅ then Choose a t +1 = FE ( b t +1 ) (delayed forced-exploration)FE ( b t +1 ) ← ∅ else if E ( b t +1 ) (cid:54) = ∅ then Choose a t +1 = E ( b t +1 ) (delayed exploration)E ( b t +1 ) ← ∅ else Choose a t +1 = a t (current exploration) end ifend if Pull a t +1 end for ppendix D. IMED-GS (cid:63) : Finite-time analysis

IMED-GS (cid:63) strategy implies empirical lower and empirical upper bounds on the numbers of pulls(Lemma 14, Lemma 15). Based on concentration lemmas (see Appendix D.7), the strategy-based em-pirical lower bounds ensure the reliability of the estimators of interest (Lemma 19). Then, combiningthe reliability of these estimators with the obtained strategy-base empirical upper bounds, we getupper bounds on the average numbers of pulls (Proposition 20). We ﬁrst show that

IMED-GS (cid:63) strategyis Pareto-optimal (for minimization problem 2) and that it is a consistent strategy which inducessequences of users with log-frequencies all equal to (independently from the considered banditconﬁguration). From an asymptotic analysis, we then prove that IMED-GS (cid:63) strategy is asymptoticallyoptimal.

D.1 Strategy-based empirical bounds

IMED-GS (cid:63) strategy implies inequalities between the indexes that can be rewritten as inequalities onthe numbers of pulls. While asymptotic analysis suggests lower bounds involving log (cid:0) N b t +1 ( t ) (cid:1) might be expected, we show in this non-asymptotic context lower bounds on the numbers of pullsinvolving instead the logarithm of the number of pulls of the current chosen arm, log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) .In contrast, we provide upper bounds involving log (cid:0) N b t +1 ( t ) (cid:1) on N a t +1 ,b t +1 ( t ) .We believe that establishing these empirical lower and upper bounds is a key element of ourproof technique, that is of independent interest and not a priori restricted to the graph structure. Lemma 14 (Empirical lower bounds)

Under

IMED-GS (cid:63) , at each step time t (cid:62) , for all couple ( a, b ) / ∈ (cid:98) O (cid:63) ( t ) , log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) (cid:54) (cid:88) b (cid:48) ∈ (cid:98) B a,b ( t ) N a,b (cid:48) ( t ) kl (cid:0)(cid:98) µ a,b (cid:48) ( t ) (cid:12)(cid:12)(cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) (cid:1) + log (cid:0) N a,b (cid:48) ( t ) (cid:1) . Furthermore, for all couple ( a, b ) ∈ (cid:98) O (cid:63) ( t ) , N a t +1 ,b t +1 ( t ) (cid:54) N a,b ( t ) . Proof

According to

IMED-GS (cid:63) strategy (see Algorithm 2), a t +1 = a t and for all couple ( a, b ) ∈ A×B I a,b ( t ) (cid:62) I a t ,b t ( t ) . There is three possible cases.Case 1: ( a t +1 , b t +1 ) = ( a t , b t ) ∈ (cid:98) O (cid:63) ( t ) and I a t ,b t ( t ) = log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) .Case 2: b t +1 ∈ (cid:98) B a t ,b t ( t ) ∪ { b t } and I a t ,b t ( t ) = (cid:80) b (cid:48) ∈ (cid:98) B at,bt ( t ) N a t ,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a t ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b (cid:48) (cid:17) +log( N a t ,b (cid:48) ( t )) . Note that b t ∈ (cid:98) B a t ,b t ( t ) except if N a t ,b t ( t ) = 0 . Thus I a t ,b t ( t ) (cid:62) log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) .Case 3: b t +1 ∈ argmin b ∈B N a t ,b ( t ) and I a t ,b t ( t ) (cid:62) min b ∈ (cid:98) B at,bt ( t ) log (cid:0) N a t +1 ,b ( t ) (cid:1) (cid:62) log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) .This implies for all couple ( a, b ) ∈ A × B , I a,b ( t ) (cid:62) log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) . ( a, b ) / ∈ (cid:98) O (cid:63) ( t ) we obtain log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) (cid:54) (cid:88) b (cid:48) ∈ (cid:98) B a,b ( t ) N a,b (cid:48) ( t ) kl (cid:0)(cid:98) µ a,b (cid:48) ( t ) (cid:12)(cid:12)(cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) (cid:1) + log (cid:0) N a,b (cid:48) ( t ) (cid:1) , and for all couple ( a, b ) ∈ (cid:98) O (cid:63) ( t ) , log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) (cid:54) log ( N a,b ( t )) . Taking the exponential in the last inequality allows us to conclude.

Lemma 15 (Empirical upper bounds)

Under

IMED-GS (cid:63) , at each step time t (cid:62) such that ( a t +1 , b t +1 ) / ∈ (cid:98) O (cid:63) ( t ) we have (cid:88) b (cid:48) ∈ (cid:98) B at +1 ,bt ( t ) N a t +1 ,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a t +1 ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b (cid:48) (cid:17) (cid:54) log (cid:0) N b t ( t ) (cid:1) and N a t +1 ,b t +1 ( t )log( t ) (cid:54)  kl (cid:16)(cid:98) µ a t +1 ,b t ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) (cid:17) , if c a t +1 = c + a t +1 min  kl (cid:16)(cid:98) µ a t +1 ,b t +1 ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b t +1 (cid:17) , n opt a t +1 ,b t +1 ( t )  otherwise. Proof

For all current optimal couple ( a, b ) ∈ (cid:98) O (cid:63) ( t ) , we have I a,b ( t ) = log ( N a,b ( t )) (cid:54) log( N b ( t )) .This implies I a t ,b t ( t ) (cid:54) log (cid:0) N b t ( t ) (cid:1) . Furthermore, since ( a t +1 , b t +1 ) / ∈ (cid:98) O (cid:63) ( t ) , we have a t +1 = a t and the previous inequality implies (cid:88) b (cid:48) ∈ (cid:98) B at +1 ,bt ( t ) N a t +1 ,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a t +1 ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b (cid:48) (cid:17) (cid:54) log (cid:0) N b t ( t ) (cid:1) . (8)In the following we study separately the two cases either c a t +1 = c + a t +1 or c a t +1 < c + a t +1 .Case 1: c a t +1 = c + a t +1 Then b t +1 ∈ argmin b ∈B N a t +1 ,b ( t ) and from Eq. 8 we get N a t +1 ,b t +1 ( t ) (cid:54) N a t +1 ,b t ( t ) (cid:54) log( t ) kl (cid:16)(cid:98) µ a t +1 ,b t ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) (cid:17) . Case 2: c a t +1 < c + a t +1 Then b t +1 ∈ argmax b ∈ (cid:98) B at,bt ( t ) ∪{ b t } N opt a t ,b ( t ) − N a t ,b ( t ) and we have N a t +1 ,b t +1 ( t ) (cid:54) log( t ) kl (cid:16)(cid:98) µ a t +1 ,b t +1 ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b t +1 (cid:17) . N a t +1 ,b t +1 ( t ) (cid:54) N opt a t +1 ,b t +1 ( t ) = n opt a t +1 ,b t +1 ( t ) min b ∈B I a t +1 ,b ( t ) (cid:54) n opt a t +1 ,b t +1 ( t ) log( t ) . Lemma 16 ( N opt dominates N ) Under

IMED-GS (cid:63) , at each time step t (cid:62) such that ( a t , b t ) / ∈ (cid:98) O (cid:63) ( t ) we have max b ∈ (cid:98) B at,bt ( t ) ∪{ b t } N opt a t ,b ( t ) − N a t ,b ( t ) (cid:62) . Proof If b t / ∈ (cid:98) B a t ,b t ( t ) , then N a t ,b t ( t ) = 0 and N opt a t ,b t ( t ) − N a t ,b t ( t ) (cid:62) . In the following weassume that b t ∈ (cid:98) B a t ,b t ( t ) (cid:54) = ∅ .From Eq. 6, since min b (cid:48) ∈B I a t ,b (cid:48) ( t ) = I a t ,b t ( t ) , for all b ∈ (cid:98) B a t ,b t ( t ) we have N opt a t ,b ( t ) = n opt a t ,b ( t ) I a t ,b t ( t ) , (9)and, since ( a t , b t ) / ∈ (cid:98) O (cid:63) ( t ) , from Eq. 5 we get (cid:88) b (cid:48) ∈ (cid:98) B at,bt ( t ) kl (cid:16)(cid:98) µ a t ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b (cid:48) (cid:17) n opt a t ,b (cid:48) ( t ) (cid:62) . (10)Then Eq. 9 and 10 imply (cid:88) b (cid:48) ∈ (cid:98) B at,bt ( t ) N opt a t ,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a t ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b (cid:48) (cid:17) (cid:62) I a t ,b t ( t ) . Hence from the deﬁnitions of the indexes (Eq. 4) this implies (cid:88) b (cid:48) ∈ (cid:98) B at,bt ( t ) N opt a t ,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a t ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b (cid:48) (cid:17) (11) (cid:62) (cid:88) b (cid:48) ∈ (cid:98) B at,bt ( t ) N a t ,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a t ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b (cid:48) (cid:17) . Since we assume b t ∈ (cid:98) B a t ,b t ( t ) (cid:54) = ∅ , previous Eq. 11 implies max b ∈ (cid:98) B at,bt ( t ) ∪{ b t } N opt a t ,b ( t ) − N a t ,b ( t ) = max b ∈ (cid:98) B at,bt ( t ) N opt a t ,b ( t ) − N a t ,b ( t ) (cid:62) . .2 Reliable current best arm and means In this subsection, we consider the subset T ε,γ of times where everything behaves well, that is: Thecurrent best couples correspond to the true ones, and, the empirical means of the best couples and thecouples at least pulled proportionally (with a coefﬁcient γ ∈ (0 , ) to the number of pulls of thecurrent chosen couple are ε -accurate for < ε < ε ν , i.e. T ε,γ := (cid:26) t (cid:62) (cid:98) O (cid:63) ( t ) = O (cid:63) ∀ ( a, b ) s.t. N a,b ( t ) (cid:62) γ N a t +1 ,b t +1 ( t ) or ( a, b ) ∈ O (cid:63) , | (cid:98) µ a,b ( t ) − µ a,b | < ε (cid:27) . We will show that its complementary is ﬁnite on average. In order to prove this we decompose theset T ε,γ in the following way. Let E ε,γ be the set of times where the means are well estimated, E ε,γ := (cid:110) t (cid:62) ∀ ( a, b ) s.t. N a,b ( t ) (cid:62) γ N a t +1 ,b t +1 ( t ) or ( a, b ) ∈ (cid:98) O (cid:63) ( t ) , | (cid:98) µ a,b ( t ) − µ a,b | < ε (cid:111) , and Λ ε the set of times where the mean of a couple that is not the current optimal neither pulled isunderestimated Λ ε := (cid:40) t (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∃ a ∈ A , B (cid:48) ⊂ B : ∀ b ∈ B (cid:48) , (cid:98) µ a,b ( t ) < µ a,b − ε log( N a t +1 ,b t +1 ( t )) (cid:54) (cid:80) b ∈B (cid:48) N a,b ( t ) kl ( (cid:98) µ a,b ( t ) | µ a,b − ε ν )+log ( N a,b ( t )) (cid:41) . Then we prove below the following inclusion.

Lemma 17 (Relations between the subsets of times)

For < ε < ε ν and γ ∈ (0 , , T cε,γ \ E cε,γ ⊂ Λ ε ν . (12) Refer to Appendix A for the deﬁnition of ε ν . Proof

For all user b ∈ B it is assumed that there exists a unique optimal arm a (cid:63)b ∈ A such that ( a (cid:63)b , b ) ∈ O (cid:63) . We have O (cid:63) = (cid:83) b ∈B { ( a (cid:63)b , b ) } . In particular, for all time step t (cid:62) , if (cid:98) O (cid:63) ( t ) (cid:54) = O (cid:63) thenthere exists b ∈ B and (cid:98) a (cid:63)b ∈ A such that ( (cid:98) a (cid:63)b , b ) ∈ (cid:98) O (cid:63) ( t ) \ O (cid:63) (and (cid:98) a (cid:63)b (cid:54) = a (cid:63)b ).Let t ∈ T cε,γ \ E cε,γ . Then (cid:98) O (cid:63) ( t ) (cid:54) = O (cid:63) and there exists b ∈ B , (cid:98) a (cid:63)b ∈ A such that ( (cid:98) a (cid:63)b , b ) ∈ (cid:98) O (cid:63) ( t ) \ O (cid:63) .Thus we know that (cid:98) a (cid:63)b (cid:54) = a (cid:63)b . In particular, we have µ a (cid:63)b ,b = µ (cid:63)b > µ (cid:98) a (cid:63)b ,b + 2 ε ν . Since t ∈ E ε,γ and ε < ε ν , this implies µ a (cid:63)b ,b > (cid:98) µ (cid:98) a (cid:63)b ,b ( t ) + ε ν = (cid:98) µ (cid:63)b ( t ) + ε ν (cid:62) (cid:98) µ a (cid:63)b ( t ) + ε (13)and ( a (cid:63)b , b ) / ∈ (cid:98) O (cid:63) ( t ) . From Lemma 14 we have the following empirical lower bound log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) (cid:54) (cid:88) b (cid:48) ∈ (cid:98) B a(cid:63)b ,b ( t ) N a (cid:63)b ,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a (cid:63)b ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) (cid:17) + log (cid:16) N a (cid:63)b ,b (cid:48) ( t ) (cid:17) . (14)In particular, for all b (cid:48) ∈ (cid:98) B a (cid:63)b ,b ( t ) we have (cid:98) µ a (cid:63)b ,b (cid:48) ( t ) < (cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) and Eq. 13 implies (cid:98) µ a (cid:63)b ,b (cid:48) ( t ) < (cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) < µ a (cid:63)b ,b − ε ν − ω b,b (cid:48) < µ a (cid:63)b ,b (cid:48) − ε ν , (15)28nd the monotonic properties of kl ( ·|· ) implieskl (cid:16)(cid:98) µ a (cid:63)b ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) (cid:17) (cid:54) kl (cid:16)(cid:98) µ a (cid:63)b ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12) µ a (cid:63)b ,b (cid:48) − ε ν (cid:17) . (16)Therefore, by combining Eq. 14, 15 and 16, we have for such t ∀ b (cid:48) ∈ (cid:98) B a (cid:63)b ,b ( t ) , (cid:98) µ a (cid:63)b ,b (cid:48) ( t ) < µ a (cid:63)b ,b (cid:48) − ε ν and log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) (cid:54) (cid:88) b (cid:48) ∈ (cid:98) B a(cid:63)b ,b ( t ) N a (cid:63)b ,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a (cid:63)b ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12) µ a (cid:63)b ,b (cid:48) − ε ν (cid:17) + log (cid:16) N a (cid:63)b ,b (cid:48) ( t ) (cid:17) , which concludes the proof.Using classical concentration arguments we prove in Appendix D.8 the following upper bounds. Lemma 18 (Bounded subsets of times)

For < ε < ε ν and γ ∈ (0 , / , E ν [ (cid:12)(cid:12) E cε,γ (cid:12)(cid:12) ] (cid:54) γε |A| |B| E ν [ | Λ ε ν | ] (cid:54) |A| |B| (1 + E ν ) |B| . Refer to Appendix A for the deﬁnitions of ε ν and E ν . Thus combining them with (12) we obtain E ν [ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) ] (cid:54) E ν [ (cid:12)(cid:12) E cε,γ (cid:12)(cid:12) ] + E ν [ | Λ ε ν | ] (cid:54) γε |A| |B| + 2 |A| |B| (1 + E ν ) |B| . Hence, we just proved the following lemma.

Lemma 19 (Reliable estimators)

For < ε < ε ν and γ ∈ (0 , / , E ν [ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) ] (cid:54) γε |A| |B| + 2 |A| |B| (1 + E ν ) |B| . Refer to Appendix A for the deﬁnitions of ε ν and E ν . D.3 Pareto-optimality and upper bounds on the numbers of pulls of sub-optimal arms

In this section, we combine the different results of the previous sections to prove the followingproposition.

Proposition 20 (Upper bounds)

Let ν ∈ D ω . Let < ε < ε ν and γ ∈ (0 , / . Let us consider T ε,γ := (cid:26) t (cid:62) (cid:98) O (cid:63) ( t ) = O (cid:63) ∀ ( a, b ) s.t. N a,b ( t ) (cid:62) γ N a t +1 ,b t +1 ( t ) or ( a, b ) ∈ O (cid:63) , | (cid:98) µ a,b ( t ) − µ a,b | < ε (cid:27) . Then under

IMED-GS (cid:63) strategy, E ν [ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) ] (cid:54) γε |A| |B| + 2 |A| |B| (1 + E ν ) |B| nd for all horizon time T (cid:62) , ∀ a ∈ A , min b : ( a,b ) / ∈O (cid:63) N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) (cid:54) (1 + α ν ( ε )) (cid:20) γ M ν m ν (cid:21) + M ν |T ε,γ | min b ∈B log( N b ( T )) where m ν and M ν are deﬁned as follows: m ν = min ( a,b ) / ∈O (cid:63) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) , M ν = max ( a,b ) / ∈O (cid:63) (cid:88) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) . Furthermore, we have ∀ ( a, b ) / ∈ O (cid:63) , N a,b ( T ) (cid:54) α ν ( ε ) m ν log( N b ( T )) + (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) . Refer to Appendix A for the deﬁnitions of ε ν , α ν ( · ) and E ν . Proof

From Lemma 19, we have: E ν [ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) ] (cid:54) γε |A| |B| + 2 |A| |B| (1 + E ν ) |B| . Let a ∈ A . Let us consider (cid:54) t (cid:54) T such that a t +1 = a , ( a t +1 , b t +1 ) / ∈ O (cid:63) and t ∈ T ε,γ . Then,according to IMED-GS (cid:63) strategy (see Algorithm 2), we have ( a, b t ) = ( a t , b t ) / ∈ (cid:98) O (cid:63) ( t ) . From Lemma 15 this implies (cid:88) b (cid:48) ∈ (cid:98) B a,bt ( t ) N a,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b (cid:48) (cid:17) (cid:54) log( N b ( T )) . (17)Since t ∈ T ε,γ and ε < ε ν , we have (cid:8) b (cid:48) ∈ B a,b t : N a,b (cid:48) ( t ) (cid:62) γN a t +1 ,b t +1 ( t ) (cid:9) ⊂ (cid:98) B a,b t ( t ) . (18)Combining inequality (17) with inclusion (18), it comes (cid:88) b (cid:48) ∈B a,bt : N a,b (cid:48) ( t ) (cid:62) γN at +1 ,bt +1 ( t ) N a,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − ω b t ,b (cid:48) (cid:17) (cid:54) log( N b ( T )) . (19)Since t ∈ T ε,γ , we have (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t ) − µ (cid:63)b t (cid:12)(cid:12)(cid:12) < ε and ∀ b (cid:48) ∈ B a,b t s.t. N a,b (cid:48) ( t ) (cid:62) γN a t +1 ,b t +1 ( t ) , (cid:12)(cid:12)(cid:98) µ a,b (cid:48) ( t ) − µ a,b (cid:48) (cid:12)(cid:12) < ε . (20)By construction of α ν ( · ) (see Section A), since ε < ε ν , inequalities (19) and (20) give us (cid:88) b (cid:48) ∈B a,bt : N a,b (cid:48) ( t ) (cid:62) γN at +1 ,bt +1 ( t ) N a,b (cid:48) ( t ) kl (cid:16) µ a,b (cid:48) (cid:12)(cid:12)(cid:12) µ (cid:63)b t − ω b t ,b (cid:48) (cid:17) (cid:54) (1 + α ν ( ε ))) log( N b ( T )) . (21)30his implies (cid:88) b (cid:48) ∈B a,bt N a,b (cid:48) ( t ) kl (cid:16) µ a,b (cid:48) (cid:12)(cid:12)(cid:12) µ (cid:63)b t − ω b t ,b (cid:48) (cid:17) (cid:54) (1 + α ν ( ε ))) log( N b ( T )) + γM ν N a t +1 ,b t +1 ( t ) . Furthermore, using inequality (21), we get N a t +1 ,b t +1 ( t ) (cid:54)  N a,b t ( t ) (cid:54) (1 + α ν ( ε )) log( N b ( T )) kl (cid:16) µ a,b t (cid:12)(cid:12)(cid:12) µ (cid:63)b t (cid:17) if c a = c + a , (1 + α ν ( ε )) log( N b ( T )) kl (cid:16) µ a t +1 ,b t +1 (cid:12)(cid:12)(cid:12) µ (cid:63)b t − ω b t ,b t +1 (cid:17) if c a < c + a . (22)Thus, we have shown that for all arm a ∈ A , for all time step (cid:54) t (cid:54) T such that a t +1 = a , ( a t +1 , b t +1 ) / ∈ O (cid:63) and t ∈ T ε,γ : min b : ( a,b ) / ∈O (cid:63) N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( t ) kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) (cid:54) (1 + α ν ( ε ))) (cid:18) γ M ν m ν (cid:19) and N a t +1 ,b t +1 ( t ) (cid:54) (1 + α ν ( ε )) log( N b ( T )) m ν . This implies for all arm a ∈ A and for all time step (cid:54) t (cid:54) T , min b : ( a,b ) / ∈O (cid:63) N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) (cid:54) (1 + α ν ( ε )) (cid:20) γ M ν m ν (cid:21) + M ν |T ε,γ | min b ∈B log( N b ( T )) and ∀ b : ( a, b ) / ∈ O (cid:63) , N a,b ( T ) (cid:54) (1 + α ν ( ε )) log( N b ( T )) m ν + (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) . It can be easily proved that under

IMED-GS (cid:63) N b ( T ) → ∞ for all b ∈ B (see Lemma 27). Fromprevious Proposition 20, we deduce the following corollary by doing T → ∞ , then ε, γ → . Corollary 21 (Pareto optimality)

Let ν ∈ D ω . Let a ∈ A such that { b ∈ B : ( a, b ) / ∈ O (cid:63) } (cid:54) = ∅ .Then, we have lim sup T →∞ min b : ( a,b ) / ∈O (cid:63) N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) (cid:54) . D.4

IMED-GS (cid:63) is consistent and induces sequences of users with log-frequencies B In this section we show that

IMED-GS (cid:63) is a consistent strategy that induces sequences of users withlog-frequencies all equal to , independently from the considered bandit conﬁguration in D .31 emma 22 (Consistency, log-frequencies B ) IMED-GS (cid:63) is a consistent strategy and induces se-quences of users with log-frequencies all equal to . Proof

We ﬁrst show that

IMED-GS (cid:63) induces sequences of users with log-frequencies all equal to .Let ν ∈ D ω and let us consider an horizon T (cid:62) . Let < ε < ε ν and γ ∈ (0 , / . Let us consideragain the set of times T ε,γ = (cid:26) T (cid:62) t (cid:62) (cid:98) O (cid:63) ( t ) = O (cid:63) ∀ ( a, b ) s.t. N a,b ( t ) (cid:62) γ N a t +1 ,b t +1 ( t ) or ( a, b ) ∈ O (cid:63) , | (cid:98) µ a,b ( t ) − µ a,b | < ε (cid:27) . Then, according to Proposition 20, under

IMED-GS (cid:63) strategy, E ν [ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) ] (cid:54) γε |A| |B| + 2 |A| |B| (1 + E ν ) |B| < ∞ . (23)and for all horizon time T (cid:62) , for all ( a, b ) / ∈ O (cid:63) , N a,b ( T ) (cid:54) α ν ( ε ) m ν log( N b ( T )) + (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) (cid:54) α ν ( ε ) m ν log( T ) + (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) , (24)where m ν = min ( a,b ) / ∈O (cid:63) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) , and ε ν , α ν ( · ) , E ν deﬁned in Appendix A.Note that, under IMED-GS (cid:63) , for all t (cid:62) such that ( a t +1 , b t +1 ) ∈ (cid:98) O (cid:63) ( t ) we have ( a t +1 , b t +1 ) ∈ argmin ( a,b ) ∈ (cid:98) O (cid:63) ( t ) N a,b ( t ) . This implies by deﬁnition of T ε,γ that ∀ ( a, b ) , ( a (cid:48) , b (cid:48) ) ∈ O (cid:63) , (cid:12)(cid:12) N a,b ( T ) − N a (cid:48) ,b (cid:48) ( T ) (cid:12)(cid:12) (cid:54) (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) + 1 . (25)Indeed the difference of pulls between two optimal couples is non-decreasing only at times t (cid:62) such that the difference is greater than and (cid:98) O (cid:63) ( t ) (cid:54) = O (cid:63) . Combining Eq. 24 and 25 we get min b ∈B N b ( T ) (cid:62) min ( a,b ) ∈O (cid:63) N a,b ( T ) − (cid:62) max ( a,b ) ∈O (cid:63) N a,b ( T ) − (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) − (Eq. 24) (cid:62) |B| (cid:88) ( a,b ) ∈O (cid:63) N a,b ( T ) − (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) −

1= 1 |B|  T − (cid:88) ( a,b ) / ∈O (cid:63) N a,b ( T )  − (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) − (cid:62) |B| (cid:18) T − ( |A| − |B| (cid:20) α ν ( ε ) m ν log( T ) + (cid:12)(cid:12) T cε,γ (cid:12)(cid:12)(cid:21)(cid:19) − (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) − (Eq. 25) (cid:62) T |B| − |A| α ν ( ε ) m ν log( T ) − |A| (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) − |A| . Since E ν (cid:2)(cid:12)(cid:12) T cε,γ (cid:12)(cid:12)(cid:3) < ∞ (see Eq. 23), this implies that IMED-GS (cid:63) induces sequences of users withlog-frequencies all equal to . 32e show the consistency of IMED-GS (cid:63) in the following. Let ( a, b ) / ∈ O (cid:63) and α ∈ (0 , . Accordingto Proposition 20, N a,b ( T ) (cid:54) α ν ( ε ) m ν log( N b ( T )) + lim sup T →∞ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) , and monotone convergence theorem ensures E ν [lim sup T →∞ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) ] = lim sup T →∞ E ν [ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) ] (cid:54) γε |A| |B| + 2 |A| |B| (1 + E ν ) |B| < ∞ . This implies N a,b ( T ) N b ( T ) α (cid:54) α ν ( ε ) m ν log( N b ( T )) N b ( T ) α + lim sup T →∞ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) N b ( T ) α , and, taking the expectation, dominated convergence theorem implies E ν (cid:20) N a,b ( T ) N b ( T ) α (cid:21) (cid:54) E ν  α ν ( ε ) m ν log( N b ( T )) N b ( T ) α + lim sup T →∞ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) N b ( T ) α  → . Indeed, it can be easily shown that under

IMED-GS (cid:63) N b ( T ) → ∞ (see Lemma 27). This implies lim sup T →∞ E ν (cid:20) N a,b ( T ) N b ( T ) α (cid:21) = 0 . D.5 The counters c a and c + a coincide at most O (log(log( T ))) times Let us consider < ε < ε ν and γ ∈ (0 , / . Let us introduce T c ( T ) := (cid:110) t ∈ T ε,γ : ( a t +1 , b t +1 ) / ∈ O (cid:63) and c a t +1 ( t ) = c + a t +1 ( t ) (cid:111) , where T ε,γ is deﬁne as in Appendix D.2.In this section, we want to bound |T c ( T ) | . Lemma 23

Let < ε < ε ν and γ ∈ (0 , / . Let us consider an horizon T (cid:62) . Then, it holds |T c ( T ) | (cid:54) |A| + |A| log (cid:18) (1 + α ν ( ε )) |B| m ν log( T ) + |B| (cid:12)(cid:12) T cε,γ (cid:12)(cid:12)(cid:19) . where m ν = min ( a,b ) / ∈O (cid:63) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) , and ε ν , α ν ( · ) deﬁned in Appendix A. roof From Lemma 24, we get: |T c ( T ) | (cid:54) |A| + (cid:88) a ∈A log ( c a ( T )) . Then applying Lemma 25, it comes: |T c ( T ) | (cid:54) |A| + (cid:88) a ∈A log  (cid:88) b : ( a,b ) / ∈O (cid:63) N a,b ( T ) + (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) . We end the proof by combining the previous inequality with Proposition 20 that ensures ∀ ( a, b ) / ∈ O (cid:63) , N a,b ( T ) (cid:54) (1 + α ν ( ε )) m ν log( T ) + (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) . Lemma 24

Let < ε < ε ν and γ ∈ (0 , / . Let us consider an horizon T (cid:62) . Then, it holds |T c ( T ) | (cid:54) |A| + (cid:88) a ∈A log ( c a ( T )) . Refer to Appendix A for the deﬁnition of ε ν . Proof

Let a ∈ A . By construction of ( c a ( t )) (cid:54) t (cid:54) T and ( c + a ( t )) (cid:54) t (cid:54) T , we have c + a ( T ) = 2 T (cid:80) t =1 I { ca ( t )= c + a ( t ) } − . (26)Furthermore, the following inequalities are satisﬁed |T c ( T ) | (cid:54) (cid:88) a ∈A T (cid:88) t =1 I { c a ( t )= c + a ( t ) } and ∀ a ∈ A , c + a ( T ) (cid:54) c a ( T ) . (27)Then Eq. 26 and 27 imply |T c ( T ) |−|A| (cid:54) |A| (cid:89) a ∈A c a ( T ) . Lemma 25

Let < ε < ε ν and γ ∈ (0 , / . Let us consider an horizon T (cid:62) . Then, it holds ∀ a ∈ A , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c a ( T ) − (cid:88) b : ( a,b ) / ∈O (cid:63) N a,b ( T ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) . Refer to Appendix A for the deﬁnition of ε ν . Proof

Let a ∈ A . At each time step t (cid:62) we increment c a ( t ) only if ( a t +1 , b t +1 ) / ∈ (cid:98) O (cid:63) ( t ) and a t +1 = a . Then, if t ∈ T ε,γ , we have (cid:98) O (cid:63) ( t ) = O (cid:63) and we increment c a ( t ) only if we increment oneof the N a,b ( t ) for b ∈ B such that ( a, b ) / ∈ O (cid:63) . 34 .6 All couples ( a, b ) ∈ A×B are asymptotically pulled an inﬁnite number of times Let < ε < ε ν (deﬁned in Appendix A) and γ ∈ (0 , / . Let us consider T ε,γ = (cid:26) t (cid:62) (cid:98) O (cid:63) ( t ) = O (cid:63) ∀ ( a, b ) s.t. N a,b ( t ) (cid:62) γ N a t +1 ,b t +1 ( t ) or ( a, b ) ∈ O (cid:63) , | (cid:98) µ a,b ( t ) − µ a,b | < ε (cid:27) . Then, according to Proposition 20, under

IMED-GS (cid:63) strategy, E ν [ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) ] < ∞ . In particular, almostsurely (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) < ∞ . Lemma 26 (The indexes tend to inﬁnity)

For all strategy we have lim t →∞ N a t +1 ,b t +1 ( t ) = ∞ and,under IMED-GS (cid:63) , ∀ ( a, b ) ∈ A × B , lim t →∞ I a,b ( t ) = ∞ . Proof

For all couple ( a, b ) ∈ A × B such that N a,b ( ∞ ) < ∞ , we have I { ( a t +1 ,b t +1 )=( a,b ) } → . Then (cid:88) ( a,b ) ∈A×B : N a,b ( ∞ )= ∞ I { ( a t +1 ,b t +1 )=( a,b ) } → . This implies (cid:88) ( a,b ) ∈A×B : N a,b ( ∞ )= ∞ I { ( a t +1 ,b t +1 )=( a,b ) } N a,b ( t ) (cid:62) min ( a,b ) ∈A×B : N a,b ( ∞ )= ∞ N a,b ( t ) (cid:88) ( a,b ) ∈A×B : N a,b ( ∞ )= ∞ I { ( a t +1 ,b t +1 )=( a,b ) } −→ ∞ . Thus, since N a t +1 ,b t +1 ( t ) = (cid:88) ( a,b ) ∈A×B : N a,b ( ∞ ) < ∞ I { ( a t +1 ,b t +1 )=( a,b ) } N a,b ( t )+ (cid:88) ( a,b ) ∈A×B : N a,b ( ∞ )= ∞ I { ( a t +1 ,b t +1 )=( a,b ) } N a,b ( t ) , we have N a t +1 ,b t +1 ( t ) −→ ∞ . Furthermore, under

IMED-GS (cid:63) strategy we have ∀ ( a, b ) ∈ A × B , I a,b ( t ) (cid:62) log( N a t +1 ,b t +1 ( t )) , which ends the proof. 35 emma 27 (The numbers of pulls tend to inﬁnity) Under

IMED-GS (cid:63) the numbers of pulls almostsurely satisfy ∀ ( a, b ) ∈ A × B , N a,b ( T ) → ∞ . In particular, almost surely for all ( a, b ) ∈ A × B , lim T →∞ (cid:98) µ a,b ( T ) = µ a,b . Proof

Lemma 26 ensures lim T →∞ I a,b ( T ) = ∞ , for all ( a, b ) ∈ A × B .Let ( a, b ) ∈ O (cid:63) . Since (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) < ∞ and ∀ T ∈ T ε,γ , (cid:98) O (cid:63) ( T ) = O (cid:63) , we have for all ( a, b ) ∈ O (cid:63) lim T →∞ log( N a,b ( T )) = lim T →∞ I a,b ( T ) = ∞ . Then, for all ( a, b ) ∈ O (cid:63) , lim T →∞ N a,b ( T ) = ∞ and lim T →∞ (cid:98) µ a,b ( T ) = µ a,b .Let ( a, b ) / ∈ O (cid:63) and let T ∈ T ε,γ . Then, the following inequalities occur I a,b ( T ) (cid:54) (cid:88) b (cid:48) : ( a,b (cid:48) ) / ∈O (cid:63) N a,b (cid:48) ( t ) kl ( (cid:98) µ a,b (cid:48) ( t ) | (cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) ) + log(max(1 , N a,b (cid:48) ( t ))) (28)and ∀ b ∈ B , (cid:98) µ (cid:63)b ( t ) < − ε ν . (29)Since (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) < ∞ , Eq. 28 and 29 imply lim T →∞ (cid:88) b (cid:48) : ( a,b (cid:48) ) / ∈O (cid:63) N a,b (cid:48) ( T ) → ∞ . Then, since (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) < ∞ , from Lemma 25 we get lim T →∞ c a ( T ) = ∞ . This implies argmax (cid:8) t ∈ (cid:74) , T (cid:75) : c a ( t ) = c + a ( t ) (cid:9) → ∞ and lim T →∞ min b ∈B N a,b ( T ) = ∞ . D.7 Concentration lemmas

We state two concentration lemmas that do not depend on the followed strategy. Lemma 28 comesfrom Lemma B.1 in Combes and Proutiere (2014) and Lemma 29 comes from Lemma 14 in Hondaand Takemura (2015). Proofs are provided in Appendix F.

Lemma 28 (Concentration inequalities)

Let ν ∈ D ω . For all < ε, γ (cid:54) / and for all couples ( a, b ) , ( a (cid:48) , b (cid:48) ) ∈ A×B , E ν (cid:34)(cid:88) t (cid:62) I { ( a t +1 ,b t +1 )=( a,b ) , N a (cid:48) ,b (cid:48) ( t ) (cid:62) γN a,b ( t ) , | (cid:98) µ a (cid:48) ,b (cid:48) ( t ) − µ a (cid:48) ,b (cid:48) | (cid:62) ε } (cid:35) (cid:54) γε . emma 29 (Large deviation probabilities) Let ν ∈ D ω . For all couple ( a, b ) ∈ A × B , for all < µ < µ a,b , E ν (cid:34)(cid:88) n (cid:62) I { (cid:98) µ na,b <µ } n exp( n kl ( (cid:98) µ na,b | µ )) (cid:35) (cid:54) e (cid:18) − log(1 − µ )log(1 − µ a,b ) (cid:19) - (cid:32) − e − (cid:18) − log(1 − µ )log(1 − µa,b ) (cid:19) kl ( µ a,b | µ ) (cid:33) - , where (cid:98) µ na,b estimates µ a,b after n pulls of couple ( a, b ) (see Appendix A). D.8 Proof of Lemma 18

Using Lemma 14, for all time step t (cid:62) , we have ∀ ( a (cid:48) , b (cid:48) ) ∈ (cid:98) O (cid:63) ( t ) , N a (cid:48) ,b (cid:48) ( t ) (cid:62) N a t +1 ,b t +1 ( t ) (cid:62) γ N a t +1 ,b t +1 ( t ) . Then, based on the concentration inequalities from Lemma 28, we obtain E ν [ (cid:12)(cid:12) E cε,γ (cid:12)(cid:12) ] (cid:54) (cid:88) ( a,b ) , ( a (cid:48) ,b (cid:48) ) ∈A×B E ν (cid:34)(cid:88) t (cid:62) I { ( a t +1 ,b t +1 )=( a,b ) , N a (cid:48) ,b (cid:48) ( t ) (cid:62) γN a,b ( t ) , | (cid:98) µ a (cid:48) ,b (cid:48) ( t ) − µ a (cid:48) ,b (cid:48) | (cid:62) ε } (cid:35) (cid:54) (cid:88) ( a,b ) , ( a (cid:48) ,b (cid:48) ) ∈A×B γε (cid:54) γε |A| |B| . Furthermore, for t (cid:62) , a ∈ A and B (cid:48) ⊂ B , we have log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) (cid:54) (cid:88) b ∈B (cid:48) N a,b ( t ) kl ( (cid:98) µ a,b ( t ) | λ a,b ) + log ( N a,b ( t )) ⇔ N a t +1 ,b t +1 ( t ) (cid:54) (cid:89) b ∈B (cid:48) N a,b ( t ) e N a,b ( t ) kl ( (cid:98) µ a,b ( t ) | λ a,b ) , where λ a,b = µ a,b − ε ν for all couple ( a, b ) ∈ A × B . Thus, considering estimators of means based37n the numbers of pulls ( (cid:98) µ na,b ) ( a,b ) ∈A×B ,n (cid:62) (see Appendix A), we have | Λ ε ν | (cid:54) (cid:88) t (cid:62) (cid:88) a ∈AB (cid:48) ⊂B I (cid:40) ∀ b ∈B (cid:48) , (cid:98) µ a,b ( t ) <λ a,b and N at +1 ,bt +1 ( t ) (cid:54) (cid:81) b ∈B(cid:48) N a,b ( t ) e Na,b ( t ) kl ( (cid:98) µa,b ( t ) | λa,b ) (cid:41) = (cid:88) t (cid:62) a (cid:48) ,b (cid:48) ) ∈A×B (cid:88) a ∈AB (cid:48) ⊂B (cid:88) n b (cid:62) b ∈B (cid:48) I { ( a t +1 ,b t +1 )=( a (cid:48) ,b (cid:48) ) ,N a,b ( t )= n b } I (cid:40) ∀ b ∈B (cid:48) , (cid:98) µ nba,b <λ a,b , N a (cid:48) ,b (cid:48) ( t ) (cid:54) (cid:81) b ∈B(cid:48) n b e nb kl ( (cid:98) µnba,b | λa,b ) (cid:41) (cid:54) (cid:88) t (cid:62) a (cid:48) ,b (cid:48) ) ∈A×B (cid:88) a ∈AB (cid:48) ⊂B (cid:88) n b (cid:62) b ∈B (cid:48) I { ( a t +1 ,b t +1 )=( a (cid:48) ,b (cid:48) ) } I (cid:110) ∀ b ∈B (cid:48) , (cid:98) µ nba,b <λ a,b (cid:111) I (cid:40) (cid:54) N a (cid:48) ,b (cid:48) ( t ) (cid:54) (cid:81) b ∈B(cid:48) n b e nb kl ( (cid:98) µnba,b | λa,b ) (cid:41) + (cid:88) t (cid:62) a (cid:48) ,b (cid:48) ) ∈A×B I { ( a t +1 ,b t +1 )=( a (cid:48) ,b (cid:48) ) } I { N a (cid:48) ,b (cid:48) ( t )=0 } (cid:54) (cid:88) ( a (cid:48) ,b (cid:48) ) ∈A×B (cid:88) a ∈AB (cid:48) ⊂B (cid:88) n b (cid:62) b ∈B (cid:48) I (cid:110) ∀ b ∈B (cid:48) , (cid:98) µ nba,b <λ a,b (cid:111) (cid:88) t (cid:62) I { ( a t +1 ,b t +1 )=( a (cid:48) ,b (cid:48) ) } I (cid:40) (cid:54) N a (cid:48) ,b (cid:48) ( t ) (cid:54) (cid:81) b ∈B(cid:48) n b e nb kl ( (cid:98) µnba,b | λa,b ) (cid:41) + |A| |B| (cid:54) (cid:88) ( a (cid:48) ,b (cid:48) ) ∈A×B (cid:88) a ∈AB (cid:48) ⊂B (cid:88) n b (cid:62) b ∈B (cid:48) I (cid:110) ∀ b ∈B (cid:48) , (cid:98) µ nba,b <λ a,b (cid:111) (cid:89) b ∈B (cid:48) n b e n b kl ( (cid:98) µ nba,b ,λ a,b ) + |A| |B| = |A| |B| (cid:88) a ∈AB (cid:48) ⊂B (cid:88) n b (cid:62) b ∈B (cid:48) (cid:89) b ∈B (cid:48) I (cid:110)(cid:98) µ nba,b <λ a,b (cid:111) n b e n b kl (cid:16)(cid:98) µ nba,b | λ a,b (cid:17) + |A| |B| = |A| |B|  (cid:88) a ∈AB (cid:48) ⊂B (cid:89) b ∈B (cid:48) (cid:88) n (cid:62) I (cid:110)(cid:98) µ na,b <λ a,b (cid:111) ne n kl ( (cid:98) µ na,b | λ a,b )  and E ν [ | Λ ε ν | ] (cid:54) |A| |B|  (cid:88) a ∈AB (cid:48) ⊂B (cid:89) b ∈B (cid:48) E ν (cid:34)(cid:88) n (cid:62) I { (cid:98) µ na,b <λ a,b } ne n kl ( (cid:98) µ na,b ,λ a,b ) (cid:35) . (30)Then, by applying Lemma 29 based on large deviation inequalities, we have ∀ ( a, b ) ∈ A × B , E ν (cid:34)(cid:88) n (cid:62) I { (cid:98) µ na,b <λ a,b } ne n kl ( (cid:98) µ na,b ,λ a,b ) (cid:35) (cid:54) E ν , (31)where E ν = 6 e max a ∈A ,b ∈B (cid:16) − log(1 − λ a,b )log(1 − µ a,b ) (cid:17) − (cid:32) − e − (1 − log(1 − λa,b )log(1 − µa,b ) ) kl ( µ a,b | λ a,b ) (cid:33) − .By combining Eq. 30 and 31, we conclude that E ν [ | Λ ε ν | ] (cid:54) |A| |B| (cid:16) |A| (1 + E ν ) |B| (cid:17) (cid:54) |A| |B| (1 + E ν ) |B| . Appendix E.

IMED-GS (cid:63) : Proof of Theorem 11 (main result)

In this section we prove the asymptotic optimality of

IMED-GS (cid:63) strategy. The proof is based on theﬁnite time analysis detailed in Appendix D. 38 .1 Almost surely n opt ( T ) tends to n ν For a ∈ A such that B a = { b ∈ B : ( a, b ) / ∈ O (cid:63) } (cid:54) = ∅ , let us deﬁne the linear programming C (cid:63)ω,a ( ν ) := min n ∈ R B a + (cid:88) b ∈B a n b ∆ a,b s.t. ∀ b ∈ B a : (cid:88) b (cid:48) ∈B a kl + ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) n a,b (cid:48) (cid:62) . Then ( n νa,b ) b ∈B a is the unique optimal solution of the previous minimization problem. Furthermore,we can state the following lemma. Lemma 30 lim T →∞ ( n opt a,b ( T )) b ∈B a = ( n νa,b ) b ∈B a . Proof

This a direct application of Lemma 31 and Lemma 43 stated below.

Lemma 31

Let < ε < ε ν (see Appendix A) and γ ∈ (0 , . Let a ∈ A such that B a = { b ∈ B : ( a, b ) / ∈ O (cid:63) } . Let us consider for T (cid:62) , (cid:98) K a ( T ) = ( kl + ( (cid:98) µ a,b (cid:48) ( T ) | (cid:98) µ (cid:63)b ( T ) − ω b,b (cid:48) )) b,b (cid:48) ∈B a ,the vector (cid:98) ∆ a ( T ) = ( (cid:98) µ (cid:63)b ( T ) − (cid:98) µ a,b ( T )) b ∈B a and the parameter (cid:98) h a ( T ) = ( (cid:98) K a ( T ) , (cid:98) ∆ a ( T )) . We alsoconsider H a := (cid:110)(cid:98) h a ( T ) , T ∈ T ε,γ (cid:111) , where T ε,γ , deﬁned in Appendix D.2, satisﬁes (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) < ∞ . Then, we have ∀ h = ( K, ∆) ∈ H a , K (cid:54) = 0 and min h =( K, ∆) ∈H a min b ∈B a ∆ b > . Proof

Let h = ( K, ∆) ∈ H a . There exists T ∈ T ε,γ such that h = ( K, ∆) = (cid:98) h a ( T ) =( (cid:98) K a ( T ) , (cid:98) ∆ a ( T )) . Since T ∈ T ε,γ , we have (cid:98) O (cid:63) ( T ) = O (cid:63) . In particular for all b ∈ B a , K b,b = (cid:98) K a,b,b ( T ) = kl + ( (cid:98) µ a,b ( T ) | (cid:98) µ (cid:63)b ( T )) > . Furthermore, we have min b ∈B a ∆ b = min b ∈B a (cid:98) ∆ a,b ( T ) = min b ∈ (cid:98) B a ( T ) (cid:98) µ (cid:63)b ( T ) − (cid:98) µ a,b ( T ) > . Lastly since ∀ b ∈ B a , (cid:98) µ (cid:63)b ( ∞ ) = µ (cid:63)b , (cid:98) µ a,b ( ∞ ) = µ a,b , we have min b ∈B a (cid:98) ∆ a,b ( T ) → min b ∈B a µ (cid:63)b − µ a,b > and min h =( K, ∆) ∈H a min b ∈B a ∆ b = min T / ∈E∪T min b ∈B a (cid:98) ∆ a,b ( T ) > . .2 Almost surely and on expectation, for all sub-optimal couple N a,b ( T )log( T ) tends to n νa,b Combining the upper bounds from the ﬁnite analysis and the asymptotic behaviour of n opt ( · ) , weprove the asymptotic optimality of IMED-GS (cid:63) . Lemma 32 (Asymptotic upper bounds)

For all sub-optimal couple ( a, b ) / ∈ O (cid:63) , lim sup T →∞ N a,b ( T )log( T ) (cid:54) n νa,b . Proof

Let < ε < ε ν (see Appendix A) and γ ∈ (0 , / . Let ( a, b ) / ∈ O (cid:63) and let us consider anhorizon T (cid:62) . Let us introduce the random variable τ := min { t ∈ (cid:74) , T (cid:75) s.t. t ∈ T ε,γ \ T c ( T ) and ( a t +1 , b t +1 ) = ( a, b ) } , where T ε,γ and T ( T ) are respectively introduced in Appendix D.2 and D.5 . Then, by deﬁnition of τ and since (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) < ∞ , from Lemma 23 we have N a,b ( T ) (cid:54) N a,b ( τ ) + (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) + |T c ( T ) | = N a,b ( τ ) + O (log(log( T ))) . (32)Furthermore, since τ / ∈ T c ( T ) we have c a ( τ ) (cid:54) = c + a ( τ ) . In addition, since τ ∈ T ε,γ and ( a, b ) =( a τ +1 , b τ +1 ) / ∈ O (cid:63) , Lemma 15 implies the following empirical upper bound N a,b ( τ ) (cid:54) log( τ ) n opt a,b ( τ ) . (33)In particular, since log( τ ) (cid:54) log( T ) , Eq. 32 and 33 imply a.s. N a,b ( T )log( T ) (cid:54) N a,b ( τ )log( τ ) + O (log(log( T )))log( T ) (cid:54) n opt ( τ ) + O (log(log( T )))log( T ) and, since a.s. lim T →∞ τ = ∞ , from Lemma 30 we get a.s lim sup T →∞ N a,b ( T )log( T ) (cid:54) lim sup T →∞ n opt a,b ( τ ) + lim sup T →∞ O (log(log( T )))log( T ) = n νa,b . Lemma 33 (Asymptotic optimality)

For all sub-optimal couple ( a, b ) / ∈ O (cid:63) , we have a.s. lim T →∞ N a,b ( T )log( T ) = n νa,b and lim T →∞ E ν (cid:20) N a,b ( T )log( T ) (cid:21) = n νa,b . Proof

Since

IMED-GS (cid:63) is a consistent strategy that induces sequences of users with log-frequenciesequal to , we have ∀ ( a, b ) (cid:54) = O (cid:63) , lim inf T →∞ T ) (cid:88) b (cid:48) ∈B a N a,b (cid:48) ( T ) kl + ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) (cid:62) . Then, Pareto-optimality of

IMED-GS (cid:63) combined with asymptotic upper bounds given in Lemma 32ensures that for all ( a, b ) / ∈ O (cid:63) , N a,b ( T ) / log( T ) → n νa,b . Since, the N a,b ( T ) / log( T ) are dominatedby an integrable variable (see Proposition 20 in Appendix D), we also have these convergences onaverage. 40 ppendix F. Concentration lemmas: Proofs Lemma

Considering the stopping times τ na,b = inf { t (cid:62) , N a,b ( t ) = n } we will rewrite the sum (cid:80) t (cid:62) I { ( a t +1 ,b t +1 )=( a,b ) , N a (cid:48) ,b (cid:48) ( t ) (cid:62) γN a,b ( t ) , | (cid:98) µ a (cid:48) ,b (cid:48) ( t ) − µ a (cid:48) ,b (cid:48) | (cid:62) ε } and use an Hoeffding’s type argument. (cid:88) t (cid:62) I { ( a t +1 ,b t +1 )=( a,b ) , N a (cid:48) ,b (cid:48) ( t ) (cid:62) γN a,b ( t ) , | (cid:98) µ a (cid:48) ,b (cid:48) ( t ) − µ a (cid:48) ,b (cid:48) | (cid:62) ε } (cid:54) (cid:88) t (cid:62) (cid:88) n (cid:62) , m (cid:62) I { τ na,b = t +1 ,N a (cid:48) ,b (cid:48) ( t )= m } I (cid:110) m (cid:62) γ ( n − , (cid:12)(cid:12)(cid:12)(cid:98) µ ma (cid:48) ,b (cid:48) − µ a (cid:48) ,b (cid:48) (cid:12)(cid:12)(cid:12) (cid:62) ε (cid:111) = (cid:88) m (cid:62) (cid:88) n (cid:62) I (cid:110) m (cid:62) γ ( n − , (cid:12)(cid:12)(cid:12)(cid:98) µ ma (cid:48) ,b (cid:48) − µ a (cid:48) ,b (cid:48) (cid:12)(cid:12)(cid:12) (cid:62) ε (cid:111) (cid:88) t (cid:62) I { τ na,b = t +1 ,N a (cid:48) ,b (cid:48) ( t )= m } (cid:54) (cid:88) m (cid:62) (cid:88) n (cid:62) I (cid:110) m (cid:62) γ ( n − , (cid:12)(cid:12)(cid:12)(cid:98) µ ma (cid:48) ,b (cid:48) − µ a (cid:48) ,b (cid:48) (cid:12)(cid:12)(cid:12) (cid:62) ε (cid:111) (cid:88) t (cid:62) I { τ na,b = t +1 } (cid:54) (cid:88) m (cid:62) (cid:88) n (cid:62) I (cid:110) m (cid:62) γ ( n − , (cid:12)(cid:12)(cid:12)(cid:98) µ ma (cid:48) ,b (cid:48) − µ a (cid:48) ,b (cid:48) (cid:12)(cid:12)(cid:12) (cid:62) ε (cid:111) Taking the expectation , it comes: E ν (cid:34)(cid:88) t (cid:62) I { ( a t +1 ,b t +1 )=( a,b ) , N a (cid:48) ,b (cid:48) ( t ) (cid:62) γN a,b ( t ) , | (cid:98) µ a (cid:48) ,b (cid:48) ( t ) − µ a (cid:48) ,b (cid:48) | (cid:62) ε } (cid:35) (cid:54) (cid:88) m (cid:62) (cid:88) n (cid:62) I { m (cid:62) γ ( n − } P ν ( | (cid:98) µ ma (cid:48) − µ a (cid:48) | (cid:62) ε ) (cid:54) (cid:88) m (cid:62) (cid:88) n (cid:62) I { m (cid:62) γ ( n − } e − mε (Hoeffding inequality) = 2 (cid:88) m (cid:62) (cid:18) mγ + 1 (cid:19) e − mε = 2 (cid:88) m (cid:62) mγ e − mε + 2 (cid:88) m (cid:62) e − mε = 1 γ e − ε (1 − e − ε ) + 11 − e − ε = 1 γ e ε ( e ε − + e ε e ε − (cid:54) γ e / ε + e ε ε (cid:54) γε . emma Let ν ∈ D ω . For all couple ( a, b ) ∈ A×B , for all < µ < µ a,b , E ν (cid:34)(cid:88) n (cid:62) I { (cid:98) µ na,b <µ } n exp( n kl ( (cid:98) µ na,b | µ )) (cid:35) (cid:54) e (1 − log(1 − µ )log(1 − µ a,b ) ) (cid:18) − e − (1 − log(1 − µ )log(1 − µa,b ) ) kl ( µ a,b | µ ) (cid:19) , where (cid:98) µ na,b estimates µ a,b after n pulls of couple ( a, b ) (see Appendix A). Proof

The proof is based on a Chernoff type inequality and a calculation by measurement change.The proof comes from Honda and Takemura (2015). We explicit here the particular case of Bernoullidistributions for completeness.Let us rephrase Proposition 11 from Honda and Takemura (2015). Since we consider Bernoullidistributions, we get a more explicit formulation.

Proposition 34

Let ν ∈ D ω . Let ( a, b ) ∈ A × B and < µ < µ a,b . Then, for all n (cid:62) and u ∈ R ,we have P ν ( kl ( (cid:98) µ na,b | µ ) (cid:62) u, (cid:98) µ na,b (cid:54) µ ) (cid:54)  e − n kl ( µ a,b | µ ) if u (cid:54) log(1 − µ )log(1 − µ a,b ) kl ( µ a,b | µ )2 e (1 + log(1 − µ a,b )log(1 − µ ) n ) e − n log(1 − µa,b )log(1 − µ ) u otherwise.42e know rewrite equality (27) from Honda and Takemura (2015) with our notations.Let n (cid:62) . We have from Proposition 34 that : E ν (cid:104) I { (cid:98) µ na,b (cid:54) µ } ne n kl ( (cid:98) µ na,b | µ ) (cid:105) = (cid:90) ∞ P ν (cid:16) I { (cid:98) µ na,b (cid:54) µ } ne n kl ( (cid:98) µ na,b | µ ) > x (cid:17) d x = (cid:90) ∞ P ν (cid:16) ne n kl ( (cid:98) µ na,b | µ ) > x, (cid:98) µ na,b (cid:54) µ (cid:17) d x = (cid:90) ∞−∞ n e nu P ν (cid:0) kl ( (cid:98) µ na,b | µ ) > u, (cid:98) µ na,b (cid:54) µ (cid:1) d u ( x = ne nu , d x = n e nu d u )= (cid:90) kl ( µa,b | µ ) log(1 − µ )log(1 − µa,b ) −∞ n e nu P ν (cid:0) kl ( (cid:98) µ na,b | µ ) > u, (cid:98) µ na,b (cid:54) µ (cid:1) d u + (cid:90) ∞ kl ( µa,b | µ ) log(1 − µ )log(1 − µa,b ) n e nu P ν (cid:0) kl ( (cid:98) µ na,b | µ ) > u, (cid:98) µ na,b (cid:54) µ (cid:1) d u (cid:54) (cid:90) kl ( µa,b | µ ) log(1 − µ )log(1 − µa,b ) −∞ n e nu e − n kl ( µ a,b | µ ) d u + (cid:90) ∞ kl ( µa,b | µ ) log(1 − µ )log(1 − µa,b ) n e nu e (1 + log(1 − µ a,b )log(1 − µ ) n ) e − n log(1 − µa,b )log(1 − µ ) u d u ( Proposition 34 )= ne − n kl ( µ a,b | µ ) (cid:90) kl ( µa,b | µ ) log(1 − µ )log(1 − µa,b ) −∞ ne nu d u + 2 ne (1 + log(1 − µ a,b )log(1 − µ ) n ) (cid:90) ∞ kl ( µa,b | µ ) log(1 − µ )log(1 − µa,b ) ne − ( log(1 − µa,b )log(1 − µ ) − nu d u = ne − n (1 − log(1 − µ )log(1 − µa,b ) ) kl ( µ a,b | µ ) + 2 ne (1 + log(1 − µ a,b )log(1 − µ ) n ) e − n (1 − log(1 − µ )log(1 − µa,b ) ) kl ( µ a,b | µ )log(1 − µ a,b )log(1 − µ ) −  e log(1 − µ a,b )log(1 − µ ) −  ne − n (1 − log(1 − µ )log(1 − µa,b ) ) kl ( µ a,b | µ ) + 2 e − log(1 − µ )log(1 − µ a,b ) n e − n (1 − log(1 − µ )log(1 − µa,b ) ) kl ( µ a,b | µ ) To ends the proof, we use the following inequalities for r > : (cid:88) n (cid:62) ne − nr (cid:54) − e − r ) (cid:54) − e − r ) (cid:88) n (cid:62) n e − nr (cid:54) − e − r ) . ppendix G. IMED-GS , IMED-GS , IMED-GS (cid:63) : Finite-time analysis In this subsection we rewrite and adapt the results established in Sections D, E for

IMED-GS (cid:63) strategyto the other considered strategies. Mainly, we rewrite the empirical lower bounds and upper boundsdetailed in Lemmas 14, and 15. These inequalities form the basis of the analysis of

IMED-GS (cid:63) strategy.For the sake of brevity and clarity, proofs are not given.

G.1

IMED-GS ﬁnite-time analysis

Under

IMED-GS strategy we do not solve empirical versions of optimisation problem 2 and pull thecouples with minimum (pseudo) indexes. This leads to the following empirical bounds.

Lemma 35 (

IMED-GS : Empirical lower bounds)

Under

IMED-GS , at each step time t (cid:62) , for allcouple ( a, b ) / ∈ (cid:98) O (cid:63) ( t ) , log (cid:0) N a t +1 ,b t +1 ( t ) (cid:1) (cid:54) (cid:88) b (cid:48) ∈ (cid:98) B a,b ( t ) N a,b (cid:48) ( t ) kl (cid:0)(cid:98) µ a,b (cid:48) ( t ) (cid:12)(cid:12)(cid:98) µ (cid:63)b ( t ) − ω b,b (cid:48) (cid:1) + log (cid:0) N a,b (cid:48) ( t ) (cid:1) . Furthermore, for all couple ( a, b ) ∈ (cid:98) O (cid:63) ( t ) , N a t +1 ,b t +1 ( t ) (cid:54) N a,b ( t ) . Lemma 36 (

IMED-GS : Empirical upper bounds)

Under

IMED-GS , at each step time t (cid:62) such that ( a t +1 , b t +1 ) / ∈ (cid:98) O (cid:63) ( t ) we have (cid:88) b (cid:48) ∈ (cid:98) B at +1 ,bt +1 ( t ) N a t +1 ,b (cid:48) ( t ) kl (cid:16)(cid:98) µ a t +1 ,b (cid:48) ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t +1 ( t ) − ω b t +1 ,b (cid:48) (cid:17) (cid:54) log (cid:0) N b t +1 ( t ) (cid:1) . In particular N a t +1 ,b t +1 ( t ) kl (cid:16)(cid:98) µ a t +1 ,b t +1 ( t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t +1 ( t ) (cid:17) (cid:54) log (cid:0) N b t +1 ( t ) (cid:1) . Based on this lemmas, one can prove

IMED-GS

Pareto-optimality in a similar way as for

IMED-GS (cid:63) strategy.

Proposition 37 (

IMED-GS : Upper bounds )

Let ν ∈ D ω . Let < ε < ε ν and γ ∈ (0 , / . Let usintroduce T ε,γ := (cid:26) t (cid:62) (cid:98) O (cid:63) ( t ) = O (cid:63) ∀ ( a, b ) s.t. N a,b ( t ) (cid:62) γ N a t +1 ,b t +1 ( t ) or ( a, b ) ∈ O (cid:63) , | (cid:98) µ a,b ( t ) − µ a,b | < ε (cid:27) . Then under

IMED-GS strategy, E ν [ (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) ] (cid:54) γε |A| |B| + 2 |A| |B| (1 + E ν ) |B| nd for all horizon time T (cid:62) , for all arm a ∈ A , min b : ( a,b ) / ∈O (cid:63) N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) (cid:54) (1 + α ν ( ε )) (cid:20) γ M ν m ν (cid:21) + M ν |T ε,γ | min b ∈B log( N b ( T )) where m ν and M ν are deﬁned as follows: m ν = min ( a,b ) / ∈O (cid:63) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) , M ν = max ( a,b ) / ∈O (cid:63) (cid:88) b (cid:48) ∈B a,b kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) . Furthermore, we have ∀ ( a, b ) / ∈ O (cid:63) , N a,b ( T )log( N b ( T )) (cid:54) α ν ( ε ) kl ( µ a,b | µ (cid:63)b ) + (cid:12)(cid:12) T cε,γ (cid:12)(cid:12) log( N b ( T )) . Refer to Appendix A for the deﬁnitions of ε ν , α ν ( · ) and E ν . From the previous proposition, we deduce the following corollary by doing T → ∞ , then ε, γ → . Corollary 38 (

IMED-GS : Pareto optimality)

Let ν ∈ D ω . Under IMED-GS strategy we have ∀ a ∈ A , lim sup T →∞ min b : ( a,b ) / ∈O (cid:63) N b ( T )) (cid:88) b (cid:48) ∈B a,b N a,b (cid:48) ( T ) kl ( µ a,b (cid:48) | µ (cid:63)b − ω b,b (cid:48) ) (cid:54) . G.2

Uncontrolled scenario : Finite-time analysis

When uncontrolled scenario is considered, the learner does not choose the users to deal with and theexploration phases may be performed with some delay. This can be formalized within the empiricalbounds induced by

IMED-GS and IMED-GS (cid:63) strategies.G.2.1 E MPIRICAL BOUNDS ON THE NUMBERS OF PULLS

For time step t (cid:62) , let us introduce the last return time of couple ( a t +1 , b t +1 ) ∈ B as τ t := min (cid:8) t − t (cid:48) : b t (cid:48) = b t +1 , t (cid:62) t (cid:48) (cid:62) (cid:9) . By deﬁnition of τ t we have N a t +1 ,b t +1 ( t ) = N a t +1 ,b t +1 ( t − τ t ) . Now, empirical bounds on the numbers of pull can be formulated for the uncontrolled scenario .These inequalities are the same as those for the controlled scenario up to (random) time-delays.

Lemma 39 (

Uncontrolled scenario : Empirical lower bounds)

Under

IMED-GS and IMED-GS (cid:63) , ateach step time t (cid:62) there exists a random time delay σ t such that (cid:54) σ t (cid:54) τ t and for all couple ( a, b ) / ∈ (cid:98) O (cid:63) ( t − σ t )log (cid:0) N a t +1 ,b t +1 ( t − σ t ) (cid:1) (cid:54) (cid:88) b (cid:48) ∈ (cid:98) B a,b ( t − σ t ) N a,b (cid:48) ( t − σ t ) kl (cid:0)(cid:98) µ a,b (cid:48) ( t − σ t ) (cid:12)(cid:12)(cid:98) µ (cid:63)b ( t − σ t ) − ω b,b (cid:48) (cid:1) +log (cid:0) N a,b (cid:48) ( t − σ t ) (cid:1) . urthermore, for all couple ( a, b ) ∈ (cid:98) O (cid:63) ( t − σ t ) , N a t +1 ,b t +1 ( t − σ t ) (cid:54) N a,b ( t − σ t ) . Lemma 40 (Empirical upper bounds)

Under

IMED-GS and IMED-GS (cid:63) , at each step time t (cid:62) suchthat ( a t +1 , b t +1 ) / ∈ (cid:98) O (cid:63) ( t ) , we have (cid:88) b (cid:48) ∈ (cid:98) B at +1 ,bt ( t − σ t ) N a t +1 ,b (cid:48) ( t − σ t ) kl (cid:16)(cid:98) µ a t +1 ,b (cid:48) ( t − σ t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t − σ t ) − ω b t ,b (cid:48) (cid:17) (cid:54) log( N b ( t − σ t )) , where σ t is a random time delay such that (cid:54) σ t (cid:54) τ t . Furthermore, we have under IMED-GS b t +1 = b t , N a t +1 ,b t +1 ( t − σ t ) kl (cid:16)(cid:98) µ a t +1 ,b t +1 ( t − σ t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t +1 ( t − σ t ) (cid:17) (cid:54) log (cid:0) N b t +1 ( t − σ t ) (cid:1) and under IMED-GS (cid:63) N a t +1 ,b t +1 ( t − σ t )log( t − σ t ) (cid:54)  kl (cid:16)(cid:98) µ a t +1 ,b t ( t − σ t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t − σ t ) (cid:17) , if c a t +1 ( t − σ t ) = c + a t +1 ( t − σ t )min  kl (cid:16)(cid:98) µ a t +1 ,b t +1 ( t − σ t ) (cid:12)(cid:12)(cid:12)(cid:98) µ (cid:63)b t ( t − σ t ) − ω b t ,b t +1 (cid:17) , n opt a t +1 ,b t +1 ( t − σ t )  , else.Thus, we prove respectively the Pareto-optimality and optimality of IMED-GS and IMED-GS (cid:63) sincewe show that the empirical means (cid:98) µ a,b ( t − σ t ) of couples ( a, b ) involved in the previous inequalitiesconcentrate as in the case of the controlled scenario . This is the case as it is stated in Lemmas 41 ofthe next subsection.G.2.2 C ONCENTRATION INEQUALITY WITH BOUNDED TIME DELAYS

We prove a concentration lemma that does not depend on the followed strategy. It is a rewritting forthe case of controlled scenario of Lemma 28.

Lemma 41 (Concentration inequalities)

Let ν ∈ D ω , < ε , γ (cid:54) / and ( a, b ) , ( a (cid:48) , b (cid:48) ) ∈ A × B .Then for all sequence of stopping times ( σ t ) t (cid:62) such that (cid:54) σ t (cid:54) τ t for all t (cid:62) , we have E ν (cid:34)(cid:88) t (cid:62) I { ( a t +1 ,b t +1 )=( a,b ) , N a (cid:48) ,b (cid:48) ( t − σ t ) (cid:62) γN a,b ( t − σ t ) , | (cid:98) µ a (cid:48) ,b (cid:48) ( t − σ t ) − µ a (cid:48) ,b (cid:48) | (cid:62) ε } (cid:35) (cid:54) γε . Remark 42

There is no need to adapt Lemma 29 for the case of controlled scenario since thisconcentration lemma does not involve the current time steps explicitly.

Proof

It is pointed out that for all time step t (cid:62) , N a t +1 ,b t +1 ( t − σ t ) (cid:62) N a t +1 ,b t +1 ( t − τ t ) = N a t +1 ,b t +1 ( t ) ,then we proceed as in Appendix F.Considering the stopping times τ na,b = inf { t (cid:62) , N a,b ( t ) = n } we will rewrite the sum (cid:88) t (cid:62) I { ( a t +1 ,b t +1 )=( a,b ) , N a (cid:48) ,b (cid:48) ( t − σ t ) (cid:62) γN a,b ( t − σ t ) , | (cid:98) µ a (cid:48) ,b (cid:48) ( t − σ t ) − µ a (cid:48) ,b (cid:48) | (cid:62) ε } (cid:88) t (cid:62) I { ( a t +1 ,b t +1 )=( a,b ) , N a (cid:48) ,b (cid:48) ( t − σ t ) (cid:62) γN a,b ( t − σ t ) , | (cid:98) µ a (cid:48) ,b (cid:48) ( t − σ t ) − µ a (cid:48) ,b (cid:48) | (cid:62) ε } (cid:54) (cid:88) t (cid:62) I { ( a t +1 ,b t +1 )=( a,b ) , N a (cid:48) ,b (cid:48) ( t − σ t ) (cid:62) γN a,b ( t ) , | (cid:98) µ a (cid:48) ,b (cid:48) ( t − σ t ) − µ a (cid:48) ,b (cid:48) | (cid:62) ε } (cid:54) (cid:88) t (cid:62) (cid:88) n (cid:62) , m (cid:62) I { τ na,b = t +1 ,N a (cid:48) ,b (cid:48) ( t − σ t )= m } I (cid:110) m (cid:62) γ ( n − , (cid:12)(cid:12)(cid:12)(cid:98) µ ma (cid:48) ,b (cid:48) − µ a (cid:48) ,b (cid:48) (cid:12)(cid:12)(cid:12) (cid:62) ε (cid:111) = (cid:88) m (cid:62) (cid:88) n (cid:62) I (cid:110) m (cid:62) γ ( n − , (cid:12)(cid:12)(cid:12)(cid:98) µ ma (cid:48) ,b (cid:48) − µ a (cid:48) ,b (cid:48) (cid:12)(cid:12)(cid:12) (cid:62) ε (cid:111) (cid:88) t (cid:62) I { τ na,b = t +1 ,N a (cid:48) ,b (cid:48) ( t − σ t )= m } (cid:54) (cid:88) m (cid:62) (cid:88) n (cid:62) I (cid:110) m (cid:62) γ ( n − , (cid:12)(cid:12)(cid:12)(cid:98) µ ma (cid:48) ,b (cid:48) − µ a (cid:48) ,b (cid:48) (cid:12)(cid:12)(cid:12) (cid:62) ε (cid:111) (cid:88) t (cid:62) I { τ na,b = t +1 } (cid:54) (cid:88) m (cid:62) (cid:88) n (cid:62) I (cid:110) m (cid:62) γ ( n − , (cid:12)(cid:12)(cid:12)(cid:98) µ ma (cid:48) ,b (cid:48) − µ a (cid:48) ,b (cid:48) (cid:12)(cid:12)(cid:12) (cid:62) ε (cid:111) , where the (cid:98) µ ma (cid:48) ,b (cid:48) are deﬁned in Appendix A. The proof ends the same way as in Appendix F. Appendix H. Continuity of solutions to parametric linear programs

In this section we recall Lemma 13 established in Magureanu et al. (2014) on the continuity ofsolutions to parametric linear programs.

Lemma 43

Consider K ∈ R B × B + , ∆ ∈ R B + , and H ⊂ R B × B + × R B + . Deﬁne h = ( K, ∆) . Considerthe function Q and the set-valued map Q (cid:63) Q ( h ) = inf x ∈ R B + { ∆ · x | K · x (cid:62) } Q (cid:63) ( h ) = { x (cid:62) · x (cid:54) Q ( h ) | K · x (cid:62) } . Assume that:(i) For all h ∈ H , all rows and columns of K are non-identically 0(ii) min h ∈H min b ∈ B ∆ b > . Then:(a) Q is continuous on H (b) Q (cid:63) is upper hemicontinuous on H . Appendix I. Details on numerical experiments

For the ﬁxed conﬁguration experiments we used the weight matrix ω of Table 1 and the conﬁguration ν described in Table 2. ω and ν have been chosen at random in such a way that the regret under IMED exceeds the structured lower bound on the regret. This means the structure ω is informative for thebandit conﬁguration ν and not taking it into account hinders optimality.47ser\user b b b b b b b b b b b b b b b b b b b b ω used in the ﬁxed conﬁguration experiment.arm \user b b b b b b b b b b a a a a a ν used in the ﬁxed conﬁguration experiment.Figure 4: min b ∈B N b ( · ) approximated over runs. At each run we sample uniformly at randoma weight matrix ω and then sample uniformly at random a conﬁguration ν ∈ D ωω