Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners
EExplicit Best Arm Identification in Linear BanditsUsing No-Regret Learners ∗ Mohammadi Zaki
Electrical Communication Engineering,Indian Institute of Science,Bangalore 560012. [email protected] , Avi Mohan
Faculty of Electrical Engineering,Technion, Israel Institute of Technology,Haifa 3200003. [email protected]
Aditya Gopalan
Electrical Communication Engineering,Indian Institute of Science,Bangalore 560012. [email protected]
Abstract.
We study the problem of best arm identification in linearly parameterised multi-armed bandits. Givena set of feature vectors
X ⊂ R d , a confidence parameter δ and an unknown vector θ ∗ , the goalis to identify argmax x ∈X x T θ ∗ , with probability at least − δ, using noisy measurements of theform x T θ ∗ . For this fixed confidence ( δ -PAC) setting, we propose an explicitly implementable andprovably order-optimal sample-complexity algorithm to solve this problem. Previous approaches relyon access to minimax optimization oracles. The algorithm, which we call the Phased EliminationLinear Exploration Game (PELEG), maintains a high-probability confidence ellipsoid containing θ ∗ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkageof this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions byinterpreting the problem as a two player zero-sum game, and sequentially converging to its saddlepoint using low-regret learners to compute players’ strategies in each round. We analyze the samplecomplexity of PELEG and show that it matches, up to order, an instance-dependent lower bound onsample complexity in the linear bandit setting. We also provide numerical results for the proposedalgorithm consistent with its theoretical guarantees. Function optimization over structured domains is a basic sequential decision making problem. A well-known formulation of this problem is Probably Approximately Correct (PAC) best arm identificationin multi-armed bandits [7], in which a learner is given a set of arms with unknown (and unrelated)means. The learner must sequentially test arms and output, as soon as possible with high confidence,a near-optimal arm (where optimality is defined in terms of the largest mean).Often, the arms (decisions) and their associated rewards, possess structural relationships, allowingfor more efficient learning of the rewards and transfer of learnt information, e.g., two ‘close enough’arms may have similar mean rewards. One of the best-known examples of structured decisionspaces is the linear bandit, whose arms are vectors (points) in R d . The reward or function valueof an arm is an unknown linear function of its vector representation, and the goal is to find an armwith maximum reward in the shortest possible time by measuring arms’ rewards sequentially withnoise. This framework models an array of structured online linear optimization problems including ∗ Under review. Please do not distribute.Preprint. Under review. a r X i v : . [ c s . L G ] J un daptive routing [2], smooth function optimization over graphs [20], subset selection [13] and, in thenonparametric setting, black box optimization in smooth function spaces [18], among others.Although no-regret online learning for linear bandits is a well-understood problem (see [14] andreferences therein), the PAC-sample complexity of best arm identification in this model has notreceived significant attention until recently [17]. The state of the art here is the work of Fiez et al.[8], who give an algorithm with optimal (instance-dependent) PAC sample complexity. However, acloser look indicates that the algorithm assumes repeated oracle access to a minimax optimizationproblem ; it is not clear, from a performance standpoint, in what manner (and to what accuracy) thisoptimization problem should be practically solved to enjoy the claimed sample complexity. Hence,the question of how to design an explicit algorithm with optimal PAC sample complexity for best armidentification in linear bandits has remained open.In this paper, we resolve this question affirmatively by giving an explicit linear bandit best-armidentification algorithm with instance-optimal PAC sample complexity and, more importantly, aclearly quantified computational effort. We achieve this goal using new techniques: the mainingredient in the proposed algorithm is a game-theoretic interpretation of the minimax optimizationproblem that is at the heart of the instance-based sample complexity lower bound. This in turn yieldsan adaptive, sample-based approach using carefully constructed confidence sets for the unknownparameter θ ∗ . The adaptive sampling strategy is driven by the interaction of 2 no-regret onlinelearning subroutines that attempt to solve the minimax problem approximately, obviating the worry ofi) solving the optimal minimax allocation to a suitable precision and ii) making an integer samplingallocation from it by rounding, which occur in the approach of Fiez et al [8]. We note that the seedsof this game-theoretic approach were laid by the recent work of Degenne et al. [6] for the simple (i.e.,unstructured) multiarmed bandit problem. However, our work demonstrates a novel extension of theirmethodology to solve best-arm learning in structured multi-armed bandits for the first time to the bestof our knowledge. The PAC best arm identification problem for linearly parameterised bandits is first studied in [17],in which an adaptive algorithm is given with a sample complexity guarantee involving a hardnessterm ( M ∗ ) which in general renders the sample complexity suboptimal. Tao et al [19] take the pathof constructing new estimators instead of ordinary least squares, using which they give an algorithmachieving the familiar sum-of-inverse-gaps sample complexity known for standard bandits; thisis, however, not optimal for general linear bandits. The LinGapE algorithm [15] is an attempt atsolving best arm identification with a fully adaptive strategy, but its sample complexity in generalis not instance-optimal and can additionally scale with the total number of arms, in addition to theextra dimension-dependence known to be incurred by self-normalized inequality-based confidenceset constructions [1]. Zaki et al [21] design a fully adaptive algorithm based on the Lower-UpperConfidence Bound (LUCB) principle [11] with limited guarantees for 2 or 3 armed settings. Fiez et al[8] give a phased elimination algorithm achieving the ideal information-theoretic sample complexitybut with minimax oracle access and an additional rounding operation; we detail an explicit arm-playing strategy that eliminates both these steps, in the same high-level template. In a separate vein,game-theoretic techniques to solve minimax problems have been in existence for over a couple ofdecades [9]; only recently have they been combined with optimism to give a powerful framework tosolve adaptive hypothesis testing problems [6].Table 1 compares the sample complexities of various best arm identification algorithms in theliterature. This is, in fact, a plug-in version of a minimax optimization problem representing an information-theoreticsample complexity lower bound for the problem. For its experiments, the paper implements a (approximate) minimax oracle using the Frank-Wolfe algorithmand a heuristic stopping rule, but this is not rigorously justifiable for nonsmooth optimization, see Sec. 4. Problem Statement and Notation
We study the problem of best arm identification in linear bandits with the arm set
X ≡{ x , x , . . . , x K } , where each arm x a is a vector in R d . We will interchangeably use X andthe set [ K ] ≡ { , , . . . , K } , whenever the context is clear. In every round t = 1 , , . . . the agentchooses an arm x t ∈ X , and receives a reward y ( x t ) = θ ∗ T x t + η t , where θ ∗ is assumed to be afixed but unknown vector, and η t is zero-mean noise assumed to be conditionally − subgaussian, i.e., ∀ γ ∈ R , E [ e γη t | x , x , . . . , x t − , η , η , . . . , η t − ] (cid:54) exp (cid:16) γ (cid:17) . We denote by ν kθ ∗ the distributionof the reward obtained by pulling arm k ∈ [ K ] , i.e., ∀ t (cid:62) , y ( x t ) ∼ ν kθ ∗ , whenever x t = x k . Giventwo probability distributions µ, ν over R , KL ( µ, ν ) denotes the KL Divergence of µ and ν (assuming µ (cid:28) ν ). Given θ ∈ R d , let a ∗ ≡ a ∗ ( θ ) = argmax a ∈ [ K ] θ T x a , where we assume that θ is such that theargmax is unique.A learning algorithm for the best arm identification problem comprises the following rules: (1) a sampling rule , which determines based on the past play of arms and observations, which arm to pullnext, (2) a stopping rule , which controls the end of sampling phase and is a function of the pastobservations and reward, and (3) a recommendation rule , which, when the algorithm stops, offersa guess for the best arm. The goal of a learning algorithm is: Given an error probability δ > ,identify (guess) a ∗ with probability (cid:62) − δ by pulling as few (in an expected sense) arms as possible.Any algorithm that (1) stops with probability 1 and (2) returns a ∗ upon stopping with probabilityat least − δ is said to be δ -Probably Approximately Correct ( δ -PAC). For clarity of exposition,we distinguish the above linear bandit setting from what we term the unstructured bandit setting,wherein K = d, and x i = (cid:98) e i , ∀ k ∈ [ K ] the canonical basis vectors (the former setting generalizesthe latter). The (expected) number of samples τ ∈ N consumed by an algorithm in determining theoptimal arm in any bandit setting (not necessarily the linear setting) is called its sample complexity. In the rest of the paper, we will assume that (cid:107) x k (cid:107) (cid:54) , ∀ x k ∈ X . Given a positive definite matrix A ,we denote by (cid:107) x (cid:107) A := √ x T Ax , the matrix norm induced by A . For any i ∈ [ K ] , i (cid:54) = a ∗ , we define ∆ i := θ ∗ T ( x a ∗ − x i ) to be the gap between the largest expected reward and the expected reward of(suboptimal) arm x i . Let ∆ min := min i ∈ [ K ] ∆ i . We denote B ( z, r ) as the closed ball with center z andradius r . For any measurable space (Ω , F ) , we define P (Ω) to be the set of all probability measureson Ω . ˜ O is big-Oh notation that suppresses logarithmic dependence on problem parameters. For thebenefit of the reader, we provide a glossary of commonly used symbols in Sec. A in the Appendix. Algorithm Sample Complexity Remarks
X Y -static [17] O (cid:16) d ∆ min (ln δ + ln K + ln min ) + d (cid:17) Static allocation, worst-case optimalDependence on d cannot be removedLinGapE [15] O (cid:16) dH log (cid:16) dH log δ (cid:17)(cid:17) Fully adaptive, sub-optimal in general.ALBA [19] O (cid:16) (cid:80) di =1 1∆ i ) ln (cid:16) Kδ + ln min (cid:17) (cid:17) Fully adaptive, sub-optimal in general (see [8])RAGE [8] O (cid:16) D θ ∗ log 1 / ∆ min log (cid:16) K log / ∆ min δ (cid:17)(cid:17) Instance-optimal, butMinimax oracle requiredPELEG (this paper) O (cid:18) log (1 / ∆ min ) D θ ∗ (cid:20) log ( (log (1 / ∆ min )) K /δ ) C (cid:21)(cid:19) Instance-optimal (upto a factor of C ), Explicitly implementable
Table 1: Comparison of Sample complexities achieved by various algorithms for the Linear Multi-armed Bandit problem in the literature. Note that K is the number of arms, d is the ambient dimension, δ is the PAC guarantee parameter and ∆ min is the minimum reward gap. H is a complicated termdefined in terms of a solution to an offline optimization problem in [15].Note: C := λ min (cid:0)(cid:80) x ∈X xx T (cid:1) is a term depending only on the geometry of the arm set. In this section we describe the main ingredients in our algorithm design and how they build uponideas introduced in recent work [6, 8] (the explicit algorithm appears in Sec. 4). In general, the number of arms can be much larger than the ambient dimension, i.e., d (cid:28) K. he phased elimination approach: Fiez et al [8]. We first note that a lower boundon the sample complexity of any δ - PAC algorithm for the canonical (i.e., unstruc-tured) bandit setting [10] was generalized by Fiez et al [8] to the linear bandit set-ting, assuming { η t } t (cid:62) to be standard normal random variables. This result states thatany δ -PAC algorithm in the linear setting must satisfy E θ ∗ [ τ ] (cid:62) (log 1 / . δ ) T θ ∗ (cid:62) (log 1 / . δ ) D θ ∗ , where T θ ∗ := max w ∈P ( X ) min θ : a ∗ ( θ ) (cid:54) = a ∗ ( θ ∗ ) (cid:80) k ∈ [ K ] w k KL (cid:0) ν kθ , ν kθ ∗ (cid:1) and D θ ∗ := max w ∈P ( X ) min x ∈X ,x (cid:54) = x ∗ ( θ ∗ T ( x ∗ − x ) ) (cid:107) x ∗ − x (cid:107) ( (cid:80) x ∈X wxxxT ) − , where x ∗ = x a ∗ . The bound suggests a natu-ral δ -PAC strategy, namely, to sample arms according to the distribution w ∗ = argmin w ∈P ( X ) max x ∈X \{ x ∗ } (cid:107) x ∗ − x (cid:107) ( (cid:80) x ∈X w x xx T ) − (( x ∗ − x ) θ ∗ ) . (1)In fact, as [8, Sec. 2.2] explains, using the Ordinary Least Squares (OLS) estimator (cid:98) θ for θ ∗ andsampling arm x ∈ X exactly (cid:98) w ∗ N (cid:99) times with N = O (cid:16) log K/δD θ ∗ (cid:17) ensures ( x − x ∗ ) T (cid:98) θ > , ∀ x (cid:54) = x ∗ with probability (cid:62) − δ. Unfortunately, this sampling distribution cannot directly be implementedsince x ∗ is unknown.Fiez et al circumvent this difficulty by designing a nontrivial strategy (RAGE) that attempts to mimicthe optimal allocation w ∗ in phases. Specifically, in phase m , it tries to eliminate arms that areabout − m -suboptimal (in their gaps), by solving (1) with a plugin estimate of θ ∗ . The resultingfractional allocation, passed through a separate discrete rounding procedure, gives an integer pullcount distribution which ensures that all surviving arms’ mean differences are estimated with highprecision and confidence.Though direct and appealing, this phased elimination strategy is based crucially on solving minimaxproblems of the form (1). Though the inner (max) function is convex as a function of w on theprobability simplex (see e.g., Lemma 1 in [21]), it is non-smooth , and it is not made explicit how, andto what extent, it must be solved in [8]. Fortunately, we are able to circumvent this obstacle by usingideas from games between no-regret online learners with optimism, as introduced by the work ofDegenne et al [6] for unstructured bandits. From Pure-exploration Games to δ -PAC Algorithms: Degenne et al [6]. We briefly explainsome of the insights in [6] that we leverage to design an explicit linear bandit- δ -PAC algorithmwith low computational complexity. For a fixed weight parameter θ ∗ ∈ R d , consider the two-player, zero-sum Pure-exploration Game in which the
M AX player (or column player) plays anarm k ∈ [ K ] while the M IN (or row) player chooses an alternative bandit model θ ∈ R d suchthat a ∗ ( θ ) (cid:54) = a ∗ . M AX then receives a payoff of (cid:80) k ∈ [ K ] KL (cid:0) ν kθ ∗ , ν kθ (cid:1) from M IN.
For a given w ∈ P ( X ) , define T θ ∗ ( w ) = min θ : a ∗ ( θ ) (cid:54) = a ∗ (cid:80) x ∈X w x xx T , and w ∗ ( θ ∗ ) the mixed strategy thatattains T θ ∗ . With
M AX moving first and playing a mixed strategy w ∈ P ( X ) , the value of the gamebecomes T θ ∗ . In the unstructured bandit setting, to match the sample complexity lower bound, anyalgorithm must essentially sample arm k ∈ [ K ] at rate N Kt t → w ∗ k ( θ ∗ ) , where N kt is the numberof times Arm k has been sampled up to time t [12]. This helps explain why any δ -PAC algorithmimplicitly needs to solve the Pure Exploration Game T θ ∗ . We crucially employ no-regret online learners to solve the Pure Exploration Game for linear bandits.More precisely, no-regret learning with the well-known Exponential Weights rule/Negative-entropymirror descent algorithm [16] on one hand, and a best-response convex programming subroutineon the other, provides a direct sampling strategy that obviates the need for separate allocationoptimization and rounding for sampling as in [8]. One crucial advantage of our approach (inspired by[6]) is that we only use a best response oracle to solve for T θ ∗ ( w ) , which gives us a computationaledge over [8] who employ the computationally more costly max-min oracle to solve T θ ∗ ( w ) , or, itslinear bandit equivalent, D θ ∗ . Our algorithm, that we call “Phased Elimination Linear Exploration Game” (PELEG), is presentedin detail as Algorithm 1. PELEG proceeds in phases with each phase consisting of multiple rounds,4aintaining a set of active arms X m for testing during Phase m . An OLS estimate (cid:98) θ m of θ ∗ is usedto estimate the mean reward of active arms and, at the end of phase m , every active arm with aplausible reward more than ≈ − m below that of some arm in X m is eliminated. Suppose S m := (cid:110) x ∈ X \ { x ∗ } : θ ∗ T ( x ∗ − x ) < m (cid:111) . If we can ensure that X m ⊂ S m in every Phase m (cid:62) , then PELEG will terminate within (cid:100) log (1 / ∆ min ) (cid:101) phases, where ∆ min = min x (cid:54) = x ∗ θ ∗ T ( x ∗ − x ) . This statement is proved in Corollary 2 in the Supplementary Material.If we knew θ ∗ , then we could sample arms according to the optimal distribution w ∗ in (1).However, since all we now have at our disposal is the knowledge that ∆ i (cid:54) − m , ∀ x i ∈X m , we can instead construct a sampling distribution w ∗ m by solving the surrogate w ∗ m =argmin w ∈P ( X ) max x,x (cid:48) ∈X m : x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) ( (cid:80) x ∈X w x xx T ) − , and sampling each arm in X m suf-ficiently often to produce a small enough confidence set. This is precisely what RAGE [8] does.However solving this optimization is, as mentioned in Sec. 3, computationally expensive and RAGErepeatedly accesses a minmax oracle to do this. Note that in simulating this algorithm, the authorsimplement an approximate oracle using the Frank-Wolfe method to solve the outer optimization over w [8, Sec. F]. The max operation, however, renders the optimization objective non-smooth, and it iswell-known that the Frank-Wolfe iteration can fail with even simple non-differentiable objectives(see e.g., [4]). We, therefore, deviate from RAGE at this point by employing three novel techniques,the first two motivated by ideas in [6]. • We formulate the above minimax problem as a two player, zero-sum game. We solve thegame sequentially, converging to its Nash equilibrium by invoking the use of the EXP-WTSalgorithm [3]. Specifically, in each round t in a phase, PELEG supplies EXP-WTS with anappropriate loss function l MAXt − and receives the requisite sampling distribution w t (lines 15& 18 of the algorithm). This w t is then fed to the second no-regret learner – a best responsesubroutine – that finds the ‘most confusing’ plausible model λ to focus next on (line 16).This is a minimization of a quadratic function over a union of finitely many convex sets(halfspaces intersecting a ball) which can be transparently implemented in polynomial time. • Once the sampling distribution is found, there still remains the problem of actually samplingaccording to it. Given a distribution w ∈ P ( X m ) , approximating it by sampling x ∈ X(cid:98) N w x (cid:99) or (cid:100) N w x (cid:101) times can lead to too few (resp. many) samples. Other naive samplingstrategies are, for the same reason, unusable. While [8] invokes a specialized roundingalgorithm for this purpose, we opt for a more efficient tracking procedure (line 19): In eachRound t of Phase m , we sample Arm k t := argmin k ∈ [ K ] n kt − / t (cid:80) s =1 w ks , where n kt is the numberof times Arm k has been sampled up to time t. In Lem. 3, we show that this procedure isefficient, i.e., t (cid:80) s =1 w ks − ( K − (cid:54) n kt (cid:54) t (cid:80) s =1 w ks + 1 . • Finally, in each phase m , we need to sample arms often enough to (i) construct confidenceintervals of size at most − ( m +1) around ( x − x (cid:48) ) T θ ∗ , ∀ x, x (cid:48) ∈ X m , (ii) ensure that X m ⊂ S m and (iii) that x ∗ ∈ X m . In Sec. E, we prove a Key Lemma (whose argument isdiscussed in Sec. 5) to show that our novel
Phase Stopping Criterion ensures this with highprobability.It is worth remarking that naively trying to adapt the strategy of Degenne et al [6] to the linear banditstructure yields a suboptimal (multiplicative √ d ) dependence in the sample complexity, thus weadopt the phased elimination template of Fiez et al [8]. We also find, interestingly, that this phasedstructure eliminates the need to use more complex, self-tuning online learners like AdaHedge [5] infavour of the simpler Exponential Weights (Hedge).The main theoretical result of this paper is the following performance guarantee.5 lgorithm 1 Phased Elimination Linear Exploration Game (PELEG) Input: X , δ. Init: m ← , X m ← X . C ← λ min (cid:18) K (cid:80) k =1 x k x Tk (cid:19) . while {|X m | > } do δ m ← δm . D m ← √ − (cid:114) C max x,x (cid:48)∈X m,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) log K ε m ← min (cid:26) , D m √ C √ K /δ m ) (cid:27) (cid:0) (cid:1) m +1 . ∀ x ∈ X m , C m ( x ) := (cid:8) λ ∈ R d : ∃ x (cid:48) ∈ X m , x (cid:48) (cid:54) = x | λ T x (cid:48) (cid:62) λ T x + ε m (cid:9) . t ← , n k ← , ∀ k ∈ [ K ] . Play each arm in X once and collect rewards Y k ∼ ν k , (cid:54) k (cid:54) K . (cid:46) Burn-in period ∀ k ∈ [ K ] , n tk (cid:12)(cid:12)(cid:12)(cid:12) t = K = n Kk ← , V mt (cid:12)(cid:12)(cid:12)(cid:12) t = K = V mK ← K (cid:80) k =1 x k x Tk , t ← K . Initialize A MAXm ≡ EXP − W T S with expert set { (cid:98) e , · · · , (cid:98) e K } ⊂ R K and loss function l MAXt − () . (cid:46) M AX player: EXP-WTS while min λ ∈ (cid:83) x ∈X m C m ( x ) ∩ B (0 ,D m ) (cid:107) λ (cid:107) V mt (cid:54) (cid:0) K /δ m (cid:1) do (cid:46) Phase Stopping Criterion t ← t + 1 . Get w t from A MAXm and form the matrix W t = K (cid:80) k =1 w kt x k x Tk . λ t ← argmin λ ∈∪ x ∈X m C m ( x ) ∩ B (0 ,D m ) (cid:107) λ (cid:107) W t . For k ∈ [ K ] , U kt := (cid:0) λ Tt x k (cid:1) . (cid:46) M IN player: Best response Construct loss function l MAXt ( w ) = − w T U t . Play arm k t := argmin k ∈ [ K ] n kt − t (cid:80) s =1 w ks (cid:46) Tracking n k t t ← n k t t + 1 Collect sample Y t = θ ∗ T x k t + η t V mt = V mt − + x k t x k t T . end while N m ← t Update: (cid:98) θ m ← (cid:0) V mN m (cid:1) − (cid:18) N m (cid:80) s =1 Y s x k s (cid:19) (cid:46) Least-squares estimate of θ ∗ Update: X m +1 ← X m (cid:47) (cid:110) x ∈ X m |∃ x (cid:48) ∈ X m : (cid:98) θ Tm ( x (cid:48) − x ) > − ( m +2) (cid:111) m ← m + 1 end while return X m (cid:46) Output surviving arm
Theorem 1 (Sample Complexity of Algorithm 1) . With probability at least − δ , PELEG returnsthe optimal arm after τ rounds, with τ (cid:54) (1 / ∆ min ) D θ ∗ (cid:16) log (cid:16) (log (1 / ∆ min )) K /δ (cid:17)(cid:17) log KC +256 log (1 / ∆ min ) D θ ∗ log (cid:16) (log (1 / ∆ min )) K /δ (cid:17) = ˜ O (cid:18) log ( K /δ ) C D θ ∗ (cid:19) . (2)6n Sec. 5, we sketch the arguments behind the result. The proof in its entirety can be found in Sec. Fin the Supplementary Material. Note 1.
As explained in Sec. 3, the optimal (oracle) allocation requires O (cid:16) D θ ∗ log Kδ (cid:17) samples.Comparing this with (2) , we see that our algorithm is instance optimal up to logarithmic factors,barring the C term, so the optimality holds whenever C = Ω(1) . Recall that C is the smallesteigenvalue of (cid:80) x ∈X xx T . C = Ω(1) is reasonable to expect given that in most applications, featurevectors (i.e., x , · · · , x K ) are chosen to represent the feature space well which translates to a highvalue of C . Note 2.
The main computational effort in Algorithm 1 is in checking the phase stopping criterion(line 13) and implementing the best-response model learner (line 16), both of which are explicitquadratic programs. Note also that bounding the losses submitted to EXP-WTS to within B (0 , D m ) is required only for the regret analysis of EXP-WTS to go through. In practice, as the simulationresults show, PELEG works without this and, in fact, permits efficient solution of Step 16 in thealgorithm, further reducing computational complexity. This section outlines the proof of the δ -PAC sample complexity of Algorithm 1 (Theorem 1) anddescribes the main ideas and challenges involved in the analysis.At a high level the proof of Theorem 1 involves two main parts: (1) a correctness argument for thecentral while loop that eliminates arms, and (2) a bound for its length, which, when added across allphases, gives the overall sample complexity bound.
1. Ensuring progress (arm elimination) in each phase.
At the heart of the analysis is the followingresult which guarantees that upon termination of the central while loop, the uncertainty in estimatingall differences of means among the surviving (i.e., non-eliminated) arms remains bounded.
Lemma 1 (Key Lemma) . After each phase m (cid:62) , max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) ( V mNm ) − (cid:54) (cid:16) ( ) m +1 (cid:17) K /δ m . Proof sketch.
Phase m ends at time t when the ellipsoid E (0 , V mt , r m ) , with center and shapeaccording to the arms played in the phase so far, becomes small enough to avoid intersecting the halfspaces C m ( x ) , for all surviving arms x , within the ball ∩ B (0 , D m ) (Step 13 of the algorithm) whichis required to keep loss functions bounded for no-regret properties.Suppose, for the sake of simplicity, that only two arms x i and x j are present when phase m starts.Figure 1a depicts a possible situation when the phase ends. C m ( x i ) ≡ C m ( x i ; ε m ) and C m ( x j ; ε m ) with ε m ≈ − m are halfspaces, denoted in gray, that intersect the ball B (0 , D m ) in the areas coloredred. In this situation, the ellipsoid V mt , shaded in blue, has just broken away from the red regions in the interior of the ball . Because its extent in the direction x i − x j lies within the strip betweenthe two hyperplanes bounding C m ( i ) , C m ( j ) , it can be shown (see proof of lemma in appendix) that (cid:107) x i − x j (cid:107) ( V mt ) − is small enough to not exceed roughly − m .The more challenging situation is when the ellipsoid V mt breaks away from the red regions by breaching the boundary of the ball B (0 , D m ) , as in the green ellipsoid in Figure 1b. The while loopterminating at this time would not satisfy the objective of controlling (cid:107) x i − x j (cid:107) ( V mt ) − to within − m , since the extent of the ellipsoid in the direction x i − x j is larger than the gap between thehalfspaces C m ( x i ) and C m ( x j ) . A key idea we introduce here is to shrink the hyperplane gap (i.e., ε m ) by a factor (precisely D m √ C (8 log K /δ m ) − ) which is represented by the min operation inStep 7. In doing this we bring the halfspaces closer, and then insist that the ellipsoid break away fromthese new halfspaces within the ball. This more stringent requirement guarantees that when the loopterminates, the extent of the final ellipsoid (shaded in blue) stays within the original, unshrunk, gapensuring (cid:107) x i − x j (cid:107) ( V mt ) − (cid:47) − m .
2. Bounding the number of arm pulls in a phase.
The main bound on the length of the central while loop is the following result. 7
Span ( x i − x j ) { λ : λ T ( x i − x j ) (cid:62) ε }{ λ : λ T ( x i − x j ) (cid:54) − ε } Ellipsoid fits within this ⇒ (cid:107) x i − x j (cid:107) ( V mt ) − (cid:47) − m B (0 ,D m ) E (0 ,V mt ,r m ) (a) ‘Easy’ case: The blue ellipsoid separates from thehalfspaces intersecting the ball (red) by staying withinthe ball . In this case its extent along ( x i − x j ) is withinthe gap between the hyperplanes (parallel black lines). Span ( x i − x j ) { λ : λ T ( x i − x j ) (cid:62) ε }{ λ : λ T ( x i − x j ) (cid:54) − ε } Ellipsoid fits within this ⇒ (cid:107) x i − x j (cid:107) ( V mt ) − (cid:47) − m Ellipsoid extenttoo large Ellipsoid extentsufficiently small (b) ‘Difficult’ case: The green ellipsoid separates fromthe halfspaces intersecting the the ball (red) by breach-ing the ball . Its extent along ( x i − x j ) exceeds the gap between the hyperplanes (parallel black lines). Whenforced to separate from a closer pair of halfspaces (dot-ted black lines), then the ellipsoid’s (in blue) extent iswithin the original gap. Figure 1:
The phase stopping condition in Algorithm 1 ensures (cid:107) x i − x j (cid:107) ( V mt ) − (cid:47) − m after phase m . Lemma 2 (Phase length bound) . Let B m := min w ∈P K max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) W − . There exists δ suchthat ∀ δ < δ , the length N m of any phase m is bounded as : N m (cid:54) (cid:40) B m (cid:0) m +1 (cid:1) (cid:104) r m log K ( √ − C (cid:105) + 1 if ε m = D m √ Cr m (cid:0) (cid:1) m +1 , B m (cid:0) m +1 (cid:1) r m + 1 if ε m = (cid:0) (cid:1) m +1 . To prove this we use the no-regret property of both the best-response
M IN and the EXP-WTS
M AX learner (the full proof appears in the appendix). A key novelty here is the introduction ofthe ball B (0 , D m ) as a technical device to control the -norm radius of the final stopped ellipsoid E (0 , V mt , r m ) (inequality ( i ) in the proof) when used with the basic tracking rule over arms introducedby Degenne et al [6]. We numerically evaluate PELEG, against the algorithms
X Y -static ([17]), LUCB ([11]), ALBA([19]), LinGapE ([15]) and RAGE ([8]), for 3 common benchmark settings. The oracle lower boundis also calculated. Note: In our implementation, we ignore the term B (0 , D m ) in the phase stoppingcriterion; this has the advantage of making the criterion check-able in closed form. We simulateindependent, N (0 , observation noise in each round. All results reported are averaged over 50 trials.We also empirically observe a success rate in identifying the best arm, although a confidencevalue of δ = 0 . is passed in all cases. Setting 1: Standard bandit.
The arm set is the standard basis { e , e , . . . , e } in 5 dimen-sions. The unknown parameter θ ∗ is set to (∆ , , . . . , , where ∆ > , with ∆ swept across { . , . , . , . , . } . As noted in [15], for ∆ close to , X Y -static’s essentially uniform allocationis optimal, since we have to estimate all directions equally accurately. However, PELEG performsbetter (Fig. 2(a)) due to being able to eliminate suboptimal arms earlier instead of uniformly acrossall arms. Fig. 2(b) compares PELEG and RAGE in the smaller window ∆ ∈ [0 . , . , wherePELEG is found to be competitive (and often better than) RAGE. Setting 2: Unit sphere.
The arms set comprises of vectors sampled uniformly from thesurface of the unit sphere S d − . We pick the two closest arms, say u and v , and then set θ ∗ = u + γ ( v − u ) for γ = 0 . , making u the best arm. We simulate all algorithms over dimensions d = 10 , , . . . , . This setting was first introduced in [19], and PELEG is uniformly competitivewith the other algorithms (Fig. 2(c)). 8 etting 3: Standard bandit with a confounding arm [17]. We instantiate d canonical basis arms { e , e , . . . , e d } and an additional arm x d +1 = (cos( ω ) , sin( ω ) , , . . . , , d = 2 , . . . , , with θ ∗ = e so that the first arm is the best arm. By setting < ω << , the d + 1 th arm becomesthe closest competitor. Here, the performance critically depends on how much an agent focuses oncomparing arm and arm d + 1 . LinGapE performs very well in this setting, and PELEG and RAGEare competitive with it (Fig. 2(d)). N u m b e r o f s a m p l e s LinGapELUCBXY-staticALBAPELEGRAGEORACLE N u m b e r o f s a m p l e s PELEGRAGE
10 15 20 25 30 35 40 45 5002468101214 N u m b e r o f s a m p l e s LinGapELUCBXY-staticALBAPELEGRAGEORACLE N u m b e r o f s a m p l e s LinGapELUCBXY-staticALBAPELEGRAGEORACLE
Figure 2: Sample complexity performance of linear bandit best arm identification algorithms for different settings: Standard bandit (Figs. a, b), Unit Sphere (Fig. c) and Standard bandit withconfounding arm (Fig. d). We have proposed a new, explicitly described algorithm for best arm identification in linear bandits,using tools from game theory and no-regret learning to solve minimax games. Several interestingdirections remain unexplored. Removing the less-than-ideal dependence on the feature C of thearm geometry and the extra logarithmic dependence on log(1 /δ ) are perhaps the most interestingtechnical questions. It is also of great interest to see if a more direct game-theoretic strategy, alongthe lines of [6], exists for structured bandit problems, as also whether one can extend this machineryto solve for best policies in more general Markov Decision Processes. Broader Impact.
This work is largely theoretical in its objective. However, the problem that itattempts to lay sound theoretical foundations for is a widely encountered search problem basedon features in machine learning. As a result, we anticipate that its implications may carry over todomains that involve continuous, feature-based learning, such as attribute-based recommendationsystems, adaptive sensing and robotics applications. Proper care must be taken in such cases toensure that recommendations or decisions from the algorithms set out in this work do not transgressconsiderations of safety and bias. While we do not address such concerns explicitly in this work, theyare important in the design and operation of automated systems that continually interact with humanusers.
References [1] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved Algorithms for Linear StochasticBandits. In
Proc. NIPS , pages 2312–2320, 2011.[2] B. Awerbuch and R. Kleinberg. Online linear optimization and adaptive routing.
Journal ofComputer and System Sciences , 74(1):97–114, 2008.[3] N. Cesa-Bianchi and G. Lugosi.
Prediction, learning, and games.
Cambridge University Press,2006.[4] E. Cheung and Y. Li. Solving separable nonsmooth problems using frank-wolfe with uniformaffine approximations. In
IJCAI , pages 2035–2041, 2018.[5] S. de Rooij, T. van Erven, P. Grunwald, and W. Koolen. Follow the leader if you can, hedge ifyou must.
Journal of Machine Learning Research , 15:1281–1316, 2014.[6] R. Degenne, W. M. Koolen, and P. M´enard. Non-asymptotic pure exploration by solving games.In
NeurIPS , 2019. 97] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for themulti-armed bandit and reinforcement learning problems.
J. Mach. Learn. Res. , 7:1079–1105,Dec. 2006.[8] T. Fiez, L. Jain, K. G. Jamieson, and L. Ratliff. Sequential experimental design for transductivelinear bandits. In
Advances in Neural Information Processing Systems , pages 10666–10676,2019.[9] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights.
Games andEconomic Behavior , 29(1-2):79–103, 1999.[10] A. Garivier and E. Kaufmann. Optimal best arm identification with fixed confidence. In
Conference On Learning Theory , pages 998–1027, Jun. 2016.[11] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochasticmulti-armed bandits. In
ICML , 2012.[12] E. Kaufmann, O. Capp´e, and A. Garivier. On the complexity of best-arm identification inmulti-armed bandit models.
J. Mach. Learn. Res. , 17(1):1–42, Jan. 2016.[13] Y. Kuroki, L. Xu, A. Miyauchi, J. Honda, and M. Sugiyama. Polynomial-time algorithms forcombinatorial pure exploration with full-bandit feedback. arXiv preprint arXiv:1902.10582 ,2019.[14] T. Lattimore and C. Szepesv´ari.
Bandit Algorithms . Cambridge University Press, 2020.[15] M. S. Liyuan Xu, Junya Honda. Fully adaptive algorithm for pure exploration in linear bandits.In
International Conference on Artificial Intelligence and Statistics , 2017.[16] S. Shalev-Shwartz et al. Online learning and online convex optimization.
Foundations andtrends in Machine Learning , 4(2):107–194, 2011.[17] M. Soare, A. Lazaric, and R. Munos. Best-arm identification in linear bandits.
Advances inNeural Information Processing Systems 27 , pages 828–836, 2014.[18] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in thebandit setting: no regret and experimental design. In
In International Conference on MachineLearning , 2010.[19] C. Tao, S. Blanco, and Y. Zhou. Best arm identification in linear bandits with linear dimensiondependency. volume 80 of
Proceedings of Machine Learning Research , pages 4877–4886.PMLR, 2018.[20] M. Valko, R. Munos, B. Kveton, and T. Koc´ak. Spectral bandits for smooth graph functions. In
International Conference on Machine Learning , pages 46–54, 2014.[21] M. Zaki, A. Mohan, and A. Gopalan. Towards optimal and efficient best arm identification inlinear bandits. arXiv preprint arXiv:1911.01695 , 2019.10
Glossary of symbols A MAXm : the EXP-WTS algorithm, used to compute the mixed strategy of the M AX playerin each round of PELEG.2. a ∗ : the index of the best arm, i.e., a ∗ := argmax i ∈ [ K ] x Ti θ ∗ .3. B (0 , D m ) : the closed ball of radius D m in R d , centered at .4. C = λ min (cid:0)(cid:80) x ∈X xx T (cid:1) .5. C m ( x ) := (cid:8) λ ∈ R d : ∃ x (cid:48) ∈ X m , x (cid:48) (cid:54) = x | λ T x (cid:48) (cid:62) λ T x + ε m (cid:9) is the union of all hyper-planes { λ ∈ R d | λ T ( x (cid:48) − x ) (cid:62) ε m } . D m := 2( √ − (cid:114) C max x,x (cid:48)∈X m,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) log K .7. d : dimension of space in which the feature vectors x , · · · , x K reside.8. ∆ i = ( x ∗ − x i ) T θ ∗ , i (cid:54) = a ∗ . ∆ min = min i (cid:54) = a ∗ ∆ i . δ : maximum allowable probability of erroneous arm selection (a.k.a confidence parameter).11. δ m = δm . E (0 , V, r ) := { λ ∈ R d | λ T V λ (cid:54) r } , is the confidence ellipsoid with center , shaped by V and r. H ( x, x (cid:48) ) : K = |X | number of feature vectors.15. N m : the length of Phase m. ν k : rewards from Arm k are all drawn IID from ν k . P (Ω) := { p ∈ [0 , | Ω | : (cid:107) p (cid:107) = 1 } , the set of all probability measures on some given set Ω . r m = (cid:113) K δ m . θ ∗ : fixed but unknown vector in R d that parameterizes the means of ν k , i.e., the mean of ν k is x Tk θ ∗ . n kt : number of times Arm k has been sampled up to Round t of PELEG.21. (cid:98) θ m : OLS estimate of θ ∗ at the end of Phase m of PELEG.22. V mt = (cid:80) s (cid:54) t x s x Ts the design matrix in Round t of Phase m. W t = (cid:80) x ∈X w x xx T the design matrix formed by sampling arms ∼ w ∈ P ( X ) . X = { x , · · · , x K } , the feature set.25. X m the set of features that survive Phase m of PELEG. B Technical lemmas
B.1 Tracking lemma
The A MAX subroutine recommends a distribution w t in every round t over the set of arms. In orderto play an arm from this distribution we use a “tracking” rule, which helps the number of arm pulls tostay close to the cumulative sum t (cid:80) s =1 w ks for each arm k ∈ [ K ] . Lemma 3 ( n kt tracks (cid:80) ts =1 w ks ) . In any phase m (cid:62) , if for ∀ t (cid:62) K , the following strategy forpulling the arms is used. Choose arm, k t = argmin k ∈ [ K ] n kt − t (cid:80) s =1 w ks , hen, for all t (cid:62) K and for all k ∈ [ K ] , t (cid:80) s =1 w ks − ( K − (cid:54) n kt (cid:54) t (cid:80) s =1 w ks + 1 .Proof. We first show the upper-bound. We need to show that the inequality holds for all arms. First,let k (cid:54) = k t . We will use induction on t .Base case: At t = K , n jt = 1 = t (cid:80) s =1 w js , ∀ j ∈ [ K ] .Let, the induction hold for all s < t . We will show that the inequality holds for t . If k (cid:54) = k t , then n kt = n kt − ∗ ) (cid:54) t − (cid:80) s =1 w ks + 1 (cid:54) t (cid:80) s =1 w ks + 1 .Next, let k = k t . We note that by definition of k t , we have n k t t − t (cid:80) s =1 w k t s ( ∗ ) (cid:54) K (cid:80) j =1 n jt − K (cid:80) j =1 t (cid:80) s =1 w js = t − t (cid:54) . Here, the inequality (*) follows because of the following fact: for any sequence of positive numbers { a i } (cid:54) i (cid:54) n and { b i } (cid:54) i (cid:54) n , min (cid:54) i (cid:54) n a i b i (cid:54) (cid:80) ni =1 a i (cid:80) ni =1 b i . Consequently, n kttt (cid:80) s =1 w kts = n ktt − +1 t (cid:80) s =1 w kts (cid:54) t (cid:80) s =1 w kts .Rearranging completes the proof of the right hand side.For the lower bound inequality, we observe that for any k ∈ [ K ] , n kt = t − (cid:88) j (cid:54) = k n jt ( ∗ ) (cid:62) t − (cid:88) j (cid:54) = k t (cid:88) s =1 w js − ( K −
1) = t − K (cid:88) j =1 t (cid:88) s =1 w js + t (cid:88) s =1 w ks − ( K −
1) = t (cid:88) s =1 w ks − ( K − . Here, the inequality ( ∗ ) follows from the the upper-bound on n kt , k ∈ [ K ] . B.1.1 Details of A MAXm (EXP-WTS)
We employ the EXP-WTS algorithm to recommend to the MAX player, the arm to be played in round t > K . At the start of every phase m (cid:62) , an EXP-WTS subroutine is instantiated afresh, with initialweight vectors to be for each of the K experts. The K experts are taken to be standard unit vectors (0 , , . . . , , , , . . . , with at the k th position, k ∈ [ K ] . The EXP-WTS subroutine recommendsan exponentially-weighted probability distribution over the number of arms, depending upon theweights on each expert. The loss function supplied to update the weights of each expert, is indicatedin Step 18 of Algorithm 1.EXP-WTS requires a bound on the losses (rewards) in order to set its learning parameter optimally.This is ensured by passing an upper-bound of D m ( ∵ in any Phase m, (cid:107) λ (cid:107) (cid:54) D m , see Step 13 ofAlgorithm 1). Lemma 4.
In any phase m , at any round t > K , A MAXm has a regret bounded as R t (cid:54) D m √ − (cid:112) t log K. Proof.
The proof involves a simple modification of the proof of the regret analysis of the EXP-WTSalgorithm (see for example, [3]), with loss scaled by [0 , D m ] followed by the well-known doublingtrick. C Proof of Key Lemma
Lemma 1 (Key Lemma) . After each phase m (cid:62) , max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) ( V mNm ) − (cid:54) (cid:16) ( ) m +1 (cid:17) K /δ m . roof. Let r m := (cid:112) K /δ m , for ease of notation. The phase stopping criterion isSTOP at round t (cid:62) K if: min λ ∈ (cid:83) x ∈X m C m ( x ) ∩ B (0 ,D m ) (cid:107) λ (cid:107) ( V mNm ) > r m . (3)Note that the set C m ( x ) depends on the value that ε m takes in phase m . Depending on the value of ε m , we divide the analysis into the following two cases. Case 1. ε m = (1 / m +1 . In this case D m √ Cr m (cid:62) . For any phase m (cid:62) , and t (cid:62) , let us define the ellipsoid E (0 , V mt , r m ) := (cid:110) θ ∈ R d : (cid:107) θ (cid:107) V mt (cid:54) r m (cid:111) . The phase stopping rule at round t (cid:62) K is equivalent to :STOP if : E (0 , V mt , r m ) (cid:92) (cid:40) (cid:91) x ∈X m C m ( x ) ∩ B (0 , D m ) (cid:41) = ∅ (empty set) ⇔ {E (0 , V mt , r m ) ∩ B (0 , D m ) } (cid:92) (cid:40) (cid:91) x ∈X m C m ( x ) (cid:41) = ∅ . However by Rayleigh’ inequality followed by the fact that D m √ Cr m (cid:62) , we have for any θ ∈E (0 , V mt , r m ) , (cid:107) θ (cid:107) (cid:54) (cid:107) θ (cid:107) V mt λ min ( V mt ) ( ∗ ) (cid:54) (cid:107) θ (cid:107) V mt λ min ( K (cid:80) k =1 x k x Tk ) (cid:54) r m C (cid:54) D m . The inequality (*) follows from the following fact: for t (cid:62) K , V mt = K (cid:80) k =1 x k x Tk + t (cid:80) s = K +1 x s x Ts (cid:60) K (cid:80) k =1 x k x Tk . ∴ E (0 , V mt , r m ) ⊆ B (0 , D m ) , ∀ t (cid:62) K . Hence the phase stopping rule reduces to,STOP if: E (0 , V mt , r m ) (cid:92) (cid:40) (cid:91) x ∈X m C m ( x ) (cid:41) = ∅ ⇔ min λ ∈∪ x ∈X m C m ( x ) (cid:107) λ (cid:107) V mt > r m ⇔ min λ ∈ (cid:83) ( x,x (cid:48) ) ∈X m { λ (cid:48) : λ (cid:48) T x (cid:48) (cid:62) λ (cid:48) T x +(1 / m +1 } (cid:107) λ (cid:107) V mt > r m . The above reduction is a minimization problem over union of halfspaces. For any fixed pair ( x, x (cid:48) ) ∈X m , x (cid:54) = x (cid:48) , this is a quadratic optimization problem with linear constraints, which can be explicitlysolved using standard Lagrange method. Lemma 5 (Supporting Lemma for Lem. 1) . For any two arms x and x (cid:48) , we have that min λ ∈{ λ (cid:48) : λ (cid:48) T x (cid:48) (cid:62) λ (cid:48) T x + ( ) m +1 } (cid:107) λ (cid:107) V mt = (cid:16)(cid:0) (cid:1) m +1 (cid:17) (cid:107) x − x (cid:48) (cid:107) V mt ) − . Proof.
The result follows by solving the optimization problem explicitly using the Lagrange multipliermethod. for any PSD matrix A and x ∈ R d , λ min ( A ) (cid:54) x T Axx T x (cid:54) λ max ( A )
13y using the above lemma we obtain:STOP if: ∀ x, x (cid:48) ∈ X m , x (cid:54) = x (cid:48) , (cid:107) x − x (cid:48) (cid:107) V mt ) − < (cid:16)(cid:0) (cid:1) m +1 (cid:17) K /δ m . Hence, at round t = N m we have, ∀ x, x (cid:48) ∈ X m , x (cid:54) = x (cid:48) , (cid:107) x − x (cid:48) (cid:107) V mNm ) − < (cid:16) ( ) m +1 (cid:17) K /δ m . Case 2. ε m = D m √ Cr m (cid:0) (cid:1) m +1 . In this case, we have D m √ Cr m < .The phase ends when ∀ ( x, x (cid:48) ) ∈ X m , min λ ∈ { λ ∈ R d : λ T x (cid:48) (cid:62) λ T x + ε m } ∩ B (0 ,D m ) (cid:107) λ (cid:107) V mt > r m . Let usdecompose the optimization problem defining the phase stopping criteria into smaller sub-problems,depending on pair of arms ( x, x (cid:48) ) in X m . That is, we split the set ∪ x ∈X m C m ( x ) in equation (3), andconsider the following problem: for any pair of distinct arms ( x, x (cid:48) ) ∈ X m , consider P ( x, x (cid:48) ) : min λ ∈ { λ ∈ R d : λ T x (cid:48) (cid:62) λ T x + ε m } ∩ B (0 ,D m ) (cid:107) λ (cid:107) V mt . let t x,x (cid:48) be the first round when min λ ∈ { λ ∈ R d : λ T x (cid:48) (cid:62) λ T x + ε m } ∩ B (0 ,D m ) (cid:107) λ (cid:107) V mt > r m . Clearly,we have N m = max ( x,x (cid:48) ) ∈X m ,x (cid:54) = x (cid:48) t x,x (cid:48) . In addition, for any t (cid:62) t x,x (cid:48) , (cid:107) λ (cid:107) V mt = λ T (cid:16) V mt x,x (cid:48) + (cid:80) ts = t x,x (cid:48) +1 x s x ts (cid:17) λ = (cid:107) λ (cid:107) V mtx,x (cid:48) + (cid:80) ts = t x,x (cid:48) +1 ( x Ts λ ) (cid:62) (cid:107) λ (cid:107) V mtx,x (cid:48) > r m . Hence,once the inequality for a given pair of arms ( x, x (cid:48) ) is fulfilled it is satisfied for all subsequent rounds.We will now analyze the problem P ( x, x (cid:48) ) for each pair of arms ( x, x (cid:48) ) ∈ X m individually.For any t (cid:62) , define λ ∗ t ∈ argmin λ ∈ { λ ∈ R d : λ T x (cid:48) (cid:62) λ T x + ε m } ∩ B (0 ,D m ) (cid:107) λ (cid:107) V mt . Note that λ ∗ t is specific to the pair ( x, x (cid:48) ) . Claim 1. λ ∗ t T ( x (cid:48) − x ) = ε m , ∀ t (cid:62) . Proof of Claim 1.
For the proof, let’s denote λ ∗ ≡ λ ∗ t . Now, suppose that the claim was not true,i.e., λ ∗ T ( x (cid:48) − x ) = ε m + a for some a > . Let b = aλ ∗ T ( x (cid:48) − x ) . Then < b < . Define λ (cid:48) := (1 − b ) λ ∗ . By construction, λ (cid:48) T ( x (cid:48) − x ) = ε m , and (cid:107) λ (cid:48) (cid:107) = (1 − b ) (cid:107) λ ∗ (cid:107) < (cid:107) λ ∗ (cid:107) . Hence λ (cid:48) ∈ (cid:8) λ ∈ R d : λ T x (cid:48) (cid:62) λ T x + ε m (cid:9) ∩ B (0 , D m ) . However, (cid:107) λ (cid:48) (cid:107) V mt = (1 − b ) (cid:107) λ ∗ (cid:107) V mt < (cid:107) λ ∗ (cid:107) V mt ,which is a contradiction.At t = t x,x (cid:48) , we have min λ ∈ { λ ∈ R d : λ T x (cid:48) (cid:62) λ T x + ε m } ∩ B (0 ,D m ) (cid:107) λ (cid:107) V mt > r m . We have two sub-cases depending onthe 2-norm of λ ∗ t . Sub-case 1. (cid:107) λ ∗ t (cid:107) < D m . In this case, we have the following equivalence: min λ ∈ { λ ∈ R d : λ T x (cid:48) (cid:62) λ T x + ε m } ∩ B (0 ,D m ) (cid:107) λ (cid:107) V mt ≡ min λ ∈ { λ ∈ R d : λ T x (cid:48) (cid:62) λ T x + ε m } (cid:107) λ (cid:107) V mt . This can be seen by noting that if (cid:107) λ ∗ t (cid:107) < D m , then the corresponding Lagrange multiplier iszero. Hence at round t = t x,x (cid:48) , by solving a standard Lagrange optimization problem, we get (cid:107) x − x (cid:48) (cid:107) V mt ) − < ε m K /δ m = D m Cr m ( ) m +1) K /δ m < ( ) m +1) K /δ m . The last inequality follows from14he hypothesis of Case 2. Since N m (cid:62) t x,x (cid:48) , we get (cid:107) x − x (cid:48) (cid:107) V mNm ) − (cid:54) (cid:107) x − x (cid:48) (cid:107) (cid:18) V mtx,x (cid:48) (cid:19) − < (cid:16) ( ) m +1 (cid:17) K /δ m . Sub-case 2. (cid:107) λ ∗ t (cid:107) = D m . The sub-case when (cid:107) λ ∗ t (cid:107) = D m , is more involved. Let’s enumerate the properties of λ ∗ t at t = t x,x (cid:48) that we have. • (cid:107) λ ∗ t (cid:107) V mt > r m . • (cid:107) λ ∗ t (cid:107) = D m . • λ ∗ t T ( x − x (cid:48) ) = ε m . We divide the analysis of this sub-case into two further sub-cases.
Sub-sub-case 1. r m (cid:107) x − x (cid:48) (cid:107) V mt ) − > ε m . Let θ ∗ t := argmax θ ∈E (0 ,V mt ,r m ) θ T ( x (cid:48) − x ) . Then, one can verify by solving the maximization problemexplicitly that θ ∗ t T ( x (cid:48) − x ) = r m (cid:107) x (cid:48) − x (cid:107) ( V mt ) − . Let θ := θ ∗ t T ( x (cid:48) − x ) (cid:107) x (cid:48) − x (cid:107) ( x (cid:48) − x ) . We have thefollowing properties of θ by construction, which are straight-forward to verify. • (cid:107) θ (cid:107) = r m (cid:107) x (cid:48) − x (cid:107) ( V mt ) − (cid:107) x (cid:48) − x (cid:107) . • θ T ( θ ∗ t − θ ) = 0 . Let λ := λ ∗ t T ( x (cid:48) − x ) (cid:107) x (cid:48) − x (cid:107) ( x (cid:48) − x ) . It follows that, (cid:107) λ (cid:107) = | λ ∗ t T ( x (cid:48) − x ) | (cid:107) x (cid:48) − x (cid:107) = ε m (cid:107) x (cid:48) − x (cid:107) .Finally, let us define two more quantities. Let λ := r m (cid:107) x (cid:48) − x (cid:107) ( V mt ) − ε m λ ∗ t and θ := ε m r m (cid:107) x (cid:48) − x (cid:107) ( V mt ) − θ ∗ t . We have by the hypothesis of sub-sub-case 1, that (cid:107) θ (cid:107) < (cid:107) θ ∗ t (cid:107) . Thisimplies that θ ∈ E (0 , V mt , r m ) .Next, we make the following two claims on the 2-norms of θ and θ ∗ t − θ . Claim. (cid:107) θ (cid:107) > D m . Proof of Claim.
Suppose that θ ∈ B (0 , D m ) . By construction, θ T ( x (cid:48) − x ) = ε m . Hence, θ ∈ (cid:8) λ ∈ R d : λ T x (cid:48) (cid:62) λ T x + ε m (cid:9) ∩ B (0 , D m ) . Since, θ ∈ E (0 , V mt , r m ) , this implies that (cid:107) θ (cid:107) V mt (cid:54) r m . However, this is a contradiction since at round t , min λ ∈ { λ T x (cid:48) (cid:62) λ T x + ε m } ∩ B (0 ,D m ) > r m .Hence, we have the following, D m < (cid:107) θ (cid:107) = ε m r m (cid:107) x (cid:48) − x (cid:107) V mt ) − (cid:107) θ ∗ t (cid:107) = D m (cid:107) λ (cid:107) (cid:107) θ ∗ t (cid:107) ⇒ (cid:107) θ ∗ t (cid:107) > (cid:107) λ (cid:107) . Claim. (cid:107) θ ∗ t − θ (cid:107) > (cid:107) λ − θ (cid:107) . Proof of Claim.
First we note that, θ T ( θ ∗ t − λ ) = θ ∗ t T ( x (cid:48) − x ) (cid:107) x (cid:48) − x (cid:107) ( x (cid:48) − x ) T (cid:32) θ ∗ t − r m (cid:107) x (cid:48) − x (cid:107) ( V mt ) − ε m λ ∗ t (cid:33) r m (cid:107) x (cid:48) − x (cid:107) V mt ) − (cid:107) x (cid:48) − x (cid:107) − r m (cid:107) x (cid:48) − x (cid:107) V mt ) − (cid:107) x (cid:48) − x (cid:107) = 0 . Next observe that, (cid:107) θ ∗ t − θ (cid:107) = (cid:107) θ ∗ t (cid:107) + (cid:107) θ (cid:107) − θ ∗ t T θ = (cid:107) θ ∗ t (cid:107) + (cid:107) θ (cid:107) − θ ∗ t − λ ) T θ − θ T λ = (cid:107) θ ∗ t (cid:107) + (cid:107) θ (cid:107) − θ T λ > (cid:107) λ (cid:107) + (cid:107) θ (cid:107) − θ T λ = (cid:107) λ − θ (cid:107) . Putting things together we have, (cid:107) θ ∗ t (cid:107) = (cid:107) θ ∗ t − θ (cid:107) + (cid:107) θ (cid:107) ⇒ (cid:107) θ (cid:107) = (cid:107) θ ∗ t (cid:107) − (cid:107) θ ∗ t − θ (cid:107) ⇒ (cid:107) θ (cid:107) < (cid:107) θ ∗ t (cid:107) − (cid:107) λ − θ (cid:107) ⇒ r m (cid:107) x (cid:48) − x (cid:107) V mt ) − (cid:107) x (cid:48) − x (cid:107) < r m C − r m (cid:107) x (cid:48) − x (cid:107) V mt ) − (cid:32) D m ε m − (cid:107) x (cid:48) − x (cid:107) (cid:33) ⇒ (cid:107) x (cid:48) − x (cid:107) V mt ) − (cid:107) x (cid:48) − x (cid:107) < C − (cid:107) x (cid:48) − x (cid:107) V mt ) − (cid:32) D m ε m − (cid:107) x (cid:48) − x (cid:107) (cid:33) ⇒ (cid:107) x (cid:48) − x (cid:107) V mt ) − (cid:107) x (cid:48) − x (cid:107) < C − (cid:107) x (cid:48) − x (cid:107) V mt ) − D m ε m + (cid:107) x (cid:48) − x (cid:107) V mt ) − (cid:107) x (cid:48) − x (cid:107) ⇒ (cid:107) x (cid:48) − x (cid:107) V mt ) − < ε m D m C = D m Cr m D m C (cid:18) (cid:19) m +1) = (cid:16)(cid:0) (cid:1) m +1 (cid:17) K /δ m . Sub-sub-case 2. r m (cid:107) x − x (cid:48) (cid:107) V mt ) − (cid:54) ε m . This case is trivial as by the hypothesis, (cid:107) x − x (cid:48) (cid:107) V mt ) − (cid:54) ε m r m = D m Cr m r m (cid:32)(cid:18) (cid:19) m +1 (cid:33) < (cid:16)(cid:0) (cid:1) m +1 (cid:17) K /δ m . This completes the proof of the key lemma.
D Proofs of bounds on phase length
In this section we will provide an upper-bound on the length of any phase m (cid:62) . Clearly, the lengthof any phase m is governed by the value of ε m in that phase. Towards this, we have the followinglemma. Lemma 2 (Phase length bound) . Let B m := min w ∈P K max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) W − . There exists δ suchthat ∀ δ < δ , the length N m of any phase m is bounded as : N m (cid:54) (cid:40) B m (cid:0) m +1 (cid:1) (cid:104) r m log K ( √ − C (cid:105) + 1 if ε m = D m √ Cr m (cid:0) (cid:1) m +1 , B m (cid:0) m +1 (cid:1) r m + 1 if ε m = (cid:0) (cid:1) m +1 . Proof.
Recall that r m = (cid:112) K /δ m . Let t be the last round in phase m , before the phase ends.Then by definition of phase stopping rule (Step 12 of the algorithm), r m (cid:62) min λ ∈ (cid:83) x ∈X m C m ( x ) ∩ B (0 ,D m ) (cid:107) λ (cid:107) V mt i ) (cid:62) min λ ∈ (cid:83) x ∈X m C m ( x ) ∩ B (0 ,D m ) t (cid:88) s =1 (cid:107) λ (cid:107) W s − K D m ( ii ) (cid:62) t (cid:88) s =1 (cid:107) λ s (cid:107) W s − K D m ( iii ) = t (cid:88) s =1 K (cid:88) k =1 w ks (cid:0) λ Ts x k (cid:1) − K D m ( iv ) (cid:62) max w ∈P K t (cid:88) s =1 K (cid:88) k =1 w k (cid:0) λ Ts x k (cid:1) − D m √ − (cid:112) t log K − K D m = max w ∈P K t (cid:88) s =1 (cid:107) λ s (cid:107) W − D m √ − (cid:112) t log K − K D m = t. max w ∈P K t (cid:88) s =1 t (cid:107) λ s (cid:107) W − D m √ − (cid:112) t log K − K D m ( v ) (cid:62) t. max w ∈P K min q ∈P (cid:32) (cid:83) x ∈X m C m ( x ) ∩ B (0 ,D m ) (cid:33) E λ ∼ q (cid:104) (cid:107) λ (cid:107) W (cid:105) − D m √ − (cid:112) t log K − K D m ( vi ) (cid:62) t. max w ∈P K min q ∈P (cid:32) (cid:83) x ∈X m C m ( x ) (cid:33) E λ ∼ q (cid:104) (cid:107) λ (cid:107) W (cid:105) − D m √ − (cid:112) t log K − K D m ( vii ) = t ε m B m − D m √ − (cid:112) t log K − K D m . Here the inequalities follow because of (i) lemma 3, (ii) best-response of MIN player as given inStep 15 of the algorithm, (iii) by definition of W s in Step 14, (iv) regret property of MAX player(see lemma 4), (v) t (cid:80) s =1 1 t { λ = λ s } ∈ P (cid:32) (cid:83) x ∈X m C m ( x ) ∩ B (0 , D m ) (cid:33) , (vi) taking minimum over alarger set, and (vii) follows by explicitly solving the minimization problem and recalling the definitionof B m We have that, t − B m ( √ − ε m D m (cid:112) log K √ t (cid:54) B m ε m r m + B m ε m K D m . (4)We will do the analysis depending on the value that ε m takes in phase m . Case 1. ε m = D m √ Cr m (cid:0) (cid:1) m +1 . In this case we have, D m √ Cr m < . Applying the value of ε m in eq. (4), we have t − B m ( √ − ε m D m (cid:112) log K √ t (cid:54) B m ε m r m + B m ε m K D m ⇒ t − B m ( √ − C r m (cid:0) m +1 (cid:1) (cid:112) log K √ t (cid:54) B m D m C r m (cid:0) m +1 (cid:1) + B m C r m (cid:0) m +1 (cid:1) K . (5)Let T m := B m D m C r m (cid:0) m +1 (cid:1) + B m C r m (cid:0) m +1 (cid:1) K . The function t (cid:55)→ √ t is a differentiable concavefunction, meaning for any t , t > , √ t (cid:54) √ t + √ t ( t − t ) . We therefore have √ t (cid:54) (cid:112) T m + 12 √ T m ( t − T m ) . Applying both these to (5) and rearranging, we get t (cid:54) T m (cid:32) B m r m (cid:0) m +1 (cid:1) √ log K √ − C √ T m − B m r m (2 m +1 ) √ log K (cid:33) . δ, the first term in the definition of T m dominates the second term, i.e.,there exists δ (1)0 > such that ∀ δ < δ (1)0 ,B m C r m (cid:0) m +1 (cid:1) K (cid:54) B m D m C r m (cid:0) m +1 (cid:1) , ⇒ r m (cid:62) K D m . (6)This means that T m (cid:54) B m D m C r m (cid:0) m +1 (cid:1) , and hence, t (cid:54) B m r m (cid:0) m +1 (cid:1) D m C B m r m (cid:0) m +1 (cid:1) √ log K √ − C (cid:113) B m r m (2 m +1 ) D m C − B m r m (2 m +1 ) √ log K = 2 B m r m (cid:0) m +1 (cid:1) D m C D m (cid:113) B m (2 m +1 ) log K √ − √ C − D m (cid:113) B m (2 m +1 ) log K . We note here the following lower bound on B m . B m = min w ∈P K max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) W − (cid:62) min w ∈P K max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) λ min ( W − ) (cid:107) x − x (cid:48) (cid:107) = min w ∈P K max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) λ max ( W ) (cid:107) x − x (cid:48) (cid:107) (cid:62) min w ∈P K max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) = max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) . By using the value of D m as given in Step 6 of the algorithm, we note that D m (cid:113) B m (2 m +1 ) log K = 2( √ − (cid:118)(cid:117)(cid:117)(cid:116) C max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) log K (cid:113) B m (2 m +1 ) log K (cid:62) √ − (cid:118)(cid:117)(cid:117)(cid:116) C max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) log K (cid:114) max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) (2 m +1 ) log K = (cid:0) m +1 (cid:1) . √ − √ C > √ − √ C. Using this we get a bound on t as: t (cid:54) B m r m (cid:0) m +1 (cid:1) D m C = 2 B m r m (cid:0) m +1 (cid:1) √ − C (cid:18) max x,x (cid:48) ∈X m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) log K (cid:19) (cid:54) B m (cid:0) m +1 (cid:1) (cid:20) r m log K ( √ − C (cid:21) . Since, by assumption, C ≡ λ min (cid:18) K (cid:80) k =1 x k x Tk (cid:19) = Θ(1) , we have t (cid:54) lim O (cid:16) B m (cid:0) m +1 (cid:1) r m log K (cid:17) , ∀ δ < δ (1)0 . Case 2. ε m = (cid:0) (cid:1) m +1 . We have in this case that, D m √ Cr m (cid:62) . Applying the value of ε m in eq. (4), we obtain t − B m ( √ − ε m D m (cid:112) log K √ t (cid:54) B m ε m r m + B m ε m K D m (7)18 t − B m ( √ − D m (cid:0) m +1 (cid:1) (cid:112) log K √ t (cid:54) B m r m (cid:0) m +1 (cid:1) + B m (cid:0) m +1 (cid:1) K D m . (8)Let T m := B m r m (cid:0) m +1 (cid:1) + B m (cid:0) m +1 (cid:1) K D m . . As before, noting that t (cid:55)→ √ t is a concave,differentiable function, we have √ t (cid:54) (cid:112) T m + 12 √ T m ( t − T m ) . Applying this to (8) and rearranging, we get t (cid:54) T m (cid:32) B m r m (cid:0) m +1 (cid:1) √ log K √ − C √ T m − B m r m (2 m +1 ) √ log K (cid:33) . Going along the same lines as Case 1, we see that there exists δ (2)0 > such that ∀ δ < δ (2)0 , T m (cid:54) B m r m (cid:0) m +1 (cid:1) , whence t (cid:54) B m (cid:0) m +1 (cid:1) r m . We now set δ = min { δ (1)0 , δ (2)0 } . E Justification of elimination criteria
In this section, we argue that progress is made after every phase of the algorithm. We will also showthe correctness of the algorithm. Let us define a few terms which will be useful for analysis.Let S m := (cid:110) x ∈ X : θ ∗ T ( x ∗ − x ) < m (cid:111) . Let B ∗ m := min w ∈P K max ( x,x (cid:48) ) ∈S m ,x (cid:54) = x (cid:48) (cid:107) x − x (cid:48) (cid:107) W − , where W = K (cid:80) k =1 w k x k x k . Finally, define T ∗ m := B ∗ m D m C r m (cid:0) m +1 (cid:1) + B ∗ m C r m (cid:0) m +1 (cid:1) K D m .Define a sequence of favorable events {G m } m (cid:62) as, G m := (cid:40) N m (cid:54) T ∗ m (cid:32) B ∗ m r m (cid:0) m +1 (cid:1) √ log K √ − C (cid:112) T ∗ m − B ∗ m r m (2 m +1 ) √ log K (cid:33)(cid:41) (cid:92) { x ∗ ∈ X m +1 } (cid:92) {X m +1 ⊆ S m +1 } . Note 3.
Conditioned on the event G m − , x ∗ ∈ X m and X m ⊆ S m . Hence, B m (cid:54) B ∗ m and T m (cid:54) T ∗ m .Hence, under the event G m − , N m (cid:54) T ∗ m (cid:32) B ∗ m r m (cid:0) m +1 (cid:1) √ log K √ − C (cid:112) T ∗ m − B ∗ m r m (2 m +1 ) √ log K (cid:33) a.s. Note here that the right hand side is a non-random quantity.
Lemma 6. P (cid:2) G m (cid:12)(cid:12) G m − , . . . , G (cid:3) (cid:62) − δ m .Proof of lemma 6. Let y = x i − x j for some x i , x j ∈ X m , x i (cid:54) = x j . Since (cid:98) θ m is a least squaresestimate of θ ∗ , conditioned on the realization of the set X m , y T (cid:16)(cid:98) θ m − θ ∗ (cid:17) is a (cid:107) y (cid:107) V mNm ) − − sub-Gaussian random variable.By the key lemma 1 we have that (cid:107) y (cid:107) V mNm ) − (cid:54) m +1 ) log( K /δ m ) . Using property of sub-Gaussian random variables, we write for any η ∈ (0 , , P (cid:104)(cid:12)(cid:12)(cid:12) y T (cid:16)(cid:98) θ m − θ ∗ (cid:17)(cid:12)(cid:12)(cid:12) > (cid:113) (cid:107) y (cid:107) V mNm ) − log (2 /η ) (cid:12)(cid:12) G m − , . . . , G (cid:105) (cid:54) η, which implies that P (cid:34)(cid:12)(cid:12)(cid:12) y T (cid:16)(cid:98) θ m − θ ∗ (cid:17)(cid:12)(cid:12)(cid:12) > (cid:115) /η )8 (2 m +1 ) log ( K /δ m ) (cid:12)(cid:12) G m − , . . . , G (cid:35) (cid:54) η. y ∈ Y ( X m ) , and setting η = 2 δ m /K , gives P (cid:104) ∀ y ∈ Y ( X m ) : (cid:12)(cid:12)(cid:12) y T (cid:16) θ ∗ − (cid:98) θ m (cid:17)(cid:12)(cid:12)(cid:12) (cid:54) − ( m +2) (cid:12)(cid:12) G m − , . . . , G (cid:105) > − δ m . (9)Conditioned on G m − , x ∗ ∈ X m . Let x (cid:48) ∈ X m be such that x (cid:48) / ∈ S m +1 . Let y = ( x ∗ − x (cid:48) ) . Then y ∈ Y ( X m ) . By eq. (9) we have with probability (cid:62) − δ m : ( x ∗ − x (cid:48) ) T (cid:16) θ ∗ − (cid:98) θ m (cid:17) (cid:54) − ( m +2) ⇒ (cid:98) θ Tm ( x ∗ − x (cid:48) ) > − ( m +1) − − ( m +2) = 2 − ( m +2) . Thus arm x (cid:48) will get eliminated after phase m by the elimination criteria of algorithm 1(see step 25of algorithm 1). Hence X m +1 ⊆ S m +1 w.p. (cid:62) − δ m .Next, we show that conditioned on G m − , x ∗ ∈ X m +1 , w.p. (cid:62) − δ m . Suppose that x ∗ getseliminated at the end of phase m . This means that ∃ x (cid:48) ∈ X m , such that (cid:98) θ Tm ( x (cid:48) − x ∗ ) > − ( m +2) .However, by eq. 9, ( x (cid:48) − x ∗ ) T (cid:16)(cid:98) θ m − θ ∗ (cid:17) (cid:54) − ( m +2) ⇒ θ ∗ T ( x ∗ − x (cid:48) ) < which is a contradiction. This, along with note 3 shows that P (cid:2) G m (cid:12)(cid:12) G m − , . . . , G (cid:3) (cid:62) − δ m . Corollary 1. P (cid:92) m (cid:62) G m (cid:62) ∞ (cid:89) m =1 (cid:18) − δm (cid:19) (cid:62) − δ. Corollary 2.
The maximum number of phases of Algorithm 1 is bounded by log min .Proof.
Recall that ∆ min = min x ∈X : x (cid:54) = x ∗ θ ∗ T ( x ∗ − x ) . The proof follows by observing that afterany phase m , under the favorable event G m − , X m ⊆ S m . Since the size S m shrinks exponentiallywith the number of phases (cid:16) because S m = (cid:110) x ∈ X : θ ∗ T ( x ∗ − x ) < m (cid:111)(cid:17) , we have the result. F Proof of bound on sample complexity
We begin by observing the following useful result from [8]. Recall that D θ ∗ = max w ∈ ∆ K min x ∈X ,x (cid:54) = x ∗ (cid:16) θ ∗ T ( x ∗ − x ) (cid:17) (cid:107) x ∗ − x (cid:107) W − Proposition 1 ([8]) . log min (cid:88) m =1 (2 m ) B ∗ m (cid:54) (1 / ∆ min ) D θ ∗ . Using proposition 1 we now give a bound on the asymptotic sample complexity of algorithm 1.
Theorem 2.
With probability at least − δ , PEPEG returns the optimal arm after τ rounds, with τ (cid:54) (1 / ∆ min ) D θ ∗ (cid:16) log (cid:16) (log (1 / ∆ min )) K /δ (cid:17)(cid:17) log K ( √ − C + (cid:18)
256 log (1 / ∆ min ) D θ ∗ log (cid:16) (log (1 / ∆ min )) K /δ (cid:17)(cid:19) . Proof.
The proof follows from Lemma 2 (phase length bound), Corollary 2 (bound on number ofphases), Prop. 1 above and the fact that the sum of several non negative quantities is bigger than theirmax. 20o begin with, the discussion in Sec. E shows that in every phase, B m (cid:54) B ∗ m . Next, Lemma 2 givesus (w.h.p), τ = log (1 / ∆ min ) (cid:88) m =1 N m (cid:54) log (1 / ∆ min ) (cid:88) m =1 max (cid:26) B ∗ m (cid:0) m +1 (cid:1) (cid:20) r m log K ( √ − C (cid:21) , B ∗ m (cid:0) m +1 (cid:1) r m (cid:27) (cid:54) log (1 / ∆ min ) (cid:88) m =1 B ∗ m (cid:0) m +1 (cid:1) (cid:20) r m log K ( √ − C (cid:21) + log (1 / ∆ min ) (cid:88) m =1 B ∗ m (cid:0) m +1 (cid:1) r m Hence, using the fact that r m = (cid:112) K /δ m and invoking Prop. 1 we get τ (cid:54) log (1 / ∆ min ) (cid:88) m =1 B ∗ m (2 m ) (cid:34) (cid:0) log (cid:0) K /δ m (cid:1)(cid:1) log K ( √ − C (cid:35) + log (1 / ∆ min ) (cid:88) m =1 B ∗ m (cid:0) m +1 (cid:1) log (cid:0) K /δ m (cid:1) ( ∗ ) (cid:54) (1 / ∆ min ) D θ ∗ (cid:16) log (cid:16) (log (1 / ∆ min )) K /δ (cid:17)(cid:17) log K ( √ − C + 256 log (1 / ∆ min ) D θ ∗ log (cid:16) (log (1 / ∆ min )) K /δ (cid:17) , where ( ∗ ) follows from the fact that K δ m = m K δ (cid:54) (log (1 / ∆ min )) K δ . G Experiment Details
In this section, we provide some details on the implementation of each algorithm. Each experimentwas repeated 50 times and the errorbar plots show the mean sample complexity with 1-standarddeviations. • For implementation of PELEG, as mentioned in Sec. 6, we ignore the intersection with theball B (0 , D m ) in the phase stopping criterion. This helps in implementing a closed formexpression for the stopping rule. The learning rate parameter in the EXP-WTS subroutine isset to be equal to (1 /D m ) (cid:112) K/t . • LinGapE: In the paper of [15] LinGapE was simulated using a greedy arm selection strategythat deviates from the algorithm that is analyzed. We instead implement the LinGapEalgorithm in the form that it is analyzed. • For implementation of RAGE, ALBA and