[PDF] Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection

Abstract

Full PDF

aa r X i v : . [ c s . L G ] F e b Nearly Instance Optimal Sample Complexity Boundsfor Top-k Arm Selection

Lijie Chen Jian Li Mingda QiaoInstitute for Interdisciplinary Information Sciences (IIIS), Tsinghua University

Abstract

In the Best- k -Arm problem, we are given n stochastic bandit arms, each associated with an unknownreward distribution. We are required to identify the k arms with the largest means by taking as fewsamples as possible. In this paper, we make progress towards a complete characterization of the instance-wise sample complexity bounds for the Best- k -Arm problem. On the lower bound side, we obtain anovel complexity term to measure the sample complexity that every Best- k -Arm instance requires. Thisis derived by an interesting and nontrivial reduction from the Best- -Arm problem. We also providean elimination-based algorithm that matches the instance-wise lower bound within doubly-logarithmicfactors. The sample complexity of our algorithm strictly dominates the state-of-the-art for Best- k -Arm(module constant factors). The stochastic multi-armed bandit is a classical and well-studied model for characterizing the exploration-exploitation tradeoﬀ in various decision-making problems in stochastic settings. The most well-knownobjective in the multi-armed bandit model is to maximize the cumulative gain (or equivalently, to mini-mize the cumulative regret) that the agent achieves. Another line of research, called the pure explorationmulti-armed bandit problem, which is motivated by a variety of practical applications including medicaltrials [Rob85, AB10], communication network [AB10], and crowdsourcing [ZCL14, CLTL15], has also at-tracted signiﬁcant attention recently. In the pure exploration problem, the agent draws samples from thearms adaptively (the exploration phase), and ﬁnally commits to one of the feasible solutions speciﬁed by theproblem. In a sense, the exploitation phase in the pure exploration problem simply consists of exploiting thesolution to which the agent commits indeﬁnitely. Therefore, the agent’s objective is to identify the optimal(or near-optimal) feasible solution with high probability.In this paper, we focus on the problem of identifying the top- k arms (i.e., the k arms with the largestmeans) in a stochastic multi-armed bandit model. The problem is known as the Best- k -Arm problem, andhas been extensively studied in the past decade [KS10, GGL12, GGLB11, KTAS12, BWV12, KK13, ZCL14,KCG15, SJR16]. We formally deﬁne the Best- k -Arm problem as follows. Deﬁnition 1.1 (Best- k -Arm) . An instance of Best- k -Arm is a set of stochastic arms I = { A , A , . . . , A n } .Each arm has a -sub-Gaussian reward distribution with an unknown mean in [0 , / .At each step, algorithm A chooses an arm and observes an i.i.d. sample from its reward distribution.The goal of A is to identify the k arms with the largest means in I using as few samples as possible. Let µ [ i ] denote the i -th largest mean in an instance of Best- k -Arm. We assume that µ [ k ] > µ [ k +1] in order to ensurethe uniqueness of the solution. Note that in our upper bound, we assume that all reward distributions are -sub-Gaussian , which isa standard assumption in multi-armed bandit literature. In our lower bound (Theorem 1.1), however, we A distribution D is σ -sub-Gaussian, if it holds that E X ∼D [exp( tX − t E X ∼D [ X ])] ≤ exp( σ t / for all t ∈ R . When we only want to identify the single best arm, we get the following Best- -Arm problem, whichis a well-studied special case of Best- k -Arm. The problem plays an important role in our lower bound forBest- k -Arm. Deﬁnition 1.2 (Best- -Arm) . The Best- -Arm problem is a special case of Best- k -Arm where k = 1 . Generally, we focus on algorithms that solve Best- k -Arm with probability at least − δ . Deﬁnition 1.3 ( δ -correct Algorithms) . A is a δ -correct algorithm for Best- k -Arm if and only if A returnsthe correct answer with probability at least − δ on every Best- k -Arm instance I . Before stating our results on the Best- k -Arm problem, we ﬁrst deﬁne a few useful notations that charac-terize the hardness of Best- k -Arm instances. Let µ A denote the mean of arm A . µ [ i ] denotes the i -th largest mean among all armsin a speciﬁc instance. We deﬁne the gap of arm A as ∆ A = ( µ A − µ [ k +1] , µ A ≥ µ [ k ] ,µ [ k ] − µ A , µ A ≤ µ [ k +1] . Note that the gap of an arm is the minimum value by which its mean needs to change in order to alter thetop k arms. We let ∆ [ i ] denote the gap of the i -th largest arm. Arm groups.

Let ε r denote − r . For an instance I of Best- k -Arm and positive integer r , we deﬁne thearm groups as G large r = { A ∈ I : µ A ≥ µ [ k ] , ∆ A ∈ ( ε r +1 , ε r ] } , and G small r = { A ∈ I : µ A ≤ µ [ k +1] , ∆ A ∈ ( ε r +1 , ε r ] } .In other words, G large r and G small r contain the arms with gaps in ( ε r +1 , ε r ] among and outside the best k arms,respectively.Note that since we assume that the mean of each arm is in [0 , / , the gap of every arm is at most / .Therefore by deﬁnition each arm is contained in one of the arm groups.We also use the following shorthand notations: G large ≥ r = ∞ [ i = r G large i and G small ≥ r = ∞ [ i = r G small i . In order to state our instance-wise lower bound precisely, we need to elaborate what is an instance. ByDeﬁnition 1.1, a given instance is a set of arms, meaning the particular input order of the arms should notmatter. Note that there indeed exists algorithms that take advantage of the input order and may performbetter for some “lucky” input orders than the others. In order to prove a tighter lower bound, we need toconsider all possible input orders and take the average. From technical perspective, we use the followingdeﬁnition of an instance. For arbitrary distributions, one may be able to distinguish two distributions with very close means using very few samples.It is impossible to establish a nontrivial lower bound in such generality. For example, a sorting algorithm can ﬁrst check if the input sequence a , . . . , a n is in increasing order in O ( n ) time, andthen run an O ( n log n ) time algorithm. This algorithm is particularly fast for a particular input order. eﬁnition 1.4 (Instance) . An instance is considered as a random permutation of a sequence of arms.Consequently, the sample complexity of an algorithm on an instance should be considered as the average ofthe number of samples over all permutations.

In fact, the random permutation is crucial to establishing instance-wise lower bounds for Best- k -Arm (i.e.,the minimum number of samples that every δ -correct algorithm for Best- k -Arm needs to take on an instance).Without the random permutation, the algorithm might use fewer samples on some “lucky" permutationsthan on others, and it is impossible to prove a tight instance-wise lower bound as ours. The use of randompermutation to deﬁne instance-wise lower bounds is also used in computational geometry [ABC09] and theBest- -Arm problem [CL15, CL16b].We say that an instance of Best- k -Arm is Gaussian, if all reward distributions are normal distributionswith unit variance. Theorem 1.1.

There exists a constant δ > , such that for any δ < δ , every δ -correct algorithm forBest- k -Arm takes Ω (cid:0) H ln δ − + H large + H small (cid:1) samples in expectation on every Gaussian instance. Here H = P ni =1 ∆ − i ] , H large = ∞ X i =1 (cid:12)(cid:12)(cid:12) G large i (cid:12)(cid:12)(cid:12) · max j ≤ i ε − j ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) , and H small = ∞ X i =1 (cid:12)(cid:12) G small i (cid:12)(cid:12) · max j ≤ i ε − j ln (cid:12)(cid:12)(cid:12) G large ≥ j (cid:12)(cid:12)(cid:12) . We notice that Simchowitz et al. [SJR16], independently of our work, derived instance-wise lower boundsfor Best- k -Arm similar to Theorem 1.1, using a somewhat diﬀerent method. For all δ > , there is a δ -correct algorithm for Best- k -Arm that takes O (cid:16) H ln δ − + e H + e H large + e H small (cid:17) samples in expectation on every instance. Here e H = n X i =1 ∆ − i ] ln ln ∆ − i ] , e H large = ∞ X i =1 (cid:12)(cid:12)(cid:12) G large i (cid:12)(cid:12)(cid:12) i X j =1 ε − j ln (cid:12)(cid:12) G small j (cid:12)(cid:12) , and e H small = ∞ X i =1 (cid:12)(cid:12) G small i (cid:12)(cid:12) i X j =1 ε − j ln (cid:12)(cid:12)(cid:12) G large j (cid:12)(cid:12)(cid:12) . The following theorem relates the e H large and e H small terms to H large and H small in the lower bound. Theorem 1.3.

For every Best- k -Arm instance, the following statements hold:1. e H large + e H small = O (cid:0)(cid:0) H large + H small (cid:1) ln ln n (cid:1) .2. e H large + e H small = O ( H ln k ) . k -Arm Source Sample Complexity [GGL12] O (cid:0) H ln δ − + H ln H (cid:1) [KTAS12] O (cid:0) H ln δ − + H ln H (cid:1) [CLK + O (cid:0) H ln δ − + H ln H (cid:1) [CGL16] O (cid:16) H ln δ − + e H + H ln k (cid:17) This paper O (cid:16) H ln δ − + e H + e H large + e H small (cid:17) Combining Theorems 1.1, 1.2 and 1.3(1), our algorithm is instance-wise optimal within doubly-logarithmicfactors (i.e., ln ln n, ln ln ∆ − i ] ). In other words, the sample complexity of our algorithm on every single in-stance nearly matches the minimum number of samples that every δ -correct algorithm has to take on thatinstance.Theorem 1.2 and Theorem 1.3(2) also imply that our algorithm strictly dominates the state-of-the-artalgorithm for Best- k -Arm obtained in [CGL16], which achieves a sample complexity of O n X i =1 ∆ − i ] (cid:16) ln δ − + ln k + ln ln ∆ − i ] (cid:17)! = O (cid:16) H ln δ − + H ln k + e H (cid:17) .In particular, we give a speciﬁc example in Appendix A in which the sample complexity achieved byTheorem 1.2 is signiﬁcantly better than that obtained in [CGL16]. See Table 1 for more previous upperbounds on the sample complexity of Best- k -Arm. Best- -Arm. In the Best- -Arm problem, the algorithm is required to identify the arm with the largestmean. As a special case of Best- k -Arm, the problem has a history dating back to 1954 [Bec54]. The problemcontinues to attract signiﬁcant attention over the past decade [AB10, EDMM06, MT04, JMNB14, KKS13,CL15, CL16a, GK16, CLQ16]. Combinatorial pure exploration.

The combinatorial pure exploration problem, which further gener-alizes the cardinality constraint in Best- k -Arm (i.e., to choose exactly k arms) to combinatorial constraints(e.g., matroid constraints), was also studied [CLK +

14, CGL16, GLG + PAC learning.

In the PAC learning setting, the algorithm is required to ﬁnd an approximate solutionto the pure exploration problem. The sample complexity of Best- -Arm and Best- k -Arm in PAC setting hasbeen extensively studied. A tight (worst case) bound of Θ( nε − ln δ − ) was obtained for the PAC version ofthe Best- -Arm problem in [EDMM02, EDMM06, MT04]. The worst case sample complexity of Best- k -Armin the PAC setting has also been well-studied [KS10, KTAS12, ZCL14, CLTL15]. Kullback-Leibler divergence.

Let

KL(

P, Q ) denote the Kullback-Leibler divergence from distribution Q to P . The following well-known fact (e.g., a special case of [Duc07]) states the Kullback-Leibler divergencebetween two normal distributions with unit variance.4 act 2.1. Let N ( µ, σ ) denote the normal distribution with mean µ and variance σ . It holds that KL( N ( µ , , N ( µ , µ − µ ) . Binary relative entropy.

Let d ( x, y ) = x ln( x/y ) + (1 − x ) ln[(1 − x ) / (1 − y )] be the binary relative entropy function. The monotonicity of d ( · , · ) is useful to our following analysis. Fact 2.2.

For ≤ y ≤ y ≤ x ≤ x ≤ , d ( x, y ) ≥ d ( x , y ) . Probability and expectation. Pr A , I and E A , I denote the probability and expectation when algorithm A runs on instance I . These notations are useful since we frequently consider the execution of diﬀerentalgorithms on various instances in our proof of the lower bound. Change of Distribution.

The following “Change of Distribution” lemma, developed in [KCG15], is auseful tool to quantify the behavior of an algorithm when the instance is modiﬁed.

Lemma 2.1 (Change of Distribution) . Suppose algorithm A runs on n arms. I = ( A , A , . . . , A n ) and I ′ = ( A ′ , A ′ , . . . , A ′ n ) are two sequences of arms. τ i denotes the number of samples taken on A i . For anyevent E in F σ , where σ is an almost-surely ﬁnite stopping time with respect to the ﬁltration {F t } t ≥ , it holdsthat n X i =1 E A , I [ τ i ]KL( A i , A ′ i ) ≥ d (cid:18) Pr A , I [ E ] , Pr A , I ′ [ E ] (cid:19) . Throughout our proof of the lower bound, we assume that the reward distributions of all arms areGaussian distributions with unit variance. Moreover, we assume that the number of arms is suﬃciently large.This assumption is used only once in the proof of Lemma 3.3. Note that when there is only a constantnumber of arms, our lower bound Ω( H large + H small ) is implied by the Ω( H ln δ − ) term. The following simple lemma is useful in lower bounding the expected number of samples taken from anarm in the top- k set, by restricting to a Best- -Arm instance embedded in the original Best- k -Arm instance.We postpone its proof to Appendix C. Lemma 3.1 (Instance Embedding) . Let I be a Best- k -Arm instance. Let A be an arm among the top k arms, and I emb be a Best- -Arm instance consisting of A and a subset of arms in I outside the top k arms.If some algorithm A solves I with probability − δ while taking less than N samples on A in expectation,there exists another algorithm A emb that solves I emb with probability − δ while taking less than N sampleson A in expectation. We show a lower bound on the number of samples required by each arm separately, and then the lowerbound stated in Theorem 1.1 follows from a direct summation. Formally, we have the following lemma.

Lemma 3.2.

Let I be an instance of Best- k -Arm. There exist universal constants δ and c such that forall ≤ j ≤ i , any δ -correct algorithm for Best- k -Arm takes at least cε − j ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) samples on every arm A ∈ G large i . The same holds if we swap G large and G small . Before proving Lemma 3.2, we show that Theorem 1.1 follows from Lemma 3.2 directly.5 roof of Theorem 1.1.

Since the Ω( H ln δ − ) lower bound has been established in Theorem 2 of [CLK + Ω( H large ) and Ω( H small ) . Let A bea δ -correct algorithm for Best- k -Arm. According to Lemma 3.2, A draws at least c · max j ≤ i ε − j ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) samples from each arm in G large i . Therefore A draws at least ∞ X i =1 (cid:12)(cid:12)(cid:12) G large i (cid:12)(cid:12)(cid:12) · c · max j ≤ i ε − j ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) = Ω( H large ) samples in total from the arms in G large . The Ω( H small ) lower bound is analogous. -Arm In order to prove Lemma 3.2, we construct a Best- -Arm instance consisting of one arm in G large i and allarms in G small ≥ j . By Instance Embedding (Lemma 3.1), to lower bound the number of samples taken on eacharm in G large i , it suﬃces to prove that every algorithm for Best- -Arm takes suﬃciently many samples on thebest arm. Formally, we would like to show the following key technical lemma. Lemma 3.3.

Let I be an instance of Best- -Arm consisting of one arm with mean µ and n arms with meanson [ µ − ∆ , µ ) . There exist universal constants δ and c (independent of n and ∆ ) such that for any algorithm A that correctly solves I with probability − δ , the expected number of samples drawn from the optimal armis at least c ∆ − ln n . The proof of Lemma 3.3 is somewhat technical and we present it in the next subsection. Now we proveLemma 3.2 from Lemma 3.3, by reducing a Best- -Arm instance to an instance of Best- k -Arm using theInstance Embedding technique. Intuitively, if an algorithm A solves Best- k -Arm without taking suﬃcientnumber of samples from a speciﬁc arm, we may extract an instance of Best- -Arm and derive a contradictionto Lemma 3.3. Proof of Lemma 3.2.

Let δ and c be the constants in Lemma 3.3. We claim that Lemma 3.2 holds forconstants δ = δ and c = c / .Suppose for a contradiction that when δ -correct algorithm A runs on Best- k -Arm instance I , the numberof samples drawn from arm A ∈ G large i is less than cε − j ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) for some j ≤ i .We construct a Best- -Arm instance I new consisting of A and all arms in G small ≥ j . By Instance Embedding(Lemma 3.1), there exists algorithm A new that solves I new with probability − δ , while the number of samplesdrawn from arm A is upper bounded by cε − j ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) in expectation.However, Lemma 3.3 implies that A new must take more than c ∆ − ln n ≥ c ( ε i + ε j ) − ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) ≥ cε − j ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) samples on the optimal arm, which leads to a contradiction. The case that G large and G small are swapped isanalogous. -Arm In order to prove Lemma 3.3, we ﬁrst study a special case that the instance consists of one optimalarm and several sub-optimal arms with equal means (we call it a Symmetric Best- -Arm instance). For thesymmetric Best- -Arm instances, we have the following lower bound on the best arm. Lemma 3.4.

Let I be an instance of Best- -Arm with one arm with mean µ and n arms with mean µ − ∆ . There exist universal constants δ and c (independent of n and ∆ ) such that for any algorithm A thatcorrectly solves I with probability − δ , the expected number of samples drawn from the optimal arm is atleast c ∆ − ln n . roof of Lemma 3.4. We claim that the lemma holds for constants δ = 0 . and c = 1 .Recall that N ( µ, σ ) denotes the normal distribution with mean µ and variance σ . Let I be the instanceconsisting of arm A ∗ with mean µ and n arms with mean µ − ∆ , and I new be the instance obtained from I by replacing the reward distribution of A ∗ with N ( µ − ∆ , . τ denotes the number of samples drawn from A ∗ .Let E be the event that A identiﬁes arm A ∗ as the best arm. Recall that Pr A , I and E A , I denote theprobability and expectation when algorithm A runs on instance I respectively. Since A solves I correctlywith probability at least − δ , we have Pr A , I [ E ] ≥ − δ . On the other hand, I new consists of n + 1 completely identical arms. By Deﬁnition 1.4, A takes a random permutation of I new as its input. Thereforethe probability that A returns each arm is the same, and it follows that Pr A , I new [ E ] ≤ / ( n + 1) .By Change of Distribution (Lemma 2.1), we have

12 E A , I [ τ ]∆ =E A , I [ τ ] · KL( N ( µ, , N ( µ − ∆ , ≥ d (cid:18) Pr A , I [ E ] , Pr A , I new [ E ] (cid:19) ≥ d (1 − δ, / ( n + 1)) ≥ (1 − δ ) ln n .Therefore we conclude that E A , I [ τ ] ≥ − δ )∆ − ln n ≥ c ∆ − ln n .Given Lemma 3.4, Lemma 3.3 may appear to be quite intuitive, as the symmetric instance I sym seemsto be the worst case. However, a rigorous proof of Lemma 3.3 is still quite nontrivial and is in fact themost technical part of the lower bound proof. The proof consists of several steps which transform a generalinstance I of Best- -Arm to a symmetric instance I sym .Suppose that some algorithm A violates Lemma 3.3 on a Best- -Arm instance I . We divide the interval [ µ − ∆ , µ ) into n . short segments, then at least one segment contains n . arms. We construct a smallerand denser instance I dense consisting of the optimal arm and n . arms from the same segment. By InstanceEmbedding, there exists algorithm A new that solves I dense while taking few samples on the optimal arm.Note that the reduction crucially relies on the fact that since our lower bound is logarithmic in n , the boundmerely shrinks by a constant factor after the number of arms decreases to n . .Finally, we transform I dense into a symmetric Best- -Arm instance I sym consisting of the optimal arm in I dense along with n . copies of one of the sub-optimal arms. We also deﬁne an algorithm A sym that solves I sym with few samples drawn from the optimal arm, thus contradicting Lemma 3.4. The full proof of Lemma3.3 is postponed to Appendix C. We start by introducing three subroutines that are useful for building our algorithm for Best- k -Arm. PAC algorithm for Best- k -Arm. PAC-Best-k is a PAC algorithm for Best- k -Arm adapted from the PAC-SamplePrune algorithm in [CGL16].

PAC-Best-k is guaranteed to partition the given arm set into twosets S large and S small , such that S large approximates the best k arms with high probability. Lemma 4.1.

PAC-Best-k ( S, k, ε, δ ) takes O (cid:0) | S | ε − (cid:2) ln δ − + ln min( k, | S | − k ) (cid:3)(cid:1) amples and returns a partition ( S large , S small ) of S with | S large | = k and | S small | = | S | − k . Let µ [ k ] and µ [ k +1] denote the the k -th and the ( k + 1) -th largest means in S . With probability − δ , it holds that µ A ≥ µ [ k ] − ε for all A ∈ S large , (1) µ A ≤ µ [ k +1] + ε for all A ∈ S small . (2)Lemma 4.1 is proved in Appendix D. We say that a speciﬁc call to PAC-Best-k returns correctly if both(1) and (2) hold.

PAC algorithms for Best- -Arm. EstMean-Large and

EstMean-Small approximate the largest andthe smallest mean among several arms respectively. Both algorithms can be easily implemented by calling

PAC-Best-k with k = 1 , and then sampling the best arm identiﬁed by PAC-Best-k . Lemma 4.2.

Both

EstMean-Large ( S, ε, δ ) and EstMean-Small ( S, ε, δ ) take O ( | S | ε − ln δ − ) samples and out-put a real number. Each of the following inequalities holds with probability − δ : (cid:12)(cid:12)(cid:12)(cid:12) EstMean-Large ( S, ε, δ ) − max A ∈ S µ A (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε (3) (cid:12)(cid:12)(cid:12)(cid:12) EstMean-Small ( S, ε, δ ) − min A ∈ S µ A (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε (4)Lemma 4.2 is proved in Appendix D. We say that a speciﬁc call to EstMean-Large (or

EstMean-Small )returns correctly if inequality (3) (or (4)) holds.

Elimination procedures.

Finally,

Elim-Large and

Elim-Small are two elimination procedures. Roughlyspeaking,

Elim-Large guarantees that after the elimination, the fraction of arms with means above the largerthreshold θ large is bounded by a constant. Meanwhile, a ﬁxed arm with mean below the smaller threshold θ small are unlikely to be eliminated. Analogously, Elim-Small removes arms with means below θ small , andpreserves arms above θ large . The properties of Elim-Large and

Elim-Small are formally stated below.

Lemma 4.3.

Both

Elim-Large ( S, θ small , θ large , δ ) and Elim-Small ( S, θ small , θ large , δ ) take O ( | S | ε − ln δ − ) sam-ples and return a set T ⊆ S . For Elim-Large and a ﬁxed arm A ∗ ∈ S with µ A ∗ ≤ θ small , it holds withprobability − δ that A ∗ ∈ T and (cid:12)(cid:12) { A ∈ T : µ A ≥ θ large } (cid:12)(cid:12) ≤ | T | / . (5) Similarly, for

Elim-Small and ﬁxed A ∗ ∈ S with µ A ∗ ≥ θ large , it holds with probability − δ that A ∗ ∈ T and (cid:12)(cid:12) { A ∈ T : µ A ≤ θ small } (cid:12)(cid:12) ≤ | T | / . (6)Lemma 4.3 is proved in Appendix D. We say that a call to Elim-Large (or

Elim-Small ) returns correctly ifinequality (5) (or (6)) holds.

Our algorithm for Best- k -Arm, Bilateral-Elimination , is formally described below.

Bilateral-Elimination takes a parameter k , an instance I of Best- k -Arm and a conﬁdence level δ as input, and returns the best k arms in I .Throughout the algorithm, Bilateral-Elimination maintains two sets of arms S r and T r for each round r . S r contains the arms that are still under consideration at the beginning of round r , while T r denotes theset of arms that have been included in the answer. We say that an arm is removed (or eliminated) at round r , if it is in S r \ S r +1 . Note that we may remove an arm either because its mean is so small that it cannotbe among the best k arms, or its mean is large enough so that we decide to include it in the answer. Thisjustiﬁes the name of our algorithm, Bilateral-Elimination .In each round r , Bilateral-Elimination performs the following four steps.8 lgorithm 1:

Bilateral-Elimination

Input:

Parameter k , instance I , and conﬁdence δ . Output:

The best k arms in I . S ← I ; T ← ∅ ; for r = 1 to ∞ do k large r ← k − | T r | ; k small r ← | S r | − k large r ; if k large r = 0 then return T r ; if k small r = 0 then return T r ∪ S r ; δ r ← δ/ (20 r ) ; ( S large r , S small r ) ← PAC-Best-k ( S r , k large r , ε r / , δ r ) ; θ large r ← EstMean-Large ( S small r , ε r / , δ r ) ; θ small r ← EstMean-Small ( S large r , ε r / , δ r ) ; δ ′ r ← δ/ min( k large r , k small r ) ; S r +1 ← Elim-Large ( S large r , θ large r + ε r / , θ large r + ε r / , δ ′ r ) ∪ Elim-Small ( S small r , θ small r − ε r / , θ small r − ε r / , δ ′ r ) ; T r +1 ← T r ∪ (cid:0) S large r \ S r +1 (cid:1) ; Step 1: Initialization.

Bilateral-Elimination ﬁrst calculates k large r and k small r , which indicate that it needsto identify the k large r largest arms (or equivalently, the k small r smallest arms) in S r . In the base case that either k large r = 0 or k small = 0 , it directly returns the answer. Step 2: Find a PAC solution.

Then

Bilateral-Elimination calls

PAC-Best-k to partition S r into S large r and S small r with size k large r and k small r respectively, such that S large r denotes an approximation of the best k large r arms in S r . Step 3: Estimate Thresholds.

After that,

Bilateral-Elimination calls

EstMean-Large and

EstMean-Small to compute two thresholds θ large r and θ small r . θ large r is an estimation of the largest mean in S small r , which isapproximately the mean of the ( k large r + 1) -th largest arm in S r . Analogously, θ small r approximates the k large r -thlargest mean in S r .It might seem weird at ﬁrst glance that θ large r and θ small r approximates the ( k large r + 1) -th mean and the k large r -th mean respectively, implying that θ large r is expected to be smaller than θ small r . In fact, the superscript“ large ” in θ large r indicates that it is the threshold used for eliminating arms in S large r . Step 4: Elimination.

Finally,

Bilateral-Elimination calls

Elim-Large and

Elim-Small to eliminate thearms in S large r that are signiﬁcantly larger than θ large r , and the arms in S small r that are much smaller than θ small r . The arms removed from S large r are included into the answer. Caveats.

Note that our algorithm uses a diﬀerent conﬁdence level, δ ′ r , in Step 4. Intuitively, at most min( k large r , k small r ) arms among the best k large r arms in S r are misclassiﬁed as “small arms” by PAC-Best-k .Therefore during the elimination process, it is crucial that such misclassiﬁed arms are not mistakenly elimi-nated. As a result, we need a union bound on these arms, which contributes to the min( k large r , k small r ) factorin our conﬁdence level. We start our analysis of

Bilateral-Elimination with a few simple yet useful observations.

Good events.

We deﬁne E good r as the event that in round r , all the ﬁve calls to PAC-Best-k , EstMean , and

Elim return correctly. These events are crucial to our following analysis, as they guarantee that the partition ( S large r , S small r ) and thresholds θ large r and θ small r are suﬃciently accurate, and additionally, Elim eliminates asuﬃciently large fraction of arms. The following observation, due to a simple union bound, lower boundsthe probability of each good event.

Observation 4.1.

Pr[ E good r ] ≥ − δ r . alid executions. We say that an execution of

Bilateral-Elimination is valid at round r , if and only ifthe following two conditions are satisﬁed: • For each ≤ i < r , event E good i happens. (i.e., all calls to subroutines return correctly in previousrounds.) • The union of T r and the best k large r arms in S r is the correct answer of the Best- k -Arm instance. Inother words, no arms have been incorrectly eliminated in previous rounds.Moreover, an execution is valid if it is valid at every round before it terminates. We deﬁne E valid to be theevent that an execution of Bilateral-Elimination is valid.

Thresholds.

In the following, we bound the thresholds θ large r and θ small r returned by subroutine EstMean conditioning on E good r . Let µ large r and µ small r denote the means of the k large r -th and the ( k large r + 1) -th largestarms in S r . We show that θ large r and θ small r are O ( ε r ) -approximations of µ small r and µ large r conditioning on thegood event E good r . The proof of the following observation is postponed to Appendix D. Observation 4.2.

Conditioning on E good r , θ large r ∈ (cid:2) µ small r − ε r / , µ small r + ε r / (cid:3) , θ small r ∈ (cid:2) µ large r − ε r / , µ large r + ε r / (cid:3) . Number of remaining arms.

Finally, we show that conditioning on the validity of an execution, thenumber of remaining arms at the beginning of each round can be upper bounded in terms of | G large ≥ r | and | G small ≥ r | . The following observation, proved in Appendix D, is crucial to analyzing the sample complexity ofour algorithm. Observation 4.3.

Conditioning on E valid , it holds that k large r ≤ | G large ≥ r | and k small r ≤ | G small ≥ r | . Recall that E valid denotes the event that the execution of Bilateral-Elimination is valid. The followinglemma, proved in Appendix D, shows that event E valid happens with high probability. Lemma 4.4. Pr (cid:2) E valid (cid:3) ≥ − δ . We show that

Bilateral-Elimination always returns the correct answer conditioning on E valid , thus provingthat Bilateral-Elimination is δ -correct. Lemma 4.5.

Bilateral-Elimination returns the correct answer with probability at least − δ .Proof of Lemma 4.5. It suﬃces to show that conditioning on E valid , the algorithm always returns the correctanswer. In fact, if Bilateral-Elimination terminates at round r , it either returns T r at Line 4 or returns T r ∪ S r at Line 5. According to the second property guaranteed by the validity at round r , the answer returned by Bilateral-Elimination must be correct.It remains to show that

Bilateral-Elimination does not run forever. Recall that ∆ [ k ] = µ [ k ] − µ [ k +1] is thegap between the k -th and the ( k + 1) -th largest means in the original instance I . We choose a suﬃcientlylarge r ∗ that satisﬁes ε r ∗ < ∆ [ k ] . By deﬁnition, we have G large ≥ r ∗ = G small ≥ r ∗ = ∅ . Then Observation 4.3 impliesthat k large r ∗ = k small r ∗ = 0 , if the algorithm does not terminate before round r ∗ . Therefore the algorithm eitherterminates at or before round r ∗ . This completes the proof.10 .5 Sample Complexity We prove the following Lemma 4.6, which bounds the sample complexity of

Bilateral-Elimination con-ditioning on E valid . Then Theorem 1.2 directly follows from Lemma 4.5 and Lemma 4.6. The proof ofTheorem 1.3 is postponed to the appendix. Lemma 4.6.

Conditioning on event E valid , Bilateral-Elimination takes O ( H ln δ − + e H large + e H small + e H ) samples.Proof of Lemma 4.6. We consider the r -th round of the algorithm. Recall that k large r + k small r = | S r | . Accord-ing to Lemmas 4.1 through 4.3, PAC-Best-k takes O (cid:0) | S r | ε − r (cid:2) ln δ − r + ln min (cid:0) k large r , k small r (cid:1)(cid:3)(cid:1) (7)samples. EstMean-Large and

EstMean-Small take O (cid:0) ( k large r + k small r ) ε − r ln δ − r (cid:1) = O (cid:0) | S r | ε − r ln δ − r (cid:1) samples in total, while Elim-Large and

Elim-Small take O (cid:16) k large r ε − r ln δ ′ r − (cid:17) + O (cid:16) k small r ε − r ln δ ′ r − (cid:17) = O (cid:0) | S r | ε − r (cid:2) ln δ − r + ln min (cid:0) k large r , k small r (cid:1)(cid:3)(cid:1) samples conditioning on E valid . Clearly the sample complexity in round r is dominated by (7). Simplify and split the sum:

By Observation 4.3, conditioning on event E valid , k large r and k small r arebounded by (cid:12)(cid:12)(cid:12) G large ≥ r (cid:12)(cid:12)(cid:12) and (cid:12)(cid:12) G small ≥ r (cid:12)(cid:12) respectively. Thus it suﬃces to bound the sum of H (1) r + H (2 , large ) r + H (2 , small ) r , where H (1) r = (cid:16) | G large ≥ r | + | G small ≥ r | (cid:17) ε − r (ln δ − + ln r ) , H (2 , large ) r = ε − r | G large ≥ r | ln | G small ≥ r | , H (2 , small ) r = ε − r | G small ≥ r | ln | G large ≥ r | .In fact, since ln δ − r = ln δ − + ln(20 r ) = O (cid:0) ln δ − + ln r (cid:1) ,the | S r | ε − r ln δ − r term in (7) is bounded by H (1) r . Moreover, the | S r | ε − r ln min( k large r , k small r ) term is smallerthan or equal to ε − r (cid:0) k large r ln k small r + k small r ln k large r (cid:1) ,and is thus upper bounded by H (2 , large ) r + H (2 , small ) r .In Appendix D, we show with a straightforward calculation that ∞ X r =1 H (1) r = O (cid:16) H ln δ − + e H (cid:17) , ∞ X r =1 H (2 , large ) r = O (cid:16) e H large (cid:17) , and ∞ X r =1 H (2 , small ) r = O (cid:16) e H small (cid:17) .Then the lemma directly follows. 11inally, we prove our main result on the upper bound side. Proof of Theorem 1.2.

Let T = H ln δ − + e H + e H large + e H small .Lemma 4.5 and Lemma 4.6 together imply that conditioning on an event that happens with probability − δ , Bilateral-Elimination returns the correct answer and takes O ( T ) samples. Using the parallel simulation trick in[CL15, Theorem H.5], we can obtain an algorithm which uses O ( T ) samples in expectation (unconditionally),thus proving Theorem 1.2. References [AB10] Jean-Yves Audibert and Sébastien Bubeck. Best arm identiﬁcation in multi-armed bandits. In

COLT , 2010.[ABC09] Peyman Afshani, Jérémy Barbay, and Timothy M Chan. Instance-optimal geometric algorithms.In

FOCS , 2009.[Bec54] Robert E Bechhofer. A single-sample multiple decision procedure for ranking means of normalpopulations with known variances.

The Annals of Mathematical Statistics , pages 16–39, 1954.[BWV12] Sébastien Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identiﬁcations in multi-armed bandits. arXiv preprint arXiv:1205.3181 , 2012.[CGL16] Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multi-armed bandit under matroidconstraints. In , pages 647–669, 2016.[CL15] Lijie Chen and Jian Li. On the optimal sample complexity for best arm identiﬁcation. arXivpreprint arXiv:1511.03774 , 2015.[CL16a] Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the ﬁxed budget best armidentiﬁcation bandit problem. arXiv preprint arXiv:1605.09004 , 2016.[CL16b] Lijie Chen and Jian Li. Open problem: Best arm identiﬁcation: Almost instance-wise optimalityand the gap entropy conjecture. In

COLT , 2016.[CLK +

14] Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pureexploration of multi-armed bandits. In

NIPS , pages 379–387, 2014.[CLQ16] Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identiﬁ-cation. arXiv preprint arXiv:1608.06031 , 2016.[CLTL15] Wei Cao, Jian Li, Yufei Tao, and Zhize Li. On top-k selection in multi-armed bandits and hiddenbipartite graphs. In

NIPS , pages 1036–1044, 2015.[Duc07] John Duchi. Derivations for linear algebra and optimization.

Berkeley, California , 2007.[EDMM02] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit andmarkov decision processes. In

COLT , pages 255–270. Springer, 2002.[EDMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditionsfor the multi-armed bandit and reinforcement learning problems.

JMLR , 7:1079–1105, 2006.[GGL12] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identiﬁcation: Auniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. In

NIPS , pages 3212–3220, 2012.[GGLB11] Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Sébastien Bubeck. Multi-bandit best arm identiﬁcation. In

NIPS , pages 2222–2230, 2011.12GK16] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identiﬁcation with ﬁxed conﬁdence.In

Conference on Learning Theory (COLT) , 2016.[GLG +

16] Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner, and PeterBartlett. Improved learning complexity in combinatorial pure exploration bandits. In

Proceedingsof the 19th International Conference on Artiﬁcial Intelligence and Statistics , pages 1004–1012,2016.[JMNB14] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimalexploration algorithm for multi-armed bandits.

COLT , 2014.[KCG15] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm identi-ﬁcation in multi-armed bandit models.

The Journal of Machine Learning Research , 2015.[KK13] Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subsetselection. In

COLT , pages 228–251, 2013.[KKS13] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armedbandits. In

ICML , pages 1238–1246, 2013.[KS10] Shivaram Kalyanakrishnan and Peter Stone. Eﬃcient selection of multiple bandit arms: Theoryand practice. In

ICML , pages 511–518, 2010.[KTAS12] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selectionin stochastic multi-armed bandits. In

ICML , pages 655–662, 2012.[MT04] Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armedbandit problem.

JMLR , 5:623–648, 2004.[Rob85] Herbert Robbins. Some aspects of the sequential design of experiments. In

Herbert RobbinsSelected Papers , pages 169–177. Springer, 1985.[SJR16] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. Towards a richer undertanding ofadaptive sampling in the moderate-conﬁdence regime. 2016.[ZCL14] Yuan Zhou, Xi Chen, and Jian Li. Optimal pac multiple arm identiﬁcation with applications tocrowdsourcing. In

ICML , pages 217–225, 2014.13 rganization of the Appendix

In the appendix, we present the missing proofs in this paper. In Appendix A, we ﬁrst discuss a speciﬁcinstance mentioned in Section 1, showing that our upper bound strictly improves previous algorithms. InAppendix B, we prove Fact 2.2 in Section 2. In Appendix C, we prove the Instance Embedding lemma(Lemma 3.1) and the relatively technical Lemma 3.3, which relates a general instance of Best- -Arm to asymmetric instance. In Appendix D, we discuss the implementation of the building blocks of our algorithm,prove a few useful and observations, and ﬁnally complete the missing proofs of other lemmas and theorems. A Speciﬁc Best- k -Arm Instance We show that our upper bound results (Theorem 1.2 and Theorem 1.3) strictly improve the state-of-the-art algorithm for Best- k -Arm obtained in [CGL16] by calculating the sample complexity of both algorithmson a speciﬁc Best- k -Arm instance.We consider a family of instances parametrized by integer n and ε ∈ (0 , / . Each instance consists of n arms with mean , n arms with mean / , along with two arms with means / ε and / − ε respectively.We are required to identify the top n + 1 arms. By deﬁnition, the gap of every arm with mean or / is / ε , while the gaps of the remaining two arms are ε . As ε tends to zero, the arms with gap / ε become relatively simple: an algorithm can decide whether to include them in the answer or not with fewsamples. The hardness of the instance is then concentrated on the two arms with close means.For simplicity, we assume that the conﬁdence level, δ , is set to a constant. Then the O ( H ln δ − ) termin the upper bounds are dominated by the O ( e H ) term. By a direct calculation, we have e H = Θ( n + ε − ln ln ε − ) .Let m be the integer that satisﬁes ε ∈ ( ε m +1 , ε m ] . Then we have | G large | = | G small | = n , and | G large m | = | G small m | = 1 .It follows from the deﬁnition of e H large and e H small that e H large = e H small = O ( n ln n + ε − ) .By Theorem 1.2, for constant δ , our algorithm takes O ( e H + e H large + e H small ) = O ( n ln n + ε − ln ln ε − ) samples on this instance.On the other hand, the upper bound achieved by PAC-SamplePrune algorithm is O ( e H + H ln n ) = O ( n ln n + ε − ln ln ε − + ε − ln n ) .Note that if ε = 1 /n , our algorithm takes O ( n ln ln n ) samples, while PAC-SamplePrune takes O ( n ln n ) samples. This indicates that there is a logarithmic gap between the state-of-the-art upper bound and theinstance-wise lower bound, while we narrow down the gap to a doubly-logarithmic factor. B Missing Proof in Section 2

Fact 2.2 (restated)

For ≤ y ≤ y ≤ x ≤ x ≤ , d ( x, y ) ≥ d ( x , y ) . roof of Fact 2.2. Taking the partial derivative yields ∂d ( x, y ) ∂x = ln x (1 − y ) y (1 − x ) , ∂d ( x, y ) ∂y = y − xy (1 − y ) .Therefore when x ≥ y , d ( x, y ) is increasing in x and decreasing in y , which proves the fact. C Missing Proofs in Section 3

C.1 Proof of Lemma 3.1

Lemma 3.1 (restated)

Let I be a Best- k -Arm instance. Let A be an arm among the top k arms, and I emb be a Best- -Arm instance consisting of A and a subset of arms in I outside the top k arms. If somealgorithm A solves I with probability − δ while taking less than N samples on A in expectation, there existsanother algorithm A emb that solves I emb with probability − δ while taking less than N samples on A inexpectation.Proof of Lemma 3.1. We construct the following algorithm A emb for I emb . Given the instance I emb , A emb ﬁrst augments the instance into I by adding a ﬁctitious arm for each arm in I \ I emb . Then A emb simulates A on the Best- k -Arm instance I . When A requires a sample from an arm in I emb , A emb draws a sampleand sends it to A . If A requires a sample from an arm outside I emb , A emb generates a ﬁctitious sample onits own and then sends it to A . When A terminates and returns a subset S of k arms, A emb terminates andreturns an arbitrary arm in S ∩ I emb .Note that when A emb runs on instance I emb , the algorithm A simulated by A emb eﬀectively runs on theinstance I . It follows that with probability − δ , A returns the correct answer of the Best- k -Arm instance I , and thus A is the only arm in both I emb and the set S returned by A . Therefore, A emb correctly solvesthe Best- -Arm instance I emb with probability at least − δ . Moreover, the expected number of samplesdrawn from arm A is less than N by our assumptions. C.2 Proof of Lemma 3.3

Lemma 3.3 (restated)

Let I be an instance of Best- -Arm consisting of one arm with mean µ and n arms with means on [ µ − ∆ , µ ) . There exist universal constants δ and c (independent of n and ∆ ) such thatfor all algorithm A that correctly solves I with probability − δ , the expected number of samples drawn fromthe optimal arm is at least c ∆ − ln n .Proof of Lemma 3.3. Let δ and c be the constants in Lemma 3.4. We claim that Lemma 3.3 holds forconstants δ = δ / and c = c δ / .Suppose for a contradiction that when algorithm A runs on Best- -Arm instance I , it outputs the correctanswer with probability − δ and the optimal arm A is sampled less than c ∆ − ln n times in expectation. Overview.

Our proof follows the following ﬁve steps.Step 1. We apply Instance Embedding to obtain a smaller yet denser (in the sense that all suboptimalarms have almost identical means) instance I dense , together with a new algorithm A new that solves I dense bytaking few samples on the optimal arm with high probability.Step 2. We obtain a symmetric instance I sym from I dense by making the suboptimal arms identical toeach other. We also deﬁne an algorithm A sym for instance I sym .Step 3. To analyze algorithm A sym on instance I sym , we deﬁne the notion of “mixed arms”, which returna ﬁxed number of samples from one distribution, and then switch to another distribution permanently. Wetransform I dense into an intance I mix with mixed arms.15 on I A new on I dense A new on I mix (Expr mix ) A sym on I sym (Expr sym )Step 1InstanceEmbedding Step 4Change ofDistribution Step 5EquivalenceFigure 1: Each rectangle denotes the execution of an algorithm on an instance. The arrows specify the stepin which each reduction is performed and the major technique involved in the reduction.Step 4. We show by Change of Distribution that when A new runs on I mix , it also returns the correctanswer with few samples on the optimal arm.Step 5. We show that the execution of A sym on I sym is, in a sense, equivalent to the execution of A new on I mix . This ﬁnally leads to a contradiction to Lemma 3.4.The reductions involved in the proof is illustrated in Figure 1. Step 1: Construct I dense and A new . We ﬁrst construct a new Best- -Arm instance I dense in whichthe sub-optimal arms have almost identical means. Let µ denote the mean of the optimal arm A . Wedivide the interval [ µ − ∆ , µ ] into n . segments, each with length ∆ /n . . Set m = n . . By the pigeonholeprinciple, we can assume that A , A , . . . , A m are m arms with means in the same interval. Let µ i denote themean of arm A i . By construction, µ − µ i ≤ ∆ for all ≤ i ≤ m and | µ i − µ j | ≤ ∆ /n . for all ≤ i, j ≤ m .We simply let I dense = { A , A , A , . . . , A m } . By Instance Embedding (Lemma 3.1), there exists analgorithm A new that solves I dense with probability − δ while taking less than c ∆ − ln n samples on A inexpectation. We will focus on instance I dense in the rest of our proof.Recall that Pr A , I and E A , I denote the probability and expectation when algorithm A runs on instance I respectively. Let τ i denote the number of samples taken on A i . Then we have E A new , I dense [ τ ] ≤ c ∆ − ln n .Let N = cδ − ∆ − ln n . By Markov’s inequality, Pr A new , I dense [ τ ≥ N ] ≤ c ∆ − ln nN = δ .Let E denote the event that the algorithm returns the correct answer while taking at most N samples onarm A . The union bound implies that Pr A new , I dense [ E ] ≥ − δ . Step 2: Construct I sym and A sym . Let I sym be the Best- -Arm instance consisting of arm A and m = n . copies of arm A . We deﬁne algorithm A sym as follows. Given instance I sym , A sym simulatesalgorithm A new as if A new is running on instance I dense . When A new requires a sample from an arm A thathas not been pulled N times (recall that N = cδ − ∆ − ln n ), A sym draws a sample from A and sends it to A new . When the number of pulls on A exceeds N for the ﬁrst time, A sym assigns a random number π ( A ) from { , , . . . , m } to arm A , such that π ( A ) is diﬀerent from every number that has already been assignedto another arm. If this step cannot be performed because all numbers in { , , . . . , m } have been used up, A sym simply terminates without returning an answer. After that, upon each pull of A , A sym sends a sampledrawn from N ( µ π ( A ) , to A new . (Recall that µ i denotes the mean of arm A i in I dense .) Finally, A sym outputs what A new outputs. Step 3: Construct mixed arms and I mix . In order to analyze the execution of A sym on instance I sym ,it is helpful to deﬁne m “mixed arms”. For ≤ i ≤ m , the i -th mixed arm, denoted by M i , returns a sampledrawn from N ( µ , (i.e., the reward distribution of arm A ) when it is pulled for the ﬁrst N times. After N pulls, M i returns samples from N ( µ i , as A i does. For ease of notation, we also let M denote A . Let I mix denote the Best- -Arm instance { M , M , M , . . . , M m } . As shown in the analysis in Step 5, we only care the behavior of A sym when the labels are not used up. tep 4: Run A new on I mix . Now suppose we run A new on instance I mix . In fact, we may view eacharm (either A i or M i ) as two separate “semi-arms”. When A new samples arm A i in the ﬁrst N times, it pullsthe ﬁrst semi-arm of A i . After A i has been pulled N times, A new pulls the second semi-arm. From thisperspective, I mix is simply obtained from I dense by changing the ﬁrst semi-arm of each arm A i ( ≤ i ≤ m )from N ( µ i , to N ( µ , . Since the ﬁrst semi-arm is sampled at most N times by A new , it follows fromChange of Distribution (Lemma 2.1) that d (cid:18) Pr A new , I dense [ E ] , Pr A new , I mix [ E ] (cid:19) ≤ m X i =1 N · KL ( N ( µ i , , N ( µ , N m X i =1 ( µ i − µ ) ≤ cδ − ∆ − ln n · n . · (∆ /n . ) ≤ c δ n − . ln n .Here the second step follows from KL( N ( µ , , N ( µ , µ − µ ) / .The third step is due to N = cδ − ∆ − ln n , m = n . , and | µ − µ i | ≤ ∆ /n . .For suﬃciently large n , we have c δ n − . ln n < d (1 − δ, − δ ) .Recall that Pr A new , I dense [ E ] ≥ − δ . It follows from the monotonicity of d ( · , · ) (Fact 2.2) that Pr A new , I mix [ E ] ≥ − δ . Step 5: Analyze A sym and derive a contradiction to Lemma 3.4. For clarity, let Expr mix denotethe experiment that A new runs on I mix , and Expr sym denote the experiment that A sym runs on I sym . Step 4implies that event E happens with probability at least − δ in experiment Expr mix .In the following, we derive the likelihood of an arbitrary execution of Expr mix in which event E happens,and prove that this execution has the same likelihood in experiment Expr sym . As a result, A sym also returnsthe correct answer with probability at least − δ . Moreover, according to our construction, A sym alwaystakes at most N samples on arm A . On the other hand, since µ − µ ≤ ∆ , Lemma 3.4 implies that noalgorithm can solve A sym correctly with probability − δ = 1 − δ while taking less than c ∆ − ln m = 30 cδ − · ∆ − · (0 . n ) = N samples on A in expectation. This leads to a contradiction and ﬁnishes the proof. Technicalities: equivalence between Expr mix and Expr sym . For ease of notation, we assume inthe following that algorithm A new is deterministic. Then the only randomness in experiment Expr mix stemsfrom the random permutation of arms at the beginning, and the samples drawn from the arms.We consider an arbitrary run of experiment Expr mix in which event E happens (i.e., A new returns theoptimal arm before taking more than N samples from it). For ≤ i ≤ m , let σ ( i ) denote the index of the i -th arm received by algorithm A new . (i.e., the i -th arm received by A new is M σ ( i ) .) By deﬁnition, σ is auniformly random permutation of { , , . . . , m } . Let obs i denote the sequence of samples that A new observesfrom the i -th arm. Then the likelihood of this execution is given by m + 1)! X σ m Y i =0 f M σ ( i ) (obs i ) . (8) In fact this assumption is without loss of generality: the argument still holds conditioning on the randomness of A new . σ on { , , , . . . , m } , and f M σ ( i ) (obs i ) denote the probabilitydensity of observing obs i on arm M σ ( i ) .Now we compute the likelihood that in experiment Expr sym , the algorithm A new simulated by A sym observes the same sequence of samples. Let λ denote the random permutation of arms given to A sym . Wedeﬁne p ∗ = λ − (0) , Long = { i ∈ { , , , . . . , m } : | obs i | > N } , and Short = { , , . . . , m } \ ( Long ∪ { p ∗ } ) .In other words, p ∗ is the position of the optimal arm A in I sym . Long denote the positions of suboptimalarms that have been sampled more than N times, while Short denote the remaining suboptimal arms. Notethat since less than N samples are taken on the optimal arm, p ∗ is excluded from both sets.Another source of randomness in Expr sym is the random numbers π ( · ) that A sym assigns to diﬀerent arms.In this speciﬁc execution, function π ( · ) chosen by A sym is a random injection from Long to { , , . . . , m } . Byour construction of A sym , for each i ∈ Long , the algorithm A new simulated by A sym ﬁrst observes N samplesdrawn from N ( µ , (i.e., the reward distribution of arm A ) on the i -th arm. After that, A new startsto observe samples drawn from N ( µ π ( i ) , . Recall that the mixed arm M π ( i ) also returns samples in thispattern. Therefore, the likelihood of observations on the i -th arm is exactly f M π ( i ) (obs i ) . (9)In fact, we may express the likelihood for all arms as in (9) by extending π into a permutation on { , , , . . . , m } . First, we set π ( p ∗ ) = 0 . Recall that the optimal arm is sampled less than N times, all thesamples observed from it are drawn from N ( µ , , which is exactly the reward distribution of M = M π ( p ∗ ) .Therefore the likelihood of observations obs p ∗ is given by f M π ( p ∗ ) (obs p ∗ ) .Second, we let R = { , , . . . , m } \ π ( Long ) denote the available labels among { , , . . . , m } . We deﬁne π on Short by matching

Short with R uniformly at random. Note that since all arms in Short are sampled atmost N times, A new simulated by A sym always observes samples drawn from N ( µ , , which agrees with theﬁrst N samples from every mixed arm M i ( i = 0 ). Therefore, the likelihood of observations on the i -th armwhere i ∈ Short is also given by f M π ( i ) (obs i ) .According to our analysis above, the samples from the i -th arm observed by the simulated A new inexperiment Expr sym follows the same distribution as samples drawn from M π ( i ) . Moreover, π is a uniformlyrandom permutation with the only condition that π ( p ∗ ) = 0 , which is equivalent to π − (0) = p ∗ = λ − (0) .Therefore, the likelihood is given by m ! · ( m + 1)! X π − (0)= λ − (0) m Y i =0 f M π ( i ) (obs i ) . (10)Note that conditioning on λ − (0) = π − (0) , π is still a uniformly random permutation on { , , , . . . , m } .Therefore the two likelihoods in (8) and (10) are equal. This ﬁnishes the proof of the equivalence. D Missing Proofs in Section 4

D.1 Building Blocks

D.1.1 PAC algorithm for Best- k -Arm On an instance of Best- k -Arm with n arms, the PAC-SamplePrune algorithm in [CGL16] is guaranteed toreturn a ε -optimal answer of Best- k -Arm with probability − δ , using O ( nε − (ln δ − + ln k )) k arms T ⊆ I is called ε -optimal, if after adding ε to the mean of each arm in T , T becomes the best k arms in I .We implement our PAC-Best-k ( S, k, ε, δ ) subroutine as follows. Recall that PAC-Best-k is expected toreturn a partition ( S large , S small ) of the arm set S . If k ≤ | S | / , we directly run PAC-SamplePrune on theBest- k -Arm instance S and return its output as S large . We let S small = S \ S large . Otherwise, we negatethe mean of all arms in S and run PAC-SamplePrune to ﬁnd the top | S | − k arms in the negated instance. Finally, we return the output of

PAC-SamplePrune as S small and let S large = S \ S small . In the following weprove Lemma 4.1. Proof of Lemma 4.1.

By construction, the algorithm

PAC-Best-k ( S, k, ε, δ ) takes O ( | S | ε − [ln δ − + ln min( k, | S | − k )]) samples. In the following we prove that if k ≤ | S | / , the set T returned by PAC-SamplePrune is ε -optimalwith probability − δ . The case k > | S | / can be proved by an analogous argument.Let S ′ denote the instance in which the mean of every arm in T is increased by ε . By deﬁnition of ε -optimality, T contains the best k arms in S ′ . Note that the k -th largest mean is S ′ is at least µ [ k ] . Thusfor each arm A ∈ T , µ A must be at least µ [ k ] − ε , since otherwise even after µ A increases by ε , A is still notamong the best k arms.It also holds that every arm in S \ T must have a mean smaller than or equal to µ [ k +1] + ε . Suppose fora contradiction that A ∈ S \ T has a mean µ A > µ [ k +1] + ε . Then every arm with mean less than or equalto µ [ k +1] in S still have a mean smaller than µ A in S ′ . This implies that A is among the best k arms in S ′ ,which contradicts our assumption that A / ∈ T . D.1.2 PAC algorithms for Best- -Arm By symmetry, it suﬃces to implement the subroutine

EstMean-Large and prove its property. In order toestimate the mean of the largest arm in S , we ﬁrst call PAC-Best-k ( S, , ε/ , δ/ to ﬁnd an approximatelylargest arm. Then we sample the arm ε − ln(4 /δ ) times, and ﬁnally return its empirical mean. We proveLemma 4.2 as follows. Proof of Lemma 4.2.

Let A ∗ denote the largest arm in S , and let A denote the arm returned by PAC-Best-k ( S, , ε/ , δ/ .According to Lemma 4.1, with probability − δ/ , µ A ∈ [ µ A ∗ − ε/ , µ A ∗ ] . It follows that, with probability − δ/ , (cid:12)(cid:12)(cid:12)(cid:12) µ A − max A ∈ S µ A (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε/ .Let ˆ µ denote the empirical mean of arm A . By a Chernoﬀ bound, with probability − δ/ , | ˆ µ − µ A | ≤ ε/ .It follows from a union bound that with probability − δ , (cid:12)(cid:12)(cid:12)(cid:12) ˆ µ − max A ∈ S µ A (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε .Finally, we note that PAC-Best-k consumes O ( | S | ε − ln δ − ) samples as k = 1 , while sampling A takes O ( ε − ln δ − ) samples. This ﬁnishes the proof. More precisely, when the algorithm requires a sample from an arm, we draw a sample and return the opposite. .1.3 Elimination procedures We use the

Elimination procedure deﬁned in [CL15] as our subroutine

Elim-Small ( S, θ small , θ large , δ ) . Theother building block Elim-Large ( S, θ small , θ large , δ ) can be implemented either using a procedure symmetric to Elimination , or simply by running

Elim-Small ( S ′ , − θ large , − θ small , δ ) , where S ′ is obtained from S by negatingthe arms. In the following, we prove Lemma 4.3. Proof of Lemma 4.3.

Let T denote the set of arms returned by Elim-Small ( S, θ small , θ large , δ ) . Lemma B.4 in[CL15] guarantees that with probability − δ , the following three properties are satisﬁed: (1) Elim-Small takes O ( | S | ε − ln δ − ) samples, where ε = θ large − θ small ; (2) (cid:12)(cid:12) { A ∈ T : µ A ≤ θ small } (cid:12)(cid:12) ≤ | T | / ;(3) Let A ∗ be the largest arm in S . If µ A ∗ ≥ θ large , then A ∗ ∈ T .In fact, the proof of Lemma B.4 does not rely on the fact that A ∗ is the largest arm in S . Thus property(3) holds for any ﬁxed arm in S . This proves the properties of Elim-Small . The properties of

Elim-Large holddue to the symmetry.

D.2 Observations

D.2.1 Proof of Observation 4.2

Proof of Observation 4.2.

Let A denote the arm with the largest mean in S small r . Recall that µ small r denotethe mean of the ( k large r + 1) -th largest arm in S r . The correctness of PAC-Best-k and Lemma 4.1 guaranteethat µ A ≤ µ small r + ε r / . Note that µ small r is the k small r -th smallest mean in S r , while µ A is the largest meanamong the k small r arms in S small r ⊆ S r . So it also holds that µ A ≥ µ small r . Thus we have µ A ∈ [ µ small r , µ small r + ε r / .Moreover, as EstMean-Large returns correctly conditioning on E good r , by Lemma 4.2 we have θ large r ∈ [ µ small r − ε r / , µ small r + ε r / .The second property follows from a symmetric argument. D.2.2 Proof of Observation 4.3

Proof of Observation 4.3.

Recall that E valid denotes the event that the execution of Bilateral-Elimination isvalid. We condition on E valid in the following proof. In particular, conditioning on E valid , E good r − happens and T r − along with the best k large r − arms in S r − constitute the correct answer of the original instance.Let µ large r − and µ small r − be the k large r − -th and the ( k large r − + 1) -th largest mean in S r − . As the arm with mean µ large r − is among the correct answer, we have µ large r − ≥ µ [ k ] , where µ [ k ] is the k -th largest mean in the originalinstance. We also have µ small r − ≤ µ [ k +1] for the same reason.Since E good r − happens, by Observation 4.2 we have θ large r − ≤ µ small r − + ε r − / ≤ µ [ k +1] + ε r − / .Then the larger threshold used in Elim-Large is upper bounded by θ large r − + ε r − / ≤ µ [ k +1] + ε r − / µ [ k +1] + ε r .Let T denote the set of arms returned Elim-Large in round r − . We partition T into the following threeparts: T (1) = (cid:8) A ∈ T : µ A > µ [ k +1] + ε r (cid:9) ,20 (2) = (cid:8) A ∈ T : µ [ k ] ≤ µ A ≤ µ [ k +1] + ε r (cid:9) , T (3) = (cid:8) A ∈ T : µ A ≤ µ [ k +1] (cid:9) .By Lemma 4.3 and the correctness of Elim conditioning on E good r − , we have | T (1) | ≤ | T | / .It follows that | T (2) | + | T (3) | ≥ | T | / ≥ | T | / .By deﬁnition of arm groups, every arm in T (2) is in G large ≥ r . In order to bound T (3) , we say that an armis misclassiﬁed into S large r − , if the arm is not among the best k large r − arms in S r − , but is included in S large r − . Wemay deﬁne misclassiﬁcation into S small r − similarly. As | S large r − | = k large r − , the numbers of arms misclassiﬁed intoboth sides are the same.Since the arms in T (3) are misclassiﬁed into S large r − , there are at least | T (3) | other arms misclassiﬁed into S small r − . Lemma 4.1 (along with the correctness of PAC-Best-k ) guarantees that all arms misclassiﬁed into S small r − have means smaller than or equal to µ [ k +1] + ε r − / . Thus by deﬁnition of arm groups, all these | T (3) | arms are also in G large ≥ r . Therefore, we have | G large ≥ r | ≥ | T (2) | + | T (3) | ≥ | T | / .Note that | T | = k large r . Therefore we conclude that k large r ≤ | G large ≥ r | . The bound on k small r can be provedusing a symmetric argument. D.3 Proof of Lemma 4.4

Lemma 4.4 (restated) Pr (cid:2) E valid (cid:3) ≥ − δ .Proof of Lemma 4.4. We prove the lemma by upper bounding the probability of E valid , the complement of E valid . Split E valid . Let E bad r denote the event that Bilateral-Elimination is valid at round r , yet it becomes invalidat round r + 1 . Then we have Pr h E valid i = ∞ X r =1 Pr (cid:2) E bad r (cid:3) .By deﬁnition of validity, event E bad r happens in one of the following two cases: • Case 1: E good r does not happen. • Case 2: E good r happens, yet T r +1 together with the best k large r +1 arms in S r +1 is no longer the correctanswer.The probability of Case 1 is upper bounded by δ r according to Observation 4.1. We focus on boundingthe probability of Case 2 in the following. Misclassiﬁed arms.

Recall that µ large r and µ small r denote the means of the k large r -th and the ( k large r + 1) -thlargest arms in S r respectively. Conditioning on the validity of the execution at round r , the arm with mean µ large r is among the best k arms in the original instance, while the arm with mean µ small r is not. Thus we have µ large r ≥ µ [ k ] > µ [ k +1] ≥ µ small r .Deﬁne U large r = { A ∈ S large r : µ A ≤ µ small r } U small r = { A ∈ S small r : µ A ≥ µ large r } .In other words, U large r and U small r denote the set of arms “misclassiﬁed” by the PAC-Best-k subroutine into S large r and S small r in round r . Bound the number of misclassiﬁed arms.

By Observation 4.2, conditioning on E good r , we have θ large r ≥ µ small r − ε r / .Therefore, when Elim-Large in Line 11 is called at round r , the smaller threshold is at least θ large r + ε r / ≥ µ small r ,which is larger than the mean of every arm in U large r . By Lemma 4.3 and a union bound, with probability − | U large r | δ ′ r ≥ − min( k large r , k small r ) δ ′ r = 1 − δ r ,no arms in U large are removed by Elim-Large . For the same reason, with probability − δ r , no arms in U small are removed by Elim-Small . Bound the probability of Case 2.

Thus, with probability at least − δ r conditioning on E good r , Elim-Large only removes arms with means larger than or equal to µ large r , and Elim-Small only removes armswith means smaller than or equal to µ small r . Consequently, every arm in S r with mean greater than or equalto µ large r either moves to T r +1 or stays in S r +1 , which implies that Case 2 does not happen.Therefore, the Case 2 happens with probability at most δ r , and it follows that Pr (cid:2) E bad r (cid:3) ≤ δ r + 2 δ r = 7 δ r .Finally, we have Pr h E valid i ≤ ∞ X r =1 δ r ≤ ∞ X r =1 δ r ≥ δ . D.4 Missing Calculation in the Proof of Lemma 4.6

Lemma 4.6 (restated)

Conditioning on event E valid , Bilateral-Elimination takes O ( H ln δ − + e H large + e H small + e H ) samples.Proof (continued). Recall that H (1) r = (cid:16) | G large ≥ r | + | G small ≥ r | (cid:17) ε − r (ln δ − + ln r ) , H (2 , large ) r = ε − r | G large ≥ r | ln | G small ≥ r | , H (2 , small ) r = ε − r | G small ≥ r | ln | G large ≥ r | .Our goal is to show that ∞ X r =1 H (1) r = O (cid:16) H ln δ − + e H (cid:17) ,22 X r =1 H (2 , large ) r = O (cid:16) e H large (cid:17) , and ∞ X r =1 H (2 , small ) r = O (cid:16) e H small (cid:17) . Upper bound the H (1) term: It follows from a directly calculation that ∞ X r =1 H (1) r = ∞ X r =1 ∞ X i = r (cid:16) | G large i | + | G small i | (cid:17) ε − r (ln δ − + ln r )= ∞ X i =1 (cid:16) | G large i | + | G small i | (cid:17) i X r =1 ε − r (ln δ − + ln r )= O ∞ X i =1 (cid:16) | G large i | + | G small i | (cid:17) ε − i (ln δ − + ln i ) ! = O n X i =1 ∆ − i ] (cid:16) ln δ − + ln ln ∆ − i ] (cid:17)! .Here the second step interchanges the order of summation. The third step holds since the inner summationis always dominated by the last term. Finally, the last step is due to the fact that ∆ A = Θ( ε i ) for every arm A ∈ G large i ∪ G small i . Therefore we have ∞ X r =1 H (1) r = O ( H ln δ − + e H ) . Upper bound H (2 , large ) and H (2 , small ) : By deﬁnition of H (2 , large ) r , we have ∞ X r =1 H (2 , large ) r = ∞ X r =1 ∞ X i = r ε − r | G large i | ln | G small ≥ r | = ∞ X i =1 | G large i | i X r =1 ε − r ln | G small ≥ r | .Therefore we conclude that ∞ X r =1 H (2 , large ) r = O ( e H large ) .The bound on the sum of H (2 , small ) r follows from an analogous calculation. D.5 Proof of Theorem 1.3

Theorem 1.3 (restated)

For every Best- k -Arm instance, the following statements hold:1. e H large + e H small = O (cid:0)(cid:0) H large + H small (cid:1) ln ln n (cid:1) .2. e H large + e H small = O ( H ln k ) .Proof of Theorem 1.3. First Upper Bound.

Recall that H large = ∞ X i =1 (cid:12)(cid:12)(cid:12) G large i (cid:12)(cid:12)(cid:12) · max j ≤ i ε − j ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) , and23 H large = ∞ X i =1 (cid:12)(cid:12)(cid:12) G large i (cid:12)(cid:12)(cid:12) i X j =1 ε − j ln (cid:12)(cid:12) G small ≥ j (cid:12)(cid:12) .For brevity, let N r denote ε − r ln | G small ≥ r | = 4 r ln | G small ≥ r | . We ﬁx the value i . Then the i -th term in e H large reduces to (cid:12)(cid:12)(cid:12) G large i (cid:12)(cid:12)(cid:12) P ir =1 N r . Let r ∗ = argmax ≤ r ≤ i N r . Thus the i -th term in H large is simply (cid:12)(cid:12)(cid:12) G large i (cid:12)(cid:12)(cid:12) N r ∗ ,which is in general smaller than (cid:12)(cid:12)(cid:12) G large i (cid:12)(cid:12)(cid:12) P ir =1 N r . However, we will show that the ratio between the twoterms is bounded by O (ln ln n ) .By deﬁnition of r ∗ , we have N r ∗ ≥ N i . Substituting N r ∗ and N i yields r ∗ ln (cid:12)(cid:12) G small ≥ r ∗ (cid:12)(cid:12) ≥ i ln (cid:12)(cid:12) G small ≥ i (cid:12)(cid:12) .It follows that i − r ∗ ln (cid:12)(cid:12) G small ≥ i (cid:12)(cid:12) ≤ ln (cid:12)(cid:12) G small ≥ r ∗ (cid:12)(cid:12) ≤ ln n ,and thus i − r ∗ = O (ln ln n ) .Let ≤ r ≤ r ∗ be the smallest integer such that N r ≥ r − r ∗ N r ∗ . By substituting N r and N r ∗ , weobtain r ln (cid:12)(cid:12) G small ≥ r (cid:12)(cid:12) ≥ r − r ∗ · r ∗ ln (cid:12)(cid:12) G small ≥ r ∗ (cid:12)(cid:12) ,which further implies that r ∗ − r ln | G small ≥ r ∗ | ≤ ln | G small ≥ r | ≤ ln n and thus r ∗ − r = O (ln ln n ) .Therefore we have i − r = O (ln ln n ) , and we can bound the sum of N r as follows: i X r =1 N r = r − X r =1 N r + i X r = r N r ≤ N r ∗ r − X r =1 r − r ∗ + ( i − r + 1) N r ∗ ≤ ( i − r + 2) N r ∗ = O ( N r ∗ ln ln n ) .Here the second step follows from N r < r − r ∗ N r ∗ for r < r (by deﬁnition of r ) and N r ≤ N r ∗ for r ≥ r (by deﬁnition of r ∗ ).It then follows from a direct summation over all i that e H large = O ( H large ln ln n ) .The bound on e H small can be proved similarly. Second Upper Bound.