Top-k Combinatorial Bandits with Full-Bandit Feedback
TTop-k Combinatorial Bandits with Full-Bandit Feedback
Idan Rejwan [email protected]
Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv Israel
Yishay Mansour [email protected]
Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv Israeland Google Research, Tel Aviv
Editor:
PLACEHOLDER
Abstract
Top-k Combinatorial Bandits generalize multi-armed bandits, where at each round anysubset of k out of n arms may be chosen and the sum of the rewards is gained. Weaddress the full-bandit feedback, in which the agent observes only the sum of rewards,in contrast to the semi-bandit feedback, in which the agent observes also the individualarms’ rewards. We present the Combinatorial Successive Accepts and Rejects (CSAR)algorithm, which generalizes SAR (Bubeck et al., 2013) for top-k combinatorial bandits.Our main contribution is an efficient sampling scheme that uses Hadamard matrices in orderto estimate accurately the individual arms’ expected rewards. We discuss two variants ofthe algorithm, the first minimizes the sample complexity and the second minimizes theregret. We also prove a lower bound on sample complexity, which is tight for k = O (1).Finally, we run experiments and show that our algorithm outperforms other methods. Keywords:
Multi-Armed Bandits, Combinatorial Bandits, Top-k Bandits, HadamardMatrix, Sample Complexity, Regret Minimization
1. Introduction
Multi-armed bandit (MAB) is an extensively studied problem in statistics and machinelearning. The classical version of this problem is formulated as a system of n arms (oractions), each having an unknown distribution of rewards. An agent repeatedly plays thesearms in order to find the best arm and maximize its reward (Robbins, 1952).The MAB research focuses on two different objectives. The first aims to maximize thereward accumulated by the agent while playing the arms. This objective highlights thetrade-off between exploration and exploitation, i.e., the balance between staying with thearm that gave highest reward in the past and exploring new arms that might give higherreward in the future. Success in this goal is measured by regret , which is the differencebetween the best arm’s expected reward over the time horizon and the reward accumulatedby the agent over the same time. The second objective, sometimes referred to as best armidentification or pure exploration , aims to minimize the sample complexity which is thenumber of steps until identifying the best arm with high probability. These two objectivesmight contradict each other, meaning that a policy which is good for finding the best armquickly is not necessarily good for accumulating high reward (Bubeck et al., 2009).An extension of the standard MAB model is the Combinatorial Bandits model (Cesa-Bianchi and Lugosi, 2012; Chen et al., 2013). In this model, instead of choosing one arm at1 a r X i v : . [ c s . L G ] D ec ach round, a decision set of actions is given, where each action is a subset of arms. Top-k isa special case of combinatorial bandits, in which the decision set includes all subsets of size k out of n arms, and each action’s reward is the sum of the k arms. Combinatorial Banditshave two variants, depending on the feedback observed by the agent. In the simpler onethe agent observes in each round the rewards of each of the k individual arms, in additionto the aggregated reward. Such model is referred to as the semi-bandit feedback. This is incontrast to the full-bandit feedback, where the only feedback observed by the agent is theaggregated reward. Although much of the research studies the semi-bandit feedback (Chenet al., 2013, 2016; Combes et al., 2015; Kveton et al., 2015), in many real-life problems it iscostly or even impossible to gain information on each individual arm by itself. This is thecase in, for example, crowd sourcing (Lin et al., 2014) and adaptive routing (Awerbuch andKleinberg, 2004), and also in scenarios where data privacy considerations come into play,such as online advertisement and medical trials.Full-bandit feedback is harder than semi-bandit feedback, due to the lack of informationabout each individual arm. Each time a subset is sampled and an aggregated reward isobserved, it is hard to assign the credit between the individual arms. One naive attemptto deal with it is to treat every possible subset as a distinct arm, and consider it as aclassical MAB problem with (cid:0) nk (cid:1) arms. However, the number of arms is exponential, hencethis approach is clearly inefficient. Additionally, it ignores the combinatorial structure thatcould extract some shared information between different subsets. Another attempt is totreat it as a special case of Linear Bandits . In this model, each arm a is a vector in adecision set D ⊆ R n , and its expected reward is the inner product between a ∈ D and thereward vector θ ∈ R n . Combinatorial bandits are actually a special case of linear bandits,where the decision set is limited to binary vectors with exactly k ones. One could hopeto use LinUCB, the highly established algorithm for linear bandits (Abbasi-Yadkori et al.,2011; Dani et al., 2008; Chu et al., 2011), to solve combinatorial bandits. This algorithminvolves an optimization problem to find which subset to sample at each round, however forcombinatorial decision sets the optimization is NP-hard (Dani et al., 2008; Kuroki et al.,2019). Thus, we wish to find an algorithm that is (a) informative - gives enough informationon each individual arm; (b) efficient - uses a small number of samples; and (c) polynomialtime computable. Our main contribution is an algorithm that fulfills all three requirements,as we show theoretically and empirically.In this work, we describe an algorithm for full-bandit feedback that finds the optimalsubset of arms efficiently. The algorithm is based on the Successive Accepts and Rejects(SAR) algorithm (Bubeck et al., 2013), that iteratively estimates the arms within increasinglevel of accuracy, and accepts or rejects arms until it finds the optimal subset. Whilethe original algorithm is designed for classical MABs, it is not clear how to estimate theexpected rewards of the individual arms given full-bandit feedback. Our main novelty isthus describing an efficient method for estimating the individual arms’ rewards and bythis generalizing SAR to full-bandit feedback. We present a sampling scheme that usesHadamard matrices to estimate the arms using a small number of samples. We show thatthis scheme is efficient, by proving that the number of samples needed to find the optimalsubset with probability at least 1 − δ is at most O (cid:0) (cid:80) ni =1 1∆ i log nδ (cid:1) , where ∆ i ’s are thegaps between the optimal and sub-optimal arms (see Section 2 for formal definition). Wealso prove a lower bound of Ω (cid:0) n(cid:15) (cid:1) samples for finding a subset whose expected reward2s within (cid:15) of the optimal. Note that in the combinatorial model the feedback dependson k actions, rather than a single one, thus it might be more informative. Second, wediscuss regret minimization. We show that the algorithm that minimizes sample complexitydoes not minimize the regret. Instead, we suggest a modification to the algorithm thatachieves O (cid:0) nk ∆ log T (cid:1) distribution-dependent and O ( k √ nT ) distribution-independent regretwhere ∆ = min ∆ i and T is the time horizon. To the best of our knowledge, this is thefirst algorithm to achieve O (log T ) distribution-dependent regret in the full-bandit setting.Finally, we conduct experiments that show that the proposed algorithm achieves smallsample complexity and regret comparing to other methods. The problem of
Best Arm Identification , a.k.a.
Pure Ex-ploration , was introduced by Even-Dar et al. (2006), and later by Bubeck et al. (2009),where the goal is to find the best arm using a minimal number of samples. Even-Daret al. (2006) describe two algorithms for this end, one of them is
Successive Elimination that in each round estimates all the arms with an increasing level of accuracy, and elimi-nates the arms which are far from the optimal arm with high confidence. This algorithmuses O (cid:0) (cid:80) ni =1 1∆ i log nδ (cid:1) to find the optimal arm with probability at least 1 − δ . It is theconceptual basis for a number of algorithms, including the one we describe in this work. Multiple Arms Identification.
As an extension for Best-Arm Identification, the goalof
Multiple Arms Identification is to find the best k arms where the samples are still of onearm in each round. This problem, a.k.a. Subset Selection or Explore-k , was introduced byKalyanakrishnan and Stone (2010), and a variety of algorithms were designed for this end(Kalyanakrishnan and Stone, 2010; Kalyanakrishnan et al., 2012; Chen et al., 2014; Zhouet al., 2014). One notable algorithm is
Successive Accepts and Rejects (SAR) (Bubeck et al.,2013), which generalizes Successive Elimination algorithm to multiple arms identificationby adding a set of accepted arms that have been identified as part of the optimal arms.
Combinatorial Bandits.
Most of the works in the framework of stochastic combina-torial bandits address the semi-bandit feedback (Chen et al., 2013, 2016; Combes et al.,2015; Kveton et al., 2015; Merlis and Mannor, 2019). For full-bandit feedback, only a fewalgorithms were suggested. One of them is ConfidenceBall in Dani et al. (2008) which is apolynomial time approximation for LinUCB for linear bandits with NP-hard decision sets.Another approximation for LinUCB is described by Kuroki et al. (2019), which uses anapproximated method for quadratic optimization based on graphs. A different approach istaken by Agarwal and Aggarwal (2018), which is designed for cases when the aggregatedreward is not necessarily the sum of individual arms. This algorithm is based on Explore-then-Exploit approach and achieves regret of O ( k n T ). Lin et al. (2014) consider aproblem that somewhat generalizes the full-bandit setting, where the reward is not neces-sarily the sum of individual arms, but the feedback for the agent is a linear combination ofthe arms’ rewards and show an O ( T / log T ) regret bound. For the sake of completeness,we note that there are also a number of works on full-bandit feedback in the adversarialsetting (Cesa-Bianchi and Lugosi, 2012; Combes et al., 2015).3 ower Bounds. For best arm identification, Θ (cid:0) n(cid:15) log δ (cid:1) samples are necessary and suf-ficient for any ( (cid:15), δ )-PAC algorithm to identify the best arm (Mannor and Tsitsiklis, 2004;Even-Dar et al., 2006). For multiple arms identification, a slightly more samples areneeded, where the lower bound is Ω (cid:0) n(cid:15) log kδ (cid:1) (Kalyanakrishnan et al., 2012; Kaufmannand Kalyanakrishnan, 2013). Our work extends this bounds to the full-bandit feedback,providing a lower bound of Ω (cid:0) n(cid:15) (cid:1) on the sample complexity.As for the regret, a seminal work by Lai and Robbins (1985) bounds the regret of classicalMAB as Ω (cid:16) (cid:80) i i log T (cid:17) . This result extents to Ω (cid:0) c ( θ ) log T (cid:1) for combinatorial bandits,where c ( θ ) is a solution of an optimization problems that depends on the distribution ofrewards (Talebi et al., 2017).Another type of bounds is the distribution-independent regret bounds, that does not dependon the distribution of the arms rewards. For classical MABs, a well-known lower bound ofΩ( √ nT ) was proven by Auer et al. (2002). This result extends to Ω( √ knT ) for combina-torial bandits with semi-bandit feedback (Kveton et al., 2015; Lattimore et al., 2018). Forfull-bandit feedback, there are even stronger results of Ω( k √ nT ) (Audibert et al., 2013)and Ω( k √ knT ) (Cohen et al., 2017), if the decision set is limited, i.e., not all subsets canbe selected by the agent, which is not the case in our setting. Another relevant bound isfor linear bandits, where the regret is bounded by Ω( n √ T ) (Dani et al., 2008).
2. Preliminaries
Suppose that there are n arms numbered 1 , , . . . , n , and each arm i ∈ [ n ] is associated witha random variable X i = θ i + η i such that θ i is the expected reward and η i is 1-subgaussiannoise. We assume the arms are ordered such that θ ≥ · · · ≥ θ n , but this order is not knownto the agent. In each round t , the agent selects a subset S t of k arms and observes a reward r t = (cid:80) i ∈ S t X i , where each arm X i is sampled independently.The agent’s objective is to find a subset S that maximizes the expected reward µ S = E [ r S ]. Since the arms are independent we can write µ S = (cid:80) i ∈ S θ i . Accordingly, the optimalsubset is S ∗ = { , . . . , k } with expected reward µ ∗ = (cid:80) ki =1 θ i .We adopt the ( (cid:15), δ )-PAC framework (Valiant, 1984), in which the goal of the agent is tooutput a subset S such that for any (cid:15), δ > P r [ µ ∗ − µ S > (cid:15) ] < δ .The regret of the agent over time horizon T is defined as R = E (cid:20) T (cid:88) t =1 (cid:16) µ ∗ − r t (cid:17)(cid:21) = T µ ∗ − T (cid:88) t =1 µ t where µ t = E [ r t ] is the expected reward at round t . The regret is measured in terms of thegaps between the arms. For every arm i ∈ [ n ] we define the gap∆ i = (cid:40) θ i − θ k +1 i ≤ kθ k − θ i i > k Note that the gaps are defined differently than for classical MAB, as the optimal arms alsohave gaps comparing to the best sub-optimal arm k + 1. Intuitively, the gaps are the arms’distances from changing their status from optimal to sub-optimal arms and vice versa.Finally, we define ∆ = min i ∆ i = ∆ k = ∆ k +1 .4 .1 Hadamard Matrices The algorithm we present in this paper uses the Hadamard matrix. We define it here anddiscuss a few properties of it, for more information see Horadam (2012).
Definition.
A square matrix H of size n is called Hadamard if its entries are ± and itsatisfies H (cid:124) H = n I , where I is the identity matrix. Hadamard matrices satisfy the following properties: • Any H can be normalized such that the first row contains only positive entries. • For any i >
1, the i th row in H contains an equal number of +1 and − • For any n , there exists a Hadamard matrix of size 2 n . It is conjectured that Hadamardmatrices exist for any multiple of 4, and the matrices for most of the multiples of 4up to 2000 are known (Dokovi´c, 2008).It is interesting to mention that Hadamard matrices maximize the determinant of ±
3. Combinatorial Successive Accepts and Rejects Algorithm
In this chapter we present the Combinatorial Successive Accepts and Rejects Algorithm fortop-k combinatorial bandits. We begin by presenting an efficient estimation algorithm thatestimates the expected rewards for all the arms, and then discuss the main algorithm thatuses the estimation algorithm in order to find the best subset of k arms. Finally, we boundthe sample complexity and regret achieved by the algorithm. The first algorithm we discuss suggests an efficient method to estimate the expected rewardsof the arms under full bandit feedback. The algorithm gets as inputs a set N of n arms, asubset size k , a level of accuracy (cid:15) and a level of confidence δ . The algorithm first partitions N into sets of size 2 k . In each of those sets, it makes use of the Hadamard matrix as aninstructor for the subsets to sample. Let H be the Hadamard matrix of size 2 k , then foreach row H i ( i (cid:54) = 1) the algorithm partitions the arms according to the positive and negativeentries in H i . Since in every row exactly half of the entries are positive, the partition formstwo sets of size k . For i = 1, H has only positive entries, so the algorithm partitionsarbitrarily to two sets. Each of these sets is sampled enough times to get a good estimateon its expected reward. Then, the sets’ estimated rewards are summed according to theirsign in H . This way we get a vector ˆ Z that is equal in expectation to Hθ . Finally, toestimate the individual arms’ rewards, the algorithm uses the Hadamard matrix inverse H − ˆ Z = k H (cid:124) ˆ Z , which is the least squares estimator for θ given ˆ Z . Remark 1.
For simplicity, we assume that 2 k divides n . Otherwise, when partitioningthe arms in the first step we may repeat arms in the last subset. This increases the numberof estimations by at most 2 k , and thus we replace n with n + 2 k in the number of samples m ( (cid:15), δ ). Since n > k , this modification does not change the order of magnitude of thesample complexity and regret. 5 lgorithm 1: EST1( N , k, (cid:15), δ ) n = |N | ; m ( (cid:15), δ ) = (cid:15) log nδ Partition N into sets of size 2 k : N . . . N n k for l = 1 . . . n k do Let N l = { j . . . j k } S , − = { j , . . . , j k } , S , +1 = { j k +1 , . . . , j k } (cid:46) i = 1 S i,b = { j ∈ N l | H ij = b } (cid:46) i = 2 . . . k, b ∈ {− , +1 } for i ∈ [2 k ] , b ∈ {− , +1 } do Sample S i,b for m = m ( (cid:15), δ ) times and observe rewards r , . . . , r m ˆ µ i,b = m (cid:80) t r t ˆ Z = ˆ µ , +1 + ˆ µ , − (cid:46) i = 1ˆ Z i = ˆ µ i, +1 − ˆ µ i, − (cid:46) i = 2 . . . k, b ∈ {− , +1 } ˆ θ N l = k H (cid:124) ˆ Z return ˆ θ Remark 2.
We assume that there exists a Hadamard matrix of size 2 k . Otherwise, let2 q ∈ N be a multiple of 2 k such that there exists a Hadamard matrix of size 2 q . Partitionthe arms into subsets of size 2 q (instead of 2 k ), and then in each row the number of positiveand negative entries is a multiple of k . Then, partition them to qk sets of size k , sampleeach one separately, and then sum them to get ˆ µ +1 and ˆ µ − . This modification changes thesample complexity and regret by at most a constant factor. Lemma 1.
For any (cid:15), δ > and k , and any set of n arms N , EST1 returns an estimatedreward vector ˆ θ such that P r (cid:104) ∀ i, | ˆ θ i − θ i | ≤ (cid:15) (cid:105) ≥ − δ Proof
We first prove that ˆ θ is an unbiased estimator of the reward vector θ . For simplicityfix N = { , . . . , k } and write ˆ θ instead of ˆ θ N . Note that for each subset S , the averageˆ µ S = m (cid:80) t r t is an unbiased estimator for the set’s reward, namely E [ˆ µ S ] = µ S = (cid:80) i ∈ S θ i .As a consequence for each i (cid:54) = 1, ˆ Z i satisfies E [ ˆ Z i ] = µ i, +1 − µ i, − = (cid:88) j ∈ S i, +1 θ j − (cid:88) j ∈ S i, − θ j = k (cid:88) j =1 H ij θ j = H (cid:124) i θ and the same for i = 1. Thus ˆ Z satisfies E [ ˆ Z ] = Hθ , and E [ˆ θ ] = k H (cid:124) E [ ˆ Z ] = k H (cid:124) Hθ = θ .Fix some subset S sampled by the algorithm, and we prove that the estimation noiseˆ η S = ˆ µ S − µ S is km -subgaussian. By definition,ˆ µ S = 1 m m (cid:88) t =1 r t = 1 m m (cid:88) t =1 (cid:88) i ∈ S X i = 1 m m (cid:88) t =1 (cid:88) i ∈ S ( θ i + η it ) = µ S + 1 m m (cid:88) t =1 (cid:88) i ∈ S η it Since the noise terms η it are 1-subgaussians, and we sum over k such terms in each t ,the total estimation noise is km -subgaussian. Accordingly, the estimation noise of each ˆ Z i ,6iven by η Z i = ˆ Z i − E [ ˆ Z i ] = η i, +1 − η i, − is km -subgaussian. Finally, the estimation noiseˆ θ i − θ i = k (cid:80) kj =1 H ij η Z j is also subgaussian with parameter k km = m . Thus by Hoeffdinginequality for subgaussian random variables, P r (cid:104) | ˆ θ i − E [ˆ θ i ] | ≥ (cid:15) (cid:105) ≤ (cid:18) − (cid:15) m (cid:19) = δn where we substituted the number of samples m ( (cid:15), δ ). Finally, the probability of error inone parameter is at most δn , and thus by the union bound the probability of error in oneparameter or more is at most δ . We now show how to use the estimation method described above to find the best subset.The algorithm, which we call
Combinatorial Successive Accepts and Rejects (CSAR), isbased on Bubeck et al. (2013) for multiple arms identification. CSAR works in phases.In each phase t it maintains a decaying level of accuracy (cid:15) t and confidence δ t and usesEST1 to estimate the arms to a given level of accuracy and confidence. Then, it sorts thearms according to their estimations ˆ θ t ≥ ˆ θ t ≥ · · · ≥ ˆ θ tn , and accepts arms whose estimatedreward is bigger than ˆ θ tk +1 by at least 2 (cid:15) t , i.e., ˆ θ ti − ˆ θ tk +1 ≥ (cid:15) t , as they are optimal withhigh confidence. Similarly, it rejects arms whose estimated reward is smaller than ˆ θ tk by atleast 2 (cid:15) t . The algorithm proceeds until n − k arms are rejected. Algorithm 2:
Combinatorial Successive Accepts and Rejects (CSAR) N = N ; A = ∅ ; (cid:15) = ; δ = π δ while |N t ∪ A t | > k do ˆ θ t = EST N t , k, (cid:15) t , δ t )Sort N t ∪ A t according to ˆ θ t such that ˆ θ t ≥ ˆ θ t ≥ · · · ≥ ˆ θ tn A = { i ∈ N t | ˆ θ ti − ˆ θ tk +1 > (cid:15) t }R = { i ∈ N t | ˆ θ tk − ˆ θ ti > (cid:15) t }A t +1 = A t ∪ AN t +1 = N t \ ( A ∪ R ) (cid:15) t +1 = (cid:15) t ; δ t +1 = δ t ; t = t + 1 return A t ∪ N t Lemma 2.
For any δ > , CSAR with EST1 is (0 , δ ) -PAC, i.e., it finds the optimal subsetwith probability at least − δ . Remark 3.
One can easily modify CSAR to be ( (cid:15), δ )-PAC. For that, we provide thealgorithm also with a level of accuracy (cid:15) , and instead of stopping only when k arms areleft, we may stop earlier when (cid:15) t ≤ (cid:15) k and return the top k arms according to the lastestimation. It is not hard to show that the surviving arms are 2 (cid:15) t close to the optimal armsand therefore the output is at most k(cid:15) t = (cid:15) far from the optimal subset.7 .3 Sample Complexity In this section, we bound CSAR’s sample complexity in the following theorem.
Theorem 3.
For any δ > , the total number of samples performed by CSAR with EST1is at most M = O (cid:18) n (cid:88) i =1 (cid:16) i log nδ + log log 1∆ i (cid:17)(cid:19) (1)Note that CSAR’s sample complexity is comparable with the O (cid:0) (cid:80) ni =1 1∆ i log nδ (cid:1) sam-ple complexity of the original Successive Elimination algorithm for best arm identifica-tion (Even-Dar et al., 2006), and also with algorithms for multiple arms identification(Kalyanakrishnan and Stone, 2010; Kalyanakrishnan et al., 2012), although in these modelsthe agent samples one arm in each round and not k like in the combinatorial model.To understand how this upper bound scales, consider the following rewards distribution X i ∼ Ber ( + (cid:15)k ) for i ∈ [ k ] and X i ∼ Ber ( ) otherwise. In this case, for all arms ∆ i = (cid:15)k and thus the number of samples is bounded by M = O (cid:0) nk (cid:15) log nδ (cid:1) (ignoring log log terms).To bound the sample complexity, we first prove the following lemma that bounds thecumulative number of times M i each arm is sampled until it is accepted or rejected. Thetheorem follows immediately by summing M i over all arms and dividing by k since eachsubset sampled by the algorithm consists of k arms. Lemma 4.
For each arm i ∈ [ n ] , the number of times it is sampled until it is rejected (if itis sub-optimal) or accepted (if it is optimal) is bounded by M i ≤ Ck ∆ i (cid:0) log 2 nδ + 2 log log 1∆ i (cid:1) Proof
Let i be an arm, and let T i be the phase it is accepted or rejected. In every phase t < T i , arm i is sampled as part of 2 k subsets and each subset is sampled m ( (cid:15) t , δ t ) timeswhere (cid:15) t = 2 − t and δ t = δ t , thus we have M i = T i (cid:88) t =1 km ( (cid:15) t , δ t ) = T i (cid:88) t =1 k(cid:15) t log 2 nδ t = 4 k T i (cid:88) t =1 (2 t ) log t nδ == 4 k (cid:16) T i (cid:88) t =1 t log t + log 2 nδ T i (cid:88) t =1 t (cid:17) ≤ Ck (cid:0) T i + log 2 nδ (cid:1) · T i (2)We now bound the phase T i when i is rejected. We discuss the case that i is sub-optimal,but the analysis for optimal arms is similar. Assuming all arms are estimated accurately(see Appendix A.1), then for any phase t and any arm i we have | ˆ θ ti − θ i | ≤ (cid:15) t . Thatalso implies that the difference between the real k th arm and the arm estimated to be inthe k th place satisfies | ˆ θ tk − θ k | ≤ (cid:15) t , since mixing the order of the arms can happen onlybetween arms that are within the same (cid:15) t -neighborhood. As long as i was not rejected, i.e., ∀ t = 1 . . . T i −
1, it holds that2 (cid:15) t ≥ ˆ θ tk − ˆ θ ti ≥ ( θ k − (cid:15) t ) − ( θ i + (cid:15) t ) = ( θ k − θ i ) − (cid:15) t = ∆ i − (cid:15) t (cid:15) t = 2 − t we get ∆ i ≤ (cid:15) t = 4 · − t . This is true also for t = T i −
1, and thuswe get T i ≤ log i . Substituting T i in (2) yields the desired bound. We now analyze the regret. Notice that while CSAR aims to minimize the sample com-plexity, it does not minimize the regret. This is because at each time the algorithm choosesa subset, the regret it achieves is affected not only by the arms it selected, but also bythe arms it did not select. In other words, the gap that should be considered is betweenthe sub-optimal arm i ∈ { k + 1 , . . . , n } that was actually selected and the optimal arm j ∈ { , . . . , k } that would have been selected instead. We denote this gap by ∆ j : i = θ j − θ i .Using this notation, we may bound the regret of the algorithm. Theorem 5.
For any n, k ≤ n and T the regret of CSAR with EST1 is at most R = O (cid:32) n (cid:88) i = k +1 ∆ i ∆ k : i k log T (cid:33) Note that this bound is tight for CSAR with EST1. Consider the problem instancewhere each arm i ∈ [ n ] is associated with a normal random variable X i ∼ N ( θ i , θ i = ∆ + i < k i = k − ∆ − i > k (3)and assume ∆ + (cid:29) ∆ − . On this problem instance, CSAR will accept the first k − n − k +1 arms is sampled k(cid:15) t log nδ t times, and it keeps being sampled until 2 (cid:15) t < ∆ − . Therefore, the total regret ofthe algorithm is Θ (cid:16) ∆ + ∆ − ( n − k ) k log nδ (cid:17) . In the following section we discuss a modificationfor the algorithm that helps achieve smaller regret. The reason for the ∆ i factors in Theorem 5 is due to the fact that when we identify anarm as optimal, we stop sampling it, and thus suffer regret for its absence. Instead, weconsider the following modification for the algorithm in order to improve the regret. Whenwe accept an arm, instead of preventing it from being sampled, we fix it. Namely, we sampleit in every subset until the end of the run. This will assure that we suffer gaps such as ∆ i only for a small number of rounds.Accordingly, we modify the estimation algorithm to support fixed arms. Now, thealgorithm gets as input also a set A of accepted arms that must be sampled in each subset.Instead of using the Hadamard matrix of size 2 k , it takes a smaller one of size 2 k (cid:48) where k (cid:48) = k − |A| is the number of arms that can be sampled in each subset after keeping9oom for the fixed arms. Most of the algorithm remains the same, except for the need tohave good estimations for the fixed arms’ expected rewards. This is because it needs toeliminate those rewards from the sampled subsets and stay only with the arms that shouldbe estimated. For that, we provide it with a set T of the top 2 k arms, according to the lastphase estimations, and run EST1 separately on them. Algorithm 3:
EST2( N , k, (cid:15), δ, A , T ) n = |N | ; k (cid:48) = k − |A| ; m ( (cid:15), δ, k (cid:48) ) = kk (cid:48) (cid:15) log nδ ˆ θ . . . ˆ θ k = EST T , (cid:15), δ, A , T )Partition N into sets of size 2 k (cid:48) : N . . . N n k (cid:48) for l = 1 . . . n k (cid:48) do Let N l = { j . . . j k (cid:48) } S (cid:48) , − = { j , . . . , j k (cid:48) } , S (cid:48) , +1 = { j k (cid:48) +1 , . . . , j k (cid:48) } (cid:46) i = 1 S (cid:48) i,b = { j ∈ N l | H ij = b } (cid:46) i = 2 . . . k (cid:48) , b ∈ {− , +1 } for i ∈ [2 k (cid:48) ] , b ∈ {− , +1 } do S i,b = S (cid:48) i,b ∪ A Sample S i,b for m = m ( (cid:15), δ, k (cid:48) ) times and observe rewards r , . . . , r m ˆ µ i,b = m (cid:80) t r t ˆ Z = ˆ µ , +1 + ˆ µ , − − (cid:80) a ∈A ˆ θ a (cid:46) i = 1ˆ Z i = ˆ µ i, +1 − ˆ µ i, − (cid:46) i = 2 . . . k (cid:48) , b ∈ {− , +1 } ˆ θ N l = k (cid:48) H (cid:124) ˆ Z return ˆ θ Theorem 6.
For any n, k and time horizon T , the regret of CSAR with EST2 is at most R = O (cid:32)(cid:18) k (cid:88) i =1 ∆ i :( k + i ) ∆ + n (cid:88) i = k +1 i (cid:19) k log T (cid:33) (4)Note that this is an improvement over CSAR with EST1. For example, on problem (3),the first arms will be fixed after a small number of rounds and the regret will be R = Θ (cid:18)(cid:16) k ∆ + ∆ − + n − k ∆ − (cid:17) k log T (cid:19) which is better than Θ (cid:16) ∆ + ∆ − ( n − k ) k log T (cid:17) as long as ∆ + ∆ − (cid:39) kn .To prove the upper bound on the regret of CSAR with EST2, we first prove the followinglemma that bounds the regret caused by each sub-optimal arm. Lemma 7.
The regret of any sub-optimal arm i is at most R i = O (cid:18) i k log nδ (cid:19) Proof
Since all expressions in this proof depend on a factor of Ck log nδ , we omit it alongthe proof and multiply by it at the end. Fix a sub-optimal arm i . By Lemma 4, thenumber of times i is chosen until it is rejected is at most M i ≤ i . We split the optimal10rms { , . . . , k } according to ∆ i , and bound separately the regret R i for the arms j ∈ [ k ] with ∆ j > ∆ i . • For any j ∈ [ k ] such that ∆ j ≤ ∆ i , the maximal gap we pay for taking arm i insteadof arm j is at most ∆ j : i ≤ i , and thus the regret of such case is bounded by R ∆ i , arm j is accepted at some point before arm i isrejected, thus at some point we can be sure that arm i is not played instead of arm j .Let l = arg min j :∆ j > ∆ i ∆ j . We showed that each optimal arm j is accepted at phase T j ≤ log i , thus we can write the regret of arm i up to phase T j as R >i ≤ M ∆ i + ( M − M )∆ i + · · · + ( M l − M l − )∆ l : i = M (∆ i − ∆ i ) + M (∆ i − ∆ i ) + · · · + M j ∆ j : i = l − (cid:88) j =1 M j (∆ j : i − ∆ ( j +1): i ) + M l ∆ l : i ≤ l − (cid:88) j =1 ∆ j − ∆ j +1 ∆ j + ∆ l + ∆ i ∆ l ≤ (cid:90) ∆ ∆ l x dx + 2∆ l ∆ l = (cid:18) l − (cid:19) + 2∆ l ≤ l ≤ i To sum up, arm i ’s contribution to the regret is R i = R i multiplied by Ck log nδ .Theorem 6 is implied by Lemma 7 by summing R i over all sub-optimal arms, in additionto the regret accumulated by estimating the top 2 k arms in each phase until the end of therun. Each of them is sampled ck ∆ log nδ times for some c >
0. As they are the top 2 k arms with high probability, the worst subset that can be sampled is { k + 1 , . . . , k } , andthe gap between it and the optimal subset is (cid:80) ki =1 ∆ i :( k + i ) . Thus their regret is at most (cid:16) (cid:80) ki =1 ∆ i :( k + i ) (cid:17) ck ∆ log nδ , and together with (cid:80) ni = k +1 R i we get Theorem 6. Corollary 8.
The distribution-independent regret is at most O (cid:0) k √ nT log T (cid:1) CSAR’s distribution-independent regret is bigger by factor √ k than the Ω (cid:0) √ knT (cid:1) lowerbound for semi-bandits in Lattimore et al. (2018) (ignoring log terms). In many cases it isreasonable to assume k = O (1) which makes the bounds tight. As for the dependence on k , we leave the search for tighter bounds for further research.Assuming all gaps are equal to ∆, the regret in (4) can be written as R = O (cid:18) nk ∆ log T (cid:19) .We prove that it is tight for CSAR. Lemma 9.
For any n, k and time horizon T , there exists a distribution over the assignmentof rewards such that the regret of CSAR with EST2 is at least R = Ω (cid:18) nk ∆ log T (cid:19) roof Consider the following example. Each arm i ∈ [ n ] is associated with a Gaussianrandom variable X i where X i ∼ N (∆ ,
1) if i ≤ k and X i ∼ N (0 ,
1) otherwise. Similarly toLemma 4, the best arms will be identified only when ∆ > (cid:15) t which implies that the numberof phases is T >
Ω(log ), and since no arm is accepted or rejected until this phase the totalnumber of samples is Ω (cid:0) n ∆ log nδ (cid:1) . Additionally, each subset has a gap of up to k ∆. Thus,the total regret is R = Ω (cid:0) k ∆ · n ∆ log nδ (cid:1) which proves that the regret upper bound is tight.The following table summarizes the theoretical bounds of CSAR in comparison to otheralgorithms for top-k combinatorial bandits with full-bandit feedback.Algorithm Sample Complexity Depend. Regret Ind. Regret CSAR O (cid:0) n ∆ log nδ (cid:1) O (cid:0) nk ∆ log T (cid:1) O (cid:0) k √ nT (cid:1) Agarwal and Aggarwal (2018) – – O ( k n T )Dani et al. (2008) – O (cid:0) n ∆ log T (cid:1) O (cid:0) n √ nT (cid:1) Kuroki et al. (2019) O (cid:0) n k ∆ log nδ (cid:1) – –
4. Lower Bound
In this section we bound the minimal number of samples necessary to identify the bestsubset under full-bandit feedback.One might wonder if the Ω (cid:0) n(cid:15) log δ (cid:1) lower bound for best arm identification (Mannor andTsitsiklis, 2004) or Ω (cid:0) n(cid:15) log kδ (cid:1) for multiple arms identification (Kaufmann and Kalyanakr-ishnan, 2013; Kalyanakrishnan et al., 2012; Chen et al., 2014) applies also for combinatorialbandits. The answer is not immediate. Intuitively, sampling k arms together might providemore information, so that hypothetically less samples can be used to find the best subset.For example, if the goal is to detect an unknown number of counterfeit coins out of n coins,and the agent is allowed to weigh any number of coins, then there exists an algorithm thatidentifies the counterfeit coins using only Θ (cid:0) n log n (cid:1) weighings, with or without the presenceof noise (Erdos and R´enyi, 1963; S¨oderberg and Shapiro, 1963; Bshouty, 2012).Despite the discussion above, the following theorem proves a lower bound of Ω (cid:0) n(cid:15) (cid:1) samples for combinatorial bandits, similar to the bounds for best- and multiple-arms iden-tification tasks. Theorem 10.
For any n and k ≤ n , and for any < (cid:15), δ < , there exists a distributionover the assignment of rewards such that the sample complexity of any ( (cid:15), δ ) -PAC algorithmis at least M = Ω (cid:16) n(cid:15) (cid:17) The proof is based on Slivkins (2019, chap. 2), but generalized for the combinatorialsetting. It defines two problem instances with small KL-divergence between them and showsthat any algorithm that uses less samples than required is wrong with high probability.
1. See Appendix B for elaboration on this bound.
5. Experiments
We compared our algorithm to other methods also experimentally, on simulated data. Weconducted two experiments, one for the sample complexity and one for the regret. Wedescribe here the experiments briefly, for more details see Appendix C.For the sample complexity, we evaluate the accuracy of different sampling methods incomparison to EST1. Figure 1(a) shows the mean square error of EST1 with Hadamardmatrices along with two other sampling methods. It can be seen that Hadamard signifi-cantly outperforms the others. For the regret, we compared CSAR with the Sort & Mergealgorithm in Agarwal and Aggarwal (2018). Figure 1(b) shows the cumulative regret as afunction of time for both algorithms. It can be seen that CSAR achieves significantly lowerregret than Sort & Merge.
6. Discussion and Conclusions
In this work we proposed a novel algorithm for top-k combinatorial bandits with full-banditfeedback. We presented the Combinatorial Successive Accepts and Rejects (CSAR) algo-rithm, and showed that it is (0 , δ )-PAC with sample complexity O (cid:0) n ∆ log nδ (cid:1) and regret O (cid:0) nk ∆ log T (cid:1) for time horizon T . For the sample complexity, we also proved a lower boundof Ω (cid:0) n(cid:15) (cid:1) . To the best of our knowledge, this is the first lower bound for sample complexityof combinatorial bandits with full-bandit feedback.In addition, we tested our results empirically. First, we tried three sampling methods andshowed that our novel method using Hadamard matrices achieves bigger accuracy withinless samples, comparing to the baselines. Second, we compared the cumulative regret toAgarwal and Aggarwal (2018), and illustrated that CSAR outperforms the latter. Acknowledgments
This work was supported part by the Yandex Initiative in Machine Learning.13 eferences
Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linearstochastic bandits. In
Advances in Neural Information Processing Systems , pages 2312–2320, 2011.Mridul Agarwal and Vaneet Aggarwal. Regret bounds for stochastic combinatorial multi-armed bandits with linear space complexity. arXiv preprint arXiv:1811.11925 , 2018.Jean-Yves Audibert, S´ebastien Bubeck, and G´abor Lugosi. Regret in online combinatorialoptimization.
Mathematics of Operations Research , 39(1):31–45, 2013.Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochasticmultiarmed bandit problem.
SIAM journal on computing , 32(1):48–77, 2002.Baruch Awerbuch and Robert D Kleinberg. Adaptive routing with end-to-end feedback:Distributed learning and geometric approaches. In
Proceedings of the thirty-sixth annualACM symposium on Theory of computing , pages 45–53. ACM, 2004.Nader H Bshouty. On the coin weighing problem with the presence of noise. In
Approx-imation, Randomization, and Combinatorial Optimization. Algorithms and Techniques ,pages 471–482. Springer, 2012.S´ebastian Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identifications in multi-armed bandits. In
International Conference on Machine Learning , pages 258–265, 2013.S´ebastien Bubeck, R´emi Munos, and Gilles Stoltz. Pure exploration in multi-armed ban-dits problems. In
International conference on Algorithmic learning theory , pages 23–37.Springer, 2009.Nicolo Cesa-Bianchi and G´abor Lugosi. Combinatorial bandits.
Journal of Computer andSystem Sciences , 78(5):1404–1422, 2012.Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorialpure exploration of multi-armed bandits. In
Advances in Neural Information ProcessingSystems , pages 379–387, 2014.Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: Generalframework and applications. In
International Conference on Machine Learning , pages151–159, 2013.Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, and Pinyan Lu. Combinatorial multi-armedbandit with general reward functions. In
Advances in Neural Information ProcessingSystems , pages 1659–1667, 2016.Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linearpayoff functions. In
Proceedings of the Fourteenth International Conference on ArtificialIntelligence and Statistics , pages 208–214, 2011.Alon Cohen, Tamir Hazan, and Tomer Koren. Tight bounds for bandit combinatorialoptimization. arXiv preprint arXiv:1702.07539 , 2017.14ichard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al.Combinatorial bandits revisited. In
Advances in Neural Information Processing Systems ,pages 2116–2124, 2015.Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization underbandit feedback. In
Conference on Learning Theory , 2008.Dragomir ˇZ Dokovi´c. Hadamard matrices of order 764 exist.
Combinatorica , 28(4):487–489,2008.Paul Erdos and Alfr´ed R´enyi. On two problems of information theory.
Magyar Tud. Akad.Mat. Kutat´o Int. K¨ozl , 8:229–243, 1963.Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping condi-tions for the multi-armed bandit and reinforcement learning problems.
Journal of machinelearning research , 7(Jun):1079–1105, 2006.Kathy J Horadam.
Hadamard matrices and their applications . Princeton university press,2012.Shivaram Kalyanakrishnan and Peter Stone. Efficient selection of multiple bandit arms:Theory and practice. In
ICML , volume 10, pages 511–518, 2010.Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selec-tion in stochastic multi-armed bandits. In
International Conference on Machine Learning ,volume 12, pages 655–662, 2012.Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subsetselection. In
Conference on Learning Theory , pages 228–251, 2013.Yuko Kuroki, Liyuan Xu, Atsushi Miyauchi, Junya Honda, and Masashi Sugiyama.Polynomial-time algorithms for combinatorial pure exploration with full-bandit feedback. arXiv preprint arXiv:1902.10582 , 2019.Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret boundsfor stochastic combinatorial semi-bandits. In
AISTATS , 2015.Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.
Advances in applied mathematics , 6(1):4–22, 1985.Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. Toprank: A practicalalgorithm for online stochastic ranking. In
Advances in Neural Information ProcessingSystems , pages 3945–3954, 2018.Tian Lin, Bruno Abrahao, Robert Kleinberg, John Lui, and Wei Chen. Combinatorialpartial monitoring game with linear feedback and its applications. In
International Con-ference on Machine Learning , pages 901–909, 2014.Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem.
Journal of Machine Learning Research , 5(Jun):623–648, 2004.15adav Merlis and Shie Mannor. Batch-size independent regret bounds for the combinatorialmulti-armed bandit problem. In
Conference on Learning Theory , 2019.Herbert Robbins. Some aspects of the sequential design of experiments.
Bulletin of theAmerican Mathematical Society , 58(5):527–535, 1952.Aleksandrs Slivkins. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272 ,pages 18–27, 2019.Staffan S¨oderberg and Harold S Shapiro. A combinatory detection problem.
The AmericanMathematical Monthly , 70(10):1066–1070, 1963.Mohammad Sadegh Talebi, Zhenhua Zou, Richard Combes, Alexandre Proutiere, andMikael Johansson. Stochastic online shortest path routing: The value of feedback.
IEEETransactions on Automatic Control , 63(4):915–930, 2017.Leslie G Valiant. A theory of the learnable. In
Proceedings of the sixteenth annual ACMsymposium on Theory of computing , pages 436–445. ACM, 1984.Yuan Zhou, Xi Chen, and Jian Li. Optimal pac multiple arm identification with applicationsto crowdsourcing. In
International Conference on Machine Learning , pages 217–225, 2014.16 ppendix A. Proofs
A.1 Proof Lemma 2 (PAC)Proof
For each phase t , define the event E t that at least one arm is estimated poorly, i.e. E t = {∃ i | ˆ θ it − θ it | > (cid:15) t } , and let E = (cid:83) t E t . Note that CSAR is wrong only if at some phaseit rejects an optimal arm or accepts a sub-optimal arm. This might happen only under theevent E . Hence, the probability that CSAR is wrong is bounded by the probability of E .By Lemma 1 for any t , P r [ E t ] ≤ δ t and thus by the union bound P r [ E ] ≤ (cid:88) t P r [ E t ] ≤ (cid:88) t δ t = T (cid:88) t =1 δ t ≤ δ ∞ (cid:88) t =1 t = 6 π δ · π δ Accordingly, the algorithm returns the optimal subset with probability at least 1 − δ . A.2 Proof EST2 CorrectnessLemma 11.
For any (cid:15), δ > and any set of n arms N and set of accepted arms A , EST2returns an estimated reward vector ˆ θ such that P r (cid:104) ∀ i, | ˆ θ i − θ i | ≤ (cid:15) (cid:105) ≥ − δ Proof
The proof is similar to Lemma 1, so we only stress the differences. First, whenproving that ˆ θ is an unbiased estimator of θ , we have S = S (cid:48) ∪ A , hence for each i (cid:54) = 1, E [ ˆ Z i ] = µ i, +1 − µ i, − = (cid:88) j ∈ S (cid:48) i, +1 θ j + (cid:88) j ∈A θ j − (cid:18) (cid:88) j ∈ S (cid:48) i, − θ j + (cid:88) j ∈A θ j (cid:19) = (cid:88) j ∈ S (cid:48) i, +1 θ j − (cid:88) j ∈ S (cid:48) i, − θ j = k (cid:48) (cid:88) j =1 H ij θ j = H (cid:124) i θ and for i = 1 E [ ˆ Z ] = µ , − + µ , +1 − (cid:88) j ∈A θ j = (cid:88) j ∈ S (cid:48) , +1 θ j + (cid:88) j ∈A θ j + (cid:88) j ∈ S (cid:48) , − θ j + (cid:88) j ∈A θ j − (cid:88) j ∈A θ j = (cid:88) j ∈ S (cid:48) , +1 θ j + (cid:88) j ∈ S (cid:48) , − θ j = k (cid:48) (cid:88) j =1 H j θ j = H (cid:124) θ Second, we prove each η ˆ Z i is subgaussian. For any i (cid:54) = 1 the proof remains the same.For i = 1, ˆ Z = ˆ µ , − + ˆ µ , +1 − (cid:88) j ∈A ˆ θ j (5)17hus the noise consists of η , − , η , +1 which are km -subgaussians, and the noise of each ˆ θ j .We proved in Lemma 1 that the latter is m -subgaussian, and therefore when summing atmost k such terms and multiplying by 2 we get that the last term in (5) is km -subgaussian.Summing all terms, we get that η ˆ Z i is km -subgaussian.Finally, note that the estimation noise of ˆ θ , given by ˆ θ i − θ i = k (cid:48) (cid:80) k (cid:48) j =1 H ij η Z j , is k k (cid:48) m -subgaussian. Thus for m = kk (cid:48) (cid:15) log nδ it holds P r (cid:104) | ˆ θ i − E [ˆ θ i ] | ≥ (cid:15) (cid:105) ≤ δn , and by the unionbound the toal probability of error is at most δ . A.3 Proof Theorem 5 (CSAR with EST1’s Regret)Proof
Since only sub-optimal arms are responsible for regret, we consider only them.For any sub-optimal arm i , its maximal gap is ∆ i . Hence, the total regret is given by R ≤ (cid:80) ni = k +1 M i ∆ i where M i is the number of times arm i is sampled. In order to trans-late this bound to terms of the time horizon T , recall that if the algorithm goes wrong,it might suffer a regret of kT , and this happens with probability δ . To avoid it, we take δ = kT . Using Lemma 4 to bound M i we get the desired regret bound. A.4 Proof Corollary 8 (Distribution-Independent Regret)Proof
Consider the ( (cid:15), δ )-PAC variant of CSAR that stops exploring when (cid:15) t ≤ (cid:15) and thenkeep selecting the best k estimated arms for the rest of the time horizon T (see Remark 3).Note that when the exploration stops, the arms estimations are at most (cid:15) far from their realvalues, according to Lemma 1. Hence, the gap between the optimal subset and any subsetof surviving arms is at most k(cid:15) , and thus their regret is at most R < ≤ k(cid:15)T . This should beadded to the regret caused by the arms that were eliminated up to this stage. Accordingto Lemma 7, the contribution of a sub-optimal arm i to the regret is bounded by k ∆ i log nδ .Since it was eliminated before phase t , it must hold that ∆ i > (cid:15) . The number of eliminatedarms is clearly bounded by n , and thus their contribution to the regret is R > ≤ C (cid:88) i :∆ i >(cid:15) k ∆ i log T ≤ C (cid:88) i :∆ i >(cid:15) k(cid:15) log T ≤ C nk(cid:15) log T for some constant C . Concluding both parts of regret, we get R ≤ C nk(cid:15) log T + (cid:15)kT . Thisis true as long as the number of rounds is at least the number of samples done by CSARup to phase t , that is T ≥ C (cid:48) n(cid:15) log nδ for some constant C (cid:48) . Thus for (cid:15) = (cid:113) C (cid:48) nT log nδ and δ = kT we get the desired regret bound. 18 .5 Proof Theorem 10 (Lower Bound) We prove the lower bound in a few steps. We first prove a bound of Ω (cid:0) k(cid:15) (cid:1) for k ≤ n , thenprove a stronger bound of Ω (cid:0) n(cid:15) (cid:1) for k ≤ n , and finally sum up both proofs to get thedesired bound. Lower bound for k ≤ n To prove the lower bound for k ≤ n , we first define the following profiles with n = 2 k arms. I = { X i ∼ Ber ( p i ) } ki =1 where p i = (cid:40) + (cid:15)k i = 1 . . . k − (cid:15)k i = k + 1 . . . k I = { X i ∼ Ber ( p i ) } ki =1 where p i = (cid:40) − (cid:15)k i = 1 . . . k + (cid:15)k i = k + 1 . . . k (6)In what follows we assume that i ∈ { , } is selected randomly and the agent gets to playagainst profile I i without knowing the value of i . Lemma 11.1.
Any algorithm A that runs on problem (6) and selects any subset S ⊂ [2 k ] of size k can be simulated by an algorithm A’ that selects only K = { . . . k } and K = { k + 1 . . . k } with the same amount of samples. Proof
Fix an algorithm A and a subset S . Let ( S , S ) be a partition of S , i.e., S ∩ S = ∅ , S ∪ S = S , such that S ⊆ K and S ⊆ K . Assume without loss of generality | S | = s , | S | = s and s ≥ s . Then, there are at least s arms in S with mean + (cid:15)k and s arms with mean − (cid:15)k . As we observe only the sum of the rewards, which is in this case s (cid:0) + (cid:15)k (cid:1) + s (cid:0) − (cid:15)k (cid:1) = s , we may simulate these 2 s arms with the same amount of faircoins with probability .We now show how to simulate the distribution of the rest s = s − s ≤ k arms in S ⊆ K using one sample of K . Sample K once and let r be the outcome. Create abinary vector of size k with r K are identical, this vector representsthe outcome of any individual arm in the subset, up to some permutation between thearms. Then, select random s entries from the vector and return their sum. This proceduresimulates exactly the distribution of s arms in K given that the sum of K is r .We now bound the sample complexity for k ≤ n . Lemma 11.2.
For any n and k ≤ n , and for any (cid:15) > , there exists a reward distributionsuch that the sample complexity of any ( (cid:15), δ ) -PAC algorithm is at least M = Ω (cid:16) k(cid:15) (cid:17) Proof
We begin with k = n . Consider problem (6), and we show that any algorithmhave to use at least T ≥ ck(cid:15) samples (for some constant c > T ≤ ck(cid:15) samples and returns a subset S T such that ∀ i = 1 , . P i [ S T = K i ] = P r [ S T = K i |I i ] ≥
34 (7)19y Lemma 11.1, it is enough to consider only algorithms that sample only K and K .Let Ω = { , } n × T be the sample space of possible rewards of the arms and let A = { ω ⊆ Ω | S T = K } be the event that the algorithm outputs K . According to Pinsker’s inequality,2( P [ A ] − P [ A ]) ≤ KL ( P , P ) = T (cid:88) t =1 2 (cid:88) i =1 KL ( P i,t , P i,t )where KL is the Kullback-Leibler divergence between two distributions, and P i,tj denotesthe distribution of rewards at time t given that subset K i was selected and the profile is I j .Note that P i,tj is a binomial distribution with k samples and probability ± (cid:15)k and thus the KL divergence satisfies KL ( P i,t , P i,t ) = k · KL ( 12 + (cid:15)k , − (cid:15)k ) ≤ k (cid:0) (cid:15)k (cid:1) Therefore we have2( P [ A ] − P [ A ]) ≤ T (cid:88) t =1 2 (cid:88) i =1 KL ( P i,t , P i,t ) ≤ T · k (cid:15) k ≤ c where we used the assumption T ≤ ck(cid:15) . Thus for c ≤ we have that | P [ A ] − P [ A ] | ≤ .Due to assumption (7) we have that P [ A ] = P r [ S T = K ] |I ] ≤ and therefore P [ A ] ≤ P [ A ] + 12 ≤ k < n we may add to the profiles I , I arms with mean0 that may only increase the number of samples. Lower bound for k ≤ n To prove the lower bound for k ≤ n , we use Lemma 4 from Audibert et al. (2013). Forconvenience, we cite the lemma. Lemma 11.3.
Let l and k be integers with ≤ k ≤ l ≤ k . Let p, p (cid:48) , q, p , . . . , p k − ∈ (0 , with q ∈ { p, p (cid:48) } , p = · · · = p l = q and p l +1 = · · · = p k − . Let B (resp. B (cid:48) ) be the sum of k independent Bernoulli distributions with parameters p, p , . . . , p k − (resp. p (cid:48) , p , . . . , p k − ).We have KL ( B , B (cid:48) ) ≤ p (cid:48) − p ) (1 − p (cid:48) )( k + 1) q We now prove the lower bound for k ≤ n . Lemma 11.4.
For any n and k ≤ n , and for any (cid:15) > , there exists a reward distributionsuch that the sample complexity of any ( (cid:15), δ ) -PAC algorithm is at least M = Ω (cid:16) n(cid:15) (cid:17) roof For any j ∈ [ n ] define the following profile I j = X i ∼ Ber (cid:16) (cid:17) i (cid:54) = jX i ∼ Ber (cid:16) + (cid:15) (cid:17) i = j and also define I = { X i ∼ Ber (cid:16) (cid:17) | i = 1 . . . n } . We use the abbreviation P j [ · ] ( E j [ · ])to denote the probability (expectation) when the arms are distributed according to I j .Suppose that there exists an algorithm that runs for T ≤ cn(cid:15) steps for some c > I and returns a subset S T . We first show that there are many arms that are sampledonly a few times and are not part of S T with high probability.For any j ∈ [ n ] let T j denote the number of times j is sampled. Then, n (cid:88) j =1 E [ T j ] = kT ≤ cnk(cid:15) Then for at least of the arms it holds E [ T j ] ≤ ck(cid:15) (otherwise the sum over all arms isbigger then kT ). Accordingly, by Markov inequality for each of these arms P [ T j ≥ T ∗ ] ≤ where T ∗ = ck(cid:15) . For similar considerations, for at least of the arms it holds that P [ j ∈ S T ] ≤ kn ≤ (where we assumed k ≤ n ). Thus, by pigeon hole there exists asubset of arms B ⊂ [ n ] such that | B | ≥ n and for all j ∈ B the following holds P [ T j > T ∗ ] ≤ and P [ j ∈ S T ] ≤
18 (8)Fix an arm j ∈ B and we prove P j [ j ∈ S T ] ≤ . Let Ω ∗ denote the sample set of possiblearms rewards under the restriction that j was sampled at most T ∗ times, and let P ∗ denotethe corresponding distribution. By Pinsker’s inequality, for any event A ⊂ Ω ∗ the distancebetween two probability distributions satisfy2( P ∗ [ A ] − P ∗ j [ A ]) ≤ KL ( P ∗ , P ∗ j ) = T (cid:88) t =1 KL ( P S t , P S t j ) (9)where P S t j denotes the reward distribution of the subset S t under profile I j . Note that allarms except j are identically distributed under I and I j , and therefore for any S t that doesnot include j the KL divergence is zero. Hence, we only need to consider rounds t ∈ [ T ] when j was sampled as part of S t . By Lemma 11.3 with p = + (cid:15) and p (cid:48) = q = p = · · · = p k − = we have KL ( P S t , P S t j ) ≤ (cid:15) · ( k + 1) ≤ (cid:15) k Substituting in (9) gives2( P ∗ j [ A ] − P ∗ [ A ]) ≤ (cid:88) t : j ∈ S t KL ( P S t j , P S t ) = (cid:88) t : j ∈ S t (cid:15) k ≤ ck(cid:15) (cid:15) k = 96 c ≤ c < . We conclude that for any event A ⊂ Ω ∗ , P ∗ j [ A ] ≤ P ∗ [ A ] + .21efine the following events A = { j ∈ S T ∧ T j ≤ T ∗ } and A (cid:48) = { T j > T ∗ } Note that both
A, A (cid:48) ⊂ Ω ∗ since whether j is sampled more than T ∗ times is completelydetermined by the first T ∗ samples. Thus, P ∗ j [ A ] ≤ P ∗ [ A ] + 18 ≤
18 + 18 = 14 P ∗ j [ A (cid:48) ] ≤ P ∗ [ A (cid:48) ] + 18 ≤
18 + 18 = 14where the probabilities are bounded due to (8). Finally we have P j [ j ∈ S T ] ≤ P j [ j ∈ S T ∧ T ≤ T ∗ ] + P j [ T > T ∗ ] ≤
14 + 14 = 12Namely, every algorithm that runs less then cn(cid:15) rounds will err on more than of the in-stances and return an (cid:15) -far set with probability at least . Sum up
In Lemma 11.4 we showed that for k ≤ n the sample complexity is at least Ω (cid:0) n(cid:15) (cid:1) , and inLemma 11.2 we showed that for k ≤ n it is at least Ω (cid:0) k(cid:15) (cid:1) . Note that for n ≤ k ≤ n wecan write k = O ( n ) and thus we can sum up both cases to deduce Theorem 10.22 ppendix B. Sample Complexity in Kuroki et al. (2019) We refer to the sample complexity upper bound of O (cid:0) ρ ( p )∆ k n log nδ (cid:1) in Kuroki et al. (2019),where ρ ( p ) depends on the distribution of arms’ selection p . We show that for any p , ρ ( p ) ≥ nk , and thus the sample complexity is given by O (cid:0) n k ∆ log nδ (cid:1) .We start by citing some definitions. For any set S of k arms let χ S ∈ { , } n denote itsindicator vector. Fix an algorithm for finding the best subset, and let p ( S ) be the probabilitythat the algorithm selects S . Define Λ p = (cid:80) S ⊂ [ n ] p ( S ) χ S χ (cid:124) S and ρ ( p ) = max S χ (cid:124) S Λ − p χ S .We want to bound ρ ( p ). For that, we first prove the following claim which will be usefulfor bounding ρ ( p ). Claim B.1.
For any vector x ∈ R n and any invertible and symmetric matrix A of size n , ( x (cid:124) Ax )( x (cid:124) A − x ) ≥ (cid:107) x (cid:107) Proof
Let v . . . v n be A ’s eigenvectors corresponding to the eigenvalues λ . . . λ n . We write x = (cid:80) ni =1 α i v i , then x (cid:124) Ax = (cid:80) ni =1 α i λ i and x (cid:124) A − x = (cid:80) ni =1 α i λ − i since A is symmetricand therefore its eigenvectors are orthonormal. According to the weighted version of theinequality of arithmetic and harmonic means, we have x (cid:124) Ax (cid:107) x (cid:107) = (cid:80) ni =1 α i λ i (cid:80) ni =1 α i ≥ (cid:80) ni =1 α i (cid:80) ni =1 α i λ − i = (cid:107) x (cid:107) x (cid:124) A − x Claim B.2.
For any distribution p , ρ ( p ) ≥ nk Proof
First consider Λ p ’s trace. tr (Λ p ) = (cid:88) S ⊂ [ n ] p ( S ) tr ( χ S χ (cid:124) S ) = (cid:88) S ⊂ [ n ] p ( S ) tr ( χ (cid:124) S χ S ) = (cid:88) S ⊂ [ n ] p ( S ) k = k where we used the fact that χ S contains exactly k ones and that (cid:80) S ⊂ [ n ] p ( S ) = 1.Next, note that entry i, j in Λ p is the marginal probability p ij of arms i, j being selectedtogether according to p . Accordingly, the entries on the diagonal p ii are the marginalprobabilities of single arms. We saw that tr (Λ p ) = (cid:80) ni =1 p ii = k , namely the average n (cid:80) ni =1 p ii = kn . Assume that the arms are ordered such that p ≤ · · · ≤ p nn . Then theaverage of the minimal k arms satisfies k (cid:80) ki =1 p ii ≤ kn . Thus, for the set S = { . . . k } , χ (cid:124) S Λ p χ S = k (cid:88) i =1 k (cid:88) j =1 p ij ≤ k k (cid:88) i =1 p ij ≤ k k n where the first inequality is because ∀ i, j, p ii ≥ p ij .Finally, note that Λ p is symmetrical and thus by Claim B.1 we have χ (cid:124) S Λ − p χ S ≥ (cid:107) χ S (cid:107) χ (cid:124) S Λ p χ S ≥ k k n which shows that ρ ( p ) = max S (cid:48) χ (cid:124) S (cid:48) Λ − p χ S (cid:48) ≥ χ (cid:124) S Λ − p χ S ≥ nk .23 ppendix C. Experiments We compared our algorithm to other methods also experimentally, on simulated data. Weconducted two experiments, one for the sample complexity and one for the regret.
C.1 Sample Complexity
For the sample complexity, we evaluate the accuracy of different sampling methods in com-parison to EST1. Figure 2(a) shows the mean square error of EST1, which uses Hadamardmatrix, along with two other sampling methods. The first is Leave One Out (LOO), thatpartitions the arms into sets of size k + 1 and in each one samples all the k + 1 subsetsof size k . The second method samples a random 2 k × k matrix, such that in each row k entries are +1 and k are −
1. In this experiment, each arm is a normal random variablewith random mean in [0 ,
1] and σ = 1, and we set n = 144 and k = 8. The plot shows theaverage and standard deviation of 1000 runs. It can be seen that Hadamard significantlyoutperforms the other two methods.The high variance in the random matrices’ MSE stems from the variance involved inthe choice of the matrices. To see that, we tested the relation between the MSE and thecondition number of different matrices. In linear regression, the condition number is definedby the ratio between the biggest and the smallest singular values σ max σ min and it measures theaffect of deviations in the response variable on the estimation error to the. Figure 2(b)shows the MSE of 1000 random matrices as a function of their condition number, whereeach point is the average of 100 independent experiments. It can be seen that the MSE isindeed monotone with the condition number, with Spearman correlation of 96%. However,we note that while the relation between the two is expected to be linear according to theory,the relation observed in the experiments is not linear.Figure 2: (a) Comparing sampling schemes (Hadamard, LOO and random). (b) MSE ofrandom matrices as a function of their condition number.24 .2 Regret For the regret, we compared CSAR’s performance to the Sort & Merge algorithm in Agarwaland Aggarwal (2018). Figure 3(a) shows the cumulative regret as a function of time for bothalgorithms. In this experiment, we initialized the arms to be Bernoulli random variableswith random mean in [0 , n = 24 and k = 2. The plot shows the averageand standard deviation of 100 runs. It can be seen that CSAR achieves significantly lowerregret than Sort & Merge. In addition, we test the consistency of this gap for different k s.Figure 3(b) shows the cumulative regret after 5 millions steps for differnt k values. Theplot shows the average and standard deviation of 35 runs. Note that the large deviations inSort & Merge’s regret in both plots result from the random initialization of the arms thatmight effect the exploration’s duration dramatically.Figure 3: Comparing regret (CSAR vs Sort & Merge) as a function of: (a) time horizon T (b) subset size kk