[PDF] Best-item Learning in Random Utility Models with Subset Choices

Abstract

Full PDF

BBest-item Learning in Random Utility Models with SubsetChoices

Aadirupa Saha ∗ , Aditya Gopalan † Abstract

We consider the problem of PAC learning the most valuable item from a pool of n itemsusing sequential, adaptively chosen plays of subsets of k items, when, upon playing a subset,the learner receives relative feedback sampled according to a general Random Utility Model(RUM) with independent noise perturbations to the latent item utilities. We identify a newproperty of such a RUM, termed the minimum advantage, that helps in characterizing thecomplexity of separating pairs of items based on their relative win/loss empirical counts, andcan be bounded as a function of the noise distribution alone. We give a learning algorithmfor general RUMs, based on pairwise relative counts of items and hierarchical elimination,along with a new PAC sample complexity guarantee of O ( nc (cid:15) log kδ ) rounds to identify an (cid:15) -optimal item with conﬁdence − δ , when the worst case pairwise advantage in the RUM hassensitivity at least c to the parameter gaps of items. Fundamental lower bounds on PAC samplecomplexity show that this is near-optimal in terms of its dependence on n, k and c . Random utility models (RUMs) are a popular and well-established framework for studying be-havioral choices by individuals and groups Thurstone [1927]. In a RUM with ﬁnite alternativesor items, a distribution on the preferred alternative(s) is assumed to arise from a random utilitydrawn from a distribution for each item, followed by rank ordering the items according to theirutilities.Perhaps the most widely known RUM is the Plackett-Luce or multinomial logit model Plackett[1975], Luce [2012] which results when each item’s utility is sampled from an additive modelwith a Gumbel-distributed perturbation. It is unique in the sense of enjoying the property ofindependence of irrelevant attributes (IIA), which is often key in permitting efﬁcient inferenceof Plackett-Luce models from data Khetan and Oh [2016]. Other well-known RUMs include theprobit model Bliss [1934] featuring random Gaussian perturbations to the intrinsic utilities, mixedlogit, nested logit, etc.A long line of work in statistics and machine learning focuses on estimating RUM propertiesfrom observed data Souﬁani et al. [2014], Zhao et al. [2018], Souﬁani et al. [2013]. Online learning ∗ Indian Institute of Science, Bangalore, India. [email protected] † Indian Institute of Science, Bangalore, India. [email protected] a r X i v : . [ c s . L G ] F e b r adaptive testing, on the other hand, has shown efﬁcient ways of identifying the most attractive(i.e., highest utility) items in RUMs by learning from relative feedback from item pairs or moregenerally subsets Sz ¨or´enyi et al. [2015], Saha and Gopalan [2019], Jang et al. [2017]. However,almost all existing work in this vein exclusively employs the Plackett-Luce model, arguably due toits very useful IIA property, and our understanding of learning performance in other, more generalRUMs has been lacking. We take a step in this direction by framing the problem of sequentiallylearning the best item/items in general RUMs by adaptive testing of item subsets and observingrelative RUM feedback. In the process, we uncover new structural properties in RUMs, includingmodels with exponential, uniform, Gaussian (probit) utility distributions, and give algorithmicprinciples to exploit this structure, that permit provably sample-efﬁcient online learning and allowus to go beyond Plackett-Luce. Our contributions:

We introduce a new property of a RUM, called the (pairwise) advantage ra-tio , which essentially measures the worst-case relative probabilities between an item pair across allpossible contexts (subsets) where they occur. We show that this ratio can be controlled (boundedbelow) as an afﬁne function of the relative strengths of item pairs for RUMs based on several com-mon centered utility distributions, e.g., exponential, Gumbel, uniform, Gamma, Weibull, normal,etc., even when the resulting RUM does not possess analytically favorable properties such as IIA.We give an algorithm for sequentially and adaptively PAC (probably approximately correct)learning the best item from among a ﬁnite pool when, in each decision round, a subset of ﬁxedsize can be tested and top- m rank ordered feedback from the RUM can be observed. The algo-rithm is based on the idea of maintaining pairwise win/loss counts among items, hierarchicallytesting subsets and propagating the surviving winners – principles that have been shown to workoptimally in the more structured Plackett-Luce RUM Sz ¨or´enyi et al. [2015], Saha and Gopalan[2019].In terms of performance guarantees, we derive a PAC sample complexity bound for our algo-rithm: when working with a pool of n items in total with subsets of size- k chosen in each decisionround, the algorithm terminates in O ( nc (cid:15) log kδ ) rounds where c is a lower bound on the advan-tage ratio’s sensitivity to intrinsic item utilities. This can in turn be shown to be a property of onlythe RUM’s perturbation distribution, independent of the subset size k . A novel feature of the guar-antee is that, unlike existing sample complexity results for sequential testing in the Plackett-Lucemodel, it does not rely on speciﬁc properties like IIA which are not present in general RUMs. Wealso extend the result to cover top- m rank ordered feedback, of which winner feedback ( m = 1 ) isa special case. Finally, we show that the sample complexity of our algorithm is order-wise optimalacross RUMs having a given advantage ratio sensitivity c , by arguing an information-theoreticlower bound on the sample complexity of any online learning algorithm.Our results and techniques represent a conceptual advance in the problem of online learningin general RUMs, moving beyond the Plackett-Luce model for the ﬁrst time to the best of ourknowledge. Related Work:

For classical multiarmed bandits setting, there is a well studied literature onPAC-arm identiﬁcation problem Even-Dar et al. [2006], Audibert and Bubeck [2010], Kalyanakr-ishnan et al. [2012], Karnin et al. [2013], Jamieson et al. [2014], where the learner gets to see a2oisy draw of absolute reward feedback of an arm upon playing a single arm per round. On thecontrary, learning to identify the best item(s) with only relative preference information (ordinalas opposed to cardinal feedback) has seen steady progress since the introduction of the duelingbandit framework Zoghi et al. [2013] with pairs of items (size- subsets) that can be played, andsubsequent work on generalisation to broader models both in terms of distributional parametersYue and Joachims [2009], Gajane et al. [2015], Ailon et al. [2014], Zoghi et al. [2015] as well as com-binatorial subset-wise plays Mohajer et al. [2017], Gonz´alez et al. [2017], Saha and Gopalan [2018a],Sui et al. [2017]. There have been several developments on the PAC objective for different pairwisepreference models, such as those satisfying stochastic triangle inequalities and strong stochastictransitivity [Yue and Joachims, 2011], general utility-based preference models [Urvoy et al., 2013],the Plackett-Luce model [Sz ¨or´enyi et al., 2015] and the Mallows model [Busa-Fekete et al., 2014a]].Recent work has studied PAC-learning objectives other than identifying the single (near) best arm,e.g. recovering a few of the top arms [Busa-Fekete et al., 2013, Mohajer et al., 2017], or the trueranking of the items [Busa-Fekete et al., 2014b, Falahatgar et al., 2017]. Some of the recent worksalso extended the PAC-learning objective with relative subsetwise preferences Saha and Gopalan[2018b], Chen et al. [2017, 2018], Saha and Gopalan [2019], Ren et al. [2018].However, none of the existing work considers strategies to learn efﬁciently in general RUMswith subset-wise preferences and to the best of our knowledge we are the ﬁrst to address thisgeneral problem setup. In a different direction, there has been work on batch (non-adaptive) es-timation in general RUMs, e.g., Zhao et al. [2018], Souﬁani et al. [2013]; however, this does notconsider the price of active learning and the associated exploration effort required as we studyhere. A related body of literature lies in dynamic assortment selection, where the goal is to offera subset of items to customers in order to maximise expected revenue, which has been studiedunder different choice models, e.g. Multinomial-Logit [Talluri and Van Ryzin, 2004], Mallows andmixture of Mallows [D´esir et al., 2016a], Markov chain-based choice models [D´esir et al., 2016b],single transition model [Nip et al., 2017] etc., but again each of this work addresses a given and avery speciﬁc kind of choice model, and their objective is more suited to regret minimization typeframework where playing every item comes with a associated cost. Organization:

We give the necessary preliminaries and our general RUM based problem setupin Section 2. The formal description of our feedback models and the details of ( (cid:15), δ ) -best armidentiﬁcation problem is given in Section 3. In Section 4, we analyse the pairwise preferences ofitem pairs for our general RUM based subset choice model and introduce the notion of Advantage-Ratio connecting subsetwise scores to pairwise preferences. Our proposed algorithm along withits performance guarantee and also matching lower bound analysis is given in Section 5. Wefurther extend the above results to a more general top- m ranking feedback model in Section 6.Section 7 ﬁnally conclude our work with certain future directions. All the proofs of results aremoved to the appendix. Notation.

We denote by [ n ] the set { , , ..., n } . For any subset S ⊆ [ n ] , let | S | denote the cardinality3f S . When there is no confusion about the context, we often represent (an unordered) subset S asa vector, or ordered subset, S of size | S | (according to, say, a ﬁxed global ordering of all the items [ n ] ). In this case, S ( i ) denotes the item (member) at the i th position in subset S . Σ S = { σ | σ isa permutation over items of S } , where for any permutation σ ∈ Σ S , σ ( i ) denotes the element atthe i -th position in σ, i ∈ [ | S | ] . ( ϕ ) is generically used to denote an indicator variable that takesthe value if the predicate ϕ is true, and otherwise. x ∨ y denotes the maximum of x and y , and P r ( A ) is used to denote the probability of event A , in a probability space that is clear from thecontext. A discrete choice model speciﬁes the relative preferences of two or more discrete alternatives ina given set. Random Utility Models (RUMs) are a widely-studied class of discrete choice models;they assume a (non-random) ground-truth utility score θ i ∈ R for each alternative i ∈ [ n ] , and as-sign a distribution D i ( ·| θ i ) for scoring item i , where E [ D i | θ i ] = θ i . To model a winning alternativegiven any set S ⊆ [ n ] , one ﬁrst draws a random utility score X i ∼ D i ( ·| θ i ) for each alternative in S , and selects an item with the highest random score. More formally, the probability that an item i ∈ S emerges as the winner in set S is given by: P r ( i | S ) = P r ( X i > X j ∀ j ∈ S \ { i } ) (1)In this paper, we assume that for each item i ∈ [ n ] , its random utility score X i is of the form X i = θ i + ζ i , where all the ζ i ∼ D are ‘noise’ random variables drawn independently from aprobability distribution D .A widely used RUM is the Multinomial-Logit (MNL) or Plackett-Luce model (PL) , where the D i s are taken to be independent Gumbel (0 , distributions with location parameters and scaleparameter [Azari et al., 2012], which results in score distributions P r ( X i ∈ [ x, x + dx ]) = e − ( x − θ i ) e − e − ( x − θi ) dx , ∀ i ∈ [ n ] . Moreover, it can be shown that the probability that an alterna-tive i emerges as the winner in any set S (cid:51) i is simply proportional to its score parameter: P r ( i | S ) = e θi (cid:80) j ∈ S e θj . Other families of discrete choice models can be obtained by imposing different probabilitydistributions over the iid noise ζ i ∼ D ; e.g.,1. Exponential noise: D is the Exponential ( λ ) distribution ( λ > ).2. Noise from Extreme value distributions : D is the Extreme-value-distribution ( µ, σ, ξ ) ( µ ∈ R , σ > , ξ ∈ R ). Many well-known distributions fall in this class, e.g., Frechet, Weibull, Gumbel . Forinstance, when χ = 0 , this reduces to the Gumbel ( µ, σ ) distribution.3. Uniform noise: D is the (continuous) Uniform ( a, b ) distribution ( a, b ∈ R , b > a ).4. Gaussian or Frechet, Weibull, Gumbel noise: D is the Gaussian ( µ, σ ) distribution ( µ ∈ R , σ > ). 4. Gamma noise: D is the Gamma ( k, ξ ) distribution (where k, ξ > ).Other distributions D can alternatively be used for modelling the noise distribution , dependingon desired tail properties, domain-speciﬁc information, etc.Finally, we denote a RUM choice model, comprised of an instance θ = ( θ , θ , . . . , θ n ) (withits implicit dependence on the noise distribution D ) along with a playable subset size k ≤ n , byRUM ( k, θ ) . We consider the probably approximately correct (PAC) version of the sequential decision-makingproblem of ﬁnding the best item in a set of n items, by making only subset-wise comparisons.Formally, the learner is given a ﬁnite set [ n ] of n > items or ‘arms’ along with a playablesubset size k ≤ n . At each decision round t = 1 , , . . . , the learner selects a subset S t ⊆ [ n ] of k distinct items, and receives (stochastic) feedback depending on (a) the chosen subset S t , and (b)a RUM ( k, θ ) choice model with parameters θ = ( θ , θ , . . . , θ n ) a priori unknown to the learner.The nature of the feedback can be of several types as described in Section 3.1. For the purposes ofanalysis, we assume, without loss of generality , that θ > θ i ∀ i ∈ [ n ] \ { } for ease of exposition .We deﬁne a best item to be one with the highest score parameter: i ∗ ∈ argmax i ∈ [ n ] θ i = { } , under theassumptions above. Remark 1.

Under the assumptions above, it follows that item is the Condorcet Winner

Zoghi et al.[2014] for the underlying pairwise preference model induced by RUM ( k, θ ) . We mean by ‘feedback model’ the information received (from the ‘environment’) once the learnerplays a subset S ⊆ [ n ] of k items. Similar to different types of feedback models introduced earlier inthe context of the speciﬁc Plackett-Luce RUM Saha and Gopalan [2019], we consider the followingfeedback mechanisms: • Winner of the selected subset (WI:

The environment returns a single item I ∈ S , drawnindependently from the probability distribution P r ( I = i | S ) = P r ( X i > X j , ∀ j ∈ S \{ i } ) ∀ i ∈ S, S ⊆ [ n ] . • Full ranking selected subset of items (FR):

The environment returns a full ranking σ ∈ Σ S , drawn from the probability distribution P r ( σ = σ | S ) = (cid:81) | S | i =1 P r ( X σ ( i ) > X σ ( j ) , ∀ j ∈{ i + 1 , . . . | S |} ) , ∀ σ ∈ Σ S . In fact, this is equivalent to picking σ (1) according to the winnerfeedback from S , then picking σ (2) from S \ { σ (1) } following the same feedback model, and terminology borrowed from multi-armed bandits under the assumption that the learner’s decision rule does not contain any bias towards a speciﬁc item index The extension to the case where several items have the same highest parameter value is easily accomplished.

5o on, until all elements from S are exhausted, or, in other words, successively sampling | S | winners from S according to the RUM ( k, θ ) model, without replacement. For a RUM ( k, θ ) instance with n ≥ k arms, an arm i ∈ [ n ] is said to be (cid:15) -optimal if θ i > θ − (cid:15) . A sequential learning algorithm that depends on feedback from an appropriate subset-wisefeedback model is said to be ( (cid:15), δ ) -PAC, for given constants < (cid:15) ≤ , < δ ≤ , if the followingproperties hold when it is run on any instance RUM ( k, θ ) : (a) it stops and outputs an arm I ∈ [ n ] after a ﬁnite number of decision rounds (subset plays) with probability , and (b) the probabilitythat its output I is an (cid:15) -optimal arm in RUM ( k, θ ) is at least − δ , i.e, P r ( I is (cid:15) -optimal ) ≥ − δ . Furthermore, by sample complexity of the algorithm, we mean the expected time (number ofdecision rounds) taken by the algorithm to stop when run on the instance RUM ( k, θ ) . In this section, we introduce the key concept of Advantage ratio as a means to systematically relatesubsetwise preference observations to pairwise scores in general RUMs.Consider any set S ⊆ [ n ] , | S | = k , and recall that the probability of item i winning in S is P r ( i | S ) := P r ( X i > X j , ∀ j ∈ [ n ] \ { i } ) for all i ∈ S, S ⊆ [ n ] . For any two items i, j ∈ [ n ] ,let us denote ∆ ij = ( θ i − θ j ) . Let us also denote by f ( · ) , F ( · ) and ¯ F ( · ) the probability densityfunction , cumulative distribution function and complementary cumulative distribution functionof the noise distribution D , respectively; thus, F ( x ) = (cid:82) x −∞ f ( x ) dx for any x ∈ Support ( D ) and ¯ F ( x ) = (cid:82) ∞ x f ( x ) dx = 1 − F ( x ) for any x ∈ Support ( D ) .We now introduce and analyse the Advantage-Ratio (Def. 1); we will see in Sec. 5.1 how thisquantity helps us deriving an improved sample complexity guarantee for our ( (cid:15), δ ) -PAC itemidentiﬁcation problem. Deﬁnition 1 (Advantage ratio and Minimum advantage ratio) . Given any subsetwise preferencemodel deﬁned on n items, we deﬁne the advantage ratio of item i over item j within the subset S ⊆ [ n ] , i, j ∈ S as Advantage-Ratio ( i, j, S ) = P r ( i | S ) P r ( j | S ) . Moreover, given a playable subset size k , we deﬁne the minimum advantage ratio, Min-AR, of item- i over j , as the least advantage ratio of i over j across size- k subsets of [ n ] , i.e.,Min-AR ( i, j ) = min S ⊆ [ n ] , | S | = k,S (cid:51) i,j P r ( i | S ) P r ( j | S ) . (2) We essentially mean a causal algorithm that makes present decisions using only past observed information at eachtime; the technical details for deﬁning this precisely are omitted. We assume by default that all noise distributions have a density; the extension to more general noise distributionsis left to future work.

Min-AR ( i, j ) for a ﬁxed z : The green shadedregion is where X j > max( X i , z ) , the red shaded region is where X i > max( X j , z ) , and the whiterectangle is where max( X i , X j ) < z . Note how the shape of the green and red region varies as z (blue dot) moves on the real line R (X-axis).The key intuition here is that when Min-AR ( i, j ) does not equal , it serves as a distinctivemeasure for identifying item i and j separately irrespective of the context S . We speciﬁcally buildon this intuition later in Sec. 5.1 to propose a new algorithm (Alg. 1) which ﬁnds the ( (cid:15), δ ) -PACbest item relying on the unique distinctive properly of the best-item θ > θ j ∀ j ∈ [ n ] \ { } (asdescribed in Sec. 3).The following result shows a variational lower bound, in terms of the noise distribution, forthe minimum advantage ratio in a RUM ( k, θ ) model with independent and identically distributed(iid) noise variables, that is often amenable to explicit calculation/bounding. Lemma 2 (Variational lower bound for the advantage ratio) . For any RUM ( k, θ ) based subsetwisepreference model and any item pair ( i, j ) , Min-AR ( i, j ) ≥ min z ∈ R P r (cid:0) X i > max( X j , z )) P r ( X j > max( X i , z ) (cid:1) . (3) Moreover for RUM ( k, θ ) models one can show that for any triplet ( i, j, S ) , P r (cid:0) X i > max( X j , z )) = F ( z − θ j ) ¯ F ( z − θ i ) + (cid:82) ∞ z − θ j ¯ F ( x − ∆ ij ) f ( x ) dx , which further lower bounds Min-AR ( i, j ) by: min z ∈ R F ( z − θ j ) ¯ F ( z − θ i ) + (cid:82) ∞ z − θ j ¯ F ( x − ∆ ij ) f ( x ) dxF ( z − θ i ) ¯ F ( z − θ j ) + (cid:82) ∞ z − θ i ¯ F ( x + ∆ ij ) f ( x ) dx . The proof of the result appears in Appendix A.1. Fig. 1 shows a geometrical interpretationbehind

Min-AR ( i, j ) , under the joint realization of the pair of values ( ζ i , ζ j ) . We assume to be ∞ in the right hand side of Eqn. 3. emark 2. Suppose ¯ S := arg min | S | = k,i,j ∈ S P r ( i | S ) P r ( j | S ) . It is sufﬁcient to consider the domain of z in theright hand side of (3) to be just the set max r ∈ ¯ S \{ i,j } θ r + support ( D ) , as the proof of Lemma 2 brings out.However, for simplicity we use a smaller lower bound in Eqn. 3 and take z ∈ R . We next derive the

Min-AR ( i, j ) values certain speciﬁc noise distributions: Lemma 3 (Analysing

Min-AR for speciﬁc noise models) . Given a ﬁxed item pair ( i, j ) such that θ i > θ j , the following bounds hold under the respective noise models in an iid RUM.1. Exponential( λ ): Min-AR ( i, j ) ≥ e ∆ ij > ij for Exponential noise with λ = 1 .2. Extreme value distribution ( µ, σ, χ ) : For Gumbel ( µ, σ ) ( χ = 0 ) noise, Min-AR ( i, j ) = e ∆ ijσ > ∆ ij σ .3. Uniform ( a, b ) : Min-AR ( i, j ) ≥ ij b − a for Uniform ( a, b ) noise ( a, b ∈ R , b > a, and ∆ ij < a ).4. Gamma ( k, ξ ) : Min-AR ( i, j ) ≥ ij for Gamma (2 , noise.5. Weibull ( λ, k ) : Min-AR ( i, j ) ≥ e λ ∆ ij > λ ∆ ij for ( k = 1) .6. Normal N (0 , : For ∆ ij small enough (in a neighborhood of ), Min-AR ( i, j ) ≥ ∆ ij . Proof is given in Appendix A.2.

In this section, we propose an algorithm (

Sequential-Pairwise-Battle , Algorithm 1) for the ( (cid:15), δ ) -PACobjective with winner feedback. We then analyse its correctness and sample complexity guarantee(Theorem 4) for any noise distribution D (under a mild assumption of its being Min-AR boundedaway from ). Following this, we also prove a matching lower bound for the problem whichshows that the sample complexity of Algorithm Sequential-Pairwise-Battle is unimprovable (up toa factor of log k ). Sequential-Pairwise-Battle algorithm

Our algorithm is based on the simple idea of dividing the set of n items into sub-groups of size k ,querying each subgroup ‘sufﬁciently enough’, retaining thereafter only the empirically ‘strongestitem’ of each sub-group, and recursing on the remaining set of items until only one item remains.More speciﬁcally, it starts by partitioning the initial item pool into G := (cid:100) nk (cid:101) mutually exclusiveand exhaustive sets G , G , · · · G G such that ∪ Gj =1 G j = S and G j ∩ G j (cid:48) = ∅ , ∀ j, j (cid:48) ∈ [ G ] | G j | = k, ∀ j ∈ [ G − . Each set G g , g ∈ [ G ] is then queried for t = O (cid:16) k(cid:15) (cid:96) ln kδ (cid:96) (cid:17) rounds, and only the ‘empiricalwinner’ c g of each group g is retained in a set S , rest are discarded. The algorithm next recursesthe same procedure on the remaining set of surviving items, until a single item is left, which thenis declared to be the ( (cid:15), δ ) PAC-best item. Algorithm 1 presents the pseudocode in more detail.8 ey idea:

The primary novelty here is how the algorithm reasons about the ‘strongest item’ ineach sub-group G g : It maintains the pairwise preferences of every item pair ( i, j ) in any sub-group G g and simply chooses the item that beats the rest of the items in the sub-group with a positiveadvantage of greater than (alternatively, the item that wins maximum number of subset-wiseplays). Our idea of maintaining pairwise preferences is motivated by a similar algorithm pro-posed in Saha and Gopalan [2019]; however, their performance guarantee applies to only the veryspeciﬁc class of Plackett-Luce feedback models, whereas the novelty of our current analysis re-veals the power of maintaining pairwise-estimates for more general RUM ( k, θ ) subsetwise model(which includes the Plackett-Luce choice model as a special case). The pseudo code of Sequential-Pairwise-Battle is given in Alg. 1.The following is our chief result; it proves correctness and a sample complexity bound forAlgorithm 1.

Theorem 4 ( Sequential-Pairwise-Battle : Correctness and Sample Complexity) . Consider any iid subset-wise preference model RUM ( k, θ ) based on a noise distribution D , and suppose that for any item pair i, j ,we have Min-AR ( i, j ) ≥ c ∆ ij − c for some D -dependent constant c > . Then, Algorithm 1, with inputconstant c > , is an ( (cid:15), δ ) -PAC algorithm with sample complexity O ( nc (cid:15) log kδ ) . The proof of the result appears in Appendix B.1.

Remark 3.

The linear dependence on the total number of items, n , is, in effect, indicates the price topay for learning the n unknown model parameters θ = ( θ , . . . , θ n ) which decide the subsetwise winningprobabilities of the n items. Remarkably, however, the theorem shows that the PAC sample complexityof the ( (cid:15), δ ) -best item identiﬁcation problem, with only winner feedback information from k -size subsets,is independent of k . One may expect to see improved sample complexity as the number of items beingsimultaneously tested in each round is large ( k ≥ , but note that on the other side, the sample complexitycould also worsen, since it is also harder for a good item to win and show itself in a few draws against a largepopulation of k − other competitors – these effects roughly balance each other out, and the ﬁnal samplecomplexity only depends on the total number of items n and the accuracy parameters ( (cid:15), δ ) . Note that Lemma 3 gives speciﬁc values of the noise-model D dependent constant c > , usingwhich we can derive speciﬁc sample complexity bounds for certain noise models: Corollary 5 (Model speciﬁc correctness and sample complexity guarantees) . For the following repre-sentative noise distributions: Exponential (1) , Gumbel ( µ, σ ) Gamma (2 , , Uniform ( a, b ) , Weibull ( λ, ,Standard normal or Normal (0 , , Seq-PB (Alg.1) ﬁnds an ( (cid:15), δ ) -PAC item within sample complexity O (cid:0) n(cid:15) ln kδ (cid:1) .Proof sketch. The proof follows from the general performance guarantee of Seq-PB (Thm. 4) andLem. 3. More speciﬁcally from Lem. 3 it follows that the value of c for these speciﬁc distributionsare constant, which concludes the claim. For completeness the distribution-speciﬁc values of c aregiven in Appendix B.2. 9 .2 Sample Complexity Lower Bound In this section we derive a sample complexity lower bound for any ( (cid:15), δ ) -PAC algorithm for anyRUM ( k, θ ) model with Min-AR ( i, j ) strictly bounded away from in terms of ∆ ij . Our formalclaim goes as follows: Theorem 6 (Sample Complexity Lower Bound for RUM ( k, θ ) model) . Given (cid:15) ∈ (0 , ] , δ ∈ (0 , , c > and an ( (cid:15), δ ) -PAC algorithm A with winner item feedback, there exists a RUM ( k, θ ) instance ν withMin-AR ( i, j ) ≥ c ∆ ij for all i, j ∈ [ n ] , where the expected sample complexity of A on ν is at least Ω (cid:0) nc (cid:15) ln . δ (cid:1) . The proof is given in Appendix B.3. It essentially involves a change of measure argumentdemonstrating a family of Plackett-Luce models (iid Gumbel noise), with the appropriate c value,that cannot easily be teased apart by any learning algorithm.Comparing this result with the performance guarantee of our proposed algorithm (Theorem6) shows that the sample complexity of the algorithm is order-wise optimal (up to a log k factor).Moreover, this result also shows that the IIA (independence of irrelevant attributes) property of thePlackett-Luce choice model is not essential for exploiting pairwise preferences via rank breaking,as was claimed in Saha and Gopalan [2019]. Indeed, except for the case of Gumbel noise, noneof the RUM ( k, θ ) based models in Corollary 5 satisﬁes IIA, but they all respect the O (cid:16) n(cid:15) ln δ (cid:17) ( (cid:15), δ ) -PAC sample complexity guarantee. Remark 4.

For constant c = O (1) , the fundamental sample complexity bound of Theorem 6 resembles thatof PAC best arm identiﬁcation in the standard multi-armed bandit (MAB) problem Even-Dar et al. [2006].Recall that our problem objective is exactly same as MAB, however our feedback model is very different sincein MAB, the learner gets to see the noisy rewards/scores (i.e. the exact values of X i , which can be seen asa noisy feedback of the true reward/score θ i of item- i ), whereas here the learner only sees a k -wise relativepreference feedback based on the underlying observed values of X i , which is a more indirect way of givingfeedback on the item scores, and thus intuitively our problem objective is at least as hard as that of MABsetup. m Ranking (TR) feedback model

We now address our ( (cid:15), δ ) -PAC item identiﬁcation problem for the case of more general, top- m rank ordered feedback for the RUM ( k, θ ) model, that generalises both the winner-item (WI) andfull ranking (FR) feedback models. Top- m ranking of items (TR- m ): In this feedback setting, the environment is assumed to re-turn a ranking of only m items from among S , i.e., the environment ﬁrst draws a full ranking σ over S according to RUM ( k, θ ) as in FR above, and returns the ﬁrst m rank elements of σ , i.e., ( σ (1) , . . . , σ ( m )) . It can be seen that for each permutation σ on a subset S m ⊂ S , | S m | = m , wemust have P r ( σ = σ | S ) = (cid:81) mi =1 P r ( X σ ( i ) > X σ ( j ) , ∀ j ∈ { i + 1 , . . . m } ) , ∀ σ ∈ Σ mS , where by Σ mS wedenote the set of all possible m -length ranking of items in set S , it is easy to note that | S | = (cid:0) km (cid:1) m ! .Thus, generating such a σ is also equivalent to successively sampling m winners from S according10o the PL model, without replacement. It follows that TR reduces to FR when m = k = | S | andto WI when m = 1 . Note that the idea for top- m ranking feedback was introduced by Saha andGopalan [2018b] but only for the speciﬁc Plackett Luce choice model. m ranking feedback In this section, we extend the algorithm proposed earlier (Alg. 1) to handle feedback from thegeneral top- m ranking feedback model. Based of the performance analysis of our algorithm (Thm.7), we are able to show that we can achieve an m -factor improved sample complexity rate withtop- m ranking feedback. We ﬁnally also give a lower bound analysis under this general feedbackmodel (Thm. 8) showing the fundamental performance limit of the current problem of interest.Our derived lower bound shows optimality of our proposed algorithm mSeq-PB up to logarithmicfactors. Main idea:

Same as Seq-PB, the algorithm proposed in this section (Alg. 2) in principle followsthe same sequential elimination based strategy to ﬁnd the near-best item of the RUM ( k, θ ) modelbased on pairwise preferences. However, we use the idea of rank breaking [Souﬁani et al., 2014,Saha and Gopalan, 2018b] to extract the pairwise preferences: formally, given any set S of size k ,if σ ∈ Σ mS , ( S m ⊆ S, | S m | = m ) denotes a possible top- m ranking of S , then the Rank-Breaking subroutine considers each item in S to be beaten by its preceding items in σ in a pairwise sense.For instance, given a full ranking of a set of elements S = { a, b, c, d } , say b (cid:31) a (cid:31) c (cid:31) d , Rank-Breaking generates the set of pairwise comparisons: { ( b (cid:31) a ) , ( b (cid:31) c ) , ( b (cid:31) d ) , ( a (cid:31) c ) , ( a (cid:31) d ) , ( c (cid:31) d ) } etc.As a whole, our new algorithm now again divides the set of n items into small groups of size k ,say G , . . . G G , G = (cid:100) nk (cid:101) , and play each sub-group some t = O (cid:16) km(cid:15) ln δ (cid:17) many rounds. Inside anyﬁxed subgroup G g , after each round of play, it uses Rank-Breaking on the top- m ranking feedback σ ∈ Σ m G g , to extract out (cid:0) m (cid:1) + ( k − m ) m many pairwise feedback, which is further used to estimatethe empirical pairwise preferences ˆ p ij for each pair of items i, j ∈ G g . Based on these pairwiseestimates it then only retains the strongest item of G g and recurse the same procedure on the setof surviving items, until just one item is left in the set. The complete algorithm is given in Alg. 2(Appendix C.1).Theorem 7 analyses the correctness and sample complexity bounds of mSeq-PB. Note that thesample complexity bound of mSeq-PB with top- m ranking (TR) feedback model is m -times thatof the WI model (Thm. 4). This is justiﬁed since intuitively revealing a ranking on m items ina k -set provides about m many WI feedback per round, which essentially leads to the m -factorimprovement in the sample complexity. Theorem 7 (mSeq-PB(Alg. 2): Correctness and Sample Complexity) . Consider any RUM ( k, θ ) sub-setwise preference model based on noise distribution D and suppose for any item pair i, j , we have Min-AR ( i, j ) ≥ c ∆ ij − c for some D -dependent constant c > . Then mSeq-PB (Alg.2) with input constant c > on top- m ranking feedback model is an ( (cid:15), δ ) -PAC algorithm with sample complexity O ( nmc (cid:15) log kδ ) . Proof is given in Appendix C.2. 11imilar to Cor. 5, for the top- m model again, we can derive speciﬁc sample complexity boundsfor different noise distributions, e.g., Exponential , Gumbel , Gaussian , Uniform , Gamma etc., in thiscase as well. m ranking feedback In this section, we analyze the fundamental limit of sample complexity lower bound for any ( (cid:15), δ ) -PAC algorithm for RUM ( k, θ ) model. Theorem 8 (Sample Complexity Lower Bound for RUM ( k, θ ) model with TR- m feedback) . Given (cid:15) ∈ (0 , ] and δ ∈ (0 , , and an ( (cid:15), δ ) -PAC algorithm A with winner item feedback, there exists aRUM ( k, θ ) instance ν , in which for any pair i, j ∈ [ n ] Min-AR ( i, j ) ≥ c ∆ ij , where the expectedsample complexity of A on ν with top- m ranking feedback has to be at least Ω (cid:18) nmc (cid:15) ln . δ (cid:19) for A to be ( (cid:15), δ ) -PAC. The proof is given in Appendix C.3.Similar to the case of winner feedback, comparing Theorem 7 with the above result shows thatthe sample complexity of mSeq-PB is orderwise optimal (up to logarithmic factors), for generalcase of top- m ranking feedback as well. We have identiﬁed a new principle to learn with general subset-size preference feedback in gen-eral iid RUMs – rank breaking followed by pairwise comparisons. This has been made possibleby extending the concept of pairwise advantage from the popular Plackett-Luce choice model tomuch more general RUMs, and showing that the IIA property that Plackett-Luce models enjoy isnot essential to obtain optimal sample complexity.Our results suggest several interesting directions for future investigation, namely the possi-bility of considering correlated noise models (making the RUM more general), explicitly model-ing the dependence of samples on item features or attributes, other performance objectives likeregret for online utility optimization, and extension to learning with relative preferences in time-correlated settings like Markov Decision Processes.12 eferences

Nir Ailon, Zohar Shay Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinalbandits. In

ICML , volume 32, pages 856–864, 2014.Jean-Yves Audibert and S´ebastien Bubeck. Best arm identiﬁcation in multi-armed bandits. In

COLT-23th Conference on Learning Theory-2010 , pages 13–p, 2010.Hossein Azari, David Parkes, and Lirong Xia. Random utility theory for social choice. In

Advancesin Neural Information Processing Systems , pages 126–134, 2012.Chester I Bliss. The method of probits.

Science , 1934.R ´obert Busa-Fekete, Balazs Szorenyi, Weiwei Cheng, Paul Weng, and Eyke H ¨ullermeier. Top-kselection based on adaptive sampling of noisy preferences. In

International Conference on MachineLearning , pages 1094–1102, 2013.R ´obert Busa-Fekete, Eyke H ¨ullermeier, and Bal´azs Sz ¨or´enyi. Preference-based rank elicitationusing statistical models: The case of mallows. In

Proceedings of The 31st International Conferenceon Machine Learning , volume 32, 2014a.R ´obert Busa-Fekete, Bal´azs Sz ¨or´enyi, and Eyke H ¨ullermeier. Pac rank elicitation through adaptivesampling of stochastic pairwise preferences. In

AAAI , pages 1701–1707, 2014b.Xi Chen, Sivakanth Gopi, Jieming Mao, and Jon Schneider. Competitive analysis of the top-kranking problem. In

Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on DiscreteAlgorithms , pages 1245–1264. SIAM, 2017.Xi Chen, Yuanzhi Li, and Jieming Mao. A nearly instance optimal algorithm for top-k rankingunder the multinomial logit model. In

Proceedings of the Twenty-Ninth Annual ACM-SIAM Sym-posium on Discrete Algorithms , pages 2504–2522. SIAM, 2018.Antoine D´esir, Vineet Goyal, Srikanth Jagabathula, and Danny Segev. Assortment optimizationunder the mallows model. In

Advances in Neural Information Processing Systems , pages 4700–4708,2016a.Antoine D´esir, Vineet Goyal, Danny Segev, and Chun Ye. Capacity constrained assortment opti-mization under the markov chain based choice model.

Operations Research , 2016b.Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditionsfor the multi-armed bandit and reinforcement learning problems.

Journal of machine learningresearch , 7(Jun):1079–1105, 2006.Moein Falahatgar, Yi Hao, Alon Orlitsky, Venkatadheeraj Pichapati, and Vaishakh Ravindrakumar.Maxing and ranking with few assumptions. In

Advances in Neural Information Processing Systems ,pages 7063–7073, 2017. 13ratik Gajane, Tanguy Urvoy, and Fabrice Cl´erot. A relative exponential weighing algorithm foradversarial utility-based dueling bandits. In

Proceedings of the 32nd International Conference onMachine Learning , pages 218–227, 2015.Javier Gonz´alez, Zhenwen Dai, Andreas Damianou, and Neil D. Lawrence. Preferential Bayesianoptimization. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,pages 1282–1291. JMLR. org, 2017.Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. lil’ ucb : An optimalexploration algorithm for multi-armed bandits. In Maria Florina Balcan, Vitaly Feldman, andCsaba Szepesvari, editors,

Proceedings of The 27th Conference on Learning Theory , volume 35 of

Proceedings of Machine Learning Research , pages 423–439. PMLR, 2014.Minje Jang, Sunghyun Kim, Changho Suh, and Sewoong Oh. Optimal sample complexity ofm-wise data for top-k ranking. In

Advances in Neural Information Processing Systems , pages 1685–1695, 2017.Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection instochastic multi-armed bandits. In

ICML , volume 12, pages 655–662, 2012.Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed ban-dits. In

International Conference on Machine Learning , pages 1238–1246, 2013.Emilie Kaufmann, Olivier Capp´e, and Aur´elien Garivier. On the complexity of best-arm identiﬁ-cation in multi-armed bandit models.

The Journal of Machine Learning Research , 17(1):1–42, 2016.Ashish Khetan and Sewoong Oh. Data-driven rank breaking for efﬁcient rank aggregation.

Journalof Machine Learning Research , 17(193):1–54, 2016.R Duncan Luce.

Individual choice behavior: A theoretical analysis . Courier Corporation, 2012.Soheil Mohajer, Changho Suh, and Adel Elmahdy. Active learning for top- k rank aggregationfrom noisy comparisons. In International Conference on Machine Learning , pages 2488–2497, 2017.Kameng Nip, Zhenbo Wang, and Zizhuo Wang. Assortment optimization under a single transitionmodel. 2017.Robin L Plackett. The analysis of permutations.

Journal of the Royal Statistical Society: Series C(Applied Statistics) , 24(2):193–202, 1975.Pantelimon G Popescu, Silvestru Dragomir, Emil I Slusanschi, and Octavian N Stanasila. Boundsfor Kullback-Leibler divergence.

Electronic Journal of Differential Equations , 2016, 2016.Wenbo Ren, Jia Liu, and Ness B Shroff. Pac ranking from pairwise and listwise queries: Lowerbounds and upper bounds. arXiv preprint arXiv:1806.02970 , 2018.Aadirupa Saha and Aditya Gopalan. Battle of bandits. In

Uncertainty in Artiﬁcial Intelligence , 2018a.14adirupa Saha and Aditya Gopalan. Active ranking with subset-wise preferences.

InternationalConference on Artiﬁcial Intelligence and Statistics (AISTATS) , 2018b.Aadirupa Saha and Aditya Gopalan. PAC Battling Bandits in the Plackett-Luce Model. In

Algo-rithmic Learning Theory , pages 700–737, 2019.Hossein Azari Souﬁani, Hansheng Diao, Zhenyu Lai, and David C Parkes. Generalized randomutility models with multiple types. In

Advances in Neural Information Processing Systems , pages73–81, 2013.Hossein Azari Souﬁani, David C Parkes, and Lirong Xia. Computing parametric ranking modelsvia rank-breaking. In

ICML , pages 360–368, 2014.Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Multi-dueling bandits with depen-dent arms. arXiv preprint arXiv:1705.00253 , 2017.Bal´azs Sz ¨or´enyi, R ´obert Busa-Fekete, Adil Paul, and Eyke H ¨ullermeier. Online rank elicitation forplackett-luce: A dueling bandits approach. In

Advances in Neural Information Processing Systems ,pages 604–612, 2015.Kalyan Talluri and Garrett Van Ryzin. Revenue management under a general discrete choicemodel of consumer behavior.

Management Science , 50(1):15–33, 2004.Louis L Thurstone. A law of comparative judgment.

Psychological review , 34(4):273, 1927.Tanguy Urvoy, Fabrice Clerot, Raphael F´eraud, and Sami Naamane. Generic exploration andk-armed voting bandits. In

International Conference on Machine Learning , pages 91–99, 2013.Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as adueling bandits problem. In

Proceedings of the 26th Annual International Conference on MachineLearning , pages 1201–1208. ACM, 2009.Yisong Yue and Thorsten Joachims. Beat the mean bandit. In

Proceedings of the 28th InternationalConference on Machine Learning (ICML-11) , pages 241–248, 2011.Zhibing Zhao, Tristan Villamil, and Lirong Xia. Learning mixtures of random utility models. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018.Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten de Rijke. Relative upper conﬁdencebound for the k-armed dueling bandit problem. arXiv preprint arXiv:1312.3393 , 2013.Masrour Zoghi, Shimon Whiteson, Remi Munos, Maarten de Rijke, et al. Relative upper conﬁ-dence bound for the k-armed dueling bandit problem. In

JMLR Workshop and Conference Proceed-ings , number 32, pages 10–18. JMLR, 2014.Masrour Zoghi, Shimon Whiteson, and Maarten de Rijke. Mergerucb: A method for large-scaleonline ranker evaluation. In

Proceedings of the Eighth ACM International Conference on Web Searchand Data Mining , pages 17–26. ACM, 2015. 15 lgorithm 1

Sequential-Pairwise-Battle (Seq-PB) Input: Set of items: [ n ] , Subset size: n ≥ k > Error bias: (cid:15) > , Conﬁdence parameter: δ > Noise model ( D ) dependent constant c > Initialize: S ← [ n ] , (cid:15) ← c(cid:15) , and δ ← δ Divide S into G := (cid:100) nk (cid:101) sets G , G , . . . , G G such that ∪ Gj =1 G j = S and G j ∩ G j (cid:48) = ∅ , ∀ j, j (cid:48) ∈ [ G ] ,where | G j | = k, ∀ j ∈ [ G − If |G G | < k , then set R ← G G and G = G − while (cid:96) = 1 , , . . . do Set S ← ∅ , δ (cid:96) ← δ (cid:96) − , (cid:15) (cid:96) ← (cid:15) (cid:96) − for g = 1 , , . . . , G do Play the set G g for t := (cid:6) k (cid:15) (cid:96) ln kδ (cid:96) (cid:7) rounds w i ← Number of times i won in t plays of G g , ∀ i ∈ G g Set c g ← arg max i ∈A w i and S ← S ∪ { c g } end for S ← S ∪ R (cid:96) if ( | S | == 1) then Break (go out of the while loop) else if | S | ≤ k then S (cid:48) ← Randomly sample k − | S | items from [ n ] \ S , and S ← S ∪ S (cid:48) , (cid:15) (cid:96) ← c(cid:15) , δ (cid:96) ← δ else Divide S into G := (cid:100) | S | k (cid:101) sets G , G , . . . , G G , such that ∪ Gj =1 G j = S , and G j ∩G j (cid:48) = ∅ , ∀ j, j (cid:48) ∈ [ G ] , where | G j | = k, ∀ j ∈ [ G − If |G G | < k , then set R (cid:96) +1 ← G G and G = G − end if end while Output:

The unique item left in S upplementary for Best-item Learning in Random Utility Models withSubset ChoicesA Appendix for Section 4 A.1 Proof of Lemma 2

Lemma 2 (Variational lower bound for the advantage ratio) . For any RUM ( k, θ ) based subsetwisepreference model and any item pair ( i, j ) , Min-AR ( i, j ) ≥ min z ∈ R P r (cid:0) X i > max( X j , z )) P r ( X j > max( X i , z ) (cid:1) . (3) Moreover for RUM ( k, θ ) models one can show that for any triplet ( i, j, S ) , P r (cid:0) X i > max( X j , z )) = F ( z − θ j ) ¯ F ( z − θ i ) + (cid:82) ∞ z − θ j ¯ F ( x − ∆ ij ) f ( x ) dx , which further lower bounds Min-AR ( i, j ) by: min z ∈ R F ( z − θ j ) ¯ F ( z − θ i ) + (cid:82) ∞ z − θ j ¯ F ( x − ∆ ij ) f ( x ) dxF ( z − θ i ) ¯ F ( z − θ j ) + (cid:82) ∞ z − θ i ¯ F ( x + ∆ ij ) f ( x ) dx . Proof.

Let us ﬁx any subset S and two consider the items i, j ∈ S such that θ i > θ j . Recall that wealso denote by ∆ ij = ( θ i − θ j ) . Let us deﬁne a random variable X Sr = max r ∈ S \{ i,j } X r that denotesthe maximum score value taken by the rest of the items in set S . Note that the support of X Sr , saydenoted by supp ( X Sr ) = max r ∈ S \{ i,j } θ r + supp ( D ) .Let us also denote ¯ S := arg min S ⊆ [ n ] || S | = k P r ( i | S ) P r ( j | S ) . We have: Min-AR ( i, j ) = P r ( i | ¯ S ) P r ( j | ¯ S ) = P r ( { X i > X j } ∩ { X i > X r ∀ r ∈ ¯ S \ { i, j }} ) P r ( { X j > X i } ∩ { X j > X r ∀ r ∈ ¯ S \ { i, j }} )= P r ( { X i > X j } ∩ { X i > X ¯ Sr }} ) P r ( { X j > X i } ∩ { X j > X ¯ Sr } )= (cid:82) supp X ¯ Sr P r (cid:0) { X i > x } ∩ { X i > X j } (cid:1) f X ¯ Sr ( x ) dx (cid:82) supp X ¯ Sr P r (cid:0) { X i > x } ∩ { X j > X i } (cid:1) f X ¯ Sr ( x ) dx = (cid:82) supp X ¯ Sr P r (cid:0) { X i > x } ∩ { X j > X i } (cid:1) P r (cid:0) { X i >x }∩{ X i >X j } (cid:1) P r (cid:0) { X i >x }∩{ X j >X i } (cid:1) f X ¯ Sr ( x ) dx (cid:82) supp X ¯ Sr P r (cid:0) { X i > x } ∩ { X j > X i } (cid:1) f X ¯ Sr ( x ) dx> min z ∈ supp ( X ¯ Sr ) (cid:20) P r (cid:0) { X i > z } ∩ { X i > X j } (cid:1) P r (cid:0) { X i > z } ∩ { X j > X i } (cid:1) (cid:21) (cid:82) supp X ¯ Sr P r (cid:0) { X i > x } ∩ { X j > X i } (cid:1) f X ¯ Sr ( x ) dx (cid:82) supp X ¯ Sr P r (cid:0) { X i > x } ∩ { X j > X i } (cid:1) f X ¯ Sr ( x ) dx = min z ∈ supp ( X ¯ Sr ) P r (cid:0) { X i > max( X j , z ) } (cid:1) P r (cid:0) { X j > max( X i , z ) } (cid:1) > min z ∈ R P r (cid:0) { X i > max( X j , z ) } (cid:1) P r (cid:0) { X j > max( X i , z ) } (cid:1) We assume to be ∞ in the right hand side of Eqn. 3. Y = max( X j , z ) . Now owing to the ‘independent andidentically distributed noise’ assumption of the RUM ( k, θ ) model, we can further show that: P r (cid:0) X i > max( X j , z ) (cid:1) = P r ( X i > Y ) = P r ( { X i > Y } ∩ { Y = z } ) + P r ( { X i > Y } ∩ { Y > z } )= P r (cid:0) { X i > z } | { Y = z } (cid:1) P r ( X j < z ) + P r (cid:0) { X i > Y } ∩ { Y > z } (cid:1) = P r (cid:0) { ζ i + θ i > z } (cid:1) P r ( ζ j + θ j < z ) + P r (cid:0) { X i > X j } ∩ { X j > z } (cid:1) = P r (cid:0) { ζ i > z − θ i } (cid:1) P r ( ζ j < z − θ j ) + P r (cid:0) { ζ i > ζ j − ( θ i − θ j ) } ∩ { ζ j > z − θ j } (cid:1) = F ( z − θ j ) ¯ F ( z − θ i ) + (cid:90) ∞ z − θ j ¯ F ( x − ∆ ij ) f ( x ) dx, which proves the claim. A.2 Proof of Lemma 3

Lemma 3 (Analysing

Min-AR ( i, j ) values for the following distributions by simply applyingthe lower bound formula stated in Thm. 2 (cid:18) min z ∈ R F ( z − θ j ) ¯ F ( z − θ i )+ (cid:82) ∞ z − θj ¯ F ( x − ∆ ij ) f ( x ) dxF ( z − θ i ) ¯ F ( z − θ j )+ (cid:82) ∞ z − θi ¯ F ( x +∆ ij ) f ( x ) dx (cid:19) along withtheir speciﬁc density functions as stated below for each speciﬁc distributions:

1. Exponential noise:

When the noise distribution D is Exponential (1) , i.e. ζ i , ζ j iid ∼ Exponential (1) note that: f ( x ) = e − x , F ( x ) = 1 − e − x , and support ( D ) = [0 , ∞ ) .

2. Gumbel noise:

When the noise distribution D is Gumbel ( µ, σ ) , i.e. ζ i , ζ j iid ∼ Gumbel ( µ, σ ) note that: f ( x ) = e − ( x − µ ) σ e − e − ( x − µ ) σ , F ( x ) = e − e − ( x − µ ) σ , and support ( D ) = ( −∞ , ∞ ) .

3. Uniform noise case: D is Uniform ( a, b ) , i.e. ζ i , ζ j iid ∼ Uniform ( a, b ) note that: f ( x ) = b − a , F ( x ) = x − ab − a , and support ( D ) = [ a, b ] .

4. Gamma noise:

When the noise distribution D is Gamma ( k, ξ ) , with k = 2 and ξ = 1 , i.e. ζ i , ζ j iid ∼ Gamma (2 , note that: f ( x ) = xe − x , F ( x ) = 1 − e − x − xe − x , and support ( D ) = [0 , ∞ ) .

5. Weibull noise:

When the noise distribution D is Weibull ( λ, k ) , with k = 1 , i.e. ζ i , ζ j iid ∼ Weibull ( λ, note that: f ( x ) = λ e − xλ , F ( x ) = 1 − e − xλ , and support ( D ) = [0 , ∞ ) .

6. Argument for the Gaussian noise case.

Note that Gaussian distributions do not have closedform CDFs and are difﬁcult to compute in general, so we propose a different line of analysisspeciﬁcally for the Gaussian noise case: Take the noise distribution to be standard normal, i.e., ζ i , ζ j iid ∼ N (0 , , with density f ( x ) = √ π e − x / . When X i = θ i + ζ i and X j = θ j + ζ j with ∆ ij = θ i − θ j > , we ﬁnd a lower bound on inf z ∈ R P r (cid:0) X i > max( X j , z )) P r ( X j > max( X i , z ) (cid:1) . First, note that by translation, we can take θ j = 0 and θ i = ∆ without loss of generality. Doingso allows us to write P r (cid:0) X i > max( X j , z )) = F ( z )(1 − F ( z − ∆)) + (cid:90) ∞ z (1 − F ( y − ∆)) f ( y ) dy ≡ g (∆ , z ) , and likewise (taking X i = 0 , X j = z − ∆ ), P r (cid:0) X j > max( X i , z )) = F ( z − ∆)(1 − F ( z )) + (cid:90) ∞ z − ∆ (1 − F ( y + ∆)) f ( y ) dy ≡ g ( − ∆ , z − ∆) . With this notation, we wish to minimize the ratio g (∆ ,z ) g ( − ∆ ,z − ∆) over z ∈ R .Notice that g (0 , z ) = 1 / for any z , and ∂g (∆ ,z ) ∂ ∆ = F ( z ) f ( z − ∆) + (cid:82) ∞ z f ( y − ∆) f ( y ) dy . Hence,up to ﬁrst order, for ∆ small enough, we have g (∆ , z ) g ( − ∆ , z − ∆) ≈ g (0 , z ) + ∆ ∂g (∆ ,z ) ∂ ∆ | ∆=0 g (0 , z − ∆) − ∆ ∂g ( ˜∆ ,z − ∆) ∂ ˜∆ | ˜∆=0 = + ∆ F ( z ) f ( z ) + ∆ (cid:82) ∞ z f ( y ) dy − ∆ F ( z − ∆) f ( z − ∆) − ∆ (cid:82) ∞ z − ∆ f ( y ) dy ≡ h ( z ) h ( z ) , say.Differentiating the ratio and equating it to to ﬁnd its minimum, we obtain the condition h (cid:48) ( z ∗ ) h ( z ∗ ) = h ( z ∗ ) h (cid:48) ( z ∗ ) ⇔ F ( z ∗ ) f (cid:48) ( z ∗ ) h ( z ∗ ) = − F ( z ∗ − ∆) f (cid:48) ( z ∗ − ∆) h ( z ∗ ) . The argument can be made rigorous using the Taylor expansion up to 2nd order. z ∗ ≈ ⇒ h ( z ∗ ) ≈ h ( z ∗ ) gives the solution z ∗ ≈ ∆2 , for which h ( z ∗ ) h ( z ∗ ) = + ∆ F ( ∆2 ) f ( ∆2 ) + ∆ (cid:82) ∞ ∆2 f ( y ) dy − ∆ F ( − ∆2 ) f ( − ∆2 ) − ∆ (cid:82) ∞− ∆2 f ( y ) dy ≈ + ∆2 1 √ π + ∆212 − ∆2 1 √ π − ∆2 ≥ (cid:18) (cid:18) √ π (cid:19)(cid:19) ≥ , for ∆ small enough. B Appendix for Section 5.1

B.1 Proof of Theorem 4

Sequential-Pairwise-Battle . Notethat at any iteration (cid:96) , any set G g is played for exactly t = k (cid:15) (cid:96) ln kδ (cid:96) many number of rounds. Also,since the algorithm discards exactly k − items from each set G g , the maximum number of it-erations possible is (cid:100) ln k n (cid:101) . Now at any iteration (cid:96) , since G = (cid:106) | S (cid:96) | k (cid:107) < | S (cid:96) | k , the total samplecomplexity the for iteration is at most | S (cid:96) | k t ≤ n k (cid:96) − (cid:15) (cid:96) ln kδ (cid:96) , as | S (cid:96) | ≤ nk (cid:96) for all (cid:96) ∈ [ (cid:98) ln k n (cid:99) ] . Alsonote that for all but last iteration (cid:96) ∈ [ (cid:98) ln k n (cid:99) ] , we have (cid:15) (cid:96) = c(cid:15) (cid:18) (cid:19) (cid:96) − , and δ (cid:96) = δ (cid:96) +1 . Moreover,for the last iteration (cid:96) = (cid:100) ln k n (cid:101) , the sample complexity is clearly t = kc (cid:15) ln kδ , as in this case (cid:15) (cid:96) = c(cid:15) , and δ (cid:96) = δ , and | S | = k . Thus, the total sample complexity of Algorithm 1 is given by (cid:100) ln k n (cid:101) (cid:88) (cid:96) =1 | S (cid:96) | (cid:15) (cid:96) ln kδ (cid:96) ≤ ∞ (cid:88) (cid:96) =1 n k (cid:96) (cid:18) c(cid:15) (cid:0) (cid:1) (cid:96) − (cid:19) k ln k (cid:96) +1 δ + 2 kc (cid:15) ln 2 kδ ≤ n c (cid:15) ∞ (cid:88) (cid:96) =1 (cid:96) − (9 k ) (cid:96) − (cid:16) ln kδ + ( (cid:96) + 1) (cid:17) + 2 kc (cid:15) ln 2 kδ ≤ nc (cid:15) ln kδ ∞ (cid:88) (cid:96) =1 (cid:96) − (9 k ) (cid:96) − (cid:16) (cid:96) (cid:17) + 2 kc (cid:15) ln 2 kδ = O (cid:18) nc (cid:15) ln kδ (cid:19) [ for any k > , ( (cid:15), δ ) -PACproperty of Sequential-Pairwise-Battle .Consider any ﬁxed subgroup G of size k , such that two items a, b ∈ G . Now suppose we denoteby P r ( { ab }|G ) = P r ( a |G ) + P r ( b |G ) the probability that either a or b wins in the subset G . Thenthe probability that a wins in G given either a or b won in G is given by p ab |G := P r ( a |G ) P r ( { ab }|G ) = P r ( a |G ) P r ( a |G )+ P r ( b |G ) — this quantity in a way models the pairwise preference of a over b in the set G .Note that as long as θ a > θ b , p ab |G > , for any G (since P r ( a |G ) > P r ( b |G ) ). We in fact nowintroduce the notation p ab := min G⊆ [ n ] ||G| = k p ab |G . Lemma 9.

For any item pair i, j ∈ [ n ] and any set S ⊆ [ n ] , if their advantage ratio P r ( i | S ) P r ( j | S ) ≥ α , forsome α > , then pairwise preference of item i over j in set S p ij | S > + α .Proof. Note that

For any item pair i, j ∈ [ n ] , if If Min-AR ( i, j ) ≥ α for some α > , then p ij > + α .Proof. The proof directly follows from Lem . 9 by using subset S = min S ⊆ [ n ] || S | = k Min-AR ( i, j ) .Let us denote the set of surviving items S at the beginning of phase (cid:96) as S (cid:96) . We now claim thefollowing crucial lemma which shows at any phase (cid:96) , the best (the one with highest θ parameter)item retained in S (cid:96) +1 can not be too bad in comparison to the best item of S (cid:96) . The formal claimgoes as follows: Lemma 11.

At any iteration (cid:96) , for any G g , if i g := arg max i ∈G g θ i , then with probability at least (1 − δ (cid:96) ) , θ c g > θ i g − (cid:15) (cid:96) c .Proof. Let us deﬁne ˆ p ij = w i w i + w j , ∀ i, j ∈ G g , i (cid:54) = j . Then clearly ˆ p c g i g ≥ , as c g is the empiricalwinner in t rounds, i.e. c g ← arg max i ∈G g w i . Moreover c g being the empirical winner of G g we alsohave w c g ≥ tk , and thus w c g + w r g ≥ tk as well. Let n ij := w i + w j denotes the number of pairwisecomparisons of item i and j in t rounds, i, j ∈ G g . Clearly ≤ n ij ≤ t . Then let us analyze theprobability of a ‘bad event’ where c g is indeed such that θ c g < θ i g − (cid:15) (cid:96) c .This implies that the advantage ratio of i g and c g in G is P r ( i g |G ) P r ( c g |G ) ≥ (cid:15) (cid:96) .But now by Lem. 9 this further implies p i g c g |G ≥ + (cid:15) (cid:96) . But since c g beats i g empirically in thesubgroup G , this implies ˆ p c g i g > . The following argument shows that this is even unlikely tohappen, more formally with probability (1 − δ (cid:96) /k ) :21 r (cid:16)(cid:8) ˆ p c g i g ≥ (cid:9)(cid:17) = P r (cid:16)(cid:8) ˆ p c g i g ≥ (cid:9) ∩ (cid:8) n c g i g ≥ tk (cid:9)(cid:17) + P r (cid:16)(cid:8) n c g i g < tk (cid:9)(cid:17) P r (cid:16)(cid:8) ˆ p c g i g ≥ (cid:9)(cid:12)(cid:12)(cid:12)(cid:8) n c g i g < tk (cid:9)(cid:17) = P r (cid:16)(cid:8) ˆ p c g i g − (cid:15) (cid:96) ≥ − (cid:15) (cid:96) (cid:9) ∩ (cid:8) n c g i g ≥ tk (cid:9)(cid:17) ≤ P r (cid:16)(cid:8) ˆ p c g i g − p c g i g |G ≥ (cid:15) (cid:96) (cid:9) ∩ (cid:8) n c g i g ≥ tk (cid:9)(cid:17) ≤ exp (cid:16) − tk (cid:0) (cid:15) (cid:96) (cid:1) (cid:17) = δ (cid:96) k . where the ﬁrst inequality holds as p c g i g |G < − (cid:15) (cid:96) , and the second inequality follows fromHoeffdings lemma. Now taking the union bound over all (cid:15) (cid:96) -suboptimal elements i (cid:48) of G g (i.e. θ i (cid:48) < θ i g − (cid:15) (cid:96) ), we get: P r (cid:16)(cid:8) ∃ i (cid:48) ∈ G g | p i (cid:48) i g < − (cid:15) (cid:96) , and c g = i (cid:48) (cid:9)(cid:17) ≤ δ (cid:96) k (cid:12)(cid:12)(cid:12)(cid:8) ∃ i (cid:48) ∈ G g | p i (cid:48) i g < − (cid:15) (cid:96) , and c g = i (cid:48) (cid:9)(cid:12)(cid:12)(cid:12) ≤ δ (cid:96) , as |G g | = k , and the claim follows henceforth.Let us denote the single element remaining in S at termination by r ∈ [ n ] . Also note that forthe last iteration (cid:96) = (cid:100) ln k n (cid:101) , since (cid:15) (cid:96) = (cid:15) , and δ (cid:96) = δ , applying Lemma 11 on S , we get that P r (cid:16) θ r < θ i g − (cid:15) (cid:17) ≤ δ .Without loss of generality we assume the best item of the RUM ( k, θ ) model is θ , i.e. θ >θ i ∀ i ∈ [ n ] \{ } . Now for any iteration (cid:96) , let us deﬁne g (cid:96) ∈ [ G ] to be the index of the set that contains best item of the entire set S (cid:96) , i.e. arg max i ∈ S (cid:96) θ i ∈ G g (cid:96) . Then applying Lemma 11, with probability atleast (1 − δ (cid:96) ) , θ c g(cid:96) > θ i g(cid:96) − (cid:15) (cid:96) /c . Note that initially, at phase (cid:96) = 1 , i g (cid:96) = 1 . Then, for each iteration (cid:96) ,applying Lemma 11 recursively to G g (cid:96) , we ﬁnally get θ r > θ − (cid:16) (cid:15) + (cid:15) (cid:16) (cid:17) + · · · + (cid:15) (cid:0) (cid:1) (cid:98) ln k n (cid:99) (cid:17) − (cid:15) ≥ θ − (cid:15) (cid:16) (cid:80) ∞ i =0 (cid:0) (cid:1) i (cid:17) − (cid:15) ≥ θ − (cid:15) . Thus assuming the algorithm does not fail in any of the iteration (cid:96) , we ﬁnally have that p r ∗ > − (cid:15) —this shows that the ﬁnal item output by Seq-PB is (cid:15) optimal.Finally since at any phase (cid:96) , the algorithm fails with probability at most δ (cid:96) , the total failureprobability of the algorithm is at most (cid:16) δ + δ + · · · + δ (cid:100) ln k n (cid:101) (cid:17) + δ ≤ δ . This concludes the correctnessof the algorithm showing that it indeed satisﬁes the ( (cid:15), δ ) -PAC objective. B.2 Proof of Corollary 5

Proof.

The proof essentially follows from the general performance guarantee of Seq-PB (Thm. 4)and Lem. 3. More speciﬁcally from Lem. 3 it follows that the value of c for these speciﬁc distribu-tions are constant, which concludes the claim. For completeness the distribution-speciﬁc valuesof c are given below:1. c = 0 . for Exponential noise with λ = 1 c = . σ for Gumbel ( µ, σ ) c = . b − a ) for Uniform ( a, b ) c = for Gamma (2 , c = λ for Weibull ( λ, c = Normal N (0 , , etc. B.3 Proof of Theorem 6

Before proving the lower bound result we state a key lemma from Kaufmann et al. [2016] whichis a general result for proving information theoretic lower bound for bandit problems:Consider a multi-armed bandit (MAB) problem with n arms or actions A = [ n ] . At round t , let A t and Z t denote the arm played and the observation (reward) received, respectively. Let F t = σ ( A , Z , . . . , A t , Z t ) be the sigma algebra generated by the trajectory of a sequential banditalgorithm up to round t . Lemma 12 (Lemma , Kaufmann et al. [2016]) . Let ν and ν (cid:48) be two bandit models (assignments ofreward distributions to arms), such that ν i ( resp. ν (cid:48) i ) is the reward distribution of any arm i ∈ A underbandit model ν ( resp. ν (cid:48) ) , and such that for all such arms i , ν i and ν (cid:48) i are mutually absolutely continuous.Then for any almost-surely ﬁnite stopping time τ with respect to ( F t ) t , n (cid:88) i =1 E ν [ N i ( τ )] KL ( ν i , ν (cid:48) i ) ≥ sup E∈F τ kl ( P r ν ( E ) , P r ν (cid:48) ( E )) , where kl ( x, y ) := x log( xy ) + (1 − x ) log( − x − y ) is the binary relative entropy, N i ( τ ) denotes the numberof times arm i is played in τ rounds, and P r ν ( E ) and P r ν (cid:48) ( E ) denote the probability of any event E ∈ F τ under bandit models ν and ν (cid:48) , respectively. We now proceed to proof our lower bound result of Thm. 6.

Theorem 6 (Sample Complexity Lower Bound for RUM ( k, θ ) model) . Given (cid:15) ∈ (0 , ] , δ ∈ (0 , , c > and an ( (cid:15), δ ) -PAC algorithm A with winner item feedback, there exists a RUM ( k, θ ) instance ν withMin-AR ( i, j ) ≥ c ∆ ij for all i, j ∈ [ n ] , where the expected sample complexity of A on ν is at least Ω (cid:0) nc (cid:15) ln . δ (cid:1) .Proof. In order to apply the change of measure based lemma Lem. 12, we constructed the follow-ing speciﬁc instances of the RUM ( k, θ ) model for our purpose and assume D to be the Gumbel (0 , noise: True Instance ( ν ) : θ j = 1 − (cid:15), ∀ j ∈ [ n ] \ { } , and θ = 1 , (cid:15) -optimal arm in the true instance is arm . Now for every suboptimal item a ∈ [ n ] \ { } , consider the modiﬁed instances ν a such that:Instance–a ( ν a ) : θ aj = 1 − (cid:15), ∀ j ∈ [ n ] \ { a, } , θ a = 1 − (cid:15), and θ aa = 1 . For any problem instance ν a , a ∈ [ n ] \ { } , the probability distribution associated with arm S ∈ A is given by ν aS ∼ Categorical ( p , p , . . . , p k ) , where p i = P r ( i | S ) , ∀ i ∈ [ k ] , ∀ S ∈ A , where P r ( i | S ) is as deﬁned in Section 3.1. Note that the only (cid:15) -optimal arm for Instance-a is arm a . Now applying Lemma 12, for any event E ∈ F τ we get, (cid:88) { S ∈A : a ∈ S } E ν [ N S ( τ A )] KL ( ν S , ν aS ) ≥ kl ( P r ν ( E ) , P r ν (cid:48) ( E )) . (4)The above result holds from the straightforward observation that for any arm S ∈ A with a / ∈ S , ν S is same as ν aS , hence KL ( ν S , ν aS ) = 0 , ∀ S ∈ A , a / ∈ S . For notational convenience, wewill henceforth denote S a = { S ∈ A : a ∈ S } .Now let us analyse the right hand side of (4), for any set S ∈ S a . Case-1:

First let us consider S ∈ S a such that / ∈ S . Note that in this case: ν S ( i ) = 1 k , for all i ∈ S On the other hand, for problem

Instance-a , we have that: ν aS ( i ) =  e ( k − e − (cid:15) + e when S ( i ) = a, e − (cid:15) ( k − e − (cid:15) + e , otherwiseNow using the following upper bound on KL ( p , p ) ≤ (cid:80) x ∈X p ( x ) p ( x ) − , p and p be twoprobability mass functions on the discrete random variable X [Popescu et al., 2016] we get: KL ( ν S , ν aS ) ≤ ( k −

1) ( k − e − (cid:15) + e k ( e − (cid:15) ) + ( k − e − (cid:15) + e k e −

1= ( k − k (cid:18) e (cid:15) − e − (cid:15) (cid:19) = ( k − k e − (cid:15) ( e (cid:15) − ≤ (cid:15) k for any (cid:15) ∈ (cid:20) , (cid:21) Case-2:

Now let us consider the remaining set in S a such that S (cid:51) , a . Similar to the earliercase in this case we get that: ν aS ( i ) =  e ( k − e − (cid:15) + e when S ( i ) = 1 , e − (cid:15) ( k − e − (cid:15) + e , otherwise24n the other hand, for problem Instance-a , we have that: ν aS ( i ) =  e − (cid:15) ( k − e − (cid:15) + e − (cid:15) + e when S ( i ) = 1 , e ( k − e − (cid:15) + e − (cid:15) + e when S ( i ) = a, e − (cid:15) ( k − e − (cid:15) + e − (cid:15) + e , otherwiseNow using the previously mentioned upper bound on the KL divergence, followed by someelementary calculations one can show that for any (cid:2) , (cid:3) : KL ( ν S , ν aS ) ≤ (cid:15) k Thus combining the above two cases we can conclude that for any S ∈ S a , KL ( ν S , ν aS ) ≤ (cid:15) k ,and as argued above for any S / ∈ S a , KL ( ν S , ν aS ) = 0 .Note that the only (cid:15) -optimal arm for any Instance-a is arm a , for all a ∈ [ n ] . Now, consider E ∈ F τ be an event such that the algorithm A returns the element i = 1 , and let us analyse the lefthand side of (4) for E = E . Clearly, A being an ( (cid:15), δ ) -PAC algorithm, we have P r ν ( E ) > − δ ,and P r ν a ( E ) < δ , for any suboptimal arm a ∈ [ n ] \ { } . Then we have kl ( P r ν ( E ) , P r ν a ( E )) ≥ kl (1 − δ, δ ) ≥ ln 12 . δ (5)where the last inequality follows from Kaufmann et al. [2016] (Eqn. ).Now applying (4) for each modiﬁed bandit Instance- ν a , and summing over all suboptimalitems a ∈ [ n ] \ { } we get, n (cid:88) a =2 (cid:88) { S ∈A| a ∈ S } E ν [ N S ( τ A )] KL ( ν S , ν aS ) ≥ ( n −

1) ln 12 . δ . (6)Using the upper bounds on KL ( ν S , ν aS ) as shown above, the right hand side of (6) can befurther upper bounded as: n (cid:88) a =2 (cid:88) { S ∈A| a ∈ S } E ν [ N S ( τ A )] KL ( ν S , ν aS ) ≤ (cid:88) S ∈A E ν [ N S ( τ A )] (cid:88) { a ∈ S | a (cid:54) =1 } (cid:15) k = (cid:88) S ∈A E ν [ N S ( τ A )] k − (cid:0) (1 ∈ S ) (cid:1) (cid:15) k ≤ (cid:88) S ∈A E ν [ N S ( τ A )]8 (cid:15) . (7)Finally noting that τ A = (cid:80) S ∈A [ N S ( τ A )] , combining (6) and (7), we get (8 (cid:15) ) E ν [ τ A ] = (cid:88) S ∈A E ν [ N S ( τ A )](8 (cid:15) ) ≥ ( n −

1) ln 12 . δ . (8)25ow note that as derived in Lem. 3, for Gumbel (0 , noise, we have shown that for any pair i, j ∈ [ n ] , Min-AR ( i, j ) = e ∆ ij > ij = 1 + 4 ∆ ij = ⇒ the value of the noise dependentconstant c can be taken to be c = . Thus rewriting Eqn. 8 we get E ν [ τ A ] ≥ ( n − (cid:15) ln . δ = ( n − c (cid:15) ln . δ . The above construction shows the existence of a problem instance of RUM ( k, θ ) model where any ( (cid:15), δ ) -PAC algorithm requires at least Ω( nc (cid:15) ln . δ ) samples to ensure correctnessof its performance, concluding our proof. Remark 5.

It is worth noting that our lower bound analysis is essentially in spirit the same as the oneproposed by Saha and Gopalan [2019] for the Plackett luce model. However note that, their PAC objectiveis quite different than the one considered in our case–precisely their model is positive scale invariant, unlikeours which is shift invariant w.r.t the model parameters θ . Moreover our setting aims to ﬁnd a (cid:15) -best itemin additive sense (i.e. to ﬁnd an item i whose score difference w.r.t to the best item is at most (cid:15) > or θ − θ i < (cid:15) ), as opposed to the ( (cid:15), δ ) -PAC objective considered in Saha and Gopalan [2019] which seeks toﬁnd a multiplicative- (cid:15) -best item (i.e. to ﬁnd an item i which matches the score of the best item up to (cid:15) -factoror θ i > (cid:15)θ ). Therefore the problem instance construction for proving a suitable lower bound these twosetups are very different where lies the novelty of out current lower bound analysis. C Appendix for Section 6

C.1 Pseudo code of

Sequential-Pairwise-Battle for top- m ranking feedback (mSeq-PB) The description is given in Algorithm 2.

C.2 Proof of Theorem 7

Theorem 7 (mSeq-PB(Alg. 2): Correctness and Sample Complexity) . Consider any RUM ( k, θ ) sub-setwise preference model based on noise distribution D and suppose for any item pair i, j , we have Min-AR ( i, j ) ≥ c ∆ ij − c for some D -dependent constant c > . Then mSeq-PB (Alg.2) with input constant c > on top- m ranking feedback model is an ( (cid:15), δ ) -PAC algorithm with sample complexity O ( nmc (cid:15) log kδ ) .Proof. Same as the proof of Thm. 4, we start by analyzing the required sample complexity of thealgorithm. Note that at any iteration (cid:96) , any set G g is played for exactly t = km(cid:15) (cid:96) ln kδ (cid:96) many numberof times. Also since the algorithm discards away exactly k − items from each set G g , hence themaximum number of iterations possible is (cid:100) ln k n (cid:101) . Now at any iteration (cid:96) , since G = (cid:106) | S (cid:96) | k (cid:107) < | S (cid:96) | k ,the total sample complexity for iteration (cid:96) is at most | S (cid:96) | k t ≤ nmk (cid:96) − (cid:15) (cid:96) ln kδ (cid:96) , as | S (cid:96) | ≤ nk (cid:96) for all (cid:96) ∈ [ (cid:98) ln k n (cid:99) ] . Also note that for all but last iteration (cid:96) ∈ [ (cid:98) ln k n (cid:99) ] , (cid:15) (cid:96) = (cid:15) (cid:18) (cid:19) (cid:96) − , and δ (cid:96) = δ (cid:96) +1 .Moreover for the last iteration (cid:96) = (cid:100) ln k n (cid:101) , the sample complexity is clearly t = kmc ( (cid:15)/ ln kδ , asin this case (cid:15) (cid:96) = c(cid:15) , and δ (cid:96) = δ , and | S | = k . Thus the total sample complexity of Algorithm 2 isgiven by 26 ln k n (cid:101) (cid:88) (cid:96) =1 | S (cid:96) | m ( (cid:15) (cid:96) / ln 2 kδ (cid:96) ≤ ∞ (cid:88) (cid:96) =1 nmc k (cid:96) (cid:18) (cid:15) (cid:0) (cid:1) (cid:96) − (cid:19) k ln k (cid:96) +1 δ + 16 kmc (cid:15) ln 4 kδ ≤ nmc (cid:15) ∞ (cid:88) (cid:96) =1 (cid:96) − (9 k ) (cid:96) − (cid:16) ln kδ + ( (cid:96) + 1) (cid:17) + 16 kmc (cid:15) ln 4 kδ ≤ nmc (cid:15) ln kδ ∞ (cid:88) (cid:96) =1 (cid:96) − (9 k ) (cid:96) − (cid:16) (cid:96) (cid:17) + 16 kmc (cid:15) ln 4 kδ = O (cid:18) nmc (cid:15) ln kδ (cid:19) [ for any k > . We are now only left with proving the ( (cid:15), δ ) -PAC correctness of the algorithm. We used thesame notations as introduced in the proof of Thm. 4.We start by making a crucial observation that at any phase, for any subgroup G g , the strongestitem of the G g gets picked in the top- m ranking quite often. More formally: Lemma 13.

Consider any particular set G g at any phase (cid:96) , and let us denote by q i as the number of timesany item i ∈ G g appears in the top- m rankings when items in the set G g are queried for t rounds. Then if i g := arg max i ∈G g θ i , then with probability at least (cid:16) − δ (cid:96) k (cid:17) , one can show that q i g > (1 − η ) mtk , for any η ∈ (cid:0) √ , (cid:3) .Proof. Fix any iteration (cid:96) and a set G g , g ∈ , , . . . , G . Deﬁne i τg := ( i ∈ σ τ ) as the indicatorvariable if i th element appeared in the top- m ranking at iteration τ ∈ [ t ] . Recall the deﬁnition ofTR feedback model (Sec. 3.1). Using this we get E [ i τg ] = P r ( { i g ∈ σ ) = P r (cid:0) ∃ j ∈ [ m ] | σ ( j ) = i g (cid:1) = (cid:80) mj =1 P r (cid:0) σ ( j ) = i g (cid:17) = (cid:80) m − j =0 1 k − j ≥ mk , as P r ( { i g | S } ) ≥ | S | for any S ⊆ [ G g ] ( i g :=arg max i ∈G g θ i being the best item of set G g ). Hence E [ q i g ] = (cid:80) tτ =1 E [ i τg ] ≥ mtk . Now applyingChernoff-Hoeffdings bound for w i g , we get that for any η ∈ ( , , P r (cid:16) q i g ≤ (1 − η ) E [ q i g ] (cid:17) ≤ exp( − E [ q i g ] η ≤ exp( − mtη k )= exp (cid:18) − η (cid:15) (cid:96) ln (cid:18) kδ (cid:96) (cid:19)(cid:19) = exp (cid:18) − ( √ η ) (cid:15) (cid:96) ln (cid:18) kδ (cid:96) (cid:19)(cid:19) ≤ exp (cid:18) − ln (cid:18) kδ (cid:96) (cid:19)(cid:19) ≤ δ (cid:96) k , where the second last inequality holds as η ≥ √ and (cid:15) (cid:96) ≤ , for any iteration (cid:96) ∈ (cid:100) ln n (cid:101) ; in otherwords for any η ≥ √ , we have √ η(cid:15) (cid:96) ≥ which leads to the second last inequality. Thus we ﬁnallyderive that with probability at least (cid:16) − δ (cid:96) k (cid:17) , one can show that q i g > (1 − η ) E [ q i g ] ≥ (1 − η ) tmk ,and the proof follows henceforth.In particular, ﬁxing η = in Lemma 13, we get that with probability at least (cid:0) − δ (cid:96) (cid:1) , q i g > (1 − ) E [ w i g ] > mt k . Note that, for any round τ ∈ [ t ] , whenever an item i ∈ G g appears in the top- m G τgm , then the rank breaking update ensures that every element in the top- m set gets comparedwith rest of the k − elements of G g . Based on this observation, we now prove that for any set G g ,a near-best ( (cid:15) (cid:96) -optimal of i g ) is retained as the winner c g with probability at least (cid:0) − δ (cid:96) (cid:1) . Moreformally: Lemma 14.

Consider any particular set G g at any iteration (cid:96) . Let i g ← arg max i ∈G g θ i , then with proba-bility at least (cid:16) − δ (cid:96) (cid:17) , θ c g > θ i g − (cid:15) (cid:96) c .Proof. With top- m ranking feedback, the crucial observation lies in that at any round τ ∈ [ t ] , when-ever an item i ∈ G g appears in the top- m ranking σ τ , then the rank breaking update ensures thatevery element in the top- m set gets compared to each of the rest k − elements of G g - it defeatsto every element preceding item in σ ∈ Σ G gm , and wins over the rest. If n ij = w ij + w ji denotesthe number of times item i and j are compared after rank-breaking, for i, j ∈ G g , n ij = n ji , andfrom Lemma 13 with η = we have that n i g j ≥ mt k with probability at least (1 − δ (cid:96) / k ) . Given theabove arguments in place, for any item j ∈ G g \ { i g } , by Hoeffdings inequality: P r (cid:0)(cid:8) ˆ p ji g − p ji g |G g > (cid:15) (cid:96) (cid:9) ∩ (cid:8) n ji g ≥ mt k (cid:9)(cid:1) ≤ exp (cid:16) − mt k ( (cid:15) (cid:96) / (cid:17)(cid:1) ≤ δ (cid:96) k , Now consider any item j such that θ i g − θ j > (cid:15) (cid:96) /c , then we have P r ( i g |G g ) P r ( j |G g ) > (cid:15) (cid:96) , which byLem. 9 implies p i g j |G g > + (cid:15) (cid:96) , or equivalently p ji g |G g < − (cid:15) (cid:96) .But since we show that for any item j ∈ G g \ { } , with high probability (1 − δ (cid:96) / k ) , we have ˆ p ji g − p ji g |G g < (cid:15) (cid:96) . Taking union bound above holds true for any j ∈ G g \ { } with probability atleast (1 − δ/ . Combining with the above claim of p ji g |G g < − (cid:15) (cid:96) , this further implies ˆ p ji g + (cid:15) (cid:96)

− (cid:15) (cid:96) = ⇒ ˆ p i g j + (cid:15) (cid:96) > p i g j |G g > for all j ∈ G g , i g is a valid candidate for c g always, or in other case someother (cid:15) (cid:96) -suboptimal item j (such θ j > θ i g − (cid:15) (cid:96) ) can be chosen as c g . This concludes the proof.The correctness-claim now follows using a similar argument as given for the proof of Thm.4. We add the details below for the sake of completeness: Without loss of generality, we assumethe best item of the RUM ( k, θ ) model is θ , i.e. θ > θ i ∀ i ∈ [ n ] \ { } . Now for any iteration (cid:96) , let us deﬁne g (cid:96) ∈ [ G ] to be the index of the set that contains best item of the entire set S (cid:96) , i.e. arg max i ∈ S (cid:96) θ i ∈ G g (cid:96) . Then applying Lemma 14, with probability at least (1 − δ (cid:96) ) , θ c g(cid:96) > θ i g(cid:96) − (cid:15) (cid:96) /c .Note that initially, at phase (cid:96) = 1 , i g (cid:96) = 1 . Then, for each iteration (cid:96) , applying Lemma 14 recursivelyto G g (cid:96) , we ﬁnally get θ r > θ − (cid:16) (cid:15) + (cid:15) (cid:16) (cid:17) + · · · + (cid:15) (cid:0) (cid:1) (cid:98) ln k n (cid:99) (cid:17) − (cid:15) ≥ θ − (cid:15) (cid:16) (cid:80) ∞ i =0 (cid:0) (cid:1) i (cid:17) − (cid:15) ≥ θ − (cid:15) . Thus assuming the algorithm does not fail in any of the iteration (cid:96) , we ﬁnally have that p r ∗ > − (cid:15) —this shows that the ﬁnal item output by Seq-PB is (cid:15) optimal.Finally note that since at each iteration (cid:96) , the algorithm fails with probability at most δ (cid:96) (1 / k ) ≤ δ (cid:96) , the total failure probability of the algorithm is at most (cid:16) δ + δ + · · · + δ (cid:100) nk (cid:101) (cid:17) + δ ≤ δ . Thisshows the correctness of the algorithm, concluding the proof.28 .3 Proof of Theorem 8 The proof proceeds almost same as the proof of Thm. 6, the only difference lies in the analysis ofthe KL-divergence terms with top- m ranking feedback.Consider the exact same set of RUM ( k, θ ) instances, { ν a } na =1 we constructed for Thm. 6. It isnow interesting to note that how the top- m ranking feedback affects the KL-divergence analysis,precisely the KL-divergence shoots up by a factor of m which in fact triggers an m reduction inregret learning rate. We show this below formally.Note that for top- m ranking feedback for any problem instance ν a , a ∈ [ n ] , each k -set S ⊆ [ n ] is associated to (cid:0) km (cid:1) ( m !) number of possible outcomes, each representing one possible rankingof set of m items of S , say S m . Also the probability of any permutation σ ∈ Σ mS is given by p aS ( σ ) = P r ν a ( σ | S ) , where P r ν a ( σ | S ) is as deﬁned for top- m ranking feedback for RUM ( k, θ ) problem instance ν a (see Sec. 6). More formally, for problem Instance-a , we have that: p aS ( σ ) = P r ν a ( σ = σ | S ) = m (cid:89) i =1 P r ( X σ ( i ) > X σ ( j ) , ∀ j ∈ { i + 1 , . . . m } ) , ∀ σ ∈ Σ mS = P r ν a ( σ = σ | S ) = m (cid:89) i =1 P r ( ζ σ ( i ) > ζ σ ( j ) − ( θ σ ( i ) − θ aσ ( j ) ) , ∀ j ∈ { i + 1 , . . . m } ) , ∀ σ ∈ Σ mS As also argued in the proof of Thm. 6, note that for any top- m ranking of σ ∈ Σ mS , KL ( p S ( σ ) , p aS ( σ )) =0 for any set S (cid:54)(cid:51) a . Hence while comparing the KL-divergence of instances ν vs ν a , we need tofocus only on sets containing a (recall we denote this as S a ). Applying the chain rule for KL-divergence, we now get: KL ( p S , p aS ) = KL ( p S ( σ ) ,p aS ( σ )) + KL ( p S ( σ | σ ) , p aS ( σ | σ )) + · · · + KL ( p S ( σ m | σ (1 : m − , p aS ( σ m | σ (1 : m − , (9)where we abbreviate σ ( i ) as σ i and KL ( P ( Y | X ) , Q ( Y | X )) := (cid:80) x P r (cid:16) X = x (cid:17)(cid:2) KL ( P ( Y | X = x ) , Q ( Y | X = x )) (cid:3) denotes the conditional KL-divergence. Moreover it is easy to note that for any σ ∈ Σ mS such that σ ( i ) = a , we have KL ( p S ( σ i +1 | σ (1 : i )) , p aS ( σ i +1 | σ (1 : i ))) := 0 , for all i ∈ [ m ] .Now using the KL divergence upper bounds, as derived in the proof of Thm. 6, we have than KL ( p S ( σ ) , p aS ( σ )) ≤ ∆ (cid:48) a (cid:15) k . One can potentially use the same line of argument to upper bound the remaining KL diver-gence terms of (9) as well. More formally note that for all i ∈ [ m − , we can show that: KL ( p S ( σ i +1 | σ (1 : i )) , p aS ( σ i +1 | σ (1 : i )))= (cid:88) σ (cid:48) ∈ Σ iS P r ( σ (cid:48) ) KL ( p S ( σ i +1 | σ (1 : i )) = σ (cid:48) , p aS ( σ i +1 | σ (1 : i )) = σ (cid:48) ) ≤ (cid:15) k KL ( p S , p aS ) = KL ( p S ( σ ) + · · · + KL ( p S ( σ m | σ (1 : m − , p aS ( σ m | σ (1 : m − ≤ m(cid:15) k . (10)Eqn. (10) precisely gives the main result to derive Thm. 8. Note that it shows an m -factorblow up in the KL-divergence terms owning to top- m ranking feedback. The rest of the prooffollows exactly the same argument used in 6 which can easily be seen to yield the desired samplecomplexity lower bound. 30 lgorithm 2 Sequential-Pairwise-Battle (TR- m feedback) Input: Set of items: [ n ] , and subset size: k > ( n ≥ k ≥ m ) Error bias: (cid:15) > , and conﬁdence parameter: δ > Noise model ( D ) dependent constant c > Initialize: S ← [ n ] , (cid:15) ← c(cid:15) , and δ ← δ Divide S into G := (cid:100) nk (cid:101) sets G , G , · · · G G such that ∪ Gj =1 G j = S and G j ∩ G j (cid:48) = ∅ , ∀ j, j (cid:48) ∈ [ G ] , | G j | = k, ∀ j ∈ [ G − . If |G G | < k , then set R ← G G and G = G − . while (cid:96) = 1 , , . . . do Set S ← ∅ , δ (cid:96) ← δ (cid:96) − , (cid:15) (cid:96) ← (cid:15) (cid:96) − for g = 1 , , · · · G do Initialize pairwise (empirical) win-count w ij ← , for each item pair i, j ∈ G g for τ = 1 , , . . . t (:= (cid:6) km(cid:15) (cid:96) ln kδ (cid:96) ) (cid:7) do Play the set G g (one round of battle) Receive: The top- m ranking σ τ ∈ Σ m G Update win-count w ij of each item pair i, j ∈ G g applying Rank-Breaking on σ τ end for Deﬁne ˆ p i,j = w ij w ij + w ji , ∀ i, j ∈ G g If ∃ any i ∈ G g such that ˆ p ij + (cid:15) (cid:96) ≥ , ∀ j ∈ G g , then set c g ← i , else select c g ← uniformlyat random from G g , and set S ← S ∪ { c g } end for S ← S ∪ R (cid:96) if ( | S | == 1) then Break (go out of the while loop) else if | S | ≤ k then S (cid:48) ← Randomly sample k − | S | items from [ n ] \ S , and S ← S ∪ S (cid:48) , (cid:15) (cid:96) ← c(cid:15) , δ (cid:96) ← δ else Divide S into G := (cid:6) | S | k (cid:7) sets G , · · · G G such that ∪ Gj =1 G j = S , G j ∩ G j (cid:48) = ∅ , ∀ j, j (cid:48) ∈ [ G ] , | G j | = k, ∀ j ∈ [ G − . If |G G | < k , then set R (cid:96) +1 ← G G and G = G − . end if end while Output: