Active Ranking using Pairwise Comparisons
AActive Ranking using Pairwise Comparisons
Kevin G. Jamieson
University of WisconsinMadison, WI 53706, USA [email protected]
Robert D. Nowak
University of WisconsinMadison, WI 53706, USA [email protected]
Abstract
This paper examines the problem of ranking a collection of objects using pairwisecomparisons (rankings of two objects). In general, the ranking of n objects can beidentified by standard sorting methods using n log n pairwise comparisons. Weare interested in natural situations in which relationships among the objects mayallow for ranking using far fewer pairwise comparisons. Specifically, we assumethat the objects can be embedded into a d -dimensional Euclidean space and thatthe rankings reflect their relative distances from a common reference point in R d .We show that under this assumption the number of possible rankings grows like n d and demonstrate an algorithm that can identify a randomly selected rankingusing just slightly more than d log n adaptively selected pairwise comparisons,on average. If instead the comparisons are chosen at random, then almost allpairwise comparisons must be made in order to identify any ranking. In addition,we propose a robust, error-tolerant algorithm that only requires that the pairwisecomparisons are probably correct. Experimental studies with synthetic and realdatasets support the conclusions of our theoretical analysis. This paper addresses the problem of ranking a set of objects based on a limited number of pair-wise comparisons (rankings between pairs of the objects). A ranking over a set of n objects Θ = ( θ , θ , . . . , θ n ) is a mapping σ : { , . . . , n } → { , . . . , n } that prescribes an order σ (Θ) := θ σ (1) ≺ θ σ (2) ≺ · · · ≺ θ σ ( n − ≺ θ σ ( n ) (1)where θ i ≺ θ j means θ i precedes θ j in the ranking. A ranking uniquely determines the collectionof pairwise comparisons between all pairs of objects. The primary objective here is to bound thenumber of pairwise comparisons needed to correctly determine the ranking when the objects (andhence rankings) satisfy certain known structural constraints. Specifically, we suppose that the objectsmay be embedded into a low-dimensional Euclidean space such that the ranking is consistent withdistances in the space. We wish to exploit such structure in order to discover the ranking using avery small number of pairwise comparisons. To the best of our knowledge, this is a previously openand unsolved problem.There are practical and theoretical motivations for restricting our attention to pairwise rankings thatare discussed in Section 2. We begin by assuming that every pairwise comparison is consistent withan unknown ranking. Each pairwise comparison can be viewed as a query: is θ i before θ j ? Eachquery provides bit of information about the underlying ranking. Since the number of rankings is n ! , in general, specifying a ranking requires Θ( n log n ) bits of information. This implies that at leastthis many pairwise comparisons are required without additional assumptions about the ranking. Infact, this lower bound can be achieved with a standard adaptive sorting algorithm like binary sort[1]. In large-scale problems or when humans are queried for pairwise comparisons, obtaining thismany pairwise comparisons may be impractical and therefore we consider situations in which thespace of rankings is structured and thereby less complex.1 a r X i v : . [ c s . L G ] D ec natural way to induce a structure on the space of rankings is to suppose that the objects can beembedded into a d -dimensional Euclidean space so that the distances between objects are consistentwith the ranking. This may be a reasonable assumption in many applications, and for instance theaudio dataset used in our experiments is believed to have a 2 or 3 dimensional embedding [2]. Wefurther discuss motivations for this assumption in Section 2. It is not difficult to show (see Section 3)that the number of full rankings that could arise from n objects embedded in R d grows like n d , andso specifying a ranking from this class requires only O ( d log n ) bits. The main results of the papershow that under this assumption a randomly selected ranking can be determined using O ( d log n )pairwise comparisons selected in an adaptive and sequential fashion, but almost all (cid:0) n (cid:1) pairwiserankings are needed if they are picked randomly rather than selectively. In other words, activelyselecting the most informative queries has a tremendous impact on the complexity of learning thecorrect ranking. Let σ denote the ranking to be learned. The objective is to learn the ranking by querying the referencefor pairwise comparisons of the form q i,j := { θ i ≺ θ j } . (2)The response or label of q i,j is binary and denoted as y i,j := { q i,j } where is the indicatorfunction; ties are not allowed. The main results quantify the minimum number of queries or labelsrequired to determine the reference’s ranking, and they are based on two key assumptions. A1 Embedding:
The set of n objects are embedded in R d (in general position) and we will also use θ , . . . , θ n to refer to their (known) locations in R d . Every ranking σ can be specified by a reference point r σ ∈ R d , as follows. The Euclidean distances between the reference and objects are consistentwith the ranking in the following sense: if the σ ranks θ i ≺ θ j , then (cid:107) θ i − r σ (cid:107) < (cid:107) θ j − r σ (cid:107) . Let Σ n,d denote the set of all possible rankings of the n objects that satisfy this embedding condition.The interpretation of this assumption is that we know how the objects are related (in the embedding),which limits the space of possible rankings. The ranking to be learned, specified by the reference(e.g., preferences of a human subject), is unknown. Many have studied the problem of findingan embedding of objects from data [3, 4, 5]. This is not the focus here, but it could certainlyplay a supporting role in our methodology (e.g., the embedding could be determined from knownsimilarities between the n objects, as is done in our experiments with the audio dataset). We assumethe embedding is given and our interest is minimizing the number of queries needed to learn theranking, and for this we require a second assumption. A2 Consistency:
Every pairwise comparison is consistent with the ranking to be learned. That is, ifthe reference ranks θ i ≺ θ j , then θ i must precede θ j in the (full) ranking.As we will discuss later in Section 3.2, these two assumptions alone are not enough to rule outpathological arrangements of objects in the embedding for which at least Ω( n ) queries must bemade to recover the ranking. However, because such situations are not representative of what istypically encountered, we analyze the problem in the framework of the average-case analysis [6]. Definition 1.
With each ranking σ ∈ Σ n,d we associate a probability π σ such that (cid:80) σ ∈ Σ n,d π σ =1 . Let π denote these probabilities and write σ ∼ π for shorthand. The uniform distributioncorresponds to π σ = | Σ n,d | − for all σ ∈ Σ n,d , and we write σ ∼ U for this special case. Definition 2. If M n ( σ ) denotes the number of pairwise comparisons requested by an algorithm toidentify the ranking σ , then the average query complexity with respect to π is denoted by E π [ M n ] . The main results are proven for the special case of π = U , the uniform distribution, to make theanalysis more transparent and intuitive. However the results can easily be extended to general dis-tributions π that satisfy certain mild conditions (see Appendix A.5). All results henceforth, unlessotherwise noted, will be given in terms of (uniform) average query complexity and we will say suchresults hold “on average.”Our main results can be summarized as follows. If the queries are chosen deterministically or ran-domly in advance of collecting the corresponding pairwise comparisons, then we show that almostall (cid:0) n (cid:1) pairwise comparisons queries are needed to identify a ranking under the assumptions above.However, if the queries are selected in an adaptive and sequential fashion according to the algorithm2 uery Selection Algorithm input: n objects in R d initialize: objects Θ = { θ , . . . , θ n } in uni-formly random orderfor j=2,. . . ,nfor i=1,. . . ,j-1 if q i,j is ambiguous ,request q i,j ’s label from reference; else impute q i,j ’s label from previouslylabeled queries.output: ranking of n objectsFigure 1: Sequential algorithm for selectingqueries. See Figure 2 and Section 4.2 for thedefinition of an ambiguous query. θ θ θ q , q , q , Figure 2: Objects θ , θ , θ and queries. The r σ lies in the shaded region (consistent with thelabels of q , , q , , q , ). The dotted (dashed)lines represent new queries whose labels are(are not) ambiguous given those labels.in Figure 1, then we show that the number of pairwise rankings required to identify a ranking is nomore than a constant multiple of d log n , on average. The algorithm requests a query if and onlyif the corresponding pairwise ranking is ambiguous (see Section 4.2), meaning that it cannot bedetermined from previously collected pairwise comparisons and the locations of the objects in R d .The efficiency of the algorithm is due to the fact that most of the queries are unambiguous whenconsidered in a sequential fashion. For this very same reason, picking queries in a non-adaptive orrandom fashion is very inefficient. It is also noteworthy that the algorithm is also computationallyefficient with an overall complexity no greater than O ( n poly ( d ) poly (log n )) (see Appendix A.1).In Section 5 we present a robust version of the algorithm of Figure 1 that is tolerant to a fraction oferrors in the pairwise comparison queries. In the case of persistent errors (see Section 5) we showthat we can find a probably approximately correct ranking by requesting just O ( d log n ) pairwisecomparisons. This allows us to handle situations in which either or both of the assumptions, A1 and A2 , are reasonable approximations to the situation at hand, but do not hold strictly (which is the casein our experiments with the audio dataset).Proving the main results involves an uncommon marriage of ideas from the ranking and statisticallearning literatures. Geometrical interpretations of our problem derive from the seminal works of[7] in ranking and [8] in learning. From this perspective our problem bears a strong resemblanceto the halfspace learning problem, with two crucial distinctions. In the ranking problem, the under-lying halfspaces are not in general position and have strong dependencies with each other. Thesedependencies invalidate many of the typical analyses of such problems [9, 10]. One popular methodof analysis in exact learning involves the use of something called the extended teaching dimension [11]. However, because of the possible pathological situations alluded to earlier, it is easy to showthat the extended teaching dimension must be at least Ω( n ) making that sort of worst-case analysisuninteresting. These differences present unique challenges to learning. The problem of learning a ranking from few pairwise comparisons is motivated by what we perceiveas a significant gap in the theory of ranking and permutation learning. Most work in ranking withstructural constraints assumes a passive approach to learning; pairwise comparisons or partial rank-ings are collected in a random or non-adaptive fashion and then aggregated to obtain a full ranking(cf. [12, 13, 14, 15]). However, this may be quite inefficient in terms of the number of pairwise com-parisons or partial rankings needed to learn the (full) ranking. This inefficiency was recently notedin the related area of social choice theory [16]. Furthermore, empirical evidence suggests that, evenunder complex ranking models, adaptively selecting pairwise comparisons can reduce the numberneeded to learn the ranking [17]. It is cause for concern since in many applications it is expen-sive and time-consuming to obtain pairwise comparisons. For example, psychologists and marketresearchers collect pairwise comparisons to gauge human preferences over a set of objects, for sci-entific understanding or product placement. The scope of these experiments is often very limited3imply due to the time and expense required to collect the data. This suggests the consideration ofmore selective and judicious approaches to gathering inputs for ranking. We are interested in takingadvantage of underlying structure in the set of objects in order to choose more informative pairwisecomparison queries. From a learning perspective, our work adds an active learning component to aproblem domain that has primarily been treated from a passive learning mindset.We focus on pairwise comparison queries for two reasons. First, pairwise comparisons admit ahalfspace representation in embedding spaces which allows for a geometrical approach to learning insuch structured ranking spaces. Second, pairwise comparisons are the most common form of queriesin many applications, especially those involving human subjects. For example, consider the problemof finding the most highly ranked object, as illustrated by the following familiar task. Supposea patient needs a new pair of prescription eye lenses. Faced with literally millions of possibleprescriptions, the doctor will present candidate prescriptions in a sequential fashion followed bythe query: better or worse? Even if certain queries are repeated to account for possible inaccurateanswers, the doctor can locate an accurate prescription with just a handful of queries. This is possiblepresumably because the doctor understands (at least intuitively) the intrinsic space of prescriptionsand can efficiently search through it using only binary responses from the patient.We assume that the objects can be embedded in R d and that the distances between objects andthe reference are consistent with the ranking (Assumption A1 ). The problem of learning a generalfunction f : R d → R using just pairwise comparisons that correctly ranks the objects embedded in R d has previously been studied in the passive setting [12, 13, 14, 15]. The main contributions ofthis paper are theoretical bounds for the specific case when f ( x ) = || x − r σ || where r σ ∈ R d isthe reference point. This is a standard model used in multidimensional unfolding and psychometrics[7, 18] and one can show that this model also contains the familiar functions f ( x ) = r Tσ x forall r σ ∈ R d . We are unaware of any existing query-complexity bounds for this problem. We donot assume a generative model is responsible for the relationship between rankings to embeddings,but one could. For example, the objects might have an embedding (in a feature space) and theranking is generated by distances in this space. Or alternatively, structural constraints on the spaceof rankings could be used to generate a consistent embedding. Assumption A1 , while arguably quitenatural/reasonable in many situations, significantly constrains the set of possible rankings. The embedding assumption A1 gives rise to geometrical interpretations of the ranking problem,which are developed in this section. The pairwise comparison q i,j can be viewed as the membershipquery: is θ i ranked before θ j in the (full) ranking σ ? The geometrical interpretation is that q i,j re-quests whether the reference r σ is closer to object θ i or object θ j in R d . Consider the line connecting θ i and θ j in R d . The hyperplane that bisects this line and is orthogonal to it defines two halfspaces:one containing points closer to θ i and the other the points closer to θ j . Thus, q i,j is a membershipquery about which halfspace r σ is in, and there is an equivalence between each query, each pair ofobjects, and the corresponding bisecting hyperplane. The set of all possible pairwise comparisonqueries can be represented as (cid:0) n (cid:1) distinct halfspaces in R d . The intersections of these halfspacespartition R d into a number of cells, and each one corresponds to a unique ranking of Θ . Arbitraryrankings are not possible due to the embedding assumption A1 , and recall that the set of rankingspossible under A1 is denoted by Σ n,d . The cardinality of Σ n,d is equal to the number of cells in thepartition. We will refer to these cells as d -cells (to indicate they are subsets in d -dimensional space)since at times we will also refer to lower dimensional cells; e.g., ( d − -cells. The following lemma determines the cardinality of the set of rankings, Σ n,d , under assumption A1 . Lemma 1. [7] Assume
A1-2 . Let Q ( n, d ) denote the number of d -cells defined by the hyperplane ar-rangement of pairwise comparisons between these objects (i.e. Q ( n, d ) = | Σ n,d | ). Q ( n, d ) satisfiesthe recursion Q ( n, d ) = Q ( n − , d ) + ( n − Q ( n − , d − , where Q (1 , d ) = 1 and Q ( n,
0) = 1 . (3)4n the hyperplane arrangement induced by the n objects in d dimensions, each hyperplane is inter-sected by every other and is partitioned into Q ( n − , d − subsets or ( d − -cells. The recursion,above, arises by considering the addition of one object at a time. Using this lemma in a straightfor-ward fashion, we prove the following corollary in Appendix A.2. Corollary 1.
Assume
A1-2 . There exist positive real numbers k and k such that k n d d d ! < Q ( n, d ) < k n d d d ! for n > d + 1 . If n ≤ d + 1 then Q ( n, d ) = n ! . For n sufficiently large, k = 1 and k = 2 suffice. Since the cardinality of the set of possible rankings is | Σ n,d | = Q ( n, d ) , we have a simple lowerbound on the number of queries needed to determine the ranking. Theorem 1.
Assume
A1-2 . To reconstruct an arbitrary ranking σ ∈ Σ n,d any algorithm will requireat least log | Σ n,d | = Θ(2 d log n ) pairwise comparisons.Proof. By Corollary 1 | Σ n,d | = Θ( n d ) , and so at least d log n bits are needed to specify a ranking.Each pairwise comparison provides at most one bit.If each query provides a full bit of information about the ranking, then we achieve this lower bound.For example, in the one-dimensional case ( d = 1 ) the objects can be ordered and binary searchcan be used to select pairwise comparison queries, achieving the lower bound. This is generallyimpossible in higher dimensions. Even in two dimensions there are placements of the objects (stillin general position) that produce d -cells in the partition induced by queries that have n − faces (i.e.,bounded by n − hyperplanes) as shown in Appendix A.3. It follows that the worst case situationmay require at least n − queries in dimensions d ≥ . In light of this, we conclude that worst casebounds may be overly pessimistic indications of the typical situation, and so we instead consider theaverage case performance introduced in Section 1.1. The geometrical representation of the ranking problem reveals that randomly choosing pairwisecomparison queries is inefficient relative to the lower bound above. To see this, suppose m querieswere chosen uniformly at random from the possible (cid:0) n (cid:1) . The answers to m queries narrows theset of possible rankings to a d -cell in R d . This d -cell may consist of one or more of the d -cellsin the partition induced by all queries. If it contains more than one of the partition cells, then theunderlying ranking is ambiguous. Theorem 2.
Assume
A1-2 . Let N = (cid:0) n (cid:1) . Suppose m pairwise comparison are chosen uniformly atrandom without replacement from the possible (cid:0) n (cid:1) . Then for all positive integers N ≥ m ≥ d theprobability that the m queries yield a unique ranking is (cid:0) md (cid:1) / (cid:0) Nd (cid:1) ≤ ( emN ) d .Proof. No fewer than d hyperplanes bound each d -cell in the partition of R d induced by all possiblequeries. The probability of selecting d specific queries in a random draw of m is equal to (cid:18) N − dm − d (cid:19)(cid:30)(cid:18) Nm (cid:19) = (cid:18) md (cid:19)(cid:30)(cid:18) Nd (cid:19) ≤ m d d ! d d N d ≤ (cid:18) mN (cid:19) d d d d ! ≤ (cid:18) emN (cid:19) d . (cid:3) Note that (cid:0) md (cid:1) / (cid:0) Nd (cid:1) < / unless m = Ω( n ) . Therefore, if the queries are randomly chosen, thenwe will need to ask almost all queries to guarantee that the inferred ranking is probably correct. Now consider the basic sequential process of the algorithm in Figure 1. Suppose we have ranked k − of the n objects. Call these objects through k − . This places the reference r σ withina d -cell (defined by the labels of the comparison queries between objects , . . . , k − ). Call this d -cell C k − . Now suppose we pick another object at random and call it object k . A comparison5uery between object k and one of objects , . . . , k − can only be informative (i.e., ambiguous)if the associated hyperplane intersects this d -cell C k − (see Figure 2). If k is significantly largerthan d , then it turns out that the cell C k − is probably quite small and the probability that one of thequeries intersects C k − is very small; in fact the probability is on the order of /k . Consider a hyperplane h = ( h , h , . . . , h d ) with ( d + 1) parameters in R d and a point p =( p , . . . , p d ) ∈ R d that does not lie on the hyperplane. Checking which halfspace p falls in, i.e., h p + h p + · · · + h d p d + h ≷ , has a dual interpretation: h is a point in R d +1 and p is ahyperplane in R d +1 passing through the origin (i.e., with d free parameters).Recall that each possible ranking can be represented by a reference point r σ ∈ R d . Our problem isto determine the ranking, or equivalently the vector of responses to the (cid:0) n (cid:1) queries represented byhyperplanes in R d . Using the above observation, we see that our problem is equivalent to finding alabeling over (cid:0) n (cid:1) points in R d +1 with as few queries as possible. We will refer to this alternativerepresentation as the dual and the former as the primal. The characterization of an ambiguous query has interpretations in both the primal and dual spaces.We will now describe the interpretation in the dual which will be critical to our analysis of thesequential algorithm of Figure 1.
Definition 3. [8] Let S be a finite subset of R d and let S + ⊂ S be points labeled +1 and S − = S \ S + be the points labeled − and let x be any other point except the origin. If there exists twohomogeneous linear separators of S + and S − that assign different labels to the point x , then thelabel of x is said to be ambiguous with respect to S. Lemma 2. [8, Lemma 1] The label of x is ambiguous with respect to S if and only if S + and S − are homogeneously linearly separable by a ( d − -dimensional subspace containing x . Let us consider the implications of this lemma to our scenario. Assume that we have labels for all thepairwise comparisons of k − objects. Next consider a new object called object k . In the dual, thepairwise comparison between object k and object i , for some i ∈ { , . . . , k − } , is ambiguous if andonly if there exists a hyperplane that still separates the original points and also passes through thisnew point. In the primal, this separating hyperplane corresponds to a point lying on the hyperplanedefined by the associated pairwise comparison. An essential component of the sequential algorithm of Figure 1 is the initial random order of theobjects; every sequence in which it could consider objects is equally probable. This allows us tostate a nontrivial fact about the partial rankings of the first k objects observed in this sequence. Lemma 3.
Assume
A1-2 and σ ∼ U . Consider the subset S ⊂ Θ with | S | = k that is randomlyselected from Θ such that all (cid:0) nk (cid:1) subsets are equally probable. If Σ k,d denotes the set of possiblerankings of these k objects then every σ ∈ Σ k,d is equally probable.Proof. Let a k -partition denote the partition of R d into Q ( k, d ) d -cells induced by k objects for ≤ k ≤ n . In the n -partition, each d -cell is weighted uniformly and is equal to /Q ( n, d ) . If weuniformly at random select k objects from the possible n and consider the k -partition, each d -cell inthe k -partition will contain one or more d -cells of the n -partition. If we select one of these d -cellsfrom the k -partition, on average there will be Q ( n, d ) /Q ( k, d ) d -cells from the n -partition containedin this cell. Therefore the probability mass in each d -cell of the k -partition is equal to the numberof cells from the n -partition in this cell multiplied by the probability of each of those cells from the n -partition: Q ( n, d ) /Q ( k, d ) × /Q ( n, d ) = 1 /Q ( k, d ) , and | Σ k,d | = Q ( k, d ) .As described above, for ≤ i ≤ k some of the pairwise comparisons q i,k +1 may be ambiguous.The algorithm chooses a random sequence of the n objects in its initialization and does not usethe labels of q ,k +1 , . . . , q j − ,k +1 , q j +1 ,k +1 , . . . , q k,k +1 to make a determination of whether or not q j,k +1 is ambiguous. It follows that the events of requesting the label of q i,k +1 for i = 1 , , . . . , k Lemma 4.
Assume
A1-2 and σ ∼ U . Let A ( k, d, U ) denote the probability of the event that thepairwise comparison q i,k +1 is ambiguous for i = 1 , , . . . , k . Then there exists a positive, realnumber constant a independent of k such that for k ≥ d , A ( k, d, U ) ≤ a dk .Proof. By Lemma 2, a point in the dual (pairwise comparison) is ambiguous if and only if thereexists a separating hyperplane that passes through this point. This implies that the hyperplane rep-resentation of the pairwise comparison in the primal intersects the cell containing r σ (see Figure 2for an illustration of this concept). Consider the partition of R d generated by the hyperplanes cor-responding to pairwise comparisons between objects , . . . , k . Let P ( k, d ) denote the number of d -cells in this partition that are intersected by a hyperplane corresponding to one of the queries q i,k +1 , i ∈ { , . . . , k } . Then it is not difficult to show that P ( k, d ) is bounded above by a con-stant independent of n and k times k d − d − ( d − (see Appendix A.4). By Lemma 3, every d -cell inthe partition induced by the k objects corresponds to an equally probable ranking of those objects.Therefore, the probability that a query is ambiguous is the number of cells intersected by the corre-sponding hyperplane divided by the total number of d -cells, and therefore A ( k, d, U ) = P ( k,d ) Q ( k,d ) . Theresult follows immediately from the bounds on P ( k, d ) and Corollary 1.Because the individual events of requesting each query are conditionally independent, the total num-ber of queries requested by the algorithm is just M n = (cid:80) n − k =1 (cid:80) ki =1 { Request q i,k +1 } . Using theresults above, it straightforward to prove the main theorem below (see Appendix A.5). Theorem 3.
Assume
A1-2 and σ ∼ U . Let the random variable M n denote the number of pairwisecomparisons that are requested in the algorithm of Figure 1, then E U [ M n ] ≤ (cid:100) da (cid:101) log n. Furthermore, if σ ∼ π and max σ ∈ Σ n,d π σ ≤ c | Σ n,d | − for some c > , then E π [ M n ] ≤ c E U [ M n ] . We now extend the algorithm of Figure 1 to situations in which the response to each query is onlyprobably correct. If the correct label of a query q i,j is y i,j , we denote the possibly incorrect responseby Y i,j . Let the probability that Y i,j = y i,j be equal to − p , p < / . The robust algorithm operatesin the same fashion as the algorithm in Figure 1, with the exception that when an ambiguous queryis encountered several (equivalent) queries are made and a decision is based on the majority vote.We will now judge performance based on two metrics: (i) how many queries are requested and (ii)how accurate the estimated ranking is with respect to the true ranking before it was corrupted. Forany two rankings σ , (cid:98) σ we adopt the popular Kendell-Tau distance [19] d τ ( σ, (cid:98) σ ) = (cid:88) ( i,j ): σ ( i ) <σ ( j ) { (cid:98) σ ( j ) < (cid:98) σ ( i ) } (4)where is the indicator function. Clearly, d τ ( σ, (cid:98) σ ) = d τ ( (cid:98) σ, σ ) and ≤ d τ ( σ, (cid:98) σ ) ≤ (cid:0) n (cid:1) . For anyranking σ ∈ Σ n,d we wish to find an estimate (cid:98) σ ∈ Σ n,d that is close in terms of d τ ( σ, (cid:98) σ ) withoutrequesting too many pairwise comparisons. For convenience, we will some times report results interms of the proportion (cid:15) of incorrect pairwise orderings such that d τ ( σ, (cid:98) σ ) ≤ (cid:15) (cid:0) n (cid:1) . Using theequivalence of the Kendell-Tau and Spearman’s footrule distances (see [20]), if d τ ( σ, (cid:98) σ ) ≤ (cid:15) (cid:0) n (cid:1) then each object in (cid:98) σ is, on average, no more than O ( (cid:15)n ) positions away from its position in σ .Thus, the Kendell-Tau distance is an intuitive measure of closeness between two rankings.First consider the case in which each query can be repeated to obtain multiple independent responses(votes) for each comparison query. This random errors model arises, for example, in social choicetheory where the “reference” is a group of people, each casting a vote. The elementary proof of thenext theorem is given in Appendix A.6. 7 heorem 4. Assume
A1-2 and σ ∼ U but that each response to the query q i,j is a realization ofan i.i.d. Bernoulli random variable Y i,j with P ( Y i,j (cid:54) = y i,j ) ≤ p < / for all distinct i, j ∈{ , . . . , n } . If all ambiguous queries are decided by the majority vote of R independent responsesto each such query, then with probability greater than − n log ( n ) exp( − (1 − p ) R ) thisprocedure correctly identifies the correct ranking (i.e. (cid:15) = 0 ) and requests no more than O ( Rd log n ) queries on average. We can deduce from the above theorem that to exactly recover the true ranking under the statedconditions with probability − δ , one need only request O (cid:0) d (1 − p ) − log ( n/δ ) (cid:1) pairwise com-parisons, on average.In other situations, if we ask the same query multiple times we may get the same, possibly incorrect,response each time. This persistent errors model is natural, for example, if the reference is a single human. Under this model, if two rankings differ by only a single pairwise comparison, then theycannot be distinguished with probability greater than − p . So, in general, exact recovery of theranking cannot be guaranteed with high probability. The best we can hope for is to exactly recovera partial ranking of the objects (i.e. the ranking over a subset of the objects) or a ranking that ismerely probably approximately correct in terms of the Kendell-Tau distance of (4) . We will firstconsider the task of exact recovery of a partial ranking of objects and then turn our attention to therecovery of an approximate ranking. Henceforth, we will assume the errors are persistent. The robust query selection algorithm for persistent errors is presented in Figure 3. The key ingredientin the persistent errors setting is the design of a voting set for each ambiguous query encountered.Suppose the query q i,j is ambiguous in the algorithm of Figure 1. In principle, a voting set couldbe constructed using objects ranked between i and j . If object k is between i and j , then note that y i,j = y i,k = y k,j . In practice, we cannot identify the subset of objects ranked between i and j exactly, but we can find a set that contains them. For an ambiguous query q i,j define T i,j := { k ∈ { . . . , n } : q i,k , q k,j , or both are ambiguous } . (5)Then T i,j contains all objects ranked between i and j (if k is ranked between i and j , and q i,k and q k,j are unambiguous, then so is q i,j , a contradiction). Furthermore, if the first j − objects ranked inthe algorithm were selected uniformly at random (or initialized in a random order in the algorithm)Lemma 3 implies that each object in T i,j is ranked between i and j with probability at least / due to the uniform distribution over the rankings Σ n,d (see Appendix A.7 for an explanation). T i,j will be our voting set. If we follow the sequential procedure of the algorithm of Figure 3, thefirst query encountered, call it q , , will be ambiguous and T , will contain all the other n − objects. However, at some point for some query q i,j it will become probable that the objects i and j are closely ranked. In that case, T i,j may be rather small, and so it is not always possible tofind a sufficiently large voting set to accurately determine y i,j . Therefore, we must specify a size-threshold R ≥ . If the size of T i,j is at least R , then we draw R indices from T i,j uniformly atrandom without replacement, call this set { t l } Rl =1 , and decide the label for q i,j by voting over theresponses to { q i,k , q k,j : k ∈ { t l } Rl =1 } ; otherwise we pass over object j and move on to the nextobject in the list. Given that | T i,j | ≥ R the label of q i,j is determined by: i ≺ j (cid:82) j ≺ i (cid:88) k ∈{ t l } Rl =1 { Y i,k = 1 ∧ Y k,j = 1 } − { Y i,k = 0 ∧ Y k,j = 0 } . (6)In the next section we will analyze this algorithm and show that it enjoys a very favorable querycomplexity while also admitting a probably approximately correct ranking. Consider the robust algorithm in Figure 3. At the end of the process, some objects that were passedover may then be unambiguously ranked (based on queries made after they were passed over) orthey can be ranked without voting (and without guarantees). As mentioned in Section 5.1, if the first j − objects ranked in the algorithm of Figure 3 were chosen uniformly at random from the full set(i.e., none of the first j − objects were passed over) then there is at least a one in three chance each8 obust Query Selection Algorithm input: n objects in R d , R ≥ initialize: objects Θ = { θ , . . . , θ n } in uniformly random order, Θ (cid:48) = Θ for j=2,. . . ,nfor i=1,. . . ,j-1 if q i,j is ambiguous , T i,j := { k ∈ { . . . , n } : q i,k , q k,j , or both are ambiguous } if | T i,j | ≥ R { t l } Rl =1 i.i.d. ∼ uniform ( T i,j ) .request Y i,k , Y k,j for all k ∈ { t l } Rl =1 decide label of q i,j with (6) else Θ (cid:48) ← Θ (cid:48) \ θ j , j ← j + 1 else impute q i,j ’s label from previously labeled queries.output: ranking over objects in Θ (cid:48) Figure 3: Robust sequential algorithm for selecting queries of Sec-tion 5.1. See Figure 2 and Section 4.2 for the definition of an am-biguous query.object in T i,j for some ambiguous query q i,j is ranked between i and j . With this in mind, we havethe following theorem, proved in Appendix A.7. Theorem 5.
Assume
A1-2 , σ ∼ U , and P ( Y i,j (cid:54) = y i,j ) = p . For every set T i,j constructed inthe algorithm of Figure 3, assume that an object selected uniformly at random from T i,j is rankedbetween θ i and θ j with probability at least / . Then for any size-threshold R ≥ , with probabilitygreater than − n log ( n ) exp (cid:0) − (1 − p ) R (cid:1) the algorithm correctly ranks at least n/ (2 R +1) objects and requests no more than O ( Rd log n ) queries on average. Note that before the algorithm skips over an object for the first time, all objects that are ranked at suchan intermediate stage are a subset chosen uniformly at random from the full set of objects, due tothe initial randomization. Therefore, if T i,j is a voting set in this stage, an object selected uniformlyat random from T i,j is ranked between θ i and θ j with probability at least / , per Lemma 3. Afterone or more objects are passed over, however, the distribution is no longer necessarily uniformdue to this action, and so the assumption of the theorem above may not hold. The procedure ofthe algorithm is still reasonable, but it is difficult to give guarantees on performance without theassumption. Nevertheless, this discussion leads us to wonder how many objects the algorithm willrank before it skips over its first object. The next lemma is proved in Appendix A.8. Lemma 5.
Consider a ranking of n objects and suppose objects are drawn sequentially, cho-sen uniformly at random without replacement. If M is the largest integer such that M ob-jects are drawn before any object is within R positions of another one in the ranking, then M ≥ (cid:113) n/R with probability at least
16 log(2) (cid:0) e − ( √ R/n +1) / − − n/ (3 R ) (cid:1) . As n/R → ∞ , P ( M ≥ (cid:113) n/R ) → √ e log(2) . Lemma 5 characterizes how many objects the robust algorithm will rank before it passes over its firstobject because if there are at least R objects between every pair of the first M objects, then T i,j ≥ R for all distinct i, j ∈ { , . . . , M } and none of the first M objects will be passed over. We canconclude from Lemma 5 and Theorem 5 that with constant probability (with respect to the initial or-dering of the objects and the randomness of the voting), the algorithm of Figure 3 exactly recovers apartial ranking of at least Ω( (cid:112) (1 − p ) n/ log n ) objects by requesting just O (cid:0) d (1 − p ) − log n (cid:1) pairwise comparisons, on average, with respect to all the rankings in Σ n,d . If we repeat the algorithmwith different initializations of the objects each time, we can boost this constant probability to anarbitrarily high probability (recall that the responses to queries will not change over the repetitions).Note, however, that the correctness of the partial ranking does not indicate how approximately cor-rect the remaining rankings will be. If the algorithm of Figure 3 ranks m objects before skipping9
10 20 30 40 50 60 70 80 90 1000100200300400500600 log | Σ n,d | | Σ n,d | Dimension N u m b e r o f q u e r y r e q u e s t s Figure 4: Mean and standard deviation of re-quested queries (solid) in the error-free case for n = 100 ; log | Σ n,d | is a lower bound (dashed). Table 1: Statistics for the algorithm robust topersistent errors of Section 5 with respect toall (cid:0) n (cid:1) pairwise comparisons. Recall y is thenoisy response vector, ˜ y is the embedding’ssolution, and ˆ y is the output of the robust al-gorithm.Dimension 2 3% of queriesrequested mean 14.5 18.5std 5.3 6Average error d ( y, ˜ y ) d ( y, ˆ y ) Σ n,d that is consistent with the probably correct partialranking of the first m objects (the output ranking of the algorithm may contain more than m objectsbut we make no guarantees about these additional objects). The proof is available in Appendix A.9. Lemma 6.
Assume
A1-2 and σ ∼ U . Suppose we select ≤ m < n objects uniformly at randomfrom the n and correctly rank them amongst themselves. If (cid:98) σ is any ranking in Σ n,d that is consistentwith all the known pairwise comparisons between the m objects, then E [ d τ ( σ, (cid:98) σ )] = O ( d/m ) (cid:0) n (cid:1) ,where the expectation is with respect to the random selection of objects and the distribution of therankings U . Combining Lemmas 5 and 6 in a straightforward way, we have the following theorem.
Theorem 6.
Assume
A1-2 , σ ∼ U , and P ( Y i,j (cid:54) = y i,j ) = p . If R = Θ((1 − p ) − log n ) and (cid:98) σ is any ranking in Σ n,d that is consistent with all known pairwise comparisons between the sub-set of objects ranked in the output of the algorithm of Figure 3, then with constant probability E [ d τ ( σ, (cid:98) σ )] = O ( d (1 − p ) − log( n ) /n ) (cid:0) n (cid:1) and no more than O ( d (1 − p ) − log ( n )) pairwisecomparisons are requested, on average. If we repeat the algorithm with different initializations of the objects until a sufficient number ofobjects are ranked before an object is passed over, we can boost this constant probability to anarbitrarily high probability. However, in practice, we recommend running the algorithm just once tocompletion since we do not believe passing over an object early on greatly affects performance.
In this section we present empirical results for both the error-free algorithm of Figure 1 and therobust algorithm of Figure 3. For the error-free algorithm, n = 100 points, representing theobjects to be ranked, were uniformly at random simulated from the unit hypercube [0 , d for d = 1 , , , . . . , . The reference was simulated from the same distribution. For each valueof d the experiment was repeated times using a new simulation of points and the reference. Be-cause responses are error-free, exact identification of the ranking is guaranteed. The number ofrequested queries is plotted in Figure 4 with the lower bound of Theorem 1 for reference. Thenumber of requested queries never exceeds twice the lower bound which agrees with the result ofTheorem 3.The robust algorithm of Figure 3 was evaluated using a symmetric similarity matrix dataset availableat [21] whose ( i, j ) th entry, denoted s i,j , represents the human-judged similarity between audiosignals i and j for all i (cid:54) = j ∈ { , . . . , } . If we consider the k th row of this matrix, we can rankthe other signals with respect to their similarity to the k th signal; we define q ( k ) i,j := { s k,i > s k,j } and y ( k ) i,j := { q ( k ) i,j } . Since the similarities were derived from human subjects, the derived labelsmay be erroneous. Moreover, there is no possibility of repeating queries here and so the errors arepersistent. The analysis of this dataset in [2] suggests that the relationship between signals can bewell approximated by an embedding in 2 or 3 dimensions. We used non-metric multidimensionalscaling [5] to find an embedding of the signals: θ , . . . , θ ∈ R d for d = 2 and . For each object10 k , we use the embedding to derive pairwise comparison labels between all other objects as follows: ˜ y ( k ) i,j := {|| θ k − θ i || < || θ k − θ j ||} , which can be considered as the best approximation to the labels y ( k ) i,j (defined above) in this embedding. The output of the robust sequential algorithm, which usesonly a small fraction of the similarities, is denoted by ˆ y ( k ) i,j . We set R = 15 using Theorem 6 as arough guide. Using the popular Kendell-Tau distance d ( y ( k ) , ˆ y ( k ) ) = (cid:0) n (cid:1) − (cid:80) i The Art of Computer Programming, Volume 3: Sorting and Searching . Addison-Wesley, 1998.[2] Scott Philips, James Pitton, and Les Atlas. Perceptual feature identification for active sonarechoes. In OCEANS 2006 , 2006.[3] B. McFee and G. Lanckriet. Partial order embedding with multiple kernels. In Proceedings ofthe 26th Annual International Conference on Machine Learning , pages 721–728. ACM, 2009.[4] I. Gormley and T. Murphy. A latent space model for rank data. Statistical Network Analysis:Models, Issues, and New Directions , pages 90–102, 2007.[5] M.A.A. Cox and T.F. Cox. Multidimensional scaling. Handbook of data visualization , pages315–347, 2008.[6] J.F. Traub. Information-based complexity . John Wiley and Sons Ltd., 2003.[7] C.H. Coombs. A theory of data. Psychological review , 67(3):143–159, 1960.[8] T.M. Cover. Geometrical and statistical properties of systems of linear inequalities with ap-plications in pattern recognition. IEEE transactions on electronic computers , 14(3):326–334,1965.[9] S. Dasgupta, A.T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. TheJournal of Machine Learning Research , 10:281–299, 2009.[10] S. Hanneke. Theoretical foundations of active learning . PhD thesis, Citeseer, 2009.[11] Tibor Heged¨us. Generalized teaching dimensions and the query complexity of learning. In Proceedings of the eighth annual conference on Computational learning theory , COLT ’95,pages 108–117, New York, NY, USA, 1995. ACM.[12] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combiningpreferences. The Journal of Machine Learning Research , 4:933–969, 2003.[13] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender.Learning to rank using gradient descent. In Proceedings of the 22nd international conferenceon Machine learning , pages 89–96. ACM, 2005.[14] Z. Zheng, K. Chen, G. Sun, and H. Zha. A regression framework for learning ranking functionsusing relative relevance judgments. In Proceedings of the 30th annual international ACM SI-GIR conference on Research and development in information retrieval , pages 287–294. ACM,2007.[15] R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning for ordinal regression. In Artificial Neural Networks, 1999. ICANN 99. Ninth International Conference on (Conf. Publ.No. 470) , volume 1, pages 97–102. IET, 1999.[16] T. Lu and C. Boutilier. Robust approximation and incremental elicitation in voting protocols. IJCAI-11, Barcelona , 2011.[17] W. Chu and Z. Ghahramani. Extensions of gaussian processes for ranking: semi-supervisedand active learning. Learning to Rank , page 29, 2005.1118] J.F. Bennett and W.L. Hays. Multidimensional unfolding: Determining the dimensionality ofranked preference data. Psychometrika , 25(1):27–43, 1960.[19] J.I. Marden. Analyzing and modeling rank data . Chapman & Hall/CRC, 1995.[20] P. Diaconis and R.L. Graham. Spearman’s footrule as a measure of disarray. Journal of theRoyal Statistical Society. Series B (Methodological) , pages 262–268, 1977.[21] Similarity Learning. Aural Sonar dataset. [http://idl.ee.washington.edu/SimilarityLearning].University of Washington Information Design Lab, 2011. A Appendix A.1 Computational complexity and implementation The computational complexity of the algorithm in Figure 1 is determined by the complexity oftesting whether a query is ambiguous or not and how many times we make this test. As written inFigure 1, the test would be performed O ( n ) times. But if binary sort is used instead of the brute-force linear search this can be reduced to n log n and, in fact, this is implemented in our simulationsand the proofs of the main results. The complexity of each test is polynomial in the number ofqueries requested because each one is a linear constraint. Because our results show that no more than O ( d log n ) queries are requested, the overall complexity is no greater than O ( n poly ( d ) poly (log n )) . A.2 Proof of Corollary 1 Proof. For initial conditions given in Lemma 1, if d (cid:28) n − a simple manipulation of (3) shows Q ( n, d ) = 1 + n − (cid:88) i =1 ( n − i ) Q ( n − i, d − n − (cid:88) i =1 i Q ( i, d − n − (cid:88) i =1 i (cid:20) i − (cid:88) j =1 j Q ( j, d − (cid:21) = 1 + Θ( n / 2) + n − (cid:88) i =1 i − (cid:88) j =1 i j (cid:20) j − (cid:88) k =1 k Q ( k, d − (cid:21) = 1 + Θ( n / 2) + Θ( n / / 4) + n − (cid:88) i =1 i − (cid:88) j =1 j − (cid:88) k =1 i j k (cid:20) k − (cid:88) l =1 l Q ( l, d − (cid:21) = 1 + Θ( n / 2) + · · · + Θ (cid:18) n d d d ! (cid:19) . From simulations, this is very tight for large values of n . If d ≥ n − then Q ( n, d ) = n ! becauseany permutation of n objects can be embedded in n − dimensional space [7]. A.3 Construction of a d -cell with n − sides Situations may arise in which Ω( n ) queries must be requested to identify a ranking because the d -cell representing the ranking is bounded by n − hyperplanes (queries) and if they are not allrequested, the ranking is ambiguous. We now show how to construct this pathological situation in R . Let Θ be a collection of n points in R where each θ ∈ Θ satisfies θ = θ and θ ∈ [0 , where θ i denotes the i th dimension of θ ( i ∈ { , } ). Then there exists a -cell in the hyperplanearrangement induced by the queries that has n − sides. This follows because the slope of theparabola keeps increasing with θ making at least one query associated with ( n − θ ’s bisect thelower-left, unbounded -cell. This can be observed in Figure 5. Obviously, a similar arrangementcould be constructed for all d ≥ . 12 − − − n − d -cell Figure 5: The points Θ representing the objects are dots on the right, the lines are the queries, andthe black, bold lines are the queries bounding the n − sided -cell. A.4 Proof of Lemma 4 Proof. Here we prove an upper bound on P ( k, d ) . P ( k, d ) is equal to the number of d -cells in thepartition induced by objects , . . . , k that are intersected by a hyperplane corresponding to a pair-wise comparison query between object k + 1 and object i , i ∈ { , . . . , k } . This new hyperplane isintersected by all the (cid:0) k (cid:1) hyperplanes in the partition. These intersections partition the new hyper-plane into a number of ( d − -cells. Because the ( k + 1) st object is in general position with respectto objects , . . . , k , the intersecting hyperplanes will not intersect the hyperplane in any special ornon-general way. That is to say, the number of ( d − -cells this hyperplane is partitioned into isthe same number that would occur if the hyperplane were intersected by (cid:0) k (cid:1) hyperplanes in generalposition. Let K = (cid:0) k (cid:1) for ease of notation. It follows then from [8, Theorem 3] that P ( k, d ) = d − (cid:88) i =0 (cid:18) Ki (cid:19) = d − (cid:88) i =0 O (cid:18) K i i ! (cid:19) = d − (cid:88) k =0 O (cid:18) k i i i ! (cid:19) = O (cid:18) k d − d − ( d − (cid:19) . A.5 Proof of Theorem 3 Proof. Let B k +1 denote the total number of pairwise comparisons requested of the ( k + 1) st object;i.e., number of ambiguous queries in the set q i,k +1 , i = 1 , . . . , k . Because the individual eventsof requesting these are conditionally independent (see Section 4.3), it follows that each B k +1 isan independent binomial random variable with parameters A ( k, d, U ) and k . The total number ofqueries requested by the algorithm is M n = n − (cid:88) k =1 k (cid:88) i =1 { Request q i,k +1 } = n − (cid:88) k =1 B k +1 . (7)Because Lemma 4 is only relevant for sufficiently large k , we assume that none of the pairwisecomparisons are ambiguous when k ≤ da . Recall from Section A.1 that binary sort is implementedso for these first (cid:100) da (cid:101) objects, at most (cid:100) da (cid:101) log ( (cid:100) da (cid:101) ) queries are requested. For k > da thenumber of requested queries to the k th object is upper bounded by the number of ambiguous queries13f the k th object. Then using the known mean and variance formulas for the binomial distribution E U (cid:2) M n (cid:3) = n − (cid:88) k =1 E U (cid:2) B k +1 (cid:3) ≤ (cid:100) da (cid:101) (cid:88) k =2 B k +1 + n − (cid:88) k = (cid:100) da (cid:101) +1 dak ≤ (cid:100) da (cid:101) log (cid:100) da (cid:101) + 2 da log (cid:0) n/ (cid:100) da (cid:101) (cid:1) ≤ (cid:100) da (cid:101) log n We now consider the case for a general distribution π . Enumerate the rankings of Σ n,d . Let N i denote the (random) number of requested queries needed by the algorithm to reconstruct the i thranking. Note that the randomness of N i is only due to the randomization of the algorithm. Let π i denote the probability it assigns to the i th ranking as in Definition 1. Then E π [ M n ] = Q ( n,d ) (cid:88) i =1 π i E [ N i ] . (8)Assume that the distribution over rankings is bounded above such that no ranking is overwhelminglyprobable. Specifically, assume that the probability of any one ranking is upper bounded by c/Q ( n, d ) for some constant c > that is independent of n . Under this bounded distribution assumption, E π [ M n ] is maximized by placing probability c/Q ( n, d ) on the k := Q ( n, d ) /c cells for which E [ N i ] is largest (we will assume k is an integer, but it is straightforward to extend the followingargument to the general case). Since the mass on these cells is equal, without loss of generalitywe may assume that E [ N i ] = µ , a common value on each, and we have E π [ M n ] = µ . For theremaining Q ( n, d ) − k cells we know that E [ N i ] ≥ d , since each cell is bounded by at least d hyperplanes/queries. Under these conditions, we can relate E π [ M n ] to E U [ M n ] as follows. Firstobserve that E U [ M n ] = 1 Q ( n, d ) Q ( n,d ) (cid:88) i =1 E [ N i ] ≥ kQ ( n, d ) µ + d Q ( n, d ) − kQ ( n, d ) , which implies E π [ M n ] = µ ≤ Q ( n,d ) k (cid:16) E U [ M n ] − d Q ( n,d ) − kQ ( n,d ) (cid:17) = c (cid:16) E U [ M n ] − d Q ( n,d ) − kQ ( n,d ) (cid:17) ≤ c E U [ M n ] . In words, the non-uniformity constant c > scales the expected number of queries. Under A1-2 ,for large n we have E π [ M n ] = O ( c d log n ) . A.6 Proof of Theorem 4 Proof. Suppose q i,j is ambiguous. Let ˆ α be the frequency of Y i,j = 1 after R trials. Let E [ˆ α ] = α .The majority vote decision is correct if | α − ˆ α | ≤ / − p . By Chernoff’s bound, P ( | α − ˆ α | ≥ / − p ) ≤ − / − p ) R ) . The result follows from the union bound over the total numberof queries considered: n log n (See Appendix A.1). A.7 Proof of Theorem 5 Suppose q i,j is ambiguous. Let S i,j denote the subset of Θ such that θ k ∈ S i,j if it is rankedbetween objects θ i and θ j (i.e. S i,j = { θ k ∈ Θ : θ i ≺ θ k ≺ θ j or θ j ≺ θ k ≺ θ i } ). Note that y i,j = y i,k = y k,j if and only if θ k ∈ S i,j . If we define E ki,j = { Y i,k = 1 ∧ Y k,j = 1 } − { Y i,k =0 ∧ Y k,j = 0 } , where is the indicator function, then for any subset T ⊂ Θ such that S i,j ⊂ T ,the sign of the sum (cid:80) θ k ∈ T E ki,j is a predictor of y i,j . In fact, with respect to just the random errors, E (cid:2)(cid:12)(cid:12) (cid:80) θ k ∈ T E ki,j (cid:12)(cid:12)(cid:3) = | S i,j | (1 − p ) . To see this, without loss of generality let y i,j = 1 , then for14 k ∈ S i,j E [ E ki,j ] = E (cid:2) { Y i,k = 1 ∧ Y k,j = 1 } − { Y i,k = 0 ∧ Y k,j = 0 } (cid:3) = P ( Y i,k = 1 ∧ Y k,j = 1) − P ( Y i,k = 0 ∧ Y k,j = 0)= (1 − p ) − p = 1 − p. If θ k / ∈ S i,j then it can be shown by a similar calculation that E [ E ki,j ] = 0 .To identify S i,j we use the fact that if θ k ∈ S i,j then q i,k , q j,k , or both are also ambiguous simplybecause otherwise q i,j would not have been ambiguous in the first place (Figure 6 may be a usefulaid to see this). While the converse is false, Lemma 3 says that each of the six possible rankingsof { θ i , θ j , θ k } are equally probable if they were uniformly at random chosen (thus partly justifyingthis explicit assumption in the theorem statement). It follows that if we define the subset T i,j ∈ Θ tobe those objects θ k with the property that q i,k , q k,j , or both are ambiguous then the probability that θ k ∈ S i,j is at least / if θ k ⊂ T i,j . You can convince yourself of this using Figure 6. Moreover, E (cid:2)(cid:12)(cid:12) (cid:80) k ∈ T i,j E ki,j (cid:12)(cid:12)(cid:3) ≥ | T i,j | (1 − p ) / which implies the sign of the sum (cid:80) θ k ∈ T i,j E ki,j is a reliablepredictor of q i,j ; just how reliable depends only on the size of T i,j . θ i θ j θ k θ j ≺ θ k ≺ θ i θ i ≺ θ k ≺ θ j θ i ≺ θ j ≺ θ k θ j ≺ θ i ≺ θ k θ k ≺ θ j ≺ θ i θ k ≺ θ i ≺ θ j Figure 6: Let q i,j be ambiguous. Object k will be informative to the majority vote of y i,j if thereference lies in the shaded region. There are six possible rankings and if q i,k , q k,j , or both areambiguous then the probability that the reference is in the shaded region is at least / Fix R > . Suppose q i,j is ambiguous and assume without loss of generality that y i,j = 1 . Giventhat E (cid:2) (cid:80) k ∈ T i,j E ki,j (cid:3) ≥ | T i,j | (1 − p ) / from above, it follows from Hoeffding’s inequality thatthe probability that (cid:80) k ∈ T i,j E ki,j ≤ is less than exp (cid:0) − (1 − p ) | T i,j | (cid:1) . If only a subset of T i,j of size R is used in the sum then | T i,j | is replaced by R in the exponent. This test is only performedwhen | T i,j | > R and clearly no more times than the number of queries considered to rank n objectsin the full ranking: n log n . Thus, all decisions using this test are correct with probability at least − n log ( n ) exp (cid:0) − (1 − p ) R (cid:1) . Only a subset of the n objects will be ranked and of those, R + 1 times more queries will be requested than in the error-free case (two queries per object in T i,j ). Thus the robust algorithm will request no more than O ( Rd log n ) queries on average.To determine the number of objects that are in the partial ranking, let Θ (cid:48) ⊂ Θ denote the subset ofobjects that are ranked in the output partial ranking. Each θ k ∈ Θ (cid:48) is associated with an index inthe true full ranking and is denoted by σ ( θ k ) . That is, if σ ( θ k ) = 5 then it is ranked fifth in the fullranking but in the partial ranking could be ranked first, second, third, fourth, or fifth. Now imaginethe real line with tick marks only at the integers , . . . , n . For each θ k ∈ Θ (cid:48) place an R -ball aroundeach θ k on these tick marks such that if σ ( θ k ) = 5 and R = 3 then , . . . , are covered by the ballaround σ ( θ k ) and and , . . . , n are not. Then the union of the balls centered at the objects in Θ (cid:48) cover , . . . , n . If this were not true then there would be an object θ j / ∈ Θ (cid:48) with | S i,j | > R for all θ i ∈ Θ (cid:48) . But S i,j ⊂ T i,j implies | T i,j | > R which implies j ∈ Θ (cid:48) , a contradiction. Because at15east n/ (2 R + 1) R -balls are required to cover , . . . , n , at least this many objects are contained in Θ (cid:48) . A.8 Proof of Lemma 5 Proof. Assume M ≤ n R . If p m denotes the probability that the ( m + 1) st object is within R positions of one of the first m objects, given that none of the first m objects are within R positionsof each other, then Rmn < p m ≤ Rmn − m and P ( M = m ) ≥ m − (cid:89) l =1 (cid:18) − Rln − l (cid:19) Rmn . Taking the log we find log P ( M = m ) ≥ log Rmn + m − (cid:88) l =1 log (cid:18) − Rln − l (cid:19) ≥ log Rmn + ( m − 1) log (cid:18) m − m − (cid:88) l =1 (cid:18) − Rln − l (cid:19)(cid:19) ≥ log Rmn + ( m − 1) log (cid:18) − Rmn − m + 1 (cid:19) ≥ log Rmn + ( m − 1) log (cid:18) − Rm n (cid:19) ≥ log Rmn + ( m − (cid:18) − Rmn (cid:19) where the second line follows from Jensen’s inequality, the fourth line follows from the fact that m ≤ n R , and the last line follows from the fact that (1 − x ) ≥ exp( − x ) for x ≤ / . Weconclude that P ( M = m ) ≥ Rn m exp {− Rn m } . Now if a = (cid:113) n/R we have P ( M ≥ a ) ≥ n/ (3 R ) − (cid:88) m = (cid:100) a (cid:101) Rn m exp {− Rn m }≥ (cid:90) n/ (3 R ) a +1 Rn x exp {− Rn x } dx = 16 log(2) (cid:18) e − ( √ R/n +1) / − e − log(2) n/ (3 R ) (cid:19) where the second line follows from the fact that xe − αx / is monotonically decreasing for x ≥ (cid:112) /α . Note, P ( M ≥ (cid:113) n/R ) is greater than for n/R ≥ , and for n/R ≥ . Moreover,as n/R → ∞ , P ( M ≥ (cid:113) n/R ) → √ e log(2) .16 .9 Proof of Lemma 6 Proof. Enumerate the objects such that the first m are the objects ranked amongst themselves. Let y be the pairwise comparison label vector for σ and ˆ y be the corresponding vector for (cid:98) σ . Then E [ d τ ( σ, (cid:98) σ )] = m (cid:88) k =2 k − (cid:88) l =1 { y l,k (cid:54) = ˆ y l,k } + n (cid:88) k = m +1 k − (cid:88) l =1 { y l,k (cid:54) = ˆ y l,k } = n (cid:88) k = m +1 k − (cid:88) l =1 { y l,k (cid:54) = ˆ y l,k }≤ n (cid:88) k = m +1 k − (cid:88) l =1 P { Request q l,k | labels to q s ≤ m,t ≤ m }≤ n (cid:88) k = m +1 k − (cid:88) l =1 adm ≤ adm ( n − m )( n + m + 1)2 ≤ ad (cid:18) ( n + 1) m − (cid:19) . where the third line assumes that every pairwise comparison that is ambiguous (that is, cannot beimputed using the knowledge gained from the first mm