[PDF] Antithetic and Monte Carlo kernel estimators for partial rankings

Abstract

In the modern age, rankings data is ubiquitous and it is useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world is incomplete, which prevents the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators for Gram matrices: matrices of kernel values between pairs of observations. We also present a novel variance reduction scheme based on an antithetic variate construction between permutations to obtain an improved estimator for the Mallows kernel. The corresponding antithetic kernel estimator has lower variance and we demonstrate empirically that it has a better performance in a variety of Machine Learning tasks. Both kernel estimators are based on extending kernel mean embeddings to the embedding of a set of full rankings consistent with an observed partial ranking. They form a computationally tractable alternative to previous approaches for partial rankings data. An overview of the existing kernels and metrics for permutations is also provided.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Antithetic and Monte Carlo kernel estimators for partialrankings

M. Lomeli · M. Rowland · A. Gretton · Z. Ghahramani

Received: date / Accepted: date

Abstract

In the modern age, rankings data is ubiqui-tous and it is useful for a variety of applications such asrecommender systems, multi-object tracking and prefer-ence learning. However, most rankings data encounteredin the real world is incomplete, which prevents the directapplication of existing modelling tools for complete rank-ings. Our contribution is a novel way to extend kernelmethods for complete rankings to partial rankings, viaconsistent Monte Carlo estimators for Gram matrices:matrices of kernel values between pairs of observations.We also present a novel variance reduction scheme basedon an antithetic variate construction between permuta-tions to obtain an improved estimator for the Mallowskernel. The corresponding antithetic kernel estimatorhas lower variance and we demonstrate empirically thatit has a better performance in a variety of MachineLearning tasks. Both kernel estimators are based onextending kernel mean embeddings to the embeddingof a set of full rankings consistent with an observedpartial ranking. They form a computationally tractablealternative to previous approaches for partial rankingsdata. An overview of the existing kernels and metricsfor permutations is also provided.

Keywords

Reproducing Kernel Hilbert Space; Partialrankings; Monte Carlo; Antithetic variates; Grammatrix

M. LomeliComputational and Biological Lab, University of Cambridge.E-mail: [email protected]. RowlandDepartment of Pure Mathematics and Mathematical Statistics,University of Cambridge. E-mail: [email protected]. GrettonGatsby Computational Neuroscience Unit, University CollegeLondon. E-mail: [email protected]. GhahramaniComputational and Biological Lab, University of Cambridgeand Uber AI Labs. E-mail: [email protected]

Permutations play a fundamental role in statistical mod-elling and machine learning applications involving rank-ings and preference data. A ranking over a set of objectscan be encoded as a permutation, hence, kernels forpermutations are useful in a variety of machine learningapplications involving rankings. Applications includerecommender systems, multi-object tracking and pref-erence learning. It is of interest to construct a kernelin the space of the data in order capture similaritiesbetween datapoints and thereby inﬂuence the patternof generalisation. Kernels are used in many machinelearning methods. For instance, a kernel input is re-quired for the maximum mean discrepancy (MMD) twosample test [15], kernel principal component analysis(kPCA) [29], support vector machines [5, 7], Gaussianprocesses (GPs) [27] and agglomerative clustering [10],among others.Our main contributions are: (i) A novel and com-putationally tractable way to deal with incomplete orpartial rankings by ﬁrst representing the marginalisedkernel [17] as a kernel mean embedding of a set of fullrankings consistent with an observed partial ranking.We then propose two estimators that can be representedas the corresponding empirical mean embeddings: (ii) AMonte Carlo kernel estimator that is based on samplingindependent and identically distributed rankings fromthe set of consistent full rankings given an observed par-tial ranking; (iii) An antithetic variate construction forthe marginalised Mallows kernel that gives a lower vari-ance estimator for the kernel Gram matrix. The Mallowskernel has been shown to be an expressive kernel; in par-ticular, Mania et al. [26] show that the Mallows kernelis an example of a universal and characteristic kernel,and hence it is a useful tool to distinguish samples from a r X i v : . [ s t a t . M L ] J u l M. Lomeli et al. two diﬀerent distributions, and it achieves the Bayes riskwhen used in kernel-based classiﬁcation/regression [32].Jiao & Vert [20] have proposed a fast approach for com-puting the Kendall marginalised kernel, however, thiskernel is not characteristic [26], and hence has limitedexpressive power.The resulting estimators are used for a variety of ker-nel machine learning algorithms in the experiments. Wepresent comparative simulation results demonstratingthe eﬃcacy of the proposed estimators for an agglomer-ative clustering task, a hypothesis test task using themaximum mean discrepancy (MMD) [15] and a Gaus-sian process classiﬁcation task. For the latter, we extendsome of the existing methods in the software libraryGPy [14].Since the space of permutations is an example of adiscrete space, with a non-commutative group structure,the corresponding reproducing kernel Hilbert spaces(RKHS) have only recently being investigated; see Kon-dor et al. [24], Fukumizu et al. [13], Kondor & Barbosa[23], Jiao & Vert [20] and Mania et al. [26]. We pro-vide an overview of the connection between kernels andcertain semimetrics when working on the space of per-mutations. This connection allows us to obtain kernelsfrom given semimetrics or semimetrics from existingkernels. We can combine these semimetric-based kernelsto obtain novel, more expressive kernels which can beused for the proposed Monte Carlo kernel estimator.

We ﬁrst brieﬂy introduce the theory of permutationgroups. A particular application of permutations is touse them to represent rankings; in fact, there is a naturalone-to-one relationship between rankings of n items andpermutations. For this reason, we sometimes use rankingand permutation interchangeably. In this section, westate some mathematical deﬁnitions to formalise theproblem in terms of the space of permutations.Let [ n ] = { , , . . . , n } be a set of indices for n items,for some n ∈ N . Given a ranking of these n items, weuse the notation (cid:31) to denote the ordering of the itemsinduced by the ranking, so that for distinct i, j ∈ [ n ],if i is preferred to j , we will write i (cid:31) j . Note that fora full ranking, the corresponding relation (cid:31) is a totalorder on { , . . . , n } .We now outline the correspondence between rank-ings on [ n ] and the permutation group S n that we usethroughout the paper. In words, given a full ranking of[ n ], we will associate it with the permutation σ ∈ S n thatmaps each ranking position 1 , . . . , n to the correct objectunder the ranking. More mathematically, given a ranking a (cid:31) · · · (cid:31) a n of [ n ], we may associate it with the permu-tation σ ∈ S n given by σ ( j ) = a j for all j = 1 , . . . , n . Forexample, the permutation corresponding to the rankingon [3] given by 2 (cid:31) (cid:31)

1, corresponds to the permuta-tion σ ∈ S given by σ (1) = 2 , σ (2) = 3 , σ (3) = 1. Thiscorrespondence allows the literature relating to kernelson permutations to be leveraged for problems involvingthe modelling of ranking data.In the next section, we will review some of the semi-metrics on S n that can serve as building blocks for theconstruction of more expressive kernels.2.1 Metrics for permutations and properties Deﬁnition 1

Let X be any set and d : X × X → R is afunction, which we write d ( x, y ) for every x, y ∈ X . Then d is a semimetric if it satisﬁes the following conditions,for every x, y ∈ X [11]:i) d ( x, y ) = d ( y, x ), that is, d is a symmetric function.ii) d ( x, y ) = 0 if and only if x = y .A semimetric is a metric if it satiﬁes:iii) d ( x, z ) ≤ d ( x, y ) + d ( y, z ) for every x, y, z ∈ X , thatis, d satisﬁes the triangle inequality.The following are some examples of semimetrics onthe space of permutations S n [9]. All semimetrics inbold have the additional property of being of negativetype. Theorem 1, stated below, shows that negative typesemimetrics are closely related to kernels.1) Spearman’s footrule . d ( σ, σ (cid:48) ) = (cid:80) ni =1 | σ ( i ) − σ (cid:48) ( i ) | = (cid:107) σ − σ (cid:48) (cid:107) .2) Spearman’s rank correlation. d ( σ, σ (cid:48) ) = (cid:80) ni =1 ( σ ( i ) − σ (cid:48) ( i )) = (cid:107) σ − σ (cid:48) (cid:107) .3) Hamming distance . d H ( σ, σ (cid:48) ) = { i | σ ( i ) (cid:54) = σ (cid:48) ( i ) } . It can also be deﬁnedas the minimum number of substitutions required tochange one permutation into the other.4)

Cayley distance. d C ( σ, σ (cid:48) ) = (cid:80) n − j =1 X j ( σ ◦ ( σ (cid:48) ) − ),where the composition operation of the permutationgroup S n is denoted by ◦ and X j ( σ ◦ ( σ (cid:48) ) − ) = 0 if j isthe largest item in its cycle and is equal to 1 otherwise[18]. It is also equal to the minimum number ofpairwise transpositions taking σ to σ (cid:48) . Finally, it canalso be shown to be equal to n − C ( σ ◦ ( σ (cid:48) ) − ) where C ( η ) is the number of cycles in η .5) Kendall distance. d τ ( σ, σ (cid:48) ) = n d ( σ, σ (cid:48) ),where n d ( σ, σ (cid:48) ) is the number of discordant pairsfor the permutation pair ( σ, σ (cid:48) ). It can also be de-ﬁned as the minimum number of pairwise adjacenttranspositions taking σ − to ( σ (cid:48) ) − . ntithetic and Monte Carlo kernel estimators for partial rankings 3 l p distances. d p ( σ, σ (cid:48) ) = ( (cid:80) ni =1 | σ ( i ) − σ (cid:48) ( i ) | p ) p = (cid:107) σ − σ (cid:48) (cid:107) p with p ≥ l ∞ distance. d ∞ ( σ, σ (cid:48) ) = max1 ≤ i ≤ n | σ ( i ) − σ (cid:48) ( i ) | = (cid:107) σ − σ (cid:48) (cid:107) ∞ . Deﬁnition 2 A semimetric is said to be of negativetype if for all n ≥ x , . . . , x n ∈ X and α , . . . , α n ∈ R with (cid:80) ni =1 α i = 0, we have n (cid:88) i =1 n (cid:88) j =1 α i α j d ( x i , x j ) ≤ . (1)In general, if we start with a Mercer kernel for per-mutations, that is, a symmetric and positive deﬁnitefunction k : S n × S n → R , the following expression givesa semimetric d that is of negative type d ( σ, σ (cid:48) ) = k ( σ, σ ) + k ( σ (cid:48) , σ (cid:48) ) − k ( σ, σ (cid:48) ) . (2)A useful characterisation of semimetrics of negativetype is given by the following theorem, which states aconnection between negative type metrics and a Hilbertspace feature representation or feature map φ . Theorem 1 [3]. A semimetric d is of negative type ifand only if there exists a Hilbert space H and an injectivemap φ : X → H such that ∀ x, x (cid:48) ∈ X , d ( x, x (cid:48) ) = (cid:107) φ ( x ) − φ ( x (cid:48) ) (cid:107) H . Once the feature map from Theorem 1 is found, we candirectly take its inner product to construct a kernel. Forinstance, Jiao & Vert [20] propose an explicit featurerepresentation for Kendall kernel given by Φ ( σ ) =  (cid:113)(cid:0) n (cid:1) (cid:2) I { σ ( i ) >σ ( j ) } − I { σ ( i ) <σ ( j ) } (cid:3) ≤ i

The Hamming distance is of negativetype with d H ( σ, σ (cid:48) ) = 12 Trace (cid:104) ( Φ ( σ ) − Φ ( σ (cid:48) )) ( Φ ( σ ) − Φ ( σ (cid:48) )) T (cid:105) (3) where the corresponding feature representation is a ma-trix given by Φ ( σ ) =  I { σ (1)=1 } . . . I { σ ( n )=1 } I { σ (1)=2 } . . . I { σ ( n )=2 } ... . . . ... I { σ (1)= n } . . . I { σ ( n )= n }  . Proof

The Hamming distance can be written as a squarediﬀerence of indicator functions in the following way d H ( σ, σ (cid:48) ) = { i | σ ( i ) (cid:54) = σ (cid:48) ( i ) } = 12 n (cid:88) i =1 n (cid:88) (cid:96) =1 (cid:18) I { σ ( i )= (cid:96) } − I { σ (cid:48) ( i )= (cid:96) } (cid:19) where each indicator is one whenever the given entry ofthe permutation is equal to the corresponding elementof the identity element of the group. Let the (cid:96) -th featurevector be φ (cid:96) ( σ ) = (cid:0) I { σ (1)= (cid:96) } , . . . , I { σ ( n )= (cid:96) } (cid:1) , then= 12 n (cid:88) (cid:96) =1 ( φ (cid:96) ( σ ) − φ (cid:96) ( σ (cid:48) )) T ( φ (cid:96) ( σ ) − φ (cid:96) ( σ (cid:48) ))= 12 n (cid:88) (cid:96) =1 (cid:107) φ (cid:96) ( σ ) − φ (cid:96) ( σ (cid:48) ) (cid:107) = 12 Trace (cid:104) ( Φ ( σ ) − Φ ( σ (cid:48) )) ( Φ ( σ ) − Φ ( σ (cid:48) )) T (cid:105) . This is the trace of the diﬀerence of the product of thefeature matrices Φ ( σ ) − Φ ( σ (cid:48) ), where the diﬀerence offeature matrices is given by  I { σ (1)=1 } − I { σ (cid:48) (1)=1 } . . . I { σ ( n )=1 } − I { σ (cid:48) ( n )=1 } I { σ (1)=2 } − I { σ (cid:48) (1)=2 } . . . I { σ ( n )=2 } − I { σ (cid:48) ( n )=2 } ... ... ... I { σ (1)= n } − I { σ (cid:48) (1)= n } . . . I { σ ( n )= n } − I { σ (cid:48) ( n )= n }  . This is the square of the usual Frobenius norm for ma-trices, so by Theorem 1, the Hamming distance is ofnegative type.Another example is Spearman’s rank correlation,which is a semimetric of negative type since it is thesquare of the usual Euclidean distance [3].The two alternative deﬁnitions given for some ofthe distances in the previous examples are handy fromdiﬀerent perspectives. One is an expression in terms ofeither an injective or non-injective feature representa-tion, while the other is in terms of the minimum number

M. Lomeli et al. of operations to change one permutation to the other.Other distances can be deﬁned in terms of this minimumnumber of operations, they are called editing metrics [8].Editing metrics are useful from an algorithmic point ofview whereas metrics deﬁned in terms of feature vectorsare useful from a theoretical point of view. Ideally, hav-ing a particular metric in terms of both algorithmic andtheoretical descriptions gives a better picture of whichare the relevant characteristics of the permutation thatthe metric takes into account. For instance, Kendall andCayley distances algorithmic descriptions correspond tothe bubble and quick sort algorithms respectively [22].

Fig. 1

Kendall and Cayley distances for permutations of n = 4. There is an edge between two permutations in the graphif they diﬀer by one adjacent or non-adjacent transposition,respectively. Another property shared by most of the semimetricsin the examples is the following

Deﬁnition 3

Let σ , σ ∈ S n , ( S n , ◦ ) denote the sym-metric group of degree n with the composition operation,a right-invariant semimetric [9] satisﬁes d ( σ , σ ) = d ( σ ◦ η, σ ◦ η ) ∀ σ , σ , η ∈ S n . (4)In particular, if we take η = σ − then d ( σ , σ ) = d ( e, σ ◦ σ − ), where e corresponds to the identity ele-ment of the permutation group.This property is inherited by the distance-induced kernel from Section 2.2, Example 7. This symmetry is anal-ogous to translation invariance for kernels deﬁned inEuclidean spaces.2.2 Kernels for S n If we specify a symmetric and positive deﬁnite functionor kernel k , it corresponds to deﬁning an implicit fea-ture space representation of a ranking data point. Thewell-known kernel trick exploits the implicit nature ofthis representation by performing computations withthe kernel function explicitly, rather than using inner products between feature vectors in high or even inﬁnitedimensional space. Any symmetric and positive deﬁnitefunction uniquely deﬁnes an underlying ReproducingKernel Hilbert Space (RKHS), see the supplementarymaterial Appendix A for a brief overview about theRKHS. Some examples of kernels for permutations arethe following1. The Kendall kernel [20] is given by k τ ( σ, σ (cid:48) ) = n c ( σ, σ (cid:48) ) − n d ( σ, σ (cid:48) ) (cid:0) d (cid:1) ,where n c ( σ, σ (cid:48) ) and n d ( σ, σ (cid:48) ) denote the number ofconcordant and discordant pairs between σ and σ (cid:48) respectively.2. The Mallows kernel [20] is given by k ν ( σ, σ (cid:48) ) = exp( − νn d ( σ, σ (cid:48) )).3. The Polynomial kernel of degree m [26], is given by k ( m ) P ( σ, σ (cid:48) ) = (1 + k τ ( σ, σ (cid:48) )) m .4. The Hamming kernel is given by k H ( σ, σ (cid:48) ) = Trace (cid:104) ( Φ ( σ ) Φ ( σ (cid:48) ) T (cid:105) .5. An exponential semimetric kernel is given by k d ( σ, σ (cid:48) ) = exp {− νd ( σ, σ (cid:48) ) } , where d is a semimetricof negative type.6. The diﬀusion kernel [23] is given by k β ( σ, σ (cid:48) ) = exp { βq ( σ ◦ σ (cid:48) ) } , where β ∈ R and q isa function that must satisfy q ( π ) = q ( π − ) and (cid:80) π q ( π ) = 0. A particular case is q ( σ, σ (cid:48) ) = 1 if σ and σ (cid:48) are connected by an edge in some Cayleygraph representation of S n , and q ( σ, σ (cid:48) ) = − degree σ if σ = σ (cid:48) or q ( σ, σ (cid:48) ) = 0 otherwise.7. The semimetric or distance induced kernel [30], ifthe semimetric d is of negative type, then, a familyof kernels k , parameterised by a central permutation σ , is given by k ( σ, σ (cid:48) ) = 12 [ d ( σ, σ ) + d ( σ (cid:48) , σ ) − d ( σ, σ (cid:48) )].If we choose any of the above kernels by itself, it will gen-erally not be complex enough to represent the rankingdata’s generating mechanism. However, we can beneﬁtfrom the allowable operations for kernels to combinekernels and still obtain a valid kernel. Some of the op-erations which render a valid kernel are the following:sum, multiplication by a positive constant, product,polynomial and exponential [4].In the case of the symmetric group of degree n , S n ,there exist kernels that are right invariant , as deﬁned inEquation (4). This invariance property is useful becauseit is possible to write down the kernel as a function ofa single argument and then obtain a Fourier represen-tation. The caveat is that this Fourier representationis given in terms of certain matrix unitary representa-tions due to the non-Abelian structure of the group [19]. ntithetic and Monte Carlo kernel estimators for partial rankings 5 Even though the space is ﬁnite, and every irreduciblerepresentation is ﬁnite-dimensional [13], these Fourierrepresentations do not have closed form expressions.For this reason, it is diﬃcult to work on the spectraldomain as opposed to the R n case. There is also no nat-ural measure to sample from such as the one providedby Bochner’s theorem in Euclidean spaces [35]. In thenext section, we will present a novel Monte Carlo kernelestimator for the case of partial rankings data. Having provided an overview of kernels for permutations,and reviewed the link between permutations and rank-ings of objects, we now turn to the practical issue thatin real datasets, we typically have access only to partialranking information, such as pairwise preferences andtop- k rankings. Following [20], we consider the followingtypes of partial rankings: Deﬁnition 4 (Exhaustive partial rankings, top- k rankings) Let n ∈ N . A partial ranking on the set[ n ] is speciﬁed by an ordered collection Ω (cid:31) · · · (cid:31) Ω l of disjoint non-empty subsets Ω , . . . , Ω l ⊆ [ n ], forany 1 ≤ l ≤ n . The partial ranking Ω (cid:31) · · · (cid:31) Ω l encodes the fact that the items in Ω i are preferred tothose in Ω i +1 , for i = 1 , . . . , l −

1, with no preferenceinformation speciﬁed about the items in [ n ] \ ∪ li =1 Ω i .A partial ranking Ω (cid:31) · · · (cid:31) Ω l with ∪ li =1 Ω i = [ n ]termed exhaustive , as all items in [ n ] are included withinthe preference information. A top- k partial ranking isa particular type of exhaustive ranking Ω (cid:31) · · · (cid:31) Ω l , with | Ω | = · · · = | Ω l − | = 1, and Ω l = [ n ] \∪ l − i =1 Ω i . We will frequently identify a partial ranking Ω (cid:31) · · · (cid:31) Ω l with the set R ( Ω , . . . , Ω l ) ⊆ S n offull rankings consistent with the partial ranking. Thus, σ ∈ R ( Ω , . . . , Ω l ) iﬀ for all 1 ≤ i < j ≤ l , and for all x ∈ Ω i , y ∈ Ω j , we have σ − ( x ) < σ − ( y ). When thereis potential for confusion, we will use the term “subsetpartial ranking” when referring to a partial ranking asa subset of S n , and “preference partial ranking” whenreferring to a partial ranking with the notation Ω (cid:31)· · · (cid:31) Ω l .Thus, for many practical problems, we require def-initions of kernels between subsets of partial rankingsrather than between full rankings, to be able to dealwith datasets containing only partial ranking informa-tion. A common approach [34] is to take a kernel K deﬁned on S n , and use the marginalised kernel , deﬁnedon subsets of partial rankings by K ( R, R (cid:48) ) = (cid:88) σ ∈ R (cid:88) σ (cid:48) ∈ R (cid:48) K ( σ, σ (cid:48) ) p ( σ | R ) p ( σ (cid:48) | R (cid:48) ) (5) for all R, R (cid:48) ⊆ S n , for some probability distribution p ∈ P ( S n ). Here, p ( ·| R ) denotes the conditioning of p to the set R ⊆ S n . Jiao & Vert [20] use the convolutionkernel [17] between partial rankings,given by K ( R, R (cid:48) ) = 1 | R || R (cid:48) | (cid:88) σ ∈ R (cid:88) σ (cid:48) ∈ R (cid:48) K ( σ, σ (cid:48) ) . (6)This is a particular case for the marginalised kernelof Equation (5), in which we take the probability massfunction to be uniform over R, R (cid:48) respectively. In general,computation with a marginalised kernel quickly becomescomputationally intractable, with the number of termsin the right-hand side of Equation (5) growing super-exponentially with n , for a ﬁxed number of items inthe partial rankings R and R (cid:48) , see Appendix D for anumerical example of such growth. An exception is theKendall kernel case for two interleaving partial rankingsof k and m items or a top- k and top- m ranking. In thiscase, the sum can be tractably computed and it can bedone in O ( k log k + m log m ) time [20].We propose a variety of Monte Carlo methods toestimate the marginalised kernel of Equation (5) for thegeneral case, where direct calculation is intractable. Deﬁnition 5

The Monte Carlo estimator approximat-ing the marginalised kernel of Equation (5) is deﬁnedfor a collection of partial rankings ( R i ) Ii =1 , given by (cid:98) K ( R i , R j ) = 1 M i M j M i (cid:88) l =1 M j (cid:88) m =1 w ( i ) l w ( j ) m K ( σ ( i ) l , σ ( j ) m ) (7)for i, j = 1 , . . . , I , where (( σ ( i ) n ) M i m =1 ) Ii =1 are randompermutations, and (cid:16) ( w ( i ) m ) M i m =1 (cid:17) Ii =1 are random weights.Note that this general set-up allows for several possibili-ties: – For each i = 1 . . . , I , the permutations ( σ ( i ) m ) M i m =1 are drawn exactly from the distribution p ( ·| R i ). Inthis case, the weights are simply w ( i ) n = 1 for m =1 , . . . , M i . – For each i = 1 , . . . , I , the permutations ( σ ( i ) m ) M i m =1 drawn from some proposal distribution q ( ·| R i ) withthe weights given by the corresponding importanceweights w ( i ) n = p ( σ ( i ) n | R ) /q ( σ ( i ) n | R ) for m = 1 , . . . , M i .An alternative perspective on the estimator deﬁnedin Equation (7), more in line with the literature onrandom feature approximations of kernels, is to deﬁnea random feature embedding for each of the partialrankings ( R i ) Ii =1 .More precisely, let H K be the (ﬁnite-dimensional)Hilbert space associated with the kernel K on the space M. Lomeli et al. S n , and let Φ be the associated feature map, so that Φ ( σ ) = K ( σ, · ) ∈ H K for each σ ∈ S n . Then observethat we have K ( σ, σ (cid:48) ) = (cid:104) Φ ( σ ) , Φ ( σ (cid:48) ) (cid:105) for all σ, σ (cid:48) ∈ S n . We now extend this feature embedding to partialrankings as follows. Given a partial ranking R ⊆ S n , wedeﬁne the feature embedding of R by Φ ( R ) = 1 | R | (cid:88) σ ∈ R K ( σ, · ) ∈ H K With this extension of Φ to partial rankings, we maynow directly express the marginalised kernel of Equation(5) as an inner product in the same Hilbert space H K : K ( R, R (cid:48) ) = (cid:104) Φ ( R ) , Φ ( R (cid:48) ) (cid:105) for all partial rankings R, R (cid:48) ⊆ S n . If we deﬁne a randomfeature embedding of the partial rankings ( R i ) Ii =1 by (cid:98) Φ ( R i ) = M i (cid:88) m =1 w ( i ) m Φ ( σ ( i ) m )then the Monte Carlo kernel estimator of Equation (7)can be expressed directly as (cid:98) K ( R i , R j ) = 1 M i M j M i (cid:88) l =1 M j (cid:88) m =1 w ( i ) l w ( j ) m K ( σ ( i ) l , σ ( j ) m )= 1 M i M j M i (cid:88) l =1 M j (cid:88) m =1 w ( i ) l w ( j ) m (cid:104) Φ ( σ ( i ) l ) , Φ ( σ ( j ) m ) (cid:105) = (cid:42) M i M i (cid:88) l =1 w ( i ) l Φ ( σ ( i ) l ) , M j M j (cid:88) m =1 w ( j ) m Φ ( σ ( j ) m ) (cid:43) = (cid:104) (cid:98) Φ ( R i ) , (cid:98) Φ ( R j ) (cid:105) (8)for each i, j ∈ { , . . . , I } . This expression of the estima-tor as an inner product between randomised embeddingswill be useful in the sequel.We provide an illustration of the various RKHS em-beddings at play in Figure 2, using the notation of theproof of Theorem 3. In this ﬁgure, η is a partial rank-ing, with three consistent full rankings σ , σ , σ . Theextended embedding (cid:101) Φ applied to η is the barycentrein the RKHS of the embeddings of the consistent fullrankings, and a Monte Carlo approximation (cid:98) Φ to thisembedding is also displayed. Theorem 2

Let R i ⊆ S n be a partial ranking, andlet (cid:16) σ ( i ) m (cid:17) M i m =1 independent and identically distributedsamples from p ( · | R i ) . The kernel Monte Carlo meanembedding, (cid:98) Φ ( R i ) = 1 M i M i (cid:88) m =1 K ( σ ( i ) m , · ) Fig. 2

Visualisation of the various embeddings discussed inthe proof of Theorem 3. σ , σ , σ are permutations in S n ,which are mapped into the RKHS H K by the embedding Φ . η is a partial ranking subset which contains σ , σ , σ , and itsembedding (cid:101) Φ ( η ) is given as the average of the embeddings ofits full rankings. The Monte Carlo embedding (cid:98) Φ ( η ) induced byEquation (7) is computed by taking the average of a randomlysampled collection of consistent full rankings from η . is a consistent estimator of the marginalised kernel em-bedding (cid:101) Φ ( R i ) = 1 | R i | (cid:88) σ ∈ R i K ( σ, · ) . Proof

Note that the RKHS in which these embeddingstake values is ﬁnite-dimensional, and the Monte Carloestimator is the average of iid terms, each of which isequal to the true embedding in expectation. Thus, weimmediately obtain unbiasedness and consistency of theMonte Carlo embedding.

Theorem 3

The Monte Carlo kernel estimator fromEquation (7) does deﬁne a positive-deﬁnite kernel; fur-ther, it yields consistent estimates of the true kernelfunction.Proof

We ﬁrst deal with the positive-deﬁniteness claim.Let R , . . . , R I ⊆ S n be a collection of partial rank-ings, and for each i = 1 , . . . , I , let ( σ ( i ) m , w ( i ) m ) M i m =1 bean i.i.d. weighted collection of complete rankings dis-tributed according to p ( ·| R i ). To show that the MonteCarlo kernel estimator (cid:98) K is positive-deﬁnite, we observethat by Equation (8), the I × I matrix with ( i, j ) th el-ement given by (cid:98) K ( R i , R j ) is the Gram matrix of thevectors ( (cid:98) Φ ( R i )) Ii =1 with respect to the inner product ofthe Hilbert space H K . We therefore immediately deducethat the matrix is positive semi-deﬁnite, and therefore ntithetic and Monte Carlo kernel estimators for partial rankings 7 the kernel estimator itself is positive-deﬁnite. Further-more, the Monte Carlo kernel estimator is consistent;see Appendix B in the supplementary material for theproof.Having established that the Monte Carlo estimator (cid:98) K is itself a kernel, we note that when it is evaluated attwo partial rankings R, R (cid:48) ⊆ S n , the resulting expressionis not a sum of iid terms; the following result quantiﬁesthe quality of the estimator through its variance. Theorem 4

The variance of the Monte Carlo kernelestimator evaluated at a pair of partial rankings R i , R j ,with M i , N j Monte Carlo samples respectively, is givenby

Var (cid:16) (cid:98) K ( R i , R j ) (cid:17) =1 M i (cid:88) σ ( i ) ∈ R i p ( σ ( i ) | R i )  (cid:88) σ ( j ) ∈ R j p ( σ ( j ) | R j ) K ( σ ( i ) , σ ( j ) )  − M i (cid:18) (cid:88) σ ( i ) ∈ R i σ ( j ) ∈ R j K ( σ ( i ) , σ ( j ) ) p ( σ ( i ) | R i ) p ( σ ( j ) | R j ) (cid:19) − M i N j (cid:88) σ ( i ) ∈ R i p ( σ ( i ) | R i ) (cid:18) (cid:88) σ ( j ) ∈ R j p ( σ ( j ) | R j ) K ( σ ( i ) , σ ( j ) ) (cid:19) + 1 M i N j (cid:88) σ ( i ) ∈ R i σ ( j ) ∈ R j K ( σ ( i ) , σ ( j ) ) p ( σ ( i ) | R i ) p ( σ ( j ) | R j ) . The proof is given in the supplementary material,Appendix C. We have presented some theoretical prop-erties of the embedding corresponding to the MonteCarlo kernel estimator which conﬁrm that it is a sensi-ble embedding. In the next section, we present a lowervariance estimator based on a novel antithetic variatesconstruction.

A common, computationally cheap variance reductiontechnique in Monte Carlo estimation of expectationsof a given function is to use antithetic variates [16],the purpose of which is to introduce negative correla-tion between samples without aﬀecting their marginaldistribution, resulting in a lower variance estimator. An-tithetic samples have been used when sampling fromEuclidean vector spaces, for which antithetic samplesare straightforward to deﬁne. However, to the best ofour knowledge, antithetic variate constructions have notbeen proposed for the space of permutations. We beginby introducing a deﬁnition for antithetic samples forpermutations.

Deﬁnition 6 (Antithetic permutations)

Let R ⊆ S n be a top- k partial ranking. The antithetic operator A R : R → R maps each permutation σ ∈ R to thepermutation in R of maximal distance from σ .It is not necessarily clear a priori that the antithetic op-erator of Deﬁnition 6 is well-deﬁned, but for the Kendalldistance and top- k partial rankings, it turns out that itis indeed well-deﬁned. Remark 1

For the Kendall distance and top- k partialrankings, the antithetic operators of Deﬁnition 6 are well-deﬁned, in the sense that there exists a unique distance-maximising permutation in R from any given σ ∈ R .Indeed, the antithetic map A R when R is a top- k partialranking has a particularly neat expression; if the partialranking corresponding to R is a (cid:31) · · · (cid:31) a k , and wehave a full ranking σ ∈ R (so that σ (1) = a , . . . , σ ( k ) = a k , then the antithetic permutation A R ( σ ) is given by A R ( σ )( i ) = a i for i = 1 , . . . , k ,A R ( σ )( k + j ) = σ ( n + 1 − j ) for j = 1 , . . . , n − k . In this case, we have d ( σ, A R ( σ )) = (cid:0) n − k (cid:1) . This deﬁnition of antithetic samples for permutationshas parallels with the standard notion of antitheticsamples in vector spaces, in which typically a sampledvector x ∈ R d is negated to form − x , its antitheticsample; − x is the vector maximising the Euclideandistance from x , under the restrictions of ﬁxed norm. Proposition 2

Let R be a partial ranking and { σ, A R ( σ ) } be an antithetic pair from R , σ distributed Uniformlyin the region R . Let d : S n → R + be the Kendall dis-tance and σ ∈ R a ﬁxed permutation, then X = d ( σ, σ ) and Y = d ( A R ( σ ) , σ ) , then, X and Y have negativecovariance. The proof of this proposition is presented after the rel-evant lemmas are proved. Since one of the main tasksin statistical inference is to compute expectations of afunction of interest, denoted by h , once the antitheticvariates are constructed, the functional form of h deter-mines whether or not the antithetic variate constructionproduces a lower variance estimator for its expectation.If h is a monotone function, we have the following corol-lary. Corollary 3

Let h be a monotone increasing (decreas-ing) function. Then, the random variables h ( X ) and h ( Y ) , have negative covariance.Proof The random variable Y from Proposition 2 isequal in distribution to Y d = K − X , where K is aconstant which changes depending whether σ is a full M. Lomeli et al. ranking or an exhaustive partial ranking, see the proofof Proposition 2 in the next section for the speciﬁc formof the constants. By Chebyshev’s integral inequality [12],the covariance between a monotone increasing (decreas-ing) and a monotone decreasing (increasing) functionsis negative.The next theorem presents the antithetic empiricalfeature embedding and corresponding antithetic kernelestimator. Indeed, if we take the inner product betweentwo embeddings, this yields the kernel antithetic esti-mator which is a function of a pair of partial rankingssubsets. In this case, the h function from above is thekernel evaluated in each pair, this is an example of a U -statistic [31, Chapter 5]. Theorem 5

Let R i ⊆ S n be a partial ranking, S n de-notes the space of permutations of n ∈ N , ( σ ( i ) m , A R i ( σ ( i ) m )) M i m =1 are antithetic pairs of i.i.d. samples from the region R i .The Kernel antithetic Monte Carlo mean embedding is (cid:98) φ ( R i ) = 1 M i M i (cid:88) m =1 (cid:34) k ( σ ( i ) m , · ) + k ( A R i ( σ ( i ) m ) , · )2 (cid:35) . It is a consistent estimator of the embedding that corre-sponds to the marginalised kernel N M N (cid:88) n =1 M (cid:88) m =1 (cid:0) K ( σ n , τ m )+ K ( (cid:101) σ n , τ m ) + K ( σ n , (cid:101) τ m ) + K ( (cid:101) σ n , (cid:101) τ m ) (cid:1) (9) Proof

Since the estimator is a convex combination ofthe Monte Carlo Kernel estimator, consistency follows.In the next section, we present the main result aboutthe estimator from Theorem 5, namely, that it has lowerasymptotic variance than the Monte Carlo kernel esti-mator from Equation (7).4.1 Variance of the antithetic kernel estimatorWe now establish some basic theoretical properties ofantithetic samples in the context of marginalised kernelestimation. In order to do so, we require a series oflemmas to derive the main result in Theorem 6 thatguarantees that the antithetic kernel estimator has lowerasymptotic variance than the Monte Carlo kernel esti-mator for the marginalised Mallows kernel.The following result shows that antithetic permuta-tions may be used to achieve coupled samples which aremarginally distributed uniformly on the subset of S n corresponding to a top- k partial ranking. Lemma 1 If R ⊆ S n is a top- k partial ranking, thenif σ ∼ Unif ( R ) , then A R ( σ ) ∼ Unif ( R ) .Proof The proof is immediate from Remark 1, since A R is bijective on R .Lemma 1 establishes a base requirement of an anti-thetic sample – namely, that it has the correct marginaldistribution. In the context of antithetic sampling in Eu-clidean spaces, this property is often trivial to establish,but the discrete geometry of S n makes this property lessobvious. Indeed, we next demonstrate that the conditionof exhaustiveness of the partial ranking in Lemma 1 isneccessary. Example 1

Let n = 3, and consider the partial ranking2 (cid:31)

1. Note that this is not an exhaustive partial rank-ing, as the element 3 does not feature in the preferenceinformation. There are three full rankings consistentwith this partial ranking, namely 3 (cid:31) (cid:31)

1, 2 (cid:31) (cid:31) (cid:31) (cid:31)

3. Encoding these full rankings as per-mutations, as described in the correspondence outlinedin Section 2, we obtain three permutations, which werespectively denote by σ A , σ B , σ C ∈ S . Speciﬁcally, wehave σ A (1) = 3 , σ A (2) = 2 , σ A (3) = 1 .σ B (1) = 2 , σ B (2) = 3 , σ A (3) = 1 .σ C (1) = 2 , σ C (2) = 1 , σ A (3) = 3 . Under the right-invariant Kendall distance, we obtainpairwise distances given by d ( σ A , σ B ) = 1 ,d ( σ A , σ C ) = 2 ,d ( σ B , σ C ) = 1 . Thus, the marginal distribution of an antithetic samplefor the partial ranking 2 (cid:31) σ B , andhalf of its mass on each of σ A and σ C , and is thereforenot uniform over R .We further show that the condition of right-invarianceof the metric d is necessary in the next example. Example 2

Let n = 3, and suppose d is a distance on S such that, with the notation introduced in Example1, we have d ( σ A , σ B ) = 1 ,d ( σ A , σ C ) = 0 . ,d ( σ B , σ C ) = 1 . Note that d is not right-invariant, since d (( σ A , σ C ) ntithetic and Monte Carlo kernel estimators for partial rankings 9 = d ( σ B τ, σ A τ ) (cid:54) = d ( σ B , σ A ) , where τ ∈ S is given by τ (1) = 1 , τ (2) = 3 , τ (3) =2. Then note that an antithetic sample for the kernelassociated with this distance and the partial ranking1 (cid:31)

2, is equal to σ B with probability 2 / / σ, τ ) and the corresponding pair ( A R ( σ ) , τ ) in boththe unconstrained and constrained cases which corre-spond to not having any partial ranking informationand having partial ranking information, respectively. Lemma 2

Let σ, τ ∈ S n . Then, d ( σ, τ ) = (cid:0) n (cid:1) − d ( A S n ( σ ) , τ ) .Proof This is immediate from the interpretation of theKendall distance as the number of discordant pairsbetween two permutations; a distinct pair i, j ∈ [ n ] arediscordant for σ, τ iﬀ they are concordant for A S n ( σ ) , τ .In fact, Lemma 2 generalises in the following manner. Lemma 3

Let R be a top- k ranking a (cid:31) · · · (cid:31) a l (cid:31) [ n ] \ { a , . . . , a l } , and let σ, τ ∈ R . Then d ( σ, τ ) = (cid:0) n − l (cid:1) − d ( A R ( σ ) , τ ) .Proof As for the proof of Lemma 2, we use the “dis-cordant pairs” interpretation of the Kendall distance.Note that if a distinct pair { x, y } ∈ [ n ] (2) has at leastone of x, y ∈ { a , . . . , a l } , then by virtue of the factthat σ, A R ( σ ) , τ ∈ R , any pair of these permutationsis concordant for x, y . Now observe that any distinctpair x, y ∈ [ n ] \ { a , . . . , a l } is discordant for σ, τ iﬀ it isconcordant for A R ( σ ) , τ , from the construction of A R ( σ )described in Remark 1. The total number of such pairsis (cid:0) n − l (cid:1) , so we have d ( σ, τ ) + d ( A R ( σ ) , τ ) = (cid:0) n − l (cid:1) , asrequired.Next, we show that it is possible to obtain a uniqueclosest element in a given partial ranking set R , denotedby Π R ( τ ), with respect to any given permutation τ ∈ S n , τ / ∈ R . This is based on the usual generalisationof a distance between a set and a point [11]. We thenuse such closest element in Lemmas 5 and 6 to obtainuseful decompositions of distances identities. Finally,in Lemma 7 we verify that the closest element is alsodistributed uniformly on a subset of the original set R . Lemma 4

Let R ⊆ S n be a top- k partial ranking, let τ ∈ S n be arbitrary. There is a unique closest elementin R to τ . In other words, arg min σ ∈ R d ( σ, τ ) is a set ofsize 1.Proof We use the interpretation of the Kendall distanceas the number of discordant pairs between two permuta-tions. Let R be the top- k partial ranking given by x (cid:31)· · · (cid:31) x k (cid:31) [ n ] \ { x , . . . , x k } , and let X = { x , . . . , x k } .We decompose the Kendall distance between σ ∈ R and τ as follows: d ( σ, τ ) = (cid:88) x,y ∈ X,x (cid:54) = y x,y discordant for σ,τ + (cid:88) x ∈ X,y (cid:54)∈ X x,y discordant for σ,τ + (cid:88) x,y (cid:54)∈ X,x (cid:54) = y x,y discordant for σ,τ . (10)As σ varies in R , only some of these terms vary. Inparticular, it is only the third term that varies with σ ,and it is minimised at 0 by the permutation σ in R which is in accordance with τ on the set [ n ] \ X . Deﬁnition 7

Let R ⊆ S n be a top- k partial ranking.Let Π R : S n → R be the map that takes a permutationto the corresponding Kendall-closest permutation in R ;by Lemma 4, this is well-deﬁned. Lemma 5 (Decomposition of distances)

Let σ ∈ R , and τ ∈ S n . We have the following decomposition ofthe distance d ( σ, τ ) : d ( σ, τ ) = d ( σ, Π R ( τ )) + d ( Π R ( τ ) , τ ) . Proof

We compute directly with the discordant pairsdeﬁnition of the Kendall distance. Again, let R be thepartial ranking x (cid:31) · · · (cid:31) x k , and let X = { x , . . . , x k } .We decompose the Kendall distance between σ ∈ R and τ as before: d ( σ, τ ) = (cid:88) x,y ∈ X,x (cid:54) = y x,y discordant for σ,τ + (cid:88) x ∈ X,y (cid:54)∈ X x,y discordant for σ,τ + (cid:88) x,y (cid:54)∈ X,x (cid:54) = y x,y discordant for σ,τ . (11)By the construction of Π R ( τ ) in the proof of Lemma 4,we have that d ( Π R ( τ ) , τ ) = (cid:88) x,y ∈ X,x (cid:54) = y x,y discordant for σ,τ + (cid:88) x ∈ X,y (cid:54)∈ X x,y discordant for σ,τ , i.e. the ﬁrst two terms of the decomposition in Equation(11). Similarly, we have d ( Π R ( τ ) , σ ) = (cid:88) x,y (cid:54)∈ X,x (cid:54) = y x,y discordant for σ,τ , and so the result follows. Lemma 6

Let σ ∈ R , and let τ ∈ R (cid:48) . We have thefollowing relationship between d ( A R ( σ ) , τ ) and d ( σ, τ ) : d ( A R ( σ ) , τ ) = d ( σ, τ ) + (cid:18) n − k (cid:19) − d ( σ, Π R ( τ )) . (12) Proof

We begin by observing that, by Lemma 5, wehave d ( σ, τ ) = d ( σ, Π R ( τ )) + d ( Π R ( τ ) , τ ) , (13)and d ( A R ( σ ) , τ ) = d ( A R ( σ ) , Π R ( τ )) + d ( Π R ( τ ) , τ ) . (14)Now, from Lemma 3, we have that d ( A R ( σ ) , Π R ( τ )) = (cid:0) n − k (cid:1) − d ( σ, Π R ( τ )). Hence, the result follows. Lemma 7

Let

R, R (cid:48) ⊆ S n be top- k rankings, in prefer-ence notation given by R : a (cid:31) · · · (cid:31) a l (cid:31) [ n ] \ { a , . . . , a l } ,R (cid:48) : b (cid:31) · · · (cid:31) b m (cid:31) [ n ] \ { b , . . . , b m } . If τ ∼ Unif ( R (cid:48) ) , then Π R ( τ ) is a full ranking with dis-tribution Unif ( R (cid:48)(cid:48) ) , where R (cid:48)(cid:48) ⊆ R is the partial rankinggiven by R (cid:48)(cid:48) : a (cid:31) · · · (cid:31) a l (cid:31) b i (cid:31) · · · (cid:31) b i q (cid:31) [ n ] \ { a , . . . , a l , b , . . . , b m } , where { b i , . . . , b i q } = { b , . . . , b m } \ { a , . . . , a l } , and i j < i j +1 for all j = 1 , . . . , q − .Proof We ﬁrst show that Π R maps R (cid:48) into R (cid:48)(cid:48) . This isstraightforward, as given τ ∈ R (cid:48) , we ﬁrst observe that Π R ( τ ) ∈ R , and so the full ranking Π R ( τ ) is consistentwith the partial ranking a (cid:31) · · · (cid:31) a l (cid:31) [ n ] \ { a , . . . , a l } . Next, since Π R ( τ ) is concordant with τ for all pairsoutside the set { a , . . . , a l } , Π R ( τ ) must be consistentwith the partial ranking b i (cid:31) · · · (cid:31) b i q (cid:31) [ n ] \ { a , . . . , a l , b , . . . , b m } . Putting these two facts together shows that the fullranking Π R ( τ ) must be consistent with the partial rank-ing a (cid:31) · · · (cid:31) a l (cid:31) b i (cid:31) · · · (cid:31) b i q (cid:31) [ n ] \ { a , . . . , a l , b , . . . , b m } . Thus, given τ ∼ Unif( R (cid:48) ), the distribution of Π R ( τ ) issupported on R (cid:48)(cid:48) . To show that it is uniform, we nowargue that equally many rankings in R (cid:48) are mapped toeach ranking in R (cid:48)(cid:48) . To see this, we observe that thepre-image of a ranking in R (cid:48)(cid:48) is the set of all rankingsin R (cid:48) which are concordant with it on all pairs in [ n ] \{ a , . . . , a l , b , . . . , b m } . The number of such rankings isindependent of the selected ranking in R (cid:48)(cid:48) , and so thestatement of the lemma follows.Having introduced the antithetic operator for a top- k partial ranking R , A R : R → R and the projectionmap Π R : S n → R , we next study how these operationsinteract with one another. Lemma 8

Let R (cid:48)(cid:48) ⊆ R ⊆ S n be top- k partial rankings.Then for σ ∈ R , we have A R (cid:48)(cid:48) ( Π R (cid:48)(cid:48) ( σ )) = Π R (cid:48)(cid:48) ( A R ( σ )) . Proof

We begin by introducing preference-style notationfor R and R (cid:48)(cid:48) . Let R be the top- k ranking given by a (cid:31) · · · (cid:31) a l (cid:31) [ n ] \ { a , . . . , a l } , and let R (cid:48)(cid:48) be thepartial ranking given by a (cid:31) · · · (cid:31) a l (cid:31) a l +1 (cid:31) · · · (cid:31) a m (cid:31) [ n ] \ { a , . . . , a m } . Let σ ∈ R , and let the elementsof [ n ] \ { a , . . . , a m } be given by b , . . . , b q , with indiceschosen such that σ corresponds to the full ranking a (cid:31) · · · a m (cid:31) b (cid:31) · · · (cid:31) b q . Then, the ranking A R (cid:48)(cid:48) ( Π R (cid:48)(cid:48) ( σ )) is given by a (cid:31) · · · a m (cid:31) b q (cid:31) · · · (cid:31) b , and a straightforward calculation shows that this is alsothe case for Π R (cid:48)(cid:48) ( A R ( σ )), as required.Finally, the last Lemma states the most general identityfor a distance, which involves the antithetic operator,the closest element map given a partial rankings set R and a subset of it, denoted by R (cid:48)(cid:48) . Lemma 9

Let R (cid:48)(cid:48) ⊆ R ⊆ S n be top- k partial rankings,given in preference notation by R : a (cid:31) · · · (cid:31) a l (cid:31) [ n ] \ { a , . . . , a l } ,R (cid:48)(cid:48) : a (cid:31) · · · (cid:31) a l (cid:31) a l +1 (cid:31) · · · a m (cid:31) [ n ] \ { a , . . . , a m } . Let α be the number of unranked elements under R , andlet β be the additional number of elements ranked under R (cid:48)(cid:48) relative to R . Then for σ ∈ R , we have d ( σ, Π R (cid:48)(cid:48) ( σ )) = (( n − l ) − ( m − l ))( m − l )+ (cid:18) m − l (cid:19) − d ( A R ( σ ) , Π R (cid:48)(cid:48) ( A R ( σ ))) . ntithetic and Monte Carlo kernel estimators for partial rankings 11 Proof

Again, we denote { b , . . . , b q } = [ n ] \{ a , . . . , a m } ,with indices chosen such that σ corresponds to the fullranking a (cid:31) · · · (cid:31) a m (cid:31) b (cid:31) · · · (cid:31) b q . From earlierarguments, we have d ( σ, Π R (cid:48)(cid:48) ( σ )) = (cid:88) x ∈{ a l +1 , ··· ,a m } y ∈{ a l +1 ,...,a m } ( x,y ) discordant for σ,Π R (cid:48)(cid:48) ( σ ) + (cid:88) x ∈{ a l +1 , ··· ,a m } y ∈{ b ,...,b q } ( x,y ) discordant for σ,Π R (cid:48)(cid:48) ( σ ) . Now observe that for a i , a j with l + 1 ≤ i < j ≤ m , thispair is discordant for the pair of rankings σ, Π R (cid:48)(cid:48) ( σ ) iﬀ a j (cid:31) a i under σ iﬀ a i (cid:31) a j w.r.t A R ( σ ) iﬀ a i , a j areconcordant for the pair of rankings A R ( σ ) , Π R (cid:48)(cid:48) ( A R ( σ )).Hence, we have (cid:88) x ∈{ a l +1 , ··· ,a m } y ∈{ a l +1 ,...,a m } ( x,y ) discordant for σ,Π R (cid:48)(cid:48) ( σ ) + (cid:88) x ∈{ a l +1 , ··· ,a m } y ∈{ a l +1 ,...,a m } ( x,y ) discordant for A R ( σ ) ,Π R (cid:48)(cid:48) ( A R ( σ )) = (cid:18) β (cid:19) . By analogous reasoning, we have (cid:88) x ∈{ a l +1 , ··· ,a m } y ∈{ b ,...,b q } ( x,y ) discordant for σ,Π R (cid:48)(cid:48) ( σ ) + (cid:88) x ∈{ a l +1 , ··· ,a m } y ∈{ b ,...,b q } ( x,y ) discordant for A R ( σ ) ,Π R (cid:48)(cid:48) ( A R ( σ )) = ( α − β ) β . Altogether, these statements yield the result of thelemma.

Proof of Proposition 2

Case: σ ∈ S n be the ﬁxedpermutation, thenCov ( d ( σ, σ ) , d ( A R ( σ ) , σ )) < . This holds true since d ( A R ( σ ) , σ ) = (cid:0) n (cid:1) − d ( σ, σ ) , ∀ σ ∈ S n , ∀ n ∈ N byLemma 2. Case ∅ ⊂ R : Let σ ∈ R we have that d ( A R ( σ ) , σ ) = (cid:0) n − k (cid:1) − d ( σ, σ ) ∀ σ ∈ R by Lemma 3.In general, if σ / ∈ R , by Lemma 6, d ( A R ( σ ) , σ ) = d ( σ, σ ) + (cid:0) n − k (cid:1) − d ( σ, Π R i ( σ )).After proving all the relevant Lemmas, we now presentour main result regarding antithetic samples, namely,that this scheme provides negatively correlated pairs ofsamples. Theorem 6

Let the antithetic kernel estimator be eval-uated at a pair of partial rankings R i , R j where ( σ n ) Nn =1 ∼ Unif ( R i ) , ( τ m ) Mm =1 ∼ Unif ( R j ) , N, M are the numberof pairs of samples. If we have (cid:101) σ n = A R i ( σ n ) and (cid:101) τ m = A R j ( τ m ) for all m, n , it corresponds to the anti-thetic case. If we have ( (cid:101) σ n ) Nn =1 ∼ Unif ( R i ) , ( (cid:101) τ m ) Mm =1 ∼ Unif ( R j ) independently, it corresponds to the i.i.d. case.Then, the asymptotic variance of the estimator fromEquation (5) is lower in the antithetic case than in thei.i.d. case.Proof It has been shown previously that the antithetickernel estimator is unbiased (in the oﬀ-diagonal case), soshowing that it has lower MSE in the antithetic case isequivalent to showing that its second moment is smallerin the antithetic case than in the i.i.d. case. The secondmoment is given by E (cid:2) (cid:98) K ( R i , R j ) (cid:3) = E (cid:104)(cid:0) N M N (cid:88) n =1 M (cid:88) m =1 (cid:0) K ( σ n , τ m )+ K ( (cid:101) σ n , τ m ) + K ( σ n , (cid:101) τ m ) + K ( (cid:101) σ n , (cid:101) τ m ) (cid:1)(cid:1) (cid:105) = 116 M N N (cid:88) n,n (cid:48) =1 M (cid:88) m,m (cid:48) =1 E (cid:104)(cid:0) K ( σ n , τ m ) + K ( (cid:101) σ n , τ m )+ K ( σ n , (cid:101) τ m ) + K ( (cid:101) σ n , (cid:101) τ m ) (cid:1) × (cid:0) K ( σ n (cid:48) , τ m (cid:48) )+ K ( (cid:101) σ n (cid:48) , τ m (cid:48) ) + K ( σ n (cid:48) , (cid:101) τ m (cid:48) ) + K ( (cid:101) σ n (cid:48) , (cid:101) τ m (cid:48) ) (cid:1)(cid:105) . We identify three types of terms in the above sum: (i)those where n (cid:54) = n (cid:48) and m (cid:54) = m (cid:48) ; (ii) those where n = n (cid:48) but m (cid:54) = m (cid:48) , or m = m (cid:48) but n (cid:54) = n (cid:48) ; (iii) those where n = n (cid:48) and m = m (cid:48) .We remark that in case (i), the 16 terms that ap-pear in the summand all have the same distribution inthe antithetic and i.i.d. case, so terms of the form (i)contribute no diﬀerence between antithetic and i.i.d..There are O ( N M + M N ) terms of the form (ii), and O ( N M ) terms of the form (iii). We thus refer to termsof the form (ii) as cubic terms, and terms of the form(iii) as quadratic terms. We observe that due to theproportion of cubic terms to quadratic terms divergingas

N, M → ∞ , it is suﬃcient to prove that each cubicterm is less in the antithetic case than the i.i.d. case toestablish the claim of lower MSE.Thus, we focus on cubic terms. Let us consider aterm with n = n (cid:48) and m (cid:54) = m (cid:48) . The term has the form E (cid:34)(cid:18) K ( σ n , τ m ) + K ( (cid:101) σ n , τ m ) + K ( σ n , (cid:101) τ m ) + K ( (cid:101) σ n , (cid:101) τ m ) (cid:19) × (cid:18) K ( σ n , τ m (cid:48) ) + K ( (cid:101) σ n , τ m (cid:48) ) + K ( σ n , (cid:101) τ m (cid:48) ) + K ( (cid:101) σ n , (cid:101) τ m (cid:48) ) (cid:19)(cid:35) . Fig. 3

An example of the variables appearing in the decom-position in Equation (18).

Of the sixteen terms appearing in the expectation above,there are only two distinct distributions they may have.The two types of terms are given below: E [ K ( σ n , τ m ) K ( σ n , τ m (cid:48) )] , (15)and E [ K ( σ n , τ m ) K ( (cid:101) σ n , τ m (cid:48) )] . (16)Terms of the form in Equation (15) have the same dis-tribution in the antithetic and i.i.d. cases, so we canignore these. However, terms of the form in Equation(16) have diﬀering distributions in these two cases, sowe focus in on these. We deal speciﬁcally with the casewhere K ( σ, τ ) = exp( − λd ( σ, τ )), so we may rewrite theexpression in Equation (16) as E [exp( − λ ( d ( σ n , τ m ) + d ( (cid:101) σ n , τ m (cid:48) )))] . (17)We now decompose the distances d ( σ n , τ m ), d ( (cid:101) σ n , τ m (cid:48) )using the series of lemmas introduced before. First, weuse Lemma 5 to write d ( σ n , τ m ) = d ( σ n , Π R ( τ m )) + d ( Π R ( τ m ) , τ m ) ,d ( (cid:101) σ n , τ m (cid:48) ) = d ( (cid:101) σ n , Π R ( τ m (cid:48) )) + d ( Π R ( τ m (cid:48) ) , τ m (cid:48) ) . (18)We give a small example illustrating some of the vari-ables at play in this decomposition in Figure 3.Now, writing R ⊆ R for the partial ranking de-scribed by Lemma 7, we have that Π R ( τ m ) , Π R ( τ m (cid:48) ) i . i . d . ∼ Unif( R ). Therefore, the distances in Equation (18) maybe decomposed further: d ( σ n , τ m ) = d ( σ n , Π R ( σ n )) + d ( Π R ( σ n ) , Π R ( τ m )) + d ( Π R ( τ m ) , τ m ) ,d ( (cid:102) σ n , τ m (cid:48) ) = d ( (cid:101) σ n , Π R ( (cid:101) σ n ))+ d ( Π R ( (cid:101) σ n ) , Π R ( τ m (cid:48) ))+ d ( Π R ( τ m (cid:48) ) , τ m (cid:48) ) . (19)We now consider each term, and argue as to whetherthe distribution is diﬀerent in the antithetic and i.i.d.cases, recalling that in the i.i.d. case, (cid:101) σ n is drawn from R independently from σ n , whilst in the antithetic case, (cid:101) σ n = A R ( σ n ). – Each of the terms d ( Π R ( τ m ) , τ m ) and d ( Π R ( τ m (cid:48) ) , τ m (cid:48) ) have the same distribution underthe i.i.d. case and antithetic case. Further, in bothcases, d ( Π R ( τ m ) , τ m ) is independent of Π R ( τ m ),and d ( Π R ( τ m (cid:48) ) , τ m (cid:48) ) is independent of Π R ( τ m (cid:48) ),so these two terms are independent of all othersappearing in the sum in both cases. – Each of the terms d ( Π R ( σ n ) , Π R ( τ m )) and d ( Π R ( (cid:101) σ n ) , Π R ( τ m (cid:48) )) have the same distributionunder the i.i.d. case and the antithetic case, and areindependent of all other terms in both cases. – We deal with the terms d ( σ n , Π R ( σ n )) and d ( (cid:101) σ n , Π R ( (cid:101) σ n )) using Lemma 9. More speciﬁcally,under the i.i.d. case, these two distances are clearlyi.i.d.. However, under the antithetic case, the lemmatells us that the sum of these two distances is equal tothe mean under the distribution of the i.i.d. case al-most surely. Thus, in the antithetic case, this randomvariable has the same mean as in the i.i.d. case, butis more concentrated (strictly so iﬀ d ( σ n , Π R ( σ n ))is not a constant almost surely, which is the case iﬀ R (cid:54) = R ).Thus, d ( σ n , τ m ) + d ( (cid:102) σ n , τ m (cid:48) ) has the same mean un-der the i.i.d. and antithetic cases, but is strictly moreconcentrated when R (cid:54) = R This holds true iﬀ thepartial rankings R and R do not concern exactly thesame set of objects. Thus, by a conditional version ofJensen’s inequality, since exp( − λx ) is strictly convex asa a function of x , we obtain the variance result.4.2 Antithetic kernel estimator and kernel herdingIn this section, having established the variance-reductionproperties of antithetic samples in the context of MonteCarlo kernel estimation, we now explore connections tokernel herding [6]. Theorem 7

The antithetic variate construction of The-orem 5 is equivalent to the optimal solution for the ﬁrsttwo steps of a kernel herding procedure in the space ofpermutations. ntithetic and Monte Carlo kernel estimators for partial rankings 13

Proof

Let R be a partial ranking of n elements. We calcu-late the sequence of herding samples from the uniformdistribution p ( ·| R ) over full rankings consistent with R associated with the exponential semimetric kernel K ( σ, σ (cid:48) ) = exp( − λd ( σ, σ (cid:48) )), for a metric d of negativedeﬁnite type. Following [6], we note that the herdingsamples from p ( ·| R ) associated with the kernel K , withRKHS embedding φ : S n → H , are deﬁned iterativelyby σ T = arg min σ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ p − T T (cid:88) t =1 φ ( σ t )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H for T = 1 , . . . , where µ p is the RKHS mean embedding of the distribu-tion p . Since p is uniform over its support, any ranking σ in the support of p ( ·| R ) is a valid choice as the ﬁrstsample in a herding sequence. Given such an initial sam-ple, we then calculate the second herding sample, byconsidering the herding objective as follows: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ p − (cid:88) t =1 φ ( σ t )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H = (cid:107) µ p (cid:107) H − (cid:88) t =1 | R | (cid:88) σ ∈ R K ( σ t , σ )+ 14 (cid:0) K ( σ , σ ) + 2 K ( σ , σ )+ K ( σ , σ ) (cid:1) (20)which as a function of σ , is equal to 2 K ( σ , σ ) =2 exp( − λd ( σ , σ )), up to an additive constant. Thus, se-lecting σ to minimize the herding objective is equivalentto maximizing d ( σ , σ ), which is exactly the deﬁnitionof the antithetic sample to σ .After this result, one would like to do a herding pro-cedure for more than two steps. However, the solutionis not the same as picking k herding samples simul-taneously. Speciﬁcally, the following counterexample,illustrated in Figure 4, clearly shows why. The left plotshows the result of solving the herding objective for 2samples – the result is an antithetic pair of samples forthe region R . If a third sample is selected greedily, withthese ﬁrst two samples ﬁxed, it will yield a diﬀerent re-sult than if the herding objective is solved for 3 samplessimultaneously, as illustrated on the right of the ﬁgure. Remark 4

Theorem 7 says that if we ﬁrst pick a pointuniformly at random from R , then put it into the herdingobjective and then select the second deterministically tominimise the herding objective this is equivalent to theantithetic variate construction of Deﬁnition 6. Alter-natively, we could pick the second point uniformly atrandom from R , independently from the ﬁrst point. Thissecond scheme will produce a higher value of the herdingobjective on average. Fig. 4

Samples from the region R , illustrating the diﬀerencebetween solving the herding objective greedily, and solvingfor all samples simultaneously. Once we have constructed two estimators for Kernelmatrices we present some experiments to asses theirperformance in the next section.

In this section, we use the Monte Carlo and antithetickernel estimators for a variety of machine learning unsu-pervised and supervised learning tasks: a nonparametrichypothesis test, an agglomerative clustering algorithmand a Gaussian process classiﬁer.Deﬁnition 6 states the antithetic permutation con-struction with respect to a given permutation for Kendall’sdistance. In order to consider partial rankings data, weshould respect the observed preferences when obtainingthe antithetic variate. The pseudocode from Algorithm 1corresponds to the algorithmic description for samplingan antithetic permutation and simultaneously respectingthe constraints imposed by the observed partial ranking.Namely, the antithetic permutation has the observedpreferences ﬁxed in the same locations as the originalpermutation and only reverses the unobserved locations.This corresponds to maximising the Kendall distancebetween the permutation pair while respecting the con-straints and ensures that both permutations have theright marginals as stated in Remark 1 and Lemma 1.

Algorithm 1

SampleAntitheticConsistentFullRankings

Input: top − k partial ranking i (cid:31) i (cid:31) · · · (cid:31) i k , degree n Returns: two full rankings σ , σ consistent with the givenpartial rankingSet σ ( l ) = σ ( l ) = i l for l = 1 , . . . , k Obtain a random ordering j , . . . , j n − k of the remainingitems { , . . . , n } \ { i , . . . , i k } Let b < · · · < b n − k be the ordering of { , . . . , n } \{ i , . . . , i k } Set σ ( b l ) = j l for l = 1 , . . . , n − k Set σ ( b l ) = j n − k − l +1 for l = 1 , . . . , n − k Return σ , σ Synthetic data set.

The synthetic dataset for the inthe nonparametric hypothesis test experiment, wherethe null hypothesis is H : P = Q and the alternativeis H : P (cid:54) = Q , is the following: the dataset from the P distribution is a mixture of Mallows distributions [9]with the Kendall and Hamming distances. The centralpermutations are given by the identity permutation andthe reverse of the identity respectively, with lengthscaleequal to one. The dataset from the Q distribution is asample from the uniform distribution over S n , where n = 6. Sushi dataset.

This dataset contains rankings aboutsushi preferences given by 5000 users [21]. The usersranked 10 types of sushi and the labels correspond tothe user’s region (East Japan or West Japan for theGaussian process classiﬁer and ten diﬀerent regions forthe agglomerative clustering task).5.2 Agglomerative clusteringIn this experiment, we used both the full and a cen-sored version of the sushi dataset from Section 5.1. Weused various distances for permutations to compute theestimators for the semimetric matrix between pairs ofpartial rankings subsets. In order to compute our esti-mators, we censored the dataset by storing the topk = 4partial rankings per user. The Monte Carlo and anti-thetic kernel estimators were used to obtain negativetype semimetric matrices using the relationship fromEquation (2) in the following way: (cid:92) D ( R, R (cid:48) ) = (cid:98) K ( R, R ) + (cid:98) K ( R (cid:48) , R (cid:48) ) − (cid:98) K ( R, R (cid:48) ) . This matrices were then used as an input to the av-erage linkage agglomerative clustering algorithm [10].The tree purity measure is reported, it provides way toasses the tree produced by the agglomerative clusteringalgorithm. It can be computed in the following way:when a dendrogram and all correct labels are given, pickuniformly at random two leaves which have the samelabel c and ﬁnd the smallest subtree containing the twoleaves. The dendrogram purity is the expected value of per class. If all leaves in theclass are contained in a pure subtree, the dendrogrampurity is one. Hence, values close to one correspond tohigh quality trees.

Semiexp Semiexp SemiexpKendall Mallows Hamming Cayley Spearman K Average 0.83 0.75 0.81 0.72 0.81 (cid:98) K Average 0.78(0.052) 0.79(0.058) 0.79(0.063) 0.82(0.040) 0.78(0.062) (cid:98) K a Average NA 0.77(0.050) NA NA NA

Table 1

Tree purities for the sushi dataset using a subsampleof 100 users with the full Gram matrix K , a censored datasetof topk = 4 partial rankings for the vanilla Monte Carloestimator (cid:98) K and the antithetic Monte Carlo estimator (cid:98) K a ,with n mc = 20 Monte Carlo samples. Tree cut at k = 10clusters. The median distance criterion was used to selectthe inverse of the lengthscale for the semimetric exponentialkernels. In Table 1, the true and estimated purities usingthe full rankings and the partial rankings datasets arereported. We assumed that the true labels are givenby the user’s region, there are ten diﬀerent possibleregions. The true purity corresponds to an agglomerativeclustering algorithm using the Gram matrix obtainedfrom the full rankings. We can compute the Gram matrixfor the full rankings because we have access to all ofthe users’ rankings over the ten diﬀerent types of sushi.The antithetic Monte Carlo estimator outperforms thevanilla Monte Carlo estimator in terms of average puritysince it is closer to the true purity. It also has a lowerstandard deviation when estimating the marginalisedMallows kernel.5.3 Nonparametric hypothesis test with MMDLet P and Q be probability distributions over S n , thenull hypothesis is H : P = Q versus H : P (cid:54) = Q using samples σ , . . . , σ n i.i.d. ∼ P and σ (cid:48) , . . . , σ (cid:48) m i.i.d. ∼ Q .We can estimate a pseudometric between P and Q andreject H if the observed value of the statistic is large.The following is an unbiased estimator of the M M D [15] (cid:92) M M D ( P, Q ) = 1 m ( m − m (cid:88) i =1 m (cid:88) j (cid:54) = i K ( σ i , σ j )+ 1 n ( n − n (cid:88) i =1 n (cid:88) j (cid:54) = i K ( σ (cid:48) i , σ (cid:48) j ) − nm m (cid:88) i =1 n (cid:88) j (cid:54) = i K ( σ i , σ (cid:48) j ) . (21)This statistic depends on the chosen kernel as can beseen in Equation (21). If the kernel is characteristic [32],then the M M D is a proper metric over probabilitydistributions. Analogously, we can compute an MMDsquared estimator for partial rankings sets, such that ntithetic and Monte Carlo kernel estimators for partial rankings 15 Fig. 5

Mean p-values ( y − axis) v.s. number of datapoints insynthetic dataset ( x − axis) R , . . . , R n i.i.d. ∼ P and R (cid:48) , . . . , R (cid:48) m i.i.d. ∼ Q , in the follow-ing way (cid:92) M M D ( P, Q ) = 1 m ( m − m (cid:88) i =1 m (cid:88) j (cid:54) = i ˆ K ( R i , R j )+ 1 n ( n − n (cid:88) i =1 n (cid:88) j (cid:54) = i ˆ K ( R (cid:48) i , R (cid:48) j ) − nm m (cid:88) i =1 n (cid:88) j (cid:54) = i ˆ K ( R i , R (cid:48) j ) . (22)We used the synthetic datasets for P and Q describedin Section 5.1 to asses the performance of the MonteCarlo and antithetic kernel estimators in a nonparamet-ric hypothesis test. The datasets consist of rankings over n = 10 objects and we censored them to obtain top − k partial rankings with k = 3. We then computed theMMD squared statistic for the samples using the sam-ples from the two populations. Since the non-asymptoticdistribution of the statistic from Equation (22) is notknown, we performed a permutation test [1] in order toestimate consistently the null distribution and computethe p − value. We did this repeatedly as we varied thenumber of observations for a ﬁxed number of MonteCarlo samples to see the eﬀect of the sample size in thep-value computations. Speciﬁcally, Figure 5 and Table 2show how the p-value computed with the antithetic ker-nel estimator has lower variance as we vary the numberof observations in our dataset. Both p-values convergeto zero since the samples from both populations comefrom diﬀerent distributions. In Table 2 we report thestandard deviations of the estimated p-values. The p-value obtained with the antithetic kernel estimator haslower variance accross all sample sizes. Table 2

Standard deviations for p -values computed withthe Monte Carlo and antithetic estimators Test accuracy Train ave-loglik Test ave-loglikMallows n obs = 100Full model 0.9 -0.2070 -0.5457MC 0.74(0.016) -0.2486(0.005) -0.563(0.020)Antithetic 0.75 (0) -0.262(0.001) -0.573(0.002)Gaussian n obs = 50Full model 0.75 -0.2215 -0.7014MC 0.72(0.048) -0.2890(0.0245) -0.5737(0.043)Antithetic NA NA NAKendall n obs = 100Full model 0.7 -0.311(3.01 × − ) -0.597(3.5 × − )MC 0.66(0.037) -0.3575(0.008) -0.7063(0.052)Antithetic NA NA NA Table 3

Averaged over 10 runs with 4 Monte Carlo samplesper run, using the inverse of the median distance as thelengthscale, n = 10 , topk = 6. In Table 3, the results of running the Gaussian pro-cess classiﬁer are reported using the marginalised Mal-lows kernel, the marginalised Gaussian kernel and themarginalised Kendall kernel as well as the correspondingestimators. Since the Mallows kernel is based on theKendall distance, it is a kernel speciﬁcally tailored forpermutations and it is the best in terms of predictiveperformance. The Gaussian kernel is a kernel that issuitable for Euclidean spaces and it does not take intoaccount the data type but it still does well. The Kendallkernel does take into account the data type but it per-forms the worst. The full model corresponds to usingthe Gram matrix using the full rankings and MC and

Antithetic refer to using our proposed estimators. Wesee how empirically the predictive accuracy obtainedwith the antithetic kernel estimator has lower varianceas expected.

We addressed the problem of extending kernels to par-tial rankings by introducing a novel Monte Carlo kernelestimator and explored variance reduction strategiesvia an antithetic variates construction. Our schemeslead to a computationally tractable alternative to pre-vious approaches for partial rankings data. The MonteCarlo scheme can be used to obtain an estimator of themarginalised kernel with any of the kernels reviewedherein. The antithetic construction provides an improvedversion of the kernel estimator for the marginalised Mal-lows kernel. Our contribution is noteworthy becausethe computation of most of the marginalised kernelsgrows super-exponentially with respect to the numberof elements in the collection, hence, it quickly becomesintractable for relatively small values of the number ofranked items n . An exception is the fast approach forcomputing the convolution kernel proposed by Jiao &Vert [20], which is only valid for Kendall kernel. Mania et al. [26] have shown that the Kendall kernel is notcharacteristic using non-commutative Fourier analysisto show that it has a degenerate spectrum. For thisreason, using other kernels for permutations might bedesirable depending on the task at hand.One possible direction for future work includes theuse of explicit feature representations for traditionalrandom features schemes to further reduce the com-putational cost of the Gram matrix. Another possibleapplication is to use our method with pairwise pref-erence data where users are not necessarily consistentabout their preferences. In this type of data, we couldstill extract a partial ranking from a given user, then,sample from the space of the corresponding full rankingsconsistent with this observed partial ranking and obtainour Monte Carlo kernel estimator. This would beneﬁtfrom our framework because having a partial ranking isin general more informative that having pairwise com-parisons or star ratings.Another natural direction for future work is to de-velop variance-reduction sampling techniques for a widervariety of kernels over permutations, and to the extendthe theoretical analysis of these constructions to discretegraphs more generally. Acknowledgements

Many thanks to Ryan Adams for insightful discussions.Maria Lomeli and Zoubin Ghahramani acknowledgesupport from the Alan Turing Institute (EPSRC GrantEP/N510129/1), EPSRC Grant EP/N014162/1, anddonations from Google and Microsoft Research. ArthurGretton thanks the Gatsby Charitable Foundation forﬁnancial support. Mark Rowland acknowledges supportby EPSRC grant EP/L016516/1 for the CambridgeCentre for Analysis.

References

1. Alba Fern´andez, V., Jim´enez Gamero, M. D., &Mu˜noz Garc´ıa, J. 2007. A test for the two-sampleproblem based on empirical characteristic functions.

Computational Statistics and Data Analysis .2. Aronszajn, N. 1950. Theory of Reproducing Kernels.

Transactions of the American Mathematical Society , (3), 337–404.3. Berg, C., Christensen, J. P. R., & Ressel, P. 1984. Harmonic analysis on semigroups . Springer-Verlag.4. Berlinet, A., & Thomas-Agnan, C. 2004.

ReproducingHilbert Spaces for Probability and Statistics . KluwerAcademic Publishers.5. Boser, B. E., Guyon, I. M., & Vapnik, V. N. 1992. Atraining algorithm for optimal margin classiﬁers.

In:ACM workshop on Computational Learning Theory .6. Chen, Y., Welling, M., & Smola, A. 2010. Super-Samples from Kernel Herding.

In: UAI .7. Cortes, C., & Vapnik, V. N. 1995. Support-vectornetworks.

Journal of Machine Learning .8. Deza, M. M., & Deza, E. 2009.

Encyclopedia ofDistances . Springer-Verlag.9. Diaconis, P. 1988.

Group Representations in Probabil-ity and Statistics . Institute of Mathematical Statisticslecture notes.10. Duda, R. O., & Hart, P. E. 1973.

Pattern Classi-ﬁcation And Scene Analysis . Wiley and Sons, NewYork.11. Dudley, R. 2002.

Real Analysis and Probability .Cambridge University Press.12. Fink, A. M., & Jodeit, M. 1984. On Chebyshev’sother inequality.

Inequalities in Statistics and Proba-bility .13. Fukumizu, K., Sriperumbudur, B., Gretton, A., &Sch¨olkopf, B. 2009. Characteristic kernels on groupsand semigroups.

In: Neural Information processingsystems .14. GPy. since 2012.

GPy: A Gaussian process frame-work in python . http://github.com/SheffieldML/GPy . ntithetic and Monte Carlo kernel estimators for partial rankings 17

15. Gretton, A., Borgwardt, K. M., Rasch, M. J.,Sch¨olkopf, B., & Smola, A. 2012. A Kernel two sam-ple test.

Journal of Machine Learning Research , ,723–773.16. Hammersley, J. M., & Morton, K. W. 1956. A newMonte Carlo techinique: antithetic variates. Mathe-matical Proceedings of the Cambridge PhilosophicalSociety , , 449–475.17. Haussler, David. 1999. Convolution Kernels onDiscrete Structures .18. Irurozki, E.and Calvo, B., & Lozano, J. A. 2016.Sampling and learning Mallows and Generalized Mal-lows models under the Cayley distance.

Methodologyand Computing in Applied Probability .19. James, G. D. 1978.

The Representation Theory ofthe Symmetric Groups . Springer.20. Jiao, Y., & Vert, J. P. 2015. The Kendall andMallows Kernels for Permutations.

In: Internationalconference for Machine learning .21. Kamishima, T., & Akaho, S. 2009.

Mining ComplexData . Springer Berlin Heidelberg. Chap. EﬃcientClustering for Orders, pages 261–279.22. Knuth, D. 1998.

The art of computer programming .Vol. 3. Addison-Wesley.23. Kondor, R., & Barbosa, M. 2010. Ranking withkernels in Fourier space.

In: Conference on LearningTheory .24. Kondor, R., Howard, A., & Jebara, T. 2007. Multi-object tracking with representations of the symmetricgroup.

In: AISTATS .25. M., Neal. R. 1998.

Regression and ClassiﬁcationUsing Gaussian Process Priors . Bayesian Statistics 6.26. Mania, H., Ramdas, A., Wainwright, M. J., Jordan,M. I., & Recht, B. 2016.

Universality of Mallows’ anddegeneracy of Kendall’s kernels for rankings .27. Rasmussen, C. E., & Williams, C. K. I. 2006.

Gaus-sian Processes for Machine Learning . MIT Press.28. Sch¨olkopf, B., & Smola, A. 2002.

Learning withkernels: support vector machines, regularisation, opti-misation and beyond . MIT press.29. Sch¨olkopf, B., Smola, A. J., & M¨uller, K. R. 1999.Kernel principal component analysis.

In:

Sch¨olkopf,Bernhard, Burges, Christopher J. C., & Smola, Alexan-der J. (eds),

Advances in Kernel Methods . MIT Press.30. Sejdinovic, D., Sriperumbudur, B., Gretton, A., &Fukumizu, K. 2013. Equivalence of distance-based andRKHS-based statistics in hypothesis testing.

Annalsof Statistics .31. Serﬂing, R. J. 1980.

Approximation Theorems ofMathematical Statistics . Wiley.32. Sriperumbudur, B. K., Fukumizu, K., & Lanckriet,G. R. G. 2011. Universality, characteristic Kernels andRKHS embedding of measures.

Journal of Machine Learning Research .33. Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A.2006. Nonparametric quantile estimation.

Journal ofMachine Learning Research .34. Tsuda, K., Kin, T., & Asai, K. 2002. Marginalisedkernels for biological sequences.

Bioinf .35. Wendland, H. 2005.

Scattered Data approximation .Cambridge University Press. M. Lomeli et al.

A Reproducing kernel Hilbert spaces

A reproducing kernel Hilbert space (RKHS) [4] over a set X is a Hilbert space H consisting of functions on X such that for each x ∈ X there is a function k x ∈ H with the property (cid:104) f, k x (cid:105) H = f ( x ) , ∀ f ∈ H . (23)The function k x ( · ) = k ( x, · ) is called the reproducing kernel of H [2]. The space H is endowed with an inner product (cid:104)· , ·(cid:105) H and anorm can be deﬁned based on it such that (cid:107) f (cid:107) H := (cid:112) (cid:104)· , ·(cid:105) H . In order to be a Hilbert space, it needs to contain all limits of Cauchysequences, i.e. it has to be complete. In the case of the symmetric group of degree n , X = S n , the space is ﬁnite dimensional whichguarantees that it is complete. Finally, any symmetric and positive deﬁnite function k x : X × X → R uniquely determines an RKHS.Alternatively, a function k : X × X → R is called a kernel if there exists a Hilbert space H and a map φ : X → H such that ∀ x, y ∈ X , k ( x, y ) = (cid:104) φ ( x ) , φ ( y ) (cid:105) H . The function φ is usually referred to as the feature representation of x . Even though the RKHS induced bythe kernel is unique, there can be more than one feature representations that deﬁne the same kernel. B Expectation of the Kernel Monte Carlo estimator

Proof

For distinct i, j = 1 , . . . , I , let (cid:110) σ ( i ) n (cid:111) N i n =1 be an independent and identically distributed (i.i.d.) sample from p ( σ | R i ) and (cid:110) σ ( j ) m (cid:111) N j m =1 be an i.i.d. sample from p ( σ | R j ).If the weights are uniform, E (cid:16) (cid:98) K ( R i , R j ) (cid:17) = 1 N i N j N i (cid:88) n =1 N j (cid:88) m =1 E (cid:16) K ( σ ( i ) n , σ ( j ) m ) (cid:17) (24)By linearity of expectation, since the samples are identically distributed, the expectation in the summand above reduces to= (cid:88) σ ∈ R i (cid:88) σ (cid:48) ∈ R j K ( σ, σ (cid:48) ) p ( σ | R i ) p ( σ (cid:48) | R j )as required. The diagonal case, E (cid:16) (cid:98) K ( R i , R i ) (cid:17) = 1 N i  N i (cid:88) n =1 N i (cid:88) m =1 E (cid:16) K ( σ ( i ) n , σ ( i ) m ) (cid:17) = 1 N i N i (cid:88) n =1 E  N i (cid:88) m (cid:54) = n E (cid:16) K ( σ ( i ) n , σ ( i ) m | σ ( i ) n ) (cid:17) + E ( K ( σ ( i ) n , σ ( i ) n | σ ( i ) n )  = 1 N i N i (cid:88) n =1 E (cid:104) ( N i − E (cid:16) K ( σ ( i ) , σ (cid:48) ( i ) | σ ( i ) n ) (cid:17) + E (cid:16) K ( σ (cid:48) ( i ) , σ (cid:48) ( i ) ) (cid:17)(cid:105) = ( N i − N i E σ,σ (cid:48) (cid:16) K ( σ ( i ) , σ (cid:48) ( i ) ) (cid:17) + 1 N i E σ (cid:48) (cid:16) K ( σ (cid:48) ( i ) , σ (cid:48) ( i ) ) (cid:17) . If the weights are non-uniform and are given by the importance sampling weights, namely w ( i ) = p ( σ | R i ) q ( σ | R i ) , and the expectation istaken with respect to the proposal q , then E q (cid:16) (cid:98) K ( R i , R j ) (cid:17) = 1 N i N j N i (cid:88) n =1 N j (cid:88) m =1 E q (cid:32) p ( σ ( i ) n | R i ) q ( σ ( i ) n | R i ) p ( σ ( j ) m | R i ) q ( σ ( j ) m | R i ) K ( σ ( i ) n , σ ( j ) m ) (cid:33) By linearity of expectation, since the samples are identically distributed, the expectation in the summand above reduces to= (cid:88) σ ∈ R i (cid:88) σ (cid:48) ∈ R j K ( σ, σ (cid:48) ) p ( σ | R i ) p ( σ (cid:48) | R j )as required. C Variance of Kernel Monte Carlo estimator with i.i.d. samples

Proof

The variance of the Kernel Monte Carlo estimator with uniform weights is the following:Var (cid:104) (cid:98) K ( R i , R j ) (cid:105) = 1 N i N j Var  N i (cid:88) n =1 N j (cid:88) m =1 K ( σ ( i ) n , σ ( j ) m )  ntithetic and Monte Carlo kernel estimators for partial rankings