[PDF] Learning Mixtures of Permutations: Groups of Pairwise Comparisons and Combinatorial Method of Moments

Abstract

Full PDF

aa r X i v : . [ m a t h . S T ] S e p Learning Mixtures of Permutations: Groups of PairwiseComparisons and Combinatorial Method of Moments

Cheng Mao and Yihong Wu ∗ September 16, 2020

Abstract

In applications such as rank aggregation, mixture models for permutations are frequentlyused when the population exhibits heterogeneity. In this work, we study the widely used Mallowsmixture model. In the high-dimensional setting, we propose a polynomial-time algorithm thatlearns a Mallows mixture of permutations on n elements with the optimal sample complexitythat is proportional to log n , improving upon previous results that scale polynomially with n .In the high-noise regime, we characterize the optimal dependency of the sample complexity onthe noise parameter. Both objectives are accomplished by ﬁrst studying demixing permutationsunder a noiseless query model using groups of pairwise comparisons, which can be viewed asmoments of the mixing distribution, and then extending these results to the noisy Mallowsmodel by simulating the noiseless oracle. Contents ∗ C. Mao is with the School of Mathematics, Georgia Institute of Technology, Atlanta, GA, USA, [email protected] . Y. Wu is with Department of Statistics and Data Science, Yale University, New HavenCT, USA, [email protected] . Y. Wu is supported in part by the NSF Grant CCF-1900507, an NSF CAREERaward CCF-1651588, and an Alfred Sloan fellowship. Mallows mixture in high-dimensional regime 10

Introduction

Rank aggregation is the task that aims to combine diﬀerent rankings on the same set of alternatives,to obtain a central ranking that best represents the population. The problem of rank aggregationhas been studied in social choice theory since Jean-Charles de Borda [Bor81] and Marquis deCondorcet [Con85] in the 18th century. More recently, due to the ubiquity of preference data, rankaggregation has found applications in a variety of areas, including web search, classiﬁcation andrecommender systems [DKNS01, FKS03, LLQ +

07, BMR10, KCS17].In these practical applications, the population of interest is often heterogeneous in the sensethat diﬀerent subpopulations have divided preferences over the alternatives. For example, multiplegroups of people may have diﬀerent preferences for movies or electoral candidates [Mar95, GM08a].In such a scenario, rather than seeking a single central ranking, it is preferable to ﬁnd a mixture ofrankings to represent the preferences of the population [JJ94, MM03, BOB07, GM08b, ABSV14,ZPX16, LM18, DOS18].

In this work, we adopt a statistical approach to the problem of heterogeneous rank aggregation.Let S n denote the set of permutations on [ n ] , { , . . . , n } . A ranking of n alternatives is describedby a permutation π ∈ S n . We refer to n as the size of a permutation. Furthermore, we modelthe preference of the population by a distribution on the set of permutations S n . Suppose that N independent permutations are generated from the distribution, each of which represents an observedranking.In this paper, we focus on the Mallows model M ( π, φ ) on S n , with central permutation π ∈ S n and noise parameter φ ∈ (0 ,

1) [Mal57]. In this model, the probability of generating a permutation σ ∈ S n is equal to Z ( φ ) φ d KT ( σ,π ) , where Z ( φ ) is a normalization factor, and d KT ( σ, π ) denotes the Kendall tau distance between permutations σ and π (see (8) and (9)). There have been decades ofwork studying theoretical properties and eﬃcient learning algorithms for the Mallows model andits generalizations [FV86, DPR04, MPPB07, BM09, LB11, CPS13, BFHS14, BFFSZ19, ICL19].To model a heterogeneous population, we consider the Mallows mixture M , P ki =1 w i M ( π i , φ )with k components, where the i th component has central permutation π i ∈ S n , noise parameter φ ∈ (0 , w i ≥ γ for some γ >

0. Let us remark that, the number of components k ina mixture of permutations is typically a small quantity, so we let k be a ﬁxed constant throughoutthis work. On the other hand, the size n of the permutations is large because it represents thenumber of alternatives.The Mallows mixture has also received considerable attention in recent years [MC10, LB14,ABSV14, CDKL15, LM18, DOS18]. More speciﬁcally, Chierichetti et al. [CDKL15] establishedthe identiﬁability of the Mallows mixture given suﬃciently many permutations generated from M under mild conditions. The ﬁrst polynomial-time algorithm to learn the Mallows mixturewith two components was proposed by Awasthi et al. [ABSV14], who particularly showed thatthe central permutations can be recovered exactly with high probability, when the sample size N exceeds poly ( n, φ (1 − φ ) , γ ). In the case of the Mallows k -mixture for any ﬁxed constant k , Liuand Moitra [LM18] introduced a polynomial-time algorithm with sample complexity poly ( n, − φ , γ )that exactly recovers the central permutations with high probability. In general, diﬀerent components may have diﬀerent noise parameters φ i ∈ (0 , • First, we propose a polynomial-time algorithm that exactly recovers the central permutationsin the Mallows k -mixture with probability 1 − n − , provided that the sample size exceeds poly ( − φ , γ ) · log n . The logarithmic dependency on the size n of the permutations is a signiﬁcantimprovement over the previous polynomial dependency, and is in fact optimal (see the remarkafter Corollary 3.5). • Second, to understand the precise dependency on the noise level − φ , we consider equally weightedMallows k -mixture in the high-noise regime of φ →

1. For ﬁxed n and k , we show that the optimalsample complexity for recovering the central permutations is of the order ( − φ ) ⌊ log k ⌋ +2 . To motivate our main methodology based on pairwise comparisons , we brieﬂy discuss why thesample complexity for learning the central permutation π in the single-component Mallows model M ( π, φ ) scales as log n . Mallows showed in his original paper [Mal57] that, for indices i, j ∈ [ n ]such that π ( i ) < π ( j ), P σ ∼ M ( π,φ ) { σ ( i ) < σ ( j ) } = π ( j ) − π ( i ) + 11 − φ π ( j ) − π ( i )+1 − π ( j ) − π ( i )1 − φ π ( j ) − π ( i ) ≥

12 + 1 − φ . In other words, the probability that a random permutation σ from M ( π, φ ) agrees with π on { i, j } is at least 1 / − φ . Therefore by Hoeﬀding’s inequality, given N i.i.d.random permutations from M ( π, φ ), a simple majority vote recovers { π ( i ) < π ( j ) } correctly withprobability at least 1 − e − c (1 − φ ) N for a constant c >

0. As a result, if N ≥ C log n (1 − φ ) for a constant C >

0, by a union bound, we readily obtain { π ( i ) < π ( j ) } for all distinct i, j ∈ [ n ] with highprobability, from which any comparison sort algorithm (such as quicksort or heapsort) can be usedto recover the central permutation π .Crucially, the size n of the permutations does not aﬀect the sample complexity of learning eachpairwise comparison { π ( i ) < π ( j ) } . Instead, n enters the overall sample complexity only througha union bound of exponentially small probabilities, so that the dependency on n is logarithmic. Infact, this high-level strategy generalizes to the case of learning the Mallows k -mixture. However,the caveat is that pairwise comparisons alone are no longer suﬃcient for identifying a mixture ofpermutations; as such, we need to consider groups of pairwise comparisons . This framework ofdemixing permutations using groups of pairwise comparisons is rigorously developed in Section 2under a noiseless oracle model, which may be of independent interest. Later in Section 3, weextend these results to the noisy case by simulating the noiseless oracle using logarithmically manyobservations drawn from the Mallows mixture model. In the high-noise regime where φ →

1, the sample complexity ( − φ ) ⌊ log k ⌋ +2 for learning theMallows k -mixture is achieved by a method of moments of combinatorial ﬂavor, which we nowexplain informally. For a distribution on the set S n of permutations, it is not obvious how to deﬁnean appropriate notion of moments. We show in Section 2.2 that, in fact, it is natural to view theset of all groups of m pairwise comparisons as the m th-order moment of the mixing distribution4 ki =1 w i δ π i associated with the Mallows mixture M = P ki =1 w i M ( π i , φ ). Moreover, the exponent of − φ in the optimal sample complexity is precisely determined by the maximum number of momentstwo distinct mixtures can match. Namely, there exist two distinct k -mixtures with the same ﬁrst ⌊ log k ⌋ moments, but any k -mixture can be identiﬁed from the ﬁrst ⌊ log k ⌋ + 1 moments, givingrise to the optimal sample complexity ( − φ ) ⌊ log k ⌋ +2 . From this perspective, learning a Mallowsmixture from groups of pairwise comparisons can be viewed as a combinatorial method of moments.Furthermore, we draw a comparison between the Mallows mixture and the better-studied Gaus-sian mixture [Pea94]. Speciﬁcally, consider the k -component n -dimensional Gaussian location mix-ture P ki =1 w i N ( µ i , I n ), where n and k are both ﬁxed constants. It is known [MV10, HK18, WY20,DWYZ20] that the sharp sample complexity of learning the mixing distribution P ki =1 w i δ µ i up toan error ε in the Wasserstein W -distance is of order ε k − , which can be achieved by a version ofthe method of moments. In contrast to the exponential growth of the sample complexity in theGaussian mixture model, for Mallows mixtures the optimal sample complexity scales polynomially with the number of components, thanks to the discrete nature of permutations. It is worth mentioning that the identiﬁability of the Mallows mixture model is related to a result ofZagier in mathematical physics [Zag92]. In [Zag92, Theorem 2], Zagier computed the determinantof the matrix A ( φ ) ∈ R n ! × n ! indexed by permutations in S n and deﬁned by A ( φ ) π,σ , φ d KT ( π,σ ) . (1)This is an instance of the group determinant associated with the symmetric group S n ; see Sec-tion 6.10 for details. In particular, Zagier showed thatdet( A ( φ )) = 0 , for all φ ∈ (0 , . (2)Note that, up to the normalization factor 1 /Z ( φ ), the row of A ( φ ) indexed by π is precisely theprobability mass function (PMF) of the Mallows model M ( π, φ ). Moreover, any number of rowsof A ( φ ) are linearly independent since the determinant of A ( φ ) is nonzero. Therefore, as long as k ≤ n ! /

2, if two Mallows mixtures P ki =1 w i M ( π i , φ ) and P ki =1 w ′ i M ( π ′ i , φ ) are identical, then thetwo sets of central permutations must coincide, and so do the corresponding weights. Therefore,Zagier’s result implies the identiﬁability of the Mallows mixture.However, in the ﬁnite-sample setting, as noted by Liu and Moitra [LM18], the direct quantitativeimplication of [Zag92] is very weak, as it only guarantees a sample complexity that is exponentialin n for learning the mixture. While the sample complexity is reduced to a polynomial in n in[LM18], in this paper we take a step further to achieve the optimal logarithmic sample complexity.As in [LM18], we also use Zagier’s result as a building block; see Lemma 6.4.Furthermore, we remark that another group determinant (deﬁned in (43) which is a variantof the one studied in [Zag92, Section 3]) appears naturally in one of our technical proofs. SeeSection 6.10 for details. The remainder of the paper is organized as follows. In Section 2, we deﬁne groups of pairwisecomparisons and interpret them as moments of a mixture. Moreover, we study learning a mixture5f permutations from groups of pairwise comparisons under a generic, noiseless model. Extendingthese results to the noisy case, in Section 3, we consider the Mallows mixture and present analgorithm that achieves the sample complexity logarithmic in the size of the permutations. InSection 4, we study the sample complexity of learning the Mallows mixture in the high-noiseregime. Section 5 discusses potential extensions of our results and proof techniques. All proofs arepresented in Section 6.

General notation

Let [ n ] , { , . . . , n } and N , { , , . . . } . For n, k ∈ N , we use poly k ( n ) todenote a quantity C ( k ) · n C ( k ) where C ( k ) is a suﬃciently large positive constant that may dependon k . Similarly, for n, m, k, ℓ ∈ N , let poly k,ℓ ( n, m ) denote a quantity C ( k, ℓ ) · ( nm ) C ( k,ℓ ) where C ( k, ℓ ) is a suﬃciently large positive constant that may depend on k and ℓ . Let TV ( P , Q ) standfor the total variation distance between two probability distributions P and Q . Permutation and restriction

Let S n denote the set of permutations on [ n ]. When presentingconcrete instances of permutations, we use the notation π = ( π − (1) , π − (2) , · · · , π − ( n )), so thatwhen π is understood as a ranking, π − ( i ) is the element that is ranked in the i th place by π . Forexample, (3 , , ,

1) denotes the permutation π with π (3) = 1, π (2) = 2, π (4) = 3 and π (1) = 4.For a permutation π ∈ S n and a subset J ⊂ [ n ], we use the notation π ( J ) , { π ( j ) : j ∈ J } .We let π | J denote the restriction of π on J , which is an injection from J to [ n ]. Moreover, let π k J denote the bijection from J to [ | J | ] induced by π | J . That is, if σ is the increasing bijection from π ( J ) to [ | J | ], then π k J = σ ◦ π | J .For example, consider π = (3 , , , , ,

5) and J = { , , } . Then π | J (1) = 5, π | J (4) = 3 and π | J (5) = 6, while π k J (1) = 2, π k J (4) = 1 and π k J (5) = 3. We also write π k J = (4 , , π = (3 , , , , ,

5) by retaining only the elements of J .Note that π k J can be viewed as a total order on J . Moreover, by identifying the elements of J with 1 , . . . , | J | in the ascending order, we can identify bijections from J to [ | J | ] with permutationsin S | J | . Hence π k J can be equivalently understood as a permutation in S | J | . We may therefore referto π k J informally as a permutation or a relative order on J . Moreover, for nested sets J ⊂ J ′ ⊂ [ n ],we clearly have ( π k J ′ ) k J = π k J . In this section, we set up a general approach to learning mixtures of permutations: We ﬁrst formalizethe notions of groups of pairwise comparisons and comparison moments, and then characterize whena mixture of permutations can be learned from groups of pairwise comparisons in a generic noiselessmodel.

Let M denote a distribution on S n . In this work, we are interested in the situation where M is acertain model for a mixture of permutations. To motivate the method of learning the mixture M from groups of pairwise comparisons, let us ﬁrst consider some simple examples:6 If M is the Dirac delta measure δ π for a ﬁxed permutation π ∈ S n , we are tasked with learning thesingle permutation π . Let us consider the pairwise comparison oracle : Given any pair of distinctindices ( i, j ) ∈ [ n ] , the oracle returns whether i is placed before j by π , that is, { π ( i ) < π ( j ) } .Based on this oracle, any comparison sorting algorithm (for example, quicksort) can be deployedto learn π . • For a general distribution M , the pairwise comparison oracle naturally extends to the following:Given any pair of distinct indices ( i, j ) ∈ [ n ] , the oracle returns the distribution of { π ( i ) < π ( j ) } where π ∼ M .However, as pointed out by Awasthi et al. [ABSV14], even for the noiseless 2-mixture M = ( δ π + δ π ), the pairwise comparison oracle is not suﬃcient for identifying M . For example, ifthe permutations π and π are reversals of each other, then for any pair of distinct indices ( i, j ),the output of the pairwise comparison oracle is always Bernoulli( ), which is uninformative. • Now that comparing one pair of indices at a time does not guarantee identiﬁability, how aboutcomparing two pairs simultaneously? This motivates the following oracle that returns a group oftwo pairwise comparisons: Given pairs of distinct indices ( i , j ) , ( i , j ) ∈ [ n ] , the oracle returnsthe distribution of (cid:18) { π ( i ) < π ( j ) } { π ( i ) < π ( j ) } (cid:19) , where π ∼ M . To illustrate why groups of two pairwise comparisons are suﬃcient for identifying a mixture oftwo permutations, we consider a mixture M = ( δ π + δ π ) where the two permutations satisfy π (1) < π (2) and π (1) > π (2). When we make a query on the group of pairs (1 , , ( i, j ) forany distinct indices i, j ∈ [4], the oracle returns the mixture of two delta measures at (cid:18) { π ( i ) < π ( j ) } (cid:19) and (cid:18) { π ( i ) < π ( j ) } (cid:19) respectively. Therefore, using the pair (1 ,

2) as a signature for the two permutations in themixture, we can demix the pairwise comparisons { π ( i ) < π ( j ) } and { π ( i ) < π ( j ) } forevery pair of indices ( i, j ), from which π and π can be recovered.It turns out that this argument can be made rigorous and extended to the case of a general k -mixtures (Theorem 2.6).Given these considerations, we are ready to formally deﬁne a group of pairwise comparisons. Deﬁnition 2.1 (Group of m pairwise comparisons, the (strong) oracle) . Consider a distribution M on S n and a random permutation π ∼ M . For m ∈ N , let I be the tuple of m pairs of distinctindices ( i , j ) , . . . , ( i m , j m ) ∈ [ n ] . Upon a query on I , the (strong) oracle of group of m pairwisecomparisons returns the distribution of the random vector χ ( π, I ) in { , } m , whose r th coordinateis deﬁned by χ ( π, I ) r , { π ( i r ) < π ( j r ) } for r ∈ [ m ] . (3)We emphasize that in the tuple I of pairs of distinct indices, i r and j r are required to be distinctfor each r ∈ [ m ], but we allow i = i or i = j , for example. Moreover, throughout this work,the queries we consider are adaptive : Our algorithms make queries to the oracle in a sequentialfashion, where a given query is allowed to depend on the outcomes of previous ones.7n addition, we introduce a weaker oracle of group of pairwise comparisons. This deﬁnition ismotivated by interpreting a “mixture” as a set of permutations in S n , rather than a distribution. Deﬁnition 2.2 (Group of m pairwise comparisons, the weak oracle) . Consider a set { π , . . . , π k } of k permutations in S n . For m ∈ N , let I be a tuple of m pairs of distinct indices in [ n ] . Upon aquery on I , the weak oracle of group of m pairwise comparisons returns the set of binary vectors { χ ( π i , I ) : i ∈ [ k ] } , where χ ( π i , I ) is deﬁned by (3) . If M is a distribution on S n supported on { π , . . . , π k } , then the set { χ ( π i , I ) : i ∈ [ k ] } returnedby Deﬁnition 2.2 is simply the support of the random vector χ ( π, I ) returned by Deﬁnition 2.1. Inthis sense, the oracle in Deﬁnition 2.2 is weaker. If | supp( M ) | = k , then the strong and the weakoracle are equivalent; otherwise the weak oracle is strictly less informative. In the special case of k = 2, they are always equivalent. We emphasize that the weak oracle only returns { χ ( π i , I ) : i ∈ [ k ] } as a collection of (possibly less than k ) distinct, unlabeled elements—it does not specifywhat each χ ( π i , I ) is. This weaker notion will be useful later when we study noisy mixtures ofpermutations.Besides groups of pairwise comparisons, it is also natural to consider ℓ -wise comparisons, whosestrong and weak versions are deﬁned as follows. Recall the notation π k J for relative order as deﬁnedSection 1.6. Deﬁnition 2.3 ( ℓ -wise comparison, the (strong) oracle) . Consider a distribution M on S n and arandom permutation π ∼ M . For ℓ ∈ N , let J be a subset of [ n ] of cardinality | J | = ℓ . Upon a queryon J , the (strong) oracle of ℓ -wise comparison returns the distribution of the relative order π k J . Deﬁnition 2.4 ( ℓ -wise comparison, the weak oracle) . Consider a set { π , . . . , π k } of k permutationsin S n . For ℓ ∈ N , let J be a subset of [ n ] of cardinality | J | = ℓ . Upon a query on J , the weak oracleof ℓ -wise comparison returns the set of relative orders { π i k J : i ∈ [ k ] } . For ℓ = 2, the oracle of ℓ -wise comparison simply reduces to the pairwise comparison oracle.Moreover, for ℓ = 2 m , the (strong or weak) oracle of ℓ -wise comparison is stronger than thecorresponding oracle of group of m pairwise comparisons. This is because for any tuple I of m pairs of indices in [ n ], we can choose J ⊂ [ n ] with | J | = ℓ = 2 m that contains all indices appearingin I . Then, for any permutation π i , we can obtain the binary vector χ ( π i , I ) from the relativeorder π i k J . We now interpret groups of pairwise comparisons in Deﬁnition 2.1 as moments of the randompermutation π . Toward this end, we adopt the following notation throughout this paper. For any(random) permutation π in S n and a pair of distinct indices ( i, j ) ∈ [ n ] , we deﬁne X πi,j , { π ( i ) < π ( j ) } . (4)In this work, we frequently identify the permutation π with the array X π = { X πi,j } i = j . There iscertainly redundancy in X π as we lift π ∈ S n to X π ∈ { , } n − n . For example, X πi,j + X πj,i = 1,and if X πi,j = 1 and X πj,k = 1, then we must have X πi,k = 1.In Deﬁnition 2.1, consider the oracle that returns the distribution of χ ( π, I ) in the form of itsPMF: f χ ( π, I ) ( v ) , P { χ ( π, I ) = v } for each v ∈ { , } m . m ∈ { , } m , f χ ( π, I ) ( m ) = E [ { χ ( π, I ) = m } ] = E h m Y r =1 { π ( i r ) < π ( j r ) } i = E h m Y r =1 X πi r ,j r i , which is an m th moment of X π . This motivates the following deﬁnition. Deﬁnition 2.5 (Comparison moment) . Consider a distribution M on S n . For a random permuta-tion π ∼ M , let X π be deﬁned by (4) . For m ∈ N , let I denote the tuple of m pairs of distinct indices ( i , j ) , . . . , ( i m , j m ) ∈ [ n ] . The comparison moment of π with index I is the vector m ( π, I ) ∈ R m ,deﬁned by m ( π, I ) v , E h m Y r =1 (cid:0) X πi r ,j r (cid:1) v r (cid:0) − X πi r ,j r (cid:1) − v r i for v ∈ { , } m . (5)Note that the comparison moment deﬁned above is of order at most m in the usual sense, as (cid:0) X πi r ,j r (cid:1) v r (cid:0) − X πi r ,j r (cid:1) − v r = ( X πi r ,j r if v r = 1 ,X πj r ,i r if v r = 0 . (6)Moreover, by (3), (4) and (5), we see that the PMF of the random vector χ ( π, I ) is precisely thecomparison moment m ( π, I ) as f χ ( π,I ) ( v ) = E [ { χ ( π, I ) = v } ] = E h m Y r =1 (cid:8) X πi r ,j r = v r (cid:9)i = E h m Y r =1 (cid:0) X πi r ,j r (cid:1) v r (cid:0) − X πi r ,j r (cid:1) − v r i = m ( π, I ) v . As a result, the group of pairwise comparisons on I can be equivalently deﬁned as the oraclethat returns the comparison moment m ( π, I ). Learning a mixture of permutations from groups ofpairwise comparisons can therefore be viewed a combinatorial method of moments. With the above deﬁnitions formulated, we are ready to study demixing permutations with groupsof pairwise comparisons or ℓ -wise comparisons. In this section, we consider the following genericnoiseless model for a mixture of k permutations: M , k X i =1 w i δ π i , where π , . . . π k are permutations in S n and w , . . . , w k are nonnegative weights that sum to one.It is clear that the more pairs we compare in a group, the more information we obtain. In otherwords, the larger m is, the stronger the oracle in Deﬁnition 2.1 becomes. Similarly, the larger ℓ is,the stronger the oracle in Deﬁnition 2.3 becomes. Is there a polynomial-time algorithm that learnsthe k -mixture M from a polynomial number of groups of m pairwise comparisons for any large n ,where m only depends on k but not on n ? Furthermore, for a ﬁxed k , what is the weakest oraclewe can assume, that is, what is the smallest m , so that such an algorithm exists? The analogousquestions can also be asked for the oracle of ℓ -wise comparison. As the main result of this section,the following theorem answers these questions. 9 heorem 2.6. Let k be a positive integer, and deﬁne m ∗ k , ⌊ log k ⌋ + 1 . (7) (a) For any mixture M = P ki =1 w i δ π i of permutations in S n , there is a poly ( n, k ) -time algorithmthat recovers M from groups of m ∗ k pairwise comparisons, with at most k ( n − n + 1) adaptive queries to the weak oracle.(b) Conversely, for n ≥ m ∗ k and ℓ ≤ m ∗ k − , there exist distinct mixtures M = k P ki =1 δ π i and M ′ = k P ki =1 δ π ′ i of permutations in S n , which cannot be distinguished even if all (cid:0) nℓ (cid:1) ℓ -wisecomparisons are queried from the strong oracle. As we have noted, if ℓ ≥ m , then the oracle of ℓ -wise comparison is stronger than the oracle ofgroup of m pairwise comparisons. Therefore, the above theorem implies: (1) The oracle of groupof m pairwise comparisons is suﬃcient for identifying the k mixture if and only if m ≥ m ∗ k ; (2) Theoracle of ℓ -wise comparison is suﬃcient for identifying the k mixture if and only if ℓ ≥ m ∗ k .In addition to the above theorem which studies the permutation demixing problem assuming thestrong oracles, we also have the following result that assumes the weak oracle given by Deﬁnition 2.2.Recall that here we view the mixture as a set of permutations rather than a distribution. Theorem 2.7.

Consider a set { π , . . . , π k } of k permutations in S n . There is a poly ( n, k ) -timealgorithm that learns the set { π , . . . , π k } from groups of k + 1 pairwise comparisons in the senseof Deﬁnition 2.2, with at most k ( n − n + 3) adaptive queries. Unlike Theorem 2.6, here the smallest m for the weak oracle is not precisely characterized.Nevertheless, the crucial observation is that m again only depends on k , the number of components,but not on n , the size of the permutations.While interesting in their own right, the above results have laid the foundation for studyingthe Mallows mixture in the next two sections. On the one hand, Theorem 2.7 provides a “meta-algorithm” for learning the central permutations, so it suﬃces to simulate the weak oracle usingsample from the Mallows mixture, which we do in Section 3. On the other hand, Theorem 2.6 shedslight on the fundamental limit of learning mixtures of permutations, which we further explore inSection 4 for the Mallows mixture in the high-noise regime. Moving from the noiseless to the noisy case, we now turn to the popular Mallows mixture model.Denote the Kendall tau distance between two permutations π, σ ∈ S n by d KT ( π, σ ) , X i,j ∈ [ n ] { π ( i ) < π ( j ) , σ ( i ) > σ ( j ) } . (8)For a central permutation π ∈ S n and a noise parameter φ ∈ (0 , M ( π, φ ) is the distribution on S n with PMF f M ( π,φ ) ( σ ) = φ d KT ( σ,π ) Z ( φ ) for σ ∈ S n , where Z ( φ ) , X σ ∈S n φ d KT ( σ, id ) . (9)10ote that φ determines the noise level of the Mallows model . As φ → M ( π, φ ) converges tothe noiseless model, a delta measure at π . On the other hand, as φ → M ( π, φ ) converges to thenoisiest model, the uniform distribution on S n .In this work, we consider a mixture M of k Mallows models M ( π , φ ) , . . . , M ( π k , φ ) with acommon noise parameter φ ∈ (0 ,

1) and respective weights w , . . . , w k > P ki =1 w i = 1.In other words, M is the distribution on S n with PMF f M ( σ ) = k X i =1 w i φ d KT ( σ,π i ) Z ( φ ) for σ ∈ S n . We also write M ( π i ) ≡ M ( π i , φ ) and M = P ki =1 w i M ( π i ) for brevity. In the special case of φ = 0, M reduces to the noiseless model P ki =1 w i δ π i considered in Section 2.Suppose that we are given N i.i.d. observations σ , . . . , σ N from the mixture M . Let M N , N P Ni =1 δ σ i denote the empirical distribution with PMF f M N ( σ ) = 1 N N X i =1 { σ i = σ } for σ ∈ S n . Assuming that the number of components k and the noise parameter φ are known , we aim toexactly recover the set of central permutations { π , . . . , π k } in the mixture.In this section, we consider the “high-dimensional” setting where the size n of the permutationsis large, and establish the logarithmic dependency of the sample complexity on n . As hintedpreviously, our strategy is to use the algorithm from Theorem 2.7 (noiseless) as a meta-algorithm,to recover the central permutations of the Mallows mixture. For this, we need to simulate the weakoracle in Deﬁnition 2.2 using noisy observations from the Mallows mixture. Furthermore, recallthat the weak oracle of ℓ -wise comparison in Deﬁnition 2.4 is stronger than the weak oracle of groupof m pairwise comparisons in Deﬁnition 2.2, provided that ℓ ≥ m . Therefore, a main goal of thissection is to introduce a subroutine, denoted by SubOrder ( J ), which simulates the weak oracle inDeﬁnition 2.4 using logarithmically many observations from the Mallows mixture. Given i.i.d. observations σ , . . . , σ N from M = P ki =1 w i M ( π i ), the goal of SubOrder ( J ) is to learnthe set of relative orders π k J , . . . , π k k J for a given subset J ⊂ [ n ]. We reiterate that the relativeorder π i k J is the bijection from J to [ | J | ] induced by π i | J ; we are not aiming at recovering π i | J itself.Toward this end, we consider the marginalization of the Mallows mixture, as well as the obser-vations, as follows. For any distribution M on S n and a set of indices J ⊂ [ n ], we let M| J denotethe marginal distribution of σ | J where σ ∼ M . That is, the PMF of M| J is given by f M| J ( ρ ) = P σ ∼M (cid:8) σ | J = ρ (cid:9) (10) In fact, it is also common [MPPB07, BM09, ICL19] to parametrize the noise level by β = 1 / log(1 /φ ) so that φ = e − /β . Particularly, we have β ≈ − φ → ∞ as φ → We assume the knowledge of φ for technical convenience. In principle, this assumption can be removed, whichwe discuss in Section 5. ρ : J → [ n ]. Moreover, given N i.i.d. observations σ , . . . , σ N from M , theempirical version of (10) is given by f M N | J ( ρ ) = 1 N N X m =1 (cid:8) σ m | J = ρ (cid:9) . (11)Note that although our goal is to learn the relative order π i k J : J → [ | J | ] for i ∈ [ k ], not theactual values of π i ( j ) for j ∈ J , the marginalization is with respect to the restriction on J only, anddoes maintain the values of σ ( j ) for j ∈ J . This is crucial to establishing the following identiﬁabilityresult for marginalized Mallows mixtures. Proposition 3.1.

Consider Mallows mixtures M = P ki =1 w i M ( π i ) and M ′ = P ki =1 w ′ i M ( π ′ i ) on S n with a common noise parameter φ ∈ (0 , . Let γ , min i ∈ [ k ] ( w i ∧ w ′ i ) > . Fix a set of indices J ⊂ [ n ] and let ℓ , | J | . Suppose that the two sets of central permutations { π k J , . . . , π k k J } and { π ′ k J , . . . , π ′ k k J } are not equal (as sets). Then TV ( M| J , M ′ | J ) ≥ η ( k, ℓ, φ, γ ) , (cid:16) γ k (cid:17) (3 ℓ ) ℓ +1 (cid:16) − φℓ (cid:17) (4 ℓ ) ℓ +2 kℓ . (12)Crucially, the above lower bound is dimension-free which does not depend on n . This is one ofthe two key ingredients (the other being the concentration inequality in Proposition 3.3 below) thatenable us to achieve a sample complexity that ultimately depends logarithmically on n . The proofof Proposition 3.1 leverages the notion of block structure introduced by Liu and Moitra [LM18]; seeSection 6.5 for details.In addition, we observe a useful property of marginalized Mallows models. Lemma 3.2.

For any subset J ⊂ [ n ] , if the central permutations π, π ′ ∈ S n satisfy π | J = π ′ | J , thenthe marginalized Mallows models M ( π, φ ) | J and M ( π ′ , φ ) | J coincide for all φ ∈ (0 , .Proof. Let τ ∈ S n be a relabeling of indices 1 , . . . , n such that π ′ = π ◦ τ . Since π | J = π ′ | J , we have τ ( j ) = j for every j ∈ J . It follows that ( σ ◦ τ ) | J = σ | J for any σ ∈ S n . Moreover, it holds that d KT ( σ ◦ τ, π ′ ) = d KT ( σ ◦ τ, π ◦ τ ) = d KT ( σ, π ) by the right invariance of the Kendall-tau distance. Inview of the deﬁnition of the Mallows model and marginalization on J , we reach the conclusion.Proposition 3.1 and Lemma 3.2 together motivate the subroutine introduced in the sequel. We are ready to deﬁne

SubOrder ( J ) formally. The ﬁrst step is to deﬁne a set of polynomially manycandidate models. Let S n,J denote the set of injections ρ : J → [ n ], which has cardinality at most n ℓ where ℓ = | J | . For each ρ ∈ S n,J , ﬁx an arbitrary permutation π ρ in S n such that π ρ | J = ρ . Let L be a positive integer to be determined later. For φ ∈ (0 ,

1) and γ ∈ (0 , /k ], we deﬁne a set ofMallows mixtures by discretizing the weights M ≡ M ( n, k, φ, γ, J, L ) , (cid:26) k X i =1 r i L M ( π ρ i , φ ) : ρ i ∈ S n,J , r i ∈ [ L ] , r i ≥ γL, k X i =1 r i = L (cid:27) . (13)Note that the weights r i /L sum to 1 and each weight is at least γ . Since there are at most L choicesfor each weight and at most |S n,J | ≤ n ℓ choices for each ρ i , we have | M | ≤ L k n kℓ .12n view of the total variation lower bound in Proposition 3.1, it is natural to consider theminimum-distance estimator that selects the Mallows mixture model in M whose marginal is closestin total variation to that of the empirical distribution M N ; however, without an explicit formulafor the marginalized distribution M ′ | J for M ′ ∈ M it is diﬃcult to directly compute the totalvariation. Fortunately, we can eﬃciently sample from M ′ | J and thus approximate the marginalizeddistribution suﬃciently well in polynomial time. This motivates the following algorithm. SubOrder ( J ): Given observations σ , . . . , σ N ∈ S n , a subset J ⊂ [ n ], ℓ , | J | , and parameters k ∈ N , φ ∈ (0 , γ ∈ (0 , /k ], N ′ ∈ N , and L = ⌈ k/η ⌉ where η = η ( k, ℓ, φ, γ ) is deﬁned in (12), do thefollowing:1. For each Mallows mixture M ′ ∈ M where M is deﬁned in (13), generate N ′ i.i.d. randompermutations σ ′ , . . . , σ ′ N ′ from M ′ . Compute the marginalized empirical distribution M ′ N ′ | J = N ′ P N ′ m =1 δ σ ′ m | J .2. If for some M ′ = P ki =1 r i L M ( π ρ i , φ ) ∈ M and M N | J deﬁned in (11), it holds that TV ( M ′ N ′ | J , M N | J ) ≤ η/ , (14)then return the set of relative orders { π ρ i k J : i ∈ [ k ] } . If there are multiple models M ′ in M satisfying (14), an arbitrary M ′ is chosen. If no models in M satisfy this condition, then return“error”.Let us remark that sampling one observation from the Mallows model can be done in O ( n )time (see, for example, [LB14]), so the computation of M ′ N ′ | J takes polynomial time in n and N ′ (which is logarithmic in n in the sequel). Moreover, since M N | J and M ′ N ′ | J are distributions withat most N and N ′ atoms respectively, computation of TV ( M ′ N ′ | J , M N | J ) is also polynomial-time.As a result, SubOrder ( J ) runs in polynomial time when k and ℓ are constants.To analyze this subroutine, we ﬁrst state a concentration inequality for the marginalized em-pirical distribution for the Mallows mixture. Proposition 3.3.

For J ⊂ [ n ] , let M| J and M N | J be the marginalized Mallows mixture and themarginalized empirical distribution deﬁned by (10) and (11) respectively. For any s ∈ (0 , , P (cid:8) TV ( M| J , M N | J ) > s (cid:9) ≤ exp (cid:16) − N s (cid:17) + 2(2 kq ) ℓ exp (cid:16) − N s (2 kq ) ℓ (cid:17) (15) where ℓ , | J | and q , − φ log ℓs (1 − φ ) . Similar to the total variation lower bound in Proposition 3.1, the above concentration inequalityis also dimension-free (independent of n ). This is possible because although M| J is a distributionon Θ( n ℓ ) elements, its “eﬀective support size” is independent of n thanks to a basic property of theMallows model (Lemma 6.3). Propositions 3.1 and 3.3 together enable us to establish the followingtheoretical guarantee for SubOrder ( J ). Theorem 3.4.

Given i.i.d. observations σ , . . . , σ N from the Mallows mixture M = P ki =1 w i M ( π i ) on S n with a noise parameter φ ∈ (0 , . Fix a set of indices J ⊂ [ n ] and let ℓ , | J | . Fix γ > such that γ ≤ min i ∈ [ k ] w i . Fix a probability of error δ ∈ (0 , . If the sample size satisﬁes N ≥ poly k,ℓ ( − φ , γ ) · log δ and we choose an integer N ′ ≥ poly k,ℓ ( − φ , γ ) · log nδ , then SubOrder ( J ) returns the set of relative orders { π i k J : i ∈ [ k ] } with probability at least − δ . .3 Exact recovery of the central permutations Consider a set of indices J ⊂ [ n ] and a tuple I of pairs of distinct indices ( i , j ) , . . . , ( i m , j m ) ∈ J .For any permutation π ∈ S n , we have { π k J ( i r ) < π k J ( j r ) } = { π ( i r ) < π ( j r ) } for r ∈ [ m ]by the deﬁnition of the relative order π k J . Since SubOrder ( J ) returns the set of relative orders { π i k J : i ∈ [ k ] } with high probability, in particular, we can obtain the set of binary vectors { χ ( π i , I ) : i ∈ [ k ] } , where χ ( π i , I ) is deﬁned by (3).Recall that the set { χ ( π i , I ) : i ∈ [ k ] } is precisely what we assume the weak oracle in Deﬁni-tion 2.2 returns. Therefore, this oracle is indeed available with high probability for the Mallowsmixture, provided that the sample size N is suﬃciently large. Consequently, the algorithm in The-orem 2.7 can be used as a meta-algorithm to recover the central permutations { π i : i ∈ [ k ] } in theMallows mixture. This argument yields the following main result. Corollary 3.5.

Given N i.i.d. observations from the Mallows mixture M = P ki =1 w i M ( π i ) on S n with a known noise parameter φ ∈ (0 , . Suppose we are given γ > such that γ ≤ min i ∈ [ k ] w i .Then there exists a poly k ( n, − φ , γ ) -time algorithm that exactly recovers the set of central permu-tations { π , . . . , π k } with probability at least − n − , provided that N ≥ poly k ( − φ , γ ) · log n .Proof. By the argument before the corollary, it suﬃces to guarantee that

SubOrder ( J ) succeeds onevery relevant subset J ⊂ [ n ]. Recall that Theorem 2.7 requires the tuple I in Deﬁnition 2.2 toconsist of m = k + 1 pairs of indices. Hence we can choose J of cardinality ℓ = 2 k + 2, so that J contains all the indices in the pairs in I . Since there are less than n k +2 possible subsets of [ n ]that have cardinality 2 k + 2, we can set δ = n − k − in Theorem 3.4 and take a union bound tocomplete the proof .As for the computational complexity, observe that both the algorithm from Theorem 2.7 and SubOrder ( J ) are polynomial in n , − φ and γ when k is a constant.We remark that the logarithmic dependency of the sample complexity N on the size n of thepermutations is optimal, even in the case k = 1 where we aim to learn a single central permutationin the Mallows model. More precisely, the proof of Lemma 10 of [BFFSZ19] established the fol-lowing information-theoretic lower bound: Given N random observations from the Mallows model M ( π, /

2) on S n , if N ≤ c log n for a suﬃciently small constant c >

0, then any algorithm fails toexactly recover the central permutation π with a constant probability. Once the central permutations in the Mallows mixture are recovered exactly according to Corol-lary 3.5, the corresponding weights can be learned as well. To see the identiﬁability of the weights,we ﬁrst establish a total variation bound for two Mallows mixtures with the same set of centralpermutations but diﬀerent weights.

Proposition 3.6.

Consider Mallows mixtures M = P ki =1 w i M ( π i ) and M ′ = P ki =1 w ′ i M ( π i ) on S n with a common noise parameter φ ∈ (0 , . Suppose that ξ , max i ∈ [ k ] | w i − w ′ i | > . Let J Although Theorem 2.7 only makes at most 1 + k ( n − n + 3) queries to the oracle of Deﬁnition 2.2, we needto take the union bound over all possible subsets of [ n ] that have cardinality 2 k + 2, because the queries are madeadaptively. e a subset of [ n ] such that π i k J = π j k J for any distinct i, j ∈ [ k ] . Deﬁne ℓ , | J | and deﬁne η ( k/ , ℓ, φ, as in (12) . Then we have TV ( M| J , M ′ | J ) ≥ ξ · η ( k/ , ℓ, φ, . The estimator of the weights can be deﬁned as follows. Let ˆ π , . . . , ˆ π k denote the centralpermutations returned by the algorithm in Corollary 3.5. By Lemma 6.2, we can ﬁnd in polynomialtime a tuple I of k − n ] such that χ (ˆ π i , I ) = χ (ˆ π j , I ) for any distinct i, j ∈ [ k ]. Hence, if we take J to be the subset of [ n ] consisting of all indices appearing in the pairsin I , then ℓ = | J | ≤ k − π i k J = ˆ π j k J for any distinct i, j ∈ [ k ].Next, ﬁx positive integers L and N ′ (to be determined in the proof of Theorem 3.7). Deﬁne aset of integer-valued vectors R ( L ) , (cid:26) r ∈ [ L ] k : r i ≥ γL, k X i =1 r i = L (cid:27) . For each r ∈ R ( L ), we generate N ′ i.i.d. random permutations σ ′ , . . . , σ ′ N ′ from the Mallows mixture M ′ ( r ) = P ki =1 r i L M (ˆ π i , φ ) . Then we compute the marginalized empirical distribution M ′ N ′ ( r ) | J ofthe generated sample. Finally, we deﬁne an estimator ˆ w ∈ R k of the weights byˆ w = 1 L argmin r ∈R ( L ) TV (cid:0) M ′ N ′ ( r ) | J , M N | J (cid:1) . (16)The following theorem concludes this section, which in particular bounds the error for each ˆ w i . Theorem 3.7.

Given N i.i.d. observations from the Mallows mixture M = P ki =1 w i M ( π i ) on S n with distinct central permutations π , . . . , π k and a known noise parameter φ ∈ (0 , . Supposewe are given γ > such that γ ≤ min i ∈ [ k ] w i . If N ≥ poly k ( − φ , γ ) · log n , then there exists a poly k ( n, − φ , γ ) -time algorithm which returns a mixture c M = P ki =1 ˆ w i M (ˆ π i ) such that the followingholds with probability at least − n − : Up to a relabeling, we have ˆ π i = π i and | ˆ w i − w i | ≤ N − / (log N ) k − (log n ) / poly k ( − φ ) for each i ∈ [ k ] . We turn to study the sample complexity for learning the Mallows mixture in the high-noise regime.For simplicity, we focus on the equally-weighted case. For a Mallows model on S n with noiseparameter φ ∈ (0 , ε , − φ and consider the high-noise regime where n is ﬁxed and ε →

0, as which the Mallows model converges to the uniform distribution on S n . We are interestedin how the sample complexity scales with 1 /ε .More formally, let M ∗ denote the collection of k -mixtures of Mallows models on S n with equalweights and a common noise parameter φ ∈ (0 , M ∗ ≡ M ∗ ( n, k, φ ) , n k k X i =1 M ( π i , φ ) : π , . . . , π k ∈ S n o . (17)Some results in this section can be generalized to mixtures with diﬀerent weights. However, wefocus on the case of equally weighted mixtures to ease the notation, which already includes all themain ideas. The following result characterizes the total variation distance between two Mallowsmixtures in the high-noise regime up to constant factors.15 heorem 4.1. For m ∗ k deﬁned by (7), the following statements hold as ε = 1 − φ → :(a) Suppose that k ≤ . For any distinct Mallows mixtures M and M ′ in M ∗ , we have TV ( M , M ′ ) = Ω( ε m ∗ k ) . (b) On the other hand, for n ≥ m ∗ k , there exist distinct Mallows mixtures M and M ′ in M ∗ forwhich TV ( M , M ′ ) = O ( ε m ∗ k ) . The hidden constants in Ω( · ) and O ( · ) above may depend on n and k . The key to proving the above theorem is to view groups of pairwise comparisons as moments andrelate them to the total variation distance between two Mallows mixtures. After establishing thislink, the upper and lower bounds follow naturally from the two parts of Theorem 2.6 respectively.Note that there is a condition k ≤

255 in part (a) of the above theorem. This is purely a technicalassumption used in one step of the proof. We conjecture that the same result holds without thisrestriction on the number of components k in the mixture. See Section 6.10 for details.Theorem 4.1 characterizes the precise exponent of ε in the total variation distance between twoMallows mixtures. From this, we easily obtain matching upper and lower bounds of order 1 /ε m ∗ k on the optimal sample complexity for learning a Mallows k -mixture in the high-noise regime. Corollary 4.2.

Suppose that for a Mallows mixture

M ∈ M ∗ , we are given i.i.d. observations σ , . . . , σ N ∼ M , and let P M denote the associated probability. We let ε , − φ and consider thesetting where n is ﬁxed and ε → . For m ∗ k deﬁned by (7), the following statements hold:(a) Suppose that k ≤ , and that k and φ are known. Let M N denote the empirical distributionof σ , . . . , σ N with PMF f M N ( σ ) = N P Ni =1 { σ i = σ } for each σ ∈ S n . Consider the minimumtotal variation distance estimator c M , argmin M ′ ∈ M ∗ TV ( M ′ , M N ) . (18) If N ≥ C log( δ ) /ε m ∗ k for a suﬃciently large constant C = C ( n, k ) > and any δ ∈ (0 , ,then we have max M∈ M ∗ P M { c M 6 = M} ≤ δ. (b) On the other hand, if n ≥ m ∗ k and N ≤ c/ε m ∗ k for a suﬃciently small constant c = c ( n, k ) > , then we have min f M max M∈ M ∗ P M { f M 6 = M} ≥ / , where the estimator f M of the mixture is measurable with respect to the observations σ , . . . , σ N . In this work, we proposed a methodology to learn a mixture of permutations based on groups ofpairwise comparisons. We ﬁrst set up the framework using a generic noiseless model for mixturesof permutation. Then, we studied the Mallows mixture model, and introduced a polynomial-timealgorithm for learning the central permutations with a sample complexity logarithmic in the size16f the permutations. Finally, we studied the sample complexity for learning the Mallows mixturein a high-noise regime.For the algorithms in this work, we assumed the knowledge of the noise parameter φ . Thisis indeed restrictive, but we argue that this assumption is not essential for our main result inSection 3 and can be removed in principle. More precisely, the knowledge of φ is needed in thedeﬁnition of the class of mixtures (13), which is used in the subroutine SubOrder ( J ). In the casewhere φ is unknown, we can simply try all possible values of φ on a ﬁne grid in (0 , φ andgood concentration properties of the model, we believe the same sample complexity can be provedwithout the knowledge of φ . We choose not to add this technical complication which does notcontribute much to our general methodology.Moreover, in general, the Mallows k -mixture allows diﬀerent noise parameters φ , . . . , φ k indiﬀerent components. We think our approach in Section 3 can be adapted to this more generalsetting. However, the technical details are more complicated. Meanwhile, the results in Section 4depend more strictly on a common, known noise parameter φ .Last but not least, our general approach of learning a mixture of permutations from groups ofpairwise comparisons has potential applications beyond the Mallows mixture model. It would beinteresting to apply the framework proposed in Section 2.3 to other models for mixtures of permu-tations, such as the Plackett-Luce model [ZPX16] and variations of the Mallows model [DOS18]. We provide the proofs of our results in this section.

Throughout the proof, we write m ≡ m ∗ k , ⌊ log k ⌋ + 1 as in (7). We start with a lemma which isthe source of the logarithmic dependency of m on k . Lemma 6.1.

Consider a set Σ of k distinct permutations in S n where n ≥ . There exists π ∗ ∈ Σ and a tuple I of ℓ pairs of distinct indices ( i , j ) , . . . , ( i ℓ , j ℓ ) ∈ [ n ] , such that ℓ ≤ ⌊ log k ⌋ and χ ( π ∗ , I ) = χ ( π, I ) for all π ∈ Σ \ { π ∗ } , where χ ( π, I ) is deﬁned by (3) . In addition, this tuple I can be found in poly ( n, k ) time.Proof. Let us start with Σ = Σ and apply the following bisection argument iteratively. Givena nonempty set Σ r − of distinct permutations where r ≥

1, it is easy to ﬁnd a pair of indices i r , j r ∈ [ n ] such that both of the following sets are nonempty:Σ + r = { π ∈ Σ r − : π ( i r ) > π ( j r ) } and Σ − r = { π ∈ Σ r − : π ( i r ) < π ( j r ) } . (19)Since Σ + r ⊔ Σ − r = Σ r − , either Σ + r or Σ − r has size at most | Σ r − | /

2. We call it Σ r so that Σ r ⊂ Σ r − and | Σ r | ≤ | Σ r − | /

2. This procedure is iterated until we have | Σ r | = 1.For any r ≥

1, we have | Σ r | ≤ k/ r by construction. In particular, | Σ ⌊ log k ⌋ | ≤ k/ ⌊ log k ⌋ < ℓ ≤ ⌊ log k ⌋ such that | Σ ℓ | = 1. We denote the permutation in Σ ℓ by π ∗ . Note thatby (19) and the deﬁnition of Σ r , we have { σ ( i r ) < σ ( j r ) } 6 = { π ( i r ) < π ( j r ) } for any σ ∈ Σ r and π ∈ Σ r − \ Σ r . Since the sets Σ r ’s are nested, it holds that { π ∗ ( i r ) < π ∗ ( j r ) } 6 = { π ( i r ) < π ( j r ) } for any π ∈ Σ r − \ Σ r where r ∈ [ ℓ ]. As a result, if we deﬁne I , (cid:0) ( i , j ) , . . . , ( i ℓ , j ℓ ) (cid:1) , then17 ( π ∗ , I ) = χ ( π, I ) for any π ∈ Σ \ { π ∗ } . It is clear that I can be found in polynomial time, so theproof is complete.We now prove Theorem 2.6(a). Recall that our goal is to recover the k -mixture M = P ki =1 w i δ π i of permutations in S n from groups of m pairwise comparisons of the form χ ( π, I ) deﬁned in (3)where π ∼ M . For this, we do an induction on n ≥

2, as the case n = 1 is vacuous. Base case.

For n = 2 and any k ≥

1, we can simply take I to be the tuple of m copies of (1 , χ ( π, I ), from which we immediatelyread oﬀ the distribution of { π (1) < π (2) } and thus the distribution of π . Induction hypothesis.

As the induction hypothesis, we assume that the statement of Theo-rem 2.6(a) holds for n − n ≥

3. Consider a mixture P ks =1 w s δ π s of permutations in S n which we aim to learn. Then each π s k [ n − is a permutation in S n − , and by deﬁnition (3), we have χ ( π s k [ n − , I ) = χ ( π s , I ) for any tuple I of pairs of indices in [ n − P ks =1 w s δ π s k [ n − . To recover the mixture P ks =1 w s δ π s ofpermutations on [ n ] from those on [ n − n into each permutationon [ n −

1] at the correct position.

Induction step.

Toward this end, let us apply Lemma 6.1 to the distinct elements of the set { π k [ n − , . . . , π k k [ n − } of permutations in S n − . Thus there exists s ∗ ∈ [ k ] and a tuple I of ℓ pairsof distinct indices in [ n − ℓ ≤ ⌊ log k ⌋ and χ ( π s , I ) = χ ( π s ∗ , I ) for all s ∈ [ k ] \ S ∗ where we deﬁne S ∗ , (cid:8) s ∈ [ k ] : π s k [ n − = π s ∗ k [ n − (cid:9) . Next, for any index r ∈ [ n − m -tuple I r consisting of all pairs of indices in I andalso the pair ( r, n ). Such a tuple I r can be chosen because ℓ ≤ ⌊ log k ⌋ = m −

1. Then we query thegroup of m pairwise comparisons on I r (Deﬁnition 2.1) to obtain the distribution P ks =1 w s δ χ ( π s , I r ) for each r ∈ [ n − I and S ∗ guarantee that χ ( π s , I ) = χ ( π s ∗ , I ) ifand only if s ∈ S ∗ . Since I r includes all pairs of indices in I , we can distinguish those componentsof P ks =1 w s δ χ ( π s , I r ) supported at χ ( π s , I r ) with s ∈ S ∗ from those with s ∈ [ k ] \ S ∗ . Therefore, weobtain the measure P s ∈ S ∗ w s δ χ ( π s , I r ) for any r ∈ [ n − r, n ) ∈ I r , from the measure P s ∈ S ∗ w s δ χ ( π s , I r ) , we can easily compute thefunction f ( r ) , |{ P s ∈ S ∗ w s : π s ( r ) < π s ( n ) }| where r ∈ [ n − f (0) , | S ∗ | .The measure P s ∈ S ∗ w s δ π s can be recovered from the sequence of numbers { f ( r ) } n − r =1 as follows. Bydeﬁnition, the permutations π s k [ n − for s ∈ S ∗ are all the same, so by re-indexing 1 , . . . , n −

1, wecan assume that they are all equal to the identity permutation (1 , , . . . , n −

1) to ease the notation.Then f ( r ) is simply the total weight of permutations π s that place n after r , so particularly thesequence { f ( r ) } n − r =0 is nonincreasing. Moreover, f ( r ) − f ( r + 1) is equal to the total weight of thepermutations in the mixture P s ∈ S ∗ w s δ π s satisfying π s ( n ) = r + 1. Therefore, we can recover themeasure P s ∈ S ∗ w s δ π s from the sequence { f ( r ) } n − r =1 .Finally, once we have learned the measure P s ∈ S ∗ w s δ π s , the task becomes recovering the mea-sure P s / ∈ S ∗ w s δ π s from the measure P s / ∈ S ∗ w s δ π s k [ n − , which can be done by repeating the aboveprocedure. Indeed, when querying a group of pairwise comparisons P ks =1 w s δ χ ( π s , I r ) , we can easilysubtract the components with s ∈ S ∗ to obtain P s / ∈ S ∗ w s δ χ ( π s , I r ) . Therefore, the above procedurecan be iterated to eventually yield the entire mixture P ks =1 w s δ π s . This completes the induction.18 ime and sample complexity. To ﬁnish the proof, note that every step in this algorithmicconstruction is clearly polynomial-time. For the total number of groups of pairwise comparisons,recall that in the base case n = 2, we need one query, and in the induction step from n − n ,we learn at least one component of the mixture from n − k P n − i =2 i = 1 + k ( n − n + 1). Throughout the proof, we write m ≡ m ∗ k , ⌊ log k ⌋ + 1 as in (7) and ﬁx ℓ ≤ m −

1. Intuitively,it is harder to identify a k -mixture of permutations in S n for larger n and larger k . Indeed, let usjustify that we can assume without loss of generality that n = 2 m and k = 2 m − : • Suppose that we can prove the statement of part (b) for n = 2 m , that is, we have two mixtures k P ki =1 δ π i and k P ki =1 δ π ′ i of permutations in S m that cannot be identiﬁed using ℓ -wise compar-isons. For any n > m , we may extend each of the above permutation to a permutation in S n bydeﬁning π s ( j ) = π ′ s ( j ) = j for all s ∈ [ k ] and 2 m < j ≤ n . Then ℓ -wise comparisons still cannotdistinguish the two mixtures, because indices larger than 2 m are completely uninformative. Asa result, we may assume n = 2 m without loss of generality. • Suppose that we can establish the desired result for k -mixtures. Then for any k ′ > k , if we deﬁne π s = π ′ s = id for all k < s ≤ k ′ , then the mixtures k ′ P k ′ i =1 δ π i and k ′ P k ′ i =1 δ π ′ i still cannot bedistinguished using groups of ℓ -wise comparisons. Hence the statement of the theorem also holdsfor k ′ in replace of k . For any ﬁxed m ∈ N , the smallest k such that ⌊ log k ⌋ + 1 = m is equal to2 m − . Therefore, we may assume that k = 2 m − without loss of generality.With these simpliﬁcations, for a ﬁxed m ∈ N , we now construct two sets Σ and Σ of permu-tations in S m such that | Σ | = | Σ | = 2 m − , and such that groups of ℓ -wise comparisons cannotdistinguish the two mixtures k P π ∈ Σ δ π and k P π ∈ Σ δ π . For each vector v ∈ { , } m , we deﬁne apermutation π v ∈ S m by ( π v (2 j −

1) = 2 j − , π v (2 j ) = 2 j if v j = 0 π v (2 j −

1) = 2 j, π v (2 j ) = 2 j − v j = 1 for all j ∈ [ m ] . (20)Moreover, we deﬁneΣ , { π v : v ∈ { , } m , k v k is odd } and Σ , { π v : v ∈ { , } m , k v k is even } . It is clear that both Σ and Σ have cardinality 2 m − .Next, consider an arbitrary set of indices J ⊂ [2 m ] with | J | = 2 m −

1. We claim that (cid:8) π v k J : v ∈ { , } m , k v k is odd (cid:9) = (cid:8) π v k J : v ∈ { , } m , k v k is even (cid:9) . (21)To prove this claim, let j denote the only element of [ n ] \ J . If j is odd, we let j = j +1; otherwise,we let j = j −

1. For any v ∈ { , } m with odd k v k , we deﬁne v ′ ∈ { , } m by v ′ i = 1 − v i for i = ⌈ j / ⌉ and v ′ i = v i for i = ⌈ j / ⌉ . Since v ′ diﬀers from v in only one coordinate, k v ′ k mustbe even. As a result, we have π v ∈ Σ and π v ′ ∈ Σ . This clearly gives a bijection between thesets Σ and Σ . Furthermore, by deﬁnition (20), π v and π v ′ only diﬀer on the pair ( j , j ). Since J does not contain j , we must have π v k J = π v ′ k J . Consequently, equation (21) holds, so that any ℓ -wise comparison (Deﬁnition 2.3) returns the same distribution for the two mixtures k P π ∈ Σ δ π and k P π ∈ Σ δ π . This completes the proof. 19 .3 Proof of Theorem 2.7 This proof is structurally similar to Theorem 2.6(a), but the key step in the induction is diﬀerent.An intricacy needs to be noted throughout this proof: The oracle in Deﬁnition 2.2 returns the set { χ ( π i , I ) : i ∈ [ k ] } . Whenever we algorithmically obtain a set , denoted by { s , . . . , s k } for example,we only have a collection of its distinct elements without labels. It is possible that |{ s , . . . , s k }| < k and we do not know how many times each element repeats. We again ﬁrst establish a lemma. Lemma 6.2.

Let π , . . . , π k be k distinct permutations in S n . There exists a tuple I of ℓ pairs ofdistinct indices in [ n ] such that ℓ ≤ k − and χ ( π i , I ) = χ ( π j , I ) for any distinct i, j ∈ [ k ] , where χ ( π i , I ) is deﬁned by (3) . In addition, this tuple I can be found in poly ( n, k ) time.Proof. We prove the lemma by induction on k ≥

2. The statement clearly holds for k = 2. Supposethat it holds for k − k ≥

3. Then we have a tuple I of ℓ pairs of distinct indices in [ n ] suchthat ℓ ≤ k − χ ( π i , I ) = χ ( π j , I ) for any distinct i, j ∈ [ k − χ ( π k , I ) = χ ( π i , I ) for all i ∈ [ k − χ ( π k , I ) = χ ( π i , I ) for exactly one i ∈ [ k − π k = π i by assumption, we can ﬁnd r, s ∈ [ n ] such that { π k ( r ) < π k ( s ) } 6 = { π i ( r ) < π i ( s ) } .Adding the pair ( r, s ) to the tuple I ﬁnishes the induction. Moreover, it is clear that the tuple I can be found in polynomial time.As Theorem 2.7 is trivial in the case n = 1, we now prove it for n ≥ Base case.

For the base case n = 2, we simply take I to be the tuple of m copies of (1 , { χ ( π i , I ) : i ∈ [ k ] } , from which we immediately readoﬀ the set { { π i (1) < π i (2) } : i ∈ [ k ] } and thus the set { π i : i ∈ [ k ] } . Induction hypothesis.

Now we assume that the theorem holds for n − n ≥

3. Considera mixture of permutations π , . . . , π k ∈ S n which we aim to learn. Then each π i k [ n − where i ∈ [ k ]is a permutation in S n − . By deﬁnition (3), we have χ ( π i k [ n − , I ) = χ ( π i , I ) for any tuple I ofpairs of indices in [ n − { π i k [ n − : i ∈ [ k ] } .Let us denote the distinct elements of { π i k [ n − : i ∈ [ k ] } by π ′ , . . . , π ′ k ′ ∈ S n − , where k ′ ≤ k . Toobtain the set { π i : i ∈ [ k ] } of permutations on [ n ] from those on [ n − n into π ′ j at the correct position for each j ∈ [ k ′ ]. Note that |{ π i k [ n − : i ∈ [ k ] }| ≤ |{ π i : i ∈ [ k ] }| ,so we may need to obtain more than one permutation in S n from some π ′ j where j ∈ [ k ′ ]. Induction step.

Applying Lemma 6.2 to π ′ , . . . , π ′ k ′ , we obtain an ℓ -tuple I of pairs of distinctindices in [ n −

1] such that ℓ ≤ k ′ − χ ( π ′ i , I ) = χ ( π ′ j , I ) for any distinct i, j ∈ [ k ′ ]. Thisguarantees that π i k [ n − = π ′ j if and only if χ ( π i , I ) = χ ( π ′ j , I ) for i ∈ [ k ] and j ∈ [ k ′ ].We now ﬁx j ∈ [ k ′ ] and aim to recover those π i such that π i k [ n − = π ′ j . To simplify the notationin the sequel, we assume without loss of generality that π ′ j is equal to (1 , , . . . , n − n − π i has n possibilities depending on the positive of the element n .Recall that m , k + 1 ≥ ℓ + 2. Thus, for each r = 2 , . . . , n −

1, we can deﬁne an m -tuple I r containing all the ℓ pairs of indices in I , the pair ( r − , n ), and the pair ( n, r ). In the casethat m > ℓ + 2, the remaining m − ℓ − I r can be deﬁned arbitrarily forconcreteness—we will not use the comparison information on those pairs. Moreover, we deﬁnean m -tuple I containing the ℓ pairs of indices in I and the pair ( n, m -tuple I n ℓ pairs of indices in I and the pair ( n − , n ). Then for each r ∈ [ n ], we query thegroup of pairwise comparisons on I r according to Deﬁnition 2.2 to obtain the set { χ ( π i , I r ) : i ∈ [ k ] } .Since I r includes all pairs of indices in I , we can compute a subset X j ( r ) of { χ ( π i , I r ) : i ∈ [ k ] } deﬁned for each r ∈ [ n ] as X j ( r ) , { χ ( π i , I r ) : i ∈ [ k ] , χ ( π i , I ) = χ ( π ′ j , I ) } . Recall that π i k [ n − = π ′ j if and only if χ ( π i , I ) = χ ( π ′ j , I ). By the deﬁnition of X j ( r ), for eachﬁxed r ∈ [ n ], we have π i k [ n − = π ′ j if and only if χ ( π i , I r ) ∈ X j ( r ) for i ∈ [ k ].It remains to recover π i for which π i k [ n − = π ′ j from the collection of sets { X j ( r ) : r ∈ [ n ] } .First, ﬁx some r ∈ { , . . . , n − } . Recall that the pairs ( r − , n ) and ( n, r ) are both in I r . If π i = (1 , . . . , r − , n, r, . . . , n −

1) for some i ∈ [ k ], then π i k [ n − = π ′ j , and the set X j ( r ) must containa vector χ ( π i , I r ) whose entries { π i ( r − < π i ( n ) } and { π i ( n ) < π i ( r ) } are both equal to one.Conversely, if X j ( r ) contains some χ ( π i , I r ) with entries { π i ( r − < π i ( n ) } and { π i ( n ) < π i ( r ) } equal to one, then we know that π i k [ n − = π ′ j , and π i must be equal to (1 , . . . , r − , n, r, . . . , n − . This argument clearly works for π i equal to ( n, , . . . , n −

1) or (1 , . . . , n − , n ) as well, byconsidering the set X j (1) or X j ( n ) respectively. Therefore, we are able to recover all distinct π i such that π i k [ n − = π ′ j for i ∈ [ k ].Finally, repeating the above procedure for each j ∈ [ k ′ ] yields the set { π i : i ∈ [ k ] } . Time and sample complexity.

To ﬁnish the proof, note that every step in this algorithmicconstruction is clearly polynomial-time. For the total number of groups of pairwise comparisons,recall that in the base case n = 2, we need one query, and in the induction step from n − n ,we learn at least one component of the mixture from n queries. In summary, the total number ofqueries needed is at most 1 + k P ni =3 i = 1 + k ( n − n + 3). We state some basic facts about the Mallows model that are known in the literature.

Lemma 6.3.

Consider a Mallows model M ( π, φ ) . Then for any ﬁxed integers j ∈ [ n ] and r ≥ ,it holds that P σ ∼ M ( π,φ ) {| σ ( j ) − π ( j ) | ≥ r } ≤ φ r − φ . Proof.

See, for example, Lemma 17 of [BM09].The following lemma is essentially Lemma 3 of [LM18] in a diﬀerent form, which gives a pre-liminary identiﬁability result for the Mallows mixture. Although this result, which follows fromZagier’s work [Zag92], appears to be extremely weak, it can be used as a building block to establishmuch stronger bounds later.

Lemma 6.4.

Consider Mallows models M ( π ) , . . . , M ( π k ) on S n with distinct central permutations π , . . . , π k and a common noise parameter φ ∈ (0 , . There exists a test function u : S n → [ − , such that E M ( π ) [ u ] ≥ n ! h (1 − φ ) n √ n ! i k , and E M ( π i ) [ u ] = 0 for i = 2 , . . . , k , where we write E M ( π i ) [ u ] ≡ E σ ∼ M ( π i ) [ u ( σ )] . roof. Using the main result from [Zag92], Lemma 4 of [LM18] establishes the following result: Let A = A ( φ ) be the n ! × n ! matrix deﬁned in (1), where A σ,π = φ d KT ( σ,π ) for σ, π ∈ S n . By (2), A isnon-singular. Let A π denote the column of A indexed by π . Then the orthogonal projection of A π onto the orthogonal complement of A π , . . . , A π k has Euclidean norm at least (cid:2) (1 − φ ) n √ n ! (cid:3) k .Normalizing this orthogonal projection of A π yields a unit vector u , which can be identiﬁedwith a function u : S n → [ − , E M ( π ) [ u ] = Z ( φ ) h A π , u i ≥ Z ( φ ) (cid:2) (1 − φ ) n √ n ! (cid:3) k , and E M ( π i ) [ u ] = Z ( φ ) h A π i , u i = 0 for i = 2 , . . . , k . Finally, applying the crude bound Z ( φ ) = P σ ∈S n φ d KT ( σ,π ) ≤ n ! ﬁnishes the proof. The notion of block structure was introduced by Liu and Moitra [LM18] and is key to analyzing(mixtures of) Mallows models. In this work, we deﬁne a block structure in a slightly diﬀerent way.We say that a set B of integers is contiguous if it is of the form { i, i + 1 , . . . , i + | B | − } for someinteger i . For a permutation π ∈ S n and a subset B ⊂ [ n ], we let π ( B ) denote the set { π ( i ) : i ∈ B } . Deﬁnition 6.5.

Consider pairwise disjoint sets B , . . . , B m ⊂ [ n ] and pairwise disjoint contiguoussets B ′ , . . . , B ′ m ⊂ [ n ] , such that | B j | = | B ′ j | > for each j ∈ [ m ] and max B ′ j < min B ′ j +1 for each j ∈ [ m − . We refer to the sequence of pairs B = ( B , B ′ ) , . . . , ( B m , B ′ m ) as a block structure .Moreover, we say that a permutation π ∈ S n satisﬁes the block structure B if π ( B j ) = B ′ j for each j ∈ [ m ] . For example, the permutation (3 , , , , , , ,

5) satisﬁes the block structure B = ( { , } , { , } ) , ( { , , } , { , , } ). Let M ( π, φ ) be a Mallows model, and let B be a block structure. Later in the proofs, we usethe technique developed in [LM18] of conditioning on σ ∼ M ( π, φ ) satisfying B . This techniqueof conditioning is helpful thanks to Lemma 6.6 below, which in particular restates Fact 1 andCorollary 2 of [LM18].Recall from Section 1.6 for a subset B ⊂ [ n ], their “relative ordering” under π is denoted by π k B ,which is the bijection from B to [ | B | ] induced by π | B . In addition, π k B can also be viewed as per-mutation in S | B | by identifying the elements of B with 1 , . . . , | B | in the ascending order. Therefore,it is valid to consider the Mallows model M ( π k B , φ ) on S | B | ≡ { bijections from B to [ | B | ] } . Forinstance, in the example after Deﬁnition 6.5 above, π k { , , } can be identiﬁed with the permutation(1 , ,

2) in S . Lemma 6.6.

Consider a Mallows model M ( π, φ ) and a block structure B = ( B , B ′ ) , . . . , ( B m , B ′ m ) .Let J , S mj =1 B j and J ′ , S mj =1 B ′ j . Fix a bijection τ : [ n ] \ J → [ n ] \ J ′ . For σ ∼ M ( π, φ ) ,conditional on the event { σ satisﬁes B and σ | [ n ] \ J = τ } , the relative orderings π k B j for j ∈ [ m ] areindependent, and each π k B j (when identiﬁed as an element of S | B j | ) is distributed as the Mallowsmodel M ( π k B j , φ ) .Consequently, given any functions u j : S | B j | → R for j ∈ [ m ] , we have E σ ∼ M ( π ) (cid:20) { σ satisﬁes B} · m Y j =1 u j ( σ k B j ) (cid:21) = P σ ∼ M ( π ) { σ satisﬁes B} · m Y j =1 E M ( π k Bj ) [ u j ] , (22)22 here we write M ( π ) ≡ M ( π, φ ) and E M ( π k Bj ) [ u j ] ≡ E σ j ∼ M ( π k Bj ) [ u j ( σ j )] .Proof. Consider permutations σ, σ ′ ∈ S n such that σ and σ ′ both satisfy the block structure B , and σ | [ n ] \ J = σ ′ | [ n ] \ J = τ . Let σ j = σ k B j and σ ′ j = σ ′ k B j for each j ∈ [ m ]. Since each B ′ j is contiguous,it is possible to have { σ ( s ) > σ ( t ) } 6 = { σ ′ ( s ) > σ ′ ( t ) } only if the indices s and t are in the sameblock B j . It follows that d KT ( π, σ ) − d KT ( π, σ ′ ) = X s,t ∈ [ n ]: π ( s ) <π ( t ) (cid:16) { σ ( s ) > σ ( t ) } − { σ ′ ( s ) > σ ′ ( t ) } (cid:17) = m X j =1 X s,t ∈ B j : π ( s ) <π ( t ) (cid:16) { σ ( s ) > σ ( t ) } − { σ ′ ( s ) > σ ′ ( t ) } (cid:17) = m X j =1 X s,t ∈ B j : π k Bj ( s ) <π k Bj ( t ) (cid:16) { σ j ( s ) > σ j ( t ) } − { σ ′ j ( s ) > σ ′ j ( t ) } (cid:17) = m X j =1 (cid:2) d KT ( π k B j , σ j ) − d KT ( π k B j , σ ′ j ) (cid:3) . As a result, the ratio between the probability masses at σ and σ ′ under the original Mallowsmodel M ( π, φ ) is equal to the ratio between the probability masses at ( σ , . . . , σ m ) and ( σ ′ , . . . , σ ′ m )under the product distribution ⊗ mj =1 M ( π k B j , φ ): φ d KT ( π,σ ) φ d KT ( π,σ ′ ) = φ d KT ( π,σ ) − d KT ( π,σ ′ ) = φ P mj =1 [ d KT ( π k Bj ,σ j ) − d KT ( π k Bj ,σ ′ j )] = m Y j =1 φ d KT ( π k Bj ,σ j ) φ d KT ( π k Bj ,σ ′ j ) . Therefore, the ﬁrst statement of the lemma holds.Furthermore, since the product distribution ⊗ mj =1 M ( π k B j , φ ) does not depend on τ = σ | [ n ] \ J ,we see that the conditional distribution of σ ∼ M ( π, φ ) on σ satisfying B , marginalized over σ | [ n ] \ J ,is also the product distribution ⊗ mj =1 M ( π k B j , φ ). Therefore, both sides of (22) are equal to P σ ∼ M ( π ) { σ satisﬁes B} · E σ ∼ M ( π ) (cid:20) m Y j =1 u j ( σ k B j ) (cid:12)(cid:12)(cid:12) σ satisﬁes B (cid:21) , so the proof is complete. We now establish a crucial lower bound on the probability that a permutation from the Mallowsmodel satisﬁes a certain block structure. Let d H denote the Hausdorﬀ distance between two sets A, B ⊂ Z , that is, d H ( A, B ) , max n max a ∈ A min b ∈ B | a − b | , max b ∈ B min a ∈ A | a − b | o . The following lemma provides a dimension-free lower bound on the probability of satisfying a blockstructure, whenever the central permutation satisﬁes the same block structure approximately (upto distance D ). This allows us to “localize” the analysis to a block structure without sacriﬁcing thedependency of the sample complexity on n . This result signiﬁcantly improves Lemma 1 of [LM18],which gives a lower bound of n − ℓ assuming that the central permutation satisﬁes the block structureexactly ( D = 0). 23 emma 6.7. Let M ( π, φ ) be a Mallows model on S n . For a block structure B = ( B , B ′ ) , . . . , ( B m , B ′ m ) , suppose that d H ( π ( B i ) , B ′ i ) ≤ D for each i ∈ [ m ] , where d H denotes the Hausdorﬀdistance and D ≥ . Let ℓ , P mi =1 | B i | . Then we have P σ ∼ M ( π,φ ) { σ satisﬁes B} ≥ φ ℓD (1 − φ ) ℓ ℓ ) ℓ . Proof.

Set R , (cid:6) log[(1 − φ ) / (4 ℓ )]log φ (cid:7) . By Lemma 6.3 and a union bound, we have P σ ∼ M ( π,φ ) (cid:8) | σ ( j ) − π ( j ) | ≤ R for all j ∈ S mi =1 B i (cid:9) ≥ / . (23)Let us deﬁne a collection of m -tuples of sets K , (cid:8) ( K , . . . , K m ) : K i ⊂ [ n ] , | K i | = | B i | , d H ( π ( B i ) , K i ) ≤ R,K i ∩ K j = ∅ for any distinct i, j ∈ [ m ] (cid:9) . Note that there are at most (cid:0) | B i | +2 R | B i | (cid:1) choices for each K i with | K i | = | B i | and d H ( π ( B i ) , K i ) ≤ R ,so the cardinality of K can be bounded as |K| ≤ m Y i =1 (cid:18) | B i | + 2 R | B i | (cid:19) ≤ m Y i =1 (cid:0) | B i | + 2 R (cid:1) | B i | ≤ (cid:0) ℓ + 2 R (cid:1) ℓ . (24)In addition, for each tuple ( K , . . . , K m ) ∈ K , we deﬁne an event S ( K , . . . , K m ) , (cid:8) σ ( B i ) = K i for all i ∈ [ m ] (cid:9) . Then by deﬁnition, we have (cid:8) | σ ( j ) − π ( j ) | ≤ R for all j ∈ S mi =1 B i (cid:9) ⊂ [ ( K ,...,K m ) ∈K S ( K , . . . , K m ) . Writing P M ( π,φ ) ≡ P σ ∼ M ( π,φ ) for brevity, we obtain from the above inclusion and (23) that X ( K ,...,K m ) ∈K P M ( π,φ ) (cid:8) S ( K , . . . , K m ) (cid:9) ≥ / . (25)Next, note that S ( B ′ , . . . , B ′ m ) = (cid:8) σ ( B i ) = B ′ i for all i ∈ [ m ] (cid:9) = (cid:8) σ satisﬁes B (cid:9) . We claim that for any ( K , . . . , K m ) ∈ K , P M ( π,φ ) (cid:8) S ( B ′ , . . . , B ′ m ) (cid:9) ≥ φ s ( D + R ) · P M ( π,φ ) (cid:8) S ( K , . . . , K m ) (cid:9) . (26)Assuming this claim, we conclude from (26), (25) and (24) that P M ( π,φ ) (cid:8) S ( B ′ , . . . , B ′ m ) (cid:9) ≥ φ ℓ ( D + R ) |K| X ( K ,...,K m ) ∈K P M ( π,φ ) (cid:8) S ( K , . . . , K m ) (cid:9) ≥ φ ℓ ( D + R ) |K| ≥ φ ℓ ( D + R ) ℓ + 2 R ) ℓ ≥ φ ℓD (1 − φ ) ℓ ℓ ) ℓ , R = log[(1 − φ ) / (4 ℓ )]log φ ≤ ℓ (1 − φ ) . It remains to prove (26). There is a natural bijection between the events S ( K , . . . , K m ) and S ( B ′ , . . . , B ′ m ) (viewed as subsets of S n ) as follows: For each σ ∈ S ( K , . . . , K m ), there is acorresponding σ ′ ∈ S ( B ′ , . . . , B ′ m ) deﬁned so that σ ′ k B i = σ k B i for all i ∈ [ m ] and σ ′ k [ n ] \ ( S mi =1 B i ) = σ k [ n ] \ ( S mi =1 B i ) . That is, σ maps each B i to K i and σ ′ maps each B i to B ′ i , and their relative orders agree on each B i as well as on the complement of S mi =1 B i .Recall that d H ( π ( B i ) , K i ) ≤ R and d H ( π ( B i ) , B ′ i ) ≤ D for each i ∈ [ m ], so we have d H ( K i , B ′ i ) ≤ D + R. Since each B ′ i is contiguous, we easily see that | σ ( k ) − σ ′ ( k ) | ≤ D + R for each k ∈ S mi =1 B i . As a result, it takes at most ℓ ( D + R ) adjacent transpositions to change the permutation σ to σ ′ ,so d KT ( σ ′ , σ ) ≤ ℓ ( D + R ). By the triangle inequality, we have d KT ( σ ′ , π ) − d KT ( σ, π ) ≤ d KT ( σ ′ , σ ) ≤ ℓ ( D + R ) . Denoting the PMF of M ( π, φ ) by f M ( π,φ ) , we have f M ( π,φ ) ( σ ′ ) = φ d KT ( σ ′ ,π ) /Z ( φ ) ≥ φ ℓ ( D + R ) φ d KT ( σ,π ) /Z ( φ ) = φ ℓ ( D + R ) f M ( π,φ ) ( σ ) . Summing up this inequality over σ ∈ S ( K , . . . , K m ) (that is, over σ ′ ∈ S ( B ′ , . . . , B ′ m )) yields(26), thereby completing the proof. The following lemma is at the crux of proving the main identiﬁability result of Proposition 3.1.

Lemma 6.8.

Consider Mallows models M ( π ) , . . . , M ( π k ) on S n with a common noise parameter φ ∈ (0 , , and consider a set of indices J = { j , . . . , j ℓ } ⊂ [ n ] . Suppose that π k J = π i k J for any i = 2 , . . . , k . Then for any ﬁxed C ≥ , there exists a block structure B = ( B , B ′ ) , . . . , ( B m , B ′ m ) where S mj =1 B j = J , such that:(1) P σ ∼ M ( π ) { σ satisﬁes B} ≥ c , (cid:2) (1 − φ ) ℓ +1 C (6 ℓ ) ℓ (cid:3) (3 ℓ ) ℓ ;(2) for each ≤ i ≤ k , we have either P σ ∼ M ( π i ) { σ satisﬁes B} ≤ c/C , or π i k B j = π k B j for some j ∈ [ m ] .Proof. We use an iterative argument to prove the lemma. At each step t ≥

0, we deﬁne a blockstructure B ( t ) = ( B ( t )1 , ( B ′ ) ( t )1 ) , . . . , ( B ( t ) m ( t ) , ( B ′ ) ( t ) m ( t ) ) and a constant c ( t ) that potentially satisfy theabove conditions. If not, we redeﬁne a coarser block structure B ( t +1) and a smaller constant c ( t +1)1 in the next step, and show that the procedure must end in ℓ − terative construction Up to a relabeling of indices in J , we may assume without loss ofgenerality that π ( j ) < · · · < π ( j ℓ ). Let us start with D (0) , B (0) , ( { j } , { π ( j ) } ) , . . . , ( { j ℓ } , { π ( j ℓ ) } ). Note that we have m (0) = ℓ .Suppose that at step t ≥

0, we have a block structure B ( t ) = ( B ( t )1 , ( B ′ ) ( t )1 ) , . . . , ( B ( t ) m ( t ) , ( B ′ ) ( t ) m ( t ) )and a constant D ( t ) ≥

0, such that:(a) the blocks B ( t )1 , . . . , B ( t ) m ( t ) form an ordered partition of the ordered set { j , . . . , j ℓ } ;(b) d H ( π ( B ( t ) j ) , ( B ′ ) ( t ) j ) ≤ D ( t ) for all j ∈ [ m ( t ) ].These conditions are clearly satisﬁed at step t = 0.Let us deﬁne c ( t ) , φ ℓD ( t ) (1 − φ ) ℓ ℓ ) ℓ . (27)It then follows from Lemma 6.7 that P σ ∼ M ( π ) { σ satisﬁes B ( t ) } ≥ c ( t ) , that is, condition (1) in the statement of the lemma holds. If condition (2) also holds, then we aredone. Otherwise, there exists 2 ≤ i ≤ k such that P σ ∼ M ( π i ) { σ satisﬁes B ( t ) } > c ( t ) /C and π i k B ( t ) j = π k B ( t ) j for all j ∈ [ m ( t ) ] . Note that the relative orders of π i and π are the same on each block B ( t ) j but diﬀerent on theirunion J (recall the assumption of π i k J = π k J ). In view of the ordering of the blocks, it is not hardto see that, there exists j ∗ ∈ [ m ( t ) − r ∈ B ( t ) j ∗ and r ∈ B ( t ) j ∗ +1 such that π ( r ) < π ( r ) while π i ( r ) > π i ( r ). If no such j ∗ exists, then every element of π i ( B ( t ) j ∗ ) is smaller than every elementof π i ( B ( t ) j ∗ +1 ) for all j ∗ ∈ [ m ( t ) − π i and π coincide on each block, this implies π i k J = π k J , which is a contradiction.Let us set s , max ( B ′ ) ( t ) j ∗ and s , min ( B ′ ) ( t ) j ∗ +1 , and we have s < s by the deﬁnition ofa block structure. Note that every σ satisfying B ( t ) must have σ ( r ) ≤ s and σ ( r ) ≥ s . Since π i ( r ) > π i ( r ), it holds that either | σ ( r ) − π i ( r ) | > ( s − s ) / | σ ( r ) − π i ( r ) | > ( s − s ) / P σ ∼ M ( π i ) (cid:8) | σ ( r ) − π i ( r ) | > ( s − s ) / (cid:9) + P σ ∼ M ( π i ) (cid:8) | σ ( r ) − π i ( r ) | > ( s − s ) / (cid:9) ≥ P σ ∼ M ( π i ) { σ satisﬁes B ( t ) } > c ( t ) /C . Lemma 6.3, on the other hand, gives the upper bound P σ ∼ M ( π i ) (cid:8) | σ ( r ) − π i ( r ) | > ( s − s ) / (cid:9) ≤ φ ( s − s ) / − φ , and the same bound also holds with r replaced by r . Combining the above inequalities yields4 φ ( s − s ) / − φ > c ( t )1 C = ⇒ s − s < φ c ( t )1 (1 − φ )4 C , (28)26here log φ ( · ) denotes the logarithm with respect to base φ .Intuitively, this shows that the blocks ( B ′ ) ( t ) j ∗ and ( B ′ ) ( t ) j ∗ +1 are not too far apart. We nowmerge them to deﬁne a coarser block structure B ( t +1) = ( B ( t +1)1 , ( B ′ ) ( t +1)1 ) , . . . , ( B ( t +1) m ( t +1) , ( B ′ ) ( t +1) m ( t +1) )with m ( t +1) = m ( t ) − B ( t +1) j ∗ , B ( t ) j ∗ ∪ B ( t ) j ∗ +1 , and deﬁne ( B ′ ) ( t +1) j ∗ to bethe contiguous block which extends ( B ′ ) ( t ) j ∗ by | ( B ′ ) ( t ) j ∗ +1 | elements to the right. Moreover, we set B ( t +1) j , B ( t ) j and ( B ′ ) ( t +1) j , ( B ′ ) ( t ) j for 1 ≤ j < j ∗ , and set B ( t +1) j , B ( t ) j +1 and ( B ′ ) ( t +1) j , ( B ′ ) ( t ) j +1 for j ∗ < j ≤ m ( t +1) = m ( t ) −

1. Note that B ( t +1) is a valid block structure per Deﬁnition 6.5, thatis, | B j | = | B ′ j | and B ′ j is contiguous for each j ∈ [ m ( t +1) ].Moreover, it is clear from the deﬁnition of the only new block ( B ′ ) ( t +1) j ∗ that d H (cid:16) ( B ′ ) ( t +1) j ∗ , ( B ′ ) ( t ) j ∗ ∪ ( B ′ ) ( t ) j ∗ +1 (cid:17) < s − s < φ c ( t )1 (1 − φ )4 C thanks to (28). As a result, we have d H (cid:16) π ( B ( t +1) j ∗ ) , ( B ′ ) ( t +1) j ∗ (cid:17) ( i ) ≤ d H (cid:16) π (cid:0) B ( t ) j ∗ ∪ B ( t ) j ∗ +1 (cid:1) , ( B ′ ) ( t ) j ∗ ∪ ( B ′ ) ( t ) j ∗ +1 (cid:17) + d H (cid:16) ( B ′ ) ( t ) j ∗ ∪ ( B ′ ) ( t ) j ∗ +1 , ( B ′ ) ( t +1) j ∗ (cid:17) ( ii ) ≤ D ( t ) + 2 log φ c ( t )1 (1 − φ )4 C , where (i) follows from the deﬁnition of B ( t +1) j ∗ and the triangle inequality, and (ii) follows fromcondition (b) above for step t . Hence if we deﬁne D ( t +1) , D ( t ) + 2 log φ c ( t )1 (1 − φ )4 C , (29)then condition (b) is also satisﬁed for step t + 1. By construction, condition (a) continues to holdfor step t + 1. Therefore, we can iterate this construction.Finally, since m (0) = ℓ and m ( t +1) = m ( t ) −

1, the procedure has to end in ℓ − m ( ℓ ) = 1. In this situation, we simply has one block in the block structure B ( ℓ − = ( J, ( B ′ ) ( ℓ − ),and condition (2) in the statement of the lemma is necessarily achieved as π i k J = π k J for all2 ≤ i ≤ k by assumption. Thus the construction ends with success. Lower bound on c ( t )1 It remains to give a lower bound on c ( t )1 . Substituting (27) into (29) yields D ( t +1) = D ( t ) + 2 log φ φ ℓD ( t ) (1 − φ ) ℓ +1 C (6 ℓ ) ℓ = D ( t ) (1 + 2 ℓ ) + 2 log φ (1 − φ ) ℓ +1 C (6 ℓ ) ℓ . Combining this relation with D (0) = 0, we see that for any t ≤ ℓ − D ( t ) ≤ ℓ ) ℓ − log φ (1 − φ ) ℓ +1 C (6 ℓ ) ℓ .

27t then follows that φ D ( t ) ≥ h (1 − φ ) ℓ +1 C (6 ℓ ) ℓ i ℓ ) ℓ − . Therefore, we conclude using deﬁnition (27) that c ( t ) ≥ (1 − φ ) ℓ ℓ ) ℓ h (1 − φ ) ℓ +1 C (6 ℓ ) ℓ i ℓ (1 + 2 ℓ ) ℓ − ≥ h (1 − φ ) ℓ +1 C (6 ℓ ) ℓ i (3 ℓ ) ℓ for any t ≤ ℓ −

1. This completes the proof.

We ﬁrst state a general result that implies both Propositions 3.1 and 3.6. Let S n | J denote the setof injections from J to [ n ]. Recall that for the Mallows model M ( π, φ ), the marginalized model M ( π, φ ) | J is a distribution on S n | J deﬁned in (10). Lemma 6.9.

Consider K Mallows models M ( π ) , . . . , M ( π K ) on S n with a common noise param-eter φ ∈ (0 , . Fix a set of indices J ⊂ [ n ] and let ℓ , | J | . Let α , . . . , α K be real numbers suchthat: (1) α = 1 ; (2) α i ≥ for every i ∈ [ K ] such that π i k J = π k J ; (3) | α i | ≤ /γ for every i ∈ [ K ] such that π i k J = π k J , where < γ ≤ . For any function v : S n | J → R , we write E i [ v ] ≡ E ρ ∼ M ( π i ) | J [ v ( ρ )] for each i ∈ [ K ] . Deﬁne η ( k, ℓ, φ, γ ) as in (12) . Then there exists a testfunction v : S n | J → [ − , such that K X i =1 α i E i [ v ] ≥ γ η ( K/ , ℓ, φ, γ ) . (30) By the assumption { π k J , . . . , π k k J } 6 = { π ′ k J , . . . , π ′ k k J } , up to a relabeling of elements within { π k J , . . . , π k k J } or { π ′ k J , . . . , π ′ k k J } , and possibly a swap of the two sets, we may assume that π k J = π ′ i k J for any i ∈ [ k ]. To prove that the total variation distance satisﬁes TV ( M| J , M ′ | J ) = 12 sup k v k ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12) E ρ ∼M| J [ v ( ρ )] − E ρ ∼M ′ | J [ v ( ρ )] (cid:12)(cid:12)(cid:12)(cid:12) ≥ η ( k, ℓ, φ, γ ) , it suﬃces to ﬁnd a test function v : S n | J → [ − ,

1] such that k X i =1 w i E i [ v ] − k X i =1 w ′ i E ′ i [ v ] ≥ η ( k, ℓ, φ, γ ) . (31)Setting K = 2 k , α i = w i /w , α k + i = − w ′ i /w and π k + i = π ′ i for i ∈ [ k ], we see that all theconditions in Lemma 6.9 are satisﬁed. Therefore, (31) follows from (30).28 .6.2 Proof of Proposition 3.6 We need to prove TV ( M| J , M ′ | J ) = 12 sup k v k ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12) k X i =1 w i E i [ v ] − k X i =1 w ′ i E i [ v ] (cid:12)(cid:12)(cid:12)(cid:12) ≥ ξ · η ( k/ , ℓ, φ, . Hence, it suﬃces to ﬁnd a test function v : S n | J → [ − ,

1] such that k X i =1 ( w i − w ′ i ) E i [ v ] ≥ ξ · η ( k/ , ℓ, φ, . (32)Up to a relabeling, we may assume that w − w ′ = ξ >

0. Let us apply Lemma 6.9 with K = k , γ = 1 and α i = ( w i − w ′ i ) /ξ for i ∈ [ k ]. Then | α i | ≤

1, and by assumption, π i k J = π k J for any i = 1. Hence all the conditions in Lemma 6.9 are satisﬁed. Therefore, (32) follows from (30). Let us deﬁne I , { i ∈ [ K ] : π i k J = π k J } . We apply Lemma 6.8 to the models M ( π ) and { M ( π i ) : i ∈ [ K ] \ I } with C = 2 K/γ , to obtaina block structure B = ( B , B ′ ) , . . . , ( B m , B ′ m ) where S j ∈ [ m ] B j = J , such that: • P σ ∼ M ( π ) { σ satisﬁes B} ≥ c , (cid:2) γ (1 − φ ) ℓ +1 K (6 ℓ ) ℓ (cid:3) (3 ℓ ) ℓ ; • There exists I ⊂ [ K ] \ I such that for each i ∈ I , we have π i k B j = π k B j for some j ∈ [ m ]; • For each i ∈ I , [ K ] \ ( I ∪ I ), we have that π i k B j = π k B j for all j ∈ [ m ], and that P σ ∼ M ( π i ) { σ satisﬁes B} ≤ cγ K .With the block structure B constructed above, we deﬁne the test function v in (30) by v ( σ | J ) , { σ satisﬁes B} · m Y j =1 u j ( σ k B j ) , where each u j : S | B j | → [ − ,

1] is to be speciﬁed later. Note that v is well-deﬁned because S j ∈ [ m ] B j = J so that: (1) whether σ satisﬁes B is fully determined by σ | J , and (2) σ k B j isfully determined by σ | J for each j ∈ [ m ]. In addition, we clearly have k v k ∞ ≤ E i [ v ], we use the deﬁnitions of M ( π i ) | J and v to obtain E i [ v ] = E ρ ∼ M ( π i ) | J [ v ( ρ )] = E σ ∼ M ( π i ) [ v ( σ | J )] = E σ ∼ M ( π i ) (cid:20) { σ satisﬁes B} · m Y j =1 u j ( σ k B j ) (cid:21) . It then follows from the conditional independence in (22) that E i [ v ] = P σ ∼ M ( π i ) { σ satisﬁes B} · m Y j =1 E M ( π i k Bj ) [ u j ] . (33)29e now deﬁne the test function u j : S | B j | → [ − ,

1] for each j ∈ [ m ]. Since | B j | ≤ | J | = ℓ is small, we can aﬀord to apply the crude construction in Lemma 6.4. First, if | B j | = 1, the set S | B j | consists of a singleton, and we simply deﬁne u j = 1. Next, if j ∈ [ m ] with | B j | ≥

2, let k ′ bethe number of distinct elements of { M ( π i k B j ) : i ∈ [ K ] } . Applying Lemma 6.4 with the distinctmodels in { M ( π i k B j ) : i ∈ [ K ] } , we obtain u j such that  E M ( π k Bj ) [ u j ] ≥ | B j | ! h (1 − φ ) | Bj | √ | B j | ! i k ′ ≥ ℓ ! h (1 − φ ) ℓ √ ℓ ! i K if π i k B j = π k B j , E M ( π i k Bj ) [ u j ] = 0 if π i k B j = π k B j , (34)where we used the trivial bounds | B j | ≤ ℓ and k ′ ≤ K .In summary, we have: • For i = 1, we have P σ ∼ M ( π ) { σ satisﬁes B} ≥ c . Thus we obtain from (33) and (34) that E [ v ] ≥ c · ℓ !) m (cid:2) (1 − φ ) ℓ √ ℓ ! (cid:3) Km . • For i ∈ I \ { } , we have E i [ v ] ≥ • For i ∈ I , we have π i k B j = π k B j for some j ∈ [ m ] by the construction of the block structure;for this index j , it holds that E M ( π i k Bj ) [ u j ] = 0 by (34). Therefore, E i [ v ] = 0 by (33). • For i ∈ I , we have that P σ ∼ M ( π i ) { σ satisﬁes B} ≤ cγ K ≤ γ K P σ ∼ M ( π ) { σ satisﬁes B} and that π i k B j = π k B j for all j ∈ [ m ] by our construction of the block structure. Together with (33), thisimplies that 0 ≤ E i [ v ] ≤ γ K E [ v ] since m ≤ ℓ .Finally, combining the above with the assumption that α = 1, α i ≥ i ∈ I , and | α i | ≤ /γ for [ K ] \ I , we conclude that K X i =1 α i E i [ v ] = E [ v ] + X i ∈ I \{ } α i E i [ v ] + X i ∈ I α i E i [ v ] + X i ∈ I α i E i [ v ] ≥ E [ v ] − X i ∈ I γ · γ K E [ v ] ≥ E [ v ] ≥ h γ (1 − φ ) ℓ +1 K (6 ℓ ) ℓ i (3 ℓ ) ℓ · ℓ !) m h (1 − φ ) ℓ √ ℓ ! i Km . Using m ≤ ℓ and η ( K/ , ℓ, φ, γ ) = ( γ K ) (3 ℓ ) ℓ +1 ( − φℓ ) (4 ℓ ) ℓ + Kℓ , it is not hard to simplify the abovebound to obtain (30), thereby ﬁnishing the proof. For J ⊂ [ n ] and r ∈ N , we deﬁne an event Σ( r ) ⊂ S n byΣ( r ) , (cid:8) σ ∈ S n : there exists j ∈ J such that | σ ( j ) − π i ( j ) | ≥ r for all i ∈ [ k ] (cid:9) . Since M is a probability measure, we have M (cid:0) Σ( r ) (cid:1) = P σ ∼M { σ ∈ Σ( r ) } . Moreover, recall that S n,J denotes the set of injections ρ : J → [ n ]. Deﬁne an event ˜Σ( r ) ⊂ S n,J by˜Σ( r ) , (cid:8) ρ ∈ S n,J : there exists j ∈ J such that | ρ ( j ) − π i ( j ) | ≥ r for all i ∈ [ k ] (cid:9) . r ) and Σ( r ) impose the same constraint, with the former on σ and the latter on ρ .By the deﬁnition of M| J in (10), we obtain p , M| J (cid:0) ˜Σ( r ) (cid:1) = X ρ ∈ ˜Σ( r ) f M| J ( ρ ) = X ρ ∈ ˜Σ( r ) P σ ∼M (cid:0) σ | J = ρ (cid:1) = M (cid:0) Σ( r ) (cid:1) . For the empirical distribution M N | J deﬁned by (11), we have p N , M N | J (cid:0) ˜Σ( r ) (cid:1) = X ρ ∈ ˜Σ( r ) f M N | J ( ρ ) = X ρ ∈ ˜Σ( r ) N N X m =1 (cid:8) σ m | J = ρ (cid:9) = 1 N N X m =1 (cid:8) σ m ∈ Σ( r ) (cid:9) . Having these quantities deﬁned, we can bound the total variation in consideration as2 TV ( M| J , M N | J ) ≤ p + p N + X ρ ∈S n,J \ ˜Σ( r ) (cid:12)(cid:12) f M| J ( ρ ) − f M N | J ( ρ ) (cid:12)(cid:12) . (35)We now control each of the three terms on the right hand side. First, a union bound yields that p = M (cid:0) Σ( r ) (cid:1) ≤ X j ∈ J P σ ∼M (cid:8) | σ ( j ) − π i ( j ) | ≥ r for all i ∈ [ k ] (cid:9) ≤ X j ∈ J k X i =1 w i P σ ∼ M ( π i ) (cid:8) | σ ( j ) − π i ( j ) | ≥ r (cid:9) ( i ) ≤ X j ∈ J k X i =1 w i φ r − φ = 2 ℓφ r − φ , (36)where ( i ) follows from Lemma 6.3.Second, since { σ m ∈ Σ( r ) } are independent Bernoulli (cid:0) M (cid:0) Σ( r ) (cid:1)(cid:1) random variables for m ∈ [ N ], we have N · p N ∼ Binomial(

N, p ) in view of the formulas for p and p N above. Hence Bernstein’sinequality yields P (cid:8) p N > p + t (cid:9) ≤ exp (cid:16) − N t / p + t/ (cid:17) for any t >

0. Taking t = s/ P n p + p N > ℓφ r − φ + s o ≤ exp (cid:16) − N s / ℓφ r / (1 − φ ) + s/ (cid:17) . (37)Third, in view of deﬁnition (11) where { σ m | J = ρ } are independent Bernoulli (cid:0) f M| J ( ρ ) (cid:1) randomvariables, Hoeﬀding’s inequality yields P (cid:8) | f M N | J ( ρ ) − f M| J ( ρ ) | > t (cid:9) ≤ − N t )for any t >

0. For each ρ ∈ S n,J \ ˜Σ( r ), we have that for all j ∈ J , | ρ ( j ) − π i ( j ) | < r for some i ∈ [ k ]. Hence there are at most 2 kr possible choices for each ρ ( j ), and the cardinality of S n,J \ ˜Σ( r )is bounded by (2 kr ) ℓ . A union bound over S n,J \ ˜Σ( r ) then implies that, for any t > P n X ρ ∈S n,J \ ˜Σ( r ) (cid:12)(cid:12) f M| J ( ρ ) − f M N | J ( ρ ) (cid:12)(cid:12) > (2 kr ) ℓ t o ≤ kr ) ℓ exp( − N t ) . (38)31inally, we plug (37) and (38) into (35), choose r to be the smallest integer such that ℓφ r − φ ≤ s , and then set t = s (2 kr ) ℓ , to obtain P (cid:8) TV ( M| J , M N | J ) > s (cid:9) ≤ exp (cid:16) − N s / ℓφ r / (1 − φ ) + s/ (cid:17) + 2(2 kr ) ℓ exp (cid:16) − N s (2 kr ) ℓ (cid:17) . By the deﬁnition of r , we have exp (cid:0) − Ns / ℓφ r / (1 − φ )+ s/ (cid:1) ≤ exp( − N s ) . On the other hand, the deﬁnitionof r also implies that ℓφ r − φ > sφ . Hence we obtain r < log φ sφ (1 − φ )8 ℓ ≤ − φ log ℓs (1 − φ ) , whichcompletes the proof. We now prove Theorem 3.4. First, we apply Proposition 3.3 with s = η/ η = η ( k, ℓ, φ, γ ) isdeﬁned in (12). Note that η is a polynomial in 1 − φ and γ when k and ℓ are constants. Therefore,if N ≥ poly k,ℓ ( − φ , γ ) · log δ for any δ ∈ (0 , . δ . Thatis, TV ( M| J , M N | J ) ≤ η/ − δ .Next, recall the collection M of Mallows model with discretized weights as deﬁned in (13).Consider a mixture M ′ = P ki =1 r i L M ( π ρ i , φ ) ∈ M and its empirical version M ′ N ′ as constructedin SubOrder ( J ). If N ′ ≥ poly k,ℓ ( − φ , γ ) · log nδ for any δ ∈ (0 , . TV ( M ′ | J , M ′ N ′ | J ) ≤ η/ − δL k n kℓ . Recall that | M | ≤ L k n kℓ . Let E denote the event that (39)holds and (40) holds for all M ′ ∈ M . By a union bound, E has probability at least 1 − δ .We observe that there exists M ′ ∈ M for which | r i L − w i | ≤ η k and π ρ i | J = π i | J for each i ∈ [ k ].This is because L ≥ k/η , and if ρ i , π i | J then π ρ i | J = π i | J by deﬁnition. For this M ′ , we have TV ( M| J , M ′ | J ) = TV k X i =1 w i M ( π i ) | J , k X i =1 r i L M ( π ρ i ) | J ! ( i ) = TV k X i =1 w i M ( π i ) | J , k X i =1 r i L M ( π i ) | J ! ≤ k X i =1 (cid:12)(cid:12)(cid:12) w i − r i L (cid:12)(cid:12)(cid:12) ≤ · k = η , where step ( i ) follows from Lemma 3.2. This combined with (39) and (40) shows that on the event E , TV ( M N | J , M ′ N ′ | J ) ≤ TV ( M| J , M N | J ) + TV ( M| J , M ′ | J ) + TV ( M ′ | J , M ′ N ′ | J ) ≤ η/ , so that, by condition (14), SubOrder ( J ) succeeds without returning “error”.32inally, suppose that SubOrder ( J ) returns a set of relative orders { π ρ i k J : i ∈ [ k ] } that is notequal to the set { π i k J : i ∈ [ k ] } . Then Proposition 3.1 implies that TV ( M| J , M ′ | J ) ≥ η. As aresult, we have that on the event E , TV ( M N | J , M ′ N ′ | J ) ≥ TV ( M| J , M ′ | J ) − TV ( M| J , M N | J ) − TV ( M ′ | J , M ′ N ′ | J ) ≥ η/ . This contradicts condition (14). Therefore, the set of relative orders returned by

SubOrder ( J ) mustbe { π i k J : i ∈ [ k ] } . Corollary 3.5 guarantees exact recovery of the central permutations with probability at least 1 − n − , so we may assume that, up to a relabeling, ˆ π i = π i for each i ∈ [ k ]. It remains to study theestimation error for ˆ w deﬁned in (16). Let ξ > w i . Recall that J is a subset of [ n ] such that ℓ , | J | ≤ k − π i k J = π j k J for anydistinct i, j ∈ [ k ]. The rest of the proof is analogous to that of Theorem 3.4, so we only present asketch.We ﬁrst apply Proposition 3.3 with s = ξη/ η = η ( k/ , ℓ, φ, TV ( M| J , M N | J ) ≤ ξη/ − n − , if N ≥ ξ (log ξ ) ℓ +1 poly k ( − φ ) · log n . Similarly, if we choose N ′ ≥ N · k log L , then Proposition 3.3 together with a union bound over all r ∈ R ( L ) implies that TV ( M ′ ( r ) | J , M ′ N ′ ( r ) | J ) ≤ ξη/ , with probability 1 − n − . In the sequel, we condition on the event E of probability at least 1 − n − that both of the above bounds hold.Moreover, if we choose L ≥ kξη , then there exists r ∈ R ( L ) for which | r i L − w i | ≤ ξη k for any i ∈ [ k ]. Using the same argument as in the proof of Theorem 3.4, we obtain TV ( M| J , M ′ ( r ) | J ) ≤ ξη/ . As a result, for this r it holds that TV ( M ′ N ′ ( r ) | J , M N | J ) ≤ ξη/ . On the other hand, for any r ∈ R ( L ), if there exists i ∈ [ k ] for which | r i L − w i | ≥ ξ , thenProposition 3.6 implies that TV ( M ′ N ′ ( r ) | J , M N | J ) ≥ ξη/ E . Consequently, such an r cannot be equal to L ˆ w by deﬁnition (16). We concludethat ˆ w must satisfy that | ˆ w i − w i | ≤ ξ for each i ∈ [ k ].To complete the proof, we derive from the relation N ≥ ξ (log ξ ) ℓ +1 poly k ( − φ ) · log n that ξ ≤ (log N ) ℓ +1 N / (cid:0) poly k ( − φ ) · log n (cid:1) / ≤ (log N ) k − N / (cid:0) poly k ( − φ ) · log n (cid:1) / . .10 A conjecture on group determinant and the proof of Theorem 4.1 Recall that Theorem 4.1(a) is stated with a restriction on the number of components, k ≤ group determinant . Given any ﬁnite group G and variables t = ( t g : g ∈ G ), the group determinant F ( t ) is the determinant of the | G |×| G | matrix ( t g ◦ h − ) g,h ∈ G .For the symmetric group S n , a notable example is the determinant in (2) studied by Zagier [Zag92],which is F ( t ) evaluated at t σ = φ d KT ( σ, id ) with id being the identity permutation. To compute thisgroup determinant, Zagier introduced an intermediate one as follows. Fix any positive integer r .For each s ∈ [ r + 1], deﬁne a permutation τ s ∈ S r +1 by τ s ( i ) ,  i if i ≤ s,r + 1 if i = s + 1 ,i − i ≥ s + 2 . (41)In other words, τ s leaves the ﬁrst s elements unchanged and inserts the last element right afterthem. Deﬁne a ( r + 1)! × ( r + 1)! matrix L indexed by π, σ ∈ S r +1 by˜ L π,σ , ( q s if π ◦ σ − = τ s , s ∈ [ r + 1]0 otherwise . (42)As studied in [Zag92, Theorem 2’], this is another instance of group determinant with t σ = q s if σ = τ s and 0 otherwise.Our restatement of Theorem 4.1 involves the following conjecture on a slight variant of thegroup determinant (42). Conjecture 6.10.

Deﬁne a ( r + 1)! × ( r + 1)! matrix L indexed by π, σ ∈ S r +1 by L π,σ , ( s if π ◦ σ − = τ s , s ∈ [ r + 1]0 otherwise . (43) Then the matrix L is invertible. Note that the matrix L is deﬁned similarly to ˜ L , except that the nonzero entry q s in ˜ L isreplaced by s . Theorem 2’ of [Zag92] gives a formula for the determinant of ˜ L , which in particularimplies that ˜ L is invertible unless q is a root of unity. However, the proof technique there based onfactorizing ˜ L using group algebra does not seem to apply to the matrix L .Although we do not have a proof of Conjecture 6.10 for an arbitrary integer r , for small r theinvertibility of L can be veriﬁed numerically. In fact, since τ s (1) = 1 for any s ∈ [ r + 1], it is nothard to see that L is block-diagonal with r + 1 blocks of size r ! × r !. We are able to verify theinvertibility of the diagonal blocks up to r = 8, where each block is of size 40320 × k -componentMallows models provided that Conjecture 6.10 holds for r up to log k . Theorem 6.11 (Restatement of Theorem 4.1) . Let the class of Mallows k -mixtures M ∗ be deﬁnedby (17) . We let ε , − φ and consider the setting where n is ﬁxed and ε → . For m ∗ k deﬁned by(7), the following statements hold: a) Suppose that Conjecture 6.10 holds for all positive integers r ≤ r , and that k ≤ r − . Then,for any distinct Mallows mixtures M and M ′ in M ∗ , we have TV ( M , M ′ ) = Ω( ε m ∗ k ) . (b) On the other hand, for n ≥ m ∗ k , there exist distinct Mallows mixtures M and M ′ in M ∗ forwhich TV ( M , M ′ ) = O ( ε m ∗ k ) . The hidden constants in Ω( · ) and O ( · ) above may depend on n and k . As noted above, since Conjecture 6.10 holds up to r = 8, Theorem 4.1 indeed follows fromTheorem 6.11 for k ≤ − Throughout the proof, we let M = P ki =1 1 k M ( π i ) and M ′ = P ki =1 1 k M ( π ′ i ) for permutations π , . . . , π k , π ′ , . . . , π ′ k ∈ S n . The key to this proof is to relate the total variation distance be-tween M and M ′ to the comparison moments deﬁned in Section 2.2, which allows us to leverageTheorem 2.6. The two parts of Theorem 6.11 are then established in Sections 6.11.3 and 6.11.4respectively. Write f i = f M ( π i ) and f ′ i = f M ( π ′ i ) for the PMFs of M ( π i ) and M ( π ′ i ) respectively. Then we have f i ( σ ) = 1 Z (1 − ε ) (1 − ε ) d KT ( σ,π i ) = 1 Z (1 − ε ) d KT ( σ,π i ) X ℓ =0 (cid:18) d KT ( σ, π i ) ℓ (cid:19) ( − ε ) ℓ , where Z (1 − ε ) → n ! as ε →

0. Therefore, the total variation between M and M ′ is TV ( M , M ′ ) = 12 k (cid:13)(cid:13)(cid:13)(cid:13) k X i =1 f i − k X i =1 f ′ i (cid:13)(cid:13)(cid:13)(cid:13) = 12 k X σ ∈S n (cid:12)(cid:12)(cid:12)(cid:12) k X i =1 f i ( σ ) − k X i =1 f ′ i ( σ ) (cid:12)(cid:12)(cid:12)(cid:12) = 12 kZ (1 − ε ) X σ ∈S n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k X i =1 d KT ( σ,π i ) X ℓ =0 (cid:18) d KT ( σ, π i ) ℓ (cid:19) ( − ε ) ℓ − k X i =1 d KT ( σ,π ′ i ) X ℓ =0 (cid:18) d KT ( σ, π ′ i ) ℓ (cid:19) ( − ε ) ℓ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 12 kZ (1 − ε ) X σ ∈S n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ( n − / X ℓ =0 k X i =1 (cid:20)(cid:18) d KT ( σ, π i ) ℓ (cid:19) − (cid:18) d KT ( σ, π ′ i ) ℓ (cid:19)(cid:21) ( − ε ) ℓ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) with the convention that (cid:0) dℓ (cid:1) , d ( d − ··· ( d − ℓ +1) ℓ ! = 0 if d < ℓ . Then TV ( M , M ′ ) = O ( ε m +1 ) if andonly if the coeﬃcient of ε ℓ vanishes in the above formula for all ℓ ∈ [ m ] and all σ ∈ S n , that is, k X i =1 (cid:18) d KT ( σ, π i ) ℓ (cid:19) = k X i =1 (cid:18) d KT ( σ, π ′ i ) ℓ (cid:19) for all ℓ ∈ [ m ] , σ ∈ S n . By a simple inductive argument, we see that this is equivalent to k X i =1 d KT ( σ, π i ) ℓ = k X i =1 d KT ( σ, π ′ i ) ℓ for all ℓ ∈ [ m ] , σ ∈ S n . (44)Rewriting (44) in terms of expectations, we have proved the following result.35 roposition 6.12 (Distance moment matching) . Consider random permutations π ∼ k P ki =1 δ π i and π ′ ∼ k P ki =1 δ π ′ i . Under the conditions of Theorem 6.11, we have: • the order of TV ( M , M ′ ) in ε is a positive integer; • TV ( M , M ′ ) = O ( ε m +1 ) if and only if E [ d KT ( σ, π ) ℓ ] = E [ d KT ( σ, π ′ ) ℓ ] for all ℓ ∈ [ m ] , σ ∈ S n . We refer to E [ d KT ( σ, π ) ℓ ] as the ℓ th order distance moment of π at σ . By the above proposition, the order of TV ( M , M ′ ) in ε is determined by how many distancemoments are matched between the two k -mixtures π and π ′ . To characterize how this number ofmatched moments depends on k , it suﬃces to relate distance moments to comparison momentsdeﬁned in Section 2.2, because we have studied identifying k -mixtures from comparison momentsin Theorem 2.6.We ﬁrst set up the notation. Recall that X πi,j , { π ( i ) < π ( j ) } . Viewing each X πi,j as a variable,we use P m ( X π ) to denote any polynomial in { X πi,j } i = j of degree at most m , that is, any polynomialof the form X sets of pairs of distinct indices( i ,j ) ,..., ( i m ,j m ) ∈ [ n ] X πi ,j · · · X πi m ,j m . When we need to explicitly specify the variables, we also write P m ( X πi ,j , . . . , X πi ℓ ,j ℓ ). For example,the polynomial X , X , + X , X , + X , can be denoted by P ( X , , X , , X , , X , ).In addition, the deﬁnition of the Kendall tau distance can be written as d KT ( σ, π ) = X ( i,j ): σ ( i ) >σ ( j ) X πi,j . (45)Therefore, we have d KT ( σ, π ) m = X ( i ,j ) ,..., ( i m ,j m ): σ ( i ) >σ ( j ) , ... , σ ( i m ) >σ ( j m ) X πi ,j · · · X πi m ,j m = P m ( X π ) . (46)Moreover, consider real-valued functions a ( π ) , . . . , a n ( π ) and b ( π ) of π ∈ S n . We say that b ( π ) can be linearly constructed from the list of functions { a ( π ) , . . . , a n ( π ) } , if there exist realcoeﬃcients c , . . . , c n that do not depend on π , such that P n i =1 c i a i ( π ) = b ( π ). If every functionin { b ( π ) , . . . , b n ( π ) } can be linearly constructed from { a ( π ) , . . . , a n ( π ) } , we write { a ( π ) , . . . , a n ( π ) } = ⇒ { b ( π ) , . . . , b n ( π ) } . By (46), it is clear that d KT ( σ, π ) m can be linearly constructed from the list (cid:8) X πi ,j · · · X πi m ,j m : i ℓ , j ℓ ∈ [ n ] , i ℓ = j ℓ , ℓ ∈ [ m ] (cid:9) for any σ ∈ S n . Therefore, we have (cid:8) X πi ,j · · · X πi m ,j m : i ℓ , j ℓ ∈ [ n ] , i ℓ = j ℓ , ℓ ∈ [ m ] (cid:9) = ⇒ (cid:8) d KT ( σ, π ) ℓ : σ ∈ S n , ℓ ∈ [ m ] (cid:9) . (47)Note that we do not explicitly have polynomials of degree less than m in the list on the LHS of(47). This is because, using the fact ( X πi,j ) r = X πi,j for any r ∈ N , we can write any polynomial ofdegree ℓ ≥ m ≥ ℓ by appending redundant variables ( X πi,j ) m − ℓ .The next lemma states the converse of (47), whose proof is deferred to Section 6.12.36 emma 6.13. Suppose that Conjecture 6.10 holds for all positive integers r ≤ r . Then, for anypositive integer m ≤ r , we have (cid:8) d KT ( σ, π ) ℓ : σ ∈ S n , ℓ ∈ [ m ] (cid:9) = ⇒ (cid:8) X πi ,j · · · X πi m ,j m : i ℓ , j ℓ ∈ [ n ] , i ℓ = j ℓ , ℓ ∈ [ m ] (cid:9) . (48) In other words, all polynomials in { X πi,j } i = j of degree at most m can be linearly constructed fromspecial polynomials d KT ( σ, π ) ℓ of degree ℓ ≤ m where σ ∈ S n . From (47) or (48), we easily obtain the following equivalence of distance moments and compar-ison moments.

Proposition 6.14 (Equivalence of distance and comparison moments) . Suppose that Conjec-ture 6.10 holds for all positive integers r ≤ r . Consider a random permutation π ∼ k P ki =1 δ π i ,where π , . . . , π k are unknown permutations in S n . For m ∈ N , consider the list of distant moments (cid:8) E [ d KT ( σ, π ) ℓ ] : σ ∈ S n , ℓ ∈ [ m ] (cid:9) (49) and the list of comparison moments (Deﬁnition 2.5) (cid:8) m ( π, I ) : I = (cid:0) ( i , j ) , . . . , ( i m , j m ) (cid:1) , i ℓ , j ℓ ∈ [ n ] , i ℓ < j ℓ , ℓ ∈ [ m ] (cid:9) . (50) Then (49) is a deterministic linear function of (50) , regardless of the unknown permutations π , . . . , π k . Conversely, (50) is a deterministic linear function of (49) provided that m ≤ r .Proof. Crucially, the constructions in (47) and (48) are linear and do not depend on π . Therefore,taking the expectation with respect to π ∼ k P ki =1 δ π i , we see that the list of distance moments(49) and the list (cid:8) E [ X πi ,j · · · X πi m ,j m ] : i ℓ , j ℓ ∈ [ n ] , i ℓ = j ℓ , ℓ ∈ [ m ] (cid:9) (51)are linear functions of each other, independent of π , . . . , π k . By Deﬁnition 2.5 and (6), the list ofcomparison moments (50) and the list (51) both contain all possible expectations of products of m variables, and are therefore linear functions of each other. Finally, note that (48) holds for r ≤ r ,so the converse direction holds under the same condition.Having established the equivalence of the two types of moments, we are ready to prove the twoparts of Theorem 6.11. Suppose that for the two Mallows mixtures M = P ki =1 1 k M ( π i ) and M ′ = P ki =1 1 k M ( π ′ i ), we have TV ( M , M ′ ) = O ( ε m +1 ) . Considering random permutations π ∼ k P ki =1 δ π i and π ′ ∼ k P ki =1 δ π ′ i ,we obtain from Proposition 6.12 that E [ d KT ( σ, π ) ℓ ] = E [ d KT ( σ, π ′ ) ℓ ] for all ℓ ∈ [ m ] , σ ∈ S n . As k ≤ r − m = ⌊ log k ⌋ + 1 ≤ r , Proposition 6.14 yields that m ( π, I ) = m ( π ′ , I ) forany tuple I of pairs of distinct indices ( i , j ) , . . . , ( i m , j m ) ∈ [ n ] . That is, the group of pairwisecomparisons on any I coincides for the two mixtures π and π ′ . Since m = ⌊ log k ⌋ + 1, thealgorithm from part (a) of Theorem 2.6 can recover the noiseless mixture of permutations fromgroups of m pairwise comparisons. Consequently, we must have k P ki =1 δ π i = k P ki =1 δ π ′ i , andtherefore M = M ′ .Since the the order of TV ( M , M ′ ) in ε is necessarily an integer according to Proposition 6.12,we conclude that, if M 6 = M ′ , then TV ( M , M ′ ) = Ω( ε m ).37 .11.4 Proof of part (b) By part (b) of Theorem 2.6, there exist distinct mixtures k P ki =1 δ π i and k P ki =1 δ π ′ i of permutationsin S n , that cannot be distinguished using any groups of m − π and π ′ denote random permutations from the above two mixtures respectively. Then we have m ( π, I ) = m ( π ′ , I ) for any tuple I of m − n ]. Hence Proposition 6.14 impliesthat E [ d KT ( σ, π ) ℓ ] = E [ d KT ( σ, π ′ ) ℓ ] for all ℓ ∈ [ m − , σ ∈ S n . It then follows from Proposition 6.12that TV ( M , M ′ ) = O ( ε m ) for M = P ki =1 1 k M ( π i ) and M ′ = P ki =1 1 k M ( π ′ i ). Throughout this section, we suppose that Conjecture 6.10 holds for all positive integers r ≤ r . We establish the following lemmas before proving Lemma 6.13. The following result gives conditionsunder which the polynomial X πi ,j · · · X πi m ,j m has degree strictly less than m . For example, X πab X πba ≡ X πab X πbc X πca = X πab X πbc has degree 2. (In the language of the next lemma, thegraph G corresponds to a double edge and a triangle respectively). Lemma 6.15.

Fix a monomial X πi ,j · · · X πi m ,j m , where ( i , j ) , . . . , ( i m , j m ) are pairs of distinctindices in [ n ] . Consider the undirected multigraph G with vertex set { i , j , . . . , i m , j m } and edgeset { ( i , j ) , . . . , ( i m , j m ) } . If G contains a cycle, then X πi ,j · · · X πi m ,j m = P m − ( X π ) .Proof. Up to a relabeling, we assume without loss of generality that the cycle is composed of undi-rected edges ( i , j ) , . . . , ( i ℓ , j ℓ ). Let the ordered vertex sequence of the cycle be ( v , v , . . . , v ℓ , v ).In particular, we have { i , j , . . . , i ℓ , j ℓ } = { v , v , . . . , v ℓ } as sets, and each pair ( i r , j r ) is equal tosome ( v s , v s +1 ) or ( v s +1 , v s ). Hence we have that either X πi r ,j r = X πv s ,v s +1 or X πi r ,j r = 1 − X πv s ,v s +1 .It then follows that X πi ,j · · · X πi ℓ ,j ℓ = X πv ,v · · · X πv ℓ − ,v ℓ X πv ℓ ,v + P ℓ − ( X π ) . In addition, if X πv ,v · · · X πv ℓ − ,v ℓ = 1, then π ( v ) < π ( v ) < · · · < π ( v ℓ ), so we must have X πv ℓ ,v = 0.As a result, it holds that X πv ,v · · · X πv ℓ − ,v ℓ X πv ℓ ,v = 0 and X πi ,j · · · X πi ℓ ,j ℓ = P ℓ − ( X π ). We concludethat X πi ,j · · · X πi m ,j m = P m − ( X π ). Lemma 6.16.

With the same notation as in Lemma 6.15, if the graph G is a tree, then there existbijections τ , . . . , τ β : [ m + 1] → { i , j , . . . , i m , j m } such that X πi ,j · · · X πi m ,j m = β X α =1 X πτ α (1) ,τ α (2) · · · X πτ α ( m ) ,τ α ( m +1) . (52) Proof.

First, since G is a tree with m edges, the cardinality of its vertex set V , { i , j , . . . , i m , j m } is exactly m + 1. This justiﬁes the possibility of deﬁning a bijection from [ m + 1] to V .Let ˜ G denote the directed tree with vertex set V and edge set { ( i , j ) , . . . , ( i m , j m ) } ; that is,˜ G is the directed version of G . It is well known that the reachability relations between vertices Reachability refers to the existence of a directed path from one vertex to another in a directed graph.

38f any directed tree form a partial order of the vertices. Let P denote this partial order for ˜ G .Furthermore, let τ − , . . . , τ − β : V → [ m + 1] denote all possible linear extensions of the partialorder P . That is, each τ α is a bijection such that τ − α ( i ℓ ) < τ − α ( j ℓ ), where α ∈ [ β ].Furthermore, recall that the permutation π induces a total order on V , which we denote by π k V as before. Note that the monomial X πi ,j · · · X πi m ,j m = (cid:8) π ( i ) < π ( j ) , · · · , π ( i m ) < π ( j m ) (cid:9) is equal to 1 if and only the total order π k V is compatible with P .On the other hand, we observe that X πτ α (1) ,τ α (2) · · · X πτ α ( m ) ,τ α ( m +1) = (cid:8) π (cid:0) τ α (1) (cid:1) < π (cid:0) τ α (2) (cid:1) < · · · < π (cid:0) τ α ( m + 1) (cid:1)(cid:9) = (cid:8) π k V ◦ τ α (1) < π k V ◦ τ α (2) < · · · < π k V ◦ τ α ( m + 1) (cid:9) . Since each π k V ◦ τ α is a permutation on [ m + 1], the above indicator is equal to 1 if and only if thetwo total orders π k V and τ − α coincide.Combining the above pieces, we see that (52) is equivalent to stating that (cid:8) π k V is compatible with P (cid:9) = β X α =1 (cid:8) π k V = τ − α (cid:9) , which is tautologically true, as τ − , . . . , τ − β are all the linear extensions of P by deﬁnition.For n ∈ N and r ∈ [ n ], we use the notation[ n ] r , { ( i , . . . , i r ) ∈ [ n ] r : i , . . . , i r are distinct } . Lemma 6.17.

For any ﬁxed π ∈ S n and a positive integer r ≤ ( n − ∧ r , we have n r − Y s =1 X πi s ,i s +1 · (cid:16) r X t =1 X πi t ,i r +1 (cid:17) : ( i , . . . , i r +1 ) ∈ [ n ] r +1 o = ⇒ n r Y s =1 X πi s ,i s +1 : ( i , . . . , i r +1 ) ∈ [ n ] r +1 o . Proof.

First, we note that for any t ∈ [ r ], r − Y s =1 X πi s ,i s +1 · X πi t ,i r +1 = r − Y s =1 { π ( i s ) < π ( i s +1 ) } · { π ( i t ) < π ( i r +1 ) } = (cid:8) π ( i ) < π ( i ) < · · · < π ( i r ) , π ( i t ) < π ( i r +1 ) (cid:9) = r X s = t (cid:8) π ( i ) < π ( i ) < · · · < π ( i s ) < π ( i r +1 ) < π ( i s +1 ) < π ( i s +2 ) < · · · < π ( i r ) (cid:9) . The last equality holds because if π ( i ) < π ( i ) < · · · < π ( i r ) and π ( i t ) < π ( i r +1 ), then the index i r +1 can possibly be placed by π in any of the r − t + 1 locations after i t . Summing the above A linear extension of a partial order is a total order that is compatible with the partial order. Here, we identifyeach total order on a ﬁnite set V with a bijection from V to [ | V | ]. t ∈ [ r ] yields r − Y s =1 X πi s ,i s +1 · (cid:16) r X t =1 X πi t ,i r +1 (cid:17) = r X t =1 r X s = t (cid:8) π ( i ) < π ( i ) < · · · < π ( i s ) < π ( i r +1 ) < π ( i s +1 ) < π ( i s +2 ) < · · · < π ( i r ) (cid:9) = r X s =1 s · (cid:8) π ( i ) < π ( i ) < · · · < π ( i s ) < π ( i r +1 ) < π ( i s +1 ) < π ( i s +2 ) < · · · < π ( i r ) (cid:9) . On the other hand, we have r Y s =1 X πi s ,i s +1 = (cid:8) π ( i ) < π ( i ) < · · · < π ( i r +1 ) (cid:9) . Hence the linear construction that we need to establish is equivalent to n r X s =1 s · (cid:8) π ( i ) < π ( i ) < · · · < π ( i s ) < π ( i r +1 ) < π ( i s +1 ) < π ( i s +2 ) < · · · < π ( i r ) (cid:9) :( i , . . . , i r +1 ) ∈ [ n ] r +1 o = ⇒ n (cid:8) π ( i ) < π ( i ) < · · · < π ( i r +1 ) (cid:9) : ( i , . . . , i r +1 ) ∈ [ n ] r +1 o . (53)Note that the sets on the left and right hand sides of (53) are indexed by distinct tuples i , . . . , i r +1 . Next we ﬁx a set of r + 1 indices in [ n ], but allow their order to vary. That is, let usﬁx distinct indices i , . . . , i r +1 ∈ [ n ], and consider i σ (1) , . . . , i σ ( r +1) where σ is any permutation in S r +1 . To show (53), then it suﬃces to establish the linear construction n r X s =1 s · (cid:8) π ( i σ (1) ) < π ( i σ (2) ) < · · · < π ( i σ ( s ) ) < π ( i σ ( r +1) ) < π ( i σ ( s +1) ) < π ( i σ ( s +2) ) < · · · < π ( i σ ( r ) ) (cid:9) : σ ∈ S r +1 o = ⇒ n (cid:8) π ( i σ (1) ) < π ( i σ (2) ) < · · · < π ( i σ ( r +1) ) (cid:9) : σ ∈ S r +1 o . (54)To prove (54), we deﬁne a vector v ∈ { , } ( r +1)! , indexed by σ ∈ S r +1 , by v σ , (cid:8) π ( i σ (1) ) < π ( i σ (2) ) < · · · < π ( i σ ( r +1) ) (cid:9) . With τ s deﬁned by (41), we see that (54) is equivalent to n r X s =1 s · v σ ◦ τ s : σ ∈ S r +1 o = ⇒ n v σ : σ ∈ S r +1 o . (55)Finally, let L be deﬁned by (43), and let L ′ be the matrix indexed by π, σ ∈ S r +1 deﬁned by L ′ π,σ = L π − ,σ − . Then we have r X s =1 s · v σ ◦ τ s = X σ ′ ∈S r +1 s · v σ ′ { σ − ◦ σ ′ = τ s } = X σ ′ ∈S r +1 L σ − , ( σ ′ ) − v σ ′ = ( L ′ v ) σ , L ′ is invertible. Moreover, L ′ is invertible if and only if L is invertible,because one can be obtained from the other by shuﬄing the columns and rows. Finally, applyingConjecture 6.10 ﬁnishes the proof. (In fact, this is the only step of the entire proof where theconjecture is used.)For any permutation π ∈ S n and distinct indices i , i , . . . , i ℓ +1 ∈ [ n ], we deﬁne Q πi ,...,i ℓ +1 , X πi ,i ( X πi ,i + X πi ,i )( X πi ,i + X πi ,i + X πi ,i ) · · · ( X πi ,i ℓ +1 + X πi ,i ℓ +1 + · · · + X πi ℓ ,i ℓ +1 )= ℓ Y s =1 (cid:16) s X t =1 X πi t ,i s +1 (cid:17) , (56)which is a degree- ℓ polynomial in X πi,j ’s. Lemma 6.18.

For any ﬁxed π ∈ S n and a positive integer ℓ ≤ ( n − ∧ r , we have (cid:8) Q πi ,...,i ℓ +1 : ( i , . . . , i ℓ +1 ) ∈ [ n ] ℓ +1 (cid:9) = ⇒ (cid:8) X πi ,i X πi ,i · · · X πi ℓ ,i ℓ +1 : ( i , . . . , i ℓ +1 ) ∈ [ n ] ℓ +1 (cid:9) . Proof.

The construction in Lemma 6.17 is linear and thus can be applied even if we multiplyevery polynomial by a common factor Q ℓs = r +1 (cid:0) P st =1 X πi t ,i s +1 (cid:1) . Therefore, we obtain, for each r = 1 , . . . , ℓ + 1, n r − Y s =1 X πi s ,i s +1 · ℓ Y s = r (cid:16) s X t =1 X πi t ,i s +1 (cid:17) : ( i , . . . , i ℓ +1 ) ∈ [ n ] ℓ +1 o = ⇒ n r Y s =1 X πi s ,i s +1 · ℓ Y s = r +1 (cid:16) s X t =1 X πi t ,i s +1 (cid:17) : ( i , . . . , i ℓ +1 ) ∈ [ n ] ℓ +1 o . (57)Here a product is understood as 1 if the bottom index exceeds the top, by convention. Note thatthe quantity Q r − s =1 X πi s ,i s +1 · Q ℓs = r (cid:0) P st =1 X πi t ,i s +1 (cid:1) is equal to Q πi ,...,i ℓ +1 for r = 1 and is equal to X πi ,i X πi ,i · · · X πi ℓ ,i ℓ +1 for r = ℓ + 1. As a result, applying (57) iteratively with r = 1 , , . . . , ℓ yieldsthe lemma. We are ready to prove Lemma 6.13. Let us ﬁrst establish the statement for m = 1: { d KT ( σ, π ) : σ ∈ S n } = ⇒ { X πi ,j : i , j ∈ [ n ] , i = j } . (58)Toward this end, we choose σ, σ ∈ S n such that σ ( i ) = 1, σ ( j ) = 2, σ ( i ) = 2, σ ( j ) = 1, and σ ( r ) = σ ( r ) for r = i , j . Then it follows from (45) that d KT ( σ, π ) − d KT ( σ, π ) = X ( i,j ): σ ( i ) >σ ( j ) X πi,j − X ( i,j ): σ ( i ) >σ ( j ) X πi,j = X πi ,j − X πj ,i = 2 X πi ,j − . Therefore, (58) indeed holds.With the base case m = 1 established, we can prove the lemma using an induction on m .Therefore, it suﬃces to show that P , (cid:8) d KT ( σ, π ) ℓ : σ ∈ S n , ℓ ∈ [ m ] (cid:9) ∪ (cid:8) X πi ,j · · · X πi m − ,j m − : i ℓ , j ℓ ∈ [ n ] , i ℓ = j ℓ , ℓ ∈ [ m − (cid:9) = ⇒ (cid:8) X πi ,j · · · X πi m ,j m : i ℓ , j ℓ ∈ [ n ] , i ℓ = j ℓ , ℓ ∈ [ m ] (cid:9) , (59)41hat is, all degree- m polynomials in X πi,j ’s can be linearly constructed from d KT ( σ, π ) ℓ where ℓ ≤ m together with degree- ℓ polynomials in X πi,j ’s where ℓ ≤ m − ℓ ≤ m ≤ r , and continue to use the notation Q πi ,...,i ℓ +1 deﬁned by (56). Moreover, we say that theindices i , . . . , i ℓ appear consecutively in π if π ( i ) = π ( i ) − π ( i ) − · · · = π ( i ℓ ) − ℓ + 1 . For example, the indices 5 , , , , , , , Step 1.

We claim that for any ﬁxed ℓ ∈ { , , . . . , m } , any permutation σ ∈ S n , and indices i , i , . . . , i ℓ +1 appearing consecutively in σ , the polynomial d KT ( σ, π ) m − ℓ Q πi ,...,i ℓ +1 (60)can be linearly constructed from P .Toward this end, we proceed by induction on ℓ . The base case ℓ = 0 is trivial. Now assumethat the claim holds for a ﬁxed ℓ ∈ { , , . . . , m − } . Consider any σ, σ ∈ S n such that: • i , i , . . . , i ℓ +2 appear consecutively in σ ; • σ ( i r ) = σ ( i r ) + 1 for r ∈ [ ℓ + 1] and σ ( i ℓ +2 ) = σ ( i ); that is, σ is obtained from σ by inserting i ℓ +2 to the position right before i (and shifting i , . . . , i ℓ +1 to the right accordingly).Since i , . . . , i ℓ +1 appear consecutively in both σ and σ , the induction hypothesis implies that thepolynomial (60) with either σ = σ or σ = σ can be linearly constructed from P .Moreover, σ and σ only diﬀer within the labels i , . . . , i ℓ +2 . Hence, by deﬁnition (45), d KT ( σ, π ) = d KT ( σ, π ) − X πi ℓ +2 ,i − X πi ℓ +2 ,i − · · · − X πi ℓ +2 ,i ℓ +1 + X πi ,i ℓ +2 + X πi ,i ℓ +2 + · · · + X πi ℓ +1 ,i ℓ +2 = d KT ( σ, π ) + 2( X πi ,i ℓ +2 + X πi ,i ℓ +2 + · · · + X πi ℓ +1 ,i ℓ +2 ) − ℓ − , where the second equality follows from that X πi,j + X πj,i = 1. Applying this relation and the binomialexpansion of d KT ( σ, π ) m − ℓ , we obtain d KT ( σ, π ) m − ℓ = d KT ( σ, π ) m − ℓ + 2( m − ℓ ) d KT ( σ, π ) m − ℓ − ( X πi ,i ℓ +2 + · · · + X πi ℓ +1 ,i ℓ +2 )+ m − ℓ X r =2 (cid:18) m − ℓr (cid:19) d KT ( σ, π ) m − ℓ − r P r ( X πi ,i ℓ +2 , . . . , X πi ℓ +1 ,i ℓ +2 ) + P m − ℓ − ( X π ) . Multiplying the above equation by the degree- ℓ polynomial Q πi ,...,i ℓ +1 , we obtain2( m − ℓ ) d KT ( σ, π ) m − ℓ − Q πi ,...,i ℓ +2 (61)= d KT ( σ, π ) m − ℓ Q πi ,...,i ℓ +1 − d KT ( σ, π ) m − ℓ Q πi ,...,i ℓ +1 + P m − ( X π ) (62) − m − ℓ X r =2 (cid:18) m − ℓr (cid:19) d KT ( σ, π ) m − ℓ − r Q πi ,...,i ℓ +1 P r ( X πi ,i ℓ +2 , . . . , X πi ℓ +1 ,i ℓ +2 ) . (63)The goal of the induction step is to linearly construct d KT ( σ, π ) m − ℓ − Q πi ,...,i ℓ +2 from P . There-fore, it suﬃces to show that each term in (62) and (63) can be linearly constructed from P . First,42ote that all the three terms in (62) can be done so in view of the induction hypothesis, because P contains all polynomials of degree at most m −

1. Next, while it is clear that each summand in(63) is of degree at most m , we will show that it is in fact at most m −

1, which will complete theproof by induction. In view of the deﬁnition (56), it suﬃces to show that for each 2 ≤ r ≤ m − ℓ , X πi ,i ( X πi ,i + X πi ,i ) · · · ( X πi ,i ℓ +1 + · · · + X πi ℓ ,i ℓ +1 ) P r ( X πi ,i ℓ +2 , . . . , X πi ℓ +1 ,i ℓ +2 ) (64)is of degree at most ℓ + r − X πj ,i X πj ,i · · · X πj ℓ ,i ℓ +1 X πj ℓ +1 ,i ℓ +2 X πj ′ ℓ +1 ,i ℓ +2 P r − ( X π ) , (65)where j r ∈ { i , . . . , i r } for r = 1 , . . . , ℓ + 1 and j ′ ℓ +1 ∈ { i , . . . , i ℓ +1 } . Consider the undirectedmultigraph G with ℓ + 1 vertices i , i , i , . . . , i ℓ +1 and ℓ edges ( j , i ) , ( j , i ) , . . . , ( j ℓ , i ℓ +1 ). Then G is clearly connected since by assumption j r ∈ { i , . . . , i r } for each r ∈ [ ℓ + 1]. Now, if weadd one more vertex i ℓ +2 and two more edges ( j ℓ +1 , i ℓ +2 ) , ( j ′ ℓ +1 , i ℓ +2 ) to the graph G , where j ℓ +1 , j ′ ℓ +1 ∈ { i , . . . , i ℓ +1 } , then there must be a cycle in the new multigraph that contains i ℓ +2 .Hence Lemma 6.15 yields that X πj ,i X πj ,i · · · X πj ℓ ,i ℓ +1 X πj ℓ +1 ,i ℓ +2 X πj ′ ℓ +1 ,i ℓ +2 = P ℓ +1 ( X π ) . It follows that the polynomial (65) and thus the polynomial (64) are of degree at most ℓ + r − Step 2.

We claim that for any set of distinct indices { i ( r ) t ∈ [ n ] : t ∈ [ ℓ r +1] , r ∈ [ s ] , P sr =1 ℓ r = m } ,the polynomial s Y r =1 Q πi ( r )1 ,...,i ( r ) ℓr +1 (66)can be linearly constructed from P .In short, this follows from applying Step 1 iteratively. Speciﬁcally, we show that d KT ( σ, π ) m − ℓ s Y r =1 Q πi ( r )1 ,...,i ( r ) ℓr +1 (67)can be linearly constructed from P , where ℓ ∈ { , , . . . , m } , P sr =1 ℓ r = ℓ , and σ is any permutationin S n such that i ( r )1 , . . . , i ( r ) ℓ r appear consecutively in σ for each r ∈ [ s ]. Note that (66) is a specialcase of (67) when ℓ = m .Moreover, (60) is a special case of (67) when s = 1. With this base case established, we canconstruct (67) inductively on s . That is, for s ≥

2, it suﬃces to construct (67) from

P ∪ n d KT ( σ, π ) m − ℓ ′ s − Y r =1 Q πi ( r )1 ,...,i ( r ) ℓr +1 : ℓ ′ = s − X r =1 ℓ r ,i ( r )1 , . . . , i ( r ) ℓ r appear consecutively in σ for each r ∈ [ s − o . (68)43oward this end, we apply the linear construction in Step 1, with m replaced by m − ℓ ′ , with ℓ replaced by ℓ s , and with the extra constraint that i ( r )1 , . . . , i ( r ) ℓ r appear consecutively in σ for each r ∈ [ s − d KT ( σ, π ) m − ℓ ′ − ℓ s Q πi ( s )1 ,...,i ( s ) ℓs +1 , (69)where i ( r )1 , . . . , i ( r ) ℓ r appear consecutively in σ for each r ∈ [ s ], can be linearly constructed from (cid:8) d KT ( σ, π ) ℓ ′ s : ℓ ′ s ∈ [ m − ℓ ′ ] , i ( r )1 , . . . , i ( r ) ℓ r appear consecutively in σ for each r ∈ [ s − (cid:9) ∪ (cid:8) X πi ,j · · · X πi m − ℓ ′− ,j m − ℓ ′− : i r , j r ∈ [ n ] , i r = j r , r ∈ [ m − ℓ ′ − (cid:9) . (70)Since the construction is linear, if we multiply (69) and each polynomial in (70) by the samefactor Q s − r =1 Q πi ( r )1 ,...,i ( r ) ℓr +1 , the linear construction remains valid. With ℓ ′ = P s − r =1 ℓ r and ℓ = ℓ ′ + ℓ s ,this shows that (67) can be linearly constructed from n d KT ( σ, π ) ℓ ′ s s − Y r =1 Q πi ( r )1 ,...,i ( r ) ℓr +1 : ℓ ′ s ∈ [ m − ℓ ′ ] ,i ( r )1 , . . . , i ( r ) ℓ r appear consecutively in σ for each r ∈ [ s − o ∪ n X πi ,j · · · X πi m − ℓ ′− ,j m − ℓ ′− s − Y r =1 Q πi ( r )1 ,...,i ( r ) ℓr +1 : i r , j r ∈ [ n ] , i r = j r , r ∈ [ m − ℓ ′ − o . (71)Since every polynomial in (71) can be linearly constructed from (68), this completes the induction. Step 3.

We claim that for any set of distinct indices { i ( r ) t ∈ [ n ] : t ∈ [ ℓ r +1] , r ∈ [ s ] , P sr =1 ℓ r = m } ,the monomial s Y r =1 X πi ( r )1 ,i ( r )2 X πi ( r )2 ,i ( r )3 · · · X πi ( r ) ℓr ,i ( r ) ℓr +1 (72)can be linearly constructed from P .By the claim in Step 2, it suﬃces to prove that n s Y r =1 Q πi ( r )1 ,...,i ( r ) ℓr +1 : i ( r ) t ∈ [ n ] , t ∈ [ ℓ r + 1] , r ∈ [ s ] , s X r =1 ℓ r = m o = ⇒ n s Y r =1 X πi ( r )1 ,i ( r )2 X πi ( r )2 ,i ( r )3 · · · X πi ( r ) ℓr ,i ( r ) ℓr +1 : i ( r ) t ∈ [ n ] , t ∈ [ ℓ r + 1] , r ∈ [ s ] , s X r =1 ℓ r = m o (73)where all the indices i ( r ) t ’s are distinct. In fact, this follows from the fact that n Q πi ,...,i ℓ +1 : i , . . . , i ℓ +1 ∈ [ n ] o = ⇒ n X πi ,i X πi ,i · · · X πi ℓ ,i ℓ +1 : i , . . . , i ℓ +1 ∈ [ n ] o (74)which is an immediate consequence of Lemma 6.18.44o prove (73) using (74), we note that the construction in (74) is linear and thus can be appliedto the indices i ( q )1 , . . . i ( q ) ℓ q +1 where q ∈ [ s ] to obtain n q Y r =1 Q πi ( r )1 ,...,i ( r ) ℓr +1 s Y r = q +1 X πi ( r )1 ,i ( r )2 · · · X πi ( r ) ℓr ,i ( r ) ℓr +1 : i ( r ) t ∈ [ n ] , t ∈ [ ℓ r + 1] , r ∈ [ s ] , s X r =1 ℓ r = m o = ⇒ n q − Y r =1 Q πi ( r )1 ,...,i ( r ) ℓr +1 s Y r = q X πi ( r )1 ,i ( r )2 · · · X πi ( r ) ℓr ,i ( r ) ℓr +1 : i ( r ) t ∈ [ n ] , t ∈ [ ℓ r + 1] , r ∈ [ s ] , s X r =1 ℓ r = m o . (75)Iteratively applying (75) with q = s, s − , . . . , Step 4.

To ﬁnish the proof of (59), ﬁx m pairs of distinct indices ( i , j ) , . . . , ( i m , j m ) ∈ [ n ] .Consider the undirected multigraph G consisting of edges ( i , j ) , . . . , ( i m , j m ). If G contains acycle, then X πi ,j · · · X πi m ,j m = P m − ( X π ) by Lemma 6.15, so it is already in P . Hence we canassume that G is acyclic, that is, it is a forest.Let G , . . . , G s denote the connected components of G , each of which is a tree. Let V ( G r )and E ( G r ) denote the vertex set and the edge set of G r respectively for each r ∈ [ s ]. Let ℓ r , | V ( G r ) | − | E ( G r ) | so that P sr =1 ℓ r = m . Moreover, we can write X πi ,j · · · X πi m ,j m = s Y r =1 Y ( i,j ) ∈ E ( G r ) X πi,j . (76)By Lemma 6.16 applied to G r , there exist bijections τ ( r )1 , . . . , τ ( r ) β r : [ ℓ r + 1] → V ( G r ) such that Y ( i,j ) ∈ E ( G r ) X πi,j = β r X α =1 X πτ ( r ) α (1) ,τ ( r ) α (2) · · · X πτ ( r ) α ( ℓ r ) ,τ ( r ) α ( ℓ r +1) . (77)Combining (76) and (77) and expanding the product of sums, we see that X πi ,j · · · X πi m ,j m is a sumof monomials of the form s Y r =1 X πτ ( r ) αr (1) ,τ ( r ) αr (2) · · · X πτ ( r ) αr ( ℓ r ) ,τ ( r ) αr ( ℓ r +1) . In fact, this is of the same form as (72) and thus can be linearly constructed from P . This showsthat X πi ,j · · · X πi m ,j m can be linearly constructed from P , thereby completing the proof. Given i.i.d. observations σ , . . . , σ N ∼ M , the empirical distribution M N has PMF f M N ( σ ) = N P Ni =1 { σ i = σ } . Hoeﬀding’s inequality then gives P (cid:8) | f M N ( σ ) − f M ( σ ) | > t (cid:9) ≤ − N t )for any t >

0. Taking a union bound over σ ∈ S n yields P (cid:8) TV ( M , M N ) > tn ! / (cid:9) ≤ n ! exp( − N t ) . (78)45n the other hand, by part (a) of Theorem 4.1, for any M ′ ∈ M ∗ distinct from M , wehave TV ( M , M ′ ) ≥ c ε m for a constant c = c ( n, k ) >

0. Choosing t = c n ! ε m in (78) yieldsthat TV ( M , M N ) ≤ c ε m / − n ! exp( − c N ε m ) for a constant c = c ( n, k ) >

0. On this event, the minimum total variation distance estimator c M deﬁned by (18) isequal to M .Finally, it suﬃces to note that if N ≥ C log( δ ) /ε m for a suﬃciently large constant C = C ( n, k ) >

0, then the failure probability can be bounded as 2 n ! exp( − c N ε m ) ≤ δ. By part (b) of Theorem 4.1, there exist distinct Mallows mixtures M , M ′ ∈ M ∗ for which TV ( M , M ′ ) . n,k ε m , where the notation . n,k hides a constant factor that may depend on n and k . Let f ′ denote the PMF of M ′ . For n ﬁxed and as ε →

0, that is, as φ → f ′ convergespointwise to 1 / ( n !). Therefore, for suﬃciently small ǫ , we have f ′ ( σ ) ≥ / (2 n !) for each σ ∈ S n .By reserve Pinsker inequality (see, for example, Theorem 2 of [Ver14]), it then follows KL ( M , M ′ ) . TV ( M , M ′ ) min σ ∈S n f ′ ( σ ) . n TV ( M , M ′ ) . n,k ε m . Let M ⊗ N and ( M ′ ) ⊗ N denote the distribution of N i.i.d. observations from M and M ′ respectively.Then Pinsker’s inequality together with tensorization of the KL divergence yields TV (cid:0) M ⊗ N , ( M ′ ) ⊗ N (cid:1) ≤ q KL (cid:0) M ⊗ N , ( M ′ ) ⊗ N (cid:1) . n √ N ε m . Finally, applying Le Cam’s two-point lower bound (cf. e.g. [Tsy09, Sec 2.3]) givesmin f M max M∈ M ∗ P M { f M 6 = M} ≥ (cid:16) − TV (cid:0) M ⊗ N , ( M ′ ) ⊗ N (cid:1)(cid:17) ≥ , if N ≤ c/ε m for a suﬃciently small constant c = c ( n, k ) > References [ABSV14] Pranjal Awasthi, Avrim Blum, Or Sheﬀet, and Aravindan Vijayaraghavan. Learningmixtures of ranking models. In

Advances in Neural Information Processing Systems ,pages 2609–2617, 2014.[BFFSZ19] Robert Busa-Fekete, Dimitris Fotakis, Bal´azs Sz¨or´enyi, and Manolis Zampetakis. Op-timal learning of mallows block model. In Alina Beygelzimer and Daniel Hsu, editors,

Proceedings of the Thirty-Second Conference on Learning Theory , volume 99 of

Pro-ceedings of Machine Learning Research , pages 529–532, Phoenix, USA, 25–28 Jun 2019.PMLR.[BFHS14] R´obert Busa-Fekete, Eyke H¨ullermeier, and Bal´azs Sz¨or´enyi. Preference-based rankelicitation using statistical models: the case of mallows. In

Proceedings of the 31stInternational Conference on Machine Learning-Volume 32 , pages II–1071. JMLR.org,2014. 46BM09] Mark Braverman and Elchanan Mossel. Sorting from noisy information. arXiv preprintarXiv:0910.1191 , 2009.[BMR10] Linas Baltrunas, Tadas Makcinskas, and Francesco Ricci. Group recommendationswith rank aggregation and collaborative ﬁltering. In

Proceedings of the fourth ACMconference on Recommender systems , pages 119–126, 2010.[BOB07] Ludwig M. Busse, Peter Orbanz, and Joachim M. Buhmann. Cluster analysis of het-erogeneous rank data. In

Proceedings of the 24th international conference on Machinelearning , pages 113–120. ACM, 2007.[Bor81] J. C. Borda. M´emoire sur les ´elections au scrutin.

Histoire de l’Academie Royale desSciences pour , 1781.[CDKL15] Flavio Chierichetti, Anirban Dasgupta, Ravi Kumar, and Silvio Lattanzi. On learningmixture models for permutations. In

Proceedings of the 2015 Conference on Innovationsin Theoretical Computer Science , pages 85–92, 2015.[Con85] M. J. Condorcet.

Essai sur l’application de l’analyse `a la probabilit´e des d´ecisionsrendues `a la pluralit´e des voix . 1785.[CPS13] Ioannis Caragiannis, Ariel D Procaccia, and Nisarg Shah. When do noisy votes revealthe truth? In

Proceedings of the fourteenth ACM conference on Electronic commerce ,pages 143–160, 2013.[DKNS01] Cynthia Dwork, Ravi Kumar, Moni Naor, and Dandapani Sivakumar. Rank aggregationmethods for the web. In

Proceedings of the 10th international conference on World WideWeb , pages 613–622, 2001.[DOS18] Anindya De, Ryan O’Donnell, and Rocco Servedio. Learning sparse mixtures of rank-ings from noisy information. arXiv preprint arXiv:1811.01216 , 2018.[DPR04] Jean-Paul Doignon, Aleksandar Pekeˇc, and Michel Regenwetter. The repeated insertionmodel for rankings: Missing link between two subset choice models.

Psychometrika ,69(1):33–54, 2004.[DWYZ20] Natalie Doss, Yihong Wu, Pengkun Yang, and Harrison H Zhou. Optimal estimationof high-dimensional gaussian mixtures. arXiv preprint arXiv:2002.05818 , 2020.[FKS03] Ronald Fagin, Ravi Kumar, and Dandapani Sivakumar. Eﬃcient similarity searchand classiﬁcation via rank aggregation. In

Proceedings of the 2003 ACM SIGMODinternational conference on Management of data , pages 301–312, 2003.[FV86] Michael A Fligner and Joseph S Verducci. Distance based ranking models.

Journal ofthe Royal Statistical Society: Series B (Methodological) , 48(3):359–369, 1986.[GM08a] Isobel Claire Gormley and Thomas Brendan Murphy. Exploring voting blocs withinthe irish electorate: A mixture modeling approach.

Journal of the American StatisticalAssociation , 103(483):1014–1027, 2008.47GM08b] Isobel Claire Gormley and Thomas Brendan Murphy. A mixture of experts modelfor rank data with applications in election studies.

The Annals of Applied Statistics ,2(4):1452–1477, 2008.[HK18] Philippe Heinrich and Jonas Kahn. Strong identiﬁability and optimal minimax ratesfor ﬁnite mixture estimation.

The Annals of Statistics , 46(6A):2844–2870, 2018.[ICL19] Ekhine Irurozki, Borja Calvo, and Jose A Lozano. Mallows and generalized mallowsmodel for matchings.

Bernoulli , 25(2):1160–1188, 2019.[JJ94] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the emalgorithm.

Neural computation , 6(2):181–214, 1994.[KCS17] Anna Korba, Stephan Cl´emen¸con, and Eric Sibony. A learning theory of rankingaggregation. In

Artiﬁcial Intelligence and Statistics , pages 1001–1010, 2017.[LB11] Tyler Lu and Craig Boutilier. Learning mallows models with pairwise preferences.In

Proceedings of the 28th International Conference on International Conference onMachine Learning , pages 145–152, 2011.[LB14] Tyler Lu and Craig Boutilier. Eﬀective sampling and learning for mallows models withpairwise-preference data.

The Journal of Machine Learning Research , 15(1):3783–3829,2014.[LLQ +

07] Yu-Ting Liu, Tie-Yan Liu, Tao Qin, Zhi-Ming Ma, and Hang Li. Supervised rankaggregation. In

Proceedings of the 16th international conference on World Wide Web ,pages 481–490, 2007.[LM18] Allen Liu and Ankur Moitra. Eﬃciently learning mixtures of mallows models. In , pages627–638. IEEE, 2018.[Mal57] Colin L Mallows. Non-null ranking models. i.

Biometrika , 44(1/2):114–130, 1957.[Mar95] John I. Marden.

Analyzing and modeling rank data . Chapman and Hall/CRC, 1995.[MC10] Marina Meil˘a and Harr Chen. Dirichlet process mixtures of generalized mallows models.In

Proceedings of the Twenty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence ,pages 358–367, 2010.[MM03] Thomas Brendan Murphy and Donal Martin. Mixtures of distance-based models forranking data.

Computational statistics & data analysis , 41(3-4):645–655, 2003.[MPPB07] Marina Meil˘a, Kapil Phadnis, Arthur Patterson, and Jeﬀ Bilmes. Consensus rank-ing under the exponential model. In

Proceedings of the Twenty-Third Conference onUncertainty in Artiﬁcial Intelligence , pages 285–294, 2007.[MV10] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixturesof Gaussians. In

Foundations of Computer Science (FOCS), 2010 51st Annual IEEESymposium on , pages 93–102. IEEE, 2010.48Pea94] Karl Pearson. Contributions to the mathematical theory of evolution.

PhilosophicalTransactions of the Royal Society of London. A , 185:71–110, 1894.[Tsy09] A. B. Tsybakov.

Introduction to Nonparametric Estimation . Springer Verlag, NewYork, NY, 2009.[Ver14] Sergio Verd´u. Total variation distance and the distribution of relative information. In , pages 1–3. IEEE, 2014.[WY20] Yihong Wu and Pengkun Yang. Optimal estimation of Gaussian mixtures with denoisedmethod of moments.

The Annals of Statistics , 48(4):1981–2007, 2020.[Zag92] Don Zagier. Realizability of a model in inﬁnite statistics.

Communications in mathe-matical physics , 147(1):199–210, 1992.[ZPX16] Zhibing Zhao, Peter Piech, and Lirong Xia. Learning mixtures of plackett-luce models.In