[PDF] Learning sparse mixtures of rankings from noisy information

Abstract

We study the problem of learning an unknown mixture of k rankings over n elements, given access to noisy samples drawn from the unknown mixture. We consider a range of different noise models, including natural variants of the "heat kernel" noise framework and the Mallows model. For each of these noise models we give an algorithm which, under mild assumptions, learns the unknown mixture to high accuracy and runs in n O(logk) time. The best previous algorithms for closely related problems have running times which are exponential in k .

Full PDF

LLearning sparse mixtures of rankingsfrom noisy information

Anindya De ∗ Northwestern University [email protected]

Ryan O’Donnell † Carnegie Mellon University [email protected]

Rocco A. Servedio ‡ Columbia University [email protected]

November 6, 2018

Abstract

We study the problem of learning an unknown mixture of k rankings over n elements, givenaccess to noisy samples drawn from the unknown mixture. We consider a range of diﬀerent noisemodels, including natural variants of the “heat kernel” noise framework and the Mallows model.For each of these noise models we give an algorithm which, under mild assumptions, learns theunknown mixture to high accuracy and runs in n O (log k ) time. The best previous algorithms forclosely related problems have running times which are exponential in k . ∗ Supported by NSF grant CCF-1814706 † Supported by NSF grants CCF-1618679 and CCF-1717606 ‡ Supported by NSF grants CCF-1563155 and CCF-1814873 and by the Simons Collaboration on Algorithms andGeometry. This material is based upon work supported by the National Science Foundation under grant numberslisted above. Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of theauthors and do not necessarily reﬂect the views of the National Science Foundation. a r X i v : . [ c s . L G ] N ov Introduction

This paper considers the following natural scenario: there is a large heterogeneous populationwhich consists of k disjoint subgroups, and for each subgroup there is a “central preference order”specifying a ranking over a ﬁxed set of n items (equivalently, specifying a permutation in thesymmetric group S n ). For each i ∈ { , . . . , k } , the preference order of each individual in subgroup i is assumed to be a noisy version of the central preference order (the permutation corresponding tosubgroup i ). A natural learning task which arises in this scenario is the following: given access to thepreference order of randomly selected members of the population, is it possible to learn the centralpreference orders of the k sub-populations, as well as the relative sizes of these k sub-populationswithin the overall population?Worst-case formulations of the above problem typically tend to be (diﬃcult) variants of thefeedback arc set problem, which is known to be NP-complete [GJ79]. In view of the practicalimportance of problems of this sort, though, there has been considerable recent research interestin studying various generative models corresponding to the above scenario (we discuss some of therecent work which is most closely related to our results in Section 1.3). In this paper we will modelthe above general problem schema as follows: The k “central preference orders” of the subgroupsare given by k unknown permutations σ , . . . , σ k ∈ S n . The fraction of the population belonging tothe i -th subgroup, for 1 ≤ i ≤ k , is given by an unknown w i ≥ w + · · · + w k = 1). Finally,the noise is modeled by some family of distributions {K θ } , where each distribution K θ is supportedon S n , and the preference order of a random individual in the i -th subgroup is given by π σ i , where π ∼ K θ . Here θ is a model parameter capturing the “noise rate” (we will have much more to sayabout this for each of the speciﬁc noise models we consider below). The learning task is to recoverthe central rankings σ , . . . , σ k and their proportions w , . . . , w k , given access to preference ordersof randomly chosen individuals from the population. In other words, each sample provided to thelearner is independently generated by ﬁrst choosing a random permutation σ , where σ is chosento be σ i with probability w i ; then independently drawing a random π ∼ K θ ; and ﬁnally, providingthe learner with the permutation πσ ∈ S n . Let f : S n → R ≥ denote the function which is w i at σ i and 0 otherwise. With this notation, we write “ K θ ∗ f ” to denote the distribution over noisysamples described above, and our goal is to approximately recover f given such noisy samples. Thereader may verify that the distribution deﬁned by πσ is precisely given by the group convolution K θ ∗ f (and hence the notation). We consider a range of diﬀerent noise models, corresponding to diﬀerent choices for the parametricfamily {K θ } , and for each one we give an eﬃcient algorithm for recovering the population in thepresence of that kind of noise. In this subsection we detail the three speciﬁc noise models that wewill work with (though as we discuss later, our general mode of analysis could be applied to othernoise models as well).(A.) Symmetric noise.

In the symmetric noise model, the parametric family of distributionsover S n is denoted {S p } p ∈ ∆ n . Given a vector p = ( p , . . . , p n ) ∈ ∆ n (so each p i ≥ (cid:80) ni =0 p i = 1),a draw of π ∼ S p is obtained as follows:1. Choose 0 ≤ j ≤ n , where value j is chosen with probability p j .1. Choose a uniformly random subset A ⊆ [ n ] of size exactly j . Draw π uniformly from S A ;in other words, π is a uniformly random permutation over the set A and is the identitypermutation on elements in [ n ] \ A . (We denote this uniform distribution over S A by U A .)Note that in this model, if the noise vector p has p n = 1, then every draw from S p ∗ f is a uniformrandom permutation and there is no useful information available to the learner.In order to deﬁne the next two noise models that we consider, let us recall the notion of a right-invariant metric on S n . Such a metric d ( · , · ) is one that satisﬁes d ( σ, π ) = d ( στ, πτ ) for all σ, π, τ ∈ S n . We note that a metric is right-invariant if and only if it is invariant under relabelingof the items 1 , . . . , n , and that most metrics considered in the literature satisfy this condition(see [KV10, Dia88b] for discussions of this point). In this paper, for technical convenience werestrict our attention to the metric d ( · , · ) being the Cayley distance over S n (though see Section 1.5for a discussion of how our methods and results could potentially be generalized to other right-invariant metrics): Deﬁnition 1.1.

Let G be the undirected graph with vertex set S n and an edge between permu-tations σ and π if there is a transposition τ such that σ = τ · π . The Cayley distance over S n is themetric induced by this graph; in other words, d ( π, σ ) = t where t is the smallest value such thatthere are transpositions τ , . . . , τ t satisfying σ = τ · · · τ t π .Now we are ready to deﬁne the next two parameterized families of noise distributions that weconsider. We note that each of the noise distributions K considered below has the natural propertythat Pr π ∼K [ π = π ] decreases with d ( π, e ) where e is the identity distribution.(B.) Heat kernel random walk under Cayley distance.

Let L be the Laplacian of thegraph G from Deﬁnition 1.1. Given a “temperature” parameter t ∈ R + , the heat kernel is the n ! × n ! matrix H t = e − t L . It is well known that H t is the transition matrix of the random walkinduced by choosing a Poisson-distributed time parameter T ∼ Poi ( t ) and then taking T steps of auniform random walk in the graph G . With this motivation, we deﬁne the heat kernel noise model as follows: the parametric family of distributions is {H t } t ∈ R + , where the probability weight that H t assigns to permutation π is the probability that the above-described random walk, starting at theidentity permutation e ∈ S n , reaches π . (Observe that higher temperature parameters t correspondto higher rates of noise. More precisely, it is well known that the mixing time of a uniform randomwalk on G is Θ( n log n ) steps, so if t grows larger than n log n then the distribution H t convergesrapidly to the uniform distribution on S n ; see [DS81] for detailed results along these lines.) We notethat these probability distributions (or more precisely, the associated heat kernel H t ) have beenpreviously studied in the context of learning rankings, see e.g. [KL02, KB10, JV18]. In some of thiswork, a diﬀerent underlying distance measure was used over S n rather than the Cayley distance;see our discussion of related work in Section 1.3.(C.) Mallows-type model under Cayley distance (Cayley-Mallows / Ewens model).

While the heat kernel noise model arises naturally from an analyst’s perspective, a somewhatdiﬀerent model, called the

Mallows model , has been more popular in the statistics and machinelearning literature. The Mallows model is deﬁned using the “Kendall τ -distance” K ( · , · ) betweenpermutations (deﬁned in Section 1.3) rather than the Cayley distance d ( · , · ); the Mallows modelwith parameter θ > e − θK ( π,e ) /Z K ( θ ) to the permutation π , where Z k ( θ ) = (cid:80) π ∈ S n e − θK ( π,e ) is a normalizing constant. As proposed by Fligner and Verducci [FV86],2t is natural to consider generalizations of the Mallows model in which other distance measurestake the place of the Kendall τ -distance. The model which we consider is one in which the Cayleydistance is used as the distance measure; so given θ >

0, the noise distribution M θ which weconsider assigns weight e − θd ( π,e ) /Z ( θ ) to each permutation π ∈ S n , where Z ( θ ) = (cid:80) π ∈ S n e − θd ( π,e ) is a normalizing constant. In fact, this noise model was already proposed in 1972 by W. Ewensin the context of population genetics [Ewe72] and has been intensively studied in that ﬁeld (wenote that [Ewe72] has been cited more than 2000 times according to Google Scholar). To alignour terminology with the strand of research in machine learning and theoretical computer sciencewhich deals with the Mallows model, in the rest of this paper we refer to M θ as the Cayley-Mallows model. For the same reason, we will also refer to the usual Mallows model (with the Kendall τ -disance) as the Kendall-Mallows model. We observe that for the Cayley-Mallows model M θ , incontrast with the heat kernel noise model now smaller values of θ correspond to higher levels ofnoise, and that when θ = 0 the distribution M θ is simply the uniform distribution over S n andthere is no useful information available to the learner. For each of the noise models deﬁned above, we give algorithms which, under a mild technicalassumption (that no mixing weight w i is too small), provably recover the unknown central rankings σ , . . . , σ k and associated mixing weights w , . . . , w k up to high accuracy. A notable feature of ourresults is that the sample and running time dependence is only quasipolynomial in the numberof elements n and the number of sub-populations k ; as we detail in Section 1.3 below, this is incontrast with recent results for similar problems in which the dependence on k is exponential.Below we give detailed statements of our results. The following notation and terminology willbe used in these statements: for f a distribution over S n (or any function from S n to R ) we writesupp( f ) to denote the set of permutations σ ∈ S n that have f ( σ ) (cid:54) = 0. For a given noise model K , we write “ K ∗ f ” to denote the distribution over noisy samples that is provided to the learningalgorithm as described earlier. Given two functions f, g : S n → R , we write “ (cid:107) f − g (cid:107) ” to denote (cid:80) π ∈ S n | f ( π ) − g ( π ) | , the (cid:96) distance between f and g . If f and g are both distributions then wewrite d TV ( f, g ) to denote the total variation distance between f and g , which is (cid:107) f − g (cid:107) . Finally,if f is a distribution over S n in which f ( σ ) > ε for every σ such that f ( σ ) >

0, we say that f is ε -heavy . Learning from noisy rankings: Positive and negative results.

Our ﬁrst algorithmic resultis for the symmetric noise model (A) deﬁned earlier. Theorem 1.2, stated below, gives an eﬃcientalgorithm as long as the vector p is “not too extreme” (i.e. not too biased towards putting almostall of its weight on large values very close to n ): Theorem 1.2 (Algorithm for symmetric noise) . There is an algorithm with the following guarantee:Let f be an unknown ε -heavy distribution over S n with | supp( f ) | ≤ k . Let p = ( p , . . . , p n ) ∈ ∆ n be such that n − log k (cid:88) j =0 p j ≥ n O (log k ) . Given p , the value of ε > , a conﬁdence parameter δ > , and access to random samples from S p ∗ f , the algorithm runs in time poly( n log k , /ε, log(1 /δ )) and with probability − δ outputs adistribution g : S n → R such that d TV ( f, g ) ≤ ε. Theorem 1.3 (Algorithm for heat kernel noise) . There is an algorithm with the following guarantee:Let f be an unknown ε -heavy distribution over S n with | supp( f ) | ≤ k . Let t ∈ R + be any valuethat is O ( n log n ) . Given t , the value of ε > , a conﬁdence parameter δ > , and access to randomsamples from H t ∗ f , the algorithm runs in time poly( n log k , /ε, log(1 /δ )) and with probability − δ outputs a distribution g : S n → R such that d TV ( f, g ) ≤ ε. Recalling that the uniform random walk on the Cayley graph of S n mixes in Θ( n log n ) steps,we see that the algorithm of Theorem 1.3 is able to handle quite high levels of noise and still runquite eﬃciently (in quasi-polynomial time).Our third positive result, for the Cayley-Mallows model, displays an intriguing qualitativediﬀerence from Theorems 1.2 and 1.3. To state our result, let us deﬁne the function dist : R + × N → R + as follows: dist ( θ, (cid:96) ) := min j ∈{ ,...,(cid:96) } (cid:12)(cid:12) e θ − j (cid:12)(cid:12) , so dist( θ, (cid:96) ) measures the minimum distance between e θ and any integer in { , . . . , (cid:96) } . Theorem 1.4gives an algorithm which can be quite eﬃcient for the Cayley-Mallows noise model if the noiseparameter θ is such that dist( θ, log k ) is not too small: Theorem 1.4 (Algorithm for the Cayley-Mallows model) . There is an algorithm with the followingguarantee: Let f be an unknown ε -heavy distribution over S n with | supp( f ) | ≤ k . Given θ > ,the value of ε > , a conﬁdence parameter δ > , and access to random samples from M θ ∗ f ,the algorithm runs in time poly( n log k , /ε, log(1 /δ ) , dist( θ, log k ) −√ log k ) and with probability − δ outputs a distribution g : S n → R such that d TV ( f, g ) ≤ ε. As alluded to earlier, as θ approaches 0 the diﬃculty of learning in the M θ noise model increases(and indeed learning becomes impossible at θ = 0); since for small θ we have dist( θ, (cid:96) ) ≈ θ , thisis accounted for by the dist( θ, log k ) −√ log k factor in our running time bound above. However, forlarger values of θ the dist( θ, log k ) −√ log k dependence may strike the reader as an unnatural artifactof our analysis: is it really hard to learn when θ is very close to ln 2 ≈ . θ isvery close to ln 2 . ≈ . , and hard again when θ is very close to ln 3 ≈ . · , · ) parameter captures a fundamentalbarrier to learning in the Cayley-Mallows model. We establish this by proving the following lowerbound for the Cayley-Mallows model, which shows that a dependence on dist as in Theorem 1.4 isin fact inherent in the problem: Theorem 1.5.

Given j ∈ N , there are inﬁnitely many values of k and m = m ( k ) such that thefollowing holds: Let θ > be such that | e θ − j | ≤ η ≤ / , and let A be any algorithm which, whengiven access to random samples from M θ ∗ f where f is a distribution over S m with | supp( f ) | ≤ k ,with probability at least 0.51 outputs a distribution h over S m that has d TV ( f, h ) ≤ . . Then A must use η − Ω (cid:16)(cid:113) log k log log k (cid:17) samples. Starting with the work of Mallows [Mal57], there is a rich line of work in machine learning andstatistics on probabilistic models of ranking data, see e.g. [Mar14, LL02, BOB07, MM09, MC10,4B11]. In order to describe the prior works which are most relevant to our paper, it will beuseful for us to deﬁne the

Kendall-Mallows model (referred to in the literature just as the Mallowsmodel) in slightly more detail than we gave earlier. Introduced by Mallows [Mal57], the Kendall-Mallows model is quite similar to the Cayley-Mallows model that we consider — it is speciﬁed bya parametric family of distributions {M τ,θ } θ ∈ R + and a central permutation σ ∈ S n , and a drawfrom the model is generated as follows: sample π ∼ M τ,θ and output π · σ . The distribution M τ,θ assigns probability weight e − θK ( π,e ) /Z K ( θ ) to the permutation π where Z K ( θ ) = (cid:80) π ∈ S n e − θK ( π,e ) is the normalizing constant and K ( · , · ) is the Kendall τ -distance (deﬁned next): Deﬁnition 1.6.

The

Kendall τ -distance K : S n × S n → R ≥ is a distance metric on S n deﬁned as K ( π, π (cid:48) ) = { ( i, j ) : i < j and (( π ( i ) < π ( j )) ⊕ ( π (cid:48) ( i ) < π (cid:48) ( j )) = 1) } In other words, K ( π, π (cid:48) ) is the number of inversions between π and π (cid:48) . Like the Cayley distance,the Kendall τ -distance is also a right-invariant metric. Another equivalent way to deﬁne K ( · , · ) isto consider the undirected graph on S n where vertices π π share an edge if and only π = τ · π where τ is an adjacent transposition – in other words, τ = ( i, i + 1) for some 1 ≤ i < n . Then K ( · , · ) is deﬁned as the shortest path metric on this graph. From this perspective, the diﬀerencebetween the Kendall τ -distance and the Cayley distance is that the former only allows adjacenttranspositions while the latter allows all transpositions. Learning mixture models:

As mentioned earlier, probabilistic models of ranking data havebeen studied extensively in probability, statistics and machine learning. Models that have beenconsidered in this context include the Kendall-Mallows model [Mal57, LB11, MPPB07, GP18], theCayley-Mallows model (and generalizations of it) [FV86, MM03, Muk16, DH92, Dia88a, Ewe72]and the heat kernel random walk model [KL02, KB10, JV18], among others. In contrast, withintheoretical computer science interest in probabilistic models of ranking data is somewhat morerecent, and the best-studied model in this community is the Kendall-Mallows model. Bravermanand Mossel [BM08] initiated this study and (among other results) gave an eﬃcient algorithm torecover a single Kendall-Mallows model from random samples. The question of learning mixturesof k Kendall-Mallows models was raised soon thereafter, and and Awasthi et al. [ABSV14] gave aneﬃcient algorithm for the case k = 2. We note two key distinctions between our work and thatof [ABSV14]: (i) our results apply to the Cayley-Mallows model rather than the Kendall-Mallowsmodel, and (ii) the work of [ABSV14] allows for the two components in the mixture to have twodiﬀerent noise parameters θ and θ whereas our mixture models allow for only one noise parameter θ across all the components.Very recently, Liu and Moitra [LM18] extended the result of [ABSV14] to any constant k . Inparticular, the running time of the [LM18] algorithm scales as n poly( k ) . It is interesting to contrastour results with those of [LM18]. Besides the obvious diﬀerence in the models treated (namelyKendall-Mallows in [LM18] versus Cayley-Mallows in this paper), another signiﬁcant diﬀerence isthat our running time scales only quasipolynomially in k versus exponentially in k for [LM18]. (Infact, [LM18] shows that an exponential dependence on k is necessary for the problem they consider.)Another diﬀerence is that their algorithm allows each mixture component to have a diﬀerent noiseparameter θ i whereas our result requires the same noise parameter θ across the mixture components.We observe that one curious feature of the algorithm of [LM18] is the following: When all the noiseparameters { θ i } ≤ i ≤ k are well-separated (meaning that for all i (cid:54) = j , | θ i − θ j | ≥ γ ), then the running5ime of [LM18] can be improved to poly( n ) · poly( k ) . This suggests that the case when all θ i arethe same might be the hardest for the Liu-Moitra [LM18] algorithm.Finally, we note that while the analysis in this paper does not immediately extend to theKendall-Mallows model (see Section 1.5 for more details), we point out that there is a sense in whichthe Kendall-Mallows and Cayley-Mallows models are fundamentally incomparable. This is because,while the results of [LM18] show that mixtures of Kendall-Mallows models are identiﬁable whenevereach θ i (cid:54) = 1, Theorem 1.5 shows that mixtures of Cayley-Mallows models are not identiﬁable atvarious larger values of θ such as ln 2 , ln 3 , . . . , even when all of the noise parameters are the samevalue θ which is provided to the algorithm. A key notion for our algorithmic approach is that of the marginal of a distribution f over S n : Deﬁnition 1.7.

Fix f : S n → [0 ,

1] to be some distribution over S n . Let t ∈ { , . . . , n } , let¯ i = ( i , . . . , i t ) be a vector of t distinct elements of { , . . . , n } and likewise ¯ j = ( j , . . . , j t ). We saythe (¯ i, ¯ j ) -marginal of f is the probability Pr σ ∼ f [ σ ( i ) = j and · · · and σ ( i t ) = j t ]that for all (cid:96) = 1 , . . . , t , the i (cid:96) -th element of a random σ drawn from f is j (cid:96) . When ¯ i and ¯ j are oflength t we refer to such a probability as a t -way marginal of f .The ﬁrst key ingredient of our approach for learning from noisy rankings is a reduction from theproblem of learning f (the unknown distribution supported on k rankings σ , . . . , σ k ) given accessto samples from K ∗ f , to the problem of estimating t -way marginals (for a not-too-large value of t ). More precisely, in Section 2 we give an algorithm which, given the ability to eﬃciently estimate t -way marginals of f , eﬃciently computes a high-accuracy approximation for an unknown ε -heavydistribution f with support size at most k (see Theorem 2.1). This algorithm builds on ideas inthe population recovery literature, suitably extended to the domain S n rather than { , } n .With the above-described reduction in hand, in order to obtain a positive result for a speciﬁcnoise model K the remaining task is to develop an algorithm A marginal which, given access to noisysamples from K ∗ f , can reliably estimate the required marginals. In Section 3 we show that ifthe noise distribution K (a distribution over S n ) is eﬃciently samplable, then given samples from K ∗ f , the time required to estimate the required marginals essentially depends on the minimum,over a certain set of matrices arising from the Fourier transform (over the symmetric group S n )of the noise distribution, of the minimum singular value of the matrix. (See Theorem 3.1 for adetailed statement.) At this point, we have reduced the algorithmic problem of obtaining a learningalgorithm for a particular noise model to the analytic task of lower bounding the relevant singularvalues. We carry out the required analyses on a noise-model-by-noise-model basis in Sections 4, 5,and 6. These analyses employ ideas and results from the representation theory of the symmetricgroup and its connections to enumerative combinatorics; we give a brief overview of the necessarybackground in Appendix A.To establish our lower bound for the Cayley-Mallows model, Theorem 1.5, we exhibit two dis-tributions f and f over the symmetric group such that the distributions of noisy rankings M θ ∗ f and M θ ∗ f have very small statistical distance from each other. Not surprisingly, the inspiration6or this construction also comes from the representation theory of the symmetric group; more pre-cisely, the two above-mentioned distributions are obtained from the character (over the symmetricgroup) corresponding to a particular carefully chosen partition of [ n ]. A crucial ingredient in theproof is the fact that characters of the symmetric group are rational-valued functions, and henceany character can be split into a positive part and a negative part; details are given in Section 8. In this paper we have considered three particular noise models — symmetric noise, heat kernelnoise, and Cayley-Mallows noise — and given eﬃcient algorithms for these noise models. Lookingbeyond these speciﬁc noise models, though, our approach provides a general framework for obtainingalgorithms for learning mixtures of noisy rankings. Indeed, for essentially any eﬃciently samplablenoise distribution K , given access to samples from K ∗ f our approach reduces the algorithmicproblem of learning f to the analytic problem of lower bounding the minimum singular values ofmatrices arising from the Fourier transform of K (see Theorem 3.1). We believe that this techniquemay be useful in a broader range of contexts, e.g. to obtain results analogous to ours for the originalKendall-Mallows model or for other noise models.As is made clear in Sections 4, 5, and 6, the representation-theoretic analysis that we requirefor our noise models is facilitated by the fact that each of the noise distributions considered inthose sections is a class function (in other words, the value of the distribution on a given inputpermutation depends only on the cycle structure of the permutation). Extending the kinds ofanalyses that we perform to other noise models which are not class functions is a technical challengethat we leave for future work. The main result of this section is the reduction alluded to in Section 1.4. In more detail, we givean algorithm which, given the ability to eﬃciently estimate t -way marginals, eﬃciently computes ahigh-accuracy approximation for an unknown ε -heavy distribution f with support size at most k : Theorem 2.1.

Let f be an unknown ε -heavy distribution over S n with | supp( f ) | ≤ k . Suppose thereis an algorithm A marginal with the following property: given as input a value δ > and two vectors ¯ i = ( i , . . . , i t ) and ¯ j = ( j , . . . , j t ) each composed of t distinct elements of { , . . . , n } , algorithm A marginal runs in time T ( δ, t, k, n ) and outputs an additively ± δ -accurate estimate of the (¯ i, ¯ j ) -marginal of f (recall Deﬁnition 1.7). Then there is an algorithm A learn with the following property:given the value of ε , algorithm A learn runs in time poly( n/ε, n log k ) · T ( ε k O (log k ) , k, k , n ) andreturns a function g : S n → R + such that (cid:107) f − g (cid:107) ≤ ε . Looking ahead, given Theorem 2.1, in order to obtain a positive result for a speciﬁc noise model K the remaining task is to develop an algorithm A marginal which, given access to noisy samplesfrom K ∗ f , can reliably estimate the required marginals. The algorithm is given in Section 3and the detailed analyses establishing its eﬃciency for each of the noise models (by boundingminimum singular values of certain matrices arising from each speciﬁc noise distribution) is givenin Sections 4, 5, and 6. 7 .1 A useful structural result The following structural result on functions from S n to R with small support will be useful for us: Claim 2.2 (Small-support functions are correlated with juntas) . Fix ≤ (cid:96) ≤ n and let g : [ n ] (cid:96) → R be such that (cid:107) g (cid:107) = 1 and | supp( g ) | ≤ k . There is a subset U ⊆ [ n ] and a list of values α , . . . , α | U | ∈ [ n ] such that | U | ≤ log k and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x ∈ [ n ] (cid:96) g ( x ) · [ x i = α i for all i ∈ U ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ k − O (log k ) . (1)Claim 2.2 is reminiscent of analogous structural results for functions over { , } (cid:96) which areimplicit in the work of [WY12] (speciﬁcally, Theorem 1.5 of that work), and indeed Claim 2.2 canbe proved by following the techniques of [WY12]. Michael Saks [Sak18] has communicated to usan alternative, and arguably simpler, argument for the relevant structural result over { , } (cid:96) ; herewe follow that alternative argument (extending it in the essentially obvious way to the domain [ n ] (cid:96) rather than { , } (cid:96) ). Proof.

Let the support of g be S ⊆ [ n ] (cid:96) . Note that since | S | ≤ k , there must exist some setof k (cid:48) := min { k, (cid:96) } coordinates such that any two elements of S diﬀer in at least one of thosecoordinates. Without loss of generality, we assume that this set is the ﬁrst k (cid:48) coordinates { , . . . , k (cid:48) } . We prove Claim 2.2 by analyzing an iterative process that iterates over the coordinates 1 , . . . , k (cid:48) .At the beginning of the process, we initialize a set Coord live of “live coordinates” to be [ k (cid:48) ], initializea set Constr of constraints to be initially empty, and initialize a set S live of “live support elements”to be the entire support S of g . We will see that the iterative process maintains the followinginvariants:(I1) The coordinates in Coord live are suﬃcient to distinguish between the elements in S live , i.e. anytwo distinct strings in S live have distinct projections onto the coordinates in Coord live ;(I2) The only elements of S that satisfy all the constraints in Constr are the elements of S live .Before presenting the iterative process we need to deﬁne some pertinent quantities. For eachcoordinate j ∈ Coord live and each index α ∈ [ n ], we deﬁne Wt ( j, α ) := (cid:88) x ∈ S live : x j = α | g ( x ) | , the weight under g of the live support elements x that have x j = α , and we deﬁne Num ( j, α ) := |{ x ∈ S live : x j = α }| , the number of live support elements x that have x j = α (note that Num ( j, α ) has nothing to dowith g ). It will also be useful to have notation for fractional versions of each of these quantities, sowe deﬁne FracWt ( j, α ) := Wt ( j, α ) (cid:80) x ∈ S live | g ( x ) | . and Frac ( j, α ) := Num ( j, α ) | S live | j ∈ Coord live we have that (cid:80) α Num ( j, α ) = | S live | , or equivalently (cid:80) α Frac ( j, α ) =1 . For each coordinate j ∈ Coord live , we write

MAJ ( j ) to denote the element β ∈ [ n ] whichis such that Num ( j, β ) ≥ Num ( j, α ) for all α ∈ [ n ] (we break ties arbitrarily). Finally, we let FracWtMaj ( j ) = FracWt ( j, MAJ ( j )).Now we are ready to present the iterative process:1. If every j ∈ Coord live has

FracWtMaj ( j ) > − k (cid:48) , then halt the process. Otherwise, let j be any element of Coord live for which FracWtMaj ( j ) ≤ − k (cid:48) .2. For this coordinate j , choose α ∈ [ n ] which maximizes the ratio FracWt ( j,α ) Frac ( j,α ) (or equivalently,maximizes FracWt ( j,α ) Num ( j,α ) ) subject to Frac ( j, α ) (cid:54) = 0 and α (cid:54) = MAJ ( j ).3. Add the constraint x j = α to Constr, remove j from Coord live , and remove all x such that x j (cid:54) = α from S live . Go to Step 1.When the iterative process ends, suppose that the set Constr is { x j = α , . . . , x j (cid:96) = α (cid:96) } . Thenwe claim that Equation (1) holds for U = { j , . . . , j (cid:96) } .To argue this, we ﬁrst observe that both invariants (I1) and (I2) are clearly maintained by eachround of the iterative process. We next observe that each time a pair ( j, α ) is processed in Step 3,it holds that Frac ( j, α ) ≤ , and hence each round shrinks S live by a factor of at least 2. Thus, afterlog k steps, the set S live must be of size at most 1 and hence the process must halt. (Note that theclaimed bound | U | ≤ log k follows from the fact that the process runs for at most log k stages.)Next, note that when the process halts, by a union bound over the at most k (cid:48) coordinates inCoord live it holds that (cid:88) x ∈ S live : x j = MAJ ( j ) for all j ∈ Coord live | g ( x ) | ≥ · (cid:88) x ∈ S live | g ( x ) | . On the other hand, by the ﬁrst invariant (I1), the cardinality of the set { x ∈ S live : x j = MAJ ( j )for all j ∈ Coord live } is precisely 1. This immediately implies that almost all of the weight of g ,across elements of S live , is on a single element; more precisely, that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x ∈ S live g ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ · (cid:88) x ∈ S live | g ( x ) | , from which it follows that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x ∈ [ n ] (cid:96) g ( x ) · [ x i = α i for all i ∈ U ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ · (cid:88) x ∈ S live | g ( x ) | . (2)So to establish Equation (1), it remains only to establish a lower bound on (cid:80) x ∈ S live | g ( x ) | whenthe process terminates. To do this, let us suppose that the process runs for T steps where in the Note that this means almost all of the weight under g of the live support elements is on elements that all agreewith the majority value on coordinate j . Note further that if Coord live is empty then this condition trivially holds. th step the coordinate chosen is j t . Now, at any stage t , we have (cid:80) β ∈ Coord live : β (cid:54) = MAJ ( j t ) FracWt ( j t , β ) (cid:80) β ∈ Coord live : β (cid:54) = MAJ ( j t ) Frac ( j t , β ) ≥ k (cid:48) . (because the denominator is at most 1 and since the process does not terminate, the numerator isat least k ). As a result, we get that if the constraint chosen at time t is x j t = α t , then FracWt ( j t , α t ) Frac ( j t , α t ) ≥ k (cid:48) . (3)By Equation (3), when the process halts we have (cid:88) x ∈ S live | g ( x ) | = T (cid:89) t =1 FracWt ( j t , α t ) ≥ k (cid:48) ) T T (cid:89) t =1 Frac ( j t , α t ) . But since at least one element remains, we have that (cid:81) Tt =1 Frac ( j t , α t ) ≥ k , and since T ≤ log k ,we conclude (recalling that k (cid:48) ≤ k ) that (cid:88) x ∈ S live | g ( x ) | ≥ k − O (log k ) . Combining with (2), this yields the claim.

The idea of the proof is quite similar to the algorithmic component of several recent works onpopulation recovery [MS13, WY12, LZ15, DST16]. Given any function f : S n → R and any integer i ∈ { , . . . , n } , we deﬁne the function f i : [ n ] i → R as follows: f i ( x , . . . , x i ) := (cid:88) σ ∈ S n f ( σ ) · [ σ (1) = x ∧ . . . ∧ σ ( i ) = x i ] . (4)At a high level, the algorithm A learn of Theorem 2.1 works in stages, by successively recon-structing f , . . . , f n . In each stage it uses the procedure described in the following claim, whichsays that high-accuracy approximations of the (log k )-marginals together with the support of f (cid:96) (ora not-too-large superset of it) suﬃces to reconstruct f (cid:96) : Claim 2.3.

Let f (cid:96) be an unknown distribution over [ n ] (cid:96) supported on a given set S of size k . Thereis an algorithm A one − stage which has the following guarantee: The algorithm is given as input δ > ,and parameters β J,y (for every set J ⊆ [ (cid:96) ] of size at most log k and every y ∈ [ n ] J ) which satisfy (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β J,y − (cid:88) x ∈ S f ( x ) · [ x i = y i for all i ∈ J ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ.A one − stage runs in time poly ( n, (cid:96) log k ) and outputs a function ˜ f : [ n ] (cid:96) → [0 , such that (cid:107) f − ˜ f (cid:107) ≤ δ · k O (log k ) . roof. We consider a linear program which has a variable s x for each x ∈ S (representing theprobability that f puts on x ) and is deﬁned by the following constraints:1. s x ≥ (cid:80) x ∈ S s x = 1.2. For each J ⊆ [ (cid:96) ] of size at most log k and each y ∈ [ n ] J , include the constraint (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β J,y − (cid:88) x ∈ S s x · [ x i = y i for all i ∈ J ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ. (5)Algorithm A one − stage sets up and solves the above linear program (this can clearly be done in time poly ( n, (cid:96) log k )). We observe that the linear program is feasible since by deﬁnition s x = f (cid:96) ( x ) is afeasible solution. To prove the claim it suﬃces to show that every feasible solution is (cid:96) -close to f (cid:96) ;so let f ∗ ( x ) denote any other feasible solution to the linear program, and let η denote (cid:107) f ∗ − f (cid:96) (cid:107) . Deﬁne h ( x ) = f ∗ ( x ) − f (cid:96) ( x ) , so (cid:107) h (cid:107) = η. By Claim 2.2, we have that there is a subset J ⊆ [ (cid:96) ] ofsize at most log k and a y ∈ [ n ] (cid:96) such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) x h ( x ) · [ x i = y i for all i ∈ J ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ η · k − O (log k ) . (6)On the other hand, since both f (cid:96) ( x ) and f ∗ ( x ) are feasible solutions to the linear program, by thetriangle inequality it must be the case that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) x h ( x ) · [ x i = y i for all i ∈ J ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ. (7)Equations 6 and 2.2 together give the desired upper bound on η , and the claim is proved.Essentially the only remaining ingredient required to prove Theorem 2.1 is a procedure to ﬁnd (anot-too-large superset of) the support of f . This is given by the following claim, which inductivelyuses the algorithm A one − stage to successively construct suitable (approximations of) the supportsets for f , . . . , f n . Claim 2.4.

Under the assumptions of Theorem 2.1, there is an algorithm A support with the fol-lowing property: given as input a value δ > , algorithm A support runs in time poly( n/ε, n log k ) · T ( ε k O (log k ) , k, k , n ) and for each (cid:96) = 1 , . . . , n outputs a set S (cid:48) ( (cid:96) ) of size at most k which containsthe support of f (cid:96) .Proof. The algorithm A support works inductively, where at the start of stage (cid:96) (in which it willconstruct the set S (cid:48) ( (cid:96) ) ) it is assumed to have a set S (cid:48) ( (cid:96) − with | S (cid:48) ( (cid:96) − | ≤ k which contains thesupport of f (cid:96) − . (Note that at the start of the ﬁrst stage (cid:96) = 1 this holds trivially since f triviallyhas empty support).Let us describe the execution of the (cid:96) -th stage of A support . For 1 ≤ (cid:96) ≤ n , we deﬁne the set S marg ,(cid:96) as follows: S marg ,(cid:96) = (cid:8) t : (cid:88) σ ∈ S n f ( σ ) · [ σ ( (cid:96) ) = t ] > (cid:9) . poly ( n/ε ) · T ( ε , , k, n ), we can compute f ( σ ) · [ σ ( (cid:96) ) = t ] up to error ± ε/ β (cid:96),t ) for all 1 ≤ t ≤ n . Since f is ε -heavy, we have that t ∈ S marg ,(cid:96) implies β (cid:96),t ≥ ε t (cid:54)∈ S marg ,(cid:96) implies β (cid:96),t ≤ ε . Consequently, we can compute the set S marg ,(cid:96) in time poly ( n/ε ) · T ( ε , , k, n ). The ﬁnal observationis that the set S ∗ ( (cid:96) ) (of cardinality at most k ) obtained by appending each ﬁnal (cid:96) -th characterfrom S marg ,(cid:96) to each element of S (cid:48) ( (cid:96) − must contain the support S ( (cid:96) ) of f (cid:96) . Set δ = ε k O (log k ) ; bythe assumption of Theorem 2.1, in time T ( ε k O (log k ) , k, k , n ) it is possible to obtain additively ± δ -accurate estimates of each of the (2 log k )-way marginals of f (cid:96) . In the (cid:96) -th stage, algorithm A support runs A one − stage using S ∗ ( (cid:96) ) and these estimates of the marginals; by Claim 2.3, this takes timepoly( n/ε, n log k ) and yields a function ˜ f (cid:96) : [ n ] (cid:96) → [0 ,

1] such that (cid:107) f (cid:96) − ˜ f (cid:96) (cid:107) ≤ δ k O (log k ) · k O (log k ) = ε/ . Since by assumption f is ε -heavy, it follows that any element x in the support of ˜ f (cid:96) such that˜ f (cid:96) ( x ) ≤ ε/ f (cid:96) ; so the algorithm removes all such elements x from S ∗ ( (cid:96) ) to obtain the set S (cid:48) ( (cid:96) ) . This resulting S (cid:48) ( (cid:96) ) is precisely the support of f (cid:96) , and is clearly of size atmost k. Finally, the overall algorithm A learn works by running A support to get the set S (cid:48) = S (cid:48) ( n ) of sizeat most k which is the support of f n = f , and then uses S (cid:48) and the algorithm A marginal fromthe assumptions of Theorem 2.1) to run algorithm A one − stage and obtain the required ε -accurateapproximator g of f . This concludes the proof of Theorem 2.1. Recall that the noisy ranking learning problems we consider are of the following sort: Thereis a known noise distribution K supported on S n , and an unknown k -sparse ε -heavy distribution f : S n → [0 , π ∼ K and σ ∼ f are obtained, and the sample givento the learner is ( πσ ) ∈ S n . By the reduction established in Theorem 2.1, in order to give analgorithm that learns the distribution f in the presence of a particular kind of noise K , it suﬃcesto give an algorithm that can eﬃciently estimate t -way marginals given samples πσ ∼ K ∗ f. The main result of this section, Theorem 3.1, gives such an algorithm. Before stating thetheorem we need some terminology and notation and we need to recall some necessary backgroundfrom representation theory of the symmetric group (see Appendix A for a detailed overview of allof the required background).First, let K be a distribution over S n (which should be thought of as a noise distribution asdescribed earlier). We say that K is eﬃciently samplable if there is a poly( n )-time randomizedalgorithm which takes no input and, each time it is invoked, returns an independent draw of π ∼ K . Next, we recall that a partition λ of the natural number n (written “ λ (cid:96) n ”) is a vector of naturalnumbers ( λ , . . . , λ k ) where λ ≥ λ ≥ . . . ≥ λ k > λ + . . . + λ k = n (see Appendix A.2for more detail). For two partitions λ and µ of n , we say that µ dominates λ , written µ (cid:3) λ , if (cid:80) j ≤ i µ j ≥ (cid:80) j ≤ i λ j for all i > λ (cid:96) n , let Up ( λ ) denote the set ofall partitions µ (cid:96) n such that µ (cid:3) λ.

12e recall that a representation of the symmetric group S n is a group homomorphism from S n to C m × m (see Appendix A). We further recall that for each partition λ (cid:96) n there is a corresponding irreducible representation, denoted ρ λ (see Appendix A.2). For a matrix M we write σ min ( M ) todenote the smallest singular value of M . Given a partition λ (cid:96) n we deﬁne the value σ min , Up ( λ ) , K to be σ min , Up ( λ ) , K := min µ ∈ Up ( λ ) σ min ( (cid:98) K ( ρ µ )) , (8)the smallest singular value across all Fourier coeﬃcients of the noise distribution of irreduciblerepresentations corresponding to partitions that dominate λ . (We recall that the Fourier coef-ﬁcients of functions over the symmetric group, and indeed over any ﬁnite group, are matrices;see Appendix A.2.)Finally, for 0 ≤ (cid:96) ≤ n − λ hook ,(cid:96) (cid:96) n to be λ hook ,(cid:96) := ( n − (cid:96), , . . . , . Now we can state the main result of this section:

Theorem 3.1.

Let K be an eﬃciently samplable distribution over S n . Let f be an unknowndistribution over S n . There is an algorithm A marginal with the following properties: A marginal receivesas input a parameter δ > , a conﬁdence parameter τ > , a pair of (cid:96) -tuples ¯ i = ( i , . . . , i (cid:96) ) ∈ [ n ] (cid:96) , ¯ j = ( j , . . . , j (cid:96) ) ∈ [ n ] (cid:96) each composed of (cid:96) distinct elements, and has access to random samples from K ∗ f . Algorithm A marginal runs in time poly ( (cid:0) n(cid:96) (cid:1) , δ − , σ − , Up ( λ hook ,(cid:96) ) , K , log(1 /τ )) and outputs a value κ i,j which with probability at least − τ is a ± δ -accurate estimate of the ( i, j ) -marginal of f . We will use the following claim to prove Theorem 3.1:

Claim 3.2.

Let ρ : S n → C m × m be any unitary representation of S n , let K be any eﬃcientlysamplable distribution over S n , and let σ min denote the smallest singular value of (cid:98) K ( ρ ) . Let f be anunknown distribution over S n . There is an algorithm which, given random samples from K ∗ f andan error parameter < δ < , runs in time poly ( m, n, σ − , δ − ) and with high probability outputsa matrix M f,ρ such that (cid:107) M f,ρ − (cid:98) f ( ρ ) (cid:107) ≤ δ .Proof. Let η , η > f is a distribution,the Fourier coeﬃcient (cid:98) f ( ρ ) is equal to E σ ∼ f [ ρ ( σ )]. Consequently, since K is assumed to be eﬃcientlysamplable and the algorithm is given samples from K ∗ f , by sampling from K and from K ∗ f itis straightforward to obtain matrices M , M in time poly( m, n, log(1 /τ )) which with probability1 − τ satisfy (cid:107) M − (cid:98) K ( ρ ) (cid:107) ≤ η and (cid:107) M − (cid:91) K ∗ f ( ρ ) (cid:107) ≤ η . Now we recall the following matrix perturbation inequality (see Theorem 2.2 of [Ste77]):

Lemma 3.3.

Let A ∈ R n × n be a non-singular matrix and further let ∆ A ∈ R n × n be such that (cid:107) ∆ A (cid:107) · (cid:107) A − (cid:107) < . Then A + ∆ A is non-singular. Further, if γ = 1 − (cid:107) A − (cid:107) (cid:107) ∆ A (cid:107) , then (cid:107) A − − ( A + ∆ A ) − (cid:107) ≤ (cid:107) A − (cid:107) (cid:107) ∆ A (cid:107) γ . η and η as follows (recall that δ < η = min (cid:8) δ · σ , δ · σ min (cid:9) and η = min { δ · σ min , } . (9)Applying Lemma 3.3 with (cid:98) K ( ρ ) in place of A and M − (cid:98) K ( ρ ) in place of ∆ A , using (9) (moreprecisely, the upper bound η ≤ δ · σ / η ≤ δ · σ min / (cid:107) M − − (cid:98) K ( ρ ) − (cid:107) ≤ (cid:107) (cid:98) K ( ρ ) − (cid:107) · (cid:107) M − (cid:98) K ( ρ ) (cid:107) − (cid:107) (cid:98) K ( ρ ) − (cid:107) · (cid:107) M − (cid:98) K ( ρ ) (cid:107) ≤ δ . (10)Now using (cid:91) K ∗ f ( ρ ) = (cid:98) K ( ρ ) · (cid:98) f ( ρ ), we get (cid:107) M − · M − (cid:98) f ( ρ ) (cid:107) = (cid:107) M − · M − (cid:98) K ( ρ ) − · (cid:91) K ∗ f ( ρ ) · (cid:107) ≤ (cid:107) M − · M − M − · (cid:91) K ∗ f ( ρ ) (cid:107) + (cid:107) M − · (cid:91) K ∗ f ( ρ ) − (cid:98) K ( ρ ) − · (cid:91) K ∗ f ( ρ ) (cid:107) ≤ (cid:107) M − (cid:107) · (cid:107) M − (cid:91) K ∗ f ( ρ ) (cid:107) + (cid:107) M − − (cid:98) K ( ρ ) − (cid:107) · (cid:107) (cid:91) K ∗ f ( ρ ) (cid:107) ≤ (cid:107) M − (cid:107) · η + (cid:107) (cid:91) K ∗ f ( ρ ) (cid:107) · δ . (using (10)) ≤ η (cid:16) (cid:107) (cid:98) K ( ρ ) − (cid:107) + (cid:107) M − − (cid:98) K ( ρ ) − (cid:107) (cid:17) + (cid:107) (cid:91) K ∗ f ( ρ ) (cid:107) · δ . (using (10)) ≤ σ − · η + δ · η + (cid:107) (cid:91) K ∗ f ( ρ ) (cid:107) · δ . (11)Next we use the following fact, which is an easy consequence of the triangle inequality and theassumption that ρ is unitary: Fact 3.4.

Let ρ : S n → C m × m be a unitary representation and let g : S n → R + . Then we havethat (cid:107) (cid:98) g ( ρ ) (cid:107) ≤ (cid:107) g (cid:107) . Combining this fact with (11) and (9), since (cid:107)K ∗ f (cid:107) = 1, we get that (cid:107) M − · M − (cid:98) f ( ρ ) (cid:107) ≤ σ − · η + δ · η + δ ≤ δ δ δ < δ. This concludes the proof of Claim 3.2.With Claim 3.2 in hand we are ready to prove Theorem 3.1: P roof of Theorem 3.1. Let τ λ hook ,(cid:96) be the permutation representation corresponding to the partition λ hook ,(cid:96) ; for conciseness we subsequently write ρ for τ λ hook ,(cid:96) . Deﬁnition A.10 immediately gives thatthe dimension of ρ is (cid:0) n(cid:96) (cid:1) . Observe that ρ is a unitary representation. Let σ min denote the smallestsingular value of (cid:98) K ( ρ ); applying Claim 3.2, we get an algorithm running in time poly ( (cid:0) n(cid:96) (cid:1) , σ − , δ )which outputs a matrix M f,ρ such that (cid:107) M f,ρ − (cid:98) f ( ρ ) (cid:107) ≤ δ . Next, we observe that the Youngtableaux corresponding to the partition λ hook ,(cid:96) (which, recalling Deﬁnition A.10, index the rowsand columns of ρ ( · )) correspond precisely to ordered t -tuples of distinct entries of [ n ]. If Y λ hook ,(cid:96) ,i = i and Y λ hook ,(cid:96) ,j = j , then it follows that (cid:98) f ( ρ )( i, j ) = (cid:88) σ ∈ S n f ( σ ) · [ f ( i ) = j and · · · and f ( i (cid:96) ) = j (cid:96) )] , i, j )-marginal of f as desired; so the output of the algorithm is M f,ρ ( i, j ).To ﬁnish the correctness argument it remains only to argue that σ − is at most poly( σ − , Up ( λ hook ,(cid:96) ) ) . To see that this is indeed, the case, we observe that by Theorem A.12, the permutation represen-tation τ λ hook ,(cid:96) block diagonalizes into a direct sum of irreducible representations ρ µ where each µ belongs to Up ( λ hook ,(cid:96) ). This ﬁnishes the proof of Theorem 3.1. In order to apply Theorem 3.1 to a particular noise distribution K we need to conﬁrm that K iseﬃciently samplable; we now do this for each of the three noise models that we consider. It isimmediate from the deﬁnition that it is straightforward (given p ) to eﬃciently generate a random σ drawn from the symmetric noise distribution S p , and the same is true for the heat kernel noisedistribution H t .For the generalized Mallows model M θ , the characterization Pr σ ∼M θ [ σ = π ] = e − θd ( π,e ) /Z ( θ )given earlier does not directly yield an eﬃcient sampling algorithm, since it may be hard to com-pute or approximate the normalizing factor Z ( θ ) = (cid:80) π ∈ S n e − θd ( π,e ) . Instead, we recall (see e.g. Sec-tion 2.1 of [DS98]) that the Metropolis algorithm can be used to eﬃciently perform a random walkon S n whose unique stationary distribution is the generalized Mallows distribution M θ . (Each stepof the random walk can be carried out eﬃciently because it is computationally easy to compute theCayley distance between two permutations: if π is the permutation that brings σ to τ , then theCayley distance d ( σ, τ ) is n − cycles( π ) where cycles( π ) is the number of cycles in π .) It is known(see e.g. Theorem 2 of [DH92]) that this random walk has rapid convergence, and consequently itis indeed possible to sample eﬃciently from M θ (up to an exponentially small statistical distancewhich can be ignored in our applications since our algorithms use a sub-exponential number ofsamples). In this section we establish lower bounds on the smallest singular value for the relevant matricescorresponding to “symmetric noise” S p on S n . In more detail, the main result of this section is thefollowing lower bound: Lemma 4.1.

Let (cid:96) ∈ { , . . . , n } and let p = ( p , . . . , p n ) ∈ ∆ n (i.e. p is a non-negative vectorwhose entries sum to 1) which is such that n − (cid:96) (cid:88) j =0 p j ≥ κ. Then (recalling Equation (8) ) we have that σ min , Up ( λ hook ,(cid:96) ) , S p ≥ κn (cid:96) . (12)15 .1 Setup To analyze the smallest singular value of (cid:99) S p ( ρ µ ) (as required by the deﬁnition of σ min , Up ( λ hook ,(cid:96) ) , S p ),we start by observing that symmetric noise is a class function (meaning that it is invariant underconjugation, see Deﬁnition A.6): Claim 4.2.

For any vector p = ( p , . . . , p n ) ∈ ∆ n , the distribution S p (viewed as a function from S n to [0 , ) is a class function ( i.e. S p ( π ) = S p ( τ πτ − ) for every π, τ ∈ S n ).Proof. For 0 ≤ j ≤ n , let e j denote the vector in R n +1 which has a 1 in the j -th position and a 0 inevery other position. By linearity, to prove Claim 4.2 it suﬃces to prove that S e j is invariant underconjugation for every j ; to establish this, it suﬃces to show that S e j is invariant under conjugationby any transposition τ . By symmetry, it suﬃces to consider the transposition τ = (1 , S e j is a uniform average of U A over all (cid:0) nj (cid:1) subsets A of [ n ] of size exactly j . Now we consider two cases: the ﬁrst is that | A ∩ { , }| is 0 or 2. In this case it is easy tosee that U A does not change under conjugation by the transposition (1 , | A ∩ { , }| = 1; in this case it is easy to see that conjugation by (1 ,

2) converts U A into U A ∆ { , } . Since the collection of size- j sets A with A ∩ { , } = { } are in 1-1 correspondencewith the collection of size- j sets A with A ∩ { , } = { } , it follows that S e j is invariant underconjugation by τ = (1 , µ (cid:96) m, λ (cid:96) n where m ≤ n , we write Paths( µ, λ ) to denote the number of paths from µ to λ in Young’s lattice (see Ap-pendix A.2 and Theorem A.15). We write Triv j to denote the trivial partition ( j ) of j . Lemma 4.3.

Let λ (cid:96) n and let ρ λ be the corresponding irreducible representation of S n . Given p = ( p , . . . , p n ) ∈ ∆ n , we have that (cid:99) S p ( ρ λ ) = c ( p, λ ) · Id where c ( p, λ ) := (cid:80) nj =0 p j · Paths(

Triv j , λ ) dim ( ρ λ ) . (13) Proof.

By Claim 4.2, we have that S p is a class function, so we may apply Lemma A.9 to concludethat (cid:99) S p ( ρ λ ) = c ( p, λ ) · Id , where c ( p, λ ) = 1 dim ( ρ λ ) · (cid:18) (cid:88) σ ∈ S n S p ( σ ) · χ λ ( σ ) (cid:19) and χ λ denotes the character of the irreducible representation ρ λ . Thus it remains to show that (cid:80) σ ∈ S n S p ( σ ) · χ λ ( σ ) is equal to the numerator of Equation (13). By deﬁnition of S p , we have that (cid:88) σ ∈ S n S p ( σ ) · χ λ ( σ ) = (cid:88) ≤ j ≤ n p j E A : |A| = j E σ ∈ U A χ λ ( σ ) . (14)We proceed to analyze E σ ∈ U A χ λ ( σ ). Let ρ A λ denote the representation ρ λ restricted to the subgroup S A . By Theorem A.15, the representation ρ A λ splits as follows: ρ A λ = ⊕ µ (cid:96)|A| Paths( µ, λ ) ρ µ . E σ ∈ U A χ λ ( σ ) = (cid:88) µ (cid:96)|A| Paths( µ, λ ) E σ ∈ U A χ µ ( σ ) = Paths( Triv |A| , λ ) . The second equality follows from that fact that if µ is a non-trivial partition of |A| then E σ ∈ U A χ µ ( σ ) =0, while if µ = Triv |A| then E σ ∈ U A χ µ ( σ ) = 1. Plugging this into (14) we get that (cid:80) σ ∈ S n S p ( σ ) · χ λ ( σ ) = (cid:80) nj =0 p j · Paths(

Triv j , λ ), and the lemma is proved. We recall from Equation (8) that σ min , Up ( λ hook ,(cid:96) ) , S p := min µ ∈ Up ( λ hook ,(cid:96) ) σ min ( (cid:99) S p ( ρ µ )) . Fix any µ ∈ Up ( λ hook ,(cid:96) ), so µ is a partition of n of the form ( n − (cid:96) (cid:48) , (cid:96) , . . . , (cid:96) r ) where (cid:96) (cid:48) ≤ (cid:96). By Lemma 4.3 we have that the smallest singular value of (cid:99) S p ( ρ µ ) is c ( p, µ ) := (cid:80) nj =0 p j · Paths(

Triv j , µ ) dim ( ρ µ ) . (15)To upper bound dim ( ρ µ ), we observe thatdim( ρ µ ) ≤ dim( τ µ ) = (cid:18) nn − (cid:96) (cid:48) , (cid:96) , . . . , (cid:96) r (cid:19) ≤ n !( n − (cid:96) (cid:48) )! ≤ n (cid:96) (cid:48) ≤ n (cid:96) , where the ﬁrst inequality is by Theorem A.12. For the numerator, we observe that if j ≤ n − (cid:96) then there is at least one path in the Young lattice from Triv j to µ , so under the assumptionsof Lemma 4.1 the numerator of Equation (15) is at least κ. This proves the lemma.

In this section, analogous to Section 4, we lower bound Equation (8) when the noise distribution K is H t , corresponding to “heat kernel noise” at temperature parameter t : Lemma 5.1.

Let t ≥ and let (cid:96) ∈ { , . . . , cn } for some suitably small universal constant c > .Then we have that σ min , Up ( λ hook ,(cid:96) ) , H t ≥ · e − O ( (cid:96)t ) /n . (16) Let trans : S n → [0 ,

1] be the following probability distribution over S n :trans( π ) =  /n if π is the identity,2 /n if π is a transposition,0 otherwise.17ince trans( π ) depends only on the cycle structure of π , the function trans( · ) is a class function.Fix any µ ∈ Up ( λ hook ,(cid:96) ), so µ is a partition of n of the form ( µ , . . . , µ r ) where µ ≥ n − (cid:96). As inthe proof of Lemma 4.3 we may apply Lemma A.9 to conclude that (cid:91) trans( ρ µ ) = c trans ,µ · Id for some constant c trans ,µ . By Corollary 1 of Diaconis and Shahshahani [DS81], we have that c trans ,µ = 1 n + n − n · χ µ ( τ )dim( ρ µ ) , (17)where as before χ µ denotes the character of the irreducible representation ρ µ and τ is any trans-position. [DS81] further shows that for ρ µ an irreducible representation of S n with µ as above and τ any transposition, it holds that χ µ ( τ )dim( ρ µ ) = 1 n ( n − · r (cid:88) j =1 ( µ j − j )( µ j − j + 1) − j ( j − . (18)In our setting we have(18) ≥ ( n − (cid:96) )( n − (cid:96) − n ( n −

1) + 1 n ( n − r (cid:88) j =2 ( µ j − j )( µ j − j + 1) − j ( j − . (19)where the inequality holds because µ ≥ n − (cid:96). Now, we observe that for each summand in Equa-tion (19), we have ( µ j − j )( µ j − j + 1) − j ( j −

1) = µ j − µ j (2 j − ≥ − µ j (2 j − ≥ − (cid:96)j − · (2 j − ≥ − (cid:96). The second inequality above holds because µ + · · · + µ j ≤ (cid:96) and the µ j ’s are non-increasing, so µ j ≤ (cid:96)j − . Since r − ≤ (cid:96) , this means that(18) ≥ ( n − (cid:96) )( n − (cid:96) − n ( n − − (cid:96) n ( n − ≥ − O ( (cid:96) ) n , and recalling Equation (17) we get that1 ≥ c trans ,µ ≥ − O ( (cid:96) ) n . (20) As in Section 4 we recall from Equation (8) that σ min , Up ( λ hook ,(cid:96) ) , H t := min µ ∈ Up ( λ hook ,(cid:96) ) σ min ( (cid:99) H t ( ρ µ )) , µ ∈ Up ( λ hook ,(cid:96) ) (so µ is a partition of n of the form ( µ , . . . , µ r ) where µ ≥ n − (cid:96) ). Werecall that the function H t : S n → [0 ,

1] is deﬁned by H t = ∞ (cid:88) j =0 Pr T ∼ Poi ( t ) [ T = j ](trans) j , where “(trans) T ” denotes T -fold convolution of trans. Since convolution corresponds to multipli-cation of Fourier coeﬃcients, this gives that (cid:99) H t ( ρ µ ) = c ( t, µ ) · Id , where c ( t, µ ) := ∞ (cid:88) j =0 Pr T ∼ Poi ( t ) [ T = j ]( c trans ,µ ) j . (21)Recalling [Cho94] that the median of the Poisson distribution Poi ( t ) is at most t + 1 /

3, we get that c ( t, µ ) ≥ · ( c trans ,µ ) t +1 / ≥ · e − O ( (cid:96)t ) /n , (where the second inequality uses (cid:96) ≤ cn and t ≥ In this section we lower bound Equation (8) when the noise distribution K is M θ , correspondingto the Cayley-Mallows noise model with parameter θ : Lemma 6.1.

Let θ > , let (cid:96) ∈ { , . . . , n } , and let η := dist( θ, (cid:96) ) = min j ∈{ ,...,(cid:96) } (cid:12)(cid:12) e θ − j (cid:12)(cid:12) . Then(recalling Equation (8) ) we have that σ min , Up ( µ hook ,(cid:96) ) , M θ ≥ (2 n ) − (cid:96) η √ (cid:96) . (22)Similar to the previous two sections, Lemma 6.1 follows immediately from the following lowerbound on singular values of certain irreducible representations: Lemma 6.2.

Let µ be a partition of n of the form ( µ , . . . , µ r ) where µ ≥ n − (cid:96) . Let θ > andlet η := dist( θ, (cid:96) ) = min j ∈{ ,...,(cid:96) } (cid:12)(cid:12) e θ − j (cid:12)(cid:12) . Then we have that (cid:100) M θ ( ρ µ ) = c µ,θ · Id where | c µ,θ | ≥ (2 n ) − (cid:96) η √ (cid:96) . To prove Lemma 6.2, we will need the notions of content and hook length for boxes in a Youngdiagram:

Deﬁnition 6.3.

Let µ be a partition µ (cid:96) n . The hook length of a box u in the Young diagram for µ, denoted by h ( u ), is the sum( u in its row) + ( u in its column) + 1 (for u itself) . The content c ( u ) of a box u is c ( u ) := j − i, where j is its column number (from the left, startingwith column 1) and i is its row number (from the top, starting with row 1).19 5 2 14 13 11 0 1 2 3 − − − − Lemma 6.4.

Let µ (cid:96) n and let χ µ be the corresponding character in S n . For any q ∈ R , n ! (cid:88) σ ∈ S n χ µ ( σ ) · q cycles( σ ) = (cid:89) u ∈ µ q + c ( u ) h ( u ) , where the subscript “ u ∈ µ ” means that u ranges over all the boxes in the Young diagram corre-sponding to µ .Proof. The above identity is given as Exercise 7.50 in Stanley’s book [Sta99]. For the sake ofcompleteness, we provide the proof here.For any ¯ t = ( t , . . . , t n ), we deﬁne the polynomial a ¯ t ( x , . . . , x n ) := det  x t x t x t . . . x t n x t x t x t . . . x t n . . . . . . . . . . . . . . . . . . . . . . .x t n x t n x t n . . . x t n n  . Given any partition µ (cid:96) n , we now deﬁne the Schur polynomial s µ ( x , . . . , x n ) as follows: Deﬁne¯ t µ = ( µ + n − , . . . , µ n + 0) and ¯ t = ( n − , . . . , s µ ( x , . . . , x n ) := a ¯ t µ ( x , . . . , x n ) a ¯ t ( x , . . . , x n ) . The denominator is just the Vandermonde determinant of the variables ( x , . . . , x n ). As the poly-nomial a ¯ t µ ( x , . . . , x n ) is alternating, it follows that s µ ( x , . . . , x n ) is a polynomial (as opposed toa rational function) and further, it is symmetric.The following is a fundamental fact connecting Schur polynomials and cycles: For any 0 ≤ k ≤ n , s µ (1 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) k , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) n − k ) = (cid:88) σ ∈ S n n ! · χ µ ( σ ) · k cycles( σ ) (23)20see equation 7.78 in [Sta99]). On the other hand, there are known explicit formulas for evaluationsof the Schur polynomial at speciﬁc inputs. In particular, Corollary 7.21.4 of [Sta99] states that s µ (1 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) k , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) n − k ) = (cid:89) u ∈ µ k + c ( u ) h ( u ) . (24)Combining (23) and (24), we get that for any 0 ≤ k ≤ n , we have1 n ! (cid:88) σ ∈ S n χ µ ( σ ) · k cycles( σ ) = (cid:89) u ∈ µ k + c ( u ) h ( u ) . However, note that both the left and the right hand sides can be seen as polynomials of degree atmost n in the variable k . Since they agree at n + 1 values k = 0 , . . . , n , they must be identical asformal functions. This concludes the proof. Proof of Lemma 6.2.

Recall that the distribution M θ over S n is deﬁned by M θ ( π ) = e − θd ( π,e ) /Z ( θ ),where Z ( θ ) = (cid:80) π ∈ S n e − θd ( π,e ) is a normalizing constant. Since the Cayley distance d ( σ, τ ) is equalto n − cycles( σ − τ ), where cycles( π ) is the number of cycles in π , we have that M θ ( π ) = e θ · cycles( π ) C , where C = (cid:88) π ∈ S n e θ · cycles( π ) . Since the cycles( · ) function is a class function so is M θ , so we can apply Lemma A.9 and weget that (cid:100) M θ ( ρ µ ) = c µ,θ · Id , where c µ,θ = (cid:80) σ ∈ S n M θ ( σ ) · χ µ ( σ ) dim ( ρ µ ) = (cid:80) σ ∈ S n e θ · cycles( σ ) · χ µ ( σ ) dim ( ρ µ ) · ( (cid:80) σ ∈ S n e θ · cycles( σ ) ) = (cid:80) σ ∈ S n q cycles( σ ) · χ µ ( σ ) dim ( ρ µ ) · ( (cid:80) σ ∈ S n q cycles( σ ) ) , where q := e θ . We re-express the numerator by applying Lemma 6.4 to get (cid:88) σ ∈ S n q cycles( σ ) · χ µ ( σ ) = n ! · (cid:89) u ∈ µ q + c ( u ) h ( u ) . (25)To analyze the denominator of c µ,θ , applying Lemma 6.4 to the trivial partition Triv n = ( n ) of n (the character of which is identically 1), we get that (cid:88) σ ∈ S n q cycles( σ ) = n ! · (cid:89) u ∈ Triv n q + c ( u ) h ( u ) = q ( q + 1) · · · ( q + n − . (26)For the rest of the denominator, we recall the following well-known fact about the dimension ofirreducible representations of the symmetric group: Fact 6.5 (Hook length formula, see e.g. Theorem 3.41 of [M´el17]) . For µ (cid:96) n , dim ( ρ µ ) = n ! (cid:81) u ∈ µ h ( u ) . Combining (25), (26) and Fact 6.5, we get c µ,θ = (cid:81) u ∈ µ ( q + c ( u )) q ( q + 1) · · · ( q + n − . (27)21et A denote the set consisting of the cells of the Young diagram of µ which are not in the ﬁrstrow. Since n − µ = (cid:96) (cid:48) for some (cid:96) (cid:48) ≤ (cid:96) , the above expression simpliﬁes to c µ,θ = (cid:81) u ∈A ( q + c ( u ))( q + n − (cid:96) (cid:48) ) · · · ( q + n − . (28)To bound this ratio, ﬁrst observe that both the numerator and denominator are (cid:96) (cid:48) -way products.There are two possibilities now:1. Case 1: q ≥ (cid:96) +1 . In this case we observe that each cell u ∈ A satisﬁes c ( u ) ≥ − (cid:96) (cid:48) ≥ − (cid:96). Thus c µ,θ can be expressed as a product of (cid:96) (cid:48) many fractions, each of which is at least q − (cid:96)q + n − ≥ (cid:96) + n .This implies that c µ,θ ≥ (cid:18) n + (cid:96) (cid:19) (cid:96) (cid:48) ≥ (2 n ) − (cid:96) . Case 2: q ≤ (cid:96) . In this case, the denominator of Equation (28) is at most (2 n ) (cid:96) . To lowerbound the numerator, observe that for every cell u of A , the value of c ( u ) is an integer in {− (cid:96), . . . , (cid:96) } . Let j and j denote the two values in {− (cid:96), . . . , (cid:96) } for which | q − j | achieves itssmallest value η and its next smallest value (note that these values are equal if η = 1 / √ (cid:96) many cells of A have content equal to any given ﬁxed integervalue. Since j and j are the only possible values of j ∈ {− (cid:96), . . . , (cid:96) } for which | q + j | <

1, itfollows that (cid:89) u ∈A | ( q + c ( u )) | ≥  (cid:89) u ∈A : c ( u )= j | ( q + c ( u )) |  ·  (cid:89) u ∈A : c ( u )= j | ( q + c ( u )) |  ≥ η √ (cid:96) . This ﬁnishes the proof.

In this brief section we put all the pieces together to obtain our main positive results, Theo-rems 1.2, 1.3 and 1.4, for the symmetric, heat kernel, and generalized Mallows noise models respec-tively.

Symmetric noise.

Under the assumptions of Theorem 1.2 (that (cid:80) n − log kj =0 p j ≥ n O (log k ) ), taking (cid:96) =log k in Lemma 4.1, we have that σ min , Up ( λ hook , log k ) , S p ≥ n O (log k ) . Since (as discussed in Section 3.1) S p is eﬃciently samplable given p , by Theorem 3.1 in time poly( n log k , /δ, log(1 /τ )) with probability1 − τ it is possible to obtain ± δ -accurate estimates of all of the (log k )-way marginals of f . Setting δ = ε k O (log k ) and applying Theorem 2.1, we get Theorem 1.2. Heat kernel noise.

First observe that we may assume that the temperature parameter t isat least 1 (since otherwise it is easy to artiﬁcially add noise to achieve t = 1). Under the as-sumptions of Theorem 1.3 (that t = O ( n log n )), taking (cid:96) = log k in Lemma 5.1, we have that22 min , Up ( λ hook , log k ) , H t ≥ n O (log k ) . Theorem 1.3 follows as in the previous paragraph (this time usingthe eﬃcient samplability of H t given t ). Cayley-Mallows noise.

Under the assumptions of Theorem 1.4, taking (cid:96) = log k in Lemma 6.1 weget that σ min , Up ( λ hook , log k ) , M θ ≥ n O (log k ) · dist( θ, log k ) √ log k . Theorem 1.4 follows as in the previousparagraph (this time using the eﬃcient samplability of M θ given θ ). Recall that because of the poly(dist( θ, log k ) −√ log k ) dependence in Theorem 1.4, the algorithmof that theorem is ineﬃcient if e θ is very close to an integer. In this section we prove Theorem 1.5,which establishes that any algorithm for learning in the presence of Cayley-Mallows noise must beineﬃcient if e θ is very close to an integer. The following lemma is at the heart of our lower bound. It shows that if e θ is close to an integer,then any partition µ of n ≥ m which extends a particular partition λ sq of m must be such that theFourier coeﬃcient (cid:100) M θ ( ρ µ ) of Cayley-Mallows noise has small singular values. Lemma 8.1.

Let λ sq denote the partition ( t, . . . , t ) of m = t ( t + j ) whose Young tableau is arectangle with t + j rows and t columns. Let θ > be such that (cid:12)(cid:12) e θ − j (cid:12)(cid:12) ≤ η where η ≤ / . Let n ≥ m , µ (cid:96) n and λ sq ⇑ µ (recall Deﬁnition A.13). Then (cid:100) M θ ( ρ µ ) = c µ,θ · Id , where c µ,θ ≤ η t . Here ρ µ denotes the irreducible representation of S n corresponding to the partition µ .Proof. Let µ = ( µ , . . . , µ r ). By Lemma 6.2, we have that (cid:100) M θ ( ρ µ ) = c µ,θ · Id , where Equation (28) gives the precise value of c µ,θ as c µ,θ = (cid:81) u ∈A ( q + c ( u )) (cid:81) u ∈B ( q + c ( u )) , where q = e θ . (29)Here A denotes the set of cells of the Young diagram of µ which are not in the ﬁrst row and B denotes the rightmost n − µ many cells in the Young diagram of the trivial partition Triv n = ( n ).Note that in this lemma, we are trying to upper bound Equation (29) whereas Lemma 6.2 wasabout lower bounding this quantity.To upper bound Equation (29), we ﬁrst observe that there is an obvious bijection Φ : A → B such that if Φ( u ) = v , then c ( v ) > | c ( u ) | >

0. 23ext, let A − j ⊂ A be A := { ( r, s ) : s − r = j and ( r, s ) ∈ A} . Since λ sq ⇑ µ , it follows that |A − j | ≥ t . As a result, we can upper bound c µ,θ as follows: c µ,θ = (cid:81) u ∈A ( q + c ( u )) (cid:81) u ∈B ( q + c ( u )) = (cid:89) u ∈A q + c ( u ) q + c (Φ( u )) =  (cid:89) u ∈A − j q + c ( u ) q + c (Φ( u ))   (cid:89) u ∈A\A − j q + c ( u ) q + c (Φ( u ))  ≤ (cid:89) u ∈A − j q + c ( u ) (using c (Φ( u )) > | c ( u ) | > q > ≤ η t . Theorem 1.5 is an immediate consequence of the following result. It shows that if e θ is close to aninteger j , then it may be statistically impossible to learn a distribution f supported on k rankingswithout using many samples from M θ ∗ f : Theorem 8.2.

Given j ∈ N , there are inﬁnitely many values of k and m = m ( k ) ≈ log k log log k suchthat the following holds: there are two distributions f , f over S m with the following properties:1. d TV ( f , f ) = 1 (i.e. the distributions f and f have disjoint support);2. | supp( f ) | , | supp( f ) | ≤ k ;3. For any θ > such that | e θ − j | ≤ η ≤ / , we have that d TV ( M θ ∗ f , M θ ∗ f ) ≤ · η Θ (cid:16)(cid:113) log k log log k (cid:17) . Proof.

Let t ≥ j be any integer, let m = t ( t + j ), and let k = m !. We ﬁrst construct the twodistributions f , f over S m and argue that properties (1) and (2) hold.Let λ sq (cid:96) m be the partition whose Young tableau is a rectangle with t + j rows and t columns.Let us consider the character χ sq : S m → Q corresponding to the partition λ sq . By Fact A.16 wehave that χ sq is rational valued, and by Theorem A.8 we have that (cid:80) σ ∈ S n χ sq ( σ ) = 0. Thus, wehave that (cid:88) σ ∈ S n | χ sq ( σ ) | · χ sq ( σ ) > = (cid:88) σ ∈ S n | χ sq ( σ ) | · χ sq ( σ ) < =: C sq (30)for some C sq (which is nonzero again by Theorem A.8). We now deﬁne distributions f and f over S m as f ( σ ) = (cid:40) C sq · χ sq ( σ ) if χ sq ( σ ) >

00 otherwise , f ( σ ) = (cid:40) − C sq · χ sq ( σ ) if χ sq ( σ ) <

00 otherwise . From their deﬁnitions and Equation (30) it is immediate that f and f are distributions over S m which have disjoint support. Since | S m | = k , this gives items 1 and 2 of the theorem.To prove the third item, observe (recalling the comment immediately after Deﬁnition A.7) thatthe function g : S m → C , deﬁned as g ( σ ) := f ( σ ) − f ( σ ) = C sq · χ sq ( σ ), is a class function. Choose24ny partition λ (cid:96) m and the corresponding irreducible representation ρ λ of S m . By applyingLemma A.9, we have that (cid:98) g ( ρ λ ) = c λ · Id where c λ = (cid:80) σ ∈ S m g ( σ ) · χ λ ( σ )dim( ρ λ ) . (31)We analyze the multiplier c λ by noting that c λ = (cid:80) σ ∈ S m g ( σ ) · χ λ ( σ )dim( ρ λ ) = (cid:80) σ ∈ S m χ sq ( σ ) · χ λ ( σ )dim( ρ λ ) · C sq = m ! · [ λ = λ sq ]dim( ρ λ ) · C sq using Theorem A.8 . (32)Thus, we have (cid:107)M θ ∗ f − M θ ∗ f (cid:107) = (cid:88) σ ∈ S m |M θ ∗ f ( σ ) − M θ ∗ f ( σ ) | = (cid:88) σ ∈ S m |M θ ∗ g ( σ ) | (linearity and g = f − f )= 1 m ! (cid:88) σ ∈ S m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) µ (cid:96) m dim( ρ µ ) Tr [ (cid:92) M θ ∗ g ( ρ µ ) ρ µ ( σ − )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (Deﬁnition A.5, inverse Fourier transform of M θ ∗ g )= 1 m ! (cid:88) σ ∈ S m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) µ (cid:96) m dim( ρ µ ) Tr [ (cid:100) M θ ( ρ µ ) (cid:98) g ( ρ µ ) ρ µ ( σ − )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (convolution identity)= 1dim( ρ λ sq ) · C sq (cid:88) σ ∈ S m (cid:12)(cid:12)(cid:12) dim( ρ λ sq ) Tr [ (cid:100) M θ ( ρ λ sq ) ρ λ sq ( σ − )] (cid:12)(cid:12)(cid:12) (Equations 31 and 32)= 1 C sq (cid:88) σ ∈ S m (cid:12)(cid:12)(cid:12) Tr [ (cid:100) M θ ( ρ λ sq ) ρ λ sq ( σ − )] (cid:12)(cid:12)(cid:12) (33)To deal with (cid:100) M θ ( ρ λ sq ), we apply Lemma 8.1. In particular, by setting n = m and µ = λ sq in Lemma 8.1, we get that (cid:100) M θ ( ρ λ sq ) = c λ sq ,θ · Id , where | c λ sq ,θ | ≤ η t , and we thus get that (cid:107)M θ ∗ f − M θ ∗ f (cid:107) ≤ η t C sq · (cid:88) σ ∈ S m (cid:12)(cid:12) Tr [ ρ λ sq ( σ − ) (cid:12)(cid:12) = η t C sq · (cid:88) σ ∈ S m (cid:12)(cid:12) χ sq ( σ − ) (cid:12)(cid:12) . (34)Finally, recalling that C sq = (cid:80) σ ∈ S n | χ sq ( σ ) | , we get that the RHS of Equation (34) is 2 η t . Recalling that t ≥ (cid:112) m/

2, the theorem is proved.25

Basics of representation theory over the symmetric group

Representation theory of the symmetric group S n is at the technical core of this paper. In thisappendix we brieﬂy review the deﬁnitions and results that we require, starting ﬁrst with generalgroups and then specializing to S n as necessary. See Curtis and Reiner [CR66] (or many othersources) for an extensive reference on representation theory of ﬁnite groups and James [Jam06] orM´eliot [M´el17] for an extensive reference on representation theory of S n . A.1 General groups

We start by recalling the deﬁnition of a representation:

Deﬁnition A.1.

For any group G , a representation ρ : G → C m × m is a group homomorphism,i.e. a function from G to C m × m that satisﬁes ρ ( g ) · ρ ( h ) = ρ ( g · h ) for all g, h ∈ G. The dimension of such a representation ρ is m .In this paper, unless otherwise mentioned, all representations ρ are unitary – in other words, forevery g ∈ G , ρ ( g ) is a unitary matrix. Over ﬁnite groups, any representation can be made unitaryby applying a similarity transformation; by this we mean that if ρ is a representation, then thereis an invertible matrix Z such that the new map ˜ ρ deﬁned as ˜ ρ ( g ) = Z − · ρ ( g ) · Z is a unitaryrepresentation. (The reader should verify that as long as Z is invertible, the map ˜ ρ is always arepresentation if ρ is a representation.) Two such representations ρ and ˜ ρ are said to be equivalent .Next we recall the notion of an irreducible representation: Deﬁnition A.2.

A representation ρ : G → C m × m is said to be reducible if there exists a propersubspace V of C m such that ρ ( g ) · V ⊆ V for all g ∈ G . If there is no such proper subspace V , then ρ is said to be irreducible .It is well known that any ﬁnite group has only ﬁnitely many irreducible representations, up tothe above notion of equivalence, and that every representation of a ﬁnite group G can be writtenas a direct sum of irreducible representations: Theorem A.3 (Maschke’s theorem, see e.g. Theorem 1.3 of [M´el17]) . For G a ﬁnite group, thereis a ﬁnite set of distinct irreducible representations { ρ , . . . , ρ r } such that for any representation ρ : G → C m × m , there is a invertible transformation Z ∈ C m × m such that Z − ρZ is block diagonalwhere each block is one of { ρ , . . . , ρ r } . In other words, Z − ρZ is equal to the direct sum ⊕ M(cid:96) =1 µ (cid:96) where each µ (cid:96) is an element of { ρ , . . . , ρ r } . We remind the reader that elements g, h in a group G are said to be conjugates if thereis an element t ∈ G such that tgt − = h. Deﬁne Cl ( g ), the conjugacy class of g , to be { h : h is conjugate to g } ; it is easy to see that the diﬀerent conjugacy classes form a partition of G .We recall some very standard facts about irreducible representations: Theorem A.4 (see e.g. Theorem 2.3.1 of [GW10]) . Let G be a ﬁnite group and let { ρ , . . . , ρ r } bethe set of its irreducible representations, where ρ i : G → C d i × d i . Then1. (cid:80) ri =1 d i = | G | .2. The number of conjugacy classes is equal to r , the number of distinct irreducible representa-tions. . For ≤ s, t ≤ d i , let ρ i,s,t : G → C be the ( s, t ) entry of ρ i ( g ) . Then, for ≤ i , i ≤ r , ≤ s , t ≤ d i and ≤ s , t ≤ d i E g ∈ G [ ρ i ,s ,t ( g ) · ρ i ,s ,t ( g )] = (cid:40) d i if i = i , s = s and t = t otherwise4. The representations ρ , . . . , ρ r are unitary. A restatement of (3) above is that the functions { ρ i,s,t ( · ) } are orthogonal. Combining this with (cid:80) ri =1 d i = | G | (given by (1)), we get that the functions { ρ i,s,t } ≤ i ≤ r, ≤ s,t ≤ d i form an orthogonalbasis for C G .With an orthonormal basis for the set of complex-valued functions on G in hand (in other words,a basis for the group algebra C [ G ]), we are ready to deﬁne the Fourier transform of a function f : G → C : Deﬁnition A.5.

Let G be a ﬁnite group with irreducible representations given by { ρ , . . . , ρ r } and let f : G → C . The

Fourier transform of f is given by matrices (cid:98) f ( ρ ) , . . . , (cid:98) f ( ρ r ), where (cid:98) f ( ρ i ) = (cid:88) g ∈ G f ( g ) · ρ i ( g ) . The inverse transform is given by f ( g ) = 1 | G | r (cid:88) i =1 dim( ρ i ) Tr [ (cid:98) f ( ρ i ) ρ i ( g − )] . Parseval’s identity states that for any f as above, we have r (cid:88) i =1 (cid:107) (cid:98) f ( ρ i ) (cid:107) F = | G | · (cid:88) g ∈ G | f ( g ) | . (35)We next recall the deﬁnition of characters and class functions for a group G . Deﬁnition A.6.

Given a ﬁnite group G , a function f : G → C is said to be a class function of G if f ( g ) only depends on the conjugacy class of g , i.e. f ( g ) = f ( hgh − ) for every h ∈ G . Deﬁnition A.7.

The character χ ρ : G → C corresponding to a representation ρ : G → C m × m isgiven by χ ρ ( g ) := Tr ( ρ ( g )).We observe that χ ρ ( · ) is a class function of G , and that if ρ and ˜ ρ are unitarily equivalent, then χ ρ ( · ) = χ ˜ ρ ( · ). We recall some standard facts about characters and class functions: Theorem A.8.

Let G be a ﬁnite group and let { ρ , . . . , ρ r } be its set of irreducible representations.Let χ ρ , . . . , χ ρ r be the corresponding characters. Then we have:1. [Schur’s lemma] E g ∈ G [ χ ρ i ( g ) · χ ρ j ( g )] = δ i,j .2. The functions { χ ρ i ( · ) } ≤ i ≤ r forms an orthonormal basis for all class functions of G . f is adiagonal matrix (in fact, a scalar multiple of the identity matrix): Lemma A.9.

Let f : G → C be a class function and let ρ : G → C m × m be an irreduciblerepresentation of G . Then (cid:98) f ( ρ ) = c · Id where c = (cid:80) g ∈ G f ( g ) χ ρ ( g ) m and Id is the identity matrix.Proof. Choose any h ∈ G, and observe that ρ ( h ) · (cid:98) f ( ρ ) = ρ ( h ) · (cid:0) (cid:88) g ∈ G f ( g ) ρ ( g ) (cid:1) = ρ ( h ) · (cid:0) (cid:88) g ∈ G f ( h − gh ) ρ ( h − gh ) (cid:1) = ρ ( h ) · (cid:0) (cid:88) g ∈ G f ( g ) ρ ( h − gh ) (cid:1) = ρ ( h ) · ρ ( h − ) · (cid:0) (cid:88) g ∈ G f ( g ) ρ ( g ) (cid:1) · ρ ( h ) = (cid:98) f ( ρ ) · ρ ( h ) . As a consequence of Schur’s lemma, we have that if a matrix A is such that A · ρ ( h ) = ρ ( h ) · A forall h ∈ G , then A = c · Id . Thus, we get that (cid:98) f ( ρ ) = c · Id . The lemma follows by taking trace onboth sides. A.2 Representation theory of the symmetric group

Representation theory of the symmetric group has many applications to algebra, combinatorics andstatistical physics and has been intensively studied (as mentioned earlier, see e.g. [Jam06, M´el17]for detailed treatments). Below we only recall a few basics which we will need.The ﬁrst notion we require is that of a

Young diagram.

Consider a partition λ = ( λ , . . . , λ k )of n where λ ≥ λ ≥ . . . ≥ λ k > λ + . . . + λ k = n . We indicate that λ is such a partitionby writing “ λ (cid:96) n .” The Young diagram corresponding to such a partition λ is a two-dimensionalleft-justiﬁed array of empty cells in which the i th row has λ i cells. See the left portion of Figure 2for an example of a Young diagram. A Young tableau corresponding to a partition λ is obtainedby ﬁlling in the n cells of the Young diagram with the elements of [ n ], using each element exactlyonce, where the ordering within rows of the Young diagram is irrelevant.1 7 2 85 34 69 1 2 8 75 36 49Figure 2: On the left is the Young diagram for the partition λ = (4 , , , { , , , } , { , } , { , } , { } ).For each partition λ = ( λ , . . . , λ k ) of n , there is an associated representation, denoted τ λ , whichwe now deﬁne. Let N λ = (cid:0) nλ ,...,λ k (cid:1) be the number of Young tableaus corresponding to partition λ ,and let Y λ, , . . . , Y λ,N λ be an enumeration of these tableaus in some order.28 eﬁnition A.10. The permutation representation τ λ corresponding to λ is deﬁned as follows: Foreach g ∈ S n , τ λ ( g ) is the N λ × N λ matrix (where we view rows and columns as indexed by Youngtableaus corresponding to λ ) which has τ λ ( g )( i, j ) = 1 iﬀ Y λ,i maps to Y λ,j under the action of g .It is easy to check that τ λ : S n → C N λ × N λ as deﬁned above is indeed a representation. In fact,since the range of τ λ is always a permutation matrix, τ λ is also a unitary representation.It turns that for λ (cid:54) = ( n ), the permutation representation τ λ is not an irreducible representation.However, it also turns out that all of the irreducible representations of S n can be obtained fromthe permutation representations. To explain this, we need to deﬁne a partial order over partitionsof n : Deﬁnition A.11.

For two partitions λ and µ of n , we say that λ dominates µ , written λ (cid:3) µ , if (cid:80) j ≤ i λ j ≥ (cid:80) j ≤ i µ j for all i >

0. The partial order deﬁned by (cid:3) is said to be the dominance order over the partitions (equivalently, Young diagrams) of n . (cid:3) (cid:3) (cid:3) (cid:3) Figure 3: The left part of the picture depicts the dominance order across the partitions of 4; ithappens to be the case that the dominance order is a total order across the partitions of 4. Thisis not true in general; as depicted on the right, the two partitions (4 , ,

1) and (3 ,

3) of 6 areincomparable under the dominance order.The next result explains how the irreducible representations of S n can be obtained from therepresentations { τ λ } λ (cid:96) n : Theorem A.12 (James submodule theorem, see e.g. Theorem 3.34 of [M´el17]) . The irreduciblerepresentations of S n are in one-to-one correspondence with the partitions λ (cid:96) n ; we denote theirreducible representation corresponding to λ by ρ λ . In particular, when λ = ( n ) , then ρ λ is thetrivial irreducible representation (which maps each g ∈ G to 1). Moreover, each permutationrepresentation τ λ is a direct sum of irreducible representations corresponding to partitions whichdominate λ , i.e. τ λ = ⊕ µ (cid:3) λ K λ,µ ⊕ (cid:96) =1 ρ µ . Here the K λ,µ ’s are non-negative integers, known as the Kostka numbers , which are such that K λ,λ = 1 . A.2.1 Restrictions of irreducible representations

Fix λ (cid:96) n and consider the irreducible representation ρ λ of S n . For any m ≤ n , S m can be viewedas the subgroup of S n where elements { m + 1 , . . . , n } are ﬁxed. Hence ρ λ can also be viewed as arepresentation of S n ; this representation of S m is written ρ mλ and is called the restriction of ρ λ to29igure 4: The ﬁrst ﬁve levels of Young’s lattice. S m . Note that ρ mλ may not be an irreducible representation of S m . By Theorem A.3, we have that ρ mλ is equivalent to some direct sum ⊕ µ (cid:96) m M λ,µ ρ µ , in which there are M λ,µ many copies of p µ , for some non-negative integers M λ,µ . These integersare given by the so-called “branching rule” on Young’s lattice, which we now describe. Deﬁnition A.13.

Young’s lattice is the partially ordered set of Young diagrams in which thepartial order is given by inclusion in the following sense: given partitions µ and λ , we write “ µ ↑ λ ”if λ can be obtained by adding one box to µ (in such a way that λ is a valid partition, of course).If there are partitions µ , . . . , µ r such that µ ↑ µ ↑ · · · ↑ µ r , we write “ µ ⇑ µ r .”It is convenient to draw Young’s lattice in such a way that the n -th level contains all and onlythe Young diagrams with n boxes. The diagram in Figure 4 depicts the ﬁrst ﬁve levels of Young’slattice.The next result, known as the “branching rule,” states that for λ (cid:96) n , ρ λ splits into a directsum of ρ µ over all µ ↑ λ when ρ λ is restricted to S n − : Lemma A.14 (Branching rule) . Let λ be a partition of n and let ρ λ be the corresponding irreduciblerepresentation of S n . Then ρ n − λ , the restriction of ρ λ to S n − , is equivalent to ⊕ µ (cid:96) n − µ ↑ λ ρ µ . By applying Lemma A.14 inductively we get a complete description of how ρ λ splits when it isrestricted to any S m , m < n : Theorem A.15.

Let λ (cid:96) n and let ρ λ be the corresponding irreducible representation of S n . For m < n we have that ρ mλ , the restriction of ρ λ to S m , is equivalent to ⊕ µ (cid:96) m Paths( µ, λ ) ρ µ , here Paths( µ, λ ) denotes the number of paths in Young’s lattice from µ to λ . Irreducible characters of the symmetric group.

Finally, we recall the following fundamentalfact (which is a consequence, e.g., of the Murnaghan-Nakayama rule) which we will use:

Fact A.16. [see e.g. Theorem 3.10 in [M´el17]] Let χ : S m → C be a character of S m . Then infact χ is Q -valued. Acknowledgments

We thank Mike Saks for allowing us to include his proof of Claim 2.2 here. We also thankVic Reiner and Yuval Roichman for answering several questions about representation theory.Anindya is grateful to Aravindan Vijayaraghavan for many useful discussions about ranking mod-els.

References [ABSV14] P. Awasthi, A. Blum, O. Sheﬀet, and A. Vijayaraghavan. Learning mixtures of rankingmodels. In

Advances in Neural Information Processing Systems , pages 2609–2617, 2014.1.3[BM08] M. Braverman and E. Mossel. Noisy sorting without resampling. In

Proceedings ofthe nineteenth annual ACM-SIAM symposium on Discrete algorithms , pages 268–276,2008. 1.3[BOB07] L. Busse, P. Orbanz, and J. Buhmann. Cluster analysis of heterogeneous rank data. In

Proceedings of the 24th ICML , pages 113–120, 2007. 1.3[Cho94] K. P. Choi. On the medians of gamma distributions and an equation of Ramanujan.

Proc. Amer. Math. Soc. , 121:245–251, 1994. 5.2[CR66] C. Curtis and I. Reiner.

Representation theory of ﬁnite groups and associative algebras ,volume 356. American Mathematical Society, 1966. A[DH92] Persi Diaconis and Phil Hanlon. Eigen Analysis for Some Examples of the MetropolisAlgorithm.

Contemporary Mathematics , 138:99–117, 1992. 1.3, 3.1[Dia88a] P. Diaconis. Group representations in probability and statistics.

Lecture Notes-Monograph Series , 11:i–192, 1988. 1.3[Dia88b] Persi Diaconis.

Chapter 6: Metrics on Groups, and Their Statistical Uses , volume Vol-ume 11 of

Lecture Notes–Monograph Series , pages 102–130. Institute of MathematicalStatistics, 1988. 1.1[DS81] Persi Diaconis and Mehrdad Shahshahani. Generating a Random Permutation withRandom Transpositions.

Z. Wahrscheinlichkeitstheorie verw. Gebiete , 57:159–179, 1981.1.1, 5.1, 5.1[DS98] Persi Diaconis and Laurent Saloﬀ-Coste. What do we know about the metropolis algo-rithm?

J. Comput. Syst. Sci. , 57(1):20–36, 1998. 3.131DST16] A. De, M. Saks, and S. Tang. Noisy population recovery in polynomial time. In , pages 675–684. IEEE, 2016. 2.2[Ewe72] W. Ewens. The sampling theory of selectively neutral alleles.

Theoretical PopulationBiology , 3:87–112, 1972. 1.1, 1.3[FV86] M. Fligner and J. Verducci. Distance based ranking models.

Journal of the RoyalStatistical Society. Series B (Methodological) , pages 359–369, 1986. 1.1, 1.3[GJ79] Michael R. Garey and David S. Johnson.

Computers and Intractability: A Guide to theTheory of NP-Completeness . W. H. Freeman, 1979. 1[GP18] A. Gladkich and R. Peled. On the cycle structure of mallows permutations. 46(2):1114–1169, 03 2018. 1.3[GW10] B. Green and A. Wigderson. Lecture notes for the 22nd McGill Invitational Workshopon Computational Complexity. 2010. A.4[Jam06] Gordon Douglas James.

The representation theory of the symmetric groups , volume682. Springer, 2006. A, A.2[JV18] Y. Jiao and J. Vert. The Kendall and Mallows kernels for permutations.

IEEE trans-actions on pattern analysis and machine intelligence , 40(7):1755–1769, 2018. 1.1, 1.3[KB10] R. Kondor and M. Barbosa. Ranking with Kernels in Fourier space. In

COLT 2010 ,pages 451–463, 2010. 1.1, 1.3[KL02] R. Kondor and J. Laﬀerty. Diﬀusion kernels on graphs and other discrete structures.In

Machine Learning, Proceedings of the 19th International Conference (ICML 2002) ,2002. 1.1, 1.3[KV10] R. Kumar and S. Vassilvitskii. Generalized distances between rankings. In

WWW ,pages 571–580, 2010. 1.1[LB11] T. Lu and C. Boutilier. Learning Mallows models with pairwise preferences. In

Pro-ceedings of the 28th ICML , pages 145–152, 2011. 1.3, 1.3[LL02] G. Lebanon and J. Laﬀerty. Cranking: Combining rankings using conditional probabil-ity models on permutations. In

Proceedings of the Nineteenth International Conferenceon Machine Learning , pages 363–370, 2002. 1.3[LM18] A. Liu and A. Moitra. Eﬃciently Learning Mixtures of Mallows Models. In

Proceedingsof FOCS, 2018 , 2018. 1.3[LZ15] Shachar Lovett and Jiapeng Zhang. Improved noisy population recovery, and reverseBonami-Beckner inequality for sparse functions. In

Proceedings of the 47th Annual ACMSymposium on Theory of Computing , pages 137–142, 2015. 2.2[Mal57] C. Mallows. Non-null ranking models. I.

Biometrika , 44(1/2):114–130, 1957. 1.3, 1.3[Mar14] J. Marden.

Analyzing and modeling rank data . Chapman and Hall/CRC, 2014. 1.332MC10] M. Meil˘a and H. Chen. Dirichlet process mixtures of generalized mallows models. In

Proceedings of the Twenty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence ,pages 358–367, 2010. 1.3[M´el17] P. M´eliot.

Representation theory of symmetric groups . Chapman and Hall/CRC, 2017.6.5, A, A.3, A.2, A.12, A.16[MM03] T. Murphy and D. Martin. Mixtures of distance-based models for ranking data.

Com-putational statistics & data analysis , 41(3-4):645–655, 2003. 1.3[MM09] B. Mandhani and M. Meila. Tractable search for learning exponential models of rank-ings. In

Artiﬁcial Intelligence and Statistics , pages 392–399, 2009. 1.3[MPPB07] M. Meil˘a, K. Phadnis, A. Patterson, and J. Bilmes. Consensus ranking under theexponential model. In

Proceedings of the Twenty-Third Conference on Uncertainty inArtiﬁcial Intelligence , pages 285–294, 2007. 1.3[MS13] Ankur Moitra and Michael Saks. A polynomial time algorithm for lossy populationrecovery. In , pages 110–116. IEEE, 2013. 2.2[Muk16] S. Mukherjee. Estimation in exponential families on permutations.

The Annals ofStatistics , 44(2):853–875, 2016. 1.3[Sak18] M. Saks. Personal communication, 2018. 2.1[Sta99] Richard P. Stanley.

Enumerative Combinatorics: Volume 2 . Cambridge UniversityPress, 1999. 6, 6[Ste77] G. W. Stewart. On the Perturbation of Pseudo-Inverses, Projections and Linear LeastSquares Problems.

SIAM Review , 19(4):634–662, 1977. 3[WY12] Avi Wigderson and Amir Yehudayoﬀ. Population Recovery and Partial Identiﬁcation.In53rd Annual IEEE Symposium on Foundations of Computer Science