Learning sparse mixtures of rankings from noisy information
LLearning sparse mixtures of rankingsfrom noisy information
Anindya De ∗ Northwestern University [email protected]
Ryan O’Donnell † Carnegie Mellon University [email protected]
Rocco A. Servedio ‡ Columbia University [email protected]
November 6, 2018
Abstract
We study the problem of learning an unknown mixture of k rankings over n elements, givenaccess to noisy samples drawn from the unknown mixture. We consider a range of different noisemodels, including natural variants of the “heat kernel” noise framework and the Mallows model.For each of these noise models we give an algorithm which, under mild assumptions, learns theunknown mixture to high accuracy and runs in n O (log k ) time. The best previous algorithms forclosely related problems have running times which are exponential in k . ∗ Supported by NSF grant CCF-1814706 † Supported by NSF grants CCF-1618679 and CCF-1717606 ‡ Supported by NSF grants CCF-1563155 and CCF-1814873 and by the Simons Collaboration on Algorithms andGeometry. This material is based upon work supported by the National Science Foundation under grant numberslisted above. Any opinions, findings and conclusions or recommendations expressed in this material are those of theauthors and do not necessarily reflect the views of the National Science Foundation. a r X i v : . [ c s . L G ] N ov Introduction
This paper considers the following natural scenario: there is a large heterogeneous populationwhich consists of k disjoint subgroups, and for each subgroup there is a “central preference order”specifying a ranking over a fixed set of n items (equivalently, specifying a permutation in thesymmetric group S n ). For each i ∈ { , . . . , k } , the preference order of each individual in subgroup i is assumed to be a noisy version of the central preference order (the permutation corresponding tosubgroup i ). A natural learning task which arises in this scenario is the following: given access to thepreference order of randomly selected members of the population, is it possible to learn the centralpreference orders of the k sub-populations, as well as the relative sizes of these k sub-populationswithin the overall population?Worst-case formulations of the above problem typically tend to be (difficult) variants of thefeedback arc set problem, which is known to be NP-complete [GJ79]. In view of the practicalimportance of problems of this sort, though, there has been considerable recent research interestin studying various generative models corresponding to the above scenario (we discuss some of therecent work which is most closely related to our results in Section 1.3). In this paper we will modelthe above general problem schema as follows: The k “central preference orders” of the subgroupsare given by k unknown permutations σ , . . . , σ k ∈ S n . The fraction of the population belonging tothe i -th subgroup, for 1 ≤ i ≤ k , is given by an unknown w i ≥ w + · · · + w k = 1). Finally,the noise is modeled by some family of distributions {K θ } , where each distribution K θ is supportedon S n , and the preference order of a random individual in the i -th subgroup is given by π σ i , where π ∼ K θ . Here θ is a model parameter capturing the “noise rate” (we will have much more to sayabout this for each of the specific noise models we consider below). The learning task is to recoverthe central rankings σ , . . . , σ k and their proportions w , . . . , w k , given access to preference ordersof randomly chosen individuals from the population. In other words, each sample provided to thelearner is independently generated by first choosing a random permutation σ , where σ is chosento be σ i with probability w i ; then independently drawing a random π ∼ K θ ; and finally, providingthe learner with the permutation πσ ∈ S n . Let f : S n → R ≥ denote the function which is w i at σ i and 0 otherwise. With this notation, we write “ K θ ∗ f ” to denote the distribution over noisysamples described above, and our goal is to approximately recover f given such noisy samples. Thereader may verify that the distribution defined by πσ is precisely given by the group convolution K θ ∗ f (and hence the notation). We consider a range of different noise models, corresponding to different choices for the parametricfamily {K θ } , and for each one we give an efficient algorithm for recovering the population in thepresence of that kind of noise. In this subsection we detail the three specific noise models that wewill work with (though as we discuss later, our general mode of analysis could be applied to othernoise models as well).(A.) Symmetric noise.
In the symmetric noise model, the parametric family of distributionsover S n is denoted {S p } p ∈ ∆ n . Given a vector p = ( p , . . . , p n ) ∈ ∆ n (so each p i ≥ (cid:80) ni =0 p i = 1),a draw of π ∼ S p is obtained as follows:1. Choose 0 ≤ j ≤ n , where value j is chosen with probability p j .1. Choose a uniformly random subset A ⊆ [ n ] of size exactly j . Draw π uniformly from S A ;in other words, π is a uniformly random permutation over the set A and is the identitypermutation on elements in [ n ] \ A . (We denote this uniform distribution over S A by U A .)Note that in this model, if the noise vector p has p n = 1, then every draw from S p ∗ f is a uniformrandom permutation and there is no useful information available to the learner.In order to define the next two noise models that we consider, let us recall the notion of a right-invariant metric on S n . Such a metric d ( · , · ) is one that satisfies d ( σ, π ) = d ( στ, πτ ) for all σ, π, τ ∈ S n . We note that a metric is right-invariant if and only if it is invariant under relabelingof the items 1 , . . . , n , and that most metrics considered in the literature satisfy this condition(see [KV10, Dia88b] for discussions of this point). In this paper, for technical convenience werestrict our attention to the metric d ( · , · ) being the Cayley distance over S n (though see Section 1.5for a discussion of how our methods and results could potentially be generalized to other right-invariant metrics): Definition 1.1.
Let G be the undirected graph with vertex set S n and an edge between permu-tations σ and π if there is a transposition τ such that σ = τ · π . The Cayley distance over S n is themetric induced by this graph; in other words, d ( π, σ ) = t where t is the smallest value such thatthere are transpositions τ , . . . , τ t satisfying σ = τ · · · τ t π .Now we are ready to define the next two parameterized families of noise distributions that weconsider. We note that each of the noise distributions K considered below has the natural propertythat Pr π ∼K [ π = π ] decreases with d ( π, e ) where e is the identity distribution.(B.) Heat kernel random walk under Cayley distance.
Let L be the Laplacian of thegraph G from Definition 1.1. Given a “temperature” parameter t ∈ R + , the heat kernel is the n ! × n ! matrix H t = e − t L . It is well known that H t is the transition matrix of the random walkinduced by choosing a Poisson-distributed time parameter T ∼ Poi ( t ) and then taking T steps of auniform random walk in the graph G . With this motivation, we define the heat kernel noise model as follows: the parametric family of distributions is {H t } t ∈ R + , where the probability weight that H t assigns to permutation π is the probability that the above-described random walk, starting at theidentity permutation e ∈ S n , reaches π . (Observe that higher temperature parameters t correspondto higher rates of noise. More precisely, it is well known that the mixing time of a uniform randomwalk on G is Θ( n log n ) steps, so if t grows larger than n log n then the distribution H t convergesrapidly to the uniform distribution on S n ; see [DS81] for detailed results along these lines.) We notethat these probability distributions (or more precisely, the associated heat kernel H t ) have beenpreviously studied in the context of learning rankings, see e.g. [KL02, KB10, JV18]. In some of thiswork, a different underlying distance measure was used over S n rather than the Cayley distance;see our discussion of related work in Section 1.3.(C.) Mallows-type model under Cayley distance (Cayley-Mallows / Ewens model).
While the heat kernel noise model arises naturally from an analyst’s perspective, a somewhatdifferent model, called the
Mallows model , has been more popular in the statistics and machinelearning literature. The Mallows model is defined using the “Kendall τ -distance” K ( · , · ) betweenpermutations (defined in Section 1.3) rather than the Cayley distance d ( · , · ); the Mallows modelwith parameter θ > e − θK ( π,e ) /Z K ( θ ) to the permutation π , where Z k ( θ ) = (cid:80) π ∈ S n e − θK ( π,e ) is a normalizing constant. As proposed by Fligner and Verducci [FV86],2t is natural to consider generalizations of the Mallows model in which other distance measurestake the place of the Kendall τ -distance. The model which we consider is one in which the Cayleydistance is used as the distance measure; so given θ >
0, the noise distribution M θ which weconsider assigns weight e − θd ( π,e ) /Z ( θ ) to each permutation π ∈ S n , where Z ( θ ) = (cid:80) π ∈ S n e − θd ( π,e ) is a normalizing constant. In fact, this noise model was already proposed in 1972 by W. Ewensin the context of population genetics [Ewe72] and has been intensively studied in that field (wenote that [Ewe72] has been cited more than 2000 times according to Google Scholar). To alignour terminology with the strand of research in machine learning and theoretical computer sciencewhich deals with the Mallows model, in the rest of this paper we refer to M θ as the Cayley-Mallows model. For the same reason, we will also refer to the usual Mallows model (with the Kendall τ -disance) as the Kendall-Mallows model. We observe that for the Cayley-Mallows model M θ , incontrast with the heat kernel noise model now smaller values of θ correspond to higher levels ofnoise, and that when θ = 0 the distribution M θ is simply the uniform distribution over S n andthere is no useful information available to the learner. For each of the noise models defined above, we give algorithms which, under a mild technicalassumption (that no mixing weight w i is too small), provably recover the unknown central rankings σ , . . . , σ k and associated mixing weights w , . . . , w k up to high accuracy. A notable feature of ourresults is that the sample and running time dependence is only quasipolynomial in the numberof elements n and the number of sub-populations k ; as we detail in Section 1.3 below, this is incontrast with recent results for similar problems in which the dependence on k is exponential.Below we give detailed statements of our results. The following notation and terminology willbe used in these statements: for f a distribution over S n (or any function from S n to R ) we writesupp( f ) to denote the set of permutations σ ∈ S n that have f ( σ ) (cid:54) = 0. For a given noise model K , we write “ K ∗ f ” to denote the distribution over noisy samples that is provided to the learningalgorithm as described earlier. Given two functions f, g : S n → R , we write “ (cid:107) f − g (cid:107) ” to denote (cid:80) π ∈ S n | f ( π ) − g ( π ) | , the (cid:96) distance between f and g . If f and g are both distributions then wewrite d TV ( f, g ) to denote the total variation distance between f and g , which is (cid:107) f − g (cid:107) . Finally,if f is a distribution over S n in which f ( σ ) > ε for every σ such that f ( σ ) >
0, we say that f is ε -heavy . Learning from noisy rankings: Positive and negative results.
Our first algorithmic resultis for the symmetric noise model (A) defined earlier. Theorem 1.2, stated below, gives an efficientalgorithm as long as the vector p is “not too extreme” (i.e. not too biased towards putting almostall of its weight on large values very close to n ): Theorem 1.2 (Algorithm for symmetric noise) . There is an algorithm with the following guarantee:Let f be an unknown ε -heavy distribution over S n with | supp( f ) | ≤ k . Let p = ( p , . . . , p n ) ∈ ∆ n be such that n − log k (cid:88) j =0 p j ≥ n O (log k ) . Given p , the value of ε > , a confidence parameter δ > , and access to random samples from S p ∗ f , the algorithm runs in time poly( n log k , /ε, log(1 /δ )) and with probability − δ outputs adistribution g : S n → R such that d TV ( f, g ) ≤ ε. Theorem 1.3 (Algorithm for heat kernel noise) . There is an algorithm with the following guarantee:Let f be an unknown ε -heavy distribution over S n with | supp( f ) | ≤ k . Let t ∈ R + be any valuethat is O ( n log n ) . Given t , the value of ε > , a confidence parameter δ > , and access to randomsamples from H t ∗ f , the algorithm runs in time poly( n log k , /ε, log(1 /δ )) and with probability − δ outputs a distribution g : S n → R such that d TV ( f, g ) ≤ ε. Recalling that the uniform random walk on the Cayley graph of S n mixes in Θ( n log n ) steps,we see that the algorithm of Theorem 1.3 is able to handle quite high levels of noise and still runquite efficiently (in quasi-polynomial time).Our third positive result, for the Cayley-Mallows model, displays an intriguing qualitativedifference from Theorems 1.2 and 1.3. To state our result, let us define the function dist : R + × N → R + as follows: dist ( θ, (cid:96) ) := min j ∈{ ,...,(cid:96) } (cid:12)(cid:12) e θ − j (cid:12)(cid:12) , so dist( θ, (cid:96) ) measures the minimum distance between e θ and any integer in { , . . . , (cid:96) } . Theorem 1.4gives an algorithm which can be quite efficient for the Cayley-Mallows noise model if the noiseparameter θ is such that dist( θ, log k ) is not too small: Theorem 1.4 (Algorithm for the Cayley-Mallows model) . There is an algorithm with the followingguarantee: Let f be an unknown ε -heavy distribution over S n with | supp( f ) | ≤ k . Given θ > ,the value of ε > , a confidence parameter δ > , and access to random samples from M θ ∗ f ,the algorithm runs in time poly( n log k , /ε, log(1 /δ ) , dist( θ, log k ) −√ log k ) and with probability − δ outputs a distribution g : S n → R such that d TV ( f, g ) ≤ ε. As alluded to earlier, as θ approaches 0 the difficulty of learning in the M θ noise model increases(and indeed learning becomes impossible at θ = 0); since for small θ we have dist( θ, (cid:96) ) ≈ θ , thisis accounted for by the dist( θ, log k ) −√ log k factor in our running time bound above. However, forlarger values of θ the dist( θ, log k ) −√ log k dependence may strike the reader as an unnatural artifactof our analysis: is it really hard to learn when θ is very close to ln 2 ≈ . θ isvery close to ln 2 . ≈ . , and hard again when θ is very close to ln 3 ≈ . · , · ) parameter captures a fundamentalbarrier to learning in the Cayley-Mallows model. We establish this by proving the following lowerbound for the Cayley-Mallows model, which shows that a dependence on dist as in Theorem 1.4 isin fact inherent in the problem: Theorem 1.5.
Given j ∈ N , there are infinitely many values of k and m = m ( k ) such that thefollowing holds: Let θ > be such that | e θ − j | ≤ η ≤ / , and let A be any algorithm which, whengiven access to random samples from M θ ∗ f where f is a distribution over S m with | supp( f ) | ≤ k ,with probability at least 0.51 outputs a distribution h over S m that has d TV ( f, h ) ≤ . . Then A must use η − Ω (cid:16)(cid:113) log k log log k (cid:17) samples. Starting with the work of Mallows [Mal57], there is a rich line of work in machine learning andstatistics on probabilistic models of ranking data, see e.g. [Mar14, LL02, BOB07, MM09, MC10,4B11]. In order to describe the prior works which are most relevant to our paper, it will beuseful for us to define the
Kendall-Mallows model (referred to in the literature just as the Mallowsmodel) in slightly more detail than we gave earlier. Introduced by Mallows [Mal57], the Kendall-Mallows model is quite similar to the Cayley-Mallows model that we consider — it is specified bya parametric family of distributions {M τ,θ } θ ∈ R + and a central permutation σ ∈ S n , and a drawfrom the model is generated as follows: sample π ∼ M τ,θ and output π · σ . The distribution M τ,θ assigns probability weight e − θK ( π,e ) /Z K ( θ ) to the permutation π where Z K ( θ ) = (cid:80) π ∈ S n e − θK ( π,e ) is the normalizing constant and K ( · , · ) is the Kendall τ -distance (defined next): Definition 1.6.
The
Kendall τ -distance K : S n × S n → R ≥ is a distance metric on S n defined as K ( π, π (cid:48) ) = { ( i, j ) : i < j and (( π ( i ) < π ( j )) ⊕ ( π (cid:48) ( i ) < π (cid:48) ( j )) = 1) } In other words, K ( π, π (cid:48) ) is the number of inversions between π and π (cid:48) . Like the Cayley distance,the Kendall τ -distance is also a right-invariant metric. Another equivalent way to define K ( · , · ) isto consider the undirected graph on S n where vertices π π share an edge if and only π = τ · π where τ is an adjacent transposition – in other words, τ = ( i, i + 1) for some 1 ≤ i < n . Then K ( · , · ) is defined as the shortest path metric on this graph. From this perspective, the differencebetween the Kendall τ -distance and the Cayley distance is that the former only allows adjacenttranspositions while the latter allows all transpositions. Learning mixture models:
As mentioned earlier, probabilistic models of ranking data havebeen studied extensively in probability, statistics and machine learning. Models that have beenconsidered in this context include the Kendall-Mallows model [Mal57, LB11, MPPB07, GP18], theCayley-Mallows model (and generalizations of it) [FV86, MM03, Muk16, DH92, Dia88a, Ewe72]and the heat kernel random walk model [KL02, KB10, JV18], among others. In contrast, withintheoretical computer science interest in probabilistic models of ranking data is somewhat morerecent, and the best-studied model in this community is the Kendall-Mallows model. Bravermanand Mossel [BM08] initiated this study and (among other results) gave an efficient algorithm torecover a single Kendall-Mallows model from random samples. The question of learning mixturesof k Kendall-Mallows models was raised soon thereafter, and and Awasthi et al. [ABSV14] gave anefficient algorithm for the case k = 2. We note two key distinctions between our work and thatof [ABSV14]: (i) our results apply to the Cayley-Mallows model rather than the Kendall-Mallowsmodel, and (ii) the work of [ABSV14] allows for the two components in the mixture to have twodifferent noise parameters θ and θ whereas our mixture models allow for only one noise parameter θ across all the components.Very recently, Liu and Moitra [LM18] extended the result of [ABSV14] to any constant k . Inparticular, the running time of the [LM18] algorithm scales as n poly( k ) . It is interesting to contrastour results with those of [LM18]. Besides the obvious difference in the models treated (namelyKendall-Mallows in [LM18] versus Cayley-Mallows in this paper), another significant difference isthat our running time scales only quasipolynomially in k versus exponentially in k for [LM18]. (Infact, [LM18] shows that an exponential dependence on k is necessary for the problem they consider.)Another difference is that their algorithm allows each mixture component to have a different noiseparameter θ i whereas our result requires the same noise parameter θ across the mixture components.We observe that one curious feature of the algorithm of [LM18] is the following: When all the noiseparameters { θ i } ≤ i ≤ k are well-separated (meaning that for all i (cid:54) = j , | θ i − θ j | ≥ γ ), then the running5ime of [LM18] can be improved to poly( n ) · poly( k ) . This suggests that the case when all θ i arethe same might be the hardest for the Liu-Moitra [LM18] algorithm.Finally, we note that while the analysis in this paper does not immediately extend to theKendall-Mallows model (see Section 1.5 for more details), we point out that there is a sense in whichthe Kendall-Mallows and Cayley-Mallows models are fundamentally incomparable. This is because,while the results of [LM18] show that mixtures of Kendall-Mallows models are identifiable whenevereach θ i (cid:54) = 1, Theorem 1.5 shows that mixtures of Cayley-Mallows models are not identifiable atvarious larger values of θ such as ln 2 , ln 3 , . . . , even when all of the noise parameters are the samevalue θ which is provided to the algorithm. A key notion for our algorithmic approach is that of the marginal of a distribution f over S n : Definition 1.7.
Fix f : S n → [0 ,
1] to be some distribution over S n . Let t ∈ { , . . . , n } , let¯ i = ( i , . . . , i t ) be a vector of t distinct elements of { , . . . , n } and likewise ¯ j = ( j , . . . , j t ). We saythe (¯ i, ¯ j ) -marginal of f is the probability Pr σ ∼ f [ σ ( i ) = j and · · · and σ ( i t ) = j t ]that for all (cid:96) = 1 , . . . , t , the i (cid:96) -th element of a random σ drawn from f is j (cid:96) . When ¯ i and ¯ j are oflength t we refer to such a probability as a t -way marginal of f .The first key ingredient of our approach for learning from noisy rankings is a reduction from theproblem of learning f (the unknown distribution supported on k rankings σ , . . . , σ k ) given accessto samples from K ∗ f , to the problem of estimating t -way marginals (for a not-too-large value of t ). More precisely, in Section 2 we give an algorithm which, given the ability to efficiently estimate t -way marginals of f , efficiently computes a high-accuracy approximation for an unknown ε -heavydistribution f with support size at most k (see Theorem 2.1). This algorithm builds on ideas inthe population recovery literature, suitably extended to the domain S n rather than { , } n .With the above-described reduction in hand, in order to obtain a positive result for a specificnoise model K the remaining task is to develop an algorithm A marginal which, given access to noisysamples from K ∗ f , can reliably estimate the required marginals. In Section 3 we show that ifthe noise distribution K (a distribution over S n ) is efficiently samplable, then given samples from K ∗ f , the time required to estimate the required marginals essentially depends on the minimum,over a certain set of matrices arising from the Fourier transform (over the symmetric group S n )of the noise distribution, of the minimum singular value of the matrix. (See Theorem 3.1 for adetailed statement.) At this point, we have reduced the algorithmic problem of obtaining a learningalgorithm for a particular noise model to the analytic task of lower bounding the relevant singularvalues. We carry out the required analyses on a noise-model-by-noise-model basis in Sections 4, 5,and 6. These analyses employ ideas and results from the representation theory of the symmetricgroup and its connections to enumerative combinatorics; we give a brief overview of the necessarybackground in Appendix A.To establish our lower bound for the Cayley-Mallows model, Theorem 1.5, we exhibit two dis-tributions f and f over the symmetric group such that the distributions of noisy rankings M θ ∗ f and M θ ∗ f have very small statistical distance from each other. Not surprisingly, the inspiration6or this construction also comes from the representation theory of the symmetric group; more pre-cisely, the two above-mentioned distributions are obtained from the character (over the symmetricgroup) corresponding to a particular carefully chosen partition of [ n ]. A crucial ingredient in theproof is the fact that characters of the symmetric group are rational-valued functions, and henceany character can be split into a positive part and a negative part; details are given in Section 8. In this paper we have considered three particular noise models — symmetric noise, heat kernelnoise, and Cayley-Mallows noise — and given efficient algorithms for these noise models. Lookingbeyond these specific noise models, though, our approach provides a general framework for obtainingalgorithms for learning mixtures of noisy rankings. Indeed, for essentially any efficiently samplablenoise distribution K , given access to samples from K ∗ f our approach reduces the algorithmicproblem of learning f to the analytic problem of lower bounding the minimum singular values ofmatrices arising from the Fourier transform of K (see Theorem 3.1). We believe that this techniquemay be useful in a broader range of contexts, e.g. to obtain results analogous to ours for the originalKendall-Mallows model or for other noise models.As is made clear in Sections 4, 5, and 6, the representation-theoretic analysis that we requirefor our noise models is facilitated by the fact that each of the noise distributions considered inthose sections is a class function (in other words, the value of the distribution on a given inputpermutation depends only on the cycle structure of the permutation). Extending the kinds ofanalyses that we perform to other noise models which are not class functions is a technical challengethat we leave for future work. The main result of this section is the reduction alluded to in Section 1.4. In more detail, we givean algorithm which, given the ability to efficiently estimate t -way marginals, efficiently computes ahigh-accuracy approximation for an unknown ε -heavy distribution f with support size at most k : Theorem 2.1.
Let f be an unknown ε -heavy distribution over S n with | supp( f ) | ≤ k . Suppose thereis an algorithm A marginal with the following property: given as input a value δ > and two vectors ¯ i = ( i , . . . , i t ) and ¯ j = ( j , . . . , j t ) each composed of t distinct elements of { , . . . , n } , algorithm A marginal runs in time T ( δ, t, k, n ) and outputs an additively ± δ -accurate estimate of the (¯ i, ¯ j ) -marginal of f (recall Definition 1.7). Then there is an algorithm A learn with the following property:given the value of ε , algorithm A learn runs in time poly( n/ε, n log k ) · T ( ε k O (log k ) , k, k , n ) andreturns a function g : S n → R + such that (cid:107) f − g (cid:107) ≤ ε . Looking ahead, given Theorem 2.1, in order to obtain a positive result for a specific noise model K the remaining task is to develop an algorithm A marginal which, given access to noisy samplesfrom K ∗ f , can reliably estimate the required marginals. The algorithm is given in Section 3and the detailed analyses establishing its efficiency for each of the noise models (by boundingminimum singular values of certain matrices arising from each specific noise distribution) is givenin Sections 4, 5, and 6. 7 .1 A useful structural result The following structural result on functions from S n to R with small support will be useful for us: Claim 2.2 (Small-support functions are correlated with juntas) . Fix ≤ (cid:96) ≤ n and let g : [ n ] (cid:96) → R be such that (cid:107) g (cid:107) = 1 and | supp( g ) | ≤ k . There is a subset U ⊆ [ n ] and a list of values α , . . . , α | U | ∈ [ n ] such that | U | ≤ log k and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x ∈ [ n ] (cid:96) g ( x ) · [ x i = α i for all i ∈ U ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ k − O (log k ) . (1)Claim 2.2 is reminiscent of analogous structural results for functions over { , } (cid:96) which areimplicit in the work of [WY12] (specifically, Theorem 1.5 of that work), and indeed Claim 2.2 canbe proved by following the techniques of [WY12]. Michael Saks [Sak18] has communicated to usan alternative, and arguably simpler, argument for the relevant structural result over { , } (cid:96) ; herewe follow that alternative argument (extending it in the essentially obvious way to the domain [ n ] (cid:96) rather than { , } (cid:96) ). Proof.
Let the support of g be S ⊆ [ n ] (cid:96) . Note that since | S | ≤ k , there must exist some setof k (cid:48) := min { k, (cid:96) } coordinates such that any two elements of S differ in at least one of thosecoordinates. Without loss of generality, we assume that this set is the first k (cid:48) coordinates { , . . . , k (cid:48) } . We prove Claim 2.2 by analyzing an iterative process that iterates over the coordinates 1 , . . . , k (cid:48) .At the beginning of the process, we initialize a set Coord live of “live coordinates” to be [ k (cid:48) ], initializea set Constr of constraints to be initially empty, and initialize a set S live of “live support elements”to be the entire support S of g . We will see that the iterative process maintains the followinginvariants:(I1) The coordinates in Coord live are sufficient to distinguish between the elements in S live , i.e. anytwo distinct strings in S live have distinct projections onto the coordinates in Coord live ;(I2) The only elements of S that satisfy all the constraints in Constr are the elements of S live .Before presenting the iterative process we need to define some pertinent quantities. For eachcoordinate j ∈ Coord live and each index α ∈ [ n ], we define Wt ( j, α ) := (cid:88) x ∈ S live : x j = α | g ( x ) | , the weight under g of the live support elements x that have x j = α , and we define Num ( j, α ) := |{ x ∈ S live : x j = α }| , the number of live support elements x that have x j = α (note that Num ( j, α ) has nothing to dowith g ). It will also be useful to have notation for fractional versions of each of these quantities, sowe define FracWt ( j, α ) := Wt ( j, α ) (cid:80) x ∈ S live | g ( x ) | . and Frac ( j, α ) := Num ( j, α ) | S live | j ∈ Coord live we have that (cid:80) α Num ( j, α ) = | S live | , or equivalently (cid:80) α Frac ( j, α ) =1 . For each coordinate j ∈ Coord live , we write
MAJ ( j ) to denote the element β ∈ [ n ] whichis such that Num ( j, β ) ≥ Num ( j, α ) for all α ∈ [ n ] (we break ties arbitrarily). Finally, we let FracWtMaj ( j ) = FracWt ( j, MAJ ( j )).Now we are ready to present the iterative process:1. If every j ∈ Coord live has
FracWtMaj ( j ) > − k (cid:48) , then halt the process. Otherwise, let j be any element of Coord live for which FracWtMaj ( j ) ≤ − k (cid:48) .2. For this coordinate j , choose α ∈ [ n ] which maximizes the ratio FracWt ( j,α ) Frac ( j,α ) (or equivalently,maximizes FracWt ( j,α ) Num ( j,α ) ) subject to Frac ( j, α ) (cid:54) = 0 and α (cid:54) = MAJ ( j ).3. Add the constraint x j = α to Constr, remove j from Coord live , and remove all x such that x j (cid:54) = α from S live . Go to Step 1.When the iterative process ends, suppose that the set Constr is { x j = α , . . . , x j (cid:96) = α (cid:96) } . Thenwe claim that Equation (1) holds for U = { j , . . . , j (cid:96) } .To argue this, we first observe that both invariants (I1) and (I2) are clearly maintained by eachround of the iterative process. We next observe that each time a pair ( j, α ) is processed in Step 3,it holds that Frac ( j, α ) ≤ , and hence each round shrinks S live by a factor of at least 2. Thus, afterlog k steps, the set S live must be of size at most 1 and hence the process must halt. (Note that theclaimed bound | U | ≤ log k follows from the fact that the process runs for at most log k stages.)Next, note that when the process halts, by a union bound over the at most k (cid:48) coordinates inCoord live it holds that (cid:88) x ∈ S live : x j = MAJ ( j ) for all j ∈ Coord live | g ( x ) | ≥ · (cid:88) x ∈ S live | g ( x ) | . On the other hand, by the first invariant (I1), the cardinality of the set { x ∈ S live : x j = MAJ ( j )for all j ∈ Coord live } is precisely 1. This immediately implies that almost all of the weight of g ,across elements of S live , is on a single element; more precisely, that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x ∈ S live g ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ · (cid:88) x ∈ S live | g ( x ) | , from which it follows that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x ∈ [ n ] (cid:96) g ( x ) · [ x i = α i for all i ∈ U ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ · (cid:88) x ∈ S live | g ( x ) | . (2)So to establish Equation (1), it remains only to establish a lower bound on (cid:80) x ∈ S live | g ( x ) | whenthe process terminates. To do this, let us suppose that the process runs for T steps where in the Note that this means almost all of the weight under g of the live support elements is on elements that all agreewith the majority value on coordinate j . Note further that if Coord live is empty then this condition trivially holds. th step the coordinate chosen is j t . Now, at any stage t , we have (cid:80) β ∈ Coord live : β (cid:54) = MAJ ( j t ) FracWt ( j t , β ) (cid:80) β ∈ Coord live : β (cid:54) = MAJ ( j t ) Frac ( j t , β ) ≥ k (cid:48) . (because the denominator is at most 1 and since the process does not terminate, the numerator isat least k ). As a result, we get that if the constraint chosen at time t is x j t = α t , then FracWt ( j t , α t ) Frac ( j t , α t ) ≥ k (cid:48) . (3)By Equation (3), when the process halts we have (cid:88) x ∈ S live | g ( x ) | = T (cid:89) t =1 FracWt ( j t , α t ) ≥ k (cid:48) ) T T (cid:89) t =1 Frac ( j t , α t ) . But since at least one element remains, we have that (cid:81) Tt =1 Frac ( j t , α t ) ≥ k , and since T ≤ log k ,we conclude (recalling that k (cid:48) ≤ k ) that (cid:88) x ∈ S live | g ( x ) | ≥ k − O (log k ) . Combining with (2), this yields the claim.
The idea of the proof is quite similar to the algorithmic component of several recent works onpopulation recovery [MS13, WY12, LZ15, DST16]. Given any function f : S n → R and any integer i ∈ { , . . . , n } , we define the function f i : [ n ] i → R as follows: f i ( x , . . . , x i ) := (cid:88) σ ∈ S n f ( σ ) · [ σ (1) = x ∧ . . . ∧ σ ( i ) = x i ] . (4)At a high level, the algorithm A learn of Theorem 2.1 works in stages, by successively recon-structing f , . . . , f n . In each stage it uses the procedure described in the following claim, whichsays that high-accuracy approximations of the (log k )-marginals together with the support of f (cid:96) (ora not-too-large superset of it) suffices to reconstruct f (cid:96) : Claim 2.3.
Let f (cid:96) be an unknown distribution over [ n ] (cid:96) supported on a given set S of size k . Thereis an algorithm A one − stage which has the following guarantee: The algorithm is given as input δ > ,and parameters β J,y (for every set J ⊆ [ (cid:96) ] of size at most log k and every y ∈ [ n ] J ) which satisfy (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β J,y − (cid:88) x ∈ S f ( x ) · [ x i = y i for all i ∈ J ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ.A one − stage runs in time poly ( n, (cid:96) log k ) and outputs a function ˜ f : [ n ] (cid:96) → [0 , such that (cid:107) f − ˜ f (cid:107) ≤ δ · k O (log k ) . roof. We consider a linear program which has a variable s x for each x ∈ S (representing theprobability that f puts on x ) and is defined by the following constraints:1. s x ≥ (cid:80) x ∈ S s x = 1.2. For each J ⊆ [ (cid:96) ] of size at most log k and each y ∈ [ n ] J , include the constraint (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β J,y − (cid:88) x ∈ S s x · [ x i = y i for all i ∈ J ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ. (5)Algorithm A one − stage sets up and solves the above linear program (this can clearly be done in time poly ( n, (cid:96) log k )). We observe that the linear program is feasible since by definition s x = f (cid:96) ( x ) is afeasible solution. To prove the claim it suffices to show that every feasible solution is (cid:96) -close to f (cid:96) ;so let f ∗ ( x ) denote any other feasible solution to the linear program, and let η denote (cid:107) f ∗ − f (cid:96) (cid:107) . Define h ( x ) = f ∗ ( x ) − f (cid:96) ( x ) , so (cid:107) h (cid:107) = η. By Claim 2.2, we have that there is a subset J ⊆ [ (cid:96) ] ofsize at most log k and a y ∈ [ n ] (cid:96) such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) x h ( x ) · [ x i = y i for all i ∈ J ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ η · k − O (log k ) . (6)On the other hand, since both f (cid:96) ( x ) and f ∗ ( x ) are feasible solutions to the linear program, by thetriangle inequality it must be the case that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) x h ( x ) · [ x i = y i for all i ∈ J ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ. (7)Equations 6 and 2.2 together give the desired upper bound on η , and the claim is proved.Essentially the only remaining ingredient required to prove Theorem 2.1 is a procedure to find (anot-too-large superset of) the support of f . This is given by the following claim, which inductivelyuses the algorithm A one − stage to successively construct suitable (approximations of) the supportsets for f , . . . , f n . Claim 2.4.
Under the assumptions of Theorem 2.1, there is an algorithm A support with the fol-lowing property: given as input a value δ > , algorithm A support runs in time poly( n/ε, n log k ) · T ( ε k O (log k ) , k, k , n ) and for each (cid:96) = 1 , . . . , n outputs a set S (cid:48) ( (cid:96) ) of size at most k which containsthe support of f (cid:96) .Proof. The algorithm A support works inductively, where at the start of stage (cid:96) (in which it willconstruct the set S (cid:48) ( (cid:96) ) ) it is assumed to have a set S (cid:48) ( (cid:96) − with | S (cid:48) ( (cid:96) − | ≤ k which contains thesupport of f (cid:96) − . (Note that at the start of the first stage (cid:96) = 1 this holds trivially since f triviallyhas empty support).Let us describe the execution of the (cid:96) -th stage of A support . For 1 ≤ (cid:96) ≤ n , we define the set S marg ,(cid:96) as follows: S marg ,(cid:96) = (cid:8) t : (cid:88) σ ∈ S n f ( σ ) · [ σ ( (cid:96) ) = t ] > (cid:9) . poly ( n/ε ) · T ( ε , , k, n ), we can compute f ( σ ) · [ σ ( (cid:96) ) = t ] up to error ± ε/ β (cid:96),t ) for all 1 ≤ t ≤ n . Since f is ε -heavy, we have that t ∈ S marg ,(cid:96) implies β (cid:96),t ≥ ε t (cid:54)∈ S marg ,(cid:96) implies β (cid:96),t ≤ ε . Consequently, we can compute the set S marg ,(cid:96) in time poly ( n/ε ) · T ( ε , , k, n ). The final observationis that the set S ∗ ( (cid:96) ) (of cardinality at most k ) obtained by appending each final (cid:96) -th characterfrom S marg ,(cid:96) to each element of S (cid:48) ( (cid:96) − must contain the support S ( (cid:96) ) of f (cid:96) . Set δ = ε k O (log k ) ; bythe assumption of Theorem 2.1, in time T ( ε k O (log k ) , k, k , n ) it is possible to obtain additively ± δ -accurate estimates of each of the (2 log k )-way marginals of f (cid:96) . In the (cid:96) -th stage, algorithm A support runs A one − stage using S ∗ ( (cid:96) ) and these estimates of the marginals; by Claim 2.3, this takes timepoly( n/ε, n log k ) and yields a function ˜ f (cid:96) : [ n ] (cid:96) → [0 ,
1] such that (cid:107) f (cid:96) − ˜ f (cid:96) (cid:107) ≤ δ k O (log k ) · k O (log k ) = ε/ . Since by assumption f is ε -heavy, it follows that any element x in the support of ˜ f (cid:96) such that˜ f (cid:96) ( x ) ≤ ε/ f (cid:96) ; so the algorithm removes all such elements x from S ∗ ( (cid:96) ) to obtain the set S (cid:48) ( (cid:96) ) . This resulting S (cid:48) ( (cid:96) ) is precisely the support of f (cid:96) , and is clearly of size atmost k. Finally, the overall algorithm A learn works by running A support to get the set S (cid:48) = S (cid:48) ( n ) of sizeat most k which is the support of f n = f , and then uses S (cid:48) and the algorithm A marginal fromthe assumptions of Theorem 2.1) to run algorithm A one − stage and obtain the required ε -accurateapproximator g of f . This concludes the proof of Theorem 2.1. Recall that the noisy ranking learning problems we consider are of the following sort: Thereis a known noise distribution K supported on S n , and an unknown k -sparse ε -heavy distribution f : S n → [0 , π ∼ K and σ ∼ f are obtained, and the sample givento the learner is ( πσ ) ∈ S n . By the reduction established in Theorem 2.1, in order to give analgorithm that learns the distribution f in the presence of a particular kind of noise K , it sufficesto give an algorithm that can efficiently estimate t -way marginals given samples πσ ∼ K ∗ f. The main result of this section, Theorem 3.1, gives such an algorithm. Before stating thetheorem we need some terminology and notation and we need to recall some necessary backgroundfrom representation theory of the symmetric group (see Appendix A for a detailed overview of allof the required background).First, let K be a distribution over S n (which should be thought of as a noise distribution asdescribed earlier). We say that K is efficiently samplable if there is a poly( n )-time randomizedalgorithm which takes no input and, each time it is invoked, returns an independent draw of π ∼ K . Next, we recall that a partition λ of the natural number n (written “ λ (cid:96) n ”) is a vector of naturalnumbers ( λ , . . . , λ k ) where λ ≥ λ ≥ . . . ≥ λ k > λ + . . . + λ k = n (see Appendix A.2for more detail). For two partitions λ and µ of n , we say that µ dominates λ , written µ (cid:3) λ , if (cid:80) j ≤ i µ j ≥ (cid:80) j ≤ i λ j for all i > λ (cid:96) n , let Up ( λ ) denote the set ofall partitions µ (cid:96) n such that µ (cid:3) λ.
12e recall that a representation of the symmetric group S n is a group homomorphism from S n to C m × m (see Appendix A). We further recall that for each partition λ (cid:96) n there is a corresponding irreducible representation, denoted ρ λ (see Appendix A.2). For a matrix M we write σ min ( M ) todenote the smallest singular value of M . Given a partition λ (cid:96) n we define the value σ min , Up ( λ ) , K to be σ min , Up ( λ ) , K := min µ ∈ Up ( λ ) σ min ( (cid:98) K ( ρ µ )) , (8)the smallest singular value across all Fourier coefficients of the noise distribution of irreduciblerepresentations corresponding to partitions that dominate λ . (We recall that the Fourier coef-ficients of functions over the symmetric group, and indeed over any finite group, are matrices;see Appendix A.2.)Finally, for 0 ≤ (cid:96) ≤ n − λ hook ,(cid:96) (cid:96) n to be λ hook ,(cid:96) := ( n − (cid:96), , . . . , . Now we can state the main result of this section:
Theorem 3.1.
Let K be an efficiently samplable distribution over S n . Let f be an unknowndistribution over S n . There is an algorithm A marginal with the following properties: A marginal receivesas input a parameter δ > , a confidence parameter τ > , a pair of (cid:96) -tuples ¯ i = ( i , . . . , i (cid:96) ) ∈ [ n ] (cid:96) , ¯ j = ( j , . . . , j (cid:96) ) ∈ [ n ] (cid:96) each composed of (cid:96) distinct elements, and has access to random samples from K ∗ f . Algorithm A marginal runs in time poly ( (cid:0) n(cid:96) (cid:1) , δ − , σ − , Up ( λ hook ,(cid:96) ) , K , log(1 /τ )) and outputs a value κ i,j which with probability at least − τ is a ± δ -accurate estimate of the ( i, j ) -marginal of f . We will use the following claim to prove Theorem 3.1:
Claim 3.2.
Let ρ : S n → C m × m be any unitary representation of S n , let K be any efficientlysamplable distribution over S n , and let σ min denote the smallest singular value of (cid:98) K ( ρ ) . Let f be anunknown distribution over S n . There is an algorithm which, given random samples from K ∗ f andan error parameter < δ < , runs in time poly ( m, n, σ − , δ − ) and with high probability outputsa matrix M f,ρ such that (cid:107) M f,ρ − (cid:98) f ( ρ ) (cid:107) ≤ δ .Proof. Let η , η > f is a distribution,the Fourier coefficient (cid:98) f ( ρ ) is equal to E σ ∼ f [ ρ ( σ )]. Consequently, since K is assumed to be efficientlysamplable and the algorithm is given samples from K ∗ f , by sampling from K and from K ∗ f itis straightforward to obtain matrices M , M in time poly( m, n, log(1 /τ )) which with probability1 − τ satisfy (cid:107) M − (cid:98) K ( ρ ) (cid:107) ≤ η and (cid:107) M − (cid:91) K ∗ f ( ρ ) (cid:107) ≤ η . Now we recall the following matrix perturbation inequality (see Theorem 2.2 of [Ste77]):
Lemma 3.3.
Let A ∈ R n × n be a non-singular matrix and further let ∆ A ∈ R n × n be such that (cid:107) ∆ A (cid:107) · (cid:107) A − (cid:107) < . Then A + ∆ A is non-singular. Further, if γ = 1 − (cid:107) A − (cid:107) (cid:107) ∆ A (cid:107) , then (cid:107) A − − ( A + ∆ A ) − (cid:107) ≤ (cid:107) A − (cid:107) (cid:107) ∆ A (cid:107) γ . η and η as follows (recall that δ < η = min (cid:8) δ · σ , δ · σ min (cid:9) and η = min { δ · σ min , } . (9)Applying Lemma 3.3 with (cid:98) K ( ρ ) in place of A and M − (cid:98) K ( ρ ) in place of ∆ A , using (9) (moreprecisely, the upper bound η ≤ δ · σ / η ≤ δ · σ min / (cid:107) M − − (cid:98) K ( ρ ) − (cid:107) ≤ (cid:107) (cid:98) K ( ρ ) − (cid:107) · (cid:107) M − (cid:98) K ( ρ ) (cid:107) − (cid:107) (cid:98) K ( ρ ) − (cid:107) · (cid:107) M − (cid:98) K ( ρ ) (cid:107) ≤ δ . (10)Now using (cid:91) K ∗ f ( ρ ) = (cid:98) K ( ρ ) · (cid:98) f ( ρ ), we get (cid:107) M − · M − (cid:98) f ( ρ ) (cid:107) = (cid:107) M − · M − (cid:98) K ( ρ ) − · (cid:91) K ∗ f ( ρ ) · (cid:107) ≤ (cid:107) M − · M − M − · (cid:91) K ∗ f ( ρ ) (cid:107) + (cid:107) M − · (cid:91) K ∗ f ( ρ ) − (cid:98) K ( ρ ) − · (cid:91) K ∗ f ( ρ ) (cid:107) ≤ (cid:107) M − (cid:107) · (cid:107) M − (cid:91) K ∗ f ( ρ ) (cid:107) + (cid:107) M − − (cid:98) K ( ρ ) − (cid:107) · (cid:107) (cid:91) K ∗ f ( ρ ) (cid:107) ≤ (cid:107) M − (cid:107) · η + (cid:107) (cid:91) K ∗ f ( ρ ) (cid:107) · δ . (using (10)) ≤ η (cid:16) (cid:107) (cid:98) K ( ρ ) − (cid:107) + (cid:107) M − − (cid:98) K ( ρ ) − (cid:107) (cid:17) + (cid:107) (cid:91) K ∗ f ( ρ ) (cid:107) · δ . (using (10)) ≤ σ − · η + δ · η + (cid:107) (cid:91) K ∗ f ( ρ ) (cid:107) · δ . (11)Next we use the following fact, which is an easy consequence of the triangle inequality and theassumption that ρ is unitary: Fact 3.4.
Let ρ : S n → C m × m be a unitary representation and let g : S n → R + . Then we havethat (cid:107) (cid:98) g ( ρ ) (cid:107) ≤ (cid:107) g (cid:107) . Combining this fact with (11) and (9), since (cid:107)K ∗ f (cid:107) = 1, we get that (cid:107) M − · M − (cid:98) f ( ρ ) (cid:107) ≤ σ − · η + δ · η + δ ≤ δ δ δ < δ. This concludes the proof of Claim 3.2.With Claim 3.2 in hand we are ready to prove Theorem 3.1: P roof of Theorem 3.1. Let τ λ hook ,(cid:96) be the permutation representation corresponding to the partition λ hook ,(cid:96) ; for conciseness we subsequently write ρ for τ λ hook ,(cid:96) . Definition A.10 immediately gives thatthe dimension of ρ is (cid:0) n(cid:96) (cid:1) . Observe that ρ is a unitary representation. Let σ min denote the smallestsingular value of (cid:98) K ( ρ ); applying Claim 3.2, we get an algorithm running in time poly ( (cid:0) n(cid:96) (cid:1) , σ − , δ )which outputs a matrix M f,ρ such that (cid:107) M f,ρ − (cid:98) f ( ρ ) (cid:107) ≤ δ . Next, we observe that the Youngtableaux corresponding to the partition λ hook ,(cid:96) (which, recalling Definition A.10, index the rowsand columns of ρ ( · )) correspond precisely to ordered t -tuples of distinct entries of [ n ]. If Y λ hook ,(cid:96) ,i = i and Y λ hook ,(cid:96) ,j = j , then it follows that (cid:98) f ( ρ )( i, j ) = (cid:88) σ ∈ S n f ( σ ) · [ f ( i ) = j and · · · and f ( i (cid:96) ) = j (cid:96) )] , i, j )-marginal of f as desired; so the output of the algorithm is M f,ρ ( i, j ).To finish the correctness argument it remains only to argue that σ − is at most poly( σ − , Up ( λ hook ,(cid:96) ) ) . To see that this is indeed, the case, we observe that by Theorem A.12, the permutation represen-tation τ λ hook ,(cid:96) block diagonalizes into a direct sum of irreducible representations ρ µ where each µ belongs to Up ( λ hook ,(cid:96) ). This finishes the proof of Theorem 3.1. In order to apply Theorem 3.1 to a particular noise distribution K we need to confirm that K isefficiently samplable; we now do this for each of the three noise models that we consider. It isimmediate from the definition that it is straightforward (given p ) to efficiently generate a random σ drawn from the symmetric noise distribution S p , and the same is true for the heat kernel noisedistribution H t .For the generalized Mallows model M θ , the characterization Pr σ ∼M θ [ σ = π ] = e − θd ( π,e ) /Z ( θ )given earlier does not directly yield an efficient sampling algorithm, since it may be hard to com-pute or approximate the normalizing factor Z ( θ ) = (cid:80) π ∈ S n e − θd ( π,e ) . Instead, we recall (see e.g. Sec-tion 2.1 of [DS98]) that the Metropolis algorithm can be used to efficiently perform a random walkon S n whose unique stationary distribution is the generalized Mallows distribution M θ . (Each stepof the random walk can be carried out efficiently because it is computationally easy to compute theCayley distance between two permutations: if π is the permutation that brings σ to τ , then theCayley distance d ( σ, τ ) is n − cycles( π ) where cycles( π ) is the number of cycles in π .) It is known(see e.g. Theorem 2 of [DH92]) that this random walk has rapid convergence, and consequently itis indeed possible to sample efficiently from M θ (up to an exponentially small statistical distancewhich can be ignored in our applications since our algorithms use a sub-exponential number ofsamples). In this section we establish lower bounds on the smallest singular value for the relevant matricescorresponding to “symmetric noise” S p on S n . In more detail, the main result of this section is thefollowing lower bound: Lemma 4.1.
Let (cid:96) ∈ { , . . . , n } and let p = ( p , . . . , p n ) ∈ ∆ n (i.e. p is a non-negative vectorwhose entries sum to 1) which is such that n − (cid:96) (cid:88) j =0 p j ≥ κ. Then (recalling Equation (8) ) we have that σ min , Up ( λ hook ,(cid:96) ) , S p ≥ κn (cid:96) . (12)15 .1 Setup To analyze the smallest singular value of (cid:99) S p ( ρ µ ) (as required by the definition of σ min , Up ( λ hook ,(cid:96) ) , S p ),we start by observing that symmetric noise is a class function (meaning that it is invariant underconjugation, see Definition A.6): Claim 4.2.
For any vector p = ( p , . . . , p n ) ∈ ∆ n , the distribution S p (viewed as a function from S n to [0 , ) is a class function ( i.e. S p ( π ) = S p ( τ πτ − ) for every π, τ ∈ S n ).Proof. For 0 ≤ j ≤ n , let e j denote the vector in R n +1 which has a 1 in the j -th position and a 0 inevery other position. By linearity, to prove Claim 4.2 it suffices to prove that S e j is invariant underconjugation for every j ; to establish this, it suffices to show that S e j is invariant under conjugationby any transposition τ . By symmetry, it suffices to consider the transposition τ = (1 , S e j is a uniform average of U A over all (cid:0) nj (cid:1) subsets A of [ n ] of size exactly j . Now we consider two cases: the first is that | A ∩ { , }| is 0 or 2. In this case it is easy tosee that U A does not change under conjugation by the transposition (1 , | A ∩ { , }| = 1; in this case it is easy to see that conjugation by (1 ,
2) converts U A into U A ∆ { , } . Since the collection of size- j sets A with A ∩ { , } = { } are in 1-1 correspondencewith the collection of size- j sets A with A ∩ { , } = { } , it follows that S e j is invariant underconjugation by τ = (1 , µ (cid:96) m, λ (cid:96) n where m ≤ n , we write Paths( µ, λ ) to denote the number of paths from µ to λ in Young’s lattice (see Ap-pendix A.2 and Theorem A.15). We write Triv j to denote the trivial partition ( j ) of j . Lemma 4.3.
Let λ (cid:96) n and let ρ λ be the corresponding irreducible representation of S n . Given p = ( p , . . . , p n ) ∈ ∆ n , we have that (cid:99) S p ( ρ λ ) = c ( p, λ ) · Id where c ( p, λ ) := (cid:80) nj =0 p j · Paths(
Triv j , λ ) dim ( ρ λ ) . (13) Proof.
By Claim 4.2, we have that S p is a class function, so we may apply Lemma A.9 to concludethat (cid:99) S p ( ρ λ ) = c ( p, λ ) · Id , where c ( p, λ ) = 1 dim ( ρ λ ) · (cid:18) (cid:88) σ ∈ S n S p ( σ ) · χ λ ( σ ) (cid:19) and χ λ denotes the character of the irreducible representation ρ λ . Thus it remains to show that (cid:80) σ ∈ S n S p ( σ ) · χ λ ( σ ) is equal to the numerator of Equation (13). By definition of S p , we have that (cid:88) σ ∈ S n S p ( σ ) · χ λ ( σ ) = (cid:88) ≤ j ≤ n p j E A : |A| = j E σ ∈ U A χ λ ( σ ) . (14)We proceed to analyze E σ ∈ U A χ λ ( σ ). Let ρ A λ denote the representation ρ λ restricted to the subgroup S A . By Theorem A.15, the representation ρ A λ splits as follows: ρ A λ = ⊕ µ (cid:96)|A| Paths( µ, λ ) ρ µ . E σ ∈ U A χ λ ( σ ) = (cid:88) µ (cid:96)|A| Paths( µ, λ ) E σ ∈ U A χ µ ( σ ) = Paths( Triv |A| , λ ) . The second equality follows from that fact that if µ is a non-trivial partition of |A| then E σ ∈ U A χ µ ( σ ) =0, while if µ = Triv |A| then E σ ∈ U A χ µ ( σ ) = 1. Plugging this into (14) we get that (cid:80) σ ∈ S n S p ( σ ) · χ λ ( σ ) = (cid:80) nj =0 p j · Paths(
Triv j , λ ), and the lemma is proved. We recall from Equation (8) that σ min , Up ( λ hook ,(cid:96) ) , S p := min µ ∈ Up ( λ hook ,(cid:96) ) σ min ( (cid:99) S p ( ρ µ )) . Fix any µ ∈ Up ( λ hook ,(cid:96) ), so µ is a partition of n of the form ( n − (cid:96) (cid:48) , (cid:96) , . . . , (cid:96) r ) where (cid:96) (cid:48) ≤ (cid:96). By Lemma 4.3 we have that the smallest singular value of (cid:99) S p ( ρ µ ) is c ( p, µ ) := (cid:80) nj =0 p j · Paths(
Triv j , µ ) dim ( ρ µ ) . (15)To upper bound dim ( ρ µ ), we observe thatdim( ρ µ ) ≤ dim( τ µ ) = (cid:18) nn − (cid:96) (cid:48) , (cid:96) , . . . , (cid:96) r (cid:19) ≤ n !( n − (cid:96) (cid:48) )! ≤ n (cid:96) (cid:48) ≤ n (cid:96) , where the first inequality is by Theorem A.12. For the numerator, we observe that if j ≤ n − (cid:96) then there is at least one path in the Young lattice from Triv j to µ , so under the assumptionsof Lemma 4.1 the numerator of Equation (15) is at least κ. This proves the lemma.
In this section, analogous to Section 4, we lower bound Equation (8) when the noise distribution K is H t , corresponding to “heat kernel noise” at temperature parameter t : Lemma 5.1.
Let t ≥ and let (cid:96) ∈ { , . . . , cn } for some suitably small universal constant c > .Then we have that σ min , Up ( λ hook ,(cid:96) ) , H t ≥ · e − O ( (cid:96)t ) /n . (16) Let trans : S n → [0 ,
1] be the following probability distribution over S n :trans( π ) = /n if π is the identity,2 /n if π is a transposition,0 otherwise.17ince trans( π ) depends only on the cycle structure of π , the function trans( · ) is a class function.Fix any µ ∈ Up ( λ hook ,(cid:96) ), so µ is a partition of n of the form ( µ , . . . , µ r ) where µ ≥ n − (cid:96). As inthe proof of Lemma 4.3 we may apply Lemma A.9 to conclude that (cid:91) trans( ρ µ ) = c trans ,µ · Id for some constant c trans ,µ . By Corollary 1 of Diaconis and Shahshahani [DS81], we have that c trans ,µ = 1 n + n − n · χ µ ( τ )dim( ρ µ ) , (17)where as before χ µ denotes the character of the irreducible representation ρ µ and τ is any trans-position. [DS81] further shows that for ρ µ an irreducible representation of S n with µ as above and τ any transposition, it holds that χ µ ( τ )dim( ρ µ ) = 1 n ( n − · r (cid:88) j =1 ( µ j − j )( µ j − j + 1) − j ( j − . (18)In our setting we have(18) ≥ ( n − (cid:96) )( n − (cid:96) − n ( n −
1) + 1 n ( n − r (cid:88) j =2 ( µ j − j )( µ j − j + 1) − j ( j − . (19)where the inequality holds because µ ≥ n − (cid:96). Now, we observe that for each summand in Equa-tion (19), we have ( µ j − j )( µ j − j + 1) − j ( j −
1) = µ j − µ j (2 j − ≥ − µ j (2 j − ≥ − (cid:96)j − · (2 j − ≥ − (cid:96). The second inequality above holds because µ + · · · + µ j ≤ (cid:96) and the µ j ’s are non-increasing, so µ j ≤ (cid:96)j − . Since r − ≤ (cid:96) , this means that(18) ≥ ( n − (cid:96) )( n − (cid:96) − n ( n − − (cid:96) n ( n − ≥ − O ( (cid:96) ) n , and recalling Equation (17) we get that1 ≥ c trans ,µ ≥ − O ( (cid:96) ) n . (20) As in Section 4 we recall from Equation (8) that σ min , Up ( λ hook ,(cid:96) ) , H t := min µ ∈ Up ( λ hook ,(cid:96) ) σ min ( (cid:99) H t ( ρ µ )) , µ ∈ Up ( λ hook ,(cid:96) ) (so µ is a partition of n of the form ( µ , . . . , µ r ) where µ ≥ n − (cid:96) ). Werecall that the function H t : S n → [0 ,
1] is defined by H t = ∞ (cid:88) j =0 Pr T ∼ Poi ( t ) [ T = j ](trans) j , where “(trans) T ” denotes T -fold convolution of trans. Since convolution corresponds to multipli-cation of Fourier coefficients, this gives that (cid:99) H t ( ρ µ ) = c ( t, µ ) · Id , where c ( t, µ ) := ∞ (cid:88) j =0 Pr T ∼ Poi ( t ) [ T = j ]( c trans ,µ ) j . (21)Recalling [Cho94] that the median of the Poisson distribution Poi ( t ) is at most t + 1 /
3, we get that c ( t, µ ) ≥ · ( c trans ,µ ) t +1 / ≥ · e − O ( (cid:96)t ) /n , (where the second inequality uses (cid:96) ≤ cn and t ≥ In this section we lower bound Equation (8) when the noise distribution K is M θ , correspondingto the Cayley-Mallows noise model with parameter θ : Lemma 6.1.
Let θ > , let (cid:96) ∈ { , . . . , n } , and let η := dist( θ, (cid:96) ) = min j ∈{ ,...,(cid:96) } (cid:12)(cid:12) e θ − j (cid:12)(cid:12) . Then(recalling Equation (8) ) we have that σ min , Up ( µ hook ,(cid:96) ) , M θ ≥ (2 n ) − (cid:96) η √ (cid:96) . (22)Similar to the previous two sections, Lemma 6.1 follows immediately from the following lowerbound on singular values of certain irreducible representations: Lemma 6.2.
Let µ be a partition of n of the form ( µ , . . . , µ r ) where µ ≥ n − (cid:96) . Let θ > andlet η := dist( θ, (cid:96) ) = min j ∈{ ,...,(cid:96) } (cid:12)(cid:12) e θ − j (cid:12)(cid:12) . Then we have that (cid:100) M θ ( ρ µ ) = c µ,θ · Id where | c µ,θ | ≥ (2 n ) − (cid:96) η √ (cid:96) . To prove Lemma 6.2, we will need the notions of content and hook length for boxes in a Youngdiagram:
Definition 6.3.
Let µ be a partition µ (cid:96) n . The hook length of a box u in the Young diagram for µ, denoted by h ( u ), is the sum( u in its row) + ( u in its column) + 1 (for u itself) . The content c ( u ) of a box u is c ( u ) := j − i, where j is its column number (from the left, startingwith column 1) and i is its row number (from the top, starting with row 1).19 5 2 14 13 11 0 1 2 3 − − − − Lemma 6.4.
Let µ (cid:96) n and let χ µ be the corresponding character in S n . For any q ∈ R , n ! (cid:88) σ ∈ S n χ µ ( σ ) · q cycles( σ ) = (cid:89) u ∈ µ q + c ( u ) h ( u ) , where the subscript “ u ∈ µ ” means that u ranges over all the boxes in the Young diagram corre-sponding to µ .Proof. The above identity is given as Exercise 7.50 in Stanley’s book [Sta99]. For the sake ofcompleteness, we provide the proof here.For any ¯ t = ( t , . . . , t n ), we define the polynomial a ¯ t ( x , . . . , x n ) := det x t x t x t . . . x t n x t x t x t . . . x t n . . . . . . . . . . . . . . . . . . . . . . .x t n x t n x t n . . . x t n n . Given any partition µ (cid:96) n , we now define the Schur polynomial s µ ( x , . . . , x n ) as follows: Define¯ t µ = ( µ + n − , . . . , µ n + 0) and ¯ t = ( n − , . . . , s µ ( x , . . . , x n ) := a ¯ t µ ( x , . . . , x n ) a ¯ t ( x , . . . , x n ) . The denominator is just the Vandermonde determinant of the variables ( x , . . . , x n ). As the poly-nomial a ¯ t µ ( x , . . . , x n ) is alternating, it follows that s µ ( x , . . . , x n ) is a polynomial (as opposed toa rational function) and further, it is symmetric.The following is a fundamental fact connecting Schur polynomials and cycles: For any 0 ≤ k ≤ n , s µ (1 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) k , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) n − k ) = (cid:88) σ ∈ S n n ! · χ µ ( σ ) · k cycles( σ ) (23)20see equation 7.78 in [Sta99]). On the other hand, there are known explicit formulas for evaluationsof the Schur polynomial at specific inputs. In particular, Corollary 7.21.4 of [Sta99] states that s µ (1 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) k , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) n − k ) = (cid:89) u ∈ µ k + c ( u ) h ( u ) . (24)Combining (23) and (24), we get that for any 0 ≤ k ≤ n , we have1 n ! (cid:88) σ ∈ S n χ µ ( σ ) · k cycles( σ ) = (cid:89) u ∈ µ k + c ( u ) h ( u ) . However, note that both the left and the right hand sides can be seen as polynomials of degree atmost n in the variable k . Since they agree at n + 1 values k = 0 , . . . , n , they must be identical asformal functions. This concludes the proof. Proof of Lemma 6.2.
Recall that the distribution M θ over S n is defined by M θ ( π ) = e − θd ( π,e ) /Z ( θ ),where Z ( θ ) = (cid:80) π ∈ S n e − θd ( π,e ) is a normalizing constant. Since the Cayley distance d ( σ, τ ) is equalto n − cycles( σ − τ ), where cycles( π ) is the number of cycles in π , we have that M θ ( π ) = e θ · cycles( π ) C , where C = (cid:88) π ∈ S n e θ · cycles( π ) . Since the cycles( · ) function is a class function so is M θ , so we can apply Lemma A.9 and weget that (cid:100) M θ ( ρ µ ) = c µ,θ · Id , where c µ,θ = (cid:80) σ ∈ S n M θ ( σ ) · χ µ ( σ ) dim ( ρ µ ) = (cid:80) σ ∈ S n e θ · cycles( σ ) · χ µ ( σ ) dim ( ρ µ ) · ( (cid:80) σ ∈ S n e θ · cycles( σ ) ) = (cid:80) σ ∈ S n q cycles( σ ) · χ µ ( σ ) dim ( ρ µ ) · ( (cid:80) σ ∈ S n q cycles( σ ) ) , where q := e θ . We re-express the numerator by applying Lemma 6.4 to get (cid:88) σ ∈ S n q cycles( σ ) · χ µ ( σ ) = n ! · (cid:89) u ∈ µ q + c ( u ) h ( u ) . (25)To analyze the denominator of c µ,θ , applying Lemma 6.4 to the trivial partition Triv n = ( n ) of n (the character of which is identically 1), we get that (cid:88) σ ∈ S n q cycles( σ ) = n ! · (cid:89) u ∈ Triv n q + c ( u ) h ( u ) = q ( q + 1) · · · ( q + n − . (26)For the rest of the denominator, we recall the following well-known fact about the dimension ofirreducible representations of the symmetric group: Fact 6.5 (Hook length formula, see e.g. Theorem 3.41 of [M´el17]) . For µ (cid:96) n , dim ( ρ µ ) = n ! (cid:81) u ∈ µ h ( u ) . Combining (25), (26) and Fact 6.5, we get c µ,θ = (cid:81) u ∈ µ ( q + c ( u )) q ( q + 1) · · · ( q + n − . (27)21et A denote the set consisting of the cells of the Young diagram of µ which are not in the firstrow. Since n − µ = (cid:96) (cid:48) for some (cid:96) (cid:48) ≤ (cid:96) , the above expression simplifies to c µ,θ = (cid:81) u ∈A ( q + c ( u ))( q + n − (cid:96) (cid:48) ) · · · ( q + n − . (28)To bound this ratio, first observe that both the numerator and denominator are (cid:96) (cid:48) -way products.There are two possibilities now:1. Case 1: q ≥ (cid:96) +1 . In this case we observe that each cell u ∈ A satisfies c ( u ) ≥ − (cid:96) (cid:48) ≥ − (cid:96). Thus c µ,θ can be expressed as a product of (cid:96) (cid:48) many fractions, each of which is at least q − (cid:96)q + n − ≥ (cid:96) + n .This implies that c µ,θ ≥ (cid:18) n + (cid:96) (cid:19) (cid:96) (cid:48) ≥ (2 n ) − (cid:96) . Case 2: q ≤ (cid:96) . In this case, the denominator of Equation (28) is at most (2 n ) (cid:96) . To lowerbound the numerator, observe that for every cell u of A , the value of c ( u ) is an integer in {− (cid:96), . . . , (cid:96) } . Let j and j denote the two values in {− (cid:96), . . . , (cid:96) } for which | q − j | achieves itssmallest value η and its next smallest value (note that these values are equal if η = 1 / √ (cid:96) many cells of A have content equal to any given fixed integervalue. Since j and j are the only possible values of j ∈ {− (cid:96), . . . , (cid:96) } for which | q + j | <
1, itfollows that (cid:89) u ∈A | ( q + c ( u )) | ≥ (cid:89) u ∈A : c ( u )= j | ( q + c ( u )) | · (cid:89) u ∈A : c ( u )= j | ( q + c ( u )) | ≥ η √ (cid:96) . This finishes the proof.
In this brief section we put all the pieces together to obtain our main positive results, Theo-rems 1.2, 1.3 and 1.4, for the symmetric, heat kernel, and generalized Mallows noise models respec-tively.
Symmetric noise.
Under the assumptions of Theorem 1.2 (that (cid:80) n − log kj =0 p j ≥ n O (log k ) ), taking (cid:96) =log k in Lemma 4.1, we have that σ min , Up ( λ hook , log k ) , S p ≥ n O (log k ) . Since (as discussed in Section 3.1) S p is efficiently samplable given p , by Theorem 3.1 in time poly( n log k , /δ, log(1 /τ )) with probability1 − τ it is possible to obtain ± δ -accurate estimates of all of the (log k )-way marginals of f . Setting δ = ε k O (log k ) and applying Theorem 2.1, we get Theorem 1.2. Heat kernel noise.
First observe that we may assume that the temperature parameter t isat least 1 (since otherwise it is easy to artificially add noise to achieve t = 1). Under the as-sumptions of Theorem 1.3 (that t = O ( n log n )), taking (cid:96) = log k in Lemma 5.1, we have that22 min , Up ( λ hook , log k ) , H t ≥ n O (log k ) . Theorem 1.3 follows as in the previous paragraph (this time usingthe efficient samplability of H t given t ). Cayley-Mallows noise.
Under the assumptions of Theorem 1.4, taking (cid:96) = log k in Lemma 6.1 weget that σ min , Up ( λ hook , log k ) , M θ ≥ n O (log k ) · dist( θ, log k ) √ log k . Theorem 1.4 follows as in the previousparagraph (this time using the efficient samplability of M θ given θ ). Recall that because of the poly(dist( θ, log k ) −√ log k ) dependence in Theorem 1.4, the algorithmof that theorem is inefficient if e θ is very close to an integer. In this section we prove Theorem 1.5,which establishes that any algorithm for learning in the presence of Cayley-Mallows noise must beinefficient if e θ is very close to an integer. The following lemma is at the heart of our lower bound. It shows that if e θ is close to an integer,then any partition µ of n ≥ m which extends a particular partition λ sq of m must be such that theFourier coefficient (cid:100) M θ ( ρ µ ) of Cayley-Mallows noise has small singular values. Lemma 8.1.
Let λ sq denote the partition ( t, . . . , t ) of m = t ( t + j ) whose Young tableau is arectangle with t + j rows and t columns. Let θ > be such that (cid:12)(cid:12) e θ − j (cid:12)(cid:12) ≤ η where η ≤ / . Let n ≥ m , µ (cid:96) n and λ sq ⇑ µ (recall Definition A.13). Then (cid:100) M θ ( ρ µ ) = c µ,θ · Id , where c µ,θ ≤ η t . Here ρ µ denotes the irreducible representation of S n corresponding to the partition µ .Proof. Let µ = ( µ , . . . , µ r ). By Lemma 6.2, we have that (cid:100) M θ ( ρ µ ) = c µ,θ · Id , where Equation (28) gives the precise value of c µ,θ as c µ,θ = (cid:81) u ∈A ( q + c ( u )) (cid:81) u ∈B ( q + c ( u )) , where q = e θ . (29)Here A denotes the set of cells of the Young diagram of µ which are not in the first row and B denotes the rightmost n − µ many cells in the Young diagram of the trivial partition Triv n = ( n ).Note that in this lemma, we are trying to upper bound Equation (29) whereas Lemma 6.2 wasabout lower bounding this quantity.To upper bound Equation (29), we first observe that there is an obvious bijection Φ : A → B such that if Φ( u ) = v , then c ( v ) > | c ( u ) | >
0. 23ext, let A − j ⊂ A be A := { ( r, s ) : s − r = j and ( r, s ) ∈ A} . Since λ sq ⇑ µ , it follows that |A − j | ≥ t . As a result, we can upper bound c µ,θ as follows: c µ,θ = (cid:81) u ∈A ( q + c ( u )) (cid:81) u ∈B ( q + c ( u )) = (cid:89) u ∈A q + c ( u ) q + c (Φ( u )) = (cid:89) u ∈A − j q + c ( u ) q + c (Φ( u )) (cid:89) u ∈A\A − j q + c ( u ) q + c (Φ( u )) ≤ (cid:89) u ∈A − j q + c ( u ) (using c (Φ( u )) > | c ( u ) | > q > ≤ η t . Theorem 1.5 is an immediate consequence of the following result. It shows that if e θ is close to aninteger j , then it may be statistically impossible to learn a distribution f supported on k rankingswithout using many samples from M θ ∗ f : Theorem 8.2.
Given j ∈ N , there are infinitely many values of k and m = m ( k ) ≈ log k log log k suchthat the following holds: there are two distributions f , f over S m with the following properties:1. d TV ( f , f ) = 1 (i.e. the distributions f and f have disjoint support);2. | supp( f ) | , | supp( f ) | ≤ k ;3. For any θ > such that | e θ − j | ≤ η ≤ / , we have that d TV ( M θ ∗ f , M θ ∗ f ) ≤ · η Θ (cid:16)(cid:113) log k log log k (cid:17) . Proof.
Let t ≥ j be any integer, let m = t ( t + j ), and let k = m !. We first construct the twodistributions f , f over S m and argue that properties (1) and (2) hold.Let λ sq (cid:96) m be the partition whose Young tableau is a rectangle with t + j rows and t columns.Let us consider the character χ sq : S m → Q corresponding to the partition λ sq . By Fact A.16 wehave that χ sq is rational valued, and by Theorem A.8 we have that (cid:80) σ ∈ S n χ sq ( σ ) = 0. Thus, wehave that (cid:88) σ ∈ S n | χ sq ( σ ) | · χ sq ( σ ) > = (cid:88) σ ∈ S n | χ sq ( σ ) | · χ sq ( σ ) < =: C sq (30)for some C sq (which is nonzero again by Theorem A.8). We now define distributions f and f over S m as f ( σ ) = (cid:40) C sq · χ sq ( σ ) if χ sq ( σ ) >
00 otherwise , f ( σ ) = (cid:40) − C sq · χ sq ( σ ) if χ sq ( σ ) <
00 otherwise . From their definitions and Equation (30) it is immediate that f and f are distributions over S m which have disjoint support. Since | S m | = k , this gives items 1 and 2 of the theorem.To prove the third item, observe (recalling the comment immediately after Definition A.7) thatthe function g : S m → C , defined as g ( σ ) := f ( σ ) − f ( σ ) = C sq · χ sq ( σ ), is a class function. Choose24ny partition λ (cid:96) m and the corresponding irreducible representation ρ λ of S m . By applyingLemma A.9, we have that (cid:98) g ( ρ λ ) = c λ · Id where c λ = (cid:80) σ ∈ S m g ( σ ) · χ λ ( σ )dim( ρ λ ) . (31)We analyze the multiplier c λ by noting that c λ = (cid:80) σ ∈ S m g ( σ ) · χ λ ( σ )dim( ρ λ ) = (cid:80) σ ∈ S m χ sq ( σ ) · χ λ ( σ )dim( ρ λ ) · C sq = m ! · [ λ = λ sq ]dim( ρ λ ) · C sq using Theorem A.8 . (32)Thus, we have (cid:107)M θ ∗ f − M θ ∗ f (cid:107) = (cid:88) σ ∈ S m |M θ ∗ f ( σ ) − M θ ∗ f ( σ ) | = (cid:88) σ ∈ S m |M θ ∗ g ( σ ) | (linearity and g = f − f )= 1 m ! (cid:88) σ ∈ S m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) µ (cid:96) m dim( ρ µ ) Tr [ (cid:92) M θ ∗ g ( ρ µ ) ρ µ ( σ − )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (Definition A.5, inverse Fourier transform of M θ ∗ g )= 1 m ! (cid:88) σ ∈ S m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) µ (cid:96) m dim( ρ µ ) Tr [ (cid:100) M θ ( ρ µ ) (cid:98) g ( ρ µ ) ρ µ ( σ − )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (convolution identity)= 1dim( ρ λ sq ) · C sq (cid:88) σ ∈ S m (cid:12)(cid:12)(cid:12) dim( ρ λ sq ) Tr [ (cid:100) M θ ( ρ λ sq ) ρ λ sq ( σ − )] (cid:12)(cid:12)(cid:12) (Equations 31 and 32)= 1 C sq (cid:88) σ ∈ S m (cid:12)(cid:12)(cid:12) Tr [ (cid:100) M θ ( ρ λ sq ) ρ λ sq ( σ − )] (cid:12)(cid:12)(cid:12) (33)To deal with (cid:100) M θ ( ρ λ sq ), we apply Lemma 8.1. In particular, by setting n = m and µ = λ sq in Lemma 8.1, we get that (cid:100) M θ ( ρ λ sq ) = c λ sq ,θ · Id , where | c λ sq ,θ | ≤ η t , and we thus get that (cid:107)M θ ∗ f − M θ ∗ f (cid:107) ≤ η t C sq · (cid:88) σ ∈ S m (cid:12)(cid:12) Tr [ ρ λ sq ( σ − ) (cid:12)(cid:12) = η t C sq · (cid:88) σ ∈ S m (cid:12)(cid:12) χ sq ( σ − ) (cid:12)(cid:12) . (34)Finally, recalling that C sq = (cid:80) σ ∈ S n | χ sq ( σ ) | , we get that the RHS of Equation (34) is 2 η t . Recalling that t ≥ (cid:112) m/
2, the theorem is proved.25
Basics of representation theory over the symmetric group
Representation theory of the symmetric group S n is at the technical core of this paper. In thisappendix we briefly review the definitions and results that we require, starting first with generalgroups and then specializing to S n as necessary. See Curtis and Reiner [CR66] (or many othersources) for an extensive reference on representation theory of finite groups and James [Jam06] orM´eliot [M´el17] for an extensive reference on representation theory of S n . A.1 General groups
We start by recalling the definition of a representation:
Definition A.1.
For any group G , a representation ρ : G → C m × m is a group homomorphism,i.e. a function from G to C m × m that satisfies ρ ( g ) · ρ ( h ) = ρ ( g · h ) for all g, h ∈ G. The dimension of such a representation ρ is m .In this paper, unless otherwise mentioned, all representations ρ are unitary – in other words, forevery g ∈ G , ρ ( g ) is a unitary matrix. Over finite groups, any representation can be made unitaryby applying a similarity transformation; by this we mean that if ρ is a representation, then thereis an invertible matrix Z such that the new map ˜ ρ defined as ˜ ρ ( g ) = Z − · ρ ( g ) · Z is a unitaryrepresentation. (The reader should verify that as long as Z is invertible, the map ˜ ρ is always arepresentation if ρ is a representation.) Two such representations ρ and ˜ ρ are said to be equivalent .Next we recall the notion of an irreducible representation: Definition A.2.
A representation ρ : G → C m × m is said to be reducible if there exists a propersubspace V of C m such that ρ ( g ) · V ⊆ V for all g ∈ G . If there is no such proper subspace V , then ρ is said to be irreducible .It is well known that any finite group has only finitely many irreducible representations, up tothe above notion of equivalence, and that every representation of a finite group G can be writtenas a direct sum of irreducible representations: Theorem A.3 (Maschke’s theorem, see e.g. Theorem 1.3 of [M´el17]) . For G a finite group, thereis a finite set of distinct irreducible representations { ρ , . . . , ρ r } such that for any representation ρ : G → C m × m , there is a invertible transformation Z ∈ C m × m such that Z − ρZ is block diagonalwhere each block is one of { ρ , . . . , ρ r } . In other words, Z − ρZ is equal to the direct sum ⊕ M(cid:96) =1 µ (cid:96) where each µ (cid:96) is an element of { ρ , . . . , ρ r } . We remind the reader that elements g, h in a group G are said to be conjugates if thereis an element t ∈ G such that tgt − = h. Define Cl ( g ), the conjugacy class of g , to be { h : h is conjugate to g } ; it is easy to see that the different conjugacy classes form a partition of G .We recall some very standard facts about irreducible representations: Theorem A.4 (see e.g. Theorem 2.3.1 of [GW10]) . Let G be a finite group and let { ρ , . . . , ρ r } bethe set of its irreducible representations, where ρ i : G → C d i × d i . Then1. (cid:80) ri =1 d i = | G | .2. The number of conjugacy classes is equal to r , the number of distinct irreducible representa-tions. . For ≤ s, t ≤ d i , let ρ i,s,t : G → C be the ( s, t ) entry of ρ i ( g ) . Then, for ≤ i , i ≤ r , ≤ s , t ≤ d i and ≤ s , t ≤ d i E g ∈ G [ ρ i ,s ,t ( g ) · ρ i ,s ,t ( g )] = (cid:40) d i if i = i , s = s and t = t otherwise4. The representations ρ , . . . , ρ r are unitary. A restatement of (3) above is that the functions { ρ i,s,t ( · ) } are orthogonal. Combining this with (cid:80) ri =1 d i = | G | (given by (1)), we get that the functions { ρ i,s,t } ≤ i ≤ r, ≤ s,t ≤ d i form an orthogonalbasis for C G .With an orthonormal basis for the set of complex-valued functions on G in hand (in other words,a basis for the group algebra C [ G ]), we are ready to define the Fourier transform of a function f : G → C : Definition A.5.
Let G be a finite group with irreducible representations given by { ρ , . . . , ρ r } and let f : G → C . The
Fourier transform of f is given by matrices (cid:98) f ( ρ ) , . . . , (cid:98) f ( ρ r ), where (cid:98) f ( ρ i ) = (cid:88) g ∈ G f ( g ) · ρ i ( g ) . The inverse transform is given by f ( g ) = 1 | G | r (cid:88) i =1 dim( ρ i ) Tr [ (cid:98) f ( ρ i ) ρ i ( g − )] . Parseval’s identity states that for any f as above, we have r (cid:88) i =1 (cid:107) (cid:98) f ( ρ i ) (cid:107) F = | G | · (cid:88) g ∈ G | f ( g ) | . (35)We next recall the definition of characters and class functions for a group G . Definition A.6.
Given a finite group G , a function f : G → C is said to be a class function of G if f ( g ) only depends on the conjugacy class of g , i.e. f ( g ) = f ( hgh − ) for every h ∈ G . Definition A.7.
The character χ ρ : G → C corresponding to a representation ρ : G → C m × m isgiven by χ ρ ( g ) := Tr ( ρ ( g )).We observe that χ ρ ( · ) is a class function of G , and that if ρ and ˜ ρ are unitarily equivalent, then χ ρ ( · ) = χ ˜ ρ ( · ). We recall some standard facts about characters and class functions: Theorem A.8.
Let G be a finite group and let { ρ , . . . , ρ r } be its set of irreducible representations.Let χ ρ , . . . , χ ρ r be the corresponding characters. Then we have:1. [Schur’s lemma] E g ∈ G [ χ ρ i ( g ) · χ ρ j ( g )] = δ i,j .2. The functions { χ ρ i ( · ) } ≤ i ≤ r forms an orthonormal basis for all class functions of G . f is adiagonal matrix (in fact, a scalar multiple of the identity matrix): Lemma A.9.
Let f : G → C be a class function and let ρ : G → C m × m be an irreduciblerepresentation of G . Then (cid:98) f ( ρ ) = c · Id where c = (cid:80) g ∈ G f ( g ) χ ρ ( g ) m and Id is the identity matrix.Proof. Choose any h ∈ G, and observe that ρ ( h ) · (cid:98) f ( ρ ) = ρ ( h ) · (cid:0) (cid:88) g ∈ G f ( g ) ρ ( g ) (cid:1) = ρ ( h ) · (cid:0) (cid:88) g ∈ G f ( h − gh ) ρ ( h − gh ) (cid:1) = ρ ( h ) · (cid:0) (cid:88) g ∈ G f ( g ) ρ ( h − gh ) (cid:1) = ρ ( h ) · ρ ( h − ) · (cid:0) (cid:88) g ∈ G f ( g ) ρ ( g ) (cid:1) · ρ ( h ) = (cid:98) f ( ρ ) · ρ ( h ) . As a consequence of Schur’s lemma, we have that if a matrix A is such that A · ρ ( h ) = ρ ( h ) · A forall h ∈ G , then A = c · Id . Thus, we get that (cid:98) f ( ρ ) = c · Id . The lemma follows by taking trace onboth sides. A.2 Representation theory of the symmetric group
Representation theory of the symmetric group has many applications to algebra, combinatorics andstatistical physics and has been intensively studied (as mentioned earlier, see e.g. [Jam06, M´el17]for detailed treatments). Below we only recall a few basics which we will need.The first notion we require is that of a
Young diagram.
Consider a partition λ = ( λ , . . . , λ k )of n where λ ≥ λ ≥ . . . ≥ λ k > λ + . . . + λ k = n . We indicate that λ is such a partitionby writing “ λ (cid:96) n .” The Young diagram corresponding to such a partition λ is a two-dimensionalleft-justified array of empty cells in which the i th row has λ i cells. See the left portion of Figure 2for an example of a Young diagram. A Young tableau corresponding to a partition λ is obtainedby filling in the n cells of the Young diagram with the elements of [ n ], using each element exactlyonce, where the ordering within rows of the Young diagram is irrelevant.1 7 2 85 34 69 1 2 8 75 36 49Figure 2: On the left is the Young diagram for the partition λ = (4 , , , { , , , } , { , } , { , } , { } ).For each partition λ = ( λ , . . . , λ k ) of n , there is an associated representation, denoted τ λ , whichwe now define. Let N λ = (cid:0) nλ ,...,λ k (cid:1) be the number of Young tableaus corresponding to partition λ ,and let Y λ, , . . . , Y λ,N λ be an enumeration of these tableaus in some order.28 efinition A.10. The permutation representation τ λ corresponding to λ is defined as follows: Foreach g ∈ S n , τ λ ( g ) is the N λ × N λ matrix (where we view rows and columns as indexed by Youngtableaus corresponding to λ ) which has τ λ ( g )( i, j ) = 1 iff Y λ,i maps to Y λ,j under the action of g .It is easy to check that τ λ : S n → C N λ × N λ as defined above is indeed a representation. In fact,since the range of τ λ is always a permutation matrix, τ λ is also a unitary representation.It turns that for λ (cid:54) = ( n ), the permutation representation τ λ is not an irreducible representation.However, it also turns out that all of the irreducible representations of S n can be obtained fromthe permutation representations. To explain this, we need to define a partial order over partitionsof n : Definition A.11.
For two partitions λ and µ of n , we say that λ dominates µ , written λ (cid:3) µ , if (cid:80) j ≤ i λ j ≥ (cid:80) j ≤ i µ j for all i >
0. The partial order defined by (cid:3) is said to be the dominance order over the partitions (equivalently, Young diagrams) of n . (cid:3) (cid:3) (cid:3) (cid:3) Figure 3: The left part of the picture depicts the dominance order across the partitions of 4; ithappens to be the case that the dominance order is a total order across the partitions of 4. Thisis not true in general; as depicted on the right, the two partitions (4 , ,
1) and (3 ,
3) of 6 areincomparable under the dominance order.The next result explains how the irreducible representations of S n can be obtained from therepresentations { τ λ } λ (cid:96) n : Theorem A.12 (James submodule theorem, see e.g. Theorem 3.34 of [M´el17]) . The irreduciblerepresentations of S n are in one-to-one correspondence with the partitions λ (cid:96) n ; we denote theirreducible representation corresponding to λ by ρ λ . In particular, when λ = ( n ) , then ρ λ is thetrivial irreducible representation (which maps each g ∈ G to 1). Moreover, each permutationrepresentation τ λ is a direct sum of irreducible representations corresponding to partitions whichdominate λ , i.e. τ λ = ⊕ µ (cid:3) λ K λ,µ ⊕ (cid:96) =1 ρ µ . Here the K λ,µ ’s are non-negative integers, known as the Kostka numbers , which are such that K λ,λ = 1 . A.2.1 Restrictions of irreducible representations
Fix λ (cid:96) n and consider the irreducible representation ρ λ of S n . For any m ≤ n , S m can be viewedas the subgroup of S n where elements { m + 1 , . . . , n } are fixed. Hence ρ λ can also be viewed as arepresentation of S n ; this representation of S m is written ρ mλ and is called the restriction of ρ λ to29igure 4: The first five levels of Young’s lattice. S m . Note that ρ mλ may not be an irreducible representation of S m . By Theorem A.3, we have that ρ mλ is equivalent to some direct sum ⊕ µ (cid:96) m M λ,µ ρ µ , in which there are M λ,µ many copies of p µ , for some non-negative integers M λ,µ . These integersare given by the so-called “branching rule” on Young’s lattice, which we now describe. Definition A.13.
Young’s lattice is the partially ordered set of Young diagrams in which thepartial order is given by inclusion in the following sense: given partitions µ and λ , we write “ µ ↑ λ ”if λ can be obtained by adding one box to µ (in such a way that λ is a valid partition, of course).If there are partitions µ , . . . , µ r such that µ ↑ µ ↑ · · · ↑ µ r , we write “ µ ⇑ µ r .”It is convenient to draw Young’s lattice in such a way that the n -th level contains all and onlythe Young diagrams with n boxes. The diagram in Figure 4 depicts the first five levels of Young’slattice.The next result, known as the “branching rule,” states that for λ (cid:96) n , ρ λ splits into a directsum of ρ µ over all µ ↑ λ when ρ λ is restricted to S n − : Lemma A.14 (Branching rule) . Let λ be a partition of n and let ρ λ be the corresponding irreduciblerepresentation of S n . Then ρ n − λ , the restriction of ρ λ to S n − , is equivalent to ⊕ µ (cid:96) n − µ ↑ λ ρ µ . By applying Lemma A.14 inductively we get a complete description of how ρ λ splits when it isrestricted to any S m , m < n : Theorem A.15.
Let λ (cid:96) n and let ρ λ be the corresponding irreducible representation of S n . For m < n we have that ρ mλ , the restriction of ρ λ to S m , is equivalent to ⊕ µ (cid:96) m Paths( µ, λ ) ρ µ , here Paths( µ, λ ) denotes the number of paths in Young’s lattice from µ to λ . Irreducible characters of the symmetric group.
Finally, we recall the following fundamentalfact (which is a consequence, e.g., of the Murnaghan-Nakayama rule) which we will use:
Fact A.16. [see e.g. Theorem 3.10 in [M´el17]] Let χ : S m → C be a character of S m . Then infact χ is Q -valued. Acknowledgments
We thank Mike Saks for allowing us to include his proof of Claim 2.2 here. We also thankVic Reiner and Yuval Roichman for answering several questions about representation theory.Anindya is grateful to Aravindan Vijayaraghavan for many useful discussions about ranking mod-els.
References [ABSV14] P. Awasthi, A. Blum, O. Sheffet, and A. Vijayaraghavan. Learning mixtures of rankingmodels. In
Advances in Neural Information Processing Systems , pages 2609–2617, 2014.1.3[BM08] M. Braverman and E. Mossel. Noisy sorting without resampling. In
Proceedings ofthe nineteenth annual ACM-SIAM symposium on Discrete algorithms , pages 268–276,2008. 1.3[BOB07] L. Busse, P. Orbanz, and J. Buhmann. Cluster analysis of heterogeneous rank data. In
Proceedings of the 24th ICML , pages 113–120, 2007. 1.3[Cho94] K. P. Choi. On the medians of gamma distributions and an equation of Ramanujan.
Proc. Amer. Math. Soc. , 121:245–251, 1994. 5.2[CR66] C. Curtis and I. Reiner.
Representation theory of finite groups and associative algebras ,volume 356. American Mathematical Society, 1966. A[DH92] Persi Diaconis and Phil Hanlon. Eigen Analysis for Some Examples of the MetropolisAlgorithm.
Contemporary Mathematics , 138:99–117, 1992. 1.3, 3.1[Dia88a] P. Diaconis. Group representations in probability and statistics.
Lecture Notes-Monograph Series , 11:i–192, 1988. 1.3[Dia88b] Persi Diaconis.
Chapter 6: Metrics on Groups, and Their Statistical Uses , volume Vol-ume 11 of
Lecture Notes–Monograph Series , pages 102–130. Institute of MathematicalStatistics, 1988. 1.1[DS81] Persi Diaconis and Mehrdad Shahshahani. Generating a Random Permutation withRandom Transpositions.
Z. Wahrscheinlichkeitstheorie verw. Gebiete , 57:159–179, 1981.1.1, 5.1, 5.1[DS98] Persi Diaconis and Laurent Saloff-Coste. What do we know about the metropolis algo-rithm?
J. Comput. Syst. Sci. , 57(1):20–36, 1998. 3.131DST16] A. De, M. Saks, and S. Tang. Noisy population recovery in polynomial time. In , pages 675–684. IEEE, 2016. 2.2[Ewe72] W. Ewens. The sampling theory of selectively neutral alleles.
Theoretical PopulationBiology , 3:87–112, 1972. 1.1, 1.3[FV86] M. Fligner and J. Verducci. Distance based ranking models.
Journal of the RoyalStatistical Society. Series B (Methodological) , pages 359–369, 1986. 1.1, 1.3[GJ79] Michael R. Garey and David S. Johnson.
Computers and Intractability: A Guide to theTheory of NP-Completeness . W. H. Freeman, 1979. 1[GP18] A. Gladkich and R. Peled. On the cycle structure of mallows permutations. 46(2):1114–1169, 03 2018. 1.3[GW10] B. Green and A. Wigderson. Lecture notes for the 22nd McGill Invitational Workshopon Computational Complexity. 2010. A.4[Jam06] Gordon Douglas James.
The representation theory of the symmetric groups , volume682. Springer, 2006. A, A.2[JV18] Y. Jiao and J. Vert. The Kendall and Mallows kernels for permutations.
IEEE trans-actions on pattern analysis and machine intelligence , 40(7):1755–1769, 2018. 1.1, 1.3[KB10] R. Kondor and M. Barbosa. Ranking with Kernels in Fourier space. In
COLT 2010 ,pages 451–463, 2010. 1.1, 1.3[KL02] R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures.In
Machine Learning, Proceedings of the 19th International Conference (ICML 2002) ,2002. 1.1, 1.3[KV10] R. Kumar and S. Vassilvitskii. Generalized distances between rankings. In
WWW ,pages 571–580, 2010. 1.1[LB11] T. Lu and C. Boutilier. Learning Mallows models with pairwise preferences. In
Pro-ceedings of the 28th ICML , pages 145–152, 2011. 1.3, 1.3[LL02] G. Lebanon and J. Lafferty. Cranking: Combining rankings using conditional probabil-ity models on permutations. In
Proceedings of the Nineteenth International Conferenceon Machine Learning , pages 363–370, 2002. 1.3[LM18] A. Liu and A. Moitra. Efficiently Learning Mixtures of Mallows Models. In
Proceedingsof FOCS, 2018 , 2018. 1.3[LZ15] Shachar Lovett and Jiapeng Zhang. Improved noisy population recovery, and reverseBonami-Beckner inequality for sparse functions. In
Proceedings of the 47th Annual ACMSymposium on Theory of Computing , pages 137–142, 2015. 2.2[Mal57] C. Mallows. Non-null ranking models. I.
Biometrika , 44(1/2):114–130, 1957. 1.3, 1.3[Mar14] J. Marden.
Analyzing and modeling rank data . Chapman and Hall/CRC, 2014. 1.332MC10] M. Meil˘a and H. Chen. Dirichlet process mixtures of generalized mallows models. In
Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence ,pages 358–367, 2010. 1.3[M´el17] P. M´eliot.
Representation theory of symmetric groups . Chapman and Hall/CRC, 2017.6.5, A, A.3, A.2, A.12, A.16[MM03] T. Murphy and D. Martin. Mixtures of distance-based models for ranking data.
Com-putational statistics & data analysis , 41(3-4):645–655, 2003. 1.3[MM09] B. Mandhani and M. Meila. Tractable search for learning exponential models of rank-ings. In
Artificial Intelligence and Statistics , pages 392–399, 2009. 1.3[MPPB07] M. Meil˘a, K. Phadnis, A. Patterson, and J. Bilmes. Consensus ranking under theexponential model. In
Proceedings of the Twenty-Third Conference on Uncertainty inArtificial Intelligence , pages 285–294, 2007. 1.3[MS13] Ankur Moitra and Michael Saks. A polynomial time algorithm for lossy populationrecovery. In , pages 110–116. IEEE, 2013. 2.2[Muk16] S. Mukherjee. Estimation in exponential families on permutations.
The Annals ofStatistics , 44(2):853–875, 2016. 1.3[Sak18] M. Saks. Personal communication, 2018. 2.1[Sta99] Richard P. Stanley.
Enumerative Combinatorics: Volume 2 . Cambridge UniversityPress, 1999. 6, 6[Ste77] G. W. Stewart. On the Perturbation of Pseudo-Inverses, Projections and Linear LeastSquares Problems.
SIAM Review , 19(4):634–662, 1977. 3[WY12] Avi Wigderson and Amir Yehudayoff. Population Recovery and Partial Identification.In53rd Annual IEEE Symposium on Foundations of Computer Science