Likelihood Maximization and Moment Matching in Low SNR Gaussian Mixture Models
aa r X i v : . [ m a t h . S T ] J un LIKELIHOOD MAXIMIZATION AND MOMENT MATCHING INLOW SNR GAUSSIAN MIXTURE MODELS
ANYA E. KATSEVICH ∗ AND AFONSO S. BANDEIRA † Abstract.
We derive an asymptotic expansion for the log likelihood of Gauss-ian mixture models (GMMs) with equal covariance matrices in the low signal-to-noise regime. The expansion reveals an intimate connection between twotypes of algorithms for parameter estimation: the method of moments andlikelihood optimizing algorithms such as Expectation-Maximization (EM). Weshow that likelihood optimization in the low SNR regime reduces to a sequenceof least squares optimization problems that match the moments of the esti-mate to the ground truth moments one by one. This connection is a steppingstone toward the analysis of EM and maximum likelihood estimation in a widerange of models. A motivating application for the study of low SNR mixturemodels is cryo-electron microscopy data, which can be modeled as a GMMwith algebraic constraints imposed on the mixture centers. We discuss theapplication of our expansion to algebraically constrained GMMs, among otherexample models of interest. Introduction
Gaussian mixtures are a useful model to describe data in a wide variety of appli-cations. Nevertheless, strong theoretical guarantees on the performance of classicalalgorithms for inference in Gaussian mixture models (GMMs) are lacking. This isprimarily due to the complicated structure of the GMM log likelihood landscape.The most popular algorithm for inference is Expectation-Maximization (EM), aniterative algorithm which performs “soft assignment” of observations to mixturecomponents. Although EM maximizes a surrogate function to the log likelihood ateach step, it can nevertheless be viewed as gradient ascent on the log likelihood inthe setting we study here. As such, analyzing it and other log likelihood optimizingalgorithms is challenging.Most existing guarantees are for for the case in which the component distribu-tions of the mixture are “well-separated”. In [BWY17] the authors characterizethe basin of attraction in which the EM algorithm is guaranteed to converge to theglobal maximum of the likelihood of a well-separated two component mixture. Thisis generalized in [XHM16], in which the authors provide a global analysis of the con-vergence of EM for two component mixtures. A further generalization is obtainedin [YYS17], in which the basin of attraction for gradient EM in arbitrary mixtures ∗ Email: [email protected]. Department of Mathematics, Courant Institute of Mathe-matical Sciences, New York University, USA.AEK is supported by the DOE Computational Science Graduate Fellowship. † Email: [email protected]. Department of Mathematics, ETH Zurich, Switzerland.Part of this work was done while ASB was with the Department of Mathematics, Courant Instituteof Mathematical Sciences and the Center for Data Science at NYU and supported partly by NSFgrants DMS-1712730, DMS-1719545 and by a grant from the Sloan foundation. with equal covariances is quantified, also under the assumption of some separationbetween component distributions. For mixtures with three or more components,[JZB +
16] shows that there are well-separated mixtures for which the log likelihoodlandscape has bad local maxima. Moreover, they show that in some cases, the EMalgorithm can converge to these bad critical points with high probability.Other algorithms have been proposed for learning the parameters of poorly sep-arated GMMs in polynomial time [BS10, KMV10] without relying on the log like-lihood. The latter paper is based on the method of moments. While EM and itsvariants are the most widely used methods for inference in GMMs, the method ofmoments is another class of inference methods which bypasses the log likelihoodentirely. This method was proposed by Karl Pearson in his 1894 paper [Pea94],which also introduces the Gaussian mixture inference problem for the first time.Pearson shows that the parameters of a mixture of two one-dimensional Gaussianscan be deduced from the mixture’s first six moments. In general, the approach isto form estimates from the data of enough moments of the distribution to uniquelyspecify it. The challenge is then to “invert” the moments to recover the groundtruth parameters. As an example, for a K -component uniform mixture with centers µ (1) , . . . , µ ( K ) ∈ R d , which we collectively denote µ , the moments are defined as T ( µ ) = 1 K µ (1) + · · · + 1 K µ ( K ) ∈ R d ,T ( µ ) = 1 K µ (1) µ T (1) + · · · + 1 K µ ( K ) µ T ( K ) ∈ R d × d , with higher moments T k ( µ ) given by higher order tensors. Given estimates of theground truth moment tensors T k ( µ ∗ ), moment inversion amounts to finding µ suchthat T k ( µ ) = T k ( µ ∗ ) , k = 1 , , . . . . In some models, the moment tensors takea particularly convenient form and can be inverted explicitly. When this is notpossible, one alternative approach is to minimize the objective function(1.1) min µ ∈ R dK X k λ k k T k ( µ ) − T k ( µ ∗ ) k , where λ k are regularizing weights.In this paper, we study the log likelihood landscape of Gaussian mixture modelsin R d with the following defining characteristics: (1) the covariance matrices of themixture components are all the same, and (2) the center of each mixture componentis small in norm relative to | Σ | /d , where Σ is the covariance of each of the mixturecomponents. We will think of Σ as being known (although this is not required forour main result), and the mixture centers as the “signal” we wish to estimate. Sincethis is made more difficult by larger variances, one can think of mixtures with thissecond feature as having low signal-to-noise ratio (SNR).We show an intimate connection between log likelihood optimization and themethod of moments in the low SNR regime. We do so by deriving an asymp-totic series expansion of the GMM log likelihood with respect to a small parameterrelated to the SNR. This expansion illuminates the structure of the likelihood land-scape. It shows that in the low SNR regime, log likelihood maximization reducesto a sequence of least squares minimization problems, in which successively highermoments are matched to those of the true distribution on the manifold on whichall previous moments have been been fixed to the ground truth values. IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 3
For the uniform mixture example, these minimization problems take the form(1.2) min µ ∈V k − k T k ( µ ) − T k ( µ ∗ ) k , k = 1 , , . . . , where V = R Kd and V k = { µ ∈ R Kd | T ℓ ( µ ) = T ℓ ( µ ∗ ) , ℓ = 1 , . . . , k } . This is very similar to the strategy of moment inversion described above. In-deed, taking weights λ ≫ λ ≫ . . . in (1.1) effectively reduces that minimizationproblem to the sequence of individual moment matching problems (1.2).This connection allows one to relate the roughness of the log-likelihood landscapewith the roughness of the landscape of least squares moment matching objectives.In Section 4, we will classify the critical points of this moment matching land-scape in two illustrative examples: a uniform mixture of two Gaussians in arbitrarydimension and an arbitrary (finite) mixture of Gaussians in one dimension. Ingeneral, however, understanding the roughness of this landscape can be a highlynon-trivial task and is outside the scope of this paper.The motivation for Taylor expanding the log likelihood comes from [BRW17]. Inthat paper, Taylor expansions for upper and lower bounds on the log likelihood arederived. However, in order to analyze algorithms which depend on the landscape ofthe log likelihood (i.e. on the function’s derivatives), a Taylor expansion of the loglikelihood itself is needed. For a certain class of models a recent paper [FSWW20],fruit of parallel research efforts, also establishes such an expansion, as we will discussin more detail below.A natural class of models to study in the low SNR regime are algebraically struc-tured mixture models. A prime example is the orbit retrieval model, also known asmulti-reference alignment (MRA). In this class of models, a known algebraic con-straint relates the centers of the mixture components to one another. Specifically,the centers are all determined from any one center by applying to it the elements ofa subgroup of rotations on R d . In particular, the centers therefore all have the samenorm. This class of models is motivated by problems arising in molecule imagingusing Cryo-Electron Microscopy (cryo-EM). The goal is to infer the density of amolecule from noisy observations of it in different unknown orientations. At a firstapproximation the data can be described by a GMM in which the centers are con-strained to be observations of the same (unknown) molecule from different viewingdirections. We describe this model in more detail in Section 4.3.In a recent paper [FSWW20] , the authors derive an asymptotic expansion for thelog likelihood of the orbit retrieval model. Remarkably, the authors then leveragethis expansion and the algebraic structure present in the orbit retrieval problem, toanalyze the critical points of the log-likelihood landscape (via the critical points ofthe moment matching objectives (1.2)). The expansion we derive in the more gen-eral context of GMMs reduces to that of [FSWW20] when the model is of the orbitretrieval type. While an analysis of the complexity of the moment-matching land-scape in the general case is beyond the scope of this paper, the results in [FSWW20] The authors learned of this work at an earlier stage of preparing the current manuscript, andhave since leveraged insights of [FSWW20] to help motivate and simplify some of our arguments.The derivation of the expansion in the case of general mixture models appears to require a differentset of techniques and our arguments are quite different overall.
A.E. KATSEVICH AND A.S.BANDEIRA on the orbit retrieval model illustrate how such an analysis can be used to drawconclusions about the log likelihood landscape and maximum likelihood estimation.We note that our likelihood expansion applies to other important algebraicallystructured models as well, such as heterogeneous MRA , in which the centers consti-tute the orbits of several points in R d under a group action. Cryo-EM data can bemodeled this way, since one often observes a molecule in several different confor-mations. The distinct orbits are then the rotations of these distinct conformations.The method of moments is a natural approach for inference in algebraicallystructured models in the low SNR regime, and a theoretical understanding of themethod has been developed in this setting. With the help of our asymptotic expan-sion, we expect that some of this understanding can be transferred to draw conclu-sions about likelihood optimizing methods such as EM. We discuss this as well aspotential implications of the expansion beyond algebraically structured models inSection 4.3.We have alluded to the fact that in the model setting we study, EM is thesame as gradient descent on the negative log likelihood. We make this precise inSection 3. Specifically, we show that for finite mixtures and orbit retrieval models,both standard EM and a variant known as gradient EM, are given by gradient ascenton the log likelihood with respect to the centers of the mixture. This implies that anunderstanding of the likelihood landscape directly translates into an understandingof the fixed points of EM and their basins of attraction. However, we will alsoshow that the standard EM algorithm is suboptimal in the low SNR regime, inthat it corresponds to gradient descent with too small a step size. This was shownin [FSWW20] for the orbit retrieval model. Thus gradient EM is a better option,since the step size is user-specified.We note that in order to use our expansion to draw conclusions about EM andmaximum likelihood estimation, a finite sample analysis of the likelihood landscapeis required. Here, we focus only on the population log likelihood. In [FSWW20]concentration of the sample log likelihood and its first two derivatives around theirpopulation analogues is established for the orbit retrieval model. We also note thatwhile our asymptotic expansion does not require the ground truth mixture weightsto be known, we assume this is the case in our discussion of the consequences ofthe expansion. Acknowledgements.
We would like to thank Jonathan Niles-Weed, Matthias Lo-effler, and Justin Finkel for insightful discussions. We also thank Zhou Fan forpointing us to his paper.
Paper Organization.
The paper is organized as follows. In Section 2, we intro-duce the general class of GMMs we will consider and some guiding example models.We then state our main result, the asymptotic expansion of the GMM log-likelihoodin the low SNR regime. In Section 3, we show that for this class of GMMs, EM isthe same as gradient ascent on the log likelihood with respect to the centers. Wealso apply the asymptotic expansion to draw conclusions about the EM algorithmand its variants in the low SNR regime. In Section 4, we apply the expansion toseveral example models to draw conclusions about critical points of the correspond-ing log likelihood landscapes. We also discuss the implications of the expansion formodels with algebraic structure motivated by the cryo-EM problem. In Section 5we present the proof of the expansion, deferring technical parts to the appendix.
IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 5
Notation.
For x ∈ R d , we let g ( x ) denote the probability distribution function ofthe standard normal Gaussian N (0 , I ), g ( x ) = ( √ π ) − d e −k x k / . For a set K ⊂ R d and a point x ∈ R d , we define K − x = { y − x | y ∈ K } . If K iscompact we define k K k ∞ := sup x ∈ K k x k , where k x k denotes the Euclidean normof x .For a probability measure ρ on R d , we write supp( ρ ) to denote its support. Wewrite θ ∼ ρ to denote that θ ∈ R d is a random variable with distribution ρ (bold-font letters will always denote random variables). For θ ∼ ρ with ρ compactlysupported, we define k θ k ∞ = k ρ k ∞ = k supp( ρ ) k ∞ . For the moment tensors of θ ∼ ρ, we write T k ( ρ ) = T k ( θ ) = E θ ∼ ρ (cid:2) θ ⊗ k (cid:3) = Z R d x ⊗ k ρ ( dx ) ∈ (cid:0) R d (cid:1) ⊗ k ,k = 1 , , , . . . . We use T k as shorthand for T , . . . , T k . For two tensors T, S ∈ (cid:0) R d (cid:1) ⊗ k with real entries, we let h T, S i denote the entry-wise inner product of theirvectorizations in R d k , and k T k = h T, T i / . Model Description and Main Theorem
Let Y ∈ R d be distributed according to a Gaussian mixture in which the com-ponent distributions have the same, nondegenerate covariance Σ. The assumptionof equal covariances allows us to write Y as a Gaussian perturbation Σ Z of arandom variable θ ∈ R d encoding the centers of the mixture components and themixture weights. For example, if Y is a uniform mixture of K Gaussian distribu-tions N ( µ j , Σ) , j = 1 , . . . , K , then θ is a discrete random variable taking the value µ j with probability 1 /K , j = 1 , . . . , K . In general, we have: Y = Σ Z + θ , θ ∼ ρ,Z ∼ N (0 , I ) , Z ⊥⊥ θ . (2.1)If ρ is a sum of point masses, then Y is a discrete mixture of component distribu-tions. If ρ has a density, then Y is a continuous mixture.We will consider maximum likelihood estimation of ρ = ρ ∗ given independentidentically distributed observations y i ∼ Y, i = 1 , . . . , N in the case N → ∞ . Theasymptotic expansion of the log likelihood presented in the next section is valid forthe family of compactly supported measures ρ , and we therefore present it in thismost general setting. Importantly, this general setting also includes the parametricframework in which it is known that ρ ∗ belongs to a set parameterized by a finitenumber of variables.Note that if Σ is known, then we can transform (2.1) into a mixture of sphericaldistributions by multiplying Y by Σ − . Thus, the case in which the componentdistribution covariances are known, equal, and nondegenerate, is equivalent to themodel(2.2) Y = σZ + θ, θ ∼ ρ, Z ∼ N (0 , I ) , Z ⊥⊥ θ . We therefore assume the covariance is σ I from now on. (We do not set σ = 1because it will be convenient to perform Taylor expansions in 1 /σ ). A.E. KATSEVICH AND A.S.BANDEIRA
Now, the distribution ρ induces a density q ρ ( y ) on Y . To compute q ρ , note that P ( Y ∈ A ) = E θ ∼ ρ [ P ( σZ + θ ∈ A | θ )] = Z A E θ ∼ ρ (cid:2) g (cid:0) σ − ( y − θ ) (cid:1)(cid:3) dy. (2.3)This gives q ρ ( y ) = E θ ∼ ρ (cid:2) g ( σ − ( y − θ )) (cid:3) = (2 πσ ) − d E θ ∼ ρ (cid:20) exp (cid:18) − k y − θ k σ (cid:19)(cid:21) . (2.4)The population log likelihood L ( ρ ; ρ ∗ ) is then given by L ( ρ ; ρ ∗ ) = E Y ∼ q ρ ∗ log q ρ ( Y )= E Y ∼ q ρ ∗ log E θ ∼ ρ (cid:20) exp (cid:18) − k Y − θ k σ (cid:19)(cid:21) , (2.5)where we have discarded the normalization constant. Writing Y = σZ + θ ∗ , θ ∗ ∼ ρ ∗ , we can also express the log likelihood in the following form:(2.6) L ( ρ ; ρ ∗ ) = E Z, θ ∗ ∼ ρ ∗ log E θ ∼ ρ (cid:20) exp (cid:18) − k σZ + θ ∗ − θ k σ (cid:19)(cid:21) . Abusing notation, we will sometimes write L ( θ ; θ ∗ ) for L ( ρ ; ρ ∗ ). Note that ρ = ρ ∗ is the unique global maximizer (up to measure zero) of L in the space of prob-ability distributions on R d . This is a consequence of the fact that L ( ρ ; ρ ∗ ) = − D KL ( q ρ ∗ || q ρ ) + const ., where D KL is the Kullback-Leibler divergence between q ρ ∗ and q ρ and the constant term depends on ρ ∗ only.The GMM formulation (2.2) lends itself to the signal processing viewpoint ofthe statistical estimation problem. Namely, one can consider the observations y i asdraws from the “signal” distribution ρ corrupted by the additive noise σZ . Thisreasoning, as well as the likelihood expansion in the following section, motivate usto define the signal-to-noise ratio (SNR) as follows: Definition 1.
Let ρ be a compactly supported measure on R d . We define the SNRas SNR( ρ, σ ) = k supp( ρ ) − T ( ρ ) k ∞ /σ . We note that this definition of SNR is not sensitive to how k θ k varies for θ ∈ supp( ρ ∗ ). For example, consider a discrete distribution ρ ∗ concentrated on ± θ ∗ , ± θ ∗ , where k θ ∗ k ≪ k θ ∗ k . Then SNR( ρ ∗ , σ ) = k θ ∗ k /σ . One could arguethat the SNR should depend not just on k θ ∗ k /σ but also on how small k θ ∗ k isrelative to k θ ∗ k .However, we will see that for our purposes this is a natural definition of SNR.Indeed, it is is the scale parameter which emerges in the asymptotic expansion. Thesmaller this value, the more clear-cut the separation between successive moment-matching stages, as will be explained in Section 2.1. Guiding Examples.
It is helpful to keep in mind the following two classes ofGMMs as examples of models to which the log likelihood expansion can be applied.Both classes (i.e. families of measures ρ ) can be parameterized by a finite numberof variables, and we write the SNR and moments as functions of these parameters. IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 7
Discrete Finite Mixture Model.
This class of models can be described by Y = σZ + θ ∈ R d where θ ∼ ρ , a finite sum of point masses. In other words, ρ isof the form(2.7) ρ ( dx ) = K X j =1 α j δ ( x − θ j ) , θ j ∈ R d , α j > , j = 1 , . . . , K, X j α j = 1 . We have T k ( θ, α ) = K X j =1 α j θ ⊗ kj , k = 1 , , . . . , SNR( θ, α, σ ) = max j =1 ,...,K k θ j − T ( θ, α ) k /σ , (2.8)where θ, α are shorthand for ( θ j , α j ) Kj =1 . Orbit Retrieval.
Let G ⊂ O ( d ) ⊂ R d × d be a possibly infinite subgroup of thegroup of orthogonal rotations in R d . Let γ be a measure on G , and g ∈ G denotethe random variable with distribution γ . In the orbit retrieval model, we have(2.9) Y = σZ + θ , where θ = g θ, g ∼ γ, θ ∈ R d . Here, g θ denotes the action of g on θ , in this case multiplication by a matrix. Notethat θ ∈ R d is deterministic.In general, both the point θ whose orbit under G constitutes the centers of theGMM, and the distribution γ , can be unknown. We have T k ( θ, γ ) = E g ∼ γ (cid:2) ( g θ ) ⊗ k (cid:3) , k = 1 , , . . . , SNR( θ, γ, σ ) = sup g ∈ G k gθ − T ( θ, γ ) k /σ . (2.10)The term orbit retrieval is also sometimes used to denote the model in which γ is known and given by the Haar measure (the uniform distribution on G ). Due tothe invariance of the Haar measure under the action of G , we have T = gT forany g ∈ G , so that the SNR is given by SNR( θ, γ = Haar , σ ) = k θ − T k /σ . An example of a discrete orbit retrieval model is Multireference alignment (MRA).Here, G = { g , . . . , g d − } is the group which acts on vectors in R d by cyclicallyshifting their entries. In other words, we have( g j θ ) k = θ j + k mod d , j, k = 0 , . . . , d − . The measure γ is therefore a sum of point masses, and induces the following distri-bution on θ : θ ∼ ρ, ρ ( dx ) = K X j =1 γ j δ ( x − g j θ ) . As an example of a continuous mixture, consider rotations in R , distributeduniformly over angles of rotation ω ∈ [0 , π ). Then the random variable θ ∈ R isdistributed as θ = (cid:18) cos ω − sin ω sin ω cos ω (cid:19) θ, ω ∼ Unif[0 , π ) , θ ∈ R . A.E. KATSEVICH AND A.S.BANDEIRA
Main results.
In this section we state our main result, the asymptotic ex-pansion of the log likelihood function. Recall that the log likelihood is given by(2.11) L ( θ ; θ ∗ , σ ) = E Z, θ ∗ ∼ ρ ∗ log E θ ∼ ρ (cid:20) exp (cid:18) − k σZ + θ ∗ − θ k σ (cid:19)(cid:21) . Theorem 2.1.
Let θ ∼ ρ and θ ∼ ρ ∗ be compactly supported random variables on R d and define δ = δ ( ρ, ρ ∗ ) = max {k θ − θ ∗ k | θ ∈ supp( ρ ) , θ ∗ ∈ supp( ρ ∗ ) } . Let m bea positive integer. If T k ( θ ) = T k ( θ ∗ ) , k = 1 , . . . , m − then for any σ > we have: − L ( θ ; θ ∗ , σ ) = C m ( θ ∗ ) + σ − m m !) k T m ( θ ) − T m ( θ ∗ ) k + ǫ m , (2.12) where C m ( θ ∗ ) is independent of θ and the error term ǫ m = ǫ m ( θ , θ ∗ ) is boundedabove by (2.13) | ǫ m ( θ , θ ∗ ) | ≤ ( m + 1)! (cid:18) C δσ (cid:19) m +2 (cid:18) ∨ δσ (cid:19) m +2 , where C is a d -dependent absolute constant. From (2.11) it is clear that L ( θ ; θ ∗ , σ ) = L ( θ − c ; θ ∗ − c, σ ) for any constant c .Note that δ is also invariant to shifts of θ and θ ∗ by the same amount. Thus, (2.12)remains true if we substitute θ − c, θ ∗ − c on the right hand side. However, thesize of k T m ( θ ) − T m ( θ ∗ ) k is on the order δ ( k θ k ∞ ∨ k θ ∗ k ∞ ) m − . It is therefore notinvariant to shifts. It will be desirable for δ to be of the same scale as k θ k ∞ ∨ k θ ∗ k .In order to accomplish this, we will replace θ and θ ∗ by θ − T ∗ and θ ∗ − T ∗ ,respectively. From now on we will let θ , θ ∗ denote these shifted random variables(i.e. assume T ∗ = 0).We also have L ( θ ; θ ∗ , σ ) = L ( λ θ ; λ θ ∗ , λσ ) for any λ >
0. We will therefore set k θ ∗ k ∞ = 1 in addition to assuming T ∗ = 0. The ground truth SNR is then givenby SNR( ρ ∗ , σ ) = 1 /σ and the low SNR regime is characterized by σ → ∞ . Notethat we have the upper bound δ ≤ k θ k ∞ ∨ . Discussion.
Suppose the GMM lies in a parameterizable family { Y = σZ + θ , θ ∼ ρ θ | θ ∈ Θ } , with Θ a set in a finite dimensional space. This allows us to consider gradient basedlocal search algorithms for likelihood optimization in Θ. Theorem 2.1 shows that inthe low SNR regime σ → ∞ , any such algorithm attempts to match the momentsof ρ = ρ θ to those of ρ ∗ = ρ θ ∗ one by one, starting from the first moment. In otherwords, likelihood optimization reduces to the sequence of minimization problems(2.14) min θ ∈V m − k T m ( θ ) − T m ( θ ∗ ) k , m = 1 , , , . . . , where V = Θ and V k ⊂ Θ , k = 1 , , . . . are the varieties(2.15) V k = { θ ∈ Θ | T ℓ ( θ ) = T ℓ ( θ ∗ ) , l = 1 , . . . , k } . This is a consequence of the fact that there is a scale separation between k T m − T ∗ m k /σ m and ǫ m . Indeed, provided k θ k ∞ = O (1) relative to σ , the former is onthe order σ − m and the latter is on the order σ − m − . IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 9
Consider (2.12) when m = 1. Due to this scale separation, the algorithm willprioritize minimization of k T ( θ ) − T ( θ ∗ ) k over that of ǫ . If the minimizationis successful, θ will reach the variety V . On this variety, the objective functionto be minimized is now k T ( θ ) − T ( θ ∗ ) k / σ to highest order. The algorithmwill continue to step through these distinct minimization stages for m = 1 , , . . . ,provided it does not get stuck in a local minimum or saddle point θ of k T m ( θ ) − T m ( θ ∗ ) k (cid:12)(cid:12) V m − , i.e. a critical point for which T m ( θ ) = T m ( θ ∗ ).This suggests an intimate connection between likelihood optimizing algorithmssuch as EM and the method of moments in the low SNR regime. The connectionbetween these two classes of algorithms will be discussed further in Section 4.For nonparametric GMMs in which there is no knowledge of ρ ∗ beyond thecompact support assumption, the asymptotic expansion of the log likelihood reducesto a sequence of minimization problems in the space of measures, i.e.min ρ ∈V m − k T m ( ρ ) − T m ( ρ ∗ ) k , where V m − is defined analogously to the parametrizable case. We note that themoments are linear in ρ , so that the objective function is quadratic and the varietiesare given by linear constraints. The sequence of least squares moment matchingproblems is therefore a quadratic programming problem, albeit in an infinite di-mensional space. While an analysis of the non-parametric setting is outside thescope of this paper, it would be interesting to explore the connection between themethod of moments and maximum likelihood estimation in this context. For re-sults on maximum likelihood estimation and inference in non-parametric mixturemodels, see e.g. [SG20, FD18, Lai78].Theorem 2.1 is a direct consequence of the following key Lemma. To state it wewill need the following two definitions. Definition 2.
Let T k i = E θ ∼ ρ i (cid:2) θ ⊗ k i (cid:3) , i = 1 , . . . , n, i.e. T k i is the order k i moment tensor of some distribution ρ i . Consider T = n O i =1 T k i = T k ⊗ T k ⊗ · · · ⊗ T k n . We define the total moment order of T to be P ni =1 k i , i.e. the sum of all momentorders. We also say that the total moment order of each entry of T is P ni =1 k i ; inother words, the total moment order of products of entries of moment tensors isthe sum of all moment orders in the product.Let ρ and ρ ∗ be compactly supported measures on R d . In the definition andlemma below, we write T k , T ∗ k as shorthand for T k ( ρ ) , T k ( ρ ∗ ), respectively. Definition 3.
We define V k [ T m , T ∗ n ] as the set of all constant coefficient linearcombinations of outer products of moment tensors T j , j ≤ m, T ∗ ℓ , ℓ ≤ n , of totalmoment order k .We define R k [ T m , T ∗ n ] as the set of all constant coefficient linear combinations ofproducts of entries of moment tensors T j , j ≤ m, T ∗ ℓ , ℓ ≤ n , of total moment order k . Lemma 2.2.
Let ρ and ρ ∗ be compactly supported probability measures on R d . Forall m = 1 , , , . . . we have − L ( ρ ; ρ ∗ ) = C ( ρ ∗ ) + 12 k T − T ∗ k σ − + m X k =2 (cid:18) k !) k T k − T ∗ k k + h T k , Q k i + r k (cid:19) σ − k + ǫ m , (2.16) where C ( ρ ∗ ) is independent of ρ , and Q k = Q k ( T k − , T ∗ k − ) ∈ V k (cid:2) T k − , T ∗ k − (cid:3) ,r k = r k ( T k − , T ∗ k ) ∈ R k [ T k − , T ∗ k ] . Moreover, Q k is such that Q k ( T ∗ k − , T ∗ k − ) = 0 . The error term ǫ m = ǫ m ( ρ, ρ ∗ ) is the same as in (2.12) . The expansion (2.16) generalizes the log likelihood series expansion (4.10) of[FSWW20], which is specific to the orbit recovery model (2.9) in which the mea-sure γ on the group G is the Haar (uniform) measure. We note that our errorbound decays as (1 /σ ) m +2 when k θ k ∞ = O (1) as σ → ∞ ; this is a somewhattighter bound than that of [FSWW20], in which the error is shown to decay as(log σ/σ ) m +2 when σ → ∞ and k θ k /σ = o (1 / log σ ). (Note that k θ k ∞ = k θ k forthe orbit retrieval model).The expansion (4.10,[FSWW20]) is the same as (2.16) except that (4.10) has noterm of the form h T k , Q k i . The following proposition explains why this is so. Forthe proof, see Proposition B.2 in the appendix. Proposition 2.3.
Let θ, θ ∗ ∈ R d , and G ⊂ O ( d ) ⊂ R d × d be a group. Define therandom variable g ∈ G distributed according to g ∼ γ, where γ is the Haar measureon G . Let T k , T ∗ k be the moment tensors of the distributions g θ, g θ ∗ , g ∼ γ , i.e. (2.17) T k = E g ∼ γ h ( g θ ) ⊗ k i , T ∗ k = E g ∼ γ h ( g θ ∗ ) ⊗ k i . Then for every tensor Q ∈ V k [ T m , T ∗ n ] , we have h T k , Q i ∈ R k [ T m , T ∗ n ] . In particular, the inner product h T k , Q i depends only on moment tensors T j , j =1 . . . , m even if k > m . The proof relies crucially on the Haar property of γ , namely, that g d = h g ∀ h ∈ G. It follows from the proposition that for the orbit retrieval model, the σ − k coef-ficient (for k >
1) in the asymptotic expansion (2.16) of − L ( θ ; θ ∗ ) is given by12( k !) k T k − T ∗ k k + h T k , Q k i + r k = 12( k !) k T k − T ∗ k k + ˜ r k , where ˜ r k = h T k , Q k i + r k ∈ R k [ T k − , T ∗ k ] . IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 11 Expectation Maximization As Gradient Descent
In this section, we consider the EM algorithm for finite GMMs and the orbitretrieval model, assuming that the mixture weights are known. We show that inthese cases, both the standard and gradient EM algorithms reduce to gradientdescent on the negative log likelihood with respect to the centers. This equivalencehas been pointed out in the literature, in the context of particular models (see,for example, [WZ19, FSWW20]). In light of the structure of the log likelihoodlandscape given in Theorem 2.1, we show that the gradient descent step size ofstandard EM is unnecessarily small, leading to slow convergence.To present the EM algorithm, it will be helpful to slightly reformulate the model.3.1.
Model Reformulation.
We will represent mixture models by Y = σZ + θ χ , χ ∼ ρ, where χ is a latent membership variable defined on a set X which parameterizes thecomponent distributions of the mixture. We will use χ ∈ X (non bold) to denote asample of χ . Finite Mixture Model.
We have X = { , . . . , K } and χ ∼ ρ , where ρ ( dχ ) = P Kj =1 α j δ ( χ − j ), assumed known. We let θ = ( θ χ ) Kχ =1 denote the K centers in R d . Orbit Retrieval
Let G ⊂ O ( d ) ⊂ R d × d be a possibly infinite group with elements { g χ | χ ∈ X } , and χ ∼ ρ , arbitrary. We let θ ∈ R d denote the vector whichgenerates all the centers θ χ through the action of G , i.e. θ χ = g χ θ , χ ∈ X .Since both of these models are parameterized by θ , we denote the density of Y by q θ . It is given by q θ ( y ) = E χ ∼ ρ (cid:20) g (cid:18) y − θ χ σ (cid:19)(cid:21) = Z g (cid:18) y − θ χ σ (cid:19) ρ ( dχ ) . (3.1)In the next section we will need the conditional distribution χ | Y . It is given by q θ ( dχ | y ) = w θ ( y, χ ) ρ ( dχ ) , where we have defined w θ ( y, χ ) = g (cid:18) y − θ χ σ (cid:19) (cid:30) E χ ∼ ρ (cid:20) g (cid:18) y − θ χ σ (cid:19)(cid:21) . Finally, the log likelihood is given by L ( θ ; θ ∗ ) = E Y ∼ q θ ∗ log E χ ∼ ρ (cid:20) g (cid:18) Y − θ χ σ (cid:19)(cid:21) , (3.2)where we have discarded the normalization constant.3.2. Algorithm Description.
Assume θ ∗ is the ground truth parameter. Definethe function Q ( θ ′ | θ ; θ ∗ ), which is a surrogate for the log likelihood. It is definedas follows: Q ( θ ′ | θ ; θ ∗ ) = E Y ∼ q θ ∗ E χ ∼ q θ ( ·| Y ) log g (cid:18) Y − θ ′ χ σ (cid:19) = − σ E Y ∼ q θ ∗ Z k Y − θ ′ χ k w θ ( Y, χ ) ρ ( dχ ) . (3.3) Note that if θ is an estimate of the ground truth parameter θ ∗ , then the distribution q θ ( dχ | Y ) = w θ ( Y, χ ) ρ ( dχ ) is our best guess for the distribution of the latentmembership variable χ given the observed data Y .Given an initialization θ (0) , the standard and gradient EM updates are given by(3.4) θ ( t +1) = arg max θ ′ Q ( θ ′ | θ ( t ) ; θ ∗ ) (standard EM) θ ( t +1) = θ ( t ) + τ ∇ θ ′ Q ( θ ′ | θ ( t ) ; θ ∗ ) (cid:12)(cid:12) θ ′ = θ ( t ) (gradient EM) , where τ > θ ( t +1) χ = E Y [ w θ ( t ) ( Y, χ ) Y ] E Y [ w θ ( t ) ( Y, χ )] , χ = 1 , . . . , K and for the orbit retrieval model θ ( t +1) = Z E Y (cid:2) w θ ( t ) ( Y, χ ) g − χ Y (cid:3) ρ ( dχ ) . Proposition 3.1.
We have ∇ θ ′ Q ( θ ′ | θ ; θ ∗ ) (cid:12)(cid:12) θ ′ = θ = ∇ θ L ( θ ; θ ∗ ) for both the finite mixture and orbit retrieval models. Therefore, gradient based EMwith step size τ is the same as gradient ascent on L ( θ ; θ ∗ ) with step size τ .For the finite mixture model, the standard EM update can be written as (3.6) θ ( t +1) χ = θ ( t ) χ + τ tχ ∇ θ χ L ( θ ( t ) ; θ ∗ ) , τ tχ = σ α χ E Y [ w θ ( t ) ( Y, χ )] , for χ = 1 , . . . , K . For the (possibly infinite) orbit retrieval model, the standard EMupdate can be written as θ ( t +1) = θ ( t ) + σ ∇ θ L ( θ ( t ) ; θ ∗ ) . We remark that in standard EM for finite mixtures, the step size τ tχ varies withtime, and is also different for different centers θ χ . Proof.
For the finite mixture, we use the fact that ∇ θ χ log E χ ∼ ρ (cid:20) g (cid:18) Y − θ χ σ (cid:19)(cid:21) = − α χ σ ∇ θ χ k Y − θ χ k g (cid:18) Y − θ χ σ (cid:19) (cid:30) E χ ∼ ρ (cid:20) g (cid:18) Y − θ χ σ (cid:19)(cid:21) = − α χ σ ∇ θ χ k Y − θ χ k w θ ( Y, χ ) . (3.7)Thus, ∇ θ χ L ( θ ; θ ∗ ) = − α χ σ E Y (cid:2) ∇ θ χ k Y − θ χ k w θ ( Y, χ ) (cid:3) , ∇ θ ′ χ Q ( θ ′ | θ ; θ ∗ ) = − α χ σ E Y h ∇ θ ′ χ k Y − θ ′ χ k w θ ( Y, χ ) i . (3.8)For the orbit retrieval model, we use that k Y − g χ θ k = k g − χ Y − θ k , so that ∇ θ log E χ ∼ ρ (cid:20) g (cid:18) Y − θ χ σ (cid:19)(cid:21) = ∇ θ log E χ ∼ ρ " g g − χ Y − θσ ! = − σ E χ ∼ ρ (cid:2) w θ ( Y, χ ) ∇ θ k g − χ Y − θ k (cid:3) . (3.9) IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 13
Using this property to compute ∇ θ Q as well, we obtain ∇ θ L ( θ ; θ ∗ ) = − σ E Y E χ ∼ ρ (cid:2) w θ ( Y, χ ) ∇ θ k g − χ Y − θ k (cid:3) ∇ θ ′ Q ( θ ′ | θ ; θ ∗ ) = − σ E Y E χ ∼ ρ (cid:2) w θ ( Y, χ ) ∇ θ ′ k g − χ Y − θ ′ k (cid:3) . (3.10)We immediately see that in both cases the gradients of Q and L are equal if θ ′ = θ .To see why standard EM is also gradient ascent on L , note that Q is a quadraticfunction in the θ ′ χ for finite GMMs, and quadratic in θ ′ for orbit retrieval. Now,for a quadratic function f ( θ ′ ) = − c k θ ′ k + x T θ ′ + const . we can reach the globalmaximum in one step of gradient ascent from any point θ ′ by taking a step size c .In other words, θ ′ + c ∇ f ( θ ′ ) is the global maximizer of f . Taking θ ′ = θ ( t ) , wehavearg max θ ′ Q ( θ ′ | θ ( t ) ; θ ∗ ) = θ ( t ) + 1 c ∇ Q ( θ ( t ) | θ ( t ) ; θ ∗ ) = θ ( t ) + 1 c ∇ L ( θ ( t ) ; θ ∗ ) . It remains to compute c . For the finite mixture, considering Q as a function of θ χ we see that c = α χ σ E Y [ w θ ( t ) ( Y, χ )] . For orbit retrieval, we have c = 1 σ E Y Z ρ ( dχ ) w θ ( t ) ( Y, χ ) = 1 σ . (cid:3) EM in Low SNR Regime.
We will use the expansion (2.16) to informallydemonstrate that in the low SNR regime, the step size in the standard EM update(3.6) for the finite mixture model is much smaller than necessary, leading to slowconvergence. The same is true for the orbit retrieval model, as shown in [FSWW20].Let θ ∗ = ( θ ∗ , . . . , θ K ∗ ) be the centers of the ground truth model with θ j ∗ ∈ R d , j = 1 , . . . , K and θ = ( θ , . . . , θ K ) be the argument to the log likelihood. Wedefine k θ k ∞ = max j =1 ,...,K k θ j k . Recall that the ground truth mixture weights α j are considered known, and T k ( θ ) = P Kj =1 α j θ ⊗ kj , k = 1 , , . . . . As in Section 2, wewill assume T ( θ ∗ ) = 0, k θ ∗ k ∞ = 1 and σ ≫ T ( θ ) and inthe subspace orthogonal to it. First, we have Proposition 3.2.
Let θ G ( θ ) be the standard EM update, given by (3.5) . Fix aconstant R > . Then for all θ such that k θ k ∞ /σ ≤ R, we have k T ( G ( θ )) k ≤ C ( k θ k ∞ ∨ /σ, where C depends on d and R only. The proof is give in Proposition C.3 in the appendix. Proposition 3.2 shows thatif the EM iterates θ ( t ) remain in a radius O (1) ball, then starting with t = 1 theestimated first moment T (cid:0) θ ( t ) (cid:1) is order O ( σ − ) away from T ∗ = 0.While T nearly converges in one iteration of standard EM, the algorithm ismuch slower in the subspace orthogonal to T . To show this, we use the gradientdescent representation of EM, θ ( t +1) χ = θ ( t ) χ + τ tχ ∇ θ χ L ( θ ( t ) ; θ ∗ ) , χ = 1 , . . . , K. For θ such that k θ k ∞ = O (1) with respect to σ , we have(3.11) − L ( θ ; θ ∗ ) = const . + 12 k T ( θ ) k σ − + q ( θ, θ ∗ ) σ − + O ( σ − ) , where q is a homogeneous polynomial of order 4 with respect to the entries of θ, θ ∗ . This follows from the representation of the log likelihood given in (2.16).Now, consider the gradient of (3.11) in the subspace orthogonal to T . On thissubspace, the highest order term of L , given by k T ( θ ) k /σ , is constant (notoptimized), while q and its θ -derivatives are order O ( σ − ). It follows that theoptimal step size for gradient descent is O ( σ ). However, the actual step size is τ tχ = σ α χ E Y [ w θ ( t ) ( Y, χ )] − = O ( σ ), using that E Y [ w θ ( t ) ( Y, χ )] = 1 + O ( σ − ). Thisis shown in Lemma C.1 of the appendix.Recall that for the orbit recovery model, the standard EM update is a gradientdescent step on − L with step size σ exactly. Numerical experiments in [FSWW20]show that gradient descent on − L in the subspace orthogonal to T with step size O (cid:0) σ (cid:1) achieves much faster convergence than standard EM.4. Examples of Interest and Implications
Recall that Theorem (2.1) shows that in the low SNR regime, likelihood optimiza-tion for parameterizable GMMs reduces to the sequence of minimization problems(4.1) min θ ∈V k − k T k ( θ ) − T k ( θ ∗ ) k , k = 1 , , . . . , where V = Θ and(4.2) V k = { θ ∈ Θ | T ℓ ( θ ) = T ℓ ( θ ∗ ) , l = 1 , . . . , k } , k = 1 , , . . . . In the following two sections, we characterize the critical points of the minimiza-tion problems (4.1) for two GMMs: a uniform mixture of two Gaussians in R d andan arbitrary finite mixture of Gaussians in R . We conclude the section with a dis-cussion of the implications of the expansion for models with algebraic structure andGMMs with randomly chosen centers. We also discuss the necessary steps to makerigorous the connection between the moment matching and likelihood landscapes.4.1. Uniform Mixture of Two Gaussians in R d . Let Y = σZ + θ ∈ R d , where θ ∼ ρ , which belongs to the family { ρ ( dv ) = 12 δ ( v − θ ) + 12 δ ( v − θ ) | θ , θ ∈ R d } . Motivated by [XHM16], we study the moment matching minimization problems inthe following coordinates:(4.3) α = 12 θ + 12 θ , β = θ − θ . Define α ∗ , β ∗ analogously for the ground truth parameters. This is a natural repa-rameterization for the landscape, since(4.4) T ( ρ ) = α, T ( ρ ) = αα T + 14 ββ T . We see that the first moment T ∗ determines α ∗ , while β ∗ is determined up to signfrom T ∗ given that α = α ∗ . Swapping θ and θ does not change the mixturedistribution (since it is uniform), so α and ± β uniquely specify the distribution. Ittherefore suffices to consider the first two moment-matching optimization problems. IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 15
The first unconstrained optimization problem, min α ∈ R d k α − α ∗ k has only a globalminimum at α = α ∗ . Now, on the manifold α = α ∗ , the second optimizationproblem reduces to min β ∈ R d k ββ T − β ∗ β T ∗ k . We see that the points β = ± β ∗ are global minima, while β = 0 is a saddle point.This aligns with the results of [XHM16] on the fixed points of EM. The authorsreformulate the EM updates in the α, β coordinates (4.3). Letting α ( t ) , β ( t ) , t =0 , , . . . be the EM iterates, they show that α ( t ) converges to α ∗ as t → ∞ , while β ( t ) converges to ± β ∗ if h β (0) , β ∗ i 6 = 0 and β ( t ) converges to 0 if h β (0) , β ∗ i = 0.4.2. Mixture of K Gaussians in R . Let Y = σZ + θ ∈ R , where θ ∈ R isdistributed according to ρ in the family R α -mix = { ρ θ ( dx ) = K X j =1 α j δ ( x − θ j ) | θ = ( θ , . . . , θ K ) ∈ R K } . Here, α j are positive weights summing to 1. They are assumed known, so that theunknown parameters are θ = ( θ , . . . , θ K ) ∈ R K . Interestingly, if the mixture isuniform ( α j = 1 /K ∀ j ), then in the low SNR regime this model is equivalent to thefollowing orbit retrieval model studied in [FSWW20]: Y = σZ + θ ∈ R K , where θ ∈ R K is distributed according to ν in the family R orbit = { ν θ ( dx ) = 1 K ! K ! X j =1 δ ( x − g j θ ) | θ ∈ R K } . Here, G = { g j , j = 1 , . . . , K ! } ⊂ O ( K ) ⊂ R K × K is a subgroup of the orthogonalgroup acting on vectors in R K by permuting their entries. In other words, the orbitof θ under G is the set of all permutations of the entries of θ .The two models are equivalent in the sense that there is a one-to-one mapping(4.5) { T k ( ρ ) | ρ ∈ R /K -mix } ←→ { T k ( ρ ) | ρ ∈ R orbit } . To show this, define the polynomials p ℓ ( θ ) = K P Kj =1 θ ℓj , so that T ℓ ( ρ θ ) = p ℓ ( θ ) for ρ θ ∈ R /K -mix . Now, let ν θ be the corresponding measure in R orbit . The entriesof the tensor T ℓ ( ν θ ) are ℓ -degree polynomials in R [ θ , . . . , θ K ] which are invariantunder permutation of the θ j . But the polynomials p j , j = 1 , . . . , ℓ generate thepermutation invariant polynomials of degree at most ℓ (see [FSWW20] and thereferences therein), showing that both sets in (4.5) are in one-to-one correspondencewith { ( p ( θ ) , . . . , p k ( θ )) | θ ∈ R K } .In particular, [FSWW20] shows that the moment-matching problem for the orbitretrieval model reduces to min x ∈V k ( p k +1 ( θ ) − p k +1 ( θ ∗ )) , where V k = { θ ∈ R K | p j ( θ ) = p j ( θ ∗ ) , j = 1 , . . . , k } . This is precisely the moment-matching problem for a uniform mixture on R .We now generalize results in [FSWW20] on critical points of the above momentmatching landscape to the case of non-uniform mixtures in R . Fix positive weights α = ( α , . . . , α K ) summing to 1, and define p ℓ ( θ ) = K X j =1 α j θ ℓj , so that p ℓ ( θ ) = T ℓ ( ρ θ ) for a distribution ρ θ ∈ R α -mix on R . For a fixed θ ∗ =( θ ∗ , . . . , θ ∗ K ) let V k be the variety V k = { θ ∈ R K | p j ( θ ) = p j ( θ ∗ ) , j = 1 , . . . , k } . The following result characterizes critical points of the moment matching objec-tive function that are not global minima.
Proposition 4.1.
The following holds for any generic θ ∗ : Define f n +1 : R K → R by f n +1 ( θ ) = 12 ( p n +1 ( θ ) − p ∗ n +1 ) , where p ∗ n +1 = p n +1 ( θ ∗ ) . Then(a) A point x = ( x , . . . , x K ) ∈ V n \ V n +1 is a critical point of f n +1 | V n if andonly if it is a critical point of p n +1 | V n , if and only if exactly n coordinates x j are distinct.(b) Let x ∈ V n \ V n +1 be a critical point of f n +1 | V n . Assume without lossof generality that x > x > · · · > x n are the distinct centers, and let ( m , . . . , m n ) be the multiplicity vector, i.e. m i is the number of times x i repeats. We have the following classification of x : • If the multiplicity vector has the form ( m , , m , , . . . ) then x is alocal minimum of f n +1 | V n if p n +1 ( x ) > p ∗ n +1 and a local maximum if p n +1 ( x ) < p ∗ n +1 . • If the multiplicity vector has the form (1 , m , , m , . . . ) , then x is alocal minimum of f n +1 | V n if p n +1 ( x ) < p ∗ n +1 and a local maximum if p n +1 ( x ) > p ∗ n +1 . • If the multiplicity vector is not of either form, then x is a saddle pointof f n +1 | V n and of p n +1 | V n (c) There are no local minima of f n +1 | V n on V n \ V n +1 if the weights α j areuniform. Example 4.2. No local minima of f on V Suppose x = ( x , . . . , x K ) is a critical point of f | V such that x / ∈ V . This implies x = · · · = x K = p ∗ , i.e. m = K >
1. But then p ( x ) = ( p ∗ ) < p ∗ (recall that p ∗ , p ∗ are the first and second moments of the distribution ρ x ∗ , respectively), so x is a local maximum. Proof of Proposition 4.1.
A point x ∈ V n is a critical point of f n +1 | V n if and onlyif ∇ f n +1 ( x ) lies in the span of ∇ p j ( x ) , j = 1 , . . . , n . We have ∇ f n +1 = ( p n +1 − p ∗ n +1 ) ∇ p n +1 , so if x / ∈ V n +1 then p n +1 ( x ) − p ∗ n +1 = 0, implying ∇ p n +1 ( x ) also lies in the span of ∇ p j ( x ) , j = 1 , . . . , n . Hence x is a critical point of p n +1 | V n .Now, by arguments analogous to those in Lemma 4.23 of [FSWW20], every pointin V n has at least n distinct entries (for generic x ∗ ), and V n is nonsingular (i.e. thegradients ∇ p j ( x ) , j = 1 , . . . , n are linearly independent for every x ∈ V n ).We show that a critical point x of p n +1 | V n can have at most n distinct entries.Note that ∂ j p k ( x ) = kα j x k − j . Since ∇ p n +1 ( x ) lies in the span of the gradients IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 17 ∇ p k ( x ), there exist λ , . . . , λ n such that(4.6) α ( n + 1) x n ... α K ( n + 1) x nK = λ α ... α K + λ α x ...2 α K x K + · · · + λ n − nα x n − ... nα K x n − K . Define the polynomial(4.7) q n ( x ) = ( n + 1) x n − ( nλ n − x n − + · · · + 2 λ x + λ ) . Now, q n is an n th order polynomial, and (4.6) gives that q n ( x ) = · · · = q n ( x K ) = 0(since the α j are nonzero). This implies that there are at most n distinct pointsamong x , . . . , x K .The second assertion follows from [Arn86], but we provide a proof for the sakeof completeness. We will use the following characterization of critical points onmanifolds, reviewed in Appendix D:Let f : R K → R and M ⊂ R K be the intersection of level sets of functions g , . . . , g n . Let x ∈ M be a critical point of f | M and c , . . . , c n ∈ R be such that(4.8) ∇ f ( x ) = n X j =1 c j ∇ g j ( x ) . Then x is a saddle, local minimum, or local maximum of f on M iff the quadraticform(4.9) ∇ f ( x ) − n X j =1 c j ∇ g j ( x )is indeterminate, positive definite, or negative definite, respectively, on the tangentplane to M at x .We apply this result with f = f n +1 , g j = p j and M = V n . Let x ∈ V n \ V n +1 bea critical point of f n +1 | V n . Without loss of generality, assume x > x > · · · > x n are the distinct points. Letting λ j , j = 1 , . . . , n be as in (4.6), we have ∇ f n +1 ( x ) = ( p n +1 ( x ) − p ∗ n +1 ) n X j =1 λ j ∇ p j , so that c j = ( p n +1 ( x ) − p ∗ n +1 ) λ j . Now the Hessian of f n +1 is given by(4.10) ∇ f n +1 = ( p n +1 − p ∗ n +1 ) ∇ p n +1 + ∇ p n +1 ∇ T p n +1 , where ∇ p n +1 is a column vector. Since ∇ p n +1 ( x ) is a linear combination of ∇ p j ( x ) , j = 1 , . . . , n , it is orthogonal to vectors in the tangent plane of V n +1 at x . We therefore drop it from the quadratic form and consider( p n +1 − p ∗ n +1 ) ∇ p n +1 − n X j =1 ( p n +1 − p ∗ n +1 ) λ j ∇ p j ( x )= ( p n +1 − p ∗ n +1 ) ∇ p n +1 ( x ) − n X j =1 λ j ∇ p j ( x ) = ( p n +1 − p ∗ n +1 )diag (cid:0) α q ′ n ( x ) , . . . , α K q ′ n ( x K ) (cid:1) , (4.11) where the polynomial q n is as in (4.7). We now characterize the vectors v ∈ R K in the tangent plane to V n at x , i.e. perpendicular to ∇ p j ( x ) , j = 1 , . . . , n . First,define the vectors u k ∈ R n , k = 1 , . . . , n by u k = ( x k − , . . . , x k − n ) T , k = 1 , . . . , n. Then the matrix V ∈ R n × n with columns u , . . . , u n is a Vandermonde matrix withdeterminant det( V ) = Y ≤ i
0, since x is the rightmost root on the real line. This finishes the proofof (b).Finally, (c) is shown in [FSWW20]. (cid:3) IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 19
Algebraically Structured Models and Discussion.
There are many im-portant inference problems that are naturally modelled as GMMs with algebraicstructure imposed on the centers. A motivating application is that of moleculeimaging using Cryo-Electron Microscopy (cryo-EM), in which the goal is to recon-struct the density of a molecule given partial observations of it. The imaging datacan be modeled by a GMM which generalizes the orbit recovery model in severalways. We describe the model in full generality, since it encapsulates most of thealgebraically structured models of interest.The observations in cryo-EM are given by noisy projections of the molecule takenfrom different unknown viewing directions. Moreover, the molecule may be observedin one of several conformations. A common model assumption is to consider thenoise to be Gaussian, in which case we can model the data by the GMM Y = σZ + θ , where Z is the additive Gaussian noise and θ encodes the projection, rotation, andconformation of the molecule.Specifically, let θ j ∈ R d , j = 1 , . . . , K represent the densities of the moleculein its K different conformations. These are the signals we wish to recover. Let χ ∈ { , , . . . , K } be a random variable representing the probability to observeconformation j = 1 , . . . , K , with P ( χ = j ) = α j , j = 1 , . . . , K . This distributionis also unknown. Next, let G be the group of rotations on R d , and g ∼ Haar( G )be a random variable which has uniform distribution over G . Finally, let Π : R d → R m , with m < d , be the tomographic projection, a linear projection operatorcorresponding to the imaging procedure. We can then write the GMM as(4.14) Y = σZ + θ , θ = Π( g θ χ ) . To summarize, the centers of this mixture, given by the support of θ , are theprojections of the orbits under the continuous group G of K points in R d . Thefollowing are simplifications of this general model:(1) Discrete Homogeneous Orbit Retrieval.
This is a type of orbit re-trieval model (2.9) described in Section 2. Here, there is no projectionoperator and only one orbit. One special case of interest is MultireferenceAlignment (MRA), in which the group G = { g , . . . , g d − } is the groupwhich acts on vectors in R d by cyclically shifting their entries, i.e.( g j θ ) k = θ j + k mod d , j, k = 0 , . . . , d − . (2) Orbit Retrieval with non-uniform weights.
In this case, the distribu-tion of g is not restricted to be uniform over G , and is unknown.(3) Continuous Orbit Retrieval.
Here, the group G may be infinite. As anexample, continuous MRA is a generalization of discrete MRA, in whichshifts of entries are generalized to continuous shifts of periodic functionson the torus, i.e. τ x θ ( y ) = θ ( y + x mod 1) , x, y ∈ [0 , Heterogeneous Orbit Retrieval.
There is no projection operator, butthe centers form
K > G .A connection between the log-likelihood and the moments of the mixture wasestablished in [BRW17, PWB +
17] for homogeneous MRA. In that paper, upperand lower bounds on the KL divergence (essentially the negative log likelihood)are given in the form of a series similar to ours, in which each term is the squarednorm of the difference between true and estimated moments. This was then used to understand the sample complexity of the orbit retrieval problem, heavily exploitingthe fact that the moments, due to the model’s algebraic structure, correspond toinvariant polynomials with respect to the group action. This showed that in the lowSNR regime, the sample complexity of MRA increases from the standard O (1 / SNR)to O (1 / SNR ), a previously unexplained phenomenon first observed in experimentsperformed in the context of Cryo-EM [Sig98]. This connection was then extendedto the general setting (4.14) in [BBSK + θ P k λ k k T k ( θ ) − T ∗ k k . This is one of the methodsused in a series of papers in which the moment-based approach was suggested foralgebraically structured mixture models [BBM +
18, BBL +
19, MBB +
20, LBBS20].In fact, numerical simulations in these papers demonstrate this connection. Theexperiments suggest that the two methods have similar performances for MRA andsome of its extensions mentioned above.A particularly interesting example is that of Heterogeneous MRA, in which thereare M = Kd mixture components corresponding to the orbits under cyclic shiftsof K vectors in R d . Statistically, it is known [BBSK +
17] that moments up to de-gree 3 are enough to resolve the model (provided the vectors are generic) even for K growing linearly with d . However, numerical experiments, in which the vectorsare chosen at random, suggest that algorithms start failing above K > ∼ √ d , and thatthe moment matching landscape has spurious local minima in this regime. It isconceivable that, for vectors chosen randomly from a Gaussian distribution, thethird moment matching landscape is benign for K ≪ √ d and riddled with spuriouscritical points when K > ∼ √ d . If such a phase transition is established, our expansioncould then be used as a vehicle to transfer such results into an understanding ofthe performance of EM and similar methods.4.3.1. Random centers.
Outside of algebraically structured GMMs, another inter-esting model is a GMM with “average-case” centers: take M randomly sampledvectors in R d from a Gaussian distribution and consider the GMM with thesevectors as centers (and fixed isotropic covariances). The question of whether themixture can be recovered from third moments is equivalent to low-rank tensor de-composition (the third moment tensor is a d × d × d tensor with rank ≤ M ). Thisproblem is believed to exhibit a statistical-to-computational gap: while the lowrank decomposition is decidable for M ≪ d it is believed to be computationallyhard for M ≫ d / [Wei18]. This is precisely the regime in which algorithms formoment inversion appear to fail in heterogeneous MRA, since the orbits of K ∼ √ d vectors form the centers of a GMM with M = Kd ∼ d / mixture components. A IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 21 characterization of the roughness of the landscape of low-rank tensor decompositionin these regimes could, with the help of our expansion, potentially be transferredto study the performance of EM in such a mixture model.4.4.
Towards finite sample guarantees.
To make the connection rigorous be-tween likelihood optimization and the series of minimization problems (4.1) in lowSNR models, one must prove that the path of gradient descent on the negative loglikelihood is well-approximated by the stagewise least squares moment minimiza-tion. In order to study these algorithms in the finite sample case, one must alsoquantify the deviation of the sample log likelihood and its first two derivatives fromthe population log likelihood and its first two derivatives, respectively.[FSWW20] carries out this program to draw conclusions about log likelihoodoptimization in the case of homogeneous orbit retrieval. The tools developed in thatpaper lay the groundwork for analysis of more general models. In particular, theauthors exploit the algebraic structure of the model to reparameterize the gradientdescent dynamics in a basis of invariant polynomials under the group action. Thevarieties (4.2) are then level sets of these polynomials, simplifying the analysis ofthe landscape of (4.1).We have not rigorously established the connection between the two landscapesfor general GMMs or performed a finite sample analysis here, but we expect thatdoing so should be possible with the help of techniques developed in [FSWW20], aswell as those used in the present paper for the derivation of the likelihood expansion.5.
Log Likelihood Asymptotic Expansion
In this section, we prove the asymptotic expansion of the population log like-lihood given in Lemma 2.2, highlighting key parts of the argument and deferringtechnical lemmas to the appendix. Recall that the log likelihood is given by L ( ρ ; ρ ∗ ) = E θ ∗ ,Z log E θ (cid:2) exp (cid:0) −k σZ + θ ∗ − θ k / σ (cid:1)(cid:3) , (5.1)where θ ∼ ρ, θ ∗ ∼ ρ ∗ , and ρ, ρ ∗ are compactly supported distributions on R d . Notethat in this section only, we use θ, θ ∗ to denote random variables, rather than θ , θ ∗ .We begin with the following key observation. Lemma 5.1.
Let Z ′ ∈ R d be a random vector independent of Z, θ, and θ ∗ suchthat Z ′ ∼ N (0 , I ) . Then (5.2) L ( ρ ; ρ ∗ ) = − d E θ ∗ ,Z log E θ,Z ′ (cid:20) exp (cid:18) σ ( θ − θ ∗ ) T ( Z + iZ ′ ) (cid:19)(cid:21) . whereProof. Consider the random variable ( θ − θ ∗ ) T Z ′ . For θ, θ ∗ fixed, it is a mean zeroGaussian with variance k θ ∗ − θ k and hence has characteristic function(5.3) E Z ′ h e it ( θ − θ ∗ ) T Z ′ i = exp (cid:18) − t k θ − θ ∗ k (cid:19) Now, we have − σ k σZ + θ ∗ − θ k = − k Z k + 1 σ ( θ − θ ∗ ) T Z − σ k θ ∗ − θ k . Using (5.3) with t = 1 /σ , we then have E θ (cid:2) exp (cid:18) − σ k σZ + θ ∗ − θ k (cid:19) (cid:3) = exp (cid:18) − k Z k (cid:19) E θ,Z ′ (cid:20) exp (cid:18) σ ( θ − θ ∗ ) T Z + 1 σ i ( θ − θ ∗ ) T Z ′ (cid:19)(cid:21) = exp (cid:18) − k Z k (cid:19) E θ,Z ′ (cid:20) exp (cid:18) σ ( θ − θ ∗ ) T ( Z + iZ ′ ) (cid:19)(cid:21) . (5.4)Taking the logarithm and expectation with respect to θ ∗ , Z gives (5.2) (cid:3) The remainder of the proof centers around a finite Taylor expansion about t = 0of(5.5) f ( t, Z, θ ∗ ) = log E θ,Z ′ (cid:2) exp (cid:0) t ( θ − θ ∗ ) T ( Z + iZ ′ ) (cid:1)(cid:3) . Note that f is C ∞ in t for every Z . Hence, for every m , f has a finite Taylorexpansion of the form(5.6) f ( t, Z, θ ∗ ) = m +1 X p =1 κ p t p p ! + ∂ m +2 t f ( ξ ) t m +2 (2 m + 2)! . Here, the κ j and ξ both depend on Z, θ ∗ , and | ξ | < | t | . This expansion is valid forevery t ∈ R and Z, θ ∗ ∈ R d . Substituting (5.6) into (5.2) with t = 1 /σ , we have(5.7) L ( ρ ; ρ ∗ ) = − d/ m +1 X p =1 E θ ∗ ,Z [ κ p ] σ − p p ! + E θ ∗ ,Z (cid:2) ∂ m +2 t f ( ξ ) (cid:3) σ − m − (2 m + 2)! . To prove Lemma 2.2, it remains to compute the expectations of κ p , p = 1 , . . . , m +1 and upper bound the expectation of the error term. The following theoremsummarizes the results of these computations. Theorem 5.2.
We have E θ ∗ ,Z [ κ k +1 ] = 0 , k = 0 , , . . . and k )! E θ ∗ ,Z [ κ k ] = − k !) k T k − T ∗ k k + k> ( h T k , Q k i + r k ) ,k = 1 , . . . , m, (5.8) where r k ∈ R k [ T k − , T ∗ k ] , Q k ∈ V k (cid:2) T k − , T ∗ k − (cid:3) , and Q k is such that Q k ( T ∗ k − , T ∗ k − ) = 0 . The error term is bounded above by (cid:12)(cid:12)(cid:12)(cid:12) E θ ∗ ,Z (cid:2) ∂ m +2 t f ( ξ ) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12) σ − m − (2 m + 2)! ≤ ( m + 1)! (cid:18) C δσ (cid:19) m +2 (cid:18) ∨ δσ (cid:19) m +2 , where C is a d -dependent constant and δ = max {k x − x ∗ k | x ∈ supp( θ ) , x ∗ ∈ supp( θ ∗ ) } . IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 23
We now outline the main steps of the proof of Theorem 5.2. In Section 5.2,we obtain expressions for the κ p . We do so by taking advantage of generalizedmoment-cumulant relationships described below. We obtain(5.9) f ( t, Z, θ ∗ ) = ∞ X p =1 κ p ( Z, θ ∗ ) k ! t k ∀ | t | < R Z , where R Z is a Z -dependent radius of convergence, within which the series convergesuniformly in t . That the radius of convergence depends on Z will not be an issue, aswe only care about the coefficients κ p of (5.9). Indeed, the κ p of (5.6) and (5.9) arethe same. In Section 5.3, we upper bound the error term by explicitly computing ∂ n +1 t f and bounding its Z, θ ∗ -expectation.In Section 5.4, we compute the Z -expectation of the κ p , and in Section 5.5, wecompute the θ ∗ -expectations of the Z -expectations.In several key steps of the proof, we make use of the polynomials which expressthe cumulants of a distribution in terms of its moments. We will apply thesemoment-cumulant relations in a more general setting, in which the “moments” arecoefficients of any Taylor expansion satisfying certain conditions. Before proceedingwith the proof, we describe these generalized moment-cumulant relations.5.1. Generalized Moment-Cumulant Relations.
Let X be a random variablewith moment-generating function M X ( t ) = E (cid:2) e tX (cid:3) = 1 + ∞ X k =1 µ k k ! t k , | t | < R, and cumulant generating function κ X ( t ) = log M X ( t ) = ∞ X m =1 κ m m ! t m . The µ k are the moments of X , µ k = E [ X k ], and the cumulants κ m are given bythe following polynomials κ m ( µ , . . . , µ m ):(5.10) κ m ( µ , . . . , µ m ) = X λ ∈ S m c λ Y k ∈ λ µ k , where S m is the set of all finite lists λ of positive integers whose sum is m . The c λ are universal constants, and we will rarely need to know their exact values. As anexample, κ = c µ + c , µ µ + c , µ + c , , µ µ . The moment-cumulant relations (5.10) are typically applied in the context of ran-dom variables. However, they arise in a more general context: for a function f withTaylor series coefficients µ k , (5.10) describes how the Taylor series coefficients κ m of log f relate to the µ k . More concretely, we have the following result, proved inProposition A.1 of the Appendix. Proposition 5.3.
Let M ( t ) be a real analytic function in the neighborhood | t − t | Recall that(5.12) f ( t, Z, θ ∗ ) = log E θ E Z ′ (cid:2) exp (cid:0) t ( θ − θ ∗ ) T ( Z + iZ ′ ) (cid:1)(cid:3) . In this section, we derive the series expansion f ( t, Z, θ ∗ ) = ∞ X p =1 κ p ( Z, θ ∗ ) p ! t p ∀ | t | < R Z , for R Z to be specified. Throughout the section, Z and θ ∗ are considered constant.Let W = Z + iZ ′ , w = θ − θ ∗ , and M ( t ) = M ( t, Z, θ ∗ ) = E θ,Z ′ (cid:2) exp (cid:0) tw T W (cid:1)(cid:3) so that f = log M . Recall that δ = sup {k x − x ∗ k | x ∈ supp( θ ) , x ∗ ∈ supp( θ ∗ ) } , sothat k w k ≤ δ . Now,exp (cid:0) tw T W (cid:1) = lim m →∞ m X k =1 ( w T W ) k t k k ! ! . IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 25 The partial sums are each bounded in absolute value by exp ( δt ( k Z k + k Z ′ k )) whichhas finite θ, Z ′ -expectation. We can therefore interchange summation and expecta-tion to get M ( t ) = E θ,Z ′ (cid:2) exp (cid:0) tw T W (cid:1)(cid:3) = 1 + ∞ X k =1 E θ,Z ′ (cid:2) ( w T W ) k (cid:3) t k k != 1 + ∞ X k =1 (cid:28) E θ (cid:2) w ⊗ k (cid:3) , E Z ′ (cid:2) W ⊗ k (cid:3) (cid:29) t k k ! ∀ t ∈ R . (5.13)Denote the coefficients in this expansion by µ k = (cid:10) E θ (cid:2) w ⊗ k (cid:3) , E Z ′ (cid:2) W ⊗ k (cid:3)(cid:11) = (cid:10) E θ (cid:2) w ⊗ k (cid:3) , E Im W (cid:2) W ⊗ k (cid:3)(cid:11) . We now apply the generalized moment-cumulant relations to Taylor expand thelogarithm of M ( t ). In order to do so, we first limit the range of t to ensure M ( t )remains in a small neighborhood of 1. Now, we can write M ( t ) as M ( t ) = E θ exp (cid:18) w T Zt − t k w k (cid:19) . Using that k w k ≤ δ , we have (cid:12)(cid:12)(cid:12)(cid:12) w T Zt − t k w k (cid:12)(cid:12)(cid:12)(cid:12) ≤ 12 if | t | < R Z = 1 δ max(4 k Z k , √ . Therefore, for | t | < R Z we have | M ( t ) − | ≤ E θ (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) w T Zt − t k w k (cid:19) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ E θ (cid:12)(cid:12)(cid:12)(cid:12) w T Zt − t k w k (cid:12)(cid:12)(cid:12)(cid:12) ≤ , (5.14)where we have used the fact that | e x − | < | x | / | x | < 1. For | t | < R Z wecan thus make use of the generalized moment-cumulant relations of Proposition 5.3to write f ( t ) = log M ( t ) = ∞ X p =1 κ p t p p ! , where κ p = κ p ( Z, θ ∗ ) = X λ ∈ S p c λ Y ℓ ∈ λ µ ℓ = X λ ∈ S p c λ Y ℓ ∈ λ (cid:10) E θ (cid:2) w ⊗ ℓ (cid:3) , E Im W (cid:2) W ⊗ ℓ (cid:3)(cid:11) . = X λ ∈ S p c λ *O ℓ ∈ λ E θ (cid:2) w ⊗ ℓ (cid:3) , E Im W λ "O ℓ ∈ λ W ⊗ ℓℓ . (5.15)In the third line, W λ denotes the set { W ℓ | ℓ ∈ λ } . We replaced W = Z + iZ ′ with vectors W ℓ = Z + iZ ℓ , ℓ ∈ λ, where the Z ℓ ∈ R d are i.i.d. standard normaland independent of Z . This allowed us to write the product of expectations as theexpectation of a product. Error Term Upper Bound. In this section, the constant C depends on d only and may change value from line to line. Recall that the error term is given by E Z,θ ∗ (cid:2) ∂ m +2 t f ( ξ ) (cid:3) σ − m − (2 m + 2)! , for ξ such that 0 < ξ < /σ . To bound it, we first compute ∂ nt f at a generic point t (with n = 2 m + 2). Recall that w = θ − θ ∗ , W = Z + iZ ′ , and f = log M for M ( t, Z, θ ∗ ) = E θ,Z ′ (cid:2) exp (cid:0) tW T w (cid:1)(cid:3) . By (5.11), we have ∂ nt f = κ n (cid:0) ∂ t M/M, . . . , ∂ nt M/M (cid:1) = X λ ∈ S n c λ Y ℓ ∈ λ ∂ ℓt MM (5.16)at any point t , since M ( t ) is never zero. We therefore have the error bound(5.17) σ − n n ! E Z,θ ∗ | ∂ nt f | ≤ σ − n n ! E θ ∗ X λ ∈ S n | c λ | E Z Y ℓ ∈ λ (cid:12)(cid:12)(cid:12)(cid:12) ∂ ℓt MM (cid:12)(cid:12)(cid:12)(cid:12) . We now compute the t -derivative of M in the following indirect way, which willyield an expression that is simpler to bound. Let t = t + s ; we will write M interms of s and take its derivative at s = 0. We have M ( t + s ) = E θ h e ( t + s ) w T Z E Z ′ e i ( t + s ) w T Z ′ i , and note that E Z ′ h e i ( t + s ) w T Z ′ i = e − t s k w k E Z ′ h e it w T Z ′ i E Z ′ h e isw T Z ′ i . Therefore, M ( t + s ) = E θ h e − t s k w k E Z ′ h e t w T W i E Z ′ h e sw T W ii = E θ h E Z ′ h e t w T W i E Z ′ h e sw T ( W − t w ) ii . (5.18)We now take the derivative at s = 0, passing it inside both expectations. This isjustified since the resulting derivative is absolutely integrable. We obtain ∂ ℓt M ( t ) = E θ h E Z ′ h e t w T W i E Z ′ h(cid:0) w T ( W − t w ) (cid:1) ℓ ii . (5.19)Noting that E Z ′ h e tw T W i > , we have (cid:12)(cid:12)(cid:12)(cid:12) ∂ ℓt M ( t ) M ( t ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ E θ h E Z ′ e tw T W E Z ′ (cid:12)(cid:12) w T ( W − tw ) (cid:12)(cid:12) ℓ i E θ (cid:2) E Z ′ e tw T W (cid:3) ≤ sup θ E Z ′ (cid:12)(cid:12) w T ( W − tw ) (cid:12)(cid:12) ℓ ≤ δ ℓ E Z ′ ( k W k + | t | δ ) ℓ (5.20) IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 27 Now, fix λ ∈ S n , and let W ℓ = Z + iZ ′ ℓ , ℓ ∈ λ , where Z ′ ℓ are independent copies of Z ′ . We have E Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Y ℓ ∈ λ ∂ ℓt MM (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E Z Y ℓ ∈ λ δ ℓ E Im W ℓ ( k W ℓ k + | t | δ ) ℓ = δ n E Y ℓ ∈ λ ( k W ℓ k + | t | δ ) ℓ ≤ (2 δ ) n E 12 ( | t | δ ) n + 12 X ℓ ∈ λ ℓn k W ℓ k n ! = (2 δ ) n (cid:18) 12 ( | t | δ ) n + 12 E k W k n (cid:19) ≤ (cid:16) n (cid:17) !( Cδ ) n (1 ∨ | t | δ ) n (5.21)where expectation with no subscripts denotes expectation with respect to all ran-dom variables and C is a constant depending on d only. Now, note that this upperbound is independent of λ . We also have X λ ∈ S n | c λ | ≤ n n ≤ n ! e n . This follows from [FSWW20], in which it was shown that the sum of the absolutevalues of the coefficients arising in an n th order multivariate cumulant is boundedabove by n n . It is straightforward to see that this bound applies to univariatecumulants as well. Substituting these bounds in the error bound (5.17) and using | t | = | ξ | < /σ , we obtain σ − n n ! E Z,θ ∗ | ∂ nt f | ≤ σ − n n ! (cid:16) n (cid:17) !( Cδ ) n (1 ∨ | t | δ ) n X λ ∈ S n | c λ |≤ (cid:16) n (cid:17) ! (cid:18) C δσ (cid:19) n (cid:18) ∨ δσ (cid:19) n . (5.22)Taking n = 2 m + 2 gives the desired error bound.5.4. Z-Expectation of Cumulants. We will use the following notation in thissection: for a list λ , we define ℓ max as the maximum element in the list, and λ \ ℓ is the list with one copy of ℓ removed. Now, recall that w = θ − θ ∗ , and κ p = X λ ∈ S p c λ *O ℓ ∈ λ E θ (cid:2) w ⊗ ℓ (cid:3) , E Im W λ "O ℓ ∈ λ W ⊗ ℓℓ , where W λ denotes the set { W ℓ | ℓ ∈ λ } and W ℓ = Z + iZ ℓ , where Z, Z ℓ , ℓ ∈ λ areindependent standard normal vectors in R d . We therefore have E θ ∗ ,Z [ κ p ] = X λ ∈ S p c λ * E θ ∗ "O ℓ ∈ λ E θ (cid:2) w ⊗ ℓ (cid:3) , E W λ "O ℓ ∈ λ W ⊗ ℓℓ . We make two initial observations:(1) If p = 2 k + 1 is odd, then the expectation on the right side of the innerproduct is zero, so that E [ κ k +1 ] = 0. This follows from the fact that allGaussian random variables appearing in the expectation are mean zero,and the total order of the product is 2 k + 1. (2) Suppose p = 2 k . If λ is such that all ℓ ∈ λ are less than k , then the θ ∗ expectation on the left is a polynomial only of moment tensors T p with p < k (though it could depend on moment tensors T ∗ p for p up to 2 k .) Thisis proved in Lemma B.4 in the appendix. Therefore, the inner product forsuch λ s will contribute only to the remainder polynomial r k ( T k − , T ∗ k ) . We therefore consider p = 2 k , and discard from the sum those λ ∈ S k in whichall numbers are less than k . Let ℓ max denote the maximum number in a list λ . Wehave E θ ∗ ,Z [ κ k ] = E θ ∗ X λ ∈ S k ℓ max ≥ k c λ *O ℓ ∈ λ E θ (cid:2) w ⊗ ℓ (cid:3) , E "O ℓ ∈ λ ( Z + iZ ℓ ) ⊗ ℓ + r k ( T k − , T ∗ k ) . (5.23)Using results in [FPR19], we now prove the following proposition. Proposition 5.5. Let λ ∈ S k be such that ℓ max , the maximum of the list, is atleast k . Let Z, Z ℓ , ℓ ∈ λ be i.i.d. standard normal vectors in R d . Then (5.24) E "O ℓ ∈ λ ( Z + iZ ℓ ) ⊗ ℓ = ( , ℓ max > k − k E h W ⊗ k ⊗ W ⊗ k i , ℓ max = k, where W = Z + iZ ℓ max . The proof relies on the following result of [FPR19] Proposition 5.6. [Proposition 2 of [FPR19] ] Let V = (cid:0) V , . . . , V p (cid:1) ∈ C p be jointlycircularly symmetric. Then E h(cid:0) V (cid:1) k . . . ( V p ) k p V m . . . V pm p i = 0 only if k + · · · + k p = m + · · · + m p .Proof of Proposition 5.5. Note that each Z + iZ ℓ is a circularly symmetric randomvector. However, ( Z + iZ ℓ ) ℓ ∈ λ ∈ C | λ | d is not jointly circularly symmetric, becausethe real parts are correlated while the imaginary parts are independent. We cannevertheless take advantage of Proposition 5.6 as follows: first, define Z ′ ℓ ∈ R d , ℓ ∈ λ \ ℓ max to be standard normal and independent of Z, Z ℓ , ℓ ∈ λ and of each other.Define W = Z + iZ ℓ max , and W ℓ = Z ′ ℓ + iZ ℓ , ℓ ∈ λ \ ℓ max . Thus, ( W ; ( W ℓ ) ℓ ∈ λ \ λ max ) ∈ C | λ | d is circularly symmetric. Now, for ℓ ∈ λ \ ℓ max we write Z + iZ ℓ = 12 ( W + W ) + 12 ( W ℓ − W ℓ ) = 12 ( W + W ℓ + W − W ℓ ) . With this notation, we have O ℓ ∈ λ ( Z + iZ ℓ ) ⊗ ℓ = W ⊗ ℓ max ⊗ (cid:18) (cid:19) k − ℓ max O ℓ ∈ λ \ ℓ max ( W + W ℓ + W − W ℓ ) ⊗ ℓ . Note that each entry in this tensor is of the form given in the proposition, i.e.it is a product of some number of conjugated and unconjugated complex randomvariables which are jointly circularly symmetric Gaussian. We count how manyconjugated and unconjugated variables appear in a typical entry of this tensor.There are at least ℓ max unconjugated variables (from the ℓ max copies of W ) , andat most 2 k − ℓ max ≤ ℓ max conjugated variables. Thus, we immediately see that the IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 29 expectation is zero if ℓ max > k . If ℓ max = k , the expectations of all terms withfewer than k conjugated variables are zero. Writing k instead of ℓ max , we then have E W ⊗ k ⊗ O ℓ ∈ λ \ k ( W + W ℓ + W − W ℓ ) ⊗ ℓ = E W ⊗ k ⊗ O ℓ ∈ λ \ k ( W − W ℓ ) ⊗ ℓ . But note that W ℓ is independent of W , so the expectation of products involving bothentries of W and entries of W ℓ will split up into a product of two expectations, oneof which involves only entries of W ℓ . But this expectation is zero by the proposition,since it involves only conjugated variables. Hence only the products involving W alone survive. We obtain E W ⊗ k ⊗ O ℓ ∈ λ \ k ( W − W ℓ ) ⊗ ℓ = E W ⊗ k ⊗ O ℓ ∈ λ \ k W ⊗ ℓ = E h W ⊗ k ⊗ W ⊗ k i . We recall the additional factor (cid:0) (cid:1) k − ℓ max = 2 − k to conclude. (cid:3) Substituting this back into the cumulant formula, we have E Z [ κ k ] = 2 − k X λ ∈ S k ℓ max = k c λ * E θ (cid:2) w ⊗ k (cid:3) ⊗ O ℓ ∈ λ \ k E θ (cid:2) w ⊗ ℓ (cid:3) , E h W ⊗ k ⊗ W ⊗ k i+ = 2 − k X λ ∈ S k ℓ max = k c λ E W E θ (cid:2) ( w T W ) k (cid:3) Y ℓ ∈ λ \ k E θ (cid:2) ( w T W ) ℓ (cid:3) . (5.25)This expectation is straightforward to evaluate (see Proposition B.6 in the Appen-dix), and we obtain E Z [ κ k ] = k ! X λ ∈ S k ℓ max = k c λ * E θ (cid:2) w ⊗ k (cid:3) , O ℓ ∈ λ \ k E θ (cid:2) w ⊗ ℓ (cid:3)+ = k ! * E θ (cid:2) w ⊗ k (cid:3) , X λ ∈ S k ℓ max = k c λ O ℓ ∈ λ \ k E θ (cid:2) w ⊗ ℓ (cid:3)+ . (5.26)5.5. θ ∗ -Expectation of Cumulants. Recall that w = θ − θ ∗ , and that(5.27) T ℓ = E (cid:2) θ ⊗ ℓ (cid:3) , T ∗ ℓ = E (cid:2) θ ⊗ ℓ ∗ (cid:3) , ℓ = 1 , , . . . . We also remind the reader of the definitions of total moment order and the spaces V k , R k : Definition. Consider S = n O i =1 S k i = S k ⊗ S k ⊗ · · · ⊗ S k n , where S k i is either T k i or T ∗ k i . We define the total moment order of S to be P ni =1 k i , i.e. the sum of all moment orders. We also say that the total momentorder of each entry of S is P ni =1 k i ; in other words, the total moment order ofproducts of entries of moment tensors is the sum of all moment orders in theproduct. Definition. We define V k [ T m , T ∗ n ] as the set of all constant coefficient linearcombinations of outer products of moment tensors T j , j ≤ m, T ∗ ℓ , ℓ ≤ n , of totalmoment order k . We define R k [ T m , T ∗ n ] as the set of all constant coefficientlinear combinations of products of entries of moment tensors T j , T ∗ ℓ , j ≤ m, ℓ ≤ n ,of total moment order k .In the previous section, we have shown that(5.28) E Z [ κ k ] = k ! * E θ (cid:2) w ⊗ k (cid:3) , X λ ∈ S k ℓ max = k c λ O ℓ ∈ λ \ k E θ (cid:2) w ⊗ ℓ (cid:3)+ + s ( T k − , θ ∗ ) , where s is such that E θ ∗ [ s ( T k − , θ ∗ )] ∈ R k [ T k − , T ∗ k ]. For brevity, define thetensor on the right of (5.28) by J k , that is J k = X λ ∈ S k ℓ max = k c λ O ℓ ∈ λ \ k E θ (cid:2) w ⊗ ℓ (cid:3) . To get a sense of J k , let us write out J : J = c , E θ (cid:2) w ⊗ (cid:3) + c , , E θ (cid:2) w ⊗ (cid:3) ⊗ E θ [ w ] + c , , E θ [ w ] ⊗ E θ [ w ] ⊗ E θ [ w ]In this section, we take the θ ∗ -expectation of the inner product in (5.28), that is,of h E θ (cid:2) w ⊗ k (cid:3) , J k i . We are interested only in terms involving T k , the highest ordermoment tensor of θ appearing in this inner product. Therefore, we can separateout T k on each side of the inner product and discard terms in which T k appearson neither side. The discarded terms will collectively be denoted s ( T k − , θ ∗ ).Note that the θ -moment tensors which arise upon expanding E θ [( θ − θ ∗ ) ⊗ p ] are T j , j ≤ p . Therefore, T k appears only through E θ (cid:2) ( θ − θ ∗ ) ⊗ k (cid:3) , which arises on theleft of the inner product with coefficient 1 and on the right with coefficient c k,k .We have (cid:28) E θ (cid:2) w ⊗ k (cid:3) , J k (cid:29) = (cid:28) E θ (cid:2) θ ⊗ k (cid:3) , J k (cid:29) + (cid:28) E θ (cid:2) w ⊗ k − θ ⊗ k (cid:3) , c k,k E θ (cid:2) θ ⊗ k (cid:3) (cid:29) + s ( T k − , θ ∗ )= (cid:28) E θ (cid:2) θ ⊗ k (cid:3) , J k + c k,k E θ (cid:2) w ⊗ k − θ ⊗ k (cid:3) (cid:29) + s ( T k − , θ ∗ ) . (5.29)We may then write E θ ∗ E Z [ κ k ] = k ! E θ ∗ (cid:28) E θ (cid:2) w ⊗ k (cid:3) , J k (cid:29) = k ! E θ ∗ (cid:28) E θ (cid:2) θ ⊗ k (cid:3) , J k + c k,k E θ (cid:2) w ⊗ k − θ ⊗ k (cid:3) (cid:29) + E θ ∗ [ s ( T k − , θ ∗ )]= k ! E θ ′ E θ ∗ (cid:28) θ ′⊗ k , J k + c k,k E θ (cid:2) w ⊗ k − θ ⊗ k (cid:3) (cid:29) + r k ( T k − , T ∗ k ) . (5.30)To get the last line, we substitute E θ (cid:2) θ ⊗ k (cid:3) with E θ ′ (cid:2) θ ′⊗ k (cid:3) , where θ ′ is an in-dependent copy of θ . We then pull the θ ′ -expectation out of the inner product.The fact that E θ ∗ [ s ( T k − , θ ∗ )] = r k ( T k − , T ∗ k ) is proved in the appendix, seeLemma B.5.The following lemma will complete the proof of Theorem 5.2 . IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 31 Lemma 5.7. Let x ∈ R d be constant. Then (5.31) E θ ∗ (cid:28) x ⊗ k , J k + c k,k E θ (cid:2) w ⊗ k − θ ⊗ k (cid:3) (cid:29) = (cid:28) x ⊗ k , − c k,k T ∗ k + c k,k T k + Q k (cid:29) , where Q k ∈ V k (cid:2) T k − , T ∗ k − (cid:3) . Moreover, Q is independent of x and satisfies Q k ( T ∗ k − , T ∗ k − ) = 0 . To finish the proof of Theorem 5.2, note that substituting x = θ ′ and taking the θ ′ -expectation gives E θ ′ E θ ∗ (cid:28) θ ′⊗ k , J k + c k,k E θ (cid:2) w ⊗ k − θ ⊗ k (cid:3) (cid:29) = c k,k k T k k − c k,k h T k , T ∗ k i + h T k , Q k i . Substituting into (5.30) and dividing by (2 k )! gives1(2 k )! E θ ∗ E Z [ κ k ] = k !(2 k )! c k,k k T k − T ∗ k k + h T k , Q k i + r k , where we have absorbed c k,k k T ∗ k k into r k . In Lemma A.2 in the appendix, weshow that c k,k = − (cid:18) kk (cid:19) , so that k !(2 k )! c k,k = − k !) . Hence, the coefficient in front of the squared difference of k th moments is as inTheorem 5.2. Now, Lemma 5.7 will follow from the following Lemma: Lemma 5.8. Let X, X ∗ ∈ R be bounded random variables, and let W = X − X ∗ .Let µ j = E (cid:2) X j (cid:3) , µ ∗ j = E (cid:2) X j ∗ (cid:3) , ω j = E X (cid:2) W j (cid:3) . (Note that ω j is random with respect to X ∗ .) Let c λ be the universal coefficients ofthe moment-cumulant relations (5.10) . Then (5.32) E X ∗ (cid:20) c k,k ω k + (cid:18) X λ ∈ S k ℓ max = k c λ Y ℓ ∈ λ \ k ω ℓ (cid:19)(cid:21) = 2 c k,k ( µ k − µ ∗ k ) + π ( µ k − , µ ∗ k − ) , where π ∈ R [ µ k − , µ ∗ k − ] and satisfies π ( µ ∗ k − , µ ∗ k − ) = 0 . We show how Lemma 5.7 follows from Lemma 5.8 in the appendix; see Lemma A.4.The idea is to take X = x T θ and X ∗ = x T θ ∗ . Proof of Lemma 5.8. Define the polynomials q ( t ) = k X j =1 µ j t j /j ! , q ∗ ( t ) = k X j =1 µ ∗ j t j /j ! , q ω ( t ) = k X j =1 ω j t j /j ! . The proof consists of the following sequence of computations. A. We show that ω k (cid:18) c k,k ω k + X λ ∈ S k ℓ max = k c λ Y ℓ ∈ λ \ p ω ℓ (cid:19) = c k,k ω k − κ k ( M (1) (0) , . . . , M (2 k ) (0)) , where M ( t ) = 1 − ω k t k k ! (1 + q ω ( t )) − . B. We show that c k,k ω k − κ k ( M (1) (0) , . . . , M (2 k ) (0)) = − c k,k ω k d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 , and hence(5.33) c k,k ω k + X λ ∈ S k ℓ max = k c λ Y ℓ ∈ λ \ p ω ℓ = − c k,k d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 . This is true even if ω k = 0. Indeed, (5.33) is true if ω k = 0, and both sides of theequation are continuous with respect to ω k (the right hand side is also a polynomialin ω k ), so we can set ω k = ǫ and take ǫ → 0. Taking the X ∗ expectation, we get E X ∗ (cid:20) c k,k ω k + X λ ∈ S k ℓ max = k c λ Y ℓ ∈ λ \ p ω ℓ (cid:21) = − c k,k E X ∗ (cid:20) d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 (cid:21) (5.34) C. We show that E X ∗ (cid:20) d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 (cid:21) = d k dt k q ∗ ( t )1 + q ( t ) (cid:12)(cid:12)(cid:12)(cid:12) t =0 . D. We show that d k dt k q ∗ ( t )1 + q ( t ) (cid:12)(cid:12)(cid:12)(cid:12) t =0 = µ ∗ k − µ k + π ( µ k − , µ ∗ k − ) , where π has the desired properties.We prove A. here and delegate the rest to the appendix; see Lemma A.3. Con-sider the moment-cumulant relation κ k ( ω , . . . , ω k ) = P λ ∈ S k c λ Q ℓ ∈ λ ω ℓ . Wewrite it as follows: κ k ( ω , . . . , ω k ) = X λ ∈ S k λ max >k c λ Y ℓ ∈ λ ω ℓ + ω k X λ ∈ S k λ max = k c λ Y ℓ ∈ λ \ k ω ℓ + X λ ∈ S k λ max Functional Anal-ysis and Applica- tions , 20:125–127, 1986.[BBL + 19] Tamir Bendory, Nicolas Boumal, William Leeb, Eitan Levin, and Amit Singer.Multi-target detection with application to cryo-electron microscopy. Inverse Prob-lems , 35(10):104003, Sep 2019.[BBM + 18] Tamir Bendory, Nicolas Boumal, Chao Ma, Zhizhen Zhao, and Amit Singer. Bispec-trum inversion with application to multireference alignment. IEEE Transactions onSignal Processing , 66(4):10371050, Feb 2018.[BBSK + 17] Afonso S. Bandeira, Ben Blum-Smith, Joe Kileel, Amelia Perry, Jonathan Weed, andAlexander S. Wein. Estimation under group actions: recovering orbits from invariants. arXiv preprint arxiv:1712.10163 , 2017.[BRW17] Afonso S. Bandeira, Philippe Rigollet, and Jonathan Weed. Optimal rates of estima-tion for multi-reference alignment. arXiv preprint arxiv:1702.08546 , 2017.[BS10] Mikhail Belkin and Kaushik Sinha. Toward learning gaussian mixtures with arbitraryseparation. pages 407–419, 07 2010.[BWY17] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guaran-tees for the em algorithm: From population to sample-based analysis. Ann. Statist. ,45(1):77–120, 02 2017.[FD18] Long Feng and Lee H. Dicker. Approximate nonparametric maximum likelihood formixture models: A convex optimization approach to fitting arbitrary multivariatemixing distributions. Computational Statistics and Data Analysis , 122:80 – 91, 2018.[FPR19] Claudia Fassino, Giovanni Pistone, and Maria Rogantin. Computing the momentsof the complex gaussian: Full and sparse covariance matrix. Mathematics , 7:263, 032019.[FSWW20] Zhou Fan, Yi Sun, Tianhao Wang, and Yihong Wu. Likelihood landscape and max-imum likelihood estimation for the discrete orbit recovery model. arXiv preprintarxiv:2004.00041 , 2020.[JZB + 16] Chi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J. Wainwright, and MichaelJordan. Local maxima in the likelihood of gaussian mixture models: Structural resultsand algorithmic consequences. Advances in neural information processing systems ,pages 4116–4124, 2016.[KMV10] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mix-tures of two gaussians. In Proceedings of the Forty-Second ACM Symposium on The-ory of Computing , STOC 10, page 553562, New York, NY, USA, 2010. Associationfor Computing Machinery.[Lai78] Nan Laird. Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association , 73(364):805–811, 1978.[LBBS20] Ti-Yen Lan, Tamir Bendory, Nicolas Boumal, and Amit Singer. Multi-target detec-tion with an arbitrary spacing distribution. IEEE Transactions on Signal Processing ,68:1589–1601, 2020. [MBB + 20] C. Ma, T. Bendory, N. Boumal, F. Sigworth, and A. Singer. Heterogeneous multiref-erence alignment for images with application to 2-D classification in single particlereconstruction. IEEE Transactions on Image Processing , 9:1699–1710, 2020.[Pea94] Karl Pearson. Iii. contributions to the mathematical theory of evolution. PhilosophicalTransactions of the Royal Society of London. (A.) , 185:71–110, 1894.[PWB + 17] Amelia Perry, Jonathan Weed, Afonso S. Bandeira, Philippe Rigollet, andAmit Singer. The sample complexity of multi-reference alignment. arXiv preprintarxiv:1707.00943 , 2017.[SDCS10] Fred Sigworth, Peter Doerschuk, Jose-Maria Carazo, and Sjors Scheres. An introduc-tion to maximum-likelihood methods in cryo-em. Methods in enzymology , 482:263–94,12 2010.[SG20] Sujayam Saha and Adityanand Guntuboyina. On the nonparametric maximum like-lihood estimator for gaussian location mixture densities with application to gaussiandenoising. Ann. Statist. , 48(2):738–762, 04 2020.[Sig98] Fred Sigworth. A maximum-likelihood approach to single-particle image refinement. Journal of structural biology , 122:328–39, 02 1998.[Wei18] Alexander Wein. Statistical Estimation in the Presence of Group Actions . PhD thesis,Massachusetss Institute of Technology, June 2018.[WZ19] Yihong Wu and Harrison H. Zhou. Randomly initialized em algorithm for two-component gaussian mixture achieves near optimality in o ( √ n ) iterations. arXivpreprint arxiv:1908.10935 , 2019.[XHM16] Ji Xu, Daniel Hsu, and Arian Maleki. Global analysis of expectation maximizationfor mixtures of two gaussians. In Proceedings of the 30th International Conferenceon Neural Information Processing Systems , NIPS16, page 26842692, Red Hook, NY,USA, 2016. Curran Associates Inc.[YYS17] Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient emon multi-component mixture of gaussians. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in NeuralInformation Processing Systems 30 , pages 6956–6966. Curran Associates, Inc., 2017. Appendix A. Generalized Moment-Cumulant Relationship Recall the definition of the polynomials κ m ( µ , . . . , µ m ):(A.1) κ m ( µ , . . . , µ m ) = X λ ∈ S m c λ Y k ∈ λ µ k , where S m is the set of all finite lists λ of positive integers whose sum is m , and c λ ∈ R are univeral constants. Proposition A.1. Let M ( t ) be a real analytic function in the neighborhood | t − t | 1, so that u ( t ) = ∞ X k =1 µ k k ! ( t − t ) k , | t − t | < R. If sup | t − t | 1, then f ( t ) = log( M ( t ) /M ( t )) = log(1 + u ( t )) is alsoreal analytic in this neighborhood, and therefore has a convergent Taylor seriesexpansion f ( t ) = ∞ X m =1 f ( m ) ( t ) ( t − t ) m m ! , | t − t | < R. Note that f ( m ) ( t ) = d m dt m log M ( t ) (cid:12)(cid:12) t = t . On the other hand, since the series defining u ( t ) is absolutely convergent and since | u ( t ) | stays below 1, we also have f ( t ) = log(1 + u ( t )) = ∞ X j =1 ( − j − j u ( t ) j = ∞ X j =1 ( − j − j ∞ X k =1 µ k k ! ( t − t ) k ! j = ∞ X m =1 κ m m ! ( t − t ) m , (A.2)where in the last line, the κ m are obtained by expanding the powers of the series andrearranging terms to combine like powers of t − t . Note that the lowest power of t − t in u ( t ) k is k ; therefore, the term ( t − t ) m arises only in the series expansionsof u ( t ) , . . . , u ( t ) m . Moreover, the coefficient of ( t − t ) m in the expansions of u ( t ) k , k = 1 , . . . , m will depend only on µ , . . . , µ m . Therefore, κ m is a polynomialof µ , . . . , µ m , and it is clear that this polynomial should be of the form(A.1). Thecoefficients c λ clearly do not depend on the particular values of the µ k , so they areuniversal constants.Summarizing, we have shown that d m dt m log M ( t ) (cid:12)(cid:12) t = t = κ m ( µ , . . . , µ m ) = κ m (cid:18) M (1) ( t ) M ( t ) , M (2) ( t ) M ( t ) . . . , M ( m ) ( t ) M ( t ) (cid:19) . The second assertion about the Taylor expansion of log(1+ u ( t )) clearly follows. (cid:3) Lemma A.2. The coefficient c p in front of the term µ p in the polynomial defining κ p (see (A.1) ) is c p = 1 . The coefficient c p,p in front of the term µ p is given by c p,p = − (cid:0) pp (cid:1) . Proof. Let ǫ = P ∞ k =1 µ k t k /k !, so thatlog(1 + ǫ ) = ∞ X j =1 ( − j +1 j ! ǫ j . Note that the moments appear individually in ǫ . Thus, the moment µ p by itselfcan only appear in ǫ , where it has coefficient 1 / (2 p )!. Since κ p is (2 p )! times the t p coefficient, we must have c p = 1.Now, similarly, µ p is a product of two moments and can therefore only appearin ǫ . In the expansion of ǫ , µ p t p will appear with coefficient 1 / ( p !) . Multiplyingby − / 2, we have that the coefficient of µ p t p in the expansion of log(1 + ǫ ) is − / p !) . Multiplying by (2 p )!, we arrive at c p,p = − 12 (2 p )! p ! p ! = − (cid:18) pp (cid:19) . (cid:3) Lemma A.3. Let X, X ∗ ∈ R be bounded random variables, and let W = X − X ∗ .We denote the moments of X, X ∗ , W as µ j = E (cid:2) X j (cid:3) , µ ∗ j = E (cid:2) X j ∗ (cid:3) , ω j = E X (cid:2) W j (cid:3) . Note that ω j is random with respect to X ∗ . Let c λ be the universal coefficients ofthe moment-cumulant relationships (A.1) . Then (A.3) E X ∗ (cid:20)(cid:18) X λ ∈ S k ℓ max = k c λ Y ℓ ∈ λ \ k ω ℓ (cid:19) + c k,k ω k (cid:21) = 2 c k,k ( µ k − µ ∗ k ) + π ( µ k − , µ ∗ k − ) , where π is a polynomial in µ k − , µ ∗ k − with universal coefficients (independentof the values of the moments) such that each monomial has total order k and whichsatisfies π ( µ ∗ k − , µ ∗ k − ) = 0 . Define q ( t ) = k X j =1 µ j t j /j ! , q ∗ ( t ) = k X j =1 µ ∗ j t j /j ! , q ω ( t ) = k X j =1 ω j t j /j ! . The proof consists of the following steps (we only summarize them, for more detailsee the main text). A. We showed in the main text that ω k (cid:18) c k,k ω k + X λ ∈ S k ℓ max = k c λ Y ℓ ∈ λ \ p ω ℓ (cid:19) = c k,k ω k − κ k ( M (1) (0) , . . . , M (2 k ) (0)) , where M ( t ) = 1 − ω k t k k ! (1 + q ω ( t )) − . B. We show that c k,k ω k − κ k ( M (1) (0) , . . . , M (2 k ) (0)) = − c k,k ω k d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 , IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 37 C. We show that E X ∗ (cid:20) d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 (cid:21) = d k dt k q ∗ ( t )1 + q ( t ) (cid:12)(cid:12)(cid:12)(cid:12) t =0 . D. We show that d k dt k q ∗ ( t )1 + q ( t ) (cid:12)(cid:12)(cid:12)(cid:12) t =0 = µ ∗ k − µ k + π ( µ k − , µ ∗ k − ) , where π has the desired properties.In Lemma 5.8, we proved A. , so it remains to prove B., C., D. Proof. For B. , note that M ( p ) (0) = 0 , p = 1 , . . . , k − 1, so all products Q ℓ ∈ λ M ( ℓ ) (0)involving ℓ < k are zero. But the only lists λ ∈ S k involving only moments of order k and higher are λ = { k, k } and λ = { k } . We must therefore compute the k thand 2 k th derivatives of M at zero. By the product rule, and using that only the k th derivative of t k is nonzero at t = 0, we have M ( k ) (0) = − ω k q − (0) = − ω k , M (2 k ) (0) = − ω k (cid:18) kk (cid:19) d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 . Therefore, κ k ( M (1) (0) , . . . , M (2 k ) (0)) = c k M (2 k ) (0) + c k,k (cid:16) M ( k ) (0) (cid:17) = − c k ω k (cid:18) kk (cid:19) d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 + c k,k ω k = 2 c k,k ω k d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 + c k,k ω k (A.4)(We used that c k = 1 and (cid:0) kk (cid:1) = − c k,k , proved in Lemma A.2). Subtracting c k,k ω k and multiplying by − B. For C. , define G ( t ) = E (cid:2) e Xt (cid:3) , G ∗ ( t ) = E (cid:2) e X ∗ t (cid:3) , G ω ( t ) = E X (cid:2) e W t (cid:3) . Note that G ω ( t ) = 1 + q ω ( t ) + t k +1 r ω ( t ) for some smooth function r ω ( t ). By Taylorexpanding (1 + q ω ) − and (1 + q ω + t k +1 r ω ) − in a neighborhood of 0, we get(1 + q ω ) − = 1 + ∞ X p =1 ( − p q pω , G − ω = 1 + ∞ X p =1 ( − p ( q ω + t k +1 r ω ) p . Combining like powers of t (justified by the absolute convergence of both series ina neighborhood of zero), we see that the order t k term in both series are the same,and hence d k dt k (1 + q ω ( t )) − (cid:12)(cid:12) t =0 = d k dt k G ω ( t ) − (cid:12)(cid:12) t =0 . We now take the X ∗ expectationand bring it inside the derivative. This is justified by the absolute convergence ofthe series and its derivatives, and the fact that X ∗ is bounded. Now, note thatsince W = X − X ∗ , we can write G ω ( t ) = e − tX ∗ G ( t ), so that E X ∗ (cid:2) G ω ( t ) − (cid:3) = G ∗ ( t ) /G ( t ) . Finally, G ∗ /G = (1 + q ∗ + t k +1 r ∗ )(1 + q + t k +1 r ) − for some smooth function r ∗ and r . By a similar Taylor expansion argument as before, the t k coefficient of theexpansion of G ∗ /G around t = 0 is the same as that of (1 + q ∗ ) / (1 + q ). This concludes C. , and we turn to D. We have1 + q ∗ q = 1 + ( q ∗ − q ) + ( q ∗ − q ) k X p =1 ( − q ) p + t k +1 c ( t ) ! = l . o . t . + µ ∗ k − µ k k ! t k + k X ℓ =1 µ ∗ ℓ − µ ℓ ℓ ! t ℓ k X p =1 ( − q ) p + h . o . t . (A.5)where t k +1 c ( t ) denotes the higher order terms in the expansion of (1 + q ) − , l.o.t.denotes terms t j for j < k and h.o.t. denotes terms t j for j > k . Now, note thatthe lowest order power of t appearing in P kp =1 ( − q ) p is t . Hence, the product( µ ∗ k − µ k ) t k P kp =1 ( − q ) p only involves terms t j , j > k , so we can discard ℓ = k from the sum. For each ℓ = 1 , . . . , k − 1, we collect the terms involving t k − ℓ in P kp =1 ( − q ) p . Since the coefficient of t j in q is µ j /j !, we collect products of µ j ’s withtotal moment order k − ℓ . Hence, the sum of all the t k − ℓ coefficients appearingin P kp =1 ( − q ) p can be written in the form P λ ∈ S k − ℓ d λ Q ℓ ′ ∈ λ µ ℓ ′ . Combining theseobservations, we have1 + q ∗ q = l . o . t . + (cid:18) µ ∗ k − µ k + k − X ℓ =1 ( µ ∗ ℓ − µ ℓ ) X λ ∈ S k − ℓ d λ Y ℓ ′ ∈ λ µ ℓ ′ (cid:19) t k k ! + h . o . t ., (A.6)where the expression in parenthesis is the k th derivative of (1 + q ∗ ) / (1 + q ) at zero,and π = k − X ℓ =1 ( µ ∗ ℓ − µ ℓ ) X λ ∈ S k − ℓ d λ Y ℓ ′ ∈ λ µ ℓ ′ . It is clear to see that π depends only on µ k − , µ ∗ k − and that π ( µ ∗ k − , µ ∗ k − ) =0, and that the total moment order of each term is k . (cid:3) Recall that w = θ − θ ∗ , where θ ∼ ρ and θ ∗ ∼ ρ ∗ have compact support. Recallalso the moment tensors(A.7) T k = E θ ∼ ρ (cid:2) θ ⊗ k (cid:3) , T ∗ k = E θ ∗ ∼ ρ ∗ (cid:2) θ ⊗ k ∗ (cid:3) . and the space V k (cid:2) T k − , T ∗ k − (cid:3) of tensors given by sums of tensor products of T i , T ∗ j , i, j < k where each product has total moment order k . Finally, recall thetensor J k = X λ ∈ S k ℓ max = k c λ O ℓ ∈ λ \ k E θ (cid:2) w ⊗ ℓ (cid:3) . Lemma A.4. Lemma A.3 implies that E θ ∗ (cid:28) x ⊗ k , J k + c k,k E θ (cid:2) w ⊗ k − θ ⊗ k (cid:3) (cid:29) = (cid:28) x ⊗ k , − c k,k T ∗ k + c k,k T k + Q k (cid:29) , where Q k ∈ V k (cid:2) T k − , T ∗ k − (cid:3) is x -independent.Proof. Take X = x T θ , X ∗ = x T θ ∗ , W = x T ( θ − θ ∗ ) = x T w . Then(A.8) µ j = (cid:10) x ⊗ j , T j (cid:11) , µ ∗ j = (cid:10) x ⊗ j , T ∗ j (cid:11) , ω j = (cid:10) x ⊗ j , E θ (cid:2) w ⊗ j (cid:3)(cid:11) . IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 39 Using (A.3), we then have E θ ∗ (cid:28) x ⊗ k , J k + c k,k E θ (cid:2) w ⊗ k − θ ⊗ k (cid:3) (cid:29) = E X ∗ (cid:20)(cid:18) X λ ∈ S k ℓ max = k c λ Y ℓ ∈ λ \ k ω ℓ (cid:19) + c k,k ( ω k − µ k ) (cid:21) = E X ∗ (cid:20)(cid:18) X λ ∈ S k ℓ max = k c λ Y ℓ ∈ λ \ k ω ℓ (cid:19) + c k,k ω k (cid:21) − c k,k µ k = c k,k µ k − c k,k µ ∗ k + π ( µ k − , µ ∗ k − )= (cid:28) x ⊗ k , − c k,k T ∗ k + c k,k T k (cid:29) + π ( µ k − , µ ∗ k − ) . (A.9)Now, we show in Lemma A.3 that π has the form π ( µ k − , µ ∗ k − ) = k − X ℓ =1 ( µ ∗ ℓ − µ ℓ ) X λ ∈ S k − ℓ d λ Y ℓ ′ ∈ λ µ ℓ ′ . Using the representation (A.8), we can write π as π ( µ k − , µ ∗ k − ) = k − X ℓ =1 (cid:10) x ⊗ ℓ , T ∗ ℓ − T ℓ (cid:11) X λ ∈ S k − ℓ d λ Y ℓ ′ ∈ λ D x ⊗ ℓ ′ , T ℓ ′ E = * x ⊗ k , k − X ℓ =1 ( T ∗ ℓ − T ℓ ) ⊗ X λ ∈ S k − ℓ d λ O ℓ ′ ∈ λ T ℓ ′ + (A.10)The tensor on the right hand side is the desired Q k . We clearly have Q ( T k − = T ∗ k − , T ∗ k − ) = 0. (cid:3) Appendix B. Moment Tensor Computations Let θ, θ ∗ ∈ R d be random vectors distributed according to distributions ρ and ρ ∗ respectively, each of which is a compactly supported probability measure. Recallthe moment tensors T k = T k ( ρ ) = E θ ∼ ρ (cid:2) θ ⊗ k (cid:3) ,T ∗ k = T k ( ρ ∗ ) = E θ ∗ ∼ ρ ∗ (cid:2) θ ⊗ k ∗ (cid:3) . (B.1)For a multi-index ℓ = ( ℓ , . . . , ℓ k ), we write T ℓk to denote the tensor entry T ℓ ,...,ℓ k k .Recall the spaces V k , R k from Definition 3. We define these spaces more formallyhere. Definition 4. Consider finite lists I, J of positive integers, which may includeseveral of the same number, satisfying(B.2) max i ∈ I i ≤ m, max j ∈ J j ≤ n, X i ∈ I i + X j ∈ J i = k. Consider the set of tensors S k,m,n = (cid:26)Q i ∈ I T i ⊗ Q j ∈ J T ∗ j (cid:12)(cid:12)(cid:12)(cid:12) I, J satisfy (B.2) (cid:27) . Wedefine the following spaces V k [ T m , T ∗ n ] = span S k,m,n ,R k [ T m , T ∗ n ] = span n h A, T i | A ∈ R d ⊗ k , T ∈ S k,m,n o (B.3) Example B.1. The set S , , is given by S , , = { T ⊗ T , T ⊗ T ∗ , T ∗ ⊗ T } . Hence, tensors in the space V [ T , T ∗ ] are of the form aT ⊗ T + bT ⊗ T ∗ + cT ∗ ⊗ T ∗ while polynomials in the space R [ T , T ∗ ] are of the form h A, T ⊗ T i + h B, T ⊗ T ∗ i + h C, T ∗ ⊗ T ∗ i . We begin with a result about moment tensors for the orbit recovery model. Proposition B.2. Let G ⊂ O ( d ) ⊂ R d × d be a subgroup of the orthogonal group indimension d , and θ, θ ∗ ∈ R d be deterministic vectors. Let γ be a Haar measure on G , and let T k , T ∗ k be the moment tensors of the distributions g θ, g θ ∗ , g ∼ γ , i.e. (B.4) T k = E g ∼ γ h ( g θ ) ⊗ k i , T ∗ k = E g ∼ γ h ( g θ ∗ ) ⊗ k i . Then for tensors Q ∈ V k [ T m , T ∗ n ] , we have (B.5) h T k , Q i ∈ R k [ T m , T ∗ n ] . The proposition will follow from the following lemma. Lemma B.3. Let I, J satisfy (B.2) . Then for moment tensors (B.4) , we have (B.6) * T k , Y i ∈ I T i ⊗ Y j ∈ J T ∗ j + = Y i ∈ I h T i , T i i Y j ∈ J (cid:10) T j , T ∗ j (cid:11) . Proof of Proposition B.2. To prove the proposition, it suffices to show h T k , Q i ∈ R k [ T m , T ∗ n ] for Q ∈ S k,m,n . Let Q = Q i ∈ I T i ⊗ Q j ∈ J T ∗ j . By the lemma, wehave h T k , Q i = Y i ∈ I h T i , T i i Y j ∈ J (cid:10) T j , T ∗ j (cid:11) . Now, for an order 2 p multi-index ℓ , write ℓ = ( ℓ , ℓ ), where ℓ and ℓ are order p multi-indices. Define the order 2 p tensor M p by M ℓ ,ℓ p = 1 if ℓ = ℓ and M ℓ ,ℓ p = 0 otherwise. We can then write h T i , T i i = h M i , T i ⊗ T i i and h T j , T ∗ j i = h M j , T j ⊗ T ∗ j i . Thus, Y i ∈ I h T i , T i i Y j ∈ J (cid:10) T j , T ∗ j (cid:11) = Y i ∈ I h M i , T i ⊗ T i i Y j ∈ J (cid:10) M j , T j ⊗ T ∗ j (cid:11) = *Y i ∈ I M i ⊗ Y j ∈ J M j , Y i ∈ I T i ⊗ T i ⊗ Y j ∈ J T j ⊗ T ∗ j . + (B.7)The tensor product on the right has total order 2 k and involves only the tensors T m , T ∗ n . It therefore lies in S k,m,n , so its inner product with a constant tensorbelongs to R k [ T m , T ∗ n ]. (cid:3) IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 41 Proof of Lemma B.3. Since the elements g of the group G act on vectors by or-thogonal transformation, we have h gθ, hθ i = (cid:10) θ, g − hθ (cid:11) = (cid:10) g ′ θ, g ′ g − hθ (cid:11) ∀ g, g ′ , h ∈ G. This implies D ( gθ ) ⊗ ℓ , T ℓ E = E h ∼ ρ h gθ, hθ i ℓ = E h ∼ ρ (cid:10) g ′ θ, g ′ g − hθ (cid:11) ℓ = D ( g ′ θ ) ⊗ ℓ , E h ∼ ρ h(cid:0) g ′ g − hθ (cid:1) ⊗ ℓ iE = (cid:10) ( g ′ θ ) ⊗ ℓ , T ℓ (cid:11) ∀ g, g ′ ∈ G, (B.8)since the measure ρ on G is Haar and therefore invariant under multiplication bygroup elements. Similarly, h ( gθ ) ⊗ ℓ , T ∗ ℓ i = h ( g ′ θ ) ⊗ ℓ , T ∗ ℓ i ∀ g, g ′ ∈ G . Averagingover g ′ ∈ G then gives D ( gθ ) ⊗ ℓ , T ℓ E = h T ℓ , T ℓ i , D ( gθ ) ⊗ ℓ , T ∗ ℓ E = h T ℓ , T ∗ ℓ i ∀ g ∈ G. We therefore have * T k , Y i ∈ I T i ⊗ Y j ∈ J T ∗ j + = E g ∼ ρ * ( gθ ) ⊗ k , Y i ∈ I T i ⊗ Y j ∈ J T ∗ j + = E g ∼ ρ Y i ∈ I (cid:10) ( gθ ) ⊗ i , T i (cid:11) Y j ∈ J (cid:10) ( gθ ) ⊗ j , T ∗ j (cid:11) = Y i ∈ I h T i , T i i Y j ∈ J (cid:10) T j , T ∗ j (cid:11) . (B.9) (cid:3) We return now to the general setting in which θ ∼ ρ, θ ∗ ∼ ρ ∗ are random.Recall that the entries of θ are denoted with superscripts, i.e. θ = ( θ , . . . , θ d ) andsimilarly for θ ∗ . Lemma B.4. Let v , . . . , v n be multi-indices, where the indices in each multi-indexare between and d . Let | v j | denote the number of indices in v j . Define L =max {| v | , | v | , . . . , | v n |} and M = | v | + | v | + · · · + | v n | , and let (B.10) a ( θ, θ ∗ ) = E θ ∗ n Y j =1 E θ Y k ∈ v j ( θ k − θ k ∗ ) Then a ( θ, θ ∗ ) ∈ R M [ T L , T ∗ M ] . Proof. For an arbitrary multi-index v , we have E θ "Y k ∈ v ( θ k − θ k ∗ ) = E θ X I ⊂ v Y i ∈ I θ i ( − | v |−| I | Y j ∈ v \ I θ j ∗ = X I ⊂ v E θ "Y i ∈ I θ i ( − | v |−| I | Y j ∈ v \ I θ j ∗ = X I ⊂ v T I | I | ( − | v |−| I | Y j ∈ v \ I θ j ∗ . (B.11) Note that | I | ≤ | v | . Now, applying (B.11) for v = v , . . . , v n and multiplyingthe results together yields entries of θ -moment tensors of order no higher than L = max ≤ i ≤ n | v i | as well as products of at most M = | v | + · · · + | v n | entries of θ ∗ .Upon taking the θ ∗ -expectation, these products will become entries of θ ∗ -momenttensors of order at most M . Noting that the sum of the moment orders in theproduct (B.10) is M , we see that a ( θ, θ ∗ ) ∈ R M [ T L , T ∗ M ], as desired. (cid:3) Recall the definition J p = X λ ∈ S p ℓ max = p c λ O ℓ ∈ λ \ p E θ (cid:2) w ⊗ ℓ (cid:3) , where w = θ − θ ∗ . Lemma B.5. Define r ( θ, θ ∗ ) = E θ ∗ (cid:28) E θ (cid:2) w ⊗ p (cid:3) − T p , J p − c p,p T p (cid:29) . Then r ( θ, θ ∗ ) ∈ R p (cid:2) T p − , T ∗ p (cid:3) . Proof. Define Q p = E θ [ w ⊗ p ] − T p and Q ′ p = J p − c p,p E θ [ w ⊗ p ], so that (cid:28) E θ (cid:2) w ⊗ p (cid:3) − T p , J p − c p,p T p (cid:29) = (cid:10) Q p , Q ′ p + c p,p Q p (cid:11) . Let v be a multi-index with | v | = p . By (B.11) of Lemma B.4,we have(B.12) Q vp = E θ "Y k ∈ v ( θ k − θ k ∗ ) − Y k ∈ v θ k = X I $ v T I | I | ( − | v |−| I | Y i ∈ v \ I θ ∗ i . Thus, we have E θ ∗ h Q p , Q p i = X v X I,J $ v c I,J T I | I | T J | J | E θ ∗ Y i ∈ v \ I θ ∗ i Y j ∈ v \ J θ ∗ j = X v X I,J $ v c I,J T I | I | T J | J | T ∗ v \ I,v \ J p −| I |−| J | = X v X I,J $ v D δ I,J,v , T | I | ⊗ T | J | ⊗ T ∗ p −| I |−| J | E , (B.13)where δ I,J,v ∈ (cid:0) R d (cid:1) ⊗ p is such that δ ℓI,J,v = 1 if ℓ = ( I, J, v \ I, v \ J ) and δ ℓI,J,v = 0otherwise. Thus, E θ ∗ h Q p , Q p i ∈ R p (cid:2) T p − , T ∗ p (cid:3) . Now, Q ′ p is given by(B.14) Q ′ p = J p − c p,p E θ (cid:2) w ⊗ p (cid:3) = X λ ∈ S p λ max = p, max( λ \ p )
1. We can therefore similarly showthat E θ ∗ (cid:10) Q p , Q ′ p (cid:11) ∈ R p (cid:2) T p − , T ∗ p (cid:3) . (cid:3) IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 43 Recall that to finish the proof of Proposition 5.5, we need to compute X λ ∈ S k ℓ max = k c λ E W E θ (cid:2) ( w T W ) k (cid:3) Y ℓ ∈ λ \ k E θ (cid:2) ( w T W ) ℓ (cid:3) , where W = Z + iZ ′ , and Z, Z ′ ∈ R d are i.i.d. standard normal vectors. We computethe W -expectation of an arbitrary summand in the next proposition. Note that if λ ∈ S k and ℓ max = k , then λ \ k ∈ S k . We therefore rename λ \ k as λ ∈ S k . Proposition B.6. Let W = Z + iZ ′ , where Z, Z ′ ∈ R d are i.i.d. standard normalrandom vectors, and let λ ∈ S k . We have (B.15) E W " E θ (cid:2) ( w T W ) k (cid:3) Y ℓ ∈ λ E θ (cid:2) ( w T W ) ℓ (cid:3) = 2 k k ! * E θ (cid:2) w ⊗ k (cid:3) , O ℓ ∈ λ E θ (cid:2) w ⊗ ℓ (cid:3)+ . Proof. First, let w ( ℓ ) = θ ( ℓ ) − θ ∗ , where θ ( ℓ ) ∼ ρ, ℓ ∈ λ ∪ { k } are i.i.d. and θ ∗ isconsidered fixed throughout the proof. Define V ( ℓ ) = w T ( ℓ ) W, ℓ ∈ λ ∪ { k } . Note that( V ( k ) ; ( V ℓ ) ℓ ∈ λ ) is jointly circularly symmetric for any fixed values of w ( k ) , w ( ℓ ) , ℓ ∈ λ and that E W [ V ( ℓ ) V ( ℓ ′ ) ] = 2 w T ( ℓ ) w ( ℓ ′ ) , ℓ, ℓ ′ ∈ λ ∪ { k } . We write E W " E θ (cid:2) ( w T W ) k (cid:3) Y ℓ ∈ λ E θ (cid:2) ( w T W ) ℓ (cid:3) = E W " E θ ( k ) h ( w T ( k ) W ) k i Y ℓ ∈ λ E θ ( ℓ ) h ( w T ( ℓ ) W ) ℓ i = E θ ( k ) ,θ ( ℓ ) ,ℓ ∈ λ E W (cid:20) V ( k ) Y ℓ ∈ λ V ( ℓ ) (cid:21) . (B.16)Using a result in [FPR19], we have that E W (cid:20) V ( k ) Y ℓ ∈ λ V ( ℓ ) (cid:21) = k ! Y ℓ ∈ λ E W (cid:2) V ( k ) V ( ℓ ) (cid:3) ℓ = 2 k k ! Y ℓ ∈ λ ( w T ( k ) w ( ℓ ) ) ℓ = 2 k k ! * w ⊗ k ( k ) , O ℓ ∈ λ w ⊗ ℓ ( ℓ ) + . (B.17)We take the expectation with respect to θ ( k ) , θ ( ℓ ) , ℓ ∈ λ , and bring the expectationsinside the products to conclude. (cid:3) Appendix C. Estimates for EM Let χ be a random variable supported in a set X . We denote samples from χ by χ . Let θ = ( θ χ ) χ ∈ X , where θ χ ∈ R d , χ ∈ X . For the purposes of this section X need not be finite. We define k θ k ∞ = sup χ ∈ X k θ χ k , and T k ( θ ) = E χ ∼ ρ (cid:2) θ ⊗ k χ (cid:3) .Recall from Section 3 that we may write the ground truth GMM as Y = σZ + θ ∗ χ , χ ∼ ρ. We will assume k θ ∗ k ∞ = 1 and that ρ is known. Recall that the standard EMupdate is given by θ ( t +1) χ = G χ ( θ ( t ) ), where G χ ( θ ) = E Y [ w θ ( Y, χ ) Y ] E Y [ w θ ( Y, χ )] , and w θ ( Y, χ ) = g (cid:18) Y − θ χ σ (cid:19) (cid:30) E χ ∼ ρ (cid:20) g (cid:18) Y − θ χ σ (cid:19)(cid:21) Here, g denotes the standard normal density in R d . We will write G to denote( G χ ) χ ∈ X . In the following two lemmas and proposition, we let R > k θ ∗ k ∞ = 1. Lemma C.1. For all θ such that ( k θ k ∞ ∨ /σ ≤ R and for all χ ∈ X , we have E Y [ w θ ( Y, χ )] = 1 + ǫ ( χ, θ, θ ∗ ) , where | ǫ ( χ, θ, θ ∗ ) | ≤ C (cid:18) k θ k ∞ ∨ σ (cid:19) , and C depends on R and d only.Proof. Let χ ′ be an independent copy of χ . Analogously to the proof of the maintheorem, we write Y = σZ + θ ∗ χ ′ . Since θ, θ ∗ , χ are fixed throughout the proof, andsince Y depends on σ , Z , and χ ′ , we rename w θ ( Y, χ ) as f ( t, Z, χ ′ ), where t = 1 /σ .Let δ = sup χ,χ ′ ∈ X k θ χ − θ ∗ χ ′ k , so that δ/σ ≤ R . Also, define v ( χ , χ ′ ) = θ χ − θ ∗ χ ′ .(This corresponds to w = θ − θ ∗ in the proof of the main theorem) and note that k v k ≤ δ . Now, let p ( t, v, Z ) = ( Z T v ) t − k v k t , and note that g ( Z − v/σ ) = exp (cid:18) − k Z k (cid:19) exp p (1 /σ, v, Z ) . We then have f ( t, Z, χ ′ ) = exp p (cid:18) t, v ( χ, χ ′ ) , Z (cid:19)(cid:30) E χ ∼ ρ (cid:20) exp p (cid:18) t, v ( χ , χ ′ ) , Z (cid:19)(cid:21) . Let M ( t, Z, χ ′ ) denote the denominator of f , so that f = e p /M . We expand f around t = 0, and take its expectation with respect to Z, χ ′ . In the following,we let f ′ , f ′′ denote the first and second partial derivative of f with respect to t ,respectively; t -derivatives of p and M are written analogously. We also suppressthe arguments Z, χ ′ on the right hand side. We have(C.1) E Z, χ ′ [ f ( t, Z, χ ′ )] = 1 + t E Z, χ ′ [ f ′ (0)] + 12 t E Z, χ ′ [ f ′′ ( ξ )] , | ξ | ≤ t. Now, f ′ = f ( p ′ − M ′ /M ). Note that p ′ = Z T v − k v k t and M ′ = E χ [ e p p ′ ] . Hence, p ′ (0) = Z T v , M (0) = 1, and M ′ (0) = Z T E χ [ v ] . Combining, we see that f ′ (0) = Z T ( v − E χ [ v ]), which has zero Z -expectation. We now compute the secondderivative of f . We have f ′′ = f (cid:0) p ′′ − M ′′ /M + ( M ′ /M ) (cid:1) + f ( p ′ − ( M ′ /M )) , so that | f ′′ | ≤ | p ′′ | + | M ′′ /M | + 3 | M ′ /M | + 2 | p ′ | since | f | < 1. Recall from (5.20) that (cid:12)(cid:12)(cid:12)(cid:12) ∂ ℓt M ( ξ ) M ( ξ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ δ ℓ E Z ′ ( k W k + δ | t | ) ℓ , (C.2) IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 45 where W = Z + iZ ′ and Z ′ is an independent copy of Z . We therefore have | M ′ /M | ≤ | M ′′ /M | , and using that δ | t | = δ/σ ≤ R , we have | M ′′ /M | ≤ δ E Z ′ (cid:2) ( k W k + 2 R ) (cid:3) . Now, | p ′ ( ξ ) | ≤ δ ( k Z k + δ | t | ) ≤ δ E Z ′ h ( k W k + 2 R ) i and | p ′′ ( ξ ) | ≤ δ . Combin-ing these estimates we obtain(C.3) | f ′′ ( ξ ) | ≤ δ + 6 δ E Z ′ h ( k W k + 2 R ) i ≤ δ ( C + k Z k ) . Taking the expectation of both sides of the inequality gives (cid:12)(cid:12) E Z, χ ′ [ f ′′ ( ξ )] (cid:12)(cid:12) ≤ Cδ . Substituting these calculations into (C.1) and taking t = 1 /σ , we obtain E Y [ w θ ( Y, χ )] = E Z, χ ′ [ f (1 /σ, Z, χ ′ )] = 1 + ǫ ( χ, θ, θ ∗ ) , where | ǫ ( χ, θ, θ ∗ ) | ≤ σ − (cid:12)(cid:12) E Z, χ ′ [ f ′′ ( ξ )] (cid:12)(cid:12) ≤ C (cid:18) δσ (cid:19) ≤ C (cid:18) k θ k ∞ ∨ σ (cid:19) . (cid:3) Lemma C.2. For all θ such that ( k θ k ∞ ∨ /σ ≤ R and for all χ ∈ X , we have E Y [ w θ ( Y, χ ) Y ] = T ( θ ∗ ) + θ χ − T ( θ ) + ǫ ( χ, θ, θ ∗ ) , where k ǫ ( χ, θ, θ ∗ ) k ≤ C ( k θ k ∞ ∨ σ , and C depends on R and d only.Proof. Recall from the previous Lemma that δ = sup χ,χ ′ ∈ X k θ χ − θ ∗ χ ′ k , so that δ/σ ≤ R . Also, we defined t = 1 /σ and expressed w θ ( Y, χ ) for fixed χ and θ as f ( t, Z, χ ′ ). We Taylor expanded f around t = 0 to second order as: f ( t, Z, χ ′ ) = 1 + tf ′ (0) + 12 t f ′′ ( ξ ) = 1 + tZ T ( v − E χ [ v ]) + 12 t f ′′ ( ξ )= 1 + tZ T ( θ χ − T ( θ )) + 12 t f ′′ ( ξ )(C.4)where | f ′′ ( ξ ) | ≤ δ ( C + k Z k ) for an absolute constant C . We now multiply thisTaylor expansion by Y = (1 /t ) Z + θ ∗ χ ′ , and take its Z, χ ′ -expectation: E Z, χ ′ (cid:20) ( 1 t Z + θ ∗ χ ′ ) f ( t, Z, χ ′ ) (cid:21) = E Z, χ ′ (cid:20) t Z + θ ∗ χ ′ + ZZ T ( θ χ − T ( θ ))+ tZ T ( θ χ − T ( θ )) θ ∗ χ ′ + 12 tf ′′ ( ξ ) Z + θ ∗ χ ′ t f ′′ ( ξ ) (cid:21) = T ( θ ∗ ) + θ χ − T ( θ ) + ǫ ( χ, θ, θ ∗ ) , (C.5)where(C.6) ǫ ( χ, θ, θ ∗ ) = 12 t E Z, χ ′ [ f ′′ ( ξ ) Z ] + 12 t E Z, χ ′ [ f ′′ ( ξ ) θ ∗ χ ′ ] . Now, using t = 1 /σ, δ/σ ≤ R, and δ ≤ k θ k ∞ ∨ (cid:13)(cid:13) t E Z, χ ′ [ f ′′ ( ξ ) Z ] (cid:13)(cid:13) ≤ 12 ( δ /σ ) E Z (cid:2) k Z k ( C + k Z k ) (cid:3) ≤ C ( k θ k ∞ ∨ /σ and (cid:13)(cid:13) t E Z, χ ′ [ f ′′ ( ξ ) θ ∗ χ ′ ] (cid:13)(cid:13) ≤ ( δ/σ ) E Z (cid:2) C + k Z k (cid:3) ≤ C ( k θ k ∞ ∨ /σ. Adding the two upper bounds together, we have k ǫ ( χ, θ, θ ∗ ) k ≤ C ( k θ k ∞ ∨ /σ, as desired. (cid:3) Proposition C.3. For all θ such that ( k θ k ∞ ∨ /σ ≤ R , we have k T ( G ( θ )) − T ( θ ∗ ) kk θ k ∞ ∨ ≤ C ( k θ k ∞ ∨ /σ for some constant C that depends on d and R only.Proof. Recall that G χ ( θ ) = E Y [ w θ ( Y, χ ) Y ] (cid:14) E Y [ w θ ( Y, χ )] . From the previous twolemmas, we have E Y [ w θ ( Y, χ )] = 1 + ǫ ( χ ) , and E Y [ w θ ( Y, χ ) Y ] = T ∗ + θ χ − T ( θ ) + ǫ ( χ ). We then have(C.7) G χ ( θ ) = T ∗ + θ χ − T ( θ ) + ǫ ǫ = T ∗ + θ χ − T ( θ ) + ǫ ( χ ) , where ǫ = ǫ + ( T ( θ ∗ ) + θ χ − T ( θ )) ǫ ǫ . Using the bounds on ǫ , ǫ from the lemmas, we have k ǫ k ≤ k ǫ k + | ǫ |k T ( θ ∗ ) + θ χ − T ( θ ) k≤ k ǫ k + C | ǫ | ( k θ k ∞ ∨ ≤ C ( k θ k ∞ ∨ /σ. (C.8)Averaging (C.7) over χ then gives T ( G ( θ )) = E χ ∼ ρ G χ ( θ ) = T ∗ + T − T + E χ ǫ ( χ ) = T ∗ + E χ ǫ ( χ ) . But the bound (C.8) on ǫ ( χ ) is independent of χ , so we have k T ( G ( θ )) − T ∗ k ≤ E χ k ǫ ( χ ) k ≤ C ( k θ k ∞ ∨ /σ, as desired. (cid:3) Appendix D. Miscellany Lemma D.1. Let f : R K → R and M ⊂ R K be a smooth function and manifold,respectively, where M is defined as the intersection of the level surfaces of functions g , . . . , g n . Then x is a critical point of f | M iff there exist λ , . . . , λ n ∈ R such that (D.1) ∇ f ( x ) = n X j =1 λ j ∇ g j ( x ) . Moreover, a critical point x is a saddle, local minimum, or local maximum of f | M iff the quadratic form (D.2) ∇ f ( x ) − n X j =1 λ j ∇ g j ( x ) is indeterminate, positive definite, or negative definite, respectively, on the tangentplane to M at x . IKELIHOOD MAXIMIZATION & MOMENT MATCHING IN LOW SNR GMMS 47 Proof. The first condition is standard. To show the second condition, note thata critical point x is a local minimum (maximum) of f | M if and only if for everycurve u ( t ) ⊂ M going through x , the function t f ( u ( t )) has a local minimum(maximum) at t = t x , the point for which u ( t ) = x .Let u be such a curve. Now, ( f ◦ u ) ′ ( t x ) = h∇ f ( x ) , u ′ ( t x ) i = 0, since ∇ f ( x ) liesin the span of ∇ g j ( x ) , j = 1 , . . . , n while u ′ ( t x ) lies in the tangent plane to M at x . Hence t f ( u ( t )) has a critical point at t x . Note that d dt f ( u ( t )) (cid:12)(cid:12) t = t x = h∇ f ( x ) , u ′′ ( t x ) i + u ′ ( t x ) T ∇ f ( x ) u ′ ( t x )= n X j =1 λ j h∇ g j ( x ) , u ′′ ( t x ) i + u ′ ( t x ) T ∇ f ( x ) u ′ ( t x )= u ′ ( t x ) T ∇ f ( x ) − n X j =1 λ j ∇ g j ( x ) u ′ ( t x ) . (D.3)The last line follows from the fact that0 = d dθ g j ( u ( t x )) = h∇ g j ( x ) , u ′′ ( t x ) i + u ′ ( t x ) T ∇ g j ( x ) u ′ ( t x ) , j = 1 , . . . , n. Now, u ′ ( t x ) is any vector in the tangent plane of M at x , i.e. any vector perpen-dicular to ∇ g j ( x ) , j = 1 , . . . , n . Thus, for d dt f ( u ( t x )) to have the same sign for allcurves u , the quadratic form ∇ f ( x ) − P nj =1 λ j ∇ g j ( x ) must be determinate onthe tangent plane at x ..