[PDF] Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences

Abstract

We provide two fundamental results on the population (infinite-sample) likelihood function of Gaussian mixture models with M≥3 components. Our first main result shows that the population likelihood function has bad local maxima even in the special case of equally-weighted mixtures of well-separated and spherical Gaussians. We prove that the log-likelihood value of these bad local maxima can be arbitrarily worse than that of any global optimum, thereby resolving an open question of Srebro (2007). Our second main result shows that the EM algorithm (or a first-order variant of it) with random initialization will converge to bad critical points with probability at least 1− e −Ω(M) . We further establish that a first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first-order method can be attributed to the existence of bad local maxima rather than bad saddle points. Overall, our results highlight the necessity of careful initialization when using the EM algorithm in practice, even when applied in highly favorable settings.

Full PDF

aa r X i v : . [ s t a t . M L ] S e p Local Maxima in the Likelihood of Gaussian Mixture Models:Structural Results and Algorithmic Consequences

Chi Jin † Yuchen Zhang † Sivaraman Balakrishnan ∗ Martin J. Wainwright † , ‡ Michael Jordan † , ‡ Department of Electrical Engineering and Computer Sciences † Department of Statistics ‡ University of California, BerkeleyDepartment of Statistics ∗ Carnegie Mellon University [email protected], [email protected], [email protected]@berkeley.edu, [email protected]

September 6, 2016

Abstract

We provide two fundamental results on the population (inﬁnite-sample) likelihoodfunction of Gaussian mixture models with M ≥ components. Our ﬁrst main resultshows that the population likelihood function has bad local maxima even in the specialcase of equally-weighted mixtures of well-separated and spherical Gaussians. We provethat the log-likelihood value of these bad local maxima can be arbitrarily worse than thatof any global optimum, thereby resolving an open question of Srebro [21]. Our secondmain result shows that the EM algorithm (or a ﬁrst-order variant of it) with randominitialization will converge to bad critical points with probability at least − e − Ω( M ) .We further establish that a ﬁrst-order variant of EM will not converge to strict saddlepoints almost surely, indicating that the poor performance of the ﬁrst-order method canbe attributed to the existence of bad local maxima rather than bad saddle points. Overall,our results highlight the necessity of careful initialization when using the EM algorithmin practice, even when applied in highly favorable settings. Finite mixture models are widely used in variety of statistical settings, as models for het-erogeneous populations, as ﬂexible models for multivariate density estimation and as modelsfor clustering. Their ability to model data as arising from underlying subpopulations pro-vides essential ﬂexibility in a wide range of applications Titterington [23]. This combinatorialstructure also creates challenges for statistical and computational theory, and there are manyproblems associated with estimation of ﬁnite mixtures that are still open. These problems areoften studied in the setting of Gaussian mixture models (GMMs), reﬂecting the wide use ofGMMs in applications, particular in the multivariate setting, and this setting will also be ourfocus in the current paper.Early work [22] studied the identiﬁability of ﬁnite mixture models, and this problem hascontinued to attract signiﬁcant interest (see the recent paper of Allman et al. [1] for a recent1verview). More recent theoretical work has focused on issues related to the use of GMMsfor the density estimation problem [12, 13]. Focusing on rates of convergence for parameterestimation in GMMs, Chen [7] established the surprising result that when the number ofmixture components is unknown, then the standard √ n -rate for regular parametric models isnot achievable. Recent investigations [14] into exact-ﬁtted, under-ﬁtted and over-ﬁtted GMMshave characterized the achievable rates of convergence in these settings.From an algorithmic perspective, the dominant practical method for estimating GMMs isthe Expectation-Maximization (EM) algorithm [9]. The EM algorithm is an ascent methodfor maximizing the likelihood, but is only guaranteed to converge to a stationary point of thelikelihood function. As such, there are no general guarantees for the quality of the estimateproduced via the EM algorithm for Gaussian mixture models. This has led researchersto explore various alternative algorithms which are computationally eﬃcient, and for whichrigorous statistical guarantees can be given. Broadly, these algorithms are based either onclustering [3, 6, 8, 24] or on the method of moments [5, 15, 18].Although general guarantees have not yet emerged, there has nonetheless been substantialprogress on the theoretical analysis of EM and its variations. Dasgupta and Schulman [8]analyzed a two-round variant of EM, which involved over-ﬁtting the mixture and then pruningextra centers. They showed that this algorithm can be used to estimate Gaussian mixturecomponents whose means are separated by at least Ω( d / ) . Balakrishnan et al. [4] studied thelocal convergence of the EM algorithm for a mixture of two Gaussians with Ω(1) -separation.Their results show that global optima have relatively large regions of attraction, but stillrequire that the EM algorithm be provided with a reasonable initialization in order to ensureconvergence to a near globally optimal solution.To date, computationally eﬃcient algorithms for estimating a GMM provide guaranteesunder the strong assumption that the samples come from a mixture of Gaussians—i.e., thatthe model is well-speciﬁed. In practice however, we never expect the data to exactly follow thegenerative model, and it is important to understand the robustness of our algorithms to thisassumption. In fact, maximum likelihood has favorable properties in this regard—maximum-likelihood estimates are well known to be robust to perturbations in the Kullback-Leiblermetric of the generative model [10]. This mathematical result motivates further study ofEM and other likelihood-based methods from the computational point of view. It would beuseful to characterize when eﬃcient algorithms can be used to compute a maximum likelihoodestimate, or a solution that is nearly as accurate, and which retains the robustness propertiesof the maximum likelihood estimate.In this paper, we focus our attention on uniformly weighted mixtures of M isotropic Gaus-sians. For this favorable setting, Srebro [21] conjectured that any local maximum of thelikelihood function is a global maximum in the limit of inﬁnite samples—in other words, thatthere are no bad local maxima for the population GMM likelihood function. This conjec-ture, if true, would provide strong theoretical justiﬁcation for EM, at least for large samplesizes. For suitably small sample sizes, it is known [2] that conﬁgurations of the samples canbe constructed which lead to the likelihood function having an unbounded number of localmaxima. The conjecture of Srebro [21] avoids this by requiring that the samples come fromthe speciﬁed GMM, as well as by considering the (inﬁnite-sample-size) population setting. Inthe context of high-dimensional regression, it has been observed that in some cases despite In addition to issues of convergence to non-maximal stationary points, solutions of inﬁnite likelihood existfor GMMs where both the location and scale parameters are estimated. In practice, several methods exist toavoid such solutions. In this paper, we avoid this issue by focusing on GMMs in which the scale parametersare ﬁxed.

A mixture of two spherical Gaussians:

A Gaussian mixture model with a singlecomponent is simply a Gaussian, so the conjecture of Srebro [21] holds trivially in this case.The ﬁrst interesting case is a Gaussian mixture with two components, for which empiricalevidence supports the conjecture that there are no bad local optima. It is possible to visualizethe setting when there are only two components and to develop a more detailed understandingof the population likelihood surface.Consider for instance a one-dimensional equally weighted unit variance GMM with truecenters µ ∗ = − and µ ∗ = 4 , and consider the log-likelihood as a function of the vector µ : = ( µ , µ ) . Figure 1 shows both the population log-likelihood, µ

7→ L ( µ ) , and the negative2-norm of its gradient, µ

7→ −k∇L ( µ ) k . Observe that the only local maxima are the vectors ( − , and (4 , − , which are both also global maxima. The only remaining critical point is (0 , , which is a saddle point. Although points of the form (0 , R ) , ( R, have small gradientwhen | R | is large, the gradient is not exactly zero for any ﬁnite R . Rigorously resolving thequestion of existence or non-existence of local maxima for the setting when M = 2 remainsan open problem.In the remainder of our paper, we focus our attention on the setting where there are morethan two mixture components and attempt to develop a broader understanding of likelihoodsurfaces for these models, as well as the consequences for algorithms. -25020-200-150 10-100 0-50 200 10-10 0-10-20 -20 -2020-15 10-10-5 0 200 10-10 0-10-20 -20 (a) (b) Figure 1.

Illustration of the likelihood and gradient maps for a two-component Gaussianmixture. (a) Plot of population log-likehood map µ

7→ L ( µ ) . (b) Plot of the negative Euclideannorm of the gradient map µ

7→ −k∇L ( µ ) k . Our ﬁrst contribution is a negative answer to the open question of Srebro [21]. We constructa GMM which is a uniform mixture of three spherical unit-variance, well-separated, Gaussianswhose population log-likelihood function contains local maxima. We further show that the log-likelihood of these local maxima can be arbitrarily worse than that of the global maxima. Thisresult immediately implies that any local search algorithm cannot exhibit global convergence(meaning convergence to a global optimum from all possible starting points), even on well-separated mixtures of Gaussians. 3he mere existence of bad local maxima is not a practical concern unless it turns outthat natural algorithms are frequently trapped in these bad local maxima. Our second mainresult shows that the EM algorithm, as well as a variant thereof known as the ﬁrst-order EMalgorithm, with random initialization, converges to a bad critical point with an exponentiallyhigh probability. In more detail, we consider the following practical scheme for parameterestimation in an M -component Gaussian mixture:(a) Draw M i.i.d. points µ , . . . , µ M uniformly at random from the sample set.(b) Run the EM or ﬁrst-order EM algorithm to estimate the model parameters, using µ , . . . , µ M as the initial centers.We note that in the limit of inﬁnite samples, the initialization scheme we consider is equivalentto selecting M initial centers i.i.d from the underlying mixture distribution. We show thatfor a universal constant c > , with probability at least − e − cM , the EM and ﬁrst-order EMalgorithms converge to a suboptimal critical point, whose log-likelihood could be arbitrarilyworse than that of the global maximum. Conversely, in order to ﬁnd a solution with satisfactorylog-likelihood via this initialization scheme, one needs repeat the above scheme exponentiallymany (in M ) times, and then select the solution with highest log-likelihood. This resultstrongly indicates that repeated random initialization followed by local search (via either EMor its ﬁrst order variant) can fail to produce useful estimates under reasonable constraints oncomputational complexity.We further prove that under the same random initialization scheme, the ﬁrst-order EMalgorithm with a suitable stepsize does not converge to a strict saddle point with probabilityone. This fact strongly suggests that the failure of local search methods for the GMM modelis due mainly to the existence of bad local optima, and not due to the presence of (strict)saddle points.Our proofs introduce new techniques to reason about the structure of the population log-likelihood, and in particular to show the existence of bad local optima. We expect that thesegeneral ideas will aid in developing a better understanding of the behavior of algorithms fornon-convex optimization. From a practical standpoint, our results strongly suggest that carefulinitialization is required for local search methods, even in large-sample settings, and even forextremely well-behaved mixture models.The remainder of this paper is organized as follows. In Section 2, we introduce GMMs,the EM algorithm, its ﬁrst-order variant and we formally set up the problem we consider.In Section 3, we state our main theoretical results and develop some of their implications.Section 4 is devoted to the proofs of our results, with some of the more technical aspectsdeferred to the appendices. In this section, we formally deﬁne the Gaussian mixture model that we study in the paper.We then describe the EM algorithm, the ﬁrst-order EM algorithm, as well as the form ofrandom initialization that we analyze. Throughout the paper, we use [ M ] to denote theset { , , · · · , M } , and N ( µ, Σ) to denote the d -dimensional Gaussian distribution with meanvector µ and covariance matrix Σ . We use φ ( · | µ, Σ) to denote the probability density functionof the Gaussian distribution with mean vector µ and covariance matrix Σ : φ ( x | µ, Σ) := 1 p (2 π ) d det (Σ) e − ( x − µ ) ⊤ Σ − ( x − µ ) . (1)4 .1 Gaussian Mixture Models A d -dimensional Gaussian mixture model (GMM) with M components can be speciﬁed bya collection µ ∗ = { µ ∗ i , . . . , µ ∗ M } of d -dimensional mean vectors, a vector λ ∗ = ( λ ∗ , . . . , λ ∗ M ) of non-negative mixture weights that sum to one, and a collection Σ ∗ = { Σ ∗ , . . . , Σ ∗ M } ofcovariance matrices. Given these parameters, the density function of a Gaussian mixturemodel takes the form p ( x | λ ∗ , µ ∗ , Σ ∗ ) = M X i =1 λ ∗ i φ ( x | µ ∗ i , Σ ∗ i ) , where the Gaussian density function φ was previously deﬁned in equation (1). In this paper,we focus on the idealized situation in which every mixture component is equally weighted, andthe covariance of each mixture component is the identity. This leads to a mixture model ofthe form p ( x | µ ∗ ) := 1 M M X i =1 φ ( x | µ ∗ i , I) , (2)which we denote by GMM ( µ ∗ ) . In this case, the only parameters to be estimated are themean vectors µ ∗ = { µ ∗ i } Mi =1 of the M components.The diﬃculty of estimating a Gaussian mixture distribution depends on the amount ofseparation between the mean vectors. More precisely, for a given parameter ξ > , we saythat the GMM ( µ ∗ ) -model is ξ -separated if k µ ∗ i − µ ∗ j k ≥ ξ, for all distinct pairs i , j ∈ [ M ] . (3)We say that the mixture is well-separated if condition (3) holds for some ξ = Ω( √ d ) .Suppose that we observe an i.i.d. sequence { x ℓ } nℓ =1 drawn according to the distributionGMM ( µ ∗ ) , and our goal is to estimate the unknown collection of mean vectors µ ∗ . Thesample-based log-likelihood function L n is given by L n ( µ ) := 1 n n X ℓ =1 log (cid:16) M M X i =1 φ ( x ℓ | µ i , I) (cid:17) . (4a)As the sample size n tends to inﬁnity, this sample likelihood converges to the populationlog-likelihood function L given by L ( µ ) = E µ ∗ log M M X i =1 φ ( X | µ i , I) ! . (4b)Here E µ ∗ denotes expectation taken over the random vector X drawn according to the modelGMM ( µ ∗ ) .A straightforward implication of the positivity of the KL divergence is that the populationlikelihood function is in fact maximized at µ ∗ (along with permutations thereof, dependingon how we index the mixture components). On the basis of empirical evidence, Srebro [21]conjectured that this population log-likelihood is in fact well-behaved, in the sense of havingno spurious local optima. In Theorem 1, we show that this intuition is false, and provide asimple example of a mixture of M = 3 well-separated Gaussians in dimension d = 1 , whosepopulation log-likelihood function has arbitrarily bad local optima.5 .2 Expectation-Maximization Algorithm A natural way to estimate the mean vectors µ ∗ is by attempting to maximize the sample log-likelihood deﬁned by the samples { x ℓ } nℓ =1 . For a non-degenerate Gaussian mixture model, thelog-likelihood is non-concave. Rather than attempting to maximize the log-likelihood directly,the EM algorithm proceeds by iteratively maximizing a lower bound on the log-likelihood. Itdoes so by alternating between two steps:1. E-step: For each i ∈ [ M ] and ℓ ∈ [ n ] , compute the membership weight w i ( x ℓ ) = φ ( x ℓ | µ i , I) P Mj =1 φ ( x ℓ | µ j , I) .

2. M-step: For each i ∈ [ M ] , update the mean µ i vector via µ new i = P ni =1 w i ( x ℓ ) x ℓ P nℓ =1 w i ( x ℓ ) . In the population setting, the M-step becomes: µ new i = E µ ∗ [ w i ( X ) X ] E µ ∗ [ w i ( X )] . (5)Intuitively, the M-step updates the mean vector of each Gaussian component to be a weightedcentroid of the samples for appropriately chosen weights. First-order EM updates:

For a general latent variable model with observed variables X = x , latent variables Z and model parameters θ , by Jensen’s inequality, the log-likelihoodfunction can be lower bounded as log P ( x | θ ′ ) ≥ E Z ∼ P ( ·| x ; θ ) log P ( x, Z | θ ′ ) | {z } := Q ( θ ′ | θ ) − E Z ∼ P ( ·| x ; θ ) log P ( Z | x ; θ ′ ) . Each step of the EM algorithm can also be viewed as optimizing over this lower bound, whichgives: θ new := arg max θ ′ Q ( θ ′ | θ ) There are many variants of the EM algorithm which rely instead on partial updates at eachiteration instead of ﬁnding the exact optimum of Q ( θ ′ | θ ) . One important example, analyzedin the work of Balakrishnan et al. [4], is the ﬁrst-order EM algorithm. The ﬁrst-order EMalgorithm takes a step along the gradient of the function Q ( θ ′ | θ ) (with respect to its ﬁrstargument) in each iteration. Concretely, given a step size s > , the ﬁrst-order EM updatescan be written as: θ new = θ + s ∇ θ ′ Q ( θ ′ | θ ) | θ ′ = θ . In the case of the model GMM ( µ ∗ ) , the gradient EM updates on the population objective takethe form µ new i = µ i + s E µ ∗ (cid:2) w i ( X )( X − µ i ) (cid:3) . (6)This update turns out to be equivalent to gradient ascent on the population likelihood L withstep size s > (see the paper [4] for details). 6 .3 Random Initialization Since the log-likelihood function is non-concave, the point to which the EM algorithm convergesdepends on the initial value of µ . In practice, it is standard to choose these values by someform of random initialization. For instance, one method is to to initialize the mean vectors bysampling uniformly at random from the data set { x ℓ } nℓ =1 . This scheme is intuitively reasonable,because it automatically adapts to the locations of the true centers. If the true centers havelarge mutual distances, then the initialized centers will also be scattered. Conversely, if thetrue centers concentrate in a small region of the space, then the initialized centers will alsobe close to each other. In practice, initializing µ by uniformly drawing from the data is oftenmore reasonable than drawing µ from a ﬁxed distribution.In this paper, we analyze the EM algorithm and its variants at the population level. Wefocus on the above practical initialization scheme of selecting µ uniformly at random from thesample set. In the idealized population setting, this is equivalent to sampling the initial valuesof µ i.i.d from the distribution GMM ( µ ∗ ) . Throughout this paper, we refer to this particularinitialization strategy as random initialization. We now turn to the statements of our main results, along with a discussion of some of theirconsequences.

In our ﬁrst main result (Theorem 1), for any M ≥ , we exhibit an M -component mixture ofGaussians in dimension d = 1 for which the population log-likelihood has a bad local maximumwhose log-likelihood is arbitrarily worse than that attained by the true parameters µ ∗ . Thisresult provides a negative answer to the conjecture of Srebro [21]. Theorem 1.

For any M ≥ and any constant C gap > , there is a well-separated uniformmixture of M unit-variance spherical Gaussians GMM ( µ ∗ ) and a local maximum µ ′ such that L ( µ ′ ) ≤ L ( µ ∗ ) − C gap . In order to illustrate the intuition underlying Theorem 1, we give a geometrical descriptionof our construction for M = 3 . Suppose that the true centers µ ∗ , µ ∗ and µ ∗ , are such thatthe distance between µ ∗ and µ ∗ is much smaller than the respective distances from µ ∗ to µ ∗ ,and from µ ∗ to µ ∗ . Now, consider the point µ := ( µ , µ , µ ) where µ = ( µ ∗ + µ ∗ ) / ; thepoints µ and µ are both placed at the true center µ ∗ . This assignment does not maximize thepopulation log-likelihood, because only one center is assigned to the two Gaussian componentscentered at µ ∗ and µ ∗ , while two centers are assigned to the Gaussian component centeredat µ ∗ . However, when the components are well-separated we are able to show that there is alocal maximum in the neighborhood of this conﬁguration. In order to establish the existenceof a local maximum, we ﬁrst deﬁne a neighborhood of this conﬁguration ensuring that it doesnot contain any global maximum, and then prove that the log-likelihood on the boundary ofthis neighborhood is strictly smaller than that of the sub-optimal conﬁguration µ . Since thelog-likelihood is bounded from above, this neighborhood must contain at least one maximumof the log-likelihood. Since the global maxima are not in this neighborhood by construction,any maximum in this neighborhood must be a local maximum. See Section 4 for a detailedproof. 7 .2 Algorithmic consequences An important implication of Theorem 1 is that any iterative algorithm, such as EM or gradientascent, that attempts to maximize the likelihood based on local updates cannot be globallyconvergent—that is, cannot converge to (near) globally optimal solutions from an arbitraryinitialization. Indeed, if any such algorithm is initialized at the local maximum, then theywill remain trapped. However, one might argue that this conclusion is overly pessimistic, inthat we have only shown that these algorithms fail when initialized at a certain (adversariallychosen) point. Indeed, the mere existence of bad local minima need not be a practical concernunless it can be shown that a typical optimization algorithm will frequently converge to oneof them. The following result shows that the EM algorithm, when applied to the populationlikelihood and initialized according to the random scheme described in Section 2.2, convergesto a bad critical point with high probability.

Theorem 2.

Let µ t be the t th iterate of the EM algorithm initialized by the random initial-ization scheme described previously. There exists a universal constant c , for any M ≥ andany constant C gap > , such that there is a well-separated uniform mixture of M unit-variancespherical Gaussians GMM ( µ ∗ ) with P (cid:2) ∀ t ≥ , L ( µ t ) ≤ L ( µ ∗ ) − C gap (cid:3) ≥ − e − cM . Theorem 2 shows that, for the speciﬁed conﬁguration µ ∗ , the probability of success forthe EM algorithm is exponentially small as a function of M . As a consequence, in order toguarantee recovering a global maximum with at least constant probability, the EM algorithmwith random initialization must be executed at least e Ω( M ) times. This result strongly suggeststhat that eﬀective initialization schemes, such as those based on pilot estimators utilizing themethod of moments [15, 18], are critical to ﬁnding good maxima in general GMMs.The key idea in the proof of Theorem 2 is the following: suppose that all the true centers aregrouped into two clusters that are extremely far apart, and suppose further that we initializeall the centers in the neighborhood of these two clusters, while ensuring that at least onecenter lies within each cluster. In this situation, all centers will remain trapped within thecluster in which they were ﬁrst initialized, irrespective of how many steps we take in the EMalgorithm. Intuitively, this suggests that the only favorable initialization schemes (from whichconvergence to a global maximum is possible) are those in which (1) all initialized centersfall in the neighborhood of exactly one cluster of true centers, (2) the number of centersinitialized within each cluster of true centers exactly matches the number of true centersin that cluster. However, this observation alone only suﬃces to guarantee that the successprobability is polynomially small in M .In order to demonstrate that the success probability is exponentially small in M , we need tofurther reﬁne this construction. In more detail, we construct a Gaussian mixture distributionwith a recursive structure: on top level, its true centers can be grouped into two clusters farapart, and then inside each cluster, the true centers can be further grouped into two mini-clusters which are well-separated, and so on. We can repeat this structure for Ω(log M ) levels.For this GMM instance, even in the case where the number of true centers exactly matchesthe number of initialized centers in each cluster at the top level, we still need to considerthe conﬁguration of the initial centers within the mini-clusters, which further reduces theprobability of success for a random initialization. A straightforward calculation then showsthat the probability of a favorable random initialization is on the order of e − Ω( M ) . The fullproof is given in Section 4.2. 8e devote the remainder of this section to a treatment of the ﬁrst-order EM algorithm.Our ﬁrst result in this direction shows that the problem of convergence to sub-optimal ﬁxedpoints remains a problem for the ﬁrst-order EM algorithm, provided the step-size is not chosentoo aggressively. Theorem 3.

Let µ t be the t th iterate of the ﬁrst-order EM algorithm with stepsize s ∈ (0 , ,initialized by the random initialization scheme described previously. There exists a universalconstant c , for any M ≥ and any constant C gap > , such that there is a well-separateduniform mixture of M unit-variance spherical Gaussians GMM ( µ ∗ ) with P (cid:0) ∀ t ≥ , L ( µ t ) ≤ L ( µ ∗ ) − C gap (cid:1) ≥ − e − cM . (7)We note that the restriction on the step-size is weak, and is satisﬁed by the theoreti-cally optimal choice for a mixture of two Gaussians in the setting studied by Balakrishnanet al. [4]. Recall that the ﬁrst-order EM updates are identical to gradient ascent updates onthe log-likelihood function. As a consequence, we can conclude that the most natural localsearch heuristics for maximizing the log-likelihood (EM and gradient ascent), fail to providestatistically meaningful estimates when initialized randomly, unless we repeat this procedureexponentially many (in M ) times.Our ﬁnal result concerns the type of ﬁxed points reached by the ﬁrst-order EM algorithmin our setting. Pascanu et al. [20] argue that for high-dimensional optimization problems, theprincipal diﬃculty is the proliferation of saddle points, not the existence of poor local maxima.In our setting, however, we can leverage recent results on gradient methods [16, 19] to showthat the ﬁrst-order EM algorithm cannot converge to strict saddle points. More precisely: Deﬁnition 1 (Strict saddle point [11]) . For a maximization problem, we say that a criticalpoint x ss of function f is a strict saddle point if the Hessian ∇ f ( x ss ) has at least one strictlypositive eigenvalue. With this deﬁnition, we have the following:

Theorem 4.

Let µ t be the t th iterate of the ﬁrst-order EM algorithm with constant stepsize s ∈ (0 , , and initialized by the random initialization scheme described previously. Then forany M -component mixture of spherical Gaussians:(a) The iterates µ t converge to a critical point of the log-likelihood.(b) For any strict saddle point µ ss , we have P (cid:0) lim t →∞ µ t = µ ss (cid:1) = 0 . Theorems 3 and 4 provide strong support for the claim that the sub-optimal points to whichthe ﬁrst-order EM algorithm frequently converges are bad local maxima. The algorithmicfailure of the ﬁrst-order EM algorithm is most likely due to the presence of bad local maxima,as opposed to (strict) saddle-points.The proof of Theorem 4 is based on recent work [16, 19] on the asymptotic performance ofgradient methods. That work reposes on the stable manifold theorem from dynamical systemstheory, and, applied directly to our setting, would require establishing that the populationlikelihood L is smooth. Our proof technique avoids such a smoothness argument; see Sec-tion 4.4 for the details. The proof technique makes use of speciﬁc properties of the ﬁrst-orderEM algorithm that do not hold for the EM algorithm. We conjecture that a similar resultis true for the EM algorithm; however, we suspect that a generalized version of the stablemanifold theorem will be needed to establish such a result.9 Proofs

This section is devoted to the proofs of Theorems 1 through 4. Certain technical aspects ofthe proofs are deferred to the appendix.

In this section, we prove Theorem 1. The proof consists of three parts: starting with the case M = 3 , the ﬁrst part shows the existence of a local maximum for certain GMMs, whereas thesecond part shows that this local maximum has a log-likelihood that is much worse than thatof the global maximum. The third part provides the extension to the general case of M > mixture components. In this section, we prove the existence of a local maximum by ﬁrst constructing a familyof GMMs parametrized by a scalar γ , and then proving the existence of local maxima inthe limiting case when γ → + ∞ . By continuity of the log-likelihood function, we can thenconclude that there exists some ﬁnite γ whose corresponding log-likelihood has local maxima.We begin by considering the special case of M = 3 components in dimension d = 1 . Forparameters R > and γ ≫ , suppose that the true centers µ ∗ are given by µ ∗ = − R, µ ∗ = R, µ ∗ = γR. By construction, the two centers µ ∗ and µ ∗ are relatively close together near the origin, whilethe third center µ ∗ is located far away from both of the ﬁrst two centers.We ﬁrst claim that when γ is suﬃciently large, there is a local maximum in the closed set: D = (cid:26) ( µ , µ , µ ) ∈ R | µ ≤ γR , µ ≥ γR and µ ≥ γR (cid:27) . To establish this claim, we consider the value of population log-likelihood function L ( ˜ µ ) at aninterior point ˜ µ = (0 , γR, γR ) of D , and compare it to the log-likelihood on the boundary ofthe set D . We show that for a suﬃciently large γ , the log-likelihood at the interior point isstrictly larger than the log-likelihood on the boundary, and use this to argue that there mustbe a local maxima in the set D . Concretely, deﬁne v : = L ( ˜ µ ) , and the maximum value of L ( µ ) on the three two-dimensional faces of D , i.e., v : = sup µ = γR/ µ ≥ γR/ µ ≥ γR/ L ( µ ) , v : = sup µ ≤ γR/ µ =2 γR/ µ ≥ γR/ L ( µ ) , and v : = sup µ ≤ γR/ µ ≥ γR/ µ =2 γR/ L ( µ ) . The population log-likelihood function is given by the expression L ( µ ) = E µ ∗ log X i =1 e − ( X − µ i ) ! − log(3 √ π ) . As γ → ∞ , it is easy to verify that v = L ( ˜ µ ) → − R + 3 − − log(3 √ π ) . v , v and v as γ → ∞ ; i.e., a straightforward calcu-lation shows that lim γ → + ∞ v = −∞ , lim γ → + ∞ v = − R + 36 − log(3 √ π ) (the maximum is attained at µ → and µ → γR ) , lim γ → + ∞ v = − R + 36 − log(3 √ π ) (the maximum is attained at µ → and µ → γR ) . This gives the relation v > max { v , v , v } when γ → ∞ . Since L is a continuous function of γ , we know that v , v , v , v are also continuous functions of γ . Therefore, there exists a ﬁnite A such that, as long as γ > A , we will still have v > max { v , v , v } . This in turn impliesthat the function value at an interior point is strictly greater than the function value on theboundary of D , which implies the existence of at least one local maximum inside D .On the other hand, the global maxima of the population likelihood function are ( − R, R, γR ) and its permutations, which are not in D . This shows the existence of at least one local max-imum which is not a global maximum. In order to prove that the log-likelihood of a local maximum can be arbitrarily worse than thelog-likelihood of the global maximum, we consider the limit when R → ∞ . In this case, thelimiting value of the global maximum will be lim R →∞ L ( µ ∗ ) = − − log(3 √ π ) . Let µ ′ = ( µ ′ , µ ′ , µ ′ ) be one of the local maxima in the closed set D . We have previouslyestablished the existence of such a local maximum.Since µ ∗ − µ ∗ = 2 R , we know that either | µ ∗ − µ ′ | > R or | µ ∗ − µ ′ | > R has to be true.Without loss of generality, we may assume that | µ ∗ − µ ′ | > R . From the deﬁnition of the set D , we can also see that | µ ∗ − µ ′ | > R and | µ ∗ − µ ′ | > R . Putting together the pieces yields lim R → + ∞ L ( µ ′ ) ≤ lim R → + ∞ E X ∼N ( µ ∗ , log X i =1 e − ( X − µ ′ i ) ! −

13 log(3 √ π ) = −∞ . Again, by the continuity of the function L with respect to R , we know for any C gap > , therealways exists a large constant A ′ , so that if R > A ′ , we will have L ( µ ∗ ) − L ( µ ′ ) > C gap . Thiscompletes the proof for case M = 3 . M > We now provide an outline of how this argument can be extended to the general setting of

M > . Consider a GMM with true centers µ ∗ i = (2 i − k ) Rk − , for i = 1 , · · · , M − and µ ∗ M = γR, for some parameter γ > to be chosen. We claim that when γ is suﬃciently large, there is atleast one local maximum in the closed set D M = (cid:26) ( µ , · · · , µ M ) | µ ≤ γR , µ ≥ γR , · · · , µ M ≥ γR (cid:27) . The proof follows from an identical argument as in the M = 3 case.11 .2 Proof of Theorem 2 In this section, we prove Theorem 2. We ﬁrst present an important technical lemma thataddresses the behavior of the EM algorithm for a particular conﬁguration of true and initialcenters. We then prove the theorem by constructing a bad example and recursively applyingthis lemma. The proof of this lemma is given in Appendix A.We focus on the one-dimensional setting throughout this proof. We use B x ( δ ) to denotean interval centered at x with radius δ , that is, B x ( δ ) = [ x − δ, x + δ ] . We also use B x ( δ ) torepresent the complement of the interval B x ( δ ) , i.e. B x ( δ ) = ( −∞ , x − δ ) ∪ ( x + δ, ∞ ) . As a preliminary, let us deﬁne a class of GMMs, which we refer to as diﬀuse GMMs . Wesay that a mixture model GMM ( µ ∗ ) consisting of f M components is ( c, δ ) - diﬀuse if:(a) For some M ≤ f M , there are M centers contained in B cδ ( δ ) ∪ B − cδ ( δ ) ;(b) Each of the sets B cδ ( δ ) and B − cδ ( δ ) contain at least one center;(c) The remaining f M − M centers are all in B (20 cδ ) .Consider the EM algorithm, and denote by M ( t )1 , M ( t )2 and M ( t )3 the number of centers theEM algorithm has in the t th iteration in the sets B − cδ (2 δ ) , B cδ (2 δ ) and B (20 cδ ) respectively,where c and δ are those speciﬁed in the deﬁnition of the diﬀuse GMM. To be clear, M (0)1 , M (0)2 and M (0)3 denote the number of centers in these sets in the initial conﬁguration speciﬁed tothe EM algorithm. With these deﬁnitions in place, we can state our lemma. Lemma 1.

Suppose that the true underlying distribution is a ( c, δ ) -diﬀuse GMM with c > and δ > log M + 3 , and that the EM algorithm is initialized so that M (0)1 , M (0)2 ≥ .(a) If M = f M , then M ( t )1 = M (0)1 and M ( t )2 = M (0)2 for every t ≥ . (8) (b) If M < f M , suppose further that for each center in µ ∗ j ∈ B (20 cδ ) , there is an initialcenter µ (0) j ′ such that | µ (0) j ′ − µ ∗ j | ≤ | µ ∗ j | / . Then the same conditions (8) hold. Intuitively, these results show that if the true centers are clustered together into two clustersthat are well separated, and the EM algorithm is initialized so that each cluster is accounted forby at least one initial center then the EM algorithm remains trapped in the initial conﬁgurationof centers. A concrete implication of part (a) is that if the true distribution is a ( c, δ ) -diﬀuseGMM with f M = M and M ∗ , M ∗ true clusters lie in B − cδ ( δ ) and B cδ ( δ ) respectively, thenthere are only three possible ways to initialize the EM algorithm that might possibly convergeto a global maximum of the log-likelihood function; i.e., the pair ( M (0)1 , M (0)2 ) must be one of { ( M, , ( M ∗ , M ∗ ) , (0 , M ) } , where M = M ∗ + M ∗ .We are now equipped to prove Theorem 2. We will ﬁrst focus on the case M = 2 m forsome positive integer m ; the case of arbitrary M will be addressed later. At a high level, wewill ﬁrst construct the distribution GMM ( µ ∗ ) that establishes the theorem, and then use theabove technical lemma in order to reason about the behavior of the EM algorithm on thisdistribution. 12 ase M = 2 m : First, deﬁne the collection of m binary vectors of the form ǫ = ( ǫ , ǫ , · · · , ǫ m ) where each ǫ i ∈ {− , } . Consider the distribution, GMM ( µ ∗ ) , with the locations of the truecenters indexed by these m vectors; i.e., each center is located at µ ( ǫ ) = m X i =1 ǫ i (cid:16) (cid:17) i − R, (9)where we choose R ≥ m +1 ( M + 1) . This in turn implies that the distance between theclosest pair of true centers is at least × ( M + 1) .Our random initialization strategy samples the initial centers µ , µ , · · · , µ M i.i.d. fromthe distribution GMM ( µ ∗ ) . We can view this sampling process as two separate steps:(i) Sample an integer Z i uniformly from [ M ] .(ii) Sample value µ i from the Gaussian distribution N ( µ ∗ Z i , I) .Concentration properties of the Gaussian distribution will ensure that µ i will not be too farfrom its expectation µ ∗ Z i . Formally, we deﬁne the following event: E M := n all M initial points µ i are contained in B µ ∗ Zi ( M ) o . (10)By standard Gaussian tail bounds, we have P ( E M ) = (1 − P X ∼N (0 , ( | X | > M )) M ≥ (1 − M e − M / ) ≥ − e − Ω( M ) This implies that the event E M will hold with high probability (when M is large). Conditionedon the event E M , we are guaranteed that all initialized points are relatively close to some truecenter.A key observation regarding the conﬁguration of centers in the model GMM ( µ ∗ ) speciﬁedby equation (9) is that the true centers can be partitioned into two well separated regions.More precisely, it is easy to verify that there are M/ true centers in the interval B − R ( R/ while the remaining M/ true centers are contained in the interval B R ( R/ . In what follows,we refer to B − R (2 R/ as the left urn and to B R (2 R/ as the right urn .Conditioned on E M , each initial point lands in either the left urn or the right urn withequal probability. Suppose we initialize EM with ( M , M ) centers in the left and right urnrespectively. By Lemma 1(a), the only three possible values of the pair ( M , M ) for which theEM algorithm might converge to a global optimum are (0 , M ) , ( M, , ( M/ , M/ . A simplecalculation will show that the ﬁrst and second possibilities occur with exponentially smallprobability. However, the third possibility occurs with only polynomially small probability,and so we need to further investigate this possibility.Consider, for example, the left urn: the true centers in the left urn can further be par-titioned into two intervals B − . R ( R/ and B − . R ( R/ with M/ true centers ineach. Thus, each urn can be further partitioned into a left urn and a right urn. Followingthe same analysis as above and now using part (b) of Lemma 1 instead of part (a), we seethat in order to ensure that the EM algorithm converges to a global optimum, the number ofinitial centers in B − . R (2 R/ and B − . R (2 R/ must be one of the following pairs { (0 , M/ , ( M/ , , ( M/ , M/ } . Our conﬁguration of centers in equation (9) guarantees that this argument can be recur-sively applied until we reach an interval which contains only two true centers. For a conﬁgu-ration of M initial centers, we call these initial centers a good initialization for a collection oftrue centers if one of the following holds: 13a) M = 1 ,(b) the number of initial centers assigned to the left urn and the right urn of the collectionof true centers are either (0 , M ) or ( M, ,(c) the number of initial centers assigned to the left urn and the right urn of the collectionof true centers are ( M/ , M/ ; and further recursively the initialization in both the leftand the right urns are good initializations.Lemma 1 implies that the EM algorithm converges to a global maximum only if a goodinitialization is realized. We will now show that the probability of a good initialization isexponentially small.Let F M represent the event that a good initialization is generated on a mixture with M components, for our conﬁguration of true centers. Let M and M represent the number ofinitial centers in the left urn and the right urn, respectively. Conditioning on the event E M from equation (10), we have P ( F M | E M ) ≤ P ( M = 0) + P ( M = M ) + P ( M = M/ · (cid:0) P ( F M/ | E M ) (cid:1) ≤ × (cid:18) M (cid:19) M + (cid:18) MM/ (cid:19) M · (cid:0) P ( F M/ | E M ) (cid:1) ≤ M − + 12 · (cid:0) P ( F M/ | E M ) (cid:1) . Since P ( F | E M ) = 1 , solving this recursive inequality implies that P ( F M | E M ) ≤ e − cM forsome universal constant c . Thus, the probability that the EM algorithm converges to a globalmaximum is upper bounded by: P ( F M ) ≤ P ( E M ) P ( F M | E M ) + P ( E M ) ≤ P ( F M | E M ) + P ( E M ) ≤ e − Ω( M ) . To complete the proof for the case when M = 2 m for a positive integer m , we need to arguethat on the event P ( F M ) , the log-likelihood of the solution reached by the EM algorithm canbe arbitrarily worse than that of the global maximum. We claim that when the event F M occurs, the EM algorithm returns a solution µ for which there is at least one urn containingtwo true centers which is assigned a single center by the EM algorithm at every iteration t ≥ .As a consequence, there is at least one true center µ ∗ j for which we have that | µ ∗ j − µ ′ i | ≥ R m for all i = 1 , . . . , M . Now, we claim that we can choose R to be large enough to ensure anarbitrarily large gap in the likelihood of the EM solution and the global maximum. Concretely,as R → ∞ , we have: lim R → + ∞ L ( µ ) ≤ lim R → + ∞ M E X ∼N ( µ ∗ j , log M X i =1 e − ( X − µ i ) ! − M log( M √ π ) = −∞ . However, the global maximum µ ∗ has log-likelihood lim R → + ∞ L ( µ ∗ ) = − − log( M √ π ) . Once again we can use the continuity of the log-likelihood as a function of R to conclude thatthere is a ﬁnite suﬃciently large R > such that the conclusion of Theorem 2 holds.14 ase m − < M ≤ m : At a high level, we deal with this case by constructing a conﬁgurationwith m centers and pruning this down to have M centers, while ensuring that the resultingurns are still approximately balanced which in turn ensures that our previous calculationscontinue to hold.Our conﬁguration of true centers in equation (9) can be viewed as the m leaves of a binarytree with depth M , where the vectors ǫ indexing the true centers represent the unique pathfrom the root and to the leaf: the value of ǫ i indicates whether to go down to the left child orto the right child at the i -th level of the tree. We choose M true centers from the m leavesby the following procedure. Starting from the root, we assign ⌈ M/ ⌉ true centers to the leftsub-tree, and assign ⌊ M/ ⌋ true centers to the right sub-tree. For any sub-tree, suppose thatit was assigned l true centers, then we assign ⌈ l/ ⌉ true centers to its left subtree and ⌊ l/ ⌋ true centers to its right subtree. This procedure is recursively continued until all the truecenters are assigned to leaves. Each leaf corresponds to a point on the real line and we choosethis point as the location of the corresponding center.The above construction has the following two properties: ﬁrst, the locations of the truecenters satisfy the separation requirements we used in dealing with the case when M = 2 m ,and further the assignment of the centers to the left and right urns in each case is roughlybalanced. By leveraging these two properties we can follow essentially the same steps as wedid in the case with M = 2 m , and we omit these remaining proof details here. We now embark on the proof of Theorem 3. The proof follows from a similar outline to theproof of Theorem 2 and we only develop the main ideas here. Concretely, it is easy to verifythat in order to prove the result we only need to establish the analogue of Lemma 1 for theﬁrst-order EM algorithm.Intuitively, we ﬁrst argue that the ﬁrst-order EM updates can be viewed as less aggressiveversions of the corresponding EM updates, and we use this fact to argue that Lemma 1continues to hold for the ﬁrst-order EM algorithm. Concretely, we can compare the update ofEM algorithm: µ new, EM i = E µ ∗ w i ( X ) · X E µ ∗ w i ( X ) with the update of the ﬁrst-order EM algorithm: µ new, ﬁrst-order EM i = µ i + s E µ ∗ w i ( X )( X − µ i ) . If for any parameter µ i , we choose the stepsize s = E µ ∗ w i ( X ) , for the ﬁrst-order EM algorithm,then the two updates will match for that parameter. For the ﬁrst-order EM algorithm, wealways use a step size s ∈ (0 , , while E µ ∗ w i ( X ) ≥ . Consequently, there must exist some θ i ∈ [0 , such that µ new, ﬁrst-order EM i = θ i µ i + (1 − θ i ) µ new, EM i . Thus, we see that the ﬁrst-order EM update is a less aggressive version of the EM update.An examination of the proof of Lemma 1 reveals that this property suﬃces to ensure that itsguarantees continue to hold for the ﬁrst-order EM algorithm, which completes the proof ofTheorem 3. 15 .4 Proof of Theorem 4

In this section, we prove Theorem 4. Throughout this proof, we use the fact that the ﬁrst-orderEM updates with step size s ∈ (0 , take the form µ new = µ + s ∇L ( µ ) . (11)In order to reason about the behavior of the ﬁrst-order EM algorithm, we ﬁrst provide a resultthat concerns the Hessian of the log-likelihood. Lemma 2.

For any scalar s ∈ (0 , and for any µ , we have s ∇ L ( µ ) ≻ − I . We prove this claim at the end of the section. Taking this lemma as given, we can now provethe theorem’s claims. We ﬁrst show that the ﬁrst-order EM algorithm with stepsize s ∈ (0 , converges to a critical point. By a Taylor expansion of the log-likelihood function, we have L ( µ new ) = L ( µ ) + h∇L ( µ ) , µ new − µ i + 12 ( µ new − µ ) T ∇ L ( e µ )( µ new − µ ) , for some e µ on the line joining µ and µ new . Applying Lemma 2 guarantees that L ( µ new ) ≥L ( µ ) + h∇L ( µ ) , µ new − µ i − s k µ new − µ k . From the form (11) of the gradient EM updates, we then have L ( µ new ) ≥L ( µ ) + (cid:16) s − s (cid:17) k∇L ( µ ) k . Consequently, for any choice of step size s ∈ (0 , and any point µ for which ∇L ( µ ) = 0 ,applying the gradient EM update leads to a strict increase in the value of the populationlikelihood L . Since L is upper bounded by a constant for a mixture of M spherical Gaussians,we can conclude that ﬁrst-order EM must converge to some point. It is easy to further verifythat it must converge to a point for which ∇L ( µ ) = 0 which concludes the ﬁrst part of ourproof.Next we show that the ﬁrst-order EM algorithm will not converge to strict saddle pointsalmost surely. We do this via a technique that has been used in recent papers [16, 19],exploiting the stable manifold theorem from dynamical systems theory. For this portion of theproof, it will be convenient to view the ﬁrst-order EM updates as a map from the parameterspace to the parameter space; i.e., we deﬁne the ﬁrst-order EM map by: g ( µ ) := µ + s ∇L ( µ ) . (12)Recalling Deﬁnition 1 of strict saddle points, we denote by D ss the set of initial points fromwhich the ﬁrst-order EM algorithm converges to a strict saddle point. With these deﬁnitionsin place, we can state an intermediate result: Lemma 3 ([16, 19]) . If the map µ g ( µ ) deﬁned by equation (12) is a local diﬀeomorphismfor each µ , then D ss has zero Lebesgue measure. Denote the Jacobian matrix of map g at point µ as ∇ g ( µ ) where [ ∇ g ( µ )] ij = ∂∂µ j g i ( µ ) . ByLemma 2, the Jacobian ∇ g ( µ ) = I + s ∇ L ( µ ) is strictly positive deﬁnite, and hence invertiblefor all µ , which implies that the map g is a local diﬀeomorphism everywhere. Furthermore,our random initialization strategy speciﬁes the distribution of the initial point µ (0) which is16bsolutely continuous with respect to Lebesgue measure. Combined these facts with lemma3, we have proved Theorem 4.Finally, the only remaining detail is to prove Lemma 2. By deﬁnition, we have I + s ∇ L ( µ ) =  (1 − s E w ( X ))I d . . . . . . . . . . . . (1 − s E w M ( X ))I d | {z } := D + s Q , where the matrix Q has d -dimensional blocks of the form Q ij = ( E ( w i ( X ) − w i ( X ))( X − µ i )( X − µ i ) ⊤ if i = j − E w i ( X ) w j ( X )( X − µ i )( X − µ j ) ⊤ otherwise.Since w i ( X ) ≤ for all i ∈ [ M ] and s < , it follows that the diagonal matrix D is strictlypositive deﬁnite. Consequently, in order to prove Lemma 2, it suﬃces to show that Q ispositive semideﬁnite. Letting v = ( v ⊤ , . . . , v ⊤ M ) ⊤ , where v i ∈ R d , be arbitrary vectors, wehave: v ⊤ Qv = M X i =1 E w i ( X )[ v ⊤ i ( X − µ i )] − M X i =1 M X j =1 E w i ( X ) w j ( X )[ v ⊤ i ( X − µ i )][ v ⊤ j ( X − µ j )] ( i ) ≥ M X i =1 E w i ( X )[ v ⊤ i ( X − µ i )] − M X i =1 M X j =1 h E w i ( X ) w j ( X )[ v ⊤ i ( X − µ i )] + E w i ( X ) w j ( X )[ v ⊤ j ( X − µ j )] i ( ii ) = M X i =1 E w i ( X )[ v ⊤ i ( X − µ i )] − M X i =1 E w i ( X )[ v ⊤ i ( X − µ i )] = 0 , where step (i) uses the elementary inequality | ab | ≤ ( a + b ) ; and step (ii) uses the fact that P Mi =1 w i ( X ) = 1 for any X . This completes the proof. Acknowledgements

This work was partially supported by Oﬃce of Naval Research MURI grant DOD-002888, AirForce Oﬃce of Scientiﬁc Research Grant AFOSR-FA9550-14-1-001, the Mathematical DataScience program of the Oﬃce of Naval Research under grant number N00014-15-1-2670, andNational Science Foundation Grant CIF-31712-23800.

References [1] Elizabeth S Allman, Catherine Matias, and John A Rhodes. Identiﬁability of parametersin latent structure models with many observed variables.

Annals of Statistics , 37(6A):3099–3132, 2009.[2] Carlos Améndola, Mathias Drton, and Bernd Sturmfels. Maximum likelihood estimatesfor Gaussian mixtures are transcendental. In

International Conference on MathematicalAspects of Computer and Information Sciences , pages 579–590. Springer, 2015.[3] Sanjeev Arora, Ravi Kannan, et al. Learning mixtures of separated nonspherical Gaus-sians.

The Annals of Applied Probability , 15(1A):69–92, 2005.174] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees forthe EM algorithm: From population to sample-based analysis.

Annals of Statistics , 2015.[5] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In , pages 103–112. IEEE,2010.[6] Kamalika Chaudhuri and Satish Rao. Learning mixtures of product distributions usingcorrelations and independence. In , volume 4,pages 9–1, 2008.[7] Jiahua Chen. Optimal rate of convergence for ﬁnite mixture models.

Annals of Statistics ,23(1):221–233, 1995.[8] Sanjoy Dasgupta and Leonard Schulman. A probabilistic analysis of EM for mixtures ofseparated, spherical Gaussians.

Journal of Machine Learning Research , 8:203–226, 2007.[9] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood fromincomplete data via the EM algorithm.

Journal of the Royal Statistical Society, Series B ,39(1):1–38, 1977.[10] David L Donoho and Richard C Liu. The “automatic” robustness of minimum distancefunctionals.

Annals of Statistics , 16(2):552–586, 1988.[11] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—onlinestochastic gradient for tensor decomposition. In , pages 797–842, 2015.[12] Christopher R Genovese and Larry Wasserman. Rates of convergence for the Gaussianmixture sieve.

Annals of Statistics , 28(4):1105–1127, 2000.[13] Subhashis Ghosal and Aad W Van Der Vaart. Entropies and rates of convergence formaximum likelihood and Bayes estimation for mixtures of normal densities.

Annals ofStatistics , 29(5):1233–1263, 2001.[14] Nhat Ho and XuanLong Nguyen. Identiﬁability and optimal rates of convergence forparameters of multiple types in ﬁnite mixtures. arXiv preprint arXiv:1501.02497 , 2015.[15] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical Gaussians: Momentmethods and spectral decompositions. In

Proceedings of the 4th Conference on Innova-tions in Theoretical Computer Science , pages 11–20. ACM, 2013.[16] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descentconverges to minimizers. In , pages 1246–1257, 2016.[17] Po-Ling Loh and Martin J Wainwright. Regularized M-estimators with nonconvexity:Statistical and algorithmic theory for local optima. In

Advances in Neural InformationProcessing Systems , pages 476–484, 2013.[18] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures ofGaussians. In , pages93–102. IEEE, 2010. 1819] Ioannis Panageas and Georgios Piliouras. Gradient descent converges to minimizers: Thecase of non-isolated critical points. arXiv preprint arXiv:1605.00405 , 2016.[20] Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddlepoint problem for non-convex optimization. arXiv preprint arXiv:1405.4604 , 2014.[21] Nathan Srebro. Are there local maxima in the inﬁnite-sample likelihood of Gaussianmixture estimation? In , pages 628–629,2007.[22] Henry Teicher. Identiﬁability of ﬁnite mixtures.

The Annals of Mathematical Statistics ,34(4):1265–1269, 1963.[23] D Michael Titterington.

Statistical Analysis of Finite Mixture Distributions . Wiley, 1985.[24] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixtures of dis-tributions. In

The 43rd Annual IEEE Symposium on Foundations of Computer Science ,pages 113–122. IEEE, 2002.[25] Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical ratesof convergence for sparse nonconvex learning problems.

Annals of Statistics , 42(6):2164,2014. 19

Proofs of Technical Lemmas

The bulk of this section is devoted to the proof of Lemma 1, which is based on a number oftechnical lemmas.

A.1 Proof of Lemma 1

Underlying our proof is the following auxiliary result:

Lemma 4.

Suppose that:(a) The true distribution is a GMM ( µ ∗ ) with M components and that all true centers arelocated in ( −∞ , − a ) ∪ ( a, + ∞ ) with at least one center in ( a, a ) , with a > log M + 3 .(b) The current conﬁguration of centers has the property that for any true center µ ∗ j in ( −∞ , − a ) , there exists a current center µ j ′ such that | µ j ′ − µ ∗ j | ≤ | µ ∗ j | / .Then, for any i ∈ [ M ] for which the current parameter µ i ∈ [0 , a ] , we have E w i ( X ) X ≥ . See Section A.2 for the proof of this claim.Using Lemma 4, let us now prove Lemma 1. Without loss of generality, we may assumethat µ i ∈ B cδ (2 δ ) , for some i ∈ [ M ] . Thus, in order to establish the claim, it suﬃces to showthat after one step of the EM algorithm, the new iterate µ new i belongs to B cδ (2 δ ) as well.In order to show that µ new i ∈ B cδ (2 δ ) , note that by the update equation (5), we have µ new i = E w i ( X ) X E w i ( X ) . Thus, it is equivalent to prove that E w i ( X )( X − ( c − δ ) ≥ , and E w i ( X )( X − ( c + 2) δ ) ≤ . The ﬁrst inequality can be proved by substituting Z = X − ( c − δ and applying Lemma 4to Z . Similarly, the second inequality can be proved by deﬁning Y : = ( c + 2) δ − X , and thenapplying Lemma 4 to Y . A.2 Proof of Lemma 4

Our proof of this claim hinges on two auxiliary lemmas, which we begin by stating. Intuitively,our ﬁrst lemma shows that if the data are generated by a single Gaussian, whose mean is atleast

Ω(log M ) to the right of the origin, then it will aﬀect any µ i ≥ , by forcing it to theright no matter where the other { µ j } j = i are. Lemma 5.

Suppose that the true distribution is a unit variance Gaussian with mean µ ∗ ≥ a for some a > log M + 3 , and that the current conﬁguration of centers, µ , · · · , µ M , has the i th center µ i ≥ . Then we have E w i ( X ) X ≥ . (13a) Furthermore, if µ ∗ ≤ a , and ≤ µ i ≤ a , then: E w i ( X ) X ≥ a M e − a / . (13b)20ee Section A.3 for the proof of this claim. In a similar vein, if the data is generated by asingle Gaussian far to the left of the origin, and some current center µ j is suﬃciently close toit then this Gaussian will force µ i towards the negative direction, but will only have a smalleﬀect on µ i . More formally, we have the following result: Lemma 6.

Suppose that the true distribution is a unit variance Gaussian with mean µ ∗ = − r ,and that the current conﬁguration of centers, µ , · · · , µ M , has the i th center µ i ≥ and furtherhas at least one µ j such that | µ j − µ ∗ | ≤ r . Then we have that: E w i ( X ) X ≥ − re − r / . (14)See Section A.4 for the proof of this claim.Equipped with these two auxiliary results, we can now prove Lemma 4. Without loss ofgenerality, suppose that the centers are sorted in ascending order, and that the ℓ th true centeris the smallest true center in (0 , + ∞ ) . From the assumptions of Lemma 4, we know µ ∗ ℓ belongsto the interval ( a, a ) . Thus, when X is drawn from a Gaussian mixture, we have E w i ( X ) X = 1 M M X j =1 E X ∼N ( µ ∗ j , w i ( X ) X = 1 M ℓ − X j =1 E X ∼N ( µ ∗ j , w i ( X ) X + 1 M E X ∼N ( µ ∗ ℓ , w i ( X ) X + 1 M M X j = ℓ +1 E X ∼N ( µ ∗ j , w i ( X ) X. We now use Lemma 6 to bound the ﬁrst term. Since f ( y ) = − y · e − y / is monotonicallyincreasing in [3 , + ∞ ) , and from the assumptions of Lemma 4, we have | µ ∗ j | > − a > − (9 a +2) for all j < ℓ . Then: M ℓ − X j =1 E X ∼N ( µ ∗ j , ) w i ( X ) X ≥ − M ℓ − X j =1 | µ i | e − µ i / ≥ − a + 2) e − (9 a +2) / . By Lemma 5, we know that the third term is non-negative and that the second term can belower bounded by a suﬃciently large quantity. Putting together the pieces, we ﬁnd that E X w i ( X ) X ≥ − a + 2) e − a / − a − + a M e − a / ≥ e − a / h a M − a + 2) e − M − i ≥ e − a / − M [80 a − a + 2)] ≥ , which completes the proof. A.3 Proof of Lemma 5

Introducing the shorthand w ∗ : = min x ∈ [1 , w i ( x ) , we have E w i ( X ) X ≥ √ π Z −∞ w i ( x ) xe − ( x − µ ∗ ) / d x + 1 √ π Z w i ( x ) xe − ( x − µ ∗ ) / d x + 1 √ π Z aa w i ( x ) xe − ( x − µ ∗ ) / d x. We calculate the ﬁrst two terms: for this purpose, the following lemma is useful:21 emma 7.

For any µ , · · · , µ M where µ i ≥ , we have following: min x ∈ [1 , w i ( x ) ≥ M e max x ∈ ( −∞ , w i ( x ) . (15)See Section A.3.1 for the proof of this claim. From Lemma 7, we have that: √ π Z −∞ w i ( x ) xe − ( x − µ ∗ ) / d x + 1 √ π Z w i ( x ) xe − ( x − µ ∗ ) / ≥ √ π (cid:20)Z −∞ M e w ∗ xe − ( x − µ ∗ ) / d x + Z w ∗ xe − ( x − µ ∗ ) / d x (cid:21) ≥ w ∗ √ π (cid:20)Z −∞ M e ( x − µ ∗ ) e − ( x − µ ∗ ) / d x + Z xe − ( x − µ ∗ ) / d x (cid:21) ≥ w ∗ √ π h − M e − ( µ ∗ ) / + e − ( µ ∗ − / i = w ∗ e − ( µ ∗ ) √ π h e µ ∗ − / − M e i ≥ . The last inequality holds since µ ∗ > a > log M + 3 . The third term is always positive, andthis ﬁnishes the proof of ﬁrst claim.For second claim: if we further know that µ ∗ ≤ a , and µ i ≤ a , then for any x ∈ [ a, a ] , w i ( x ) ≥ e − a / M , we have: √ π Z aa w i ( x ) xe − ( x − µ ∗ ) / d x ≥ M √ π e − a / a Z aa e − ( x − µ ∗ ) / d x ≥ aM √ eπ e − a / ≥ a M e − a / . The last inequality is true by integrating over an interval of length 1 around µ ∗ contained in ( a, a ) . A.3.1 Proof of Lemma 7

We split the proof into two cases.

Case µ i ∈ [0 , : In this case, we are guaranteed that max x ∈ ( −∞ , w i ( x ) ≤ . Also, for any x ∈ [1 , , we have: w i ( x ) = e − ( x − µ i ) / P j e − ( x − µ j ) / ≥ M e , (16)which proves the required result. Case µ i > : In this case, we have w i ( x ) = e − ( x − µ i ) / P j e − ( x − µ j ) / = 1 P j = i M − + e [( x − µ i ) − ( x − µ j ) ] / = 1 P j = i M − + e ( µ i − µ j )( µ i + µ j − x ) / = 1 P j = i A ij ( x ) , A ij ( x ) := M − + e ( µ i − µ j )( µ i + µ j − x ) / . It suﬃces to show that A ij ( x ) ≤ M A ij ( x ′ ) for any x ∈ [1 , , x ′ ∈ ( −∞ , and j ∈ [ M ] . (17)Using this, we know: w i ( x ) = 1 P j = i A ij ( x ) ≥ P j = i M A ij ( x ′ ) = 1 M w i ( x ′ ) , (18)and the claim of Lemma 7 easily follows. In order to establish the claim of equation (17), wenote that if µ j ≤ µ i , then since x ′ < x , we have ( µ i − µ j )( µ i + µ j − x ) ≤ ( µ i − µ j )( µ i + µ j − x ′ ) , which implies that A ij ( x ) ≤ A ij ( x ′ ) . If µ i < µ j , then we know: ( µ i − µ j )( µ i + µ j − x ) < . (19)This implies A ij ( x ) ≤ M − + 1 = MM − . On the other hand, we always have A ij ( x ′ ) ≥ M − ,this gives A ij ( x ) ≤ M A ij ( x ′ ) , which ﬁnishes the proof. A.4 Proof of Lemma 6

We have E w i ( X ) X ≥ √ π Z − r/ −∞ w i ( x ) xe − ( x − µ ∗ ) / d x + 1 √ π Z − r/ w i ( x ) xe − ( x − µ ∗ ) / d x. For the ﬁrst term, we know for any x ∈ ( −∞ , − r/ , we have: w i ( x ) ≤ e − x / e − ( x − µ j ) / = e − xµ j + µ j / ≤ e − r µ j + µ j / ≤ e − r ≤ e − r / . The second last inequality is true since µ j ≥ − r . Thus, we know: √ π Z − r/ −∞ w i ( x ) xe − ( x − µ ∗ ) / d x ≥ e − r / √ π Z − r/ −∞ xe − ( x − µ ∗ ) / d x ≥ e − r / √ π "Z − r/ −∞ ( x − µ ∗ ) e − ( x − µ ∗ ) / d x + µ ∗ √ π ≥ e − r / √ π (cid:20) − e − r / − r √ π (cid:21) ≥ − re − r / . For the second term, we have: √ π Z − r/ w i ( x ) xe − ( x − µ ∗ ) / d x ≥ − r √ π Z − r/ e − ( x − µ ∗ ) / d x ≥ − r √ π Z + ∞− r/ e − ( x − µ ∗ ) / d x ≥ − r e − r / . Putting the pieces together we obtain, E w i ( X ) X ≥ − ( 23 + 2) e − r / ≥ − re − r / ,,