[PDF] Improved Convergence Guarantees for Learning Gaussian Mixture Models by EM and Gradient EM

Abstract

We consider the problem of estimating the parameters a Gaussian Mixture Model with K components of known weights, all with an identity covariance matrix. We make two contributions. First, at the population level, we present a sharper analysis of the local convergence of EM and gradient EM, compared to previous works. Assuming a separation of \Omega(\sqrt{\log K}), we prove convergence of both methods to the global optima from an initialization region larger than those of previous works. Specifically, the initial guess of each component can be as far as (almost) half its distance to the nearest Gaussian. This is essentially the largest possible contraction region. Our second contribution are improved sample size requirements for accurate estimation by EM and gradient EM. In previous works, the required number of samples had a quadratic dependence on the maximal separation between the K components, and the resulting error estimate increased linearly with this maximal separation. In this manuscript we show that both quantities depend only logarithmically on the maximal separation.

Full PDF

aa r X i v : . [ c s . L G ] J a n Improved Convergence Guarantees forLearning Gaussian Mixture Models by EMand Gradient EM

Nimrod Segol and Boaz Nadler

Department of Computer Science and Applied MathematicsWeizmann Institute of ScienceRehovot, Israele-mail: [email protected] ; boaz.nadler.weizmann.ac.il Abstract:

We consider the problem of estimating the parameters a Gaussian MixtureModel with K components of known weights, all with an identity covariance matrix.We make two contributions. First, at the population level, we present a sharper anal-ysis of the local convergence of EM and gradient EM, compared to previous works.Assuming a separation of Ω( √ log K ) , we prove convergence of both methods to theglobal optima from an initialization region larger than those of previous works. Specif-ically, the initial guess of each component can be as far as (almost) half its distance tothe nearest Gaussian. This is essentially the largest possible contraction region. Our sec-ond contribution are improved sample size requirements for accurate estimation by EMand gradient EM. In previous works, the required number of samples had a quadraticdependence on the maximal separation between the K components, and the resultingerror estimate increased linearly with this maximal separation. In this manuscript weshow that both quantities depend only logarithmically on the maximal separation.

1. INTRODUCTION

Gaussian mixture models (GMMs) are a widely used statistical model going back toPearson [15]. In a GMM each sample x ∈ R d is drawn from one of K componentsaccording to mixing weights π , . . . , π K > with P Ki =1 π i = 1 . Each componentfollows a Gaussian distribution with mean µ ∗ i ∈ R d and covariance Σ i ∈ R d × d . In thiswork, we focus on the important special case of K spherical Gaussians with identitycovariance matrix, with a corresponding density function f X ( x ) = K X i =1 π i (2 π ) d e − k x − µ ∗ i k . (1)For simplicity, as in [22, 23], we assume the weights π i are known.Given n i.i.d. samples from the distribution (1), a fundamental problem is to esti-mate the vectors µ ∗ i of the K components. Beyond the number of components K andthe dimension d , the difﬁculty of this problem is characterized by the following keyquantities: The smallest and largest separation between the cluster centers, R min = min i = j k µ ∗ i − µ ∗ j k , R max = max i = j k µ ∗ i − µ ∗ j k , (2) egol et al./Improved Convergence Guarantees for EM the minimal and maximal weights and their ratio, π min = min i ∈ [ K ] π i , π max = max i ∈ [ K ] π i , θ = π max π min . (3)In principle, one could estimate µ ∗ i by maximizing the likelihood of the observeddata. However, as the log-likelihood is non-concave, this problem is computationallychallenging. A popular alternative approach is based on the EM algorithm [7], andvariants thereof, such as gradient EM. These iterative methods require an initial guess ( µ , . . . , µ K ) of the K cluster centers. Classical results show that regardless of theinitial guess, the values of the likelihood function after each EM iteration are non de-creasing. Furthermore, under fairly general conditions, the EM algorithm converges toa stationary point or a local optima [21, 19]. The success of these methods to convergeto an accurate solution depend critically on the accuracy of the initial guess [10].In this work we study the ability of the popular EM and gradient EM algorithmsto accurately estimate the parameters of the GMM in (1). Two quantities of particularinterest are: (i) the size of the initialization region and the minimal separation thatguarantee convergence to the global optima. Namely, how small can R min be and howlarge can k µ i − µ ∗ i k , and still have convergence of EM to the global optima in thepopulation setting; and (ii) the required sample size, and its dependence on the problemparameters, that guarantees EM to ﬁnd accurate solutions, with high probability.We make the following contributions: First, we present an improved analysis of thelocal convergence of EM and gradient EM, at the population level, assuming an inﬁnitenumber of samples. In Theorems 3.1 and 3.2 we prove their convergence under thelargest possible initialization region, while requiring a separation R min = Ω (cid:0) √ log K (cid:1) .For example, consider the case of equal weights π i = 1 /K , and an initial guess thatsatisﬁes k µ i − µ ∗ i k ≤ λ min j = i k µ ∗ j − µ ∗ i k for all i , with λ < / . Then, a separation R min ≥ C ( λ ) √ log K , with an explicit C ( λ ) , sufﬁces to ensure that the population EMand gradient EM algorithms converge to the true means at a linear rate.Let us compare our results to several recent works that derived convergence guar-antees for EM and gradient EM. [23] and [22] proved local convergence to the globaloptima under a much larger minimal separation of R min ≥ C p min( d, K ) log K . Inaddition, the requirement on the initial estimates had a dependence on the maximalseparation, k µ i − µ ∗ i k ≤ R min − C p d log max( R max , K ) for a universal con-stant C . These results were signiﬁcantly improved by [13], who proved the local con-vergence of the EM algorithm for the more general case of spherical Gaussians withunknown weights and variances. They required a far less restrictive minimal separation R min ≥ C √ log K , with a constant C ≥ , and their initialization was restricted to λ < . We should note that no particular effort was made to optimize these constants.In comparison to these works, we allow the largest possible initialization region λ < ,with no dependence on R max . Also, for small values λ ≤ / , our resulting constant C is roughly 6 times smaller that that of [13].Our second contribution concerns the required sample size to ensure accurate es-timation by the EM and gradient EM algorithms. Recently, [13] proved that with anumber of samples n = ˜Ω( d/π min ) , a sample splitting variant of EM is statisticallyoptimal. In this variant, the n samples are split into B distinct batches, with eachEM iteration using a separate batch. In contrast, for the standard EM and gradient egol et al./Improved Convergence Guarantees for EM EM algorithms, weaker results have been established so far. Currently, the best knownsample requirements for EM are n = ˜Ω( K dR /R ) , whereas for gradient EM, n = ˜Ω( K dR /R ) . In addition, the bounds for the resulting errors increase lin-early with R max , see [22, 23]. Note that in these two results, the required number ofsamples increases at least quadratically with the maximal separation between clusters,even though increasing R max should make the problem easier. In Theorems 3.3 and 3.4,we prove that for an initialization region with parameter λ strictly smaller than half, theEM and gradient EM algorithms yield accurate estimates with sample size ˜Ω( K d ) . Inparticular, both our sample size requirements and the bounds on the error of the EMand gradient EM have only a logarithmic dependence on R max .Our results on the initialization region and minimal separation stem from a carefulanalysis of the weights in the EM update and their effect on the estimated cluster cen-ters. Similarly to [13], we upper bound the expectation of the i -th weight when the datais drawn from a different component j = i and show that it is exponentially small in thedistance between the centers of the i and j components. We make use of the fact thatall Gaussians have the same covariance to reduce the expectation to one dimension anddirectly upper bound the one dimensional integral. This allows us to derive a sharperbound compared to [13] from which we obtain a larger contraction region for the pop-ulation EM and gradient EM algorithms. Our analysis of the ﬁnite sample behavior ofEM and gradient EM follows the general strategy of [23]. Our improved results rely ontighter bounds on the sub-Gaussian norm of the weights in the EM update which donot depend on the distance between the clusters. Over the past decades, several approaches to estimate the parameters of Gaussian mix-ture models were proposed. In addition, many works derived theoretical guaranteesfor these methods as well as information-theoretic lower bounds on the number ofsamples required for accurate estimation. Signiﬁcant efforts were made in understand-ing whether GMMs can be learned efﬁciently both from a computational perspective,namely in polynomial run time, and from a statistical view, namely with a number ofsamples polynomial in the problem parameters.Method of moments approaches [11, 14, 8] can accurately estimate the parametersof general GMMs with R min arbitrarily small, at the cost of sample complexity, andthus also run time, that is exponential in the number of clusters. [9] showed that amethod of moments type algorithm can recover the parameters of spherical GMMs witharbitrarily close cluster centers in polynomial time, under the additional assumptionthat the components centers are afﬁnely independent. This assumption implies that d ≥ K .Methods based on dimensionality reduction [4, 1, 2, 12, 17] can accurately esti-mate the parameters of a GMM in polynomial time in the dimension and numberof clusters, under conditions on the separation of the clusters’ centers. In particular,[17] proved that accurate recovery is possible with a minimal separation of R min =Ω(min( K, d ) ) .In general, it is not possible to learn the parameters of a GMM with number of sam-ples that is polynomial in the number of clusters, see [14] for an explicit example. [16] egol et al./Improved Convergence Guarantees for EM showed that for any function γ ( K ) = o ( √ log K ) one can ﬁnd two spherical GMMs,both with R min = γ ( K ) such that no algorithm with polynomial sample complexitycan distinguish between them. [16] also presented a variant of the EM algorithm thatprovably learns the parameters of a GMM with separation Ω( √ log K ) , with polyno-mial sample complexity, but run time exponential in the number of components.More closely related to our manuscript, are several works that studied the ability ofEM and variants thereof to accurately estimate the parameters of a GMM. [5] showedthat with a separation of Ω( d ) , a two-round variant of EM produces accurate estimatesof the cluster centers. A signiﬁcant advance was made by [3], who developed new tech-niques to analyze the local convergence of EM for rather general latent variable models.In particular, for a GMM with K = 2 components of equal weights, they proved thatthe EM algorithm converges locally at a linear rate provided that the distance betweenthe components is at least some universal constant. These results were extended in [20]and [6] where a full description of the initialization region for which the populationEM algorithm learns a mixture of any two equally weighted Gaussians was given. Asalready mentioned above, the three works that are directly related to our work, and towhich we compare in detail in Section 3 are [22], [23] and [13].

2. PROBLEM SETUP AND NOTATIONS

We write X ∼ GMM ( µ ∗ , π ) for a random variable with density given by Eq. (1).The distance between cluster means is denoted by R ij = k µ ∗ i − µ ∗ j k . We set R i =min j = i R ij . Expectation of a function f ( X ) with respect to X is denoted by E X [ f ( X )] ,or when clear from context simply by E [ f ( X )] . For simplicity of notation, we shallwrite E i [ f ( X )] = E X ∼N ( µ ∗ i ,I d ) [ f ( X )] . For a vector v we denote by k v k its Euclideannorm. For a matrix A , we denote its operator norm by k A k op = max k x k =1 k Ax k . Fi-nally, we denote by µ = ( µ ⊤ , . . . , µ ⊤ K ) ⊤ ∈ R Kd the concatenation of µ , . . . , µ K ∈ R d .As in previous works, we consider the following error measure for the quality of anestimate µ of the true means, E ( µ ) = max i ∈ [ K ] k µ i − µ ∗ i k . We will see that in the population case we can restrict our analysis to the space spannedby the K true cluster means and the K cluster estimates. It will therefore be convenientto deﬁne d = min( d, K ) . For any < λ < we deﬁne the region U λ = (cid:8) µ ∈ R Kd : k µ i − µ ∗ i k ≤ λR i ∀ i ∈ [ K ] (cid:9) . (4)For future use we deﬁne the following function which will play a key role in our anal-ysis, c ( λ ) = 18 (cid:18) − λ λ (cid:19) . (5) egol et al./Improved Convergence Guarantees for EM Given an estimate ( µ , . . . , µ K ) of the K centers, for any x ∈ R d and i ∈ [ K ] let w i ( x, µ ) = π i e − k x − µi k P Kj =1 π j e − k x − µj k . (6)The population EM update, denoted by µ + = ( µ +1 , . . . , µ + K ) is given by µ + i = E X [ w i ( X, µ ) X ] E X [ w i ( X, µ )] , ∀ i ∈ [ K ] . (7)The population gradient EM update with a step size s > is deﬁned by µ + i = µ i + s E X [ w i ( X, µ )( X − µ i )] , ∀ i ∈ [ K ] . (8)Given an observed set of n samples X , . . . , X n ∼ X , the sample EM and samplegradient EM updates follow by replacing the expectations in (7) and (8) with theirempirical counterparts. For the EM, the update is µ + i = P nℓ =1 w i ( X ℓ , µ ) X ℓ P nℓ =1 w i ( X ℓ , µ ) , ∀ i ∈ [ K ] . (9)and for the gradient EM µ + i = µ i + s n n X ℓ =1 w i ( X ℓ , µ )( X ℓ − µ i ) , ∀ i ∈ [ K ] . (10)In this work, we study the convergence of EM and gradient EM, both in the pop-ulation setting and with a ﬁnite number of samples. In particular we are interested insufﬁcient conditions on the initialization and on the separation of the GMM compo-nents that ensure convergence to accurate solutions.

3. LOCAL CONVERGENCE OF EM AND GRADIENT EM

As in previous works, we ﬁrst study the convergence of EM in the population case andthen build upon this analysis to study the ﬁnite sample setting. Informally, our mainresult in this section is that for any ﬁxed λ ∈ (0 , ) and an initial estimate µ ∈ U λ ,there exists a constant C ( λ ) such that for any mixture with R min & C ( λ ) q log π min the estimation error of a single population EM update (7) decreases by a multiplicativefactor strictly less than . This, in turn, implies convergence of the population EM tothe global optimal solution µ ∗ . Formally, our result is stated in the following theorem. egol et al./Improved Convergence Guarantees for EM Theorem 3.1.

Set λ ∈ (0 , ) . Let X ∼ GMM ( µ ∗ , π ) with R min ≥ s c ( λ ) log 32 ( K − p

14 (1 + θ )3 π min c ( λ ) (11) where c ( λ ) and θ are as deﬁned in (5) and (3) , respectively. Then for any µ ∈ U λ itholds that E ( µ t ) ≤ t E ( µ ) where µ t is the t -th iterate of the population EM update (7) initialized at µ . We derive a similar result for gradient EM.

Theorem 3.2.

Set λ ∈ (0 , ) . Let X ∼ GMM ( µ ∗ , π ) with R min satisfying (11) . Thenfor any s ∈ (cid:16) , π min (cid:17) and any µ ∈ U λ it holds that E ( µ t ) ≤ γ t E ( µ ) where µ t is the t -th iterate of the population gradient EM update (8) with step size s and γ = 1 − sπ min . The proof of Theorem 3.1 appears in Section 4 with the technical details deferred tothe appendix. The proof of Theorem 3.2 is similar and appears in full in the appendix.It is interesting to compare Theorems 3.1 and 3.2 to several recent works, in termsof both the size of the initialization region, and the requirements on the minimal sep-aration. [22] and [23] assumed a separation R min = Ω( √ d log K ) and proved localconvergence of the gradient EM and of the EM algorithm, for an initialization regionof the following form, with C a universal constant, max i ∈ [ K ] k µ i − µ ∗ i k ≤ R min − C p d log max( R max , K ) . Recently, [13] signiﬁcantly improved these works, proving convergence of populationEM with a much smaller separation R min ≥ p log( θK ) . Moreover, they consid-ered the more general and challenging case where the Gaussians may have differentvariances and the EM algorithm estimates not only the Gaussian centers, but also theirweights and variances. However, they proved convergence only for an initializationregion U λ with λ ≤ .Our results improve upon these works in several aspects. First, in comparison to thecontraction region of [22], our theorem allows the largest possible initialization region k µ i − µ ∗ i k < R i , with no dependence on the other problem parameters d , K and R max . This initialization region is optimal as there exists GMMs and initializations µ with k µ i − µ ∗ i k = R i such that the EM algorithm, even at the population level, willnot converge to values that are close to the true parameters.Second, in comparison to the result of [13], we allow λ to be as large as . Also,for λ < , our requirement on R min is nearly one order of magnitude smaller. Forexample, for a balanced mixture with π min = K , the right hand side of (11) reads vuut c ( λ ) log ( K ) + log 32 √ c ( λ ) ! . An initialization region k µ i − µ ∗ i k ≤ R i leads to a separation requirement R min ≥ . p log( K ) + 6 . , which is much smaller than √ log K . egol et al./Improved Convergence Guarantees for EM We now present our results on the EM and gradient EM algorithms for the ﬁnite samplecase.

Theorem 3.3.

Set λ ∈ (0 , ) , δ ∈ (0 , . Let X , . . . , X n i.i.d. ∼ GMM ( µ ∗ , π ) with R min satisfying (11) . Suppose that n is sufﬁciently large so that n log n > C Kd log (cid:16) ˜ Cδ (cid:17) π min max (cid:18) , − λ ) λ π min R (cid:19) . (12) where C is a universal constant and ˜ C = 100 K R max ( √ d + 2 R max ) . Assume aninitial estimate µ ∈ U λ and let µ t be the t -th iterate of the sample EM update (9) . Thenwith probability at least − δ , for all iterations t , µ t ∈ U λ and k µ ti − µ ∗ i k ≤ t E ( µ ) + C (1 − λ ) π i s Kd log ˜ Cnδ n (13) for a suitable absolute constant C . Theorem 3.4.

Set λ ∈ (0 , ) , δ ∈ (0 , . Let X , . . . , X n i.i.d. ∼ GMM ( µ ∗ , π ) with R min satisfying (11) . Set s ∈ (cid:16) , π min (cid:17) and suppose that n is sufﬁciently large so that n log n > CKd log ˜ Cδ π max i ∈ [ K ] max (cid:16) λ R i , − λ ) (cid:17) λ R i (14) where C is a universal constant and ˜ C = 36 K R max ( √ d + 2 R max ) . Assume aninitial estimate µ ∈ U λ and let µ t be the t -th iterate of the sample gradient EM update (10) with step size s . Then with probability at least − δ , µ t ∈ U λ for all t , and k µ ti − µ ∗ i k ≤ γ t E ( µ ) + C π i max (cid:18) − λ , λR i (cid:19) vuut Kd log (cid:16) ˜ Cnδ (cid:17) n (15) where γ = 1 − sπ min and C is a suitable absolute constant. The main idea in the proofs of Theorems 3.3 and 3.4 is to show the uniform conver-gence, inside the initialization region U λ , of the sample update to the population update.The sample size requirements (12) and (14) are such that the resulting error of a singleupdate of the EM and gradient EM algorithms is sufﬁciently small to ensure that theupdated means are in the contraction region U λ . This, combined with the convergenceof the population update, yields the required result. We outline the main steps of theproof in Section 5 with more technical details deferred to the appendix.Let us compare Theorems 3.3 and 3.4 to previous results, in terms of required sam-ple size and bounds on the estimation error. The strongest result to date, due to [13],considered a variant of the EM algorithm, whereby the samples are split into B sepa-rate batches, and at each iteration t (with ≤ t ≤ B ), the sample EM algorithm is run egol et al./Improved Convergence Guarantees for EM only using the data of the t -th batch. They showed that to achieve an error E ( µ B ) ≤ ǫ , the required sample size is ˜Ω( dπ min ǫ ) . The best known bounds without samplesplitting were derived by [22] and by [23]. The error guarantee for gradient EM is ˜ O ( n − / max( K R √ d, R max d )) , whereas for EM it is ˜ O ( n − / R max √ Kd/π min ) .The sample size requirements for gradient EM are n log n = ˜Ω(max( K R √ d, R max d ) /R ) and n log n = ˜Ω( Kdπ max(1 , R /R )) for EM. Note that these bounds have a de-pendence on the maximal separation R max . In particular, even though intuitively, as R max increases the problem should become easier, these error bounds increase linearlywith R max and the required sample size increases quadratically with R max . In contrast,in our two theorems above there is a dependence on / (1 − λ ) , which is strictly smallerthan R max by the separation condition (11). Thus, for λ bounded away from / , thereis only a logarithmic dependence on R max . We believe that with further effort, thedependence on R max can be fully eliminated.

4. PROOF FOR THE POPULATION EM

Our strategy is similar to [22] and [23]: We bound the error of a single update, k µ + i − µ ∗ i k in terms of E X [ w j ( X, µ )] and E X [ ∇ µ w j ( X, µ )( X − µ j )] , which in turn depend ontheir expectations with respect to individual Gaussian components. Our key result onthe latter expectation is the following Proposition, whose proof appears in the appendix. Proposition 4.1.

Set < λ < . Let X ∼ GMM ( µ ∗ , π ) with R min > r − λ log θ (16) where θ is deﬁned in (3) . Then for any µ ∈ U λ and all j = i , with c ( λ ) deﬁned in (5) , E i [ w j ( X, µ )] ≤ (cid:18) π j π i (cid:19) e − c ( λ ) R ij . (17)This proposition shows that E i [ w j ( X, µ )] is exponentially small in the separation R ij and is key to proving contraction of the EM and gradient EM updates. A similarresult was proven in [13]. The main differences are that they assumed a smaller re-gion with λ < and obtained a looser exponential bound exp( − R ij / . However,they considered a more challenging case where the weights π i and variances of the K Gaussian components are unknown and are also estimated by the EM procedure.The key idea in proving Proposition 4.1 is that for X ∼ N ( µ ∗ i , I d ) it sufﬁces toanalyze the random variable w j ( X, µ ) on the one dimensional space spanned by µ i − µ j .Thus, the expectation over a d dimensional random vector is reduced to the expectationof some explicit function over a univariate standard Gaussian. An immediate corollaryis that under the same conditions as in Proposition 4.1, the following lower bound holdsfor the expectation E i [ w i ( X, µ )] . Corollary 4.1.1.

Set < λ < and suppose that R min satisﬁes (16) . Then ∀ µ ∈ U λ E i [ w i ( X, µ )] ≥ − ( K − θ ) e − c ( λ ) R i . (18) egol et al./Improved Convergence Guarantees for EM Next, note that for X ∼ GMM ( µ ∗ , π ) , it holds that E X [ w i ( X, µ ∗ )] = π i . Thus, forcenter estimates µ close to µ ∗ we expect that E X [ w i ( X, µ )] > π i . This intuition ismade precise in the following lemma which follows readily from Corollary 4.1.1. Lemma 4.2.

Fix < λ < . Let X ∼ GMM ( µ ∗ , π ) and suppose that R min ≥ p c ( λ ) − log(15( K − θ )) . (19) Then for any i ∈ [ K ] and any µ ∈ U λ , E X [ w i ( X, µ )] ≥ π i . (20)Next, we turn to the term E X [ ∇ µ w i ( X, µ )( X − µ i )] . By deﬁnition, ∇ µ w i ∈ R Kd has the following K components, each a vector in R d , ∂w i ( X, µ ) ∂µ i = − w i ( X, µ )(1 − w i ( X, µ ))( µ i − X ) (21)and for j = i ∂w i ( X, µ ) ∂µ j = w i ( X, µ ) w j ( X, µ )( µ j − X ) . (22)For future use we introduce the following quantities related to E X [ ∇ µ w i ( X, µ )( X − µ i )] . For any µ, v ∈ R Kd , deﬁne V i,j ( µ, v ) = k E X [ w i ( X, µ ) w j ( X, µ )( X − v i )( X − µ j ) ⊤ ] k op , (23) V i,i ( µ, v ) = k E X [ w i ( X, µ )(1 − w i ( X, µ ))( X − v i )( X − µ i ) ⊤ ] k op . (24)The following lemma, proved in the appendix, provides a bound on these quantities. Lemma 4.3.

Fix < λ < . Let X ∼ GMM ( µ ∗ , π ) with R min satisfying Eq. (16).Assume µ ∈ U λ and v = µ or v = µ ∗ . Then, for any i, j ∈ [ K ] with i = jV i,i ( µ, v ) ≤ p C ( K −

1) (1 + θ ) max (cid:0) d , R i (cid:1) e − c ( λ )2 R i , (25) V i,j ( µ, v ) ≤ p C (1 + θ ) max (cid:0) d , max( R i , R j ) (cid:1) e − c ( λ )2 max( R i ,R j ) , (26) where C is a universal constant, for example we can take C = 14 . Expressions related to V i,i and V i,j were also studied by [22]. They required a muchlarger separation, R min ≥ C √ d log K , and their resulting bounds involved also R max . Remark 4.1.

In proving the convergence of EM, the quantities of interest are V i,j ( µ, µ ∗ ) and V i,i ( µ, µ ∗ ) , whereas for the gradient EM algorithm the relevant quantities are V i,j ( µ, µ ) , V i,i ( µ, µ ) . The reason for the effective dimension d = min( d, K ) is thatfor d > K , in the population setting, the EM update of µ always remains in the sub-space spanned by the K vectors { µ i } Ki =1 and { µ ∗ i } Ki =1 . In the case of gradient EM,one may deﬁne a potentially smaller effective dimension d = min( d, K ) . egol et al./Improved Convergence Guarantees for EM Last but not least, the following auxiliary lemma shows that µ ∗ is a ﬁxed point ofthe population EM update. Lemma 4.4.

Let X ∼ GMM ( µ ∗ , π ) . Then ∀ i ∈ [ K ] , E X [ w i ( X, µ ∗ )( X − µ ∗ i )] = 0 . With all the pieces in place, we are now ready to prove Theorem 3.1.

Proof of Theorem 3.1.

Consider a single EM update, as given by Eq. (7), k µ + i − µ ∗ i k = 1 E X [ w i ( X, µ )] · k E X [ w i ( X, µ )( X − µ ∗ i )] k , ∀ i ∈ [ K ] Using Lemma 4.4, we may write the numerator above as follows, E X [ w i ( X, µ )( X − µ ∗ i )] = E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] . (27)By the mean value theorem there exists µ τ on the line connecting µ and µ ∗ such that w i ( X, µ ) − w i ( X, µ ∗ ) = ∇ µ w i ( X, µ τ ) ⊤ ( µ − µ ∗ ) . (28)Inserting the expressions (21) and (22) for the gradient of w i into Eq. (28) gives w i ( X, µ ) − w i ( X, µ ∗ ) = w i ( X, µ τ )(1 − w i ( X, µ τ ))( X − µ τi ) ⊤ ( µ i − µ ∗ i ) − X j = i w i ( X, µ τ ) w j ( X, µ τ ))( X − µ τj ) ⊤ ( µ j − µ ∗ j ) dτ. Taking expectations, and using the deﬁnitions of V ii and V ij , Eqs. (23) and (24), gives k E [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] k ≤ k X j =1 V ij ( µ τ , µ ∗ ) k µ j − µ ∗ j k . (29)Since µ τ ∈ U λ , we may apply Lemma 4.3 to bound the terms on the right hand sideabove. Furthermore, given that x e − tx is monotonic decreasing for all x > p /t and R i ≥ p /c ( λ ) , we may replace all R i , R j in the bounds of Lemma 4.3 by R min .Deﬁning U = K − √ C (1+ θ )3 π min , we thus have k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] k ≤ π min U · e − c ( λ )2 R E ( µ ) . Next, note that condition (11) on R min implies that it also satisﬁes the weaker condition(19) of Lemma 4.2. Invoking this lemma yields that E X [ w i ( X, µ )] ≥ π min . Thus, k µ i − µ ∗ i k ≤ U max (cid:0) d , R (cid:1) e − c ( λ )2 R · E ( µ )2 . If d ≥ R , then for E ( µ + ) ≤ E ( µ ) to hold the minimal separation must satisfy c ( λ )2 R ≥ log( d U ) . (30) egol et al./Improved Convergence Guarantees for EM In contrast, if R ≥ d we obtain the following inequality for w = c ( λ )2 R , we − w ≤ c ( λ )2 U . (31)Note that for w > , the function we − w is monotonic decreasing. Also, consider thevalue w ∗ = 2 log(2 U/c ( λ )) which is larger than 1, given the deﬁnitions of U and of c ( λ ) . It is easy to show that w ∗ exp( − w ∗ ) ≤ c ( λ ) / U . Hence a sufﬁcient conditionfor (31) to hold is that w > w ∗ , namely c ( λ )2 R ≥ Uc ( λ ) . (32)It is easy to verify that log U +log(4 /c ( λ )) > log d and thus the bound of (32) is morerestrictive than (30). Inserting the expression for U into Eq. (32) yields the conditionof the Theorem, Eq. (11). Finally, to complete the proof we need to show that for all i , k µ + i − µ ∗ i k ≤ λR i . This part is proven in auxiliary lemma A.5 in the appendix.

5. PROOF FOR THE SAMPLE EM

In this section we prove our results on the sample EM and gradient EM algorithms.The main idea is to show concentration results for both the denominator and the numer-ator of the EM update. Our strategy is similar to [23] but with several improvements.First, our result on the concentration of the denominator of the EM update, Lemma5.1, only considers samples from the i -th cluster. Thus, in Lemma 5.2, we obtain auniform lower bound for the weight w i with n = ˜Ω( Kd/π min ) compared to the larger n = ˜Ω( Kd/π ) in [23]. Second, while [23] bounded the sub-Gaussian norm ofthe numerator of the EM update by CR max , we derive in Lemma 5.3 a tighter bound,which does not depend on R max . This in turn, yields a tighter concentration for thenumerator of the EM update in Lemma 5.4. Lemma 5.1.

Fix δ ∈ (0 , , λ ∈ (0 , ) and let X , . . . , X n i i.i.d. ∼ N ( µ ∗ i , I d ) . Thenwith probability at least − δ , sup µ ∈U λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n i n i X ℓ =1 w i ( X ℓ , µ ) − E i [ w i ( X, µ )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ vuut ˜ c Kd log (cid:16) ˜ Cn i δ (cid:17) n i (33) where ˜ C = 18 K ( √ d + 2 R max ) R max and ˜ c is a suitable universal constant. As we saw in Lemma 4.2, the denominator in the population EM update for the i -th mean is lower bounded by π i . We use Lemma 5.1 to show that this lower boundholds also for the ﬁnite sample case. We remark that a version of the following lemmaappeared in [23], but with a larger sample size requirement of n = ˜Ω( Kd/π ) . Lemma 5.2.

Fix δ ∈ (0 , , λ ∈ (0 , ) . Let X , . . . , X n i.i.d. ∼ GMM ( µ ∗ , π ) , with R min that satisﬁes (19) . Assume a sufﬁciently large sample size n such that n log n > C Kd log ˜ Cδ π min (34) egol et al./Improved Convergence Guarantees for EM where ˜ C = 100 K π max ( √ d + 2 R max ) R max and C is a universal constant. For any i ∈ [ K ] , deﬁne the event D i = ( inf µ ∈U λ n n X ℓ =1 w i ( X ℓ , µ ) ≥ π i ) . (35) Then, the event D i occurs with probability at least − δ K . Next, we analyze the sub-Gaussian norm of w i ( X, µ )( X − µ ∗ i ) . [23] bounded thisquantity by CR max . We present an improved bound which does not depend on R max .For the deﬁnition of the sub-Gaussian norm k · k ψ , see the Appendix. Lemma 5.3.

Fix λ ∈ (0 , ) . Let X ∼ GMM ( µ ∗ , π ) with R min ≥ s max (cid:18) − λ log (cid:18) ) θ − λc ( λ ) (cid:19) , c ( λ ) log 2 (cid:19) . (36) Suppose that µ ∈ U λ . Then for any i ∈ [ K ] , k w i ( X, µ ) ( X − µ ∗ i ) k ψ ≤ − λ (37) and k w i ( X, µ ) ( X − µ i ) k ψ ≤

24 max (cid:18) − λ , λR i (cid:19) . (38)Using Lemma 5.3 we upper bound the concentration of the numerator in the expres-sion for the error in the sample EM update, Eq. (9). Lemma 5.4.

Fix δ ∈ (0 , , λ ∈ (0 , ) . Let X , . . . , X n i.i.d. ∼ GMM ( µ ∗ , π ) with R min satisfying (36) . For i ∈ [ K ] deﬁne S i = n P nℓ =1 w i ( X ℓ , µ )( X ℓ − µ ∗ i ) and the event N i =  sup µ ∈U λ k S i − E [ w i ( X, µ )( X − µ ∗ i )] k ≤ C − λ s Kd log ˜ Cnδ n  (39) Then, with ˜ C = 36 K R max ( √ d + 2 R max ) and with a suitable choice of a universalconstant C , the event N i occurs with probability at least − δ K . With all the pieces in place, we are now ready to prove Theorem 3.3.

Proof of Theorem 3.3.

Consider the error of a single the update of the from (9) of thesample EM algorithm, k µ + i − µ ∗ i k = k n P nℓ =1 w i ( X ℓ , µ )( X ℓ − µ ∗ i ) k n P nℓ =1 w i ( X ℓ , µ ) . Note that the requirement (11) on R min is more restrictive than (19). Also, the samplesize requirement (12) is more restrictive than (34). Thus, we may invoke Lemma 5.2and get that with probability at least − δ K , that event D i (35) occurs. Hence, k µ + i − µ ∗ i k ≤ π i k S i − E X [ w i ( X, µ )( X − µ ∗ i )] k + 43 π i k E X [ w i ( X, µ )( X − µ ∗ i )] k egol et al./Improved Convergence Guarantees for EM It follows from Theorem 3.1 that for R min satisfying (11), the second term above is up-per bounded by min( E ( µ ) , λR i ) . We thus continue by bounding the ﬁrst term above.Note that our requirements on the minimal separation (11) is more restrictive than therequirement in (36). Thus, we may invoke Lemma 5.4 and obtain with probability atleast − δ K , that the event N i (39) occurs. Therefore, k µ + i − µ ∗ i k ≤

12 min( E ( µ ) , λR i ) + C (1 − λ ) π i s Kd log ˜ Cnδ n (40)where C is a universal constant and ˜ C = 36 K R max ( √ d +2 R max ) . For n sufﬁcientlylarge so that (12) is satisﬁed, it holds that C − λ ) π i q Kd log ˜ Cnδ n ≤ λR i and therefore k µ + i − µ ∗ i k ≤ λR i . By a union bound over all i ∈ [ K ] , with probability at least − δ , µ + ∈ U λ . This allows us to iteratively apply (40) and obtain Eq. (13). Appendix A: PROOFS FOR SECTION 4

A.1. Proof of Proposition 4.1

Before proving Proposition 4.1 we state several auxiliary lemmas.

Lemma A.1.

Let g ( A, B ) be the following function of two variables, g ( A, B ) = Z √ π

11 + αe At + B e − t / dt where α > is a ﬁxed constant. Then: (i) For any ﬁxed A , g ( A, B ) is monotonicdecreasing in B ; and (ii) If in addition α > e − B and A > , then for any ﬁxed B , g ( A, B ) is monotonic increasing in A .Proof. Since the function inside the integral is monotonically decreasing in B , part (i)directly follows. To prove part (ii), we take the derivative with respect to A , ∂∂A g ( A, B ) = Z − αte At + B (1 + αe At + B ) e − t / √ π dt. Denote the function inside the integral by f ( t ) . Note that f ( t ) > when t < and f ( t ) < when t > . To show that the integral is positive it sufﬁces to show that forall t > it holds that − f ( t ) < f ( − t ) . This condition reads as e − At (1 + αe − At + B ) > e At (1 + αe At + B ) . Some algebraic manipulations give that this condition is equivalent to (cid:0) e At − e − At (cid:1) (cid:0) α e B − (cid:1) > which is indeed satisﬁed for A, t > and α > e − B . egol et al./Improved Convergence Guarantees for EM Lemma A.2.

Fix any two distinct vectors µ ∗ i , µ ∗ j ∈ R d and λ ∈ (0 , / . Denote theball of radius r about the origin in R d by B d (0 , r ) and deﬁne Ω = B d (0 , λ k µ ∗ i − µ ∗ j k ) × B d (0 , λ k µ ∗ i − µ ∗ j k ) ⊂ R d × R d . Consider the two functions

A, B : Ω → R A ( ξ i , ξ j ) = k µ ∗ i − ξ i − µ ∗ j + ξ j k (41) B ( ξ i , ξ j ) = 12 k µ ∗ i − µ ∗ j + ξ j k − k ξ i k . (42) Then for any ( ξ i , ξ j ) ∈ Ω , A ( ξ i , ξ j ) ≤ (1 + 2 λ ) k µ ∗ i − µ ∗ j k = A ∗ (43) B ( ξ i , ξ j ) ≥ − λ k µ ∗ i − µ ∗ j k = B ∗ . (44) Proof.

We ﬁrst prove the upper bound on A . By the triangle inequality A ( ξ i , ξ j ) ≤ k ξ i k + k ξ j k + k µ ∗ i − µ ∗ j k ≤ (1 + 2 λ ) k µ ∗ i − µ ∗ j k As for the lower bound on B , clearly it is obtained when k ξ i k is maximal, i.e. k ξ i k = λ k µ ∗ i − µ ∗ j k . Finally, the vector ξ j = λ ( µ ∗ i − µ ∗ j ) minimizes (42) regardless of the valueof ξ i . This yields the lower bound of (44) for B . Proof of Proposition 4.1.

Recall the deﬁnition of the weight w j ( X, µ ) in Eq (6). Sinceall the terms in the denominator are positive, we may upper bound w j by taking intoaccount only the two terms with indices k = i and k = j . Hence, w j ( X, µ ) ≤ π j e − k X − µj k π j e − k X − µj k + π i e − k X − µi k = 11 + π i π j e k X − µj k − k X − µi k . (45)Next, since X ∼ N ( µ ∗ i , I d ) we may write X = µ ∗ i + η = µ i + η + ξ i where η ∼N (0 , I d ) and ξ i = µ ∗ i − µ i . Therefore, k X − µ j k − k X − µ i k = 2 η ⊤ ( µ i − µ j ) + k µ ∗ i − µ ∗ j + ξ j k − k ξ i k . (46)Note that by deﬁnition η ⊤ ( µ i − µ j ) is a univariate Gaussian random variable with meanzero and variance k µ i − µ j k . Hence, we may write η ⊤ ( µ i − µ j ) = k µ i − µ j k ν where ν ∼ N (0 , . Deﬁning ˜ w ( A, B, ν ) = 1 / (1 + π i π j e Aν + B ) , we therefore have E i [ w j ( X, µ )] ≤ E ν [ ˜ w ( A, B, ν )] = 1 √ π Z ˜ w ( A, B, t ) e − t dt = g ( A, B ) (47)with A = A ( ξ i , ξ j ) and B = B ( ξ i , ξ j ) as deﬁned in (41) and (42), respectively. Since k ξ i k , k ξ j k ≤ λ k µ ∗ i − µ ∗ j k , then A ≥ (1 − λ ) k µ ∗ i − µ ∗ j k . Therefore, A > for λ < . By Lemma A.2, B ≥ B ∗ with B ∗ given in (44). The condition (16) impliesthat π i π j > e − B ∗ ≥ e − B . Hence, the conditions of Lemma A.1 are satisﬁed and we can egol et al./Improved Convergence Guarantees for EM upper bound g ( A, B ) in (47), by g ( A ∗ , B ∗ ) with A ∗ and B ∗ respectively, as given inEquations (43) and (44) of Lemma A.2. Therefore, E i [ w j ( X, µ )] ≤ √ π Z ˜ w ( A ∗ , B ∗ , t ) e − t dt = I. To upper bound the integral I we split it into two parts based on the sign of A ∗ t + B ∗ . I = 1 √ π − B ∗ /A ∗ Z −∞ ˜ w ( A ∗ , B ∗ , t ) e − t dt + 1 √ π ∞ Z − B ∗ /A ∗ ˜ w ( A ∗ , B ∗ , t ) e − t dt = I + I . For I , where A ∗ t + B ∗ < , we upper bound ˜ w ( A ∗ , B ∗ , t ) ≤ . Since both A ∗ and B ∗ are positive, we have that − B ∗ A ∗ < . We can therefore use Chernoff’s bound to get I ≤ − B ∗ /A ∗ Z −∞ √ π e − t dt ≤ e − ( B ∗ A ∗ ) . (48)For I , where A ∗ t + B ∗ > we upper bound the integral by ignoring the constant inthe denominator. Completing the square and changing variables by z = t + A ∗ we get I ≤ ∞ Z − B ∗ /A ∗ √ π π j π i e − t − A ∗ t − B ∗ dt = e A ∗ − B ∗ ∞ Z − B ∗ /A ∗ √ π π j π i e − ( t + A ∗ )22 dt = e A ∗ − B ∗ ∞ Z A ∗ − B ∗ /A ∗ √ π π j π i e − z dz. Using the deﬁnitions of A ∗ and B ∗ in (43) and (44) we note that for λ > , A ∗ − B ∗ A ∗ = 2 (1 + 2 λ ) − (1 − λ )2 k µ ∗ i − µ ∗ j k > . We can therefore apply Chernoff’s bound on the above and obtain I ≤ π j π i e A ∗ − B ∗ − ( A ∗ − B ∗ A ∗ ) = π j π i e − ( B ∗ A ∗ ) . (49)Combining the two bounds (48) and (49) yields Eq (17). Proof of Corollary 4.1.1.

By deﬁnition, the sum of all weights is one. Thus, w i ( X, µ ) = 1 − X j = i w j ( X, µ ) . By Proposition 4.1 and the linearity of expectation E i [ w i ( X, µ )] ≥ − X j = i (cid:18) π j π i (cid:19) e − c ( λ ) R ij ≥ − ( K − θ ) e − c ( λ ) R i . egol et al./Improved Convergence Guarantees for EM A.2. Proof of Lemma 4.2

Proof.

Since X is distributed as a GMM with K components and w i ( X, µ ) > , theexpected value is greater than if we consider only the i -th component of the GMM. E [ w i ( X, µ )] = K X j =1 π j E j [ w i ( X, µ )] ≥ π i E i [ w i ( X, µ )] . Since the requirement (19) on R min implies (16), it follows from Corollary 4.1.1 that E X [ w i ( X, µ )] ≥ π i (cid:16) − ( K − θ )) e − c ( λ ) R i (cid:17) . Furthermore, Eq. (19) implies that ( K − θ ) e − c ( λ ) R i ≤ . A.3. Proof of Lemma 4.3

The proof consists of several steps. First, in Lemma A.3 we reduce the dimension to d = min( d, K ) . Next, in Lemma A.4 we bound k E k [( X − v i )( X − µ j ) ⊤ ] k op interms of d , R i , R j and R i j . We then present the proof of the Lemma.We ﬁrst introduce notations. For µ, v ∈ R Kd and i, j, k ∈ [ K ] with i = j we deﬁne, V kij ( µ, v ) = k E k [ w i ( X, µ ) w j ( X, µ )( X − v i )( x − µ j ) ⊤ ] k op (50) V kii ( µ, v ) = k E k [ w i ( X, µ )(1 − w i ( X, µ ))( X − v i )( x − µ i ) ⊤ ] k op . (51)Suppose that X ∼ N ( µ ∗ k , I d ) . Let Γ be a rotation matrix such that (Γ µ i ) ⊤ =( µ i ⊤ , ⊤ [ d − d ] + ) and (Γ v i ) ⊤ = ( v i ⊤ , ⊤ [ d − d ] + ) , for all i ∈ [ K ] . Write X d for theﬁrst d coordinates of Γ X and X d − d for the remaining coordinates. We deﬁne V kij ( µ, v ) = k E k [ w i ( X d , µ )( w j ( X d , µ ))( X d − v i )( X d − µ j ) ⊤ ] k op (52) V kii ( µ, v ) = k E k [ w i ( X d , µ )(1 − w i ( X d , µ ))( X d − v i )( X d − µ i ) ⊤ ] k op . (53) Lemma A.3.

For any i, j, k ∈ [ K ] with i = j , V kij ≤ max (cid:16) V kij , E k [ w i ( X, µ ) w j ( X, µ )] (cid:17) ,V ki,j ≤ max (cid:16) V ki,j , E k [ w i ( X, µ )(1 − w i ( X, µ ))] (cid:17)

The proof is similar to the one in [22]. We include it for our paper to be self con-tained.

Proof.

We prove only the ﬁrst inequality. The proof of the second inequality is similar.Note that ( X − v i ) ( X − µ j ) ⊤ is equal to Γ ⊤  (cid:16) X d − v i (cid:17) (cid:16) X d − µ j (cid:17) ⊤ (cid:16) X d − v i (cid:17) (cid:16) X [ d − d ] + (cid:17) ⊤ X [ d − d ] + (cid:16) X d − µ j (cid:17) ⊤ X [ d − d ] + (cid:16) X [ d − d ] + (cid:17) ⊤  Γ . egol et al./Improved Convergence Guarantees for EM Now, since Γ is a rotation matrix and the last [ d − d ] + coordinates of Γ µ i are weget that w i ( X, µ ) = w i ( X d , µ ) . Therefore w i ( X, µ ) , (cid:16) X d − v i (cid:17) , (cid:16) X d − µ j (cid:17) areindependent of X [ d − d ] + . Thus, V kij ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V ijk C ijk (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ max (cid:16) V kij ( µ, v ) , k C ijk k op (cid:17) where V ijk and V ijk are deﬁned in (50) and (52), respectively and C ijk = E k (cid:20) w i (cid:16) X d , µ (cid:17) w j (cid:16) X d , µ (cid:17) X [ d − d ] + (cid:16) X [ d − d ] + (cid:17) ⊤ (cid:21) = E X d h w i (cid:16) X d , µ (cid:17) w j (cid:16) X d , µ (cid:17)i E X [ d − d (cid:20) X [ d − d ] + (cid:16) X [ d − d ] + (cid:17) ⊤ (cid:21) = E X d h w i (cid:16) X d , µ (cid:17) w j (cid:16) X d , µ (cid:17)i I [ d − d ] + Since w i ( X, µ ) = w i ( X d , µ ) , we may return to the original variables and write k C ijk k op = E k [ w i ( X, µ ) w j ( X, µ )] . Hence, the lemma follows.

Lemma A.4. ﬁx λ ∈ (0 , ) . Let ( µ ∗ , . . . , µ ∗ K ) be the centers of a K component GMM.Let X ∼ N ( µ ∗ k , I d ) for some k ∈ [ K ] . Then, for any i, j ∈ [ K ] and any µ, v ∈ U λ , E k [ k ( X − v i )( X − µ j ) ⊤ k op ] ≤ C  max (cid:0) d , R i (cid:1) i = j = k max (cid:16) d , R ik R jk (cid:17) k = i, k = j max (cid:0) d , R ij (cid:1) k = i, k = j (54) where C is a universal constant, for example we can take C = 14 .Proof. First, for any rank matrix uv ⊤ it holds that k uv T k op = k u k · k v k . Thus, E k (cid:2) k ( X − v i )( X − µ j ) ⊤ k op (cid:3) = E k (cid:2) k X − v i k · k X − µ j k (cid:3) . Next, since X ∼ N ( µ ∗ k , I d ) we may write X = µ ∗ k + η , where η ∼ N (0 , I d ) . Thus, k X − v i k · k X − µ j k = k µ ∗ k − v i + η k · k µ ∗ k − µ j + η k . Let Γ be a rotation matrix such that Γ ( µ ∗ k − v i ) = R ∗ ik e , Γ ( µ ∗ k − µ j ) = R ∗ kj cos ( α ) e + R ∗ kj sin ( α ) e where R ∗ ik = k µ ∗ k − v i k , R ∗ kj = k µ ∗ k − µ j k and α is the angle between e and Γ( µ ∗ k − µ j ) .Then by applying Γ to and using the rotation invariance of the Gaussian distribution, k X − v i k = ( R ∗ ik + η ) + η + X q> η q egol et al./Improved Convergence Guarantees for EM and k X − µ j k = ( R ∗ kj cos α + η ) + ( R ∗ kj sin α + η ) + X q> η q . It is easy to show that the expectation of the above expression is maximal when α = 0 .In this case, we can write the expectation as follows E [ k X − v i k · k X − µ j k ] ≤ E [( A + C ) · ( B + C )] where A = ( R ∗ ik + η ) follows a non-central χ distribution with one degree offreedom and non-centrality parameter ( R ∗ ik ) , C = P q ≥ η q follows a central χ distribution with d − degrees of freedom, and B = ( R ∗ kj + η ) . Using known resultson the moments of central and non-central χ random variables, E [ AB ] = E [(( R ∗ ik ) + 2 R ∗ ik η + η )(( R ∗ kj ) + 2 R ∗ kj η + η )]= ( R ∗ ik R ∗ kj ) + ( R ∗ ik ) + ( R ∗ kj ) + 4 R ∗ ik R ∗ kj + 3 . and E [( A + C ) · ( B + C )] = E [ AB ] + ( E [ A ] + E [ B ]) E [ C ] + E [ C ]= E [ AB ] + [( R ∗ ik ) + ( R ∗ kj ) + 2]( d −

1) + d − . Since µ, v ∈ U λ it holds that R ∗ ik ≤ R ik + R i , R ∗ kj ≤ R kj + R j .Now we consider several different cases. First, for i = j = k we have R ik = R jk =0 and R i = R j . Hence E i [ k ( X − v i )( X − µ i ) T k op ] ≤ C max( d , R i ) . Next, if k is distinct from both i and j , then R i ≤ R ik and R j ≤ R kj . Hence, E k [ k ( X − v i )( X − µ j ) ⊤ k op ] ≤ C max( d , R ik R kj ) . Finally, we consider the case where j = i but k is not distinct from both i and j ,without loss of generality k = i . Then R ik = 0 , R kj = R ij . By deﬁnition, R j ≤ R ij and R i ≤ R ij . Thus, E i [ k ( X − v i )( X − µ j ) ⊤ k op ] ≤ C max( d , R ij ) . We are now ready to prove the lemma. For clarity we present in two separate partsthe proof of Eq. (26) and of Eq. (25).

Proof of Eq. (26) in Lemma 4.3.

The ﬁrst step is to separate the expectation over theGMM to its K components. By the triangle inequality, V ij ( µ, v ) ≤ X k π k V kij ( µ, v ) (55) egol et al./Improved Convergence Guarantees for EM with V kij as deﬁned in (50). By Lemma A.3, for each k , V kij ( µ, v ) ≤ max( V kij ( µ, v ) , E k [ w i ( X, µ ) w j ( X, µ )]) (56)with V kij as deﬁned in Eq. (52). We now separately analyze each of the two terms onthe right hand size of (56). We start with the second term. When k = i , by Proposition4.1 E i [ w i ( X, µ ) w j ( X, µ )] ≤ E i [ w j ( X, µ )] ≤ (1 + θ ) e − c ( λ ) R ij . By symmetry, the same bound holds also for k = j .Next we bound V kij and we shall later see that it is the largest of the two quantitiesin (56). Note that by the Cauchy-Schwarz inequality V kij ≤ r E k h ( w i ( X, µ ) w j ( X, µ )) is E k (cid:20)(cid:13)(cid:13)(cid:13) ( X d − v i )( X d − µ j ) ⊤ (cid:13)(cid:13)(cid:13) op (cid:21) . By Lemma A.4 there exists a universal constant C such that Eq (54) holds with dimen-sion d . Thus, for the ﬁrst term in Eq. (55), with k = i , V iij ( µ, v ) ≤ q E i [( w i ( X, µ ) w j ( X, µ )) ] √ C max( d , R ij ) ≤ q E i [ w j ( X, µ )] √ C max( d , R ij ) and by Proposition 4.1 V iij ( µ, v ) ≤ p C (1 + θ ) e − c ( λ )2 R ij max (cid:0) d , R ij (cid:1) . (57)Similarly, for k = j , V jij ( µ, v ) ≤ p C (1 + θ ) e − c ( λ )2 R ij max (cid:0) d , R ij (cid:1) . (58)Hence, in these two cases indeed V ijk is the dominant term.Finally we consider the case k = i, j . Again by Proposition 4.1 E k [ w i ( X, µ ) w j ( X, µ )] ≤ (1 + θ ) e − c ( λ ) max( R ik ,R jk ) . As for the ﬁrst term, V kij ≤ r E k h ( w i ( X, µ ) w j ( X, µ )) i √ C max ( d , R ik R jk ) ≤ p C (1 + θ ) e − c ( λ )2 max( R ik ,R jk ) max( d , max( R ik , R jk ) ) . (59)Inserting (57), (58) and (59) into (55) and summing over the components gives V i,j ( µ, v ) ≤√ C √ θ h ( π i + π j ) max( d , R ij ) e − c ( λ )2 R ij + X k = i,j π k max( d , max( R ik , R jk ) ) e − c ( λ )2 max( R ik ,R jk ) i . egol et al./Improved Convergence Guarantees for EM Since the function x e − tx is monotonic decreasing for x > p /t , and R min > p /c ( λ ) we may replace R ij by max( R i , R j ) in the ﬁrst term. Similarly, we mayreplace R ik by R i and R jk by R j in the second sum above. This yields Eq. (26). Proof of Eq. (25) in Lemma 4.3.

The ﬁrst step is to separate the expectation over theGMM to its K components. By the triangle inequality, V ii ( µ, v ) ≤ K X k =1 π k V kii ( µ, v ) (60)with V kii as deﬁned in (51). We now bound each V kii separately. First, by Lemma A.3, V kii ≤ max( V kii , E k [(1 − w i ( X, µ )) w i ( X, µ )]) (61)with V kii as deﬁned in (53). We now analyze each component separately. We ﬁrst bound V kii and we shall later see that it is the largest of the two quantities in (61). By theCauchy-Schwarz inequality, V kii ( µ, v ) ≤ E k h ( w i ( X, µ )(1 − w i ( X, µ ))) i · E k (cid:20)(cid:13)(cid:13)(cid:13) ( X d − v i )( X d − µ i ) ⊤ (cid:13)(cid:13)(cid:13) op (cid:21) . By Lemma A.4, there exists a universal constant C such that Eq. (54) holds with di-mension d . Thus for k = i , V iii ( µ, v ) ≤ r E i h ( w i ( X, µ )(1 − w i ( X, µ ))) i √ C max (cid:0) d , R i (cid:1) ≤ p E i [1 − w i ( X, µ )] √ C max (cid:0) d , R i (cid:1) and by Corollary 4.1.1, V iii ( µ, v ) ≤ p C ( K − θ ) e − c ( λ )2 R i max (cid:0) d , R i (cid:1) . (62)Similarly, we upper bound the second quantity on the right hand side of (61) as follows, E i [ w i ( X, µ )(1 − w i ( X, µ ))] ≤ ( K − θ ) e − c ( λ ) R ij . Thus V kij ( µ, v ) is the dominant term in the maximum in (61).Now, for k = i , V kij ( µ, v ) ≤ r E k h ( w i ( X, µ )(1 − w i ( X, µ ))) i √ C max (cid:0) d , R ik (cid:1) ≤ √ θe − c ( λ ) R ik √ C max (cid:0) d , R ik (cid:1) The function x e − tx is monotonic decreasing for x > √ t − . Since R i ≥ p c ( λ ) − ,so does R ik , and we may replace it in the equation above by R i . Namely, V kij ( µ, v ) ≤ √ θ √ C max( d , R i ) e − c ( λ ) R i / (63)Inserting (62) and (63) into (60), and summing over all components yields Eq. (25). egol et al./Improved Convergence Guarantees for EM A.4. Completing the Proof of Theorem 3.1

The following lemma completes the proof of the Theorem.

Lemma A.5.

Let X ∼ GMM ( µ ∗ , π ) with R min satisfying (11) . Let µ + be the popula-tion EM update (7) . Then for every i ∈ [ K ] it holds that k µ + i − µ ∗ i k ≤ λR i .Proof. Our starting point is Eq. (29), k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] k ≤ K X j =1 sup µ ∈U λ V i,j ( µ, µ ∗ ) k µ j − µ ∗ j k . We insert the bounds (25) and (26) on V i,i and V i,j , respectively, to the above.Since µ ∈ U λ we may replace all k µ k − µ ∗ k k in the expressions above by λR k . Since x e − tx is monotonic decreasing for all x > p / t , we may replace all R i , R j aboveby R min . Deﬁning U = K − √ C (1+ θ )3 π min , this gives k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] k ≤ λR min · π min U max (cid:0) d , R (cid:1) e − c ( λ ) R . By Eqs. (27) and (20), it follows that k µ + i − µ ∗ i k ≤ λR min · e − c ( λ )2 R U max( d , R ) The separation condition (11) sufﬁces to ensure that k µ + i − µ ∗ i k ≤ λR min . Appendix B: PROOFS FOR SECTION 5

B.1. Preliminaries

We recall basic deﬁnitions and results on sub-Gaussian random variables. See e.g. [18].

Deﬁnition B.1.

1. A random variable X is called sub-Gaussian if there exists t > such that E h e X t i ≤ . Its norm is deﬁned as k X k ψ = inf t> E h e X t i ≤ .2. A random vector X ∈ R d is called sub-Gaussian if sup v ∈ S d − k X ⊤ v k ψ < ∞ .Its sub-Gaussian norm is deﬁned as sup v ∈ S d − k X ⊤ v k ψ . Lemma B.1.

Let X ∈ R d be a sub-Gaussian random vector with sub-Gaussian normat most R . Let X , . . . , X n be n i.i.d. copies of X . Deﬁne S n = n P nℓ =1 X ℓ . Then,there exists a universal constant c such that for any t > , Pr ( k S n − E [ X ] k > t ) ≤ e − cnt R + d log 3 .Proof. Let N be a -net of S d − and ﬁx v ∈ N . By deﬁnition X ⊤ v is sub-Gaussianwith k X ⊤ v k ψ ≤ R . Write X v,n = n P nℓ =1 ( X ℓ − E [ X ]) ⊤ v . Then by Hoeffding’sinequality there exists a universal constant c such that, Pr ( | X v,n | > t ) ≤ e − cnt k X ⊤ v k ψ ≤ e − cnt R . egol et al./Improved Convergence Guarantees for EM Next, we note that for any x ∈ R d it holds that k x k ≤ v ∈ N v ⊤ x . As is wellknown, the size of an ε -net is bounded by | N | ≤ e d log 3 [18, Corollary 4.2.13]. Thelemma therefore follows from a union bound.The following lemma is key for proving uniform convergence. A version of thislemma appears in [23]. Lemma B.2.

Fix < δ < . Let B , . . . , B K ⊂ R d be Euclidean balls of radii r , . . . , r K ≥ . Deﬁne B = ⊗ Kk =1 B k ⊂ R Kd and r = max k ∈ [ K ] r k . Let X be arandom vector in R d and W R d × B → R k where k ≤ d . Assume the following hold:1. There exists a constant L ≥ such that for any µ ∈ B , ε > , and µ ε ∈ B whichsatisﬁes max i ∈ [ K ] k µ i − µ εi k ≤ ε , then E X (cid:2) sup µ ∈B k W ( X, µ ) − W ( X, µ ε ) k (cid:3) ≤ Lε .2. There exists a constant R such that for any µ ∈ B , k W ( X, µ ) k ψ ≤ R .Let X , . . . , X n be i.i.d. random vectors with the same distribution as X . Then thereexists a universal condstant ˜ c such that with probability at least − δ , sup µ ∈B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ) − E X [ W ( X, µ )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ R s ˜ c Kd log (cid:0) nLrδ (cid:1) n . (64) Proof.

For any ε > , let N i be an ε -net of B i and deﬁne N ε = ⊗ N i . Then, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ) − E [ W ( X, µ )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k E [ W ( X, µ )] − E [ W ( X, µ ε )] k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 ( W ( X ℓ , µ ) − W ( X ℓ , µ ε )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ε ) − E [ W ( X, µ ε )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Therefore for any t > , Pr sup µ ∈B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ) − E [ W ( X, µ )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ! ≤ Pr ( A ) + Pr ( B ) + Pr ( C ) where the three events A, B, C are given by A = ( sup µ ∈B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X ℓ =1 n W ( X ℓ , µ ) − n X ℓ =1 n W ( X ℓ , µ ε ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ) B = (cid:26) sup µ ∈B k E [ W ( X, µ )] − E [ W ( X, µ ε )] k > t (cid:27) C = ( sup µ ε ∈ N ε (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ε ) − E [ W ( X, µ ε )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ) . egol et al./Improved Convergence Guarantees for EM We ﬁrst bound

Pr( A ) . By Markov’s inequality and the ﬁrst condition of the lemma, Pr ( A ) ≤ t E " sup µ ∈B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X ℓ =1 n W ( X ℓ , µ ) − n X ℓ =1 n W ( X ℓ , µ ε ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ t E (cid:20) sup µ ∈B k W ( X, µ ) − W ( X, µ ε ) k (cid:21) ≤ εLt . Thus, for εLt ≤ δ we have that Pr( A ) < δ . Note also that for t satisfying εLt ≤ δ , sup µ ∈B k E [ W ( X, µ )] − E [ W ( X, µ ε )] k ≤ E (cid:20) sup µ ∈B k W ( X, µ ) − W ( X, µ ε ) k (cid:21) ≤ t and hence Pr( B ) = 0 .Finally, we bound the probability of C . Here we use the second condition of thelemma, that k W ( X, µ ) k ψ ≤ R . It follows from Lemma B.1, that for any ﬁxed µ ε , Pr (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ε ) − E [ W ( X, µ ε )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ! ≤ e k log 3 − cnt R ≤ e d log 3 − cnt R where c is a universal constant. Since all the balls B , . . . , B K ⊂ R d are of radius atmost r , it holds that | N ε | ≤ e log( rε ) Kd . Hence, taking a union bound, Pr sup µ ε ∈ N ε (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ε ) − E [ W ( X, µ ε )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ! ≤ e Kd log rε + d log 3 − cnt R . The requirement that the right hand side of the above is smaller than δ implies t ≥ R s Kd log rε + d log 3 + log δ cn . Setting ε = δ Ln , the condition Pr( A ) ≤ δ implies t > n , which holds if t satisﬁes theinequality above. Hence, for t > R q ˜ c Kd log ( nLrδ ) n with a suitable universal constant ˜ c it holds that Pr( A ) + Pr( B ) + Pr( C ) ≤ δ . B.2. Proof of Lemma 5.1

Proof.

The lemma will follow from Lemma B.2 by setting X ∼ N ( µ ∗ i , I d ) , B = U λ and W = w i ( X, µ ) . To this end we show that the conditions of Lemma B.2 hold: (i)There exists L > such that for any ε > , E i (cid:2) sup µ | w i ( X, µ ) − w i ( X, µ ε ) | (cid:3) ≤ Lε for all µ ε ∈ U λ with max i ∈ [ K ] k µ i − µ εi k ≤ ε . (ii) The sub-Gaussian norm of w i ( X, µ ) for X ∼ N ( µ ∗ i , I d ) is bounded by a constant. The latter is clear as w i is bounded. Forthe former we use the mean value theorem. There exists a point ˜ µ such that | w i ( X, µ ) − w i ( X, µ ε ) | = |∇ w i ( X, ˜ µ ) ⊤ ( µ − µ ε ) | egol et al./Improved Convergence Guarantees for EM Using the expressions (21) and (22) for the gradient of w i with respect to µ , | w i ( X, µ ) − w i ( X, µ ε ) | ≤ sup ˜ µ ∈U λ k w i ( X, ˜ µ )(1 − w i ( X, ˜ µ ))( X − ˜ µ i ) k ε + X j = i sup ˜ µ ∈U λ k w i ( X, ˜ µ ) w j ( X, ˜ µ )( X − ˜ µ j ) k ε Since ≤ w i ( X, µ ) ≤ we get E i (cid:20) sup µ ∈U λ | w i ( X, µ ) − w i ( X, µ ε ) | (cid:21) ≤ ε K X j =1 E i (cid:20) sup µ ∈U λ k X − µ j k (cid:21) . Since X ∼ N ( µ ∗ i , I d ) , we may write X = η + µ ∗ i where X ∼ N (0 , I d ) . Therefore, E i (cid:20) sup µ ∈U λ k X − µ j k (cid:21) ≤ E k η k + sup µ ∈U λ k µ ∗ i − µ j k ≤ √ d + 2 R max . It follows that E i (cid:20) sup µ ∈U λ | w i ( X, µ ) − w i ( X, µ ε ) | (cid:21) ≤ K ( √ d + 2 R max ) ε. The lemma follows by plugging L = K ( √ d + 2 R max ) and r = R max into (64). B.3. Proof of Lemma 5.2

Proof.

We denote the set of all samples X ℓ generated from the i -th component by I i , n i = | I i | and ˜ π i = n i /n . Since w i ( X, µ ) ≥ for any X, µ we can lower bound thesum in the event D i by considering only the terms w i ( X ℓ , µ ) with X ℓ ∈ I i , n n X ℓ =1 w i ( X ℓ , µ ) ≥ n X X ℓ ∈ I i w i ( X ℓ , µ ) = ˜ π i n i X X ℓ ∈ I i w i ( X ℓ , µ ) . With a suitably large constant C , the sample size requirement (34) implies that n > log Kδ π min . By the multiplicative form of the Chernoff bound for Bernoulli randomvariables, see e.g. [18, Exercise 2.3.5], we have for any δ ∈ (0 , that | ˜ π i − π i | ≤ π i .Therefore, ˜ π i ≥ π i and thus n n X ℓ =1 w i ( X ℓ , µ ) ≥ π i

10 1 n i X X ℓ ∈ I i w i ( X ℓ , µ ) . Now, deﬁning d i = sup µ ∈U λ (cid:16) n i P X ℓ ∈ I i w i ( X ℓ , µ ) − E i [ w i ( X, µ )] (cid:17) we have inf µ ∈U λ n i X X ℓ ∈ I i w i ( X ℓ , µ ) ≥ inf µ ∈U λ E i [ w i ( X, µ )] − d i . egol et al./Improved Convergence Guarantees for EM Note that by Lemma 5.1, with probability at least − δ K , Eq. (33) holds. Since n i ≥ nπ i , we may replace n i in Eq. (33) by nπ i , and increase the relevant constants to ˜ c = ˜ c and ˜ C = C . It thus follows that inf µ ∈U λ n n X ℓ =1 w i ( X, µ ) ≥ π i  inf µ ∈U λ E i [ w i ( X, µ )] − s ˜ c Kd log( ˜ C nδ ) nπ i  . The condition on the sample size (34) implies that r ˜ c Kd log( ˜ C nδ ) nπ i ≤ and therefore, inf µ ∈U λ n n X ℓ =1 w i ( X, µ ) ≥ π i (cid:18) inf µ ∈U λ E i [ w i ( X, µ )] − (cid:19) . By Corollary 4.1.1, inf µ ∈U λ n n X ℓ =1 w i ( X, µ ) ≥ π i (cid:18) − ( K − θ ) e − c ( λ ) R (cid:19) . Under the separation requirement (19), it holds that ( K − θ ) e − c ( λ ) R ≤ .Hence, the event D i (35) occurs with probability at least − δ K . B.4. Proof of Lemma 5.3

Proof.

Write X = η + Z where η ∼ N (0 , I d ) and Z ∈ { µ ∗ , . . . , µ ∗ K } has a distribu-tion Pr (cid:0) Z = µ ∗ j (cid:1) = π j . First we prove (37). By the triangle inequality k w i ( X, µ ) ( X − µ ∗ i ) k ψ ≤ k w i ( X, µ ) η k ψ + k w i ( X, µ ) ( Z − µ ∗ i ) k ψ . Since w i ≤ and η ∼ N (0 , I d ) , it follows that k w i ( X, µ ) η k ψ ≤ k η k ψ [23, LemmaB.1 part 5]. Using the explicit formula for the moment generating function of a chi-squared distribution with degree of freedom, E [exp(1 /t ( η ⊤ s ) )] = (1 − /t ) − / .It follows that k η k ψ ≤ and hence k w i ( X, µ ) η k ψ ≤ . Next, we analyze the sub-Gaussian norm of the second term. We show that k w i ( X, µ )( Z − µ ∗ i ) k ψ ≤ λ − λ . Tothis end we show that for t = 8 λ − λ and any s ∈ S d − , K X j =1 π j E j (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) (cid:0) µ ∗ j − µ ∗ i (cid:1) ⊤ s (cid:17) (cid:19)(cid:21) ≤ . (65)First, for j = i , Z − µ ∗ i = 0 and thus the expectation is . Now, consider any j = i . Itholds that ( µ ∗ j − µ ∗ i ) ⊤ s ≤ R ij . Therefore, E j (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) (cid:0) µ ∗ j − µ ∗ i (cid:1) ⊤ s (cid:17) (cid:19)(cid:21) ≤ E j (cid:20) exp (cid:18) t ( R ij w i ( X, µ )) (cid:19)(cid:21) . egol et al./Improved Convergence Guarantees for EM By Equations (45) and (46), w i (cid:0) η + µ ∗ j , µ (cid:1) ≤ πjπi e Aν + B = ˜ w i ( A, B, ν ) where ν ∼ N (0 , , A = k µ i − µ j k and B = k µ ∗ j − µ i k − k µ j − µ ∗ j k . This allowsbounding the expectation over the d -dimensional random vector η by an expectationover a univariate random variable ν . E j (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) (cid:0) µ ∗ j − µ ∗ i (cid:1) ⊤ s (cid:17) (cid:19)(cid:21) ≤ E ν (cid:20) exp (cid:18) t R ij ˜ w i ( A, B, ν ) (cid:19)(cid:21) = E. Next, we split the expectation over ν to two cases as follows, E = E (cid:20) exp (cid:18) t R ij ˜ w i ( A, B, ν ) (cid:19) (cid:12)(cid:12)(cid:12) ν < − B A (cid:21) Pr (cid:18) ν < − B A (cid:19) + E (cid:20) exp (cid:18) t R ij ˜ w i ( A, B, ν ) (cid:19) (cid:12)(cid:12)(cid:12) ν > − B A (cid:21) Pr (cid:18) ν > − B A (cid:19) = E + E . (66)We now show that E ≤ and E ≤ , from which it follows that E ≤ .First, consider the term E in (66). Note that Pr (cid:0) ν < − B A (cid:1) ≤ e − B A . By LemmaA.2, A ≤ (1 + 2 λ ) R ij and B ≥ (1 − λ ) R ij . It follows that Pr (cid:0) ν < − B A (cid:1) ≤ e − ( − λ λ ) R ij . Since ˜ w i ( A, B, ν ) ≤ , we thus obtain by inserting t = 8 λ − λ , E ≤ e R ijt e − ( − λ λ ) R ij = e − ( − λ λ ) R ij . Therefore, for R min satisfying (36), E ≤ . Second, consider the term E in (66). Since ν > − B A , then Aν + B > B . Thus, / ˜ w i ( A, B, ν ) = (1 + π j π i e Aν + B ) > π j π i e B . That is, ˜ w i ( A, B, ν ) ≤ π i π j e − B . ByLemma A.2, B ≥ (1 − λ )2 R ij . Hence, ˜ w i ( A, B, ν ) ≤ π i π j e − (1 − λ )2 R ij . Since Pr (cid:0) ν > − B A (cid:1) ≤ , it follows by plugging in t = 8 λ − λ , E ≤ exp t R ij π i π j e − (1 − λ )2 R ij ! = exp λ ) (1 − λ ) R ij π i π j e − (1 − λ )2 R ij ! . The condition that the right hand side of the above is smaller than can be written as we − w ≤ a , where w = − λ R ij and a = π j π i

32 log

32 (1+2 λ )1 − λ . Since for w ≥ a ,it holds that we − w < a , we get for our case that for R min satisfying (36), E ≤ .Since E ≤ , Eq (65) holds. Therefore k w i ( X, µ )( Z − µ ∗ i ) k ψ ≤ λ − λ . Since k w i ( X, µ ) η k ψ ≤ , we get Eq. (37).The proof of Eq. (38) is similar. We analyze the sub-Gaussian norm of k w i ( X, µ ) ( Z − µ i ) k ψ . Similarly to Eq. (65), we decompose the expectation to components. First con-sider the i ’th component. Since k µ ∗ i − µ i k ≤ λR i , we have for all s ∈ S d − that ( µ ∗ i − µ i ) ⊤ s ≤ λR i . Thus, E i (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) ( µ ∗ i − µ i ) ⊤ s (cid:17) (cid:19)(cid:21) ≤ E i (cid:20) exp (cid:18) t ( w i ( X, µ ) λR i ) (cid:19)(cid:21) . egol et al./Improved Convergence Guarantees for EM Hence for t ≥ λR i √ log 2 , the last expression is smaller than . Next, for any component j with j = i and any s ∈ S d − , we have ( µ ∗ j − µ i ) ⊤ s ≤ R ij + k µ i − µ ∗ i k ≤ R ij .Hence, E j (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) (cid:0) µ ∗ j − µ i (cid:1) ⊤ s (cid:17) (cid:19)(cid:21) ≤ E j " exp t (cid:18) w i ( X, µ ) 32 R ij (cid:19) ! . Since for t ≥ λ − λ , E η ∼N (0 ,I d ) h exp (cid:16) t (cid:0) w i (cid:0) η + µ ∗ j , µ (cid:1) R ij (cid:1) (cid:17)i ≤ , it followsthat for t ≥ λ − λ , the above is smaller than . Thus, for any s ∈ S d − , t ≥ max (cid:16) λ − λ , λR i √ log 2 (cid:17) = t , E X (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) ( X − µ i ) ⊤ s (cid:17) (cid:19)(cid:21) ≤ . Thus, k w i ( X, µ )( Z − µ i ) k ψ ≤ t . Since k w i ( X, µ ) η k ψ ≤ , Eq. (38) follows. B.5. Proof of Lemma 5.4

We ﬁrst present the following auxiliary lemma.

Lemma B.3.

Let X ∼ GMM ( µ ∗ , π ) with R min satisfying (36) . Fix λ ∈ (0 , ) . Foreach µ ∈ U λ let µ ε ∈ U λ be such that max i ∈ [ K ] k µ i − µ εi k < ε . Then, for v ∈ { µ, µ ∗ } , E X (cid:20) sup µ ∈U λ k ( w i ( X, µ ) − w i ( X, µ ε ))( X − v i ) k (cid:21) ≤ K ( √ d + 2 R max ) ε. (67) Proof.

By the mean value theorem and the expression for ∇ µ w i ( X, µ ) , (21) and (22), E X (cid:20) sup µ ∈U λ k ( w i ( X, µ ) − w i ( X, µ ε ))( X − v i ) k (cid:21) ≤ K X j =1 sup µ ∈U λ V ij ( µ, v ) ε. Since ≤ w i ( X, µ ) ≤ , we get, E X (cid:20) sup µ ∈U λ k ( w i ( X, µ ) − w i ( X, µ ε ))( X − v i ) k (cid:21) ≤ K X j =1 E X (cid:20) sup µ ∈U λ k X − µ j kk X − v i k (cid:21) ε. Now, E X (cid:20) sup µ ∈U λ k X − µ j k (cid:21) = K X k =1 π k E k (cid:20) sup µ ∈U λ k X − µ j k (cid:21) ≤ √ d + 2 R max . Similarly, E X (cid:2) sup µ ∈U λ k X − v i k (cid:3) ≤ √ d + 2 R max . Eq. (67) now follows. egol et al./Improved Convergence Guarantees for EM Proof of Lemma 5.4.

The lemma will follow from Lemma B.2 with X ∼ GMM ( µ ∗ , π ) , B = U λ , W = w i ( X, µ )( X − µ ∗ i ) and probability δ = δ K . To this end we needto show that the two conditions of Lemma B.2 hold: (i) For any ε > , Eq. (67)holds for all µ ε ∈ U λ with max i ∈ [ K ] k µ i − µ εi k ≤ ε . (ii) The sub-Gaussian normof w i ( X, µ )( X − µ ∗ i ) for X ∼ GMM ( µ ∗ , π ) is bounded by − λ . The former followsfrom Lemma B.3. The latter follows from Lemma 5.3 for R min satisfying (36). Appendix C: PROOFS FOR THE GRADIENT EM ALGORITHM

Proof of Theorem 3.2.

Consider the error of the estimate for the i -th center after a sin-gle gradient EM update (8). By the triangle inequality, k µ + i − µ ∗ i k ≤ k µ i − µ ∗ i + s E X [ w i ( X, µ ∗ )( X − µ i )] k + s k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ i )] k . We now separately upper bound each of the two terms above. For the ﬁrst term, recallthat E X [ w i ( X, µ ∗ )] = π i and by Lemma 4.4, E X [ w i ( X, µ ∗ )( X − µ i )] = π i ( µ ∗ i − µ i ) .Hence, for any step size s < /π i , k µ i − µ ∗ i + s E X [ w i ( X, µ ∗ )( X − µ i )] k ≤ (1 − sπ i ) k µ i − µ ∗ i k . Next, to bound the second term we use the expressions in Eqs. (21) and (22), k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ i )] k ≤ K X j =1 sup µ ∈U λ V i,j ( µ, µ ) k µ j − µ ∗ j k (68)with V i,j and V i,i as deﬁned in (23) and (24), respectively. The proof proceeds similarlyto that of the original EM algorithm. First, using the bounds (25) and (26) in Lemma4.3, for R min satisfying the separation condition (11), it holds that k E X [( w i ( X, µ ) − w i ( X, µ ∗ )) ( X − µ i )] k ≤ π min . Therefore, k µ + i − µ ∗ i k ≤ (1 − sπ i ) E ( µ ) . We ﬁnish by showing that µ + ∈ U λ . Replacing k µ j − µ ∗ j k by λR j in (68), and replacing R k by R min in the bounds in (25) and (26) we get k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ i )] k ≤ λR min · π min U max (cid:0) d , R (cid:1) e − c ( λ ) R . with U = K − √ C (1+ θ )3 π min . Under the separation condition (11), the right hand sideof the above is upper bounded by π min . Therefore, k µ + i − µ ∗ i k ≤ (1 − sπ i + 38 sπ i ) λR i < λR i . egol et al./Improved Convergence Guarantees for EM The next lemma presents a concentration result for the sample EM update.

Lemma C.1.

Fix δ ∈ (0 , , λ ∈ (0 , ) . Let X , . . . , X n ∼ GMM ( µ ∗ , π ) with R min satisfying (36) . For i ∈ [ K ] deﬁne S gi = n P nℓ =1 w i ( X ℓ , µ )( X ℓ − µ i ) and the event N gi =  sup µ ∈U λ k S i − E [ w i ( X, µ )( X − µ i )] k ≤ C max (cid:18) − λ , λR i (cid:19) s Kd log ˜ Cnδ n  (69) where C is a suitable universal constant and ˜ C = 18 K R max ( √ d + 2 R max ) . Then N i occurs with probability at least − δK .Proof. The lemma will follow from Lemma B.2 by setting X ∼ GMM ( µ ∗ , π ) , B = U λ and W = w i ( X, µ )( x − µ i ) . To this end we need to show that the two conditions ofLemma B.2 hold: (i) For any ε > , Eq (67) holds for all µ ε with max i ∈ [ K ] k µ i − µ εi k ≤ ε . (ii) The sub-Gaussian norm of w i ( X, µ )( X − µ i ) for X ∼ GMM ( µ ∗ , π ) is boundedby C max( − λ , λR i ) . The former follows from Lemma B.3. The latter follows fromLemma 5.3 for R min satisfying (36).With the pieces in place we now prove Theorem 3.4. Proof of Theorem 3.4.

Consider the error of the i -th cluster of the sample gradient EMupdate (10), k µ ∗ i − µ + i k ≤ k µ i − µ ∗ i − s E [ w i ( X, µ )( X − µ i )] k + s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E [ w i ( X, µ )( X − µ i )] − n n X ℓ =1 w i ( X ℓ , µ )( X ℓ − µ i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Theorem 3.2 implies that for R min satisfying (11), it holds that k µ i − µ ∗ i − s E [ w i ( X, µ )( X − µ i )] k ≤ γ min( E ( µ ) , λR i ) with γ = 1 − sπ min . We therefore bound the second term above. Since the requirementon R min (11) is more restrictive than the requirement (36) we may invoke Lemma C.1and obtain that with probability at least − δK , the event N gi (69) occurs. Thus, k µ ∗ i − µ + i k ≤ γ min( E ( µ ) , λR i ) + sC max (cid:18) − λ , λR i (cid:19) vuut Kd log (cid:16) ˜ Cnδ (cid:17) n . (70)The sample size condition (14), implies that Cs max (cid:16) − λ , λR i (cid:17) r Kd log (cid:16) ˜ Cnδ (cid:17) n ≤ sπ min λR i ≤ λ (1 − γ ) R i . Taking a union bound over the K components, µ + ∈ U λ with probability at least − δ . We can therefore iteratively apply (70) and obtain k µ ti − µ ∗ i k ≤ γ t E ( µ ) + sC − γ max (cid:18) − λ , λR i (cid:19) vuut Kd log (cid:16) ˜ Cnδ (cid:17) n .

Since s − γ = π i , we get Eq. (15). egol et al./Improved Convergence Guarantees for EM References [1] A

CHLIOPTAS , D. and M C S HERRY , F. (2005). On spectral learning of mixturesof distributions. In

International Conference on Computational Learning Theory

RORA , S., K

ANNAN , R. et al. (2005). Learning mixtures of separated nonspher-ical Gaussians.

The Annals of Applied Probability ALAKRISHNAN , S., W

AINWRIGHT , M. J., Y U , B. et al. (2017). Statistical guar-antees for the EM algorithm: From population to sample-based analysis. The An-nals of Statistics ASGUPTA , S. (1999). Learning mixtures of Gaussians. In

ASGUPTA , S. and S

CHULMAN , L. (2007). A Probabilistic Analysis of EM forMixtures of Separated, Spherical Gaussians.

Journal of Machine Learning Re-search ASKALAKIS , C., T

ZAMOS , C. and Z

AMPETAKIS , M. (2017). Ten steps of EMsufﬁce for mixtures of two Gaussians. In

Conference on Learning Theory

EMPSTER , A. P., L

AIRD , N. M. and R

UBIN , D. B. (1977). Maximum likeli-hood from incomplete data via the EM algorithm.

Journal of the Royal StatisticalSociety: Series B (Methodological) ARDT , M. and P

RICE , E. (2015). Tight bounds for learning a mixture of twogaussians. In

Proceedings of the forty-seventh annual ACM symposium on Theoryof computing SU , D. and K AKADE , S. M. (2013). Learning mixtures of spherical gaussians:moment methods and spectral decompositions. In

Proceedings of the 4th confer-ence on Innovations in Theoretical Computer Science IN , C., Z HANG , Y., B

ALAKRISHNAN , S., W

AINWRIGHT , M. J. and J OR - DAN , M. I. (2016). Local maxima in the likelihood of gaussian mixture models:Structural results and algorithmic consequences. In

Advances in neural informa-tion processing systems

ALAI , A. T., M

OITRA , A. and V

ALIANT , G. (2010). Efﬁciently learning mix-tures of two Gaussians. In

Proceedings of the forty-second ACM symposium onTheory of computing

ANNAN , R., S

ALMASIAN , H. and V

EMPALA , S. (2008). The spectral methodfor general mixture models.

SIAM Journal on Computing WON , J. and C

ARAMANIS , C. (2020). The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians. In

Proceedings ofThirty Third Conference on Learning Theory (J. A

BERNETHY and S. A

GARWAL ,eds.).

Proceedings of Machine Learning Research

OITRA , A. and V

ALIANT , G. (2010). Settling the polynomial learnability ofmixtures of gaussians. In

EARSON , K. (1894). Contributions to the mathematical theory of evolution.

Philosophical Transactions of the Royal Society of London. A egol et al./Improved Convergence Guarantees for EM [16] R EGEV , O. and V

IJAYARAGHAVAN , A. (2017). On learning mixtures of well-separated gaussians. In

EMPALA , S. and W

ANG , G. (2004). A spectral algorithm for learning mixturemodels.

Journal of Computer and System Sciences ERSHYNIN , R. (2018).

High-dimensional probability: An introduction with ap-plications in data science . Cambridge university press.[19] W U , C. J. (1983). On the convergence properties of the EM algorithm. The An-nals of statistics U , J., H SU , D. J. and M ALEKI , A. (2016). Global analysis of expectation max-imization for mixtures of two gaussians. In

Advances in Neural Information Pro-cessing Systems U , L. and J ORDAN , M. I. (1996). On convergence properties of the EM algo-rithm for Gaussian mixtures.

Neural computation AN , B., Y IN , M. and S ARKAR , P. (2017). Convergence analysis of gradientEM for multi-component gaussian mixture. arXiv preprint arXiv:1705.08530 .[23] Z

HAO , R., L I , Y., S UN , Y. et al. (2020). Statistical convergence of the EM algo-rithm on Gaussian mixture models. Electronic Journal of Statistics14