Improved Convergence Guarantees for Learning Gaussian Mixture Models by EM and Gradient EM
aa r X i v : . [ c s . L G ] J a n Improved Convergence Guarantees forLearning Gaussian Mixture Models by EMand Gradient EM
Nimrod Segol and Boaz Nadler
Department of Computer Science and Applied MathematicsWeizmann Institute of ScienceRehovot, Israele-mail: [email protected] ; boaz.nadler.weizmann.ac.il Abstract:
We consider the problem of estimating the parameters a Gaussian MixtureModel with K components of known weights, all with an identity covariance matrix.We make two contributions. First, at the population level, we present a sharper anal-ysis of the local convergence of EM and gradient EM, compared to previous works.Assuming a separation of Ω( √ log K ) , we prove convergence of both methods to theglobal optima from an initialization region larger than those of previous works. Specif-ically, the initial guess of each component can be as far as (almost) half its distance tothe nearest Gaussian. This is essentially the largest possible contraction region. Our sec-ond contribution are improved sample size requirements for accurate estimation by EMand gradient EM. In previous works, the required number of samples had a quadraticdependence on the maximal separation between the K components, and the resultingerror estimate increased linearly with this maximal separation. In this manuscript weshow that both quantities depend only logarithmically on the maximal separation.
1. INTRODUCTION
Gaussian mixture models (GMMs) are a widely used statistical model going back toPearson [15]. In a GMM each sample x ∈ R d is drawn from one of K componentsaccording to mixing weights π , . . . , π K > with P Ki =1 π i = 1 . Each componentfollows a Gaussian distribution with mean µ ∗ i ∈ R d and covariance Σ i ∈ R d × d . In thiswork, we focus on the important special case of K spherical Gaussians with identitycovariance matrix, with a corresponding density function f X ( x ) = K X i =1 π i (2 π ) d e − k x − µ ∗ i k . (1)For simplicity, as in [22, 23], we assume the weights π i are known.Given n i.i.d. samples from the distribution (1), a fundamental problem is to esti-mate the vectors µ ∗ i of the K components. Beyond the number of components K andthe dimension d , the difficulty of this problem is characterized by the following keyquantities: The smallest and largest separation between the cluster centers, R min = min i = j k µ ∗ i − µ ∗ j k , R max = max i = j k µ ∗ i − µ ∗ j k , (2) egol et al./Improved Convergence Guarantees for EM the minimal and maximal weights and their ratio, π min = min i ∈ [ K ] π i , π max = max i ∈ [ K ] π i , θ = π max π min . (3)In principle, one could estimate µ ∗ i by maximizing the likelihood of the observeddata. However, as the log-likelihood is non-concave, this problem is computationallychallenging. A popular alternative approach is based on the EM algorithm [7], andvariants thereof, such as gradient EM. These iterative methods require an initial guess ( µ , . . . , µ K ) of the K cluster centers. Classical results show that regardless of theinitial guess, the values of the likelihood function after each EM iteration are non de-creasing. Furthermore, under fairly general conditions, the EM algorithm converges toa stationary point or a local optima [21, 19]. The success of these methods to convergeto an accurate solution depend critically on the accuracy of the initial guess [10].In this work we study the ability of the popular EM and gradient EM algorithmsto accurately estimate the parameters of the GMM in (1). Two quantities of particularinterest are: (i) the size of the initialization region and the minimal separation thatguarantee convergence to the global optima. Namely, how small can R min be and howlarge can k µ i − µ ∗ i k , and still have convergence of EM to the global optima in thepopulation setting; and (ii) the required sample size, and its dependence on the problemparameters, that guarantees EM to find accurate solutions, with high probability.We make the following contributions: First, we present an improved analysis of thelocal convergence of EM and gradient EM, at the population level, assuming an infinitenumber of samples. In Theorems 3.1 and 3.2 we prove their convergence under thelargest possible initialization region, while requiring a separation R min = Ω (cid:0) √ log K (cid:1) .For example, consider the case of equal weights π i = 1 /K , and an initial guess thatsatisfies k µ i − µ ∗ i k ≤ λ min j = i k µ ∗ j − µ ∗ i k for all i , with λ < / . Then, a separation R min ≥ C ( λ ) √ log K , with an explicit C ( λ ) , suffices to ensure that the population EMand gradient EM algorithms converge to the true means at a linear rate.Let us compare our results to several recent works that derived convergence guar-antees for EM and gradient EM. [23] and [22] proved local convergence to the globaloptima under a much larger minimal separation of R min ≥ C p min( d, K ) log K . Inaddition, the requirement on the initial estimates had a dependence on the maximalseparation, k µ i − µ ∗ i k ≤ R min − C p d log max( R max , K ) for a universal con-stant C . These results were significantly improved by [13], who proved the local con-vergence of the EM algorithm for the more general case of spherical Gaussians withunknown weights and variances. They required a far less restrictive minimal separation R min ≥ C √ log K , with a constant C ≥ , and their initialization was restricted to λ < . We should note that no particular effort was made to optimize these constants.In comparison to these works, we allow the largest possible initialization region λ < ,with no dependence on R max . Also, for small values λ ≤ / , our resulting constant C is roughly 6 times smaller that that of [13].Our second contribution concerns the required sample size to ensure accurate es-timation by the EM and gradient EM algorithms. Recently, [13] proved that with anumber of samples n = ˜Ω( d/π min ) , a sample splitting variant of EM is statisticallyoptimal. In this variant, the n samples are split into B distinct batches, with eachEM iteration using a separate batch. In contrast, for the standard EM and gradient egol et al./Improved Convergence Guarantees for EM EM algorithms, weaker results have been established so far. Currently, the best knownsample requirements for EM are n = ˜Ω( K dR /R ) , whereas for gradient EM, n = ˜Ω( K dR /R ) . In addition, the bounds for the resulting errors increase lin-early with R max , see [22, 23]. Note that in these two results, the required number ofsamples increases at least quadratically with the maximal separation between clusters,even though increasing R max should make the problem easier. In Theorems 3.3 and 3.4,we prove that for an initialization region with parameter λ strictly smaller than half, theEM and gradient EM algorithms yield accurate estimates with sample size ˜Ω( K d ) . Inparticular, both our sample size requirements and the bounds on the error of the EMand gradient EM have only a logarithmic dependence on R max .Our results on the initialization region and minimal separation stem from a carefulanalysis of the weights in the EM update and their effect on the estimated cluster cen-ters. Similarly to [13], we upper bound the expectation of the i -th weight when the datais drawn from a different component j = i and show that it is exponentially small in thedistance between the centers of the i and j components. We make use of the fact thatall Gaussians have the same covariance to reduce the expectation to one dimension anddirectly upper bound the one dimensional integral. This allows us to derive a sharperbound compared to [13] from which we obtain a larger contraction region for the pop-ulation EM and gradient EM algorithms. Our analysis of the finite sample behavior ofEM and gradient EM follows the general strategy of [23]. Our improved results rely ontighter bounds on the sub-Gaussian norm of the weights in the EM update which donot depend on the distance between the clusters. Over the past decades, several approaches to estimate the parameters of Gaussian mix-ture models were proposed. In addition, many works derived theoretical guaranteesfor these methods as well as information-theoretic lower bounds on the number ofsamples required for accurate estimation. Significant efforts were made in understand-ing whether GMMs can be learned efficiently both from a computational perspective,namely in polynomial run time, and from a statistical view, namely with a number ofsamples polynomial in the problem parameters.Method of moments approaches [11, 14, 8] can accurately estimate the parametersof general GMMs with R min arbitrarily small, at the cost of sample complexity, andthus also run time, that is exponential in the number of clusters. [9] showed that amethod of moments type algorithm can recover the parameters of spherical GMMs witharbitrarily close cluster centers in polynomial time, under the additional assumptionthat the components centers are affinely independent. This assumption implies that d ≥ K .Methods based on dimensionality reduction [4, 1, 2, 12, 17] can accurately esti-mate the parameters of a GMM in polynomial time in the dimension and numberof clusters, under conditions on the separation of the clusters’ centers. In particular,[17] proved that accurate recovery is possible with a minimal separation of R min =Ω(min( K, d ) ) .In general, it is not possible to learn the parameters of a GMM with number of sam-ples that is polynomial in the number of clusters, see [14] for an explicit example. [16] egol et al./Improved Convergence Guarantees for EM showed that for any function γ ( K ) = o ( √ log K ) one can find two spherical GMMs,both with R min = γ ( K ) such that no algorithm with polynomial sample complexitycan distinguish between them. [16] also presented a variant of the EM algorithm thatprovably learns the parameters of a GMM with separation Ω( √ log K ) , with polyno-mial sample complexity, but run time exponential in the number of components.More closely related to our manuscript, are several works that studied the ability ofEM and variants thereof to accurately estimate the parameters of a GMM. [5] showedthat with a separation of Ω( d ) , a two-round variant of EM produces accurate estimatesof the cluster centers. A significant advance was made by [3], who developed new tech-niques to analyze the local convergence of EM for rather general latent variable models.In particular, for a GMM with K = 2 components of equal weights, they proved thatthe EM algorithm converges locally at a linear rate provided that the distance betweenthe components is at least some universal constant. These results were extended in [20]and [6] where a full description of the initialization region for which the populationEM algorithm learns a mixture of any two equally weighted Gaussians was given. Asalready mentioned above, the three works that are directly related to our work, and towhich we compare in detail in Section 3 are [22], [23] and [13].
2. PROBLEM SETUP AND NOTATIONS
We write X ∼ GMM ( µ ∗ , π ) for a random variable with density given by Eq. (1).The distance between cluster means is denoted by R ij = k µ ∗ i − µ ∗ j k . We set R i =min j = i R ij . Expectation of a function f ( X ) with respect to X is denoted by E X [ f ( X )] ,or when clear from context simply by E [ f ( X )] . For simplicity of notation, we shallwrite E i [ f ( X )] = E X ∼N ( µ ∗ i ,I d ) [ f ( X )] . For a vector v we denote by k v k its Euclideannorm. For a matrix A , we denote its operator norm by k A k op = max k x k =1 k Ax k . Fi-nally, we denote by µ = ( µ ⊤ , . . . , µ ⊤ K ) ⊤ ∈ R Kd the concatenation of µ , . . . , µ K ∈ R d .As in previous works, we consider the following error measure for the quality of anestimate µ of the true means, E ( µ ) = max i ∈ [ K ] k µ i − µ ∗ i k . We will see that in the population case we can restrict our analysis to the space spannedby the K true cluster means and the K cluster estimates. It will therefore be convenientto define d = min( d, K ) . For any < λ < we define the region U λ = (cid:8) µ ∈ R Kd : k µ i − µ ∗ i k ≤ λR i ∀ i ∈ [ K ] (cid:9) . (4)For future use we define the following function which will play a key role in our anal-ysis, c ( λ ) = 18 (cid:18) − λ λ (cid:19) . (5) egol et al./Improved Convergence Guarantees for EM Given an estimate ( µ , . . . , µ K ) of the K centers, for any x ∈ R d and i ∈ [ K ] let w i ( x, µ ) = π i e − k x − µi k P Kj =1 π j e − k x − µj k . (6)The population EM update, denoted by µ + = ( µ +1 , . . . , µ + K ) is given by µ + i = E X [ w i ( X, µ ) X ] E X [ w i ( X, µ )] , ∀ i ∈ [ K ] . (7)The population gradient EM update with a step size s > is defined by µ + i = µ i + s E X [ w i ( X, µ )( X − µ i )] , ∀ i ∈ [ K ] . (8)Given an observed set of n samples X , . . . , X n ∼ X , the sample EM and samplegradient EM updates follow by replacing the expectations in (7) and (8) with theirempirical counterparts. For the EM, the update is µ + i = P nℓ =1 w i ( X ℓ , µ ) X ℓ P nℓ =1 w i ( X ℓ , µ ) , ∀ i ∈ [ K ] . (9)and for the gradient EM µ + i = µ i + s n n X ℓ =1 w i ( X ℓ , µ )( X ℓ − µ i ) , ∀ i ∈ [ K ] . (10)In this work, we study the convergence of EM and gradient EM, both in the pop-ulation setting and with a finite number of samples. In particular we are interested insufficient conditions on the initialization and on the separation of the GMM compo-nents that ensure convergence to accurate solutions.
3. LOCAL CONVERGENCE OF EM AND GRADIENT EM
As in previous works, we first study the convergence of EM in the population case andthen build upon this analysis to study the finite sample setting. Informally, our mainresult in this section is that for any fixed λ ∈ (0 , ) and an initial estimate µ ∈ U λ ,there exists a constant C ( λ ) such that for any mixture with R min & C ( λ ) q log π min the estimation error of a single population EM update (7) decreases by a multiplicativefactor strictly less than . This, in turn, implies convergence of the population EM tothe global optimal solution µ ∗ . Formally, our result is stated in the following theorem. egol et al./Improved Convergence Guarantees for EM Theorem 3.1.
Set λ ∈ (0 , ) . Let X ∼ GMM ( µ ∗ , π ) with R min ≥ s c ( λ ) log 32 ( K − p
14 (1 + θ )3 π min c ( λ ) (11) where c ( λ ) and θ are as defined in (5) and (3) , respectively. Then for any µ ∈ U λ itholds that E ( µ t ) ≤ t E ( µ ) where µ t is the t -th iterate of the population EM update (7) initialized at µ . We derive a similar result for gradient EM.
Theorem 3.2.
Set λ ∈ (0 , ) . Let X ∼ GMM ( µ ∗ , π ) with R min satisfying (11) . Thenfor any s ∈ (cid:16) , π min (cid:17) and any µ ∈ U λ it holds that E ( µ t ) ≤ γ t E ( µ ) where µ t is the t -th iterate of the population gradient EM update (8) with step size s and γ = 1 − sπ min . The proof of Theorem 3.1 appears in Section 4 with the technical details deferred tothe appendix. The proof of Theorem 3.2 is similar and appears in full in the appendix.It is interesting to compare Theorems 3.1 and 3.2 to several recent works, in termsof both the size of the initialization region, and the requirements on the minimal sep-aration. [22] and [23] assumed a separation R min = Ω( √ d log K ) and proved localconvergence of the gradient EM and of the EM algorithm, for an initialization regionof the following form, with C a universal constant, max i ∈ [ K ] k µ i − µ ∗ i k ≤ R min − C p d log max( R max , K ) . Recently, [13] significantly improved these works, proving convergence of populationEM with a much smaller separation R min ≥ p log( θK ) . Moreover, they consid-ered the more general and challenging case where the Gaussians may have differentvariances and the EM algorithm estimates not only the Gaussian centers, but also theirweights and variances. However, they proved convergence only for an initializationregion U λ with λ ≤ .Our results improve upon these works in several aspects. First, in comparison to thecontraction region of [22], our theorem allows the largest possible initialization region k µ i − µ ∗ i k < R i , with no dependence on the other problem parameters d , K and R max . This initialization region is optimal as there exists GMMs and initializations µ with k µ i − µ ∗ i k = R i such that the EM algorithm, even at the population level, willnot converge to values that are close to the true parameters.Second, in comparison to the result of [13], we allow λ to be as large as . Also,for λ < , our requirement on R min is nearly one order of magnitude smaller. Forexample, for a balanced mixture with π min = K , the right hand side of (11) reads vuut c ( λ ) log ( K ) + log 32 √ c ( λ ) ! . An initialization region k µ i − µ ∗ i k ≤ R i leads to a separation requirement R min ≥ . p log( K ) + 6 . , which is much smaller than √ log K . egol et al./Improved Convergence Guarantees for EM We now present our results on the EM and gradient EM algorithms for the finite samplecase.
Theorem 3.3.
Set λ ∈ (0 , ) , δ ∈ (0 , . Let X , . . . , X n i.i.d. ∼ GMM ( µ ∗ , π ) with R min satisfying (11) . Suppose that n is sufficiently large so that n log n > C Kd log (cid:16) ˜ Cδ (cid:17) π min max (cid:18) , − λ ) λ π min R (cid:19) . (12) where C is a universal constant and ˜ C = 100 K R max ( √ d + 2 R max ) . Assume aninitial estimate µ ∈ U λ and let µ t be the t -th iterate of the sample EM update (9) . Thenwith probability at least − δ , for all iterations t , µ t ∈ U λ and k µ ti − µ ∗ i k ≤ t E ( µ ) + C (1 − λ ) π i s Kd log ˜ Cnδ n (13) for a suitable absolute constant C . Theorem 3.4.
Set λ ∈ (0 , ) , δ ∈ (0 , . Let X , . . . , X n i.i.d. ∼ GMM ( µ ∗ , π ) with R min satisfying (11) . Set s ∈ (cid:16) , π min (cid:17) and suppose that n is sufficiently large so that n log n > CKd log ˜ Cδ π max i ∈ [ K ] max (cid:16) λ R i , − λ ) (cid:17) λ R i (14) where C is a universal constant and ˜ C = 36 K R max ( √ d + 2 R max ) . Assume aninitial estimate µ ∈ U λ and let µ t be the t -th iterate of the sample gradient EM update (10) with step size s . Then with probability at least − δ , µ t ∈ U λ for all t , and k µ ti − µ ∗ i k ≤ γ t E ( µ ) + C π i max (cid:18) − λ , λR i (cid:19) vuut Kd log (cid:16) ˜ Cnδ (cid:17) n (15) where γ = 1 − sπ min and C is a suitable absolute constant. The main idea in the proofs of Theorems 3.3 and 3.4 is to show the uniform conver-gence, inside the initialization region U λ , of the sample update to the population update.The sample size requirements (12) and (14) are such that the resulting error of a singleupdate of the EM and gradient EM algorithms is sufficiently small to ensure that theupdated means are in the contraction region U λ . This, combined with the convergenceof the population update, yields the required result. We outline the main steps of theproof in Section 5 with more technical details deferred to the appendix.Let us compare Theorems 3.3 and 3.4 to previous results, in terms of required sam-ple size and bounds on the estimation error. The strongest result to date, due to [13],considered a variant of the EM algorithm, whereby the samples are split into B sepa-rate batches, and at each iteration t (with ≤ t ≤ B ), the sample EM algorithm is run egol et al./Improved Convergence Guarantees for EM only using the data of the t -th batch. They showed that to achieve an error E ( µ B ) ≤ ǫ , the required sample size is ˜Ω( dπ min ǫ ) . The best known bounds without samplesplitting were derived by [22] and by [23]. The error guarantee for gradient EM is ˜ O ( n − / max( K R √ d, R max d )) , whereas for EM it is ˜ O ( n − / R max √ Kd/π min ) .The sample size requirements for gradient EM are n log n = ˜Ω(max( K R √ d, R max d ) /R ) and n log n = ˜Ω( Kdπ max(1 , R /R )) for EM. Note that these bounds have a de-pendence on the maximal separation R max . In particular, even though intuitively, as R max increases the problem should become easier, these error bounds increase linearlywith R max and the required sample size increases quadratically with R max . In contrast,in our two theorems above there is a dependence on / (1 − λ ) , which is strictly smallerthan R max by the separation condition (11). Thus, for λ bounded away from / , thereis only a logarithmic dependence on R max . We believe that with further effort, thedependence on R max can be fully eliminated.
4. PROOF FOR THE POPULATION EM
Our strategy is similar to [22] and [23]: We bound the error of a single update, k µ + i − µ ∗ i k in terms of E X [ w j ( X, µ )] and E X [ ∇ µ w j ( X, µ )( X − µ j )] , which in turn depend ontheir expectations with respect to individual Gaussian components. Our key result onthe latter expectation is the following Proposition, whose proof appears in the appendix. Proposition 4.1.
Set < λ < . Let X ∼ GMM ( µ ∗ , π ) with R min > r − λ log θ (16) where θ is defined in (3) . Then for any µ ∈ U λ and all j = i , with c ( λ ) defined in (5) , E i [ w j ( X, µ )] ≤ (cid:18) π j π i (cid:19) e − c ( λ ) R ij . (17)This proposition shows that E i [ w j ( X, µ )] is exponentially small in the separation R ij and is key to proving contraction of the EM and gradient EM updates. A similarresult was proven in [13]. The main differences are that they assumed a smaller re-gion with λ < and obtained a looser exponential bound exp( − R ij / . However,they considered a more challenging case where the weights π i and variances of the K Gaussian components are unknown and are also estimated by the EM procedure.The key idea in proving Proposition 4.1 is that for X ∼ N ( µ ∗ i , I d ) it suffices toanalyze the random variable w j ( X, µ ) on the one dimensional space spanned by µ i − µ j .Thus, the expectation over a d dimensional random vector is reduced to the expectationof some explicit function over a univariate standard Gaussian. An immediate corollaryis that under the same conditions as in Proposition 4.1, the following lower bound holdsfor the expectation E i [ w i ( X, µ )] . Corollary 4.1.1.
Set < λ < and suppose that R min satisfies (16) . Then ∀ µ ∈ U λ E i [ w i ( X, µ )] ≥ − ( K − θ ) e − c ( λ ) R i . (18) egol et al./Improved Convergence Guarantees for EM Next, note that for X ∼ GMM ( µ ∗ , π ) , it holds that E X [ w i ( X, µ ∗ )] = π i . Thus, forcenter estimates µ close to µ ∗ we expect that E X [ w i ( X, µ )] > π i . This intuition ismade precise in the following lemma which follows readily from Corollary 4.1.1. Lemma 4.2.
Fix < λ < . Let X ∼ GMM ( µ ∗ , π ) and suppose that R min ≥ p c ( λ ) − log(15( K − θ )) . (19) Then for any i ∈ [ K ] and any µ ∈ U λ , E X [ w i ( X, µ )] ≥ π i . (20)Next, we turn to the term E X [ ∇ µ w i ( X, µ )( X − µ i )] . By definition, ∇ µ w i ∈ R Kd has the following K components, each a vector in R d , ∂w i ( X, µ ) ∂µ i = − w i ( X, µ )(1 − w i ( X, µ ))( µ i − X ) (21)and for j = i ∂w i ( X, µ ) ∂µ j = w i ( X, µ ) w j ( X, µ )( µ j − X ) . (22)For future use we introduce the following quantities related to E X [ ∇ µ w i ( X, µ )( X − µ i )] . For any µ, v ∈ R Kd , define V i,j ( µ, v ) = k E X [ w i ( X, µ ) w j ( X, µ )( X − v i )( X − µ j ) ⊤ ] k op , (23) V i,i ( µ, v ) = k E X [ w i ( X, µ )(1 − w i ( X, µ ))( X − v i )( X − µ i ) ⊤ ] k op . (24)The following lemma, proved in the appendix, provides a bound on these quantities. Lemma 4.3.
Fix < λ < . Let X ∼ GMM ( µ ∗ , π ) with R min satisfying Eq. (16).Assume µ ∈ U λ and v = µ or v = µ ∗ . Then, for any i, j ∈ [ K ] with i = jV i,i ( µ, v ) ≤ p C ( K −
1) (1 + θ ) max (cid:0) d , R i (cid:1) e − c ( λ )2 R i , (25) V i,j ( µ, v ) ≤ p C (1 + θ ) max (cid:0) d , max( R i , R j ) (cid:1) e − c ( λ )2 max( R i ,R j ) , (26) where C is a universal constant, for example we can take C = 14 . Expressions related to V i,i and V i,j were also studied by [22]. They required a muchlarger separation, R min ≥ C √ d log K , and their resulting bounds involved also R max . Remark 4.1.
In proving the convergence of EM, the quantities of interest are V i,j ( µ, µ ∗ ) and V i,i ( µ, µ ∗ ) , whereas for the gradient EM algorithm the relevant quantities are V i,j ( µ, µ ) , V i,i ( µ, µ ) . The reason for the effective dimension d = min( d, K ) is thatfor d > K , in the population setting, the EM update of µ always remains in the sub-space spanned by the K vectors { µ i } Ki =1 and { µ ∗ i } Ki =1 . In the case of gradient EM,one may define a potentially smaller effective dimension d = min( d, K ) . egol et al./Improved Convergence Guarantees for EM Last but not least, the following auxiliary lemma shows that µ ∗ is a fixed point ofthe population EM update. Lemma 4.4.
Let X ∼ GMM ( µ ∗ , π ) . Then ∀ i ∈ [ K ] , E X [ w i ( X, µ ∗ )( X − µ ∗ i )] = 0 . With all the pieces in place, we are now ready to prove Theorem 3.1.
Proof of Theorem 3.1.
Consider a single EM update, as given by Eq. (7), k µ + i − µ ∗ i k = 1 E X [ w i ( X, µ )] · k E X [ w i ( X, µ )( X − µ ∗ i )] k , ∀ i ∈ [ K ] Using Lemma 4.4, we may write the numerator above as follows, E X [ w i ( X, µ )( X − µ ∗ i )] = E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] . (27)By the mean value theorem there exists µ τ on the line connecting µ and µ ∗ such that w i ( X, µ ) − w i ( X, µ ∗ ) = ∇ µ w i ( X, µ τ ) ⊤ ( µ − µ ∗ ) . (28)Inserting the expressions (21) and (22) for the gradient of w i into Eq. (28) gives w i ( X, µ ) − w i ( X, µ ∗ ) = w i ( X, µ τ )(1 − w i ( X, µ τ ))( X − µ τi ) ⊤ ( µ i − µ ∗ i ) − X j = i w i ( X, µ τ ) w j ( X, µ τ ))( X − µ τj ) ⊤ ( µ j − µ ∗ j ) dτ. Taking expectations, and using the definitions of V ii and V ij , Eqs. (23) and (24), gives k E [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] k ≤ k X j =1 V ij ( µ τ , µ ∗ ) k µ j − µ ∗ j k . (29)Since µ τ ∈ U λ , we may apply Lemma 4.3 to bound the terms on the right hand sideabove. Furthermore, given that x e − tx is monotonic decreasing for all x > p /t and R i ≥ p /c ( λ ) , we may replace all R i , R j in the bounds of Lemma 4.3 by R min .Defining U = K − √ C (1+ θ )3 π min , we thus have k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] k ≤ π min U · e − c ( λ )2 R E ( µ ) . Next, note that condition (11) on R min implies that it also satisfies the weaker condition(19) of Lemma 4.2. Invoking this lemma yields that E X [ w i ( X, µ )] ≥ π min . Thus, k µ i − µ ∗ i k ≤ U max (cid:0) d , R (cid:1) e − c ( λ )2 R · E ( µ )2 . If d ≥ R , then for E ( µ + ) ≤ E ( µ ) to hold the minimal separation must satisfy c ( λ )2 R ≥ log( d U ) . (30) egol et al./Improved Convergence Guarantees for EM In contrast, if R ≥ d we obtain the following inequality for w = c ( λ )2 R , we − w ≤ c ( λ )2 U . (31)Note that for w > , the function we − w is monotonic decreasing. Also, consider thevalue w ∗ = 2 log(2 U/c ( λ )) which is larger than 1, given the definitions of U and of c ( λ ) . It is easy to show that w ∗ exp( − w ∗ ) ≤ c ( λ ) / U . Hence a sufficient conditionfor (31) to hold is that w > w ∗ , namely c ( λ )2 R ≥ Uc ( λ ) . (32)It is easy to verify that log U +log(4 /c ( λ )) > log d and thus the bound of (32) is morerestrictive than (30). Inserting the expression for U into Eq. (32) yields the conditionof the Theorem, Eq. (11). Finally, to complete the proof we need to show that for all i , k µ + i − µ ∗ i k ≤ λR i . This part is proven in auxiliary lemma A.5 in the appendix.
5. PROOF FOR THE SAMPLE EM
In this section we prove our results on the sample EM and gradient EM algorithms.The main idea is to show concentration results for both the denominator and the numer-ator of the EM update. Our strategy is similar to [23] but with several improvements.First, our result on the concentration of the denominator of the EM update, Lemma5.1, only considers samples from the i -th cluster. Thus, in Lemma 5.2, we obtain auniform lower bound for the weight w i with n = ˜Ω( Kd/π min ) compared to the larger n = ˜Ω( Kd/π ) in [23]. Second, while [23] bounded the sub-Gaussian norm ofthe numerator of the EM update by CR max , we derive in Lemma 5.3 a tighter bound,which does not depend on R max . This in turn, yields a tighter concentration for thenumerator of the EM update in Lemma 5.4. Lemma 5.1.
Fix δ ∈ (0 , , λ ∈ (0 , ) and let X , . . . , X n i i.i.d. ∼ N ( µ ∗ i , I d ) . Thenwith probability at least − δ , sup µ ∈U λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n i n i X ℓ =1 w i ( X ℓ , µ ) − E i [ w i ( X, µ )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ vuut ˜ c Kd log (cid:16) ˜ Cn i δ (cid:17) n i (33) where ˜ C = 18 K ( √ d + 2 R max ) R max and ˜ c is a suitable universal constant. As we saw in Lemma 4.2, the denominator in the population EM update for the i -th mean is lower bounded by π i . We use Lemma 5.1 to show that this lower boundholds also for the finite sample case. We remark that a version of the following lemmaappeared in [23], but with a larger sample size requirement of n = ˜Ω( Kd/π ) . Lemma 5.2.
Fix δ ∈ (0 , , λ ∈ (0 , ) . Let X , . . . , X n i.i.d. ∼ GMM ( µ ∗ , π ) , with R min that satisfies (19) . Assume a sufficiently large sample size n such that n log n > C Kd log ˜ Cδ π min (34) egol et al./Improved Convergence Guarantees for EM where ˜ C = 100 K π max ( √ d + 2 R max ) R max and C is a universal constant. For any i ∈ [ K ] , define the event D i = ( inf µ ∈U λ n n X ℓ =1 w i ( X ℓ , µ ) ≥ π i ) . (35) Then, the event D i occurs with probability at least − δ K . Next, we analyze the sub-Gaussian norm of w i ( X, µ )( X − µ ∗ i ) . [23] bounded thisquantity by CR max . We present an improved bound which does not depend on R max .For the definition of the sub-Gaussian norm k · k ψ , see the Appendix. Lemma 5.3.
Fix λ ∈ (0 , ) . Let X ∼ GMM ( µ ∗ , π ) with R min ≥ s max (cid:18) − λ log (cid:18) ) θ − λc ( λ ) (cid:19) , c ( λ ) log 2 (cid:19) . (36) Suppose that µ ∈ U λ . Then for any i ∈ [ K ] , k w i ( X, µ ) ( X − µ ∗ i ) k ψ ≤ − λ (37) and k w i ( X, µ ) ( X − µ i ) k ψ ≤
24 max (cid:18) − λ , λR i (cid:19) . (38)Using Lemma 5.3 we upper bound the concentration of the numerator in the expres-sion for the error in the sample EM update, Eq. (9). Lemma 5.4.
Fix δ ∈ (0 , , λ ∈ (0 , ) . Let X , . . . , X n i.i.d. ∼ GMM ( µ ∗ , π ) with R min satisfying (36) . For i ∈ [ K ] define S i = n P nℓ =1 w i ( X ℓ , µ )( X ℓ − µ ∗ i ) and the event N i = sup µ ∈U λ k S i − E [ w i ( X, µ )( X − µ ∗ i )] k ≤ C − λ s Kd log ˜ Cnδ n (39) Then, with ˜ C = 36 K R max ( √ d + 2 R max ) and with a suitable choice of a universalconstant C , the event N i occurs with probability at least − δ K . With all the pieces in place, we are now ready to prove Theorem 3.3.
Proof of Theorem 3.3.
Consider the error of a single the update of the from (9) of thesample EM algorithm, k µ + i − µ ∗ i k = k n P nℓ =1 w i ( X ℓ , µ )( X ℓ − µ ∗ i ) k n P nℓ =1 w i ( X ℓ , µ ) . Note that the requirement (11) on R min is more restrictive than (19). Also, the samplesize requirement (12) is more restrictive than (34). Thus, we may invoke Lemma 5.2and get that with probability at least − δ K , that event D i (35) occurs. Hence, k µ + i − µ ∗ i k ≤ π i k S i − E X [ w i ( X, µ )( X − µ ∗ i )] k + 43 π i k E X [ w i ( X, µ )( X − µ ∗ i )] k egol et al./Improved Convergence Guarantees for EM It follows from Theorem 3.1 that for R min satisfying (11), the second term above is up-per bounded by min( E ( µ ) , λR i ) . We thus continue by bounding the first term above.Note that our requirements on the minimal separation (11) is more restrictive than therequirement in (36). Thus, we may invoke Lemma 5.4 and obtain with probability atleast − δ K , that the event N i (39) occurs. Therefore, k µ + i − µ ∗ i k ≤
12 min( E ( µ ) , λR i ) + C (1 − λ ) π i s Kd log ˜ Cnδ n (40)where C is a universal constant and ˜ C = 36 K R max ( √ d +2 R max ) . For n sufficientlylarge so that (12) is satisfied, it holds that C − λ ) π i q Kd log ˜ Cnδ n ≤ λR i and therefore k µ + i − µ ∗ i k ≤ λR i . By a union bound over all i ∈ [ K ] , with probability at least − δ , µ + ∈ U λ . This allows us to iteratively apply (40) and obtain Eq. (13). Appendix A: PROOFS FOR SECTION 4
A.1. Proof of Proposition 4.1
Before proving Proposition 4.1 we state several auxiliary lemmas.
Lemma A.1.
Let g ( A, B ) be the following function of two variables, g ( A, B ) = Z √ π
11 + αe At + B e − t / dt where α > is a fixed constant. Then: (i) For any fixed A , g ( A, B ) is monotonicdecreasing in B ; and (ii) If in addition α > e − B and A > , then for any fixed B , g ( A, B ) is monotonic increasing in A .Proof. Since the function inside the integral is monotonically decreasing in B , part (i)directly follows. To prove part (ii), we take the derivative with respect to A , ∂∂A g ( A, B ) = Z − αte At + B (1 + αe At + B ) e − t / √ π dt. Denote the function inside the integral by f ( t ) . Note that f ( t ) > when t < and f ( t ) < when t > . To show that the integral is positive it suffices to show that forall t > it holds that − f ( t ) < f ( − t ) . This condition reads as e − At (1 + αe − At + B ) > e At (1 + αe At + B ) . Some algebraic manipulations give that this condition is equivalent to (cid:0) e At − e − At (cid:1) (cid:0) α e B − (cid:1) > which is indeed satisfied for A, t > and α > e − B . egol et al./Improved Convergence Guarantees for EM Lemma A.2.
Fix any two distinct vectors µ ∗ i , µ ∗ j ∈ R d and λ ∈ (0 , / . Denote theball of radius r about the origin in R d by B d (0 , r ) and define Ω = B d (0 , λ k µ ∗ i − µ ∗ j k ) × B d (0 , λ k µ ∗ i − µ ∗ j k ) ⊂ R d × R d . Consider the two functions
A, B : Ω → R A ( ξ i , ξ j ) = k µ ∗ i − ξ i − µ ∗ j + ξ j k (41) B ( ξ i , ξ j ) = 12 k µ ∗ i − µ ∗ j + ξ j k − k ξ i k . (42) Then for any ( ξ i , ξ j ) ∈ Ω , A ( ξ i , ξ j ) ≤ (1 + 2 λ ) k µ ∗ i − µ ∗ j k = A ∗ (43) B ( ξ i , ξ j ) ≥ − λ k µ ∗ i − µ ∗ j k = B ∗ . (44) Proof.
We first prove the upper bound on A . By the triangle inequality A ( ξ i , ξ j ) ≤ k ξ i k + k ξ j k + k µ ∗ i − µ ∗ j k ≤ (1 + 2 λ ) k µ ∗ i − µ ∗ j k As for the lower bound on B , clearly it is obtained when k ξ i k is maximal, i.e. k ξ i k = λ k µ ∗ i − µ ∗ j k . Finally, the vector ξ j = λ ( µ ∗ i − µ ∗ j ) minimizes (42) regardless of the valueof ξ i . This yields the lower bound of (44) for B . Proof of Proposition 4.1.
Recall the definition of the weight w j ( X, µ ) in Eq (6). Sinceall the terms in the denominator are positive, we may upper bound w j by taking intoaccount only the two terms with indices k = i and k = j . Hence, w j ( X, µ ) ≤ π j e − k X − µj k π j e − k X − µj k + π i e − k X − µi k = 11 + π i π j e k X − µj k − k X − µi k . (45)Next, since X ∼ N ( µ ∗ i , I d ) we may write X = µ ∗ i + η = µ i + η + ξ i where η ∼N (0 , I d ) and ξ i = µ ∗ i − µ i . Therefore, k X − µ j k − k X − µ i k = 2 η ⊤ ( µ i − µ j ) + k µ ∗ i − µ ∗ j + ξ j k − k ξ i k . (46)Note that by definition η ⊤ ( µ i − µ j ) is a univariate Gaussian random variable with meanzero and variance k µ i − µ j k . Hence, we may write η ⊤ ( µ i − µ j ) = k µ i − µ j k ν where ν ∼ N (0 , . Defining ˜ w ( A, B, ν ) = 1 / (1 + π i π j e Aν + B ) , we therefore have E i [ w j ( X, µ )] ≤ E ν [ ˜ w ( A, B, ν )] = 1 √ π Z ˜ w ( A, B, t ) e − t dt = g ( A, B ) (47)with A = A ( ξ i , ξ j ) and B = B ( ξ i , ξ j ) as defined in (41) and (42), respectively. Since k ξ i k , k ξ j k ≤ λ k µ ∗ i − µ ∗ j k , then A ≥ (1 − λ ) k µ ∗ i − µ ∗ j k . Therefore, A > for λ < . By Lemma A.2, B ≥ B ∗ with B ∗ given in (44). The condition (16) impliesthat π i π j > e − B ∗ ≥ e − B . Hence, the conditions of Lemma A.1 are satisfied and we can egol et al./Improved Convergence Guarantees for EM upper bound g ( A, B ) in (47), by g ( A ∗ , B ∗ ) with A ∗ and B ∗ respectively, as given inEquations (43) and (44) of Lemma A.2. Therefore, E i [ w j ( X, µ )] ≤ √ π Z ˜ w ( A ∗ , B ∗ , t ) e − t dt = I. To upper bound the integral I we split it into two parts based on the sign of A ∗ t + B ∗ . I = 1 √ π − B ∗ /A ∗ Z −∞ ˜ w ( A ∗ , B ∗ , t ) e − t dt + 1 √ π ∞ Z − B ∗ /A ∗ ˜ w ( A ∗ , B ∗ , t ) e − t dt = I + I . For I , where A ∗ t + B ∗ < , we upper bound ˜ w ( A ∗ , B ∗ , t ) ≤ . Since both A ∗ and B ∗ are positive, we have that − B ∗ A ∗ < . We can therefore use Chernoff’s bound to get I ≤ − B ∗ /A ∗ Z −∞ √ π e − t dt ≤ e − ( B ∗ A ∗ ) . (48)For I , where A ∗ t + B ∗ > we upper bound the integral by ignoring the constant inthe denominator. Completing the square and changing variables by z = t + A ∗ we get I ≤ ∞ Z − B ∗ /A ∗ √ π π j π i e − t − A ∗ t − B ∗ dt = e A ∗ − B ∗ ∞ Z − B ∗ /A ∗ √ π π j π i e − ( t + A ∗ )22 dt = e A ∗ − B ∗ ∞ Z A ∗ − B ∗ /A ∗ √ π π j π i e − z dz. Using the definitions of A ∗ and B ∗ in (43) and (44) we note that for λ > , A ∗ − B ∗ A ∗ = 2 (1 + 2 λ ) − (1 − λ )2 k µ ∗ i − µ ∗ j k > . We can therefore apply Chernoff’s bound on the above and obtain I ≤ π j π i e A ∗ − B ∗ − ( A ∗ − B ∗ A ∗ ) = π j π i e − ( B ∗ A ∗ ) . (49)Combining the two bounds (48) and (49) yields Eq (17). Proof of Corollary 4.1.1.
By definition, the sum of all weights is one. Thus, w i ( X, µ ) = 1 − X j = i w j ( X, µ ) . By Proposition 4.1 and the linearity of expectation E i [ w i ( X, µ )] ≥ − X j = i (cid:18) π j π i (cid:19) e − c ( λ ) R ij ≥ − ( K − θ ) e − c ( λ ) R i . egol et al./Improved Convergence Guarantees for EM A.2. Proof of Lemma 4.2
Proof.
Since X is distributed as a GMM with K components and w i ( X, µ ) > , theexpected value is greater than if we consider only the i -th component of the GMM. E [ w i ( X, µ )] = K X j =1 π j E j [ w i ( X, µ )] ≥ π i E i [ w i ( X, µ )] . Since the requirement (19) on R min implies (16), it follows from Corollary 4.1.1 that E X [ w i ( X, µ )] ≥ π i (cid:16) − ( K − θ )) e − c ( λ ) R i (cid:17) . Furthermore, Eq. (19) implies that ( K − θ ) e − c ( λ ) R i ≤ . A.3. Proof of Lemma 4.3
The proof consists of several steps. First, in Lemma A.3 we reduce the dimension to d = min( d, K ) . Next, in Lemma A.4 we bound k E k [( X − v i )( X − µ j ) ⊤ ] k op interms of d , R i , R j and R i j . We then present the proof of the Lemma.We first introduce notations. For µ, v ∈ R Kd and i, j, k ∈ [ K ] with i = j we define, V kij ( µ, v ) = k E k [ w i ( X, µ ) w j ( X, µ )( X − v i )( x − µ j ) ⊤ ] k op (50) V kii ( µ, v ) = k E k [ w i ( X, µ )(1 − w i ( X, µ ))( X − v i )( x − µ i ) ⊤ ] k op . (51)Suppose that X ∼ N ( µ ∗ k , I d ) . Let Γ be a rotation matrix such that (Γ µ i ) ⊤ =( µ i ⊤ , ⊤ [ d − d ] + ) and (Γ v i ) ⊤ = ( v i ⊤ , ⊤ [ d − d ] + ) , for all i ∈ [ K ] . Write X d for thefirst d coordinates of Γ X and X d − d for the remaining coordinates. We define V kij ( µ, v ) = k E k [ w i ( X d , µ )( w j ( X d , µ ))( X d − v i )( X d − µ j ) ⊤ ] k op (52) V kii ( µ, v ) = k E k [ w i ( X d , µ )(1 − w i ( X d , µ ))( X d − v i )( X d − µ i ) ⊤ ] k op . (53) Lemma A.3.
For any i, j, k ∈ [ K ] with i = j , V kij ≤ max (cid:16) V kij , E k [ w i ( X, µ ) w j ( X, µ )] (cid:17) ,V ki,j ≤ max (cid:16) V ki,j , E k [ w i ( X, µ )(1 − w i ( X, µ ))] (cid:17)
The proof is similar to the one in [22]. We include it for our paper to be self con-tained.
Proof.
We prove only the first inequality. The proof of the second inequality is similar.Note that ( X − v i ) ( X − µ j ) ⊤ is equal to Γ ⊤ (cid:16) X d − v i (cid:17) (cid:16) X d − µ j (cid:17) ⊤ (cid:16) X d − v i (cid:17) (cid:16) X [ d − d ] + (cid:17) ⊤ X [ d − d ] + (cid:16) X d − µ j (cid:17) ⊤ X [ d − d ] + (cid:16) X [ d − d ] + (cid:17) ⊤ Γ . egol et al./Improved Convergence Guarantees for EM Now, since Γ is a rotation matrix and the last [ d − d ] + coordinates of Γ µ i are weget that w i ( X, µ ) = w i ( X d , µ ) . Therefore w i ( X, µ ) , (cid:16) X d − v i (cid:17) , (cid:16) X d − µ j (cid:17) areindependent of X [ d − d ] + . Thus, V kij ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V ijk C ijk (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ max (cid:16) V kij ( µ, v ) , k C ijk k op (cid:17) where V ijk and V ijk are defined in (50) and (52), respectively and C ijk = E k (cid:20) w i (cid:16) X d , µ (cid:17) w j (cid:16) X d , µ (cid:17) X [ d − d ] + (cid:16) X [ d − d ] + (cid:17) ⊤ (cid:21) = E X d h w i (cid:16) X d , µ (cid:17) w j (cid:16) X d , µ (cid:17)i E X [ d − d (cid:20) X [ d − d ] + (cid:16) X [ d − d ] + (cid:17) ⊤ (cid:21) = E X d h w i (cid:16) X d , µ (cid:17) w j (cid:16) X d , µ (cid:17)i I [ d − d ] + Since w i ( X, µ ) = w i ( X d , µ ) , we may return to the original variables and write k C ijk k op = E k [ w i ( X, µ ) w j ( X, µ )] . Hence, the lemma follows.
Lemma A.4. fix λ ∈ (0 , ) . Let ( µ ∗ , . . . , µ ∗ K ) be the centers of a K component GMM.Let X ∼ N ( µ ∗ k , I d ) for some k ∈ [ K ] . Then, for any i, j ∈ [ K ] and any µ, v ∈ U λ , E k [ k ( X − v i )( X − µ j ) ⊤ k op ] ≤ C max (cid:0) d , R i (cid:1) i = j = k max (cid:16) d , R ik R jk (cid:17) k = i, k = j max (cid:0) d , R ij (cid:1) k = i, k = j (54) where C is a universal constant, for example we can take C = 14 .Proof. First, for any rank matrix uv ⊤ it holds that k uv T k op = k u k · k v k . Thus, E k (cid:2) k ( X − v i )( X − µ j ) ⊤ k op (cid:3) = E k (cid:2) k X − v i k · k X − µ j k (cid:3) . Next, since X ∼ N ( µ ∗ k , I d ) we may write X = µ ∗ k + η , where η ∼ N (0 , I d ) . Thus, k X − v i k · k X − µ j k = k µ ∗ k − v i + η k · k µ ∗ k − µ j + η k . Let Γ be a rotation matrix such that Γ ( µ ∗ k − v i ) = R ∗ ik e , Γ ( µ ∗ k − µ j ) = R ∗ kj cos ( α ) e + R ∗ kj sin ( α ) e where R ∗ ik = k µ ∗ k − v i k , R ∗ kj = k µ ∗ k − µ j k and α is the angle between e and Γ( µ ∗ k − µ j ) .Then by applying Γ to and using the rotation invariance of the Gaussian distribution, k X − v i k = ( R ∗ ik + η ) + η + X q> η q egol et al./Improved Convergence Guarantees for EM and k X − µ j k = ( R ∗ kj cos α + η ) + ( R ∗ kj sin α + η ) + X q> η q . It is easy to show that the expectation of the above expression is maximal when α = 0 .In this case, we can write the expectation as follows E [ k X − v i k · k X − µ j k ] ≤ E [( A + C ) · ( B + C )] where A = ( R ∗ ik + η ) follows a non-central χ distribution with one degree offreedom and non-centrality parameter ( R ∗ ik ) , C = P q ≥ η q follows a central χ distribution with d − degrees of freedom, and B = ( R ∗ kj + η ) . Using known resultson the moments of central and non-central χ random variables, E [ AB ] = E [(( R ∗ ik ) + 2 R ∗ ik η + η )(( R ∗ kj ) + 2 R ∗ kj η + η )]= ( R ∗ ik R ∗ kj ) + ( R ∗ ik ) + ( R ∗ kj ) + 4 R ∗ ik R ∗ kj + 3 . and E [( A + C ) · ( B + C )] = E [ AB ] + ( E [ A ] + E [ B ]) E [ C ] + E [ C ]= E [ AB ] + [( R ∗ ik ) + ( R ∗ kj ) + 2]( d −
1) + d − . Since µ, v ∈ U λ it holds that R ∗ ik ≤ R ik + R i , R ∗ kj ≤ R kj + R j .Now we consider several different cases. First, for i = j = k we have R ik = R jk =0 and R i = R j . Hence E i [ k ( X − v i )( X − µ i ) T k op ] ≤ C max( d , R i ) . Next, if k is distinct from both i and j , then R i ≤ R ik and R j ≤ R kj . Hence, E k [ k ( X − v i )( X − µ j ) ⊤ k op ] ≤ C max( d , R ik R kj ) . Finally, we consider the case where j = i but k is not distinct from both i and j ,without loss of generality k = i . Then R ik = 0 , R kj = R ij . By definition, R j ≤ R ij and R i ≤ R ij . Thus, E i [ k ( X − v i )( X − µ j ) ⊤ k op ] ≤ C max( d , R ij ) . We are now ready to prove the lemma. For clarity we present in two separate partsthe proof of Eq. (26) and of Eq. (25).
Proof of Eq. (26) in Lemma 4.3.
The first step is to separate the expectation over theGMM to its K components. By the triangle inequality, V ij ( µ, v ) ≤ X k π k V kij ( µ, v ) (55) egol et al./Improved Convergence Guarantees for EM with V kij as defined in (50). By Lemma A.3, for each k , V kij ( µ, v ) ≤ max( V kij ( µ, v ) , E k [ w i ( X, µ ) w j ( X, µ )]) (56)with V kij as defined in Eq. (52). We now separately analyze each of the two terms onthe right hand size of (56). We start with the second term. When k = i , by Proposition4.1 E i [ w i ( X, µ ) w j ( X, µ )] ≤ E i [ w j ( X, µ )] ≤ (1 + θ ) e − c ( λ ) R ij . By symmetry, the same bound holds also for k = j .Next we bound V kij and we shall later see that it is the largest of the two quantitiesin (56). Note that by the Cauchy-Schwarz inequality V kij ≤ r E k h ( w i ( X, µ ) w j ( X, µ )) is E k (cid:20)(cid:13)(cid:13)(cid:13) ( X d − v i )( X d − µ j ) ⊤ (cid:13)(cid:13)(cid:13) op (cid:21) . By Lemma A.4 there exists a universal constant C such that Eq (54) holds with dimen-sion d . Thus, for the first term in Eq. (55), with k = i , V iij ( µ, v ) ≤ q E i [( w i ( X, µ ) w j ( X, µ )) ] √ C max( d , R ij ) ≤ q E i [ w j ( X, µ )] √ C max( d , R ij ) and by Proposition 4.1 V iij ( µ, v ) ≤ p C (1 + θ ) e − c ( λ )2 R ij max (cid:0) d , R ij (cid:1) . (57)Similarly, for k = j , V jij ( µ, v ) ≤ p C (1 + θ ) e − c ( λ )2 R ij max (cid:0) d , R ij (cid:1) . (58)Hence, in these two cases indeed V ijk is the dominant term.Finally we consider the case k = i, j . Again by Proposition 4.1 E k [ w i ( X, µ ) w j ( X, µ )] ≤ (1 + θ ) e − c ( λ ) max( R ik ,R jk ) . As for the first term, V kij ≤ r E k h ( w i ( X, µ ) w j ( X, µ )) i √ C max ( d , R ik R jk ) ≤ p C (1 + θ ) e − c ( λ )2 max( R ik ,R jk ) max( d , max( R ik , R jk ) ) . (59)Inserting (57), (58) and (59) into (55) and summing over the components gives V i,j ( µ, v ) ≤√ C √ θ h ( π i + π j ) max( d , R ij ) e − c ( λ )2 R ij + X k = i,j π k max( d , max( R ik , R jk ) ) e − c ( λ )2 max( R ik ,R jk ) i . egol et al./Improved Convergence Guarantees for EM Since the function x e − tx is monotonic decreasing for x > p /t , and R min > p /c ( λ ) we may replace R ij by max( R i , R j ) in the first term. Similarly, we mayreplace R ik by R i and R jk by R j in the second sum above. This yields Eq. (26). Proof of Eq. (25) in Lemma 4.3.
The first step is to separate the expectation over theGMM to its K components. By the triangle inequality, V ii ( µ, v ) ≤ K X k =1 π k V kii ( µ, v ) (60)with V kii as defined in (51). We now bound each V kii separately. First, by Lemma A.3, V kii ≤ max( V kii , E k [(1 − w i ( X, µ )) w i ( X, µ )]) (61)with V kii as defined in (53). We now analyze each component separately. We first bound V kii and we shall later see that it is the largest of the two quantities in (61). By theCauchy-Schwarz inequality, V kii ( µ, v ) ≤ E k h ( w i ( X, µ )(1 − w i ( X, µ ))) i · E k (cid:20)(cid:13)(cid:13)(cid:13) ( X d − v i )( X d − µ i ) ⊤ (cid:13)(cid:13)(cid:13) op (cid:21) . By Lemma A.4, there exists a universal constant C such that Eq. (54) holds with di-mension d . Thus for k = i , V iii ( µ, v ) ≤ r E i h ( w i ( X, µ )(1 − w i ( X, µ ))) i √ C max (cid:0) d , R i (cid:1) ≤ p E i [1 − w i ( X, µ )] √ C max (cid:0) d , R i (cid:1) and by Corollary 4.1.1, V iii ( µ, v ) ≤ p C ( K − θ ) e − c ( λ )2 R i max (cid:0) d , R i (cid:1) . (62)Similarly, we upper bound the second quantity on the right hand side of (61) as follows, E i [ w i ( X, µ )(1 − w i ( X, µ ))] ≤ ( K − θ ) e − c ( λ ) R ij . Thus V kij ( µ, v ) is the dominant term in the maximum in (61).Now, for k = i , V kij ( µ, v ) ≤ r E k h ( w i ( X, µ )(1 − w i ( X, µ ))) i √ C max (cid:0) d , R ik (cid:1) ≤ √ θe − c ( λ ) R ik √ C max (cid:0) d , R ik (cid:1) The function x e − tx is monotonic decreasing for x > √ t − . Since R i ≥ p c ( λ ) − ,so does R ik , and we may replace it in the equation above by R i . Namely, V kij ( µ, v ) ≤ √ θ √ C max( d , R i ) e − c ( λ ) R i / (63)Inserting (62) and (63) into (60), and summing over all components yields Eq. (25). egol et al./Improved Convergence Guarantees for EM A.4. Completing the Proof of Theorem 3.1
The following lemma completes the proof of the Theorem.
Lemma A.5.
Let X ∼ GMM ( µ ∗ , π ) with R min satisfying (11) . Let µ + be the popula-tion EM update (7) . Then for every i ∈ [ K ] it holds that k µ + i − µ ∗ i k ≤ λR i .Proof. Our starting point is Eq. (29), k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] k ≤ K X j =1 sup µ ∈U λ V i,j ( µ, µ ∗ ) k µ j − µ ∗ j k . We insert the bounds (25) and (26) on V i,i and V i,j , respectively, to the above.Since µ ∈ U λ we may replace all k µ k − µ ∗ k k in the expressions above by λR k . Since x e − tx is monotonic decreasing for all x > p / t , we may replace all R i , R j aboveby R min . Defining U = K − √ C (1+ θ )3 π min , this gives k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ ∗ i )] k ≤ λR min · π min U max (cid:0) d , R (cid:1) e − c ( λ ) R . By Eqs. (27) and (20), it follows that k µ + i − µ ∗ i k ≤ λR min · e − c ( λ )2 R U max( d , R ) The separation condition (11) suffices to ensure that k µ + i − µ ∗ i k ≤ λR min . Appendix B: PROOFS FOR SECTION 5
B.1. Preliminaries
We recall basic definitions and results on sub-Gaussian random variables. See e.g. [18].
Definition B.1.
1. A random variable X is called sub-Gaussian if there exists t > such that E h e X t i ≤ . Its norm is defined as k X k ψ = inf t> E h e X t i ≤ .2. A random vector X ∈ R d is called sub-Gaussian if sup v ∈ S d − k X ⊤ v k ψ < ∞ .Its sub-Gaussian norm is defined as sup v ∈ S d − k X ⊤ v k ψ . Lemma B.1.
Let X ∈ R d be a sub-Gaussian random vector with sub-Gaussian normat most R . Let X , . . . , X n be n i.i.d. copies of X . Define S n = n P nℓ =1 X ℓ . Then,there exists a universal constant c such that for any t > , Pr ( k S n − E [ X ] k > t ) ≤ e − cnt R + d log 3 .Proof. Let N be a -net of S d − and fix v ∈ N . By definition X ⊤ v is sub-Gaussianwith k X ⊤ v k ψ ≤ R . Write X v,n = n P nℓ =1 ( X ℓ − E [ X ]) ⊤ v . Then by Hoeffding’sinequality there exists a universal constant c such that, Pr ( | X v,n | > t ) ≤ e − cnt k X ⊤ v k ψ ≤ e − cnt R . egol et al./Improved Convergence Guarantees for EM Next, we note that for any x ∈ R d it holds that k x k ≤ v ∈ N v ⊤ x . As is wellknown, the size of an ε -net is bounded by | N | ≤ e d log 3 [18, Corollary 4.2.13]. Thelemma therefore follows from a union bound.The following lemma is key for proving uniform convergence. A version of thislemma appears in [23]. Lemma B.2.
Fix < δ < . Let B , . . . , B K ⊂ R d be Euclidean balls of radii r , . . . , r K ≥ . Define B = ⊗ Kk =1 B k ⊂ R Kd and r = max k ∈ [ K ] r k . Let X be arandom vector in R d and W R d × B → R k where k ≤ d . Assume the following hold:1. There exists a constant L ≥ such that for any µ ∈ B , ε > , and µ ε ∈ B whichsatisfies max i ∈ [ K ] k µ i − µ εi k ≤ ε , then E X (cid:2) sup µ ∈B k W ( X, µ ) − W ( X, µ ε ) k (cid:3) ≤ Lε .2. There exists a constant R such that for any µ ∈ B , k W ( X, µ ) k ψ ≤ R .Let X , . . . , X n be i.i.d. random vectors with the same distribution as X . Then thereexists a universal condstant ˜ c such that with probability at least − δ , sup µ ∈B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ) − E X [ W ( X, µ )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ R s ˜ c Kd log (cid:0) nLrδ (cid:1) n . (64) Proof.
For any ε > , let N i be an ε -net of B i and define N ε = ⊗ N i . Then, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ) − E [ W ( X, µ )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k E [ W ( X, µ )] − E [ W ( X, µ ε )] k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 ( W ( X ℓ , µ ) − W ( X ℓ , µ ε )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ε ) − E [ W ( X, µ ε )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Therefore for any t > , Pr sup µ ∈B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ) − E [ W ( X, µ )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ! ≤ Pr ( A ) + Pr ( B ) + Pr ( C ) where the three events A, B, C are given by A = ( sup µ ∈B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X ℓ =1 n W ( X ℓ , µ ) − n X ℓ =1 n W ( X ℓ , µ ε ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ) B = (cid:26) sup µ ∈B k E [ W ( X, µ )] − E [ W ( X, µ ε )] k > t (cid:27) C = ( sup µ ε ∈ N ε (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ε ) − E [ W ( X, µ ε )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ) . egol et al./Improved Convergence Guarantees for EM We first bound
Pr( A ) . By Markov’s inequality and the first condition of the lemma, Pr ( A ) ≤ t E " sup µ ∈B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X ℓ =1 n W ( X ℓ , µ ) − n X ℓ =1 n W ( X ℓ , µ ε ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ t E (cid:20) sup µ ∈B k W ( X, µ ) − W ( X, µ ε ) k (cid:21) ≤ εLt . Thus, for εLt ≤ δ we have that Pr( A ) < δ . Note also that for t satisfying εLt ≤ δ , sup µ ∈B k E [ W ( X, µ )] − E [ W ( X, µ ε )] k ≤ E (cid:20) sup µ ∈B k W ( X, µ ) − W ( X, µ ε ) k (cid:21) ≤ t and hence Pr( B ) = 0 .Finally, we bound the probability of C . Here we use the second condition of thelemma, that k W ( X, µ ) k ψ ≤ R . It follows from Lemma B.1, that for any fixed µ ε , Pr (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ε ) − E [ W ( X, µ ε )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ! ≤ e k log 3 − cnt R ≤ e d log 3 − cnt R where c is a universal constant. Since all the balls B , . . . , B K ⊂ R d are of radius atmost r , it holds that | N ε | ≤ e log( rε ) Kd . Hence, taking a union bound, Pr sup µ ε ∈ N ε (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X ℓ =1 W ( X ℓ , µ ε ) − E [ W ( X, µ ε )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t ! ≤ e Kd log rε + d log 3 − cnt R . The requirement that the right hand side of the above is smaller than δ implies t ≥ R s Kd log rε + d log 3 + log δ cn . Setting ε = δ Ln , the condition Pr( A ) ≤ δ implies t > n , which holds if t satisfies theinequality above. Hence, for t > R q ˜ c Kd log ( nLrδ ) n with a suitable universal constant ˜ c it holds that Pr( A ) + Pr( B ) + Pr( C ) ≤ δ . B.2. Proof of Lemma 5.1
Proof.
The lemma will follow from Lemma B.2 by setting X ∼ N ( µ ∗ i , I d ) , B = U λ and W = w i ( X, µ ) . To this end we show that the conditions of Lemma B.2 hold: (i)There exists L > such that for any ε > , E i (cid:2) sup µ | w i ( X, µ ) − w i ( X, µ ε ) | (cid:3) ≤ Lε for all µ ε ∈ U λ with max i ∈ [ K ] k µ i − µ εi k ≤ ε . (ii) The sub-Gaussian norm of w i ( X, µ ) for X ∼ N ( µ ∗ i , I d ) is bounded by a constant. The latter is clear as w i is bounded. Forthe former we use the mean value theorem. There exists a point ˜ µ such that | w i ( X, µ ) − w i ( X, µ ε ) | = |∇ w i ( X, ˜ µ ) ⊤ ( µ − µ ε ) | egol et al./Improved Convergence Guarantees for EM Using the expressions (21) and (22) for the gradient of w i with respect to µ , | w i ( X, µ ) − w i ( X, µ ε ) | ≤ sup ˜ µ ∈U λ k w i ( X, ˜ µ )(1 − w i ( X, ˜ µ ))( X − ˜ µ i ) k ε + X j = i sup ˜ µ ∈U λ k w i ( X, ˜ µ ) w j ( X, ˜ µ )( X − ˜ µ j ) k ε Since ≤ w i ( X, µ ) ≤ we get E i (cid:20) sup µ ∈U λ | w i ( X, µ ) − w i ( X, µ ε ) | (cid:21) ≤ ε K X j =1 E i (cid:20) sup µ ∈U λ k X − µ j k (cid:21) . Since X ∼ N ( µ ∗ i , I d ) , we may write X = η + µ ∗ i where X ∼ N (0 , I d ) . Therefore, E i (cid:20) sup µ ∈U λ k X − µ j k (cid:21) ≤ E k η k + sup µ ∈U λ k µ ∗ i − µ j k ≤ √ d + 2 R max . It follows that E i (cid:20) sup µ ∈U λ | w i ( X, µ ) − w i ( X, µ ε ) | (cid:21) ≤ K ( √ d + 2 R max ) ε. The lemma follows by plugging L = K ( √ d + 2 R max ) and r = R max into (64). B.3. Proof of Lemma 5.2
Proof.
We denote the set of all samples X ℓ generated from the i -th component by I i , n i = | I i | and ˜ π i = n i /n . Since w i ( X, µ ) ≥ for any X, µ we can lower bound thesum in the event D i by considering only the terms w i ( X ℓ , µ ) with X ℓ ∈ I i , n n X ℓ =1 w i ( X ℓ , µ ) ≥ n X X ℓ ∈ I i w i ( X ℓ , µ ) = ˜ π i n i X X ℓ ∈ I i w i ( X ℓ , µ ) . With a suitably large constant C , the sample size requirement (34) implies that n > log Kδ π min . By the multiplicative form of the Chernoff bound for Bernoulli randomvariables, see e.g. [18, Exercise 2.3.5], we have for any δ ∈ (0 , that | ˜ π i − π i | ≤ π i .Therefore, ˜ π i ≥ π i and thus n n X ℓ =1 w i ( X ℓ , µ ) ≥ π i
10 1 n i X X ℓ ∈ I i w i ( X ℓ , µ ) . Now, defining d i = sup µ ∈U λ (cid:16) n i P X ℓ ∈ I i w i ( X ℓ , µ ) − E i [ w i ( X, µ )] (cid:17) we have inf µ ∈U λ n i X X ℓ ∈ I i w i ( X ℓ , µ ) ≥ inf µ ∈U λ E i [ w i ( X, µ )] − d i . egol et al./Improved Convergence Guarantees for EM Note that by Lemma 5.1, with probability at least − δ K , Eq. (33) holds. Since n i ≥ nπ i , we may replace n i in Eq. (33) by nπ i , and increase the relevant constants to ˜ c = ˜ c and ˜ C = C . It thus follows that inf µ ∈U λ n n X ℓ =1 w i ( X, µ ) ≥ π i inf µ ∈U λ E i [ w i ( X, µ )] − s ˜ c Kd log( ˜ C nδ ) nπ i . The condition on the sample size (34) implies that r ˜ c Kd log( ˜ C nδ ) nπ i ≤ and therefore, inf µ ∈U λ n n X ℓ =1 w i ( X, µ ) ≥ π i (cid:18) inf µ ∈U λ E i [ w i ( X, µ )] − (cid:19) . By Corollary 4.1.1, inf µ ∈U λ n n X ℓ =1 w i ( X, µ ) ≥ π i (cid:18) − ( K − θ ) e − c ( λ ) R (cid:19) . Under the separation requirement (19), it holds that ( K − θ ) e − c ( λ ) R ≤ .Hence, the event D i (35) occurs with probability at least − δ K . B.4. Proof of Lemma 5.3
Proof.
Write X = η + Z where η ∼ N (0 , I d ) and Z ∈ { µ ∗ , . . . , µ ∗ K } has a distribu-tion Pr (cid:0) Z = µ ∗ j (cid:1) = π j . First we prove (37). By the triangle inequality k w i ( X, µ ) ( X − µ ∗ i ) k ψ ≤ k w i ( X, µ ) η k ψ + k w i ( X, µ ) ( Z − µ ∗ i ) k ψ . Since w i ≤ and η ∼ N (0 , I d ) , it follows that k w i ( X, µ ) η k ψ ≤ k η k ψ [23, LemmaB.1 part 5]. Using the explicit formula for the moment generating function of a chi-squared distribution with degree of freedom, E [exp(1 /t ( η ⊤ s ) )] = (1 − /t ) − / .It follows that k η k ψ ≤ and hence k w i ( X, µ ) η k ψ ≤ . Next, we analyze the sub-Gaussian norm of the second term. We show that k w i ( X, µ )( Z − µ ∗ i ) k ψ ≤ λ − λ . Tothis end we show that for t = 8 λ − λ and any s ∈ S d − , K X j =1 π j E j (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) (cid:0) µ ∗ j − µ ∗ i (cid:1) ⊤ s (cid:17) (cid:19)(cid:21) ≤ . (65)First, for j = i , Z − µ ∗ i = 0 and thus the expectation is . Now, consider any j = i . Itholds that ( µ ∗ j − µ ∗ i ) ⊤ s ≤ R ij . Therefore, E j (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) (cid:0) µ ∗ j − µ ∗ i (cid:1) ⊤ s (cid:17) (cid:19)(cid:21) ≤ E j (cid:20) exp (cid:18) t ( R ij w i ( X, µ )) (cid:19)(cid:21) . egol et al./Improved Convergence Guarantees for EM By Equations (45) and (46), w i (cid:0) η + µ ∗ j , µ (cid:1) ≤ πjπi e Aν + B = ˜ w i ( A, B, ν ) where ν ∼ N (0 , , A = k µ i − µ j k and B = k µ ∗ j − µ i k − k µ j − µ ∗ j k . This allowsbounding the expectation over the d -dimensional random vector η by an expectationover a univariate random variable ν . E j (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) (cid:0) µ ∗ j − µ ∗ i (cid:1) ⊤ s (cid:17) (cid:19)(cid:21) ≤ E ν (cid:20) exp (cid:18) t R ij ˜ w i ( A, B, ν ) (cid:19)(cid:21) = E. Next, we split the expectation over ν to two cases as follows, E = E (cid:20) exp (cid:18) t R ij ˜ w i ( A, B, ν ) (cid:19) (cid:12)(cid:12)(cid:12) ν < − B A (cid:21) Pr (cid:18) ν < − B A (cid:19) + E (cid:20) exp (cid:18) t R ij ˜ w i ( A, B, ν ) (cid:19) (cid:12)(cid:12)(cid:12) ν > − B A (cid:21) Pr (cid:18) ν > − B A (cid:19) = E + E . (66)We now show that E ≤ and E ≤ , from which it follows that E ≤ .First, consider the term E in (66). Note that Pr (cid:0) ν < − B A (cid:1) ≤ e − B A . By LemmaA.2, A ≤ (1 + 2 λ ) R ij and B ≥ (1 − λ ) R ij . It follows that Pr (cid:0) ν < − B A (cid:1) ≤ e − ( − λ λ ) R ij . Since ˜ w i ( A, B, ν ) ≤ , we thus obtain by inserting t = 8 λ − λ , E ≤ e R ijt e − ( − λ λ ) R ij = e − ( − λ λ ) R ij . Therefore, for R min satisfying (36), E ≤ . Second, consider the term E in (66). Since ν > − B A , then Aν + B > B . Thus, / ˜ w i ( A, B, ν ) = (1 + π j π i e Aν + B ) > π j π i e B . That is, ˜ w i ( A, B, ν ) ≤ π i π j e − B . ByLemma A.2, B ≥ (1 − λ )2 R ij . Hence, ˜ w i ( A, B, ν ) ≤ π i π j e − (1 − λ )2 R ij . Since Pr (cid:0) ν > − B A (cid:1) ≤ , it follows by plugging in t = 8 λ − λ , E ≤ exp t R ij π i π j e − (1 − λ )2 R ij ! = exp λ ) (1 − λ ) R ij π i π j e − (1 − λ )2 R ij ! . The condition that the right hand side of the above is smaller than can be written as we − w ≤ a , where w = − λ R ij and a = π j π i
32 log
32 (1+2 λ )1 − λ . Since for w ≥ a ,it holds that we − w < a , we get for our case that for R min satisfying (36), E ≤ .Since E ≤ , Eq (65) holds. Therefore k w i ( X, µ )( Z − µ ∗ i ) k ψ ≤ λ − λ . Since k w i ( X, µ ) η k ψ ≤ , we get Eq. (37).The proof of Eq. (38) is similar. We analyze the sub-Gaussian norm of k w i ( X, µ ) ( Z − µ i ) k ψ . Similarly to Eq. (65), we decompose the expectation to components. First con-sider the i ’th component. Since k µ ∗ i − µ i k ≤ λR i , we have for all s ∈ S d − that ( µ ∗ i − µ i ) ⊤ s ≤ λR i . Thus, E i (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) ( µ ∗ i − µ i ) ⊤ s (cid:17) (cid:19)(cid:21) ≤ E i (cid:20) exp (cid:18) t ( w i ( X, µ ) λR i ) (cid:19)(cid:21) . egol et al./Improved Convergence Guarantees for EM Hence for t ≥ λR i √ log 2 , the last expression is smaller than . Next, for any component j with j = i and any s ∈ S d − , we have ( µ ∗ j − µ i ) ⊤ s ≤ R ij + k µ i − µ ∗ i k ≤ R ij .Hence, E j (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) (cid:0) µ ∗ j − µ i (cid:1) ⊤ s (cid:17) (cid:19)(cid:21) ≤ E j " exp t (cid:18) w i ( X, µ ) 32 R ij (cid:19) ! . Since for t ≥ λ − λ , E η ∼N (0 ,I d ) h exp (cid:16) t (cid:0) w i (cid:0) η + µ ∗ j , µ (cid:1) R ij (cid:1) (cid:17)i ≤ , it followsthat for t ≥ λ − λ , the above is smaller than . Thus, for any s ∈ S d − , t ≥ max (cid:16) λ − λ , λR i √ log 2 (cid:17) = t , E X (cid:20) exp (cid:18) t (cid:16) w i ( X, µ ) ( X − µ i ) ⊤ s (cid:17) (cid:19)(cid:21) ≤ . Thus, k w i ( X, µ )( Z − µ i ) k ψ ≤ t . Since k w i ( X, µ ) η k ψ ≤ , Eq. (38) follows. B.5. Proof of Lemma 5.4
We first present the following auxiliary lemma.
Lemma B.3.
Let X ∼ GMM ( µ ∗ , π ) with R min satisfying (36) . Fix λ ∈ (0 , ) . Foreach µ ∈ U λ let µ ε ∈ U λ be such that max i ∈ [ K ] k µ i − µ εi k < ε . Then, for v ∈ { µ, µ ∗ } , E X (cid:20) sup µ ∈U λ k ( w i ( X, µ ) − w i ( X, µ ε ))( X − v i ) k (cid:21) ≤ K ( √ d + 2 R max ) ε. (67) Proof.
By the mean value theorem and the expression for ∇ µ w i ( X, µ ) , (21) and (22), E X (cid:20) sup µ ∈U λ k ( w i ( X, µ ) − w i ( X, µ ε ))( X − v i ) k (cid:21) ≤ K X j =1 sup µ ∈U λ V ij ( µ, v ) ε. Since ≤ w i ( X, µ ) ≤ , we get, E X (cid:20) sup µ ∈U λ k ( w i ( X, µ ) − w i ( X, µ ε ))( X − v i ) k (cid:21) ≤ K X j =1 E X (cid:20) sup µ ∈U λ k X − µ j kk X − v i k (cid:21) ε. Now, E X (cid:20) sup µ ∈U λ k X − µ j k (cid:21) = K X k =1 π k E k (cid:20) sup µ ∈U λ k X − µ j k (cid:21) ≤ √ d + 2 R max . Similarly, E X (cid:2) sup µ ∈U λ k X − v i k (cid:3) ≤ √ d + 2 R max . Eq. (67) now follows. egol et al./Improved Convergence Guarantees for EM Proof of Lemma 5.4.
The lemma will follow from Lemma B.2 with X ∼ GMM ( µ ∗ , π ) , B = U λ , W = w i ( X, µ )( X − µ ∗ i ) and probability δ = δ K . To this end we needto show that the two conditions of Lemma B.2 hold: (i) For any ε > , Eq. (67)holds for all µ ε ∈ U λ with max i ∈ [ K ] k µ i − µ εi k ≤ ε . (ii) The sub-Gaussian normof w i ( X, µ )( X − µ ∗ i ) for X ∼ GMM ( µ ∗ , π ) is bounded by − λ . The former followsfrom Lemma B.3. The latter follows from Lemma 5.3 for R min satisfying (36). Appendix C: PROOFS FOR THE GRADIENT EM ALGORITHM
Proof of Theorem 3.2.
Consider the error of the estimate for the i -th center after a sin-gle gradient EM update (8). By the triangle inequality, k µ + i − µ ∗ i k ≤ k µ i − µ ∗ i + s E X [ w i ( X, µ ∗ )( X − µ i )] k + s k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ i )] k . We now separately upper bound each of the two terms above. For the first term, recallthat E X [ w i ( X, µ ∗ )] = π i and by Lemma 4.4, E X [ w i ( X, µ ∗ )( X − µ i )] = π i ( µ ∗ i − µ i ) .Hence, for any step size s < /π i , k µ i − µ ∗ i + s E X [ w i ( X, µ ∗ )( X − µ i )] k ≤ (1 − sπ i ) k µ i − µ ∗ i k . Next, to bound the second term we use the expressions in Eqs. (21) and (22), k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ i )] k ≤ K X j =1 sup µ ∈U λ V i,j ( µ, µ ) k µ j − µ ∗ j k (68)with V i,j and V i,i as defined in (23) and (24), respectively. The proof proceeds similarlyto that of the original EM algorithm. First, using the bounds (25) and (26) in Lemma4.3, for R min satisfying the separation condition (11), it holds that k E X [( w i ( X, µ ) − w i ( X, µ ∗ )) ( X − µ i )] k ≤ π min . Therefore, k µ + i − µ ∗ i k ≤ (1 − sπ i ) E ( µ ) . We finish by showing that µ + ∈ U λ . Replacing k µ j − µ ∗ j k by λR j in (68), and replacing R k by R min in the bounds in (25) and (26) we get k E X [( w i ( X, µ ) − w i ( X, µ ∗ ))( X − µ i )] k ≤ λR min · π min U max (cid:0) d , R (cid:1) e − c ( λ ) R . with U = K − √ C (1+ θ )3 π min . Under the separation condition (11), the right hand sideof the above is upper bounded by π min . Therefore, k µ + i − µ ∗ i k ≤ (1 − sπ i + 38 sπ i ) λR i < λR i . egol et al./Improved Convergence Guarantees for EM The next lemma presents a concentration result for the sample EM update.
Lemma C.1.
Fix δ ∈ (0 , , λ ∈ (0 , ) . Let X , . . . , X n ∼ GMM ( µ ∗ , π ) with R min satisfying (36) . For i ∈ [ K ] define S gi = n P nℓ =1 w i ( X ℓ , µ )( X ℓ − µ i ) and the event N gi = sup µ ∈U λ k S i − E [ w i ( X, µ )( X − µ i )] k ≤ C max (cid:18) − λ , λR i (cid:19) s Kd log ˜ Cnδ n (69) where C is a suitable universal constant and ˜ C = 18 K R max ( √ d + 2 R max ) . Then N i occurs with probability at least − δK .Proof. The lemma will follow from Lemma B.2 by setting X ∼ GMM ( µ ∗ , π ) , B = U λ and W = w i ( X, µ )( x − µ i ) . To this end we need to show that the two conditions ofLemma B.2 hold: (i) For any ε > , Eq (67) holds for all µ ε with max i ∈ [ K ] k µ i − µ εi k ≤ ε . (ii) The sub-Gaussian norm of w i ( X, µ )( X − µ i ) for X ∼ GMM ( µ ∗ , π ) is boundedby C max( − λ , λR i ) . The former follows from Lemma B.3. The latter follows fromLemma 5.3 for R min satisfying (36).With the pieces in place we now prove Theorem 3.4. Proof of Theorem 3.4.
Consider the error of the i -th cluster of the sample gradient EMupdate (10), k µ ∗ i − µ + i k ≤ k µ i − µ ∗ i − s E [ w i ( X, µ )( X − µ i )] k + s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E [ w i ( X, µ )( X − µ i )] − n n X ℓ =1 w i ( X ℓ , µ )( X ℓ − µ i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Theorem 3.2 implies that for R min satisfying (11), it holds that k µ i − µ ∗ i − s E [ w i ( X, µ )( X − µ i )] k ≤ γ min( E ( µ ) , λR i ) with γ = 1 − sπ min . We therefore bound the second term above. Since the requirementon R min (11) is more restrictive than the requirement (36) we may invoke Lemma C.1and obtain that with probability at least − δK , the event N gi (69) occurs. Thus, k µ ∗ i − µ + i k ≤ γ min( E ( µ ) , λR i ) + sC max (cid:18) − λ , λR i (cid:19) vuut Kd log (cid:16) ˜ Cnδ (cid:17) n . (70)The sample size condition (14), implies that Cs max (cid:16) − λ , λR i (cid:17) r Kd log (cid:16) ˜ Cnδ (cid:17) n ≤ sπ min λR i ≤ λ (1 − γ ) R i . Taking a union bound over the K components, µ + ∈ U λ with probability at least − δ . We can therefore iteratively apply (70) and obtain k µ ti − µ ∗ i k ≤ γ t E ( µ ) + sC − γ max (cid:18) − λ , λR i (cid:19) vuut Kd log (cid:16) ˜ Cnδ (cid:17) n .
Since s − γ = π i , we get Eq. (15). egol et al./Improved Convergence Guarantees for EM References [1] A
CHLIOPTAS , D. and M C S HERRY , F. (2005). On spectral learning of mixturesof distributions. In
International Conference on Computational Learning Theory
RORA , S., K
ANNAN , R. et al. (2005). Learning mixtures of separated nonspher-ical Gaussians.
The Annals of Applied Probability ALAKRISHNAN , S., W
AINWRIGHT , M. J., Y U , B. et al. (2017). Statistical guar-antees for the EM algorithm: From population to sample-based analysis. The An-nals of Statistics ASGUPTA , S. (1999). Learning mixtures of Gaussians. In
ASGUPTA , S. and S
CHULMAN , L. (2007). A Probabilistic Analysis of EM forMixtures of Separated, Spherical Gaussians.
Journal of Machine Learning Re-search ASKALAKIS , C., T
ZAMOS , C. and Z
AMPETAKIS , M. (2017). Ten steps of EMsuffice for mixtures of two Gaussians. In
Conference on Learning Theory
EMPSTER , A. P., L
AIRD , N. M. and R
UBIN , D. B. (1977). Maximum likeli-hood from incomplete data via the EM algorithm.
Journal of the Royal StatisticalSociety: Series B (Methodological) ARDT , M. and P
RICE , E. (2015). Tight bounds for learning a mixture of twogaussians. In
Proceedings of the forty-seventh annual ACM symposium on Theoryof computing SU , D. and K AKADE , S. M. (2013). Learning mixtures of spherical gaussians:moment methods and spectral decompositions. In
Proceedings of the 4th confer-ence on Innovations in Theoretical Computer Science IN , C., Z HANG , Y., B
ALAKRISHNAN , S., W
AINWRIGHT , M. J. and J OR - DAN , M. I. (2016). Local maxima in the likelihood of gaussian mixture models:Structural results and algorithmic consequences. In
Advances in neural informa-tion processing systems
ALAI , A. T., M
OITRA , A. and V
ALIANT , G. (2010). Efficiently learning mix-tures of two Gaussians. In
Proceedings of the forty-second ACM symposium onTheory of computing
ANNAN , R., S
ALMASIAN , H. and V
EMPALA , S. (2008). The spectral methodfor general mixture models.
SIAM Journal on Computing WON , J. and C
ARAMANIS , C. (2020). The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians. In
Proceedings ofThirty Third Conference on Learning Theory (J. A
BERNETHY and S. A
GARWAL ,eds.).
Proceedings of Machine Learning Research
OITRA , A. and V
ALIANT , G. (2010). Settling the polynomial learnability ofmixtures of gaussians. In
EARSON , K. (1894). Contributions to the mathematical theory of evolution.
Philosophical Transactions of the Royal Society of London. A egol et al./Improved Convergence Guarantees for EM [16] R EGEV , O. and V
IJAYARAGHAVAN , A. (2017). On learning mixtures of well-separated gaussians. In
EMPALA , S. and W
ANG , G. (2004). A spectral algorithm for learning mixturemodels.
Journal of Computer and System Sciences ERSHYNIN , R. (2018).
High-dimensional probability: An introduction with ap-plications in data science . Cambridge university press.[19] W U , C. J. (1983). On the convergence properties of the EM algorithm. The An-nals of statistics U , J., H SU , D. J. and M ALEKI , A. (2016). Global analysis of expectation max-imization for mixtures of two gaussians. In
Advances in Neural Information Pro-cessing Systems U , L. and J ORDAN , M. I. (1996). On convergence properties of the EM algo-rithm for Gaussian mixtures.
Neural computation AN , B., Y IN , M. and S ARKAR , P. (2017). Convergence analysis of gradientEM for multi-component gaussian mixture. arXiv preprint arXiv:1705.08530 .[23] Z
HAO , R., L I , Y., S UN , Y. et al. (2020). Statistical convergence of the EM algo-rithm on Gaussian mixture models. Electronic Journal of Statistics14