On the Minimax Optimality of the EM Algorithm for Learning Two-Component Mixed Linear Regression
OOn the Minimax Optimality of the EM Algorithm forLearning Two-Component Mixed Linear Regression
Jeong Yeol Kwon (cid:5)
Nhat Ho † Constantine Caramanis (cid:5)
Department of Electrical and Computer Engineering (cid:5)
University of Texas, AustinDepartment of Electrical Engineering and Computer Sciences † University of California, BerkeleyJune 5, 2020
Abstract
We study the convergence rates of the EM algorithm for learning two-componentmixed linear regression under all regimes of signal-to-noise ratio (SNR). We resolve along-standing question that many recent results have attempted to tackle: we completelycharacterize the convergence behavior of EM, and show that the EM algorithm achievesminimax optimal sample complexity under all SNR regimes. In particular, when the SNRis sufficiently large, the EM updates converge to the true parameter θ ∗ at the standardparametric convergence rate O (( d/n ) / ) after O (log( n/d )) iterations. In the regime wherethe SNR is above O (( d/n ) / ) and below some constant, the EM iterates converge to a O (SNR − ( d/n ) / ) neighborhood of the true parameter, when the number of iterationsis of the order O (SNR − log( n/d )). In the low SNR regime where the SNR is below O (( d/n ) / ), we show that EM converges to a O (( d/n ) / ) neighborhood of the trueparameters, after O (( n/d ) / ) iterations. Notably, these results are achieved under mildconditions of either random initialization or an efficiently computable local initialization.By providing tight convergence guarantees of the EM algorithm in middle-to-low SNRregimes, we fill the remaining gap in the literature, and significantly, reveal that in lowSNR, EM changes rate, matching the n − / rate of the MLE, a behavior that previouswork had been unable to show. The expectation-maximization (EM) algorithm is a general-purpose heuristic to compute amaximum-likelihood estimator (MLE) for problems with missing information [9, 32, 27]. Ingeneral, computing the MLE is intractable due to the non-concave nature of log-likelihoodfunctions in the presence of missing data. The EM algorithm iteratively computes a tighterlower bound on log-likelihood functions, with each iteration no more complex than solvinga maximum-likelihood (ML) problem without missing data. Due to its simplicity and broadsuccess in practice, EM is one of the most popular methods-of-choice in a variety of applica-tions [18, 26, 24, 4].Recent years have witnessed remarkable progress in establishing theory describing the non-asymptotic convergence of EM to the true parameters on canonical examples such as a mixtureof Gaussian distributions and mixed linear regression (see Prior Art below). In such models,a key factor in the analysis is the separation between components, or the “signal strength”.Most prior work has studied strongly separated instances (high SNR) and established linearconvergence of the EM algorithm with the standard parametric statistical rate n − / . In1 a r X i v : . [ s t a t . M L ] J un ontrast, the understanding of the EM algorithm in the weakly separated settings (low SNR),especially mixed linear regression, remains incomplete. Our contributions:
In this paper, we aim to fill the remaining gap in the literature with theminimax optimal sample complexity of the EM algorithm for learning two-component mixedlinear regression in the weakly separated regime. In so doing, we provide a complete picture ofthe EM algorithm under all signal-to-noise ratio (SNR) regimes for symmetric two-componentmixed linear regression, namely, N ( − X (cid:62) θ ∗ , ( σ ∗ ) ) + N ( X (cid:62) θ ∗ , ( σ ∗ ) ) where σ ∗ = 1 is givenand X follows the standard multivariate normal distribution in d dimensions. We define SNRas η := (cid:107) θ ∗ (cid:107) since σ ∗ = 1. Notably, our results are obtained under mild conditions of eitherrandom initialization or an efficiently computable local initialization. While simplified, themodel is complex enough to capture the most interesting behaviors of the EM algorithm forlearning a mixed linear regression with two components, and reveals statistical behaviors in thelow-to-middle SNR regimes that previous analysis had missed. In summary, our contributionsare as follows.1. High-to-middle SNR regimes : when ( d/n ) / (cid:46) (cid:107) θ ∗ (cid:107) (up to some logarithmic fac-tor), the EM updates converges to θ ∗ within a neighborhood of O (max { , (cid:107) θ ∗ (cid:107) − } ( d/n ) / )after O (max { , (cid:107) θ ∗ (cid:107) − } log( n/d )) number of iterations.2. Low SNR regime : when (cid:107) θ ∗ (cid:107) (cid:46) ( d/n ) / (up to some logarithmic factor), the EMalgorithm converge to θ ∗ within a neighborhood of O (( d/n ) / ) when the number ofiterations is of the order of O (( n/d ) / ).3. Global Convergence : We demonstrate that EM converges from any randomly initial-ized point with high probability. Furthermore, we do not require sample-splitting in ouranalysis.While we discuss the tightness of our result in a great detail in Section 2.3, we brieflyexplain the significance of our results. We focus primarily on two aspects of the EM algorithm:(i) statistical rate, and (ii) computational complexity. In the high SNR regime, we havelinear convergence to true parameters within (cid:112) d/n rate as noted previously in the literature.In contrast, in the low SNR regime when (cid:107) θ ∗ (cid:107) (cid:46) ( d/n ) / , the statistical rate is ( d/n ) / .We explain this transition in statistical rate with a convergence property of the populationEM in the middle-to-low SNR regimes. The upper bound given by EM matches the knownlower bound for this problem in all SNR regimes [6]. For the computational complexity, thenumber of iterations increases quadratically in the inverse of SNR until SNR reaches ( d/n ) / .Interestingly, the number of iterations is naturally interpolated at SNR = ( d/n ) / from (cid:107) θ ∗ (cid:107) − log( n/d ) to (cid:112) n/d . More in-depth discussions on the results (e.g., detailed comparisonto previous works, proof techniques we use, etc.) are provided in Section 2.3. While the classical results on the EM algorithm only guaranteed asymptotic convergence to stationary points [32], the seminal work [1] proposed a general framework to study a non-asymptotic convergence of the EM algorithm to true parameters . Motivated by this work,there has been a flurry of work studying the convergence of the EM algorithm to the trueparameters for various kinds of regular mixture models (see e.g., [36, 37, 34, 35, 7, 20, 12, 21]).Most of the work in this line require strong separation compared to the noise level, i.e.,2onsiders the high SNR regime. Using this condition, it establishes linear convergence of EMto parameter estimates that lie within ( d/n ) / -radius around the true location parameters.In contrast, relatively little understanding is available when different components in a mixturemodel are weakly separated (i.e., middle-to-low SNR). In particular, even for simple settingsof two-component mixed linear regression that we consider in this work, our understandingon the EM algorithm still remains incomplete, for as we show, not only the techniques, butalso the conclusions of past analysis no longer hold in the weakly separated regime.The first convergence guarantees for EM under mixed linear regression was established ina noise-free setting [36, 37]. Subsequent results succeeded in treating the noisy setting (see[1]) for a mixture of two linear regressions, when the the signal strength (cid:107) θ ∗ (cid:107) is significantlylarger than the noise variance σ ∗ (high SNR). Work in [20] extended the results in [1] and [37]to a more general setting of learning a mixture of k -component linear regressions when theSNR is Ω( k ). However, it has not been obvious how to extend any of these results to theweakly separated regimes.Recently, [22] has established the global convergence of the EM algorithm for learning amixture of two linear regressions in all SNR regimes. While their result guarantees convergenceof EM in all SNR regimes, the characterization of this convergence falls short in two aspects:(i) their analysis relies on the sample-splitting, (ii) their result is sub-optimal in terms of SNRin low SNR regime. In order to elaborate more on the second aspect, the statistical rate in[22] is given as O ( η − n − / ) given that the sample size n (cid:38) η − is sufficiently large. However,it is known that in the limit setting of η →
0, the rate of MLE slows down to n − / [3, 16, 17].The result in [22] fails to capture this important property in relation to EM, and gives littleinsight on what happens when there is a large overlap between components. Our resultstighten the sub-optimal analysis for middle SNR regime in [22] and fill in the remaining gapin the literature by providing a tight convergence guarantee of the EM algorithm in low SNRregime.In a closely related problem of learning mixtures of two Gaussians, [11, 10, 12] recentlystudied an extreme case of the over-specified mixture models, i.e., there is no separation be-tween two components. However, their analysis is restricted to strictly over-specified settings,and it has not been obvious to extend their result to weakly-separated models. In anotherrecent work, [33] has studied the EM algorithm for learning a mixture of two weakly-separatedlocation Gaussians, establishing a minimax rate of the EM algorithm after O ( (cid:112) n/d ) itera-tions in middle-to-low SNR regimes. However, their result requires the initialization to bealready within a small Euclidean ball of ( d/n ) / -radius, which is very restrictive. Our resultdoes not suffer from small initialization issue as in [33]. Furthermore, our proof strategy canbe applied to resolve the open issue with small initialization in [33].We note in passing that the problem of solving mixed linear regressions is an interestingproblem by itself. It arises in a number of applications [8, 14], and has been extensivelystudied with various algorithms proposed (see e.g., [2, 6, 28, 37, 25, 5, 19]). The special caseof a mixture of two-component linear regressions is by now well understood [36, 6, 22, 13]. Inthis work, rather than solving a mixed linear regression itself, we focus on the rigorous studyof the EM algorithm. Organization:
The remainder of our paper is organized as follows. In Section 2, wefirst present the setup of EM algorithm for learning symmetric two-component mixed linearregression. Then, we present the convergence rates of EM iterates under all regimes of SNRwith either random initialization or computable local initialization. Finally, we discuss thetightness of the results. We present the proof sketch of the results in Section 3. We concludethe paper in Section 4 while deferring the proofs of the main results in the appendices.3
Convergence rates of EM algorithm
We first formulate symmetric mixed linear regression with two components and EM updatesfor this model in Section 2.1. Then, we state our main results with the convergence behaviorsof EM algorithm under all regimes of SNR in Section 2.2. Finally, we provide a detaileddiscussion with the tightness of the results in Section 2.3.
We assume that the data ( X , Y ) , . . . , ( X n , Y n ) are generated from a symmetric two-componentmixed linear regression, whose density function has the following form: g true ( x, y ) := (cid:0) f ( y | − ( θ ∗ ) (cid:62) x, σ ∗ ) + 12 f ( y | ( θ ∗ ) (cid:62) x, σ ∗ ) (cid:1) ¯ f ( x ) , (1)where σ ∗ = 1 is given and θ ∗ is an unknown parameter. Furthermore, we assume that ¯ f ( x )is the density of standard multivariate Gaussian distribution, i.e., X ∼ N (0 , I d ). In orderto estimate θ ∗ , we fit the data by using symmetric two-component mixed linear regression,which is given by: g fit ( x, y ; θ ) := (cid:0) f ( y | − θ (cid:62) x, σ ∗ ) + 12 f ( y | θ (cid:62) x, σ ∗ ) (cid:1) ¯ f ( x ) . (2)It is clear that g fit ( x, y ; θ ∗ ) = g true ( x, y ). A common approach to obtain an estimator for θ ∗ is by using maximum likelihood esimation (MLE). However, given that the log-likelihoodfunction of symmetric two-component mixed linear regression is highly non-concave, the MLEdoes not have a closed-form expression. EM is a popular iterative algorithm to approximatethe MLE. Given fitted model (2), simple algebra shows that the EM update for θ can bewritten as follows: θ t +1 n = (cid:32) n n (cid:88) i =1 X i X (cid:62) i (cid:33) − (cid:32) n n (cid:88) i =1 tanh (cid:18) Y i X (cid:62) i θ tn σ ∗ (cid:19) Y i X i (cid:33) , (3)where the hyperbolic function tanh( x ) := (exp( x ) − exp( − x )) / (exp( x )+exp( − x )) for all x ∈ R .In order to facilitate the ensuing argument, let us the denote population and finite-sampleEM operators by Eqns. 4 and 5, respectively, as given below: M mlr ( θ ) := E [ XY tanh( Y X (cid:62) θ )] , (4) M n,mlr ( θ ) := (cid:32) n (cid:88) i X i X (cid:62) i (cid:33) − (cid:32) n (cid:88) i X i Y i tanh( Y i X (cid:62) i θ ) (cid:33) . (5) Motivation from experiments:
In Figure 1, we present the statistical rate and optimiza-tion complexity of EM algorithm under different regimes of SNR. We set d = 5 and initializedthe estimator in the neighborhood of the true parameters such that θ = θ ∗ + ru , where r = max { , (cid:107) θ ∗ (cid:107)} · . u is a random unit vector. For measuring the statistical rate, theEM algorithm runs with different size of samples n ∈ { , , , ... } (approximately √ ,
000 independent runs. The stoppingcriterion is the change in estimators being less than 0 . l norm. In Figure 1 (a), weobserve the standard n − / rate in the high SNR regime, and n − / rate in the low SNRregime. Interestingly, we can see a clear transition in the statistical rate when SNR = 0.3 as4a) (b) (c) Figure 1.
Convergence behavior of the EM algorithm for the fitted model (1) when d = 5:(a) statistical rate ( (cid:107) θ tn − θ ∗ (cid:107) at the last iteration) in various SNRs (b) linear convergence inhigh SNR regime (c) slow convergence in low SNR regime. n increases. This explains how the low SNR regime is defined (cid:107) θ ∗ (cid:107) (cid:46) ( d/n ) / : the meaningof low SNR depends on how many samples we have, not on the absolute value that can becomputed from a problem instance.We also look at the optimization complexity in Figure 1 (b, c). We run the EM algorithmwith fixed sample size n = 32768. Estimation error (cid:107) θ tn − θ ∗ (cid:107) in all iteration steps are averagedover 5 ,
000 independent runs. In the high SNR regime, note that the y -axis is in log-scale andwe can see the linear convergence. In contrast, in the middle-to-low SNR regimes, we canobserve that the convergence of the EM algorithm is no longer linear, and significantly sloweddown. In this section, we state our main results with the convergence behaviors of the EM algorithmunder different regimes of SNR. Our first result assumes a good initialization and focuses onthe statistical optimality of the EM algorithm in the last iterations. We can use the standardspectral method to get such a good initialization (see Appendix F.1 for guarantees given by thespectral initialization). Then, with a mild condition on SNR and permission to use a simplevariant of EM, our second result shows that EM converges globally to the true parameterwith the same optimal statistical rates.Throughout the paper, we assume that n ≥ Cd for sufficiently large constant C >
0. Ouranalysis is divided into two cases when we are in the middle-high SNR regimes and low SNRregime. We state our first main theorem:
Theorem 1. (a) (Middle-High SNR regimes) Suppose (cid:107) θ ∗ (cid:107) ≥ C ( d log ( n/δ ) /n ) / for somelarge universal constant C > . In this regime, suppose we run the EM algorithm startingfrom well-initialized θ n such that (cid:107) θ n (cid:107) ≥ . (cid:107) θ ∗ (cid:107) and cos ∠ ( θ ∗ , θ n ) ≥ . . Then, for any δ > there exist universal constants C , C > such that the EM updates (3) give θ tn for θ ∗ which satisfies (cid:107) θ tn − θ ∗ (cid:107) ≤ C max { , (cid:107) θ ∗ (cid:107) − } ( d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n ) / , with probability at least − δ after t ≥ C max { , (cid:107) θ ∗ (cid:107) − } log( n (cid:107) θ ∗ (cid:107) /d ) iterations.(b) (Low SNR regime) When (cid:107) θ ∗ (cid:107) ≤ C ( d log ( n/δ ) /n ) / , there exist universal constants C , C > such that the EM updates (3) initialized with (cid:107) θ n (cid:107) ≤ . return θ tn which satisfies (cid:107) θ tn − θ ∗ (cid:107) ≤ C ( d log ( n/δ ) /n ) / , ith probability at least − δ after t ≥ C log(log( n/d )) (cid:113) n/ ( d log ( n/δ )) iterations. The proof sketch of Theorem 1 is in Section 3 while the full proof is in Appendix B.Interestingly, the upper bound given by Theorem 1 matches the known lower bounds givenfor all SNR regimes in [6], and explains detailed behavior that interpolates between differentseparation regimes. Note that, the additional requirement (cid:107) θ n (cid:107) ≥ . (cid:107) θ ∗ (cid:107) under middle-highSNR regimes is to prevent the analysis to become over-complicated (see Appendix C.3 forthe arguments for starting from well-aligned small estimators). Furthermore, the initializa-tion condition (cid:107) θ n (cid:107) ≤ . (cid:107) θ n (cid:107) ≥ .
2, in a finite numberof steps the norm of EM updates becomes smaller than 0.2.Next, we present our second result that does not rely on the warm start, but requiresslightly more involved mechanisms. We call the following variant of EM as “Easy-EM” oper-ator [22]: M easy ( θ ) := 1 n n (cid:88) i =1 X i Y i tanh( Y i X (cid:62) i θ ) . (6)Note that the only difference is the absence of the inverse of the sample covariance matrix.Our second theorem guarantees the global convergence of the EM algorithm with minimaxoptimality: Theorem 2.
Given
C > , suppose that (cid:107) θ ∗ (cid:107) ≤ C . Let θ n be a randomly initialized vectorin R d space such that the direction of θ n is randomly sampled from a uniform distributionon the unit sphere. The norm of initial estimator can be any non-zero constant such that (cid:107) θ n (cid:107) ≥ c ( d log ( n/δ ) /n ) / for some universal constant c > .(a) In the middle-to-high SNR regimes, there exist universal constants C , C , C > suchthat when C ( d log ( n/δ ) /n ) / ≤ (cid:107) θ ∗ (cid:107) ≤ C , with probability at least − δ , we have (cid:107) θ tn − θ ∗ (cid:107) ≤ C max { , (cid:107) θ ∗ (cid:107) − } ( d log ( n/δ ) /n ) / , after we first run the Easy-EM algorithm (6) for C max { , (cid:107) θ ∗ (cid:107) − } log( d ) iterations, andthen run the standard EM algorithm (4) for C max { , (cid:107) θ ∗ (cid:107) − } log( n/d ) iterations.(b) In the low SNR regime when (cid:107) θ ∗ (cid:107) ≤ C ( d log ( n/δ ) /n ) / , there exist universal con-stants C , C > such that with probability at least − δ , we have (cid:107) θ tn − θ ∗ (cid:107) ≤ C ( d log ( n/δ ) /n ) / , after we run either Easy-EM or standard EM for t ≥ C log(log( n/d )) (cid:113) n/ ( d log ( n/δ )) iter-ations. The proof sketch of Theorem 2 is in Section 3 while the full proof is in Appendix C. A fewcomments are in order. First, comparing to Theorem 1, we have an additional assumptionfor (cid:107) θ ∗ (cid:107) being bounded. This is required for a technical reason that arises from giving anuniform control on the deviation of Easy-EM operator in one direction when (cid:107) θ ∗ (cid:107) can bearbitrarily large (see Remark 2 in Appendix C.2 for details). Second, in order to correctlyestimate how many iterations we must run Easy-EM, we can check the value of n (cid:80) ni =1 Y i − (cid:107) θ ∗ (cid:107) . We note that Easy-EM is only introduced for atheoretical justification, and in practice we can just run the EM algorithm from a randomlyinitialized point. Finally, our condition on the norm of initial estimator is to ensure that theinitial point is sufficiently far from zero. In practice, we use any constant Ω(1) for the normof initial estimator. This is in stark contrast to the initialization of [33] in which only verysmall initialization of order Θ(( d/n ) / ) is allowed, which goes to 0 as n → ∞ .6 .3 Tightness of the results In this section, we discuss in detail the tightness of our results in Theorem 1 and Theorem 2.
Tightness of the result in the high SNR regime:
In the high SNR regime, a minimaxrate should guarantee exact recovery when the noise variance goes to zero. Our results obtain astatistical rate of (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n . Note that, since we have rescaled to σ ∗ = 1, we shouldinterpret the statistical rate of EM algorithm in the original scale where it is translated to( σ ∗ log(1 /σ ∗ )) (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n . Therefore, we still guarantee the exact recovery as σ ∗ → (cid:107) θ ∗ (cid:107) , and leave it as future work. As mentioned earlier, there has been muchrecent interest in establishing the linear convergence and tight finite-sample error in highSNR regime [36, 37, 22, 20]. While all previous results are also minimax optimal in allparameters (up to logarithmic factors), as an artifact of their analysis, their results rely onsample-splitting, and thus do not in fact analyze the algorithm that is used in practice. Ourresults remove this artifact.A very recent work in [13] has established a super-linear convergence of the EM algorithmin the noiseless setting (a.k.a. Alternating Minimization). In fact, we conjecture that theirresult can be extended to the noisy setting when SNR is high enough ( i.e., (cid:107) θ ∗ (cid:107) (cid:29) Lemma 1. If C (cid:112) log (cid:107) θ ∗ (cid:107) ≤ (cid:107) θ − θ ∗ (cid:107) ≤ (cid:107) θ ∗ (cid:107) / for sufficiently large constant C > , thenthere exists a constant c < such that (cid:107) M mlr ( θ ) − θ ∗ (cid:107) ≤ c (cid:107) θ − θ ∗ (cid:107) / (cid:107) θ ∗ (cid:107) . The proof of Lemma 1 is in Appendix F.2. This lemma implies that until (cid:107) θ − θ ∗ (cid:107) dropsfrom O ( (cid:107) θ ∗ (cid:107) ) to O ( (cid:112) log (cid:107) θ ∗ (cid:107) ), the population EM updates converge in a super-linear rate.While the super-linear convergence behavior is a very interesting phenomenon and deservesfurther exploration, it is beyond the scope of this paper. Tightness of the result in the middle-low SNR regimes:
As discussed in the intro-duction, [22] has recently established a convergence of the EM algorithm in SNR regimes formodel (2). In particular, according to the result in [22], the EM algorithm can achieve arbi-trary (cid:15) accuracy if the sample size n is large enough to compensate a low SNR η := (cid:107) θ ∗ (cid:107) /σ ∗ , i.e., η − /(cid:15) (cid:46) n . This sub-optimal result is an artifact of the technical approach used torelate the population and finite-sample EM operators. Specifically, the convergence rate ofthe population EM operator is given by 1 − η . The finite-sample analysis then follows byanalyzing the uniform deviation of finite-sample operators from population operators, whichis in order of magnitude (cid:112) d/n . In order to guarantee the progress toward θ ∗ in each step aswell as to control the accumulation of statistical errors in all iterations, [22] required n (cid:38) η − per iteration. The sample-splitting results in even worse total n (cid:38) η − sample complexityin terms of SNR. Furthermore, nothing can be explained when the sample size is less thanthe threshold η − . This calls for a more refined and tighter analysis of the EM algorithm inmiddle-to-low SNR regimes.We adopt the localization argument used in [11, 12] where they establish the convergencebehaviors of the EM algorithm under over-specified Gaussian mixtures, namely, no separation7f the parameters. Unlike these previous studies, our analysis is not restricted to strictlyover-specified instances, but spans all possible configuration of parameters. The core of theanalysis has three parts: (i) refined convergence rate of the population EM operator 1 − max {(cid:107) θ (cid:107) − η , η } , (ii) multi-level application of uniform deviation of finite-sample operatorsthat is proportional to (cid:107) θ (cid:107) (cid:112) d/n , and (iii) localization arguments applied to different levelsof (cid:107) θ (cid:107) . The threshold that separates middle-SNR and low-SNR regimes is naturally found at η = (cid:112) d/n . Global Convergence of (Easy) EM:
Global convergence of the EM algorithm for model (1)has been established in [22] using the idea of two-phase analysis where EM first converges inangle, and then converges in l norm. In the initial stage of the EM iterations with a randominitialization, [22] proposed a simple variant of the EM update (6) to encourage the boostingof angle from cos ∠ ( θ n , θ ∗ ) = O (1 / √ d ). Our result removes the usage of sample-splitting in[22] and tightens the sub-optimal statistical rate in middle-to-low SNR regimes as in Theorem1. In [33], the authors employed a similar idea of analyzing the growth of the signal strengthin the θ ∗ direction for learning a two symmetric mixture of Gaussian distributions. How-ever, in general the value itself in θ ∗ direction can indeed decrease if EM starts from largeinitialization. Therefore, they restricted the initialization to be within a very small radiusof (cid:107) θ n (cid:107) ≈ ( d/n ) / in all SNR (separation) regimes. While it does not degrade the overallcomputational complexity of the finite-sample EM algorithm, the convergence guarantee withsuch small initialization is not global in a true sense since if n grows to infinity ( i.e., approachto the population setting), the initialization should be at 0, which is a saddle point of thelog-likelihood. Theorem 2 resolves the open issue of small initialization in [33] by analyzingthe convergence in angle. In this section, we give a proof sketch of Theorem 1. The full proof of Theorem 1 is inAppendix B. We need the following uniform deviation bound between sample and populationEM operators:
Lemma 2.
Given the population and finite-sample EM operators M mlr , M n, mlr in equa-tions (5) and (4) , for any given r > , there exists a universal constant c > such thatwe have P (cid:32) sup (cid:107) θ (cid:107)≤ r (cid:107) M n, mlr ( θ ) − M mlr ( θ ) (cid:107) ≤ cr (cid:113) d log ( n/δ ) /n (cid:33) ≥ − δ. (7)While the lemma is a straight-forward consequence of Lemma 11 given in Appendix E,this is the first key result to get a tight statistical rate. The proof of Lemma 2 can be foundin Appendix D.3. High SNR regime: (cid:107) θ ∗ (cid:107) ≥ C . The high-level proof in the high SNR regime follows aspecialized proof strategy exploited in [20]. The core idea is that for high SNR, most “good”samples are assigned correct (soft but almost hard) labels in E-step, and the portion of “bad”8amples is negligibly small. Such an argument first appeared informally in [1], and then wasformally organized in [20, 21] to establish a linear convergence and tight statistical rate. Thefull proof for the high SNR regime is given in Appendix B.1.
Middle SNR regime: C ( d log ( n/δ ) /n ) / ≤ (cid:107) θ ∗ (cid:107) ≤ C . We consider two cases, when (cid:107) θ ∗ (cid:107) ≥ (cid:107) θ ∗ (cid:107) ≤ Case (i) ≤ (cid:107) θ ∗ (cid:107) ≤ C : Given the initialization conditions in Theorem 1, we can showthat (cid:107) M mlr ( θ ) − θ ∗ (cid:107) < . (cid:107) θ − θ ∗ (cid:107) . Furthermore, from the uniform concentration Lemma 11in Appendix E, we have (cid:107) M n,mlr ( θ ) − M mlr ( θ ) (cid:107) ≤ (cid:113) d log ( n/δ ) /n with probability at least1 − δ . From here, we can check that (cid:107) θ tn − θ ∗ (cid:107) (cid:46) (0 . t (cid:107) θ − θ ∗ (cid:107) + (cid:113) d log ( n/δ ) /n. Case (ii) C ( d log ( n/δ ) /n ) / ≤ (cid:107) θ ∗ (cid:107) ≤ : In this case, the result of Lemma 3 inAppendix B shows that (cid:107) M mlr ( θ ) − θ ∗ (cid:107) ≤ (cid:0) − O ( (cid:107) θ ∗ (cid:107) ) (cid:1) (cid:107) θ − θ ∗ (cid:107) . (8)As Lemma 2 and Corollary 2 in the Appendix make precise, we can infer that in order for theEM algorithm to make progress toward θ ∗ , we need (cid:107) θ ∗ (cid:107) (cid:107) θ − θ ∗ (cid:107) (cid:38) (cid:107) θ (cid:107) (cid:112) d/n . Intuitively,EM converges to θ ∗ as long as such a relation holds, and until θ gets close enough to θ ∗ suchthat the above equation does not hold. In other words, in the last iterations when (cid:107) θ (cid:107) ≈ (cid:107) θ ∗ (cid:107) ,we have (cid:107) θ ∗ (cid:107) (cid:107) θ − θ ∗ (cid:107) ≈ (cid:107) θ ∗ (cid:107) (cid:112) d/n, which implies the statistical rate should be on the order of (cid:107) θ ∗ (cid:107) − (cid:112) d/n . The full proof isgiven in Appendix B.2. Low SNR Regime: (cid:107) θ ∗ (cid:107) ≤ C ( d log ( n/δ ) /n ) / . In this case, even the standard spectralmethods would not give a good initialization since the eigenspace is perturbed too much to bealigned with θ ∗ (see Lemma 13 in Appendix F.1 for the guarantees given by spectral methods).Instead, we assume the initial estimator to be (cid:107) θ n (cid:107) ≤ . θ ∗ = 0 and θ ∗ (cid:54) = 0. Therefore, we aim to investigate (cid:107) θ (cid:107) instead of the estimationerror (cid:107) θ − θ ∗ (cid:107) . If we can show that (cid:107) θ tn (cid:107) ≤ c · ( d/n ) / , then given the condition of low SNRregime, we have (cid:107) θ tn − θ ∗ (cid:107) ≤ c · ( d/n ) / where c , c are some positive constants.In the low SNR regime, there exist universal constants c l , c u > (cid:107) θ (cid:107) ≤ . (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) − c l (cid:107) θ ∗ (cid:107) ) ≤ (cid:107) M mlr ( θ ) (cid:107) ≤ (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) + c u (cid:107) θ ∗ (cid:107) ) . The statistical fluctuation of the finite-sample EM operator given in Lemma 2 shows that (cid:107) M n, mlr ( θ ) − M mlr ( θ ) (cid:107) ≤ c · (cid:107) θ (cid:107) (cid:113) d log ( n/δ ) /n , for some universal constant c . It is nowmore clear to see that since (cid:107) θ ∗ (cid:107) (cid:46) (cid:112) d/n , the above statistical error will subsume anextra O ( (cid:107) θ ∗ (cid:107) ) term in the contraction rate of the population EM operator. Therefore, theconvergence behaviors of the finite-sample EM operator are essentially the same when θ ∗ = 0and θ ∗ (cid:54) = 0. 9he EM iterations stop improving the estimator when the statistical error becomes largerthan the amount that the population EM can proceed: (cid:107) θ (cid:107) ≈ (cid:113) d log ( n/δ ) /n. Therefore, the statistical rate of the EM algorithm is achieved at (cid:107) θ (cid:107) (cid:46) ( d/n ) / . The restof the proof in the low SNR regime is a reminiscent of the localization arguments used in[10, 12], and can be found in Appendix B.3. The global convergence statement is subsumed into Theorem 1 when the estimator θ entersin the initialization region that Theorem 1 requires. Therefore we can focus on the iterationsthat θ stays outside of the initialization region. The key idea is to adopt the angle convergenceargument presented in [22]. Note that in low SNR regime, we do not need such an involvedargument since the initialization only requires (cid:107) θ n (cid:107) ≤ . d/n ) / (cid:46) (cid:107) θ ∗ (cid:107) ≤ ∠ ( M mlr ( θ ) , θ ∗ ) ≥ (1 + c (cid:107) θ ∗ (cid:107) ) cos ∠ ( θ, θ ∗ ) , for some universal constant c >
0. We again see that the increase rate is 1 + O ( (cid:107) θ ∗ (cid:107) );however, the cosine value is very small Θ(1 / √ d ) at the initial stage. Then, the second keystep is to show that cos ∠ ( M easy ( θ ) − M mlr ( θ ) , θ ∗ ) ≤ (cid:15) f / √ d, for sufficiently small (cid:15) f (cid:46) (cid:112) d/n . At a high level, if it holds that c (cid:107) θ ∗ (cid:107) cos ∠ ( θ, θ ∗ ) ≥ (cid:15) f / √ d ,then we can guarantee that cos ∠ ( M easy ( θ ) , θ ∗ ) ≥ (1+ c (cid:107) θ ∗ (cid:107) /
2) cos ∠ ( θ, θ ∗ ). We can concludethat this is true in the middle-SNR regime since (cid:107) θ ∗ (cid:107) (cid:38) ( d/n ) / . The argument in high-SNRregime is similar to middle-SNR regime. The formal proof is a bit more involved since weneed to ensure that the statistical error in orthogonal directions does not dominate the angle(see Appendix C.2 for more detail). In the paper, we completely characterize the convergence behavior of EM under all SNRregimes of symmetric two-component mixed linear regression. We view our results for thismodel as the first step towards a comprehensive understanding of the EM algorithm for learn-ing weakly separated latent variable models. We now discuss a few future directions naturallyarise from our work. First, in more general settings of weakly separated mixture models with k components, it is known that the rate of MLE can be n − O (1 /k ) in the worst case [15]. Fur-thermore, EM is known to suffer from very slow convergence in practice for instances withlarge overlaps. It is an important future direction to characterize the convergence behavior ofthe EM algorithm in such settings. Second, our results demonstrate that the EM algorithmhas sub-linear convergence to θ ∗ under middle and low SNR regimes. It respectively leadsto (cid:107) θ ∗ (cid:107) − log( n/d ) and (cid:112) n/d number of iterations under middle-to-low SNR regimes, whichresult in high computational complexity. An important direction is to develop an alternativeto EM algorithm that can achieve much cheaper computational complexity and also obtainminimax optimal sample complexity under all SNR regimes of mixed linear regression.10 Additional Notations
We sometimes use the transformed coordinate where the first two coordinate spans θ and θ ∗ .That is, let { v , ..., v d } be standard basis in the transformed coordinate such that v = θ/ (cid:107) θ (cid:107) ,and span ( v , v ) = span ( θ, θ ∗ ). Since Gaussian distribution is invariant to rotation, weoften work on the transformed space in the proofs. Let α = ∠ ( θ, θ ∗ ), η = (cid:107) θ ∗ (cid:107) /σ ∗ , and σ = 1 + (cid:107) θ ∗ (cid:107) sin α .We define a few more quantities to simplify the notations throughout the proofs. Let x , x be X (cid:62) v , X (cid:62) v respectively. Following the notation in [22], we denote b ∗ = θ ∗(cid:62) v = (cid:107) θ ∗ (cid:107) cos ∠ ( θ, θ ∗ ), and b ∗ = θ ∗(cid:62) v = (cid:107) θ ∗ (cid:107) sin ∠ ( θ, θ ∗ ). Note that in this transformed coordi-nate, due to the symmetry of the distribution, M mlr ( θ ) (cid:62) v j = 0 for all j ≥
3. Hence we focuson bounding the values in first two coordinates.Using the coordinate transformation and new notations defined here, we can write thepopulation operator in new coordinate as: M mlr ( θ ) = E X,Y (cid:104) tanh(
Y X (cid:62) θ ) Y X (cid:105) = E x ,x ,y [tanh( yx (cid:107) θ (cid:107) ) x y ] v + E x ,x ,y [tanh( yx (cid:107) θ (cid:107) ) x y ] v , (9)where y | ( x , x ) ∼ N ( x b ∗ + x b ∗ , y as a single Gaussian due to thesymmetry in the signs of y and Gaussian noise. B Proof of Theorem 1
We first consider middle-to-high SNR regimes and then we consider low SNR regimes. Inmiddle-to-high SNR regimes, we assume that we start from the initialization where cos α ≥ .
95. We note that the additional requirement (cid:107) θ n (cid:107) ≥ . (cid:107) θ ∗ (cid:107) is to prevent the analysis tobecome over-complicated (see Appendix C.3 for the arguments for starting from well-alignedsmall estimators).We will frequently use the fact that (cid:107) θ ∗ (cid:107) sin α ≤ (cid:107) θ − θ ∗ (cid:107) . We can check that θ remainsin this good initialization region using the convergence property of angles (see the argumentsfor sine values in Appendix C.3). Before getting into the detailed proof, we state some usefullemmas from previous work. We need the following lemma for the contraction rate of thepopulation EM operator (5): Lemma 3 (Theorem 4 in [22]) . Assume α < π/ . Then, we have (cid:107) M mlr ( θ ) − θ ∗ (cid:107) ≤ max { κ, . }(cid:107) θ − θ ∗ (cid:107) + κ (16 sin α ) (cid:107) θ ∗ (cid:107) η η , (10) where κ = (cid:16)(cid:112) { σ (cid:107) θ (cid:107) , (cid:107) θ ∗ (cid:107) cos α } /σ (cid:17) − . B.1 High SNR Regime
First, we arrange the sample operator as the following: M n,mlr ( θ ) − θ ∗ = (cid:32) n (cid:88) i X i X (cid:62) i (cid:33) − (cid:32) n (cid:88) i X i Y i tanh( Y i X (cid:62) i θ ) (cid:33) − θ ∗ (cid:32) n (cid:88) i X i X (cid:62) i (cid:33) − (cid:18) n (cid:88) i X i Y i tanh( Y i X (cid:62) i θ ) − n (cid:88) i X i Y i tanh( Y i X (cid:62) i θ ∗ )+ 1 n (cid:88) i X i Y i tanh( Y i X (cid:62) i θ ∗ ) − n (cid:88) i X i X (cid:62) i θ ∗ (cid:19) = (cid:32) n (cid:88) i X i X (cid:62) i (cid:33) − (cid:32) E X,Y [ XY ∆ ( X,Y ) ( θ )] (cid:124) (cid:123)(cid:122) (cid:125) := A + 1 n (cid:88) i X i Y i ∆ ( X i ,Y i ) ( θ ) − E X,Y [ XY ∆ ( X,Y ) ( θ )] (cid:124) (cid:123)(cid:122) (cid:125) := A + 1 n (cid:88) i X i Y i tanh( Y i X (cid:62) i θ ∗ ) − E Y i | X i (cid:34) n (cid:88) i X i Y i tanh( Y i X (cid:62) i θ ∗ ) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) := A (cid:33) , (11)where ∆ ( X,Y ) ( θ ) := tanh( Y X (cid:62) θ ) − tanh( Y X (cid:62) θ ∗ ). In the term A , the expectation is takenover Y i | X i ∼ N ( X (cid:62) i θ ∗ ,
1) + N ( − X (cid:62) i θ ∗ , X i fixed. Note that the true parametersare fixed points of the EM operators, and it is easy to check that the expectation in A isequivalent to n (cid:80) i X i X (cid:62) i θ ∗ .Now, we claim the following bounds with A , A , and A in equation (11): A < . (cid:107) θ − θ ∗ (cid:107) , (12) A ≤ ( (cid:107) θ − θ ∗ (cid:107) + 1) (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n, (13) A ≤ C (cid:112) d log(1 /δ ) /n, (14)with probability at least 1 − δ . Here, C is some universal constant.Assume that the above claims are given at the moment, we proceed to finish the proof ofthe convergence of EM algorithm under high SNR regime. In fact, plugging the results fromequations (12), (13), and (14) into equation (11), we find that (cid:107) M n,mlr ( θ ) − θ ∗ (cid:107) ≤ (cid:18) . (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n ) (cid:19) (cid:107) θ − θ ∗ (cid:107) + C (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n ≤ γ (cid:107) θ − θ ∗ (cid:107) + C (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n, for some γ <
1. From here, let (cid:15) n := C (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n and we iterate over t to boundthe estimation error in t th step: (cid:107) θ t +1 n − θ ∗ (cid:107) ≤ γ (cid:107) θ tn − θ ∗ (cid:107) + (cid:15) n ≤ γ (cid:107) θ t − n − θ ∗ (cid:107) + (1 + γ ) (cid:15) n ≤ ... ≤ γ t (cid:107) θ n − θ ∗ (cid:107) + 11 − γ (cid:15) n . After t ≥ c log( n (cid:107) θ ∗ (cid:107) /d ) iterations, we have (cid:107) θ tn − θ ∗ (cid:107) ≤ c (cid:112) d/n where c and c are universalconstants. As a consequence, we reach the conclusion of the theorem for high SNR regime. Proof of claim (12) : In order to bound A , we can use the result of Corollary 1 in Ap-pendix B.2. Observe that E (cid:104) XY tanh( Y X (cid:62) θ ∗ ) (cid:105) = θ ∗ , (cid:104) XY tanh( Y X (cid:62) θ ) (cid:105) = M mlr ( θ ) . From Corollary 1, we conclude that A = E X,Y [ XY ∆ ( X,Y ) ( θ )] < . (cid:107) θ − θ ∗ (cid:107) . Therefore, we reach the conclusion of claim (12).
Proof of claim (13) : Next, we bound A . We first discretize the parameter space for θ asthe following: P (cid:18) sup θ ∈ B ( θ ∗ ,r ) (cid:107) n n (cid:88) i =1 X i Y i ∆ i ( θ ) − E [ XY ∆( θ )] (cid:107) ≥ t (cid:19) = P (cid:32) sup j ∈ [ N (cid:15) ] (cid:107) n n (cid:88) i =1 X i Y i ∆ i ( θ j ) − E [ X i Y i ∆ i ( θ j )] (cid:107) ≥ t/ (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) finite-sample error + P (cid:32) sup (cid:107) θ − θ (cid:48) (cid:107)≤ (cid:15) (cid:107) n n (cid:88) i =1 X i Y i (∆ i ( θ ) − ∆ i ( θ (cid:48) ) (cid:107) + (cid:107) E (cid:2) XY (∆( θ ) − ∆( θ (cid:48) )) (cid:3) (cid:107) ≥ t/ (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) discretization error , where ∆ i ( θ ) is a shorthand for ∆ i ( θ ) := tanh( Y i X (cid:62) i θ ) − tanh( Y i X (cid:62) i θ ∗ ), ∆( θ ) is a shorthand for∆( θ ) = tanh( Y X (cid:62) θ ) − tanh( Y X (cid:62) θ ∗ ), N (cid:15) is (cid:15) -covering number of B ( θ ∗ , r ), and { θ j , j ∈ [ N (cid:15) ] } is the corresponding (cid:15) -covering set.The discretization error can be bounded by the Lipschitz continuity of the function ∆ i ,namely, | ∆ i ( θ ) − ∆ i ( θ (cid:48) ) | ≤ | Y i || X (cid:62) θ − X (cid:62) θ (cid:48) | for all θ, θ (cid:48) . It follows that (cid:107) n n (cid:88) i =1 X i Y i (∆ i ( θ ) − ∆ i ( θ (cid:48) ) (cid:107) ≤ (cid:107) n n (cid:88) i =1 Y i X i X (cid:62) i ( θ − θ (cid:48) ) (cid:107)≤ (cid:15) ||| n n (cid:88) i =1 Y i X i X (cid:62) i ||| op . Note that E [ Y XX (cid:62) ] = I + 2 θ ∗ θ ∗(cid:62) , hence ||| E [ Y XX (cid:62) ] ||| op ≤ (cid:107) θ ∗ (cid:107) + 1. Furthermore,from Lemma 10, we have ||| n (cid:80) ni =1 Y i X i X (cid:62) i ||| op ≤ (cid:107) θ ∗ (cid:107) with probability at least 1 − δ . Weconclude that discretization error ≤ (cid:15) (cid:107) θ ∗ (cid:107) with probability at least 1 − δ .In order to bound the finite-sample error for each fixed θ j , we adopt the per-sample decom-position argument used in the previous works [20] and [21]. In order to simplify the notation,let Z i be the noise such that Y i = ν i X (cid:62) i θ ∗ + Z i where ν i is an independent Rademachervariable. We define good events as follows: E = { | X (cid:62) ( θ ∗ − θ ) | ≤ | X (cid:62) θ ∗ |} , E = {| X (cid:62) θ ∗ | ≥ τ } , E = {| Z | ≤ τ } , where we decide τ later. Let the good event E good := E ∩ E ∩ E . Then we have a followinglemma: 13 emma 4. Under the event E good , we have | ∆ ( X,Y ) ( θ ) | ≤ exp( − τ ) . Proof.
Without loss of generality, let ν = +1. We can check that Y X (cid:62) θ = ( νX (cid:62) θ ∗ + Z )( X (cid:62) θ ∗ ) + ( νX (cid:62) θ ∗ + Z )( X (cid:62) ( θ − θ ∗ ))= ( νX (cid:62) θ ∗ + Z )( X (cid:62) θ ∗ + X (cid:62) ( θ − θ ∗ )) ≥ τ · τ = τ . Since tanh( x ) = exp( x ) − exp( − x )exp( x )+exp( − x ) ≥ − exp( − x ) for x ≥
0, we have tanh(
Y X (cid:62) θ ) ≥ − exp( − τ ).Similarly, tanh( Y X (cid:62) θ ∗ ) ≥ − exp( − τ ). On the other hand, tanh( x ) ≤ x . We canconclude that ∆ ( X,Y ) ( θ ) ≤ exp( − τ ). For the other sign ν = −
1, we can show it similarly.To simplify the notation, we denote W i := ν i X i X (cid:62) i θ ∗ ∆ i ( θ ). Then, we can decompose A as follows: A = (cid:32) n n (cid:88) i =1 X i Z i ∆ i ( θ ) − E [ XZ ∆( θ )] (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) := T + (cid:32) n n (cid:88) i =1 W i − E [ W ] (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) := T . (15)We first claim the following high probability bound with T : P ( (cid:107) T (cid:107) ≥ t ) ≤ exp (cid:18) − nt K + K (cid:48) d (cid:19) , (16)for some universal constants K , K (cid:48) >
0, where we assumed n (cid:29) d to ignore sub-exponentialtail part. The proof of claim (16) is deferred to the end of the proof of high SNR regime.For the term T in equation (15), we apply per-sample decomposition.1 n (cid:88) i W i − E [ W ] = 1 n (cid:88) i ( W i E good − E [ W E good ]) + 1 n (cid:88) i ( W i E c − E [ W E c ])+ 1 n (cid:88) i ( W i E ∩E c − E [ W E ∩E c ]) + 1 n (cid:88) i ( W i E ∩E ∩E c − E [ W E ∩E ∩E c ]) . In the sequel, we will show that P (cid:32) (cid:107) n (cid:88) i ( W i E good − E [ W E good ]) (cid:107) ≥ t (cid:33) ≤ exp (cid:18) − nt K (cid:107) θ ∗ (cid:107) exp( − τ ) + K (cid:48) d (cid:19) , (17) P (cid:32) (cid:107) n (cid:88) i ( W i E c − E [ W E c ]) (cid:107) ≥ t (cid:33) ≤ exp (cid:18) − nt K (cid:107) θ − θ ∗ (cid:107) + K (cid:48) d (cid:19) , (18) P (cid:32) (cid:107) n (cid:88) i ( W i E ∩E c − E [ W E ∩E c ]) (cid:107) ≥ t (cid:33) ≤ exp (cid:18) − nt K τ + K (cid:48) d (cid:19) , (19) P (cid:18) sup θ ∈ B ( θ ∗ ,r ) (cid:107) n (cid:88) i ( W i E ∩E ∩E c − E [ W E ∩E ∩E c ]) (cid:107) = 0 (cid:19) ≥ − δ, (20)14here K ( · ) are all some universal constants. The last probability is due to our choice τ =Θ( (cid:112) log( n (cid:107) θ ∗ (cid:107) /δ )) such that no sample fall in the event E c with probability at least 1 − δ .We set t and (cid:15) as follows: t = O (cid:18) ( (cid:107) θ − θ ∗ (cid:107) + 1) (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n (cid:19) ,(cid:15) = O (cid:18) (cid:107) θ ∗ (cid:107) − (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n (cid:19) . The overall finite-sample error term is bounded by taking union bound over (cid:15) -covering set.Note that log( N (cid:15) ) ≤ c · d log( (cid:107) θ ∗ (cid:107) ) for some universal constant c . Hence the total probabilityof (cid:107) T (cid:107) ≥ t is dominated byexp (cid:18) − nt K (cid:107) θ − θ ∗ (cid:107) + K (cid:48) d log( n (cid:107) θ ∗ (cid:107) /d ) (cid:19) + exp (cid:18) − nt K τ + K (cid:48) d log( n (cid:107) θ ∗ (cid:107) /d ) (cid:19) , for some (new) constants K , K (cid:48) , K , K (cid:48) >
0. Our choice of t gives 5 δ total probability boundfor the finite-sample error. We can conclude that A ≤ t ≤ ( (cid:107) θ − θ ∗ (cid:107) + 1) (cid:113) d log ( n (cid:107) θ ∗ (cid:107) /δ ) /n with probability at least 1 − δ . Hence, we reach the conclusion of claim (13). Proof of claim (14) : Finally, for bounding A , we use Proposition 11 in [22] that exactlytargets to bound this quantity. Lemma 5 (Proposition 11 in [22]) . For each fixed θ , with probability at least − exp( − cn ) − d exp( − nt / , (cid:107) n (cid:88) i X i Y i tanh( Y i X (cid:62) i θ ) − n (cid:88) i E Y i | X i (cid:104) Y i X i tanh( Y i X (cid:62) i θ ) (cid:105) (cid:107) ≤ t, (21) for some absolute constant c > . Applying the above lemma for θ = θ ∗ , we can show that A ≤ C (cid:112) d log(1 /δ ) /n withprobability at least 1 − δ . As a consequence, we obtain claim (14). Proof of Equation (16) : We use the notion of sub-exponential Orcliz norm to bound (16).It is easy to see that X i Z i ∆ i is a sub-exponential random vector with Orcliz norm O (1).Using the standard concentration result in [30], we get the result. Proof of Equation (17) : Similarly to the previous case, we need to bound the sub-exponential norm of the quantity: (cid:13)(cid:13) W i E good (cid:13)(cid:13) ψ = sup u ∈ S d − sup p ≥ p − E (cid:104) | ( X (cid:62) i u )( X (cid:62) i θ ∗ )∆ i E good | p (cid:105) /p ≤ exp( − τ ) sup u ∈ S d − sup p ≥ p − E (cid:104) | ( X (cid:62) i u )( X (cid:62) i θ ∗ ) | p (cid:105) /p ≤ exp( − τ ) sup u ∈ S d − sup p ≥ p − (cid:113) E [( X (cid:62) u ) p ] E (cid:2) ( X (cid:62) i θ ∗ ) p (cid:3) /p ≤ K (cid:107) θ ∗ (cid:107) exp( − τ ) . We use the fact that | ∆ i ( θ ) | ≤ exp( − τ ) under the good event, Cauchy-Schwartz inequality,and p th -order moments of Gaussian is O ((2 p ) p/ ). Similarly using the result in [30], we havethe equation (17). 15 roof of Equation (18) : We check the sub-exponential ψ -Orcliz norm again. (cid:13)(cid:13) W i E c (cid:13)(cid:13) ψ = sup u ∈ S d − sup p ≥ p − E (cid:104) | ( X (cid:62) i u )( X (cid:62) i θ ∗ )∆ i E c | p (cid:105) /p ≤ sup u ∈ S d − sup p ≥ p − E (cid:104) | ( X (cid:62) i u )( X (cid:62) i ( θ ∗ − θ )) | p (cid:105) /p ≤ K (cid:107) θ ∗ − θ (cid:107) , from which we again use the standard result to get (18). Proof of Equation (19) : (cid:13)(cid:13) W i E ∩E c (cid:13)(cid:13) ψ = sup u ∈ S d − sup p ≥ p − E (cid:104) | ( X (cid:62) i u )( X (cid:62) i θ ∗ )∆ i E ∩E c | p (cid:105) /p ≤ sup u ∈ S d − sup p ≥ p − E τ (cid:104) | ( X (cid:62) i u ) | p (cid:105) /p ≤ K τ, getting the desired result. Proof of Equation (20) : For this quantity, note that P ( ∀ i ∈ [ n ] , | Z i | (cid:46) log( n/δ )) ≥ − n exp( − τ ) . Hence it is very likely that no sample falls into this category. Meanwhile, we can bound theexpectation term:sup u ∈ S d − E [ W (cid:62) u E ∩E ∩E c ] ≤ sup u ∈ S d − E [( W (cid:62) u )1 E ∩E |E c ] P ( ∩E c ) ≤ sup u ∈ S d − E [ | ( X (cid:62) i u )( X (cid:62) i θ ∗ )1 E ∩E ||E c ] P ( E c ) ≤ sup u ∈ S d − E [ | ( X (cid:62) i u )( X (cid:62) i θ ∗ ) | ] P ( E c ) ≤ K (cid:107) θ ∗ (cid:107) exp( − τ ) . Since τ = Θ(log( n (cid:107) θ ∗ (cid:107) /δ )), we have the result. B.2 Middle SNR Regime
We consider two cases, when (cid:107) θ ∗ (cid:107) ≥ (cid:107) θ ∗ (cid:107) ≤ Case (i) ≤ (cid:107) θ ∗ (cid:107) ≤ C : Given the initialization conditions in Theorem 1, we can get thefollowing corollary of Lemma 3.
Corollary 1.
When (cid:107) θ ∗ (cid:107) ≥ and sin α < . , we have (cid:107) M mlr ( θ ) − θ ∗ (cid:107) < . (cid:107) θ − θ ∗ (cid:107) . The proof of Corollary 1 is in Appendix D.2.1. Furthermore, from the uniform concentra-tion Lemma 11 in Appendix E, for all θ : (cid:107) θ − θ ∗ (cid:107) ≤ O ( (cid:107) θ ∗ (cid:107) ), we have (cid:107) M n,mlr ( θ ) − M mlr ( θ ) (cid:107) ≤ C (cid:113) d log ( n/δ ) /n − δ . From here, we can check that (cid:107) θ tn − θ ∗ (cid:107) (cid:46) (0 . t (cid:107) θ − θ ∗ (cid:107) + O (cid:18)(cid:113) d log ( n/δ ) /n (cid:19) . Case (ii) C ( d log ( n/δ ) /n ) / ≤ (cid:107) θ ∗ (cid:107) ≤ : In this case, the result of Lemma 3 showsthat:
Corollary 2.
When (cid:107) θ ∗ (cid:107) ≤ and sin α < . , we have (cid:107) M mlr ( θ ) − θ ∗ (cid:107) ≤ (cid:18) − (cid:107) θ ∗ (cid:107) (cid:19) (cid:107) θ − θ ∗ (cid:107) . (22)In order to analyze the convergence of finite-sample EM operator, we first divide theiterations into several epochs. Let ¯ C = (cid:107) θ n − θ ∗ (cid:107) . We consider that in each l th epoch, θ satisfies ¯ C − l − ≤ (cid:107) θ − θ ∗ (cid:107) ≤ ¯ C − l . Note that such consideration of dividing into severalepochs is only conceptual, and does not affect the implementation of the EM algorithm.Consider we are in l th epoch such that ¯ C − l − ≤ (cid:107) θ − θ ∗ (cid:107) ≤ ¯ C − l . The key idea is thatin each epoch, EM makes a progress toward the ground truth as long as the improvement inpopulation operator overcomes the statistical error, i.e., (cid:107) θ ∗ (cid:107) (cid:107) θ − θ ∗ (cid:107) ≥ cr (cid:113) d log ( n/δ ) /n, where c is a constant in Lemma 2. Here, since (cid:107) θ (cid:107) ≤ (cid:107) θ ∗ (cid:107) + (cid:107) θ − θ ∗ (cid:107) , we can set r = (cid:107) θ ∗ (cid:107) + ¯ C − l . This in turn implies that in l th epoch, if the following is true:18 (cid:107) θ ∗ (cid:107) ¯ C − l − ≥ cr (cid:113) d log ( n/δ ) /n ≥ c ( (cid:107) θ ∗ (cid:107) + ¯ C − l ) (cid:113) d log ( n/δ ) /n, then we have (cid:107) M n, mlr ( θ ) − θ ∗ (cid:107) ≤ (cid:18) − (cid:107) θ ∗ (cid:107) (cid:19) (cid:107) θ − θ ∗ (cid:107) . Arranging the terms, we require that¯ C − l (cid:18) (cid:107) θ ∗ (cid:107) − c (cid:113) d log ( n/δ ) /n (cid:19) ≥ c (cid:107) θ ∗ (cid:107) (cid:113) d log ( n/δ ) /n, for some universal constants c , c >
0. Recall that we are in middle SNR regime where (withappropriately set constants) (cid:107) θ ∗ (cid:107) ≥ ( c + 1) (cid:113) d log ( n/δ ) /n. Therefore, θ is guaranteed to move closer to θ ∗ as long as ¯ C − l ≤ c (cid:107) θ ∗ (cid:107) − (cid:113) d log ( n/δ ) /n .Note that each epoch takes O ( (cid:107) θ ∗ (cid:107) − ) iterations to enter the next epoch. We can conclude thatafter l = O (log( n/d )) epochs, we enter the region where (cid:107) θ − θ ∗ (cid:107) ≤ c (cid:107) θ ∗ (cid:107) − (cid:113) d log ( n/δ ) /n for some absolute constant c > δ probability bound, we can replace δ with δ/ log( n/d ) and take a union bound of theuniform deviation of finite-sample EM operators given in Lemma 11 for all epochs. This doesnot change the complexity in the final statistical error.Finally, the required number of iterations in each epoch is O ( (cid:107) θ ∗ (cid:107) − ) to make (cid:107) θ − θ ∗ (cid:107) ahalf. Since the total number of epoch we require is O (log( n/d )), the total number of iterationsis at most O ( (cid:107) θ ∗ (cid:107) − log( n/d )), concluding the proof in middle-high SNR regime.17 emark 1. After O (log( n/d )) epochs, studying on the property of the Hessian in a very closeneighborhood of (cid:107) θ ∗ (cid:107) may lead to a guarantee that EM indeed converges to the empirical MLE,see Section 6 in [33] for example. B.3 Low SNR Regime
As mentioned in the main text, the core idea of the low SNR regime is that EM essentiallycannot distinguish the cases between θ ∗ = 0 and θ ∗ (cid:54) = 0. Therefore, instead of studyingthe contraction of population EM operator to θ ∗ , we study its contraction to 0. Given thatinsight, we have the following result with the norm of population EM operator: Lemma 6.
There exists some universal constants c u > such that, (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) − c u (cid:107) θ ∗ (cid:107) ) ≤ (cid:107) M mlr ( θ ) (cid:107) ≤ (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) + c u (cid:107) θ ∗ (cid:107) ) . The proof of the Lemma 6 is in Section D.1.1. The result of Lemma 6 shows that the con-traction coefficient of the population operator M mlr consists of two terms: the non-expansiveterm, which is at the order of 1 −O ( (cid:107) θ (cid:107) ), and the quadratic term (cid:107) θ ∗ (cid:107) (up to some constant).Since we are in low SNR regime, the contraction coefficient gets close to 1. It demonstratesthat the updates from population EM operator suffers from sub-linear convergence rate, in-stead of geometric convergence rate as that in high SNR regime.From Lemma 2, we immediately have thatsup (cid:107) θ (cid:107)≤ r (cid:107) M n, mlr ( θ ) − M mlr ( θ ) (cid:107) ≤ cr (cid:113) d log ( n/δ ) /n, for some universal constant c > (cid:15) n := C (cid:113) d log ( n/δ ) /n with some absolute constant C >
0. We assume that we start fromthe initialization region where (cid:107) θ (cid:107) ≤ (cid:15) α n for some α ∈ [0 , / (cid:15) α l +1 n ≤ (cid:107) θ (cid:107) ≤ (cid:15) α l n atthe l th epoch for l ≥
0. We let
C > (cid:15) n ≥ c u (cid:107) θ ∗ (cid:107) + 4 sup θ ∈ B ( θ ∗ ,r l ) (cid:107) M n, ind ( θ ) − M ind ( θ ) (cid:107) /r l , with r l = (cid:15) α l n . During this period, from Lemma 6 on contraction of population EM, andLemma 2 concentration of finite sample EM, we can check that (cid:107) M n, ind ( θ ) (cid:107) ≤ (cid:107) θ (cid:107) − . (cid:107) θ (cid:107) + c u (cid:107) θ (cid:107)(cid:107) θ ∗ (cid:107) + sup θ ∈ B ( θ ∗ ,r ) (cid:107) M n, ind ( θ ) − M ind ( θ ) (cid:107)≤ (cid:107) θ (cid:107) − (cid:15) α l +1 n + 14 (cid:15) α l +1 n . Note that this inequality is valid as long as (cid:15) α l +1 n ≤ (cid:107) θ (cid:107) ≤ (cid:15) α l n . Now we define a sequence α l using the following recursion: α l +1 = 13 ( α l + 1) . (23)18he limit point of this recursion is 1 /
2, which will give (cid:15) α ∞ n ≈ ( d/n ) / as argued in the maintext. Hence during the l th epoch, we have (cid:107) M n, ind ( θ ) (cid:107) ≤ (cid:107) θ (cid:107) − (cid:15) α l +1 n . Furthermore, the number of iterations required in l th epoch is t l := ( (cid:15) α l n − (cid:15) α l +1 n ) /(cid:15) α l +1 n ≤ (cid:15) − n . After getting out of l th epoch, it gets into ( l + 1) th epoch which can be analyzed in thesame way. From this, we can conclude that after going through l epochs in total, we have (cid:107) θ (cid:107) ≤ (cid:15) α l +1 n . Note that the number of EM iterations taken up to this point is l(cid:15) − n .It is easy to check α l = (1 / l ( α − / / l = C log(1 /β ) for someuniversal constant C such that α l is 1 / − β for arbitrarily small β >
0. In conclusion, (cid:107) θ tn (cid:107) ≤ (cid:15) / − βn ≤ c · ( d ln ( n/δ ) /n ) / − β/ with high probability as long as t ≥ (cid:15) − n l (cid:38) (cid:112) d/n log(1 /β )where c is some universal constant. Hence we can set β = C/ log( d/n ) to get a desiredresult (cid:107) θ tn (cid:107) ≤ c · ( d ln ( n/δ ) /n ) / . Since (cid:107) θ ∗ (cid:107) ≤ C ( d ln ( n/δ ) /n ) / , it implies (cid:107) θ tn − θ ∗ (cid:107) ≤ c ( d ln ( n/δ ) /n ) / where c is some universal constant.Note that we need the union bound of the concentration of sample EM operators for all l = 1 , ..., C log(1 /β ), such that the argument holds for all epochs. For this purpose, we canreplace δ by δ/ log(1 /β ). This does not change the order of (cid:15) n , hence the proof is complete. C Global Convergence of the (Easy) EM
This appendix gives a full proof of Theorem 2. We prove the result for bounded instanceswith { θ ∗ : (cid:107) θ ∗ (cid:107) ≤ C } for some universal constant C >
0. The global convergence propertyof the (Easy)-EM algorithm will be used for the initialization for Theorem 1, hence we willfocus on the iterations that the estimator stays outside of the initialization region. While westart with Easy-EM when cos ∠ ( θ n , θ ∗ ) is in order O (1 / √ d ), note that we can safely go backto the standard EM algorithm as soon as cos ∠ ( θ tn , θ ∗ ) becomes Θ(1) (see Section 4 in [22] formore details). C.1 Decreasing Norm with Large Initialization in Low SNR Regime
In low SNR regime, we require that (cid:107) θ n (cid:107) ≤ .
2. Here, when we initialize with large normsuch that (cid:107) θ n (cid:107) ≥ .
2, we show that in a finite number of steps it becomes that (cid:107) θ n (cid:107) ≤ . (cid:107) θ ∗ (cid:107) (cid:28) (cid:107) θ (cid:107) ≥ /
3. Then, (cid:107) M mlr ( θ ) (cid:107) ≤ sup u ∈ S d − E [( X (cid:62) θ ∗ )( X (cid:62) u ) tanh( Y X (cid:62) θ )] + E [ Z ( X (cid:62) u ) tanh( Y X (cid:62) θ )] ≤ sup u ∈ S d − (cid:113) E [( X (cid:62) θ ∗ ) ] E [( X (cid:62) u ) ] + E [ | Z ( X (cid:62) u ) | ] , ≤ (cid:107) θ ∗ (cid:107) + E [ | Z ( X (cid:62) u ) | ] ≤ (cid:107) θ ∗ (cid:107) + 2 /π. where Z ∼ N (0 ,
1) such that Y = X (cid:62) θ ∗ + Z . Since the uniform deviation in Easy-EM isgiven by Lemma 11 as (cid:113) d log ( n/δ ) /n , we can conclude that (cid:107) M n, mlr ( θ ) (cid:107) ≤ (cid:107) M mlr ( θ ) (cid:107) + O (cid:18)(cid:113) d log ( n/δ ) /n (cid:19) (cid:107) θ ∗ (cid:107) + 2 /π + O (cid:18)(cid:113) d log ( n/δ ) /n (cid:19) ≤ / . Next, suppose 0 . ≤ (cid:107) θ (cid:107) ≤ /
3. Following the notation in Appendix A, we recall equa-tion (9), M mlr ( θ ) = E [ yx tanh( yx (cid:107) θ (cid:107) )] v + E [ yx tanh( yx (cid:107) θ (cid:107) )] v , where y = X (cid:62) θ ∗ + z where z ∼ N (0 , x = X (cid:62) v and x = X (cid:62) v . We will see in AppendixD.1.1 that M mlr ( θ ) (cid:62) v ≤ (cid:107) θ (cid:107)(cid:107) θ ∗ (cid:107) ≤ c (cid:113) d log ( n/δ ) /n for some absolute constant c > a = 4, and define event E := { x + z ≤ a } . We expand M mlr ( θ ) as follows: M mlr ( θ ) (cid:62) v ≤ (cid:107) θ (cid:107) E [ y x E ] + E [ | yx | E c ] ≤ (cid:107) θ (cid:107) E [ z x E ] + E [ | zx | E c ] + O ( (cid:107) θ ∗ (cid:107) ) . By converting the above expression to Rayleigh distribution with x = r cos w, z = r sin w , wecan more explicitly find the values of the expectations in the above equation. That is, E [ z x E ] = 12 π (cid:90) π cos w sin wdw (cid:90) r exp( − r / dr ≈ − . , and E [ | zx | E c ] = 12 π (cid:90) π | cos w sin w | dw (cid:90) ∞ r exp( − r / dr ≤ . , Now using the condition that (cid:107) θ (cid:107) ≤ .
2, we have M mlr ( θ ) (cid:62) v ≤ (cid:107) θ (cid:107) (1 − . O ( (cid:107) θ ∗ (cid:107) ) ≤ γ (cid:107) θ (cid:107) + O ( (cid:107) θ ∗ (cid:107) ) , where γ = 0 . <
1. Since the deviation of finite-sample EM operator is in order (cid:113) d log ( n/δ ) /n ,we can conclude that (cid:107) M mlr ( θ ) (cid:107) ≤ γ (cid:107) θ (cid:107) + O (cid:18)(cid:113) d log ( n/δ ) /n + (cid:107) θ ∗ (cid:107) (cid:19) . Hence we can conclude that after t = O (1) iterations, (cid:107) θ tn (cid:107) ≤ . C.2 Angle Convergence in Middle-to-High SNR Regime
Now we work in the regime where (cid:107) θ ∗ (cid:107) = η ≥ c η ( d log( n/δ ) /n ) / for some sufficiently largeconstant c η >
0. We first focus on the convergence of angle from random initialization.Let us denote α t := ∠ ( θ tn , θ ∗ ). Note that since we initialize with a random vector sampleduniformly from the unit sphere, cos α = O (1 / √ d ). We bring the following lemma for thechange in angles for a fixed estimator θ tn given in [22]: Lemma 7 (Theorem 8 in [22]) . Let (cid:15) f := c max(1 , η − ) (cid:112) d/n be the statistical fluctuationwith some universal constant c > in one-step iteration of Easy-EM. Suppose the norm ofthe current estimator (cid:107) θ tn (cid:107) is larger than (cid:107) θ ∗ (cid:107) / . Then we have, cos α t +1 ≥ κ t (1 − (cid:15) f ) cos α t − (cid:15) f √ d , (24)sin α t +1 ≤ κ (cid:48) t sin α t + (cid:15) f , (25) where κ t = (cid:114) sin α t cos α t + (1+ η − ) ≥ , and κ (cid:48) t = (cid:16) η η cos α t (cid:17) − < . κ t comes from Theorem 2 in [22] for the convergence rate of the cosine valuesof the population EM operator. The key idea in the above lemma is that when we bound thestatistical error of cosine value, we need to bound an error in one fixed direction u := θ ∗ / (cid:107) θ ∗ (cid:107) instead of all directions in R d to bound l norm. More specifically, they show that (cid:32) n (cid:88) i ( X (cid:62) i u ) Y i tanh( Y i X (cid:62) i θ ) − M mlr ( θ ) (cid:62) u (cid:33) (cid:46) (1 + (cid:107) θ ∗ (cid:107) ) (cid:112) /n (cid:46) (1 + (cid:107) θ ∗ (cid:107) ) (cid:15) f / √ d. Remark 2. [22] requires the sample-splitting scheme in which we draw a new batch of samplesat every step. The main challenge when we try to remove the sample-splitting is to show thatthe above argument holds for all θ : (cid:107) θ (cid:107) ≤ r where r = O (max { , (cid:107) θ ∗ (cid:107)} ) . For large (cid:107) θ ∗ (cid:107) ,getting a right order of uniform statistical error is challenging: discretization of θ results inextra √ d factor, while the Ledoux-Talagrand type approach as in Lemma 11 results in extra O ( (cid:107) θ ∗ (cid:107) ) factor. Therefore, here we show only for bounded instances with (cid:107) θ ∗ (cid:107) ≤ C , and leavethe analysis for arbitrarily large (cid:107) θ ∗ (cid:107) as future work. Now we adopt their approach to work without sample-splitting, and get a right order ofsample complexity. First, when we work with bounded θ ∗ , we follow the steps in Lemma 11,while we can skip the procedure in which we take a union bound over 1 / l norm of a random vector. This yields thatsup (cid:107) θ (cid:107)≤ r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i ( X (cid:62) i u ) Y i tanh( Y i X (cid:62) i θ ) − M mlr ( θ ) (cid:62) u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ cr (cid:113) log ( n/δ ) /n, (26)for the absolute constant c > (cid:15) f := c (cid:113) d log ( n/δ ) /n . The cosinevalue can be bounded as follows:cos α t +1 = ( θ ∗ ) (cid:62) θ t +1 n (cid:107) θ t +1 n (cid:107)(cid:107) θ ∗ (cid:107) = u (cid:62) ( M ind ( θ tn ) − θ t +1 n ) (cid:107) θ t +1 n (cid:107) + u (cid:62) M ind ( θ tn ) (cid:107) M ind ( θ tn ) (cid:107) (cid:107) M ind ( θ tn ) (cid:107)(cid:107) θ t +1 n (cid:107) , ≥ − (cid:15) f √ d r (cid:107) θ t +1 n (cid:107) + u (cid:62) M ind ( θ tn ) (cid:107) M ind ( θ tn ) (cid:107) (cid:107) M ind ( θ tn ) (cid:107)(cid:107) M ind ( θ tn ) (cid:107) + r(cid:15) f ≥ κ t cos α t (cid:18) − r(cid:15) f (cid:107) M ind ( θ tn ) (cid:107) (cid:19) − (cid:15) f √ d r (cid:107) M mlr ( θ tn ) (cid:107) − r(cid:15) f , where the last inequality comes from Theorem 2 in [22].Finally, we need to show that r/ (cid:107) M mlr ( θ tn ) (cid:107) = O (1) such that we can set (cid:15) f as somesufficiently small absolute constant (that does not depend on η ). We first need the followinglemma on the norm of the next estimator: Lemma 8. If (cid:107) θ (cid:107) ≤ (cid:107) θ ∗ (cid:107) / , then (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ (cid:107) (1 + d · min { , (cid:107) θ (cid:107) } ) . Otherwise, if (cid:107) θ (cid:107) ≥ (cid:107) θ ∗ (cid:107) / , we have (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ ∗ (cid:107)
10 (1 + d · min { , (cid:107) θ ∗ (cid:107) } ) . for some universal constants d , d > .
21e defer the proof of this lemma to Appendix D.4.We need the uniform concentration (26) for several values of r = C , C − , ..., C − l +1 , C − l where C = 3 C and l = O (log( n/d )). We can replace δ by δ/ log( n/d ) for union bound, whichdoes not change the order of statistical error. Pick k such that C − k ≤ (cid:107) θ tn (cid:107) ≤ C − k +1 = r .When (cid:107) θ tn (cid:107) ≤ (cid:107) θ ∗ (cid:107) /
10, we can apply the Lemma 8 to see r/ (cid:107) M mlr ( θ tn ) (cid:107) ≤ C − k +1 / ( C − k ) = 2 , where we used r = 2 − k +1 . Therefore, r/M mlr ( θ tn ) = O (1). On the other hand, if (cid:107) θ tn (cid:107) ≥(cid:107) θ ∗ (cid:107) /
10, then we divide the cases when (cid:107) θ ∗ (cid:107) ≥ / max(3 , c ) where c > (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) ) − c (cid:107) θ (cid:107)(cid:107) θ ∗ (cid:107) . When (cid:107) θ ∗ (cid:107) ≥ / max(3 , c ) and (cid:107) θ tn (cid:107) ≥ (cid:107) θ ∗ (cid:107) /
10, by Lemma 8 we have r/M mlr ( θ ) ≤ C max(3 , c ) = O (1) since all parameters here are universal constants. On the other hand, if (cid:107) θ ∗ (cid:107) ≤ / max(3 , c ) and (cid:107) θ tn (cid:107) ≥ (cid:107) θ ∗ (cid:107) /
10, then from equation (31) we have (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) ) − c (cid:107) θ (cid:107)(cid:107) θ ∗ (cid:107) ≥ (cid:107) θ (cid:107) / . Therefore, r/ (cid:107) M mlr ( θ tn ) (cid:107) ≤ C − k +1 / ( C − k − ) = 4 = O (1).From the above case study, we have thatcos α t +1 ≥ κ t cos α t (1 − c (cid:15) f ) − c (cid:15) f √ d , for some absolute constants c , c >
0. Now observe that as long as sin α t > c α , κ t =1 + c min { , η } for some sufficiently small constant c α , c >
0. Also, recall that we areconsidering the middle-to-high SNR regime when η ≥ c η (cid:113) d log ( n/δ ) /n for some sufficientlylarge constant c η >
0, whereas (cid:15) f ≤ c (cid:113) d log ( n/δ ) /n for another fixed constant c > c > α t ≥ / √ d , we havecos α t +1 ≥ (1 + c min(1 , η )) cos α t . After t = O ( η − log( d )) iterations starting from cos α = 1 / √ d , we have cos α t ≥ .
95 orsin α t ≤ . C.3 Stability and Convergence in Middle-to-High SNR Regime after Align-ment
In this subsection, we see how the alignment is stabilized and the norm increases in case westart from small initialization.
Sine stays below some threshold.
Once θ tn and θ ∗ are well-aligned, using sin α t =1 − cos α t , similar arguments can be applied for sin values:sin α t +1 ≤ (1 − c min(1 , η )) sin α t , if sin α t ≥ c sin α t +1 ≤ c , else sin α t ≤ c , for some absolute constants c > < c < .
01 given that cos α t > .
95. 22 nitialization from small estimators after alignment.
After the angle is aligned suchthat sin α t ≤ c . We see how fast (cid:107) θ tn (cid:107) enters the desired initialization region that Theorem1 requires, when (cid:107) θ tn (cid:107) ≤ . (cid:107) θ ∗ (cid:107) .Let us first consider the case 0 . (cid:107) θ ∗ (cid:107) ≤ (cid:107) θ tn (cid:107) ≤ . (cid:107) θ ∗ (cid:107) . We recall Lemma 3 such that (cid:107) θ ∗ − M mlr ( θ tn ) (cid:107) ≤ κ (cid:107) θ tn − θ ∗ (cid:107) + κ
16 sin α (cid:107) θ tn − θ ∗ (cid:107) η η ≤ κ (1 + (16 sin α ) η ) (cid:107) θ tn − θ ∗ (cid:107) , where κ < − c η for some absolute constant c . By appropriately setting c and c , wehave (cid:107) θ ∗ − M mlr ( θ tn ) (cid:107) ≤ (1 − c min(1 , η )) (cid:107) θ − θ ∗ (cid:107) , for some constant c >
0. Since we are in the regime η ≥ c η (cid:113) d log ( n/δ ) /n for suffi-ciently large c η , by appropriately setting the constants we have (cid:107) M n, mlr ( θ tn ) − θ ∗ (cid:107) ≤ (1 − c min(1 , η )) (cid:107) θ − θ ∗ (cid:107) for some absolute constant c >
0, as long as we are in the region0 . (cid:107) θ ∗ (cid:107) ≤ (cid:107) θ tn (cid:107) ≤ . (cid:107) θ ∗ (cid:107) . Hence after O (max(1 , η − )) iterations, we reach to the desiredinitialization region.Now we consider the case (cid:107) θ (cid:107) ≤ . (cid:107) θ ∗ (cid:107) . In this case, by Lemma 8, we can show that (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ (cid:107) (1 + c min { , (cid:107) θ (cid:107) , (cid:107) θ ∗ (cid:107) } ) , for some universal constant c >
0. After O (max {(cid:107) θ (cid:107) − , (cid:107) θ ∗ (cid:107) − } ) iterations, we enter (cid:107) θ (cid:107) ≥(cid:107) θ ∗ (cid:107) /
10. Note that when we start with (cid:107) θ n (cid:107) = Ω(1), (cid:107) θ tn (cid:107) will stay above min { Ω(1) , (cid:107) θ ∗ (cid:107) / } throughout all iterations due to Lemma 8 and Lemma 7. D Deferred Lemmas
In this appendix, we collect proofs for auxiliary lemmas which were postponed in the proofof main theorems: the contraction of population EM operators under both middle and lowSNR regimes, uniform deviation of finite-sample EM operators, and the lower bounds on thenorms of population EM operators.
D.1 Contraction of the Population EM Operator under Low SNR Regime
D.1.1 Proof of Lemma 6
We use notations and definitions stated in A.
Upper Bound:
We first bound the first coordinate of the population operator from equa-tion (9): M mlr ( θ ) (cid:62) v = E x ,x ,y [tanh( yx (cid:107) θ (cid:107) ) x y ] , We will expand the above equation using Taylor series bound of x tanh( x ): x − x ≤ x tanh( x ) ≤ x − x x . (27)23ow we unfold the equation above, we have M mlr ( θ ) (cid:62) v = 1 (cid:107) θ (cid:107) E x ,x ,y [tanh( yx (cid:107) θ (cid:107) ) yx (cid:107) θ (cid:107) ] ≤ (cid:107) θ (cid:107) E x ,x ,y (cid:20) ( yx (cid:107) θ (cid:107) ) − ( yx (cid:107) θ (cid:107) ) yx (cid:107) θ (cid:107) ) (cid:21) ≤ (cid:107) θ (cid:107) E x ,z (cid:20) ( x (cid:107) θ (cid:107) ( z + x b ∗ + x b ∗ )) − ( x (cid:107) θ (cid:107) ( z + x b ∗ + x b ∗ ))
3+ 2( x (cid:107) θ (cid:107) ( z + x b ∗ + x b ∗ )) (cid:21) , where z ∼ N (0 , M mlr ( θ ) (cid:62) v ≤ (cid:107) θ (cid:107) E x ,z (cid:20) ( x (cid:107) θ (cid:107) z ) − ( x (cid:107) θ (cid:107) z ) x (cid:107) θ (cid:107) z ) (cid:21) + c (cid:107) θ (cid:107)(cid:107) θ ∗ (cid:107) , = (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) + 30 (cid:107) θ (cid:107) ) + c (cid:107) θ (cid:107)(cid:107) θ ∗ (cid:107) , (28)for some universal constant c >
0. Since we assumed (cid:107) θ (cid:107) < .
2, we have 3 (cid:107) θ (cid:107) − (cid:107) θ (cid:107) ≥(cid:107) θ (cid:107) . We conclude that M mlr ( θ ) (cid:62) v ≤ (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) + c (cid:107) θ ∗ (cid:107) ) . Then we bound the value in the second coordinate of the population operator: M mlr ( θ ) (cid:62) v = E x ,x ,y [tanh( yx (cid:107) θ (cid:107) ) yx ] , where y | ( x , x ) ∼ N ( x b ∗ + x b ∗ , E [tanh( yx (cid:107) θ (cid:107) ) yx ] = b ∗ E (cid:2) x tanh( x (cid:107) θ (cid:107) ( z + x b ∗ )) − (cid:107) θ (cid:107) b ∗ x tanh (cid:48) ( x (cid:107) θ (cid:107) ( z + x b ∗ )) (cid:3) , (29)where z ∼ N (0 , b ∗ ) with subsuming x from the equation. From (29), we can check that E [tanh( yx (cid:107) θ (cid:107) ) yx ] ≤ b ∗ E (cid:2) x tanh( x (cid:107) θ (cid:107) ( z + x b ∗ )) (cid:3) = b ∗ E (cid:2) x tanh( x (cid:107) θ (cid:107) ( z + x b ∗ )) + x tanh( x (cid:107) θ (cid:107) ( − z + x b ∗ )) (cid:3) ≤ b ∗ E (cid:2) x tanh( x (cid:107) θ (cid:107) b ∗ ) (cid:3) , ≤ (cid:107) θ (cid:107) b ∗ b ∗ E (cid:2) x (cid:3) ≤ (cid:107) θ (cid:107)(cid:107) θ ∗ (cid:107) , where we used tanh( a + x ) + tanh( a − x ) ≤ a ) for any a > x ∈ R .From the above results, we have shown that (cid:107) M mlr ( θ ) (cid:107) ≤ | M mlr ( θ ) (cid:62) v | + | M mlr ( θ ) (cid:62) v | ≤ (cid:107) θ (cid:107) (cid:0) − (cid:107) θ (cid:107) + c (cid:107) θ ∗ (cid:107) (cid:1) , (30)for some universal constant c >
0. 24 ower Bound:
To prove the lower bound of the population EM operator, we again expandthe equation using Taylor series (27): (cid:107) M mlr ( θ ) (cid:107) ≥ | M mlr ( θ ) (cid:62) v | ≥ (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) ) − c (cid:107) θ (cid:107)(cid:107) θ ∗ (cid:107) . (31)The result follows immediately with some absolute constant c > Proof of equation (29) : For the left hand side, we apply the Stein’s lemma with respectto x . It gives that E [tanh( (cid:107) θ (cid:107) x y ) yx ] = E (cid:20) ddx tanh( (cid:107) θ (cid:107) x y ) y (cid:21) = E (cid:20) ddx tanh( (cid:107) θ (cid:107) x (¯ z + x b ∗ + x b ∗ ))(¯ z + x b ∗ + x b ∗ ) (cid:21) = E [ b ∗ tanh( (cid:107) θ (cid:107) x (¯ z + x b ∗ + x b ∗ ))+ ( (cid:107) θ (cid:107) x b ∗ )(¯ z + x b ∗ + x b ∗ ) tanh (cid:48) ( (cid:107) θ (cid:107) x (¯ z + x b ∗ + x b ∗ )]= b ∗ E [tanh( (cid:107) θ (cid:107) x ( z + x b ∗ )) + (cid:107) θ (cid:107) x ( z + x b ∗ ) tanh (cid:48) ( (cid:107) θ (cid:107) x ( z + x b ∗ )))]where ¯ z ∼ N (0 ,
1) and z ∼ N (0 , b ∗ ). For the right hand side, we apply the Stein’s lemmawith respect to x . First, we check the first term in the right hand side that E [ x tanh( (cid:107) θ (cid:107) x ( z + x b ∗ ))]= E (cid:20) ddx ( x tanh( (cid:107) θ (cid:107) x ( z + x b ∗ ))) (cid:21) = E (cid:20) tanh( (cid:107) θ (cid:107) x ( z + x b ∗ )) + x ddx tanh( (cid:107) θ (cid:107) x ( z + x b ∗ ) (cid:21) = E (cid:2) tanh( (cid:107) θ (cid:107) x ( z + x b ∗ )) + (cid:107) θ (cid:107) x ( z + 2 x b ∗ ) tanh (cid:48) ( (cid:107) θ (cid:107) x ( z + x b ∗ ) (cid:3) . Plugging this into (29) and subtracting the remaining term gives the result that matches tothe left hand side.
D.2 Contraction of the Population EM Operator under Middle SNR Regime
In this appendix, we provide the proofs for contraction of the population EM operator undermiddle SNR regime.
D.2.1 Proof of Corollary 1
In Lemma 3, note that κ ≤ − min {(cid:107) θ (cid:107) , (cid:107) θ ∗ (cid:107) (cid:107) θ ∗ (cid:107) +1 } and ( (cid:107) θ ∗ (cid:107) sin α ) < (cid:107) θ − θ ∗ (cid:107) wheresin α < /
10. Therefore, whenever (cid:107) θ ∗ (cid:107) ≥
1, with the initialization condition (cid:107) θ (cid:107) ≥ . (cid:107) θ ∗ (cid:107)(cid:107) M mlr ( θ ) − θ ∗ (cid:107) ≤ (1 − / (cid:107) θ − θ ∗ (cid:107) + κ α ) (cid:107) θ − θ ∗ (cid:107) ≤ . (cid:107) θ − θ ∗ (cid:107) , which completes the proof. D.2.2 Proof of Corollary 2
From Lemma 3, note that η η ≤ η = (cid:107) θ ∗ (cid:107) . Using κ ≤ − min {(cid:107) θ (cid:107) , (cid:107) θ ∗ (cid:107) (cid:107) θ ∗ (cid:107) +1 } , ( (cid:107) θ ∗ (cid:107) sin α ) < (cid:107) θ − θ ∗ (cid:107) and sin α < /
10. With the initialization condition (cid:107) θ (cid:107) ≥ . (cid:107) θ ∗ (cid:107) , we have (cid:107) M mlr ( θ ) − θ ∗ (cid:107) ≤ (cid:18) − (cid:107) θ ∗ (cid:107) (cid:19) (cid:107) θ − θ ∗ (cid:107) + 18 (cid:107) θ ∗ (cid:107) (cid:107) θ − θ ∗ (cid:107) ≤ (cid:18) − (cid:107) θ ∗ (cid:107) (cid:19) (cid:107) θ − θ ∗ (cid:107) . .3 Uniform deviation of finite-sample EM operator: Proof of Lemma 2 Proof.
Let us assume that n ≥ Cd for sufficiently large constant C >
0. To simplify thenotation, we use ˆΣ n = n (cid:80) i X i X (cid:62) i . Observe that (cid:107) M n, mlr ( θ ) − M mlr ( θ ) (cid:107) ≤ ||| ˆΣ − n ||| op (cid:107) n n (cid:88) i =1 Y i X i tanh( Y i X (cid:62) i θ ) − M mlr ( θ ) (cid:107) + ||| ˆΣ − n − I ||| op (cid:107) M mlr ( θ ) (cid:107) . The first term can be bounded by c r (cid:113) d log ( n/δ ) /n with some absolute constant c > ||| ˆΣ − n − I ||| op = ||| ˆΣ − n ||| op ||| ˆΣ n − I ||| op ≤ c (cid:112) d/n ) for some universal constant c >
0. If we can show that (cid:107) M mlr ( θ ) (cid:107) ≤ O ( r ), then weare done. To see this, first we check that (cid:107) M mlr ( θ ) (cid:107) = (cid:107) E [ Y X tanh(
Y X (cid:62) θ )] (cid:107) ≤ (cid:107) θ (cid:107)||| E [ Y XX (cid:62) ] ||| op . It is easy to check that E [ Y XX (cid:62) ] = I + 2 θ ∗ θ ∗(cid:62) , hence ||| E [ Y XX (cid:62) ] ||| op = 1 + 2 (cid:107) θ ∗ (cid:107) ≤ C = O (1). Therefore, (cid:107) M mlr ( θ ) (cid:107) ≤ c (cid:107) θ (cid:107) ≤ c r with a constant c = (1 + 2 C ). Thiscompletes the proof of Lemma 2. D.4 Lower Bound on the Norm: Proof of Lemma 8
This Lemma is in fact a more refined statement of Lemma 23 in [22] where they give a lowerbound on the norms for the same purpose. We give a more refined result here.Let α = ∠ ( θ, θ ∗ ). We use the notations defined in Appendix A. We recall here that b ∗ = θ ∗ cos α , b ∗ = θ ∗ sin α . We consider three cases as in [22]. Case (i): cos α ≤ .
2. This case we essentially give a norm bound for cos α = 0. Supposethat (cid:107) θ (cid:107) ≤ (cid:107) θ ∗ (cid:107) /
10. We can first check that (cid:107) M mlr ( θ ) (cid:107) ≥ | M mlr ( θ ) (cid:62) v | = E x ,x ,y [tanh( yx (cid:107) θ (cid:107) ) yx ]= E x ,x ,z [tanh(( x b ∗ + x b ∗ + z ) x (cid:107) θ (cid:107) )( x b ∗ + x b ∗ + z ) x ] , where x , x , z ∼ N (0 , b ∗ = 0 case (see Lemma 23 in [22] for details): E x ,x ,z [tanh(( x b ∗ + z ) x (cid:107) θ (cid:107) )( x b ∗ + z ) x ] = E x , ¯ z [tanh(¯ zx (cid:107) θ (cid:107) )¯ zx ] , where ¯ z ∼ N (0 , b ∗ ) ) = N (0 , σ ). We can lower bound the following quantity such that E x , ¯ z [tanh(¯ zx (cid:107) θ (cid:107) )¯ zx ] ≥ σ E x ,z [tanh( σ zx (cid:107) θ (cid:107) ) zx ] ≥ σ E x ,z [tanh( zx (cid:107) θ (cid:107) ) zx ] . If (cid:107) θ (cid:107) > .
5, then through the numerical integration we can check that E x ,z [tanh(0 . zx ) zx ] > /π . Hence, we immediately have that | M mlr ( θ ) (cid:62) v | ≥ π σ ≥ sin απ (cid:107) θ ∗ (cid:107) ≥ (cid:107) θ ∗ (cid:107) , α > . (cid:107) θ (cid:107) ≤ (cid:107) θ ∗ (cid:107) /
10, clearlywe have (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ (cid:107) (1 + 1 · min(1 , (cid:107) θ (cid:107) )) . If (cid:107) θ (cid:107) < .
5, then we get a lower bound using Taylor expansion: E x , ¯ z [tanh(¯ zx (cid:107) θ (cid:107) )¯ zx ] ≥ σ (cid:18) E x ,z [ (cid:107) θ (cid:107) ( zx ) ] − E x ,z [ (cid:107) θ (cid:107) ( zx ) ] (cid:19) = σ (cid:107) θ (cid:107) (1 − (cid:107) θ (cid:107) ) = (cid:107) θ (cid:107) (cid:112) . η (1 − (cid:107) θ (cid:107) ) , where (cid:107) θ ∗ (cid:107) = η . Here, we consider three cases when η ≥
5, 5 ≥ η ≥
1, 1 ≥ η . When η ≥ | M mlr ( θ ) (cid:62) v | ≥ . (cid:107) θ (cid:107) . In case 5 ≥ η ≥
1, we first note that since (cid:107) θ (cid:107) ≤ (cid:107) θ ∗ (cid:107) /
10, we check the value of (cid:107) θ (cid:107) (cid:112) . η (1 − . η ) . We can again, numerically check that (cid:112) . η (1 − . η ) ≤ .
25 for 1 ≤ η ≤
5. Finally,when η ≤
1, then a simple algebra shows that (cid:107) θ (cid:107) (cid:112) . η (1 − . η ) ≥ (cid:107) θ (cid:107) (1 + 0 . η ) . Combining all, we can conclude that when (cid:107) θ (cid:107) ≤ (cid:107) θ ∗ (cid:107) (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ (cid:107) (1 + 0 . · min(1 , (cid:107) θ ∗ (cid:107) )) ≥ (cid:107) θ (cid:107) (1 + 0 . · min(1 , (cid:107) θ (cid:107) )) . Now note that M mlr ( θ ) (cid:62) v increases in (cid:107) θ (cid:107) , hence for all (cid:107) θ (cid:107) ≥ (cid:107) θ ∗ (cid:107) /
10, it holds that (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ ∗ (cid:107)
10 (1 + 0 . · min(1 , (cid:107) θ ∗ (cid:107) )) . Case (ii): cos α ≥ .
2. Again, we can only consider when (cid:107) θ (cid:107) ≤ (cid:107) θ ∗ (cid:107) /
10 since the othercase will immediately follow. Their claim in this case is that | M mlr ( θ ) (cid:62) v | ≥ min (cid:0) σ (cid:107) θ (cid:107) , b ∗ (cid:1) .Hence we consider two cases when σ (cid:107) θ (cid:107) = (1 + η sin α ) (cid:107) θ (cid:107) ≤ b ∗ = (cid:107) θ ∗ (cid:107) cos α and the othercase.In the first case when σ (cid:107) θ (cid:107) ≤ b ∗ , it can be shown that (see equation (50) in [22] fordetails) b ∗ − M mlr ( θ ) (cid:62) v ≤ κ ( b ∗ − σ (cid:107) θ (cid:107) ) , where κ ≤ (cid:112) b − . Rearranging this inequality, we have M mlr ( θ ) (cid:62) v ≥ (cid:107) θ ∗ (cid:107) (1 − κ ) cos α + κ (1 + η sin α ) (cid:107) θ (cid:107)≥ (cid:107) θ (cid:107) − κ ) + κ (1 + η sin α ) (cid:107) θ (cid:107)≥ (cid:107) θ (cid:107) + (1 − κ ) (cid:107) θ (cid:107) . Note that 1 − κ ≥ c min(1 , b ) for some constant c >
0. On the other side, if σ (cid:107) θ (cid:107) ≥ b ∗ ,then we immediately have M mlr ( θ ) (cid:62) v ≥ (cid:107) θ ∗ (cid:107) / ≥ (cid:107) θ ∗ (cid:107)
10 (1 + 1 · min(1 , (cid:107) θ ∗ (cid:107) )) ≥ (cid:107) θ (cid:107) (1 + 1 · min(1 , (cid:107) θ (cid:107) )) . (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ (cid:107) (1 + c · min(1 , (cid:107) θ (cid:107) )) . Now similarly to
Case (i) , since M mlr ( θ ) (cid:62) v is increasing in (cid:107) θ (cid:107) , when (cid:107) θ (cid:107) ≥ (cid:107) θ ∗ (cid:107) /
10, wehave (cid:107) M mlr ( θ ) (cid:107) ≥ (cid:107) θ ∗ (cid:107)
10 (1 + c · min(1 , (cid:107) θ ∗ (cid:107) )) , where c = c / E Concentration of Measures in Finite-Sample EM
In all lemmas that follow, we assume that n ≥ Cd for sufficiently large constant C >
0, suchthat the tail probability of the sum of n independent sub-exponential random variables are insub-Gaussian decaying rate. Lemma 9.
Suppose X ∼ N (0 , I ) and Y | X ∼ N ( X (cid:62) θ ∗ ,
1) + N ( − X (cid:62) θ ∗ , . Then, withprobability at least − δ , n n (cid:88) i =1 Y i − O (cid:32) ( (cid:107) θ ∗ (cid:107) + 1) (cid:114) ln(1 /δ ) n (cid:33) , (32) ||| n n (cid:88) i =1 X i X (cid:62) i − I ||| op = O (cid:32)(cid:114) d ln(1 /δ ) n (cid:33) . (33)The above lemma is standard concentration lemmas for standard Gaussian distributions. Lemma 10.
Let
X, Y be the random variables as in Lemma 9. With probability at least − δ ,we have ||| n n (cid:88) i =1 Y i X i X (cid:62) i − I ||| op = O ( (cid:107) θ ∗ (cid:107) + 1) (cid:115) d ln ( n/δ ) n , (34) Proof.
Let ν i be an independent Rademacher variable and Z i = N (0 , Y i = ν i X (cid:62) i θ ∗ + Z i . We use the truncation argument for the of concentration of higher ordermoments. First define the good event E := {∀ i ∈ [ n ] , | Z i | ≤ τ, | X (cid:62) i θ ∗ | ≤ τ |} . We willdecide the order of τ later such that P ( E ) ≥ − δ . Let (cid:101) Y ∼ Y |E , (cid:101) X ∼ X |E and ( (cid:101) Y i , (cid:101) X i ) beindependent samples of ( (cid:101) Y , (cid:101) X ). It is easy to check that (cid:101) Y (cid:101) X is a sub-Gaussian vector withOrlicz norm O ( τ + τ ) [30]. To see this, (cid:13)(cid:13)(cid:13) (cid:101) Y (cid:101) X (cid:13)(cid:13)(cid:13) ψ = sup u ∈ S d − sup p ≥ p − / E (cid:104) | Y ( X (cid:62) u ) | p |E (cid:105) /p (35) ≤ ( τ + τ ) sup u ∈ S d − sup p ≥ p − / E (cid:104) | X (cid:62) u | p E (cid:105) /p /P ( E ) /p (36) ≤ ( τ + τ ) K, (37)28or some universal constant K > p th moments ofGaussian is O ((2 p ) p/ ) and P ( E ) ≥ − δ .Now we decompose the probability as the following: P (cid:32) ||| n n (cid:88) i =1 Y i X i X (cid:62) i − I ||| op ≥ t (cid:33) ≤ P (cid:32) ||| n n (cid:88) i =1 Y i X i X (cid:62) i − I ||| op ≥ t |E (cid:33) + P ( E c ) ≤ P (cid:32) ||| n n (cid:88) i =1 (cid:101) Y i (cid:101) X i (cid:101) X (cid:62) i − E [ (cid:101) Y (cid:101) X (cid:101) X (cid:62) ] ||| op ≥ t/ (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) ( a ) + P (cid:16) ||| E [ (cid:101) Y (cid:101) X (cid:101) X (cid:62) ] − I ||| op ≥ t/ (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) ( b ) + P ( E c ) (cid:124) (cid:123)(cid:122) (cid:125) ( c ) . We can use a measure of concentration for random matrices for (a) given that n ≥ Cd for sufficiently large C > (cid:16) − nt C ( τ + τ ) + C (cid:48) d (cid:17) for some constants C, C (cid:48) >
0. The bound for (c) is given by n exp( − τ ), hence we set τ = Θ (cid:16)(cid:112) log( n/δ ) (cid:17) , τ = (cid:107) θ ∗ (cid:107) τ. Finally, for (b), we first note that E [ Y XX (cid:62) ] = E [ (cid:101) Y (cid:101) X (cid:101) X (cid:62) ] P ( E ) + E [ Y XX (cid:62) E c ] . Rearranging the terms, ||| E [ (cid:101) Y (cid:101) X (cid:101) X (cid:62) ] − I ||| op ≤ ||| E [ (cid:101) Y (cid:101) X (cid:101) X (cid:62) ] ||| op P ( E c ) + (cid:114) sup u ∈ S d E [ Y ( X (cid:62) u ) ] (cid:112) P ( E c ) ≤ ( τ + τ ) n exp( − τ /
2) + 3( τ + τ ) √ n exp( − τ / ≤ (cid:112) /n. We can set t = O (cid:18) ( (cid:107) θ ∗ (cid:107) + 1) (cid:113) d log ( n/δ ) /n (cid:19) and get the desired result. Lemma 11.
Let
X, Y be the random variables as in Lemma 9. Suppose (cid:107) θ ∗ (cid:107) ≤ C for someuniversal constant C > . Then for any given r > , with probability at least − δ , we have sup θ : (cid:107) θ (cid:107)≤ r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 Y i X i tanh (cid:16) Y i X (cid:62) i θ (cid:17) − M mlr ( θ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ cr (cid:115) d ln ( n/δ ) n , (38) for some universal constant c > .Proof. We start with the standard discretization argument for bounding the concentrationof measures in l norm. Let Z ( θ ) := n (cid:80) ni =1 Y i X i tanh (cid:0) Y i X (cid:62) i θ (cid:1) − M mlr ( θ ). The standardsymmetrization argument gives that [29, 31]. P (cid:18) sup (cid:107) θ (cid:107)≤ r (cid:107) Z ( θ ) (cid:107) ≥ t (cid:19) ≤ P (cid:32) sup (cid:107) θ (cid:107)≤ r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ε i Y i X i tanh (cid:16) Y i X (cid:62) i θ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ t/ (cid:33) , (39)29here ε i are independent Rademacher random variables. We define a good event E := {∀ i ∈ [ n ] , | Y i | ≤ τ, | X (cid:62) i θ ∗ | ≤ Cτ } as before, where τ = Θ (cid:16)(cid:112) log( n/δ ) (cid:17) . Then the probabilitydefined in (39) can be decomposed as P (cid:32) sup (cid:107) θ (cid:107)≤ r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ε i Y i X i tanh (cid:16) Y i X (cid:62) i θ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ t/ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:33) + P ( E c ) . We are interested in bounding the following quantity for Chernoff bound: E (cid:34) exp (cid:32) sup (cid:107) θ (cid:107)≤ r λn (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ε i Y i X i tanh (cid:16) Y i X (cid:62) i θ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) , where we used Chernoff-Bound with some λ > f i ( θ ) :=tanh (cid:0) | Y i | X (cid:62) i θ (cid:1) . First, we use discretization argument for removing l norm inside the expec-tation. E (cid:34) exp (cid:32) sup (cid:107) θ (cid:107)≤ r λn (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ε i Y i X i tanh (cid:16) Y i X (cid:62) i θ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) ≤ E (cid:34) exp (cid:32) sup u ∈ S d sup (cid:107) θ (cid:107)≤ r λn n (cid:88) i =1 ε i Y i ( X (cid:62) i u ) tanh (cid:16) Y i X (cid:62) i θ (cid:17)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) ≤ E (cid:34) exp (cid:32) sup j ∈ [ M ] sup (cid:107) θ (cid:107)≤ r λn n (cid:88) i =1 ε i Y i ( X (cid:62) i u j ) tanh (cid:16) Y i X (cid:62) i θ (cid:17)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) ≤ M (cid:88) j =1 E (cid:34) exp (cid:32) sup (cid:107) θ (cid:107)≤ r λn n (cid:88) i =1 ε i Y i ( X (cid:62) i u j ) tanh (cid:16) Y i X (cid:62) i θ (cid:17)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) , where M is 1/2-covering number of the unit sphere and { u , ..., u M } is the correspondingcovering set. Now for each u j , we can apply the Ledoux-Talagrand contraction lemma since | f i ( θ ) − f i ( θ ) | ≤ | Y i || X (cid:62) i θ − X (cid:62) i θ | for θ ∈ B (0 , r ): E (cid:34) exp (cid:32) sup (cid:107) θ (cid:107)≤ r λn n (cid:88) i =1 ε i Y i X (cid:62) i u j tanh (cid:16) Y i X (cid:62) i θ (cid:17)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) = E (cid:34) exp (cid:32) sup (cid:107) θ (cid:107)≤ r λn n (cid:88) i =1 ε i | Y i | X (cid:62) i u j tanh (cid:16) | Y i | X (cid:62) i θ (cid:17)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) ≤ E (cid:34) exp (cid:32) sup (cid:107) θ (cid:107)≤ r λn n (cid:88) i =1 ε i Y i ( X (cid:62) i θ )( X (cid:62) i u j ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) ≤ E (cid:34) exp (cid:32) sup (cid:107) θ (cid:107)≤ r λn n (cid:88) i =1 ε i Y i ( X (cid:62) i v )( X (cid:62) i u j ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) , (40)where we define v := θ/ (cid:107) θ (cid:107) .We have already seen in (35) that Y i ( X (cid:62) i u j ) |E is sub-Gaussian with Orcliz norm O ( τ (1 + (cid:107) θ ∗ (cid:107) )) = O ( τ ). Since the multiplication of two sub-Gaussian variables is sub-exponential, itimplies that Y i ( X (cid:62) i v )( X (cid:62) i u ) |E is sub-exponential with Orcliz norm O ( τ ) [30]. Now we needthe lemma for the exponential moment of sub-exponential random variables from [30].30 emma 12 (Lemma 5.15 in [30]) . Let X be a centered sub-exponential random variable.Then, for t such that t ≤ c/ (cid:107) X (cid:107) ψ , one has E [exp( tX )] ≤ exp( Ct (cid:107) X (cid:107) ψ ) , for some universal constant c, C > . Finally, note that ε i Y i ( X (cid:62) i v )( X (cid:62) i u ) is a centered sub-exponential random variable withthe same Orcliz norm. Equipped with the lemma, we can obtain that E (cid:34) exp (cid:32) λr n n (cid:88) i =1 ε i Y i ( X (cid:62) i v )( X (cid:62) i u ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) ≤ exp( Cλ r τ /n ) , ∀| λr/n | ≤ c/τ , which yields E (cid:34) exp (cid:32) sup (cid:107) θ (cid:107)≤ r λn (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ε i Y i X i tanh (cid:16) Y i X (cid:62) i θ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:35) ≤ exp (cid:0) Cλ r τ /n + C (cid:48) d (cid:1) , ∀| λ | ≤ n/cτ r, where we used log M = O ( d ) with some C, C (cid:48) , c >
0. Combining all the above, we have that P (cid:18) sup θ ∈ B ( θ ∗ ,r ) (cid:107) Z ( θ ) (cid:107) ≥ t (cid:19) ≤ exp (cid:0) C λ r τ /n + C d − λt/ (cid:1) + P ( E c ) . From here, we can optimize for λ = O ( t/r τ ) with setting t = O (cid:16) r (cid:112) dτ /n (cid:17) . Since t = O (cid:18) r (cid:113) d log ( n/δ ) /n (cid:19) , this concludes the proof. F Supplementary Results
In this appendix, we collect an additional result clarifying the initialization in Theorem 1 andthe proof for super-linear convergence of population EM operator in very high SNR regime.
F.1 Initialization with Spectral Methods
Lemma 13.
Let M = n (cid:80) ni =1 Y i X i X (cid:62) i − I where X, Y are as given in Lemma 9. Let thelargest eigenvalue and corresponding eigenvector of M be ( λ , v ) . Then, there exists universalconstants c , c > such that | λ − (cid:107) θ ∗ (cid:107) | ≤ c ( (cid:107) θ ∗ (cid:107) + 1) (cid:115) d log ( n/δ ) n . Furthermore, if (cid:107) θ ∗ (cid:107) ≥ c ( d log ( n/δ ) /n ) / , then sin ∠ ( v , θ ∗ ) ≤ c (cid:18) (cid:107) θ ∗ (cid:107) (cid:19) (cid:115) d log ( n/δ ) n ≤ . Proof.
The lemma is a direct consequence of Lemma 10 and matrix perturbation theory [31].Note that E [ Y i X i X (cid:62) i ] = I + 2 θ ∗ θ ∗(cid:62) (e.g., see Lemma 1 in [37]).31he above lemma states that when (cid:107) θ ∗ (cid:107) is not too small, we can always start from thewell-initialized point where it is well aligned with ground truth θ ∗ . In low SNR regime where (cid:107) θ ∗ (cid:107) (cid:46) ( d/n ) / , we cannot guarantee such a well-alignment with θ ∗ since the eigenvectoris perturbed too much. However, the largest eigenvalue can still serve as an indicator that (cid:107) θ ∗ (cid:107) is small. Hence in all cases, we can initialize the estimator with θ n = max { . , √ λ } v to satisfy the initialization condition that we required in Theorem 1. F.2 Super-Linear Convergence of Population EM Operator in Very HighSNR Regime
In this appendix, we prove Lemma 1 on the super-linear convergence behavior of populationEM operator in very high SNR regime.
Proof.
We start from the following equation: (cid:107) M mlr ( θ ) − θ ∗ (cid:107) = E [ XY (tanh( Y X (cid:62) θ ) − tanh( Y X (cid:62) θ ∗ ))]= E [ XY ∆ ( X,Y ) ( θ )] , where ∆ ( X,Y ) ( θ ) := tanh( Y X (cid:62) θ ) − tanh( Y X (cid:62) θ ∗ ). We define good events as follows: E = { | X (cid:62) ( θ ∗ − θ ) | ≤ | X (cid:62) θ ∗ |} , E = {| X (cid:62) θ ∗ | ≥ τ } , E = {| Z | ≤ τ } , (41)where we set τ = Θ (cid:16)(cid:112) log (cid:107) θ ∗ (cid:107) (cid:17) .Let the good event E good = E ∩ E ∩ E . From Lemma 4, under the good event, wehave ∆ ( X,Y ) ( θ ) ≤ exp( − τ ). To simplify the notation, let ∆( θ ) = ∆ ( X,Y ) ( θ ) and W = νXX (cid:62) θ ∗ ∆( θ ). Then we can decompose the estimation error as the following: (cid:107) M mlr ( θ ) − θ ∗ (cid:107) = (cid:107) E [ XZ ∆( θ )] + E [ W ∆( θ )] (cid:107)≤ sup u ∈ S d − | E [( X (cid:62) u ) Z ∆( θ )] | + | E [( W (cid:62) u )∆( θ )] |≤ sup u ∈ S d − (cid:113) E [( X (cid:62) u ) | ∆( θ ) | ] (cid:112) E [ Z | ∆( θ ) | ]+ (cid:113) E [( X (cid:62) u ) | ∆( θ ) | ] (cid:113) E [( X (cid:62) θ ∗ ) | ∆( θ ) | ] . We use again the event-wise decomposition strategy. For population EM, note that we set τ = Θ( (cid:112) log (cid:107) θ ∗ (cid:107) ) unlike in finite-sample EM case in Appendix B.1. We need to prove thefollowing lemma: Lemma 14.
For any u ∈ S d − , we have E (cid:104) ( X (cid:62) u ) | ∆( θ ) | (cid:105) ≤ − τ /
2) + 2( τ + 2 (cid:107) θ − θ ∗ (cid:107) ) / (cid:107) θ ∗ (cid:107) . (42) Furthermore, we have E (cid:104) ( X (cid:62) θ ∗ ) | ∆( θ ) | (cid:105) ≤ (cid:107) θ ∗ (cid:107) exp( − τ /
2) + 8 τ / (cid:107) θ ∗ (cid:107) + 4 (cid:107) θ − θ ∗ (cid:107) / (cid:107) θ ∗ (cid:107) . (43) On the other hand, we have E (cid:2) Z | ∆( θ ) | (cid:3) ≤ − τ /
4) + 2( τ + (cid:107) θ − θ ∗ (cid:107) ) / (cid:107) θ ∗ (cid:107) . (44)32quipped with the above lemma, whenever (cid:107) θ − θ ∗ (cid:107) ≥ Cτ with τ = c (cid:112) log (cid:107) θ ∗ (cid:107) forsufficiently large constants C, c >
0, we have E [( X (cid:62) u ) | ∆( θ ) | ] ≤ (cid:107) θ − θ ∗ (cid:107) / (cid:107) θ ∗ (cid:107) , E [( X (cid:62) θ ∗ ) | ∆( θ ) | ] ≤ (cid:107) θ − θ ∗ (cid:107) / (cid:107) θ ∗ (cid:107) , E [ Z | ∆( θ ) | ] ≤ (cid:107) θ − θ ∗ (cid:107) / (cid:107) θ ∗ (cid:107) , which yields (cid:107) M mlr ( θ ) − θ ∗ (cid:107) ≤ (cid:107) θ − θ ∗ (cid:107) / (cid:107) θ ∗ (cid:107) , given that (cid:107) θ ∗ (cid:107) is sufficiently large and (cid:107) θ − θ ∗ (cid:107) ≤ (cid:107) θ ∗ (cid:107) / Proof of Lemma 14:
For equation (42), we can check that E [( X (cid:62) u ) | ∆( θ ) | ] ≤ E [( X (cid:62) u ) | ∆( θ ) ||E good ] P ( E good ) + E [( X (cid:62) u ) | ∆( θ ) ||E c ] P ( E c )+ E [( X (cid:62) u ) | ∆( θ ) ||E c ] P ( E c ) + E [( X (cid:62) u ) | ∆( θ ) ||E c ] P ( E c ) ≤ exp( − τ ) E [( X (cid:62) u ) E good ] + E [( X (cid:62) u ) |E c ] P ( E c )++ E [( X (cid:62) u ) |E c ] P ( E c ) + E [( X (cid:62) u ) |E c ] P ( E c ) . We now recall Lemma 1 in [36], which is given by:
Lemma 15 (Lemma 1 in [36]) . Given vectors u, v ∈ R d and a Gaussian random vector X ∼ N (0 , I ) , the matrix Σ = E [ XX (cid:62) | ( X (cid:62) u ) > ( X (cid:62) v ) ] has singular values (cid:18) αα , − sin αα , , , ..., (cid:19) , where α = cos − (cid:18) ( u − v ) (cid:62) ( u + v ) (cid:107) u − v (cid:107)(cid:107) u + v (cid:107) (cid:19) . Furthermore, if (cid:107) v (cid:107) ≤ (cid:107) u (cid:107) , then we have P (( X (cid:62) u ) > ( X (cid:62) v ) ) ≤ (cid:107) v (cid:107)(cid:107) u (cid:107) . Based on the results of Lemma 15, we obtain ||| E [ XX (cid:62) |E c ] ||| op ≤ , P ( E c ) ≤ (cid:107) θ − θ ∗ (cid:107) / (cid:107) θ ∗ (cid:107) . From standard property of Gaussian distribution, (see also Lemma 9 in [1]), we also have ||| E [ XX (cid:62) |E c ] ||| op ≤ , P ( E c ) ≤ τ / (cid:107) θ ∗ (cid:107) . Finally, from standard Gaussian tail bound, P ( E c ) ≤ − τ / E [( X (cid:62) u ) | ∆( θ ) | ] ≤ exp( − τ ) E [( X (cid:62) θ ∗ ) E good ] + E [( X (cid:62) θ ∗ ) |E c ] P ( E c )++ E [( X (cid:62) θ ∗ ) |E c ] P ( E c ) + E [( X (cid:62) θ ∗ ) |E c ] P ( E c ) ≤ exp( − τ ) E [( X (cid:62) θ ∗ ) ] + E [( X (cid:62) ( θ ∗ − θ )) |E c ] P ( E c )++ 4 E [ τ |E c ] P ( E c ) + E [( X (cid:62) θ ∗ ) |E c ] P ( E c ) ≤ exp( − τ ) (cid:107) θ ∗ (cid:107) + 4 (cid:107) θ ∗ − θ (cid:107) / (cid:107) θ ∗ (cid:107) + 8 τ / (cid:107) θ ∗ (cid:107) + 2 (cid:107) θ ∗ (cid:107) exp( − τ / , which gives equation (43). 33inally, for equation (44), E [ Z | ∆( θ ) | ] ≤ exp( − τ ) E [ Z E good ] + E [ Z |E c ] P ( E c ) + E [ Z |E c ] P ( E c ) + E [ Z E c ] ≤ exp( − τ ) + E [ Z ] P ( E c ) + E [ Z ] P ( E c ) + (cid:112) E [ Z ] (cid:113) P ( E c ) ≤ − τ /
4) + 2 τ / (cid:107) θ ∗ (cid:107) + 2 (cid:107) θ − θ ∗ (cid:107) / (cid:107) θ ∗ (cid:107) , where we used the independence between Z and E , E . This concludes the proof of Lemma14. (cid:3) References [1] S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algo-rithm: From population to sample-based analysis.
Annals of Statistics , 45:77–120, 2017. (Cited on pages 2, 3, 9, and 33.) [2] A. T. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regres-sions. In
International Conference on Machine Learning , pages 1040–1048, 2013. (Citedon page 3.) [3] J. Chen. Optimal rate of convergence for finite mixture models.
Annals of Statistics ,23(1):221–233, 1995. (Cited on page 3.) [4] J. Chen and P. Li. Hypothesis test for normal mixture models: The EM approach.
Annalsof Statistics , 37:2523–2542, 2009. (Cited on page 1.) [5] S. Chen, J. Li, and Z. Song. Learning mixtures of linear regressions in subexponentialtime via fourier moments. arXiv preprint arXiv:1912.07629 , 2019. (Cited on page 3.) [6] Y. Chen, X. Yi, and C. Caramanis. A convex formulation for mixed regression with twocomponents: Minimax optimal rates. In
Conference on Learning Theory , pages 560–604,2014. (Cited on pages 2, 3, and 6.) [7] C. Daskalakis, C. Tzamos, and M. Zampetakis. Ten steps of EM suffice for mixtures oftwo Gaussians. In
Proceedings of the 2017 Conference on Learning Theory , 2017. (Citedon page 2.) [8] R. D. De Veaux. Mixtures of linear regressions.
Computational Statistics & Data Analysis ,8(3):227–245, 1989. (Cited on page 3.) [9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incompletedata via the EM algorithm.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 39:1–38, 1997. (Cited on page 1.) [10] R. Dwivedi, N. Ho, K. Khamaru, M. J. Wainwright, and M. I. Jordan. Theoreticalguarantees for EM under misspecified Gaussian mixture models. In
NeurIPS 31 , 2018. (Cited on pages 3 and 10.) [11] R. Dwivedi, N. Ho, K. Khamaru, M. J. Wainwright, M. I. Jordan, and B. Yu. Singularity,misspecification, and the convergence rate of EM. arXiv preprint arXiv:1810.00828 , 2018. (Cited on pages 3 and 7.) arXiv preprintarXiv:1902.00194 , 2019. (Cited on pages 2, 3, 7, and 10.) [13] A. Ghosh and K. Ramchandran. Alternating minimization converges super-linearly formixed linear regression. arXiv preprint arXiv:2004.10914 , 2020. (Cited on pages 3 and 7.) [14] B. Gr¨un, F. Leisch, et al. Applications of finite mixtures of regression models. 2007. (Cited on page 3.) [15] P. Heinrich and J. Kahn. Strong identifiability and optimal minimax rates for finitemixture estimation.
Annals of Statistics , 46:2844–2870, 2018. (Cited on page 10.) [16] N. Ho and X. Nguyen. Convergence rates of parameter estimation for some weaklyidentifiable finite mixtures.
Annals of Statistics , 44:2726–2755, 2016. (Cited on page 3.) [17] N. Ho, C.-Y. Yang, and M. I. Jordan. Convergence rates for Gaussian mixtures of experts. arXiv preprint arXiv:1907.04377 , 2019. (Cited on page 3.) [18] M. I. Jordan and L. Xu. Convergence results for the EM approach to mixtures of expertsarchitectures.
Neural Networks , 8, 1995. (Cited on page 1.) [19] S. Karmalkar, A. Klivans, and P. Kothari. List-decodable linear regression. In
Advancesin Neural Information Processing Systems , pages 7423–7432, 2019. (Cited on page 3.) [20] J. Kwon and C. Caramanis. EM converges for a mixture of many linear regressions. arXiv preprint arXiv:1905.12106 , 2019. (Cited on pages 2, 3, 7, 8, 9, and 13.) [21] J. Kwon and C. Caramanis. EM algorithm is sample-optimal for learning mixtures ofwell-separated gaussians. arXiv preprint arXiv:2002.00329 , 2020. (Cited on pages 2, 9,and 13.) [22] J. Kwon, W. Qian, C. Caramanis, Y. Chen, and D. Davis. Global convergence of the EMalgorithm for mixtures of two component linear regression. In
Conference on LearningTheory , pages 2055–2110, 2019. (Cited on pages 3, 6, 7, 8, 10, 11, 15, 19, 20, 21, 26, and 27.) [23] M. Ledoux and M. Talagrand.
Probability in Banach Spaces: Isoperimetry and Processes .Springer-Verlag, New York, NY, 1991. (Cited on page 30.) [24] P. Li, J. Chen, and P. Marriott. Non-finite Fisher information and homogeneity: an EMapproach.
Biometrika , 96:411–426, 2009. (Cited on page 1.) [25] Y. Li and Y. Liang. Learning mixtures of linear regressions with nearly optimal com-plexity. In
Conference On Learning Theory , pages 1125–1144, 2018. (Cited on page 3.) [26] J. Ma, L. Xu, and M. I. Jordan. Asymptotic convergence rate of the EM algorithm forGaussian mixtures.
Neural Computation , 12:2881–2907, 2000. (Cited on page 1.) [27] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EMalgorithm.
SIAM review , 26(2):195–239, 1984. (Cited on page 1.) [28] H. Sedghi, M. Janzamin, and A. Anandkumar. Provable tensor methods for learningmixtures of generalized linear models. In
Artificial Intelligence and Statistics , pages1223–1231, 2016. (Cited on page 3.)
Weak Convergence and Empirical Processes .Springer-Verlag, New York, NY, 1996. (Cited on page 29.) [30] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027v7 . (Cited on pages 15, 28, 29, 30, and 31.) [31] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint , volume 48.Cambridge University Press, 2019. (Cited on pages 29 and 31.) [32] C. F. J. Wu. On the convergence properties of the EM algorithm.
Annals of Statistics ,11:95–103, 1983. (Cited on pages 1 and 2.) [33] Y. Wu and H. H. Zhou. Randomly initialized EM algorithm for two-component Gaussianmixture achieves near optimality in O ( √ n ) iterations. arXiv preprint arXiv:1908.10935 ,2019. (Cited on pages 3, 6, 8, and 18.) [34] J. Xu, D. Hsu, and A. Maleki. Global analysis of expectation maximization for mixturesof two Gaussians. In Advances in Neural Information Processing Systems 29 , 2016. (Citedon page 2.) [35] B. Yan, M. Yin, and P. Sarkar. Convergence of gradient EM on multi-component mixtureof Gaussians. In
Advances in Neural Information Processing Systems 30 , 2017. (Cited onpage 2.) [36] X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regres-sion. In
International Conference on Machine Learning , pages 613–621, 2014. (Cited onpages 2, 3, 7, and 33.) [37] X. Yi, C. Caramanis, and S. Sanghavi. Solving a mixture of many random linear equationsby tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749 ,2016. (Cited on pages 2, 3, 7, and 31.)(Cited on pages 2, 3, 7, and 31.)