[PDF] Training Linear Neural Networks: Non-Local Convergence and Complexity Results

Abstract

Linear networks provide valuable insights into the workings of neural networks in general. This paper identifies conditions under which the gradient flow provably trains a linear network, in spite of the non-strict saddle points present in the optimization landscape. This paper also provides the computational complexity of training linear networks with gradient flow. To achieve these results, this work develops a machinery to provably identify the stable set of gradient flow, which then enables us to improve over the state of the art in the literature of linear networks (Bah et al., 2019;Arora et al., 2018a). Crucially, our results appear to be the first to break away from the lazy training regime which has dominated the literature of neural networks. This work requires the network to have a layer with one neuron, which subsumes the networks with a scalar output, but extending the results of this theoretical work to all linear networks remains a challenging open problem.

Full PDF

TTraining Linear Neural Networks:Non-Local Convergence and Complexity Results

Armin Eftekhari Abstract

Linear networks provide valuable insights intothe workings of neural networks in general. Thispaper identiﬁes conditions under which the gradi-ent ﬂow provably trains a linear network, in spiteof the non-strict saddle points present in the op-timization landscape. This paper also providesthe computational complexity of training linearnetworks with gradient ﬂow. To achieve these re-sults, this work develops a machinery to provablyidentify the stable set of gradient ﬂow, which thenenables us to improve over the state of the art inthe literature of linear networks (Bah et al., 2019;Arora et al., 2018a). Crucially, our results appearto be the ﬁrst to break away from the lazy train-ing regime which has dominated the literature ofneural networks. This work requires the networkto have a layer with one neuron, which subsumesthe networks with a scalar output, but extendingthe results of this theoretical work to all linearnetworks remains a challenging open problem.

1. Introduction and Overview

Consider the training samples and their labels { x i , y i } mi =1 ⊂ R d x × R d y , respectively. By concatenating { x i } i and { y i } i ,we form the data matrices X ∈ R d x × m , Y ∈ R d y × m . (1)Consider also a linear network, i.e., a neural network wherethe nonlinear activation functions are replaced with the iden-tity map. To be speciﬁc, with N layers and the correspond-ing weight matrices { W i } Ni =1 , this network is characterized Department of Mathematics and Mathematical Statistics,Umea University, Sweden. AE is indebted to Holger Rauhut,Ulrich Terstiege and Gongguo Tang for insightful discussions.Correspondence to: Armin Eftekhari < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s). by the linear map R d x → R d y x → W x, (2)and the matrix W ∈ R d y × d x in (2) is itself speciﬁed withthe nonlinear (and often over-parametrized) map R d × d × · · · × R d N × d N − −→ R d y × d x ( W , · · · , W N ) −→ W := W N · · · W , (3)where we set d = d x and d N = d y for consistency.In foregoing the full generality of nonlinear neural networks,linear networks afford us a level of insight and technicalrigor that is out of the reach for nonlinear networks, at leastwith our current technical tools (Arora et al., 2018b; Yanet al., 1994; Kawaguchi, 2016; Chitour et al., 2018; Trageret al., 2019; Saxe et al., 2013; Lu & Kawaguchi, 2017; Yunet al., 2017).Indeed, despite the absence of activation functions, matrix W in the linear network (2,3) is a nonlinear function of { W i } i , and training this linear network thus involves solvinga nonconvex optimization problem in { W i } i , which sharesmany interesting features of the nonlinear neural networks.Simply put, we cannot claim to understand neural networksin general without understanding linear networks.Training the linear network (2,3) with the data ( X, Y ) in (1)can be cast as the optimization problem (cid:40) min W , ··· ,W N (cid:107) Y − W N W N − · · · W X (cid:107) F subject to W j ∈ R d j × d j − ∀ j ∈ [ N ] , (4)which is nonconvex when N ≥ . Above, [ N ] = { , · · · , N } . Let us introduce the shorthand W N := ( W , · · · , W N ) ∈ R d × d × · · · × R d N × d N − =: R d N , (5)which allows us to rewrite problem (4) more compactly as min W N L N ( W N ) subject to W N ∈ R d N , (6)where L N ( W N ) := (cid:107) Y − W N · · · W X (cid:107) F . With thissetup and before turning to the details, let us highlight thecontributions of this paper, in the order of appearance. a r X i v : . [ m a t h . O C ] J un raining Linear Neural Networks • Theorem 2.8 in Section 2 provides a new analysis ofthe optimization landscape of linear networks, wherewe uncover a previously-unknown link to the celebratedEckart-Young-Mirsky theorem and the geometry of theprincipal component analysis (PCA). • Theorem 3.8 in Section 3 identiﬁes the conditions underwhich gradient ﬂow successfully trains a linear network,despite the presence of non-strict saddle points in theoptimization landscape.Theorem 3.8 thus improves the state of the art in (Bahet al., 2019) as the ﬁrst convergence result outside of thelazy training regime, reviewed later. This improvement isachieved with a new argument that provably identiﬁes thestable set of gradient ﬂow, in the language of dynamicalsystems theory.Theorem 3.8 applies to linear networks that have a layerwith a single neuron, see Assumption 3.6. This casecorresponds to the popular spiked covariance model instatistics, and subsumes networks with a scalar output.Extension of Theorem 3.8 to all linear networks remainsa challenging open problem, as there is no natural notionof stable set in general. We will however speculate abouthow Theorem 3.8 might serve as the natural buildingblock for a more general result in the future. • Theorem 4.4 in Section 4 quantiﬁes the computationalcomplexity of training a linear network by establishing non-local convergence rates for gradient ﬂow. This re-sult also quantiﬁes for the ﬁrst time how the (faraway)convergence rate beneﬁts from increasing the networkdepth.Theorem 4.4 presents the ﬁrst convergence rate for lin-ear networks outside of the lazy training regime, thusimproving the state of the art in (Arora et al., 2018a).Indeed, of the dozens of related works, virtually all belongto this lazy regime, to the best of our knowledge, thus sig-nifying the importance of this breakthrough. Theorem 4.4also corresponds to the spiked covariance model.

2. Landscape of Linear Networks

The landscape of nonconvex program (6) has been widelystudied in the literature, with contributions from (Aroraet al., 2018b;a; Chitour et al., 2018; Bartlett et al., 2019;Saxe et al., 2013; Hardt & Ma, 2016; Laurent, 2018; Trageret al., 2019; Baldi & Hornik, 1989; Zhu et al., 2019; Heet al., 2016; Nguyen, 2019). The state of the art here isProposition 31 in (Bah et al., 2019), reviewed in Appendix E,which itself improves over Theorem 2.3 in (Kawaguchi,2016).The main result of this section, Theorem 2.8 below, is avariation of Proposition 31 in (Bah et al., 2019) with an ad- ditional assumption. In Section 3, we will use Theorem 2.8to improve the state of art for training linear networks, underthis new assumption.Crucially, the proof of Theorem 2.8 uncovers a previously-unknown link to the celebrated Eckart-Young-Mirsky theo-rem (Eckart & Young, 1936; Mirsky, 1966) and the geome-try of PCA.To begin, let us concretely deﬁne the notion of optimalityfor problem (6).

Deﬁniton 2.1 (First-order stationarity for (6)) . We say that W N ∈ R d N is a ﬁrst-order stationary point (FOSP) ofproblem (6) if ∇ L N ( W N ) = 0 , (7) where ∇ L N ( W N ) is the gradient of L N at W N . Deﬁniton 2.2 (Second-order stationarity for (6)) . We saythat W N ∈ R d N is a second-order stationary point (SOSP)of problem (6) if, in addition to (7) , it holds that ∇ L N ( W N )[∆ N ] ≥ , ∀ ∆ N ∈ R d N , (8) where ∇ L N ( W N )[∆ N ] is the second derivative of L N at W N along the direction ∆ N . Deﬁniton 2.3 (Strict saddles of (6)) . Any FOSP of prob-lem (6) , which is not an SOSP, is a strict saddle point of (6) . Any SOSP of problem (6) is either a local minimizer of (6),or a non-strict saddle point of problem (6). Unlike a non-strict saddle point, there always exists a descent directionto escape from a strict saddle point (Lee et al., 2017). Tocontinue, let r := min j ≤ N d j , (smallest width of the network) (9)denote the smallest width of the linear network (2,3). Asshown in Appendix A, we can reformulate problem (6) as min W N L N ( W N )= (cid:40) min W (cid:107) Y − W X (cid:107) F =: L ( W ) subject to rank ( W ) ≤ r (10a) = (cid:40) min P,Q (cid:107) Y − P QX (cid:107) F =: L ( P, Q ) subject to P ∈ R d y × r , Q ∈ R r × d x . (10b)In particular, the notion of optimality for problem (10b)is deﬁned similar to Deﬁnitions 2.1-2.3. There is a corre-spondence between the stationary points of problems (6)and (10b), proved in Appendix B. Lemma 2.4 (Pairwise correspondence between SOSPs) . Any FOSP W N = ( W , · · · , W N ) of problem (6) cor-responds to an FOSP ( P , Q ) of problem (10b) , provided raining Linear Neural Networks that W = W N · · · W is rank- r . Moreover, any SOSP W N of problem (6) corresponds to an SOSP ( P , Q ) of prob-lem (10b) , provided that rank ( W ) = r . Let P X := X † X and P X ⊥ := I m − P X denote the orthog-onal projections onto the row span of X and its orthogonalcomplement, respectively. Here, X † is the pseudo-inverseof X and I m ∈ R m × m is the identity matrix. Using thedecomposition Y = Y P X + Y P X ⊥ , we can in turn rewriteproblem (10b) as min P,Q L ( P, Q )= 12 (cid:107) Y P X ⊥ (cid:107) F + (cid:40) min P,Q,Q (cid:48) (cid:107) Y P X − P Q (cid:48) (cid:107) F subject to Q (cid:48) = QX ≥ (cid:107) Y P X ⊥ (cid:107) F + min P,Q (cid:48) (cid:107) Y P X − P Q (cid:48) (cid:107) F . (11)The relaxation above is tight, and there is a correspondencebetween the stationary points, proved in Appendix C. Lemma 2.5 (Pairwise correspondence between SOSPs) . Suppose that XX (cid:62) is invertible. Then it holds that − (cid:107) Y P X ⊥ (cid:107) F + min P,Q L ( P, Q )= (cid:40) min P,Q (cid:48) (cid:107) Y P X − P Q (cid:48) (cid:107) F =: L (cid:48) ( P, Q (cid:48) ) subject to P ∈ R d y × r , Q (cid:48) ∈ R r × m . (12) Any FOSP ( P , Q ) of problem (10b) corresponds to anFOSP ( P , Q (cid:48) ) of problem (12) . Moreover, any SOSP ( P , Q ) of problem (10b) corresponds to an SOSP ( P , Q (cid:48) ) of prob-lem (12) . Note that solving problem (12) involves ﬁnding a best rank- r approximation of Y P X or, equivalently, ﬁnding r lead-ing principal components of Y P X (Murphy, 2012). Bycombining Lemmas 2.4 and 2.5, we immediately reach thefollowing conclusion. Lemma 2.6 (Pairwise correspondence between SOSPs) . Suppose that XX (cid:62) is invertible. Then any FOSP W N of problem (6) corresponds to an FOSP ( P , Q (cid:48) ) of prob-lem (12) , provided that W = W N · · · W is rank- r . More-over, any SOSP W N of problem (6) corresponds to anSOSP ( P , Q (cid:48) ) of (12) , provided that rank ( W ) = r . We next recall a variant of the celebrated EYM theo-rem (Eckart & Young, 1936; Mirsky, 1966; Hauser &Eftekhari, 2018; Hauser et al., 2018), which speciﬁes thelandscape of the PCA problem (12).

Theorem 2.7 (EYM theorem) . Any SOSP ( P , Q (cid:48) ) of thePCA problem (12) is also a global minimizer of prob-lem (12) . With Lemma 2.6 at hand, we invoke Theorem 2.7 to uncoverthe landscape of problem (6), see Appendix D for the proof.

Theorem 2.8 (Landscape of linear networks) . Suppose that XX (cid:62) is invertible and that rank ( Y X † X ) ≥ r . Then anySOSP W N = ( W , · · · , W N ) of problem (6) is a globalminimizer of problem (6) , provided that W N · · · W is arank- r matrix. In words, Theorem 2.8 identiﬁes certain SOSPs of prob-lem (6) which are global minimizers of problem (6). Afew important remarks are in order: 1 The proof of The-orem 2.8 establishes a pairwise correspondence with thestationary points and the geometry of the PCA problem.This connection was previously unknown, to the best of ourknowledge.2 Any rank-degenerate SOSP of problem (6) is excludedfrom Theorem 2.8, i.e., any SOSP W N = ( W , · · · , W N ) such that rank ( W N · · · W ) < r is excluded from our re-sult. For example, the zero matrix is a spurious SOSP ofproblem (6) when the network depth N ≥ , as observedin (Bah et al., 2019; Kawaguchi, 2016; Trager et al., 2019).The landscape of problem (6) in general is therefore morecomplicated than the special case of N = 2 , correspondingto the Eckart-Young-Mirsky theorem, see Theorem 2.7.3 Theorem 2.8 is a variation of Proposition 31 in (Bahet al., 2019) with a new assumption on Y X † X , which willbe necessary shortly. Similar assumptions have been usedin the context of PCA, for example in (Helmke & Shayman,1995). 4 For completeness, we also prove Theorem 2.8using Proposition 31 in (Bah et al., 2019) as the startingpoint, see Appendix E.

3. Convergence of Gradient Flow

In view of Theorem 2.8 above, even though nonconvex, thelandscape of problem (6) has certain favourable properties.On the other hand, problem (6) fails to satisfy the strictsaddle property that enables ﬁrst-order algorithms to avoidsaddle points (Lee et al., 2016; Ge et al., 2015). For example,the zero matrix is a non-strict saddle point of problem (6)when the network depth N ≥ , as discussed earlier.Against this mixed background, it is natural to ask if ﬁrst-order algorithms can successfully train linear neural net-works. This fundamental question has remained unansweredin the literature, to our knowledge. Indeed, outside the lazytraining regime, reviewed in Section 4, it is not known ifgradient ﬂow can successfully solve problem (6).In fact, the state of the art here, Theorem 35(a) in (Bahet al., 2019), guarantees the convergence of gradient ﬂow toa minimizer of L N , when restricted to one of few regionsin R d N . Even though these regions are known in advance,their result cannot predict which region would contain the raining Linear Neural Networks limit point of gradient ﬂow, for a given initialization.In other words, Theorem 35(a) in (Bah et al., 2019) does not guarantee the convergence of gradient ﬂow to a globalminimzer of problem (6), and gradient ﬂow might indeedconverge to a spurious SOSP of problem (6), such as thezero matrix. For completeness, Theorem 35(a) in (Bah et al.,2019) is reviewed in Appendix F.In an important setting, this section answers the open ques-tion of convergence of gradient ﬂow for training linear net-works. This is achieved with a new argument, which enablesus to provably identify the stable set of gradient ﬂow, in thelanguage of dynamical systems theory. Let us begin withthe necessary preparations.Consider gradient ﬂow applied to program (6), speciﬁed as ˙ W j ( t ) = d W j ( t ) d t = −∇ W j L N ( W N ( t )) , ∀ j ∈ [ N ] , ∀ t ≥ , (gradient ﬂow) (13)and initialized at W N , ∈ R d N . Above, ∇ W j L N is thegradient of L N with respect to W j , the weight matrix forthe j th layer of the linear network, see (4,5,6).A consequence of the Lojasiewicz’ theorem is the followingconvergence result for the gradient ﬂow. See Appendix Gfor the proof, similar to Theorem 11 in (Bah et al., 2019). Lemma 3.1 (Convergence, uninformative) . If XX (cid:62) is in-vertible, then gradient ﬂow (13) converges. Moreover, thelimit point is an SOSP W N ∈ R d N of problem (6) , for al-most every initialization W N , with respect to the Lebesguemeasure in R d N . To study this limit point W N , we focus here on a commoninitialization technique for linear networks (Hardt & Ma,2016; Bartlett et al., 2019; Arora et al., 2018a;b). Deﬁniton 3.2 (Balanced initialization) . For gradientﬂow (13) , we call W N , = ( W , , · · · , W N, ) ∈ R d N abalanced initialization if W (cid:62) j +1 , W j +1 , = W j, W (cid:62) j, , ∀ j ∈ [ N − . (14)Claim 4 in (Arora et al., 2018a) underscores the necessity ofa (nearly) balanced initialization for linear networks. Moregenerally, (Sutskever et al., 2013) highlights the importanceof initialization in deep neural networks. The main result ofthis section, Theorem 3.8 below, thus requires an (exactly)balanced initialization.Assuming exact balanced-ness in (14) is sufﬁcient here be-cause the focus of this theoretical work is continuous-timeoptimization. More generally, approximate balanced-nessis necessary for discretized algorithms, such as gradientdescent (Arora et al., 2018a). We avoid this additional layer of complexity here as it does not seem to add any key theo-retical insights to this paper.A useful observation is that, if the initialization is balanced,gradient ﬂow remains balanced afterwards, see for exampleLemma 2 in (Bah et al., 2019). More formally, gradientﬂow (13) satisﬁes W (cid:62) j +1 , W j +1 , = W j, W (cid:62) j, , ∀ j ∈ [ N − ⇒ W j +1 ( t ) (cid:62) W j +1 ( t ) = W j ( t ) W j ( t ) (cid:62) , (15)for every j ∈ [ N − and every t ≥ . Above, the weightmatrix W j ( t ) is the j th component of W N ( t ) , see (5).Alongside gradient ﬂow (13), it is convenient to introduceanother ﬂow (Bah et al., 2019; Arora et al., 2018a), whichdictates the evolution of the end-to-end product of theweight matrices of the linear network.Concretely, for a matrix W ∈ R d y × d x , consider the linearoperator A W speciﬁed as A W : R d y × d x → R d y × d x ∆ → N (cid:88) j =1 ( W W (cid:62) ) N − jN ∆( W (cid:62) W ) j − N . (16)For a balanced initialization W N , = ( W , , · · · , W N, ) ,gradient ﬂow (13) in R d N induces a ﬂow in R d y × d x , ini-tialized at W = W N, · · · W , ∈ R d y × d x and speciﬁedas ˙ W ( t ) = −A W ( t ) ( ∇ L ( W ( t ))) ∀ t ≥ , = −A W ( t ) ( W ( t ) XX (cid:62) − Y X (cid:62) ) (see (10a))(induced ﬂow) (17)see for example Equation (26) in (Bah et al., 2019). Above, W ( t ) = W N ( t ) · · · W ( t ) ∈ R d y × d x . (18)We will refer to (17) as the induced ﬂow , which governs theevolution of the end-to-end product of the weight matricesof the linear network.It is known that induced ﬂow (17) admits an analytic singu-lar value decomposition (SVD), see for example Lemma 1and Theorem 3 in (Arora et al., 2019a) or (Illashenko &Yakovenko, 2008). More speciﬁcally, it holds that W ( t ) SVD = (cid:101) U ( t ) (cid:101) S ( t ) (cid:101) V ( t ) (cid:62) , ∀ t ≥ , (19)provided that the network depth N ≥ . In (19), (cid:101) U ( t ) , (cid:101) V ( t ) , (cid:101) S ( t ) are analytic functions of t (Parks & Krantz,1992). Moreover, (cid:101) U ( t ) ∈ R d y × d y , (cid:101) V ( t ) ∈ R d x × d x are or-thonormal bases, and (cid:101) S ( t ) ∈ R d y × d x contains the singularvalues of W ( t ) in no speciﬁc order. raining Linear Neural Networks The evolution of the singular values of W ( t ) in (19) is alsoknown (Arora et al., 2019a; Townsend, 2016). In particular,the following byproduct about the rank of W ( t ) is importantfor us, see Appendix H for the proof. Lemma 3.3 (Rank-invariance) . For induced ﬂow (17) ,rank ( W ( t )) = rank ( W ) for all t ≥ , provided that XX (cid:62) is invertible and the network depth N ≥ . Let us henceforth assume that XX (cid:62) is invertible, and thatgradient ﬂow (13) is initialized at W N , ∈ M N ,r , where M N ,r := (cid:110) W N : rank ( W N · · · W ) = r (cid:111) ⊂ R d N , (20)see (5). We now make the following observations about theset M N ,r , proved in Appendix I. Lemma 3.4 (Propeties of M N ,r ) . M N ,r is not a closedsubset of R d N . The complement of M N ,r in R d N hasLebesgue measure zero. (In particular, M N ,r is a densesubset of R d N .) In view of Lemma 3.4, almost every initialization W N , ∈ R d N of gradient ﬂow (13) falls into the set M N ,r , i.e., W N , ∈ M N ,r , almost surely . (21)Moreover, once initialized in M N ,r with a balanced ini-tialization, induced ﬂow (17) remains rank- r at all timesby (18,20) and Lemma 3.3. Consequently, gradient ﬂow (13)remains in M N ,r at all times, see again (18,20). We com-bine this last observation with (21) to conclude that W N ( t ) ∈ M N ,r , ∀ t ≥ , almost surely , (22)over the choice of balanced initialization W N , ∈ R d N .Despite (22), the limit point W N of gradient ﬂow (13)might not belong to M N ,r because M N ,r is not closed,see Lemma 3.4.That is, even though the limit point W N of gradient ﬂow isalmost surely an SOSP of problem (6) by Lemma 3.1, we cannot apply Theorem 2.8 and W N might be an spuriousSOSP of problem (6), such as the zero matrix. Indeed, Re-mark 39 in (Bah et al., 2019) constructs an example where W N / ∈ M N ,r , see also (Yan et al., 1994). To avoid this un-wanted behaviour, it is necessary to restrict the initializationof the gradient ﬂow and impose additional assumptions.Our ﬁrst assumption is that the data is statistically whitened,which is common in the analysis of linear networks (Aroraet al., 2018a; Bartlett et al., 2019). Deﬁniton 3.5 (Whitened data) . We say that the data matrix X ∈ R d x × m is whitened if XX (cid:62) m = 1 m m (cid:88) i =1 x i x (cid:62) i = I d x , (23) where I d x ∈ R d x × d x is the identity matrix. Our second assumption is that r = 1 in (9). This case is sig-niﬁcant as it corresponds to the popular spiked covariance model in statistics and signal processing (Eftekhari et al.,2019; Johnstone et al., 2001; Vershynin, 2012; Berthet et al.,2013; Deshpande & Montanari, 2014), to name a few.Moreover, r = 1 subsumes the important case of networkswith a scalar output.Lastly, the case r = 1 appears to be the natural buildingblock for the case r > via a deﬂation argument (Mackey,2009; Zhang & Ghaoui, 2011). Indeed, gradient ﬂow (13)moves orthogonal to the principal directions that it has pre-viously discovered or “peeled”. Extending our results to thecase r > remains a challenging open problem.From (10a) with r = 1 , recall that problem (6) for traininga linear neural network is closely related to the problem min W (cid:107) Y P X − W X (cid:107) F subject to rank ( W ) ≤ r = 1= min W m (cid:107) Z − W (cid:107) F subject to rank ( W ) ≤ , (24)where the second line above is obtained using (23), and Z := Y X (cid:62) m . (25)We are in position to collect all the assumptions made inthis section in one place.

Assumption 3.6.

In this section, we assume that the linearnetwork (2,3) has depth N ≥ , and one of the layers hasonly one neuron, i.e., r = 1 in (9) . Moreover, the datamatrix X in (1) is whitened as in (23) , and Z = m Y X (cid:62) in (25) satisﬁesrank ( Z ) ≥ r = 1 , γ Z := s Z, s Z < , (26) where s Z and s Z, are the two largest singular values of Z .Lastly, we assume that the initialization of gradient ﬂow (13) is balanced, see Deﬁnition 3.2. In view of (26), let us deﬁne Z = u Z · s Z · v (cid:62) Z (target matrix) (27)to be the best rank- approximation of Z , obtained via SVD.Here, (cid:107) u Z (cid:107) = (cid:107) v Z (cid:107) = 1 , and s Z appeared in (26).Note that Z is the unique solution of problem (24), because Z has a nontrivial spectral gap in (26), see for exampleSection 1 of (Golub et al., 1987).Let us ﬁx α ∈ [ γ Z , . To exclude the zero matrix as thelimit point of gradient ﬂow (13), the key is to restrict theinitialization to a particular subset of the feasible set of raining Linear Neural Networks problem (6) with r = 1 , speciﬁed as N N ,α := (cid:110) W N = ( W , · · · , W N ) : W N · · · W tSVD = u W · s W · v (cid:62) W ,s W > ( α − γ Z ) s Z , u (cid:62) W Z v W > αs Z (cid:111) ⊂ R d N , (28)where s Z , γ Z were deﬁned in (26). Above, tSVD stands forthe thin SVD. A simple observation is that the set N N ,α hasinﬁnite Lebesgue measure in R d N .Such restriction of the feasible set of problem (6) is nec-essary as described earlier, see also the negative exampleconstructed in Remark 39 of (Bah et al., 2019). Crucially,note that the end-to-end matrices in N N ,α are positivelycorrelated with Z , and also bounded away from the origin.An important observation is that, once initialized in N N ,α ,gradient ﬂow (13) avoids the zero matrix, see Appendix J. Lemma 3.7 (Stable set) . For gradient ﬂow (13) initializedat W N , ∈ N N ,α , the limit point exists and satisﬁes W N ∈M N , . Above, α ∈ ( γ Z , , and Assumption and itsnotation are in force, see also (20,28) . Combining Lemma 3.7 with Lemma 3.1, we ﬁnd that thelimit point W N ∈ M N , of gradient ﬂow (13) is an SOSPof problem (6), for every balanced initialization W N , ∈N N ,α outside a subset with Lebesgue measure zero.We ﬁnally invoke Theorem 2.8 to conclude that this SOSP W N ∈ M N , is in fact a global minimizer of L N in R d N .This conclusion is summarized below. Theorem 3.8 (Convergence) . Gradient ﬂow (13) convergesto a global minimizer of problem (6) from every balancedinitialization in N N ,α ⊂ R d N , outside of a subset withLebesgue measure zero, see (28) . Above, α ∈ ( γ Z , , andAssumption and its notation are in force. A few important remarks are in order. 1 Outside the lazytraining regime reviewed in Section 4, to our knowledge,Theorem 3.8 is the ﬁrst convergence result for linear net-works, answering the fundamental question of when gradi-ent ﬂow successfully trains a linear network.2 Indeed, under Assumption 3.6, Theorem 3.8 improvesover Theorem 35 in (Bah et al., 2019) which does not guar-antee the convergence of gradient ﬂow (13) to a solution ofproblem (6), and discussed earlier and in Appendix F.3 This improvement was achieved by provably restrictingthe gradient ﬂow (13) to its stable set N N ,α in (28), andsuch a restriction is indeed necessary as detailed earlier.4 Note that Theorem 3.8 sheds light on the theoreticalaspects of the training of neural networks, and should not beviewed as an initialization technique for linear networks. In turn, linear networks only serve to improve our theoreticalunderstanding of neural networks in general.Let us also examine the content of Assumption 3.6. 1 Thecase r = 1 in (9) corresponds to the spiked covariancemodel in statistics, and covers the important case of net-works with a scalar output. Lastly, r = 1 appears to bethe natural building block for extension to r > , whichremains an open problem, see the discussion after (23).2 The assumption of whitened data in (23) is commonlyused in the context of linear networks, see for exam-ple (Arora et al., 2018a; Bartlett et al., 2019). 3 Therequirement that rank ( Z ) = rank ( Y P X ) ≥ r = 1 in As-sumption 3.6 is evidently necessary to avoid the limit pointof zero.4 Finally, it is known that the induced ﬂow (17) for an unbalanced initialization deviates rapidly from its balancedcounterpart. It is therefore not clear if an unbalanced ﬂowwould provably avoid rank-degenerate limit points. How-ever, we suspect that any disadvantage of an unbalancedinitialization will disappear asymptotically as the networkdepth N grows larger, see Equation 8 in (Bah et al., 2019).

4. Convergence Rate of Gradient Flow

In view of Theorem 3.8, it is natural to ask how fast wecan train a linear network with gradient ﬂow. However,Theorem 3.8 is notably silent about the convergence rate ofgradient ﬂow (13) to a solution of problem (6). In short, is itpossible for gradient ﬂow to efﬁciently solve problem (6)?As we review now, this fundamental question has not beenanswered in the literature beyond the lazy training regime.Indeed, several works have contributed to our understandinghere, including (Shamir, 2018; Bartlett et al., 2019; Du &Hu, 2019; Wu et al., 2019; Hu et al., 2020; Wu et al., 2019),and (Gunasekar et al., 2017; Soudry et al., 2018; Ji & Tel-garsky, 2018; Arora et al., 2019a; Rahaman et al., 2018; Duet al., 2018a) in the related area of implicit regularization.For our purposes, (Arora et al., 2018a) exempliﬁes the cur-rent state of the art and its shortcomings. Loosely speaking,Theorem 1 in (Arora et al., 2018a) states that, when theinitial loss is small , gradient ﬂow (13) solves problem (6) toan accuracy of (cid:15) > in the order of C − (1 − N ) log(1 /(cid:15) ) (29)time units, where C is independent of the depth N of thelinear network. For completeness, Theorem 1 in (Aroraet al., 2018a) is reviewed in Appendix K.Theorem 1 in (Arora et al., 2018a) might disappoint the re-searchers. For one, (29) suggests that increasing the networkdepth N only marginally speeds up the training. raining Linear Neural Networks More concerning is that Theorem 1 in (Arora et al., 2018a)requires a close initialization, which is not necessary forconvergence, see Theorem 3.8. Indeed, Theorem 1 in (Aroraet al., 2018a) hinges on a perturbation argument, wherebythe initialization W N , of gradient ﬂow (13) must satisfy L N ( W N , ) = sufﬁciently small . (see (6)) (30)In this sense, (Arora et al., 2018a) joins the growing bodyof literature that quantiﬁes the behavior of neural networkswhen the learning trajectory is short (Du et al., 2018b;Li & Liang, 2018; Allen-Zhu et al., 2018b;a; Zou et al.,2018; Arora et al., 2019b; Tian, 2017; Brutzkus & Glober-son, 2017; Brutzkus et al., 2017; Du & Lee, 2018; Zhonget al., 2017; Zhang et al., 2018; Wu et al., 2019; Shin & Kar-niadakis, 2019; Zhang et al., 2019; Su & Yang, 2019; Cao& Gu, 2019; Chen et al., 2019; Oymak & Soltanolkotabi,2018), to name a few.To be sure, restricting the initialization is necessary forsuccessful training (Sutskever et al., 2013). For example,gradient ﬂow would stall when initialized at a saddle point.However, it is widely-believed that ﬁrst-order algorithmscan successfully train neural networks far beyond the lazytraining regime considered by (Arora et al., 2018a) andothers, and the line of research exempliﬁed by (Arora et al.,2018a) is, while valuable, highly over-represented in theliterature.Indeed, the learning trajectory of neural networks is in gen-eral not short , and the learning is often not local . We referto (Chizat et al., 2019; Yehudai & Shamir, 2019) for a de-tailed critique of lazy training, see also Appendix K.Let us dub this more general regime non-local training .Quantifying the non-local convergence rate of linear net-works is a vital step towards understanding the non-localtraining of neural networks in general.In an important setting, this section indeed quantiﬁes thenon-local training of linear networks, and addresses both ofthe shortcomings of Theorem 1 in (Arora et al., 2018a).More speciﬁcally, for the case r = 1 in (9), Theorem 4.4below quantiﬁes the convergence rate of gradient ﬂow (13)to a solution of problem (6), even when (30) is violated .Moreover, Theorem 4.4 establishes that the faraway con-vergence rate of gradient ﬂow improves by increasing thenetwork depth. All assumptions for this section are collectedin Assumption 3.6. Let us turn to the details now.Instead of the convergence rate of gradient ﬂow (13) toa solution of problem (6), we equivalently study the con-vergence rate of induced ﬂow (17), as detailed next. Thefollowing result is a consequence of Theorem 3.8, provedin Appendix L. Lemma 4.1 (Convergence of induced ﬂow) . In the settingof Theorem , if gradient ﬂow (13) converges to a solutionof problem (6) , then induced ﬂow (17) converges to thesolution Z of problem (24) . Here, Z was deﬁned in (27) . To quantify the convergence rate of induced ﬂow (17), letus deﬁne the new loss function L , ( W ) := 12 (cid:107) Z − W (cid:107) F . (see (27)) (31)In this section, we often opt for subscripts to compactlyshow the dependence of variables on time t , for example, W t as a shorthand for the induced ﬂow W ( t ) . With r = 1 , recallthat induced ﬂow (17) satisﬁes rank ( W t ) ≤ , see (9,18).Assuming that rank ( W ) = 1 at initialization, the inducedﬂow remains rank- by Lemma 3.3. Recall also the analyticSVD of the induced ﬂow in (19). The induced ﬂow thusadmits the analytic thin SVD W t tSVD = u t · s t · v (cid:62) t , ∀ t ≥ , (32)where u t ∈ R d y and v t ∈ R d x have unit-norm, and s t > is the only nonzero singular value of W t .A simple calculation using (32), deferred to Appendix M,upper bounds the loss function L , in (31) as L , ( W t ) ≤ T ,t (cid:122) (cid:125)(cid:124) (cid:123)

12 ( s t − u (cid:62) t Z v t ) + s Z ( s Z − u (cid:62) t Z v t ) (cid:124) (cid:123)(cid:122) (cid:125) T ,t . (33)Roughly speaking, T ,t above gauges the error in estimat-ing the (only) nonzero singular value s Z of the target Z ,whereas T ,t gauges the misalignment between W t and Z .Both T ,t , T ,t are nonnegative for all t ≥ , see (27,32).To quantify the convergence rate of induced ﬂow (17) to theglobal minimizer Z of problem (24), we next write downthe evolution of the loss function L , in (31) asd L , ( W t ) d t = (cid:68) ∇ L , ( W t ) , ˙ W t (cid:69) (chain rule) = − m (cid:104) W t − Z , A W t ( W t − Z ) (cid:105) , (see (17,25)) (34)where the last line also uses the whitened data in (23). Start-ing with the deﬁnition of A W t in (16), we can bound thelast line of (34), see in Appendix N for the proof. Lemma 4.2 (Evolution of loss) . For induced ﬂow (17) and raining Linear Neural Networks the loss function L , in (31) , it holds thatd L , ( W t ) d t ≤ − mN s − N t T ,t − ms − N t ( u (cid:62) t Z v t ) T ,t + √ mN s − N t γ Z (cid:112) T ,t T ,t + 2 ms − N t s Z, T ,t , (35) see (26,32,33) for the notation involved. Loosely speaking, the two nonpositive terms on the right-hand side of (35) are the contribution of the target matrix Z in (27), whereas the two nonnegative terms there are thecontribution of the residual matrix Z − Z . The (unwanted)nonnegative terms in (35) vanish if Z = Z is rank- and,consequently, γ Z = s Z, = 0 , see (26). In view of (33,35),we make two observations:1 Both T ,t and T ,t in (33) appear with negative factorsin the dynamics of (35). For loss L , to reduce rapidly, wemust ensure that s t and u (cid:62) t Z v t both remain bounded awayfrom zero for all t ≥ .2 T ,t has a large negative factor of − N in the evolutionof loss function in (35), and is therefore expected to reducemuch faster with time for deeper linear networks.Let us ﬁx α ∈ [ γ Z , and β > . Given the ﬁrst observationabove, it is natural to restrict the initialization of gradientﬂow to a subset of the feasible set of problem (24), speciﬁedas N α,β ( Z ) := (cid:110) W tSVD = u W · s W · v (cid:62) W :( α − γ Z ) s Z < s W < βs Z ,u (cid:62) W Z v W > αs Z (cid:111) ⊂ R d y × d x , (36)where s Z , γ Z were deﬁned in (26).The necessity of such a restriction was discussed after (28),and the (new) upper bound on s W in (36) controls the (un-wanted) positive terms in (35). Note that N α,β ( Z ) is aneighborhood of Z , i.e., Z ∈ N α,β ( Z ) by (27).Once initialized in N α,β ( Z ) , induced ﬂow (17) remains in N α,β ( Z ) , see Appendix O, closely related to Lemma 3.7. Lemma 4.3 (Stable set) . Fix α ∈ [ γ Z , and β > . Forinduced ﬂow (17) , W ∈ N α,β ( Z ) implies that W t ∈N α,β ( Z ) for all t ≥ . Above, Assumption and thenotation therein are in force. In view of Lemma 4.3, we can now use (36) to bound s t and u (cid:62) t Z v t in (35). We can then distinguish two regimes (fastand slow convergence) in the dynamics of the loss functionin (35) depending on the dominant term on the right-handside of (33). The remaining technical details are deferred toAppendix P and we ﬁnally arrive at the following result. Theorem 4.4 (Convergence rate) . With Assumption andits notation in force, ﬁx α ∈ ( γ Z , and β > . Supposethat the inverse spectral gap γ Z is small enough so that theexponents below are both negative.Consider gradient ﬂow (13) with the balanced initializa-tion W N , = ( W , , · · · , W N, ) ∈ R d N such that W := W N, · · · W , ∈ R d y × d x satisﬁesrank ( W ) = 1 , W tSVD = u s v (cid:62) , ( α − γ Z ) s Z < s < βs Z , u (cid:62) Z v > αs Z . (37) Let W N ( t ) = ( W ( t ) , · · · , W N ( t )) be the output of gradi-ent ﬂow (13) at time t , and set W ( t ) := W N ( t ) · · · W ( t ) ,which satisﬁes rank ( W ( t )) = 1 for every t ≥ .Let τ ≥ be the ﬁrst time when s ( τ ) ≤ √ s Z , where s ( τ ) is the (only) nonzero singular value of W ( τ ) . Then thedistance to the target matrix Z in (27) evolves as ∀ t ≤ τ, (cid:107) Z − W ( t ) (cid:107) F ≤ (cid:107) Z − W (cid:107) F (38a) · e − mNs − NZ (cid:16) ( α − γ Z ) − N − γ Z β − N (cid:17) t . ∀ t ≥ τ, (cid:107) Z − W ( t ) (cid:107) F ≤ (cid:107) Z − W ( τ ) (cid:107) F (38b) · e − ms − NZ (cid:16) α ( α − γ Z ) − N − γ Z Nβ − N (cid:17) ( t − τ ) . Under Assumption 3.6, Theorem 4.4 states that gradientﬂow successfully trains a linear network with linear rate,when initialized in the stable set.As we will see shortly, Theorem 4.4 is the ﬁrst result toquantify the convergence rate of gradient ﬂow beyond thewidely-studied lazy training regime. The remarks after The-orem 3.8 again apply here about Assumption 3.6 and thecase r > . A few additional remarks are in order.1 Rephrasing (38), gradient ﬂow (13) solves problem (6)to an accuracy of (cid:15) > in the order of  mNs Z (cid:0) ( α − γ Z ) − γ Z β (cid:1) − log( C/(cid:15) ) (cid:15) > (cid:15) ms Z (cid:0) α ( α − γ Z ) − γ Z N β (cid:1) − log( C/(cid:15) ) − τ (cid:16) N ( α − γ Z ) − γ Z β α ( α − γ Z ) − γ Z Nβ − (cid:17) (cid:15) ≤ (cid:15) time units. Above, (cid:15) is the right-hand side of (38b), evalu-ated at t = τ .2 In Theorem 4.4, the end-to-end initialization matrix W in (37) is positively correlated with the target matrix Z ,and away from the origin, see our earlier discussions forthe necessity of such restricted initialization. Note also that(37) should not be seen as an initialization scheme but as atheoretical result.3 The faraway convergence rate in (38) improves withincreasing the network depth N , whereas the nearby con-vergence rate does not appear to beneﬁt from increasing raining Linear Neural Networks N . This improved faraway convergence rate should be con-trasted with Arora’s result in (29).4 Crucially, the lazy training results fail to apply here. Tosee this, with the initialization W N , = ( W , , · · · , W N, ) ,Claim 1 in (Arora et al., 2018a) uses a perturbation argu-ment, which requires that (cid:107) Z − W N, · · · W , (cid:107) F < s min ( Z ) , (39)which is impossible unless trivially rank ( Z ) ≤ r . Indeed,the network architecture forces that rank ( W N, · · · W , ) ≤ r , see (9).In contrast, Theorem 4.4 applies even when (39) is violated,as it does away entirely with the limitations of a perturbationargument. Theorem 4.4 thus ventures beyond the reach ofthe lazy training regime in (Arora et al., 2018a), which hasdominated the recent literature of neural networks, thussignifying the importance of this breakthrough.Thorough numerics for linear networks are abound, see forexample (Bah et al., 2019; Arora et al., 2018a;b), and werefrain from lengthy simulations and only provide a numeri-cal example in Figure 1. to visualize the (gradual) change ofregimes from fast to slow convergence, see (38a,38b). Thisexample also suggests new research questions about linearnetworks. -1 Figure 1.

Suppose that the sample size is m = 50 , and consider arandomly-generated whitened training dataset ( X, Y ) ∈ R d x × m × R d y × m , with d x = 5 and d y = 1 . For this dataset, the aboveﬁgure depicts the distance from induced ﬂow (17) to the targetvector Z = Z = Y X (cid:62) /m in (25,27), plotted versus time t , fortraining a linear network with d x inputs and d y outputs, as thenetwork depth N varies.The direction of the initial end-to-end vector W ∈ R d y × d x isobtained by randomly rotating the direction of the target vector Z by about degrees. We also set (cid:107) W (cid:107) = 10 (cid:107) Z (cid:107) . Insteadof induced ﬂow (17), we implemented the discretization of (17)obtained from the explicit (or forward) Euler method with a stepsize of − with steps.This simple numerical example visualizes the (gradual) slow-downin the convergence rate of gradient ﬂow with time, see (38), andalso shows the faster faraway convergence rate for deeper networks,see Theorem 4.4. The above ﬁgure also suggests that the nearbyconvergence rate of gradient ﬂow (13) might actually be slowerfor deeper networks. It is however difﬁcult to theoretically inferthis from Theorem 4.4, because (38b) is an upper bound for thenearby error.The precise nearby convergence rates of linear networks (and anytrade-offs associated with the network depth) thus remain as openquestions. Note also that the local analysis of (Arora et al., 2018a)cannot be applied here, as discussed after Theorem 4.4. raining Linear Neural Networks References

Absil, P.-A., Mahony, R., and Andrews, B. Convergence ofthe iterates of descent methods for analytic cost functions.

SIAM Journal on Optimization , 16(2):531–547, 2005.Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generaliza-tion in overparameterized neural networks, going beyondtwo layers. arXiv preprint arXiv:1811.04918 , 2018a.Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory fordeep learning via over-parameterization. arXiv preprintarXiv:1811.03962 , 2018b.Arora, S., Cohen, N., Golowich, N., and Hu, W. A conver-gence analysis of gradient descent for deep linear neuralnetworks. arXiv preprint arXiv:1810.02281 , 2018a.Arora, S., Cohen, N., and Hazan, E. On the optimization ofdeep networks: Implicit acceleration by overparameteri-zation. arXiv preprint arXiv:1802.06509 , 2018b.Arora, S., Cohen, N., Hu, W., and Luo, Y. Implicit regu-larization in deep matrix factorization. arXiv preprintarXiv:1905.13655 , 2019a.Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization foroverparameterized two-layer neural networks. arXivpreprint arXiv:1901.08584 , 2019b.Bah, B., Rauhut, H., Terstiege, U., and Westdickenberg,M. Learning deep linear neural networks: Riemanniangradient ﬂows and convergence to global minimizers. arXiv preprint arXiv:1910.05505 , 2019.Baldi, P. and Hornik, K. Neural networks and principalcomponent analysis: Learning from examples withoutlocal minima.

Neural networks , 2(1):53–58, 1989.Bartlett, P. L., Helmbold, D. P., and Long, P. M. Gradi-ent descent with identity initialization efﬁciently learnspositive-deﬁnite linear transformations by deep residualnetworks.

Neural computation , 31(3):477–502, 2019.Berthet, Q., Rigollet, P., et al. Optimal detection of sparseprincipal components in high dimension.

The Annals ofStatistics , 41(4):1780–1815, 2013.Brutzkus, A. and Globerson, A. Globally optimal gradientdescent for a convnet with gaussian inputs. In

Proceed-ings of the 34th International Conference on MachineLearning-Volume 70 , pp. 605–614. JMLR. org, 2017.Brutzkus, A., Globerson, A., Malach, E., and Shalev-Shwartz, S. Sgd learns over-parameterized networksthat provably generalize on linearly separable data. arXivpreprint arXiv:1710.10174 , 2017. Cao, Y. and Gu, Q. Generalization bounds of stochasticgradient descent for wide and deep neural networks. In

Advances in Neural Information Processing Systems , pp.10835–10845, 2019.Chen, Z., Cao, Y., Zou, D., and Gu, Q. How much over-parameterization is sufﬁcient to learn deep relu networks? arXiv preprint arXiv:1911.12360 , 2019.Chitour, Y., Liao, Z., and Couillet, R. A geometric approachof gradient descent algorithms in neural networks. arXivpreprint arXiv:1811.03568 , 2018.Chizat, L., Oyallon, E., and Bach, F. On lazy training indifferentiable programming. 2019.Deshpande, Y. and Montanari, A. Information-theoreticallyoptimal sparse pca. In , pp. 2197–2201. IEEE,2014.Du, S. S. and Hu, W. Width provably matters in opti-mization for deep linear neural networks. arXiv preprintarXiv:1901.08572 , 2019.Du, S. S. and Lee, J. D. On the power of over-parametrization in neural networks with quadratic ac-tivation. arXiv preprint arXiv:1803.01206 , 2018.Du, S. S., Hu, W., and Lee, J. D. Algorithmic regularizationin learning deep homogeneous models: Layers are auto-matically balanced. In

Advances in Neural InformationProcessing Systems , pp. 384–395, 2018a.Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradientdescent provably optimizes over-parameterized neuralnetworks. arXiv preprint arXiv:1810.02054 , 2018b.Eckart, C. and Young, G. The approximation of one matrixby another of lower rank.

Psychometrika , 1:211–218,1936. doi: 10.1007/BF02288367.Eftekhari, A., Hauser, R., and Grammenos, A. Moses: Astreaming algorithm for linear dimensionality reduction.

IEEE transactions on pattern analysis and machine intel-ligence , 2019.Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping fromsaddle pointsonline stochastic gradient for tensor decom-position. In

Conference on Learning Theory , pp. 797–842,2015.Golub, G. H., Hoffman, A., and Stewart, G. W. A general-ization of the eckart-young-mirsky matrix approximationtheorem.

Linear Algebra and its applications , 88:317–327, 1987. raining Linear Neural Networks

Gunasekar, S., Woodworth, B. E., Bhojanapalli, S.,Neyshabur, B., and Srebro, N. Implicit regularizationin matrix factorization. In

Advances in Neural Informa-tion Processing Systems , pp. 6151–6159, 2017.Hardt, M. and Ma, T. Identity matters in deep learning. arXiv preprint arXiv:1611.04231 , 2016.Hauser, R. A. and Eftekhari, A. Pca by optimisation ofsymmetric functions has no spurious local optima. arXivpreprint arXiv:1805.07459 , 2018.Hauser, R. A., Eftekhari, A., and Matzinger, H. F. Pca bydeterminant optimisation has no spurious local optima.In

Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining ,pp. 1504–1511, 2018.He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In

European conference oncomputer vision , pp. 630–645. Springer, 2016.Helmke, U. and Shayman, M. A. Critical points of matrixleast squares distance functions.

Linear Algebra and itsApplications , 215:1–19, 1995.Hu, W., Xiao, L., and Pennington, J. Provable beneﬁt of or-thogonal initialization in optimizing deep linear networks. arXiv preprint arXiv:2001.05992 , 2020.Illashenko and Yakovenko.

Lectures on analytic differentialequations , volume 86. American Mathematical Soc.,2008.Ji, Z. and Telgarsky, M. Gradient descent aligns the layers ofdeep linear networks. arXiv preprint arXiv:1810.02032 ,2018.Johnstone, I. M. et al. On the distribution of the largesteigenvalue in principal components analysis.

The Annalsof statistics , 29(2):295–327, 2001.Kawaguchi, K. Deep learning without poor local minima.In

Advances in neural information processing systems ,pp. 586–594, 2016.Laurent, T. Deep linear networks with arbitrary loss: Alllocal minima are global. 2018.Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B.Gradient descent converges to minimizers. arXiv preprintarXiv:1602.04915 , 2016.Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M.,Jordan, M. I., and Recht, B. First-order methodsalmost always avoid saddle points. arXiv preprintarXiv:1710.07406 , 2017. Li, Y. and Liang, Y. Learning overparameterized neuralnetworks via stochastic gradient descent on structureddata. In

Advances in Neural Information ProcessingSystems , pp. 8157–8166, 2018.Lojasiewicz, S. On the trajectories of the gradient of ananalytical function. 1983:115–117, 1982.Lu, H. and Kawaguchi, K. Depth creates no bad localminima. arXiv preprint arXiv:1702.08580 , 2017.Mackey, L. W. Deﬂation methods for sparse pca. In

Ad-vances in neural information processing systems , pp.1017–1024, 2009.Mirsky, L. Symmetric gauge functions and unitarily in-variant norms.

Quart. J. Math. Oxford , pp. 1156–1159,1966.Murphy, K. P.

Machine learning: a probabilistic perspective .MIT press, 2012.Nguyen, Q. On connected sublevel sets in deep learning. arXiv preprint arXiv:1901.07417 , 2019.Oymak, S. and Soltanolkotabi, M. Overparameterized non-linear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004 , 2018.Parks, H. R. and Krantz, S.

A primer of real analytic func-tions . Birkh¨auser Verlag Boston (MA), 1992.Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M.,Hamprecht, F. A., Bengio, Y., and Courville, A. Onthe spectral bias of neural networks. arXiv preprintarXiv:1806.08734 , 2018.Saxe, A. M., McClelland, J. L., and Ganguli, S. Exactsolutions to the nonlinear dynamics of learning in deeplinear neural networks. arXiv preprint arXiv:1312.6120 ,2013.Shamir, O. Exponential convergence time of gradient de-scent for one-dimensional deep linear neural networks. arXiv preprint arXiv:1809.08587 , 2018.Shin, Y. and Karniadakis, G. E. Trainability and data-dependent initialization of over-parameterized relu neuralnetworks. arXiv preprint arXiv:1907.09696 , 2019.Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., andSrebro, N. The implicit bias of gradient descent on sepa-rable data.

The Journal of Machine Learning Research ,19(1):2822–2878, 2018.Su, L. and Yang, P. On learning over-parameterized neuralnetworks: A functional approximation prospective. arXivpreprint arXiv:1905.10826 , 2019. raining Linear Neural Networks

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On theimportance of initialization and momentum in deep learn-ing. In

International conference on machine learning , pp.1139–1147, 2013.Tian, Y. An analytical formula of population gradient fortwo-layered relu network and its applications in con-vergence and critical point analysis. In

Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pp. 3404–3413. JMLR. org, 2017.Townsend, J. Differentiating the singular value decomposi-tion. Technical report, Technical report, 2016.Trager, M., Kohn, K., and Bruna, J. Pure and spurious criti-cal points: a geometric study of linear networks. arXivpreprint arXiv:1910.01671 , 2019.Vershynin, R. How close is the sample covariance matrixto the actual covariance matrix?

Journal of TheoreticalProbability , 25(3):655–686, 2012.Wu, X., Du, S. S., and Ward, R. Global convergence of adap-tive gradient methods for an over-parameterized neuralnetwork. arXiv preprint arXiv:1902.07111 , 2019.Yan, W.-Y., Helmke, U., and Moore, J. B. Global analysisof oja’s ﬂow for neural networks.

IEEE Transactions onNeural Networks , 5(5):674–683, 1994.Yehudai, G. and Shamir, O. On the power and limitationsof random features for understanding neural networks. arXiv preprint arXiv:1904.00687 , 2019.Yun, C., Sra, S., and Jadbabaie, A. Global optimalityconditions for deep neural networks. arXiv preprintarXiv:1707.02444 , 2017.Zhang, G., Martens, J., and Grosse, R. Fast convergenceof natural gradient descent for overparameterized neuralnetworks. arXiv preprint arXiv:1905.10961 , 2019.Zhang, X., Yu, Y., Wang, L., and Gu, Q. Learning one-hidden-layer relu networks via gradient descent. arXivpreprint arXiv:1806.07808 , 2018.Zhang, Y. and Ghaoui, L. E. Large-scale sparse principalcomponent analysis with application to text data. In

Advances in Neural Information Processing Systems , pp.532–539, 2011.Zhong, K., Song, Z., and Dhillon, I. S. Learning non-overlapping convolutional neural networks with multiplekernels. arXiv preprint arXiv:1711.03440 , 2017.Zhu, Z., Soudry, D., Eldar, Y. C., and Wakin, M. B. Theglobal optimization geometry of shallow linear neuralnetworks.

Journal of Mathematical Imaging and Vision ,pp. 1–14, 2019. Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradientdescent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888 , 2018. raining Linear Neural Networks

A. Derivation of (10a,10b)

To show (10a,10b), it sufﬁces to show that the map Π( d N ) : R d N → M d N × d , ··· ,r W N = ( W , · · · , W N ) → W = W N · · · W , (40)is surjective, which we now set out to do.Above, d N = ( d , · · · , d N ) and R d N = R d × · · · × R d N is the domain of the function. Also, M d N × d , ··· ,r ⊂ R d N × d is the set of all d N × d matrices of rank at most r . As aside note, M d N × d , ··· ,r is the closure of the manifold of rank- r matrices. Lastly, the network architecture dictates thatsatisﬁes min j d j = r. (see (9)) (41)The proof of this surjective property is by induction.The base of induction for N = 1 is trivial because Π( d , d ) is simply the identity map by (40) and thus surjective, inparticular for any pair of integers ( d , d ) that satisﬁes (41).For the step of induction, suppose that Π( d N ) is surjectivefor every tuple d N = ( d , · · · , d N ) that satisﬁes (41).For an arbitrary integer d N +1 , consider also an arbitrarymatrix W ∈ M d N +1 × d , ··· ,r , (42)with the SVD W SVD = (cid:101) U · (cid:101) S (cid:101) V (cid:62) =: (cid:101) U · (cid:101) Q, (43)where (cid:101) U ∈ R d N +1 × d N +1 and (cid:101) V ∈ R d × d are orthonor-mal bases, and (cid:101) S ∈ R d N +1 × d contains the singular valuesof W .In particular, note that (cid:101) Q ∈ R d N +1 × d . Note also that (42)implies that rank ( (cid:101) Q ) = rank ( W ) ≤ r, (44)because (cid:101) U is an orthonormal basis.Combining (41) and (44), we reachrank ( (cid:101) Q ) ≤ r ≤ d N . (45)In view of (45), it is therefore possible (by padding withzero columns or removing some columns from (cid:101) U and thecorresponding rows from (cid:101) Q ) to create (cid:101) U (cid:48) and (cid:101) Q (cid:48) such that W = (cid:101) U (cid:48) · (cid:101) Q (cid:48)(cid:62) , where (cid:101) U (cid:48) ∈ R d N +1 × d N , (cid:101) Q (cid:48) ∈ R d N × d . (46) In this construction, rank ( (cid:101) Q (cid:48) ) = rank ( (cid:101) Q ) ≤ r and (cid:101) Q (cid:48) ∈ R d N × d . Consequently, the step of induction guarantees theexistence of W N = ( W N , · · · , W ) ∈ R d N such that (cid:101) Q (cid:48) = W N · · · W . (47)It follows that W = (cid:101) U (cid:48) · (cid:101) Q (cid:48) (see (46)) = (cid:101) U (cid:48) W N · · · W . (see (47)) (48)That is, W = Π( d , · · · , d N +1 )[ W , · · · , W N , (cid:101) U (cid:48) ] , (49)which completes the induction. We thus proved that Π( d N ) is a surjective map for every tuple d N that satisﬁes (41). B. Proof of Lemma 2.4

Let W N = ( W , · · · , W N ) ∈ R d N be an FOSP ofproblem (6). For an inﬁnitesimally small perturbation ∆ N = (∆ , · · · , ∆ N ) ∈ R d N , we can expand L N in (6) as L N ( W N + ∆ N )= L N ( W N ) + ∇ L N ( W N )[∆ N ]+ 12 ∇ L N ( W N )[∆ N ] + o = L N ( W N ) + 12 ∇ L N ( W N )[∆ N ] + o. (50)where o represents (negligible) higher order terms, and thesecond identity above holds because W N is assumed tobe an FOSP in Lemma 2.4, see Deﬁnition 2.1. Above, ∇ L N ( W N )[∆ N ] contains all second order terms in thevariables ∆ N .Let j correspond to a layer with the smallest width withinthe linear network (2,3), i.e., r = min j ≤ N d j (see (9)) = d j . (51)We also set P := W N · · · W j +1 ∈ R d y × r ,Q := W j · · · W ∈ R r × d x , (52)for short, and note that W := P · Q = W N · · · W , rank ( W ) = rank ( P ) = rank ( Q ) = r, (53)where the second line above holds by the assumption ofLemma 2.4. raining Linear Neural Networks Indeed, W = P · Q implies that min( rank ( P ) , rank ( Q )) ≥ r. (54)Note also that P has r columns and Q has r rows, thus max( rank ( P ) , rank ( Q )) ≤ r. (55)Together, (54) and (55) give the second line of (53).On the one hand, for an arbitrary (∆ P , ∆ Q ) , we can relatethe perturbation of W N to the perturbation of ( P , Q ) as ( W N + ∆ N ) · · · ( W + ∆ )= ( P + ∆ P )( Q + ∆ Q ) , (56)where ∆ = W Q † ∆ Q , ∆ i = 0 , ≤ i ≤ N − , ∆ N = ∆ P P † W N , (57)and † denotes the pseudo-inverse, and we used the secondidentity in (53).Indeed, for the choice of ∆ N in (57), it holds that ( W N + ∆ N ) · · · ( W + ∆ )= ( W N + ∆ P P † W N ) W N − · · ·· · · W ( W + W Q † ∆ Q ) (see (57)) = ( I d y + ∆ P P † ) W N W N − · · ·· · · W W ( I d x + Q † ∆ Q )= ( I d y + ∆ P P † ) P · Q ( I d x + Q † ∆ Q ) (see (53)) = ( P + ∆ P )( Q + ∆ Q ) , (58)which agrees with (56). The last line above uses the secondidentity in (53), i.e., rank ( P ) = rank ( Q ) = r . Above, I d y ∈ R d y × d y is the identity matrix.On the other hand, we can expand L in (10a) as L ( P + ∆ P , Q + ∆ Q )= L ( P , Q ) + ∇ L ( P , Q )[∆ P , ∆ Q ]+ ∇ L ( P , Q )[∆ P , ∆ Q ] + o, (59)where o again higher order terms. Above, ∇ L ( P , Q )[∆ P , ∆ Q ] collects all ﬁrst order terms inthe variables (∆ P , ∆ Q ) . Likewise, ∇ L ( P , Q )[∆ P , ∆ Q ] contains all second order terms in (∆ P , ∆ Q ) .For convenience, let us deﬁne the map L : R d y × d x → R W → (cid:107) Y − W X (cid:107) F , (60) and note that L ( P, Q ) = L ( P Q ) , (see (10b)) L N ( W N ) = L ( W N · · · W ) , (see (6)) (61)for every P, Q, W N .In view of (61), we now write that L ( P + ∆ P , Q + ∆ Q )= L (( P + ∆ P )( Q + ∆ Q )) (see (61)) = L (( W N + ∆ N ) · · · ( W + ∆ )) (see (56)) = L N ( W N + ∆ N ) , (see (61)) (62)for ∆ N = (∆ , · · · , ∆ N ) speciﬁed in (57).As a result of (62), the expansions in (50) and (59) mustmatch. That is, for an arbitrary (∆ P , ∆ Q ) and the corre-sponding choice of ∆ N in (56), it holds that ∇ L ( P , Q )[∆ P , ∆ Q ]= ∇ L N ( W N )[∆ N ] = 0 , (see (50,59)) (63)and ∇ L ( P , Q )[∆ P , ∆ Q ]= ∇ L N ( W N )[∆ N ] . (see (50,59)) (64)It follows from (63) that ( P , Q ) is an FOSP of problem (10a)if W N is an FOSP of problem (6).Moreover, if W N is an SOSP of problem (6), then the lastline of (64) is nonnegative, see Deﬁnition 2.2. That is, ∇ L ( P , Q )[∆ P , ∆ Q ]= ∇ L N ( W N )(∆ N ) ≥ , (65)for an arbitrary (∆ P , ∆ Q ) and the corresponding choice of ∆ N in (56). Therefore, ( P , Q ) is an SOSP of problem (10a)if W N is an SOSP of problem (6). This completes the proofof Lemma 2.4. C. Proof of Lemma 2.5

Recall that P X and P X ⊥ denote the orthogonal projectionsonto the row span of X and its complement, respectively.Using the decomposition Q (cid:48) = Q (cid:48) P X + Q (cid:48) P X ⊥ , the lastprogram in (12) can be written as min P,Q (cid:48) (cid:107) Y P X − P Q (cid:48) (cid:107) F = min P,Q (cid:48) (cid:107) Y P X − P Q (cid:48) P X (cid:107) F + 12 (cid:107) P Q (cid:48) P X ⊥ (cid:107) F . (66) raining Linear Neural Networks From the above decomposition, it is evident that the mini-mum above is achieved when the last term in (66) vanishes.This observation allows us to write that min

P,Q (cid:48) (cid:107) Y P X − P Q (cid:48) (cid:107) F = min P,Q (cid:48) (cid:107) Y P X − P Q (cid:48) P X (cid:107) F (see (66)) = (cid:40) min P,Q (cid:48) (cid:107) Y P X − P Q (cid:48)(cid:48) (cid:107) F subject to Q (cid:48)(cid:48) = Q (cid:48) P X = (cid:40) min P,Q (cid:48)(cid:48) (cid:107) Y P X − P Q (cid:48)(cid:48) (cid:107) F subject to row span ( Q (cid:48)(cid:48) ) ⊆ row span ( X )= (cid:40) min P,Q (cid:48)(cid:48) (cid:107) Y P X − P Q (cid:48)(cid:48) (cid:107) F subject to Q (cid:48)(cid:48) = QX = min P,Q (cid:107) Y P X − P QX (cid:107) F , (67)which proves the tight relaxation claimed in (12). The thirdidentity above uses the fact that the map R r × m → row span ( X ) Q (cid:48) → Q (cid:48)(cid:48) = Q (cid:48) P X (68)is surjective.To prove the second claim in Lemma 2.5, let ( P , Q ) be anFOSP of problem (10b), which satisﬁes Y − P · QX ) X (cid:62) Q (cid:62) , P (cid:62) ( Y − P · QX ) X (cid:62) . (69)After setting Q (cid:48) = QX, (70)the above identities read as Y − P · QX ) X (cid:62) Q (cid:62) (see (69)) = ( Y P X − P · QX ) X (cid:62) Q (cid:62) = ( Y P X − P · Q (cid:48) ) Q (cid:48)(cid:62) , (71)and P (cid:62) ( Y − P · QX ) X (cid:62) (see (69)) = P (cid:62) ( Y P X − P · QX ) X (cid:62) = P (cid:62) ( Y P X − P · Q (cid:48) ) X (cid:62) . (72)Recall thatrow span ( Q (cid:48) ) ⊆ row span ( X ) . (see (70)) (73) With this in mind, (72) implies that P (cid:62) ( Y P X − P · Q (cid:48) ) X (cid:62) (see (72)) = P (cid:62) ( Y P X − P · Q (cid:48) ) , (see (73)) (74)where we also used the assumption that XX (cid:62) is invertible.By combining (71,74), we conclude that ( P , Q (cid:48) ) is an FOSPof problem (12) if ( P , Q ) is an FOSP of problem (10b).To prove the last claim of Lemma 2.5, let ( P , Q ) be anSOSP of problem (10b), which satisﬁes (cid:107) ∆ P QX + P ∆ Q X (cid:107) F + (cid:104) P · QX − Y, ∆ P ∆ Q X (cid:105) =12 (cid:107) ∆ P QX + P ∆ Q X (cid:107) F + (cid:104) P · QX − Y P X , ∆ P ∆ Q X (cid:105)≥ , ∀ (∆ P , ∆ Q ) . (75)Let us set Q (cid:48) = QX as before, and also note that the map R r × d x → row span ( X )∆ Q → ∆ Q (cid:48) = ∆ Q X (76)is evidently surjective. Then we may rewrite (75) as (cid:107) ∆ P Q (cid:48) + P ∆ Q (cid:48) (cid:107) F + (cid:104) P · Q (cid:48) − Y P X , ∆ P ∆ Q (cid:48) (cid:105) ≥ , ∀ (∆ P , ∆ Q (cid:48) ) ∈ R d y × r × row span ( X ) , (77)On the other hand, recall again (73). When ∆ Q (cid:48) ⊥ row span ( X ) , (78)we have that (cid:107) ∆ P Q (cid:48) + P ∆ Q (cid:48) (cid:107) F + (cid:104) P · Q (cid:48) − Y P X , ∆ P ∆ Q (cid:48) (cid:105) = 12 (cid:107) ∆ P Q (cid:48) (cid:107) F + (cid:107) P ∆ Q (cid:48) (cid:107) F ≥ , ∀ (∆ P , ∆ Q (cid:48) ) ∈ R d y × r × row span ( X ) ⊥ , (79)where the identity above uses (73,78). By combin-ing (77,79), we reach (cid:107) ∆ P Q (cid:48) + P ∆ Q (cid:48) (cid:107) F + (cid:104) P · Q (cid:48) − Y P X , ∆ P ∆ Q (cid:48) (cid:105)≥ , ∀ (∆ P , ∆ Q (cid:48) ) . (80)It is evident from (80) that ( P , Q (cid:48) ) is an SOSP of prob-lem (12) if ( P , Q ) is an SOSP of problem (10b). Thiscompletes the proof of Lemma 2.5. D. Proof of Theorem 2.8

We begin with a technical lemma below, proved with the aidof EYM Theorem 2.7. This result is standard but a proof isincluded for completeness. raining Linear Neural Networks

Lemma D.1.

If rank ( Y P X ) ≥ r , then any SOSP ( P , Q (cid:48) ) of problem (12) is a global minimizer of problem (12) andsatisﬁes rank ( P ) = rank ( Q (cid:48) ) = rank ( W ) = r, (81) where W = P · Q (cid:48) . Before proving the above lemma in the next appendix, letus show how it can be used to prove Theorem 2.8.Let us assume that rank ( Y P X ) ≥ r , so that Lemma D.1is in force. Then any SOSP ( P , Q (cid:48) ) of problem (12) is aglobal minimizer of problem (12) and satisﬁes (81).Let us also assume that XX (cid:62) is invertible, so that Lemma2.6 is in force. Lemma 2.6 then implies that any SOSP W N of problem (6) corresponds to an SOSP ( P , Q (cid:48) ) ofproblem (12), provided that W = W N · · · W is rank- r .The relationship between these quantities is W N · · · W X = W X = P · Q (cid:48) . (see (53,70) (82)In light of the preceding paragraph, we observe that anySOSP W N of problem (12) corresponds to a global mini-mizer ( P , Q (cid:48) ) of problem (12), provided that W is rank- r .Using the decomposition Y = Y P X + Y P X ⊥ , we cantherefore write that (cid:107) Y − W N · · · W X (cid:107) F = 12 (cid:107) Y P X − W N · · · W X (cid:107) F + 12 (cid:107) Y P X ⊥ (cid:107) F = 12 (cid:107) Y P X − P · Q (cid:48) (cid:107) F + 12 (cid:107) Y P X ⊥ (cid:107) F (see (82)) = min P,Q (cid:48) (cid:107) Y P X − P Q (cid:48) (cid:107) F + 12 (cid:107) Y P X ⊥ (cid:107) F = min P,Q (cid:107) Y − P QX (cid:107) F (see (12)) = min W , ··· ,W N (cid:107) Y − W N · · · W X (cid:107) F . (see (10b)) (83)That is, any SOSP W N of problem (6) is a global minimizerof problem (6), provided that W is rank- r . This completesthe proof of Theorem 2.8. D.1. Proof of Lemma D.1

We conveniently assume thatrank ( Y P X ) = r, (84)but the same argument is valid also when rank ( Y P X ) > r .Let Y P X tSVD = (cid:101) U (cid:101) S (cid:101) V (85) denote the thin SVD of Y P X , where (cid:101) U ∈ R d y × r has or-thonormal columns, (cid:101) V ∈ R r × m has orthonormal rows, andthe diagonal matrix (cid:101) S ∈ R r × r contains the singular valuesof Y P X .By the way of contradiction, suppose that ( P , Q (cid:48) ) is anSOSP of problem (12) such thatrank ( P · Q (cid:48) ) < r. (86)Without loss of generality, we can in fact replace (86) withrank ( P ) = rank ( Q (cid:48) ) = rank ( P · Q (cid:48) ) < r. (87)(Indeed, for example if rank ( P ) < rank ( Q (cid:48) ) < r , then ( P P S , P S Q (cid:48) ) takes the same objective value in prob-lem (12) as ( P , Q (cid:48) ) . Here, P S is the orthogonal projectiononto the subspace S = row span ( P ) ∩ column span ( Q (cid:48) ) .On the other hand, by EYM Theorem 2.7, the SOSP ( P , Q (cid:48) ) is in fact a global minimizer of problem (12). Therefore, ( P P S , P S Q (cid:48) ) too is a global minimizer of problem (12)and a fortiori an SOSP of problem (12). We can thus re-place ( P , Q (cid:48) ) with the SOSP ( P P S , P S Q (cid:48) ) which satisﬁesrank ( P P S ) = rank ( P S Q (cid:48) ) < r . That is, the assumptionmade in (87) indeed does not reduce the generality of thefollowing argument.)Assuming (87), next note that ( P , Q (cid:48) ) satisﬁes P = (cid:2) U S P d y × (cid:3) ∈ R d y × r Q (cid:48) = (cid:20) S Q (cid:48) V × d x (cid:21) ∈ R r × m , (88)where U ∈ R d y × ( r − and V ∈ R ( r − × m correspond tothose left and right singular vectors of Y P X that mightbe present in P · Q (cid:48) , see for example Lemma 5.1 (Item 5)in (Hauser & Eftekhari, 2018).In (88), S P , S Q (cid:48) ∈ R r × r are (not necessarily diagonal)matrices, and we note that U and V are column and rowsubmatrices of (cid:101) U and (cid:101) V , respectively.In view of (84,85,86), there exists a unique pair ( u, v ) of leftand right singular vectors of Y P X that is absent from (88),i.e., U (cid:62) u = 0 , V (cid:62) v = 0 . (89)To match the representation in (85), note that u above is acolumn-vector whereas v is a row-vector. In particular, (cid:101) U = (cid:2) U u (cid:3) , (cid:101) V = (cid:20) Vv (cid:21) ,Y P X = (cid:101) U (cid:101) S (cid:101) V = U SV + usv, (see (85)) (90) raining Linear Neural Networks where S ∈ R r − and s ∈ R collect the singular valuescorresponding to ( U, V ) and ( u, v ) , respectively.To proceed, consider iniﬁnetsimally small scalars δ u and δ v . Consider also an inﬁnitesimally small perturbation (∆ P , ∆ Q (cid:48) ) in ( P , Q (cid:48) ) , speciﬁed as P + ∆ P = (cid:2) U S P δ u u (cid:3) Q (cid:48) + ∆ Q (cid:48) = (cid:20) S Q (cid:48) Vδ v v (cid:21) . (91)It immediately follows that ( P + ∆ P )( Q (cid:48) + ∆ Q (cid:48) )= (cid:2) U S P δ u u (cid:3) · (cid:20) S Q (cid:48) Vδ v v (cid:21) (see (91)) = U S P S Q (cid:48) V + δ u δ v uv = P · Q (cid:48) + δ u δ v uv. (see (88)) (92)From (89), it is evident that the perturbation in (92) is or-thogonal to P · Q (cid:48) .To continue, let us deﬁne the orthogonal projections P U = U U (cid:62) and P u = uu (cid:62) , and deﬁne P V , P v similarly. Inparticular, we can decompose Y P X as Y P X = ( P U + P u )( Y P X )( P V + P v )= P U ( Y P X ) P V + P u ( Y P X ) P v , (93)where the cross terms above vanish by properties of theSVD, see (85,89). Indeed, for example, P U ( Y P X ) P v = U U (cid:62) ( (cid:101) U (cid:101) S (cid:101) V ) v (cid:62) v (see (85)) = U U (cid:62) ( U SV + usv ) v (cid:62) v (see (90)) = U U (cid:62) usv (see (89)) = 0 , (see (89)) (94)where S and s collect the corresponding singular valuesfor the singular vectors collected in ( U, V ) and ( u, v ) , re-spectively. We will use the decomposition (93) immediatelybelow.Under the perturbation in (91), the objective function of problem (12) becomes (cid:107) Y P X − ( P + ∆ P )( Q (cid:48) + ∆ Q (cid:48) ) (cid:107) F (95) = 12 (cid:107) Y P X − ( P + δ u u )( Q (cid:48) + δ v v ) (cid:107) F (see (91)) = 12 (cid:107) Y P X − P · Q (cid:48) − δ u δ v uv (cid:107) F (see (92)) = 12 (cid:107) ( P U + P u ) · ( Y P X − P · Q (cid:48) − δ u δ v uv )( P V + P v ) (cid:107) F = 12 (cid:107) ( P U ( Y P X ) P V − P · Q (cid:48) )+ ( P u ( Y P X ) P v − δ u δ v uv ) (cid:107) F (see (93)) = 12 (cid:107)P U ( Y P X ) P V − P · Q (cid:48) (cid:107) F + 12 (cid:107)P u ( Y P X ) P v − δ u δ v uv (cid:107) F (see (88)) = 12 (cid:107)P U ( Y P X ) P V − P · Q (cid:48) (cid:107) F + 12 | u (cid:62) ( Y P X ) v (cid:62) − δ u δ v | . (96)It is now clear from (96) that the perturbation in (91) de-creases the objective function of problem (12) if we choosethe signs of δ u and δ v carefully.Indeed, we can upper bound the last line of (96) as (cid:107) Y P X − ( P + δu )( Q (cid:48) + δv ) (cid:107) F = 12 (cid:107)P U ( Y P X ) P V − P · Q (cid:48) (cid:107) F + 12 | u (cid:62) ( Y P X ) v (cid:62) − δ u δ v | (see (96)) < (cid:107)P U ( Y P X ) P V − P · Q (cid:48) (cid:107) F + 12 (cid:107) u (cid:62) ( Y P X ) v (cid:107) F = 12 (cid:107)P U ( Y P X ) P V − P · Q (cid:48) (cid:107) F + 12 (cid:107)P u ( Y P X ) P v (cid:107) F = 12 (cid:107) Y P X − P · Q (cid:48) (cid:107) F , (see (88,93)) (97)where we chose δ u δ v above such that sign ( δ u δ v ) = sign ( u (cid:62) ( Y P X ) v (cid:62) ) .Note that (97) contradicts the assumption that ( P , Q (cid:48) ) is anSOSP of problem (12), see Deﬁnition 2.2. In fact, ( P , Q (cid:48) ) is a strict saddle point of problem (12) because (∆ P , ∆ Q (cid:48) ) is a descent direction, see Deﬁnition 2.3.Provided that rank ( Y P X ) ≥ r , we conclude that any SOSP ( P , Q (cid:48) ) of problem (12) satisﬁesrank ( W ) = r, where W = P · Q (cid:48) . (98)We can in fact replace the conclusion in (98) withrank ( P ) = rank ( Q (cid:48) ) = rank ( W ) = r. (99) raining Linear Neural Networks Indeed, max( rank ( P ) , rank ( Q (cid:48) )) ≤ r because P has r columns and Q (cid:48) has r rows, see (12). On the other hand, min( P , Q (cid:48) ) ≥ r because of (98). These two observationsimply that rank ( P ) = rank ( Q (cid:48) ) = r , as claimed in (99).Lastly, by EYM Theorem 2.7, any SOSP of the PCA prob-lem (12) is also a global minimizer of problem (12). Thiscompletes the proof of Lemma D.1. E. Another Proof for Theorem 2.8

Here, we establish Theorem 2.8 with Proposition 32 in (Bahet al., 2019) as the starting point. For completeness, let usﬁrst recall their result, adapted to our notation.

Proposition E.1.

For every r (cid:48) ∈ [ r ] , An SOSP W N ∈M N ,r (cid:48) of problem (6) is a global minimizer of L N restrictedto the set M N ,r (cid:48) . Next, let us recall from (10b,12) that min W N L N ( W N )= min W , ··· ,W N (cid:107) Y − W N · · · W X (cid:107) F (see (4,6)) = min P,Q (cid:107) Y − P QX (cid:107) F (see (10b)) = 12 (cid:107) Y P X ⊥ (cid:107) F + min P,Q (cid:48) (cid:107) Y P X − P Q (cid:48) (cid:107) F , (100)where the last line above is from (12).In view of (100), a global minimizer W o N =( W o , · · · , W oN ) of problem (6) (the ﬁrst program in (100))corresponds to a global minimizer ( P o , Q (cid:48) o ) of prob-lem (12) (the last program in (100)) such that W oN · · · W o =: W o , W o X = P o Q (cid:48) o . (101)By assumption of Theorem 2.8, it holds that rank ( Y P X ) ≥ r . We can therefore invoke Lemma D.1 in Appendix D toﬁnd thatrank ( W oN · · · W o ) = rank ( W o ) (see (101)) = r. (see Lemma D.1) (102)It is convenient to rewrite (102) as ( W o , · · · , W oN ) ∈ M N ,r , (103)where M N ,r := (cid:110) W N = ( W N , · · · , W ): rank ( W N · · · W ) = r (cid:111) ⊂ R d N . (104)On the other hand, with k = r , Proposition 32 in (Bah et al.,2019) states that an SOSP W N ∈ M N ,r of problem (6) is almost surely a global minimizer of L N restricted to the set M N ,r . In view of (103), we see that W N is in fact a globalminimizer of L N in R d N . This completes our alternativeproof for Theorem 2.8. F. Theorem 35(a) in (Bah et al., 2019)

For completeness, here we recall Theorem 35(a) in (Bahet al., 2019), adapted to our notation.

Theorem 35(a) (Bah et al., 2019) . Suppose that X has fullcolumn-rank and that rank ( Y X † X ) ≥ r . Then gradientﬂow (13) converges to a global minimizer of L N restrictedto the M N ,r (cid:48) for some r (cid:48) ≤ r from any initialization outsideof a subset with Lebesgue measure zero. The key drawback of Theorem 35(a) above is it cannotensure the convergence of gradient ﬂow (13) to a globalminimizer of L N . For example, if r (cid:48) = 0 above, thengradient ﬂow converges to the zero matrix, which is knownto be a non-strict saddle point of problem (6) when N ≥ . G. Proof of Lemma 3.1

On the one hand, note that the objective function L N ( W N ) of problem (6) is analytic in W N .On the other hand, recall the assumption that XX (cid:62) is invert-ible. Then, regardless of initialization, gradient ﬂow (13) isbounded, i.e., contained in a ﬁnite ball centered at the origin,see Step 1 in the proof of Theorem 11 in (Bah et al., 2019).We can now invoke the Lojasiewicz’ theorem, see for ex-ample Theorem 10 in (Bah et al., 2019) or (Absil et al.,2005; Lojasiewicz, 1982), to conclude that gradient ﬂowconverges to an FOSP W N of problem (6), regardless ofinitialization.Lastly, gradient ﬂow (13) avoids strict saddle points of L N for almost every initialization W N , , with respect to theLebesgue measure in R d N , see Theorem 4.1 in (Lee et al.,2016). We conclude that the limit point W N of gradient ﬂow (13)is in fact an SOSP of problem (6), for almost every initial-ization with respect to the Lebesgue measure in R d N . Thiscompletes the proof of Lemma 3.1.Part of this argument is identical to the one in Theorem 11of (Bah et al., 2019). Strictly speaking,Theorem 4.1 in (Lee et al., 2016) is forgradient descent with a sufﬁciently small step size. However, theirclaim also holds for the limit case of gradient ﬂow, as the step sizeof gradient descent goes to zero. raining Linear Neural Networks

H. Proof of Lemma 3.3

In the SVD of W ( t ) in (19), we let { s i ( t ) } min( d y ,d x ) i =1 denotethe singular values of W ( t ) in no particular order, with thecorresponding left and right singular vectors denoted by { u i ( t ) , v i ( t ) } i , for every t ≥ .On the one hand, the evolution of the singular values of W ( t ) in (19) is described by Theorem 3 in (Arora et al.,2019a) as ˙ s i ( t ) = − N s i ( t ) − N · u i ( t ) (cid:62) ∇ L ( W ( t )) v i ( t ) , (105)for every t ≥ , where L was deﬁned in (10a). Moreover,since the network depth N ≥ , { s i ( t ) } i remain nonnega-tive for every t ≥ .On the other hand, the singular values of W ( t ) are bounded,i.e., max i sup t s i ( t ) < ∞ .Indeed, if XX (cid:62) is invertible, then gradient ﬂow (13) isbounded regardless of initialization, see Step 1 in the proofof Theorem 11 in (Bah et al., 2019). Recall also that gradientﬂow (13) and induced ﬂow (17) are related through the map R d N → R d y × d x W N = ( W , · · · , W N ) → W = W N · · · W . (106)Consequently, induced ﬂow (17) and a fortiori its singularvalues too are bounded.We ﬁnally apply Lemma 4 in (Arora et al., 2019a) to (105)and ﬁnd that each s i ( t ) is either zero for all t ≥ , or positivefor all t ≥ , provided that the network depth N ≥ .In other words, rank ( W ( t )) is invariant with t , i.e.,rank ( W ( t )) = rank ( W ) , ∀ t ≥ , (107)which completes the proof of Lemma 3.3. I. Proof of Lemma 3.4

It is easy to see that M N ,r is not a closed set for any inte-ger r . For example, one can construct a sequence of rank- matrices that converge to the zero matrix.For the second claim in Lemma 3.4, the proof is by inductionover the depth N of the linear network.For the base of induction, when N = 1 , note that W N = W ∈ R d × d , see (3). It now follows from (9) that min( d , d ) = r. (see (9)) (108)In turn, it follows from (108) that almost every W is rank- r ,with respect to the Lebesgue measure in R d N = R d y × d x .For the step of induction, suppose thatrank ( W ) = r, with W = W N · · · W , (109) for almost every W N = ( W , · · · , W N ) , with respect tothe Lebesgue measure in R d N . In particular, it followsfrom (109) that range ( W ) is almost surely an r -dimensionalsubspace in R d N , i.e., dim( range ( W )) = r, almost surely . (110)Consider a generic matrix W N +1 , with respect to theLebesgue measure in R d N +1 × d N . We distinguish two cases.In the ﬁrst case, suppose that d N +1 ≥ d N . Then, W N +1 ∈ R d N +1 × d N has a trivial null space almost surely. With nullstanding for null space of a matrix, it follows thatnull ( W N +1 W ) = null ( W ) , (111)and, consequently,rank ( W N +1 W N · · · W )= rank ( W N +1 W ) (see (109)) = d − dim( null ( W N +1 W ))= d − dim( null ( W )) (see (111)) = rank ( W ) = r, (112)almost surely with respect to the Lebesgue measure in R d N +1 × d N . Above, the third and last lines use the fun-damental theorem of linear algebra.In the second case, suppose that d N +1 < d N . Then the nullspace of W N +1 ∈ R d N +1 × d N is a generic ( d N − d N +1 ) -dimensional subspace of R d N . It follows that dim( null ( W N +1 )) = d N − d N +1 ≤ d N − r, (113)where the inequality above holds by (9). Note that dim( range ( W )) + dim( null ( W N +1 )) ≤ r + ( d N − r ) (see (110,113)) = d N . (114)Since null ( W N +1 ) is a generic subspace in R d N that satis-ﬁes (114), it almost surely holds thatrange ( W ) ∩ null ( W N +1 ) = { } . (115)Consequently,null ( W N +1 W ) = null ( W ) , (see (115)) (116)and it follows identically to (112) thatrank ( W N +1 · · · W ) = r, (117)almost surely with respect to the Lebesgue measurein R d N +1 .We conclude from (112,117) that the induction is complete,and this in turn completes the proof of the second and ﬁnalclaim in Lemma 3.4. raining Linear Neural Networks J. Proof of Lemma 3.7

Recall that the initialization of gradient ﬂow (13) is balancedby Assumption 3.6 and consider induced ﬂow (17). Let usdeﬁne the set N α ( Z ) := (cid:110) W tSVD = u W · s W · v (cid:62) W : s W > ( α − γ Z ) s Z ,u (cid:62) W Z v W > αs Z (cid:111) ⊂ R d y × d x , (118)for α ∈ [ γ Z , . Once initialized in N α ( Z ) , induced ﬂowremains there, as detailed in the next technical lemma. Lemma J.1.

For induced ﬂow (17) and α ∈ [ γ Z , , W ∈N α ( Z ) implies that W t ∈ N α ( Z ) for all t ≥ . That is, W ∈ N α ( Z ) = ⇒ W t ∈ N α ( Z ) , ∀ t ≥ . (119) Above, Assumption and the notation therein are in force.

Before proving Lemma J.1 in the next appendix, we showhow it helps us prove Lemma 3.7.Indeed, from Lemma J.1 and the balanced initialization ofgradient ﬂow (13), it follows that W N , ∈ N N ,α = ⇒ W N ,t ∈ N N ,α , ∀ t ≥ , (120)under Assumption 3.6 and for α ∈ [ γ Z , , where we usedthe deﬁnition of N N ,α in (28).Recall that the limit point W N of gradient ﬂow (13) existsby Lemma 3.1, since X has full-column rank by Assump-tion 3.6.A byproduct of (120) about the limit point W N of gradientﬂow (13) is that W N , ∈ N N ,α = ⇒ W N ∈ closure ( N N ,α ) ⊂ M N , , (121)where the set inclusion above holds true provided that α ∈ ( γ Z , , see (20,28,120). In words, (121) indicatesthat gradient ﬂow does not converge to the zero matrix.This completes the proof of Lemma 3.7. J.1. Proof of Lemma J.1

The proof relies on the following technical lemma, whichroughly-speaking states that the (rank- ) induced ﬂow (17)always points in a similar direction as the (rank- ) targetmatrix Z . Lemma J.2.

Under Assumption and for α ∈ [ γ Z , , u (cid:62) Z v > αs Z implies that u (cid:62) t Z v t > αs Z for every t ≥ . That is, u (cid:62) Z v > αs Z = ⇒ u (cid:62) t Z v t > αs Z , ∀ t ≥ . (122) Above, W t tSVD = u t s t v (cid:62) t is the rank- induced ﬂow in (17,32) ,and s Z , γ Z were deﬁned in (26) . Before proving Lemma J.2 in the next appendix, let us seehow Lemma J.2 can be used to prove Lemma J.1.Let us ﬁx α ∈ [ γ Z , . If W tSVD = u s v (cid:62) ∈ N α ( Z ) ,then u (cid:62) Z v (cid:62) > αs Z by deﬁnition of N α ( Z ) in (118).Lemma J.2 then implies that u (cid:62) t Z v t > αs Z , ∀ t ≥ . (123)To prove Lemma J.1, by the way of contradiction, let τ > be the ﬁrst time that the induced ﬂow (17) leaves the set N α ( Z ) . It thus holds that s τ = αs Z − s Z, (see (118)) < u (cid:62) τ Z v τ − s Z, , (see (123)) (124)where the ﬁrst line above uses the continuity of s t as afunction of time t . Indeed, we know s t to be an analyticfunction of t , see (32).On the other hand, let us recall the evolution of the nonzerosingular value of ﬂow (17) from (105), which we repeathere for convenience: ˙ s τ = − N s − N τ · u (cid:62) τ ∇ L ( W τ ) v τ . (see (105)) (125)Recalling the deﬁnition of L from (10a) and the whiteneddata assumption in (23), we simplify the above gradient as ∇ L ( W τ ) = W τ XX (cid:62) − Y X (cid:62) (see (10a)) = m ( W τ − Z ) . (see (23,25)) (126)Substituting (126) back into (125) and using the thin SVDof W τ in (32), we write at ˙ s τ = − mN s − N τ · u (cid:62) τ ( W τ − Z ) v τ (see (125,126)) = − mN s − N τ · ( s τ − u (cid:62) τ Zv τ ) (see (32)) > − mN s − N τ ( u (cid:62) τ Z v τ − s Z, − u (cid:62) τ Zv τ ) (see (124)) = − mN s − N τ (cid:0) u (cid:62) τ Z v τ − s Z, − u (cid:62) τ Z v τ − u (cid:62) τ Z + v τ (cid:1) (see (170)) = − mN s − N τ ( − s Z, − u (cid:62) τ Z + v τ ) ≥ , (127)which pushes the singular value up and thus pushes theinduced ﬂow back into N α ( Z ) . That is, the induced ﬂowcannot escape from N α ( Z ) .In the last line of (127), we used the fact that s Z, is thesecond largest singular value of Z and hence the largestsingular value of the residual matrix Z + , see (171). In thesame line, we also used the fact that u τ , v τ are unit-lengthvectors by construction, so that u (cid:62) τ Z + v τ ≥ − s Z, . Thiscompletes the proof of Lemma J.1. raining Linear Neural Networks J.2. Proof of Lemma J.2

From (32), recall the thin SVD of induced ﬂow (17), i.e., W t tSVD = u t s t v (cid:62) t , ∀ t ≥ , (128)where (cid:107) u t (cid:107) = (cid:107) v t (cid:107) = 1 , (129)and the only nonzero singular value is s t > .By taking the derivative of the identities in (129) with re-spect to t , we ﬁnd that u (cid:62) t ˙ u t = 0 ,v (cid:62) t ˙ v t = 0 , ∀ t ≥ . (130)By taking derivative of both sides of the thin SVD (128),we also ﬁnd that ˙ W t = ˙ u t s t v (cid:62) t + u t ˙ s t v (cid:62) t + u t s t ˙ v (cid:62) t , ∀ t ≥ . (131)Let U t ∈ R d y × ( d y − with orthonormal columns be orthog-onal to u t . By multiplying both sides of (131) by U (cid:62) t , weﬁnd that U (cid:62) t ˙ W t = U (cid:62) t ˙ u t s t v (cid:62) t , ∀ t ≥ , (132)which after rearranging yields that U (cid:62) t ˙ u t = s − t U (cid:62) t ˙ W t v t , ∀ t ≥ . (133)Combining (130,133) yields that ˙ u t = s − t P U t ˙ W t v t , ∀ t ≥ , (134)where P U t = U t U (cid:62) t is the orthogonal projection onto thesubspace orthogonal to u t .Similarly, let V t ∈ R d x × ( d x − with orthonormal columnsbe orthogonal to v t . As before, by multiplying both sidesof (131) by V t , we ﬁnd that ˙ W t V t = u t s t ˙ v (cid:62) t V t , ∀ t ≥ , (135)which after rearranging yields V (cid:62) t ˙ v t = s − t V (cid:62) t ˙ W (cid:62) t u t , ∀ t ≥ . (136)Then, combining (130,136) leads us to ˙ v t = s − t P V t ˙ W (cid:62) t u t , ∀ t ≥ , (137)where P V t = V t V (cid:62) t .Both expressions (134,137) involve ˙ W t . Under the assump-tion of whitened data in (23), we express ˙ W t as ˙ W t = −A W t ( W t XX (cid:62) − Y X (cid:62) ) (see (17)) = − m A W t ( W t − Z ) (see (23,25)) = − mN s − N t ( s t − u (cid:62) t Zv t ) W t (see (173)) + ms − N t P u t Z P V t + ms − N t P U t Z P v t , ∀ t ≥ , (138) where P u t = u t u (cid:62) t and P v t = v t v (cid:62) t . The last identityabove invokes the ﬁrst part of Lemma N.1, which collectssome basic properties of the operator A W .Substituting ˙ W t back into (134,137), we reach ˙ u t = ms − N t P U t Zv t , (see (138)) ˙ v t = ms − N t P V t Z (cid:62) u t , ∀ t ≥ . (139)It immediately follows from the ﬁrst identity in (139) that u (cid:62) Z ˙ u t = ms − N t u (cid:62) Z P U t Zv t (see (139)) = ms − N t u (cid:62) Z P U t ( u Z s Z v (cid:62) Z + Z + ) v t (see (170)) = ms − N t s Z u (cid:62) Z P U t u Z · v (cid:62) Z v t + ms − N t u (cid:62) Z P U t Z + v t = ms − N t s Z (cid:107)P U t u Z (cid:107) · v (cid:62) Z v t + ms − N t u (cid:62) Z P U t Z + v t , ∀ t ≥ . (140)To bound the last term above, note that | u (cid:62) Z P U t Z + v t |≤ (cid:107)P U t u Z (cid:107) · (cid:107) Z + v t (cid:107) (Cauchy-Schawrz ineq.) ≤ (cid:107)P U t u Z (cid:107) · s Z, (cid:107)P V Z v t (cid:107) , (see (170,171)) (141)where s Z, is the second largest singular value of Z and thusthe largest singular value of the residual matrix Z + . Above, V Z ∈ R d x × ( d x − with orthonormal columns is orthogonalto v Z , see (170,171).Similarly, it follows from the second identity in (139) that v (cid:62) Z ˙ v t = ms − N t v (cid:62) Z P V t Z (cid:62) u t (see (139)) = ms − N t v (cid:62) Z P V t ( v Z s Z u (cid:62) Z + Z (cid:62) + ) u t (see (170)) = ms − N t s Z (cid:107)P V t v Z (cid:107) · u (cid:62) Z u t + ms − N t v (cid:62) Z P V t Z (cid:62) + u t , ∀ t ≥ . (142)To bound the last term above, we write that | v (cid:62) Z P V t Z (cid:62) + u t |≤ (cid:107)P V t v Z (cid:107) · (cid:107) Z (cid:62) + u t (cid:107) ≤ (cid:107)P V t v Z (cid:107) · s Z, (cid:107)P U Z u t (cid:107) , (143)where U Z ∈ R d y × ( d y − with orthonormal columns is or-thogonal to u Z , see (170). raining Linear Neural Networks All these calculations in (140-143) allow us to write thatd ( u (cid:62) t Z v t ) d t = s Z d ( u (cid:62) t u Z v (cid:62) Z v t ) d t (see (170)) = s Z ( u (cid:62) Z ˙ u t )( v (cid:62) Z v t )+ s Z ( u (cid:62) Z u t )( v (cid:62) Z ˙ v t ) (product rule) = ms Z s − N t (cid:107)P U t u Z (cid:107) · ( v (cid:62) Z v t ) + ms Z s − N t (cid:107)P V t v Z (cid:107) · ( u (cid:62) Z u t ) + R t , ∀ t ≥ , (144)where the residual R t satisﬁes | R t | ≤ ms Z s Z, s − N t (cid:107)P U t u Z (cid:107) (cid:107)P V Z v t (cid:107) | v (cid:62) Z v t | + ms Z s Z, s − N t (cid:107)P V t v Z (cid:107) (cid:107)P U Z u t (cid:107) | u (cid:62) Z u t | . (145)Let us set a t := u (cid:62) Z u t , b t := v (cid:62) Z v t , (146)for short. Then we can rewrite (144,145) asd ( u (cid:62) t Z v t ) d t = ms Z s − N t (cid:0) (1 − a t ) b t + a t (1 − b t ) (cid:1) + R t , ∀ t ≥ , (147)where | R t | ≤ ms Z s Z, s − N t (cid:113) − a t (cid:113) − b t ( | a t | + | b t | ) ≤ ms Z s Z, s − N t (cid:113) (1 − a t )(1 − b t ) ≤ ms Z s Z, s − N t (1 − a t b t ) , (148)and the second above uses the fact that | a t | = | u (cid:62) Z u t | ≤ (cid:107) u Z (cid:107) · (cid:107) u t (cid:107) ≤ , (149)for every t ≥ , and similarly | b t | ≤ . The third linein (148) uses the inequality (cid:113) (1 − a t )(1 − b t ) = (cid:113) − a t − b t + a t b t ≤ (cid:113) − a t b t + a t b t = 1 − a t b t , (150)where the last line above again uses (149).The residual R t is small when the spectral gap of Z is large.Indeed, note that | R t | ≤ ms Z s Z, s − N t (1 − a t b t ) (see (148)) < ms Z s − N t a t b t (1 − a t b t ) , (151) where the last line above holds provided that a t b t > s Z, s Z = γ Z . (see (26)) (152)We continue and bound the last line of (151) as | R t | < ms Z s − N t a t b t (1 − a t b t ) (see (151)) = 2 ms Z s − N t ( a t b t − a t b t ) ≤ ms Z s − N t (cid:18) a t + b t − a t b t (cid:19) = ms Z s − N t ( a t + b t − a t b t )= ms Z s − N t ((1 − a t ) b t + a t (1 − b t )) . (153)By comparing the above bound on the residual R t with (147), for a ﬁxed time t , we conclude thatd ( u (cid:62) t Z v t ) d t > , (154)provided that u (cid:62) t Z v t = s Z · u (cid:62) t u Z v (cid:62) Z v t (see (170)) = s Z · a t b t (see (146)) > s Z γ Z . (see (152)) (155)For α ∈ [ γ Z , , it immediately follows from (155) that u (cid:62) Z v > αs Z = ⇒ u (cid:62) t Z v t > αs Z , (156)for every t ≥ , which completes the proof of Lemma J.2. K. Lazy Training

For completeness, here we verify that Theorem 1 in (Aroraet al., 2018a) suffers from lazy training (Chizat et al., 2019).To begin, let us recall Theorem 1 in (Arora et al., 2018a),adapted to our setting.

Theorem K.1.

For c > , suppose that gradient ﬂow isinitialized at W N , = ( W , , · · · , W N, ) where W = W N, · · · W , satisﬁes (cid:107) W − Z (cid:107) F ≤ σ min ( Z ) − c . Here, σ min ( Z ) stands for the smallest singular value of Z , deﬁnedin (25) . Suppose also that W N , is balanced, see Deﬁni-tion 3.2. Then, for a target accuracy (cid:15) > , it holds that l ( t ) = 12 (cid:107) W ( t ) − Z (cid:107) F ≤ (cid:15), ∀ t ≥ c − N ) log( l (0) /(cid:15) ) , (157) where W ( t ) = W N ( t ) · · · W ( t ) . For the sake of clarity, let us assume that d x = m and X = √ mI d x , which satisﬁes the whitened requirement raining Linear Neural Networks in (23) and (Arora et al., 2018a). Here, I d x ∈ R d x × d x is theidentity matrix.Recalling the loss function L N in (6) and the initialization W N , ∈ R d N of gradient ﬂow (13), we write that L N ( W N , )= 12 (cid:107) Y − W N, · · · W , X (cid:107) F (see (6)) = 12 (cid:107) Y − √ mW N, · · · W , (cid:107) F ( X = √ mI d x )= m (cid:13)(cid:13)(cid:13)(cid:13) Y X (cid:62) m − W n, · · · W , (cid:13)(cid:13)(cid:13)(cid:13) F ( X = √ mI d x )= m (cid:107) Z − W N, · · · W , (cid:107) F . (see (25)) (158)Deﬁnition 2 in (Arora et al., 2018a) requires the last lineabove and, consequently, L N ( W N , ) to be small. In turn, L N ( W N , ) appears in Equation (1) in (Chizat et al., 2019).Deﬁnition 2 in (Arora et al., 2018a) thus requires the factor κ in Equation (1) in (Chizat et al., 2019) to be small, whichis how the authors deﬁne the lazy training regime there. L. Proof of Lemma 4.1

From Theorem 3.8, recall that gradient ﬂow (13) convergesto a solution of (6) from almost every balanced initializationin the set N N ,α . That is, lim t →∞ (cid:107) Y − W N ( t ) · · · W ( t ) X (cid:107) F = min W , ··· ,W N (cid:107) Y − W N · · · W X (cid:107) F . (159)On the other hand, recall that gradient ﬂow (13) induces theﬂow (17) under the surjective map R d N → M , ··· ,r W N = ( W , · · · , W N ) → W = W N · · · W , (160)where M , ··· ,r is the set of all d y × d x matrices of rankat most r , see Appendix A for the proof of the surjectiveproperty.In view of (159), induced ﬂow (17) therefore satisﬁes lim t →∞ (cid:107) Y − W ( t ) X (cid:107) F = min rank ( W ) ≤ r (cid:107) Y − W X (cid:107) F , (161)where W ( t ) = W N ( t ) · · · W ( t ) .Let P X = X † X and P X ⊥ = I m − P X denote the orthogo-nal projections onto the row span of X and its orthogonalcomplement, respectively. We can decompose Y as Y = Y P X + Y P X . (162) Using this decomposition, we can rewrite (161) as lim t →∞ (cid:107) Y P X − W ( t ) X (cid:107) F = min rank ( W ) ≤ r (cid:107) Y P X − W X (cid:107) F . (163)That is, in words, a linear network can only learn the com-ponent of Y within the row span of X .Under Assumption 3.6, the data matrix X is whitened, sothat P X = X † X = m X (cid:62) X , see (23). We can thereforerevise (163) as lim t →∞ m (cid:107) Z − W ( t ) (cid:107) F = min rank ( W ) ≤ r m (cid:107) Z − W (cid:107) F , (164)where above we also used the deﬁnition of Z in (25). Toprove Lemma 4.1, we continue by setting r = 1 in (164).Recall also from Assumption 3.6 and speciﬁcally (26) that Z has a nontrivial spectral gap, i.e., s Z > s Z, . Therefore, Z = u Z s Z v (cid:62) Z is the unique solution of the optimizationproblem in (164), where the vectors u Z , v Z are the corre-sponding leading left and right singular vectors of Z , seefor example Section 1 in (Golub et al., 1987). In view ofthis, it now follows from (164) with r = 1 that lim t →∞ (cid:107) Z − W ( t ) (cid:107) F = 0 , (165)which completes the proof of Lemma 4.1. M. Derivation of (33)

From Appendix N, we will use the orthonormal bases (cid:101) U t =[ u t , U t ] and (cid:101) V t = [ v t , V t ] . We decompose the loss functionin these two bases as L , ( W t ) = 12 (cid:107) W t − Z (cid:107) F (see (31)) = 12 (cid:107)P u t ( W t − Z ) P v t (cid:107) F + 12 (cid:107)P u t ( W t − Z ) P V t (cid:107) F + 12 (cid:107)P U t ( W t − Z ) P v t (cid:107) F + 12 (cid:107)P U t ( W t − Z ) P V t (cid:107) F , (166)where P u t = u t u (cid:62) t is the orthogonal projection onto thespan of u t , and the remaining projection operators aboveare deﬁned similarly.Recalling the thin SVD W t = u t s t v (cid:62) t from (32) allows us raining Linear Neural Networks to simplify (166) as L , ( W t ) = 12 ( s t − u (cid:62) t Z v t ) + 12 (cid:107)P u t Z P V t (cid:107) F + 12 (cid:107)P U t Z P v t (cid:107) F + 12 (cid:107)P U t Z P V t (cid:107) F = 12 ( s t − u (cid:62) t Z v t ) + 12 (cid:107) Z (cid:107) F − (cid:107)P u t Z P v t (cid:107) F . (167)Using the thin SVD Z = u Z s Z v (cid:62) Z from (170) simpliﬁesthe last line above as (cid:107) Z (cid:107) F − (cid:107)P u t Z P v t (cid:107) F = s Z − s Z ( u (cid:62) t u Z ) ( v (cid:62) Z v t ) . (see (170)) (168)Substituting the above identity back into (167) yields that L , ( W t )= 12 ( s t − u (cid:62) t Z v t ) + s Z − ( u (cid:62) t u Z ) ( v (cid:62) Z v t ) ) (see (167,168)) = 12 ( s t − u (cid:62) t Z v t ) + 12 ( s Z − ( u (cid:62) t Z v t ) ) (see (170)) = 12 ( s t − u (cid:62) t Z v t ) + 12 ( s Z + u (cid:62) t Z v t )( s Z − u (cid:62) t Z v t ) ≤

12 ( s t − u (cid:62) t Z v t ) + s Z ( s Z − u (cid:62) t Z v t ) , (169)where the second identity and the inequality above useagain the thin SVD of Z . The inequality above alsouses u (cid:62) t Z v t ≤ s Z twice, which holds true because u t , v t are unit-length vectors by construction and s Z is the onlynonzero singular of Z , see (170). This completes the deriva-tion of (33). N. Proof of Lemma 4.2

To begin, recall from (27) that Z is the leading rank- component of Z , and let Z + = Z − Z denote the corre-sponding residual. We thus decompose Z as Z = Z + Z + SVD = u Z · s Z · v (cid:62) Z + U Z S Z V (cid:62) Z , (170) where U Z , V Z contain the remaining left and right singularvectors of Z , and S Z contains the remaining singular valuesof of Z . In particular, let us repeat that s Z = (cid:107) Z (cid:107) = (cid:107) Z (cid:107) , s Z, = (cid:107) Z + (cid:107) , (see (26)) (171)where (cid:107) · (cid:107) stands for spectral norm.In this appendix, we compute the evolution of loss function L , with time, which we recall from (34) asd L ,r ( W t ) d t = − m (cid:104) W t − Z , A W t ( W t − Z ) (cid:105) (see (34)) = − m (cid:104) W t − Z , A W t ( W t − Z ) (cid:105) + m (cid:104) W t − Z , A W t ( Z + ) (cid:105) . (see (170)) (172)To proceed, we will recall some basic properties of theoperator A W , proved by algebraic manipulation of (16) andincluded in Appendix N.1 for completeness. The secondidentity below appears also in Lemma 5 of (Bah et al., 2019). Lemma N.1.

For an arbitrary W ∈ R d y × d x , let W SVD = (cid:101) U (cid:101) S (cid:101) V (cid:62) denote its SVD, where (cid:101) U , (cid:101) V are orthonormal bases and (cid:101) S contains the singular values of W . Then, for an arbitrary ∆ ∈ R d y × d x , it holds that A W (∆) = (cid:101) U  N (cid:88) j =1 ( (cid:101) S (cid:101) S (cid:62) ) N − jN ( (cid:101) U (cid:62) ∆ (cid:101) V )( (cid:101) S (cid:62) (cid:101) S ) j − N  (cid:101) V (cid:62) , (cid:104) ∆ , A W (∆) (cid:105) = N (cid:88) j =1 (cid:13)(cid:13)(cid:13) ( (cid:101) S (cid:101) S (cid:62) ) N − j N ( (cid:101) U (cid:62) ∆ (cid:101) V )( (cid:101) S (cid:62) (cid:101) S ) j − N (cid:13)(cid:13)(cid:13) F . (173)For the ﬁrst inner product in the last line of (172), we invokethe second identity in (173) to write that (cid:104) W t − Z , A W t ( W t − Z ) (cid:105) (174) = N (cid:88) j =1 (cid:13)(cid:13)(cid:13) ( (cid:101) S t (cid:101) S (cid:62) t ) N − j N (cid:101) U (cid:62) t ( W t − Z ) (cid:101) V t ( (cid:101) S (cid:62) t (cid:101) S t ) j − N ) (cid:13)(cid:13)(cid:13) F = N (cid:88) j =1 (cid:13)(cid:13)(cid:13) ( (cid:101) S t (cid:101) S (cid:62) t ) N − j N ( (cid:101) S t − (cid:101) U (cid:62) t Z (cid:101) V t )( (cid:101) S (cid:62) t (cid:101) S t ) j − N ) (cid:13)(cid:13)(cid:13) F , where the second line above uses the SVD W t SVD = (cid:101) U t (cid:101) S t (cid:101) V (cid:62) t . (see (19)) (175)We next simplify the last line of (174).In view of the thin SVD W t tSVD = u t s t v (cid:62) t in (32), we let U t ∈ R d y × ( d y − and V t ∈ R d x × ( d x − be orthogonal raining Linear Neural Networks complements for u t and v t , respectively. This allows us todecompose W t as W t SVD = (cid:101) U t (cid:101) S t (cid:101) V (cid:62) t (see (175)) = (cid:2) u t U t (cid:3) (cid:20) s t (cid:21) (cid:20) v (cid:62) t V (cid:62) t (cid:21) , (176)for every t ≥ , where above is the ( d y − × ( d x − zero matrix. Using (176), we simplify (174) to read (cid:104) W t − Z , A W t ( W t − Z ) (cid:105) = N s − N t ( s t − u (cid:62) t Z v t ) + s − N t (cid:107) u (cid:62) t Z V t (cid:107) + s − N t (cid:107) U (cid:62) t Z v t (cid:107) . (177)The two norms above can be further simpliﬁed. Let us set a t := u (cid:62) t u Z , b t := v (cid:62) t v Z , (178)for short. Then we expand the ﬁrst norm in (177) as (cid:107) u (cid:62) t Z V t (cid:107) = s Z (cid:107) u (cid:62) t u Z v (cid:62) Z V t (cid:107) (see (170)) = s Z | u (cid:62) t u Z | · (cid:107) v (cid:62) Z V t (cid:107) = s Z a t (cid:113) − b t , (see (178)) (179)where the last line above follows because V t spans the or-thogonal complement of v t . Likewise, the second normin (177) is expanded as (cid:107) U (cid:62) t Z v t (cid:107) = s Z (cid:107) U (cid:62) t u Z (cid:107) · | v (cid:62) z v t | = s Z b t (cid:113) − a t . (see (178)) (180)In particular, by combining (179,180), we ﬁnd that (cid:107) u (cid:62) t Z V t (cid:107) + (cid:107) U (cid:62) t Z v t (cid:107) = s Z (cid:0) a t (1 − b t ) + (1 − b t ) a t (cid:1) (see (179,180)) = s Z (cid:0) a t + b t − a t b t (cid:1) ≥ s Z (cid:0) a t b t − a t b t (cid:1) = 2 s Z a t b t (1 − a t b t ) , (181)where the penultimate line above uses the inequality a t + b t ≥ a t b t . Plugging (181) back into (177), we arrive at (cid:104) W t − Z , A W t ( W t − Z ) (cid:105)≥ N s − N t ( s t − u (cid:62) t Z v t ) + 2 s − N t s Z a t b t (1 − a t b t ) . (182)For the second inner product in the last line of (172), we invoke the ﬁrst identity in (173) to write that (cid:104) W t − Z, A W t ( Z + ) (cid:105) = N s − N t ( s t − u (cid:62) t Z v t )( u (cid:62) t Z + v t ) − s − N t (cid:10) u (cid:62) t Z V t , u (cid:62) t Z + V t (cid:11) − s − N t (cid:10) U (cid:62) t Z v t , U (cid:62) t Z + v t (cid:11) , (183)and, consequently, |(cid:104) W t − Z, A W t ( Z + ) (cid:105)|≤ N s − N t | s t − u (cid:62) t Z v t | · | u (cid:62) t Z + v t | + s − N t (cid:107) u (cid:62) t Z V t (cid:107) (cid:107) u (cid:62) t Z + V t (cid:107) + s − N t (cid:107) U (cid:62) t Z v t (cid:107) (cid:107) U (cid:62) t Z + v t (cid:107) , (184)where we twice used the Cauchy-Schwarz inequality above.Recall the decomposition of Z in (170). The three termsin (184) that involve the residual matrix Z + can be simpli-ﬁed as follows. For the ﬁrst term, we write that | u (cid:62) t Z + v t | = | u (cid:62) t U Z S Z V (cid:62) Z v t | (see (170)) ≤ (cid:107) u (cid:62) t U Z (cid:107) · (cid:107) S Z (cid:107) · (cid:107) V (cid:62) Z v t (cid:107) = s Z, (cid:107) u (cid:62) t U Z (cid:107) · (cid:107) V (cid:62) Z v t (cid:107) = s Z, (cid:113) − a t (cid:113) − b t (see (178)) = s Z, (cid:113) − a t − b t + a t b t ≤ s Z, (cid:113) − a t b t + a t b t = s Z, (1 − a t b t ) , (185)where (cid:107) S Z (cid:107) denotes the spectral norm of the matrix S Z .The second line above uses the fact that s Z, is the sec-ond largest singular value of Z and thus the largest singu-lar value of the residual matrix Z + , see (170,171). Thelast line above uses the observation that a t b t ≤ since u t , v t , u Z , v Z all have unit norm by construction, see (178).Likewise, another term in (184) can be bounded as (cid:107) u (cid:62) t Z + V t (cid:107) = (cid:107) u (cid:62) t U Z S Z V (cid:62) Z V t (cid:107) (see (170)) ≤ (cid:107) u (cid:62) t U Z (cid:107) · (cid:107) S Z (cid:107) · (cid:107) V (cid:62) Z V t (cid:107) = s Z, (cid:107) u (cid:62) t U Z (cid:107) · (cid:107) V (cid:62) Z V t (cid:107)≤ s Z, (cid:107) u (cid:62) t U Z (cid:107) · (cid:107) V Z (cid:107) · (cid:107) V t (cid:107)≤ s Z, (cid:113) − a t , (see (178)) (186)where the last line above uses the fact that both V Z and V t have orthonormal columns.For the last term involving Z + in (184), we similarly writethat (cid:107) U (cid:62) t Z + v t (cid:107) ≤ s Z, (cid:107) V (cid:62) Z v t (cid:107) = s Z, (cid:113) − b t . (187) raining Linear Neural Networks Plugging back (179,180,185,186,187) into (184), we reach |(cid:104) W t − Z, A W t ( Z + ) (cid:105)|≤ N s − N t s Z, | s t − u (cid:62) t Z v t | (1 − a t b t )+ s − N t s Z s Z, ( a t + b t ) (cid:113) − a t (cid:113) − b t ≤ N s − N t s Z, | s t − u (cid:62) t Z v t | (1 − a t b t )+ 2 s − N t s Z s Z, (1 − a t b t ) , (188)where the last line above follows from the chain of inequali-ties ( a t + b t ) (cid:113) − a t (cid:113) − b t ≤ (cid:113) − a t − b t + a t b t (see (178)) ≤ (cid:113) − a t b t + a t b t = 2(1 − a t b t ) . (see (178)) (189)Above, in the second and last lines, we used the fact that a t ≤ and b t ≤ , see their deﬁnition in (178).By combining (182,188), we can upper bound the evolutionof the loss function asd L , ( W t ) d t ≤ − mN s − N t ( s t − u (cid:62) t Z v t ) − ms − N t s Z a t b t (1 − a t b t )+ mN s − N t s Z, | s t − u (cid:62) t Z v t | (1 − a t b t )+ 2 ms − N t s Z s Z, (1 − a t b t )= − mN s − N t ( s t − u (cid:62) t Z v t ) − ms − N t ( u (cid:62) t Z v t )( s Z − u (cid:62) t Z v )+ mN s − N t γ Z | s t − u (cid:62) t Z v t | ( s Z − u (cid:62) t Z v t )+ 2 ms − N t s Z, ( s Z − u (cid:62) t Z v t ) , (190)where the identity above uses the fact that s Z a t b t = u (cid:62) t Z v t , see (170,178). The inverse spectral gap γ Z = s Z, /s Z was introduced in (26). This completes the proofof Lemma 4.2. N.1. Proof of Lemma N.1

Let W SVD = (cid:101) U (cid:101) S (cid:101) V (cid:62) denote the SVD of W , where (cid:101) U , (cid:101) V areorthonormal bases, and (cid:101) S contains the singular values of W . Using the deﬁnition of A W , we write that A W (∆) = N (cid:88) j =1 ( W W (cid:62) ) N − jN ∆( W (cid:62) W ) j − N (see (16)) = N (cid:88) j =1 ( (cid:101) U (cid:101) S (cid:101) S (cid:62) (cid:101) U (cid:62) ) N − jN ∆( (cid:101) V (cid:101) S (cid:62) (cid:101) S (cid:101) V (cid:62) ) j − N = N (cid:88) j =1 (cid:101) U ( (cid:101) S (cid:101) S (cid:62) ) N − jN (cid:101) U (cid:62) ∆ (cid:101) V ( (cid:101) S (cid:62) (cid:101) S ) j − N (cid:101) V (cid:62) =: N (cid:88) j =1 A W,j (∆) , (191)which proves the ﬁrst claim in Lemma N.1.For every j ∈ N , it also holds that (cid:104) ∆ , A W,j (∆) (cid:105) = (cid:104) ∆ , (cid:101) U ( (cid:101) S (cid:101) S (cid:62) ) N − jN (cid:101) U (cid:62) ∆ (cid:101) V ( (cid:101) S (cid:62) (cid:101) S ) j − N (cid:101) V (cid:62) (cid:105) (see (191)) = (cid:13)(cid:13)(cid:13) ( (cid:101) S (cid:101) S (cid:62) ) N − j N (cid:101) U (cid:62) ∆ (cid:101) V ( (cid:101) S (cid:62) (cid:101) S ) j − N (cid:13)(cid:13)(cid:13) F , (192)where the last line uses the fact that (cid:101) U , (cid:101) V are orthonormalbases. The proof of Lemma N.1 is complete after summingup the above identity over j . O. Proof of Lemma 4.3

The proof is similar to that of Lemma J.1.Let us ﬁx α ∈ [ γ Z , and β > . If W tSVD = u s v (cid:62) ∈ N α,β ( Z ) , (193)then u (cid:62) Z v (cid:62) > αs Z by deﬁnition of N α,β ( Z ) in (36).Lemma J.2 then implies that u (cid:62) t Z v t > αs Z , ∀ t ≥ . (194)To prove Lemma 4.3, by the way of contradiction, let τ > be the ﬁrst time that induced ﬂow (17) leaves the set N α,β ( Z ) . It thus holds that s τ = αs Z − s Z, , (195)or s τ = βs Z , (196)where both of the identities above use the continuity of s t as a function of t . Indeed, we know s t to be an analyticfunction of time t , see (32).The case where (195) happens is handled identically to theproof of Lemma J.1. We therefore focus on when the secondcase happens, i.e., (196). raining Linear Neural Networks Recalling the second identity in (127), we bound the evolu-tion of the singular value of induced ﬂow (17) as ˙ s τ = − mN s − N τ · ( s τ − u (cid:62) τ Zv τ ) (see (127)) = − mN s − N τ ( βs Z − u (cid:62) τ Zv τ ) (see (196)) ≤ − mN s − N τ ( βs Z − s Z ) (see (170)) < , (197)which pushes the singular value down and thus pushes theinduced ﬂow back into N α,β ( Z ) . That is, the induced ﬂowcannot escape from N α,β ( Z ) .In the third line of (197), we used the fact that s Z is the lead-ing singular value of Z , and u τ , v τ are unit-norm vectors,see (170,128), thus u (cid:62) τ Zv τ ≤ s Z . The last line in (197)holds because Lemma 4.3 assumes that β > . This com-pletes the proof of Lemma 4.3. P. Proof of Theorem 4.4

Let us ﬁx α ∈ [ γ Z , and β > . In view of Lemma 4.3,we assume henceforth that induced ﬂow (17) is initializedwithin N α,β ( Z ) and thus remains there forever, i.e., W t ∈ N α,β ( Z ) , ∀ t ≥ . (198)Using the deﬁnition of N α,β ( Z ) in (36) and (198), we canupdate (35) asd L , ( W t ) d t ≤ − mN (( α − γ Z ) s Z ) − N ( s t − u (cid:62) t Z v t ) − αms Z (( α − γ Z ) s Z ) − N ( s Z − u (cid:62) t Z v t )+ mN ( βs Z ) − N γ Z | s t − u (cid:62) t Z v t | ( s Z − u (cid:62) t Z v t )+ 2 m ( βs Z ) − N s Z, ( s Z − u (cid:62) t Z v t ) . (199)Recalling the upper bound on the loss function in (33), wecan distinguish two regimes in the dynamics of (199), de-pending on the dominant term on the right-hand side of (33),as detailed next. (Fast convergence) When

12 ( s t − u (cid:62) t Z v t ) ≥ s Z ( s Z − u (cid:62) t Z v t ) , (200)the loss function can be bounded as L , ( W t ) ≤ ( s t − u (cid:62) t Z v t ) , (see (33,200)) (201)and the evolution of loss in (199) thus simpliﬁes tod L , ( W t ) d t ≤ − mN s − N Z · (202) (cid:16) ( α − γ Z ) − N − γ Z β − N (cid:17) · L , ( W t ) , see Appendix P.1 for the detailed derivation of (202). (Slow convergence) On the other hand, when

12 ( s t − u (cid:62) t Z v t ) ≤ s Z ( s Z − u (cid:62) t Z v t ) , (203)the loss function can be bounded as L , ( W t ) ≤ s Z ( s Z − u (cid:62) t Z v t ) , (see (33,203)) (204)and the evolution of loss in (199) simpliﬁes tod L , ( W t ) d t ≤ − ms − N Z · (205) (cid:16) α ( α − γ Z ) − N − γ Z N β − N (cid:17) · L , ( W t ) , see Appendix P.1 again for the detailed derivation of (205).In view of (200,203), the key transition between fast andslow convergence rates happens when T t := T ,t − s Z T ,t = 12 ( s t − u (cid:62) t Z v t ) − s Z ( s Z − u (cid:62) t Z v t ) (206)changes sign. Above, we used the deﬁnition of T ,t and T ,t in (33).Instead of the ﬁrst time such a sign change happens, it isconvenient to consider the more conservative choice of time τ ≥ when s τ ≤ √ s Z (207)for the ﬁrst time. Indeed, if (207) does not hold, then T τ > and thus the fast convergence is in force. This claim isveriﬁed in Appendix P.2 for completeness.With the deﬁnition of τ at hand from (207), we can com-bine (202) and (205) to obtain thatd L , ( W t ) d t ≤ − ms − N L , ( W t ) ·  N (cid:16) ( α − γ Z ) − N − γ Z β − N (cid:17) t ≤ τ (cid:16) α ( α − γ Z ) − N − γ Z N β − N (cid:17) t ≥ τ. (208)Suppose that inverse spectral gap γ Z is small enough suchthat the right-hand side of (208) is negative, see (26). Usingthis observation that L , ( W t ) is decreasing in t and byapplying the Gronwalls inequality to (208), we arrive atTheorem 4.4. raining Linear Neural Networks P.1. Derivation of (202,205)

We begin with the detailed derivation of (202). Let us re-peat (199) for convenience:d L , ( W t ) d t ≤ − mN (( α − γ Z ) s Z ) − N ( s t − u (cid:62) t Z v t ) − αms Z (( α − γ Z ) s Z ) − N ( s Z − u (cid:62) t Z v t )+ mN ( βs Z ) − N γ Z | s t − u (cid:62) t Z v t | ( s Z − u (cid:62) t Z v t )+ 2 m ( βs Z ) − N s Z, ( s Z − u (cid:62) t Z v t ) . (see (199))Recall also that (200) is in force. By ignoring the nonposi-tive term in the third line above, we arrive atd L , ( W t ) d t (209) ≤ − mN (( α − γ Z ) s Z ) − N ( s t − u (cid:62) t Z v t ) + mN ( βs Z ) − N γ Z | s t − u (cid:62) t Z v t |· (cid:113) s Z ( s Z − u (cid:62) t Z v t )+ 2 m ( βs Z ) − N γ Z s Z ( s Z − u (cid:62) t Z v t ) . To obtain the ﬁrst inequality in (209), note that W t ∈N α,β ( Z ) by (198) and, in particular, u (cid:62) t Z v t ≥ , ∀ t ≥ , (210)by deﬁnition of N α,β ( Z ) in (36). In turn, (210) impliesthat s Z − u (cid:62) t Z v t ≤ s Z .We continue to bound the right-hand side of (209) asd L , ( W t ) d t ≤ − mN (( α − γ Z ) s Z ) − N ( s t − u (cid:62) t Z v t ) + mN √ βs Z ) − N γ Z ( s t − u (cid:62) t Z v t ) + m ( βs Z ) − N γ Z ( s t − u (cid:62) t Z v t ) = − ms − N Z · (cid:18) ( α − γ Z ) − N N − γ Z β − N (cid:18) N √ (cid:19)(cid:19) · ( s t − u (cid:62) t Z v t ) ≤ − ms − N Z · (cid:18) ( α − γ Z ) − N N − γ Z β − N (cid:18) N √ (cid:19)(cid:19) · L , ( W t ) . The ﬁrst inequality above uses (200) and the last inequalityabove uses (201). We next derive (205). Let us repeat (199) for convenience:d L , ( W t ) d t ≤ − mN (( α − γ Z ) s Z ) − N ( s t − u (cid:62) t Z v t ) − αms Z (( α − γ Z ) s Z ) − N ( s Z − u (cid:62) t Z v t )+ mN ( βs Z ) − N γ Z | s t − u (cid:62) t Z v t | ( s Z − u (cid:62) t Z v t )+ 2 m ( βs Z ) − N s Z, ( s Z − u (cid:62) t Z v t ) . Recall that now (203) is in force. By ignoring the nonpos-itive term in the second line above, we then simplify theabove bound asd L , ( W t ) d t ≤ − αm (( α − γ Z ) s Z ) − N · s Z ( s Z − u (cid:62) t Z v t )+ mN √ βs Z ) − N γ Z · s Z ( s Z − u (cid:62) t Z v t )+ m ( βs Z ) − N γ Z · s Z ( s Z − u (cid:62) t Z v t )= − ms − N Z (cid:18) α ( α − γ Z ) − N − β − N γ Z (cid:18) N √ (cid:19)(cid:19) · s Z ( s Z − u (cid:62) t Z v t ) ≤ − ms − N Z (cid:18) α ( α − γ Z ) − N − β − N γ Z (cid:18) N √ (cid:19)(cid:19) · L , ( W t ) . (211)To obtain the ﬁrst inequality above, note that

12 ( s t − u (cid:62) t Z v t ) ≤ s Z ( s Z − u (cid:62) t Z v t ) (see (203)) ≤ s Z (see (210)) , which, after rearranging, reads as | s t − u (cid:62) t Z v t | ≤ √ s Z . (212)The last inequality in (211) uses (204). P.2. Derivation of (207)

In the slow convergence regime in (203), it holds that

12 ( s t − u (cid:62) t Z v t ) ≤ s Z ( s Z − u (cid:62) t Z v t ) (see (203)) ≤ s Z . (see (210)) (213)On the other hand, we can also lower bound the ﬁrst termin (213) as s Z ≥

12 ( s t − u (cid:62) t Z v t ) (see (213)) ≥ s t −