[PDF] Normal Approximation and Confidence Region of Singular Subspaces

Abstract

This paper is on the normal approximation of singular subspaces when the noise matrix has i.i.d. entries. Our contributions are three-fold. First, we derive an explicit representation formula of the empirical spectral projectors. The formula is neat and holds for deterministic matrix perturbations. Second, we calculate the expected projection distance between the empirical singular subspaces and true singular subspaces. Our method allows obtaining arbitrary k -th order approximation of the expected projection distance. Third, we prove the non-asymptotical normal approximation of the projection distance with different levels of bias corrections. By the ⌈log( d 1 + d 2 )⌉ -th order bias corrections, the asymptotical normality holds under optimal signal-to-noise ration (SNR) condition where d 1 and d 2 denote the matrix sizes. In addition, it shows that higher order approximations are unnecessary when | d 1 − d 2 |=O(( d 1 + d 2 ) 1/2 ) . Finally, we provide comprehensive simulation results to merit our theoretic discoveries. Unlike the existing results, our approach is non-asymptotical and the convergence rates are established. Our method allows the rank r to diverge as fast as o(( d 1 + d 2 ) 1/3 ) . Moreover, our method requires no eigen-gap condition (except the SNR) and no constraints between d 1 and d 2 .

Full PDF

aarXiv:

Normal Approximation and ConﬁdenceRegion of Singular Subspaces

Dong Xia

Department of MathematicsHong Kong University of Science and TechnologyClear Water Bay, Kowloon, Hong Kong.e-mail: [email protected] url: ∼ madxia/ Abstract:

This paper is on the normal approximation of singular sub-spaces when the noise matrix has i.i.d. entries. Our contributions are three-fold. First, we derive an explicit representation formula of the empiricalspectral projectors. The formula is neat and holds for deterministic ma-trix perturbations. Second, we calculate the expected projection distancebetween the empirical singular subspaces and true singular subspaces. Ourmethod allows obtaining arbitrary k -th order approximation of the expectedprojection distance. Third, we prove the non-asymptotical normal approx-imation of the projection distance with diﬀerent levels of bias corrections.By the (cid:100) log( d + d ) (cid:101) -th order bias corrections, the asymptotical normalityholds under optimal signal-to-noise ration (SNR) condition where d and d denote the matrix sizes. In addition, it shows that higher order approxima-tions are unnecessary when | d − d | = O (( d + d ) / ). Finally, we providecomprehensive simulation results to merit our theoretic discoveries.Unlike the existing results, our approach is non-asymptotical and theconvergence rates are established. Our method allows the rank r to divergeas fast as o (( d + d ) / ). Moreover, our method requires no eigen-gapcondition (except the SNR) and no constraints between d and d . MSC 2010 subject classiﬁcations:

Primary 62H10, 62H25; secondary62G20.

Keywords and phrases: singular value decomposition, projection dis-tance, normal approximation, random matrix theory, spectral perturbation.

1. Introduction

Matrix singular value decomposition (SVD) is a powerful tool for various pur-poses across diverse ﬁelds. In numerical linear algebra, SVD has been success-fully applied for solving linear inverse problems, low-rank matrix approximationand etc. See, e.g., (Golub and Van Loan, 2012), for more examples. In manymachine learning tasks, SVD is crucial for designing computationally eﬃcientalgorithms, such as matrix and tensor completion ((Cai et al., 2010), (Keshavanet al., 2010), (Cand`es and Tao, 2010), (Xia and Yuan, 2018), (Xia et al., 2017)),and phase retrieval ((Ma et al., 2017), (Candes et al., 2015)), where SVD isoften applied for generating a warm initial point for non-convex optimizationalgorithms. In statistical data analysis, SVD is superior for denoising and di-mension reduction. For instance, SVD, as a dimension reduction tool, is usedfor text classiﬁcation in (Kim et al., 2005). See also (Li and Wang, 2007). In a r X i v : . [ m a t h . S T ] J u l . Xia/Normal Approximation of SVD (Shabalin and Nobel, 2013), SVD shows appealing performances in low rankmatrix denoising. More speciﬁcally, in (Donoho and Gavish, 2014), they provedthat statistically minimax optimal matrix denoising can be attained via precisesingular value thresholding. Recently, matrix SVD is generalized to tensor SVDfor tensor denoising, see (Xia and Zhou, 2019) and (Zhang and Xia, 2018).The perturbation analysis is critical for advancing the theoretical develop-ments of SVD for low-rank matrix denoising where the observed data matrixoften equals a low-rank information matrix plus a noise matrix. The determin-istic perturbation bounds of matrix SVD have been well established by Davis-Kahan ((Davis and Kahan, 1970), (Yu et al., 2014)) and Wedin ((Wedin, 1972))many years ago. Among those deterministic perturbation bounds, one simpleyet useful bound shows that the perturbation of singular vectors is governed bythe so-called signal-to-noise ratio (SNR) where ”signal” refers to the smallestnon-zero singular value of the information matrix and the ”noise” refers to thespectral norm of the noise matrix. It is a quite general result since the bounddoes not rely on the wellness of alignments between the singular subspaces ofthe information and of the noise matrices. Such a general bound turns out tobe somewhat satisfactorily sharp when the noise matrix contains i.i.d. randomentries. However, more reﬁned characterizations of singular vectors are neededon the frontiers of statistical inference for matrix SVD. The Davis-Kahan The-orem and Wedin’s perturbation bounds are illustrated by the non-zero smallestsingular value of the information matrix, where the eﬀects of those large singularvalues are usually missing. Moreover, the exact numerical factor is also not wellrecognized.The behavior of singular values and singular vectors of low rank perturba-tions of large rectangular random matrices is popular in recent years. They playa key role in statistical inference with diverse applications. See Li and Li (2018),Naumov et al. (2017), Tang et al. (2018) for some examples in network testing.The asymptotic limits of singular values and singular vectors were ﬁrstly devel-oped by (Benaych-Georges and Nadakuditi, 2012), where the convergence rateof the largest singular value was also established. Recently, by (Ding, 2017),more precise non-asymptotic concentration bounds for empirical singular valueswere obtained. Meanwhile, (Ding, 2017) also proved non-asymptotic perturba-tion bounds of empirical singular vector when the associated singular value hasmultiplicity 1. In a recent work (Bao et al., 2018), the authors studied theasymptotic limit distributions of the empirical singular subspaces when (scaled)singular values are bounded. Speciﬁcally, they showed that if the noise matrixhas Gaussian distribution, then the limit distribution of the projection distanceis also Gaussian. Unlike these prior arts (Ding (2017), Bao et al. (2018)), we focuson the non-asymptotical normal approximations of the joint singular subspacesin a diﬀerent regime. Our approach allows the rank to diverge, and imposes noconstraints between d and d . In addition, we establish the convergence ratesand impose no eigen-gap conditions (except SNR).In (Xia, 2019), the low rank matrix regression model is investigated wherethe author proposed a de-biased estimator built on nuclear normal penalizedleast squares estimator. The de-biased estimator ends up with an analogous . Xia/Normal Approximation of SVD form of the low rank perturbation of rectangular random matrices. Then, non-asymptotical normal approximation theory of the projection distance is proved,under near optimal sample size requirement. The paramount observation is thatthe mean value in the limit normal distribution is signiﬁcantly larger than itsstandard deviation. As a result, a much larger than regular sample size require-ment is necessary to tradeoﬀ the estimation error of the expected projectiondistance. Most recently, (Chen et al., 2018) revealed an interesting phenomenonof the perturbation of eigenvalues and eigenvectors of such non-asymmetric ran-dom perturbations, showing that the perturbation of eigen structures is muchsmaller than the singular structures. In addition, some non-asymptotic pertur-bation bounds of empirical singular vectors can be found in (Koltchinskii andXia, 2016),(Bloemendal et al., 2016) and (Abbe et al., 2017). The minimax opti-mal bounds of singular subspace estimation for low rank perturbations of largerectangular random matrices are established in (Cai and Zhang, 2018).Our goal is to investigate the central limit theorems of singular subspacesin the low rank perturbation model of large rectangular random matrices. Asillustrated in (Xia, 2019), the major diﬃculty arises from how to precisely de-termine the expected projection distance. One conclusive contribution of thispaper is an explicit representation formula of the empirical spectral projector.This explicit representation formula allows us to obtain precise characterizationof the (non-asymptotical) expected projection distance. After those higher orderbias corrections, we prove normal approximation of the singular subspaces withoptimal (in the consistency regime) SNR requirement. For better presentingthe results and highlighting the contributions, let’s begin with introducing thestandard notations. We denote M = U Λ V T the unknown d × d matrix where U ∈ R d × r and V ∈ R d × r are its left and right singular vectors. The diago-nal matrix Λ = diag( λ , · · · , λ r ) contains M ’s non-increasing positive singularvalues. The observed data matrix ˆ M ∈ R d × d satisﬁes the additive model:ˆ M = M + Z where Z j j i.i.d. ∼ N (0 ,

1) for 1 ≤ j ≤ d , ≤ j ≤ d . (1)Here, we ﬁx the noise variance to be 1, just for simplicity. Let ˆ U ∈ R d × r and ˆ V ∈ R d × r be the top- r left and right singular vectors of ˆ M . Let ˆΛ = diag(ˆ λ , · · · , ˆ λ r )denote the top- r singular values of ˆ M . We focus on the projection distancebetween the empirical and true singular subspaces which is deﬁned bydist [( ˆ U , ˆ V ) , ( U, V )] := (cid:107) ˆ U ˆ U T − U U T (cid:107) + (cid:107) ˆ V ˆ V T − V V T (cid:107) . (2)By Davis-Kahan Theorem ((Davis and Kahan, 1970)) or Wedin’s sin Θ theorem((Wedin, 1972)), dist [( ˆ U , ˆ V ) , ( U, V )] is non-trivial on the event { λ r > (cid:107) Z (cid:107)} .It is well-known that (cid:107) Z (cid:107) = O P ( √ d max ) where (cid:107) · (cid:107) denotes the spectral normand d max = max { d , d } . Therefore, it is convenient to consider λ r (cid:38) √ d max . Inthis paper, we focus on the consistency regime so that the empirical singular We note that, in RMT literature (see, e.g., (Bao et al., 2018),(Ding, 2017)), many worksstudied the problem when λ r = O ( √ d max ) and λ r (cid:38) ( d d ) / . In this paper, we focus on theregime when empirical singular subspaces are consistent, i.e., E dist [( ˆ U, ˆ V ) , ( U, V )] → d max → ∞ . As shown in (Cai and Zhang, 2018), such consistency requires √ rd max /λ r → . Xia/Normal Approximation of SVD subspaces are consistent which requires λ r (cid:29) √ rd max . See, e.g., (Tao, 2012),(Koltchinskii and Xia, 2016), (Cai and Zhang, 2018) and (Vershynin, 2010).Our contributions are summarized as follows.1. An explicit representation formula of ˆ U ˆ U T and ˆ V ˆ V T is derived. In partic-ular, ˆ U ˆ U T and ˆ V ˆ V T can be completely determined by a sum of a series ofmatrix product involving only Λ , U U T , U ⊥ U T ⊥ , V V T , V ⊥ V T ⊥ and Z , where U ⊥ ∈ R d × ( d − r ) and V ∈ R d × ( d − r ) are chosen so that ( U, U ⊥ ) and( V, V ⊥ ) are orthonormal matrices. To derive such a useful representationformula, we apply the Reisz formula, combinatoric formulas, contour inte-grals, residue theorem and generalized Leibniz rule. It worths to point outthat the representation formula is deterministic as long as (cid:107) Z (cid:107) < λ r / ε := (cid:0) dist [( ˆ U , ˆ V ) , ( U, V )] − E dist [( ˆ U , ˆ V ) , ( U, V )] (cid:1) / (cid:0) √ d (cid:63) (cid:107) Λ − (cid:107) F (cid:1) where d (cid:63) = d + d − r . In particular, we show that ˆ ε converges to a stan-dard normal distribution as long as √ rd max /λ r → r /d max → d , d → ∞ . The required SNR is optimal in the consistency regime. Notethat our result allows r to diverge as fast as o (( d + d ) / ). In addition,no conditions on the eigen-gaps (except λ r ) are required. The convergencerate is also established. The proof strategy is based on the Gaussianisoperimetric inequality and Berry-Esseen theorem.3. The unknown E dist [( ˆ U , ˆ V ) , ( U, V )] plays the role of centering in ˆ ε . Toderive user-friendly normal approximations of dist [( ˆ U , ˆ V ) , ( U, V )], it suf-ﬁces to explicitly calculate its expectation (non-asymptotically). By therepresentation formula of ˆ U ˆ U T and ˆ V ˆ V T , we obtain approximations of E dist [( ˆ U , ˆ V ) , ( U, V )]. Diﬀerent levels of approximating E dist [( ˆ U , ˆ V ) , ( U, V )]ends up with diﬀerent levels of bias corrections. These levels of approxi-mations are(a) Level-1 approximation: B = 2 d (cid:63) (cid:107) Λ − (cid:107) . The approximation erroris (cid:12)(cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − B (cid:12)(cid:12)(cid:12) = O (cid:16) rd λ r (cid:17) . (b) Level-2 approximation: B = 2( d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) ) where ∆ d = d − d . Then, (cid:12)(cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − B (cid:12)(cid:12)(cid:12) = O (cid:16) rd λ r (cid:17) . (c) Level- k approximation: B k = 2 d (cid:63) (cid:107) Λ − (cid:107) − (cid:80) kk =2 ( − k ∆ d ( d k − − − d k − − ) (cid:107) Λ − k (cid:107) where d − = d − r and d − = d − r . Then, for all k ≥ (cid:12)(cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − B k (cid:12)(cid:12)(cid:12) . Xia/Normal Approximation of SVD = O (cid:16) r d max λ r + r √ d max · (cid:16) d max λ r (cid:17) + r (cid:16) C d max λ r (cid:17) k +1 (cid:17) where C > C d max /λ r < B ∞ is also derived. An intriguing fact is that if | d − d | = O ( √ d max ), i.e., the two dimensions of M are comparable, thenhigher level approximations have similar eﬀects as the Level-1 approxima-tion. Simulation results show that Level-1 approximation by B is indeedsatisfactorily accurate when d = d .4. By replacing E dist [( ˆ U , ˆ V ) , ( U, V )] with B k , we prove the normal approx-imation of dist [( ˆ U , ˆ V ) , ( U, V )]. Diﬀerent levels of bias corrections requirediﬀerent levels of SNR conditions for the asymptotical normality. For in-stance, we prove the normal approximation of ˆ ε := (cid:0) dist [( ˆ U , ˆ V ) , ( U, V )] − B (cid:100) log d max (cid:101) (cid:1) / (cid:0) √ d (cid:63) (cid:107) Λ − (cid:107) F (cid:1) with the (cid:100) log d max (cid:101) -th order bias correction.More exactly, we show the asymptotical normality of ˆ ε when √ rd max /λ r → r /d max → d , d → ∞ . As far as we know, this is the ﬁrst re-sult about the limiting distribution of singular subspaces which allows therank r to diverge. Meanwhile, no eigen-gap conditions (except SNR) areneeded. Since our normal approximation is non-asymptotical, we imposeno constraints on the relation between d and d .The rest of the paper is organized as follows. In Section 2, we derive the ex-plicit representation formula of empirical spectral projector. The representa-tion formula is established under deterministic perturbation. We prove normalapproximation of dist [( ˆ U , ˆ V ) , ( U, V )] in Section 3. Especially, we show thatdist [( ˆ U , ˆ V ) , ( U, V )] is asymptotically normal under optimal SNR conditions.In Section 4 and 5, we develop the arbitrarily k -th level approximations of E dist [( ˆ U , ˆ V ) , ( U, V )] and its corresponding normal approximation, where re-quirements for SNR are speciﬁcally developed. In Section 6, we propose con-ﬁdence regions and discuss about data-adaptive shrinkage estimator of sin-gular values. We then display comprehensive simulation results in Section 7,where, for instance, we show the importance of higher order approximationsof E dist [( ˆ U , ˆ V ) , ( U, V )] when the matrix has unbalanced sizes and the eﬀec-tiveness of shrinkage estimation of singular values. The proofs are collected inSection 9 and Appendix 9.

2. Representation formula of spectral projectors

Let A and X be d × d symmetric matrices. The matrix A has rank r = rank( A ) ≤ d . Denote the eigen-decomposition of A , A = ΘΛΘ T = r (cid:88) j =1 λ j θ j θ T j where Λ = diag( λ , · · · , λ r ) contains the non-zero non-increasing eigenvalues of A . The d × r matrix Θ = ( θ , · · · , θ r ) consists of A ’s eigenvectors. The noise . Xia/Normal Approximation of SVD matrix X satisﬁes (cid:107) X (cid:107) < min ≤ i ≤ r | λ i | where (cid:107) · (cid:107) denotes the matrix operatornorm. Given ˆ A = A + X where A and X are unknown, our goal is to estimateΘ. We denote ˆΘ = (ˆ θ , · · · , ˆ θ r ) the d × r matrix containing the eigenvectors of ˆ A with largest r eigenvalues in absolute values. Therefore, ˆΘ represents the empir-ical version of Θ. We derive the representation formula of ˆΘ ˆΘ T for deterministic X . The formula is useful for various of purposes.To this end, deﬁne Θ ⊥ = ( θ r +1 , · · · , θ d ) the d × ( d − r ) matrix such that(Θ , Θ ⊥ ) is orthonormal. Deﬁne the spectral projector, P ⊥ = d (cid:88) j = r +1 θ j θ T j = Θ ⊥ Θ T ⊥ . Also, deﬁne P − := r (cid:88) j =1 λ − j θ j θ T j = ΘΛ − Θ T . Meanwhile, we write P − k = ΘΛ − k Θ T for all k ≥

1. For notational simplicity,we denote P = P ⊥ and denote the k -th order perturbation term S A,k ( X ) = (cid:88) s : s + ··· + s k +1 = k ( − τ ( s ) · P − s X P − s X · · · X P − s k +1 (3)where s = ( s , · · · , s k +1 ) contains non-negative integer indices and τ ( s ) = k +1 (cid:88) j =1 I ( s j > s . For instance, if k = 1, we have S A, ( X ) = P − X P ⊥ + P ⊥ X P − . If k = 2, by considering s + s + s = 2 for s , s , s ≥ S A, ( X ) = (cid:0) P − X P ⊥ X P ⊥ + P ⊥ X P − X P ⊥ + P ⊥ X P ⊥ X P − (cid:1) − (cid:0) P ⊥ X P − X P − + P − X P ⊥ X P − + P − X P − X P ⊥ (cid:1) . Theorem 1. If (cid:107) X (cid:107) < min ≤ i ≤ r | λ i | , then ˆΘ ˆΘ T − ΘΘ T = (cid:88) k ≥ S A,k ( X ) where S A,k ( X ) is deﬁned in (3) and we set P = P ⊥ = Θ ⊥ Θ T ⊥ for notationalsimplicity. Apparently, by eq. (3), a simple fact is (cid:13)(cid:13) S A,k ( X ) (cid:13)(cid:13) ≤ (cid:18) kk (cid:19) · (cid:107) X (cid:107) k λ kr ≤ (cid:16) (cid:107) X (cid:107) λ r (cid:17) k , ∀ k ≥ . . Xia/Normal Approximation of SVD

3. Normal approximation of spectral projectors

Recall from (1) that ˆ M = M + Z ∈ R d × d with M = U Λ V T where U ∈ R d × r and V ∈ R d × r satisfying U T U = I r and V T V = I r . The diagonal matrixΛ = diag( λ , · · · , λ r ) contains non-increasing positive singular values of M . Letˆ U and ˆ V be ˆ M ’s top- r left and right singular vectors. We derive the normalapproximation ofdist [( ˆ U , ˆ V ) , ( U, V )] = (cid:107) ˆ U ˆ U T − U U T (cid:107) + (cid:107) ˆ V ˆ V T − V V T (cid:107) , which is often called the (squared) projection distance on Grassmannians. Tothis end, we clarify important notations which shall appear frequently through-out the paper.To apply the representation formula from Theorem 1, we turn ˆ M , M and Z into symmetric matrices. For notational consistency, we create ( d + d ) × ( d + d ) symmetric matrices asˆ A = (cid:18) M ˆ M T (cid:19) , A = (cid:18) MM T (cid:19) and X = (cid:18) ZZ T (cid:19) . The model (1) is thus translated into ˆ A = A + X . The symmetric matrix A haseigenvalues λ ≥ · · · ≥ λ r ≥ λ − r ≥ · · · ≥ λ − where λ − i = − λ i for 1 ≤ i ≤ r .The eigenvectors corresponding to λ i and λ − i are, respectively, θ i = 1 √ (cid:18) u i v i (cid:19) and θ − i = 1 √ (cid:18) u i − v i (cid:19) for 1 ≤ i ≤ r , where { u i } ri =1 and { v i } ri =1 are the columns of U and V . Here, { θ i } ri =1 may not be uniquely deﬁned if the singular value λ i has multiplicitylarger than 1. However, the spectral projector U U T and V V T are unique re-gardless of the multiplicities of M ’s singular values.Following the same routine of notations, we denoteΘ = ( θ , · · · , θ r , θ − r , · · · , θ − ) ∈ R ( d + d ) × r and Θ ⊥ ∈ R ( d + d ) × ( d + d − r ) such that (Θ , Θ ⊥ ) is an orthonormal matrix.Then, ΘΘ T = (cid:88) ≤| j |≤ r θ j θ T j = (cid:18) U U T V V T (cid:19) and ˆΘ ˆΘ T = (cid:88) ≤| j |≤ r ˆ θ j ˆ θ T j = (cid:18) ˆ U ˆ U T

00 ˆ V ˆ V T (cid:19) where ˆ U and ˆ V represent ˆ M ’s top- r left and right singular vectors. Similarly, . Xia/Normal Approximation of SVD for all k ≥

1, denote P − k = (cid:88) ≤| j |≤ r λ kj θ j θ T j = (cid:32) U Λ − k V T V Λ − k U T (cid:33) if k is odd (cid:32) U Λ − k U T V Λ − k V T (cid:33) if k is even . The orthogonal spectral projector is written as P ⊥ = Θ ⊥ Θ T ⊥ = (cid:18) U ⊥ U T ⊥ V ⊥ V T ⊥ (cid:19) where ( U, U ⊥ ) and ( V, V ⊥ ) are orthonormal matrices. Actually, the columns ofΘ ⊥ can be explicitly expressed by the columns of U ⊥ and V ⊥ . Indeed, if wedenote the columns of Θ ⊥ ∈ R ( d + d ) × ( d + d − r ) byΘ ⊥ = ( θ r +1 , · · · , θ d , θ − r − , · · · , θ − d ), then we can write θ j = (cid:18) u j (cid:19) and θ − j = (cid:18) v j (cid:19) for r + 1 ≤ j ≤ d and r + 1 ≤ j ≤ d .By the above notations, it is clear thatdist [( ˆ U , ˆ V ) , ( U, V )] = (cid:107) ˆΘ ˆΘ T − ΘΘ T (cid:107) . It suﬃces to prove the normal approximation of (cid:107) ˆΘ ˆΘ T − ΘΘ T (cid:107) . Observe that (cid:107)| ˆΘ ˆΘ T − ΘΘ T (cid:107) =4 r − (cid:10) ΘΘ T , ˆΘ ˆΘ T (cid:11) = − (cid:10) ΘΘ T , ˆΘ ˆΘ T − ΘΘ T (cid:11) . By Theorem 1 and ΘΘ T P ⊥ = 0, we can writedist [( ˆ U , ˆ V ) , ( U, V )] = − (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) =2 (cid:107) P ⊥ X P − (cid:107) − (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) . (4)where we used the fact P ⊥ P ⊥ = P ⊥ so that − (cid:10) ΘΘ T , S A, (cid:11) =2 (cid:10) ΘΘ T , P − X P ⊥ X P − (cid:11) =2tr (cid:0) P − X P ⊥ X P − (cid:1) = 2 (cid:107) P ⊥ X P − (cid:107) . We prove CLT of dist [( ˆ U , ˆ V ) , ( U, V )] with an explicit normalizing factor. . Xia/Normal Approximation of SVD Theorem 2.

Suppose d max ≥ r where d max = max { d , d } . There exist abso-lute constants C , C , c > such that if λ r ≥ C √ d max , then for any s ≥ , sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) dist [( ˆ U , ˆ V ) , ( U, V )] − E dist [( ˆ U , ˆ V ) , ( U, V )] √ d (cid:63) (cid:107) Λ − (cid:107) F ≤ x (cid:19) − Φ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C s / (cid:16) √ r (cid:107) Λ − (cid:107) F λ r (cid:17) · ( rd max ) / λ r + e − s + e − c d max + C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d max , where d (cid:63) = d + d − r and Φ( x ) denotes the c.d.f. of standard normal distri-butions. By setting s = λ r √ rd max , we conclude that sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) dist [( ˆ U , ˆ V ) , ( U, V )] − E dist [( ˆ U , ˆ V ) , ( U, V )] √ d (cid:63) (cid:107) Λ − (cid:107) F ≤ x (cid:19) − Φ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:16) √ r (cid:107) Λ − (cid:107) F λ r (cid:17) · (cid:115) ( rd max ) / λ r + e − λ r / √ rd max + e − c d max + C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d max . By Theorem 2, the asymptotical normality holds as long as (cid:16) √ r (cid:107) Λ − (cid:107) F λ r (cid:17) · (cid:115) ( rd max ) / λ r → (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d max → d , d → ∞ . If √ r = O ( λ r (cid:107) Λ − (cid:107) F ), then the ﬁrst condition in (5) is equiva-lent to √ rd max λ r →

0. Such SNR condition is optimal in the consistency regime.In addition, Cauchy-Schwartz inequality implies that (cid:107) Λ − (cid:107) ≤ r · (cid:107) Λ − (cid:107) .Therefore, the second condition in (5) holds when r d max → d , d → ∞ . Therefore, r is allowed to grow as fast as o (cid:0) ( d + d ) / (cid:1) . Remark 1.

The normalization factor √ d (cid:63) (cid:107) Λ − (cid:107) F comes from the fact Var (cid:0) (cid:107) P − X P ⊥ (cid:107) (cid:1) = 8 d (cid:63) (cid:107) Λ − (cid:107) . We remark that Theorem 2 is non-asymptotical and no constraints between d and d are needed. Note that E dist [( ˆ U , ˆ V ) , ( U, V )] in Theorem 2 is not transparent yet. Calcu-lating E dist [( ˆ U , ˆ V ) , ( U, V )] needs delicate analysis. If we approximate E dist [( ˆ U , ˆ V ) , ( U, V )]by its leading term 2 E (cid:107) P − X P ⊥ (cid:107) , we obtain E dist [( ˆ U , ˆ V ) , ( U, V )] = [2 + o (1)] · d (cid:63) (cid:107) Λ − (cid:107) . The primary subject of section 4 is to approximate E dist [( ˆ U , ˆ V ) , ( U, V )] to ahigher accuracy. . Xia/Normal Approximation of SVD

4. Approximating the bias

Recall (4), we have E dist [( ˆ U , ˆ V ) , ( U, V )] = 2 E (cid:107) P ⊥ X P − (cid:107) − (cid:88) k ≥ E (cid:10) ΘΘ T , S A, k ( X ) (cid:11) where we used the fact E S A, k +1 ( X ) = 0 for any positive integer k ≥

1. We aimto determine E (cid:107) P ⊥ X P − (cid:107) and E (cid:10) ΘΘ T , S A, k ( X ) (cid:11) for all k ≥

2. Apparently,by obtaining explicit formulas of E (cid:10) ΘΘ T , S A, k ( X ) (cid:11) for larger k s, we end upwith more precise approximation of E dist [( ˆ U , ˆ V ) , ( U, V )]. In Lemma 1-3, weprovide arbitrarily k -th order approximation of the bias. Lemma 1 (First order approximation) . The following equation holds E (cid:107) P ⊥ X P ⊥ (cid:107) = d (cid:63) (cid:107) Λ − (cid:107) where d (cid:63) = d + d − r . Moreover, if λ r ≥ C √ d max for some large enoughconstant C > , then (cid:12)(cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) Λ − (cid:107) (cid:12)(cid:12)(cid:12) ≤ C r (cid:16) d max λ r (cid:17) where C > is an absolute constant (depending on the constant C ). In Lemma 2, we calculate E (cid:10) ΘΘ T , S A, ( X ) (cid:11) . It yields the second order ap-proximation of E dist [( ˆ U , ˆ V ) , ( U, V )].

Lemma 2 (Second order approximation) . The following fact holds (cid:12)(cid:12)(cid:12) E (cid:10) ΘΘ T , S A, ( X ) (cid:11) − ∆ d (cid:107) Λ − (cid:107) (cid:12)(cid:12)(cid:12) ≤ C r d max λ r where d (cid:63) = d + d − r and ∆ d = d − d and C is an absolute constant.Moreover, if λ r ≥ C √ d max for some large enough constant C > , then (cid:12)(cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − (cid:0) d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) (cid:1)(cid:12)(cid:12)(cid:12) ≤ C r d max λ r + C r (cid:16) d max λ r (cid:17) where C , C > are absolute constants (depending on C ). In general, we calculate the arbitrary k -th order approximation in Lemma 3. Lemma 3 (Arbitrary k -th order approximation) . For a positive integer k ≥ and √ d max ≥ log d max and e − c d max ≤ √ d max , the following fact holds (cid:12)(cid:12)(cid:12) E (cid:10) ΘΘ T , S A, k ( X ) (cid:11) − ( − k ( d k − − − d k − − )( d − − d − ) (cid:107) Λ − k (cid:107) (cid:12)(cid:12) . Xia/Normal Approximation of SVD ≤ C ( r + k ) √ d max · (cid:16) C d max λ r (cid:17) k where c , C , C > are some absolute constants. Then, the following boundholds (cid:12)(cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − B k (cid:12)(cid:12)(cid:12) ≤ C r d max λ r + C r √ d max · (cid:16) d max λ r (cid:17) + C r (cid:16) C d max λ r (cid:17) k +1 where C , C , C , C are some absolute constants and B k is deﬁned by B k = 2 d (cid:63) (cid:107) Λ − (cid:107) − k (cid:88) k =2 ( − k ( d k − − − d k − − )( d − − d − ) (cid:107) Λ − k (cid:107) . (6)The second and higher order terms involve the dimension diﬀerence ∆ d = d − d . If d = d , these higher order approximations essentially have similareﬀects as the ﬁrst order approximation. Remark 2.

By choosing k = (cid:100) log d max (cid:101) so that ( C d max /λ r ) k +1 (cid:46) ( d max /λ r ) / √ d max ,we get (cid:12)(cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − B (cid:100) log d max (cid:101) (cid:12)(cid:12)(cid:12) ≤ C r d max λ r + C r √ d max · (cid:16) d max λ r (cid:17) for some absolute constants C , C > . In addition, for each ≤ j ≤ r , wehave d − λ − j − ∞ (cid:88) k =2 ( − k ( d − − d − ) d k − − λ − kj = 2 d − ( λ j + d − ) λ j ( λ j + d − ) which matches E (cid:107) ˆ u j ˆ u T j − u j u T j (cid:107) developed in (Bao et al., 2018, Theorem 2.9)if min { λ j − λ j +1 , λ j − − λ j } is bounded away from and r is ﬁxed. Similarly,we have d − λ − j − ∞ (cid:88) k =2 ( − k ( d − − d − ) d k − − λ − kj = 2 d − ( λ j + d − ) λ j ( λ j + d − ) which matches E (cid:107) ˆ v j ˆ v T j − v j v T j (cid:107) developed in (Bao et al., 2018, Theorem 2.3).Compared with Bao et al. (2018), our results are non-asymptotical. We imposeno eigen-gap conditions and no upper bounds on r . Remark 3.

The proof of Lemma 3 imply that if λ r ≥ C √ d max , then E (cid:107) ˆ U ˆ U T − U U T (cid:107) = 2 r (cid:88) j =1 d − ( λ j + d − ) λ j ( λ j + d − ) + O (cid:16) r d max λ r + r √ d max · d λ r (cid:17) and E (cid:107) ˆ V ˆ V T − V V T (cid:107) = 2 r (cid:88) j =1 d − ( λ j + d − ) λ j ( λ j + d − ) + O (cid:16) r d max λ r + r √ d max · d λ r (cid:17) . . Xia/Normal Approximation of SVD

5. Normal approximation after bias corrections

In this section, we prove the normal approximation of dist [( ˆ U , ˆ V ) , ( U, V )] withexplicit centering and normalizing terms. By Theorem 2, it suﬃces to substitute E dist [( ˆ U , ˆ V ) , ( U, V )] with the explicit formulas from Lemma 1-3.Similarly as in Section 4, we consider arbitrarily k -th levels of bias correctionsfor dist [( ˆ U , ˆ V ) , ( U, V )]. Higher order bias corrections, while involving morecomplicate bias reduction terms, require lower levels of SNR to guarantee theasymptotical normality. For instance, the ﬁrst order bias correction in Theorem 3requires λ r (cid:29) (cid:113) rd / for asymptotical normality, while the (cid:100) log d max (cid:101) -th orderbias correction in Theorem 4 only requires optimal λ r (cid:29) √ rd max for asymptot-ical normality. Again, the rank r is allowed to diverge as fast as o (cid:0) ( d + d ) / (cid:1) . Theorem 3 (First order CLT) . Suppose d max ≥ r . There exist absolute con-stants C , C , C , c > such that if λ r ≥ C √ d max , then, sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) dist [( ˆ U , ˆ V ) , ( U, V )] − B √ d (cid:63) (cid:107) Λ − (cid:107) F ≤ x (cid:19) − Φ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:16) √ r (cid:107) Λ − (cid:107) F λ r (cid:17) · (cid:115) ( rd max ) / λ r + e − c d max + e − λ r / √ rd max + C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d max + C rd / λ r , where d (cid:63) = d + d − r and B is deﬁned by (6). By Theorem 3, we conclude thatdist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) Λ − (cid:107) √ d (cid:63) (cid:107) Λ − (cid:107) F d −→ N (0 , d , d → ∞ if √ r = O ( (cid:107) Λ − (cid:107) F λ r ) and √ rd max + (cid:113) rd / λ r → (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d max → . The above conditions require λ r (cid:29) (cid:113) rd / and r (cid:28) d max . The order d / islarger than the optimal rate √ d max . It is improvable if we apply higher orderbias corrections. Theorem 4 (Arbitrary k -th order CLT) . Suppose that d max ≥ r and k ≥ .There exist absolute constants C , C , C , C , c > such that if λ r ≥ C √ d max ,then, sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) dist [( ˆ U , ˆ V ) , ( U, V )] − B k √ d (cid:63) (cid:107) Λ − (cid:107) F ≤ x (cid:19) − Φ( x ) (cid:12)(cid:12)(cid:12)(cid:12) . Xia/Normal Approximation of SVD ≤ C (cid:16) √ r (cid:107) Λ − (cid:107) F λ r (cid:17) · (cid:115) ( rd max ) / λ r + e − c d max + e − λ r / √ rd max + C r √ d max λ r + C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d max + C r d λ r + C r (cid:112) d max · (cid:16) C d max λ r (cid:17) k , where B k is deﬁned by (6). By Theorem 4, the asymptotical normality of (cid:0) dist [( ˆ U , ˆ V ) , ( U, V )] − B k (cid:1) / √ d (cid:63) (cid:107) Λ − (cid:107) F requires √ rd max + rd / + √ d max · ( r d max ) / k λ r → d , d → ∞ when √ r = O ( (cid:107) Λ − (cid:107) F λ r ). By choosing k = (cid:100) log d max (cid:101) , it boilsdown to √ rd max /λ r → (cid:107) Λ − (cid:107) / (cid:107) Λ − (cid:107) ) / / √ d max → r /d max → d , d → ∞ . Remark 4.

To avoid computing the sum of k terms in B k (6), it suﬃces toapply B ∞ which by Remark 2 is B ∞ = 2 r (cid:88) j =1 λ j (cid:16) d − · λ j + d − λ j + d − + d − · λ j + d − λ j + d − (cid:17) . By setting k = ∞ in Theorem 4, we obtain dist [( ˆ U , ˆ V ) , ( U, V )] − B ∞ √ d (cid:63) (cid:107) Λ − (cid:107) F → N (0 , as long as √ rd max /λ r → and r /d max → when d , d → ∞ .

6. Conﬁdence regions of singular subspaces

By the normal approximation of dist [( ˆ U , ˆ V ) , ( U, V )] in Theorem 4, we constructconﬁdence regions of U and V . The conﬁdence regions of ( U, V ) attain the pre-determined conﬁdence level asymptotically. In the asymptotic scheme, we shallconsider d , d → ∞ . Therefore, the parameters r ( d ,d ) , U ( d ,d ) , V ( d ,d ) andΛ ( d ,d ) also depend on d , d . For notational simplicity, we omit the superscripts( d , d ) without causing confusions.In particular, we set k = (cid:100) log d max (cid:101) in Theorem 4 and getdist [( ˆ U , ˆ V ) , ( U, V )] − B (cid:100) log d max (cid:101) √ d (cid:63) (cid:107) Λ − (cid:107) F d −→ N (0 , d , d → + ∞ when √ r = O ( λ r (cid:107) Λ − (cid:107) F ) andlim d ,d →∞ max (cid:26) √ rd max + rd / λ r + (cid:18) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:19) / · √ d max (cid:27) = 0 . (7) . Xia/Normal Approximation of SVD We deﬁne the conﬁdence region based on ( ˆ

U , ˆ V ) by M α ( ˆ U , ˆ V ) := (cid:26) ( L, R ) : L ∈ R d × r , R ∈ R d × r , L T L = R T R = I r , (cid:12)(cid:12) dist [( L, R ) , ( ˆ U , ˆ V )] − B (cid:100) log d max (cid:101) (cid:12)(cid:12) ≤ (cid:112) d (cid:63) z α/ (cid:107) Λ − (cid:107) F (cid:27) where z α denotes the critical value of standard normal distribution, i.e., z α =Φ − (1 − α ). Theorem 5 follows immediately from Theorem 4. Theorem 5.

Suppose that conditions in Theorem 4 hold. Then, for any α ∈ (0 , , we get (cid:12)(cid:12)(cid:12) P (cid:0) ( U, V ) ∈ M α ( ˆ U , ˆ V ) (cid:1) − (1 − α ) (cid:12)(cid:12)(cid:12) ≤ C √ rλ r (cid:107) Λ − (cid:107) F · (cid:115) ( rd max ) / λ r + 2 e − c d max + e − λ r / √ rd max + C (cid:18) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:19) / · √ d max + C r √ d max λ r + C r d λ r for some absolute constants C , C , C , C , c > . If condition (7) holds, then lim d ,d →∞ P (cid:16) ( U, V ) ∈ M α ( ˆ U , ˆ V ) (cid:17) = 1 − α. Remark 5.

We can also simply replace B (cid:100) log d max (cid:101) with B ∞ and Theorem 5still holds under the same conditions. Remark 6.

Note that Λ is usually unknown. An immediate choice is the em-pirical singular values ˆΛ = diag(ˆ λ , · · · , ˆ λ r ) , i.e., top- r singular values of ˆ M .It is well known that { ˆ λ j } rj =1 are biased estimators of { λ j } rj =1 . See (Benaych-Georges and Nadakuditi, 2012) and (Ding, 2017) for more details.By (Ding, 2017, Theorem 2.2), if λ r = O ( √ d max ) and some eigen-gap con-ditions hold, then with probability at least − d − , for all ≤ j ≤ r , (cid:12)(cid:12)(cid:12) ˆ λ j − (cid:16) λ j + ( d + d ) + d d λ j (cid:17)(cid:12)(cid:12)(cid:12) ≤ C d / λ / j (8) where C > is some absolute constant. In the non-asymptotical settings, (8)suggests that | ˆ λ j − λ j | ≥ c ( d + d ) . Then, d (cid:63) (cid:12)(cid:12) (cid:107) Λ − (cid:107) − (cid:107) ˆΛ − (cid:107) (cid:12)(cid:12) √ d (cid:63) (cid:107) Λ − (cid:107) F ≥ c d / λ r for some absolute constants c , c > . If we directly use (cid:107) ˆΛ − (cid:107) in Theorem 3,the non-asymptotical convergence rate reads d / /λ r . It is indeed observed insimulations. See Section 7.2 for more details. . Xia/Normal Approximation of SVD Bound (8) inspires the following shrinkage estimator of λ j : ˜ λ j = ˆ λ j − ( d + d )2 + (cid:113) (ˆ λ j − ( d + d )) − d d for all ≤ j ≤ r. (9) By replacing Λ with data-dependent estimates ˜Λ = diag(˜ λ , · · · , ˜ λ r ) , it worksextremely well in simulations. See Section 7.2 for more details.However, in order to theoretically justify these data-dependent estimates, weshall prove bound (8) in the regime λ r (cid:29) √ d max and for divergent r . It isbeyond the scope of this paper and we leave it as a future work. Note that wecan still apply (9) in practice since real-world applications are always in thenon-asymptotic settings.

7. Numerical experiments

For all the simulation cases considered below, we choose the rank r = 6 andthe singular values are set as λ i = 2 r − i · λ for i = 1 , · · · , r for some positivenumber λ . As a result, the signal strength is determined by λ . The true singularvectors U ∈ R d × r and V ∈ R d × r are computed from the left and right singularsubspaces of a d × d Gaussian random matrix. In Simulation

1, we show the eﬀectiveness of approximating E dist [( ˆ U , ˆ V ) , ( U, V )]by the ﬁrst order approximation 2 d (cid:63) (cid:107) Λ − (cid:107) where d (cid:63) = d + d − r . Mean-while, we show the ineﬃciency of ﬁrst order approximation when | d − d | (cid:38) min( d , d ). In Simulation

2, we demonstrate the beneﬁts of higher order ap-proximations when | d − d | (cid:38) min( d , d ). Simulation . In this simulation, we study the accuracy of ﬁrst order ap-proximation and its relevance with ∆ d = d − d . First, we set d = d = d where d = 100 , , λ is chosen as 30 , . , · · · ,

40. Foreach given λ , the ﬁrst order approximation 2 d (cid:63) (cid:107) Λ − (cid:107) is recorded. To obtain E dist [( ˆ U , ˆ V ) , ( U, V )], we repeat the experiments for 500 times for each λ and theaverage of dist [( ˆ U , ˆ V ) , ( U, V )] is recorded, which denotes the simulated valueof E dist [( ˆ U , ˆ V ) , ( U, V )]. We compare the simulated E dist [( ˆ U , ˆ V ) , ( U, V )] with2 d (cid:63) (cid:107) Λ − (cid:107) , which is displayed in Figure 1(a). Since d = d = d , the ﬁrst orderapproximation has similar eﬀect as higher order approximation which is veriﬁedby Figure 1(a). Second, we set d = d = d for d = 100 , , d = d − d = d which is signiﬁcantly large. Similar experiments are conductedand the results are displayed in Figure 1(b), which clearly shows that ﬁrst or-der approximation is insuﬃcient to estimate E dist [( ˆ U , ˆ V ) , ( U, V )]. Therefore,if | d − d | (cid:29)

0, we need higher order approximation of E dist [( ˆ U , ˆ V ) , ( U, V )].

Simulation . In this simulation, we study the eﬀects of higher order ap-proximations when | d − d | (cid:29)

0. More speciﬁcally, we choose d = 500 and d = 1000. The signal strength λ = 50 , , · · · ,

60. For each λ , we repeat the . Xia/Normal Approximation of SVD

30 32 34 36 38 400.20.40.60.811.21.41.61.8 (a) First order approximation 2 d (cid:63) (cid:107) Λ − (cid:107) isaccurate when ∆ d = d − d = 0 and rank r = 6. Here d (cid:63) = d + d − r . There is noneed for higher order approximations.

30 32 34 36 38 4000.511.522.53 (b) First order approximation 2 d (cid:63) (cid:107) Λ − (cid:107) isnot suﬃciently accurate when | d − d | (cid:29) d (cid:63) = d + d − r and rank r = 6. Thehigher order approximations are indeed neces-sary. Fig 1 . Comparison between E dist [( ˆ U, ˆ V ) , ( U, V )] and the ﬁrst order approximation: d (cid:63) (cid:107) Λ − (cid:107) . It veriﬁes that the accuracy of ﬁrst order approximation depends on the di-mension diﬀerence ∆ d = d − d . Here the red curves represent the simulated mean E dist [( ˆ U, ˆ V ) , ( U, V )] based on realizations of dist [( ˆ U, ˆ V ) , ( U, V )] . The blue curves arethe theoretical ﬁrst order approximations d (cid:63) (cid:107) Λ − (cid:107) based on Lemma 1. The above left ﬁgureclearly shows that ﬁrst order approximation is accurate if d = d .. Xia/Normal Approximation of SVD experiments for 500 times producing 500 realizations of dist [( ˆ U , ˆ V ) , ( U, V )]whose average is recorded as the simulated E dist [( ˆ U , ˆ V ) , ( U, V )]. Meanwhile,for each λ , we record the 1st-4th order approximations B , B , B and B whichare deﬁned by (6) . All the results are displayed in Figure 2. It veriﬁes thathigher order bias corrections indeed improve the accuracy of approximating E dist [( ˆ U , ˆ V ) , ( U, V )]. It also shows that the 1st and 3rd order approximationsover-estimate E dist [( ˆ U , ˆ V ) , ( U, V )]; while, the 2nd and 4th order approxima-tions under-estimate E dist [( ˆ U , ˆ V ) , ( U, V )].

Simulation . We apply higher order approximations and show the normalapproximation of (cid:0) dist [( ˆ U , ˆ V ) , ( U, V )] − B k (cid:1) / √ d (cid:63) (cid:107) Λ − (cid:107) F when d = 100 , d =600 and rank r = 6. We ﬁxed the signal strength λ = 50. The density histogramis based on 5000 realizations from independent experiments. We consider 1st-4thorder approximations, denoted by { B k } k =1 . More speciﬁcally, B = 2 d (cid:63) (cid:107) Λ − (cid:107) , and B = 2( d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) )and B = 2( d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) + d (cid:63) ∆ d (cid:107) Λ − (cid:107) )and B = 2 (cid:0) d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) + d (cid:63) ∆ d (cid:107) Λ − (cid:107) − ( d − − d − )∆ d (cid:107) Λ − (cid:107) (cid:1) . The results are shown in Figure 3. This experiment aims to demonstrate thenecessity of higher order bias corrections. Indeed, by the density histograms inFigure 3, the ﬁrst and second order bias corrections are not suﬃciently strong toguarantee the normal approximations, at least when λ ≤

50, where the densityhistograms either shift leftward or rightward compared with the standard normalcurve. On the other hand, after third or fourth order corrections, the normalapproximation is very satisfactory at the same level of signal strength λ = 50. Next, we show normal approximations of dist [( ˆ U , ˆ V ) , ( U, V )] with data-dependentbias corrections and normalization factors.

Simulation . We apply the 1st order approximation and show normal ap-proximation of (cid:0) dist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) ˆΛ − (cid:107) (cid:1) / √ d (cid:63) (cid:107) ˆΛ − (cid:107) F when d = d = 100 and r = 6. Here, ˆΛ = diag(ˆ λ , · · · , ˆ λ r ) denotes the top- r empirical sin-gular values of ˆ M . The signal strength λ = 25 , , ,

75. For each λ , we record (cid:0) dist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) ˆΛ − (cid:107) (cid:1) / √ d (cid:63) (cid:107) ˆΛ − (cid:107) F from 5000 thousand indepen-dent experiments and draw the density histogram. The p.d.f. of standard normaldistribution is displayed by the red curve. The results are shown in Figure 4.Since each ˆ λ j over-estimates the true λ j , the bias correction 2 d (cid:63) (cid:107) ˆΛ − (cid:107) is notsuﬃciently signiﬁcant. It explains why the density histograms shift rightwardcompared with the standard normal curve, especially when signal strength λ ismoderately strong. . Xia/Normal Approximation of SVD

50 52 54 56 58 6011.11.21.31.41.51.6 Simulated mean1st approx2nd approx3rd approx4th approx

Fig 2 . The higher order approximations of E dist [( ˆ U, ˆ V ) , ( U, V )] . The simulatedmean represents E dist [( ˆ U, ˆ V ) , ( U, V )] calculated by the average of realizations of dist [( ˆ U, ˆ V ) , ( U, V )] . The st order approximation is d (cid:63) (cid:107) Λ − (cid:107) ; nd order approxima-tion is d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) ) , rd order approximation is d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) + d (cid:63) ∆ d (cid:107) Λ − (cid:107) ) and th order approximation is d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) + d (cid:63) ∆ d (cid:107) Λ − (cid:107) − ( d (cid:63) − d − d − )∆ d (cid:107) Λ − (cid:107) ) where ∆ d = d − d , d − = d − r , d − = d − r and d (cid:63) = d − + d − with r = 6 . Clearly, the rd and th order approximations are alreadyclose to the simulated mean. We observe that the st and rd order approximations over-estimate E dist [( ˆ U, ˆ V ) , ( U, V )] ; while, the nd and th order approximations under-estimate E dist [( ˆ U, ˆ V ) , ( U, V )] .. Xia/Normal Approximation of SVD B = 2 d (cid:63) (cid:107) Λ − (cid:107) (b) B = 2( d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) )(c) B = 2( d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) + d (cid:63) ∆ d (cid:107) Λ − (cid:107) (d) B = 2( d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) + d (cid:63) ∆ d (cid:107) Λ − (cid:107) − ( d − d )( d − d ) (cid:107) Λ − (cid:107) ) Fig 3 . Normal approximation of dist [( ˆ U, ˆ V ) , ( U,V )] − B k √ d (cid:63) (cid:107) Λ − (cid:107) F with higher order bias corrections when d = 100 , d = 600 and r = 6 . The density histogram is based on realizations fromindependent experiments. The red curve presents p.d.f. of standard normal distributions. Since | d − d | (cid:29) , this experiment demonstrates the necessity of higher order bias corrections.The bias correction ˆ B k can be st - th order bias corrections.. Xia/Normal Approximation of SVD λ = 25 (b) λ = 50(c) λ = 65 (d) λ = 75 Fig 4 . Normal approximation of dist [( ˆ U, ˆ V ) , ( U,V )] − d (cid:63) (cid:107) ˆΛ − (cid:107) √ d (cid:63) (cid:107) ˆΛ − (cid:107) F with d = d = 100 and r =6 . The density histogram is based on realizations from independent experiments. Theempirical singular values ˆΛ = diag(ˆ λ , · · · , ˆ λ r ) are calculated from ˆ M . The red curve presentsp.d.f. of standard normal distributions. Since ˆ λ j over-estimates λ j , it explains why the densityhistogram shifts to the right compared with the standard normal curve, especially when signalstrength λ is not signiﬁcantly strong.. Xia/Normal Approximation of SVD Simulation . We apply the 1st order approximation and show normal ap-proximation of (cid:0) dist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) ˜Λ − (cid:107) (cid:1) / √ d (cid:63) (cid:107) ˜Λ − (cid:107) F when d = d = 100 and r = 6. Here, ˜Λ = diag(˜ λ , · · · , ˜ λ r ) denotes the top- r shrinkageestimators of λ j s as in (9). The signal strength λ = 25 , , ,

75. For each λ , werecord (cid:0) dist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) ˜Λ − (cid:107) (cid:1) / √ d (cid:63) (cid:107) ˜Λ − (cid:107) F from 5000 thousand in-dependent experiments and draw the density histogram. The results are shownin Figure 5. In comparison with Simulation d (cid:63) (cid:107) ˜Λ − (cid:107) works better than 2 d (cid:63) (cid:107) ˆΛ − (cid:107) for bias corrections. Indeed, normalapproximation of (cid:0) dist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) ˜Λ − (cid:107) (cid:1) / √ d (cid:63) (cid:107) ˜Λ − (cid:107) F is alreadysatisfactory when signal strength λ = 35, compared with λ ≥

75 when ˆΛ isused.

8. Acknowledgement

The author would like to thank Yik-Man Chiang for the insightful recommen-dations on applying the Residue theorem, and Jeﬀ Yao for the encouragementson improving the former results.

9. Proofs

We only provide the proof of Theorem 1 in this section. Proofs of other theoremsare collected in the supplementary ﬁle.

For notational simplicity., we assume λ i > ≤ i ≤ r , i.e., the matrix A is positively semideﬁnite. The proof is almost identical if A has negativeeigenvalues.Since A is positively semideﬁnite, we have min ≤ i ≤ r | λ i | = λ r . The conditionin Theorem 1 is equivalent to λ r > (cid:107) X (cid:107) . Recall that { ˆ λ i , ˆ θ i } di =1 denote thesingular values and singular vectors of ˆ A . Deﬁne the following contour plot γ A on the complex plane (shown as in Figure 6): Fig 6 . The contour plot γ A which includes { ˆ λ i , λ i } ri =1 leaving out and { ˆ λ i } di = r +1 . , where the contour γ A is chosen such that min η ∈ γ A min ≤ i ≤ r | η − λ i | = λ r . . Xia/Normal Approximation of SVD λ = 25 (b) λ = 35(c) λ = 45 (d) λ = 55 Fig 5 . Normal approximation of dist [( ˆ U, ˆ V ) , ( U,V )] − d (cid:63) (cid:107) ˜Λ − (cid:107) √ d (cid:63) (cid:107) ˜Λ − (cid:107) F with d = d = 100 and r = 6 . The density histogram is based on realizations from independent experiments.The shrinkage estimators ˜Λ = diag(˜ λ , · · · , ˜ λ r ) are calculated as eq. (9). The red curvepresents p.d.f. of standard normal distributions. Since d = d , we apply ﬁrst order biascorrections to dist [( ˆ U, ˆ V ) , ( U, V )] . In comparison with Simulation and Figure 4 where ˆΛ is used instead of ˜Λ , we conclude that d (cid:63) (cid:107) ˜Λ − (cid:107) is more accurate than d (cid:63) (cid:107) ˆΛ − (cid:107) forbias corrections. Indeed, we see that normal approximation of dist [( ˆ U, ˆ V ) , ( U,V )] − d (cid:63) (cid:107) ˜Λ − (cid:107) √ d (cid:63) (cid:107) ˜Λ − (cid:107) F is already satisfactory when signal strength λ = 35 .. Xia/Normal Approximation of SVD Weyl’s lemma implies that max ≤ i ≤ r | ˆ λ i − λ i | ≤ (cid:107) X (cid:107) . We observe that, when (cid:107) X (cid:107) < λ r , all { ˆ λ i } ri =1 are inside the contour γ A while 0 and { ˆ λ i } di = r +1 areoutside of the contour γ A . By Cauchy’s integral formula, we get12 πi (cid:73) γ A ( ηI − ˆ A ) − dη = r (cid:88) i =1 πi (cid:73) γ A dηη − ˆ λ i (ˆ θ i ˆ θ T i ) + d (cid:88) i = r +1 πi (cid:73) γ A dηη − ˆ λ i (ˆ θ i ˆ θ T i )= r (cid:88) i =1 ˆ θ i ˆ θ T i = ˆΘ ˆΘ T . As a result, we have ˆΘ ˆΘ T = 12 πi (cid:73) γ A ( ηI − ˆ A ) − dη. (10)Note that( ηI − ˆ A ) − = ( ηI − A − X ) − = (cid:2) ( ηI − A ) (cid:0) I − R A ( η ) X (cid:1)(cid:3) − = (cid:0) I − R A ( η ) X (cid:1) − R A ( η )where R A ( η ) := ( ηI − A ) − . clearly (cid:13)(cid:13) R A ( η ) X (cid:13)(cid:13) ≤ (cid:107)R A ( η ) (cid:107)(cid:107) X (cid:107) ≤ (cid:107) X (cid:107) λ r < . Therefore, we write the Neumann series: (cid:0) I − R A ( η ) X (cid:1) − = I + (cid:88) j ≥ [ R A ( η ) X ] j . (11)By (11) and (10), we getˆΘ ˆΘ T = 12 πi (cid:73) γ A ( ηI − ˆ A ) − dη = 12 πi (cid:73) γ A R A ( η ) dη + (cid:88) j ≥ πi (cid:73) γ A (cid:2) R A ( η ) X (cid:3) j R A ( η ) dη. Clearly, πi (cid:72) γ A R A ( η ) dη = ΘΘ T , we end up withˆΘ ˆΘ T − ΘΘ T = S A ( X ) := (cid:88) j ≥ πi (cid:73) γ A (cid:2) R A ( η ) X (cid:3) j R A ( η ) dη. For k ≥

1, we deﬁne S A,k ( X ) = 12 πi (cid:73) γ A (cid:2) R A ( η ) X (cid:3) k R A ( η ) dη (12) . Xia/Normal Approximation of SVD which is essentially the k -th order perturbation. Therefore, we obtainˆΘ ˆΘ T − ΘΘ T = (cid:88) k ≥ S A,k ( X ) . (13)By (13), it suﬃces to derive explicit expression formulas for {S A,k ( X ) } k ≥ .Before dealing with general k , let us derive S A,k ( X ) for k = 1 , I r the r × r identity matrix and write R A ( η ) = Θ( η · I r − Λ) − Θ T + η − Θ ⊥ Θ T ⊥ = d (cid:88) j =1 η − λ j θ j θ T j where we set λ j = 0 for all r + 1 ≤ j ≤ d . Denote P j = θ j θ T j for all 1 ≤ j ≤ d which represents the spectral projector onto θ j . Derivation of S A, ( X ) . By the deﬁnition of S A, ( X ), S A, ( X ) = 12 πi (cid:73) γ A R A ( η ) X R A ( η ) dη = d (cid:88) j =1 d (cid:88) j =1 πi (cid:73) γ A dη ( η − λ j )( η − λ j ) P j XP j . (14) Case 1: both j and j are greater than r . In this case, the contour integral in(14) is zero by Cauchy integral formula. Case 2 : only one of j and j is greater than r . W.L.O.G, let j > r , we get r (cid:88) j =1 d (cid:88) j >r πi (cid:73) γ A η − dηη − λ j P j XP j = r (cid:88) j =1 (cid:88) j >r λ − j P j XP j = P − X P ⊥ . Case 3 : none of j and j is greater than r . Clearly, the contour integral in (14)is zero.To sum up, we conclude with S A, ( X ) = P − X P ⊥ + P ⊥ X P − . Derivation of S A, ( X ) . By the deﬁnition of S A, ( X ), S A, ( X ) = 12 πi (cid:73) γ A R A ( η ) X R A ( η ) X R A ( η ) dη = d (cid:88) j =1 d (cid:88) j =1 d (cid:88) j =1 πi (cid:73) γ A dη ( η − λ j )( η − λ j )( η − λ j ) P j XP j XP j . (15) Case 1 : all j , j , j are greater than r . The contour integral in (15) is zero byCauchy integral formula. . Xia/Normal Approximation of SVD Case 2 : two of j , j , j are greater than r . W.L.O.G., let j ≤ r and j , j > r ,we get r (cid:88) j =1 d (cid:88) j ,j >r πi (cid:73) γ A η − dηη − λ j P j XP j XP j = r (cid:88) j =1 d (cid:88) j ,j >r λ j P j XP j XP j = P − X P ⊥ X P ⊥ . Case 3 : one of j , j , j is greater than r . W.L.O.G., let j , j ≤ r and j > r ,we get r (cid:88) j ,j =1 d (cid:88) j >r πi (cid:73) γ A η − dη ( η − λ j )( η − λ j ) P j XP j XP j = r (cid:88) j = j =1 d (cid:88) j >r πi (cid:73) γ A η − dη ( η − λ j ) P j XP j XP j + r (cid:88) j (cid:54) = j ≥ d (cid:88) j >r πi (cid:73) γ A η − dη ( η − λ j )( η − λ j ) P j XP j XP j = − r (cid:88) j =1 λ − j P j XP j X P ⊥ − r (cid:88) j (cid:54) = j ≥ ( λ j λ j ) − P j XP j X P ⊥ = − P − X P − X P ⊥ . Case 4: none of j , j , j is greater than r . Clearly, the contour integral in (15)is zero.To sum up, we obtain S A, ( X ) = (cid:0) P − X P ⊥ X P ⊥ + P ⊥ X P − X P ⊥ + P ⊥ X P ⊥ X P − (cid:1) − (cid:0) P ⊥ X P − X P − + P − X P ⊥ X P − + P − X P − X P ⊥ (cid:1) . Derivation of S A,k ( X ) for general k. Recall the deﬁnition of S A,k ( X ), wewrite S A,k ( X ) = d (cid:88) j , ··· ,j k +1 ≥ πi (cid:73) γ A (cid:16) k +1 (cid:89) i =1 η − λ j i (cid:17) dηP j XP j X · · · P j k XP j k +1 . (16)We consider components of summations in (16). For instance, consider the casesthat some k indices from { j , · · · , j k +1 } are not larger than r . W.L.O.G., let j , · · · , j k ≤ r and j k +1 , · · · , j k +1 > r . By Cauchy integral formula, the inte-gral in (16) is zero if k = 0 or k = k + 1. Therefore, we only focus on the casesthat 1 ≤ k ≤ k . Then, r (cid:88) j , ··· ,j k ≥ d (cid:88) j k , ··· ,j k +1 >r πi (cid:73) γ A (cid:16) k (cid:89) i =1 η − λ j i (cid:17) η k − k − dηP j XP j X · · · P j k XP j k +1 . Xia/Normal Approximation of SVD = r (cid:88) j , ··· ,j k ≥ πi (cid:73) γ A (cid:16) k (cid:89) i =1 η − λ j i (cid:17) η k − k − dηP j XP j X · · · P j k X P ⊥ X · · · X P ⊥ . Recall that our goal is to prove S A,k ( X ) = (cid:88) s : s + ··· + s k +1 = k ( − τ ( s ) · P − s X P − s X · · · X P − s k +1 . Accordingly, in the above summations, we consider the components, where s , · · · , s k ≥ s k +1 = · · · = s k +1 = 0, namely, (cid:88) s + ··· + s k = ks j ≥ ( − k +1 P − s X · · · X P − s k X P ⊥ · · · X P ⊥ . It turns out that we need to prove r (cid:88) j , ··· ,j k ≥ πi (cid:73) γ A (cid:16) k (cid:89) i =1 η − λ j i (cid:17) η k − k − dηP j XP j X · · · P j k = r (cid:88) j , ··· ,j k ≥ (cid:88) s + ··· + s k = ks j ≥ ( − k +1 λ s j · · · λ s k j k P j XP j X · · · XP j k . It suﬃces to prove that for all j = ( j , . . . , j k ) ∈ { , · · · , r } k ,12 πi (cid:73) γ A dη ( η − λ j ) · · · ( η − λ j k ) η k +1 − k = (cid:88) s + ··· + s k = ks j ≥ ( − k +1 λ s j · · · λ s k j k . (17)To prove (17), we rewrite its right hand side. Given any j = ( j , · · · , j k ) ∈{ , · · · , r } k , deﬁne v i ( j ) := (cid:8) ≤ t ≤ k : j t = i (cid:9) for 1 ≤ i ≤ r , that is, v i ( j ) contains the location s such that λ j s = λ i . Meanwhile, denote v i ( j ) = Card (cid:0) v i ( j ) (cid:1) . Then, the right hand side of (17) is written as (cid:88) s + ··· + s k = ks j ≥ ( − k +1 λ s j · · · λ s k j k = ( − k +1 (cid:88) s + ··· + s k = ks j ≥ λ − (cid:80) p ∈ v j ) s p · · · λ − (cid:80) p ∈ v r ( j ) s p r . Now, we denote t i ( j ) = (cid:80) p ∈ v i ( j ) s p for 1 ≤ i ≤ r , we can write the aboveequation as (cid:88) s + ··· + s k = ks j ≥ ( − k +1 λ s j · · · λ s k j k = ( − k +1 (cid:88) t ( j )+ ··· + t r ( j )= kt i ( j ) ≥ v i ( j ) t i ( j )=0 if v i ( j )=0 (cid:89) i : v i ( j ) ≥ (cid:18) t i ( j ) − v i ( j ) − (cid:19) λ − t i ( j ) i . Xia/Normal Approximation of SVD =( − k +1 (cid:88) t ( j )+ ··· + t r ( j )= k − k t i ( j )=0 if v i ( j )=0 (cid:89) i : v i ( j ) ≥ (cid:18) t i ( j ) + v i ( j ) − v i ( j ) − (cid:19) λ − t i ( j ) − v i ( j ) i where the last equality is due to the fact v ( j ) + · · · + v r ( j ) = k . Similarly, theleft hand side of (17) can be written as12 πi (cid:73) γ A dη ( η − λ j ) · · · ( η − λ j k ) η k +1 − k = 12 πi (cid:73) γ A dη ( η − λ ) v ( j ) · · · ( η − λ j r ) v r ( j ) η k +1 − k . Therefore, in order to prove (17), it suﬃces to prove that for any j = ( j , · · · , j k )the following equality holds12 πi (cid:73) γ A dη ( η − λ ) v · · · ( η − λ j r ) v r η k +1 − k = ( − k +1 (cid:88) t + ··· + t r = k − k t i =0 if v i =0 r (cid:89) i : v i ≥ (cid:18) t i + v i − v i − (cid:19) λ − t i − v i i (18)where we omitted the index j in deﬁnitions of v i ( j ) and t i ( j ) without causingany confusions. The non-negative numbers v + · · · + v r = k . We deﬁne thefunction ϕ ( η ) = 1( η − λ ) v · · · ( η − λ r ) v r η k +1 − k and we will calculate πi (cid:72) γ A ϕ ( η ) dη by Residue theorem. Indeed, by Residuetheorem, 12 πi (cid:73) γ A ϕ ( η ) dη = − Res( ϕ, η = ∞ ) − Res( ϕ, η = 0) . Clearly, Res( ϕ, η = ∞ ) = 0 and it suﬃces to calculate Res( ϕ, η = 0). To thisend, let γ be a contour plot around 0 where none of { λ k } rk =1 is inside it. Then,Res( ϕ, η = 0) = 12 πi (cid:73) γ ϕ ( η ) dη. By Cauchy integral formula, we obtainRes( ϕ, η = 0) = 1( k − k )! (cid:104) r (cid:89) i : v i ≥ ( η − λ i ) − v i (cid:105) ( k − k ) (cid:12)(cid:12)(cid:12) η =0 where we denote by f ( x ) ( k − k ) the k − k -th order diﬀerentiation of f ( x ). Then,we use general Leibniz rule and getRes( ϕ, η = 0) = 1( k − k )! (cid:88) t + ··· + t r = k − k t i =0 if v i =0 ( k − k )! t ! t ! · · · t r ! r (cid:89) i : v i ≥ (cid:104) ( η − λ i ) − v i (cid:105) ( t i ) (cid:12)(cid:12)(cid:12) η =0 =( − k − k (cid:88) t + ··· + t r = k − k t i =0 if v i =0 r (cid:89) i : v i ≥ v i ( v i + 1) · · · ( v i + t i − t i ! ( − λ i ) − v i − t i . Xia/Normal Approximation of SVD =( − k − k (cid:88) t + ··· + t r = k − k t i =0 if v i =0 r (cid:89) i : v i ≥ (cid:18) t i + v i − v i − (cid:19) ( − λ i ) − v i − t i =( − k − k (cid:88) t + ··· + t r = k − k t i =0 if v i =0 r (cid:89) i : v i ≥ (cid:18) t i + v i − v i − (cid:19) λ − v i − t i i . Therefore,12 πi (cid:73) γ A ϕ ( η ) dη = ( − k +1 (cid:88) t + ··· + t r = k − k t i =0 if v i =0 r (cid:89) i : v i ≥ (cid:18) t i + v i − v i − (cid:19) λ − v i − t i i which proves (18). We conclude the proof of Theorem 1. References

Abbe, E., Fan, J., Wang, K., and Zhong, Y. (2017). Entrywise eigenvec-tor analysis of random matrices with low expected rank. arXiv preprintarXiv:1709.09565 .Bao, Z., Ding, X., and Wang, K. (2018). Singular vector and singular subspacedistribution for the matrix denoising model. arXiv preprint arXiv:1809.10476 .Benaych-Georges, F. and Nadakuditi, R. R. (2012). The singular values and vec-tors of low rank perturbations of large rectangular random matrices.

Journalof Multivariate Analysis , 111:120–135.Berry, A. C. (1941). The accuracy of the gaussian approximation to the sumof independent variates.

Transactions of the american mathematical society ,49(1):122–136.Bloemendal, A., Knowles, A., Yau, H.-T., and Yin, J. (2016). On the principalcomponents of sample covariance matrices.

Probability theory and relatedﬁelds , 164(1-2):459–552.Cai, J.-F., Cand`es, E. J., and Shen, Z. (2010). A singular value thresholdingalgorithm for matrix completion.

SIAM Journal on Optimization , 20(4):1956–1982.Cai, T. T. and Zhang, A. (2018). Rate-optimal perturbation bounds for singularsubspaces with applications to high-dimensional statistics.

The Annals ofStatistics , 46(1):60–89.Candes, E. J., Li, X., and Soltanolkotabi, M. (2015). Phase retrieval viawirtinger ﬂow: Theory and algorithms.

IEEE Transactions on InformationTheory , 61(4):1985–2007.Cand`es, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimalmatrix completion.

IEEE Transactions on Information Theory , 56(5):2053–2080.Chen, Y., Cheng, C., and Fan, J. (2018). Asymmetry helps: Eigenvalue andeigenvector analyses of asymmetrically perturbed low-rank matrices. arXivpreprint arXiv:1811.12804 . . Xia/Normal Approximation of SVD Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a pertur-bation. iii.

SIAM Journal on Numerical Analysis , 7(1):1–46.Ding, X. (2017). High dimensional deformed rectangular matrices with appli-cations in matrix denoising. arXiv:1702.06975 .Donoho, D. and Gavish, M. (2014). Minimax risk of matrix denoising by singularvalue thresholding.

The Annals of Statistics , 42(6):2413–2440.Esseen, C.-G. (1942). On the liapunoﬀ limit of error in the theory of probability.

Arkiv for matematik, astronomi och fysik , A28:1–19.Golub, G. H. and Van Loan, C. F. (2012).

Matrix computations , volume 3. JHUPress.G¨otze, F. and Tikhomirov, A. (2011). On the rate of convergence to themarchenko–pastur distribution. arXiv preprint arXiv:1110.1284 .Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from afew entries.

IEEE transactions on information theory , 56(6):2980–2998.Kim, H., Howland, P., and Park, H. (2005). Dimension reduction in text classiﬁ-cation with support vector machines.

Journal of Machine Learning Research ,6(Jan):37–53.Koltchinskii, V. and Lounici, K. (2016). Asymptotics and concentration boundsfor bilinear forms of spectral projectors of sample covariance. In

Annalesde l’Institut Henri Poincar´e, Probabilit´es et Statistiques , volume 52, pages1976–2013. Institut Henri Poincar´e.Koltchinskii, V. and Lounici, K. (2017). Normal approximation and concen-tration of spectral projectors of sample covariance.

The Annals of Statistics ,45(1):121–157.Koltchinskii, V. and Xia, D. (2016). Perturbation of linear forms of singularvectors under gaussian noise. In

High Dimensional Probability VII , pages397–423. Springer.Li, B. and Wang, S. (2007). On directional regression for dimension reduction.

Journal of the American Statistical Association , 102(479):997–1008.Li, Y. and Li, H. (2018). Two-sample test of community memberships ofweighted stochastic block models. arXiv preprint arXiv:1811.12593 .Ma, C., Wang, K., Chi, Y., and Chen, Y. (2017). Implicit regularizationin nonconvex statistical estimation: Gradient descent converges linearly forphase retrieval, matrix completion and blind deconvolution. arXiv preprintarXiv:1711.10467 .Mingo, J. A. and Speicher, R. (2017).

Free probability and random matrices ,volume 35. Springer.Naumov, A., Spokoiny, V., and Ulyanov, V. (2017). Bootstrap conﬁdence setsfor spectral projectors of sample covariance.

Probability Theory and RelatedFields , pages 1–42.Shabalin, A. A. and Nobel, A. B. (2013). Reconstruction of a low-rank matrix inthe presence of gaussian noise.

Journal of Multivariate Analysis , 118:67–76.Tang, M., Priebe, C. E., et al. (2018). Limit theorems for eigenvectors of thenormalized laplacian for random graphs.

The Annals of Statistics , 46(5):2360–2415.Tao, T. (2012).

Topics in random matrix theory , volume 132. American Math- . Xia/Normal Approximation of SVD ematical Soc.Vershynin, R. (2010). Introduction to the non-asymptotic analysis of randommatrices. arXiv preprint arXiv:1011.3027 .Wedin, P.-˚A. (1972). Perturbation bounds in connection with singular valuedecomposition. BIT Numerical Mathematics , 12(1):99–111.Xia, D. (2019). Conﬁdence interval of singular subspaces for high-dimensionaland low-rank matrix regression.

IEEE Transactions on Information Theory .Xia, D. and Yuan, M. (2018+). On polynomial time methods for exact low ranktensor completion.

Foundations of Computational Mathematics .Xia, D., Yuan, M., and Zhang, C.-H. (2017). Statistically optimal and com-putationally eﬃcient low rank tensor completion from noisy entries. arXivpreprint arXiv:1711.04934 .Xia, D. and Zhou, F. (2019). The sup-norm perturbation of hosvd and low ranktensor denoising.

Journal of Machine Learning Research , 20(61):1–42.Yu, Y., Wang, T., and Samworth, R. J. (2014). A useful variant of the davis–kahan theorem for statisticians.

Biometrika , 102(2):315–323.Zhang, A. and Xia, D. (2018). Tensor svd: Statistical and computational limits.

IEEE Transactions on Information Theory . . Xia/Normal Approximation of SVD Supplement to “Normal Approximation andConﬁdence Region of Singular Subspaces” Dong XiaHong Kong University of Science and Technology

Appendix A: Proofs

A.1. Proof of Theorem 2

By rank( ˆΘ) = rank(Θ) = 2 r , we getdist [( ˆ U , ˆ V ) , ( U, V )] = (cid:107) ˆΘ ˆΘ T − ΘΘ T (cid:107) = 4 r − (cid:10) ˆΘ ˆΘ T , ΘΘ T (cid:11) . Since X is random, we shall take care of the “size” of X . Observe that (cid:107) X (cid:107) = (cid:107) Z (cid:107) and the operator norm of Z is well-known (see, e.g., (Tao, 2012) and (Ver-shynin, 2010) ). Indeed, there exist some absolute constants C , C , c > E (cid:107) X (cid:107) ≤ C (cid:112) d max and P (cid:0) (cid:107) X (cid:107) ≥ C (cid:112) d max (cid:1) ≤ e − c d max (19)where d max = max { d , d } . Meanwhile, E /p (cid:107) X (cid:107) p ≤ C √ d max for all integer p ≥

1. See (Koltchinskii and Xia, 2016, Lemma 3).Denote the event E := {(cid:107) X (cid:107) ≤ C √ d max } so that P ( E ) ≥ − e − c d max .Assume that λ r > C √ d max , our analysis is conditioned on E . By Theorem 1,on event E , we haveˆΘ ˆΘ T = ΘΘ T + S A, ( X ) + S A, ( X ) + (cid:88) k ≥ S A, ( X )where S A, ( X ) = P − X P ⊥ + P ⊥ X P − and S A, ( X ) = (cid:0) P − X P ⊥ X P ⊥ + P ⊥ X P − X P ⊥ + P ⊥ X P ⊥ X P − (cid:1) − (cid:0) P ⊥ X P − X P − + P − X P ⊥ X P − + P − X P − X P ⊥ (cid:1) . Therefore, we get (cid:107) ˆΘ ˆΘ T − ΘΘ T (cid:107) =2tr (cid:0) P − X P ⊥ X P − (cid:1) − (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) =2 (cid:107) P − X P ⊥ (cid:107) − (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) . Dong Xia is an Assistant Professor in Department of Mathematics at Hong Kong Univer-sity of Science and Technology, Kowloon, Hong Kong. E-mail: [email protected]. . Xia/Normal Approximation of SVD Then,dist [( ˆ U , ˆ V ) , ( U, V )] − E dist [( ˆ U , ˆ V ) , ( U, V )]= (cid:16) (cid:107) P − X P ⊥ (cid:107) − E (cid:107) P − X P ⊥ (cid:107) (cid:17) − (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) − E S A,k ( X ) (cid:11) . We investigate the normal approximation of2 (cid:107) P − X P ⊥ (cid:107) − E (cid:107) P − X P ⊥ (cid:107) (cid:112) d + d − r ) · (cid:107) Λ − (cid:107) F and show that 2 (cid:80) k ≥ (cid:10) ΘΘ T , S A,k ( X ) − E S A,k ( X ) (cid:11)(cid:112) d + d − r ) · (cid:107) Λ − (cid:107) F is ignorable when signal strength λ r is suﬃciently strong. For some t > f t ( X ) = 2 (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) · φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17) (20)where we view X as a variable in R ( d + d ) × ( d + d ) and the function φ ( · ) : R + (cid:55)→ R + is deﬁned by φ ( s ) :=  s ≤ , − s if 1 < s ≤ , s > . Clearly, φ ( s ) is Lipschitz with constant 1. Lemma 4 shows that f ( · ) is Lipschitzwhen λ r ≥ C √ d max . The proof of Lemma 4 is in Appendix, Section B.1. Lemma 4.

There exist absolute constants C , C > so that if λ r ≥ C t √ d max ,then (cid:12)(cid:12) f t ( X ) − f t ( X ) (cid:12)(cid:12) ≤ C t rd max λ r · (cid:107) X − X (cid:107) F where f t ( X ) is deﬁned by (20). By Lemma 4 and Gaussian isoperimetric inequality (see, e.g., (Koltchinskiiand Lounici, 2016, 2017)), it holds with probability at least 1 − e − s for any s ≥ (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) · φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17) − E (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) · φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C √ st rd max λ r (21)for some absolute constant C >

0. Now, set t = C where C is deﬁned in (19).Therefore, φ (cid:16) (cid:107) X (cid:107) C ·√ d max (cid:17) = 1 on event E . Meanwhile, the following fact holds (cid:12)(cid:12)(cid:12)(cid:12) E (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) · φ (cid:16) (cid:107) X (cid:107) C · √ d max (cid:17) − E (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11)(cid:12)(cid:12)(cid:12)(cid:12) . Xia/Normal Approximation of SVD ≤ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) · φ (cid:16) (cid:107) X (cid:107) C · √ d max (cid:17) I E c1 (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) E (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) I E c1 (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) k ≥ E (cid:12)(cid:12)(cid:10) ΘΘ T , S A,k ( X ) (cid:11)(cid:12)(cid:12) I E c1 ≤ r (cid:88) k ≥ E / (cid:107)S A,k ( X ) (cid:107) · e − c d max / ≤ e − c d max / · r (cid:88) k ≥ E / k (cid:107) X (cid:107) k λ kr ≤ e − c d max / · r (cid:88) r ≥ (cid:16) C · d / λ r (cid:17) k ≤ e − c d max / · C rd / λ r ≤ C rd max λ r where the last inequality holds as long as e − c d max / ≤ √ d max and we used thefact E /p (cid:107) X (cid:107) p ≤ C √ d max for some absolute constant C > p . (See, e.g., (Koltchinskii and Xia, 2016), (Vershynin, 2010) and (Tao,2012)). Together with (21), it holds with probability at least 1 − e − s − e − c d max for any s ≥ (cid:12)(cid:12)(cid:12) (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) − E (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11)(cid:12)(cid:12)(cid:12) ≤ C s / · rd max λ r for some absolute constant C >

0. Therefore, for any s ≥

1, with probabilityat least 1 − e − s − e − c d max , (cid:12)(cid:12)(cid:12) (cid:80) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) − E (cid:80) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11)(cid:12)(cid:12)(cid:12)(cid:112) d + d − r ) (cid:107) Λ − (cid:107) F ≤ C s / (cid:16) √ r (cid:107) Λ − (cid:107) F λ r (cid:17) · √ rd max λ r (22)where we assumed d max ≥ r .We next prove the normal approximation of 2 (cid:107) P − X P ⊥ (cid:107) . Similar as in(Xia, 2019), by the deﬁnition of P − , X and P ⊥ , we could write P − X P ⊥ = (cid:18) U Λ − V T Z T U ⊥ U T ⊥ V Λ − U T ZV ⊥ V T ⊥ (cid:19) . Then, (cid:107) P − X P ⊥ (cid:107) = (cid:107) U Λ − V T Z T U ⊥ U T ⊥ (cid:107) + (cid:107) V Λ − U T ZV ⊥ V T ⊥ (cid:107) . Denote z j ∈ R d the j -th column of Z for 1 ≤ j ≤ d . Then, z , · · · , z d areindependent Gaussian random vector and E z j z T j = I d for all j . Therefore, U T Z = d (cid:88) j =1 ( U T z j ) e T j . Xia/Normal Approximation of SVD where { e j } d j =1 represent the standard basis vectors in R d . Similarly, U T ⊥ Z = d (cid:88) j =1 ( U T ⊥ z j ) e T j . Sincet U T z j and U T ⊥ z j are Gaussian random vectors and E U T z j (cid:0) U T ⊥ z j (cid:1) T = U T U ⊥ = 0, we know that { U T z j } d j =1 are independent with { U T ⊥ z j } d j =1 . Therefore, (cid:107) U Λ − V T Z T U ⊥ U T ⊥ (cid:107) is independent with (cid:107) V Λ − U T ZV ⊥ V T ⊥ (cid:107) . Denote by ˜ Z an independent copy of Z , we conclude that ( Y = Y denotes equivalence of Y and Y in distribution) (cid:107) P − X P ⊥ (cid:107)

2F d = (cid:107) U Λ − V T Z T U ⊥ U T ⊥ (cid:107) + (cid:107) V Λ − U T ˜ ZV ⊥ V T ⊥ (cid:107) = d (cid:88) j = r +1 (cid:107) U Λ − V T Z T u j (cid:107) (cid:96) + d (cid:88) j = r +1 (cid:107) V Λ − U T ˜ Zv j (cid:107) (cid:96) = d (cid:88) j = r +1 (cid:107) Λ − V T Z T u j (cid:107) (cid:96) + d (cid:88) j = r +1 (cid:107) Λ − U T ˜ Zv j (cid:107) (cid:96) where { u j } d j = r +1 and { v j } d j = r +1 denote the columns of U ⊥ and V ⊥ , respectively.Observe that Z T u j ∼ N (0 , I d ) for all r + 1 ≤ j ≤ d and E ( Z T u j )( Z T u j ) T = 0 for all r + 1 ≤ j (cid:54) = j ≤ d . Therefore, { Z T u j } d j = r +1 are independent normal random vectors. Similarly,˜ Zv j ∼ N (0 , I d ) are independent for all r + 1 ≤ j ≤ d . Clearly, V T Z T u j ∼N (0 , I r ) and U T ˜ Zv j ∼ N (0 , I r ) are all independent for r + 1 ≤ j ≤ d and r + 1 ≤ j ≤ d .As a result, let d (cid:63) = d + d − r , we conclude that (cid:107) P − X P ⊥ (cid:107)

2F d = d (cid:63) (cid:88) j =1 (cid:107) Λ − z j (cid:107) (cid:96) (23)where we abuse the notations and denote { z j } d (cid:63) j =1 where z j i.i.d. ∼ N (0 , I r ). ByBerry-Esseen theorem ((Berry, 1941) and (Esseen, 1942)), it holds for someabsolute constant C > x ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) (cid:107) P − X P ⊥ (cid:107) − E (cid:107) P − X P ⊥ (cid:107) (cid:112) d + d − r ) (cid:107) Λ − (cid:107) F ≤ x (cid:19) − Φ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d (cid:63) (24)where we used the fact Var (cid:0) (cid:107) Λ − z j (cid:107) (cid:96) (cid:1) = 2 (cid:107) Λ − (cid:107) and E (cid:13)(cid:13) Λ − z j (cid:13)(cid:13) (cid:96) ≤ C r (cid:88) j ,j ,j ≥ λ j λ j λ j ≤ C (cid:107) Λ − (cid:107) . . Xia/Normal Approximation of SVD In (24), the function Φ( x ) denotes the c.d.f. of standard normal distributions.Recall that, on event E ,dist [( ˆ U , ˆ V ) , ( U, V )] − E dist [( ˆ U , ˆ V ) , ( U, V )] (cid:112) d + d − r ) (cid:107) Λ − (cid:107) F = 2 (cid:107) P − X P ⊥ (cid:107) − E (cid:107) P − X P ⊥ (cid:107) (cid:112) d + d − r ) (cid:107) Λ − (cid:107) F + 2 (cid:80) k ≥ (cid:10) ΘΘ T , S A,k ( X ) − E S A,k ( X ) (cid:11)(cid:112) d + d − r ) (cid:107) Λ − (cid:107) F where normal approximation of the ﬁrst term is given in (24) and upper boundof the second term is given in (22). Based on (22), we get for any x ∈ R andany s ≥ P (cid:18) dist [( ˆ U , ˆ V ) , ( U, V )] − E dist [( ˆ U , ˆ V ) , ( U, V )] (cid:112) d + d − r ) (cid:107) Λ − (cid:107) F ≤ x (cid:19) ≤ P (cid:18) (cid:107) P − X P ⊥ (cid:107) − E (cid:107) P − X P ⊥ (cid:107) (cid:112) d + d − r ) (cid:107) Λ − (cid:107) F ≤ x + C s / · √ r (cid:107) Λ − (cid:107) F λ r · √ rd max λ r (cid:19) + e − s + e − c d max ≤ Φ (cid:18) x + C s / · √ r (cid:107) Λ − (cid:107) F λ r · √ rd max λ r (cid:19) + e − s + e − c d max + C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d (cid:63) ≤ Φ( x ) + C s / · √ r (cid:107) Λ − (cid:107) F λ r · √ rd max λ r + e − s + e − c d max + C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d (cid:63) where the last inequality is due to (24) and the Lipschitz property of Φ( x ).Similarly, for any x ∈ R and any s ≥ P (cid:18) dist [( ˆ U , ˆ V ) , ( U, V )] − E dist [( ˆ U , ˆ V ) , ( U, V )] (cid:112) d + d − r ) (cid:107) Λ − (cid:107) F ≤ x (cid:19) ≥ P (cid:18) (cid:107) P − X P ⊥ (cid:107) − E (cid:107) P − X P ⊥ (cid:107) (cid:112) d + d − r ) (cid:107) Λ − (cid:107) F ≤ x − C s / · √ r (cid:107) Λ − (cid:107) F λ r · √ rd max λ r (cid:19) − e − s − e − c d max ≥ Φ (cid:18) x − C s / · √ r (cid:107) Λ − (cid:107) F λ r · √ rd max λ r (cid:19) − e − s − e − c d max − C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d (cid:63) ≥ Φ( x ) − C s / · √ r (cid:107) Λ − (cid:107) F λ r · √ rd max λ r − e − s − e − c d max − C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d (cid:63) . Finally, we conclude that for any s ≥ x ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) dist [( ˆ U , ˆ V ) , ( U, V )] − E dist [( ˆ U , ˆ V ) , ( U, V )] (cid:112) d + d − r ) (cid:107) Λ − (cid:107) F ≤ x (cid:19) − Φ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C s / (cid:16) √ r (cid:107) Λ − (cid:107) F λ r (cid:17) · √ rd max λ r + e − s + e − c d max + C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d (cid:63) where d (cid:63) = d + d − r and C , C , c are absolute positive constants. . Xia/Normal Approximation of SVD A.2. Proof of lemmas in Section 4

Observe that S A,k ( X ) involves the product of X for k times. If k is odd, we imme-diately get E S A,k ( X ) = 0 since Z has i.i.d. standard normal entries. Therefore,it suﬃces to investigate E (cid:10) ΘΘ T , S A,k ( X ) (cid:11) when k is even. Proof of Lemma 1.

By the deﬁnitions of P ⊥ , X and P − , E (cid:107) P ⊥ X P − (cid:107) = E (cid:107) U Λ − V T Z T U ⊥ U T ⊥ (cid:107) + E (cid:107) V Λ − U T ZV ⊥ V T ⊥ (cid:107) = E (cid:107) Λ − V T Z T U ⊥ (cid:107) + E (cid:107) Λ − U T ZV ⊥ (cid:107) . By the proof of Theorem 2, we obtain E (cid:107) P ⊥ X P − (cid:107) = ( d + d − r ) (cid:107) Λ − (cid:107) which is the ﬁrst claim. To prove the second claim, it holds by Theorem 1 that (cid:12)(cid:12)(cid:12) E (cid:107) ˆΘ ˆΘ T − ΘΘ T (cid:107) − d (cid:63) (cid:107) Λ − (cid:107) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) (cid:88) k ≥ E (cid:10) ΘΘ T , S A, k ( X ) (cid:11)(cid:12)(cid:12)(cid:12) ≤ (cid:88) k ≥ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:28) ΘΘ T , (cid:88) s : s + ··· + s k +1 =2 k ( − τ ( s ) · P − s X P − s X · · · X P − s k X P − s k +1 (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) =2 (cid:88) k ≥ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:28) ΘΘ T , (cid:88) s : s + ··· + s k +1 =2 ks ,s k +1 ≥ ( − τ ( s ) · P − s X P − s X · · · X P − s k X P − s k +1 (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) where we used the fact ΘΘ T P = P ΘΘ T = 0. Then, (cid:12)(cid:12)(cid:12) E (cid:107) ˆΘ ˆΘ T − ΘΘ T (cid:107) − d (cid:63) (cid:107) Λ − (cid:107) (cid:12)(cid:12)(cid:12) ≤ r (cid:88) k ≥ (cid:88) s : s + ··· + s k +1 =2 ks ,s k +1 ≥ E (cid:13)(cid:13)(cid:13) P − s X P − s X · · · X P − s k X P − s k +1 (cid:13)(cid:13)(cid:13) ≤ r (cid:88) k ≥ (cid:88) s : s + ··· + s k +1 =2 ks ,s k +1 ≥ E (cid:107) X (cid:107) k λ kr ≤ r (cid:88) k ≥ (cid:18) k k (cid:19) E (cid:107) X (cid:107) k λ kr ≤ C r (cid:88) k ≥ k E (cid:107) X (cid:107) k λ kr . for some absolute constant C >

0. Therefore, (cid:12)(cid:12)(cid:12) E (cid:107) ˆΘ ˆΘ T − ΘΘ T (cid:107) − d (cid:63) (cid:107) Λ − (cid:107) (cid:12)(cid:12)(cid:12) ≤ C r (cid:88) k ≥ (cid:16) C d max λ r (cid:17) k ≤ C rd λ r where the last inequality holds as long as λ r ≥ C √ d max . Property 1: only even order terms matter.

In order to calculate higherorder approximations, we need the following useful property of E S k ( X ). . Xia/Normal Approximation of SVD By Theorem 1, (cid:10) ΘΘ T , S A, k ( X ) (cid:11) = (cid:88) s : s + ··· + s k +1 =2 ks ,s k +1 ≥ ( − τ ( s ) · tr (cid:16) P − s X · · · X P − s k +1 (cid:17) . For any τ ( s ) = τ ≥

2, there exists positive integers s j , s j , · · · , s j τ and positiveintegers t , t , · · · , t τ − so that we can write P − s X · · · X P − s k +1 = P − s j X P ⊥ · · · P ⊥ X (cid:124) (cid:123)(cid:122) (cid:125) t of X P − s j · · · P − s jτ − X P ⊥ · · · P ⊥ X (cid:124) (cid:123)(cid:122) (cid:125) t τ − of X P − s jτ where s j + · · · + s j τ = 2 k and t + · · · + t τ − = 2 k. Therefore, for positive integers s , · · · , s k +1 , t , · · · , t k ≥ (cid:10) ΘΘ T , E S A, k ( X ) (cid:11) = (cid:88) τ ≥ ( − τ (cid:88) s + ··· + s τ =2 k (cid:88) t + ··· + t τ − =2 k E tr (cid:0) Q ( s s ··· s τ ) t t ··· t τ − (cid:1) where the matrix Q ( s s ··· s τ ) t t ··· t τ − is deﬁned by Q ( s s ··· s τ ) t t ··· t τ − = P − s X P ⊥ · · · P ⊥ X (cid:124) (cid:123)(cid:122) (cid:125) t of X P − s · · · P − s τ − X P ⊥ · · · P ⊥ X (cid:124) (cid:123)(cid:122) (cid:125) t τ − of X P − s τ . (25) Case 1 : if any of t , t , · · · , t τ − equals one. W.L.O.G., let t = 1. Then, Q ( s s ··· s τ ) t t ··· t τ − involves the product of P − s X P − s . Then, (cid:12)(cid:12) E tr (cid:0) Q ( s s ··· s τ ) t t ··· t τ − (cid:1)(cid:12)(cid:12) ≤ √ r · E (cid:107) P − s X P − s (cid:107) F · (cid:107) X (cid:107) k − λ k − s − s r ≤√ r · E (cid:107) ΘΘ T X ΘΘ T (cid:107) F · (cid:107) X (cid:107) k − λ kr ≤ √ rλ kr · E (cid:107) ΘΘ T X ΘΘ T (cid:107) F (cid:107) X (cid:107) k − ≤ √ rλ kr · E / (cid:107) ΘΘ T X ΘΘ T (cid:107) E / (cid:107) X (cid:107) k − ≤ C C k − r / d k − max λ kr where we used the fact ΘΘ T X ΘΘ T = (cid:18) U U T ZV V T V V T Z T U U T (cid:19) which isof rank at most 2 r and E / (cid:107) U T ZV (cid:107) = O ( r ). We also used the fact E /p (cid:107) X (cid:107) p ≤ C √ d max for some absolute constant C > p ≥ t , · · · , t τ − equals one, then the magnitude of (cid:12)(cid:12) E tr (cid:0) Q ( s s ··· s τ ) t t ··· t τ − (cid:1)(cid:12)(cid:12) is of the order O (cid:16) r / √ d max · C k d k max λ kr (cid:17) . . Xia/Normal Approximation of SVD Case 2 : if any of t , · · · , t τ − is an odd number greater than 1. W.L.O.G., let t be an odd number and t ≥

3. More speciﬁcally, let t = 2 p + 3 for somenon-negative integer p ≥

0. Then, (cid:12)(cid:12) E (cid:10) ΘΘ T ,Q ( s s ··· s τ ) t t ··· t τ − (cid:11)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) tr (cid:16) P − s X (cid:0) P ⊥ X (cid:1) t − P − s X (cid:0) P ⊥ X (cid:1) t − P − s X · · · P − s τ − X (cid:0) P ⊥ X (cid:1) t τ − − P − s τ (cid:17)(cid:12)(cid:12)(cid:12) ≤ E (cid:13)(cid:13)(cid:13) P − s X ( P ⊥ X P ⊥ ) p +1 X P − s (cid:13)(cid:13)(cid:13) F · √ r (cid:107) X (cid:107) k − t λ k − s − s r ≤ E (cid:13)(cid:13)(cid:13) P − s X ( P ⊥ X P ⊥ ) p +1 X P − s (cid:13)(cid:13)(cid:13) F · √ r (cid:107) X (cid:107) k − t λ k − s − s r I E + E (cid:13)(cid:13)(cid:13) P − s X ( P ⊥ X P ⊥ ) p +1 X P − s (cid:13)(cid:13)(cid:13) F · √ r (cid:107) X (cid:107) k − t λ k − s − s r I E c1 where, as in the proof of Theorem 2, deﬁne the event E = {(cid:107) X (cid:107) ≤ C · √ d max } for some absolute constant C > P ( E ) ≥ − e − c d max . As a result,we get (cid:12)(cid:12) E (cid:10) ΘΘ T ,Q ( s s ··· s τ ) t t ··· t τ − (cid:11)(cid:12)(cid:12) ≤ E (cid:13)(cid:13)(cid:13) P − s X ( P ⊥ X P ⊥ ) p +1 X P − s (cid:13)(cid:13)(cid:13) F · √ rd (2 k − t ) / λ k − s − s r I E + C k · rd k max λ kr · e − c d max ≤ E / (cid:13)(cid:13)(cid:13) P − s X ( P ⊥ X P ⊥ ) p +1 X P − s (cid:13)(cid:13)(cid:13) · √ rd (2 k − t ) / λ k − s − s r + C k · rd k max λ kr · e − c d max ≤ C √ rd (2 k − t ) / λ kr · E / (cid:13)(cid:13)(cid:13) Θ T X ( P ⊥ X P ⊥ ) p +1 X Θ (cid:13)(cid:13)(cid:13) + C k · rd k max λ kr · e − c d max where Θ = ( θ , · · · , θ r , θ − r , · · · , θ − ) ∈ R ( d + d ) × (2 r ) . In addition, we can write E (cid:13)(cid:13)(cid:13) Θ T X ( P ⊥ X P ⊥ ) p +1 X Θ (cid:13)(cid:13)(cid:13) = (cid:88) ≤| j | , | j |≤ r E (cid:0) θ T j X ( P ⊥ X P ⊥ ) p +1 Xθ j (cid:1) . Observe that, for any integer p ≥ P ⊥ X P ⊥ ) p = (cid:18) (cid:0) U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T U ⊥ U T ⊥ (cid:1) p (cid:0) V ⊥ V T ⊥ Z T U ⊥ U T ⊥ ZV ⊥ V T ⊥ (cid:1) p (cid:19) . W.L.O.G, let j , j ≥

1. Then, we write θ T j X ( P ⊥ X P ⊥ ) p +1 Xθ j = 12 v T j Z T (cid:0) U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T U ⊥ U T ⊥ (cid:1) p U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T u j + 12 u T j Z (cid:0) V ⊥ V T ⊥ Z T U ⊥ U T ⊥ ZV ⊥ V T ⊥ (cid:1) p V ⊥ V T ⊥ Z T U ⊥ U T ⊥ Zv j and get the simple bound E (cid:0) θ T j X ( P ⊥ X P ⊥ ) p +1 Xθ j (cid:1) . Xia/Normal Approximation of SVD ≤ − E (cid:16) v T j Z T (cid:0) U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T U ⊥ U T ⊥ (cid:1) p U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T u j (cid:17) + 2 − E (cid:16) u T j Z (cid:0) V ⊥ V T ⊥ Z T U ⊥ U T ⊥ ZV ⊥ V T ⊥ (cid:1) p V ⊥ V T ⊥ Z T U ⊥ U T ⊥ Zv j (cid:17) . Observe that Zv j is independent with ZV ⊥ and Z T u j is independent with Z T U ⊥ . Therefore, E (cid:0) θ T j X ( P ⊥ X P ⊥ ) p +1 Xθ j (cid:1) ≤ − E (cid:13)(cid:13)(cid:13)(cid:0) U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T U ⊥ U T ⊥ (cid:1) p U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T u j (cid:13)(cid:13)(cid:13) (cid:96) + 2 − E (cid:13)(cid:13)(cid:13)(cid:0) V ⊥ V T ⊥ Z T U ⊥ U T ⊥ ZV ⊥ V T ⊥ (cid:1) p V ⊥ V T ⊥ Z T U ⊥ U T ⊥ Zv j (cid:13)(cid:13)(cid:13) (cid:96) ≤ − E (cid:13)(cid:13)(cid:13)(cid:0) U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T U ⊥ U T ⊥ (cid:1) p U ⊥ U T ⊥ ZV ⊥ V T ⊥ (cid:13)(cid:13)(cid:13) (cid:96) + 2 − E (cid:13)(cid:13)(cid:13)(cid:0) V ⊥ V T ⊥ Z T U ⊥ U T ⊥ ZV ⊥ V T ⊥ (cid:1) p V ⊥ V T ⊥ Z T U ⊥ U T ⊥ (cid:13)(cid:13)(cid:13) (cid:96) ≤ C p +22 d p +1max = C p +22 d t − , where the last inequality is due to the independence between Z T u j and Z T U ⊥ ,the independence between Zv j and ZV ⊥ . We conclude that (cid:12)(cid:12) E (cid:10) ΘΘ T ,Q ( s s ··· s τ ) t t ··· t τ − (cid:11)(cid:12)(cid:12) ≤ C k · r / d k − λ kr + r (cid:16) C d max λ r (cid:17) k · e − c d max ≤ r / d max · (cid:16) C d max λ r (cid:17) k where C > e − c d max ≤ d − .We now ﬁnalize the proof. If there exists one odd t i , then there exists at leastanother t j which is also odd since the sum of t i s is even. Following the sameanalysis, we conclude (cid:12)(cid:12) E (cid:10) ΘΘ T , Q ( s s ··· s τ ) t t ··· t τ − (cid:11)(cid:12)(cid:12) ≤ r d max · (cid:16) C d max λ r (cid:17) k whenever any of t , · · · , t τ − is an odd number. Therefore, it suﬃces to considerthe cases that all of t , · · · , t τ − are even numbers. Proof of Lemma 2.

From the above analysis, to calculate E (cid:10) ΘΘ T , S A, ( X ) (cid:11) ,it suﬃces to calculate (cid:88) τ =2 ( − τ (cid:88) s + ··· + s τ =4 (cid:88) t + ··· + t τ − =4 E (cid:10) ΘΘ T , Q ( s s ··· s τ ) t t ··· t τ − (cid:11) where t , · · · , t τ − are positive even numbers and s , · · · , s τ are positive num-bers. Case 1: τ = 2 . In this case, t = 4 and s + s = 4. Therefore, for any s , s such that s + s = 4, we shall calculate Q ( s s )4 = P − s X ( P ⊥ X P ⊥ ) X P − s . Xia/Normal Approximation of SVD = E tr (cid:0) Q ( s s )4 (cid:1) = E tr (cid:0) ΘΘ T X ( P ⊥ X P ⊥ ) X ΘΘ T P − (cid:1) . Clearly, we haveΘΘ T X ( P ⊥ X P ⊥ ) X ΘΘ T = (cid:18) U U T ZV ⊥ V ⊥ Z T U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T U U T V V T Z T U ⊥ U T ⊥ ZV ⊥ V T ⊥ Z T U ⊥ U T ⊥ ZV V T (cid:19) . By the independence between U T Z and U T ⊥ Z , independence between V T Z T and V T ⊥ Z T , we immediately obtain E ΘΘ T X ( P ⊥ X P ⊥ ) X ΘΘ T = E (cid:18) d − U U T ZV ⊥ V T ⊥ Z T U U T d − V V T Z T U ⊥ U T ⊥ ZV V T (cid:19) = d − d − ΘΘ T where d − = d − r and d − = d − r . Then, E (cid:10) ΘΘ T , Q ( s s )4 (cid:11) = 2 d − d − (cid:107) Λ − (cid:107) for all ( s , s ) = (1 , s , s ) = (2 ,

2) and ( s , s ) = (3 , Case 2: τ = 3 . In this case, the only possible even numbers are t = 2 and t = 2.There are three pairs of ( s , s , s ) ∈ (cid:8) (1 , , , (1 , , , (2 , , (cid:9) . W.L.O.G.,consider s = 1 , s = 1 , s = 2, we have Q (112)22 = P − X P ⊥ X P − X P ⊥ X P − . Similarly, we can write E tr( Q (112)22 ) = E tr (cid:0) U Λ − V T Z T U ⊥ U T ⊥ ZV Λ − U T ZV ⊥ V T ⊥ Z T U Λ − U T (cid:1) + E tr (cid:0) V Λ − U T ZV ⊥ V T ⊥ Z T U Λ − V T Z T U ⊥ U T ⊥ ZV Λ − V T (cid:1) = d − E tr (cid:0) U Λ − V T Z T U ⊥ U T ⊥ ZV Λ − U T (cid:1) + d − E tr (cid:0) V Λ − U T ZV ⊥ V T ⊥ Z T U Λ − V T (cid:1) =2 d − d − (cid:107) Λ − (cid:107) . By symmetricity, the same equation holds for E tr( Q (211)22 ). Next, we consider( s , s , s ) = (1 , , E tr( Q (121)22 ) = E tr (cid:0) U Λ − V T Z T U ⊥ U T ⊥ ZV Λ − V T Z T U ⊥ U T ⊥ ZV Λ − U T (cid:1) + E tr (cid:0) V Λ − U T ZV ⊥ V T ⊥ Z T U Λ − U T ZV ⊥ V T ⊥ Z T U Λ − V T (cid:1) = E (cid:107) Λ − ˜ Z ˜ Z T Λ − (cid:107) + E (cid:107) Λ − ˜ Z ˜ Z T Λ − (cid:107) where ˜ Z ∈ R r × d − and ˜ Z ∈ R r × d − contain i.i.d. standard normal entries. ByLemma 6 in the Appendix, we obtain E tr (cid:0) Q (121)22 (cid:1) = ( d − + d − ) (cid:107) Λ − (cid:107) + ( d − + d − ) (cid:0) (cid:107) Λ − (cid:107) + (cid:107) Λ − (cid:107) (cid:1) . . Xia/Normal Approximation of SVD Therefore, we conclude that (cid:12)(cid:12)(cid:12) − E (cid:10) ΘΘ T , S A, ( X ) (cid:11) + ( d − − d − ) (cid:107) Λ − (cid:107) (cid:12)(cid:12)(cid:12) ≤ C · r d max λ r for some absolute constant C > t i is odd as discussed in Property 1 . Together with the proof ofLemma 1, we conclude that (cid:12)(cid:12)(cid:12) E (cid:107) ˆΘ ˆΘ T − ΘΘ T (cid:107) − (cid:0) d (cid:63) (cid:107) Λ − (cid:107) − ∆ d (cid:107) Λ − (cid:107) (cid:1)(cid:12)(cid:12)(cid:12) ≤ C · r d max λ r + C · rd λ r where ∆ d = d − d and C , C > A.3. Proof of Lemma 3.

To characterize E (cid:10) ΘΘ T , S A, k ( X ) (cid:11) more easily, we observe the following prop-erty. Property 2: eﬀect from distinct singular values are negligible.

Recallthat E (cid:10) ΘΘ T , S A, k ( X ) (cid:11) = (cid:88) s : s + ··· + s k +1 =2 k ( − τ ( s ) · E (cid:10) ΘΘ T , P − s X · · · X P − s k +1 (cid:11) . As proved in

Property 1 , we have (cid:12)(cid:12)(cid:12) E (cid:10) ΘΘ T , S A, k ( X ) (cid:11) − (cid:88) τ ≥ ( − τ (cid:88) s + ··· + s τ =2 k (cid:88) t + ··· + t τ − =2 k E (cid:10) ΘΘ T , Q ( s s ··· s τ ) t t ··· t τ − (cid:11)(cid:12)(cid:12)(cid:12) ≤ r d max · (cid:16) C d max λ r (cid:17) k where the matrix Q ( s s ··· s τ ) t t ··· t τ − is deﬁned as in (25) and t , · · · , t τ − are positiveeven numbers. Recall that ΘΘ T = (cid:80) rj =1 ( P j + P − j ) and P − s = (cid:80) rj =1 (cid:2) λ − s j P j +( λ − j ) − s j P − j (cid:3) where λ − j = − λ j . For each ﬁxed ( s , · · · , s τ ) and ( t , · · · , t τ − )where t j s are even numbers, we write (cid:10) ΘΘ T , Q ( s s ··· s τ ) t t ··· t τ − (cid:11) = r (cid:88) | j | , | j | , ··· , | j τ − |≥ λ − ( s + s τ ) j λ − s j · · · λ − s τ − j τ − ( θ T j W t θ j )( θ T j W t θ j ) · · · ( θ T j τ − W t τ − θ j )where the matrix W t = X P ⊥ X P ⊥ · · · P ⊥ X (cid:124) (cid:123)(cid:122) (cid:125) t of X for positive even numbers t . Ob-serve that θ T j W t θ j = θ T j W t θ j = θ T j X ( P ⊥ X P ⊥ ) t − Xθ j . . Xia/Normal Approximation of SVD We show that if there exists 1 ≤ k ≤ τ − | j k | (cid:54) = | j k +1 | , then | θ T j k W t k θ j k | is a negligibly smaller term. W.L.O.G., assume | j | (cid:54) = | j | andthen (cid:12)(cid:12)(cid:12)(cid:12) E r (cid:88) | j | , | j | , ··· , | j τ − |≥ | j |(cid:54) = | j | λ − ( s + s τ ) j λ − s j · · · λ − s τ − j τ − ( θ T j W t θ j )( θ T j W t θ j ) · · · ( θ T j τ − W t τ − θ j ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E (cid:88) | j |(cid:54) = | j | λ − ( s + s τ ) j λ − s j ( θ T j W t θ j ) θ T j W t P − s W t P − s · · · P − s τ − W t τ − θ j (cid:12)(cid:12)(cid:12) ≤ λ kr (cid:88) | j |(cid:54) = | j | E (cid:12)(cid:12) θ T j W t θ j (cid:12)(cid:12) (cid:107) X (cid:107) k − t Since θ j and θ j are orthogonal, we conclude that Xθ j and Xθ j are inde-pendent normal vectors from which we get that θ T j W t θ j | P ⊥ X P ⊥ is sub-exponential and E | θ T j W t θ j | = O (cid:0) (cid:107) ( P ⊥ X P ⊥ ) t − (cid:107) F (cid:1) . Therefore, we get E (cid:12)(cid:12) θ T j W t θ j (cid:12)(cid:12) (cid:107) X (cid:107) k − t = E (cid:12)(cid:12) θ T j W t θ j (cid:12)(cid:12) (cid:107) X (cid:107) k − t ( (cid:107) X (cid:107) ≤ C (cid:112) d max ) + E (cid:12)(cid:12) θ T j W t θ j (cid:12)(cid:12) (cid:107) X (cid:107) k − t ( (cid:107) X (cid:107) ≥ C (cid:112) d max ) ≤ E / (cid:12)(cid:12) θ T j W t θ j (cid:12)(cid:12) · ( C d max ) k − t / ( (cid:107) X (cid:107) ≤ C d / ) + e − d max / ( C d max ) k (cid:46) √ d max · ( C d max ) k + e − d max / ( C d max ) k . As a result, we conclude that (cid:12)(cid:12)(cid:12)(cid:12) E r (cid:88) | j | , | j | , ··· , | j τ − |≥ λ − ( s + s τ ) j λ − s j · · · λ − s τ − j τ − ( θ T j W t θ j )( θ T j W t θ j ) · · · ( θ T j τ − W t τ − θ j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C r √ d max · (cid:16) C d max λ r (cid:17) k + C e − d max / · (cid:16) C d max λ r (cid:17) k ≤ C r √ d max · (cid:16) C d max λ r (cid:17) k for some absolute constants C , C > j , j , · · · , j τ − )such that | j | = | j | = · · · = | j τ − | . Now, we deﬁne P j = λ j P j + λ − j P − j . Tothis end, we conclude (cid:12)(cid:12)(cid:12) E (cid:10) ΘΘ T , S A, k ( X ) (cid:11) − r (cid:88) j =1 (cid:88) τ ≥ ( − τ (cid:88) s : s + ··· + s τ =2 k,s ,s τ > t : t + ··· + t τ − =2 k E tr (cid:0) P − s j W t P − s j W t · · · P − s τ j (cid:1)(cid:12)(cid:12)(cid:12) ≤ C r √ d max · (cid:16) C d max λ r (cid:17) k (26)for some absolute constants C , C >

0. The above fact suggests that it suﬃcesto focus on the eﬀect from individual singular values (i.e., for any ﬁxed 1 ≤ j ≤ r ). Moreover, it is easy to check that P − s j W t P − s j W t · · · P − s τ j = 1 λ kj · ˜ P − s j W t ˜ P − s j W t · · · ˜ P − s τ j . Xia/Normal Approximation of SVD where ˜ P − sj = P j + ( − s P − j implying that the k -th order error term has dom-inator λ kj . To this end, we prove the following lemma in the Appendix. Lemma 5.

For any ≤ j ≤ r and k ≥ , we obtain (cid:12)(cid:12)(cid:12) (cid:88) τ ≥ ( − τ (cid:88) s : s + ··· + s τ =2 k,s ,s τ > t : t + ··· + t τ − =2 k E tr (cid:0) P − s j W t P − s j · · · P − s τ j (cid:1) − ( − k ( d k − − − d k − − )( d − − d − ) λ kj (cid:12)(cid:12)(cid:12) ≤ C k √ d max · (cid:16) C d max λ r (cid:17) k for some absolute constants C , C > . By Lemma 5 and (26), it holds for all k ≥ (cid:12)(cid:12) E (cid:10) ΘΘ T , S A, k ( X ) (cid:11) − ( − k ( d k − − − d k − − )( d − − d − ) (cid:107) Λ − k (cid:107) (cid:12)(cid:12) ≤ C ( r + k ) √ d max · (cid:16) C d max λ r (cid:17) k for some absolute constants C , C >

0, which concludes the proof.

A.4. Proof of CLT theorems in Section 5

Proof of Theorem 3

Recall Theorem 2, we end up withsup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) dist [( ˆ U , ˆ V ) , ( U, V )] − E dist [( ˆ U , ˆ V ) , ( U, V )] (cid:112) d + d − r ) (cid:107) Λ − (cid:107) F ≤ x (cid:19) − Φ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:16) √ r (cid:107) Λ − (cid:107) F λ r (cid:17) · (cid:115) ( rd max ) / λ r + e − c d max + C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d max + e − λ r / √ rd max . By Lemma 1, we get (cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) Λ − (cid:107) (cid:12)(cid:12) ≤ C rd λ r . Therefore, (cid:12)(cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) Λ − (cid:107) √ d (cid:63) (cid:107) Λ − (cid:107) F (cid:12)(cid:12)(cid:12) ≤ C rd / λ r . By the Lipschitz property of Φ( x ) and applying similar technical as in proof ofTheorem 2, we can getsup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) dist [( ˆ U , ˆ V ) , ( U, V )] − d (cid:63) (cid:107) Λ − (cid:107) √ d (cid:63) (cid:107) Λ − (cid:107) F ≤ x (cid:19) − Φ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:16) √ r (cid:107) Λ − (cid:107) F λ r (cid:17) · (cid:115) ( rd max ) / λ r + e − c d max + C (cid:16) (cid:107) Λ − (cid:107) (cid:107) Λ − (cid:107) (cid:17) / · √ d max + C rd / λ r + e − λ r / √ rd max . . Xia/Normal Approximation of SVD Proof of Theorem 4

By Lemma 3, we have (cid:12)(cid:12)(cid:12) E dist [( ˆ U , ˆ V ) , ( U, V )] − B k (cid:12)(cid:12)(cid:12) ≤ C r d max λ r + C r √ d max · (cid:16) d max λ r (cid:17) + C r (cid:16) C d max λ r (cid:17) k +1 . The rest of the proof is the same as in the proof of Theorem 3.

Appendix B: Appendix

B.1. Supporting lemmas

Proof of Lemma 4.

Recall that f t ( X ) = (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17) . Case 1 : if (cid:107) X (cid:107) > t √ d max and (cid:107) X (cid:107) > t √ d max , then f t ( X ) = f t ( X ) = 0 bydeﬁnition of φ ( · ) where the claimed inequality holds trivially. Case 2 : if (cid:107) X (cid:107) ≤ t √ d max and (cid:107) X (cid:107) > t √ d max , then f t ( X ) = 0. We get, byLipschitz property of φ ( · ), that (cid:12)(cid:12)(cid:12) f t ( X ) − f t ( X ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) k ≥ (cid:10) ΘΘ T , S A,k ( X ) (cid:11) · (cid:18) φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17) − φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) k ≥ r (cid:13)(cid:13) S A,k ( X ) (cid:13)(cid:13) · (cid:107) X − X (cid:107) F t · √ d max ≤ r (cid:107) X − X (cid:107) F t · √ d max · (cid:88) k ≥ (cid:88) s : s + ··· + s k +1 = k (cid:13)(cid:13)(cid:13) P − s X P − s X · · · X P − s k +1 (cid:13)(cid:13)(cid:13) ≤ r (cid:107) X − X (cid:107) F t · √ d max · (cid:88) k ≥ (cid:88) s : s + ··· + s k +1 = k (cid:107) X (cid:107) k λ kr ≤ r (cid:107) X − X (cid:107) F t · √ d max · (cid:88) k ≥ (cid:16) (cid:107) X (cid:107) λ r (cid:17) k ≤ C t r (cid:107) X − X (cid:107) F √ d max · d / λ r where the last inequality holds as long as λ r ≥ t √ d max . Case 3 : if (cid:107) X (cid:107) ≤ td / and (cid:107) X (cid:107) ≤ td / . Then, (cid:12)(cid:12)(cid:12) f t ( X ) − f t ( X ) (cid:12)(cid:12)(cid:12) ≤ r (cid:88) k ≥ (cid:13)(cid:13)(cid:13) S A,k ( X ) φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17) − S A,k ( X ) φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17)(cid:13)(cid:13)(cid:13) ≤ r (cid:88) k ≥ (cid:88) s : s + ··· + s k +1 = k (cid:13)(cid:13)(cid:13) P − s X · · · X P − s k +1 φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17) − P − s X · · · X P − s k +1 φ (cid:16) (cid:107) X (cid:107) t · √ d max (cid:17)(cid:13)(cid:13)(cid:13) . Xia/Normal Approximation of SVD ≤ r (cid:88) k ≥ (cid:88) s : s + ··· + s k +1 = k ( k + 2) · (2 td / ) k − λ kr (cid:107) X − X (cid:107) F ≤ C t · rd max λ r (cid:107) X − X (cid:107) F where the last inequality holds as long as λ r ≥ t √ d max . Therefore, we concludethe proof of Lemma 4. Proof of Lemma 5.

Based on

Property 2 and eq. (26), it suﬃces to calculatethe quantities E tr (cid:0) P − s j W t P − s j W t · · · P − s τ j (cid:1) which relies on singular values λ j and singular vectors u j , v j only. Moreover, the actual forms of u j , v j does notaﬀect the values. By choosing { u j } rj =1 and { v j } rj =1 as the ﬁrst r canonical basisvectors in R d and R d , it is easy to check that we can reduce the calculations tothe rank-one spiked model with singular value λ j . To leverage the dimensionalityeﬀect where U T ⊥ ZV ⊥ ∈ R d − × d − has i.i.d. standard normal entries, we considerthe rank-one spiked model withˆ M = λ ( u ⊗ v ) + Z ∈ R ( d − +1) × ( d − +1) (27)where Z has i.i.d. standard normal entries and d − = d − r, d − = d − r . Letˆ u and ˆ v denote the leading left and right singular vectors of ˆ M . By fact (26),it suﬃces to calculate the k -th order approximation of (cid:107) ˆ u ˆ u T − uu T (cid:107) + (cid:107) ˆ v ˆ v T − vv T (cid:107) . In the proof, we calculate the errors (cid:107) ˆ u ˆ u T − uu T (cid:107) and (cid:107) ˆ v ˆ v T − vv T (cid:107) separately. W.L.O.G., we just deal with (cid:107) ˆ u ˆ u T − uu T (cid:107) and consider d ≤ d .Recall that we aim to calculate the k -th order error term in (cid:107) uu T − ˆ u ˆ u T (cid:107) .To this end, we write the error terms as E (cid:107) ˆ u ˆ u T − uu T (cid:107) = 2 ∞ (cid:88) k =1 E k λ k . (28)We show that E k = ( − k d k − − ( d − − d − ) · (cid:104) O (cid:16) C k √ d max (cid:17)(cid:105) for some absoluteconstant C >

0. To this end, we consider the second-order (see (Xia and Zhou,2019)) moment trick (denote T = λ ( u ⊗ u ))ˆ M ˆ M T = λ ( u ⊗ u ) + ∆ ∈ R ( d − +1) × ( d − +1) (29)where ∆ = λuv T Z T + λZvu T + ZZ T . By eq. (4), we can write (cid:107) ˆ u ˆ u T − uu T (cid:107) = − (cid:88) k ≥ (cid:10) uu (cid:62) , S T,k (∆) (cid:11) where we deﬁne P u = λ ( u ⊗ u ) and P u = P ⊥ u = U ⊥ U (cid:62)⊥ ∈ R ( d − +1) × d − and S T,k (∆) = (cid:88) s : s + ··· + s k +1 = k ( − τ ( s )+1 · P − s u ∆ P s u ∆ · · · ∆ P − s k +1 u . This condition just simpliﬁes our calculation when dealing with the Marchenko Pasturlaw. Our results do not rely on the condition d ≤ d . Xia/Normal Approximation of SVD Now, we investigate (cid:10) uu (cid:62) , S T,k (∆) (cid:11) for all k ≥

2. Denote W t = ∆ P ⊥ u ∆ · · · P ⊥ u ∆ (cid:124) (cid:123)(cid:122) (cid:125) t of ∆ and we can write (cid:10) uu (cid:62) , S T,k (∆) (cid:11) = k (cid:88) τ =2 ( − τ +1 (cid:88) t + ··· + t τ − = k,t j ≥ s + ··· + s τ = k,s j ≥ tr (cid:0) P − s u W t P − s u W t · · · P − s τ − u W t τ − P − s τ u (cid:1) = 1 λ k k (cid:88) τ =2 ( − τ (cid:18) k − τ − (cid:19) (cid:88) t + ··· + t τ − = k,t j ≥ ( u T W t u )( u T W t u ) · · · ( u T W t τ − u ) . Denote β ∆ t = u T W t u , we can write concisely E (cid:10) uu (cid:62) , S T,k (∆) (cid:11) = 1 λ k k (cid:88) τ =2 ( − τ (cid:18) k − τ − (cid:19) (cid:88) t + ··· + t τ − = k,t j ≥ E (cid:0) β ∆ t β ∆ t · · · β ∆ t τ − (cid:1) . (30)Now, we investigate the concentration property of β ∆ t = u T W t u . Clearly, we canwrite β ∆1 = 2 λ · ( u T Zv ) (cid:124) (cid:123)(cid:122) (cid:125) β ∆1 , + u T ZZ T u (cid:124) (cid:123)(cid:122) (cid:125) β ∆1 , and for all t ≥

2, we write β ∆ t = β ∆ t, + β ∆ t, where β ∆ t, = u T ZZ T U ⊥ ( U T ⊥ ZZ T U ⊥ ) t − U T ⊥ ZZ T u + λ uv T Z T U ⊥ ( U T ⊥ ZZ T U ⊥ ) t − U T ⊥ Zvu T β ∆ t, =2 λuv T Z T U ⊥ ( U T ⊥ ZZ T U ⊥ ) t − U T ⊥ ZZ T u. As a result, we can calculate E ( β ∆ t β ∆ t · · · β ∆ t τ − ) = E (cid:0) ( β ∆ t , + β ∆ t , )( β ∆ t , + β ∆ t , ) · · · ( β ∆ t τ − , + β ∆ t τ − , ) (cid:1) . It is easy to check that E β ∆1 = d − + 1 and for t ≥ E β ∆ t = λ E (cid:0) v T Z T U ⊥ ( U T ⊥ ZZ T U ⊥ ) t − U T ⊥ Zv (cid:1) + E tr (cid:0) Z T U ⊥ ( U T ⊥ ZZ T U ⊥ ) t − U T ⊥ Z (cid:1) = (cid:16) λ d − + 1 (cid:17) · E tr (cid:0) Z T U ⊥ ( U T ⊥ ZZ T U ⊥ ) t − U T ⊥ Z (cid:1) = (cid:16) λ d − + 1 (cid:17) · E tr (cid:0) ( U T ⊥ ZZ T U ⊥ ) t − (cid:1) where the second equality can be checked by choosing v = e ∈ R d − +1 . Since Z T u and Z T U ⊥ are independent, it is easy to check that E β ∆ t ,i β ∆ t ,i · · · β ∆ t τ − ,i τ − = 0 , if τ − (cid:88) j =1 i j is an odd numberfor all i , i , · · · , i τ − ∈ { , } . As a result, we observe that E (cid:10) uu T , S T,k (cid:11) hascontributions to E k , E k − , E k − , · · · , E (cid:100) k/ (cid:101) . (Recall that E k is the coeﬃ-cient for λ k .) . Xia/Normal Approximation of SVD Moreover, since Z T u and Z T U ⊥ are independent, we can conclude that β ∆1 , ∼ N (0 , λ )and for all t ≥ β ∆ t, (cid:12)(cid:12) U T ⊥ Z ∼ N (cid:0) , λ (cid:107) Z T U ⊥ ( U T ⊥ ZZ T U ⊥ ) t − U T ⊥ Zv (cid:107) (cid:96) (cid:1) . We can get, for all t ≥

2, that E / (cid:2) ( β ∆ t, ) (cid:12)(cid:12) U T ⊥ Z (cid:3) (cid:46) E / (cid:2) ( β ∆ t, ) (cid:12)(cid:12) U T ⊥ Z (cid:3) (cid:46) λ (cid:107) U T ⊥ Z (cid:107) t − Therefore, it is easy to check that for any ( i , i , · · · , i τ − ) ∈ { , } τ − wherethere exists some i j ≥

1, then E β ∆ t ,i β ∆ t ,i · · · β ∆ t τ − ,i τ − ’s contribution to any E k is bounded by d max · (cid:16) C d max λ (cid:17) k for some absolute constant C > (cid:100) k/ (cid:101) ≤ k ≤ k . To show this, w.l.o.g, let i = i = 1 and observe that E β ∆ t , β ∆ t , β ∆ t ,i · · · β ∆ t τ − ,i τ − = E / ( β ∆ t , β ∆ t , ) E / (cid:0) β ∆ t ,i · · · β ∆ t τ − ,i τ − (cid:1) ≤ E / ( β ∆ t , ) E / ( β ∆ t , ) E / (cid:0) β ∆ t ,i · · · β ∆ t τ − ,i τ − (cid:1) ≤ λ d t + t − E / (cid:0) β ∆ t ,i · · · β ∆ t τ − ,i τ − (cid:1) (31)and then we get1 λ k E β ∆ t , β ∆ t , β ∆ t ,i · · · β ∆ t τ − ,i τ − ≤ d max · (cid:16) d max λ (cid:17) t + t − · E / (cid:0) β ∆ t ,i · · · β ∆ t τ − ,i τ − (cid:1) λ k − t − t ) . The claim follows immediately since E / (cid:0) β ∆ t ,i · · · β ∆ t τ − ,i τ − (cid:1) λ k − t − t ) ≤ k (cid:88) k = (cid:100) ( k − t − t ) / (cid:101) C k E / (cid:107) Z (cid:107) k λ k ≤ k (cid:88) k = (cid:100) ( k − t − t ) / (cid:101) (cid:16) C d max λ (cid:17) k for some absolute constant C , C > E (cid:107) Z (cid:107) k ≤ C k d k max for some absolute constant C > λ k k (cid:88) τ =2 ( − τ (cid:18) k − τ − (cid:19) (cid:88) t + ··· + t τ − = k,t j ≥ E (cid:0) β ∆ t , β ∆ t , · · · β ∆ t τ − , (cid:1) . (32)Next, we will replace E (cid:0) β ∆ t , β ∆ t , · · · β ∆ t τ − , (cid:1) with E β ∆ t , E β ∆ t , · · · E β ∆ t τ − , forwhich we shall investigate the concentrations of β ∆ t, . To this end, we have thesub-exponential inequality P (cid:0)(cid:12)(cid:12) u T ZZ T u − d − (cid:12)(cid:12) ≥ C (cid:112) αd − + C α (cid:1) ≤ C e − α , ∀ α > . Xia/Normal Approximation of SVD for some constants C , C >

0. Again, by Gaussian isoperimetric inequality andthe proof of Theorem 3 , we can show, for all α > P (cid:0)(cid:12)(cid:12) u T Z ( Z T U ⊥ U T ⊥ Z ) t Z T u − E u T Z ( Z T U ⊥ U T ⊥ Z ) t Z T u (cid:12)(cid:12) ≥ C αd t +1 / + C e − c d max d t +1max (cid:1) ≤ C e − α + C e − c d max and P (cid:0)(cid:12)(cid:12) v T ( Z T U ⊥ U T ⊥ Z ) t − v − E v T ( Z T U ⊥ U T ⊥ Z ) t − v (cid:12)(cid:12) ≥ C tαd t − / + C e − c d max d t − (cid:1) ≤ C e − α + C e − c d max . Therefore, we can show that (cid:12)(cid:12) E ( β ∆ t , β ∆ t , · · · β ∆ t τ − , ) − ( E β ∆ t , )( E β ∆ t , ) · · · ( E β ∆ t τ − , ) (cid:12)(cid:12) ’scontribution to any E k is bounded by √ d max · ( C d max /λ ) k for some constant C > (cid:100) k/ (cid:101) ≤ k ≤ k . Indeed, the above concentration inequalities of β ∆ t, imply E / ( β ∆ t, − E β ∆ t, ) (cid:46) d t − / + λ d t − / , ∀ t ≥ . The claim can be proved as in eq. (31). Indeed, we can write1 λ k (cid:12)(cid:12)(cid:12) E β ∆ t , β ∆ t , · · · β ∆ t τ − , − ( E β ∆ t , )( E β ∆ t , ) · · · E ( β ∆ t τ − , ) (cid:12)(cid:12)(cid:12) ≤ λ k τ − (cid:88) i =1 (cid:16) i − (cid:89) j =1 E β ∆ t j , (cid:17)(cid:12)(cid:12)(cid:12) E (cid:0) β ∆ t i , − E β ∆ t i , (cid:1)(cid:16) τ − (cid:89) j = i +1 β ∆ t j , (cid:17)(cid:12)(cid:12)(cid:12) ≤ τ − (cid:88) i =1 E / (cid:0) β ∆ t i , − E β ∆ t i , (cid:1) λ t i · λ k − t i ) i − (cid:89) j =1 (cid:16) E β ∆ t j , (cid:17) E / (cid:16) τ − (cid:89) j = i +1 β ∆ t j , (cid:17) ≤ τ − (cid:88) i =1 √ d max (cid:16) ( d max /λ ) t i + ( d max /λ ) t i − (cid:17) · λ k − t i ) i − (cid:89) j =1 (cid:16) E β ∆ t j , (cid:17) E / (cid:16) τ − (cid:89) j = i +1 β ∆ t j , (cid:17) which concludes the proof since λ k − ti ) (cid:81) i − j =1 (cid:16) E β ∆ t j , (cid:17) E / (cid:16) (cid:81) τ − j = i +1 β ∆ t j , (cid:17) ≤ (cid:16) C d max λ (cid:17) k − t i .To this end, to calculate eq. (30), it suﬃces to calculate1 λ k k (cid:88) τ =2 ( − τ (cid:18) k − τ − (cid:19) (cid:88) t + ··· + t τ − = k,t j ≥ E β ∆ t , E β ∆ t , · · · E β ∆ t τ − , Now, we compute E β ∆ t, = (cid:0) λ / ( d − + 1) (cid:1) · E tr (cid:0) ( U T ⊥ ZZ T U ⊥ ) t − (cid:1) . Notethat the matrix U T ⊥ Z ∈ R d − × ( d − +1) has i.i.d. standard normal entries. By the We just need to study the Lipschitz property of the function f ( Z ) = u T Z ( Z T U ⊥ U T ⊥ Z ) t Z T u · ( (cid:107) Z (cid:107) ≤ C √ d max ) . Xia/Normal Approximation of SVD moment of Marchenko-Pastur law ((Mingo and Speicher, 2017)), for all t ≥ β = d − + 1) β t λ / (1 + d − ) = 1 t − t − (cid:88) r =0 d r +11 − ( d − + 1) t − − r (cid:18) t − r + 1 (cid:19)(cid:18) t − r (cid:19) . (33)Note that E tr (cid:0) ( U T ⊥ ZZ T U ⊥ ) t − (cid:1) = E tr (cid:0) ( Z T U ⊥ U T ⊥ Z ) t − (cid:1) for all t ≥

2. By therate of convergence of Marchenko Pastur law ((G¨otze and Tikhomirov, 2011,Theorem 1.1)), we have (as long as √ d max ≥ log d max ) (cid:12)(cid:12) β t − E β ∆ t, (cid:12)(cid:12) λ / ( d − + 1) ≤ √ d max · (cid:0) C d max (cid:1) t − for all t ≥ C > t + · · · + t τ − = k , the contribution to E k from (cid:12)(cid:12) E β ∆ t , E β ∆ t , · · · E β ∆ t τ − , − β t β t · · · β t τ − (cid:12)(cid:12) is bounded by √ d max · (cid:16) C d max λ (cid:17) k .Therefore, by eq. (30), to calculate E (cid:10) uu T , S T,k (∆) (cid:11) , we consider the followingterm 1 λ k k (cid:88) τ =2 ( − τ (cid:18) k − τ − (cid:19) (cid:88) t + ··· + t τ − = k,t j ≥ β t β t · · · β t τ − which is the k -th order derivative of the function λ k · ( k !) (1 − g ( α )) k − at α = 0where g ( α ) = β α + α β + α β + · · · = (cid:88) k ≥ β k α k . (34)Now, we calculate the explicit form of the function g ( α ). Denote γ = d − d − +1 and Y the random variable obeying the Marchenko-Pastur distribution, i.e., its pdfis given by f Y ( y ) = 12 π (cid:112) ( γ + − y )( y − γ − ) γy · ( y ∈ [ γ − , γ + ])where γ + = (1 + √ γ ) and γ − = (1 − √ γ ) . It is easy to check that ((Mingoand Speicher, 2017)) β t = (cid:16) λ d − (cid:17) d − ( d − + 1) t − E Y t − , ∀ t ≥ . For notational simplicity, we just write d − instead of 1 + d − . As a result, weget for α (cid:28) d , g ( α ) = β α + (cid:16) λ d − (cid:17) d − α E (cid:88) t ≥ d t − ( αY ) t = β α + (cid:16) λ d − (cid:17) E d − αd − αY − d − αY . Xia/Normal Approximation of SVD = αd − + (cid:16) λ d − (cid:17) · (cid:0)(cid:112) − αd − γ − − (cid:112) − αd − γ + (cid:1) Y according to the p.d.f. F Y ( y ).Therefore, we get 1 − g ( α ) = 12 (cid:104) g + ( α ) − λ d − g − ( α ) (cid:105) where g − ( α ) = 1 − ( d − + d − ) α − (cid:112) (1 − αd − γ − )(1 − αd − γ + )and g + ( α ) = 1 − ( d − − d − ) α + (cid:112) (1 − αd − γ − )(1 − αd − γ + ) . Therefore, in order to calculate E (cid:10) uu (cid:62) , S T,k (∆) (cid:11) , it suﬃces to calculate the k -thorder derivative of function (1 − g ( α )) k − λ k · ( k !) at α = 0. Write (cid:104)(cid:0) − g ( α ) (cid:1) k − (cid:105) ( k ) λ k ( k !) (cid:12)(cid:12)(cid:12) α =0 = 1 λ k · k − · ( k !) k − (cid:88) t =0 (cid:18) k − t (cid:19)(cid:16) − λ d − (cid:17) t (cid:104) g t − ( α ) g k − − t + ( α ) (cid:105) ( k ) (cid:12)(cid:12)(cid:12) α =0 . (35)Note that g − ( α ) = O ( α ). The terms in eq. (35) with t > k are all 0. Re-call that we are interested in the k -th order term in the error (cid:107) ˆ u ˆ u T − uu T (cid:107) whose denominator is λ k . By eq. (35), the k -th order error term λ k can becontributed from E (cid:10) uu (cid:62) , S T,k (∆) (cid:11) for k = k , k = k + 1 , · · · , k = 2 k .By the above analysis, we conclude that the k -th error term (except thenegligible error terms from translating E ( β ∆ t β ∆ t · · · β ∆ t τ − ) into β t β t · · · β t τ − )of (cid:107) ˆ u ˆ u T − uu T (cid:107) is given by E k = (cid:80) k t =0 E k ,t where (we change k in (35) to k + t ) E k ,t = 1 λ k k + t − k + t )! (cid:18) k + t − t (cid:19)(cid:16) − d − (cid:17) t (cid:104) g t − ( α ) g k − ( α ) (cid:105) ( k + t ) (cid:12)(cid:12)(cid:12) α =0 When t = k , we have g k − ( α ) = (4 d − d − ) k α k (cid:2) − α ( d − + d − ) + (cid:112) (1 − αd − γ − )(1 − αd − γ + ) (cid:3) k implying that (cid:104) g k − ( α ) g k − ( α ) (cid:105) (2 k ) (cid:12)(cid:12)(cid:12) α =0 = (2 k )! (4 d − d − ) k . Therefore, we get E k ,k = ( − k d k − (cid:0) k − k (cid:1) . Now, we consider t ≤ k − − α ( d − + d − ) + (cid:112) (1 − αd − γ − )(1 − αd − γ + ) = g + ( α ) − d − α . Xia/Normal Approximation of SVD so that g − ( α ) = d − d − α g + ( α ) − αd − , Then, we get (cid:104) g t − ( α ) g k − ( α ) (cid:105) ( k + t ) (cid:12)(cid:12)(cid:12) α =0 = (cid:104) (4 d − d − α ) t (cid:0) g + ( α ) − αd − (cid:1) t · g k − ( α ) (cid:105) ( k + t ) (cid:12)(cid:12)(cid:12) α =0 = (cid:18) k + t t (cid:19) (2 t )!(4 d − d − ) t (cid:104) g k − ( α )( g + ( α ) − αd − ) t (cid:105) ( k − t ) (cid:12)(cid:12)(cid:12) α =0 . It suﬃces to calculate the ( k − t )-th derivative of function g k − ( α ) / (cid:0) g + ( α ) − αd − (cid:1) t at α = 0. We write (cid:104) g k − ( α )( g + ( α ) − αd − ) t (cid:105) ( k − t ) (cid:12)(cid:12)(cid:12) α =0 = (cid:104) k − (cid:88) t =0 (cid:18) k − t (cid:19) (2 αd − ) k − − t (cid:0) g + ( α ) − αd − (cid:1) t − t (cid:105) ( k − t ) (cid:12)(cid:12)(cid:12) α =0 . Observe that (cid:2) (2 αd − ) k − − t (cid:3) ( k − t ) (cid:12)(cid:12) α =0 = 0 for all t < t −

1. Then, we get (cid:104) g k − ( α )( g + ( α ) − αd − ) t (cid:105) ( k − t ) (cid:12)(cid:12)(cid:12) α =0 = (cid:104) k − (cid:88) t = t − (cid:18) k − t (cid:19) (2 αd − ) k − − t (cid:0) g + ( α ) − αd − (cid:1) t − t (cid:105) ( k − t ) (cid:12)(cid:12)(cid:12) α =0 . If t = t −

1, then (cid:104)(cid:18) k − t (cid:19) (2 αd − ) k − − t (cid:0) g + ( α ) − αd − (cid:1) t − t (cid:105) ( k − t ) (cid:12)(cid:12)(cid:12) α =0 = (cid:18) k − t − (cid:19) (2 d ) k − t ( k − t )! · t ≥ t , we have (cid:104) (2 αd − ) k − − t (cid:0) g + ( α ) − αd − (cid:1) t − t (cid:105) ( k − t ) (cid:12)(cid:12)(cid:12) α =0 = (cid:18) k − tk − − t (cid:19) (2 d ) k − − t ( k − − t )! (cid:104)(cid:0) g + ( α ) − αd − (cid:1) t − t (cid:105) ( t +1 − t ) (cid:12)(cid:12)(cid:12) α =0 . Clearly, if t = t , then (cid:2)(cid:0) g + ( α ) − αd − (cid:1) t − t (cid:3) ( t +1 − t ) (cid:12)(cid:12) α =0 = 0. For t ≥ t + 1,recall that g + ( α ) − αd = 1 − ( d − + d − ) α + (cid:112) (1 − αd − γ − )(1 − αd − γ + ) . It is easy to check that (cid:2)(cid:0) g + ( α ) − αd − (cid:1) t − t (cid:3) ( t +1 − t ) (cid:12)(cid:12) α =0 = − (cid:2)(cid:0) − ( d − + d − ) α − (cid:112) (1 − αd − γ − )(1 − αd − γ + ) (cid:1) t − t (cid:3) ( t +1 − t ) (cid:12)(cid:12) α =0 = − (cid:104)(cid:16) d − d − α g + ( α ) − αd − (cid:17) t − t (cid:105) ( t +1 − t ) (cid:12)(cid:12)(cid:12) α =0 which is non-zero only when t = t + 1. In fact, when t = t + 1, we get (cid:2)(cid:0) g + ( α ) − αd − (cid:1) t − t (cid:3) ( t +1 − t ) (cid:12)(cid:12) α =0 = − d − d − . . Xia/Normal Approximation of SVD Therefore, we conclude that (cid:104) g k − ( α )( g + ( α ) − αd − ) t (cid:105) ( k − t ) (cid:12)(cid:12)(cid:12) α =0 = (cid:18) k − t − (cid:19) (2 d − ) k − t ( k − t )! · − (cid:18) k − t + 1 (cid:19)(cid:18) k − t (cid:19) (2 d − ) k − t − ( k − − t )!(4 d − d − ) . As a result, for t ≤ k −

1, we get E k ,t = d k − · ( − t (cid:18) k + t − t (cid:19)(cid:18) k − t − (cid:19) − d k − − d − · ( − t (cid:18) k + t − t (cid:19)(cid:18) k − t + 1 (cid:19) . Clearly, it also holds for t = k . Therefore, we have E k = k (cid:88) t =0 E k ,t = d k − k (cid:88) t =0 ( − t (cid:18) k + t − t (cid:19)(cid:18) k − t − (cid:19) − d k − − d − k − (cid:88) t =0 ( − t (cid:18) k + t − t (cid:19)(cid:18) k − t + 1 (cid:19) . It is easy to check that k (cid:88) t =0 ( − t (cid:18) k + t − t (cid:19)(cid:18) k − t − (cid:19) = k (cid:88) t =1 ( − t (cid:18) k + t − t (cid:19)(cid:18) k − t − (cid:19) =( − k − (cid:88) t =0 ( − t (cid:18) k + tt + 1 (cid:19)(cid:18) k − t (cid:19) . It is interesting to observe that (cid:80) k − t =0 ( − t (cid:0) k + tt +1 (cid:1)(cid:0) k − t (cid:1) equals the coeﬃcientof x k − in the polynomial (1 + x ) k (cid:2) − (1 + x ) (cid:3) k − . Then, it is easy to checkthat (cid:80) k − t =0 ( − t (cid:0) k + tt +1 (cid:1)(cid:0) k − t (cid:1) = ( − k − . Similarly, we can observe that k − (cid:88) t =0 ( − t (cid:18) k + t − t (cid:19)(cid:18) k − t + 1 (cid:19) = k − (cid:88) t =1 ( − t − (cid:18) k + t − t − (cid:19)(cid:18) k − t (cid:19) =( − k − (cid:88) t =1 ( − t (cid:18) k + t − t − (cid:19)(cid:18) k − t (cid:19) . Again, it is easy to check that (cid:80) k − t =1 ( − t (cid:0) k + t − t − (cid:1)(cid:0) k − t (cid:1) equals the coeﬃcientof x k − in the polynomial (1 + x ) k − [1 − (1 + x )] k − . As a result, we get (cid:80) k − t =1 ( − t (cid:0) k + t − t − (cid:1)(cid:0) k − t (cid:1) = ( − k − . To this end, we conclude that E k = ( − k d k − − ( d − − d − ), i.e., the k -th error term in E (cid:107) ˆ u ˆ u T − uu T (cid:107) is given by ( − k d k − − ( d − − d − ) λ k (except the negligible error terms). In a similar fashion, we can show that the k -th error term in E (cid:107) ˆ v ˆ v T − vv T (cid:107) is given by ( − k d k − − ( d − − d − ) λ k . Meanwhile, the . Xia/Normal Approximation of SVD negligible error terms from translating E ( β ∆ t β ∆ t · · · β ∆ t τ − ) into β t β t · · · β t τ − areupper bounded by k √ d max · (cid:16) C d max λ r (cid:17) k which concludes the proof. Lemma 6.

Let

Λ = diag( λ , · · · , λ r ) and Z ∈ R r × d be a random matrix con-taining i.i.d. standard normal entries. Then, for any positive numbers j , j , wehave E (cid:107) Λ − j ZZ T Λ − j (cid:107) = d (cid:107) Λ − j − j (cid:107) + d (cid:0) (cid:107) Λ − j − j (cid:107) + (cid:107) Λ − j (cid:107) (cid:107) Λ − j (cid:107) (cid:1) . Proof of Lemma 6.

Let z , · · · , z r ∈ R d denote the columns of Z T . Therefore,we can write (cid:107) Λ − j ZZ T Λ − j (cid:107) = r (cid:88) i =1 λ j + j ) i ( z T i z i ) + (cid:88) ≤ i (cid:54) = i ≤ r λ j i λ j i ( z T i z i ) . Then, we get E (cid:107) Λ − ZZ T Λ − (cid:107) = r (cid:88) i =1 d + 2 dλ j + j ) i + (cid:88) ≤ i (cid:54) = i ≤ r dλ j i λ j i = d (cid:107) Λ − j − j (cid:107) + d (cid:0) (cid:107) Λ − j − j (cid:107) + (cid:107) Λ − j (cid:107) (cid:107) Λ − j (cid:107) (cid:1)(cid:1)