[PDF] Optimal Structured Principal Subspace Estimation: Metric Entropy and Minimax Rates

Abstract

Driven by a wide range of applications, many principal subspace estimation problems have been studied individually under different structural constraints. This paper presents a unified framework for the statistical analysis of a general structured principal subspace estimation problem which includes as special cases non-negative PCA/SVD, sparse PCA/SVD, subspace constrained PCA/SVD, and spectral clustering. General minimax lower and upper bounds are established to characterize the interplay between the information-geometric complexity of the structural set for the principal subspaces, the signal-to-noise ratio (SNR), and the dimensionality. The results yield interesting phase transition phenomena concerning the rates of convergence as a function of the SNRs and the fundamental limit for consistent estimation. Applying the general results to the specific settings yields the minimax rates of convergence for those problems, including the previous unknown optimal rates for non-negative PCA/SVD, sparse SVD and subspace constrained PCA/SVD.

Full PDF

JJournal of Machine Learning Research 1 (2000) 1-43 Submitted 00/00; Published 00/00

Optimal Structured Principal Subspace Estimation: MetricEntropy and Minimax Rates

Tony Cai [email protected]

Department of StatisticsUniversity of PennsylvaniaPhiladelphia, PA 19104, USA

Hongzhe Li [email protected]

Department of Biostatistics, Epidemiology and InformaticsUniversity of PennsylvaniaPhiladelphia, PA 19104, USA

Rong Ma [email protected]

Department of Biostatistics, Epidemiology and InformaticsUniversity of PennsylvaniaPhiladelphia, PA 19104, USA

Editor:

Abstract

Driven by a wide range of applications, several principal subspace estimation problems havebeen studied individually under diﬀerent structural constraints. This paper presents a uni-ﬁed framework for the statistical analysis of a general structured principal subspace estima-tion problem which includes as special cases sparse PCA/SVD, non-negative PCA/SVD,subspace constrained PCA/SVD, and spectral clustering. General minimax lower and up-per bounds are established to characterize the interplay between the information-geometriccomplexity of the structural set for the principal subspaces, the signal-to-noise ratio (SNR),and the dimensionality. The results yield interesting phase transition phenomena concern-ing the rates of convergence as a function of the SNRs and the fundamental limit forconsistent estimation. Applying the general results to the speciﬁc settings yields the mini-max rates of convergence for those problems, including the previous unknown optimal ratesfor sparse SVD, non-negative PCA/SVD and subspace constrained PCA/SVD.

Keywords:

Low-rank matrix; Metric entropy; Minimax risk; Principal component anal-ysis; Singular value decomposition

1. Introduction

Spectral methods such as the principal component analysis (PCA) and singular value de-composition (SVD) are a ubiquitous technique in modern data analysis with a wide rangeof applications in many ﬁelds including statistics, machine learning, applied mathematics,and engineering. As a fundamental tool for dimension reduction, the spectral methods aimto extract the low-dimensional structures embedded in the high-dimensional data. In manyof these modern applications, the complexity of the datasets and the need of incorporatingthe existing knowledge from the subject areas require the data analysts to take into accountthe prior structural information on the statistical objects of interest in their analysis. In © https://creativecommons.org/licenses/by/4.0/ . Attribution requirements are providedat http://jmlr.org/papers/v1/.html . a r X i v : . [ m a t h . S T ] N ov ai, Li and Ma particular, many interesting problems in high-dimensional data analysis can be formulatedas a structured principal subspace estimation problem where one has the prior knowledgethat the underlying principal subspace satisﬁes certain structural conditions (see Section1.2 for a list of related problems).The present paper aims to provide a uniﬁed treatment of the structured principal sub-space estimation problems that have attracted much recent interest in both theory andpractice. To ﬁx ideas, we consider two generic models that have been extensively studied in theliterature, namely, the matrix denoising model and the spiked Wishart model (see, forexample, Johnstone (2001); Baik and Silverstein (2006); Paul (2007); Bai and Yao (2008);Cai et al. (2013); Donoho and Gavish (2014); Wang and Fan (2017); Choi et al. (2017);Donoho et al. (2018); Perry et al. (2018); Bao et al. (2018), among many others).

Deﬁnition 1 (Matrix Denoising Model)

Let Y ∈ R p × p be the observed data matrixgenerated from the model Y = UΓV (cid:62) + Z where Z ∈ R p × p has i.i.d. entries from N (0 , σ ) , Γ ∈ R r × r is a diagonal matrix with ordered diagonal entries λ ≥ λ ≥ ... ≥ λ r > for ≤ r ≤ min { p , p } , U ∈ O ( p , r ) , and V ∈ O ( p , r ) with O ( p, r ) = { W ∈ R p × r : W (cid:62) W = I r } being the set of all p × r orthonormal matrices. Deﬁnition 2 (Spiked Wishart Model)

Let Y ∈ R n × p be the observed data matrix whoserows Y i ∈ R p , i = 1 , . . . , n , are independently generated from N ( µ , UΓU (cid:62) + σ I p ) where U ∈ O ( p, r ) with ≤ r ≤ p , and Γ ∈ R r × r is diagonal with ordered diagonal entries λ ≥ ... ≥ λ r > . Equivalently, Y i can be viewed as Y i = X i + (cid:15) i where X i ∼ N ( µ , UΓU (cid:62) ) , (cid:15) i ∼ N (0 , σ I p ) , and X , . . . , X n and (cid:15) , . . . , (cid:15) n are independent. In the past decades, these two models have attracted substantial practical and theo-retical interest and have been studied in diﬀerent contexts in statistics, probability, andmachine learning. This paper addresses the problem of optimal estimation of the principal(eigen/singular) subspaces spanned by the orthonormal columns of U (denoted as span( U )),based on the data matrix Y and the prior structural knowledge on U . Speciﬁcally, we aimto uncover the deep connections between the statistical limit of the estimation problem asmeasured by the minimax risk and the geometric complexity of the parameter spaces ascharacterized by functions of certain entropy measures.Since the principal subspaces can be uniquely identiﬁed with their associated projectionmatrices, estimating span( U ) is equivalent to estimating UU (cid:62) . A commonly used metricfor gauging the distance between two linear subspaces span( U ) and span( U ) is d ( U , U ) = (cid:107) U U (cid:62) − U U (cid:62) (cid:107) F . In this paper, we use d ( · , · ) as the loss function and measure the performance of an estimator (cid:98) U of U by the risk R ( (cid:98) U , U ) = E d ( (cid:98) U , U ) . ptimal Structured Principal Subspace Estimation The problem considered in this paper can be viewed as a generalization and uniﬁcation ofmany interesting problems in high-dimensional statistics and machine learning. We ﬁrstpresent a few examples to demonstrate the richness of the structured principal subspaceestimation problem and its connections to the existing literature.1.

Sparse PCA/SVD.

The goal of sparse PCA/SVD is to recover span( U ) under theassumption that columns of U are sparse. Sparse PCA has been extensively studied inthe past two decades under the spiked Wishart model (see, for example, d’Aspremontet al. (2005); Zou et al. (2006); Shen and Huang (2008); Witten et al. (2009); Yanget al. (2011); Vu and Lei (2012); Cai et al. (2013); Ma (2013); Birnbaum et al. (2013);Cai et al. (2015), among many others). In particular, the exact minimax rates ofconvergence under the loss d ( · , · ) was established by Cai et al. (2013) in the generalrank- r setting. In contrast, theoretical analysis for the sparse SVD is relatively scarce,and the minimax rate of convergence remains unknown.2. Non-negative PCA/SVD.

Non-negative PCA/SVD aims to estimate span( U ) underthe assumption that entries of U are non-negative. This problem has been studied byDeshpande et al. (2014) and Montanari and Richard (2015) under the rank-one ma-trix denoising model ( r =1), where the statistical limit and certain sharp asymptoticswere carefully established. However, it is still unclear what are the minimax rates ofconvergence for estimating span( U ) under either rank-one or general rank- r settingsunder either the spiked Wishart model or matrix denoising model.3. Subspace Constrained PCA/SVD.

The subspace constrained PCA/SVD assumes thecolumns of U are in some low-dimensional linear subspaces of R p . In other words, U ∈ C A ( p, k ) = { U ∈ O ( p, r ) : A U .j = 0 for all 1 ≤ j ≤ r } for some rank ( p − k )matrix A ∈ R p × ( p − k ) where r < k < p . Estimating the principal subspaces undervarious linear subspace constraints has been considered in many applications such asnetwork clustering (Wang and Davidson, 2010; Kawale and Boley, 2013; Kleindessneret al., 2019). However, the minimax rates of convergence for subspace constrainedPCA/SVD remain unknown.4. Spectral Clustering.

Suppose we observe Y i ∼ N ( θ i , σ I p ) independently, where θ i ∈{ θ , − θ } ⊂ R p for i = 1 , ..., n . Let Y ∈ R n × p such that Y i is the i -th row of Y . Wehave Y = h θ (cid:62) + Z where h ∈ {± } n and Z has i.i.d. entries from N (0 , σ ). Spectralclustering of { Y i } ≤ i ≤ n aims to recover the class labels in h . Equivalently, spectralclustering can be treated as estimating the leading left singular vector u = h / (cid:107) h (cid:107) inthe matrix denoising model with u ∈ C n ± = { u ∈ R n : (cid:107) u (cid:107) = 1 , u i ∈ {± n − / }} . SeeAzizyan et al. (2013); Jin and Wang (2016); Lu and Zhou (2016); Jin et al. (2017); Caiand Zhang (2018); Giraud and Verzelen (2018); Ndaoud (2018); L¨oﬄer et al. (2019)and references therein for recent theoretical results.In addition to the aforementioned problems, there are many other interesting problemsthat share the same generic form as the structured principal subspace estimation problem.For example, motivated by applications in the statistical analysis of metagenomics data, ai, Li and Ma Ma et al. (2019, 2020) considered an approximately rank-one matrix denoising model wherethe leading singular vector satisﬁes the monotonicity constraint. In a special case of matrixdenoising model, namely, the Gaussian Wigner model Y = λ uu (cid:62) + Z ∈ R n × n , where Z has i.i.d. entries (up to symmetry) drawn from a Gaussian distribution, the Gaussian Z / u where u ∈ { u ∈ R n : (cid:107) u (cid:107) = 1 , u i ∈ {± n − / }} . These importantapplications provide motivations for a uniﬁed framework to study the fundamental diﬃcultyand optimality of these estimation problems.On the other hand, investigations of metric entropy as a measure of statistical complexityhas been one of the central topics in theoretical statistics, ranging from nonparametricfunction estimation (Yatracos, 1988; Haussler and Opper, 1997b; Yang and Barron, 1999;Yang, 1999; Wu and Yang, 2016), high-dimensional statistical inference (Raskutti et al.,2011; Verzelen, 2012; Vu and Lei, 2012; Cai et al., 2013; Ma, 2013) to statistical learningtheory (Haussler and Opper, 1997a; Lugosi and Nobel, 1999; Bousquet et al., 2002; Bartlettand Mendelson, 2002; Koltchinskii, 2006; Lecu´e and Mendelson, 2009; Cai et al., 2016;Rakhlin et al., 2017). Among them, interesting connections between the complexity of theparameter space and the fundamental diﬃculty of the statistical problem as quantiﬁed bycertain minimax risk have been carefully established. In this sense, the current work standsas a step along this direction in the context of principal subspace estimation under somegeneral random matrix models. The main contribution of this paper is three-fold. Firstly, a uniﬁed framework is introducedfor the study of structured principal subspace estimation problems under both the ma-trix denoising model and the spiked Wishart model. Novel generic minimax lower boundsand risk upper bounds are established to characterize explicitly the interplay between theinformation-geometric complexity of the structural set for the principal subspaces, thesignal-to-noise ratio (SNR), and the dimensionality of the parameter spaces. The resultsyield interesting phase transition phenomena concerning the rates of convergence as func-tions of the SNRs and the fundamental limit for consistent estimation. The general lowerand upper bounds reduce determination of the minimax optimal rates for many interest-ing problems to mere calculations of certain information-geometric quantities. Secondly, toobtain the general risk upper bounds, new technical tools are developed for the analysis ofthe proposed estimators in their general forms. In addition, the minimax lower bounds relyon careful constructions of multiple composite hypotheses about the structured parameterspaces, and non-trivial calculations of the Kullback-Leibler (KL) divergence between certainmixture probability measures, which can be of independent interest. Thirdly, by directlyapplying our general results to the speciﬁc problems discussed in Section 1.2, we establishthe minimax optimal rates for those problems. Among them, the minimax rates for sparseSVD, non-negative PCA/SVD and subspace constrained PCA/SVD, are to our knowledgepreviously unknown. ptimal Structured Principal Subspace Estimation The rest of the paper is organized as follows. After introducing the notation at the end ofthis section, we characterize in Section 2 a minimax lower bound under the matrix denoisingmodel using local metric entropy measures. A general estimator is introduced in Section 3and its risk upper bound is obtained via certain global metric entropy measures. In Section4, the spiked Wishart model is discussed in detail and generic risk lower and upper boundsare obtained. The general results are applied in Section 5 to speciﬁc settings and minimaxoptimal rates are established by explicitly calculating the local and global metric-entropicquantities. In Section 6, we address the computational issues of the proposed estimatorsand discuss some extensions and make connections to some other interesting problems.For a vector a = ( a , ..., a n ) (cid:62) ∈ R n , we denote diag( a , ..., a n ) ∈ R n × n as the diagonalmatrix whose i -th diagonal entry is a i , and deﬁne the (cid:96) p norm (cid:107) a (cid:107) p = (cid:0) (cid:80) ni =1 a pi (cid:1) /p . Wewrite a ∧ b = min { a, b } and a ∨ b = max { a, b } . For a matrix A = ( a ij ) ∈ R p × p , wedeﬁne its Frobenius norm as (cid:107) A (cid:107) F = (cid:113)(cid:80) p i =1 (cid:80) p j =1 a ij and its spectral norm as (cid:107) A (cid:107) =sup (cid:107) x (cid:107) ≤ (cid:107) Ax (cid:107) ; we also denote A .i ∈ R p as its i -th column and A i. ∈ R p as its i -throw. Let O ( p, k ) = { V ∈ R p × k : V (cid:62) V = I k } be the set of all p × k orthonormal matricesand O p = O ( p, p ), the set of p -dimensional orthonormal matrices. For a rank r matrix A ∈ R p × p with 1 ≤ r ≤ p ∧ p , its SVD is denoted as A = UΓV (cid:62) where U ∈ O ( p , r ), V ∈ O ( p , r ), and Γ = diag( λ ( A ) , λ ( A ) , ..., λ r ( A )) with λ max ( A ) = λ ( A ) ≥ λ ( A ) ≥ ... ≥ λ p ∧ p ( A ) = λ min ( A ) ≥ A . The columns of U and the columns of V are the left singular vectors and right singular vectors associated tothe non-zero singular values of A , respectively. For a given set S , we denote its cardinalityas | S | . For sequences { a n } and { b n } , we write a n = o ( b n ) or a n (cid:28) b n if lim n a n /b n = 0, andwrite a n = O ( b n ), a n (cid:46) b n or b n (cid:38) a n if there exists a constant C such that a n ≤ Cb n forall n . We write a n (cid:16) b n if a n (cid:46) b n and a n (cid:38) b n . Lastly, c, C, C , C , ... are constants thatmay vary from place to place.

2. Minimax Lower Bounds via Local Packing

We start with the matrix denoising model. Without loss of generality, we focus on estimatingthe structured left singular subspace span( U ). Speciﬁcally, for a given subset C ⊂ O ( p , r ),we consider the parameter space Y ( C , t, p , p , r ) = (cid:26) ( Γ , U , V ) : Γ = diag( λ , ..., λ r ) , U ∈ C , V ∈ O ( p , r ) Lt ≥ λ ≥ ... ≥ λ r ≥ t/L > (cid:27) , (1)for some ﬁxed constant L >

1. For any U ∈ O ( p , r ) and (cid:15) ∈ (0 , (cid:15) -ball centered at U is deﬁned as B ( U , (cid:15) ) = { U (cid:48) ∈ O ( p , r ) : d ( U (cid:48) , U ) ≤ (cid:15) } , and for any given subset C ⊂ O ( p , r ), we deﬁnediam( C ) = sup U , U ∈C d ( U , U ) . We introduce the concepts of packing and covering of a given set before stating a generalminimax lower bound. ai, Li and Ma Deﬁnition 3 ( (cid:15) -packing and (cid:15) -covering)

Let ( V, d ) be a metric space and M ⊂ V . Wesay that G ( M, d, (cid:15) ) ⊂ M is an (cid:15) -packing of M if for any m i , m j ∈ G ( M, d, (cid:15) ) with m i (cid:54) = m j ,it holds that d ( m i , m j ) > (cid:15) . We say that H ( M, d, (cid:15) ) ⊂ M is an (cid:15) -covering of M if for any m ∈ M , there exists an m (cid:48) ∈ H ( M, d, (cid:15) ) such that d ( m, m (cid:48) ) < (cid:15) . We denote M ( M, d, (cid:15) ) =max {| G ( M, d, (cid:15) ) |} and N ( M, d, (cid:15) ) = min {| H ( M, d, (cid:15) ) |} as the (cid:15) -packing number and the (cid:15) -covering number of M , respectively. Following Yang and Barron (1999), we also deﬁne the metric entropy of a given set.

Deﬁnition 4 (packing and covering (cid:15) -entropy)

Let M ( M, d, (cid:15) ) and N ( M, d, (cid:15) ) be the (cid:15) -packing and (cid:15) -covering number of M , respectively. We call log M ( M, d, (cid:15) ) the packing (cid:15) -entropy and log N ( M, d, (cid:15) ) the covering (cid:15) -entropy of M . The following theorem gives a minimax lower bound for estimating span( U ) over Y ( C , t, p , p , r ),as a function of the cardinality of a local packing set of C , the magnitude of the leadingsingular values ( t ), the noise level ( σ ), the rank ( r ), and the dimension ( p ) of the rightsingular vectors in V . Theorem 5

Under the matrix denoising model Y = UΓV (cid:62) + Z where ( Γ , U , V ) ∈ Y ( C , t, p , p , r ) ,suppose there exist some U ∈ C , (cid:15) > and α ∈ (0 , such that a local packing set G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) satisﬁes (cid:15) = (cid:112) cσ ( t + σ p ) t (cid:112) log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | ∧ diam( C ) (2) for some c ∈ (0 , / . Then, as long as | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | ≥ , it holds that, for θ = ( Γ , U , V ) , inf (cid:98) U sup θ ∈Y ( C ,t,p ,p ,r ) R ( (cid:98) U , U ) (cid:38) (cid:18) σ (cid:112) t + σ p t (cid:112) log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) |∧ diam( C ) (cid:19) , (3) where the inﬁmum is over all the estimators based on the observation Y . The above theorem, to the best of our knowledge, is the ﬁrst minimax lower boundresult for the matrix denoising model under the general parameter space (1). Its proofis separated into two parts. In the strong signal regime ( t (cid:38) σ p ), the minimax lowerbound can be obtained by generalizing the ideas in Vu and Lei (2012, 2013) and Cai et al.(2013), where a general lower bound for testing multiple hypotheses (Lemma 30) is appliedto obtain (3). In contrast, the analysis is much more complicated in the weak signal regime( t (cid:46) σ p ) due to the asymmetry between U and V : the dependence on p need to becaptured by extra eﬀorts in the lower bound construction (Cai and Zhang, 2018), which isdiﬀerent from the aforementioned works on sparse PCA. To achieve this, our analysis relieson a generalized Fano’s method for testing multiple composite hypotheses (Lemma 31) anda nontrivial calculation of the pairwise KL divergence between certain mixture probabilitymeasures (Lemma 32).A key observation from the above theorem is the role of the local packing set G ( B ( U , (cid:15) ) ∩C , d, α(cid:15) ) and its entropy measure log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | in characterizing the funda-mental diﬃculty of the estimation problem. Similar phenomena connecting the local packing ptimal Structured Principal Subspace Estimation numbers to the minimax lower bounds has been observed in, for example, nonparametricfunction estimation (Yang and Barron, 1999), high-dimensional linear regression (Raskuttiet al., 2011; Verzelen, 2012), and sparse principal component analysis (Vu and Lei, 2012;Cai et al., 2013).By Cai and Zhang (2018), a sharp minimax lower bound for estimating span( U ) underthe unstructured matrix denoising models (i.e., C = O ( p , r )) isinf (cid:98) U sup ( Γ , U , V ) ∈Y ( O ( p ,r ) ,t,p ,p ,r ) R ( (cid:98) U , U ) (cid:38) (cid:18) σ (cid:112) ( t + σ p ) rp t ∧ √ r (cid:19) , (4)which, in light of the packing number estimates for the orthogonal group (Lemma 1 ofCai et al. (2013)), is a direct consequence of our lower bound (3) for any U ∈ O ( p , r ).In addition, comparing the lower bounds (3) and (4), we observe that the information-geometric quantity log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | essentially quantiﬁes the intrinsic statisticaldimension (which is rp in the case of C = O ( p , r )) of the set C .

3. Risk Upper Bound using Dudley’s Entropy Integral

In this section, we consider a general singular subspace estimator and study its theoreticalproperties. Speciﬁcally, we obtain its risk upper bound which, analogous to the minimaxlower bound, can be expressed as a function of certain entropic measures related to thestructural constraint C .Under the matrix denoising model, with the parameters ( Γ , U , V ) ∈ Y ( C , t, p , p , r ) forsome given set C ⊂ O ( p , r ), we consider the structured singular subspace estimator (cid:98) U = arg max U ∈C tr( U (cid:62) YY (cid:62) U ) . (5)Before stating our main theorem, we need to make more deﬁnitions about quantities thatplay important roles in our subsequent discussions. Deﬁnition 6

For given

C ⊂ O ( p , r ) and any U ∈ C , we deﬁne the set T ( C , U ) = (cid:26) WW (cid:62) − UU (cid:62) (cid:107) WW (cid:62) − UU (cid:62) (cid:107) F ∈ R p × p : W ∈ C \ { U } (cid:27) , equipped with the Frobenius distance d , where for any D , D ∈ T ( C , U ) , we deﬁne d ( D , D ) = (cid:107) D − D (cid:107) F . Deﬁnition 7 (Dudley’s entropy integral)

For a metric space ( T, d ) and a subset A ⊂ T , Dudley’s entropy integral of A is deﬁned as D ( A, d ) = (cid:82) ∞ (cid:112) log N ( A, d, (cid:15) ) d(cid:15) . Moreover,we deﬁne D (cid:48) ( A, d ) = (cid:82) ∞ log N ( A, d, (cid:15) ) d(cid:15) . Theorem 8

Under the matrix denoising model, for any given subset

C ⊂ O ( p , r ) and theparameter space Y ( C , t, p , p , r ) , if t /σ (cid:38) sup U ∈C [ D (cid:48) ( T ( C , U ) , d ) /D ( T ( C , U ) , d )] , itholds that sup ( Γ , U , V ) ∈Y ( C ,t,p ,p ,r ) R ( (cid:98) U , U ) (cid:46) (cid:18) σ ∆( C ) (cid:112) t + σ p t ∧ diam( C ) (cid:19) , (6) where ∆( C ) = sup U ∈C D ( T ( C , U ) , d ) . ai, Li and Ma The proof of the above theorem, as it concerns the generic estimator (5) under somearbitrary structural set C , is involved and very diﬀerent from the existing works such asCai et al. (2013) Deshpande et al. (2014) Cai and Zhang (2018) and Zhang et al. (2018)where speciﬁc examples of C are considered. The argument relies on careful analysis thesupremum of a Gaussian chaos of order 2 and the supremum of a Gaussian process. Inthe latter case, we applied Dudley’s integral inequality (Theorem 22) and the invarianceproperty of the covering numbers with respect to Lipschitz maps (Lemma 23), whereasin the former case, the Arcones-Gin´e decoupling inequality (Theorem 24) as well as thegeneric chaining argument (Theorem 27) were used to obtain the desired upper bounds.Many technical tools concatenated for the proof of this theorem can be of independentinterest. See more details in Section A.1.Interestingly, both the risk upper bound (6) and the minimax lower bound (3) indicatetwo phase transitions when treated as a function of the SNR t/σ , with the ﬁrst critical point tσ (cid:16) √ p , (7)and the second critical point tσ (cid:16) (cid:20) ζ diam ( C ) + (cid:115) ζp diam ( C ) (cid:21) / , (8)where in the upper bound ζ = ∆ ( C ) and in the lower bound ζ = log | G ( B ( U , (cid:15) ) ∩C , d, α(cid:15) ) | . Speciﬁcally, the phase transition at the ﬁrst critical point highlights the role ofthe dimensionality of the right singular vectors ( V ) and the change of the rates of conver-gence from an inverse quadratic function ( σ √ p ζ/t ) to an inverse linear function ( σ √ ζ/t )of t/σ . The message from the second phase transition concerns the statistical limit of theestimation problem: consistent estimation is possible only when the SNR exceeds the crit-ical point (8) asymptotically. See Figure 1 (left) for a graphical illustration. As for theimplications of the condition t /σ (cid:38) sup U ∈C [ D (cid:48) ( T ( C , U ) , d ) /D ( T ( C , U ) , d )] (9)required by Theorem 8, it can be seen in Section 5 that, for many speciﬁc problems, asuﬃcient condition for (9) is that t/σ is above the second critical point (8), which is mildand natural since the latter condition characterizes the region where (cid:98) U is consistent andmore generally where consistent estimation is possible.Comparing our risk upper bound (6) to the minimax lower bound (3), we can observe thesimilar role played by the information-geometric quantities that characterize the intrinsicstatistical dimension of the sets C or T ( C , U ). Speciﬁcally, in (6), the quantity ∆( C ) is re-lated to the global covering entropy, whereas in (3), the quantity (cid:112) log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | is associated to the local packing entropy. To obtain the minimax optimal rate of conver-gence, we need to compare the above two quantities and show∆ ( C ) (cid:16) log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | . (10)Proving the above equation in its general form is diﬃcult. Alternatively, we brieﬂy discussthe aﬃnity between these two geometric quantities yielded by information theory and leavemore detailed discussions in the context of some speciﬁc examples in Section 5. ptimal Structured Principal Subspace Estimation Figure 1: A graphical illustration of the phase transitions in risks as a function of the SNRsunder the matrix denoising model (left) and the spiked Wishart model (right).By deﬁnition of the packing numbers, we have the relationshiplog | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | ≤ log M ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) , (11)that links log | G ( B ( U , (cid:15) ) ∩C , d, α(cid:15) ) | to the local packing entropy. A well known fact aboutthe equivalence between the packing and the covering number of a set M is that M ( M, d, (cid:15) ) ≤ N ( M, d, (cid:15) ) ≤ M ( M, d, (cid:15) ) . (12)Moreover, Yang and Barron (1999) obtained a very interesting result connecting the localand the global (covering) metric entropies. Speciﬁcally, let U be any element from M , thenlog M ( M, d, (cid:15)/ − log M ( M, d, (cid:15) ) ≤ log M ( B ( U , (cid:15) ) ∩ M, d, (cid:15)/ ≤ log M ( M, d, (cid:15) ) . (13)In Section 5, by focusing on some speciﬁc examples of C that are widely considered inpractice, we show that equation (10) holds, which along with our generic lower and upperbounds recovers some existing minimax rates, and more importantly, helps to establishsome previously unknown rates.

4. Structured Eigen Subspace Estimation in the Spiked Wishart Model

We turn the focus in this section to the spiked Wishart model where one has i.i.d. ob-servations Y i ∼ N ( µ , Σ ) with Σ = UΓV (cid:62) + σ I , which is usually referred as the spikedcovariance. Similar to the matrix denoising model, a minimax lower bound based on somelocal packing set and a risk upper bound based on the Dudley’s entropy integral can beobtained. For any given subset

C ⊂ O ( p, r ), we consider the parameter space Z ( C , t, p, r ) = { ( Γ , U ) : Γ = diag( λ , ..., λ r ) , Lt ≥ λ ≥ ... ≥ λ r ≥ t/L > , U ∈ C} , ai, Li and Ma where L > U ) over Z ( C , t, p, r ) under the spiked Wishart model. Theorem 9

Under the spiked Wishart model where ( Γ , U ) ∈ Z ( C , t, p, r ) , suppose thereexist some U ∈ C , (cid:15) > and α ∈ (0 , such that a local packing set G ( B ( U , (cid:15) ) ∩C , d, α(cid:15) ) satisﬁes (cid:15) = σ (cid:112) c ( σ + t ) t √ n (cid:112) log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | ∧ diam( C ) , (14) for some c ∈ (0 , / . Then, as long as | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | ≥ , it holds that inf (cid:98) U sup ( Γ , U ) ∈Z ( C ,t,p,r ) R ( (cid:98) U , U ) (cid:38) (cid:18) σ √ σ + tt √ n (cid:112) log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | ∧ diam( C ) (cid:19) , (15) where the inﬁmum is over all the estimators based on the observation Y . In Zhang et al. (2018), a sharp minimax lower bound for estimating span( U ) under theunstructured spiked Wishart model was obtained asinf (cid:98) U sup ( Γ , U ) ∈Z ( O ( p,r ) ,t,p,r ) R ( (cid:98) U , U ) (cid:38) (cid:18) σ (cid:112) ( σ + t ) rpt √ n ∧ √ r (cid:19) . (16)Comparing the general lower bound (15) with (16), we observe that the local entropicquantity log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | again characterizes the intrinsic statistical dimension(which is rp in the case of C = O ( p, r )) of the set C . See Section 5 for more examples. Under the spiked Wishart model, to estimate the eigen subspace span( U ) under the struc-tural constraint U ∈ C , we start with the sample covariance matrixˆ Σ = 1 n n (cid:88) i =1 ( Y i − ¯ Y )( Y i − ¯ Y ) (cid:62) , where ¯ Y = n (cid:80) ni =1 Y i and Y i is the i -th row of the observed data matrix Y ∈ R n × p . Sinceˆ Σ is invariant to any translation on Y , we assume µ = 0 without loss of generality.Similar to the matrix denoising model, for the spiked Wishart model, with a slight abuseof notation, we deﬁne the eigen subspace estimator as (cid:98) U = arg max U ∈C tr( U (cid:62) ˆ ΣU ) . (17)The following theorem provides the risk upper bound of (cid:98) U . Theorem 10

Under the spiked Wishart model, for any given

C ⊂ O ( p, r ) and the parameterspace Z ( C , t, p, r ) , suppose n (cid:38) max { log tσ , r } and t/σ (cid:38) sup U ∈C [ D (cid:48) ( T ( C , U ) , d ) /D ( T ( C , U ) , d )] ,then sup ( Γ , U ) ∈Z ( C ,t,p,r ) R ( (cid:98) U , U ) (cid:46) (cid:18) σ ∆( C ) √ t + σ t √ n ∧ diam( C ) (cid:19) , where ∆( C ) is deﬁned in Theorem 8. ptimal Structured Principal Subspace Estimation Similar to the matrix denoising model, the above risk upper bound has a great aﬃnityto the minimax lower bound (15), up to a diﬀerence in the information-geometric (metric-entropic) measures of C , and the sharpness of our results relies on the relative magnitudebetween the pair of quantities ∆ ( C ) and log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | . In addition, phasetransitions in the rates of the lower and upper bounds as functions of the SNR t/σ can beobserved with the ﬁrst critical point at tσ (cid:16) , (18)and the second critical point at tσ (cid:16) ζn · diam ( C ) + (cid:115) ζn · diam ( C ) , (19)where in the lower bound ζ = log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | and in the upper bound ζ =∆ ( C ). Again, the phase transition at the ﬁrst critical point reﬂects the change of thespeed of the rates of convergence, whereas the phase transition at the second critical pointcharacterizes the statistical limit of the estimation problem. See Figure 1 (right) for agraphical illustration. Finally, it will be seen in Section 5 that for many speciﬁc problems,the condition t/σ (cid:38) sup U ∈C [ D (cid:48) ( T ( C , U ) , d ) /D ( T ( C , U ) , d )] required by Theorem 10 ismild and in fact necessary for consistent estimation.

5. Applications

In the following, building upon the minimax lower bounds and the risk upper boundsestablished in the previous sections, we obtain minimax rates and fundamental limits forvarious structural principal subspace estimation problems of broad interest. Speciﬁcally, inlight of our generic results, we show the asymptotic equivalence of various local and globalentropic measures associated to some speciﬁc examples of C . Previous discussions under thegeneral settings such as the phase transition phenomena also apply to each of the examples. We start with the sparse PCA/SVD where the columns of U are sparse vectors. Suppose C S ( p, r, k ) is the k -sparse subset of O ( p, r ) for some k ≤ p , i.e., C S ( p, r, k ) = { U ∈ O ( p, r ) :max ≤ i ≤ r (cid:107) U .i (cid:107) ≤ k } . The following proposition concerns some estimates about the localand global entropic quantities associated with the set C S ( p, r, k ). For simplicity, we denote C S ( k ) = C S ( p, r, k ) when there is no confusion. Proposition 11

Under the matrix denoising model where ( Γ , U , V ) ∈ Y ( C S ( k ) , t, p , p , r ) with k = o ( p ) and r = O (1) , there exist some ( U , (cid:15) , α ) and a local packing set G ( B ( U , (cid:15) ) ∩C S ( p , r, k ) , d, α(cid:15) ) satisfying (2) such that log | G ( B ( U , (cid:15) ) ∩ C S ( p , r, k ) , d, α(cid:15) ) | (cid:16) ∆ ( C S ( p , k, r )) (cid:16) k log( ep /k ) + k. Similarly, under the spiked Wishart model where ( Γ , V ) ∈ Z ( C S ( k ) , t, p, r ) with k = o ( p ) and r = O (1) , there exist some ( U , (cid:15) , α ) and a local packing set G ( B ( U , (cid:15) ) ∩C S ( p, r, k ) , d, α(cid:15) ) satisfying (14) such that log | G ( B ( u , (cid:15) ) ∩ C S ( p, k, r ) , d, α(cid:15) ) | (cid:16) ∆ ( C S ( p, k, r )) (cid:16) k log( ep/k ) + k. ai, Li and Ma In light of our lower and upper bounds under both the matrix denoising model (Theorem5 and 8) and the spiked Wishart model (Theorem 9 and 10), with Proposition 11, we areable to establish sharp minimax rates of convergence for sparse PCA/SVD.

Theorem 12

Under the matrix denoising model with U ∈ C S ( p , r, k ) where k = o ( p ) and r = O (1) , it holds that inf (cid:98) U sup Y ( C S ( k ) ,t,p ,p ,r ) R ( (cid:98) U , U ) (cid:16) (cid:18) σ (cid:112) t + σ p t (cid:18)(cid:114) k log ep k + √ k (cid:19) ∧ (cid:19) (20) where the estimator (5) is rate-optimal whenever consistent estimation is possible. Similarly,under the spiked Wishart model with U ∈ C S ( p, r, k ) where k = o ( p ) and r = O (1) , if n (cid:38) max { log tσ , r } , then inf (cid:98) U sup Z ( C S ( k ) ,t,p,r ) R ( (cid:98) U , U ) (cid:16) (cid:18) σ √ t + σ t √ n (cid:18)(cid:114) k log epk + √ k (cid:19) ∧ (cid:19) , (21) where the estimator (17) is rate-optimal whenever consistent estimation is possible. The minimax rate (21) for the spiked Wishart model (sparse PCA) recovers the onesobtained by Vu and Lei (2012) and Cai et al. (2013) under either rank-one or ﬁnite rank r settings. In contrast, the result (20) for the matrix denoising model (sparse SVD), to thebest of our knowledge, has not been established. We now turn to the non-negative PCA/SVD under either the matrix denoising model (SVD)or the spiked Wishart model (PCA) where U ∈ C N ( p, r ) = { U = ( u ij ) ∈ O ( p, r ) : u ij ≥ i, j } . The following proposition provides estimates about the local and globalentropic quantities related to the set C N ( p, r ). Proposition 13

Under the matrix denoising model where ( Γ , U , V ) ∈ Y ( C N ( p , r ) , t, p , p , r ) and r = O (1) , there exist some ( U , (cid:15) , α ) and a local packing set G ( B ( U , (cid:15) ) ∩C N ( p , r ) , d, α(cid:15) ) satisfying (2) such that ∆ ( C N ( p , r )) (cid:16) log | G ( B ( U , (cid:15) ) ∩ C N ( p , r ) , d, α(cid:15) ) | (cid:16) p . Similarly, under the spiked Wishart model where ( Γ , U ) ∈ Z ( C N ( p, r ) , t, p, r ) and r = O (1) ,there exist some ( U , (cid:15) , α ) and a local packing set G ( B ( U , (cid:15) ) ∩ C N ( p, r ) , d, α(cid:15) ) satisfying(14) such that ∆ ( C N ( p, r )) (cid:16) log | G ( B ( U , (cid:15) ) ∩ C N ( p, r ) , d, α(cid:15) ) | (cid:16) p. Proposition 13 enables us to establish sharp minimax rates of convergence for non-negative PCA/SVD using the general lower and upper bounds from the previous sections. ptimal Structured Principal Subspace Estimation Theorem 14

Under the matrix denoising model with U ∈ C N ( p , r ) where r = O (1) , itholds that inf (cid:98) U sup Y ( C N ( p ,r ) ,t,p ,p ,r ) R ( (cid:98) U , U ) (cid:16) σ (cid:112) ( t + σ p ) p t ∧ , (22) and the estimator (5) is rate-optimal whenever consistent estimation is possible. Similarly,for the spiked Wishart model with U ∈ C N ( p, r ) where r = O (1) , if n (cid:38) max { log tσ , r } , then inf (cid:98) U sup Z ( C N ( p,r ) ,t,p,r ) R ( (cid:98) U , U ) (cid:16) σ (cid:112) ( t + σ ) pt √ n ∧ , (23) where the estimator (17) is rate-optimal whenever consistent estimation is possible. The minimax rates for non-negative PCA/SVD, which were previously unknown, turnout to be the same as the rates for the ordinary unstructured SVD (Cai and Zhang, 2018)and PCA (Zhang et al., 2018). This is due to the fact claimed in Proposition 13 that, underthe ﬁnite rank scenarios, as a much smaller subset of O ( p, r ), C N ( p, r ) has asymptoticallythe same geometric complexity as O ( p, r ). Remark 15

Deshpande et al. (2014) considered the rank-one Gaussian Wigner model Y = λ uu (cid:62) + Z ∈ R p × p , which can be treated as a special case of the matrix denoising model.Speciﬁcally, it was shown that, for (cid:98) u = arg max u ∈C N ( p, u (cid:62) Yu , it holds that sup ( λ, u ) ∈Z ( C N ( p, ,t,p, E [1 − | (cid:98) u (cid:62) u | ] (cid:46) σ √ pt ∧ , which, by the fact that − | (cid:98) u (cid:62) u | ≤ d ( (cid:98) u , u ) , can be implied by our result (see also Section6.2). Similar problems were studied in Montanari and Richard (2015) under the settingwhere p /p → α ∈ (0 , ∞ ) . However, their focus is to unveil the asymptotic behavior of (cid:98) u (cid:62) u as well as the analysis of an approximate message passing algorithm, which is diﬀerentfrom ours. In some applications such as network clustering (Wang and Davidson, 2010; Kawale andBoley, 2013; Kleindessner et al., 2019), it is of interest to estimate principal subspaces withcertain linear subspace constraints. For example, under the matrix denoising model, forsome ﬁxed A ∈ R p × ( p − k ) of rank ( p − k ) where r < k < p , a k -dimensional subspaceconstraint on the singular subspace span( U ) could be U ∈ C A ( p , r, k ) = { U ∈ O ( p , r ) : A U .i = 0 , ∀ ≤ i ≤ r } . Again, subspace constrained PCA/SVD can be solved based on thegeneral results obtained in the previous sections. Proposition 16

For given A ∈ R p × ( p − k ) of rank ( p − k ) , under the matrix denois-ing model where ( Γ , U , V ) ∈ Y ( C A ( p , r, k ) , t, p , p , r ) and r = O (1) , there exist some ( U , (cid:15) , α ) and a local packing set G ( B ( U , (cid:15) ) ∩ C A ( p , r, k ) , d, α(cid:15) ) satisfying (2) such that ∆ ( C A ( p , r, k )) (cid:16) log | G ( B ( u , (cid:15) ) ∩ C A ( p , r, k ) , d, α(cid:15) ) | (cid:16) k. ai, Li and Ma Similarly, for given B ∈ R p × ( p − k ) of rank ( p − k ) , under the spiked Wishart model with ( Γ , U ) ∈ Z ( C B ( p, r, k ) , t, p, r ) and r = O (1) , there exist some ( U , (cid:15) , α ) and a local packingset G ( B ( U , (cid:15) ) ∩ C B ( p, r, k ) , d, α(cid:15) ) satisfying (14) such that ∆ ( C B ( p, r, k )) (cid:16) log | G ( B ( U , (cid:15) ) ∩ C B ( p, r, k ) , d, α(cid:15) ) | (cid:16) k. Theorem 17

Under the matrix denoising model with U ∈ C A ( p , r, k ) where r < k < p , r = O (1) and A ∈ R p × ( p − k ) is of rank ( p − k ) , it holds that inf (cid:98) U sup Y ( C A ( p ,r,k ) ,t,p ,p ,r ) R ( (cid:98) U , U ) (cid:16) (cid:18) σ (cid:112) ( t + σ p ) kt ∧ (cid:19) (24) and the estimator (5) is rate-optimal whenever consistent estimation is possible. Similarly,under the spiked Wishart model with U ∈ C B ( p, r, k ) , where r < k < p , r = O (1) and B ∈ R p × ( p − k ) is of rank ( p − k ) , if n (cid:38) max { log tσ , r } , then inf (cid:98) U sup Z ( C B ( p,r,k ) ,t,p,r ) R ( (cid:98) U , U ) (cid:16) (cid:18) σ (cid:112) ( t + σ ) kt √ n ∧ (cid:19) , (25) where the estimator (17) is rate-optimal whenever consistent estimation is possible. As discussed in Section 1.2, spectral clustering can be treated as estimation of the structuraleigenvector under the rank-one matrix denoising model Y = λ uv (cid:62) + Z ∈ R n × p where λ = (cid:107) h (cid:107) (cid:107) θ (cid:107) is the global signal strength, u = h / (cid:107) h (cid:107) ∈ C n ± = { u ∈ R n : (cid:107) u (cid:107) = 1 , u i ∈{± n − / }} indicates the group labels, and Z has i.i.d. entries from N (0 , σ ). As a result,important insights about the clustering problem can be obtained by calculating the entropicquantities related to C n ± and applying the general results from the previous sections. Proposition 18

Under the matrix denoising model where ( λ, u , v ) ∈ Y ( C n ± , t, n, p, , itholds that ∆ ( C n ± ) (cid:46) n . In addition, if t = Cσ ( √ pn + n ) for some constant C > , thenthere exist some ( u , (cid:15) , α ) and a local packing set G ( B ( u , (cid:15) ) ∩ C n ± , d, α(cid:15) ) satisfying (2)such that log | G ( B ( u , (cid:15) ) ∩ C n ± , d, α(cid:15) ) | (cid:16) n . Theorem 19

Under the spectral clustering model deﬁned in Section 1.2, or equivalently,the matrix denoising model Y = λ uv (cid:62) + Z ∈ R n × p where u ∈ C n ± , the estimator (cid:98) u =arg max u ∈C n ± u (cid:62) YY (cid:62) u satisﬁes sup ( λ, u , v ) ∈Y ( C n ± ,t,n,p, R ( (cid:98) u , u ) (cid:46) (cid:18) σ (cid:112) ( t + σ p ) nt ∧ (cid:19) . (26) In addition, if t (cid:46) σ ( n + √ np ) , then inf (cid:98) u sup ( λ, u , v ) ∈Y ( C n ± ,t,n,p, R ( (cid:98) u , u ) (cid:38) C (27) for some absolute constant C > . ptimal Structured Principal Subspace Estimation Intuitively, the fundamental diﬃculty for clustering relies on the interplay between theglobal signal strength λ , which reﬂects both the sample size ( n ) and the distance betweenthe two clusters ( (cid:107) θ (cid:107) ), the noise level ( σ ), and the dimensionality ( p ). In particular, thelower bound from the above theorem shows that one needs λ (cid:38) σ ( √ pn + n ) in order tohave consistent clustering. Moreover, the risk upper bound implies that, whenever λ (cid:38) σ ( √ pn + n ), the estimator (cid:98) u is consistent. Theorem 19 thus establishes the fundamentalstatistical limit for the minimal global signal strength for consistent clustering. Similarphenomena have also been observed by Azizyan et al. (2013) and Cai and Zhang (2018).Nevertheless, it should be noted that, despite the fundamental limits for consistentrecovery yielded by Theorem 19, the estimator (cid:98) u is in itself sub-optimal and can be furtherimproved through a variant of Lloyd’s iterations. See Lu and Zhou (2016) and Ndaoud(2018) for more details.

6. Discussions

In this paper, we studied a collection of structural principal subspace estimation problems ina uniﬁed framework by exploring the deep connections between the diﬃculty for statisticalestimation and the geometric complexity of the parameter spaces. Minimax optimal ratesof convergence for a collection of structured PCA/SVD problems are established. In thissection, we discuss the computational issues of the proposed estimators as well as theextensions and connections to other problems.

In general, the constrained optimization problems that deﬁne the estimators in (5) and (17)are computationally intractable. However, in practice, many iterative algorithms have beendeveloped to approximate such estimators.For example, under the matrix denoising model, given the data matrix Y , the set C , andan initial estimator U ∈ O ( p , r ), an iterative algorithm for the constrained optimizationproblem arg max U ∈C tr( U (cid:62) YY (cid:62) U ) can be realized through iterations over the followingupdates for t ≥ G t = YY (cid:62) U t ;2. QR factorization: U (cid:48) t +1 W t +1 = G t where U (cid:48) t +1 is p × r orthonormal and W t +1 is r × r upper triangular;3. Projection: U t +1 = P C ( U (cid:48) t +1 ).Here the projection operator P C ( · ) is deﬁned as P C ( U ) = arg min G ∈C d ( U , G ) . The abovealgorithm generalizes the ideas of the projected power method (see, for example, Boumal(2016); Chen and Cand`es (2018); Onaran and Villar (2017)) and the orthogonal iterationmethod (Golub and Van Loan, 2012; Ma, 2013).The computational eﬃciency of this iterative algorithm relies on the complexity of theprojection operator P C for a given C . In the rank-one case (r=1), Ferreira et al. (2013)pointed out that, whenever the set C is an intersection of a convex cone and the unit sphere,the projection operator P C ( · ) admits an explicit formula and can be computed eﬃciently. ai, Li and Ma This class of spherical convex sets includes many of the above examples such as non-negativePCA/SVD and subspace constrained PCA/SVD. The case of spectral clustering, under therank-one setting, is also straightforward as the projection has a simple expression P C n ± ( u ) =sgn( u ) / √ n (see Ndaoud (2018) and L¨oﬄer et al. (2019) for more in depth discussions). Asfor sparse PCA/SVD, the computational side of the problem is much more complicated andhas been extensively studied in literature (Shen and Huang, 2008; d’Aspremont et al., 2008;Witten et al., 2009; Journ´ee et al., 2010; Ma, 2013; Vu et al., 2013; Yuan and Zhang, 2013;Deshpande and Montanari, 2014).In addition to the iterative projection method discussed above, there are several othercomputationally eﬃcient algorithms such as convex (semideﬁnite in particular) relaxations(Singer, 2011; Deshpande et al., 2014; Bandeira et al., 2017) and the approximate messagepassing algorithms (Deshpande and Montanari, 2014; Deshpande et al., 2014; Montanariand Richard, 2015; Rangan and Fletcher, 2012), that have been considered to solve thestructured eigenvector problems. However, the focuses of these algorithms are still rank-one matrices, and it remains to be understood how well these algorithms generalize to thegeneral rank- r cases. We leave further investigations along these directions to future work. As mentioned in Section 1.2, an important special case of matrix denoising model is theGaussian Wigner model (Deshpande et al., 2014; Montanari and Richard, 2015; Perry et al.,2018), where the data matrix Y = UΓU (cid:62) + Z ∈ R p × p is symmetric, and the noise matrix Z has i.i.d. entries (up to symmetry) drawn from N (0 , σ ). Consider the parameter space Z ( C , t, p, r ) deﬁned in Section 4.1. It can be shown that, under similar conditions to thoseof Theorem 5,inf (cid:98) U sup ( Γ , U ) ∈Z ( C ,t,p,r ) R ( (cid:98) U , U ) (cid:38) (cid:18) σt (cid:112) log | G ( B ( U , (cid:15) ) ∩ C , d, α(cid:15) ) | ∧ diam( C ) (cid:19) . (28)Moreover, if we deﬁne (cid:98) U = arg max U ∈C tr( U (cid:62) YU ), then its risk upper bound can beobtained as sup ( Γ , U ) ∈Z ( C ,t,p,r ) R ( (cid:98) U , U ) (cid:46) (cid:18) σ ∆( C ) t ∧ diam( C ) (cid:19) . (29)These general bounds combined with the entropic quantities calculated in Section 5 wouldyield many other interesting optimality results. For instance, recall that the Gaussian Z / Y = λ uu (cid:62) + Z where u ∈ C n ± . In this case, we have, for t (cid:46) σ √ n inf (cid:98) u sup ( λ, u ) ∈Z ( C n ± ,t,n, R ( (cid:98) u , u ) (cid:38) C. and, for (cid:98) u = arg max u ∈C n ± u (cid:62) Yu ,sup ( λ, u ) ∈Z ( C n ± ,t,n, R ( (cid:98) u , u ) (cid:46) (cid:18) σ √ nt ∧ (cid:19) . ptimal Structured Principal Subspace Estimation This implies that, about Gaussian Z / λ (cid:38) σ √ n , and the estimator (cid:98) u is consistent whenever λ (cid:38) σ √ n . These resultsmake interesting connections to the existing works (Javanmard et al., 2016; Perry et al.,2018) concerning the so-called critical threshold or fundamental limit for the SNRs in Z / U ) is available.However, in some applications, structural knowledge on the other singular subspace span( V )can also be available. An interesting question is whether and how much the prior knowledgeon span( V ) will help in the estimation of span( U ). Some preliminary thinking suggeststhat novel phenomena might exist in such settings. For example, in an extreme case, if V is completely known a priori, then after a simple transform YV = UΓ + ZV , estimationof span( U ) can be reduced to a Gaussian mean estimation problem, whose minimax rate isclearly independent of the dimension of the columns in V and therefore quite diﬀerent fromthe rates obtained in this paper. The problem again bears important concrete examplesin statistics and machine learning. The present work provides a theoretical foundation forstudying these problems. Appendix A. Proof of the Main Theorems

In this section, we prove Theorems 5, 8, 9 and 10.

A.1 Risk Upper Bounds

This section proves Theorems 8 and 10. Throughout, for any X , Y ∈ R p × p , we denote (cid:104) X , Y (cid:105) = tr( X (cid:62) Y ). We recall Lemma 1 in Cai and Zhang (2018), which concerns therelationships between diﬀerent distance measures. Lemma 20

For H , H ∈ O ( p, r ) , (cid:107) H H (cid:62) − H H (cid:62) (cid:107) F = (cid:113) r − (cid:107) H (cid:62) H (cid:107) F ) , and √ (cid:107) H H (cid:62) − H H (cid:62) (cid:107) F ≤ inf O ∈ O ( r ) (cid:107) H − H O (cid:107) F ≤ (cid:107) H H (cid:62) − H H (cid:62) (cid:107) F . Proof of Theorem 8.

We begin by stating a useful lemma, whose proof is delayed toSection C.

Lemma 21

Let U ∈ O ( p , r ) , and Γ = diag( λ , ..., λ r ) . Then for any W ∈ O ( p , r ) , wehave λ r (cid:107) UU (cid:62) − WW (cid:62) (cid:107) F ≤ (cid:104) UΓ U (cid:62) , UU (cid:62) − WW (cid:62) (cid:105) ≤ λ (cid:107) UU (cid:62) − WW (cid:62) (cid:107) F . By Lemma 21 and the fact that tr( (cid:98) U (cid:62) YY (cid:62) (cid:98) U ) ≥ tr( U (cid:62) YY (cid:62) U ), or equivalently (cid:104) YY (cid:62) , UU (cid:62) − (cid:98) U (cid:98) U (cid:62) (cid:105) ≤

0, we have (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F ≤ λ r (cid:104) UΓ U (cid:62) − YY (cid:62) , UU (cid:62) − (cid:98) U (cid:98) U (cid:62) (cid:105) . ai, Li and Ma Since Y = UΓV (cid:62) + Z , we have YY (cid:62) = UΓ U (cid:62) + ZVΓU (cid:62) + UΓV (cid:62) Z (cid:62) + ZZ (cid:62) . Thus (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F ≤ λ r (cid:2) (cid:104) UΓV (cid:62) Z (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) + (cid:104) ZVΓU (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) + (cid:104) ZZ (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) (cid:3) ≡ λ r ( H + H + H ) . For H , if we set G W = WW (cid:62) − UU (cid:62) (cid:107) WW (cid:62) − UU (cid:62) (cid:107) F , W ∈ O ( p , r ) \ { U } , (30)we can write H = (cid:104) UΓV (cid:62) Z (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) = (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · (cid:104) UΓV (cid:62) Z (cid:62) , G (cid:98) U (cid:105)≤ (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · sup W ∈C tr( ZVΓU (cid:62) G W ) . Similarly, we have H ≤ (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · sup W ∈C tr( UΓV (cid:62) Z (cid:62) G W ) , and H ≤ (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · sup W ∈C tr( Z (cid:62) G W Z ) . It then follows that (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F ≤ λ r (cid:18) sup W ∈C tr( ZVΓU (cid:62) G W )+ sup W ∈C tr( UΓV (cid:62) Z (cid:62) G W )+ sup W ∈C tr( Z (cid:62) G W Z ) (cid:19) . (31)The rest of the proof is separated into three parts. In the ﬁrst two parts, we obtain upperbounds for the right-hand side of equation (31). In the third part, we derive the desiredrisk upper bound.Part I. For the term sup W ∈C tr( ZVΓU (cid:62) G W ), we havesup W ∈C tr( ZVΓU (cid:62) G W ) = sup W ∈C tr( U (cid:62) G W ZVΓ ) = sup W ∈C r (cid:88) i =1 λ i ( U (cid:62) G W ZV ) ii ≤ λ sup W ∈C tr( VU (cid:62) G W Z ) ≤ λ sup G ∈T (cid:48) ( C , U , V ) (cid:104) G , Z (cid:105) , where we deﬁned T (cid:48) ( C , U , V ) = { G W UV (cid:62) ∈ R p × p : W ∈ C \ { U }} . To control theexpected suprema of the Gaussian process sup G ∈T (cid:48) ( C , U , V ) (cid:104) G , Z (cid:105) , we use the following Dud-ley’s integral inequality (see, for example, Vershynin 2018, pp. 188). Theorem 22 (Dudley’s Integral Inequality)

Let { X t } t ∈ T be a Gaussian process, thatis, a jointly Gaussian family of centered random variables indexed by T , where T is equippedwith the canonical distance d ( s, t ) = (cid:112) E ( X s − X t ) . For some universal constant L , wehave E sup t ∈ T X t ≤ L (cid:82) ∞ (cid:112) log N ( T, d, (cid:15) ) d(cid:15). For the Gaussian process sup G ∈T (cid:48) ( C , U , V ) (cid:104) G , Z (cid:105) , the canonical distance deﬁned over theset T (cid:48) ( C , U , V ) can be obtained as follows. For any G , G ∈ T ( C , U , V ), the canonical dis-tance between G and G , by deﬁnition, is (cid:112) E (cid:104) G − G , Z (cid:105) = (cid:107) G − G (cid:107) F ≡ d ( G , G ).Theorem 22 yields E sup G ∈T (cid:48) ( C , U , V ) (cid:104) G , Z (cid:105) ≤ Cσ (cid:90) ∞ (cid:112) log N ( T (cid:48) ( C , U , V ) , d , (cid:15) ) d(cid:15), (32) ptimal Structured Principal Subspace Estimation for some universal constant C >

0. Next, for any G , G ∈ T (cid:48) ( C , U , V ), without loss ofgenerality, if we assume G = G W UV (cid:62) and G = G W UV (cid:62) , where W , W ∈ C \ { U } ,then it holds that d ( G , G ) = (cid:107) G − G (cid:107) F ≤ (cid:107) G W − G W (cid:107) F (cid:107) U (cid:107)(cid:107) V (cid:107) (33) ≤ (cid:107) G W − G W (cid:107) F = d ( G W , G W ) , where we used the fact that (cid:107) HG (cid:107) F ≤ (cid:107) H (cid:107) F (cid:107) G (cid:107) . The next lemma, obtained by Szarek(1998), concerns the invariance property of the covering numbers with respect to Lipschitzmaps. Lemma 23 (Szarek (1998))

Let ( M, d ) and ( M , d ) be metric spaces, K ⊂ M , Φ : M → M , and let L > . If Φ satisﬁes d (Φ( x ) , Φ( y )) ≤ Ld ( x, y ) for x, y, ∈ M , then, for every (cid:15) > , we have N (Φ( K ) , d , L(cid:15) ) ≤ N ( K, d, (cid:15) ) . Deﬁne the set T ( C , U ) = { G W : W ∈ C \ { U }} . Equation (33) and Lemma 23 implylog N ( T (cid:48) ( C , U , V ) , d , (cid:15) ) ≤ log N ( T ( C , U ) , d , (cid:15) ) , (34)which means sup W ∈C tr( ZVΓU (cid:62) G W ) ≤ Cλ σ (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (35)Applying the same argument to sup W ∈C tr( UΓV (cid:62) Z (cid:62) G W ) leads tosup W ∈C tr( UΓV (cid:62) Z (cid:62) G W ) ≤ Cλ σ (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (36)Part II. To bound sup W ∈C tr( Z (cid:62) G W Z ), note that tr( Z (cid:62) G W Z ) = vec( Z ) (cid:62) D W vec( Z ),where vec( Z ) = ( Z , ..., Z p , Z , ..., Z p , ..., Z p , ..., Z p p ) (cid:62) , and D W =  G W . . . G W  ∈ R p p × p p , (37)It suﬃces to control the expected supremum of the following Gaussian chaos of order 2,sup D ∈P ( C , U ) vec( Z ) (cid:62) D vec( Z ) , (38)where P ( C , U ) = { D W ∈ R p p × p p : W ∈ C \ { U }} . To analyze the above Gaussian chaos,a powerful tool from empirical process theory is the decoupling technique. In particular, weapply the following decoupling inequality obtained by Arcones and Gin´e (1993) (see alsoTheorem 2.5 of Krahmer et al. (2014)). Theorem 24 (Arcones-Gen´e Decoupling Inequality)

Let { g i } ≤ i ≤ n be a sequence ofindependent standard Gaussian variables and let { g (cid:48) i } ≤ i ≤ n be an independent copy of { g i } ≤ i ≤ n .Let B be a collection of n × n symmetric matrices. Then for all p ≥ , there exists an absoluteconstant C such that E sup B ∈B (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ≤ j (cid:54) = k ≤ n B jk g j g k + n (cid:88) j =1 B jj ( g j − (cid:12)(cid:12)(cid:12)(cid:12) p ≤ C p E sup B ∈B (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ≤ j,k ≤ n B jk g j g (cid:48) k (cid:12)(cid:12)(cid:12)(cid:12) p . ai, Li and Ma From Theorem 24 and the fact that for given W ∈ C\{ U } we have E vec( Z ) (cid:62) D W vec( Z ) =0, then E sup D ∈P ( C , U ) [vec( Z ) (cid:62) D vec( Z )] ≤ C E sup D ∈P ( C , U ) [vec( Z ) (cid:62) D vec( Z (cid:48) )] (39)where Z (cid:48) is an independent copy of Z . The upper bound of the right hand side of (39) canbe obtained by using a generic chaining argument developed by Talagrand (2014). To statethe result, we make the following deﬁnitions that characterize the complexity of a set in ametric space. Deﬁnition 25 (admissible sequence)

Given a set T in the metric space ( S, d ) , an ad-missible sequence is an increasing sequence {A n } of partitions of T such that |A | = 1 and |A n | ≤ n for n ≥ . Deﬁnition 26 ( γ α ( T, d ) ) Given α > and a set T in the metric space ( S, d ) , we deﬁne γ α ( T, d ) = inf sup t ∈ T (cid:80) n ≥ n/α diam( A n ( t )) , where A n ( t ) is the unique element of A n whichcontains t and the inﬁmum is taken over all admissible sequences. The following theorem from (Talagrand, 2014, pp. 246) provides an important upperbound of the general decoupled Gaussian chaos of order 2.

Theorem 27 (Talagrand (2014))

Let g , g (cid:48) ∈ R n be independent standard Gaussian vec-tors, and Q = { q ij } ≤ i,j ≤ n ∈ R n × n . Given a set T ⊂ R n × n equipped with two distances d ∞ ( Q , Q ) = (cid:107) Q − Q (cid:107) and d ( Q , Q ) = (cid:107) Q − Q (cid:107) F , E sup Q ∈ T g (cid:62) Qg (cid:48) ≤ L ( γ ( T, d ∞ ) + γ ( T, d )) , for some absolute constant L ≥ . A direct consequence of Theorem 27 is E sup D ∈P ( C , U ) [vec( Z ) (cid:62) D vec( Z (cid:48) )] ≤ Cσ ( γ ( P ( C , U ) , d ∞ ) + γ ( P ( C , U ) , d )) . (40)Our next lemma obtains estimates of the functionals γ ( P ( C , U ) , d ∞ ) and γ ( P ( C , U ) , d ). Lemma 28

Let T ( C , U ) = { G W ∈ R p × p : W ∈ C \ { U }} be equipped with distances d ∞ and d deﬁned in Theorem 27. It holds that γ ( P ( C , U ) , d ∞ ) ≤ C (cid:90) ∞ log N ( T ( C , U ) , d , (cid:15) ) d(cid:15), (41) γ ( P ( C , U ) , d ) ≤ C √ p (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (42)Combining the above results, we have E sup W ∈C tr( Z (cid:62) G W Z ) (cid:46) σ √ p (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15) + σ (cid:90) ∞ log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (43) ptimal Structured Principal Subspace Estimation Part III. By (31) (35) (36) and (43), we have, for any ( Γ , U , V ) ∈ Y ( C , t, p , p , r ),whenever t (cid:38) σD (cid:48) ( T ( C , U ) , d ) /D ( T ( C , U ) , d ), E (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F (cid:46) σλ D ( T ( C , U ) , d ) λ r + σ √ p D ( T ( C , U ) , d ) + σ D (cid:48) ( T ( C , U ) , d ) λ r (cid:46) σ ∆( C ) (cid:112) t + σ p t . The ﬁnal result then follows by noticing the trivial upper bound of diam( C ). Proof of Theorem 10.

We ﬁrst state a useful lemma (Lemma 3 in Cai et al. (2013)).

Lemma 29

Let Σ = σ I p + UΓU (cid:62) where U ∈ O ( p, r ) and Γ = diag( λ , ..., λ r ) . Thenfor any W ∈ O ( p, r ) , we have λ r (cid:107) UU (cid:62) − WW (cid:62) (cid:107) F ≤ (cid:104) Σ , UU (cid:62) − WW (cid:62) (cid:105) ≤ λ (cid:107) UU (cid:62) − WW (cid:62) (cid:107) F . Note that Y = XΓ / U (cid:62) + Z ∈ R n × p where Γ / = diag( λ / , ..., λ / r ), X ∈ R n × r hasi.i.d. entries from ∼ N (0 , Z has i.i.d. entries from N (0 , σ ). We can writeˆ Σ = 1 n Y (cid:62) Y − ¯ Y ¯ Y (cid:62) = 1 n ( UΓ / X (cid:62) XΓ / U (cid:62) + Z (cid:62) XΓ / U (cid:62) + UΓ / X (cid:62) Z + Z (cid:62) Z ) − ( UΓ / ¯ X ¯ X (cid:62) Γ / U (cid:62) + UΓ / ¯ X ¯ Z (cid:62) + ¯ Z ¯ X (cid:62) Γ / U (cid:62) + ¯ Z ¯ Z (cid:62) ) , where ¯ X = n (cid:80) ni =1 X i ∈ R r and ¯ Z = n (cid:80) ni =1 Z i ∈ R p . Now since tr( (cid:98) U (cid:62) ˆ Σ (cid:98) U ) ≥ tr( U (cid:62) ˆ ΣU ),or equivalently (cid:104) ˆ Σ , UU (cid:62) − (cid:98) U (cid:98) U (cid:62) (cid:105) ≤

0, we have (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F ≤ λ r (cid:104) Σ − ˆ Σ , UU (cid:62) − (cid:98) U (cid:98) U (cid:62) (cid:105) . Hence, (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F ≤ λ r (cid:2) (cid:104) n − Z (cid:62) XΓ / U (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) + (cid:104) n − UΓ / X (cid:62) Z , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) + (cid:104) n − UΓ / X (cid:62) XΓ / U (cid:62) − UΓU (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) + (cid:104) n − Z (cid:62) Z − I p , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105)− (cid:104) UΓ / ¯ X ¯ X (cid:62) Γ / U (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) − (cid:104) UΓ / ¯ X ¯ Z (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105)− (cid:104) ¯ Z ¯ X (cid:62) Γ / U (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) − (cid:104) ¯ Z ¯ Z (cid:62) , (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:105) (cid:3) ≡ λ r ( H + H + H + H − H − H − H − H ) . To control H , using the same notations in (30), we have H ≤ n (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · sup W ∈C tr( UΓ / X (cid:62) ZG W ) . ai, Li and Ma Similarly, it holds that H ≤ n (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · sup W ∈C tr( Z (cid:62) XΓ / U (cid:62) G W ) ,H ≤ (cid:104) Γ / ( n − X (cid:62) X − I r ) Γ / , U (cid:62) (cid:98) U (cid:98) U (cid:62) U − I r (cid:105)≤ (cid:107) Γ / ( n − X (cid:62) X − I r ) Γ / (cid:107) · | tr( U (cid:62) (cid:98) U (cid:98) U (cid:62) U − I r ) |≤ λ (cid:107) n − X (cid:62) X − I r (cid:107)(cid:107) UU (cid:62) − (cid:98) U (cid:98) U (cid:62) (cid:107) F ,H ≤ (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · sup W ∈C tr(( n − Z (cid:62) Z − I p ) G W ) ,H ≤ (cid:107) Γ / ¯ X ¯ X (cid:62) Γ / (cid:107) · | tr( U (cid:62) (cid:98) U (cid:98) U (cid:62) U − I r ) | ≤ λ (cid:107) ¯ X ¯ X (cid:62) (cid:107)(cid:107) UU (cid:62) − (cid:98) U (cid:98) U (cid:62) (cid:107) F ,H ≤ (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · sup W ∈C tr( UΓ / ¯ X ¯ Z (cid:62) G W ) ,H ≤ (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · sup W ∈C tr( ¯ Z (cid:62) ¯ X Γ / U (cid:62) G W ) ,H ≤ (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F · sup W ∈C tr( ¯ Z ¯ Z (cid:62) G W ) . Combining the above inequalities, we have (cid:107) (cid:98) U (cid:98) U (cid:62) − UU (cid:62) (cid:107) F ≤ λ r (1 − λ λ r (cid:107) n − X (cid:62) X − I r (cid:107) − λ λ r (cid:107) ¯ X ¯ X (cid:62) (cid:107) ) (cid:18) n − sup W ∈C tr( UΓ / X (cid:62) ZG W )+ n − sup W ∈C tr( Z (cid:62) XΓ / U (cid:62) G W ) + sup W ∈C tr(( n − Z (cid:62) Z − I p ) G W ) + sup W ∈C tr( UΓ / ¯ X ¯ Z (cid:62) G W )+ sup W ∈C tr( ¯ Z (cid:62) ¯ X Γ / U (cid:62) G W ) + sup W ∈C tr( ¯ Z ¯ Z (cid:62) G W ) (cid:19) (44)The rest of the proof is separated into four parts, with the ﬁrst three parts controlling theright-hand side of the inequality (44), and the last part deriving the ﬁnal risk upper bound.Part I. Note thatsup W ∈C tr( UΓ / X (cid:62) ZG W ) = sup W ∈C tr( X (cid:62) ZG W UΓ / ) ≤ λ / sup W ∈C tr( ZG W UX (cid:62) / (cid:107) X (cid:107) ) (cid:107) X (cid:107)≤ λ / sup G ∈T ( C , U , X ) (cid:104) Z (cid:62) , G (cid:105)(cid:107) X (cid:107) , where T ( C , U , X ) = (cid:8) G W UX (cid:62) (cid:107) X (cid:107) : W ∈ C \ { U } (cid:9) . By Theorem 22, we have E (cid:20) sup G ∈T ( C , U , X ) (cid:104) Z (cid:62) , G (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) X (cid:21) ≤ Cσ (cid:90) ∞ (cid:112) log N ( T ( C , U , X ) , d , (cid:15) ) d(cid:15). For any G , G ∈ T ( C , U , X ), without loss of generality, if we assume G = (cid:107) X (cid:107) − G W UX (cid:62) and G = (cid:107) X (cid:107) − G W UX (cid:62) where W , W ∈ C \ { U } , then d ( G , G ) ≤ (cid:107) G W − G W (cid:107) F (cid:107) U (cid:107) ≤ (cid:107) G W − G W (cid:107) F = d ( G W , G W ) . (45) ptimal Structured Principal Subspace Estimation Again, recall the set T ( C , U ) deﬁned in the proof of Theorem 8, by Lemma 23, we havelog N ( T ( C , U , X ) , d , (cid:15) ) ≤ log N ( T ( C , U ) , d , (cid:15) ), which implies E sup W ∈C tr( UΓ / X (cid:62) ZG W ) ≤ Cλ / E (cid:107) X (cid:107) σ (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (46)Now by Theorem 5.32 of Vershynin (2010), we have E (cid:107) X (cid:107) ≤ √ n + √ r , then E n − sup W ∈C tr( UΓ / X (cid:62) ZG W ) ≤ Cλ / σ (1 / √ n + √ r/n ) (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (47)Similarly, we can derive E n − sup W ∈C tr( Z (cid:62) XΓ / U (cid:62) G W ) ≤ Cλ / σ (1 / √ n + √ r/n ) (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (48)One the other hand, since sup W ∈C tr( UΓ / ¯ X ¯ Z (cid:62) G W ) = sup W ∈C tr( ¯ X ¯ Z (cid:62) G W UΓ / ) ≤ λ / sup W ∈C tr( ¯ Z (cid:62) G W U ¯ X/ (cid:107) ¯ X (cid:107) ) (cid:107) ¯ X (cid:107) ≤ λ / (cid:107) ¯ X (cid:107) sup g ∈T (cid:104) ¯ Z, g (cid:105) , where T ( C , U , X ) = { G W U ¯ X (cid:107) ¯ X (cid:107) : W ∈ C \ { U }} is equipped with the Euclidean (cid:96) distance. By Theorem 22, wehave E (cid:20) sup g ∈T ( C , U , X ) (cid:104) ¯ Z, g (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) X (cid:21) ≤ Cσ √ n (cid:90) ∞ (cid:112) log N ( T ( C , U , X ) , d , (cid:15) ) d(cid:15). Now for any g , g ∈ T ( C , U , X ), without loss of generality, if we assume g = (cid:107) ¯ X (cid:107) − G W U ¯ X and g = (cid:107) ¯ X (cid:107) − G W U ¯ X, where W , W ∈ C \ { U } , then (cid:107) g − g (cid:107) ≤ (cid:107) ¯ X (cid:107) − (cid:107) G W U ¯ X − G W U ¯ X (cid:107) ≤ d ∞ ( G W , G W ) ≤ d ( G W , G W ) . Lemma 23 implies log N ( T ( C , U , X ) , d , (cid:15) ) ≤ log N ( T ( C , U ) , d , (cid:15) ), which along with the fact that E (cid:107) ¯ X (cid:107) (cid:46) (cid:112) r/n implies E sup W ∈C tr( UΓ / ¯ X ¯ Z (cid:62) G W ) ≤ Cσ √ rλ / n (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (49)Similarly, we havesup W ∈C tr( ¯ Z (cid:62) ¯ X Γ / U (cid:62) G W ) ≤ Cσ √ rλ / n (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (50)Part II. Note that tr(( n − Z (cid:62) Z − σ I p ) G W ) = tr( n − Z (cid:62) ZG W ) − σ tr( G W ) = n − vec( Z ) (cid:62) D W vec( Z ) , where D W is deﬁned in (37). By the similar chaining argument in Part II of the proof ofTheorem 8, we have E sup W ∈C tr(( n − Z (cid:62) Z − I p ) G W ) (cid:46) σ √ n (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15) + σ n (cid:90) ∞ log N ( T ( C , U ) , d , (cid:15) ) d(cid:15) (51)Similarly, since sup W ∈C tr( ¯ Z ¯ Z (cid:62) G W ) = sup W ∈C ¯ Z (cid:62) G W ¯ Z , we also have E sup W ∈C tr( ¯ Z ¯ Z (cid:62) G W ) (cid:46) σ n (cid:90) ∞ (cid:112) log N ( T ( C , U ) , d , (cid:15) ) d(cid:15) + σ n (cid:90) ∞ log N ( T ( C , U ) , d , (cid:15) ) d(cid:15). (52) ai, Li and Ma Part III. Deﬁne the event E = {(cid:107) n − X (cid:62) X − I r (cid:107) ≤ / (4 L ) , (cid:107) ¯ X ¯ X (cid:62) (cid:107) ≤ / (4 L ) } , where L is the constant in Z ( C , t, p, r ). By Proposition D.1 in the Supplementary Material of Ma(2013), P ( (cid:107) n − X (cid:62) X − I r (cid:107) ≤ (cid:112) r/n + t ) + ( (cid:112) r/n + t ) ) ≥ − e − nt / , which implies P ( (cid:107) n − X (cid:62) X − I r (cid:107) ≤ / (4 L )) ≥ − e − cn . In addition, since (cid:107) ¯ X ¯ X (cid:62) (cid:107) ≤(cid:107) ¯ X (cid:107) = n (cid:80) ri =1 g i , where g i ∼ i.i.d. N (0 , P ( (cid:107) ¯ X ¯ X (cid:62) (cid:107) ≤ / (4 L )) ≥ − e − cn . Thus, it follows that P ( E c ) ≤ P ( (cid:107) n − X (cid:62) X − I r (cid:107) ≥ / (4 L )) + P ( (cid:107) ¯ X ¯ X (cid:62) (cid:107) ≥ / (4 L )) ≤ e − cn . Part IV. Note that E d ( U , (cid:98) U ) = E [ d ( U , (cid:98) U ) | E ] + E [ d ( U , (cid:98) U ) | E c ] . It follows from (44) andthe inequalities (47)-(52) from Parts I and II thatsup ( Γ , U ) ∈Z ( C ,t,p,t ) E [ d ( U , (cid:98) U ) | E ] ≤ Ct (cid:20) √ tσ (cid:18) √ n + √ rn (cid:19) D ( T ( C , U ) , d ) + σ D ( T ( C , U ) , d ) √ n + σ D (cid:48) ( T ( C , U ) , d ) n (cid:21) ≤ Cσ ∆( C ) (cid:112) t (1 + r/n ) + σ √ nt , where the last inequality holds whenever t/σ (cid:38) sup U ∈C [ D (cid:48) ( T ( C , U ) , d ) /D ( T ( C , U ) , d )].On the other hand, by Part III, E [ d ( U , (cid:98) U ) | E c ] ≤ diam( C ) · P ( E c ) ≤ C √ re − cn . Consequentlyas long as n (cid:38) max { log tσ , r } and t/σ (cid:38) sup U ∈C [ D (cid:48) ( T ( C , U ) , d ) /D ( T ( C , U ) , d )], wehave sup ( Γ , U ) ∈Z ( C ,t,p,t ) E d ( U , (cid:98) U ) ≤ Cσ ∆( C ) √ t + σ √ nt . The ﬁnal result then follows by noticing the trivial upper bound of diam( C ). A.2 Minimax Lower BoundsProof of Theorem 5.

The proof is divided into two parts, the strong signal regime( t ≥ σ p /

4) and the weak signal regime ( t < σ p / Lemma 30 (Tsybakov (2009))

Assume that M ≥ and suppose that (Θ , d ) containselements θ , θ , ..., θ M such that: (i) d ( θ j , θ k ) ≥ s > for any ≤ j < k ≤ M ; (ii) it holdsthat M (cid:80) Mj =1 D ( P j , P ) ≤ α log M with < α < / and P j = P θ j for j = 0 , , ..., M , where D ( P j , P ) = (cid:82) log dP j dP dP j is the KL divergence between P j and P . Then inf ˆ θ sup θ ∈ Θ P θ ( d (ˆ θ, θ ) ≥ s ) ≥ √ M √ M (cid:18) − α − (cid:114) α log M (cid:19) . ptimal Structured Principal Subspace Estimation Let V ∈ O ( p , r ) be ﬁxed and U ∈ C . Denote the (cid:15) -ball B ( U , (cid:15) ) = { U ∈ O ( p , r ) : d ( U , U ) ≤ (cid:15) } . For some δ < (cid:15) , we consider the local δ -packing set G δ = G ( B ( U , (cid:15) ) ∩C , d, δ )such that for any pair U , U (cid:48) ∈ B ( U , (cid:15) ) ∩C , it holds that d ( U , U (cid:48) ) = (cid:107) UU (cid:62) − U (cid:48) U (cid:48)(cid:62) (cid:107) F ≥ δ. We denote the elements of G δ as U i for 1 ≤ i ≤ | G δ | . Lemma 20 shows that, for any i , wecan ﬁnd O i ∈ O r such that (cid:107) U − U i O i (cid:107) F ≤ d ( U , U i ) ≤ (cid:15). Set U (cid:48) i = U i O i and denote G (cid:48) δ = { U (cid:48) i } . For given t >

0, we consider the subset X ( t, (cid:15), δ, U , V ) = { ( Γ , U , V ) : U ∈ G (cid:48) δ , V = V , Γ = t I r } ⊂ Y ( C , t, p , p , r ) . In particular, the above construction admits |X ( t, (cid:15), δ, U , V ) | = | G δ | . Moreover, for any ( Γ , U i , V ) ∈ X ( t, (cid:15), δ, U , V ), let P i be the probability measure of Y = U i ΓV (cid:62) + Z where Z has i.i.d. entries from N (0 , σ ). We have, for 1 ≤ i (cid:54) = j ≤ | G δ | , D ( P i , P j ) = (cid:107) ( U (cid:48) i − U (cid:48) j ) ΓV (cid:62) (cid:107) F σ ≤ t (cid:107) U (cid:48) i − U (cid:48) j (cid:107) F σ ≤ t (cid:15) σ . Now set (cid:15) = (cid:15) and δ = α(cid:15) for some α ∈ (0 , (cid:18) cσ t log | G α(cid:15) | ∧ diam ( C ) (cid:19) ≤ (cid:15) ≤ (cid:18) σ t log | G α(cid:15) | ∧ diam ( C ) (cid:19) (53)for some c ∈ (0 , / D ( P i , P j ) ≤ log | G α(cid:15) | . Now by Lemma 30, it holdsthat, for θ = ( Γ , U , V ),inf (cid:98) U sup θ ∈X ( t,(cid:15),δ, U , V ) P θ ( d ( (cid:98) U , U ) ≥ α(cid:15) / ≥ (cid:112) | G α(cid:15) | (cid:112) | G α(cid:15) | (cid:18) − (cid:112) | G α(cid:15) | (cid:19) . By Markov’s inequality, we haveinf (cid:98) U sup θ ∈X ( t,(cid:15),δ, U , V ) E θ d ( (cid:98) U , U ) ≥ α(cid:15) (cid:112) | G α(cid:15) | (cid:112) | G α(cid:15) | ) (cid:18) − (cid:112) | G α(cid:15) | (cid:19) ≥ Cα(cid:15) , for some C > | G α(cid:15) | ≥

2. Therefore, it holds thatinf (cid:98) U sup θ ∈Y ( C ,t,p ,p ,r ) E θ d ( (cid:98) U , U ) ≥ inf (cid:98) U sup θ ∈X ( t,(cid:15),δ, U , V ) E θ d ( (cid:98) U , U ) (cid:38) ( σt − (cid:112) log | G α(cid:15) | ∧ diam( C )) (cid:38) (cid:18) σ (cid:112) t + σ p t (cid:112) log | G α(cid:15) | ∧ diam( C ) (cid:19) . Part II. Weak Signal Regime. The proof relies on the following generalized Fano’s method,obtained by Ma et al. (2019), about testing multiple composite hypotheses.

Lemma 31 (Generalized Fano’s Method)

Let µ , µ , ..., µ M be M + 1 priors on theparameter spaces Θ of the family { P θ } , and let P j be the posterior probability measures on ( X , A ) such that P j ( S ) = (cid:90) P θ ( S ) µ j ( dθ ) , ∀ S ∈ A , j = 0 , , ..., M. ai, Li and Ma Let F : Θ → ( R d , d ) . If (i) there exist some sets B , B , ..., B M ⊂ R d such that d ( B i , B j ) ≥ s for some s > for all ≤ i (cid:54) = j ≤ M and µ j ( θ ∈ Θ : F ( θ ) ∈ B j ) = 1 ; and (ii) it holdsthat M (cid:80) Mj =1 D ( P j , P ) ≤ α log M with < α < / . Then inf ˆ F sup θ ∈ Θ P θ ( d ( ˆ F , F ( θ )) ≥ s ) ≥ √ M √ M (cid:18) − α − (cid:114) α log M (cid:19) . To use the above lemma, we need to construct a collection of priors over the set Y ( C , t, p , p , r ). Speciﬁcally, recall the previously constructed δ -packing set G δ = { U i :1 ≤ i ≤ | G δ |} . Inspired by Cai and Zhang (2018), we consider the prior probability measure µ i over Y ( C , t, p , p , r ), whose deﬁnition is given as follows. Let W be a random matrix on R p × r , whose probability density is given by p ( W ) = C (cid:18) p π (cid:19) rp / exp( − p (cid:107) W (cid:107) F / · { / ≤ λ min ( W ) ≤ λ max ( W ) ≤ } , where C is a normalizing constant; then, if we denote ˜ U i ˜ Γ i ˜ V (cid:62) i as the SVD of t U i W (cid:62) ∈ R p × p where U i ∈ G δ and W ∼ p ( W ), then µ i is deﬁned as the joint distribution of( ˜ Γ i , ˜ U i , ˜ V i ). By deﬁnition of U i , one can easily verify that µ i is a well-deﬁned probabilitymeasure on Y ( C , t, p , p , r ). Note that, for any θ i = ( ˜ Γ i , ˜ U i , ˜ V i ) ∈ supp( µ i ) and θ j =( ˜ Γ j , ˜ U j , ˜ V j ) ∈ supp( µ j ) with 1 ≤ i (cid:54) = j ≤ | G δ | , it holds that d ( ˜ U i , ˜ U j ) = d ( U i , U j ) ≥ δ .Consequently, the joint distribution of Y = UΓV (cid:62) + Z with ( Γ , U , V ) ∼ µ i and Z ij ∼ N (0 , σ ) can be expressed as P i ( Y ) = C (cid:90) / ≤ λ min ( W ) ≤ λ max ( W ) ≤ σ − p p (2 π ) p p / exp( −(cid:107) Y − t U i W (cid:62) (cid:107) F / (2 σ )) × (cid:18) p π (cid:19) rp / exp( − p (cid:107) W (cid:107) F / d W , and it remains to control the pairwise KL divergence D ( P i , P j ) for any 1 ≤ i (cid:54) = j ≤ | G δ | . This is done by the next lemma, whose proof, which is involved, is delayed to Section C.

Lemma 32

Under the assumption of the theorem, for any ≤ i (cid:54) = j ≤ | G δ | , we have D ( P i , P j ) ≤ C t d ( U i , U j ) σ (4 t + σ p ) + C where C , C > are some uniform constant and { U i } areelements of G δ . Again, set (cid:15) = (cid:15) and δ = α(cid:15) for some α ∈ (0 , (cid:18) cσ ( t + σ p ) t log | G δ | ∧ diam( C ) (cid:19) ≤ (cid:15) ≤ (cid:18) σ ( t + σ p )640 t log | G δ | ∧ diam( C ) (cid:19) , for some c ∈ (0 , / D ( P i , P j ) ≤ C log | G α(cid:15) | + C . Now let X (cid:48) ( t, (cid:15), δ, U ) = (cid:83) ≤ i ≤| G α(cid:15) | supp( µ i ) . By Lemma 31 and Markov’s inequality, we have,for θ = ( Γ , U , V ),inf (cid:98) U sup θ ∈X (cid:48) ( t,(cid:15) ,α(cid:15) , U ) E θ d ( (cid:98) U , U ) ≥ α(cid:15) (cid:112) | G α(cid:15) | (cid:112) | G α(cid:15) | ) (cid:18) − (cid:112) | G α(cid:15) | (cid:19) ≥ Cα(cid:15) , ptimal Structured Principal Subspace Estimation for some C > | G α(cid:15) | ≥

2. Hence,inf (cid:98) U sup θ ∈Y ( C ,t,p ,p ,r ) E θ d ( (cid:98) U , U ) (cid:38) inf (cid:98) U sup θ ∈X (cid:48) ( t,(cid:15) ,α(cid:15) , U ) E θ d ( (cid:98) U , U ) (cid:38) (cid:18) σ (cid:112) t + σ p t (cid:112) log | G α(cid:15) | ∧ diam( C ) (cid:19) . Proof of Theorem 9.

For some U ∈ C , similar to the proof of Theorem 5, we considerthe δ -packing set G δ = G ( B ( U , (cid:15) ) ∩ C , d, δ ), where for any U i , U j ∈ G δ , d ( U i , U j ) = (cid:107) U i U (cid:62) i − U j U (cid:62) j (cid:107) F ≥ δ. Then, for given t >

0, we consider the subset Z (cid:48) ( t, (cid:15), δ, U ) = { ( Γ , U ) ∈ Z ( C , t, p, r ) : U ∈ G δ , Γ = t I r } , so that |Z (cid:48) ( t, (cid:15), δ, U ) | = | G δ | . Let P i be the jointprobability measure of Y k ∼ i.i.d. N (0 , Σ i ) with k = 1 , ..., n and Σ i = t U i U (cid:62) i + σ I p . Wehave, for any 1 ≤ i (cid:54) = j ≤ | G δ | , D ( P i , P j ) = n (cid:18) tr( Σ − j Σ i ) − p + log (cid:18) det Σ i det Σ j (cid:19)(cid:19) = n (cid:18) − tt + σ U i U (cid:62) i + tσ U j U (cid:62) j − t σ ( t + σ ) U i U (cid:62) i U j U (cid:62) j (cid:19) = nt σ ( σ + t ) ( r − (cid:107) U (cid:62) i U j (cid:107) F ) ≤ nt d ( U i , U j ) σ ( σ + t ) ≤ nt (cid:15) σ ( σ + t ) , where the second equation follows from the Woodbury matrix identity and the second lastinequality follows from Lemma 20. Now let (cid:15) = (cid:15) and δ = α(cid:15) for some α ∈ (0 , (cid:18) cσ ( σ + t ) nt log | G α(cid:15) | ∧ diam ( C ) (cid:19) ≤ (cid:15) ≤ (cid:18) σ ( σ + t )32 nt log | G α(cid:15) | ∧ diam ( C ) (cid:19) , for some c ∈ (0 , / D ( P i , P j ) ≤ log | G α(cid:15) | . Now by Lemma 30, it holdsthat, for θ = ( Γ , U ),inf (cid:98) U sup θ ∈Z (cid:48) ( t,(cid:15) ,α(cid:15) , U ) P θ ( d ( (cid:98) U , U ) ≥ α(cid:15) / ≥ (cid:112) | G α(cid:15) | (cid:112) | G α(cid:15) | (cid:18) − (cid:112) | G α(cid:15) | (cid:19) . By Markov’s inequality, as long as | G α(cid:15) | ≥

2, we haveinf (cid:98) U sup θ ∈Z (cid:48) ( t,(cid:15) ,α(cid:15) , U ) E θ d ( (cid:98) U , U ) ≥ Cα(cid:15) , for some C >

0. Therefore, since Z (cid:48) ( t, (cid:15) , α(cid:15) , U ) ⊂ Z ( C , t, p, r ),inf (cid:98) U sup θ ∈Z ( C ,t,p,r ) R ( (cid:98) U , U ) ≥ inf (cid:98) U sup θ ∈Z (cid:48) ( t,(cid:15) ,α(cid:15) , U ) E θ d ( (cid:98) U , U ) (cid:38) (cid:18) σ √ σ + tt √ n (cid:112) log | G α(cid:15) | ∧ diam( C ) (cid:19) . ai, Li and Ma Appendix B. Calculation of Metric Entropies

In this section, we prove the results in Section 5 by calculating metric entropies of somespeciﬁc sets. The calculation relies on the following useful lemmas.

Lemma 33 (Varshamov-Gilbert Bound)

Let

Ω = { , } n and ≤ d ≤ n/ . Thenthere exists a subset { ω (1) , ..., ω ( M ) } of Ω such that (cid:107) ω ( j ) (cid:107) = d for all ≤ j ≤ M and (cid:107) ω ( j ) − ω ( k ) (cid:107) ≥ d for ≤ j < k ≤ M , and log M ≥ cd log nd where c ≥ . . The proof of the above version of Varshamov-Gilbert bound can be found, for exam-ple, in Lemma 4.10 in Massart (2007)). The next two lemmas concern estimates of thecovering/packing numbers of the orthogonal group.

Lemma 34 (Candes and Plan 2011)

Deﬁne P = { ¯ U ¯ Γ ¯ V (cid:62) : ¯ U , ¯ V ∈ O ( p, r ) , (cid:107) ( ¯ Γ ii ) ≤ i ≤ r (cid:107) =1 } . Then for any (cid:15) ∈ (0 , √ , there exists an (cid:15) -covering set H ( P , d , (cid:15) ) such that | H ( P , d , (cid:15) ) | ≤ ( c/(cid:15) ) p +1) r for some constant c > . Lemma 35

For any V ∈ O ( k, r ) , identifying the subspace span ( V ) with its projectionmatrix V V (cid:62) , deﬁne the metric on the Grassmannian manifold G ( k, r ) by ρ ( V V (cid:62) , U U (cid:62) ) = (cid:107)

V V (cid:62) − U U (cid:62) (cid:107) F . Then for any (cid:15) ∈ (0 , (cid:112) r ∧ ( k − r ))) , (cid:18) c (cid:15) (cid:19) r ( k − r ) ≤ N ( G ( k, r ) , ρ, (cid:15) ) ≤ (cid:18) c (cid:15) (cid:19) r ( k − r ) , where N ( E, (cid:15) ) is the (cid:15) -covering number of E and c , c are absolute constants. Moreover,for any V ∈ O ( k, r ) and any α ∈ (0 , , it holds that (cid:18) c αc (cid:19) r ( k − r ) ≤ M ( B ( V, (cid:15) ) , ρ, α(cid:15) ) ≤ (cid:18) c αc (cid:19) r ( k − r ) . Proof

We only prove the entropy upper bound M ( B ( V, (cid:15) ) , d, α(cid:15) ) ≤ (cid:18) c αc (cid:19) r ( k − r ) , (54)as the other results has been proved in Lemma 1 of Cai et al. (2013). Speciﬁcally, Let G (cid:15) be the (cid:15) -packing set of O ( k, r ). It then holds that M ( O ( k, r ) , d, α(cid:15) ) ≥ (cid:88) V ∈ G (cid:15) M ( B ( V, (cid:15) ) , d, α(cid:15) ) ≥ | G (cid:15) |M ( B ( V ∗ , (cid:15) ) , d, α(cid:15) )= M ( O ( k, r ) , d, (cid:15) ) M ( B ( V ∗ , (cid:15) ) , d, α(cid:15) )for some V ∗ ∈ O ( k, r ). Hence, M ( B ( V ∗ , (cid:15) ) , d, α(cid:15) ) ≤ M ( O ( k, r )) , d, α(cid:15) ) M ( O ( k, r )) , d, (cid:15) ) . By the equivalence between the packing and the covering numbers, it holds that M ( B ( V ∗ , (cid:15) ) , d, α(cid:15) ) ≤ N ( O ( k, r )) , d, α(cid:15)/ N ( O ( k, r )) , d, (cid:15) ) ≤ (cid:18) c αc (cid:19) r ( k − r ) , ptimal Structured Principal Subspace Estimation where the last inequality follows from the ﬁrst statement of the lemma. Then (54) holdssince the metric d is unitarily invariant.The following lemma is an estimate of the Dudley’s entropy integral for the orthogonalgroup O ( p, r ). Lemma 36

For any given U ∈ O ( p, r ) , there exists some constant C > such that (cid:82) ∞ (cid:112) log N ( T ( O ( p, r ) , U ) , d , (cid:15) ) d(cid:15) ≤ C √ pr . Therefore, we have ∆ ( O ( p, r )) ≤ Cpr . Proof

By deﬁnition, for any G ∈ T ( O ( p, r ) , U ), it is at most rank 2 r , and suppose its SVDis G = ¯ U ¯ Γ ¯ V (cid:62) , then ¯ Γ is a diagonal matrix with nonnegative diagonal entries and Frobeniusnorm equal to one. Thus, if we deﬁne P = { ¯ U ¯ Γ ¯ V (cid:62) : ¯ U , ¯ V ∈ O ( p, r ) , (cid:107) ( ¯ Γ ii ) ≤ i ≤ r (cid:107) = 1 } ,then by Lemma 23, N ( T ( O ( p, r ) , U ) , d , (cid:15) ) ≤ N ( P , d , (cid:15) ) . By Lemma 34, we can calculate that (cid:90) ∞ (cid:112) log N ( T ( O ( p, r ) , U ) , d , (cid:15) ) d(cid:15) ≤ (cid:90) ∞ (cid:112) log N ( P , d , (cid:15) ) d(cid:15) ≤ C √ pr (cid:90) √ (cid:112) log( c/(cid:15) ) d(cid:15) ≤ C √ pr. (55)The second statement follows directly from the deﬁnition of ∆ ( O ( p, r )). B.1 Sparse PCA/SVD: Proof of Proposition 11 and Theorem 12Matrix denoising model with C S ( p , r, k ) , or sparse SVD. By Lemma 33, we canconstruct a subset Θ (cid:15) ( k ) ⊂ C S ( p , r, k ) as follows. Let Ω M = { ω (1) , ..., ω ( M ) } ⊂ { , } p − r − be the set obtained from Lemma 33 where n = p − r − d = k/e < ( p − r − / M is the smallest integer such that log M ≥ cd log n/d , i.e., M = (cid:100) exp( ck log e ( p − r − k ) (cid:101) . Wedeﬁne Θ (cid:15) = (cid:26) (cid:20) v 00 I r − (cid:21) : v = ( (cid:112) − (cid:15) , (cid:15)ω/ √ d ) ∈ S p − r − , ω ∈ Ω M (cid:27) , (cid:15) ∈ (0 , . Then Θ (cid:15) is a (cid:15) -packing set of B ( U , √ (cid:15) ) ∩ C S ( p , r, k ) with U = (cid:20) v

00 I r − (cid:21) where v =(1 , , ..., (cid:62) , | Θ (cid:15) | = M . Now we set (cid:15) = c ( t + σ p ) σ k log( e ( p − r − /k ) t ∧ , for some suﬃciently small c >

0. It follows that (cid:18) c σ ( t + σ p ) t log | Θ (cid:15) | ∧ (cid:19) ≤ (cid:15) ≤ (cid:18) σ ( t + σ p )640 t log | Θ (cid:15) | ∧ (cid:19) ai, Li and Ma for some c ∈ (0 , / (cid:15) = √ (cid:15) , α = 1 / (2 √ | Θ (cid:15) | (cid:16) k log( ep /k ). Moreover, for any U (cid:48) ∈ O ( k, r ), suppose M (cid:15) ⊂ O ( k, r ) is an α(cid:15) -packing set of B ( U (cid:48) , (cid:15) ) constructed as in Lemma 35, then the setΘ (cid:48) (cid:15) = (cid:26) U = (cid:20) W0 (cid:21) , W ∈ M (cid:15) (cid:27) ⊂ C S ( p , r, k ) , (56)is an α(cid:15) -packing set of C S ( p , r, k ) ∩ B ( U , (cid:15) ) where U = (cid:20) U (cid:48) (cid:21) , and | Θ (cid:48) (cid:15) | ≥ ( c/α ) r ( k − r ) .Now we set (cid:15) = c ( t + σ p ) σ r ( k − r ) t ∧ r , for some suﬃciently small c >

0. It follows that (cid:18) c σ ( t + σ p ) t log | Θ (cid:48) (cid:15) | ∧ r (cid:19) ≤ (cid:15) ≤ (cid:18) σ ( t + σ p )640 t log | Θ (cid:48) (cid:15) | ∧ r (cid:19) for some c ∈ (0 , / | Θ (cid:48) (cid:15) | (cid:16) r ( k − r ).To obtain an upper bound for ∆( C S ( p , r, k )), we notice that any element H ∈ C S ( p , r, k )satisﬁes H = H (cid:62) and max ≤ i ≤ p (cid:107) H i. (cid:107) ≤ k, max ≤ i ≤ p (cid:107) H .i (cid:107) ≤ k. Then T ( C S ( p , r, k ) , U ) can be covered by the union of its (cid:0) p k (cid:1) disjoint subsets, with eachsubset corresponding to a ﬁxed sparsity conﬁguration. Each of the above subsets can beidentiﬁed with T ( O ( k, r ) , U (cid:48) ) for some U (cid:48) ∈ O ( k, r ), and by Lemma 34 and the proof ofLemma 36, N ( T ( O ( k, r ) , U (cid:48) ) , d , (cid:15) ) ≤ ( c/(cid:15) ) r (2 k +1) . for any (cid:15) ∈ (0 , √ N ( T ( C S ( p , r, k ) , U ) , d , (cid:15) ) ≤ (cid:18) p k (cid:19) ( c /(cid:15) ) r (2 k +1) ≤ ( ep /k ) k ( c /(cid:15) ) r (2 k +1) . As a result, (cid:90) ∞ (cid:112) log N ( T ( C S ( p , r, k ) , U ) , d , (cid:15) ) d(cid:15) ≤ (cid:112) k log( ep /k ) + (cid:112) r (2 k + 1) (cid:90) √ (cid:114) log c (cid:15) d(cid:15) ≤ C ( (cid:112) k log( ep /k ) + √ rk ) . In addition, we also have (cid:90) ∞ log N ( T ( C S ( p , r, k ) , U ) , d , (cid:15) ) d(cid:15) ≤ C ( k log( ep /k ) + rk ) . The validity of Theorem 8 reduces to the condition t σ (cid:38) k log( ep /k ) + rk. Note that when r = O (1), this condition is satisﬁed whenever σ (cid:112) t + σ p t (cid:18)(cid:114) k log ep k + √ k (cid:19) (cid:46) . In other words, in light of the minimax lower bound (from Theorem 5), whenever consistentestimation is possible, the condition t σ (cid:38) k log( ep /k ) + k is satisﬁed and the proposedestimator is minimax optimal. The ﬁnal results follows by combining Theorems 5 and 8. ptimal Structured Principal Subspace Estimation Spiked Wishart model with C S ( p, r, k ) , or sparse PCA. We omitted the proof ofthis case as it is similar to the proof of the sparse SVD.

B.2 Non-Negative PCA/SVD: Proof of Proposition 13 and Theorem 14Matrix denoising model with C N ( p , r ) , or non-negative SVD. On the one hand,with Lemma 33, we can construct a subset Θ (cid:15) ⊂ O ( p , r ) as follows. Let Ω M = { ω (1) , ..., ω ( M ) } ⊂{ , } n be the set obtained from Lemma 33 where n = p − r − d = ( p − r − / M isthe smallest integer such that log M ≥ cd log n/d , i.e., M = (cid:100) exp( c ( p − r −

1) log 22 ) (cid:101) . Followingthe idea of Vu and Lei (2012) and Cai et al. (2013), we deﬁneΘ (cid:15) = (cid:26) (cid:20) v 00 I r − (cid:21) : v = ( (cid:112) − (cid:15) , (cid:15)ω/ √ d ) ∈ S p − r − , ω ∈ Ω M (cid:27) , (cid:15) ∈ (0 , . Then it holds that Θ (cid:15) ⊂ B ( U , √ (cid:15) ) for U = (cid:20) v

00 I r − (cid:21) where v = (1 , , ..., (cid:62) , | Θ (cid:15) | = M , and that for any U (cid:54) = U (cid:48) ∈ Θ (cid:15) , d ( U , U (cid:48) ) ≥ √ · (cid:112) − (1 − (cid:15) / ≥ (cid:15) . In other words, Θ (cid:15) is a (cid:15) -packing set of B ( U , √ (cid:15) ) ∩ C NN ( p , r ). Now we set (cid:15) = c ( t + σ p ) σ ( p − r − t ∧ , for some suﬃciently small c >

0. It follows that (cid:18) c σ ( t + σ p ) t log | Θ (cid:15) | ∧ (cid:19) ≤ (cid:15) ≤ (cid:18) σ ( t + σ p )640 t log | Θ (cid:15) | ∧ (cid:19) for some c ∈ (0 , / (cid:15) = √ (cid:15) , α = 1 / (2 √ | Θ (cid:15) | (cid:16) p .On the other hand, we need to obtain an upper bound for ∆( C N ( p , r )). To bound theDudley’s entropy integral (cid:82) ∞ (cid:112) log N ( T ( C N ( p , r ) , U ) , d , (cid:15) ) d(cid:15) , we simply use the fact that C N ( p , r ) ⊂ O ( p , r ) and N ( T ( C N ( p , r ) , U ) , d , (cid:15) ) ≤ N ( T ( O ( p , r ) , U ) , d , (cid:15) ) . Then by Lemma 36, we have ∆ ( C NN ( p , r )) (cid:46) p r . Combining Theorems 5 and 8, wehave ∆ ( C NN ( p , r )) (cid:38) log | Θ (cid:15) | , which implies ∆ ( C NN ( p , r )) (cid:16) log | Θ (cid:15) | (cid:16) p if r = O (1).Again, Theorem 8 requires t σ (cid:38) rp . Note that when r = O (1), this condition is satisﬁedwhenever σ (cid:112) p ( t + σ p ) t (cid:46) . In other words, in light of the minimax lower bound (from Theorem 5), whenever consistentestimation is possible, the condition t σ (cid:38) p is satisﬁed and the proposed estimator isminimax optimal. ai, Li and Ma Spiked Wishart model with C N ( p, r ) , or non-negative PCA. Similarly, let Ω M = { ω (1) , ..., ω ( M ) } ⊂ { , } p − r − be the set obtained from Lemma 33 where d = ( p − r − / M is the smallest integer such that log M ≥ cd log( p − r − /d , i.e., M = (cid:100) exp( c ( p − r −

1) log 22 ) (cid:101) .We deﬁneΘ (cid:15) = (cid:26) (cid:20) v 00 I r − (cid:21) : v = ( (cid:112) − (cid:15) , (cid:15)ω/ √ d ) ∈ S p − r − , ω ∈ Ω M (cid:27) , (cid:15) ∈ (0 , . Then it holds that Θ (cid:15) ⊂ B ( U , √ (cid:15) ) for U = (cid:20) v

00 I r − (cid:21) where v = (1 , , ..., (cid:62) , | Θ (cid:15) | = M , and that for any U (cid:54) = U (cid:48) ∈ Θ (cid:15) , d ( U , U (cid:48) ) ≥ √ · (cid:112) − (1 − (cid:15) / ≥ (cid:15) . In other words, Θ (cid:15) is a (cid:15) -packing set of B ( U , √ (cid:15) ) ∩ C NN ( p, r ). Now we set (cid:15) = c σ ( σ + t )( p − r − nt ∧ , for some suﬃciently small c >

0. It follows that (cid:18) c σ ( σ + t ) nt log | Θ (cid:15) |∧ (cid:19) ≤ (cid:15) ≤ (cid:18) σ ( σ + t ) nt ( p − r −

1) log 210 ∧ (cid:19) ≤ (cid:18) σ ( σ + t )32 nt log | Θ (cid:15) |∧ (cid:19) for some c ∈ (0 , / | Θ (cid:15) | (cid:16) p . The restof the arguments such as the calculation of Dudley’s entropy integral are the same as theabove proof of the non-negative SVD. B.3 Subspace PCA/SVD: Proof of Proposition 16 and Theorem 17

To prove this proposition, in light of Lemmas 34, 35 and 36, it suﬃces to establish theisometry between ( C A ( p, r, k ) , d ) and ( O ( k, r ) , d ). Let Q ∈ O ( p, k ) has its columns beingthe basis of the null space of A . We consider the map F : O ( k, r ) → C A ( p, r, k ) where F ( W ) = QW . To show that F is a bijection, we notice that1. For any G ∈ C A ( p, r, k ), for each of its columns Q .i , there exists some v i ∈ S k − suchthat G .i = Qv i and v (cid:62) i v j = v (cid:62) i Q (cid:62) Qv j = G (cid:62) .i G .j = 0. Then let W = [ v , ..., v r ] ∈ O ( k, r ), apparently, we have F ( W ) = G . This proves that the map is onto.2. For any W (cid:54) = W ∈ O ( k, r ), it follows that F ( W ) (cid:54) = F ( W ). This proves theinjection.To show the map F is isometric, we notice that1. For any G = F ( W ) , G = F ( W ) ∈ C A ( p, r, k ), d ( F ( W ) , F ( W )) = (cid:107) QW W (cid:62) Q (cid:62) − QW W (cid:62) Q (cid:62) (cid:107) F ≤ (cid:107) Q (cid:107) (cid:107) W W (cid:62) − W W (cid:62) (cid:107) F ≤ d ( W , W ) . ptimal Structured Principal Subspace Estimation

2. For any W , W ∈ O ( k, r ), d ( W , W ) = (cid:107) Q (cid:62) QW W (cid:62) Q (cid:62) − QW W (cid:62) Q (cid:62) Q (cid:107) F ≤ d ( F ( W , W )) . Thus d ( F ( W ) , F ( W )) = d ( W , W ). B.4 Spectral Clustering: Proof of Proposition 18 and Theorem 19

The upper bound ∆ ( C n ± ) (cid:46) n follows from the same argument as in the proof of Proposition15. For the second statement, by Lemma 33, we can construct a subset Θ( d ) ⊂ S n − asfollows. Let Ω M = { ω (1) , ..., ω ( M ) } ⊂ { , } n be the set obtained from Lemma 33 where (cid:107) ω ( j ) (cid:107) = d ≤ n/ ≤ j ≤ n and M is the smallest integer such that log M ≥ cd ,i.e., M = (cid:100) exp( cd log nd ) (cid:101) . We deﬁneΘ( d ) = (cid:26) | ω − . · |√ n ∈ C n ± : ω ∈ Ω M ∪ { (0 , ..., } (cid:27) , where = (1 , ..., (cid:62) ∈ R n . Then since for u = ( − / √ n, ..., − / √ n ) (cid:62) and any u ∈ Θ( d ), d ( u , u ) ≤ (cid:107) u − u (cid:107) ≤ (cid:114) dn , it holds that Θ( d ) ⊂ B ( u , (cid:112) d/n ) with and that for any u (cid:54) = u (cid:48) ∈ Θ( d ), d ( u , u (cid:48) ) ≥ √ (cid:107) u − u (cid:48) (cid:107) ≥ (cid:114) dn so that Θ( d ) is a (cid:113) dn -packing set of B ( u , (cid:112) d/n ) ∩ C n ± . Now since t = Cσ ( n + √ np ), wecan set (cid:15) = (cid:114) dn , where d = c n, for some suﬃciently small c >

0, and thus it follows that (cid:18) c σ ( t + σ p ) t log | Θ( d ) | ∧ (cid:19) ≤ (cid:15) ≤ (cid:18) σ ( t + σ p )128 t log | Θ( d ) | ∧ (cid:19) for some c ∈ (0 , / α = 1 / | Θ( d ) | (cid:16) n . Appendix C. Proof of Technical Lemmas

Proof of Lemma 21.

The ﬁrst inequality can be proved by (cid:104) UΓ U , UU (cid:62) − WW (cid:62) (cid:105) = tr( UΓ U (cid:62) ) − tr( W (cid:62) UΓ U (cid:62) W )= tr( Γ ) − tr( Γ U (cid:62) WW (cid:62) U )= r (cid:88) i =1 λ i (1 − ( U (cid:62) WW (cid:62) U ) ii ) ≥ λ r ( r − tr( U (cid:62) WW (cid:62) U ))= λ r (cid:107) UU (cid:62) − WW (cid:62) (cid:107) F . ai, Li and Ma The other inequality follows from the same rationale.

Proof of Lemma 28.

Throughout the proof, for simplicity, we write P = P ( C , U ) and T = T ( C , U ). By Corollary 2.3.2 of Talagrand (2014), for any metric space ( T, d ), if wedeﬁne e n ( T ) = inf { (cid:15) : N ( T, d, (cid:15) ) ≤ N n } , where N = 1; N n = 2 n for n ≥ , (57)then there exists some constant K ( α ) only depending on α such that γ α ( T, d ) ≤ K ( α ) (cid:88) n ≥ n/α e n ( T ) . (58)The following inequalities establish the correspondence between e n and the Dudley’s entropyintegral, (cid:88) n ≥ n/ e n ( T ) ≤ C (cid:90) ∞ (cid:112) log N ( T, d, (cid:15) ) d(cid:15), (cid:88) n ≥ n e n ( T ) ≤ C (cid:90) ∞ log N ( T, d, (cid:15) ) d(cid:15), (59)whose derivation is delayed to the end of this proof. Combining (58) and (59), it followsthat γ α ( T, d ) ≤ K ( α ) (cid:90) ∞ log /α N ( T, d, (cid:15) ) d(cid:15). (60)By (60), it suﬃces to obtain estimates of the metric entropies log N ( P , d ∞ , (cid:15) ) and (cid:112) log N ( P , d , (cid:15) ).By deﬁnition of T , apparently ( P , d ∞ ) is isomorphic to ( T , d ∞ ), then by Lemma 23, it holdsthat N ( P , d ∞ , (cid:15) ) = N ( T , d ∞ , (cid:15) ) . Along with the fact that, for any G , G ∈ T , d ∞ ( G , G ) ≤ d ( G , G ) and therefore N ( T , d ∞ , (cid:15) ) ≤ N ( T , d , (cid:15) ) , we prove the ﬁrst statement of the lemma. On the other hand, consider the map F :( P , d ) → ( T , d ) where for any D ∈ P , F ( D ) ∈ R p × p is the submatrix of D by extractingits entries in the ﬁrst p columns and rows. Then, for any D , D ∈ P , it holds that d ( F ( D ) , F ( D )) = (cid:107) F ( D ) − F ( D ) (cid:107) F = 1 √ p d ( D , D ) . Again, applying Lemma 6, we have N ( P , d , (cid:15) ) = N ( T , d , (cid:15)/ √ p ) . The second statement of the lemma then follows simply from the change of variable γ ( P , d ) ≤ C (cid:90) ∞ (cid:113) log N ( T , d , (cid:15)/ √ p ) d(cid:15) = C √ p (cid:90) ∞ (cid:112) log N ( T , d , (cid:15) ) d(cid:15). ptimal Structured Principal Subspace Estimation Proof of (59).

The proof of the ﬁrst inequality can be found, for example, on page 22 ofTalagrand (2014). Nevertheless, we provide a detailed proof for completeness. By deﬁnitionof e n , if (cid:15) < e n ( T ), we have N ( T, d, (cid:15) ) > N n and N ( T, d, (cid:15) ) ≥ N n + 1. Then (cid:112) log(1 + N n )( e n ( T ) − e n +1 ( T )) ≤ (cid:90) e n ( T ) e n +1 ( T ) (cid:112) log N ( T, d, (cid:15) ) . Since log(1 + N n ) ≥ n log 2 for n ≥

0, summation over n ≥ (cid:112) log 2 (cid:88) n ≥ n/ ( e n − e n +1 ( T )) ≤ (cid:90) e ( T )0 (cid:112) log N ( T, d, (cid:15) ) . Then the ﬁnal inequality (59) follows by noting that (cid:88) n ≥ n/ ( e n − e n +1 ( T )) = (cid:88) n ≥ n/ e n ( T ) − (cid:88) n ≥ ( n − / e n ( T ) ≥ (1 − / √ (cid:88) n ≥ n/ e n ( T ) . The second inequality can be obtained similarly by working with the inequalitylog(1 + N n )( e n ( T ) − e n +1 ( T )) ≤ (cid:90) e n ( T ) e n +1 ( T ) log N ( T, d, (cid:15) ) . Proof of Lemma 32.

The proof of this lemma generalizes the ideas in Cai and Zhang(2018) and Ma et al. (2019). In general, direct calculation of D ( P i , P j ) is diﬃcult. Wedetour by introducing an approximate density of P i as˜ P i ( Y ) = σ − p p (2 π ) p p / (cid:90) exp( −(cid:107) Y − t U i W (cid:62) (cid:107) F / (2 σ )) (cid:18) p π (cid:19) rp / exp( − p (cid:107) W (cid:107) F / d W . Now for Y ∼ ˜ P i , if Y k is the k -th column of Y , we have Y k | U i ∼ i.i.d. N (cid:18) , σ (cid:18) I n − t t + σ p U i U (cid:62) i (cid:19) − (cid:19) = N (cid:18) , σ I n + 4 t p U i U (cid:62) i (cid:19) , (61)for k = 1 , ..., p . It is well-known that the KL-divergence between two p -dimensional multi-variate Gaussian distribution is D ( N ( µ , Σ ) (cid:107) N ( µ , Σ )) = 12 (cid:18) tr( Σ − Σ ) + ( µ − µ ) (cid:62) Σ − ( µ − µ ) − p + log (cid:18) det Σ det Σ (cid:19)(cid:19) . As a result, we can calculate that for any ˜ P i and ˜ P j , D ( ˜ P i , ˜ P j ) = p (cid:26) tr (cid:18)(cid:18) I p − t t + σ p U i U (cid:62) i (cid:19)(cid:18) I p + 4 t σ p U j U (cid:62) j (cid:19)(cid:19) − p (cid:27) ≤ Ct t + σ p ( r − (cid:107) U (cid:62) i U j (cid:107) F )= Ct d ( U i , U j )4 t + σ p (62) ai, Li and Ma where the last inequality follows from Lemma 20. Hence, the proof of this proposition iscomplete if we can show that there exist some constant C > D ( P i , P j ) ≤ D ( ˜ P i , ˜ P j ) + C. (63)The rest of the proof is devoted to the proof of (63). Proof of (63).

Deﬁne the event G = { W ∈ R r × p : 1 / ≤ λ min ( W ) ≤ λ max ( W ) ≤ } .For any given u , P i ˜ P i = 1(2 π ) rp ( σ t + σ p ) rp exp (cid:18) σ p (cid:88) k =1 Y (cid:62) k ( I p − t t + σ p U i U (cid:62) i ) Y k (cid:19) × C U i ,t (cid:90) G exp( −(cid:107) Y − t U i W (cid:62) (cid:107) F / (2 σ ) − p (cid:107) W (cid:107) F / d W = (cid:18) t + σ p πσ (cid:19) p r/ exp (cid:18) − (4 t + σ p ) (cid:13)(cid:13)(cid:13)(cid:13) W − t t + σ p U (cid:62) i Y (cid:13)(cid:13)(cid:13)(cid:13) F / (cid:19) d W = C U i ,t P (cid:18) W (cid:48) ∈ G (cid:12)(cid:12)(cid:12)(cid:12) W (cid:48) ∼ N (cid:18) t t + σ p U (cid:62) i Y , σ t + σ p I p (cid:19)(cid:19) ≤ C U i ,t . (64)Recall that C − U i ,t = P (cid:0) W = ( w jk ) ∈ G| w jk ∼ N (0 , /p ) (cid:1) . By concentration of measure inequalities for Gaussian random matrices (see, for example,Corollary 5.35 of Vershynin (2010)), we have, for suﬃciently large ( p , r ), P ( W ∈ G ) ≥ − − cp ) , (65)for some constant c >

0. In other words, we have C − U i ,t ≥ − p − c (66)and P i ˜ P i ≤ p − c (67)uniformly for some constant c > . Thus, for some constant δ >

0, we have D ( P i , P j ) = (cid:90) P i (cid:20) log (cid:18) P i ˜ P i (cid:19) + log (cid:18) ˜ P i ˜ P j (cid:19) + log (cid:18) ˜ P j P j (cid:19)(cid:21) d Y ≤ log(1 + δ ) + D ( ˜ P i , ˜ P j ) + (cid:90) ( P i − ˜ P i ) log (cid:18) ˜ P i ˜ P j (cid:19) d Y + (cid:90) P i log (cid:18) ˜ P i P j (cid:19) d Y ≤ log(1 + δ ) + D ( ˜ P i , ˜ P j ) + (cid:90) ˜ P i (cid:18) P i ˜ P i − (cid:19) log (cid:18) ˜ P i ˜ P j (cid:19) d Y + (1 + δ ) (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) ˜ P j P j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) d Y ≤ log(1 + δ ) + D ( ˜ P i , ˜ P j ) + p − c (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) ˜ P i ˜ P j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) d Y + (1 + δ ) (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) ˜ P j P j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) d Y . (68) ptimal Structured Principal Subspace Estimation Now since (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) ˜ P i ˜ P j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) d Y = 12 σ (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) t t + σ p p (cid:88) k =1 Y (cid:62) k ( U i U (cid:62) i − U j U (cid:62) j ) Y k (cid:12)(cid:12)(cid:12)(cid:12) d Y ≤ σ E (cid:20) t t + σ p p (cid:88) k =1 Y (cid:62) k ( U i U (cid:62) i + U j U (cid:62) j ) Y k (cid:21) = 4 t p σ (4 t + σ p ) tr (cid:18) ( U i U (cid:62) i + U j U (cid:62) j ) (cid:0) σ I p + 4 t p U i U (cid:62) i (cid:1)(cid:19) ≤ t p t + σ p tr (cid:18) U (cid:62) i (cid:0) I p + 4 t σ p (cid:1) U i (cid:19) = 4 rt σ ≤ rp , where in the second row the expectation is with respect to Y k ∼ N (cid:0) , σ I p + t σ σ p U i U (cid:62) i (cid:1) .we know that the third term in (68) can be bounded by p − c (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) ˜ P i ˜ P j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) d Y ≤ rp · p − c ≤ C for some constants C, c >

0. Finally, by (64), we have (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) ˜ P j P j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) dY ≤ (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log 1 C U j ,t (cid:12)(cid:12)(cid:12)(cid:12) dY + (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log 1 P ( W (cid:48) ∈ G| E ) (cid:12)(cid:12)(cid:12)(cid:12) d Y , where we denoted E = (cid:26) W (cid:48) ∼ N (cid:18) t t + σ p U (cid:62) i Y , σ t + σ p I p (cid:19)(cid:27) . Now on the one hand, (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log 1 C U i ,t (cid:12)(cid:12)(cid:12)(cid:12) d Y ≤ (cid:0) log(1 + δ ) ∨ | log(1 − δ ) − | (cid:1) . On the other hand, for ﬁxed Y and U (cid:62) i Y ∈ R r × p , we can ﬁnd Q ∈ O ( p , p − r ) which isorthogonal to U (cid:62) i Y , i.e., U (cid:62) i YQ = 0. Then W (cid:48) Q ∈ R r × ( p − r ) are i.i.d. normal distributedwith mean 0 and variance σ t + σ p . Then again by standard result in random matrix (e.g.Corollary 5.35 in Vershynin (2010)), we have λ min ( W (cid:48) ) = λ r ( W (cid:48) ) ≥ λ r ( W (cid:48) Q ) ≥ σ (cid:112) t + σ p ( √ p − r − √ r − x )with probability at least 1 − − x / t < σ p /

4, for p suﬃciently large, wecan ﬁnd c such that by setting x = c √ p , P ( λ min ( W (cid:48) ) ≥ / ≥ − e − cp . (69) ai, Li and Ma Analogous to the argument on λ min ( W (cid:48) ), we also have P ( λ max ( W (cid:48) ) ≤ ≥ − e − cp . (70)Thus, by the union bound inequality, we have P ( W (cid:48) ∈ G ) ≥ − e − cp , and consequently, (cid:90) ˜ P i (cid:12)(cid:12)(cid:12)(cid:12) log 1 P ( W (cid:48) ∈ G| E ) (cid:12)(cid:12)(cid:12)(cid:12) d Y ≤ (cid:12)(cid:12)(cid:12)(cid:12) log 11 − p − c (cid:12)(cid:12)(cid:12)(cid:12) ≤ p − c . This helps us to bound the last term of (68). Combining the above results, we have proventhe inequality (63) and therefore completed the proof.

References

Miguel A Arcones and Evarist Gin´e. On decoupling, series expansions, and tail behavior ofchaos processes.

J. Theor. Probab. , 6(1):101–122, 1993.Martin Azizyan, Aarti Singh, and Larry Wasserman. Minimax theory for high-dimensionalgaussian mixtures with sparse mean separation. In

NIPS , pages 2139–2147, 2013.Zhidong Bai and Jian-feng Yao. Central limit theorems for eigenvalues in a spiked populationmodel. In

Annales de l’IHP Probabilit´es et Statistiques , volume 44, pages 447–474, 2008.Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices ofspiked population models.

J. Multiv. Anal. , 97(6):1382–1408, 2006.Afonso S Bandeira, Nicolas Boumal, and Amit Singer. Tightness of the maximum likelihoodsemideﬁnite relaxation for angular synchronization.

Mathematical Programming , 163(1-2):145–167, 2017.Zhigang Bao, Xiucai Ding, and Ke Wang. Singular vector and singular subspace distributionfor the matrix denoising model. arXiv preprint arXiv:1809.10476 , 2018.Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Riskbounds and structural results.

J. Mach. Learn. Res. , 3(Nov):463–482, 2002.Aharon Birnbaum, Iain M Johnstone, Boaz Nadler, and Debashis Paul. Minimax boundsfor sparse pca with noisy high-dimensional data.

Ann. Statist. , 41(3):1055–1084, 2013.Nicolas Boumal. Nonconvex phase synchronization.

SIAM J. Optimiz. , 26(4):2355–2377,2016.Olivier Bousquet, Vladimir Koltchinskii, and Dmitriy Panchenko. Some local measures ofcomplexity of convex hulls and generalization bounds. In

International Conference onComputational Learning Theory , pages 59–73. Springer, 2002. ptimal Structured Principal Subspace Estimation T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaceswith applications to high-dimensional statistics.

Ann. Statist. , 46(1):60–89, 2018.T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptiveestimation.

Ann. Statist. , 41(6):3074–3110, 2013.T. Tony Cai, Zongming Ma, and Yihong Wu. Optimal estimation and rank detection forsparse spiked covariance matrices.

Probab. Theory Related Fields , 161:781–815, 2015.T Tony Cai, Tengyuan Liang, and Alexander Rakhlin. Geometric inference for generalhigh-dimensional linear inverse problems.

Ann. Statist. , 44(4):1536–1563, 2016.Emmanuel J Candes and Yaniv Plan. Tight oracle inequalities for low-rank matrix recoveryfrom a minimal number of noisy random measurements.

IEEE Trans. Inform. Theory ,57(4):2342–2359, 2011.Yuxin Chen and Emmanuel J Cand`es. The projected power method: An eﬃcient algorithmfor joint alignment from pairwise diﬀerences.

Comm. Pure Appl. Math. , 71(8):1648–1714,2018.Yunjin Choi, Jonathan Taylor, and Robert Tibshirani. Selecting the number of principalcomponents: Estimation of the true rank of a noisy matrix.

Ann. Statist. , 45(6):2590–2617, 2017.Alexandre d’Aspremont, Laurent E Ghaoui, Michael I Jordan, and Gert R Lanckriet. Adirect formulation for sparse pca using semideﬁnite programming. In

NIPS , pages 41–48,2005.Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse pca. In , pages 2197–2201. IEEE, 2014.Yash Deshpande, Andrea Montanari, and Emile Richard. Cone-constrained principal com-ponent analysis. In

NIPS , pages 2717–2725, 2014.David Donoho and Matan Gavish. Minimax risk of matrix denoising by singular valuethresholding.

Ann. Statist. , 42(6):2413–2440, 2014.David L Donoho, Matan Gavish, and Iain M Johnstone. Optimal shrinkage of eigenvaluesin the spiked covariance model.

Ann. Statist. , 46(4):1742, 2018.Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparseprincipal component analysis.

J. Mach. Learn. Res. , 9(Jul):1269–1294, 2008.Orizon Pereira Ferreira, Alfredo N Iusem, and Sandor Z N´emeth. Projections onto convexsets on the sphere.

Journal of Global Optimization , 57(3):663–676, 2013.Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with therelaxed k means. arXiv preprint arXiv:1807.07547 , 2018.Gene H Golub and Charles F Van Loan. Matrix Computations , volume 3. JHU Press, 2012. ai, Li and Ma David Haussler and Manfred Opper. Metric entropy and minimax risk in classiﬁcation. In

Structures in Logic and Computer Science , pages 212–235. Springer, 1997a.David Haussler and Manfred Opper. Mutual information, metric entropy and cumulativerelative entropy risk.

Ann. Statist. , 25(6):2451–2492, 1997b.Adel Javanmard, Andrea Montanari, and Federico Ricci-Tersenghi. Phase transitions insemideﬁnite relaxations.

P. Natl. Acad. Sci. , 113(16):E2218–E2223, 2016.Jiashun Jin and Wanjie Wang. Inﬂuential features pca for high dimensional clustering.

Ann.Statist. , 44(6):2323–2359, 2016.Jiashun Jin, Zheng Tracy Ke, and Wanjie Wang. Phase transitions for high dimensionalclustering and related problems.

Ann. Statist. , 45(5):2151–2189, 2017.Iain M Johnstone. On the distribution of the largest eigenvalue in principal componentsanalysis.

Ann. Statist. , 29(2):295–327, 2001.Michel Journ´ee, Yurii Nesterov, Peter Richt´arik, and Rodolphe Sepulchre. Generalizedpower method for sparse principal component analysis.

J. Mach. Learn. Res. , 11(Feb):517–553, 2010.Jaya Kawale and Daniel Boley. Constrained spectral clustering using l1 regularization. In

Proceedings of the 2013 SIAM International Conference on Data Mining , pages 103–111.SIAM, 2013.Matth¨aus Kleindessner, Samira Samadi, Pranjal Awasthi, and Jamie Morgenstern. Guar-antees for spectral clustering with fairness constraints. arXiv preprint arXiv:1901.08668 ,2019.Vladimir Koltchinskii. Local rademacher complexities and oracle inequalities in risk mini-mization.

Ann. Statist. , 34(6):2593–2656, 2006.Felix Krahmer, Shahar Mendelson, and Holger Rauhut. Suprema of chaos processes andthe restricted isometry property.

Comm. Pure Appl. Math. , 67(11):1877–1904, 2014.Guillaume Lecu´e and Shahar Mendelson. Aggregation via empirical risk minimization.

Probab. Theory Related Fields , 145(3-4):591–613, 2009.Matthias L¨oﬄer, Anderson Y Zhang, and Harrison H Zhou. Optimality of spectral clusteringfor gaussian mixture model. arXiv preprint arXiv:1911.00538 , 2019.Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd’s algorithmand its variants. arXiv preprint arXiv:1612.02099 , 2016.G´abor Lugosi and Andrew B Nobel. Adaptive model selection using empirical complexities.

Ann. Statist. , 27(6):1830–1864, 1999.Rong Ma, T Tony Cai, and Hongzhe Li. Optimal and adaptive estimation of extreme valuesin the permuted monotone matrix model. arXiv preprint arXiv:1911.12516 , 2019. ptimal Structured Principal Subspace Estimation Rong Ma, T Tony Cai, and Hongzhe Li. Optimal permutation recovery in permuted mono-tone matrix model.

J. Amer. Statist. Assoc. , 2020.Zongming Ma. Sparse principal component analysis and iterative thresholding.

Ann.Statist. , 41(2):772–801, 2013.Pascal Massart.

Concentration Inequalities and Model Selection: Ecole d’Et´e de Probabilit´esde Saint-Flour XXXIII-2003 . Springer, 2007.Andrea Montanari and Emile Richard. Non-negative principal component analysis: Messagepassing algorithms and sharp asymptotics.

IEEE Trans. Inform. Theory , 62(3):1458–1484, 2015.Mohamed Ndaoud. Sharp optimal recovery in the two component gaussian mixture model. arXiv preprint arXiv:1812.08078 , 2018.Efe Onaran and Soledad Villar. Projected power iteration for network alignment. In

Wavelets and Sparsity XVII , volume 10394, page 103941C. International Society for Op-tics and Photonics, 2017.Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covari-ance model.

Statist. Sin. , pages 1617–1642, 2007.Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality andsub-optimality of pca i: Spiked random matrix models.

Ann. Statist. , 46(5):2416–2451,2018.Alexander Rakhlin, Karthik Sridharan, and Alexandre B Tsybakov. Empirical entropy,minimax regret and minimax risk.

Bernoulli , 23(2):789–824, 2017.Sundeep Rangan and Alyson K Fletcher. Iterative estimation of constrained rank-one ma-trices in noise. In , pages 1246–1250. IEEE,2012.Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation forhigh-dimensional linear regression over (cid:96) q -balls. IEEE Trans. Inform. Theory , 57(10):6976–6994, 2011.Haipeng Shen and Jianhua Z Huang. Sparse principal component analysis via regularizedlow rank matrix approximation.

J. Multiv. Anal. , 99(6):1015–1034, 2008.Amit Singer. Angular synchronization by eigenvectors and semideﬁnite programming.

Ap-plied and Computational Harmonic Analysis , 30(1):20–36, 2011.Stanis(cid:32)law Szarek. Metric entropy of homogeneous spaces.

Banach Center Publications , 43(1):395–410, 1998.Michel Talagrand.

Upper and lower bounds for stochastic processes: modern methods andclassical problems , volume 60. Springer Science & Business Media, 2014. ai, Li and Ma Alexandre B Tsybakov.

Introduction to Nonparametric Estimation . Springer Series inStatistics. Springer, New York, 2009.Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXivpreprint arXiv:1011.3027 , 2010.Roman Vershynin.

High-dimensional Probability: An Introduction with Applications in DataScience , volume 47. Cambridge University Press, 2018.Nicolas Verzelen. Minimax risks for sparse regressions: Ultra-high dimensional phe-nomenons.

Electronic Journal of Statistics , 6:38–90, 2012.Vincent Vu and Jing Lei. Minimax rates of estimation for sparse pca in high dimensions.In

Artiﬁcial Intelligence and Statistics , pages 1278–1286, 2012.Vincent Q Vu and Jing Lei. Minimax sparse principal subspace estimation in high dimen-sions.

Ann. Statist. , 41(6):2905–2947, 2013.Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: Anear-optimal convex relaxation of sparse pca. In

NIPS , pages 2670–2678, 2013.Weichen Wang and Jianqing Fan. Asymptotics of empirical eigenstructure for high dimen-sional spiked covariance.

Ann. Statist. , 45(3):1342, 2017.Xiang Wang and Ian Davidson. Flexible constrained spectral clustering. In

Proceedingsof the 16th ACM SIGKDD International Conference on Knowledge Discovery and DataMining , pages 563–572. ACM, 2010.Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposi-tion, with applications to sparse principal components and canonical correlation analysis.

Biostatistics , 10(3):515–534, 2009.Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets viabest polynomial approximation.

IEEE Trans. Inform. Theory , 62(6):3702–3720, 2016.Dan Yang, Zongming Ma, and Andreas Buja. A sparse svd method for high-dimensionaldata. arXiv preprint arXiv:1112.2433 , 2011.Yuhong Yang. Minimax nonparametric classiﬁcation. i. rates of convergence.

IEEE Trans.Inform. Theory , 45(7):2271–2284, 1999.Yuhong Yang and Andrew Barron. Information-theoretic determination of minimax ratesof convergence.

Ann. Statist. , pages 1564–1599, 1999.Yannis G Yatracos. A lower bound on the error in nonparametric regression type problems.

Ann. Statist. , pages 1180–1187, 1988.Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems.

J. Mach. Learn. Res. , 14(Apr):899–925, 2013. ptimal Structured Principal Subspace Estimation Anru Zhang, T Tony Cai, and Yihong Wu. Heteroskedastic pca: Algorithm, optimality,and applications. arXiv preprint arXiv:1810.08316 , 2018.Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis.

J.Comput. Graph. Stat. , 15(2):265–286, 2006., 15(2):265–286, 2006.