[PDF] Subspace Perspective on Canonical Correlation Analysis: Dimension Reduction and Minimax Rates

Abstract

Full PDF

aa r X i v : . [ m a t h . S T ] J a n Subspace Perspective on Canonical Correlation Analysis:Dimension Reduction and Minimax Rates

Zhuang Ma and Xiaodong Li

Abstract

Canonical correlation analysis (CCA) is a fundamental statistical tool for exploring thecorrelation structure between two sets of random variables. In this paper, motivated by therecent success of applying CCA to learn low dimensional representations of high dimensionalobjects, we propose two losses based on the principal angles between the model spaces spannedby the sample canonical variates and their population correspondents, respectively. We furthercharacterize the non-asymptotic error bounds for the estimation risks under the proposed errormetrics, which reveal how the performance of sample CCA depends adaptively on key quantitiesincluding the dimensions, the sample size, the condition number of the covariance matrices andparticularly the population canonical correlation coeﬃcients. The optimality of our uniformupper bounds is also justiﬁed by lower-bound analysis based on stringent and localized parameterspaces. To the best of our knowledge, for the ﬁrst time our paper separates p and p for theﬁrst order term in the upper bounds without assuming the residual correlations are zeros. Moresigniﬁcantly, our paper derives p ´ λ k qp ´ λ k ` q{p λ k ´ λ k ` q for the ﬁrst time in the non-asymptotic CCA estimation convergence rates, which is essential to understand the behavior ofCCA when the leading canonical correlation coeﬃcients are close to 1. Canonical correlation analysis (CCA), ﬁrst introduced by Hotelling (1936), is a fundamentalstatistical tool to characterize the relationship between two groups of random variables and ﬁndsa wide range of applications across many diﬀerent ﬁelds. For example, in genome-wide associationstudy (GWAS), CCA is used to discover the genetic associations between the genotype dataof single nucleotide polymorphisms (SNPs) and the phenotype data of gene expression levels(Witten et al., 2009; Chen et al., 2012). In information retrieval, CCA is used to embed boththe search space (e.g. images) and the query space (e.g. text) into a shared low dimensionallatent space such that the similarity between the queries and the candidates can be quantiﬁed(Rasiwasia et al., 2010; Gong et al., 2014). In natural language processing, CCA is applied tothe word co-occurrence matrix and learns vector representations of the words which capture thesemantics (Dhillon et al., 2011; Faruqui and Dyer, 2014). Other applications, to name a few,include fMRI data analysis (Friman et al., 2003), computer vision (Kim et al., 2007) and speechrecognition (Arora and Livescu, 2013; Wang et al., 2015).The enormous empirical success motivates us to revisit the estimation problem of canonicalcorrelation analysis. Two theoretical questions are naturally posed: What are proper error metricsto quantify the discrepancy between population CCA and its sample estimates? And under suchmetrics, what are the quantities that characterize the fundamental statistical limits?1he justiﬁcation of loss functions, in the context of CCA, has seldom appeared in the literature.From ﬁrst principles that the proper metric to quantify the estimation loss should depend on thespeciﬁc purpose of using CCA, we ﬁnd that the applications discussed above mainly fall into twocategories: identifying variables of interest and dimension reduction.The ﬁrst category, mostly in genomic research (Witten et al., 2009; Chen et al., 2012), treatsone group of variables as responses and the other group of variables as covariates. The goal is todiscover the speciﬁc subset of the covariates that are most correlated with the responses. Suchapplications are featured by low signal-to-noise ratio and the interpretability of the results is themajor concern.In contrast, the second category is investigated extensively in statistical machine learning andengineering community where CCA is used to learn low dimensional latent representations ofcomplex objects such as images (Rasiwasia et al., 2010), text (Dhillon et al., 2011) and speeches(Arora and Livescu, 2013). These scenarios are usually accompanied with relatively high signal-to-noise ratio and the prediction accuracy, using the learned low dimensional embeddings as thenew set of predictors, is of primary interest. In recent years, there has been a series of publicationsestablishing fundamental theoretical guarantees for CCA to achieve suﬃcient dimension reduction(Kakade and Foster (2007); Foster et al. (2008); Sridharan and Kakade (2008); Fukumizu et al.(2009); Chaudhuri et al. (2009) and many others).In this paper, we aim to address the problems raised above by treating CCA as a tool fordimension reduction.

Suppose x “ r X , . . . , X p s J P R p and y “ r Y , . . . , Y p s J P R p are two sets of variates with thejoint covariance matrix Cov ˆ„ xy ˙ “ Σ : “ „ Σ x Σ xy Σ J xy Σ y  . (1.1)For simplicity, we assume E p X i q “ , i “ , . . . , p , E p Y j q “ , j “ , . . . , p . On the population level, CCA is designed to extract the most correlated linear combinationsbetween two sets of random variables sequentially: The i th pair of canonical variables U i “ φ J i x and V i “ ψ J i y maximizes λ i “ Corr p U i , V i q such that U i and V i have unit variances and they are uncorrelated to all previous pairs of canonicalvariables. Here p φ i , ψ i q is called the i th pair of canonical loadings and λ i is the i th canonicalcorrelation .It is well known in multivariate statistical analysis that the canonical loadings can be foundrecursively by the following criterion: p φ i , ψ i q “ arg max φ J Σ xy ψ subject to φ J Σ x φ “ , ψ J Σ y ψ “ φ J Σ x φ j “ , ψ J Σ y ψ j “ , @ ď j ď i ´ . (1.2)2lthough this criterion is a nonconvex optimization, it can be obtained easily by spectral methods:Deﬁne Φ : “ r φ , ¨ ¨ ¨ , φ p ^ p s , Ψ : “ r ψ , ¨ ¨ ¨ , ψ p ^ p s and Λ : “ diag p λ , ¨ ¨ ¨ , λ p ^ p q . Then λ , . . . , λ p ^ p are singular values of Σ ´ { x Σ xy Σ ´ { y , and Σ { x Φ , Σ { y Ψ are actually left and rightsingular vectors of Σ ´ { x Σ xy Σ ´ { y , respectively. For any given estimates of the leading k canonical loadings, denoted by tp p φ i , p ψ i qu ki “ , thecorresponding estimates for the canonical variables can be represented by p U i “ p φ J i x , p V i “ p ψ J i y , i “ . . . , p ^ p . To quantify the estimation loss, generally speaking, we can either focus on measuring the diﬀerencebetween the canonical loadings tp φ i , ψ i qu ki “ and tp p φ i , p ψ i qu ki “ or measuring the diﬀerence betweenthe canonical variables tp U i , V i qu ki “ and tp p U i , p V i qu ki “ . Here x , y in the deﬁnition of tp U i , V i qu ki “ and tp p U i , p V i qu ki “ are independent of the samples based on which tp p φ i , p ψ i qu ki “ are constructed.Therefore, for the discrepancy between the canonical variables, there is an extra layer of randomness.As discussed above, in modern machine learning applications such as natural languageprocessing and information retrieval, the leading sample canonical loadings are used for dimensionreduction, i.e., for a new observation p x , y q , ideally we hope to use the corresponding values ofthe canonical variables p u i “ φ J i x q ki “ and p v i “ ψ J i y q ki “ to represent the observation in a lowdimension space. Empirically, the actual low dimensional representations are p ˆ u i “ p φ J i x q ki “ and p ˆ v i “ p ψ J i y q ki “ . Therefore, the discrepancy between the ideal dimension reduction andactual dimension reduction should be explained by how well tp p U i , p V i qu ki “ approximate tp U i , V i qu ki “ .Consequently, we choose to quantify the diﬀerence between the sample and population canonicalvariables instead of the canonical loadings. However, there are still many options to quantify how well the sample canonical variablesapproximate their population correspondents. To choose suitable losses, it is convenient to comeback to speciﬁc applications to get some inspiration.Motivated by applications in natural language processing and information retrieval, the modelof multi-view suﬃcient dimension reduction has been studied in Foster et al. (2008). Roughlyspeaking, a statistical model was proposed by Foster et al. (2008) to study how to predict Z usingtwo sets of predictors denoted by x “ r X , . . . , X p s J and y “ r Y , . . . , Y p s J , where the jointcovariance of p Z, x , y q is Cov ¨˝»– xy Z ﬁﬂ˛‚ “ »– Σ x Σ xy σ xz Σ J xy Σ y σ yz σ J xz σ J yz σ z ﬁﬂ . It was proven in Foster et al. (2008) that under certain assumptions, the leading k canonicalvariables U , . . . U k are suﬃcient dimension reduction for the linear prediction of Z ; That is, thebest linear predictor of Z based on X , . . . , X p is the same as the best linear predictor based on3 , . . . U k . (Similarly, the best linear predictor of Z based on Y , . . . , Y p is the same as the bestlinear predictor based on V , . . . V k .)Notice that the best linear predictor is actually determined by the set of all linear combinationsof U , . . . , U k (referred to as the “model space” in the literature of linear regression for prediction),which we denote as span p U , . . . , U k q . Inspired by Foster et al. (2008), we propose to quantify thediscrepancy between t U i u ki “ and t p U i u ki “ by the discrepancy between the corresponding subspacesspan p p U , . . . , p U k q and span p U , . . . , U k q (and similarly measure the diﬀerence between t V i u ki “ and t p V i u ki “ by the distance between span p p V , . . . , p V k q and span p V , . . . , V k q ). In this section, we deﬁne the discrepancy between x M p U,k q “ span p p U , . . . , p U k q and M p U,k q “ span p U , . . . , U k q by introducing a Hilbert space. Noting that for any given sample tp x i , y i qu ni “ ,both x M p U,k q and M p U,k q are composed by linear combinations of X , . . . , X p . Denote the set of allpossible linear combinations as H “ span p X , . . . , X p q . (1.3)Moreover, for any X , X P H , we deﬁne a bilinear function x X , X y : “ Cov p X , X q “ E p X X q .It is easy to show that x¨ , ¨y is an inner product and p H , x¨ , ¨yq is a p -dimensional Hilbert space,which is isomorphic to R p .With the natural covariance-based inner product, we know both x M p U,k q and M p U,k q aresubspaces of H , so it is natural to deﬁne their discrepancy based on their principal angles π ě θ ě . . . ě θ k ě

0. In the literature of statistics and linear algebra, two loss functionsare usually used L max p span p p U , . . . , p U k q , span p U , . . . , U k qq “ sin p θ q and L ave p span p p U , . . . , p U k q , span p U , . . . , U k qq “ k p sin p θ q ` . . . ` sin p θ k qq In spite of a somewhat abstract deﬁnition, we have the following clean formula for these two losses:

Theorem 1.1.

Suppose for any p ˆ k matrix A , P A represents the orthogonal projector onto thecolumn span of A . Assume the observed sample is ﬁxed. Then L ave p span p p U , . . . , p U k q , span p U , . . . , U k qq “ k ››› P Σ { x p Φ k ´ P Σ { x Φ k ››› F “ k ›››´ I p ´ P Σ { x Φ k ¯ P Σ { x p Φ k ››› F (1.4) “ k min Q P R k ˆ k E “ } u J ´ p u J Q } ‰ : “ L ave p Φ k , p Φ k q nd L max p span p p U , . . . , p U k q , span p U , . . . , U k qq “ ››› P Σ { x p Φ k ´ P Σ { x Φ k ››› “ ›››´ I p ´ P Σ { x Φ k ¯ P Σ { x p Φ k ››› (1.5) “ max g P R k min Q P R k ˆ k E ”`` u J ´ p u J Q ˘ g ˘ ı : “ L max p Φ k , p Φ k q . Here Φ k “ r φ , . . . , φ k s is a p ˆ k matrix consisting of the leading k population canonicalloadings for x , and p Φ k is its estimate based on a given sample. Moreover u J : “ p U , . . . , U k q and ˆ u J : “ p p U , . . . , p U k q . The most important contribution of this paper is to establish sharp upper bounds for theestimation/prediction of CCA based on the proposed subspace losses L max p Φ k , p Φ k q and L ave p Φ k , p Φ k q . It is noteworthy that both upper bounds hold uniformly for all invertible Σ x , Σ y provided n ą C p p ` p q for some numerical constant C . Furthermore, in order to justify thesharpness of these bounds, we also establish minimax lower bounds under a family of stringent andlocalized parameter spaces. These results will be detailed in Section 2. Throughout the paper, we use lower-case and upper-case non-bolded letters to represent ﬁxed andrandom variables, respectively. We also use lower-case and upper-case bold letters to representvectors (which could be either deterministic or random) and matrices, respectively. For any matrix U P R n ˆ p and vector u P R p , } U } , } U } F denotes operator (spectral) norm and Frobenius normrespectively, } u } denotes the vector l norm, U k denotes the submatrix consisting of the ﬁrst k columns of U , and P U stands for the projection matrix onto the column space of U . Moreover, weuse σ max p U q and σ min p U q to represent the largest and smallest singular value of U respectively, and κ p U q “ σ max p U q{ σ min p U q to denote the condition number of the matrix. We use I p for the identitymatrix of dimension p and I p,k for the submatrix composed of the ﬁrst k columns of I p . Further, O p m, n q (and simply O p n q when m “ n ) stands for the set of m ˆ n matrices with orthonormalcolumns and S p ` denotes the set of p ˆ p strictly positive deﬁnite matrices. For a random vector x P R p , span p x J q “ t x J w , w P R p u denotes the subspace of all the linear combinations of x . Othernotations will be speciﬁed within the corresponding context.In the following, we will introduce our main upper and lower bound results in Section 2. Tohighlight our contributions in the new loss functions and theoretical results, we will compare ourresults to existing work in the literature in Section 3. All proofs are deferred to Section 4. In this section, we introduce our main results on non-asymptotic upper and lower bounds forestimating CCA under the proposed loss functions. It is worth recalling that λ , . . . , λ p ^ p aresingular values of Σ ´ { x Σ xy Σ ´ { y . 5t is natural to estimate population CCA by its sample counterparts. Similar to equation (1.2),the sample canonical loadings are deﬁned recursively by p p φ i , p ψ i q “ arg max φ J p Σ xy ψ subject to φ J p Σ x φ “ , ψ J p Σ y ψ “ φ J p Σ x φ j “ , ψ J p Σ y ψ j “ , @ ď j ď i ´ . (2.1)where p Σ x , p Σ y , p Σ xy are the sample covariance matrices. The sample canonical variables aredeﬁned as the following linear combinations by the sample canonical loadings: p U i “ p φ J i x , p V i “ p ψ J i y , i “ . . . , p ^ p . We prove the following upper bound for the estimate based on sample CCA.

Theorem 2.1. (Upper bound) Suppose „ xy  „ N p , Σ q where Σ is deﬁned as in (1.1) . Assume Σ x and Σ y are invertible. Moreover, assume λ k ą λ k ` for some predetermined k . Then thereexist universal positive constants γ, C, C such that if n ě C p p ` p q , the top- k sample canonicalcoeﬃcients matrix p Φ k satisﬁes E ” L max p Φ k , p Φ k q ı ď C « p ´ λ k qp ´ λ k ` qp λ k ´ λ k ` q p n ` p p ` p q n p λ k ´ λ k ` q ` e ´ γ p p ^ p q ﬀ E ” L ave p Φ k , p Φ k q ı ď C « p ´ λ k qp ´ λ k ` qp λ k ´ λ k ` q p ´ kn ` p p ` p q n p λ k ´ λ k ` q ` e ´ γ p p ^ p q ﬀ The upper bounds for p Ψ k can be obtained by switching p and p . Since we pursue a nonasymptotic theoretical framework for CCA estimates, and the lossfunctions we propose are nonstandard in the literature, the standard minimax lower bound resultsin parametric maximum likelihood estimates do not apply straightforwardly. Instead, we turn tothe nonparametric minimax lower bound frameworks, particularly those in PCA and CCA; See,e.g., Vu et al. (2013); Cai et al. (2013); Gao et al. (2015). Compared to these existing works, thetechnical novelties of our results and proofs are summarized in Sections 3.3 and 6.We deﬁne the parameter space F p p , p , k, λ k , λ k ` , κ , κ q as the collection of joint covariancematrices Σ satisfying1. κ p Σ x q “ κ and κ p Σ y q “ κ ;2. 0 ď λ p ^ p ď ¨ ¨ ¨ ď λ k ` ă λ k ď ¨ ¨ ¨ ď λ ď κ p Σ x q “ κ , κ p Σ y q “ κ to demonstrate that the lower bound is independentof the condition number. For the rest of the paper, we will use the shorthand F to represent thisparameter space for simplicity. 6 heorem 2.2. (Lower bound) There exists a universal constant c independent of n, p , p and Σ such that inf p Φ k sup Σ P F E ” L max p Φ k , p Φ k q ı ě c p ´ λ k qp ´ λ k ` qp λ k ´ λ k ` q p ´ kn ¸ ^ ^ p ´ kk + inf p Φ k sup Σ P F E ” L ave p Φ k , p Φ k q ı ě c p ´ λ k qp ´ λ k ` qp λ k ´ λ k ` q p ´ kn ¸ ^ ^ p ´ kk + . The lower bounds for p Ψ k can be obtained by replacing p with p . Corollary 2.3.

When p , p ě p k q _ C p log n q and n ě C p p ` p qp ` p { p qp λ k ´ λ k ` q p ´ λ k qp ´ λ k ` q (2.2) for some universal positive constant c , the minimax rates can be characterized by inf p Φ k sup Σ P F E ” L max p Φ k , p Φ k q ı — p ´ λ k qp ´ λ k ` qp λ k ´ λ k ` q p n , inf p Φ k sup Σ P F E ” L ave p Φ k , p Φ k q ı — p ´ λ k qp ´ λ k ` qp λ k ´ λ k ` q p n . Recently, the non-asymptotic rate of convergence of CCA has been studied by Gao et al. (2015,2017) under a sparse setup and by Cai and Zhang (2017) under the usual non-sparse setup.Cai and Zhang (2017) appeared on arXiv almost at the same time as the ﬁrst version of our paperwas posted. In this section, we state our contributions by detailed comparison with these works.

We proposed new loss functions based on the principal angles between the subspace spanned bythe population canonical variates and the subspace spanned by the estimated canonical variates.In contrast, Gao et al. (2017) proposed and studied the loss L ave ; Cai and Zhang (2017) proposed L max and studied both L ave and L max , where L ave p Φ k , p Φ k q “ min Q P O p k,k q E ” } x J Φ k ´ x J p Φ k Q } ˇˇˇ p Φ k ı , L max p Φ k , p Φ k q “ max g P R k , | g |“ min Q P O p k,k q E „´´ x J Φ k ´ x J p Φ k Q ¯ g ¯ ˇˇˇ p Φ k  . L ave and L max resemble our loss functions L ave and L max respectively. By Theorem 1.1, we alsohave L ave p Φ k , p Φ k q “ Q P R k ˆ k E ” } x J Φ k ´ x J p Φ k Q } ˇˇˇ p Φ k ı L max p Φ k , p Φ k q “ max g P R k , | g |“ min Q P R k ˆ k E „´´ x J Φ k ´ x J p Φ k Q ¯ g ¯ ˇˇˇ p Φ k 

7y these two expressions, we can easily obtain L ave p Φ k , p Φ k q ď L ave p Φ k , p Φ k q L max p Φ k , p Φ k q ď L max p Φ k , p Φ k q (3.1)However, L ave p Φ k , p Φ k q and L ave p Φ k , p Φ k q are not equivalent up to a constant. Neither are L max p Φ k , p Φ k q and L max p Φ k , p Φ k q . In fact, we can prove that as long as n ą max p p , p q , if λ k “ ą λ k ` , then L ave p Φ k , p Φ k q “ L max p Φ k , p Φ k q “ , while almost surely L ave p Φ k , p Φ k q ‰ L max p Φ k , p Φ k q ‰ p “ p “ n “ Σ x “ „  and Σ y “ „  and Σ xy “ „ .  . In this setup, weknow the population canonical correlation coeﬃcients are λ “ λ “ .

5, and the leadingcanonical loadings are φ “ „  and ψ “ „  . In our simulation, we generated the following datamatrices X “ »– . . . ´ . . ´ . ﬁﬂ and Y “ »– . . . ´ . . . ﬁﬂ . Furthermore, we can obtain the sample canonical correlations p λ “ p λ “ . p φ “ „ ´ .  and p ψ “ „ ´ .  . Then L ave p φ , p φ q “ L max p φ , p φ q “ L ave p φ , p φ q ‰ , L max p φ , p φ q ‰ X and X and all linear combinations of Y and Y , aX and bY are mostlycorrelated. Our loss functions L ave and L max do characterize this exact identiﬁcation, whereas L ave and L max do not.Moreover, the following joint loss was studied in Gao et al. (2015): L joint ´ p Φ k , Ψ k q , ´ p Φ k , p Ψ k ¯¯ “ E „››› p Φ k p Ψ J k ´ Φ k Ψ J k ›››  . Similarly, L joint ´ p Φ k , Ψ k q , ´ p Φ k , p Ψ k ¯¯ ‰ λ k “ ą λ k ` . Regardless of loss functions, we explain in the following why Theorem 2.1 implies sharper upperbounds than the existing rates in Gao et al. (2015), Gao et al. (2017) and Cai and Zhang (2017)8nder the nonsparse case. Our discussion is focused on L ave in the following discussion while thediscussion for L max is similar.Notice that if we only apply Wedin’s sin-theta law, i.e., replacing the ﬁne bound Lemma 5.4with the rough bound Lemma 5.2 (also see Gao et al. (2015) for similar ideas), we can obtain thefollowing rough bound: E ” L ave p Φ k , p Φ k q ı ď C „ p ` p n p λ k ´ λ k ` q  . (3.2)In order to decouple the estimation error bound of p Φ k from p , both Gao et al. (2017) andCai and Zhang (2017) assume the residual canonical correlations are zero, i.e., λ k ` “ . . . “ λ p ^ p “ . This assumption is essential for proofs in both Gao et al. (2017) and Cai and Zhang (2017) undercertain sample size conditions. We got rid of this assumption by developing new proof techniquesand these techniques actually work for L ave , L max as well. A detailed comparison between ourresult and that in Cai and Zhang (2017) is summarized in Table 3.2 (The results of Gao et al.(2017) in the non-sparse regime can be implied by Cai and Zhang (2017) under milder sample sizeconditions). Cai and Zhang 2016 Our workLoss function L ave pě L ave q L ave Sample size n ą C ˆ p `? p p λ k ` p λ { k ˙ n ą C p p ` p q λ k ` “ ¨ ¨ ¨ “ λ p “ p nλ k ` p p n λ k p ´ λ k qp ´ λ k ` qp λ k ´ λ k ` q p ´ kn ` p p ` p q n p λ k ´ λ k ` q ` e ´ γ p p ^ p q Perhaps the most striking contribution of our upper bound is that we ﬁrst derive the factors p ´ λ k q and p ´ λ k ` q in the literature of nonasymptotic CCA estimate. We now explain whythese factors are essential when leading canonical correlation coeﬃcients are close to 1. Example 1: λ k “ and λ k ` “ k “ p “ p : “ p " log n , λ “ λ “

0. Then our bound rates p ´ λ k qp ´ λ k ` qp λ k ´ λ k ` q p ´ kn ` p p ` p q n p λ k ´ λ k ` q ` e ´ γ p p ^ p q actually imply that E L ave p φ , p φ q ď C p n , while the rates in Gao et al. (2017) and Cai and Zhang (2017) imply that E L ave p φ , p φ q ď E L ave p φ , p φ q ď C pn .

This shows that even under the condition λ k ` “

0, under our loss L ave p φ , p φ q , our result couldimply sharper convergence rates than that in Gao et al. (2017) and Cai and Zhang (2017) if λ k “ λ k “

1, we can actually prove E L ave p φ , p φ q “ Example 2: Both λ k and λ k ` are close to k “ p “ p : “ p " log n , λ “ ´ b pn and λ “ ´ b pn . Thenour bound rates p ´ λ k qp ´ λ k ` qp λ k ´ λ k ` q p ´ kn ` p p ` p q n p λ k ´ λ k ` q ` e ´ γ p p ^ p q actually imply that E L ave p φ , p φ q ď C pn , while the rough rates (3.2) by Wedin’s sin-theta law implies E L ave p φ , p φ q ď C c pn . This shows that our upper bound rates could be much sharper than the rough rates (3.2) whenboth λ k and λ k ` are close to 1. New proof techniques and connection to asymptotic theory

To the best of our knowledge, none of the analysis in Gao et al. (2015), Gao et al. (2017),Cai and Zhang (2017) can be used to obtain the multiplicative factor p ´ λ k qp ´ λ k ` q{p λ k ´ λ k ` q in the ﬁrst order term of the upper bound, even under the strong condition that λ k ` “ ¨ ¨ ¨ “ λ p ^ p “ λ k ě and λ k ă in the proof ofLemma 5.4) to decompose the target matrices into simple-structure matrices where we can applythe tools developed in Lemma 5.6.The asymptotic distribution of the canonical loadings tp p φ i , p ψ i qu p ^ p i “ has been studied byAnderson (1999) under the assumption that all the canonical correlations are distinct and λ ‰ λ k ą λ k ` for the given k . Both Anderson (1999)and our work are based on analyzing the estimating equations ((5.5)) of CCA. Our analysis is moreinvolved because completely novel techniques are required to obtain the factor p ´ λ k qp ´ λ k ` q in the nonasymptotic framework. λ k and λ k ` The minimax lower bounds for the estimation rates of CCA were ﬁrst established by Gao et al.(2015, 2017) under the losses L joint and L ave . However, the parameter space discussed in Gao et al.(2017) requires λ k ` “

0. Moreover, the parameter space in Gao et al. (2015) is parameterized by λ satisfying λ k ě λ , but λ k ` is not speciﬁed. In fact, they also constructed the hypothesis classwith λ k ` “ λ .10owever, this minimax lower bound is not sharp when λ k and λ k ` are close. Suppose p “ p : “ p , k “ λ “ and λ “ ´ b pn . Our minimax lower bound in Theorem 2.2 leads toinf p Φ k sup Σ P F E ” L ave p Φ k , p Φ k q ı ě O p q . In contrast, to capture the fundamental limit of CCA estimates in this scenario under the frameworkof Gao et al. (2015), one needs to choose λ to capture both λ k and λ k ` , i.e., λ k ď λ ď λ k ` andhence λ « {

2. Then the resulting minimax lower bound rate will be pnλ “ O p pn q , which is muchlooser than O p q .Technically speaking, we follow the analytical framework of Gao et al. (2015) and Gao et al.(2017), but the hypothesis classes construction requires any given λ k ` ą λ k ` “ Suppose the observed sample of p x , y q is ﬁxed and consider the correlation betweenthe two subspaces of H (deﬁned in (1.3)): span p U , . . . , U k q and span p p U , . . . , p U k q . Let p W , x W q , p W , x W q , . . . , p W k , x W k q be the ﬁrst, second, ..., and k th pair of canonicalvariates between U , . . . , U k and p U , . . . , p U k . Then span p W , . . . , W k q “ span p U , . . . , U k q ,span p x W , . . . , x W k q “ span p p U , . . . , p U k q and x W i , W j y “ x W i , x W j y “ x x W i , x W j y “

0, for any i ‰ j andVar p W i q “ Var p x W i q “

1, for i “ , . . . , k .By the deﬁnition of principal angles, we know = p W i , x W i q is actually the i th principal anglebetween span p U , . . . , U k q and span p p U , . . . , p U k q , i.e., θ i : “ = p W i , x W i q . This implies that L ave p Φ k , p Φ k q : “ k ÿ i “ sin θ i “ k ÿ i “ ˆ ´ ˇˇˇA W i , x W i Eˇˇˇ ˙ . Since U , . . . , U k , p U , . . . , p U k are linear combinations of X , . . . , X p , we can denote w J : “ p W , . . . , W k q “ x J Σ ´ { x B , and ˆ w J : “ p x W , . . . , x W k q “ x J Σ ´ { x p B , where B : “ r b , . . . , b k s , p B : “ r p b , . . . , p b k s P R p ˆ k .By the deﬁnition of w , we have I k “ Cov p w q “ B J Σ ´ { x Cov p x q Σ ´ { x B “ B J B and similarly I k “ p B J p B . Then B , p B are p ˆ k basis matrices. Moreover, we have b J i p b j “x W i , x W j y “

0, for all i ‰ j . Moreover, we haveDiag p cos p θ q , . . . , cos p θ k qq “ Cov p w , ˆ w q “ B J Σ ´ { x Cov p x q Σ ´ { x p B “ B J p B . Notice that span p U , . . . , U k q “ span p W , . . . , W k q , p U , . . . , U k q “ x J Φ k , and p W , . . . , W k q “ x J Σ ´ { x B . Then Φ k “ Σ ´ { x BC ñ Σ { x Φ k “ BC k ˆ k matrix C . This implies that B and Σ { x Φ k have the same columnspace. Since B P R p ˆ k is a basis matrix, we have BB J “ P Σ { x Φ k . Similarly, we have p B p B J “ P Σ { x p Φ k . Straightforward calculation gives ››› BB J ´ p B p B J ››› F “ trace ´ BB J BB J ´ BB J p B p B J ´ p B p B J BB J ` p B p B J p B p B J ¯ “ k ´ p B J p B p B J B q“ k ´ p Diag p cos p θ q , . . . , cos p θ k qqq“ p sin p θ q ` . . . ` sin p θ k qq “ k L ave p Φ k , p Φ k q and ›››` I p ´ BB J ˘ p B p B J ››› F “ trace ´` I p ´ BB J ˘ p B p B J p B p B J ` I p ´ BB J ˘¯ “ k ´ trace p B J p B p B J B q“ k L ave p Φ k , p Φ k q . The above equalities yield the ﬁrst two equalities in (1.4).Notice that both U , . . . , U k and W , . . . W k are both orthonormal bases of span p U , . . . , U k q .(Similarly, p U , . . . , p U k and x W , . . . x W k are both orthonormal bases of span p p U , . . . , p U k qq .) Then wehave u J “ w J R where R is a k ˆ k orthogonal matrix. Thenmin Q P R k ˆ k E } u J ´ ˆ u J Q } “ min Q P R k ˆ k E } u J ´ ˆ w J Q } “ min Q P R k ˆ k E } w J R ´ ˆ w J Q } “ min Q P R k ˆ k E } w J ´ ˆ w J QR J } “ min Q P R k ˆ k E } w J ´ ˆ w J Q } “ min q i P R k , i “ ,...,k E k ÿ i “ p W i ´ ˆ w J q i q “ min q i P R k , i “ ,...,k k ÿ i “ E p W i ´ ˆ w J q i q “ k ÿ i “ min q i P R k E p W i ´ ˆ w J q i q Notice that min q i P R k E p W i ´ ˆ w J q i q is obtained by the best linear predictor, somin q i P R k E p W i ´ ˆ w J q i q “ Var p W i q ´ Cov p ˆ w , W i q J Cov ´ p ˆ w q Cov p ˆ w , W i q“ ´ cos θ i “ sin θ i . Q P R k ˆ k E } u J ´ ˆ u J Q } “ k ÿ i “ sin θ i “ k L ave p Φ k , p Φ k q , which implies the third equality in (1.4). Similarly,max g P R k , } g }“ min Q P R k ˆ k E `` u J ´ ˆ u J Q ˘ g ˘ “ max g P R k , } g }“ min Q P R k ˆ k E `` u J ´ ˆ w J Q ˘ g ˘ “ max g P R k , } g }“ min Q P R k ˆ k E `` w J R ´ ˆ w J Q ˘ R J g ˘ “ max g P R k , } g }“ min Q P R k ˆ k E `` w J ´ ˆ w J Q ˘ g ˘ “ max g P R k , } g }“ min q i P R k , i “ ,...,k E k ÿ i “ g i p W i ´ ˆ w J q i q “ max g P R k , } g }“ k ÿ i “ g i sin θ i “ sin θ Finally, we prove (1.5). By Wedin (1983), we have ››› BB J ´ p B p B J ››› “ ›››` I p ´ BB J ˘ p B p B J ››› “ ›››` I p ´ BB J ˘ p B ››› “ λ max ´ p B J ` I p ´ BB J ˘ J ` I p ´ BB J ˘ p B ¯ “ λ max ` I k ´ Diag p cos p θ q , . . . , cos p θ k qq ˘ “ ´ cos p θ q “ sin p θ q “ L max p Φ k , p Φ k q , which implies the the equalities in (1.5). Throughout this proof, we denote ∆ : “ λ k ´ λ k ` . Without loss of generality, we assume p ě p : “ p . By the deﬁnition of canonical variables, we knowthat U , . . . , U p and V , . . . , V p are only determined by span p X , . . . , X p q and span p Y , . . . , Y p q . Inother words, for any invertible C P R p ˆ p and C P R p ˆ p , the canonical pairs of p X , . . . , X p q C and p Y , . . . , Y p q C are still p U , V q , . . . , p U p , V p q . Therefore, we can consider the followingorthonormal bases U , . . . , U p P span p X , . . . , X p q V , . . . , V p , V p ` , . . . , V p P span p Y , . . . , Y p q . Here p V , . . . , V p , V p ` , . . . , V p q is an orthonormal extension of V , . . . , V p . Therefore, we knowthat p U , V q , . . . , p U p , V p q are also the the canonical pairs between U , . . . , U p and V , . . . , V p .Similarly, for a ﬁxed sample of the variables of x and y , the sample canonical pairs p p U , p V q , . . . , p p U p , p V p q are also sample canonical pairs of the corresponding sample of p X , . . . , X p q C and p Y , . . . , Y p q C . This can be easily seen from the concept of samplecanonical variables. For example, p U and p V are respectively the linear combinations of X , . . . , X p and Y , . . . , Y p , such that their corresponding sample variance are both 1 and samplecorrelation is maximized. If we replace p X , . . . , X p q and p Y , . . . , Y p q with p X , . . . , X p q C and p Y , . . . , Y p q C respectively and seek for the ﬁrst sample canonical pair, the constraints(linear combinations of the two sets of variables and unit sample variances) and the objective(sample correlation is maximized) are the same as before, so p p U , p V q is still the answer. Similarly, p p U , p V q , . . . , p p U p , p V p q are the sample canonical pairs of p X , . . . , X p q C and p Y , . . . , Y p q C . Inparticular, they are the sample canonical pairs of U , . . . , U p and V , . . . , V p .The above argument gives the following convenient fact: In order to bound L ave { max p span p p U , . . . , p U k q , span p U , . . . , U k qq we can replace X , . . . , X p , Y , . . . , Y p with U , . . . , U p , V , . . . , V p . In other words, we can assume x and y satisfy the standard form Σ x “ I p , Σ y “ I p , Σ xy “ r Λ , p ˆp p ´ p q s : “ r Λ where Λ “ Diag p λ , λ , . . . , λ p q P R p ˆ p . Moreover Φ p “ I p , Ψ p “ „ I p p p ´ p qˆ p  , which implies that Φ k “ „ I k p p ´ k qˆ k  , Ψ k “ „ I k p p ´ k qˆ k  . Under the standard form, by (1.4) and (1.5), we have L ave p span p p U , . . . , p U k q , span p U , . . . , U k qq “ k ››› p I p ´ P Φ k q P p Φ k ››› F (5.1)and L max p span p p U , . . . , p U k q , span p U , . . . , U k qq “ ››› p I p ´ P Φ k q P p Φ k ››› . (5.2)14enote p Φ k “ « p Φ u k p Φ l k ﬀ where p Φ u k and p Φ l k are the upper k ˆ k and lower p p ´ k q ˆ k sub-matrices of p Φ k respectively. Then ››› p I p ´ P Φ k q P p Φ k ››› F “ trace ´ p I p ´ P Φ k q p Φ k p p Φ J k p Φ k q ´ p Φ J k p I p ´ P Φ k q ¯ , ››› p I p ´ P Φ k q P p Φ k ››› “ λ max ´ p I p ´ P Φ k q p Φ k p p Φ J k p Φ k q ´ p Φ J k p I p ´ P Φ k q ¯ Since p I p ´ P Φ k q p Φ k p p Φ J k p Φ k q ´ p Φ J k p I p ´ P Φ k q ĺ σ k p p Φ k q p I p ´ P Φ k q p Φ k p Φ J k p I p ´ P Φ k q “ σ k p p Φ k q „ k ˆ k p Φ l k  ” k ˆ k p p Φ l k q J ı , we have ››› p I p ´ P Φ k q P p Φ k ››› F ď trace ˜ σ k p p Φ k q „ k ˆ k p Φ l k  ” k ˆ k p Φ J k ı¸ “ } p Φ l k } F σ k p p Φ k q , (5.3)and ››› p I p ´ P Φ k q P p Φ k ››› ď λ max ˜ σ k p p Φ k q „ k ˆ k p Φ l k  ” k ˆ k p Φ J k ı¸ “ } p Φ l k } σ k p p Φ k q . (5.4)Therefore, it suﬃces to give upper bounds of } p Φ l k } F and } p Φ l k } , as well as a lower bound of σ k p p Φ k q . Recall that Σ x “ I p , Σ y “ I p , Σ xy “ r Λ , p ˆp p ´ p q s : “ r Λ . Then Cov ˆ„ xy ˙ : “ Σ “ « I p r Λ r Λ J I p ﬀ and y Cov ˆ„ xy ˙ : “ p Σ “ « p Σ x p Σ xy p Σ yx p Σ y ﬀ . Moreover, we can deﬁne p Σ p as the left upper p p q ˆ p p q principal submatrix of p Σ . We cansimilarly deﬁne Σ p . Lemma 5.1.

There exist universal constants γ , C and C such that when n ě C p , then withprobability at least ´ e ´ γp , the following inequalities hold } Σ p ´ p Σ p } , } I p ´ p Σ x } , ››› p Σ { x ´ I p ››› ď C c p n . roof. It is obvious that } Σ p } ď

2. By Lemma 5.9, there exist constants γ , C and C , such that when n ě C p , with probability at least 1 ´ e ´ γp there holds } p Σ p ´ Σ p } ď C c p n . As submatrices, we have } I p ´ p Σ x } ď C b p n . Moreover, } I p ´ p Σ x } “ }p I p ´ p Σ { x qp I p ` p Σ { x q} ě σ min p I p ` p Σ { x q} I p ´ p Σ { x } ě } I p ´ p Σ { x } , which implies } I p ´ p Σ { x } ď C b p ` p n . Lemma 5.2.

There exist universal constants c , C and C such that when n ě C p p ` p q , thenwith probability at least ´ e ´ c p p ` p q , the following inequalities hold } Σ ´ p Σ } , } I p ´ p Σ y } , } Σ xy ´ p Σ xy } , ››› p Σ { y ´ I p ››› ď C c p ` p n , } p Λ ´ Λ } ď } p Σ ´ { x p Σ xy p Σ ´ { y ´ Σ xy } ď C c p ` p n ,σ k p p Φ k q ě , } p Φ k } ď , σ k p p Ψ k q ě , } p Ψ k } ď , } p Φ l k } , } p Ψ l k } ď C ∆ c p ` p n , where ∆ “ λ k ´ λ k ` is the eigen-gap. The proof is deferred to Section 5.7. } p Φ l k } In this section, we aim to give a sharp upper bound for } p Φ l k } . Notice that we have alreadyestablished an upper bound in Lemma 5.2, where Wedin’s sin θ law plays the essential role. However,this bound is actually too loose for our purpose. Therefore, we need to develop new techniques tosharpen the results.Recall that p Φ P R p ˆ p , p Ψ P R p ˆ p consist of the sample canonical coeﬃcients. By deﬁnition,the sample canonical coeﬃcients satisfy the following two estimating equations (because p Σ { x p Φ and p Σ { y p Ψ are left and right singular vectors of p Σ ´ { x p Σ xy p Σ ´ { y respectively), p Σ xy p Ψ “ p Σ x p Φ p Λ p Σ yx p Φ “ p Σ y p Ψ p Λ . (5.5)If we deﬁne deﬁne Λ “ „ Λ Λ  P R p ˆ p , p Λ “ « p Λ p Λ ﬀ P R p ˆ p , (5.6)16here Λ , p Λ are k ˆ k diagonal matrices while Λ , p Λ are p p ´ k q ˆ p p ´ k q diagonal matrices.Then (5.5) imply p Σ xy p Ψ k “ p Σ x p Φ k p Λ p Σ yx p Φ k “ p Σ y p Ψ k p Λ . (5.7)Divide the matrices into blocks, p Σ x “ « p Σ x p Σ x p Σ x p Σ x ﬀ , p Σ y “ « p Σ y p Σ y p Σ y p Σ y ﬀ , p Σ xy “ « p Σ xy p Σ xy p Σ xy p Σ xy ﬀ , p Σ yx “ « p Σ yx p Σ yx p Σ yx p Σ yx ﬀ where p Σ x , p Σ y , p Σ xy , p Σ yx are k ˆ k matrices. Finally, we deﬁne p Ψ u k P R k ˆ k , p Ψ l k P R p p ´ k qˆ k inthe same way as p Φ u k , p Φ l k . With these blocks, (5.7) can be rewritten as p Σ xy p Ψ u k ` p Σ xy p Ψ l k “ p Σ x p Φ u k p Λ ` p Σ x p Φ l k p Λ , (5.8) p Σ yx p Φ u k ` p Σ yx p Φ l k “ p Σ y p Ψ u k p Λ ` p Σ y p Ψ l k p Λ , (5.9) p Σ xy p Ψ u k ` p Σ xy p Ψ l k “ p Σ x p Φ u k p Λ ` p Σ x p Φ l k p Λ , (5.10) p Σ yx p Φ u k ` p Σ yx p Φ l k “ p Σ y p Ψ u k p Λ ` p Σ y p Ψ l k p Λ . (5.11)Deﬁne the zero-padding of Λ : r Λ : “ r Λ , s “ Σ xy P R p p ´ k qˆp p ´ k q . The above equations imply the following lemma:

Lemma 5.3.

The equality (5.7) gives the following result p Φ l k Λ ´ Λ p Φ l k “ B p Φ u k ` R (5.12) “ p p Σ xy ´ p Σ x Λ q p Ψ u k Λ ` r Λ p p Σ yx ´ p Σ y Λ q p Φ u k ` r R (5.13) where B : “ p Σ xy Λ ` r Λ p Σ yx ´ p Σ x Λ ´ r Λ p Σ y Λ , r R : “ p p Σ x R ´ R q Λ ´ r Λ p p Σ y R ` R q , R : “ r R ´ p p Σ xy ´ p Σ x Λ q R . and R : “ p Φ u k p p Λ ´ Λ q ` p p Σ x ´ I k q p Φ u k p Λ ` p Σ x p Φ l k p Λ ´ p p Σ xy ´ Λ q p Ψ u k ´ p Σ xy p Ψ l k , R : “ p Ψ u k p p Λ ´ Λ q ` p p Σ y ´ I k q p Ψ u k p Λ ` p Σ y p Ψ l k p Λ ´ p p Σ yx ´ Λ q p Φ u k ´ p Σ yx p Φ l k , R : “ p Σ x p Φ u k p p Λ ´ Λ q ` p p Σ x p Φ l k p Λ ´ p Φ l k Λ q ´ p p Σ xy ´ r Λ q p Ψ l k , R : “ p Σ y p Ψ u k p p Λ ´ Λ q ` p p Σ y p Ψ l k p Λ ´ p Ψ l k Λ q ´ p p Σ yx ´ r Λ J q p Φ l k . } R } , } R } ď C c p ` p n . Recall that R : “ p Σ x p Φ u k p p Λ ´ Λ q ` p p Σ x p Φ l k p Λ ´ p Φ l k Λ q ´ p p Σ xy ´ r Λ q p Ψ l k By Lemma 5.2, we have } p Σ x p Φ u k p p Λ ´ Λ q} ď C p ` p n , }p p Σ xy ´ r Λ q p Ψ l k } ď C p ` p ∆ n , and } p Σ x p Φ l k p Λ ´ p Φ l k Λ } ď }p p Σ x ´ I p ´ k q p Φ l k p Λ ` p Φ l k p p Λ ´ Λ q}ď }p p Σ x ´ I p ´ k q p Φ l k p Λ } ` } p Φ l k p p Λ ´ Λ q} ď C p ` p ∆ n . Therefore, we get } R } ď C p ` p ∆ n . Similarly, } R } ď C p ` p ∆ n .Combined with Lemma 5.2, we have } r R } “ }p p Σ x R ´ R q Λ ´ r Λ p p Σ y R ` R q} ď C p ` p ∆ n and } R } ď } r R } ` } p Σ xy ´ p Σ x Λ }} R } ď C p ` p ∆ n . The proof of the following lemma is deferred to Section 5.7:

Lemma 5.4. If n ě C p p ` p q , then with probability ´ c exp p´ γp q , } p Φ l k } ď C »–d p p ´ λ k qp ´ λ k ` q n ∆ ` p p ` p q n ∆ ﬁﬂ . Notice that the inequality (5.4) yields ››› p I p ´ P Φ k q P p Φ k ››› ď } p Φ l k } σ k p p Φ k q . By Lemma 5.4 and Lemma 5.2, we know on an event G with probability at least 1 ´ Ce ´ γp , ››› p I p ´ P Φ k q P p Φ k ››› ď C « p p ´ λ k qp ´ λ k ` q n ∆ ` p p ` p q n ∆ ﬀ . ››› p I p ´ P Φ k q P p Φ k ››› ď

1, by (5.2), we have E L max p Φ k , p Φ k q “ E ››› p I p ´ P Φ k q P p Φ k ››› ď C « p p ´ λ k qp ´ λ k ` q n ∆ ` p p ` p q n ∆ ` e ´ γp ﬀ . Since p I p ´ P Φ k q P p Φ k is of at most rank- k , we have1 k ››› p I p ´ P Φ k q P p Φ k ››› F ď ››› p I p ´ P Φ k q P p Φ k ››› Then by (5.1) and the previous inequality, we have E L ave p Φ k , p Φ k q “ E ››› p I p ´ P Φ k q P p Φ k ››› “ E k ››› p I p ´ P Φ k q P p Φ k ››› F ď E ››› p I p ´ P Φ k q P p Φ k ››› ď C « p p ´ λ k qp ´ λ k ` q n ∆ ` p p ` p q n ∆ ` e ´ γp ﬀ . In fact, the factor p in the main term can be reduced to p ´ k by similar arguments as donefor the operator norm. The Frobenius norm version of Lemma 5.4 is actually much simpler. Weomit the proof to avoid unnecessary redundancy and repetition. Deﬁnition 5.5. (Hadamard Operator Norm) For A P R m ˆ n , deﬁne the Hadamard operator normas ||| A ||| “ sup } A ˝ B } : } B } ď , B P R m ˆ n ( Let α , ¨ ¨ ¨ , α m and β , ¨ ¨ ¨ , β n be arbitrary positive numbers lower bounded by a positive constant δ . Lemma 5.6.

Let t α i u mi “ and t β i u ni “ be two sequences of positive numbers. for any X P R m ˆ n ,there hold ›››››« a α i β j α i ` β j ﬀ ˝ X ››››› ď } X } , (5.14) and ››››„ min p α i , β j q α i ` β j  ˝ X ›››› ď } X } , ››››„ max p α i , β j q α i ` β j  ˝ X ›››› ď } X } . (5.15) Proof.

The proof of (5.14) can be found in “Norm Bounds for Hadamard Products and anArithmetic-Geometric Mean Inequality for Unitarily Invariant Norms” by Horn.Denote G “ „ max p α i , β j q α i ` β j  , G “ „ min p α i , β j q α i ` β j  The proof of (5.15) relies on the following two results.19 emma 5.7. (Theorem 5.5.18 of Hom and Johnson (1991)) If A , B P R n ˆ n and A is positivesemideﬁnite. Then, } A ˝ B } ď ˆ max ď i ď n A ii ˙ } B } , where } ¨ } is the operator norm. Lemma 5.8. (Theorem 3.2 of Mathias (1993)) The symmetric matrix ´ min p a i , a j q a i ` a j ¯ ď i,j ď n is positive semideﬁnite if a i ą , ď i ď n . Deﬁne γ i “ β i , ď i ď n and γ i “ α i ´ n , n ` ď i ď m ` n . Deﬁne M P R p m ` n qˆp m ` n q by M ij “ min t γ i , γ j u γ i ` γ j . By Lemma 5.8, M is also positive semideﬁnite. Again, apply Lemma 5.7 and notice that G is thelower left sub-matrix of M , It is easy to obtain ||| G ||| ď ||| M ||| ď . Finally, since G ˝ B “ B ´ G ˝ B for any B , we have } G ˝ B } ď } B } ` } G ˝ B } , which implies, ||| G ||| ď ` ||| G ||| ď . Lemma 5.9. (Covariance Matrix Estimation, Remark 5.40 of Vershynin (2010)) Assume A P R n ˆ p has independent sub-gaussian random rows with second moment matrix Σ . Then there existsuniversal constant C such that for every t ě , the following inequality holds with probability atleast ´ e ´ ct , } n A J A ´ Σ } ď max t δ, δ u} Σ } δ “ C c pn ` t ? n . Lemma 5.10. (Bernstein inequality, Proposition 5.16 of Vershynin (2010)) Let X , ¨ ¨ ¨ , X n beindependent centered sub-exponential random variables and K “ max i } X i } ψ . Then for every a “ p a , ¨ ¨ ¨ , a n q P R n and every t ě , we have P | n ÿ i “ a i X i | ě t + ď exp " ´ c min ˆ t K } a } , tK } a } ˙* . emma 5.11. (Hanson-Wright inequality, Theorem 1.1 of Rudelson and Vershynin (2013)) Let x “ p x , ¨ ¨ ¨ , x p q be a random vector with independent components x i which satisfy E x i “ and } x i } ψ ď K , Let A P R p ˆ p . Then there exists universal constant c such that for every t ě , P | x J Ax ´ E x J Ax | ě t ( ď exp " ´ c min ˆ t K } A } F , tK } A } ˙* . Lemma 5.12. (Covering Number of the Sphere, Lemma 5.2 of Vershynin (2010)). The unitEuclidean sphere S n ´ equipped with the Euclidean metric satisﬁes for every ǫ ą that | N p S n ´ , ǫ q| ď p ` ǫ q n , where N p S n ´ , ǫ q is the ǫ -net of S n ´ with minimal cardinality. The following variant of Wedin’s sin θ law (Wedin, 1972) is proved in Proposition 1 of Cai et al.(2015). Lemma 5.13.

For A , E P R m ˆ n and p A “ A ` E , deﬁne the singular value decompositions of A and p A as A “ U DV J , p A “ p U p D p V J . Then the following perturbation bound holds, ››› p I ´ P U k q P p U k ››› “ ››› P U k ´ P p U k ››› ď } E } σ k p A q ´ σ k ` p A q , where σ k p A q , σ k ` p A q are the k th and p k ` q th singular values of A . (1) The proof of } Σ ´ p Σ } , } I p ´ p Σ y } , } Σ xy ´ p Σ xy } , ››› p Σ { y ´ I p ››› ď C c p ` p n is exactly the same as that of Lemma 5.1.(2) Observe that p Σ ´ { x p Σ xy p Σ ´ { y ´ Σ xy “ p I p ´ p Σ { x q p Σ ´ { x p Σ xy p Σ ´ { y ` p Σ { x p Σ ´ { x p Σ xy p Σ ´ { y p I p ´ p Σ { y q ` p p Σ xy ´ Σ xy q . and } p Σ ´ { x p Σ xy p Σ ´ { y } “ p λ ď

1. Then } p Σ ´ { x p Σ xy p Σ ´ { y ´ Σ xy } ď } I p ´ p Σ { x } ` } p Σ x }} I p ´ p Σ { y } ` } p Σ xy ´ Σ xy } . p Λ and Λ are singular values of p Σ ´ { x p Σ xy p Σ ´ { y and Σ xy respectively. Hence by thefamous Weyl’s inequality for singular values, } p Λ ´ Λ } ď } p Σ ´ { x p Σ xy p Σ ´ { y ´ Σ xy }ď } I p ´ p Σ x } ` } p Σ x }} I p ´ p Σ { y } ` } p Σ xy ´ Σ xy }ď ˜ ` C c p ` p n ¸ C c p ` p n ď C c p ` p n . (3) Since p Σ { x p Φ are left singular vectors of p Σ ´ { x p Σ xy p Σ ´ { y , we have } p Σ { x p Φ } “ p Φ J p Σ x p Φ “ I p and p Φ J p Φ ´ I p “ ´ p Φ J p p Σ x ´ I p q p Φ . Then we have, } p Φ J p Φ ´ I p } “ } p Φ J p p Σ x ´ I p q p Φ } ď } p Φ J p Σ { x }} p Σ ´ { x p p Σ x ´ I p q p Σ ´ { x }} p Σ { x p Φ }“ } p Σ ´ { x p p Σ x ´ I p q p Σ ´ { x } . As a submatrix, } p Φ J k p Φ k ´ I k } ď } p Σ ´ { x p p Σ x ´ I p q p Σ ´ { x } ď } p Σ ´ x }} p Σ x ´ I p }ď ´ } p Σ x ´ I p } } p Σ x ´ I p } ď } p Σ ´ Σ } ´ } p Σ ´ Σ } ď n ě C p p ` p q for suﬃciently large C . In this case, σ k p p Φ k q ě { , } p Φ k } ď { . By the same argument, σ k p p Ψ k q ě { , } p Ψ k } ď { . (4) Recall that Φ k “ „ I k p p ´ k qˆ k  , Ψ k “ „ I k p p ´ k qˆ k  . The last inequality in the lemma relies on the fact that p Σ { x p Φ k and Φ k are leading k singularvectors of p Σ ´ { x p Σ xy p Σ ´ { y and Σ xy respectively. By a variant of Wedin’s sin θ law as stated inLemma 5.13, ››› P p Σ { x p Φ k p I p ´ P Φ k q ››› ď } p Σ ´ { x p Σ xy p Σ ´ { y ´ Σ xy } ∆ ď C ∆ c p ` p n . On the other hand, ››› P p Σ { x p Φ k p I p ´ P Φ k q ››› “ ››› p Σ { x p Φ k p p Σ { x p Φ k q J p I p ´ P Φ k q ››› “ ››› p p Σ { x p Φ k q J p I p ´ P Φ k q ››› “ ››› p p Σ { x p Φ k q l ››› , p Σ { x p Φ k has orthonormal columns. Moreover, p p Σ { x p Φ k q l denotes the lower p p ´ k q ˆ k sub-matrix of p Σ { x p Φ k . Again, by triangle inequality, ››› p Φ l k ››› “ ›››› p p Σ { x p Φ k q l ´ ´ p p Σ { x ´ I p q p Φ k ¯ l ›››› ď ››› p p Σ { x p Φ k q l ››› ` ››› p p Σ { x ´ I p q ››› ››› p Φ k ››› ď C ∆ c p ` p n ` c C c p ` p n ď C ∆ c p ` p n . The last inequality is due to ∆ ď

1. Let C “ max p C , C , C q , the proof is done. The equality (5.10) implies Λ p Ψ u k ´ p Φ u k Λ “ p Φ u k p p Λ ´ Λ q ` p p Σ x ´ I k q p Φ u k p Λ ` p Σ x p Φ l k p Λ ´ p p Σ xy ´ Λ q p Ψ u k ´ p Σ xy p Ψ l k : “ R . (5.16)Similarly, (5.11) implies Λ p Φ u k ´ p Ψ u k Λ “ p Ψ u k p p Λ ´ Λ q ` p p Σ y ´ I k q p Ψ u k p Λ ` p Σ y p Ψ l k p Λ ´ p p Σ yx ´ Λ q p Φ u k ´ p Σ yx p Φ l k : “ R . (5.17)The equality (5.8) is equivalent to p Σ xy p Ψ u k ` r Λ p Ψ l k ` p p Σ xy ´ r Λ q p Ψ l k “ p Σ x p Φ u k Λ ` p Σ x p Φ u k p p Λ ´ Λ q` p Φ l k Λ ` p p Σ x p Φ l k p Λ ´ p Φ l k Λ q , which can be written as p Σ xy p Ψ u k ` r Λ p Ψ l k ´ p Σ x p Φ u k Λ ´ p Φ l k Λ “ p Σ x p Φ u k p p Λ ´ Λ q` p p Σ x p Φ l k p Λ ´ p Φ l k Λ q ´ p p Σ xy ´ r Λ q p Ψ l k : “ R . (5.18)Apply the same argument to (5.9), we obtain p Σ yx p Φ u k ` r Λ J p Φ l k ´ p Σ y p Ψ u k Λ ´ p Ψ l k Λ “ p Σ y p Ψ u k p p Λ ´ Λ q` p p Σ y p Ψ l k p Λ ´ p Ψ l k Λ q ´ p p Σ yx ´ r Λ J q p Φ l k : “ R . (5.19)Consider (5.18) ˆ p´ Λ q ´ r Λ ˆ (5.19), then p Φ l k Λ ´ Λ p Φ l k ` p Σ x p Φ u k Λ ´ p Σ xy p Ψ u k Λ ´ r Λ p Σ yx p Φ u k ` r Λ p Σ y p Ψ u k Λ “ ´p R Λ ` r Λ R q , that is p Φ l k Λ ´ Λ p Φ l k “ p Σ xy p Ψ u k Λ ` r Λ p Σ yx p Φ u k ´ p Σ x p Φ u k Λ ´ r Λ p Σ y p Ψ u k Λ ´ p R Λ ` r Λ R q . (5.20)23ombined with (5.16) and (5.17), p Φ l k Λ ´ Λ p Φ l k “ p Σ xy p Ψ u k Λ ` r Λ p Σ yx p Φ u k ´ p Σ x Λ p Ψ u k Λ ` p Σ x R Λ ´ r Λ p Σ y Λ p Φ u k ´ r Λ p Σ y R ´ p R Λ ` r Λ R q“ p p Σ xy ´ p Σ x Λ q p Ψ u k Λ ` r Λ p p Σ yx ´ p Σ y Λ q p Φ u k ` p p Σ x R ´ R q Λ ´ r Λ p p Σ y R ` R q . (5.21)This ﬁnishes the proof of (5.13).Plug (5.17) into (5.21), we get p Φ l k Λ ´ r Λ p Φ l k “ p p Σ xy ´ p Σ x Λ qp Λ p Φ u k ´ R q ` r Λ p p Σ yx ´ p Σ y Λ q p Φ u k ` r R “ B p Φ u k ` p r R ´ p p Σ xy ´ p Σ x Λ q R q . This ﬁnishes the proof of (5.12).

First, we discuss two quite diﬀerent cases: λ k ě and λ k ă . Case 1: λ k ě Let δ : “ λ k ´ λ k ` “ p λ k ´ λ k ` qp λ k ` λ k ` q ě

12 ∆ . Deﬁne the p p ´ k q ˆ k matrices A by A ij “ b λ j ´ λ k ` δ b λ k ` ´ λ k ` i ` δ λ j ´ λ k ` i , ď i ď p ´ k, ď j ď k By (5.12) in Lemma 5.3, there holds p Φ l k “ A ˝ p D B p Φ u k D q ` A ˝ p D RD q , where D “ diag ¨˝ b δ , ¨ ¨ ¨ , b λ k ` ´ λ p ` δ ˛‚ and D “ diag ¨˝ b λ ´ λ k ` δ , ¨ ¨ ¨ , b δ ˛‚ . By Lemma 5.6, we have } p Φ l k } ď } D B p Φ u k D } ` }p D RD q}ď } D B }} p Φ u k }} D } ` } D }} R }} D } . } p Φ u k } ď } p Φ k } ď b and it is obvious that } D } , } D } ď b δ . Moreover, in theprevious section, we also have shown that } R } ď C p p ` p q n ∆ . It suﬃces to bound } D B } and to thisend we apply the standard covering argument. Step 1. Reduction.

Denote by N ǫ p S d q the d -dimensional unit ball surface. For ǫ ą u P R p ´ k , v P R k , we can choose u ǫ P N ǫ p S p ´ k ´ q , v ǫ P N ǫ p S k ´ q such that } u ´ u ǫ } , } v ´ v ǫ } ď ǫ . Then u J D Bv “ u J D Bv ´ u J ǫ D Bv ` u J ǫ D Bv ´ u J ǫ D Bv ǫ ` u J ǫ D Bv ǫ ď } u ´ u ǫ }} D Bv } ` } u J ǫ D B }} v ´ v ǫ } ` u J ǫ D Bv ǫ ď ǫ } D B } ` u J ǫ D Bv ǫ ď ǫ } D B } ` max u ǫ , v ǫ u J ǫ D Bv ǫ . Maximize over u and v , we obtain } D B } ď ǫ } D B } ` max u ǫ , v ǫ u J ǫ D Bv ǫ . Therefore, } D B } ď p ´ ǫ q ´ max u ǫ , v ǫ u J ǫ D Bv ǫ . Let ǫ “ {

4. Then it suﬃces to give an upperbound max u ǫ , v ǫ u J ǫ D Bv ǫ with high probability. Step 2. Concentration.

Let Z α,l “ Y α,l ´ λ l X l ? ´ λ l for all 1 ď α ď n and 1 ď l ď p . Then for1 ď i ď p ´ k and 1 ď j ď k r D B s i,j “ b λ k ` ´ λ k ` i ` δ n n ÿ α “ p λ j X α,k ` i Y α,j ´ λ j X α,k ` i X α,j ` λ k ` i Y α,k ` i X α,j ´ λ k ` i λ j Y α,k ` i Y α,j q“ b λ k ` ´ λ k ` i ` δ n n ÿ α “ ! p ´ λ j q λ k ` i λ j X α,k ` i X α,j ´ λ j p Y α,k ` i ´ λ k ` i X α,k ` i qp Y α,j ´ λ j X α,j q` p ´ λ j q λ j p Y α,k ` i ´ λ k ` i X α,k ` i q X α,j ` p ´ λ j q λ k ` i p Y α,j ´ λ j X α,j q X α,k ` i ) . “ b λ k ` ´ λ k ` i ` δ n n ÿ α “ ! p ´ λ j q λ k ` i λ j X α,k ` i X α,j ´ λ j b ´ λ k ` i b ´ λ j Z α,k ` i Z α,j ` p ´ λ j q λ j b ´ λ k ` i Z α,k ` i X α,j ` p ´ λ j q λ k ` i b ´ λ k ` i X α,k ` i Z α,j ) . In this way, t X α,k ` i , Z α,k ` i , ď i ď p , ď α ď n u are mutually independent standard gaussian25andom variables. For any given pair of vectors u P R p ´ k , v P R k , u J D Bv “ n n ÿ α “ p ´ k ÿ i “ k ÿ j “ u i v j b λ k ` ´ λ k ` i ` δ ! p ´ λ j q λ k ` i λ j X α,k ` i X α,j ´ λ j b ´ λ k ` i b ´ λ j Z α,k ` i Z α,j ` p ´ λ j q λ j b ´ λ k ` i Z α,k ` i X α,j ` p ´ λ j q λ k ` i b ´ λ k ` i X α,k ` i Z α,j ) . “ n n ÿ α “ w J α A α w α , where w J α “ r x J α , z J α s “ r X α, , . . . , X α,p , Z α, , . . . , Z α,p s and A α P R p p qˆp p q is symmetric and determined by the corresponding quadratic form. Thisyields } A α } F “ p ´ k ÿ i “ k ÿ j “ u i v j λ k ` ´ λ k ` i ` δ ! p ´ λ j q λ k ` i λ j ` λ j p ´ λ k ` i qp ´ λ j q` p ´ λ j q λ j p ´ λ k ` i q ` p ´ λ j q λ k ` i p ´ λ k ` i q ) “ p ´ k ÿ i “ k ÿ j “ u i v j λ k ` ´ λ k ` i ` δ ` ´ λ j ˘ ` λ k ` i ` λ j ´ λ k ` i λ j ˘ ď ˜ p ´ k ÿ i “ u i ¸ ˜ k ÿ j “ v j ¸ max ď i ď p ´ k ď j ď k p ´ λ j qp λ k ` i ` λ j ´ λ k ` i λ j q λ k ` ´ λ k ` i ` δ ď

12 max ď i ď p ´ k ď j ď k p ´ λ k qp λ j ´ λ k ` i λ j q λ k ` ´ λ k ` i ` δ ď p ´ λ k q max ď i ď p ´ k ď j ď k λ j p ´ λ k ` i q δ ` λ k ` ´ λ i ` k ď p ´ λ k q max ď i ď p ´ k ď j ď k p ´ λ k ` q δ ď p ´ λ k qp ´ λ k ` q δ . “ K , where the second last inequality is due to the facts that λ j ď p ´ λ k ` i q δ ` λ k ` ´ λ i ` k ď p ´ λ k ` q δ p δ ` λ k ` ă λ k ď q . Moreover } A α } ď } A α } F ď K . 26ow deﬁne w J : “ r w J , . . . , w J n s and A “ »———– A A . . . A n ﬁﬃﬃﬃﬂ . Then we have } A } ď max ď α ď n } A α } ď K, } A } F ď n ÿ α “ } A α } F ď nK and u J D Bv “ n w J Aw , where w P N p n p , I p n q . Therefore, By the classic Hanson-Wright inequality (Lemma 5.11), there holds P n | u J D Bv | ě t ( ď " ´ c min ˆ t nK , tK ˙* for some numerical constant c ą

0. Without loss of generality, we can also assume c ď

1. Let t “ c ? np K . By n ě p , straightforward calculation gives P " n | u J D Bv | ě c ? np K * ď e ´ p . Step 3. Union Bound.

By Lemma 5.12, we choose 1 { P $’&’% max u ǫ P N ǫ p S p ´ k ´ q v ǫ P N ǫ p S k ´ q u J ǫ D Bv ǫ ě ˆ ? c ˙ c p n d p ´ λ k qp ´ λ k ` q δ ,/./- ď p ´ k k ˆ e ´ p ď e ´ p . In other words, with probability at least 1 ´ e ´ p , we have } D B } ď p ´ ǫ q ´ max u ǫ , v ǫ u J ǫ D Bv ǫ ď ˆ ? c ˙ c p n d p ´ λ k qp ´ λ k ` q δ . In summary, we have as long as n ě C p p ` p q , with probability 1 ´ c exp p´ γp q , } p Φ l k } ď C »–d p p ´ λ k qp ´ λ k ` q nδ ` p p ` p q n ∆ δ ﬁﬂ ď C »–d p p ´ λ k qp ´ λ k ` q n ∆ ` p p ` p q n ∆ ﬁﬂ . Here the last inequality is due to δ “ p λ k ` λ k ` q ∆ ě ∆. Here C , C, c , γ are absolute constants.27 ase 2: λ k ď By (5.13), we have p Φ l k Λ ´ Λ p Φ l k “ G Λ ` Λ F , where G : “ p p Σ xy ´ p Σ x Λ q p Ψ u k ` p p Σ x R ´ R q and F : “ r I p , p ˆp p ´ p q s ” p p Σ yx ´ p Σ y Λ q p Φ u k ´ p p Σ y R ` R q ı . Notice that p Σ xy and p Σ x are submatrices of p Σ p . By Lemma 5.1, we have } p Σ xy ´ p Σ x Λ } ď C c p n . Moreover, by } R } ď C b p ` p n , } R } ď C p ` p n ∆ and Lemma 5.2, there holds } G } ď C ˆc p n ` p ` p n ∆ ˙ . Similarly, r I p , p ˆp p ´ p q s p Σ yx and r I p , p ˆp p ´ p q s p Σ x are submatrices of p Σ p . By a similarargument, } F } ď C ˆc p n ` p ` p n ∆ ˙ . Then p Φ l k “ „ λ j λ k ` i ` λ j  ˝ „ λ j ´ λ k ` i  ˝ G ` „ λ k ` i λ k ` i ` λ j  ˝ „ λ j ´ λ k ` i  ˝ F Here 1 ď i ď p ´ k and 1 ď j ď k . By Lemma 5.6, there holds for any X , ››››„ λ j λ k ` i ` λ j  X ›››› “ ››››„ max p λ k ` i , λ j q λ k ` i ` λ j  X ›››› ď } X } and ››››„ λ k ` i λ k ` i ` λ j  X ›››› “ ››››„ min p λ k ` i , λ j q λ k ` i ` λ j  X ›››› ď } X } . Finally, for any X , „ λ j ´ λ k ` i  X “ A ˝ p D XD q where A : “ »– b λ j ´ λ k ` ∆2 b λ k ` ´ λ k ` i ` ∆2 λ j ´ λ k ` i ﬁﬂ , “ diag ¨˝ b ∆2 , ¨ ¨ ¨ , b λ k ` ´ λ p ` ∆2 ˛‚ , and D “ diag ¨˝ b λ ´ λ k ` ∆2 , ¨ ¨ ¨ , b ∆2 ˛‚ . Since } D } , } D } ď b , by Lemma 5.6, ››››„ λ j ´ λ k ` i  X ›››› ď } D XD } ď } X } . In summary, we have } p Φ l k } ď C ˆc p n ∆ ` p ` p n ∆ ˙ . Since ě λ k ě λ k ` , there holds } p Φ l k } ď C »–d p p ´ λ k qp ´ λ k ` q n ∆ ` p p ` p q n ∆ ﬁﬂ . To establish the minimax lower bounds of CCA estimates for our proposed losses, we follow theanalytical frameworks in the literature of PCA and CCA, e.g., Vu et al. (2013); Cai et al. (2013);Gao et al. (2015), where the calculation is focused on the construction of the hypothesis class towhich the packing lemma and Fano’s inequality are applied. However, since we ﬁx both λ k and λ k ` in the localized parameter spaces, new technical challenges arise and consequently we constructhypothesis classes based on the equality (6.1). In this section we also denote ∆ : “ λ k ´ λ k ` . The following lemma can be viewed as an extension of Lemma 14 in Gao et al. (2015) from λ k ` “ λ k ` . The proof of the lemma can be found in Section 6.4. Lemma 6.1.

For i “ , and p ě p ě k , let “ U p i q , W p i q ‰ P O p p , p q , “ V p i q , Z p i q ‰ P O p p , p q where U p i q P R p ˆ k , V p i q P R p ˆ k . For ď λ ă λ ă , let ∆ “ λ ´ λ and deﬁne Σ p i q “ « Σ x Σ { x p λ U p i q V Jp i q ` λ W p i q Z Jp i q q Σ { y Σ { y p λ V p i q U Jp i q ` λ Z p i q W Jp i q q Σ { x Σ y ﬀ i “ , , Let P p i q denote the distribution of a random i.i.d. sample of size n from N p , Σ p i q q . If we furtherassume r U p q , W p q s « V Jp q Z Jp q ﬀ “ r U p q , W p q s « V Jp q Z Jp q ﬀ , (6.1)29 hen one can show that D p P p q || P p q q “ n ∆ p ` λ λ q p ´ λ qp ´ λ q } U p q V Jp q ´ U p q V Jp q } F . Remark 6.2.

The conditon in (6.1) is crucial for obtaining the eigen-gap factor { ∆ in the lowerbound and is the key insight behind the construction of the hypothesis class in the proof. Gao et al.(2015) has a similar lemma but only deals with the case that the residual canonical correlationsare zero. To the best of our knowledge, the proof techniques in Gao et al. (2015, 2017) cannot bedirectly used to obtain our results. The following result on the packing number is based on the metric entropy of the Grassmannianmanifold G p k, r q due to Szarek (1982). We use the version adapted from Lemma 1 of Cai et al.(2013) which is also used in Gao et al. (2015). Lemma 6.3.

For any ﬁxed U P O p p, k q and B ǫ “ t U P O p p, k q : } U U J ´ U U J } F ď ǫ u with ǫ P p , a r k ^ p p ´ k qs q . Deﬁne the semi-metric ρ p¨ , ¨q on B ǫ by ρ p U , U q “ } U U J ´ U U J } F . Then there exists universal constant C such that for any α P p , q , the packing number M p B ǫ , ρ, αǫ q satisﬁes M p B ǫ , ρ, αǫ q ě ˆ Cα ˙ k p p ´ k q . The following corollary is used to prove the lower bound.

Corollary 6.4.

If we change the set in Lemma 6.3 to r B ǫ “ t U P O p p, k q : } U ´ U } F ď ǫ u , thenwe still have M p r B ǫ , ρ, αǫ q ě ˆ Cα ˙ k p p ´ k q . Proof.

Apply Lemma 6.3 to B ǫ , there exists U , ¨ ¨ ¨ , U n with n ě p { Cα q k p p ´ k q such that } U i U J i ´ U U J } F ď ǫ , ď i ď n, } U i U J i ´ U j U J j } F ě αǫ , ď i ď j ď n. Deﬁne r U i “ arg min U Pt U i Q , Q P O p k qu } U ´ U } F , by Lemma 6.5, } r U i ´ U } F ď } r U i r U J i ´ U U J } F ď ǫ . Therefore, r U , ¨ ¨ ¨ , r U n P r B ǫ and } r U i r U J i ´ r U j r U J j } F “ } U i U J i ´ U j U J j } F ě αǫ . which implies, M p r B ǫ , ρ, αǫ q ě n ě ˆ Cα ˙ k p p ´ k q . emma 6.5. For any matrices U , U P O p p, k q , inf Q P O p k,k q } U ´ U Q } F ď } P U ´ P U } F Proof.

By deﬁnition } U ´ U Q } F “ k ´ tr p U J U Q q Let U J U “ U DV J be the singular value decomposition. Then V U J P O p k, k q andinf Q P O p k,k q } U ´ U Q } F ď k ´ tr p U J U V U J q“ k ´ tr p U DU J q“ k ´ tr p D q . On the other hand, } P U ´ P U } F “ } U U J ´ U U J } F “ k ´ tr p U U J U U J q“ k ´ tr p U J U U J U q“ k ´ tr p D q . Since U , U P O p p, k q , } U J U } ď D is less than 1,which implies that tr p D q ě tr p D q andinf Q P O p k,k q } U ´ U Q } F ď } P U ´ P U } F . Lemma 6.6 (Fano’s Lemma Yu (1997)) . Let p Θ , ρ q be a (semi)metric space and t P θ : θ P Θ u acollection of probability measures. For any totally bounded T Ă Θ , denote M p T, ρ, ǫ q the ǫ -packingnumber of T with respect to the metric ρ , i.e. , the maximal number of points in T whoese pairwiseminimum distance in ρ is at least ǫ . Deﬁne the Kullback-Leibler diameter of T by d KL p T q “ sup θ,θ P T D p P θ || P θ q . Then, inf p θ sup θ P Θ E θ ” ρ p p θ, θ q ı ě sup T Ă Θ sup ǫ ą ǫ ´ ´ d KL p T q ` log log M p T, ρ, ǫ q ¯ .3 Proof of Lower Bound For any ﬁxed “ U p q , W p q ‰ P O p p , p q and “ V p q , Z p q ‰ P O p p , p q where U p q P R p ˆ k , V p q P R p ˆ k , W p q P R p ˆp p ´ k q , V p q P R p ˆp p ´ k q , deﬁne H ǫ “ !` U , W , V , Z ˘ : “ U , W ‰ P O p p , p q with U P R p ˆ k , “ V , Z ‰ P O p p , p q with V P R p ˆ k , } U ´ U p q } F ď ǫ , r U , W s « V J Z J ﬀ “ r U p q , W p q s « V Jp q Z Jp q ﬀ ) . For any ﬁxed Σ x P S p ` , Σ y P S p ` with κ p Σ x q “ κ x , κ p Σ y q “ κ y , consider the parametrization Σ xy “ Σ x Φ Λ Ψ J Σ y , for 0 ď λ k ` ă λ k ă

1, deﬁne T ǫ “ ! Σ “ « Σ x Σ { x p λ k U V J ` λ k ` W Z J q Σ { y Σ { y p λ k V U J ` λ k ` ZW J q Σ { x Σ y ﬀ , Φ “ Σ ´ { x r U , W s , Ψ “ Σ ´ { y r V , Z s , ` U , W , V , Z ˘ P H ǫ ) . It is straightforward to verify that T ǫ Ă F p p , p , k, λ k , λ k ` , κ x , κ y q . For any Σ p i q P T ǫ , i “ , Σ p i q “ « Σ x Σ { x p λ k U p i q V Jp i q ` λ k ` W p i q Z Jp i q q Σ { y Σ { y p λ k V p i q U Jp i q ` λ k ` Z p i q W Jp i q q Σ { x Σ y ﬀ , where ` U p i q , W p i q , V p i q , Z p i q ˘ P H ǫ and the leading- k canonical vectors are Φ p i q k “ Σ ´ { x U p i q , Ψ p i q k “ Σ ´ { y V p i q . We deﬁne a semi-metric on T ǫ as ρ p Σ p q , Σ p q q “ ››› P Σ { x Φ p q k ´ P Σ { x Φ p q k ››› F “ ››› P U p q ´ P U p q ››› F . By Lemma 6.1, D p P Σ || P Σ q “ n ∆ p ` λ k λ k ` q p ´ λ k qp ´ λ k ` q } U p q V Jp q ´ U p q V Jp q } F . Further by the deﬁnition of d KL p T q , d KL p T q “ n ∆ p ` λ k λ k ` q p ´ λ k qp ´ λ k ` q sup Σ p q , Σ p q P T ǫ } U p q V Jp q ´ U p q V Jp q } F . (6.2)To bound the Kullback-Leibler diameter, for any Σ p q , Σ p q P T ǫ , by deﬁnition, r U p q , W p q s « V Jp q Z Jp q ﬀ “ r U p q , W p q s « V Jp q Z Jp q ﬀ , which implies that they are singular value decompositions of the same matrix. Therefore, thereexists Q P O p p , p q such that r U p q , W p q s “ r U p q , W p q s Q , r V p q , Z p q s “ r V p q , Z p q s Q . (6.3)32ecompose Q into four blocks such that Q “ „ Q Q Q Q  . Substitute into (6.3), U p q “ U p q Q ` W p q Q , V p q “ V p q Q ` Z p q Q . Then, } U p q ´ U p q } F “ } U p q p Q ´ I k q ` W p q Q } F “ } U p q p Q ´ I k q} F ` } W p q Q } F “ } Q ´ I k } F ` } Q } F . The second equality is due to the fact that U p q and W p q have orthogonal column space and thethird equality is valid because U p q , W p q P O p p , k q . By the same argument, we will have } V p q ´ V p q } F “ } Q ´ I k } F ` } Q } F . Notice that } U p q V Jp q ´ U p q V Jp q } F “ }p U p q ´ U p q q V p q ` U p q p V p q ´ V p q q} F ď } U p q ´ U p q } F ` } V p q ´ V p q } F “ }p U p q ´ U p q q} F ď ` }p U p q ´ U p q q} F ` }p U p q ´ U p q q} F ˘ ď ǫ . Then, substitute into (6.2) d KL p T q ď n ∆ p ` λ k λ k ` qp ´ λ k qp ´ λ k ` q ǫ . (6.4)Let B ǫ “ t U P O p p , k q : } U ´ U p q } F ď ǫ u . Under the semi-metric r ρ p U p q , U p q q “ } U p q U Jp q ´ U p q U Jp q } F , we claim that the packing number of H ǫ is lower bounded by the packing number of B ǫ .To prove this claim, it suﬃces to show that for any U P B ǫ , there exists corresponding W , V , Z such that p U , W , V , Z q P H ǫ . First of all, by deﬁnition, } U ´ U } F ď ǫ . Let W P O p p , p ´ k q bethe orthogonal complement of U . Then r U , W s P O p p , p q and therefore there exists Q P O p p , p q such that r U , W s “ r U p q , W p q s Q . Set r V , Z s “ r V p q , Z p q s Q P O p p , p q , then r U , W s « V J Z J ﬀ “ r U p q , W p q s « V Jp q Z Jp q ﬀ , p U , W , V , Z q P H ǫ . Let ǫ “ αǫ “ c ¨˝a k ^ p p ´ k q ^ d p ´ λ k qp ´ λ k ` q n ∆ p ` λ k λ k ` q k p p ´ k q ˛‚ , where c P p , q depends on α and is chosen small enough such that ǫ “ ǫ { α P p , a r k ^ p p ´ k qs q .By Corollary 6.4, M p T ǫ , ρ, αǫ q “ M p H ǫ , r ρ, αǫ q ě M p B ǫ , r ρ, αǫ q ě ˆ Cα ˙ k p p ´ k q . Apply Lemma 6.6 with T ǫ , ρ, ǫ ,inf p Φ k sup Σ P F E „ ››› P Σ { x p Φ k ´ P Σ { x Φ k ››› F  ě sup T Ă Θ sup ǫ ą ǫ ˜ ´ c k p p ´ k q ` log2 k p p ´ k q log Cα ¸ . Choose α small enough such that 1 ´ c k p p ´ k q ` log2 k p p ´ k q log Cα ě . Then the lower bound is reduced toinf p Φ k sup Σ P F E „ ››› P Σ { x p Φ k ´ P Σ { x Φ k ››› F  ě c p ´ λ k qp ´ λ k ` q n ∆ p ` λ k λ k ` q k p p ´ k q ^ k ^ p p ´ k q + ě C k p ´ λ k qp ´ λ k ` q ∆ p ´ kn ¸ ^ ^ p ´ kk + By symmetry,inf p Ψ k sup Σ P F E „ ››› P Σ { y p Ψ k ´ P Σ { y Ψ k ››› F  ě C k p ´ λ k qp ´ λ k ` q ∆ p ´ kn ¸ ^ ^ p ´ kk + The lower bound for operator norm error can be immediately obtained by noticing that P Σ { y p Ψ k ´ P Σ { y Ψ k has at most rank 2 k and ››› P Σ { x p Φ k ´ P Σ { x Φ k ››› ě k ››› P Σ { x p Φ k ´ P Σ { x Φ k ››› F By simple algebra, the Kullback-Leibler divergence between two multivariate gaussian distributionssatisﬁes D p P Σ p q || P Σ p q q “ n ! Tr ´ Σ ´ p q p Σ p q ´ Σ p q q ¯ ´ log det p Σ ´ p q Σ p q q ) . Notice that Σ p i q “ « Σ { x Σ { y ﬀ Ω p i q « Σ { x Σ { y ﬀ , Ω p i q “ « I p λ U p i q V Jp i q ` λ W p i q Z Jp i q λ V p i q U Jp i q ` λ Z p i q W Jp i q I p ﬀ . Then, D p P Σ p q || P Σ p q q “ n ! Tr p Ω ´ p q Ω p q q ´ p p ` p q ´ log det p Ω ´ p q Ω p q q ) . Also notice that Ω p i q “ „ I p I p  ` λ „ U p i q V p i q  ” U Jp i q V Jp i q ı ´ λ „ U p i q ´ V p i q  ” U Jp i q ´ V Jp i q ı ` λ „ W p i q Z p i q  ” W Jp i q Z Jp i q ı ´ λ „ W p i q ´ Z p i q  ” W Jp i q ´ Z Jp i q ı . Therefore Ω p q , Ω p q share the same set of eigenvalues: 1 ` λ with multiplicity k , 1 ´ λ withmultiplicity k , 1 ` λ with multiplicity p ´ k , 1 ´ λ with multiplicity p ´ k and 1 with multiplicity2 p p ´ p q . This implies log det p Ω ´ p q Ω p q qq “

0. On the other hand, by block inversion formula, wecan compute Ω ´ p q “ »– I p ` λ ´ λ U p q U Jp q ` λ ´ λ W p q W Jp q ´ λ ´ λ U p q V Jp q ´ λ ´ λ W p q Z Jp q ´ λ ´ λ V p q U Jp q ´ λ ´ λ Z p q W Jp q I p ` λ ´ λ V p q V Jp q ` λ ´ λ Z p q Z Jp q ﬁﬂ . Divide Ω ´ p q Ω p q into blocks such that Ω ´ p q Ω p q “ „ J J J J  where J P R p ˆ p , J P R p ˆ p , and J “ λ ´ λ p U p q U Jp q ´ U p q V Jp q V p q U Jp q q ` λ ´ λ p W p q W p q ´ W p q Z Jp q Z p q W Jp q q´ λ λ ´ λ p U p q V Jp q Z p q W Jp q q ´ λ λ ´ λ p W p q Z Jp q V p q U Jp q q J “ λ ´ λ p V p q V Jp q ´ V p q U Jp q U p q V Jp q q ` λ ´ λ p Z p q Z p q ´ Z p q W Jp q W p q Z Jp q q´ λ λ ´ λ p V p q U Jp q W p q Z Jp q q ´ λ λ ´ λ p Z p q W Jp q U p q V Jp q q . We spell out the algebra for tr p J q , and tr p J q can be computed in exactly the same fashion. tr p U p q U Jp q ´ U p q V Jp q V p q U Jp q q “ tr p U p q V Jp q V p q U Jp q ` U p q V Jp q V p q U Jp q ´ U p q V Jp q V p q U Jp q q“ } U p q V Jp q ´ U p q V p q } F . Similarly, tr p W p q W p q ´ W p q Z Jp q Z p q W Jp q q “ } W p q Z Jp q ´ W p q Z p q } F .

35y the assumption (6.1), i.e., U p q V Jp q ` W p q Z Jp q “ U p q V Jp q ` W p q Z Jp q , we have tr p W p q W p q ´ W p q Z Jp q Z p q W Jp q q “ } U p q V Jp q ´ U p q V p q } F . Further, tr p U p q V Jp q Z p q W Jp q q “ tr ´ U p q V Jp q p U p q V Jp q ` W p q Z Jp q ´ U p q V Jp q q J ¯ “ tr ´ U p q V Jp q p U p q V Jp q ´ U p q V Jp q q J ¯ “ } U p q V Jp q ´ U p q V p q } F , and by the same argument, tr p W p q Z Jp q V p q U Jp q q “ } U p q V Jp q ´ U p q V p q } F . Sum these equations,tr p J q “ " λ ´ λ ` λ ´ λ ´ λ λ ´ λ ´ λ λ ´ λ * } U p q V Jp q ´ U p q V p q } F “ ∆ p ` λ λ q p ´ λ qp ´ λ q } U p q V Jp q ´ U p q V p q } F . Repeat the argument for J , one can show thattr p J q “ tr p J q “ ∆ p ` λ λ q p ´ λ qp ´ λ q } U p q V Jp q ´ U p q V p q } F . Therefore, D p P Σ p q || P Σ p q q “ n tr p Ω ´ p q Ω p q q “ n p tr p J q ` tr p J qq“ n ∆ p ` λ λ q p ´ λ qp ´ λ q } U p q V Jp q ´ U p q V p q } F . References

Anderson, T. W. (1999). Asymptotic theory for canonical correlation analysis.

Journal ofMultivariate Analysis 70 (1), 1–29.Arora, R. and K. Livescu (2013). Multi-view cca-based acoustic features for phonetic recognitionacross speakers and domains. In

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEEInternational Conference on , pp. 7135–7139. IEEE.Cai, T., Z. Ma, and Y. Wu (2015). Optimal estimation and rank detection for sparse spikedcovariance matrices.

Probability theory and related ﬁelds 161 (3-4), 781–815.Cai, T. T., Z. Ma, Y. Wu, et al. (2013). Sparse pca: Optimal rates and adaptive estimation.

TheAnnals of Statistics 41 (6), 3074–3110. 36ai, T. T. and A. Zhang (2017). Rate-optimal perturbation bounds for singular subspaces withapplications to high-dimensional statistics.

The Annals of Statistics, to appear .Chaudhuri, K., S. M. Kakade, K. Livescu, and K. Sridharan (2009). Multi-view clustering viacanonical correlation analysis. In

Proceedings of the 26th annual international conference onmachine learning , pp. 129–136. ACM.Chen, X., H. Liu, and J. G. Carbonell (2012). Structured sparse canonical correlation analysis. In

International Conference on Artiﬁcial Intelligence and Statistics , pp. 199–207.Dhillon, P. S., D. Foster, and L. Ungar (2011). Multi-view learning of word embeddings via cca.In

Advances in Neural Information Processing Systems (NIPS) , Volume 24.Faruqui, M. and C. Dyer (2014). Improving vector space word representations using multilingualcorrelation. Association for Computational Linguistics.Foster, D. P., R. Johnson, S. M. Kakade, and T. Zhang (2008). Multi-view dimensionality reductionvia canonical correlation analysis. Technical report.Friman, O., M. Borga, P. Lundberg, and H. Knutsson (2003). Adaptive analysis of fmri data.

NeuroImage 19 (3), 837–845.Fukumizu, K., F. R. Bach, and M. I. Jordan (2009). Kernel dimension reduction in regression.

TheAnnals of Statistics , 1871–1905.Gao, C., Z. Ma, Z. Ren, H. H. Zhou, et al. (2015). Minimax estimation in sparse canonicalcorrelation analysis.

The Annals of Statistics 43 (5), 2168–2197.Gao, C., Z. Ma, and H. H. Zhou (2017). Sparse cca: Adaptive estimation and computationalbarriers.

The Annals of Statistics, to appear .Gong, Y., Q. Ke, M. Isard, and S. Lazebnik (2014). A multi-view embedding space for modelinginternet images, tags, and their semantics.

International journal of computer vision 106 (2),210–233.Hom, R. A. and C. R. Johnson (1991). Topics in matrix analysis.

Cambridge UP, New York .Hotelling, H. (1936). Relations between two sets of variables.

Biometrika 28 , 312–377.Kakade, S. M. and D. P. Foster (2007). Multi-view regression via canonical correlation analysis. In

In Proc. of Conference on Learning Theory .Kim, T.-K., S.-F. Wong, and R. Cipolla (2007). Tensor canonical correlation analysis for actionclassiﬁcation. In

Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conferenceon , pp. 1–8. IEEE.Mathias, R. (1993). The hadamard operator norm of a circulant and applications.

SIAM journalon matrix analysis and applications 14 (4), 1152–1167.Rasiwasia, N., J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos(2010). A new approach to cross-modal multimedia retrieval. In

Proceedings of the 18th ACMinternational conference on Multimedia , pp. 251–260. ACM.37udelson, M. and R. Vershynin (2013). Hanson-wright inequality and sub-gaussian concentration.

Electron. Commun. Probab. 18 , no. 82, 1–9.Sridharan, K. and S. M. Kakade (2008). An information theoretic framework for multi-viewlearning. In R. A. Servedio and T. Zhang (Eds.),

COLT , pp. 403–414. Omnipress.Szarek, S. J. (1982). Nets of grassmann manifold and orthogonal group. In

Proceedings of researchworkshop on Banach space theory (Iowa City, Iowa, 1981) , Volume 169, pp. 185.Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXivpreprint arXiv:1011.3027 .Vu, V. Q., J. Lei, et al. (2013). Minimax sparse principal subspace estimation in high dimensions.

The Annals of Statistics 41 (6), 2905–2947.Wang, W., R. Arora, K. Livescu, and J. Bilmes (2015). On deep multi-view representation learning.In

Proceedings of the 32nd International Conference on Machine Learning (ICML-15) , pp. 1083–1092.Wedin, P.-˚A. (1972). Perturbation bounds in connection with singular value decomposition.

BITNumerical Mathematics 12 (1), 99–111.Wedin, P. ˚A. (1983). On angles between subspaces of a ﬁnite dimensional inner product space. In

Matrix Pencils , pp. 263–285. Springer.Witten, D. M., R. Tibshirani, and T. Hastie (2009). A penalized matrix decomposition, withapplications to sparse principal components and canonical correlation analysis.

Biostatistics ,kxp008.Yu, B. (1997). Assouad, fano, and le cam. In