[PDF] Eigenvector-based sparse canonical correlation analysis: Fast computation for estimation of multiple canonical vectors

Abstract

Classical canonical correlation analysis (CCA) requires matrices to be low dimensional, i.e. the number of features cannot exceed the sample size. Recent developments in CCA have mainly focused on the high-dimensional setting, where the number of features in both matrices under analysis greatly exceeds the sample size. These approaches impose penalties in the optimization problems that are needed to be solve iteratively, and estimate multiple canonical vectors sequentially. In this work, we provide an explicit link between sparse multiple regression with sparse canonical correlation analysis, and an efficient algorithm that can estimate multiple canonical pairs simultaneously rather than sequentially. Furthermore, the algorithm naturally allows parallel computing. These properties make the algorithm much efficient. We provide theoretical results on the consistency of canonical pairs. The algorithm and theoretical development are based on solving an eigenvectors problem, which significantly differentiate our method with existing methods. Simulation results support the improved performance of the proposed approach. We apply eigenvector-based CCA to analysis of the GTEx thyroid histology images, analysis of SNPs and RNA-seq gene expression data, and a microbiome study. The real data analysis also shows improved performance compared to traditional sparse CCA.

Full PDF

IImbalanced Sparse Canonical Correlation Analysis

Wenjia WangThe Statistical and Applied Mathematical Sciences InstituteDurham, NC, [email protected] ZhouDepartment of Biological SciencesNorth Carolina State University, Raleigh, NC, U.S.A.yihui [email protected]

Abstract

Classical canonical correlation analysis (CCA) requires matrices to be low dimensional,i.e. the number of features cannot exceed the sample size. Recent developments in CCAhave mainly focused on the high-dimensional setting, where the number of features inboth matrices under analysis greatly exceeds the sample size. However, these approachesmake considerable sparsity assumptions and impose penalties that may be unnecessaryfor some datasets. We consider an imbalanced setting that is commonly encountered,where one matrix is high dimensional and the other is low dimensional. We providean explicit link between sparse multiple regression with sparse canonical correlationanalysis, and an eﬃcient algorithm that exploits the imbalanced data structure andestimates multiple canonical pairs rather than sequentially. We provide theoreticalresults on the consistency of canonical pairs. Simulation results and the analysis ofseveral real datasets support the improved performance of the proposed approach.

Canonical correlation analysis (CCA) is a widely used method to determine the relationshipbetween two sets of variables. In CCA, the objective is to ﬁnd a linear combination ofvariables from each set of variables such that the correlation is maximized. The vectorsconsist of coeﬃcients from each linear combination is called canonical pairs. Originallyproposed by Hotelling (1936), CCA has been applied to numerous problems, including thoseof large scale. In large scale problems, including genomic studies, researchers are often facedwith high dimensional data. For example, Chen et al. (2012) studied the association betweennutrient intake and human gut microbiome composition, and Wang et al. (2015) studied thegroup interactions among genes. Projects such as GTEx (Aguet et al., 2019) also providerich datasets for which CCA might be used to identify important genetic modules relevant to1 a r X i v : . [ s t a t . M E ] A p r isease. In these works, classical CCA cannot be used to analyze the high dimensional data,where the number of variables exceeds the number of observations.To study the relationship between two sets of high dimensional variables, many extensionsof classical CCA have been proposed. One popular approach, sparse canonical correlationanalysis, imposes sparse structure on the canonical vectors. An incomplete list of sparse CCAmethods is Parkhomenko et al. (2009); Witten and Tibshirani (2009); Waaijenborg et al.(2008); Lˆe Cao et al. (2009); Witten et al. (2009); Chen et al. (2012), and references therein.In these works, the dimensions of the two variables being compared are both high.In some cases, the dimensions of the two variables can be quite diﬀerent. For example, inthe application to gut microbiome data analysis (Chen et al., 2012), the dimensions of twovariables were 214 and 40, while the sample size was 99. Such datasets have a partially lowdimensional structure, that is, the dimension of one variable is larger than the sample size,and the other one has dimension lower than the sample size. The sparse canonical correlationanalysis does not utilize the partially low dimensional structure, and may provide biasedcanonical pairs of canonical vectors.In order to estimate the canonical pairs in the problems with partially low dimensionalstructure, in this work, we propose imbalanced sparse canonical correlation analysis . Specif-ically, we link sparse multiple regression with sparse canonical correlation analysis, andapply Lasso regularization to the estimation of the canonical pairs. We propose an eﬃcientalgorithm to provide K canonical pairs simultaneously for K ≥

1. This advantage signiﬁ-cantly diﬀerentiates our method from other sparse canonical correlation analysis methods,which usually estimate multiple canonical pairs sequentially. Importantly, we also providetheoretical guarantees on the consistency of estimated canonical pairs.We note that the relationship between multiple regression and CCA has been consideredpreviously. In Glahn (1968), multiple regression was considered as be a special case of CCA,but the high dimensional situation was not considered. Lutz and Eckert (1994) analyzedthe relationship between multiple regression and CCA via eigenstructure. Yamamoto et al.(2008) applied CCA to multivariate regression. Song et al. (2016) assumed that the responseshave a linear relationship with some underlying signals. However, we are not aware of anyworks that apply sparse multiple regression to canonical correlation analysis.The rest of this paper is arranged as follows. In Section 2, we introduce classical canonicalcorrelation analysis and sparse canonical correlation analysis. We propose an imbalancedsparse canonical correlation analysis approach, with attendant theoretical properties. InSection 3, we compare the imbalanced sparse canonical correlation analysis to existingmethods, and highlight the potential advantages and importance of the proposed method.In Section 4, we conduct numeric simulation studies. In Section 5, we apply imbalancedsparse canonical correlation analysis and competing methods to three applied problems,including human gut microbiome data, GTEx thyroid imaging/expression data, and GTExliver genotype/expression data. A Discussion with conclusions is provided in Section 6.Technical proofs are provided in the Appendix.2

Background and Proposed Methods

In this section, we provide a brief introduction to classical canonical correlation analysisand sparse canonical correlation analysis, propose imbalanced sparse canonical correlationanalysis, and study its theoretical properties.

Suppose we are interested in studying the correlation between two sets of random variables x = ( x , ..., x p ) T ∈ R p and y = ( y , ..., y d ) T ∈ R d . The goal of canonical correlation analysis(CCA) is to ﬁnd a ∈ R d and b ∈ R p such that ( a , b ) is the solution tomax a ∈ R d ,b ∈ R p corr( a T y, b T x ) . (1)Without loss of generality, we assume x and y have mean zero, for otherwise we can shift themean. Let Σ xx and Σ yy be the covariance matrix of x and y , respectively. Let Σ xy be thecovariance matrix between x and y . The optimization problem (1) is the same asmax a ∈ R d ,b ∈ R p a T Σ yx b (cid:112) a T Σ yy a √ b T Σ xx b . (2)The solution to (2), denoted by a and b , are called the ﬁrst pair of canonical vectors, and thenew variables x (cid:48) = a T x and y (cid:48) = b T y are called the ﬁrst pair of canonical variables (Mardiaet al., 1979). Once the ( k − a k − and b k − are obtained, the k -th pair of canonical vectors is the solution to the optimization problemmax a ∈ R d ,b ∈ R p a T Σ yx b (cid:112) a T Σ yy a √ b T Σ xx b s . t . a T Σ yy a l = 0 , b T Σ xx b l = 0 , ≤ l ≤ k − , (3)for k = 2 , ..., min { d, p } .By basic matrix computation, one can obtain that the solution a k to the optimizationproblem (3) is the k -th eigenvector of Σ − yy Σ yx Σ − xx Σ xy , (4)and b k is proportional to Σ − xx Σ xy a k . (5)Note that the solution to (3) is not unique, because for any constant C ∈ R and C (cid:54) = 0, if( a k , b k ) is the solution to (3), then so is ( Ca k , Cb k ). Therefore, we restrict the norms of a k and b k such that (cid:107) a k (cid:107) = (cid:107) b k (cid:107) = 1, and the ﬁrst nonzero element of a k ( b k ) is positive tomake the solution to (3) unique, where (cid:107) · (cid:107) is the Euclidean norm.Let X i , Y i , i = 1 , ..., n be observations, where X i = ( x i , ..., x ip ) T and Y i = ( y i , ..., y id ) T .Let X = ( X , ..., X n ) and Y = ( Y , ..., Y n ) be the sample matrices. In classical CCA,3he covariance matrices Σ xx , Σ yy , and Σ yx are replaced by XX T /n , Y Y T /n , and Y X T /n ,respectively (Gonz´alez et al., 2008). This approach does not work if the dimension of x or y is larger than the sample size n , because XX T /n or Y Y T /n is singular. To addressthe case when p or d is larger than n , several modiﬁcations of the classical CCA have beenproposed. One widely used method, sparse canonical correlation analysis, restricts theestimated canonical vectors to be sparse. In Parkhomenko et al. (2009), the sparse canonicalvectors are obtained by an algorithm where a threshold is used to control the sparsity ofcanonical vectors. Another way to obtain the sparse canonical vectors is via regularization(Witten and Tibshirani, 2009; Witten et al., 2009; Chen et al., 2012), where the k -th pair ofcanonical vectors is obtained by solvingmax a,b n a T Y X T b s . t . (cid:107) a (cid:107) ≤ , (cid:107) b (cid:107) ≤ , P ( a ) ≤ c , P ( b ) ≤ c , a T Y Y T a l = 0 , b T XX T b l = 0 , ≤ l < k, (6)where P and P are two convex penalty functions, and c and c are two constants. In thismodel, the sparsity is imposed on the canonical vectors by using diﬀerent penalty functions.In Section 3, we will provide a more detailed introduction to sparse canonical correlationanalysis. In some cases, the dimensions of x and y are very diﬀerent. Without loss of generality for ourapplication domain, we assume the dimension of x is much larger than the sample size, whilethe dimension of y is relatively small. In this work, we propose imbalanced sparse canonicalcorrelation analysis , which can be used to estimate the canonical vectors under the the setting p (cid:29) n > d . This imbalance allows us to use the low dimensional structure of y to computecanonical vectors eﬃciently. One may still use sparse CCA (6) in this setting to estimate thecanonical vectors, but may lose some eﬃciency or incur substantial bias, as we will see in thenumerical studies. In contrast with existing sparse CCA methods (6), we do not approachthe problem directly as in terms of the CCA correlation maximization. First, we establish arelationship between multivariate regression and CCA. This relationship allows us to applyexisting methodologies from regression, which makes the algorithm of estimating canonicalvectors more eﬃcient.Consider a multivariate linear regression on y with variables x , y = B ∗ x + (cid:15) y , (7)where B ∗ ∈ R d × p is the coeﬃcient matrix. The variable (cid:15) y is the residual, and E ( (cid:15) Ty B ∗ x ) = 0.With the relationship (7), we can compute the covariance matrices Σ xy and Σ yy byΣ xy = Σ xx B T ∗ , Σ yy = B ∗ Σ xx B T ∗ + Σ (cid:15) y (cid:15) y , (8)where the second equality follows from E ( (cid:15) Ty B ∗ x ) = 0. By the results in classical CCA, theﬁrst canonical vector of the k -th pair of canonical vectors a k is the k -th eigenvector ofΣ − yy Σ yx Σ − xx Σ xy = Σ − yy B ∗ Σ xx B T ∗ , (9)4here the equality is because of (8). The second canonical vector b k is propotional toΣ − xx Σ xy a k = Σ − xx Σ xx B T ∗ a k = B T ∗ a k . (10)Note that in (9) and (10), we do not need to compute Σ − xx . Therefore, we avoid the problemthat XX T /n is singular. Thus, if B ∗ is known, we can replace Σ xx and Σ yy in (9) by XX T /n and Y Y T /n , respectively, to estimate the canonical vectors.In practice, B ∗ is rarely known. Therefore, we need to estimate B ∗ in order to use (9)and (10) to obtain the canonical vectors. Note that B ∗ ∈ R d × p with p (cid:29) d . One natural ideais to assume the coeﬃcient matrix B ∗ has some sparse structure, and to use the elementwise l penalty as a regularization as in Lasso (Tibshirani, 1996). To be speciﬁc, let ˆ B be anestimator of B ∗ . We compute ˆ B by the following optimization problemmin B ∈ R d × p n (cid:88) i =1 (cid:107) Y i − BX i (cid:107) + λ (cid:107) B (cid:107) F , (11)where (cid:107) B (cid:107) F = d (cid:88) j =1 (cid:107) β j (cid:107) , for B = ( β , ..., β d ) T , (cid:107) · (cid:107) is the l norm, and λ > n (cid:88) i =1 (cid:107) Y i − BX i (cid:107) + λ (cid:107) B (cid:107) F = n (cid:88) i =1 d (cid:88) j =1 ( y ij − β Tj X i ) + λ d (cid:88) j =1 (cid:107) β j (cid:107) = d (cid:88) j =1 (cid:32) n (cid:88) i =1 ( y ij − β Tj X i ) + λ (cid:107) β j (cid:107) (cid:33) , we can decompose (11) into d Lasso problems,min β j n (cid:88) i =1 ( y ij − β Tj X i ) + λ (cid:107) β j (cid:107) (12)for j = 1 , ..., d . Let ˆ β j be the solution to (12), and ˆ B = ( ˆ β , ..., ˆ β d ) T . Therefore, ˆ B is anestimator of B ∗ . By replacing B ∗ , Σ xx and Σ yy in (9) and (10) with ˆ B , XX T /n and Y Y T /n ,respectively, we can obtain the k -th pair of estimated canonical vectors ˆ a k and ˆ b k as follows.The ﬁrst estimated canonical vector ˆ a k is the k -th eigenvector of( Y Y T ) − ˆ BX ( ˆ BX ) T , lgorithm 1 Imbalanced sparse CCA Input:

Observed data X and Y . Solve (12) for j = 1 , ..., d to obtain the estimated coeﬃcient matrix ˆ B . Calculate the eigenvectors of (

Y Y T ) − ˆ BX ( ˆ BX ) T . The k -th eigenvector ˆ a k is the ﬁrstestimated canonical vector in the k -th pair of estimated canonical vectors. The secondestimated canonical vector is ˆ b (cid:48) k = ˆ B T ˆ a k . Normalize ˆ b (cid:48) k as ˆ b k such that (cid:107) ˆ b k (cid:107) = 1. Output:

The k -th pair of estimated canonical vectors ˆ a k and ˆ b k .and the second canonical vector ˆ b k is proportional toˆ B T ˆ a k . Algorithm 1 describes the procedure to obtain the k -th pair of estimated canonical vectors.Compared with existing methods, the proposed algorithm has the following advantages. First,we do not use any iteration in our algorithm, except in solving Lasso, which has been wellstudied and optimized in the literature (Friedman et al., 2010). Therefore, Algorithm 1 iseﬃcient, since we assume d is small. Second, in Algorithm 1, it can be seen that we obtainmultiple pairs of estimated canonical vectors simultaneously. This implies that we do notneed to estimate multiple pairs of canonical vectors sequentially. In particular, if one is onlyinterested in the k -th pair of canonical vectors, one does not need to know l -th canonicalvectors for l < k . We will discuss the comparison between imbalanced sparse CCA andexisting methodologies in greater detail in Section 3. In this subsection, we present theoretical results of imbalanced sparse CCA. We mainly focuson the consistency of the estimated canonical vectors. We ﬁrst introduce some technicalassumptions. In the rest of this work, we will use the following deﬁnitions. For notationalsimplicity, we will use

C, C (cid:48) , C , C , ... and K, K , K , ... to denote the constants, of which thevalues can change from line to line. For two positive sequences s n and t n , we write s n (cid:16) t n if,for some constants C, C (cid:48) > C ≤ s n /t n ≤ C (cid:48) . Similarly, we write s n (cid:38) t n if s n ≥ Ct n forsome constant C >

0, and s n (cid:46) t n if s n ≤ C (cid:48) t n for some constant C (cid:48) > xx and Σ (cid:15) y (cid:15) y . Assumption 1.

Let λ max ( U ) and λ min ( U ) be the maximum and minimum eigenvalues ofmatrix U , respectively. Assume there exist positive constants K and K such that K ≤ min (cid:8) λ min (Σ xx ) , λ min (Σ (cid:15) y (cid:15) y ) (cid:9) ≤ max { λ max (Σ xx ) , λ max (Σ (cid:15) y (cid:15) y } ≤ K . Assumption 1 assures that the eigenvalues of the covariance matrix are bounded, whichis a typical condition for the high dimensional analysis; see Gao et al. (2017); Chen et al.(2013) for example.The second assumption is on the coeﬃcient matrix B ∗ .6 ssumption 2. Suppose B ∗ satisﬁes σ max ( B ∗ ) ≤ K for some constant K > , where σ max ( B ∗ ) is the maximum singular value of B ∗ . As a simple consequence of Assumptions 1 and 2, the eigenvalues of the covariance matrixΣ yy are bounded above by a constant, and bounded below from zero, as shown in the followingproposition. Proposition 1.

Suppose Assumptions 1 and 2 hold. Then there exist positive constants K and K such that K ≤ λ min (Σ yy ) ≤ λ max (Σ yy ) ≤ K . The following assumption is on the matrix Σ − yy B ∗ Σ xx B T ∗ . Assumption 3.

Let

Γ = Σ − yy B ∗ Σ xx B T ∗ . Suppose the Schur decomposition of Γ with respectto k -th eigenvalue and eigenvector is Q Tk Γ Q k = (cid:20) λ k v Tk T k (cid:21) , where Q k = [ q k , Q (cid:48) k ] ∈ R p × p is orthogonal (thus, q k is the k -th eigenvector of Γ ). Assume thereexist some constants σ > and K > such that for all k = 1 , ..., d , σ k = σ min ( T k − λ k I ) > σ and (cid:107) v k (cid:107) < K , where σ min ( T k − λ k I ) is the minimum singular value of T k − λ k I . Assumption 3 imposes the conditions on the matrix Σ − yy B ∗ Σ xx B T ∗ , which ensures thenumerical stability of the calculation of the eigenvectors of Σ − yy B ∗ Σ xx B T ∗ . The singular valuecondition σ min ( T k − λ k I ) > σ is necessary because if λ k is a nondefective, repeated eigenvalueof Γ, there exist inﬁnitely many of eigenvectors corresponding to λ k , thus the consistency ofeigenvectors cannot hold. Roughly speaking, Assumption 3 requires that the eigenvalues ofΣ − yy B ∗ Σ xx B T ∗ are well separated.The next assumption is related to the tail behaviors of variables x , y , and (cid:15) y . Deﬁnition 1.

A vector v = ( v , ..., v p ) T is sub-Gaussian, if there exist positive constants K and σ such that K ( E e v i /K − ≤ σ holds for all i ∈ { , ..., p } . Assumption 4.

The random variables x , y , and (cid:15) y are all sub-Gaussian. Furthermore, (cid:15) y isindependent of x . The sub-Gaussian assumption in Assumption 4 is also typical in high dimensional analysis.As a simple example, x ∼ N (0 , Σ xx ) and y ∼ N (0 , Σ yy ) are sub-Gaussian, where N (0 , Σ) is amultivariate normal distribution with mean zero and covariance matrix Σ. The independenceassumption of (cid:15) y and x is slightly stronger than the condition E ( (cid:15) Ty Ax ) = 0, which can bealways done by projection of y onto x , if x and y are normally distributed.Under Assumptions 1-4, we have the following consistency results, as shown in Theorem1. The proof of Theorem 1 can be found in Appendix C.7 heorem 1. Let B ∗ = ( β ∗ , ..., β ∗ d ) T with β ∗ k = ( β ∗ k , ..., β ∗ kp ) T . Suppose Assumptions 1-4 hold.Furthermore, assume max k supp ( β ∗ k ) = s ∗ , n − / s ∗ log p = o (1) , and λ (cid:16) √ n log p , where supp ( β ∗ k ) = card( { j | β ∗ kj (cid:54) = 0 } ) and card( A ) is the cardinality of set A . Then with probabilityat least − C d /p , max {(cid:107) a k − ˆ a k (cid:107) , (cid:107) b k − ˆ b k (cid:107) } (cid:46) (cid:112) d ( d + s ∗ ) log p/n, (13) for all k = 1 , ..., d , where C is a positive constant not depending on n . In Theorem 1, it can be seen that if d is small, then imbalanced sparse CCA can provideconsistent estimators of canonical vectors, under the high dimensional settings with respectto the second random variable. In this section, we ﬁrst review some existing methods, and then compare imbalanced sparseCCA with these existing methods. As introduced in Section 2.1, when the dimension of x or y islarger than the sample size n , the classical CCA cannot be directly applied. One naive methodto estimate the canonical vectors is to add diagonal matrices µ Y I d and µ X I p with µ Y , µ X > Y Y + µ Y I d and Σ XX + µ X I p are invertible,where I d and I p are two identity matrices of size d and p , respectively, Σ Y Y = Y Y T /n , andΣ XX = XX T /n . Following the terminology in spatial statistics (Stein, 1999) and computerexperiments (Peng and Wu, 2014), we call µ X and µ Y “nugget” parameters, and call thecorresponding method CCA with a nugget parameter. CCA with a nugget parameter providesthe ﬁrst canonical vector a µ as an eigenvector of(Σ Y Y + µ Y I d ) − Σ Y X (Σ XX + µ X I p ) − Σ XY , (14)and the second canonical vector b µ is proportional to(Σ XX + µ X I p ) − Σ XY a, (15)where Σ XY = XY T /n and Σ Y X = Σ

TXY . Although using a nugget parameter enables thematrix inverse, it may produce non-sparse canonical vectors, which may hard to interpret.If the dimension of Y is not large, then the non-sparse canonical vector a µ is reasonable.However, since the dimension of X is large, it is desirable to have a sparse canonical vector b µ , and CCA with a nugget parameter may not be appropriate to use.Many other approaches to generalize classical CCA to high dimensional settings have beenproposed. In these works, thresholding or regularization is introduced into the optimizationproblem (1). For example, Parkhomenko et al. (2009, 2007); Waaijenborg et al. (2008)introduced a soft-thersholding for each element of canonical vectors. Therefore, elementswith small absolute value are forced to be zero, and a sparse solution is obtained. Chenet al. (2013) introduced iterative thresholding to estimate the canonical vectors, and showedthat the consistency of estimated canonical vectors holds under the assumptions that Σ xx and Σ yy (or the inverses of them) are sparse. Tenenhaus and Tenenhaus (2011) proposed aregularized generalized CCA, where the constraint on canonical vectors are changed to be8 a T Σ XX a + (1 − τ ) (cid:107) a (cid:107) = 1 and τ b T Σ Y Y b + (1 − τ ) (cid:107) b (cid:107) = 1, where τ , τ ∈ [0 ,

1] are twotuning parameters.Regularization-based sparse CCA usually has the form (6). This method was proposedby Witten et al. (2009), and has been extended by Witten and Tibshirani (2009). Analgorithm based on Witten and Tibshirani (2009) has been proposed by Lee et al. (2011b).In Waaijenborg et al. (2008), the elastic net was also used to obtain sparsity of the estimatedcanonical vectors. Chen et al. (2012) modiﬁed sparse CCA as in Witten and Tibshirani (2009)by adding a structure based constraint to the canonical vectors. Gao et al. (2017) proposed amethod called convex program with group-Lasso reﬁnement, which is a two-stage methodbased on group Lasso, and they proved the consistency of estimated canonical variables.Another type of sparse CCA methods is via reformulation. In Hardoon and Shawe-Taylor(2011), it was shown that based on a primal-dual framework, (2) is equivalent to the followingproblem min w,e (cid:107) X T w − Y T Y e (cid:107) , (16)subject to (cid:107) Y T Y e (cid:107) = 1, in the sense that ( w, e ) is the solution to (16) if and only if thereexists µ, γ such that ( µw, γY e ) is the solution to (2). Then by imposing l regularizationon w and e , sparse canonical vectors can be obtained. Recent work by Mai and Zhang(2019) reformulated (6) into a constrained quadratic optimization problem, and proposed aniterative penalized least squares algorithm to solve the optimization problem. Theoreticalguarantees on the consistency of the estimated canonical vectors were also presented in Maiand Zhang (2019).Our proposed method, imbalanced sparse CCA, is substantially diﬀerent from all existingmethods from the following perspectives. First, the scope of imbalanced sparse CCA isdiﬀerent from sparse CCA. In imbalanced sparse CCA, we target the estimation of canonicalvectors under the settings d < n (cid:28) p . Although the case of d < n (cid:28) p can be covered bythe methods mentioned above, these methods do not utilize the partially low dimensionalstructure. Imbalanced sparse CCA, on the other hand, utilizes the low dimensional structure.Many methods mentioned above need to solve Lasso in each iteration of the algorithm, untilthe algorithm converges; see Mai and Zhang (2019); Waaijenborg et al. (2008) for example.In imbalanced sparse CCA, we need only to solve d Lasso optimization problems, which isfast, since d is small and Lasso can be eﬃciently solved by existing algorithms.Second, because of this low dimensional structure, imbalanced sparse CCA can provide K pairs of canonical correlation vectors simultaneously, and does not need to obtain K pairsof canonical correlation vectors one by one. In particular, if one is only interested in the k -thpair of canonical vectors for k >

1, one does not need to know all l -th canonical vectors for l < k . Other methods mentioned above usually need to solve a separate optimization problemin order to get another pair of canonical vectors, and must know l -th pairs of canonical vectorsfor all l < k in order to get k -th pair of canonical vectors. This advantage of imbalancedsparse CCA makes our method extremely powerful when researchers need to estimate arelatively large number of pairs of canonical vectors.Third, it is mentioned in Mai and Zhang (2019) that one advantage of their method isthat they do not need any assumptions on the covariance matrices. In our algorithm, we onlyuse the estimation of the covariance matrix at the last step. Therefore, our approach also9oes not require assumptions on the covariance matrices.Fourth, because we solve d Lasso problems separately in imbalanced sparse CCA, ourmethod naturally allows parallel computing, which can make our algorithm more eﬃcient.The parallel computing makes imbalanced sparse CCA useful when p , n , and d are extremelylarge. This trivially parallel computing cannot be used by most methods mentioned above,where iteration is required.Fifth, imbalanced sparse CCA has theoretical guarantees, and the theoretical developmentis much diﬀerent from the other methods mentioned above. The theory underlying imbalancedsparse CCA involves not only statistical theory, but also results from numerical algebra. Inour theoretical development, we are dealing with the eigenvalues and eigenvectors directly.Because we can obtain multiple eigenvectors at one time, we can obtain multiple canonicalvectors simultaneously. Therefore, our theory and method may be able to inspire newdirections of analysis and development of new methodologies for sparse CCA. Because of thisnew direction, our theory itself may be of interest to researchers studying sparse CCA. In this section, we conduct numeric studies on the applications of the imbalanced sparse CCAand sparse CCA methods. We compare the imbalanced sparse CCA (isCCA) with CCA witha nugget parameter (nCCA) as in (14) and (15), CCA with l penalty ( l -CCA) (Wittenet al., 2009), sCCA (Lee et al., 2011b), and regularized and sparse generalized CCA (rgCCA)(Tenenhaus and Tenenhaus, 2011). In our simulation study, we ﬁrst simulate X ∼ N (0 , Σ xx ), Y ∼ N (0 , Σ yy ), withΣ xx = A A T + 0 . I p , Σ yy = B ∗ Σ xx B T ∗ + σ I d , Σ xy = Σ xx B T ∗ , (17)where B ∗ ∈ R d × p is a sparse matrix, A ∈ R p × d , σ > I p and I d areidentity matrices with size p and d , respectively.Given Σ xx , B ∗ and σ , we can calculate the k -th true canonical vectors by (4) and (5),denoted by a k and b k , respectively. In the numeric simulation, we mainly focus on theﬁrst pair of canonical vectors a and b . We use R packages PMA (Witten and Tibshirani,2020), sCCA (Lee et al., 2011a), and

RGCCA (Tenenhaus and Guillemot, 2017) to implement l -CCA, sCCA, and rgCCA, respectively. For isCCA and nCCA, we generate an independentvalidation set of X v and Y v with the same sample size as the training set. The validationset is used to select the tuning parameters. Speciﬁcally, let λ , ..., λ m be candidates oftuning parameters, and (ˆ a (cid:48) , , ˆ b (cid:48) , ) , ..., (ˆ a (cid:48) ,m , ˆ b (cid:48) ,m ) be the canonical vectors obtained by usingparameters λ , ..., λ m , respectively. Then we compute corr( Y Tv a (cid:48) ,j , X Tv b (cid:48) ,j ) for j = 1 , ..., m ,and choose k = argmax ≤ j ≤ m corr( Y Tv a (cid:48) ,j , X Tv b (cid:48) ,j ). The tuning parameter then is chosen tobe λ k , and the ﬁrst pair of the estimated canonical vectors are ˆ a = ˆ a (cid:48) ,k and ˆ b = ˆ b (cid:48) ,k . Afterobtaining estimated canonical vectors, we compare the l errors (cid:107) ˆ a − a (cid:107) and (cid:107) ˆ b − b (cid:107) forall ﬁve methods.Note in (17), the parameter σ controls the correlation between x and y . Roughly speaking,a larger σ leads to a smaller correlation between x and y . Therefore, we choose σ = 0 . k ,for k = 3 , , ...,

30 to see the change of (cid:107) ˆ a − a (cid:107) and (cid:107) ˆ b − b (cid:107) when the correlation of x y changes. For each k , we run N = 50 replicates. For j -th replicate, we compute theestimated canonical vectors ˆ a ,j and ˆ b ,j , and use( ˆ E (cid:107) ˆ a − a (cid:107) ) / = (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) j =1 (cid:107) ˆ a ,j − a (cid:107) , ( ˆ E (cid:107) ˆ b − b (cid:107) ) / = (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) j =1 (cid:107) ˆ b ,j − b (cid:107) to approximate the root mean squared prediction error (RMSE)( E (cid:107) ˆ a − a (cid:107) ) / , ( E (cid:107) ˆ b − b (cid:107) ) / , respectively. We consider two cases, where the matrix B ∗ is diﬀerent. In both cases, we usethe sample size n = 500. The matrix A = ( α jk ) jk is randomly generated by α jk (cid:26) ∼ Unif(0 ,

2) with probability 0.3,= 0 with probability 0.7,where Unif(0 ,

2) is the uniform distribution on the interval [0 , Case 1:

In (17), we choose B ∗ = ( B , B ) T , where B =   ∈ R d × d is a tridiagonal matrix, and B ∈ R d × ( p − d ) is a zero matrix. The results under diﬀerent ( p, d )are shown in Figures 1 - 3. Case 2:

We choose B ∗ = ( B , B ) T in (17), where B =   ∈ R d × d , and B ∈ R d × ( p − d ) is a zero matrix. The results under diﬀerent ( p, d ) are shown in Figures 4- 6. From Figures 1 - 6, we can see that sCCA performs well in most cases when estimatingthe ﬁrst canonical vector a . isCCA is comparable to rgCCA on the estimation of a . l -CCAcannot provide a consistent estimator of a . nCCA does not perform well in Case 2. However,11 .5 1.0 1.5 2.0 2.5 3.0 . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA Figure 1:

The approximated root mean squared prediction error (ˆ E (cid:107) ˆ a − a (cid:107) ) / for Case 1. Panel 1: p = 100 , d = 5 . Panel 2: p = 100 , d = 10 . Panel 3: p = 500 , d = 3 . Panel 4: p = 1 , , d = 3 . when we turn to look at the estimation of the second canonical vector b in the ﬁrst pairof canonical vectors, we can see that nCCA, l -CCA, sCCA, and rgCCA cannot providea consistent estimator. Therefore, the average errors of these methods shown in Figure 3and Figure 6 are large. This indicates that these methods are not appropriate when thedimensions of x and y are quite diﬀerent, because these methods do not utilize the lowdimensional structure of y . isCCA works well on the estimation of b . We can also see theprediction error of isCCA increases as σ increases, which is natural because the accuracy ofthe estimation of coeﬃcients using (11) is inﬂuenced by the variance σ .12 .5 1.0 1.5 2.0 2.5 3.0 . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA Figure 2:

The approximated root mean squared prediction error (ˆ E (cid:107) ˆ b − b (cid:107) ) / for Case 1. Panel 1: p = 100 , d = 5 . Panel 2: p = 100 , d = 10 . Panel 3: p = 500 , d = 3 . Panel 4: p = 1 , , d = 3 . We applied imbalanced sparse CCA to a microbiome study conducted at University ofPennsylvania (Chen et al., 2012). The study proﬁled 16S rRNA in the human gut andmeasured components of nutrient intake using a food frequency questionnaire for 99 healthypeople. Microbiome OTUs were consolidated at the genus level, with d = 40 relatively13 .5 1.0 1.5 2.0 2.5 3.0 . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA Figure 3:

The average of approximated root mean squared prediction errors (ˆ E (cid:107) ˆ a − a (cid:107) ) / +(ˆ E (cid:107) ˆ b − b (cid:107) ) / ) / for Case 1. Panel 1: p = 100 , d = 5 . Panel 2: p = 100 , d = 10 . Panel 3: p = 500 , d = 3 . Panel 4: p = 1 , , d = 3 . common genera considered (i.e., Y was a 40 ×

99 OTU abundance matrix). Following Chenet al. (2012), the daily intake for p = 214 nutrients was calculated for each person, andregressed upon energy consumption, and the residuals used as a processed nutrient intake214 ×

99 matrix X .ssCCA (Chen et al., 2012) identiﬁed 24 nutrients and 14 genera whose linear combinationsgave a cross-validated canonical correlation of 0.42 between gut bacterial abundance andnutrients. Imbalanced sparse CCA reached a canonical correlation of 0.60. To test thecanonical correlation between gut bacterial abundance and nutrients, we permuted columns of14 .5 1.0 1.5 2.0 2.5 3.0 . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA Figure 4:

The approximated root mean squared prediction error (ˆ E (cid:107) ˆ a − a (cid:107) ) / for Case 2. Panel 1: p = 100 , d = 5 . Panel 2: p = 100 , d = 10 . Panel 3: p = 500 , d = 3 . Panel 4: p = 1 , , d = 3 . the nutrient matrix 1,000 times, and calculated the canonical correlation between them usingthe ﬁve CCA methods described in Section 4. These correlations constitute a null distributionfor each method, to which we compared the respective observed canonical correlation. TheisCCA method was signiﬁcant at the 0.05 level, with p -value 0.025. Of the remaining methods,only sCCA and nCCA (with a large nugget parameter) also provided signiﬁcant p -values.However, results from nCCA appeared highly sensitive to the nugget parameter, and range ofchoices for nugget parameters produced nonsigniﬁcant p -values. l -CCA and rgCCA did notappear to provide insightful results for this dataset. The heatmap of the covariance matrix15 .5 1.0 1.5 2.0 2.5 3.0 . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA Figure 5:

The approximated root mean squared prediction error (ˆ E (cid:107) ˆ b − b (cid:107) ) / for Case 2. Panel 1: p = 100 , d = 5 . Panel 2: p = 100 , d = 10 . Panel 3: p = 500 , d = 3 . Panel 4: p = 1 , , d = 3 . ( XY T ) is shown in Figure 7. The marginal plots of absolute values of ˆ a and ˆ b provideinsights for the relative weighting of OTUs and nutritional components toward the overallcanonical correlation, i.e. larger values correspond to greater weight for that component. The GTEx project oﬀers an opportunity to explore the relationship between imaging andgene expression, while also considering the eﬀect of a clinically-relevant trait. We obtained16 .5 1.0 1.5 2.0 2.5 3.0 . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA . . . . . variance s R M SE isCCAnCCAl_1−CCAsCCArgCCA Figure 6:

The average of approximated root mean squared prediction errors (ˆ E (cid:107) ˆ a − a (cid:107) ) / +( E (cid:107) ˆ b − b (cid:107) ) / ) / for Case 2. Panel 1: p = 100 , d = 5 . Panel 2: p = 100 , d = 10 . Panel 3: p = 500 , d = 3 . Panel 4: p = 1 , , d = 3 . igure 7: The heatmap of the covariance matrix between gut bacterial abundance and nutrients. Y ) and gene expression data ( X ) on theseoverlapped 570 subjects, with 37 cases of Hashimoto’s thyroiditis. Figure 8:

Examples of Hashimoto’s thyroiditis negative/positive GTEx samples.

We applied imbalanced sparse CCA and other four methods in Section 4, to study thecorrelation between the processed image feature matrix and gene expression data. Sinceimbalanced sparse CCA works well for the low dimensional data Y , we ﬁrst applied prin-cipal component analysis (PCA) to reduce the dimension of Y . We used the ﬁrst halfof principal components, denoted by U = ( u , ..., u d ), and performed imbalanced sparsecanonical correlation analysis on the transformed variables U Y and X . We randomly splitthe data into a training set ( Y train , X train ) and testing set ( Y test , X test ) with ratio 5 : 1 for500 times. For each run, we obtained the ﬁrst pair of estimated canonical vectors ˆ a andˆ b by the methods mentioned in Section 4, and compared the correlations on the testingset corr(( Y test ) T ˆ a , ( X test ) T ˆ b ). We found that nCCA is very sensitive to the value of thenugget parameter, so we did not include it in the comparison. The results obtained by isCCA, l -CCA, sCCA and rgCCA are shown in Figure 9.From Figure 9, we can see that rgCCA does not provide reliable estimation of the canonicalvariables in this study, and sCCA provides a smaller correlation between the estimatedcanonical variables on the testing data. isCCA is slightly better than l -CCA. In some cases,isCCA provides relatively small correlations between the estimated canonical variables on thetesting data. This may be because we ﬁxed the number of principal components. Therefore,a further study on adaptively choosing number of principal components is needed.In order to explore the eﬀect of a clinical phenotype on isCCA, we performed isCCAseparately on the set of individuals without Hashimoto’s thyroiditis (median isCCA of 0.578,nearly the same as for the full dataset), and for individuals with Hashimoto’s thyroiditis(median isCCA of 0.375). The dramatic change in estimated correlation by case/controlstatus provides a window into potential additional uses of sparse CCA methods, e.g. byusing the contrast in canonical correlation by phenotype to improve omics-based phenotypeprediction. 19 lllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllll lllllllll isCCA l_1−CCA sCCA rgCCA − . . . c anon i c a l c o rr e l a t i on Figure 9:

Boxplots of the GTEx thyroid cross-validated canonical correlations of processed imagefeature matrix and gene expression data. n X with p SNPs and n samples and expression matrix Y with d genes and n samples, with p > n > d . For each method, a permutation-based p p -values were0 . , . , . , .

14, and 0 .

81 for isCCA, nCCA, sCCA, l -CCA, and rgCCA respectively.Three genes, KRT13, KRT4, and KRT5, showed values of | ˆ a | that are much larger than thoseof the remaining genes, while the values of ˆ b are spread more uniformly across the SNPs.Keratins are important for the mechanical stability and integrity of epithelial cells and livertissues. They play a role in protecting liver cells from apoptosis, against stress, and frominjury, and defects may predispose to liver diseases (Moll et al., 2008). Figure 10 showsheatmap plots and Manhattan-style line plots showing the absolute values of ˆ a and ˆ b p -values were 0.003, 0.16, 0.52, 0.35, and0.66 for isCCA, nCCA, sCCA, l -CCA, and rgCCA respectively. Again three genes, ACSS3,BDH2, and BDH1 showed | ˆ a | of greater magnitude than the others. Ketone bodies aremetabolites derived from fatty and amino acids and are mainly produced in the liver. Bothin the biosynthesis of ketone bodies (ketogenesis) and in ketone body utilization (ketolysis),inborn errors of metabolism are known, resulting in various metabolic diseases (Sass, 2012). In this work, we proposed imbalanced sparse canonical correlation analysis, which can beapplied to data where the dimensions of two variables are very diﬀerent. Our method canprovide K pairs of canonical vectors simultaneously for K >

1, and can be implemented by aneﬃcient algorithm based on Lasso. The implementation is straightforward. The computationtime can be signiﬁcantly decreased if parallel computing is available. We show the consistencyof the estimated canonical pairs in the case that the dimension of one variable can increaseas an exponential rate in comparison to the sample size. The dimension of the other variableshould be smaller than the sample size. We also present numerical studies and real dataanalysis to validate our methodology.We consider the imbalanced case, where the dimensions of two variables are much diﬀerent.In practice, there are also some cases that are balanced , i.e., the dimensions of two variablesare comparable but both much larger than the sample size. One straightforward potential21xtension is to apply two sets of Lasso problems. Speciﬁcally, we might set Y as the dependentvariable and X as the independent variable in the ﬁrst set of Lasso problems, and set X as the dependent variable and Y as the independent variable in the second set of Lassoproblems. However, the number of Lasso optimizations is very large, which leads to theineﬃciency of the algorithm. One possible remedy is to apply principal components analysisto reduce the dimension of one variable. This approach has shown its potential in the realdata analysis. However, the theoretical justiﬁcation is currently lacking. Imbalanced sparsecanonical correlation analysis also shows great potential for prediction problems, since it canprovides K pairs of canonical vectors simultaneously for K >

1. These possible extensions ofimbalanced sparse canonical correlation analysis to the balanced case will be pursued in thefuture work.

AppendixA Notation

We ﬁrst present some notation used in Appendix. Let A = ( a ij ) ij ∈ R m × n . Let (cid:107) A (cid:107) p be the p -norm of a matrix A , deﬁned by (cid:107) A (cid:107) p = sup x (cid:54) =0 (cid:107) Ax (cid:107) p (cid:107) x (cid:107) p , where (cid:107) x (cid:107) p is the l p norm of a vector x ∈ R n . With an abuse of notation, we use (cid:107) · (cid:107) p forboth p -norm of a matrix and l p norm of a vector. Let (cid:107) A (cid:107) max = max | a ij | . In the specialcases p = 1 , , ∞ , (cid:107) A (cid:107) = max ≤ j ≤ n m (cid:88) i =1 | a ij | , (cid:107) A (cid:107) = σ max ( A ) , (cid:107) A (cid:107) ∞ = max ≤ i ≤ m n (cid:88) j =1 | a ij | , where σ max ( A ) is the largest singular value of matrix A . These matrix norms are equivalent,which are implied by the following inequality,1 √ n (cid:107) A (cid:107) ∞ ≤ (cid:107) A (cid:107) ≤ √ m (cid:107) A (cid:107) ∞ , √ m (cid:107) A (cid:107) ≤ (cid:107) A (cid:107) ≤ √ n (cid:107) A (cid:107) , (cid:107) A (cid:107) max ≤ (cid:107) A (cid:107) ≤ √ mn (cid:107) A (cid:107) max . B Proof of Proposition 1

Recall in (8), we have Σ yy = B ∗ Σ xx B T ∗ + Σ (cid:15) y (cid:15) y .

22y Weyl’s theorem (Horn and Johnson (2012), Theorem 4.3.1), we have λ min (Σ yy ) ≥ λ min (Σ (cid:15) y (cid:15) y ) + λ min ( B ∗ Σ xx B T ∗ ) ≥ λ min (Σ (cid:15) y (cid:15) y ) ≥ K , where the last inequality is because of Assumption 1. Using Weyl’s theorem again, we canbound the largest eigenvalue λ max (Σ yy ) by λ max (Σ yy ) ≤ λ max (Σ (cid:15) y (cid:15) y ) + λ max ( B ∗ Σ XX B T ∗ ) ≤ λ max (Σ (cid:15) y (cid:15) y ) + (cid:107) B ∗ (cid:107) λ max (Σ XX ) ≤ K , where the last inequality is because of Assumptions 1 and 2. This ﬁnishes the proof. C Proof of Theorem 1

We ﬁrst present some lemmas used in this proof. Lemma C.1 states the consistency of ˆ β k obtained by (12). Lemma C.2 is the Bernstein inequality. Lemma C.3 is the concentrationinequality for sub-Gaussian random vectors. Lemma C.4 describes the accuracy of solvinglinear systems; see Theorem 2.7.3 in Van Loan and Golub (1983). Lemma C.5 states theeigenvector sensitivity for a pertubation of a matrix, which is a slight recasting of Theorem4.11 in Stewart (1973); also see Corollary 7.2.6 in Van Loan and Golub (1983). Lemma C.1.

Suppose the conditions of Theorem 1 hold. Then with probability at least − C p − d , max ≤ k ≤ d (cid:107) ˆ β k − β ∗ k (cid:107) ≤ C s ∗ (cid:114) log pn , max ≤ k ≤ d (cid:107) ˆ β k − β ∗ k (cid:107) ≤ C (cid:114) s ∗ log pn . In addition, max ≤ k ≤ d ( ˆ β k − β ∗ k ) T H X ( ˆ β k − β ∗ k ) (cid:46) s ∗ log p/n , where H X = n − (cid:80) ni =1 X i X Ti .Proof. By Lemma B.3 of Ning and Liu (2017), we have for a ﬁxed k , with probability atleast 1 − C p − , (cid:107) ˆ β k − β ∗ k (cid:107) ≤ C s ∗ (cid:114) log pn , (cid:107) ˆ β k − β ∗ k (cid:107) ≤ C (cid:114) s ∗ log pn , ( ˆ β k − β ∗ k ) T H X ( ˆ β k − β ∗ k ) (cid:46) s ∗ log pn . Then the results follow the union bound inequality.

Lemma C.2.

Let X i ’s be independent mean zero sub-Gaussian variables. There exists aconstant C > such that for any t > , P (cid:32) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 X i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:33) ≤ − Cnt ) . Lemma C.3.

Let Q i ∈ R d be sub-Gaussian random vectors for i = 1 , ..., n . We have P ( (cid:107) H Q − E ( QQ T ) (cid:107) max ≥ t ) ≤ q exp( − C nt ) , for some constants C , C > , where H Q = n − (cid:80) ni =1 Q i Q Ti . roof. The results follow the union bound inequality and Lemma C.2.

Lemma C.4.

Let A, ˜ A ∈ R d × d , and b, ˜ b ∈ R d . Suppose Ax = b and ˜ A ˜ x = ˜ b with (cid:107) ˜ A − A (cid:107) ≤ δ (cid:107) A (cid:107) , (cid:107) ˜ b − b (cid:107) ≤ δ (cid:107) b (cid:107) , and κ ( A ) = r/δ < /δ for some δ > . Then, ˜ A is non-singular, (cid:107) ˜ x (cid:107) (cid:107) x (cid:107) ≤ r − r , (cid:107) ˜ x − x (cid:107) (cid:107) x (cid:107) ≤ δ − r κ ( A ) , where κ ( A ) = (cid:107) A (cid:107) (cid:107) A − (cid:107) . Lemma C.5.

Let

A, E ∈ R d × d and Q = [ q , Q ] ∈ R d × d is orthogonal, where q ∈ R d . Let Q T AQ = (cid:20) λ v T T (cid:21) , Q T EQ = (cid:20) (cid:15) r T δ E (cid:21) . If σ = σ min ( T − λI ) > and (cid:107) E (cid:107) (cid:18) (cid:107) v (cid:107) σ (cid:19) ≤ σ , then there exists u ∈ R d − with (cid:107) u (cid:107) ≤ (cid:107) δ (cid:107) σ such that ˜ q = ( q + Q u ) / √ u T u is a unit 2-norm eigenvector for A + E . Now we are ready to prove Theorem 1. We ﬁrst show that (

Y Y T ) − ˆ BX ( ˆ BX ) T is closeto Σ − yy B ∗ Σ xx B T ∗ , then we apply Lemma C.5 to show the consistency of canonical vectors.Without loss of generality, let k = 1. If (13) holds for k = 1, then the results of Theorem 1follow the union bound inequality.By the triangle inequality, the 2-norm of ( Y Y T ) − ˆ BX ( ˆ BX ) T − Σ − yy B ∗ Σ xx B T ∗ can bebounded by (cid:107) ( Y Y T ) − ˆ BX ( ˆ BX ) T − Σ − yy B ∗ Σ xx B T ∗ (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13) ( 1 n Y Y T ) − n ˆ BX ( ˆ BX ) T − Σ − yy n ˆ BX ( ˆ BX ) T + Σ − yy n ˆ BX ( ˆ BX ) T − Σ − yy n B ∗ X ( B ∗ X ) T + Σ − yy n B ∗ X ( B ∗ X ) T − Σ − yy B ∗ Σ xx B T ∗ (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) ( 1 n Y Y T ) − n ˆ BX ( ˆ BX ) T − Σ − yy n ˆ BX ( ˆ BX ) T (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) Σ − yy n ˆ BX ( ˆ BX ) T − Σ − yy n B ∗ X ( B ∗ X ) T (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) Σ − yy n B ∗ X ( B ∗ X ) T − Σ − yy B ∗ Σ xx B T ∗ (cid:13)(cid:13)(cid:13)(cid:13) = I + I + I . (C.1)24e consider I ﬁrst. By Assumption 1, we have I ≤ (cid:13)(cid:13)(cid:13)(cid:13) Σ − yy (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) n ˆ BX ( ˆ BX ) T − n B ∗ X ( B ∗ X ) T (cid:13)(cid:13)(cid:13)(cid:13) ≤ K (cid:107) B ∗ H X B T ∗ − ˆ BH X B T ∗ + ˆ BH X B T ∗ − ˆ BH X ˆ B T (cid:107) = 1 K (cid:107) ( B ∗ − ˆ B ) H X ( B ∗ + ˆ B ) T (cid:107) ≤ K (cid:113) (cid:107) ( B ∗ − ˆ B ) H X ( B ∗ − ˆ B ) T (cid:107) (cid:107) ( B ∗ + ˆ B ) H X ( B ∗ + ˆ B ) T (cid:107) , (C.2)where H X = n (cid:80) ni =1 X i X Ti , and the third inequality is because of the Cauchy-Schwarzinequality.The ﬁrst term in the right-hand side of (C.2) (cid:107) ( B ∗ − ˆ B ) H X ( B ∗ − ˆ B ) T (cid:107) can be boundedby (cid:107) ( B ∗ − ˆ B ) H X ( B ∗ − ˆ B ) T (cid:107) ≤ tr(( B ∗ − ˆ B ) H X ( B ∗ − ˆ B ) T ) ≤ d max k ( ˆ β k − β ∗ k ) T H X ( ˆ β k − β ∗ k ) (cid:46) ds ∗ log p/n, (C.3)where tr( A ) is the trace of a matrix A , and the last inequality is by Lemma C.1. The secondterm in (C.2) ( B ∗ + ˆ B ) H X ( B ∗ + ˆ B ) T can be bounded by (cid:107) ( B ∗ + ˆ B ) H X ( B ∗ + ˆ B ) T (cid:107) = (cid:107) ( ˆ B − B ∗ + 2 B ∗ ) H X ( ˆ B − B ∗ + 2 B ∗ ) T (cid:107) ≤ (cid:107) ( ˆ B − B ∗ ) H X ( B ∗ − ˆ B ) T + 4 B ∗ H X B T ∗ (cid:107) ≤ (cid:107) ( ˆ B − B ∗ ) H X ( B ∗ − ˆ B ) T (cid:107) + 8 (cid:107) B ∗ H X B T ∗ (cid:107) (cid:46) ds ∗ log p/n + (cid:107) B ∗ H X B T ∗ (cid:107) , (C.4)where the ﬁrst inequality is by the Cauchy-Schwarz inequality, the second inequality is bythe triangle inequality, and the third inequality is by (C.3).Now consider bounding (cid:107) B ∗ H X B T ∗ (cid:107) . By the triangle inequality, we have (cid:107) B ∗ H X B T ∗ (cid:107) ≤(cid:107) B ∗ Σ xx B T ∗ (cid:107) + (cid:107) B ∗ Σ xx B T ∗ − B ∗ H X B T ∗ (cid:107) . Therefore, we need to show that (cid:107) B ∗ Σ xx B T ∗ − B ∗ H X B T ∗ (cid:107) is small, which can be shown directly by Lemma C.3. To see this, note that B ∗ X i is still a sub-Gaussian random vector. Let t = C (cid:112) log( d + p ) /n for some constant C > − d /p , we have (cid:107) B ∗ Σ xx B T ∗ − B ∗ H X B T ∗ (cid:107) max (cid:46) (cid:114) log( d + p ) n , which implies (cid:107) B ∗ Σ xx B T ∗ − B ∗ H X B T ∗ (cid:107) ≤ d (cid:107) B ∗ Σ xx B T ∗ − B ∗ H X B T ∗ (cid:107) max (cid:46) d (cid:114) log( d + p ) n . (C.5)By Assumption 1, Proposition 1 and (8), we have (cid:107) B ∗ Σ xx B T ∗ (cid:107) = (cid:107) Σ yy − Σ (cid:15) y (cid:15) y (cid:107) ≤ (cid:107) Σ yy (cid:107) + (cid:107) Σ (cid:15) y (cid:15) y (cid:107) ≤ C , C >

0. Therefore, with probability at least 1 − d /p , the right hand sideof (C.4) can be further bounded by (cid:107) ( B ∗ + ˆ B ) H X ( B ∗ + ˆ B ) T (cid:107) (cid:46) ds ∗ log p/n + d (cid:114) log( d + p ) n + C . (C.6)Plugging (C.3) and (C.6) into (C.2), we have I (cid:46) (cid:112) ds ∗ log p/n. (C.7)The ﬁrst term I in (C.1) can be bounded by I ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) n Y Y T (cid:19) − − Σ − yy (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) n ˆ BX ( ˆ BX ) T (cid:13)(cid:13)(cid:13)(cid:13) . (C.8)By letting t = C (cid:112) log( d + p ) /n for some constant C > − d /p , we have (cid:13)(cid:13)(cid:13)(cid:13) n Y Y T − Σ yy (cid:13)(cid:13)(cid:13)(cid:13) ≤ d (cid:13)(cid:13)(cid:13)(cid:13) n Y Y T − Σ yy (cid:13)(cid:13)(cid:13)(cid:13) max (cid:46) d (cid:112) log( d + p ) /n. For any unit vector u , by Lemma C.4 and noting that Proposition 1 implies κ (Σ yy ) ≤ C , wehave (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) n Y Y T (cid:19) − u − Σ − yy u (cid:13)(cid:13)(cid:13)(cid:13) (cid:46) d (cid:112) log( d + p ) /n, which implies (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) n Y Y T (cid:19) − − Σ − yy (cid:13)(cid:13)(cid:13)(cid:13) (cid:46) d (cid:112) log( d + p ) /n. (C.9)The second term in the right-hand side of (C.8) can be bounded by (cid:13)(cid:13)(cid:13)(cid:13) n ˆ BX ( ˆ BX ) T (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) n ˆ BX ( ˆ BX ) T − n B ∗ X ( B ∗ X ) T + 1 n B ∗ X ( B ∗ X ) T (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) n ˆ BX ( ˆ BX ) T − n B ∗ X ( B ∗ X ) T (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) n B ∗ X ( B ∗ X ) T (cid:13)(cid:13)(cid:13)(cid:13) , which can be bounded by a constant using the similar approach as in bounding I . Togetherwith (C.8) and (C.9), we have I (cid:46) d (cid:112) log( d + p ) /n. (C.10)By Proposition 1 and (C.5), it can be veriﬁed that the term I can be bounded by I = (cid:13)(cid:13)(cid:13)(cid:13) Σ − yy n B ∗ X ( B ∗ X ) T − Σ − yy B ∗ Σ xx B T ∗ (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) Σ − yy (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) n B ∗ X ( B ∗ X ) T − B ∗ Σ xx B T ∗ (cid:13)(cid:13)(cid:13)(cid:13) (cid:46) d (cid:112) log( d + p ) /n. (C.11)26lugging (C.7), (C.10) and (C.11) in (C.1), we have (cid:107) ( Y Y T ) − ˆ BX ( ˆ BX ) T − Σ − yy B ∗ Σ xx B T ∗ (cid:107) (cid:46) d (cid:112) log( d + p ) /n + (cid:112) ds ∗ log p/n (cid:46) d (cid:112) log p/n + (cid:112) ds ∗ log p/n, (C.12)where the last inequality is because d < p .To apply Lemma C.5, we need to show that (cid:107) δ (cid:107) in Lemma C.5 is small. Let E =( Y Y T ) − ˆ BX ( ˆ BX ) T − Σ − yy B ∗ Σ xx B T ∗ . By (C.12), we have (cid:107) δ (cid:107) = (cid:107) Q T Eq (cid:107) ≤ (cid:107) Q T E (cid:107) ≤ (cid:107) Q (cid:107) (cid:107) E (cid:107) (cid:46) (cid:112) d log p/n + (cid:112) ds ∗ log p/n (cid:46) (cid:112) d ( d + s ∗ ) log p/n, where Q is as in Assumption 3, and the last inequality follows the fact √ w + √ w ≤√ w + 2 w for w , w >

0. By Assumption 3, we can apply Lemma C.5. This yields (cid:107) a − ˆ a (cid:107) (cid:46) (cid:112) d ( d + s ∗ ) log p/n. (C.13)For the second estimated canonical vector ˆ b , we have (cid:107) b − ˆ b (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) B T ∗ a (cid:107) B T ∗ a (cid:107) − ˆ B T ˆ a (cid:107) ˆ B T ˆ a (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) (cid:46) (cid:107) B T ∗ a − ˆ B T ˆ a (cid:107) ≤(cid:107) B T ∗ a − B T ∗ ˆ a (cid:107) + (cid:107) B T ∗ ˆ a − ˆ B T ˆ a (cid:107) ≤(cid:107) B T ∗ (cid:107) (cid:107) a − ˆ a (cid:107) + (cid:107) B T ∗ − ˆ B T (cid:107) (cid:107) ˆ a (cid:107) . (C.14)By Assumption 2 and (C.13), (cid:107) B T ∗ (cid:107) (cid:107) a − ˆ a (cid:107) (cid:46) (cid:112) d ( d + s ∗ ) log p/n. (C.15)Note that (cid:107) B T ∗ − ˆ B T (cid:107) ≤ (cid:107) B T ∗ − ˆ B T (cid:107) F = (cid:107) B ∗ − ˆ B (cid:107) F (cid:46) (cid:114) ds ∗ log pn , (C.16)where the last inequality is because of Lemma C.1. By (C.13), (cid:107) ˆ a (cid:107) ≤ (cid:107) a − ˆ a (cid:107) + (cid:107) a (cid:107) (cid:46) (cid:107) b − ˆ b (cid:107) (cid:46) (cid:112) d ( d + s ∗ ) log p/n + (cid:114) ds ∗ log pn (cid:46) (cid:112) d ( d + s ∗ ) log p/n, with probability at least 1 − Cd /p . The results of Theorem 1 follow the union boundinequality. Thus, we ﬁnish the proof. References

Aguet, F., Barbeira, A. N., Bonazzola, R., Brown, A., Castel, S. E., Jo, B., Kasela, S.,Kim-Hellmuth, S., Liang, Y., Oliva, M., et al. (2019). The GTEx consortium atlas ofgenetic regulatory eﬀects across human tissues.

BioRxiv , page 787903.27arry, J. D., Fagny, M., Paulson, J. N., Aerts, H. J., Platig, J., and Quackenbush, J. (2018).Histopathological image QTL discovery of immune inﬁltration variants. iScience , 5:80–89.Chen, J., Bushman, F. D., Lewis, J. D., Wu, G. D., and Li, H. (2012). Structure-constrainedsparse canonical correlation analysis with an application to microbiome data analysis.

Biostatistics , 14(2):244–258.Chen, M., Gao, C., Ren, Z., and Zhou, H. H. (2013). Sparse CCA via precision adjustediterative thresholding. arXiv preprint arXiv:1311.6186 .Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalizedlinear models via coordinate descent.

Journal of Statistical Software , 33(1):1.Gao, C., Ma, Z., Zhou, H. H., et al. (2017). Sparse CCA: Adaptive estimation and computa-tional barriers.

The Annals of Statistics , 45(5):2074–2101.Glahn, H. R. (1968). Canonical correlation and its relationship to discriminant analysis andmultiple regression.

Journal of the Atmospheric Sciences , 25(1):23–31.Gonz´alez, I., D´ejean, S., Martin, P. G., and Baccini, A. (2008). CCA: An R package toextend canonical correlation analysis.

Journal of Statistical Software , 23(12):1–14.Hardoon, D. R. and Shawe-Taylor, J. (2011). Sparse canonical correlation analysis.

MachineLearning , 83(3):331–353.Horn, R. A. and Johnson, C. R. (2012).

Matrix Analysis . Cambridge University Press.Hotelling, H. (1936). Relations between two sets of variates.

Biometrika .Lˆe Cao, K.-A., Martin, P. G., Robert-Grani´e, C., and Besse, P. (2009). Sparse canonicalmethods for biological data integration: Application to a cross-platform study.

BMCBioinformatics , 10(1):34.Lee, W., Lee, D., Lee, Y., and Pawitan, Y. (2011a). scca: Sparse Canonical CovarianceAnalysis . R package version 1.1.1.Lee, W., Lee, D., Lee, Y., and Pawitan, Y. (2011b). Sparse canonical covariance analysis forhigh-throughput data.

Statistical Applications in Genetics and Molecular Biology , 10(1).Lutz, J. G. and Eckert, T. L. (1994). The relationship between canonical correlationanalysis and multivariate multiple regression.

Educational and Psychological Measurement ,54(3):666–675.Mai, Q. and Zhang, X. (2019). An iterative penalized least squares approach to sparsecanonical correlation analysis.

Biometrics , 75(3):734–744.Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979).

Multivariate Analysis . Johns HopkinsUniversity Press. 28oll, R., Divo, M., and Langbein, L. (2008). The human keratins: Biology and pathology.

Histochemistry and Cell Biology , 129(6):705.Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and conﬁdence regions forsparse high dimensional models.

The Annals of Statistics , 45(1):158–195.Parkhomenko, E., Tritchler, D., and Beyene, J. (2007). Genome-wide sparse canonicalcorrelation of gene expression with genotypes. In

BMC Proceedings , volume 1, page S119.Springer.Parkhomenko, E., Tritchler, D., and Beyene, J. (2009). Sparse canonical correlation analysiswith application to genomic data integration.

Statistical Applications in Genetics andMolecular Biology , 8(1):1–34.Pau, G., Fuchs, F., Sklyar, O., Boutros, M., and Huber, W. (2010). Ebimage—an Rpackage for image processing with applications to cellular phenotypes.

Bioinformatics ,26(7):979–981.Peng, C.-Y. and Wu, C. J. (2014). On the choice of nugget in kriging modeling for deterministiccomputer experiments.

Journal of Computational and Graphical Statistics , 23(1):151–168.Sass, J. O. (2012). Inborn errors of ketogenesis and ketone body utilization.

Journal ofInherited Metabolic Disease , 35(1):23–28.Song, Y., Schreier, P. J., Ram´ırez, D., and Hasija, T. (2016). Canonical correlation analysisof high-dimensional data with very small sample support.

Signal Processing , 128:449–458.Stein, M. L. (1999).

Interpolation of Spatial Data: Some Theory for Kriging . Springer Science& Business Media.Stewart, G. W. (1973). Error and perturbation bounds for subspaces associated with certaineigenvalue problems.

SIAM Review , 15(4):727–764.Tenenhaus, A. and Guillemot, V. (2017).

RGCCA: Regularized and Sparse GeneralizedCanonical Correlation Analysis for Multiblock Data . R package version 2.1.2.Tenenhaus, A. and Tenenhaus, M. (2011). Regularized generalized canonical correlationanalysis.

Psychometrika , 76(2):257.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

Journal of the RoyalStatistical Society: Series B (Methodological) , 58(1):267–288.Van Loan, C. F. and Golub, G. H. (1983).

Matrix Computations . Johns Hopkins UniversityPress.Waaijenborg, S., de Witt Hamer, P. C. V., and Zwinderman, A. H. (2008). Quantifying theassociation between gene expressions and dna-markers by penalized canonical correlationanalysis.

Statistical Applications in Genetics and Molecular Biology , 7(1).29ang, Y. R., Jiang, K., Feldman, L. J., Bickel, P. J., Huang, H., et al. (2015). Inferringgene–gene interactions and functional modules using sparse canonical correlation analysis.

The Annals of Applied Statistics , 9(1):300–323.Witten, D. and Tibshirani, R. (2020).

PMA: Penalized Multivariate Analysis . R packageversion 1.2.1.Witten, D. M., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, withapplications to sparse principal components and canonical correlation analysis.

Biostatistics ,10(3):515–534.Witten, D. M. and Tibshirani, R. J. (2009). Extensions of sparse canonical correlation analysiswith applications to genomic data.

Statistical Applications in Genetics and MolecularBiology , 8(1):1–27.Yamamoto, H., Yamaji, H., Fukusaki, E., Ohno, H., and Fukuda, H. (2008). Canonicalcorrelation analysis for multivariate regression and its application to metabolic ﬁngerprinting.

Biochemical Engineering Journal , 40(2):199–204.30 igure 10:

The heatmap plots and Manhattan-style line plots showing the absolute values of ˆ a and ˆ b , in which SNPs (rows) and genes (columns) are ordered by genomic position., in which SNPs (rows) and genes (columns) are ordered by genomic position.