[PDF] Grouping effects of sparse CCA models in variable selection

Abstract

The sparse canonical correlation analysis (SCCA) is a bi-multivariate association model that finds sparse linear combinations of two sets of variables that are maximally correlated with each other. In addition to the standard SCCA model, a simplified SCCA criterion which maixmizes the cross-covariance between a pair of canonical variables instead of their cross-correlation, is widely used in the literature due to its computational simplicity. However, the behaviors/properties of the solutions of these two models remain unknown in theory. In this paper, we analyze the grouping effect of the standard and simplified SCCA models in variable selection. In high-dimensional settings, the variables often form groups with high within-group correlation and low between-group correlation. Our theoretical analysis shows that for grouped variable selection, the simplified SCCA jointly selects or deselects a group of variables together, while the standard SCCA randomly selects a few dominant variables from each relevant group of correlated variables. Empirical results on synthetic data and real imaging genetics data verify the finding of our theoretical analysis.

Full PDF

11 Grouping effects of sparse CCA models in variableselection

Kefei Liu, Qi Long, Li Shen

Abstract —The sparse canonical correlation analysis (SCCA) isa bi-multivariate association model that ﬁnds sparse linear combi-nations of two sets of variables that are maximally correlated witheach other. In addition to the standard SCCA model, a simpliﬁedSCCA criterion which maixmizes the cross-covariance betweena pair of canonical variables instead of their cross-correlation, iswidely used in the literature due to its computational simplicity.However, the behaviors/properties of the solutions of these twomodels remain unknown in theory. In this paper, we analyzethe grouping effect of the standard and simpliﬁed SCCA modelsin variable selection. In high-dimensional settings, the variablesoften form groups with high within-group correlation and lowbetween-group correlation. Our theoretical analysis shows thatfor grouped variable selection, the simpliﬁed SCCA jointly selectsor deselects a group of variables together, while the standardSCCA randomly selects a few dominant variables from eachrelevant group of correlated variables. Empirical results onsynthetic data and real imaging genetics data verify the ﬁndingof our theoretical analysis.

Index Terms —canonical correlation analysis (CCA), sparseCCA, grouped variables, dimensionality reduction, imaging ge-netics

I. I

NTRODUCTION

Canonical correlation analysis (CCA) [1], [2] is a multi-variate statistical method which investigates the associationsbetween two sets of variables. It has found applications instatistics [3], data mining and machine learning [2], [4],functional magnetic resonance imaging [5], [6], genomics[7] and other ﬁelds [8]. Given two data sets X ∈ R n × p and Y ∈ R n × q measured on the same set of n samples, CCA seekslinear combinations of the variables in X and those in Y thatare maximally correlated with each other:maximize u , v u T X T Yv s.t. u T X T Xu ≤ , v T Y T Yv ≤ , where X and Y are column-centered to zero mean.Compared with multivariate multiple regression, the CCA is“symmetric” and more ﬂexible in ﬁnding variables from both X and Y to predict each other well. However, in high dimensionalsetting ( n < p ) such as linking imaging to genomics [9], [10],the CCA breaks down because it has inﬁnitely many solutions.In particular, the solution can have any support of cardinalitygreater than or equal to n , which means that the CCA canselect an arbitrary set of n or more variables. To handle that,the sparse CCA (SCCA) [11], [12], [13], [14], [15] utilizesthe L1 sparsity regularization to select a subset of variables, Department of Biostatistics, Epidemiology and Informatics,University of Pennsylvania, Philadelphia, Pennsylvania, USA.Email:[email protected]. which can improve the interpretability, stability as well asperformance in variable selection.A main drawback of the SCCA is that it is computationallyexpensive. To reduce the computational load, a commonpractice is to replace the covariance matrices X T X and Y T Y in the L2 constraints with diagonal matrices [16], [17], [18],[19], [20]. The resulting simpliﬁed SCCA model allows aclosed-form solution for solving each subproblem (update of u with v ﬁxed or vice versa) and is thus computationally moreefﬁcient.However, the fundamental difference between the standardand simpliﬁed SCCA in variable selection remains unclear,particularly in the theoretical properties of their solutions.In [17], [20], the use of the simpliﬁed SCCA model isjustiﬁed based only on the empirical observation that “inhigh-dimensional classiﬁcation problems [21], [22], treatingthe covariance matrix as diagonal can yield good results”. Inthis paper, we attempt to close this gap by investigating theproperties of the solutions of the standard and simpliﬁed SCCAmodels.Our main contributions are summarized as follows. ● The behaviors of the standard and simpliﬁed SCCA modelsin grouped variable selection is theoretically characterized.In high-dimension small sample-size problems, the vari-ables often form groups of various sizes with high within-group correlation and low between-group correlation. Itshows that the simpliﬁed SCCA jointly selects or deselectsa group of correlated variables together, while the standardSCCA tends to select a few dominant variables from eachrelevant group of correlated variables. This ﬁnding couldbe used by practitioners using SCCA, allowing them toselect the proper method for their tasks. ● The Lemma 2.2 of [17] is extended from c ∈ [√∣S∣ , ∞) to c ∈ ( , ∞) , where S = { i ∶ i ∈ argmax j ∣ a j ∣} . The Lemma2.2 of [17], which solves maximize u a T u subject to ∥ u ∥ ≤ , ∥ u ∥ ≤ c , is a key component of the simpliﬁed SCCAalgorithm used to solve the subproblems at each iterationof the alternating optimization algorithm. However, thelemma fails to provide a solution to the above problemfor c ∈ ( , √∣S∣) . ● Greedy algorithms to sequentially compute multiple canon-ical components for standard and simpliﬁed SCCA arederived and presented. To the best of our knowledge, thesealgorithms are new.

Notation : Scalars are denoted as italic letters, column vectorsas boldface lowercase letters, and matrices as boldface capitals.The j -th column vector of a matrix X is denoted as x j . The a r X i v : . [ s t a t . M E ] A ug superscript T stands for the transpose. The ∥ u ∥ and ∥ u ∥ denotethe Euclidean norm and (cid:96) norm of a vector u , respectively.The σ max ( A ) and λ max ( A ) denote the largest singular valueand largest eigenvalue of a matrix A , respectively. For a set S ,its cardinality is denoted as ∣S∣ . The soft-thresholding operatoris deﬁned as S ( a, ∆ ) = ⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩ a − ∆ , a > ∆ a + ∆ , a < − ∆0 , − ∆ ≤ a ≤ ∆ , where ∆ is a non-negative constant.II. S PARSE

CCA

MODEL

Assume that X and Y are column-centered to zero mean.SCCA aims to ﬁnd a linear combination of variables in X and Y to maximize their correlation [15], [13]:maximize u , v u T X T Yv subject to u T X T Xu ≤ , ∥ u ∥ ≤ c v T Y T Yv ≤ , ∥ v ∥ ≤ c , (1)where Xu and Yv are the canonical variables, u and v arecanonical loadings/weights measuring the contribution of eachfeature in the identiﬁed association, and c > , c > arethe regularization parameters that control the sparsity of thesolution.The problem (1) is not convenient to solve due to thequadratic constraints. To save the computational cost, it isa common practice to treat the covariance matrices X T X and Y T Y as diagonal [16], [17], [18], [19], [20], [14], [23]. Thisyields the following simpliﬁed formulation of SCCA:maximize u , v u T X T Yv subject to ∥ u ∥ ≤ , ∥ u ∥ ≤ c ∥ v ∥ ≤ , ∥ v ∥ ≤ c , (2)where c , c > .In Section IV-B and Supplementary Materials Section B,we will describe algorithms to ﬁt the two models, as well asexplain how to obtain multiple canonical components.III. G ROUPING EFFECT ANALYSIS

In high-dimensional problems such as imaging genomics,grouped variables are common and how to properly select themis an important research problem [10], [24], [25], [26]. For asparse CCA model, we say it exhibits the grouping effect ifit jointly selects or deselects each group of highly correlatedvariables together.To gain initial insights, we start with the simplest case withall p X variables fully correlated with each other.

Lemma III.1.

Let x = x = ⋅ ⋅ ⋅ = x p have unit L2 norm.The optimal solution u ∗ to problem (1) is (i) any point on the segment of the line u + u +⋅ ⋅ ⋅+ u p = that is inside the L1 ball: u ∗ + u ∗ + ⋅ ⋅ ⋅ + u ∗ p = (3) ∥ u ∗ ∥ ≤ c (4) when c ≥ , and (ii) any u ∗ ≥ , u ∗ ≥ , . . . , u ∗ p ≥ that satisfy: u ∗ + u ∗ + ⋅ ⋅ ⋅ + u ∗ p = c (5) when < c < .The optimal solution u ∗ to problem (2) is: (i) u ∗ = u ∗ = ⋅ ⋅ ⋅ = u ∗ p = √ p when c ≥ √ p , and (ii) any u ∗ ≥ , u ∗ ≥ , . . . , u ∗ p ≥ that satisfy: u ∗ + u ∗ + ⋅ ⋅ ⋅ + u ∗ p = c (6) u ∗ + u ∗ + ⋅ ⋅ ⋅ + u ∗ p ≤ (7) when ≤ c < √ p .Proof. We ﬁrst prove the result for problem (1), i.e., the SCCAmodel.When x = x = ⋅ ⋅ ⋅ = x p ≜ x , the problem (1) reduces tomaximize u , v ( u + u + ⋅ ⋅ ⋅ + u p ) x T Yv subject to ∣ u + u + ⋅ ⋅ ⋅ + u p ∣ ≤ , ∥ u ∥ ≤ c v T Y T Yv ≤ , ∥ v ∥ ≤ c , (8)where c ≥ , c ≥ .Note that the optimal solution to problem (8) is not uniquebecause the objective function remains the same after wereverse the signs of both u and v . To resolve this, we assume u + u + ⋅ ⋅ ⋅ + u p ≥ .Note also that the optimal value of problem (8) is largerthan zero when c > , c > .As a result, u and v can be independently optimized: u ∗ = argmax u ( u + u + ⋅ ⋅ ⋅ + u p ) subject to ∣ u + u + ⋅ ⋅ ⋅ + u p ∣ ≤ , ∥ u ∥ ≤ c (9) v ∗ = argmax v x T Yv subject to v T Y T Yv ≤ , ∥ v ∥ ≤ c . (10)Solving (9) yields the optimal solution u ∗ shown in (3)-(5).We next prove the result regarding problem (2), i.e., thesimpliﬁed SCCA model.When x = x = ⋅ ⋅ ⋅ = x p ≜ x , the problem (2) reduces tomaximize u , v ( u + u + ⋅ ⋅ ⋅ + u p ) x T Yv subject to ∥ u ∥ ≤ , ∥ u ∥ ≤ c ∥ v ∥ ≤ , ∥ v ∥ ≤ c , (11)where c ≥ , c ≥ .To resolve sign ambiguity, we assume u + u + ⋅ ⋅ ⋅ + u p ≥ .Therefore, u and v can be independently optimized: u ∗ = argmax u ( u + u + ⋅ ⋅ ⋅ + u p ) subject to ∥ u ∥ ≤ , ∥ u ∥ ≤ c (12) v ∗ = argmax v x T Yv subject to ∥ v ∥ ≤ , ∥ v ∥ ≤ c . (13)Solving (12) yields the optimal solution shown in LemmaIII.1 (simpliﬁed SCCA part). We then provide a formal proof of the grouping effects invariable selection for the simpliﬁed SCCA.

Theorem III.2.

Given data ( X , Y ) , with columns standardizedto zero mean and unit norm, and regularization parameters ( c , c ) . Let ( u ∗ , v ∗ ) be an optimal solution to problem (2) .Assume at ( u ∗ , v ∗ ) the L2 inequality constraint on u is stronglyactive. We have: ● when u ∗ i u ∗ j > ∣ u ∗ i − u ∗ j ∣ (14) ≤ α min ⎛⎝ σ max ( Y ) , c ¿``(cid:192) n ∑ (cid:96) = max ≤ j ≤ q y (cid:96)j ⎞⎠ √( − r ij ) / ● when u ∗ i u ∗ j < ∣ u ∗ i + u ∗ j ∣ (15) ≤ α min ⎛⎝ σ max ( Y ) , c ¿``(cid:192) n ∑ (cid:96) = max ≤ j ≤ q y (cid:96)j ⎞⎠ √( + r ij ) / , where r ij = x T i x j ∈ [− , ] is the Pearson correlationcoefﬁcient between x i and x j , and α > is a constant thatonly depends on ( X , Y , c , c ) .Likewise, if at ( u ∗ , v ∗ ) the L2 inequality constraint on v isstrongly active, we have ● when v ∗ i v ∗ j > ∣ v ∗ i − v ∗ j ∣ (16) ≤ α min ⎛⎝ σ max ( X ) , c ¿``(cid:192) n ∑ (cid:96) = max ≤ i ≤ p x (cid:96)i ⎞⎠ √( − r ′ ij )/ ● when v ∗ i v ∗ j < ∣ v ∗ i + v ∗ j ∣ (17) ≤ α min ⎛⎝ σ max ( X ) , c ¿``(cid:192) n ∑ (cid:96) = max ≤ i ≤ p x (cid:96)i ⎞⎠ √( + r ′ ij ) / , where r ′ ij = y T i y j ∈ [− , ] is the Pearson correlationcoefﬁcient between y i and y j , and α > is a constant thatonly depends on ( X , Y , c , c ) .Proof. Since each subproblem (solve for u with v ﬁxed orsolve for v with u ﬁxed) is a convex optimization problemwith differentiable objective and constraint functions (The L1inequality constraint can be written as p linear inequalityconstraints), and is strictly feasible (Slater’s condition holds),the KKT conditions provide necessary and sufﬁcient conditionsfor optimality [27].The KKT conditions for the optimality of u ∗ consist of thefollowing conditions: α u ∗ + λ s = X T Yv ∗ , (18)where s i = sign ( u ∗ i ) if u ∗ i ≠ ; otherwise, s i ∈ [− , ] . α ≥ , ∥ u ∗ ∥ ≤ , α (∥ u ∗ ∥ − ) = (19) λ ≥ , ∥ u ∗ ∥ ≤ c , λ (∥ u ∗ ∥ − c ) = . (20) If u ∗ i u ∗ j > , then both u ∗ i and u ∗ j are non-zero with sign ( u ∗ i ) = sign ( u ∗ j ) . From (18), it follows that α u ∗ i + λ sign ( u ∗ i ) = x T i Yv ∗ (21) α u ∗ j + λ sign ( u ∗ j ) = x T j Yv ∗ . (22)Subtracting (22) from (21) gives α ( u ∗ i − u ∗ j ) = ( x i − x j ) T Yv ∗ . (23)Therefore, we have ∣ u ∗ i − u ∗ j ∣ = α ∣( x i − x j ) T Yv ∗ ∣≤ α ∥ x i − x j ∥ ∥ Yv ∗ ∥ . (24)Since X is column standardized, we have ∥ x i − x j ∥ = √∥ x i ∥ + ∥ x j ∥ − x T i x j = √ ( − r ij ) , (25)where r ij = x T i x j is the sample Pearson correlation coefﬁcientbetween x i and x j .In the domain of problem (2), it holds that ∥ Yv ∗ ∥ ≤ σ max ( Y ) ∥ v ∗ ∥ ≤ σ max ( Y ) (26)and ∥ Yv ∗ ∥ = ¿```(cid:192) n ∑ (cid:96) = ⎛⎝ q ∑ j = y (cid:96)j v ∗ j ⎞⎠ ≤¿```(cid:192) n ∑ (cid:96) = max ≤ j ≤ q y (cid:96)j ⎛⎝ q ∑ j = ∣ v ∗ j ∣⎞⎠ =¿``(cid:192) n ∑ (cid:96) = max ≤ j ≤ q y (cid:96)j ∥ v ∗ ∥ ≤ c ¿``(cid:192) n ∑ (cid:96) = max ≤ j ≤ q y (cid:96)j , (27)where in (26) and (27) we have used the L2 and L1 constraintsin problem (2), respectively.Substituting (25)-(27) into (24), we arrive at ∣ u ∗ i − u ∗ j ∣≤ α min ⎛⎝ σ max ( Y ) , c ¿``(cid:192) n ∑ (cid:96) = max ≤ j ≤ q y (cid:96)j ⎞⎠ √( − r ij ) / . (28)Since the L2 inequality constraint on u is strongly active at ( u ∗ , v ∗ ) , we have α > . Speciﬁcally, combining conditions(18)-(20) yields α = ∥ S ( X T Yv ∗ , λ )∥ , (29)where λ = if this results in ∥ X T Yv ∗ ∥ ∥ X T Yv ∗ ∥ ≤ c ; otherwise, λ is the smallest positive number for which it satisﬁes ∥ S ( X T Yv ∗ ,λ )∥ ∥ S ( X T Yv ∗ ,λ )∥ = c . Thus we obtain (14).Using a similar line of argumentation, we can prove (15)and (16)-(17). (u ,u )-1.5 -1 -0.5 0 0.5 1 1.5-1.5-1-0.500.511.5 (1,1) (a) (1,1)(u ,u )0 0.5 1 1.5 u u (b) u u ,u ) (c)Fig. 1. The optimal solution set u ∗ with p = identical variables. (a) the SCCA problem with c = . ; (b) simpliﬁed SCCA with c = . ; (c) simpliﬁedSCCA with c = . . The feasible set of points are shown lightly shaded. The optimal points are highlighted in orange. Fig. 1 illustrates the optimal solution u ∗ to problems (1)and (2) with p = identical X variables. We see that forSCCA (Fig. 1(a)), the optimal solution set is a line segmentthat cross the axes (i.e., includes sparse solutions). While forsimpliﬁed SCCA (Figs. 1(b)-1(c)), the optimal solution setdoes not intersect with the axes (i.e., does not include sparse ornearly sparse solutions); in particular, when the L2 constrainton u is strongly active at the optimal solution, i.e., when c ≥ √ , the optimal solution set contains a single point withequal coordinates: ( u ∗ , u ∗ ) = ( √ , √ ) .IV. O PTIMIZATION A LGORITHMS

Both problems (1) and (2) are bi-convex, i.e., convex in u with v ﬁxed and in v with u ﬁxed, but not jointly convex in u and v . A standard method to solve the SCCA models isalternating optimization [28]: it ﬁrst updates u while holding v ﬁxed and then updates v while holding u ﬁxed, and repeatsthis process until convergence. A. SCCA model (1)The SCCA model ﬁtting algorithm is shown in Algorithm1.

Algorithm 1

SCCA algorithm Initialize v ; repeat Update u with v ﬁxed:maximize u u T X T Yv subject to ∥ Xu ∥ ≤ , ∥ u ∥ ≤ c (30) Update v with u ﬁxed:;maximize v u T X T Yv subject to v T Y T Yv ≤ , ∥ v ∥ ≤ c (31) until convergence.Both problems (30) and (31) are convex optimizationproblems, and in [15] the linearized alternating directionmethod of multipliers (ADMM) [29] algorithm has been proposed to solve each of them. Since in [15] it uses a slightlydifferent formulation (therein the L1 regularizer appears inthe objective function), we have presented a new linearizedADMM algorithm to solve problem (30) in SupplementaryMaterials A. B. Simpliﬁed SCCA model (2)We ﬁrst introduce the following lemma, which will be usedas a building block in the simpliﬁed SCCA algorithm.

Lemma IV.1.

Consider the quadratically constrained linearprogram (QCLP) optimization problem:maximize u a T u subject to ∥ u ∥ ≤ , ∥ u ∥ ≤ c, (32) where c > is a constant.Deﬁne S = { i ∶ i ∈ argmax j ∣ a j ∣} . The optimal solution u ∗ to (32) is as below. ● Case 1 : c < √∣S∣[ u ∗ ] i = ⎧⎪⎪⎨⎪⎪⎩ c ∣S∣ sign ( a i ) , i ∈ S , i ∉ S (33) ● Case 2: c ≥ √∣S∣ u ∗ = S ( a , ∆ )∥ S ( a , ∆ )∥ (34) where ∆ = if this results ∥ u ∗ ∥ ≤ c ; otherwise, ∆ > satisﬁes ∥ u ∗ ∥ = c . Here the soft-thresholding S ( a , ∆ ) isapplied to a coordinate-wise. The above lemma extends Lemma 2.2 of [17] from c ∈[√∣S∣ , ∞) to c ∈ ( , ∞) . See Supplementary Materials SectionA for the proof of Lemma IV.1 and how it extends Lemma2.2 of [17]. In Case 1, the solution is generally not unique. Speciﬁcally, the optimalsolution has the following form: [ u ∗ ] i = { w i sign ( a i ) , i ∈ S , i ∉ S where w i , i ∈ S , can be any non-negative numbers that satisfy ∑ i ∈S w i ≤ , ∑ i ∈S w i = c . The presented solution is the solution that minimizes ∑ i ∈S w i . For the simpliﬁed SCCA in (2), each subproblem (solving u with v ﬁxed or solving v with u ﬁxed) is a QCLP problemof form (32), which results in Algorithm 2. Algorithm 2

Simpliﬁed SCCA algorithm Initialize v ; repeat Update u according to Lemma IV.1, with a = X T Yv and c = c ; Update v according to Lemma IV.1, with a = Y T Xu and c = c ; until convergence.Note that by repeatedly applying Algorithms 1 and 2, we canobtain multiple canonical components, as described in SectionB in Supplemental Materials.V. E XPERIMENTAL RESULTS AND DISCUSSION

We perform comparative study of the two SCCA modelsusing both synthetic data and real imaging genetics data.

A. Simulation study on synthetic data

Assume the data X ∈ R n × p and Y ∈ R n × q collect n i.i.d.observations/samples of random vectors x ∈ R p × and Y ∈ R q × (with slight abuse of notation), respectively, with n = , p ≈ , q = . We consider two simulation setups, one withuncorrelated variables and the other with grouped variables.For simplicity, we focus on the simulation and analysis of X variables only.

1) Setup 1: uncorrelated variables:

The random vector x is modeled as standard normal: x ∼ N ( , I p ) . Deﬁne z as z = c T x , where c ∈ R p × is a sparse vector. The random vector y is the modeled as y = d z + σ n (35)where d ∈ R q × is a sparse vector, n ∼ N ( , I q ) modelsrandom noise. We set σ to have signal-to-noise ratio of 1.

2) Setup 2: grouped variables:

We assume that the variablesin x form G = non-overlapping groups: x = [ p ‡„„„„„„„„„„„„„„„„„„„„„„„„„•„„„„„„„„„„„„„„„„„„„„„„„„„(cid:181) x ⋯ x x ⋯ x p ‡„„„„„„„„„„„„„„„„„„„„„„„„„•„„„„„„„„„„„„„„„„„„„„„„„„„(cid:181) x ⋯ x x ⋯ x ⋯ p G ‡„„„„„„„„„„„„„„„„„„„„„„„„„„„„„•„„„„„„„„„„„„„„„„„„„„„„„„„„„„„(cid:181) x G ⋯ x G x G ⋯ x G ] T The group sizes p g are drawn independently from a Poissondistribution with mean 100. The total number of variables in x is p = ∑ Gg = p g . For g = , , . . . , G , x g ∼ N ( , ) .Deﬁne c ∈ R p × as a sparse vector collecting the weights ofvariables in x . We assume that the elements of c are groupedin the same way as x . Five of G = groups of variables in x are randomly selected and their weights are set to 1 (alternatein sign group-wise for visual clarity), while the remaininggroups of variables in x are not correlated/informative andtheir weights are set to 0. The c is shown in the top row ofFig. 2(c). Deﬁne a random variable z as z = c T X . The randomvector y is modeled in the same way as described in SectionV-A1.

3) Hyperparameter tuning & performance estimation:

Totune the hyperparameters ( c , c ) , we partition the data intotraining (50%), validation (25%), and testing (25%) sets. Afterﬁtting the SCCA model on the training data, the canonical corre-lation on the validation data is estimated over a two-dimensionalgrid in log-linear scale: . ∧ (⌊ log c , min ⌋ ∶ ⌈ log c , max ⌉) × . ∧ (⌊ log c , min ⌋ ∶ ⌈ log c , max ⌉) , where c (cid:96), min and c (cid:96), max , (cid:96) = , , are the minimum and maximum value of c (cid:96) , respectively.The c and c yielding the maximum validation canonicalcorrelation is selected. Then, we train the model with theselected regularization parameters on the full training data(training+validation) and report the canonical correlation onthe testing set as the performance. For the simpliﬁed SCCA,the same procedure is used except that the canonical covarianceis used as the metric for hyperparameter tuning. More detaileddescription of the procedure to select c , c and to assessperformance, including how to determine c (cid:96), min and c (cid:96), max , (cid:96) = , , is provided in Supplementary Materials F.

4) Simulation study results:

Fig. 2 shows the canonicalweight vectors estimated by SCCA and simpliﬁed SCCA onthe entire training data. In Supplementary Materials Tables S2-S3, we also summarize the variable selection performance interms of recall, precision, F1 score, accuracy (ACC), balancedaccuracy (bACC), Matthews correlation coefﬁcient (MCC),precision-recall area under curve (PR AUC), and relativeabsolute error (RAE). The canonical correlation/covarianceon the training and testing sets are reported in TableI.Referring to Experimental setup 1 where the variables in x are uncorrelated, the standard SCCA consistently outperformsthe simpliﬁed SCCA in both selection of variables in x andidentiﬁcation of strong canonical correlation.Referring to Experimental setup 2 where the variables in x form in groups with full correlation within each group, thesimpliﬁed SCCA always assigns the same weights to eachgroup of variables in x . However, for the standard SCCA,the weights of variables in x in the same group is randomlyassigned, which leads to a few variables with large weightswhile remaining variables with weights close to zero. Despitethat the simpliﬁed SCCA can falsely detect variables group-wise, it outperforms standard SCCA in selection of variablesin x . Note that, compared to standard SCCA, the simpliﬁedSCCA has slightly lower canonical correlation but much highercanonical covariance. This is not surprising because in thestandard SCCA the objective is to maximize the canonicalcorrelation while the simpliﬁed SCCA maximizes the canonicalcovariance.Regarding the selection of variables in y , the simpliﬁedSCCA performs better than standard SCCA in both Experimen-tal setups. This is as expected considering that the variables in y in (35) are highly correlated. B. Application to real imaging genetic data

We applied the two SCCA models to a real imaging geneticsdata set to compare their performances. The genotyping andbaseline AV-45 PET data of 757 non-Hispanic Caucasiansubjects (age 72.26 ± -0.200.2 u i X variable i -0.0500.05 u i X variable i -202 v j Y variable j (b) -202 v j Y variable j (d)

Fig. 2. The actual and estimated canonical weight vectors u and v for (a-b) Experiment 1 and (c-d) Experiment 2. In each subﬁgure, the top row showsthe actual weights used in the generative model, and the bottom two rows show the weights estimated by SCCA and simpliﬁed SCCA on all training data,respectively. To facilitate comparison, the estimated weight vectors are scaled to have the same Euclidean norm as the actual weight vector. APOE , PICALM and

ABCA7 . However, in each gene, the simpliﬁed SCCAselects a cluster of SNPs while the standard SCCA only selectsone or very few SNPs which dominate. Together with thecorrelation among the SNPs within each gene (Fig. S4 middle),it veriﬁes that the simpliﬁed SCCA has the grouping effectsin feature selection while the standard SCCA does not.For imaging feature selection (Figure S7 and Table S6),although high correlation is prevalent among the 116 imagingfeatures (Fig. S4 right), the standard SCCA only selects about20 features while the simpliﬁed SCCA selects more than 60features, which conﬁrms that the simpliﬁed SCCA is prone toselecting correlated feature together.

TABLE IP

ERFORMANCE COMPARISON ON CANONICAL CORRELATION COEFFICIENTS ON SYNTHETIC DATA . T RAINING T ESTING M ODEL ( c opt1 , c opt2 ) C OV @V AL ∗ C ORR @V AL ∗ C OV ∗∗ C ORR ∗∗ C OV C ORR E XPERIMENTAL SETUP

IMP

SCCA (11.763, 4.513) 3.304 — 7.396 0.893 4.104 0.749E

XPERIMENTAL SETUP

IMP

SCCA (21.354, 4.154) 461.295 — 522.153 0.983 545.016 0.985 * Cov@Val/Corr@Val: canonical covariance/correlation on the validation data during the training (model selection) stage. Thereported value is the maximum canonical covariance/correlation over all candidate ( c , c ) (i.e. at the optimal regularizationparameters ( c opt1 , c opt2 ) ). ** Cov/Corr: canonical covariance/correlation when the optimal model is ﬁt to combined training and validation data.TABLE IIP

ERFORMANCE COMPARISON ON CANONICAL CORRELATION COEFFICIENTS ON REAL DATA . T RAINING T ESTING F OLD INDEX ( c opt1 , c opt2 ) C OV @V AL ∗ C ORR @V AL ∗ C OV ∗∗ C ORR ∗∗ C OV C ORR

SCCAF

OLD

ULL DATA (2, 4) — — 1.1274 0.0.6331 — —S

IMPLIFIED

SCCAF

OLD

ULL DATA (4, 16) — — 7.1326 0.4248 — — * Cov@Val/Corr@Val: mean canonical covariance/correlation for the left-out folds in the inner cross-validation to selectthe regularization parameters. The reported value is the maximum mean canonical covariance/correlation over all candidate ( c , c ) (i.e. at the optimal regularization parameters ( c opt1 , c opt2 ) ). ** Cov/Corr: mean canonical covariance/correlation when the optimal model is ﬁt to the whole training data.

VI. C

ONCLUSION

The sparse canonical correlation analysis (SCCA) is a bi-multivariate model that maximizes the multivariate correlationbetween two sets of variables. Since SCCA is computationallyexpensive, a simpliﬁed SCCA model which maximizes themultivariate covariance, has been widely used as its surrogate.The fundamental properties of the solutions of these twomodels remain unknown. Through theoretical analysis, weshow that these two models behave differently regarding thegrouping effects in variable selection. The simpliﬁed SCCAjointly selects or deselects a group of correlated variablestogether, while the standard SCCA randomly selects one or fewrepresentatives from a group of correlated variables. Empiricalresults on both synthetic and real data conﬁrm our theoreticalﬁnding. This result can guide users to choose the right SCCAmodel in practice. R EFERENCES[1] H. Hotelling, “Relations between two sets of variates,”

Biometrika , vol. 28,pp. 321–377, 1936.[2] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlationanalysis: An overview with application to learning methods,”

Neuralcomputation , vol. 16, no. 12, pp. 2639–2664, 2004.[3] A. Klami, S. Virtanen et al. , “Bayesian canonical correlation analysis,”

J. Mach. Learn. Res. , vol. 14, no. Apr, pp. 965–1003, 2013.[4] L. Sun, S. Ji, and J. Ye, “Canonical correlation analysis for multilabelclassiﬁcation: A least-squares formulation, extensions, and analysis,”

IEEE Trans Pattern Anal Mach Intell , vol. 33, no. 1, pp. 194–200, 2010.[5] K. J. Worsley, J.-B. Poline, K. J. Friston, and A. Evans, “Characterizingthe response of PET and fMRI data using multivariate linear models,”

Neuroimage , vol. 6, no. 4, pp. 305–319, 1997.[6] O. Friman, J. Cedefamn et al. , “Detection of neural activity in functionalMRI using canonical correlation analysis,”

Magnetic Resonance inMedicine , vol. 45, no. 2, pp. 323–330, 2001.[7] Y. Yamanishi, J.-P. Vert et al. , “Extraction of correlated gene clustersfrom multiple genomic data by generalized kernel canonical correlationanalysis,”

Bioinformatics , vol. 19, no. suppl 1, pp. i323–i330, 2003.[8] J. Via, I. Santamaria, and J. P´erez, “Canonical correlation analysis (CCA)algorithms for multiple data sets: Application to blind SIMO equalization,”in

IEEE European Signal Proc. Conf.

IEEE, 2005, pp. 1–4.[9] A. R. Hariri and D. R. Weinberger, “Imaging genomics,”

British medicalbulletin , vol. 65, no. 1, pp. 259–270, 2003.[10] L. Shen and P. M. Thompson, “Brain imaging genomics: Integratedanalysis and machine learning,”

Proceedings of the IEEE , vol. 108, no. 1,pp. 125–162, Jan 2020.[11] S. Waaijenborg, P. C. V. de Witt Hamer, and A. H. Zwinderman,“Quantifying the association between gene expressions and DNA-markersby penalized canonical correlation analysis,”

Statistical applications ingenetics and molecular biology , vol. 7, no. 1, 2008.[12] D. R. Hardoon and J. Shawe-Taylor, “Sparse canonical correlationanalysis,”

Machine Learning , vol. 83, no. 3, pp. 331–353, 2011.[13] D. Chu, L.-Z. Liao, M. K. Ng, and X. Zhang, “Sparse canonicalcorrelation analysis: New formulation and algorithm,”

IEEE Trans PatternAnal Mach Intell , vol. 35, no. 12, pp. 3050–3065, 2013.[14] E. C. Chi, G. I. Allen et al. , “Imaging genetics via sparse canonicalcorrelation analysis,” in

IEEE 10th Int Sym on Biomedical Imaging (ISBI) ,San Francisco, CA, 2013, pp. 740–743.[15] X. Suo, V. Minden, B. Nelson, R. Tibshirani, and M. Saunders, “Sparsecanonical correlation analysis,” arXiv preprint arXiv:1705.10865 , 2017.[16] E. Parkhomenko, D. Tritchler, and J. Beyene, “Sparse canonical correla-tion analysis with application to genomic data integration,”

StatisticalApplications in Genetics and Molecular Biology , vol. 8, pp. 1–34, 2009.[17] D. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix decompo-sition, with applications to sparse principal components and canonicalcorrelation analysis,”

Biostatistics , vol. 10, no. 3, pp. 515–34, 2009.[18] D. M. Witten and R. J. Tibshirani, “Extensions of sparse canonicalcorrelation analysis with applications to genomic data,”

Stat Appl GenetMol Biol , vol. 8, no. 1, pp. 1–27, 2009.[19] X. Chen, H. Liu, and J. G. Carbonell, “Structured sparse canonicalcorrelation analysis,” in

International Conference on Artiﬁcial Intelligenceand Statistics , vol. 12, La Palma, Canary Islands, 2012, pp. 199–207.[20] J. Chen, F. D. Bushman, J. D. Lewis, G. D. Wu, and H. Li, “Structure-constrained sparse canonical correlation analysis with an application tomicrobiome data analysis,”

Biostatistics , vol. 14, no. 2, pp. 244–258,2013.[21] S. Dudoit, J. Fridlyand, and T. P. Speed, “Comparison of discriminationmethods for the classiﬁcation of tumors using gene expression data,”

J.Am. Stat. Assoc. , vol. 97, no. 457, pp. 77–87, 2002.[22] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, “Class predictionby nearest shrunken centroids, with applications to DNA microarrays,”

Statistical Science , pp. 104–117, 2003.[23] J. Fang, D. Lin et al. , “Joint sparse canonical correlation analysis fordetecting differential imaging genetics modules,”

Bioinformatics , vol. 32,no. 22, pp. 3480–3488, 2016.[24] C. B. MikeWest, H. Dressman et al. , “Predicting the clinical status ofhuman breast cancer using gene expression proﬁles,”

PNAS , 2001.[25] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,”

Journal of the royal statistical society: series B (statisticalmethodology) , vol. 67, no. 2, pp. 301–320, 2005.[26] P. M. Thompson, N. G. Martin, and M. J. Wright, “Imaging genomics,”

Curr Opin Neurol , vol. 23, no. 4, pp. 368–73, 2010.[27] S. Boyd and L. Vandenberghe,

Convex optimization . Cambridgeuniversity press, 2004. [28] J. C. Bezdek and R. J. Hathaway, “Some notes on alternating optimiza-tion,” in

AFSS International Conference on Fuzzy Systems . Berlin,Heidelberg: Springer, 2002, pp. 288–300.[29] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”

Foundations and Trends® in Machine Learning , vol. 3,no. 1, pp. 1–122, 2011.[30] M. W. Weiner, D. P. Veitch et al. , “The Alzheimer’s disease neuroimag-ing initiative 3: Continued innovation for clinical trial improvement,”

Alzheimer’s & Dementia , vol. 13, no. 5, pp. 561–571, 2017.[31] C. Eckart and G. Young, “The approximation of one matrix by anotherof lower rank,”

Psychometrika , vol. 1, no. 3, pp. 211–218, 1936. S TUDY ON THE GROUPING EFFECTS OF TWO SPARSE

CCA

MODELS IN VARIABLESELECTION S UPPLEMENTARY MATERIALS A PPENDIX

We present how to solve problem (30) using the linearized alternating direction method of multipliers (ADMM) [29], [15].The problem (31) can be solved in a similar manner.First, we write problem (30) in the form:minimize u − u T X T Yv + (∥ Xu ∥ ≤ ) + (∥ u ∥ ≤ c ) , (36)where (⋅) is the indicator function deﬁned as ( x ∈ A ) = ⎧⎪⎪⎨⎪⎪⎩ , x ∈ A ∞ , x ∉ A To apply the ADMM, the problem (36) is reformulated asminimize u − u T X T Yv + (∥ z ∥ ≤ ) + (∥ u ∥ ≤ c ) subject to Xu = z (37)The augmented Lagrangian of problem (37) is L ρ ( u , z , λ ) = − u T X T Yv + (∥ z ∥ ≤ ) + (∥ u ∥ ≤ c ) + ⟨ λ, Xu − z ⟩ + ρ ∥ Xu − z ∥ . (38)ADMM consists of the iterations: u (cid:96) + = argmin u L ρ ( u , z (cid:96) , λ (cid:96) ) (39) z (cid:96) + = argmin z L ρ ( u (cid:96) + , z , λ (cid:96) ) (40) λ (cid:96) + = λ (cid:96) + ρ ( Xu (cid:96) + − z (cid:96) + ) (41)That is u (cid:96) + = argmin u − u T X T Yv + (∥ u ∥ ≤ c ) + ⟨ λ (cid:96) , Xu − z (cid:96) ⟩ + ρ ∥ Xu − z (cid:96) ∥ (42) z (cid:96) + = argmin z (∥ z ∥ ≤ ) + ⟨ λ (cid:96) , Xu (cid:96) + − z ⟩ + ρ ∥ Xu (cid:96) + − z ∥ (43) λ (cid:96) + = λ (cid:96) + ρ ( Xu (cid:96) + − z (cid:96) + ) (44)The problem (42) is not easy to solve due to the term ∥ Xu − z (cid:96) ∥ ≜ f ( u ) . To handle this, we construct a quadraticapproximation of f ( u ) near the estimate u (cid:96) of u in the previous iteration (cid:96) : F ( u ) ≜ f ( u (cid:96) ) + ⟨∇ f ( u (cid:96) ) , u − u (cid:96) ⟩ + L X ∥ u − u (cid:96) ∥ = ∥ Xu (cid:96) − z (cid:96) ∥ + ⟨ X T ( Xu (cid:96) − z (cid:96) ) , u − u (cid:96) ⟩ + L X ∥ u − u (cid:96) ∥ (45)where L X = λ max ( X T X ) , where λ max (⋅) is the largest eigenvalue of its argument/input. Note that we have F ( u ) ≥ f ( u ) forany u ∈ R p × and F ( u (cid:96) ) = f ( u (cid:96) ) .In the linearized ADMM, it solves the approximate version of problem (42) with the term f ( u ) = ∥ Xu − z (cid:96) ∥ replaced by F ( u ) : u (cid:96) + = argmin u − u T X T Yv + (∥ u ∥ ≤ c ) + ⟨ λ (cid:96) , Xu − z (cid:96) ⟩ + ρF ( u )= argmin u ρL X ∥ u − u (cid:96) + L X X T ( Xu (cid:96) − z (cid:96) + ρ λ (cid:96) − ρ Yv )∥ + (∥ u ∥ ≤ c )= prox L ( u (cid:96) − L X X T ( Xu (cid:96) − z (cid:96) + ρ λ (cid:96) − ρ Yv ) ; c ) (46)where the proximal operator prox L (⋅ ; ⋅) is deﬁned as prox L ( a ; c ) = argmin x ∥ x − a ∥ + (∥ x ∥ ≤ c ) = ⎧⎪⎪⎨⎪⎪⎩ a , ∥ a ∥ ≤ c S ( a , ∆ ) , ∥ a ∥ > c (47) where ∆ is a positive constant that satisﬁes ∥ S ( a , ∆ )∥ = c .The update formula of problem (43) is z (cid:96) + = ⎧⎪⎪⎨⎪⎪⎩ Xu (cid:96) + + λ (cid:96) / ρ, ∥ Xu (cid:96) + + λ (cid:96) / ρ ∥ ≤ Xu (cid:96) + + λ (cid:96) / ρ ∥ Xu (cid:96) + + λ (cid:96) / ρ ∥ , ∥ Xu (cid:96) + + λ (cid:96) / ρ ∥ > (48)Taken together, the updates at each ADMM iteration are u (cid:96) + = prox L ( u (cid:96) − L X X T ( Xu (cid:96) − z (cid:96) + ξ (cid:96) − ρ Yv ) ; c ) (49) z (cid:96) + = ⎧⎪⎪⎨⎪⎪⎩ Xu (cid:96) + + ξ (cid:96) , ∥ Xu (cid:96) + + ξ (cid:96) ∥ ≤ Xu (cid:96) + + ξ (cid:96) ∥ Xu (cid:96) + + ξ (cid:96) ∥ , ∥ Xu (cid:96) + + ξ (cid:96) ∥ > (50) ξ (cid:96) + = ξ (cid:96) + Xu (cid:96) + − z (cid:96) + (51)where ξ (cid:96) = λ (cid:96) / ρ .The Linearized ADMM algorithm for ﬁtting the SCCA model is summarized in Algorithm 3. Algorithm 3

Sparse CCA ﬁtting algorithm: Linearized ADMM

Require: X ∈ R n × p , Y ∈ R n × q , with column-wise zero empirical mean (the sample mean of each column has been shifted tozero);Regularization parameters c and c . Calculate Lipschitz constants: L X = λ max ( X T X ) , L Y = λ max ( Y T Y ) ; Initialization: u ( ) ∈ R p × , v ( ) ∈ R q × ; Set the penalty parameter to ρ = ρ = [29]; k = ; repeat Update u : a = Yv ( k ) Input: u = u ( k ) ∈ R p × , z = ξ = ∈ R n × ; for (cid:96) = , , , . . . do u (cid:96) + = prox L ( u (cid:96) − L X X T ( Xu (cid:96) − z (cid:96) + ξ (cid:96) − ρ a ) ; c ) z (cid:96) + = ⎧⎪⎪⎨⎪⎪⎩ Xu (cid:96) + + ξ (cid:96) , ∥ Xu (cid:96) + + ξ (cid:96) ∥ ≤ Xu (cid:96) + + ξ (cid:96) ∥ Xu (cid:96) + + ξ (cid:96) ∥ , ∥ Xu (cid:96) + + ξ (cid:96) ∥ > ξ (cid:96) + = ξ (cid:96) + Xu (cid:96) + − z (cid:96) + end for Output: u ( k + ) = u (cid:96) + . Update v : b = Xu ( k + ) Initialization: v = v ( k ) ∈ R q × , ζ = ψ = ∈ R n × ; for (cid:96) = , , , . . . do v (cid:96) + = prox L ( v (cid:96) − L Y Y T ( Yv (cid:96) − ζ (cid:96) + ψ (cid:96) − ρ b ) ; c ) ζ (cid:96) + = ⎧⎪⎪⎨⎪⎪⎩ Yv (cid:96) + + ψ (cid:96) , ∥ Yv (cid:96) + + ψ (cid:96) ∥ ≤ Yv (cid:96) + + ψ (cid:96) ∥ Yv (cid:96) + + ψ (cid:96) ∥ , ∥ Yv (cid:96) + + ψ (cid:96) ∥ > ψ (cid:96) + = ψ (cid:96) + Yv (cid:96) + − ζ (cid:96) + end for Output: v ( k + ) = v (cid:96) + . k ← k + . until convergence. A. Proof of Lemma IV.1Proof.

Since the problem (32) is a convex optimization problem with differentiable objective and constraint functions (Notethat the L1 inequality constraint can be written as p linear inequality constraints), and is strictly feasible (Slater’s conditionholds), the KKT conditions provide necessary and sufﬁcient conditions for optimality [27].The Lagrangian function is L ( u , α, ∆ ) = − a T u + α (∥ u ∥ − ) + ∆ (∥ u ∥ − c ) where α and ∆ are the Lagrange multipliers (dual variables) for the L2 and L1 constraints, respectively.Setting the differential of L ( u , α, ∆ ) with respect to u equal to zero yields α u + ∆ s = a (52)where s is the subgradient of ∥ u ∥ with respect to u , with s i = sign ( u i ) if u i ≠ and s i ∈ [− , ] otherwise.The KKT conditions for optimality consist of (52) and α ≥ , ∥ u ∥ ≤ , α (∥ u ∥ − ) = (53) ∆ ≥ , ∥ u ∥ ≤ c, ∆ (∥ u ∥ − c ) = (54) ● Case 1: α = , ∆ > .The KKT conditions (52)-(54) are simpliﬁed as ∆ s = a , ∆ > (55) ∥ u ∥ ≤ (56) ∥ u ∥ = c (57)From (55), it follows that ∆ = max ≤ i ≤ p ∣ a i ∣ and u i = for any i ∉ S , where S = { i ∣ ∣ a i ∣ = ∆ } .Therefore, an optimal solution can be written in the following form: [ u ∗ ] i = ⎧⎪⎪⎨⎪⎪⎩ w i sign ( a i ) , i ∈ S , i ∉ S (58)with w i ≥ satisfying ∑ i ∈S w i ≤ , ∑ i ∈S w i = c When c ≤ √∣S∣ , the set of solutions deﬁned above is non-empty, and among them the solution with minimum Euclideannorm is shown in (33). ● Case 2: α > – Case 2.1: ∆ = The KKT conditions (52)-(54) are simpliﬁed as α u = a , α > (59) ∥ u ∥ = (60) ∥ u ∥ ≤ c (61)From (59)-(60), it follows that u = a ∥ a ∥ .When c ≥ ∥ a ∥ ∥ a ∥ , the above u also satisﬁes (61) and is therefore the optimal solution. ● Case 2.2: ∆ > The conditions (53)-(54) become α > , ∥ u ∥ = (62) ∆ > , ∥ u ∥ = c (63)Combining conditions (52) and (62)-(63), we obtain the optimal solution shown in Eq. (34). This corresponds to the range √∣S∣ ≤ c < ∥ a ∥ ∥ a ∥ . u u (a ,a ) u u (a,a) Fig. S1. Two particular scenarios of Case 1 in Lemma IV.1 where Lemma 2.2 in [17] does not consider in the optimization problem. The dimension is p = . The shaded area shows the domain [feasible/constraint set/region] (deﬁned by the L2 and L1 constraints) of the objective function of problem (32). (a) c = . < ; (b) c = . < √ and a = a = a . In both cases, the optimal solution in (a) (the point u ∗ = [ .

8; 0 ] ) and the optimal solution in (b) (any pointon the chord of the circle) do not have a form that is shown in Lemma 2.2 in [17]. B. How does Lemma IV.1 extend Lemma 2.2 of [17]?

Fig. S1 illustrates two particular scenarios of Case 1 in Lemma IV.1 for p = where Lemma 2.2 in [17] fails.Essentially, the expression presented in Lemma 2.2 of [17] is the solution tomaximize u a T u subject to ∥ u ∥ = , ∥ u ∥ ≤ c (64)while in the problem (32) that Lemma IV.1 is solving, the L2 equality constraint is replaced by the L2 inequality constraint,resulting in a convex problem.Note that in order for problem (64) to have an optimal solution of the form presented in Lemma 2.2 of [17], c must belarger than or equal to √∣S∣ , where S is a set deﬁned as S = { i ∶ i ∈ argmax j ∣ a j ∣} (see the proof of Lemma IV.1). Otherwise, ● when ≤ c < , problem (64) is infeasible (there are no feasible points that satisfy the constraints). ● when < c < √∣S∣ , the optimal solution to problem (64) is [ u ∗ ] i = ⎧⎪⎪⎨⎪⎪⎩ w i sign ( a i ) , i ∈ S , i ∉ S (65a)where w i ≥ , i ∈ S , satisfy ∑ i ∈S w i = , ∑ i ∈S w i = c (65b)Note that solution (65) cannot be written in the form shown in Lemma 2.2 in [17].By contrast, problem (32) has an optimal solution for every c ≥ .In this section, we will show how to apply the single-canonical-component SCCA algorithms to sequentially compute multiplecanonical components of standard and simpliﬁed SCCA models. Note that except that Algorithm 8 was described in [17], allalgorithms (Algorithms 4-5 for standard the SCCA model and Algorithm 9 for the simpliﬁed SCCA model) and their theoreticaljustiﬁcations in Sections C1 and D1 are new, to the best our knowledge. C. Sequential calculation of multiple canonical components of standard SCCA

The SCCA model for computing R canonical components ismaximize U , V trace ( U T ˆΣ xy V ) subject to U T ˆΣ xx U = I R , ∥ u r ∥ ≤ c r , r = , , . . . , R V T ˆΣ yy V = I R , ∥ v r ∥ ≤ c r , r = , , . . . , R (66) where ˆΣ xy = n − X T Y (67) ˆΣ xx = n − X T X (68) ˆΣ yy = n − Y T Y (69)are the sample cross-covariance between random vectors x and y , sample auto-covariance matrix within random vector x andsample auto-covariance matrix within random vector y , respectively. Here we assume that the columns of X and Y have beencentered to zero mean.For clarity, we ﬁrst present two algorithms (Algorithms 4 and 5) to sequentially compute multiple canonical components ofSCCA: one is based on deﬂation of the cross-covariance matrix, and the other one is based on deﬂation of the data matrices.Then we provide theoretical explanations of both algorithms in the subsequent sections. Algorithm 4

Sequential computation of R canonical components of SCCA via deﬂation of the cross-covariance matrix. Let ˆΣ xy = n − X T Y ∈ R p × q , ˆΣ xx = n − X T X ∈ R p × p and ˆΣ yy = n − Y T Y ∈ R q × q . for r = , , . . . , R do Find the r -th pair of canonical weight vectors ˆu r and ˆv r by applying the single-canonical-component SCCA algorithmto ( ˆΣ r − xy , ˆΣ xx , ˆΣ yy ) : maximize u r , v r u T r ˆΣ r − xy v r subject to u T r ˆΣ xx u r ≤ , ∥ u r ∥ ≤ c r v T r ˆΣ yy v r ≤ , ∥ v r ∥ ≤ c r ˆΣ r xy ← ˆΣ r − xy − ˆΣ xx ˆ d r ˆu r ˆv T r ˆΣ yy , where ˆ d r = ˆu T r ˆΣ r − xy ˆv r ˆu T r ˆΣ xx ˆu r ⋅ ˆv T r ˆΣ yy ˆv r . end forAlgorithm 5 Sequential computation of R canonical components of SCCA via deﬂation of the data matrices. Let X = X ∈ R n × p , Y = Y ∈ R n × q . for r = , , . . . , R do Find the r -th pair of canonical weight vectors ( ˆu r , ˆv r ) by applying Algorithm 3 to solvemaximize u , v n − u T r X r − Y r − v r subject to n − u T r X T Xu r ≤ , ∥ u r ∥ ≤ c r n − v T r Y T Yv r ≤ , ∥ v r ∥ ≤ c r Calculate the residual data: X r ← X r − − X r − ˆu r ˆu T r X T Xˆu T r X T Xˆu r (70) Y r ← Y r − − Y r − ˆv r ˆv T r Y T Yˆv T r Y T Yˆv r (71) end for Remark

A.1 . The deﬂated data in Eqs. (70)-(71) can also be interpreted as the residual matrix of linear least squaresregression: minimize z ∈ R n ∥ X r − ( X T X ) − / − z ⋅ [( X T X ) / ˆu r ] T ∥ and minimize ζ ∈ R n ∥ Y r − ( Y T Y ) − / − ζ ⋅ [( Y T Y ) / ˆv r ] T ∥ ,respectively.

1) Sequential calculation of multiple SCCA canonical components in the large-sample-size asymptotic regime:

To compute R canonical components sequentially/greedily, we consider the asymptotic regime of n → ∞ in which case model (66) becomesmaximize U , V trace ( U T Σ xy V ) subject to U T Σ xx U = I R V T Σ yy V = I R (72)where Σ xy , Σ xx and Σ yy are the population cross-covariance matrix between random vectors x and y , population auto-covariance matrix within random vector x and population auto-covariance matrix within random vector y , respectively. Notethat in model (72) we have dropped the L1 regularizers: since we have inﬁnite amount of data available for use, the L1regularizations are no longer necessary.The Lagrangian function of problem (72) is deﬁned as L ( U , V , Ψ , Φ ) = − U T Σ xy V + ⟨ Ψ , U T Σ xx U − I R ⟩ + ⟨ Φ , V T Σ yy V − I R ⟩ where Ψ ∈ R R × R is a symmetric matrix of Lagrange multipliers for the R ( R + )/ constraints on U in problem (72), and Φ ∈ R R × R is a symmetric matrix of Lagrange multipliers for the R ( R + )/ constraints on V .Denote the optimal primal and dual solutions of problem (72) as ( ˆU , ˆV ) and ( ˆΨ , ˆΦ ) , respectively. According to the KKTconditions, we have Σ xx ˆU ˆΨ = Σ xy ˆV (73) Σ yy ˆV ˆΦ = Σ T xy ˆU (74)Combining Eqs. (73)-(74) with the quadratic constraints in problem (72) yields ˆΨ = ˆU T Σ xy ˆV ˆΦ = ˆV T Σ T xy ˆU Note that problem (72) does not have a unique solution due to the rotational ambiguity: if ( ˆU , ˆV ) is an optimal solution ofproblem (72), then ( ˆˆU , ˆˆV ) = ( ˆUQ , ˆVQ ) for any orthogonal matrix Q ∈ R R × R is also an optimal solution. Since ˆΨ and thus ˆU T Σ xy ˆV is a symmetric matrix, we can choose the optimal solution ( ˆU , ˆV ) for which ˆU T Σ xy ˆV is a diagonal matrix. As aresult, ˆΨ = ˆΦ =∶ D is a diagonal matrix. Assuming both Σ xx and Σ yy are nonsingular, Eqs. (73)-(74) can be rewritten as Σ / xx ˆUD = Σ − / xx Σ xy Σ − / yy ⋅ Σ / yy ˆV (75) Σ / yy ˆVD = Σ − / yy Σ T xy Σ − / xx ⋅ Σ / xx ˆU (76)Note that the objective of problem (72) is to maximize trace ( D ) under the constraints that Σ / xx U and Σ / yy V both haveorthonormal columns. It follows that D contains the R largest singular values of Σ − / xx Σ xy Σ − / yy , and ˆE = Σ / xx ˆU and ˆF = Σ / yy ˆV contain the corresponding R left and right singular vectors, respectively. According to the Eckart-Young-Mirskytheorem [31], the columns of ˆU and ˆV can be obtained by successive rank-one SVDs of the residual covariance matrix.Speciﬁcally, let S = Σ − / xx Σ xy Σ − / yy ∈ R p × q . For r = , , . . . , R , we have ( ˆ d r , ˆu r , ˆv r ) = argmin d r , u r , v r ∥ Σ / xx u r ∥= ∥ Σ / yy v r ∥= ∥ S r − − Σ / xx d r u r v T r Σ / yy ∥ (77) S r = S r − − Σ / xx ˆ d r ˆu r ˆv T r Σ / yy (78)Suppose we have obtained the estimate of the r -th pair of canonical weight vectors ( ˆu r , ˆv r ) . We then estimate d r as ˆ d r = argmin d r ∥ S r − − Σ / xx d r ˆu r ˆv T r Σ / yy ∥ = ˆu T r Σ / xx S r − Σ / yy ˆv r ˆu T r Σ xx ˆu r ⋅ ˆv T r Σ yy ˆv r Taken all together, to compute multiple canonical components sequentially in the large-sample-size asymptotic regime, theresidual covariance matrix is updated as below: S = Σ − / xx Σ xy Σ − / yy (79) S r = S r − − Σ / xx ˆu r ˆu T r Σ / xx S r − Σ / yy ˆv r ˆv T r Σ / yy ˆu T r Σ xx ˆu r ⋅ ˆv T r Σ yy ˆv r , r = , , . . . , R (80) or equivalently Σ xy = Σ xy (81) Σ r xy = Σ r − xy − Σ xx ˆu r ˆu T r Σ r − xy ˆv r ˆv T r Σ yy ˆu T r Σ xx ˆu r ⋅ ˆv T r Σ yy ˆv r , r = , , . . . , R (82)which results in Algorithm 6.Let x ∈ R p × and y ∈ R q × be random vectors generating the X ∈ R n × p and Y ∈ R n × q , respectively. For notational simplicity,assume E [ x ] = , E [ y ] = . It can be shown that the residual covariance matrix update formulas (79)-(80) can be rewritten interms of random vectors x and y as x = x , y = y (83) x r = Σ / xx ⎛⎝ I p − Σ / xx ˆu r ˆu T r Σ / xx ˆu T r Σ xx ˆu r ⎞⎠ Σ − / xx x r − = x r − − Σ xx ˆu r ˆu T r ˆu T r Σ xx ˆu r x r − (84) y r = Σ / yy ⎛⎝ I q − Σ / yy ˆv r ˆv T r Σ / yy ˆv T r Σ yy ˆv r ⎞⎠ Σ − / yy y r − = y r − − Σ yy ˆv r ˆv T r ˆv T r Σ yy ˆv r y r − (85)which results in Algorithm 7. Algorithm 6

Sequential computation of R canonical components of SCCA in asymptotic regime via deﬂation of the populationcross-covariance matrix. Σ xy = E [ xy T ] , Σ xx = E [ xx T ] and Σ yy = E [ yy T ] . for r = , , . . . , R do Find the estimate of the r -th pair of canonical weight vectors ˆu r and ˆv r :maximize u r , v r u T r Σ r − xy v r subject to u T r Σ xx u r = v T r Σ yy v r = Σ r xy ← Σ r − xy − Σ xx ˆ d r ˆu r ˆv T r Σ yy , where ˆ d r = ˆu T r Σ r − xy ˆv r ˆu T r Σ xx ˆu r ⋅ ˆv T r Σ yy ˆv r . end forAlgorithm 7 Sequential computation of R canonical components of SCCA in asymptotic regime via deﬂation of randomvectors. Let x = x ∈ R p × , y = y ∈ R q × . for r = , , . . . , R do Find the estimate of the r -th pair of canonical weight vectors ˆu r and ˆv r :maximize u r , v r u T r E [ x r − y r − ] v r subject to u T r E [ xx T ] u r = v T r E [ yy T ] v r = Calculate the residual random vectors: x r ← x r − − Σ xx ˆu r ˆu T r ˆu T r Σ xx ˆu r x r − y r ← y r − − Σ yy ˆv r ˆv T r ˆv T r Σ yy ˆv r y r − end for The Algorithms 4 and 5 are implementations of Algorithms 6 and 7 in ﬁnite-sample settings, respectively.

D. Sequential calculation of multiple canonical components of simpliﬁed SCCA

The simpliﬁed SCCA model for computing R canonical components ismaximize U , V trace ( U T ˆΣ xy V ) subject to U T U = I R , ∥ u r ∥ ≤ c r , r = , , . . . , R V T V = I R , ∥ v r ∥ ≤ c r , r = , , . . . , R (86)where ˆΣ xy is the sample cross-covariance matrix between random vectors x and y .For clarity, we ﬁrst present two algorithms (Algorithms 8 and 9) to sequentially compute multiple canonical components ofsimpliﬁed SCCA: one is based on deﬂation of the cross-covariance matrix, and the other one is based on deﬂation of the datamatrices. Then we provide theoretical explanations of both algorithms in the subsequent sections. Algorithm 8

Sequential computation of R canonical components of simpliﬁed SCCA via deﬂation of the cross-covariancematrix. Let ˆΣ xy = n − X T Y ∈ R p × q . for r = , , . . . , R do Find the r -th pair of canonical weight vectors ˆu r and ˆv r by applying Algorithm 2 to ˆΣ r − xy :maximize u r , v r u T r ˆΣ r − xy v r subject to ∥ u r ∥ ≤ , ∥ u r ∥ ≤ c r ∥ v r ∥ ≤ , ∥ v r ∥ ≤ c r ˆΣ r xy ← ˆΣ r − xy − ˆ d r ˆu r ˆv T r , where ˆ d r = ˆu T r ˆΣ r − xy ˆv r ∥ ˆu r ∥ ⋅∥ ˆv r ∥ . end forAlgorithm 9 Sequential computation of R canonical components of simpliﬁed SCCA via deﬂation of the data matrices. Let X = X ∈ R n × p , Y = Y ∈ R n × q . for r = , , . . . , R do Find the r -th pair of canonical weight vectors ( ˆu r , ˆv r ) by applying Algorithm 2:maximize u , v n − u T r X r − Y r − v r subject to ∥ u r ∥ ≤ , ∥ u ∥ ≤ c r ∥ v r ∥ ≤ , ∥ v ∥ ≤ c r Calculate the residual data: X r ← X r − ⎛⎝ I p − ˆu r ˆu T r ∥ ˆu r ∥ ⎞⎠ (87) Y r ← Y r − ⎛⎝ I q − ˆv r ˆv T r ∥ ˆv r ∥ ⎞⎠ (88) end for Remark

A.2 . The deﬂated data in Eqs. (87)-(88) can also be interpreted as the residual matrix of linear least squares regression:minimize z ∈ R n ∥ X r − − z ⋅ ˆu T r ∥ and minimize ζ ∈ R n ∥ Y r − − ζ ⋅ ˆv T r ∥ , respectively.

1) Sequential calculation of multiple SCCA canonical components in the large-sample-size asymptotic regime:

To compute R canonical components sequentially/greedily, we consider the asymptotic regime of n → ∞ in which case model (86) becomesmaximize U , V trace ( U T Σ xy V ) subject to U T U = I R V T V = I R (89)where Σ xy is the population cross-covariance matrix between random vectors x and y . Note that in model (89) we havedropped the L1 regularizers: since we have inﬁnite amount of data available for use, the L1 regularizations are no longernecessary. The Lagrangian function of problem (89) is deﬁned as

L ( U , V , Ψ , Φ ) = − U T Σ xy V + ⟨ Ψ , U T U − I R ⟩ + ⟨ Φ , V T V − I R ⟩ where Ψ ∈ R R × R is a symmetric matrix of Lagrange multipliers for the R ( R + )/ constraints on U in problem (89), and Φ ∈ R R × R is a symmetric matrix of Lagrange multipliers for the R ( R + )/ constraints on V .Denote the optimal primal and dual solutions of problem (89) as ( ˆU , ˆV ) and ( ˆΨ , ˆΦ ) , respectively. According to the KKTconditions, we have ˆU ˆΨ = Σ xy ˆV (90) ˆV ˆΦ = Σ T xy ˆU (91)Combining Eqs. (90)-(91) with the quadratic constraints in problem (89) yields ˆΨ = ˆU T Σ xy ˆV ˆΦ = ˆV T Σ T xy ˆU Note that problem (89) does not have a unique solution due to the rotational ambiguity: if ( ˆU , ˆV ) is an optimal solution ofproblem (89), then ( ˆˆU , ˆˆV ) = ( ˆUQ , ˆVQ ) for any orthogonal matrix Q ∈ R R × R is also an optimal solution. Since ˆΨ and thus ˆU T Σ xy ˆV is a symmetric matrix, we can choose the optimal solution ( ˆU , ˆV ) for which ˆU T Σ xy ˆV is a diagonal matrix. As aresult, ˆΨ = ˆΦ =∶ D is a diagonal matrix. Assuming both Σ xx and Σ yy are nonsingular, Eqs. (90)-(91) can be rewritten as ˆUD = Σ xy ⋅ ˆV (92) ˆVD = Σ T xy ⋅ ˆU (93)Note that the objective of problem (89) is to maximize trace ( D ) under the constraints that U and V both have orthonormalcolumns. It follows that D contains the R largest singular values of Σ xy , and ˆU and ˆV contain the corresponding R left andright singular vectors, respectively. According to the Eckart-Young-Mirsky theorem [31], the columns of ˆU and ˆV can beobtained by successive rank-one SVDs of the residual covariance matrix. Speciﬁcally, let Σ xy = Σ xy ∈ R p × q . For r = , , . . . , R ,we have ( ˆ d r , ˆu r , ˆv r ) = argmin d r , u r , v r ∥ u r ∥= ∥ v r ∥= ∥ Σ r − xy − d r u r v T r ∥ (94) Σ r xy = Σ r − xy − ˆ d r ˆu r ˆv T r (95)Suppose we have obtained the estimate of the r -th pair of canonical weight vectors ( ˆu r , ˆv r ) . We then estimate d r as ˆ d r = argmin d r ∥ Σ r − xy − d r ˆu r ˆv T r ∥ = ˆu T r Σ r − xy ˆv r ∥ ˆu r ∥ ⋅ ∥ ˆv r ∥ Taken all together, to compute multiple canonical components sequentially in the large-sample-size asymptotic regime, theresidual covariance matrix is updated as below: Σ xy = Σ xy (96) Σ r xy = Σ r − xy − ˆu r ˆu T r Σ r − xy ˆv r ˆv T r ∥ ˆu r ∥ ⋅ ∥ ˆv r ∥ , r = , , . . . , R (97)This results in Algorithm 10.For notational simplicity, assume E [ x ] = , E [ y ] = . It can be shown that the residual covariance matrix update formulas(96)-(97) can be rewritten in terms of random vectors x and y as x = x , y = y (98) x r = ⎛⎝ I p − ˆu r ˆu T r ∥ ˆu r ∥ ⎞⎠ x r − (99) y r = ⎛⎝ I q − ˆv r ˆv T r ∥ ˆv r ∥ ⎞⎠ y r − (100)which results in Algorithm 11. Algorithm 10

Sequential computation of R canonical components of simpliﬁed SCCA in asymptotic regime via deﬂation ofthe population cross-covariance matrix. Let Σ xy = E [ xy T ] . for r = , , . . . , R do Solve for the r -th pair of canonical weight vectors ˆu r and ˆv r :maximize u r , v r u T r Σ r − xy v r subject to ∥ u r ∥ = ∥ v r ∥ = Σ r xy ← Σ r − xy − ˆ d r ˆu r ˆv T r , where ˆ d r = ˆu T r Σ r − xy ˆv r ∥ ˆu r ∥ ⋅∥ ˆv r ∥ . end forAlgorithm 11 Sequential computation of R canonical components of simpliﬁed SCCA in asymptotic regime via deﬂation ofrandom vectors. Let x = x ∈ R p × , y = y ∈ R q × . for r = , , . . . , R do Solve for the r -th pair of canonical weight vectors ˆu r and ˆv r :maximize u r , v r u T r E [ x r − y r − ] v r subject to ∥ u r ∥ = ∥ v r ∥ = Calculate the residual random vectors: x r ← ⎛⎝ I p − ˆu r ˆu T r ∥ ˆu r ∥ ⎞⎠ x r − y r ← ⎛⎝ I q − ˆv r ˆv T r ∥ ˆv r ∥ ⎞⎠ y r − end for In ﬁnite-sample settings, the covariance matrix deﬂation based Algorithm 10 becomes Algorithm 8 to sequentially compute R canonical components of simpliﬁed SCCA, while the random vector deﬂation based Algorithm 11 becomes Algorithm 9. E. Covariance structure of the synthetic data

The sample cross- and auto-covariance matrices among random vectors x and y are deﬁned as ˆΣ xy = n X T Y (101) ˆΣ xx = n X T X (102) ˆΣ yy = n Y T Y (103)

1) Experimental setup 1: uncorrelated variables:

The population cross- and auto-covariance matrices among random vectors x and y are E [ x ] = , E [ y ] = xx = E [ xx T ] = I p (104) Σ yy = E [ yy T ] = ∥ c ∥ dd T + σ I q (105) Σ xy = E [ xy T ] = cd T (106) Sample covariances between X and Y variables

20 40 60 80 100

Y variable j X v a r i ab l e i -0.3-0.2-0.100.10.20.3 Population covariances between X and Y variables

20 40 60 80 100

Y variable j X v a r i ab l e i -0.3-0.2-0.100.10.20.3 Sample covariances among X variables

500 1000 1500 2000

X variable i X v a r i ab l e i Population covariances among X variables

500 1000 1500 2000

X variable i X v a r i ab l e i Sample covariances among Y variables

20 40 60 80 100

Y variable j Y v a r i ab l e j -3-2-101234 Population covariances among Y variables

20 40 60 80 100

Y variable j Y v a r i ab l e j -3-2-101234 Fig. S2. Experimental setup 1: Heatmaps showing the sample (left) and population (right) cross-covariances between X and Y variables (top), auto-covarianceswithin X variables (middle), and auto-covariances within Y variables (bottom).

2) Experimental setup 2: grouped variables:

The population cross- and auto-covariance matrices among random vectors x and y are E [ x ] = , E [ y ] = Σ xx = E [ xx T ] = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ Σ ⋱ Σ R Σ R + ⋱ Σ G ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (107) Σ yy = E [ yy T ] = c dd T + σ I q (108) Σ xy = E [ xy T ] = Σ xx cd T (109)where Σ g = ( σ gij ) ∈ R p g × p g , with σ gii = and σ gij ρ gi ρ gj for any i ≠ j and g = , , . . . , G , and c ∶= E [ z ] = c T Σ xx c . Here R is the number of relevant/informative groups. TABLE S1G

ROUP SIZES OF VARIABLES IN x Group ID G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20Group size 89 112 92 88 88 99 130 103 94 91 99 91 90 112 96 100 96 91 103 100 Sample covariances between X and Y variables

20 40 60 80 100

Y variable j X v a r i ab l e i -10-50510 Population covariances between X and Y variables

20 40 60 80 100

Y variable j X v a r i ab l e i -10-50510 Sample covariances among X variables

200 400 600 800 1000 1200 1400 1600 1800

X variable i X v a r i ab l e i Population covariances among X variables

200 400 600 800 1000 1200 1400 1600 1800

X variable i X v a r i ab l e i Sample covariances among Y variables

20 40 60 80 100

Y variable j Y v a r i ab l e j -600-400-2000200400600800 Population covariances among Y variables

20 40 60 80 100

Y variable j Y v a r i ab l e j -600-400-2000200400600800 Fig. S3. Experimental setup 2: Heatmaps showing the sample (left) and population (right) cross-covariances between X and Y variables (top), auto-covarianceswithin X variables (middle), and auto-covariances within Y variables (bottom). F. Hyperparameter tuning and performance estimation

To select the regularization parameters ( c , c ) and estimate the generalization performance, we partition the data into training(50%, n s samples), validation (25%, n v samples), and testing (25%, n t = n − n s − n v samples) data sets: [ X Y ] = ⎡⎢⎢⎢⎢⎢⎣ X train Y train X val Y val X test Y test ⎤⎥⎥⎥⎥⎥⎦ ∈ R ( n s + n v + n t )×( p + q ) The training and validation data are used to tune the regularization parameters ( c , c ) , and the test data is used to estimate theperformance.To select the regularization parameters ( c , c ) , we ﬁt the (simpliﬁed) SCCA model on the training data using each candidatevalue of ( c , c ) as the regularization parameters, where c and c are chosen from a sequence of values equally spaced on thelog scale: c ∈ . ∧ (⌊ log c , min ⌋ ∶ ⌈ log c , max ⌉) , c ∈∈ . ∧ (⌊ log c , min ⌋ ∶ ⌈ log c , max ⌉) . Here, c (cid:96), min and c (cid:96), max , (cid:96) = , , arethe minimum and maximum value of c (cid:96) which will be calculated for the standard and simpliﬁed SCCA models in Section F1.Denote the solution of the model ﬁtted with ( c , c ) as ( ˆu train ( c , c ) , ˆv train ( c , c )) . For the standard SCCA model, theoptimal ( c , c ) are chosen as ( c opt1 , c opt2 ) = argmax c ,c Corr ( X val ˆu train , Y val ˆv train ) (110) = argmax c ,c ⟨ X val ˆu train , Y val ˆv train ⟩∥ X val ˆu train ∥ ∥ Y val ˆv train ∥ (111)For the simpliﬁed SCCA model, the optimal ( c , c ) are chosen as ( c opt1 , c opt2 ) = argmax c ,c Cov ( X val ˆu train / ∥ ˆu train ∥ , Y val ˆv train / ∥ ˆv train ∥) (112) = argmax c ,c n t ⟨ X val ˆu train , Y val ˆv train ⟩∥ ˆu train ∥ ∥ ˆv train ∥ (113)Then, we reﬁt the SCCA model with ( c opt1 , c opt2 ) on all training data (combined training and validation data) to get thesolution ( ˆu trainval , ˆv trainval ) . The canonical covariance and correlation on the test data are reported as the generalizationperformance: Cov ( X test ˆu trainval , Y test ˆv trainval ) = ⟨ X test ˆu trainval , Y test ˆv trainval ⟩∥ ˆu trainval ∥ ∥ ˆv trainval ∥ (114) Corr ( X test ˆu trainval , Y test ˆv trainval ) = ⟨ X test ˆu trainval , Y test ˆv trainval ⟩∥ X test ˆu trainval ∥ ∥ Y test ˆv trainval ∥ (115)

1) Effective range of c and c : To determine the range for the parameters ( c , c ) for the standard SCCA model (1), wereplace its L2 inequality constraints with the L2 equality constraints:maximize u , v u T X T Yv subject to u T X T Xu = , ∥ u ∥ ≤ c v T Y T Yv = , ∥ v ∥ ≤ c (116)We note that for valid L1 regularization, the L1 inequality constraints needs to be active (i.e., satisﬁed as equalities) at theoptimal solution. This implies that c ≥ minimize u ∥ u ∥ subject to ∥ Xu ∥ = (117) c ≥ minimize v ∥ v ∥ subject to ∥ Yv ∥ = (118)and c ≤ maximize u ∥ u ∥ subject to ∥ Xu ∥ = (119) c ≤ maximize v ∥ v ∥ subject to ∥ Yv ∥ = (120) The reason the sample covariance matrix has n t in the denominator rather than n t − is that we assume that population mean of is known. Simple analysis shows that a sufﬁcient condition for (117)-(118) to hold is c ≥ max ⎛⎝ σ max ( X ) , √∑ n(cid:96) = max ≤ i ≤ p x (cid:96)i ⎞⎠ =∶ c , min (121) c ≥ max ⎛⎜⎝ σ max ( Y ) , √∑ n(cid:96) = max ≤ j ≤ q y (cid:96)j ⎞⎟⎠ =∶ c , min (122)Note however, that the objective in (119) (resp., (120)) is unbounded above when n < p (resp., n < q ), and thus it can not beused to ﬁnd an effective maximum of c (resp., c ). To ﬁnd an effective maximum value of c and c , we solve problem (116)in the absence of L1 constraints instead: maximize u , v u T X T Yv subject to u T X T Xu = v T Y T Yv = (123)Denote the optimal solution of problem 123 as ( u ∗ , v ∗ ) . We set c , max = ∥ u ∗ ∥ and c , max = ∥ v ∗ ∥ .It can be shown that u ∗ = ( X T X ) − / u , v ∗ = ( Y T Y ) − / v , where u and v are respectively the left and right singularvectors of ( X T X ) − / X T Y ( Y T Y ) − / associated with the largest singular value. If X T X is singular, we can use X T X + (cid:15) I p to approximate it; likewise for Y T Y .In a similar line of reasoning, to determine the range for the parameters ( c , c ) for the simpliﬁed SCCA model (1), considermaximize u , v u T X T Yv subject to ∥ u ∥ = , ∥ u ∥ ≤ c ∥ v ∥ = , ∥ v ∥ ≤ c (124)We note that an effective value of c and c should be such that the L1 inequality constraints are active (i.e., satisﬁed asequalities) at the optimal solution.To this end, it should satisfy c ≥ minimize u ∥ u ∥ subject to ∥ u ∥ = (125) c ≥ minimize v ∥ v ∥ subject to ∥ v ∥ = (126)and c ≤ maximize u ∥ u ∥ subject to ∥ u ∥ = (127) c ≤ maximize v ∥ v ∥ subject to ∥ v ∥ = (128)From (125)-(128), it follows that c , min ∶= ≤ c ≤ √ p (129) c , min ∶= ≤ c ≤ √ q (130)The upper bounds √ p for c and √ q for c are too relaxed. To ﬁnd a tighter bound, we solve problem (124) in the absenceof L1 constraints instead: maximize u , v u T X T Yv subject to ∥ u ∥ = ∥ v ∥ = (131)The optimal solution is u ∗ = u , v ∗ = v , where u and v are respectively the left and right singular vectors of X T Y associated with the largest singular value. We set c , max = ∥ u ∗ ∥ and c , max = ∥ v ∗ ∥ . G. Variable selection performance

The balanced accuracy (bACC) and Matthews correlation coefﬁcient (MCC) are deﬁned as bACC = ( TPTP + FN + TNTN + FP ) , (132) MCC = TP × TN − FP × FN √( TP + FP )( TP + FN )( TN + FP )( TN + FN ) , (133) TABLE S2T HE X VARIABLE SELECTION PERFORMANCE OF THE

SCCA

AND SIMPLIFIED

SCCA

ON WHOLE TRAINING DATA . M ODEL R ECALL P RECISION F1 SCORE

ACC B ACC MCC PR AUC RAEE

XPERIMENTAL SETUP

IMP

SCCA 0.430 0.306 0.358 0.846 0.661 0.278 0.429 1.044E

XPERIMENTAL SETUP A N 0.998 0.320S

IMP

SCCA 1.000 0.602 0.751 0.846 0.899 0.693 0.800 0.110

TABLE S3T HE Y VARIABLE SELECTION PERFORMANCE OF THE

SCCA

AND SIMPLIFIED

SCCA

ON WHOLE TRAINING DATA . M ODEL R ECALL P RECISION F1 SCORE

ACC B ACC MCC PR AUC RAEE

XPERIMENTAL SETUP A N 0.896 0.823S

IMP

SCCA 1.000 0.566 0.723 0.770 0.836 0.616 0.957 0.030E

XPERIMENTAL SETUP A N 0.844 0.503S

IMP

SCCA 0.967 0.558 0.707 0.760 0.819 0.585 0.948 0.050 where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively.The bACC and MCC are overall measures of variable selection accuracy, and a larger score indicates a better variable selectionperformance.The relative absolute error (RAE), which for the selection of X variables is deﬁned as RAE = ∥ ˆu − u ∗ ∥ ∥ u ∗ ∥ (134)where u ∗ and ˆu denote the true and estimated canonical vector, respectively. Our variable selection performance on the syntheticdata is shown in Table S2 and Table S3. H. Subject characteristics

TABLE S4S

UBJECT CHARACTERISTICS

HC SMC EMCI LMCI ADNum 183 75 218 184 97Gender (M/F) 89/94 29/46 113/105 96/88 54/43Handedness (R/L) 163/20 65/10 195/23 165/19 89/8Age (mean ± std) 73.96 ± ± ± ± ± ± std) 16.44 ± ± ± ± ± Participant characteristics of our real imaging genetics data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI)cohort is shown in Table S4.

I. Correlation structure of the real imaging genetic data

Correlation structure of the real ADNI imaging genetics data used in this study is shown in Fig. S4.

J. Hyperparameter tuning and generalization performance estimation

We employ the nested cross-validation method which is an extension of the procedure described in Section F. We ﬁrstrandomly divide each category of subjects into ﬁve roughly equal-sized subgroups and combine the data from each category toform ﬁve outer folds.We used the ﬁrst fold for testing and the remaining folds for training/validating the model. Test set data are put aside. Thefollowing steps were carried out with the training+validation data:(1) We employ the stratiﬁed cross-validation (CV) method to choose ( c , c ) . The samples/subjects from each category arerandomly divided into ﬁve roughly equal-sized subgroups and then combined to form ﬁve folds I = ∪ k = I k . Denote genetic-imaging feature correlation

20 40 60 80 100 imaging feature j g e n e ti c f ea t u r e i -0.2-0.100.10.20.30.4 genetic-genetic feature correlation

200 400 600 800 1000 1200 1400 genetic feature i g e n e ti c f ea t u r e j -0.8-0.6-0.4-0.200.20.40.60.81 imaging-imaging feature correlation

10 20 30 40 50 60 70 80 90 imaging feature i i m a g i ng f ea t u r e j Fig. S4. Heatmaps showing the pairwise sample Pearson correlation coefﬁcients between genetic and imaging features (left), within genetic features (middle),and within imaging features (right). X trainval I k and Y trainval I k , k = , , . . . , , as the submatrices formed by the rows of X trainval and Y trainval indexed by I k , respectively.(2) The SCCA model is ﬁtted to the normalized ( X trainval I∖I , Y trainval I∖I ) to obtain the solution as ( ˆu trainval − , ˆv trainval − ) . Then,the performance on the validation data is recorded as Corr ( X trainval I ˆu trainval − , Y trainval I ˆv trainval − ) . This process isrepeated ﬁve times with each fold of samples/subjects used once as the validation set.(3) The cross-validation criterion to select the regularization parameters is deﬁned as ( c opt1 , c opt2 ) = argmax c ,c ∑ k = Corr ( X trainval I k ˆu trainval − k , Y trainval I k ˆv trainval − k ) (135)where Corr (⋅ , ⋅) is the correlation function and ( ˆu trainval − k , ˆv trainval − k ) are the estimates of ( u , v ) by the standard SCCAon the training+validation data ( X trainval I∖I k , Y trainval I∖I k ) with ( c , c ) as regularization parameters.(4) The SCCA model was then ﬁt to the entire training set at ( c opt1 , c opt2 ) to estimate the canonical weights ( ˆu opt , ˆv opt ) .The canonical correlation on the test data Corr ( X test ˆu opt , Y test ˆv opt ) is reported as the generalization performance. For thesimpliﬁed SCCA, the canonical covariance is used as the metric to measure the performance and to tune the regularizationparameters.This process is repeated ﬁve times with each outer fold of samples/subjects used once as the testing set. K. Genetic and Imaging Marker Selection

TABLE S5: Genetic features (ordered by absolute values of estimated canonical weights)selected by SCCA and simpliﬁed SCCA.Standard SCCA Simpliﬁed SCCASNP Closest gene ˆ u i p-value SNP Closest gene ˆ u i p-valuers4420638 APOE

APOE

ABCA7

APOE

SLC24A4 -0.102 9.11e-01 rs2075650

APOE

CR1

APOE -0.213 7.29e-06rs609903

PICALM -0.065 6.38e-01 rs8106922

APOE -0.183 2.27e-03Continued on next page TABLE S5 – continued from previous page

Standard SCCA Simpliﬁed SCCASNP Closest gene ˆ u i p-value SNP Closest gene ˆ u i p-valuers7141622 RIN3

APOE

CR1

APOE -0.121 4.20e-03rs923892

SORL1 -0.052 3.82e-01 rs157580

APOE -0.111 1.70e-01rs2949766

EPHA1

APOE -0.084 3.85e-01rs17126012

FERMT2

APOE -0.078 3.26e-01rs3087554

CLU

ABCA7

APOE -0.043 7.29e-06 rs609903

PICALM -0.067 6.38e-01rs6843

ABCA7

PICALM -0.067 3.06e-01rs1422189

MEF2C -0.042 5.00e-02 rs6843

ABCA7

MEF2C -0.040 2.05e-01 rs519825

APOE

DSG2 -0.038 9.48e-01 rs694011

PICALM -0.059 5.21e-01rs6064401

CASS4

ABCA7

MS4A6A

ABCA7

SORL1

PICALM -0.050 6.54e-01rs12703526

EPHA1 -0.022 9.19e-01 rs1237999

PICALM -0.043 8.32e-01rs611267

MS4A6A -0.021 1.19e-01 rs34374273

APOE -0.041 5.45e-02rs7936092

PICALM

DSG2 -0.041 6.87e-01rs733430

SORL1

PICALM

ABCA7 -0.020 3.68e-01 rs11608136

PICALM -0.040 7.11e-01rs8008270

FERMT2 -0.013 5.49e-02 rs8013925

RIN3

DSG2 -0.040 6.82e-01rs157582

APOE

PICALM -0.029 7.78e-01rs1667284

DSG2 -0.009 6.87e-01 rs17258982

CR1

CELF1 -0.009 8.21e-01 rs7143400

FERMT2

MS4A6A -0.007 6.15e-01 rs3851179

PICALM -0.021 8.13e-01rs11952384

MEF2C -0.007 5.46e-01 rs405697

APOE -0.020 2.60e-01rs1784927

SORL1 -0.006 2.66e-01 rs4147932

ABCA7

NME8

CR1

APOE -0.006 2.27e-03 rs12961029

DSG2

FERMT2 -0.016 5.49e-02rs244749

MEF2C

PICALM -0.016 7.83e-01rs753812

CELF1

FERMT2

APOE

FERMT2

INPP5D -0.005 4.06e-01 rs16979595

APOE

INPP5D -0.005 3.62e-01 rs4904920

SLC24A4

RIN3

FERMT2

CELF1

FERMT2

CELF1

FERMT2

INPP5D -0.004 3.97e-01 rs2405442

ZCWPW1 -0.013 2.75e-01rs254778

MEF2C

RIN3 -0.013 5.53e-01rs1117067

MS4A6A

RIN3

MS4A6A

EPHA1 -0.011 1.96e-01rs4939319

MS4A6A

INPP5D -0.011 1.28e-01rs7929057

MS4A6A

SLC24A4 -0.009 4.92e-01rs1866236

BIN1

DSG2 -0.009 5.14e-01rs11218325

SORL1

DSG2 -0.009 5.14e-01rs1791161

DSG2 -0.003 6.82e-01 rs12434016

SLC24A4 -0.008 9.11e-01rs1871045

APOE

CD33

CASS4

HLA-DRB1

BIN1

DSG2 -0.006 7.17e-01rs757232

ABCA7

DSG2

APOE

DSG2 -0.003 9.48e-01rs12476339

BIN1

CD33 -0.002 2.95e-01rs674747

MEF2C

APOE

MS4A6A -0.002 3.67e-01 rs12539172

ZCWPW1 -0.002 5.44e-01rs17186722

CR1 -0.002 5.20e-01 rs13426725

BIN1

ABCA7 -0.002 5.98e-01 rs10779277

CR1

MEF2C -0.002 1.49e-01 rs2490255

CR1

PICALM -0.002 7.78e-01 rs17186722

CR1

CR1 -0.002 6.84e-01 rs2940252

CR1

ABCA7 -0.002 6.04e-01 rs2661361

CR1

PICALM -0.002 5.21e-01 rs6664001

CR1

CELF1 -0.002 8.76e-01 rs17042520

CR1

CELF1 -0.002 8.76e-01 rs2135924

CR1

CELF1 -0.002 8.76e-01 rs6656123

CR1

APOE

CR1

EPHA1

CR1

EPHA1 -0.001 1.96e-01 rs1032980

CR1

BIN1

CR1

NME8 -0.001 4.05e-01 rs4308977

CR1

SORL1 -0.001 5.51e-01 rs17616

CR1 TABLE S5 – continued from previous page

Standard SCCA Simpliﬁed SCCASNP Closest gene ˆ u i p-value SNP Closest gene ˆ u i p-valuers8018746 SLC24A4 -0.001 3.49e-01 rs7549152

CR1

MS4A6A -0.001 3.62e-01 rs2182909

CR1

MS4A6A -0.001 3.62e-01 rs6540433

CR1

SLC24A4

CR1

ABCA7 -0.001 6.49e-01 rs12021671

CR1

SLC24A4

CR1

BIN1

CR1

NME8 -0.001 9.90e-01 rs9429940

CR1

PICALM -0.001 6.54e-01 rs11117956

CR1

MEF2C

CR1

INPP5D -0.001 1.28e-01 rs10127904

CR1

CR1 -0.001 5.21e-01 rs2274566

CR1

SLC24A4 -0.000 2.36e-01 rs3738468

CR1

ABCA7

CR1

CELF1 -0.000 8.27e-01 rs6691117

CR1

MS4A6A -0.000 6.72e-02 rs12032275

CR1

PICALM -0.000 8.32e-01 rs12734030

CR1

CELF1 -0.000 8.66e-01 rs12034383

CR1

APOE

CR1

SORL1

CR1

CR1 -0.000 2.92e-01 rs6696840

CR1

CR1 -0.000 2.65e-01 rs1323721

CR1

CR1 -0.000 5.18e-01 rs10863461

CR1 ˆ v j p-value brain ROI ˆ v j p-valueHippocampus L -0.403 1.25e-08 Frontal Med Orb L 0.138 9.65e-26Frontal Mid R 0.279 4.84e-18 Frontal Sup Medial L 0.135 8.66e-21Frontal Mid L 0.261 1.67e-18 Cingulum Ant L 0.133 2.32e-19Precentral L -0.249 7.67e-07 Frontal Med Orb R 0.133 1.04e-24Rolandic Oper L -0.238 4.63e-10 Frontal Sup Medial R 0.132 4.47e-20Frontal Sup Medial L 0.235 8.66e-21 Rectus L 0.132 3.33e-25Cerebelum 6 R 0.219 5.71e-10 Frontal Mid R 0.130 4.84e-18Calcarine R -0.216 5.11e-13 Frontal Mid Orb R 0.129 5.09e-21Insula R 0.206 1.67e-16 Frontal Mid L 0.129 1.67e-18Cingulum Ant L 0.188 2.32e-19 Temporal Mid R 0.128 2.15e-20Temporal Pole Mid R -0.187 1.39e-06 Rectus R 0.128 4.53e-22Caudate L 0.185 1.38e-01 Frontal Sup Orb R 0.128 3.23e-20Precentral R -0.179 1.25e-05 Insula R 0.127 1.67e-16Vermis 8 0.171 9.35e-01 Temporal Inf R 0.127 6.79e-19Temporal Inf R 0.169 6.79e-19 Frontal Sup Orb L 0.127 7.41e-20Cuneus R -0.165 9.59e-07 Frontal Mid Orb L 0.126 3.34e-21Olfactory L 0.130 1.46e-13 Frontal Inf Orb R 0.126 2.57e-14Heschl R 0.130 6.06e-17 Olfactory L 0.125 1.46e-13Occipital Inf L 0.112 2.30e-13 Cingulum Mid L 0.125 8.86e-22Cerebelum 9 L -0.112 1.18e-03 Cingulum Mid R 0.125 2.12e-19Thalamus R 0.108 9.22e-01 Frontal Inf Orb L 0.124 1.64e-17Cerebelum 3 L -0.107 2.41e-05 Cingulum Ant R 0.123 6.76e-15Putamen L 0.105 2.11e-17 Frontal Sup R 0.123 1.35e-14Frontal Med Orb L 0.098 9.65e-26 Temporal Sup R 0.123 2.06e-20Temporal Mid R 0.097 2.15e-20 Temporal Mid L 0.121 1.94e-21Occipital Mid L 0.090 1.55e-09 Precuneus L 0.121 8.66e-22Frontal Inf Orb R 0.081 2.57e-14 Olfactory R 0.120 7.48e-11Frontal Mid Orb R 0.073 5.09e-21 Precuneus R 0.120 8.93e-23Olfactory R 0.069 7.48e-11 Frontal Inf Tri L 0.120 4.56e-16Vermis 3 0.059 1.05e-01 Temporal Inf L 0.119 5.09e-19Cerebelum 3 R -0.053 1.33e-05 Temporal Sup L 0.119 8.89e-17Cerebelum 4 5 R -0.052 5.88e-09 Frontal Sup L 0.119 6.30e-15Cuneus L -0.052 5.56e-06 Parietal Inf L 0.119 2.94e-15Frontal Sup R 0.051 1.35e-14 SupraMarginal R 0.118 7.04e-15Cerebelum 10 L 0.044 1.02e-04 Frontal Inf Tri R 0.117 4.50e-13Cerebelum 7b L 0.042 4.67e-07 Angular R 0.117 5.56e-16Hippocampus R -0.034 4.66e-08 Angular L 0.116 6.30e-17Cerebelum 4 5 L -0.033 1.29e-04 Parietal Inf R 0.115 7.79e-14Continued on next page TABLE S6 – continued from previous page

Standard SCCA Simpliﬁed SCCAbrain ROI ˆ v j p-value brain ROI ˆ v j p-valueCerebelum 6 L 0.032 1.69e-09 Insula L 0.115 4.36e-14Cingulum Mid R 0.023 2.12e-19 Heschl R 0.114 6.06e-17Supp Motor Area R -0.010 7.83e-15 SupraMarginal L 0.113 2.29e-11Cingulum Post R -0.009 3.18e-04 Frontal Inf Oper R 0.112 2.22e-14Fusiform R -0.009 1.94e-20 Rolandic Oper R 0.111 2.84e-13Postcentral L -0.009 2.91e-09 Supp Motor Area L 0.110 1.13e-15Frontal Mid Orb L 0.008 3.34e-21 Fusiform R 0.110 1.94e-20Postcentral R -0.008 1.85e-08 Cingulum Post L 0.108 5.66e-13Calcarine L -0.008 3.33e-17 Fusiform L 0.108 4.12e-19Frontal Inf Oper R -0.008 2.22e-14 Frontal Inf Oper L 0.106 4.89e-12Lingual L -0.008 2.68e-15 Putamen L 0.106 2.11e-17Cingulum Mid L 0.007 8.86e-22 Putamen R 0.106 4.16e-15Parietal Inf L 0.007 2.94e-15 Heschl L 0.105 1.05e-14Frontal Sup Medial R 0.007 4.47e-20 Temporal Pole Sup L 0.104 3.30e-08Temporal Sup L 0.007 8.89e-17 Temporal Pole Sup R 0.104 4.33e-09ParaHippocampal R -0.007 4.46e-01 Occipital Mid L 0.101 1.55e-09Temporal Sup R 0.006 2.06e-20 Occipital Inf L 0.100 2.30e-13Paracentral Lobule R -0.006 3.84e-12 Supp Motor Area R 0.098 7.83e-15Lingual R -0.006 7.85e-17 Rolandic Oper L 0.095 4.63e-10Temporal Pole Sup L 0.005 3.30e-08 Parietal Sup L 0.094 3.29e-10Paracentral Lobule L -0.005 5.85e-08 Temporal Pole Mid L 0.093 2.11e-07Vermis 1 2 0.005 6.11e-04 Occipital Mid R 0.092 1.56e-08Occipital Sup L -0.005 1.43e-02 Occipital Inf R 0.091 1.26e-09Occipital Inf R 0.005 1.26e-09 Calcarine L 0.091 3.33e-17Cingulum Post L -0.004 5.66e-13 Temporal Pole Mid R 0.088 1.39e-06Temporal Pole Mid L -0.004 2.11e-07 Postcentral R 0.087 1.85e-08Fusiform L -0.004 4.12e-19 Postcentral L 0.086 2.91e-09Pallidum R -0.004 3.78e-03 Paracentral Lobule R 0.084 3.84e-12Parietal Sup R -0.003 3.11e-05 Lingual R 0.081 7.85e-17Pallidum L -0.002 4.55e-02 Precentral L 0.078 7.67e-07Caudate R -0.002 5.42e-01 Lingual L 0.078 2.68e-15Vermis 9 0.002 9.51e-02 Amygdala L 0.078 8.42e-07Vermis 7 -0.002 4.63e-02 Cingulum Post R 0.076 3.18e-04Cerebelum Crus1 R -0.002 3.23e-03 Precentral R 0.074 1.25e-05Cerebelum 8 R -0.001 3.11e-06 Amygdala R 0.073 1.09e-02Cerebelum 7b R 0.001 6.70e-05 Calcarine R 0.072 5.11e-13Frontal Inf Oper L -0.001 4.89e-12 Parietal Sup R 0.071 3.11e-05Frontal Inf Orb L 0.001 1.64e-17 Cuneus L 0.071 5.56e-06ParaHippocampal L -0.001 4.57e-01 Paracentral Lobule L 0.068 5.85e-08Thalamus L 0.001 2.74e-01 Occipital Sup R 0.067 4.50e-05Supp Motor Area L -0.000 1.13e-15 ParaHippocampal R 0.064 4.46e-01Frontal Sup L -0.000 6.30e-15 Cerebelum 6 R 0.062 5.71e-10Frontal Sup Orb L -0.000 7.41e-20 Occipital Sup L 0.059 1.43e-02Frontal Sup Orb R -0.000 3.23e-20 Cuneus R 0.059 9.59e-07Frontal Inf Tri L -0.000 4.56e-16 Caudate R 0.057 5.42e-01Frontal Inf Tri R -0.000 4.50e-13 Pallidum R 0.056 3.78e-03Rolandic Oper R -0.000 2.84e-13 Caudate L 0.054 1.38e-01Frontal Med Orb R -0.000 1.04e-24 Cerebelum 6 L 0.051 1.69e-09Rectus L -0.000 3.33e-25 ParaHippocampal L 0.051 4.57e-01Rectus R -0.000 4.53e-22 Pallidum L 0.047 4.55e-02Insula L -0.000 4.36e-14 Cerebelum 3 L -0.044 2.41e-05Cingulum Ant R -0.000 6.76e-15 Cerebelum 8 R -0.042 3.11e-06Amygdala L -0.000 8.42e-07 Cerebelum 8 L -0.042 2.15e-05Amygdala R -0.000 1.09e-02 Cerebelum 4 5 R 0.041 5.88e-09Occipital Sup R -0.000 4.50e-05 Cerebelum 7b L -0.040 4.67e-07Occipital Mid R -0.000 1.56e-08 Cerebelum Crus2 L -0.039 3.35e-06Parietal Sup L -0.000 3.29e-10 Cerebelum 9 L -0.038 1.18e-03Parietal Inf R -0.000 7.79e-14 Cerebelum Crus2 R -0.038 8.42e-06SupraMarginal L -0.000 2.29e-11 Thalamus R 0.035 9.22e-01SupraMarginal R -0.000 7.04e-15 Cerebelum 9 R -0.033 3.81e-04Angular L -0.000 6.30e-17 Cerebelum 3 R -0.032 1.33e-05Angular R -0.000 5.56e-16 Cerebelum 10 R -0.030 7.99e-05Precuneus L -0.000 8.66e-22 Thalamus L 0.029 2.74e-01Precuneus R -0.000 8.93e-23 Cerebelum 7b R -0.029 6.70e-05Putamen R -0.000 4.16e-15 Vermis 1 2 -0.023 6.11e-04Heschl L -0.000 1.05e-14 Vermis 7 -0.022 4.63e-02Temporal Pole Sup R -0.000 4.33e-09 Vermis 4 5 0.021 5.31e-04Temporal Mid L -0.000 1.94e-21 Cerebelum Crus1 R -0.021 3.23e-03Temporal Inf L -0.000 5.09e-19 Vermis 8 0.016 9.35e-01Cerebelum Crus1 L -0.000 7.15e-02 Cerebelum 10 L -0.016 1.02e-04Cerebelum Crus2 L -0.000 3.35e-06 Vermis 10 -0.014 9.29e-02Continued on next page TABLE S6 – continued from previous page

Standard SCCA Simpliﬁed SCCAbrain ROI ˆ v j p-value brain ROI ˆ v j p-valueCerebelum Crus2 R -0.000 8.42e-06 Cerebelum Crus1 L -0.012 7.15e-02Cerebelum 8 L -0.000 2.15e-05 Vermis 9 0.011 9.51e-02Cerebelum 9 R -0.000 3.81e-04 Vermis 3 -0.008 1.05e-01Cerebelum 10 R -0.000 7.99e-05 Hippocampus R 0.007 4.66e-08Vermis 4 5 -0.000 5.31e-04 Vermis 6 0.003 1.60e-01Vermis 6 -0.000 1.60e-01 Hippocampus L -0.003 1.25e-08Vermis 10 -0.000 9.29e-02 Cerebelum 4 5 L 0.000 1.29e-04 CR B I N I N PP D M E F C H L A - D RB H L A - D RB T R E M C D A P N M E Z C W P W E P HA P T K B C L U C EL F M S A A P I C A L M S O R L F E R M T S L C A R I N D S G A BC A A P O E C D C A SS Estimated u by SCCA -0.200.20.40.60 500 1000 1500-0.200.20.40.60 500 1000 1500-0.200.20.40 500 1000 150000.20.40.60 500 1000 1500-0.200.20.4 CR B I N I N PP D M E F C H L A - D RB H L A - D RB T R E M C D A P N M E Z C W P W E P HA P T K B C L U C EL F M S A A P I C A L M S O R L F E R M T S L C A R I N D S G A BC A A P O E C D C A SS Estimated u by Simplified SCCA

Fig. S5. Canonical genetic weights estimated by SCCA (top ﬁgure) and simpliﬁed SCCA (bottom ﬁgure). In each ﬁgure, the results on each of the fourtraining folds (rows 1-4) and on the entire data (bottom row) are shown. - . - . .

20 0 . - . - . . . - . - . . . - . - . . . Precentral_LPrecentral_RFrontal_Sup_LFrontal_Sup_RFrontal_Sup_Orb_LFrontal_Sup_Orb_RFrontal_Mid_LFrontal_Mid_RFrontal_Mid_Orb_LFrontal_Mid_Orb_RFrontal_Inf_Oper_LFrontal_Inf_Oper_RFrontal_Inf_Tri_LFrontal_Inf_Tri_RFrontal_Inf_Orb_LFrontal_Inf_Orb_RRolandic_Oper_LRolandic_Oper_RSupp_Motor_Area_LSupp_Motor_Area_ROlfactory_LOlfactory_RFrontal_Sup_Medial_LFrontal_Sup_Medial_RFrontal_Med_Orb_LFrontal_Med_Orb_RRectus_LRectus_RInsula_LInsula_RCingulum_Ant_LCingulum_Ant_RCingulum_Mid_LCingulum_Mid_RCingulum_Post_LCingulum_Post_RHippocampus_LHippocampus_RParaHippocampal_LParaHippocampal_RAmygdala_LAmygdala_RCalcarine_LCalcarine_RCuneus_LCuneus_RLingual_LLingual_ROccipital_Sup_LOccipital_Sup_ROccipital_Mid_LOccipital_Mid_ROccipital_Inf_LOccipital_Inf_RFusiform_LFusiform_RPostcentral_LPostcentral_RParietal_Sup_LParietal_Sup_RParietal_Inf_LParietal_Inf_RSupraMarginal_LSupraMarginal_RAngular_LAngular_RPrecuneus_LPrecuneus_RParacentral_Lobule_LParacentral_Lobule_RCaudate_LCaudate_RPutamen_LPutamen_RPallidum_LPallidum_RThalamus_LThalamus_RHeschl_LHeschl_RTemporal_Sup_LTemporal_Sup_RTemporal_Pole_Sup_LTemporal_Pole_Sup_RTemporal_Mid_LTemporal_Mid_RTemporal_Pole_Mid_LTemporal_Pole_Mid_RTemporal_Inf_LTemporal_Inf_RCerebelum_Crus1_LCerebelum_Crus1_RCerebelum_Crus2_LCerebelum_Crus2_RCerebelum_3_LCerebelum_3_RCerebelum_4_5_LCerebelum_4_5_RCerebelum_6_LCerebelum_6_RCerebelum_7b_LCerebelum_7b_RCerebelum_8_LCerebelum_8_RCerebelum_9_LCerebelum_9_RCerebelum_10_LCerebelum_10_RVermis_1_2Vermis_3Vermis_4_5Vermis_6Vermis_7Vermis_8Vermis_9Vermis_10 E s ti m a t e d v by S CC A Fig. S6. Canonical imaging weights estimated by SCCA (top ﬁgure) and simpliﬁed SCCA (bottom ﬁgure). In each ﬁgure, the results on each of the fourtraining folds (rows 1-4) and on the entire data (bottom row) are shown. - .

05 0 0 .

05 0 . . - .

05 0 0 .

05 0 . . - .

05 0 0 .

05 0 . . - .

05 0 0 .

05 0 . . - .

05 0 0 .