[PDF] Finding Stable Groups of Cross-Correlated Features in Multi-View data

Abstract

Multi-view data, in which data of different types are obtained from a common set of samples, is now common in many scientific problems. An important problem in the analysis of multi-view data is identifying interactions between groups of features from different data types. A bimodule is a pair (A,B) of feature sets from two different data types such that the aggregate cross-correlation between the features in A and those in B is large. A bimodule (A,B) is stable if A coincides with the set of features having significant aggregate correlation with the features in B , and vice-versa. At the population level, stable bimodules correspond to connected components of the cross-correlation network, which is the bipartite graph whose edges are pairs of features with non-zero cross-correlations. We develop an iterative, testing-based procedure, called BSP, to identify stable bimodules in two moderate- to high-dimensional data sets. BSP relies on permutation-based p-values for sums of squared cross-correlations. We efficiently approximate the p-values using tail probabilities of gamma distributions that are fit using analytical estimates of the permutation moments of the test statistic. Our moment estimates depend on the eigenvalues of the intra-correlation matrices of A and B and as a result, the significance of observed cross-correlations accounts for the correlations within each data type. We carry out a thorough simulation study to assess the performance of BSP, and present an extended application of BSP to the problem of expression quantitative trait loci (eQTL) analysis using recent data from the GTEx project. In addition, we apply BSP to climatology data in order to identify regions in North America where annual temperature variation affects precipitation.

Full PDF

FFINDING STABLE GROUPS OF CROSS-CORRELATED FEATURESIN MULTI-VIEW DATA

By Miheer Dewaskar , John Palowitch , Mark He , Michael I.Love , and Andrew Nobel Department of Statistics and Operations Research, UNC Chapel Hill. Department of Biostatistics, UNC Chapel Hill. Department of Genetics, UNC Chapel Hill. Google Research.

Multi-view data, in which data of diﬀerent types are obtainedfrom a common set of samples, is now common in many applied sci-entiﬁc problems. An important problem in the analysis of multi-viewdata is identifying interactions between groups of features from dif-ferent data types. A bimodule is a pair (

A, B ) of feature sets from twodiﬀerent data types such that the aggregate cross-correlation betweenthe features in A and those in B is large. A bimodule ( A, B ) is stable if A coincides with the set of features having signiﬁcant aggregatecorrelation with the features in B , and vice-versa. At the populationlevel, stable bimodules correspond to connected components of thecross-correlation network, which is the bipartite graph whose edgesare pairs of features with non-zero cross-correlations.We develop and investigate an iterative, testing-based procedure,called BSP, to identify stable bimodules in two moderate- to high-dimensional data sets. BSP relies on permutation-based p-values fortest statistics equal to sums of squared cross-correlations. These p-values are approximated using tail probabilities of gamma distribu-tions that are ﬁt using estimates of the permutation moments of thetest statistic. Our moment estimates depend on the eigenvalues ofthe intra-correlation matrices of A and B , and as a result the sig-niﬁcance of observed cross-correlations accounts for the correlations within each data type.We carry out a thorough simulation study to assess the perfor-mance of BSP, and present an extended application of BSP to theproblem of expression quantitative trait loci (eQTL) analysis usingrecent data from the GTEx project. In addition, we apply BSP toclimatology data in order to identify regions in North America whereannual temperature variation aﬀects precipitation.The method is available as an R package at https://github.com/miheerdew/cbce .

1. Introduction.

With the ongoing development and application of moderate- to high-throughput measurement technologies in ﬁelds such as genomics, neuroscience, ecology, andatmospheric science, researchers are often faced with the task of analyzing and comparingtwo or more data sets derived from a common set of samples. In most cases, diﬀerent tech-nologies measure diﬀerent features, and capture diﬀerent information about the samples athand. While one may analyze the data arising from diﬀerent technologies separately, addi-tional and potentially important insights can often be gained from the joint (or integrated)analysis of the data sets. Joint analysis, also called multi-view or multi-modal analysis, has

MSC 2010 subject classiﬁcations:

Primary 62-04, 62H20; secondary 62J15, 62P10

Keywords and phrases:

Bipartite correlation networks, eQTL-analysis, Multi-modal, Multi-view, Bipartiteclustering, Nash equilibrium, Permutation distribution a r X i v : . [ s t a t . M E ] S e p M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL received considerable attention in the literature, see Lahat, Adali and Jutten (2015); Menget al. (2016); Tini et al. (2019); Pucher, Zeleznik and Thallinger (2019); McCabe, Lin andLove (2019); Sankaran and Holmes (2019) and the references therein for more details.In what follows we will refer to the measurements arising from a particular technology asa data type, and will restrict our attention to problems in which two data types, referredto as Type 1 and Type 2, are under study. Our primary interest is in exploring associationsbetween the measured features of the two data types. In particular, we wish to identify pairs(

A, B ), where A is a set of features of Type 1 and B is a set of features of Type 2, such thatthe aggregate correlation between the features in A and those in B is large. Throughout weconsider the unsupervised setting, in which the analysis does not make use of an externalresponse. The problem of identifying sets of highly correlated features within a single datatype has been widely studied, typically through clustering and related methods. Borrowingfrom the use in genomics of the term “module” to refer to a set of correlated genes, werefer to the feature set pairs ( A, B ) of interest to us as bimodules . The term bimodule hasalso appeared, with somewhat diﬀerent meaning, in Wu, Liu and Jiang (2009), Patel et al.(2010), and Pan et al. (2019).We will refer to the correlations between features from diﬀerent data types as cross-correlations , and note that this usage diﬀers from that in time-series analysis. Correla-tions among features of the same data type will be referred to as intra-correlations . Cross-correlations provide information about interactions between features from the two datatypes. These interactions are of interest in many applications, for example, in studying therelationship between the characteristics of species and those of their environment (see, e.g.,Dol´edec and Chessel, 1994) in ecology, identifying brain regions associated with diﬀerentbehaviors (McIntosh et al., 1996) in neuroscience, and studying the relationship betweentemperature and precipitation in climate science (discussed in Section 6 below). A bimod-ule provides evidence for the coordinated activity of features from diﬀerent data types.Coordination may arise, for example, from common function or functional relationships, orcausal interactions. Bimodules can identify potentially informative downstream analyses,or suggest the more targeted acquisition and analysis of new data. Importantly, bimodulescan capture aggregate behavior, which may be signiﬁcant even when no individual pair offeatures has high cross-correlation. As such, the search for bimodules can eﬀectively leveragelow-level signals across multiple features.1.1.

Bimodule Search Procedure.

In this paper we propose and analyze a method calledthe Bimodule Search Procedure (BSP) for identifying bimodules in moderate to high dimen-sional data sets. Importantly, BSP is not based on formulating or ﬁtting a statistical model,or on detailed distributional assumptions. Instead, the method relies on general multipletesting principles.A key feature of BSP is that it seeks stable bimodules. A bimodule (

A, B ) is stable if A coincides with the features that are signiﬁcantly associated in aggregate with the featuresin B , and vice versa. We examine stable bimodules in the population and sample settings.In the population setting, stability has connections with Nash equilibria (Nash, 1950) in asimple two-player game, and with the connectivity of the bipartite graph representing thecross-correlations of the Type 1 and Type 2 variables. The latter connection with bipartitenetworks is pursued throughout the paper, and in particular, provides a principled way toextract an association network from a bimodule.BSP is based on an iterative testing framework that has found application in otherexploratory problems, see Palowitch, Bhamidi and Nobel (2016); Bodwin et al. (2017, 2018). IMODULE SEARCH PROCEDURE In the present setting, we employ fast, moment-based approximations to permutation p-values for sums of squared correlations. We emphasize that these p-values explicitly accountfor the intra-correlations of the features in A and B , attenuating signiﬁcance when theseintra-correlations are high..1.2. Expression Quantitative Trait Loci Analysis.

Much of the existing work on bimod-ule discovery is focused on the integrated analysis of genomic data. To motivate and providecontext for this work, and the methods introduced here, we brieﬂy discuss the problem ofexpression quantitative trait loci (eQTL) analysis in genomics. An application of BSP toeQTL analysis is given in Section 5.Genetic variation within a population is commonly studied by considering single nu-cleotide polymorphisms, called SNPs. A SNP is a single base pair site in the genome wherethere is allelic variation in the population. The dosage of the SNP for an individual in termsof one of the alleles can be considered taking values 0, 1, or 2, which we will refer to as the“value” at the SNP. After normalization and covariate correction, the value of a SNP mayno longer be discrete.eQTL analysis seeks to identify SNPs that aﬀect the expression of one or more genes; aSNP-gene pair for which the expression of the gene is correlated with the value of the SNPis referred to as an eQTL. Identiﬁcation of eQTLs is an important ﬁrst step in the studyof genomic pathways and networks that underlie disease and development in human andother populations (see Nica and Dermitzakis, 2013; Albert and Kruglyak, 2015).In modern eQTL studies it is common to have measurements of 10-20 thousand genesand 2-5 million SNPs on hundreds (or in some cases thousands) of samples. Identiﬁcationof putative eQTLs or genomic “hot spots” is carried out by evaluating the correlation ofnumerous SNP-gene pairs, and identifying those meeting an appropriate multiple testingbased threshold. In studies with larger sample sizes it may be feasible to carry out trans -eQTL analyses, which consider all SNP-gene pairs regardless of genomic location. However,it is more common to carry out cis -eQTL analyses, in which one restricts attention toSNP-gene pairs for which the SNP is within some ﬁxed genomic distance (often 1 millionbase pairs) of the gene’s transcription start site, and in particular, on the same chromosome(c.f. Westra and Franke, 2014; GTEx Consortium, 2017). We use the preﬁxes cis - and trans - to refer to the type of eQTL analysis, while using adjectives local and distal todenote the proximity of the discovered SNP-gene pairs. In particular, although cis -eQTLanalysis focuses of ﬁnding local eQTLs, trans -eQTL analysis can discover both local anddistal eQTLs.As a result of multiple testing correction needed to address the large number of SNP-genepairs under study, both trans - and cis -eQTL analyses can suﬀer from low power. Severalmethods have been proposed to improve the power of standard eQTL analysis, includingpenalized regression schemes that try to account for intra-gene or intra-SNP interactionnetworks (Tian et al., 2014, and references therin) and methods that consider gene modulesas high-level phenotypes to reduce the burden of multiple-testing (Kolberg et al., 2020).Alternatively, one may also shift attention from individual SNP-gene pairs to SNP-gene bi-modules, that is, to sets of SNPs and sets of genes having large aggregate cross-correlation.A number of bimodule search methods have been proposed and developed in the context ofeQTL analysis. These include methods based on Gaussian graphical models (Cheng et al.,2012, 2015), bipartite community detection (Platig et al., 2016; Fagny et al., 2017), pe-nalized regression (Chen et al., 2012), and sparse canonical correlation analysis (sCCA)(Parkhomenko, Tritchler and Beyene, 2007, 2009). These methods have the potential to

M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL enhance and improve standard eQTL approaches that focus primarily on identifying signif-icant SNP-gene pairs. As SNPs and genes often act in concert with one another, bimodulediscovery methods can gain statistical power from group-wise interactions by borrowingstrength across individual SNP-gene pairs.1.3.

Bipartite-network based approach to ﬁnd bimodules.

Since bimodules are deﬁned interms of cross-correlations, it is natural to investigate them in the context of the bipartite cross-correlation network, which is formed by connecting pairs of features from diﬀerentdata types with a weighted edge, where the weight is equal to the square of the (sample orpopulation) cross-correlation between the features. CONDOR (Platig et al., 2016) identiﬁesbimodules by applying a community detection method to an unweighted bipartite graphobtained by thresholding the sample cross-correlations. One could, in principle, extend thisapproach by leveraging other community detected methods (Beckett, 2016; Barber, 2007;Liu and Murata, 2010; Costa and Hansen, 2014; Pesantez-Cabrera and Kalyanaraman, 2016)for weighted and unweighted bipartite networks. In Huang et al. (2009), the authors ﬁndgene-SNP bimodules by searching for bipartite cliques in a network having nodes derivedfrom progeny strain data in addition to genes and SNPs.The approach taken here is network based, but diﬀers from community-detection basedapproaches such as CONDOR. While stable population bimodules can be deﬁned in termsof the population cross-correlation network, the sample cross-correlation network is not asuﬃcient statistic for stable sample bimodules, which also account for the intra-correlationsbetween features of the same type.1.4.

Other approaches to discover bimodules.

One might search for bimodules by ap-plying a standard clustering method such as k-means to a joint data matrix containingstandardized features from the two data types, and treating any cluster with features fromboth data types as a bimodule. While appropriate as a “ﬁrst look”, this approach requiresspecifying the number of clusters, imposes the constraint that every feature be part of one,and only one, bimodule, and, most importantly, does not distinguish between cross- andintra-correlations.Bimodules can also be found using sCCA methods (Waaijenborg, de Witt Hamer andZwinderman, 2008; Witten, Tibshirani and Hastie, 2009; Parkhomenko, Tritchler and Beyene,2009), which ﬁnd pairs of correlated sparse linear combinations of features from the two datatypes. Each such canonical covariate pair can then be regarded as a bimodule consisting ofthe features appearing in the linear combination.Finally, in the context of eQTL analysis, methods based on Gaussian graphical modelsCheng et al. (2012, 2015, 2016) and penalized multi-task regression Chen et al. (2012) havealso been used to ﬁnd bimodules. In the former work, the authors ﬁt a sparse graphicalmodel with a hidden variables that model interactions between sets of genes and sets ofSNPs. In Chen et al. (2012), the gene and SNP networks derived from the respective intra-correlations matrices are used in a penalized regression setup to ﬁnd a network-to-networkmapping in the same spirit as bimodules.1.5.

Overview of the Paper.

The next section introduces basic notation and presentsthe deﬁnition and properties of stable bimodules at the population level. In particular,we establish connections between stable population bimodules, Nash equilibria in a simpletwo-player game, and the connected components of the bipartite network of populationcross-correlations. Section 3 is devoted to the testing-based deﬁnition of stable bimodules

IMODULE SEARCH PROCEDURE in the sample setting, and a description of the Bimodule Search Procedure. In addition,we outline the computation of p-values for BSP and discuss the network interpretation ofsample bimodules. Section 4 is devoted to a simulation study that makes use of a complexmodel to capture some of the features observed in real multi-view data. We compare theperformance of BSP with CONDOR and sCCA. Section 5 describes and evaluates theresults of BSP and CONDOR applied to an eQTL dataset from the GTEx consortium. Inparticular, we examine the bimodules produced by BSP using a variety of descriptive andbiological metrics, including comparisons with, and potential extensions of, standard eQTLanalysis. In Section 6, we present the results of BSP applied to inter-annual temperatureand precipitation measurements in North America.

2. Stable Population Bimodules.

In this section we deﬁne and study stable bimod-ules at the population level. We begin with notation and basic assumptions.2.1.

Notation and stochastic setting.

Suppose that we have acquired data sets of twodiﬀerent types from a common set of n samples. Let X be an n × p matrix containing thedata of Type 1, and let Y be an n × q matrix containing the data of Type 2. The i th row of X and Y contain the measurements of Type 1 and Type 2, respectively, on the i th sample.The columns of X and Y correspond to the measured features of each type. Denote featuresof Type 1 by S = { s , s , . . . , s p } , and those of Type 2 by T = { t , t , . . . , t q } . We assumethat the rows of the joint matrix [ X , Y ] are independent copies of a random (row) vector( X , Y ) = ( X s , . . . , X s p , Y t , . . . Y t q ) . For each s ∈ S let X s be the column of X corresponding to feature s ; for each t ∈ T let Y t be the column of Y corresponding to feature t . For s ∈ S and t ∈ T let ρ ( s, t ) be thepopulation correlation between the random variables X s and Y t , and let r ( s, t ) denote thesample correlation between X s and Y t . For A ⊆ S and B ⊆ T we deﬁne the aggregatesquared (population and sample) correlation between A and B by ρ ( A, B ) . = (cid:88) s ∈ A,t ∈ B ρ ( s, t ) , and(1) r ( A, B ) . = (cid:88) s ∈ A,t ∈ B r ( s, t ) . (2)For singleton sets we will omit brackets, writing ρ ( s, B ) and ρ ( A, t ) instead of ρ ( { s } , B )and ρ ( A, { t } ).2.2. Stable population bimodules.

Recall that our goal is to identify pairs (

A, B ) with A ⊆ S and B ⊆ T such that the aggregate cross-correlation between features in A and B is large. We begin our analysis of this problem at the population level, where the aggregatecross-correlation between A and B can be measured by the quantity ρ ( A, B ) deﬁned in(1). One might rank pairs (

A, B ) using a score based on ρ ( A, B ), but it is not immediatelyclear how such a score should be deﬁned. For example, ρ ( A, B ) itself favors larger featuresets, while the average ρ ( A, B ) / | A | | B | might favor smaller feature sets. More importantly,it is not immediately clear how a score based on ρ ( A, B ) can be eﬀectively translated to,and evaluated in, the sample setting.To address these issues, we shift our assessment of pairs (

A, B ) from global numericalperformance measures to internal stability criteria that are based on the structure of thepopulation cross-correlations. The basic idea is contained in the following deﬁnition.

M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

Definition . A pair (

A, B ) of non-empty sets A ⊆ S and B ⊆ T is a stable popula-tion bimodule if1. A = { s ∈ S | ρ ( s, B ) > } and2. B = { t ∈ T | ρ ( A, t ) > } .In words, the deﬁnition says that A is exactly the set of features in S that are correlatedin aggregate with the features in B , while B is exactly the set of features in T that arecorrelated in aggregate with the features in A . It is useful to consider stable bimodules inthe context of the population network of cross-correlations. Definition . The population cross-correlation network G p is the weighted bipartitenetwork with vertex set S ∪ T , edge set E p = { ( s, t ) ∈ S × T | ρ ( s, t ) (cid:54) = 0 } , and weightfunction w p : E → [ − ,

1] given by w p ( s, t ) = ρ ( s, t ).The following elementary lemma shows that stable bimodules are closely related to theconnected components of G p . Lemma . A pair ( A, B ) of non empty sets with A ⊆ S and B ⊆ T is a populationbimodule if and only if A ∪ B is a union of non-trivial connected components of G p . Proof.

For any subsets F ⊆ S and G ⊆ T note that ρ ( F, G ) > F × G ) ∩ E p (cid:54) = ∅ . Let Nb( s ) . = { t (cid:48) | ( s, t (cid:48) ) ∈ E p } and Nb( t ) . = { s (cid:48) | ( s (cid:48) , t ) ∈ E p } and denotethe neighborhoods of s and t in the graph G p . The two conditions in Deﬁnition 2.1 areequivalent to saying A = ∪ t ∈ B Nb( t ) and B = ∪ s ∈ A Nb( s ), respectively. Equivalently, since G p is a bipartite graph, the set of nodes H = A ∪ B satisﬁes the property(3) H = Nb( H )where Nb( C ) . = ∪ v ∈ C Nb( v ) for any subset of vertices C ⊆ S ∪ T .Using (3), let us show that H is a union of non-trivial connected components. For any r ∈ S ∪ T , note that the connected component containing r is deﬁned by C ( r ) . = ∪ ∞ i =0 C i where C = { r } and C i = Nb( C i − ) for each i ≥

1. For r ∈ H , repeatedly applying (3)shows that r ∈ C ( r ) ⊆ H and hence(4) H = ∪ r ∈ H C ( r ) . Since (3) holds and G p is a simple graph, each r ∈ H has at least one other neighbor. Hence | C ( r ) | > r ∈ H .Finally, since Nb( C ( r )) ⊆ C ( r ) for any r , if H satisﬁes (4) then Nb( H ) ⊆ H . Moreover ifall the connected components in (4) are non-trivial then Nb( H ) ⊇ H and (3) is satisﬁed.As the lemma shows, stable population bimodules depend only on the edges of G p ; theydo not depend on the edge weights, or on correlations between features of the same type.As we will see below, the situation for sample bimodules is substantially diﬀerent.2.2.1. Bimodules are Nash Equilibria.

The notion of stability in Deﬁnition 2.1 has closeconnections with Nash equilibrium (Nash, 1950) in game theory. To make this precise, ﬁxan (cid:15) >

0, and consider the reward function Φ (cid:15) that for any A ⊆ S and B ⊆ T takes thevalue(5) Φ (cid:15) ( A, B ) . = (cid:88) s ∈ A (cid:88) t ∈ B ρ ( s, t ) − (cid:15) | A | | B | . IMODULE SEARCH PROCEDURE Consider a two player game in which player 1 chooses a subset A ⊆ S , player 2 choosesa subset B ⊆ T , and the payoﬀ to both the players is Φ (cid:15) ( A, B ). In this setting, a pair ofsubsets ( A ∗ , B ∗ ) is called a Nash equilibrium ifmax A ⊆ S Φ (cid:15) ( A, B ∗ ) = Φ (cid:15) ( A ∗ , B ∗ ) = max B ⊆ T Φ (cid:15) ( A ∗ , B ) . The following elementary lemma shows that bimodules are the just Nash equilibria in thisgame. The proof appears in Appendix A.

Lemma . Let δ = min (cid:8) ρ ( s, t ) | s ∈ S, t ∈ T, ρ ( s, t ) (cid:54) = 0 (cid:9) and (cid:15) = δ (max( | S | , | T | )) − .If (cid:15) ∈ (0 , (cid:15) ) then the non-empty Nash equilibria of the game with reward function Φ (cid:15) coincides with the family of stable population bimodules. The connection between stable bimodules and two player games in the population set-ting suggests a simple iterative scheme to ﬁnd stable bimodules when the cross-correlations( ρ ( s, t )) s ∈ S,t ∈ T are known. Begin by ﬁxing a non-empty subset B ⊆ T and any (cid:15) ∈ (0 , (cid:15) ),where (cid:15) is chosen as in Lemma 2. Then repeatedly update A k +1 = arg max A Φ (cid:15) ( A, B k )and B k +1 = arg max B Φ (cid:15) ( A k +1 , B ) for k ≥

0. As the value of the objective function strictlyincreases at every update, the sets ( A k , B k ) are guaranteed to convergence to a Nash equi-librium after ﬁnitely many steps. If B is such that that Φ (cid:15) ( A , B ) >

0, then the Nashequilibrium will be non-empty and hence a population bimodule.It is illustrative to view this iterative update procedure in terms of the cross-correlationgraph G p . The proof of Lemma 2 shows that the update steps are equivalent to(6) A k +1 = { s ∈ S | ρ ( s, B k ) > } and B k +1 = { t ∈ T | ρ ( A k +1 , t ) > } . In other words A k +1 is set of neighbors of B k , and B k +1 is the set of neighbors of A k +1 . Ifthe iterative update procedure begins from a singleton set B = { t } for some t ∈ T , then itcorresponds to the breadth ﬁrst search algorithm for ﬁnding the connected component of t in G p (see, e.g., Cormen et al., 2009). This connection shows that consideration of singletonsets B = { t } for t ∈ T ﬁnds all the connected components of G p , which by Lemma 1, arethe minimal stable population bimodules.

3. Stable Sample Bimodules and the Bimodule Search Procedure.

We nowextend the notion of a stable bimodule and the iterative search procedure described aboveto the sample (empirical) setting using ideas and methods from multiple testing. Whilethe empirical setting involves a number of additional complications, the motivation behindstability is essentially the same.In practice, the population cross-correlations ρ ( s, t ) are unknown, and the search forbimodules is based on the observed data matrices [ X , Y ]. One may simply replace the popu-lation correlations with their sample counterparts r ( s, t ) in Deﬁnition 2.1, but when workingwith continuous data r ( s, t ) (cid:54) = 0 (even if ρ ( s, t ) = 0), and in this case the only stable bi-module is the full index set ( S, T ). To address this, we replace the conditions ρ ( s, B ) > r ( s, B ) > ˆ γ ( s, B ), where the threshold ˆ γ ( s, B ) is derived from the application of anFDR-controlling multiple testing procedure to approximate permutation p-values for thestatistics { r ( s, B ) : s ∈ S } . An analogous approach is taken for the conditions ρ ( A, t ) > Permutation null distribution and p-values.

M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

Definition . Let [ X , Y ] be given, and let P , P ∈ { , } n × n be chosen indepen-dently and uniformly from the set of all n × n permutation matrices. The permutation nulldistribution of [ X , Y ] is the distribution of the data matrix[ ˜ X , ˜ Y ] . = [ P X , P Y ] . Let P π and E π denote probability and expectation, respectively, under the permutation null.For s ∈ S and t ∈ T let R ( s, t ) be the (random) sample-correlation of ˜ X s and ˜ Y t .The permutation null distribution is obtained by randomly reordering the rows of X and independently doing the same for the rows of Y . Permutation preserves the samplecorrelation between features in S , and between features in T , but it nulliﬁes the cross-correlations between features in S and T . Indeed, as shown in Zhou, Barry and Wright(2013), E π [ R ( s, t )] = 0 for each s ∈ S and t ∈ T . Definition . For A ⊆ S and B ⊆ T deﬁne the permutation p-value(7) p ( A, B ) . = P π ( R ( A, B ) ≥ r ( A, B ) )where R ( A, B ) . = (cid:80) s ∈ A,t ∈ B R ( s, t ) and the observed sum of squares r ( A, B ) is ﬁxed.The permutation p-value p ( A, B ) is the probability under the permutation null distri-bution that the aggregate cross-correlation between the features in A and B exceeds itsobserved value in the data. Small values of p ( A, B ) provide evidence in favor of the hypoth-esis that ρ ( A, B ) >

0. As the permutation distribution preserves the correlations betweenfeatures from A and between features from B , p ( A, B ) accounts for the presence of thesecorrelations while assessing the signiﬁcance of r ( A, B ).3.2.

Stable sample bimodules.

Let p = ( p , . . . , p m ) be a sequence of p-values and let α ∈ (0 ,

1) be a target false discovery rate. Recall that the Benjamini and Yekutieli (2001)rejection threshold at level α is deﬁned by(8) τ α ( p ) . = max (cid:26) p ( j ) : m p ( j ) j ≤ α (cid:80) mi =1 i − (cid:27) where p (1) ≤ p (2) . . . ≤ p ( m ) are the ordered values of p . Definition . (Stable Sample Bimodule) Let [ X , Y ] and α ∈ (0 ,

1) be given. A pair(

A, B ) of non-empty sets A ⊆ S and B ⊆ T is a stable sample bimodule at level α if1. A = { s ∈ S | p ( s, B ) ≤ τ α ( p · ,B ) } and2. B = { t ∈ T | p ( A, t ) ≤ τ α ( p A, · ) } where p · ,B = { p ( s, B ) } s ∈ S and p A, · = { p ( A, t ) } t ∈ T .Thus ( A, B ) is a sample stable bimodule if A is exactly the set of features in S that aresigniﬁcantly correlated in aggregate with the features in B , and at the same time B is exactlythe set of features in T that are signiﬁcantly correlated in aggregate with the features in A . When no ambiguity will arise, we will refer to stable sample bimodules simply as stablebimodules.Although the deﬁnition above parallels that of Deﬁnition 2.1, critical diﬀerences emergein the sample setting. One key diﬀerence is the aggregation of small eﬀects. As noted above IMODULE SEARCH PROCEDURE that the condition p ( s, B ) ≤ τ α ( p · ,B ) is equivalent to requiring r ( s, B ) ≥ ˆ γ ( s, B ) whereˆ γ ( s, B ) depending on τ α ( p · ,B ). The latter condition may be satisﬁed even if the feature s is not signiﬁcantly correlated with any individual feature in B . Similar remarks apply to p ( A, t )Another, more important, diﬀerence between the population and sample settings is therole of intra-correlations. A likely side-eﬀect of any empirical search for pairs (

A, B ) withhigh cross-correlations is that the intra-correlations of the features in A and B will alsobe large, often signiﬁcantly larger than the intra-correlations of a randomly selected setof features with the same cardinality. Failure to account for inﬂated intra-correlations canlead to anti-conservative (optimistic) assessments of signiﬁcance, false discoveries, and over-sized feature sets. As noted above, the permutation distribution leaves intra-correlationsunchanged, while ensuring that cross-correlations are equal to zero. In this way the permu-tation p-values p ( s, B ) and p ( A, t ) directly account for the intra-correlations among featuresin A and B .3.3. The Bimodule Search Procedure (BSP).

We adapt the iterative search procedurefor population bimodules described at the end of Section 2 using the p-value based char-acterization of sample bimodules in Deﬁnition 3.3. The result is an iterative, testing-basedsearch procedure for stable bimodules. Iterative-testing based procedures have been usedin single data-type settings for community detection Palowitch, Bhamidi and Nobel (2016),diﬀerential correlation mining Bodwin et al. (2018), and association mining Bodwin et al.(2017). The deﬁnition and approximation of the permutation based p-values used here dif-fers substantially from this existing work.

Input :

Data matrices X and Y and parameter α ∈ (0 , Result:

A stable bimodule (

A, B ) at level α , if found. Initialize A (cid:48) = { s } ⊆ S and B (cid:48) = ∅ ; do ( A, B ) ← ( A (cid:48) , B (cid:48) ); Compute p ( A, t ) for each t ∈ T and let p T ← ( p ( A, t )) t ∈ T ; B (cid:48) ← { t ∈ T | p ( A, t ) ≤ τ α ( p T ) } ; // The indices rejected by the B.Y. procedure Compute p ( s, B (cid:48) ) for each s ∈ S and let p S ← ( p ( s, B (cid:48) )) s ∈ S ; A (cid:48) ← { s ∈ S | p ( s, B (cid:48) ) ≤ τ α ( p S ) } ; // The indices rejected by the B.Y. procedure while ( A (cid:48) , B (cid:48) ) (cid:54) = ( A, B ); if | A || B | > then return ( A, B ); end Algorithm 1:

Bimodule Search Procedure (BSP)An overview of BSP is given in Algorithm 1. If BSP terminates at a non-empty ﬁxedpoint, then its output is a stable bimodule at level α . Unlike its population counterpart,BSP is not guaranteed to terminate in a ﬁnite number of steps: as the procedure operates ina deterministic manner, and the number of feature set pairs is ﬁnite, BSP will terminate ata (possibly empty) ﬁxed point or enter a limiting cycle. To limit computation time, the loopat Line 2 is stopped after 20 iterations. In our simulations and real-data analyses (describedbelow) the 20 iteration limit was enforced in only a handful of cases. Further details on howBSP deals with cycles and limits large sets can be found in Appendix B.1.In practice, BSP is initialized with each singleton pair ( s, ∅ ) for s ∈ S , and each singletonpair ( ∅ , t ) for t ∈ T . While this initialization guarantees the recovery of all minimal stablebimodules in the population setting, no such guarantees are available in the sample setting. M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

Nevertheless, we have found this initialization strategy to be eﬀective in practice. Wheneither of the sets S or T is large, we use additional strategies to speed up computation, likerandomly selecting a smaller subset of features for initialization (Appendix B.2).The constant α ∈ (0 ,

1) is the only free parameter of BSP. We will refer to α as thefalse discovery parameter. While α controls the false discovery rate at each step of thesearch procedure, this does not guarantee control on the false associations (i.e. ( s, t ) suchthat ρ ( s, t ) = 0) within the stable bimodules. In general, BSP will ﬁnd fewer and smallerbimodules when α is small, and ﬁnd more numerous and larger bimodules when α is large.In practice, we employ a permutation based procedure to select α from a ﬁxed grid of valuesbased on the notion of edge-error introduced in Section 4.2.1. See Appendix B.3 for details.Simulations and theoretical calculations suggest that singleton bimodules ( { s } , { t } ) at agiven level α ∈ (0 ,

1) can occur even in completely random data if | S | and | T | are largeenough. To minimize the detection of spurious singleton bimodules, we discard bimodules( A, B ) with p ( A, B ) > α | S || T | , where the threshold is the Bonferroni correction at level α for singleton bimodules. Alternatively, one can simply discard singleton bimodules withp-values exceeding the Bonferroni threshold.The BSP search procedure often ﬁnds the same bimodule starting from multiple initial-izations, and in some cases there are numerous bimodules having substantial overlap. In thelatter case, we assess the eﬀective number of distinct bimodules and select an equal numberof representative bimodules for subsequent analysis. Details can be found in Appendix B.4.3.3.1. Approximation of p-values.

Recall that BSP is not based on an underlying gen-erative or distributional model. The method relies on the assumption that the samples areindependent and identically distributed and on Deﬁnition 3.1, the permutation based p-values p ( s, B ) and p ( A, t ). A total of | S | + | T | p-values are calculated in each iteration ofthe loop at Line 2 in Algorithm 1. Accounting for multiple initializations, several billionp-value calculations are required for typical genomic data sets. Moreover, the resolution ofthese p-values must be small enough to account for multiple-testing correction.When | S | or | T | is large, calculating the p-values p ( s, B ) and p ( A, t ) using a standardMonte Carlo permutation scheme is not feasible.As an alternative, we make of use ideasfrom Zhou, Barry and Wright (2013) and Zhou, Gallins and Wright (2019) to approximatethe permutation p-values p ( A, t ) and p ( s, B ) using the tails of a location-shifted Gammadistribution having same ﬁrst three moments as the sampling distribution of R ( A, t ) underthe permutation null.Although the ﬁrst three moments of R ( A, t ) can be computed exactly (Zhou, Barryand Wright, 2013), to further speed computation we use instead the eigenvalue conditionalmoments of R ( A, t ) (see Zhou, Gallins and Wright, 2019), which depend only on the eigen-values of the intra-correlation matrix of the features in A , and not on t . The analyticalformula for the eigenvalue conditional moments is based on a normality assumption forthe data generating distribution, but one may show that the weaker assumption of spher-ical symmetry is suﬃcient. In practice, the additional assumptions used in the momentapproximation do not appear to limit the applicability of BSP. Accuracy of the p-valueapproximations is brieﬂy discussed Appendix B.6.3.4. Network interpretation of sample stable bimodules.

As discussed in Section 2.2,stable population bimodules can be studied in terms of the correlation network G p , and itis useful to study sample bimodules in a similar manner. To this end, we deﬁne the samplecross-correlation network G s to be the weighted bipartite network with vertex set S ∪ T , IMODULE SEARCH PROCEDURE full edge set E = S × T , and weight function w ( s, t ) = r ( s, t ). Definition . For each τ > G τs = ( S ∪ T, E ( τ ))where E ( τ ) = { ( s, t ) ∈ S × T | | w ( s, t ) | ≥ τ } . For each feature set pair ( A, B ) we deﬁne the connectivity-threshold (9) τ ∗ ( A, B ) = max { τ ∈ [0 ,

1] : A ∪ B is connected in G τs } . It follows from Lemma 1 that minimal population bimodules (those obtained when start-ing the iterative search from singletons) correspond to connected components of G p . Ac-cordingly, we deﬁne the essential-edges of a bimodule ( A, B ) to be those that are presentat the connectivity threshold(10) essential-edges ( A, B ) = ( A × B ) ∩ E ( τ ∗ ( A, B )) . Note that E ( τ ) ∩ ( A × B ) is a set-estimate of the edges ( s, t ) ∈ A × B with ρ ( s, t ) (cid:54) = 0,and that the choice of τ > τ = τ ∗ ( A, B ) is the most conservative threshold subject to the constraint that A ∪ B is connected in G τs , and the essential edges are those of the resulting graph. Assuming thatthe bimodule ( A, B ) is connected in the population network, we expect the essential-edgesto be a conservative estimate of the true edges in the population network.

4. Simulation Study.

To assess the eﬀectiveness of BSP, we carried out a simulationstudy in which a variety of true bimodules of diﬀerent strengths and sizes were present inthe underlying distribution of the samples. In this section, we provide an overview of thestudy, and an assessment of the results from BSP and competing methods CONDOR andsCCA (which were described in sections 1.3 and 1.4).Simulation studies incorporating fewer than ten embedded bimodules have been con-ducted for methods based on sCCA (Waaijenborg, de Witt Hamer and Zwinderman, 2008;Parkhomenko, Tritchler and Beyene, 2009; Witten, Tibshirani and Hastie, 2009) and graph-ical models (Cheng et al., 2016, 2015). Existing studies are relatively simple, and do notemphasize the network structure of many applications. In order to emulate the complex-ity of eQTL analysis and similar applications, we designed a simulation study in which K = 500 bimodules of various strengths, sizes, network structures, and intra-correlationswere planted in a single large dataset. The planted bimodules were then connected by con-founding edges to make their recovery more challenging. We emphasize that BSP is notbased on an underlying generative model: the model used in the simulation study is forassessment purposes only.4.1. Details of the simulated data.

We generated a single large dataset having n =200 samples and two measurement types, with p = 100 ,

000 and q = 20 ,

000 features,respectively. The number of features is of the same order of magnitude as in the eQTLdataset considered in Section 5. Following the notation at the beginning of Section 2, wedenote the two types of features by index sets S = { s , s . . . s p } and T = { t , t . . . t q } . Foreach individual, the joint p + q dimensional measurement vector is independently drawnfrom a multivariate normal distribution with mean 0 ∈ R p + q and ( p + q ) × ( p + q ) covariancematrix Σ. The covariance matrix Σ is designed so that it has K = 500 true bimodules ofvarious sizes, network structures, signal strengths and intra-correlations. As it is diﬃcultto generate structured covariance matrices while maintaining non-negative deﬁniteness, we M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL instead specify a generative model for the p + q dimensional random row vector ( X, Y ) ∼N p + q (0 , Σ).We now describe how we generated the block cross-correlation signal between the twofeature types, representing observed bimodules. To begin, we partitioned the ﬁrst-half of the S -indices { s , . . . , s (cid:100) p/ (cid:101) } into K disjoint subsets A , A , . . . , A K with sizes chosen accordingto a Dirichlet distribution with parameter (1 , , . . . , ∈ R K . In the same way, we generateda Dirichlet partition B , B , . . . , B K of the ﬁrst-half of T indices { t , . . . , t (cid:100) q/ (cid:101) } independentof the previous partition. The feature-set pairs ( A i , B i ) constitute the true bimodules, whilethe features in second-half of the S - and T -indices are not part of true bimodules. Next,the random sub-vectors ( X A i , Y B i ) corresponding to the true bimodules were generatedindependently for each i ∈ [ K ] using a graph based regression model described below.Let ( A, B ) be a feature set pair, and suppose that ρ ∈ [0 ,

1) and σ > D ∈ { , } | A |×| B | be a binary matrix, which we regard as the adjacency matrix of aconnected bipartite network with vertex set A ∪ B . Then the random row-vector ( X A , Y B )is generated as follows:(11) X A ∼ N | A | (0 , (1 − ρ ) I + ρU ) and Y B = X A D + (cid:15), where (cid:15) ∼ N | B | (0 , σ I ) and U is a matrix of all ones. To understand the bimodule signalproduced by this model, note that ρ governs the intra-correlation between features in A and that for any t ∈ B , the variable Y t is inﬂuenced by features X s such that ( s, t ) is anedge in adjacency matrix D . For each of the true bimodules ( A i , B i ) in the simulation,we independently chose parameters ρ i , σ i , and D i to produce a variety of behaviors whilemaintaining the inherent constraints between them (see Appendix C.1).Among features that are not part of bimodules, features X s j with j > (cid:100) p/ (cid:101) are inde-pendent N (0 ,

1) noise variables and features Y t r with r > (cid:100) q/ (cid:101) are either noise variables(generated independently as N (0 , A k , B k ) and ( A l , B l ) with1 ≤ k < l ≤ K , with probability q = . K , we connect the two bimodules by selecting atrandom (and without replacement) an index r > (cid:100) q/ (cid:101) and making it a bridge variable bydeﬁning(12) Y t r = X s + X s (cid:48) + (cid:15) with (cid:15) ∼ N (0 , σ r ) , for a randomly chosen s ∈ A k and s (cid:48) ∈ A l . The noise variance σ r in (12) is chosen so thatthe correlation strength between Y t r and X s (and X s (cid:48) ) is equal to the average strength ofthe bimodules that are being connected. If Y t r is not a bridge variable, it is taken to benoise.Prior to the addition of bridge variables, the connected components of the populationcross-correlation network are just the bimodules ( A k , B k ). Once bridge variables have beenadded, the population cross-correlation network will have a so-called giant connected compo-nent comprising a substantial portion of the underlying index space S × T . While theoreticalsupport for the presence of giant component in our simulation model comes from the studyof Erd¨os-Renyi random graphs (Bollob´as, 2001), such components have also been observedin empirical eQTL networks (Fagny et al., 2017; Platig et al., 2016). Although the giantcomponent is itself a stable population bimodule, since we only add a small number (348) ofbridge variables, the majority of the cross-correlation signal is in the more densely connectedsets ( A k , B k ), which we continue to refer to as the true bimodules. IMODULE SEARCH PROCEDURE Running BSP and related methods.

We applied BSP to the simulated data usingthe false discovery parameter α = 0 .

01, which was selected to keep the edge-error estimatesunder 0 .

05 (Appendix B.3). The search was initialized from singletons consisting of all thefeatures in T and 1% of the features in S , chosen at random. In what follows, feature-setpairs identiﬁed by BSP (or some other method, when clear from context) will be referredto as detected bimodules. BSP detected 319 unique bimodules while the eﬀective number(Appendix B.4) of detected bimodules was 301.5.To obtain bimodules via CONDOR (Platig et al., 2016), we applied Matrix-eQTL (Sha-balin, 2012) to the simulated dataset with S considered as the set of SNPs and T consideredas the set of genes, to extract feature pairs ( s, t ) ∈ S × T with q-value less than α = 0 . S ∪ T with edges given by the sig-niﬁcant feature pairs found in the previous step. The largest connected component of thisgraph, made up of 28,876 features from S and 6,455 features from T , was passed througha bipartite community detection software (Platig, 2016) which partitioned the nodes of thesub-graph into 112 bimodules.We applied the sCCA method of Witten, Tibshirani and Hastie (2009) to the simulateddata to ﬁnd 100 bimodules. More precisely, for various penalty parameters λ ∈ [0 , (cid:96) normconstraint of λ √ p and λ √ q on the coeﬃcients of the linear combinations correspondingto S and T respectively. Initially, we considered λ = 0 . λ ∈ { . , . , . , . , . } to obtain smaller bimodules.4.2.1. Comparing performance of the methods.

In the simulation study described above,we measure the recovery of a true bimodule ( A t , B t ) by a detected bimodule ( A d , B d ) usingthe two metrics:recall = | A t ∩ A d || B t ∩ B d || A t || B t | and Jaccard = | A t ∩ A d || B t ∩ B d || A t × B t ∪ A d × B d | . Recall captures how well the true bimodule is contained inside the detected bimodule, whileJaccard measures how well the two bimodules match . When assessing the recovery of a truebimodule under a collection of detected bimodules (like the output of BSP), we choosethe detected bimodule with the best recall or Jaccard, depending on the metric underconsideration.As shown in Figure 1, the BSP Jaccard for true bimodules was inﬂuenced primarily bythe cross-correlation strength (cid:113) r ( A,B ) | A || B | of the true bimodule, though the intra-correlationparameter ρ used in the simulation (11) was also seen to have an eﬀect (Figure 1, left). Mostbimodules with cross-correlation strength above 0.4 were completely recovered, while thosewith strength below 0.2 were not recovered. For strengths between 0.2 to 0.4, there was avariation in Jaccard, with smaller Jaccard for bimodules having larger values of ρ (Figure 1,left). The eﬀect of ρ on Jaccard was expected since BSP accounts for the intra-correlationamong features of the same type.The intra-correlation parameter ρ did not have signiﬁcant eﬀect on CONDOR Jaccard,since the method does not account for intra-correlations. Hence, here we only consider theeﬀects of the cross-correlation strength of true bimodules on CONDOR Jaccard (Figure 1,green curve on the right). Regardless of the cross-correlation strength, CONDOR Jaccard M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

Fig 1: Recovery of true bimodules. Left: dependence of cross-correlation strength and intra-correlation parameter of true bimodules on BSP Jaccard. Right: the averaged recoverycurves (recall and Jaccard) for true bimodules under CONDOR and BSP.remained low. This was because CONDOR bimodules often overlapped multiple true bi-modules; indeed, 102 of the 112 CONDOR bimodules overlapped with two or more (up to19) true bimodules, compared with only 21 of the 319 BSP bimodules. However, the resultsfor CONDOR recall (Figure 1, purple curve on the right) show that most true bimoduleswith signiﬁcant cross-correlation strengths were contained inside some CONDOR bimodule.To assess the false discoveries in detected bimodules, we measured the edge-error ofdetected bimodules. The edge-error is the fraction of the essential-edges (10) of a detectedbimodule that are not part of the simulation model, that is, edges not contained in anytrue bimodule and not in the set of bridge edges. The average edge-error for BSP bimoduleswas 0.03, and 90% of the detected bimodules had edge-error under 0.05. In contrast, theaverage edge-error for CONDOR bimodules was 0.08, and 90% of the detected bimoduleshad edge-error under 0.14. The larger edge-error among CONDOR bimodules may havearisen because the method does not account for intra-correlations.Concerning sCCA, the sizes of the detected bimodules were at least an order of magnitudelarger than sizes of the true bimodules when λ exceeded 0 .

04 (Figure 7, Appendix C.2).Thus we only considered λ ≤ .

04. For λ = 0 .

03 and 0 .

04, the detected bimodules hadlarge edge-error (average error 0.47 and 0.65, respectively), while for λ = 0 .

01 and 0 . λ for each of the 100 bimodules. We expect that the results of sCCA would improve if onechose a diﬀerent penalty parameter for each bimodule. However Witten, Tibshirani andHastie (2009) do not provide explicit guidelines to chose diﬀerent penalty parameters foreach component (bimodule), and directly doing a permutation-based grid search each timewould be exceedingly slow.We also studied the performance of BSP and CONDOR on a simulation study withlarger sample size n = 600. As expected, both methods were able to recall bimodules withlower cross-correlation strengths than earlier. However, both BSP and CONDOR had lowerJaccard than in the n = 200 simulation (see Appendix C.3). We discuss this behavior in IMODULE SEARCH PROCEDURE Section 7.

5. Application of BSP to eQTL Analysis.

Here we describe the application ofbimodules to the problem of expression quantitative trait loci (eQTL) analysis discussed inSection 1. The NIH funded GTEx Project has collected and created a large eQTL databasecontaining genotype and expression data from postmortem tissues of human donors. Aunique feature of this database is that it contains expression data from many tissues. Weapplied BSP, CONDOR and standard eQTL-analysis to p = 556 ,

304 SNPs and q = 26 , n = 574 individuals. A detailed account of dataacquisition, preprocessing, and covariate correction can be found in Appendix D.1.The 556K SNPs considered were a representative subset chosen from 4.9 million (directlyobserved and imputed) autosomal SNPs with minor allele frequency greater than 0.1. Usinga representative set decreased computation time and reduced the multiple testing burdenin each iteration of BSP. As SNPs exhibit local correlation due to linkage disequilibrium(LD), the selection process should not reduce the statistical power of BSP. We used an LDpruning software SNPRelate (Zheng, 2015) to select the representative subset of SNPs (seeAppendix D.1 for details).5.1.

Results of BSP.

We applied BSP to the thyroid eQTL data with false discoveryparameter α = 0 .

03 selected to keep the edge-error under 0 .

05 (details in Appendix D.2).The search was initialized from singleton sets of all genes and half of the available SNPs,chosen at random. Thus the search procedure in Section 3.3 was run p/ q ∼ αpq = 3 . × − (see Section 3.3). The majority(277K) of the searches terminated in the empty set after the ﬁrst step; of the remaining27K searches, the great majority identiﬁed a non-empty ﬁxed point within 20 steps. Only 20searches cycled and did not terminate in a ﬁxed point. Among the searches taking more thanone iteration, 94% terminated by the ﬁfth step. Among searches that found a non-emptyﬁxed point, 92.3% of the ﬁxed points contained the seed singleton set of the search.Among the unique bimodules discovered by BSP, some bimodules were similar to others;hence the eﬀective number (Appendix B.4) of bimodules was 3304, slightly smaller than thenumber of unique bimodules. We then applied the ﬁltering procedure described in SectionB.4 to select a sub-collection of 3304 bimodules that were roughly disjoint. The selectedbimodules had SNP sets ranging in size from 1 to 1000, and gene sets ranging in size from1 to 100 (Figure 2); the median size of the gene and SNP sets was 1 and 7, respectively.If required, BSP can be run in a faster (less exhaustive) or slower (more exhaustive)fashion by selecting a smaller or larger fraction of SNPs from which to initialize the searchprocedure. The eﬀective number of discovered bimodules was only slightly smaller (3258)when initializing with 10% of the SNPs.5.2. Running other methods.

Standard eQTL analysis was performed by applying Matrix-eQTL (Shabalin, 2012) twice to the data, ﬁrst to perform a cis -eQTL analysis within a dis-tance of 1MB and next to perform a trans -eQTL analysis. In each case, SNP-gene pairs withBH (Benjamini and Hochberg, 1995) q -value less than 0 .

05 were identiﬁed as signiﬁcant.Matrix-eQTL identiﬁed 186K cis -eQTLs and 73K trans -eQTLs.To obtain CONDOR bimodules (Platig et al., 2016), we applied Matrix-eQTL to identifyboth cis- and trans- eQTLs with BH q-value under the threshold .

2, chosen as in Fagny et al. M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL (2017). The resulting gene-SNP bipartite graph formed by these eQTLs was passed throughCONDOR’s bipartite community detection pipeline (Platig et al., 2016), which partitionedthe nodes of largest connected component of this graph into 6 bimodules.We also applied the sCCA method of Witten, Tibshirani and Hastie (2009) using thepermutation based parameter selection procedure (Witten and Tibshirani, 2020) on thecovariate-corrected genotype and expression matrices to identify 50 bimodules. The identi-ﬁed bimodules were large, containing roughly 100K SNPs and 4K-8K genes (Fig. 2), makingthem diﬃcult to analyze and interpret. The identiﬁed bimodules also exhibited moderateoverlap: the eﬀective number was 25. As such, we excluded the sCCA bimodules from sub-sequent comparisons. Analysis of sCCA on the simulated data (Section 4.2.1) suggests thatthe method may be able to recover smaller bimodules with a more tailored choice of itsparameters. While this is an interesting topic for future research, it is beyond the scope ofthe present paper.5.3.

Quantitative validation.

In this subsection, we apply several objective measures tovalidate and understand the bimodules found by BSP and CONDOR.5.3.1.

Permuted data.

In order to assess the propensity of each method to detect spu-rious bimodules, we applied BSP and CONDOR to ﬁve data sets obtained by jointly per-muting the sample labels for the expression measurements and most covariates (all exceptthe ﬁve genotype PCs), while keeping the labels for genotype measurements and genotypecovariates unchanged. Each data set obtained in this way is a realization of the permuta-tion null deﬁned in Deﬁnition 3.1. BSP found very few (5-12) bimodules in the permuteddatasets compared to the real data (3344). CONDOR found no bimodules in any of thepermuted datasets.5.3.2.

Bimodule sizes.

Most (89%) bimodules found by BSP have fewer than 4 genesand 50 SNPs, but BSP also identiﬁed moderately sized bimodules having 10-100 genes and30-1000 SNPs (see Figure 2). The bimodules found by CONDOR were moderately sized,with 10-100 genes and several hundred SNPs, except for a smaller bimodule with 5 genesand 43 SNPs. On the permuted data, most bimodules found by BSP have fewer than 2genes and 2 SNPs.As a one dimensional measure, we deﬁne the geometric size of a bimodule (

A, B ) to bethe geometric mean (cid:112) | A || B | of its gene and counts, or equivalently, the square root of thenumber of gene-SNP pairs in the bimodule.5.3.3. Connectivity threshold and network sparsity.

Stable bimodules capture aggregateassociation between groups of SNPs and genes, however it is unclear how to recover individ-ual SNP-gene associations within these bimodules. Motivated by the network perspective,in Section 3.4 we proposed evaluating for each bimodule (

A, B ), the connectivity threshold(9) and the corresponding network of essential edges (10) between A and B . To understandthe structure of the network of essential edges, we further calculated the tree-multiplicity (13) TreeMul ( A, B ) . = | essential-edges ( A, B ) || A | + | B | − , which measures the number of essential edges relative to the number of edges in a tree onthe same node set. TreeMul ( A, B ) is never less than 1, and takes the value 1 exactly whenthe essential edges form a tree.

IMODULE SEARCH PROCEDURE Fig 2:

The sizes of bimodules detected by BSP,CONDOR and sCCA, and sizes of bimodules de-tected by BSP under the 5 permuted datasets.

Fig 3:

Correlations corresponding to SNP-genepairs that appear as essential-edges (Section 3.4)in one or more BSP bimodules with geometricsize above 10. Local pairs to the left of the blueline ( cis -analysis threshold) and distal pairs tothe left red line ( trans -analysis threshold) showimportance at the network level but were not dis-covered by standard eQTL analysis.

For bimodules found by BSP, the connectivity thresholds ranged from 0.14 to 0.59 andtree-multiplicities ranged from 1 to 10; the smaller values of the former and larger valuesof latter were associated with bimodules of larger geometric size (Fig. 9, Appendix D.4).Smaller bimodules had large connectivity thresholds and a tree-like essential edge network;in other words, such bimodules were connected under a small number of strong and local(see Section 5.4.2) SNP-gene associations. On the other hand larger bimodules had lowerconnectivity thresholds, meaning that we had to include weaker and often distal (see Sec-tion 5.4.2) SNP-gene associations to connect such bimodules. After including the weakerSNP-gene edges, although the association network for large bimodules had tree-multiplicityaround 10 (Fig. 9, Appendix D.4), these networks were still sparsely connected comparedto the complete bipartite graph on the same nodes.5.4.

Biological Validation.

In order to assess potential biological utility of bimodulesfound by BSP, we compared the SNP-gene pairs in bimodules to those found by standard cis- and trans- eQTL analysis, studied the locations of the SNPs, and examined the genesets for enrichment of known functional categories.5.4.1.

Comparison with standard eQTL analysis.

As described earlier, the bimodulesproduced by CONDOR are derived directly from SNP-gene pairs identiﬁed by cis - and trans -eQTL analysis. Table 1 compares these eQTL pairs with those found in bimodulesidentiﬁed by BSP. Recall that cis -eQTL analysis considers only local SNP-gene pairs (im-proving detection power by reducing multiple testing), while trans -eQTL analysis and BSPdo not use any information about locations of the SNPs and genes. We ﬁnd that half ofthe pairs identiﬁed by cis -eQTL analysis and most of the pairs identiﬁed by trans -eQTLanalysis appear in at least one bimodule.Bimodules capture sub-networks of SNP-gene associations rather than individual eQTLs,and as such individual SNP-gene pairs in a bimodule need not be eQTLs. In fact, the resultsof Section 5.3.3 suggest that the association networks underlying large bimodules may besparse. Deﬁne a bimodule (

A, B ) to be connected by a set of eQTLs if the bipartite graphwith vertex set A ∪ B and edges corresponding to the eQTLs is connected. As shown in M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBELAnalysis type % eQTLs found among bimodules % bimodules connected by eQTLs trans -eQTL analysis 84% 70% cis -eQTL analysis 51% 88%

Table 1

Comparison of BSP and standard eQTL analysis. A gene-SNP pair is said to be found among a collectionbimodules if the gene and SNP are both part of some common bimodule. On the other hand, we say that abimodule is connected by a collection of eQTLs if under the gene-SNP pairs from the collection, thebimodule forms a connected graph.

Table 1, a signiﬁcant fraction of BSP bimodules are not connected by either cis- or trans- eQTLs. The discovery of such bimodules suggests that the sub-networks identiﬁed by BSPcannot be found by standard eQTL analysis, and that these sub-networks can provide newinsights and hypotheses for further study.To identify potentially new eQTLs using BSP, we examine bimodule connectivity underthe combined set of cis- and trans- eQTLs. All of the bimodules with one SNP or one geneare connected by the combined set of eQTLs (Appendix D.5), and therefore all edges inthese bimodules are discovered by standard analyses. On the other hand, 224 out of the 358bimodules with geometric size larger than 10 were not connected by the combined set ofeQTLs. In Figure 3, we plot the correlations corresponding to SNP-gene pairs that appearas essential-edges (Section 3.4) in one or more bimodules with geometric size above 10, alongwith the correlation thresholds for cis -eQTL (blue line) and trans -eQTL (red line) analysis.Around 300 local edges (i.e. the SNP is located within 1MB of the gene transcription startsite) and 8.8K distal edges do not meet the correlation thresholds for cis - and trans -eQTLanalysis, respectively, but show evidence of importance at the network level, and may beworthy of further study.5.4.2. Genomic locations.

We studied the chromosomal location and proximity of SNPsand genes from bimodules found by BSP and CONDOR. While CONDOR uses genomiclocations as part of the cis -eQTL analysis in its ﬁrst stage, BSP does not make use oflocation information. Genetic control of expression is often enriched in a region local tothe gene (GTEx Consortium, 2017). All CONDOR bimodules, and almost all (99.3%) BSPbimodules, have at least one local SNP-gene pair (the SNP is located within 1MB of thegene transcription start site). In 93.5% of the smaller BSP bimodules (geometric size 10 orsmaller) and 54.8% of the medium to large BSP bimodules (geometric size above 10) eachgene and each SNP had a local counterpart SNP or gene within the bimodule.For each bimodule, we examined the chromosomal locations of its SNPs and genes. AllSNPs and many of the genes from the six CONDOR bimodules were located on Chromosome6; two CONDOR bimodules also had genes located on Chromosome 8 and Chromosome 9.The SNPs and genes from the BSP bimodules were distributed across all 23 chromosomes:170 of the 2947 small bimodules spanned 2 to 5 chromosomes and 152 of the 358 mediumto large bimodules spanned 2 to 11 chromosomes; however the remaining bimodules werelocalized to a chromosome each.Figure 4 illustrates the genomic locations of two bimodules found by BSP, with SNPlocation on the left and gene location on the right (only active chromosomes are shown).In addition, the ﬁgure illustrates the essential edges (Section 3.4) of each bimodule. Theresulting bipartite graph provides insight into the underlying associations between SNPsand genes that constitute the bimodule. See Appendix D.6 for more such illustrations.

IMODULE SEARCH PROCEDURE Fig 4: The gene-SNP association network for two BSP bimodules mapped onto the genome.The network of essential edges was formed by thresholding the cross-correlation matrix forthe bimodule at the connectivity threshold (Section 3.4).5.4.3.

Gene Ontology enrichment for bimodules.

The Gene Ontology (GO) databasecontains a curated collection of gene sets that are known to be associated with diﬀerentbiological functions (c.f. Gene Ontology Consortium, 2014; Botstein et al., 2000; Rhee et al.,2008). The topGO (Alexa and Rahnenfuhrer, 2018) package assesses whether sets in theGO database are enriched for a given gene set using Fisher’s test. For each of the 145BSP bimodules having a gene set B with 8 or more elements, we used topGO to assess theenrichment of B in 6463 GO gene sets of size more than 10, representing biological processes;however these signiﬁcant sets were not apparently related to thyroid-speciﬁc function. Weretained results with signiﬁcant BH q -values ( α = .

6. Application of BSP to North American Temperature-Precipitation Data.

Introduction.

The relationship between temperature and precipitation over NorthAmerica has been well documented (Madden and Williams, 1978; Berg et al., 2015; Adleret al., 2008; Livneh and Hoerling, 2016; Hao et al., 2018) and is of agricultural impor-tance. For example, Berg et al. (2015) noted widespread correlation between summertimemean temperature and precipitation at the same location over various land regions. Weexplore these relationships using the Bimodule Search Procedure. In particular, the methodallows us to search for clusters of distal temperature-precipitation relationships, known asteleconnections, whereas previous work has mostly focused on analyzing spatially proximalcorrelations.We applied BSP to ﬁnd pairs of geographic regions such that summer temperature in theﬁrst region is signiﬁcantly correlated in aggregate with summer precipitation in the secondregion one year later. We will refer to such region pairs as T-P (temperature-precipitation) M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL bimodules. T-P bimodules reﬂect mesoscale analysis of region-speciﬁc climatic patterns,which can be useful for predicting impact of climatic changes on practical outcomes likeagricultural output.6.2.

Data Description and Processing.

The Climatic Research Unit (CRU TS version4.01) data (Harris et al., 2014) contains daily global measurements of temperature (T) andprecipitation (P) levels on land over a . o × . o (360 pixels by 720 pixels) resolution gridfrom 1901 to 2016. We reduced the resolution of the data to 2 . o × . o (72 by 144 pixels) byaveraging over neighboring pixels and restricted to 427 pixels corresponding to the latitude-longitude pairs within North America. For each available year and each pixel/location weaveraged temperature (T) and precipitation (P) over the summer months of June, July, andAugust. Each feature of the resulting time series was centered and scaled to have zero meanand unit variance. The data matrix X , reﬂecting temperature, had 115 rows containing theannual summer-aggregated temperatures from 1901 to 2015 for each of the 427 locations.The data matrix Y , reﬂecting precipitation, had 115 rows containing the annual summer-aggregated precipitation from 1902 to 2016 (lagged by one year from temperature) for eachof the 427 locations.Analysis of summer precipitation versus summer temperatures lagged by 2 years, andtemperatures from diﬀerent seasons (winter T; summer P of the same year) in the sameyear did not yield any bimodules.6.3. Bimodules Search Procedure and Diagnostics.

We ran BSP on the processed datawith the false discovery parameter α = 0 . { . , . , . , . . . . } to keep edge-error under 0.1 (see Figure 12, Appendix E). BSP searches for groups of temper-ature and precipitation pixels that have signiﬁcant aggregate cross-correlation. Temperatureand precipitation are known to be spatial and temporally auto-correlated. Although BSPdoes not use spatial locations of the pixels, it directly accounts for spatial-correlations. Thepermutation null (Deﬁnition 3.1) used in BSP imposes an exchangeablilty assumption onsamples which may fail under temporal auto-correlation. The temporal auto-correlation inour data was moderate, ranging from 0.10 to 0.30 for various features.BSP found ﬁve distinct bimodules, while the eﬀective number (Appendix B.4) of bimod-ules was three. After the ﬁltering step (Appendix B.4), the two bimodules illustrated inFigure 5 and another bimodule with 80 temperature pixels and 5 precipitation pixels re-mained. We omitted a further analysis of the last bimodule since its precipitation pixelswere same as those of bimodule B in Figure 5 and its temperature pixels were geographicallyscattered.Temperature pixels in the two bimodules are situated distally from the precipitationpixels, but the temperature and precipitation pixels within a bimodule form blocks of con-tiguous geographical regions. Since BSP did not use any location information while searchingfor these bimodules, these eﬀects might have a common spatial origin.The locations from the bimodules occupy large geographical areas on the map. Theprecipitation pixels from the bimodule on the left in Fig. 5 form a vertical stretch aroundthe eastern edge of the Great Plains and are correlated with temperature pixels in large areasof land in the Paciﬁc Northwest, Alaska, and Mexico. In the second bimodule Fig. 5 (right)precipitation pixels in the southern Great Plains around Oklahoma are strongly correlatedwith temperature pixels in the Northwestern Great Plains. An anomalously hot summerOregon in one year in the Northwest suggests an anomalously rainy growing season in thefollowing year in the Southern Great Plains. Pixel-wise positive correlations are conﬁrmed IMODULE SEARCH PROCEDURE CRU: T(JJA)-P(JJA, oﬀset), 1901-2016 , α = . Fig 5:

Bimodules of summer temperature and precipitation in North America from CRU observations from1901-2016. The left bimodule ( A ) contains 149 temperature locations (pixels) and 6 precipitation locations.The right bimodule ( B ) contains 53 temperature and 5 precipitation locations. in Appendix E.The coastal proximity in all the temperature clusters suggest inﬂuences of oscillations insea surface temperatures. Aforementioned patterns from both bimodules map to locationsof agricultural productivity, such as in Oklahoma and Missouri (ﬁgure 5).The bimodules found by BSP only consider the magnitudes of correlations between thetemperature and precipitation pixels. Upon further analysis of these bimodules we see thatthe signiﬁcantly correlated temperature and precipitation pixels are positively correlated inthe Great Plains region. These results agree with ﬁndings on concurrent T-P correlations inthe Great Plains (Zhao and Khalil, 1993; Berg et al., 2015; Wang et al., 2019), which notedwidespread correlations between summertime mean temperatures and precipitation at thesame location over land in various parts of North America, notably the Great plains. Ourﬁndings show strong correlations between northwestern (coastal) temperatures and GreatPlains precipitation and generally agree with ﬁndings in the literature. For example, Livnehand Hoerling (2016) considered the relationship between hot temperatures and droughts inthe Great Plains, noting that hot temperatures in the summer are related to droughts inthe following year on the overall global scale. The results of Livneh and Hoerling (2016)preface the results contained within the bimodules, but BSP is able to ﬁnd regions wherethis eﬀect is signiﬁcant.Our ﬁndings demonstrate the utility of BSP in ﬁnding insights into remote correlationsbetween precipitation and temperature in North America. Further research may build onthese exploratory ﬁndings and create a model that can forecast precipitation in agricultur-ally productive regions around the world.

7. Discussion.

The Bimodule Search Procedure (BSP) is an exploratory tool thatsearches for groups of features with signiﬁcant aggregate cross-correlation, which we referto as bimodules. Rather than relying on an underlying generative model, BSP makes useof iterative hypothesis-testing to identify stable bimodules, which satisfy a natural stabilitycondition. The false discovery threshold α ∈ (0 ,

1) is the only free parameter of the proce-dure. Eﬃcient approximation of the p-values used for iterative testing allow BSP to run onlarge datasets.At the population level, stable bimodules can be characterized in terms of the connected M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL components of the population cross-correlation network. At the sample level, stable bimod-ules depend on both cross-correlations and intra-correlations, which are not part of thecross-correlation network. Nevertheless, the network perspective provides insights in boththe simulation study and the real data analysis.Using a complex, network-based simulation study, we found that BSP was able to recovermost true bimodules with signiﬁcant cross-correlation strength, while simultaneously con-trolling the false discovery of edges having network-level importance. Among true bimoduleswith similar cross-correlation strengths, those with lower intra-correlations were more likelyto be recovered than those with higher intra-correlations, reﬂecting the incorporation ofintra-correlations in the calculation of p-values; the eﬀects of intra-correlations were mostpronounced when the cross-correlation strength was moderate.When applied to eQTL data, BSP bimodules identiﬁed both local and distal eﬀects,capturing half of the eQTLs found by standard cis-analysis and most of the eQTLs found bystandard trans-analysis. Further, a substantial proportion of bimodules contained SNP-genepairs that were important at the network level but not deemed signiﬁcant under pairwisetrans-analysis.At root, the discovery of bimodules by BSP and CONDOR is driven by the presenceor absence of correlations between features of diﬀerent types. A key issue for these, andrelated, methods is how they behave with increasing sample size. In general, increasingsample size will yield greater power to detect cross-correlations, and therefore one expectsthe sizes of bimodule to increase. While this is often a desirable outcome, in applicationswhere non-zero cross-correlations (possibly of small size) are the norm, this increased powermay yield very large bimodules with little interpretive value. Evidence of this phenomenais found in the simulation study where, due to the presence of cross-edges, increasing thesample size from n = 200 to n = 600 yields larger BSP bimodules, which often containmultiple true bimodules (Appendix C.3). This may well reﬂect the underlying biology ofgenetic regulation: the omni-genic hypothesis of Boyle, Li and Pritchard (2017) suggeststhat a substantial portion of the gene-SNP cross-correlation network might be connectedat the population level.An obvious way to address “super connectivity” of the cross-correlation network is tochange the deﬁnition of bimodule to account for the magnitude of cross-correlations, ratherthan their mere presence or absence. Incorporating a more stringent deﬁnition of connec-tivity in BSP would require modifying the permutation null distribution and addressingthe theory and computation behind such a modiﬁcation, both of which are areas of futureresearch. Acknowledgment.

M.D., J.P., and A.B.N. were supported by NIH grant R01 HG009125-01 and NSF grant DMS-1613072. M.H. was awarded the Department of Defense, Air ForceOﬃce of Scientiﬁc Research, National Defense Science and Engineering Graduate (NDSEG)Fellowship, 32 CFR 168a and funded by government support under contract FA9550-11-C-0028. M.I.L. was supported by NIH grants R01 HG009125-01 and R01-HG009937. Theauthors wish to acknowledge numerous helpful conversations with Fred Wright and RichardSmith.

References.

Adler, R. F. , Gu, G. , Wang, J.-J. , Huffman, G. J. , Curtis, S. and

Bolvin, D. (2008). Relationshipsbetween global precipitation and surface temperature on interannual and longer timescales (19792006).

Journal of Geophysical Research: Atmospheres .IMODULE SEARCH PROCEDURE Albert, F. W. and

Kruglyak, L. (2015). The role of regulatory variation in complex traits and disease.

Nature Reviews Genetics Alexa, A. and

Rahnenfuhrer, J. (2018). topGO: Enrichment Analysis for Gene Ontology R packageversion 2.34.0.

Barber, M. J. (2007). Modularity and community detection in bipartite networks.

Physical Review E Beckett, S. J. (2016). Improved community detection in weighted bipartite networks.

Royal Society openscience Benjamini, Y. and

Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerfulapproach to multiple testing.

Journal of the Royal statistical society: series B (Methodological) Benjamini, Y. and

Yekutieli, D. (2001). The control of the false discovery rate in multiple testing underdependency.

Ann. Statist. Berg, A. , Lintner, B. R. , Findell, K. , Seneviratne, S. I. , van den Hurk, B. , Ducharne, A. , Chruy, F. , Hagemann, S. , Lawrence, D. M. , Malyshev, S. , Meier, A. and

Gentine, P. (2015).Interannual Coupling between Summertime Surface Temperature and Precipitation over Land: Processesand Implications for Climate Change.

Journal of Climate Bodwin, K. , Chakraborty, S. , Zhang, K. and

Nobel, A. B. (2017). Latent Association Mining in BinaryData. arXiv preprint arXiv:1711.10427 . Bodwin, K. , Zhang, K. , Nobel, A. et al. (2018). A testing based approach to the discovery of diﬀerentiallycorrelated variable sets.

The Annals of Applied Statistics Bollob´as, B. (2001). The evolution of random graphsthe giant component. In

Random graphs

Botstein, D. , Cherry, J. M. , Ashburner, M. , Ball, C. , Blake, J. , Butler, H. , Davis, A. , Dolin-ski, K. , Dwight, S. , Eppig, J. et al. (2000). Gene Ontology: tool for the uniﬁcation of biology.

Nat genet Boyle, E. A. , Li, Y. I. and

Pritchard, J. K. (2017). An expanded view of complex traits: from polygenicto omnigenic.

Cell

Chen, X. , Shi, X. , Xu, X. , Wang, Z. , Mills, R. , Lee, C. and

Xu, J. (2012). A two-graph guided multi-tasklasso approach for eQTL mapping.

Journal of Machine Learning Research Cheng, W. , Zhang, X. , Wu, Y. , Yin, X. , Li, J. , Heckerman, D. and

Wang, W. (2012). Inferring novelassociations between SNP sets and gene sets in eQTL study using sparse graphical model. In

Proceedingsof the ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Cheng, W. , Shi, Y. , Zhang, X. and

Wang, W. (2015). Fast and robust group-wise eQTL mapping usingsparse graphical models.

BMC bioinformatics Cheng, W. , Shi, Y. , Zhang, X. and

Wang, W. (2016). Sparse regression models for unraveling group andindividual associations in eQTL mapping.

BMC bioinformatics Gene Ontology Consortium (2014). Gene ontology consortium: going forward.

Nucleic acids research D1049–D1056.

GTEx Consortium (2017). Genetic eﬀects on gene expression across human tissues.

Nature

Cormen, T. H. , Leiserson, C. E. , Rivest, R. L. and

Stein, C. (2009).

Introduction to algorithms . MITpress.

Costa, A. and

Hansen, P. (2014). A locally optimal hierarchical divisive heuristic for bipartite modularitymaximization.

Optimization letters Dol´edec, S. and

Chessel, D. (1994). Co-inertia analysis: an alternative method for studying species–environment relationships.

Freshwater biology Fagny, M. , Paulson, J. N. , Kuijjer, M. L. , Sonawane, A. R. , Chen, C.-Y. , Lopes-Ramos, C. M. , Glass, K. , Quackenbush, J. and

Platig, J. (2017). Exploring regulation in tissues with eQTL networks.

Proceedings of the National Academy of Sciences

E7841–E7850.

Hao, Z. , Hao, F. , Singh, V. P. and

Zhang, X. (2018). Quantifying the relationship between compounddry and hot events and El Niosouthern Oscillation (ENSO) at the global scale.

Journal of Hydrology

332 - 338.

Harris, I. , Jones, P. D. , Osborn, T. J. and

Lister, D. H. (2014). Updated high-resolution grids ofmonthly climatic observations the CRU TS3.10 Dataset.

International Journal of Climatology Hastie, T. , Tibshirani, R. and

Friedman, J. (2009).

The elements of statistical learning: data mining,inference, and prediction . Springer Science & Business Media.

Huang, Y. , Wuchty, S. , Ferdig, M. T. and

Przytycka, T. M. (2009). Graph theoretical approach tostudy eQTL: a case study of Plasmodium falciparum.

Bioinformatics i15–i20. M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

Kolberg, L. , Kerimov, N. , Peterson, H. and

Alasoo, K. (2020). Co-expression analysis reveals inter-pretable gene modules controlled by trans -acting genetic variants. eLife e58705. Lahat, D. , Adali, T. and

Jutten, C. (2015). Multimodal Data Fusion: An Overview of Methods, Chal-lenges, and Prospects.

Proceedings of the IEEE

Liu, X. and

Murata, T. (2010). An eﬃcient algorithm for optimizing bipartite modularity in bipartitenetworks.

Journal of Advanced Computational Intelligence and Intelligent Informatics Livneh, B. and

Hoerling, M. P. (2016). The Physics of Drought in the U.S. Central Great Plains.

Journalof Climate Madden, R. A. and

Williams, J. (1978). The Correlation between Temperature and Precipitation in theUnited States and Europe.

Monthly Weather Review

McCabe, S. D. , Lin, D.-Y. and

Love, M. I. (2019). Consistency and overﬁtting of multi-omics methodson experimental data.

Brief Bioinform . McIntosh, A. , Bookstein, F. , Haxby, J. V. and

Grady, C. (1996). Spatial pattern analysis of functionalbrain images using partial least squares.

Neuroimage Meng, C. , Zeleznik, O. A. , Thallinger, G. G. , Kuster, B. , Gholami, A. M. and

Culhane, A. C. (2016). Dimension reduction techniques for the integrative analysis of multi-omics data.

Brieﬁngs inbioinformatics Nash, J. F. (1950). Equilibrium points in n-person games.

Proceedings of the National Academy of Sciences Nica, A. C. and

Dermitzakis, E. T. (2013). Expression quantitative trait loci: present and future.

Philo-sophical Transactions of the Royal Society B: Biological Sciences

Palowitch, J. , Bhamidi, S. and

Nobel, A. B. (2016). The continuous conﬁguration model: A null forcommunity detection on weighted networks. arXiv preprint arXiv:1601.05630 . Pan, C. , Luo, J. , Zhang, J. and

Li, X. (2019). BiModule: biclique modularity strategy for identifyingtranscription factor and microRNA co-regulatory modules.

IEEE/ACM Transactions on ComputationalBiology and Bioinformatics . Parkhomenko, E. , Tritchler, D. and

Beyene, J. (2007). Genome-wide sparse canonical correlation ofgene expression with genotypes. In

BMC proceedings S119. BioMed Central.

Parkhomenko, E. , Tritchler, D. and

Beyene, J. (2009). Sparse canonical correlation analysis withapplication to genomic data integration.

Statistical applications in genetics and molecular biology . Patel, P. V. , Gianoulis, T. A. , Bjornson, R. D. , Yip, K. Y. , Engelman, D. M. and

Gerstein, M. B. (2010). Analysis of membrane proteins in metagenomics: Networks of correlated environmental featuresand protein families.

Genome Research Pesantez-Cabrera, P. and

Kalyanaraman, A. (2016). Detecting communities in biological bipartitenetworks. In

Proceedings of the 7th ACM International Conference on Bioinformatics, ComputationalBiology, and Health Informatics

Platig, J. (2016). condor: COmplex Network Description Of Regulators R package version 1.1.1.

Platig, J. , Castaldi, P. J. , DeMeo, D. and

Quackenbush, J. (2016). Bipartite community structure ofeQTLs.

PLoS computational biology e1005033. Pucher, B. M. , Zeleznik, O. A. and

Thallinger, G. G. (2019). Comparison and evaluation of integrativemethods for the analysis of multilevel omics data: a study based on simulated and experimental cancerdata.

Brieﬁngs in bioinformatics Rhee, S. Y. , Wood, V. , Dolinski, K. and

Draghici, S. (2008). Use and misuse of the gene ontologyannotations.

Nature Reviews Genetics Sankaran, K. and

Holmes, S. P. (2019). Multitable methods for microbiome data integration.

Frontiersin genetics . Shabalin, A. A. (2012). Matrix eQTL: ultra fast eQTL analysis via large matrix operations.

Bioinformatics Shabalin, A. A. , Weigman, V. J. , Perou, C. M. , Nobel, A. B. et al. (2009). Finding large averagesubmatrices in high dimensional data.

The Annals of Applied Statistics Tian, L. , Quitadamo, A. , Lin, F. and

Shi, X. (2014). Methods for population-based eQTL analysis inhuman genetics.

Tsinghua Science and Technology Tini, G. , Marchetti, L. , Priami, C. and

Scott-Boyer, M.-P. (2019). Multi-omics integrationa compar-ison of unsupervised clustering methodologies.

Brieﬁngs in bioinformatics Waaijenborg, S. , de Witt Hamer, P. C. V. and Zwinderman, A. H. (2008). Quantifying the associa-tion between gene expressions and DNA-markers by penalized canonical correlation analysis.

Statisticalapplications in genetics and molecular biology . Wang, B. , Luo, X. , Yang, Y.-M. , Sun, W. , Cane, M. A. , Cai, W. , Yeh, S.-W. and

Liu, J. (2019).Historical change of El Ni˜no properties sheds light on future changes of extreme El Ni˜no.

Proceedings of

IMODULE SEARCH PROCEDURE the National Academy of Sciences Westra, H.-J. and

Franke, L. (2014). From genome to function by studying eQTLs.

Biochimica et Bio-physica Acta (BBA)-Molecular Basis of Disease

Witten, D. M. , Tibshirani, R. and

Hastie, T. (2009). A penalized matrix decomposition, with applica-tions to sparse principal components and canonical correlation analysis.

Biostatistics Witten, D. and

Tibshirani, R. (2020). PMA: Penalized Multivariate Analysis R package version 1.2.1.

Wu, X. , Liu, Q. and

Jiang, R. (2009). Align human interactome with phenome to identify causative genesand networks underlying disease families.

Bioinformatics Zhao, W. and

Khalil, M. A. K. (1993). The Relationship between Precipitation and Temperature overthe Contiguous United States.

Journal of Climate Zheng, X. (2015). A Tutorial for the R/Bioconductor Package SNPRelate.

Zhou, Y.-H. , Barry, W. T. and

Wright, F. A. (2013). Empirical pathway analysis, without permutation.

Biostatistics Zhou, Y.-H. , Gallins, P. and

Wright, F. (2019). Marker-Trait Complete Analysis. bioRxiv

APPENDIX A: PROOF OF LEMMA 2

Proof of Lemma 2.

Suppose 0 < (cid:15) < (cid:15) . Fix A ⊆ S and observe that for any B ⊆ T Φ (cid:15) ( A, B ) = (cid:88) t ∈ B (cid:88) s ∈ A (cid:0) ρ ( s, t ) − (cid:15) (cid:1) = (cid:88) t ∈ B ( ρ ( A, t ) − (cid:15) | A | ) . (14)Since (cid:15) | A | < (cid:15) | A | ≤ δ , for any t ∈ T , if ρ ( A, t ) > ρ ( A, t ) − (cid:15) | A | >

0. Hence themaximum over subsets B ⊆ T will be uniquely attained at(15) arg max B ⊆ T Φ (cid:15) ( A, B ) = (cid:8) t ∈ T | ρ ( A, t ) − (cid:15) | A | > (cid:9) = (cid:8) t ∈ T | ρ ( A, t ) > (cid:9) Similarly, if we ﬁx A ⊆ T , we can show(16) arg max A ⊆ S Φ (cid:15) ( A, B ) = (cid:8) s ∈ S | ρ ( s, B ) > (cid:9) Hence the pair of non-empty sets ( A ∗ , B ∗ ) is a Nash equilibrium if and only if A ∗ = (cid:8) s ∈ S | ρ ( s, B ∗ ) > (cid:9) and B ∗ = (cid:8) t ∈ T | ρ ( A ∗ , t ) > (cid:9) . These conditions are the same as those required for ( A ∗ , B ∗ ) to be a population bimodule(Deﬁnition 2.1). APPENDIX B: BSP IMPLEMENTATION DETAILS B.1. Dealing with cycles and large sets.

In practice, we do not want the sizes ofthe sets ( A k , B k ) in the iteration to grow too large as this slows computation, and largebimodules are diﬃcult to interpret. Therefore the search procedure is terminated when thegeometric size of ( A k , B k ) exceeds 5000. In some cases, the sequence of iterates ( A k , B k )for k ∈ { , . . . , k max } will form a cycle of length greater than 1, and will therefore fail toreach a ﬁxed point. To search for a nearby ﬁxed point instead, when we encounter the cycle( A k , B k ) = ( A l , B l ) for some l < k −

1, we set ( A l +1 , B l +1 ) to ( A k ∩ A k − , B k ∩ B k − ) andcontinue the iteration. M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

B.2. Initialization heuristics for large datasets.

When S is large, we initializeBSP from all the features in T , but initialize only from a subset of randomly chosen fea-tures in S . We also note that identical resultant bimodules are repeatedly discovered byBSP starting from diﬀerent initializations, often from features within said bimodule. Thisproblem is particularly prominent for large bimodules which may be rediscovered by uptoseveral thousand initializations. Hence, to avoid some of this redundant computation, wemay skip initializing BSP from features in the bimodules that have already been discovered. B.3. Choice of α using half-permutation based edge-error estimates. To selectthe false discovery parameter α for BSP, we estimate the edge-error for each value of α froma pre-speciﬁed grid. The edge-error is an edge-based false discovery notion for bimodules,deﬁned as the average fraction of erroneous essential-edges (deﬁned in Section 3.4) amongbimodules. Since we do not know the ground truth, we estimate the edge-error for BSP byrunning it on instances of the half-permuted dataset in which the sample labels for half ofthe features from each data type have been permuted. Further details are given below.B.3.1. Half-permutation.

Comparing results between the original and permuted data(Deﬁnition 3.1) allow us to assess the false discoveries from BSP when there are no trueassociations between features from S and T . However, we often expect associations betweenat least some variables from S and T (in fact, these are the ones that we want to ﬁnd). Tocreate a null distribution where some of pairs of features from S and T are correlated andsome are not, we use the following half-permutation scheme. Let ( X , Y ) denote the originaldata, where X and Y are measurements matrices for the two data types. We generate a half-permuted dataset ( ˜ X , ˜ Y ) as follows:1. Randomly select half the features, ˆ S ⊆ S and ˆ T ⊆ T , from each data type.2. Randomly permute the rows of the submatrix of X that corresponding to the columnsˆ S , and call the resulting matrix ˜ X . In other words, the submatrix corresponding tothe features S \ ˆ S is the same in X and ˜ X , while the sample labels of the submatrixof ˜ X corresponding to features in ˆ S have been permuted X ˆ S = P X ˆ S by a randompermutation matrix P .3. Similarly, permute the rows of matrix Y corresponding to the features ˆ T using anotherindependent permutation matrix P . Call the resulting matrix ˜ Y .Note that, together, the “half-permutation” steps 2 and 3 temper the cross-correlationbetween pairs of features in ˆ S × T ∪ S × ˆ T . Let B = { ( A , B ) , ( A , B ) . . . ( A K , B K ) } be thecollection of bimodules in the half-permuted data ( ˜ X , ˜ Y ). We will assume that the collection B is already ﬁltered for overlaps (Section B.4 below). We deﬁne the edge-error estimate forthe collection B as(17) (cid:92) edge-error ( B ) = 1 |B| (cid:88) ( A,B ) ∈B (cid:12)(cid:12)(cid:12) essential-edges ( A, B ) ∩ (cid:16) ˆ S × T ∪ S × ˆ T (cid:17)(cid:12)(cid:12)(cid:12) | essential-edges ( A, B ) | . In practice, we generate half-permuted datasets and use the edge-error estimate (17)to choose α as follows. First, we generate a pre-speciﬁed number N of instances of thehalf-permuted dataset. If the covariates are present, we correct for them after the half-permutation step. Next, for each α among a range of values, e.g. { . , . , . . . , . } , werun BSP with false discovery parameter α over the each of the half-permuted datasets andcalculate the average edge-error (17) of the resulting collection of bimodules for that value IMODULE SEARCH PROCEDURE of α , averaged over all the N half-permuted instances. We can then choose an α from thegrid that has average edge-error smaller a pre-speciﬁed value like 0.05. Generally smallervalues of α tend to have smaller edge error, so we choose the largest value of α from the gridthat has acceptable edge error. However, we may chose a smaller value of α if the bimodulesare too large.A caveat with the above procedure to select α is that the edge-error estimates may bequite variable even when averaged over a large number N of half-permuted datasets. Oneexplanation for this variability is that diﬀerent ˆ S and ˆ T are chosen for each instance ofthe half-permutation. Nevertheless, if we observe variability we choose a more conservativevalue of α . As seen in Section 4.2.1, even without access to the ground truth, we were ableto keep the true edge-error under 0 .

05 by using the above strategy to select α . B.4. Filtering overlaping bimodules.

Running BSP starting from diﬀerent initial-izations may lead to a collection B = { ( A ∗ , B ∗ ) , ( A ∗ , B ∗ ) , . . . ( A ∗ J , B ∗ J ) } of J , potentiallyrepeating and overlapping, bimodules. A typical bimodule might occur multiple times in B , and distinct bimodules in B might have substantial overlap. To reduce duplication, wecount the eﬀective number of disjoint bimodules in the collection B and propose a way toselect a subset of those many bimodules from B that are most disjoint. Details follow.We use the following deﬁnition N e ( B ) of eﬀective number of disjoint bimodules amongthe collection B , adapted from Shabalin et al. (2009): N e ( B ) . = (cid:88) ( A,B ) ∈B | A || B | (cid:88) s ∈ A,t ∈ B C B ( s, t ) , (18)where C B ( s, t ) = (cid:88) ( A,B ) ∈B I { s ∈ A,t ∈ B } (19)is the number of bimodules that contain the pair ( s, t ). As noted in Shabalin et al. (2009),the measure N e has the property that if there were r distinct bimodules in B , all disjointwhen considered as collection of feature pairs, then N e ( B ) = r .When there is substantial overlap between the bimodules in B , N e ( B ) is typically smallerthan |B| . We can then choose N = (cid:100) N e ( B ) (cid:101) most distinct representative bimodules among B , as follows. Cluster the collection of bimodules B into N groups We use hierarchical agglomer-ative clustering using the average linkage method (see Hastie, Tibshirani and Fried-man, 2009) based on the distance metric given by d J (( A , B ) , ( A , B )) = 1 − | A × B ∩ A × B || A × B ∪ A × B | (20)When we consider the bimodules as a collection of feature pairs, the metric d J issimply the Jaccard distance between the bimodules. Select one representative bimodule from each cluster

Let

C ⊆ B be one of the N bimodule clusters obtained in the previous step. We select a bimodule representative( A, B ) ∈ C from this cluster cluster that maximizes the following importance score :(21) is C ( A, B ) = (cid:88) s ∈ A,t ∈ B (cid:88) ( A (cid:48) ,B (cid:48) ) ∈C I { s ∈ A (cid:48) ,t ∈ B (cid:48) } . M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

The score is C ( A, B ) aggregates the importance of each feature pair within (

A, B ),measured using the number of (potentially repeating) bimodules from C the feature-pair was a part of. Bimodules that are larger and have more overlapping pairs will bepreferred as representatives under this scheme. B.5. Covariate correction.

In some cases the data matrices [ X , Y ] ∈ R n × ( p + q ) areaccompanied by one or more covariates like sex, platform details and PEER factors thatmust be accounted for by removing their eﬀects before discovering bimodules. Suppose weare given m such linearly independent covariates v , . . . , v m ∈ R n . Here we describe howto modify BSP to remove their eﬀects. First, we residualize each column of the originaldata [ X , Y ] by setting up a linear model with explanatory variables v , . . . , v m . Denote theresulting matrix by [ X (cid:48) , Y (cid:48) ] ∈ R n × ( p + q ) that has columns which are projections of thoseof [ X , Y ] onto the subspace orthogonal to v , . . . v m . We would like to now run BSP on[ X (cid:48) , Y (cid:48) ], however since the columns of [ X (cid:48) , Y (cid:48) ] lie on an n (cid:48) = n − m (cid:48) dimensional subspace,the independence assumption of p-value calculations in Section 3.3.1 would fail. However,as argued in Zhou, Barry and Wright (2013), it is enough to replace the sample size n withthe eﬀective sample size of n (cid:48) in the p-value calculations. B.6. Checking our p-value approximation.

To check the uniformity of our approx-imate p-value under the permutation null, we chose a bimodule (

A, B ) found in Section 5and a t ∈ B . Then we randomly permuted the labels of gene t (10 times), computing ourp-value approximation ˆ p ( A, t ) in each case. Hence we are assessing the uniformity of ˆ p ( A, t )under the permutation null distribution. The result in Figure 6 shows that the computedp-values are almost uniform but extremely small p-values show anti-conservative behavior.A potential reason for this anti-conservative behavior is that the tails of test statistic underthe permutation distribution may be heavier compared to the tails of the location-shiftedGamma distribution that we use to approximate it, since the permutation distribution isdiscrete distribution which explicitly depends on the exact entries of the data matrices.APPENDIX C: SIMULATION DETAILS

C.1. Simulation model for each true bimodule.

As described in Section 4, given ρ, σ ∈ [0 ,

1] and a binary adjacency matrix D ∈ { , } | A |×| B | representing a connectedbipartite graph on vertices A and B (called the regressor-graph), the variables ( X A , Y A ) fora bimodule ( A, B ) can be simulated as(22) X A ∼ N | A | (0 , (1 − ρ ) I + ρU ) and Y tB = D t X tA + (cid:15), where U is the matrix of all ones and (cid:15) ∼ N (0 , σ ). The parameters ρ, σ ∈ [0 ,

1] and D appearing in (22) are chosen independently for each true bimodule ( A, B ) as follows:1. Choose a constant β ∈ [0 ,

1] uniformly at random. With d . = (cid:100) β | A |(cid:101) , let D be theadjacency matrix of the d -regular bipartite connected graph on vertex sets A and B formed by independently connecting each vertex t ∈ B to d randomly chosen verticesfrom A . If the resulting graph is not connected, set β to β + ∆ β where ∆ β = 0 . ρ ∈ [0 ,

1] and η ∈ [0 , .

8] subject to the constraint δ . = 1 + ρ ( d − ≥ η d . We satisfy this constraint by ﬁrst uniformly generating ρ and then generating η uniformly from [0 , min( √ δd − , . σ = √ δ ( δ − η d ) η . IMODULE SEARCH PROCEDURE Fig 6: Assessing the accuracy of our p-value estimate ˆ p ( A, t ): We used the eQTL data fromSection 5 and chose a bimodule with 24 SNPs (used as A ) and selected t to be a gene fromthe same bimodule. We then performed 10 random permutation of the sample labels forthe gene t and repeatedly estimated ˆ p ( A, t ) for each permutation after removing the eﬀectsof covariates (Appendix B.5).The constants ( ρ, β, η ) in the above procedure have the following intuitive role: ρ is theintra-correlation between any two features from the set A , β ∈ [0 ,

1] controls the edgedensity of the of the regressor-graph D , and η is the cross-correlation between features from B and adjacent features from A in the regressor-graph. The following Lemma shows thatour choice of parameters indeed results in population cross-correlation of η between featuresconnected by the regressor-graph: Lemma . Fix ρ, η ∈ [0 , , a, b ∈ N and d ∈ { , . . . a } so that δ . = 1 + ρ ( d − ≥ η d .Suppose X is an a -dimensional random vector with covariance matrix Cov( X ) = ρU a +(1 − ρ ) I a , where U a ∈ R a × a is the matrix of all ones and I a ∈ R a × a is the identity matrix.Next suppose D is a { , } valued a × b dimensional matrix that has exactly d ones in eachcolumn. Finally let σ = (cid:112) δ ( δ − η d ) /η and suppose the b -dimensional random vector Y isgiven by Y = D t X + (cid:15) where (cid:15) is another b -dimensional random vector independent of X with Cov( (cid:15) ) = σ I b .Then (23) Cor( X , Y ) (cid:12) D = ηD where Cor( X , Y ) ∈ R a × b is the cross-correlation matrix between random vectors X and Y ,and (cid:12) represents the element-wise product of matrices (i.e., the Hadamard product). Proof.

Since we are concerned with covariances, we can assume by mean centering that E X = 0 ∈ R a and E Y = E (cid:15) = 0 ∈ R b . Note that D t e a = de b and U a = e a e ta , where M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL e r . = (1 , . . . , t ∈ R r for r ∈ { a, b } . Hence using independence of X and (cid:15) :Cov( Y ) = E ( YY t ) = D t E ( XX t ) D + E ( (cid:15)(cid:15) t )= D t Cov( X ) D + Cov( (cid:15) ) = D t ( ρe a e ta + (1 − ρ ) I a ) D + σ I b = ρ ( D t e a ) t ( D t e a ) + (1 − ρ ) D t D + σ I b = ρd e b e tb + (1 − ρ ) D t D + σ I b Since all the diagonal entries of D t D have the value d ,(24) diag[Cov( Y )] = ( ρd + (1 − ρ ) d + σ ) I b = ( dδ + σ ) I b = (cid:18) δη (cid:19) I b where for any square matrix A , diag[ A ] denotes the diagonal matrix obtained from A bysetting all the oﬀ-diagonal entries of A to 0.We can similarly calculate the cross-covariance between X and Y (25) Cov( X , Y ) = E ( XY t ) = E ( XX t ) D = ( ρe a e ta + (1 − ρ ) I a ) D = ρde a e tb + (1 − ρ ) D, and also ﬁnally the cross-correlation between X and Y using (25), (24) and diag[Cov( X )] = I a : Cor( X , Y ) = diag[Cov( X )] − Cov( X , Y ) diag[Cov( Y )] − = ηδ (cid:0) ρde a e tb + (1 − ρ ) D (cid:1) = ηδ (cid:0) ρd ¯ D + (1 − ρ + ρd ) D (cid:1) = ηD + ηρdδ − ¯ D. where ¯ D . = 1 − D . In particular this shows (23). C.2. Results from sCCA.

We ran sCCA on the simulated dataset to search for 100canonical covariates for a range of values of the penalty parameter λ . The sizes of the bi-modules for various values of λ can be seen in Figure 7. For λ ∈ { . , . , . , . , . } ,the ﬁrst two columns of the following table show the number of true bimodules (TB) thatoverlapped with each detected bimodule (DB) and the edge-error of each DB averaged overall DBs. The last column shows the top 5 (or bottom 95) percentile recall among the truebimodules. λ λ = 0 .

01 has small edge-error, but poor recall. The recall improveson increasing λ , but the edge-error degrades. C.3. Performance of BSP and CONDOR on increasing sample size.

We in-creased that sample size of the simulation study in Section 4 to n = 600, and re-ran BSPand CONDOR with the same parameters as earlier. The average edge-error for BSP andCONDOR was 0.05 and 0.10 respectively. As seen in Figure 8, BSP and CONDOR bothrecall most bimodules with cross-correlation strength above 0.3, however Jaccard for BSPand CONDOR has degraded. This can be explained by noting that 25% or BSP bimodulesnow overlapped with two or more true bimodules compared to 6% when n = 200. IMODULE SEARCH PROCEDURE Fig 7: The sizes of sCCA bimodules for various values of the penalty parameter λ alongwith sizes of the true bimodules.Fig 8: Average recall and Jaccard for true bimodules in the simulation with 600 samples. M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

APPENDIX D: GTEX RESULTS

D.1. Data acquisition and preprocessing.

We obtained genotype and thyroid ex-pression data for 574 individuals from the dbGap website (accession number: phs000424.v8.p1).We directly used the ﬁltered and normalized gene expression data and covariates providedfor eQTL analysis but ﬁltered the SNPs in the genotype data using the LD pruning soft-ware

SNPRelate

Zheng (2015). The software retained 556K autosomal SNPs with minorallele frequency above 0 . . . The latter threshold was chosen to balancethe number of retained SNPs and information loss.There were 68 covariates provided for the Thyroid tissue consisting of the top 5 genotypeprinciple components; 60 PEER covariates, and 3 additional covariates for sequencing plat-form, sequencing protocol, and sex. We accounted for these covariates by the modiﬁcationto BSP mentioned in Appendix B.5. D.2. Choice of false discovery parameter to BSP.

We chose the false discoverparameter α for BSP from the grid { . , . , . , . , . } by ﬁnding the largest α thatkept the average edge-error estimates based on N = 5 half-permutations under 0.05 (seeAppendix B.3). However our error estimates were variable as we obtained α = 0 .

05 in oneinstance and α = 0 .

03 in another. We conservatively chose α = 0 . D.3. Hardware and software stack.

The various methods used is this analysis wererun on a dedicated computer that had Intel (R) Xeon (R) E5-2640 CPU with 20 parallelcores at 2.50 Hz base frequency, and a 512 GB random access memory along with L1, L2and L3 caches of sizes 1.3, 5 and 50 MB respectively. The computer ran Windows server2012 R2 operating system and we used the Microsoft R Open 3.5.3 software to performmost of our analysis, since it has multi-core implementations of linear algebra routines.

D.4. Bimodule connectivity thresholds and network sparsity.

Figure 9 showstwo network statistics for bimodules found by BSP – connectivity threshold and tree-multiplicity. All bimodules have tree multiplicity under 10. This shows that the associationnetwork for large bimodules, particularly having low connectivity-thresholds, is sparse.

D.5. Connectivity of bimodules under edges from combined eQTL analysis.

Here we examine which bimodules are connected under the combined edges from cis -eQTLand trans -eQTL analysis, based on geometric size of the bimodule. Figure 10 (left) showsthat all the bimodules that have either one gene or one SNP are connected. Hence, thesebimodules could have been recovered using standard eQTL analysis. On the other hand ifwe restrict to bimodules with two or more genes and SNPs, we see that (Figure 10; right)the fraction of connected bimodules tends to decrease as the geometric size of the bimodulesincreases.

D.6. Bimodule association networks.

See the plots in Figure 11.

D.7. Gene Ontology.

18 out of the 145 BSP gene sets, and 1 out of the 5 CONDORgene sets that we considered had signiﬁcant overlap with GO categories. Among the 40 GOterms detected by the CONDOR, 27 terms were also found among the 135 terms detectedby BSP. The GO terms that were discovered for the two methods did not seem speciﬁc tothyroid, the tissue under investigation. Complete list of the GO terms for the two methodsis as follows. Signiﬁcant GO terms for BSP:

IMODULE SEARCH PROCEDURE Fig 9: Connectivity-threshold and tree-multiplicity for BSP bimodules compared to theirgeometric size. The horizontal dotted line represents the threshold obtained from standardtrans-analysis.Fig 10: Connectivity of BSP bimodules under combined edges from cis -eQTL and trans -eQTL analysis. Left: the number of bimodules that are connected and are one bimodules ,i.e. have one gene or one SNP. Right: Among bimodules having two or more genes and SNPs(i.e. are not one bimodules), the connectivity and geometric size of the bimodules. M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

Fig 11: Out of 31 BSP bimodules that had genes on 3 or more chromosomes and SNPs on 2or more chromosomes, we selected 9 bimodules that looked interesting. The bipartite graphfor each bimodule is formed out of the essential edges (Section 5.3.3).

IMODULE SEARCH PROCEDURE GO.ID Term bimod.1 GO:0060333 interferon-gamma-mediated signaling path... 12 GO:0002478 antigen processing and presentation of e... 13 GO:0019884 antigen processing and presentation of e... 14 GO:0048002 antigen processing and presentation of p... 15 GO:0019882 antigen processing and presentation 16 GO:0071346 cellular response to interferon-gamma 17 GO:0034341 response to interferon-gamma 18 GO:0019886 antigen processing and presentation of e... 19 GO:0002495 antigen processing and presentation of p... 110 GO:0002504 antigen processing and presentation of p... 111 GO:0045087 innate immune response 112 GO:0050776 regulation of immune response 113 GO:0006952 defense response 114 GO:0031295 T cell costimulation 115 GO:0031294 lymphocyte costimulation 116 GO:0050852 T cell receptor signaling pathway 117 GO:0002768 immune response-regulating cell surface ... 118 GO:0002764 immune response-regulating signaling pat... 119 GO:0050851 antigen receptor-mediated signaling path... 120 GO:0002682 regulation of immune system process 121 GO:0022409 positive regulation of cell-cell adhesio... 122 GO:0002253 activation of immune response 123 GO:0002429 immune response-activating cell surface ... 124 GO:0006950 response to stress 125 GO:0006955 immune response 126 GO:0019221 cytokine-mediated signaling pathway 127 GO:0002757 immune response-activating signal transd... 128 GO:0050870 positive regulation of T cell activation 129 GO:0002479 antigen processing and presentation of e... 130 GO:1903039 positive regulation of leukocyte cell-ce... 131 GO:0042590 antigen processing and presentation of e... 132 GO:0045806 negative regulation of endocytosis 233 GO:0050911 detection of chemical stimulus involved ... 334 GO:0007608 sensory perception of smell 335 GO:0050907 detection of chemical stimulus involved ... 336 GO:0009593 detection of chemical stimulus 337 GO:0007606 sensory perception of chemical stimulus 338 GO:0035459 cargo loading into vesicle 339 GO:0050906 detection of stimulus involved in sensor... 340 GO:0000038 very long-chain fatty acid metabolic pro... 441 GO:0006732 coenzyme metabolic process 442 GO:0006417 regulation of translation 543 GO:0034248 regulation of cellular amide metabolic p... 544 GO:0010608 posttranscriptional regulation of gene e... 545 GO:0046597 negative regulation of viral entry into ... 646 GO:0035455 response to interferon-alpha 6 M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

47 GO:0035456 response to interferon-beta 648 GO:0046596 regulation of viral entry into host cell 649 GO:0045071 negative regulation of viral genome repl... 650 GO:1903901 negative regulation of viral life cycle 651 GO:0060337 type I interferon signaling pathway 652 GO:0071357 cellular response to type I interferon 653 GO:0034340 response to type I interferon 654 GO:0045069 regulation of viral genome replication 655 GO:0048525 negative regulation of viral process 656 GO:0019079 viral genome replication 657 GO:0046718 viral entry into host cell 658 GO:1903900 regulation of viral life cycle 659 GO:0030260 entry into host cell 660 GO:0044409 entry into host 661 GO:0051806 entry into cell of other organism involv... 662 GO:0051828 entry into other organism involved in sy... 663 GO:0043901 negative regulation of multi-organism pr... 664 GO:0034341 response to interferon-gamma 665 GO:0050792 regulation of viral process 666 GO:0051607 defense response to virus 667 GO:0051701 interaction with host 668 GO:0043903 regulation of symbiosis, encompassing mu... 669 GO:0009615 response to virus 670 GO:0051225 spindle assembly 771 GO:0007030 Golgi organization 772 GO:0007051 spindle organization 773 GO:0010256 endomembrane system organization 774 GO:0000226 microtubule cytoskeleton organization 775 GO:0007017 microtubule-based process 776 GO:0070925 organelle assembly 777 GO:0007010 cytoskeleton organization 778 GO:0007156 homophilic cell adhesion via plasma memb... 879 GO:0098742 cell-cell adhesion via plasma-membrane a... 880 GO:0098609 cell-cell adhesion 881 GO:0007155 cell adhesion 882 GO:0022610 biological adhesion 883 GO:0007416 synapse assembly 884 GO:0007267 cell-cell signaling 885 GO:0006355 regulation of transcription, DNA-templat... 986 GO:1903506 regulation of nucleic acid-templated tra... 987 GO:2001141 regulation of RNA biosynthetic process 988 GO:0006351 transcription, DNA-templated 989 GO:0097659 nucleic acid-templated transcription 990 GO:0032774 RNA biosynthetic process 991 GO:0051252 regulation of RNA metabolic process 992 GO:2000112 regulation of cellular macromolecule bio... 993 GO:0010556 regulation of macromolecule biosynthetic... 994 GO:0019219 regulation of nucleobase-containing comp... 9

IMODULE SEARCH PROCEDURE

95 GO:0031326 regulation of cellular biosynthetic proc... 996 GO:0034654 nucleobase-containing compound biosynthe... 997 GO:0009889 regulation of biosynthetic process 998 GO:0018130 heterocycle biosynthetic process 999 GO:0019438 aromatic compound biosynthetic process 9100 GO:0010468 regulation of gene expression 9101 GO:1901362 organic cyclic compound biosynthetic pro... 9102 GO:0016070 RNA metabolic process 9103 GO:0001580 detection of chemical stimulus involved ... 10104 GO:0050912 detection of chemical stimulus involved ... 10105 GO:0050913 sensory perception of bitter taste 10106 GO:0050909 sensory perception of taste 10107 GO:0050907 detection of chemical stimulus involved ... 10108 GO:0009593 detection of chemical stimulus 10109 GO:0007606 sensory perception of chemical stimulus 10110 GO:0050906 detection of stimulus involved in sensor... 10111 GO:0007600 sensory perception 10112 GO:0051606 detection of stimulus 10113 GO:0050877 nervous system process 10114 GO:0003008 system process 10115 GO:0007186 G-protein coupled receptor signaling pat... 10116 GO:0006355 regulation of transcription, DNA-templat... 11117 GO:1903506 regulation of nucleic acid-templated tra... 11118 GO:2001141 regulation of RNA biosynthetic process 11119 GO:0006351 transcription, DNA-templated 11120 GO:0097659 nucleic acid-templated transcription 11121 GO:0032774 RNA biosynthetic process 11122 GO:0051252 regulation of RNA metabolic process 11123 GO:2000112 regulation of cellular macromolecule bio... 11124 GO:0010556 regulation of macromolecule biosynthetic... 11125 GO:0019219 regulation of nucleobase-containing comp... 11126 GO:0031326 regulation of cellular biosynthetic proc... 11127 GO:0034654 nucleobase-containing compound biosynthe... 11128 GO:0009889 regulation of biosynthetic process 11129 GO:0018130 heterocycle biosynthetic process 11130 GO:0019438 aromatic compound biosynthetic process 11131 GO:0010468 regulation of gene expression 11132 GO:1901362 organic cyclic compound biosynthetic pro... 11133 GO:0016070 RNA metabolic process 11134 GO:1901685 glutathione derivative metabolic process 12135 GO:1901687 glutathione derivative biosynthetic proc... 12136 GO:0006749 glutathione metabolic process 12137 GO:0042178 xenobiotic catabolic process 12138 GO:0042537 benzene-containing compound metabolic pr... 12139 GO:0006575 cellular modiﬁed amino acid metabolic p... 12140 GO:0044272 sulfur compound biosynthetic process 12141 GO:0046854 phosphatidylinositol phosphorylation 13142 GO:0046834 lipid phosphorylation 13 M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

143 GO:0048015 phosphatidylinositol-mediated signaling 13144 GO:0048017 inositol lipid-mediated signaling 13145 GO:0006882 cellular zinc ion homeostasis 14146 GO:0055069 zinc ion homeostasis 14147 GO:0010273 detoxiﬁcation of copper ion 14148 GO:1990169 stress response to copper ion 14149 GO:0061687 detoxiﬁcation of inorganic compound 14150 GO:0097501 stress response to metal ion 14151 GO:0071294 cellular response to zinc ion 14152 GO:0071280 cellular response to copper ion 14153 GO:0046916 cellular transition metal ion homeostasi... 14154 GO:0071276 cellular response to cadmium ion 14155 GO:0046688 response to copper ion 14156 GO:0055076 transition metal ion homeostasis 14157 GO:0072488 ammonium transmembrane transport 15158 GO:0006089 lactate metabolic process 16159 GO:0006882 cellular zinc ion homeostasis 17160 GO:0055069 zinc ion homeostasis 17161 GO:0006882 cellular zinc ion homeostasis 18162 GO:0055069 zinc ion homeostasis 18Signiﬁcant GO terms for CONDORGO.ID Term bimod1 GO:0050852 T cell receptor signaling pathway 12 GO:0050851 antigen receptor-mediated signaling path... 13 GO:0006355 regulation of transcription, DNA-templat... 14 GO:1903506 regulation of nucleic acid-templated tra... 15 GO:2001141 regulation of RNA biosynthetic process 16 GO:0060333 interferon-gamma-mediated signaling path... 27 GO:0002478 antigen processing and presentation of e... 38 GO:0019884 antigen processing and presentation of e... 39 GO:0048002 antigen processing and presentation of p... 310 GO:0019886 antigen processing and presentation of e... 311 GO:0002495 antigen processing and presentation of p... 312 GO:0002504 antigen processing and presentation of p... 313 GO:0019882 antigen processing and presentation 314 GO:0031295 T cell costimulation 315 GO:0031294 lymphocyte costimulation 316 GO:0060333 interferon-gamma-mediated signaling path... 317 GO:0050852 T cell receptor signaling pathway 318 GO:0050870 positive regulation of T cell activation 319 GO:1903039 positive regulation of leukocyte cell-ce... 320 GO:0050778 positive regulation of immune response 321 GO:0002253 activation of immune response 322 GO:0050851 antigen receptor-mediated signaling path... 323 GO:0022409 positive regulation of cell-cell adhesio... 3

IMODULE SEARCH PROCEDURE E d g e - e rr o r e s t i m a t e False discovery parameter ( α ) Fig 12:

Average edge-error estimates for BSP results for the climate data based on 100 half-permutations(Section B.3.1) for α ranging from 0.01 to 0.05. The edge-error estimates exceed 0 .

05 for the ﬁrst time at α = 0 .

24 GO:0071346 cellular response to interferon-gamma 325 GO:0051251 positive regulation of lymphocyte activa... 326 GO:1903037 regulation of leukocyte cell-cell adhesi... 327 GO:0034341 response to interferon-gamma 328 GO:0002696 positive regulation of leukocyte activat... 329 GO:0050863 regulation of T cell activation 330 GO:0050867 positive regulation of cell activation 331 GO:0007159 leukocyte cell-cell adhesion 332 GO:0050776 regulation of immune response 333 GO:0002429 immune response-activating cell surface ... 334 GO:0002684 positive regulation of immune system pro... 335 GO:0006955 immune response 336 GO:0022407 regulation of cell-cell adhesion 337 GO:0045087 innate immune response 338 GO:0002768 immune response-regulating cell surface ... 339 GO:0045785 positive regulation of cell adhesion 340 GO:0002455 humoral immune response mediated by circ... 341 GO:0051249 regulation of lymphocyte activation 342 GO:0019221 cytokine-mediated signaling pathway 343 GO:0042110 T cell activation 3APPENDIX E: CLIMATE ANALYSIS APPENDIXFigure 12 shows the edge-error estimates we used to choose α . Table 4 show a summaryof cross-correlations for each precipitation pixel from the two BSP bimodules. Address of First and Second authors,E-mail: [email protected] ; [email protected] Address of last authorE-mail: [email protected] M. DEWASKAR, J. PALOWITCH, M. HE, M.I. LOVE, AND A.B. NOBEL

A P

Pixel Mean SD1 0.28 0.072 0.27 0.063 0.28 0.084 0.27 0.085 0.31 0.066 0.30 0.08

B P

Pixel Mean SD1 0.31 0.042 0.35 0.033 0.29 0.04

Table 4